In this article we will see "How to create Spark Java Application and Submit it to Spark Cluster" and submit it to spark cluster to be executed. We will create a maven Java application with Spark Java API.

1) Install Java

Spark processes runs in JVM, Java should be pre-installed on the machines on which we have to run Spark application. Make sure the development machine has Java8+ installed; if not, Java 8 can be installed easily using below commands:

$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

To check the installation use following command:

$ java -version

java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

2) Install Maven

We will create a Maven application for the purpose, than will add Spark Java dependency to it, if maven is not installed already following commands can be used to do so:

$ sudo apt-cache search maven
$ sudo apt-get install maven

To check the installation use following command:

$ mvn -version

Apache Maven 3.3.9
Maven home: /usr/share/maven
Java version: 1.8.0_151, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_IN, platform encoding: UTF-8
OS name: "linux", version: "4.4.0-98-generic", arch: "amd64", family: "unix"

3) Create Java Maven project

Now lets create a maven java project from command line and import it in eclipse to start writing application code:

$ mvn archetype:generate -DartifactId=SparkJavaProject -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

$ cd SparkJavaProject/
$ mvn eclipse:eclipse -Dwtpversion=2.0
Now import the project in eclipse to get started.

4) Adding Spark dependencies to project

Let's add spark dependencies in pom.xml file to be able to connect and communicate to spark cluster, we will create a .jar file for this application and than will submit this application to spark cluster as a job to be executed.

<project xmlns="" xmlns:xsi=""
	<!-- Spark dependency -->

5) Application Code

In this simple application we will start a SparkSession and will read a text file (/home/techie/fruits.txt) to count occurrence of words "Apple" and "Orange".

The content of input text file (/home/techie/fruits.txt) is:
Apple Orange Apple Orange
Apple Orange Apple Orange
Apple Orange Apple Orange
Apple Orange Apple Orange

Create a application file in: src/main/java


import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SparkSession;

 * Hello world!
public class Application {
	public static void main(String[] args) {
		// from 2.2 SparkContext is replaced with SparkSession
		SparkSession spark = SparkSession.builder().appName("FruitApp").getOrCreate();

		// from 2.2 RDD is replaced with Dataset
		Dataset<String> logData ="/home/techie/fruits.txt");

		// this will count occurrence of Apple in the file
		long appleCount = logData.filter(s -> s.contains("Apple")).count();

		// this will count occurrence of Orange in the file
		long orangeCount = logData.filter(s -> s.contains("Orange")).count();

		System.out.println("appleCount: " + appleCount + ", orangeCount: " + orangeCount);


6) Compile and get .jar file

Compile your code with maven, this will generate a SparkJavaProject.jar file in target folder. This .jar file will be used to submit spark application as a job to cluster master.

7) Submitting application to Spark

Application in form of precompiled .jar file can be submitted to spark custer master with following command:

sudo ./bin/spark-submit  --class  --master spark://  --deploy-mode cluster /opt/SparkJavaProject.jar
Note: While suppling '--master', make sure you have added spark master's "REST URL: spark:// (cluster mode)" this is different from "URL: spark://". These two addresses can be found on master's web UI ().

--class parameter requires fully qualified path of main() in application.
--deploy-mode, are of two type 'cluster' or 'client', distinguishes where the driver process runs.

In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.

8) Application status on Master UI

Submitted application name, assigned worker, used cores and memory, running applications and completed applications are shown on master web ui: http://localhost:8080/

9) Application status on Worker UI

Running driver, finished driver and running executors detail is shown on worker UI http://localhost:8082/, as shown below:

10) Application status on Application UI

Spark also provides web view for details about each submitted application and process, as shown below:

In this article we have seen "How to create a Spark Java application" and submit it to spark cluster to be executed. In coming articles we will see more about spark code writing practices and optimisation.
  • By
  • Nov 12, 2017
  • Big Data