In this article we will see "How to create a Spark Java application" and submit it to spark cluster to be executed. We will create a maven Java application with Spark Java API.
1) Install Java
Spark processes runs in JVM, Java should be pre-installed on the machines on which we have to run Spark job. Make sure
each machine has Java8+ installed; if not, Java 8 can be installed easily using below commands:
$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer
To check the installation use following command:
$ java -version
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
2) Install Scala
Apache spark is written in Scala, to build it we would need Scala installed, if not installed already, run following commands:
$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
$ sudo apt-get update
$ sudo apt-get install sbt
3) Install Git
Apache spark requires Git, if not installed already, run following commands:
$ sudo apt-get install git
4) Download Spark
Apache spark can be downloaded from official Spark Website
prebuild bundles or in source code form, for this article we will setup the cluster with source code:
$ sudo wget http://d3kbcqa49mib13.cloudfront.net/spark-2.2.0.tgz
$ sudo tar xvf spark-2.2.0.tgz
5) Build Spark
Spark can be build using SBT(Simple Build Tool) which is bundled with it. Run following commands to build the code:
$ sudo build/sbt assembly
Using /usr/lib/jvm/java-8-oracle as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
Attempting to fetch sbt
Launching sbt from build/sbt-launch-0.13.13.jar
Getting org.scala-sbt sbt 0.13.13 ...
Building spark will take some time, be patient !
6) Starting Master
Assuming that, we were building spark and are done with all the steps from 1 to 5 mentioned above, now we can start spark
as master with following command:
$ sudo ./sbin/start-master.sh -h 127.0.0.1 -p 7077 --webui-port 8080
This will start a master node with local ip 127.0.0.1 and port 7077, spark comes with an web ui, to check details of spark
master following url can be used:
7) Starting Worker nodes
Follow steps 1 to 5 mentioned above on each machine, where you want to start a worker to be supplied to master started
in step 6.
Spark master can be supplied by any number of workers, to start a worker following command can be used:
$ sudo ./sbin/start-slave.sh spark://127.0.0.1:7077 -h 127.0.0.1 -p 7079 --webui-port 8082 -c 2 -m 1024m
This will start a worker node with local ip 127.0.0.1 and port 7079, "spark://127.0.0.1:7077" is the URL of master node.
and -m tells the number of machine cores and RAM to be assigned to the worker.
A web ui is also availabe to monitor spak workers, this can be accessed on:
8) Test status
Now the attached workers and applications submitted can be viewed on master web ui: http://localhost:8080/
9) Connecting to spark cluster
We can connect to spark master from the application or can access the shell using following command:
./bin/spark-shell --master spark://127.0.0.1:7077
A spark shell to interact with this spark cluster will be opened as shown below:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/11/11 16:10:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/11 16:10:32 WARN Utils: Your hostname, techie-Satellite-Pro-R50-B resolves to a loopback address: 127.0.1.1; using 192.168.0.103 instead (on interface wlan0)
17/11/11 16:10:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Spark context Web UI available at http://192.168.0.103:4040
Spark context available as 'sc' (master = spark://127.0.0.1:7077, app id = app-20171111161034-0000).
Spark session available as 'spark'.
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
In this article we have seen how to setup Apache Spark cluster on ubuntu machines using Simple standalone spark cluster manager. In coming articles we will see how to setup spark cluster with Hadoop YARN, Apache Mesos and How to submit spark application to spark cluster.