Apache Spark is very popular and widely-used distributed in-memory system for big data processing as it is reduces the execution time dramatically compared to MapReduce jobs on Hadoop. If you want learn more about “Apache Spark” then check out the followings:
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica “Spark: Cluster Computing with Working Sets.”
- M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, I. Stoica. Apache Spark: A Unified Engine for Big Data Processing, Communications of the ACM, November 2016.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, I. Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”
- M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, M. Zaharia. “Spark SQL: Relational Data Processing in Spark.” SIGMOD 2015.
You can run Spark using its standalone mode, on EC2, on Hadoop YARN, and on Apache Mesos. Also, you can access data in HDFS, Cassandra, HBase, Hive, Alluxio, and any Hadoop data source. In this blog, I will explain how to configure Spark with multi-node hadoop cluster on Ubuntu and run a simple word count program.
Prerequisite:
- JDK : JDK 1.7/+ If you are already installed this software then add the following lines on ~/.bashrc on each of the nodes or you can check https://goo.gl/TVjzpY
$ gedit ~/.bashrc
export JAVA_HOME=/path/to/jdk_installation_dir/
export PATH=$JAVA_HOME/bin:$PATH
- Scala: check below
- Hadoop: You should have Hadoop configured on multi-node cluster. You can check this link http://emahbub.blogspot.ca/2017/01/hadoop.html
Scala: (on each node)
2 > set the following environment variables in ~/.bashrc
# Scala
export SCALA_HOME=/usr/local/share/scala
export PATH=$PATH:$SCALA_HOME/bin
3> test
~$ scala -version
Scala code runner version 2.11.11 -- Copyright 2002-2017, LAMP/EPFL
~$ which scala
/usr/local/share/scala/bin/scala
Apache Spark
2 > Unzip the downloaded version and set the Spark environment variables in ~/.bashrc file
#Spark
export SPARK_HOME=$HOME/hadoop/spark-2.1.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
3 > Edit spark-env.sh and slaves in the spark/conf directory
~$ cd $SPARK_HOME/conf
~/conf$ cp spark-env.sh.template spark-env.sh
~/conf$ gedit spark-env.sh
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_65
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_WORKER_CORES=4
~/conf$ cp slaves.template slaves
~/conf$ gedit slaves
master
Slave1
Slave2
slave3
4 > Copy this configured spark folder to each worker node and set the environment variable
#Spark
export SPARK_HOME=/home/bigdata/hadoop/spark-2.1.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
5> Start Spark
~$ cd $SPARK_HOME/sbin
~SPARK_HOME/sbin$ ./start-all.sh
~$ jps (Master)
Master
Worker
~$ jps (Slave)
Worker
Also you can check http://master:8080/
6> Stop Spark
~$ cd $SPARK_HOME/sbin
~SPARK_HOME/sbin$ ./stop-all.sh
7> test
spark-submit --version
Spark-shell
Testing Spark using Java WordCount Program (JDK 1.8 Required)
1> Create a Java Maven Project and Add the follwong lines on pom.xml
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<!-- Spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<!-- Hadoop -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.1</version>
</dependency
</dependencies>
2> Add a class in the project with the following line of codes
import org.apache.spark.api.java.*;
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import scala.Tuple2;
public class WordCountApp {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("WordCount Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> textFile = sc.textFile(args[0]);
JavaPairRDD<String, Integer> counts = textFile
.flatMap(s -> Arrays.asList(s.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
counts.saveAsTextFile(args[1]);
sc.stop();
System.out.println("\n\nSUCCESSFUL");
}
}
3> Build the project and create jar of the project.
4> Start Hadoop and Spark Cluster
5> Copy any documents on HDFS to count the words in the documents
6> run the wordcount program from the terminal in cluster mode with yarn
spark-submit --class WordCountApp --master yarn --deploy-mode cluster --executor-memory 1g sparkTest-0.0.1-SNAPSHOT.jar hdfs://bigdatamaster:8020/user/bigdata/README.md hdfs://bigdatamaster:8020/user/bigdata/result.out
Thank you...Mahbub.