Mahbub's Blog: May 2017

Apache Spark is very popular and widely-used distributed in-memory system for big data processing as it is reduces the execution time dramatically compared to MapReduce jobs on Hadoop. If you want learn more about “Apache Spark” then check out the followings:

Apache Spark. http://spark.apache.org/

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica “Spark: Cluster Computing with Working Sets.”

M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, I. Stoica. Apache Spark: A Unified Engine for Big Data Processing, Communications of the ACM, November 2016.

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, I. Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, M. Zaharia. “Spark SQL: Relational Data Processing in Spark.” SIGMOD 2015.

You can run Spark using its standalone mode, on EC2, on Hadoop YARN, and on Apache Mesos. Also, you can access data in HDFS, Cassandra, HBase, Hive, Alluxio, and any Hadoop data source. In this blog, I will explain how to configure Spark with multi-node hadoop cluster on Ubuntu and run a simple word count program.

Prerequisite:

JDK : JDK 1.7/+ If you are already installed this software then add the following lines on ~/.bashrc on each of the nodes or you can check https://goo.gl/TVjzpY

$ gedit ~/.bashrc

export JAVA_HOME=/path/to/jdk_installation_dir/

export PATH=$JAVA_HOME/bin:$PATH

Scala: check below

Hadoop: You should have Hadoop configured on multi-node cluster. You can check this link http://emahbub.blogspot.ca/2017/01/hadoop.html

Scala: (on each node)

1 > Download the scala binaries and unpack the archive into /usr/local/share

2 > set the following environment variables in ~/.bashrc

# Scala

export SCALA_HOME=/usr/local/share/scala

export PATH=$PATH:$SCALA_HOME/bin

3> test

~$ scala -version

Scala code runner version 2.11.11 -- Copyright 2002-2017, LAMP/EPFL

~$ which scala

/usr/local/share/scala/bin/scala

Apache Spark

1 > Download the compatible Spark version from http://spark.apache.org/downloads.html

2 > Unzip the downloaded version and set the Spark environment variables in ~/.bashrc file

#Spark

export SPARK_HOME=$HOME/hadoop/spark-2.1.1-bin-hadoop2.7

export PATH=$PATH:$SPARK_HOME/bin

3 > Edit spark-env.sh and slaves in the spark/conf directory

~$ cd $SPARK_HOME/conf

~/conf$ cp spark-env.sh.template spark-env.sh

~/conf$ gedit spark-env.sh

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_65

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export SPARK_WORKER_CORES=4

~/conf$ cp slaves.template slaves

~/conf$ gedit slaves

master

Slave1

Slave2

slave3

4 > Copy this configured spark folder to each worker node and set the environment variable

#Spark

export SPARK_HOME=/home/bigdata/hadoop/spark-2.1.1-bin-hadoop2.7

export PATH=$PATH:$SPARK_HOME/bin

5> Start Spark

~$ cd $SPARK_HOME/sbin

~SPARK_HOME/sbin$ ./start-all.sh

~$ jps (Master)

  Master

  Worker

~$ jps (Slave)

  Worker

    Also you can check http://master:8080/

6> Stop Spark

~$ cd $SPARK_HOME/sbin

~SPARK_HOME/sbin$ ./stop-all.sh

7> test

spark-submit --version

Spark-shell

Testing Spark using Java WordCount Program (JDK 1.8 Required)

1> Create a Java Maven Project and Add the follwong lines on pom.xml

<properties>

   <maven.compiler.source>1.8</maven.compiler.source>

   <maven.compiler.target>1.8</maven.compiler.target>

</properties>

<dependencies>

   

   <dependency>

       <groupId>org.apache.spark</groupId>

       <artifactId>spark-core_2.11</artifactId>

       <version>2.1.1</version>

   </dependency>



   

   <dependency>

       <groupId>org.apache.hadoop</groupId>

     <artifactId>hadoop-client</artifactId>

       <version>2.7.1</version>

   </dependency

</dependencies>

2> Add a class in the project with the following line of codes

import org.apache.spark.api.java.*;

import java.util.Arrays;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.function.Function;

import scala.Tuple2;

public class WordCountApp {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setAppName("WordCount Application");

JavaSparkContext sc = new JavaSparkContext(conf);

JavaRDD<String> textFile = sc.textFile(args[0]);

JavaPairRDD<String, Integer> counts = textFile

.flatMap(s -> Arrays.asList(s.split(" ")).iterator())

.mapToPair(word -> new Tuple2<>(word, 1))

.reduceByKey((a, b) -> a + b);

counts.saveAsTextFile(args[1]);

sc.stop();

System.out.println("\n\nSUCCESSFUL");

}

3> Build the project and create jar of the project.

4> Start Hadoop and Spark Cluster

5> Copy any documents on HDFS to count the words in the documents

6> run the wordcount program from the terminal in cluster mode with yarn

spark-submit --class WordCountApp --master yarn --deploy-mode cluster --executor-memory 1g sparkTest-0.0.1-SNAPSHOT.jar hdfs://bigdatamaster:8020/user/bigdata/README.md hdfs://bigdatamaster:8020/user/bigdata/result.out

Thank you...Mahbub.

Mahbub's Blog

Friday, May 5, 2017

Spark Installation and Configuration with Multi-Node Hadoop Cluster

About Me

Followers