Mahbub's Blog: 2017

SpatialHadoop is a framework which includes spatial data processing support in each layer of Hadoop namely Storage, MapReduce, Operations, and Language layers. In this blog, I will explain the configuration of SpatialHadoop on 4-Node Hadoop Cluster. If you want learn more about "what is SpatialHadoop and how is it works" then check out the followings:

SpatialHadoop: http://spatialhadoop.cs.umn.edu/

A. Eldawy, M. F. Mokbel, and C. Jonathan. "HadoopViz: A MapReduce Framework for Extensible Visualization of Big Spatial Data". ICDE 2016, 2016.

A. Eldawy and M. F. Mokbel. "SpatialHadoop: A MapReduce Framework for Spatial Data". IEEE ICDE 2015.

A. Eldawy and M. F. Mokbel. "Pigeon: A Spatial MapReduce Language" IEEE ICDE 2014.

Prerequisit:

You need to install and configure Hadoop cluster (if you don't have it already). You can see the following link: http://emahbub.blogspot.ca/2017/01/hadoop.html

Pig installed and configured correctly with Hadoop. https://pig.apache.org/

JDK 1.6+

SpatialHadoop

(1) Download the latest version of SpatialHadoop. http://spatialhadoop.cs.umn.edu/

(2) Extract the downloaded compressed file into the home directory of Hadoop. i.e. merge the

     SpatialHadoop files with Hadoop.

(3) set the JAVA_HOME to /etc/hadoop/hadoop-env.sh (if you don’t set it already)

(4) you can test your installation by running examples given in http://spatialhadoop.cs.umn.edu/

Pig

(1) Download a recent stable release of pig from https://pig.apache.org/

(2) Unpack the downloaded Pig distribution and add the following environment variables to ~/.bashrc

export PIG_HOME=/path/to/hadoop/pig-0.16.0

export PATH=$PATH:$PIG_HOME/bin

export PIG_CLASSPATH=$HADOOP_CONF_DIR

(3) Test the pig installation:

pig -version

pig -help

(4) Test run: run a pig script using Hadoop MapReduce

Suppose we have a text file(student.txt) containing following information:

001,Rajiv,Reddy,21,984802233,Hyderabad

002,siddarth,Battacharya,22,9848022338,Kolkata

003,Rajesh,Khanna,22,9848022339,Delhi

004,Preethi,Agarwal,21,9848022330,Pune

005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar

006,Archana,Mishra,23,9848022335,Chennai

007,Komal,Nayak,24,9848022334,trivendram

008,Bharathi,Nambiayar,24,9848022333,Chennai

And a pig script(student.pig) with followings commands:

std = LOAD './pig/student.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

name = FOREACH std GENERATE firstname;

DUMP name;

Start hadoop cluster and test by running the following commands:

$ start-all.sh

$ hadoop dfs -mkdir /path/to/pig

$ hadoop dfs -copyFromLocal /path/to/pig/student.txt

/path/to/pig

$ cd pig (go to folder where you keep your pig script)

~/pig$ pig student.pig

(Rajiv)

(siddarth)

(Rajesh)

(Preethi)

(Trupthi)

(Archana)

(Komal)

(Bharathi)

Pigeon

Download latest JAR file of Pigeon from http://spatialhadoop.cs.umn.edu/pigeon/

Or you can get the latest version of Pigeon from source (https://github.com/aseldawy/pigeon). Download and unzip the source and run the following command

mvn assembly:assembly

Also, you need to download the following two jar files (SpatialHadoop package already have these JARs /spatialhadoop-2.4.2-bin/share/hadoop/common/lib)

jts-1.13.jar;

esri-geometry-api-1.2.1.jar;

Create a folder (say pigeon) and keep these JARs
Also keep all the data and pig scripts in this folder
trajectory.pig scripts contain the following line:

REGISTER 'pigeon-0.2.2.jar';

REGISTER 'esri-geometry-api-1.2.1.jar';

REGISTER 'jts-1.13.jar';

IMPORT 'pigeon_import.pig';

points = LOAD './pigeon/trajectory.tsv' AS (type, time: datetime, lat:double, lon:double);

s_points = FOREACH points GENERATE ST_MakePoint(lat, lon) AS point, time;

points_by_time = ORDER s_points BY time;

points_grouped = GROUP points_by_time ALL;

lines = FOREACH points_grouped GENERATE ST_AsText(ST_MakeLine(points_by_time));

STORE lines INTO 'line';

Start Hadoop Cluster and do the followings:

$ start-all.sh

$ hadoop dfs -mkdir /path/to/pigeon

$ hadoop dfs -copyFromLocal /path/to/pigeon/trajectory.tsv

/user/bigdata/pigeon

$ cd pigeon (go to folder where you keep your pig script and other JARs)

~/pigeon$ pig trajectory.pig

~/pigeon$ hadoop dfs -cat /path/to/line/part-r-00000

Thanks...Mahbub

Apache Spark is very popular and widely-used distributed in-memory system for big data processing as it is reduces the execution time dramatically compared to MapReduce jobs on Hadoop. If you want learn more about “Apache Spark” then check out the followings:

Apache Spark. http://spark.apache.org/

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica “Spark: Cluster Computing with Working Sets.”

M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, I. Stoica. Apache Spark: A Unified Engine for Big Data Processing, Communications of the ACM, November 2016.

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, I. Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, M. Zaharia. “Spark SQL: Relational Data Processing in Spark.” SIGMOD 2015.

You can run Spark using its standalone mode, on EC2, on Hadoop YARN, and on Apache Mesos. Also, you can access data in HDFS, Cassandra, HBase, Hive, Alluxio, and any Hadoop data source. In this blog, I will explain how to configure Spark with multi-node hadoop cluster on Ubuntu and run a simple word count program.

Prerequisite:

JDK : JDK 1.7/+ If you are already installed this software then add the following lines on ~/.bashrc on each of the nodes or you can check https://goo.gl/TVjzpY

$ gedit ~/.bashrc

export JAVA_HOME=/path/to/jdk_installation_dir/

export PATH=$JAVA_HOME/bin:$PATH

Scala: check below

Hadoop: You should have Hadoop configured on multi-node cluster. You can check this link http://emahbub.blogspot.ca/2017/01/hadoop.html

Scala: (on each node)

1 > Download the scala binaries and unpack the archive into /usr/local/share

2 > set the following environment variables in ~/.bashrc

# Scala

export SCALA_HOME=/usr/local/share/scala

export PATH=$PATH:$SCALA_HOME/bin

3> test

~$ scala -version

Scala code runner version 2.11.11 -- Copyright 2002-2017, LAMP/EPFL

~$ which scala

/usr/local/share/scala/bin/scala

Apache Spark

1 > Download the compatible Spark version from http://spark.apache.org/downloads.html

2 > Unzip the downloaded version and set the Spark environment variables in ~/.bashrc file

#Spark

export SPARK_HOME=$HOME/hadoop/spark-2.1.1-bin-hadoop2.7

export PATH=$PATH:$SPARK_HOME/bin

3 > Edit spark-env.sh and slaves in the spark/conf directory

~$ cd $SPARK_HOME/conf

~/conf$ cp spark-env.sh.template spark-env.sh

~/conf$ gedit spark-env.sh

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_65

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export SPARK_WORKER_CORES=4

~/conf$ cp slaves.template slaves

~/conf$ gedit slaves

master

Slave1

Slave2

slave3

4 > Copy this configured spark folder to each worker node and set the environment variable

#Spark

export SPARK_HOME=/home/bigdata/hadoop/spark-2.1.1-bin-hadoop2.7

export PATH=$PATH:$SPARK_HOME/bin

5> Start Spark

~$ cd $SPARK_HOME/sbin

~SPARK_HOME/sbin$ ./start-all.sh

~$ jps (Master)

  Master

  Worker

~$ jps (Slave)

  Worker

    Also you can check http://master:8080/

6> Stop Spark

~$ cd $SPARK_HOME/sbin

~SPARK_HOME/sbin$ ./stop-all.sh

7> test

spark-submit --version

Spark-shell

Testing Spark using Java WordCount Program (JDK 1.8 Required)

1> Create a Java Maven Project and Add the follwong lines on pom.xml

<properties>

   <maven.compiler.source>1.8</maven.compiler.source>

   <maven.compiler.target>1.8</maven.compiler.target>

</properties>

<dependencies>

   

   <dependency>

       <groupId>org.apache.spark</groupId>

       <artifactId>spark-core_2.11</artifactId>

       <version>2.1.1</version>

   </dependency>



   

   <dependency>

       <groupId>org.apache.hadoop</groupId>

     <artifactId>hadoop-client</artifactId>

       <version>2.7.1</version>

   </dependency

</dependencies>

2> Add a class in the project with the following line of codes

import org.apache.spark.api.java.*;

import java.util.Arrays;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.function.Function;

import scala.Tuple2;

public class WordCountApp {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setAppName("WordCount Application");

JavaSparkContext sc = new JavaSparkContext(conf);

JavaRDD<String> textFile = sc.textFile(args[0]);

JavaPairRDD<String, Integer> counts = textFile

.flatMap(s -> Arrays.asList(s.split(" ")).iterator())

.mapToPair(word -> new Tuple2<>(word, 1))

.reduceByKey((a, b) -> a + b);

counts.saveAsTextFile(args[1]);

sc.stop();

System.out.println("\n\nSUCCESSFUL");

}

3> Build the project and create jar of the project.

4> Start Hadoop and Spark Cluster

5> Copy any documents on HDFS to count the words in the documents

6> run the wordcount program from the terminal in cluster mode with yarn

spark-submit --class WordCountApp --master yarn --deploy-mode cluster --executor-memory 1g sparkTest-0.0.1-SNAPSHOT.jar hdfs://bigdatamaster:8020/user/bigdata/README.md hdfs://bigdatamaster:8020/user/bigdata/result.out

Thank you...Mahbub.

Mahbub's Blog

Tuesday, June 13, 2017

SpatialHadoop Installation on Multi-Node Cluster

Prerequisit:

SpatialHadoop

Pigeon

Friday, May 5, 2017

Spark Installation and Configuration with Multi-Node Hadoop Cluster

About Me

Followers