Tuesday, June 13, 2017

SpatialHadoop Installation on Multi-Node Cluster

SpatialHadoop is a framework which includes spatial data processing support in each layer of Hadoop namely Storage, MapReduce, Operations, and Language layers. In this blog, I will explain the configuration of SpatialHadoop on 4-Node Hadoop Cluster. If you want learn more about "what is SpatialHadoop and how is it works" then check out the followings:

  1. A. Eldawy, M. F. Mokbel, and C. Jonathan. "HadoopViz: A MapReduce Framework for Extensible Visualization of Big Spatial Data". ICDE 2016, 2016.
  2. A. Eldawy and M. F. Mokbel. "SpatialHadoop: A MapReduce Framework for Spatial Data". IEEE ICDE 2015.
  3. A. Eldawy and M. F. Mokbel. "Pigeon: A Spatial MapReduce Language" IEEE ICDE 2014.

Prerequisit:

SpatialHadoop

(1) Download the latest version of SpatialHadoop. http://spatialhadoop.cs.umn.edu/
(2) Extract the downloaded compressed file into the home directory of Hadoop. i.e. merge the   
     SpatialHadoop files with Hadoop.
(3) set the JAVA_HOME to /etc/hadoop/hadoop-env.sh (if you don’t set it already)
(4) you can test your installation by running examples given in http://spatialhadoop.cs.umn.edu/
Pig
(1) Download a recent stable release of pig from https://pig.apache.org/    
(2) Unpack the downloaded Pig distribution and add the following environment variables to ~/.bashrc
export PIG_HOME=/path/to/hadoop/pig-0.16.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_CONF_DIR
(3) Test the pig installation:
pig -version
pig -help
(4) Test run: run a pig script using Hadoop MapReduce
  • Suppose we have a text file(student.txt) containing following information:
001,Rajiv,Reddy,21,984802233,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
  • And a pig script(student.pig) with followings commands:
std = LOAD './pig/student.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
name = FOREACH std GENERATE firstname;
DUMP name;
  • Start hadoop cluster and test by running the following commands:
$ start-all.sh
$ hadoop dfs -mkdir /path/to/pig
$ hadoop dfs -copyFromLocal /path/to/pig/student.txt     
 /path/to/pig
$ cd pig (go to folder where you keep your pig script)
~/pig$ pig student.pig
(Rajiv)
(siddarth)
(Rajesh)
(Preethi)
(Trupthi)
(Archana)
(Komal)
(Bharathi)

Pigeon

mvn assembly:assembly
  • Also, you need to download the following two jar files (SpatialHadoop package already have these JARs /spatialhadoop-2.4.2-bin/share/hadoop/common/lib)
jts-1.13.jar;
esri-geometry-api-1.2.1.jar;
  • Create a folder (say pigeon) and keep these JARs 
  • Also keep all the data and pig scripts in this folder 
  • trajectory.pig scripts contain the following line:
REGISTER 'pigeon-0.2.2.jar';
REGISTER 'esri-geometry-api-1.2.1.jar';
REGISTER 'jts-1.13.jar';

IMPORT 'pigeon_import.pig';

points = LOAD './pigeon/trajectory.tsv' AS (type, time: datetime, lat:double, lon:double);

s_points = FOREACH points GENERATE ST_MakePoint(lat, lon) AS point, time;
points_by_time = ORDER s_points BY time;

points_grouped = GROUP points_by_time ALL;

lines = FOREACH points_grouped GENERATE ST_AsText(ST_MakeLine(points_by_time));

STORE lines INTO 'line';
  • Start Hadoop Cluster  and do the followings:
$ start-all.sh
$ hadoop dfs -mkdir /path/to/pigeon
$ hadoop dfs -copyFromLocal /path/to/pigeon/trajectory.tsv     
 /user/bigdata/pigeon
$ cd pigeon (go to folder where you keep your pig script and other JARs)
~/pigeon$ pig trajectory.pig
~/pigeon$ hadoop dfs -cat /path/to/line/part-r-00000

Thanks...Mahbub

Friday, May 5, 2017

Spark Installation and Configuration with Multi-Node Hadoop Cluster


Apache Spark is very popular and widely-used distributed in-memory system for big data processing as it is reduces the execution time dramatically compared to MapReduce jobs on Hadoop. If you want learn more about “Apache Spark” then check out the followings:
  1. Apache Spark. http://spark.apache.org/
  2. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica “Spark: Cluster Computing with Working Sets.”
  3. M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, I. Stoica. Apache Spark: A Unified Engine for Big Data Processing, Communications of the ACM, November 2016.
  4. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, I. Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”
  5. M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, M. Zaharia. “Spark SQL: Relational Data Processing in Spark.”  SIGMOD 2015.
You can run Spark using its standalone mode, on EC2, on Hadoop YARN, and on Apache Mesos. Also, you can access data in HDFS, Cassandra, HBase, Hive, Alluxio, and any Hadoop data source. In this blog, I will explain how to configure Spark with multi-node hadoop cluster on Ubuntu and run a simple word count program.
Prerequisite:
  • JDK : JDK 1.7/+ If you are already installed this software then add the following lines on ~/.bashrc on each of the nodes or you can check https://goo.gl/TVjzpY
$ gedit ~/.bashrc
export JAVA_HOME=/path/to/jdk_installation_dir/
export PATH=$JAVA_HOME/bin:$PATH
Scala: (on each node)
1 > Download the scala binaries and unpack the archive into /usr/local/share
2 > set the following environment variables in ~/.bashrc
# Scala
export SCALA_HOME=/usr/local/share/scala
export PATH=$PATH:$SCALA_HOME/bin
3> test
~$ scala -version
Scala code runner version 2.11.11 -- Copyright 2002-2017, LAMP/EPFL
~$ which scala
/usr/local/share/scala/bin/scala
Apache Spark
1 > Download the compatible Spark version from http://spark.apache.org/downloads.html
2 > Unzip the downloaded version and set the Spark environment variables in ~/.bashrc file
#Spark
export SPARK_HOME=$HOME/hadoop/spark-2.1.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
3 > Edit spark-env.sh and slaves in the spark/conf directory
~$ cd $SPARK_HOME/conf
~/conf$ cp spark-env.sh.template spark-env.sh
~/conf$ gedit spark-env.sh
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_65
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_WORKER_CORES=4
~/conf$ cp slaves.template slaves
~/conf$ gedit slaves
master
Slave1
Slave2
slave3
4 > Copy this configured spark folder to each worker node and set the environment variable
#Spark
export SPARK_HOME=/home/bigdata/hadoop/spark-2.1.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
5> Start Spark
~$ cd $SPARK_HOME/sbin
~SPARK_HOME/sbin$ ./start-all.sh
~$ jps (Master)
  Master
  Worker
~$ jps (Slave)
  Worker
    Also you can check http://master:8080/
6> Stop Spark
~$ cd $SPARK_HOME/sbin
~SPARK_HOME/sbin$ ./stop-all.sh
7> test
spark-submit --version
Spark-shell
Testing Spark using Java WordCount Program (JDK 1.8 Required)
1> Create a Java Maven Project and Add the follwong lines on pom.xml
<properties>
   <maven.compiler.source>1.8</maven.compiler.source>
   <maven.compiler.target>1.8</maven.compiler.target>
</properties>

<dependencies>
   <!-- Spark -->
   <dependency>
       <groupId>org.apache.spark</groupId>
       <artifactId>spark-core_2.11</artifactId>
       <version>2.1.1</version>
   </dependency>
       
   <!-- Hadoop -->
   <dependency>
       <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-client</artifactId>
       <version>2.7.1</version>
   </dependency
</dependencies>

2> Add a class in the project with the following line of codes
import org.apache.spark.api.java.*;
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import scala.Tuple2;
public class WordCountApp {
 public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("WordCount Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> textFile = sc.textFile(args[0]);
JavaPairRDD<String, Integer> counts = textFile
    .flatMap(s -> Arrays.asList(s.split(" ")).iterator())
    .mapToPair(word -> new Tuple2<>(word, 1))
    .reduceByKey((a, b) -> a + b);
counts.saveAsTextFile(args[1]);
sc.stop();
System.out.println("\n\nSUCCESSFUL");
 }
}
3>  Build the project and create jar of the project.
4> Start Hadoop and Spark Cluster
5> Copy any documents on HDFS to count the words in the documents
6> run the wordcount program from the terminal in cluster mode with yarn
spark-submit --class WordCountApp --master yarn --deploy-mode cluster --executor-memory 1g sparkTest-0.0.1-SNAPSHOT.jar hdfs://bigdatamaster:8020/user/bigdata/README.md hdfs://bigdatamaster:8020/user/bigdata/result.out
Thank you...Mahbub.