Tuesday, June 13, 2017

SpatialHadoop Installation on Multi-Node Cluster

SpatialHadoop is a framework which includes spatial data processing support in each layer of Hadoop namely Storage, MapReduce, Operations, and Language layers. In this blog, I will explain the configuration of SpatialHadoop on 4-Node Hadoop Cluster. If you want learn more about "what is SpatialHadoop and how is it works" then check out the followings:

  1. A. Eldawy, M. F. Mokbel, and C. Jonathan. "HadoopViz: A MapReduce Framework for Extensible Visualization of Big Spatial Data". ICDE 2016, 2016.
  2. A. Eldawy and M. F. Mokbel. "SpatialHadoop: A MapReduce Framework for Spatial Data". IEEE ICDE 2015.
  3. A. Eldawy and M. F. Mokbel. "Pigeon: A Spatial MapReduce Language" IEEE ICDE 2014.

Prerequisit:

SpatialHadoop

(1) Download the latest version of SpatialHadoop. http://spatialhadoop.cs.umn.edu/
(2) Extract the downloaded compressed file into the home directory of Hadoop. i.e. merge the   
     SpatialHadoop files with Hadoop.
(3) set the JAVA_HOME to /etc/hadoop/hadoop-env.sh (if you don’t set it already)
(4) you can test your installation by running examples given in http://spatialhadoop.cs.umn.edu/
Pig
(1) Download a recent stable release of pig from https://pig.apache.org/    
(2) Unpack the downloaded Pig distribution and add the following environment variables to ~/.bashrc
export PIG_HOME=/path/to/hadoop/pig-0.16.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_CONF_DIR
(3) Test the pig installation:
pig -version
pig -help
(4) Test run: run a pig script using Hadoop MapReduce
  • Suppose we have a text file(student.txt) containing following information:
001,Rajiv,Reddy,21,984802233,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
  • And a pig script(student.pig) with followings commands:
std = LOAD './pig/student.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
name = FOREACH std GENERATE firstname;
DUMP name;
  • Start hadoop cluster and test by running the following commands:
$ start-all.sh
$ hadoop dfs -mkdir /path/to/pig
$ hadoop dfs -copyFromLocal /path/to/pig/student.txt     
 /path/to/pig
$ cd pig (go to folder where you keep your pig script)
~/pig$ pig student.pig
(Rajiv)
(siddarth)
(Rajesh)
(Preethi)
(Trupthi)
(Archana)
(Komal)
(Bharathi)

Pigeon

mvn assembly:assembly
  • Also, you need to download the following two jar files (SpatialHadoop package already have these JARs /spatialhadoop-2.4.2-bin/share/hadoop/common/lib)
jts-1.13.jar;
esri-geometry-api-1.2.1.jar;
  • Create a folder (say pigeon) and keep these JARs 
  • Also keep all the data and pig scripts in this folder 
  • trajectory.pig scripts contain the following line:
REGISTER 'pigeon-0.2.2.jar';
REGISTER 'esri-geometry-api-1.2.1.jar';
REGISTER 'jts-1.13.jar';

IMPORT 'pigeon_import.pig';

points = LOAD './pigeon/trajectory.tsv' AS (type, time: datetime, lat:double, lon:double);

s_points = FOREACH points GENERATE ST_MakePoint(lat, lon) AS point, time;
points_by_time = ORDER s_points BY time;

points_grouped = GROUP points_by_time ALL;

lines = FOREACH points_grouped GENERATE ST_AsText(ST_MakeLine(points_by_time));

STORE lines INTO 'line';
  • Start Hadoop Cluster  and do the followings:
$ start-all.sh
$ hadoop dfs -mkdir /path/to/pigeon
$ hadoop dfs -copyFromLocal /path/to/pigeon/trajectory.tsv     
 /user/bigdata/pigeon
$ cd pigeon (go to folder where you keep your pig script and other JARs)
~/pigeon$ pig trajectory.pig
~/pigeon$ hadoop dfs -cat /path/to/line/part-r-00000

Thanks...Mahbub