SpatialHadoop is a framework which includes spatial data processing support in each layer of Hadoop namely Storage, MapReduce, Operations, and Language layers. In this blog, I will explain the configuration of SpatialHadoop on 4-Node Hadoop Cluster. If you want learn more about "what is SpatialHadoop and how is it works" then check out the followings:
- A. Eldawy, M. F. Mokbel, and C. Jonathan. "HadoopViz: A MapReduce Framework for Extensible Visualization of Big Spatial Data". ICDE 2016, 2016.
- A. Eldawy and M. F. Mokbel. "SpatialHadoop: A MapReduce Framework for Spatial Data". IEEE ICDE 2015.
- A. Eldawy and M. F. Mokbel. "Pigeon: A Spatial MapReduce Language" IEEE ICDE 2014.
Prerequisit:
- You need to install and configure Hadoop cluster (if you don't have it already). You can see the following link: http://emahbub.blogspot.ca/2017/01/hadoop.html
- JDK 1.6+
SpatialHadoop
(2) Extract the downloaded compressed file into the home directory of Hadoop. i.e. merge the
SpatialHadoop files with Hadoop.
(3) set the JAVA_HOME to /etc/hadoop/hadoop-env.sh (if you don’t set it already)
Pig
(2) Unpack the downloaded Pig distribution and add the following environment variables to ~/.bashrc
export PIG_HOME=/path/to/hadoop/pig-0.16.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_CONF_DIR
(3) Test the pig installation:
pig -version
pig -help
(4) Test run: run a pig script using Hadoop MapReduce
- Suppose we have a text file(student.txt) containing following information:
001,Rajiv,Reddy,21,984802233,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
- And a pig script(student.pig) with followings commands:
std = LOAD './pig/student.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
name = FOREACH std GENERATE firstname;
DUMP name;
- Start hadoop cluster and test by running the following commands:
$ start-all.sh
$ hadoop dfs -mkdir /path/to/pig
$ hadoop dfs -copyFromLocal /path/to/pig/student.txt
/path/to/pig
$ cd pig (go to folder where you keep your pig script)
~/pig$ pig student.pig
(Rajiv)
(siddarth)
(Rajesh)
(Preethi)
(Trupthi)
(Archana)
(Komal)
(Bharathi)
Pigeon
- Download latest JAR file of Pigeon from http://spatialhadoop.cs.umn.edu/pigeon/
- Or you can get the latest version of Pigeon from source (https://github.com/aseldawy/pigeon). Download and unzip the source and run the following command
mvn assembly:assembly
- Also, you need to download the following two jar files (SpatialHadoop package already have these JARs /spatialhadoop-2.4.2-bin/share/hadoop/common/lib)
jts-1.13.jar;
esri-geometry-api-1.2.1.jar;
- Create a folder (say pigeon) and keep these JARs
- Also keep all the data and pig scripts in this folder
- trajectory.pig scripts contain the following line:
REGISTER 'pigeon-0.2.2.jar';
REGISTER 'esri-geometry-api-1.2.1.jar';
REGISTER 'jts-1.13.jar';
IMPORT 'pigeon_import.pig';
points = LOAD './pigeon/trajectory.tsv' AS (type, time: datetime, lat:double, lon:double);
s_points = FOREACH points GENERATE ST_MakePoint(lat, lon) AS point, time;
points_by_time = ORDER s_points BY time;
points_grouped = GROUP points_by_time ALL;
lines = FOREACH points_grouped GENERATE ST_AsText(ST_MakeLine(points_by_time));
STORE lines INTO 'line';
- Start Hadoop Cluster and do the followings:
$ start-all.sh
$ hadoop dfs -mkdir /path/to/pigeon
$ hadoop dfs -copyFromLocal /path/to/pigeon/trajectory.tsv
/user/bigdata/pigeon
$ cd pigeon (go to folder where you keep your pig script and other JARs)
~/pigeon$ pig trajectory.pig
~/pigeon$ hadoop dfs -cat /path/to/line/part-r-00000
Thanks...Mahbub