Apache Pig is an analytical tool that is used to perform analysis on datasets that are stored in the Hadoop file system (HDFS).

In this tutorial, we will see how to load data into the Hadoop file system (HDFS) and from HDFS how to load a dataset in Pig Latin for processing using the LOAD operator.

Load Operation

LOAD operator is used to loading data from HDFS/local file system into Apache Pig.

Syntax:
grunt> LOAD 'data' [USING function] [AS schema];

  • 'data': It represents the name of the file or directory in single quotes. If we mention a directory name then all the files in the directory are loaded.
  • USING Keyword: Define a function here otherwise by default PigStorage function will be used.
  • AS schema: Schema is used to define the structure of the dataset such as column name and its data type.

Let us see step by step process to load data from the local system to Hadoop HDFS and then from Hadoop HDFS to Pig.


Step. 1: To perform this activity we have below employee datasets which are comma-delimited.

1001,James,Butt,New Orleans,Orleans 1002,Josephine,Darakjy,Brighton,Livingston 1003,Art,Venere,Bridgeport,Gloucester 1004,Lenna,Paprocki,Anchorage,Anchorage 1005,Donette,Foller,Hamilton,Butler 1006,Simona,Morasca,Ashland,Ashland 1007,Mitsue,Tollner,Chicago,Cook 1008,Leota,Dilliard,San Jose,Santa 1009,Sage,Wieser,Sioux Falls,Minnehaha 1010,Kris,Marrier,Baltimore,Baltimore 1011,Minna,Amigon,Kulpsville,Montgomery 1012,Abel,Maclead,Phoenix,Suffolk 1013,Gladys,Rim,Taylor,Wayne 1014,Yuki,Whobrey,Rockford,Winnebago 1015,Fletcher,Flosi,Aston,Delaware 1016,Bette,Nicka,San Jose,Santa Clara 1017,Veronika,Inouye,Irving,Dallas 1018,Willard,Kolmetz,Albany,Albany 1019,Maryann,Royster,Middlesex,Middlesex

Step. 2: We will create an “employee.txt” file in our local system and put this data.

Command:
cloudduggu@ubuntu:~/pig/tutorial$ nano employee.txt

Output:
data set

To save the file press CTRL+O and to exit from the editor press CTRL+X.


Step. 3: Now we will start Hadoop services from sbin directory and verify services using the JPS command.


Command:
cloudduggu@ubuntu:~/hadoop$ sbin/start-all.sh
cloudduggu@ubuntu:~/hadoop$ jps
cloudduggu@ubuntu:~/hadoop$ sbin/mr-jobhistory-daemon.sh start historyserver

Output:
starting hadoop


Step. 4: We will create a directory named “pigexample” in HDFS and place “employee.txt” data under it.


Command:
cloudduggu@ubuntu:~/hadoop$ hadoop fs -mkdir /pigexample
cloudduggu@ubuntu:~/hadoop$ hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/employee.txt /pigexample/
cloudduggu@ubuntu:~/hadoop$ hadoop fs -ls /pigexample/

Output:
load data in hadoop


Step. 5: After this, we will start the Pig Grunt shell in MapReduce mode.


Command:
cloudduggu@ubuntu:~/pig$ pig -x mapreduce

Output:
start pig grunt


Step. 6: Now load the file “employee.txt” into Pig using the below statements.


Command:
grunt> employees = LOAD '/pigexample/employee.txt' USING PigStorage (',') as (emp_id:int, first_name:chararray,last_name: chararray, city:chararray ,county:chararray ) ;

Output:
load data in pig example


Store Operation

Store operator is used to storing the result set of Pig Latin on Hadoop HDFS or a local file system.

Syntax:
grunt> STORE alias INTO 'directory' [USING function];

  • alias This represents the name of the relation.
  • INTO 'directory' This represents the name of the storage directory where the result set will be copied.
  • USING function You can use Store functions such as BinStorage for the machine-readable format, JsonLoader for JSON data, if these functions are not used then the PigStorage Store function will be used as a default function.

We will Load the “employee.txt” file from HDFS to Pig and then we will store the result set of Pig Latin on HDFS ‘/pigexample/’ location.

Let us see this process using the below steps.


Step. 1: Load the file “employee.txt” into Pig from HDFS using the below statements and verify the output using the DUMP operator.


Command:
grunt> employees = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );

grunt> dump employees

Output:
load data in pig example

verify output


Step. 2: Now we will store the resultset of the Pig Latin command using the STORE operator in “/pigexample/newemployeedata.txt” of HDFS.


Command:
grunt> STORE employees INTO '/pigexample/newemployeedata.txt' USING PigStorage (',');

Output:
store_result


Step. 3: Now verify this location ‘/pigexample/newemployeedata.txt/’ in HDFS to make sure that data is stored. You will see result set file is created with the name “part-m-00000” under the “/pigexample/newemployeedata.txt/” directory. Now cat that file to see if the same data is there or not.


Command:
cloudduggu@ubuntu:~$ hadoop fs -ls /pigexample/
cloudduggu@ubuntu:~$ hadoop fs -ls /pigexample/newemployeedata.txt/
cloudduggu@ubuntu:~$ hadoop fs -cat /pigexample/newemployeedata.txt/part-m-00000

Output:
store output verification