Apache Pig Load & Store Built-In Functions

Load & Store Functions

Apache Pig supports various type Load & Store Functions such as "pigstorage", "textloader" "binstorage" and handling compression which are used to decide how data will reside in pig and comes out. These functions are used with load and store operators.

The following is the list of Apache Pig supported LOAD and STORE functions.

S.No. Functions Description
1 PigStorage() This function is used to load and store structured files.
2 TextLoader() This function is useful in loading the unstructured data.
3 BinStorage() This function is used to load and store data into Pig using the machine-readable format.
4 Handling Compression This function is used to load and store compressed data.

Let us see a couple of examples.


PigStorage()

PigStorage function is used to load/store the data. It is the default function that supports structured text files in compressed or uncompressed form.

Syntax:
grunt> PigStorage( [field_delimiter] , ['options'] );

To perform this operation we have dataset “employee.txt” which is located at HDFS /pigexample/ location.

Content of “employee.txt”:

1001,James,Butt,New Orleans,Orleans 1002,Josephine,Darakjy,Brighton,Livingston 1003,Art,Venere,Bridgeport,Gloucester 1004,Lenna,Paprocki,Anchorage,Anchorage 1005,Donette,Foller,Hamilton,Butler 1006,Simona,Morasca,Ashland,Ashland 1007,Mitsue,Tollner,Chicago,Cook 1008,Leota,Dilliard,San Jose,Santa 1009,Sage,Wieser,Sioux Falls,Minnehaha 1010,Kris,Marrier,Baltimore,Baltimore

We will load “employee.txt” data from HDFS “/pigexample/” to Pig using PigStorage() function. We have separated the value of records using comma (,) delimiter.

Command:
grunt> empdetail = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );

We can store data in HDFS using PigStorage() function. In this example, we are storing data of relation “empdetail” in HDFS location ‘/pigoutput/outputdata’.

Command:
grunt> STORE empdetail INTO '/pigoutput/outputdata' USING PigStorage (',');

We can verify stored data using the below commands.

Command:
$hadoop fs -ls /pigoutput/outputdata/
$hadoop fs -cat /pigoutput/outputdata/part-m-00000

Output:
output of store command


TextLoader()

The TextLoader is useful in loading the unstructured data in the format of UTF8.

Syntax:
grunt> TextLoader();

To perform this operation we have dataset “department.txt” which is located at HDFS /pigexample/ location.

Content of “employee.txt”:

1001,Bette,Nicka,LA,70116 1002,Veronika,Inouye,MI,48116 1003,Willard,Kolmetz,NJ,8014 1004,Maryann,Royster,AK,99501 1005,Alisha,Slusarski,OH,45011 1006,Allene,Iturbide,OH,44805 1007,Chanel,Caudy,IL,60632 1008,Ezekiel,Chui,CA,95111 1009,Willow,Kusko,SD,57105 1010,Bernardo,Figeroa,MD,21224

We will load “department.txt” data from HDFS “/pigexample/” to Pig using TextLoader() function and using the DUMP operator we will print output on the terminal.

Command:
grunt> deptdata = LOAD '/pigexample/department.txt' USING TextLoader();
grunt> DUMP deptdata;

Output:
TextLoader command example