Apache Pig Combining & Splitting

Apache Pig provides UNION operators to merge two relations and SPLIT operator to split the relation into two or more relations.

  1. UNION Operator
  2. SPLIT Operator

To perform these two operations, we have taken two datasets “finance_data.txt” and "finance_bkp.txt”. We will put both files in HDFS location “/pigexample/” from the local file system.

Content of “finance_data.txt”:

1,Chanel,Shawnee,KS,9133882079 2,Ezekiel,Easton,MD,4106691642 3,Willow,New York,NY,2125824976 4,Bernardo,Conroe,TX,9363363951 5,Ammie,Columbus,OH,6148019788 6,Francine,Las Cruces,NM,5059773911 7,Ernie,Ridgefield Park,NJ,2017096245 8,Albina,Dunellen,NJ,7329247882 9,Alishia,New York,NY,2128601579 10,Solange,Metairie,LA,5049799175

Content of "finance_bkp.txt”:

1,Kati,Dunellen,NJ,7329247882 2,Youlanda,New York,NY,2128601579 3,Dyan,Metairie,LA,5049799175

We will load both data files from the local filesystem into HDFS “/pigexample/” using the below commands.

Command:

$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/finance_data.txt /pigexample/
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/finance_bkp.txt /pigexample/

Now we will create relations for both files and load data from HDFS to Pig.

Command:

grunt> financedata = LOAD '/pigexample/finance_data.txt' USING PigStorage(',') as (empid:int,empname:chararray,city:chararray,state:chararray,phone:int );
grunt> financebkp = LOAD '/pigexample/finance_bkp.txt' USING PigStorage(',') as (empid:int,empname:chararray,city:chararray,state:chararray,phone:int );

Let us see each operator in detail.


1. UNION

UNION operator is used to merging the contents of two or more relations.

Syntax:

grunt> alias = UNION [ONSCHEMA] alias, alias [, alias …] [PARALLEL n];

We will merge both relations “financedata” and “financebkp” into “unionexample” and print the final result using the DUMP operator.

Command:

grunt> unionexample = UNION financedata, financebkp;
grunt> DUMP unionexample;

Output:

union operator example


2. SPLIT Operator

The SPLIT operator is used to partition the content of a relation into two or more relations based on some expression.

Syntax:

grunt> SPLIT alias INTO aliasIF expression, alias IF expression [, alias IF expression …] [, alias OTHERWISE];

We will use relation “financedata” which we have already created and split its data into two relations “financefinal” and “financedata12”. Relation “financefinal” will have data based on condition (empid is greater than 5) and relation two “financefinal” will have content based on condition (empid is less than 4 and greater than 7).

Command:

grunt> SPLIT financedata into financedata1if empid>5, financedata12 if (4<empid and empid>7);
grunt> DUMP financedata1;

Output:

split example

Command:

grunt> DUMP financedata12;

Output:

split example