Apache Pig provides UNION operators to merge two relations and SPLIT operator to split the relation into two or more relations.
To perform these two operations, we have taken two datasets “finance_data.txt” and "finance_bkp.txt”. We will put both files in HDFS location “/pigexample/” from the local file system.
Content of “finance_data.txt”:
1,Chanel,Shawnee,KS,9133882079 2,Ezekiel,Easton,MD,4106691642 3,Willow,New York,NY,2125824976 4,Bernardo,Conroe,TX,9363363951 5,Ammie,Columbus,OH,6148019788 6,Francine,Las Cruces,NM,5059773911 7,Ernie,Ridgefield Park,NJ,2017096245 8,Albina,Dunellen,NJ,7329247882 9,Alishia,New York,NY,2128601579 10,Solange,Metairie,LA,5049799175
Content of "finance_bkp.txt”:
1,Kati,Dunellen,NJ,7329247882 2,Youlanda,New York,NY,2128601579 3,Dyan,Metairie,LA,5049799175
We will load both data files from the local filesystem into HDFS “/pigexample/” using the below commands.
Command:$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/finance_data.txt /pigexample/
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/finance_bkp.txt /pigexample/
Now we will create relations for both files and load data from HDFS to Pig.
Command:grunt> financedata = LOAD '/pigexample/finance_data.txt' USING PigStorage(',') as (empid:int,empname:chararray,city:chararray,state:chararray,phone:int );
grunt> financebkp = LOAD '/pigexample/finance_bkp.txt' USING PigStorage(',') as (empid:int,empname:chararray,city:chararray,state:chararray,phone:int );
Let us see each operator in detail.
UNION operator is used to merging the contents of two or more relations.
Syntax:grunt> alias = UNION [ONSCHEMA] alias, alias [, alias …] [PARALLEL n];
We will merge both relations “financedata” and “financebkp” into “unionexample” and print the final result using the DUMP operator.
Command:grunt> unionexample = UNION financedata, financebkp;
grunt> DUMP unionexample;
2. SPLIT Operator
The SPLIT operator is used to partition the content of a relation into two or more relations based on some expression.
Syntax:grunt> SPLIT alias INTO aliasIF expression, alias IF expression [, alias IF expression …] [, alias OTHERWISE];
We will use relation “financedata” which we have already created and split its data into two relations “financefinal” and “financedata12”. Relation “financefinal” will have data based on condition (empid is greater than 5) and relation two “financefinal” will have content based on condition (empid is less than 4 and greater than 7).
Command:grunt> SPLIT financedata into financedata1if empid>5, financedata12 if (4<empid and empid>7);
grunt> DUMP financedata1;