Apache Pig Latin provides the below list of group and joins operators.

  1. GROUP Operator
  2. COGROUP Operator
  3. JOIN Operator
  4. CROSS Operator

Let us see each operator in detail.


1. GROUP Operator

The GROUP operator performs the grouping of the same group key tuples. After processing the generated output is a relation that has a tuple per group.

Syntax:
grunt> alias = GROUP alias { ALL | BY exp} [, alias ALL | BY exp …] [USING 'collected' | 'merge'] [PARTITION BY partit] [PARALLEL n];

To perform GROUP operation, first, we will load a file name “employee.txt” in HDFS “/pigexample/” location. The content of the file is mentioned below.


1001,James,Butt,New Orleans,Orleans 1002,Josephine,Darakjy,Brighton,Livingston 1003,Art,Venere,Bridgeport,Gloucester 1004,Lenna,Paprocki,Anchorage,Anchorage 1005,Donette,Foller,Hamilton,Butler 1006,Simona,Morasca,Ashland,Ashland 1007,Mitsue,Tollner,Chicago,Cook 1008,Leota,Dilliard,San Jose,Santa 1009,Sage,Wieser,Sioux Falls,Minnehaha

Command:
grunt> employees = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );


a. GROUP by Single column

Will use the GROUP operator to perform grouping on schema “employees” by city.

After that, we will use the DUMP operator to print the records on the terminal.

Command:
grunt> groupdata = GROUP employees by city;
grunt> DUMP groupdata;

Output:

We can see from the output that two columns are generated.

  • The first column is a city that we used to group the data.
  • The second column is showing the city-wise data.
group by data


b. GROUP by Multi-Column

We will use the GROUP operator to perform grouping on relation “employees” by city and county.

After that, we will use the DUMP operator to print the records on the terminal.

Command:
grunt> groupdata = GROUP employees by (city,county);
grunt> DUMP groupdata;

Output:
group by two columns


2. COGROUP Operator

COGROUP operators are similar to the GROUP operator in which the GRPOUP operators works on one relation on the other hand the COGROUP operators work on the two and more relations.

Syntax:
grunt> alias = COGROUP alias { ALL | BY exp} [, alias ALL | BY exp …] [USING 'collected' | 'merge'] [PARTITION BY partit] [PARALLEL n];

To perform COGROUP operation we have taken two datasets “employee.txt” and “department.txt”. We will put both files in HDFS location “/pigexample/”.

Content of “employee.txt”:

1001,James,Butt,New Orleans,Orleans 1002,Josephine,Darakjy,Brighton,Livingston 1003,Art,Venere,Bridgeport,Gloucester 1004,Lenna,Paprocki,Anchorage,Anchorage 1005,Donette,Foller,Hamilton,Butler 1006,Simona,Morasca,Ashland,Ashland 1007,Mitsue,Tollner,Chicago,Cook 1008,Leota,Dilliard,San Jose,Santa 1009,Sage,Wieser,Sioux Falls,Minnehaha

Content of “department.txt”:

1001,James,Butt,New Orleans,Orleans 1002,Josephine,Darakjy,Brighton,Livingston 1003,Art,Venere,Bridgeport,Gloucester 1004,Lenna,Paprocki,Anchorage,Anchorage 1005,Donette,Foller,Hamilton,Butler 1006,Simona,Morasca,Ashland,Ashland 1007,Mitsue,Tollner,Chicago,Cook 1008,Leota,Dilliard,San Jose,Santa 1009,Sage,Wieser,Sioux Falls,Minnehaha

We will load both data files from the local filesystem into HDFS “/pigexample/” using the below commands.

Command:
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/employee.txt /pigexample/
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/department.txt /pigexample/

Now we will create relations for both files and load data from HDFS to Pig.

Command:
grunt> empdata = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );
grunt> deptdata = LOAD '/pigexample/department.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,state:chararray,zip:int );

We will group the records/tuples of the relation “empdata” and “deptdata” using city and state and print the result on the terminal using the DUMP operator.

Command:
grunt> cogroupdata = COGROUP empdata by city, deptdata by state;
grunt> DUMP cogroupdata;

Output:

COGROUP operator will group the tuples from each relation according to city and state and print the result.

cogroup example


3. JOIN Operator

The JOIN operators are used to JOIN two or more relations to get a desirable result. To perform the JOIN operation we declare one or more tuples from each relation.

Pig supports the below list of JOIN operators.

  1. Self-Join
  2. Inner-Join
  3. Left Outer Join
  4. Right Outer Join
  5. Full Outer Join

Let us see each JOIN operator in detail.

To perform the JOIN operation we have taken two datasets “employee.txt” and “invoice.txt”. We will put both files in HDFS location “/pigexample/” from the local file system.

Content of “employee.txt”:

1001,James,Butt,New Orleans,Orleans 1002,Josephine,Darakjy,Brighton,Livingston 1003,Art,Venere,Bridgeport,Gloucester 1004,Lenna,Paprocki,Anchorage,Anchorage 1005,Donette,Foller,Hamilton,Butler 1006,Simona,Morasca,Ashland,Ashland 1007,Mitsue,Tollner,Chicago,Cook 1008,Leota,Dilliard,San Jose,Santa 1009,Sage,Wieser,Sioux Falls,Minnehaha

Content of “invoice.txt”:

1002,2019-10-08 00:00:00,23000 1001,2012-10-08 00:00:00,23500 1001,2012-11-20 00:00:00,13260 1002,2011-05-20 00:00:00,56030 1002,2011-10-08 00:00:00,34200 1001,2010-10-08 00:00:00,32300 1005,2001-11-20 00:00:00,12360 1006,2001-05-20 00:00:00,32450

We will load both data files from the local filesystem into HDFS “/pigexample/” using the below commands.

Command:
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/employee.txt /pigexample/
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/invoice.txt /pigexample/

Now we will create relations for both files and load data from HDFS to Pig.

Command:
grunt> empdata = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );
grunt> invdata = LOAD '/pigexample/invoice.txt' USING PigStorage(',') as (custid:int,data:chararray,amount:int );

Now we are good to perform JOIN operations on these two datasets.


3.1 Self-Join

Self-JOIN performs the joining on a single table by creating multiple aliases. To perform Self-JOIN, we will load “employee.txt” data two times in different aliases.

Syntax:
grunt> alias = JOIN alias BY {expression|'('exp [, exp …]')'} (, alias BY {expression|'('exp [, exp …]')'} ) [USING 'replicated' | 'bloom' | 'skewed' | 'merge' | 'merge-sparse'] [PARTITION BY partit] [PARALLEL n];

Command:
grunt> empdata1 = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );
grunt> empdata2 = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );

Output:
self join example

Now we will perform the Self-JOIN operation by joining two relations (empdata1 and empdata2) and print the output on the terminal using the DUMP operator.

Command:
grunt> empdata3 = JOIN empdata1 BY empid, empdata2 BY empid;
grunt> DUMP empdata3;

Output:
self join example


3.2 Inner-Join

Inner-JOIN or Equi-JOIN is used to perform the joining of two or more relations based on common field values. Inner-JOIN ignores null keys.

Syntax:
grunt> alias = JOIN alias BY {expression|'('expression [, expression …]')'} (, alias BY {expression|'('expression [, expression …]')'} ) [USING 'replicated' | 'bloom' | 'skewed' | 'merge' | 'merge-sparse'] [PARTITION BY partitioner] [PARALLEL n];

Now we will perform Inner-JOIN operation on two relations “empdata” and “invdata” and we will print output on-screen using the DUMP operator.

Command:
grunt> empdata = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );
grunt> invdata = LOAD '/pigexample/invoice.txt' USING PigStorage(',') as (custid:int,data:chararray,amount:int );
grunt> empinvoice = JOIN empdata BY empid, invdata BY custid;
grunt> DUMP empinvoice;

Output:
inner join example


3.3 Left Outer-Join

Left Outer-Join is used to fetch the records from the left relation even if there are no matches in the right relation.

Syntax:
grunt> alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column [USING 'replicated' | 'bloom' | 'skewed' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];

Now we will perform the Left Outer-Join operation on two relations “empdata” and “invdata” and we will print output on-screen using the DUMP operator.

Command:
grunt> leftouterexample = JOIN empdata BY empid LEFT OUTER, invdata BY custid;
grunt> DUMP leftouterexample;

Output:
left outer join example


Right Outer-Join is used to fetch the records from the right relation even if there are no matches in the left relation.

Syntax:
grunt> alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column [USING 'replicated' | 'bloom' | 'skewed' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];

Now we will perform the Right Outer-Join operation on two relations “empdata” and “invdata” and we will print output on-screen using the DUMP operator.

Command:
grunt> rightouterexample = JOIN empdata BY empid RIGHT OUTER, invdata BY custid;
grunt> DUMP rightouterexample;

Output:

right outer join example


3.5 Full Outer-Join

Left Outer-Join is used to fetch the records when there is a match in any relation.

Syntax:
grunt> alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column [USING 'replicated' | 'bloom' | 'skewed' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];

Now we will perform Full Outer-Join operation on two relations “empdata” and “invdata” and we will print output on-screen using the DUMP operator.

Command:
grunt> fullouterexample = JOIN empdata BY empid FULL OUTER, invdata BY custid;
grunt> DUMP fullouterexample;

Output:
full outer join example


4. CROSS Operator

CROSS operator is used to calculating cross-product(Cartesian product) of the two or more relations.

Syntax:
grunt> alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];

Command:
grunt> crossjoinexample = CROSS empdata , invdata;
grunt> DUMP crossjoinexample;

Output:
crossjoin example