Apache Pig Latin is a dataflow language that allows users to describe how data should be read, processed, and then stored in Hadoop. It provides multiple operators such as comparison operators, arithmetic operators, data types, type construction operators, relational operations to perform various operations.

Apache Pig Latin Statements

Apache Pig Latin statements are the basic concepts you use to process data using Pig. It is an operator that takes a relation as input and produces another relation as output. Pig Latin statements can include expressions and schemas. Pig Latin statements can be multiple in lines and should end with a semi-colon (;). Pig Latin statements are processed using multi-query execution by default.

Apache Pig statements are structured as follows.

  1. Load
  2. Transformations
  3. Dump
  4. Store

Let us see each step in detail.

1. Load

A LOAD statement is used to read data from the file system such as HDFS as local file system into Pig. PigStorage is the default load function.

Suppose we have to create a file name “num.txt” which is tab-delimited and stored at the local file system (In our case /home/cloudduggu/pig).

1 2 3 5 4 7 8 6 4 4 2 1 8 3 4

We will use the default function PigStorage to load data from “num.txt” into the numexp variable.

Command:
$pig -x local
grunt> numexp = LOAD ‘num.txt’;

We will use dump to see records as mentioned below, we don’t see the filed name because schema was not defined.

Command:
grunt> DUMP numexp
(1 2 3) (5 4 7) (8 6 4) (4 2 1) (8 3 4)

Output:
local_example1

local_example1

Now we will define schema using the AS keyword and use the DESCRIBE and ILLUSTRATE operators to view the schema also we will use DUMP to see the result.

Command:
grunt> numexp = LOAD ‘num.txt’ USING PigStorage('\t') as (a:int,b:int,c:int);
grunt> DESCRIBE numexp;
grunt> ILLUSTRATE numexp;
grunt> DUMP numexp;

Output:
with_schema1

with_schema2

with_schema3


2. Transformations

We can perform some calculative operations on the dataset. Pig allows us to perform transform of data in various ways.

Using Filter Operator

Filter Operator is used to selecting the data that you require. We can use Filter Operator to work with tuples or rows of data.

Syntax:
grunt> alias = FILTER alias  BY expression;

In this example, we will load the 'num.txt' file in local mode(pig -x local) and project the records where the third column “c” == 3.

Command:
grunt> numexp = LOAD 'num.txt' as (a:int,b:int,c:int);
grunt> filterdata = FILTER numexp BY c == 3;
grunt> DUMP filterdata;

Output:
filter example

filter example


Using Group Operator

GROUP operator performs the grouping of the same group of tuples. After grouping, the result is generated as a relation that will have one tuple in each group.

Syntax:
grunt> alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];

In this example, we will load the 'num.txt' file in local mode(pig -x local) and project the records by using the GROUP operator for column “c”.

Command:
grunt> numexp = LOAD 'num.txt' as (a:int,b:int,c:int);
grunt> groupdata = GROUP numexp BY c;
grunt> DUMP groupdata;

Output:
filter example

filter example


Intermediate Results

By default, Pig stores intermediate data generated between MapReduce jobs in the “/tmp” directory of HDFS.


3. Dump

A dump operator is used to display output on screen only. It is used for interactive mode, statements are executed immediately and the results are not saved (persisted). Dump can be used as a debugging device to make sure that the results you are expecting are generated.

Syntax:
grunt> DUMP alias;

Command:
grunt> numexp = LOAD 'num.txt' as (a:int,b:int,c:int);
grunt> DUMP numexp;

Output:
load example

load example


4. STORE Operator

By using the STORE operator we can run Pig Latin statements and save results to the file system.

In this example, we will load 'num.txt' in the numexp variable and then we will store the output of this file in another file “numstore.txt” using the Store operator.

Syntax:
grunt> LOAD 'data' [USING function] [AS schema];

Command:
grunt> numexp = LOAD 'num.txt' as (a:int,b:int,c:int);
grunt> STORE numexp INTO 'numstore.txt' USING PigStorage ('*');
grunt> cat  numstore.txt;

Output:
store data example

store data example


Apache Pig Latin Simple Data Types


Apache Pig Latin supports the below list of Simple Datatypes.

Simple Data Types Description Example Of Data Types
int It represents signed 32-bit integer 100
long It represents signed 64-bit integer Data:  100L or 100l, Display: 100L
float It represents 32-bit floating point Data:  10.5F or 10.5f or 10.5e2f or 10.5E2F, Display: 10.5F or 1050.0F
double It represents 64-bit floating point Data:  10.5 or 10.5e2 or 10.5E2,Display: 10.5 or 1050.0
chararray It represents character array (string) in Unicode UTF-8 format hello cloudduggu
bytearray It represents Byte array (blob) Byte array (blob)
boolean It represents boolean true/false (case insensitive)
datetime It represents datetime 1970-01-01T00:00:00.000+00:00
biginteger It represents Java BigInteger 2E+11
bigdecimal It represents Java BigDecimal 33.45678332


Apache Pig Latin Complex Data Types


Apache Pig Latin supports the below list of Complex Datatypes.

Complex Data Types Description Example Of Data Types
tuple It is an ordered set of fields. (19,2)
bag It is a collection of tuples. {(19,2), (18,1)}
map It is a set of key value pairs. [open#apache]


Apache Pig Latin Arithmetic Operators


Apache Pig Latin supports the below list of Arithmetic Operators. To show an example we have assumed x=40 and y=60.

Arithmetic Operators Description Example
+ addition x +y = 100
- subtraction x - y = -20
* multiplication   x * y= 2400
/ division   x / y = 0.6666666
% modulo   Returns the remainder of a divided by b (x%y).
? : bincond "y = (x == 1)? 40: 60; if x = 1 the value of y is 40. if x!=1 the value of y is 60."
CASE WHEN THEN ELSE END case CASE expression [ WHEN value THEN value ]+ [ ELSE value ]? END


Apache Pig Latin Comparison Operators


Apache Pig Latin supports the below list of Comparison Operators.

Arithmetic Operators Description Example
== equal   (x == y)
!= not equal (x != y)
< less than   (x < y)
> greater than (x > y)
<= less than or equal to   (x <= y)
>= greater than or equal to (x >= y)
matches pattern matching   It takes an expression on the left and a string constant on the right expression matches string-constant.


Apache Pig Latin Type Construction Operators


Apache Pig Latin supports the below list of Comparison Operators.

Arithmetic Operators Description Example
( ) tuple constructor It is used to construct a tuple from the specified elements. It is equivalent to TOTUPLE. Example (name, age); (joe smith,40);
{ } bag constructor It is used to construct a bag from the specified elements. It is equivalent to TOBAG. Example {(name, age)}, {(name, age)}; {(joe smith,40),(joan wick,45)};
[ ] map constructor It is used to construct a map from the specified elements. It is equivalent to TOMAP. Example [name,age];[name#joan,age#45];


Apache Pig Latin Relational Operators

Apache Pig Latin supports the below list of Relational Operators.

Relational Operators Description
LOAD This operator is used to load data from the file system (local/HDFS) into a relation.
STORE This operator is used to save a relation to the file system (local/HDFS).
FILTER This operator is used to remove unwanted rows from a relation.
DISTINCT This operator is used to remove duplicate rows from a relation.
FOREACH, GENERATE This operator is used to generate data transformations based on columns of data.
STREAM This operator is used to transform a relation using an external program.
JOIN This operator is used to join two or more relations.
COGROUP This operator is used to group the data in two or more relations.
GROUP This operator is used to group the data in single relation.
CROSS This operator is used to create the cross product of two or more relations.
ORDER BY This operator arranges the relation in ascending or descending order.
LIMIT This operator is used to get a limited number of tuples from a relation.
UNION This operator is used to combine two or more relations into a single relation.
SPLIT This operator is used to split a single relation into two or more relations.
DUMP This operator is used to print the contents of a relation on the console.
DESCRIBE This operator is used to describe the schema of a relation.
EXPLAIN Using this operator we can see the physical and logical view of an execution plan.
ILLUSTRATE This operator shows all steps of execution for statements.