Apache Pig Eval Built-In Functions

Eval Functions

Apache Pig supports various types of Eval Functions such as AVG, CONCAT, COUNT, COUNT_STAR, and so on to perform a different type of operation.

The following is the list of Eval functions supported by Apache Pig.

Sr No	Functions	Description
1	AVG()	This function is used to compute the average of the numerical values within a bag.
2	BagToString()	The function performs the concatenation of elements in a string.
3	CONCAT()	This function is used to concatenate two or more expressions of the same type.
4	COUNT()	This function is used to get the number of elements in a bag while counting the number of tuples in a bag.
5	COUNT_STAR()	This function is similar to the COUNT() function that shows the number of elements.
6	DIFF()	This function is used to compare two bags (fields) in a tuple.
7	IsEmpty()	This function is used to check if a bag or map is empty.
8	MAX()	This function is used to calculate the highest value for a column (numeric values or chararrays) in a single-column bag.
9	MIN()	This function is used to get the minimum (lowest) value (numeric or chararray) for a certain column in a single-column bag.
10	PluckTuple()	This function is used to define a string Prefix and filter the columns in a relation that begin with the given prefix.
11	SIZE()	This function is used to compute the number of elements based on any Pig data type.
12	SUBTRACT()	This function is used to perform the subtraction of two bags.
13	SUM()	This function is used to get the total of the numeric values of a column in a single-column bag.
14	TOKENIZE()	This function is used to split a string (which contains a group of words) in a single tuple and return a bag that contains the output of the split operation.

Let us see a couple of examples.

AVG()

AVG function is used to compute the average of the numeric values in a single-column bag. It ignores null values.

Syntax:

grunt> AVG(expression)

We have used the “studentdata.txt” dataset to perform this operation. We will put “studentdata.txt” in the HDFS location “/pigexample/” from the local file system.

Content of “studentdata.txt”:



    1,Chanel,Shawnee,KS,39
    2,Ezekiel,Easton,MD,37
    3,Willow,New York,NY,40
    4,Bernardo,Conroe,TX,38
    5,Ammie,Columbus,OH,38
    6,Francine,Las Cruces,NM,38
    7,Ernie,Ridgefield Park,NJ,38
    8,Albina,Dunellen,NJ,56
    9,Alishia,New York,NY,34
    10,Solange,Metairie,LA,54

We will load “studentdata.txt” from the local filesystem into HDFS “/pigexample/” using the below commands.

Command:

$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/studentdata.txt /pigexample/

Now we will create the relation "student_data" and load data from HDFS to Pig.

Command:

grunt> student_data = LOAD '/pigexample/studentdata.txt' USING PigStorage(',')
as (studentid:int,firstname:chararray,lastname:chararray,city:chararray,gpa:int);

To perform AVG () operation, we will first group the relation “student_data” using group All operator, and store the result in the relation named “studentgroup”.

grunt> studentgroup = Group student_data all;
grunt> DUMP studentgroup;

Output:



    (all,{(10,Solange,Metairie,LA,54),
    (9,Alishia,New York,NY,34),
    (8,Albina,Dunellen,NJ,56),
    (7,Ernie,Ridgefield Park,NJ,38),
    (6,Francine,Las Cruces,NM,38),
    (5,Ammie,Columbus,OH,38),
    (4,Bernardo,Conroe,TX,38),
    (3,Willow,New York,NY,40),
    (2,Ezekiel,Easton,MD,37),
    (1,Chanel,Shawnee,KS,39)})

Now let us calculate the global average GPA of all the students using the AVG () function. After calculation, we will print output on the terminal using the DUMP operator.

Command:

grunt> studentavggpa = foreach studentgroup Generate
(student_data.firstname, student_data.gpa), AVG(student_data.gpa);
grunt>DUMP studentavggpa;

Output:



    (({(Solange),(Alishia),(Albina),(Ernie),(Francine),(Ammie),(Bernardo),(Willow),(Ezekiel),(Chanel)},
    { (54), (34), (56), (38), (38), (38), (38), (40), (37), (39)}),41.2)

BagToString()

The bagToString function is used to concatenate the elements of a Bag into a char array string, placing an optional delimiter between each value.

Syntax:

grunt> BagToString(vals:bag [, delimiter:chararray]);

We have used the “birthdate.txt” dataset to perform this operation. We will put “birthdate.txt” in HDFS location “/pigexample/” from the local file system.

Content of “birthdate.txt”:

We will load “birthdate.txt” from the local filesystem into HDFS “/pigexample/” using the below commands.

Command:

$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/birthdate.txt /pigexample/

Now we will create the relation "birthdate_data" and load data from HDFS to Pig.

Command:

grunt> birthdate_detail = LOAD '/pigexample/birthdate.txt' USING PigStorage(',') as (date:int,month:int,year:int);

We will group the relation “birthdate_detail” using group all operator, and store the result in the relation named “birthdategroup”.

Command:

grunt> birthdategroup = Group birthdate_detail All;
grunt> DUMP birthdategroup ;

Apache Pig Eval Built-In Functions

Pig - Filter

Pig - Load & Store Built-In Functions

Eval Functions

AVG()

Syntax:

Command:

Command:

Output:

Command:

Output:

BagToString()

Syntax:

Command:

Command:

Command: