String Functions

Apache Pig supports various types of String Functions such as ENDSWITH, STARTSWITH, SUBSTRING, INDEX OF, and so on to perform a different type of operation.

The following is the list of String functions supported by Apache Pig.

Sr No Functions Description
1 ENDSWITH(string, testdata) This function is used to verify whether a given string ends with a particular substring.
2 STARTSWITH(string, substring) This function will accept two string parameters and verifies whether the first string starts with the second.
3 SUBSTRING(string, startIndex, stopIndex) This function will return a substring from a given string.
4 EqualsIgnoreCase(stringdata1, stringdata2) This function is used to compare two strings ignoring the case.
5 INDEXOF(string, ‘chardata’, startIndex) This function will return the first occurrence of a character in a string, searching forward from a start index.
6 LAST_INDEX_OF(expression) This function will return the index of the last occurrence of a character in a string, searching backward from a start index.
7 LCFIRST(expression) This function will convert the first character in a string to lower case.
8 UCFIRST(expression) This function will return a string with the first character converted to upper case.
9 UPPER(expression) This function will return a string converted to upper case.
10 LOWER(expression) This function will convert all characters in a string to lower case.
11 REPLACE(string, ‘olddata’, ‘newdata’); This function is used to replace existing characters in a string with new characters.
12 STRSPLIT(string, regex, limit) This function is used to split a string around matches of a given regular expression.
13 STRSPLITTOBAG(string, regex, limit) This function is similar to the STRSPLIT() function, it splits the string by a given delimiter and returns the result in a bag.
14 TRIM(expression) This function will return a copy of a string with leading and trailing whitespaces removed.
15 LTRIM(expression) This function will return a copy of a string with leading whitespaces removed.
16 RTRIM(expression) This function will return a copy of a string with trailing whitespaces removed.

Let us see a couple of examples.


SUBSTRING()

SUBSTRING function would return the substring from the given string.

Syntax:
grunt> SUBSTRING(string, startIndex, stopIndex)

We have used the “citydata.txt” dataset to perform this operation. We will put “citydata.txt” in the HDFS location “/pigexample/” from the local file system.

Content of “citydata.txt”:

Yolando,Luczki,Marion Lizette,Stem,Onondaga Gregoria,Pawlowicz,Camden Carin,Deleo,Nassau Chantell,Maynerich,Pulaski Dierdre,Yum,Ramsey Larae,Gudroe,Philadelphia Latrice,Tolfree,Terrebonne Kerry,Theodorov,Suffolk Dorthy,Hidvegi,Sacramento Fannie,Lungren,Ada Evangelina,Radde,Suffolk

We will load “citydata.txt” from the local filesystem into HDFS “/pigexample/” using the below commands.

Command:
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/citydata.txt /pigexample/

Now we will create relation "cityexample" and load data from HDFS to Pig.

Command:
grunt> cityexample = LOAD '/pigexample/citydata.txt' USING PigStorage(',')
   as (firstname:chararray,lastname:chararray,city:chararray);


Now we will use the SUBSTRING () function to fetches the substrings that start with the 0th letter and ends with the 2nd letter from the employee names.

Command:
grunt> substrdata = FOREACH cityexample GENERATE (firstname), SUBSTRING (firstname, 0, 2);
grunt> DUMP substrdata;

Output:
output of string command


INDEXOF()

INDEXOF function is used to determine the index of the first occurrence of a character in a string. The forward search for the character begins at the designated start index.

Syntax:
grunt> INDEXOF(string, 'character', startIndex)

Now we will use INDEXOF () function to find out the occurrence of ‘g’ in the firstname column of relation “cityexample”. If it will find out character ‘g’ then it will show the number of occurrences and if it will not find out character ‘g’ then -1 is a return.

We will use the DUMP operator to print the output on the terminal.

Command:
grunt> indexdata = FOREACH cityexample GENERATE (firstname), INDEXOF(firstname, 'g',0);
grunt> DUMP indexdata;

Output:
output of substr command