Top 50 Apache Pig Question and Answers

1. What is Apache Pig?

Apache PIG is open-source software that operates on Hadoop. It contains Pig Latin language to perform all data manipulation operations on Hadoop such as reading, writing, & processing dataset. Pig uses Map Reduce & HDFS for storing and processing complete tasks.

2. What is the difference between Pig and SQL?

  • Apache Pig Latin is a high-level language in which the user can describe exactly how to process the input data whereas SQL is a query language in which the user is not concerned about the way system will process the data. The main focus which SQL language provides is writing the SQL.
  • Apache Pig is developed to run in the Hadoop framework in which the data do not structure and the schema is also not consistent whereas the SQL language work with RDBMS systems in which the schema and data are mostly normalized.
  • Apache Pig can process data even it is copied to the Hadoop HDFS. There is no such constraint that data should be organized in the table whereas on SQL the data should be stored in a table for processing.

3. What is the role of Mapreduce when we program in Apache Pig?

Apache Pig uses Pig Latin language to write down programs but to run those programs it needs an execution engine. Mapreduce operates as an execution engine for Pig Latin programs.

4. What is a bag in Apache Pig?

In Apache Pig, a bag is a compilation of tuples.

5. What are the different data types in Apache Pig?

Apache Pig supports the below list of complex data types.

  • Map: It is a representation of key-value pairs.
  • Bag: It represents the collection of tuples.
  • Tuple: It is a representation of an ordered set of fields.

6. What is Pigstorage in Apache Pig?

Pigstorage is the default parameter to load or store relations using field delimited text format. Using its delimiter (defaults to a tab character) each line is broken into fields and stored as a tuples field.

7. What is Apache Pig Grunt Shell?

Apache Pig Grunt Shell is an interactive shell where we can write done Pig Latin language for processing.

8. What is UDF in Apache Pig?

UDF stands for user-defined function. When Apache Pig’s built-in functions do not provide some functionality then the developer can write down its function using programming language like Java, Python, etc, and use that function in Pig Latin script.

9. What is the last version release of Apache Pig?

The last version release of Apache Pig is “0.17.0” which was released on “19-06-2017”.

10. What are the components of Apache Pig?

Apache Pig provides the below list of components.

  • Parser
  • Optimizer
  • Compiler
  • Execution Engine

11. What is the Local execution mode of Apache Pig?

We can start Apache Pig Grunt shell locally using the " $pig -x local " command. The local mode is used in the testing and development.

12. What is the Mapreduce execution mode of Apache Pig?

The Default mode of Apache Pig Grunt shell execution is MapReduce mode in which the dataset should be loaded in HDFS to perform any action. When we run Pig Latin commands on that dataset a MapReduce task gets started in the background.

We can start Mapreduce mode using the “$pig” or “ $pig -x mapreduce ” command.

13. What is the UDF’s in Java supported by Apache Pig?

Apache Pig supports the following UDF.

  • Eval Algebraic
  • Filter Functions

14. Can we combine the contents of two or more relations & then divide them into a single relation into two or more relations?

We can do this operation by using SPLIT and UNION operators.

15. How to perform diagnosis and exception handling in Apache Pig?

We can do this using the below operators.

  • DUMP

16. Does Pig support multi-line commands?

Yes, Pig supports multiline commands.

17. What are the operators for loading and storing data in Pig?

For loading and storing data in HDFS, pig uses the below two commands.

  • LOAD

18. What is the Math built-in function available in Pig?

The following are the lists of Math functions that are available in Apache Pig.

  • ABS(expression)
  • ACOS(expression)
  • ASIN(expression)
  • ATAN(expression)
  • CBRT(expression)
  • CEIL(expression)
  • COS(expression)
  • COSH(expression)
  • EXP(expression)
  • FLOOR(expression)
  • LOG(expression)
  • RANDOM(expression)
  • ROUND(expression)
  • SIN(expression)
  • SINH(expression)
  • SQRT(expression)
  • TAN(expression)
  • TANH(expression)

19. What is the Eval built-in function available in Pig?

The following are the lists of Eval functions that are available in Apache Pig.

  • AVG()
  • BagToString()
  • CONCAT()
  • COUNT()
  • DIFF()
  • IsEmpty()
  • MAX()
  • MIN()
  • PluckTuple()
  • SIZE()
  • SUM()

20. What is the String built-in function available in Pig?

The following are the lists of String functions that are available in Apache Pig.

  • ENDSWITH(string, testdata)
  • STARTSWITH(string, substring)
  • SUBSTRING(string, startIndex, stopIndex)
  • EqualsIgnoreCase(stringdata1, stringdata2)
  • INDEXOF(string, ‘chardata’, startIndex)
  • LAST_INDEX_OF(expression)
  • LCFIRST(expression)
  • UCFIRST(expression)
  • UPPER(expression)
  • LOWER(expression)
  • REPLACE(string, ‘olddata’, ‘newdata’)
  • STRSPLIT(string, regex, limit)
  • STRSPLITTOBAG(string, regex, limit)
  • TRIM(expression)
  • LTRIM(expression)
  • RTRIM(expression)

21. What is the Physical plan in Apache Pig?

The physical plan is responsible for concert operators into the physical plan. The execution of the physical form of pig script starts in this stage.

22. What is the use of the DUMP operator?

The Apache Pig's DUMP operator prints the result on the terminal.

23. What are Relational Operators present in Apache Pig?

The following are the lists of Relational Operators that are available in Apache Pig.

  • LOAD
  • JOIN
  • DUMP

24. What is the difference between store and dumps commands?

DUMP operator is used to displaying the result on the terminal whereas the STORE command is used to store the output in a file.

25. What is COGROUP in Apache Pig?

COGROUP operator is used to grouping the data in two or more relations.

26. What is the use of the distinct keyword in Apache pig scripts?

The distinct keyword is used to remove the duplicate records. It works on complete records, not on an individual field.

27. What are the primitive data types present in Apache Pig?

The following are the lists of primitive data types that are available in Apache Pig.

  • int
  • Long
  • Float
  • Double
  • Char array
  • Byte array

28. What is a relation in Apache Pig?

A relation in Apache Pig is similar to the table in a relational database management system. In Apache Pig relation the tuples are similar to the rows of a table. unlike a table in RDBMS, the pig tuple doesn't contain the same number of fields.

29. What is the need for Apache Pig?

Apache Pig is a high-level framework and a Big Data tool that is used to process large data sets and provide data analysis. It uses the Pig Latin language to work on user's data such as load data, Store Data, process data, perform some action on data, and so on. At a high level, the Apache Pig engine converts the Pig Latin script into the MapReduce job which then processes further.

30. What is the Difference Between Mapreduce and Apache Pig?

  • In MapReduce, we need to write the entire logic for operations like join, group, filter, sum, etc whereas in Pig Built-in functions are available.
  • In MapReduce, the number of lines of code required is too much even for a simple functionality whereas in Pig it can be easily done in few lines of code.
  • In Pig ten lines of Pig Latin code equal to 200 lines of MapReduce code.
  • Apache Pig provides high productivity whereas MapReduce provides less productivity.

31. What is the use of FOREACH?

The Apache Pig FOREACH operators apply data transformation and generate new data.

32. What is BloomMapFile used for?

BloomMapFile is a class that extends MapFile. It uses dynamic Bloom filters to provide a quick membership test for the keys.

33. What are the Apache Pig use cases?

Apache Pig is used to process data for search engine platforms such as Yahoo uses Apache Pig to analyze data gathered from Yahoo search engines and Yahoo News Feeds also It uses Pig to process weblogs, streaming online data, and so on.

34. What are the ways to execute Pig script?

There are three ways to execute Pig script?

  • Grunt Shell
  • Script File
  • Embedded Script

35. What is an inner bag and outer bag in Pig?

An inner bag is used to contain a bag inside a tuple. An outer bag is a bag of tuples in which relations are similar to a relation in RDBMS.

36. What is Filter Operator in Apache Pig?

Filter operator is used to selecting tuples from a relation based on some condition.

grunt> filter = FILTER dept BY state = 'LA';

37. What is Apache Pig Statistics?

Apache Pig Statistics is a framework to collect statistics and store those statistics for Latin script. Once the job is completed these statistics can be retrieved for review.

38. Is Pig Latin language case-sensitive?

Apache Pig Latin is sometimes not a case sensitive. For example, the DATA=load ‘emp’ is not similar to data=load ‘emp’.

39. If there is type mismatch or filed mismatch in that case does Pig throws a warning?

Apache Pig does not show any warning if there is a type mismatch or filed mismatch. In case there is a data mismatch then the Pig engine assumes it as a null value.

40. What is GROUP Operator in Pig?

The GROUP operator is used to groups the tuples that have the same group key (key field).

41. What is JOIN Operator in Pig?

Apache Pig provides the following list of JOINs.

  • Self-Join
  • Inner-Join
  • Left Outer Join
  • Right Outer Join
  • Full Outer Join

42. What are the stats classes available in package?

The following are the lists of stats classes available in package?

  • PigStats
  • JobStats
  • OutputStats
  • InputStats

43. How to start Pig Grunt Shell in local mode?

We can start Apache Pig Grunt shell using the below command.

$pig -x local

44. How to start Pig Grunt Shell in HDFS mode?

We can start Apache Pig Grunt shell using the below command

$pig -x mapreduce

45. What is the Apache Pig engine?

Pig engine is used to converts the pig operations into MapReduce jobs. It provides an environment to write Pig Latin script.

46. What is the way to interact with Hadoop using Pig?

We can start Pig Grunt shell using ( $pig -x mapreduce ) and access Hadoop HDFS.

47. How to get the top 50 tuples from a relation?

We can use the TOP () function to get that.

48. What is the difference between COUNT and COUNT_STAR?

The major difference between COUNT and COUNT_STAR is that the COUNT function does not work with NULL whereas COUNT_STAR considers NULL value.

49. What is flatten in Pig?

Flatten is used to remove the nesting from the data in a tuple or bag.

50. What is the limitation of Apache Pig?

  • Apache Pig is designed for ETL-type use cases, it’s not a better choice for real-time scenarios.
  • Apache Pig is not a good choice for investigating a single record in huge data sets.
  • Apache Pig engine uses Hadoop MapReduce for processing that is again a batch processing framework.