Apache Pig Introduction

What is Apache Pig?

Apache Pig is a high-level Big data tool used to perform data processing of large data sets. Pig uses Pig Latin language to perform all data manipulation operations on Hadoop. Pig Latin is a dataflow language that allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel.

Apache Pig Latin provides various types of operators that are used by the developer to develop a program to read, write, and process data. Once the user submits the Pig script then it gets converted into MapReduce job by Pig engine for further execution.

Apache Pig provides the following features.

1. Ease of Programming

Pig Latin is easy and similar to the SQL language. A developer can easily write down its code if he has awareness of SQL.

2. Extensibility

Users can create their functions to perform data read, data process, and data store operations.

3. Optimization Opportunities

Apache Pig optimizes its execution automatically which allows users to focus on semantics rather than efficiency.

4. Data Types

Apache Pig can process structured and unstructured kind of data. After processing it store data in HDFS.

Apache Pig History

Let us see a year by year evaluation of Apache Pig.

2006: Pig was developed for researchers to have an ad-hoc way of creating and executing MapReduce jobs on very large data sets.

2007: Through Apache Incubator, Pig was open-sourced.

2008: First Pig release came and became a subproject of Apache Hadoop.

2009: Companies started to use Pig for their data processing. Amazon also used Pig as a part of the Elastic MapReduce service.

2010: Pig became an Apache top-level project.

The following are the version release of Apache Pig.

Version Release Date
0.1.1 11-09-2008
0.2.0 08-04-2009
0.3.0 25-06-2009
0.4.0 29-08-2009
0.5.0 29-09-2009
0.6.0 01-03-2010
0.7.0 13-05-2010
0.8.1 17-12-2010
0.9.2 29-07-2011
0.10.1 22-01-2012
0.11.1 21-02-2013
0.12.1 14-10-2013
0.13.0 04-07-2014
0.14.0 20-11-2014
0.15.0 06-06-2015
0.16.0 08-06-2016
0.17.0 19-06-2017

SQL vs Pig Latin

Please find the difference between SQL and Pig Latin.

SQL Pig Latin
SQL is a query language that is used by a user to fetch the required detail from the database. Pig Latin is a high-level language in which the user can describe exactly how to process the input data.
SQL language is oriented to provide an answer to a question. Pig is developed considering a series of data operations that is the reason a user should not worry to write a data pipeline.
Users use the SQL language to run a query against an RDBMS system. In RDBMS the data is stored in a normalized way. Pig development was done considering the big data Hadoop processing framework.
SQL requires data to be loaded into tables first. The pig can operate on data once it is loaded into the Hadoop file system.

Apache Pig Applications

Please find some important uses of Pig.

  • Yahoo evaluates that between 40% and 60% of its Hadoop workloads are generated from Pig Latin scripts.
  • Twitter uses Pig for processing logs, mining tweet data.
  • Pig is used by LinkedIn to find out those people whom you know.
  • Pig is used for processing web log data.
  • Pig is used to estimate the performance of a commercial RDBMS and Hadoop in astronomy reproduction analysis tasks.