Apache Hive provides a data-warehousing solution and it is developed on top of the Hadoop framework. Hive stores its data in Hadoop HDFS and uses the feature of Hadoop such as massive scale-out, fault tolerance, and so on to provide better performance. Hive supports the processing of Adhoc queries, large data processing for data analysis, and summarization. A user can use the Hive's HQL language that is similar to the SQL language and perform different types of operations.
In this tutorial, we will see the major components of Apache Hive.
Apache Hive Architecture
The following are some of the major components of Apache Hive and the interaction with Hadoop.
Major components of Hive Architecture
- Thrift Server and CLI, UI: It is an entry point for the client to interact with Apache Hive. The command-line interface provides an interface to run Hive queries, monitor processes, and so on. Apache Hive accepts the connection from JDBC and ODBC protocols.
- Driver: It is used to start a session to process the client request and works as a controller for Hive queries. Apache Hive driver monitors the execution life cycle of queries and saves the metadata information required during the execution.
- Execution Engine: The Apache Hive execution engine process the steps created by the compiler in the form of DAG stages. It executes the different levels of DAG stages on relevant nodes.
- Optimizer: The task of the Apache Hive optimizer is to perform the transformation on the execution plan and produce an optimized DAG stage. The optimization is performed at a different level such as changing the structure of joins and making it a single join to achieve the best performance, splitting the task by applying the transformation before it goes to the reduced stage, and so on.
- Compiler: The task of the Apache Hive compiler is to take the user's query and convert it into the execution plan, now the execution plan will have the steps that are needed by Hadoop MapReduce to execute and produce the desired output.
- Metastore: The Apache Hive meta-store component is used to store the metadata information of Hive objects such as its definition, location, schema, and so on. Apache Hive uses the traditional RDBMS to store its metadata and by default, it uses a derby database but it also supports other RDBMS such as SQL Server.
- HCatalog: The Apache Hive HCatalog table represents the different processing tools to end-user for easy data reading and writing on the grid.
- WebHCat: This component is used to facilitate service to end-user so that he/she can Hadoop jobs also it provides an HTTP interface to perform Hive metadata operation.
Query Execution flow in Hive
Let us see the query execution flow of the Hive. (refer above Snapshot)
Step. 1- The user will submit a request to the driver.
Step. 2- The Driver creates a session handler for the query and sends the query to the compiler to generate an execution plan.
Step. 3- The compiler checks the required metadata information from the metastore.
Step. 4- This metadata is used to type check the expressions in the query tree as well as to prune partitions based on query predicates.
Step. 5- The plan generated by the compiler is a DAG of stages with each stage being either a map/reduce job, a metadata operation, or an operation on HDFS.
Step. 6- The execution engine process the steps. Each stage of the task has a mapper and reducer that is used to read the rows from Hadoop HDFS and written on a temporary HDFS file. Now that temp file is used to provide the required data to map/reduce stages in the plan.
Step. 7- The User will make a fetch call to the driver.
Step. 8- The driver will interact with the execution engine to fetch the result.
Step. 9- The execution engine will take the result and send it to the driver program and then the drive will send data to the user.
Apache Hive Data Model
Apache Hive data models contain the following components.
A table is a collection of records that are stored in a database in the form of rows and columns. We can use HQL to perform different types of operations on Hive tables. Hive’s table data are stored in the HDFS directory and compatible with other file systems such as Amazon S3 filesystem and Alluxio.
Apache Hive supports the following two types of tables.
a. Managed Tables
Managed tables are owned by Apache Hive in which all the write operations are performed using Hive SQL commands. Hive manages the life cycle of managed tables. In case the table is dropped then its data and metadata are permanently deleted. The default location of a managed table is hive.metastore.warehouse.dir and it can be changed during table creation.
b. External Tables
External table files are managed by processes outside of Hive. The data stored for external tables can be external sources for example a remote Hadoop HDFS or an external storage volume. . We can perform write operations on External tables using Hive SQL commands. If an External table or partition is dropped, then only the metadata associated with the table or partition is deleted but the underlying data files remain there.
A partition is a way to divide the table into multiple parts based on its partitioned columns such as date, departments, city, and country. Partition helps to fetch a portal of data and reduce resource system utilization. For a table, there could be one or more partition keys, and based on that key data is stored. For example, a table emp with a date partition column ds had a file with data for a particular date stored in the directory in HDFS. Partitions allow the system to prune data to be inspected based on query predicates, for example, a query that is interested in rows from emp that satisfy the predicate emp. directory = '2010-10-09' would only have to look at files in /directory=2010-10-09/ directory in HDFS.
Bucketing is similar to partition it is just data in each partition is divided into Buckets based on the hash of a column in the table and stored in the partition directory as a file. Bucketing helps the system to find data based on a sample. It is intended to produce an even distribution of rows across buckets. It is useful for jobs that need to “sample” data from the table.
Apache Hive Metastore
Apache Hive metastore is an important component of Hive architecture and it is used to store the metadata information of all objects present in the Hive. It facilitates the feature of data abstraction and data discovery. The data abstraction information such as data formats, extractors are provided if a table is referenced in execution. The data discovery helps to find the relevant data in the data warehouse.
Apache Hive Query Language
Apache HiveQL is very similar to the structure query language that is used to perform different types of operations such as create, load, select, export, and on the set of data that is stored on the Hadoop HDFS file system. Apache Hive supported user custom based script that is created in MapReduce. Hive provides the facility to update multiple tables.
Apache Hive Compiler
The following task is performed during the compilation phase of a query.
- Parser: This stage transforms the query string into a parse tree.
- Semantic Analyser: In this stage of complier the column names, expression verification such as (*), type checking, and implicit conversion type are performed. In the case of the partition table, the details of expressions are collected to check the necessary partition involvement.
- Logical Plan Generator: In this stage, the query is converted into a logical plan which includes the relational algebra operators such as "join", "filter, and so on. In this plan, the optimizer transforms the plan to tune the performance.
- Query Plan Generator: In this stage, the logical plan is converted into the Map-Reduce task that is further submitted on Hadoop HDFS.