Apache Hive Query Optimization

As we know Apache Hive is SQL-like and data-warehousing infrastructure on top of Apache Hadoop. Apache Hive leverage the Hadoop features of fault tolerance, scale-out and provide better performance on commodity hardware. Users can use Apache Hive to perform data analysis, run ad-hoc queries, and so on.

Apache Hive includes unique tools, which can be used to perform data queries and analysis effectively and to make full use of these tools users need to follow best practices to implement Apache Hive.

The following are some of the important points that are used to optimize Apache Hive’s performance.

1. Partition Tables

Apache Hive Partition tables are used to improve the performance of queries. It allows the user to store data in separate subdirectories under table location so when a user submits the query against the partition key, the performance is improved because Hive fetches the specific rows despite all row scans but for the user, it is a very challenging task to choose a partition key because the partition should be a low cardinal attribute.

2. De-normalizing data:

Normalization is a process used to model the table’s data using certain rules to reduce data redundancy and improve data integrity. In the real term, if we normalize the data set that means we are joining multiple tables to create a relation and fetch the data but from performance prospective joins are expensive and difficult operations to perform, and one of the major reasons for performance issues. So we should avoid highly normalized table structures to maintain good performance.

3. Map/Reduce Output Compress:

If compression is used then it will reduce the intermediate data volume and due to this internal data transfer between mappers and reducers is reduced over a network. We can apply compression on mapper and reducer output individually. We should note that gzip-compressed files are not splittable which means this should be applied with caution. Ideally, the compressed file size should not be larger than a few hundred MB otherwise it will create an imbalanced job.

We can set mapper output compression using “set mapred.compress.map.output to true”.

We can set job output compression using “set mapred.compress.map.output to true”.

4. Bucketing:

By using Bucketing the performances of queries are improved if the Bucket key and join keys are similar. Bucketing distributes the data in different buckets based on the hash results of bucket key also I/O gets reduced if the query is using join process on the same keys(column). Before writing data to the bucketed table it is important to set a bucketing flag (SET hive.enforce.bucketing=true ;) and for best join performance we can set (SET hive.optimize.bucketmapjoin=true) so that it will hints Hive to do bucket level join during the map stage join.

5. Input Format Selection:

Choosing the correct input file in Apache Hive is critical for performance. If we take JSON or text file for input format then it is not a good choice for a high volume production system because these types of readable format take lots of space and create overhead during processing. We can resolve such issues by choosing the correct input format such as columnar input formats (RCFile, ORC). There are some other binary format files out there such as Avro, sequence files, Thrift, and ProtoBuf, which will help in other issues.

6. Vectorization:

With the help of Vectorization, we can process the batch of rows together instead of processing one row at a time. We can enable the Vectorization by setting the configuration parameter hive.vectorized.execution.enabled=true.

7. Tez-Execution Engine:

Using Tez as an execution engine, performance will be improved for the Apache Hive query. TEZ provides an expressive-dataflow-definition API using that we can describe the Direct Acyclic Graph (DAG) of computation that we want to run.

8. Indexing:

An index is used to improve the performance drastically. When we define an index on a table a separate index table gets created, which is used during query processing. Without an index, if a user is running query then it will perform all row scan which is a costly and time taking job and it takes a lot of system resources. But with index framework will check the index table and jump on specific data despite searching for all rows.