Top 50 Apache Hive Question and Answers
1. What is Apache Hive?
Apache Hive is Big data warehousing framework that is created on top of Hadoop and that is the reason Hive leverage the features of Hadoop such as fault tolerance, scale-out massively, and data storage on commodity hardware. Users can use Apache Hive to perform data analysis using Hive's HQL language which is similar to the SQL language.
2. What Apache Hive is NOT?
Apache Hive is developed to perform a data warehouse task that includes storing data from multiple sources. Apache Hive is not the best choice for online transaction processing. It does not use complex indexes like many RDBMS which can answer queries in seconds. Apache Hive's HQL query can take some time to execute maybe one minute or some hour.
3. What is the difference between Apache Hive and Apache Pig?
|Hive||Pig||Hive is used for Structured Data.||Pig is majorly supported for semi-structured data.||Hive requires a well-defined Schema.||In Pig schema is optional.||Hive is a declarative language that has a very similar syntax to SQL.||The nature of the Pig is a procedural language.||Hive is mainly used for reporting.||Pig is mainly used for programming.||Hive is usually used on the server-side of the Hadoop cluster.||Pig is usually used on the client-side of the Hadoop cluster.||Hive is more like SQL.||Pig is more like Verbose.|
4. What are the commonly used Apache Hive services?
The following are some of the commonly used services in Apache Hive.
- Command Line Interface (CLI)
- Hive Web Interface
- HiveServer (hiveserver)
5. What is Hcatalog in Apache Hive?
In Apache Hadoop, HCatalog is a storage layer that allows users to read and write their data on the Hadoop grid.
6. What is the default HDFS storage for a table in Apache Hive?
The default storage for an Apache Hive table is the following.
7. What is Apache Hive Partitioning?
Apache Hive Partitioning is a very important feature of the Hive in terms of performance. If a user has a partition table then the data will be divided into separate parts based on the partition column and stored on the storage system. Using Apache Hive partitioning the performance of queries is increased because only the selected data is fetched.
8. What is Apache Hive Bucketing?
Apache Hive bucketing is used to store users' data in a more manageable way. A bucket is a range of data in part that is determined by the hash value of one or more columns in a table. Often these columns are called clustered by or bucketing columns. We should use a bucket when we have to manage a large dataset that needs to be divided into the cluster to provide an optimum result.
9. Explain the difference between Apache Hive's Partition & Buckets?
|Partition||Bucket||Partition is a directory in Apache Hive.||A Bucket is a file in Apache Hive.||Partitioning helps in executing queries faster if we define a common range filtering.||Bucketing does not work by default.||Partitioning filter the data if it is used in the WHERE clause.||Bucketing arranges users' data into multiple files in the partition.||Based on the column unique value partition is created in Apache Hive.||For Bucketing, one can limit it to a specific number and the data can then be decomposed in those buckets.|
10. Explain the components of an Apache Hive query processor?
Apache Hive query processor is used to convert SQL into Map Reduce graph for further execution.
It has the below components
- Semantic Analyser
- Type Checking
- Logical Plan Generation
- Physical Plan Generation
- Execution Engine
- UDF’s and UDAF’s.
11. What is an ACID?
ACID stands for four features which are Atomicity, Consistency, Isolation, and Durability the Atomicity represents the operation that is a success or fails, Consistency represents the operation that is visible to its succeeding operation, Isolation represents an incomplete operation and the Durability represents the operation that is completed once will be stored even in case of system failure.
12. What is Apache Hive Metadata?
Apache Hive metastore is used by Hive to store the metadata information of all objects which are present in Hive. It is an important part of Apache Hive architecture. Apache Hive uses derby as its default database to store its metadata however we can use other RDBMS to store Hive's metadata such as MySQL.
13. What are the major components of Apache Hive Architecture?
The following are the major components of Apache Hive Architecture.
- Thrift Server and CLI, UI
- Execution Engine
14. What is Apache Hive Data Model?
Data in Apache Hive is organized in the below components.
- Managed Tables
- External Tables
Hive supports two types of tables.
15. What is Vectorization?
In Apache Hive. the Vectorization helps in processing a batch of rows despite processing one row at a time. It can be enabled by configuring the parameter hive.vectorized.execution.enabled=true.
16. What is the name of the default database in Apache Hive to store Metadata?
Apache Hive provides a Derby database as a default database to store metadata.
17. How many types of Join, Apache Hive provides?
Apache Hive provides the following list of joins.
- Inner Join
- Left Outer Join
- Right Outer Join
- Full Outer Join
18. What are the Apache Hive Managed(Internal) Tables?
The Internal table is created in Apache Hive without specifying the "external" keyword and saved at the hive.metastore.warehouse.dir path. If we drop an internal table then its data and metadata both are dropped.
19. What is Apache Hive External Tables?
The external tables are those tables that are created by specifying the "external" keywords. The management of the external table is done by external sources. In the external table, the data is stored on external devices such as a remote Hadoop file system or the cloud system. If we drop an external table then only metadata information is deleted from Apache Hive.
20. What is the difference between Internal and External Tables?
|Internal Table||External Table||In the case of the Internal table, the user's data is transferred to the “/user/hive/warehouse” directory.||In the case of the External table, the user's data does not transfer to the “/user/hive/warehouse” directory.||If we drop the Internal table then its data along with metadata also gets deleted.||If we drop the External table then its metadata only gets deleted but the data remains the same.||Internal table supports TRUNCATE operation.||The External table does not support the TRUNCATE operation.||Internal table supports ACID transactions.||The external table does not support ACID transactions.||Query Results Caching only works for Internal tables.||Query Results Caching does not work for Internal tables.|
21. What is the order of precedence for Hive configuration?
The following is the order of precedence to configure Apache Hive properties.
- SET command
- Command-line –hiveconf option
22. Why MapReduce job does not run when a user submits Select * from the statement?
The MapReduce job doesn’t start for the Select * command because this Hive property hive.fetch.task.conversion skips MapReduce function.
23. Why a new metastore_db is created when users run a Hive query?
When a user runs Hive SQL, the framework will check if metastore DB is existing or not, if it does not exist then it will be created. Parameter for metastore_db is set in the Hive-Site.xml configuration file.
24. What are three different modes in which Apache Hive can run?
The following are the three different modes in which the Apache Hive can work.
- Local mode
- Distributed mode
- Pseudodistributed mode
25. What are different ways to connect with Apache Hive when we run it as a server?
- Thrift Client: It provides an interface to connect with Apache Hive from a different programming language such as Java, PHP, C++, Python, and so on.
- ODBC Driver: The application which supports ODBC complaint can connect with Apache Hive.
- JDBC Driver: This interface helps to connect Apache Hive by translating the request from the application in SQL language and then it passes those SQL to Hive for execution.
26. What are the data types supported by Apache Hive?
Apache Hive supports the following data types.
27. How to run Apache Hive script from Hive terminal?
We can run the Hive script from the Hive terminal as mentioned below.
Hive> source /cloudduggu/hive/script/data_load.hql
28. How to access subdirectories recursively in Apache Hive?
We can access subdirectories recursively by setting below parameters.
hive> Set mapred.input.dir.recursive=true;
hive> Set hive.mapred.supports.subdirectories=true;
29. What is the use of .hiverc File?
The Hiverc file contains a list of commands which run when the CLI starts such as setting strict mode to true.
30. What is the “USE” command?
USE command is used to set the database on which all a user will run Hive queries.
31. What is the use of setting this parameter “Set Hive. mapred.mode = Strict”?
By setting this parameter we tell MapReduce to run in restrict mode and due to this feature queries on the partition column cannot run without a where clause. It prevents jobs from running for a long time.
32. Explain the command to check the Apache Hive Partition?
The following command will show the presence of Apache Hive Partition.
SHOW PARTITIONS tabname PARTITION(partitioned_column=’partvalue’);
33. What is Apache Hive View?
A view virtual representation of a base table. It is stored as the definition in the dictionary. When a user fire query on view then it goes on the base table and operates. The view is used to maintain the security of data.
34. How to create Apache Hive View?
A view can be created by the create view command as mentioned below.
SELECT * FROM cloudduggudb.dept_detail
35. What are data types available in Hive?
Apache Hive provides the below list of data types.
- Numeric Types
- Date/Time Types
- String Types
- Miscellaneous Types
- Complex Types
36. How to use (if Exists clause) in case of drop table in Apache Hive?
This clause will check if the table is present in Apache Hive, in case the table is not present then an error message is thrown by the Apache Hive.
37. How to check the index detail which is created on a table?
We can check index detail by using the below command.
SHOW INDEX ON table_name
38. How to provide query hints to stream a table in memory?
In Apache Hive the hints are used for fast execution of query because the user's data is streamed in memory. Use this parameter to use hints /*streamtable(table_name)*/.
39. Why Hive store its metadata on RDBMS and not on HDFS?
Apache Hive stores metadata information in metastore using RDBMS such as Derby, Mysql, and so on. The main reason to choose RDBMS to store metadata is to achieve low latency because HDFS takes time to process read and write operations.
40. How to change the default location of a managed table?
We can use the LOCATION 'HDFS path' parameter to change the default location.
41. What is the difference between local metastore and remote metastore?
In the Local metastore setup, the metastore services run in the same JVM process in which Hive also running its process and connect to the database.
In Remote metastore setup. The metastore services run on their own JVM and not in Hive service JVM. In this case, you can have more than one metastore server to provide more availability. Other processes can connect with metastore using thrift network APIs.
42. What is Static Partitioning?
Static partitioning is used when we load big files in Hive tables. It inserts input data files individually into a partition table. To use Static Partition we can set the property “hive.mapred.mode = strict” in the hive-site.xml configuration file.
43. What is Dynamic Partitioning?
A Dynamic Partitioning can be used when we load data from a non-Partition table. In dynamic partitioning values for partition, columns are known in the runtime (during loading of the data). We should use Dynamic Partitioning when we don’t know the values of Partition from the huge data set.
44. How many defaults Dynamic Partitions can be created by Mapper/Reducer?
By default 100 Partitions can be created by Mapper- Reducer tasks. It is set in the below parameter and its value can be changed.
SET hive.exec.max.dynamic.partitions.pernode = 'value'
45. How are rows distributed in buckets?
Apache Hive uses Hashing Formula to determine bucket numbers for a row.
Formula: hash_function (bucketing_column) modulo (numof_buckets).
The value of hash_function depends on the column data type. If column data type is integer then hash_function will be hash_function (int_column)= value of int_column.
46. Why we use index?.
An index is used to improve the access of columns or set of the column in Hive database because with the use of index system does not read all rows to find requested data.
47. Can we write multiline comments in Apache Hive?
No, we can’t write multiple comments in Apache Hive.
48. How can we rename a table?
We can rename a table using the below command.
ALTER TABLE tablename RENAME TO newtablename;
49. How to add comments in the Apache Hive table?
We can add comments in a table using the below command.
ALTER TABLE tabname SET TBLPROPERTIES ('comment' = new_comment);
50. How to alter Partition in Hive table?
We can alter Partition in a table using the below command.
ALTER TABLE emp_records ADD PARTITION (dt='2020-10-01', country='ind') location '/path/to/us/part201001' PARTITION (dt='2020-08-09', country='ind') location '/path/to/us/part200809';