Top 50 Apache Sqoop Question and Answers
1. What is Apache Sqoop?
Apache Sqoop is a data transfer tool that is used to send data from Hadoop to other relational database management systems. Sqoop can do bidirectional data transfer activity using the IMPORT and EXPORT tool. Apache Sqoop IMPORT tool imports data from RDBMS to HADOOP HDFS system and the EXPORT tool does the opposite of IMPORT tool by transferring data files from Hadoop HDFS to the database.
2. What is the basic use of Apache Sqoop?
The basic use of Apache Sqoop is to transfer important data of the organization from the database system to Hadoop HDFS and again from Hadoop HDFS to the database system.
3. What is the use of the Apache Sqoop Import tool?
Apache Sqoop Import tool is an important tool of Sqoop that is used to import the data from external databases to the Hadoop HDFS file system in the form of text, Avro, and the Sequence file.
4. What is the use of the Apache Sqoop Export tool?
Apache Sqoop Export is another important data transfer tool of Sqoop that is used to send data files from Hadoop HDFS to external databases. In the case of the Export data transfer, the table should be present in the target database.
5. What is the difference between Sqoop and Flume?
|Sqoop||Flume||Sqoop is used to import data from RDBMS.||Flume is used to move bulk streaming data into HDFS.||Sqoop uses a connector to connect a particular database.||Flume uses an agent to fetch data.||Sqoop is an event drive tool.||Flume is not an event-driven tool.|
6. Can we control the number of mappers in the Sqoop command?
We can control the number of mappers using the (-m num-mapers) argument. Mapper task is used to maintain a degree of parallelism. We can choose several mapper tasks depending upon the size of the table. If we want to define the 10 mapper task then we can pass it using (-m 10) so ten sessions should be started on databases if it is allowed from the database end.
7. Write a sqoop command to show all the databases present in the MySQL server.
The below command will show all the databases in the MySQL server.
$ sqoop list-databases --connect jdbc:mysql://localhost/
8. What is Apache Sqoop metastore?
In Apache Sqoop the metastore is a central place to store the job information. A user can create a job and load some data and on the other hand, other users can access the same job and run it in the cluster. Users can connect with metadata by configuring its sqoop-site.xml or it can be configured using the --meta-connect argument.
9. How to login in MySQL database?
To login into the MySQL DB, we can use “mysql –u root –p” where root represents the username and -p represents the password.
10. How to create a table in Mysql and how to insert the values into the table?
We can use the below command to create a table in MySQL. Suppose we have to create an “employee” table and insert some values, in that case, the SQL command should be as below.mysql>CREATE TABLE employee_data (
mysql>INSERT INTO employee_data VALUES (
11. What is the difference between Sqoop and distcp in Hadoop?
Distcp is used to copy any type of files from the Local filesystem to HDFS whereas Sqoop is used for transferring the data records between RDBMS and Hadoop eco-system service.
12. How many MapReduce jobs are submitted for each Apache Sqoop copy?
There will be 4 map jobs are submitted and no reduce tasks are scheduled.
13. Is Jdbc Driver sufficient To Connect Apache Sqoop to the Databases?
Sqoop needs both JDBC and connector to connect to databases. Depending upon the flavor of the database such as (MySQL, Oracle, Teradata ) we can download connectors.
14. What is the name of the databases on which Apache Sqoop metastore runs on?
A shared HSQLDB database instance is launched on the current machine when we start Sqoop Metastore.
15. What is Apache Sqoop Job?
Apache Sqoop job contains a set of commands which can be executed multiple times. A user can create an Apache Sqoop job and run it multiple times to perform the same activity despite writing commands every time.
16. How to create Apache Sqoop Jobs?
We can create a Sqoop job using the below command.
$ sqoop job --create importjob -- import --connect jdbc:mysql://localhost/userdata
17. How to check already created jobs?
We can use the (-list) command to get the list of already created Jobs.
$ sqoop job --list
18. How to verify an Apache Sqoop Job?
We can verify a particular job and its detail using the (--show) argument.("importjob" is a job name)
$ sqoop job --show importjob
19. How to launch an Apache Sqoop Job?
We can execute a Sqoop job using the (--exec) option. Let us see the below example in which we are executing the “importjob” Sqoop job.
$ sqoop job --exec importjob -- --username root -P
20. What is the difference between Target-dir and Warehouse-dir?
The target-dir argument is used to specify an HDFS directory whereas the warehouse-dir argument is used to specify the parent directory of all the Sqoop jobs.
21. Can you import a subset of the table’s data?
We can import a subset of data from a table using the “where” clause in the Sqoop import statement.
22. What is the advantage of using the Password-file over –P option?
The password-file option is used inside the Sqoop script whereas the –P option can be used to enter a password on the terminal.
23. What is the purpose of “import-mainframe”?
The import-mainframe tool imports all sequential datasets in a partitioned dataset(PDS) from a mainframe system to Hadoop HDFS in which a PDS represents a directory on the open systems.
24. What is the use of "sqoop-codegen"?
The Apache Sqoop codegen tool is used to generates the Java classes that encapsulate and interpret imported records. In case, the source code is lost then the same can be recreated.
25. How to import large objects such as BLOB and CLOB in Apache Sqoop?
Sqoop import function does not support BLOB and CLOB objects. We can use JDBC based imports to perform this operation. It is done without introducing the direct argument of the import utility.
26. What is the default database for Apache Sqoop?
The default database for Apache Sqoop is the MySQL database.
27. What is the way to execute free-form SQL queries to import rows?
By Using the MapReduce task (-m1) we can perform a free-form SQL query to import rows. This option will create only one MapReduce task and import data directly.
28. What is the use of this (–compress-codec) parameter?
Codec is a Sqoop data compression technique used in the big data Hadoop to reduce the storage size.
29. What is the use of Eval tool in Sqoop?
The Apache Sqoop eval tool runs SQL queries on a database and prints the results on the console. We can use this tool to check and verify the Import queries before executing them.
30. Does Sqoop use the maps to reduce function?
Apache Sqoop uses the Map-Reduce function of Hadoop to get data from the relational databases. A user can assign the number of mappers and while initiating the Sqoop Import process.
31. What is the meaning of Input Split in Hadoop?
The Input Split is a function that is associated with splitting the input files into various chunks and further these chunks are assigned to the mapper of data correct process.
32. Explain the default file format of Apache Sqoop?
Apache Sqoop allows two data file formats for import operation.
- Text Delimited File Format: In Apache Sqoop this is the default file format that is used for importing data. In this file, format data is stored as a delimiter character.
- Sequence File Format: This is a binary file format in which records are stored in custom record-specific data types which are shown as Java classes. Sqoop automatically creates these data types and manifests them as java classes.
33. What are some basic Apache Sqoop commands and their usage?
The basic commands for Apache Sqoop are mentioned below.
- Codegen: It is used to generate code to interact with database records.
- Create-hive-table: It is used to Import a table definition into a hive.
- Eval: It is used to evaluate SQL statement and display the results.
- Export: It is used to export an HDFS directory into a database table.
- Help: It is used to list the available commands.
- Import: It is used to import a table from a database to HDFS.
- Import-all-tables: This command imports the table data from a relational database to Hadoop HDFS.
- List-databases: It is used to list available databases on a server.
- List-tables: It is used to list tables in a database.
- Version: This command will display the version detail.
34. How will you sync your data in HDFS if source data gets updated frequently?
We can perform Incremental load using the below options.
- Append This option can be used when a table is getting updated frequently with new rows and increasing row id values then incremental import with append option should be used where values of some of the columns are checked (columns to be checked are specified using –check-column) and if it finds any updated value for those columns then only a new row will be inserted.
- Lastmodified This option can be used when the source has a date column that is checked for any records that have been updated after the last import. The new rows are inserted based on the last modification column on the source.
35. Can we use Apache Sqoop with Java programs?
We can include Sqoop .jar in the Java code classpath and the required parameters are created to Sqoop programmatically like for CLI (command line interface). Sqoop.runTool() method also invoked in Java code.
36. What is Apache Sqoop Driver?
Apache Sqoop driver is used to connect with the databases and perform the required data transfer operation. Apache Sqoop provides compatibility with various databases and to connect with those databases we can download the specific driver and install it.
37. What is Apache Sqoop Connectors?
Apache Sqoop provides the facility to connect with external databases with the help of connectors. Using connectors Sqoop can connect with those databases that don't support native JDBC. In some cases, a user needs to download the JDBC driver for connectors for proper functioning. You can obtain JDBC drivers from the client distribution for your external system's operating system or from the manufacturer of your external system.
38. What is Apache Sqoop Partitioner?
The partitioner is used to generate conditions that can be used by the extractor to transfer data from the partition column.
(upper boundary - lower boundary) / (max partitions)
39. How to resolve this error “ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: Access denied for user”?
This issue can be resolved by bypassing the password of the user on the terminal.
sqoop import --connect jdbc:mysql://localhost/DB-name --username username --password your_password --table tablename
40. How to fetch more than 1000 entries from the Sqoop import command?
The max number of entries set to be imported is 1000 so only 1000 entries can be imported. But by using the “fetch-size” parameter we can fetch more entries.
The syntax for “fetch-size” is mentioned below.
$ sqoop import parameters --fetch-size=number_of_entries
41. How to securely transfer data from RDBMS to HDFS using sqoop?
Sqoop stores metadata in a repository and also allows you to encrypt the data that is going to be added to the repository. This feature helps to transfer data securely.
42. How to exclude some tables from a database when importing a larger number of tables?
We can do so by using the (exclude-tables) argument in the Sqoop Import command.
$ sqoop import --connect jdbc:mysql://localhost/databasename --username username --password password_of_user --exclude-tables table1,table2
43. How to resolve this error “Exception in thread "main" java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.readerFor” ?
By changing the Jackson jar file from Sqoop lib to a higher version (2.9 or above) will resolve this issue.
44. How to resolve this error “Import failed: No primary key could be found for table”?
This issue can be resolve by creating a primary key on a table or by using mapper 1( --m 1).
45. How to resolve this error “Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver”?
To resolve this issue download Mysql connector, untar it and then put in /$SQOOP_HOME/lib path. Use below command.
$tar -xvf mysql-connector-java-8.0.20.tar.gz
$mv mysql-connector-java-8.0.20.jar /home/cloudduggu/sqoop/lib/
46. What is the default extension of the files generated from a sqoop import using the –compress parameter?
The default extension of the file is “.gz”.
47. Can we add a parameter while running a saved job?
We can add a parameter at run time using the (–exec) argument.
$ sqoop job -exec jobname --newparameter
48. What is the use of (--validate) argument in Apache Sqoop?
The Validate argument in Apache Sqoop is used to check the data that copied using Import and Export operation. It checks the row count of the source and the destination system.
There are 3 basic interfaces for validation.
- ValidationThreshold: This interface is used to check if the error percentage between the source and target are acceptable Absolute, Percentage Tolerant, etc. By default, it implements the AbsoluteValidationThreshold which ensures that the data in source and target is the same.
- ValidationFailureHandler: It is responsible for handling failures log an error/warning, abort, etc. By default, it uses LogOnFailureHandler that indicates a warning message in the logger.
- Validator: It drives the validation logic using the ValidationThreshold and if there is a failure then it assing failure handling to ValidationFailureHandler.
49. What is the use of the Apache Sqoop-merge tool?
The Apache Sqoop merger tool is useful in merging the two datasets in which a dataset will write on top of the old dataset. If we take a real-time incremental load example then, it runs on the last modified mode in which it will generate multiple files on Hadoop HDFS, in this case, the Sqoop Merger tool will map two data set based on the primary key and put a new dataset over old one.
50. What is HCatalog in Apache Sqoop?
Apache Sqoop HCatalog is a table and storage management service for Hadoop that enables users with different data processing tools Pig, MapReduce, and Hive to more easily read and write data on the grid.