Apache Hadoop Advantages and Disadvantages

Apache Hadoop was designed to store and process large data sets and it has many advantages like its open-source, cost-effective, fault-tolerant, etc on the other hand it has a disadvantage as well.

Apache Hadoop Advantages

The following are some of the Apache Hadoop advantages.

1. Open Source

Apache Hadoop is open-source software and developed at Apache Software Foundation. We can download Hadoop software from the Apache Software portal and start using it. It is freely available.

2. Data Sources

Apache Hadoop can store structured, unstructured, and semi-structured data which is generated from different sources like emails, social media in the form of log format, XML format, text format.

3. Performance

Apache Hadoop is a distributed storage and distributed processing system that process large datasets in the range of terabytes to petabytes. It achieves the best performance by dividing data into several blocks and store into several nodes and when a user submits a job it divides that job into a sub-task that starts executing on all those slave nodes and this way Hadoop achieves the best performance.

4. Scalable

Apache Hadoop can scale horizontally depending upon workload; nodes can be added to the Hadoop framework on the fly which makes it scalable.

5. High Availability

Apache Hadoop supports multiple standby name nodes and if one or two name nodes get down Hadoop will continue functioning this is how Hadoop achieves high availability.

6. Language Support

Apache Hadoop supports multiple languages like Python, C, C++Perl so programmers can write down codes in these languages.

7. Compatibility

Apache Hadoop is compatible with other fast-growing technology like Spark, Spark has its processing unit so it uses Hadoop as a data storage platform.

Apache Hadoop Disadvantages

The following are some of the disadvantages of Apache Hadoop.

1. Batch Processing

Apache Hadoop is a batch-processing engine, which processes data in batch mode. In batch, mode data is already stored on the system, and not real-time streaming cause Hadoop is not efficient in processing of real-time data.

2. Processing Overhead

When we deal with terabytes or Petabytes of data, it becomes overhead for Hadoop to read such huge data from disk and after processing write down on disk because Hadoop cannot process data in memory.

3. Small File Overhead

Apache Hadoop is used to store a small number of large files, but when it comes to storing a large number of small files(below 100 MB) then Hadoop fails because Hadoop store data in the block size of 128 MB or 256 MB by default and storing less than default size creates overhead for name node to process.

4. Security Concern

Apache Hadoop uses Kerberos for its authentication but missing encryption at storage and network layers are security concerns.