Apache HBase Introduction

What is Apache HBase?

Apache HBase is a column-oriented, non-relational, open-source, and distributed database management system that is developed based on the concept of Google's Bigtable. Apache HBase is designed in the Java language and used for real-time processing, random read and write operations on a huge dataset. It can easily be deployed and run on Hadoop.

Apache HBase is not a relational database management system and hence it does not support the structured query language(SQL). We can write an HBase application in Java language similar to Apache MapReduce also HBase support creating the application in other frameworks such as Apache Avro, REST, and so on.

Apache HBase stores data in the table in the form of rows and columns just like a traditional database management system. Those tables have a primary key that is used to access that table. Apache HBase uses the Zoopkeepr to manage performance. For the production environment, it is suggested to use a dedicated Zookeeper cluster that is integrated with the Apache HBase cluster.

Apache HBase History

Let us see a year by year evaluation of Apache HBase.

2006: In November, Google releases a paper on BigTable.
2007: In February, The prototype was developed for HBase as a Hadoop contribution.
2007: In October, HBase was released with Hadoop 0.15.
2008: In January, HBase becomes subproject.
2008: In October, HBase 0.18.1 version was released.
2009: In January, HBase 0.19.0 version was released.
2009: In September, HBase 0.20.0 version was released.
2010: In May, HBase becomes an Apache top-level project.
2010: In June, HBase 0.89.20100621, the first developer version was released.
2011: In January, HBase 0.90.0 version was released.
2011: In mid of 2011, HBase 0.92.0 version was released.

Characteristics of Apache HBase

The following are the some of important characteristics of Apache HBase.

1. Distributed

We can run Apache HBase in two modes namely the pseudo-distributed mode and the fully-distributed mod. The pseudo-distributed mode is used for the testing purpose and it runs on a single node whereas the fully-distributed mode is used in production environments and runs over a cluster of nodes.

2. Big Data Store

Apache HBase is designed to store very huge data in tables that come with billions of rows and billions/millions of columns. It runs on top of Hadoop HDFS hence provides low latency and real-time read and write to the data. It provides better performance for the reading operation on huge data stored in tables.

3. Non-Relational

As we know now the Apache HBase is a non-relational database hence it does not follow the relation database model. In relational database management, the data is stored in a table in the form of rows and columns. If we want to access that data then we can use the SQL language whereas in the case of Apache HBase uses the storage-and-query mechanism in which the data storage is not in a fixed format. In Apache HBase, the schema is flexible and can extern per requirement.

4. Flexible Data Model

Apache HBase provides the flexible data model to store data in the table. A table will have one or more column families. The user's data will be stored in the rows that is a collection of key/value pair. In a table, a row is uniquely identified by a row key.

5. Scalable

Apache HBase region is horizontally scalable in which rows are shared by regions. A table can be stored in multiple regions and if the regions get very large then it is split into two regions based on the middle row key having equal data.

Features of Apache HBase

The features of Apache HBase are listed below.

It supports Linear and modular scalability.
It strictly consistent reads and writes operations.
It provides Automatic and configurable sharding of tables
It provides automatic failover support between RegionServers.
Support Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
It is easy to use Java API for client access.
Provides Block cache and Bloom Filters for real-time queries.
It provides Query predicate push down via server-side Filters.
Extensible JRuby-based (JIRB) shell.

Difference Between RDBMS and HBase

Let us see the major difference between relational databases and HBase.

RDBMS	HBase
It uses tables as databases.	It uses regions as databases.
File systems supported are FAT, NTFS, and EXT.	The file system supported is HDFS.
It uses commit logs to store logs.	It uses Write-Ahead Logs (WAL) to store logs.
The reference system used is a coordinate system.	The reference system used in ZooKeeper.
Uses the primary key.	Uses the row key.
Partitioning is supported.	Sharding is supported.
Use of rows, columns, and cells.	Use of rows, column families, columns, and cells.

Difference Between HDFS and HBase

Let us see the major difference between HDFS and HBase.

HDFS	HBase
HDFS provides a file system for distributed storage.	HBase provides tabular column-oriented data storage.
HDFS provides optimized storage for big files.	HBase provides optimization for tabular data.
It uses flat files.	It uses key-value pairs of data.
The data model is not flexible.	It provides a flexible data model.
It uses a file system and processing framework.	It uses tabular storage with built-in Hadoop MapReduce support.
It is mostly optimized for write-once-read-many.	It is optimized for both read/write many.

Difference Between Row-oriented and Column-oriented data stores

Let us see the major difference between Row-oriented data stores and Column-oriented data stores.

Row-Oriented Data Stores	Column-Oriented Data Stores
Row-oriented data stores are efficient for the addition/modification of records.	Column-oriented data stores are efficient for reading data.
They read pages containing entire rows.	They read only required columns.
It is best suitable for the OLTP system.	It is not suitable for the OLTP systems yet.
It serializes the whole value in a row.	It serializes the whole value in a column.
It stores the rows in contiguous pages in memory.	It stores the columns in pages in memory.

Pros and Cons of Column Oriented Databases

Let us see the Pros and Cons of Column Oriented Databases.

Pros

It has built-in support for efficiency and data compression.
It supports fast data retrieval.
In column-oriented databases, the administration and configuration are simplified.
It is good for high performance on aggregation queries (such as COUNT, SUM, AVG, MIN, and MAX).
It is efficient for partitioning as it provides features of an automatic sharding mechanism to distribute bigger regions to smaller ones.

Cons

JOIN Queries and data from multiple tables are not optimized.
It has to create splits for frequent deletes and updates hence the efficiency of storage is reduced.
It is very difficult to design the partition and indexes due to the nature of non-relational.

Use Cases of Apache HBase

Let us see some use cases of Apache HBase.

Monitoring system.
Tracking user actions.
Audit logging systems.
Real-time analytics.
Message-centered systems (Twitter-like messages and statuses).
Content management systems serving content out of HBase.
Canonical use-cases such as storing web pages during crawling of the Web.