Apache HBase stores huge data in the tables that are sorted by a row key lexicographically. A table is spread over multiple regions and regions are further spread over many RegionServers. When we create a table in Apache HBase then it has a default region assigned to it.
Apache HBase region is the horizontal scalable. It contains the start key and the end key and based on that rows are stored in a sorted and contiguous form. HBase doesn't store the same row key on multiple regions due to which it provides a strong consistency. A region distributes the load across multiple RegionServers also performs the load balancing and failover per requirement. In case of data growth, the regions will be divided either manually or automatically.
Number of Regions?
It is suggested to use a small number (20-200) of medium-large sized (5-20GB) regions per region server. The standard number of regions is 100.
Let us see some factors to consider Regions.
- The limiting factor in selecting the number of regions is available heap space. Almost 2 MB is required for MemStore per column family per region and the MemStore heap space requirement is 600MB if there are 100 regions and 3 column families per region. If there are few regions then there would be less MemStore heap requirement.
- A large number of regions create a large number of small flushes; with each flush making a StoreFile, a large number of StoreFiles are made, which in turn require more compactions. Also, the MemStore and the StoreFile index require more heap space.
- A high number of regions create a load on the Master because the Master has to assign/reassign regions to RegionServers. Also, the Master has to move the regions around for load balancing.
In Apache HBase regions are assigned by Master as below.
- The Master starts the assignment manager.
- The assignment manager checks the existing assignment in hbase:meta metadata.
- The assignment is stored in case of RegionServer availability.
- If the RegionServer is not online, the load balancer is invoked to assign the region to another RegionServer.
- The hbase:meta metadata is updated with the new assignment.
A RegionServer could fail and the regions on the RegionServer could become unavailable due to multiple RegionServers serving data. The ZooKeeper detects the RegionServer failure and the Master initiates the failover on another RegionServer that has a similar row key in the region.
Resign Locality represents the closeness of a region to a RegionServer. It is achieved with HDFS block replication across the cluster.
The replica placement policy is the replica placement policy of HDFS, which is as follows.
- The first replica will be put on the local node.
- The second replica will be put on the random node of the other rack.
- The third replica will be put on the same rack as the second replica but this time on another node.
Benefits of Regions
The benefits of regions are distributed datastores, partitioning, auto sharding and scalability, and region splitting.
Let us see each in detail.
1. Distributed Datastore
The design of a distributed data store is in line with the design of Apache HBase which uses multiple regions for a table. By distributing a larger table’s regions across a cluster of nodes, high availability can be achieved.
A table of data is stored on multiple regions and the regions partition the data. Now if we want to access that table's data then we will have to access it from different regions. The benefit of having multiple regions is to provide fast delivery of data.
3. Auto Sharding and Scalability
The auto-sharding process is used to splits the region into approximately two halves when the number of row keys in a region becomes too large. In HBase, the basic unit of horizontal scalability is a region in which rows are shared by regions.
4. Region Splitting
Regions get split when a threshold is exceeded. It is handled by the RegionServer, which splits a region and offlines the split region. Afterward, the two split regions are added to hbase:meta and opened on the RegionServer, and reported to the Master. Region splitting is automatic by default but maybe run manually also. The HBase region splitting policy is configured in hbase.regionserver.region.split.policy.