Horizontal Scalability with HBase
In my last blog, “HBase Performance Tuners”, I answered the question ”Can Hbase be used as a viable solution for interactive aggregation of massive data sets?”
In this blog, I ask “Can we ensure Data Locality in HBase to achieve consistent latency?”Data locality, here, refers to the ability to move the computation close to where the data is.
Data Locality has become an important feature in Multi-Site Big Data Organizations-such as Ericsson, that maintains data centers at multiple sites. They have to make certain that the time involved in the request-response from any site is minimal and at the same time, ensuring data availability across the sites. This actually leads us to solve an interesting problem: how do we make sure that the data is present at the site, when there are an overwhelming number of data requests?
The short answer is: with Big Data techniques.
THE CORE OF THE PROBLEM:
In the telecom world, the system charges customers based on the calls he or she makes. When a customer makes a call or downloads data, the system checks the customer’s account information, which resides on a node called a Service Data Point, SDP. After each second, or minute, depending on the scheme, the customer balance has to be calculated to check if the call is allowed to continue. After the call is completed, the particular call information is returned to the SDP. The SDP currently uses an in-memory relational database.
In a Simple real-word scenario, the above mentioned example resembles a typical search problem where you search for one record out of millions of records as shown in the below figure.
Figure 1: a generic illustration of the Data Request Process.
Our challenge is that this is a very large set of data. We are always challenged to find ways to speed up this process, satisfying increased customer usage and demands for faster and better service.
IS HBASE THE ONLY OPTION?
What better option to handle huge customer information than HBase? Yes, we chose HBase to handle Big Data, as HBase is the universally accepted solution for handling customer information.
Everything appeared to be fine as of now. We got a read/write latency of less than 5ms for more than 70% Hit Ratio, as explained in detail, in my lab setup below.
As we moved further into our research, we wondered “Is there any option other than HBase?” Our requirement was to retrieve the user information in less than 5 ms for 100% of the requests. HBase is known to retrieve data faster, but to achieve 100% of requests, even HBase, with our normal configuration settings, struggled.
I was pretty sure that there was still something I was clearly missing to achieve my quality goal.
PROBLEMS WITH HBASE
When we use HBase as a data storage layer, the main problem is that there is no control over the data placement. There are various reasons why Data Locality is disturbed in a system, for example the RegionServer going down, any configuration changes like replication changes, etc… in HDFS or the whole Cluster Setup of HBase is restarted.
THE LAB SET-UP:
Then, I had a lightbulb moment.
I tried to find out exactly where the data resides, in both the physical locations and the sites where the request is launched.
Now, maybe it’s time for me to introduce you to my lab setup:
I had a Cluster Setup with 10 slave machines - 7 connected to one network router and 3 connected to a separate network router. The two network routers were connected through a MPLS (Multiprotocol Label Switching) router. We later realized that this made the setup similar to having servers in two different sites.
We then began our experiment, conducting it for four months, in 24-hour intervals. We inserted 100 million records of customer data into HBase spread across the 10 slaves. Since there is no control over data placement, some data pertaining to zone1 got placed in zone2's server and vice versa. So when a customer, whose account balance resides in zone2, makes a call from zone1, each account balance request has to be fetched from zone2's server. This resulted in higher latency.
SOLUTION 1: KEY-BASED BALANCER
We came up with a solution using HBase Balancer to solve our problem. We included their zone information as a part of the key. We extended the HBase balancer to assign the regions to make sure all of the keys corresponding to a zone resides in the same RegionServers in that zone, closer to the request zone. Here, we improved our Hit Ratio of Requests from 70% to around 90%. We believe the remaining 10% is due to the requests that came up during internal HBase operations-like compactions. We still have to research more on that particular theory.
But, our jubilation was short-lived since new big and more interesting problems arose; we almost diluted the default balancer. This ended up receiving a lot of data in one particular zone which:
- Defeated the basic requirements of distributed processing. However, don’t confuse geographic redundancy here.
- Since we made zone information as the initial part of the HBase RowKey, we hit the conventional HOT Region problem, where all the writes or reads end up at the exact same place.
SOLUTION 2: REQUEST LOAD BASED BALANCER
We then decided to remove the zone information from the key and do the balancing based on the number of requests a region gets from a zone. Finally, the solution we chose was this: when the number of request hits from a location crosses a certain threshold value, and then the regions that are pertaining to that request area are moved to the region’s servers near the location, the request region was in close proximity. So, the next time a customer from zone1 makes a request, it will be fetched from zone1 rather than zone2, since data pertaining to zone1 from zone2's server is moved to zone1. The threshold value we set was either 50% or the largest percentage among request hits.
This looks like an implementation of a traditional load balancing solution, but there was a problem with this approach. We observed that too often, there were region movements and it kept HBase continuously busy.
SOLUTION 3: BALANCER BASED ON MACHINE LEARNING
There was lot of regions shifting due to the fact that this was happening between the RegionServers within a zone itself. And also, for some regions, the higher request hit ratio was shifting between various RegionServers on a regular basis. So, we designed a machine learning module, through which we are able to manage this shift. Our module considers the last number values of the request pattern and we built a regression model on top of it. We found that the last 6 values were better suited for our test. The regression model then predicts the next 3 values of requests count and then, if all three values are above the maximum threshold value among the zones, the region is shifted. This kind of machine learning balancer helped us to achieve the balance and speed we desired.
So In Summary, Data Locality looks like a grain of sand spun around after a sand storm. With machine-learning techniques, I was able to ensure my SLA of 5ms for all requests, but I also realized that I needed to look into other techniques when the request pattern changed.
I am signing off now, but in my next blog post, I will explain how data duplication helped solve the pesky problem of when equal amount of requests, from different zones, end up in the same region. So, stay tuned!
Hari Kumar, supported by his Interns Amaladhithyan K, Kavitha S and Arun Kumar V in his research