HBase: Performance Tuners

In this world of Big Data, scaling up is not a mystery to stare at but a problem to solve.

Running Race
#BigData #HBase


These days everyone talks about Hadoop as a solution to our Big Data problem. If I have massive amounts of data that I need to process, the standard advice people give is: “Use Hadoop, it will solve your problem”. The question is obvious: is it really the silver bullet that many claim it to be? To spare you the pain of reading the entire blog, the answer is: yes, the Hadoop stack is an affordable and scalable data processing solution but not out-of-the-box.

In a series of posts I will share my findings on working with HBase – the Hadoop data base. The question that I tried to answer was: “Can HBase be used as a viable solution for interactive aggregation of massive data sets?” Put another way: is it possible to use HBase for OLAP? By carefully tuning HBase and using some of its more advanced features, I managed to go from processing 5,000 rows per second to 750,000 rows per second.

Let me explain what I had to do to reach those numbers.


First and foremost, you need to understand your data and what you want to do with it. In my case, I had to:

  • work with data records that contained 91 attributes
  • be able to scan through at least 100 Million records
  • to evaluate queries which had a random access pattern over all the 91 attributes

My query pattern turned out to be specifically troublesome since it made it difficult to group my attributes into column families. To solve this problem, I had only a couple of meaningful choices: My first choice was to keep all attributes as separate columns in a single column family. I will explain the second choice in my next blog.

I had a cluster setup of one master and 7 slaves. All are X86 machines equipped with SETA disks. I didn’t use a private IP to connect the cluster. The setup was such that the aggregation program was running on the same machine as the HMaster. The program scanned through the entire table data and aggregated a set of columns based on other values of another set of columns. For those of you that are familiar with SQL, the pattern was a typical SELECT sum(A), WHERE B = C GROUP BY(D).


To establish a baseline, I started by running my tests using the plain HBase API without any parameter tuning. This gave me a performance of roughly 5,000 records per second. Even though this was a vanilla setup, the results puzzled me since other blogs (like this one and this one) indicate a much higher throughput—around 100,000 records/sec (today I have achieved more than that). Because of this, I started to investigate the bottlenecks and features of HBase with the hope of reaching the performance figures I was aiming for. I quickly learnt the following:

Lesson 1: By default, HBase is tuned for random access.

Since my requirement was to do a full table scan, I learned quite a bit about using HBase tuning parameters in different situations. I started analyzing the HBase architecture and found a lot of interesting parameters that I could tune to achieve my goal. Let’s start with block cache.


The block cache acts as a cache to keep data in memory. Blocks themselves are stored on disks. In my experiment, I started with an empty block cache and ran my program twice. Both times the result was the same. The cache became full and it ran into problems with the garbage collector. I increased the block size from 1,048,576 to 67,108,864. This was done to check the increase in block cache size when new data is retrieved from that disk as well as to check the garbage collector behavior. Again the problem was the same; the garbage collector ended up being a bottleneck.

CPU Capture

The problem lies within the CPU utilization and the garbage collector call time. If the garbage collector kicks when the CPU is running hot, we will run short of resources. Ultimately this will lead to shutdown of your region server (which has a serious impact on performance), so, it’s important to maintain a “cool environment” around these two parameters.

Cluster Network

After a long thought process, I decided to disable the block cache, thus avoiding a lot of calls to the garbage collector. Obviously, this has the side effect of reading all data from the disc, so I needed to do a couple of more things to get me where I wanted.


The first thing you should play with is the setCaching parameter which will determine how many rows are sent from a region server to a client at the same time. In my use-case, the ideal scenario occurred when I kept it at 500 or 10,000. Also, try to change the setBatch parameter since it will give you better control over your network bandwidth. A final suggestion is to increase the rpchandler. By tuning these three settings, I managed to go from processing 5,000 to 20,000 records per second. This was a 400% improvement but still far away from where I aimed.


Even after tuning the HBase parameters, there was still something I was clearly missing to achieve my performance goal. At this point, I decided to spend some time on checking out how HBase scans work in the background. This is when I discovered the second valuable lesson.

Lesson 2: The scan in HBase executes each region in a serial manner.

This triggered me to design my own aggregation application, such that it starts reading rows in parallel from each region using the HBase API on regions rather than using the default scan. I saw a huge increase in performance, but ended up with a new problem with the network bandwidth because all the data was being transferred from all the servers to a single one simultaneously. Luckily, HBase offers a very neat solution to this problem. Let’s look at it.

Parallel Scan


The problem with my initial solution was that I transferred all data to my application whether it was needed or not. What I needed was a way of doing partial processing of data on the slaves so that only relevant information was sent to my aggregation application. This will reduce the data transferred to the master ten-fold. The HBase coprocessor feature is perfect for this. It allowed me to execute user defined code on each region server in parallel, thus solving my bandwidth problem. The co-processor will be instrumental to anyone that needs to aggregate large amounts of rows in HBase without having to resort to map-reduce.


So in summary, by solving these three bottlenecks—CPU, bandwidth and memory—along with incorporating high-end HBase features like parallel scans and coprocessors, I was able to improve the performance to 500,000 records per second.

At the beginning of the blog I promised performance figures in the 750,000 range, but these final performance improvements is something I leave for my next post. In that post, I will also share my thoughts on Horizontal Scalability with HBase, so stayed tuned.

-- N. Hari Kumar, Ericsson Research

The Ericsson blog

In a world that is increasingly complex, we are on a quest for easy. At the Ericsson blog, we provide insight, news and opinion to help make complex ideas on technology, business and innovation simple. If you want to hear from us directly, please head over to our contact page.

Contact us