NoSQL for Telco
Can No-SQL technologies hold for the specific requirements that apply to the Telco domain?
During the 2013 Ericsson Data Research day, an internal company event, I had the opportunity to talk about NoSQL and some of our NoSQL/BigData experiments. Using my experience from that talk, I’ll share those same results and conclusions with you.
No-SQL encompasses a wide variety of database technologies that are an "internet" solution for handling the rise in the volume of data stored, the frequency in which this data is accessed and performance needs. We have done research experiments to find out if those approaches hold for the specific requirements that apply to the Telco domain.
Experiment 1) Looking at replacing a subscription database such as the Home Location Register (HLR) database with HBase. It is technically feasible to use HBase as an HLR database replacement, but it is a poor fit.
Experiment 2) Stressing an HBase/Hadoop File System (HDFS) cluster by persisting a stream of data events as they would be generated from a mobile network.
HBase and Hadoop will not perform well out of the box, both requiring a lot of tuning. For superior performance, parts of the configuration can be specific to the use case. Even though we did all of this in our setup, the data batches recurring at constant intervals had the undesired effect of triggering a compaction storm. This can be addressed by manually managing compaction and/or improvements in the HBase software.
Are you interested in how I came to these conclusions? Please continue reading as I discuss the "why" and “what is” NoSQL and explain our experiments and results. Hopefully, this post will bring forth how scale and agility challenges faced by modern internet applications have disrupted the traditional relational database technologies.
Let's begin with a quick introduction to Not-Only SQL (NoSQL) data stores as well as some of the industry trends driving their adoption. Not in any particular order, the main contributors are the increase in data sets, schemaless store, decentralization, parallelization, and cloud.
But the most critical change is the inversion of the cost/speed relation that has existed for the last 30 years in the processor race. For 30 years, each generation was faster AND more cost-efficient i.e. less power-hungry. This trend is now broken, and the newer generations cannot be sped up without incurring huge increases in costs/power consumption.
This has lead the industry to design multiple cores per processor chip (see Why Intel is designing multi-core processors by Geoff Lowney) as multiple cores deliver more performance per watt. The result is that scaling up, as in replacing an aging machine with a bigger/faster one, previously the norm, is not cost-efficient anymore. Scaling out, for example horizontally other multiple machines and multiple cores, becomes a necessity. Applying this principle to databases, the data becomes distributed over many machines (and cores) over the network (and a bus) and thus becomes subject to network partition.
In general, distributed data has its own challenges that are covered by the Consistency Availability Partition (CAP) theorem. CAP states that in a distributed system, only two out of the three (C, A, and P) guarantees can be satisfied at any given moment in time. The implications being that, while in general the system will be satisfying C, A, and P, there will be some extreme cases where it can not achieve all three and where as per CAP only two of these guarantees can be satisfied. (For more information, see the proof of the CAP theorem by Gilbert and Lynch). This theorem led to the introduction of a new form of consistency (C) known as eventual consistency: The notion that if a system relaxes the C guarantee, it can guarantee that at a later time, C will be restored.
The application of CAP splits databases (NoSQL and others) in three categories based on the two guarantees supported at any given time by the system:
- CA, Consistent-Available: Request will complete only if all nodes are consistent and available. No partition tolerance.
- CP, Consistent-Partition: Requests will complete at nodes that have quorum.
- AP, Available-Partition: Requests will complete at any node possibly violating consistency.
The above classification regroups most existing "traditional" Relational Data Base Management Systems (RDBMS) along the CA edge, such as systems not allowing for network partitioning, while most NoSQL are positioned either along the CP, or AP edges. NoSQL stores are, in general, partition tolerant-meant to be used for distributed systems-or, in other words, built for horizontal scalability-allowing network partition.
Without going into more details, the main characteristics of NoSQL stores can be summarized by the following:
- Horizontal scaling, partition tolerant (CP, AP), cluster friendly
- Internet size
- Solution oriented
- No one size fits all (this leads to heterogeneous data stores and polyglot persistence which is not covered in this post).
- Our experiments
After this quick NoSQL introduction, now let's peek at two NoSQL research activities. One is looking at replacing an Home Location Repository (HLR) database with HBase and another is stressing an HBase/Hadoop File System (HDFS) cluster by persisting a stream of data events.
Replacing an HLR database with HBase
Let's start by describing what is an HLR and what is HBase. First, an HLR is a central database that contains details of each mobile phone subscriber and every details of every SIM card issued by the mobile phone operator. Various unique identifiers (IMSI, MSISDN, etc.) act as primary keys to each HLR record. An HLR can be categorized as a document store where different keys refer to a "telco" document.
Second, HBase is an open-source, distributed, versioned, column-oriented NoSQL store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. paper. It is built on top of the Hadoop File System (HDFS).
To summarize our HBase study and experiments, we conclude that while it is technically feasible to use HBase as an HLR database replacement, it is a poor fit. While HBase scales linearly, is resilient, and highly-available, it is targeted at handling huge amounts of data-billions of rows, millions of columns-while a typical HLR only handles in the millions of rows.
In addition, the type of data stored in an HLR, as well as some of its inherent real time aspects-like low latency-would favor a NoSQL solution geared towards speed, supporting simple queries and with stronger Atomic Consistent Isolated Durability (ACID) guarantees. Note that HBase offers only strong consistency at the row level.
What we demonstrate is that HBase can shine on use cases where complex queries are ran over large data sets. The caveat being that getting good performance requires a lot of tuning. Many configuration and schema changes were needed to achieve good performance results.
Note that choosing the right computational approach such as picking one of Map/Reduce (MR), parallel scans, or co-processors for parallelism is also quite important. Computationally speaking, parallel scans seem to be best as operations are done within the HBase region server, holding the data thus guaranteeing data locality while MR accesses data remotely.
In case of node failures or network partitioning, HBase is resilient and will keep on serving clients. Under degraded situations, a slight read latency increase is observed while write latency remains constant. When failing nodes are brought back into the cluster, the system takes time to return to optimum. Depending on the amount of data per node this can take minutes to hours. The recovery can be sped up by triggering a major compaction. This approach can however have drawbacks as we will see later.
Persisting events from a large data stream
As part of our experiments, we attempted to stress a HBase/HDFS cluster using a large, almost, constant inflow of data events to persist. This situation is typical in Telco when, for example, the data store is used to collect logs or performance counters from many nodes in an operator's network. Typically in a network managing 10 millions users, the amount of performance counters to collect is in the range of 1000+K events per second. Using a relative size of 1KB per event, this is close to a 1 GB/s data stream. This stream comes uninterrupted with slight rate variance. Common deployment places an event aggregator in front of this data stream to filter, aggregate, and batch these events into smaller, more compact structures, thus reducing the stream to a more manageable rate of 10+ K events (x1KB) per seconds or 10MB/s.
So, we reproduced this use case and investigated its impact on our cluster in two different contexts: First by investigating persistence of the stream as raw unstructured files; second by inserting the data events from the stream as HBase records (1 event 1 record). For thiw, we ran our tests on two different Hadoop clusters:
The "BigData" cluster using two servers with a total of 48 cores (two six-core processors per machine with hyperthreading), 96 GB memory and 1 TB of disks using SSDs.
The "Storklustret" using various heterogeneous machines ranging from dual-core pentiums with 2GB of RAM to quad-core machines with 16GB of RAM for a total number of 34 cores, and 86 GB of RAM, and 12 TBs using only mechanical SATA disks (no SSD).
Let's start by saying that neither HBase nor Hadoop will perform well out of the box. Both require a lot of tuning and some of the configuration can be specific to the use case for superior performance. That consideration is despite using the Cloudera Hadoop Distribution (CDH) which simplifies a lot the configuration efforts via a clean UI. Before CDH or a similar distribution, setting up an HBase cluster is a time consuming task.
Our experiments showed that under our simulated data stream stress both HDFS and HBase scale linearly: The more machines and/or IO capacity in the cluster, the more it could handle (i.e. globally faster writes and/or reads).
Since HDFS replicates data, expect a lot of network traffic. They also showed that the bottlenecks vary depending on the cluster hardware. For example, when the load is spread over many machines, it is easier to avoid network bottlenecks. The two machines "BigData" cluster quickly experienced a network bottleneck since all the data was replicated from one host to the other at ~100MB/s the network link got saturated.
As a general rule, disks with large bandwidth (like SSDs) or nodes using many JBOD disks help avoiding IO bottlenecks. Some "Storklustret" nodes with a single mechanical disk performed very poorly in our tests as these machines also suffered from low RAM and thus little caching which is a problem for reads.
However, getting good performance as data grows is not as simple as adding more hardware plus tuning. In our setup, the data batches are recurring at constant interval of aggregated data had the undesired effect of triggering a compaction storm. This, in turn, pushed CPU, disk and network IO close to 100% on all machines in the cluster simultaneously reducing overall cluster performance to almost nil. To complete the picture, often to troubleshoot Hadoop/HBase, a simple detection of the bottleneck is not sufficient to find a solution, most of the time, it is necessary to take a look at the internals. This situation is slowly improving as the documentation and the code mature.
Let's investigate this HBase compaction storm. First HBase stores all its internal files into HDFS. As data increases, each HBase region may require multiple files. Too many files is not good for read performance as each operation then requires many files to be opened.
To solve this, HBase periodically gathers small files and rewrites them into bigger ones thus reducing the number of files (aka minor compaction). In addition, HBase periodically drops all deleted or expired cells. This triggers a rewrite of all data files into a single one for each Region Server (node). That second operation is called a major compaction and will "usually" improve performance. Major compaction results in high CPUs loads, and IO overloads that reduces write speed to (almost) nil on the affected node.
However when batches of new data are submitted regularly, as is our case, data keeps on increasing at an almost constant rate, and that data ends up split evenly across all regions. This leads to uniform data growth across the whole cluster. In other words, each node in the cluster is roughly hitting the same size of data and the same number of files to clean up at roughly the same time.
As a result, a major compaction operation would be triggered on every one of the nodes within a small time interval-a few minutes at most. This phenomenon is known as a compaction storm and would result in cluster wide degrade performance. In addition, as new batches of data keep on arriving while the files are re-written, an infinite compaction cycle would start from which the cluster does not recover (performance remains degraded infinitely).
In retrospect, this was a known HBase issue (see HBASE-7842, HBASE-7516, HBASE-7678, HBASE-7603, HBASE-7236) which is caused by the default compaction algorithms used by HBase. Rather than let HBase auto-split/compact data, it is recommended to manually split and compact in order to spread staggered those operations over the cluster in a node-by-node fashion. Note also that automatic splitting can be turned off by setting the configuration value hbase.hregion.max.filesize (a suggested setting is 100GB.) If no such precaution are taken, the cluster can be prone to compaction storms as the algorithm can decide to run major compaction on many or all regions at once.
This issue of compaction has led to many fixes in newer version of HBase (see HBASE-7842, HBASE-7516, HBASE-7678, HBASE-7603, HBASE-7236), where for example strip compaction is introduced (HBASE-7667) and/or tier-based compaction (HBASE-7055).
As I reflect after conducting these experiments, database technologies are now undergoing a rapid evolution with new approaches being actively explored after decades of relative stability. The NoSQL data stores are an interesting breed delivering on promises at the cost of added complexity and, in general, weak consistency guarantees.
Large web applications dished the notion of "one size fits all" as promoted by the traditional RDBMS vendors, and are now scaling out using very large and distributed systems. This had had a large impact on database development technologies and architecture
In 2012, Brewer wrote that the CAP theorem has been widely misunderstood and that a distributed database can be designed to remain highly available over a broad range of failures without supporting perfect availability in the CAP sense. Google has gone further with their Spanner database, a globally distributed database providing ACID transactions: "We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions." Finally, very recently, MIT Prof. Michael Stonebraker predicted that "The traditional RDBMS Wisdom is (Almost Certainly) All Wrong".
As a result, a second generation of NoSQL is slowly emerging that is employing shared-nothing, distributed architecture with fault tolerance and scalability but rather than using designs with weak consistency guarantee, this new generation is aggressively exploring the strong-consistency region.
Watch my the original Slideshare from Ericsson Data Research day.
Note: Interesting slide set on compaction by Cloudera.
Note that I recommend watching Martin Fowler's talk which was held as the introductory keynotes of the 2013 NoSQL Matters conference. It is a clear introduction to the NoSQL technologies, movement, and its various implementations (key-value store vs columnar store vs document store, etc.).
Note: Additional links to some of Michael Stonebraker's papers:
One Size Fits All: An Idea Whose Time Has Come and Gone. This paper makes the argument that the relational database cannot be extended ad infinitum, demonstrates how RDBMSs are inappropriate for several new applications, and argues that the DBMS market will fragment into a series of special-purpose engines, perhaps unified by a common front-end parser.
One Size Fits All: Part 2, Benchmarking Results. Benchmark results for relational vs. special-purpose databases in several applications. Interestingly and pragmatically, Stonebraker argues that most people won’t even consider a special-purpose database (largely due to inertia) unless it is at least 10x faster than relational for a given application. He then demonstrates several applications where you can see 10 – 100x gains in performance.