Apache Storm vs Spark Streaming
What when and how to choose a real-time processing framework in a telecom scenario. Identified issues, potential problems and proposed solutions.
In two previous blog posts - "Comparing Apache Storm and Trident" and "Real time processing frameworks" - I compared Apache Storm and Apache S4. In this post, I will present my comparison between Apache Storm and Spark Streaming.
Apache Storm is a stream processing framework, which can do micro-batching using Trident (an abstraction on Storm to perform stateful stream processing in batches).
Spark is a framework to perform batch processing. It can also do micro-batching using Spark Streaming (an abstraction on Spark to perform stateful stream processing).
I described the architecture of Apache storm in my previous post. A detailed description of the architecture of Spark & Spark Streaming is available here.
One key difference between these two frameworks is that Spark performs Data-Parallel computations while Storm performs Task-Parallel computations. More similarities and differences are given in the table below.
In Spark Streaming, if a worker node fails, then the system can re-compute from the left over copy of input data. But, if the node where the network receiver runs is failing, then the data which is not yet replicated to other nodes might be lost. In short, only HDFS backed data source is safe.
In Apache Storm/Trident, if a worker fails, the nimbus assigns the worker’s tasks to other nodes in the system. All tuples sent to the failed node will be timed out and hence replayed automatically. In Storm as well, delivery guarantee depends on a safe data source.
Both Trident and Spark offer micro-batches that can be constrained by time. Functionality-wise, they're very alike, but implementation-wise, there are different semantics. So, if the question is: "Which framework should one choose for an application?", the answer is that it depends on the application's requirements. Some common scenarios and the corresponding choice of frameworks are given in the table below.
Spark Streaming in Telecom
In Telecom OSS & BSS Systems, there are cases when a continuous stream of massive amount of data needs to be processed in real time with very low latency. For example in a Network Operation Centre (NoC), detecting alarms and triggering fallback mechanisms have such strict demands in processing continuous streams of data at real time. In these cases, there are few problems that may arise when Spark streaming is put to use.
A serialized task might be very large due to a closure. This might occur especially when processing massive amounts of continuous streams of data with data structures like a hashmap.
For example, in a NoC, an engineer could program a use case to monitor network logs as given below:
hash_map = large_hash_map_of_network_logs()
rdd.map(lambda x: hash_map(x)) .count_by_value()
Current versions of spark streaming greater than 0.9.x detect such cases and warn the user. The ideal fix for this problem is to convert large objects to RDD’s.
Also, in cases where serialized tasks are very large, there could be further issues as the time it takes to write data between stages is large. Spark writes shuffle output to an OS buffer cache. If the task is large, as mentioned above, then it spends a lot of time writing output to the OS buffer cache. The solution for this problem is to allow several GB’s of buffer cache, especially when operating on large shuffles on large heaps.
In telecom charging and billing, many times there is a need to process a selected subset of data, for instance billing the monthly usage of a subscriber. In such cases, Spark streaming could generate a large number of empty tasks due to the usage of filters.
A billing system could use a date filter as given below:
rdd = source.textFile(“occ://mbb/usage/imsi-2015-usage-data”).map(lambda x: x.split(“\t”))
.filter(lambda parts: parts == “2015-05-31”) .filter(lambda parts: parts == “19:00”)
rdd.map(lambda parts: (parts, parts).reduceBy...
The ideal solution for this is to use repartitioning to shrink RDD number of partitions after filtering.
A common mistake made by programmers is to overlook setting the number of reducers for a task. In such cases, Spark streaming inherits the number of reducers from the parent RDD’s. If parent RDD’s have too many reducers, then there is significant overhead in launching the task. Too few reducers might end up in limited parallelism in the cluster. Setting the apt number of reducers is very important especially in transactional systems like telecom systems.
Spark and Apache Storm/Trident both offer their application master, so one can essentially co-locate both of these applications on a cluster that runs YARN.
Storm has run in production much longer than Spark Streaming. However, Spark Streaming has a small advantage in that it has a dedicated company – Databricks – for support
A famous question from newbies to Spark is: “Is Spark+SparkStreaming=Lambda?”. The answer is no. Lambda is lot more than batch+Stream processing. Lambda is more powerful when used correctly, but is not the best for every use case.
There are lot more new frameworks coming out in Stream processing. More comparisons and discussions in the upcoming blogs. Happy Reading!
Reference . http://ganglia.sourceforge.net