Hadoop on OpenStack
Is Hadoop cloud ready? Is Cloud Hadoop ready? Find out what to expect when running Hadoop on OpenStack. In this post, I shall share the results from what we have done in our labs, with tests done on small data sets.
In short, my conclusions & questions are:
- Though co-locating VMs does not yet have a big impact for small data; as the data volume grows, there is a need to shuffle VMs across servers.
- Hadoop’s resource scheduler must be adapted for cloud set-up considering some of the over-heads of cloud.
- For enterprise grade Big-data analytic application, whether the cloud architecture needs to have a dedicated FCoE based disk network.
After conducting these tests, a question arose that I will leave for later. First, let me explain how I conducted this test.
A Case under Test
A Linear Regression based prediction algorithm consisting of two Hadoop jobs – Data Pre-Processing and Iterative Model Training for a data set of size 1 GB.
As you can easily observe about the nature of these jobs, one is high data I/O intensive and the other is highly computation intensive; which theoretically summarizes the areas of observation
The Test Environment
The basis is a 3 node regular Hadoop cluster, and a 4 node OpenStack Ice house cloud set-up. I used 4 different variants, see below figure. Each box is a physical server/compute node.
The basis for selecting these variants is to ensure that the execution of the computation logic and its associated data are distributed & combined in unique ways.
1. Hadoop on physical machines – A Hadoop cluster with three nodes that each have their own local storage.
2. Hadoop on VMs located on the same physical m/c, each with virtual storage provisioned under the same machine. In fact, this is observed to be the normal provisioning behavior of OpenStack (without Hadoop-based plugins).
3. Hadoop on VMs provisioned on different physical m/c, each with virtual volume provisioned from a single machine’s own physical local storage.
4. Hadoop on VMs on different physical m/c, with virtual storage volumes provisioned under the local storages of each of the same physical machines that hosts the VMs.
Though there exists many observations of Hadoop over cloud on the internet like:
- Hardware technologies – midplane crossbar switch of the Superdome.
- Software technologies – resource pooling, scheduling, optimization of containers and more general observations of cost savings.
For this blog, I took the initial step of monitoring the basic four main metrics: Job completion time, CPU usage, Memory usage and Disk I/O and registering my observations/inference.
The Results are shown below:
It can be seen that the Disk I/O results seem most prominent. The I/O throughput is relatively high for the physical machine setup compared to the cloud-based set-ups.
Considering the data size being small, the job completion time is observed to be more for Variants 3 & 4 compared to variants 1 & 2. It has to noted here that based on VMware’s white paper on similar attempt for “big data” size, this is reversed trend.
I imply that as OpenStack exposes the virtual volumes to the VMs as iSCSI targets this overloads the network interface that also has to handle data transfers from mappers to reducers.
This opens up a question whether a dedicated Disk I/O network in the architecture of OpenStack is needed for Hadoop.
And further, what technology will that exclusive Disk I/O network employ? Will it be the conventional IP based iSCSI technology or will it be Fibre channel based FCoE technology?
Considering “Big Data” transfers are needed with very low latency, I believe FCoE Vs iSCSI is a separate exercise to benchmark.
There are more reasons to emphasize the disk I/O latencies, and here is why:
As Hadoop’s resource plug-in schedules / terminates tasks based on the latencies in the task execution, there is prominent task re-scheduling observed in the cloud variants. Typically 50-100 tasks were re-scheduled due to high disk I/O latencies even for the small data size.
For the same reasons, co-located VMs of Variant-2 performed far better than distributed VMs of Variant 3 & 4. This opens up a rather intriguing fact: Do the cloud providers have to re-position their Hadoop VMs based on the data volume growth from co-location to distributed?
Now it’s fair to let you decide, Is Hadoop Cloud Ready? Or Is Cloud Hadoop Ready?