Massively distributed analytics
The Internet of Things and the Networked Society, with billions of devices connected and generating data, implies massive numbers of data sources and analytics receivers distributed across the globe. This is translated into a scenario where data will literally come from everywhere. At the device level, this may not seem to be a problem. However, if we step out and analyze the impact of the “massively” distributed scenario, we find that the existing analytics architectures and algorithms need to evolve to fulfil its requirements.
“Massively distributed analytics” was the topic of the presentation I gave at the Big Data Stockholm Meet-up about a month ago. At the core was a discussion about different analytics architectures for solving the challenges and exploiting the opportunities in a “massively” distributed environment.
Three main architectures principles can be applied to analytics: centralized, decentralized and distributed.
Centralized architecture is the most common today. The principle is simple: data coming from everywhere is moved to a large, powerful, collocated cluster of physical nodes. This cluster then ingests and processes the data to get insights. Data is thus centralized for analytics. The principle to follow could be likened to: “Move everything here, I have capacity to analyze everything!”
By contrast, in a decentralized architecture, multiple large distributed clusters are hierarchically located. As a result a tree of clusters, where the leaves are close to the sources, can compute the data earlier or distribute the data more efficiently to perform the analysis. This architecture can have some sort of logical grouping (per country or continent, as an example) or a hierarchy setup to distribute the analytics tasks. The underlying principle will be: “You can do this task but whenever you need to perform that, let me handle it!”
Finally, in a distributed architecture, the clusters are everywhere. Some are small, some large. In a distributed architecture, any cluster can utilize the data, and computations can be moved even closer to the data sources. Therefore, analytics tasks are pushed towards the edge or even to the device if possible. This model can of great interest in for example the IoT scenario. Analytics and statistical models can then be built on the device or close to it. A quote to understand this architecture would be: “We live in harmony with no bosses.”
Constrains and requirements are much higher in the distributed architecture than in the centralized one. An example where the distributed approach is clearly needed is in a scenario where the number of data sources is counted in millions. Sending all this information to a central location requires a lot of bandwidth and implies slower responses. Therefore, it is critical to compute the analytics on a geo-distributed manner, where the raw data is kept locally and not streamed, while the model is built coordinating data from multiple locations. The traditional approach to analyze this data would be to send everything to a central location and then, perform the analysis or build the statistical model.
It is important to stress here that the centralized model has been/is/will be extremely useful for a scenario where our data sources is available in a centralized location, such as a typical web server or a database in general. Therefore, it will still be the chosen choice in many occasions.
An important conclusion from this discussion is to really put effort into analyzing constraints and requirements in the expected analytics system, as they will determine what kind of approach is suitable for your system. There is no one single approach that solves everything. Each of the presented architectures offer advantages and disadvantages, so it is important to find the real needs of the desired analytics system to be able to select the most suitable approach.
The Big Data Stockholm Meetup April 1st in Stockholm, Sweden was organized by Dataconomy, a big data event-organizer bringing the big data community around Europe together on a weekly-basis. Invited speakers were Karina Bunyik, Spotify; Moustafa Soliman, HP; and myself, Ignacio Mulas Viela, Ericsson Research. Around 200 persons attended the event.
After the presentation, there were lots of interesting discussions with people working with data across various industries. Valuable insights and approaches to tackle these problems were discussed among the participants who brought their different points of view into the picture. I would definitely like to repeat this experience both as a speaker and I would like to participate in more events like this. I would encourage anyone interested to get involved and attend to present and discuss your work!
Ignacio Mulas Viela, Ericsson Research