Cloud 3.0: Making Cloud Easy
We envision Cloud 3.0 as a platform making it easy for developers to both deploy and develop directly in the datacenter and for datacenter managers to administer, for delivering services in compute clusters of all sizes – from large centralized datacenters to clusters of 3 to 5 machines in a radio base station. Such increased usability and scalability should encourage wider utilization of the cloud computing model. Our vision is for cloud computing to become an affordable, secure, and performant information technology utility that is ubiquitously available and can be successfully deployed to digitalize governments, industries, and society. We believe Cloud 3.0 will play a critical role in this vision.
Pain points for developers and datacenter managers
In the last year, the international cloud research community has begun to define the properties of the next generation cloud system platform, generically designated as Cloud 3.0. At Ericsson Research, we have been working on a next generation cloud platform for the last year, with a primary motivation to address pain points for two groups: developers and datacenter managers.
We believe that the next generation cloud platform must cater primarily to the needs of the application developer. From the developer standpoint, infrastructure programming requires the developer to do more work unrelated to their application. The next generation of cloud platform should get rid of infrastructure programming.
In the current cloud platform, networking is also constantly in the developer’s face. They need to deal with setting up a virtual network and managing it, even as performance in terms of latency often disappoints, requiring them to go into their code and modify it. For the next generation, the network inside the datacenter should be like a bus in a server: fast and invisible, and network management should be automated as much as possible. Finally, while DevOps is a nice concept, like infrastructure programming, it puts more of a burden on the developer. We think that the right approach to simplifying the developer’s task is to present an abstraction of the datacenter as One Big Machine, or OBM, with the developer’s laptop as an extension, so development and deployment happen seamlessly.
In addition, datacenter management has its own set of pain points. Adding or removing servers from a cloud platform such as OpenStack and managing the network requires too much hand editing of databases and scripts, leaving open the possibility of errors. Upgrading cloud platform software is often difficult, requiring the datacenter management staff to run their production deployment next to a “development” deployment they are testing. Finally, because the current generation of cloud platform software is designed around a centralized approach, there is a mismatch with the requirements of the distributed cloud, especially with respect to downward scalability.
Consequently, we’ve come up with a few architectural principles around which we believe Cloud 3.0 should be designed. Like the original architectural principles for the internet, these architectural principles are intended to outlive the changing layer cakes of functional architecture and become the ground on which we base our Cloud 3.0 system building. Our architectural principles are divided into three groups: operating system, datacenter management, and developer expectations.
OS Principle #1: All cloud services are fully distributed.
By “operating system” here, we mean the datacenter operating system. We believe that the next generation of cloud system software must be designed along the lines of a distributed operating systems such as Ameoba and Saguaro from the early 1990s, in which operating system services like process creation and resource management are distributed. But unlike the distributed operating systems of the early 1990s, operating system services that are performance critical, such as virtual memory, should remain local. And because thread management depends on sharing virtual memory, thread management must remain local too. Thus, the Linux kernel becomes a kind of per server module providing the base performance sensitive services for the operating system, while other services are implemented in a distributed fashion across the datacenter. This distributed operating system design approach is the foundation of the OBM.
OS Principle #2: Resources belong to the resource manager, not the services.
By “resource”, we mean a physical or logical component in limited supply that is necessary to provide a service to a tenant. By “service” we mean a bundle of resources and other services that allows a tenant to perform some computational or communication task of value. One of the problems with existing cloud management systems such as OpenStack is that services own resources, so other services must negotiate with them in order to get access to resources they need, and as a result the management of resources becomes complicated. In Cloud 3.0, we believe there should be a single point of contact for resources, the resource manager, and that services should be responsible for managing bundles of resources obtained from the resource manager. The lines of control in the resulting architecture should be cleaner, and should allow better control of resource usage and deeper automation.
DCM Principle #1: No unit of management abstraction below the datacenter.
Management abstractions provide the basis on which tools are built to manage the datacenter. We believe that Cloud 3.0 should not present VMs, pods, or frameworks as abstractions toward human operators. Since our intent is to make the datacenter look like OBM, our architecture needs to provide management abstractions that present the datacenter as OBM. That does not mean the management system internally doesn’t support a modularized approach to managing datacenter operation, just that these internal mechanisms make visible the various datacenter components in a way that supports a human view of the datacenter as OBM. This approach will operate as a forcing function on management automation, encouraging the application of automation techniques such as machine intelligence to a maximum extent, thereby driving down the cost and complexity of running the datacenter.
DCM Principle #2: Datacenter capacity grows organically and self-configures.
The current generation of cloud management software, exemplified by OpenStack, requires a minimum number of servers to run the datacenter management system, and extensive editing of databases and configuration files to incorporate or remove a server. In Cloud 3.0, the datacenter management system should not require any human intervention to incorporate or remove a server. It also should not require a minimum number of nodes to run, but should scale down to one server. This model has three important advantages over the current generation: it removes the possibility of human error, it eliminates the role of centralized databases as single points of failure, and it removes the lower limit on the number of servers in a cluster before they can be managed as a cloud. We believe that this model can be implemented through OS Principle #1, namely making the datacenter management services a distributed system just like the rest of the Cloud 3.0 services.
DE Principle #1: Abstractions provided along a latency gradient.
Our expectation is that next generation datacenters will have 100 Gb networking as a standard, so we do not expect bandwidth within the datacenter to be an issue. Latency (including tail latency), however, will still be an issue, and we believe that the abstraction mechanisms provided to developers must expose, rather than hide, latency. The reason for this is that developers often need to accommodate higher latency connections with various implementation schemes (e.g. caching) whereas for a low latency connection such schemes are unnecessary. Abstractions should not hide system characteristics impacting how developers interact with other parts of the system if those interactions would result in a substantial difference in the way the code is implemented.
DE Principle #2: Different communication mechanisms for different “distances”
As a corollary to DE Principle #1, our Cloud 3.0 datacenter operating system will provide developers with inter-process communication or remote procedure call as abstractions for network communication within the datacenter, over a next generation low latency networking protocol such as RDMA. For communicating to services on the Internet and to other datacenters in a distributed cloud, standard IP networking and sockets will be used through a proxy. The communication between a client process and the proxy will use a low latency, intra-DC protocol and the proxy will handle communication with services on the Internet using IP addresses, ports, and sockets. The developer will not need to use sockets for microservices that live in the same datacenter.
Use cases and feature sets
The primary use case for Cloud 3.0 is to provide a platform where development environments for cloud native applications can be built, so that developers don’t just deploy into the cloud, they also develop there. For our earlier reflections on that topic, see blog posts Time for a new cloud operating system and CloudOS: A Gen3 Cloud System Platform. We would like to move the “cloud native” definition back from “born in the cloud” to “gestated in the cloud” too, so that the difference between the developer’s laptop and the cloud is erased. Infrastructure programming and networking must disappear to remove the requirement for developers to deal with complexity unrelated to the development task. On the datacenter management side, the management system should be fully distributed, to avoid single points of failure and to make management incrementally scale down as well as up, and highly automated, to reduce the cost and complexity of running a datacenter. Resource management needs to be handled in a disaggregated fashion and policy control needs to be easy to use and well-integrated with analytics, so the results of analytics can influence policy. Finally, the platform should support software defined hardware accelerators, like FPGAs and GPUs.