CloudOS: A Gen3 Cloud System Platform
Just like cellular wireless networks, cloud system platforms have gone through several generations. Gen1 lifted server images off hardware servers and plopped them down on virtualized enterprise data centers as VMs. Gen 2 provided complex cloud management stacks with lots of services. Recently, in Gen2.5, container management systems have been layered on top of the cloud management stacks. But one more step is needed to make development and deployment truly easy and data center management radically better.
With Gen 1, enterprise customers liked the result: by oversubscribing server capacity, they could reduce their CAPEX spending and still achieve acceptable performance for their applications. Custom scripts or proprietary software achieved control, orchestration, and management of the virtualized platform, but there was no autoscaling, multitenancy, load balancing or other cloud services. In Gen2, public cloud operators and open source cloud management suites such as OpenStack feature complex cloud management stacks with poor integration between the system software on the server and the cloud management software, particularly for policy and analytics. Cloud resource usage and network performance, particularly tail latency, are mediocre. For example, M. Paulish, A. S. Varde, and S. A. Robila showed that public cloud operators rarely achieve more than 25% resource usage and Y. Xu, Z. Musgrave, B. Noble, and M. Bailey that tail latencies up to 40 ms are common. Metrics collection and analytics need to be programmed on an application by application basis. Gen 2.5 has storage, networking, policy, and analytics disconnections between the cloud management systems and the container management systems.
Cloud System Software Evolution
What should a Gen3 cloud system platform look like? The data center management system needs to provide right sizing of applications to execution and deployment units with the proper security and isolation. Whereas in Gen 1 through 2.5, large GB-sized VMs were the minimum execution unit (with VMs hosting containers in Gen 2.5), today’s microservices hosted natively in containers can be in the 100 MBs range, and serverless computing models such as Amazon Lambdas and Google Functions can be considerably smaller. Smaller sized execution units make resource utilization more efficient because the size of any stranded resource decreases, but the data center management system must support allocation of execution and deployment in the smaller sized units.
In addition, Gen 2.5 data center management systems require a developer to go through a step of “infrastructure programming”. Most developers tend to view this step as unnecessary overhead. Removing it means automating the deployment step completely, so that the developer should, if they chose, be able to develop in the cloud as easily as if they were using their own laptop, provided IDE support is available. “Born in the cloud” then means developed in as well as deployed into the cloud. Removing “infrastructure programming” provides the developer with an abstraction of the data center so that it looks like one gigantic computer, moving developers closer to the DevOps ideal of fully integrating development and operations.
The cloud system platform should support a collection of abstractions and primitives that make distributed development tools, such as loggers and debuggers, straightforward to build. Policy and analytics abstractions in the server operating system should support hierarchical policy and generic metrics collection. Libraries of common analytics algorithms and policy templates and generic abstractions and primitives to easily hook together analytics, policy, and control should drive increased automation of workload control in a straightforward fashion without a lot of complicated custom programming. Developers should be able to easily plug together an analytics pipeline, and couple it to a policy decision making engine as easily as they can connect today to distributed key-value stores such as etcd. The result should provide developers with a more seamless development experience, rather like developing on top of a single computer operating system, so that they can focus on developing application code rather than configuring analytics and policy.
On the data center management side, multitenancy, policy, and analytics support should be available and well-integrated into the container management and serverless computing development environment and not layered on top of VMs. This in turn means providing mechanisms that implement the right level of security and isolation for smaller sized execution units. Resource allocation and job placement algorithms should be available to improve data center resource utilization efficiency by a factor of at least 3. In addition, networking needs to improve to reduce tail latency. The Gen3 multitenant cloud system platform should encourage the blossoming of a microservice ecosystem, where developers can offer microservices, discover the APIs and service endpoint URLs of microservices offered by other developers, and be able to charge for their own offered microservices.
The overall objective of tightly integrating the cloud management system with the server OS should be to provide the developer with an abstraction of the data center as one giant computer. Such a platform would be a true cloud operating system, a CloudOS, in which the data center is the computer.
You might want to also check out my earlier post on the subject "Time for a new cloud operating system".