Are today’s clouds really cheap?
Can further reduction of operational cost in a cloud infrastructure be achieved by employing more sophisticated measurement and management tools? The answer is yes! Using integrated cost analysis tool with cloud management system coupled with advanced analytics can have a significant impact on the costs associated with operating a data center.
As reported by big cloud players (Microsoft and Amazon), cloud computing is a low-margin business even at scale. Competition between cloud players has resulted in a “race-to-zero”, meaning big cloud providers charge next to nothing for cloud infrastructure as a service and make money on massive sales volume. Hence, decreasing the operational cost of datacenters is of high interest as it can lead to higher profit margins with the same offered price.
We believe reduction of operational cost can only be achieved by employing more sophisticated measurement and management tools. Using relatively simple cloud management system architectural changes coupled with advanced analytical techniques could have a significant impact on the costs associated with operating a data center.
Given the complexity and dynamicity of cloud solutions especially considering the new disaggregated architecture paradigm offered by Intel’s Rack Scale Design (RSD), cloud providers could adopt a more structured way to operate (read more about Cloud Disaggregation). Success lies in careful planning and continuous execution of optimization processes. Today’s hardware-defined infrastructure (data centers with monolithic servers) is inflexible and inefficient, and is unable to keep pace with the frequency of change and complexity of managing at scale. Moreover, lack of automation has caused very low utilization, over-provisioning and slow error-prone operations.
Therefore, a new degree of agility, speed, and flexibility is needed for the best operational practices and the best competitive pricing. The benefits of agile planning and continuous optimization become even more obvious in data centers built with a disaggregated architecture. In this architecture, instead of purchasing fixed and pre-configured servers with a defined amount of compute, memory, and storage, the data center operator buys hardware that can be configured as separate pools of compute, memory (e.g. RAM, NVMe), storage and networking resources. This gives cloud providers a new dimension along which to optimize, because it enables dynamically reshuffling the hardware resources of various types in order to increase resource utilization.
The siloed operating model of separating business and technology used currently in data center management systems is not well suited for cost effectively delivering cloud services, as it leads to inefficient operational processes. Once in an operational state, gaps in planning are often exposed, characterized by either high cost or low operational assurance. Therefore, operational cycles must be business focused and performance driven. Since technology and business are ever evolving, the journey should be a continuous and iterative process. Such an approach can help ensure that the infrastructure and its operation never get out of step.
To address this challenge, we are working on integrating an advanced “cost engine” into the data center operations automation platform where business policy considerations can be introduced. The data center management platform provides a single point of integration across IT, facilities and operations. Advanced monitoring, purpose-built analytics and an associated cost engine allow autonomous operations, continuous optimization and an accelerated application lifecycle management - at scale.
Having a cost engine integrated into the data center management system enables proactive hardware planning using run-time performance metrics obtained via analytics. The integrated cost engine allows cloud infrastructure providers to move towards more agile and micro-planned operational phases, which can bring a continuous replanning/re-shuffling of infrastructure resources and automatic cost reduction during the data centers’ lifetime. This approach can help save on Total Cost of Ownership (TCO) by increasing resource utilization with application-optimized hardware, and thereby can introduce more cost transparency into the data center.
The data center automation platform constantly monitors the performance of cloud services and collects performance metrics such as: power consumption, network congestion, CPU utilization, delays and response time. The metrics are instantly analyzed by our advanced analytics engine, based on pre-defined rules and thresholds. If the analysis detects the need for changes in the system, such as the need for new hardware to support increasing amount of workload, one or more events/alarms are generated and sent to the cost engine with any associated information. The cost engine calculates TCO for various options and decides if an immediate action is required or if the alarm should be ignored at this stage (considering the expenses of the system or any other business policy). In case of action, a proposal is sent to the Hardware Dimensioning Engine to estimate the exact list of new hardware and/or required actions to be sent to the system administrator.
In summary, adding cost optimization processes to the daily operation and management systems of datacenters can be consider as a logical way forward to bring large cost savings for cloud providers, allowing them to increase their profit margin without extra investment.
The benefits of this system are clarified through the following examples.
Example 1: “Increased operational cost due to aging of hardware components”.
Suppose that the Monitoring Engine detects a 10% increase in the failure rate of servers in a certain segment of the data center. After three years, the server failure rate can increase (see this IDC White Paper) causing maintenance and support costs to increases. The Monitoring Engine sends an event to the TCO Engine notifying it of the increase in failure rate. The TCO Engine determines that a hardware refresh is scheduled for six months in the future. However, changing the servers at this time (six months earlier than planned) can decrease the TCO by saving on failure recovery for workloads and annual maintenance.
The TCO Engine sends a recommendation to the Hardware Dimensioning Engine, which calculates the type and number of units for new servers to be purchased. However, the recommended hardware needs to be assessed from a cost viewpoint to make sure that it minimizes TCO. Therefore, the new hardware list will be fed back into the TCO Engine again to find the cheapest hardware replacement option. This process could be repeated until no further improvement in TCO is possible, resulting in a so-called pareto-optimal solution . It should be noted that the planning performed by the TCO Engine is more intelligent than the initial planning, since the performance parameters related to the operational data center are available and will be used by the TCO Engine while modifying the hardware proposal. The results will be presented to the system administrator for a final decision.
Example 2: “Resource allocation policy based on instance energy cost”.
Many policy engines are built with the assumption that reducing power consumption by packing as many workloads of different types as possible onto the same server and keeping the rest of the servers on standby optimizes data center power cost. However, such a policy might not always result in a total cost reduction. Cooling system power consumption does not follow a linear trend, meaning that having many servers running at low temperature could consume less cooling system power than a few fully utilized servers running at high temperature, as the later need more cooling system power to maintain the server temperature at an acceptable level. In addition, running a few servers at higher temperature might result in shorter hardware lifetime and more frequent hardware replacement cycles, increasing TCO. Cost integrated data center management system can detect and mitigate these risks.
The Monitoring Engine continuously measures power usage for both the IT infrastructure (servers, switches, storage units) and the cooling system. This information is fed to the TCO Engine, where the running cost is calculated and a decision is made on how to handle the cooling load, based on calculated TCO. The decision could be to either turn on another chilling tower to cool down fully utilized servers, or to distribute workload among more servers running at lower temperature, depending on which option reduces total cost. The two policies need to be balanced, which is difficult without having a holistic view of the data center operational parameters and their implications on operational cost. Instead of having a fixed resource allocation policy, the TCO Engine can decide on the best policy at each instance in time based on the current operational cost of the datacenter.