The fundamentals of highly available cloud services
Companies claim that their cloud services are accessible by users anywhere and anytime. But are they really available 24 hours a day, 365 days per year? There’s a study from Ponemon institute, showing that the average cost of a datacenter outage is around USD 740,357 – a significant loss for both cloud providers and users. How much interruption of this sort can users tolerate? Which are the main sources of failure in a data center? How much extra redundancy is needed? What are the existing mechanisms for offering the desired level of service availability?
The study referenced above can be find here.
Cloud service providers need to understand the main source of failures, have enough redundant hardware, have mitigation plans, and be able to predict, detect and fix failures in every operational layer. And despite the many advantages of cloud services, a study by Peak 10, shows that cloud reliability is still one of the top 10 challenges service providers encounter.
This post addresses the concerns regarding service availabilities by providing a brief overview of the most important factors to be considered when offering a highly-available cloud.
The main sources of failure in datacenters
Service interruptions in datacenters can happen due to planned or non-planned outages. Planned outages are related to system tests and preventive maintenance events that are scheduled during specific periods of the year to ensure system performance. Though the planned outages are known of in advance, without redundancy and proper planning the service will still be unavailable for the duration of the interruption. On the other hand, non-planned failures can happen at any time and anywhere in the datacenter and might disrupt service for a long time. Therefore, addressing the non-planned failures is a very important issue for cloud providers.
The main sources of outages in datacenters are as follows (see figure 1): infrastructure failures, planning mistakes, human errors (misconfiguration), software failures, cyber-attacks, preventive maintenance and tests.
Figure 1: Main failure sources in a datacenter.
Each type of outage requires a specific planning and mitigation strategy and has a different impact on service availability. For example, a major failure in the power system can lead to an interruption in the whole datacenter if no redundant components are installed, while a CPU failure impacts only applications executing on top of the server hosting the CPU. In short, in order to have a highly available system in which no single failure or mistake can cause an unacceptable service disruption, a combination of various protection mechanisms needs to be implemented.
There are three broad solution categories for mitigating unplanned outages in datacenters: hardware redundancy, proactive failure management, and reactive failure management.
- Hardware redundancy:
Reliable hardware architecture planning and infrastructure design is the basis of any service, meaning that without the availability of redundant network paths and hardware resources, it is not possible to offer high-available services only on the software level. However, hardware redundancy requires extra investment. Therefore, understanding exactly how much of the redundant resources need to be provided in combination with software level mechanisms to meet availability requirement in order to keep investment low is an important challenge for datacenter providers.
- Proactive failure management:
In addition to provisioning redundancy, you can try to anticipate failures just ahead of their occurrence and perform proactive counter-measures. This is done mostly in the software layer during the runtime. Some of the available mechanisms in this category are as follows: checkpointing, replication, job migration and task resubmission, among other solutions.
- Reactive failure management:
In cases where it is not possible to predict failures, the only option is to deal with failures when they occur. Some well-known mechanisms in this category are software rejuvenation, self-healing and preemptive migration.
Major steps in planning highly available cloud services
To offer truly highly available cloud services, a combination of several of the aforementioned mechanisms is needed. For example, you need some level of redundancy in power and cooling infrastructure, but it should be complemented with a set of software mechanisms based on the availability requirements, type of applications and services that are going to be served by the datacenter.
Here are a set of major steps towards offering highly available services:
- Define the desired total datacenter availability/tier level.
- Find the suitable system, architecture and components for power and cooling infrastructure that fulfill defined availability and cost requirements.
- Find the amount of required redundancy – as well as the availability level – for IT hardware (answering questions such as: Is it beneficial to have expensive carrier-grade components with low failure rate, or a large amount of commodity hardware that can do the work cheaper?)
- Decide on fault-tolerant software schemes to deploy on top of the hardware, both in software platforms and applications.
- Deploy, integrate, test, monitor and run.
While following the above steps, it is important to keep in mind the amount of investment needed to have the service up and running. It is always easy to deploy a highly available system with full duplication of all the resources, and various types of fault-tolerant software with large overheads. However, the cost is prohibitive. Hence there should be a tradeoff between the investment cost and the level of planned redundancy in the datacenter. To reach such a goal requires a good understanding of the major failure points and design factors (with the help of models and available tools). In this way, cloud providers can avoid either over-spending on deploying fully-duplicated systems where it is not necessary or losing money on systems with lower availability levels than expected. Below we provide further insights to this tradeoff.
Datacenter infrastructure design based on availability requirements
Uptime Institute has developed standards (telecommunication industry association -TIA 942) regarding levels of redundancy within datacenters by defining four different tiers, Tier I has the most basic datacenter architecture with no built-in redundancy, providing 99.671 percent availability. Tier II requires some redundant components, but still has a single path for cooling and power distribution, and Tier III offers a concurrently maintainable infrastructure and an availability of 99.982 percent. Tier IV datacenters are fully fault-tolerant offering 99.995 percent availability, which is needed for some critical applications such as telco cloud.
From the availability point of view, it is desirable to implement a Tier IV datacenter. However, the expenses to reach that level of uptime is a critical factor that often forces datacenter owners to sacrifice some of the redundancies for cost. Based on a studypublished by DataCenter Knowledge website a Tier IV deployment can cost twice as much as a Tier II datacenter. Therefore, it is important to design a datacenter with sufficient redundancy, which matches cloud offering requirements but without overspending.
To address this challenge, we have developed a set of models and a simulation tool that can help datacenter owners understand the dynamics of various infrastructure failures and allow them to estimate the availability of a given architecture and service.
We have used Uptime Institute recommendations to model different level of redundancy in datacenter subsystems (Tier I to Tier IV) assuming three main subsystems: power infrastructure, cooling infrastructure, and IT infrastructure, in which each subsystem can follow a distinct tire design according to the TIA 942 standard (see figure 2).
Figure 2: Datacenter infrastructure with three sub systems.
- Power system:
IT infrastructure needs power facilities with enough capacities to operate properly. Hence, faults in power system directly affect overall datacenter availability. Main components of such power systems include an alternate power supply, a transfer switchgear – also called ATS (Automatic Transfer Switch) – a UPS (Uninterruptible Power Supply) system, and a PDU (Power Distribution Unit). Both primary and secondary power sources are connected to an ATS. The ATS provides input for the cooling and UPS systems. The UPS system routes power to the PDU (rack socket for cabinets). Lastly, a PDU distributes electrical energy to the IT infrastructure.In this study, we modeled a generic datacenter power system for Tier I to Tier IV using stochastic petri net (SPN) to evaluate the availability of power systems with different levels of redundancy considering the architectures shown below (see figure 3).
Figure 3a: Power system Tier I architecture
Figure 3b: Tier IV architecture
The simulation results showed very little difference in availability level moving from Tier I (99.94) to Tier II (99.97) where redundant generators and UPS are installed. This means higher tier implementations and more redundancy is needed for critical application with high availability requirements.
- Cooling subsystem:
Like the power system, cooling system availability and maintainability are fundamental to a proper datacenter operation. A typical datacenter cooling system relies on a variety of cooling components, including a CRAC (Computer Room Air Conditioning), chillers, cooling towers, piping, pumps, heat exchangers, and water treatment systems.In continuation of our study regarding datacenter availability, we also modeled the cooling system (see figure 4) using SPN and assessed the availability of such system. In contrast to the power system, partial redundancy brings a considerable improvement in availability figures for cooling systems, with deploying a tier II model improving the availability by 2 nines from 99.43 percent to 99.99 percent in comparison with Tier I, with duplicated chiller, cooling tower, and one extra CRAC for each 5 CRAC. This means that combining a Tier II cooling system with a Tier III/IV power system is enough to provide a four nines overall availability in a datacenter if availability level of the IT system allows.
Figure 4: (a) Cooling system Tier I (left pic.), Tier IV (right pic.) architecture.
- IT subsystem:
The IT subsystem is the main component of every datacenter and mostly consist of servers, storage, and networking devices, as well as the software layer. Servers are mounted within racks and consist of hardware resources (such as CPUs, NICs, and RAMs) that host applications. All the data generated by these applications are stored in storage systems. Networking equipment manages the internal communication between servers and storage systems as well as all the data flow from/to the datacenter. In a highly available datacenter, some amount of redundancy is necessary in all components. For example, in a Tier IV implementation, every single component in the networking layer, as well as the servers, is duplicated. We have further extended our study by proposing the SPN models for the IT subsystem implementing Tier I to Tier IV (see examples of the architectures in figure 5). The evaluation results of the IT system show that though using a Tier IV implementation in hardware makes it possible to reach four nines availability, adding the software layer on top of that without implementing any fault-tolerant mechanism for the application, means a 99.9 percent availability at best, even with full duplication of power and cooling systems (see Table 1). This result was also confirmed through a sensitivity analysis of IT system availability, showing that the software layer has a large impact on service availability.
Figure 5: IT subsystem Tier I (left pic.), Tier IV (right pic.) architecture.
We have integrated all the proposed models of three subsystems into one tool with which it is possible to estimate the total availability of a datacenter with distinct tier levels for each subsystem (see some values in Table 1). This tool allows datacenter operators to play with input parameters for each component such as server, UPS and CRAC, and see the impact of changes on the overall availability of a service running in the cloud. It is also possible to do a sensitivity analysis to detect the components that have a high impact on service availability. These results can be useful for datacenter owners to gain insight into how to increase service availability with minimum possible investment.
Table 1: Selected simulation results showing the availability.
Our studies show that in order to have a highly available datacenter, redundancy in the power system is necessary (at least Tier III), while having less reliable components in the cooling (Tier II) and IT subsystem can be tolerated by deploying fault-tolerant software, which allows datacenter owners to avoid extra investment in hardware redundancy. This implies that, even though having a proper infrastructure design is the first important step towards having a highly available cloud service, it is not enough, as the applications and software running on top of hardware also have a high failure rate. Therefore, offering fault-tolerant mechanisms in the software layer is as important as hardware redundancy for end-to-end reliability of cloud services.
Our studies confirmed the need for closer study of servers, including both the hardware and software layer. Especially with new types of hardware architectures (for example, disaggregated non-volatile memory that can be accessed remotely and shared across different systems), there is even more need of detailed models of end-to-end system availability.
In the current server-based architecture, the failure of each component within a server, leads to the failure of whole server. However, if servers can be composed of devices spread out in different chassis (for example, disaggregated NVM), their failure is independent from each other. With disaggregated hardware devices, redundancy can be done at the component level in comparison to using full-duplication of the whole server in the conventional hardware models. However, a precise modeling of the system is required to fully analyze and understand possible benefits and impacts of this new infrastructure paradigm on the availability of the cloud services.
The same concept is valid in case of fault-tolerant mechanisms related to the software layer, meaning that they need to be updated/adopted to benefit from disaggregated hardware architecture underneath. Some example of such modifications:
- Partial virtual machine (VM) migration: instead of full VM migration in case of a failure, it might be beneficial to do a partial migration in a disaggregated environment by only replacing the failed component with a new one from the same or neighboring pool.
- Resource allocation based on availability: how you allocate resources to a specific job request in a disaggregated environment can have a high impact on service availability and should be integrated as part of the resource allocation algorithms.
- Primary and backup VM planning: while allocating resources to primary and backup VMs, it could be beneficial to know how to combine resources from different pool/chassis to reach maximum uptime with lower resource overhead.
With the growing interest of industries towards migrating to cloud services as their primary IT source, increasing service availability is more vital than ever for cloud providers. This means that more and more effort should be redirected towards developing suitable methods, tools, and mechanisms supporting high availability in cloud, especially with appearance of new IT architectures such as hardware disaggregation where the challenge is twofold.
This study is done in collaboration with a team lead by professor Djamel Sadok from Universidade Federal De Pernambuco in Brazil.
For further details about the models, and this study, there’s more to read:
- Highly available clouds: system modeling, evaluations and open challenges, book chapter in Springer, 2017.
- Analyzing the IT subsystem failure impact on availability of cloud services, IEEE symposium on computers and communications (ISCC), 2017.
- Evaluating the cooling subsystem availability on a Cloud data center, IEEE symposium on computers and communications (ISCC), 2017.
- Minimizing and managing cloud failures, IEEE Journals & Magazines, Volume: 50, Issue: 11, Pages: 86 – 90, 2017.
- Modeling and analyzing power system failures on cloud services, IEEE international conference on network and service management (CNSM) 2017.
- How to improve cloud services availability? Investigating the impact of power and IT subsystems failures, Hawaii international conference on system sciences (HICSS), 2018.