Maximizing COTS hardware performance with CPU memory management
What can be done to squeeze the maximum performance out of Commercially Available off-the-shelf (COTS)-based hardware? We’ve scrutinized this together with researchers from the Royal Institute of Technology (KTH) in Stockholm and have some breakthrough results.
Cloud computing performance
Industry digitalization, Internet of Things (IoT), and new services enabled by 5G are a few examples of application areas that put great demands on cloud computing. Additionally, since some of these applications also require low-latency as well as predictable service times, the cloud infrastructure should become more efficient, deliver faster response times, and offer more predictable services to be able to meet these requirements. Aligned with this increase in demand, we’re seeing the dawn of the hundred-gigabit-per-second link speeds. This makes it easier for large enterprises to handle the problems of exponential growth in their data center (DC) traffic by utilizing state-of-the-art networking equipment. However, introducing faster link speeds does not necessarily guarantee faster and predictable service times. As networking is tightly coupled with processing, introducing faster connections exposes processing elements like servers and networking elements to packets at a higher rate. For instance, a server receiving 64 B packets at a link rate of 100 Gbps has only 5.12 nanoseconds to process the packet before the next packet arrives. This is a relatively small amount of time for the current COTS-based hardware, which means offering low-latency and predictable service times also requires faster processing elements.
By looking at the trend of processors' evolution, we observe that the speed of CPUs was getting doubled every two years for many years - aka Moore's law. However, since the early 21st century, the increase in CPU speed is slowing down significantly. Consequently, it will become harder for processors to keep up with the increase in link speeds. In addition, the memory wall has appeared as a major bottleneck for achieving high performance in computer systems, which is the difference in speed between the CPU and memory access outside the CPU chip.
Putting these factors together, we see it’s hard for processor performance to keep up with the increase in networking link speeds.
Maximizing the performance of COTS-based hardware
There have been several efforts to address this issue, for example introducing and employing new features in hardware. However, we believe much can be done to get a better performance out of existing COTS-based hardware. It is essential to carefully asses the hardware and exploit every opportunity to optimize both hardware and software. Therefore, in collaboration with researchers from the Royal Institute of Technology (KTH) in Stockholm, Sweden, we have scrutinized the architecture of the widely-used processors and come up with a solution for memory management, called slice-aware memory management, which takes advantage of the CPU characteristics and improves the performance of existing processors. We have applied our solution to time-critical Network Function Virtualization (NFV) service chains running at100 Gbps and we could improve their performance by up to 21.5%. We believe this is a great accomplishment as it is making the current COTS-based hardware faster and more predictable. Read on for a short overview of this research, and head over to our paper Make the Most out of Last Level Cache in Intel Processors for full details.
Last level cache slice-aware memory management
To tackle the slow memory accesses, one can focus on memory hierarchy. Current processors are equipped with a hierarchy of on-die fast memories called cache hierarchy. Cache hierarchy is usually implemented in three levels: the first two levels are private to every core, but the last level is shared among all of the cores. Due to the fabrication process of CPUs, the architecture of the Last Level Cache (LLC) has recently become non-uniform. Consequently, LLC can be realized with multiple slices, which are interconnected by some topology. To increase the effective memory bandwidth of LLC, CPU vendors often map different parts of the memory address space uniformly among these slices. However, accessing different slices can be different in terms of latency. In our paper, we quantify the characteristics of non-uniform cache architecture (NUCA) for two latest processors' architecture and show that accessing data stored in a closer slice can reduce access time latency by almost a factor of 2. Having this, we introduced a new memory management scheme, called slice-aware memory management, which carefully maps the allocated memory to slices based on their access time latency rather than the de-facto scheme that maps them uniformly. Using our scheme, applications can improve their performance by up to 20%. The following figure shows the difference between slice-aware memory management and normal memory management, in which colored boxes are showing the position of data in LLC.
Furthermore, we employ this scheme to propose a more efficient network I/O solution, CacheDirector, which reduces I/O latency. CacheDirector places packet headers in the slice of the LLC that is closest to the relevant processing core. By doing so, it can reduce the tail-latency of time-critical NFV service chains running at 100 Gbps by up to 21.5%. The effectiveness of slice-aware memory management is, however, not limited to CacheDirector. Other applications, for example in-memory key-value stores, can utilize this memory management scheme to improve their performance. Additionally, slice-aware memory management can also be used to mitigate the noisy neighbor effect and realize isolation as well as cache partitioning. All in all, slice-aware memory management unlocks a hidden potential of LLC, which can be used to improve the performance of applications and bring greater predictability to the cloud.