A positive move from static to adaptive and dynamic system observability
- Dynamic system observability intelligently collects more useful, while actively avoiding the collection of unnecessary, data.
- This translates into a compelling service that is near-invisible in normal conditions and saves both costs and resources.
- To achieve optimal observability, a clear strategy should be designed for the collection of data and the system being observed.
Observability is described as the ability to measure the internal state of a system by examining its outputs. There are three main ways to do this:
1. Monitoring: Generally a “black box” method which collects measurements from the system without affecting its behavior.
2. Observability: Goes one step further by peeking under the hood to collect trace and log information as well.
While monitoring and observability can be collecting information on a 24/7 basis, the collection process should not change the system behavior, like performance.
3. Debugging: A “white box” method that usually interferes with the application or system, causing it to slow down or change its run-time behavior in some way.
Monitoring can be considered a subset of observability because it involves the collection of monitored data. However it is more difficult to draw a line between observability and debugging. Observability can implement debugging features to collect detailed information about the running system but should not cause it to change its behavior or loss of performance. This is only possible when the data collection has limited time and scope.
Dynamic system observability enables the collected dataset to vary over time and allows it to collect different datasets from different sources. It also allows the observability to adjust the collection process based on the problem in question and available resources. The less available resources, the harder it is to collect details without causing interference with the running system.
For example, to collect detailed system observability data from an edge node without changing the node behavior, the amount of collected data and the duration of collection need to be limited. Being able to define the exact node or a network connection to be observed allows for the collection of detailed information without observability flooding the network or causing too much extra resource contention.
The goal for observability is to look for all the details in a system and report only those incidents that need to be addressed by maintenance. Making the business case even more compelling is that no resources are used to achieve these results, making it cost free while also not affecting the system being observed.
Of course, implementing this goal is as difficult as building a perpetual motion machine. However, even if it is virtually impossible to reach, it is worthwhile to explore the concept of ideal system observability.
Since the “get all for free” approach is not realistic, observability cannot collect everything all the time. As a result, we are faced with a typical “one size never fits all” problem where the set of most useful metrics are system dependent, and observability needs to be adapted to the system being observed. Going a bit further with this train of thought, the optimal set of observability metrics may vary over time depending on the usage or the state of the system.
To get closer to the goal, observability needs to be tailored to the system it is observing. A medical instructor once said: “If you don’t know what to do with measured values, don’t bother to measure at all.” To collect only “useful” data, a strategy needs to be designed for system observability, one that answers questions like: Based on the collected dataset, what decisions can be made? What, when and from where do we need to collect data to enable fact-based decision making?
A good analogy is a person picking up firewood from a forest and they have picked up as much as they can carry. To pick up a new, better piece of wood, the only solution is to drop some of what they are already carrying to make room for the new piece.
The same idea can be used for dynamic system observability. Turning off the collection for part of the dataset would release resources to collect more relevant, detailed information without increasing resource usage. If dynamic observability enables extension – for example using plugins – it would allow the collection of application or hardware specific information.
Collecting such detailed information would very likely result in a lot of data and the collection process would ultimately decrease the performance of the measured system. These detailed measurements should thus only be run when they are needed according to the defined strategy and in a meaningful scope. For example, a collector following CPU/process scheduling, CPU power state, etc. would provide far too much information unless data collection is limited to the specific CPUs that are of particular interest.
Flexible and tailored observability enables observability as a service that can be used for more purposes than just visualizing the system state. Since observability needs to have data collectors all around the system, it could be offered “as a service” to provide on-demand metrics data to support load balancing, routing, or scaling in/out decisions. Observability can also provide information to the fault-management system and provide an Application Programming Interface (API) for system testing and verification.
Dynamic system observability
Collecting data that is not used is an obvious waste of resources. A configurable observability service should only collect the data that is requested by the user. If a run-time configuration is possible, observability will enable temporary dataset collection. This could be very useful in collecting data for fault handling purposes where it would be excessive to continuously run such a collection.
It is important to keep in mind, however, that collecting observability data has a trade-off: the more data is collected, the more resources are consumed in collection process, network, storage, and analysis. On the other hand, the more data is collected, the more useful the data can be, and the better understanding of the system state can be formed.
This would be especially effective in fault handling cases since the need for detailed data collection exists only for a short period. If the location from where the data is collected is relatively small, the resource consumption is low even if observability is collecting very detailed data.
“The best place to hide a key is to put it among other keys.” Collecting data for specific faults and error conditions require very detailed tracing, as opposed to generic network monitoring. If the specific fault is yet unknown, collection would produce much data without revealing the fault or event of a symptom. In other words, storing or viewing all the collected data for fault isolation would consume resources in vain.
Observability could utilize the same method as a car dashcam that records video when a car is moving. It uses a rolling buffer and will only keep a few minutes of footage before it overwrites the old recording. However, if the motion sensor detects a collision, the recorded video will not be overwritten. This means the dashcam keeps the video containing important information and, by deleting the rest, the amount of needed storage space is reasonable.
Let’s apply this analogy to system observability: following TCP sessions on node or service level, the collector could follow every connection attempt. Session information is stored into a database when a failure is detected, but deleted in the case of a successful connection. This way observability would be able to collect details of all unsuccessful sessions without sending and storing the data of successful connections. Collecting the data and temporarily storing it for decision making would still consume resources.
To benefit from the dashcam approach, the decision to store or discard data must be made very close to the data source. The logic should be very simple, and it could be realized, for example, with eBPF or in a SmartNIC (smart network interface card) to minimize the system load increase. In a negative decision, the loss of resources would consist of storing the information into temporary storage and running the simple decision logic. In a positive decision, which should be rarely executed, the network and permanent storage would be used.
Adaptive system observability
In the dashcam approach, observability has a logic that would decide if the collected trace would be stored to a database or discarded. Enhancing the collector logic with functionality to turn traces on and off would further enable observability to only collect data when needed. For example, in case of high service latency, observability could turn on the detailed trace to see what kind of requests are causing the problem or from which network the requests are coming from.
In adaptive observability, the collector, which is placed near the monitored resources, decides to increase or decrease the dataset collected. This decision could be based on a contract coming from an observability user request.
Adaptive observability saves resources by enabling detailed data collection when it is deemed necessary. The collection is time limited, and the location should be limited to a small area. Instead of obtaining general data from all over the system, observability gives a very targeted and detailed data collection containing only information on the recognized problem.
One additional benefit of adaptive observability is speed. If decision logic is near the resource being watched the reaction time to the event will be very small. Should observability identify a network packet that would trigger the activation of a more detailed trace, the adaptive observability would not lose a single packet. In contrast, if the logic was in the central site, the decision latency would cause several packets to be unseen.
AI and observability
An old joke tells the tale of a discussion between a married couple:
- wife: “Why don’t you ever tell me that you love me?”
- husband: “I said ‘I do’ 20 years ago. I will let you know if I change my mind.”
This joke, while an old one, can be applied to observability. Should observability be constantly reporting the metrics or simply follow consistently and only report when necessary? As in the joke, this is a matter of trust. If observability can be trusted to see the difference between correct and incorrect behavior of a system, it only needs to inform when an unwanted behavior exists.
If observability is trusted to make decisions, the benefit is that all the collected data does not need to be sent to a central database and shown to the observability management team. If the decision logic is very close to the resource being traced, the resource consumption is relatively small compared to sending all the traces to a central database. This lower resource usage allows observability to collect even more data without disturbing the running system.
The closer the logic is to the observed resource, the faster observability can react, and the less network and storage resources are used.
Summarizing the benefits
The takeaway from this article can be summarized as “getting more with less”. This means intelligently collecting more useful data, while actively avoiding the collection of unnecessary data. Dynamic observability can limit the use of resources to avoid interference with the observed system. By actively and intelligently controlling the available resources, observability can provide deep and detailed information of the system behavior in a fault or abnormal scenario while still being almost invisible in normal conditions.
System observability as a service enables the use of metrics in an application without the need for the application to implement its own collectors. Observability can have an authentication mechanism to control which dataset, and in which scope, an API user can access. Using observability instead of application-specific collectors can also remove the need of running the application with superuser privileges.
Mar 03, 2020 |Blog post
Dec 04, 2020 |Blog post
Industry 4.0, 5G
Jun 19, 2013 |Magazine
Like what you’re reading? Please sign up for email updates on your favorite topics.Subscribe now
At the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.