Like what you’re reading?

Subscribe now!

Improved network fault handling with Hybrid AI

Detecting and efficiently resolving faults in operational networks are the main tasks of a network operating center (NOC). The current approach of resolving faults through documented step-by-step procedures designed for specific conditions is inadequate when undocumented conditions arise. It is also inefficient because it cannot exploit network states and available resources to resolve multiple faults together. It is shown that fault-handling procedures can be synthesized automatically using hybrid AI techniques, leading to more efficient and adaptive NOCs.

Jun 18, 2024 | 6 min.

Leonid Mokrushin

Principal Researcher, Cognitive technologies

Swarup Kumar Mohalik

Principal Researcher, AI, and formal methods

Sathiyanarayan Sampath

Senior Data Scientist

Pratyush Kiran Uppuluri

Senior Data Scientist

Marin Orlić

Master researcher, Machine reasoning and Hybrid AI

Jun 18, 2024 | 6 min.

Leonid Mokrushin

Principal Researcher, Cognitive technologies

Swarup Kumar Mohalik

Principal Researcher, AI, and formal methods

Sathiyanarayan Sampath

Senior Data Scientist

Pratyush Kiran Uppuluri

Senior Data Scientist

Marin Orlić

Master researcher, Machine reasoning and Hybrid AI

Leonid Mokrushin

Principal Researcher, Cognitive technologies

Contributor (+4)

Swarup Kumar Mohalik

Principal Researcher, AI, and formal methods

Sathiyanarayan Sampath

Senior Data Scientist

Pratyush Kiran Uppuluri

Senior Data Scientist

Marin Orlić

Master researcher, Machine reasoning and Hybrid AI

Network operating centers (NOC) are tasked to detect, predict, diagnose, and resolve problems causing disruptions in the network. The state-of-the-art NOCs consists of operating instructions (OPIs) issued by equipment manufacturers and methods of operations (MOPs), which are standardized management procedures. OPIs and MOPs are used as guidelines by service personnel to accomplish the above tasks. OPIs and MOPs are sequences of inspections and actions, designed for specific conditions such as alarm patterns, equipment types, and so on. Algorithm 1 below shows a slightly abstracted OPI for a ”loss of mains” (power supply) alarm. There can be different versions of such instructions for different types of equipments with minor variations, in the alarms format, or supplementary inspections. In the rest of the post, we will refer to OPIs and MOPs together by just MOPs.

Algorithm 1: Loss of Mains Operating Instruction

Check the power source fuses! (If there are any faulty fuses, do the following:)
Change all faulty fuses!
Check the incoming power supply! (If the incoming power supply is absent, do the following:)
Contact the incoming power supplier or the next level of maintenance support! (If the alarm remains, do the following:)
Consult the next level of maintenance support

We note some disadvantages with the existing method of documenting MOPs.

MOPs are focused on the management task, and not on the underlying cause of an observed system behavior.
This means that each MOP should consider all known possible causes of the alarms. However, same alarms may get issued by multiple managed objects, and each managed object may have several versions. This makes it difficult to maintain the MOPs, since the telecommunication system is constantly changing and evolving, and new types of equipment with new potential faults get added frequently.
MOPs are ”fixed” and formulated in advance.
This means they are not tailored to the specific requirements of different installations, the current state of the network, and presently available resources. For example, which checks to perform in a site, and in what order may depend on what actions have been performed recently, and whether there are service personnel already at the site. Such complications are handled either by deploying several versions of the MOPs, each applicable to only a small number of many different cases, or manually by qualified personnel.

Dynamic MOPs

The disadvantages of “static” MOPs described above can be mitigated by addressing the root causes of the faults and generating the MOPs on-demand, considering the current state and available resources. We call the latter “dynamic MOPs”. By representing the information in the current static MOPs in an alternate way, and using hybrid AI techniques comprising machine learning, machine reasoning, data analysis, and optimization techniques, dynamic MOPs makes fault-handling more efficient and management of the MOPs better automated.

Solution outline

The main idea behind the dynamic MOPs solution is to focus on faults rather than alarms. Here we imitate the approach that human experts undertake when resolving alarms. The alarms themselves do not represent the actual root cause of the problem, they are only indicators or symptoms of some faults. Based on MOPs and their own experience, the expert forms a set of hypotheses of faults that might be the root causes of the observed alarms. Then the expert tries to address these faults through remedial actions, and further checks to ascertain that the alarms are cleared. Therefore, instead of describing what actions to take when an alarm occurs, one should describe:

the causal relation between faults and their observable symptoms, for example, alarms, KPI degradation, and so on, and
the causal relation between actions and their expected effects in terms of system states – observable symptoms and available resources.

Formally represented in a “causal domain model”, the above causal relations along with the current network state enable the dynamic MOPs-based solution (See Figure 1), whose elements are explained subsequently.

Figure 1. NOC process based on machine learning (ML) and causal modeling of faults and actions.

Detecting the faults

Faults are defined as any undesirable state of the network. The observable symptoms of the faults are primarily alarms but could be any other observable aspects such as various counters and key performance indicators (KPIs). In most systems, the alarm data which has its root cause in faults is collected as temporal data where the faults are not mentioned explicitly.

The NOC process uses ML techniques to derive possible faults from the logs of the observables (mainly alarms data).

It first uses an offline technique called double clustering – an extension of the expectation-minimization based clustering algorithm - to create a fault model. The fault model maps a group of alarms to ”fault type” clusters, or ”syndromes”, each representing a group of faults with the same or similar symptoms.

The second technique is implemented in “incident generation” using Jaccard clustering to aggregate alarms occurring close in time at the same site into incidents, that is groups of alarms probably originating from the same root cause. Using the fault model from the offline process, the incident clusters are then classified to determine the possible fault types and location of the fault.

Causal models of faults and actions

As outlined earlier, the causal domain model in the NOC formally captures the causal relation between

fault types (or, simply “faults”) and symptoms, and
actions and system state.

These causal relations are captured through detailed forms as exemplified in Table 1 and Table 2 respectively. The columns “precondition” and “symptoms” in the fault form in Table 1 hold significant importance, providing definitive causal information for a fault. They outline a condition that must be met for the fault to occur and the corresponding alarms that will be triggered when the fault occurs. Similarly, the columns “precondition” and “effects” in the action forms in Table 2 capture the enabling condition for an action and the resulting system state when an action is executed. Note that the column “Severity“ in the fault form can be used to prioritize the faults. The columns “Delay” and “Cost” in the action form can be used to compute the optimal resolving procedures.

Table 1. Sample form for the fault “Overheated battery”

Fault mode name	Overheated battery
Equipment type	BTS 4G
Description	...
Precondition	PropOf(SiteOf) = power_battery
symptoms	Temperature abnormal, Battery level low
consequences	...
Severity	5

Table 2: Sample forms for the actions “Restart climate control” and “Goto”

Action name	Object type	precondition	Effects	Delay	Cost
Restart climate control	BTS 4G	Overheated battery	Not overheated battery	10	6
Goto	Site		At(site)	60	999

Inventory model

The inventory model is essentially a snapshot of the system state, encompassing the knowledge of all the relevant objects within the system and their respective properties. This knowledge is stored and updated in a knowledge base (KB) as Boolean predicates (for example, At(site), OverheatedBattery(), not (OverheatedBattery()) etc.) and affects the applicability of the faults and actions.

Abductive reasoning and Dynamic MOPS

Whenever there are alarms and the fault-types are identified through the “incident generation” component, the goal is to rectify the faults which will result in clearing the alarms. First, the causal model of faults is used to rule out certain fault-types by matching with the precondition and symptoms. The faults are then prioritized, and goals are generated. Formally, if there is a fault state where a symptom predicate is true, the goal to be achieved is one that contains the negation of the symptom predicate. As a simple example, if the symptom of the fault is “overheated battery”, then the goal state is “not overheated battery”.

To achieve a goal state from the current faulty state, a form of logical reasoning called abductive reasoning is used. It can dynamically produce the MOPs, like Algorithm 1, from the causal models of actions and the inventory model. Following is a sample plan obtained through abductive reasoning. Here is a sequence of operations on a router named “SITE2”. The operations include detection of alarms (step verified), root causing the faults (step fault), and setting appropriate goals (step intent). At this point, the MOPs are derived on-the-fly and executed (step action) achieving the goals (resolved).

Time	Step	Title	MO
2022-12-16 17:55	verified	External Link Failure	SITE2
2022-12-16 17:55	fault	Incorrect power config	SITE2
2022-12-16 17:55	action	Reconfigure	SITE2
2022-12-16 18:10	resolved	Incorrect power config	SITE2
2022-12-16 18:10	verified	External Link Failure	SITE2
2022-12-16 18:10	fault	Broken optical link	SITE2
2022-12-16 18:10	intent	at	SITE1
2022-12-16 18:10	action	goto	SITE1
2022-12-16 20:10	resolved	at	SITE1
2022-12-16 20:10	action	Repair optical link	SITE2
2022-12-16 20:20	resolved	Broken optical link	SITE2
2022-12-16 20:20	done	Symptoms subsided	SITE2

Implementation and demo

The end-to-end NOC process in Figure 1 is implemented in the machine reasoning (MARE) project in collaboration with RISE labs, Sweden. MARE implementation and demonstration is integrated with Ericsson’s cognitive framework which helps create technical solutions for autonomous intent-driven management of complex telecommunication networks and service infrastructures across a variety of domains (such as radio, communication transport, cloud, operations, etc.). The cognitive framework is based on blackboard model where several specialized agents collaborate and communicate by exchanging facts through the knowledge base.

The MARE project is demonstrated through a graphic user interface (GUI) (Figure 2) with data from a real site. The GUI integrates an alarm generator. From the alarms, the fault-type and location – external link failure at BTS 2G – corresponding to the alarms is derived and displayed. The causal model of the fault is used to derive the possible fault “incorrect power configuration”. From the symptoms of the fault and the causal model of the available actions, abductive reasoning derives the action “reconfigure” which resolves the fault. In another situation, the derived fault is a broken optical link. The causal model of actions understands that this needs manual intervention and generates an action “goTo” for an engineer to visit the site, followed by an action “repair optical link” to resolve the fault. Since MARE can generate MOPs dynamically, it can generate different procedures on-the-fly in case the current procedure fails. It can also leverage the current state – for example, the presence of the engineer at the base station site, to resolve other pending faults.

The faults and alarms in the demonstrations are primarily chosen from the power management domain, but the illustrated mechanisms are general, and should adapt well to any similar domain.

Summary and future work

A closed-loop intelligent, adaptive, and autonomous NOC can be achieved through the concept of dynamic MOPS implemented using Hybrid AI - a mix of machine reasoning and machine learning techniques. The prototype and demo of the MARE project show the feasibility of suchan approach. Moreover, this approach enables efficient adaptation to rapid changes in a telecommunication system through modular and incremental addition of causal domain knowledge. This allows easier maintenance of the dynamic MOPs framework.

Recently, the proliferation of Generative AI, particularly LLM-based approaches, has shown significant potential in both machine learning and machine reasoning tasks in terms of low investment in training of ML models and quality performance. These techniques can be explored to make the tasks in dynamic MOPs more efficient. Particularly, for the machine reasoning-based tasks which, while producing high quality and definitive outputs, usually suffer from higher computational complexity.

Learn more

Read about Ericsson’s Network support services powered by AI and ML.

Learn about our early work on Cognitive processes for adaptive intent-based networking

The Ericsson Blog

Like what you’re reading? Please sign up for email updates on your favorite topics.

Subscribe now

At the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.

Improved network fault handling with Hybrid AI

Algorithm 1: Loss of Mains Operating Instruction

Dynamic MOPs

Solution outline

Detecting the faults

Causal models of faults and actions

Table 1. Sample form for the fault “Overheated battery”

Table 2: Sample forms for the actions “Restart climate control” and “Goto”

Inventory model

Abductive reasoning and Dynamic MOPS

Implementation and demo

Summary and future work

Learn more

RELATED CONTENT