Machine Intelligence at the NOC
What are some of the specific challenges @ NOC for Machine Intelligence?
Some of the major challenges with NOC management today, outlined in the diagram below, are:
- Troubleshooting billions of service alarms
- Handling ~20 million notifications of workflow management by NOC experts.
- Handling millions of service desk emails
- Increased costs due to low rate of workflow management utilization
Incident Management is an area where we already use expert systems-based frameworks. However, the constantly changing nature of networks – both from a technology and from a deployment point of view – make it very challenging to maintain the human-written rules in such expert systems. Automated processing of incidents in a data driven domain agnostic manner, without the need for expert rules would help significantly enhance automation in NOCs. As an example, a fault in one node can lead to cascading faults in other nodes, resulting in a slew of alarms. Machine learning techniques enable us to discover co-occurring patterns in such a stream of alarms, and other events, which helps to quickly identify the root cause in most fault scenarios. This frees up the NOC operations teams, so they can focus on more complex challenges.
What kind of complexities does this involve?
Typical NOC alarm processing involves mapping of incoming alarms to incidents using enrichment, aggregation, de-duplication and correlation techniques. It is challenging due to heterogeneity of alarm information caused by the multi-technology, multi-vendor solutions used in today’s telecom networks. This heterogeneity makes it difficult to create a harmonized view of the network and significantly increases the complexity associated with fault detection and resolution.
Can we afford to encode domain knowledge in the long run?
The short answer is ‘no’.
Today’s NOC solutions include rule-based processing of alarms from different sources, namely, nodes or service management systems or network/element management systems. Rules are written such that they convert the domain specific information into a general view of the network at the NOC and additionally they include hardcoded rules that process/correlate alarms for appropriate grouping.
Such rule development is time consuming and manually intensive. Continuous changes in the network with new types of network nodes and resulting new types of alarms, also make rule development and maintenance more complex. Further, rule generation/updates need to be done frequently; else the rule database will be incomplete or even inaccurate.
Does that mean we stop domain-oriented rule development?
This does not mean traditional rule development goes away, rather it will be augmented by domain independent data driven approaches. Additionally, automatic detection of possible correlations among alarms can augment the rule-based approach where rules are not complete or where the specific domain knowledge is yet to be acquired.
The data-driven approach will enable identification of cross-domain correlations and data-driven insight generation. Gradually, the system can evolve towards a fully automated solution.
Data-driven automation in the NOC
Let us share with you a case study on automatic incident formation, root-causing and self-healing scenario that we have worked on as part of our research.
We have applied Machine Intelligence principles - data mining and data science - to discover patterns of behavior from large historical datasets. These behaviors or patterns essentially mean correlation between alarms and co-occurrence patterns. One interesting aspect of our approach is that we evaluated it not only as a time-series data, but also considered how to process the largely symbolic or categorical information collected from the network and identify latent behaviors from it.
This approach aids domain experts in learning unknown and evolving patterns of behavior when the environment is multi-technology and multi-vendor. Such correlated and grouped patterns enable automatic grouping of alarms which sets the stage for automated network incident detection, root causing and self-healing.
Using this approach, we can achieve intelligent grouping of alarms and tickets with minimal manual involvement; we can reduce or altogether avoid manual rule development by automatically identifying important and missing groupings and we can reduce the overall number of trouble tickets.
Automatic incident detection
Fault conditions and alarm grouping is made possible by
- Embedding network information, such as alarms and events, into a telco knowledge graph which includes raw as well as insightful, derived information that forms the basis for enabling automated intelligent behaviors at the NOC
- Automatically capturing the behaviors in the network data – alarms and events – in a data-driven manner into digitalized versions which we will refer to as machine learning (ML) generated rules
Using this approach, we can use automatic identification of amalgamated and enriched conditions instead of looking at individual alarms one at a time. The data-driven capabilities automatically create composite conditions from historical information. In other words, pattern mining techniques are used to perform intelligent grouping of cross-domain alarms. These composite conditions are transposed as ML generated rules which aid in detection of groups of alarms which we call an incident.
Frequent patterns are item sets, subsequences or substructures that appear in a dataset with a certain level of frequency. Frequent pattern mining algorithms range from frequent item set mining, sequential pattern mining, structured pattern mining, correlation mining, associative classification and frequent pattern-based clustering. Finding patterns in data helps with mining associations, correlations and other interesting relationships among the input data. Telecom network data is adapted with extended variants of pattern mining algorithms to produce machine learning generated rules.
The above picture depicts the iNOC (Intelligent Network Operations Center) functional view. Incident detection involves two high-level processes. First, we explore the historical alarm data to capture ML generated rules using the Pattern generator. Next, we make use of these ML generated rules to detect recurring patterns from a live stream of alarms using alarms grouper. We refer to this as a Pattern/model generation pipeline and Incident grouping pipeline for data processing.
As in the above picture, the Pattern generation pipeline uses pattern mining algorithms to extract behavior patterns and translates them into ML generated rules. The Incident grouping process matches these rules against the live stream of alarms to match complete or partial portions of such rules. This essentially automates the alarm grouping problem reducing the need for manual grouping. We also support the usage of legacy mode rule coding to work along with ML generated rules to enhance the outcomes.
Automatic root cause identification
Root cause identification is made possible by
- Domain knowledge of experts on root causes
- Meaningful insights into composite conditions
- Topology which enhances ML rule generated incidents,
- Labelled root causes to map against detected incidents
Experts help us understand the original fault/faults from the list of alarms generated during a detected incident at the NOC. After incidents have been detected automatically, we use various machine learning techniques – multi-polynomial regression & stochastic gradient methods –- and expert knowledge on various attributes of data from Ericsson’s Fault Management, Ticket Management and Work Flow Management systems to generate a classification model. iNOC generated incidents are subjected to root-cause identification using the generated classification model.
As shown in the picture below, the automated root cause identification and self-healing agent takes additional information from various data sources such as ticket Root cause analysis (RCA) history, and KPI information of performance data. A distributed data storage is maintained with map-reduce and spark streaming capabilities. An Operation and Maintenance GUI is used to project the RCA, self-healing and predictions so that the user can choose the required services.
The system also includes an additional alarm prediction module, which predicts the number of occurrences of a particular alarm or group of alarms in an upcoming time window using cross-correlation and regression techniques. Cross correlation is applied on each pair of alarms and the pair that crosses a given threshold value is selected and modelled with Poisson regression with each alarm transposed in the form of a time-series window. The model is then used to predict the number of alarms that would occur in the next time window. This helps the operator categorize the alarm and assign workforce to troubleshoot the problem in advance based on the priority and severity of the issue and thus prevent critical network failures.
Like what you’re reading? Please sign up for email updates on your favorite topics.
Subscribe nowAt the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.