Graph machine learning in distributed systems: What you need to know
Just as graphs make it easier for us to understand and act on complex data, graph machine learning can take graph theory a giant step further. But can it even help today’s service providers to improve reliability and predict anomalous behaviors in complicated distributed systems? Find out below…
How much do you know about graph representation of data? Over the last two decades, graph theory has become increasingly popular in both research and industry. Among other areas, it has been used in epidemiology, medicine genetics, healthcare, banking and engineering to solve challenges such as routing, finding relation, path etc.
So let’s begin with the basics. As you can see in the example below, a graph is a structured data type that has nodes and edges: nodes usually hold information about an entity, edges indicate connections, and weights may be used to further define the connections (e.g. probability of moving from an entity to another, the strength of a connection).
A graph can be directed or un-directed, as illustrated in the two examples below. In the case of a directed graph, the edge has an orientation.
A directed acyclic graph (DAG) is a directed graph that has no cycles. The DAGs represent a topological ordering that can be useful for defining complicated systems. It is often used to represent a sequence of events, their probabilities (e.g. a Bayesian network) and influences among each other (e.g. causal inference).
In the case of Bayesian networks, probabilistic graphical models are built using conditional dependencies. As for causal inference, statistical models (e.g. matching) are used to analyze confounding and bias etc., and derive consequence of actions (e.g. causality).
How does graph machine learning work?
Although full of potential, using graphs for machine learning (graph machine learning) can sometimes be challenging. Representing and manipulating a sparse graph as a matrix is more difficult than doing the same operations on vectors, text or images.
Across Ericsson’s Global AI Accelerator, we have been experimenting with various approaches to graph machine learbing based on telco data. These approaches have shown promising results in terms of better performance, predictability, and efficiency.
A general process to apply graph machine learning follows a few common steps. It always starts with representing the data as a graph: this could be done at the time of data ingestion, when incoming data is stored as graphs in a graph database (e.g. Neo4j, JanusGraph etc.) or by transforming the original data into graph representation. Depending on the application, the graph data could be partitioned or embedded for the downstream graph machine learning. Finally, model predictions or outcomes will be served.
Why use graph machine learning for distributed systems?
Unlike other data representations, graph exists in 3D, which makes it easier to represent temporal information on distributed systems, such as communication networks and IT infrastructure. Traditional representation of data has fixed schemas that make it difficult to store connections between different entities. In graph representation, critical information such as logical dependencies, connections, relationships among many entities can be represented with little loss of information and great flexibility.
Graph machine learning in telco use cases
Below, we illustrate a telco use case whereby we are using graph machine learning to predict anomalous behaviors of a complicated distributed system.
A distributed service system
Many telco service providers deliver their services in a cloud-native server-client API interaction based environment that is often distributed by nature, for example, a service request to activate a mobile plan, adding additional wearable device to an existing plan etc.
A distributed service system topology consists of large amount of services linked to the load balancer(s). One or several of the services could be running on the same virtual machine with a specific IP.
The distributed service system is also time sensitive. After a system component with a specific IP requests a service from an API, the response should be received in a specific time window. An excessively long response time (i.e. an anomalous behavior) could be caused, for example, by the extra usage of CPU when that API is invoked at the IP at that specific time.
By predicting these problems before they occur, service providers can find solutions to avoid or mitigate the issues (e.g. blocking other requests, adding more resources for that request or reprogram the scheduling routine).
Time series data analysis
Each API response and other system metrics over time can be represented as time series data.
In the case of complicated distributed systems, there could be hundreds or even thousands of (correlated or uncorrelated) time series to be analyzed. Approaches such as statistics-based anomaly detection, clustering, sequence-2-sequence models (e.g. LSTM) are often used to identify or predict these anomalies. Finding out the relationships among all the hundreds or thousands of time series by analyzing their relations across the time axis at any given time is non trivial. Challenging and time-consuming feature engineering, tuning, statistical analyses, and data inspection are required in the process to identify the patterns.
Predicting anomolous system behavior with graph machine learning
The 3D nature of graph representation allows us to encode temporal relational information among entities (nodes) with various granularity and focus. The same graph can be encoded with its nodes, edges, subgraph, or as the whole graph, which allows us to be flexible in the trade-off between complexity and efficiency of operations.
In our experiments, to discover enough useful features for time series data, we need to at least include hundreds of feature columns, whereas when using graph representation, we could achieve similar performance with a dimension of 20-50 for the feature vectors.
Above: Flexible embedding approaches with graph data representation (courtesy of H. Cai, V. W. Zheng, and K. C.-C. Chang, available here)
In this example, we define the virtual machine nodes and the APIs that are running through those machines as graph nodes. An edge between the API and the machine is established when a particular API is invoked on a specific machine at a given time. The weight on the edge indicates response time delays (e.g., 1 for long response time above a threshold, 0 otherwise). The APIs and the corresponding IPs are then represented in the graph. Below is a graph of the API responses and IPs in a 6 minute window.
Once represented in graph, the system status can be analyzed as a unique dynamic graph representation at any given time (a snapshot), as shown below. When there is no API calls between 2 nodes, the edge will be missing. Thus this dynamic change can illustrate the original graph representation changes with situation and time before being embedded.
Then, the next steps are encoding the graph and feeding it into downstream machine learning algorithms for predictions.
Changes in the system captured in the graph (courtesy of L. Akoglu, H. Tong, and D. Koutr, available here)
In our use case, we used an approach called node2vec embedding to encode the graph. The process has two steps: random walk and word2vec. Random walk is used to sample the graph and create the corpus (traversal paths that indicate the sequence of events). Random walk is a sampling approach to discover potential possibilities of the graph traversal path given certain constraints, which are defined by the hyper parameters p (probability of being at node v) and q (probability that the random walk would discover the undiscovered part of the graph). The random walk process can be controlled by setting these 2 parameters and the number of walks to achieve the desired balance between performance and complexity. This is a desirable option especially when we need to scale the solution.
Above: Random walk illustration (coutesy of A. Grover and J. Leskovec, available here)
Then we used word2vec (often used in Natural Language Processing [NLP] for word embedding) to encode the graph in the latent space. In this case, each node in the sampled sequence from random walk is treated as word. The relation between the word and the sampled sequence is reflected with word2vec embedding just like how word2vec embedding is used to reflect the relation between word and the sentence. Word2vec is a shallow neural network that generates probability distribution of what words are likely to be seen in the input word’s context. Word2vec projects the one hot encoded inputs into the latent space to capture their similarities among each other in the latent space. The embedding is then used as input for a downstream machine learning classifier algorithm to predict whether an anomalous behavior will happen in the next 6 minutes.
Our initial result has shown around 20 percent improvement on the area under precision-recall curve (AUC) with graph ML comparing with conventional approach (improved predictability) and around 50 percent improvement on training and inference efficiency (total time excluding embedding process).
What next for graph machine learning in telco?
Graph representation of data has shown promising results in predicting anomalous behavior for complicated distributed systems. It indicates that graphs with well-designed representation will help improve predictability by leveraging the additional embedded temporal and dynamic information, which is not available with other data representations.
In the future, we envision that we could use graph representation to infer causality, assuming a proper Directed Acyclic Graph can be derived based on domain knowledge. In addition, edge and graph embedding, and prediction can be further explored to identify the system dynamics from different perspectives. They can be interesting areas for future work.
In writing this post, I would like to thank Sheyda Kiani Mehr, Xuancheng Fan, Nikita Botakov, and Nicolas Ferland for their contributions in related work and Simone Vincenzi for the proof reading and revision suggestions.
In this overview of causal inference in machine learning, my GAIA colleague explores the importance of taking artificial intelligence from the how to the why.
Find more expert insights about AI and machine learning in the Ericsson Technology Review article library.
Learn more about the future of autonomous networks.