Next generation machine learning powered by graph analytics

Machine learning is computationally demanding, not just in terms of processing power but also the underlying graph query language and architecture of the system. We look at how these challenges can be addressed with native graph databases. By Richard Henderson, Solution Architect, TigerGraph EMEA.

  • 4 years ago Posted in

New developments in machine learning are being powered by deep link graph analytics which support unsupervised learning of graph patterns, feature enrichment for supervised learning and explainable models and results. It’s a potent combination that will serve enterprises well for years to come.

We see machine learning (ML) being used for a range of complex computing tasks, including fraud detection, personalised recommendations, predictive analytics, identification of user groups and influential users, reporting weaknesses or bottlenecks in operations and supply chains, and more.

But ML is computationally demanding, and graph-based machine learning no less so. With every hop, or level of connected data, the size of data in the search expands exponentially, requiring massively parallel computation to traverse the data. Computationally this is too expensive for key-value databases as it requires a large number of separate table lookups and for relational database management systems (RDBMS) which must create table joins for every query. Even a standard graph database may not be able to handle deep-link analytics on large graphs.

One solution is a native graph database featuring massively parallel and distributed processing.

Unsupervised machine learning

Applying graph database capabilities to ML is a relatively new, but ultimately not very surprising, development: the Google Knowledge Graph, which popularised the concept of extracting actionable information based on patterns of relationships in data, was introduced in 2012, and graphs are known to be ideal for storing, connecting and drawing inferences from complex data.

That it didn’t happen sooner is down to the fact that, until recently, graph databases didn’t support the algorithms for deep-link analytics and struggled with very large datasets. But using ML algorithms in a native graph database opens new doors to these unsupervised methods, making possible the use of whole classes of graph algorithms to extract meaningful business intelligence including:

        community detection

        PageRank

        label propagation

        betweenness centrality

        closeness centrality

        similarity of neighbourhoods

These algorithms share a common requirement: the ability to gather data and analyse it while traversing large numbers of nodes and edges. This is a powerful feature of modern graph databases. Without it, many of these classes of algorithms would simply not be feasible to run.

Now these algorithms are finding uses in business to tackle a range of ‘difficult’ problems including fraud detection, identifying user groups and communities and reporting weaknesses or bottlenecks in operations and supply chains.

Supervised machine learning

Graph is also giving a boost to supervised machine learning because of its ability to support the analysis of a much richer set of data features – allowing you to deploy more sophisticated ML algorithms.

Consider the problem of detecting spam phone calls on a massive mobile phone network. This was precisely the problem that China Mobile wanted to solve. It has more than 900 million subscribers who make over two billion phone calls a week, but a tiny percentage of those are unwanted or fraudulent phone calls which the operator was keen to disrupt.

The approach was to analyse the data features of the phone that was initiating the call to determine whether it met the risk criteria for being fraudulent and then send a warning to the recipient’s phone – while it was still ringing – to warn them that the caller might be a scammer. The recipient could then decide whether to answer or not.

One could use a simple set of data features to detect phones associated with fraudulent calls, but China Mobile found that relying on duration of phone call and percentage of rejected calls to raise a warning flag resulted in too many legitimate calls being flagged as fraudulent – i.e., false positives.

China Mobile chose to broaden the scope considerably and monitor 118 data features to identify ‘good’ and ‘bad’ phones. The ML algorithms had to be powerful enough to analyse all of these data features and fast enough to do it in the time it took the network to connect a new call. Using ML, it would be possible to classify a caller as good or bad based on their relationships to other phones on the network which could be summarised in three key properties: 

        Stable group – based on how many phones a given phone calls and receives calls from on a regular basis. Relevant factors include the number of phones it regularly connects with, the frequency of interactions to and from each phone and the duration of the relationship with each phone.

        In-group connections – the degree of connectedness between the phones that the target phone is in regular contact with.

        3-step friend relationships – the degree of extended connectedness between the target phone and other phones. Does the given phone have connections with other phones that have connections with other phones that in turn initiate calls to the first phone (forming a sort of friendship loop)?

It turns out that bad phones score consistently and reliably low on these metrics, and it’s difficult for scammers to redress or hide these features of the phones they are using. Using data on known good and bad phones, the ML algorithms can be trained to recognise suspicious patterns of behaviour with a high level of confidence.

But its one thing modelling these metrics and another challenge altogether to implement them across a network of nearly one billion phones in real time. The real time element of this was very important because there is no point warning the recipient of a phone call that it might be fraudulent if you can’t do it while the phone is still ringing – an important consideration in China Mobile’s choice of graph database.

A native graph database not only has the query language to traverse many connections and filter and aggregate the results but also the computational power and underlying system architecture to do this in real time.

Explainable models

A criticism of neural networks and deep-learning networks is they don’t provide insight into causal factors – how did you get this output from those inputs? Without that, you don’t know what factors the ML system is associating with the given output, which erodes confidence in the ability of the system to maintain consistent results over time.

This underscores the importance of explainable models in ML. The objective is to be able to highlight the key variables associated with a result, and it turns out that graph analytics is well suited to compute and show the evidence behind ML decisions.

Explainable ML boosts user confidence in the result. In an online retail environment, the likelihood that a consumer will respond to a product recommendation is higher if the recommendation is accompanied by a reason, such as people you like also liked this product or this product is similar to a previous purchase.

In the more business critical scenario of fraud detection, explainable ML can be a regulatory or audit requirement, and it is also more helpful to fraud investigators if they can see the connections that caused a transaction to be flagged as suspicious rather than just receiving a numerical fraud score.

Graph databases represent data on networked objects in a way that closely mirrors reality, opening new doors to supervised and unsupervised machine learning techniques. It also exposes the underlying decision making process in a way that neural networks do not, satisfying the need for explainable ML models. It’s no surprise then that businesses are turning to graph to solve deep-link data analysis challenges.

 

 

By Rosemary Thomas, Senior Technical Researcher, AI Labs, Version 1.
By Ian Wood, Senior SE Director at Commvault.
By Ram Chakravarti, chief technology officer, BMC Software.
By Darren Watkins, chief revenue officer at VIRTUS Data Centres.
By Steve Young, UK SVP and MD, Dell Technologies.
By Richard Chart, Chief Scientist and Co-Founder, ScienceLogic.