I attended an interesting presentation at the EnergySec Pacific Rim summit discussing the role of machine learning and artificial intelligence (ML/AI) in network security and ICS operations. The talk was mostly an overview of potential applications and niches for ML/AI within these spaces, which in itself is refreshing as ML/AI is frequently touted as a dramatic, overall solution for numerous security problems as opposed to just another tool in the information security toolbox. More importantly, the talk and subsequent discussion provided an arena to discuss some of the pitfalls of ML/AI as a security solution and what adopters must be cognizant of in deploying such solutions.

AI/ML as an approach, from my understanding of the process[1], builds on a combination of algorithms or learning approaches and data sets that “train” the underlying model for how to respond to real events. In certain respects, one can look at ML/AI as a form of initial inductive reasoning – building general concepts from specific observations – that results in fundamental principles as the output of learning, after which the model turns to a deductive approach – using learned principles to classify new data. Based on initial assumptions or mathematical models governing the learning portion, the overall model builds a representation of the world from which future information can be understood and classified.

Within the context of ICS and cyber security, ML/AI is often touted as a powerful solution to multiple problems due to an assumption on the underlying states and natures of these two fields. Namely, both ICS operations and IT networks are assumed to be reasonably “steady state” models, from which a sufficiently powerful classification engine (leveraging ML/AI) can detect deviations from “normal” that merit investigation or response. This type of anomaly detection appears on its face quite powerful and well-tailored to the problems sets in question, but the assumption that such environments can prove to be “steady state” can be deadly. Among other considerations, these environments are more dynamic than they seem, and intelligent adversaries can leverage well-executed “living off the land” techniques to “blend in” with normal operational activity below alerting thresholds.

While that is my normal response to ML/AI solutions, [2] this is far from the only consideration, and in some cases may in fact be the least concerning when adopting a full and exclusive adoption of ML/AI approaches for operational needs. When thinking about other risks, my thoughts drift toward extreme left/right tail events, so-called “black swans” or other events outside the boundaries of “normal” models for which an algorithm (especially one that “learned” responses via a non-transparent method) may not have a response – or more worrying still, may have a response but one that reinforces underlying problems.

To illustrate the above, I look to algorithmic trading – the so-called “quant” methodology – that now dominates so much of finance. While many organizations have minted millions via algorithmic trading, taking advantage of miniscule market inefficiencies by enabling computers to buy and sell items at speeds well beyond those possible by human traders, such an approach comes with a potential (if seldom realized) cost: the algorithm going “out of control”. Such was the case in 2010 which witnessed the “Flash Crash”, a near-9% drop in the Dow Jones Industrial Average stock index. Triggering and amplifying this event were trading algorithms trained to behave in certain ways in response to observations – at a very basic level, to sell in a declining market to minimize losses. The issue with such an approach is it quickly turns into bandwagoning and a self-reinforcing cycle of continual loss: as quoted prices drop further, algorithms attempt to staunch their manager’s bleeding by selling before losing more money, driving prices still further downward. Until the automated trading is halted, each competing algorithm basically results in a race to the bottom, wiping out significant gains by the former financial masters of the universe.

One response to this is quite simple: the algorithms were bad and with improvement events such as the “Flash Crash” can be avoided. While true, this is a response to symptoms, rather than resolving the fundamental problem. The underlying issue is not poor algorithm design, but rather not knowing the implications of a specific ML/AI process until it produces results that are suboptimal or outright damaging. From the perspective of ICS operations management, an algorithm may do a very adequate or even superlative job in 95%+ instances – but a freak storm, physical interruption, or other “black swan” event (or even a simple glitch) can produce a situation outside the scope of standard learning sets. In response, the underlying ML/AI algorithm may then take actions that are internally consistent and logical but to outside observers completely irrational and unwanted – amplifying a disastrous situation rather than staunching the wound.

Essentially, the “black box” nature of ML/AI learning after initial algorithm deployment results in a couple of worrying situations: first, “baking in” potentially erroneous or incorrect assumptions in the initial algorithm development; second, producing a classification or decision-making model after learning that simply cannot be explained, understood, or documented following the learning period. As a result, operators can make assumptions on likely responses given past data fed in to train the ML/AI model, but are unable to determine with any degree of confidence how such a model will respond to novel, truly anomalous situations beyond what was covered in the learning period. When combined with automated response or operations management based on model results, potentially harmful (or at least undesirable) results can emerge due to the inability to build responses around rare, but still realistic events.

The above is not meant to say that ML/AI has no place within the fields of ICS operations or information security overall. Rather these statements are meant to illustrate a refrain I hold near and dear to my heart: that all generalizations are stupid. In this sense, ML/AI does (or will) have a place within these environments, but to exclusively turn over matters to such a methodology introduced risks that simply cannot be understood until the results of “all possible words” are gathered and understood – a data gathering issue which is fundamentally impossible. Thus in evaluating the applicability or desirability of a ML/AI approach to solving critical operational issues – whether in industrial operations, information security, or other areas – decision-makers must not only incorporate fairly standard ideas of where such models might fail (e.g., significant false positives) but also those truly rare but still quite plausible instances where the models may be unable to understand or cope with events adequately.

Ultimately, ML/AI remains a “tool in the toolbox”, suitable for some elements of operations management or security monitoring but like every other approach, incapable of responding to all contingencies in a desirable fashion on its own. Rather than jumping on a bandwagon of technical hype, asset owners and practitioners should instead look to the technologies that are available and determine how they fit within an overall application landscape. At this time, wholly trusting on a single technical approach, such as ML/AI, seems not just undesirable, but potentially disastrous.

[1] There is a definite difference between academic definitions of ML/AI and marketing conceptions of ML/AI. For the sake of this piece, I will only concern myself with the concept of ML/AI pushed by companies in the information security space – which dramatically simplified matters from the extensive array of academic implementations or understandings of various ML/AI models.

[2] I addressed this in my CS3STHLM talk in 2018 and will bring this up again in a general information security perspective at RSAC USA 2019.

Categories: ICSInfosec