In my RSA Conference '21 presentation I discussed a Threat Hunting methodology that made use of Machine Learning to automate, to a certain extent, the detection of malicious activity via anomaly analysis. In the blog post series of which this is Part 1, I will discuss this methodology together with the low level details of the Machine Learning models proposed, and its corresponding (python) code, as one more function of the ds4n6_lib library, under the D4ML project (Machine Learning extensions).
Before jumping into the “fanciness” of new AI-based Threat Hunting methodologies, let's first look at the current state-of-the-art in this area.
Threat Hunting is a very broad term that encompasses in the real world many different ways to actively look for malicious activity on a network at scale.
There are 3 traditional approaches to Threat Hunting[1,2]:
IOC-based Hunting is one of the most common and effective Threat Hunting approaches. The idea is to use Indicators of Compromise (IOCs) associated to a certain Campaign or Threat Actor. At the low level, these IOCs can be of many different types: process names, IPs, file names, mutexes, strings, etc., which are searched for on various forensic artifacts (logs, process list, services, registry, network connections, etc.). If the IOCs have the right quality and context (which unfortunately is not often the case), we will be able to identify the malicious activity relatively easily and fast in our networks, and therefore detect the intrusion in a timely manner.
However, IOC-based Hunting requires that someone in the Community has identified the attack, researched it, and shared the corresponding IOCs with the rest of the Community. But what happens when a certain attack (such as the Solarwinds/Sunburst one) occurs and goes under the radar for some time? During that period of time the Community does not have any IOCs to search for and the related intrusion typically goes unnoticed for months.
We would obviously like to be able to put in place other methodologies that are able to detect intrusions without the use of IOCs. As mentioned before, there are two other approaches that Hunters use when they want to detect malicious activity on a network:
Let's go with TTP-based analysis first. How effective is it? Well, the answer is, as usual, “it depends!”.
Detecting malicious activity at small scale (on a single computer or a small group of computers) is typically not a problem. A knowledgeable/capable Forensicator is able to grab a computer and, after a couple of days of in-depth analysis, in most cases figure out if there is something wrong with that computer. How does she do this magic?
An experienced analyst will typically search first for obvious signs of compromise, including commonly used TTPs, based on Community knowledge and her own experience. Sometimes the signs will be very easy to identify, but other times (“Stealth Mode”) they may be more difficult to identify. If that's the case, the analyst will start looking for anomalies, things that are “out of place” when compared with the rest of the computers of the same type. There will be false positives in the way, which are typically clarified by looking at the problem from different angles or, in Forensics words, looking at different types of forensic artifacts.
So, well, we can conclude that, if we have knowledgeable Hunters / Forensicators we will be able to detect the intrusion, right? What's the problem then?
Assuming that you do have those knowledgeable Forensicators in your team, the first problem is that in the real world most organizations do not have an adequate list of assets, they do not have easy access to all their computer base, computers most of the times don't have the right configuration to preserve the evidence for a long-enough period of time, they are often not even configured to generate the right evidence, the Log Aggregation / EDR technology in place is limited, etc.
But even if we ignore all those real world problems, there still is a bigger problem: scalability.
Scaling this in-depth forensic analysis to hundreds, thousands, tens of thousands or even hundreds of thousands of computers is not feasible. Remember I told you an in-depth analysis takes about 2-3 days for a single computer? So if you multiply that by thousands of computers that means many thousands of analysis days, which obviously organizations cannot afford (and does not make any sense either). We must conclude then that the effective manual process that we were discussing clearly does not scale.
Since our goal is to find malicious activity which is not easily identified as malicious via IOC/malicious pattern searches, we will focus on finding a way to search for anomalies at scale.
Ok, so how do Hunters approach this in the real world?
Normally what Hunters do is sweep the computer base (via EDRs, Logs, etc.) for the most commonly used Adversary Techniques, like malicious powershell, scheduled tasks, windows shares, etc., and then analyze the output produced.
What the Hunter will do first is start looking for signs of malicious activity. Finding malicious patterns will depend on the Hunter/Forensicator experience and knowledge, but also on all those real world factors mentioned in the previous section. And not only that, the problem is that those techniques tend to be in most cases also used by system administrators and/or be part of standard computer activity, specially if the Adversary is in “Stealth Mode” (e.g. using LOLBAS).
So how can the Hunter identify malicious activity if there is nothing obviously malicious in the available evidence?
This is when she would start looking for anomalies, deviations from “standard” computer behavior.
A knowledgeable/experienced Forensicator will be familiar with the “normal” in a standard computer. And, if she works for a specific organization, she will probably also know what is the “normal” in her organization's computers (software, running processes, accounts, services, etc.). This means that if you provide her a set of computers to look at, after careful analysis she will most likely be able to identify anomalies. The problem is, as you can imagine, that one-by-one analysis does not scale well.
There is a common approach to finding anomalies when you are analyzing many different machines of a similar type that helps: baseline analysis.
In today's world, most organizations don't install computer systems one by one, they normally deploy them by making copies of a standard template, commonly known as the “Gold Image” (GI). There typically are multiple different GIs in an organization, for different types of computers (desktops, servers) or for different roles (Database Servers, Web Servers, etc.). After the deployment of a GI, sysadmins will typically fine-tune the computer by installing new software, changing some configuration settings, etc. And then the life of the computer starts, during which new changes will happen to the computer.
What we can do in this case in order to find anomalies is to create a baseline of what is “normal” in our specific computer set and try to perform a Baseline Analysis, i.e. find deviations from the GI. Examples of these may be binaries, scheduled tasks or windows shares that are found in a few systems or a single system only and not in the rest of the computers of the same group. There are many different tools that allow you to baseline analysis for different forensic artifacts (e.g. volatility), so this should be doable from the tools point of view.
As a result, we can say that Baseline Analysis allows us to find anomalies “at scale”.
In the real world, however, you will see a problem coming at this point. Once these “anomalies” (deviations from the GI) have been identified, the analyst will still have to review all those “anomalies” to see if they are in fact malicious or just false positives. While this analysis certainly is lighter than a full-blown computer analysis, in the real world it still is a daunting task when the number of computer grows beyond a few dozens, when you have multiple different groups or when the computers have been online for a longer period of time. It still doable, but the number of analysis hours/days multiply.
I guess at this point you can see that baseline analysis, while a useful technique, is certainly not a simple task, specially the longer the life of the computer and the bigger the computer set.
And there is an additional problem with baseline analysis: even if you identify what is normal and what is not normal, you really don't know how anomalous those not-normal entries are, so you cannot prioritize your analysis.
Another commonly used technique to try to find anomalies in a computer base is to do statistical analysis on the computers under analysis.
Let's say, as part of the intrusion, a certain Scheduled Task has been introduced in a small set of computers in the network. We could analyze how many times each task appears in the computer base and then take a look at those scheduled tasks that appear in a small number of computers. This is known in Forensics as the Least Frequency of Ocurrence (LFO) and is indeed a useful technique.
The problem appears, once again, when you try to implement it in the real world. When you are Hunting at scale you typically have a non-homogeneous set of computers, so the number of scheduled tasks that appear only in a small number of computers tends to grow until the problem becomes unmanageable.
You can address this problem by including more variables in the mix. For instance, instead of looking only at the name of the scheduled task, you can also look at when it was created, how big is the associated, file, etc. These variables would certainly help in terms of providing better accuracy, but performing statistical analysis on multiple variables is not something trivial. Additionally, some variables, such as the creation time, would have to be flexible, since a set of computers may have a new legitimate scheduled task deployed within the same day, but not exactly in the same second.
So, while Statistical Analysis is indeed a useful technique for finding anomalies, it once again complicates at scale and when you start enriching the content with additional variables.
Let's jump into an interesting topic: Anomaly Classification.
When a Baseline Analysis tells the analyst “this is an anomaly”, it doesn't actually tell her how much of an anomaly it is. Is it very anomalous or just a little anomalous?
Similarly, even when Statistical Analysis provides a certain classification when you sort by number of occurrences, which may allow you to identify similar or different groups of activity, in most cases it doesn't tell you either if something is very anomalous or just a little anomalous.
As a consequence, our capability to classify anomalies is very limited or null. And having the ability to classify the anomalies would be extremely useful in order to allow us to organize our (limited) analysis resources, by prioritizing the analysis of the most anomalous entries.
This therefore brings up the need for an Anomaly Classification or Scoring mechanism.
As discussed previously, we reached the conclusion that we need some type of Anomaly Score metric that defines for each anomaly how anomalous it is when compared with the rest of the data, and allows us to prioritize the analysis of the most anomalous anomalies.
For instance, if a specific scheduled task or process is flagged as anomalous, I would like to know if it is considered very anomalous or just a little anomalous. Then the analyst could analyze, for instance, the Top 100 most anomalous scheduled tasks/processes.
This will help us, in turn, to assign resources and define ETA projections to our Threat Hunting processes: for instance, if we want to analyse the Top 100 anomalies detected and we want to do it within 12 hours we will have to devote 3 analysts to the job. That sounds useful from the Service Delivery / Operations point of view, right?
But how can we actually create this Anomaly Classification / Score? It seems obvious that, in order to identify the anomaly score, we need to compare the anomaly with the rest of the data. Well, this is when Data Science / Machine Learning comes to help!
There are different algorithms and models to detect anomalies via Data Science, both using Statistical Methods (Elliptic Envelope Algorithm, Isolation Forest Algorithm) and via Machine Learning Models[5,6,7] (Autoencoders, Isolation Forests, Random Forests, One-Class SVM Algorithm, Local Outlier Factor -LOF- Algorithm).
Many of these algorithms and models will not only find the anomalies but will also give you a metric in terms of how different the anomaly is from the rest of the data.
We will go the Machine Learning way, which will give us more flexibility as we move forward to more complex scenarios, and we will choose a model that is unsupervised, which means that you don't need to train the Neural Network in advance with what is normal and what is not normal. This is very important for several reasons:
Out of the available Machine Learning models we will choose Autoencoders because they are very convenient. Besides being good at anomaly detection, they provide the Anomaly Score metric that we were looking for, and they are a very flexible model that, as you will later see, can be architected as Shallow or Deep Learning Models, and can also be combined with other types of models to provide additional features (such as the Long Short Term Memory -LSTM- model, which provides the capability to take into account Time in the analysis).
We will leave it for now. Later in this Blog Post series we will analyze 2 variations of the Autoencoder (Simple / LSTM) and analyze how they can help in Anomaly Detection, but before we will dig deeper in how to detect specific MITRE ATT&CK techniques with different Forensic artifacts.
Hope you enjoyed!
Stay Tuned and contact us if you have any comment or question!