In previous parts of this blog post series we introduced our objective: trying to detect malicious activity through anomaly detection via machine learning. We also defined a specific use case: detecting the Solarwinds/Sunburst Campaign intrusion without IOCs, specifically via the detection of its malicious scheduled task. And to end up with, we also explained the forensic artifacts associated with scheduled tasks activity.
In this post we will introduce the Autoencoder, a neural network architecture which is very effective detecting anomalies.
As discussed in Part 1 of this blog post series, the Autoencoder model is one of the most efficient and convenient for Anomaly Detection. But, what is an Autoencoder?
An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learned, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name.[ 1 ]
Or, to express it in a simpler more graphical way:
The most useful thing for us is that the reconstruction error, the loss, is a value that we can use to differentiate how different (anomalous) each input object is.
There is a loss value per input entry (per input cat), so:
As a summary, an Autoencoder has the capability to detect anomalies in a dataset. For each input object we present to the Autoencoder, it will give us a measure of how anomalous it is (different than the rest of the objects it has seen before).
If, instead of finding anomalous cats what you want is to find anomalous Scheduled Tasks events, you would provide the different evtx fields as input:
The neural network will learn the “essence” of Scheduled Tasks events, and it will return the loss (reconstruction error) of each input event, in other words, how anomalous each event is when compared with the rest of the dataset previously presented to the neural network.
It is important to note that in this case you will not provide all the event logs, but only the “unique” event logs. That is, if there is an event log entry that repeats over and over again in the same system (e.g. the execution of an hourly scheduled task), it is just considered as a single entry. In other words, we are not taking into account the time variable in this analysis, only the event logs which are different from each other.
Now that you've seen it in action, you can probably appreciate the characteristics that make the Autoencoder very useful for detecting anomalies:
Without further ado we will go straight to using the Autoencoder to process our Scheduled Tasks event logs. We will discuss later its benefits, and we will cover in a later post the gory details of how it works at the low level and how you code it.
What we will do now then is to just run the model with the input data and see how it performs in terms of detecting 3 increasingly anomalous Scheduled Tasks event log entries (that is, different from the rest of the event log entries presented to the Autoencoder).
We have created a function called find_anomalies() (which we will also explain and analyze in-depth in a later post) which hides all the complexity of the underlying Autoencoder Machine Learning process involved, and we will now apply it to our input data (i.e. our unique input event logs).
The find_anomalies() function is very simple: you provide your input data, that is, the Scheduled Tasks event logs (in the form of a pandas DataFrame), and you will get back the reconstruction error for each event log entry in the input. Or, in other words, how anomalous each event log entry is when compared with the rest of the event logs.
We will now show you how an Autoencoder works in a small demo video in which we will do the following:
Note: Some of the fields of the above events have been sanitized after running the prediction process because the data used in this whole process is real server data from a production environment. You may also notice an unfamiliar field, UserNC_, which is the result of merging 2 Scheduled Tasks event log fields to make predictions more accurete (this is called Feature Engineering and we will discuss it in a later post).
Take a couple of minutes to get familiar with how an Autoencoder works by watching this short demo video.
So out of 8.465 different event log entries in the dataset, the “anomaly order” in which they appear (0 → most anomalous / 8.464 → least anomalous) is:
As you can see, the more unique the event is, the higher it appears in the anomaly classification. The Autoencoder behaves the way we expect it should.
We are now ready to use this Machine Learning model in our Solarwinds/Sunburst case study and see how it performs.
Stay Tuned and contact us if you have any comment or question!