[30/08/20] Jess Garcia - Twitter: j3ssgarcia - LinkedIn: garciajess
Hi DS-Forensicators!
I know I owe you A LOT of introductory content about DS4N6 and the ds4n6.py library (I promised I would create blog entries explaining more in-depth each of the functions and techniques that I presented at the SANS DFIR Summit in July, and they will be coming soon), but we have been caught up with the holidays of the DS4N6 Team plus a couple of big investigations during the last couple of months, and neither the DS4N6 Team nor myself have had time for much, to be honest. We are sorry for that.
In any case, since we've had to work in the analysis and correlation of some massive logs during this time, I wouldn't like to miss the opportunity to share with you how to do that efficiently in pandas.
The problem (simplified & “educational-ized”) that we need to address:
So we have translated the problem of analyzing the original logs to something equally useful for us in this particular case, which is analyzing which specific log lines match a set of specific strings. The idea, beyond the scope of this case study, is that we can afterwards do multiple types of analysis and correlations using the resulting outdf and generate additional DFs as needed.
Some quick calculations for the resulting output DF (outdf):
Also, as a reference, let's say we have 1.000 users in these logs.
The format that we would like to have for this outdf DF would be something like this:
DATE | USER | SEARCHSTR | HIT |
---|---|---|---|
2018-01-01 | user_x | Search String 1 | True |
2018-01-01 | user_y | Search String 1 | False |
2018-01-01 | user_z | Search String 1 | True |
[…] | |||
2018-01-01 | user_r | Search String 2 | True |
2018-01-01 | user_s | Search String 2 | True |
2018-01-01 | user_t | Search String 2 | False |
[…] | |||
2018-01-01 | user_a | Search String 50 | False |
2018-01-01 | user_b | Search String 50 | False |
2018-01-01 | user_c | Search String 50 | False |
[…] |
As you may already know, pandas maps strings to the “object” dtype (data type), so USER and SEARCHSTR would by default be defined as “object” dtype.
The definition of an empty DF that suits our needs would be:
outdf=pd.DataFrame({ "DATE": pd.Series([], dtype='datetime64[ns]'), "USER": pd.Series([], dtype=object), "SEARCHSTR": pd.Series([], dtype=object), "HIT": pd.Series([], dtype=bool) })
And therefore you would get the following format:
outdf.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000000 entries, 0 to 19999999 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 DATE datetime64[ns] 1 USER object 2 SEARCHSTR object 3 HIT bool dtypes: bool(1), object(2), datetime64[ns](1) memory usage: 228.2 MB None
As you can see, we have defined DATE of dtype “datetime64[ns]” which is efficient and HIT of dtype “boolean”, which also is.
But the “object” dtype is very inefficient in pandas (both from the memory consumption and operations point of view), plus it doesn't make a lot of sense to waste memory over and over again repeating the same USER strings and the same SEARCHSTR strings on each of the 1B rows of the output DF, right?
This is relevant because it looks like having a DF with 1 billion rows in memory is going to require a machine with lots of memory, specially if the strings are long. Plus also, it looks like making queries on that huge amount of data is going to be super-slow. So, what can we do to improve efficiency?
The answer is a not-so-well-known and somewhat poorly documented pandas data type called “category”.
A category is a data type that, simply put, allows you to define a list of elements (in this case “50 SearchString” or “1.000 Username” strings), and define them as a new data type.
So, let's define a category with the users we are going to search for:
users=indf['USER'].unique() users_dtype = pd.api.types.CategoricalDtype(categories=users.tolist(), ordered=False)
And another one with the strings we are going to search for:
searchstr = ['srchstring1','srchstring2', ... , 'srchstring50'] searchstr_dtype = pd.api.types.CategoricalDtype(categories=searchstr.tolist(), ordered=False)
So the definition of my output DF (outdf) would now be:
outdf=pd.DataFrame({ "DATE": pd.Series([], dtype='datetime64[ns]'), "USER": pd.Categorical([],dtype=users_dtype),), "SEARCHSTR": pd.Categorical([],dtype=searchstr_dtype), "HIT": pd.Series([], dtype=bool) })
An now, the corresponding output would be:
outdf.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000000 entries, 0 to 999999999 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 DATE datetime64[ns] 1 USER category <----------------------- 2 SEARCHSTR category <----------------------- 3 HIT bool dtypes: bool(1), category(2), datetime64[ns](1) memory usage: 11.3 GB <----------------------- None
The advantage of this approach is that the outdf DF will no longer store any strings in its memory space, but pointers to the strings in the users_dtype & searchstr_dtype categories. This drastically reduces the amount of memory you need in order to store this DF and boosts the performance of your queries.
Results:
And this is not the end, but the beginning. We can now start playing with our outdf DF and do lots of different types of analysis. And, since the mapping you've done between indf and outdf is 1-to-1 from the original DF for any group of searchstr group, you can easily identify which line(s) in outdf match with which lines in indf.
This could be a simple piece of code to solve this problem:
# First, let's define the Categories ------------------------------------------------------------ users=indf['USER'].unique() users_dtype = pd.api.types.CategoricalDtype(categories=users.tolist(), ordered=False) searchstr = ['srchstring1','srchstring2', ... , 'srchstring50'] searchstr_dtype = pd.api.types.CategoricalDtype(categories=searchstr.tolist(), ordered=False) # Then, let's define the DFs -------------------------------------------------------------------- outdf=pd.DataFrame({ "DATE": pd.Series([], dtype='datetime64[ns]'), "USER": pd.Categorical([],dtype=users_dtype),), "SEARCHSTR": pd.Categorical([],dtype=searchstr_dtype), "HIT": pd.Series([], dtype=bool) }) # We will be using a temporary DF for each string search tmpdf=pd.DataFrame({ "DATE": pd.Series([], dtype='datetime64[ns]'), "USER": pd.Categorical([],dtype=users_dtype),), "SEARCHSTR": pd.Categorical([],dtype=searchstr_dtype), "HIT": pd.Series([], dtype=bool) }) # Now the search loop ---------------------------------------------------------------------------- for sstr in searchstr: # Let's read the input DATE / USER data, and set the SEARCHSTR tmpdf[[DATE,'USER']]=indf[['DATE','USER']] tmpdf['SEARCHSTR']=sstr # Let's adjust USER & SEARCHSTR dtypes to categories tmpdf['USER']=tmpdf['USER'].astype(users_dtype) tmpdf['SEARCHSTR']=tmpdf['SEARCHSTR'].astype(searchstr_dtype) # Search for the Search String str in the LOG column of the original data tmpdf['HITS']=indf['LOG'].str.contains(sstr) # Append tmpdf to outdf outdf=outdf.append(tmpdf,ignore_index=True)
Note: Loops and temporary DFs are not an elegant way to approach problems in Data Science and you would normally do this in a little different way. However, for the purpose of this article, I think it is easier to understand if we do it this way.
Processing the original dataframe indf to obtain your output outdf may take some time with the above code, so you may want to save your results so you don't need to repeat the processing again in case your jupyter kernel crashes.
The easiest way to save a DataFrame is using the feather format:
outdf.to_feather('outdf.feather')
The code to read it back (in case of crash) would be:
outdf=pd.read_feather('outdf.feather')
feather is a file format which is optimized for saving DataFrames, it's very easy to deal with and quite efficient, so it's probably your best bet (there are other more portable formats like Pickle or HDF5 if you want portability to/from, for instance, Big Data platforms like Hadoop -see the References below for more information-, but we don't need that now).
Can't be any simpler, right?
In this article I explained how to translate efficiently a problem which was originally difficult to deal with (doing analysis and correlation on data contained in 20GB of logs), into a more friendly one, dealing with a DF which contains hits to some keywords we are interested in. But being a huge 1 billion row DF, we've used the pandas category dtype to shape the resulting DF in a way that it barely uses 10GB of memory and that can be searched in roughly 15secs.
I hope you enjoyed this post and I trust it will help you optimize those monstrous datasets we sometimes have to work with. I'm pretty sure you will find multiple scenarios where these techniques can be useful.
Happy DS4N6'ing!
JESS
[30/08/20] Jess Garcia - Twitter: j3ssgarcia - LinkedIn: garciajess