AIDFIR / DS4N6 Blog >> [BLOG] Efficiently dealing with MASSIVE Log Search Hits (1 billion rows DataFrames!) in pandas

[BLOG] Efficiently dealing with MASSIVE Log Search Hits (1 billion rows DataFrames!) in pandas

[30/08/20] Jess Garcia - Twitter: j3ssgarcia - LinkedIn: garciajess

Hi DS-Forensicators!

I know I owe you A LOT of introductory content about DS4N6 and the ds4n6.py library (I promised I would create blog entries explaining more in-depth each of the functions and techniques that I presented at the SANS DFIR Summit in July, and they will be coming soon), but we have been caught up with the holidays of the DS4N6 Team plus a couple of big investigations during the last couple of months, and neither the DS4N6 Team nor myself have had time for much, to be honest. We are sorry for that.

In any case, since we've had to work in the analysis and correlation of some massive logs during this time, I wouldn't like to miss the opportunity to share with you how to do that efficiently in pandas.

The Scenario

The problem (simplified & “educational-ized”) that we need to address:

We need to search for some (let's say 50) strings on a DataFrame (DF) containing 20 million lines of logs (20GB), so the input is:
- indf → [ DATE | USER | LOG ]
We want to know, for each line of the original 20M lines DF, if that line contains or does not contain each of the search strings (as we want to do further analysis with this output), so the output DF would be:
- outdf → [ DATE | USER | SEARCHSTR | HIT (True/False) ]

So we have translated the problem of analyzing the original logs to something equally useful for us in this particular case, which is analyzing which specific log lines match a set of specific strings. The idea, beyond the scope of this case study, is that we can afterwards do multiple types of analysis and correlations using the resulting outdf and generate additional DFs as needed.

Some quick calculations for the resulting output DF (outdf):

Since we are evaluating each line of the 20M rows of original DF with each of the 50 strings, the resulting output would be 20M lines x 50 = 1.000M rows (that's american 1 Billion rows!).

Also, as a reference, let's say we have 1.000 users in these logs.

The format that we would like to have for this outdf DF would be something like this:

DATE	USER	SEARCHSTR	HIT
2018-01-01	user_x	Search String 1	True
2018-01-01	user_y	Search String 1	False
2018-01-01	user_z	Search String 1	True
[…]
2018-01-01	user_r	Search String 2	True
2018-01-01	user_s	Search String 2	True
2018-01-01	user_t	Search String 2	False
[…]
2018-01-01	user_a	Search String 50	False
2018-01-01	user_b	Search String 50	False
2018-01-01	user_c	Search String 50	False
[…]

As you may already know, pandas maps strings to the “object” dtype (data type), so USER and SEARCHSTR would by default be defined as “object” dtype.

The definition of an empty DF that suits our needs would be:

     outdf=pd.DataFrame({ "DATE":      pd.Series([], dtype='datetime64[ns]'),               
                          "USER":      pd.Series([], dtype=object),
                          "SEARCHSTR": pd.Series([], dtype=object),
                          "HIT":       pd.Series([], dtype=bool)
                                                 })

And therefore you would get the following format:

     outdf.info()

     <class 'pandas.core.frame.DataFrame'>
     RangeIndex: 20000000 entries, 0 to 19999999
     Data columns (total 4 columns):
     #   Column    Dtype         
     ---  ------   -----         
      0   DATE      datetime64[ns]
      1   USER      object      
      2   SEARCHSTR object      
      3   HIT       bool          
     dtypes: bool(1), object(2), datetime64[ns](1)
     memory usage: 228.2 MB
     None

As you can see, we have defined DATE of dtype “datetime64[ns]” which is efficient and HIT of dtype “boolean”, which also is.

But the “object” dtype is very inefficient in pandas (both from the memory consumption and operations point of view), plus it doesn't make a lot of sense to waste memory over and over again repeating the same USER strings and the same SEARCHSTR strings on each of the 1B rows of the output DF, right?

This is relevant because it looks like having a DF with 1 billion rows in memory is going to require a machine with lots of memory, specially if the strings are long. Plus also, it looks like making queries on that huge amount of data is going to be super-slow. So, what can we do to improve efficiency?

The answer is a not-so-well-known and somewhat poorly documented pandas data type called “category”.

The Category Data Type (dtype)

A category is a data type that, simply put, allows you to define a list of elements (in this case “50 SearchString” or “1.000 Username” strings), and define them as a new data type.

So, let's define a category with the users we are going to search for:

users=indf['USER'].unique()
users_dtype = pd.api.types.CategoricalDtype(categories=users.tolist(), ordered=False)

And another one with the strings we are going to search for:

searchstr = ['srchstring1','srchstring2', ... , 'srchstring50']
searchstr_dtype = pd.api.types.CategoricalDtype(categories=searchstr.tolist(), ordered=False)

So the definition of my output DF (outdf) would now be:

     outdf=pd.DataFrame({ "DATE":      pd.Series([], dtype='datetime64[ns]'),               
                          "USER":      pd.Categorical([],dtype=users_dtype),),
                          "SEARCHSTR": pd.Categorical([],dtype=searchstr_dtype),
                          "HIT":       pd.Series([], dtype=bool)
                                                 })

An now, the corresponding output would be:

outdf.info()

     <class 'pandas.core.frame.DataFrame'>
     RangeIndex: 1000000000 entries, 0 to 999999999
     Data columns (total 4 columns):
      #   Column    Dtype         
     ---  ------    -----         
      0   DATE      datetime64[ns]
      1   USER      category   <-----------------------
      2   SEARCHSTR category   <-----------------------
      3   HIT       bool          
     dtypes: bool(1), category(2), datetime64[ns](1)
     memory usage: 11.3 GB     <-----------------------
     None

The advantage of this approach is that the outdf DF will no longer store any strings in its memory space, but pointers to the strings in the users_dtype & searchstr_dtype categories. This drastically reduces the amount of memory you need in order to store this DF and boosts the performance of your queries.

Results:

Memory: The jupyter kernel holding this data is using 50GB of Virtual Memory and only 23GB of Real Memory (and that includes a DF with the original 15GB of data, indf). That's quite reasonable, even your laptop can probably handle that.
Speed: A query on standard server hardware for all entries matching a specific string on the 1B rows took 16 seconds. Not bad, uh?

And this is not the end, but the beginning. We can now start playing with our outdf DF and do lots of different types of analysis. And, since the mapping you've done between indf and outdf is 1-to-1 from the original DF for any group of searchstr group, you can easily identify which line(s) in outdf match with which lines in indf.

The Code

This could be a simple piece of code to solve this problem:

     # First, let's define the Categories ------------------------------------------------------------
     users=indf['USER'].unique()
     users_dtype = pd.api.types.CategoricalDtype(categories=users.tolist(), ordered=False)
     
     searchstr = ['srchstring1','srchstring2', ... , 'srchstring50']
     searchstr_dtype = pd.api.types.CategoricalDtype(categories=searchstr.tolist(), ordered=False)

     # Then, let's define the DFs --------------------------------------------------------------------
     outdf=pd.DataFrame({ "DATE":      pd.Series([], dtype='datetime64[ns]'),               
                          "USER":      pd.Categorical([],dtype=users_dtype),),
                          "SEARCHSTR": pd.Categorical([],dtype=searchstr_dtype),
                          "HIT":       pd.Series([], dtype=bool)
                                                 })
                                                 
     # We will be using a temporary DF for each string search
     tmpdf=pd.DataFrame({ "DATE":      pd.Series([], dtype='datetime64[ns]'),               
                          "USER":      pd.Categorical([],dtype=users_dtype),),
                          "SEARCHSTR": pd.Categorical([],dtype=searchstr_dtype),
                          "HIT":       pd.Series([], dtype=bool)
                                                 })
    # Now the search loop ----------------------------------------------------------------------------
    for sstr in searchstr:
        
        # Let's read the input DATE / USER data, and set the SEARCHSTR
        tmpdf[[DATE,'USER']]=indf[['DATE','USER']]
        tmpdf['SEARCHSTR']=sstr
        
        # Let's adjust USER & SEARCHSTR dtypes to categories
        tmpdf['USER']=tmpdf['USER'].astype(users_dtype)
        tmpdf['SEARCHSTR']=tmpdf['SEARCHSTR'].astype(searchstr_dtype)
        
        # Search for the Search String str in the LOG column of the original data
        tmpdf['HITS']=indf['LOG'].str.contains(sstr)
        
        # Append tmpdf to outdf
        outdf=outdf.append(tmpdf,ignore_index=True)

Note: Loops and temporary DFs are not an elegant way to approach problems in Data Science and you would normally do this in a little different way. However, for the purpose of this article, I think it is easier to understand if we do it this way.

Saving your DataFrame

Processing the original dataframe indf to obtain your output outdf may take some time with the above code, so you may want to save your results so you don't need to repeat the processing again in case your jupyter kernel crashes.

The easiest way to save a DataFrame is using the feather format:

outdf.to_feather('outdf.feather')

The code to read it back (in case of crash) would be:

outdf=pd.read_feather('outdf.feather')

feather is a file format which is optimized for saving DataFrames, it's very easy to deal with and quite efficient, so it's probably your best bet (there are other more portable formats like Pickle or HDF5 if you want portability to/from, for instance, Big Data platforms like Hadoop -see the References below for more information-, but we don't need that now).

Can't be any simpler, right?

Wrapping Up

In this article I explained how to translate efficiently a problem which was originally difficult to deal with (doing analysis and correlation on data contained in 20GB of logs), into a more friendly one, dealing with a DF which contains hits to some keywords we are interested in. But being a huge 1 billion row DF, we've used the pandas category dtype to shape the resulting DF in a way that it barely uses 10GB of memory and that can be searched in roughly 15secs.

I hope you enjoyed this post and I trust it will help you optimize those monstrous datasets we sometimes have to work with. I'm pretty sure you will find multiple scenarios where these techniques can be useful.

Happy DS4N6'ing!

JESS

[30/08/20] Jess Garcia - Twitter: j3ssgarcia - LinkedIn: garciajess