The ds4n6_lib will typically store the data in two types of objects: pandas DataFrames (what we call df) and Collections of pandas DataFrames (what we call dfs).
When we mention the word “DataFrame” most people think of the “pandas DataFrame” object but, while the underlying DataFrame concept does not change, there are other DataFrames implemented by other frameworks, such as ElasticSearch or Apache Spark.
The most popular DataFrame in Data Science is the pandas ones, but pandas DataFrames have an important limitation: they are fully loaded in memory. This means that the number and size of the pandas DFs you can use is determined by the amount of RAM in your computer and that, once you shutdown your analysis environment (Jupyter Kernel, script, etc.) the DataFrame vanishes. Also, some types of DS analysis may be better suited to be performed in other environments (distributed, structured, etc.).
To address the abovementioned pandas limitations, the DS4N6 framework will provide the possibility to save your DFs to file (in feather or pickle format) or to an ElasticSearch system.
Also, whenever makes sense, we will use directly ES DFs to perform analysis types that are better suited for ElasticSearch.
In DFIR analysis we commonly multiple files (typically .csv) which correspond to the same evidence set. For instance:
In this context, it is specially useful to use a construct that we will call dfs, which is a collection of dataframes and corresponds at the low level to a dictionary (python dict) of DataFrames.
We will use this construct also in other cases, such as for dividing a specific .evtx file into multiple DFs, one per EventID, or dividing a plaso database into multiple DFs, one per plaso parsed artifact.
The xmenu() GUI function will facilitate the process of choosing an analyzing dfs, so whenever you encounter a dfs object, you can do xmenu(mydfs) in order to choose one DF out of the bundle and analyzing it (using qgrid, ag-grid, etc.).
D4 is short for DS4N6. When we need to save space (e.g. in column names), and we want to make the user note that a specific field has been introduced by the ds4n6_lib, that field will typically have D4/d4 in its name.
Examples:
When your tool output is read with the ds4n6_lib (e.g. with xread), the reading function will include add the following 5 columns to the DF before your data:
D4_DataType_ | Type of file |
---|---|
D4_Orchestrator_ | Orchestrator tool (if any) |
D4_Tool_ | The tool used to process the artifact name |
D4_Plugin_ | Plugin name |
D4_Hostname_ | Name of the device where the evidence has been found |
Example:
D4_DataType_ flist D4_Orchestrator_ kape D4_Tool_ kape D4_Plugin_ FileSystem-MFTECmd_$MFT D4_Hostname_ laptop
A (legitimate) question you may ask at seeing that these values are repeated over and over again in every row of the dataframe is if repeating those strings in memory so many times (as many as rows the DF has) is not a waste of memory. The answer is (mostly) no. The pandas data types associated to those fields is “categorical” which is very efficiently managed. So while repeating those entries actually consumes a little amount of memory, it really is negligible versus the rest of the data.
The ds4n6_lib provides a number of GUI functions, such as GUI File Choosers or DataFrame analysis enviroments (e.g. xmenu() + qgrid).
In order to easily recover the output of a GUI tool during interactive analysis, the ds4n6_lib provides a global variable called d4.out.
For instance, when you perform file reading operation using the graphical File Chooser, or you export a DF after some filtering analysis with qgrid, the output results will be provided through d4.out.
What you should do then is to assign a variable of your choosing to this global variable to retain its value (mydf = d4.out), immediately after performing the GUI operation.
In summary, things you need to know about d4.out:
Data Formats:
File Extensions:
In Data Science you will commonly see several file formats (feather, pickle, HDF, etc.), but different people/programs may use different extensions. We will commonly use the following conventions: