[ds4n6_lib] User Manual (v0.5) >> [ds4n6_lib] Essentials

[ds4n6_lib] Essentials

Notable Objects

The ds4n6_lib will typically store the data in two types of objects: pandas DataFrames (what we call df) and Collections of pandas DataFrames (what we call dfs).

df - DataFrames

When we mention the word “DataFrame” most people think of the “pandas DataFrame” object but, while the underlying DataFrame concept does not change, there are other DataFrames implemented by other frameworks, such as ElasticSearch or Apache Spark.

The most popular DataFrame in Data Science is the pandas ones, but pandas DataFrames have an important limitation: they are fully loaded in memory. This means that the number and size of the pandas DFs you can use is determined by the amount of RAM in your computer and that, once you shutdown your analysis environment (Jupyter Kernel, script, etc.) the DataFrame vanishes. Also, some types of DS analysis may be better suited to be performed in other environments (distributed, structured, etc.).

To address the abovementioned pandas limitations, the DS4N6 framework will provide the possibility to save your DFs to file (in feather or pickle format) or to an ElasticSearch system.

Also, whenever makes sense, we will use directly ES DFs to perform analysis types that are better suited for ElasticSearch.

dfs - Collections (dict) of DataFrames

In DFIR analysis we commonly multiple files (typically .csv) which correspond to the same evidence set. For instance:

Multiple evtx files
Multiple kansa .csv files for different artifacts
Multiple kape .csv files for different artifacts
Multiple autoruns .csv files corresponding to different hosts

In this context, it is specially useful to use a construct that we will call dfs, which is a collection of dataframes and corresponds at the low level to a dictionary (python dict) of DataFrames.

We will use this construct also in other cases, such as for dividing a specific .evtx file into multiple DFs, one per EventID, or dividing a plaso database into multiple DFs, one per plaso parsed artifact.

The xmenu() GUI function will facilitate the process of choosing an analyzing dfs, so whenever you encounter a dfs object, you can do xmenu(mydfs) in order to choose one DF out of the bundle and analyzing it (using qgrid, ag-grid, etc.).

D4 Conventions

What is D4?

D4 is short for DS4N6. When we need to save space (e.g. in column names), and we want to make the user note that a specific field has been introduced by the ds4n6_lib, that field will typically have D4/d4 in its name.

Examples:

D4_* column names after reading your tool output data with xread.
Library imports:
- import ds4n6_lib.kansa as d4ksa

What are those five D4_* Columns at the beginning of my data DF?

When your tool output is read with the ds4n6_lib (e.g. with xread), the reading function will include add the following 5 columns to the DF before your data:

D4_DataType_	Type of file
D4_Orchestrator_	Orchestrator tool (if any)
D4_Tool_	The tool used to process the artifact name
D4_Plugin_	Plugin name
D4_Hostname_	Name of the device where the evidence has been found

Example:

D4_DataType_		flist
D4_Orchestrator_	kape
D4_Tool_		kape
D4_Plugin_		FileSystem-MFTECmd_$MFT
D4_Hostname_		laptop

A (legitimate) question you may ask at seeing that these values are repeated over and over again in every row of the dataframe is if repeating those strings in memory so many times (as many as rows the DF has) is not a waste of memory. The answer is (mostly) no. The pandas data types associated to those fields is “categorical” which is very efficiently managed. So while repeating those entries actually consumes a little amount of memory, it really is negligible versus the rest of the data.

DF Column Names

_: At the end of a DF column name denotes that the column has been generated artificially, it is not part of the original data. In some cases, the original column may have a similar name without the “_”. This is ok and simply means that the Harmonized (HAM) column name is the same as the original column name.
%: At the end of a DF column name denotes that the column has been added artificially to enrich the context of the column with the same name but without the “%”
_K: At the end of a DF column name denotes that the column has been added artificially to enrich the column with related knowledge of the column with the same name but without the “_K”

Global Variables You Need to Know

The d4.out Global Variable

The ds4n6_lib provides a number of GUI functions, such as GUI File Choosers or DataFrame analysis enviroments (e.g. xmenu() + qgrid).

In order to easily recover the output of a GUI tool during interactive analysis, the ds4n6_lib provides a global variable called d4.out.

For instance, when you perform file reading operation using the graphical File Chooser, or you export a DF after some filtering analysis with qgrid, the output results will be provided through d4.out.

What you should do then is to assign a variable of your choosing to this global variable to retain its value (mydf = d4.out), immediately after performing the GUI operation.

In summary, things you need to know about d4.out:

d4.out is a global variable that is used by multiple utilities, so it will be overwritten quickly (the next time you use another GUI). Save it as soon as you finish your GUI operation (mydf = d4.out).
The d4.out variable may return different types of data depending on the specific operation that you are doing. It will commonly return a DF, but it may also return other data types (such as DFs dict).

Commonly Used Regular expressions

d4.ipregex: IPv4 Regular Expression

Data & File Conventions

Data Formats:

raw: Original Data, without Adjustments, Harmonization, etc.
ham: Harmonized Artifact Model (HAM). This format corresponds to data that has been harmonized following the HAM model in terms of column names and data types.

File Extensions:

In Data Science you will commonly see several file formats (feather, pickle, HDF, etc.), but different people/programs may use different extensions. We will commonly use the following conventions:

.pkl: pickle
.fth: feather