Insider Threat Detection with AI Using Tensorflow and RapidMiner Studio | By Dennis Chow

Insider Threat Detection with AI Using Tensorflow and RapidMiner Studio

Summary

This technical article will teach you how to pre-process data, create your own neural networks, and train and evaluate models using the US-CERT's simulated insider threat dataset. The methods and solutions are designed for non-domain experts; particularly cyber security professionals. We will start our journey with the raw data provided by the dataset and provide examples of different pre-processing methods to get it "ready" for the AI solution to ingest. We will ultimately create models that can be re-used for additional predictions based on security events. Throughout the article, I will also point out the applicability and return on investment depending on your existing Information Security program in the enterprise.

Note: To use and replicate the pre-processed data and steps we use, prepare to spend 1-2 hours on this page. Stay with me and try not to fall asleep during the data pre-processing portion. What many tutorials don't state is that if you're starting from scratch; data pre-processing takes up to 90% of your time when doing projects like these.

At the end of this hybrid article and tutorial, you should be able to:

  • Pre-process the data provided from US-CERT into an AI solution ready format (Tensorflow in particular)
  • Use RapidMiner Studio and Tensorflow 2.0 + Keras to create and train a model using a pre-processed sample CSV dataset
  • Perform basic analysis of your data, chosen fields for AI evaluation, and understand the practicality for your organization using the methods described

Disclaimer

The author provides these methods, insights, and recommendations *as is* and makes no claim of warranty. Please do not use the models you create in this tutorial in a production environment without sufficient tuning and analysis before making them a part of your security program.

Tools Setup

If you wish to follow along and perform these activities yourself, please download and install the following tools from their respective locations:

Overview of the Process

It's important for newcomers to any data science discipline to know that the majority of your time spent will be in data pre-processing and analyzing what you have which includes cleaning up the data, normalizing, extracting any additional meta insights, and then encoding the data so that it is ready for an AI solution to ingest it.

  1. We need to extract and process the dataset in such a way where it is structured with fields that we may need as 'features' which is just to be inclusive in the AI model we create. We will need to ensure all the text strings are encoded into numbers so the engine we use can ingest it. We will also have to mark which are insider threat and non-threat rows (true positives, and true negatives).
  2. Next, after data pre-processing we'll need select, setup, and create the functions we will use to create the model and create the neural network layers itself
  3. Generate the model; and examine the accuracy, applicability, and identify additional modifications or tuning needed in any of part of the data pipeline

Examining the Dataset Hands on and Manual Pre-Processing

Examining the raw US-CERT data requires you to download compressed files that must be extracted. Note just how large the sets are compared to how much we will use and reduce at the end of the data pre-processing.

In our article, we saved a bunch of time by going directly to the answers.tar.bz2 that has the insiders.csv file for matching which datasets and individual extracted records of are value. Now, it is worth stating that in the index provided there has correlated record numbers in extended data such as the file, and psychometric related data. We didn't use the extended meta in this tutorial brief because of the extra time to correlate and consolidate all of it into a single CSV in our case.

No alt text provided for this image

To see a more comprehensive set of feature sets extracted from this same data, consider checking out this research paper called "Image-Based Feature Representation for Insider Threat Classification." We'll be referring to that paper later in the article when we examine our model accuracy.

Before getting the data encoded and ready for a function to read it; we need to get the data extracted and categorized into columns that we need to predict one. Let's use good old Excel to insert a column into the CSV. Prior to the screenshot we took and added all the rows from the referenced datasets in "insiders.csv" for scenario 2.

No alt text provided for this image

The scenario (2) is described in scenarios.txt: "User begins surfing job websites and soliciting employment from a competitor. Before leaving the company, they use a thumb drive (at markedly higher rates than their previous activity) to steal data."

Examine our pre-processed data that includes its intermediary and final forms as shown in the following below:

No alt text provided for this image

In the above photo, this is a snippet of all the different record types essentially appended to each other and properly sorted by date. Note that different vectors (http vs. email vs. device) do not all align easily have different contexts in the columns. This is not optimal by any means but since the insider threat scenario includes multiple event types; this is what we'll have to work with for now. This is the usual case with data that you'll get trying to correlate based on time and multiple events tied to a specific attribute or user like a SIEM does.

Mis-matched column data above we need to normalize

In the aggregation set; we combined the relevant CSV's after moving all of the items mentioned from the insiders.csv for scenario 2 into the same folder. To formulate the entire 'true positive' only dataset portion; we've used powershell as shown below:

No alt text provided for this image

Right now we have a completely imbalanced dataset where we only have true positives. We'll also have to add true negatives and the best approach is to have an equal amount of record types representing in a 50/50 scenario of non-threat activity. This is almost never the case with security data so we'll do what we can as you'll find below. I also want to point out, that if you're doing manual data processing in an OS shell-- whatever you import into a variable is in memory and does not get released or garbage collected by itself as you can see from my PowerShell memory consumption after a bunch of data manipulation and CSV wrangling, I've bumped up my usage to 5.6 GB.

No alt text provided for this image

Let's look at the R1 dataset files. We'll need to pull from that we know are confirmed true negatives (non-threats) for each of the 3 types from filenames we used in the true positive dataset extracts (again, it's from the R1 dataset which have benign events).

We'll merge a number of records from all 3 of the R1 true negative data sets from logon, http, and device files. Note, that in the R1 true negative set, we did not find an emails CSV which adds to the imbalance for our aggregate data set.

Using PowerShell we count the length of lines in each file. Since we had about ~14K of rows from the true positive side, I arbitrarily took from the true negative side the first 4500 applicable rows from each subsequent file and appended them to the training dataset so that we have both true positives, and true negatives mixed in. We'll have to add a column to mark which is a insider threat and which aren't.

No alt text provided for this image

In pre-processing our data we've already added all the records of interest below and selected various other true-negative non-threat records from the R1 dataset. Now we have our baseline of threats and non-threats concatenated in a single CSV. To the left, we've added a new column to denote a true/false or (1 or 0) in a find and replace scenario.

No alt text provided for this image

Above, you can also see we started changing true/false strings to numerical categories. This is us beginning on our path to encode the data through manual pre-processing which we could save ourselves the hassle as we see in future steps in RapidMiner Studio and using the Pandas Dataframe library in Python for Tensorflow. We just wanted to illustrate some of the steps and considerations you'll have to perform. Following this, we will continue processing our data for a bit. Let's highlight what we can do using excel functions before going the fully automated route.

No alt text provided for this image

We're also manually going to convert the date field into Unix Epoch Time for the sake of demonstration and as you seen it becomes a large integer with a new column. To remove the old column in excel for rename, create a new sheet such as 'scratch' and cut the old date (non epoch timestamp) values into that sheet. Reference the sheet along with the formula you see in the cell to achieve this effect. This formula is: "=(C2-DATE(1970,1,1))*86400" without quotes.

No alt text provided for this image

In our last manual pre-processing work example you need to format the CSV in is to 'categorize' by label encoding the data. You can automate this as one-hot encoding methods via a data dictionary in a script or in our case we show you the manual method of mapping this in excel since we have a finite set of vectors of the records of interest (http is 0, email is 1, and device is 2).

You'll notice that we have not done the user, source, or action columns as it has a very large number of unique values that need label encoding and it's just impractical by hand. We were able to accomplish this without all the manual wrangling above using the 'turbo prep' feature of RapidMiner Studio and likewise for the remaining columns via Python's Panda in our script snippet below. Don't worry about this for now, we will show case the steps in each different AI tool and up doing the same thing the easy way.

#print(pd.unique(dataframe['user']))
#https://pbpython.com/categorical-encoding.html
dataframe["user"] = dataframe["user"].astype('category')
dataframe["source"] = dataframe["source"].astype('category')
dataframe["action"] = dataframe["action"].astype('category')
dataframe["user_cat"] = dataframe["user"].cat.codes
dataframe["source_cat"] = dataframe["source"].cat.codes
dataframe["action_cat"] = dataframe["action"].cat.codes


#print(dataframe.info())
#print(dataframe.head())


#save dataframe with new columns for future datmapping
dataframe.to_csv('dataframe-export-allcolumns.csv')


#remove old columns
del dataframe["user"]
del dataframe["source"]
del dataframe["action"]
#restore original names of columns
dataframe.rename(columns={"user_cat": "user", "source_cat": "source", "action_cat": "action"}, inplace=True)

The above snippet is the using python's panda library example of manipulating and label encoding the columns into numerical values unique to each string value in the original data set. Try not to get caught up in this yet. We're going to show you the easy and comprehensive approach of all this data science work in Rapidminer Studio

Important step for defenders: Given that we're using the pre-simulated dataset that has been formatted from US-CERT, not every SOC is going to have access to the same uniform data for their own security events. Many times your SOC will have only raw logs to export. From an ROI perspective-- before pursuing your own DIY project like this, consider the level of effort and if you can automate exporting meta of your logs into a CSV format, an enterprise solution as Splunk or another SIEM might be able to do this for you. You would have to correlate your events and add as many columns as possible for enriched data formatting. You would also have to examine how consistent and how you can automate exporting this data in a format that US-CERT has to use similar methods for pre-processing or ingestion. Make use of your SIEM's API features to export reports into a CSV format whenever possible.

Walking through RapidMiner Studio with our Dataset

It's time use to some GUI based and streamlined approaches. The desktop edition of RapidMiner is Studio and the latest editions as of 9.6.x have turbo prep and auto modeling built in as part of your workflows. Since we're not domain experts, we are definitely going to take advantage of using this. Let's dig right in.

Note: If your trial expired before getting to this tutorial and use community edition, you will be limited to 10,000 rows. Further pre-processing is required to limit your datasets to 5K of true positives, and 5K of true negatives including the header. If applicable, use an educational license which is unlimited and renewable each year that you enrolled in a qualifying institution with a .edu email.

No alt text provided for this image

Upon starting we're going to start a new project and utilize the Turbo Prep feature. You can use other methods or the manual way of selecting operators via the GUI in the bottom left for the community edition. However, we're going to use the enterprise trial because it's easy to walk through for first-time users.

No alt text provided for this image

We'll import our aggregate CSV of true positive only data non-processed; and also remove the first row headers and use our own because the original row relates to the HTTP vector and does not apply to subsequent USB device connection and Email related records also in the dataset as shown below.

Note: Unlike our pre-processing steps which includes label encodings and reduction, we did not do this yet on RapidMiner Studio to show the full extent of what we can easily do in the 'turbo prep' feature. We're going to enable the use of quotes as well and leave the other defaults for proper string escapes.

No alt text provided for this image

Next, we set our column header types to their appropriate data types.

No alt text provided for this image

No alt text provided for this image

No alt text provided for this image

Stepping through the wizard we arrive at the turbo prep tab for review and it shows us distribution and any errors such as missing values that need to be adjusted and which columns might be problematic. Let's start with making sure we identify all of these true positives as insider threats to begin with. Click on generate and we're going to transform this dataset by inserting a new column in all the rows with a logical 'true' statement like so below

No alt text provided for this image

We'll save the column details and export it for further processing later or we'll use it as a base template set for when we begin to pre-process for the Tensorflow method following this to make things a little easier.

No alt text provided for this image

No alt text provided for this image

No alt text provided for this image

After the export as you can see above, don't forget we need to balance the data with true negatives. We'll repeat the same process of importing the true negatives. Now we should see multiple datasets in our turbo prep screen.

No alt text provided for this image

In the above, even though we've only imported 2 datasets, remember transformed the true positive by adding a column called insiderthreat which is a true/false boolean logic. We do the same with true negatives and you'll eventually end up with 4 of these listings.

We'll need to merge the true positives and true negatives into a 'training set' before we get to do anything fun with it. But first, we also need to drop columns that we don't think are relevant our useful such as the transaction ID and the description column of the website keywords scraped as none of the other row data have these; and and would contain a bunch of empty (null) values that aren't useful for calculation weights.

No alt text provided for this image

Important thought: As we've mentioned regarding other research papers, choosing columns for calculation aka 'feature sets' that include complex strings have to be tokenized using natural language processing (NLP). This adds to your pre-processing requirements in additional to label encoding in which in the Tensorflow + Pandas Python method would usually require wrangling multiple data frames and merging them together based on column keys for each record. While this is automated for you in RapidMiner, in Tensorflow you'll have to include this in your pre-processing script. More documentation about this can be found here.

Take note that we did not do this in our datasets because you'll see much later in an optimized RapidMiner Studio recommendation that heavier weight and emphasis on the date and time are were more efficient feature sets with less complexity. You on other hand on different datasets and applications may need to NLP for sentiment analysis to add to the insider threat modeling.

Finishing your training set: Although we do not illustrate this, after you have imported both true negatives and true positives within the Turbo prep menu click on the "merge" button and select both transformed datasets and select the "Append" option since both have been pre-sorted by date.

Continue to the Auto Model feature

Within RapidMiner Studio we continue to the 'Auto Model' tab and utilize our selected aggregate 'training' data (remember training data includes true positives and true negatives) to predict on the insiderthreat column (true or false)

No alt text provided for this image

We also notice what our actual balance is. We are still imbalanced with only 9,001 records of non-threats vs. threats of ~14K. It's imbalanced and that can always be padded with additional records should you choose. For now, we'll live with it and see what we can accomplish with not-so-perfect data.

No alt text provided for this image

Here the auto modeler recommends different feature columns in green and yellow and their respective correlation. The interesting thing is that it is estimating date is of high correlation but less stability than action and vector.

Important thought: In our head, we would think as defenders all of the feature set applies in each column as we've already reduced what we could as far as relevance and complexities. It's also worth mentioning that this is based off a single event. Remember insider threats often take multiple events as we saw in the answers portion of the insiders.csv . What the green indicators are showing us are unique record single event identification.

No alt text provided for this image

We're going to use all the columns anyways because we think it's all relevant columns to use. We also move to the next screen on model types, and because we're not domain experts we're going to try almost all of them and we want the computer to re-run each model multiple times finding the optimized set of inputs and feature columns.

Remember that feature sets can include meta information based on insights from existing columns. We leave the default values of tokenization and we want to extract date and text information. Obviously the items with the free-form text are the 'Action' column with all the different URLs, and event activity that we want NLP to be applied. And we want to correlate between columns, the importance of columns, and explain predictions as well.

No alt text provided for this image

Note that in the above we've pretty much selected bunch of heavy processing parameters in our batch job. On an 8 core single threadded processor running Windows 10, 24 GB memory and a GPU of a Radeon RX570 value series with SSD's all of these models took about 6 hours to run total with all the options set. After everything was completed we have 8000+ models and 2600+ feature set combinations tested in our screen comparison.

No alt text provided for this image

According to RapidMiner Studio; the deep learning neural network methods aren't the best ROI fit; compared to the linear general model. There are no errors though- and that's worrisome which might mean that we have poor quality data or an overfit issue with the model. Let's take a look at deep learning as it also states a potential 100% accuracy just to compare it.

No alt text provided for this image

In the Deep Learning above it's tested against 187 different combinations of feature sets and the optimized model shows that unlike our own thoughts as to what features would be good including the vector and action mostly. We see even more weight put on the tokens in Action interesting words and the dates. Surprisingly; we did not see anything related to "email" or the word "device" in the actions as part of the optimized model.

No alt text provided for this image

Not to worry, as this doesn't mean we're dead wrong. It just means the feature sets it selected in its training (columns and extracted meta columns) provided less errors in the training set. This could be that we don't have enough diverse or high quality data in our set. In the previous screen above you saw an orange circle and a translucent square.

The orange circle indicates the models suggested optimizer function and the square is our original selected feature set. If you examine the scale, our human selected feature set was an error rate of 0.9 and 1% which gives our accuracy closer to the 99% mark; but only at a much higher complexity model (more layers and connections in the neural net required) That makes me feel a little better and just goes to show you that caution is needed when interpreting all of these at face value.

Tuning Considerations

Let's say you don't fully trust such a highly "100% accurate model". We can try to re-run it using our feature sets in vanilla manner as a pure token label. We're *not* going extract date information, no text tokenization via NLP and we don't want it to automatically create new feature set meta based on our original selections. Basically, we're going to use a plain vanilla set of columns for the calculations.

No alt text provided for this image

So in the above let's re-run it looking at 3 different models including the original best fit model and the deep learning we absolutely no optimization and additional NLP applied. So it's as if we only used encoded label values only in the calculations and not much else.

No alt text provided for this image

In the above, we get even worse results with an error rate of 39% is a 61% accuracy across pretty much all the models. Our selection and lack of complexity without using text token extraction is so reduced that even a more "primitive" Bayesian model (commonly used in basic email spam filter engines) seems to be just as accurate and has a fast compute time. This all looks bad but let's dig a little deeper:

No alt text provided for this image

When we select the details of the deep learning model again we see the accuracy climb in linear fashion as more of the training set population is discovered and validated against. From an interpretation stand point this shows us a few things:

  • Our original primitive thoughts of feature sets of focusing on the vector and action frequency using only unique encoded values is about as only as good as a toss-up probability of an analyst finding a threat in the data. On the surface it appears that we have at best a 10% gain of increasing our chances of detecting an insider threat.
  • It also shows that even though action and vector were first thought of 'green' for unique record events for a better input selection was actually the opposite for insider threat scenarios that we need to think about multiple events for each incident/alert. In the optimized model many of the weights and tokens used were time correlated specific and action token words
  • This also tells us that our base data quality for this set is rather low and we would need additional context and possibly sentiment analysis of each user for each unique event which is also an inclusive HR data metric 'OCEAN' in the psychometric.csv file. Using tokens through NLP; we would possibly tune to include the column of mixture of nulls to include the website descriptor words from the original data sets and maybe the files.csv that would have to merged into our training set based on time and transaction ID as keys when performing those joins in our data pre-processing

Deploying your model optimized (or not)

While this section does not show screenshots, the last step in the RapidMiner studio is to deploy the optimized or non-optimized model of your choosing. Deploying locally in the context of studio won't do much for you other than to re-use a model that you really like and to load new data through the interactions of the Studio application. You would need RapidMiner Server to make local or remote deployments automated to integrate with production applications. We do not illustrate such steps here, but there is great documentation on their site at: https://docs.rapidminer.com/latest/studio/guided/deployments/

But what about Tensorflow 2.0 and Keras?

Maybe RapidMiner Studio wasn't for us and everyone talks about Tensorflow (TF) as one of the leading solutions. But, TF does not have a GUI. The new TF v2.0 has Keras API part of the installation which makes interaction in creating the neural net layers much easier along getting your data ingested from Python's Panda Data Frame into model execution. Let's get started.

As you recall from our manual steps we start data pre-processing. We re-use the same scenario 2 and data set and will use basic label encoding like we did with our non-optimized model in RapidMiner Studio to show you the comparison in methods and the fact the it's all statistics at the end of the day based on algorithmic functions converted into libraries. Reusing the screenshot, remember that we did some manual pre-processing work and converted the insiderthreat, vector, and date columns into category numerical values already like so below:

No alt text provided for this image

I've placed a copy of the semi-scrubbed data on the Github if you wish to review the intermediate dataset prior to us running Python script to pre-process further:

No alt text provided for this image

Let's examine the python code to help us get to the final state we want which is:

No alt text provided for this image

The code can be copied below:

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from pandas.api.types import CategoricalDtype


#Use Pandas to create a dataframe
#In windows to get file from path other than same run directory see:
#https://stackoverflow.com/questions/16952632/read-a-csv-into-pandas-from-f-drive-on-windows-7


URL = 'https://raw.githubusercontent.com/dc401/tensorflow-insiderthreat/master/scenario2-training-dataset-transformed-tf.csv'
dataframe = pd.read_csv(URL)
#print(dataframe.head())


#show dataframe details for column types
#print(dataframe.info())


#print(pd.unique(dataframe['user']))
#https://pbpython.com/categorical-encoding.html
dataframe["user"] = dataframe["user"].astype('category')
dataframe["source"] = dataframe["source"].astype('category')
dataframe["action"] = dataframe["action"].astype('category')
dataframe["user_cat"] = dataframe["user"].cat.codes
dataframe["source_cat"] = dataframe["source"].cat.codes
dataframe["action_cat"] = dataframe["action"].cat.codes


#print(dataframe.info())
#print(dataframe.head())


#save dataframe with new columns for future datmapping
dataframe.to_csv('dataframe-export-allcolumns.csv')


#remove old columns
del dataframe["user"]
del dataframe["source"]
del dataframe["action"]
#restore original names of columns
dataframe.rename(columns={"user_cat": "user", "source_cat": "source", "action_cat": "action"}, inplace=True)
print(dataframe.head())
print(dataframe.info())


#save dataframe cleaned up
dataframe.to_csv('dataframe-export-int-cleaned.csv')




#Split the dataframe into train, validation, and test
train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')


#Create an input pipeline using tf.data
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('insiderthreat')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds




#choose columns needed for calculations (features)
feature_columns = []
for header in ["vector", "date", "user", "source", "action"]:
    feature_columns.append(feature_column.numeric_column(header))


#create feature layer
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)


#set batch size pipeline
batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)


#create compile and train model
model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dense(1)
])


model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])


model.fit(train_ds,
          validation_data=val_ds,
          epochs=5)


loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

In our scenario we're going to ingest the data from Github. I've included in the comment the method of using the os import to do so from a file local to your disk. One thing to point out is that we use the Pandas dataframe construct and methods to manipulate the columns using label encoding for the input. Note that this is not the optimized manner as which RapidMiner Studio reported to us.

We're still using our same feature set columns in the second round of modeling we re-ran in the previous screens; but this time in Tensorflow for method demonstration.

No alt text provided for this image

Note in the above there is an error in how vector still shows 'object' in the DType. I was pulling my hair out looking and I found I needed to update the dataset as I did not capture all the values into the vector column as a category numerical like I originally thought. Apparently, I was missing one. Once this was all corrected and errors gone, the model training was ran without a problem.

Unlike RapidMiner Studio, we don't just have one large training set and let the system do it for us. We must divide the training set into smaller pieces that must be ran through a batch based on the following as a subset for the model to be trained using known correct data of true/false insider threats and a reserved portion that is split which is the remaining being validation only.

No alt text provided for this image

Next we need to choose our feature columns, which again is the 'non optimized' columns of our 5 columns of data encoded. We use a sampling batch size of 32 in each round of validation (epoch) for the pipeline as we define it early on.

No alt text provided for this image

Keep note that we did not execute anything related to the tensor or even create a model yet. This is all just data prep and building the 'pipeline' that feeds the tensor. Below is when we create the model using the layers in sequential format using Keras, we compile the model using Google TF's tutorial demo optimizer and loss functions with an emphasis on accuracy. We try to fit and validate the model with 5 rounds and then print the display.

Welcome to the remaining 10% of your journey in applying AI to your insider threat data set!

No alt text provided for this image

Let's run it again and well-- now we see accuracy of around 61% like last time! So again, this just proves that a majority of your outcome will be in the data science process itself and the quality surrounding the pre-processing, tuning, and data. Not so much about which core software solution you go with. Without the optimizing and testing multiple model experimenting simulating in varying feature sets; our primitive models will only be at best 10% better than random chance sampling that a human analyst may or may not catch reviewing the same data.

No alt text provided for this image

Where is the ROI for AI in Cyber Security

For simple project tasks that can be accomplished on individual events as an alert vs. an incident using non-domain experts; AI enabled defenders through SOC's or threat hunting can better achieve ROI faster on things that are considered anomalous or not using baseline data. Examples include anomalies user agent strings that may show C2 infections, or K-means or KNN clustering based on cyber threat intelligence IOC's that may show specific APT similarities. There's some great curated lists found on Github that may give your team ideas on what else they can pursue with some simple methods as we've demonstrated in this article. Whatever software solution you elect to use, chances are that our alert payloads really need NLP applied and an appropriately sized neural network created to engage in more accurate modeling. Feel free to modify our base python template and try it out yourself.

Comparing the ~60% accuracy vs. other real-world cyber use cases

I have to admit, I was pretty disappointed in myself at first; even if we knew this was not a tuned model with the labels and input selection I had. But when we cross compare it with other more complex datasets and models in the communities such as Kaggle: It really isn't as bad as we first thought. Microsoft hosted a malware detection competition to the community and provided enriched datasets. The competition highest scores show 67% prediction accuracy and this was in 2019 with over 2400 teams competing. One member shared their code which had a 63% score and was released free and to the public as a great template if you wanted to investigate further. He titles LightGBM.

No alt text provided for this image

Compared to the leaderboard points the public facing solution was only 5% "worse." Is a 5% difference a huge amount in the world of data science? Yes (though it depends also how you measure confidence levels). So out of 2400+ teams, the best model achieved a success accuracy of ~68%. But from a budgeting ROI stand point when a CISO asks for their next FY's CAPEX-- 68% isn't going to cut it for most security programs.

While somewhat discouraging, it's important to remember that there are dedicated data science and dev ops professionals that spend their entire careers doing this to get models u to the 95% or better range. To achieve this, tons of model testing, additional data, and additional featureset extraction is required (as we saw in RapidMiner Studio doing this automatically for us).

Where do we go from here for applying AI to insider threats?

Obviously, this is a complex task. Researchers at the Deakin University in published a paper called "Image-Based Feature Representation for Insider Threat Classification" which was mentioned briefly in the earlier portion of the article. They discuss the measures that they have had to create a feature set based on an extended amount of data provided by the same US-CERT CMU dataset and they created 'images' out of it that can be used for prediction classification where they achieved 98% accuracy.

Within the paper the researchers also discussed examination of prior models such as 'BAIT' for insider threat which at best a 70% accuracy also using imbalanced data. Security programs with enough budget can have in-house models made from scratch with the help of data scientists and dev ops engineers that can use this research paper into applicable code.

How can cyber defenders get better at AI and begin to develop skills in house?

Focus less on the solution and more on the data science and pre-processing. I took the EdX Data8x Courseware (3 in total) and the book referenced (also free) provides great details and methods anyone can use to properly examine data and know what they're looking at during the process. This course set and among others can really augment and enhance existing cyber security skills to prepare us to do things like:

  • Evaluate vendors providing 'AI' enabled services and solutions on their actual effectiveness such as asking questions into what data pre-processing, feature sets, model architecture and optimization functions are used
  • Build use cases and augment their SOC or threat hunt programs with more informed choices of AI specific modeling on what is considered anomalous
  • Be able to pipeline and automate high quality data into proven tried-and-true models for highly effective alerting and response

Closing

I hope you've enjoyed this article and tutorial brief on cyber security applications of insider threats or really any data set into a neural network using two different solutions. If you're interested in professional services or an MSSP to bolster your organization's cyber security please feel free to contact us at www.scissecurity.com

Original post: https://www.linkedin.com/pulse/getting-hands-n-ai-cyber-security-professionals-dennis-chow-mba/?trackingId=

Please note: All future articles will be on Medium. Please follow https://medium.com/@dw.chow for updates.

June 17, 2020
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
© HAKIN9 MEDIA SP. Z O.O. SP. K. 2013