Tutorial
In this tutorial, we craft a brand new architecture using existing components in the framework and evaluate it on node-level intrusion detection.
Architecture¶
Our goal is to implement a new system that satisfies the following requirements:
- Compute text embeddings from the textual attributes of entities, such as file paths, process commands, and socket IP addresses.
- Learn behavior-specific representations for both source and destination nodes.
- Leverage GraphSAGE layers to capture structural patterns within the provenance graph.
- Use the node embeddings generated by the encoder as input to a two-layer MLP decoder, training the model in a self-supervised manner to predict edge types—following an approach similar to that of the Kairos and Orthrus systems.
- In the final step, classify nodes as malicious if their predicted score exceeds a threshold, defined as the maximum loss observed on the validation set.
Approach: We will craft a new encoder and intgrate it in the framework in such a way that it can be used from arguments. We will then create a YML config describing our system's pipeline and will execute it.
Requirements¶
- a GPU (min 5 GB memory required), or a CPU
- RAM >20GB
- storage >10GB
- follow the installation guidelines and have a shell opened in the pids container
Integrate a new encoder¶
In this example, we implement a new encoder that captures whether nodes are source or destination, then uses a GraphSAGE model to capture structural patterns in the provenance graph.
import torch.nn as nn
from pidsmaker.encoders import SAGE
class CustomEncoder(nn.Module):
def __init__(self, in_dim, hid_dim, out_dim, graph_reindexer, activation, dropout, num_layers, device):
super().__init__()
self.src_proj = nn.Linear(in_dim, hid_dim)
self.dst_proj = nn.Linear(in_dim, hid_dim)
self.sage = SAGE(
in_dim=hid_dim,
hid_dim=hid_dim,
out_dim=out_dim,
activation=activation,
dropout=dropout,
num_layers=num_layers,
)
self.graph_reindexer = graph_reindexer
def forward(self, x_src, x_dst, edge_index, **kwargs):
# Project source and destination nodes in a separate embedding space
h_src = self.src_proj(x_src) # (E, d)
h_dst = self.dst_proj(x_dst) # (E, d)
# Reshape features to (N, d)
h_src_N, h_dst_N = self.graph_reindexer.node_features_reshape(edge_index, h_src, h_dst, x_is_tuple=True)
h = h_src_N + h_dst_N # (N, d)
# Pass them through a SAGE GNN
return self.sage(h, edge_index)
In this example, only two arguments are specific to the encoder and are not shared globally across all encoders: the activation
function and the number of GNN layers (num_layers
). We want to allow users to specify these parameters via the configuration file to facilitate experimentation with different values.
To achieve this, we need to define a new set of arguments tailored specifically for this encoder.
ENCODERS_CFG = {
...
"custom_encoder": {
"activation": str,
"num_layers": int,
},
}
Note
If an encoder doesn't rely on any custom arguments, simply leave the dict empty: "custom_encoder": {}
, but every encoder should be defined here, or it will not be recognized by the framework.
All logic related to component instantiation is located in factory.py
. To integrate our new encoder, we will add a new case in the encoder_factory()
function. The existing graph_reindexer
instance can be reused. Here, custom_encoder
refers to the name of the encoder, which we will later specify in the configuration file.
...
elif method == "custom_encoder":
encoder = CustomEncoder(
in_dim=in_dim,
hid_dim=node_hid_dim,
out_dim=node_out_dim,
dropout=dropout,
graph_reindexer=graph_reindexer,
activation=activation_fn_factory(
cfg.detection.gnn_training.encoder.custom_encoder.activation),
num_layers=cfg.detection.gnn_training.encoder.custom_encoder.num_layers,
device=device,
)
Our new argument activation
can be accessed from the cfg
object via cfg.detection.gnn_training.encoder.custom_encoder.activation
.
Then add the encoder to the list of available encoders in __init__.py
.
...
from .custom_encoder import CustomEncoder
Integrate a new system¶
To integrate a new system, first create a new YAML file: config/custom_system.yml
. This file describes all the logic of our new PIDS pipeline.
In this example, we take orthrus
as base configuration. We only override some arguments for simplicity.
Question
All available arguments for each component and task can be found in the pages under the Configuration section of the documentation.
Config:
config/custom_system.yml | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
- In this example, we take the
orthrus
system as base configuration. - We partition the graphs in time windows of size 15 minutes.
- By default
orthrus
partitions each time window in even smaller batches of 1024 edges. To discard this behavior, we setused_methods: None
. - We set some custom hyperparameters for training.
- It's here that we tell the encoder to use, in this case our new
custom_encoder
. - We set the values for our two defined arguments:
activation
andnum_layers
. - Our objective is to predict edge types as recent research shows it it the best approach yet.
- Here,
edge_mlp
is a MLP designed for edge-level tasks like edge type prediction. It first projects src and dst nodes with different linear layers, then applies the MLP specified inarchitecture_str
. - This arg refers to how many times the size of the input should be the output of the src/dst linear layer. Here we want to project to an embedding with double the input size.
- Our final neural network prior to prediction is a two-layer MLP including relu activation. An additional linear layer is added after the relu to match the output size expected by the objective, here the number of edge types.
- Refers to node-level detection.
- We compute the threshold based on the maximum loss seen on the validation set.
- We use here a simple thresholding, without clustering.
- We describe here the features assigned to each type of entity.
- We train a word2vec model and embed each node's features (
node_label_features
) into a vector of sizeemb_dim
. - The features to use as node features during GNN training. Here we concatenate the word2vec embedding and one-hot encoded entity type.
- Our model doesn't integrate edge features so we do not use any in this example.
Run the pipeline¶
In the pids container, you can now run the pipeline. We highly recommend using VSCode with the dev container extension to open your whole workspace in the container to avoid re-building the image after updating the code. Using dev container also enables to run the framework in debug mode using VSCode's debugger.
For tiny experiments, you can run the framework locally like so:
cd scripts
python pidsmaker/main.py custom_system CADETS_E3 --project=test_custom_system
For more practical experimentation, prefer running the framework in background and monitor the logs, figures and metrics in Weights & Biases (W&B).
To do so, ensure you have logged to W&B with wandb login
and run:
./run.sh custom_system CADETS_E3 --project=test_custom_system
Note
Use --cpu
to run the framework on CPU instead of GPU.
Analyze results¶
In the W&B interface, go to the test_custom_system
project and you can check real-time logs and metrics being updated to your ongoing run.
Once finished, we provide some figures to illustrate the ability of the model to differentiate attack and benign nodes.
Results show a noticeable separation between benign and some malicious nodes on all three attacks in the E3-CADETS
dataset.
However, the threshold (vertical line) is not adequately located in the space of anomaly scores, leading to 21 TPs and 33 FPs, whereas some nodes can be detected without any FP.
System metrics such as GPU memory, RAM, and CPU utilization are automatically captured by W&B and can be visualized on the interface.
Try variants¶
While this first version of the model yields relatively satisfactory results, we have no guarantee it has the best performing set of arguments on this dataset. We can easily experiment with multiple variants using CLI args. Depending on your hardware, you can run multiple runs in parallel even on a single GPU (usually up to 3-4 parallel runs with simple architectures on a A100 GPU without major runtime overhead).
# Remove node type from node features, keep only the word2vec embedding
./run.sh custom_system CADETS_E3 --project=test_custom_system \
--detection.graph_preprocessing.node_features=node_emb
# Increase node embedding size
./run.sh custom_system CADETS_E3 --project=test_custom_system \
--detection.gnn_training.node_hid_dim=256 \
--detection.gnn_training.node_out_dim=256
# Reduce the number of GNN layers
./run.sh custom_system CADETS_E3 --project=test_custom_system \
--detection.gnn_training.encoder.custom_encoder.num_layers=2
For more advanced hyperparameter exploration, consider using the hyperparameter tuning feature.