Tutorial

In this tutorial, we craft a brand new architecture using existing components in the framework and evaluate it on node-level intrusion detection.

Architecture¶

Our goal is to implement a new system that satisfies the following requirements:

Compute text embeddings from the textual attributes of entities, such as file paths, process commands, and socket IP addresses.
Learn behavior-specific representations for both source and destination nodes.
Leverage GraphSAGE layers to capture structural patterns within the provenance graph.
Use the node embeddings generated by the encoder as input to a two-layer MLP decoder, training the model in a self-supervised manner to predict edge types—following an approach similar to that of the Kairos and Orthrus systems.
In the final step, classify nodes as malicious if their predicted score exceeds a threshold, defined as the maximum loss observed on the validation set.

Approach: We will craft a new encoder and intgrate it in the framework in such a way that it can be used from arguments. We will then create a YML config describing our system's pipeline and will execute it.

Requirements¶

a GPU (min 5 GB memory required), or a CPU
RAM >20GB
storage >10GB
follow the installation guidelines and have a shell opened in the pids container

Integrate a new encoder¶

In this example, we implement a new encoder that captures whether nodes are source or destination, then uses a GraphSAGE model to capture structural patterns in the provenance graph.

encoders/custom_encoder.py

import torch.nn as nn

from pidsmaker.encoders import SAGE


class CustomEncoder(nn.Module):
    def __init__(self, in_dim, hid_dim, out_dim, graph_reindexer, activation, dropout, num_layers, device):
        super().__init__()

        self.src_proj = nn.Linear(in_dim, hid_dim)
        self.dst_proj = nn.Linear(in_dim, hid_dim)

        self.sage = SAGE(
            in_dim=hid_dim,
            hid_dim=hid_dim,
            out_dim=out_dim,
            activation=activation,
            dropout=dropout,
            num_layers=num_layers,
        )

        self.graph_reindexer = graph_reindexer

    def forward(self, x_src, x_dst, edge_index, **kwargs):
        # Project source and destination nodes in a separate embedding space
        h_src = self.src_proj(x_src)  # (E, d)
        h_dst = self.dst_proj(x_dst)  # (E, d)

        # Reshape features to (N, d)
        h_src_N, h_dst_N = self.graph_reindexer.node_features_reshape(edge_index, h_src, h_dst, x_is_tuple=True)
        h = h_src_N + h_dst_N  # (N, d)

        # Pass them through a SAGE GNN
        return self.sage(h, edge_index)

In this example, only two arguments are specific to the encoder and are not shared globally across all encoders: the activation function and the number of GNN layers (num_layers). We want to allow users to specify these parameters via the configuration file to facilitate experimentation with different values. To achieve this, we need to define a new set of arguments tailored specifically for this encoder.

config/config.py

ENCODERS_CFG = {
    ...
    "custom_encoder": {
        "activation": str,
        "num_layers": int,
    },
}

Note

If an encoder doesn't rely on any custom arguments, simply leave the dict empty: "custom_encoder": {}, but every encoder should be defined here, or it will not be recognized by the framework.

All logic related to component instantiation is located in factory.py. To integrate our new encoder, we will add a new case in the encoder_factory() function. The existing graph_reindexer instance can be reused. Here, custom_encoder refers to the name of the encoder, which we will later specify in the configuration file.

factory.py: encoder_factory()

    ...
    elif method == "custom_encoder":
        encoder = CustomEncoder(
            in_dim=in_dim,
            hid_dim=node_hid_dim,
            out_dim=node_out_dim,
            dropout=dropout,
            graph_reindexer=graph_reindexer,
            activation=activation_fn_factory(
                cfg.detection.gnn_training.encoder.custom_encoder.activation),
            num_layers=cfg.detection.gnn_training.encoder.custom_encoder.num_layers,
            device=device,
        )

Our new argument activation can be accessed from the cfg object via cfg.detection.gnn_training.encoder.custom_encoder.activation.

Then add the encoder to the list of available encoders in __init__.py.

encoders/__init__.py

...
from .custom_encoder import CustomEncoder

Integrate a new system¶

To integrate a new system, first create a new YAML file: config/custom_system.yml. This file describes all the logic of our new PIDS pipeline. In this example, we take orthrus as base configuration. We only override some arguments for simplicity.

Question

All available arguments for each component and task can be found in the pages under the Configuration section of the documentation.

Config:

config/custom_system.yml
_include_yml: orthrus # (1)!

preprocessing:
  build_graphs:
    time_window_size: 15.0 # (2)!
    node_label_features: # (14)!
      subject: type, path, cmd_line
      file: type, path
      netflow: type, remote_ip, remote_port

featurization:
  feat_training:
    emb_dim: 128
    epochs: 50
    used_method: word2vec # (15)!
    word2vec:
      alpha: 0.025
      window_size: 5
      min_count: 1
      use_skip_gram: True
      num_workers: 15
      compute_loss: True
      negative: 5
      decline_rate: 30

detection:
  graph_preprocessing:
    node_features: node_emb,node_type # (16)!
    edge_features: none # (17)!
    intra_graph_batching:
      used_methods: none # (3)!

  gnn_training:
    lr: 0.0001 # (4)!
    node_hid_dim: 128
    node_out_dim: 128

    encoder:
      dropout: 0.3
      used_methods: custom_encoder # (5)!
      custom_encoder:
        activation: relu # (6)!
        num_layers: 3

    decoder:
      used_methods: predict_edge_type # (7)!
      predict_edge_type:
        decoder: edge_mlp # (8)!
        edge_mlp:
          src_dst_projection_coef: 2 # (9)!
          architecture_str: linear(0.5) | relu # (10)!

  evaluation:
    used_method: node_evaluation # (11)!
    node_evaluation:
      threshold_method: max_val_loss # (12)!
      use_kmeans: False # (13)!

In this example, we take the orthrus system as base configuration.
We partition the graphs in time windows of size 15 minutes.
By default orthrus partitions each time window in even smaller batches of 1024 edges. To discard this behavior, we set used_methods: None.
We set some custom hyperparameters for training.
It's here that we tell the encoder to use, in this case our new custom_encoder.
We set the values for our two defined arguments: activation and num_layers.
Our objective is to predict edge types as recent research shows it it the best approach yet.
Here, edge_mlp is a MLP designed for edge-level tasks like edge type prediction. It first projects src and dst nodes with different linear layers, then applies the MLP specified in architecture_str.
This arg refers to how many times the size of the input should be the output of the src/dst linear layer. Here we want to project to an embedding with double the input size.
Our final neural network prior to prediction is a two-layer MLP including relu activation. An additional linear layer is added after the relu to match the output size expected by the objective, here the number of edge types.
Refers to node-level detection.
We compute the threshold based on the maximum loss seen on the validation set.
We use here a simple thresholding, without clustering.
We describe here the features assigned to each type of entity.
We train a word2vec model and embed each node's features (node_label_features) into a vector of size emb_dim.
The features to use as node features during GNN training. Here we concatenate the word2vec embedding and one-hot encoded entity type.
Our model doesn't integrate edge features so we do not use any in this example.

Run the pipeline¶

In the pids container, you can now run the pipeline. We highly recommend using VSCode with the dev container extension to open your whole workspace in the container to avoid re-building the image after updating the code. Using dev container also enables to run the framework in debug mode using VSCode's debugger.

For tiny experiments, you can run the framework locally like so:

cd scripts
python pidsmaker/main.py custom_system CADETS_E3 --project=test_custom_system

For more practical experimentation, prefer running the framework in background and monitor the logs, figures and metrics in Weights & Biases (W&B). To do so, ensure you have logged to W&B with wandb login and run:

./run.sh custom_system CADETS_E3 --project=test_custom_system

Note

Use --cpu to run the framework on CPU instead of GPU.

Analyze results¶

In the W&B interface, go to the test_custom_system project and you can check real-time logs and metrics being updated to your ongoing run.

Once finished, we provide some figures to illustrate the ability of the model to differentiate attack and benign nodes.

Results show a noticeable separation between benign and some malicious nodes on all three attacks in the E3-CADETS dataset. However, the threshold (vertical line) is not adequately located in the space of anomaly scores, leading to 21 TPs and 33 FPs, whereas some nodes can be detected without any FP.

System metrics such as GPU memory, RAM, and CPU utilization are automatically captured by W&B and can be visualized on the interface.

Try variants¶

While this first version of the model yields relatively satisfactory results, we have no guarantee it has the best performing set of arguments on this dataset. We can easily experiment with multiple variants using CLI args. Depending on your hardware, you can run multiple runs in parallel even on a single GPU (usually up to 3-4 parallel runs with simple architectures on a A100 GPU without major runtime overhead).

# Remove node type from node features, keep only the word2vec embedding
./run.sh custom_system CADETS_E3 --project=test_custom_system \
    --detection.graph_preprocessing.node_features=node_emb

# Increase node embedding size
./run.sh custom_system CADETS_E3 --project=test_custom_system \
    --detection.gnn_training.node_hid_dim=256 \
    --detection.gnn_training.node_out_dim=256

# Reduce the number of GNN layers
./run.sh custom_system CADETS_E3 --project=test_custom_system \
    --detection.gnn_training.encoder.custom_encoder.num_layers=2

For more advanced hyperparameter exploration, consider using the hyperparameter tuning feature.