Tasks

Tasks are steps composing the pipeline, starting from graph construction (build_graphs) to detection (evaluation) or optionally triage (tracing). Each task takes as input the output from the previous task and write its output to the disk so that the next task can use it. This process enables "checkpointing" across the pipeline and avoids the duplication of compute. More information on tasks and the pipeline here.

Preprocessing¶

build_graphs
- used_method: str (1)
- use_all_files: bool
- mimicry_edge_num: int
- time_window_size: float (2)
- use_hashed_label: bool (3)
- fuse_edge: bool (4)
- node_label_features
  - subject: str (5)
  - file: str (6)
  - netflow: str (7)
- multi_dataset: str (8)
transformation
- used_methods: str (9)
- rcaid_pseudo_graph
  - use_pruning: bool
- synthetic_attack_naive
  - num_attacks: int
  - num_malicious_process: int
  - num_unauthorized_file_access: int
  - process_selection_method: str

The method to build time window graphs.

Available options (one selection):
default
magic
The size of each graph in minutes. The notation should always be float (e.g. 10.0). Supports sizes < 1.0.
Whether to hash the textual features.
Whether to fuse duplicate sequential edges into a single edge.
Which features use for process nodes. Features will be concatenated.

Available options (multi selection):
type
path
cmd_line
Available options (multi selection):
type
path
Available options (multi selection):
type
remote_ip
remote_port
A comma-separated list of datasets on which training is performed. Evaluation is done only the primary dataset run in CLI.

Available options (one selection):
THEIA_E5
THEIA_E3
CADETS_E5
CADETS_E3
CLEARSCOPE_E5
CLEARSCOPE_E3
optc_h201
optc_h501
optc_h051
none
Applies transformations to graphs after their construction. Multiple transformations can be applied sequentially. Example: used_methods=undirected,dag

Available options (multi selection):
undirected
dag
rcaid_pseudo_graph
none
synthetic_attack_naive

Featurization¶

feat_training
- emb_dim: int (1)
- epochs: int (2)
- use_seed: bool
- training_split: str (3)
- multi_dataset_training: bool (4)
- used_method: str (5)
feat_inference
- to_remove: bool

Size of the text embedding. Arg not used by some featurization methods that do not build embeddings.
Epochs to train the embedding method. Arg not used by some methods.
The partition of data used to train the featurization method.

Available options (one selection):
train
all
Whether the featurization method should be trained on all datasets in multi_dataset.
Algorithms used to create node and edge features.

Available options (one selection):
word2vec
doc2vec
fasttext
alacarte
temporal_rw
flash
hierarchical_hashing
magic
only_type
only_ones

Detection¶

graph_preprocessing
- save_on_disk: bool (1)
- node_features: str (2)
- edge_features: str (3)
- multi_dataset_training: bool (4)
- fix_buggy_graph_reindexer: bool (5)
- global_batching
  - used_method: str (6)
  - global_batching_batch_size: int (7)
  - global_batching_batch_size_inference: int (8)
- intra_graph_batching
  - used_methods: str (9)
  - edges
    - intra_graph_batch_size: int (10)
  - tgn_last_neighbor
    - tgn_neighbor_size: int (11)
    - tgn_neighbor_n_hop: int (12)
    - fix_buggy_orthrus_TGN: bool (13)
    - fix_tgn_neighbor_loader: bool (14)
    - directed: bool (15)
    - insert_neighbors_before: bool (16)
- inter_graph_batching
  - used_method: str (17)
  - inter_graph_batch_size: int (18)
gnn_training
- use_seed: bool
- deterministic: bool (19)
- num_epochs: int
- patience: int
- lr: float
- weight_decay: float
- node_hid_dim: int (20)
- node_out_dim: int (21)
- grad_accumulation: int (22)
- inference_device: str (23)
- used_method: str (24)
- encoder
  - dropout: float
  - used_methods: str (25)
- decoder
  - used_methods: str (26)
  - use_few_shot: bool (27)
  - few_shot
    - include_attacks_in_ssl_training: bool
    - freeze_encoder: bool
    - num_epochs_few_shot: int
    - patience_few_shot: int
    - lr_few_shot: float
    - weight_decay_few_shot: float
    - decoder
      - used_methods: str
evaluation
- viz_malicious_nodes: bool (28)
- ground_truth_version: str (29)
- best_model_selection: str (30)
- used_method: str
- node_evaluation
  - threshold_method: str (31)
  - use_dst_node_loss: bool (32)
  - use_kmeans: bool (33)
  - kmeans_top_K: int (34)
- tw_evaluation
  - threshold_method: str (35)
- node_tw_evaluation
  - threshold_method: str (36)
  - use_dst_node_loss: bool
  - use_kmeans: bool
  - kmeans_top_K: int
- queue_evaluation
  - used_method: str (37)
  - queue_threshold: int
  - kairos_idf_queue
    - include_test_set_in_IDF: bool
  - provnet_lof_queue
    - queue_arg: str
- edge_evaluation
  - malicious_edge_selection: str (38)
  - threshold_method: str (39)

Whether to store the graphs on disk upon building the graphs. Used to avoid re-computation of very complex batching operations that take time. Can take up to 300GB storage for CADETS_E5.
Node features to use during GNN training. node_type is a one-hot encoded entity type vector, node_emb refers to the embedding generated during the featurization task, only_ones is a vector of ones with length node_type, edges_distribution counts emitted and received edges.

Available options (multi selection):
node_type
node_emb
only_ones
edges_distribution
Edge features to used during GNN training. edge_type refers to the system call type, edge_type_triplet considers a same edge type as a new type if source or destination node types are different, msg is the message vector used in the TGN, time_encoding encodes temporal order of events with their timestamps in the TGN, none uses no features.

Available options (multi selection):
edge_type
edge_type_triplet
msg
time_encoding
none
Whether the GNN should be trained on all datasets in multi_dataset.
A bug has been found in the first version of the framework, where reindexing graphs in shape (N, d) slightly modify node features. Setting this to true fixes the bug.
Flattens the time window-based graphs into a single large temporal graph and recreate graphs based on the given method. edges creates contiguous graphs of size global_batching_batch_size edges, the same applies for minutes, unique_edge_types builds graphs where each pair of connected nodes share edges with distinct edge types, none uses the default time window-based batching defined in minutes with arg time_window_size.

Available options (one selection):
edges
minutes
unique_edge_types
none
Controls the value associated with global_batching.used_method (training+inference).
Controls the value associated with global_batching.used_method (inference only).
Breaks each previously computed graph into even smaller graphs. edges creates contiguous graphs of size intra_graph_batch_size edges (if a graph has 2000 edges and intra_graph_batch_size=1500 creates two graphs: one with 1500 edges, the other with 500 edges), tgn_last_neighbor computes for each graph its associated graph based on the TGN last neighbor loader, namely a new graph where each node is connected with its last tgn_neighbor_size incoming edges. none does not alter any graph.

Available options (multi selection):
edges
tgn_last_neighbor
none
Controls the value associated with global_batching.used_method.
Number of last neighbors to store for each node.
If greater than one, will also gather the last neighbors of neighbors.
A bug has been in the first version of the framework, where the features of last neighbors not appearing in the input graph have zero node feature vectors. Setting this arg to true includes the features of all nodes in the TGN graph.
We found a minor bug in the original TGN code (https://github.com/pyg-team/pytorch_geometric/issues/10100). This is an unofficial fix.
The original TGN's loader builds graphs in an undirected way. This makes the graphs purely directed.
Whether to insert the edges of the current graph before loading last neighbors.
Batches multiple graphs into a single large one for parallel training. Does not support TGN. graph_batching batches inter_graph_batch_size together, none doesn't batch graphs.

Available options (one selection):
graph_batching
none
Controls the value associated with inter_graph_batching.used_method.
Whether to force PyTorch to use deterministic algorithms.
Number of neurons in the middle layers of the encoder.
Number of neurons in the last layer of the encoder.
Number of epochs to gather gradients before backprop.
Device used during testing.

Available options (one selection):
cpu
cuda
Which training pipeline use.

Available options (one selection):
default
First part of the neural network. Usually GNN encoders to capture complex patterns.

Available options (multi selection):
tgn
graph_attention
sage
gat
gin
sum_aggregation
rcaid_gat
magic_gat
glstm
custom_mlp
none
Second part of the neural network. Usually MLPs specific to the downstream task (e.g. reconstruction of prediction)

Available options (multi selection):
predict_edge_type
predict_node_type
predict_masked_struct
detect_edge_few_shot
predict_edge_contrastive
reconstruct_node_features
reconstruct_node_embeddings
reconstruct_edge_embeddings
reconstruct_masked_features
Old feature: need some work to update it.
Whether to generate images of malicious nodes' neighborhoods (not stable).
Available options (one selection):
orthrus
Strategy to select the best model across epochs. best_adp selects the best model based on the highest ADP score, best_discrimination selects the model that does the best separation between top-score TPs and top-score FPs.

Available options (one selection):
best_adp
best_discrimination
Method to calculate the threshold value used to detect anomalies.

Available options (one selection):
max_val_loss
mean_val_loss
threatrace
magic
flash
nodlink
Whether to consider the loss of destination nodes when computing the node-level scores (maximum loss of a node).
Whether to cluster nodes after thresholding as done in Orthrus
Number of top-score nodes selected before clustering.
Time-window detection. The code is broken and needs work to be updated.

Available options (one selection):
max_val_loss
mean_val_loss
threatrace
magic
flash
nodlink
Node-level detection where a same node in multiple time windows is considered as multiple unique nodes. More realistic evaluation for near real-time detection. The code is broken and needs work to be updated.

Available options (one selection):
max_val_loss
mean_val_loss
threatrace
magic
flash
nodlink
Queue-level detection as in Kairos. The code is broken and needs work to be updated.

Available options (one selection):
kairos_idf_queue
provnet_lof_queue
The ground truth only contains node-level labels. This arg controls the strategy to label edges. src_nodes and dst_nodes consider an edge as malicious if only its source or only its destination node is malicious. both labels an edge as malicious if both end nodes are malicious.

Available options (one selection):
src_node
dst_node
both_nodes
Available options (one selection):
max_val_loss
mean_val_loss
threatrace
magic
flash
nodlink

Triage¶

tracing
- used_method: str (1)
- depimpact
  - used_method: str (2)
  - score_method: str (3)
  - workers: int
  - visualize: bool

Post-processing step to reconstruct attack paths or reduce false positives. depimpact is used in Orthrus.

Available options (one selection):
depimpact
Available options (one selection):
component
shortest_path
1-hop
2-hop
3-hop
Available options (one selection):
degree
recon_loss
degree_recon