Tasks
Tasks are steps composing the pipeline, starting from graph construction (build_graphs
) to detection (evaluation
) or optionally triage (tracing
).
Each task takes as input the output from the previous task and write its output to the disk so that the next task can use it. This process enables "checkpointing" across the pipeline and avoids the duplication of compute. More information on tasks and the pipeline here.
Preprocessing¶
- build_graphs
- used_method: str (1)
- use_all_files: bool
- mimicry_edge_num: int
- time_window_size: float (2)
- use_hashed_label: bool (3)
- fuse_edge: bool (4)
- node_label_features
- subject: str (5)
- file: str (6)
- netflow: str (7)
- multi_dataset: str (8)
- transformation
- used_methods: str (9)
- rcaid_pseudo_graph
- use_pruning: bool
- synthetic_attack_naive
- num_attacks: int
- num_malicious_process: int
- num_unauthorized_file_access: int
- process_selection_method: str
- The method to build time window graphs.
Available options (one selection):default
magic
- The size of each graph in minutes. The notation should always be float (e.g. 10.0). Supports sizes < 1.0.
- Whether to hash the textual features.
- Whether to fuse duplicate sequential edges into a single edge.
- Which features use for process nodes. Features will be concatenated.
Available options (multi selection):type
path
cmd_line
Available options (multi selection):type
path
Available options (multi selection):type
remote_ip
remote_port
- A comma-separated list of datasets on which training is performed. Evaluation is done only the primary dataset run in CLI.
Available options (one selection):THEIA_E5
THEIA_E3
CADETS_E5
CADETS_E3
CLEARSCOPE_E5
CLEARSCOPE_E3
optc_h201
optc_h501
optc_h051
none
- Applies transformations to graphs after their construction. Multiple transformations can be applied sequentially. Example:
used_methods=undirected,dag
Available options (multi selection):undirected
dag
rcaid_pseudo_graph
none
synthetic_attack_naive
Featurization¶
- feat_training
- emb_dim: int (1)
- epochs: int (2)
- use_seed: bool
- training_split: str (3)
- multi_dataset_training: bool (4)
- used_method: str (5)
- feat_inference
- to_remove: bool
- Size of the text embedding. Arg not used by some featurization methods that do not build embeddings.
- Epochs to train the embedding method. Arg not used by some methods.
- The partition of data used to train the featurization method.
Available options (one selection):train
all
- Whether the featurization method should be trained on all datasets in
multi_dataset
. - Algorithms used to create node and edge features.
Available options (one selection):word2vec
doc2vec
fasttext
alacarte
temporal_rw
flash
hierarchical_hashing
magic
only_type
only_ones
Detection¶
- graph_preprocessing
- save_on_disk: bool (1)
- node_features: str (2)
- edge_features: str (3)
- multi_dataset_training: bool (4)
- fix_buggy_graph_reindexer: bool (5)
- global_batching
- used_method: str (6)
- global_batching_batch_size: int (7)
- global_batching_batch_size_inference: int (8)
- intra_graph_batching
- used_methods: str (9)
- edges
- intra_graph_batch_size: int (10)
- tgn_last_neighbor
- tgn_neighbor_size: int (11)
- tgn_neighbor_n_hop: int (12)
- fix_buggy_orthrus_TGN: bool (13)
- fix_tgn_neighbor_loader: bool (14)
- directed: bool (15)
- insert_neighbors_before: bool (16)
- inter_graph_batching
- used_method: str (17)
- inter_graph_batch_size: int (18)
- gnn_training
- use_seed: bool
- deterministic: bool (19)
- num_epochs: int
- patience: int
- lr: float
- weight_decay: float
- node_hid_dim: int (20)
- node_out_dim: int (21)
- grad_accumulation: int (22)
- inference_device: str (23)
- used_method: str (24)
- encoder
- dropout: float
- used_methods: str (25)
- decoder
- used_methods: str (26)
- use_few_shot: bool (27)
- few_shot
- include_attacks_in_ssl_training: bool
- freeze_encoder: bool
- num_epochs_few_shot: int
- patience_few_shot: int
- lr_few_shot: float
- weight_decay_few_shot: float
- decoder
- used_methods: str
- evaluation
- viz_malicious_nodes: bool (28)
- ground_truth_version: str (29)
- best_model_selection: str (30)
- used_method: str
- node_evaluation
- threshold_method: str (31)
- use_dst_node_loss: bool (32)
- use_kmeans: bool (33)
- kmeans_top_K: int (34)
- tw_evaluation
- threshold_method: str (35)
- node_tw_evaluation
- threshold_method: str (36)
- use_dst_node_loss: bool
- use_kmeans: bool
- kmeans_top_K: int
- queue_evaluation
- used_method: str (37)
- queue_threshold: int
- kairos_idf_queue
- include_test_set_in_IDF: bool
- provnet_lof_queue
- queue_arg: str
- edge_evaluation
- malicious_edge_selection: str (38)
- threshold_method: str (39)
- Whether to store the graphs on disk upon building the graphs. Used to avoid re-computation of very complex batching operations that take time. Can take up to 300GB storage for CADETS_E5.
- Node features to use during GNN training.
node_type
is a one-hot encoded entity type vector,node_emb
refers to the embedding generated during thefeaturization
task,only_ones
is a vector of ones with lengthnode_type
,edges_distribution
counts emitted and received edges.
Available options (multi selection):node_type
node_emb
only_ones
edges_distribution
- Edge features to used during GNN training.
edge_type
refers to the system call type,edge_type_triplet
considers a same edge type as a new type if source or destination node types are different,msg
is the message vector used in the TGN,time_encoding
encodes temporal order of events with their timestamps in the TGN,none
uses no features.
Available options (multi selection):edge_type
edge_type_triplet
msg
time_encoding
none
- Whether the GNN should be trained on all datasets in
multi_dataset
. - A bug has been found in the first version of the framework, where reindexing graphs in shape (N, d) slightly modify node features. Setting this to true fixes the bug.
- Flattens the time window-based graphs into a single large temporal graph and recreate graphs based on the given method.
edges
creates contiguous graphs of sizeglobal_batching_batch_size
edges, the same applies forminutes
,unique_edge_types
builds graphs where each pair of connected nodes share edges with distinct edge types,none
uses the default time window-based batching defined in minutes with argtime_window_size
.
Available options (one selection):edges
minutes
unique_edge_types
none
- Controls the value associated with
global_batching.used_method
(training+inference). - Controls the value associated with
global_batching.used_method
(inference only). - Breaks each previously computed graph into even smaller graphs.
edges
creates contiguous graphs of sizeintra_graph_batch_size
edges (if a graph has 2000 edges andintra_graph_batch_size=1500
creates two graphs: one with 1500 edges, the other with 500 edges),tgn_last_neighbor
computes for each graph its associated graph based on the TGN last neighbor loader, namely a new graph where each node is connected with its lasttgn_neighbor_size
incoming edges.none
does not alter any graph.
Available options (multi selection):edges
tgn_last_neighbor
none
- Controls the value associated with
global_batching.used_method
. - Number of last neighbors to store for each node.
- If greater than one, will also gather the last neighbors of neighbors.
- A bug has been in the first version of the framework, where the features of last neighbors not appearing in the input graph have zero node feature vectors. Setting this arg to true includes the features of all nodes in the TGN graph.
- We found a minor bug in the original TGN code (https://github.com/pyg-team/pytorch_geometric/issues/10100). This is an unofficial fix.
- The original TGN's loader builds graphs in an undirected way. This makes the graphs purely directed.
- Whether to insert the edges of the current graph before loading last neighbors.
- Batches multiple graphs into a single large one for parallel training. Does not support TGN.
graph_batching
batchesinter_graph_batch_size
together,none
doesn't batch graphs.
Available options (one selection):graph_batching
none
- Controls the value associated with
inter_graph_batching.used_method
. - Whether to force PyTorch to use deterministic algorithms.
- Number of neurons in the middle layers of the encoder.
- Number of neurons in the last layer of the encoder.
- Number of epochs to gather gradients before backprop.
- Device used during testing.
Available options (one selection):cpu
cuda
- Which training pipeline use.
Available options (one selection):default
- First part of the neural network. Usually GNN encoders to capture complex patterns.
Available options (multi selection):tgn
graph_attention
sage
gat
gin
sum_aggregation
rcaid_gat
magic_gat
glstm
custom_mlp
none
- Second part of the neural network. Usually MLPs specific to the downstream task (e.g. reconstruction of prediction)
Available options (multi selection):predict_edge_type
predict_node_type
predict_masked_struct
detect_edge_few_shot
predict_edge_contrastive
reconstruct_node_features
reconstruct_node_embeddings
reconstruct_edge_embeddings
reconstruct_masked_features
- Old feature: need some work to update it.
- Whether to generate images of malicious nodes' neighborhoods (not stable).
Available options (one selection):orthrus
- Strategy to select the best model across epochs.
best_adp
selects the best model based on the highest ADP score,best_discrimination
selects the model that does the best separation between top-score TPs and top-score FPs.
Available options (one selection):best_adp
best_discrimination
- Method to calculate the threshold value used to detect anomalies.
Available options (one selection):max_val_loss
mean_val_loss
threatrace
magic
flash
nodlink
- Whether to consider the loss of destination nodes when computing the node-level scores (maximum loss of a node).
- Whether to cluster nodes after thresholding as done in Orthrus
- Number of top-score nodes selected before clustering.
- Time-window detection. The code is broken and needs work to be updated.
Available options (one selection):max_val_loss
mean_val_loss
threatrace
magic
flash
nodlink
- Node-level detection where a same node in multiple time windows is considered as multiple unique nodes. More realistic evaluation for near real-time detection. The code is broken and needs work to be updated.
Available options (one selection):max_val_loss
mean_val_loss
threatrace
magic
flash
nodlink
- Queue-level detection as in Kairos. The code is broken and needs work to be updated.
Available options (one selection):kairos_idf_queue
provnet_lof_queue
- The ground truth only contains node-level labels. This arg controls the strategy to label edges.
src_nodes
anddst_nodes
consider an edge as malicious if only its source or only its destination node is malicious.both
labels an edge as malicious if both end nodes are malicious.
Available options (one selection):src_node
dst_node
both_nodes
Available options (one selection):max_val_loss
mean_val_loss
threatrace
magic
flash
nodlink
Triage¶
- tracing
- used_method: str (1)
- depimpact
- used_method: str (2)
- score_method: str (3)
- workers: int
- visualize: bool
- Post-processing step to reconstruct attack paths or reduce false positives.
depimpact
is used in Orthrus.
Available options (one selection):depimpact
Available options (one selection):component
shortest_path
1-hop
2-hop
3-hop
Available options (one selection):degree
recon_loss
degree_recon