Tasks
Tasks are steps composing the pipeline, starting from graph construction (build_graphs) to detection (evaluation) or optionally triage (tracing).
Each task takes as input the output from the previous task and write its output to the disk so that the next task can use it. This process enables "checkpointing" across the pipeline and avoids the duplication of compute. More information on tasks and the pipeline here.
Preprocessing¶
- build_graphs
- used_method: str (1)
- use_all_files: bool
- mimicry_edge_num: int
- time_window_size: float (2)
- use_hashed_label: bool (3)
- fuse_edge: bool (4)
- node_label_features
- subject: str (5)
- file: str (6)
- netflow: str (7)
- multi_dataset: str (8)
- transformation
- used_methods: str (9)
- rcaid_pseudo_graph
- use_pruning: bool
- synthetic_attack_naive
- num_attacks: int
- num_malicious_process: int
- num_unauthorized_file_access: int
- process_selection_method: str
- The method to build time window graphs.
Available options (one selection):defaultmagic - The size of each graph in minutes. The notation should always be float (e.g. 10.0). Supports sizes < 1.0.
- Whether to hash the textual features.
- Whether to fuse duplicate sequential edges into a single edge.
- Which features use for process nodes. Features will be concatenated.
Available options (multi selection):typepathcmd_line - Which features use for file nodes. Features will be concatenated.
Available options (multi selection):typepath - Which features use for netflow nodes. Features will be concatenated.
Available options (multi selection):typeremote_ipremote_port - A comma-separated list of datasets on which training is performed. Evaluation is done only the primary dataset run in CLI.
Available options (one selection):THEIA_E5THEIA_E3CADETS_E5CADETS_E3CLEARSCOPE_E5CLEARSCOPE_E3optc_h201optc_h501optc_h051none - Applies transformations to graphs after their construction. Multiple transformations can be applied sequentially. Example:
used_methods=undirected,dag
Available options (multi selection):undirecteddagrcaid_pseudo_graphnonesynthetic_attack_naive
Featurization¶
- feat_training
- emb_dim: int (1)
- epochs: int (2)
- use_seed: bool
- training_split: str (3)
- multi_dataset_training: bool (4)
- used_method: str (5)
- feat_inference
- to_remove: bool
- Size of the text embedding. Arg not used by some featurization methods that do not build embeddings.
- Epochs to train the embedding method. Arg not used by some methods.
- The partition of data used to train the featurization method.
Available options (one selection):trainall - Whether the featurization method should be trained on all datasets in
multi_dataset. - Algorithms used to create node and edge features.
Available options (one selection):word2vecdoc2vecfasttextalacartetemporal_rwflashhierarchical_hashingmagiconly_typeonly_ones
Detection¶
- graph_preprocessing
- save_on_disk: bool (1)
- node_features: str (2)
- edge_features: str (3)
- multi_dataset_training: bool (4)
- fix_buggy_graph_reindexer: bool (5)
- global_batching
- used_method: str (6)
- global_batching_batch_size: int (7)
- global_batching_batch_size_inference: int (8)
- intra_graph_batching
- used_methods: str (9)
- edges
- intra_graph_batch_size: int (10)
- tgn_last_neighbor
- tgn_neighbor_size: int (11)
- tgn_neighbor_n_hop: int (12)
- fix_buggy_orthrus_TGN: bool (13)
- fix_tgn_neighbor_loader: bool (14)
- directed: bool (15)
- insert_neighbors_before: bool (16)
- inter_graph_batching
- used_method: str (17)
- inter_graph_batch_size: int (18)
- gnn_training
- use_seed: bool
- deterministic: bool (19)
- num_epochs: int
- patience: int
- lr: float
- weight_decay: float
- node_hid_dim: int (20)
- node_out_dim: int (21)
- grad_accumulation: int (22)
- inference_device: str (23)
- used_method: str (24)
- encoder
- dropout: float
- used_methods: str (25)
- x_is_tuple: bool (26)
- decoder
- used_methods: str (27)
- use_few_shot: bool (28)
- few_shot
- include_attacks_in_ssl_training: bool
- freeze_encoder: bool
- num_epochs_few_shot: int
- patience_few_shot: int
- lr_few_shot: float
- weight_decay_few_shot: float
- decoder
- used_methods: str
- evaluation
- viz_malicious_nodes: bool (29)
- ground_truth_version: str (30)
- best_model_selection: str (31)
- used_method: str
- node_evaluation
- threshold_method: str (32)
- use_dst_node_loss: bool (33)
- use_kmeans: bool (34)
- kmeans_top_K: int (35)
- tw_evaluation
- threshold_method: str (36)
- node_tw_evaluation
- threshold_method: str (37)
- use_dst_node_loss: bool
- use_kmeans: bool
- kmeans_top_K: int
- queue_evaluation
- used_method: str (38)
- queue_threshold: int
- kairos_idf_queue
- include_test_set_in_IDF: bool
- provnet_lof_queue
- queue_arg: str
- edge_evaluation
- malicious_edge_selection: str (39)
- threshold_method: str (40)
- Whether to store the graphs on disk upon building the graphs. Used to avoid re-computation of very complex batching operations that take time. Can take up to 300GB storage for CADETS_E5.
- Node features to use during GNN training.
node_typeis a one-hot encoded entity type vector,node_embrefers to the embedding generated during thefeaturizationtask,only_onesis a vector of ones with lengthnode_type,edges_distributioncounts emitted and received edges.
Available options (multi selection):node_typenode_embonly_onesedges_distribution - Edge features to used during GNN training.
edge_typerefers to the system call type,edge_type_tripletconsiders a same edge type as a new type if source or destination node types are different,msgis the message vector used in the TGN,time_encodingencodes temporal order of events with their timestamps in the TGN,noneuses no features.
Available options (multi selection):edge_typeedge_type_tripletmsgtime_encodingnone - Whether the GNN should be trained on all datasets in
multi_dataset. - A bug has been found in the first version of the framework, where reindexing graphs in shape (N, d) slightly modify node features. Setting this to true fixes the bug.
- Flattens the time window-based graphs into a single large temporal graph and recreate graphs based on the given method.
edgescreates contiguous graphs of sizeglobal_batching_batch_sizeedges, the same applies forminutes,unique_edge_typesbuilds graphs where each pair of connected nodes share edges with distinct edge types,noneuses the default time window-based batching defined in minutes with argtime_window_size.
Available options (one selection):edgesminutesunique_edge_typesnone - Controls the value associated with
global_batching.used_method(training+inference). - Controls the value associated with
global_batching.used_method(inference only). - Breaks each previously computed graph into even smaller graphs.
edgescreates contiguous graphs of sizeintra_graph_batch_sizeedges (if a graph has 2000 edges andintra_graph_batch_size=1500creates two graphs: one with 1500 edges, the other with 500 edges),tgn_last_neighborcomputes for each graph its associated graph based on the TGN last neighbor loader, namely a new graph where each node is connected with its lasttgn_neighbor_sizeincoming edges.nonedoes not alter any graph.
Available options (multi selection):edgestgn_last_neighbornone - Controls the value associated with
global_batching.used_method. - Number of last neighbors to store for each node.
- If greater than one, will also gather the last neighbors of neighbors.
- A bug has been in the first version of the framework, where the features of last neighbors not appearing in the input graph have zero node feature vectors. Setting this arg to true includes the features of all nodes in the TGN graph.
- We found a minor bug in the original TGN code (https://github.com/pyg-team/pytorch_geometric/issues/10100). This is an unofficial fix.
- The original TGN's loader builds graphs in an undirected way. This makes the graphs purely directed.
- Whether to insert the edges of the current graph before loading last neighbors.
- Batches multiple graphs into a single large one for parallel training. Does not support TGN.
graph_batchingbatchesinter_graph_batch_sizetogether,nonedoesn't batch graphs.
Available options (one selection):graph_batchingnone - Controls the value associated with
inter_graph_batching.used_method. - Whether to force PyTorch to use deterministic algorithms.
- Number of neurons in the middle layers of the encoder.
- Number of neurons in the last layer of the encoder.
- Number of epochs to gather gradients before backprop.
- Device used during testing.
Available options (one selection):cpucuda - Which training pipeline use.
Available options (one selection):default - First part of the neural network. Usually GNN encoders to capture complex patterns.
Available options (multi selection):tgngraph_attentionsagegatginsum_aggregationrcaid_gatmagic_gatglstmcustom_mlpnone - Whether to consider nodes differently when being source or destination.
- Second part of the neural network. Usually MLPs specific to the downstream task (e.g. reconstruction of prediction)
Available options (multi selection):predict_edge_typepredict_node_typepredict_masked_structdetect_edge_few_shotpredict_edge_contrastivereconstruct_node_featuresreconstruct_node_embeddingsreconstruct_edge_embeddingsreconstruct_masked_features - Old feature: need some work to update it.
- Whether to generate images of malicious nodes' neighborhoods (not stable).
Available options (one selection):orthrusreapr- Strategy to select the best model across epochs.
best_adpselects the best model based on the highest ADP score,best_discriminationselects the model that does the best separation between top-score TPs and top-score FPs.
Available options (one selection):best_adpbest_discrimination - Method to calculate the threshold value used to detect anomalies.
Available options (one selection):max_val_lossmean_val_lossthreatracemagicflashnodlink - Whether to consider the loss of destination nodes when computing the node-level scores (maximum loss of a node).
- Whether to cluster nodes after thresholding as done in Orthrus
- Number of top-score nodes selected before clustering.
- Time-window detection. The code is broken and needs work to be updated.
Available options (one selection):max_val_lossmean_val_lossthreatracemagicflashnodlink - Node-level detection where a same node in multiple time windows is considered as multiple unique nodes. More realistic evaluation for near real-time detection. The code is broken and needs work to be updated.
Available options (one selection):max_val_lossmean_val_lossthreatracemagicflashnodlink - Queue-level detection as in Kairos. The code is broken and needs work to be updated.
Available options (one selection):kairos_idf_queueprovnet_lof_queue - The ground truth only contains node-level labels. This arg controls the strategy to label edges.
src_nodesanddst_nodesconsider an edge as malicious if only its source or only its destination node is malicious.bothlabels an edge as malicious if both end nodes are malicious.
Available options (one selection):src_nodedst_nodeboth_nodes
Available options (one selection):max_val_lossmean_val_lossthreatracemagicflashnodlink
Triage¶
- tracing
- used_method: str (1)
- depimpact
- used_method: str (2)
- score_method: str (3)
- workers: int
- visualize: bool
- Post-processing step to reconstruct attack paths or reduce false positives.
depimpactis used in Orthrus.
Available options (one selection):depimpact
Available options (one selection):componentshortest_path1-hop2-hop3-hop
Available options (one selection):degreerecon_lossdegree_recon