Skip to content

Tasks

Tasks are steps composing the pipeline, starting from graph construction (build_graphs) to detection (evaluation) or optionally triage (tracing). Each task takes as input the output from the previous task and write its output to the disk so that the next task can use it. This process enables "checkpointing" across the pipeline and avoids the duplication of compute. More information on tasks and the pipeline here.

Preprocessing

  • build_graphs
    • used_method: str (1)
    • use_all_files: bool
    • mimicry_edge_num: int
    • time_window_size: float (2)
    • use_hashed_label: bool (3)
    • fuse_edge: bool (4)
    • node_label_features
      • subject: str (5)
      • file: str (6)
      • netflow: str (7)
    • multi_dataset: str (8)
  • transformation
    • used_methods: str (9)
    • rcaid_pseudo_graph
      • use_pruning: bool
    • synthetic_attack_naive
      • num_attacks: int
      • num_malicious_process: int
      • num_unauthorized_file_access: int
      • process_selection_method: str
  1. The method to build time window graphs.

    Available options (one selection):
    default
    magic
  2. The size of each graph in minutes. The notation should always be float (e.g. 10.0). Supports sizes < 1.0.
  3. Whether to hash the textual features.
  4. Whether to fuse duplicate sequential edges into a single edge.
  5. Which features use for process nodes. Features will be concatenated.

    Available options (multi selection):
    type
    path
    cmd_line

  6. Available options (multi selection):
    type
    path

  7. Available options (multi selection):
    type
    remote_ip
    remote_port
  8. A comma-separated list of datasets on which training is performed. Evaluation is done only the primary dataset run in CLI.

    Available options (one selection):
    THEIA_E5
    THEIA_E3
    CADETS_E5
    CADETS_E3
    CLEARSCOPE_E5
    CLEARSCOPE_E3
    optc_h201
    optc_h501
    optc_h051
    none
  9. Applies transformations to graphs after their construction. Multiple transformations can be applied sequentially. Example: used_methods=undirected,dag

    Available options (multi selection):
    undirected
    dag
    rcaid_pseudo_graph
    none
    synthetic_attack_naive

Featurization

  • feat_training
    • emb_dim: int (1)
    • epochs: int (2)
    • use_seed: bool
    • training_split: str (3)
    • multi_dataset_training: bool (4)
    • used_method: str (5)
  • feat_inference
    • to_remove: bool
  1. Size of the text embedding. Arg not used by some featurization methods that do not build embeddings.
  2. Epochs to train the embedding method. Arg not used by some methods.
  3. The partition of data used to train the featurization method.

    Available options (one selection):
    train
    all
  4. Whether the featurization method should be trained on all datasets in multi_dataset.
  5. Algorithms used to create node and edge features.

    Available options (one selection):
    word2vec
    doc2vec
    fasttext
    alacarte
    temporal_rw
    flash
    hierarchical_hashing
    magic
    only_type
    only_ones

Detection

  • graph_preprocessing
    • save_on_disk: bool (1)
    • node_features: str (2)
    • edge_features: str (3)
    • multi_dataset_training: bool (4)
    • fix_buggy_graph_reindexer: bool (5)
    • global_batching
      • used_method: str (6)
      • global_batching_batch_size: int (7)
      • global_batching_batch_size_inference: int (8)
    • intra_graph_batching
      • used_methods: str (9)
      • edges
        • intra_graph_batch_size: int (10)
      • tgn_last_neighbor
        • tgn_neighbor_size: int (11)
        • tgn_neighbor_n_hop: int (12)
        • fix_buggy_orthrus_TGN: bool (13)
        • fix_tgn_neighbor_loader: bool (14)
        • directed: bool (15)
        • insert_neighbors_before: bool (16)
    • inter_graph_batching
      • used_method: str (17)
      • inter_graph_batch_size: int (18)
  • gnn_training
    • use_seed: bool
    • deterministic: bool (19)
    • num_epochs: int
    • patience: int
    • lr: float
    • weight_decay: float
    • node_hid_dim: int (20)
    • node_out_dim: int (21)
    • grad_accumulation: int (22)
    • inference_device: str (23)
    • used_method: str (24)
    • encoder
      • dropout: float
      • used_methods: str (25)
    • decoder
      • used_methods: str (26)
      • use_few_shot: bool (27)
      • few_shot
        • include_attacks_in_ssl_training: bool
        • freeze_encoder: bool
        • num_epochs_few_shot: int
        • patience_few_shot: int
        • lr_few_shot: float
        • weight_decay_few_shot: float
        • decoder
          • used_methods: str
  • evaluation
    • viz_malicious_nodes: bool (28)
    • ground_truth_version: str (29)
    • best_model_selection: str (30)
    • used_method: str
    • node_evaluation
      • threshold_method: str (31)
      • use_dst_node_loss: bool (32)
      • use_kmeans: bool (33)
      • kmeans_top_K: int (34)
    • tw_evaluation
      • threshold_method: str (35)
    • node_tw_evaluation
      • threshold_method: str (36)
      • use_dst_node_loss: bool
      • use_kmeans: bool
      • kmeans_top_K: int
    • queue_evaluation
      • used_method: str (37)
      • queue_threshold: int
      • kairos_idf_queue
        • include_test_set_in_IDF: bool
      • provnet_lof_queue
        • queue_arg: str
    • edge_evaluation
      • malicious_edge_selection: str (38)
      • threshold_method: str (39)
  1. Whether to store the graphs on disk upon building the graphs. Used to avoid re-computation of very complex batching operations that take time. Can take up to 300GB storage for CADETS_E5.
  2. Node features to use during GNN training. node_type is a one-hot encoded entity type vector, node_emb refers to the embedding generated during the featurization task, only_ones is a vector of ones with length node_type, edges_distribution counts emitted and received edges.

    Available options (multi selection):
    node_type
    node_emb
    only_ones
    edges_distribution
  3. Edge features to used during GNN training. edge_type refers to the system call type, edge_type_triplet considers a same edge type as a new type if source or destination node types are different, msg is the message vector used in the TGN, time_encoding encodes temporal order of events with their timestamps in the TGN, none uses no features.

    Available options (multi selection):
    edge_type
    edge_type_triplet
    msg
    time_encoding
    none
  4. Whether the GNN should be trained on all datasets in multi_dataset.
  5. A bug has been found in the first version of the framework, where reindexing graphs in shape (N, d) slightly modify node features. Setting this to true fixes the bug.
  6. Flattens the time window-based graphs into a single large temporal graph and recreate graphs based on the given method. edges creates contiguous graphs of size global_batching_batch_size edges, the same applies for minutes, unique_edge_types builds graphs where each pair of connected nodes share edges with distinct edge types, none uses the default time window-based batching defined in minutes with arg time_window_size.

    Available options (one selection):
    edges
    minutes
    unique_edge_types
    none
  7. Controls the value associated with global_batching.used_method (training+inference).
  8. Controls the value associated with global_batching.used_method (inference only).
  9. Breaks each previously computed graph into even smaller graphs. edges creates contiguous graphs of size intra_graph_batch_size edges (if a graph has 2000 edges and intra_graph_batch_size=1500 creates two graphs: one with 1500 edges, the other with 500 edges), tgn_last_neighbor computes for each graph its associated graph based on the TGN last neighbor loader, namely a new graph where each node is connected with its last tgn_neighbor_size incoming edges. none does not alter any graph.

    Available options (multi selection):
    edges
    tgn_last_neighbor
    none
  10. Controls the value associated with global_batching.used_method.
  11. Number of last neighbors to store for each node.
  12. If greater than one, will also gather the last neighbors of neighbors.
  13. A bug has been in the first version of the framework, where the features of last neighbors not appearing in the input graph have zero node feature vectors. Setting this arg to true includes the features of all nodes in the TGN graph.
  14. We found a minor bug in the original TGN code (https://github.com/pyg-team/pytorch_geometric/issues/10100). This is an unofficial fix.
  15. The original TGN's loader builds graphs in an undirected way. This makes the graphs purely directed.
  16. Whether to insert the edges of the current graph before loading last neighbors.
  17. Batches multiple graphs into a single large one for parallel training. Does not support TGN. graph_batching batches inter_graph_batch_size together, none doesn't batch graphs.

    Available options (one selection):
    graph_batching
    none
  18. Controls the value associated with inter_graph_batching.used_method.
  19. Whether to force PyTorch to use deterministic algorithms.
  20. Number of neurons in the middle layers of the encoder.
  21. Number of neurons in the last layer of the encoder.
  22. Number of epochs to gather gradients before backprop.
  23. Device used during testing.

    Available options (one selection):
    cpu
    cuda
  24. Which training pipeline use.

    Available options (one selection):
    default
  25. First part of the neural network. Usually GNN encoders to capture complex patterns.

    Available options (multi selection):
    tgn
    graph_attention
    sage
    gat
    gin
    sum_aggregation
    rcaid_gat
    magic_gat
    glstm
    custom_mlp
    none
  26. Second part of the neural network. Usually MLPs specific to the downstream task (e.g. reconstruction of prediction)

    Available options (multi selection):
    predict_edge_type
    predict_node_type
    predict_masked_struct
    detect_edge_few_shot
    predict_edge_contrastive
    reconstruct_node_features
    reconstruct_node_embeddings
    reconstruct_edge_embeddings
    reconstruct_masked_features
  27. Old feature: need some work to update it.
  28. Whether to generate images of malicious nodes' neighborhoods (not stable).

  29. Available options (one selection):
    orthrus
  30. Strategy to select the best model across epochs. best_adp selects the best model based on the highest ADP score, best_discrimination selects the model that does the best separation between top-score TPs and top-score FPs.

    Available options (one selection):
    best_adp
    best_discrimination
  31. Method to calculate the threshold value used to detect anomalies.

    Available options (one selection):
    max_val_loss
    mean_val_loss
    threatrace
    magic
    flash
    nodlink
  32. Whether to consider the loss of destination nodes when computing the node-level scores (maximum loss of a node).
  33. Whether to cluster nodes after thresholding as done in Orthrus
  34. Number of top-score nodes selected before clustering.
  35. Time-window detection. The code is broken and needs work to be updated.

    Available options (one selection):
    max_val_loss
    mean_val_loss
    threatrace
    magic
    flash
    nodlink
  36. Node-level detection where a same node in multiple time windows is considered as multiple unique nodes. More realistic evaluation for near real-time detection. The code is broken and needs work to be updated.

    Available options (one selection):
    max_val_loss
    mean_val_loss
    threatrace
    magic
    flash
    nodlink
  37. Queue-level detection as in Kairos. The code is broken and needs work to be updated.

    Available options (one selection):
    kairos_idf_queue
    provnet_lof_queue
  38. The ground truth only contains node-level labels. This arg controls the strategy to label edges. src_nodes and dst_nodes consider an edge as malicious if only its source or only its destination node is malicious. both labels an edge as malicious if both end nodes are malicious.

    Available options (one selection):
    src_node
    dst_node
    both_nodes

  39. Available options (one selection):
    max_val_loss
    mean_val_loss
    threatrace
    magic
    flash
    nodlink

Triage

  • tracing
    • used_method: str (1)
    • depimpact
      • used_method: str (2)
      • score_method: str (3)
      • workers: int
      • visualize: bool
  1. Post-processing step to reconstruct attack paths or reduce false positives. depimpact is used in Orthrus.

    Available options (one selection):
    depimpact

  2. Available options (one selection):
    component
    shortest_path
    1-hop
    2-hop
    3-hop

  3. Available options (one selection):
    degree
    recon_loss
    degree_recon