Featurization

Featurization methods are used to transform textual attributes of entities (e.g. file paths, process command lines, socket IP addresses and ports) into a vector. Some methods like word2vec and doc2vec learn this vector from the text corpus, while others like hierarchical_hashing compute the vector in a deterministic way.

Other methods like only_type and only_ones simply skip this embedding step and assign either a one-hot encoded type or ones to each entity. Those methods thus do not require any specific argument. In all methods, the resulting vectors are used as node features during the gnn_training task.

word2vec
- alpha: float
- window_size: int
- min_count: int
- use_skip_gram: bool
- num_workers: int
- epochs: int
- compute_loss: bool
- negative: int
- decline_rate: int
doc2vec
- include_neighbors: bool
- epochs: int
- alpha: float
fasttext
- min_count: int
- alpha: float
- window_size: int
- negative: int
- num_workers: int
- use_pretrained_fb_model: bool
alacarte
- walk_length: int
- num_walks: int
- epochs: int
- context_window_size: int
- min_count: int
- use_skip_gram: bool
- num_workers: int
- compute_loss: bool
- add_paths: bool
temporal_rw
- walk_length: int
- num_walks: int
- trw_workers: int
- time_weight: str
- half_life: int
- window_size: int
- min_count: int
- use_skip_gram: bool
- wv_workers: int
- epochs: int
- compute_loss: bool
- negative: int
- decline_rate: int
flash
- min_count: int
- workers: int
hierarchical_hashing
magic
only_type
only_ones