Skip to content

Datasets

PIDSMaker supports several public datasets commonly used in APT detection research. This page describes each dataset and its attack scenarios.

Overview

Dataset OS Attacks Size (GB)
CADETS_E3 FreeBSD 3 10
THEIA_E3 Linux 2 12
CLEARSCOPE_E3 Android 1 4.8
FIVEDIRECTIONS_E3 Linux 2 22
TRACE_E3 Linux 3 100
CADETS_E5 FreeBSD 2 276
THEIA_E5 Linux 1 36
CLEARSCOPE_E5 Android 2 49
FIVEDIRECTIONS_E5 Linux 4 280
TRACE_E5 Linux 1 710
optc_h201 Windows 1 9
optc_h501 Windows 1 6.7
optc_h051 Windows 1 7.7
ATLASV2_EDR Windows 10 1
CARBANAKV2_EDR Windows + Linux 1 6.6

DARPA TC

The DARPA Transparent Computing program produced benchmark datasets for evaluating provenance-based security systems.

Engagement 3 (E3) - April 2018

CADETS_E3

FreeBSD host with Nginx server exploitation.

Attack id Duration Description
0 49 min Nginx exploited to deploy Drakon loader with root escalation. Netrecon executed after C2 connection, followed by failed libdrakon injection into sshd. Host crashed with kernel panic.
1 40 min Nginx re-exploited to deploy Drakon and MicroAPT implants under random names (tmux, minions, sendmail). Privilege escalation failed; MicroAPT ran unprivileged for port scanning.
2 13 min Nginx re-exploited to deploy new Drakon implant with root privileges. Multiple failed sshd injection attempts using renamed libdrakon copies.
python pidsmaker/main.py SYSTEM CADETS_E3

THEIA_E3

Ubuntu host with Firefox exploitation.

Attack id Duration Description
0 50 min Malicious Firefox extension dropped Drakon implant. MicroAPT staged under /var/log/mail, connected to C2 for control and network scanning.
1 30 min Firefox exploited to drop Drakon implant as /home/admin/clean with root privileges, then copied as profile. Both connected to C2 server.
python pidsmaker/main.py SYSTEM THEIA_E3

CLEARSCOPE_E3

Android device with Firefox exploitation.

Attack id Duration Description
0 54 min Firefox exploited via malicious website. Drakon implant installed and elevated, but module loading failed. Persistent C2 connection maintained.
python pidsmaker/main.py SYSTEM CLEARSCOPE_E3

Engagement 5 (E5) - May 2019

THEIA_E5

Ubuntu host with Firefox exploitation.

Attack id Duration Description
0 19 min Firefox exploited via malicious website. Root gained with BinFmt-Elevate, Drakon shellcode injected into sshd, persistence file created, C2 access maintained.
python pidsmaker/main.py SYSTEM THEIA_E5

CLEARSCOPE_E5

Android device with APK-based attacks.

Attack id Duration Description
0 41 min Malicious appstarter APK loaded MicroAPT. Elevate driver installed for privilege escalation. Sensitive databases exfiltrated (calllog, calendar, SMS) and screenshot captured.
1 8 min MicroAPT deployed directly via adb shell after APK dropper failed. Privilege escalation via BinFmt Elevate driver, then file exfiltration.
python pidsmaker/main.py SYSTEM CLEARSCOPE_E5

DARPA OpTC

Windows enterprise environment with realistic APT scenarios.

optc_h201

Attack id Duration Description
0 1h58 PowerShell Empire stager executed with elevated access. Mimikatz used for credential theft, registry persistence set, recon performed, then pivoted to other hosts via WMI.
python pidsmaker/main.py SYSTEM optc_h201

optc_h501

Attack id Duration Description
0 5h01 Phishing email launched PowerShell Empire stager. Escalated via DeathStar, WMI persistence established, RDP tunneling and file exfiltration performed, then pivoted to other hosts.
python pidsmaker/main.py SYSTEM optc_h501

optc_h051

Attack id Duration Description
0 3h56 Malicious Notepad++ update installed Meterpreter. Escalated to SYSTEM, migrated into LSASS for Mimikatz credential theft, established persistence, timestomped files, added admin account for RDP.
python pidsmaker/main.py SYSTEM optc_h051

Note

TODO: add descriptions for CADETS_E5, FIVED and TRACE datasets.

ATLASv2

Two Windows 7 hosts (h1, h2) attacked from Kali Linux. The dataset covers four benign days followed by one attack day on which all ten scenarios are executed. The provided data is Carbon Black Cloud EDR telemetry; the original dataset also includes Sysmon, Windows Security Auditing, Firefox, and DNS logs.

ATLASV2_EDR

Attack id Duration Description
s1 40 min CVE-2015-5122 Adobe Flash exploit via phishing email on h1. Meterpreter shell obtained, payload.exe dropped, PDFs exfiltrated over HTTPS.
s2 35 min CVE-2015-3105 Adobe Flash exploit via phishing email on h1. Meterpreter shell, payload drop, and HTTPS PDF exfiltration.
s3 45 min CVE-2017-11882 Microsoft Word memory corruption exploit via malicious attachment on h1. Meterpreter obtained, PDFs exfiltrated.
s4 44 min CVE-2017-0199 Microsoft Word OLE2 link exploit via malicious document on h1. Meterpreter shell, payload drop, HTTPS exfiltration.
m1 1h50 CVE-2015-5122 Adobe Flash exploit on h1 via phishing. h1's SimpleHTTP server poisoned to deliver payload to h2. PDFs exfiltrated from both hosts.
m2 30 min CVE-2015-5119 Adobe Flash exploit on h1 with lateral movement to h2 via SimpleHTTP server poisoning. PDFs exfiltrated from both hosts.
m3 34 min CVE-2015-3105 Adobe Flash exploit on h1 with lateral movement to h2 via poisoned SimpleHTTP server. PDFs exfiltrated.
m4 33 min CVE-2018-8174 Microsoft Word VBScript engine exploit on h1 with lateral movement to h2 via HTTP server. PDFs exfiltrated from both hosts.
m5 30 min CVE-2017-0199 Microsoft Word OLE2 link exploit on h1 with lateral movement to h2. PDFs exfiltrated from both hosts.
m6 33 min CVE-2017-11882 Microsoft Word memory corruption exploit on h1 with lateral movement to h2 via poisoned HTTP server. PDFs exfiltrated.
python pidsmaker/main.py SYSTEM ATLASV2_EDR

CARBANAKv2

Multi-host testbed comprising four Windows 10 workstations (h1–h4), a CentOS 7 Linux fileserver (fs), and a Windows 10 Server AD domain controller (dc). Benign activity was generated by four graduate students using the machines as their primary workstations. The provided data is Carbon Black Cloud EDR telemetry.

CARBANAKV2_EDR

Attack id Duration Description
0 10 days MITRE Carbanak APT emulation. Attacker compromises h1 (Apr 30) via WScript-based Carbanak implant (TransBaseOdbcDriver.js). Pivots to Linux fileserver fs and then Windows AD domain controller dc via lateral movement (May 2). Subsequently spreads to h2, h3, and h4 (May 7–10), establishing persistence across all hosts.
python pidsmaker/main.py SYSTEM CARBANAKV2_EDR

Data structure

Graph partitioning

Each dataset is partitioned into daily graphs, split into:

  • Train graphs: Normal activity for model training
  • Validation graphs: Normal activity for threshold calibration
  • Test graphs: Contains both normal activity and attacks

Adding custom datasets

To add a new dataset, define its configuration in pidsmaker/config/config.py:

DATASET_DEFAULT_CONFIG = {
    "MY_DATASET": {
        "raw_dir": "",
        "database": "my_database_name",
        "database_all_file": "my_database_name",
        "num_node_types": 3, # Number of node types in the dataset __format__ (i.e., in pidsmaker/utils/dataset_utils.py)
        "num_edge_types": 10, # Number of edge types in the dataset __format__ (i.e., in pidsmaker/utils/dataset_utils.py)
        "start_date": "2018-04-02", # Start date (Inclusive)
        "end_date": "2018-04-14", # End date (Exclusive)
        "train_dates": [
            # Dates/graphs used for training (i.e., benign activity)
            "2018-04-02",
            "2018-04-03",
            "2018-04-04",
            "2018-04-05",
            "2018-04-07",
            "2018-04-08",
            "2018-04-09",
        ],
        "val_dates": [
            # Dates/graphs used for validation/threshold calibration (i.e., benign activity)
            "2018-04-10"
        ],
        "test_dates": [
            # Dates/graphs used for testing (i.e., contains both benign and attack activity)
            "2018-04-06",
            "2018-04-11",
            "2018-04-12",
            "2018-04-13"
        ],
        "unused_dates": [
            # Any unused dates/graphs that should be ignored
             "2018-04-14"
        ],
        "ground_truth_relative_path": ["MY_DATASET/labels.csv"],
        "attack_to_time_window": [
            ["MY_DATASET/labels.csv", "2018-04-11 10:00:00", "2018-04-12 12:00:00"],
        ],
    },
}

Then follow the database creation guide to load your data.