10-min Docker Install with DARPA TC/OpTC Datasets¶

Download datasets¶

DARPA TC and OpTC are very large datasets that are significantly challenging to process. We provide our pre-processed versions of these datasets. We use a postgres database to store and load the data and provide the dumps to download.

Sizes for each database dump are as follow: compressed is the raw size of each dump, uncompressed is the size taken once loaded into the postgres table.

Dataset	Compressed (GB)	Uncompressed (GB)
`CLEARSCOPE_E3`	0.6	4.8
`CADETS_E3`	1.4	10.1
`THEIA_E3`	1.1	12
`CLEARSCOPE_E5`	6.2	49
`CADETS_E5`	36	276
`THEIA_E5`	5.8	36
`OPTC_H051`	1.7	7.7
`OPTC_H_501`	1.5	6.7
`OPTC_H201`	2	9.1

Steps:

First download the archive(s) into a new data folder. We provide archives containing multiple datasets if their size is small, or provide the dump directly for larger datasets. On CLI, you can use curl with an authorization token (as explained here):

Go to OAuth 2.0 Playground https://developers.google.com/oauthplayground/
In the Select the Scope box, paste https://www.googleapis.com/auth/drive.readonly
Click Authorize APIs and then Exchange authorization code for tokens
Copy the Access token
Run in terminal

Note: Each call to curl downloads only a part of each file. You should call the same command multiple times to download the archvives at 100%

mkdir data && cd data

# optc_and_cadets_theia_clearscope_e3.tar
curl -H "Authorization: Bearer ACCESS_TOKEN" -C - https://www.googleapis.com/drive/v3/files/11YVPAuWfeEqC_zV8KD0gNrnEPbHf2Y4M?alt=media -o optc_and_cadets_theia_clearscope_e3.tar

# theia_clearscope_e5.tar
curl -H "Authorization: Bearer ACCESS_TOKEN" -C - https://www.googleapis.com/drive/v3/files/1DfolzEa3PVz_6fGZUNEUm0sBP42LB7_1?alt=media -o theia_clearscope_e5.tar

# cadets_e5.dump
curl -H "Authorization: Bearer ACCESS_TOKEN" -C - https://www.googleapis.com/drive/v3/files/1Xiq7w0Ofz4jZG2PVFuNqi_i0fm28kRcT?alt=media -o cadets_e5.dump

Then uncompress the archives (this won't increase space)

tar -xvf optc_and_cadets_theia_clearscope_e3.tar
tar -xvf theia_clearscope_e5.tar

Alternatively, here are the guidelines to manually create the databases from the official DARPA TC files.

Docker Install¶

If not installed, install Docker following the steps from the official site and avoid using sudo.
Then, install dependencies for CUDA support with Docker:

# Add the NVIDIA package repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Update and install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Restart services
sudo systemctl restart docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Load databases¶

We create two containers: one that runs the postgres database, the other runs the Python env and the pipeline.

Set your paths in .env
```
cd ..
cp .env.local .env
```
- In .env, set INPUT_DIR to the data folder path. Optionally, set ARTIFACTS_DIR to the folder where all generated files will go (multiple GBs).
- Run the following command and set accordingly HOST_UID, HOST_GID and USER_NAME in .env.
```
echo "HOST_UID=$(id -u)" && echo "HOST_GID=$(id -g)" && echo "USER_NAME=$(whoami)"
```
- Then run:
```
source .env
```
- Then create the output artifacts folder if it doesn't exist yet and ensure it is owned by your user.
```
mkdir ${ARTIFACTS_DIR} || chown ${USER_NAME} -R ${ARTIFACTS_DIR}
```
Build and start the database container up:
```
docker compose -p postgres -f compose-postgres.yml up -d --build
```
Note: each time you modify variables in .env, update env variables using source .env prior to running docker compose.

In a terminal, get a shell into this container:

docker compose -p postgres exec postgres bash

If you have enough space to uncompress all datasets locally (135 GB), run this script to load all databases:
```
./scripts/load_dumps.sh
```
If you have limited space and want to load databases one by one, do:
```
pg_restore -U postgres -h localhost -p 5432 -d DATASET /data/DATASET.dump
```
Once databases are loaded, we won't need to touch this container anymore:
```
exit
```

Get into the PIDSMaker container¶

It is within the pids container that coding and experiments take place.

For VSCode users, we recommend using the dev container extension to directly open VSCode in the container. To do so, simply install the extension, then ctrl+shift+P and Dev Containers: Open Folder in Container.
The other alternative is to load the container manually and open a shell directly in your terminal.
```
docker compose -f compose-pidsmaker.yml up -d --build
docker compose exec pids bash
```

It's in this container that the python env is installed and where the framework will be used.

Weights & Biases interface¶

W&B is used as the default interface to visualize and historize experiments, we highly encourage to use it. You can create an account if not already done. Log into your account from CLI by pasting your API key, obtained via your W&B dashboard:

wandb login

Then you can push the logs and results of experiments to the interface using the --wandb arg or when calling ./run.sh.