Full Reproducibility¶
scripts/run_fresh.sh bootstraps a uv virtualenv, then simulates
new data, trains all models from scratch, and generates all figures in a
single isolated directory. It installs uv automatically if missing.
Quick start (fresh run)¶
Run everything (simulations → preprocessing → training → figures) in a single isolated directory:
./scripts/run_fresh.sh
By default all outputs go to /sietch_colab/data_share/cxt_scratch/.
Override with the BASE_DIR environment variable:
BASE_DIR=/scratch/myuser/cxt_run ./scripts/run_fresh.sh
The script uses GPUs 0 and 1 and 80 CPU workers by default. These are
configured at the top of the script. A uv virtualenv is created at
BASE_DIR/.venv and reused on subsequent runs. To recreate it:
rm -rf /sietch_colab/data_share/cxt_scratch/.venv
./scripts/run_fresh.sh
Run individual stages:
./scripts/run_fresh.sh simulate # only simulations
./scripts/run_fresh.sh preprocess # only preprocessing
./scripts/run_fresh.sh train # only training
./scripts/run_fresh.sh figures # only figures
./scripts/run_fresh.sh train figures # multiple stages
Pipeline overview¶
┌──────────────────────────────────────────────────────────────────────┐
│ │
│ STAGE 1: SIMULATE python cxt/simulation_ts_only.py │
│ ───────────────── │
│ 35+ scenarios (constant, sawtooth, island, LLM sweeps, │
│ 10 stdpopsim mammals, 15 stdpopsim other species) │
│ → .trees files in DATA_DIR/ │
│ │
│ STAGE 2: PREPROCESS python -m cxt.preprocess │
│ ────────────────── │
│ 6 preprocessed datasets: │
│ processed_narrow (w2000, n50, 200 pairs, constant only) │
│ processed (w2000, n50, 200 pairs) │
│ processed_n10 (w2000, n10, 20 pairs) │
│ processed_small_window (w200, n50, 200 pairs) │
│ processed_small_window_missing_data (w200, n50, bitmask) │
│ processed_small_window_missing_data_n10 (w200, n10, bitmask) │
│ → X.npy, y.npy, pairs.npy per simulation │
│ │
│ STAGE 3: TRAIN python -m cxt.train │
│ ────────────── │
│ 6 checkpoints in dependency order: │
│ narrow ← processed_narrow (constant only) │
│ broad ← processed │
│ broad_w200 ← processed_small_window + broad ckpt │
│ broad+adapter ← processed_n10 + broad ckpt │
│ w200_wmissing ← processed_sw_missing + broad_w200 ckpt │
│ w200_wmissing_adapter ← processed_sw_missing_n10 │
│ + broad+adapter (--resume-adapter) │
│ │
│ STAGE 4: FIGURES python -m figures.main.* │
│ ──────────────── │
│ 8 main figures (Fig 1-8) + 6 supplementary (S4, S5, S6, S9-S11) │
│ → figures/output/main/ and figures/output/supplementary/ │
│ │
└──────────────────────────────────────────────────────────────────────┘
Configuration¶
run_fresh.sh places everything under a single BASE_DIR:
BASE_DIR/
├── data/ # simulated .trees + preprocessed datasets
├── lightning_logs/ # PyTorch Lightning training logs
├── checkpoints/ # installed model checkpoints (for cxt.load_model)
└── figures/output/ # generated figure PNGs
├── main/
└── supplementary/
Hardware and path settings are at the top of the script:
Variable |
Description |
Default |
|---|---|---|
|
Root for all outputs |
|
|
GPU indices for training and figures |
|
|
Parallel workers for simulation |
|
|
Parallel workers for preprocessing |
|
|
DataLoader workers for training |
|
The script sets CXT_CHECKPOINT_CACHE=$BASE_DIR/checkpoints so that
cxt.load_model() uses the freshly trained checkpoints instead of the
global cache at ~/.cache/cxt/checkpoints/. It also sets
CUDA_VISIBLE_DEVICES so that figure scripts see the correct GPUs.
Figure-specific external data (also configurable via env):
Variable |
Default |
|---|---|
|
|
|
|
|
|
Example: custom paths¶
BASE_DIR=/scratch/myuser/cxt_run ./scripts/run_fresh.sh
Simulation details¶
Stage 1 runs 35+ simulation batches across four categories (using
python cxt/simulation_ts_only.py):
Base dataset (3 scenarios):
Scenario |
Samples |
Description |
|---|---|---|
|
10,000 |
Constant \(N_e = 20{,}000\) |
|
1,000 |
Oscillating \(N_e\) (Schiffels & Durbin zigzag) |
|
1,000 |
3-population island model with migration |
LLM-style sweeps (4 scenarios):
Scenario |
Samples |
Description |
|---|---|---|
|
125 |
3 magnitudes × 3 \(N_e\) × 2 \(\mu\) × 2 \(r\) |
|
50 |
3 \(N_e\) × 2 \(\mu\) × 2 \(r\) × 3 sel. coefficients |
|
50 |
2 migration × 3 \(N_e\) × 2 \(\mu\) × 2 \(r\) |
|
500 |
3 \(N_e\) × 2 \(\mu\) × 2 \(r\) |
stdpopsim mammals (10 scenarios, 1,000 samples each):
homsap, homsap_map, bostau, canfam, canfam_map,
pantro, papanu, papanu_map, ponabe, ponabe_map
stdpopsim other species (15 scenarios, 5–1,000 samples):
aedaeg, anapla, anocar, anogam, apimel, aratha,
aratha_map, caeele, caeele_map, dromel, drosec,
gasacu, helann, helmel, musmus
Preprocessing details¶
Stage 2 creates six preprocessed datasets. See Preprocessing for the full schema. The datasets differ in window size, sample count, and whether a missingness bitmask is encoded:
Dataset |
Window |
Pairs |
Samples |
Bitmask |
|---|---|---|---|---|
|
2,000 bp |
200 |
50 |
No (constant scenario only) |
|
2,000 bp |
200 |
50 |
No |
|
2,000 bp |
20 |
10 |
No |
|
200 bp |
200 |
50 |
No |
|
200 bp |
200 |
50 |
Yes |
|
200 bp |
20 |
10 |
Yes |
Training details¶
Stage 3 trains six model checkpoints respecting the dependency chain.
run_fresh.sh installs each checkpoint into BASE_DIR/checkpoints/
after training so that figure generation uses the freshly trained models.
Which model trains on which dataset:
Model |
Fine-tuning |
Dataset |
Source scenarios |
Window |
Samples |
Pairs |
|---|---|---|---|---|---|---|
|
No (from scratch) |
|
constant only |
w2000 |
50 |
200 |
|
No (from scratch) |
|
all |
w2000 |
50 |
200 |
|
Yes ← |
|
13 high-Ne stdpopsim |
w200 |
50 |
200 |
|
Yes ← |
|
all |
w2000 |
10 |
20 |
|
Yes ← |
|
13 high-Ne stdpopsim |
w200 |
50 |
200 + bitmask |
|
Yes ← |
|
13 high-Ne stdpopsim |
w200 |
10 |
20 + bitmask |
See Training for the full checkpoint commands and hyperparameters.
Figure details¶
Stage 4 generates all paper figures. Each figure script caches its intermediate results (simulated tree sequences, TMRCA predictions) so re-runs skip expensive computation.
Main figures:
Fig |
Script |
Models |
Description |
|---|---|---|---|
1 |
|
narrow |
Model architecture and batch inference demo |
2 |
|
narrow, broad |
True vs predicted TMRCA (cxt, SMC++, SINGER) |
3 |
|
broad |
KDE comparison across 16 stdpopsim v0.2 species |
4 |
|
broad |
Out-of-distribution evaluation on 6 v0.3 species |
5 |
|
broad |
IICR for H. sapiens, B. taurus, A. thaliana |
6 |
|
broad |
TMRCA landscapes for GBR (chr2, chr6, LCT, HLA) |
7 |
|
w200_wmissing |
|
8 |
|
(uses Fig 7 cache) |
In(2L)a inversion coalescence patterns |
Supplementary figures:
Fig |
Script |
Description |
|---|---|---|
S4 |
|
Sample-size adapter (n=5 vs n=25) |
S5 |
|
Window size effect (w=2000, 200, 20 bp) + residual model |
S6 |
|
Runtime comparison: cxt vs SMC++ vs SINGER |
S9 |
|
RDL region: cxt vs Singer+Polegon vs SMC++ |
S10 |
|
Cross-population coalescence (OutOfAfrica_2T12) |
S11 |
|
Mutation × recombination grid evaluation |
Outputs land in figures/output/main/ and figures/output/supplementary/.
External data dependencies¶
Some figures require external genomic datasets that are not generated by the simulation stage:
Figures |
Variable |
Data |
|---|---|---|
6 |
|
Human 1000 Genomes tsinfer trees + masks |
7, S9 |
|
Ag1000G chr2L dated trees |
7, S9 |
|
Ag1000G per-site accessibility bitmask |
S6 |
(paths.py) |
Benchmark timing logs (JSONL) |
S9 |
(paths.py) |
SINGER and SMC++ revision caches |
These paths are resolved via figures/paths.py and can be overridden
with environment variables.
Logs¶
run_fresh.sh writes a timestamped log file
(run_fresh_YYYYMMDD_HHMMSS.log) under BASE_DIR. The log captures
stdout/stderr from all stages and ends with a summary of wall times and
any failures.