Full Reproducibility

scripts/run_fresh.sh bootstraps a uv virtualenv, then simulates new data, trains all models from scratch, and generates all figures in a single isolated directory. It installs uv automatically if missing.

Quick start (fresh run)

Run everything (simulations → preprocessing → training → figures) in a single isolated directory:

./scripts/run_fresh.sh

By default all outputs go to /sietch_colab/data_share/cxt_scratch/. Override with the BASE_DIR environment variable:

BASE_DIR=/scratch/myuser/cxt_run ./scripts/run_fresh.sh

The script uses GPUs 0 and 1 and 80 CPU workers by default. These are configured at the top of the script. A uv virtualenv is created at BASE_DIR/.venv and reused on subsequent runs. To recreate it:

rm -rf /sietch_colab/data_share/cxt_scratch/.venv
./scripts/run_fresh.sh

Run individual stages:

./scripts/run_fresh.sh simulate       # only simulations
./scripts/run_fresh.sh preprocess      # only preprocessing
./scripts/run_fresh.sh train           # only training
./scripts/run_fresh.sh figures         # only figures
./scripts/run_fresh.sh train figures   # multiple stages

Pipeline overview

┌──────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  STAGE 1: SIMULATE      python cxt/simulation_ts_only.py            │
│  ─────────────────                                                   │
│  35+ scenarios (constant, sawtooth, island, LLM sweeps,             │
│  10 stdpopsim mammals, 15 stdpopsim other species)                  │
│  → .trees files in DATA_DIR/                                        │
│                                                                      │
│  STAGE 2: PREPROCESS    python -m cxt.preprocess                    │
│  ──────────────────                                                  │
│  6 preprocessed datasets:                                            │
│    processed_narrow         (w2000, n50, 200 pairs, constant only)  │
│    processed                 (w2000, n50, 200 pairs)                │
│    processed_n10             (w2000, n10, 20 pairs)                 │
│    processed_small_window    (w200, n50, 200 pairs)                 │
│    processed_small_window_missing_data        (w200, n50, bitmask)  │
│    processed_small_window_missing_data_n10    (w200, n10, bitmask)  │
│  → X.npy, y.npy, pairs.npy per simulation                          │
│                                                                      │
│  STAGE 3: TRAIN         python -m cxt.train                         │
│  ──────────────                                                      │
│  6 checkpoints in dependency order:                                  │
│    narrow           ← processed_narrow (constant only)              │
│    broad            ← processed                                     │
│    broad_w200       ← processed_small_window + broad ckpt           │
│    broad+adapter    ← processed_n10 + broad ckpt                    │
│    w200_wmissing    ← processed_sw_missing + broad_w200 ckpt        │
│    w200_wmissing_adapter ← processed_sw_missing_n10                 │
│                        + broad+adapter (--resume-adapter)           │
│                                                                      │
│  STAGE 4: FIGURES       python -m figures.main.*                     │
│  ────────────────                                                    │
│  8 main figures (Fig 1-8) + 6 supplementary (S4, S5, S6, S9-S11)   │
│  → figures/output/main/ and figures/output/supplementary/            │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Configuration

run_fresh.sh places everything under a single BASE_DIR:

BASE_DIR/
├── data/                  # simulated .trees + preprocessed datasets
├── lightning_logs/         # PyTorch Lightning training logs
├── checkpoints/           # installed model checkpoints (for cxt.load_model)
└── figures/output/        # generated figure PNGs
    ├── main/
    └── supplementary/

Hardware and path settings are at the top of the script:

Variable

Description

Default

BASE_DIR

Root for all outputs

/sietch_colab/data_share/cxt_scratch

GPUS

GPU indices for training and figures

0 1

SIM_WORKERS

Parallel workers for simulation

80

PREPROCESS_WORKERS

Parallel workers for preprocessing

80

TRAIN_WORKERS

DataLoader workers for training

16

The script sets CXT_CHECKPOINT_CACHE=$BASE_DIR/checkpoints so that cxt.load_model() uses the freshly trained checkpoints instead of the global cache at ~/.cache/cxt/checkpoints/. It also sets CUDA_VISIBLE_DEVICES so that figure scripts see the correct GPUs.

Figure-specific external data (also configurable via env):

Variable

Default

AG1000G_DATA_DIR

/sietch_colab/data_share/Ag1000G/Ag3.0/args_trees/tsinfer_data_v2

AG1000G_ACCESSIBILITY

/sietch_colab/data_share/Ag1000G/Ag3.0/args_trees/singer/agp3.is_accessible.txt.npz

HG1KG_TSZ_DIR

/sietch_colab/data_share/hg1kg/tsinfer-trees/working

Example: custom paths

BASE_DIR=/scratch/myuser/cxt_run ./scripts/run_fresh.sh

Simulation details

Stage 1 runs 35+ simulation batches across four categories (using python cxt/simulation_ts_only.py):

Base dataset (3 scenarios):

Scenario

Samples

Description

constant

10,000

Constant \(N_e = 20{,}000\)

sawtooth

1,000

Oscillating \(N_e\) (Schiffels & Durbin zigzag)

island

1,000

3-population island model with migration

LLM-style sweeps (4 scenarios):

Scenario

Samples

Description

llm_ne_sawtooth

125

3 magnitudes × 3 \(N_e\) × 2 \(\mu\) × 2 \(r\)

llm_hard_sweeps

50

3 \(N_e\) × 2 \(\mu\) × 2 \(r\) × 3 sel. coefficients

llm_island_3pop

50

2 migration × 3 \(N_e\) × 2 \(\mu\) × 2 \(r\)

llm_ne_constant

500

3 \(N_e\) × 2 \(\mu\) × 2 \(r\)

stdpopsim mammals (10 scenarios, 1,000 samples each): homsap, homsap_map, bostau, canfam, canfam_map, pantro, papanu, papanu_map, ponabe, ponabe_map

stdpopsim other species (15 scenarios, 5–1,000 samples): aedaeg, anapla, anocar, anogam, apimel, aratha, aratha_map, caeele, caeele_map, dromel, drosec, gasacu, helann, helmel, musmus

Preprocessing details

Stage 2 creates six preprocessed datasets. See Preprocessing for the full schema. The datasets differ in window size, sample count, and whether a missingness bitmask is encoded:

Dataset

Window

Pairs

Samples

Bitmask

processed_narrow

2,000 bp

200

50

No (constant scenario only)

processed

2,000 bp

200

50

No

processed_n10

2,000 bp

20

10

No

processed_small_window

200 bp

200

50

No

processed_small_window_missing_data

200 bp

200

50

Yes

processed_small_window_missing_data_n10

200 bp

20

10

Yes

Training details

Stage 3 trains six model checkpoints respecting the dependency chain. run_fresh.sh installs each checkpoint into BASE_DIR/checkpoints/ after training so that figure generation uses the freshly trained models.

Which model trains on which dataset:

Model

Fine-tuning

Dataset

Source scenarios

Window

Samples

Pairs

narrow

No (from scratch)

processed_narrow

constant only

w2000

50

200

broad

No (from scratch)

processed

all

w2000

50

200

broad_w200

Yes ← broad

processed_small_window

13 high-Ne stdpopsim

w200

50

200

broad+adapter

Yes ← broad

processed_n10

all

w2000

10

20

w200_wmissing

Yes ← broad_w200

processed_small_window_missing_data

13 high-Ne stdpopsim

w200

50

200 + bitmask

w200_wmissing_adapter

Yes ← broad+adapter (--resume-adapter)

processed_small_window_missing_data_n10

13 high-Ne stdpopsim

w200

10

20 + bitmask

See Training for the full checkpoint commands and hyperparameters.

Figure details

Stage 4 generates all paper figures. Each figure script caches its intermediate results (simulated tree sequences, TMRCA predictions) so re-runs skip expensive computation.

Main figures:

Fig

Script

Models

Description

1

fig1_model_schematic

narrow

Model architecture and batch inference demo

2

fig2_benchmark_comparison

narrow, broad

True vs predicted TMRCA (cxt, SMC++, SINGER)

3

fig3_stdpopsim_v2_coalescence

broad

KDE comparison across 16 stdpopsim v0.2 species

4

fig4_stdpopsim_v3_ood

broad

Out-of-distribution evaluation on 6 v0.3 species

5

fig5_demography_inference

broad

IICR for H. sapiens, B. taurus, A. thaliana

6

fig6_human_1kg

broad

TMRCA landscapes for GBR (chr2, chr6, LCT, HLA)

7

fig7_mosquito_rdl

w200_wmissing

  1. gambiae RDL region across 5 populations

8

fig8_inversion_coalescence

(uses Fig 7 cache)

In(2L)a inversion coalescence patterns

Supplementary figures:

Fig

Script

Description

S4

figS4_sample_size_adapter

Sample-size adapter (n=5 vs n=25)

S5

figS5_window_resolution

Window size effect (w=2000, 200, 20 bp) + residual model

S6

figS6_runtime_benchmark

Runtime comparison: cxt vs SMC++ vs SINGER

S9

figS9_mosquito_comparison

RDL region: cxt vs Singer+Polegon vs SMC++

S10

figS10_cross_coalescence

Cross-population coalescence (OutOfAfrica_2T12)

S11

figS11_interpolation_grid

Mutation × recombination grid evaluation

Outputs land in figures/output/main/ and figures/output/supplementary/.

External data dependencies

Some figures require external genomic datasets that are not generated by the simulation stage:

Figures

Variable

Data

6

HG1KG_TSZ_DIR

Human 1000 Genomes tsinfer trees + masks

7, S9

AG1000G_DATA_DIR

Ag1000G chr2L dated trees

7, S9

AG1000G_ACCESSIBILITY

Ag1000G per-site accessibility bitmask

S6

(paths.py)

Benchmark timing logs (JSONL)

S9

(paths.py)

SINGER and SMC++ revision caches

These paths are resolved via figures/paths.py and can be overridden with environment variables.

Logs

run_fresh.sh writes a timestamped log file (run_fresh_YYYYMMDD_HHMMSS.log) under BASE_DIR. The log captures stdout/stderr from all stages and ends with a summary of wall times and any failures.