Full Reproducibility¶

scripts/run_fresh.sh bootstraps a uv virtualenv, then simulates new data, trains all models from scratch, and generates all figures in a single isolated directory. It installs uv automatically if missing.

Quick start (fresh run)¶

Run everything (simulations → preprocessing → training → figures) in a single isolated directory:

./scripts/run_fresh.sh

By default all outputs go to /sietch_colab/data_share/cxt_scratch/. Override with the BASE_DIR environment variable:

BASE_DIR=/scratch/myuser/cxt_run ./scripts/run_fresh.sh

The script uses GPUs 0 and 1 and 80 CPU workers by default. These are configured at the top of the script. A uv virtualenv is created at BASE_DIR/.venv and reused on subsequent runs. To recreate it:

rm -rf /sietch_colab/data_share/cxt_scratch/.venv
./scripts/run_fresh.sh

Run individual stages:

./scripts/run_fresh.sh simulate       # only simulations
./scripts/run_fresh.sh preprocess      # only preprocessing
./scripts/run_fresh.sh train           # only training
./scripts/run_fresh.sh figures         # only figures
./scripts/run_fresh.sh train figures   # multiple stages

Pipeline overview¶

┌──────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  STAGE 1: SIMULATE      python cxt/simulation_ts_only.py            │
│  ─────────────────                                                   │
│  35+ scenarios (constant, sawtooth, island, LLM sweeps,             │
│  10 stdpopsim mammals, 15 stdpopsim other species)                  │
│  → .trees files in DATA_DIR/                                        │
│                                                                      │
│  STAGE 2: PREPROCESS    python -m cxt.preprocess                    │
│  ──────────────────                                                  │
│  6 preprocessed datasets:                                            │
│    processed_narrow         (w2000, n50, 200 pairs, constant only)  │
│    processed                 (w2000, n50, 200 pairs)                │
│    processed_n10             (w2000, n10, 20 pairs)                 │
│    processed_small_window    (w200, n50, 200 pairs)                 │
│    processed_small_window_missing_data        (w200, n50, bitmask)  │
│    processed_small_window_missing_data_n10    (w200, n10, bitmask)  │
│  → X.npy, y.npy, pairs.npy per simulation                          │
│                                                                      │
│  STAGE 3: TRAIN         python -m cxt.train                         │
│  ──────────────                                                      │
│  6 checkpoints in dependency order:                                  │
│    narrow           ← processed_narrow (constant only)              │
│    broad            ← processed                                     │
│    broad_w200       ← processed_small_window + broad ckpt           │
│    broad+adapter    ← processed_n10 + broad ckpt                    │
│    w200_wmissing    ← processed_sw_missing + broad_w200 ckpt        │
│    w200_wmissing_adapter ← processed_sw_missing_n10                 │
│                        + broad+adapter (--resume-adapter)           │
│                                                                      │
│  STAGE 4: FIGURES       python -m figures.main.*                     │
│  ────────────────                                                    │
│  8 main figures (Fig 1-8) + 6 supplementary (S4, S5, S6, S9-S11)   │
│  → figures/output/main/ and figures/output/supplementary/            │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Configuration¶

run_fresh.sh places everything under a single BASE_DIR:

BASE_DIR/
├── data/                  # simulated .trees + preprocessed datasets
├── lightning_logs/         # PyTorch Lightning training logs
├── checkpoints/           # installed model checkpoints (for cxt.load_model)
└── figures/output/        # generated figure PNGs
    ├── main/
    └── supplementary/

Hardware and path settings are at the top of the script:

Variable	Description	Default
`BASE_DIR`	Root for all outputs	`/sietch_colab/data_share/cxt_scratch`
`GPUS`	GPU indices for training and figures	`0 1`
`SIM_WORKERS`	Parallel workers for simulation	`80`
`PREPROCESS_WORKERS`	Parallel workers for preprocessing	`80`
`TRAIN_WORKERS`	DataLoader workers for training	`16`

The script sets CXT_CHECKPOINT_CACHE=$BASE_DIR/checkpoints so that cxt.load_model() uses the freshly trained checkpoints instead of the global cache at ~/.cache/cxt/checkpoints/. It also sets CUDA_VISIBLE_DEVICES so that figure scripts see the correct GPUs.

Figure-specific external data (also configurable via env):

Variable	Default
`AG1000G_DATA_DIR`	`/sietch_colab/data_share/Ag1000G/Ag3.0/args_trees/tsinfer_data_v2`
`AG1000G_ACCESSIBILITY`	`/sietch_colab/data_share/Ag1000G/Ag3.0/args_trees/singer/agp3.is_accessible.txt.npz`
`HG1KG_TSZ_DIR`	`/sietch_colab/data_share/hg1kg/tsinfer-trees/working`

Example: custom paths¶

BASE_DIR=/scratch/myuser/cxt_run ./scripts/run_fresh.sh

Simulation details¶

Stage 1 runs 35+ simulation batches across four categories (using python cxt/simulation_ts_only.py):

Base dataset (3 scenarios):

Scenario	Samples	Description
`constant`	10,000	Constant $N_e = 20{,}000$
`sawtooth`	1,000	Oscillating $N_e$ (Schiffels & Durbin zigzag)
`island`	1,000	3-population island model with migration

LLM-style sweeps (4 scenarios):

Scenario	Samples	Description
`llm_ne_sawtooth`	125	3 magnitudes × 3 $N_e$ × 2 $\mu$ × 2 $r$
`llm_hard_sweeps`	50	3 $N_e$ × 2 $\mu$ × 2 $r$ × 3 sel. coefficients
`llm_island_3pop`	50	2 migration × 3 $N_e$ × 2 $\mu$ × 2 $r$
`llm_ne_constant`	500	3 $N_e$ × 2 $\mu$ × 2 $r$

stdpopsim mammals (10 scenarios, 1,000 samples each): homsap, homsap_map, bostau, canfam, canfam_map, pantro, papanu, papanu_map, ponabe, ponabe_map

stdpopsim other species (15 scenarios, 5–1,000 samples): aedaeg, anapla, anocar, anogam, apimel, aratha, aratha_map, caeele, caeele_map, dromel, drosec, gasacu, helann, helmel, musmus

Preprocessing details¶

Stage 2 creates six preprocessed datasets. See Preprocessing for the full schema. The datasets differ in window size, sample count, and whether a missingness bitmask is encoded:

Dataset	Window	Pairs	Samples	Bitmask
`processed_narrow`	2,000 bp	200	50	No (constant scenario only)
`processed`	2,000 bp	200	50	No
`processed_n10`	2,000 bp	20	10	No
`processed_small_window`	200 bp	200	50	No
`processed_small_window_missing_data`	200 bp	200	50	Yes
`processed_small_window_missing_data_n10`	200 bp	20	10	Yes

Training details¶

Stage 3 trains six model checkpoints respecting the dependency chain. run_fresh.sh installs each checkpoint into BASE_DIR/checkpoints/ after training so that figure generation uses the freshly trained models.

Which model trains on which dataset:

Model	Fine-tuning	Dataset	Source scenarios	Window	Samples	Pairs
`narrow`	No (from scratch)	`processed_narrow`	constant only	w2000	50	200
`broad`	No (from scratch)	`processed`	all	w2000	50	200
`broad_w200`	Yes ← `broad`	`processed_small_window`	13 high-Ne stdpopsim	w200	50	200
`broad+adapter`	Yes ← `broad`	`processed_n10`	all	w2000	10	20
`w200_wmissing`	Yes ← `broad_w200`	`processed_small_window_missing_data`	13 high-Ne stdpopsim	w200	50	200 + bitmask
`w200_wmissing_adapter`	Yes ← `broad+adapter` (`--resume-adapter`)	`processed_small_window_missing_data_n10`	13 high-Ne stdpopsim	w200	10	20 + bitmask

See Training for the full checkpoint commands and hyperparameters.

Figure details¶

Stage 4 generates all paper figures. Each figure script caches its intermediate results (simulated tree sequences, TMRCA predictions) so re-runs skip expensive computation.

Main figures:

Fig	Script	Models	Description
1	`fig1_model_schematic`	narrow	Model architecture and batch inference demo
2	`fig2_benchmark_comparison`	narrow, broad	True vs predicted TMRCA (cxt, SMC++, SINGER)
3	`fig3_stdpopsim_v2_coalescence`	broad	KDE comparison across 16 stdpopsim v0.2 species
4	`fig4_stdpopsim_v3_ood`	broad	Out-of-distribution evaluation on 6 v0.3 species
5	`fig5_demography_inference`	broad	IICR for H. sapiens, B. taurus, A. thaliana
6	`fig6_human_1kg`	broad	TMRCA landscapes for GBR (chr2, chr6, LCT, HLA)
7	`fig7_mosquito_rdl`	w200_wmissing	gambiae RDL region across 5 populations
8	`fig8_inversion_coalescence`	(uses Fig 7 cache)	In(2L)a inversion coalescence patterns

Supplementary figures:

Fig	Script	Description
S4	`figS4_sample_size_adapter`	Sample-size adapter (n=5 vs n=25)
S5	`figS5_window_resolution`	Window size effect (w=2000, 200, 20 bp) + residual model
S6	`figS6_runtime_benchmark`	Runtime comparison: cxt vs SMC++ vs SINGER
S9	`figS9_mosquito_comparison`	RDL region: cxt vs Singer+Polegon vs SMC++
S10	`figS10_cross_coalescence`	Cross-population coalescence (OutOfAfrica_2T12)
S11	`figS11_interpolation_grid`	Mutation × recombination grid evaluation

Outputs land in figures/output/main/ and figures/output/supplementary/.

External data dependencies¶

Some figures require external genomic datasets that are not generated by the simulation stage:

Figures	Variable	Data
6	`HG1KG_TSZ_DIR`	Human 1000 Genomes tsinfer trees + masks
7, S9	`AG1000G_DATA_DIR`	Ag1000G chr2L dated trees
7, S9	`AG1000G_ACCESSIBILITY`	Ag1000G per-site accessibility bitmask
S6	(paths.py)	Benchmark timing logs (JSONL)
S9	(paths.py)	SINGER and SMC++ revision caches

These paths are resolved via figures/paths.py and can be overridden with environment variables.

Logs¶

run_fresh.sh writes a timestamped log file (run_fresh_YYYYMMDD_HHMMSS.log) under BASE_DIR. The log captures stdout/stderr from all stages and ends with a summary of wall times and any failures.

Scenario	Samples	Description
`constant`	10,000	Constant \(N_e = 20{,}000\)
`sawtooth`	1,000	Oscillating \(N_e\) (Schiffels & Durbin zigzag)
`island`	1,000	3-population island model with migration

Scenario	Samples	Description
`llm_ne_sawtooth`	125	3 magnitudes × 3 \(N_e\) × 2 \(\mu\) × 2 \(r\)
`llm_hard_sweeps`	50	3 \(N_e\) × 2 \(\mu\) × 2 \(r\) × 3 sel. coefficients
`llm_island_3pop`	50	2 migration × 3 \(N_e\) × 2 \(\mu\) × 2 \(r\)
`llm_ne_constant`	500	3 \(N_e\) × 2 \(\mu\) × 2 \(r\)