Simulation¶
All training data for cxt is generated using python cxt/simulation_ts_only.py,
which consolidates every simulation scenario into a single CLI. This page
documents the exact commands that produce the training data used in the paper.
The simulation script generates tree sequences using msprime and
stdpopsim, saving them as .trees files organized by scenario. These
tree sequences are then preprocessed into training pairs (see
Preprocessing).
Note
The simulation CLI is cxt/simulation_ts_only.py (not a Python module
invocation). Core simulation functions such as
simulate_parameterized_tree_sequence and create_sawtooth_demography
live in cxt.utils.
Overview¶
The training data consists of four categories:
Base dataset – constant \(N_e\), sawtooth demography, island model
LLM-style datasets – parameter sweeps over \(N_e\), mutation rate, recombination rate, and selection
stdpopsim mammals – realistic demographic models for great apes and cattle, with and without genetic maps
stdpopsim other species – 15+ additional species from the stdpopsim catalog
Each simulation produces 1 Mb tree sequences with 25 diploid individuals (50 haploid samples) by default.
CLI reference¶
python cxt/simulation_ts_only.py \
--scenario <scenario_name> \
--data_dir <output_directory> \
--num_samples <n_simulations> \
--num_processes <n_parallel> \
[--n_individuals 25] \
[--batch_size 1000]
Key arguments:
--scenario: simulation scenario (see below)--data_dir: output directory for.treesfiles--num_samples: total number of simulations to generate--num_processes: parallel worker count--n_individuals: diploid sample count per simulation (default: 25)--batch_size: number of simulations per batch (default: 1000)
Full pipeline (paper)¶
The following script reproduces the complete training dataset. Adjust
DATA_DIR to point to your storage location.
#!/usr/bin/env bash
set -euo pipefail
DATA_DIR=/path/to/training_data
SIM="python cxt/simulation_ts_only.py"
mkdir -p ${DATA_DIR}
mkdir -p ${DATA_DIR}/llm
mkdir -p ${DATA_DIR}/stdpopsim/v0.2
# ================================================================
# 1. Base dataset
# ================================================================
# Constant Ne (10,000 simulations)
${SIM} --num_processes 50 --num_samples 10000 \
--data_dir ${DATA_DIR}/base_dataset --scenario constant
# Sawtooth demography (1,000 simulations)
${SIM} --num_processes 30 --num_samples 1000 \
--data_dir ${DATA_DIR}/ssd --scenario sawtooth
# Island model (1,000 simulations)
${SIM} --num_processes 30 --num_samples 1000 \
--data_dir ${DATA_DIR}/idd --scenario island
# ================================================================
# 2. LLM-style datasets (parameter sweeps)
# ================================================================
${SIM} --num_processes 100 --num_samples 125 \
--data_dir ${DATA_DIR}/llm --scenario llm_ne_sawtooth
${SIM} --num_processes 100 --num_samples 50 \
--data_dir ${DATA_DIR}/llm --scenario llm_hard_sweeps
${SIM} --num_processes 75 --num_samples 50 \
--data_dir ${DATA_DIR}/llm --scenario llm_island_3pop
${SIM} --num_processes 100 --num_samples 500 \
--data_dir ${DATA_DIR}/llm --scenario llm_ne_constant
# ================================================================
# 3. stdpopsim mammals (great apes, cattle)
# ================================================================
for scenario in stdpopsim_homsap stdpopsim_homsap_map \
stdpopsim_bostau stdpopsim_canfam stdpopsim_canfam_map \
stdpopsim_pantro stdpopsim_papanu stdpopsim_papanu_map \
stdpopsim_ponabe stdpopsim_ponabe_map; do
${SIM} --num_processes 75 --num_samples 1000 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/${scenario} \
--scenario ${scenario}
done
# ================================================================
# 4. stdpopsim other species
# ================================================================
${SIM} --num_processes 100 --num_samples 300 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_aedaeg --scenario stdpopsim_aedaeg
${SIM} --num_processes 100 --num_samples 25 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_anapla --scenario stdpopsim_anapla
${SIM} --num_processes 100 --num_samples 5 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_anocar --scenario stdpopsim_anocar
${SIM} --num_processes 100 --num_samples 100 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_anogam --scenario stdpopsim_anogam
${SIM} --num_processes 100 --num_samples 5 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_apimel --scenario stdpopsim_apimel
${SIM} --num_processes 100 --num_samples 500 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_aratha --scenario stdpopsim_aratha
${SIM} --num_processes 100 --num_samples 500 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_aratha_map --scenario stdpopsim_aratha_map
${SIM} --num_processes 100 --num_samples 1000 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_caeele --scenario stdpopsim_caeele
${SIM} --num_processes 100 --num_samples 1000 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_caeele_map --scenario stdpopsim_caeele_map
${SIM} --num_processes 100 --num_samples 5 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_dromel --scenario stdpopsim_dromel
${SIM} --num_processes 100 --num_samples 300 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_drosec --scenario stdpopsim_drosec
${SIM} --num_processes 100 --num_samples 1000 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_gasacu --scenario stdpopsim_gasacu
${SIM} --num_processes 100 --num_samples 300 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_helann --scenario stdpopsim_helann
${SIM} --num_processes 100 --num_samples 5 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_helmel --scenario stdpopsim_helmel
${SIM} --num_processes 100 --num_samples 1000 \
--data_dir ${DATA_DIR}/stdpopsim/v0.2/stdpopsim_musmus --scenario stdpopsim_musmus
Available scenarios¶
Run python cxt/simulation_ts_only.py --help for the full list. The main
categories are:
Parametric models (msprime):
constant– constant \(N_e = 20{,}000\)sawtooth– oscillating \(N_e\) (Schiffels & Durbin 2014 zigzag)island– 3-population island model with migrationrandom– 64 pre-drawn demographic trajectories with random \(\mu\) and \(r\)
LLM-style parameter sweeps:
llm_ne_constant– 3 \(N_e\) × 2 \(\mu\) × 2 \(r\)llm_ne_sawtooth– 3 magnitudes × 3 \(N_e\) × 2 \(\mu\) × 2 \(r\)llm_island_3pop– 2 migration rates × 3 \(N_e\) × 2 \(\mu\) × 2 \(r\)llm_hard_sweeps– 3 \(N_e\) × 2 \(\mu\) × 2 \(r\) × 3 selection coefficients
stdpopsim species (stdpopsim):
Simulations use species-specific demographic models from the stdpopsim
catalog, sampling random chromosomal segments. Suffixes _map indicate
use of a species-specific genetic map.
Scenario |
Species |
Samples |
|---|---|---|
|
H. sapiens |
1,000 |
|
B. taurus |
1,000 |
|
C. familiaris |
1,000 |
|
P. troglodytes |
1,000 |
|
P. anubis |
1,000 |
|
P. abelii |
1,000 |
|
A. aegypti |
300 |
|
A. gambiae |
100 |
|
A. thaliana |
500 |
|
C. elegans |
1,000 |
|
D. melanogaster |
5 |
|
G. aculeatus |
1,000 |
|
H. annuus |
300 |
|
H. melpomene |
5 |
|
A. mellifera |
5 |
|
M. musculus |
1,000 |
Python API¶
For programmatic use, the core simulation functions are available from
cxt.utils:
from cxt.utils import (
simulate_parameterized_tree_sequence,
create_sawtooth_demography,
)
# Constant Ne
ts = simulate_parameterized_tree_sequence(seed=42, samples=25)
# Sawtooth demography
dem = create_sawtooth_demography(Ne=20_000, magnitude=3)
ts = simulate_parameterized_tree_sequence(seed=42, demography=dem, samples=25)