Preprocessing¶
After simulation, tree sequences must be preprocessed into the (X, y)
format expected by the training pipeline. The preprocessing module
(python -m cxt.preprocess) extracts multi-scale windowed SFS features
and discretized log-TMRCA targets for each haplotype pair.
What preprocessing does¶
For each tree sequence and each sampled pair of haplotypes:
Feature extraction (
X): Compute the XOR and XNOR site-frequency spectra at four window scales (2×, 8×, 32×, 64× the base window), yielding a tensor of shape(2, 4, n_windows, n_samples).Target extraction (
y): Compute the true pairwise TMRCA per window via span-weighted averaging from the simplified two-sample tree, then apply a log transform.Data splitting: Files are deterministically assigned to
train/ortest/splits using a grouped hash that ensures all pairs from the same simulation scenario stay in the same split.
Output structure:
<out_dir>/
├── train/
│ └── <scenario>/<file_id>/
│ ├── X.npy # (n_pairs, 2, 4, n_windows, n_samples) float16
│ ├── y.npy # (n_pairs, n_windows) float16
│ ├── pairs.npy # (n_pairs, 2) int
│ └── meta.json
└── test/
└── ...
CLI reference¶
python -m cxt.preprocess \
--base_dir <dir_with_tree_sequences> \
--out_subdir <output_name> \
--window_size 2000 \
--num_pairs 200 \
--train_ratio 0.9 \
--global_seed 12345 \
--num_workers 75 \
[--skip_existing] \
[--simplify_first_n_samples 50] \
[--bitmask /path/to/bitmask.npz]
Paper datasets¶
The following six preprocessed datasets are needed to reproduce all checkpoints used in the paper. Each corresponds to a different training regime.
Note
DATA_DIR refers to the directory containing simulated tree sequences
(see Simulation). DATA_DIR_LP is the same directory but may
contain additional large-population simulations used for fine-tuning the
w200 variants. In run_fresh.sh, a ts_large_pop symlink tree is
created that links only the high-Ne stdpopsim species.
0. processed_narrow – constant-only baseline (w2000)¶
Used to train: narrow
Contains only the constant-\(N_e\) scenario. This is a smaller dataset used for the 6-layer narrow model.
python -m cxt.preprocess \
--base_dir ${DATA_DIR}/base_dataset \
--out_subdir processed_narrow \
--window_size 2000 \
--num_pairs 200 \
--train_ratio 0.9 \
--global_seed 12345 \
--num_workers 75 \
--skip_existing
1. processed – full base dataset (w2000)¶
Used to train: broad
python -m cxt.preprocess \
--base_dir ${DATA_DIR} \
--out_subdir processed \
--window_size 2000 \
--num_pairs 200 \
--train_ratio 0.9 \
--global_seed 12345 \
--num_workers 75 \
--skip_existing
2. processed_n10 – adapter dataset (10 samples)¶
Used to train: broad+adapter
Simplifies each tree sequence to 10 haploid samples before feature extraction. Uses fewer pairs (20) since the combinatorial space is smaller.
python -m cxt.preprocess \
--base_dir ${DATA_DIR} \
--out_subdir processed_n10 \
--window_size 2000 \
--num_pairs 20 \
--simplify_first_n_samples 10 \
--train_ratio 0.9 \
--global_seed 12345 \
--num_workers 75
3. processed_small_window – w200 dataset¶
Used to train: broad_w200
Uses 200 bp windows and 100 kb sequence length for fine-scale resolution in large-\(N_e\) species.
python -m cxt.preprocess \
--base_dir ${DATA_DIR_LP} \
--out_subdir processed_small_window \
--window_size 200 \
--sequence_length 100000 \
--num_pairs 200 \
--train_ratio 0.9 \
--global_seed 12345 \
--num_workers 75 \
--skip_existing
4. processed_small_window_missing_data – w200 + missingness¶
Used to train: w200_wmissing
Same as above but incorporates an accessibility bitmask (from e.g. Ag1000G) to encode per-window missingness into the source tensor.
python -m cxt.preprocess \
--base_dir ${DATA_DIR_LP} \
--out_subdir processed_small_window_missing_data \
--window_size 200 \
--sequence_length 100000 \
--num_pairs 200 \
--train_ratio 0.9 \
--global_seed 12345 \
--num_workers 75 \
--skip_existing \
--bitmask /path/to/bitmask.npz
5. processed_small_window_missing_data_n10 – w200 + missingness + adapter¶
Used to train: w200_wmissing_adapter
Combines small-window missingness with 10-sample simplification.
python -m cxt.preprocess \
--base_dir ${DATA_DIR_LP} \
--out_subdir processed_small_window_missing_data_n10 \
--window_size 200 \
--sequence_length 100000 \
--num_pairs 20 \
--simplify_first_n_samples 10 \
--train_ratio 0.9 \
--global_seed 12345 \
--num_workers 75 \
--skip_existing \
--bitmask /path/to/bitmask.npz
Dataset summary¶
Dataset |
Window |
Pairs |
Samples |
Missingness |
|---|---|---|---|---|
|
2,000 bp |
200 |
50 |
No (constant only) |
|
2,000 bp |
200 |
50 |
No |
|
2,000 bp |
20 |
10 |
No |
|
200 bp |
200 |
50 |
No |
|
200 bp |
200 |
50 |
Yes |
|
200 bp |
20 |
10 |
Yes |