Preprocessing ============= After simulation, tree sequences must be preprocessed into the ``(X, y)`` format expected by the training pipeline. The preprocessing module (``python -m cxt.preprocess``) extracts multi-scale windowed SFS features and discretized log-TMRCA targets for each haplotype pair. What preprocessing does ----------------------- For each tree sequence and each sampled pair of haplotypes: 1. **Feature extraction** (``X``): Compute the XOR and XNOR site-frequency spectra at four window scales (2×, 8×, 32×, 64× the base window), yielding a tensor of shape ``(2, 4, n_windows, n_samples)``. 2. **Target extraction** (``y``): Compute the true pairwise TMRCA per window via span-weighted averaging from the simplified two-sample tree, then apply a log transform. 3. **Data splitting**: Files are deterministically assigned to ``train/`` or ``test/`` splits using a grouped hash that ensures all pairs from the same simulation scenario stay in the same split. Output structure: .. code-block:: text / ├── train/ │ └── // │ ├── X.npy # (n_pairs, 2, 4, n_windows, n_samples) float16 │ ├── y.npy # (n_pairs, n_windows) float16 │ ├── pairs.npy # (n_pairs, 2) int │ └── meta.json └── test/ └── ... CLI reference ------------- .. code-block:: text python -m cxt.preprocess \ --base_dir \ --out_subdir \ --window_size 2000 \ --num_pairs 200 \ --train_ratio 0.9 \ --global_seed 12345 \ --num_workers 75 \ [--skip_existing] \ [--simplify_first_n_samples 50] \ [--bitmask /path/to/bitmask.npz] Paper datasets -------------- The following six preprocessed datasets are needed to reproduce all checkpoints used in the paper. Each corresponds to a different training regime. .. note:: ``DATA_DIR`` refers to the directory containing simulated tree sequences (see :doc:`simulation`). ``DATA_DIR_LP`` is the same directory but may contain additional large-population simulations used for fine-tuning the w200 variants. In ``run_fresh.sh``, a ``ts_large_pop`` symlink tree is created that links only the high-Ne stdpopsim species. 0. ``processed_narrow`` -- constant-only baseline (w2000) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Used to train: **narrow** Contains only the constant-:math:`N_e` scenario. This is a smaller dataset used for the 6-layer narrow model. .. code-block:: bash python -m cxt.preprocess \ --base_dir ${DATA_DIR}/base_dataset \ --out_subdir processed_narrow \ --window_size 2000 \ --num_pairs 200 \ --train_ratio 0.9 \ --global_seed 12345 \ --num_workers 75 \ --skip_existing 1. ``processed`` -- full base dataset (w2000) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Used to train: **broad** .. code-block:: bash python -m cxt.preprocess \ --base_dir ${DATA_DIR} \ --out_subdir processed \ --window_size 2000 \ --num_pairs 200 \ --train_ratio 0.9 \ --global_seed 12345 \ --num_workers 75 \ --skip_existing 2. ``processed_n10`` -- adapter dataset (10 samples) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Used to train: **broad+adapter** Simplifies each tree sequence to 10 haploid samples before feature extraction. Uses fewer pairs (20) since the combinatorial space is smaller. .. code-block:: bash python -m cxt.preprocess \ --base_dir ${DATA_DIR} \ --out_subdir processed_n10 \ --window_size 2000 \ --num_pairs 20 \ --simplify_first_n_samples 10 \ --train_ratio 0.9 \ --global_seed 12345 \ --num_workers 75 3. ``processed_small_window`` -- w200 dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Used to train: **broad_w200** Uses 200 bp windows and 100 kb sequence length for fine-scale resolution in large-\ :math:`N_e` species. .. code-block:: bash python -m cxt.preprocess \ --base_dir ${DATA_DIR_LP} \ --out_subdir processed_small_window \ --window_size 200 \ --sequence_length 100000 \ --num_pairs 200 \ --train_ratio 0.9 \ --global_seed 12345 \ --num_workers 75 \ --skip_existing 4. ``processed_small_window_missing_data`` -- w200 + missingness ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Used to train: **w200_wmissing** Same as above but incorporates an accessibility bitmask (from e.g. Ag1000G) to encode per-window missingness into the source tensor. .. code-block:: bash python -m cxt.preprocess \ --base_dir ${DATA_DIR_LP} \ --out_subdir processed_small_window_missing_data \ --window_size 200 \ --sequence_length 100000 \ --num_pairs 200 \ --train_ratio 0.9 \ --global_seed 12345 \ --num_workers 75 \ --skip_existing \ --bitmask /path/to/bitmask.npz 5. ``processed_small_window_missing_data_n10`` -- w200 + missingness + adapter ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Used to train: **w200_wmissing_adapter** Combines small-window missingness with 10-sample simplification. .. code-block:: bash python -m cxt.preprocess \ --base_dir ${DATA_DIR_LP} \ --out_subdir processed_small_window_missing_data_n10 \ --window_size 200 \ --sequence_length 100000 \ --num_pairs 20 \ --simplify_first_n_samples 10 \ --train_ratio 0.9 \ --global_seed 12345 \ --num_workers 75 \ --skip_existing \ --bitmask /path/to/bitmask.npz Dataset summary --------------- .. list-table:: :header-rows: 1 :widths: 35 10 10 10 15 * - Dataset - Window - Pairs - Samples - Missingness * - ``processed_narrow`` - 2,000 bp - 200 - 50 - No (constant only) * - ``processed`` - 2,000 bp - 200 - 50 - No * - ``processed_n10`` - 2,000 bp - 20 - 10 - No * - ``processed_small_window`` - 200 bp - 200 - 50 - No * - ``processed_small_window_missing_data`` - 200 bp - 200 - 50 - Yes * - ``processed_small_window_missing_data_n10`` - 200 bp - 20 - 10 - Yes