Full Reproducibility
====================

``scripts/run_fresh.sh`` bootstraps a ``uv`` virtualenv, then simulates
new data, trains all models from scratch, and generates all figures in a
single isolated directory. It installs ``uv`` automatically if missing.


Quick start (fresh run)
-----------------------

Run everything (simulations → preprocessing → training → figures) in a
single isolated directory:

.. code-block:: bash

   ./scripts/run_fresh.sh

By default all outputs go to ``/sietch_colab/data_share/cxt_scratch/``.
Override with the ``BASE_DIR`` environment variable:

.. code-block:: bash

   BASE_DIR=/scratch/myuser/cxt_run ./scripts/run_fresh.sh

The script uses GPUs 0 and 1 and 80 CPU workers by default. These are
configured at the top of the script. A ``uv`` virtualenv is created at
``BASE_DIR/.venv`` and reused on subsequent runs. To recreate it:

.. code-block:: bash

   rm -rf /sietch_colab/data_share/cxt_scratch/.venv
   ./scripts/run_fresh.sh

Run individual stages:

.. code-block:: bash

   ./scripts/run_fresh.sh simulate       # only simulations
   ./scripts/run_fresh.sh preprocess      # only preprocessing
   ./scripts/run_fresh.sh train           # only training
   ./scripts/run_fresh.sh figures         # only figures
   ./scripts/run_fresh.sh train figures   # multiple stages


Pipeline overview
-----------------

.. code-block:: text

   ┌──────────────────────────────────────────────────────────────────────┐
   │                                                                      │
   │  STAGE 1: SIMULATE      python cxt/simulation_ts_only.py            │
   │  ─────────────────                                                   │
   │  35+ scenarios (constant, sawtooth, island, LLM sweeps,             │
   │  10 stdpopsim mammals, 15 stdpopsim other species)                  │
   │  → .trees files in DATA_DIR/                                        │
   │                                                                      │
   │  STAGE 2: PREPROCESS    python -m cxt.preprocess                    │
   │  ──────────────────                                                  │
   │  6 preprocessed datasets:                                            │
   │    processed_narrow         (w2000, n50, 200 pairs, constant only)  │
   │    processed                 (w2000, n50, 200 pairs)                │
   │    processed_n10             (w2000, n10, 20 pairs)                 │
   │    processed_small_window    (w200, n50, 200 pairs)                 │
   │    processed_small_window_missing_data        (w200, n50, bitmask)  │
   │    processed_small_window_missing_data_n10    (w200, n10, bitmask)  │
   │  → X.npy, y.npy, pairs.npy per simulation                          │
   │                                                                      │
   │  STAGE 3: TRAIN         python -m cxt.train                         │
   │  ──────────────                                                      │
   │  6 checkpoints in dependency order:                                  │
   │    narrow           ← processed_narrow (constant only)              │
   │    broad            ← processed                                     │
   │    broad_w200       ← processed_small_window + broad ckpt           │
   │    broad+adapter    ← processed_n10 + broad ckpt                    │
   │    w200_wmissing    ← processed_sw_missing + broad_w200 ckpt        │
   │    w200_wmissing_adapter ← processed_sw_missing_n10                 │
   │                        + broad+adapter (--resume-adapter)           │
   │                                                                      │
   │  STAGE 4: FIGURES       python -m figures.main.*                     │
   │  ────────────────                                                    │
   │  8 main figures (Fig 1-8) + 6 supplementary (S4, S5, S6, S9-S11)   │
   │  → figures/output/main/ and figures/output/supplementary/            │
   │                                                                      │
   └──────────────────────────────────────────────────────────────────────┘


Configuration
--------------------------------

``run_fresh.sh`` places everything under a single ``BASE_DIR``:

.. code-block:: text

   BASE_DIR/
   ├── data/                  # simulated .trees + preprocessed datasets
   ├── lightning_logs/         # PyTorch Lightning training logs
   ├── checkpoints/           # installed model checkpoints (for cxt.load_model)
   └── figures/output/        # generated figure PNGs
       ├── main/
       └── supplementary/

Hardware and path settings are at the top of the script:

.. list-table::
   :header-rows: 1
   :widths: 25 45 30

   * - Variable
     - Description
     - Default
   * - ``BASE_DIR``
     - Root for all outputs
     - ``/sietch_colab/data_share/cxt_scratch``
   * - ``GPUS``
     - GPU indices for training and figures
     - ``0 1``
   * - ``SIM_WORKERS``
     - Parallel workers for simulation
     - ``80``
   * - ``PREPROCESS_WORKERS``
     - Parallel workers for preprocessing
     - ``80``
   * - ``TRAIN_WORKERS``
     - DataLoader workers for training
     - ``16``

The script sets ``CXT_CHECKPOINT_CACHE=$BASE_DIR/checkpoints`` so that
``cxt.load_model()`` uses the freshly trained checkpoints instead of the
global cache at ``~/.cache/cxt/checkpoints/``. It also sets
``CUDA_VISIBLE_DEVICES`` so that figure scripts see the correct GPUs.

Figure-specific external data (also configurable via env):

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Variable
     - Default
   * - ``AG1000G_DATA_DIR``
     - ``/sietch_colab/data_share/Ag1000G/Ag3.0/args_trees/tsinfer_data_v2``
   * - ``AG1000G_ACCESSIBILITY``
     - ``/sietch_colab/data_share/Ag1000G/Ag3.0/args_trees/singer/agp3.is_accessible.txt.npz``
   * - ``HG1KG_TSZ_DIR``
     - ``/sietch_colab/data_share/hg1kg/tsinfer-trees/working``


Example: custom paths
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   BASE_DIR=/scratch/myuser/cxt_run ./scripts/run_fresh.sh


Simulation details
------------------

Stage 1 runs 35+ simulation batches across four categories (using
``python cxt/simulation_ts_only.py``):

**Base dataset** (3 scenarios):

.. list-table::
   :header-rows: 1
   :widths: 20 15 65

   * - Scenario
     - Samples
     - Description
   * - ``constant``
     - 10,000
     - Constant :math:`N_e = 20{,}000`
   * - ``sawtooth``
     - 1,000
     - Oscillating :math:`N_e` (Schiffels & Durbin zigzag)
   * - ``island``
     - 1,000
     - 3-population island model with migration

**LLM-style sweeps** (4 scenarios):

.. list-table::
   :header-rows: 1
   :widths: 25 10 65

   * - Scenario
     - Samples
     - Description
   * - ``llm_ne_sawtooth``
     - 125
     - 3 magnitudes × 3 :math:`N_e` × 2 :math:`\mu` × 2 :math:`r`
   * - ``llm_hard_sweeps``
     - 50
     - 3 :math:`N_e` × 2 :math:`\mu` × 2 :math:`r` × 3 sel. coefficients
   * - ``llm_island_3pop``
     - 50
     - 2 migration × 3 :math:`N_e` × 2 :math:`\mu` × 2 :math:`r`
   * - ``llm_ne_constant``
     - 500
     - 3 :math:`N_e` × 2 :math:`\mu` × 2 :math:`r`

**stdpopsim mammals** (10 scenarios, 1,000 samples each):
``homsap``, ``homsap_map``, ``bostau``, ``canfam``, ``canfam_map``,
``pantro``, ``papanu``, ``papanu_map``, ``ponabe``, ``ponabe_map``

**stdpopsim other species** (15 scenarios, 5--1,000 samples):
``aedaeg``, ``anapla``, ``anocar``, ``anogam``, ``apimel``, ``aratha``,
``aratha_map``, ``caeele``, ``caeele_map``, ``dromel``, ``drosec``,
``gasacu``, ``helann``, ``helmel``, ``musmus``


Preprocessing details
---------------------

Stage 2 creates six preprocessed datasets. See :doc:`preprocessing` for
the full schema. The datasets differ in window size, sample count, and
whether a missingness bitmask is encoded:

.. list-table::
   :header-rows: 1
   :widths: 40 10 10 10 10

   * - Dataset
     - Window
     - Pairs
     - Samples
     - Bitmask
   * - ``processed_narrow``
     - 2,000 bp
     - 200
     - 50
     - No (constant scenario only)
   * - ``processed``
     - 2,000 bp
     - 200
     - 50
     - No
   * - ``processed_n10``
     - 2,000 bp
     - 20
     - 10
     - No
   * - ``processed_small_window``
     - 200 bp
     - 200
     - 50
     - No
   * - ``processed_small_window_missing_data``
     - 200 bp
     - 200
     - 50
     - Yes
   * - ``processed_small_window_missing_data_n10``
     - 200 bp
     - 20
     - 10
     - Yes


Training details
----------------

Stage 3 trains six model checkpoints respecting the dependency chain.
``run_fresh.sh`` installs each checkpoint into ``BASE_DIR/checkpoints/``
after training so that figure generation uses the freshly trained models.

Which model trains on which dataset:

.. list-table::
   :header-rows: 1
   :widths: 18 18 25 18 10 10 12

   * - Model
     - Fine-tuning
     - Dataset
     - Source scenarios
     - Window
     - Samples
     - Pairs
   * - ``narrow``
     - No (from scratch)
     - ``processed_narrow``
     - constant only
     - w2000
     - 50
     - 200
   * - ``broad``
     - No (from scratch)
     - ``processed``
     - all
     - w2000
     - 50
     - 200
   * - ``broad_w200``
     - Yes ← ``broad``
     - ``processed_small_window``
     - 13 high-Ne stdpopsim
     - w200
     - 50
     - 200
   * - ``broad+adapter``
     - Yes ← ``broad``
     - ``processed_n10``
     - all
     - w2000
     - 10
     - 20
   * - ``w200_wmissing``
     - Yes ← ``broad_w200``
     - ``processed_small_window_missing_data``
     - 13 high-Ne stdpopsim
     - w200
     - 50
     - 200 + bitmask
   * - ``w200_wmissing_adapter``
     - Yes ← ``broad+adapter`` (``--resume-adapter``)
     - ``processed_small_window_missing_data_n10``
     - 13 high-Ne stdpopsim
     - w200
     - 10
     - 20 + bitmask

See :doc:`training` for the full checkpoint commands and hyperparameters.


Figure details
--------------

Stage 4 generates all paper figures. Each figure script caches its
intermediate results (simulated tree sequences, TMRCA predictions) so
re-runs skip expensive computation.

**Main figures:**

.. list-table::
   :header-rows: 1
   :widths: 10 25 25 40

   * - Fig
     - Script
     - Models
     - Description
   * - 1
     - ``fig1_model_schematic``
     - narrow
     - Model architecture and batch inference demo
   * - 2
     - ``fig2_benchmark_comparison``
     - narrow, broad
     - True vs predicted TMRCA (cxt, SMC++, SINGER)
   * - 3
     - ``fig3_stdpopsim_v2_coalescence``
     - broad
     - KDE comparison across 16 stdpopsim v0.2 species
   * - 4
     - ``fig4_stdpopsim_v3_ood``
     - broad
     - Out-of-distribution evaluation on 6 v0.3 species
   * - 5
     - ``fig5_demography_inference``
     - broad
     - IICR for H. sapiens, B. taurus, A. thaliana
   * - 6
     - ``fig6_human_1kg``
     - broad
     - TMRCA landscapes for GBR (chr2, chr6, LCT, HLA)
   * - 7
     - ``fig7_mosquito_rdl``
     - w200_wmissing
     - A. gambiae RDL region across 5 populations
   * - 8
     - ``fig8_inversion_coalescence``
     - (uses Fig 7 cache)
     - In(2L)a inversion coalescence patterns

**Supplementary figures:**

.. list-table::
   :header-rows: 1
   :widths: 10 30 60

   * - Fig
     - Script
     - Description
   * - S4
     - ``figS4_sample_size_adapter``
     - Sample-size adapter (n=5 vs n=25)
   * - S5
     - ``figS5_window_resolution``
     - Window size effect (w=2000, 200, 20 bp) + residual model
   * - S6
     - ``figS6_runtime_benchmark``
     - Runtime comparison: cxt vs SMC++ vs SINGER
   * - S9
     - ``figS9_mosquito_comparison``
     - RDL region: cxt vs Singer+Polegon vs SMC++
   * - S10
     - ``figS10_cross_coalescence``
     - Cross-population coalescence (OutOfAfrica_2T12)
   * - S11
     - ``figS11_interpolation_grid``
     - Mutation × recombination grid evaluation

Outputs land in ``figures/output/main/`` and ``figures/output/supplementary/``.


External data dependencies
--------------------------

Some figures require external genomic datasets that are **not** generated by
the simulation stage:

.. list-table::
   :header-rows: 1
   :widths: 20 30 50

   * - Figures
     - Variable
     - Data
   * - 6
     - ``HG1KG_TSZ_DIR``
     - Human 1000 Genomes tsinfer trees + masks
   * - 7, S9
     - ``AG1000G_DATA_DIR``
     - Ag1000G chr2L dated trees
   * - 7, S9
     - ``AG1000G_ACCESSIBILITY``
     - Ag1000G per-site accessibility bitmask
   * - S6
     - (paths.py)
     - Benchmark timing logs (JSONL)
   * - S9
     - (paths.py)
     - SINGER and SMC++ revision caches

These paths are resolved via ``figures/paths.py`` and can be overridden
with environment variables.


Logs
----

``run_fresh.sh`` writes a timestamped log file
(``run_fresh_YYYYMMDD_HHMMSS.log``) under ``BASE_DIR``. The log captures
stdout/stderr from all stages and ends with a summary of wall times and
any failures.