.. cxt documentation master file

cxt: Coalescence x Translation
=============================================

``cxt`` is a transformer-based method for inferring pairwise coalescent times
(TMRCA) from genotype data using a language-modelling approach. For each pair
of haplotypes it computes a multi-scale site-frequency spectrum (SFS) in
sliding windows, feeds it through a token-free transformer decoder, and
outputs a discretized log-TMRCA profile across the genome.

.. figure:: figures/figure1.png
   :align: center
   :width: 100%
   :alt: cxt model architecture

   **Figure 1.** cxt introduces the notion of next-coalescence prediction
   (Left). cxt is a language model that conditions on a chosen "pivot"
   haplotype pair and predicts the pair's TMRCA for each window. cxt
   ingests mutational tensors constructed using the pivot pair and SFS values
   computed in windows across a focal region. The model works
   autoregressively: after each window is predicted, that estimate is
   appended to the context and supplied to the next step, yielding a
   step-wise reconstruction of the entire pairwise coalescent history.
   In the right panel all :math:`\binom{50}{2}` pairwise coalescence curves
   for a sample of 50 haplotypes were inferred simultaneously in under five
   minutes on a single NVIDIA A100 GPU.

Key features
------------

- **No tree-sequence inference required** -- works directly on genotype
  matrices, VCF files, or ``tskit`` tree sequences.
- **Stochastic sampling** -- multiple replicate predictions yield uncertainty
  estimates for each genomic window.
- **Bias correction** -- optional Bayesian diversity-based correction accounts
  for mutation-rate scaling and missingness.
- **Multiple model variants** -- narrow, broad, broad_w200, w200_wmissing,
  and adapter-based models for different sample sizes and data
  characteristics.
- **Multi-GPU inference** -- pairs are automatically sharded across GPUs for
  high throughput.

Model variants
--------------

.. list-table::
   :header-rows: 1
   :widths: 20 20 10 50

   * - Name
     - Preset
     - Layers
     - Description
   * - ``narrow``
     - ``PRESETS["narrow"]``
     - 6
     - Smaller model, faster inference
   * - ``broad``
     - ``PRESETS["broad"]``
     - 10
     - Main model, best overall accuracy
   * - ``residual``
     - ``PRESETS["residual"]``
     - 10
     - Predicts log-residuals from the population mean
   * - ``broad_w200``
     - ``PRESETS["broad_w200"]``
     - 10
     - 200 bp windows for fine-scale resolution in large-\ :math:`N_e` species
   * - ``w200_wmissing``
     - ``PRESETS["w200_wmissing"]``
     - 10
     - 200 bp windows with explicit missingness support
   * - ``broad+adapter``
     - adapter on ``broad``
     - 10
     - 10-sample adapter on the broad backbone
   * - ``w200_wmissing_adapter``
     - adapter on ``w200_wmissing``
     - 10
     - 10-sample adapter with missingness support

.. figure:: decoding.gif
   :align: center
   :width: 80%
   :alt: Decoding process in cxt

   Window-wise autoregressive decoding of coalescence times. Multiple
   stochastic replicates are averaged to obtain a robust TMRCA estimate.


.. toctree::
   :maxdepth: 2
   :caption: Getting Started

   installation
   quick_start
   verification

.. toctree::
   :maxdepth: 2
   :caption: Usage Guide

   examples
   finetune_missingness
   demography
   human
   mosquito

.. toctree::
   :maxdepth: 2
   :caption: Reproducing the Paper

   reproduce
   simulation
   preprocessing
   training

.. toctree::
   :maxdepth: 2
   :caption: Reference

   cxt


Indices and tables
==================

* :ref:`genindex`
* :ref:`search`