cxt: Coalescence x Translation¶
cxt is a transformer-based method for inferring pairwise coalescent times
(TMRCA) from genotype data using a language-modelling approach. For each pair
of haplotypes it computes a multi-scale site-frequency spectrum (SFS) in
sliding windows, feeds it through a token-free transformer decoder, and
outputs a discretized log-TMRCA profile across the genome.
Figure 1. cxt introduces the notion of next-coalescence prediction (Left). cxt is a language model that conditions on a chosen “pivot” haplotype pair and predicts the pair’s TMRCA for each window. cxt ingests mutational tensors constructed using the pivot pair and SFS values computed in windows across a focal region. The model works autoregressively: after each window is predicted, that estimate is appended to the context and supplied to the next step, yielding a step-wise reconstruction of the entire pairwise coalescent history. In the right panel all \(\binom{50}{2}\) pairwise coalescence curves for a sample of 50 haplotypes were inferred simultaneously in under five minutes on a single NVIDIA A100 GPU.¶
Key features¶
No tree-sequence inference required – works directly on genotype matrices, VCF files, or
tskittree sequences.Stochastic sampling – multiple replicate predictions yield uncertainty estimates for each genomic window.
Bias correction – optional Bayesian diversity-based correction accounts for mutation-rate scaling and missingness.
Multiple model variants – narrow, broad, broad_w200, w200_wmissing, and adapter-based models for different sample sizes and data characteristics.
Multi-GPU inference – pairs are automatically sharded across GPUs for high throughput.
Model variants¶
Name |
Preset |
Layers |
Description |
|---|---|---|---|
|
|
6 |
Smaller model, faster inference |
|
|
10 |
Main model, best overall accuracy |
|
|
10 |
Predicts log-residuals from the population mean |
|
|
10 |
200 bp windows for fine-scale resolution in large-\(N_e\) species |
|
|
10 |
200 bp windows with explicit missingness support |
|
adapter on |
10 |
10-sample adapter on the broad backbone |
|
adapter on |
10 |
10-sample adapter with missingness support |
Window-wise autoregressive decoding of coalescence times. Multiple stochastic replicates are averaged to obtain a robust TMRCA estimate.¶
Getting Started
Usage Guide
Reproducing the Paper