cxt: Coalescence x Translation

cxt is a transformer-based method for inferring pairwise coalescent times (TMRCA) from genotype data using a language-modelling approach. For each pair of haplotypes it computes a multi-scale site-frequency spectrum (SFS) in sliding windows, feeds it through a token-free transformer decoder, and outputs a discretized log-TMRCA profile across the genome.

cxt model architecture

Figure 1. cxt introduces the notion of next-coalescence prediction (Left). cxt is a language model that conditions on a chosen “pivot” haplotype pair and predicts the pair’s TMRCA for each window. cxt ingests mutational tensors constructed using the pivot pair and SFS values computed in windows across a focal region. The model works autoregressively: after each window is predicted, that estimate is appended to the context and supplied to the next step, yielding a step-wise reconstruction of the entire pairwise coalescent history. In the right panel all \(\binom{50}{2}\) pairwise coalescence curves for a sample of 50 haplotypes were inferred simultaneously in under five minutes on a single NVIDIA A100 GPU.

Key features

  • No tree-sequence inference required – works directly on genotype matrices, VCF files, or tskit tree sequences.

  • Stochastic sampling – multiple replicate predictions yield uncertainty estimates for each genomic window.

  • Bias correction – optional Bayesian diversity-based correction accounts for mutation-rate scaling and missingness.

  • Multiple model variants – narrow, broad, broad_w200, w200_wmissing, and adapter-based models for different sample sizes and data characteristics.

  • Multi-GPU inference – pairs are automatically sharded across GPUs for high throughput.

Model variants

Name

Preset

Layers

Description

narrow

PRESETS["narrow"]

6

Smaller model, faster inference

broad

PRESETS["broad"]

10

Main model, best overall accuracy

residual

PRESETS["residual"]

10

Predicts log-residuals from the population mean

broad_w200

PRESETS["broad_w200"]

10

200 bp windows for fine-scale resolution in large-\(N_e\) species

w200_wmissing

PRESETS["w200_wmissing"]

10

200 bp windows with explicit missingness support

broad+adapter

adapter on broad

10

10-sample adapter on the broad backbone

w200_wmissing_adapter

adapter on w200_wmissing

10

10-sample adapter with missingness support

Decoding process in cxt

Window-wise autoregressive decoding of coalescence times. Multiple stochastic replicates are averaged to obtain a robust TMRCA estimate.

Indices and tables