Quick Start =========== This page demonstrates the core cxt workflow in under 20 lines of code: simulate a tree sequence, run inference, and compare inferred versus true coalescent times. Minimal example --------------- .. code-block:: python import cxt import msprime import numpy as np # 1. Simulate a 1 Mb tree sequence ts = msprime.sim_ancestry( 25, population_size=2e4, sequence_length=1e6, recombination_rate=1.28e-8, random_seed=42, ) ts = msprime.mutate(ts, rate=1.29e-8, random_seed=42) # 2. Load a pretrained model model = cxt.load_model("broad", device="cuda") # 3. Infer TMRCA for selected haplotype pairs tmrca, index_map = cxt.translate( ts, model, blocks=[(0, 1_000_000)], pivot_pairs=[(0, 1), (2, 3), (4, 5)], devices=["cuda:0"], n_reps=15, mutation_rate=1.29e-8, ) # tmrca shape: (n_items, n_reps, n_windows) -- log-TMRCA values print(f"Inferred TMRCA shape: {tmrca.shape}") Benchmark: constant :math:`N_e` ------------------------------- The following example reproduces the narrow-model constant-\ :math:`N_e` benchmark scatter plot from the paper. It compares inferred TMRCAs against the true values from the simulated tree sequence. .. code-block:: python import os import numpy as np from concurrent.futures import ProcessPoolExecutor import cxt from cxt.utils import simulate_parameterized_tree_sequence from cxt.preprocess import interpolate_tmrcas # Simulate ts = simulate_parameterized_tree_sequence(seed=103370001) # Load model model = cxt.load_model("narrow", device="cuda") # All pairwise combinations among 50 haploid samples pivot_pairs = [(i, j) for i in range(50) for j in range(i + 1, 50)] # Infer yhat_tmrca, index_map = cxt.translate( ts, model, pivot_pairs=pivot_pairs, blocks=[(0, 1_000_000)], devices=["cuda:0", "cuda:1", "cuda:2"], B=256, build_workers=8, mutation_rate=1.29e-8, ) # Compute true TMRCAs def _true(args): ts, a, b = args return interpolate_tmrcas( ts, window_size=2000, sample_a=a, sample_b=b, ) with ProcessPoolExecutor(max_workers=24) as ex: ytrues = list(ex.map(_true, [(ts, a, b) for a, b in pivot_pairs])) yhats = [np.exp(yhat_tmrca).mean(0)[i] for i in range(len(pivot_pairs))] # Scatter plot from cxt.utils import plot_inference_scatter plot_inference_scatter( np.array(yhats).flatten(), np.array(ytrues).flatten(), "cxt_constant.png", tool=r"$\mathbf{cxt}$-narrow: Constant $N_e$", ) .. figure:: figures/figure2_benchmark.png :align: center :width: 100% **Figure 2.** True versus predicted coalescence times for three inference approaches across two demographic scenarios: a constant population size and a fluctuating "sawtooth" demography. Top row: cxt-narrow evaluated on the constant-size scenario (left) and on the sawtooth scenario (middle), followed by cxt-broad evaluated on the sawtooth scenario (right). Bottom row: Singer+Polegon evaluated on the constant-size (left) and sawtooth (middle) scenarios, followed by SMC++ on the constant-size and sawtooth scenarios. MSE values are reported within each panel. Input formats ------------- cxt accepts three input types via the unified :func:`cxt.translate` interface: **Tree sequence** (default): .. code-block:: python tmrca, index_map = cxt.translate(ts, model, ...) **VCF file**: .. code-block:: python tmrca, index_map = cxt.translate("path/to/file.vcf", model, ...) **Genotype matrix** ``(gm, positions)`` tuple: .. code-block:: python import numpy as np gm = np.load("genotypes.npy") # (n_haplotypes, n_sites) pos = np.load("positions.npy") # (n_sites,) in bp tmrca, index_map = cxt.translate( (gm, pos), model, ... ) .. note:: Two usage patterns arise naturally depending on how blocks and pivot pairs are specified: - **Targeted**: many pivot pairs on few narrow blocks -- concentrates inference on specific loci. - **Long-range**: few pivot pairs across many contiguous blocks -- enables chromosome-scale TMRCA decoding.