Human 1000 Genomes: LCT and HLA
=================================

This example applies cxt to GBR samples from the 1000 Genomes Project to
decode chromosome-scale TMRCA landscapes and zoom into the LCT locus
(chromosome 2) and HLA region (chromosome 6).


.. figure:: figures/figure6_human_1kg.png
   :align: center
   :width: 100%

   **Figure 6 (top).** Inference of coalescent times from 25 focal pairs of
   British individuals from the 1000 Genomes Project. Chromosome-wide
   TMRCA landscapes for chromosomes 2 (left) and 6 (right).

.. figure:: figures/figure6_human_1kg_zoom.png
   :align: center
   :width: 100%

   **Figure 6 (bottom).** Zoomed view of the LCT region on chromosome 2
   (left) and the HLA region on chromosome 6 (right). Each pivot pair has
   been inferred 15 times to get reliable mean, median, and variance
   estimates. Light blue lines show the respective inferences of the 25
   pairs; the IQR (25--75%) band, mean, and median are overlaid with gene
   annotations.


Pipeline overview
-----------------

1. Load tsinfer-based tree sequences for chr2p/chr2q and chr6p/chr6q.
2. Run :func:`cxt.translate` in 1 Mb blocks on 25 pivot pairs.
3. Apply a post-hoc stochastic diversity-based bias correction.
4. Stitch p and q arms into whole-chromosome TMRCA tracks.
5. Plot chromosome-wide curves and zoom into LCT and HLA with gene
   annotations.

.. note::

   Unlike the mosquito analysis, we do **not** use a missingness-aware model
   variant here. Instead, a *post-hoc stochastic diversity bias correction*
   accounts for the moderate missingness in these human regions.


Model and devices
-----------------

.. code-block:: python

   import os
   import numpy as np
   import torch
   import tszip
   import cxt

   devices = [f"cuda:{i}" for i in range(torch.cuda.device_count())]
   model = cxt.load_model("broad", device="cpu")

   cache_dir = "./cache"
   os.makedirs(cache_dir, exist_ok=True)


Loading GBR tree sequences
--------------------------

We load precomputed tsinfer-based trees for four chromosome arms, simplify
to 50 haploid samples, and cache locally:

.. code-block:: python

   TSZ_DIR = "/path/to/hg1kg/tsinfer-trees"

   arms = {}
   for arm_name in ["chr2p", "chr2q", "chr6p", "chr6q"]:
       tsz_cache = os.path.join(cache_dir, f"gbr_{arm_name}.tsz")
       if os.path.exists(tsz_cache):
           arms[arm_name] = tszip.load(tsz_cache)
       else:
           ts = tszip.load(os.path.join(TSZ_DIR, f"{arm_name}.tsz"))
           ts = ts.simplify(samples=np.arange(50))
           tszip.compress(ts, tsz_cache)
           arms[arm_name] = ts

       print(f"{arm_name}: {arms[arm_name].num_samples} samples, "
             f"{arms[arm_name].sequence_length:.0f} bp")


Inference on 1 Mb blocks
-------------------------

We tile each arm into 1 Mb blocks and use 25 consecutive pivot pairs:

.. code-block:: python

   pivot_pairs = [(i, i + 1) for i in range(0, 50, 2)]

   def infer_arm(ts, name):
       cache_path = os.path.join(cache_dir, f"gbr_{name}.npz")
       if os.path.exists(cache_path):
           data = np.load(cache_path)
           return data["tmrca"], data["index_map"]

       num_blocks = int(ts.sequence_length // 1e6)
       blocks = [(int(i * 1e6), int((i + 1) * 1e6)) for i in range(num_blocks)]

       tmrca, index_map = cxt.translate(
           ts, model,
           pivot_pairs=pivot_pairs,
           blocks=blocks,
           B_per_device=128, B=128,
           devices=devices,
           build_workers=32,
       )
       np.savez_compressed(cache_path, tmrca=tmrca, index_map=index_map)
       return tmrca, index_map

   results = {}
   for arm_name, ts in arms.items():
       results[arm_name] = infer_arm(ts, arm_name)


Post-hoc diversity correction
------------------------------

We estimate per-window missingness from bcftools masks and apply a
stochastic diversity-based correction per block:

.. code-block:: python

   from cxt.correction import stochastic_diversity_bias_correction

   mutation_rate = 1.29e-8

   def correct_arm(tmrca, index_map, ts, availability, name):
       cache_path = os.path.join(cache_dir, f"genome_{name}.npz")
       if os.path.exists(cache_path):
           return np.load(cache_path)["genome"]

       num_blocks = int(ts.sequence_length // 1e6)
       blocks = [(int(i * 1e6), int((i + 1) * 1e6)) for i in range(num_blocks)]

       corrected_blocks = []
       for b_idx in range(num_blocks):
           mask = index_map[:, 0] == b_idx
           block = blocks[b_idx]
           avail = availability[b_idx] / 100 if availability is not None else 1.0

           try:
               ts_block = ts.keep_intervals([block], simplify=True)
               rng = np.random.default_rng(20_000_001 + b_idx)
               out = stochastic_diversity_bias_correction(
                   tree_sequence=ts_block,
                   mutation_rate=mutation_rate / max(avail, 0.01),
                   predictions=tmrca[:, mask],
                   pivot_pairs=np.array(pivot_pairs),
                   rng=rng,
               )
               corrected_blocks.append(out.mean(0))
           except Exception:
               corrected_blocks.append(np.zeros([25, 500]))

       genome = np.stack(corrected_blocks).transpose(1, 0, 2).reshape(25, -1)
       np.savez_compressed(cache_path, genome=genome)
       return genome


Stitching chromosome-wide tracks
---------------------------------

.. code-block:: python

   genome_chr2 = (
       correct_arm(*results["chr2p"], arms["chr2p"], chr2_available, "chr2p")
       + correct_arm(*results["chr2q"], arms["chr2q"], chr2_available, "chr2q")
   )
   genome_chr6 = (
       correct_arm(*results["chr6p"], arms["chr6p"], chr6_available, "chr6p")
       + correct_arm(*results["chr6q"], arms["chr6q"], chr6_available, "chr6q")
   )


Chromosome-wide plot
--------------------

.. code-block:: python

   import matplotlib.pyplot as plt

   generation_time = 28
   BIN_BP = 2000

   lct_start, lct_end = 135_300_000, 136_400_000
   hla_start, hla_end = 29_600_000, 33_300_000

   fig, axes = plt.subplots(1, 2, figsize=(12, 3), sharex=False, sharey=True)

   for ax, genome, label in [
       (axes[0], genome_chr2, "Chromosome 2"),
       (axes[1], genome_chr6, "Chromosome 6"),
   ]:
       x = np.arange(genome.shape[1]) * BIN_BP
       y = np.exp(genome.mean(0)) * generation_time
       ax.plot(x, y, linewidth=0.8, color="dodgerblue", alpha=0.9)
       ax.set_yscale("log")
       ax.set_title(label, loc="left", fontsize=12)
       ax.grid(alpha=0.3, which="both", linestyle="--")

   axes[0].axvspan(lct_start, lct_end, color="crimson", alpha=0.3, label="LCT")
   axes[1].axvspan(hla_start, hla_end, color="crimson", alpha=0.3, label="HLA")
   axes[0].text(lct_start, 3e6, "LCT", color="crimson", fontsize=10, va="bottom")
   axes[1].text(hla_start, 3e6, "HLA", color="crimson", fontsize=10, va="bottom")

   fig.text(0.5, 0.03, "Position (bp)", ha="center", fontsize=12)
   fig.text(0.04, 0.5, "Mean TMRCA (years)", va="center",
            rotation="vertical", fontsize=12)

   plt.tight_layout(rect=[0.05, 0.05, 1, 0.95])
   plt.ylim(1e3, 1e7)
   fig.savefig("figure6_human_1kg.png", dpi=300, bbox_inches="tight")


Zoomed panels with gene annotations
-------------------------------------

The zoomed LCT and HLA panels overlay individual replicate traces, IQR
bands, and gene annotations. Key genes at each locus:

- **LCT region** (chr2, 133--138 Mb): *LCT*, *MCM6*, *DARS1*, *R3HDM1*,
  *ZRANB3*, *UBXN4*, *RAB3GAP1*, *CXCR4*
- **HLA region** (chr6, 29--34 Mb): *HLA-A*, *HLA-B*, *HLA-C*, *HLA-DRB1*,
  *HLA-DQA1*, *HLA-DQB1*, *HLA-DPA1*, *HLA-DPB1*

The plotting code is available in ``figures/main/fig6_human_1kg.py``.

The LCT region shows a sharp reduction in TMRCA consistent with the
well-documented signature of recent positive selection on lactase
persistence in European populations. The HLA region displays elevated and
highly variable TMRCA, reflecting ancient balancing selection maintaining
MHC diversity.