https://github.com/heart-gen/rfmix_reader

gpu-acceleration local-ancestry-inference software

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/heart-gen/rfmix_reader
Owner: heart-gen
License: gpl-3.0
Created: 2024-05-28T11:52:35.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2026-01-14T14:12:01.000Z (5 months ago)
Last Synced: 2026-01-14T18:00:48.130Z (5 months ago)
Topics: gpu-acceleration, local-ancestry-inference, software
Language: Python
Size: 1.32 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

# RFMix-reader
`RFMix-reader` is a Python package for efficiently reading and processing output
files generated by [`RFMix`](https://github.com/slowkoni/rfmix), a widely used tool
for estimating local ancestry in admixed populations.
It employs a **lazy loading approach** to minimize memory usage, and leverages **GPU acceleration**
for major speedups when available.

---

## Installation

`rfmix-reader` requires **Python 3.11+**. Install from PyPI:

```bash
pip install rfmix-reader
````

### Installation Options

* **Basic install** (CPU only):

```bash
pip install rfmix-reader
```

* **With GPU acceleration** (`cupy`, `cudf`, `dask-cudf`, `torch`):

```bash
pip install rfmix-reader[gpu]
```

The default GPU extra targets **CUDA 12** wheels (`cupy-cuda12x`,
`cudf-cu12`, `dask-cudf-cu12`). Use the appropriate CUDA build strings if
your environment requires a different version.

* **With documentation tools** (`sphinx`, `sphinx-rtd-theme`):

```bash
pip install rfmix-reader[docs]
```

* **With testing tools** (`pytest`):

```bash
pip install rfmix-reader[tests]
```

### GPU Notes

* PyTorch is **not** installed in the base package. It is pulled in only when you install the `[gpu]` extra (`pip install rfmix-reader[gpu]`) or add it yourself.
* Install a PyTorch wheel that matches your platform and CUDA version using the [official selector](https://pytorch.org/get-started/locally/). For example:
* **CUDA 12 (Linux/Windows):** `pip install torch --index-url https://download.pytorch.org/whl/cu121`
* **CUDA 11 (Linux/Windows):** `pip install torch --index-url https://download.pytorch.org/whl/cu118`
* **CPU-only:** `pip install torch --index-url https://download.pytorch.org/whl/cpu`
* RAPIDS (`cudf`, `cupy`) wheels are version- and CUDA-specific. See the [RAPIDS install guide](https://docs.rapids.ai/install).
* CPU-only installations will still run efficiently, just without GPU acceleration.

---

## Quickstart

```python
from rfmix_reader import read_rfmix

# Load RFMix outputs (two-population admixture example)
file_path = "examples/two_populations/out/"
loci_df, g_anc, local_array = read_rfmix(file_path)

print(loci_df.head())
print(g_anc.head())
print(local_array.shape)
"""See the phasing section below for how to phase per-chromosome outputs and
write them to Zarr with `phase_rfmix_chromosome_to_zarr`."""
```

---

## Key Features

* **Lazy Loading**: Reads data on-the-fly, reducing memory footprint.
* **Efficient Access**: Query specific loci or regions of interest.
* **Seamless Integration**: Works smoothly with `pandas`, `dask`, and other analysis tools.
* **Loci Imputation**: Impute local ancestry loci to dense genotype variant sites.
* **GPU Acceleration**: Automatic CUDA acceleration via PyTorch/CuPy when available.

---

## Simulation Data

Test datasets for two- and three-population admixture are available on Synapse:
[Synapse Project syn61691659](https://www.synapse.org/Synapse:syn61691659).

---

## Usage

### Binary Conversion

RFMix does not generate binary files directly.
Use `create_binaries` to generate them (also available as a CLI):

```bash
create-binaries two_pops/out/
```

```python
from rfmix_reader import create_binaries

create_binaries("two_pops/out/", binary_dir="./binary_files")
```

### Preparing Reference Data for Phasing

Use `prepare-reference` to convert bgzipped, indexed reference VCF/BCF files
into per-chromosome VCF-Zarr stores that the phasing pipeline consumes.
The command writes one `.zarr` directory per input file.

```
prepare-reference -h
```

```
usage: prepare-reference [-h] [--chunk-length CHUNK_LENGTH]
[--samples-chunk-size SAMPLES_CHUNK_SIZE]
[--worker-processes WORKER_PROCESSES]
[--verbose | --no-verbose] [--version]
output_dir vcf_paths [vcf_paths ...]

Convert one or more bgzipped reference VCF/BCF files into Zarr stores.

positional arguments:
output_dir Directory where the Zarr outputs will be written.
vcf_paths Paths to reference VCF/BCF files (bgzipped and
indexed).

options:
-h, --help show this help message and exit
--chunk-length CHUNK_LENGTH
Genomic chunk size for the output Zarr stores
(default: 100000).
--samples-chunk-size SAMPLES_CHUNK_SIZE
Chunk size for samples in the output Zarr stores
(default: library chosen).
--worker-processes WORKER_PROCESSES
Number of worker processes to use for conversion
(default: 0, use library default).
--verbose, --no-verbose
Print progress messages (default: enabled).
--version Show the version of the program and exit.
```

Example data preparation:

```bash
# Sample annotations: two columns (no header): sample_idgroup
cat > sample_annotations.tsv <<'EOF'
NA19700 AFR
NA19701 AFR
NA20847 EUR
EOF

# Convert chromosome VCFs into a reference store directory
prepare-reference refs/ 1kg_chr20.vcf.gz 1kg_chr21.vcf.gz \
--chunk-length 50000 --samples-chunk-size 512
```

### Main Function

Once binaries are available, process RFMix results:

```python
from rfmix_reader import read_rfmix

loci, g_anc, admix = read_rfmix("two_pops/out/")
```

### Three Population Example

Binaries can also be generated on-the-fly within `read_rfmix` with
`generate_binary` set to `True`.

```python
loci, g_anc, admix = read_rfmix("examples/three_populations/out/",
binary_dir="./binary_files",
generate_binary=True)
```

### Phasing data

For optimal memory and computational speed, phasing is done per
chromosome.

```python
from rfmix_reader import phase_rfmix_chromosome_to_zarr

# Use the reference store + annotations during phasing
admix = phase_rfmix_chromosome_to_zarr(
file_prefix="two_pops/out/",
ref_zarr_root="refs",
sample_annot_path="sample_annotations.tsv",
output_path="./phased_chr21.zarr",
chrom="21",
)
```

The chunking is suboptimal for phasing, so remember to
rechunk before using for optimal processing. This is only needed
when loading individual chromosomes. The merged data is already
optimized.

```python
local_array = admix["local_ancestry"].chunk({"variant": 20000, "sample": 100})
# Compute into memory, if needed
local_array = local_array.compute()
```

This also saves the data to Zarr for later merging or data processing.

```bash
merge-phased-zarrs ./phased_all.zarr ./phased_chr21.zarr ./phased_chr22.zarr
```
### Loci Imputation

The imputation workflow now lives in ``rfmix_reader.processing.imputation`` and
is exported as ``interpolate_array``. It interpolates the local ancestry matrix
onto a denser variant grid and writes the result to ``/local-ancestry.zarr``
as a Zarr array shaped ``(variants, samples, ancestries)``.

**Inputs**

* ``variant_loci_df``: a pandas DataFrame defining the variant grid. Provide at
least ``chrom``/``pos`` and an ``i`` column that points to the source RFMix
row index; rows with ``i`` set to ``NaN`` are treated as missing loci to
interpolate. Sort the frame by genomic coordinate, and include ``pos`` if you
plan to interpolate in base-pair space.
* ``admix``: the local ancestry Dask array returned by ``read_rfmix`` (shape
``(loci, samples, ancestries)``).
* ``zarr_outdir``: an output directory where the new ``local-ancestry.zarr``
store will be created.

**Key options**

* ``interpolation``: ``"linear"`` (default), ``"nearest"``, or ``"stepwise"``.
* ``use_bp_positions``: set to ``True`` to interpolate along ``variant_loci_df['pos']``
rather than treating loci as equally spaced indices.
* ``chunk_size``/``batch_size``: tune how many rows are materialized at a time
when filling and interpolating the Zarr array.

**Workflow example**

```python
import pandas as pd
from pathlib import Path
from rfmix_reader import interpolate_array, read_rfmix

# Load RFMix loci and local ancestry
loci_df, _, admix = read_rfmix("two_pops/out/", binary_dir="./binary_files")

# Build the variant grid by merging genotype sites with the RFMix loci index
variants = pd.read_parquet("genotypes/variants.parquet") # must include chrom/pos
variants = variants.drop_duplicates(subset=["chrom", "pos"]).sort_values("pos")
variant_loci_df = (
variants.merge(loci_df.to_pandas(), on=["chrom", "pos"], how="outer", indicator=True)
.loc[:, ["chrom", "pos", "i", "_merge"]]
)

z = interpolate_array(
variant_loci_df,
admix,
zarr_outdir=Path("./imputed_local_ancestry"),
interpolation="linear",
use_bp_positions=True,
chunk_size=50_000,
)
print(z)
```

The interpolator uses GPU acceleration transparently when ``cupy`` and a CUDA
PyTorch build are available; otherwise it falls back to NumPy. All methods other
than ``"hmm"`` operate on diploid-summed trajectories and preserve the original
ancestry dimension.

### Phasing workflow (`rfmix_reader.processing.phase`)

`rfmix_reader.processing.phase` implements gnomix-style tail-flip corrections
for local ancestry haplotypes. Phasing now lives outside `read_rfmix` so you can
process each chromosome independently and write outputs directly to Zarr.

#### Required reference inputs

* **VCF-Zarr reference** (`ref_zarr_root`): either a single `*.zarr` store or a
directory containing per-chromosome stores (e.g., `1.zarr`, `chr1.zarr`).
* **Sample annotations** (`sample_annot_path`): two-column file mapping
`sample_id` to `group` (ancestry label). One representative sample per group
is pulled to build reference haplotypes.

#### Per-chromosome pipeline

1. (Optional) generate RFMix binary caches with `create_binaries`.
2. Call `phase_rfmix_chromosome_to_zarr` for each chromosome you want to
process.
3. Optionally concatenate those per-chromosome Zarr stores with
`merge_phased_zarrs`.

```python
from rfmix_reader.processing.phase import (
PhasingConfig,
merge_phased_zarrs,
phase_rfmix_chromosome_to_zarr,
)

# Phase chromosome 21 and write to Zarr
dataset = phase_rfmix_chromosome_to_zarr(
file_prefix="examples/two_populations/out/",
ref_zarr_root="/refs/1kg_chr_zarr/",
sample_annot_path="/refs/1kg_annotations.tsv",
output_path="/tmp/phased_chr21.zarr",
chrom="21",
)

# Merge multiple per-chromosome Zarr stores
merged = merge_phased_zarrs(
["/tmp/phased_chr21.zarr", "/tmp/phased_chr22.zarr"],
output_path="/tmp/phased_all.zarr",
)
```

If you need fine-grained control, you can still start from unphased outputs and
call `phase_admix_dask_with_index` directly:

```python
from rfmix_reader import read_rfmix
from rfmix_reader.processing.phase import PhasingConfig, phase_admix_dask_with_index

loci_df, g_anc, admix, X_raw = read_rfmix(
"examples/two_populations/out/",
return_original=True,
chrom="21",
)

config = PhasingConfig(window_size=100, min_block_len=10, max_mismatch_frac=0.3)
phased = phase_admix_dask_with_index(
admix=admix,
X_raw=X_raw,
positions=loci_df.physical_position.to_numpy(),
chrom=str(loci_df.chromosome.iloc[0]),
ref_zarr_root="/refs/1kg_chr_zarr/",
sample_annot_path="/refs/1kg_annotations.tsv",
config=config,
)
```

### Reading Haptools simulations

Use `read_simu` to load BGZF-compressed VCF files created by
`haptools simgenotype --pop_field`:

```python
from rfmix_reader import read_simu

loci_df, g_anc, admix = read_simu("/path/to/simulations/")
```

Haptools does **not** include the chromosome length in the `##contig`
header lines, but `read_simu` requires that metadata to index each VCF.
Copy the `contigs.txt` file Haptools generates from the FASTA you used
for simulation and reheader every file with the appropriate contig entry
before calling `read_simu`. The following snippet shows one approach
using `bcftools` and `tabix`:

```bash
CONTIGS="../../three_populations/_m/contigs.txt"
VCFDIR="gt-files"
CHR="chr${SLURM_ARRAY_TASK_ID}"
OUT="${VCFDIR}/${CHR}.vcf.gz"
IN="${VCFDIR}/back/${CHR}.vcf.gz"

CONTIG_LINE=$(grep -w "ID=${CHR}" "$CONTIGS")
if [[ -z "$CONTIG_LINE" ]]; then
echo "ERROR: No contig line found for ${CHR} in $CONTIGS"
exit 1
fi

bcftools view -h "$IN" \
| sed "s/^##contig=.*/${CONTIG_LINE}/" > header.${CHR}.tmp
bcftools reheader -h header.${CHR}.tmp -o "$OUT" "$IN"
tabix -p vcf "$OUT"
```

### Visualization

`read_rfmix`, `read_flare`, and `read_simu` all return the same
`(loci_df, g_anc, admix)` tuple, so the plotting utilities in
`rfmix_reader._visualization` work identically for RFMix, FLARE, and
Haptools-simulated inputs. The snippet below shows the typical workflow
for each reader:

```python
from rfmix_reader import (
plot_ancestry_by_chromosome,
plot_global_ancestry,
read_flare,
read_rfmix,
read_simu,
)

# RFMix run directory
loci_df, g_anc, admix = read_rfmix("two_pops/out/")
plot_global_ancestry(g_anc, save_path="rfmix_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="rfmix_local.png")

# FLARE output directory (contains *.anc.vcf.gz + global.anc.gz)
loci_df, g_anc, admix = read_flare("flare_runs/chr1/")
plot_global_ancestry(g_anc, save_path="flare_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="flare_local.png")

# Haptools simulations (after reheadering contigs)
loci_df, g_anc, admix = read_simu("/path/to/simulations/")
plot_global_ancestry(g_anc, save_path="simu_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="simu_local.png")
```

`plot_global_ancestry` builds per-individual stacked bars of global
ancestry while `plot_ancestry_by_chromosome` summarizes local ancestry
along each chromosome, giving you quick visual QC for every supported
input format.

---

## Development Install

For contributors:

```bash
git clone https://github.com/heart-gen/rfmix_reader.git
cd rfmix_reader
pip install -e ".[gpu,docs,tests]"
```

---

## Citation

If you use this software, please cite:

[![DOI](https://zenodo.org/badge/807052842.svg)](https://zenodo.org/doi/10.5281/zenodo.12629787)

Benjamin, K. J. M. (2024). **RFMix-reader (Version 0.2.0)** \[Computer software].
[https://github.com/heart-gen/rfmix\_reader](https://github.com/heart-gen/rfmix_reader)

Kynon JM Benjamin. *"RFMix-reader: Accelerated reading and processing for local ancestry studies."*
**bioRxiv** (2024).
DOI: [10.1101/2024.07.13.603370](https://www.biorxiv.org/content/10.1101/2024.07.13.603370v2).

---

## Funding

This work was supported by the National Institutes of Health,
National Institute on Minority Health and Health Disparities (NIMHD)
K99MD016964 / R00MD016964.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/heart-gen/rfmix_reader

Awesome Lists containing this project

README