{"id":45897014,"url":"https://github.com/heart-gen/rfmix_reader","last_synced_at":"2026-02-27T21:04:32.707Z","repository":{"id":242108066,"uuid":"807052842","full_name":"heart-gen/rfmix_reader","owner":"heart-gen","description":null,"archived":false,"fork":false,"pushed_at":"2026-01-14T14:12:01.000Z","size":1384,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-14T18:00:48.130Z","etag":null,"topics":["gpu-acceleration","local-ancestry-inference","software"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/heart-gen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-05-28T11:52:35.000Z","updated_at":"2026-01-14T14:12:02.000Z","dependencies_parsed_at":"2025-12-07T08:06:22.912Z","dependency_job_id":null,"html_url":"https://github.com/heart-gen/rfmix_reader","commit_stats":null,"previous_names":["heart-gen/rfmix_reader"],"tags_count":13,"template":false,"template_full_name":null,"purl":"pkg:github/heart-gen/rfmix_reader","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heart-gen%2Frfmix_reader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heart-gen%2Frfmix_reader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heart-gen%2Frfmix_reader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heart-gen%2Frfmix_reader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/heart-gen","download_url":"https://codeload.github.com/heart-gen/rfmix_reader/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heart-gen%2Frfmix_reader/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29913720,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-27T19:37:42.220Z","status":"ssl_error","status_checked_at":"2026-02-27T19:37:41.463Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpu-acceleration","local-ancestry-inference","software"],"created_at":"2026-02-27T21:04:31.873Z","updated_at":"2026-02-27T21:04:32.698Z","avatar_url":"https://github.com/heart-gen.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RFMix-reader\n`RFMix-reader` is a Python package for efficiently reading and processing output\nfiles generated by [`RFMix`](https://github.com/slowkoni/rfmix), a widely used tool\nfor estimating local ancestry in admixed populations.  \nIt employs a **lazy loading approach** to minimize memory usage, and leverages **GPU acceleration**\nfor major speedups when available.\n\n---\n\n## Installation\n\n`rfmix-reader` requires **Python 3.11+**. Install from PyPI:\n\n```bash\npip install rfmix-reader\n````\n\n### Installation Options\n\n* **Basic install** (CPU only):\n\n  ```bash\n  pip install rfmix-reader\n  ```\n\n* **With GPU acceleration** (`cupy`, `cudf`, `dask-cudf`, `torch`):\n\n  ```bash\n  pip install rfmix-reader[gpu]\n  ```\n\n  The default GPU extra targets **CUDA 12** wheels (`cupy-cuda12x`,\n  `cudf-cu12`, `dask-cudf-cu12`). Use the appropriate CUDA build strings if\n  your environment requires a different version.\n  \n* **With documentation tools** (`sphinx`, `sphinx-rtd-theme`):\n\n  ```bash\n  pip install rfmix-reader[docs]\n  ```\n\n* **With testing tools** (`pytest`):\n\n  ```bash\n  pip install rfmix-reader[tests]\n  ```\n\n### GPU Notes\n\n* PyTorch is **not** installed in the base package. It is pulled in only when you install the `[gpu]` extra (`pip install rfmix-reader[gpu]`) or add it yourself.\n* Install a PyTorch wheel that matches your platform and CUDA version using the [official selector](https://pytorch.org/get-started/locally/). For example:\n  * **CUDA 12 (Linux/Windows):** `pip install torch --index-url https://download.pytorch.org/whl/cu121`\n  * **CUDA 11 (Linux/Windows):** `pip install torch --index-url https://download.pytorch.org/whl/cu118`\n  * **CPU-only:** `pip install torch --index-url https://download.pytorch.org/whl/cpu`\n* RAPIDS (`cudf`, `cupy`) wheels are version- and CUDA-specific. See the [RAPIDS install guide](https://docs.rapids.ai/install).\n* CPU-only installations will still run efficiently, just without GPU acceleration.\n\n---\n\n## Quickstart\n\n```python\nfrom rfmix_reader import read_rfmix\n\n# Load RFMix outputs (two-population admixture example)\nfile_path = \"examples/two_populations/out/\"\nloci_df, g_anc, local_array = read_rfmix(file_path)\n\nprint(loci_df.head())\nprint(g_anc.head())\nprint(local_array.shape)\n\"\"\"See the phasing section below for how to phase per-chromosome outputs and\nwrite them to Zarr with `phase_rfmix_chromosome_to_zarr`.\"\"\"\n```\n\n---\n\n## Key Features\n\n* **Lazy Loading**: Reads data on-the-fly, reducing memory footprint.\n* **Efficient Access**: Query specific loci or regions of interest.\n* **Seamless Integration**: Works smoothly with `pandas`, `dask`, and other analysis tools.\n* **Loci Imputation**: Impute local ancestry loci to dense genotype variant sites.\n* **GPU Acceleration**: Automatic CUDA acceleration via PyTorch/CuPy when available.\n\n---\n\n## Simulation Data\n\nTest datasets for two- and three-population admixture are available on Synapse:\n[Synapse Project syn61691659](https://www.synapse.org/Synapse:syn61691659).\n\n---\n\n## Usage\n\n### Binary Conversion\n\nRFMix does not generate binary files directly.\nUse `create_binaries` to generate them (also available as a CLI):\n\n```bash\ncreate-binaries two_pops/out/\n```\n\n```python\nfrom rfmix_reader import create_binaries\n\ncreate_binaries(\"two_pops/out/\", binary_dir=\"./binary_files\")\n```\n\n### Preparing Reference Data for Phasing\n\nUse `prepare-reference` to convert bgzipped, indexed reference VCF/BCF files\ninto per-chromosome VCF-Zarr stores that the phasing pipeline consumes.\nThe command writes one `\u003cchrom\u003e.zarr` directory per input file.\n\n```\nprepare-reference -h\n```\n\n```\nusage: prepare-reference [-h] [--chunk-length CHUNK_LENGTH]\n                         [--samples-chunk-size SAMPLES_CHUNK_SIZE]\n                         [--worker-processes WORKER_PROCESSES]\n                         [--verbose | --no-verbose] [--version]\n                         output_dir vcf_paths [vcf_paths ...]\n\nConvert one or more bgzipped reference VCF/BCF files into Zarr stores.\n\npositional arguments:\n  output_dir            Directory where the Zarr outputs will be written.\n  vcf_paths             Paths to reference VCF/BCF files (bgzipped and\n                        indexed).\n\noptions:\n  -h, --help            show this help message and exit\n  --chunk-length CHUNK_LENGTH\n                        Genomic chunk size for the output Zarr stores\n                        (default: 100000).\n  --samples-chunk-size SAMPLES_CHUNK_SIZE\n                        Chunk size for samples in the output Zarr stores\n                        (default: library chosen).\n  --worker-processes WORKER_PROCESSES\n                        Number of worker processes to use for conversion\n                        (default: 0, use library default).\n  --verbose, --no-verbose\n                        Print progress messages (default: enabled).\n  --version             Show the version of the program and exit.\n```\n\nExample data preparation:\n\n```bash\n# Sample annotations: two columns (no header): sample_id\u003cTAB\u003egroup\ncat \u003e sample_annotations.tsv \u003c\u003c'EOF'\nNA19700\tAFR\nNA19701\tAFR\nNA20847\tEUR\nEOF\n\n# Convert chromosome VCFs into a reference store directory\nprepare-reference refs/ 1kg_chr20.vcf.gz 1kg_chr21.vcf.gz \\\n  --chunk-length 50000 --samples-chunk-size 512\n```\n\n### Main Function\n\nOnce binaries are available, process RFMix results:\n\n```python\nfrom rfmix_reader import read_rfmix\n\nloci, g_anc, admix = read_rfmix(\"two_pops/out/\")\n```\n\n### Three Population Example\n\nBinaries can also be generated on-the-fly within `read_rfmix` with\n`generate_binary` set to `True`.\n\n```python\nloci, g_anc, admix = read_rfmix(\"examples/three_populations/out/\",\n                                binary_dir=\"./binary_files\",\n                                generate_binary=True)\n```\n\n### Phasing data\n\nFor optimal memory and computational speed, phasing is done per\nchromosome.\n\n```python\nfrom rfmix_reader import phase_rfmix_chromosome_to_zarr\n\n# Use the reference store + annotations during phasing\nadmix = phase_rfmix_chromosome_to_zarr(\n    file_prefix=\"two_pops/out/\",\n    ref_zarr_root=\"refs\",\n    sample_annot_path=\"sample_annotations.tsv\",\n    output_path=\"./phased_chr21.zarr\",\n    chrom=\"21\",\n)\n```\n\nThe chunking is suboptimal for phasing, so remember to\nrechunk before using for optimal processing. This is only needed\nwhen loading individual chromosomes. The merged data is already\noptimized.\n\n```python\nlocal_array = admix[\"local_ancestry\"].chunk({\"variant\": 20000, \"sample\": 100})\n# Compute into memory, if needed\nlocal_array = local_array.compute()\n```\n\nThis also saves the data to Zarr for later merging or data processing.\n\n```bash\nmerge-phased-zarrs ./phased_all.zarr ./phased_chr21.zarr ./phased_chr22.zarr\n```\n### Loci Imputation\n\nThe imputation workflow now lives in ``rfmix_reader.processing.imputation`` and\nis exported as ``interpolate_array``. It interpolates the local ancestry matrix\nonto a denser variant grid and writes the result to ``\u003czarr_outdir\u003e/local-ancestry.zarr``\nas a Zarr array shaped ``(variants, samples, ancestries)``.\n\n**Inputs**\n\n* ``variant_loci_df``: a pandas DataFrame defining the variant grid. Provide at\n  least ``chrom``/``pos`` and an ``i`` column that points to the source RFMix\n  row index; rows with ``i`` set to ``NaN`` are treated as missing loci to\n  interpolate. Sort the frame by genomic coordinate, and include ``pos`` if you\n  plan to interpolate in base-pair space.\n* ``admix``: the local ancestry Dask array returned by ``read_rfmix`` (shape\n  ``(loci, samples, ancestries)``).\n* ``zarr_outdir``: an output directory where the new ``local-ancestry.zarr``\n  store will be created.\n\n**Key options**\n\n* ``interpolation``: ``\"linear\"`` (default), ``\"nearest\"``, or ``\"stepwise\"``.\n* ``use_bp_positions``: set to ``True`` to interpolate along ``variant_loci_df['pos']``\n  rather than treating loci as equally spaced indices.\n* ``chunk_size``/``batch_size``: tune how many rows are materialized at a time\n  when filling and interpolating the Zarr array.\n\n**Workflow example**\n\n```python\nimport pandas as pd\nfrom pathlib import Path\nfrom rfmix_reader import interpolate_array, read_rfmix\n\n# Load RFMix loci and local ancestry\nloci_df, _, admix = read_rfmix(\"two_pops/out/\", binary_dir=\"./binary_files\")\n\n# Build the variant grid by merging genotype sites with the RFMix loci index\nvariants = pd.read_parquet(\"genotypes/variants.parquet\")  # must include chrom/pos\nvariants = variants.drop_duplicates(subset=[\"chrom\", \"pos\"]).sort_values(\"pos\")\nvariant_loci_df = (\n    variants.merge(loci_df.to_pandas(), on=[\"chrom\", \"pos\"], how=\"outer\", indicator=True)\n            .loc[:, [\"chrom\", \"pos\", \"i\", \"_merge\"]]\n)\n\nz = interpolate_array(\n    variant_loci_df,\n    admix,\n    zarr_outdir=Path(\"./imputed_local_ancestry\"),\n    interpolation=\"linear\",\n    use_bp_positions=True,\n    chunk_size=50_000,\n)\nprint(z)\n```\n\nThe interpolator uses GPU acceleration transparently when ``cupy`` and a CUDA\nPyTorch build are available; otherwise it falls back to NumPy. All methods other\nthan ``\"hmm\"`` operate on diploid-summed trajectories and preserve the original\nancestry dimension.\n\n### Phasing workflow (`rfmix_reader.processing.phase`)\n\n`rfmix_reader.processing.phase` implements gnomix-style tail-flip corrections\nfor local ancestry haplotypes. Phasing now lives outside `read_rfmix` so you can\nprocess each chromosome independently and write outputs directly to Zarr.\n\n#### Required reference inputs\n\n* **VCF-Zarr reference** (`ref_zarr_root`): either a single `*.zarr` store or a\n  directory containing per-chromosome stores (e.g., `1.zarr`, `chr1.zarr`).\n* **Sample annotations** (`sample_annot_path`): two-column file mapping\n  `sample_id` to `group` (ancestry label). One representative sample per group\n  is pulled to build reference haplotypes.\n\n#### Per-chromosome pipeline\n\n1. (Optional) generate RFMix binary caches with `create_binaries`.\n2. Call `phase_rfmix_chromosome_to_zarr` for each chromosome you want to\n   process.\n3. Optionally concatenate those per-chromosome Zarr stores with\n   `merge_phased_zarrs`.\n\n```python\nfrom rfmix_reader.processing.phase import (\n    PhasingConfig,\n    merge_phased_zarrs,\n    phase_rfmix_chromosome_to_zarr,\n)\n\n# Phase chromosome 21 and write to Zarr\ndataset = phase_rfmix_chromosome_to_zarr(\n    file_prefix=\"examples/two_populations/out/\",\n    ref_zarr_root=\"/refs/1kg_chr_zarr/\",\n    sample_annot_path=\"/refs/1kg_annotations.tsv\",\n    output_path=\"/tmp/phased_chr21.zarr\",\n    chrom=\"21\",\n)\n\n# Merge multiple per-chromosome Zarr stores\nmerged = merge_phased_zarrs(\n    [\"/tmp/phased_chr21.zarr\", \"/tmp/phased_chr22.zarr\"],\n    output_path=\"/tmp/phased_all.zarr\",\n)\n```\n\nIf you need fine-grained control, you can still start from unphased outputs and\ncall `phase_admix_dask_with_index` directly:\n\n```python\nfrom rfmix_reader import read_rfmix\nfrom rfmix_reader.processing.phase import PhasingConfig, phase_admix_dask_with_index\n\nloci_df, g_anc, admix, X_raw = read_rfmix(\n    \"examples/two_populations/out/\",\n    return_original=True,\n    chrom=\"21\",\n)\n\nconfig = PhasingConfig(window_size=100, min_block_len=10, max_mismatch_frac=0.3)\nphased = phase_admix_dask_with_index(\n    admix=admix,\n    X_raw=X_raw,\n    positions=loci_df.physical_position.to_numpy(),\n    chrom=str(loci_df.chromosome.iloc[0]),\n    ref_zarr_root=\"/refs/1kg_chr_zarr/\",\n    sample_annot_path=\"/refs/1kg_annotations.tsv\",\n    config=config,\n)\n```\n\n### Reading Haptools simulations\n\nUse `read_simu` to load BGZF-compressed VCF files created by\n`haptools simgenotype --pop_field`:\n\n```python\nfrom rfmix_reader import read_simu\n\nloci_df, g_anc, admix = read_simu(\"/path/to/simulations/\")\n```\n\nHaptools does **not** include the chromosome length in the `##contig`\nheader lines, but `read_simu` requires that metadata to index each VCF.\nCopy the `contigs.txt` file Haptools generates from the FASTA you used\nfor simulation and reheader every file with the appropriate contig entry\nbefore calling `read_simu`. The following snippet shows one approach\nusing `bcftools` and `tabix`:\n\n```bash\nCONTIGS=\"../../three_populations/_m/contigs.txt\"\nVCFDIR=\"gt-files\"\nCHR=\"chr${SLURM_ARRAY_TASK_ID}\"\nOUT=\"${VCFDIR}/${CHR}.vcf.gz\"\nIN=\"${VCFDIR}/back/${CHR}.vcf.gz\"\n\nCONTIG_LINE=$(grep -w \"ID=${CHR}\" \"$CONTIGS\")\nif [[ -z \"$CONTIG_LINE\" ]]; then\n    echo \"ERROR: No contig line found for ${CHR} in $CONTIGS\"\n    exit 1\nfi\n\nbcftools view -h \"$IN\" \\\n    | sed \"s/^##contig=\u003cID=${CHR}\u003e.*/${CONTIG_LINE}/\" \u003e header.${CHR}.tmp\nbcftools reheader -h header.${CHR}.tmp -o \"$OUT\" \"$IN\"\ntabix -p vcf \"$OUT\"\n```\n\n### Visualization\n\n`read_rfmix`, `read_flare`, and `read_simu` all return the same\n`(loci_df, g_anc, admix)` tuple, so the plotting utilities in\n`rfmix_reader._visualization` work identically for RFMix, FLARE, and\nHaptools-simulated inputs. The snippet below shows the typical workflow\nfor each reader:\n\n```python\nfrom rfmix_reader import (\n    plot_ancestry_by_chromosome,\n    plot_global_ancestry,\n    read_flare,\n    read_rfmix,\n    read_simu,\n)\n\n# RFMix run directory\nloci_df, g_anc, admix = read_rfmix(\"two_pops/out/\")\nplot_global_ancestry(g_anc, save_path=\"rfmix_global.png\")\nplot_ancestry_by_chromosome(loci_df, admix, save_path=\"rfmix_local.png\")\n\n# FLARE output directory (contains *.anc.vcf.gz + global.anc.gz)\nloci_df, g_anc, admix = read_flare(\"flare_runs/chr1/\")\nplot_global_ancestry(g_anc, save_path=\"flare_global.png\")\nplot_ancestry_by_chromosome(loci_df, admix, save_path=\"flare_local.png\")\n\n# Haptools simulations (after reheadering contigs)\nloci_df, g_anc, admix = read_simu(\"/path/to/simulations/\")\nplot_global_ancestry(g_anc, save_path=\"simu_global.png\")\nplot_ancestry_by_chromosome(loci_df, admix, save_path=\"simu_local.png\")\n```\n\n`plot_global_ancestry` builds per-individual stacked bars of global\nancestry while `plot_ancestry_by_chromosome` summarizes local ancestry\nalong each chromosome, giving you quick visual QC for every supported\ninput format.\n\n---\n\n## Development Install\n\nFor contributors:\n\n```bash\ngit clone https://github.com/heart-gen/rfmix_reader.git\ncd rfmix_reader\npip install -e \".[gpu,docs,tests]\"\n```\n\n---\n\n## Citation\n\nIf you use this software, please cite:\n\n[![DOI](https://zenodo.org/badge/807052842.svg)](https://zenodo.org/doi/10.5281/zenodo.12629787)\n\nBenjamin, K. J. M. (2024). **RFMix-reader (Version 0.2.0)** \\[Computer software].\n[https://github.com/heart-gen/rfmix\\_reader](https://github.com/heart-gen/rfmix_reader)\n\nKynon JM Benjamin. *\"RFMix-reader: Accelerated reading and processing for local ancestry studies.\"*\n**bioRxiv** (2024).\nDOI: [10.1101/2024.07.13.603370](https://www.biorxiv.org/content/10.1101/2024.07.13.603370v2).\n\n---\n\n## Funding\n\nThis work was supported by the National Institutes of Health,\nNational Institute on Minority Health and Health Disparities (NIMHD)\nK99MD016964 / R00MD016964.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fheart-gen%2Frfmix_reader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fheart-gen%2Frfmix_reader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fheart-gen%2Frfmix_reader/lists"}