https://github.com/mcvickerlab/GenVarLoader
Dataloader for applying sequence models to personalized genomics
https://github.com/mcvickerlab/GenVarLoader
Last synced: about 2 months ago
JSON representation
Dataloader for applying sequence models to personalized genomics
- Host: GitHub
- URL: https://github.com/mcvickerlab/GenVarLoader
- Owner: mcvickerlab
- License: mit
- Created: 2022-04-06T02:07:58.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2025-05-02T23:25:46.000Z (2 months ago)
- Last Synced: 2025-05-03T00:20:04.873Z (2 months ago)
- Language: Python
- Homepage: https://genvarloader.readthedocs.io/en/latest/
- Size: 133 MB
- Stars: 25
- Watchers: 2
- Forks: 4
- Open Issues: 14
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-dl4g - genome-loader - Pipeline for efficient genomic data processing. (Software packages / Data wrangling)
README
[](https://pypi.org/project/genvarloader/)
[](https://genvarloader.readthedocs.io)
[](https://pepy.tech/project/genvarloader)
[](https://img.shields.io/pypi/dm/genvarloader)
[](https://github.com/mcvickerlab/GenVarLoader)
[](https://www.biorxiv.org/content/10.1101/2025.01.15.633240)## Features
GenVarLoader provides a fast, memory efficient data structure for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. [Dalla-Torre et al.](https://www.biorxiv.org/content/10.1101/2023.01.11.523679)) or train sequence to function models with genetic variation (e.g. [Celaj et al.](https://www.biorxiv.org/content/10.1101/2023.09.20.558508v1), [Drusinsky et al.](https://www.biorxiv.org/content/10.1101/2024.07.27.605449v1), [He et al.](https://www.biorxiv.org/content/10.1101/2024.10.15.618510v1), and [Rastogi et al.](https://www.biorxiv.org/content/10.1101/2024.09.23.614632v1)).
- Avoid writing any sequences to disk (can save >2,000x storage vs. writing personalized genomes with bcftools consensus)
- Generate haplotypes up to 1,000 times faster than reading a FASTA file
- Generate tracks up to 450 times faster than reading a BigWig
- **Supports indels** and re-aligns tracks to haplotypes that have them
- Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWigDocumentation is available [here](https://genvarloader.readthedocs.io/). See our [preprint](https://www.biorxiv.org/content/10.1101/2025.01.15.633240) for benchmarking and implementation details.
## Installation
```bash
pip install genvarloader
```A PyTorch dependency is **not** included since it may require [special instructions](https://pytorch.org/get-started/locally/).
## Contributing
1. Clone the repo.
2. Assuming you have [Pixi](https://pixi.sh/latest/), install pre-commit hooks `pixi run -e dev pre-commit`
3. Activate and use the appropriate Pixi environment for your needs. A decent catch-all is `dev` but you might need a different environment if using a GPU.All the tests are designed to use pytest and live under `tests/`. These tests ensure the code works as intended so they must all pass before any features are merged into `main` and subsequently released.