https://github.com/mcvickerlab/GenVarLoader
Pipeline for efficient genomic data processing.
https://github.com/mcvickerlab/GenVarLoader
Last synced: 5 months ago
JSON representation
Pipeline for efficient genomic data processing.
- Host: GitHub
- URL: https://github.com/mcvickerlab/GenVarLoader
- Owner: mcvickerlab
- License: mit
- Created: 2022-04-06T02:07:58.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2024-04-13T01:56:58.000Z (about 1 year ago)
- Last Synced: 2024-04-13T21:46:39.813Z (about 1 year ago)
- Language: Python
- Size: 132 MB
- Stars: 8
- Watchers: 1
- Forks: 2
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-dl4g - genome-loader - Pipeline for efficient genomic data processing. (Software packages / Data wrangling)
README
GenVarLoader provides a fast, memory efficient data loader for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. [Nucleotide Transformer](https://www.biorxiv.org/content/10.1101/2023.01.11.523679)) or train sequence to function models with genetic variation (e.g. [BigRNA](https://www.biorxiv.org/content/10.1101/2023.09.20.558508v1)).
## Features
- Avoids writing any sequences to disk
- Works with datasets that are larger than RAM
- Generates haplotypes up to 1,000 times faster than reading a FASTA file
- Generates tracks up to 450 times faster than reading a BigWig
- Supports indels and re-aligns tracks to haplotypes that have them
- Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWig## Tutorial
### Installation
```bash
pip install genvarloader
```A PyTorch dependency is not included since it may require [special instructions](https://pytorch.org/get-started/locally/).
### Write a `gvl.Dataset`
GenVarLoader has both a CLI and Python API for writing datasets. The Python API provides some extra flexibility, for example for a multi-task objective.
```bash
genvarloader cool_dataset.gvl interesting_regions.bed --variants cool_variants.vcf --bigwig-table samples_to_bigwigs.csv --length 2048 --max-jitter 128
```Where `samples_to_bigwigs.csv` has columns `sample` and `path` mapping each sample to its BigWig.
This could equivalently be done in Python as:
```python
import genvarloader as gvlgvl.write(
path="cool_dataset.gvl",
bed="interesting_regions.bed",
variants="cool_variants.vcf",
bigwigs=gvl.BigWigs.from_table("bigwig", "samples_to_bigwigs.csv"),
length=2048,
max_jitter=128,
)
```### Open a `gvl.Dataset` and get a PyTorch DataLoader
```python
import genvarloader as gvldataset = gvl.Dataset.open(path="cool_dataset.gvl", reference="hg38.fa")
train_samples = ["David", "Aaron"]
train_dataset = dataset.subset_to(regions="train_regions.bed", samples=train_samples)
train_dataloader = train_dataset.to_dataloader(batch_size=32, shuffle=True, num_workers=1)# use it in your training loop
for haplotypes, tracks in train_dataloader:
...
```### Inspect specific instances
```python
dataset[99] # 100-th instance of the raveled dataset
dataset[0, 9] # first region, 10th sample
dataset.isel(regions=0, samples=9)
dataset.sel(regions=dataset.get_bed()[0], samples=dataset.samples[9])
dataset[:10] # first 10 instances
dataset[:10, :5] # first 10 regions and 5 samples
```### Transform the data on-the-fly
```python
import seqpro as sp
from einops import rearrangedef transform(haplotypes, tracks):
ohe = sp.DNA.ohe(haplotypes)
ohe = rearrange(ohe, "batch length alphabet -> batch alphabet length")
return ohe, trackstransformed_dataset = dataset.with_settings(transform=transform)
```### Pre-computing transformed tracks
Suppose we want to return tracks that are the z-scored, log(CPM + 1) version of the original. Sometimes it is better to write this to disk to avoid having to recompute it during training or inference.
```python
import numpy as np# We'll assume we already have an array of total counts for each sample.
# This usually can't be derived from a gvl.Dataset since it only has data for specific regions.
total_counts = np.load('total_counts.npy') # shape: (samples) float32# We'll compute the mean and std log(CPM + 1) using the training split
means = np.empty((train_dataset.n_regions, train_dataset.region_length), np.float32)
stds = np.empty_like(means)
just_tracks = train_dataset.with_settings(return_sequences=False, jitter=0)
for region in range(len(means)):
cpm = np.log1p(just_tracks[region, :] / total_counts[:, None] * 1e6)
means[region] = cpm.mean(0)
stds[region] = cpm.std(0)# Define our transformation
def z_log_cpm(dataset_indices, region_indices, sample_indices, tracks: gvl.Ragged[np.float32]):
# In the event that the dataset only has SNPs, the full length tracks will all be the same length.
# So, we can reshape the ragged data into a regular array.
_tracks = tracks.data.reshape(-1, dataset.region_length)
# Otherwise, we would have to leave `tracks`as a gvl.Ragged array to accommodate different lengths.
# In that case, we could do the transformation with a Numba compiled function instead.# original tracks -> log(CPM + 1) -> z-score
_tracks = np.log1p(_tracks / total_counts[sample_indices, None] * 1e6)
_tracks = (_tracks - means[region_indices]) / stds[region_indices]return gvl.Ragged.from_offsets(_tracks.ravel(), tracks.shape, tracks.offsets)
# This can take about as long as writing the original tracks or longer, depending on the transformation.
dataset_with_zlogcpm = dataset.write_transformed_track("z-log-cpm", "bigwig", transform=z_log_cpm)# The dataset now has both tracks available, "bigwig" and "z-log-cpm", and we can choose to return either one or both.
haps_and_zlogcpm = dataset_with_zlogcpm.with_settings(return_tracks="z-log-cpm")# If we re-opened the dataset after running this then we could write...
dataset = gvl.Dataset.open("cool_dataset.gvl", "hg38.fa", return_tracks="z-log-cpm")
```## Performance tips
- GenVarLoader uses multithreading extensively, so it's best to use 0 or 1 workers with your PyTorch `DataLoader`.
- A GenVarLoader `Dataset` is most efficient when given batches of indices, rather than one at a time. PyTorch `DataLoader` by default uses one index at a time, so if you want to use a ***custom*** PyTorch `Sampler` you should wrap it with a PyTorch `BatchSampler` before passing it to `Dataset.to_dataloader()`.