{"id":15118906,"url":"https://github.com/pfizer-opensource/bigwig-loader","last_synced_at":"2025-12-24T16:34:06.003Z","repository":{"id":197466753,"uuid":"698534038","full_name":"pfizer-opensource/bigwig-loader","owner":"pfizer-opensource","description":"A fast dataloader for bigwig files made for machine learning","archived":false,"fork":false,"pushed_at":"2024-09-30T08:09:55.000Z","size":292,"stargazers_count":24,"open_issues_count":0,"forks_count":1,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-12-17T20:48:23.986Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pfizer-opensource.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-30T07:41:28.000Z","updated_at":"2024-09-30T08:09:34.000Z","dependencies_parsed_at":"2023-10-11T15:33:13.616Z","dependency_job_id":"38eaa51f-4129-474d-a2db-537f6b914730","html_url":"https://github.com/pfizer-opensource/bigwig-loader","commit_stats":null,"previous_names":["pfizer-opensource/bigwig-loader-opensource"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pfizer-opensource%2Fbigwig-loader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pfizer-opensource%2Fbigwig-loader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pfizer-opensource%2Fbigwig-loader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pfizer-opensource%2Fbigwig-loader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pfizer-opensource","download_url":"https://codeload.github.com/pfizer-opensource/bigwig-loader/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234475315,"owners_count":18839358,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-26T01:53:40.088Z","updated_at":"2025-12-24T16:34:05.996Z","avatar_url":"https://github.com/pfizer-opensource.png","language":"Python","funding_links":[],"categories":["Ranked by starred repositories","Projects using Pixi"],"sub_categories":["Deep Learning / Machine Learning"],"readme":"# :lollipop: Epigenetics Dataloader for BigWig files\n\n[![Tests](https://github.com/pfizer-opensource/bigwig-loader/actions/workflows/tests.yml/badge.svg)](https://github.com/pfizer-opensource/bigwig-loader/actions/workflows/tests.yml)\n[![Code Quality](https://github.com/pfizer-opensource/bigwig-loader/actions/workflows/run-commit-hooks.yml/badge.svg)](https://github.com/pfizer-opensource/bigwig-loader/actions/workflows/run-commit-hooks.yml)\n\nFast batched dataloading of BigWig files containing epigentic track data and corresponding sequences powered by GPU\nfor deep learning applications.\n\n\u003e ⚠️ **BREAKING CHANGE (v0.3.0+)**: The output matrix dimensionality has changed from `(n_tracks, batch_size, sequence_length)` to `(batch_size, sequence_length, n_tracks)`. This change was long overdue and eliminates the need for (potentially memory expensive) transpose operations downstream. If you're upgrading from an earlier version, please update your code accordingly (probaby you need to delete one transpose in your code).\n\n\u003e ✨ **NEW FEATURE (v0.3.0+)**: Full `bfloat16` support! You can now specify `dtype=\"bfloat16\"` to get output tensors in bfloat16 format, reducing memory usage by 50%.\n\n\u003e ⚠️ **Cupy and bfloat16 support**\nBecause cupy does not support bfloat16 yet, the cupy array is typed as uint64, but the actual data behind it is in bfloat16. So when converting the array to a tensor in a framework that DOES support bfloat16 like pytorch, tensorflow or JAX should be followed by a \"view\" method that just changes how the underlying bytes are interpreted (and not actually casting to bfloat16, which would change the underlaying data). In the *bigwig_loader.pytorch.PytorchBigWigDataset* this has already been done for you (when you set dtype=\"bfloat16\").\n\n\n\n\n## Quickstart\n\n### Installation with Pixi\nUsing [pixi](https://pixi.sh/) to install bigwig-loader is highly recommended.\nPlease take a look at this example pixi.toml:\n\n```toml\n[workspace]\nchannels = [\"rapidsai\", \"conda-forge\", \"nvidia\", \"bioconda\", \"dataloading\"]\nname = \"bigwig-loader\"\nplatforms = [\"linux-64\"]\nversion = \"0.1.0\"\n\n[tasks]\ndownload-example-data = { cmd = \"python -m bigwig_loader.download_example_data\"}\n\n[feature.bigwig-loader.system-requirements]\ncuda = \"12\"\n\n[dependencies]\npython = \"==3.11\"\npip = \"*\"\n\n[feature.bigwig-loader.dependencies]\ncuda-version = \"12.8.*\"\npytorch-gpu = \"\u003e=2.6\"\ncuda-nvcc = \"*\"\nkvikio = \"\u003c=25.08.00\"\nbigwig-loader = \"*\"\nnumpy = \"*\"\npandas = \"*\"\n\n[pypi-dependencies]\npython-dotenv = \"*\"\npydantic = \"*\"\npydantic-settings = \"*\"\nuniversal-pathlib = \"*\"\nfsspec = { version = \"*\" }\ns3fs = \"*\"\npyfaidx = \"*\"\nnumcodecs =\"*\"\n\n[environments]\ndefault = {features = [\"bigwig-loader\"]}\n```\n\n\nIf you just want to use bigwig-loader, just\ncopy that into a pixi.toml file and add the other libraries you need.\n(you don't need to clone this repo, pixi will download bigwig-loader from the\nconda \"dataloading\" channel):\n\n*   Install pixi, if not installed:\n    ```shell\n    curl -fsSL https://pixi.sh/install.sh | sh\n    ```\n\n* change directory to wherever you put the pixi.toml, and:\n    ```shell\n    pixi run \u003cmy_training_command\u003e\n    ```\n\n\nThe pixi.toml I included in this repository works for both the released version and for development of bigwig-loader, but assumes you cloned this repo.\n\n\n### Installation with conda/mamba\n\nAlternatively, bigwig-loader can be installed using conda/mamba. To create a new environment with bigwig-loader\ninstalled:\n\n```shell\nmamba create -n my-env -c rapidsai -c conda-forge -c bioconda -c dataloading bigwig-loader\n```\n\nOr add this to you environment.yml file:\n\n```yaml\nname: my-env\nchannels:\n  - rapidsai\n  - conda-forge\n  - bioconda\n  - dataloading\ndependencies:\n    - bigwig-loader\n```\n\nand update:\n\n```shell\nmamba env update -f environment.yml\n```\n\n### Installation with pip\nBigwig-loader can also be installed using pip in an environment which has the rapidsai kvikio library\nand cupy installed already:\n\n```shell\npip install bigwig-loader\n```\n\n### PyTorch Example\nWe wrapped the BigWigDataset in a PyTorch iterable dataset that you can directly use:\n\n```python\n# examples/pytorch_example.py\nimport pandas as pd\nimport torch\nfrom torch.utils.data import DataLoader\nfrom bigwig_loader import config\nfrom bigwig_loader.pytorch import PytorchBigWigDataset\nfrom bigwig_loader.download_example_data import download_example_data\n\n# Download example data to play with\ndownload_example_data()\nexample_bigwigs_directory = config.bigwig_dir\nreference_genome_file = config.reference_genome\n\ntrain_regions = pd.DataFrame({\"chrom\": [\"chr1\", \"chr2\"], \"start\": [0, 0], \"end\": [1000000, 1000000]})\n\ndataset = PytorchBigWigDataset(\n    regions_of_interest=train_regions,\n    collection=example_bigwigs_directory,\n    reference_genome_path=reference_genome_file,\n    sequence_length=1000,000,\n    center_bin_to_predict=500,000,\n    window_size=1,\n    batch_size=1,\n    super_batch_size=4,\n    batches_per_epoch=100,\n    maximum_unknown_bases_fraction=0.1,\n    sequence_encoder=\"onehot\",\n    n_threads=4,\n    return_batch_objects=True,\n    dtype=\"bfloat16\"\n)\n\n# Don't use num_workers \u003e 0 in DataLoader. The heavy\n# lifting/parallelism is done on cuda streams on the GPU.\ndataloader = DataLoader(dataset, num_workers=0, batch_size=None)\n\n\nclass MyTerribleModel(torch.nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.linear = torch.nn.Linear(4, 2)\n\n    def forward(self, batch):\n        return self.linear(batch)\n\n\nmodel = MyTerribleModel()\noptimizer = torch.optim.SGD(model.parameters(), lr=0.01)\n\ndef poisson_loss(pred, target):\n    return (pred - target * torch.log(pred.clamp(min=1e-8))).mean()\n\nfor batch in dataloader:\n    # batch.sequences.shape = n_batch x sequence_length x onehot encoding (4)\n    pred = model(batch.sequences)\n    # batch.values.shape = n_batch x center_bin_to_predict x n_tracks\n    loss = poisson_loss(pred[:, 250000:750000, :], batch.values)\n    print(loss)\n    optimizer.zero_grad()\n    loss.backward()\n    optimizer.step()\n```\n\n### Other frameworks\n\nA framework agnostic Dataset object can be imported from `bigwig_loader.dataset`. This dataset object\nreturns cupy tensors. Cupy tensors adhere to the cuda array interface and can be zero-copy transformed\nto JAX or tensorflow tensors.\n\n```python\nfrom bigwig_loader.dataset import BigWigDataset\n\ndataset = BigWigDataset(\n    regions_of_interest=train_regions,\n    collection=example_bigwigs_directory,\n    reference_genome_path=reference_genome_file,\n    sequence_length=1000,\n    center_bin_to_predict=500,\n    window_size=1,\n    batch_size=32,\n    super_batch_size=1024,\n    batches_per_epoch=20,\n    maximum_unknown_bases_fraction=0.1,\n    sequence_encoder=\"onehot\",\n)\n\n```\nSee the examples directory for more examples.\n\n## Background\n\nThis library is meant for loading batches of data with the same dimensionality, which allows for some assumptions that can\nspeed up the loading process. As can be seen from the plot below, when loading a small amount of data, pyBigWig is very fast,\nbut does not exploit the batched nature of data loading for machine learning.\n\nIn the benchmark below we also created PyTorch dataloaders (with set_start_method('spawn')) using pyBigWig to compare to\nthe realistic scenario where multiple CPUs would be used per GPU. We see that the throughput of the CPU dataloader does\nnot go up linearly with the number of CPUs, and therefore it becomes hard to get the needed throughput to keep the GPU,\ntraining the neural network,saturated during the learning steps.\n\n\n![benchmark.png](images%2Fbenchmark.png)\n\nThis is the problem bigwig-loader solves. This is an example of how to use bigwig-loader:\n\n### Installation\n\n1. `git clone git@github.com:pfizer-opensource/bigwig-loader`\n2. `cd bigwig-loader`\n3. create the conda environment\" `conda env create -f environment.yml`\n\nIn this environment you should be able to run `pytest -v` and see the tests\nsucceed. NOTE: you need a GPU to use bigwig-loader!\n\n## Development\n\nThis section guides you through the steps needed to add new functionality. If\nanything is unclear, please open an issue.\n\n### Environment\n\nThe pixi.toml includes a dev environment that has bigwig-loader installed\nas an editable pypi dependency.\n\n1. `git clone git@github.com:pfizer-opensource/bigwig-loader`\n2. `cd bigwig-loader`\n3. optional: `pixi install -e dev`\n4. run `pre-commit install` to install the pre-commit hooks\n\n### Run Tests\nTests are in the tests directory. One of the most important tests is\ntest_against_pybigwig which makes sure that if there is a mistake in\npyBigWIg, it is also in bigwig-loader.\n\nIn order to run these tests you need gpu.\n\n```shell\npixi run -e dev test\n```\n\nWhen github runners with GPU's will become available we would also\nlike to run these tests in the CI. But for now, you can run them locally.\n\n\n## Citing\n\nIf you use this library, consider citing:\n\nRetel, Joren Sebastian, Andreas Poehlmann, Josh Chiou, Andreas Steffen, and Djork-Arné Clevert. “A Fast Machine Learning Dataloader for Epigenetic Tracks from BigWig Files.” Bioinformatics 40, no. 1 (January 1, 2024): btad767. https://doi.org/10.1093/bioinformatics/btad767.\n\n```bibtex\n@article{\n    retel_fast_2024,\n    title = {A fast machine learning dataloader for epigenetic tracks from {BigWig} files},\n    volume = {40},\n    issn = {1367-4811},\n    url = {https://doi.org/10.1093/bioinformatics/btad767},\n    doi = {10.1093/bioinformatics/btad767},\n    abstract = {We created bigwig-loader, a data-loader for epigenetic profiles from BigWig files that decompresses and processes information for multiple intervals from multiple BigWig files in parallel. This is an access pattern needed to create training batches for typical machine learning models on epigenetics data. Using a new codec, the decompression can be done on a graphical processing unit (GPU) making it fast enough to create the training batches during training, mitigating the need for saving preprocessed training examples to disk.The bigwig-loader installation instructions and source code can be accessed at https://github.com/pfizer-opensource/bigwig-loader},\n    number = {1},\n    urldate = {2024-02-02},\n    journal = {Bioinformatics},\n    author = {Retel, Joren Sebastian and Poehlmann, Andreas and Chiou, Josh and Steffen, Andreas and Clevert, Djork-Arné},\n    month = jan,\n    year = {2024},\n    pages = {btad767},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpfizer-opensource%2Fbigwig-loader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpfizer-opensource%2Fbigwig-loader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpfizer-opensource%2Fbigwig-loader/lists"}