https://github.com/biodatageeks/polars-bio
Blazing-Fast Bioinformatic Operations on Python DataFrames
https://github.com/biodatageeks/polars-bio
arrow bioinformatics dataframes datafusion genomic-intervals genomic-ranges genomics pandas polars rust-lang
Last synced: 4 months ago
JSON representation
Blazing-Fast Bioinformatic Operations on Python DataFrames
- Host: GitHub
- URL: https://github.com/biodatageeks/polars-bio
- Owner: biodatageeks
- License: apache-2.0
- Created: 2024-11-26T16:58:18.000Z (12 months ago)
- Default Branch: master
- Last Pushed: 2025-07-18T14:03:31.000Z (4 months ago)
- Last Synced: 2025-07-18T18:21:05.939Z (4 months ago)
- Topics: arrow, bioinformatics, dataframes, datafusion, genomic-intervals, genomic-ranges, genomics, pandas, polars, rust-lang
- Language: Python
- Homepage: http://biodatageeks.org/polars-bio/
- Size: 9.08 MB
- Stars: 68
- Watchers: 2
- Forks: 22
- Open Issues: 48
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-polars - polars-bio - Polars plugin for large-scale genomic analyses which is easy to use and considerable faster and more scalabe than existing alternatives by [@biodatageeks](https://github.com/biodatageeks). (Libraries/Packages/Scripts / Polars plugins)
- fucking-awesome-rust - polars-bio - Blazing-Fast Bioinformatic Operations on Python DataFrames  (Libraries / Bioinformatics)
- awesome-rust - polars-bio - Blazing-Fast Bioinformatic Operations on Python DataFrames  (Libraries / Bioinformatics)
README
# polars-bio - Next-gen Python DataFrame operations for genomics!




[](https://discord.gg/bpxQ4Yxhk5)



[polars-bio](https://pypi.org/project/polars-bio/) is a Python library for genomics built on top of [polars](https://pola.rs/), [Apache Arrow](https://arrow.apache.org/) and [Apache DataFusion](https://datafusion.apache.org/).
It provides a DataFrame API for genomics data and is designed to be blazing fast, memory efficient and easy to use.
## Key Features
* optimized for [performance](https://biodatageeks.org/polars-bio/performance/) and memory [efficiency](https://biodatageeks.org/polars-bio/performance/#memory-characteristics) for large-scale genomics datasets analyses both when reading input data and performing operations
* popular genomics [operations](https://biodatageeks.org/polars-bio/features/#genomic-ranges-operations) with a DataFrame API (both [Pandas](https://pandas.pydata.org/) and [polars](https://pola.rs/))
* [SQL](https://biodatageeks.org/polars-bio/features/#sql-powered-data-processing)-powered bioinformatic data querying or manipulation
* native parallel engine powered by Apache DataFusion and [sequila-native](https://github.com/biodatageeks/sequila-native)
* [out-of-core/streaming](https://biodatageeks.org/polars-bio/features/#streaming) processing (for data too large to fit into a computer's main memory) with [Apache DataFusion](https://datafusion.apache.org/) and [polars](https://pola.rs/)
* support for *federated* and *streamed* reading data from [cloud storages](https://biodatageeks.org/polars-bio/features/#cloud-storage) (e.g. S3, GCS) with [Apache OpenDAL](https://github.com/apache/opendal) enabling processing large-scale genomics data without materializing in memory
* zero-copy data exchange with [Apache Arrow](https://arrow.apache.org/)
* bioinformatics file [formats](https://biodatageeks.org/polars-bio/features/#file-formats-support) with [noodles](https://github.com/zaeleus/noodles) and [exon](https://github.com/wheretrue/exon)
* fast overlap operations with [COITrees: Cache Oblivious Interval Trees](https://github.com/dcjones/coitrees)
* pre-built wheel packages for *Linux*, *Windows* and *MacOS* (*arm64* and *x86_64*) available on [PyPI](https://pypi.org/project/polars-bio/#files)
## Single-thread performance 🏃




## Parallel performance 🏃🏃




## Citing
If you use **polars-bio** in your work, please cite:
```bibtex
@article {Wiewiorka2025.03.21.644629,
author = {Wiewiorka, Marek and Khamutou, Pavel and Zbysinski, Marek and Gambin, Tomasz},
title = {polars-bio - fast, scalable and out-of-core operations on large genomic interval datasets},
elocation-id = {2025.03.21.644629},
year = {2025},
doi = {10.1101/2025.03.21.644629},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/03/25/2025.03.21.644629},
eprint = {https://www.biorxiv.org/content/early/2025/03/25/2025.03.21.644629.full.pdf},
journal = {bioRxiv}
}
```
Read the [documentation](https://biodatageeks.github.io/polars-bio/)