https://github.com/anuradhawick/rsbio-seq
RSBio-Seq is a python wrapper for rust bio crate to provide fast sequence reading.
https://github.com/anuradhawick/rsbio-seq
Last synced: 18 days ago
JSON representation
RSBio-Seq is a python wrapper for rust bio crate to provide fast sequence reading.
- Host: GitHub
- URL: https://github.com/anuradhawick/rsbio-seq
- Owner: anuradhawick
- License: gpl-3.0
- Created: 2024-08-31T00:12:30.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-12-31T00:39:25.000Z (5 months ago)
- Last Synced: 2025-05-07T08:02:57.876Z (18 days ago)
- Language: Rust
- Size: 92.8 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# RSBio-Seq
[](https://github.com/anuradhawick/rsbio-seq/actions/workflows/rust_test.yml)
[](https://pepy.tech/project/rsbio-seq)
[](https://pypi.org/project/rsbio-seq/)
[](https://github.com/anuradhawick/rsbio-seq/actions/workflows/pypi.yml)
[](https://www.gnu.org/licenses/gpl-3.0)
██████ ███████ ██████ ██ ██████ ███████ ███████ ██████
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██████ ███████ ██████ ██ ██ ██ █████ ███████ █████ ██ ██
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ▄▄ ██
██ ██ ███████ ██████ ██ ██████ ███████ ███████ ██████
▀▀RSBio-Seq intends to provide reading/writing facility on common sequence formats (FASTA/FASTQ) in both raw (`fasta`, `fa`, `fna`, `fastq`, `fq`) and compressed formats (`.gz`).
## Installation
### 1. From PyPI (Recommended)
Use the following command to install from PyPI.
```bash
pip install rsbio-seq
```### 2. Build and install from source
To build from source, make sure you have the following programs installed.
- Rust - https://www.rust-lang.org/tools/install
- Maturin - https://www.maturin.rs/installation
- Python environment with Python >=3.9 - https://www.python.org/downloads/To build and install the development version of the wheel.
```bash
maturin develop # this installs the development version in the env
maturin develop --release # this installs a release version in the env
```To build a release mode wheel for installation, use this command.
```bash
maturin build --release
```You will find the `whl` file inside the `target/wheels` directory. Your `whl` file will have a name depicting your python environment and CPU architecture. The built wheel can be installed using this command.
```bash
pip install target/wheels/*.whl
```## Usage
Once installed you can import the library and use as follows.
### Reading
```python
from rsbio_seq import SeqReader, Sequence, ascii_to_phred# each seq entry is of type Sequence
seq: Sequencefor seq in SeqReader("path/to/seq.fasta.gz"):
print(seq.id)
print(seq.seq)
# for fastq quality line
print(seq.qual) # prints IIII
print(ascii_to_phred(seq.qual)) # prints [40, 40, 40, 40]
# optional description attribute
print(seq.desc)
```### Reading from FASTA (`fai`/`fai+gzi`) index
Index reader supports fasta in raw text and bgzipped formats.
```python
seqs = SeqReaderIndexed(
"path/tp/seq.fa",
"path/to/seq.fa.fai"
)
seq: Sequence = seqs["Record_1"]
print(seq.id)
print(seq.seq)
print(seq.desc)"Record_2" in seqs # returns a boolean
```For bgzipped fasta files, a gzi file is required.
```python
seqs = SeqReaderIndexed(
"path/tp/seq.fa.gz",
"path/to/seq.fa.gz.fai",
"path/to/seq.fa.gz.gzi",
)
seq: Sequence = seqs["Record_1"]
print(seq.id)
print(seq.seq)
print(seq.desc)"Record_2" in seqs # returns a boolean
```Using an invalid key will result in `KeyError`.
### Writing
```python
from rsbio_seq import SeqWriter, Sequence, phred_to_ascii# writing fasta
seq = Sequence("id", "desc", "ACGT") # id, description, sequence
writer = SeqWriter("out.fasta")
writer.write(seq)
writer.close()# writing fastq
seq = Sequence("id", "desc", "ACGT", "IIII") # id, description, sequence, quality
writer = SeqWriter("out.fastq")
writer.write(seq)
writer.close()# writing gzipped
seq = Sequence("id", "desc", "ACGT", "IIII") # id, description, sequence, quality
writer = SeqWriter("out.fq.gz")
writer.write(seq)
writer.close()# writing gzipped with phred score translation
qual = phred_to_ascii([40, 40, 40, 40])
seq = Sequence("id", "desc", "ACGT", qual) # id, description, sequence, quality
writer = SeqWriter("out.fq.gz")
writer.write(seq)
writer.close()
```Note: `close()` is only required if you want to read the file again in the same function/code scope. Closing opened files is a good practice either way.
We provide two utility functions for your convenience.
* `phred_to_ascii` - convert phred scores list of numbers to a string
* `ascii_to_phred` - convert the quality string to a list of numbersRSBio-Seq reads and write quality string in ascii format only. Please use these helper functions to translate if you intend to read them.
## Writing to FASTA with an Index (`fai`/`fai+gzi`)
Writing FASTA with an index can be performed in plain text and compressed forms. In compressed form the compression used is `bgzip`. In addition to `fai` there will also be a `gzi` file in compressed form. You can specify to compress using `gz` suffix at the end. Index paths are automatically inferred.
```python
# Plain text
seq = Sequence("id", "desc", "ACGT") # id, description, sequence
writer = SeqWriter("out.fasta", True) # set index true
writer.write(seq)
writer.close()# Compressed
seq = Sequence("id", "desc", "ACGT") # id, description, sequence
writer = SeqWriter("out.fa.gz", True) # set index true
writer.write(seq)
writer.close()
```## Planned soon for the major release v1.0.0
* Support for `fastq` Indexes
## Authors
- Anuradha Wickramarachchi [https://anuradhawick.com](https://anuradhawick.com)
- Vijini Mallawaarachchi [https://vijinimallawaarachchi.com](https://vijinimallawaarachchi.com)## Support and contributions
Please get in touch via author websites or GitHub issues. Thanks!