An open API service indexing awesome lists of open source software.

https://github.com/anuradhawick/rsbio-seq

RSBio-Seq is a python wrapper for rust bio crate to provide fast sequence reading.
https://github.com/anuradhawick/rsbio-seq

Last synced: 18 days ago
JSON representation

RSBio-Seq is a python wrapper for rust bio crate to provide fast sequence reading.

Awesome Lists containing this project

README

        

# RSBio-Seq

[![Cargo tests](https://github.com/anuradhawick/rsbio-seq/actions/workflows/rust_test.yml/badge.svg)](https://github.com/anuradhawick/rsbio-seq/actions/workflows/rust_test.yml)
[![Downloads](https://static.pepy.tech/badge/rsbio-seq)](https://pepy.tech/project/rsbio-seq)
[![PyPI - Version](https://img.shields.io/pypi/v/rsbio-seq)](https://pypi.org/project/rsbio-seq/)
[![Upload to PyPI](https://github.com/anuradhawick/rsbio-seq/actions/workflows/pypi.yml/badge.svg)](https://github.com/anuradhawick/rsbio-seq/actions/workflows/pypi.yml)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)



██████ ███████ ██████ ██ ██████ ███████ ███████ ██████
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██████ ███████ ██████ ██ ██ ██ █████ ███████ █████ ██ ██
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ▄▄ ██
██ ██ ███████ ██████ ██ ██████ ███████ ███████ ██████
▀▀

RSBio-Seq intends to provide reading/writing facility on common sequence formats (FASTA/FASTQ) in both raw (`fasta`, `fa`, `fna`, `fastq`, `fq`) and compressed formats (`.gz`).

## Installation

### 1. From PyPI (Recommended)

Use the following command to install from PyPI.

```bash
pip install rsbio-seq
```

### 2. Build and install from source

To build from source, make sure you have the following programs installed.

- Rust - https://www.rust-lang.org/tools/install
- Maturin - https://www.maturin.rs/installation
- Python environment with Python >=3.9 - https://www.python.org/downloads/

To build and install the development version of the wheel.

```bash
maturin develop # this installs the development version in the env
maturin develop --release # this installs a release version in the env
```

To build a release mode wheel for installation, use this command.

```bash
maturin build --release
```

You will find the `whl` file inside the `target/wheels` directory. Your `whl` file will have a name depicting your python environment and CPU architecture. The built wheel can be installed using this command.

```bash
pip install target/wheels/*.whl
```

## Usage

Once installed you can import the library and use as follows.

### Reading

```python
from rsbio_seq import SeqReader, Sequence, ascii_to_phred

# each seq entry is of type Sequence
seq: Sequence

for seq in SeqReader("path/to/seq.fasta.gz"):
print(seq.id)
print(seq.seq)
# for fastq quality line
print(seq.qual) # prints IIII
print(ascii_to_phred(seq.qual)) # prints [40, 40, 40, 40]
# optional description attribute
print(seq.desc)
```

### Reading from FASTA (`fai`/`fai+gzi`) index

Index reader supports fasta in raw text and bgzipped formats.

```python
seqs = SeqReaderIndexed(
"path/tp/seq.fa",
"path/to/seq.fa.fai"
)
seq: Sequence = seqs["Record_1"]
print(seq.id)
print(seq.seq)
print(seq.desc)

"Record_2" in seqs # returns a boolean
```

For bgzipped fasta files, a gzi file is required.

```python
seqs = SeqReaderIndexed(
"path/tp/seq.fa.gz",
"path/to/seq.fa.gz.fai",
"path/to/seq.fa.gz.gzi",
)
seq: Sequence = seqs["Record_1"]
print(seq.id)
print(seq.seq)
print(seq.desc)

"Record_2" in seqs # returns a boolean
```

Using an invalid key will result in `KeyError`.

### Writing

```python
from rsbio_seq import SeqWriter, Sequence, phred_to_ascii

# writing fasta
seq = Sequence("id", "desc", "ACGT") # id, description, sequence
writer = SeqWriter("out.fasta")
writer.write(seq)
writer.close()

# writing fastq
seq = Sequence("id", "desc", "ACGT", "IIII") # id, description, sequence, quality
writer = SeqWriter("out.fastq")
writer.write(seq)
writer.close()

# writing gzipped
seq = Sequence("id", "desc", "ACGT", "IIII") # id, description, sequence, quality
writer = SeqWriter("out.fq.gz")
writer.write(seq)
writer.close()

# writing gzipped with phred score translation
qual = phred_to_ascii([40, 40, 40, 40])
seq = Sequence("id", "desc", "ACGT", qual) # id, description, sequence, quality
writer = SeqWriter("out.fq.gz")
writer.write(seq)
writer.close()
```

Note: `close()` is only required if you want to read the file again in the same function/code scope. Closing opened files is a good practice either way.

We provide two utility functions for your convenience.

* `phred_to_ascii` - convert phred scores list of numbers to a string
* `ascii_to_phred` - convert the quality string to a list of numbers

RSBio-Seq reads and write quality string in ascii format only. Please use these helper functions to translate if you intend to read them.

## Writing to FASTA with an Index (`fai`/`fai+gzi`)

Writing FASTA with an index can be performed in plain text and compressed forms. In compressed form the compression used is `bgzip`. In addition to `fai` there will also be a `gzi` file in compressed form. You can specify to compress using `gz` suffix at the end. Index paths are automatically inferred.

```python
# Plain text
seq = Sequence("id", "desc", "ACGT") # id, description, sequence
writer = SeqWriter("out.fasta", True) # set index true
writer.write(seq)
writer.close()

# Compressed
seq = Sequence("id", "desc", "ACGT") # id, description, sequence
writer = SeqWriter("out.fa.gz", True) # set index true
writer.write(seq)
writer.close()
```

## Planned soon for the major release v1.0.0

* Support for `fastq` Indexes

## Authors

- Anuradha Wickramarachchi [https://anuradhawick.com](https://anuradhawick.com)
- Vijini Mallawaarachchi [https://vijinimallawaarachchi.com](https://vijinimallawaarachchi.com)

## Support and contributions

Please get in touch via author websites or GitHub issues. Thanks!