https://github.com/corneliusroemer/fasta_zstd_sqlite

Efficiently store FASTA sequences in sqlite compressed with sidecar zstd dictionary
https://github.com/corneliusroemer/fasta_zstd_sqlite

bioinformatics fasta genomic-epidemiology sqlite virus-bioinformatics zstd zstd-dictionary

Last synced: 3 months ago
JSON representation

Efficiently store FASTA sequences in sqlite compressed with sidecar zstd dictionary

Host: GitHub
URL: https://github.com/corneliusroemer/fasta_zstd_sqlite
Owner: corneliusroemer
License: mit
Created: 2021-09-25T20:37:23.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2021-10-04T14:32:19.000Z (about 4 years ago)
Last Synced: 2025-07-04T11:05:37.427Z (3 months ago)
Topics: bioinformatics, fasta, genomic-epidemiology, sqlite, virus-bioinformatics, zstd, zstd-dictionary
Language: Python
Homepage:
Size: 530 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

CLI tool to store and retrieve individual dictionary zstd compressed FASTA records in an sqlite database

## Features

- Fast retrieval of FASTA records from database (~GB/s)
- High compression ratio (100 GB -> 1 GB) for similar sequences (like Sars-CoV-2 genomes)
- Fast compression (~GB/s)
- Convenient CLI user interface (no need to use Python)
- User can choose between explicit file names or stdin/stdout
- Batteries included:
- zstd dictionary is auto-generated from first fasta record
- zstd dictionary is auto-included in sqlite database, read automagically
- sqlite database can be read by external tools
- Records can be queried passing strain names via stdin or strains.txt file

## Roadmap

- Add support to store non-Fasta lines, e.g. metadata rows
- Extend querying capabilities beyond strain names:
- by metadata
- by mutations
- Support random sampling
- Store hashes of uncompressed records to allow fast checking which records have changed (save time for downstreaming processing if records are unchanged, e.g. in Sars-CoV-2 pipelines
- Make conda installable
- Add tests

## Installation

1. Clone the repository:
```bash
git clone https://github.com/corneliusroemer/fasta_zstd_sqlite.git
cd fasta_zstd_sqlite
```
2. Install using pip
```python
python3 -m venv env
source env/bin/activate
python3 -m pip install --editable .
```

## Usage

Compressing fasta records line by line in sqlite db:
```
xzcat in.fasta.xz | fzs insert --db-path in.fasta.db
fzs insert --fasta-path in.fasta --db-path in.fasta.db
head in.fasta | fzs insert --db-path in.fasta.db
```

Querying records from sqlite db:
```
fzs query --db-path in.fasta.db --strains-path strains.txt --fasta-path out.fasta
echo Wuhan-Hu-1 | fzs query --strains-path - --db-path in.fasta.db | less
```

Uncompressing all records back to fasta:
```
fzs query --db-path in.fasta.db --fasta-path out.fasta
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/corneliusroemer/fasta_zstd_sqlite

Awesome Lists containing this project

README