Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shenwei356/seqkit
A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://github.com/shenwei356/seqkit
bioinformatics cross-platform fasta fastq golang manipulation sequence tool toolkit
Last synced: 3 months ago
JSON representation
A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
- Host: GitHub
- URL: https://github.com/shenwei356/seqkit
- Owner: shenwei356
- License: mit
- Created: 2016-02-28T10:04:40.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2024-10-31T08:06:14.000Z (3 months ago)
- Last Synced: 2024-10-31T09:17:47.993Z (3 months ago)
- Topics: bioinformatics, cross-platform, fasta, fastq, golang, manipulation, sequence, tool, toolkit
- Language: Go
- Homepage: https://bioinf.shenwei.me/seqkit
- Size: 62.3 MB
- Stars: 1,312
- Watchers: 28
- Forks: 159
- Open Issues: 17
-
Metadata Files:
- Readme: README-v0.3.1.1.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
- awesome-bio-go - SeqKit - platform and ultrafast toolkit for FASTA/Q file manipulation. (Sequence Analysis and Manipulation)
- Awesome-Bioinformatics - SeqKit - A cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang. [ [paper-2016](https://pubmed.ncbi.nlm.nih.gov/27706213) | [web](https://bioinf.shenwei.me/seqkit) ] (Next Generation Sequencing / Sequence Processing)
- top-life-sciences - **shenwei356/seqkit** - platform and ultrafast toolkit for FASTA/Q file manipulation<br>`bioinformatics`, `cross-platform`, `fasta`, `fastq`, `golang`, `manipulation`, `sequence`, `tool`, `toolkit`<br><img src='https://github.com/HubTou/topgh/blob/main/icons/gstars.png'> 1226 <img src='https://github.com/HubTou/topgh/blob/main/icons/forks.png'> 157 <img src='https://github.com/HubTou/topgh/blob/main/icons/watchers.png'> 26 <img src='https://github.com/HubTou/topgh/blob/main/icons/code.png'> Go <img src='https://github.com/HubTou/topgh/blob/main/icons/license.png'> MIT license <img src='https://github.com/HubTou/topgh/blob/main/icons/last.png'> 2024-05-17 15:59:35 | (Ranked by starred repositories)
README
## Introduction
FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and
protein sequences. Common manipulations of FASTA/Q file include converting,
searching, filtering, deduplication, splitting, shuffling, and sampling.
Existing tools only implement some of these manipulations,
and not particularly efficiently, and some are only available for certain
operating systems. Furthermore, the complicated installation process of
required packages and running environments can render these programs less
user friendly.This project describes a cross-platform ultrafast comprehensive
toolkit for FASTA/Q processing. SeqKit provides executable binary files for
all major operating systems, including Windows, Linux, and macOS, and can
be directly used without any dependencies or pre-configurations.
SeqKit demonstrates competitive performance in execution time and memory
usage compared to similar tools. The efficiency and usability of SeqKit
enable researchers to rapidly accomplish common FASTA/Q file manipulations.### Features comparison
|Categories |Features |seqkit |fasta_utilities|fastx_toolkit|pyfaidx|seqmagick|seqtk
|:-------------------|:----------------------|:------:|:-------------:|:-----------:|:-----:|:-------:|:---:
|**Formats support** |Multi-line FASTA |Yes |Yes |-- |Yes |Yes |Yes
| |FASTQ |Yes |Yes |Yes |-- |Yes |Yes
| |Multi-line FASTQ |Yes |Yes |-- |-- |Yes |Yes
| |Validating sequences |Yes |-- |Yes |Yes |-- |--
| |Supporting RNA |Yes |Yes |-- |-- |Yes |Yes
|**Functions** |Searching by motifs |Yes |Yes |-- |-- |Yes |--
| |Sampling |Yes |-- |-- |-- |Yes |Yes
| |Extracting sub-sequence|Yes |Yes |-- |Yes |Yes |Yes
| |Removing duplicates |Yes |-- |-- |-- |Partly |--
| |Splitting |Yes |Yes |-- |Partly |-- |--
| |Splitting by seq |Yes |-- |Yes |Yes |-- |--
| |Shuffling |Yes |-- |-- |-- |-- |--
| |Sorting |Yes |Yes |-- |-- |Yes |--
| |Locating motifs |Yes |-- |-- |-- |-- |--
| |Common sequences |Yes |-- |-- |-- |-- |--
| |Cleaning bases |Yes |Yes |Yes |Yes |-- |--
| |Transcription |Yes |Yes |Yes |Yes |Yes |Yes
| |Translation |Yes |Yes |Yes |Yes |Yes |--
| |Filtering by size |Yes |Yes |-- |Yes |Yes |--
| |Renaming header |Yes |Yes |-- |-- |Yes |Yes
|**Other features** |Cross-platform |Yes |Partly |Partly |Yes |Yes |Yes
| |Reading STDIN |Yes |Yes |Yes |-- |Yes |Yes
| |Reading gzipped file |Yes |Yes |-- |-- |Yes |Yes
| |Writing gzip file |Yes |-- |-- |-- |Yes |--**Note 1**: See [version information](http://bioinf.shenwei.me/seqkit/benchmark/#softwares) of the softwares.
**Note 2**: See [usage](http://bioinf.shenwei.me/seqkit/usage/) for detailed options of seqkit.
## BenchmarkMore details: [http://bioinf.shenwei.me/seqkit/benchmark/](http://bioinf.shenwei.me/seqkit/benchmark/)
Datasets:
$ seqkit stat *.fa
file format type num_seqs sum_len min_len avg_len max_len
dataset_A.fa FASTA DNA 67,748 2,807,643,808 56 41,442.5 5,976,145
dataset_B.fa FASTA DNA 194 3,099,750,718 970 15,978,096.5 248,956,422
dataset_C.fq FASTQ DNA 9,186,045 918,604,500 100 100 100SeqKit version: v0.3.1.1
FASTA:
![benchmark-5tests.tsv.png](benchmark/benchmark.5tests.tsv.png)
FASTQ:
![benchmark-5tests.tsv.png](benchmark/benchmark.5tests.tsv.C.png)
## Acknowledgements
We thank [Lei Zhang](https://github.com/jameslz) for testing SeqKit,
and also thank [Jim Hester](https://github.com/jimhester/),
author of [fasta_utilities](https://github.com/jimhester/fasta_utilities),
for advice on early performance improvements of for FASTA parsing
and [Brian Bushnell](https://twitter.com/BBToolsBio),
author of [BBMaps](https://sourceforge.net/projects/bbmap/),
for advice on naming SeqKit and adding accuracy evaluation in benchmarks.
We also thank Nicholas C. Wu from the Scripps Research Institute,
USA for commenting on the manuscript
and [Guangchuang Yu](http://guangchuangyu.github.io/)
from State Key Laboratory of Emerging Infectious Diseases,
The University of Hong Kong, HK for advice on the manuscript.We thank [Li Peng](https://github.com/penglbio) for reporting many bugs.