https://github.com/shenwei356/seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://github.com/shenwei356/seqkit

bioinformatics cross-platform fasta fastq golang manipulation sequence tool toolkit

Last synced: 8 months ago
JSON representation

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Host: GitHub
URL: https://github.com/shenwei356/seqkit
Owner: shenwei356
License: mit
Created: 2016-02-28T10:04:40.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2024-10-31T08:06:14.000Z (about 1 year ago)
Last Synced: 2024-10-31T09:17:47.993Z (about 1 year ago)
Topics: bioinformatics, cross-platform, fasta, fastq, golang, manipulation, sequence, tool, toolkit
Language: Go
Homepage: https://bioinf.shenwei.me/seqkit
Size: 62.3 MB
Stars: 1,312
Watchers: 28
Forks: 159
Open Issues: 17
Metadata Files:
- Readme: README-v0.3.1.1.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

awesome-bio-go - SeqKit - platform and ultrafast toolkit for FASTA/Q file manipulation. (Sequence Analysis and Manipulation)
Awesome-Bioinformatics - SeqKit - A cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang. [ [paper-2016](https://pubmed.ncbi.nlm.nih.gov/27706213) | [web](https://bioinf.shenwei.me/seqkit) ] (Next Generation Sequencing / Sequence Processing)
top-life-sciences - **shenwei356/seqkit** - platform and ultrafast toolkit for FASTA/Q file manipulation<br>`bioinformatics`, `cross-platform`, `fasta`, `fastq`, `golang`, `manipulation`, `sequence`, `tool`, `toolkit`<br><img src='https://github.com/HubTou/topgh/blob/main/icons/gstars.png'> 1226 <img src='https://github.com/HubTou/topgh/blob/main/icons/forks.png'> 157 <img src='https://github.com/HubTou/topgh/blob/main/icons/watchers.png'> 26 <img src='https://github.com/HubTou/topgh/blob/main/icons/code.png'> Go <img src='https://github.com/HubTou/topgh/blob/main/icons/license.png'> MIT license <img src='https://github.com/HubTou/topgh/blob/main/icons/last.png'> 2024-05-17 15:59:35 | (Ranked by starred repositories)

README

          ## Introduction

FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and

protein sequences. Common manipulations of FASTA/Q file include converting,

searching, filtering, deduplication, splitting, shuffling, and sampling.

Existing tools only implement some of these manipulations,

and not particularly efficiently, and some are only available for certain

operating systems. Furthermore, the complicated installation process of

required packages and running environments can render these programs less

user friendly.

This project describes a cross-platform ultrafast comprehensive

toolkit for FASTA/Q processing. SeqKit provides executable binary files for

all major operating systems, including Windows, Linux, and macOS, and can

be directly used without any dependencies or pre-configurations.

SeqKit demonstrates competitive performance in execution time and memory

usage compared to similar tools. The efficiency and usability of SeqKit

enable researchers to rapidly accomplish common FASTA/Q file manipulations.

### Features comparison

|Categories          |Features 
|:-------------------|:------------- 
|**Formats support** |Multi-line FASTA 
|                    |FASTQ 
|                    |Multi-line  FASTQ 
|                    |Validating sequences 
|                    |Supporting RNA 
|**Functions**       |Searching by motifs 
|                    |Sampling 
|                    |Extracting 
|                    |Removing duplicates 
|                    |Splitting 
|                    |Splitting by seq 
|                    |Shuffling 
|                    |Sorting 
|                    |Locating motifs 
|                    |Common sequences 
|                    |Cleaning bases 
|                    |Transcription 
|                    |Translation 
|                    |Filtering by size 
|                    |Renaming header 
|**Other features**  |Cross-platform 
|                    |Reading STDIN 
|                    |Reading gzipped file 
|                    |Writing gzip file

|seqkit  |fasta_utilities|fastx_toolkit|pyfaidx|seqmagick|seqtk ---------|:------:|:-------------:|:-----------:|:-----:|:-------:|:---: |Yes     |Yes            |--           |Yes    |Yes      |Yes |Yes     |Yes            |Yes          |--     |Yes      |Yes |Yes     |Yes            |--           |--     |Yes      |Yes |Yes     |--             |Yes          |Yes    |--       |-- |Yes     |Yes            |--           |--     |Yes      |Yes |Yes     |Yes            |--           |--     |Yes      |-- |Yes     |--             |--           |--     |Yes      |Yes sub-sequence|Yes     |Yes            |--           |Yes    |Yes      |Yes |Yes     |--             |--           |--     |Partly   |-- |Yes     |Yes            |--           |Partly |--       |-- |Yes     |--             |Yes          |Yes    |--       |-- |Yes     |--             |--           |--     |--       |-- |Yes     |Yes            |--           |--     |Yes      |-- |Yes     |--             |--           |--     |--       |-- |Yes     |--             |--           |--     |--       |-- |Yes     |Yes            |Yes          |Yes    |--       |-- |Yes     |Yes            |Yes          |Yes    |Yes      |Yes |Yes     |Yes            |Yes          |Yes    |Yes      |-- |Yes     |Yes            |--           |Yes    |Yes      |-- |Yes     |Yes            |--           |--     |Yes      |Yes |Yes     |Partly         |Partly       |Yes    |Yes      |Yes |Yes     |Yes            |Yes          |--     |Yes      |Yes |Yes     |Yes            |--           |--     |Yes      |Yes |Yes     |--             |--           |--     |Yes      |--

**Note 1**: See [version information](http://bioinf.shenwei.me/seqkit/benchmark/#softwares) of the softwares.

**Note 2**: See [usage](http://bioinf.shenwei.me/seqkit/usage/) for detailed options of seqkit.

 

## Benchmark

More details: [http://bioinf.shenwei.me/seqkit/benchmark/](http://bioinf.shenwei.me/seqkit/benchmark/)

Datasets:

    $ seqkit stat *.fa

    file          format  type   num_seqs        sum_len  min_len       avg_len      max_len

    dataset_A.fa  FASTA   DNA      67,748  2,807,643,808       56      41,442.5    5,976,145

    dataset_B.fa  FASTA   DNA         194  3,099,750,718      970  15,978,096.5  248,956,422

    dataset_C.fq  FASTQ   DNA   9,186,045    918,604,500      100           100          100

SeqKit version: v0.3.1.1

FASTA:

![benchmark-5tests.tsv.png](benchmark/benchmark.5tests.tsv.png)

FASTQ:

![benchmark-5tests.tsv.png](benchmark/benchmark.5tests.tsv.C.png)

## Acknowledgements

We thank [Lei Zhang](https://github.com/jameslz) for testing SeqKit,

and also thank [Jim Hester](https://github.com/jimhester/),

author of [fasta_utilities](https://github.com/jimhester/fasta_utilities),

for advice on early performance improvements of for FASTA parsing

and [Brian Bushnell](https://twitter.com/BBToolsBio),

author of [BBMaps](https://sourceforge.net/projects/bbmap/),

for advice on naming SeqKit and adding accuracy evaluation in benchmarks.

We also thank Nicholas C. Wu from the Scripps Research Institute,

USA for commenting on the manuscript

and [Guangchuang Yu](http://guangchuangyu.github.io/)

from State Key Laboratory of Emerging Infectious Diseases,

The University of Hong Kong, HK for advice on the manuscript.

We thank [Li Peng](https://github.com/penglbio) for reporting many bugs.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shenwei356/seqkit

Awesome Lists containing this project

README