Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/shenwei356/seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://github.com/shenwei356/seqkit

bioinformatics cross-platform fasta fastq golang manipulation sequence tool toolkit

Last synced: about 1 month ago
JSON representation

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Host: GitHub
URL: https://github.com/shenwei356/seqkit
Owner: shenwei356
License: mit
Created: 2016-02-28T10:04:40.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2024-04-29T09:14:36.000Z (about 2 months ago)
Last Synced: 2024-04-29T10:33:16.625Z (about 2 months ago)
Topics: bioinformatics, cross-platform, fasta, fastq, golang, manipulation, sequence, tool, toolkit
Language: Go
Homepage: https://bioinf.shenwei.me/seqkit
Size: 62.2 MB
Stars: 1,204
Watchers: 27
Forks: 156
Open Issues: 18
Metadata Files:
- Readme: README-v0.3.1.1.md
- Changelog: CHANGELOG.md
- License: LICENSE

Lists

Awesome-Bioinformatics - SeqKit - A cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang. [ [paper-2016](https://pubmed.ncbi.nlm.nih.gov/27706213) | [web](https://bioinf.shenwei.me/seqkit) ] (Next Generation Sequencing / Sequence Processing)
awesome-scientific-go - shenwei356/seqkit - a cross-platform toolkit for FASTA/Q file manipulation (Field-specific projects / Biology)
awesome-bio-go - SeqKit - platform and ultrafast toolkit for FASTA/Q file manipulation. (Sequence Analysis and Manipulation)
Awesome-Bioinformatics-CN - SeqKit - 基于`Go`的跨平台，超快处理FASTQ/FASTQ文件的工具包[ [paper-2016](https://pubmed.ncbi.nlm.nih.gov/27706213) | [web](https://bioinf.shenwei.me/seqkit) ] (二代测序 / 序列处理)
top-life-sciences - **shenwei356/seqkit** - platform and ultrafast toolkit for FASTA/Q file manipulation<br>`bioinformatics`, `cross-platform`, `fasta`, `fastq`, `golang`, `manipulation`, `sequence`, `tool`, `toolkit`<br><img src='https://github.com/HubTou/topgh/blob/main/icons/gstars.png'> 1226 <img src='https://github.com/HubTou/topgh/blob/main/icons/forks.png'> 157 <img src='https://github.com/HubTou/topgh/blob/main/icons/watchers.png'> 26 <img src='https://github.com/HubTou/topgh/blob/main/icons/code.png'> Go <img src='https://github.com/HubTou/topgh/blob/main/icons/license.png'> MIT license <img src='https://github.com/HubTou/topgh/blob/main/icons/last.png'> 2024-05-17 15:59:35 | (Ranked by starred repositories)

README

        ## Introduction

FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and

protein sequences. Common manipulations of FASTA/Q file include converting,

searching, filtering, deduplication, splitting, shuffling, and sampling.

Existing tools only implement some of these manipulations,

and not particularly efficiently, and some are only available for certain

operating systems. Furthermore, the complicated installation process of

required packages and running environments can render these programs less

user friendly.

This project describes a cross-platform ultrafast comprehensive

toolkit for FASTA/Q processing. SeqKit provides executable binary files for

all major operating systems, including Windows, Linux, and macOS, and can

be directly used without any dependencies or pre-configurations.

SeqKit demonstrates competitive performance in execution time and memory

usage compared to similar tools. The efficiency and usability of SeqKit

enable researchers to rapidly accomplish common FASTA/Q file manipulations.

### Features comparison

|Categories          |Features 
|:-------------------|:------------- 
|**Formats support** |Multi-line FASTA 
|                    |FASTQ 
|                    |Multi-line  FASTQ 
|                    |Validating sequences 
|                    |Supporting RNA 
|**Functions**       |Searching by motifs 
|                    |Sampling 
|                    |Extracting 
|                    |Removing duplicates 
|                    |Splitting 
|                    |Splitting by seq 
|                    |Shuffling 
|                    |Sorting 
|                    |Locating motifs 
|                    |Common sequences 
|                    |Cleaning bases 
|                    |Transcription 
|                    |Translation 
|                    |Filtering by size 
|                    |Renaming header 
|**Other features**  |Cross-platform 
|                    |Reading STDIN 
|                    |Reading gzipped file 
|                    |Writing gzip file

|seqkit  |fasta_utilities|fastx_toolkit|pyfaidx|seqmagick|seqtk ---------|:------:|:-------------:|:-----------:|:-----:|:-------:|:---: |Yes     |Yes            |--           |Yes    |Yes      |Yes |Yes     |Yes            |Yes          |--     |Yes      |Yes |Yes     |Yes            |--           |--     |Yes      |Yes |Yes     |--             |Yes          |Yes    |--       |-- |Yes     |Yes            |--           |--     |Yes      |Yes |Yes     |Yes            |--           |--     |Yes      |-- |Yes     |--             |--           |--     |Yes      |Yes sub-sequence|Yes     |Yes            |--           |Yes    |Yes      |Yes |Yes     |--             |--           |--     |Partly   |-- |Yes     |Yes            |--           |Partly |--       |-- |Yes     |--             |Yes          |Yes    |--       |-- |Yes     |--             |--           |--     |--       |-- |Yes     |Yes            |--           |--     |Yes      |-- |Yes     |--             |--           |--     |--       |-- |Yes     |--             |--           |--     |--       |-- |Yes     |Yes            |Yes          |Yes    |--       |-- |Yes     |Yes            |Yes          |Yes    |Yes      |Yes |Yes     |Yes            |Yes          |Yes    |Yes      |-- |Yes     |Yes            |--           |Yes    |Yes      |-- |Yes     |Yes            |--           |--     |Yes      |Yes |Yes     |Partly         |Partly       |Yes    |Yes      |Yes |Yes     |Yes            |Yes          |--     |Yes      |Yes |Yes     |Yes            |--           |--     |Yes      |Yes |Yes     |--             |--           |--     |Yes      |--

**Note 1**: See [version information](http://bioinf.shenwei.me/seqkit/benchmark/#softwares) of the softwares.

**Note 2**: See [usage](http://bioinf.shenwei.me/seqkit/usage/) for detailed options of seqkit.

 

## Benchmark

More details: [http://bioinf.shenwei.me/seqkit/benchmark/](http://bioinf.shenwei.me/seqkit/benchmark/)

Datasets:

    $ seqkit stat *.fa

    file          format  type   num_seqs        sum_len  min_len       avg_len      max_len

    dataset_A.fa  FASTA   DNA      67,748  2,807,643,808       56      41,442.5    5,976,145

    dataset_B.fa  FASTA   DNA         194  3,099,750,718      970  15,978,096.5  248,956,422

    dataset_C.fq  FASTQ   DNA   9,186,045    918,604,500      100           100          100

SeqKit version: v0.3.1.1

FASTA:

![benchmark-5tests.tsv.png](benchmark/benchmark.5tests.tsv.png)

FASTQ:

![benchmark-5tests.tsv.png](benchmark/benchmark.5tests.tsv.C.png)