{"id":13639111,"url":"https://github.com/shenwei356/seqkit","last_synced_at":"2025-12-29T22:05:48.127Z","repository":{"id":37493025,"uuid":"52715040","full_name":"shenwei356/seqkit","owner":"shenwei356","description":"A cross-platform and ultrafast toolkit for FASTA/Q file manipulation","archived":false,"fork":false,"pushed_at":"2024-10-31T08:06:14.000Z","size":65276,"stargazers_count":1312,"open_issues_count":17,"forks_count":159,"subscribers_count":28,"default_branch":"master","last_synced_at":"2024-10-31T09:17:47.993Z","etag":null,"topics":["bioinformatics","cross-platform","fasta","fastq","golang","manipulation","sequence","tool","toolkit"],"latest_commit_sha":null,"homepage":"https://bioinf.shenwei.me/seqkit","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shenwei356.png","metadata":{"files":{"readme":"README-v0.3.1.1.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-02-28T10:04:40.000Z","updated_at":"2024-10-31T08:06:17.000Z","dependencies_parsed_at":"2024-03-19T20:48:03.592Z","dependency_job_id":"98fe7be1-07e5-4efa-b896-cc15edfa24fd","html_url":"https://github.com/shenwei356/seqkit","commit_stats":{"total_commits":757,"total_committers":18,"mean_commits":42.05555555555556,"dds":"0.20211360634081899","last_synced_commit":"1aef84fbc2381db2e73639632bc44a5f7a0a8bcc"},"previous_names":[],"tags_count":108,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shenwei356%2Fseqkit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shenwei356%2Fseqkit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shenwei356%2Fseqkit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shenwei356%2Fseqkit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shenwei356","download_url":"https://codeload.github.com/shenwei356/seqkit/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223810278,"owners_count":17206728,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","cross-platform","fasta","fastq","golang","manipulation","sequence","tool","toolkit"],"created_at":"2024-08-02T01:00:57.796Z","updated_at":"2025-12-29T22:05:48.099Z","avatar_url":"https://github.com/shenwei356.png","language":"Go","funding_links":[],"categories":["Next Generation Sequencing","Field-specific projects","Sequence Analysis and Manipulation","Ranked by starred repositories"],"sub_categories":["Sequence Processing","Biology"],"readme":"## Introduction\n\nFASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and\nprotein sequences. Common manipulations of FASTA/Q file include converting,\nsearching, filtering, deduplication, splitting, shuffling, and sampling.\nExisting tools only implement some of these manipulations,\nand not particularly efficiently, and some are only available for certain\noperating systems. Furthermore, the complicated installation process of\nrequired packages and running environments can render these programs less\nuser friendly.\n\nThis project describes a cross-platform ultrafast comprehensive\ntoolkit for FASTA/Q processing. SeqKit provides executable binary files for\nall major operating systems, including Windows, Linux, and macOS, and can\nbe directly used without any dependencies or pre-configurations.\nSeqKit demonstrates competitive performance in execution time and memory\nusage compared to similar tools. The efficiency and usability of SeqKit\nenable researchers to rapidly accomplish common FASTA/Q file manipulations.\n\n### Features comparison\n\n|Categories          |Features               |seqkit  |fasta_utilities|fastx_toolkit|pyfaidx|seqmagick|seqtk\n|:-------------------|:----------------------|:------:|:-------------:|:-----------:|:-----:|:-------:|:---:\n|**Formats support** |Multi-line FASTA       |Yes     |Yes            |--           |Yes    |Yes      |Yes\n|                    |FASTQ                  |Yes     |Yes            |Yes          |--     |Yes      |Yes\n|                    |Multi-line  FASTQ      |Yes     |Yes            |--           |--     |Yes      |Yes\n|                    |Validating sequences   |Yes     |--             |Yes          |Yes    |--       |--\n|                    |Supporting RNA         |Yes     |Yes            |--           |--     |Yes      |Yes\n|**Functions**       |Searching by motifs    |Yes     |Yes            |--           |--     |Yes      |--\n|                    |Sampling               |Yes     |--             |--           |--     |Yes      |Yes\n|                    |Extracting sub-sequence|Yes     |Yes            |--           |Yes    |Yes      |Yes\n|                    |Removing duplicates    |Yes     |--             |--           |--     |Partly   |--\n|                    |Splitting              |Yes     |Yes            |--           |Partly |--       |--\n|                    |Splitting by seq       |Yes     |--             |Yes          |Yes    |--       |--\n|                    |Shuffling              |Yes     |--             |--           |--     |--       |--\n|                    |Sorting                |Yes     |Yes            |--           |--     |Yes      |--\n|                    |Locating motifs        |Yes     |--             |--           |--     |--       |--\n|                    |Common sequences       |Yes     |--             |--           |--     |--       |--\n|                    |Cleaning bases         |Yes     |Yes            |Yes          |Yes    |--       |--\n|                    |Transcription          |Yes     |Yes            |Yes          |Yes    |Yes      |Yes\n|                    |Translation            |Yes     |Yes            |Yes          |Yes    |Yes      |--\n|                    |Filtering by size      |Yes     |Yes            |--           |Yes    |Yes      |--\n|                    |Renaming header        |Yes     |Yes            |--           |--     |Yes      |Yes\n|**Other features**  |Cross-platform         |Yes     |Partly         |Partly       |Yes    |Yes      |Yes\n|                    |Reading STDIN          |Yes     |Yes            |Yes          |--     |Yes      |Yes\n|                    |Reading gzipped file   |Yes     |Yes            |--           |--     |Yes      |Yes\n|                    |Writing gzip file      |Yes     |--             |--           |--     |Yes      |--\n\n**Note 1**: See [version information](http://bioinf.shenwei.me/seqkit/benchmark/#softwares) of the softwares.\n\n**Note 2**: See [usage](http://bioinf.shenwei.me/seqkit/usage/) for detailed options of seqkit.\n \n## Benchmark\n\nMore details: [http://bioinf.shenwei.me/seqkit/benchmark/](http://bioinf.shenwei.me/seqkit/benchmark/)\n\nDatasets:\n\n    $ seqkit stat *.fa\n    file          format  type   num_seqs        sum_len  min_len       avg_len      max_len\n    dataset_A.fa  FASTA   DNA      67,748  2,807,643,808       56      41,442.5    5,976,145\n    dataset_B.fa  FASTA   DNA         194  3,099,750,718      970  15,978,096.5  248,956,422\n    dataset_C.fq  FASTQ   DNA   9,186,045    918,604,500      100           100          100\n\nSeqKit version: v0.3.1.1\n\nFASTA:\n\n![benchmark-5tests.tsv.png](benchmark/benchmark.5tests.tsv.png)\n\nFASTQ:\n\n![benchmark-5tests.tsv.png](benchmark/benchmark.5tests.tsv.C.png)\n\n## Acknowledgements\n\nWe thank [Lei Zhang](https://github.com/jameslz) for testing SeqKit,\nand also thank [Jim Hester](https://github.com/jimhester/),\nauthor of [fasta_utilities](https://github.com/jimhester/fasta_utilities),\nfor advice on early performance improvements of for FASTA parsing\nand [Brian Bushnell](https://twitter.com/BBToolsBio),\nauthor of [BBMaps](https://sourceforge.net/projects/bbmap/),\nfor advice on naming SeqKit and adding accuracy evaluation in benchmarks.\nWe also thank Nicholas C. Wu from the Scripps Research Institute,\nUSA for commenting on the manuscript\nand [Guangchuang Yu](http://guangchuangyu.github.io/)\nfrom State Key Laboratory of Emerging Infectious Diseases,\nThe University of Hong Kong, HK for advice on the manuscript.\n\nWe thank [Li Peng](https://github.com/penglbio) for reporting many bugs.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshenwei356%2Fseqkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshenwei356%2Fseqkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshenwei356%2Fseqkit/lists"}