{"id":25501953,"url":"https://github.com/opengene/uniquekmer","last_synced_at":"2025-04-10T09:46:42.848Z","repository":{"id":90140217,"uuid":"258440853","full_name":"OpenGene/UniqueKMER","owner":"OpenGene","description":"Generate unique KMERs for every contig in a FASTA file","archived":false,"fork":false,"pushed_at":"2022-08-17T10:38:24.000Z","size":183,"stargazers_count":47,"open_issues_count":5,"forks_count":8,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-03-24T08:42:21.708Z","etag":null,"topics":["bioinformatics","fasta","kmer","ngs","sequencing","unique","virus"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGene.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-04-24T07:36:35.000Z","updated_at":"2025-02-10T03:14:31.000Z","dependencies_parsed_at":null,"dependency_job_id":"80fa7a9e-1818-4a11-aadf-94498247b5e2","html_url":"https://github.com/OpenGene/UniqueKMER","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2FUniqueKMER","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2FUniqueKMER/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2FUniqueKMER/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGene%2FUniqueKMER/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGene","download_url":"https://codeload.github.com/OpenGene/UniqueKMER/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248196606,"owners_count":21063467,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","fasta","kmer","ngs","sequencing","unique","virus"],"created_at":"2025-02-19T04:59:46.603Z","updated_at":"2025-04-10T09:46:42.824Z","avatar_url":"https://github.com/OpenGene.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# UniqueKMER\nGenerate unique k-mers for every contig in a FASTA file.  \n\nUnique k-mer is consisted of k-mer keys (i.e. ATCGATCCTTAAGG) that are only presented in one contig, but not presented in any other contigs (for both forward and reverse strands).  \n\nThis tool accepts the input of a FASTA file consisting of many contigs, and extract unique k-mers for each contig.\n\nThe output unique k-mer file and Genome file can be used for fastv: https://github.com/OpenGene/fastv, which is an ultra-fast tool to identify and visualize microbial sequences from sequencing data.\n\n# what does UniqueKMER output?\nThis tool outputs a folder (folder name can be specified by `-o/--outdir`), which contains:\n* a `index.html` file.\n* a `kmercollection.fasta` file, which is a single file lists all the genome names along with their unique k-mer. Each k-mer key is represented by an individual line.\n* a subfolder `genomes_kmers`, which contains a k-mer file and a Genome file for each contig, both in FASTA format.\n\nYou can open the `index.html` with any browser, then click on the contig names to find its k-mer file and Genome file.\n* a small example: http://opengene.org/uniquekmer/test/index.html. This is generated by a small FASTA (http://opengene.org/test.fasta)\n* a big example: http://opengene.org/uniquekmer/viral/index.html. This is generated by a big FASTA (http://opengene.org/viral.genomic.fasta) containing all NCBI complete RefSeq release of viral sequences, which can be found from https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/\n\n# get this tool\n## download binary \nThis binary is only for Linux systems: http://opengene.org/uniquekmer/uniquekmer\n```shell\n# this binary was compiled on CentOS, and tested on CentOS/Ubuntu\nwget http://opengene.org/uniquekmer/uniquekmer\nchmod a+x ./uniquekmer\n```\n## or compile from source\n```shell\n# step 1: get the code\ngit clone https://github.com/OpenGene/UniqueKMER.git\n\n# step 2: build\ncd UniqueKMER\nmake\n\n# step 3: install it to system if you have a sudo permission\nmake install\n```\n\n# simple example:\n```shell\nuniquekmer -f test.fasta\n```\nYou can get the test.fasta from: http://opengene.org/test.fasta\n\n# more examples\n### set the k-mer key length\n```shell\n# 16-mer (i.e. ATCGATCGATCGATCG...)\nuniquekmer -f test.fasta -k 16\n```\n### filter the k-mer keys that can be mapped to a reference genome (i.e. human genome)\n```shell\n# k-mer sequences that can be mapped to hg38 with `edit distance \u003c=2`  will be removed\nuniquekmer -f test.fasta -r hg38.fasta -e 2\n```\n### set the spacing to avoid many continuous k-mer keys\n```shell\n# the spacing will be 2, which means if `key(pos)` is stored, then `key(pos+1)`  and `key(pos+2)` will be skipped\nuniquekmer -f test.fasta -s 2\n```\n\noptions:\n```shel\n  -f, --fasta            FASTA input file name (string)\n  -o, --outdir           Directory for output. Default is unique_kmers in the current directory. (string [=unique_kmers])\n  -k, --kmer             The length k of k-mer (3~32), default 25 (int [=25])\n  -s, --spacing          If a key with POS is recorded, then skip [POS+1...POS+spacing] to avoid too compact result (0~100). default 0 means no skipping. (int [=0])\n  -g, --genome_limit     Process up to genome_limit genomes in the FASTA input file. Default 0 means no limit. This option is for DEBUG. (int [=0])\n  -r, --ref              Reference genome FASTA file name. Specify this only when you want to filter out the unique k-mer that can be mapped to reference genome. (string [=])\n  -e, --edit_distance    k-mer mapped to reference genome with edit distance \u003c= edit_distance will be removed (0~16). 3 for default. (int [=3])\n  -?, --help             print this message\n```\n\n## get the pre-built k-mer file, genomes file or k-mer collection file for viruses\n* You can download `k-mer` files and `genomes` files of viruses from http://opengene.org/uniquekmer/viral/index.html. This is generated by extracting unique k-mers for all genomes in a big FASTA (http://opengene.org/viral.genomic.fasta), which contains all NCBI complete RefSeq release of viral sequences that can be found from https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/. The k-mers that can be mapped to human reference genome (GRCh38) with `edit_distance \u003c= 3` have already been filtered out.\n* You can download the `k-mer collection` file for viral genomes from: http://opengene.org/viral.kc.fasta.gz\n\n## get the pre-built k-mer file, genomes file or k-mer collection file for viruses and human microorganisms\n* You can download `k-mer` files and `genomes` files of viruses from http://opengene.org/uniquekmer/microbial/index.html. This is generated by extracting unique k-mers for all genomes in a big FASTA (http://opengene.org/microbial.genomic.fasta), which contains genomes for the viruses above and common human microorganisms. The k-mers that can be mapped to human reference genome (GRCh38) with `edit_distance \u003c= 3` have already been filtered out.\n* You can download the `k-mer collection` file for viral genomes from: http://opengene.org/microbial.kc.fasta.gz\n\n# Citation\nIf you use `fastv`, `UniqueKMER` or the pre-generated resources provided by this repository, please cite our work as:\n\nShifu Chen, Changshou He, Yingqiang Li, Zhicheng Li, Charles E Melancon III. A Computational Toolset for Rapid Identification of SARS-CoV-2, other Viruses, and Microorganisms from Sequencing Data. bioRxiv 2020.05.12.092163; doi: https://doi.org/10.1101/2020.05.12.092163\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopengene%2Funiquekmer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopengene%2Funiquekmer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopengene%2Funiquekmer/lists"}