{"id":19434629,"url":"https://github.com/edawson/rkmh","last_synced_at":"2025-04-24T20:32:17.155Z","repository":{"id":150736312,"uuid":"62011740","full_name":"edawson/rkmh","owner":"edawson","description":"Classify sequencing reads using MinHash.","archived":false,"fork":false,"pushed_at":"2020-04-06T02:08:10.000Z","size":34870,"stargazers_count":48,"open_issues_count":8,"forks_count":4,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-03T10:38:04.085Z","etag":null,"topics":["bioinformatics","kmer","minhash","mutations","nanopore","openmp"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/edawson.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-06-26T22:49:44.000Z","updated_at":"2024-10-06T07:22:23.000Z","dependencies_parsed_at":"2023-04-12T13:25:42.533Z","dependency_job_id":null,"html_url":"https://github.com/edawson/rkmh","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edawson%2Frkmh","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edawson%2Frkmh/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edawson%2Frkmh/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edawson%2Frkmh/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/edawson","download_url":"https://codeload.github.com/edawson/rkmh/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250704856,"owners_count":21473774,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","kmer","minhash","mutations","nanopore","openmp"],"created_at":"2024-11-10T14:47:03.501Z","updated_at":"2025-04-24T20:32:12.344Z","avatar_url":"https://github.com/edawson.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"rkmh\n--------------------------------------------\nEric T Dawson  \nJune 2016\n\n\n[![](https://images.microbadger.com/badges/image/erictdawson/rkmh.svg)](https://microbadger.com/images/erictdawson/rkmh \"Get your own image badge on microbadger.com\")\n\n![C/C++ CI](https://github.com/edawson/rkmh/workflows/C/C++%20CI/badge.svg)\n\n### What is it\nrkmh performs identification of *individual reads*, identity-based read filtering, and alignment-free variant calling\nusing MinHash (as implemented in [Mash](https://github.com/marbl/Mash)). It is compatible with Mash and sourmash via JSON exchange.\n\n\nWe're using rkmh to identify which strains are present in infections with multiple strains of the same virus.\nrkmh could also be used to remove reads from contaminants or call mutations in novel strains relative to a nearby reference.\nYou could even select out only reads from a pathogen sample contaminated with human DNA.\n\n### License\nMIT, but please cite the repository if you use it.\n\n### Dependencies and build process\nThe only external dependencies should be zlib and a compiler supporting OpenMP. To download and build:  \n\n                    git clone --recursive https://github.com/edawson/rkmh.git  \n                    cd rkmh  \n                    make  \n\nThis should build rkmh and its library dependencies (mkmh and murmur3).\n\n### HPV16 sublineage classification\nrkmh was designed to assess HPV16 lineage and sublineage coinfections. There is a special command specifically for identifying\nlineage / sublineage specific kmers and labeling reads with them. The necessary references are also included with rkmh.\n\nTo classify each read by its lineage and sublineage, run the following command from inside the rkmh directory:  \n\n```\n./rkmh hpv16 -f \u003cfastqToClassify.fq\u003e \u003e out.rk\n```\n\nPrevalence estimates for each lineage/sublineage can then be calculated by running a script that sums\nthe number of reads, corrects for common error modes, and outputs a summary of the infecting (sub)lineages\nand their estimated proportions:  \n\n```\npython scripts/score_real_classification.py \u003c out.rk \u003e out.cls\n```\n\nThe output file `out.cls` contains a single line describing the estimated (sub)lineages and their proportions. We\nassume a sample is coinfected if we see at least two lineages present at \u003e5% prevalence.\n\n### Stream\nrkmh can now stream reads through, using roughly constant memory.\nThis command performs almost identically to `classify` and performs the same read classification task by default:\n\n```rkmh stream -r refs.fa -f reads.fa -k 12 -s 1000```  \n\n\nBut also permits this:  \n\n```cat reads.fq | ./rmkmh stream -i -r refs.fa -k 12 -s 1000```  \n\nwhich will use `64 * (  (number of refs * sketchsize) + sketchsize )` bits of memory after references are hashed. I'm working on\nreducing the amount of memory used during the initial hashing as well, though a human genome is feasible in 32ish gigabytes of ram.\n\nThe `-M` flag for stream uses a modified hash table counter which takes up only ~80MB of memory; however, it is prone to collisions if the\nsketch size and reference genome become very large and the kmer size very small. Its performance on most small genome's is identical to that\nof `classify`, but if you cannot tolerate collisions we suggest you use the classify command.\n\nThe `-I` flag is implemented the same way as the `-M` flag, and again matches the specificity of classify on small genomes while providing\na big boost in performance for less memory.\n\n\n### Filter\nImagine you have a bunch of reads sequenced from a viral infection and you want to select only those that are\nfrom the virus (i.e. remove host reads).\n\nNow you can:\n\n    rkmh filter -f reads.fq -r viral_refs.fa -t 4 -k 20 -s 2000\n\nYou can also pass the `-z` param to stream to accomplish the same thing.\n\n\n### Classify \nrkmh requires a set of query sequences (\"reads\") and a set of references in the FASTA/FASTQ format. Reads may be in either FASTQ or FASTA.\n\n\nTo use MinHash sketch of size 1000, and a kmer size of 10:  \n```./rkmh classify -r references.fa -f reads.fq -k 10 -s 1000```\n\nThere's also now a filter for minimum kmer occurrence in a read set, compatible with the MinHash sketch.\nTo only use kmers that occur more than 10 times in the reads:  \n```./rkmh classify -r references.fa -f reads.fq -k 10 -s 1000 -M 100```\n\nThere is also a filter that will fail reads with fewer than some number of matches to any reference.\nIt's availble via the `-N` flag:  \n```./rkmh -r references.fa -f reads.fq -k 10 -s 1000 -M 100 -N 10```\n\n\n**A note on optimum kmer size**: we've had a lot of success with k \u003c= 15 on data fron ONT's R7 pore. I don't have any R9 flowcells around lab, but \nI expect we'll do a bit better on R9 given what others have been showing off.\n\n### Call\nOnce you've identified which reference a set of reads most closely matches, you may want to figure out the differences between your set of reads\nand your reference. `rkmh call` uses a brute-force approach to produce a list of candidate mutations / sequencing errors present in a readset.\n\n```rkmh call -r ref.fa -f reads.fq -k 12 -t 4```  \n\nWe advise using only one reference during call, as it's relatively slow (~10x longer than classification, 10 seconds for 1100 reads). For example, you might first classify your reads using `classify`, then\nfor the top classification in your set run `rkmh call`.\n\n### Hash\nYou might want to see the hashes generated by rkmh for debugging purposes. To do so, use the `hash` command.\n\n```rkmh hash -r ref.fa -f reads.fq -k 12 -s 1000``` \n\n### Filter\nThe `filter` command will only output reads which match any of the input references sufficiently well. This is very useful if filtering\nout contaminants or selecting reads which map to only a single strain.\n\n\n### Other options\nThese are extra options for the `classify` and `hash` commands. Some of them are also applicable to `call`. For full usage, just\ntype `./rkmh` or `./rkmh \u003ccommand\u003e` at the command line to get the help message.\n\n\n```-t / --threads \u003cINT\u003e               number of OpenMP threads to use (default is 1)```  \n```-M / --min-kmer-occurence \u003cINT\u003e    minimum number of times a kmer must appear in the set of reads to be included in a read's MinHash sketch.```  \n```-N / --min-matches \u003cINT\u003e           minimum number of matches a read must have to any reference to be considered classified.```  \n```-I / --max-samples \u003cINT\u003e           remove kmers that appear in more than \u003cINT\u003e reference genomes.```  \n```-D / --min-difference \u003cINT\u003e        flag reads that have two matches within \u003cINT\u003e hashes of each other as failing.```   \n```-k / --kmer \u003cINT\u003e                  the kmer size to use for hashing. Multiple kmer sizes may be passed, but they must all use the -k \u003cINT\u003e format (i.e. -k 12 -k 14 -k 16...)```   \n```-s / --sketch-size                 the number of hashes to use when comparing reads / references.```    \n```-f / --fasta                       a FASTA/FASTQ file to use as a read set. Can be passed multiple times (i.e. -f first.fa -f second.fa...)``` \n```-r / --reference                   a FASTA/FASTQ file to use as a reference set. Can be passed multiple times (i.e. -r ref.fa -r ref_second.fa...)```   \n\n\n\n### Performance\nOn a set of 1000 minION reads from a known HPV strain, rkmh is ~97% accurate (correctly placing the read in the right strain\nof 182 input reference strains) and runs in \u003c20 seconds. With the kmer depth and minimum match filters we're approaching 100% accuracy for about the same run time.\nPerformance for short reads is slightly decreased because they have fewer kmers, but is still quite high.\nWe're working on ways to improve sensitivity with further filtering and correction.\n\n\nrkmh is threaded using OpenMP. Hashing can handle more than 400 long reads/second (400 * 7kb means we're running over 2,500,000 basepairs / second), with some room still left for improvement.\n\n\nWe've tested up to 100,000 6.5kb reads + 182 7kb references in a bit over 8GB of RAM, but we're working to scale to larger genomes and more reads. We've run an E. coli\nrun (actually, Nick Loman's R7.3 ONT dataset against 6 E. coli references) on a desktop with 16GB of RAM. We think with a few tweaks we can do a lot better.\n\n\n### Getting help\nPlease post to the [github](https://github.com/edawson/rkmh.git) for help.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedawson%2Frkmh","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fedawson%2Frkmh","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedawson%2Frkmh/lists"}