{"id":50718718,"url":"https://github.com/husonlab/diamer1","last_synced_at":"2026-06-09T21:30:54.423Z","repository":{"id":293803865,"uuid":"979253707","full_name":"husonlab/diamer1","owner":"husonlab","description":"Double-indexed k-mer taxonomic classifier","archived":false,"fork":false,"pushed_at":"2025-09-16T12:24:00.000Z","size":196916,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-16T14:40:34.146Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/husonlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-05-07T08:24:11.000Z","updated_at":"2025-09-16T12:24:04.000Z","dependencies_parsed_at":null,"dependency_job_id":"01eda039-efc8-44ce-9b26-c56c501f1b40","html_url":"https://github.com/husonlab/diamer1","commit_stats":null,"previous_names":["husonlab/diamer1"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/husonlab/diamer1","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/husonlab%2Fdiamer1","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/husonlab%2Fdiamer1/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/husonlab%2Fdiamer1/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/husonlab%2Fdiamer1/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/husonlab","download_url":"https://codeload.github.com/husonlab/diamer1/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/husonlab%2Fdiamer1/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34127342,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-09T21:30:53.650Z","updated_at":"2026-06-09T21:30:54.414Z","avatar_url":"https://github.com/husonlab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DIAMER\nThe Double-Indexed k-mer Taxonomic Classifier DIAMER is a bioinformatics tool\nto taxonomically classify long reads of next generation sequencing technologies\nby comparison to a protein reference database.\n\nThis tool was created as part of the master thesis by Noel Kubach in 2025, which can be accessed [here](masterthesis_kubach_2025.pdf).\n\n![overview.png](src/test/resources/overview.png)\n\n# Download\nA precompiled .jar file with dependencies can be downloaded [here](out/diamer.jar).\n\n# Requirenments\nDIAMER needs a Java Runtime Environment (version \u003e= 23).\n\nAt least 16 GB of RAM and about 500 GB of free disk space are recommended.\n\n# Usage\nTo classify a FASTQ dataset of long reads with DIAMER, four steps have to be done:\n1) A reference database has to be prepared\n2) The reference database has to be indexed\n3) The reads have to be indexed\n4) The reads are assigned to the reference taxonomy\n\nStep 1 and 2 have not to be repeated for further datasets.\n\n## 1. Database preparation\nDIAMER uses the NCBI protein blast database as reference.\nEither the full NR database can be used or smaller, clustered versions provided by [MEGAN](https://software-ab.cs.uni-tuebingen.de/download/megan7/welcome.html).\n### Downloading the taxonomy\nFor either version of NR, the NCBI taxonomy is required. It can be downloaded from the FTP server of NCBI:\n* [Taxonomy](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/)\n  (the `nodes.dmp` and `names.dmp` file from the `taxdump` archive are required)\n\n### Full NR\n#### Downloading NR\nThe NR database and the required accession2taxid mapping files can be downloaded\nfrom the FTP server of NCBI.\n* [NR](https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/)\n* [accession2taxid](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/)\n(the `prot.accession2taxid.FULL` together with the `dead_prot.accession2taxid` are recommended)\n  * alternatively the [MEGAN mapping file](https://software-ab.cs.uni-tuebingen.de/download/megan7/welcome.html)\n  for the full NR can be downloaded and used as described [here](#prepare-the-database)\n#### Preparing NR\nThe `accession2taxid` mapping files are used to replace all headers in the NR FASTA with the taxId of the LCA\nof all organisms that contain the sequence.\n\nThe input database FASTA file can be gzipped or not, the output will always be gzipped.\n\n````shell\njava -jar diamer.jar --preprocess -no \u003cpath to nodes.dmp\u003e -na \u003cpath to names.dmp\u003e \u003cpath to nr.fsa.gz\u003e \u003coutput file\u003e \u003cpath to prot.accession2taxid.FULL\u003e \u003cpath to dead_prot.accession2taxid\u003e [\u003cpaths to further mapping files\u003e...]\n````\n\n### Clustered NR (MEGAN)\n[MEGAN](https://software-ab.cs.uni-tuebingen.de/download/megan7/welcome.html) provided clustered versions\nof the NR database, that are clustered at either 50% (NR50) or 90% (NR90) sequence identity.\nThe corresponding FASTA files and mapping database files can be found on the [homepage of MEGAN](https://software-ab.cs.uni-tuebingen.de/download/megan7/welcome.html).\n\nNote: The zipped MEGAN database files have to be extracted before they can be used.\n\n#### Prepare the Database\nTo annotate the NR90 or NR50 database with the associated NCBI taxIds, the following command is used:\n\n````shell\njava -jar diamer.jar --preprocess -no \u003cpath to nodes.dmp\u003e -na \u003cpath to names.dmp\u003e \u003cpath to database\u003e \u003c\u003e\n````\n\n## 2. Indexing the reference database\nThe prepared reference database has to be indexed. In this step, a sorted list of all k-mers from the reference database,\nmapped to the LCA of all organisms they occur in is created. The list is stored in 1024 separate bucket files.\n\n````shell\njava -jar diamer.jar --indexdb [optional arguments] -no \u003cpath to nodes.dmp\u003e -na \u003cpath to names.dmp\u003e \u003cpath to prepared database\u003e \u003coutput path\u003e\n````\n\n## 3. Indexing the reads\nSimilar to the database index, a read index is created that consists of a sorted list of all k-mer-sequenceID\npairs. This list is again stored in the form of 1024 bucket files that correspond to the bucket files of the\ndatabase index.\n\nThe input FASTQ file can be gzipped.\n\n````shell\njava -jar diamer.jar --indexreads [optional arguments] \u003cpath to FASTQ\u003e \u003coutput path\u003e\n````\n\n## 4. Assigning the reads\nIn this final step, the database and reads index are compared and the reads assigned to a taxon depending on the k-mers\nthat match to the reference database and the selected assignment algorithm.\n\n````shell\njava -jar diamer.jar --assignreads [optional arguments] \u003cpath to database index\u003e \u003cpath to reads index\u003e \u003coutput path\u003e\n````\n\n# Output files\n\nDIAMER produces four different output files:\n1) `raw_assignments.tsv`\n   * the first row holds the number of remaining rows\n   * one row per read\n   * tab-separated columns: read header, space separared list of k-mer matches\n   * each entry in the space separated list has the format taxId:k-mer count\n   * Similar to the default Kraken 2 output file\n    ````\n    @4d4262d4-c...\t2:38 131567:37 1280:9 1239:3 1:3 2249226:2 2053627:2 ...\n    @cd4133d1-f...\t131567:22 1:8 1613:3 2:2 4567:2 203682:1 3379134:1 ...\n    ...\n    ````\n2) `per_read_assignment.tsv`\n   * the first row holds the number of data rows\n   * the second row contains column names\n     * first column for a running index of the reads (starting at 0)\n     * followed by one column per assignment algorithm\n   * from row 3 on: one row per read\n     * cell format for assignment algorithms: (rank) label (taxId)\n   ````\n   1160526\n   ReadID\tOVO (0.20) kmer count\tOVO (0.20) norm. kmers\tOVA (1.00) kmer count\tOVA (1.00) norm. kmers\t...\n   0\t(superkingdom) Bacteria (2)\t(no rank) cellular organisms (131567)\t(species) Staphylococcus aureus (1280)\t(species) Photobacterium kishitanii (318456)\t...\n   ...\n   ````\n3) `per_taxon_assignment.tsv`\n   * one row per taxon in the reference taxonomy\n     * the first row contains column names\n   * some columns to identify taxa\n     * taxId\n     * rank\n     * label\n   * four columns with k-mer counts\n     * k-mer count that a taxon has in the reference database\n     * k-mer count the taxon has in the reads\n     * the cumulative k-mer count of the reads\n     * a normalized k-mer count with (k-mer count reads)/(k-mer count db)\n   * one column per algorithm (with threshold and weight)\n   ````\n   node id\trank\tlabel\tkmers in database\tkmer count\tkmer count (cumulative)\tOVA (1.00) kmer count\tOVA (1.00) kmer count (cumulative)\tOVA (1.00) norm. kmer count\tOVA (1.00) norm. kmer count (cumulative)\tnorm. kmer count ...\n   1\tno rank\troot\t76325325\t45559852\t626282761\t2926\t1132107\t12478\t1132107\t1949 ...\n   ...\n   ````\n\n## Syntax of Algorithm columns\nMultiple of the DIAMER output files contain columns where the results of different algorithms are listed.\nThe column names contain the name of the algorithm used, the threshold parameter and the weights used.\nAdditionally, there can be a `(cumulative)` flag to discriminate between raw and accumulated values.\n\nExample: `OVA (1.00) norm. kmer count (cumulative)`\n* _one-vs-all_ algorithm\n* threshold: 1\n* the algorithm used normalized k-mer counts as weights for the subtree\n* the value in this column is accumulated over the taxonomic tree\n\n# Syntax \u0026 Options\nThe argument syntax of diamer follows this pattern:\n````shell\njava [-Xmx\u003cRAM\u003e] -jar diamer.jar \u003ccomputation task\u003e [options] [input] [output]\n````\nThe memory to use can be specified by the JVM parameter `-Xmx\u003cint\u003eg` in GB.\nDepending on the computation task, some options are mandatory.\nThe number of input and output parameters is task-dependent too.\n## Computation Task (mandatory)\nDIAMER needs to know which task to perform. This has to be indicated with either of these five flags:\n1) `--preprocess`\n   * Preprocesses the reference database\n   * syntax:\n     * `--preprocess [options] \u003cdatabase input\u003e \u003cdatabase output\u003e \u003cmapping file\u003e [further mapping files ...]`\n   * mandatory options:\n     * `-no`, `-na` for the taxonomic tree\n   * database input:\n     * reference database in FASTA format\n   * database output:\n     * output file (will be gzipped)\n   * mapping file(s)\n     * paths to mapping files (NCBI or MEGAN)\n2) `--indexdb`\n   * Index a preprocessed reference database\n   * syntax:\n     * `--indexdb [options] \u003cdatabase input\u003e \u003cindex folder\u003e`\n   * mandatory options:\n     * `-no`, `-na` for the taxonomic tree\n   * optional options:\n     * `-t`, `--threads` number of threads to use\n     * `-b`, `--buckets` number of buckets per iteration\n     * `--keep-in-memory` cache input sequence in memory\n     * `--mask` specify mask for k-mers (default: `1111111111111`)\n     * `--alphabet` supply a custom reduced amino acid alphabet (default: `[L][A][GC][VWUBIZO*][SH][EMX][TY][RQ][DN][IF][PK]`)\n     * `--filtering` filter k-mers (default: `c 3`)\n     * `--debug` debug mode\n     * `--statistics` collect additional statistics (might not work)\n     * `--only-standard-ranks` reduce the input taxonomy to the 8 NCBI standard ranks (might not work)\n   * database input:\n     * preprocessed reference database\n   * index folder:\n     * path to a folder where the database index will be stored\n3) `--indexreads`\n   * Generate an index of the reads\n   * syntax:\n     * `--indexreads [options] \u003creads FASTQ\u003e \u003cindex folder\u003e`\n   * optional options:\n     * `-t`, `--threads` number of threads to use\n     * `-b`, `--buckets` number of buckets per iteration\n     * `--keep-in-memory` cache input sequence in memory\n     * `--mask` specify mask for k-mers (default: `1111111111111`)\n     * `--alphabet` supply a custom reduced amino acid alphabet (default: `[L][A][GC][VWUBIZO*][SH][EMX][TY][RQ][DN][IF][PK]`)\n     * `--filtering` filter k-mers (default: `c 3`, recomended `c 0` -\u003e no filtering (faster))\n     * `--debug` debug mode\n     * `--statistics` collect additional statistics (might not work)\n   * reads FASTQ:\n     * FASTQ file with the DNA reads\n   * index folder:\n     * path to the folder where the reads index will be stored\n4) `--assignreads`\n   * Assign reads to taxa\n   * syntax:\n     * `--assignreads [options] \u003creference DB index\u003e \u003creads index\u003e \u003coutput folder\u003e`\n   * optional options:\n     * `-t`, `--threads` number of threads to use\n     * `-b`, `--buckets` number of buckets to process in parallel (will be equal to the number of threads if unspecified)\n     * `--debug` debug mode\n     * `--statistics` collect additional statistics (might not work)\n   * reference DB index\n     * path to the folder of the reference database index\n   * reads index\n     * path to the folder of the reads index\n   * output folder\n     * path to the folder where the result files will be stored\n5) `--analyze-db-index`\n   * Calculate some statistics on the database index (might be broken)\n   * syntax:\n     * `--analyze-db-index \u003creference DB index\u003e \u003coutput folder\u003e`\n   * reference database index:\n     * path to the index of the reference database\n   * output folder:\n     * path to a folder where output files will be stored\n\n## Available Options\n* `-t`, `--threads`\n  * number of threads to use\n* `-b`, `--buckets`\n  * index generation: number of buckets per iteration\n  * read classificationto: number of buckets to be processed in parallel. Cannot exceed the number of threads in this case.\n* `--keep-in-memory`\n  * cache input sequence in memory\n  * the memory that is required for caching the sequence is not considered in the estimation of how many buckets\n  to process in parallel. Manually setting `-b` is recommended.\n* `--mask`\n  * specify a mask for k-mer extraction during indexing\n  * default: `1111111111111`\n  * spaces can be used to mask amino acids: `111010110100110111`\n* `--alphabet`\n  * supply a custom reduced amino acid alphabet\n  * default: `[L][A][GC][VWUBIZO*][SH][EMX][TY][RQ][DN][IF][PK]`\n  * the parser is case sensitive! If your input sequences contain lower case letters, they should be included in the\n  alphabet as well.\n  * all undefined characters (including `*`) will be interpreted as the end of a sequence\n  and split sequences at this position\n* `--filtering`\n  * filter k-mers during indexing\n  * default: `c 3`\n  * for read indexing `c 0` (no filtering) is recommended, since it is much faster\n  * complexity filtering:\n    * syntax: `--filtering c \u003cnumber\u003e`\n    * only keeps k-mers with a complexity higher than `number`\n  * probability filtering:\n    * syntax: `--filtering p \u003cnumber\u003e`\n    * only keeps k-mers with a probability lower than `number`\n    * e.g., `--filtering p 1e-12`\n  * complexity maximizer\n    * use the minimizer concept to only keep the k-mer with maximal complexity within a window of size `number`\n    * syntax: `--filtering cm \u003cnumber\u003e`\n    * the window size cannot be smaller than the k-mer length\n  * probability minimizer\n    * use the minimizer concept to only keep k-mers with a low probability within a window of size `number`\n    * syntax: `--filtering pm \u003cnumber\u003e`\n    * the window size cannot be smaller than the k-mer length\n\n# Build from source with maven\n\n#### Clean, compile, test, and package the project\nmvn clean package\n\n#### Just compile without running tests\nmvn clean compile\n\n#### Package without running tests\nmvn clean package -DskipTests\n\n#### Just create the assembly JAR (if already compiled)\nmvn assembly:single","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhusonlab%2Fdiamer1","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhusonlab%2Fdiamer1","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhusonlab%2Fdiamer1/lists"}