{"id":24143863,"url":"https://github.com/dnbaker/bonsai","last_synced_at":"2025-09-19T12:32:28.641Z","repository":{"id":46261022,"uuid":"71488531","full_name":"dnbaker/bonsai","owner":"dnbaker","description":"Bonsai: Fast, flexible taxonomic analysis and classification","archived":false,"fork":false,"pushed_at":"2024-04-09T22:14:10.000Z","size":77696,"stargazers_count":70,"open_issues_count":3,"forks_count":11,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-04-20T00:06:34.834Z","etag":null,"topics":["bioinformatics","database","metagenomics"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dnbaker.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2016-10-20T17:43:52.000Z","updated_at":"2024-01-28T22:02:32.000Z","dependencies_parsed_at":"2024-04-09T23:40:32.523Z","dependency_job_id":null,"html_url":"https://github.com/dnbaker/bonsai","commit_stats":null,"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fbonsai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fbonsai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fbonsai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnbaker%2Fbonsai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dnbaker","download_url":"https://codeload.github.com/dnbaker/bonsai/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":233570525,"owners_count":18695859,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","database","metagenomics"],"created_at":"2025-01-12T05:45:44.050Z","updated_at":"2025-09-19T12:32:21.927Z","avatar_url":"https://github.com/dnbaker.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"Bonsai: Flexible Taxonomic Analysis and Extension [![Build Status](https://travis-ci.com/dnbaker/bonsai.svg?branch=main)](https://travis-ci.com/dnbaker/bonsai) [![Language grade: C/C++](https://img.shields.io/lgtm/grade/cpp/g/dnbaker/bonsai.svg?logo=lgtm\u0026logoWidth=18)](https://lgtm.com/projects/g/dnbaker/bonsai/context:cpp)\n===============\n\nBonsai contains varied utilities for taxonomic analysis and classification using exact subsequence matches. These include:\n* A high-performance, generic taxonomic classifier\n  * Efficient classification\n    * 20x as fast, single-threaded, as Kraken in our benchmarks, while demonstrating significantly better threadscaling.\n  * Arbitrary, user-defined spaced-seed encoding.\n    * *Reference compression* by windowing/minimization schemes.\n    * *Generic minimization* including by taxonomic depth, lexicographic value, subsequence specificity, or Shannon entropy.\n  * Parallelized pairwise Jaccard Distance estimation using HyperLogLog sketches, which has recently migrated to [dashing](https://github.com/dnbaker/dashing).\n* An unsupervised method for taxonomic structure discovery and correction. (metatree)\n* A threadsafe, SIMD-accelerated HyperLogLog implementation, which has migrated to [hll](https://github.com/dnbaker/hll).\n* Scripts for downloading reference genomes from new (post-2014) and old RefSeq.\n\nTools have been compiled using both zlib and zstd, which means that they can transparently consume zlib-, zstd-, and uncompressed files.\n\nAll of these tools are experimental. Use at your own risk.\n\n\nBuild Instructions\n=================\n\n`cd bonsai \u0026\u0026 make bonsai`\n\nUnit Tests\n=================\nWe use the Catch testing framework. You can build and run the tests by:\n\n`cd bonsai \u0026\u0026 make unit \u0026\u0026 ./unit`\n\n\nDependencies\n============\nPrimary dependency is `sketch`, stored in hll, which handles sketching + bit math requirements.\nIn addition, we require zlib, ntHash, and zstd.\n\nUsage\n================\n\nEncoding: Use `Encoder` from `include/bonsai/encoder.h` to directly encode k-mers or `RollingHasher` to encode k-mers with a rolling hash to enable unbounded length.\nThese are then called via `for_each` and `for_each_hash` functions.\n\n\nExecutables:\n\nUsage instructions are available in each executable by executing it with no options or providing the `-h` flag.\n\n\nFor classification purposes, the commands involved are `bonsai prebuild`, `bonsai build`, and `bonsai classify`.\nprebuild is only required for taxonomic or feature minimization strategies, for which case database building requires double the memory requirements.\nUnless you're very sure you know what you're doing, we recommend simply `bonsai build` with either Entropy or Lexicographic minimization.\n\nTo build a database with k = 31, window size = 50, minimized by entropy, from a taxonomy in `ref/nodes.dmp` and a nameidmap in `ref/nameidmap.txt` and store it in in `bns.db`\n```\nbonsai build -e -w50 -k31 -p20 -T ref/nodes.dmp -M ref/nameidmap.txt bns.db `find ref/ -name '*.fna.gz'`\n```\n\nTo prepare the above, the script in `python/download_genomes.py` can be used. The default of downloading all available genomes can be run by `python python/download_genomes.py --threads 20 all`.\nThis places downloaded genomes by default into the paths listed above in the `bonsai build` command. These paths can be altered; see `python/download_genomes.py -h/--help` for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdnbaker%2Fbonsai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdnbaker%2Fbonsai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdnbaker%2Fbonsai/lists"}