{"id":20547412,"url":"https://github.com/ggonnella/fastsubtrees","last_synced_at":"2026-02-28T03:09:12.217Z","repository":{"id":44088473,"uuid":"432295861","full_name":"ggonnella/fastsubtrees","owner":"ggonnella","description":"Python library and command line script , for fast extraction of subtrees of fairly large trees, consisting of millions of nodes, such as the NCBI taxonomy tree.","archived":false,"fork":false,"pushed_at":"2025-02-17T11:23:20.000Z","size":854,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-02-22T19:45:06.737Z","etag":null,"topics":["bioinformatics","ncbi-taxonomy","python","subtree","subtree-extraction","subtree-query","taxonomy","tree"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"isc","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ggonnella.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.txt","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.txt","dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-11-26T20:38:53.000Z","updated_at":"2025-02-17T11:23:24.000Z","dependencies_parsed_at":"2025-04-11T11:26:02.055Z","dependency_job_id":null,"html_url":"https://github.com/ggonnella/fastsubtrees","commit_stats":{"total_commits":476,"total_committers":4,"mean_commits":119.0,"dds":"0.15756302521008403","last_synced_commit":"57b47fd6494b070c3ad02a300be41b48111ab877"},"previous_names":[],"tags_count":32,"template":false,"template_full_name":null,"purl":"pkg:github/ggonnella/fastsubtrees","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ggonnella%2Ffastsubtrees","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ggonnella%2Ffastsubtrees/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ggonnella%2Ffastsubtrees/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ggonnella%2Ffastsubtrees/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ggonnella","download_url":"https://codeload.github.com/ggonnella/fastsubtrees/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ggonnella%2Ffastsubtrees/sbom","scorecard":{"id":425312,"data":{"date":"2025-08-11","repo":{"name":"github.com/ggonnella/fastsubtrees","commit":"42b5e2ac7498709b738480d33174b302441f267a"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.1,"checks":[{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/draft-pdf.yml:1","Warn: no topLevel permission defined: .github/workflows/python-package.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: ISC License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/draft-pdf.yml:9: update your workflow using https://app.stepsecurity.io/secureworkflow/ggonnella/fastsubtrees/draft-pdf.yml/main?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/draft-pdf.yml:11: update your workflow using https://app.stepsecurity.io/secureworkflow/ggonnella/fastsubtrees/draft-pdf.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/draft-pdf.yml:17: update your workflow using https://app.stepsecurity.io/secureworkflow/ggonnella/fastsubtrees/draft-pdf.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/python-package.yml:20: update your workflow using https://app.stepsecurity.io/secureworkflow/ggonnella/fastsubtrees/python-package.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/python-package.yml:22: update your workflow using https://app.stepsecurity.io/secureworkflow/ggonnella/fastsubtrees/python-package.yml/main?enable=pin","Warn: containerImage not pinned by hash: Dockerfile:2: pin your Docker image by updating ubuntu:22.04 to ubuntu:22.04@sha256:1aa979d85661c488ce030ac292876cf6ed04535d3a237e49f61542d8e5de5ae0","Warn: downloadThenRun not pinned by hash: Dockerfile:22","Warn: pipCommand not pinned by hash: Dockerfile:24","Warn: pipCommand not pinned by hash: Dockerfile:29","Warn: pipCommand not pinned by hash: Dockerfile:30","Warn: pipCommand not pinned by hash: Dockerfile:31","Warn: pipCommand not pinned by hash: Dockerfile:32","Warn: pipCommand not pinned by hash: Dockerfile:33","Warn: pipCommand not pinned by hash: Dockerfile:38","Warn: pipCommand not pinned by hash: docker/start-example-app:6","Warn: pipCommand not pinned by hash: .github/workflows/python-package.yml:27","Warn: pipCommand not pinned by hash: .github/workflows/python-package.yml:28","Warn: pipCommand not pinned by hash: .github/workflows/python-package.yml:29","Warn: pipCommand not pinned by hash: .github/workflows/python-package.yml:40","Info:   0 out of   4 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   1 third-party GitHubAction dependencies pinned","Info:   0 out of   1 downloadThenRun dependencies pinned","Info:   0 out of  12 pipCommand dependencies pinned","Info:   0 out of   1 containerImage dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Branch-Protection","score":-1,"reason":"internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration","details":null,"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":3,"reason":"7 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2022-14 / GHSA-39ph-wr67-j4xq","Warn: Project is vulnerable to: PYSEC-2021-142 / GHSA-8q59-q68h-6hv4","Warn: Project is vulnerable to: PYSEC-2018-49 / GHSA-rprw-h62v-c2w7","Warn: Project is vulnerable to: PYSEC-2019-124 / GHSA-38fc-9xqv-7f7q","Warn: Project is vulnerable to: PYSEC-2019-123 / GHSA-887w-45rq-vxgf","Warn: Project is vulnerable to: PYSEC-2012-9 / GHSA-hfg2-wf6j-x53p","Warn: Project is vulnerable to: GHSA-g7vv-2v7x-gj9p"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-19T02:07:32.070Z","repository_id":44088473,"created_at":"2025-08-19T02:07:32.070Z","updated_at":"2025-08-19T02:07:32.070Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29923442,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-27T19:37:42.220Z","status":"online","status_checked_at":"2026-02-28T02:00:07.010Z","response_time":90,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","ncbi-taxonomy","python","subtree","subtree-extraction","subtree-query","taxonomy","tree"],"created_at":"2024-11-16T02:08:08.399Z","updated_at":"2026-02-28T03:09:12.202Z","avatar_url":"https://github.com/ggonnella.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fastsubtrees\n\n[![License: ISC](https://img.shields.io/badge/License-ISC-blue.svg)](https://opensource.org/licenses/ISC)\n[![Latest Github tag](https://img.shields.io/github/v/tag/ggonnella/fastsubtrees)](https://github.com/ggonnella/fastsubtrees/tags)\n[![ReadTheDocs](https://readthedocs.org/projects/pip/badge/?version=stable)](https://fastsubtrees.readthedocs.io/)\n[![PyPI](https://img.shields.io/pypi/v/fastsubtrees)](https://pypi.org/project/fastsubtrees/)\n[![DOI](https://joss.theoj.org/papers/10.21105/joss.04755/status.svg)](https://doi.org/10.21105/joss.04755)\n\n_Fastsubtrees_ is a Python library and a command line script, for handling fairly\nlarge trees (in the order of magnitude of millions nodes), in particular\nallowing the fast extraction of any subtree.\nThe main application domain of _fastsubtrees_ is working with the NCBI taxonomy\ntree, however the code is implemented in a generic way, so that other\napplications are possible.\n\nThe library functionality can be accessed both from inside Python code\nand from the provided command line tool ``fastsubtrees``.\n\n## Introduction\n\nFor the use of _fastsubtrees_, nodes must be uniquely identified by non-negative IDs.\nFurthermore, the space of the IDs must be compact (i.e. the maximum ID should not be\nmuch larger than the number of IDs).\n\nThe first step when using _fastsubtrees_ is to construct a tree representation.\nThe operation requires a source of IDs of elements and their parents, which can be\na tabular file, or any Python function yielding the IDs.\n\nThis operation just takes a few seconds, for a tree with million nodes, such as the NCBI taxonomy tree.\nIt must be done only once, if a tree does not change, since the resulting data\nis stored to file.\n\nThe IDs of the NCBI taxonomy tree fullfill the conditions stated above. However, the library\ncan be used for any tree. A way to use the library with IDs which do not fullfill the conditions,\nit to map them to an ID space which does, and store the original IDs as an attribute.\n\nBesides the IDs, a tree can contain further information, e.g. integers, floats or other\ndata, here called attributes, associated to the nodes. Each node can contain zero, one or more values\nfor an attribute. To add values for an attribute, a tabular file or another data\nsource (a Python function) is selected.\n\nThe data for any subtree can then be easily and efficently queried; thereby the node IDs and/other\nselected attributes can be retrieved.\n\nThe tree representation is dynamic, i.e. both the tree topology and the attribute values can be\nedited and changed.\n\n## Working with the library\n\n### Installation\n\nThe package can be installed using ``pip install fastsubtrees``.\n\n### Command line interface\n\nThe command line tool ``fastsubtrees`` allows constructing and modifying a tree\n(subcommand ``tree``), adding and editing attributes (subcommand ``attribute``)\nand performing a subtree query (subcommand ``query``).\n\nThe command line interface is further described in the\n[CLI manual](https://github.com/ggonnella/fastsubtrees/blob/main/docs/cli.md).\n\n#### CLI example: working with the NCBI taxonomy tree\n\nThe example below uses the ``fastsubtrees`` command, as well as the ``ntdownload`` library\n(installed as a dependency, by ``pip``) for obtaining the NCBI taxonomy data.\n\n```\nntdownload ntdumps                                     # download NCBI taxonomy data\nfastsubtrees tree nt.tree --ncbi ntdumps/nodes.dmp -f  # create the tree\nfastsubtrees query nt.tree 562                         # query node 562\n\n# attributes\nATTRTAB=data/accession_taxid_attribute.tsv.gz          # data file\nTAXID=2; GENOME_SIZE=3; GC_CONTENT=4                   # column numbers, 1-based\n\nfastsubtrees attribute nt.tree genome_size $ATTRTAB -e $TAXID -v $GENOME_SIZE -t int\nfastsubtrees attribute nt.tree GC_content $ATTRTAB -e $TAXID -v $GC_CONTENT -t float\n\nfastsubtrees query nt.tree 562 genome_size GC_content  # query including attributes\n\n# taxonomy names\nntnames ntdumps \u003e| names.tsv                           # prepare data from names dump\nfastsubtrees attribute nt.tree taxname names.tsv       # add names as attribute\nfastsubtrees query nt.tree 562 taxname genome_size     # query including taxa names\n```\n\n#### Using NtSubtree\n\nThe package ``ntsubtree`` (installable by ``pip``) simplifies working with the NCBI taxonomy even more.\nTree and the taxonomic names tables are automatically created and stored in a central location.\n\n```\n# first run after installing automatically downloads and constructs the tree\n\nntsubtree query 562               # taxonomic names displayed alongside the IDs\nntsubtree query -n \"Escherichia\"  # Query by taxonomic name\n\n# attributes\nATTRTAB=data/accession_taxid_attribute.tsv.gz          # data file\nTAXID=2; GENOME_SIZE=3; GC_CONTENT=4                   # column numbers\n\nntsubtree attribute genome_size $ATTRTAB -e $TAXID -v $GENOME_SIZE\nntsubtree attribute GC_content $ATTRTAB -e $TAXID -v $GC_CONTENT\nntsubtree query -n \"Escherichia\" genome_size GC_content\n\n# check if a newer version of the taxonomy data is available\n# and update the tree if necessary, keeping the attribute values:\nntsubtree update\n```\n\n### API\n\nThe library functionality can be also directly accessed in Python code using\nthe API, which is documented in the\n[API manual](https://github.com/ggonnella/fastsubtrees/blob/main/docs/api.md).\n\n#### API example: working with the NCBI taxonomy tree\n\nThe example below uses the ``fastsubtrees`` command, as well as the ``ntdownload`` library\n(installed as a dependency, by ``pip``) for obtaining the NCBI taxonomy data.\n\n```python\n# download the NCBI taxonomy data\nfrom ntdownload import Downloader\nd = Downloader(\"ntdumpsdir\")\nhas_downloaded = d.run()\n\nfrom fastsubtrees import Tree\ninfile = \"ntdumpsdir/nodes.dmp\"\ntree = Tree.construct_from_ncbi_dump(infile)     # create the tree\nresults = tree.subtree_ids(562)                   # retrieve subtree IDs\n\nattrtab=\"data/accession_taxid_attribute.tsv.gz\"         # data file\ntaxid_col=1; genome_size_col=2; gc_content_col=3        # column numbers, 0-based\n\ntree.to_file(\"nt.tree\")\ntree.create_attribute_from_tabular(\"genome_size\", attrtab, elem_field_num=taxid_col,\n                                   attr_field_num=genome_size_col, casting_fn=int)\ntree.create_attribute_from_tabular(\"GC_content\", attrtab, elem_field_num=taxid_col,\n                                   attr_field_num=gc_content_col, casting_fn=float)\nresults = tree.subtree_info(562, [\"genome_size\", \"GC_content\"])\n\n# taxonomy names\nfrom ntdownload import yield_scientific_names_from_dump as generator\ntree.create_attribute(\"taxname\", generator(\"ntdumpsdir\"))\nresults = tree.subtree_info(562, [\"taxname\", \"genome_size\"])\n```\n\n#### Using NtSubtree\n\nThe package ``ntsubtree`` (installable by ``pip``) simplifies working with the NCBI taxonomy even more.\nTree and the taxonomic names tables are automatically created and stored in a central location.\nThe first time the library is included these operations are done automatically.\n\n```python\nimport ntsubtree\n\ntree = ntsubtree.get_tree()\nresults = tree.subtree_ids(562)\n\ntaxid = ntsubtree.search_name(\"Escherichia\")\nresults = tree.subtree_info(taxid, [\"taxname\"])\n\nattrtab=\"data/accession_taxid_attribute.tsv.gz\"         # data file\ntaxid_col=1; genome_size_col=2; gc_content_col=3        # column numbers, 0-based\n\ntree.create_attribute_from_tabular(\"genome_size\", attrtab, elem_field_num=taxid_col,\n                                   attr_field_num=genome_size_col, casting_fn=int)\ntree.create_attribute_from_tabular(\"GC_content\", attrtab, elem_field_num=taxid_col,\n                                   attr_field_num=gc_content_col, casting_fn=float)\nresults = tree.subtree_info(562, [\"genome_size\", \"GC_content\"])\n\n# check if a newer version of the taxonomy data is available\n# and update the tree if necessary, keeping the attribute values:\nntsubtree.update()\n```\n\n### Docker\n\nTo try or test the package, it is possible to use ``fastsubtrees``\nby employing the Docker image defined in ``Dockerfile``.\nThis does not require any external database installation and configuration.\n\n\u003cdetails\u003e\n    \u003csummary\u003eExample of the Docker command line:\u003c/summary\u003e\n\n```\n# create a Docker image\ndocker build --tag \"fastsubtrees\" .\n\n# create a container and run it\ndocker run -p 8050:8050 --detach --name fastsubtreesC fastsubtrees\n# or, if it was already created and stopped, restart it using:\n# docker start fastsubtreesC\n\n# run the tests\ndocker exec fastsubtreesC tests\n\n# run benchmarks\ndocker exec fastsubtreesC benchmarks\n\n# run the example application\ndocker exec fastsubtreesC start-example-app\n# now open it in the browser at https://0.0.0.0:8050\n```\n\u003c/details\u003e\n  \n### Tests\n\nTo run the test suite, you can use ``pytest`` (or ``make tests``).\nThe tests include tests of ``fastsubtrees`` and of the sub-package ``ntmirror``.\nThe latter are partly dependent on a database installation and configuration\nwhich must be given in ``ntmirror/tests/config.yaml``;\ndatabase-dependent tests are skipped if this configuration file is not provided.\n\nThe entire test suite can be also run from the Docker container,\nwithout further configuration, see above the _Docker_ section.\n\n### Benchmarks\n\nBenchmarks can be run using the shell scripts provided under ``benchmarks``.\nThese require data, which is downloaded from NCBI taxonomy and\nsome pre-computed example data which is provided in the ``data`` subdirectory\n(genome sizes and GC content).\n\nThe benchmarks can be convienently run from the Docker container, without\nrequiring a database installation and setup, see above the _Docker_ section.\n\n### Example application: Genome attributes viewer\n\nAn interactive web application based on ``fastsubtrees`` was developed using\n_dash_. It allows to graphically display the distribution of values of\nattributes in subtrees of the NCBI taxonomic tree.\nIt is a separate Python package, which can\nbe installed using ``pip``, and depends on _fastsubtrees_.\n\nIt can also be installed using the Docker image of\n_fastsubtrees_ (see above in the _Docker_ section).\n\nFor more information see also the ``genomes-attributes-viewer/README.md`` file.\n\n#### Local installation and startup\n\nTo application can be installed using ``pip install genomes_attributes_viewer``\nor from the ``genomes_attributes_viewer`` directory of the _fastsubtrees_\nrepository.\n\nTo start the application, use the ``genomes-attributes-viewer``.\nThe first time this command is run, the application data are downloaded and\nprepared, taking a few seconds. Startup on subsequent\nstarts does not require these operations and is thus faster.\n\n### Other subpackages\n\n#### NtSubtree\n\nNtSubtree is a library which automatically downloads the NCBI taxonomy\ndump and constructs the ``fastsubtrees`` data for it. It allows to easily\nkeep the data up-to-date. It is a separate Python package, which can\nbe installed using ``pip``, and depends on _fastsubtrees_.\n\nThe ``query`` command of the NtSubtree CLI tool automatically\ndisplay also taxonomic names, alongside the IDs in query and allow to\nperform queries by taxonomic name.\n\nFor more information see also the ``ntsubtree/README.md`` file.\n\n#### ntdownload\n\nWhen working with the NCBI taxonomy database, a local copy of the NCBI taxonomy\ndump can be obtained and kept up-to-date using the _ntdownload_ package, which\nis located in the directory ``ntdownload``. It is a separate\nPython package, which can be installed using ``pip``, independently\nfrom _fastsubtrees_.\n\nPlease refer to the user manual of _ntdownload_ located under ``ntdownload/README.md``\nfor more information.\n\n#### ntmirror\n\nA downloaded NCBI taxonomy database dump can be loaded to\na local SQL database, using the package _ntmirror_, which is located\nin the directory ``ntmirror``.\nIt is a separate Python package, which can\nbe installed using ``pip``, independently from _fastsubtrees_.\n\nIt contains also a script to extract subtrees\nfrom the local database mirror using hierarchical SQL queries.\n\nPlease refer to the user manual of _ntmirror_ located under ``ntmirror/README.md``\nfor more information.\n\n### Internals\n\nFor achieving an efficient running time and memory use, the nodes of the tree\nare represented compactly in deep-first traversal order.\nSubtrees are then extracted in O(s) time, where s is the size of the extracted\nsubtree (i.e. not depending on the size of the whole tree).\n\nThe IDs must not\nnecessarily be all consecutive (i.e. some \"holes\" may be present), but the\nlargest node ID (_idmax_) should not be much larger than the total number of\nnodes, because the memory consumption is in _O(idmax)_.\n\nFor each attribute defined in a tree, a file is created, where the attribute\nvalues are stored. The attributes are also stored in the same deep-first traversal\norder as the tree IDs.\n\n#### Tree construction algorithm\n\nThe tree construction algorithm used here is the following,\nwhere the input data consists of 2-tuples ``(element_id, parent_id)``\nand the maximum node ID m is not much larger than the number of IDs n.\n\n1. iteration over the input data to construct a table _P_ of parents by ID,\ni.e. ``P[element_id]=parent_id`` if ``element_id`` is in the tree,\nand ``P[element_id]=UNDEF`` if not, where UNDEF is a special value.\nThis requires _O(n)_ steps\nfor reading the IDs and _O(m)_ steps for writing either the ID or the _UNDEF_\nvalue to _P_; since _m\u003e=n_, the total time is in _O(m)_.\n2. iteration over table _P_ to construct a table _S_ of subtree sizes\nby ID; for each element the tree is climbed to the root, to add the\nelement to the counts of each ancestor. This\noperation requires _O(n*d)_ time, where _d_ is the height of the node,\nwhich is in average case much lower than _m_ and _d=m_ is the worst case.\n3. iteration over each node ID to construct the list _D_, consisting of the depth first order of\nthe nodes, and the table _C_ of the coordinates of all nodes in the tree data, by id.\nFor this operation, first the root is added to _D_ and _C_, then\nfor each other node _x_ in _P_, the tree is climbed and nodes added to a stack until the next not-yet-added\nancestor is found. The position where to add it this node is computed by the next\nfree position in the subtree of its parent (which must have been already added,\nby definition, thus the next free position in its subtree is known). After this,\nthe next stack element is added, until _x_ is added.\nAlthough this operation also requires climbing the tree, it takes in total _O(n)_ time,\nsince each node is added only once.\n\n#### Parallelizing the tree construction\n\nCurrently the slowest step of the construction, detailed in the previous\nsection, is the second, i.e. the computation of _S_.\nSince each node must be count in the subtree size of all its ancestors,\nthere is no easy way to reduce the time from _O(n*h)_.\n\nTo parallelize this step, one divides the parents table into _t_ slices,\nand assign each to a different sub-process (not thread, because of the GIL).\nEach sub-process would then count the subtree sizes in the slice only.\nA version implemented with a shared table and a lock was too slow,\nsince access to the table was concurrent among the sub-processes.\nIn the current version, instead, each sub-process makes a own subtree sizes\n_S'_ table. The sub-processes _S'_ tables are summed up after completion for\nobtaining the _S_ table.\n\nThis option can be activated in the CLI using the ``--processes P`` option,\nor in the API setting the ``nprocesses`` argument of ``Tree.construct`` and\nrelated methods. Benchmarks show that the parallel version did not significantly\nimprove the performance on constructing the NCBI taxonomy tree, likely\nbecause of the overhead of process starting, array _S'_ initialization\nand summing up of all _S'_ to _S_ after completion.\n\n## Community guidelines\n\nContributions to the software are welcome. Please clone this repository\nand send a pull request on Github, to let the changes be integrated in\nthe original repository.\n\nIn case of bugs and issues, please report them through the Github Issues page\nof the repository.\n\n## Documentation\n\nThe complete documentation of Fastsubtrees is available on ReadTheDocs\nat https://fastsubtrees.readthedocs.io/ in website and\n[PDF format](https://fastsubtrees.readthedocs.io/_/downloads/en/latest/pdf/).\n\n## Licence\n\nAll code of Fastsubtrees is released under the ISC license.\n(see LICENSE file).\nIt is functionally equivalent to a two-term BSD copyright with\nlanguage removed that is made unnecessary by the Berne convention.\nSee http://openbsd.org/policy.html for more information on copyrights.\n\n## Acknowledgements\nThis software has been originally created in context of the DFG project GO 3192/1-1\n“Automated characterization of microbial genomes and metagenomes by collection and verification of association rules”.\nThe funders had no role in study design, data collection and analysis.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fggonnella%2Ffastsubtrees","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fggonnella%2Ffastsubtrees","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fggonnella%2Ffastsubtrees/lists"}