{"id":16649635,"url":"https://github.com/luispedro/strobefilter","last_synced_at":"2025-03-12T13:23:31.917Z","repository":{"id":223786561,"uuid":"761501142","full_name":"luispedro/strobefilter","owner":"luispedro","description":"Benchmarking pre-filtering large databases before short-read mapping","archived":false,"fork":false,"pushed_at":"2024-08-09T03:27:41.000Z","size":93,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-19T06:43:35.185Z","etag":null,"topics":["bioinformatics","metagenomics","strobemers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luispedro.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.MIT","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-22T00:15:10.000Z","updated_at":"2024-08-09T03:27:44.000Z","dependencies_parsed_at":"2024-04-02T04:49:54.792Z","dependency_job_id":null,"html_url":"https://github.com/luispedro/strobefilter","commit_stats":null,"previous_names":["luispedro/strobefilter"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luispedro%2Fstrobefilter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luispedro%2Fstrobefilter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luispedro%2Fstrobefilter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luispedro%2Fstrobefilter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luispedro","download_url":"https://codeload.github.com/luispedro/strobefilter/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243223587,"owners_count":20256543,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","metagenomics","strobemers"],"created_at":"2024-10-12T09:11:26.882Z","updated_at":"2025-03-12T13:23:31.885Z","avatar_url":"https://github.com/luispedro.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Prefilter for mapping\n\n**Problem:** Mapping metagenomes to large databases, such as the\n[GMGCv1](https://gmgc.embl.de), takes too much memory. Partitioning the dataset\nis a common solution (and [supported by\nNGLess](https://ngless.embl.de/Mapping.html#low-memory-mode), but it has\ndrawbacks (it's slow).\n\nThis repository explores the possibility of prefiltering the database by\nremoving sequences that are extremely unlikely to be matches.\n\n## Approach\n\n1. Parse all the reads and collect all randstrobes (or rather their hashes)\n2. Parse the database and select only unigenes that are expected to be present in the reads\n3. Map as usual to the pre-filtered database\n\nFor `2`, different strategies are possible. The simplest is to keep any unigene\nthat shares any hash with the set of hashes from the reads. Currently being\nconsidered\n\n- `min1`: keep all references that match at least one hash\n- `min2`: keep all references that match at least two hashes\n\nWe also tested counting the exact value or using a hacky Bloom filter structure\nthat uses a single fixed size array, but the hacky version gave bad estimates.\n\n### Requirements\n\n- Python, including NumPy and Pandas\n- [Jug](https://jug.rtfd.io/)\n- [NGLess](https://ngless.embl.de/)\n- [Strobealign](https://github.com/ksahlin/strobealign) ([Sahlin, 2022](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02831-7)), including the Python bindings\n- [tabulate](https://pypi.org/project/tabulate/) is used to print the final table\n\nTo install most dependencies (assuming you have conda-forge \u0026 bioconda set up):\n\n```\nconda install python=3.11 numpy pandas requests tabulate jug ngless\n```\n\nTo install `strobealign`'s Python bindings (which will **not** be installed by default with `conda`):\n\n```\n# To ensure you have a recent C++ compiler (not always needed)\nconda install gxx_linux-64 gcc_linux-64\nexport CC CXX\n\ngit clone https://github.com/ksahlin/strobealign\ncd strobealign\npip install .\n```\n\n### Data\n\n1. _Database_ GMGCv1 (from ([Coelho et al., 2022](https://www.nature.com/articles/s41586-021-04233-4)). This can be is downloaded by `jugfile.py`\n2. _Metagenomes_: Dog dataset (from [Coelho et al., 2018](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0450-3)) and human gut dataset (from [Zeller et al., 2014](https://doi.org/10.15252/msb.20145645). These can be downloaded with [ena-mirror](https://github.com/BigDataBiology/ena-mirror). More guidance will be provided on how to do it soon, but [get in touch](https://github.com/luispedro/strobefilter/issues) if you have questions.\n\nNote that running this benchmark will use a lot of disk storage!\n\n\n### Author\n\n- [Luis Pedro Coelho](https://luispedro.org) (Queensland University of Technology). [luis@luispedro.org](mailto:luis@luispedro.org)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluispedro%2Fstrobefilter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluispedro%2Fstrobefilter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluispedro%2Fstrobefilter/lists"}