{"id":22313690,"url":"https://github.com/ctb/2020-long-read-assembly-decontam","last_synced_at":"2025-08-20T20:07:08.125Z","repository":{"id":66659477,"uuid":"255101771","full_name":"ctb/2020-long-read-assembly-decontam","owner":"ctb","description":"Try 2 of detecting/removing microbial contamination from long-read assemblies.","archived":false,"fork":false,"pushed_at":"2020-04-18T16:51:42.000Z","size":4736,"stargazers_count":3,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2023-10-26T10:04:43.907Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ctb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-04-12T14:36:57.000Z","updated_at":"2020-12-31T03:08:50.000Z","dependencies_parsed_at":"2023-03-27T11:47:41.631Z","dependency_job_id":null,"html_url":"https://github.com/ctb/2020-long-read-assembly-decontam","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctb%2F2020-long-read-assembly-decontam","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctb%2F2020-long-read-assembly-decontam/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctb%2F2020-long-read-assembly-decontam/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ctb%2F2020-long-read-assembly-decontam/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ctb","download_url":"https://codeload.github.com/ctb/2020-long-read-assembly-decontam/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228006294,"owners_count":17854995,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-03T22:07:56.374Z","updated_at":"2024-12-03T22:07:57.069Z","avatar_url":"https://github.com/ctb.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 2020-long-read-assembly-decontam\n\nFind and extract components of long-read assemblies that match to a\ndatabase, for the purposes of decontamination.\n\n**Still early in development.** Buyer beware! Here be dragons!!\n\n## Installing!\n\nClone this repository and change into the top-level repo directory.\nThe file `environment.yml` contains the necessary conda packages\n(python and snakemake) to run charcoal; see the Quickstart section\nfor explicit instructions.\n\n### Quickstart:\n\nClone the repository, change into it, create the environment, and activate it:\n\n```\ngit clone https://github.com/ctb/2020-long-read-assembly-decontam\ncd ./2020-long-read-assembly-decontam/\nconda env create -f environment.yml -n lra-decontam\nconda activate lra-decontam\n```\n\n## Running!\n\nTo run, execute (in the top-level directory):\n\n```\nsnakemake --use-conda -p -j 1\n```\n\nThis should succeed :).\n\nOnce that works, you can configure it yourself by copying\n`test-data/conf-test.yml` to a new file and editing it. See\n`conf/conf-necator.yml` for a real example.\n\n## Explanation of output files.\n\nIn the output directory (e.g. `output.test`, or whatever is specified\nin the config file you use), there will be a few important files --\nthe main ones are,\n\n* `gather.csv` - the list of contaminants\n* `matching-contigs.fa` - all contigs with any matches to the database\n* `matching-fragments.fa` - all fragments with any matches to the database\n\n## Resources\n\nOn a ~300 MB assembly, this took about 2 hours and required about 2\nGB of RAM, using the\n[RefSeq microbial genomes SBT](https://sourmash.readthedocs.io/en/latest/databases.html#refseq-microbial-genomes-sbt). The disk space requirement is more\nsignificant, mainly because the SBTs are in the ~10-30 GB range when unpacked.\n   \n## Need help?\n\nPlease ask questions and file issues on [the sourmash GitHub issue tracker](https://github.com/dib-lab/sourmash/issues).\n\n## Credits\n\nThanks to Erich Schwarz (for stubborn pursuit of contamination in\nlong-read assemblies) and Taylor Reiter (for stubborn pursuit of\ncontamination, period) for their inspiration!\n\nA first try at this approach is detailed\n[here](http://ivory.idyll.org/blog/2018-detecting-contamination-in-long-read-assemblies.html), and the discussion that led to this particular repo is in\n[sourmash issue #940](https://github.com/dib-lab/sourmash/issues/940).\n\n----\n\n[@ctb](https://github.com/ctb/)\nApril 2020\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fctb%2F2020-long-read-assembly-decontam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fctb%2F2020-long-read-assembly-decontam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fctb%2F2020-long-read-assembly-decontam/lists"}