https://github.com/ctb/2020-long-read-assembly-decontam
Try 2 of detecting/removing microbial contamination from long-read assemblies.
https://github.com/ctb/2020-long-read-assembly-decontam
Last synced: 10 months ago
JSON representation
Try 2 of detecting/removing microbial contamination from long-read assemblies.
- Host: GitHub
- URL: https://github.com/ctb/2020-long-read-assembly-decontam
- Owner: ctb
- License: other
- Created: 2020-04-12T14:36:57.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2020-04-18T16:51:42.000Z (about 6 years ago)
- Last Synced: 2023-10-26T10:04:43.907Z (over 2 years ago)
- Language: Python
- Homepage:
- Size: 4.52 MB
- Stars: 3
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 2020-long-read-assembly-decontam
Find and extract components of long-read assemblies that match to a
database, for the purposes of decontamination.
**Still early in development.** Buyer beware! Here be dragons!!
## Installing!
Clone this repository and change into the top-level repo directory.
The file `environment.yml` contains the necessary conda packages
(python and snakemake) to run charcoal; see the Quickstart section
for explicit instructions.
### Quickstart:
Clone the repository, change into it, create the environment, and activate it:
```
git clone https://github.com/ctb/2020-long-read-assembly-decontam
cd ./2020-long-read-assembly-decontam/
conda env create -f environment.yml -n lra-decontam
conda activate lra-decontam
```
## Running!
To run, execute (in the top-level directory):
```
snakemake --use-conda -p -j 1
```
This should succeed :).
Once that works, you can configure it yourself by copying
`test-data/conf-test.yml` to a new file and editing it. See
`conf/conf-necator.yml` for a real example.
## Explanation of output files.
In the output directory (e.g. `output.test`, or whatever is specified
in the config file you use), there will be a few important files --
the main ones are,
* `gather.csv` - the list of contaminants
* `matching-contigs.fa` - all contigs with any matches to the database
* `matching-fragments.fa` - all fragments with any matches to the database
## Resources
On a ~300 MB assembly, this took about 2 hours and required about 2
GB of RAM, using the
[RefSeq microbial genomes SBT](https://sourmash.readthedocs.io/en/latest/databases.html#refseq-microbial-genomes-sbt). The disk space requirement is more
significant, mainly because the SBTs are in the ~10-30 GB range when unpacked.
## Need help?
Please ask questions and file issues on [the sourmash GitHub issue tracker](https://github.com/dib-lab/sourmash/issues).
## Credits
Thanks to Erich Schwarz (for stubborn pursuit of contamination in
long-read assemblies) and Taylor Reiter (for stubborn pursuit of
contamination, period) for their inspiration!
A first try at this approach is detailed
[here](http://ivory.idyll.org/blog/2018-detecting-contamination-in-long-read-assemblies.html), and the discussion that led to this particular repo is in
[sourmash issue #940](https://github.com/dib-lab/sourmash/issues/940).
----
[@ctb](https://github.com/ctb/)
April 2020