https://github.com/luispedro/strobefilter
Benchmarking pre-filtering large databases before short-read mapping
https://github.com/luispedro/strobefilter
bioinformatics metagenomics strobemers
Last synced: 7 months ago
JSON representation
Benchmarking pre-filtering large databases before short-read mapping
- Host: GitHub
- URL: https://github.com/luispedro/strobefilter
- Owner: luispedro
- License: mit
- Created: 2024-02-22T00:15:10.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-09T03:27:41.000Z (about 1 year ago)
- Last Synced: 2025-01-19T06:43:35.185Z (9 months ago)
- Topics: bioinformatics, metagenomics, strobemers
- Language: Python
- Homepage:
- Size: 90.8 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.MIT
Awesome Lists containing this project
README
# Prefilter for mapping
**Problem:** Mapping metagenomes to large databases, such as the
[GMGCv1](https://gmgc.embl.de), takes too much memory. Partitioning the dataset
is a common solution (and [supported by
NGLess](https://ngless.embl.de/Mapping.html#low-memory-mode), but it has
drawbacks (it's slow).This repository explores the possibility of prefiltering the database by
removing sequences that are extremely unlikely to be matches.## Approach
1. Parse all the reads and collect all randstrobes (or rather their hashes)
2. Parse the database and select only unigenes that are expected to be present in the reads
3. Map as usual to the pre-filtered databaseFor `2`, different strategies are possible. The simplest is to keep any unigene
that shares any hash with the set of hashes from the reads. Currently being
considered- `min1`: keep all references that match at least one hash
- `min2`: keep all references that match at least two hashesWe also tested counting the exact value or using a hacky Bloom filter structure
that uses a single fixed size array, but the hacky version gave bad estimates.### Requirements
- Python, including NumPy and Pandas
- [Jug](https://jug.rtfd.io/)
- [NGLess](https://ngless.embl.de/)
- [Strobealign](https://github.com/ksahlin/strobealign) ([Sahlin, 2022](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02831-7)), including the Python bindings
- [tabulate](https://pypi.org/project/tabulate/) is used to print the final tableTo install most dependencies (assuming you have conda-forge & bioconda set up):
```
conda install python=3.11 numpy pandas requests tabulate jug ngless
```To install `strobealign`'s Python bindings (which will **not** be installed by default with `conda`):
```
# To ensure you have a recent C++ compiler (not always needed)
conda install gxx_linux-64 gcc_linux-64
export CC CXXgit clone https://github.com/ksahlin/strobealign
cd strobealign
pip install .
```### Data
1. _Database_ GMGCv1 (from ([Coelho et al., 2022](https://www.nature.com/articles/s41586-021-04233-4)). This can be is downloaded by `jugfile.py`
2. _Metagenomes_: Dog dataset (from [Coelho et al., 2018](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0450-3)) and human gut dataset (from [Zeller et al., 2014](https://doi.org/10.15252/msb.20145645). These can be downloaded with [ena-mirror](https://github.com/BigDataBiology/ena-mirror). More guidance will be provided on how to do it soon, but [get in touch](https://github.com/luispedro/strobefilter/issues) if you have questions.Note that running this benchmark will use a lot of disk storage!
### Author
- [Luis Pedro Coelho](https://luispedro.org) (Queensland University of Technology). [luis@luispedro.org](mailto:luis@luispedro.org)