https://github.com/luispedro/strobefilter

Benchmarking pre-filtering large databases before short-read mapping
https://github.com/luispedro/strobefilter

bioinformatics metagenomics strobemers

Last synced: 7 months ago
JSON representation

Benchmarking pre-filtering large databases before short-read mapping

Host: GitHub
URL: https://github.com/luispedro/strobefilter
Owner: luispedro
License: mit
Created: 2024-02-22T00:15:10.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-08-09T03:27:41.000Z (about 1 year ago)
Last Synced: 2025-01-19T06:43:35.185Z (9 months ago)
Topics: bioinformatics, metagenomics, strobemers
Language: Python
Homepage:
Size: 90.8 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.MIT

Awesome Lists containing this project

README

          # Prefilter for mapping

**Problem:** Mapping metagenomes to large databases, such as the

[GMGCv1](https://gmgc.embl.de), takes too much memory. Partitioning the dataset

is a common solution (and [supported by

NGLess](https://ngless.embl.de/Mapping.html#low-memory-mode), but it has

drawbacks (it's slow).

This repository explores the possibility of prefiltering the database by

removing sequences that are extremely unlikely to be matches.

## Approach

1. Parse all the reads and collect all randstrobes (or rather their hashes)

2. Parse the database and select only unigenes that are expected to be present in the reads

3. Map as usual to the pre-filtered database

For `2`, different strategies are possible. The simplest is to keep any unigene

that shares any hash with the set of hashes from the reads. Currently being

considered

- `min1`: keep all references that match at least one hash

- `min2`: keep all references that match at least two hashes

We also tested counting the exact value or using a hacky Bloom filter structure

that uses a single fixed size array, but the hacky version gave bad estimates.

### Requirements

- Python, including NumPy and Pandas

- [Jug](https://jug.rtfd.io/)

- [NGLess](https://ngless.embl.de/)

- [Strobealign](https://github.com/ksahlin/strobealign) ([Sahlin, 2022](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02831-7)), including the Python bindings

- [tabulate](https://pypi.org/project/tabulate/) is used to print the final table

To install most dependencies (assuming you have conda-forge & bioconda set up):

```

conda install python=3.11 numpy pandas requests tabulate jug ngless

```

To install `strobealign`'s Python bindings (which will **not** be installed by default with `conda`):

```

# To ensure you have a recent C++ compiler (not always needed)

conda install gxx_linux-64 gcc_linux-64

export CC CXX

git clone https://github.com/ksahlin/strobealign

cd strobealign

pip install .

```

### Data

1. _Database_ GMGCv1 (from ([Coelho et al., 2022](https://www.nature.com/articles/s41586-021-04233-4)). This can be is downloaded by `jugfile.py`

2. _Metagenomes_: Dog dataset (from [Coelho et al., 2018](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0450-3)) and human gut dataset (from [Zeller et al., 2014](https://doi.org/10.15252/msb.20145645). These can be downloaded with [ena-mirror](https://github.com/BigDataBiology/ena-mirror). More guidance will be provided on how to do it soon, but [get in touch](https://github.com/luispedro/strobefilter/issues) if you have questions.

Note that running this benchmark will use a lot of disk storage!

### Author

- [Luis Pedro Coelho](https://luispedro.org) (Queensland University of Technology). [luis@luispedro.org](mailto:luis@luispedro.org)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/luispedro/strobefilter

Awesome Lists containing this project

README