https://github.com/maxibor/sourcepredict
Prediction/source tracking of metagenomic samples source using machine learning
https://github.com/maxibor/sourcepredict
machine-learning microbiome source-tracking
Last synced: 5 months ago
JSON representation
Prediction/source tracking of metagenomic samples source using machine learning
- Host: GitHub
- URL: https://github.com/maxibor/sourcepredict
- Owner: maxibor
- License: gpl-3.0
- Created: 2018-12-13T12:31:18.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-10-14T14:52:16.000Z (over 1 year ago)
- Last Synced: 2026-01-12T01:57:01.722Z (5 months ago)
- Topics: machine-learning, microbiome, source-tracking
- Language: Python
- Homepage:
- Size: 23.3 MB
- Stars: 9
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: contributing.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://travis-ci.com/maxibor/sourcepredict) [](https://coveralls.io/github/maxibor/sourcepredict?branch=master) [](https://conda.anaconda.org/maxibor) [](https://sourcepredict.readthedocs.io/en/latest/?badge=latest) [](https://doi.org/10.5281/zenodo.10.5281/zenodo.3379603)
[](https://doi.org/10.21105/joss.01540)
---

Sourcepredict is a Python package distributed through Conda, to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking.
Sourcepredict solves this problem by using machine learning classification on dimensionally reduced datasets.
## Installation
With conda (recommended)
```bash
$ conda install -c conda-forge -c maxibor sourcepredict
```
With pip
```bash
$ pip install sourcepredict
```
## Example
### Input
- Sink taxonomic count file (see [example file](https://github.com/maxibor/sourcepredict/blob/master/data/test/dog_test_sink_sample.csv) and [documentation](https://sourcepredict.readthedocs.io/en/latest/usage.html#sink_table))
- Source taxonomic count file (see [example file](https://github.com/maxibor/sourcepredict/blob/master/data/modern_gut_microbiomes_sources.csv) and [documentation](https://sourcepredict.readthedocs.io/en/latest/usage.html#s-sources))
- Source label file (see [example file](https://github.com/maxibor/sourcepredict/blob/master/data/modern_gut_microbiomes_labels.csv) and [documentation](https://sourcepredict.readthedocs.io/en/latest/usage.html#l-labels))
### Usage
```bash
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/test/dog_test_sink_sample.csv -O dog_example.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_labels.csv -O sp_labels.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_sources.csv -O sp_sources.csv
$ sourcepredict -s sp_sources.csv -l sp_labels.csv dog_example.csv
Step 1: Checking for unknown proportion
== Sample: ERR1915662 ==
Adding unknown
Normalizing (GMPR)
Computing Bray-Curtis distance
Performing MDS embedding in 2 dimensions
KNN machine learning
Training KNN classifier on 2 cores...
-> Testing Accuracy: 1.0
----------------------
- Sample: ERR1915662
known:98.61%
unknown:1.39%
Step 2: Checking for source proportion
Computing weighted_unifrac distance on species rank
TSNE embedding in 2 dimensions
KNN machine learning
Performing 5 fold cross validation on 2 cores...
Trained KNN classifier with 10 neighbors
-> Testing Accuracy: 0.99
----------------------
- Sample: ERR1915662
Canis_familiaris:96.1%
Homo_sapiens:2.47%
Soil:1.43%
Sourcepredict result written to dog_test_sample.sourcepredict.csv
```
### Output
Sourcepredict output the predicted source contribution to each sink sample, and the embedding of all samples in the lower dimensional space. See [documentation](https://sourcepredict.readthedocs.io/en/latest/results.html) for details.
### Runtime
Depending on the normalization method (`-n`), the embedding (`-me`) method, the cpus available for parallel processing (`-t`), and the data, the runtime should be between a few seconds and a few minutes per sink sample.
## Documentation
The documentation of SourcePredict is available here: [sourcepredict.readthedocs.io](https://sourcepredict.readthedocs.io/en/latest/)
## Sourcepredict example files
- The sources were obtained with a simple [Nextflow pipeline](https://github.com/maxibor/kraken-nf), with Kraken2 using the [*MiniKraken2_v2_8GB*](https://ccb.jhu.edu/software/kraken2/dl/minikraken2_v2_8GB.tgz).
See the [documentation](https://sourcepredict.readthedocs.io/en/latest/custom_sources.html) for more informations on how to build a custom source file.
- The example source file is here [modern_gut_microbiomes_sources.csv](https://github.com/maxibor/sourcepredict/raw/master/data/modern_gut_microbiomes_sources.csv)
- The example label file is here [modern_gut_microbiomes_sources.csv](https://github.com/maxibor/sourcepredict/raw/master/data/modern_gut_microbiomes_labels.csv)
### Environments included in the example source file
- *Homo sapiens* gut microbiome ([1](https://doi.org/10.1038/nature11234), [2](https://doi.org/10.1093/gigascience/giz004), [3](https://doi.org/10.1038/s41564-019-0409-6), [4](https://doi.org/10.1016/j.cell.2019.01.001), [5](https://doi.org/10.1038/ncomms7505), [6](http://doi.org/10.1016/j.cub.2015.04.055))
- *Canis familiaris* gut microbiome ([1](https://doi.org/10.1186/s40168-018-0450-3))
- Soil microbiome ([1](https://doi.org/10.1073/pnas.1215210110), [2](https://www.ncbi.nlm.nih.gov/bioproject/?term=322597), [3](https://dx.doi.org/10.1128%2FAEM.01646-17))
## Contributing Code, Documentation, or Feedback
If you wish to contribute to Sourcepredict, you are welcome and encouraged to contribute by opening an issue, or creating a pull-request. All contributions will be made under the GPLv3 license. More informations can found on the [contributing page](https://github.com/maxibor/sourcepredict/blob/master/contributing.md).
## How to cite
Sourcepredict has been published in [JOSS](https://joss.theoj.org/papers/10.21105/joss.01540).
```
@article{Borry2019Sourcepredict,
journal = {Journal of Open Source Software},
doi = {10.21105/joss.01540},
issn = {2475-9066},
number = {41},
publisher = {The Open Journal},
title = {Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification},
url = {http://dx.doi.org/10.21105/joss.01540},
volume = {4},
author = {Borry, Maxime},
pages = {1540},
date = {2019-09-04},
year = {2019},
month = {9},
day = {4}
}
```