https://github.com/bamescience/pepgm
Graphical model for taxonomic profiling
https://github.com/bamescience/pepgm
Last synced: about 2 months ago
JSON representation
Graphical model for taxonomic profiling
- Host: GitHub
- URL: https://github.com/bamescience/pepgm
- Owner: BAMeScience
- License: other
- Created: 2022-04-04T09:52:55.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2024-05-14T11:22:05.000Z (about 1 year ago)
- Last Synced: 2025-03-26T10:03:18.156Z (2 months ago)
- Language: Python
- Size: 159 MB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE.md
Awesome Lists containing this project
README
![]()
PepGM
A probabilistic graphical model for taxonomic inference of viral proteome samples with associated confidence scores
Table of Contents
## About The Project
### Our preprint is out now! You can read it [here](https://www.biorxiv.org/content/10.1101/2022.09.21.508832v1).
PepGM is a probabilistic graphical model embedded into a snakemake workflow for taxonomic inference of viral proteome samples. PepGM was
developed by the the eScience group at BAM (Federal Institute for Materials Research and Testing).The PepGM workflow includes the following steps:
0. Optional host and cRAP filtering step
1. SearchDB cleanup : cRAP DB ist added, host is added (if wanted), duplicate entries are removed using [seqkit](https://bioinf.shenwei.me/seqkit/). generation of target-decoy DB using searchCLI. Susequent peptide search using searchCLI + PeptideShaker. Generation of a a peptide list
2. All descendant strains of the target taxa are queried in the NCBI protein DB through the NCBI API. scripts: GetTargets.py, CreatePepGMGraph.py and FactorGraphGeneration.py
3. Downloaded protein recordes are digested and queried against the protein ID list to generate a bipartite taxon-peptide graph. scripts: CreatePepGMGraph.py and FactorGraphGeneration.py
4. The bipartite graph is transformed into a factor graph using convolution trees and conditional probability table factors (CPD). scripts: CreatePepGMGraph.py and FactorGraphGeneration.py
5. For different sets of CPD parameters, the belief propagation algorithm is run until convergence to obtain the posterior probabilites of the taxa. scripts: belief_propagation.py and PepGM.py
6. Through an empirically deduced metric, the ideal parameter set is inferred. script GridSearchAnalysis.py
7. For this ideal parameter set, we output a results barchart and phylogenetic tree view showcasing the 15 best scoring tax. scripts: BarPlotResults, PhyloTreeView.py
If you find PepGM helpful for your research, please cite:
_PepGM: A probabilistic graphical model for taxonomic inference of viral proteome samples with associated confidence scores
_
Tanja Holstein, Franziska Kistner, Lennart Martens, Thilo Muth
bioRxiv 2022.09.21.508832
doi: https://doi.org/10.1101/2022.09.21.508832PepGM uses convolution trees. The code for the convolution trees was developed and is described in: [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0091507](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0091507)
PepGM uses a version of the belief propagatin algorithm with a graphical network architecture previously described in [https://pubs.acs.org/doi/10.1021/acs.jproteome.9b00566](https://pubs.acs.org/doi/10.1021/acs.jproteome.9b00566)## Input
* Your spectrum file in .mgf format
* A reference database in fasta format (see Preparation)
* A searchGUI .par parameters file with the database search parameters that can be generated using searchGUIAdditonally, you need:
* NCBI Entrez account## Getting Started
### Prerequisites
Make sure you have git installed and clone the repo:
```sh
git clone https://github.com/BAMeScience/PepGM.git
```
PepGM is a snakemake workflow developed with snakemake 5.10.0.
Installing snakemake requires mamba.To install mamba:
```sh
conda install -n -c conda-forge mamba
```To install snakemake:
```sh
conda activate
mamba create -c conda-forge -c bioconda -n snakemake
```
In accordance with the Snakemake recommendations, we suggest to save your sample data
in `resources` folder. All outputs will be saved in `results`.Additional dependencies necessary are Java and GCC.
PepGM is tested for Linux OS and uses SearchGUI-4.1.14 and PeptideShaker-2.2.9 developed
by the CompOmics group at University of Ghent.Download the necessary files at the following link:
* SearchGUI : [http://compomics.github.io/projects/searchgui](http://compomics.github.io/projects/searchgui)
* PeptideShaker : [http://compomics.github.io/projects/peptide-shaker.html](http://compomics.github.io/projects/peptide-shaker.html)We suggest to create a new directory `bin` inside your PepGM
working directory and save the SearchGUI and PeptideShaker binaries there:```shell
mkdir ./bin && cd bin
wget https://genesis.ugent.be/maven2/eu/isas/searchgui/SearchGUI/4.1.23/SearchGUI-4.1.23-mac_and_linux.tar.gz
wget https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.16/PeptideShaker-2.2.16.zip
tar -xvf SearchGUI-4.1.23-mac_and_linux.tar.gz && unzip PeptideShaker-2.2.16.zip
```
You can delete the .zip files afterwards:
```shell
rm *.tar.gz && rm *.zip
```## Preparation
### Downloading reference database
We recommend using the RefSeq Viral database as a generic reference database. It can be downloaded from the NCBI ftp:```sh
cd ./resources/Database
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/\*.protein.faa.gz &&
gzip -d viral.*.protein.faa.gz &&
cat viral.*.protein.faa> refSeqViral.fasta &&
rm viral.*.protein.faa
```
### Using the NCBI Entrez API
PepGM uses the NCBI Entrez API.
We strongly advise you to create an NCBI account with your own key due to drastic speed increase.
Find out how to obtain your NCBI API key [here](https://support.nlm.nih.gov/knowledgebase/article/KA-05317/en-us).### Generating a SearchGUI parameters file
As PepGM relies on SearchGUI to perform the database search, a SearchGUI parameters file,
specifying the database search parameters, has to be provided.
The easiest way to generate this file is via the GUI provided by SearchGUI.
Other than that,
the CLI instructions to set SearchGUI parameters are described
[here](http://compomics.github.io/projects/searchgui#user-defined-modifications).## Usage
### Configuration file
PepGM needs a configuration file in `yaml` format to set up the workflow.
An exemplary configuration file is provided in `config/config.yaml`.
Please insert your NCBI account details (mail & key) and provide the required absolute paths to
* SamplePath
* ParametersFile
* SearchGUI & PeptideShaker binaries (SearchGUIDir & PeptideShakerDir)Do not change the config file location.
Details on the configuration parameters
Run panel
Set up the workflow of your PepGM run by providing parameters that fill wildcards to locate input files
such as raw spectra or reference database files. Thus, use file basenames i.e., without file
suffix, that your files already have or rename them accordingly.
Run: Name of your run that is used to create a subfolder in the results directory.
Sample: Name of your sample that is used to create a subfolder in the run directory.
Reference: Name of reference database (e.g. human).
Host: Trivial host name.
Scientific host: Scientific host name. Retain (scientific) host names from public libraries such as
ProteomeXchange or
PRIDE (e.g. homo sapiens).
Add host and crap database: Search database is extended by a host and
cRAP database. Mutually exclusive to Filter Spectra.
Input panel
Specify input file and directory paths.
Sample spectra: Path to raw spectra file.
Parameter: Path to SearchGUI parameter file.
Sample data: Path to directory that contains sample raw spectra files.
Database: Path to directory that contains the reference database.
Peptide Shaker: Path to PeptideShaker binary (.jar).
Search GUI (folder): Path to SearchGUI binary (.jar).
The following paths are part of the recommended project structure for Snakemake workflows. Find out more about
reproducible Snakemake workflows
here.
Resources: Relative path to resources folder
Results: Relative path to results folder
TaxID mapping: Relative path to folder that contains mapped taxIDs.
Search panel
Choose a search engine that SearchGUI is using and the desired FDR levels.
PepGM panel
Grid search: Choose increments for alpha, beta and prior that are to be included in the grid search to tune
graphical model parameters. Do not put a comma between values.
Results plotting: Number of taxa in the final strain identification barplot.
Config file panel
Provide your NCBI API mail and key.
### Using the graphical user interface
The graphical user interface (GUI) is developed to run Snakemake workflows without modifying
the configuration file manually in a text editor.
You can write a config file from scratch or edit an existing config file.
When modifying the config file in between runs, make sure to press the Write button before running.### Through the command line
PepGM can also be run from the command line. To run the snakemake workflow,
you need to be in your PepGM repository and have the Snakemake conda environment activated.
Run the following command
```sh
snakemake --use-conda --conda-frontend conda --cores
```
Where `n_cores` is the number of cores you want snakemake to use.### Output files
All PepGM output files are saved into the results folder and include the following:
Main results:
- PepGM_Results.csv: Table with values ID, score, type (contains all taxids under 'ID' and all probabilities under 'score' that were attributed by PepGM)
- PepGM_ResultsPlot.png: Posterior probabilities of n (default: 15) highest scoring taxa
- PhyloTreeView.png : n (default: 1 5) highest scoring taxa including their score visualized in a taxonomic treeAdditional (intermediate):
- Intermediate results folder sorted by their prior value for all possible grid search parameter combinations
- mapped_taxids_weights.csv: csv file of all taxids that had at least one protein map to them and their weight
- PepGM_graph.graphml: graphml file of the graphical model (without convolution tree factors). Useful to visualize the graph structure and peptide-taxon connections
- paramcheck.png: barplot of the metric used to determine the graphical model parameters for n (default: 15) best performing parameter combinations
- log files for bug fixing## Toy example
We have provided a toy example (Cowpox virus Brighton Red) to ease the first steps with PepGM. You will find a reduced
viral reference database only containing peptides from cowpow and cowpox-related strains,
a SearchGUI parameter file and the host and cRAP peptide sequence database in `/resources`. The cowpox MS2
spectra can be downloaded
here (PRIDE ftp archive).
Download the spectra file to `/resources/SampleData/````
wget https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/05/PXD014913/CPXV-0,1MOI-supernatant-HEp-24h.mgf
mv CPXV-0,1MOI-supernatant-HEp-24h.mgf spectrafile_PXD014913_cowpox_minimal_example.mgf
```and adopt the reference database file basename in corresponding configuration parameter to minRefSeqViral. Finally,
insert your API key and mail and replace the path to SamplePath, ParameterFile, SearchGUI and PeptideShaker with your
individual locations.## Roadmap
- [ ] Damping oscillations
- [ ] Extension to metaproteomics+UnipeptSee the [open issues](https://github.com/BAMeScience/repo_name/issues) for a full list of proposed features (and known issues).
## Contributing
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
Don't forget to give the project a star! Thanks again!1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request## License
Distributed under the MIT License. See `LICENSE.txt` for more information.
## Contact
Tanja Holstein - [@HolsteinTanja](https://twitter.com/HolsteinTanja) - [email protected]
Franziska Kistner - [LinkedIn](https://www.linkedin.com/in/franziska-kistner-58a57b18b) - [email protected][contributors-shield]: https://img.shields.io/github/contributors/BAMeScience/repo_name.svg?style=for-the-badge
[contributors-url]: https://github.com/BAMeScience/repo_name/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/BAMeScience/repo_name.svg?style=for-the-badge
[forks-url]: https://github.com/BAMeScience/repo_name/network/members
[stars-shield]: https://img.shields.io/github/stars/BAMeScience/repo_name.svg?style=for-the-badge
[stars-url]: https://github.com/BAMeScience/repo_name/stargazers
[issues-shield]: https://img.shields.io/github/issues/BAMeScience/repo_name.svg?style=for-the-badge
[issues-url]: https://github.com/BAMeScience/repo_name/issues
[license-shield]: https://img.shields.io/github/license/BAMeScience/repo_name.svg?style=for-the-badge
[license-url]: https://github.com/BAMeScience/repo_name/blob/master/LICENSE.txt
[linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555
[linkedin-url]: https://linkedin.com/in/linkedin_username
[product-screenshot]: images/screenshot.png
[Next.js]: https://img.shields.io/badge/next.js-000000?style=for-the-badge&logo=nextdotjs&logoColor=white
[Next-url]: https://nextjs.org/
[React.js]: https://img.shields.io/badge/React-20232A?style=for-the-badge&logo=react&logoColor=61DAFB
[React-url]: https://reactjs.org/
[Vue.js]: https://img.shields.io/badge/Vue.js-35495E?style=for-the-badge&logo=vuedotjs&logoColor=4FC08D
[Vue-url]: https://vuejs.org/
[Angular.io]: https://img.shields.io/badge/Angular-DD0031?style=for-the-badge&logo=angular&logoColor=white
[Angular-url]: https://angular.io/
[Svelte.dev]: https://img.shields.io/badge/Svelte-4A4A55?style=for-the-badge&logo=svelte&logoColor=FF3E00
[Svelte-url]: https://svelte.dev/
[Laravel.com]: https://img.shields.io/badge/Laravel-FF2D20?style=for-the-badge&logo=laravel&logoColor=white
[Laravel-url]: https://laravel.com
[Bootstrap.com]: https://img.shields.io/badge/Bootstrap-563D7C?style=for-the-badge&logo=bootstrap&logoColor=white
[Bootstrap-url]: https://getbootstrap.com
[JQuery.com]: https://img.shields.io/badge/jQuery-0769AD?style=for-the-badge&logo=jquery&logoColor=white
[JQuery-url]: https://jquery.com