https://github.com/mpusp/snakemake-ms-proteomics

Pipeline for automatic processing and quality control of mass spectrometry data
https://github.com/mpusp/snakemake-ms-proteomics
bioinformatics conda mass-spectrometry pipeline proteomics snakemake
Last synced: 4 months ago
JSON representation
Pipeline for automatic processing and quality control of mass spectrometry data
Host: GitHub
URL: https://github.com/mpusp/snakemake-ms-proteomics
Owner: MPUSP
License: mit
Created: 2022-12-14T13:39:47.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2025-01-30T14:06:23.000Z (over 1 year ago)
Last Synced: 2025-01-30T14:32:28.260Z (over 1 year ago)
Topics: bioinformatics, conda, mass-spectrometry, pipeline, proteomics, snakemake
Language: Python
Homepage:
Size: 1.38 MB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # snakemake-ms-proteomics

[![Snakemake](https://img.shields.io/badge/snakemake-≥8.0.0-brightgreen.svg)](https://snakemake.github.io)

[![GitHub actions](https://github.com/MPUSP/snakemake-ms-proteomics/actions/workflows/main.yml/badge.svg)](https://github.com/MPUSP/snakemake-ms-proteomics/actions/workflows/main.yml)

![GitHub issues](https://img.shields.io/github/issues/MPUSP/snakemake-ms-proteomics)

![GitHub last commit](https://img.shields.io/github/last-commit/MPUSP/snakemake-ms-proteomics)

[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)

[![workflow catalog](https://img.shields.io/badge/Snakemake%20workflow%20catalog-darkgreen)](https://snakemake.github.io/snakemake-workflow-catalog)

---

A Snakemake workflow for automatic processing and quality control of protein mass spectrometry data.

- [snakemake-ms-proteomics](#snakemake-ms-proteomics)

  - [workflow overview](#workflow-overview)

  - [Installation](#installation)

    - [Snakemake](#snakemake)

    - [Fragpipe](#fragpipe)

  - [Running the workflow](#running-the-workflow)

    - [Input data](#input-data)

    - [Execution](#execution)

    - [Parameters](#parameters)

    - [Missing value imputation\*\*](#missing-value-imputation)

  - [Output](#output)

  - [Authors](#authors)

  - [License](#license)

  - [References](#references)

## workflow overview



---

This workflow is a best-practice workflow for the automated analysis of mass spectrometry proteomics data. It currently supports automated analysis of data-dependent acquisition (DDA) data with label-free quantification. An extension by different wokflows (DIA, isotope labeling) is planned in the future. The workflow is mainly a wrapper for the excellent tools [fragpipe](https://fragpipe.nesvilab.org/) and [MSstats](https://www.bioconductor.org/packages/release/bioc/html/MSstats.html), with additional modules that supply and check the required input files, and generate reports. The workflow is built using [snakemake](https://snakemake.readthedocs.io/en/stable/) and processes MS data using the following steps:

1. Prepare `workflow` file (`python` script)

2. check user-supplied sample sheet (`python` script)

3. Fetch protein database from NCBI or use user-supplied fasta file (`python`, [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/))

4. Generate decoy proteins ([DecoyPyrat](https://github.com/wtsi-proteomics/DecoyPYrat))

5. Import raw files, search protein database ([fragpipe](https://fragpipe.nesvilab.org/))

6. Align feature maps using IonQuant ([fragpipe](https://fragpipe.nesvilab.org/))

7. Import quantified features, infer and quantify proteins ([R MSstats](https://www.bioconductor.org/packages/release/bioc/html/MSstats.html))

8. Compare different biological conditions, export results ([R MSstats](https://www.bioconductor.org/packages/release/bioc/html/MSstats.html))

9. Generate HTML report with embedded QC plots ([R markdown](https://rmarkdown.rstudio.com/))

10. Generate PDF report from HTML [weasyprint](https://weasyprint.org/)

11. Send out report by email (`python` script)

12. Clean up temporary files after workflow execution (`bash` script)

If you want to contribute, report issues, or suggest features, please get in touch on [github](https://github.com/MPUSP/snakemake-ms-proteomics).

## Installation

### Snakemake

Step 1: Install snakemake with `conda`, `mamba`, `micromamba` (or any another `conda` flavor). This step generates a new conda environment called `snakemake-ms-proteomics`, which will be used for all further installations.

```bash

conda create -c conda-forge -c bioconda -n snakemake-ms-proteomics snakemake

```

Step 2: Activate conda environment with snakemake

```bash

source /path/to/conda/bin/activate

conda activate snakemake-ms-proteomics

```

Alternatively, install `snakemake` using pip:

```bash

pip install snakemake

```

Or install `snakemake` globally from linux archives:

```bash

sudo apt install snakemake

```

### Fragpipe

Fragpipe is not available on `conda` or other package archives. However, to make the workflow as user-friendly as possible, the latest [fragpipe release from github](https://github.com/Nesvilab/FragPipe/releases) (currently v22.0) is automatically installed to the respective `conda` environment when using the workflow the first time. After installation, the GUI (graphical user interface) will pop up and ask to you to finish the installation by **downloading the missing modules MSFragger, IonQuant, and Philosopher**. This step is necessary to abide to license restrictions. From then on, fragpipe will run in `headless` mode through command line only.

All other dependencies for the workflow are **automatically pulled as `conda` environments** by snakemake.

## Running the workflow

### Input data

The workflow requires the following input files:

1. mass spectrometry data, such as Thermo `*.raw` or `*.mzML` files

2. an (organism) database in `*.fasta` format _OR_ a NCBI Refseq ID. Decoys (`rev_` prefix) will be added if necessary

3. a sample sheet in tab-separated format (aka `manifest` file)

4. a `workflow` file for fragpipe (see `resources` dir)

The samplesheet file has the following structure with four mandatory columns and no header (example file: `test/input/samplesheet/samplesheet.tsv`).

- `sample`: names/paths to raw files

- `condition`: experimental group, treatments

- `replicate`: replicate number, consecutively numbered. Repeating numbers (e.g. 1,2,1,2) will be treated as paired samples!

- `type`: the type of MS data, will be used to determine the workflow

- `control`: reference condition for testing differential abudandance

| sample   | condition   | replicate | type | control     |

| -------- | ----------- | --------- | ---- | ----------- |

| sample_1 | condition_1 | 1         | DDA  | condition_1 |

| sample_2 | condition_1 | 2         | DDA  | condition_1 |

| sample_3 | condition_2 | 3         | DDA  | condition_1 |

| sample_4 | condition_2 | 4         | DDA  | condition_1 |

### Execution

To run the workflow from command line, change the working directory.

```bash

cd /path/to/snakemake-ms-proteomics

```

Adjust options in the default config file `config/config.yml`.

Before running the entire workflow, you can perform a dry run using:

```bash

snakemake --dry-run

```

To run the complete workflow with test files using **`conda`**, execute the following command. The definition of the number of compute cores is mandatory.

```bash

snakemake --cores 10 --sdm conda --directory .test

```

To supply options that override the defaults, run the workflow like this:

```bash

snakemake --cores 10 --sdm conda --directory .test \

  --configfile 'config/config.yml' \

  --config \

  samplesheet='my/sample_sheet.tsv'

```

### Parameters

This table lists all **global parameters** to the workflow.

| parameter   | type                   | details             | example                                                 |

| ----------- | ---------------------- | ------------------- | ------------------------------------------------------- |

| samplesheet | `*.tsv`                | tab-separated file  | `test/input/config/samplesheet.tsv`                     |

| database    | `*.fasta` OR refseq ID | plain text          | `test/input/database/database.fasta`, `GCF_000009045.1` |

| workflow    | `*.workflow` OR string | a fragpipe workflow | `workflows/LFQ-MBR.workflow`, `from_samplesheet`        |

This table lists all **module-specific parameters** and their default values, as included in the `config.yml` file.

| module     | parameter 
| ---------- | ---------------- 
| decoypyrat | `cleavage_sites` 
|            | `decoy_prefix` 
| fragpipe   | `target_dir` 
|            | `executable` 
|            | `download` 
| msstats    | `logTrans` 
|            | `normalization` 
|            | `featureSubset` 
|            | `summaryMethod` 
|            | `MBimpute` 
| report     | `html` 
|            | `pdf` 
| email      | `send` 
|            | `port` 
|            | `smtp_server` 
|            | `smtp_user` 
|            | `smtp_pw` 
|            | `from` 
|            | `to` 
|            | `subject`

| default                            | details                                                          | | ---------------------------------- | ---------------------------------------------------------------- | | `KR`                               | amino acids residues used for decoy peptide generation           | | `rev`                              | decoy prefix appended to proteins names                          | | `share`                            | default path in conda env to store fragpipe                      | | `fragpipe/bin/fragpipe`            | path to fragpipe executable                                      | | FragPipe-22.0 (see config)         | downlowd link to Fragpipe Github repo                            | | `2`                                | base for log fold change transformation                          | | `equalizeMedians`                  | normalization strategy for feature intensity, see MSstats manual | | `all`                              | which features to use for quantification                         | | `TMP`                              | how to calculate protein from feature intensity                  | | `True`                             | Imputes missing values with Accelerated failure time model       | | `True`                             | Generate HTLM report                                             | | `True`                             | Generate PDF report                                              | | `False`                            | whether reports should send out by email                         | | `0`                                | default port for email server                                    | | `smtp.example.com`                 | smtp server address                                              | | `user`                             | smtp server user name                                            | | `password`                         | smtp server user password                                        | | `sender@email.com`                 | sender's email address                                           | | `["receiver@email.com"]`           | receiver's email address(es), a list                             | | `"Results MS proteomics workflow"` | subject line for email                                           |

### Missing value imputation\*\*

- missing value imputation happens at different stages

- first, the default strategy for `fragpipe` is to use "match between runs", i.e. non-identified features in the MS1 spectra are cross-compared with other runs of the same experiment where MS2 identification is available

- this reduces the number of missing feature quantifications

- this strategy is based on actual quantification data

- second, `MSstats` imputes two kinds of missing values where absolutely no feature quantification is available

- missing values at random: removed during summarization

- missing values due to low abundance: imputed at the feature level via accelerated failure time model

- missing value treatment can be controlled through `MSstats` parameters `MBimpute` and others

- see `MSstats` manual for more information

## Output

The workflow generates the following output from its modules:

samplesheet

- `samplesheet.tsv`: Samplesheet after checking file paths and options

- `log.txt`: Log file for this module

workflow

- `workflow.txt`: Configuration file for `fragpipe`, determined from samplesheet.

- `log.txt`: Log file for this module

database

- `database.fasta`: The downloaded or user-supplied `.fasta` file. In the latter case, the file is identical to the input.

- `log.txt`: Log file for this module

decoypyrat

- `decoy_database.fasta`: Original `.fasta` file supplemented with randomized protein sequences.

- `log.txt`: Log file for this module

fragpipe

- `[sample_name]/`: Directory containing sample specific output files for each run

- `combined_ion.tsv`: Quantification of ion intensity per peptide

- `combined_modified_peptide.tsv`: Quantification of peptide modifications

- `combined_peptide.tsv`: Quantification of peptides/features

- `combined_protein.tsv`: Quantification of proteins from petide, inferred by `fragpipe`

- `MSstats.csv`: Qunatification of petides/features, output from fragpipe served in `MSstats` friendly format

- other files such as logs, file lists, etc.

- `log.txt`: Log file for this module

msstats

- `comparison_result.csv`: Main table with results about the comparison between different experimental conditions

- `feature_level_data.csv`: Feature-level quantification data processed by MSstats

- `model_qc.csv`: Table with data about the fitted quantification models from MSstats

- `protein_level_data.csv`: Protein-level quantification data processed by MSstats

- `uniprot.csv`: Optionally downloaded table with protein annotation from Uniprot

- `log.txt`: Log file for this module

report

- `report.html`: Report with figures and tables

- `report.pdf`: Report with figures and tables in PDF format. Converted from HTML

- `log.txt`: Log file for this module

email

- `log.txt`: Log file for this module

clean_up

- `log.txt`: Log file for this module

## Authors

- Dr. Michael Jahn

  - Affiliation: [Max-Planck-Unit for the Science of Pathogens](https://www.mpusp.mpg.de/) (MPUSP), Berlin, Germany

  - ORCID profile: https://orcid.org/0000-0002-3913-153X

  - github page: https://github.com/m-jahn

## License

- the contents of this repository are licensed with the [MIT License](https://choosealicense.com/licenses/mit/)

  - you are free use the workflow for your purposes free of charge

  - you are free to modify the contents and create derivative work

  - the only condition is that you refer to the original license and copyright owners (MPUSP)

  - all contents come with absolutely no warranty to work for your or any other purposes

  - **important**: all third party dependencies are licensed under their own terms and _not covered_ by this license

## References

- Essential tools are linked in the top section of this document

- The core of this workflow are the two external packages **fragpipe** and **MSstats**

**fragpipe**

1. Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D., & Nesvizhskii, A. I. (2017). _MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics_. Nature Methods, 14(5), 513-520.

2. da Veiga Leprevost, F., Haynes, S. E., Avtonomov, D. M., Chang, H. Y., Shanmugam, A. K., Mellacheruvu, D., Kong, A. T., & Nesvizhskii, A. I. (2020). _Philosopher: a versatile toolkit for shotgun proteomics data analysis_. Nature Methods, 17(9), 869-870.

3. Yu, F., Haynes, S. E., & Nesvizhskii, A. I. (2021). _IonQuant enables accurate and sensitive label-free quantification with FDR-controlled match-between-runs_. Molecular & Cellular Proteomics, 20.

**MSstats**

1. Choi M (2014). _MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments._ Bioinformatics, 30.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mpusp/snakemake-ms-proteomics

Awesome Lists containing this project

README