{"id":44332377,"url":"https://github.com/mpusp/snakemake-ms-proteomics","last_synced_at":"2026-02-11T10:09:54.514Z","repository":{"id":157987466,"uuid":"578197150","full_name":"MPUSP/snakemake-ms-proteomics","owner":"MPUSP","description":"Pipeline for automatic processing and quality control of mass spectrometry data","archived":false,"fork":false,"pushed_at":"2025-01-30T14:06:23.000Z","size":1446,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-30T14:32:28.260Z","etag":null,"topics":["bioinformatics","conda","mass-spectrometry","pipeline","proteomics","snakemake"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MPUSP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-12-14T13:39:47.000Z","updated_at":"2025-01-30T14:06:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"6f105130-0843-4e89-a5f0-21e2358ee105","html_url":"https://github.com/MPUSP/snakemake-ms-proteomics","commit_stats":{"total_commits":32,"total_committers":1,"mean_commits":32.0,"dds":0.0,"last_synced_commit":"9e55f58f7c496e0834cc2292c11a89c0fb6a2897"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MPUSP/snakemake-ms-proteomics","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MPUSP%2Fsnakemake-ms-proteomics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MPUSP%2Fsnakemake-ms-proteomics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MPUSP%2Fsnakemake-ms-proteomics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MPUSP%2Fsnakemake-ms-proteomics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MPUSP","download_url":"https://codeload.github.com/MPUSP/snakemake-ms-proteomics/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MPUSP%2Fsnakemake-ms-proteomics/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29331743,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-11T06:13:03.264Z","status":"ssl_error","status_checked_at":"2026-02-11T06:12:55.843Z","response_time":97,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","conda","mass-spectrometry","pipeline","proteomics","snakemake"],"created_at":"2026-02-11T10:09:53.870Z","updated_at":"2026-02-11T10:09:54.506Z","avatar_url":"https://github.com/MPUSP.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# snakemake-ms-proteomics\n\n\u003c!-- badges start --\u003e\n\n[![Snakemake](https://img.shields.io/badge/snakemake-≥8.0.0-brightgreen.svg)](https://snakemake.github.io)\n[![GitHub actions](https://github.com/MPUSP/snakemake-ms-proteomics/actions/workflows/main.yml/badge.svg)](https://github.com/MPUSP/snakemake-ms-proteomics/actions/workflows/main.yml)\n![GitHub issues](https://img.shields.io/github/issues/MPUSP/snakemake-ms-proteomics)\n![GitHub last commit](https://img.shields.io/github/last-commit/MPUSP/snakemake-ms-proteomics)\n[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000\u0026logo=anaconda)](https://docs.conda.io/en/latest/)\n[![workflow catalog](https://img.shields.io/badge/Snakemake%20workflow%20catalog-darkgreen)](https://snakemake.github.io/snakemake-workflow-catalog)\n\n\u003c!-- badges end --\u003e\n\n---\n\nA Snakemake workflow for automatic processing and quality control of protein mass spectrometry data.\n\n- [snakemake-ms-proteomics](#snakemake-ms-proteomics)\n  - [workflow overview](#workflow-overview)\n  - [Installation](#installation)\n    - [Snakemake](#snakemake)\n    - [Fragpipe](#fragpipe)\n  - [Running the workflow](#running-the-workflow)\n    - [Input data](#input-data)\n    - [Execution](#execution)\n    - [Parameters](#parameters)\n    - [Missing value imputation\\*\\*](#missing-value-imputation)\n  - [Output](#output)\n  - [Authors](#authors)\n  - [License](#license)\n  - [References](#references)\n\n## workflow overview\n\n\u003c!-- include logo--\u003e\n\u003cimg src=\"docs/images/logo.png\" align=\"right\" /\u003e\n\n---\n\nThis workflow is a best-practice workflow for the automated analysis of mass spectrometry proteomics data. It currently supports automated analysis of data-dependent acquisition (DDA) data with label-free quantification. An extension by different wokflows (DIA, isotope labeling) is planned in the future. The workflow is mainly a wrapper for the excellent tools [fragpipe](https://fragpipe.nesvilab.org/) and [MSstats](https://www.bioconductor.org/packages/release/bioc/html/MSstats.html), with additional modules that supply and check the required input files, and generate reports. The workflow is built using [snakemake](https://snakemake.readthedocs.io/en/stable/) and processes MS data using the following steps:\n\n1. Prepare `workflow` file (`python` script)\n2. check user-supplied sample sheet (`python` script)\n3. Fetch protein database from NCBI or use user-supplied fasta file (`python`, [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/))\n4. Generate decoy proteins ([DecoyPyrat](https://github.com/wtsi-proteomics/DecoyPYrat))\n5. Import raw files, search protein database ([fragpipe](https://fragpipe.nesvilab.org/))\n6. Align feature maps using IonQuant ([fragpipe](https://fragpipe.nesvilab.org/))\n7. Import quantified features, infer and quantify proteins ([R MSstats](https://www.bioconductor.org/packages/release/bioc/html/MSstats.html))\n8. Compare different biological conditions, export results ([R MSstats](https://www.bioconductor.org/packages/release/bioc/html/MSstats.html))\n9. Generate HTML report with embedded QC plots ([R markdown](https://rmarkdown.rstudio.com/))\n10. Generate PDF report from HTML [weasyprint](https://weasyprint.org/)\n11. Send out report by email (`python` script)\n12. Clean up temporary files after workflow execution (`bash` script)\n\nIf you want to contribute, report issues, or suggest features, please get in touch on [github](https://github.com/MPUSP/snakemake-ms-proteomics).\n\n## Installation\n\n### Snakemake\n\nStep 1: Install snakemake with `conda`, `mamba`, `micromamba` (or any another `conda` flavor). This step generates a new conda environment called `snakemake-ms-proteomics`, which will be used for all further installations.\n\n```bash\nconda create -c conda-forge -c bioconda -n snakemake-ms-proteomics snakemake\n```\n\nStep 2: Activate conda environment with snakemake\n\n```bash\nsource /path/to/conda/bin/activate\nconda activate snakemake-ms-proteomics\n```\n\nAlternatively, install `snakemake` using pip:\n\n```bash\npip install snakemake\n```\n\nOr install `snakemake` globally from linux archives:\n\n```bash\nsudo apt install snakemake\n```\n\n### Fragpipe\n\nFragpipe is not available on `conda` or other package archives. However, to make the workflow as user-friendly as possible, the latest [fragpipe release from github](https://github.com/Nesvilab/FragPipe/releases) (currently v22.0) is automatically installed to the respective `conda` environment when using the workflow the first time. After installation, the GUI (graphical user interface) will pop up and ask to you to finish the installation by **downloading the missing modules MSFragger, IonQuant, and Philosopher**. This step is necessary to abide to license restrictions. From then on, fragpipe will run in `headless` mode through command line only.\n\nAll other dependencies for the workflow are **automatically pulled as `conda` environments** by snakemake.\n\n## Running the workflow\n\n### Input data\n\nThe workflow requires the following input files:\n\n1. mass spectrometry data, such as Thermo `*.raw` or `*.mzML` files\n2. an (organism) database in `*.fasta` format _OR_ a NCBI Refseq ID. Decoys (`rev_` prefix) will be added if necessary\n3. a sample sheet in tab-separated format (aka `manifest` file)\n4. a `workflow` file for fragpipe (see `resources` dir)\n\nThe samplesheet file has the following structure with four mandatory columns and no header (example file: `test/input/samplesheet/samplesheet.tsv`).\n\n- `sample`: names/paths to raw files\n- `condition`: experimental group, treatments\n- `replicate`: replicate number, consecutively numbered. Repeating numbers (e.g. 1,2,1,2) will be treated as paired samples!\n- `type`: the type of MS data, will be used to determine the workflow\n- `control`: reference condition for testing differential abudandance\n\n| sample   | condition   | replicate | type | control     |\n| -------- | ----------- | --------- | ---- | ----------- |\n| sample_1 | condition_1 | 1         | DDA  | condition_1 |\n| sample_2 | condition_1 | 2         | DDA  | condition_1 |\n| sample_3 | condition_2 | 3         | DDA  | condition_1 |\n| sample_4 | condition_2 | 4         | DDA  | condition_1 |\n\n### Execution\n\nTo run the workflow from command line, change the working directory.\n\n```bash\ncd /path/to/snakemake-ms-proteomics\n```\n\nAdjust options in the default config file `config/config.yml`.\nBefore running the entire workflow, you can perform a dry run using:\n\n```bash\nsnakemake --dry-run\n```\n\nTo run the complete workflow with test files using **`conda`**, execute the following command. The definition of the number of compute cores is mandatory.\n\n```bash\nsnakemake --cores 10 --sdm conda --directory .test\n```\n\nTo supply options that override the defaults, run the workflow like this:\n\n```bash\nsnakemake --cores 10 --sdm conda --directory .test \\\n  --configfile 'config/config.yml' \\\n  --config \\\n  samplesheet='my/sample_sheet.tsv'\n```\n\n### Parameters\n\nThis table lists all **global parameters** to the workflow.\n\n| parameter   | type                   | details             | example                                                 |\n| ----------- | ---------------------- | ------------------- | ------------------------------------------------------- |\n| samplesheet | `*.tsv`                | tab-separated file  | `test/input/config/samplesheet.tsv`                     |\n| database    | `*.fasta` OR refseq ID | plain text          | `test/input/database/database.fasta`, `GCF_000009045.1` |\n| workflow    | `*.workflow` OR string | a fragpipe workflow | `workflows/LFQ-MBR.workflow`, `from_samplesheet`        |\n\nThis table lists all **module-specific parameters** and their default values, as included in the `config.yml` file.\n\n| module     | parameter        | default                            | details                                                          |\n| ---------- | ---------------- | ---------------------------------- | ---------------------------------------------------------------- |\n| decoypyrat | `cleavage_sites` | `KR`                               | amino acids residues used for decoy peptide generation           |\n|            | `decoy_prefix`   | `rev`                              | decoy prefix appended to proteins names                          |\n| fragpipe   | `target_dir`     | `share`                            | default path in conda env to store fragpipe                      |\n|            | `executable`     | `fragpipe/bin/fragpipe`            | path to fragpipe executable                                      |\n|            | `download`       | FragPipe-22.0 (see config)         | downlowd link to Fragpipe Github repo                            |\n| msstats    | `logTrans`       | `2`                                | base for log fold change transformation                          |\n|            | `normalization`  | `equalizeMedians`                  | normalization strategy for feature intensity, see MSstats manual |\n|            | `featureSubset`  | `all`                              | which features to use for quantification                         |\n|            | `summaryMethod`  | `TMP`                              | how to calculate protein from feature intensity                  |\n|            | `MBimpute`       | `True`                             | Imputes missing values with Accelerated failure time model       |\n| report     | `html`           | `True`                             | Generate HTLM report                                             |\n|            | `pdf`            | `True`                             | Generate PDF report                                              |\n| email      | `send`           | `False`                            | whether reports should send out by email                         |\n|            | `port`           | `0`                                | default port for email server                                    |\n|            | `smtp_server`    | `smtp.example.com`                 | smtp server address                                              |\n|            | `smtp_user`      | `user`                             | smtp server user name                                            |\n|            | `smtp_pw`        | `password`                         | smtp server user password                                        |\n|            | `from`           | `sender@email.com`                 | sender's email address                                           |\n|            | `to`             | `[\"receiver@email.com\"]`           | receiver's email address(es), a list                             |\n|            | `subject`        | `\"Results MS proteomics workflow\"` | subject line for email                                           |\n\n### Missing value imputation\\*\\*\n\n- missing value imputation happens at different stages\n- first, the default strategy for `fragpipe` is to use \"match between runs\", i.e. non-identified features in the MS1 spectra are cross-compared with other runs of the same experiment where MS2 identification is available\n- this reduces the number of missing feature quantifications\n- this strategy is based on actual quantification data\n- second, `MSstats` imputes two kinds of missing values where absolutely no feature quantification is available\n- missing values at random: removed during summarization\n- missing values due to low abundance: imputed at the feature level via accelerated failure time model\n- missing value treatment can be controlled through `MSstats` parameters `MBimpute` and others\n- see `MSstats` manual for more information\n\n## Output\n\nThe workflow generates the following output from its modules:\n\n\u003cdetails markdown=\"1\"\u003e\n\u003csummary\u003esamplesheet\u003c/summary\u003e\n\n- `samplesheet.tsv`: Samplesheet after checking file paths and options\n- `log.txt`: Log file for this module\n\n\u003c/details\u003e\n\n\u003cdetails markdown=\"1\"\u003e\n\u003csummary\u003eworkflow\u003c/summary\u003e\n\n- `workflow.txt`: Configuration file for `fragpipe`, determined from samplesheet.\n- `log.txt`: Log file for this module\n\n\u003c/details\u003e\n\n\u003cdetails markdown=\"1\"\u003e\n\u003csummary\u003edatabase\u003c/summary\u003e\n\n- `database.fasta`: The downloaded or user-supplied `.fasta` file. In the latter case, the file is identical to the input.\n- `log.txt`: Log file for this module\n\n\u003c/details\u003e\n\n\u003cdetails markdown=\"1\"\u003e\n\u003csummary\u003edecoypyrat\u003c/summary\u003e\n\n- `decoy_database.fasta`: Original `.fasta` file supplemented with randomized protein sequences.\n- `log.txt`: Log file for this module\n\n\u003c/details\u003e\n\n\u003cdetails markdown=\"1\"\u003e\n\u003csummary\u003efragpipe\u003c/summary\u003e\n\n- `[sample_name]/`: Directory containing sample specific output files for each run\n- `combined_ion.tsv`: Quantification of ion intensity per peptide\n- `combined_modified_peptide.tsv`: Quantification of peptide modifications\n- `combined_peptide.tsv`: Quantification of peptides/features\n- `combined_protein.tsv`: Quantification of proteins from petide, inferred by `fragpipe`\n- `MSstats.csv`: Qunatification of petides/features, output from fragpipe served in `MSstats` friendly format\n- other files such as logs, file lists, etc.\n- `log.txt`: Log file for this module\n\n\u003c/details\u003e\n\n\u003cdetails markdown=\"1\"\u003e\n\u003csummary\u003emsstats\u003c/summary\u003e\n\n- `comparison_result.csv`: Main table with results about the comparison between different experimental conditions\n- `feature_level_data.csv`: Feature-level quantification data processed by MSstats\n- `model_qc.csv`: Table with data about the fitted quantification models from MSstats\n- `protein_level_data.csv`: Protein-level quantification data processed by MSstats\n- `uniprot.csv`: Optionally downloaded table with protein annotation from Uniprot\n- `log.txt`: Log file for this module\n\n\u003c/details\u003e\n\n\u003cdetails markdown=\"1\"\u003e\n\u003csummary\u003ereport\u003c/summary\u003e\n\n- `report.html`: Report with figures and tables\n- `report.pdf`: Report with figures and tables in PDF format. Converted from HTML\n- `log.txt`: Log file for this module\n\n\u003c/details\u003e\n\n\u003c/details\u003e\n\n\u003cdetails markdown=\"1\"\u003e\n\u003csummary\u003eemail\u003c/summary\u003e\n\n- `log.txt`: Log file for this module\n\n\u003c/details\u003e\n\n\u003c/details\u003e\n\n\u003cdetails markdown=\"1\"\u003e\n\u003csummary\u003eclean_up\u003c/summary\u003e\n\n- `log.txt`: Log file for this module\n\n\u003c/details\u003e\n\n## Authors\n\n- Dr. Michael Jahn\n  - Affiliation: [Max-Planck-Unit for the Science of Pathogens](https://www.mpusp.mpg.de/) (MPUSP), Berlin, Germany\n  - ORCID profile: https://orcid.org/0000-0002-3913-153X\n  - github page: https://github.com/m-jahn\n\n## License\n\n- the contents of this repository are licensed with the [MIT License](https://choosealicense.com/licenses/mit/)\n  - you are free use the workflow for your purposes free of charge\n  - you are free to modify the contents and create derivative work\n  - the only condition is that you refer to the original license and copyright owners (MPUSP)\n  - all contents come with absolutely no warranty to work for your or any other purposes\n  - **important**: all third party dependencies are licensed under their own terms and _not covered_ by this license\n\n## References\n\n- Essential tools are linked in the top section of this document\n- The core of this workflow are the two external packages **fragpipe** and **MSstats**\n\n**fragpipe**\n\n1. Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D., \u0026 Nesvizhskii, A. I. (2017). _MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics_. Nature Methods, 14(5), 513-520.\n2. da Veiga Leprevost, F., Haynes, S. E., Avtonomov, D. M., Chang, H. Y., Shanmugam, A. K., Mellacheruvu, D., Kong, A. T., \u0026 Nesvizhskii, A. I. (2020). _Philosopher: a versatile toolkit for shotgun proteomics data analysis_. Nature Methods, 17(9), 869-870.\n3. Yu, F., Haynes, S. E., \u0026 Nesvizhskii, A. I. (2021). _IonQuant enables accurate and sensitive label-free quantification with FDR-controlled match-between-runs_. Molecular \u0026 Cellular Proteomics, 20.\n\n**MSstats**\n\n1. Choi M (2014). _MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments._ Bioinformatics, 30.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmpusp%2Fsnakemake-ms-proteomics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmpusp%2Fsnakemake-ms-proteomics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmpusp%2Fsnakemake-ms-proteomics/lists"}