{"id":23045568,"url":"https://github.com/fgcz/prolfqua","last_synced_at":"2025-10-29T06:07:47.648Z","repository":{"id":37725230,"uuid":"148905397","full_name":"fgcz/prolfqua","owner":"fgcz","description":"Differential Expression Analysis tool box R lang package for omics data","archived":false,"fork":false,"pushed_at":"2025-05-27T12:46:15.000Z","size":808309,"stargazers_count":45,"open_issues_count":9,"forks_count":9,"subscribers_count":4,"default_branch":"Modelling2R6","last_synced_at":"2025-05-31T09:47:40.995Z","etag":null,"topics":["differential-expression-analysis","hypothesis-testing","protein-quantification","proteomics-data-analysis","quality-control","r-package","rstats-package"],"latest_commit_sha":null,"homepage":"https://pubs.acs.org/doi/pdf/10.1021/acs.jproteome.2c00441","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fgcz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-09-15T13:41:42.000Z","updated_at":"2025-05-27T12:46:18.000Z","dependencies_parsed_at":"2023-12-22T13:38:18.576Z","dependency_job_id":"27d7941b-4f6a-4ca2-b529-08593b64ae17","html_url":"https://github.com/fgcz/prolfqua","commit_stats":null,"previous_names":["wolski/prolfqua"],"tags_count":29,"template":false,"template_full_name":null,"purl":"pkg:github/fgcz/prolfqua","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fgcz%2Fprolfqua","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fgcz%2Fprolfqua/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fgcz%2Fprolfqua/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fgcz%2Fprolfqua/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fgcz","download_url":"https://codeload.github.com/fgcz/prolfqua/tar.gz/refs/heads/Modelling2R6","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fgcz%2Fprolfqua/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270500008,"owners_count":24595150,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-14T02:00:10.309Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["differential-expression-analysis","hypothesis-testing","protein-quantification","proteomics-data-analysis","quality-control","r-package","rstats-package"],"created_at":"2024-12-15T21:27:02.804Z","updated_at":"2025-10-29T06:07:42.598Z","avatar_url":"https://github.com/fgcz.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![R-CMD-check-prolfqua](https://github.com/fgcz/prolfqua/actions/workflows/r.yaml/badge.svg)](https://github.com/fgcz/prolfqua/actions/workflows/r.yaml) ![ReleseeDownloads](https://img.shields.io/github/downloads/fgcz/prolfqua/total)\n[![codecov](https://codecov.io/gh/fgcz/prolfqua/branch/Modelling2R6/graph/badge.svg?token=NP7IPP323C)](https://codecov.io/gh/fgcz/prolfqua)\n[![bioRxiv](https://img.shields.io/badge/bioRxiv-10.1101%2F2022.06.07.494524-ligtgreen)](https://www.biorxiv.org/content/early/2022/06/09/2022.06.07.494524)\n\n\u003cimg src=\"man/figures/imgfile.png\" width=\"200\"\u003e \n\n# prolfqua - a comprehensive R package for Proteomics Differential Expression Analysis\n\nThe R package contains functions for analyzing mass spectrometry based experiments.\nThis package is developed at the [FGCZ](http://fgcz.ch/).\nThe package documentation including vignettes can be accessed at https://fgcz.github.io/prolfqua/index.html\n\n`prolfqua` makes easy things easy while remaining fully hackable.\n\n# How to install prolfqua?\n\nRequirements : A Windows|Linux|MacOSX platform with R (\u003e= 4.1) installed.\n\n\nWe recommend to install the package using the latest [release](https://github.com/fgcz/prolfqua/releases)\nDownload the `prolfqua_X.Y.Z.tar.gz` from the [github release page](https://github.com/fgcz/prolfqua/releases) into your working directory. and then execute:\n\n```\ninstall.packages(\"./prolfqua_X.Y.Z.tar.gz\",repos = NULL, type=\"source\")\n```\n\n\nTo install the package without vignettes from github you can execute in R.\n\n```\ninstall.packages('remotes')\nremotes::install_github('fgcz/prolfqua', dependencies = TRUE)\n```\n\n\nIf you want to build the vignettes on your system:\n\n```\ninstall.packages('remotes')\nremotes::install_github('fgcz/prolfqua', build_vignettes = TRUE, dependencies = TRUE)\n\n```\n\n\nLet us please know about any installation problems or errors when using the package:\nhttps://github.com/fgcz/prolfqua/issues\n\n\n\n# How to get started\n\nHow to build a `LFQData` object from a table with protein or peptide quantification results, and a table with sample annotation is described in more detail here the: [CreatingConfigurations vignette](https://fgcz.github.io/prolfqua/articles/CreatingConfigurations.html)\n\nA minimal example for a table with protein abudances is:\n\n```{r}\n#Table with abundances\ndf \u003c- data.frame(protein_Id = c(\"tr|A|HUMAN\",\"tr|B|HUMAN\",\"tr|C|HUMAN\",\"tr|D|HUMAN\"),\n                 Intensity_A = c(100,10000,10,NA),\n                 Intensity_B = c(NA, 9000, 20, 100),\n                 Intensity_C = c(200,8000,NA,150),\n                 Intensity_D = c(130,11000, 50, 50))\n# Table with sample annotation\nannot \u003c- data.frame(Sample = c(\"Intensity_A\", \"Intensity_B\", \"Intensity_C\", \"Intensity_D\"), Group = c(\"A\",\"A\",\"B\",\"C\"))\n\n# convert into long format\ntable_long \u003c- tidyr::pivot_longer(df, starts_with(\"Intensity_\"),names_to = \"Sample\", values_to = \"Intensity\")\n\ntable_long \u003c- dplyr::inner_join(annot, table_long)\n\n# create TableAnnotation and AnalysisConfiguration\n\natable \u003c- prolfqua::AnalysisTableAnnotation$new()\natable$fileName = \"Sample\"\natable$workIntensity = \"Intensity\"\natable$hierarchy[[\"protein_Id\"]]    \u003c-  \"protein_Id\"\natable$factors[[\"Group\"]] \u003c- \"Group\"\n\nconfig \u003c- prolfqua::AnalysisConfiguration$new(atable)\n\n# Build LFQData object\nanalysis_data \u003c- prolfqua::setup_analysis(table_long, config)\nlfqdata \u003c- prolfqua::LFQData$new(analysis_data, config)\nlfqdata$hierarchy_counts()\n\n```\n\nOnce you have created an `LFQData` you can use prolfqua like this.\n\n```{r}\nR.version.string; packageVersion(\"prolfqua\")\n\n## here we simulate peptide level data\nstartdata \u003c- sim_lfq_data_peptide_config()\nlfqpep \u003c- LFQData$new(startdata$data, startdata$config)\n\n\n## transform intensities\nlfqpep \u003c- lfqpep$get_Transformer()$log2()$robscale()$lfq\nlfqpep$rename_response(\"log_peptide_abundance\")\nagr \u003c- lfqpep$get_Aggregator()\nlfqpro \u003c- agr$medpolish()\nlfqpro$rename_response(\"log_protein_abundance\")\n\n## plot Figure 3 panels A-D\npl \u003c- lfqpep$get_Plotter()\npanelA \u003c- pl$intensity_distribution_density() +\n  ggplot2::labs(tag = \"A\") + ggplot2::theme(legend.position = \"none\")\npanelB \u003c- agr$plot()$plots[[1]] + ggplot2::labs(tag = \"B\")\npanelC \u003c- lfqpro$get_Stats()$violin() + ggplot2::labs(tag = \"C\")\npl \u003c- lfqpro$get_Plotter()\npanelD \u003c- pl$boxplots()$boxplot[[1]] + ggplot2::labs(tag = \"D\")\nggpubr::ggarrange(panelA, panelB, panelC, panelD)\n\n```\n\n![image](https://github.com/fgcz/prolfqua/assets/1926513/4d5bb64b-6e45-4d00-b029-f08995ac3127)\n\n\n```{r}\n## specify model\nmodelFunction \u003c-\n strategy_lm(\"log_protein_abundance  ~ group_\")\n\n## fit models to lfqpro data\nmod \u003c- build_model(\n lfqpro,\n modelFunction\n)\n\n## specify contrasts\nContr \u003c- c(\"AvsCtrl\" = \"group_A - group_Ctrl\",\n     \"BvsCtrl\" = \"group_B - group_Ctrl\",\n     \"BvsA\" = \"group_B - group_A\"\n      )\n      \n## determine contrasts and plot\ncontrastX \u003c- prolfqua::Contrasts$new(mod, Contr)\npl \u003c- contrastX$get_Plotter()\npl$volcano()$FDR\n\n```\n\n![image](https://github.com/fgcz/prolfqua/assets/1926513/4ae8634b-ce6c-4fa2-8e42-c8bc64a12821)\n\n\n[![SIB in-silico talk](https://img.youtube.com/vi/acDiXq2xbOw/1.jpg)](https://www.youtube.com/watch?v=acDiXq2xbOw)\n\n- Watch the [silico talks](https://www.sib.swiss/in-silico-talks/prolfqua-a-comprehensive-r-package-for-protein-differential-expression-analysis)\n- See our article at the [Journal of Proteome Research](https://pubmed.ncbi.nlm.nih.gov/36939687/)\n- See [Bioconductor 2021 Conference poster](https://fgcz-proteomics.uzh.ch/~wolski/PosterBioconductor.html). \n- Watch the lightning (8 min) talk at [EuroBioc2020](https://www.youtube.com/watch?v=jOXU4X7nV9I\u0026t) on YouTube or [slides](https://f1000research.com/slides/9-1476).\n- Read the pkgdown generate website https://fgcz.github.io/prolfqua/index.html\n\n\n# Detailed documentation with R code:\n\nDocument's explaining how to run an analysis with prolfqua are at github.io [https://fgcz.github.io/prolfqua/index.html](https://fgcz.github.io/prolfqua/index.html).\n\n- [Comparing two Conditions](https://fgcz.github.io/prolfqua/articles/Comparing2Groups.html)\n- [QC and protein wise sample size estimation](https://fgcz.github.io/prolfqua/articles/QualityControlAndSampleSizeEstimation.html)\n- [Analysing factorial designs](https://fgcz.github.io/prolfqua/articles/Modelling2Factors.html)\n\nExample QC and sample size report\n\n- [QC and sample size Report](https://fgcz.github.io/prolfqua/articles/QCandSampleSize.html)\n\n# Releated projects\n\n- prolfquabenchmark - a package to document the performance of prolfqua, MSstats, msqrob, and proda. See documentation: [https://prolfqua.github.io/prolfquabenchmark/]\n- prolfquapp: Generating Dynamic DEA Reports with the prolfqua R Package [https://github.com/prolfqua/prolfquapp](https://github.com/prolfqua/prolfquapp)\n- prophosqua - (scripts for the analysis of phospho experiments) [https://github.com/prolfqua/prophosqua](https://github.com/prolfqua/prophosqua)\n\n\n# How to cite?\n\nPlease do reference the [prolfqua article at Journal of Proteome Research](https://pubmed.ncbi.nlm.nih.gov/36939687/)\n\n```\n\n@article{prolfquawolski2023,\nauthor = {Wolski, Witold E. and Nanni, Paolo and Grossmann, Jonas and d’Errico, Maria and Schlapbach, Ralph and Panse, Christian},\ntitle = {prolfqua: A Comprehensive R-Package for Proteomics Differential Expression Analysis},\njournal = {Journal of Proteome Research},\nvolume = {4},\nnumber = {22},\npages = {1092–1104},\nyear = {2023},\ndoi = {10.1021/acs.jproteome.2c00441},\n    note = {PMID: 36939687},\nURL = {https://doi.org/10.1021/acs.jproteome.2c00441},\neprint = {https://doi.org/10.1021/acs.jproteome.2c00441}\n}\n\n```\n\n## Motivation\n\nThe package for **pro**teomics **l**abel **f**ree **qua**ntification `prolfqua` (read : prolevka) evolved from a set of scripts and functions written in the R programming language to visualize and analyze mass spectrometric data, and some of them are still in R packages such as quantable, protViz or imsbInfer. For computing protein fold changes among treatment conditions, we first used t-test or linear models, then started to use functions implemented in the package limma to obtain moderated p-values. We did also try to use other packages such as MSStats, ROPECA or MSqRob all implemented in R, with the idea to integrate the various approaches to protein fold-change estimation. Although all these packages were written in R,  model specification, input and output formats differ widely and wildly, which made our aim to use the original implementations challenging. Therefore, and also to understand the algorithms used, we attempted to reimplement those methods, if possible. \n\nWhen developing _prolfqua_ we were inspired by packages such as _sf_ or _stars_ which use data in long table format and _dplyr_ for data transformation and ggplot2 for visualization.  In the long table format each column stores a different attribute, e.g. there is only a single column with the raw intensities. In the wide table format there might be several columns with the same attribute, e.g. for each recorded sample a raw intensity column.\nIn _prolfqua_ the data needed for analysis is represented using a single data-frame in long format and a configuration object. The configuration annotates the table, specifies what information is in which column. The results of the statistical modelling are stored in data frames.  Relying on the long data table format enabled us to access a large variety of useful visualizations as well as data preprocessing methods implemented in the R packages _dplyr_ and _ggplot2_.\n\nThe use of an annotated table makes integrating new data if provided in long formatted tables simple.  Hence for Spectronaut or Skyline text output, all is needed is a table annotation (see code snipped).  Since MSStats formatted input is a table in long format _prolefqa_ works with MSstats formatted files. For software, which writes the data in a wide table format, e.g. Maxquant, we implemented methods which first transform the data into a long format.  \n\nA further design decision, which differentiates `prolfqua` is that it embraces and supports R's linear model formula interface, or R lme4 formula interface. R's formula interface for linear models is flexible, widely used and documented. The linear model and linear mixed model interfaces allow specifying a wide range of essential models, including parallel designs, factorial designs, repeated measurements and many more. Since `prolfqua` uses R modelling infrastructure directly, we can fit all these models to proteomics data.\nThis is not easily possible with any other package dedicated to proteomics data analysis. For instance, MSStats, although using the same modelling infrastructure, supports only a small subset of possible models. Limma, on the other hand, supports R formula interface but not for linear mixed models. Since the ROPECA package relies on _limma_ it is limited to the same subset of models. MSqRob is limited to random effects model's, and it is unclear how to fit these models to factorial designs, and how interactions among factors can be computed and tested.\n\nThe use of R's formula interface does not limit _prolfqua_ to the output provided by the R modelling infrastructure. _prolfqua_ also implements p-value moderations, as in the limma publication or computing probabilities of differential regulation, as suggested in the ROPECA publication. \nMoreover, the design decision to use the R formula interface allowed us to integrate Bayesian regression models provided by the r-package _brms_. Because of that, we can benchmark all those methods: linear models, mixed effect models, p-value moderation, ROPECA as well as Bayesian regression models within the same framework, which enabled us to evaluate the practical relevance of these methods.\n\nLast but not least _prolfqua_ supports the LFQ data analysis workflow, e.g. computing coefficients of Variations (CV) for peptide and proteins, sample size estimation, visualization and summarization of missing data and intensity distributions, multivariate analysis of the data, etc.\nIt also implements various protein intensity summarization and inference methods, e.g. top 3, or Tukeys median polish etc. Last but not least, ANOVA analysis or model selection using the likelihood ratio test for thousand of proteins can be performed. \n\nTo use `prolfqua` knowledge of the R regression model infrastructure is of advantage. Acknowledging, the complexity of the formula interface,  we provide an  MSstats emulator, where the model specification is generated based on the annotation file structure. \n\n\n\n# Related resources\n\n- [proDA](https://www.bioconductor.org/packages/release/bioc/html/proDA.html)\n- [MSstats](https://www.bioconductor.org/packages/release/bioc/html/MSstats.html)\n- [MSQRob](https://github.com/statOmics/MSqRob)\n- [Triqler](https://github.com/statisticalbiotechnology/triqler)\n- [DAPAR](https://github.com/samWieczorek/DAPAR/)\n- [DAPARData](https://github.com/samWieczorek/DAPARdata/)\n- [PECA/ROPECA](http://bioconductor.org/packages/release/bioc/html/PECA.html)\n\n#  Relevant background information\n\n- [R Companion](https://rcompanion.org/rcompanion/h_01.html)\n- [Extending the Linear Model with R](http://www.maths.bath.ac.uk/~jjf23/ELM/)\n- [Bayesian Data Analysis](http://www.stat.columbia.edu/~gelman/book/)\n- [Bayesian essentials with R - R package](https://CRAN.R-project.org/package=bayess)\n- [Contrasts in R - an example vignette by Rose Maier](https://rstudio-pubs-static.s3.amazonaws.com/65059_586f394d8eb84f84b1baaf56ffb6b47f.html)\n- [Interactions and Contrasts PH525x series](http://genomicsclass.github.io/book/pages/interactions_and_contrasts.html)\n\n# R packages to compute contrasts from linear and other models\n\n- [marginaleffects](https://vincentarelbundock.github.io/marginaleffects/) Compute and plot predictions, slopes, marginal means, and comparisons (contrasts, risk ratios, odds ratios, etc.) for over 70 classes of statistical models in R.\n- [emmeans](https://CRAN.R-project.org/package=emmeans) Obtain estimated marginal means (EMMs) for many linear, generalized linear, and mixed models.\n- [lmerTest](https://CRAN.R-project.org/package=lmerTest) computes contrast for [lme4](https://CRAN.R-project.org/package=lme4) models\n- [multcomp](https://CRAN.R-project.org/package=multcomp) computes contrast for linear models and adjusts p-values (multiple comparison)\n\n# Future interesting topics or packages to look at\n\n- [modelsummary](https://vincentarelbundock.github.io/modelsummary/index.html)\n- [modelsummary tutorial](https://elbersb.com/public/pdf/web-7-regression-tables-graphs.pdf)\n- [edgeR tutorial](https://gist.github.com/jdblischak/11384914)\n- [another edgeR tutorial](https://web.stanford.edu/class/bios221/labs/rnaseq/lab_4_rnaseq.html)\n\n- https://fromthebottomoftheheap.net/2021/02/02/random-effects-in-gams/\n\n# Sample size estimation based on FDR\n\n- [ssize](https://www.bioconductor.org/packages/release/bioc/html/ssize.html)\n- [ssize.fdr](https://CRAN.R-project.org/package=ssize.fdr)\n  - related article [https://journal.r-project.org/archive/2009/RJ-2009-019/RJ-2009-019.pdf]\n- [proper](https://bioconductor.org/packages/release/bioc/html/PROPER.html)\n\n# What package name?\n\nWhat name should we use?\n\nhttps://twitter.com/WitoldE/status/1338799648149041156\n\n- prolfqua - PROteomics Label Free QUAntification package (read prolewka)\n- LFQService - we do proteomics LFQ services at the FGCZ.\n- nalfqua - Not Another Label Free QUAntification package (read nalewka)\n- prodea - proteomics differential expression analysis ?\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffgcz%2Fprolfqua","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffgcz%2Fprolfqua","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffgcz%2Fprolfqua/lists"}