{"id":18501603,"url":"https://github.com/const-ae/proda","last_synced_at":"2025-04-09T18:33:21.522Z","repository":{"id":51416211,"uuid":"184236748","full_name":"const-ae/proDA","owner":"const-ae","description":"Protein Differential Abundance for Label-Free Mass Spectrometry https://const-ae.github.io/proDA/","archived":false,"fork":false,"pushed_at":"2023-10-30T08:46:14.000Z","size":1748,"stargazers_count":19,"open_issues_count":10,"forks_count":8,"subscribers_count":2,"default_branch":"devel","last_synced_at":"2025-03-23T20:37:03.699Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/const-ae.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-04-30T09:51:48.000Z","updated_at":"2025-03-04T16:35:36.000Z","dependencies_parsed_at":"2022-08-12T23:30:40.596Z","dependency_job_id":"3d2126e4-5f23-488f-85b9-c3f95e8ff6ac","html_url":"https://github.com/const-ae/proDA","commit_stats":{"total_commits":192,"total_committers":9,"mean_commits":"21.333333333333332","dds":"0.14583333333333337","last_synced_commit":"2355db7e03259042d11e365029d25d43ab011f0f"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/const-ae%2FproDA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/const-ae%2FproDA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/const-ae%2FproDA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/const-ae%2FproDA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/const-ae","download_url":"https://codeload.github.com/const-ae/proDA/tar.gz/refs/heads/devel","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248087990,"owners_count":21045625,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T13:54:23.162Z","updated_at":"2025-04-09T18:33:21.484Z","avatar_url":"https://github.com/const-ae.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: \n  github_document:\n    df_print: tibble\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"vignettes/figures/README-\",\n  out.width = \"100%\"\n)\nknitr::opts_knit$set(global.par = TRUE)\n```\n\n```{r, include=FALSE}\nset.seed(1)\noptions(width = 100)\npar(cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)\n```\n\n# proDA\n\n\u003c!-- badges: start --\u003e\n\u003c!-- badges: end --\u003e\n\nThe goal of `proDA` is to identify differentially abundant proteins in label-free\nmass spectrometry data. The main challenge of this data are the many missing values.\nThe missing values don't occur randomly but especially at low intensities. This \nmeans that they cannot just be ignored. Existing methods have mostly focused on\nreplacing the missing values with some reasonable number (\"imputation\") and then\nrun classical methods. But imputation is problematic because it obscures the\namount of available information. Which in turn can lead to over-confident \npredictions.\n\n`proDA` on the other hand does not impute missing values, but constructs a \nprobabilistic dropout model. For each sample it fits a sigmoidal dropout \ncurve. This information can then be used to infer means across samples and the\nassociated uncertainty, without the intermediate imputation step. `proDA`\nsupports full linear models with variance and location moderation.\n\nFor full details, please see our **preprint**:\n\nConstantin Ahlmann-Eltze and Simon Anders: *proDA: Probabilistic Dropout Analysis for Identifying Differentially Abundant Proteins in Label-Free Mass Spectrometry*. [biorXiv 661496](http://www.biorxiv.org/content/10.1101/661496v1) (Jun 2019)\n\n## Installation\n\nproDA is implemented as an [R](https://cran.r-project.org/) package.\n\nYou can install it from [Bioconductor](https://www.bioconductor.org/) by typing \nthe following commands into R:\n\n```{r eval=FALSE}\nif(!requireNamespace(\"BiocManager\", quietly = TRUE))\n    install.packages(\"BiocManager\")\nBiocManager::install(\"proDA\")\n```\n\nTo get the latest development version from\n[GitHub](https://github.com/const-ae/proDA), you can use\nthe [`devtools`](https://github.com/r-lib/devtools) package:\n\n```{r eval=FALSE}\n# install.packages(\"devtools\")\ndevtools::install_github(\"const-ae/proDA\")\n```\n\nThe pkgdown documentation for the package is available on\n\u003chttps://const-ae.github.io/proDA/reference\u003e.\n\n---\n\nIn the following section, I will give a very brief overview on the main\nfunctionality of the `proDA` package, aimed at experienced R users. \nNew users are advised to skip this \"quickstart\" and to go directly\nto section 1.3, where I give a complete walkthrough and explain in\ndetail, what steps are necessary for the analysis of label-free mass\nspectrometry data.\n\n## Quickstart\n\nThe three steps that are necessary to analyze the data are\n\n1. Load the data (see vignette on loading MaxQuant output files)\n2. Fit the probabilistic dropout model (`proDA()`)\n3. Test in which proteins the coefficients of the model differ (`test_diff()`)\n\n```{r quickstart}\n# Load the package\nlibrary(proDA)\n# Generate some dataset with known structure\nsyn_dataset \u003c- generate_synthetic_data(n_proteins = 100, n_conditions = 2)\n\n# The abundance matrix\nsyn_dataset$Y[1:5, ]\n\n# Assignment of the samples to the two conditions\nsyn_dataset$groups\n\n# Fit the probabilistic dropout model\nfit \u003c- proDA(syn_dataset$Y, design = syn_dataset$groups)\n\n# Identify which proteins differ between Condition 1 and 2\ntest_diff(fit, `Condition_1` - `Condition_2`, sort_by = \"pval\", n_max = 5)\n```\n\nOther helpful functions for quality control are `median_normalization()` and \n`dist_approx()`.\n\n\n## proDA Walkthrough\n\n`proDA` is an R package that implements a powerful probabilistic dropout model\nto identify differentially abundant proteins. The package was specifically designed \nfor label-free mass spectrometry data and in particular how to handle the many\nmany missing values.\n\nBut all this is useless if you cannot load your data and get it into a shape that is useable.\nIn the next section, I will explain how to load the abundance matrix and bring it into\na useful form. The steps that I will go through are \n\n1. Load the `proteinGroups.txt` MaxQuant output table\n2. Extract the intensity columns and create the abundance matrix\n3. Replace the zeros with `NA`s and take the `log2()` of the data\n4. Normalize the data using `median_normalization()`\n5. Inspect sample structure with a heatmap of the distance matrix (`dist_approx()`)\n6. Fit the probabilistic dropout model with `proDA()`\n7. Identify differentially abundant proteins with `test_diff()`\n\n### Load Data\n\nI will now demonstrate how to load a MaxQuant output file. For more information\nabout other approaches for loading the data, please take a look at the vignette on loading\ndata.\n\nMaxQuant is one of the most popular tools for handling raw MS data. It produces\na number of files. The important file that contains the protein intensities is\ncalled `proteinGroups.txt`. It is a large table with detailed information about\nthe identification and quantification process for each protein group (which I will\nfrom now on just call \"protein\").\n\nThis package comes with an example `proteinGroups.txt` file, located in the \npackage folder. The file contains the reduced output from an experiment studying the different \nDHHCs in Drosophila melanogaster.\n\n```{r}\nsystem.file(\"extdata/proteinGroups.txt\", package = \"proDA\", mustWork = TRUE)\n```\n\nIn this example, I will use the base R functions to load the data, because \nthey don't require any additional dependencies.\n\n```{r}\n# Load the table into memory\nmaxquant_protein_table \u003c- read.delim(\n    system.file(\"extdata/proteinGroups.txt\", package = \"proDA\", mustWork = TRUE),\n    stringsAsFactors = FALSE\n)\n```\n\nAs I have mentioned, the table contains a lot of information (359 columns!!), but we\nare first of all interested in the columns which contain the measured intensities.\n\n```{r}\n# I use a regular expression (regex) to select the intensity columns\nintensity_colnames \u003c- grep(\"^LFQ\\\\.intensity\\\\.\", colnames(maxquant_protein_table), value=TRUE)\nhead(intensity_colnames)\n\n\n# Create the intensity matrix\nabundance_matrix \u003c- as.matrix(maxquant_protein_table[, intensity_colnames])\n# Adapt column and row maxquant_protein_table\ncolnames(abundance_matrix) \u003c- sub(\"^LFQ\\\\.intensity\\\\.\", \"\", intensity_colnames)\nrownames(abundance_matrix) \u003c- maxquant_protein_table$Protein.IDs\n# Print some rows of the matrix with short names so they fit on the screen\nabundance_matrix[46:48, 1:6]\n```\n\nAfter extracting the bits from the table we most care about, we will have to modify it.\n\nFirstly, MaxQuant codes missing values as `0`. This is misleading, because the actual\nabundance probably was not zero, but just some value too small to be detected by the mass spectrometer.\nAccordingly, I will replace all `0` with `NA`. \n\nSecondly, the raw intensity values have a linear mean-variance relation. This is undesirable, because \na change of `x` units can be a large shift if the mean is small or irrelevant if the mean is large.\nLuckily, to make the mean and variance independent, we can just `log` the intensities. Now a change\nof `x` units is as significant for highly abundant proteins, as it is for low abundant ones.\n\n```{r}\nabundance_matrix[abundance_matrix == 0] \u003c- NA\nabundance_matrix \u003c- log2(abundance_matrix)\nabundance_matrix[46:48, 1:6]\n\n```\n\n\n\n\n### Quality Control\n\nQuality control (QC) is essential for a successful bioinformatics analysis, because any dataset \nshows some unwanted variation or could even contain more serious error like for example a sample\nswap.\n\nOften we start with normalizing the data to remove potential\nsample specific effects. But already this step is challenging, because the missing values cannot\neasily be corrected for. Thus, a first helpful plot is to look how many missing values are in each\nsample.\n\n```{r qc-mis_barplot,  out.width=\"60%\", fig.height=4, fig.align=\"center\"}\n\nbarplot(colSums(is.na(abundance_matrix)),\n        ylab = \"# missing values\",\n        xlab = \"Sample 1 to 36\")\n```\n\nWe can see that the number of missing values differs substantially between samples (between 30% and\n90%) in this dataset. If we take a look at the intensity distribution for each sample, we see that\nthey differ substantially as well. \n\n```{r qc-raw_boxplot, out.width=\"60%\", fig.height=4, fig.align=\"center\"}\nboxplot(abundance_matrix,\n        ylab = \"Intensity Distribution\",\n        xlab = \"Sample 1 to 36\")\n```\n\nNote that, the intensity distribution is shifted upwards for samples  which also have a large number\nof missing values (for example the last one). This agrees with our idea that small values are\nmore likely to be missing. On the other hand, this also demonstrates why normalization methods\nsuch as quantile normalization, which distort the data until all the distributions\nare equal, are problematic. I will apply the more \"conservative\" median normalization, which \nignores the  missing values and transforms the values so that the median difference between the \nsample and average across all other samples is zero.\n\n```{r}\nnormalized_abundance_matrix \u003c- median_normalization(abundance_matrix)\n```\n\nAn important tool to identify sample swaps and outliers in the dataset is to look at the sample\ndistance matrix. It shows the distances of samples A to B, A to C, B to C and so on.\n\nThe base R `dist()` function can not handle input data that contains missing values, so we might be\ntempted to just replace the missing values with some realistic numbers and calculate the distance \non the \ncompleted dataset. But choosing a good replacement value is challenging and can also be misleading\nbecause the samples with many missing values would be considered too close.\n\nInstead `proDA` provides the `dist_approx()` function that takes either a fitted model (ie. the \noutput from `proDA()`) or a simple matrix (for which it internally calls `proDA()`) and \nestimates the expected distance without imputing the missing values. In addition, it reports\nthe associated uncertainty with every estimate. The estimates for samples with many missing\nvalues will be uncertain, allowing the data analyst to discount them.\n\n```{r}\nda \u003c- dist_approx(normalized_abundance_matrix)\n```\n\n\n`dist_approx()` returns two elements the `mean` of the estimate and the associated `sd`.\nIn the next step I will plot the heatmap for three different conditions, adding the 95% confidence\ninterval as text to each cell. \n\n```{r sample_dist, out.width=\"60%\", fig.align=\"center\"}\n# This chunk only works if pheatmap is installed\n# install.packages(\"pheatmap\")\nsel \u003c- c(1:3,  # CG1407\n         7:9,  # CG59163\n         22:24)# CG6618\n\nplot_mat \u003c- as.matrix(da$mean)[sel, sel]\n# Remove diagonal elements, so that the colorscale is not distorted\nplot_mat[diag(9) == 1] \u003c- NA\n# 95% conf interval is approx `sd * 1.96`\nuncertainty \u003c- matrix(paste0(\" ± \",round(as.matrix(da$sd * 1.96)[sel, sel], 1)), nrow=9)\npheatmap::pheatmap(plot_mat, \n                   cluster_rows = FALSE, cluster_cols = FALSE,\n                   display_numbers= uncertainty,\n                   number_color = \"black\")\n```\n\n\n\n### Fit the Probabilistic Dropout Model\n\nIn the next step, we will fit the actual linear probabilistic dropout model to the normalized\ndata. But before we start, I will create a data.frame that contains some additional information on\neach sample, in particular to which condition that sample belongs.\n\n```{r}\n# The best way to create this data.frame depends on the column naming scheme\nsample_info_df \u003c- data.frame(name = colnames(normalized_abundance_matrix),\n                             stringsAsFactors = FALSE)\nsample_info_df$condition \u003c- substr(sample_info_df$name, 1, nchar(sample_info_df$name)  - 3)\nsample_info_df$replicate \u003c- as.numeric(\n  substr(sample_info_df$name, nchar(sample_info_df$name)  - 1, 20)\n)\nsample_info_df\n```\n\n\nNow we can call the `proDA()` function to actually fit the model. We specify the `design` using\nthe formula notation, referencing the `condition` column in the `sample_info_df` data.frame that\nwe have just created. In addition, I specify that I want to use the `S2R` condition as the reference\nbecause I know that it was the negative control and this way automatically all coefficients\nmeasure how much each condition differs from the negative control.\n\n```{r}\nfit \u003c- proDA(normalized_abundance_matrix, design = ~ condition, \n             col_data = sample_info_df, reference_level = \"S2R\")\nfit\n```\n\nThe `proDAFit` object prints a number of useful information about the convergence of the model,\nthe size of the dataset, the number of missing values, and the inferred hyper parameters.\n\nTo make it easy to find available methods on the `proDAFit` object, the `$`-operator is overloaded\nand shows a list of possible functions:\n\n![Screenshot from Rstudio suggesting the available functions](vignettes/figures/README-screenshot_fit_functions.png)\n\n\n```{r}\n# Equivalent to feature_parameters(fit)\nfit$feature_parameters\n```\n\nInternally the `proDAFit` object is implemented as a subclass of `SummarizedExperiment`.\nThis means it can be subsetted, which is for example useful for calculating the distance \nof a subset of proteins and samples.\n\n```{r protein_dist, out.width=\"60%\", fig.align=\"center\"}\n# This chunk only works if pheatmap is installed\n# install.packages(\"pheatmap\")\npheatmap::pheatmap(dist_approx(fit[1:20, 1:3], by_sample = FALSE)$mean)\n```\n\n\n\n### Identify Differential Abundance\n\nLastly, we will use a Wald test to identify in which proteins a coefficient is significantly different\nfrom zero. The `test_diff()` function takes first the fit object produced by `proDA()` and a \ncontrast argument. This can either be a string or an expression if we want to test more complex\ncombinations. For example `conditionCG1407 - (conditionCG6017 + conditionCG5880) / 2` would test\nfor the difference between CG1407 and the average of CG6017 and CG5880. \n\nAlternatively `test_diff()` also supports likelihood ratio F-tests. In that case instead of the `contrast`\nargument specify the `reduced_model` argument.\n\n```{r}\n# Test which proteins differ between condition CG1407 and S2R\n# S2R is the default contrast, because it was specified as the `reference_level`\ntest_res \u003c- test_diff(fit, \"conditionCG1407\")\ntest_res\n```\n\n\n\nThis walkthrough ends with the identification which proteins are differentially abundant. But for\na real dataset, now the actual analysis only just begins. A list of significant proteins is hardly\never a publishable result, one often needs to make sense what the relevant underlying biological \nmechanisms are. But for this problem other tools are necessary, which depend on the precise \nquestion associated with the biological problem at hand.\n\n\n# Session Info\n\n```{r}\nsessionInfo()\n```\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconst-ae%2Fproda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fconst-ae%2Fproda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconst-ae%2Fproda/lists"}