{"id":18501587,"url":"https://github.com/const-ae/prodd","last_synced_at":"2025-07-19T07:09:42.394Z","repository":{"id":82391882,"uuid":"156582064","full_name":"const-ae/proDD","owner":"const-ae","description":"Differential Detection for Label-free (LFQ) Mass Spec Data","archived":false,"fork":false,"pushed_at":"2020-01-12T17:12:14.000Z","size":1343,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-17T01:42:14.940Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/const-ae.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-07T17:15:18.000Z","updated_at":"2024-10-17T00:30:01.000Z","dependencies_parsed_at":"2023-04-19T23:31:45.913Z","dependency_job_id":null,"html_url":"https://github.com/const-ae/proDD","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/const-ae%2FproDD","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/const-ae%2FproDD/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/const-ae%2FproDD/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/const-ae%2FproDD/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/const-ae","download_url":"https://codeload.github.com/const-ae/proDD/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254095160,"owners_count":22013716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T13:54:21.268Z","updated_at":"2025-05-14T07:32:06.481Z","avatar_url":"https://github.com/const-ae.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r setup, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  fig.align = \"center\",\n  out.width = \"80%\",\n  dpi = 170\n)\n```\n\n# proDD\n\nDifferential Detection for Label-free (LFQ) Mass Spectometry Data\n\nThe tool fits a probabilistic dropout model to an intensity matrix from from label-free quantification (LFQ). Dropouts in \nLFQ data occur if the protein has a low intensity. Our model takes the non-random missingness into account, by constructing \na Bayesian hierarchical model. After fitting the model you can sample from the posterior distribution of the means from each\nprotein and condition. The posterior are a useful element to calculate all kind of statistics/metrics including the probability\nthat the intensity of a protein in one condition is smaller than in the control (similar to the one-sided p-value).\n\n# Installation\n\nInstall the latest version directly from GitHub (make sure that `devtools` is installed)\n\n```{r eval=FALSE, include=TRUE}\ndevtools::install_github(\"const-ae/proDD\")\n```\n\n# Disclaimer\n\nI am still actively working on the project and although the algorithm is working fine at this point, the API might still be subject to change.\n\n\n\n# Walkthrough\n\nIn the following section I will explain how to use the `proDD` package to identify\ndifferential detected protein in label-free mass spectrometry data. I will highlight\nall the important functions the package provides.\n\nAt first we have to load the `proDD` package and some additional packages\nthat we will use to plot our data.\n\n```{r}\n# My package\nlibrary(proDD)\n\n# Packages for plotting\nlibrary(ggplot2)\nlibrary(pheatmap)\nlibrary(viridisLite)\nset.seed(1)\n```\n\n\nNext we will load some data. To make our life easier we will use\nsynthetic data, where we know which proteins have changed and which have not.\nFor this we will use the `generate_synthetic_data()`. We define that 10\\% of the\nprotein differ between condtion A and B.\n\n```{r}\n\n# The experimental_design is a vector that assignes each sample to one condition\nexperimental_design \u003c- c(\"A\", \"A\", \"A\", \"B\", \"B\", \"B\")\n\n# The generate_synthetic_data can be customized a lot, but here we will \n# use it in its most basic form\nsyn_data \u003c- generate_synthetic_data(n_rows=1234, experimental_design=experimental_design,\n                                    frac_changed = 0.1)\n\n# The data matrix, where non-observed values are coded as NA\nX \u003c- syn_data$X\n\n# Median normalization to remove sample effects \nX \u003c- median_normalization(X)\n\n# The columns are the samples and each row is a protein\nhead(X)\n```\n\n\nTo get a better impression of the raw data\nwe will make a heatmap plot (using the `pheatmap` package). Unfortunately\nthe `hclust` method that is internally used does not support missing values,\nso we will for this plot just replace all missing values with a zero\n\n\n\n```{r}\nX_for_plotting \u003c- X\nX_for_plotting[is.na(X_for_plotting)] \u003c- 0\npheatmap(X_for_plotting,\n         main=paste0(round(sum(is.na(X))/prod(dim(X)) * 100), \"% missing values\"),\n         annotation_row = data.frame(changed=as.character(syn_data$changed),\n                                     row.names = rownames(X_for_plotting)),\n         show_rownames = FALSE)\n```\n\n\nOne important observation is that the missing values do not occur randomly,\nbut predominantly at low intensities. This can be most clearly be seen, when \nlooking at proteins which have some observed and some missing values\n\n```{r, fig.width=6, fig.height=2.8}\nhist_tmp_data \u003c- data.frame(intensity=c(X), \n           row_has_missing=c(t(apply(X, 1, function(x) rep(any(is.na(x)), ncol(X))))))\n\nggplot(hist_tmp_data, aes(x=intensity, fill=row_has_missing)) +\n    geom_histogram() +\n    xlim(12, 32)\n\n```\n\nWe conclude from this that there is a certain dropout probability associated\nwith each latent intensity. At low intensities (e.g. `\u003c15`) it is almost certain\nthat the values dropout, whereas for high intensities (e.g. `\u003e25`) almost \nno value is missing. We capture this idea using a sigmoidal shaped dropout\ncurve that looks roughly like this:\n\n```{r, fig.width=6, fig.height=2.8}\ndropout_curve \u003c- data.frame(intensity=seq(12, 32, length.out=1001))\ndropout_curve$dropout_probability \u003c- invprobit(dropout_curve$intensity, rho=18, zeta=-2.5)\n\nggplot(hist_tmp_data, aes(x=intensity)) +\n    geom_histogram(aes(fill=row_has_missing)) +\n    geom_line(data=dropout_curve, aes(y=dropout_probability * 600), color=\"red\") +\n    xlim(12, 32)\n\n```\n\n\nOur probabilistic dropout algorithm has two major steps. In the first step\nwe infer important hyper-parameters of the model using an EM algorithm. The\nhyper-parameters that we identify are\n\n* the location and scale of the dropout curve for each sample (called `rho` and `zeta`)\n* the overal location of the values (`mu0` and `sigma20`)\n* a prior for the protein variances (`nu` and `eta`).\n\n```{r}\n# To see the progress while fitting set verbose=TRUE\nparams \u003c- fit_parameters(X, experimental_design)\nparams\n```\n\nAs we can see the method has successfully converged so we can continue. If \nit would not have converged increase `max_iter`. In this example we are working\non a moderately sized data set, usually a thousand proteins are enough to \nmake good estimates of the hyper-parameters, if your dataset has many proteins\nyou can easily speed up the inference by setting for example `n_subsample=1000`.\n\n\nKnowing the general distribution of our data, we might be interested how the \nsamples are related. Naively we would just calculate the distance matrix\nusing `dist(t(X))`. But `dist` simply scales up vectors containing missing values,\nwhich is equivalent to some kind of mean imputation, which does not really\nmake sense as we have seen when we looked where missing values actually occur.\n\n```{r, fig.width=4.8, fig.height=4.8, out.width=\"50%\"}\nnaive_dist \u003c- dist(t(X))\npheatmap(as.matrix(naive_dist), cluster_rows=FALSE, cluster_cols = FALSE,\n         color=viridisLite::plasma(n=100, direction=-1),\n         breaks = seq(30, 60, length.out=101),\n         display_numbers=round(as.matrix(naive_dist)),\n         number_color = \"black\")\n```\n\n\n\nInstead our package provides a function called `dist_approx` that estimates the\ndistances and properly takes into account the missing values. But due to the missing\nvalues we cannot be certain of the exact distance. Thus the function returns \nin addition to the best guess of the distance an estimate how variable that \nguess is. \n\n\n\n```{r, fig.width=4.8, fig.height=4.8, out.width=\"50%\"}\n# Removing condtion information to get unbiased results\ndist_proDD \u003c- dist_approx(X, params, by_sample=TRUE, blind=TRUE)\n# The mean and standard deviation of the sample distance estimates\npheatmap(as.matrix(dist_proDD$mean), cluster_rows=FALSE, cluster_cols=FALSE,\n         color=viridisLite::plasma(n=100, direction=-1),\n         breaks = seq(30, 60, length.out=101),\n         display_numbers=matrix(paste0(round(as.matrix(dist_proDD$mean)), \" ± \",\n                              round(sqrt(as.matrix(dist_proDD$var)), 1)), nrow=6),\n         number_color = \"black\")\n```\n\n\nAfter making sure that there are no extreme outliers in our data and the heatmap\nshows the group structure that we expected, we will continue to infer\nthe posterior distribution of the mean for each protein and condition.\n\nThose posterior distribution form the basis of the subsequent steps for identifying\ndifferentially detected proteins.\n\n\n```{r}\n# Internally this function uses Stan to sample the posteriors.\n# Stan provides a lot of output which you can see by setting verbose=TRUE\nposteriors \u003c- sample_protein_means(X, params, verbose=FALSE)\n```\n\n\nNow that we have a good idea what is the latent intensity of each protein\nwe can go on to identify the differentially detected proteins\n\n```{r}\nresult \u003c- test_diff(posteriors$A,  posteriors$B)\n\n# The resulting data.frame\nhead(result)\n\n# The most significant changes\nhead(result[order(result$pval), ])\n```\n\n\n\nA popular way to look at such data is to make a volcano plot. Here we will\nuse the fact that we generated the data to highlight proteins that\nwere actually changed\n\n```{r}\nresult$changed \u003c- syn_data$changed\n\nggplot(result, aes(x=diff, y=-log10(pval), color=changed)) +\n    geom_point() +\n    ggtitle(\"Volcano plot highlighting truely changed values\")\n```\n\nWe know that 10% of the data was changed and in the volcano plot we can see that\nour method does a good job of identifying many of them.\n\nAn interesting way to look at at the data is to explicitly look how many values \nwere observed per condition. So we will make 16 plots, where we compare the \ndifference between for A and B if we had three 3 vs. 3, 3 vs. 2, 3 vs. 1 etc. \nobserved values.\n\n```{r}\nresult$nobs_a \u003c- rowSums(! is.na(X[, experimental_design == \"A\"]))\nresult$nobs_b \u003c- rowSums(! is.na(X[, experimental_design == \"B\"]))\n\nggplot(result, aes(x=diff, y=-log10(pval), color=changed)) +\n    geom_point() +\n    facet_grid(nobs_a ~ nobs_b, labeller = label_both)\n```\n\n\n\n\nUsing this data we can also make an MA plot, where we color the points\nby the number of observations\n\n```{r}\nresult$label \u003c- paste0(pmax(result$nobs_a, result$nobs_b), \"-\", pmin(result$nobs_a, result$nobs_b))\n\nggplot(result, aes(x=mean, y=diff, color=label, shape=changed)) +\n    geom_point(size=2) +\n    ggtitle(\"MA plot comparing the number of observed values\")\n\nggplot(result, aes(x=mean, y=diff, color=adj_pval \u003c 0.05)) +\n    geom_point(size=2) +\n    ggtitle(\"MA plot identifying significant and non-significant values\")\n\n```\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconst-ae%2Fprodd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fconst-ae%2Fprodd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconst-ae%2Fprodd/lists"}