Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/const-ae/prodd_old
Differential Detection for Label-free (LFQ) Mass Spec Data
https://github.com/const-ae/prodd_old
Last synced: 9 days ago
JSON representation
Differential Detection for Label-free (LFQ) Mass Spec Data
- Host: GitHub
- URL: https://github.com/const-ae/prodd_old
- Owner: const-ae
- Created: 2017-11-20T10:44:17.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2017-11-20T11:23:25.000Z (about 7 years ago)
- Last Synced: 2024-11-06T14:00:49.967Z (about 2 months ago)
- Language: C++
- Homepage:
- Size: 7.07 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
title: "proDD"
output: github_document
---```{r setup, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
cache = TRUE,
fig.path = "tools/README-fig/",
cache.path = "tools/README-cache/",
message = FALSE,
warning = FALSE
)
```Differential Detection with Label-free Mass Spec Data
## Overview
This package provides a framework to find proteins in mass spec data that are differentially detected between groups.
It is designed to deal with high number of missing values (i.e. zeros) and can nonetheless give reliable significance
estimates.It is thus applicable to data from affinity purification experiments such as BioID.
## Method
The algorithm is build around the fact that a missing values are more likely to occur if the intensity of protein
is low, which means that a missing observation can tell us something. In the first step the algorithm quantifies this dependency by
estimating a logistic regression of the chance to miss a value depending on the underlying intensity. This model
is fitted using Hamiltonian Monte Carlo method, because precise estimates of the sigmoid are necessary for reliable
downstream calculations. In the second step the group means for each condition and protein are estimated using a
maximum likelihood approach. To find which groups are actually significantly expressed in the last step a moderated
t-test is applied to each protein.Unlike other approaches that have been suggested in the literature that rely on imputing missing values using _ad hoc_
methods, such as just using half the global minimum, proDD exploits the information provided by the zeros in a
structured way and focuses on the MLE of the group means, which are sufficient to establish significance.## Workflow
Installation
```{r}
# Install directly from github
devtools::github("const-ae/proDD")
```Let's assume that `X` is a matrix where each row contains the intensity for one protein and each column is one
sample, which can be grouped into conditions.```{r, echo=FALSE}
library(proDD)
source("tests/testthat/helper_datageneration.R")
data <- generate_zero_inflated_data_with_effect(N_genes=100, N_rep=3, perc_changed = 0, mu0=8.5, nu0=5, sigma0=0.4, location=8, scale=-0.3)
X <- cbind(data$X, data$Y)
colnames(X) <- c(paste0("A_", 1:3), paste0("B_", 1:3))
X <- X[rowSums(X) != 0, ]
``````{r}
library(proDD)
head(X, n=10)
``````{r, echo=FALSE}
ComplexHeatmap::Heatmap((X != 0)*1.0, cluster_rows=FALSE, cluster_columns= FALSE,
col=c("black", "lightgrey"), name="Value Observed")
```For subsequent steps a description of the samples is necessary, i.e. which sample belongs to which condition.
For this we will create a dataframe containing that information:```{r}
data_description <- data.frame(Condition=as.factor(c(rep("A", 3), rep("B", 3))),
Replicate=c(1:3, 1:3))
data_description$Sample <- paste0(data_description$Condition, data_description$Replicate)
data_descriptiondesign <- model.matrix(Sample ~ Condition - 1, data_description)
design
```Now we can apply the algorithm that consists of three steps to that data
1. Estimate the parameters for the variance moderation:
```{r}
vm_est <- estimate_variance_moderation(X, design)
```2. Estimate the sigmoid that describes the chance to miss an observation:
```{r}
sig_est <- estimate_sigmoid(X, data_description, vm_est$nu_est, vm_est$sigma2_est, chains=1)
```3. Estimate the means of each condition per protein
```{r}
group_locations <- estimate_group_means(X, design, vm_est$nu_est, vm_est$sigma2_est, sig_est$location_est, sig_est$scale_est)
```4. Lastly, apply the moderated t-test to the group means to find differentially detected proteins:
```{r}
result <- detect_differences(X, design, data_description, d0=vm_est$nu_est, s0=vm_est$sigma2_est,
group_locations=group_locations, comparison=c("A", "B"))
head(result, n=10)
```## Note
This project is still work in progress and although the algorithm is working well, the API will probably change dramatically.