https://github.com/ltla/oiff

Optimizing an independent filter for the FDR
https://github.com/ltla/oiff

Last synced: about 1 year ago
JSON representation

Optimizing an independent filter for the FDR

Host: GitHub
URL: https://github.com/ltla/oiff
Owner: LTLA
License: mit
Created: 2022-12-05T11:25:00.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2022-12-08T06:59:10.000Z (over 3 years ago)
Last Synced: 2025-04-05T04:26:00.903Z (about 1 year ago)
Language: C++
Size: 273 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Optimizing an independent filter for FDR control

## Overview

Given a set of p-values and a filter statistic that is independent of the p-values under the null,

the **oiff** library identifies the filter threshold that maximizes the number of discoveries at a given FDR threshold.

Conceptually, it yields the same result as the following naive procedure:

1. Retain only those hypotheses where the filter statistic is below some filter threshold.

2. Apply the Benjamini-Hochberg (BH) method to the retained hypotheses.

3. Count the number of discoveries among the retained hypotheses at a given FDR threshold.

4. Repeat 1-3 to find the filter threshold that maximizes the number of discoveries.

This can provide a "sensible" choice for the filter threshold when no _a priori_ setting is available.

For example, we often filter out low-abundance features prior to differential analyses of genomic data,

on the basis that the abundance of a feature is usually independent of its p-value.

## Quick start

C++ users can just link to [the header](include/oiff/oiff.hpp) and run:

```cpp

#include "oiff/oiff.hpp"

std::vector pvalues; // fill with p-values

std::vector covariates; // fill with covariates

// Finds the optimal filter at a FDR threshold of 0.05.

oiff::OptimizeFilter runner;

runner.fdr_threshold = 0.05;

auto res = runner.run(pvalues.size(), pvalues.data(), covariates.data());

res.middle; // one choice of filter threshold

res.number; // number of discoveries

// Run with subsampling and take the average of subsample iterations.

auto res2 = runner.run_subsample(pvalues.size(), pvalues.data(), covariates.data());

double mean_threshold = 0;

for (auto x : res2) {

    mean_threshold += x.middle;

}

mean_threshold /= res2.size();

```

R users can install [the test package](R/) and run the example:

```r

library(oiff)

pvalues <- c(runif(9900), rbeta(100, 1, 50))

filter <- c(rnorm(9900), rnorm(100) - 2)

findOptimalFilter(pvalues, filter)

```

Check out the [reference documentation](https://ltla.github.io/oiff) for more details.

## Building projects 

If you're using CMake, you just need to add something like this to your `CMakeLists.txt`:

```

include(FetchContent)

FetchContent_Declare(

  oiff

  GIT_REPOSITORY https://github.com/LTLA/oiff

  GIT_TAG master # or any version of interest 

)

FetchContent_MakeAvailable(oiff)

```

Then you can link to **oiff** to make the headers available during compilation:

```

# For executables:

target_link_libraries(myexe oiff)

# For libaries

target_link_libraries(mylib INTERFACE oiff)

```

## Comments on performance

Computationally, **oiff** uses an interval tree to avoid repeated invocations of the BH method.

This means that the algorithm is very fast for large numbers of hypotheses:

```r

library(oiff)

pvalues <- c(runif(999000), rbeta(1000, 1, 50))

filter <- c(rnorm(999000), rnorm(1000) + 2)

system.time(expected <- findOptimalFilter(pvalues, filter, above=TRUE))

##    user  system elapsed

##   0.458   0.008   0.466

```

Statistically, this approach is flawed as it does not guarantee control of the FDR.

By allowing the filter threshold to vary in a manner that depends on the p-values,

**oiff** will systematically include more false discoveries than allowed for under the BH method.

Here is a simple demonstration of the problem:

```r

library(oiff)

num.discoveries <- numeric(1000)

ref.discoveries <- numeric(1000)

for (it in seq_along(num.discoveries)) {

    # Generating null hypotheses.

    pval <- runif(100) 

    filter <- rnorm(100)

    # Injecting a single true positive that is always retained.

    pval <- c(0, pval)

    filter <- c(100, filter)

    # Using an optimal filter threshold.

    expected <- findOptimalFilter(pval, filter, threshold=0.05, above=TRUE)

    num.discoveries[it] <- expected$number

    # Compared to a constant filter.

    above.zero <- pval[filter >= 0]

    ref.discoveries[it] <- sum(p.adjust(above.zero, method="BH") <= 0.05)

}

# Calculating the FDR after removing the lone true positive: 

mean((num.discoveries - 1) / num.discoveries)

## [1] 0.1866667

mean((ref.discoveries - 1) / ref.discoveries)

## [1] 0.04925

```

A practical mitigation is to derive the threshold from a small subsample of hypotheses.

This preserves any dependencies between the p-values and filter statistic _under the alternative hypothesis_,

thus ensuring that we still reap the benefits of filter optimization.

The use of a small subsample means that the chosen filter threshold is independent of the p-values for the remaining hypotheses,

limiting the severity of the loss of FDR control (assuming that the various hypotheses are independent of each other).

This is inspired by the cross-validation procedure in the [**IHW**](https://bioconductor.org/packages/IHW) package.

```r

library(oiff)

num.discoveries <- numeric(1000)

for (it in seq_along(num.discoveries)) {

    # Generating null hypotheses.

    pval <- runif(100) 

    filter <- rnorm(100)

    # Injecting a single true positive that is always retained.

    pval <- c(0, pval)

    filter <- c(100, filter)

    # Using an optimal filter threshold based on a subsample. 

    expected <- findOptimalFilter(pval, filter, threshold=0.05, above=TRUE, subsample=0.1)

    keep <- filter >= expected$middle

    num.discoveries[it] <- sum(p.adjust(pval[keep], method="BH") <= 0.05)

}

mean((num.discoveries - 1) / num.discoveries)

## [1] 0.05008333

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ltla/oiff

Awesome Lists containing this project

README