{"id":18046652,"url":"https://github.com/ltla/oiff","last_synced_at":"2025-06-16T06:07:18.204Z","repository":{"id":82716817,"uuid":"574465150","full_name":"LTLA/oiff","owner":"LTLA","description":"Optimizing an independent filter for the FDR","archived":false,"fork":false,"pushed_at":"2022-12-08T06:59:10.000Z","size":280,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-05T04:26:00.903Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LTLA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-05T11:25:00.000Z","updated_at":"2022-12-05T11:25:17.000Z","dependencies_parsed_at":"2023-04-14T09:18:36.447Z","dependency_job_id":null,"html_url":"https://github.com/LTLA/oiff","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/LTLA/oiff","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2Foiff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2Foiff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2Foiff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2Foiff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LTLA","download_url":"https://codeload.github.com/LTLA/oiff/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2Foiff/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260109486,"owners_count":22960031,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-30T19:08:23.323Z","updated_at":"2025-06-16T06:07:18.185Z","avatar_url":"https://github.com/LTLA.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Optimizing an independent filter for FDR control\n\n## Overview\n\nGiven a set of p-values and a filter statistic that is independent of the p-values under the null,\nthe **oiff** library identifies the filter threshold that maximizes the number of discoveries at a given FDR threshold.\nConceptually, it yields the same result as the following naive procedure:\n\n1. Retain only those hypotheses where the filter statistic is below some filter threshold.\n2. Apply the Benjamini-Hochberg (BH) method to the retained hypotheses.\n3. Count the number of discoveries among the retained hypotheses at a given FDR threshold.\n4. Repeat 1-3 to find the filter threshold that maximizes the number of discoveries.\n\nThis can provide a \"sensible\" choice for the filter threshold when no _a priori_ setting is available.\nFor example, we often filter out low-abundance features prior to differential analyses of genomic data,\non the basis that the abundance of a feature is usually independent of its p-value.\n\n## Quick start\n\nC++ users can just link to [the header](include/oiff/oiff.hpp) and run:\n\n```cpp\n#include \"oiff/oiff.hpp\"\n\nstd::vector\u003cdouble\u003e pvalues; // fill with p-values\nstd::vector\u003cdouble\u003e covariates; // fill with covariates\n\n// Finds the optimal filter at a FDR threshold of 0.05.\noiff::OptimizeFilter runner;\nrunner.fdr_threshold = 0.05;\nauto res = runner.run(pvalues.size(), pvalues.data(), covariates.data());\nres.middle; // one choice of filter threshold\nres.number; // number of discoveries\n\n// Run with subsampling and take the average of subsample iterations.\nauto res2 = runner.run_subsample(pvalues.size(), pvalues.data(), covariates.data());\ndouble mean_threshold = 0;\nfor (auto x : res2) {\n    mean_threshold += x.middle;\n}\nmean_threshold /= res2.size();\n```\n\nR users can install [the test package](R/) and run the example:\n\n```r\nlibrary(oiff)\npvalues \u003c- c(runif(9900), rbeta(100, 1, 50))\nfilter \u003c- c(rnorm(9900), rnorm(100) - 2)\nfindOptimalFilter(pvalues, filter)\n```\n\nCheck out the [reference documentation](https://ltla.github.io/oiff) for more details.\n\n## Building projects \n\nIf you're using CMake, you just need to add something like this to your `CMakeLists.txt`:\n\n```\ninclude(FetchContent)\n\nFetchContent_Declare(\n  oiff\n  GIT_REPOSITORY https://github.com/LTLA/oiff\n  GIT_TAG master # or any version of interest \n)\n\nFetchContent_MakeAvailable(oiff)\n```\n\nThen you can link to **oiff** to make the headers available during compilation:\n\n```\n# For executables:\ntarget_link_libraries(myexe oiff)\n\n# For libaries\ntarget_link_libraries(mylib INTERFACE oiff)\n```\n\n## Comments on performance\n\nComputationally, **oiff** uses an interval tree to avoid repeated invocations of the BH method.\nThis means that the algorithm is very fast for large numbers of hypotheses:\n\n```r\nlibrary(oiff)\npvalues \u003c- c(runif(999000), rbeta(1000, 1, 50))\nfilter \u003c- c(rnorm(999000), rnorm(1000) + 2)\nsystem.time(expected \u003c- findOptimalFilter(pvalues, filter, above=TRUE))\n##    user  system elapsed\n##   0.458   0.008   0.466\n```\n\nStatistically, this approach is flawed as it does not guarantee control of the FDR.\nBy allowing the filter threshold to vary in a manner that depends on the p-values,\n**oiff** will systematically include more false discoveries than allowed for under the BH method.\nHere is a simple demonstration of the problem:\n\n```r\nlibrary(oiff)\nnum.discoveries \u003c- numeric(1000)\nref.discoveries \u003c- numeric(1000)\n\nfor (it in seq_along(num.discoveries)) {\n    # Generating null hypotheses.\n    pval \u003c- runif(100) \n    filter \u003c- rnorm(100)\n\n    # Injecting a single true positive that is always retained.\n    pval \u003c- c(0, pval)\n    filter \u003c- c(100, filter)\n\n    # Using an optimal filter threshold.\n    expected \u003c- findOptimalFilter(pval, filter, threshold=0.05, above=TRUE)\n    num.discoveries[it] \u003c- expected$number\n\n    # Compared to a constant filter.\n    above.zero \u003c- pval[filter \u003e= 0]\n    ref.discoveries[it] \u003c- sum(p.adjust(above.zero, method=\"BH\") \u003c= 0.05)\n}\n\n# Calculating the FDR after removing the lone true positive: \nmean((num.discoveries - 1) / num.discoveries)\n## [1] 0.1866667\nmean((ref.discoveries - 1) / ref.discoveries)\n## [1] 0.04925\n```\n\nA practical mitigation is to derive the threshold from a small subsample of hypotheses.\nThis preserves any dependencies between the p-values and filter statistic _under the alternative hypothesis_,\nthus ensuring that we still reap the benefits of filter optimization.\nThe use of a small subsample means that the chosen filter threshold is independent of the p-values for the remaining hypotheses,\nlimiting the severity of the loss of FDR control (assuming that the various hypotheses are independent of each other).\nThis is inspired by the cross-validation procedure in the [**IHW**](https://bioconductor.org/packages/IHW) package.\n\n```r\nlibrary(oiff)\nnum.discoveries \u003c- numeric(1000)\n\nfor (it in seq_along(num.discoveries)) {\n    # Generating null hypotheses.\n    pval \u003c- runif(100) \n    filter \u003c- rnorm(100)\n\n    # Injecting a single true positive that is always retained.\n    pval \u003c- c(0, pval)\n    filter \u003c- c(100, filter)\n\n    # Using an optimal filter threshold based on a subsample. \n    expected \u003c- findOptimalFilter(pval, filter, threshold=0.05, above=TRUE, subsample=0.1)\n    keep \u003c- filter \u003e= expected$middle\n    num.discoveries[it] \u003c- sum(p.adjust(pval[keep], method=\"BH\") \u003c= 0.05)\n}\n\nmean((num.discoveries - 1) / num.discoveries)\n## [1] 0.05008333\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fltla%2Foiff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fltla%2Foiff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fltla%2Foiff/lists"}