Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kenhanscombe/plink-custom-r

Setup and starter script for the PLINK R plugin.
https://github.com/kenhanscombe/plink-custom-r

genetic-analysis plink r statistical-analysis

Last synced: about 2 months ago
JSON representation

Setup and starter script for the PLINK R plugin.

Awesome Lists containing this project

README

        

code{white-space: pre;}

pre:not([class]) {
background-color: white;
}

if (window.hljs && document.readyState && document.readyState === "complete") {
window.setTimeout(function() {
hljs.initHighlighting();
}, 0);
}

h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}

.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
button.code-folding-btn:focus {
outline: none;
}

$(document).ready(function () {
window.buildTabsets("TOC");
});


Custom analysis with PLINK R plugin


_*If like me, you thought this would be great but hadn’t actually got around to figuring out how to use it, here is a script to play with and some setup instructions._




It is possible to call R from PLINK. This facility allows you to keep genotype and phenotype data in PLINK binary format and perform a custom analysis. Below is an example of how this facility can be used to retrieve model fit statistics.


More information for PLINK’s R Plugin functions is available in the 1.07 and 1.9 documentation, including details for changing port, host, socket.






Getting started


First, you will need to install the development version of PLINK, and the latest version of R. Open R and install relevant packages. Rserve is required; broom and a couple of tidyverse packages are needed for the specific example below. Make a note of the Rserve installation location printed by install.packages. You will need to point to it later.


To copy the R script, clone this repository.


git clone https://github.com/kenhanscombe/plink-custom-r.git





Retrieve model fit statistics


In an R script (e.g. plink_custom_analysis.R), define a custom function. This script defines a pseudo-R-squared for alogistic regression analysis, and uses the broom functions glance and tidy to collect fit statistics. (Note: Before changing anything to suit your needs, see the Details section at the end.)


Rplink <- function(PHENO, GENO, CLUSTER, COVAR) {


library(tidyverse)
library(broom)

pseudo_rsq <- function(model){
dev <- model$deviance
null_dev <- model$null.deviance
model_n <- length(model$fitted.values)
r2_cox_snell <- 1 - exp(-(null_dev - dev) / model_n)
r2_nagelkerke <- r2_cox_snell / (1 - (exp(-(null_dev / model_n))))
r2_nagelkerke
}

func <- function(snp) {
m <- glm(PHENO == 2 ~ COVAR + snp, family = "binomial")
rsq <- pseudo_rsq(m)
glance_m <- glance(m) %>% unlist(.[1, ])
tidy_m <- tidy(m) %>% select(-term) %>% tail(n = 1) %>% unlist()
summary_m <- c(tidy_m, glance_m, rsq)
c(length(summary_m), summary_m)
}

apply(GENO, 2, func)
}




To run the custom analysis, first start Rserve (supply the full path to R CMD). All data input and filtering flags to PLINK remain the same. Simply add --R [R script filename] to the PLINK call. The results of the custom analysis are written to plink.auto.R by default (As usual, you can change the file stem plink with --out).


R CMD /full/path/to/Rserve

plink \
--bfile {prefix} \
--pheno [filename] \
--covar [filename] \
(other optional filters ...)
--logistic \
--R custom_plink_analysis.R


NB. In the above example we’re collecting model fit statistics from a logistic regresion (using the excellent package broom). --logistic is an optional sanity check. Compare plink.assoc.logistic to plink.auto.R for effect size, signed statistic, and p-value. (Adding a header to the plink --R output helps. See Output section below)





Output


For each SNP in your analysis (i.e., each row in the output plink.auto.R), PLINK combines the vector of outputs v, with the 4 values for CHR, SNP, BP, and A1. The R read command below adds a header to the custom output. You could of course do this in a bash one-liner, but if you’re going to use in R to visualize your association results and model fit statistics, you can add column names on reading in the data.


library(tidyverse)

# These col_names correspond to the custom analysis above.
custom_plink_result <- read_table2(
"plink.auto.R",
col_names = c("chr", "snp", "bp", "a1", "estimate", "std_error", "statistic",
"p_value", "null_deviance", "df_null", "logLik", "aic", "bic", "deviance",
"df_residual", "pseudo_rsq"),
cols(
chr = col_integer(),
snp = col_character(),
bp = col_integer(),
a1 = col_character(),
estimate = col_double(),
std_error = col_double(),
statistic = col_double(),
p_value = col_double(),
null_deviance = col_double(),
df_null = col_double(),
logLik = col_double(),
aic = col_double(),
bic = col_double(),
deviance = col_double(),
df_residual = col_double(),
pseudo_rsq = col_double()
)
)






Multi-SNP model


If you want to inspect the overall model fit of a multi-SNP model, or compare the relative fit of multiple genetic variants (e.g. your 3 favourite SNPs), against a null model (e.g. 10 PCs), you cannot include the SNPs with the --condition flag. PLINK’s --R always runs the analysis defined in Rplink. There are a couple of workarounds. One solution is to add the SNPs to the covariate file. First, convert the 3 SNPs to a 0/1/2 count of the reference allele with --recode A. The recoded SNPs appear in the last 3 columns of plink.raw. Add these 3 columns to the covariate file. Next, edit the function call in Rplink to not include snps (i.e., delete + snp ) then run your custom analysis once with --covar-number 1-10 (null), and a second time with --covar-number 1-13. Compare the 2 models.








Details (summarised from PLINK 1.07 and 1.9 documentation)


For a sample of size n, genotyped at l genetic variants, including c covariates, all genotypes, phenotypes, covariates and cluster membership are accessible within the custom R script as:




PHENO A vector of phenotypes of length n.


GENO An n x l matrix of genotypes.


CLUSTER A vector of cluster membership codes of length n.


COVAR An n x c matrix of covariates.




The R script defines a function Rplink, with obligatory header, and return value, as follows,


Rplink <- function(PHENO, GENO, CLUSTER, COVAR) {


# A function f is applied to the columns of GENO (i.e. to each genetic variant) and
# must return a numeric vector v, combined with its length.
f <- function(s) {

# Function body

c(length(v), v)
}

apply(GENO, 2, f)
}



// add bootstrap table styles to pandoc tables
function bootstrapStylePandocTables() {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
}
$(document).ready(function () {
bootstrapStylePandocTables();
});

(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();