Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/hdg204/DNANexus_GRS

An R Function for calculating genetic risk scores in the UK Biobank cohort using DNA Nexus
https://github.com/hdg204/DNANexus_GRS

Last synced: 4 months ago
JSON representation

An R Function for calculating genetic risk scores in the UK Biobank cohort using DNA Nexus

Lists

README

        

# DNANexus GRS

## Description

This repository contains two functions:

**Calculate_GRS.R** is a function built to evaluate genetic risk scores in the UK biobank cohort, using the imputed genotypes and RStudio workbench on the DNA Nexus platform. It takes one input, a file with a list of chromosome, base pair, other allele, effect allele, and weight, and returns a data frame with two columns, eid and grs. Please note that it takes about 20 minutes to compile a GRS on the default DNA Nexus settings. Most of this time is spent extracting SNPs from the BGEN files, which is a slow process through R.

Important notes:

* If any SNPs are missing, it will just exclude them and not tell you about it. I'm working on it.
* SNPs must be entered in chr bp format, and must be in build 37. This is to match the index bgen files stored on the DNA Nexus RAP

**extract_snp.R** is not required for Calculate_GRS, but is a potentially useful related function that extracts the genotype information for one SNP in from the imputed data and stores it in a dataframe. The function takes two inputs, chromosome and base pair, and returns a lit with two outputs, one with the genotype data and one with the snp info. The genotype data is a dataframe with two columns, id and genotype. The snp info contains chromosome position rsid number_of_alleles allele0 allele1.

extract_snp can be run, e.g. using `extract_snp(8,128077146)`. It takes about one minute and is not recommended for outputting lots of SNPs. The speed for these functions is limited by the speed of `bgen.load`, and a future release will add an extra function to make this process quicker for multiple SNPs.

## Example script for Calculate_GRS

This script has been written to run on the RStudio Workbench on DNA Nexus, which at the time of writing runs First, run

`library(devtools)`

`source_url("https://raw.githubusercontent.com/hdg204/DNANexus_GRS/main/Calculate_GRS.R")`

Then, a genetic risk score can be generated by running

`grs=generate_grs('snp_file')`

Where `snp file` is a tab seperated file, which looks like this

![image](https://user-images.githubusercontent.com/36624710/213706895-55a9471b-b85b-427d-997b-1306911b8c10.png)

See the file `contigrs` as an example.

## Complete Script

Users on the Exomes_450K Project at the Univeristy of Exeter can run the following example to calculate the Conti et. al. GRS https://pubmed.ncbi.nlm.nih.gov/33398198/ and test its predictive power against Prostate Cancer.

This script relies on https://github.com/hdg204/UKBB

```
library(devtools)
library('dplyr')
library('ggplot2')
install.packages( "http://www.well.ox.ac.uk/~gav/resources/rbgen_v1.1.5.tgz", repos = NULL, type = "source" )
library('rbgen')

system('dx download file-GP3GfZjJZ8kYP25V5VZ1BGfx') #this file has the conti et al snps in it
source_url("https://raw.githubusercontent.com/hdg204/DNANexus_GRS/main/Calculate_GRS.R") #makes Calculate_GRS available
grs=generate_grs('conti_et_al_prostate_snps') #this uses the downloaded file and makes a grs

source_url("https://raw.githubusercontent.com/hdg204/UKBB/main/UKBB_Health_Records_Public.R") # this script is used to derive the Prostate Cancer phenotype
prostate_cancer=first_occurence(cancer='C61',ICD10='C61')%>%mutate(prca=1) # this uses my own first occurence code

all_data=left_join(grs,prostate_cancer)
all_data$prca[is.na(all_data$prca)]=0 #all_data$prca is now a list of 1s and 0s

# plot density plots of the distribution in caes and controls
ggplot(all_data,aes(x=grs,fill=as.factor(prca),colour=as.factor(prca)))+
geom_density(alpha=0.3)

#build a ROC curve with the AUC
install.packages('pROC')
library('pROC')
logit <- glm(prca~grs, data = all_data, family = "binomial")
prob = predict(logit, newdata = all_data, type = "response")
roc(all_data$prca ~ prob, plot = TRUE, print.auc = TRUE, ci=TRUE)
```