Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hdg204/DNANexus_GRS
An R Function for calculating genetic risk scores in the UK Biobank cohort using DNA Nexus
https://github.com/hdg204/DNANexus_GRS
Last synced: 4 months ago
JSON representation
An R Function for calculating genetic risk scores in the UK Biobank cohort using DNA Nexus
- Host: GitHub
- URL: https://github.com/hdg204/DNANexus_GRS
- Owner: hdg204
- Created: 2023-01-20T13:19:10.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-02-01T17:33:40.000Z (over 1 year ago)
- Last Synced: 2024-01-17T01:05:55.255Z (6 months ago)
- Language: R
- Size: 34.2 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Lists
- awesome-uk-biobank - DNANexus_GRS
README
# DNANexus GRS
## Description
This repository contains two functions:
**Calculate_GRS.R** is a function built to evaluate genetic risk scores in the UK biobank cohort, using the imputed genotypes and RStudio workbench on the DNA Nexus platform. It takes one input, a file with a list of chromosome, base pair, other allele, effect allele, and weight, and returns a data frame with two columns, eid and grs. Please note that it takes about 20 minutes to compile a GRS on the default DNA Nexus settings. Most of this time is spent extracting SNPs from the BGEN files, which is a slow process through R.
Important notes:
* If any SNPs are missing, it will just exclude them and not tell you about it. I'm working on it.
* SNPs must be entered in chr bp format, and must be in build 37. This is to match the index bgen files stored on the DNA Nexus RAP**extract_snp.R** is not required for Calculate_GRS, but is a potentially useful related function that extracts the genotype information for one SNP in from the imputed data and stores it in a dataframe. The function takes two inputs, chromosome and base pair, and returns a lit with two outputs, one with the genotype data and one with the snp info. The genotype data is a dataframe with two columns, id and genotype. The snp info contains chromosome position rsid number_of_alleles allele0 allele1.
extract_snp can be run, e.g. using `extract_snp(8,128077146)`. It takes about one minute and is not recommended for outputting lots of SNPs. The speed for these functions is limited by the speed of `bgen.load`, and a future release will add an extra function to make this process quicker for multiple SNPs.
## Example script for Calculate_GRS
This script has been written to run on the RStudio Workbench on DNA Nexus, which at the time of writing runs First, run
`library(devtools)`
`source_url("https://raw.githubusercontent.com/hdg204/DNANexus_GRS/main/Calculate_GRS.R")`
Then, a genetic risk score can be generated by running
`grs=generate_grs('snp_file')`
Where `snp file` is a tab seperated file, which looks like this
![image](https://user-images.githubusercontent.com/36624710/213706895-55a9471b-b85b-427d-997b-1306911b8c10.png)
See the file `contigrs` as an example.
## Complete Script
Users on the Exomes_450K Project at the Univeristy of Exeter can run the following example to calculate the Conti et. al. GRS https://pubmed.ncbi.nlm.nih.gov/33398198/ and test its predictive power against Prostate Cancer.
This script relies on https://github.com/hdg204/UKBB
```
library(devtools)
library('dplyr')
library('ggplot2')
install.packages( "http://www.well.ox.ac.uk/~gav/resources/rbgen_v1.1.5.tgz", repos = NULL, type = "source" )
library('rbgen')system('dx download file-GP3GfZjJZ8kYP25V5VZ1BGfx') #this file has the conti et al snps in it
source_url("https://raw.githubusercontent.com/hdg204/DNANexus_GRS/main/Calculate_GRS.R") #makes Calculate_GRS available
grs=generate_grs('conti_et_al_prostate_snps') #this uses the downloaded file and makes a grssource_url("https://raw.githubusercontent.com/hdg204/UKBB/main/UKBB_Health_Records_Public.R") # this script is used to derive the Prostate Cancer phenotype
prostate_cancer=first_occurence(cancer='C61',ICD10='C61')%>%mutate(prca=1) # this uses my own first occurence codeall_data=left_join(grs,prostate_cancer)
all_data$prca[is.na(all_data$prca)]=0 #all_data$prca is now a list of 1s and 0s# plot density plots of the distribution in caes and controls
ggplot(all_data,aes(x=grs,fill=as.factor(prca),colour=as.factor(prca)))+
geom_density(alpha=0.3)
#build a ROC curve with the AUC
install.packages('pROC')
library('pROC')
logit <- glm(prca~grs, data = all_data, family = "binomial")
prob = predict(logit, newdata = all_data, type = "response")
roc(all_data$prca ~ prob, plot = TRUE, print.auc = TRUE, ci=TRUE)
```