Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lcpilling/gwasRtools

Some useful tools for processing GWAS output
https://github.com/lcpilling/gwasRtools

Last synced: 2 months ago
JSON representation

Some useful tools for processing GWAS output

Awesome Lists containing this project

README

        

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
fig.width = 6, height = 4, dpi = 150
)
set.seed(1234)
```

# gwasRtools
Some useful R functions for processing GWAS output

[![](https://img.shields.io/badge/version-0.1.3-informational.svg)](https://github.com/lcpilling/gwasRtools)
[![](https://img.shields.io/github/last-commit/lcpilling/gwasRtools.svg)](https://github.com/lcpilling/gwasRtools/commits/master)
[![](https://img.shields.io/badge/lifecycle-experimental-orange)](https://www.tidyverse.org/lifecycle/#experimental)
[![DOI](https://zenodo.org/badge/655790727.svg)](https://zenodo.org/badge/latestdoi/655790727)

## List of functions
- [get_loci()](#get_loci)
- [get_nearest_gene()](#get_nearest_gene)
- [lambda_gc()](#lambda_gc)

## Installation
To install `gwasRtools` from [GitHub](https://github.com/) with:

```r
remotes::install_github("lukepilling/gwasRtools")
```

## Example dataset
The package includes a subset of variants from the Graham et al. 2021 GWAS of LDL in 1,320,016 Europeans (GWAS catalog GCST90239658). I will use this throughout.

``` {r}
library(gwasRtools)
head(gwas_example)
```

## get_loci()
Determine loci from a GWAS summary statistics file. Use distance from lead significant SNP to estimate independent loci in GWAS summary stats [default distance = 500kb]. By default, the HLA region is treated is one continuous locus due to the complex LD. Uses -log10(p) derived from BETA/SE so does not need P as input. Example below with default input:

``` {r}
gwas_loci = get_loci(gwas_example)

head(gwas_loci)

gwas_loci |> dplyr::filter(lead==TRUE) |> head()
```

- Loci are numbered. Variants within a locus (i.e., significant below the `p_threshold` and less than `n_bases` from last significant variant).
- Lead variant for each locus is highlighted where `lead==TRUE` (i.e., smallest p-value for any variant within a locus)

### Use LD clumping to identify independent SNPs at the same locus

Setting option `get_ld_indep=TRUE` will use {[ieugwasr](https://github.com/MRCIEU/ieugwasr)} package `ld_clump()` function to run Plink LD clumping.

Default is to use a local Plink installation (this is faster) with EUR reference panel. But setting option `ld_clump_local` to FALSE will use the online IEU API. See the {ieugwasr} docs for details. Default R2 threshold for LD pruning is 0.01 (modify with `ld_pruning_r2` option).

``` {r}
gwas_loci = get_loci(gwas_example, get_ld_indep=TRUE)

head(gwas_loci)

gwas_loci |> dplyr::filter(lead==TRUE) |> head()
```

Where before, locus 3 would only have had one lead SNP (based on distance/lowest p-value) `ld_clump()` has identified multiple independent variants in the region.

Note that there are now three `lead` columns:
- `lead_dist` is the original `lead` column, simply based on distance and p-values
- `lead_ld` is the direct results from `ld_clump()`
- The `lead` column combines the two (some SNPs are missing from LD panel so it does not always choose the lowest p-value if only 1 variant identified at a locus).

** Note that `ld_clump()` only considers R^2 when defining independent variants, not D' -- you should perform additional checking/conditional analysis where relevant for `lead` variants in close proximity.

## get_nearest_gene()
Get nearest gene from a set of variants using GENCODE data. Need to provide a data.frame of variant IDs (e.g., rsids), CHR and POS. Default column names are the same as for `get_loci()`. Default max distance from variant to gene is 100kb.

``` {r}
gwas_loci = get_nearest_gene(gwas_loci, build=37)

head(gwas_loci)

gwas_loci |> dplyr::filter(lead==TRUE) |> head()
```
- If `dist` is positive, the variant is intergenic, and this is the distance to the closest gene.
- If `dist` is negative, the variant is within a gene, and this is the distance to the start of the gene.
- If `dist` is NA, the variant is not within `n_bases` of a gene in GENCODE.

## lambda_gc()
Estimate inflation of test statistics. Lambda GC compares the median test statistic against the expected median test statistic under the null hypothesis of no association. For well-powered GWAS of traits with a known polygenic inheritance, we expect inflation of lambda GC. For traits with no expected association, we expect lambda GC to be around 1.

``` {r}
lambda_gc(gwas_example$P)
```