Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stephenturner/kgp

1000 Genomes Project Metadata R Package
https://github.com/stephenturner/kgp

1000genomes bioinformatics genetics genomics metadata population-genetics sequencing

Last synced: 3 months ago
JSON representation

1000 Genomes Project Metadata R Package

Host: GitHub
URL: https://github.com/stephenturner/kgp
Owner: stephenturner
License: other
Created: 2022-09-09T09:38:46.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-12-21T11:20:07.000Z (about 2 years ago)
Last Synced: 2024-04-29T20:05:34.788Z (9 months ago)
Topics: 1000genomes, bioinformatics, genetics, genomics, metadata, population-genetics, sequencing
Language: R
Homepage: https://stephenturner.github.io/kgp/
Size: 1000 KB
Stars: 18
Watchers: 3
Forks: 3
Open Issues: 5
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md
- Citation: CITATION.cff

Awesome Lists containing this project

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# kgp

[![CRAN status](https://www.r-pkg.org/badges/version/kgp)](https://CRAN.R-project.org/package=kgp)

[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)

[![arXiv](https://img.shields.io/badge/arXiv-2210.00539-b31b1b.svg)](https://arxiv.org/abs/2210.00539)

This kgp data package provides metadata about populations and data about samples from the 1000 Genomes Project, including the 2,504 samples sequenced for the Phase 3 release and the expanded collection of 3,202 samples with 602 additional trios.

## Installation

You can install the released version of kgp from [CRAN](https://CRAN.R-project.org/package=kgp) with:

```r

install.packages("kgp")

```

You can install the development version of kgp from [GitHub](https://github.com/stephenturner/kgp) with:

```r

# install.packages("devtools")

devtools::install_github("stephenturner/kgp")

```

## About the data

The 1000 Genomes Project data Phase 3 data contains 2,504 samples with sequence data available, and was later expanded to 3,202 samples with high coverage adding 602 trios. Data is available through the [1000 Genomes FTP site](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/) and [GitHub](https://github.com/igsr/1000Genomes_data_indexes/). 

- Pilot publication: [An integrated map of genetic variation from 1,092 human genomes](https://www.nature.com/articles/nature11632)

- Phase 1 publication: [A map of human genome variation from population scale sequencing](https://www.nature.com/articles/nature09534)

- Phase 3 publication: [A global reference for human genetic variation](https://www.nature.com/articles/nature15393)

- Expanded high-coverage publication: [High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios](https://pubmed.ncbi.nlm.nih.gov/36055201/)

There are three data sets available in the kgp package.

```{r example}

library(kgp)

data(kgp)

```

The `kgp3` data contains pedigree and population information for the 2,504 samples included in the Phase 3 release of the 1000 Genomes Project data.

```{r}

kgp3

```

The `kgpe` data contains pedigree and population information all 3,202 samples included in the expanded 1000 Genomes Project data, which includes 602 trios.

```{r}

kgpe

```

The `kgpmeta` contains population metadata for the 26 populations across five continental regions.

```{r}

kgpmeta

```

## Examples

```{r, message=FALSE, warning=FALSE}

library(dplyr)

library(ggplot2)

library(kgp)

data(kgp)

```

Count the number of samples in each region, or in each population: 

```{r}

kgp3 %>% 

  count(region) %>% 

  knitr::kable()

```

```{r}

kgp3 %>% 

  count(region, population) %>% 

  knitr::kable()

```

```{r kgp3barplot, fig.width=9, fig.height=12}

kgp3 %>% 

  count(region, population) %>% 

  arrange(region, n) %>% 

  mutate(population=forcats::fct_inorder(population)) %>% 

  ggplot(aes(population, n)) + 

  geom_col(aes(fill=region)) + 

  labs(fill=NULL, x=NULL, x="N") + 

  coord_flip() + 

  theme_bw() + 

  theme(legend.position="bottom")

```

The latitude and longitude coordinates in `kgpmeta` can be used to plot a map of the locations of the 1000 Genomes populations. There is also a column for region color, which provides a hexadecimal color code to enable reproduction of the population data map as shown on the IGSR population data page. The figure below shows a static map produced using ggplot2, but interactive maps such as that shown on the IGSR population data portal can be created with the leaflet package.

```{r kgpmap, fig.cap="Map showing locations of the 1000 Genomes Phase 3 populations.", fig.width=8, fig.height=6}

pal <- kgpmeta %>% distinct(reg, regcolor) %>% tibble::deframe()

ggplot() + 

  geom_polygon(data=map_data("world"), 

               aes(long, lat, group=group), 

               col="gray30", fill="gray95", lwd=.2, alpha=.5) + 

  geom_point(data=kgpmeta, aes(lng, lat, col=reg), size=4) + 

  scale_colour_manual(values=pal) +

  theme_minimal() + 

  theme(axis.ticks = element_blank(), 

        axis.text = element_blank(), 

        axis.title = element_blank(), 

        legend.title = element_blank(),

        panel.grid = element_blank(),

        legend.position = "bottom")

```

The table below shows a selection of samples from `kgpe` showing pedigree information for each sample. This pedigree information could be used in downstream analysis to filter out related individuals, select only trios, or to visualize family structure.

```{r kgpe}

kgpe %>% 

  filter(pid!="0" & mid!="0") %>% 

  group_by(pop) %>% 

  slice(1) %>% 

  head(12) %>% 

  arrange(reg, pop) %>% 

  select(fid:reg) %>% 

  select(-sexf) %>% 

  knitr::kable()

```

The figure below shows an example of a pedigree plot made by parsing the pedigree information using [skater](https://cran.r-project.org/package=skater) and plotting using [kinship2](https://cran.r-project.org/package=kinship2). The skater package provides documentation, examples, and a vignette demonstrating how to iteratively plot all pedigrees in a given data set.

```{r pedplot, fig.height=5, fig.width=8, fig.cap="Trios in 1000 Genomes Project family 13291."}

kgpe %>% 

  filter(fid=="13291") %>% 

  transmute(fid, id, dadid=pid, momid=mid, sex, affected=1) %>% 

  skater::fam2ped() %>% 

  pull(ped) %>% 

  purrr::pluck(1) %>% 

  kinship2::plot.pedigree(mar=c(4,2,4,2), cex=.8)

```