Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stephenturner/kgp

1000 Genomes Project Metadata R Package
https://github.com/stephenturner/kgp

1000genomes bioinformatics genetics genomics metadata population-genetics sequencing

Last synced: 3 months ago
JSON representation

1000 Genomes Project Metadata R Package

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# kgp

[![CRAN status](https://www.r-pkg.org/badges/version/kgp)](https://CRAN.R-project.org/package=kgp)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![arXiv](https://img.shields.io/badge/arXiv-2210.00539-b31b1b.svg)](https://arxiv.org/abs/2210.00539)

This kgp data package provides metadata about populations and data about samples from the 1000 Genomes Project, including the 2,504 samples sequenced for the Phase 3 release and the expanded collection of 3,202 samples with 602 additional trios.

## Installation

You can install the released version of kgp from [CRAN](https://CRAN.R-project.org/package=kgp) with:

```r
install.packages("kgp")
```

You can install the development version of kgp from [GitHub](https://github.com/stephenturner/kgp) with:

```r
# install.packages("devtools")
devtools::install_github("stephenturner/kgp")
```

## About the data

The 1000 Genomes Project data Phase 3 data contains 2,504 samples with sequence data available, and was later expanded to 3,202 samples with high coverage adding 602 trios. Data is available through the [1000 Genomes FTP site](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/) and [GitHub](https://github.com/igsr/1000Genomes_data_indexes/).

- Pilot publication: [An integrated map of genetic variation from 1,092 human genomes](https://www.nature.com/articles/nature11632)
- Phase 1 publication: [A map of human genome variation from population scale sequencing](https://www.nature.com/articles/nature09534)
- Phase 3 publication: [A global reference for human genetic variation](https://www.nature.com/articles/nature15393)
- Expanded high-coverage publication: [High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios](https://pubmed.ncbi.nlm.nih.gov/36055201/)

There are three data sets available in the kgp package.

```{r example}
library(kgp)
data(kgp)
```

The `kgp3` data contains pedigree and population information for the 2,504 samples included in the Phase 3 release of the 1000 Genomes Project data.

```{r}
kgp3
```

The `kgpe` data contains pedigree and population information all 3,202 samples included in the expanded 1000 Genomes Project data, which includes 602 trios.

```{r}
kgpe
```

The `kgpmeta` contains population metadata for the 26 populations across five continental regions.

```{r}
kgpmeta
```

## Examples

```{r, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(kgp)
data(kgp)
```

Count the number of samples in each region, or in each population:

```{r}
kgp3 %>%
count(region) %>%
knitr::kable()
```

```{r}
kgp3 %>%
count(region, population) %>%
knitr::kable()
```

```{r kgp3barplot, fig.width=9, fig.height=12}
kgp3 %>%
count(region, population) %>%
arrange(region, n) %>%
mutate(population=forcats::fct_inorder(population)) %>%
ggplot(aes(population, n)) +
geom_col(aes(fill=region)) +
labs(fill=NULL, x=NULL, x="N") +
coord_flip() +
theme_bw() +
theme(legend.position="bottom")
```

The latitude and longitude coordinates in `kgpmeta` can be used to plot a map of the locations of the 1000 Genomes populations. There is also a column for region color, which provides a hexadecimal color code to enable reproduction of the population data map as shown on the IGSR population data page. The figure below shows a static map produced using ggplot2, but interactive maps such as that shown on the IGSR population data portal can be created with the leaflet package.

```{r kgpmap, fig.cap="Map showing locations of the 1000 Genomes Phase 3 populations.", fig.width=8, fig.height=6}
pal <- kgpmeta %>% distinct(reg, regcolor) %>% tibble::deframe()
ggplot() +
geom_polygon(data=map_data("world"),
aes(long, lat, group=group),
col="gray30", fill="gray95", lwd=.2, alpha=.5) +
geom_point(data=kgpmeta, aes(lng, lat, col=reg), size=4) +
scale_colour_manual(values=pal) +
theme_minimal() +
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
legend.title = element_blank(),
panel.grid = element_blank(),
legend.position = "bottom")
```

The table below shows a selection of samples from `kgpe` showing pedigree information for each sample. This pedigree information could be used in downstream analysis to filter out related individuals, select only trios, or to visualize family structure.

```{r kgpe}
kgpe %>%
filter(pid!="0" & mid!="0") %>%
group_by(pop) %>%
slice(1) %>%
head(12) %>%
arrange(reg, pop) %>%
select(fid:reg) %>%
select(-sexf) %>%
knitr::kable()
```

The figure below shows an example of a pedigree plot made by parsing the pedigree information using [skater](https://cran.r-project.org/package=skater) and plotting using [kinship2](https://cran.r-project.org/package=kinship2). The skater package provides documentation, examples, and a vignette demonstrating how to iteratively plot all pedigrees in a given data set.

```{r pedplot, fig.height=5, fig.width=8, fig.cap="Trios in 1000 Genomes Project family 13291."}
kgpe %>%
filter(fid=="13291") %>%
transmute(fid, id, dadid=pid, momid=mid, sex, affected=1) %>%
skater::fam2ped() %>%
pull(ped) %>%
purrr::pluck(1) %>%
kinship2::plot.pedigree(mar=c(4,2,4,2), cex=.8)
```