Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/stephenturner/kgp
1000 Genomes Project Metadata R Package
https://github.com/stephenturner/kgp
1000genomes bioinformatics genetics genomics metadata population-genetics sequencing
Last synced: 6 days ago
JSON representation
1000 Genomes Project Metadata R Package
- Host: GitHub
- URL: https://github.com/stephenturner/kgp
- Owner: stephenturner
- License: other
- Created: 2022-09-09T09:38:46.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2022-12-21T11:20:07.000Z (almost 2 years ago)
- Last Synced: 2024-04-29T20:05:34.788Z (6 months ago)
- Topics: 1000genomes, bioinformatics, genetics, genomics, metadata, population-genetics, sequencing
- Language: R
- Homepage: https://stephenturner.github.io/kgp/
- Size: 1000 KB
- Stars: 18
- Watchers: 3
- Forks: 3
- Open Issues: 5
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md
- Citation: CITATION.cff
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# kgp
[![CRAN status](https://www.r-pkg.org/badges/version/kgp)](https://CRAN.R-project.org/package=kgp)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![arXiv](https://img.shields.io/badge/arXiv-2210.00539-b31b1b.svg)](https://arxiv.org/abs/2210.00539)This kgp data package provides metadata about populations and data about samples from the 1000 Genomes Project, including the 2,504 samples sequenced for the Phase 3 release and the expanded collection of 3,202 samples with 602 additional trios.
## Installation
You can install the released version of kgp from [CRAN](https://CRAN.R-project.org/package=kgp) with:
```r
install.packages("kgp")
```You can install the development version of kgp from [GitHub](https://github.com/stephenturner/kgp) with:
```r
# install.packages("devtools")
devtools::install_github("stephenturner/kgp")
```## About the data
The 1000 Genomes Project data Phase 3 data contains 2,504 samples with sequence data available, and was later expanded to 3,202 samples with high coverage adding 602 trios. Data is available through the [1000 Genomes FTP site](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/) and [GitHub](https://github.com/igsr/1000Genomes_data_indexes/).
- Pilot publication: [An integrated map of genetic variation from 1,092 human genomes](https://www.nature.com/articles/nature11632)
- Phase 1 publication: [A map of human genome variation from population scale sequencing](https://www.nature.com/articles/nature09534)
- Phase 3 publication: [A global reference for human genetic variation](https://www.nature.com/articles/nature15393)
- Expanded high-coverage publication: [High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios](https://pubmed.ncbi.nlm.nih.gov/36055201/)There are three data sets available in the kgp package.
```{r example}
library(kgp)
data(kgp)
```The `kgp3` data contains pedigree and population information for the 2,504 samples included in the Phase 3 release of the 1000 Genomes Project data.
```{r}
kgp3
```The `kgpe` data contains pedigree and population information all 3,202 samples included in the expanded 1000 Genomes Project data, which includes 602 trios.
```{r}
kgpe
```The `kgpmeta` contains population metadata for the 26 populations across five continental regions.
```{r}
kgpmeta
```## Examples
```{r, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(kgp)
data(kgp)
```Count the number of samples in each region, or in each population:
```{r}
kgp3 %>%
count(region) %>%
knitr::kable()
``````{r}
kgp3 %>%
count(region, population) %>%
knitr::kable()
``````{r kgp3barplot, fig.width=9, fig.height=12}
kgp3 %>%
count(region, population) %>%
arrange(region, n) %>%
mutate(population=forcats::fct_inorder(population)) %>%
ggplot(aes(population, n)) +
geom_col(aes(fill=region)) +
labs(fill=NULL, x=NULL, x="N") +
coord_flip() +
theme_bw() +
theme(legend.position="bottom")
```The latitude and longitude coordinates in `kgpmeta` can be used to plot a map of the locations of the 1000 Genomes populations. There is also a column for region color, which provides a hexadecimal color code to enable reproduction of the population data map as shown on the IGSR population data page. The figure below shows a static map produced using ggplot2, but interactive maps such as that shown on the IGSR population data portal can be created with the leaflet package.
```{r kgpmap, fig.cap="Map showing locations of the 1000 Genomes Phase 3 populations.", fig.width=8, fig.height=6}
pal <- kgpmeta %>% distinct(reg, regcolor) %>% tibble::deframe()
ggplot() +
geom_polygon(data=map_data("world"),
aes(long, lat, group=group),
col="gray30", fill="gray95", lwd=.2, alpha=.5) +
geom_point(data=kgpmeta, aes(lng, lat, col=reg), size=4) +
scale_colour_manual(values=pal) +
theme_minimal() +
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
legend.title = element_blank(),
panel.grid = element_blank(),
legend.position = "bottom")
```The table below shows a selection of samples from `kgpe` showing pedigree information for each sample. This pedigree information could be used in downstream analysis to filter out related individuals, select only trios, or to visualize family structure.
```{r kgpe}
kgpe %>%
filter(pid!="0" & mid!="0") %>%
group_by(pop) %>%
slice(1) %>%
head(12) %>%
arrange(reg, pop) %>%
select(fid:reg) %>%
select(-sexf) %>%
knitr::kable()
```The figure below shows an example of a pedigree plot made by parsing the pedigree information using [skater](https://cran.r-project.org/package=skater) and plotting using [kinship2](https://cran.r-project.org/package=kinship2). The skater package provides documentation, examples, and a vignette demonstrating how to iteratively plot all pedigrees in a given data set.
```{r pedplot, fig.height=5, fig.width=8, fig.cap="Trios in 1000 Genomes Project family 13291."}
kgpe %>%
filter(fid=="13291") %>%
transmute(fid, id, dadid=pid, momid=mid, sex, affected=1) %>%
skater::fam2ped() %>%
pull(ped) %>%
purrr::pluck(1) %>%
kinship2::plot.pedigree(mar=c(4,2,4,2), cex=.8)
```