https://github.com/stephenturner/kgp
  
  
    1000 Genomes Project Metadata R Package 
    https://github.com/stephenturner/kgp
  
1000genomes bioinformatics genetics genomics metadata population-genetics sequencing
        Last synced: 7 months ago 
        JSON representation
    
1000 Genomes Project Metadata R Package
- Host: GitHub
- URL: https://github.com/stephenturner/kgp
- Owner: stephenturner
- License: other
- Created: 2022-09-09T09:38:46.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-12-21T11:20:07.000Z (almost 3 years ago)
- Last Synced: 2024-04-29T20:05:34.788Z (over 1 year ago)
- Topics: 1000genomes, bioinformatics, genetics, genomics, metadata, population-genetics, sequencing
- Language: R
- Homepage: https://stephenturner.github.io/kgp/
- Size: 1000 KB
- Stars: 18
- Watchers: 3
- Forks: 3
- Open Issues: 5
- 
            Metadata Files:
            - Readme: README.Rmd
- License: LICENSE.md
- Citation: CITATION.cff
 
Awesome Lists containing this project
README
          ---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```
# kgp
[](https://CRAN.R-project.org/package=kgp)
[](https://lifecycle.r-lib.org/articles/stages.html#stable)
[](https://arxiv.org/abs/2210.00539)
This kgp data package provides metadata about populations and data about samples from the 1000 Genomes Project, including the 2,504 samples sequenced for the Phase 3 release and the expanded collection of 3,202 samples with 602 additional trios.
## Installation
You can install the released version of kgp from [CRAN](https://CRAN.R-project.org/package=kgp) with:
```r
install.packages("kgp")
```
You can install the development version of kgp from [GitHub](https://github.com/stephenturner/kgp) with:
```r
# install.packages("devtools")
devtools::install_github("stephenturner/kgp")
```
## About the data
The 1000 Genomes Project data Phase 3 data contains 2,504 samples with sequence data available, and was later expanded to 3,202 samples with high coverage adding 602 trios. Data is available through the [1000 Genomes FTP site](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/) and [GitHub](https://github.com/igsr/1000Genomes_data_indexes/). 
- Pilot publication: [An integrated map of genetic variation from 1,092 human genomes](https://www.nature.com/articles/nature11632)
- Phase 1 publication: [A map of human genome variation from population scale sequencing](https://www.nature.com/articles/nature09534)
- Phase 3 publication: [A global reference for human genetic variation](https://www.nature.com/articles/nature15393)
- Expanded high-coverage publication: [High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios](https://pubmed.ncbi.nlm.nih.gov/36055201/)
There are three data sets available in the kgp package.
```{r example}
library(kgp)
data(kgp)
```
The `kgp3` data contains pedigree and population information for the 2,504 samples included in the Phase 3 release of the 1000 Genomes Project data.
```{r}
kgp3
```
The `kgpe` data contains pedigree and population information all 3,202 samples included in the expanded 1000 Genomes Project data, which includes 602 trios.
```{r}
kgpe
```
The `kgpmeta` contains population metadata for the 26 populations across five continental regions.
```{r}
kgpmeta
```
## Examples
```{r, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(kgp)
data(kgp)
```
Count the number of samples in each region, or in each population: 
```{r}
kgp3 %>% 
  count(region) %>% 
  knitr::kable()
```
```{r}
kgp3 %>% 
  count(region, population) %>% 
  knitr::kable()
```
```{r kgp3barplot, fig.width=9, fig.height=12}
kgp3 %>% 
  count(region, population) %>% 
  arrange(region, n) %>% 
  mutate(population=forcats::fct_inorder(population)) %>% 
  ggplot(aes(population, n)) + 
  geom_col(aes(fill=region)) + 
  labs(fill=NULL, x=NULL, x="N") + 
  coord_flip() + 
  theme_bw() + 
  theme(legend.position="bottom")
```
The latitude and longitude coordinates in `kgpmeta` can be used to plot a map of the locations of the 1000 Genomes populations. There is also a column for region color, which provides a hexadecimal color code to enable reproduction of the population data map as shown on the IGSR population data page. The figure below shows a static map produced using ggplot2, but interactive maps such as that shown on the IGSR population data portal can be created with the leaflet package.
```{r kgpmap, fig.cap="Map showing locations of the 1000 Genomes Phase 3 populations.", fig.width=8, fig.height=6}
pal <- kgpmeta %>% distinct(reg, regcolor) %>% tibble::deframe()
ggplot() + 
  geom_polygon(data=map_data("world"), 
               aes(long, lat, group=group), 
               col="gray30", fill="gray95", lwd=.2, alpha=.5) + 
  geom_point(data=kgpmeta, aes(lng, lat, col=reg), size=4) + 
  scale_colour_manual(values=pal) +
  theme_minimal() + 
  theme(axis.ticks = element_blank(), 
        axis.text = element_blank(), 
        axis.title = element_blank(), 
        legend.title = element_blank(),
        panel.grid = element_blank(),
        legend.position = "bottom")
```
The table below shows a selection of samples from `kgpe` showing pedigree information for each sample. This pedigree information could be used in downstream analysis to filter out related individuals, select only trios, or to visualize family structure.
```{r kgpe}
kgpe %>% 
  filter(pid!="0" & mid!="0") %>% 
  group_by(pop) %>% 
  slice(1) %>% 
  head(12) %>% 
  arrange(reg, pop) %>% 
  select(fid:reg) %>% 
  select(-sexf) %>% 
  knitr::kable()
```
The figure below shows an example of a pedigree plot made by parsing the pedigree information using [skater](https://cran.r-project.org/package=skater) and plotting using [kinship2](https://cran.r-project.org/package=kinship2). The skater package provides documentation, examples, and a vignette demonstrating how to iteratively plot all pedigrees in a given data set.
```{r pedplot, fig.height=5, fig.width=8, fig.cap="Trios in 1000 Genomes Project family 13291."}
kgpe %>% 
  filter(fid=="13291") %>% 
  transmute(fid, id, dadid=pid, momid=mid, sex, affected=1) %>% 
  skater::fam2ped() %>% 
  pull(ped) %>% 
  purrr::pluck(1) %>% 
  kinship2::plot.pedigree(mar=c(4,2,4,2), cex=.8)
```