An open API service indexing awesome lists of open source software.

https://github.com/slowkow/allelefrequencies

📂 HLA allele frequencies in tab-delimited format, downloaded from AFND.
https://github.com/slowkow/allelefrequencies

bioinformatics genetics hla hla-database

Last synced: about 1 year ago
JSON representation

📂 HLA allele frequencies in tab-delimited format, downloaded from AFND.

Awesome Lists containing this project

README

          

---
output:
md_document:
variant: "gfm"
standalone: true
toc: false
toc_depth: 2
html_document:
toc: false
toc_depth: 2
self_contained: true
---

# HLA allele frequencies in tab-delimited format

DOI

Kamil Slowikowski

`r format(Sys.Date())`

```{r,include=FALSE}
options(width=100)
library(data.table)
library(dplyr)
library(glue)
library(readr)
library(magrittr)
library(ggplot2)
library(ggstance)
devtools::source_gist("c83e078bf8c81b035e32c3fc0cf04ee8", filename = 'render_toc.R')
file_size <- function(x) glue("{fs::file_size(x)}B")
d <- fread("afnd.tsv")
n_hla <- d %>% filter(group == "hla") %>% count(gene) %>% nrow
n_kir <- d %>% filter(group == "kir") %>% count(gene) %>% nrow
n_mic <- d %>% filter(group == "mic") %>% count(gene) %>% nrow
n_cyt <- d %>% filter(group == "cyt") %>% count(gene) %>% nrow
```

**Table of Contents**

```{r toc, echo=FALSE}
render_toc("README.Rmd", toc_depth = 2)
```

## Introduction

Here, we share a single file [afnd.tsv](afnd.tsv) (`r file_size("afnd.tsv")`) in tab-delimited format with all allele frequencies for `r n_hla` HLA genes, `r n_kir` KIR genes, `r n_mic` MIC genes, and `r n_cyt` cytokine genes from [Allele Frequency Net Database](http://allelefrequencies.net) (AFND).

The script [allelefrequencies.py](allelefrequencies.py) automatically downloads allele frequencies from the website.

[What is the Allele Frequency Net Database?](http://www.allelefrequencies.net/faqs.asp)

> The Allele Frequency Net Database (AFND) is a public database which contains
> frequency information of several immune genes such as Human Leukocyte
> Antigens (HLA), Killer-cell Immunoglobulin-like Receptors (KIR), Major
> histocompatibility complex class I chain-related (MIC) genes, and a number of
> cytokine gene polymorphisms.

The [afnd.tsv](afnd.tsv) file looks like this:

```{r}
d <- fread("afnd.tsv")
head(d)
```

Definitions:

- `alleles_over_2n` (Alleles / 2n)
Allele Frequency: total number of copies of
the allele in the population sample in three decimal format.

- `indivs_over_n` (100 \* Individuals / n)
Percentage of individuals who have the allele or gene.

- `n` (Individuals)
Number of individuals sampled from the population.

## Examples

Here are a few examples of how we can use R to analyze these data.

View the largest and smallest populations available in the data:

```{r}
d %>%
mutate(n = parse_number(n)) %>%
select(population, n) %>%
unique() %>%
arrange(-n)
```

Count the number of alleles for each gene:

```{r}
d %>%
count(group, gene, allele) %>%
count(group, gene) %>%
arrange(-n) %>%
head(15)
```

Sum the allele frequencies for each gene in each population. This allows us to
see which populations have a set of allele frequencies that adds up to 100
percent:

```{r}
d %>%
mutate(alleles_over_2n = parse_number(alleles_over_2n)) %>%
filter(alleles_over_2n > 0) %>%
group_by(group, gene, population) %>%
summarize(sum = sum(alleles_over_2n)) %>%
count(sum == 1)
```

```{r, include = FALSE}
theme_set(
theme_bw(base_size = 14) +
theme(
plot.caption.position = "plot"
)
)
```

Plot the frequency of a specific allele in populations with more than 1000
sampled individuals:

```{r, dpi = 300, fig.width = 9, fig.height = 7}

my_allele <- "DQB1*02:01"
my_d <- d %>% filter(allele == my_allele) %>%
mutate(
n = parse_number(n),
alleles_over_2n = parse_number(alleles_over_2n)
) %>%
filter(n > 1000) %>%
arrange(-alleles_over_2n)

ggplot(my_d) +
aes(x = alleles_over_2n, y = reorder(population, alleles_over_2n)) +
scale_y_discrete(position = "right") +
geom_colh() +
labs(
x = "Allele Frequency (Alleles / 2N)",
y = NULL,
title = glue("Frequency of {my_allele} across populations"),
caption = "Data from AFND http://allelefrequencies.net"
)
```

## Citation

If you use this data, please cite the latest manuscript about **Allele Frequency
Net Database**:

- Gonzalez-Galarza FF, McCabe A, Santos EJMD, Jones J, Takeshita L,
Ortega-Rivera ND, et al. [Allele frequency net database (AFND) 2020 update:
gold-standard data classification, open access genotype data and new query
tools.](https://pubmed.ncbi.nlm.nih.gov/31722398) Nucleic Acids Res. 2020;48:
D783–D788. doi:10.1093/nar/gkz1029

```
@ARTICLE{Gonzalez-Galarza2020,
title = "{Allele frequency net database (AFND) 2020 update: gold-standard
data classification, open access genotype data and new query
tools}",
author = "Gonzalez-Galarza, Faviel F and McCabe, Antony and Santos, Eduardo
J Melo Dos and Jones, James and Takeshita, Louise and
Ortega-Rivera, Nestor D and Cid-Pavon, Glenda M Del and
Ramsbottom, Kerry and Ghattaoraya, Gurpreet and Alfirevic, Ana
and Middleton, Derek and Jones, Andrew R",
journal = "Nucleic acids research",
volume = 48,
number = "D1",
pages = "D783--D788",
month = jan,
year = 2020,
language = "en",
issn = "0305-1048, 1362-4962",
pmid = "31722398",
doi = "10.1093/nar/gkz1029",
pmc = "PMC7145554"
}
```

## Related work

Here are all of the resources I could find that have information about HLA
allele frequencies in different populations.

### HLAfreq

https://github.com/Vaccitech/HLAfreq/

- Wells DA, McAuley M. [HLAfreq: Download and combine HLA allele frequency data.](https://www.biorxiv.org/content/10.1101/2023.09.15.557761v1) bioRxiv. 2023. doi:10.1101/2023.09.15.557761

### CIWD version 3.0.0

- Hurley CK, Kempenich J, Wadsworth K, Sauter J, Hofmann JA, Schefzyk D, et al.
[Common, intermediate and well-documented HLA alleles in world populations:
CIWD version 3.0.0.](https://www.ncbi.nlm.nih.gov/pubmed/31970929) Hladnikia.
2020;95: 516–531. doi:10.1111/tan.13811

The authors provide xlsx files on this website:

- https://www.ihiw18.org/component-immunogenetics/download-common-and-well-documented-alleles-3-0

But the frequency information is binned into categories:

- C, common
- I, intermediate
- WD, well-documented
- NA, not applicable

There is a tool called [HLA-Net](https://hla-net.eu/tools/cwd-viewer/results/)
that provides a visualization of the CIWD data.

### IEDB Tools

http://tools.iedb.org/population/download

At the IEDB Tools page, we can find a tool called **Population Coverage**. The
authors have downloaded the HLA frequency information from AFND and saved it in
a Python pickle file.

### dbMHC

https://www.ncbi.nlm.nih.gov/gv/mhc

The dbMHC database and website appears to be discontinued. But an archive of
old files is still available via FTP.

### NMDP

https://bioinformatics.bethematchclinical.org/hla-resources/haplotype-frequencies/high-resolution-hla-alleles-and-haplotypes-in-the-us-population/

- Maiers M, Gragert L, Klitz W. [High-resolution HLA alleles and haplotypes in
the United States population.](https://pubmed.ncbi.nlm.nih.gov/17869653) Hum
Immunol. 2007;68: 779–788. doi:10.1016/j.humimm.2007.04.005

## Acknowledgments

Thanks to David A. Wells for sharing [scrapeAF][1], which inspired me to work
on this project.

[1]: https://github.com/DAWells/scrapeAF