https://github.com/privefl/bigsnpr
R package for the analysis of massive SNP arrays.
https://github.com/privefl/bigsnpr
big-data bioinformatics memory-mapped-file parallel-computing polygenic-scores population-structure-inference r r-package snp-data statistical-methods
Last synced: 1 day ago
JSON representation
R package for the analysis of massive SNP arrays.
- Host: GitHub
- URL: https://github.com/privefl/bigsnpr
- Owner: privefl
- Created: 2016-07-05T14:36:34.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2025-07-02T03:33:08.000Z (4 months ago)
- Last Synced: 2025-07-02T04:31:43.750Z (4 months ago)
- Topics: big-data, bioinformatics, memory-mapped-file, parallel-computing, polygenic-scores, population-structure-inference, r, r-package, snp-data, statistical-methods
- Language: R
- Homepage: https://privefl.github.io/bigsnpr/
- Size: 107 MB
- Stars: 208
- Watchers: 8
- Forks: 46
- Open Issues: 28
-
Metadata Files:
- Readme: README.md
- Changelog: NEWS.md
Awesome Lists containing this project
- awesome-complex-trait-genetics - bigsnpr
- jimsghstars - privefl/bigsnpr - R package for the analysis of massive SNP arrays. (R)
README
[](https://github.com/privefl/bigsnpr/actions)
[](https://app.codecov.io/gh/privefl/bigsnpr?branch=master)
[](https://CRAN.R-project.org/package=bigsnpr)
[](http://dx.doi.org/10.1093/bioinformatics/bty185)
# bigsnpr{bigsnpr} is an R package for the analysis of massive SNP arrays, primarily designed for human genetics. It enhances the features of [package {bigstatsr}](https://privefl.github.io/bigstatsr/) for the purpose of analyzing genotype data.
To get you started:
- [Quick demo](https://privefl.github.io/bigsnpr/articles/demo.html)
- List of functions [from bigsnpr](https://privefl.github.io/bigsnpr/reference/index.html) and [from bigstatsr](https://privefl.github.io/bigstatsr/reference/index.html)
- [Extended documentation with more examples](https://privefl.github.io/bigsnpr-extdoc/) + [course recording](https://youtu.be/7VxBT5A_AcA)
## Installation
In R, run
```r
# install.packages("remotes")
remotes::install_github("privefl/bigsnpr")
```or for the CRAN version
```r
install.packages("bigsnpr")
```## Input formats
This package reads *bed*/*bim*/*fam* files (PLINK preferred format) using functions `snp_readBed()` and `snp_readBed2()`. Before reading into this package's special format, quality control and conversion can be done using PLINK, which can be called directly from R using `snp_plinkQC()` and `snp_plinkKINGQC()`.
This package can also read **UK Biobank BGEN files** using function `snp_readBGEN()`. This function takes around 40 minutes to read 1M variants for 400K individuals using 15 cores.
This package uses a class called `bigSNP` for representing SNP data. A `bigSNP` object is a list with some elements:
- `$genotypes`: A [`FBM.code256`](https://privefl.github.io/bigstatsr/reference/FBM.code256-class.html). Rows are samples and columns are variants. This stores genotype calls or **dosages** (rounded to 2 decimal places).
- `$fam`: A `data.frame` with some information on the individuals.
- `$map`: A `data.frame` with some information on the variants.**Note that most of the algorithms of this package don't handle missing values.** You can use `snp_fastImpute()` (taking a few hours for a chip of 15K x 300K) and `snp_fastImputeSimple()` (taking a few minutes only) to impute missing values of *genotyped* variants.
Package {bigsnpr} also provides functions that directly work on bed files with a few missing values (the `bed_*()` functions). See paper ["Efficient toolkit implementing.."](https://doi.org/10.1093/bioinformatics/btaa520).
## Polygenic scores
Polygenic scores are one of the main focus of this package. There are 3 main methods currently available:
- Penalized regressions with individual-level data (see [paper](https://doi.org/10.1534/genetics.119.302019) and [tutorial](https://privefl.github.io/bigstatsr/articles/penalized-regressions.html))
- Clumping and Thresholding (C+T) and Stacked C+T (SCT) with summary statistics and individual level data (see [paper](https://doi.org/10.1016/j.ajhg.2019.11.001) and [tutorial](https://privefl.github.io/bigsnpr/articles/SCT.html)).
- LDpred2 with summary statistics (see [paper](https://doi.org/10.1093/bioinformatics/btaa1029) and [tutorial](https://privefl.github.io/bigsnpr/articles/LDpred2.html)), and lassosum2
## Possible upcoming features
- Multiple imputation for GWAS (https://doi.org/10.1371/journal.pgen.1006091).
- More interactive (visual) QC.
You can request some feature by opening an issue.
## Bug report / Support
[How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/6103040)
Please open an issue if you find a bug.
If you want help using {bigstatsr} (the `big_*()` functions), please open an issue on [{bigstatsr}'s repo](https://github.com/privefl/bigstatsr/issues), or post on Stack Overflow with the tag *bigstatsr*.
I will always redirect you to GitHub issues if you email me, so that others can benefit from our discussion.
## References
- Privé, Florian, et al. ["Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr."](https://doi.org/10.1093/bioinformatics/bty185) *Bioinformatics* 34.16 (2018): 2781-2787.
- Privé, Florian, et al. ["Efficient implementation of penalized regression for genetic risk prediction."](https://doi.org/10.1534/genetics.119.302019) *Genetics* 212.1 (2019): 65-74.
- Privé, Florian, et al. ["Making the most of Clumping and Thresholding for polygenic scores."](https://doi.org/10.1016/j.ajhg.2019.11.001) *The American Journal of Human Genetics* 105.6 (2019): 1213-1221.
- Privé, Florian, et al. ["Efficient toolkit implementing best practices for principal component analysis of population genetic data."](https://doi.org/10.1093/bioinformatics/btaa520) *Bioinformatics* 36.16 (2020): 4449-4457.
- Privé, Florian, et al. ["LDpred2: better, faster, stronger."](https://doi.org/10.1093/bioinformatics/btaa1029) *Bioinformatics* 36.22-23 (2020): 5424-5431.
- Privé, Florian. ["Optimal linkage disequilibrium splitting."](https://doi.org/10.1093/bioinformatics/btab519) *Bioinformatics* 38.1 (2022): 255–256.
- Privé, Florian. ["Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics."](https://doi.org/10.1093/bioinformatics/btac348) *Bioinformatics* 38.13 (2022): 3477-3480.
- Privé, Florian, et al. ["Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores."](https://doi.org/10.1016/j.xhgg.2022.100136) *Human Genetics and Genomics Advances* 3.4 (2022).
- Privé, Florian, et al. [Inferring disease architecture and predictive ability with LDpred2-auto](https://doi.org/10.1101/2022.10.10.511629). *The American Journal of Human Genetics* 110.12 (2023): 2042-2055.