https://github.com/zhanxw/seqminer
Query sequence data (VCF/BCF1/BCF2, Tabix, BGEN, PLINK) in R
https://github.com/zhanxw/seqminer
annotation bcf bgen meta-analysis next-generation-sequencing plink sequencing tabix vcf workflow
Last synced: 2 months ago
JSON representation
Query sequence data (VCF/BCF1/BCF2, Tabix, BGEN, PLINK) in R
- Host: GitHub
- URL: https://github.com/zhanxw/seqminer
- Owner: zhanxw
- License: other
- Created: 2013-11-03T01:33:16.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2025-10-01T01:03:36.000Z (7 months ago)
- Last Synced: 2026-01-12T20:21:35.547Z (4 months ago)
- Topics: annotation, bcf, bgen, meta-analysis, next-generation-sequencing, plink, sequencing, tabix, vcf, workflow
- Language: C
- Homepage: http://zhanxw.github.io/seqminer/
- Size: 8 MB
- Stars: 32
- Watchers: 2
- Forks: 12
- Open Issues: 18
-
Metadata Files:
- Readme: README.md
- Changelog: ChangeLog
- License: LICENSE
Awesome Lists containing this project
README
SEQMINER2
========
[](https://github.com/zhanxw/seqminer/actions)
[](https://ci.appveyor.com/project/zhanxw/seqminer)

[](https://cran.r-project.org/package=seqminer)
**Table of Contents**
- [Introduction](#introduction)
- [Download](#download)
- [Showcase](#showcase)
- [Index VCF/BCF files](#Index-VCF/BCF-files)
- [Query VCF/BCF files](#Query-VCF/BCF-files)
- [Query BGEN/PLINK files](#query-BGEN/PLINK-files)
- [Command line linterface](#command-line-linterface)
# Introduction
Seqminer is a highly efficient R-package for retrieving sequence variants from biobank scale datasets of millions of individuals and billions of genetic variants. It supports all variant types, including multi-allelic variants and imputation dosages. It takes VCF/BCF/BGEN/PLINK format as input file, indexes, queries them based upon variant-based index and loads them as R data types such as list or matrix.
# Download
Install the development version ([devtools](https://github.com/r-lib/devtools) package is required):
devtools::install_github("zhanxw/seqminer")
# Showcase
Here are some examples of how to use seqminer to index and query files in real-life scenarios.
## Index VCF/BCF files
library(seqminer)
bcf.ref.file <- "input.bcf"
bcf.idx.file <- "input.bcf.scIdx"
out <- seqminer::createSingleChromosomeBCFIndex(bcf.ref.file, bcf.idx.file)
or
vcf.ref.file <- "input.vcf.gz"
vcf.idx.file <- "input.vcf.gz.scIdx"
out <- seqminer::createSingleChromosomeVCFIndex(vcf.ref.file, vcf.idx.file)
This would generate variant-based index that works with commonly used sequence variant file format, such as VCF/BCF files.
## Query VCF/BCF files
Query VCF file:
vcf.ref.file <- "input.vcf.gz"
vcf.idx.file <- "input.vcf.gz.scIdx"
tabix.range <- "1:123-1234"
geno <- seqminer::readSingleChromosomeVCFToMatrixByRange(vcf.ref.file, tabix.range, vcf.idx.file)
Query BCF file:
bcf.ref.file <- "input.bcf"
bcf.idx.file <- "input.bcf.scIdx"
tabix.range <- "1:123-1234"
geno <- seqminer::readSingleChromosomeBCFToMatrixByRange(bcf.ref.file, tabix.range, bcf.idx.file)
Querying multiple regions is also doable, simply specify multiple regions and separte them by a comma, e.g. `"1:123-124,1:1234-1235"`.
Output example (column represents variants, row represents individuals):

## Query BGEN/PLINK files
Query BGEN file:
bg.ref.file <- "input.bgen"
bg.range <- "1:123-1234"
geno.mat <- seqminer::readBGENToMatrixByRange(bg.ref.file, bg.range)
geno.list <- seqminer::readBGENToListByRange(bg.ref.file, bg.range)
Make sure that bgen file has an index file `*.bgi` in the same folder.
Query PLINK file:
plink.ref.file <- "input"
geno <- seqminer::readPlinkToMatrixByIndex(plink.ref.file, sampleIndex=1:20000, markerIndex=1:100)
## Command line linterface
We also developed a seqminer command line interface:
./queryVCFIndex.intel input.vcf.gz input.vcf.gz.scIdx 1:123-1234
Citation:
[Yang, L., Jiang, S., Jiang, B., Liu, D. J., & Zhan, X. (2020). Seqminer2: An Efficient Tool to Query and Retrieve Genotypes for Statistical Genetics Analyses from Biobank Scale Sequence Dataset. Bioinformatics](https://doi.org/10.1093/bioinformatics/btaa628)
[Zhan, X. and Liu, D. J. (2015), SEQMINER: An R-Package to Facilitate the Functional Interpretation of Sequence-Based Associations. Genet. Epidemiol., 39: 619–623. doi:10.1002/gepi.21918](https://doi.org/10.1002/gepi.21918)