{"id":25176041,"url":"https://github.com/quantgen/bgdata","last_synced_at":"2025-08-28T06:19:41.210Z","repository":{"id":28830766,"uuid":"32354330","full_name":"QuantGen/BGData","owner":"QuantGen","description":"A Suite of Packages for Analysis of Big Genomic Data","archived":false,"fork":false,"pushed_at":"2025-01-20T17:17:46.000Z","size":5139,"stargazers_count":34,"open_issues_count":1,"forks_count":14,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-07-30T07:31:32.697Z","etag":null,"topics":["cran","genetics","genomics","gwas","r","r-pkg"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"nashville-software-school/ng-todo-boilerplate","license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/QuantGen.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2015-03-16T21:33:22.000Z","updated_at":"2025-01-20T17:17:47.000Z","dependencies_parsed_at":"2022-07-12T16:09:22.147Z","dependency_job_id":"fda3fd4c-e035-4042-b96b-c8a91f93bdc7","html_url":"https://github.com/QuantGen/BGData","commit_stats":{"total_commits":947,"total_committers":6,"mean_commits":"157.83333333333334","dds":"0.20063357972544882","last_synced_commit":"c8819857f715406c36d71101026afae91500994a"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/QuantGen/BGData","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QuantGen%2FBGData","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QuantGen%2FBGData/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QuantGen%2FBGData/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QuantGen%2FBGData/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/QuantGen","download_url":"https://codeload.github.com/QuantGen/BGData/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QuantGen%2FBGData/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272452969,"owners_count":24937467,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-28T02:00:10.768Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cran","genetics","genomics","gwas","r","r-pkg"],"created_at":"2025-02-09T13:15:54.187Z","updated_at":"2025-08-28T06:19:41.189Z","avatar_url":"https://github.com/QuantGen.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"BGData: A Suite of Packages for Analysis of Big Genomic Data\n============================================================\n\n[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/BGData)](https://CRAN.R-project.org/package=BGData)\n\nBGData ([Grueneberg \u0026 de los Campos, 2019](https://doi.org/10.1534/g3.119.400018)) is an R package that provides scalable and efficient computational methods for large genomic datasets, e.g., genome-wide association studies (GWAS) or genomic relationship matrices (G matrices). It also contains a container class called `BGData` that holds genotypes, sample information, and variant information.\n\nModern genomic datasets are big (large *n*), high-dimensional (large *p*), and multi-layered. The challenges that need to be addressed are memory requirements and computational demands. Our goal is to develop software that will enable researchers to carry out analyses with big genomic data within the R environment.\n\nWe have identified several approaches to tackle those challenges within R:\n\n- File-backed matrices: The data is stored in on the hard drive and users can read in smaller chunks when they are needed.\n- Linked arrays: For very large datasets a single file-backed array may not be enough or convenient. A linked array is an array whose content is distributed over multiple file-backed nodes.\n- Multiple dispatch: Methods are presented to users so that they can treat these arrays pretty much as if they were RAM arrays.\n- Multi-level parallelism: Exploit multi-core and multi-node computing.\n- Inputs: Users can create these arrays from standard formats (e.g., PLINK .bed).\n\nThe BGData package is an umbrella package that comprises several packages: [BEDMatrix](https://CRAN.R-project.org/package=BEDMatrix), [LinkedMatrix](https://CRAN.R-project.org/package=LinkedMatrix), and [symDMatrix](https://CRAN.R-project.org/package=symDMatrix).\n\n\nExamples\n--------\n\n### Loading the package\n\nLoad the BGData package:\n\n```R\nlibrary(BGData)\n```\n\n### Inspecting the example dataset\n\nThe `inst/extdata` folder contains example files that were generated from the 250k SNP and phenotype data in [Atwell et al. (2010)](https://doi.org/10.1038/nature08800). Only the first 300 SNPs of chromosome 1, 2, and 3 were included to keep the size of the example dataset small enough for CRAN. [PLINK](https://www.cog-genomics.org/plink2) was used to convert the data to [.bed](https://www.cog-genomics.org/plink2/input#bed) and [.raw](https://www.cog-genomics.org/plink2/input#raw) files. `FT10` has been chosen as a phenotype and is provided as an [alternate phenotype file](https://www.cog-genomics.org/plink2/input#pheno). The file is intentionally shuffled to demonstrate that the additional phenotypes are put in the same order as the rest of the phenotypes.\n\n```R\npath \u003c- system.file(\"extdata\", package = \"BGData\")\nlist.files(path)\n#\u003e  [1] \"chr1.bed\"  \"chr1.bim\"  \"chr1.fam\"  \"chr1.raw\"  \"chr2.bed\"  \"chr2.bim\"\n#\u003e  [7] \"chr2.fam\"  \"chr2.raw\"  \"chr3.bed\"  \"chr3.bim\"  \"chr3.fam\"  \"chr3.raw\"\n#\u003e [13] \"pheno.txt\"\n```\n\n### Loading example dataset\n\n#### Loading individual PLINK .bed files\n\nLoad the .bed file for chromosome 1 (chr1.bed) using the [BEDMatrix](https://CRAN.R-project.org/package=BEDMatrix) package:\n\n```R\nchr1 \u003c- BEDMatrix(paste0(path, \"/chr1.bed\"))\n#\u003e Extracting number of individuals and rownames from .fam file...\n#\u003e Extracting number of markers and colnames from .bim file...\n```\n\n`BEDMatrix` objects behave similarly to regular matrices:\n\n```R\ndim(chr1)\n#\u003e [1] 199 300\nrownames(chr1)[1:10]\n#\u003e [1] \"5837_5837\" \"6008_6008\" \"6009_6009\" \"6016_6016\" \"6040_6040\" \"6042_6042\"\n#\u003e [7] \"6043_6043\" \"6046_6046\" \"6064_6064\" \"6074_6074\"\ncolnames(chr1)[1:10]\n#\u003e [1] \"snp1_T\"  \"snp2_G\"  \"snp3_A\"  \"snp4_T\"  \"snp5_G\"  \"snp6_T\"  \"snp7_C\"\n#\u003e [8] \"snp8_C\"  \"snp9_C\"  \"snp10_G\"\nchr1[\"6008_6008\", \"snp5_G\"]\n#\u003e [1] 0\n```\n\n#### Linking multiple BEDMatrix objects together\n\nLoad the other two .bed files:\n\n```R\nchr2 \u003c- BEDMatrix(paste0(path, \"/chr2.bed\"))\n#\u003e Extracting number of individuals and rownames from .fam file...\n#\u003e Extracting number of markers and colnames from .bim file...\nchr3 \u003c- BEDMatrix(paste0(path, \"/chr3.bed\"))\n#\u003e Extracting number of individuals and rownames from .fam file...\n#\u003e Extracting number of markers and colnames from .bim file...\n```\n\nCombine the BEDMatrix objects by columns using the [LinkedMatrix](https://CRAN.R-project.org/package=LinkedMatrix) to avoid the inconvenience of having three separate matrices:\n\n```R\nwg \u003c- ColumnLinkedMatrix(chr1, chr2, chr3)\n```\n\nJust like `BEDMatrix` objects, `LinkedMatrix` objects also behave similarly to regular matrices:\n\n```R\ndim(wg)\n#\u003e [1] 199 900\nrownames(wg)[1:10]\n#\u003e [1] \"5837_5837\" \"6008_6008\" \"6009_6009\" \"6016_6016\" \"6040_6040\" \"6042_6042\"\n#\u003e [7] \"6043_6043\" \"6046_6046\" \"6064_6064\" \"6074_6074\"\ncolnames(wg)[1:10]\n#\u003e [1] \"snp1_T\"  \"snp2_G\"  \"snp3_A\"  \"snp4_T\"  \"snp5_G\"  \"snp6_T\"  \"snp7_C\"\n#\u003e [8] \"snp8_C\"  \"snp9_C\"  \"snp10_G\"\nwg[\"6008_6008\", \"snp5_G\"]\n#\u003e [1] 0\n```\n\n### Creating a BGData object\n\n`BGData` objects can be created from individual `BEDMatrix` objects or a collection of `BEDMatrix` objects as a `LinkedMatrix` object using the `as.BGData()` function. This will read the .fam and .bim file that comes with the .bed files. The `alternatePhenotypeFile` parameter points to the file that contains the `FT10` phenotype:\n\n```R\nbg \u003c- as.BGData(wg, alternatePhenotypeFile = paste0(path, \"/pheno.txt\"))\n#\u003e Extracting phenotypes from .fam file, assuming that the .fam file of the first BEDMatrix instance is representative of all the other nodes...\n#\u003e Extracting map from .bim files...\n#\u003e Merging alternate phenotype file...\n```\n\nThe `bg` object will use the `LinkedMatrix` object as genotypes, the .fam file augmented by the `FT10` phenotype as sample information, and the .bim file as variant information.\n\n```R\nstr(bg)\n#\u003e Formal class 'BGData' [package \"BGData\"] with 3 slots\n#\u003e   ..@ geno :Formal class 'ColumnLinkedMatrix' [package \"LinkedMatrix\"] with 1 slot\n#\u003e   .. .. ..@ .Data:List of 3\n#\u003e   .. .. .. ..$ :BEDMatrix: 199 x 300 [/home/agrueneberg/.pkgs/R/BGData/extdata/chr1.bed]\n#\u003e   .. .. .. ..$ :BEDMatrix: 199 x 300 [/home/agrueneberg/.pkgs/R/BGData/extdata/chr2.bed]\n#\u003e   .. .. .. ..$ :BEDMatrix: 199 x 300 [/home/agrueneberg/.pkgs/R/BGData/extdata/chr3.bed]\n#\u003e   ..@ pheno:'data.frame':       199 obs. of  7 variables:\n#\u003e   .. ..$ FID      : int [1:199] 5837 6008 6009 6016 6040 6042 6043 6046 6064 6074 ...\n#\u003e   .. ..$ IID      : int [1:199] 5837 6008 6009 6016 6040 6042 6043 6046 6064 6074 ...\n#\u003e   .. ..$ PAT      : int [1:199] 0 0 0 0 0 0 0 0 0 0 ...\n#\u003e   .. ..$ MAT      : int [1:199] 0 0 0 0 0 0 0 0 0 0 ...\n#\u003e   .. ..$ SEX      : int [1:199] 0 0 0 0 0 0 0 0 0 0 ...\n#\u003e   .. ..$ PHENOTYPE: int [1:199] -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 ...\n#\u003e   .. ..$ FT10     : num [1:199] 57 60 98 75 71 56 90 93 96 91 ...\n#\u003e   ..@ map  :'data.frame':       900 obs. of  6 variables:\n#\u003e   .. ..$ chromosome        : int [1:900] 1 1 1 1 1 1 1 1 1 1 ...\n#\u003e   .. ..$ snp_id            : chr [1:900] \"snp1\" \"snp2\" \"snp3\" \"snp4\" ...\n#\u003e   .. ..$ genetic_distance  : int [1:900] 0 0 0 0 0 0 0 0 0 0 ...\n#\u003e   .. ..$ base_pair_position: int [1:900] 657 3102 4648 4880 5975 6063 6449 6514 6603 6768 ...\n#\u003e   .. ..$ allele_1          : chr [1:900] \"T\" \"G\" \"A\" \"T\" ...\n#\u003e   .. ..$ allele_2          : chr [1:900] \"C\" \"A\" \"C\" \"C\" ...\n```\n\n### Saving a BGData object\n\nA BGData object can be saved like any other R object using the `save` function:\n\n```R\nsave(bg, file = \"BGData.RData\")\n```\n\n### Loading a BGData object\n\nThe genotypes in a `BGData` object can be of various types, some of which need to be initialized in a particular way. The `load.BGData` takes care of reloading a saved BGData object properly:\n\n```R\nload.BGData(\"BGData.RData\")\n#\u003e Loaded objects: bg\n```\n\n### Summarizing data\n\nUse `chunkedApply` to count missing values (among others):\n\n```R\ncountNAs \u003c- chunkedApply(X = geno(bg), MARGIN = 2, FUN = function(x) sum(is.na(x)))\n```\n\nUse the `summarize` function to calculate minor allele frequencies and frequency of missing values:\n\n```R\nsummarize(geno(bg))\n```\n\n### Running GWASes with different regression methods\n\nA data structure for genomic data is useful when defining methods that act on both phenotype and genotype information. We have implemented a `GWAS` function that supports various regression methods. The formula takes phenotypes from the sample information of the `BGData` object and inserts one marker at a time.\n\n```R\ngwas \u003c- GWAS(formula = FT10 ~ 1, data = bg)\n```\n\n### Generating the G Matrix\n\n```R\nG \u003c- getG(geno(bg))\n```\n\n\nInstallation\n------------\n\nInstall the stable version from CRAN:\n\n```R\ninstall.packages(\"BGData\")\n```\n\nAlternatively, install the development version from GitHub:\n\n```R\n# install.packages(\"remotes\")\nremotes::install_github(\"QuantGen/BGData\")\n```\n\n\nDocumentation\n-------------\n\nFurther documentation can be found on [RDocumentation](https://www.rdocumentation.org/packages/BGData).\n\n\nContributing\n------------\n\n- Issue Tracker: https://github.com/QuantGen/BGData/issues\n- Source Code: https://github.com/QuantGen/BGData\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquantgen%2Fbgdata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquantgen%2Fbgdata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquantgen%2Fbgdata/lists"}