Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bjcairns/ukbschemas
Use R to generate a database containing the UK Biobank data schemas from http://biobank.ctsu.ox.ac.uk/crystal/schema.cgi
https://github.com/bjcairns/ukbschemas
r r-package rstats sqlite uk-biobank
Last synced: 2 months ago
JSON representation
Use R to generate a database containing the UK Biobank data schemas from http://biobank.ctsu.ox.ac.uk/crystal/schema.cgi
- Host: GitHub
- URL: https://github.com/bjcairns/ukbschemas
- Owner: bjcairns
- License: other
- Created: 2019-03-15T17:32:53.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2021-02-26T10:06:01.000Z (almost 4 years ago)
- Last Synced: 2024-08-02T16:47:20.290Z (5 months ago)
- Topics: r, r-package, rstats, sqlite, uk-biobank
- Language: R
- Homepage:
- Size: 218 KB
- Stars: 18
- Watchers: 5
- Forks: 3
- Open Issues: 8
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- awesome-uk-biobank - ukbschemas
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)# Download the files?
do_dl <- FALSE
save_load_file <- "~/ukbschemas-test-data/ukbschemas_db_test.sqlite"```
# ukbschemas[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
[![Build Status](https://travis-ci.com/bjcairns/ukbschemas.svg?token=tA2cYTLpigx5VuTgcHFd&branch=master)](https://travis-ci.com/bjcairns/ukbschemas)This R package can be used to create and/or load a database containing the [UK Biobank Data Showcase schemas](http://biobank.ctsu.ox.ac.uk/crystal/schema.cgi), which are data dictionaries describing the structure of the UK Biobank main dataset.
## Installation
You can install the current version of ukbschemas from [GitHub](https://github.com/) with:
```{r install, eval=FALSE}
# install.packages("devtools")
devtools::install_github("bjcairns/ukbschemas")library(ukbschemas)
```## Examples
```{r prelim, echo=FALSE, include=FALSE}
try(rm(list=c("db","sch")))
library(ukbschemas)
```The package supports two workflows.
#### Save-Load workflow (recommended)
The recommended approach is to use `ukbschemas_db()` to download the schema tables and save them to an SQLite database, then use `load_db()` to load the tables from the database and store them as tibbles in a named list:
```{r create, eval=do_dl}
db <- ukbschemas_db(path = tempdir())
sch <- load_db(db = db)
``````{r info, echo=FALSE, include=FALSE}
if (do_dl) {
file <- paste0(tempdir(), "\\ukb-schemas-", Sys.Date(), ".sqlite")
file.copy(file, save_load_file)
}finfo <- file.info(save_load_file)
fsize <- round(finfo$size/1e6, 1)
fmtime <- finfo$mtime
```By default, the database is named `ukb-schemas-YYYY-MM-DD.sqlite` (where `YYYY-MM-DD` is the current date) and placed in the current working directory. (`path = tempdir()` in the above example puts it in the current temporary directory instead.) At the most recent compilation of the database (`r format(fmtime, "%d %B %Y")`), the size of the .sqlite database file produced by `ukbschemas_db()` was approximately `r fsize`MB.
Note that without further arguments, `ukbschemas_db()` tidies up the database to give it a more consistent relational structure (the changes are summarised in the output of the first example, above). Alternatively the raw data can be loaded with the `as_is` argument:
```{r create_as_is, eval=FALSE}
db <- ukbschemas_db(path = tempdir(), overwrite = TRUE, as_is = TRUE)
```The `overwrite` option allows the database file to be overwritten (if `TRUE`), or prevents this (`FALSE`), or if not specified and the session is interactive (`interactive() == TRUE`) then the user is prompted to decide.
**Note:** If you have created a schemas database with an earlier version of ukbschemas, it should be possible to load that database with the latest version of `load_db()`, which (currently) should load any SQLite database, regardless of contents.
#### Load-Save workflow
The second approach is to download the schemas and store them in memory in a list, and save them to a database only as requried.
This is **not** recommended, because it is better (for everyone) not to download the schema files every time they are needed, and because the database assumes a certain structure that should be guaranteed when the database is saved. If you still want to take this approach, use:
```{r inmemory, eval=FALSE}
sch <- ukbschemas()
db <- save_db(sch, path = tempdir())
```## Why R?
This package was originally written in bash (a Unix shell scripting language). However, R is more accessible and all dependencies are loaded when you install the package; there is no need to install any secondary software (not even SQLite).
## Notes
#### Design notes
* All the encoding value tables (`esimpint`, `esimpstring`, `esimpreal`, `esimpdate`, `ehierint`, `ehierstring`) have been harmonised and combined into a single table `encvalues`. The `value` column in `encvalues` has type `TEXT`, but a `type` column has been added in case the value is not clear from context. The original type-specific tables have been deleted.
* To avoid redunancy, category parent-child relationships have been moved to table `categories`, as column `parent_id`, from table `catbrowse` (which has been deleted).
* Reference to the category to which a field belongs is in the `main_category` column in the `fields` schema, but has been renamed to `category_id` for consistency with the `categories` schema.
* Details of several of the field properties (`value_type`, `stability`, `item_type`, `strata` and `sexed`) are available elsewhere on the Data Showcase. These have been added manually to tables `valuetypes`, `stability`, `itemtypes`, `strata` and `sexed`, and appropriate ID references have been renamed with the `_id` suffix in tables `fields` and `encodings`.
* There are several columns in the tables which are not well-documented (e.g. `base_type` in fields, `availability` in `encodings` and `categories`, and others). Additional tables documenting these encoded values may be included in future versions (and suggestions are welcome).#### Known code issues
* The UK Biobank data schemas are regularly updated as new data are added to the system. ukbschemas does not currently include a facility for updating the database; it is necessary to create a new database.
* Because `readr::read_csv()` reads whole numbers as type `double`, not `integer` (allowing 64-bit integers without loss of information), column types in schemas loaded in R will differ depending on whether the schemas are loaded directly to R or first saved to a database. This should make little or no difference for most applications.
* Any [other issues](https://github.com/bjcairns/ukbschemas/issues).