Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/bjcairns/ukbschemas

Use R to generate a database containing the UK Biobank data schemas from http://biobank.ctsu.ox.ac.uk/crystal/schema.cgi
https://github.com/bjcairns/ukbschemas

r r-package rstats sqlite uk-biobank

Last synced: 4 months ago
JSON representation

Use R to generate a database containing the UK Biobank data schemas from http://biobank.ctsu.ox.ac.uk/crystal/schema.cgi

Host: GitHub
URL: https://github.com/bjcairns/ukbschemas
Owner: bjcairns
License: other
Created: 2019-03-15T17:32:53.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2021-02-26T10:06:01.000Z (over 3 years ago)
Last Synced: 2024-01-17T01:10:15.157Z (6 months ago)
Topics: r, r-package, rstats, sqlite, uk-biobank
Language: R
Homepage:
Size: 218 KB
Stars: 18
Watchers: 5
Forks: 3
Open Issues: 8
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Lists

awesome-uk-biobank - ukbschemas

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

# Download the files?

do_dl <- FALSE

save_load_file <- "~/ukbschemas-test-data/ukbschemas_db_test.sqlite"

```

# ukbschemas

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)

[![Build Status](https://travis-ci.com/bjcairns/ukbschemas.svg?token=tA2cYTLpigx5VuTgcHFd&branch=master)](https://travis-ci.com/bjcairns/ukbschemas)

This R package can be used to create and/or load a database containing the [UK Biobank Data Showcase schemas](http://biobank.ctsu.ox.ac.uk/crystal/schema.cgi), which are data dictionaries describing the structure of the UK Biobank main dataset.

## Installation

You can install the current version of ukbschemas from [GitHub](https://github.com/) with:

```{r install, eval=FALSE}

# install.packages("devtools")

devtools::install_github("bjcairns/ukbschemas")

library(ukbschemas)

```

## Examples

```{r prelim, echo=FALSE, include=FALSE}

try(rm(list=c("db","sch")))

library(ukbschemas)

```

The package supports two workflows. 

#### Save-Load workflow (recommended)

The recommended approach is to use `ukbschemas_db()` to download the schema tables and save them to an SQLite database, then use `load_db()` to load the tables from the database and store them as tibbles in a named list:

```{r create, eval=do_dl}

db <- ukbschemas_db(path = tempdir())

sch <- load_db(db = db)

```

```{r info, echo=FALSE, include=FALSE}

if (do_dl) {

  file <- paste0(tempdir(), "\\ukb-schemas-", Sys.Date(), ".sqlite")

  file.copy(file, save_load_file)

}

finfo <- file.info(save_load_file)

fsize <- round(finfo$size/1e6, 1)

fmtime <- finfo$mtime

```

By default, the database is named `ukb-schemas-YYYY-MM-DD.sqlite` (where `YYYY-MM-DD` is the current date) and placed in the current working directory. (`path = tempdir()` in the above example puts it in the current temporary directory instead.) At the most recent compilation of the database (`r format(fmtime, "%d %B %Y")`), the size of the .sqlite database file produced by `ukbschemas_db()` was approximately `r fsize`MB.

Note that without further arguments, `ukbschemas_db()` tidies up the database to give it a more consistent relational structure (the changes are summarised in the output of the first example, above). Alternatively the raw data can be loaded with the `as_is` argument:

```{r create_as_is, eval=FALSE}

db <- ukbschemas_db(path = tempdir(), overwrite = TRUE, as_is = TRUE)

```

The `overwrite` option allows the database file to be overwritten (if `TRUE`), or prevents this (`FALSE`), or if not specified and the session is interactive (`interactive() == TRUE`) then the user is prompted to decide.

**Note:** If you have created a schemas database with an earlier version of ukbschemas, it should be possible to load that database with the latest version of `load_db()`, which (currently) should load any SQLite database, regardless of contents.

#### Load-Save workflow

The second approach is to download the schemas and store them in memory in a list, and save them to a database only as requried. 

This is **not** recommended, because it is better (for everyone) not to download the schema files every time they are needed, and because the database assumes a certain structure that should be guaranteed when the database is saved. If you still want to take this approach, use:

```{r inmemory, eval=FALSE}

sch <- ukbschemas()

db <- save_db(sch, path = tempdir())

```

## Why R?

This package was originally written in bash (a Unix shell scripting language). However, R is more accessible and all dependencies are loaded when you install the package; there is no need to install any secondary software (not even SQLite).

## Notes

#### Design notes

* All the encoding value tables (`esimpint`, `esimpstring`, `esimpreal`, `esimpdate`, `ehierint`, `ehierstring`) have been harmonised and combined into a single table `encvalues`. The `value` column in `encvalues` has type `TEXT`, but a `type` column has been added in case the value is not clear from context. The original type-specific tables have been deleted.

* To avoid redunancy, category parent-child relationships have been moved to table `categories`, as column `parent_id`, from table `catbrowse` (which has been deleted).

* Reference to the category to which a field belongs is in the `main_category` column in the `fields` schema, but has been renamed to `category_id` for consistency with the `categories` schema.

* Details of several of the field properties (`value_type`, `stability`, `item_type`, `strata` and `sexed`) are available elsewhere on the Data Showcase. These have been added manually to tables `valuetypes`, `stability`, `itemtypes`, `strata` and `sexed`, and appropriate ID references have been renamed with the `_id` suffix in tables `fields` and `encodings`.

* There are several columns in the tables which are not well-documented (e.g. `base_type` in fields, `availability` in `encodings` and `categories`, and others). Additional tables documenting these encoded values may be included in future versions (and suggestions are welcome).

#### Known code issues

* The UK Biobank data schemas are regularly updated as new data are added to the system. ukbschemas does not currently include a facility for updating the database; it is necessary to create a new database. 

* Because `readr::read_csv()` reads whole numbers as type `double`, not `integer` (allowing 64-bit integers without loss of information), column types in schemas loaded in R will differ depending on whether the schemas are loaded directly to R or first saved to a database. This should make little or no difference for most applications.

* Any [other issues](https://github.com/bjcairns/ukbschemas/issues).