Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ccbs-stradl/ukb_healthoutcomes_db

Store and work with UK Biobank record-level health outcomes in a SQLite database.
https://github.com/ccbs-stradl/ukb_healthoutcomes_db

Last synced: 2 days ago
JSON representation

Store and work with UK Biobank record-level health outcomes in a SQLite database.

Host: GitHub
URL: https://github.com/ccbs-stradl/ukb_healthoutcomes_db
Owner: ccbs-stradl
Created: 2020-05-18T10:13:33.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-11-23T14:34:30.000Z (almost 3 years ago)
Last Synced: 2024-08-02T16:46:59.944Z (3 months ago)
Language: Shell
Homepage:
Size: 492 KB
Stars: 8
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-uk-biobank - ukb_healthoutcomes_db - level health outcomes in a SQLite database | (Data processing / Optical coherence tomography and fundus)

README

# UK Biobank Record Level Health-related Outcomes Database Tools

Commands to load record-level access [Health-related outcomes](http://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=3001) tables into an [SQLite](https://www.sqlite.org) database. Data must be requested from UK Biobank and downloaded as database tables from the Data Portal.

# Introduction

The health outcomes data tables in UKB can each be upwards of 4GB in size and therefore are memory-intensive to work with. By storing the data tables in a database, it is possible to query them without loading all of the data into memory.

### Updates

- Nov 23 2021: Update parsing of dates in the `hesin` table which used to be stored coded `DDMMYYYY` but are now coded as `DD/MM/YYYY`.

- Aug 11 2020: Added uniqueness checks during data import and transactional commits to main relational tables to handle the database creation being interrupted and restarted.

- Aug 10 2020: Database schema has been [normalised](https://en.wikipedia.org/wiki/Database_normalization) which should make many queries a lot faster, particularly those that involve searching text fields. It also decreases the size of the database by about 7GB.

## List of database tables

### [Hospital inpatient](http://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=2000)

- `hesin`: Master table of administrative records
- `hesin_diag`: Diagnosis codes
- `hesin_oper`: Operations and procedural codes
- `hesin_psych`: Administrative records relating to psychiatry
- `hesin_maternity`: Maternity records of care
- `hesin_delivery`: Children born as the result of a maternity record.

### [Primary care](http://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=3001)

- `gp_clinical`: [GP clinical event records](http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=42040)
- `gp_scripts`: [GP prescription records](http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=42039)
- `gp_registrations`: [GP registration records](http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=42038)

## Installation

- Install [SQLite](https://www.sqlite.org/download.html) version > 3.9.0
- Clone the repository
```
git clone [email protected]:ccbs-stradl/ukb_healthoutcomes_db.git
cd ukb_healthoutcomes_db
```

## Downloading the database tables

Assuming you have requested the relevant fields for the record-level health outcomes data as part of an approved UKB application, the full database tables can be downloaded (in tab-separated text format) from the [UKB Data Showcase](http://biobank.ndph.ox.ac.uk/showcase/)

1. Log in to the [AMS Portal](https://bbams.ndph.ox.ac.uk/ams/)
2. Select the **Projects** tab
3. Click the "View/Update" button for the Application you are downloading data for
4. Select the **Data** tab.
5. Click the "Go to Showcase to refresh or download data" button.
6. Select the **Data Portal** tab.
7. Click the "Connect" button.
8. Select the **Table Download** tab.
9. Enter the name of a table (listed above) to download and click the "Fetch Table" button.
10. Use the listed `wget` command with the specified unique URL key to download the table, or click the download link (`wget` is preferred as it names the file correctly. The download link may open the table directly in a browser window, in which case you have to use _Save as..._ to save it).

Example `wget` commands:
```
wget -nd -Ohesin.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ohesin_diag.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ohesin_oper.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ohesin_psych.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ohesin_maternity.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ohesin_delivery.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ogp_clinical.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ogp_scripts.txt https://biota.ndph.ox.ac.uk/...
wget -nd -Ogp_registrations.txt https://biota.ndph.ox.ac.uk/...
```

# Database creation

Move the downloaded text files into the repository directory. They are expected to have the names given to them in the `wget` command (of the form `table_name.txt`). If the tables have different names, are in different locations, or are not available, modify the `import.sql` file as appropriate.

Create the database with
```
sh db.sh
```

The database creation script reads the raw data into the database and then [normalises](https://en.wikipedia.org/wiki/Database_normalization) it into a more compact set of tables to speed up querying.

The database is called `healthoutcomes.db` and can be opened using SQLite:

```
sqlite3 healthoutcomes.db
```

The database has end-user views of the data that conform to the table names and columns of the original data. Dates in the tables are standardised to the format `YYYY-MM-DD`.

The total size of the SQL database file is approximately 19GB.

# Working with the data

The data can be manipulated in R without loading the entire dataset into memory. There are several [R libraries for working with databases](https://db.rstudio.com) such as [RSQLite](https://cran.r-project.org/web/packages/RSQLite/index.html) and [dplyr](https://db.rstudio.com/dplyr/).

```
# install required packages
install.packages(c('dplyr', 'RSQLite', 'dbplyr'))

library(dplyr)

# make connection to database
con <- DBI::dbConnect(RSQLite::SQLite(), 'healthoutcomes.db')

# load hesin table
hesin <- tbl(con, 'hesin')
```

The `hesin` table can then be worked on using dplyr commands like any other `tibble`. Use `select()`, `filter()`, and `summarize()` commands to identify the subset of the data, or transform it by passing expressions with SQL functions to `mutate()`. Once your query is finalized together, use `collect()` to import the query into the R workspace for further manipulation or modeling.

## Date information

The health outcomes data tables have date information formatted as either `YYYYMMDD` or `DD-MM-YYYY` so these have been normalized to `YYYY-MM-DD` so that they can be passed to [SQLite's date functions](https://www.sqlite.org/lang_datefunc.html).

```
gp_registrations <- tbl(con, 'gp_registrations')
gp_clinical <- tbl(con, 'gp_clinical')

# Return all registrations from October 2016
gp_registrations %>% filter(reg_date >= '2016-10-1' & reg_date <= '2016-10-31') %>% arrange(reg_date)

# count how many clinical records are available for each year
gp_clinical %>% group_by(date(event_dt, 'start of year')) %>% tally()

```

## Searching by Drug Name

Prescriptions are referred to in `gp_scripts` table are variously coded with Read v2, BNF, or DMD codes depending on the data source. Many entries also have a plain text name of the drug in the `drug_name` column (along with dosage size information). The drug name can be searched using the `%LIKE%` operator and with the name of the drug surrounded by `"%%"`.

```

gp_scripts <- tbl(con, 'gp_scripts')
gp_scripts %>% filter(drug_name %LIKE% "%Amitriptyline%")

```

## Issues

### Codings

Many columns are stored as coded integers rather than strings. The [HES Data Dictionary](http://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=141140) lists the structure of each table and the data coding for each column. Data codings can be searched for on the [UKB Showcase](http://biobank.ndph.ox.ac.uk/showcase/search.cgi) and inspected or downloaded as a text file. For example, the `source` column of the `hesin` table has [Data-Coding 263](http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=263).

# Schema

For each table `TABLE` there is an underlying data representation called `TABLE_data` with foreign key links between them using the `eids` table. Most `TEXT` columns (excepting date fields) are normalized to separate tables called `TABLE_FIELD` linked with a foreign key `FIELD_id`.

![Database schema](docs/schema.png)