Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/BenevolentAI/ukbiobank-loaders
https://github.com/BenevolentAI/ukbiobank-loaders
Last synced: 2 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/BenevolentAI/ukbiobank-loaders
- Owner: BenevolentAI
- License: mit
- Created: 2023-03-29T09:46:20.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-05-23T10:17:34.000Z (over 1 year ago)
- Last Synced: 2024-11-02T03:17:48.822Z (7 days ago)
- Language: Python
- Size: 9.69 MB
- Stars: 8
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-uk-biobank - ukbiobank-loaders
README
# ukbiobank-loaders
This repository provides an easy way to load UK Biobank data. It is composed of a pre-processing script, which converts the UK Biobank data into parquets that are easier to read,
and a library that provides different methods to access the data.## Installation
To install this package, simply run
```bash
pip install ukbiobank-loaders
```
Please note that python 3.7 or newer is needed.## Usage
We will now describe how to use this library. Please note that data can be read from both local directories, and aws s3 directories.
### Pre-processing
These are the UK Biobank files that are needed in order to run the pre-processing, all saved in the same directory :
```
death.txt
death_cause.txt
gp_clinical.txt
gp_scripts.txt
hesin.txt
hesin_diag.txt
hesin_oper.txt
```Additionally, also the withdrawn consent file is needed:
```
withdrawn_consent.txt
```From the terminal, run
```bash
update_data.py --raw_dir --withdrawn_file --out_dir
```The processed data will be saved in a folder named `/final`.
We found this process to take about 14 minutes in a pod composed of 4 CPUs and 32GB of RAM. If the process is Killed, it might be
because there is not enough RAM available.### Accessing the data
This is a simple example on how to use the library. Specific documentation about the methods is given below.
```bash
>>> from ukbb_loaders.loaders import load
>>> dl = load.DataLoader(data_dir = "/final")
>>> dl.get_hospital_data("icd10")
date_of_visit source feature value
eid
68 1986-04-22 icd10 N181 1
68 1945-05-03 icd10 N181 1
68 1950-04-03 icd10 N181 1
68 1966-08-07 icd10 N181 1
67 1991-03-12 icd10 N181 1
.. ... ... ... ...
73 NaT icd10 N181 1
48 1997-06-20 icd10 N181 1
48 1945-03-05 icd10 N181 1
48 1956-02-25 icd10 N181 1
48 1981-04-08 icd10 N181 1
```### Documentation for ukbb\_loaders.loaders
### Table of Contents
* [ukbb\_loaders.utilities.util](#ukbb_loaders.utilities.util)
* [load\_lookup](#ukbb_loaders.utilities.util.load_lookup)
* [load\_mapper](#ukbb_loaders.utilities.util.load_mapper)
* [ukbb\_loaders.loaders.load](#ukbb_loaders.loaders.load)
* [DataLoader](#ukbb_loaders.loaders.load.DataLoader)
* [\_\_init\_\_](#ukbb_loaders.loaders.load.DataLoader.__init__)
* [get\_hospital\_data](#ukbb_loaders.loaders.load.DataLoader.get_hospital_data)
* [get\_death\_data](#ukbb_loaders.loaders.load.DataLoader.get_death_data)
* [get\_gp\_clinical\_data](#ukbb_loaders.loaders.load.DataLoader.get_gp_clinical_data)
* [get\_gp\_medication\_data](#ukbb_loaders.loaders.load.DataLoader.get_gp_medication_data)### ukbb\_loaders.utilities.util
#### load\_lookup
```python
def load_lookup(lookup_name: str) -> pd.DataFrame
```Loads lookup table.
**Arguments**:
- `lookup_name` _str_ - The name of the lookup table to be loaded.
**Returns**:
- `(pd.DataFrame)` - The lookup table of interest.
**Example**:
Load lookup of ICD10 diagnosis codes:
>>> load_lookup("ehr_diagnosis_icd10")
#### load\_mapper
```python
def load_mapper(mapper_name: str) -> pd.DataFrame
```
Loads ontology mapper.**Arguments**:
- `mapper_name` _str_ - The name of the mapper to be loaded.
**Returns**:
- `(pd.DataFrame)` - The mapper of interest.
**Example**:
Load mapping from ICD10 codes to Phecodes:
>>> load_mapper("icd10_to_phecodes")
### ukbb\_loaders.loaders.load
Loaders for versioned UKBB data.
### DataLoader Objects
```python
class DataLoader()
```#### \_\_init\_\_
```python
def __init__(data_dir: str)
```Class for loading UKBB data.
**Arguments**:
- `data_dir` _str_ - The path to the directory containing the processed data.
Note that on Windows the path must have forward-slashes,
e.g. "C:/Users/john/Documents/data_dir"#### get\_hospital\_data
```python
def get_hospital_data(source: Union[str, List[str]],
level=None,
patient_list: np.ndarray = None) -> pd.DataFrame
```Method that fetches hospital data for the UKBB population.
**Arguments**:
- `source` _str or list_ - The coding/representation/source we would like to fetch.
It needs to be one or more of:
- `icd10` - for fetching all icd10 related diagnoses.
- `icd9` - for fetching all icd9 related diagnoses.
- `opcs3` - for fetching all opcs4 related operational codes.
- `opcs4` - for fetching all opcs4 related operational codes.
- `level` _list or string_ - The level/significance of diagnoses we would like to fetch.
It needs to be one or both of:
- `primary` - for fetching only the primary code related to one diagnosis.
- `secondary` - for fetching all the secondary (complementary) codes for one
diagnosis.
- `external` - For fetching diagnosis codes from external sources.
Defaults to all of them.
- `patient_list` _np.ndarray_ - The patients to fetch characteristics for. If this is empty,
all UKBB patients will be used.**Returns**:
- `df` _pd.DataFrame_ - A long canonical dataframe with patients as the index and the
following columns:
- date_of_visit: pandas datetime for each hospital visit
- feature: the different codes used (e.g. the different icd10 codes)
- source: this is relevant to the source the feature is referring to (e.g. icd10)
- value: the occurrence value for each row combination (initially 1.)#### get\_death\_data
```python
def get_death_data(level=None,
patient_list: np.ndarray = None) -> pd.DataFrame
```Method that fetches death information for the UKBB population.
**Arguments**:
- `level` _list or string_ - The level/significance of deaths we would like to fetch.
It needs to be one or both of: primary (main reason of death), secondary. Defaults to both.
- `patient_list` _np.ndarray_ - The patients to fetch characteristics for.
If this is empty, all UKBB patients will be used.**Returns**:
- `df` _pd.DataFrame_ - A long canonical dataframe with patients as the index and all
recorded death information including death date in the right format.#### get\_gp\_clinical\_data
```python
def get_gp_clinical_data(source=None, patient_list: np.ndarray = None)
```Method that fetches GP diagnosis information for the UKBB population.
**Arguments**:
- `source` _str or list_ - Whether to load read_2, read_3 or both. Defaults to both.
- `patient_list` _np.ndarray_ - The patients to fetch characteristics for.
If this is empty, all UKBB patients will be used.**Returns**:
- `df` _pd.DataFrame_ - A long canonical dataframe with patients as the index and all
recorded gp information including date in the right format.#### get\_gp\_medication\_data
```python
def get_gp_medication_data(patient_list: np.ndarray = None) -> pd.DataFrame
```Method that fetches GP medication data for the UKBB population.
**Arguments**:
- `patient_list` _np.ndarray_ - The patients to fetch medication data for.
If this is empty, all UKBB patients will be used.**Returns**:
- `df` _pd.DataFrame_ - A canonical long dataframe with patients as the index and
features as columns.## Acknowledgments
This package is developed using the UK Biobank Resource under Application Number 43138.