{"id":13593269,"url":"https://github.com/BenevolentAI/ukbiobank-loaders","last_synced_at":"2025-04-09T02:33:21.043Z","repository":{"id":149048972,"uuid":"620748303","full_name":"BenevolentAI/ukbiobank-loaders","owner":"BenevolentAI","description":null,"archived":false,"fork":false,"pushed_at":"2023-05-23T10:17:34.000Z","size":10158,"stargazers_count":8,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-07T02:51:13.218Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BenevolentAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-03-29T09:46:20.000Z","updated_at":"2024-03-20T06:33:28.000Z","dependencies_parsed_at":"2024-01-16T22:20:19.342Z","dependency_job_id":"4d3dfb7f-5c98-4352-98c0-21ae52e84c7a","html_url":"https://github.com/BenevolentAI/ukbiobank-loaders","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BenevolentAI%2Fukbiobank-loaders","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BenevolentAI%2Fukbiobank-loaders/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BenevolentAI%2Fukbiobank-loaders/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BenevolentAI%2Fukbiobank-loaders/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BenevolentAI","download_url":"https://codeload.github.com/BenevolentAI/ukbiobank-loaders/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247965840,"owners_count":21025446,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T16:01:18.604Z","updated_at":"2025-04-09T02:33:16.034Z","avatar_url":"https://github.com/BenevolentAI.png","language":"Python","funding_links":[],"categories":["Data processing"],"sub_categories":["Optical coherence tomography and fundus"],"readme":"# ukbiobank-loaders\n\nThis repository provides an easy way to load UK Biobank data. It is composed of a pre-processing script, which converts the UK Biobank data into parquets that are easier to read,\nand a library that provides different methods to access the data.\n\n## Installation\nTo install this package, simply run\n```bash\npip install ukbiobank-loaders\n```\nPlease note that python 3.7 or newer is needed.\n\n## Usage\n\nWe will now describe how to use this library. Please note that data can be read from both local directories, and aws s3 directories.\n\n### Pre-processing\nThese are the UK Biobank files that are needed in order to run the pre-processing, all saved in the same directory \u003cDATA_FOLDER\u003e:\n```\ndeath.txt\ndeath_cause.txt\ngp_clinical.txt\ngp_scripts.txt\nhesin.txt\nhesin_diag.txt\nhesin_oper.txt\n```\n\nAdditionally, also the withdrawn consent file is needed:\n```\nwithdrawn_consent.txt\n```\n\nFrom the terminal, run\n```bash\nupdate_data.py --raw_dir \u003cDATA_FOLDER\u003e --withdrawn_file \u003cWITHDRAWN_CONSENT_FILE_PATH\u003e --out_dir \u003cOUTPUT_DIR_FOLDER\u003e\n```\n\nThe processed data will be saved in a folder named `\u003cOUTPUT_DIR_FOLDER\u003e/final`.\n\nWe found this process to take about 14 minutes in a pod composed of 4 CPUs and 32GB of RAM. If the process is Killed, it might be\nbecause there is not enough RAM available.\n\n### Accessing the data\n\nThis is a simple example on how to use the library. Specific documentation about the methods is given below.\n```bash\n\u003e\u003e\u003e from ukbb_loaders.loaders import load\n\u003e\u003e\u003e dl = load.DataLoader(data_dir = \"\u003cOUTPUT_DIR_FOLDER\u003e/final\")\n\u003e\u003e\u003e dl.get_hospital_data(\"icd10\")\n    date_of_visit source feature  value\neid\n68     1986-04-22  icd10    N181      1\n68     1945-05-03  icd10    N181      1\n68     1950-04-03  icd10    N181      1\n68     1966-08-07  icd10    N181      1\n67     1991-03-12  icd10    N181      1\n..            ...    ...     ...    ...\n73            NaT  icd10    N181      1\n48     1997-06-20  icd10    N181      1\n48     1945-03-05  icd10    N181      1\n48     1956-02-25  icd10    N181      1\n48     1981-04-08  icd10    N181      1\n```\n\n### Documentation for ukbb\\_loaders.loaders\n\n### Table of Contents\n\n* [ukbb\\_loaders.utilities.util](#ukbb_loaders.utilities.util)\n  * [load\\_lookup](#ukbb_loaders.utilities.util.load_lookup)\n  * [load\\_mapper](#ukbb_loaders.utilities.util.load_mapper)\n* [ukbb\\_loaders.loaders.load](#ukbb_loaders.loaders.load)\n  * [DataLoader](#ukbb_loaders.loaders.load.DataLoader)\n    * [\\_\\_init\\_\\_](#ukbb_loaders.loaders.load.DataLoader.__init__)\n    * [get\\_hospital\\_data](#ukbb_loaders.loaders.load.DataLoader.get_hospital_data)\n    * [get\\_death\\_data](#ukbb_loaders.loaders.load.DataLoader.get_death_data)\n    * [get\\_gp\\_clinical\\_data](#ukbb_loaders.loaders.load.DataLoader.get_gp_clinical_data)\n    * [get\\_gp\\_medication\\_data](#ukbb_loaders.loaders.load.DataLoader.get_gp_medication_data)\n\n\u003ca id=\"ukbb_loaders\"\u003e\u003c/a\u003e\n\n### ukbb\\_loaders.utilities.util\n\n\u003ca id=\"ukbb_loaders.utilities.util.load_lookup\"\u003e\u003c/a\u003e\n\n#### load\\_lookup\n\n```python\ndef load_lookup(lookup_name: str) -\u003e pd.DataFrame\n```\n\nLoads lookup table.\n\n**Arguments**:\n\n- `lookup_name` _str_ - The name of the lookup table to be loaded.\n  \n\n**Returns**:\n\n- `(pd.DataFrame)` - The lookup table of interest.\n  \n\n**Example**:\n  Load lookup of ICD10 diagnosis codes:\n  \u003e\u003e\u003e load_lookup(\"ehr_diagnosis_icd10\")\n  \n\n\u003ca id=\"ukbb_loaders.utilities.util.load_mapper\"\u003e\u003c/a\u003e\n\n#### load\\_mapper\n\n```python\ndef load_mapper(mapper_name: str) -\u003e pd.DataFrame\n```\nLoads ontology mapper.\n\n**Arguments**:\n\n- `mapper_name` _str_ - The name of the mapper to be loaded.\n\n**Returns**:\n\n- `(pd.DataFrame)` - The mapper of interest.\n  \n\n**Example**:\n  Load mapping from ICD10 codes to Phecodes:\n  \u003e\u003e\u003e load_mapper(\"icd10_to_phecodes\")\n  \n\n### ukbb\\_loaders.loaders.load\n\nLoaders for versioned UKBB data.\n\n\u003ca id=\"ukbb_loaders.loaders.load.DataLoader\"\u003e\u003c/a\u003e\n\n### DataLoader Objects\n\n```python\nclass DataLoader()\n```\n\n\u003ca id=\"ukbb_loaders.loaders.load.DataLoader.__init__\"\u003e\u003c/a\u003e\n\n#### \\_\\_init\\_\\_\n\n```python\ndef __init__(data_dir: str)\n```\n\nClass for loading UKBB data.\n\n**Arguments**:\n\n- `data_dir` _str_ - The path to the directory containing the processed data.\n  Note that on Windows the path must have forward-slashes,\n  e.g.  \"C:/Users/john/Documents/data_dir\"\n\n\u003ca id=\"ukbb_loaders.loaders.load.DataLoader.get_hospital_data\"\u003e\u003c/a\u003e\n\n#### get\\_hospital\\_data\n\n```python\ndef get_hospital_data(source: Union[str, List[str]],\n                      level=None,\n                      patient_list: np.ndarray = None) -\u003e pd.DataFrame\n```\n\nMethod that fetches hospital data for the UKBB population.\n\n**Arguments**:\n\n- `source` _str or list_ - The coding/representation/source we would like to fetch.\n  It needs to be one or more of:\n- `icd10` - for fetching all icd10 related diagnoses.\n- `icd9` - for fetching all icd9 related diagnoses.\n- `opcs3` - for fetching all opcs4 related operational codes.\n- `opcs4` - for fetching all opcs4 related operational codes.\n- `level` _list or string_ - The level/significance of diagnoses we would like to fetch.\n  It needs to be one or both of:\n- `primary` - for fetching only the primary code related to one diagnosis.\n- `secondary` - for fetching all the secondary (complementary) codes for one\n  diagnosis.\n- `external` - For fetching diagnosis codes from external sources.\n  Defaults to all of them.\n- `patient_list` _np.ndarray_ - The patients to fetch characteristics for. If this is empty,\n  all UKBB patients will be used.\n\n**Returns**:\n\n- `df` _pd.DataFrame_ - A long canonical dataframe with patients as the index and the\n  following columns:\n  - date_of_visit: pandas datetime for each hospital visit\n  - feature: the different codes used (e.g. the different icd10 codes)\n  - source: this is relevant to the source the feature is referring to (e.g. icd10)\n  - value: the occurrence value for each row combination (initially 1.)\n\n\u003ca id=\"ukbb_loaders.loaders.load.DataLoader.get_death_data\"\u003e\u003c/a\u003e\n\n#### get\\_death\\_data\n\n```python\ndef get_death_data(level=None,\n                   patient_list: np.ndarray = None) -\u003e pd.DataFrame\n```\n\nMethod that fetches death information for the UKBB population.\n\n**Arguments**:\n\n- `level` _list or string_ - The level/significance of deaths we would like to fetch.\n  It needs to be one or both of: primary (main reason of death), secondary. Defaults to both.\n- `patient_list` _np.ndarray_ - The patients to fetch characteristics for.\n  If this is empty, all UKBB patients will be used.\n\n**Returns**:\n\n- `df` _pd.DataFrame_ - A long canonical dataframe with patients as the index and all\n  recorded death information including death date in the right format.\n\n\u003ca id=\"ukbb_loaders.loaders.load.DataLoader.get_gp_clinical_data\"\u003e\u003c/a\u003e\n\n#### get\\_gp\\_clinical\\_data\n\n```python\ndef get_gp_clinical_data(source=None, patient_list: np.ndarray = None)\n```\n\nMethod that fetches GP diagnosis information for the UKBB population.\n\n**Arguments**:\n\n- `source` _str or list_ - Whether to load read_2, read_3 or both. Defaults to both.\n- `patient_list` _np.ndarray_ - The patients to fetch characteristics for.\n  If this is empty, all UKBB patients will be used.\n\n**Returns**:\n\n- `df` _pd.DataFrame_ - A long canonical dataframe with patients as the index and all\n  recorded gp information including date in the right format.\n\n\u003ca id=\"ukbb_loaders.loaders.load.DataLoader.get_gp_medication_data\"\u003e\u003c/a\u003e\n\n#### get\\_gp\\_medication\\_data\n\n```python\ndef get_gp_medication_data(patient_list: np.ndarray = None) -\u003e pd.DataFrame\n```\n\nMethod that fetches GP medication data for the UKBB population.\n\n**Arguments**:\n\n- `patient_list` _np.ndarray_ - The patients to fetch medication data for.\n  If this is empty, all UKBB patients will be used.\n\n**Returns**:\n\n- `df` _pd.DataFrame_ - A canonical long dataframe with patients as the index and\n  features as columns.\n\n## Acknowledgments\nThis package is developed using the UK Biobank Resource under Application Number 43138.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBenevolentAI%2Fukbiobank-loaders","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FBenevolentAI%2Fukbiobank-loaders","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBenevolentAI%2Fukbiobank-loaders/lists"}