{"id":19303370,"url":"https://github.com/kenhanscombe/ukbproject","last_synced_at":"2025-04-22T11:32:03.763Z","repository":{"id":201940784,"uuid":"295145117","full_name":"kenhanscombe/ukbproject","owner":"kenhanscombe","description":"A python CLI to setup and manage a UKB project directory","archived":false,"fork":false,"pushed_at":"2022-05-12T15:50:53.000Z","size":84,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-01T22:47:00.260Z","etag":null,"topics":["kcl-sgu","python3","uk-biobank"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kenhanscombe.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-09-13T12:19:19.000Z","updated_at":"2023-12-11T02:01:14.000Z","dependencies_parsed_at":null,"dependency_job_id":"51f995fc-b665-48e9-8709-37c1161ab9ab","html_url":"https://github.com/kenhanscombe/ukbproject","commit_stats":null,"previous_names":["kenhanscombe/ukbproject"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhanscombe%2Fukbproject","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhanscombe%2Fukbproject/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhanscombe%2Fukbproject/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhanscombe%2Fukbproject/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kenhanscombe","download_url":"https://codeload.github.com/kenhanscombe/ukbproject/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250232193,"owners_count":21396588,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kcl-sgu","python3","uk-biobank"],"created_at":"2024-11-09T23:26:10.653Z","updated_at":"2025-04-22T11:32:03.480Z","avatar_url":"https://github.com/kenhanscombe.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ukbproject\n\n![build](https://github.com/kenhanscombe/ukbproject/workflows/build/badge.svg)\n\nA python CLI to setup a UK Biobank (UKB) project folder.\n\n**Important: This CLI is only useful for UKB-approved KCL\nreasearchers and their collaborators, with an account on the Rosalind\nor CREATE HPC clusters.**\n\n\u003cbr\u003e\n\n\u003cspan style=\"color:dodgerblue;\"\u003e**Contents:**\u003c/span\u003e  \n1. [Installation](#installation)\n2. [Use](#use)  \n2.1 [Setup a project directory](#setup)  \n2.2 [Download UKB utilities](#download)  \n2.3 [Include project data](#include)  \n2.4 [Munge the UKB data](#munge)  \n2.5 [Add symlinks to sample information and relatedness files](#add)  \n3. [Access the data with ukbkings](#access)  \n4. [Additional withdrawals](#withdrawals)  \n5. [Updates to phenotype data](#updates)\n\n\u003cbr\u003e\n\n***\n\n\u003cbr\u003e\n\n\u003ca name=\"installation\"\u003e\u003c/a\u003e\n## 1. Installation\n\nClone the github repo\n\n```{bash}\ngit clone https://github.com/kenhanscombe/ukbproject.git\n```\n\nChange into the ukbproject directory, make munge.py executable, and copy the snakemake SLURM profile (replace `\u003cusername\u003e` with your KCL username).\n\n```{bash}\ncd ukbproject\nchmod +x ukbproject/munge.py\nmkdir -p /users/\u003cusername\u003e/.config/snakemake\ncp -R ukbproject/conf/slurm /users/\u003cusername\u003e/.config/snakemake\n```\n\nKCL Rosalind users load the default python3 module\n\n```{bash}\nmodule avail python3\nmodule load \u003cpython3_module\u003e\n```\n\nKCL CREATE users load conda and python3\n\n```{bash}\nmodule spider conda\nmodule load \u003cconda_module\u003e\n\nmodule spider python\nmodule load \u003cpython3_module\u003e\n```\n\nYou may be prompted to do a one-time `git init \u003cSHELL_NAME\u003e`. For bash\n\n```{bash}\nconda init bash\n```\n\nReload the terminal or run `source ~/.bashrc`. Create the conda\nenvironment activate it and install the ukbproject package into it.\n\n```{bash}\nconda env create -f ukbproject/conf/environment.yml\nconda activate ukbproject\npython3 -m pip install --editable .\n```\n\nAfter use (below), exit the environment with `conda deactivate`. To use\nthe `prj` CLI on subsequent occasions, simply activate the environment\n`conda activate ukbproject`.\n\n\u003cbr\u003e\n\n\u003ca name=\"use\"\u003e\u003c/a\u003e\n## 2. Use\n\nFor help\n\n```{bash}\nprj --help\n```\n\n\u003cpre\u003e\nUsage: prj [OPTIONS] COMMAND [ARGS]...\n\nSets up a UKB project on Rosalind/CREATE storing common data and utilities in the parent directory, at resources/ and bin/ respectively.\n\nOptions:\n  --version  Show the version and exit.\n  --hpc TEXT  Either \"ROSALIND\" (default) or \"CREATE\". Sets path to UKB data.\n  --help     Show this message and exit.\n\nCommands:\n  clean     Removes defunct file/dir(s) from projects, and sets permissions.\n  create    Creates a skeleton UKB project directory.\n  link      Makes links to sample information and relatedness files.\n  munge     Runs rules described in the Snakefile to munge UKB data.\n  util      Downloads UKB file handlers and utilities.\n  withdraw  Writes withdrawal IDs and corresponding indeces to be excluded.\n\u003c/pre\u003e\n\n\u003cbr\u003e\n\nNote. usage is similar to `git` with general **options**, and\n**commands** that take further **arguments**. For help on commands (e.g.\n`prj create`)\n\n```{bash}\nprj create --help\n```\n\n\u003cbr\u003e\n\n\u003ca name=\"setup\"\u003e\u003c/a\u003e\n### 2.1 Setup a project directory\n\nAt /scratch/datasets/ukbiobank, create a project\ndirectory ukb\\\u003c*project_id*\\\u003e.\n\n```{bash}\nprj create -p \u003cproject_id\u003e\n```\n\nThis will create the project directory structure in **Figure 1**,\nadding symlinks to the genetic in the project genotyped/ and imputed/\nfolders, and download the required UKB programs and utilites.\n\n\u003cbr\u003e\n\n\u003cpre\u003e\nukb\u0026ltproject_id\u0026gt  \n├ genotyped  \n  ├ ukb_binary_v2.bed  \n  └ ukb_binary_v2.bim  \n├ imputed  \n  ├ ukb_sqc.txt  \n  ├ ukb_sqc_fields.txt  \n  ├ ukb_imp_chr*.bgen  \n  ├ ukb_imp_chr*.bgen.bgi  \n  └ ukb_mfi_chr*.txt\n├ log  \n├ phenotypes  \n├ raw  \n├ returns  \n└ withdrawals\n\u003c/pre\u003e\n\n**Figure 1** Project directory structure\n\n\u003cbr\u003e\n\nFor most other operations, you should change into the project folder.\n\n\u003cbr\u003e\n\n\u003ca name=\"download\"\u003e\u003c/a\u003e\n### 2.2 Download UKB utilities\n\nAdd UKB file handlers and utilities to the parent directory\n/scratch/datasets/ukbiobank folders bin/ and resources/, with\n`ukb util`. UKB data encodings (Codings_Showcase.csv, encoding.ukb) are\ndownloaded to resources/; UKB programs are downloaded to bin/.\n\n\u003cbr\u003e\n\n\u003ca name=\"include\"\u003e\u003c/a\u003e\n### 2.3 Include project data\n\n#### 2.3.1 Encrypted phenotype data, keys, withdrawals\n\nDownload project-specific encrypted files (\\*.enc), associated key\nfiles (\\*.key), and withdrawal files\n(w\\\u003c*project-id*\\\u003e_\\\u003c*yyyymmdd*\\\u003e.csv) must be copied into the project\nsubdirectory raw/. Change the key file names to match the encrypted\nfiles: ukb\\\u003c*project_id*\\\u003e.enc pairs with ukb\\\u003c*project_id*\\\u003e.key. The\nfirst line in each key file should be the project id; the second line\nshould be the decryption key.\n\n#### 2.3.2 Genetic sample information and relatedness\n\nAdd to raw the project-specific key associated with the genetic data\naccess - rename to ukb\u003cproject_id\u003e.key. Download the project-specific\ngenetic sample information files (.fam and .sample) and relatedness\nfile (*rel*.dat/.txt) into raw/.\n\n```{bash}\ncd /scratch/datasets/ukbiobank/ukb_\u003cproject_id\u003e/raw/\n\n/scratch/datasets/ukbiobank/bin/gfetch 22418 -c1 -m -a\u003ckey_name\u003e.key\n/scratch/datasets/ukbiobank/bin/gfetch 22828 -c1 -m -a\u003ckey_name\u003e.key\n/scratch/datasets/ukbiobank/bin/gfetch rel -a\u003ckey_name\u003e.key\n```\n\n\u003cbr\u003e\n\n\u003ca name=\"munge\"\u003e\u003c/a\u003e\n### 2.4 Munge the UKB data\n\nProcess the encrypted UKB files into formats to be read by ukbkings.\n\n```{bash}\nprj munge -p ukb\u003cproject_id\u003e\n```\n\nThe munged phenotype data are written to phenotypes/ and output\ninformation is written to log/, for every \u003c*dataset_id*\u003e (or UKB\nbasket) (**Figure 2**). For a dry run, in which no files are edited/\nwritten to disk, only details of what would be munged is printed to\nstandard output, use `ukb munge -p ukb\u003cproject_id\u003e -n`.\n\n\u003cbr\u003e\n\n\u003cpre\u003e\nukb\u0026ltproject_id\u0026gt  \n├ phenotypes  \n  ├ ukb\u0026ltdataset_id\u0026gt.csv  \n  ├ ukb\u0026ltdataset_id\u0026gt.html  \n  └ ukb\u0026ltdataset_id\u0026gt_field_finder.text  \n├ log  \n  └ ...\n\u003c/pre\u003e\n\n**Figure 2** Munged phenotype data\n\n\u003cbr\u003e\n\n\u003ca name=\"add\"\u003e\u003c/a\u003e\n### 2.5 Add symlinks to sample information and relatedness files\n\nSample information files (.fam and .sample) and the relatedness file\n(*rel*.dat/.txt) should be in raw/. Create symlinks to these project-specific\nfiles in genotyped/ and imputed/ (**Figure 3**).\n\n```{bash}\nprj link \\\n-p ukb\u003cproject_id\u003e \\\n-f \u003cfam_file_name\u003e \\\n-s \u003csample_file_name\u003e \\\n-r \u003crelatedness_file_name\u003e\n```\n\nYou can link one or more of these files (they do not all need to be \npassed to the program simultaneously).\n\n\u003cbr\u003e\n\n\u003cpre\u003e\nukb\u0026ltproject_id\u0026gt \n├ genotyped\n  └ ukb\u0026ltproject_id\u0026gt_cal_chr1_v2_sN.fam \n├ imputed  \n  ├ ukb\u0026ltproject_id\u0026gt_imp_chr1_v3_sN.sample  \n  └ ukb\u0026ltproject_id\u0026gt_rel_sN.dat\n\u003c/pre\u003e\n\n**Figure 3** Project-specific sample information and relatedness symlinks.  \nN = number of samples with non-negative IDs\n\n\u003cbr\u003e\n\n**UKB genetic data resources:**\n\n* [Accessing Genetic Data within UK Biobank](https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/ukbgene_instruct.html#aut)\n* [Resource 531: Description of genetic data types](http://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=531)\n* [Resource 664: Instructions for downloading genetic data using ukbgene](https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=664)\n* [Resource 667: UK Biobank Keyfile](http://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=667)\n\n\u003cbr\u003e\n\n\u003ca name=\"access\"\u003e\u003c/a\u003e\n### 3. Access the data with ukbkings\n\nThe data should now be available from anywhere on Rosalind through\nthe [ukbkings](https://kenhanscombe.github.io/ukbkings) R package. Read\n[Access UKB data on Rosalind](https://kenhanscombe.github.io/ukbkings/articles/kcl-ukb-access.html)\nfor a detailed description of usage. The same usage documentation is\nincluded in a package vignette. In R\n\n```{R}\ndevtools::install_github(\"kenhanscombe/ukbkings\", dependencies = TRUE, force = TRUE)\nvignette(\"Access UKB data on Rosalind\")\n```\n\n\u003cbr\u003e\n\n\u003ca name=\"withdrawals\"\u003e\u003c/a\u003e\n## 4. Additional withdrawals\n\nEach time an updated set of participant withdrawals is received, add the\nw\\\u003c*project-id*\\\u003e_\\\u003c*yyyymmdd*\\\u003e.csv file to raw/.\n\nTo exclude the latest withdrawals from the phenotype data, you have to\ngenerate your dataset with `ukbkings::bio_phen` again. Be aware that if\nany researcher on your project does run `ukbkings::bio_phen` again, to\ngrab some other data say, this would apply the latest set of\nwithdrawals. \n\nTo exclude the latest withdrawals from the genotype link files (.fam,\n.sample), there is a 2-step process: generate a list of withdrawals with\n`prj withdraw`, and then remove them from the link files with\n`prj remove`.\n\nNote. In both cases the row count remains the same: `ukbkings::bio_phen`\nreplaces phenotype data values with `NA`; `prj remove` replaces IDs with\nnegative integer. This preserves the row count and alignment of files.\n\n\u003cbr\u003e\n\n\u003ca name=\"updates\"\u003e\u003c/a\u003e\n## 5. Updates to phenotype data\n\nIf you receive any new data you would like to incorporate, place the\nnew .enc and .key files (prepared as described in [Include\nproject data]) into raw/ and re-run the data munging step\n(described in [Munge the UKB data]).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkenhanscombe%2Fukbproject","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkenhanscombe%2Fukbproject","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkenhanscombe%2Fukbproject/lists"}