{"id":24016396,"url":"https://github.com/ssmiler/idash2019_2","last_synced_at":"2025-10-12T20:05:54.125Z","repository":{"id":249291554,"uuid":"191530488","full_name":"ssmiler/idash2019_2","owner":"ssmiler","description":"Secure genotype imputation using homomorphic encryption - iDASH 2019 track 2","archived":false,"fork":false,"pushed_at":"2021-02-26T19:04:46.000Z","size":207,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-15T14:13:06.791Z","etag":null,"topics":["genome-imputation","genomics","homomorphic-encryption","idash","imputation","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ssmiler.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-12T08:30:57.000Z","updated_at":"2023-03-30T07:35:10.000Z","dependencies_parsed_at":"2024-07-19T22:55:20.762Z","dependency_job_id":null,"html_url":"https://github.com/ssmiler/idash2019_2","commit_stats":null,"previous_names":["ssmiler/idash2019_2"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ssmiler%2Fidash2019_2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ssmiler%2Fidash2019_2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ssmiler%2Fidash2019_2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ssmiler%2Fidash2019_2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ssmiler","download_url":"https://codeload.github.com/ssmiler/idash2019_2/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249085429,"owners_count":21210267,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["genome-imputation","genomics","homomorphic-encryption","idash","imputation","machine-learning"],"created_at":"2025-01-08T08:50:51.340Z","updated_at":"2025-10-12T20:05:49.086Z","avatar_url":"https://github.com/ssmiler.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Secure genome imputation using homomorphic encryption\n\nGenome imputation allows to predict missing variant genotypes (target SNPs) from available genotypes (tag SNPs).\nIn this project, the imputation is performed using multi-output logistic regression models.\nThe models are build upon one-hot-encoded tag SNPs and are trained to predict target SNP variants (probabilities).\nThe [vowpal-wabbit](https://github.com/VowpalWabbit/vowpal_wabbit) framework is used for model training.\n\nIn a typical use case of secure genome imputation, we have 2 parties A and B.\nParty A have genome data with missing SNPs and party B has the imputation model.\nThe secure genome imputation consists of 3 steps:\n\n1. Encode/encrypt tag SNPs\n2. Secure impute target SNPs\n3. Decrypt/decode target SNPs imputed probabilities\n\nHere, steps 1 and 3 are executed by party A and step 2 is executed by party B.\n\nThe encryption, decryption and imputation use the homomorphic encryption library [TFHE](https://github.com/tfhe/tfhe).\nThe tag SNPs and the resulting probabilities for target SNPs are encrypted throughout the whole process.\nOnly party A has access to tag and target SNPs in clear.\nThe imputation models are available to the evaluation party B only.\n\nThis code is an open-source software distributed under the terms of the Apache 2.0 license.\n\nIn what follows we describe how to learn imputation models and how to perform secure genome imputation.\n\n## Learn imputation models\n\nInput genome data files, downloaded from [here](https://github.com/K-miran/secure-imputation), must be located in the folder `orig_data` (relative to repository root).\nThe `orig_data` folder should look like:\n```bash\n# ls ../orig_data | head\ntag_testing.txt\ntag_training.txt\ntarget_testing.txt\ntarget_training.txt\ntesting_sample_ids.list\ntraining_sample_ids.list\n```\nIn case stratified population is used the `orig_data` folder will have additional `*_AFR.list`, `*_AMR.list` and `*_EUR.list` files with indices of the respective population in the dataset. \n\nThe obtained models will be placed under folder `models/hr/neighbors=\u003cneighbors\u003e\u003cpopulation\u003e` where `\u003cneighbors\u003e` and `\u003cpopulation\u003e` are model configuration parameters.\n\n### Prerequisites\n\nFirst of all clone repository and navitage into it:\n```bash\ngit clone https://github.com/ssmiler/idash2019_2.git --recursive\ncd idash2019_2\n```\n\nYou can either install the needed packages on your machine or use a docker container.\nPlease refer to respective section.\n\n#### Machine configuration (ubuntu 18.04)\n\nInstall required packages:\n```bash\napt-get install vowpal-wabbit python3 python3-pip parallel\npip3 install numpy pandas scikit-learn\n```\n\n#### Use docker container\n\nYou can build a docker image having the needed configuration and packages.\nExecute the following command in the root folder:\n```bash\ndocker build -t idash_2019_chimera:train -f Dockerfile.train .\n```\n\nStart docker container using:\n```bash\ndocker run -it --rm -v $(pwd):/idash idash_2019_chimera:train bash\n```\n\n### Input data preprocessing\n\nNext instructions are relative to `train` folder (docker container starts automatically in this folder).\n\nTransform the format of input data and obtain auxiliary data configuration files.\n\n```bash\npython3 prepare_data.py ../orig_data ../data\n```\n\nIf all went well you should obtain something like this in the output data folder:\n```bash\n# ls ../data | head\ntag_test.pickle\ntag_train.pickle\ntarget_snp\ntarget_test.pickle\ntarget_train.pickle\n```\nIf stratified population indices are available in the `orig_data` folder the output data will contain `*.pickle` files for each population.  \n\n### Logistic regression models for genome imputation\n\nLogistic regression models for imputing target SNPs are created using script `learn_vw.sh`.\nFor example, in order to build the models using 5 nearest tag SNP neighbors (variable `neighbors`) for all population (variable `population`) use:\n\n```bash\nneighbors=5\npopulation=''\n\nbash learn_vw.sh $neighbors $population\n```\n\nOnce the learning process is finished folder `../models/vw/neighbors=$neighbors$population/` will contain about 240k model files.\nThe total number of obtained models is 3 times the number of target SNPs (3 models are built for each target SNP because of the one-hot-encoding).\n```bash\n# ls ../models/vw/neighbors=$neighbors$population | wc -l\n242646\n```\n\nThe number of neighboring tag SNPs to use in each model training is configurable (we have tested from 5 to 50 neighbors).\n For the population stratification you can choose one of the following values: `_AFR`, `_AMR`, `_EUR` or empty value for no stratification.\n\nThe micro-AUC score for the predictions obtained by these models is computed with command (keep double quotation marks around `*.hr` otherwise the 242646 model files will overflow the command line buffer :smile:):\n```bash\npython3 test_vw_hr.py -m ../models/vw/neighbors=$neighbors$population/\"*.hr\" --tag_file ../data/tag_test$population.pickle --target_file ../data/target_test$population.pickle\n```\n\nThe output shall look like:\n```\nMicro-AUC score: 0.98147822 pred max-min 89.093671 (../models/vw/neighbors=5/*.hr)\n```\n\nHere, value `89.093671` is the absolute norm of the obtained predictions and is used for model coefficient discretization (rescaling and mapping coefficients to integers).\nThis operation is performed using the same python script as before but with additional arguments:\n```bash\nrange=89.093671\nmodel_scale=$(python3 -c \"print(16384 / $range / 2)\") # leave a 100% margin (by 2 division)\n\npython3 test_vw_hr.py -m ../models/vw/neighbors=$neighbors$population/\"*.hr\" --tag_file ../data/tag_test$population.pickle --target_file ../data/target_test$population.pickle --model_scale $model_scale --out_dir ../models/hr/neighbors=$neighbors$population\n```\n\nThe output shall look like (observe that the accuracy is somewhat worse when compared to the accuracy of the non-discretized model):\n```\nMicro-AUC score: 0.98147786 pred max-min 8191.0 (../models/vw/neighbors=5/*.hr)\n```\n\nDiscretized genome imputation models which will be used in the secure imputation phase (described in next [section](#secure-evaluation-of-imputation-models)) are placed in folder `../models/hr/neighbors=$neighbors$population`:\n```bash\n\u003e\u003e ls ../models/hr/neighbors=$neighbors$population | head\n17084716_0.hr\n17084716_1.hr\n17084716_2.hr\n17084761_0.hr\n...\n```\n\n## Secure evaluation of imputation models\n\nOnce the desired logistic regression models are obtained we can proceed to the imputation on encrypted tag SNPs.\n\n### Prerequisites\n\nFirstly, we need to clone the TFHE library as a submodule and apply a patch to it.\nRun the following instruction from the root folder of the repository:\n\n```bash\ngit submodule update --init\n\ncd tfhe\ngit apply ../eval/thread_local_rand_gen.patch\ncd ..\n```\n\nAs previously, we can either configure the host machine or use a docker container.\n\n#### Machine configuration (ubuntu 18.04)\n\nPython packages installed previously and usual C++ build tools (`cmake, make, g++`) are sufficient to build the project.\n\n#### Use docker container\n\nExecute the following command in the root folder of the repository:\n```bash\ndocker build -t idash_2019_chimera:eval -f Dockerfile.eval .\n```\n\nStart docker container:\n```bash\ndocker run -it --rm -v $(pwd):/idash idash_2019_chimera:eval bash\n```\n\n### Compile project\n\nInstructions in following sections are relative to folder `eval/run` (automatically set in docker container).\n\nStart by compiling the TFHE library and the secure genome imputation project:\n```bash\nmake build\n```\n\n#### Number of threads\n\nThe secure imputation binaries can use parallelization in order to increase execution performance.\nThe default number of threads is set to 4 and can be changed (see line 37 of file [idash.h](eval/idash.h)).\nThe project must be re-compiled in this case.\n\n### Execute secure imputation\n\nMakefile target `auc` executes the key generation, encryption, missing SNPs imputation and decryption steps.\nBesides, accuracy scores (micro-AUC, macro and macro non-reference accuracies) for the imputed target SNPs in output file `result_bypos.csv` are computed also.\n\n```bash\nmake auc\n```\n\nThe typical output on a mid-end laptop shall look like:\n```\n===============================================================\n./bin/keygen \"../../orig_data/target_geno_model_coordinates.txt\" \"../../orig_data/tag_testing.txt\" 1\nusing target file (headers): ../../orig_data/target_geno_model_coordinates.txt (only positions)\nusing tag file (challenge): ../../orig_data/tag_testing.txt\ntarget_file (headers): ../../orig_data/target_geno_model_coordinates.txt\ntag_file: ../../orig_data/tag_testing.txt\nNUM_SAMPLES: 1004\nNUM_REGIONS: 1\nREGION_SIZE: 1024\nNUM_TARGET_POSITIONS: 80882\nNUM_TAG_POSITIONS: 16130\n----------------- BENCHMARK -----------------\nNumber of threads ...............: 4\nKeygen time (seconds)............: 5.19753e-05\nTotal wall time (seconds)........: 0.200243\nRAM usage (MB)...................: 12.996\n===============================================================\n===============================================================\n./bin/encrypt \"../../orig_data/tag_testing.txt\"\nusing tag file (challenge): ../../orig_data/tag_testing.txt\n----------------- BENCHMARK -----------------\nencrypt wall time (seconds)......: 4.62767\nserialization wall time (seconds): 2.10074\ntotal wall time (seconds)........: 6.72841\nRAM usage (MB)...................: 441.484\n-rw-r--r-- 1 root root 397185128 Jun 15 14:45 encrypted_data.bin\n===============================================================\n===============================================================\n./bin/cloud \"../../models/hr/neighbors=5\"\nusing model dir: ../../models/hr/neighbors=5\n----------------- BENCHMARK -----------------\nfhe wall time (seconds)..........: 3.06433\nserialization wall time (seconds): 26.6188\ntotal wall time (seconds)........: 29.6832\nRAM usage (MB)...................: 2735.53\n-rw-r--r-- 1 root root 1991638376 Jun 15 14:46 encrypted_prediction.bin\n===============================================================\n===============================================================\n./bin/decrypt bypos\n----------------- BENCHMARK -----------------\ndecrypt wall time (seconds)......: 2.54242\nserialization wall time (seconds): 395.26\ntotal wall time (seconds)........: 397.802\nRAM usage (MB)...................: 2957.85\n===============================================================\npython3 ../validate.py --pred_file result_bypos.csv --target_file \"../../orig_data/target_testing.txt\"\nMicro-AUC score: 0.9814777758605334\nMAP score: 0.8956327825366766\nMAP non-ref score: 0.7289859822447496\n```\n\nThe imputation model to use is set in the [Makefile-final.inc](eval/run/Makefile-final.inc) file (variable `MODEL_FILE`).\nThe default value of this variable corresponds to the model learned earlier (5 neighbors and no population stratification).\n\n## Paper experiments\n\nAll the experiments performed for [paper](https://www.biorxiv.org/content/10.1101/2020.07.02.183459v1) are grouped in the bash [script](train/experiment.sh) for model learning phase and bash [script](eval/experiment.sh) for model evaluation phase.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fssmiler%2Fidash2019_2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fssmiler%2Fidash2019_2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fssmiler%2Fidash2019_2/lists"}