{"id":20074699,"url":"https://github.com/greenelab/daps","last_synced_at":"2026-02-20T23:39:13.579Z","repository":{"id":79359342,"uuid":"51492256","full_name":"greenelab/DAPS","owner":"greenelab","description":"Denoising Autoencoders for Phenotype Stratification","archived":false,"fork":false,"pushed_at":"2018-11-09T02:43:04.000Z","size":20733,"stargazers_count":41,"open_issues_count":2,"forks_count":9,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-10-08T23:44:34.025Z","etag":null,"topics":["analysis","autoencoders","machine-learning","methodology","neural-networks"],"latest_commit_sha":null,"homepage":"https://doi.org/10.1101/039800","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/greenelab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-02-11T03:43:05.000Z","updated_at":"2025-01-13T09:59:26.000Z","dependencies_parsed_at":"2023-03-11T19:15:58.936Z","dependency_job_id":null,"html_url":"https://github.com/greenelab/DAPS","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/greenelab/DAPS","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2FDAPS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2FDAPS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2FDAPS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2FDAPS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/greenelab","download_url":"https://codeload.github.com/greenelab/DAPS/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2FDAPS/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29667861,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-20T23:24:07.480Z","status":"ssl_error","status_checked_at":"2026-02-20T23:24:06.202Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","autoencoders","machine-learning","methodology","neural-networks"],"created_at":"2024-11-13T14:53:40.261Z","updated_at":"2026-02-20T23:39:13.561Z","avatar_url":"https://github.com/greenelab.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"Denoising Autoencoders for Phenotype Stratification (DAPS) \n===========================================================\n\n![](\u003chttps://api.shippable.com/projects/56bc03af1895ca447473c87d/badge?branch=staging\u003e)\n\nDenoising Autoencoder for Phenotype Stratification (DAPS) is a semi-supervised\ntechnique for exploring phenotypes in the Electronic Health Record (EHR).\n\nUpon build, figures are regenerated and saved in:\n[Images](https://github.com/greenelab/DAPS/tree/master/images)\n\n![](\u003c./images/cluster.png\u003e)\n\nControls and 2 artificial subtypes of cases were simulated from 2 different\nmodels. The labels are the number of hidden nodes in the trained DAs. Principal\ncomponent analysis and t-distributed stochastic neighbor embedding.\n\nCiting DAPS\n===========\n\nBeaulieu-Jones, BK. and Greene, CS. \"DAPS: Semi-Supervised Learning of the\nElectronic Health Record with Denoising Autoencoders for Phenotype\nStratification.\", *Under review*, 2016.\n\nINSTALL\n=======\n\nDAPS relies on several rapidly updating software packages. As such we include\npackage information below, but we also have a docker build available at:\nhttps://hub.docker.com/r/brettbj/daps/\n\nRequired\n--------\n\n-   [Python] (https://www.python.org) (3.4).\n\n-   [Theano] (https://github.com/Theano/Theano) (0.70).\n\n-   [Seaborn] (http://stanford.edu/\\~mwaskom/software/seaborn/) \u0026 [MatPlotlib]\n    (http://matplotlib.org/)\n\n-   [Scikit-Learn] (https://scikit-learn.org)\n\nOptional\n--------\n\n-   [iPython](\u003chttp://ipython.org/\u003e) \u0026 [Jupyterhub]\n    (https://github.com/jupyter/jupyterhub) - Required for visualization\n\n-   [CUDA] (https://developer.nvidia.com/cuda-toolkit-65) (7.5). This is listed\n    as optional, but it's impractical to train more than 1000 samples without\n    CUDA.\n\nUSAGE\n=====\n\nRunning Simulations\n-------------------\n\nIf changing the number of patients per simulation, it is important to also\nchange the size of minibatches to keep the same ratio. I.e. For 100,000 patients\nyou could do 1,000 patient mini-batches (for speed of training, if kept at 100\nit will train but slower), if 1,000 you should do 10 (it will not effectively\ntrain with a mini-batch size of 100)\n\nBelow are a sampling of simulations run. Full parameter sweeps required \\\u003e 72\nhrs on an array of TitanX GPUs and it is recommended to choose an interesting\nsubset.\n\nData used to generate figures is available at:\n\n\u003chttp://dx.doi.org/10.5281/zenodo.46082\u003e\n\n**Simulation Model 1:**\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\npython3 create_patients.py --run_name 1 --trials 10 --patient_count 10000 --num_effects 1 2 4 8 16 --observed_variables 100 200 400 --per_effect 10 --effect_mag 1 2 --sim_model 1 --systematic_bias 0.1 --input_variance 0.1 --missing_data 0\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nTHEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32,nvcc.fastmath=True python3 classify_patients.py --run_name 1 --patient_count 100 200 500 1000 2000 --da_patients 10000 --hidden_nodes 2 --missing_data 0\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n**Simulation Model 2:**\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\npython3 create_patients.py --run_name 2 --trials 10 --patient_count 10000 --num_effects 1 2 4 8 --observed_variables 100 --per_effect 10 --effect_mag 2 5 --sim_model 2 --systematic_bias 0.1 --input_variance 0.1 --missing_data 0\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nTHEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32,nvcc.fastmath=True python3 classify_patients.py --run_name 2 --patient_count 50 100 200 500 1000 2000 --da_patients 10000 --hidden_nodes 2 4 8 --missing_data 0\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n**Simulation Model 3:**\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\npython3 create_patients.py --run_name 3 --trials 10 --patient_count 10000 --num_effects 1 2 4 8 --observed_variables 100 --per_effect 10 --effect_mag 5 --sim_model 3 --systematic_bias 0.1 --input_variance 0.1 --missing_data 0\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nTHEANO_FLAGS=mode=FAST_RUN,device=gpu1,floatX=float32,nvcc.fastmath=True python3 classify_patients.py --run_name 3 --patient_count 50 100 200 500 1000 2000 --da_patients 10000 --hidden_nodes 2 4 8 --missing_data 0\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n**Simulation Model 4:**\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\npython3 create_patients.py --run_name 4 --trials 10 --patient_count 10000 --num_effects 2 4 8 --observed_variables 100 --per_effect 10 --effect_mag 10 --sim_model 4 --systematic_bias 0.1 --input_variance 0.1 --missing_data 0\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nTHEANO_FLAGS=mode=FAST_RUN,device=gpu1,floatX=float32,nvcc.fastmath=True python3 classify_patients.py --run_name 4 --patient_count 50 100 200 500 1000 2000 10000 --da_patients 10000 --hidden_nodes 2 4 8 16 --missing_data 0\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n**Missing data:**\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\npython3 create_patients.py --run_name md --trials 10 --patient_count 10000 --num_effects 2 4 8 16 --observed_variables 100 --per_effect 10 --effect_mag 2 --sim_model 1 --systematic_bias 0.1 --input_variance 0.1 --missing_data 0\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nTHEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32,nvcc.fastmath=True python3 classify_patients.py --run_name md --patient_count 100 200 500 1000 --da_patients 10000 --hidden_nodes 2 --missing_data 0 0.1\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nAnalyzing Results\n-----------------\n\nWe've included 3 ipython notebook files to help analyze the results.\n\n-   Script used to generate the classification figures shown in the paper -\n    Figures.ipynb\n\n-   Script used to generate the clustering images shown in the paper -\n    Clustering.ipynb\n\n-   Examine a wide array of visualizations for a particular sweep -\n    Visualize.ipynb\n\nSelected Results\n----------------\n\n![](\u003c./images/figure_3_patients_100.png\u003e)![](\u003c./images/figure_3_patients_200.png\u003e)![](\u003c./images/figure_3_patients_500.png\u003e)![](\u003c./images/figure_3_patients_1000.png\u003e)![](\u003c./images/figure_3_patients_2000.png\u003e)\n\nClassification AUC in relation to the number of labeled patients under\nsimulation model 1 (RF – Random Forest, NN – Nearest Neighbors, DA – 2-node DA +\nRandom Forest, SVM – Support vector machine).\n\n![](\u003c./images/fig2.png\u003e)\n\nCase vs. Control clustering via principal components analysis and t-distributed\nstochastic neighbor embedding throughout the training of the DA (raw input to\n10,000 training epochs) for simulation model 1.\n\nFeedback\n--------\n\nPlease feel free to email me - (brettbe) at med.upenn.edu with any feedback or\nraise a github issue with any comments or questions.\n\nAcknowledgements\n----------------\n\nThis work is supported by the Commonwealth Universal Research Enhancement (CURE)\nProgram grant from the Pennsylvania Department of Health as well as the Gordon\nand Betty Moore Foundation's Data-Driven Discovery Initiative through Grant\nGBMF4552 to C.S.G.\n\nWe're grateful for the support of [Computational Genetics\nLaboratory](\u003chttp://epistasis.org\u003e), the [Penn Institute of Biomedical\nSciences](\u003chttp://upibi.org/\u003e) and in particular Dr. Jason H. Moore for his\nsupport of this project.\n\nWe also wish to acknowledge the support of the NVIDIA Corporation with the\ndonation of one of the TitanX GPUs used for this research.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgreenelab%2Fdaps","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgreenelab%2Fdaps","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgreenelab%2Fdaps/lists"}