{"id":13752173,"url":"https://github.com/greenelab/tybalt","last_synced_at":"2026-03-06T01:44:36.058Z","repository":{"id":79359977,"uuid":"97131241","full_name":"greenelab/tybalt","owner":"greenelab","description":"Training and evaluating a variational autoencoder for pan-cancer gene expression data","archived":false,"fork":false,"pushed_at":"2019-01-31T22:57:14.000Z","size":104164,"stargazers_count":169,"open_issues_count":2,"forks_count":62,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-05-23T00:26:51.729Z","etag":null,"topics":["analysis","autoencoder","cancer","cancer-genomics","deep-learning","gene-expression","script","tool","unsupervised-learning","variational-autoencoder","variational-autoencoders"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/greenelab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-07-13T14:23:33.000Z","updated_at":"2025-04-29T19:02:26.000Z","dependencies_parsed_at":"2023-03-09T04:15:32.508Z","dependency_job_id":null,"html_url":"https://github.com/greenelab/tybalt","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/greenelab/tybalt","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Ftybalt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Ftybalt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Ftybalt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Ftybalt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/greenelab","download_url":"https://codeload.github.com/greenelab/tybalt/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Ftybalt/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30157894,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T22:39:40.138Z","status":"ssl_error","status_checked_at":"2026-03-05T22:39:24.771Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","autoencoder","cancer","cancer-genomics","deep-learning","gene-expression","script","tool","unsupervised-learning","variational-autoencoder","variational-autoencoders"],"created_at":"2024-08-03T09:01:00.875Z","updated_at":"2026-03-06T01:44:36.035Z","avatar_url":"https://github.com/greenelab.png","language":"HTML","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"# Tybalt :smirk_cat:\n\n### *A Variational Autoencoder trained on Pan-Cancer Gene Expression*\n\n**Gregory Way and Casey Greene 2017**\n\n[![DOI](https://zenodo.org/badge/97131241.svg)](https://zenodo.org/badge/latestdoi/97131241)\n\nThe repository stores scripts to train, evaluate, and extract knowledge from\na variational autoencoder (VAE) trained on 33 different cancer-types from The\nCancer Genome Atlas (TCGA).\n\nThe specific VAE model is named [*Tybalt*](https://en.wikipedia.org/wiki/Tybalt)\nafter an instigative, cat-like character in Shakespeare's \"Romeo and Juliet\".\nJust as the character Tybalt sets off the series of events in the play, the\nmodel Tybalt begins the foray of VAE manifold learning in transcriptomics.\n[Also, deep unsupervised learning likes cats](https://arxiv.org/abs/1112.6209).\n\nWe discuss the training and evaluation of Tybalt in our PSB paper:\n\n[_Extracting a Biologically Relevant Latent Space from Cancer Transcriptomes with Variational Autoencoders_](http://www.biorxiv.org/content/early/2017/08/11/174474).\n\n## Citation\n\n\u003e Way, GP, Greene, CS. 2018. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders.\n_Pacific Symposium on Biocomputing_ 23:80-91. doi:10.1142/9789813235533_0008\n\n## Notes\n\n\u003e As discovered by @enricoferrero, the preprint text (`section 2.2`) states\nthat the top _median_ absolute deviation (MAD) genes were selected for subsetting,\nwhen the data processing code\n([`process_data.ipynb`](https://github.com/greenelab/tybalt/blob/master/process_data.ipynb))\nactually outputs the top _mean_ absolute deviation genes. We discuss this discrepancy\nand its potential impact in [issue #99](https://github.com/greenelab/tybalt/issues/99).\n\n\u003e git-lfs (https://git-lfs.github.com/) must be installed prior to cloning the repository.\nIf it is not installed, run `git lfs install`\n\n## The Data\n\nTCGA has collected numerous different genomic measurements from over 10,000\ndifferent tumors spanning 33 different cancer-types. In this repository, we\nextract cancer signatures from *gene expression* data (RNA-seq). \n\nThe RNA-seq data serves as a measurement describing the high-dimensional state\nof each tumor. As a highly heterogeneous disease, cancer exists in several\ndifferent combination of states. Our goal is to extract these different states\nusing high capacity models capable of identifying common signatures in gene\nexpression data across different cancer-types.\n\n## The Model\n\nWe present a variational autoencoder (VAE) applied to cancer gene expression\ndata. A VAE is a deep generative model introduced by\n[Kingma and Welling](https://arxiv.org/abs/1312.6114) in 2013. The model has\ntwo direct benefits of modeling cancer gene expression data. \n\n1. Automatically engineer non-linear features\n2. Learning the reduced dimension manifold of cancer expression space\n\nAs a generative model, the reduced dimension features can be sampled from to\nsimulate data. The manifold can also be interpolated to interrogate trajectories\nand transitions between states.\n\nVAEs have typically been applied to image data and have demonstrated remarkable\ngenerative capacity and modeling flexibility. VAEs are different from\ndeterministic autoencoders because of the added constraint of normally\ndistributed feature activations per sample. This constraint not only\nregularizes the model, but also provides the interpretable manifold.\n\nBelow is a t-SNE visualization of the VAE encoded features (p = 100) for all\ntumors.\n\n![VAE t-SNE](figures/tsne_vae.png?raw=true)\n\n### Training\n\nThe current model training is explained in [this notebook](tybalt_vae.ipynb).\n\nTybalt dependencies are listed in [`environment.yml`](environment.yml). To download\nand activate this environment run:\n\n```sh\n# conda version 4.4.10\nconda env create --force --file environment.yml\n\n# activate environment\nconda activate tybalt\n```\n\nTybalt is also configured to train on GPUs using\n[`gpu-environment.yml`](gpu-environment.yml). To activate this environment run:\n\n```sh\n# conda version 4.4.10\nconda env create --force --file gpu-environment.yml\n\n# activate environment\nconda activate tybalt-gpu\n```\n\nFor a complete pipeline with reproducibility instructions, refer to\n[run_pipeline.sh](run_pipeline.sh). Note that scripts originally written in\nJupyter notebooks were ported to the scripts folder for pipeline purposes with:\n\n```sh\njupyter nbconvert --to=script --FilesWriter.build_directory=scripts/nbconverted *.ipynb\n```\n\n#### Architecture\n\nWe select the top 5,000 most variably expressed genes by median absolute\ndeviation. We compress this 5,000 vector of gene expression (for all samples)\ninto two vectors of length 100; one representing the a mean and the other the\nvariance. This vector can be sampled from to generate samples from an\napproximation of the data generating function. This hidden layer is then\nreconstructed back to the original dimensions. We use batch normalization\nand relu activation layers in the compression steps to prevent dead nodes and\npositive weights. We use a sigmoid activation in the decoder. We use the Keras\nlibrary with a TensorFlow backend for training.\n\n![VAE Architecture](figures/onehidden_vae_architecture.png?raw=true)\n\n#### Parameter sweep\n\nIn order to select the most optimal parameters for the model, we ran a\nparameter search over a small grid of parameters. See\n[parameter_sweep.md](parameter_sweep.md) for more details.\n\nOverall, we selected optimal `learning rate = 0.0005`, `batch size = 50`, and\n`epochs = 100`. Because training did not improve much between 50 and 100 epochs,\nwe used a 50 epoch model. Training and validation loss across 50 training epochs\nfor the optimal model is shown below.\n\n![Training Performance](figures/onehidden_vae_training.png?raw=true)\n\n#### Model Evaluation\n\nAfter training with optimal hyper parameters, the unsupervised model can be\ninterpreted. For example, we can observe the distribution of activation\npatterns for all tumors across specific nodes. The first 10 nodes (of 100) are\nvisualized below.\n\n![Node Activation](figures/node_activation_distribution.png?raw=true)\n\nIn this scenario, each node activation pattern contributes uniquely to each\ntumor and may represent specific gene expression signatures of biological\nsignificance. The distribution is heavily right skewed, with some nodes\ncapturing slightly bimodal attributes.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgreenelab%2Ftybalt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgreenelab%2Ftybalt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgreenelab%2Ftybalt/lists"}