{"id":19091956,"url":"https://github.com/allencellmodeling/cvae_testbed","last_synced_at":"2025-07-22T10:07:38.221Z","repository":{"id":40980632,"uuid":"200938346","full_name":"AllenCellModeling/CVAE_testbed","owner":"AllenCellModeling","description":"A research testbed on conditional variational autoencoders using Gaussian distributions as input","archived":false,"fork":false,"pushed_at":"2023-07-06T21:42:11.000Z","size":4143,"stargazers_count":3,"open_issues_count":2,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-30T12:17:27.470Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AllenCellModeling.png","metadata":{"files":{"readme":"README.rst","changelog":"HISTORY.rst","contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.rst","dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-08-06T23:36:34.000Z","updated_at":"2021-08-16T19:23:53.000Z","dependencies_parsed_at":"2025-04-18T14:00:35.104Z","dependency_job_id":"37200e86-772b-4ff0-9228-71266104780e","html_url":"https://github.com/AllenCellModeling/CVAE_testbed","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AllenCellModeling/CVAE_testbed","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenCellModeling%2FCVAE_testbed","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenCellModeling%2FCVAE_testbed/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenCellModeling%2FCVAE_testbed/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenCellModeling%2FCVAE_testbed/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AllenCellModeling","download_url":"https://codeload.github.com/AllenCellModeling/CVAE_testbed/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenCellModeling%2FCVAE_testbed/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266473101,"owners_count":23934477,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-22T02:00:09.085Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T03:17:38.993Z","updated_at":"2025-07-22T10:07:38.165Z","avatar_url":"https://github.com/AllenCellModeling.png","language":"Jupyter Notebook","readme":"=====================\nCVAE research testbed\n=====================\n\n.. image:: https://travis-ci.org/AllenCellModeling/CVAE_testbed.svg?branch=master\n        :target: https://travis-ci.org/AllenCellModeling/CVAE_testbed\n        :alt: Build Status\n        \n.. image:: https://readthedocs.org/projects/gaussian-cvae/badge/?version=latest\n        :target: https://gaussian-cvae.readthedocs.io/en/latest/?badge=latest\n        :alt: Documentation Status\n\n.. image:: https://codecov.io/gh/AllenCellModeling/CVAE_testbed/branch/master/graph/badge.svg\n        :target: https://codecov.io/gh/AllenCellModeling/CVAE_testbed\n        :alt: Codecov Status\n\n\nA research testbed on conditional variational autoencoders using Gaussian distributions as an input. We are interested in arbitrary conditioning of a CVAE and finding the relationships between information passing through the latent dimension bottlneck and the input dimensions. Our goal is to generate a fully factoriazable probabilistic model of structures in a cell.\n\n* Free software: Allen Institute Software License\n\n* Documentation: https://Gaussian-CVAE.readthedocs.io.\n\nTests\n--------\n\n* Create conda environment\n\n.. code-block:: bash\n\n    $ conda create --name cvae python=3.7\n\n* Activate conda environment :\n\n.. code-block:: bash\n\n    $ conda activate cvae\n\n* Install requirments in setup.py\n\n.. code-block:: bash\n\n    $ pip install -e .[all]\n\nUsage: Synthetic data\n--------\n\n* Run all synthetic data models:\n\n.. code-block:: bash\n\n    $ cd scripts\n\n.. code-block:: bash\n\n    $ ./run_all_synthetic_datasets.sh\n\nThis will run mutliple synthetic experiments on 2 GPU's simultaneously. Each script can also be run individually. Here, we go through these one by one:\n\n* Run baseline model. \n\n.. code-block:: bash\n\n    $ cd scripts\n\n.. code-block:: bash\n\n    $ ./baseline.sh\n\nThis model takes a set of independent Gaussian distributions as an input. Specify the number of input dimensions 'x_dim' in baseline_kwargs.json\n\nSpecifying 2 input dimensions gives\n\n.. image:: scripts/outputs/baseline_results/encoding_test_plots.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nThis plot can be viewed in outputs/baseline_results. The first component is the train and test loss. The other 3 plots are encoding tests of the model in the presence of different sets of conditions. 0 (blue) implies that no conditions are provided, and thus the model uses 2 latent dimensions in order to encode the information. 1 (orange) implies that one condition is provided, meaning the model needs only 1 latent dimension to encode the information. Finally, 2 (green) means that both conditions are provided, implying that the model needs no dimensions to encode the information, i.e all the information about the input data has been provided via the condition. \n\nSimilarly, specifying 4 input dimensions gives\n\n.. image:: https://user-images.githubusercontent.com/40371793/63390327-8e69fc80-c363-11e9-93e0-219b6044774d.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nspecifying 6 input dimensions gives\n\n.. image:: https://user-images.githubusercontent.com/40371793/63449614-4f848700-c3f5-11e9-842e-40b07271a5ed.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nand so on.\n\n* Run projected baseline model. This model will take a set of independent Gaussian distributions as an input and project it to a higher dimension. Specify the number of input dimensions 'x_dim' and number of projected dimensions 'projection_dim' in baseline_kwargs_proj.json\n\n.. code-block:: bash\n\n    $ ./baseline_projected.sh\n\nProjecting 2 dimensions to 8 dimensions gives \n\n.. image:: scripts/outputs/baseline_results_projected/encoding_test_plots.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nThis plot can be viewed in outputs/baseline_results_projected. The model uses only 2 dimensions in the latent space to encode information from a 4 dimensional input dataset. \n\nSimilarly, projecting 2 dimensions to 4 dimensions gives\n\n.. image:: https://user-images.githubusercontent.com/40371793/63447464-eac72d80-c3f0-11e9-86c9-26df0b5ed8da.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nprojecting 4 dimensions to 8 dimensions gives \n\n.. image:: https://user-images.githubusercontent.com/40371793/63446173-9327c280-c3ee-11e9-95c9-ed04fdab0522.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nand so on. \n\n* Run projected baseline model with a mask. This model will take a set of independent Gaussian distributions, project it to a higher dimensional space and then mask a percentage of the input data during training. \n\n.. code-block:: bash\n\n    $ ./baseline_projected_with_mask.sh\n\nHere we need to update the loss function to not penalize masked data. Without doing this, projecting 2 dimensions to 8 dimensions with 50% of the input data masked gives \n\n.. image:: https://user-images.githubusercontent.com/40371793/63446885-dafb1980-c3ef-11e9-89cb-6389a38dfaca.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nAfter updating the loss, doing the same thing gives\n\n.. image:: https://user-images.githubusercontent.com/40371793/63446987-10076c00-c3f0-11e9-9b99-72b67c3592fa.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nDespite 50% of the data being masked, the model uses 2 dimensions in the latent space.\n\n* Run swiss roll baseline model. This model will take the swiss roll dataset as an input. \n\n.. code-block:: bash\n\n    $ ./baseline_swissroll.sh\n\nThe swiss roll dataset is parametrized as:\n\n.. math:: x = \\phi \\cos(\\phi)\n.. math:: y = \\phi \\sin(\\phi)\n.. math:: z = \\psi\n\nDespite having 3 dimensions, it is parametrized by 2 dimensions. Running this script gives\n\n.. image:: scripts/outputs/baseline_results_swissroll/encoding_test_plots.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nThis plot can be viewed in outputs/baseline_results_swissroll. We observe that given 0 conditions (blue), the model gets embedded into only dimensions in the latent space. Providing 1 condition (X) is no different then providing 2 conditions (X and Y) since both X and Y are parameterized by only 1 dimension. Finally, providing both conditions means that no information passes throught the bottleneck and the model encodes no information. \n\n* Run sklearn datasets model. This model will take the sklearn datasets like circles, moons and blobs as an input. \n\n.. code-block:: bash\n\n    $ ./baseline_circles_moons_blobs.sh\n\nThe type of dataset (i.e. circles, moons, blobs or an s_curve) is specified in \"sklearn_data\" in baseline_kwargs_circles_moons_blobs.json. Running this file for blobs gives \n\n.. image:: scripts/outputs/loop_models_blobs/encoding_test_plots.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nSimilarly for moons gives \n\n.. image:: scripts/outputs/loop_models_moons/encoding_test_plots.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nThis is how the original data maps to the latent space\n\n.. image:: https://user-images.githubusercontent.com/40371793/63801095-61b66780-c8c4-11e9-9b59-d51be918211f.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nSimilarly for an s_curve gives \n\n.. image:: scripts/outputs/loop_models_s_curve/encoding_test_plots.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nAnd for circles gives\n\n.. image:: scripts/outputs/loop_models_circles/encoding_test_plots.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\n* Run compare_models.py to compare results across output folders\n\n* Visualize individual model runs or multiple model runs using the notebooks in CVAE_testbed/notebooks\n\nUsage: AICS feature data\n--------\n\n* Run aics feature model. Here we pass 159 features (1 binary, 102 real and 56 one-hot features) through the CVAE\n\n.. code-block:: bash\n\n    $ cd scripts\n\n.. code-block:: bash\n\n    $ ./aics_features_simple.sh\n\nHere is what the encoding looks like for a beta of 1\n\n.. image:: scripts/outputs/aics_159_features_beta_1/encoding_test_plots.png\n   :width: 750px\n   :scale: 100 %\n   :align: center\n\nThere is no information passing through the information bottleneck, i.e. the KL divergence term is near 0 and the model is close to the autoencoding limit. \n\nWe can vary beta and compare ELBO and FID scores in order to find the best model. \n\nOrganization\n--------\n\nThe project has the following structure::\n\n    CVAE_testbed/\n      |- README.rst\n      |- setup.py\n      |- requirements.txt\n      |- tox.ini\n      |- Makefile\n      |- MANIFEST.in\n      |- HISTORY.rst\n      |- CHANGES.rst\n      |- AUTHORS.rst\n      |- LICENSE\n      |- docs/\n         |- ...\n      |- CVAE_testbed/\n         |- __init__.py\n         |- main_train.py\n         |- baseline_kwargs.json\n         |- mnist_kwargs.json\n         |- tests/\n            |- __init__.py\n            |- test_function.py\n            |- example.sh\n         |- datasets/\n            |- __init__.py\n            |- dataloader.py\n            |- synthetic.py\n         |- losses/\n            |- __init__.py\n            |- ELBO.py\n         |- metrics/\n            |- __init__.py\n            |- blur.py\n            |- calculate_fid.py\n            |- inception.py\n            |- visualize_encoder.py\n         |- models/\n            |- __init__.py\n            |- CVAE_baseline.py\n            |- CVAE_first.py\n            |- sample.py\n         |- run_models/\n            |- __init__.py\n            |- generative_metric.py\n            |- run_synthetic.py\n            |- run_test_train.py\n            |- test.py\n            |- train.py\n         |- scripts/\n            |- __init__.py\n            |- baseline.sh\n            |- mnist.sh\n            |- compare_models.py\n         |- utils/\n            |- __init__.py\n            |- compare_plots.py\n\nSupport\n-------\nWe are not currently supporting this code, but simply releasing it to the community AS IS but are not able to provide any guarantees of support. The community is welcome to submit issues, but you should not expect an active response.\n\nCredits\n-------\n\nThis package was created with Cookiecutter_.\n\n.. _Cookiecutter: https://github.com/audreyr/cookiecutter\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fallencellmodeling%2Fcvae_testbed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fallencellmodeling%2Fcvae_testbed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fallencellmodeling%2Fcvae_testbed/lists"}