{"id":20628939,"url":"https://github.com/ishan-kumar2/molecular_vae_pytorch","last_synced_at":"2025-04-15T16:18:16.510Z","repository":{"id":209998773,"uuid":"322281620","full_name":"Ishan-Kumar2/Molecular_VAE_Pytorch","owner":"Ishan-Kumar2","description":"PyTorch implementation of the paper \"Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules\"","archived":false,"fork":false,"pushed_at":"2021-03-04T11:40:49.000Z","size":18962,"stargazers_count":28,"open_issues_count":1,"forks_count":10,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-15T16:18:11.099Z","etag":null,"topics":["automatics","cheminformatics","pytorch","qsar","smiles"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ishan-Kumar2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-12-17T12:00:43.000Z","updated_at":"2024-12-12T12:57:31.000Z","dependencies_parsed_at":"2023-11-30T08:27:12.671Z","dependency_job_id":"7bfb9ea3-20ac-4f4a-975b-cf195447e8cb","html_url":"https://github.com/Ishan-Kumar2/Molecular_VAE_Pytorch","commit_stats":null,"previous_names":["ishan-kumar2/molecular_vae_pytorch"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ishan-Kumar2%2FMolecular_VAE_Pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ishan-Kumar2%2FMolecular_VAE_Pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ishan-Kumar2%2FMolecular_VAE_Pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ishan-Kumar2%2FMolecular_VAE_Pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ishan-Kumar2","download_url":"https://codeload.github.com/Ishan-Kumar2/Molecular_VAE_Pytorch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249105474,"owners_count":21213537,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automatics","cheminformatics","pytorch","qsar","smiles"],"created_at":"2024-11-16T13:23:49.264Z","updated_at":"2025-04-15T16:18:16.461Z","avatar_url":"https://github.com/Ishan-Kumar2.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A PyTorch implementation of Molecular VAE paper\n\nPyTorch implementation of the paper **\"Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules\" by Gómez-Bombarelli, et al.**\\\nLink to Paper - [arXiv](https://arxiv.org/abs/1610.02415)\n\u003cbr /\u003e\n\n\u003cdiv style=\"text-align:center\"\u003e\u003cimg src=\"https://github.com/Ishan-Kumar2/Molecular_VAE_Pytorch/blob/master/Sample_imgs/cover_img.jpg\" /\u003e\u003c/div\u003e\n\n----\n\n## Getting the Repo\nTo clone the repo on your machine run -\\\n`git clone https://github.com/Ishan-Kumar2/Molecular_VAE_Pytorch.git`\\\nThe Structure of the Repo is as follows -\\\n`data_prep.py`- For Getting the Data in CSV format and splitting into specifed sized Train and Val\\\n`main.py` - Running the model\\\n`model.py` - Defines the Architecture of the Model\\\n`utils.py` - Various useful functions for encoding and decoding the data \u003cbr /\u003e\n\n\n## Getting the Dataset\nFor this work I have used the ChEMBL Dataset which can be found [here](https://www.ebi.ac.uk/chembl/)\\\n\\\nSince the whole dataset has over 16M datapoints, I have decided to use a subset of that data.\nTo get the subset you can either use the train, val data present in ``/data``\nor run the ``data_prep.py`` file as - \\\n`python data_prep.py /path/to/downloaded_data col_name_smiles /save/path 50000` \\\n\\\nThis will prepare 2 CSV files `/save/path_train.csv` and `/save/path_val.csv` both of length 50k and having randomly shuffled datapoints.\n\nExample of a Smiles string and corresponding Molecule\n\n\n\n## Training the Network\nTo train the network use the `main.py` file\n\nTo Run the Papers Model (Conv Encoder and GRU Decoder)\\\n`python main.py ./data/chembl_500k_train ./data/chembl_500k_val ./Save_Models/ --epochs 100 --model_type mol_vae --latent_dim 290 --batch_size 512 --lr 0.0001`\\\nLatent Dim has default value 292 which is the value used in the original Paper\n\nTo Run a VAE with Fully Connected layers in both Encoder Decoder\\\n``python main.py ./data/bbbp.csv ./Save_Models/ --epochs 1 --model_type fc --latent_dim 100 --batch_size 20 --lr 0.0001``\n\n\n## Results\n\nThe Train and Validation Losses where tracked for Training and Validation epochs\n\n**Using Latent Dim = 292 (As in the Paper)** \\\n![Loss graphs](/Sample_imgs/graph_loss_1.png) \n\nIt starts to overfit the train set after 20 Epochs, so the saved weights at 20 should be used for best results \u003cbr /\u003e\n\n\n\nAlthough the Training Loss Reduces more in the 392 Case the Validation Loss remains almost equal which means it starts to overfit after 292.\n\n### Sample Outputs\n\n*Input* - \\CC(C)(C)C(=O)OCN1OC(=O)c2ccccc12 \\\n*Output* - \\CC(C)CC)C(=O)OC11CC(=O)C2ccccc12\n\n*Input* - \\CN\\C(=N\\S(=O)(=O)c1cc(CCNC(=O)c2cc(Cl)ccc2OC)ccc1OCCOC)\\S \\\n*Output* - \\CN\\C(=N/S)=O)(=O)c1ccccCNC(=O)c2cc(Cl)ccc2OC)ccc1OCC(C(\\C \n\n*Input* - \\O[C@@H]1[C@@H](O)[C@@H](Cc2ccccc2)N(CCCCCNC(=O)c3ccccc3)C(=O)N(CCCCCNC(=O)c4ccccc4)[C@@H]1Cc5ccccc5 \\\nOutput -  \\O[C@@H]1[C@@H](O)[C@@H](Cc2ccccc2)N(CcCCCN3(=O)c3ccccc3)C(=O)N4Cc44NC4C=O)c4cccc54)c1Cc5ccccc5\n\n*Input* - \\C\\C(=N/OC(c1ccccc1)c2ccccc2)\\C[C@H]3CCc4c(C3)cccc4OCC(=O)O \\\n*Output* - \\C\\C(=N/OC(c1ccccc1)\\2ccccc2)\\C33CNC4ccc))ccc44OCC=O)O\n\n*Input* - \\O[C@@H](CNCCc1ccc(NS(=O)(=O)c2ccc(cc2)c3coc(n3)c4ccc(cc4)C(F)(F)F)cc1)c5cccnc5 \\\n*Output*- \\O[C@@H](CNCCc1ccc(NS(=O)(=O)c2ccc(cc2)c3ncc(C3)C4cccccc4)C(F)(F)F)cc1)c5cccnc5 \n\n*Input*- \\CCCCCCCCCCc1cccc(O)c1C(=O)O \\\n*Output*- \\CCCCCCCCCCc1ccccccc)CC(O))O \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fishan-kumar2%2Fmolecular_vae_pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fishan-kumar2%2Fmolecular_vae_pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fishan-kumar2%2Fmolecular_vae_pytorch/lists"}