https://github.com/ishan-kumar2/molecular_vae_pytorch
PyTorch implementation of the paper "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules"
https://github.com/ishan-kumar2/molecular_vae_pytorch
automatics cheminformatics pytorch qsar smiles
Last synced: about 1 year ago
JSON representation
PyTorch implementation of the paper "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules"
- Host: GitHub
- URL: https://github.com/ishan-kumar2/molecular_vae_pytorch
- Owner: Ishan-Kumar2
- Created: 2020-12-17T12:00:43.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-03-04T11:40:49.000Z (over 5 years ago)
- Last Synced: 2025-04-15T16:18:11.099Z (about 1 year ago)
- Topics: automatics, cheminformatics, pytorch, qsar, smiles
- Language: Python
- Homepage:
- Size: 18.1 MB
- Stars: 28
- Watchers: 1
- Forks: 10
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# A PyTorch implementation of Molecular VAE paper
PyTorch implementation of the paper **"Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules" by Gómez-Bombarelli, et al.**\
Link to Paper - [arXiv](https://arxiv.org/abs/1610.02415)
----
## Getting the Repo
To clone the repo on your machine run -\
`git clone https://github.com/Ishan-Kumar2/Molecular_VAE_Pytorch.git`\
The Structure of the Repo is as follows -\
`data_prep.py`- For Getting the Data in CSV format and splitting into specifed sized Train and Val\
`main.py` - Running the model\
`model.py` - Defines the Architecture of the Model\
`utils.py` - Various useful functions for encoding and decoding the data
## Getting the Dataset
For this work I have used the ChEMBL Dataset which can be found [here](https://www.ebi.ac.uk/chembl/)\
\
Since the whole dataset has over 16M datapoints, I have decided to use a subset of that data.
To get the subset you can either use the train, val data present in ``/data``
or run the ``data_prep.py`` file as - \
`python data_prep.py /path/to/downloaded_data col_name_smiles /save/path 50000` \
\
This will prepare 2 CSV files `/save/path_train.csv` and `/save/path_val.csv` both of length 50k and having randomly shuffled datapoints.
Example of a Smiles string and corresponding Molecule
## Training the Network
To train the network use the `main.py` file
To Run the Papers Model (Conv Encoder and GRU Decoder)\
`python main.py ./data/chembl_500k_train ./data/chembl_500k_val ./Save_Models/ --epochs 100 --model_type mol_vae --latent_dim 290 --batch_size 512 --lr 0.0001`\
Latent Dim has default value 292 which is the value used in the original Paper
To Run a VAE with Fully Connected layers in both Encoder Decoder\
``python main.py ./data/bbbp.csv ./Save_Models/ --epochs 1 --model_type fc --latent_dim 100 --batch_size 20 --lr 0.0001``
## Results
The Train and Validation Losses where tracked for Training and Validation epochs
**Using Latent Dim = 292 (As in the Paper)** \

It starts to overfit the train set after 20 Epochs, so the saved weights at 20 should be used for best results
Although the Training Loss Reduces more in the 392 Case the Validation Loss remains almost equal which means it starts to overfit after 292.
### Sample Outputs
*Input* - \CC(C)(C)C(=O)OCN1OC(=O)c2ccccc12 \
*Output* - \CC(C)CC)C(=O)OC11CC(=O)C2ccccc12
*Input* - \CN\C(=N\S(=O)(=O)c1cc(CCNC(=O)c2cc(Cl)ccc2OC)ccc1OCCOC)\S \
*Output* - \CN\C(=N/S)=O)(=O)c1ccccCNC(=O)c2cc(Cl)ccc2OC)ccc1OCC(C(\C
*Input* - \O[C@@H]1[C@@H](O)[C@@H](Cc2ccccc2)N(CCCCCNC(=O)c3ccccc3)C(=O)N(CCCCCNC(=O)c4ccccc4)[C@@H]1Cc5ccccc5 \
Output - \O[C@@H]1[C@@H](O)[C@@H](Cc2ccccc2)N(CcCCCN3(=O)c3ccccc3)C(=O)N4Cc44NC4C=O)c4cccc54)c1Cc5ccccc5
*Input* - \C\C(=N/OC(c1ccccc1)c2ccccc2)\C[C@H]3CCc4c(C3)cccc4OCC(=O)O \
*Output* - \C\C(=N/OC(c1ccccc1)\2ccccc2)\C33CNC4ccc))ccc44OCC=O)O
*Input* - \O[C@@H](CNCCc1ccc(NS(=O)(=O)c2ccc(cc2)c3coc(n3)c4ccc(cc4)C(F)(F)F)cc1)c5cccnc5 \
*Output*- \O[C@@H](CNCCc1ccc(NS(=O)(=O)c2ccc(cc2)c3ncc(C3)C4cccccc4)C(F)(F)F)cc1)c5cccnc5
*Input*- \CCCCCCCCCCc1cccc(O)c1C(=O)O \
*Output*- \CCCCCCCCCCc1ccccccc)CC(O))O