https://github.com/drkenreid/vae-for-molecule-discovery
A Variational Autoencoder in Google Colab to generate and visualize novel molecular structures for potential drug discovery applications, using the QM9 dataset and SMILES representation.
https://github.com/drkenreid/vae-for-molecule-discovery
drug-discovery molecule-generation molecule-visualization qm9 smiles smiles-strings vae variational-autoencoder
Last synced: 3 months ago
JSON representation
A Variational Autoencoder in Google Colab to generate and visualize novel molecular structures for potential drug discovery applications, using the QM9 dataset and SMILES representation.
- Host: GitHub
- URL: https://github.com/drkenreid/vae-for-molecule-discovery
- Owner: DrKenReid
- License: mit
- Created: 2024-08-21T14:17:59.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-08-21T20:22:10.000Z (almost 2 years ago)
- Last Synced: 2024-12-31T15:54:10.887Z (over 1 year ago)
- Topics: drug-discovery, molecule-generation, molecule-visualization, qm9, smiles, smiles-strings, vae, variational-autoencoder
- Language: Jupyter Notebook
- Homepage:
- Size: 32.2 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐งช Variational Autoencoder for Molecule Discovery
## ๐ Overview
This project implements a Variational Autoencoder (VAE) for generating novel molecular structures. It's particularly useful in drug discovery, where the goal is to generate new potential drug candidates. The project is designed to run in Google Colab, leveraging GPU acceleration for efficient training and generation.
## โจ Features
- **๐ง VAE Architecture**: Utilizes a Variational Autoencoder to learn a compact representation of molecular structures and generate new ones.
- **๐งฌ SMILES Representation**: Uses SMILES (Simplified Molecular-Input Line-Entry System) strings for molecular representation.
- **๐ QM9 Dataset**: Trains on the QM9 dataset, a standard benchmark in molecular machine learning.
- **๐๏ธ Molecule Visualization**: Generates and visualizes molecular structures using RDKit.
- **โ๏ธ Property Calculation**: Computes basic molecular properties for generated molecules.
- **โ
Validity and Novelty Checks**: Assesses the validity of generated molecules and checks for novelty against the training set.
- **โ๏ธ Google Colab Integration**: Designed to run in Google Colab for easy access to GPU resources.
## ๐ ๏ธ Requirements
- Google Colab environment
- Required libraries (automatically installed in the notebook):
- PyTorch
- RDKit
- Pandas
- Pillow
- IPython
## ๐ Usage
1. Open the notebook in Google Colab.
2. Run the cells in order, following the instructions in the notebook.
3. The notebook will guide you through:
- Setting up the environment
- Loading and preprocessing the QM9 dataset
- Defining and training the VAE model
- Generating new molecules
- Visualizing and analyzing the generated molecules
## โ๏ธ Configuration
You can modify the following parameters in the notebook:
- `hidden_dim`: Dimension of the hidden state in GRU layers
- `latent_dim`: Dimension of the latent space
- `batch_size`: Batch size for training
- `num_epochs`: Number of training epochs
## ๐ค Output
The notebook generates several outputs:
1. Training loss plots
2. Generated SMILES strings
3. Visualizations of generated molecules
4. Analysis of molecular properties
5. Validity and novelty statistics
## โ ๏ธ Limitations
- The model's performance is limited by the size and diversity of the training dataset (QM9).
- Generated molecules may not always be synthetically feasible or stable.
- The current implementation focuses on small organic molecules.
## ๐ค Contributing
Contributions, issues, and feature requests are welcome. Feel free to open an issue or submit a pull request.
## ๐ License
This project is open-source and available under the MIT License.
## โ๏ธ Disclaimer
This tool is for research and educational purposes only. Generated molecules should not be considered as actual drug candidates without further extensive testing and validation.