Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/augus1999/akane
AsymmetriC AutoeNcodEr (ACANE → AkAne). This model is part of MSc Electrochemistry and Battery Technologies project (2022 - 2023), University of Southampton.
https://github.com/augus1999/akane
chemistry deep-neural-networks denovo-design graph-neural-networks machine-learning multitask-learning pytorch-implementation transformer
Last synced: about 2 months ago
JSON representation
AsymmetriC AutoeNcodEr (ACANE → AkAne). This model is part of MSc Electrochemistry and Battery Technologies project (2022 - 2023), University of Southampton.
- Host: GitHub
- URL: https://github.com/augus1999/akane
- Owner: Augus1999
- License: gpl-3.0
- Created: 2023-07-26T16:34:04.000Z (over 1 year ago)
- Default Branch: original
- Last Pushed: 2024-06-11T03:30:41.000Z (7 months ago)
- Last Synced: 2024-06-11T04:42:30.217Z (7 months ago)
- Topics: chemistry, deep-neural-networks, denovo-design, graph-neural-networks, machine-learning, multitask-learning, pytorch-implementation, transformer
- Language: Python
- Homepage:
- Size: 5.13 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AkAne: bidirectionary model that predicts molecular properties and generates molecular structures
![OS](https://img.shields.io/badge/OS-Windows%20|%20Linux%20|%20macOS-blue?color=00b166)
![python](https://img.shields.io/badge/Python-3.10%20|%203.12-blue.svg?color=dd9b65)
![torch](https://img.shields.io/badge/torch-2.2-blue?color=708ddd)
![black](https://img.shields.io/badge/code%20style-black-black)
[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/suenoomozawa/AkAne)Proudly made in [](https://www.southampton.ac.uk/about/faculties-schools-departments/school-of-chemistry) in 2023.
Presented in [The 20th Nano Bio Info Chemistry Symposium](https://nanobioinfo.chemistry.hiroshima-u.ac.jp/2023/program.html).
## Web APP
First download the compiled models (`torchscript_model.7z`) from the [release](https://github.com/Augus1999/AkAne/releases) and extract the folder `torchscript_model` to the same directory of `app.py`. Then you can run `$ python app.py` to launch the web app locally.## Trained models
We provide pre-trained autoencoder, prediction models trained on MoleculeNet benchmark (including ESOL, FreeSolv, Lipo, BBBP, BACE, ClinTox, HIV), QM9, PhotoSwitch, AqSolDB, CMC value dataset, and a range of deep eutectic solvents (DES) properties, and 2 generation models that generate protein ligands and DES pairs, respectively.You can download trained models from the [release](https://github.com/Augus1999/AkAne/releases).
## Dataset format
The datasets we used and provided are stored in CSV files. We provide a python class `CSVData` in [akane2/utils/dataset.py](akane2/utils/dataset.py) to handle these files which require a header with the following tags:
* __smiles__ (_mandatory_): the entities under this tag should be molecule SMILES strings. Multiple tags are acceptable.
* __temperature__ (_optional_): the temperature in kelvin. Providing more than one this tag won't cause any error but only the last one will be accepted.
* __ratio__ (_optional_): molar ratio of each compound in the format of `x1:x2:...:xn`. Providing more than one this tag won't cause any error but only the last one will be accepted.
* __value__ (_optional_): entities under this tag should be molecular properties. Multiple tags are acceptable and in this case you can tell `CSVData` which value(s) should be loaded by specifying `label_idx=[...]`. If a property is not defined, leave it empty and the entity will be automatically masked to `torch.inf` telling the model that this property is unknown.
* __seq__ (_optional_): FASTA-style protein sequence. Providing more than one this tag won't cause any error but only the last one will be accepted. NOTE THAT WHEN THIS TAG IS USED, MOLECULAR PROPERTIES (IF PRESENT IN THE FILE) WILL NOT BE LOADED.These tags are unnecessary to be ordered, e.g.,
```csv
smiles,value,value,ratio,smiles
```
and
```csv
smiles,smiles,ratio,value,value
```
are both okey.## Training thy own model
The following is a guide of how to train your own model.
#### _1. Create your dataset following the dataset format_
#### _2. Split your dataset_
```python
from akane2.utils import split_datasetsplit_ratio = 0.8 # you can use any training:testing ratio from 0 to 1
method = "random" # another choice is "scaffold"
split_dataset("YOUR_DATASET.csv", split_ratio, method)
```
This will split your dataset into `YOUR_DATASET_train.csv` and `YOUR_DATASET_test.csv`.
#### _3. Load your data_
```python
from akane2.utils import CSVDatalimit = None # you can specify how many data-points your want to load, e.g., 1200
label_index = None # see the above "Dataset format" section
train_set = CSVData("YOUR_DATASET_train.csv", limit, label_index)
test_set = CSVData("YOUR_DATASET_test.csv", limit, label_index)
```
#### _4. Define your work space_
```python
from pathlib import Pathcwd = Path(__file__).parent
workdir = cwd / "YOUR_WORKDIR" # the directory where checkpoints (if any) will be stored
logdir = cwd / "YOUR_LOG.log" # where to print the log (you can set it to "None")
```
#### _5. Define your model_
We provide 2 types of models (that is where _2_ comes from in the package name): `akane2.representation.AkAne` (the whole AkAne model) and `akane2.representation.Kamome` (the indenpendent encoder part, without latent space regularisation, directly connected with the readout block).
* If you are only interested in property predictions or molecule classifications, we recommend to use only the encoder model:
```python
from akane2.representation import Kamomenum_task = 1 # number of tasks in one output, i.e., if you want to predict [HOMO, LUMO, gap] together then set `num_task = 3`
model = Kamome(num_task=num_task) # DON'T FORGET TO SET OTHER IMPORTANT HYPERPARAMETERS
```
* If you are going to train a generative or bidirectionary model, please use the whole model:
```python
from akane2.representation import AkAnenum_task = 2
label_mode = "class:2" # see the comments in `akane2/representation.py` about how to set a proper value
model = AkAne(num_task=num_task, label_mode=label_mode) # DON'T FORGET TO SET OTHER IMPORTANT HYPERPARAMETERS
```
__IMPORTANT__: Regarding to the hyperparameters (e.g., `num_task` and `label_mode`) that DEFINE the functionality of the model, please refer to the comments under each model in [representation.py](akane2/representation.py).
#### _6. Train your model_
```python
import os
from akane2.utils import train, find_recent_checkpointos.environ["NUM_WORKER"] = "4" # set `num_workers` of torch.utils.data.DataLoader (the default value is min(4, num_cpu_cores) if you remove this line)
chkpt = find_recent_checkpoint(workdir) # find latest checkpoint (if any)
mode = "predict" # training mode based on thy desire. Other options are "autoencoder", "classify", and "diffusion"
n_epochs = 1000 # training epochs
batch_size = 5 # define batch-size. Choose thy own value that won't cause `CUDA out of memory` error
save_every = 100 # save a checkpoint every `save_every` epochs (you can set to "None")
train(model, train_set, mode, n_epochs, batch_size, chkpt, logdir, workdir, save_every)
```
You will find the weight of trained model `trained.pt` and (if any) checkpoint file(s) `state-xxxx.pth` under _workdir_. You can safely delete any checkpoint file if you don't want them. __NOTE__: In order to get a generative model, it is necessary to first train an autoencoder or finetune a pre-trained autoencoder then train the diffusion model.
#### _7. Test your model (ignore this step if you are training an autoencoder or generation model)_
```python
from akane2.utils import testos.environ["INFERENCE_BATCH_SIZE"] = "20" # set the inference batch-size that won't cause `CUDA out of memory` error (the default value is 20 if you remove this line)
mode = "prediction" # testing mode based on thy model. Another choice is "classification"
print(test(model, test_set, mode, workdir/ "train.pt", logdir))
```
#### _8. Visualise the training loss (optional)_
```python
import matplotlib.pyplot as plt
from akane2.utils import extract_log_infoinfo = extract_log_info(logdir)
plt.plot(info["epoch"], info["loss"])
plt.xlabel("epoch")
plt.ylabel("MSE loss")
plt.yscale("log")
plt.show()
```## Inferencing
Here are some examples:
```python
import torch
from akane2.representation import AkAne, Kamome
from akane2.utils.graph import smiles2graph, gather
from akane2.utils.token import protein2vecdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")
############## define the input to encoder ##############
smiles = "FC1=CC(C(OCC)=O)=CC(F)=C1/N=N/C2=C(F)C=C(C(OCC)=O)C=C2F"
mol = gather([smiles2graph(smiles)]) # get a molecular graph from SMILES
mol["node"] = mol["node"].to(device)
mol["edge"] = mol["edge"].to(device)############## define the labels to diffusion model ##############
with open("5lqv.fasta", "r") as f:
fasta = f.readlines()[1]
protein_label = torch.tensor([protein2vec(fasta)], device=device) # get embedded vectors from FASTA
class_label = torch.tensor([[1]], dtype=torch.long, device=device)############## load models and inference ##############
model = torch.jit.load("torchscript_model/moleculenet/freesolv.pt").to(device) # load a compiled Kamome model
result = model(mol)
print(result)model = torch.jit.load("torchscript_model/protein_ligand.pt").to(device) # load a compiled generative AkAne model
result = model.generate(size=[1, 20, 1], label=protein_label) # batch-size=1 mol-size=20 beam-size=1
print(result)model = AkAne(num_task=2, label_mode="class:2").pretrained("model_akane/hiv_bidirectional.pt").to(device) # load a bidirectional AkAne model from saved model weight
result = model.inference(mol)
print(result)
result = model.generate(size=[1, 17, 1], label=class_label) # batch-size=1 mol-size=17 beam-size=1
print(result)
```## Known issue
* You cannot compile 2 or more AkAne models (i.e., `akane2.representation.AkAne`) into TorchScript modules together in one file. We recommend to save the compiled models before hand and load by `torch.jit.load(...)`.
* Directly loading a TorchScript model or compiling a Python model to TorchScript model via `model = torch.jit.script(model)` will $\times 10$ slower down the inference. We recommend to freeze the TorchScript model while evaluating by adding an addition line of `model = torch.jit.freeze(model.eval())` to eliminate the warmup.## Cite
```bibtex
@mastersthesis{AkAne2023,
title = {On The Way of Accurate Prediction of Complex Chemical System via General Graph Neural Networks},
author = {Nianze Tao},
year = {2023},
month = {September},
school = {The University of Southampton},
type = {Master's thesis},
note = {MSc Electrochemistry and Battery Technologies 2022-23},
}
```