{"id":20226667,"url":"https://github.com/augus1999/akane","last_synced_at":"2025-05-07T09:31:27.660Z","repository":{"id":193574772,"uuid":"671145395","full_name":"Augus1999/AkAne","owner":"Augus1999","description":"AsymmetriC AutoeNcodEr (ACANE → AkAne). This model is part of MSc Electrochemistry and Battery Technologies project (2022 - 2023), University of Southampton.","archived":false,"fork":false,"pushed_at":"2024-06-11T03:30:41.000Z","size":5374,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"original","last_synced_at":"2024-06-11T04:42:30.217Z","etag":null,"topics":["chemistry","deep-neural-networks","denovo-design","graph-neural-networks","machine-learning","multitask-learning","pytorch-implementation","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Augus1999.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-26T16:34:04.000Z","updated_at":"2024-06-11T03:30:45.000Z","dependencies_parsed_at":null,"dependency_job_id":"57379cb4-e613-46a1-bf43-5fb00f90f9e3","html_url":"https://github.com/Augus1999/AkAne","commit_stats":{"total_commits":28,"total_committers":1,"mean_commits":28.0,"dds":0.0,"last_synced_commit":"6a0f6ac3c3c9b4a67bdd282de3bf89d38f4552d8"},"previous_names":["augus1999/akane"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Augus1999%2FAkAne","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Augus1999%2FAkAne/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Augus1999%2FAkAne/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Augus1999%2FAkAne/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Augus1999","download_url":"https://codeload.github.com/Augus1999/AkAne/tar.gz/refs/heads/original","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224580594,"owners_count":17334855,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chemistry","deep-neural-networks","denovo-design","graph-neural-networks","machine-learning","multitask-learning","pytorch-implementation","transformer"],"created_at":"2024-11-14T07:19:36.315Z","updated_at":"2024-11-14T07:19:36.908Z","avatar_url":"https://github.com/Augus1999.png","language":"Python","readme":"# A\u003cspan style='color:#CB4154'\u003ek\u003c/span\u003eAne: bidirectionary model that predicts molecular properties and generates molecular structures\n\n\n![OS](https://img.shields.io/badge/OS-Windows%20|%20Linux%20|%20macOS-blue?color=00b166)\n![python](https://img.shields.io/badge/Python-3.10%20|%203.12-blue.svg?color=dd9b65)\n![torch](https://img.shields.io/badge/torch-2.2-blue?color=708ddd)\n![black](https://img.shields.io/badge/code%20style-black-black)\n[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/suenoomozawa/AkAne)\n\n\n\nProudly made in [\u003cimg src=\"image/uos_blue.png\" alt=\"University of Southampton\" width=\"100\"/\u003e](https://www.southampton.ac.uk/about/faculties-schools-departments/school-of-chemistry) in 2023.\n\nPresented in [The 20\u003csup\u003eth\u003c/sup\u003e Nano Bio Info Chemistry Symposium](https://nanobioinfo.chemistry.hiroshima-u.ac.jp/2023/program.html).\n\n\u003cimg src=\"image/model_scheme.png\" alt=\"model scheme\" width=\"600\"/\u003e\n\n## Web APP\nFirst download the compiled models (`torchscript_model.7z`) from the [release](https://github.com/Augus1999/AkAne/releases) and extract the folder `torchscript_model` to the same directory of `app.py`. Then you can run `$ python app.py` to launch the web app locally.\n\n## Trained models\nWe provide pre-trained autoencoder, prediction models trained on MoleculeNet benchmark (including ESOL, FreeSolv, Lipo, BBBP, BACE, ClinTox, HIV), QM9, PhotoSwitch, AqSolDB, CMC value dataset, and a range of deep eutectic solvents (DES) properties, and 2 generation models that generate protein ligands and DES pairs, respectively.\n\nYou can download trained models from the [release](https://github.com/Augus1999/AkAne/releases).\n\n## Dataset format\nThe datasets we used and provided are stored in CSV files. We provide a python class `CSVData` in [akane2/utils/dataset.py](akane2/utils/dataset.py) to handle these files which require a header with the following tags:\n* __smiles__ (_mandatory_): the entities under this tag should be molecule SMILES strings. Multiple tags are acceptable.\n* __temperature__ (_optional_): the temperature in kelvin. Providing more than one this tag won't cause any error but only the last one will be accepted.\n* __ratio__ (_optional_): molar ratio of each compound in the format of `x1:x2:...:xn`.  Providing more than one this tag won't cause any error but only the last one will be accepted.\n* __value__ (_optional_): entities under this tag should be molecular properties. Multiple tags are acceptable and in this case you can tell `CSVData` which value(s) should be loaded by specifying `label_idx=[...]`. If a property is not defined, leave it empty and the entity will be automatically masked to `torch.inf` telling the model that this property is unknown. \n* __seq__ (_optional_): FASTA-style protein sequence. Providing more than one this tag won't cause any error but only the last one will be accepted. NOTE THAT WHEN THIS TAG IS USED, MOLECULAR PROPERTIES (IF PRESENT IN THE FILE) WILL NOT BE LOADED.\n\nThese tags are unnecessary to be ordered, e.g.,\n```csv\nsmiles,value,value,ratio,smiles\n```\nand\n```csv\nsmiles,smiles,ratio,value,value\n```\nare both okey.\n\n## Training thy own model\nThe following is a guide of how to train your own model.\n#### _1. Create your dataset following the dataset format_\n#### _2. Split your dataset_\n```python\nfrom akane2.utils import split_dataset\n\nsplit_ratio = 0.8  # you can use any training:testing ratio from 0 to 1\nmethod = \"random\"  # another choice is \"scaffold\"\nsplit_dataset(\"YOUR_DATASET.csv\", split_ratio, method)\n```\nThis will split your dataset into `YOUR_DATASET_train.csv` and `YOUR_DATASET_test.csv`.\n#### _3. Load your data_\n```python\nfrom akane2.utils import CSVData\n\nlimit = None  # you can specify how many data-points your want to load, e.g., 1200\nlabel_index = None  # see the above \"Dataset format\" section\ntrain_set = CSVData(\"YOUR_DATASET_train.csv\", limit, label_index)\ntest_set = CSVData(\"YOUR_DATASET_test.csv\", limit, label_index)\n```\n#### _4. Define your work space_\n```python\nfrom pathlib import Path\n\ncwd = Path(__file__).parent\nworkdir = cwd / \"YOUR_WORKDIR\"  # the directory where checkpoints (if any) will be stored\nlogdir = cwd / \"YOUR_LOG.log\"  # where to print the log (you can set it to \"None\")\n```\n#### _5. Define your model_\nWe provide 2 types of models (that is where _2_ comes from in the package name): `akane2.representation.AkAne` (the whole A\u003cspan style='color:#CB4154'\u003ek\u003c/span\u003eAne model) and `akane2.representation.Kamome` (the indenpendent encoder part, without latent space regularisation, directly connected with the readout block).\n* If you are only interested in property predictions or molecule classifications, we recommend to use only the encoder model:\n```python\nfrom akane2.representation import Kamome\n\nnum_task = 1  # number of tasks in one output, i.e., if you want to predict [HOMO, LUMO, gap] together then set `num_task = 3`\nmodel = Kamome(num_task=num_task)  #  DON'T FORGET TO SET OTHER IMPORTANT HYPERPARAMETERS\n```\n* If you are going to train a generative or bidirectionary model, please use the whole model:\n```python\nfrom akane2.representation import AkAne\n\nnum_task = 2\nlabel_mode = \"class:2\"  # see the comments in `akane2/representation.py` about how to set a proper value\nmodel = AkAne(num_task=num_task, label_mode=label_mode)  #  DON'T FORGET TO SET OTHER IMPORTANT HYPERPARAMETERS\n```\n__IMPORTANT__: Regarding to the hyperparameters (e.g., `num_task` and `label_mode`) that DEFINE the functionality of the model, please refer to the comments under each model in [representation.py](akane2/representation.py).\n#### _6. Train your model_\n```python\nimport os\nfrom akane2.utils import train, find_recent_checkpoint\n\nos.environ[\"NUM_WORKER\"] = \"4\"  # set `num_workers` of torch.utils.data.DataLoader (the default value is min(4, num_cpu_cores) if you remove this line)\nchkpt = find_recent_checkpoint(workdir)  # find latest checkpoint (if any)\nmode = \"predict\"  # training mode based on thy desire. Other options are \"autoencoder\", \"classify\", and \"diffusion\"\nn_epochs = 1000  # training epochs\nbatch_size = 5  # define batch-size. Choose thy own value that won't cause `CUDA out of memory` error\nsave_every = 100  # save a checkpoint every `save_every` epochs (you can set to \"None\")\ntrain(model, train_set, mode, n_epochs, batch_size, chkpt, logdir, workdir, save_every)\n```\nYou will find the weight of trained model `trained.pt` and (if any) checkpoint file(s) `state-xxxx.pth` under _workdir_. You can safely delete any checkpoint file if you don't want them. __NOTE__: In order to get a generative model, it is necessary to first train an autoencoder or finetune a pre-trained autoencoder then train the diffusion model.\n#### _7. Test your model (ignore this step if you are training an autoencoder or generation model)_\n```python\nfrom akane2.utils import test\n\nos.environ[\"INFERENCE_BATCH_SIZE\"] = \"20\"  # set the inference batch-size that won't cause `CUDA out of memory` error (the default value is 20 if you remove this line)\nmode = \"prediction\"  # testing mode based on thy model. Another choice is \"classification\"\nprint(test(model, test_set, mode, workdir/ \"train.pt\", logdir))\n```\n#### _8. Visualise the training loss (optional)_\n```python\nimport matplotlib.pyplot as plt\nfrom akane2.utils import extract_log_info\n\ninfo = extract_log_info(logdir)\nplt.plot(info[\"epoch\"], info[\"loss\"])\nplt.xlabel(\"epoch\")\nplt.ylabel(\"MSE loss\")\nplt.yscale(\"log\")\nplt.show()\n```\n\n## Inferencing\nHere are some examples:\n```python\nimport torch\nfrom akane2.representation import AkAne, Kamome\nfrom akane2.utils.graph import smiles2graph, gather\nfrom akane2.utils.token import protein2vec\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n\n############## define the input to encoder ##############\nsmiles = \"FC1=CC(C(OCC)=O)=CC(F)=C1/N=N/C2=C(F)C=C(C(OCC)=O)C=C2F\"\nmol = gather([smiles2graph(smiles)])  # get a molecular graph from SMILES\nmol[\"node\"] = mol[\"node\"].to(device)\nmol[\"edge\"] = mol[\"edge\"].to(device)\n\n############## define the labels to diffusion model ##############\nwith open(\"5lqv.fasta\", \"r\") as f:\n    fasta = f.readlines()[1]\nprotein_label = torch.tensor([protein2vec(fasta)], device=device)  # get embedded vectors from FASTA\nclass_label = torch.tensor([[1]], dtype=torch.long, device=device)\n\n############## load models and inference ##############\nmodel = torch.jit.load(\"torchscript_model/moleculenet/freesolv.pt\").to(device)  # load a compiled Kamome model\nresult = model(mol)\nprint(result)\n\nmodel = torch.jit.load(\"torchscript_model/protein_ligand.pt\").to(device)  # load a compiled generative AkAne model\nresult = model.generate(size=[1, 20, 1], label=protein_label)  # batch-size=1 mol-size=20 beam-size=1\nprint(result)\n\nmodel = AkAne(num_task=2, label_mode=\"class:2\").pretrained(\"model_akane/hiv_bidirectional.pt\").to(device)  # load a bidirectional AkAne model from saved model weight\nresult = model.inference(mol)\nprint(result)\nresult = model.generate(size=[1, 17, 1], label=class_label)  # batch-size=1 mol-size=17 beam-size=1\nprint(result)\n```\n\n## Known issue\n* You cannot compile 2 or more AkAne models (i.e., `akane2.representation.AkAne`) into TorchScript modules together in one file. We recommend to save the compiled models before hand and load by `torch.jit.load(...)`.\n* Directly loading a TorchScript model or compiling a Python model to TorchScript model via `model = torch.jit.script(model)` will $\\times 10$ slower down the inference. We recommend to freeze the TorchScript model while evaluating by adding an addition line of `model = torch.jit.freeze(model.eval())` to eliminate the warmup.\n\n## Cite\n```bibtex\n@mastersthesis{AkAne2023,\ntitle  = {On The Way of Accurate Prediction of Complex Chemical System via General Graph Neural Networks},\nauthor = {Nianze Tao},\nyear   = {2023},\nmonth  = {September},\nschool = {The University of Southampton},\ntype   = {Master's thesis},\nnote   = {MSc Electrochemistry and Battery Technologies 2022-23},\n}\n```","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faugus1999%2Fakane","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faugus1999%2Fakane","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faugus1999%2Fakane/lists"}