{"id":13698867,"url":"https://github.com/microsoft/molecule-generation","last_synced_at":"2025-04-04T20:08:51.905Z","repository":{"id":37475479,"uuid":"460574336","full_name":"microsoft/molecule-generation","owner":"microsoft","description":"Implementation of MoLeR: a generative model of molecular graphs which supports scaffold-constrained generation","archived":false,"fork":false,"pushed_at":"2024-01-04T18:10:00.000Z","size":1003,"stargazers_count":292,"open_issues_count":5,"forks_count":42,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-03-28T19:09:40.926Z","etag":null,"topics":["deep-learning","generative-model","graph-neural-networks","molecule-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null}},"created_at":"2022-02-17T19:16:29.000Z","updated_at":"2025-03-11T02:26:44.000Z","dependencies_parsed_at":"2023-11-23T13:44:23.130Z","dependency_job_id":"95a8af1c-b42a-4168-bba5-8940a85bf37e","html_url":"https://github.com/microsoft/molecule-generation","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fmolecule-generation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fmolecule-generation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fmolecule-generation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fmolecule-generation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/molecule-generation/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247242678,"owners_count":20907134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","generative-model","graph-neural-networks","molecule-generation"],"created_at":"2024-08-02T19:00:54.027Z","updated_at":"2025-04-04T20:08:51.887Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":["Generative Models"],"sub_categories":[],"readme":"# MoLeR: A Model for Molecule Generation\n\n[![CI](https://github.com/microsoft/molecule-generation/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/microsoft/molecule-generation/actions/workflows/ci.yml)\n[![license](https://img.shields.io/github/license/microsoft/molecule-generation.svg)](https://github.com/microsoft/molecule-generation/blob/main/LICENSE)\n[![pypi](https://img.shields.io/pypi/v/molecule-generation.svg)](https://pypi.org/project/molecule-generation/)\n[![python](https://img.shields.io/pypi/pyversions/molecule_generation)](https://www.python.org/downloads/)\n[![code style](https://img.shields.io/badge/code%20style-black-202020.svg)](https://github.com/ambv/black)\n\nThis repository contains training and inference code for the MoLeR model introduced in [Learning to Extend Molecular Scaffolds with Structural Motifs](https://arxiv.org/abs/2103.03864). We also include our implementation of CGVAE, but without integration with the high-level model interface.\n\n## Quick start\n\n`molecule_generation` can be installed via `pip`, but it additionally depends on `rdkit` and (if one wants to use a GPU) on setting up CUDA libraries. One can get both through `conda`:\n\n```bash\nconda env create -f environment.yml\nconda activate moler-env\n```\n\nOur package was tested with `python\u003e=3.7`, `tensorflow\u003e=2.1.0` and `rdkit\u003e=2020.09.1`; see the `environment*.yml` files for the exact configurations tested in CI.\n\nTo then install the latest release of `molecule_generation`, run\n```bash\npip install molecule-generation\n```\n\nAlternatively, `pip install -e .` within the root folder installs the latest state of the code, including changes that were merged into `main` but not yet released.\n\nA MoLeR checkpoint trained using the default hyperparameters is available [here](https://figshare.com/ndownloader/files/34642724). This file needs to be saved in a fresh folder `MODEL_DIR` (e.g., `/tmp/MoLeR_checkpoint`) and be renamed to have the `.pkl` ending (e.g., to `GNN_Edge_MLP_MoLeR__2022-02-24_07-16-23_best.pkl`). Then you can sample 10 molecules by running\n\n```bash\nmolecule_generation sample MODEL_DIR 10\n```\n\nSee below for how to train your own model and run more advanced inference.\n\n### Troubleshooting\n\n\u003e Q: Installing `tensorflow` on my system does not work, or it works but GPU is not being used.\n\u003e\n\u003e A: Please refer to [the tensorflow website](https://www.tensorflow.org/install) for guidelines. In particular, with recent versions of `tensorflow` one may get a \"libdevice not found\" error; in that case please follow the instructions at the bottom of [this page](https://www.tensorflow.org/install/pip#step-by-step_instructions).\n\n\u003e Q: My particular combination of dependency versions does not work.\n\u003e\n\u003e A: Please submit an issue and default to using one of the pinned configurations from `environment-py*.yml` in the meantime.\n\n\u003e Q: I am in China and so the figshare checkpoint link does not work for me.\n\u003e\n\u003e A: You can try [this link](https://pan.baidu.com/s/1lkiWK9-d5MvNyzqRrusGXA?pwd=4hij) instead.\n\n## Workflow\n\nWorking with MoLeR can be roughly divided into four stages:\n- *data preprocessing*, where a plain text list of SMILES strings is turned into `*.pkl` files containing descriptions of the molecular graphs and generation traces;\n- *training*, where MoLeR is trained on the preprocessed data until convergence;\n- *inference*, where one loads the model and performs batched encoding, decoding or sampling; and (optionally)\n- *fine-tuning*, where a previously trained model is fine-tuned on new data.\n\nAdditionally, you can visualise the decoding traces and internal action probabilities of the model, which can be useful for debugging.\n\n### Data Preprocessing\n\nTo run preprocessing, your data has to follow a simple GuacaMol format (files `train.smiles`, `valid.smiles` and `test.smiles`, each containing SMILES strings, one per line). Then, you can preprocess the data by running\n\n```\nmolecule_generation preprocess INPUT_DIR OUTPUT_DIR TRACE_DIR\n```\n\nwhere `INPUT_DIR` is the directory containing the three `*.smiles` files, `OUTPUT_DIR` is used for intermediate results, and `TRACE_DIR` for final preprocessed files containing the generation traces. Additionally, the `preprocess` command accepts command-line arguments to override various preprocessing hyperparameters (notably, the size of the motif vocabulary).\nThis step roughly corresponds to applying Algorithm 2 from our paper to each molecule in the input data.\n\nAfter running the above, you should see an output similar to\n\n```\n2022-03-10 11:22:15,927 preprocess.py:239 INFO 1273104 train datapoints, 79568 validation datapoints, 238706 test datapoints loaded, beginning featurization.\n2022-03-10 11:22:15,927 preprocess.py:245 INFO Featurising data...\n2022-03-10 11:22:15,927 molecule_dataset_utils.py:261 INFO Turning smiles into mol\n2022-03-10 11:22:15,927 molecule_dataset_utils.py:79 INFO Initialising feature extractors and motif vocabulary.\n2022-03-10 11:44:17,864 motif_utils.py:158 INFO Motifs in total: 99751\n2022-03-10 11:44:25,755 motif_utils.py:182 INFO Removing motifs with less than 3 atoms\n2022-03-10 11:44:25,755 motif_utils.py:183 INFO Motifs remaining: 99653\n2022-03-10 11:44:25,764 motif_utils.py:190 INFO Truncating the list of motifs to 128 most common\n2022-03-10 11:44:25,764 motif_utils.py:192 INFO Motifs remaining: 128\n2022-03-10 11:44:25,764 motif_utils.py:199 INFO Finished creating the motif vocabulary\n2022-03-10 11:44:25,764 motif_utils.py:200 INFO | Number of motifs: 128\n2022-03-10 11:44:25,764 motif_utils.py:203 INFO | Min frequency: 3602\n2022-03-10 11:44:25,764 motif_utils.py:204 INFO | Max frequency: 1338327\n2022-03-10 11:44:25,764 motif_utils.py:205 INFO | Min num atoms: 3\n2022-03-10 11:44:25,764 motif_utils.py:206 INFO | Max num atoms: 10\n2022-03-10 11:44:25,862 preprocess.py:255 INFO Completed initializing feature extractors; featurising and saving data now.\n Wrote 1273104 datapoints to /guacamol/output/train.jsonl.gz.\n Wrote 79568 datapoints to /guacamol/output/valid.jsonl.gz.\n Wrote 238706 datapoints to /guacamol/output/test.jsonl.gz.\n Wrote metadata to /guacamol/output/metadata.pkl.gz.\n(...proceeds to compute generation traces...)\n```\n\nAfter the preprocessed graphs are saved into `OUTPUT_DIR`, they will be turned into concrete generation traces, which is typically the most compute-intensive part of preprocessing. During that part, the preprocessing code may print errors, noting molecules that could not have been parsed or failed other assertions; MoLeR's preprocessing is robust to such cases, and will simply skip any problematic samples.\n\n### Training\n\nHaving stored some preprocessed data under `TRACE_DIR`, MoLeR can be trained by running\n\n```\nmolecule_generation train MoLeR TRACE_DIR\n```\n\n\nThe `train` command accepts many command-line arguments to override training and architectural hyperparameters, most of which are accessed through passing `--model-params-override`. For example, the following trains a MoLeR model using `GGNN`-style message passing (instead of the default `GNN_Edge_MLP`) and using fewer layers in both the encoder and the decoder GNNs:\n\n```\nmolecule_generation train MoLeR TRACE_DIR \\\n    --model GGNN \\\n    --model-params-override '{\"gnn_num_layers\": 6, \"decoder_gnn_num_layers\": 6}'\n```\n\nAs [tf2-gnn](https://github.com/microsoft/tf2-gnn) is highly flexible, MoLeR supports a vast space of architectural configurations.\n\nAfter running `molecule_generation train`, you should see an output similar to\n\n```\n(...tensorflow messages, hyperparameter dump...)\nInitial valid metric:\nAvg weighted sum. of graph losses:  122.1728\nAvg weighted sum. of prop losses:   0.4712\nAvg node class. loss:                 35.9361\nAvg first node class. loss:           27.4681\nAvg edge selection loss:              1.7522\nAvg edge type loss:                   3.8963\nAvg attachment point selection loss:  1.1227\nAvg KL divergence:                    7335960.5000\nProperty results: sa_score: MAE 11.23, MSE 1416.26 (norm MAE: 13.89) | clogp: MAE 10.87, MSE 4620.69 (norm MAE: 5.98) | mol_weight: MAE 407.42, MSE 185524.38 (norm MAE: 3.70).\n   (Stored model metadata and weights to trained_model/GNN_Edge_MLP_MoLeR__2022-03-01_18-15-14_best.pkl).\n(...training proceeds...)\n```\n\nBy default, training proceeds until there is no improvement in validation loss for 3 consecutive mini-epochs, where a mini-epoch is defined as 5000 training steps; this can be controlled through the `--patience` flag and the `num_train_steps_between_valid` model parameter, respectively.\n\n### Inference\n\nAfter a model has been trained and saved under `MODEL_DIR`, we provide two ways to load it: from CLI or directly from Python.\nCurrently, CLI-based loading does not expose all useful functionalities, and is mostly meant for simple tests.\n\nTo sample molecules from the model using the CLI, simply run\n\n```\nmolecule_generation sample MODEL_DIR NUM_SAMPLES\n```\n\nand, similarly, to encode a list of SMILES stored under `SMILES_PATH` into latent vectors, and store them under `OUTPUT_PATH`\n\n```\nmolecule_generation encode MODEL_DIR SMILES_PATH OUTPUT_PATH\n```\n\nIn all cases `MODEL_DIR` denotes the directory containing the model checkpoint, not the path to the checkpoint itself.\nThe model loader will only look at `*.pkl` files under `MODEL_DIR`, and expect there is _exactly one_ such file, corresponding to the trained checkpoint.\n\nYou can load a model directly from Python via\n\n```python\nfrom molecule_generation import load_model_from_directory\n\nmodel_dir = \"./example_model_directory\"\nexample_smiles = [\"c1ccccc1\", \"CNC=O\"]\n\nwith load_model_from_directory(model_dir) as model:\n    embeddings = model.encode(example_smiles)\n    print(f\"Embedding shape: {embeddings[0].shape}\")\n\n    # Decode without a scaffold constraint.\n    decoded = model.decode(embeddings)\n\n    # The i-th scaffold will be used when decoding the i-th latent vector.\n    decoded_scaffolds = model.decode(embeddings, scaffolds=[\"CN\", \"CCC\"])\n\n    print(f\"Encoded: {example_smiles}\")\n    print(f\"Decoded: {decoded}\")\n    print(f\"Decoded with scaffolds: {decoded_scaffolds}\")\n```\n\nwhich should yield an output similar to\n\n```\nEmbedding shape: (512,)\nEncoded: ['c1ccccc1', 'CNC=O']\nDecoded: ['C1=CC=CC=C1', 'CNC=O']\nDecoded with scaffolds: ['C1=CC=C(CNC2=CC=CC=C2)C=C1', 'CNC(=O)C(C)C']\n```\n\nAs shown above, MoLeR is loaded through a context manager.\nBehind the scenes, the following things happen:\n- First, an appropriate wrapper class is chosen: if the provided directory contains a `MoLeRVae` checkpoint, the returned wrapper will support `encode`, `decode` and `sample`, while `MoLeRGenerator` will only support `sample`.\n- Next, parallel workers are spawned, which await queries for encoding/decoding; these processes continue to live as long as the context is active.\nThe degree of paralellism can be configured using a `num_workers` argument.\n\n### Fine-tuning\n\nFine-tuning proceeds similarly to training from scratch, with a few adjustments.\nFirst, data intended for fine-tuning has to be preprocessed accordingly, by running\n\n```\nmolecule_generation preprocess INPUT_DIR OUTPUT_DIR TRACE_DIR \\\n    --pretrained-model-path CHECKPOINT_PATH\n```\n\nWhere `CHECKPOINT_PATH` points to the file (not directory) corresponding to the model that will later be fine-tuned.\n\nThe `--pretrained-model-path` argument is necessary, as otherwise preprocessing would infer various metadata (e.g. set of atom/motif types) solely from the provided set of SMILES, whereas for fine-tuning this has to be aligned with the metadata that the model was originally trained with.\n\nAfter preprocessing, fine-tuning is run as\n```\nmolecule_generation train MoLeR TRACE_DIR \\\n    --load-saved-model CHECKPOINT_PATH \\\n    --load-weights-only\n```\n\nWhen fine-tuning on a small dataset, it may not be desirable to update the model until convergence.\nTraining duration can be capped by passing `--model-params-override '{\"num_train_steps_between_valid\": 100}'` (to shorten the mini-epochs) and `--max-epochs` (to limit the number of mini-epochs).\n\n### Visualisation\n\nWe support two subtly different modes of visualisation: decoding a given latent vector, and decoding a latent vector created by encoding a given SMILES string. In the former case, the decoder runs as normal during inference; in the latter case we know the ground-truth input, so we teacher-force the correct decoding decisions.\n\nTo enter the visualiser, run either\n\n```\nmolecule_generation visualise cli MODEL_DIR SMILES_OR_SAMPLES_PATH\n```\n\nto get the result printed as plain text in the CLI, or\n\n```\nmolecule_generation visualise html MODEL_DIR SMILES_OR_SAMPLES_PATH OUTPUT_DIR\n```\n\nto get the result saved under `OUTPUT_DIR` as a static HTML webpage.\n\n## Code Structure\n\nAll of our models are implemented in [Tensorflow 2](https://www.tensorflow.org/), and are meant to be easy to extend and build upon. We use [tf2-gnn](https://github.com/microsoft/tf2-gnn) for the core Graph Neural Network components.\n\nThe MoLeR model itself is implemented as a `MoLeRVae` class, inheriting from `GraphTaskModel` in `tf2-gnn`; that base class encapsulates the encoder GNN. The decoder GNN is instantiated as an external `MoLeRDecoder` layer; it also includes batched inference code, which forces the maximum likelihood choice at every step.\n\n## Authors\n\n* [Krzysztof Maziarz](mailto:krzysztof.maziarz@microsoft.com)\n* [Henry Jackson-Flux](mailto:hrjackson@gmail.com)\n* [Marc Brockschmidt](mailto:mabrocks@microsoft.com)\n* [Pashmina Cameron](mailto:Pashmina.Cameron@microsoft.com)\n* [Sarah Lewis](mailto:sarahlewis@microsoft.com)\n* [Marwin Segler](mailto:marwinsegler@microsoft.com)\n* [Megan Stanley](mailto:meganstanley@microsoft.com)\n* [Paweł Czyż](mailto:pawelpiotr.czyz@ai.ethz.ch)\n* [Ashok Thillaisundaram](mailto:ashok@cantab.net)\n\n_Note: as git history was truncated at the point of open-sourcing, GitHub's statistics do not reflect the degree of contribution from some of the authors. All listed above had an impact on the code, and are (approximately) ordered by decreasing contribution._\n\nThe code is maintained by the [Generative Chemistry](https://www.microsoft.com/en-us/research/project/generative-chemistry/)\ngroup at Microsoft Research, Cambridge, UK.\nWe are [hiring](https://www.microsoft.com/en-us/research/project/generative-chemistry/opportunities/).\n\nMoLeR was created as part of our collaboration with Novartis Research. In particular, its design was guided by [Nadine Schneider](mailto:nadine-1.schneider@novartis.com), [Finton Sirockin](mailto:finton.sirockin@novartis.com), [Nikolaus Stiefl](mailto:nikolaus.stiefl@novartis.com), as well as others from Novartis.\n\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n### Style Guide\n\n- For code style, use [black](https://pypi.org/project/black/) and [flake8](https://pypi.org/project/flake8/).\n- For commit messages, use imperative style and follow the [semmantic commit messages](https://gist.github.com/joshbuchea/6f47e86d2510bce28f8e7f42ae84c716) template; e.g.\n    \u003e feat(moler_decoder): Improve masking of invalid actions\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft\ntrademarks or logos is subject to and must follow\n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fmolecule-generation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fmolecule-generation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fmolecule-generation/lists"}