{"id":13751994,"url":"https://github.com/MolecularAI/Chemformer","last_synced_at":"2025-05-09T18:32:59.419Z","repository":{"id":41807052,"uuid":"388016777","full_name":"MolecularAI/Chemformer","owner":"MolecularAI","description":null,"archived":false,"fork":false,"pushed_at":"2024-04-15T09:09:36.000Z","size":159,"stargazers_count":185,"open_issues_count":3,"forks_count":34,"subscribers_count":7,"default_branch":"main","last_synced_at":"2024-05-22T08:34:58.512Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MolecularAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-21T06:27:33.000Z","updated_at":"2024-05-29T13:37:17.297Z","dependencies_parsed_at":"2024-05-29T13:49:40.084Z","dependency_job_id":null,"html_url":"https://github.com/MolecularAI/Chemformer","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MolecularAI%2FChemformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MolecularAI%2FChemformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MolecularAI%2FChemformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MolecularAI%2FChemformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MolecularAI","download_url":"https://codeload.github.com/MolecularAI/Chemformer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253303194,"owners_count":21886904,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:00:57.955Z","updated_at":"2025-05-09T18:32:57.177Z","avatar_url":"https://github.com/MolecularAI.png","language":"Python","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"# Chemformer\nThis repository contains the code used to generate the results in the Chemformer papers [[1]](#1) [[2]](#2) [[3]](#3).\n\nThe Chemformer project aimed to pre-train a BART transformer language model [[4]](#4) on molecular SMILES strings [[5]](#5) by optimising a de-noising objective. We hypothesized that pre-training would lead to improved generalisation, performance, training speed and validity on downstream fine-tuned tasks. \nThe pre-trained model was tested on downstream tasks such as reaction prediction, retrosynthetic prediction, molecular optimisation and molecular property prediction in our original manuscript [[1]](#1). Our synthesis-prediction (seq2seq) Chemformer was evaluated for the purpose of single- and multi-step retrosynthesis [[2]](#2), and used for disconnection-aware retrosynthesis [[3]](#3).\n\nThe public models and datasets available [here](https://az.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq). To run these models with the new version, you first need to update the checkpoint, e.g.:\n```\nmodel = torch.load(\"model.ckpt\")\nmodel[\"hyper_parameters\"][\"vocabulary_size\"] = model[\"hyper_parameters\"].pop(\"vocab_size\")\ntorch.save(model, \"model_v2.ckpt\")\n```\n\n\n## Prerequisites\nBefore you begin, ensure you have met the following requirements:\n\n* Linux, Windows or macOS platforms are supported - as long as the dependencies are supported on these platforms.\n\n* You have installed [anaconda](https://www.anaconda.com/) or [miniconda](https://docs.conda.io/en/latest/miniconda.html) with python 3.7\n\n## Installation\n\nFirst clone the repository using Git.\n\nThe project dependencies can be installed by executing the following commands in the root of \nthe repository:\n\n    conda env create -f env-dev.yml\n    conda activate chemformer\n    poetry install\n\nIf there is an error \"ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found\"\nit can be mitigated by adding the 'lib' directory from the Conda environment to LD_LIBRARY_PATH\n\nAs example:\n`export LD_LIBRARY_PATH=/path/to/your/conda/envs/chemformer/lib`\n\nFor developers: Run the following to enable editable mode\n```\n    pip install -e .\n```\n\n## User guide\nThe following is an example of how to fine tune Chemformer using the pre-trained models and datasets available [here](https://az.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq).\n\n1. Create a Chemformer conda environment, as above.\n1. Download the dataset of interest and store it locally (e.g. ../data/uspto_50.pickle).\n1. Download a pre-trained Chemformer model and store it locally (e.g ../models/pre-trained/combined.ckpt).\n1. Update the `fine_tune.sh` shell script in the example_scripts directory (or create your own) with the paths to your model and dataset, as well as the values of hyperparameters you wish to pass to the script.\n1. Run the `fine_tune.sh` script.\n\nYou can of course run other scripts in the repository following a similar approach. The Scripts section below provides more details on what each script does.\n\n### Scripts\nThe molbart includes the following scripts:\n* `molbart/pretrain.py` runs the pre-training \n* `molbart/fine_tune.py` runs fine-tuning on a specified task\n* `molbart/inference_score.py` predicts SMILES and evaluates the performance of a fine-tuned model\n* `molbart/predict.py` predict products given input reactants\n* `molbart/build_tokenizer.py` creates a tokenizer from a dataset and stores it in a pickle file\n* `molbart/retrosynthesis/round_trip_inference.py` runs round-trip inference and scoring using the predicted SMILES from `molbart/inference_score.py`\n\nThe scripts use hydra for reading parameters from config files. To run a script from `your/project/folder`, first create an experiment folder: `your/project/folder/experiment/`. In that folder add a config file with the parameters you wish to override the defaults for:\n`your/project/folder/experiment/project_config.yaml`.\n\nExample of project config yaml:\n```\n# @package _global_\n\nseed: 2\ndataset_part: test # Which dataset split to run inference on. [full\", \"train\", \"val\", \"test\"]\nn_beams: 5\nbatch_size: 64\n```\n\nThe script can then be run with\n```\npython -m molbart.\u003cscipt_name\u003e 'hydra.searchpath=[file:///your/project/folder] experiment=inference_score.yaml\n```\n\nSpecific parameters can also be overwritten via command line:\n```\npython -m molbart.\u003cscript_name\u003e param1=new_value1 param2.subparam=new_value2 \n```\nSee the default configuration files of each script under molbart/config/ for more details on each argument.\n\n### Notes on running retrosynthesis predictions and round-trip validation \nExample of running inference and calulcating (1) top-N accuracy (stored in `metrics.csv`) and (2) round-trip accuracy (stored in `round_trip_metrics.csv`):\n1. Run backward inference\n```python -m molbart.inference_score data_path=data.csv output_score_data=metrics.csv output_sampled_smiles=sampled_smiles.json dataset_type=synthesis \u003cadditional_args\u003e```\n1. Run round-trip inference \n```python -m molbart.retrosynthesis.round_trip_inference input_data=data.csv backward_predictions=sampled_smiles.json output_score_data=round_trip_metrics.csv output_sampled_smiles=round_trip_sampled_smiles.json \u003cadditional_args\u003e```\n\nThe default datamodule is now the SynthesisDataModule (this can be changed in the config using the \"datamodule\" argument - see example_scripts). The input file given by `data_path` is assumed to be a tab-separated .csv file containing the columns `products` (SMILES), `reactants` (SMILES) and `set` (labels of each sample according to which dataset split it belongs to, i.e. \"train\", \"val\" or \"test\").\n\nSee the default configuration corresponding to each script in molbart/config/ more details on each argument.\n\n## Specifying available and custom callbacks\nThere are default callbacks used when fine-tuning or training, as well as for inference and round-trip evaluations. You can also specify which specific callbacks to use in your config file. Callbacks in molbart.utils.callbacks can now be added to the config file like:\n```\ncallbacks:\n  - LearningRateMonitor\n  - ModelCheckpoint: # Select which parameter values should override the defaults\n    - period: 1\n    - monitor: val_loss\n  - ValidationScoreCallback\n  - OptLRMonitor\n  - StepCheckpoint\n```\nYou can also add you own custom callback with relative import (CustomCallback from my_package/callbacks.py):\n```\ncallbacks:\n  - my_package.callbacks.CustomCallback\n```\n\n\n## Specifying available and custom scores\nThere are default scores which are used in all scripts (including in `molbart.inference_score`, `molbart.retrosynthesis.round_trip_inference`, `molbart.fine_tune`). You can also specify which specific scores to calculate in your config file. Scores in molbart.utils.scores can now be added to the config file like:\n```\nscorers:\n  - FractionInvalidScore\n  - FractionUniqueScore\n  - TanimotoSimilarityScore:\n    - statistics: mean\n  - TopKAccuracyScore\n```\nYou can also add you own custom scores with relative import (CustomScore from my_package/scores.py):\n```\nscorers:\n  - my_package.scores.CustomScore\n```\nThe default is to use the internal callback ScoreCallback which collects the computed scores listed under `scorers:` and writes to the specified output files (`output_score_data` and `output_sampled_smiles`).\n\n## Specifying a custom datamodule\nSimilar to scorers and callbacks, the datamodule can also be specified dynamically in the config file. A custom datamodule (e.g. located at my_package/datamodules.py) can be used with:\n```\ndatamodule:\n  - my_package.datamodules.CustomDataModule:\n    - datamodule-specific-arg1\n    - datamodule-specific-arg2\n```\nSee molbart/data/datamodules.py for inspiration on how to construct the new datamodule.\n\n## Running with FastAPI service\n### Baseline Chemformer forward or backward synthesis prediction\nChemformer predictions and log-likelihood calculations can be executed with FastAPI.\n\nInstall FastAPI libraries\n```\n    python -m pip install fastapi\n    python -m pip install \"uvicorn[standard]\"\n```\nThen\n```\n    cd service\n    export CHEMFORMER_MODEL={PATH TO MODEL}\n    export CHEMFORMER_VOCAB={PATH TO VOCABULARY FILE}\n    export CHEMFORMER_TASK=backward_prediction\n    python chemformer_service.py\n```\nThe model URL can for example be used to run multi-step retrosynthesis with [AiZynthFinder](https://github.com/MolecularAI/aizynthfinder)\n\n### Disconnection-aware retrosynthesis prediction\nFor running the disconnection-aware Chemformer, run the following (RXN-mapper should be installed in the environment - see https://github.com/rxn4chemistry/rxnmapper):\n```\n    cd service\n    export CHEMFORMER_DISCONNECTION_MODEL={PATH TO DISCONNECTION CHEMFORMER MODEL}\n    export CHEMFORMER_VOCAB={PATH TO VOCABULARY FILE} # The vocabulary should include a \"!\" token\n    export CHEMFORMER_TASK=backward_prediction\n    export RXNUTILS_ENV_PATH={PATH TO rxnutils CONDA ENV} # See https://github.com/MolecularAI/reaction_utils on how to create an environment\n    python chemformer_disconnect_service.py\n```\n\n### Workflow for fine-tuning and running disconnection-aware Chemformer in AiZynthFinder\nExample workflow for running multi-step retrosynthesis with a disconnection-aware Chemformer [[3]](#3). First, create training dataset (tag disconnection sites with [AiZynthTrain](https://github.com/MolecularAI/aizynthtrain)):\n```\npython -m aizynthtrain.pipelines.disconnection_chemformer_data_prep_pipeline run --config tag_products_config.yml --max-workers 25 --max-num-splits 100 \n```\nwhere `tag_products_config.yml` contains the input `uspto_50k.csv` and output files on the format:\n```\nchemformer_data_prep:\n  chemformer_data_path: uspto_50k.csv\n  disconnection_aware_data_path: uspto_50k_disconnection.csv\n  autotag_data_path: uspto_50k_autotag.csv\n```\n1. Fine-tune Chemformer on `uspto_50k_disconnection.csv`.\n1. Run backward and round-trip inference.\n1. Start FastAPI service for disconnection-aware Chemformer.\n1. Run multi-step retrosynthesis search with [AiZynthFinder](https://github.com/MolecularAI/aizynthfinder) using the `expansion_strategies.DisconnectionAwareExpansionStrategy`. We refer the user to https://github.com/MolecularAI/aizynthfinder/tree/master/plugins for information on how to do this.\n\n## Code structure\n\nThe codebase is broadly split into the following parts:\n* Models\n* Data\n* Utils, including data helpers, scorers, callbacks, samplers, etc.\n* Scripts for running e.g. fine-tuning, prediction, etc.\n\n\n### Models\n\nThe  `models/transformer_models.py` file contains a Pytorch Lightning implementation of the BART language model, as well as Pytorch Lightning implementations of models for downstream tasks.\n`models/chemformer.py` contains the synthesis prediction Chemformer model used for both forward and backward (seq2seq) predictions.\n\n### Data\n\nThe `data` folder contains DataModules for different tasks and datasets.\nThe classes which inherit from `_AbsDataset` are subclasses of Pytorch's `nn.utils.Dataset` and are simply used to store and split data (molecules, reactions, etc) into its relevant subset (train, val, test).\nOur `_AbsDataModule` class inherits from Pytorch Lightning's `LightningDataModule` class, and its subclasses are used to augment, tokenize and tensorize the data before it passed to the model.\n\nFinally, we include a `TokenSampler` class which categorises sequences into buckets based on their length, and is able to sample a different batch size of sequences from each bucket. This helps to ensure that the model sees approximately the same number of tokens on each batch, as well as dramatically improving training speed.\n\n### Utils\n#### Tokenization\n\nThe `utils/tokenizers` includes the `MolEncTokeniser` class which is capable of random 'BERT-style' masking of tokens, as well as padding each batch of sequences to be the same length. The `ChemformerTokenizer`, which is used in the synthesis Chemformer makes use of the `SMILESTokenizer` from the `pysmilesutils` library for tokenising SMILES into their constituent atoms.\n\n\n#### Decoding / sampling\n\nWe include implementations of greedy and beam search, as well as a GPU-optimized beam search decoding (BeamSearchSampler) in the `utils/samplers/beam_search_samplers.py` file. All implementations make use of batch decoding for improved evaluation speeds. They do not, however, cache results from previous decodes, rather, they simply pass the entire sequence of tokens produced so far through the transformer decoder. The BeamSearchSampler is used by the synthesis Chemformer model in molbart.models.chemformer.\n\n\n## Contributing\n\nWe welcome contributions, in the form of issues or pull requests.\n\nIf you have a question or want to report a bug, please submit an issue.\n\n\nTo contribute with code to the project, follow these steps:\n\n1. Fork this repository.\n2. Create a branch: `git checkout -b \u003cbranch_name\u003e`.\n3. Make your changes and commit them: `git commit -m '\u003ccommit_message\u003e'`\n4. Push to the remote branch: `git push`\n5. Create the pull request.\n\nPlease use ``black`` package for formatting.\n\n\nThe contributors have limited time for support questions, but please do not hesitate to submit an issue.\n\n## License\nThe software is licensed under the MIT license (see LICENSE file), and is free and provided as-is.\n\n\n## Cite our work\n\nIf you find our work useful for your research, please cite our paper(s):\n\n\u003ca id=\"1\"\u003e[1]\u003c/a\u003e\nIrwin, R., Dimitriadis, S., He, J., Bjerrum, E.J., 2021. Chemformer: A Pre-Trained Transformer for Computational Chemistry. Mach. Learn. Sci. Technol. [https://doi.org/10.1088/2632-2153/ac3ffb](https://doi.org/10.1088/2632-2153/ac3ffb)\n\n\u003ca id=\"2\"\u003e[2]\u003c/a\u003e\nWesterlund, A.M., Manohar Koki, S., Kancharla, S., Tibo, A., Saigiridharan, L., Mercado, R., Genheden, S., 2023. \nDo Chemformers dream of organic matter? Evaluating a transformer model for multi-step retrosynthesis, J. Chem. Inf. Model.\n [https://pubs.acs.org/doi/10.1021/acs.jcim.3c01685](https://pubs.acs.org/doi/10.1021/acs.jcim.3c01685)\n\n\u003ca id=\"3\"\u003e[3]\u003c/a\u003e\nWesterlund, A.M., Saigiridharan, L., Genheden, S., 2024. \nConstrained synthesis planning with disconnection-aware transformer and multi-objective search, ChemRxiv\n [10.26434/chemrxiv-2024-c77p4](https://chemrxiv.org/engage/chemrxiv/article-details/664ee4c291aefa6ce1c4fc8d)\n\n## References\n\n\u003ca id=\"5\"\u003e[4]\u003c/a\u003e\nLewis, Mike, et al.\n\"Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.\"\narXiv preprint arXiv:1910.13461 (2019).\n\n\u003ca id=\"5\"\u003e[5]\u003c/a\u003e\nWeininger, David.\n\"SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules.\"\nJournal of chemical information and computer sciences 28.1 (1988): 31-36.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMolecularAI%2FChemformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMolecularAI%2FChemformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMolecularAI%2FChemformer/lists"}