{"id":15018291,"url":"https://github.com/thomas0809/molscribe","last_synced_at":"2025-04-08T15:02:53.928Z","repository":{"id":102423996,"uuid":"564151960","full_name":"thomas0809/MolScribe","owner":"thomas0809","description":"Robust Molecular Structure Recognition with Image-to-Graph Generation","archived":false,"fork":false,"pushed_at":"2025-01-09T18:36:51.000Z","size":73268,"stargazers_count":195,"open_issues_count":16,"forks_count":41,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-08T15:01:43.529Z","etag":null,"topics":["chemistry","deep-learning","molecule"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thomas0809.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-10T05:07:55.000Z","updated_at":"2025-04-08T07:45:10.000Z","dependencies_parsed_at":"2024-07-23T23:01:35.237Z","dependency_job_id":"36219faa-6d15-43c9-bff7-2a65340c78e5","html_url":"https://github.com/thomas0809/MolScribe","commit_stats":{"total_commits":154,"total_committers":13,"mean_commits":"11.846153846153847","dds":"0.48051948051948057","last_synced_commit":"d0f7b3d7959c871e2e957952737a571492b8d98c"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thomas0809%2FMolScribe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thomas0809%2FMolScribe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thomas0809%2FMolScribe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thomas0809%2FMolScribe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thomas0809","download_url":"https://codeload.github.com/thomas0809/MolScribe/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247867304,"owners_count":21009240,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chemistry","deep-learning","molecule"],"created_at":"2024-09-24T19:51:47.536Z","updated_at":"2025-04-08T15:02:53.843Z","avatar_url":"https://github.com/thomas0809.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MolScribe\n\nThis is the repository for MolScribe, an image-to-graph model that translates a molecular image to its chemical\nstructure. Try our [demo](https://huggingface.co/spaces/yujieq/MolScribe) on HuggingFace!\n\n![MolScribe](assets/model.png)\n\nIf you use MolScribe in your research, please cite our [paper](https://pubs.acs.org/doi/10.1021/acs.jcim.2c01480).\n```\n@article{\n    MolScribe,\n    title = {{MolScribe}: Robust Molecular Structure Recognition with Image-to-Graph Generation},\n    author = {Yujie Qian and Jiang Guo and Zhengkai Tu and Zhening Li and Connor W. Coley and Regina Barzilay},\n    journal = {Journal of Chemical Information and Modeling},\n    publisher = {American Chemical Society ({ACS})},\n    doi = {10.1021/acs.jcim.2c01480},\n    year = 2023,\n}\n```\n\nPlease check out our subsequent works on parsing chemical diagrams:\n- [RxnScribe](https://github.com/thomas0809/RxnScribe) (reaction diagram parsing):\n[paper](https://pubs.acs.org/doi/10.1021/acs.jcim.3c00439),\n[code](https://github.com/thomas0809/RxnScribe), [demo](https://huggingface.co/spaces/yujieq/RxnScribe)\n- [OpenChemIE](https://github.com/CrystalEye42/OpenChemIE) (information extraction toolkit for chemistry literature): [paper](https://pubs.acs.org/doi/10.1021/acs.jcim.4c00572), [code](https://github.com/CrystalEye42/OpenChemIE), [demo](https://mit.openchemie.info)\n\n## Quick Start\n\n### Installation\nOption 1: Install MolScribe with pip\n```\npip install MolScribe\n```\n\nOption 2: Run the following command to install the package and its dependencies\n```\ngit clone git@github.com:thomas0809/MolScribe.git\ncd MolScribe\npython setup.py install\n```\n\n### Example\nDownload the MolScribe checkpoint from [HuggingFace Hub](https://huggingface.co/yujieq/MolScribe/tree/main) \nand predict molecular structures:\n```python\nimport torch\nfrom molscribe import MolScribe\nfrom huggingface_hub import hf_hub_download\n\nckpt_path = hf_hub_download('yujieq/MolScribe', 'swin_base_char_aux_1m.pth')\n\nmodel = MolScribe(ckpt_path, device=torch.device('cpu'))\noutput = model.predict_image_file('assets/example.png', return_atoms_bonds=True, return_confidence=True)\n```\n\nThe output is a dictionary, with the following format\n```\n{\n    'smiles': 'Fc1ccc(-c2cc(-c3ccccc3)n(-c3ccccc3)c2)cc1',\n    'molfile': '***', \n    'confidence': 0.9175,\n    'atoms': [{'atom_symbol': '[Ph]', 'x': 0.5714, 'y': 0.9523, 'confidence': 0.9127}, ... ],\n    'bonds': [{'bond_type': 'single', 'endpoint_atoms': [0, 1], 'confidence': 0.9999}, ... ]\n}\n```\n\nPlease refer to [`molscribe/interface.py`](molscribe/interface.py) and [`notebook/predict.ipynb`](notebook/predict.ipynb) \nfor details and other available APIs.\n\nFor development or reproducing the experiments, please follow the instructions below.\n\n## Experiments\n\n### Requirements\nInstall the required packages\n```\npip install -r requirements.txt\n```\n\n### Data\nFor training or evaluation, please download the corresponding datasets to `data/`.\n\nTraining data:\n\n| Datasets                                                                            | Description                                                                                                                                   |\n|-------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|\n| USPTO \u003cbr\u003e [Download](https://huggingface.co/yujieq/MolScribe/blob/main/uspto_mol.zip) | Downloaded from [USPTO, Grant Red Book](https://bulkdata.uspto.gov/).                                                                         |\n| PubChem \u003cbr\u003e [Download](https://huggingface.co/yujieq/MolScribe/blob/main/pubchem.zip) | Molecules are downloaded from [PubChem](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/), and images are dynamically rendered during training. |\n\nBenchmarks:\n\n| Category                                                                                   | Datasets                                      | Description                                                                                                                                                                                                                                |\n|--------------------------------------------------------------------------------------------|-----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| Synthetic \u003cbr\u003e [Download](https://huggingface.co/yujieq/MolScribe/blob/main/synthetic.zip) | Indigo \u003cbr\u003e ChemDraw                          | Images are rendered by Indigo and ChemDraw.                                                                                                                                                                                                |\n| Realistic \u003cbr\u003e [Download](https://huggingface.co/yujieq/MolScribe/blob/main/real.zip)      | CLEF \u003cbr\u003e UOB \u003cbr\u003e USPTO \u003cbr\u003e Staker \u003cbr\u003e ACS | CLEF, UOB, and USPTO are downloaded from https://github.com/Kohulan/OCSR_Review. \u003cbr/\u003e Staker is downloaded from https://drive.google.com/drive/folders/16OjPwQ7bQ486VhdX4DWpfYzRsTGgJkSu. \u003cbr\u003e ACS is a new dataset collected by ourself. |\n| Perturbed \u003cbr\u003e [Download](https://huggingface.co/yujieq/MolScribe/blob/main/perturb.zip)   | CLEF \u003cbr\u003e UOB \u003cbr\u003e USPTO \u003cbr\u003e Staker          | Downloaded from https://github.com/bayer-science-for-a-better-life/Img2Mol/                                                                                                                                                                |\n\n\n### Model\nOur model checkpoints can be downloaded from [Dropbox](https://www.dropbox.com/sh/91u508kf48cotv4/AACQden2waMXIqLwYSi8zO37a?dl=0) \nor [HuggingFace Hub](https://huggingface.co/yujieq/MolScribe/tree/main).\n\nModel architecture:\n- Encoder: [Swin Transformer](https://github.com/microsoft/Swin-Transformer), Swin-B.\n- Decoder: Transformer, 6 layers, hidden_size=256, attn_heads=8.\n- Input size: 384x384\n\nDownload the model checkpoint to reproduce our experiments:\n```\nmkdir -p ckpts\nwget -P ckpts https://huggingface.co/yujieq/MolScribe/resolve/main/swin_base_char_aux_1m680k.pth\n```\n\n### Prediction\n```\npython predict.py --model_path ckpts/swin_base_char_aux_1m680k.pth --image_path assets/example.png\n```\nMolScribe prediction interface is in [`molscribe/interface.py`](molscribe/interface.py).\nSee python script [`predict.py`](predict.py) or jupyter notebook [`notebook/predict.ipynb`](notebook/predict.ipynb)\nfor example usage.\n\n### Evaluate MolScribe\n```\nbash scripts/eval_uspto_joint_chartok_1m680k.sh\n```\nThe script uses one GPU and batch size of 64 by default. If more GPUs are available, update `NUM_GPUS_PER_NODE` and \n`BATCH_SIZE` for faster evaluation.\n\n### Train MolScribe\n```\nbash scripts/train_uspto_joint_chartok_1m680k.sh\n```\nThe script uses four GPUs and batch size of 256 by default. It takes about one day to train the model with four A100 GPUs.\nDuring training, we use a modified code of [Indigo](https://github.com/epam/Indigo) (included in `molscribe/indigo/`).\n\n\n### Evaluation Script\nWe implement a standalone evaluation script [`evaluate.py`](evaluate.py). Example usage:\n```\npython evaluate.py \\\n    --gold_file data/real/acs.csv \\\n    --pred_file output/uspto/swin_base_char_aux_1m680k/prediction_acs.csv \\\n    --pred_field post_SMILES\n```\nThe prediction should be saved in a csv file, with columns `image_id` for the index (must match the gold file),\nand `SMILES` for predicted SMILES. If prediction has a different column name, specify it with `--pred_field`.\n\nThe result contains three scores:\n- canon_smiles: our main metric, exact matching accuracy.\n- graph: graph exact matching accuracy, ignoring tetrahedral chirality.\n- chiral: exact matching accuracy on chiral molecules.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthomas0809%2Fmolscribe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthomas0809%2Fmolscribe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthomas0809%2Fmolscribe/lists"}