{"id":13752056,"url":"https://github.com/chao1224/MoleculeSTM","last_synced_at":"2025-05-09T18:33:13.897Z","repository":{"id":133803678,"uuid":"580927065","full_name":"chao1224/MoleculeSTM","owner":"chao1224","description":"Multi-modal Molecule Structure-text Model for Text-based Editing and Retrieval, Nat Mach Intell 2023 (https://www.nature.com/articles/s42256-023-00759-6)","archived":false,"fork":false,"pushed_at":"2025-01-06T03:08:14.000Z","size":41869,"stargazers_count":224,"open_issues_count":10,"forks_count":21,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-13T04:17:33.737Z","etag":null,"topics":["clip","computation-chemistry","drug-discovery","editing","foundation-model","molecule-editing","moleculeclip","moleculestm","pretraining","retrieval"],"latest_commit_sha":null,"homepage":"https://chao1224.github.io/MoleculeSTM","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chao1224.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-12-21T20:12:01.000Z","updated_at":"2025-04-06T16:15:11.000Z","dependencies_parsed_at":null,"dependency_job_id":"286908bf-6584-479d-9d4b-30300e4fafc3","html_url":"https://github.com/chao1224/MoleculeSTM","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chao1224%2FMoleculeSTM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chao1224%2FMoleculeSTM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chao1224%2FMoleculeSTM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chao1224%2FMoleculeSTM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chao1224","download_url":"https://codeload.github.com/chao1224/MoleculeSTM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253303238,"owners_count":21886912,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clip","computation-chemistry","drug-discovery","editing","foundation-model","molecule-editing","moleculeclip","moleculestm","pretraining","retrieval"],"created_at":"2024-08-03T09:00:58.759Z","updated_at":"2025-05-09T18:33:08.857Z","avatar_url":"https://github.com/chao1224.png","language":"Python","funding_links":[],"categories":["🧪 Scientific Pretraining, SFT, Reasoning, and Agent Datasets","Ranked by starred repositories"],"sub_categories":["🧬 Life Sciences"],"readme":"# MoleculeSTM: Multi-modal Molecule Structure-text Model for Text-based Editing and Retrieval\n\nAuthors: Shengchao Liu, Weili Nie, Chengpeng Wang, Jiarui Lu, Zhuoran Qiao, Ling Liu, Jian Tang\u003csup\u003e\\*\u003c/sup\u003e, Chaowei Xiao\u003csup\u003e\\*\u003c/sup\u003e, Anima Anandkumar\u003csup\u003e\\*\u003c/sup\u003e\n\n\u003csup\u003e\\*\u003c/sup\u003e jointly supervised\n\n[[Paper](https://www.nature.com/articles/s42256-023-00759-6)]\n[[Project Page](https://chao1224.github.io/MoleculeSTM)] [[ArXiv](https://arxiv.org/abs/2212.10789)]\n[[Datasets on Hugging Face](https://huggingface.co/datasets/chao1224/MoleculeSTM/tree/main)] [[Checkpoints on Hugging Face](https://huggingface.co/chao1224/MoleculeSTM/tree/main)]\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"pic/pipeline.png\" /\u003e \n\u003c/p\u003e\n\n\u003cp align=\"left\"\u003e\n  \u003cimg src=\"pic/final.gif\" width=\"100%\" /\u003e \n\u003c/p\u003e\n\n## 1 Environment\n\nFirst install conda:\n```\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh\n```\n\nThen create virtual environment and install packages:\n```\nconda create -n MoleculeSTM python=3.7\nconda activate MoleculeSTM\n\nconda install -y -c rdkit rdkit=2020.09.1.0\nconda install -y -c conda-forge -c pytorch pytorch=1.9.1\nconda install -y -c pyg -c conda-forge pyg==2.0.3\n\npip install requests\npip install tqdm\npip install matplotlib\npip install spacy\npip install Levenshtein\n\n# for SciBert\nconda install -y boto3\npip install transformers\n\n# for MoleculeNet\npip install ogb==1.2.0\n\n# install pysmilesutils\npython -m pip install git+https://github.com/MolecularAI/pysmilesutils.git\n\npip install deepspeed\n\n# install metagron\n# pip install megatron-lm==1.1.5\ngit clone https://github.com/MolecularAI/MolBART.git --branch megatron-molbart-with-zinc\ncd MolBART/megatron_molbart/Megatron-LM-v1.1.5-3D_parallelism\npip install .\ncd ../../..\n\n# install apex\n# wget https://github.com/NVIDIA/apex/archive/refs/tags/22.03.zip\n# unzip 22.03.zip\ngit clone https://github.com/chao1224/apex.git\ncd apex\npip install -v --disable-pip-version-check --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./\ncd ..\n```\n\nWe also provide the docker in `Dockerfile`.\n\n## 2 Datasets and Preprocessing\n\nWe provide the raw dataset (after preprocessing) at [this Hugging Face link](https://huggingface.co/datasets/chao1224/MoleculeSTM). Or you can use the following python script:\n```\nfrom huggingface_hub import HfApi, snapshot_download\napi = HfApi()\nsnapshot_download(repo_id=\"chao1224/MoleculeSTM\", repo_type=\"dataset\", local_dir='.')\n```\n\nThen you can move all the downloaded datasets under `./data` folder.\n\n### 2.1 Pretraining Dataset: PubChemSTM\n\nUseful resources:\n- For molecular structure information (SMILES, 2D molecular graph etc), we can download it from PubChem in SDF format [here](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/).\n- For textual data, we may first refer to this [PubChem RDF tutorial](https://ftp.ncbi.nlm.nih.gov/pubchem/presentations/pubchem_rdf_tutorial.pdf).\n- `The RDF data on the PubChem FTP site is arranged in such a way that you only need to download the type of information in which you are interested, thus allowing you to avoid downloading parts of PubChem data you will not use. For example, if you are just interested in computed chemical properties, you only need to download PubChemRDF data in the compound descriptor directory.` The link is [here](https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor/compound/).\n- Guidance on using `RDF` and `REST` API can be found [here](https://ftp.ncbi.nlm.nih.gov/pubchem/presentations/pubchem_rdf_details.pdf).\n\nAs confirmed with PubChem group, performing research on these data is not violating their license; however, PubChem does not possess the license for the textual data, which necessitates an extensive evaluation of the license for each pair of structure-text pair data in PubChemSTM. This task poses a substantial workload and has hindered the release of PubChemSTM. However, we have tried our best to upload the structure part of the PubChemSTM data on Hugging Face, and we also provide all the details to generate PubChemSTM as follows:\n1. Go to `preprocessing/PubChemSTM` folder.\n2. `python step_01_description_extraction.py`. This step extracts and merge all the textual descriptions into a single json file. We run this on May 30th, 2022. The APIs will keep updating, so you may have slightly different versions if you run this script yourself.\n3. `bash step_02.sh`. This will download all the SDF files, with SMILES, 2D graph, and computed molecular properties. This may take hours.\n4. `python step_03_filter_out_SDF.py`. This will filter all the molecules with textual descriptions and save them int the SDF file. This may take \u003c2 hours.\n5. `python step_04_merge_SDF.py`. This will gather all the molecules into a single SDF file.\n6. `python step_05_sample_extraction.py`. This will generate the `CID2SMILES.csv` file.\n\n### 2.2 Downstream Datasets\n\nWe have included them in [the Hugging Face link](https://huggingface.co/datasets/chao1224/MoleculeSTM). We briefly list the details below:\n\n- `DrugBank_data` for zero-shot structure-text retrieval\n- `ZINC250K_data` for space alignment (step 1 in editing)\n- `Editing_data` for zero-shot text-guided (step 2 in editing)\n    - `single_multi_property_SMILES.txt` for single-objective, multi-objective, binding-affinity-based, and drug relevance editing\n    - `neighbor2drug` for neighborhood searching for patent drug molecules\n    - `ChEMBL_data` for binding editing\n- `MoleculeNet_data` for molecular property prediction\n\n## 3 Checkpoints\n\n### 3.1 SciBERT\nThis can be done by simplying calling the following for SciBERT:\n```\nSciBERT_tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', cache_dir=pretrained_SciBERT_folder)\nSciBERT_model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased', cache_dir=pretrained_SciBERT_folder).to(device)\n```\n\n### 3.2 MegaMolBART\nRun `download_MegaMolBART.sh` (credit to [RetMol](https://github.com/NVlabs/RetMol/blob/main/download_scripts/download_models.sh)). The output structure is like:\n```\n├── bart_vocab.txt\n└── checkpoints\n    ├── iter_0134000\n    │   ├── mp_rank_00\n    │   │   └── model_optim_rng.pt\n    │   ├── mp_rank_00_model_states.pt\n    │   ├── zero_pp_rank_0_mp_rank_00optim_states.pt\n    │   ├── zero_pp_rank_1_mp_rank_00optim_states.pt\n    │   ├── zero_pp_rank_2_mp_rank_00optim_states.pt\n    │   ├── zero_pp_rank_3_mp_rank_00optim_states.pt\n    │   ├── zero_pp_rank_4_mp_rank_00optim_states.pt\n    │   ├── zero_pp_rank_5_mp_rank_00optim_states.pt\n    │   ├── zero_pp_rank_6_mp_rank_00optim_states.pt\n    │   └── zero_pp_rank_7_mp_rank_00optim_states.pt\n    └── latest_checkpointed_iteration.txt\n```\n\n### 3.3 GNN and GraphMVP\nFor GraphMVP, check this [repo](https://github.com/chao1224/GraphMVP), and the checkpoints on [Google Drive link](https://drive.google.com/drive/u/1/folders/1uPsBiQF3bfeCAXSDd4JfyXiTh-qxYfu6).\n```\npretrained_GraphMVP/\n├── GraphMVP_C\n│   └── model.pth\n└── GraphMVP_G\n    └── model.pth\n```\n\n### 3.4 Baseline KV-PLM\nFor KV-PLM, check this [repo](https://github.com/thunlp/KV-PLM) and checkpoints on [Google Drive link](https://drive.google.com/drive/folders/1xig3-3JG63kR-Xqj1b9wkPEdxtfD_4IX).\n\n### 3.5 Checkpoints for MoleculeSTM\nWe provide two sets of demo checkpoints at [this huggingface link](https://huggingface.co/chao1224/MoleculeSTM). Or you can use the following python script:\n```\nfrom huggingface_hub import HfApi, snapshot_download\napi = HfApi()\nsnapshot_download(repo_id=\"chao1224/MoleculeSTM\", repo_type=\"model\", cache_dir='.')\n```\n\nFor the optimal results reported in the paper, please use the following script:\n```\nfrom huggingface_hub import HfApi, snapshot_download\napi = HfApi()\nsnapshot_download(repo_id=\"chao1224/MoleculeSTM\", repo_type=\"model\", local_dir='.', allow_patterns=\"*MoleculeSTM*\")\n```\n\nWe further provide the optimal checkpoints for each downstream task under the `scripts` folder (README file).\n\n## 4 Scripts and Demos\n\nAll the running scripts and demos can be found under the `scripts` folder and `demos` folder, respectively.\n\n### 4.1 Pretraining\n\nMoleculeSTM-SMILES\n```\npython pretrain.py \\\n    --verbose --batch_size=8 \\\n    --molecule_type=SMILES\n```\n\nMoleculeSTM-Graph\n```\npython pretrain.py \\\n    --verbose --batch_size=8 \\\n    --molecule_type=Graph\n```\n\n### 4.2 Downstream: Zero-shot Structure-text Retrieval\n\n**For DrugBank-Description**\n\nMoleculeSTM-SMILES\n```\npython downstream_01_retrieval_Description_Pharmacodynamics.py \\\n    --task=molecule_description_removed_PubChem \\\n    --molecule_type=SMILES \\\n    --input_model_dir=../data/demo/demo_checkpoints_SMILES\n```\n\nMoleculeSTM-Graph\n```\npython downstream_01_retrieval_Description_Pharmacodynamics.py \\\n    --task=molecule_description_removed_PubChem \\\n    --molecule_type=Graph \\\n    --input_model_dir=../data/demo/demo_checkpoints_Graph\n```\n\n**For DrugBank-Pharmacodynamics**\n\nMoleculeSTM-SMILES\n```\npython downstream_01_retrieval_Description_Pharmacodynamics.py \\\n    --task=molecule_pharmacodynamics_removed_PubChem \\\n    --molecule_type=SMILES \\\n    --input_model_dir=../data/demo/demo_checkpoints_SMILES\n```\n\nMoleculeSTM-Graph\n```\npython downstream_01_retrieval_Description_Pharmacodynamics.py \\\n    --task=molecule_pharmacodynamics_removed_PubChem \\\n    --molecule_type=Graph \\\n    --input_model_dir=../data/demo/demo_checkpoints_Graph\n```\n\n**For DrugBank-ATC**\n\n\nMoleculeSTM-SMILES\n```\npython downstream_01_retrieval_ATC.py \\\n    --molecule_type=SMILES \\\n    --input_model_dir=../data/demo/demo_checkpoints_SMILES\n```\n\nMoleculeSTM-Graph\n```\npython downstream_01_retrieval_ATC.py \\\n    --molecule_type=Graph \\\n    --input_model_dir=../data/demo/demo_checkpoints_Graph\n```\n\n### 4.3 Downstream: Zero-shot Text-based Molecule Editing\n\nFor description id list, you can find them in `MoleculeSTM/downstream_molecule_edit_utils.py`.\n\nMoleculeSTM-SMILES\n```\npython downstream_02_molecule_edit_step_01_MoleculeSTM_Space_Alignment.py \\\n    --MoleculeSTM_molecule_type=SMILES \\\n    --MoleculeSTM_model_dir=../data/demo/demo_checkpoints_SMILES\n\n\npython downstream_02_molecule_edit_step_02_MoleculeSTM_Latent_Optimization.py \\\n    --MoleculeSTM_molecule_type=SMILES \\\n    --MoleculeSTM_model_dir=../data/demo/demo_checkpoints_SMILES \\\n    --language_edit_model_dir=../data/demo/demo_checkpoints_SMILES \\\n    --input_description_id=101\n```\n\nMoleculeSTM-Graph\n```\npython downstream_02_molecule_edit_step_01_MoleculeSTM_Space_Alignment.py \\\n    --MoleculeSTM_molecule_type=Graph \\\n    --MoleculeSTM_model_dir=../data/demo/demo_checkpoints_Graph\n\n\npython downstream_02_molecule_edit_step_02_MoleculeSTM_Latent_Optimization.py \\\n    --MoleculeSTM_molecule_type=Graph \\\n    --MoleculeSTM_model_dir=../data/demo/demo_checkpoints_Graph \\\n    --language_edit_model_dir=../data/demo/demo_checkpoints_Graph \\\n    --input_description_id=101\n```\n\n### 4.4 Downstream: Molecular Property Prediction\n\nMoleculeSTM-SMILES\n```\npython downstream_03_property_prediction.py \\\n    --dataset=bace --molecule_type=SMILES \\\n```\n\nMoleculeSTM-Graph\n```\npython downstream_03_property_prediction.py \\\n    --dataset=bace --molecule_type=Graph\n```\n\n### 4.5 Demo\nPlease check the `demos` folder. This may require you download the dataset and checkpoints first:\n- raw dataset (after preprocessing) at [this huggingface link](https://huggingface.co/datasets/chao1224/MoleculeSTM).\n- checkpoints at [this huggingface link](https://huggingface.co/chao1224/MoleculeSTM).\n\n## Cite Us\nFeel free to cite this work if you find it useful to you!\n```\n@article{liu2023moleculestm,\n    title={Multi-modal molecule structure-text model for text-based retrieval and editing},\n    author={Liu, Shengchao and Nie, Weili and Wang, Chengpeng and Lu, Jiarui and Qiao, Zhuoran and Liu, Ling and Tang, Jian and Xiao, Chaowei and Anandkumar, Anima},\n    title={Multi-modal molecule structure--text model for text-based retrieval and editing},\n    journal={Nature Machine Intelligence},\n    year={2023},\n    month={Dec},\n    day={01},\n    volume={5},\n    number={12},\n    pages={1447-1457},\n    issn={2522-5839},\n    doi={10.1038/s42256-023-00759-6},\n    url={https://doi.org/10.1038/s42256-023-00759-6}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchao1224%2FMoleculeSTM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchao1224%2FMoleculeSTM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchao1224%2FMoleculeSTM/lists"}