{"id":27267825,"url":"https://github.com/OSU-NLP-Group/LLM4Chem","last_synced_at":"2025-04-11T10:02:22.045Z","repository":{"id":222527436,"uuid":"757148762","full_name":"OSU-NLP-Group/LLM4Chem","owner":"OSU-NLP-Group","description":"Official code repo for the paper \"LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset\"","archived":false,"fork":false,"pushed_at":"2024-11-12T00:05:01.000Z","size":136466,"stargazers_count":82,"open_issues_count":0,"forks_count":11,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-09T12:04:10.883Z","etag":null,"topics":["ai4science","chemistry","llms","molecule"],"latest_commit_sha":null,"homepage":"https://osu-nlp-group.github.io/LLM4Chem/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OSU-NLP-Group.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-13T22:29:28.000Z","updated_at":"2025-04-07T02:48:01.000Z","dependencies_parsed_at":"2024-02-20T05:24:50.702Z","dependency_job_id":"a1d164f7-95d4-480b-ba58-2b7fe50d71a0","html_url":"https://github.com/OSU-NLP-Group/LLM4Chem","commit_stats":{"total_commits":9,"total_committers":1,"mean_commits":9.0,"dds":0.0,"last_synced_commit":"bbf5c94fa938f43902eb2fc20223d20d338922ca"},"previous_names":["osu-nlp-group/llm4chem"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OSU-NLP-Group%2FLLM4Chem","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OSU-NLP-Group%2FLLM4Chem/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OSU-NLP-Group%2FLLM4Chem/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OSU-NLP-Group%2FLLM4Chem/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OSU-NLP-Group","download_url":"https://codeload.github.com/OSU-NLP-Group/LLM4Chem/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248372360,"owners_count":21093134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai4science","chemistry","llms","molecule"],"created_at":"2025-04-11T10:01:29.826Z","updated_at":"2025-04-11T10:02:22.037Z","avatar_url":"https://github.com/OSU-NLP-Group.png","language":"Python","funding_links":[],"categories":["Language Models","Datasets \u0026 Benchmarks"],"sub_categories":["Text + BioMulti"],"readme":"# LlaSMol\nThis is the official code repository for the paper *LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset*.\n\n- Paper: https://arxiv.org/abs/2402.09391\n- Page: https://osu-nlp-group.github.io/LLM4Chem\n- Dataset: https://huggingface.co/datasets/osunlp/SMolInstruct\n- Models:\n  - LlaSMol-Galactica-6.7B: [https://huggingface.co/osunlp/LlaSMol-Galactica-6.7B](https://huggingface.co/osunlp/LlaSMol-Galactica-6.7B)\n  - LlaSMol-Llama2-7B: [https://huggingface.co/osunlp/LlaSMol-Llama2-7B](https://huggingface.co/osunlp/LlaSMol-Llama2-7B)\n  - LlaSMol-CodeLlama-7B: [https://huggingface.co/osunlp/LlaSMol-CodeLlama-7B](https://huggingface.co/osunlp/LlaSMol-CodeLlama-7B)\n  - LlaSMol-Mistral-7B: [https://huggingface.co/osunlp/LlaSMol-Mistral-7B](https://huggingface.co/osunlp/LlaSMol-Mistral-7B)\n\n## Tasks and Examples\n\nOur models are trained on 4 types of 14 essential tasks, which, along with examples, are listed below.\n\n### Name Conversion\n**Name Conversion - IUPAC to Molecular Formula**\n```\nQuery: What is the molecular formula of the compound with this IUPAC name \u003cIUPAC\u003e 2,5-diphenyl-1,3-oxazole \u003c/IUPAC\u003e ?\n\nResponse: \u003cMOLFORMULA\u003e C15H11NO \u003c/MOLFORMULA\u003e\n```\n\n**Name Conversion - IUPAC to SMILES**\n```\nQuery: Could you provide the SMILES for \u003cIUPAC\u003e 4-ethyl-4-methyloxolan-2-one \u003c/IUPAC\u003e ?\n\nResponse: Of course. It's \u003cSMILES\u003e CCC1(C)COC(=O)C1 \u003c/SMILES\u003e .\n```\n\n**Name Conversion - SMILES to Molecular Formula**\n```\nQuery: Given the SMILES representation \u003cSMILES\u003e S=P1(N(CCCl)CCCl)NCCCO1 \u003c/SMILES\u003e, what would be its molecular formula?\n\nResponse: It is \u003cMOLFORMULA\u003e C7H15Cl2N2OPS \u003c/MOLFORMULA\u003e .\n```\n\n**Name Conversion - SMILES to IUPAC**\n```\nQuery: Translate the given SMILES formula of a molecule \u003cSMILES\u003e CCC(C)C1CNCCCNC1 \u003c/SMILES\u003e into its IUPAC name.\n\nResponse: \u003cIUPAC\u003e 3-butan-2-yl-1,5-diazocane \u003c/IUPAC\u003e\n```\n\n### Property Prediction\n**Property Prediction - ESOL**\n```\nQuery: How soluble is \u003cSMILES\u003e CC(C)Cl \u003c/SMILES\u003e ?\n\nResponse: Its log solubility is \u003cNUMBER\u003e -1.41 \u003c/NUMBER\u003e mol/L.\n```\n\n**Property Prediction - LIPO**\n```\nQuery: Predict the octanol/water distribution coefficient logD under the circumstance of pH 7.4 for \u003cSMILES\u003e NC(=O)C1=CC=CC=C1O \u003c/SMILES\u003e .\n\nResponse: \u003cNUMBER\u003e 1.090 \u003c/NUMBER\u003e\n```\n\n**Property Prediction - BBBP**\n```\nQuery: Is blood-brain barrier permeability (BBBP) a property of \u003cSMILES\u003e CCNC(=O)/C=C/C1=CC=CC(Br)=C1 \u003c/SMILES\u003e ?\n\nResponse: \u003cBOOLEAN\u003e Yes \u003c/BOOLEAN\u003e\n```\n\n**Property Prediction - Clintox**\n```\nQuery: Is \u003cSMILES\u003e COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 \u003c/SMILES\u003e toxic?\n\nResponse: \u003cBOOLEAN\u003e No \u003c/BOOLEAN\u003e\n```\n\n**Property Prediction - HIV**\n```\nQuery: Can \u003cSMILES\u003e CC1=CN(C2C=CCCC2O)C(=O)NC1=O \u003c/SMILES\u003e serve as an inhibitor of HIV replication?\n\nResponse: \u003cBOOLEAN\u003e No \u003c/BOOLEAN\u003e\n```\n\n**Property Prediction - SIDER**\n```\nQuery: Are there any known side effects of \u003cSMILES\u003e CC1=CC(C)=C(NC(=O)CN(CC(=O)O)CC(=O)O)C(C)=C1Br \u003c/SMILES\u003e affecting the heart?\n\nResponse: \u003cBOOLEAN\u003e No \u003c/BOOLEAN\u003e\n```\n\n### Molecule Description\n\n**Molecule Captioning**\n```\nQuery: Describe this molecule: \u003cSMILES\u003e CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1 \u003c/SMILES\u003e\n\nResponse: The molecule is an imidazole derivative with short-acting sedative, hypnotic, and general anesthetic properties. Etomidate appears to have gamma-aminobutyric acid (GABA) like effects, mediated through GABA-A receptor. The action enhances the inhibitory effect of GABA on the central nervous system by causing chloride channel opening events which leads to membrane hyperpolarization.\n```\n\n**Molecule Generation**\n\nIt is not required to use tags to wrap up the input description.\n\n```\nQuery: Give me a molecule that satisfies the conditions outlined in the description: The molecule is a member of the class of tripyrroles that is a red-coloured pigment with antibiotic properties produced by Serratia marcescens. It has a role as an antimicrobial agent, a biological pigment, a bacterial metabolite, an apoptosis inducer and an antineoplastic agent. It is a tripyrrole, an aromatic ether and a ring assembly.\n\nResponse: Here is a potential molecule: \u003cSMILES\u003e CCCCCC1=C(C)NC(/C=C2\\N=C(C3=CC=CN3)C=C2OC)=C1 \u003c/SMILES\u003e\n```\n\n### Chemical Reaction\n\n**Forward Synthesis**\n```\nQuery: \u003cSMILES\u003e NC1=CC=C2OCOC2=C1.O=CO \u003c/SMILES\u003e Based on the reactants and reagents given above, suggest a possible product.\n\nResponse: A possible product can be \u003cSMILES\u003e O=CNC1=CC=C2OCOC2=C1 \u003c/SMILES\u003e .\n```\n\n**Retrosynthesis**\n```\nQuery: Identify possible reactants that could have been used to create the specified product. \u003cSMILES\u003e CC1=CC=C(N)N=C1N \u003c/SMILES\u003e\n\nResponse: \u003cSMILES\u003e CC(C#N)CCC#N.N \u003c/SMILES\u003e\n```\n\n## Training\n\nIf you need to fine-tune a base model on SMolInstruct, please first clone this repo to your machine, and `cd` to the folder, then use the following command.\n\n```bash\nMODELNAME=LlaSMol-Mistral-7B \u0026\u0026 CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch finetune.py --data_path osunlp/SMolInstruct --base_model mistralai/Mistral-7B-v0.1 --wandb_project LlaSMol --wandb_run_name $MODELNAME --wandb_log_model true ---output_dir checkpoint/$MODELNAME\n```\n\nThe above is an example command for fine-tuning Mistral with LoRA, using 4 GPUs. If with other base models, the lora settings (e.g., `lora_target_modules`) might need to be modified accordingly.\n\n\n\n## Usage\n\nClone this repo to your machine, and `cd` to the folder.\n\n### Generation\n\nYou could use the following code to query the models with your questions.\n\n```python\nfrom generation import LlaSMolGeneration\n\ngenerator = LlaSMolGeneration('osunlp/LlaSMol-Mistral-7B')\ngenerator.generate('Can you tell me the IUPAC name of \u003cSMILES\u003e C1CCOC1 \u003c/SMILES\u003e ?')\n```\n\n**Note**: \n1. In the input query, please use corresponding tags to wrap up specific content. \n    - SMILES representation: `\u003cSMILES\u003e ... \u003c/SMILES\u003e`\n    - IUPAC name: `\u003cIUPAC\u003e ... \u003c/IUPAC\u003e`\n    \n    Other tags may appear in models' responses:\n    - Molecular formula: `\u003cMOLFORMULA\u003e ... \u003c/MOLFORMULA\u003e`\n    - Number: `\u003cNUMBER\u003e ... \u003c/NUMBER\u003e`\n    - Boolean: `\u003cBOOLEAN\u003e ... \u003c/BOOLEAN\u003e`\n\n    Please see the examples in [the above section](#tasks-and-examples).\n\n2. The code would canonicalize SMILES string automatically, as long as it is wrapped in `\u003cSMILES\u003e ... \u003c/SMILES\u003e`.\n\n### Evaluation on SMolInstruct\n\n#### Step 1. Generate responses for samples\n\nUse the following command to apply LlaSMol models to generate responses for samples in SmolInstruct.\n\n```bash\npython generate_on_dataset.py --model_name osunlp/LlaSMol-Mistral-7B --output_dir eval/LlaSMol-Mistral-7B/output \n```\n\nBy default, it generates for all the tasks. You could also specify tasks by adding argument like `--tasks \"['forward_synthesis','retrosynthesis']\"`.\nIf not setting `tasks`, the script will generate for all the tasks in SMolInstruct.\n\n#### Step 2. Extract predicted answer from model outputs\n\nUse the command to extract predicted answers from model's output, and store them in the `pred` domains. By default, it extract the part between the corresponding tags (e.g., `\u003cSMILES\u003e ... \u003c/SMILES\u003e`). If the tags are missing or incomplete, the extracted answer will be empty and regarded as \"no answer\" in metric calculation.\n\n```bash\npython extract_prediction.py --output_dir eval/LlaSMol-Mistral-7B/output --prediction_dir eval/LlaSMol-Mistral-7B/prediction\n```\n\nBy default, it extracts predicted answers for all the tasks. It skips task if its output file is not found. You could also specify tasks like  `--tasks \"['forward_synthesis','retrosynthesis']\"`.\n\n#### Step 3. Calculate metrics\n\nUse the following command to compute metrics for all the tasks.\n\n```bash\npython compute_metrics.py --prediction_dir eval/LlaSMol-Mistral-7B/prediction\n```\n\nBy default, it extracts predicted answers for all the tasks. It skips task if its output file is not found. You could also specify tasks like  `--tasks \"['forward_synthesis','retrosynthesis']\"`.\n\n## Citation\nIf our paper or related resources prove valuable to your research, we kindly ask for citation. Please feel free to contact us with any inquiries.\n```\n@article{yu2024llasmol,\n    title={LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset},\n    author={Botao Yu and Frazier N. Baker and Ziqi Chen and Xia Ning and Huan Sun},\n    journal={arXiv preprint arXiv:2402.09391},\n    year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOSU-NLP-Group%2FLLM4Chem","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOSU-NLP-Group%2FLLM4Chem","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOSU-NLP-Group%2FLLM4Chem/lists"}