Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/osu-nlp-group/llm4chem
Official code repo for the paper "LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset"
https://github.com/osu-nlp-group/llm4chem
ai4science chemistry llms molecule
Last synced: 4 days ago
JSON representation
Official code repo for the paper "LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset"
- Host: GitHub
- URL: https://github.com/osu-nlp-group/llm4chem
- Owner: OSU-NLP-Group
- License: mit
- Created: 2024-02-13T22:29:28.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-11-12T00:05:01.000Z (about 1 month ago)
- Last Synced: 2024-12-18T14:13:53.891Z (4 days ago)
- Topics: ai4science, chemistry, llms, molecule
- Language: Python
- Homepage: https://osu-nlp-group.github.io/LLM4Chem/
- Size: 130 MB
- Stars: 72
- Watchers: 8
- Forks: 10
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# LlaSMol
This is the official code repository for the paper *LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset*.- Paper: https://arxiv.org/abs/2402.09391
- Page: https://osu-nlp-group.github.io/LLM4Chem
- Dataset: https://huggingface.co/datasets/osunlp/SMolInstruct
- Models:
- LlaSMol-Galactica-6.7B: [https://huggingface.co/osunlp/LlaSMol-Galactica-6.7B](https://huggingface.co/osunlp/LlaSMol-Galactica-6.7B)
- LlaSMol-Llama2-7B: [https://huggingface.co/osunlp/LlaSMol-Llama2-7B](https://huggingface.co/osunlp/LlaSMol-Llama2-7B)
- LlaSMol-CodeLlama-7B: [https://huggingface.co/osunlp/LlaSMol-CodeLlama-7B](https://huggingface.co/osunlp/LlaSMol-CodeLlama-7B)
- LlaSMol-Mistral-7B: [https://huggingface.co/osunlp/LlaSMol-Mistral-7B](https://huggingface.co/osunlp/LlaSMol-Mistral-7B)## Tasks and Examples
Our models are trained on 4 types of 14 essential tasks, which, along with examples, are listed below.
### Name Conversion
**Name Conversion - IUPAC to Molecular Formula**
```
Query: What is the molecular formula of the compound with this IUPAC name 2,5-diphenyl-1,3-oxazole ?Response: C15H11NO
```**Name Conversion - IUPAC to SMILES**
```
Query: Could you provide the SMILES for 4-ethyl-4-methyloxolan-2-one ?Response: Of course. It's CCC1(C)COC(=O)C1 .
```**Name Conversion - SMILES to Molecular Formula**
```
Query: Given the SMILES representation S=P1(N(CCCl)CCCl)NCCCO1 , what would be its molecular formula?Response: It is C7H15Cl2N2OPS .
```**Name Conversion - SMILES to IUPAC**
```
Query: Translate the given SMILES formula of a molecule CCC(C)C1CNCCCNC1 into its IUPAC name.Response: 3-butan-2-yl-1,5-diazocane
```### Property Prediction
**Property Prediction - ESOL**
```
Query: How soluble is CC(C)Cl ?Response: Its log solubility is -1.41 mol/L.
```**Property Prediction - LIPO**
```
Query: Predict the octanol/water distribution coefficient logD under the circumstance of pH 7.4 for NC(=O)C1=CC=CC=C1O .Response: 1.090
```**Property Prediction - BBBP**
```
Query: Is blood-brain barrier permeability (BBBP) a property of CCNC(=O)/C=C/C1=CC=CC(Br)=C1 ?Response: Yes
```**Property Prediction - Clintox**
```
Query: Is COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 toxic?Response: No
```**Property Prediction - HIV**
```
Query: Can CC1=CN(C2C=CCCC2O)C(=O)NC1=O serve as an inhibitor of HIV replication?Response: No
```**Property Prediction - SIDER**
```
Query: Are there any known side effects of CC1=CC(C)=C(NC(=O)CN(CC(=O)O)CC(=O)O)C(C)=C1Br affecting the heart?Response: No
```### Molecule Description
**Molecule Captioning**
```
Query: Describe this molecule: CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1Response: The molecule is an imidazole derivative with short-acting sedative, hypnotic, and general anesthetic properties. Etomidate appears to have gamma-aminobutyric acid (GABA) like effects, mediated through GABA-A receptor. The action enhances the inhibitory effect of GABA on the central nervous system by causing chloride channel opening events which leads to membrane hyperpolarization.
```**Molecule Generation**
It is not required to use tags to wrap up the input description.
```
Query: Give me a molecule that satisfies the conditions outlined in the description: The molecule is a member of the class of tripyrroles that is a red-coloured pigment with antibiotic properties produced by Serratia marcescens. It has a role as an antimicrobial agent, a biological pigment, a bacterial metabolite, an apoptosis inducer and an antineoplastic agent. It is a tripyrrole, an aromatic ether and a ring assembly.Response: Here is a potential molecule: CCCCCC1=C(C)NC(/C=C2\N=C(C3=CC=CN3)C=C2OC)=C1
```### Chemical Reaction
**Forward Synthesis**
```
Query: NC1=CC=C2OCOC2=C1.O=CO Based on the reactants and reagents given above, suggest a possible product.Response: A possible product can be O=CNC1=CC=C2OCOC2=C1 .
```**Retrosynthesis**
```
Query: Identify possible reactants that could have been used to create the specified product. CC1=CC=C(N)N=C1NResponse: CC(C#N)CCC#N.N
```## Training
If you need to fine-tune a base model on SMolInstruct, please first clone this repo to your machine, and `cd` to the folder, then use the following command.
```bash
MODELNAME=LlaSMol-Mistral-7B && CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch finetune.py --data_path osunlp/SMolInstruct --base_model mistralai/Mistral-7B-v0.1 --wandb_project LlaSMol --wandb_run_name $MODELNAME --wandb_log_model true ---output_dir checkpoint/$MODELNAME
```The above is an example command for fine-tuning Mistral with LoRA, using 4 GPUs. If with other base models, the lora settings (e.g., `lora_target_modules`) might need to be modified accordingly.
## Usage
Clone this repo to your machine, and `cd` to the folder.
### Generation
You could use the following code to query the models with your questions.
```python
from generation import LlaSMolGenerationgenerator = LlaSMolGeneration('osunlp/LlaSMol-Mistral-7B')
generator.generate('Can you tell me the IUPAC name of C1CCOC1 ?')
```**Note**:
1. In the input query, please use corresponding tags to wrap up specific content.
- SMILES representation: ` ... `
- IUPAC name: ` ... `
Other tags may appear in models' responses:
- Molecular formula: ` ... `
- Number: ` ... `
- Boolean: ` ... `Please see the examples in [the above section](#tasks-and-examples).
2. The code would canonicalize SMILES string automatically, as long as it is wrapped in ` ... `.
### Evaluation on SMolInstruct
#### Step 1. Generate responses for samples
Use the following command to apply LlaSMol models to generate responses for samples in SmolInstruct.
```bash
python generate_on_dataset.py --model_name osunlp/LlaSMol-Mistral-7B --output_dir eval/LlaSMol-Mistral-7B/output
```By default, it generates for all the tasks. You could also specify tasks by adding argument like `--tasks "['forward_synthesis','retrosynthesis']"`.
If not setting `tasks`, the script will generate for all the tasks in SMolInstruct.#### Step 2. Extract predicted answer from model outputs
Use the command to extract predicted answers from model's output, and store them in the `pred` domains. By default, it extract the part between the corresponding tags (e.g., ` ... `). If the tags are missing or incomplete, the extracted answer will be empty and regarded as "no answer" in metric calculation.
```bash
python extract_prediction.py --output_dir eval/LlaSMol-Mistral-7B/output --prediction_dir eval/LlaSMol-Mistral-7B/prediction
```By default, it extracts predicted answers for all the tasks. It skips task if its output file is not found. You could also specify tasks like `--tasks "['forward_synthesis','retrosynthesis']"`.
#### Step 3. Calculate metrics
Use the following command to compute metrics for all the tasks.
```bash
python compute_metrics.py --prediction_dir eval/LlaSMol-Mistral-7B/prediction
```By default, it extracts predicted answers for all the tasks. It skips task if its output file is not found. You could also specify tasks like `--tasks "['forward_synthesis','retrosynthesis']"`.
## Citation
If our paper or related resources prove valuable to your research, we kindly ask for citation. Please feel free to contact us with any inquiries.
```
@article{yu2024llasmol,
title={LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset},
author={Botao Yu and Frazier N. Baker and Ziqi Chen and Xia Ning and Huan Sun},
journal={arXiv preprint arXiv:2402.09391},
year={2024}
}
```