Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/osu-nlp-group/llm4chem

Official code repo for the paper "LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset"
https://github.com/osu-nlp-group/llm4chem

ai4science chemistry llms molecule

Last synced: 4 days ago
JSON representation

Official code repo for the paper "LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset"

Awesome Lists containing this project

README

        

# LlaSMol
This is the official code repository for the paper *LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset*.

- Paper: https://arxiv.org/abs/2402.09391
- Page: https://osu-nlp-group.github.io/LLM4Chem
- Dataset: https://huggingface.co/datasets/osunlp/SMolInstruct
- Models:
- LlaSMol-Galactica-6.7B: [https://huggingface.co/osunlp/LlaSMol-Galactica-6.7B](https://huggingface.co/osunlp/LlaSMol-Galactica-6.7B)
- LlaSMol-Llama2-7B: [https://huggingface.co/osunlp/LlaSMol-Llama2-7B](https://huggingface.co/osunlp/LlaSMol-Llama2-7B)
- LlaSMol-CodeLlama-7B: [https://huggingface.co/osunlp/LlaSMol-CodeLlama-7B](https://huggingface.co/osunlp/LlaSMol-CodeLlama-7B)
- LlaSMol-Mistral-7B: [https://huggingface.co/osunlp/LlaSMol-Mistral-7B](https://huggingface.co/osunlp/LlaSMol-Mistral-7B)

## Tasks and Examples

Our models are trained on 4 types of 14 essential tasks, which, along with examples, are listed below.

### Name Conversion
**Name Conversion - IUPAC to Molecular Formula**
```
Query: What is the molecular formula of the compound with this IUPAC name 2,5-diphenyl-1,3-oxazole ?

Response: C15H11NO
```

**Name Conversion - IUPAC to SMILES**
```
Query: Could you provide the SMILES for 4-ethyl-4-methyloxolan-2-one ?

Response: Of course. It's CCC1(C)COC(=O)C1 .
```

**Name Conversion - SMILES to Molecular Formula**
```
Query: Given the SMILES representation S=P1(N(CCCl)CCCl)NCCCO1 , what would be its molecular formula?

Response: It is C7H15Cl2N2OPS .
```

**Name Conversion - SMILES to IUPAC**
```
Query: Translate the given SMILES formula of a molecule CCC(C)C1CNCCCNC1 into its IUPAC name.

Response: 3-butan-2-yl-1,5-diazocane
```

### Property Prediction
**Property Prediction - ESOL**
```
Query: How soluble is CC(C)Cl ?

Response: Its log solubility is -1.41 mol/L.
```

**Property Prediction - LIPO**
```
Query: Predict the octanol/water distribution coefficient logD under the circumstance of pH 7.4 for NC(=O)C1=CC=CC=C1O .

Response: 1.090
```

**Property Prediction - BBBP**
```
Query: Is blood-brain barrier permeability (BBBP) a property of CCNC(=O)/C=C/C1=CC=CC(Br)=C1 ?

Response: Yes
```

**Property Prediction - Clintox**
```
Query: Is COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 toxic?

Response: No
```

**Property Prediction - HIV**
```
Query: Can CC1=CN(C2C=CCCC2O)C(=O)NC1=O serve as an inhibitor of HIV replication?

Response: No
```

**Property Prediction - SIDER**
```
Query: Are there any known side effects of CC1=CC(C)=C(NC(=O)CN(CC(=O)O)CC(=O)O)C(C)=C1Br affecting the heart?

Response: No
```

### Molecule Description

**Molecule Captioning**
```
Query: Describe this molecule: CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1

Response: The molecule is an imidazole derivative with short-acting sedative, hypnotic, and general anesthetic properties. Etomidate appears to have gamma-aminobutyric acid (GABA) like effects, mediated through GABA-A receptor. The action enhances the inhibitory effect of GABA on the central nervous system by causing chloride channel opening events which leads to membrane hyperpolarization.
```

**Molecule Generation**

It is not required to use tags to wrap up the input description.

```
Query: Give me a molecule that satisfies the conditions outlined in the description: The molecule is a member of the class of tripyrroles that is a red-coloured pigment with antibiotic properties produced by Serratia marcescens. It has a role as an antimicrobial agent, a biological pigment, a bacterial metabolite, an apoptosis inducer and an antineoplastic agent. It is a tripyrrole, an aromatic ether and a ring assembly.

Response: Here is a potential molecule: CCCCCC1=C(C)NC(/C=C2\N=C(C3=CC=CN3)C=C2OC)=C1
```

### Chemical Reaction

**Forward Synthesis**
```
Query: NC1=CC=C2OCOC2=C1.O=CO Based on the reactants and reagents given above, suggest a possible product.

Response: A possible product can be O=CNC1=CC=C2OCOC2=C1 .
```

**Retrosynthesis**
```
Query: Identify possible reactants that could have been used to create the specified product. CC1=CC=C(N)N=C1N

Response: CC(C#N)CCC#N.N
```

## Training

If you need to fine-tune a base model on SMolInstruct, please first clone this repo to your machine, and `cd` to the folder, then use the following command.

```bash
MODELNAME=LlaSMol-Mistral-7B && CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch finetune.py --data_path osunlp/SMolInstruct --base_model mistralai/Mistral-7B-v0.1 --wandb_project LlaSMol --wandb_run_name $MODELNAME --wandb_log_model true ---output_dir checkpoint/$MODELNAME
```

The above is an example command for fine-tuning Mistral with LoRA, using 4 GPUs. If with other base models, the lora settings (e.g., `lora_target_modules`) might need to be modified accordingly.

## Usage

Clone this repo to your machine, and `cd` to the folder.

### Generation

You could use the following code to query the models with your questions.

```python
from generation import LlaSMolGeneration

generator = LlaSMolGeneration('osunlp/LlaSMol-Mistral-7B')
generator.generate('Can you tell me the IUPAC name of C1CCOC1 ?')
```

**Note**:
1. In the input query, please use corresponding tags to wrap up specific content.
- SMILES representation: ` ... `
- IUPAC name: ` ... `

Other tags may appear in models' responses:
- Molecular formula: ` ... `
- Number: ` ... `
- Boolean: ` ... `

Please see the examples in [the above section](#tasks-and-examples).

2. The code would canonicalize SMILES string automatically, as long as it is wrapped in ` ... `.

### Evaluation on SMolInstruct

#### Step 1. Generate responses for samples

Use the following command to apply LlaSMol models to generate responses for samples in SmolInstruct.

```bash
python generate_on_dataset.py --model_name osunlp/LlaSMol-Mistral-7B --output_dir eval/LlaSMol-Mistral-7B/output
```

By default, it generates for all the tasks. You could also specify tasks by adding argument like `--tasks "['forward_synthesis','retrosynthesis']"`.
If not setting `tasks`, the script will generate for all the tasks in SMolInstruct.

#### Step 2. Extract predicted answer from model outputs

Use the command to extract predicted answers from model's output, and store them in the `pred` domains. By default, it extract the part between the corresponding tags (e.g., ` ... `). If the tags are missing or incomplete, the extracted answer will be empty and regarded as "no answer" in metric calculation.

```bash
python extract_prediction.py --output_dir eval/LlaSMol-Mistral-7B/output --prediction_dir eval/LlaSMol-Mistral-7B/prediction
```

By default, it extracts predicted answers for all the tasks. It skips task if its output file is not found. You could also specify tasks like `--tasks "['forward_synthesis','retrosynthesis']"`.

#### Step 3. Calculate metrics

Use the following command to compute metrics for all the tasks.

```bash
python compute_metrics.py --prediction_dir eval/LlaSMol-Mistral-7B/prediction
```

By default, it extracts predicted answers for all the tasks. It skips task if its output file is not found. You could also specify tasks like `--tasks "['forward_synthesis','retrosynthesis']"`.

## Citation
If our paper or related resources prove valuable to your research, we kindly ask for citation. Please feel free to contact us with any inquiries.
```
@article{yu2024llasmol,
title={LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset},
author={Botao Yu and Frazier N. Baker and Ziqi Chen and Xia Ning and Huan Sun},
journal={arXiv preprint arXiv:2402.09391},
year={2024}
}
```