Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/chemistryvisionlanguage/ppvl-bench

The official repository for the paper : "MPr2-Bench: Large Vision Language Models for Molecular Property Prediction"
https://github.com/chemistryvisionlanguage/ppvl-bench

ai4science chemistry molecule vlm

Last synced: 23 days ago
JSON representation

The official repository for the paper : "MPr2-Bench: Large Vision Language Models for Molecular Property Prediction"

Awesome Lists containing this project

README

        

# MPr2-Bench [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/ChemistryVision)
The official repository for the paper : "MPr2-Bench: Large Vision Language Models for Molecular Property Prediction"

- **Page**: https://chemistryvisionlanguage.github.io/ppvl-bench/
- **Dataset**: [https://huggingface.co/ChemistryVision](https://huggingface.co/ChemistryVision)
## Introduction
We introduce MPr2-Bench, a novel benchmark for Vision-Language Models (VLMs) that focuses on the critical task of Molecular Property Prediction. While traditional computational chemistry relies on complex mathematical models, MPr2-Bench reformats this fundamental challenge into a multimodal task that leverages both visual and textual representations of molecules. Our method uniquely integrates both visual data (molecular structures in bond-line/skeletal formats) and textual data (SMILES and SELFIES representations) to improve prediction accuracy.




Importantly, our study also extends to regression tasks. In tasks such as predicting ESOL, LD50, and QM9 properties, our VLM-based approach achieves performance nearly on par with traditional and state-of-the-art methods.
## Datasets
You can find all the datasets on Hugging Face:
[![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/ChemistryVision)

Each dataset contains molecular structures (images and SMILES), property values, and metadata.
Note: Datasets include in-context examples (k=2) for few-shot learning.
```
from datasets import load_dataset
dataset = load_dataset("ChemistryVision/BBBP-V-SMILES")
```

## Notebooks
The dataset inference with the models has been provided in the form of Jupyter notebooks. You can find the notebooks in the Notebooks/ directory. These notebooks are designed for direct use—just run each cell, and you will obtain the required results.

Coming Soon - Updated work with regression based molecule property prediction using Vision Language Models.

## Prompt

The following are the prompts utilized in our study. Adapting and experimenting with custom-designed prompts is straightforward—simply modify the prompt in the Jupyter notebook for each task to observe the resulting performance and outcomes. Additionally, in-context examples have been incorporated using Tanimoto similarity to enhance the relevance of the prompts.



## Tasks and Examples
The experiments are carried out on classification and regression based tasks, which, along with examples, are listed below. You can see the prompts example to all the datasets in the Prompts/ Directory.

### BBBP Dataset Prompt Example
```
You are an expert chemist, your task is to predict the property of molecule using your experienced chemical property prediction knowledge.
Please strictly follow the format, no other information can be provided. Given the SMILES string of a molecule, the task focuses on predicting molecular properties, specifically penetration/non-penetration to the brain-blood barrier, based on the SMILES string representation of each molecule.
You will be provided with several examples molecules, each accompanied by a binary label indicating whether it has penetrative property (Yes) or not (No).
Please answer with only Yes or No.
SMILES: CN(C)CCOC(C)(c1ccccc1)c2ccccn2.OC(=O)CCC(O)=O
Penetration: Yes
SMILES: CC1(C)S[C@@H]2[C@H](NC(=O)C3(N)CCCCC3)C(=O)N2[C@H]1C(O)=O
Penetration: No

Below is the molecule who's property you have to predict. Along with is the image structure of the molecule.
SMILES: C2=C(C(C1CCCCC1)CCN(C)C)C=CC=C2
Penetration:
You have to predict whether it has Penetration with answer Yes or No.

Response: Yes
```
Ground truth - Yes

### Esol Dataset Prompt Example
```
As an expert chemist specializing in molecular property prediction, your task is to accurately estimate the measured log solubility in mols per litre for various compounds. You have extensive knowledge of chemical structures, solubility principles, and structure-property relationships. Consider the following information about the molecule: its molecular weight is 238.455 g/mol, and it has 0 H-bond donors. Additionally, here are some example compounds with their measured log solubilities:

SMILES: Clc1cccc(Cl)c1, Log Solubility: -3.04
SMILES: Oc1cccc(Cl)c1, Log Solubility: -0.7

Using your expertise, analyze the given SMILES (Simplified Molecular Input Line Entry System) representation of the molecule, considering factors such as polarity, molecular weight, H-bond donors, and other functional groups that influence solubility. Based on this analysis, provide your best prediction of the measured log solubility in mols per litre for the following SMILES string: Clc1cccc(I)c1.

Response: The predicted log solubility for the SMILES string Clc1cccc(I)c1 is -3.04.
```
Ground Truth: -3.55
## Setup

- ### Prepare your dataset in your local environment
If you need to make modifications to the dataset in your local enviroment. You can follow the steps to generate your own version of Dataset. To get started, follow these steps:

1. **Clone the repository:**

```sh
git clone https://github.com/ChemistryVisionlanguage/ppvl-bench.git
cd ppvl-bench
```

2. **Create a Conda environment:**

Ensure you have Conda installed. If not, you can install it from [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html).

```sh
conda create --name molecule-env python=3.10
conda activate molecule-env
```

3. **Install the required packages:**

```sh
pip install -r requirements.txt
```
4. **In your root Directory Create a folder named Datasets, here you will include all your csv files, for example- BACE.csv, esol.csv etc. Additionally a GenerateDataset Notebook is also included in the repo. You can find the notebook in Directory Notebooks/.**
5. **Run respective Dataset Script**
```
python GenerateDataset/SMILES/bace.py
```
- ### Running Inference on a Dataset Using the Model

To perform inference on a dataset using different models, follow the instructions below:

1. **Run the Inference Script for the Selected Model**:
- Depending on the model you are using, run the inference script directly from the appropriate directory.

- **For BLIP, CoG, GPT-4, and Qwen Models**:
- These models are ready to use and can be run directly from the ICL directory.
- Example commands:
```bash
python ICL/blip/icl.py
```
or
```bash
python ICL/cog/icl.py
```
or
```bash
python ICL/gpt4/icl.py
```
or
```bash
python ICL/qwen/icl.py
```

- **For Other Models**:
- Other models need to be configured within their respective GitHub repositories.
- Follow these steps:
1. Clone the model's repository:
```bash
git clone
```
2. Navigate to the cloned repository:
```bash
cd
```
3. Follow the repository’s setup instructions to configure the model.
4. Use the ICL script provided in the repository to run inference.

2. **Add File Names to the Script**:
- Ensure that the dataset file names are correctly referenced within the script files before running the inference.
- This step is crucial to ensure the model processes the correct dataset.

- ### Fine-Tuning the Model

To fine-tune models on your dataset, follow the instructions below:

1. **Fine-Tuning Using the BLIP Model**:
- The fine-tuning script for the BLIP model has been provided and can be run directly.
- Example command:
```bash
python ICL/blip/finetune.py
```

2. **Fine-Tuning for Other Models**:
- For other models, the fine-tuning process needs to be configured within their respective GitHub repositories.
- Follow these steps to fine-tune other models:
1. Clone the model's repository:
```bash
git clone
```
2. Navigate to the cloned repository:
```bash
cd
```
3. Follow the repository’s setup instructions to configure and fine-tune the model on your dataset.

To evaluate the model's performance on a specific dataset, follow the steps below:

1. **Navigate to the Evaluation Script**:
- Each dataset has a corresponding evaluation script designed to assess the model's performance. Ensure you know the correct script for the dataset you are working with.
- Use the following command to navigate to the evaluation script directory:
```bash
cd Eval
```

2. **Run the Evaluation Script**:
- Execute the evaluation script for your dataset. For example, if you're evaluating a classification model, you might use:
```bash
python evalclassification.py
```
- Replace `evalclassification.py` with the appropriate script name if you're working with a different dataset or evaluation type.

3. **Add the Evaluated CSV to the Script**:
- After running the evaluation, the script will provide you the evaluation metric.
- Ensure that the predicted CSV file after model inference is placed in the Results/ directory as the evaluation script. This helps in keeping all related files organized and makes it easier for future references and comparisons.

## Acknowledgements

The finetuning,ICL method used in this project/paper has been referenced from the following repository:

- [Llava1.5](https://github.com/haotian-liu/LLaVA.git)
- [Llama-AdapterV2](https://github.com/OpenGVLab/LLaMA-Adapter.git)
- [mPlugOWL2](https://github.com/X-PLUG/mPLUG-Owl.git)
- [QwenVL](https://github.com/QwenLM/Qwen-VL.git)
- [CogVLM](https://github.com/THUDM/CogVLM)
- [BLIP](https://huggingface.co/Salesforce/blip-vqa-base)

## Dataset Information

For detailed information about the datasets used in MPr2-Bench, please refer to our [Datasheet.md](./Datasheet/Datasheet.md). This comprehensive datasheet provides key information about each dataset, including:

- Full names and abbreviations
- Task types (e.g., binary classification, regression)
- Target properties
- Dataset sizes
- Available features
- Data split information (where applicable)

The datasheet covers all datasets used in our benchmark, from BACE-V to PCQM4Mv2, offering a quick reference for researchers and users of MPr2-Bench. It's an essential resource for understanding the scope and characteristics of the molecular property prediction tasks included in our benchmark.
### Important Considerations

- The current In-Context Learning (ICL) and fine-tuning scripts provided are designed to work with datasets generated in your local environment. Ensure that you have prepared your dataset accordingly.

- For those who wish to use Hugging Face datasets for ICL and fine-tuning:
- **ICL scripts for Hugging Face datasets** have been provided in the `Notebooks/` directory as Colab notebooks.
- The Python scripts to directly use Hugging Face datasets for ICL and fine-tuning will be provided soon.
## Contribution

Feel free to contribute to this project by opening issues or submitting pull requests.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a pull request

## Citation

If you use the BACE, BBBP, HIV, Clintox, and Tox21 datasets in your work, please cite the following source:

```bibtex
@article{wu2018moleculenet,
title={MoleculeNet: A benchmark for molecular machine learning},
author={Wu, Zhenqin and Ramsundar, Bharath and Feinberg, Evan N and others},
journal={Chemical Science},
volume={9},
number={2},
pages={513--530},
year={2018},
publisher={Royal Society of Chemistry},
doi={10.1039/C7SC02664A}
}