https://github.com/dgjun32/Graph-Instruction-Tuning

Instruction tuning LLM with molecular graph instruction dataset.
https://github.com/dgjun32/Graph-Instruction-Tuning

Last synced: 3 months ago
JSON representation

Instruction tuning LLM with molecular graph instruction dataset.

Host: GitHub
URL: https://github.com/dgjun32/Graph-Instruction-Tuning
Owner: dgjun32
Created: 2023-06-14T04:26:26.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-06-14T05:10:18.000Z (almost 2 years ago)
Last Synced: 2023-10-16T15:36:06.170Z (over 1 year ago)
Language: Python
Size: 2.93 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome_ai_agents - Graph-Instruction-Tuning - Instruction tuning LLM with molecular graph instruction dataset. (Building / Datasets)
awesome_ai_agents - Graph-Instruction-Tuning - Instruction tuning LLM with molecular graph instruction dataset. (Building / Datasets)

README

        # Graph Instruction Tuning

Recently, large-scale language models (LLMs) shows promising performance on across many NLP tasks, at the same time extending to understand visual modality (Flamingo (2022), LLaVA (2023)). 

However, methods of LLMs to understand structured data (e.g., tabular or graph) modalities have not been actively explored yet. Therefore in this project, I propose Graph Instruction Tuning (GIT), which trains LLMs to understand molecular graph structured data as a reference and output correct answer given natural language instruction (e.g., "Predict the solvation free energies of this molecule in water.", "Is this chemical compound able to cross blood-brain barrier? Answer yes (1) or no (0).")

## Dataset

First, I synthesized natural language instructions corresponding to each dataset using ```openai api```.

Based on manually written seed instruction set, I synthesized 10 instructions for each dataset.

For .json file of git dataset, please run the script below.

```

python construct_dataset.py \

    --bot_model [model version (e.g., gpt-3.5-turbo)] \

    --k [# of instructions per dataset]

    --api_key [your openai api key]

```

A single training instances looks like this.

```

{"smiles":"C(N1CCOCC1)CNC2=NNC(C=C2C)=C3C=CC(=O)C=C3","instruction":"Is this chemical compound able to cross blood-brain barrier? Answer yes (1) or no (0).","y":[[1.0]],"task":"BBBP"}

{"smiles":"C(N1CCOCC1)CNC2=NNC(C=C2C)=C3C=CC(=O)C=C3","instruction":"Please indicate whether this chemical compound is capable of crossing the blood-brain barrier by answering with a \"yes\" (1) or \"no\" (0).","y":[[1.0]],"task":"BBBP"}

```

The created dataset consists of 500K (instruction, molecule smiles, target (label or float)) triplets, which can be used for LLM instruction tuning, thereby making LLM to agent equipped with bio-molecular expertise!

## Plan (Will be updated soon)

I am going to instruction-tune pretrained ```LLaMA-7B``` with 500K ```git_dataset``` synthesized by procedure described above.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dgjun32/Graph-Instruction-Tuning

Awesome Lists containing this project

README