https://github.com/aksw/lms4text2sparql

Finetuning LLMs for the Text2Sparql task
https://github.com/aksw/lms4text2sparql

Last synced: 1 day ago
JSON representation

Finetuning LLMs for the Text2Sparql task

Host: GitHub
URL: https://github.com/aksw/lms4text2sparql
Owner: AKSW
Created: 2024-02-23T09:03:06.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-04-19T08:32:32.000Z (about 2 years ago)
Last Synced: 2025-09-09T02:26:51.674Z (9 months ago)
Language: Python
Size: 3.06 MB
Stars: 9
Watchers: 27
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# LMs 4 Text2SPARQL

This repository contains code to finetune a varietey of LMs with under one billion parameters on different datasets.

## How to use

There are 4 evaluation datasets available right now:
- An organizational demo graph
- Sample data from Coypu
- QALD10
- LC_QUAD

To run the training for 20 epochs on the coypu dataset for example, execute the following:

```bash
python train.py --num-epochs 20 --dataset coypu
```

The code will stop every five iterations to check how the model performs. This will create json-Files under `./results` that contain the generated SPARQL query as well as the gold standard, using the following naming scheme:
```python
f"{model_checkpoint}_{dataset_name}_{run_id}_{evaluation_step}.json"
```

So for example, `Babelscape_mrebel-base_coypu_R01_12.json` contains queries generated by the model Babelscape/mREBEL-base, trained on the CoyPu dataset. The ID of the run is R01 and it is the 12th evaluation step. The Run ID is used to identify which json files belong together when doing the evaluation and can be chosen arbitrarily by the user.

To actually evaluate the results, run the following:

```bash
python eval.py --num-epochs 20 --dataset coypu
```

The dataset parameter is important to let the eval script know which dataset to run the queries on. The number of epochs are specified here as well to tell the script how many json files to expect per model. It will print some stuff to the screen
which was only used during development, you can ignore that. After the script is done, you will find a file called `total_{dataset}.json` in the results folder, for example `./results/total_coypu.json` which contains for each model the number of
correctly translated questions.

### Docker
With `make` you get an overview of all the avialable tasks.
First you would need to build the docker image localy.

```bash
docker build -t akws/lms4text2sparql .
```

Then you can run the benchmark for a given dataset with:

```bash
make run-orga
```

## how to cite
If you want to reference our work, please cite the related paper "Leveraging small language models for Text2SPARQLtasks to improve the resilience of AI assistance" by Felix Brei et al accepted for publication in the [workshop D2R2@ESWC2024](https://2024.d2r2.aksw.org/):
```bibtex
@InProceedings{Brei2024LeveragingSmallLanguage,
author = {Brei, Felix and Frey, Johannes and Meyer, Lars-Peter},
booktitle = {Proceedings of the Third International Workshop on Linked Data-driven Resilience Research 2024 (D2R2'24), colocated with ESWC 2024},
title = {Leveraging small language models for Text2SPARQLtasks to improve the resilience of AI assistance},
year = {2024},
comment = {accepted for publication},
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aksw/lms4text2sparql

Awesome Lists containing this project

README