An open API service indexing awesome lists of open source software.

https://github.com/sjtu-ipads/sqldriller

Repository for paper "Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency" (SIGMOD '25)
https://github.com/sjtu-ipads/sqldriller

Last synced: about 1 month ago
JSON representation

Repository for paper "Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency" (SIGMOD '25)

Awesome Lists containing this project

README

          

# SQLDRILLER: Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency

SQLDRILLER is a Text-to-SQL translation correctness checking tool.
It introduces and applies **Execution Consistency** to automatically discover and fix errors in Text-to-SQL datasets, and in Text-to-SQL model inference results to improve model accuracy.
It is presented in the paper "Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency" (SIGMOD '25, [paper link](https://dl.acm.org/doi/10.1145/3725271)).

Contact: yangyicun@sjtu.edu.cn

## Repaired Text-to-SQL Datasets Available
Preview `./data/bird/opt/{train/dev}.json` and `./data/spider/opt/{train/dev/test}.json` for the repaired Text-to-SQL datasets.

[//]: # (## CodeBase)

## Prerequisites
### 1. Environment Setup
Basic requirements:
- Ubuntu 20.04 is preferred, other Linux OS is also feasible.
- Anaconda, Python 3.10 (> 3.9 is feasible)
- Docker, Docker Compose (to run VeriEQL for SQL counterexample generation)
- Java JDK 17, Gradle 7.3.3 (optional, to run SQLSolver for SQL equivalence)

Then, run the following commands to set up:
```shell
# Create and activate Python environment
conda create -n SQLDriller python=3.10
source activate SQLDriller
# If this failed, try this cmd instead:
# `conda activate SQLDriller`
python -m pip install --upgrade pip
pip install -r requirements.txt

# Download nltk lib
python nltk_downloader.py

# Setup docker container for VeriEQL solver
# (add little modifications to the original code in 2 files)
export CURRENT_PATH=$(pwd)
cd ./third_party/ce_gen/solver/
git clone git@github.com:VeriEQL/VeriEQL.git verieql
cp ./modification/main.py ./verieql/main.py
cp ./modification/cli_within_timeout.py ./verieql/parallel/cli_within_timeout.py
sh verieql.sh
cd $CURRENT_PATH

# clone repo for model accuracy evaluation
cd ./third_party/
git clone git@github.com:yicun0720/SD-test-suite-accuracy.git test-suite-accuracy
cd $CURRENT_PATH

# grant exec permission for shell scripts
chmod 755 ./click_to_run/*.sh
```

### 2. OpenAI Key Configuration
Create an `.env` file in the root path of this repo, replace the placeholder below to add your own api key information into the file:
```shell
touch .env
echo 'OPENAI_API_KEY = "{your-api-key}"' >> .env
echo 'OPENAI_API_BASE = "{your-api-url}"' >> .env
```

### 3. Resource Downloading:
We have prepared the following resources for experiments, download them into project root dir `./`:
- Download the prepared resource files for SQLDriller quick-start and result preview: [link](https://drive.google.com/file/d/1FuK0PCCccmjyJQAiE-nlyIKZ2NMmL61y/view?usp=sharing).
- Download the prepared generated sqlite files for counter-example generation, accuracy evaluation, and NL execution microbenchmark: [link](https://drive.google.com/file/d/1k6JTWTWEqgFvF1x1VduWQR26IgTERYv_/view?usp=sharing).
- Download the sqlite files of each schema of each benchmark:
- [Spider link](https://drive.google.com/file/d/1-D-B389hBsr8ygzpoP5rr2_g8OohTsVS/view?usp=sharing), [BIRD link](https://drive.google.com/file/d/1XOXkmph5NYdMFkVLTC2BKASRfbIf4a2G/view?usp=sharing), [Beaver link](https://drive.google.com/file/d/1omf0uDAtXztwJceBPGMQC5PfdWkB-qN3/view?usp=sharing)

Unzip them by:
```shell
unzip prepared.zip -d .
unzip dbs.zip -d .
unzip schema_for_sample_values_spider.zip -d ./data/spider/
unzip schema_for_sample_values_bird.zip -d ./data/bird/
unzip schema_for_sample_values_beaver.zip -d ./data/beaver/
```

## Result Previewing
Since running SQLDriller on the whole large dataset and re-training and re-evaluating models are time-consuming and costly, we provide our evaluated results for quick previewing.
The results are exactly in the downloaded `./prepared/` directory:
```
./prepared/
|-- spider/
|-- accuracy/
|-- eval_res/ # Accuracy results of each model
|-- test/
|-- pred_refined_train/ # Accuracy result of model trained by fixed training set
|-- original_test/ # Accuracy on original test set
|-- ...
|-- refined_test/ # Accuracy on fixed test set
|-- {model_name}.tsv
|-- {model_name}_SQLDRILLER.tsv # Accuracy of using SQLDriller to further optimize model inference
|-- pred_original_train/ # Accuracy result of model trained by original training set
|-- ...
|-- pred/ # SQL predictions of each model
|-- dataset_refine/
|-- issues/ # Logs of checking each Text-to-SQL case in the dataset
|-- logs/ # Logs of LLM output when checking each Text-to-SQL case
|-- sqls/ # Prepared SQL candidate set pre-generated by base model, used for counterexample generation and candidate selection
|-- microbench/
|-- spider_test_{LLM_name}/ # NL execution accuracy and logs
|-- ...
|-- bird/
|-- ...
|-- stat.xlsx # The summary of dataset refining results
```

For detailed guides of reproducing the results, please view the following `Scripts to Run` section to follow.

## Scripts to Run

### 1. Text-to-SQL Error Study

For the sampled cases in the training datasets, you can view the file `data/spider/opt/train_sampled_all.json` and `data/bird/opt/train_sampled_all.json` to check the original and our fixed SQLs. The value of the key "fixed" is `None` means the original SQL is correct.

As to the statistics of error study, run the following scripts to achieve the results.
```shell
./click_to_run/err_study.sh
```
The results are shown in `./results/study/` (refer to results of **Figure 2 and 3, and Table 1** in the paper):
```
./results/study/error/
|-- Table1_error_rate_by_difficulty.txt
|-- Figure2_error_rate_per_schema.txt
|-- Figure2_error_rate_per_schema_cdf.pdf
|-- Figure3_Jaccard_similarity.pdf # Jaccard similarity of original and fixed SQLs in sampled error cases
```

### 2. Natural Language (NL) Execution

Run the following scripts to perform Spider and BIRD's NL execution tasks on generated database instances.
(The database instances for NL execution have been prepared in `./dbs/microbench/{benchmark}/`.)
```shell
./click_to_run/nl_exec.sh spider gpt-4o
./click_to_run/nl_exec.sh bird gpt-4o

# You can also try other LLMs, which shows similar accuracy as shown in paper's Table 3.
./click_to_run/nl_exec.sh spider gpt-4-turbo
./click_to_run/nl_exec.sh bird gpt-4-turbo

./click_to_run/nl_exec.sh spider o1-preview
./click_to_run/nl_exec.sh bird o1-preview
```
- The 1st parameter represents the benchmark name (Options: `spider`, `bird`).
- The 2nd parameter represents the used LLM (Options: `gpt-4o`, `gpt-4-turbo`, `o1-preview`). It is set as `gpt-4o` by default due to its comparable performance and low cost and latency.

All the related results are shown in `./results/nl_exec/{benchmark}_{LLM_name}/`. Check `nl_exec_accuracy.txt` of each subdirectory and refer to results of **Table 3** in the paper:
```
./results/nl_exec/{benchmark}_{LLM_name}/
|-- exec_res/ # Execution logs of each case
|-- nl_exec_accuracy.txt # Results of NL exeuction accuracy on this benchmark and LLM
```
Note that the script only evaluates `gpt-4o` by default for time and `$` overhead, because other LLMs show comparable NL execution accuracy (**Table 3** in the paper), while `gpt-4o` has relatively low cost and latency.
You can also experiment by changing the 2nd parameter to other LLMs mentioned above.

### 3. Dataset Refine and Model Accuracy Evaluation

#### 3.1 Dataset Refine

##### 3.1.1 SQLDriller itself
Run the following command to ***detect and fix*** errors in the datasets.
```shell
./click_to_run/dataset_refine.sh spider train
./click_to_run/dataset_refine.sh bird train
```
- The 1st parameter represents the benchmark name (Options: `spider`, `bird`).
- The 2nd parameter represents the dataset split (Options: `train`, `dev`, `test`). We here only fix the train set. The dev and test set have been manually checked for accuracy evaluation as mentioned in the paper.

This script will setup 8 processes to run in parallel (taking about 10~15 hours to finish). You can view the progress of each process by running the following command whenever you want:
```shell
./click_to_run/dataset_refine_view_progress.sh {spider/bird} train
```
If this script outputs that some processes are somehow aborted and need to be resumed, you can resume them by running the recommended command it outputs. Example like:
```shell
./click_to_run/dataset_refine.sh {spider/bird} train --resume {partition_id}
```

##### 3.1.2 LLM Consistency Baseline
Run the following command to do the same thing using LLM consistency-based baseline:
```shell
./click_to_run/dataset_refine_baseline.sh spider train
./click_to_run/dataset_refine_baseline.sh bird train
```

##### 3.1.3 Evaluate Them All
Then run the following command to evaluate the effectiveness of SQLDriller and the baseline on error detection and fixing:
```shell
./click_to_run/dataset_refine_eval.sh spider
./click_to_run/dataset_refine_eval.sh bird
```

The results are shown in `./results/dataset_refine/{benchmark}_train/`:
```
./results/dataset_refine/{benchmark}_train/
|-- SQLDriller/ # SQLDriller's results
|-- exec_res/ # Logs of checking each Text-to-SQL case in the dataset
|-- 0/ # A subdir for each case id
|-- ce.txt # Generated counterexample of the case (qtext form)
|-- gold_pred_ce_res.json # Gold and predicted SQLs' exec. results on counterexamples
|-- pred_ce_res.json # Predicted SQL's exec. results on counterexamples among them, when ties of their scores happens (see paper for details)
|-- 1/
|-- ...
|-- modified_gold.tsv # The fixed gold SQLs in fixed cases
|-- modified_gold_tagged.tsv # The fixed gold SQLs in fixed cases w/ tagged fix results
|-- train.json # The fixed dataset file
|-- LLMConsis_baseline/ # Baseline's results
|-- exec_res/ # Logs of checking each Text-to-SQL case w/ baseline
|-- ...
|-- modified_gold.tsv # The fixed gold SQLs in fixed cases
|-- modified_gold_tagged.tsv # The fixed gold SQLs in fixed cases w/ tagged fix results
|-- statistics.txt # The statistics of SQLDriller and LLM consistency baseline's effectiveness on error detection and fixing
```
In `./results/dataset_refine/{benchmark}_train/`, view `statistics.txt` for a summary of effectiveness evaluation (refer to results of **Table 5 and 6** in the paper).

Then, `./results/dataset_refine/{benchmark}_train/SQLDriller/train.json` can be used to fine-tune existing Text-to-SQL models.

#### 3.2 Model Accuracy Evaluation

Run the following command to ***use SQLDriller to improve model inference accuracy directly***.
```shell
./click_to_run/inference_optimize.sh spider test dail
./click_to_run/inference_optimize.sh spider test din
./click_to_run/inference_optimize.sh spider test resd
./click_to_run/inference_optimize.sh spider test graphix-T5

./click_to_run/inference_optimize.sh bird dev sftcodes
./click_to_run/inference_optimize.sh bird dev codes
```
- The 1st parameter represents the benchmark name (Options: `spider`, `bird`).
- The 2nd parameter represents the dataset split. `test` set for Spider and `dev` set for BIRD.
- The 3rd parameter represents the model. Each evaluated model has a file w/ multiple prediction candidates for each test case.

Run the following command to do the same thing using LLM consistency-based baseline:
```shell
./click_to_run/inference_optimize_baseline.sh spider test dail
./click_to_run/inference_optimize_baseline.sh spider test din
./click_to_run/inference_optimize_baseline.sh spider test resd
./click_to_run/inference_optimize_baseline.sh spider test graphix-T5

./click_to_run/inference_optimize_baseline.sh bird dev sftcodes
./click_to_run/inference_optimize_baseline.sh bird dev codes
```

Then run the following command to summarize the accuracy improvements of all the evaluated models:
```shell
./click_to_run/inference_accuracy_eval.sh
```

In `./results/inference/`, view `accuracy_improvement.tsv` for a summary of accuracy improvements (refer to **Table 4** in the paper) and `Figure11_accuracy_improvement_breakdown.pdf` for accuracy improvement breakdown results (refer to **Figure 11** in the paper).
View `./results/inference/{model_name}/*.tsv` for accuracy evaluation results of each model (the bottom line of each .tsv file shows overall accuracy result):
```
./results/inference/
|-- {model_name}/ # accuracy results of each model
|-- {model_name}_original_train.tsv
|-- {model_name}_refined_train.tsv
|-- {model_name}_opt_SQLDriller.tsv
|-- {model_name}_opt_llmconsis.tsv
|-- accuracy_improvement.tsv
|-- Figure11_accuracy_improvement_breakdown.pdf
```

Note: for each model, it outputs 4 results of model accuracy:
- {model_name}_original_train.tsv: accuracy w/ original train set
- {model_name}_refined_train.tsv: accuracy w/ refined train set
- {model_name}_opt_SQLDriller.tsv: accuracy w/ refined train set and SQLdriller's optimization of model inference
- {model_name}_opt_llmconsis.tsv: accuracy w/ refined train set and LLM consistency baseline's optimization of model inference

## Reference
```
@article{10.1145/3725271,
author = {Yang, Yicun and Wang, Zhaoguo and Xia, Yu and Wei, Zhuoran and Ding, Haoran and Piskac, Ruzica and Chen, Haibo and Li, Jinyang},
title = {Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency},
year = {2025},
issue_date = {June 2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {3},
number = {3},
url = {https://doi.org/10.1145/3725271},
doi = {10.1145/3725271},
journal = {Proc. ACM Manag. Data},
month = jun,
articleno = {134},
numpages = {28},
keywords = {SQL query equivalence, execution consistency, language model, natural language interface for databases, text-to-SQL}
}