https://github.com/ShayanTalaei/CHESS
Contextual Harnessing for Efficient SQL Synthesis
https://github.com/ShayanTalaei/CHESS
Last synced: about 1 month ago
JSON representation
Contextual Harnessing for Efficient SQL Synthesis
- Host: GitHub
- URL: https://github.com/ShayanTalaei/CHESS
- Owner: ShayanTalaei
- License: apache-2.0
- Created: 2024-06-20T17:55:26.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-11-13T21:13:39.000Z (5 months ago)
- Last Synced: 2024-11-13T22:22:44.922Z (5 months ago)
- Language: Python
- Size: 7.81 MB
- Stars: 115
- Watchers: 2
- Forks: 28
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-Text2SQL - [code
README
# CHESS: Contextual Harnessing for Efficient SQL Synthesis
This repository contains the code and data for the paper "CHESS: Contextual Harnessing for Efficient SQL Synthesis."
Translating natural language questions into SQL queries, known as text-to-SQL, is a long-standing research problem. Effective text-to-SQL synthesis can become very challenging due to:
- (i) The extensive size of database catalogs (descriptions of tables and their columns) and database values,
- (ii) Reasoning over large database schemas,
- (iii) Ensuring the functional validity of the generated queries,
- (iv) Navigating the ambiguities of natural language questions.We introduce **CHESS**, a Large Language Model (LLM) based multi-agent framework for efficient and scalable SQL synthesis, comprising four specialized agents, each targeting one of the aforementioned challenges:
1. **Information Retriever (IR)**: Extracts relevant data.
2. **Schema Selector (SS)**: Prunes large schemas.
3. **Candidate Generator (CG)**: Generates high-quality candidates and refines queries iteratively.
4. **Unit Tester (UT)**: Validates queries through LLM-based natural language unit tests.Our framework offers configurable features that adapt to various deployment constraints:
### Key Features
- **Industrial-Scale Database Support**: Using the Schema Selector agent, CHESS efficiently narrows down very large database schemas into manageable sub-schemas, boosting system accuracy by approximately 2% and reducing LLM token usage by 5x.
- **Privacy-Preserving Performance**: Among methods using open-source models, CHESS achieves state-of-the-art performance, providing a high-performing, privacy-preserving system suitable for industrial deployment.
- **Scalability**: In settings with high computational budgets, CHESS reaches 71.10% accuracy on the BIRD test set, within 2% of the leading proprietary method, while reducing LLM calls by approximately 83%.## CHESS

## Setting up the Environment
1. **Clone the repository**:
```bash
git clone https://github.com/yourusername/CHESS.git
cd CHESS
```2. **Create a `.env` file** in the root directory and add the following configuration:
```bash
DATA_MODE="dev"
DATA_PATH="./data/dev/dev.json"
DB_ROOT_DIRECTORY="./data/dev/dev_databases"
DATA_TABLES_PATH="./data/dev/dev_tables.json"
INDEX_SERVER_HOST='localhost'
INDEX_SERVER_PORT=12345OPENAI_API_KEY=
GCP_PROJECT=''
GCP_REGION='us-central1'
GCP_CREDENTIALS=''
GOOGLE_CLOUD_PROJECT=''
```3. **Install required packages**:
```bash
pip install -r requirements.txt
```## Preprocessing
To retrieve database catalogs and find the most similar database values to a question, preprocess the databases:
1. **Run the preprocessing script**:
```bash
sh run/run_preprocess.sh
```This will create the minhash, LSH, and vector databases for each of the databases in the specified directory.
## Running the Code
After preprocessing the databases, generate SQL queries for the BIRD dataset by choosing a configuration:
1. **Run the main script**:
```bash
sh run/run_main_ir_cg_ut.sh
```or
```bash
sh run/run_main_ir_ss_ch.sh
```## Sub-sampled Development Set (SDS)
The sub-sampled development set (SDS) is a subset of the BIRD dataset with 10% of samples from each database. It is used for ablation studies and is available in `sub_sampled_bird_dev_set.json`.
## Supporting Other LLMs
To use your own LLM, modify the `get_llm_chain(engine, temperature, base_uri=None)` function and add your LLM in `run/langchain_utils.py`.
## Citation
If you find this repository helpful, please cite the following paper:
```bibtex
@article{talaei2024chess,
title={CHESS: Contextual Harnessing for Efficient SQL Synthesis},
author={Talaei, Shayan and Pourreza, Mohammadreza and Chang, Yu-Chen and Mirhoseini, Azalia and Saberi, Amin},
journal={arXiv preprint arXiv:2405.16755},
year={2024}
}
```