{"id":17820193,"url":"https://github.com/ShayanTalaei/CHESS","last_synced_at":"2025-03-18T07:30:48.929Z","repository":{"id":245330778,"uuid":"817915982","full_name":"ShayanTalaei/CHESS","owner":"ShayanTalaei","description":"Contextual Harnessing for Efficient SQL Synthesis","archived":false,"fork":false,"pushed_at":"2024-11-13T21:13:39.000Z","size":8193,"stargazers_count":115,"open_issues_count":2,"forks_count":28,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-13T22:22:44.922Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ShayanTalaei.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-20T17:55:26.000Z","updated_at":"2024-11-13T21:13:44.000Z","dependencies_parsed_at":null,"dependency_job_id":"c8ba70a3-618a-4a34-ac55-24cbe6f5da0e","html_url":"https://github.com/ShayanTalaei/CHESS","commit_stats":null,"previous_names":["shayantalaei/chess"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShayanTalaei%2FCHESS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShayanTalaei%2FCHESS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShayanTalaei%2FCHESS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShayanTalaei%2FCHESS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ShayanTalaei","download_url":"https://codeload.github.com/ShayanTalaei/CHESS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244177648,"owners_count":20410993,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-27T17:02:01.964Z","updated_at":"2025-03-18T07:30:48.923Z","avatar_url":"https://github.com/ShayanTalaei.png","language":"Python","funding_links":[],"categories":["💬 Classic Model"],"sub_categories":[],"readme":"# CHESS: Contextual Harnessing for Efficient SQL Synthesis\n\nThis repository contains the code and data for the paper \"CHESS: Contextual Harnessing for Efficient SQL Synthesis.\"\n\nTranslating natural language questions into SQL queries, known as text-to-SQL, is a long-standing research problem. Effective text-to-SQL synthesis can become very challenging due to:\n- (i) The extensive size of database catalogs (descriptions of tables and their columns) and database values,\n- (ii) Reasoning over large database schemas,\n- (iii) Ensuring the functional validity of the generated queries,\n- (iv) Navigating the ambiguities of natural language questions.\n\nWe introduce **CHESS**, a Large Language Model (LLM) based multi-agent framework for efficient and scalable SQL synthesis, comprising four specialized agents, each targeting one of the aforementioned challenges:\n\n1. **Information Retriever (IR)**: Extracts relevant data.\n2. **Schema Selector (SS)**: Prunes large schemas.\n3. **Candidate Generator (CG)**: Generates high-quality candidates and refines queries iteratively.\n4. **Unit Tester (UT)**: Validates queries through LLM-based natural language unit tests.\n\nOur framework offers configurable features that adapt to various deployment constraints:\n\n### Key Features\n\n- **Industrial-Scale Database Support**: Using the Schema Selector agent, CHESS efficiently narrows down very large database schemas into manageable sub-schemas, boosting system accuracy by approximately 2% and reducing LLM token usage by 5x.\n- **Privacy-Preserving Performance**: Among methods using open-source models, CHESS achieves state-of-the-art performance, providing a high-performing, privacy-preserving system suitable for industrial deployment.\n- **Scalability**: In settings with high computational budgets, CHESS reaches 71.10% accuracy on the BIRD test set, within 2% of the leading proprietary method, while reducing LLM calls by approximately 83%.\n\n## CHESS\n\n![CHESS Framework](images/chess.jpg)\n\n## Setting up the Environment\n\n1. **Clone the repository**:\n    ```bash\n    git clone https://github.com/yourusername/CHESS.git\n    cd CHESS\n    ```\n\n2. **Create a `.env` file** in the root directory and add the following configuration:\n    ```bash\n    DATA_MODE=\"dev\"\n    DATA_PATH=\"./data/dev/dev.json\"\n    DB_ROOT_DIRECTORY=\"./data/dev/dev_databases\"\n    DATA_TABLES_PATH=\"./data/dev/dev_tables.json\"\n    INDEX_SERVER_HOST='localhost'\n    INDEX_SERVER_PORT=12345\n\n    OPENAI_API_KEY=\n    GCP_PROJECT=''\n    GCP_REGION='us-central1'\n    GCP_CREDENTIALS=''\n    GOOGLE_CLOUD_PROJECT=''\n    ```\n\n3. **Install required packages**:\n    ```bash\n    pip install -r requirements.txt\n    ```\n\n## Preprocessing\n\nTo retrieve database catalogs and find the most similar database values to a question, preprocess the databases:\n\n1. **Run the preprocessing script**:\n    ```bash\n    sh run/run_preprocess.sh\n    ```\n\n    This will create the minhash, LSH, and vector databases for each of the databases in the specified directory.\n\n## Running the Code\n\nAfter preprocessing the databases, generate SQL queries for the BIRD dataset by choosing a configuration:\n\n1. **Run the main script**:\n    ```bash\n    sh run/run_main_ir_cg_ut.sh\n    ```\n\n    or\n\n    ```bash\n    sh run/run_main_ir_ss_ch.sh\n    ```\n\n## Sub-sampled Development Set (SDS)\n\nThe sub-sampled development set (SDS) is a subset of the BIRD dataset with 10% of samples from each database. It is used for ablation studies and is available in `sub_sampled_bird_dev_set.json`.\n\n## Supporting Other LLMs\n\nTo use your own LLM, modify the `get_llm_chain(engine, temperature, base_uri=None)` function and add your LLM in `run/langchain_utils.py`.\n\n## Citation\n\nIf you find this repository helpful, please cite the following paper:\n\n```bibtex\n@article{talaei2024chess,\n  title={CHESS: Contextual Harnessing for Efficient SQL Synthesis},\n  author={Talaei, Shayan and Pourreza, Mohammadreza and Chang, Yu-Chen and Mirhoseini, Azalia and Saberi, Amin},\n  journal={arXiv preprint arXiv:2405.16755},\n  year={2024}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FShayanTalaei%2FCHESS","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FShayanTalaei%2FCHESS","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FShayanTalaei%2FCHESS/lists"}