{"id":13368087,"url":"https://github.com/eosphoros-ai/DB-GPT-Hub","last_synced_at":"2025-03-12T20:31:11.866Z","repository":{"id":179469468,"uuid":"648474263","full_name":"eosphoros-ai/DB-GPT-Hub","owner":"eosphoros-ai","description":"A repository that contains models, datasets, and fine-tuning techniques for DB-GPT, with the purpose of enhancing model performance  in Text-to-SQL","archived":false,"fork":false,"pushed_at":"2025-02-19T01:51:32.000Z","size":63890,"stargazers_count":1628,"open_issues_count":68,"forks_count":205,"subscribers_count":22,"default_branch":"main","last_synced_at":"2025-03-06T17:53:51.196Z","etag":null,"topics":["database","datasets","fine-tuning","gpt","hacktoberfest","llm","nl2sql","sql","text-to-sql","text2sql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eosphoros-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-02T03:58:07.000Z","updated_at":"2025-03-06T10:13:57.000Z","dependencies_parsed_at":"2024-04-19T08:39:07.439Z","dependency_job_id":"883b53c9-8026-44d5-926c-f74cd659d802","html_url":"https://github.com/eosphoros-ai/DB-GPT-Hub","commit_stats":null,"previous_names":["csunny/db-gpt-hub","eosphoros-ai/db-gpt-hub"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eosphoros-ai%2FDB-GPT-Hub","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eosphoros-ai%2FDB-GPT-Hub/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eosphoros-ai%2FDB-GPT-Hub/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eosphoros-ai%2FDB-GPT-Hub/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eosphoros-ai","download_url":"https://codeload.github.com/eosphoros-ai/DB-GPT-Hub/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243290822,"owners_count":20267790,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["database","datasets","fine-tuning","gpt","hacktoberfest","llm","nl2sql","sql","text-to-sql","text2sql"],"created_at":"2024-07-30T01:00:50.894Z","updated_at":"2025-03-12T20:31:09.471Z","avatar_url":"https://github.com/eosphoros-ai.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Python","📦 Projects","💬 Classic Model","GitHub projects","4. Fine-Tuning"],"sub_categories":["大语言对话模型及数据","Fine-tuning","Frameworks"],"readme":"# DB-GPT-Hub: Text-to-SQL parsing with LLMs\n\n\u003cdiv align=\"center\"\u003e\n  \u003cp\u003e\n    \u003ca href=\"https://github.com/eosphoros-ai/DB-GPT\"\u003e\n        \u003cimg alt=\"stars\" src=\"https://img.shields.io/github/stars/eosphoros-ai/db-gpt-hub?style=social\" /\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/eosphoros-ai/DB-GPT-Hub\"\u003e\n        \u003cimg alt=\"forks\" src=\"https://img.shields.io/github/forks/eosphoros-ai/db-gpt-hub?style=social\" /\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://opensource.org/licenses/MIT\"\u003e\n      \u003cimg alt=\"License: MIT\" src=\"https://img.shields.io/badge/License-MIT-yellow.svg\" /\u003e\n    \u003c/a\u003e\n     \u003ca href=\"https://github.com/eosphoros-ai/DB-GPT-Hub/releases\"\u003e\n      \u003cimg alt=\"Release Notes\" src=\"https://img.shields.io/github/release/eosphoros-ai/DB-GPT-Hub\" /\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/eosphoros-ai/DB-GPT-Hub/issues\"\u003e\n      \u003cimg alt=\"Open Issues\" src=\"https://img.shields.io/github/issues-raw/eosphoros-ai/DB-GPT-Hub\" /\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://discord.gg/7uQnPuveTY\"\u003e\n      \u003cimg alt=\"Discord\" src=\"https://dcbadge.vercel.app/api/server/7uQnPuveTY?compact=true\u0026style=flat\" /\u003e\n    \u003c/a\u003e\n  \u003c/p\u003e\n\n\n[**简体中文**](README.zh.md) | [**Discord**](https://discord.gg/7uQnPuveTY) | [**Wechat**](https://github.com/eosphoros-ai/DB-GPT/blob/main/README.zh.md#%E8%81%94%E7%B3%BB%E6%88%91%E4%BB%AC) | [**Huggingface**](https://huggingface.co/eosphoros) | [**Community**](https://github.com/eosphoros-ai/community) | [**Paper**](https://arxiv.org/abs/2406.11434)\n\n\n[**Text2SQL**](README.md) | [**Text2NLU**](src/dbgpt-hub-nlu/README.zh.md) \n\u003c/div\u003e\n\n## 🔥🔥🔥 News\n- Support [Text2NLU](src/dbgpt-hub-nlu/README.zh.md) fine-tuning to improve semantic understanding accuracy.\n- Support [Text2GQL](src/dbgpt-hub-gql/README.zh.md) fine-tuning to generate graph query.\n\n## Baseline\n\nText2SQL eval execution accuracy (ex) metric, and we will move this to `src/dbgpt_hub_sql`\n- update time: 2023/12/08\n- metric: execution accuracy (ex)\n- more details refer to [docs/eval-llm-result.md](https://github.com/eosphoros-ai/DB-GPT-Hub/blob/main/docs/eval_llm_result.md)\n\n\u003ctable style=\"text-align: center;\"\u003e\n  \u003ctr\u003e\n    \u003cth style=\"text-align: center;\"\u003eModel\u003c/th\u003e\n    \u003cth\u003eMethod\u003c/th\u003e\n    \u003cth\u003eEasy\u003c/th\u003e\n    \u003cth\u003eMedium\u003c/th\u003e\n    \u003cth\u003eHard\u003c/th\u003e\n    \u003cth\u003eExtra\u003c/th\u003e\n    \u003cth\u003eAll\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr \u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003ebase\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eLlama2-7B-Chat\u003c/td\u003e\n    \u003ctd\u003elora\u003c/td\u003e\n    \u003ctd\u003e0.887\u003c/td\u003e\n    \u003ctd\u003e0.641\u003c/td\u003e\n    \u003ctd\u003e0.489\u003c/td\u003e\n    \u003ctd\u003e0.331\u003c/td\u003e\n    \u003ctd\u003e0.626\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003eqlora\u003c/td\u003e\n    \u003ctd\u003e0.847\u003c/td\u003e\n    \u003ctd\u003e0.623\u003c/td\u003e\n    \u003ctd\u003e0.466\u003c/td\u003e\n    \u003ctd\u003e0.361\u003c/td\u003e\n    \u003ctd\u003e0.608\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003ebase\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eLlama2-13B-Chat\u003c/td\u003e\n    \u003ctd\u003elora\u003c/td\u003e\n    \u003ctd\u003e0.907\u003c/td\u003e\n    \u003ctd\u003e0.729\u003c/td\u003e\n    \u003ctd\u003e0.552\u003c/td\u003e\n    \u003ctd\u003e0.343\u003c/td\u003e\n    \u003ctd\u003e0.68\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003eqlora\u003c/td\u003e\n    \u003ctd\u003e0.911\u003c/td\u003e\n    \u003ctd\u003e0.7\u003c/td\u003e\n    \u003ctd\u003e0.552\u003c/td\u003e\n    \u003ctd\u003e0.319\u003c/td\u003e\n    \u003ctd\u003e0.664\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003ebase\u003c/td\u003e\n    \u003ctd\u003e0.214\u003c/td\u003e\n    \u003ctd\u003e0.177\u003c/td\u003e\n    \u003ctd\u003e0.092\u003c/td\u003e\n    \u003ctd\u003e0.036\u003c/td\u003e\n    \u003ctd\u003e0.149\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n  \u003ctd\u003eCodeLlama-7B-Instruct\u003c/td\u003e\n    \u003ctd\u003elora\u003c/td\u003e\n    \u003ctd\u003e0.923\u003c/td\u003e\n    \u003ctd\u003e0.756\u003c/td\u003e\n    \u003ctd\u003e0.586\u003c/td\u003e\n    \u003ctd\u003e0.349\u003c/td\u003e\n    \u003ctd\u003e0.702\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003eqlora\u003c/td\u003e\n    \u003ctd\u003e0.911\u003c/td\u003e\n    \u003ctd\u003e0.751\u003c/td\u003e\n    \u003ctd\u003e0.598\u003c/td\u003e\n    \u003ctd\u003e0.331\u003c/td\u003e\n    \u003ctd\u003e0.696\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003ebase\u003c/td\u003e\n    \u003ctd\u003e0.698\u003c/td\u003e\n    \u003ctd\u003e0.601\u003c/td\u003e\n    \u003ctd\u003e0.408\u003c/td\u003e\n    \u003ctd\u003e0.271\u003c/td\u003e\n    \u003ctd\u003e0.539\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eCodeLlama-13B-Instruct\u003c/td\u003e\n    \u003ctd\u003elora\u003c/td\u003e\n    \u003ctd\u003e0.94\u003c/td\u003e\n    \u003ctd\u003e0.789\u003c/td\u003e\n    \u003ctd\u003e0.684\u003c/td\u003e\n    \u003ctd\u003e0.404\u003c/td\u003e\n    \u003ctd\u003e0.746\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003eqlora\u003c/td\u003e\n    \u003ctd\u003e0.94\u003c/td\u003e\n    \u003ctd\u003e0.774\u003c/td\u003e\n    \u003ctd\u003e0.626\u003c/td\u003e\n    \u003ctd\u003e0.392\u003c/td\u003e\n    \u003ctd\u003e0.727\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003ebase\u003c/td\u003e\n    \u003ctd\u003e0.577\u003c/td\u003e\n    \u003ctd\u003e0.352\u003c/td\u003e\n    \u003ctd\u003e0.201\u003c/td\u003e\n    \u003ctd\u003e0.066\u003c/td\u003e\n    \u003ctd\u003e0.335\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eBaichuan2-7B-Chat\u003c/td\u003e\n    \u003ctd\u003elora\u003c/td\u003e\n    \u003ctd\u003e0.871\u003c/td\u003e\n    \u003ctd\u003e0.63\u003c/td\u003e\n    \u003ctd\u003e0.448\u003c/td\u003e\n    \u003ctd\u003e0.295\u003c/td\u003e\n    \u003ctd\u003e0.603\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n  \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003eqlora\u003c/td\u003e\n    \u003ctd\u003e0.891\u003c/td\u003e\n    \u003ctd\u003e0.637\u003c/td\u003e\n    \u003ctd\u003e0.489\u003c/td\u003e\n    \u003ctd\u003e0.331\u003c/td\u003e\n    \u003ctd\u003e0.624\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003ebase\u003c/td\u003e\n    \u003ctd\u003e0.581\u003c/td\u003e\n    \u003ctd\u003e0.413\u003c/td\u003e\n    \u003ctd\u003e0.264\u003c/td\u003e\n    \u003ctd\u003e0.187\u003c/td\u003e\n    \u003ctd\u003e0.392\u003c/td\u003e\n  \u003c/tr\u003e\n    \u003ctr\u003e\n    \u003ctd\u003eBaichuan2-13B-Chat\u003c/td\u003e\n    \u003ctd\u003elora\u003c/td\u003e\n    \u003ctd\u003e0.903\u003c/td\u003e\n    \u003ctd\u003e0.702\u003c/td\u003e\n    \u003ctd\u003e0.569\u003c/td\u003e\n    \u003ctd\u003e0.392\u003c/td\u003e\n    \u003ctd\u003e0.678\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003ctr\u003e\n  \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003eqlora\u003c/td\u003e\n    \u003ctd\u003e0.895\u003c/td\u003e\n    \u003ctd\u003e0.675\u003c/td\u003e\n    \u003ctd\u003e0.58\u003c/td\u003e\n    \u003ctd\u003e0.343\u003c/td\u003e\n    \u003ctd\u003e0.659\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n  \u003ctd\u003e\u003c/td\u003e\n  \u003ctd\u003ebase\u003c/td\u003e\n  \u003ctd\u003e0.395\u003c/td\u003e\n  \u003ctd\u003e0.256\u003c/td\u003e\n  \u003ctd\u003e0.138\u003c/td\u003e\n  \u003ctd\u003e0.042\u003c/td\u003e\n  \u003ctd\u003e0.235\u003c/td\u003e\n  \u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003eQwen-7B-Chat\u003c/td\u003e\n  \u003ctd\u003elora\u003c/td\u003e\n  \u003ctd\u003e0.855\u003c/td\u003e\n  \u003ctd\u003e0.688\u003c/td\u003e\n  \u003ctd\u003e0.575\u003c/td\u003e\n  \u003ctd\u003e0.331\u003c/td\u003e\n  \u003ctd\u003e0.652\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003eqlora\u003c/td\u003e\n    \u003ctd\u003e0.911\u003c/td\u003e\n    \u003ctd\u003e0.675\u003c/td\u003e\n    \u003ctd\u003e0.575\u003c/td\u003e\n    \u003ctd\u003e0.343\u003c/td\u003e\n    \u003ctd\u003e0.662\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n  \u003ctd\u003e\u003c/td\u003e\n  \u003ctd\u003ebase\u003c/td\u003e\n  \u003ctd\u003e0.871\u003c/td\u003e\n  \u003ctd\u003e0.632\u003c/td\u003e\n  \u003ctd\u003e0.368\u003c/td\u003e\n  \u003ctd\u003e0.181\u003c/td\u003e\n  \u003ctd\u003e0.573\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eQwen-14B-Chat\u003c/td\u003e\n    \u003ctd\u003elora\u003c/td\u003e\n    \u003ctd\u003e0.895\u003c/td\u003e\n    \u003ctd\u003e0.702\u003c/td\u003e\n    \u003ctd\u003e0.552\u003c/td\u003e\n    \u003ctd\u003e0.331\u003c/td\u003e\n    \u003ctd\u003e0.663\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003eqlora\u003c/td\u003e\n    \u003ctd\u003e0.919\u003c/td\u003e\n    \u003ctd\u003e0.744\u003c/td\u003e\n    \u003ctd\u003e0.598\u003c/td\u003e\n    \u003ctd\u003e0.367\u003c/td\u003e\n  \u003ctd\u003e0.701\u003c/td\u003e\n  \u003c/tr\u003e\n    \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003ebase\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n    \u003ctd\u003e0\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eChatGLM3-6b\u003c/td\u003e\n    \u003ctd\u003elora\u003c/td\u003e\n    \u003ctd\u003e0.855\u003c/td\u003e\n    \u003ctd\u003e0.605\u003c/td\u003e\n    \u003ctd\u003e0.477\u003c/td\u003e\n    \u003ctd\u003e0.271\u003c/td\u003e\n    \u003ctd\u003e0.59\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003eqlora\u003c/td\u003e\n    \u003ctd\u003e0.843\u003c/td\u003e\n    \u003ctd\u003e0.603\u003c/td\u003e\n    \u003ctd\u003e0.506\u003c/td\u003e\n    \u003ctd\u003e0.211\u003c/td\u003e\n    \u003ctd\u003e0.581\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n \n\n## Contents\n- [DB-GPT-Hub: Text-to-SQL parsing with LLMs](#db-gpt-hub-text-to-sql-parsing-with-llms)\n  - [Baseline](#baseline)\n  - [Contents](#contents)\n  - [1. What is DB-GPT-Hub](#1-what-is-db-gpt-hub)\n  - [2. Fine-tuning Text-to-SQL](#2-fine-tuning-text-to-sql)\n    - [2.1. Dataset](#21-dataset)\n    - [2.2. Model](#22-model)\n  - [3. Usage](#3-usage)\n    - [3.1. Environment preparation](#31-environment-preparation)\n    - [3.2 Quick Start](#32-quick-start)\n    - [3.3. Data preparation](#33-data-preparation)\n    - [3.4. Model fine-tuning](#34-model-fine-tuning)\n    - [3.5. Model Predict](#35-model-predict)\n    - [3.6 Model Weights](#36-model-weights)\n      - [3.6.1 Model and fine-tuned weight merging](#361-model-and-fine-tuned-weight-merging)\n    - [3.7 Model Evaluation](#37-model-evaluation)\n  - [4. RoadMap](#4-roadmap)\n  - [5. Contributions](#5-contributions)\n  - [6. Acknowledgements](#6-acknowledgements)\n  - [7. Citation](#7-citation)\n  - [8. Licence](#8-licence)\n  - [9. Contact Information](#9-contact-information)\n\n## 1. What is DB-GPT-Hub\n\nDB-GPT-Hub is an experimental project that leverages Large Language Models (LLMs) to achieve Text-to-SQL parsing. The project encompasses various stages, including data collection, data preprocessing, model selection and construction, and fine-tuning of model weights. Through these processes, our aim is to enhance Text-to-SQL capabilities while reducing model training costs, thus enabling more developers to contribute to improving Text-to-SQL accuracy. Our ultimate goal is to realize automated question-answering capabilities based on databases, allowing users to execute complex database queries using natural language descriptions.\n\nTo date, we have successfully integrated multiple large models and established a comprehensive workflow that includes data processing, Supervised Fine-Tuning (SFT) model training, prediction output, and evaluation. The code developed for this project is easily reusable within the project itself.\n\nAs of October 10, 2023, we have used this project to fine-tune the open-source 13B-sized model, incorporating more relevant data. Under zero-shot prompts and utilizing [the Spider-based test-suite](https://github.com/taoyds/test-suite-sql-eval), we have achieved an execution accuracy rate of 0.764 for a database with a size of 1.27G. Additionally, the execution accuracy for the database pointed to by [the Spider official website](https://yale-lily.github.io/spider), with a size of 95M, stands at 0.825.\n\n\n## 2. Fine-tuning Text-to-SQL\n\nWe enhance the Text-to-SQL performance by applying Supervised Fine-Tuning (SFT) on large language models.   \n\n### 2.1. Dataset\n\nThe primary dataset for this project's examples is the **Spider** dataset:\n\n- [SPIDER](https://yale-lily.github.io/spider): A complex text2sql dataset across domains, containing 10,181 natural language queries, 5,693 SQL distributed across 200 separate databases, covering 138 different domains.[download link](https://drive.google.com/uc?export=download\u0026id=1TqleXec_OykOYFREKKtschzY29dUcVAQ)  \n\nOther text2sql datasets available:   \n\n- [WikiSQL:](https://github.com/salesforce/WikiSQL) A large semantic parsing dataset consisting of 80,654 natural statement expressions and sql annotations of 24,241 tables. Each query in WikiSQL is limited to the same table and does not contain complex operations such as sorting, grouping The queries in WikiSQL are limited to the same table and do not include complex operations such as sorting, grouping, subqueries, etc.\n- [CHASE](https://xjtu-intsoft.github.io/chase/): A cross-domain multi-round interactive text2sql Chinese dataset containing a list of 5,459 multi-round questions consisting of 17,940 \u003cquery, SQL\u003e binary groups across 280 different domain databases.\n- [BIRD-SQL:](https://bird-bench.github.io/) A large-scale cross-domain text-to-SQL benchmark in English, with a particular focus on large database content. The dataset contains 12,751 text-to-SQL data pairs and 95 databases with a total size of 33.4 GB across 37 occupational domains. The BIRD-SQL dataset bridges the gap between text-to-SQL research and real-world applications by exploring three additional challenges, namely dealing with large and messy database values, external knowledge inference and optimising SQL execution efficiency.\n- [CoSQL:](https://yale-lily.github.io/cosql) A corpus for building cross-domain conversational text-to-SQL systems. It is a conversational version of the Spider and SParC tasks. CoSQL consists of 30k+ rounds and 10k+ annotated SQL queries from Wizard-of-Oz's collection of 3k conversations querying 200 complex databases across 138 domains. Each conversation simulates a realistic DB query scenario in which a staff member explores the database as a user and a SQL expert uses SQL to retrieve answers, clarify ambiguous questions, or otherwise inform.\n\n- Following the processing template of [NSQL](https://github.com/NumbersStationAI/NSQL), the dataset underwent basic processing, yielding approximately [20W dataset](https://huggingface.co/datasets/Healthy13/Text2SQL/tree/main)\n\n\n\n### 2.2. Model\n\nDB-GPT-Hub currently supports the following base models:\n\n  - [x] CodeLlama\n  - [x] Baichuan2 \n  - [x] LLaMa/LLaMa2\n  - [x] Falcon\n  - [x] Qwen\n  - [x] XVERSE\n  - [x] ChatGLM2\n  - [x] ChatGLM3\n  - [x] internlm\n  - [x] sqlcoder-7b(mistral)\n  - [x] sqlcoder2-15b(starcoder)\n\n\n\n\n\nThe model is fine-tuned based on a quantization bit of 4 using Quantized Learning over Redundant Architecture (QLoRA). The minimum hardware requirements for this can be referred to as follows:   \n\n| Model Parameters | GPU RAM | CPU RAM | DISK   |\n| ---------------- | ------- | ------- | ------ |\n| 7b               | 6GB     | 3.6GB   | 36.4GB |\n| 13b              | 13.4GB  | 5.9GB   | 60.2GB |\n  \nAll the related parameters are set to the minimum, with a batch size of 1 and max length of 512. Based on experience, for better performance, it is recommended to set the related length values to 1024 or 2048.\n\n\n## 3. Usage\n\n### 3.1. Environment preparation\n\n```\ngit clone https://github.com/eosphoros-ai/DB-GPT-Hub.git\ncd DB-GPT-Hub\nconda create -n dbgpt_hub python=3.10 \nconda activate dbgpt_hub\n\ncd src/dbgpt_hub_sql\npip install -e .\n```\n### 3.2 Quick Start\n\nFirstly, install `dbgpt-hub` with the following command\n\n`pip install dbgpt-hub`\n\nThen, set up the arguments and run the whole process.\n```python\nfrom dbgpt_hub_sql.data_process import preprocess_sft_data\nfrom dbgpt_hub_sql.train import start_sft\nfrom dbgpt_hub_sql.predict import start_predict\nfrom dbgpt_hub_sql.eval import start_evaluate\n\n# Config the input datasets\ndata_folder = \"dbgpt_hub_sql/data\"\ndata_info = [\n        {\n            \"data_source\": \"spider\",\n            \"train_file\": [\"train_spider.json\", \"train_others.json\"],\n            \"dev_file\": [\"dev.json\"],\n            \"tables_file\": \"tables.json\",\n            \"db_id_name\": \"db_id\",\n            \"is_multiple_turn\": False,\n            \"train_output\": \"spider_train.json\",\n            \"dev_output\": \"spider_dev.json\",\n        }\n]\n\n# Config training parameters\ntrain_args = {\n            \"model_name_or_path\": \"codellama/CodeLlama-13b-Instruct-hf\",\n            \"do_train\": True,\n            \"dataset\": \"example_text2sql_train\",\n            \"max_source_length\": 2048,\n            \"max_target_length\": 512,\n            \"finetuning_type\": \"lora\",\n            \"lora_target\": \"q_proj,v_proj\",\n            \"template\": \"llama2\",\n            \"lora_rank\": 64,\n            \"lora_alpha\": 32,\n            \"output_dir\": \"dbgpt_hub_sql/output/adapter/CodeLlama-13b-sql-lora\",\n            \"overwrite_cache\": True,\n            \"overwrite_output_dir\": True,\n            \"per_device_train_batch_size\": 1,\n            \"gradient_accumulation_steps\": 16,\n            \"lr_scheduler_type\": \"cosine_with_restarts\",\n            \"logging_steps\": 50,\n            \"save_steps\": 2000,\n            \"learning_rate\": 2e-4,\n            \"num_train_epochs\": 8,\n            \"plot_loss\": True,\n            \"bf16\": True,\n}\n\n# Config predict parameters\npredict_args = {\n            \"model_name_or_path\": \"codellama/CodeLlama-13b-Instruct-hf\",\n            \"template\": \"llama2\",\n            \"finetuning_type\": \"lora\",\n            \"checkpoint_dir\": \"dbgpt_hub_sql/output/adapter/CodeLlama-13b-sql-lora\",\n            \"predict_file_path\": \"dbgpt_hub_sql/data/eval_data/dev_sql.json\",\n            \"predict_out_dir\": \"dbgpt_hub_sql/output/\",\n            \"predicted_out_filename\": \"pred_sql.sql\",\n}\n\n# Config evaluation parameters\nevaluate_args =  {\n            \"input\": \"./dbgpt_hub_sql/output/pred/pred_sql_dev_skeleton.sql\",\n            \"gold\": \"./dbgpt_hub_sql/data/eval_data/gold.txt\",\n            \"gold_natsql\": \"./dbgpt_hub_sql/data/eval_data/gold_natsql2sql.txt\",\n            \"db\": \"./dbgpt_hub_sql/data/spider/database\",\n            \"table\": \"./dbgpt_hub_sql/data/eval_data/tables.json\",\n            \"table_natsql\": \"./dbgpt_hub_sql/data/eval_data/tables_for_natsql2sql.json\",\n            \"etype\": \"exec\",\n            \"plug_value\": True,\n            \"keep_distict\": False,\n            \"progress_bar_for_each_datapoint\": False,\n            \"natsql\": False,\n}\n\n# Run the whole fine-tuning workflow\npreprocess_sft_data(\n      data_folder = data_folder,\n      data_info = data_info\n)\n\nstart_sft(train_args)\nstart_predict(predict_args)\nstart_evaluate(evaluate_args)\n```\n\n### 3.3. Data preparation\n\nDB-GPT-Hub uses the information matching generation method for data preparation, i.e. the SQL + Repository generation method that combines table information. This method combines data table information to better understand the structure and relationships of the data table, and is suitable for generating SQL statements that meet the requirements.  \n\nDownload the [Spider dataset]((https://drive.google.com/uc?export=download\u0026id=1TqleXec_OykOYFREKKtschzY29dUcVAQ)) from the Spider dataset link. By default, after downloading and extracting the data, place it in the dbgpt_hub_sql/data directory, i.e., the path should be `dbgpt_hub_sql/data/spider`.  \n\nFor the data preprocessing part, simply **run the following script** :\n```bash\n## generate train and dev(eval) data\nsh dbgpt_hub_sql/scripts/gen_train_eval_data.sh\n```\n\nIn the directory `dbgpt_hub_sql/data/`, you will find the newly generated training file example_text2sql_train.json and testing file example_text2sql_dev.json, containing 8659 and 1034 entries respectively. For the data used in subsequent fine-tuning, set the parameter `file_name` value to the file name of the training set in dbgpt_hub_sql/data/dataset_info.json, such as example_text2sql_train.json\n\n\nThe data in the generated JSON looks something like this:\n```\n    {\n        \"db_id\": \"department_management\",\n        \"instruction\": \"I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\\n\\\"\\n##Instruction:\\ndepartment_management contains tables such as department, head, management. Table department has columns such as Department_ID, Name, Creation, Ranking, Budget_in_Billions, Num_Employees. Department_ID is the primary key.\\nTable head has columns such as head_ID, name, born_state, age. head_ID is the primary key.\\nTable management has columns such as department_ID, head_ID, temporary_acting. department_ID is the primary key.\\nThe head_ID of management is the foreign key of head_ID of head.\\nThe department_ID of management is the foreign key of Department_ID of department.\\n\\n\",\n        \"input\": \"###Input:\\nHow many heads of the departments are older than 56 ?\\n\\n###Response:\",\n        \"output\": \"SELECT count(*) FROM head WHERE age  \u003e  56\",\n        \"history\": []\n    }, \n```     \nThe data processing code of `chase`, `cosql` and `sparc` has been embedded in the data processing code of the project. After downloading the data set according to the above link, you only need to add ` in `dbgpt_hub_sql/configs/config.py` Just loosen the corresponding code comment in SQL_DATA_INFO`.   \n\n### 3.4. Model fine-tuning\n\nThe model fine-tuning supports both LoRA and QLoRA methods. We can run the following command to fine-tune the model. By default, with the parameter --quantization_bit, it uses the QLoRA fine-tuning method. To switch to LoRAs, simply remove the related parameter from the script.\nRun the command:\n\n```bash\nsh dbgpt_hub_sql/scripts/train_sft.sh\n```\n\nAfter fine-tuning, the model weights will be saved by default in the adapter folder, specifically in the dbgpt_hub_sql/output/adapter directory.   \n\nIf you're using **multi-GPU training and want to utilize deepseed**, you should modify the default content in train_sft.sh. The change  is:\n\n```\nCUDA_VISIBLE_DEVICES=0 python dbgpt_hub_sql/train/sft_train.py \\\n    --quantization_bit 4 \\\n    ...\n```    \nchange to ： \n```\ndeepspeed --num_gpus 2  dbgpt_hub_sql/train/sft_train.py \\\n    --deepspeed dbgpt_hub_sql/configs/ds_config.json \\\n    --quantization_bit 4 \\\n    ...\n```     \n\nif you need  order card  id   \n```\ndeepspeed --include localhost:0,1  dbgpt_hub_sql/train/sft_train.py \\\n    --deepspeed dbgpt_hub_sql/configs/ds_config.json \\\n    --quantization_bit 4 \\\n    ...\n```    \n\nThe other parts that are omitted (…) can be kept consistent. If you want to change the default deepseed configuration, go into the `dbgpt_hub_sql/configs` directory and make changes to ds_config.json as needed,the default is stage2.   \n\nIn the script, during fine-tuning, different models correspond to key parameters lora_target and template, as shown in the following table:   \n\n| model name                                               | lora_target     | template  |\n| -------------------------------------------------------- | --------------- | --------- |\n| [LLaMA-2](https://huggingface.co/meta-llama)             | q_proj,v_proj   | llama2    |\n| [CodeLlama-2](https://huggingface.co/codellama/)         | q_proj,v_proj   | llama2    |\n| [Baichuan2](https://github.com/baichuan-inc/Baichuan2)   | W_pack          | baichuan2 |\n| [Qwen](https://github.com/QwenLM/Qwen-7B)                | c_attn          | chatml    |\n| [sqlcoder-7b](https://huggingface.co/defog/sqlcoder-7b)  | q_proj,v_proj   | mistral   |\n| [sqlcoder2-15b](https://huggingface.co/defog/sqlcoder2)  | c_attn          | default   |\n| [InternLM](https://github.com/InternLM/InternLM)         | q_proj,v_proj   | intern    |\n| [XVERSE](https://github.com/xverse-ai/XVERSE-13B)        | q_proj,v_proj   | xverse    |\n| [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B)         | query_key_value | chatglm2  |\n| [LLaMA](https://github.com/facebookresearch/llama)       | q_proj,v_proj   | -         |\n| [BLOOM](https://huggingface.co/bigscience/bloom)         | query_key_value | -         |\n| [BLOOMZ](https://huggingface.co/bigscience/bloomz)       | query_key_value | -         |\n| [Baichuan](https://github.com/baichuan-inc/baichuan-13B) | W_pack          | baichuan  |\n| [Falcon](https://huggingface.co/tiiuae/falcon-7b)        | query_key_value | -         |\n\n\n\n In `train_sft.sh` , other key parameters are as follows:\n\n \u003e quantization_bit: Indicates whether quantization is applied, with valid values being [4 or 8].   \n\u003e model_name_or_path: The path of the LLM (Large Language Model).   \n\u003e dataset: Specifies the name of the training dataset configuration, corresponding to the outer key value in dbgpt_hub_sql/data/dataset_info.json, such as example_text2sql.  \n\u003e max_source_length: The length of the text input into the model. If computing resources allow, it can be set as large as possible, like 1024 or 2048.      \n\u003e max_target_length: The length of the SQL content output by the model; 512 is generally sufficient.   \n\u003e output_dir: The output path of the Peft module during SFT (Supervised Fine-Tuning), set by default to `dbgpt_hub_sql/output/adapter/` .     \n\u003e per_device_train_batch_size: The size of the batch. If computing resources allow, it can be set larger; the default is 1.   \n\u003e gradient_accumulation_steps: The number of steps for accumulating gradients before an update.   \n\u003e save_steps: The number of steps at which model checkpoints are saved; it can be set to 100 by default.  \n\u003e num_train_epochs: The number of epochs for training the dataset.   \n\n\n### 3.5. Model Predict\n\nUnder the project directory ./dbgpt_hub_sql/output/pred/, this folder is the default output location for model predictions(if not exist, just mkdir).\n\n```bash\nsh ./dbgpt_hub_sql/scripts/predict_sft.sh\n```\n\nIn the script, by default with the parameter `--quantization_bit`, it predicts using QLoRA. Removing it switches to the LoRA prediction method.\nThe value of the parameter `predicted_input_filename`  is your predict test dataset file.  `--predicted_out_filename` is the file name of the model's predicted results.\n\n### 3.6 Model Weights\nYou can find the second corresponding model weights  from Huggingface [hg-eosphoros-ai\n](https://huggingface.co/Wangzaistone123/CodeLlama-13b-sql-lora)  ,we uploaded the LoRA weights in October,which execution accuracy on the Spider evaluation set reached 0.789.    \n\n#### 3.6.1 Model and fine-tuned weight merging \n\nIf you need to merge the weights of the trained base model and the fine-tuned Peft module to export a complete model, execute the following model export script:   \n\n```bash\nsh ./dbgpt_hub_sql/scripts/export_merge.sh\n```\n\nBe sure to replace the parameter path values in the script with the paths corresponding to your project.  \n                                                    \n### 3.7 Model Evaluation\nTo evaluate model performance on the dataset, default is spider dev dataset.\nRun the following command:\n```bash\npython dbgpt_hub_sql/eval/evaluation.py --plug_value --input Your_model_pred_file\n```\nYou can find the results of our latest review and part of experiment results [here](docs/eval_llm_result.md)  \n**Note**: The database pointed to by the default code is a 95M database downloaded from [Spider official website] (https://yale-lily.github.io/spider). If you need to use Spider database (size 1.27G) in [test-suite](https://github.com/taoyds/test-suite-sql-eval), please download the database in the link to the custom directory first, and run the above evaluation command which add parameters and values ​​like `--db Your_download_db_path`.\n\n## 4. RoadMap \n\nThe whole process we will divide into three phases:\n\n* Stage 1:\n  * Set up the foundational framework, enabling an end-to-end workflow that encompasses data processing, model SFT (Single Fine-Tuning) training, prediction output, and evaluation using multiple large language models (LLMs). As of August 4th, 2023, the entire pipeline has been successfully established.\n\n  Currently, we offer support for the following features:\n  - [x] CodeLlama\n  - [x] Baichuan2 \n  - [x] LLaMa/LLaMa2\n  - [x] Falcon\n  - [x] Qwen\n  - [x] XVERSE\n  - [x] ChatGLM2\n  - [x] ChatGLM3\n  - [x] internlm\n  - [x] sqlcoder-7b(mistral)\n  - [x] sqlcoder2-15b(starcoder)\n\n* Stage 2:\n  - [x] Optidmize model performance, and support fine-tuning more different models in various ways before  `20231010`\n  - [x] Optimize `prompts`\n  - [x] Release evaluation results, and optimized   models open to peers.\n* Stage 3:\n  - [ ] Inference speed optimization and improvement   \n  - [ ] Targeted optimization and improvement of business scenarios and Chinese effects   \n  - [ ] Optimized based on more papers, such as RESDSQL and others. Combined with our community's sibling project[Awesome-Text2SQL](https://github.com/eosphoros-ai/Awesome-Text2SQL)for further enhancements..  \n\n**If our work has provided even a small measure of assistance to you, please consider giving us a star. Your feedback and support serve as motivation for us to continue releasing more related work and improving our efforts. Thank you!**   \n\n## 5. Contributions\n\nWe warmly invite more individuals to join us and actively engage in various aspects of our project, such as datasets, model fine-tuning, performance evaluation, paper recommendations, and code reproduction. Please don't hesitate to open issues or pull requests (PRs), and we will be proactive in responding to your contributions.\n\nBefore submitting your code, please ensure that it is formatted according to the black style by using the following command: \n```\nblack dbgpt_hub\n```\n\nIf you have more time to execute more detailed type checking and style checking of your code, please use the following command:\n```\npyright dbgpt_hub\npylint dbgpt_hub\n```\n\nIf you have any questions or need further assistance, don't hesitate to reach out. We appreciate your involvement!\n\n## 6. Acknowledgements\n\nOur work is primarily based on the foundation of numerous open-source contributions. Thanks to the following open source projects\n\n* [Spider](https://github.com/ElementAI/spider)\n* [CoSQL](https://yale-lily.github.io/cosql)\n* [Chase](https://xjtu-intsoft.github.io/chase/)\n* [BIRD-SQL](https://bird-bench.github.io/)\n* [LLaMA](https://github.com/facebookresearch/llama/tree/main)\n* [BLOOM](https://huggingface.co/spaces/bigscience/license)\n* [Falcon](https://github.com/hiyouga/LLaMA-Efficient-Tuning/blob/main/LICENSE)\n* [ChatGLM](https://github.com/search?q=ChatGLM\u0026type=repositories)\n* [WizardLM](https://github.com/nlpxucan/WizardLM)\n* [text-to-sql-wizardcoder](https://github.com/cuplv/text-to-sql-wizardcoder)\n* [test-suite-sql-eval](https://github.com/taoyds/test-suite-sql-eval)\n* [LLaMa-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning) \n\nThanks to all the contributors, especially @[JBoRu](https://github.com/JBoRu) who raised the [issue](https://github.com/eosphoros-ai/DB-GPT-Hub/issues/119) which reminded us to add a new promising evaluation way, i.e. Test Suite. As the paper 《SQL-PALM: IMPROVED LARGE LANGUAGE MODEL ADAPTATION FOR TEXT-TO-SQL》 mentioned, \"We consider two commonly-used evaluation metrics: execution accuracy (EX) and test-suite accuracy (TS). EX measures whether the SQL execution outcome matches ground truth (GT), whereas TS measures whether the SQL passes all EX evaluations for multiple tests, generated by database augmentation. Since EX contains false positives, we consider TS as a more reliable evaluation metric\".\n\n## 7. Citation\nIf you find `DB-GPT-Hub` useful for your research or development, please cite the following \u003ca href=\"https://arxiv.org/abs/2406.11434\" target=\"_blank\"\u003epaper\u003c/a\u003e:\n\n```bibtex\n@misc{zhou2024dbgpthub,\n      title={DB-GPT-Hub: Towards Open Benchmarking Text-to-SQL Empowered by Large Language Models}, \n      author={Fan Zhou and Siqiao Xue and Danrui Qi and Wenhui Shi and Wang Zhao and Ganglin Wei and Hongyang Zhang and Caigai Jiang and Gangwei Jiang and Zhixuan Chu and Faqiang Chen},\n      year={2024},\n      eprint={2406.11434},\n      archivePrefix={arXiv},\n      primaryClass={id='cs.DB' full_name='Databases' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers database management, datamining, and data processing. Roughly includes material in ACM Subject Classes E.2, E.5, H.0, H.2, and J.1.'}\n}\n```\n\n## 8. Licence\n\nThe MIT License (MIT)\n\n## 9. Contact Information\nWe are collaborating as a community, and if you have any ideas regarding our community work, please don't hesitate to get in touch with us. If you're interested in delving into an in-depth experiment and optimizing the DB-GPT-Hub subproject, you can reach out to 'wangzai' within the WeChat group. We wholeheartedly welcome your contributions to making it even better together! \n[![](https://dcbadge.vercel.app/api/server/7uQnPuveTY?compact=true\u0026style=flat)](https://discord.gg/7uQnPuveTY)\n\n[![Star History Chart](https://api.star-history.com/svg?repos=eosphoros-ai/DB-GPT-Hub\u0026type=Date)](https://star-history.com/#eosphoros-ai/DB-GPT-Hub)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feosphoros-ai%2FDB-GPT-Hub","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feosphoros-ai%2FDB-GPT-Hub","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feosphoros-ai%2FDB-GPT-Hub/lists"}