{"id":27957466,"url":"https://github.com/ashishsalunkhe/text2sql","last_synced_at":"2026-04-15T18:32:01.218Z","repository":{"id":291665646,"uuid":"978339587","full_name":"ashishsalunkhe/Text2SQL","owner":"ashishsalunkhe","description":"Natural language interface for querying clinical data. This project uses Retrieval-Augmented Generation (RAG) with GPT-3.5 to translate user questions into SQL over a subset of the MIMIC-III dataset, enabling clinicians and researchers to extract insights without SQL knowledge.","archived":false,"fork":false,"pushed_at":"2025-05-07T16:00:24.000Z","size":27,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-07T18:13:05.441Z","etag":null,"topics":["clinical-data","datascience","gpt3","healthcare","llm","mimic-iii","rag","retrieval-augmented-generation","semantic-search","sql-generation","sqlite","streamlit","text-to-sql"],"latest_commit_sha":null,"homepage":"https://youtu.be/NzW3CZyuETg","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashishsalunkhe.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-05T20:44:11.000Z","updated_at":"2025-05-07T16:00:27.000Z","dependencies_parsed_at":"2025-05-05T23:27:12.116Z","dependency_job_id":"9857820a-ae9e-4f4f-9da0-0c99a7517c9e","html_url":"https://github.com/ashishsalunkhe/Text2SQL","commit_stats":null,"previous_names":["ashishsalunkhe/text2sql"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashishsalunkhe%2FText2SQL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashishsalunkhe%2FText2SQL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashishsalunkhe%2FText2SQL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashishsalunkhe%2FText2SQL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashishsalunkhe","download_url":"https://codeload.github.com/ashishsalunkhe/Text2SQL/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252931553,"owners_count":21827112,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clinical-data","datascience","gpt3","healthcare","llm","mimic-iii","rag","retrieval-augmented-generation","semantic-search","sql-generation","sqlite","streamlit","text-to-sql"],"created_at":"2025-05-07T18:13:09.068Z","updated_at":"2026-04-15T18:32:01.164Z","avatar_url":"https://github.com/ashishsalunkhe.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🩺 Text-to-SQL System for MIMIC-III Dataset\n\n**Ashish Salunkhe**\nUniversity of Maryland, College Park\n**Aryaman Paigankar**\nUniversity of Maryland, College Park\n\n---\n## How to reproduct this project?\n\nThis guide will help you set up and run the Text-to-SQL system for querying the MIMIC-III dataset using natural language.\n\n---\n\n### Clone the Repository\n\n```bash\ngit clone https://github.com/your-username/mimic-llm-text2sql.git\ncd mimic-llm-text2sql\n```\n\n---\n\n### Set Up the Python Environment\n\n```bash\npython3 -m venv .venv\nsource .venv/bin/activate     # On Windows use: .venv\\Scripts\\activate\npip install -r requirements.txt\n```\n\n---\n\n### Download the Dataset\n\n* Go to [Kaggle – mimic-iii-10k dataset](https://www.kaggle.com/datasets/bilal1907/mimic-iii-10k)\n* Download only the `*_random.csv` files\n* Place them in the following directory:\n\n```bash\ndata/csv/\n```\n\n---\n\n### Set OpenAI API Key\n\n* Create a file named `.streamlit/secrets.toml` at the root of your repo\n* Add your OpenAI key as follows:\n\n```toml\nOPENAI_API_KEY = \"your-api-key-here\"\n```\n\n---\n\n### Build the SQLite DB and Schema Map\n\n```bash\npython app/main.py --question \"What are the most common diagnoses?\"\n```\n\nThis will create `mimic_iii.db` and `schema_map.json` in the `data/` directory.\n\n---\n\n### Run the Streamlit UI App\n\n```bash\nstreamlit run app/ui.py\n```\n\nThis will open a browser where you can ask clinical questions in plain English.\n\n---\n\n### Run via Command-Line (Optional)\n\nYou can also run the pipeline through CLI:\n\n```bash\npython app/main.py --question \"Which lab tests are common in diabetic patients?\"\n```\n\n\n---\n## 🚀 Repo Structure\n\n```\nmimic_text_to_sql/\n├── data/\n│   ├── mimic_iii.db         # SQLite DB from CSVs\n│   ├── schema_map.json      # JSON schema metadata\n│   └── query_log.csv        # Logged questions, SQL, results\n├── app/\n│   ├── main.py              # CLI interface\n│   ├── ui.py                # Streamlit interface\n├── .streamlit/\n│   └── secrets.toml         # API keys (ignored)\n├── requirements.txt\n└── README.md\n```\n---\n\n## 📌 Problem Formulation\n\nLarge Language Models (LLMs) have shown increasing capability in natural language understanding and structured data reasoning. One practical application is translating natural language questions into SQL queries to access complex medical datasets like MIMIC-III.\n\nOur project aims to develop a Retrieval-Augmented Generation (RAG) based Text-to-SQL system that enables healthcare professionals or researchers to interact with the MIMIC-III clinical database using plain English queries.\n\nKey challenges addressed:\n\n* Schema complexity\n* Ambiguity in natural language\n* Lack of join/contextual awareness in naive LLMs\n\nWe mitigate these issues via schema-aware metadata retrieval and GPT-based SQL generation.\n\n---\n\n## 🗃️ Dataset Description\n\nWe used the **mimic-III-10k** dataset — a curated subset of the full MIMIC-III clinical dataset containing \\~10,000 patients.\n\n* Source: Beth Israel Deaconess Medical Center (via PhysioNet)\n* Format: 25 CSV tables (\\~6 GB total)\n* Relational schema includes:\n\n  * `PATIENTS`: Demographics\n  * `ADMISSIONS`: Hospital admission logs\n  * `ICUSTAYS`: ICU-level data\n  * `DIAGNOSES_ICD`: ICD-9 medical codes\n\n### Data Ingestion Pipeline\n\n* Loaded CSVs into **SQLite** for fast, structured access\n* Optionally support **PostgreSQL** for scale\n* Explored initial joins using `subject_id`, `hadm_id`, and `icustay_id`\n\n---\n\n## 📊 Descriptive Analysis\n\nWe began by understanding key patient journeys using 4 main tables:\n\n* Explored relationships between `PATIENTS`, `ADMISSIONS`, `ICUSTAYS`, and `DIAGNOSES_ICD`\n* Identified key identifiers for joins: `subject_id`, `hadm_id`\n* Highlighted distribution of diagnoses and ICU visits\n\nWe also set up:\n\n* SQLite database from CSV\n* Initial EDA in Google Colab using Pandas\n* Schema inspection for metadata modeling\n\n---\n\n## 🧠 Methodology: RAG-based LLM System\n\n### 🔧 System Steps:\n\n1. **Metadata Extraction**\n\n   * Generate JSON summaries of table schemas (columns, types, join keys)\n2. **Embedding Generation**\n\n   * Use `all-MiniLM-L6-v2` from SentenceTransformers\n   * Encode schema metadata into dense vectors\n3. **Vector DB (ChromaDB)**\n\n   * Store embeddings and enable semantic retrieval\n4. **Retrieval Layer**\n\n   * Given a user question, retrieve top-k relevant table schemas\n5. **Prompt Construction**\ng\n   * Inject schema context + user query into GPT-3.5-Turbo prompt\n6. **LLM SQL Generation**\n\n   * Parse GPT output to SQL, execute, and return results\n\nAll components were orchestrated within a modular Python architecture with `main.py` (CLI) and `ui.py` (Streamlit).\n\n\n---\n\n## 📈 Evaluation Strategy\n\n### 🎯 Ground Truth Creation\n\n* Defined 15 clinical questions with gold SQL and results\n* Example: *\"What procedures are most common among deceased patients?\"*\n* Evaluated SQL outputs for correctness and execution success\n\n### 📏 Metrics\n\n| Metric                     | Description                                    |\n| -------------------------- | ---------------------------------------------- |\n| Execution Accuracy         | % of SQL queries that executed without error   |\n| Result Overlap (Jaccard)   | Match between LLM vs. ground truth results     |\n| Schema Retrieval Precision | % of correct tables retrieved in top-k context |\n| Prompt Token Size          | Avg tokens used in prompt to GPT-3.5           |\n| Latency / Cost             | Time + API cost per query                      |\n\n---\n\n## 📊 Results Summary\n\n* ✅ Execution Accuracy: **87%** (13/15 queries successful)\n* ✅ Result Overlap (Jaccard): Avg **0.72**\n* ✅ Retrieval hit rate: **90%** relevant tables in top-k\n* ⚠️ Common failure: SQL hallucination in JOINs or WHERE clauses\n\n---\n\n## ⚠️ Challenges \u0026 Takeaways\n\n* Complex schema with repeated identifiers across tables\n* Token limit requires prompt compression / top-k filtering\n* LLMs occasionally hallucinate JOIN conditions\n* Some vague queries required schema-specific disambiguation\n\n---\n\n## 🔭 Future Work\n\n* Fine-tune with healthcare-specific SQL data (MimicSQL, Spider)\n* Add error-handling and user-guided corrections\n* Integrate with PostgreSQL for production-scale queries\n* Explore open-source LLMs with local inference (e.g., SQLCoder)\n\n---\n\n## 📚 Related Work\n\n* MimicSQL: Fine-tuned Text2SQL on MIMIC (Zhang et al., 2023)\n* RAG-based Question Answering (Lewis et al., 2020)\n* Spider Benchmark for cross-domain SQL generation (Yu et al., 2018)\n\n---\n\n## 🔗 References\n\n* Johnson, A., Pollard, T., \u0026 Mark, R. (2016). MIMIC-III Clinical Database. [https://doi.org/10.13026/C2XW26](https://doi.org/10.13026/C2XW26)\n* mimic-III-10k \\[Kaggle]. [https://www.kaggle.com/datasets/bilal1907/mimic-iii-10k](https://www.kaggle.com/datasets/bilal1907/mimic-iii-10k)\n* Lewis, P. et al. (2020). Retrieval-Augmented Generation. NeurIPS 33\n* Yu, T. et al. (2018). Spider Dataset. EMNLP\n* Zhang, H. et al. (2023). MimicSQL. ACL. [https://arxiv.org/abs/2305.11921](https://arxiv.org/abs/2305.11921)\n\n---\n\n## 👥 Authors\n\n* **Ashish Salunkhe** — [ashishsalunke.com](https://ashishsalunke.com)\n* **Aryaman Paigankar** — University of Maryland\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashishsalunkhe%2Ftext2sql","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashishsalunkhe%2Ftext2sql","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashishsalunkhe%2Ftext2sql/lists"}