https://github.com/shubhamahobia/rag-based-custom-medical-dataset-chatbot
A GenAI-powered chatbot that leverages Retrieval-Augmented Generation (RAG) to answer medical questions using information from PDF documents. Built with LangChain, OpenAI GPT, HuggingFace embeddings, and Pinecone vector database.
https://github.com/shubhamahobia/rag-based-custom-medical-dataset-chatbot
chatbot machine-learning medical-application openai pinecone rag rag-chatbot retrieval-augmented-generation vector-database vectordb
Last synced: about 2 months ago
JSON representation
A GenAI-powered chatbot that leverages Retrieval-Augmented Generation (RAG) to answer medical questions using information from PDF documents. Built with LangChain, OpenAI GPT, HuggingFace embeddings, and Pinecone vector database.
- Host: GitHub
- URL: https://github.com/shubhamahobia/rag-based-custom-medical-dataset-chatbot
- Owner: ShubhaMahobia
- Created: 2025-07-15T15:14:20.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-07-21T09:37:10.000Z (3 months ago)
- Last Synced: 2025-07-21T10:29:58.181Z (3 months ago)
- Topics: chatbot, machine-learning, medical-application, openai, pinecone, rag, rag-chatbot, retrieval-augmented-generation, vector-database, vectordb
- Language: Jupyter Notebook
- Homepage: https://custom-medical-dataset-chatbot-xmujphacuklwbub5qjbhe8.streamlit.app/
- Size: 3.81 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🏥 Medical Knowledge Chatbot (RAG-based)
A GenAI-powered chatbot that leverages Retrieval-Augmented Generation (RAG) to answer medical questions using information from PDF documents. Built with LangChain, OpenAI GPT, HuggingFace embeddings, and Pinecone vector database.
---
## 🚀 Architecture Overview
```mermaid
graph TD
A["User Query"] --> B["Embed Query"]
B --> C["Vector Search (Pinecone)"]
C --> D["Retrieve Relevant Chunks"]
D --> E["Combine Chunks + Query"]
E --> F["OpenAI GPT (LLM)"]
F --> G["Generate Answer with Citations"]
G --> H["Display in Streamlit UI"]
```- **Document Ingestion:** PDFs are split into chunks, embedded using HuggingFace models, and stored in Pinecone.
- **Retrieval:** User queries are embedded and matched against the vector store to fetch relevant context.
- **Generation:** Retrieved context + user query are sent to OpenAI GPT to generate a grounded, cited answer.
- **UI:** Streamlit app for real-time Q&A with source/page references.---
## ✨ Features
- **RAG Pipeline:** Combines retrieval and generation for accurate, context-aware answers.
- **Source Attribution:** Every answer includes page-level citations for transparency.
- **Fast Semantic Search:** Pinecone vector DB enables rapid, scalable retrieval.
- **Customizable:** Easily swap models, chunk sizes, or add new documents.
- **Modern UI:** Streamlit interface for interactive chat and index management.---
## 🛠️ How It Works
1. **Index Creation:**
- PDFs are processed, chunked, and embedded.
- Embeddings are stored in Pinecone as a vector index.2. **Chat Flow:**
- User asks a question in the Streamlit app.
- The system retrieves the most relevant document chunks.
- The question + retrieved context are sent to OpenAI GPT.
- The answer, with citations, is displayed to the user.---
## 📦 Installation & Usage
### 1. Clone the Repository
```bash
git clone https://github.com/YourUsername/Custom-Medical-Dataset-Chatbot.git
cd Custom-Medical-Dataset-Chatbot
```### 2. Install Dependencies
```bash
pip install -r requirements.txt
```### 3. Set Up API Keys
- Create a `.env` file in the root directory:
```
PINECONE_API_KEY=your_pinecone_api_key
OPENAI_API_KEY=your_openai_api_key
```
- Or, for Streamlit Cloud, add them to `.streamlit/secrets.toml`.### 4. Add Your Medical PDFs
- Place your PDF files in the project root directory.
### 5. Run the Application
```bash
# Option 1: Let the app handle index creation
streamlit run streamlit_app.py# Option 2: Manually create the index first
python create_index.py
streamlit run streamlit_app.py
```---
## 🖼️ Screenshots
---
## 🤖 Tech Stack
- **LangChain** (RAG pipeline)
- **OpenAI GPT** (LLM)
- **HuggingFace Transformers** (Embeddings)
- **Pinecone** (Vector DB)
- **Streamlit** (UI)
- **Python**---
## 📄 License
MIT License
---
## 🙋♂️ Contact
For questions or collaboration, reach out via [LinkedIn](https://www.linkedin.com/in/shubhammahobia/) or open an issue!