An open API service indexing awesome lists of open source software.

https://github.com/adityabhatt3010/universal-ai-chatbot

A domain-adaptable AI chatbot powered by RAG, FAISS, and LangChain to answer questions from your custom PDFs using HuggingFace LLMs.
https://github.com/adityabhatt3010/universal-ai-chatbot

ai ai-chatbot chatbot faiss langchain rag rag-chatbot

Last synced: about 2 months ago
JSON representation

A domain-adaptable AI chatbot powered by RAG, FAISS, and LangChain to answer questions from your custom PDFs using HuggingFace LLMs.

Awesome Lists containing this project

README

          

# πŸ€– Universal AI Chatbot (RAG + FAISS + LangChain)

A **domain-adaptable AI chatbot framework** built using **Retrieval-Augmented Generation (RAG)**, **FAISS**, and **LangChain**, capable of answering questions from **custom document-based knowledge** like cybersecurity books, medical encyclopedias, and more.

This project supports both research (via Jupyter Notebooks) and production deployment (via Python scripts).

![Universal AI ChatBot Cover](https://github.com/user-attachments/assets/ecec53e7-687f-411f-bd3f-b0858e324d11)

---

## πŸ“Œ Table of Contents

* [πŸ” What is this Chatbot?](#-what-is-this-chatbot)
* [🧠 Key Concepts (RAG, FAISS, etc.)](#-key-concepts-rag-faiss-etc)
* [πŸ› οΈ Project Structure](#️-project-structure)
* [βš™οΈ How It Works](#️-how-it-works-behind-the-scenes)
* [πŸ“š Models Used](#-models-used)
* [πŸš€ How to Run](#-how-to-run)
* [πŸͺ„ Setup Script](#-setup-script)
* [πŸ“ Data & Vectorstore Info](#-data--vectorstore-info)
* [πŸ‹ Docker Support](#-Docker-Support)
* [πŸŽ“ Use Cases](#-use-cases)
* [πŸ™Œ Credits](#-credits)

---

## πŸ” What is this Chatbot?

This is a **plug-and-play AI chatbot engine** capable of retrieving answers from your **own documents**. Currently, it includes:

* πŸ§‘β€πŸ’» **HackerBot** trained on Bug Bounty & Web Hacking books.
* πŸ₯ **MedicBot** trained on Medical Encyclopedias.
* 🧠 A base Python script (`ChatBot.py`) for creating more bots easily.

> Jupyter chat logs preserve conversations, useful for debugging and audit trails.

---

## 🧠 Key Concepts (RAG, FAISS, etc.)

### πŸ” Retrieval-Augmented Generation (RAG)

Combines **document retrieval** + **LLM generation**:

1. Retrieves the top-k relevant document chunks.
2. Passes them to a language model for generating the final answer.

### πŸ” FAISS (Facebook AI Similarity Search)

A high-performance library for **semantic vector search** using approximate nearest neighbors (ANN).

Used to:

* Store text chunks as embeddings.
* Retrieve the most relevant ones based on query similarity.

### πŸ’‘ Semantic Search

Goes **beyond keyword matching**β€”it uses vector embeddings to find conceptually similar content even if phrased differently.

---

## πŸ› οΈ Project Structure

```
Universal-AI-ChatBot/
β”‚
β”œβ”€β”€ data/ # Place your PDF datasets here
β”‚ └── Instructions.md # Instructions for dataset placement
β”œβ”€β”€ vectorstore/ # Stores FAISS + pickle index files
β”‚ └── Instructions.md # Instructions for vector DB
β”œβ”€β”€ HackerBot.ipynb # Chatbot trained on Web Hacking books
β”œβ”€β”€ MedicBot.ipynb # Chatbot trained on Medical encyclopedia
β”œβ”€β”€ ChatBot.py # General chatbot template (script version)
β”œβ”€β”€ Setup_env.ps1 # PowerShell script to auto-setup environment
β”œβ”€β”€ requirements.txt
└── README.md
```

---

## βš™οΈ How It Works (Behind the Scenes)

### πŸ”Έ Step 1: Load and Split PDFs

```python
DirectoryLoader β†’ PyPDFLoader β†’ RecursiveCharacterTextSplitter
```

* All `.pdf` files in `/data/` are extracted and broken into 500-token chunks.
* 50-token overlap helps preserve context across splits.

---

### πŸ”Έ Step 2: Create Embeddings & Store in FAISS

```python
text_chunks β†’ MiniLM Embeddings β†’ FAISS.from_documents()
```

* Each chunk is transformed into a vector using MiniLM.
* FAISS stores them in `/vectorstore/db_faiss/` as `.faiss` and `.pkl`.

---

### πŸ”Έ Step 3: Query Retrieval & Prompt Assembly

```python
User Query β†’ Embed β†’ Top-3 Match β†’ Inject into Prompt
```

* Input is embedded and compared against the FAISS index.
* Top 3 chunks are selected and formatted into a custom prompt.

---

### πŸ”Έ Step 4: Generate Answer via LLM

```python
PromptTemplate + Mistral LLM β†’ Final Answer
```

* The prompt is passed to `mistralai/Mistral-7B-Instruct-v0.3` on HuggingFace.
* It follows strict instruction: β€œdon’t make up answers.”

---

### πŸ”Έ Step 5: Chat Loop (Script Mode)

```python
while True β†’ input() β†’ RetrievalQA β†’ print()
```

* Interactive command-line chatbot runs until user types `Exit the Chatbot`.

---

## πŸ“š Models Used

### 🧠 `mistralai/Mistral-7B-Instruct-v0.3`

> A lightweight, instruction-tuned 7B parameter model.

* Balances **speed and comprehension**.
* Follows custom prompt instructions like β€œNo small talk.”

**Usage:**

```python
HuggingFaceEndpoint(repo_id="mistralai/Mistral-7B-Instruct-v0.3", ...)
```

---

### 🧬 `sentence-transformers/all-MiniLM-L6-v2`

> Fast & efficient transformer model for semantic embeddings.

* Converts text into high-dimensional vectors.
* Ideal for **document retrieval** and similarity scoring.

**Usage:**

```python
HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
```

---

## πŸš€ How to Run

### ▢️ Using Notebooks (Exploratory Mode)

```bash
jupyter notebook HackerBot.ipynb
```

or

```bash
jupyter notebook MedicBot.ipynb
```

### ▢️ Using Python Script (Production Mode)

```bash
python ChatBot.py
```

### βœ… Manual Environment Setup

```bash
python -m venv venv
.\venv\Scripts\activate # For Windows
pip install -r requirements.txt
```

---

## πŸͺ„ Setup Script

To simplify setup on Windows, run the included PowerShell script:

```powershell
.\Setup_env.ps1
```

This script will:

* Create virtual environment
* Activate it
* Install dependencies silently
* Display success banner βœ…

---

## πŸ“ Data & Vectorstore Info

**Note:** No copyrighted books or embeddings are provided.

Instead:

* `data/Instructions.md`: Add your own `.pdf` files here.
* `vectorstore/Instructions.md`: Explains how indexes will be **auto-created** when PDFs are processed.

Generated files:

* `index.faiss` β€” vector similarity data
* `index.pkl` β€” metadata (e.g., document sources)

---

Sure thing BubπŸ—ΏπŸ”₯ β€” here’s the updated `README.md` with the **Docker section** seamlessly added **after** the existing content, and without touching any of your original formatting or headings:

---

## πŸ‹ Docker Support

You can now run the Universal-AI-ChatBot inside a Docker container!

### πŸ›  Prerequisites

* Make sure Docker is installed and running.
* Verify with:

```bash
docker --version
```

### πŸš€ Build and Run

```bash
# Build the Docker image
docker build -t ai-chatbot .

# Run the Docker container with environment variables
docker run --env-file .env ai-chatbot
```

The `.env` file must contain your Hugging Face token as:

```env
HF_TOKEN=your-token-here
```

---

## πŸŽ“ Use Cases

* 🩺 Medical Bots (trained on medical PDFs)
* πŸ›‘οΈ Cybersecurity Advisors (for bug bounty, web security)
* 🧠 Legal or Finance Q\&A Assistants
* πŸ“„ Compliance Documentation Bots (ISO, SOC2, GDPR, etc.)
* πŸ“˜ Educational Assistants (coursebooks, research guides)

---

## πŸ” Visual Pipeline

```mermaid
graph TD
A[PDF Files in /data] --> B[Text Chunking]
B --> C[Embedding Chunks with MiniLM-L6-v2]
C --> D[Store Embeddings in FAISS Vector DB]

E[User Query] --> F[Embed Query with MiniLM-L6-v2]
F --> G[Semantic Search in FAISS]
D --> G

G --> H[Retrieve Top-k Relevant Chunks]

H --> I[Insert Context into Prompt Template]
I --> J[Mistral-7B-Instruct-v0.3]
J --> K[Answer Generated]
K --> L[Display Answer in Chat Loop]
```

---

## πŸ™Œ Credits

> Special Thanks & Shout-out to the community and devs whose work made this possible:

* πŸŽ₯ [AIwithHassan on YouTube](https://youtu.be/OP0FYjF-37c?si=HJOGBVR4Izgs_8RM)
* πŸ’» [GitHub - AIwithhassan/medical-chatbot](https://github.com/AIwithhassan/medical-chatbot)

---

## πŸ™‹ Contribution & Feedback

Feel free to fork, star 🌟, open issues, or contribute new bot variants!

---