https://github.com/mjennings061/vgs-chatbot
Search engine for 2FTS documents.
https://github.com/mjennings061/vgs-chatbot
aviation chatbot llm streamlit
Last synced: 4 months ago
JSON representation
Search engine for 2FTS documents.
- Host: GitHub
- URL: https://github.com/mjennings061/vgs-chatbot
- Owner: mjennings061
- Created: 2023-12-16T08:28:36.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-11-29T08:50:32.000Z (7 months ago)
- Last Synced: 2025-12-01T10:12:37.249Z (7 months ago)
- Topics: aviation, chatbot, llm, streamlit
- Language: Python
- Homepage: https://vgs-chatbot.streamlit.app/
- Size: 1.24 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# VGS Chatbot
A Streamlit application that lets RAF 2FTS instructors explore Viking training material through a knowledge-graph assisted RAG pipeline. Admins can upload documents, the chatbot cites its answers, and MongoDB Atlas stores the full corpus.
## Features
- Streamlit login and registration backed by `chatbot.users` with hashed passwords.
- Document ingestion for PDF and DOCX files with progress reporting; sources live in GridFS.
- Automatic section detection, ~900 character chunking, and FastEmbed embeddings.
- Lightweight knowledge graph that links keyphrases to chunk candidates.
- Retrieval that fuses Atlas Vector Search, Atlas Search (BM25), and graph priors before answering.
- Optional OpenAI `gpt-4o-mini` generation with an extractive fallback when the API key is absent.
## Prerequisites
- Python 3.13
- MongoDB Atlas cluster with Search and Vector Search enabled
## Setup
Create and activate a virtual environment, install dependencies, and prepare env vars.
macOS/Linux
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
```
Windows (PowerShell)
```powershell
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
Copy-Item .env.example .env
```
Edit `.streamlit/secrets.toml` (or `.env`) with:
- `MONGO_URI` – full Atlas connection string (for example `mongodb+srv://user:pass@cluster0.example.mongodb.net/?retryWrites=true&w=majority&appName=vgs-chatbot`)
- `OPENAI_API_KEY` – optional; enables generative answers instead of the extractive fallback
- `LOG_LEVEL` – optional; adjusts app logging (`INFO` by default)
## Prepare MongoDB Atlas
Create a database (for example `vgs`) and let the app create the collections on first use:
- `documents` – metadata for each uploaded file
- `doc_chunks` – embedded text chunks with section and page references
- `kg_nodes` / `kg_edges` – knowledge graph nodes and associations
- GridFS buckets `fs.files` / `fs.chunks` – original documents
Configure the indexes before querying:
Vector Search (`doc_chunks`, name `vgs_vector`)
```json
{
"name": "vgs_vector",
"type": "vectorSearch",
"definition": {
"fields": [
{ "type": "vector", "path": "embedding", "numDimensions": 384, "similarity": "cosine" },
{ "type": "filter", "path": "_id" },
{ "type": "filter", "path": "doc_id" },
{ "type": "filter", "path": "section_id" },
{ "type": "filter", "path": "page_start" }
]
}
}
```
Atlas Search (`doc_chunks`, name `vgs_text`)
```json
{
"name": "vgs_text",
"mappings": {
"dynamic": false,
"fields": {
"text": { "type": "string" },
"section_title": { "type": "string" },
"doc_title": { "type": "string" }
}
}
}
```
## Running the app
```bash
streamlit run streamlit_app.py
```
1. Open the displayed local URL, register with an eligible email, or sign in with an existing account.
2. Use the sidebar to open **Chat** or **Admin**.
### Admin workflow
- Upload a PDF or DOCX file and trigger **Ingest document** to push it to GridFS, create chunks, embed them, and update the knowledge graph.
- The **Library** section lists stored documents with chunk counts and lets you delete an item (removing GridFS blobs, metadata, and related chunks).
- Progress bars report ingestion stages, and errors surface friendly messages.
### Chat workflow
- Ask a question such as “What are the canopy checks before launch?”.
- Retrieval expands the query to graph-linked chunks, runs Vector Search and text search, fuses the scores, and shows cited answers (`Document · Section · Page`).
- When `OPENAI_API_KEY` is unset, responses fall back to the highest scoring chunk extract.
## Development notes
- Optional: use tools like Ruff, isort, and pre-commit locally if you wish.
### Enable OpenAI (optional)
By default, the app runs without an LLM and falls back to extractive answers. To enable OpenAI responses:
- Install the SDK: `pip install openai` (or add it to `requirements.txt`).
- Set `OPENAI_API_KEY` in `.env`.
## Security notes
- Demo credentials are intentionally low-privilege. Restrict them to the chatbot database and rotate them frequently.
- Do not reuse production secrets in `.env`. Use Atlas network rules to limit inbound connections.