https://github.com/nazdridoy/docuchat
DocuChat is a document chat application that allows you to have conversations with your documents, powered by a serverless vector database for scalable, efficient retrieval. Upload your files and ask questions in natural language to get answers based on their content.
https://github.com/nazdridoy/docuchat
ai docuchat document-chatbot libsql llm local-llm oll openai pdf vector-database
Last synced: 5 months ago
JSON representation
DocuChat is a document chat application that allows you to have conversations with your documents, powered by a serverless vector database for scalable, efficient retrieval. Upload your files and ask questions in natural language to get answers based on their content.
- Host: GitHub
- URL: https://github.com/nazdridoy/docuchat
- Owner: nazdridoy
- License: agpl-3.0
- Created: 2025-07-09T05:06:07.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-07-09T05:10:31.000Z (6 months ago)
- Last Synced: 2025-07-09T06:25:00.007Z (6 months ago)
- Topics: ai, docuchat, document-chatbot, libsql, llm, local-llm, oll, openai, pdf, vector-database
- Language: JavaScript
- Homepage:
- Size: 65.4 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DocuChat
DocuChat is a document chat application that allows you to have conversations with your documents, powered by a serverless vector database for scalable, efficient retrieval. Upload your files and ask questions in natural language to get answers based on their content.
## Features
- Upload and process text documents (.txt, .pdf, etc.)
- Prevent duplicate file uploads using file hash comparison (SHA-256)
- Client-side file hash calculation before upload for fast duplicate detection
- Generate embeddings for document chunks
- Flexible database options: Use a local SQLite database for quick development or a serverless Turso database for scalable, production-ready vector storage with `better-sqlite3` and `sqlite-vss`.
- Query documents using natural language
- Get contextual responses based on your documents' content with citations.
- Real-time chat responses with Server-Sent Events (SSE).
- Deep search with HyDE (Hypothetical Document Embeddings) for improved retrieval.
- Maximal Marginal Relevance (MMR) for diverse search results.
- Contextual chat history integration for better conversations.
- Markdown rendering for chat responses and source citations.
## Technologies
- Node.js
- Express
- `better-sqlite3` with `sqlite-vss` (for local and Turso-compatible serverless vector storage)
- OpenAI / Ollama compatible APIs (for chat and embeddings)
- `pdf-parse` for PDF document processing.
- Modular OpenAI client.
- Server-Sent Events (SSE).
## Prerequisites
- Node.js
- npm
- Git
- Optionally, [Ollama](https://ollama.ai/) for local embeddings.
## Installation
1. Clone this repository:
```bash
git clone https://github.com/your-username/docuchat.git
cd docuchat
```
2. Install dependencies:
```bash
npm install
```
3. Create a `.env` file in the root directory and populate it with the necessary environment variables as described in the **Configuration** section.
## Configuration
The application is configured through environment variables.
### Environment Variables
```
# Server Configuration
PORT=3000
# Database Configuration
# For local SQLite database:
DATABASE_URL="file:./data.db"
USE_LOCAL_DB="true"
# For remote Turso database (experimental with better-sqlite3):
# DATABASE_URL="libsql://your-database.turso.io"
# DATABASE_TOKEN="your-database-token"
# USE_LOCAL_DB="false"
# OpenAI API Configuration (for chat)
OPENAI_BASE_URL="https://api.openai.com/v1"
OPENAI_MODEL="gpt-3.5-turbo"
OPENAI_API_KEY="your-openai-api-key"
# RAG API Configuration (for embeddings)
# For OpenAI:
RAG_BASE_URL="https://api.openai.com/v1"
RAG_MODEL="text-embedding-3-small"
RAG_API_KEY="your-openai-api-key"
# For Ollama (recommended for local development):
# RAG_BASE_URL="http://127.0.0.1:11434/v1"
# RAG_MODEL="mxbai-embed-large:latest"
# RAG_API_KEY="ollama" # Any value works here
# EMBEDDING_DIMENSIONS=1024 # Required for local models to set correct chunking defaults and vector dimensions in DB
# Document Processing Configuration
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
# These values are dynamically determined based on the embedding model's dimensions
# if not explicitly set. Overriding them may be useful for fine-tuning performance.
# RAG Configuration
SIMILARITY_THRESHOLD=0.5 # Minimum score for a chunk to be included in context. (Default: 0.5)
DEEP_SEARCH_INITIAL_THRESHOLD=0.30
CONTEXT_MAX_LENGTH=4096 # Max characters for context sent to the LLM. (Default: 4096)
# Document Upload Configuration
MAX_FILE_SIZE=10485760 # 10MB
UPLOAD_DIRECTORY="./uploads"
```
## CLI Tool(debug): View Database Entries
A Python script `view_embedding.py` is provided to inspect entries in the SQLite database, including document chunks and their associated embeddings. This is particularly useful for debugging and understanding the stored data.
To use it:
1. Ensure you have Python and `numpy` installed (`pip install numpy`).
2. Run the script, specifying the database path and whether to view chunks or embeddings.
Example:
```bash
python view_embedding.py --database ./data.db --rowid 1
python view_embedding.py --database ./data.db --rowid 1 --embedding
```
## License
This project is licensed under the AGPL-3.0 License. See the `LICENSE` file for details.