https://github.com/hirannor/hexadocs
Event-driven document ingestion system built with Java and Spring Boot using Hexagonal Architecture. It processes documents into searchable knowledge bases via asynchronous pipelines (extract, chunk, embed, index) powered by domain events and pluggable vector storage with Spring AI.
https://github.com/hirannor/hexadocs
event-driven hexagonal java messaging rag spring springai springboot
Last synced: 1 day ago
JSON representation
Event-driven document ingestion system built with Java and Spring Boot using Hexagonal Architecture. It processes documents into searchable knowledge bases via asynchronous pipelines (extract, chunk, embed, index) powered by domain events and pluggable vector storage with Spring AI.
- Host: GitHub
- URL: https://github.com/hirannor/hexadocs
- Owner: hirannor
- Created: 2026-04-10T05:13:28.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-10T09:00:14.000Z (3 months ago)
- Last Synced: 2026-04-10T11:08:42.306Z (3 months ago)
- Topics: event-driven, hexagonal, java, messaging, rag, spring, springai, springboot
- Language: Java
- Homepage:
- Size: 25.4 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ HexaDocs
| CI Status | License |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [](https://github.com/hirannor/HexaDocs/actions/workflows/ci.yaml) | [](https://opensource.org/licenses/MIT) [](https://commonsclause.com/) |
---
A **domain-driven, event-driven document intelligence system** built with **Hexagonal Architecture**, designed to
support semantic document ingestion, chunking, embedding, and AI-powered retrieval (RAG).
---
## ๐ Overview
HexaDocs is a backend system that allows users to:
- Upload documents into a **Knowledge Base**
- Automatically process and split documents into semantic chunks
- Generate embeddings for semantic search
- Store vectors in a dedicated vector store
- Enable AI-powered question answering over documents
The system is designed around **clean architecture principles**, strong domain modeling, and asynchronous event-driven
workflows.

---
## ๐ง Core Idea
Instead of treating documents as static files, HexaDocs transforms them into:
> **Structured, searchable knowledge units powered by embeddings**
Each document becomes:
- A set of semantic chunks
- Indexed
---
## ๐งฉ Core Capabilities
### ๐ Document Ingestion
- Upload documents (PDF)
- Extract and normalize content
- Split into chunks
### ๐ง Embeddings
- Generate embeddings using:
- `nomic-embed-text` (Ollama)
- Store vectors in `pgvector`
### ๐ Semantic Search (RAG)
- Query embedding generation
- Vector similarity search
- Context retrieval for LLM
### ๐ฌ AI Chat
- Uses `mistral` via Ollama
- Context-aware answers (RAG-based)
### โก Event-Driven Pipeline
- RabbitMQ-based async processing
- Decoupled ingestion and embedding pipeline
---
# ๐ Overview Flow
The system works in 3 main steps:
1. Create a Knowledge Base
2. Upload Documents into the Knowledge Base
3. Ask questions via Chat (RAG)
---
# ๐ง 1. Create Knowledge Base
Create Knowledge base workspace
---
### โค Endpoint
POST /api/knowledge-bases
---
### โค Request Body
```json
{
"name": "My Knowledge Base"
}
```
# ๐ 2. Upload Document
Uploads a document into a specific Knowledge Base for processing (text extraction + embeddings).
---
## โค Endpoint
POST /api/documents/upload
---
## โค Content-Type
multipart/form-data
---
## โค Request Fields
| Field | Type | Required | Description |
|------|------|----------|-------------|
| file | binary (PDF) | Yes | The document file to upload |
| name | string | Yes | The document name |
| knowledgeBaseId | string (UUID) | Yes | ID of the target knowledge base |
| language | string | Yes | Language of the document |
---
# ๐งพ 3. Ask Question (Chat / RAG)
Ask questions about documents stored in a Knowledge Base using semantic search (vector DB) + LLM generation.
---
## โค Endpoint
POST /api/chat
---
### โค Request Body
```json
{
"knowledgeBaseId": "550e8400-e29b-41d4-a716-446655440000",
"question": "My question?"
}
```
---
## ๐งช OCR (Tesseract) Setup
HexaDocs uses **Tesseract OCR** as a fallback mechanism for PDF text extraction when embedded text layers are missing or
corrupted.
### ๐ฆ Required Installation
Tesseract must be installed on the host machine:
- Windows installer: https://github.com/UB-Mannheim/tesseract/wiki
- Linux: `sudo apt install tesseract-ocr`
- macOS: `brew install tesseract`
---
### ๐ Language Support
By default, only English is installed.
If you process non-English documents (e.g. Hungarian), you must install additional language packs.
Example for Hungarian:
- Download: https://github.com/tesseract-ocr/tessdata_best/blob/main/hun.traineddata
#### โ๏ธ Environment Variable Configuration
Tesseract requires the `TESSDATA_PREFIX` environment variable to be set correctly.
#### Windows example:
## ๐ Running the system
Start all services:
```bash
docker-compose up -d
```
# โ๏ธ Typical Workflow via UI
1. Open http://localhost:8080
2. Create a Knowledge Base
3. Upload a PDF document
4. Wait for ingestion pipeline to process the document
5. Start chatting with your documents instantly