An open API service indexing awesome lists of open source software.

https://github.com/hirannor/hexadocs

Event-driven document ingestion system built with Java and Spring Boot using Hexagonal Architecture. It processes documents into searchable knowledge bases via asynchronous pipelines (extract, chunk, embed, index) powered by domain events and pluggable vector storage with Spring AI.
https://github.com/hirannor/hexadocs

event-driven hexagonal java messaging rag spring springai springboot

Last synced: 1 day ago
JSON representation

Event-driven document ingestion system built with Java and Spring Boot using Hexagonal Architecture. It processes documents into searchable knowledge bases via asynchronous pipelines (extract, chunk, embed, index) powered by domain events and pluggable vector storage with Spring AI.

Awesome Lists containing this project

README

          

# ๐Ÿ“š HexaDocs

| CI Status | License |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [![CI](https://github.com/hirannor/HexaDocs/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/hirannor/HexaDocs/actions/workflows/ci.yaml) | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Commons Clause](https://img.shields.io/badge/Commons-Clause-red.svg)](https://commonsclause.com/) |

---

A **domain-driven, event-driven document intelligence system** built with **Hexagonal Architecture**, designed to
support semantic document ingestion, chunking, embedding, and AI-powered retrieval (RAG).

---

## ๐Ÿš€ Overview

HexaDocs is a backend system that allows users to:

- Upload documents into a **Knowledge Base**
- Automatically process and split documents into semantic chunks
- Generate embeddings for semantic search
- Store vectors in a dedicated vector store
- Enable AI-powered question answering over documents

The system is designed around **clean architecture principles**, strong domain modeling, and asynchronous event-driven
workflows.

![APP_1](images/app.png)

---

## ๐Ÿง  Core Idea

Instead of treating documents as static files, HexaDocs transforms them into:

> **Structured, searchable knowledge units powered by embeddings**

Each document becomes:

- A set of semantic chunks
- Indexed

---

## ๐Ÿงฉ Core Capabilities

### ๐Ÿ“„ Document Ingestion

- Upload documents (PDF)
- Extract and normalize content
- Split into chunks

### ๐Ÿง  Embeddings

- Generate embeddings using:
- `nomic-embed-text` (Ollama)
- Store vectors in `pgvector`

### ๐Ÿ”Ž Semantic Search (RAG)

- Query embedding generation
- Vector similarity search
- Context retrieval for LLM

### ๐Ÿ’ฌ AI Chat

- Uses `mistral` via Ollama
- Context-aware answers (RAG-based)

### โšก Event-Driven Pipeline

- RabbitMQ-based async processing
- Decoupled ingestion and embedding pipeline

---

# ๐Ÿ“Œ Overview Flow

The system works in 3 main steps:

1. Create a Knowledge Base
2. Upload Documents into the Knowledge Base
3. Ask questions via Chat (RAG)

---

# ๐Ÿง  1. Create Knowledge Base

Create Knowledge base workspace

---

### โžค Endpoint
POST /api/knowledge-bases

---

### โžค Request Body
```json
{
"name": "My Knowledge Base"
}
```

# ๐Ÿ“„ 2. Upload Document

Uploads a document into a specific Knowledge Base for processing (text extraction + embeddings).

---

## โžค Endpoint
POST /api/documents/upload

---

## โžค Content-Type
multipart/form-data

---

## โžค Request Fields

| Field | Type | Required | Description |
|------|------|----------|-------------|
| file | binary (PDF) | Yes | The document file to upload |
| name | string | Yes | The document name |
| knowledgeBaseId | string (UUID) | Yes | ID of the target knowledge base |
| language | string | Yes | Language of the document |

---

# ๐Ÿงพ 3. Ask Question (Chat / RAG)

Ask questions about documents stored in a Knowledge Base using semantic search (vector DB) + LLM generation.

---

## โžค Endpoint
POST /api/chat

---

### โžค Request Body
```json
{
"knowledgeBaseId": "550e8400-e29b-41d4-a716-446655440000",
"question": "My question?"
}
```

---

## ๐Ÿงช OCR (Tesseract) Setup

HexaDocs uses **Tesseract OCR** as a fallback mechanism for PDF text extraction when embedded text layers are missing or
corrupted.

### ๐Ÿ“ฆ Required Installation

Tesseract must be installed on the host machine:

- Windows installer: https://github.com/UB-Mannheim/tesseract/wiki
- Linux: `sudo apt install tesseract-ocr`
- macOS: `brew install tesseract`

---

### ๐ŸŒ Language Support

By default, only English is installed.
If you process non-English documents (e.g. Hungarian), you must install additional language packs.

Example for Hungarian:

- Download: https://github.com/tesseract-ocr/tessdata_best/blob/main/hun.traineddata

#### โš™๏ธ Environment Variable Configuration

Tesseract requires the `TESSDATA_PREFIX` environment variable to be set correctly.

#### Windows example:

## ๐Ÿš€ Running the system

Start all services:

```bash
docker-compose up -d
```

# โš™๏ธ Typical Workflow via UI
1. Open http://localhost:8080
2. Create a Knowledge Base
3. Upload a PDF document
4. Wait for ingestion pipeline to process the document
5. Start chatting with your documents instantly