{"id":51070628,"url":"https://github.com/hirannor/hexadocs","last_synced_at":"2026-06-23T10:01:36.983Z","repository":{"id":350427996,"uuid":"1206626100","full_name":"hirannor/HexaDocs","owner":"hirannor","description":"Event-driven document ingestion system built with Java and Spring Boot using Hexagonal Architecture. It processes documents into searchable knowledge bases via asynchronous pipelines (extract, chunk, embed, index) powered by domain events and pluggable vector storage with Spring AI.","archived":false,"fork":false,"pushed_at":"2026-04-10T09:00:14.000Z","size":26,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-10T11:08:42.306Z","etag":null,"topics":["event-driven","hexagonal","java","messaging","rag","spring","springai","springboot"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hirannor.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-10T05:13:28.000Z","updated_at":"2026-04-10T09:00:29.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hirannor/HexaDocs","commit_stats":null,"previous_names":["hirannor/hexadocs"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/hirannor/HexaDocs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hirannor%2FHexaDocs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hirannor%2FHexaDocs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hirannor%2FHexaDocs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hirannor%2FHexaDocs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hirannor","download_url":"https://codeload.github.com/hirannor/HexaDocs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hirannor%2FHexaDocs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34684686,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-23T02:00:07.161Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["event-driven","hexagonal","java","messaging","rag","spring","springai","springboot"],"created_at":"2026-06-23T10:01:33.947Z","updated_at":"2026-06-23T10:01:36.975Z","avatar_url":"https://github.com/hirannor.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 📚 HexaDocs\n\n| CI Status                                                                                                                                                     | License                                                                                                                                                                                                          |\n|---------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [![CI](https://github.com/hirannor/HexaDocs/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/hirannor/HexaDocs/actions/workflows/ci.yaml) | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Commons Clause](https://img.shields.io/badge/Commons-Clause-red.svg)](https://commonsclause.com/) |\n\n---\n\nA **domain-driven, event-driven document intelligence system** built with **Hexagonal Architecture**, designed to\nsupport semantic document ingestion, chunking, embedding, and AI-powered retrieval (RAG).\n\n---\n\n## 🚀 Overview\n\nHexaDocs is a backend system that allows users to:\n\n- Upload documents into a **Knowledge Base**\n- Automatically process and split documents into semantic chunks\n- Generate embeddings for semantic search\n- Store vectors in a dedicated vector store\n- Enable AI-powered question answering over documents\n\nThe system is designed around **clean architecture principles**, strong domain modeling, and asynchronous event-driven\nworkflows.\n\n![APP_1](images/app.png)\n\n---\n\n\n## 🧠 Core Idea\n\nInstead of treating documents as static files, HexaDocs transforms them into:\n\n\u003e **Structured, searchable knowledge units powered by embeddings**\n\nEach document becomes:\n\n- A set of semantic chunks\n- Indexed\n\n---\n\n## 🧩 Core Capabilities\n\n### 📄 Document Ingestion\n\n- Upload documents (PDF)\n- Extract and normalize content\n- Split into chunks\n\n### 🧠 Embeddings\n\n- Generate embeddings using:\n    - `nomic-embed-text` (Ollama)\n- Store vectors in `pgvector`\n\n### 🔎 Semantic Search (RAG)\n\n- Query embedding generation\n- Vector similarity search\n- Context retrieval for LLM\n\n### 💬 AI Chat\n\n- Uses `mistral` via Ollama\n- Context-aware answers (RAG-based)\n\n### ⚡ Event-Driven Pipeline\n\n- RabbitMQ-based async processing\n- Decoupled ingestion and embedding pipeline\n\n---\n\n# 📌 Overview Flow\n\nThe system works in 3 main steps:\n\n1. Create a Knowledge Base\n2. Upload Documents into the Knowledge Base\n3. Ask questions via Chat (RAG)\n\n---\n\n# 🧠 1. Create Knowledge Base\n\nCreate Knowledge base workspace\n\n---\n\n### ➤ Endpoint\nPOST /api/knowledge-bases\n\n---\n\n### ➤ Request Body\n```json\n{\n  \"name\": \"My Knowledge Base\"\n}\n```\n\n# 📄 2. Upload Document\n\nUploads a document into a specific Knowledge Base for processing (text extraction + embeddings).\n\n---\n\n## ➤ Endpoint\nPOST /api/documents/upload\n\n---\n\n## ➤ Content-Type\nmultipart/form-data\n\n---\n\n## ➤ Request Fields\n\n| Field | Type | Required | Description |\n|------|------|----------|-------------|\n| file | binary (PDF) | Yes      | The document file to upload |\n| name | string       | Yes      | The document name | \n| knowledgeBaseId | string (UUID) | Yes      | ID of the target knowledge base |\n | language  | string | Yes      | Language of the document | \n\n---\n\n# 🧾 3. Ask Question (Chat / RAG)\n\nAsk questions about documents stored in a Knowledge Base using semantic search (vector DB) + LLM generation.\n\n---\n\n## ➤ Endpoint\nPOST /api/chat\n\n---\n\n### ➤ Request Body\n```json\n{\n  \"knowledgeBaseId\": \"550e8400-e29b-41d4-a716-446655440000\",\n  \"question\": \"My question?\"\n}\n```\n\n---\n\n## 🧪 OCR (Tesseract) Setup\n\nHexaDocs uses **Tesseract OCR** as a fallback mechanism for PDF text extraction when embedded text layers are missing or\ncorrupted.\n\n### 📦 Required Installation\n\nTesseract must be installed on the host machine:\n\n- Windows installer: https://github.com/UB-Mannheim/tesseract/wiki\n- Linux: `sudo apt install tesseract-ocr`\n- macOS: `brew install tesseract`\n\n---\n\n### 🌍 Language Support\n\nBy default, only English is installed.  \nIf you process non-English documents (e.g. Hungarian), you must install additional language packs.\n\nExample for Hungarian:\n\n- Download: https://github.com/tesseract-ocr/tessdata_best/blob/main/hun.traineddata\n\n#### ⚙️ Environment Variable Configuration\n\nTesseract requires the `TESSDATA_PREFIX` environment variable to be set correctly.\n\n#### Windows example:\n\n## 🚀 Running the system\n\nStart all services:\n\n```bash\ndocker-compose up -d\n```\n\n# ⚙️ Typical Workflow via UI\n1. Open http://localhost:8080\n2. Create a Knowledge Base\n3. Upload a PDF document\n4. Wait for ingestion pipeline to process the document\n5. Start chatting with your documents instantly\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhirannor%2Fhexadocs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhirannor%2Fhexadocs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhirannor%2Fhexadocs/lists"}