{"id":47237123,"url":"https://github.com/aborroy/alfresco-content-lake","last_synced_at":"2026-03-13T23:17:29.848Z","repository":{"id":336910713,"uuid":"1151560798","full_name":"aborroy/alfresco-content-lake","owner":"aborroy","description":"Alfresco AI App for Hyland Content Lake","archived":false,"fork":false,"pushed_at":"2026-03-09T15:19:49.000Z","size":299,"stargazers_count":3,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-09T19:33:38.221Z","etag":null,"topics":["ai","alfresco","content-lake","docker","rag"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aborroy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-06T16:09:54.000Z","updated_at":"2026-03-09T15:19:54.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/aborroy/alfresco-content-lake","commit_stats":null,"previous_names":["aborroy/alfresco-content-lake"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aborroy/alfresco-content-lake","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falfresco-content-lake","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falfresco-content-lake/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falfresco-content-lake/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falfresco-content-lake/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aborroy","download_url":"https://codeload.github.com/aborroy/alfresco-content-lake/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falfresco-content-lake/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30479126,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-13T20:45:58.186Z","status":"ssl_error","status_checked_at":"2026-03-13T20:45:20.133Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","alfresco","content-lake","docker","rag"],"created_at":"2026-03-13T23:17:26.669Z","updated_at":"2026-03-13T23:17:29.842Z","avatar_url":"https://github.com/aborroy.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Alfresco Content Lake\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)\n[![Java](https://img.shields.io/badge/Java-21-orange.svg)](https://openjdk.org/projects/jdk/21/)\n[![Spring Boot](https://img.shields.io/badge/Spring%20Boot-3.4.3-brightgreen.svg)](https://spring.io/projects/spring-boot)\n[![Maven](https://img.shields.io/badge/Maven-3.9+-red.svg)](https://maven.apache.org/)\n[![Docker](https://img.shields.io/badge/Docker-Compose-blue.svg)](https://docs.docker.com/compose/)\n[![Status](https://img.shields.io/badge/Status-PoC-yellow.svg)]()\n\n**AI-powered semantic search and RAG for Alfresco using hxpr Content Lake**\n\n[Features](#features) • [Quick Start](#quick-start) • [Architecture](#architecture) • [Authentication](#authentication) • [API Usage](#api-usage) • [Configuration](#configuration)\n\n## Related Projects\n\n- [alfresco-content-lake-ui](https://github.com/aborroy/alfresco-content-lake-ui) - ACA-based frontend for semantic search and RAG over Content Lake.\n- [alfresco-content-lake-deploy](https://github.com/aborroy/alfresco-content-lake-deploy) - Docker Compose deployment for Alfresco, hxpr, Content Lake services, and the UI.\n\n## Overview\n\nProof of Concept for AI-powered semantic search and Retrieval-Augmented Generation (RAG) on Alfresco Content Services.\n\nLeverages **hxpr** as a Content Lake to enable high-quality AI search while:\n\n* Keeping Alfresco as the source of truth\n* Enforcing server-side permissions via ACLs\n* Supporting on-premises AI execution\n* Minimizing data duplication\n\n## Features\n\n- Two-Phase Sync Pipeline: Fast metadata ingestion + async content processing\n- Near Real-Time Sync: Alfresco Event2 listener over ActiveMQ using the Alfresco Java SDK\n- Semantic Search: Vector embeddings with permission-aware kNN search\n- RAG: LLM-powered question answering grounded in Alfresco document content\n- Permission-Aware: Server-side ACL enforcement via hxpr\n- Local AI: On-premises LLM and embedding models using Spring AI\n- Repository Scope Model: `cl:indexed` and `cl:excludeFromLake` for Alfresco-native scope control\n- REST API: Generic connector using Alfresco REST APIs\n- Secured Endpoints: Alfresco authentication (username/password or tickets)\n- Shared Ingestion Core: Common metadata, transform, chunking, embedding, ACL, and delete/update logic in `content-lake-common`\n- Idempotent Coexistence: `alfresco_modifiedAt` guard prevents stale batch/live writes from overwriting newer content\n\n## Architecture\n\n```text\n                 ┌──────────────────────────────────────┐\n                 │ Alfresco Repository + Event2         │\n                 │ REST API + ActiveMQ topic            │\n                 └──────────────────────────────────────┘\n                          │                     │\n                          │                     │\n                          ▼                     ▼\n┌──────────────────────────────────────┐   ┌──────────────────────────────────────┐\n│ batch-ingester                       │   │ live-ingester                        │\n│ Discovery → Metadata → Queue         │   │ SDK Handlers → Filter → Sync         │\n└──────────────────────────────────────┘   └──────────────────────────────────────┘\n                          │                     │\n                          └──────────┬──────────┘\n                                     ▼\n               ┌──────────────────────────────────────────┐\n               │ content-lake-common                      │\n               │ Node sync, Transform, Chunk, Embed, ACL  │\n               │ `alfresco_modifiedAt` idempotency guard  │\n               └──────────────────────────────────────────┘\n                                     ▼\n               ┌──────────────────────────────────────────┐\n               │ hxpr Content Lake                        │\n               └──────────────────────────────────────────┘\n                                     ▼\n               ┌──────────────────────────────────────────┐\n               │ rag-service                              │\n               │ Query → Embed → Search → Augment → LLM   │\n               └──────────────────────────────────────────┘\n```\n\n### Modules\n\n| Module | Port | Description |\n|--------|------|-------------|\n| `content-lake-repo-model` | — | Alfresco repository JAR that bootstraps the `cl:indexed` content model for scope control |\n| `content-lake-common` | — | Shared clients and ingestion pipeline: metadata sync, transform, chunking, embedding, ACL updates, idempotency |\n| `batch-ingester` | 9090 | Folder discovery, batch scheduling, metadata enqueueing, and `/api/sync/*` controllers |\n| `live-ingester` | 9092 | Alfresco Event2 listener over ActiveMQ using Alfresco Java SDK handlers and filters |\n| `rag-service` | 9091 | Semantic search and RAG question answering |\n\n## Quick Start\n\n### Prerequisites\n\n- Java 21+ and Maven 3.9+\n- Docker and Docker Compose\n- Alfresco Content Services 25.x+\n  - Alfresco Transform Service (for text extraction)\n- hxpr Content Lake (with OAuth2 IDP)\n- Docker Model Runner (for embeddings and LLM)\n\n### Installation\n\n```bash\n# Clone repository\ngit clone https://github.com/aborroy/alfresco-content-lake.git\ncd alfresco-content-lake\n\n# Build all modules\nmvn clean package\n\n# Deploy the repository content model to ACS before starting the ingesters\n# Artifact:\n#   content-lake-repo-model/target/content-lake-repo-model-1.0.0-SNAPSHOT.jar\n# Deploy it to the Alfresco Repository classpath.\n\n# Configure (see Environment Variables below)\nexport ALFRESCO_URL=http://localhost:8080\nexport ALFRESCO_INTERNAL_USERNAME=admin\nexport ALFRESCO_INTERNAL_PASSWORD=admin\n# ... (see full configuration below)\n\n# Run batch ingestion\njava -jar batch-ingester/target/batch-ingester-1.0.0-SNAPSHOT.jar\n\n# Run live ingestion\njava -jar live-ingester/target/live-ingester-1.0.0-SNAPSHOT.jar\n\n# Run RAG service\njava -jar rag-service/target/rag-service-1.0.0-SNAPSHOT.jar\n\n# Or with Docker Compose (both services)\ndocker-compose up\n```\n\n### Alfresco Repo Model\n\nThe batch and live ingesters now rely on an Alfresco content model for scope control:\n\n- `cl:indexed` marks a folder subtree as in scope for Content Lake ingestion\n- `cl:excludeFromLake` lets a file opt out, or a folder subtree opt out, even when an ancestor folder is indexed\n\nBuild artifact:\n\n```bash\ncontent-lake-repo-model/target/content-lake-repo-model-1.0.0-SNAPSHOT.jar\n```\n\nDeploy that JAR to the Alfresco Repository classpath before enabling ingestion. Typical options are:\n\n- include it in an ACS SDK `modules/platform` build\n- copy or mount it into an Alfresco Repository image under `webapps/alfresco/WEB-INF/lib`\n\n### Starting From A Non-Indexed Repository\n\nIf your Alfresco Repository does not yet use `cl:indexed`, the recommended startup sequence is:\n\n1. Build the project and deploy the repository model JAR to Alfresco Repository.\n   After deployment, restart the repository so `cl:indexed` and `cl:excludeFromLake` are available.\n2. Start `batch-ingester`.\n3. Run a batch synchronization against the folder you want to onboard.\n   The ingester automatically adds `cl:indexed` to each root folder if it is not already present, then performs the initial backfill into Content Lake.\n4. Start `live-ingester`.\n   Live ingestion then keeps that indexed subtree up to date.\n\nExample for indexing all sites under `Company Home/Sites`:\n\n1. Resolve the Alfresco node id for `Company Home/Sites`.\n   You can obtain it from Alfresco UI tools or the Alfresco REST API.\n2. Run the batch sync against that folder:\n\n```bash\ncurl -X POST http://localhost:9090/api/sync/batch \\\n  -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"folders\":[\"SITES_FOLDER_NODE_ID\"],\"recursive\":true,\"types\":[\"cm:content\"]}'\n```\n\nThis single call marks `SITES_FOLDER_NODE_ID` with `cl:indexed` (if needed) and ingests all existing content beneath it.\n\n3. After the batch completes, start `live-ingester` so new or changed content under `Company Home/Sites` continues to sync automatically.\n\nImportant:\n\n- `cl:indexed` can also be set directly via the Alfresco Repository nodes API or the Content Lake UI extension; the batch ingester sets it automatically only for root folders passed in the request\n- `cl:excludeFromLake` on a folder removes that folder's full subtree from Content Lake scope; batch discovery skips it and live reconciliation deletes previously ingested descendants\n- if you later want to index only one site, pass that site folder to `/api/sync/batch` instead of `Company Home/Sites`\n\n### Environment Variables\n\n```bash\n# Alfresco (Internal Service Account)\nexport ALFRESCO_URL=http://localhost:8080\nexport ALFRESCO_INTERNAL_USERNAME=admin\nexport ALFRESCO_INTERNAL_PASSWORD=admin\n\n# hxpr Content Lake\nexport HXPR_URL=http://localhost:8080\nexport HXPR_REPOSITORY_ID=default\nexport HXPR_IDP_TOKEN_URL=http://localhost:5002/idp/connect/token\nexport HXPR_IDP_CLIENT_ID=nuxeo-client\nexport HXPR_IDP_CLIENT_SECRET=secret\nexport HXPR_IDP_USERNAME=testuser\nexport HXPR_IDP_PASSWORD=password\n\n# Transform Service (batch-ingester only)\nexport TRANSFORM_URL=http://localhost:10090\nexport TRANSFORM_ENABLED=true\n\n# ActiveMQ / Event2 (live-ingester only)\nexport ACTIVEMQ_URL=tcp://localhost:61616\nexport ACTIVEMQ_USER=admin\nexport ACTIVEMQ_PASSWORD=admin\nexport ALFRESCO_EVENT_TOPIC=alfresco.repo.event2\n\n# AI/Embeddings (both services)\n# Spring AI appends /v1 itself; use the Docker Model Runner root URL.\nexport MODEL_RUNNER_URL=http://localhost:12434\nexport EMBEDDING_MODEL=ai/mxbai-embed-large\n\n# LLM (rag-service only)\nexport LLM_MODEL=ai/gpt-oss\nexport LLM_TEMPERATURE=0.3\nexport LLM_MAX_TOKENS=1024\n\n# RAG defaults (rag-service only)\nexport RAG_DEFAULT_TOP_K=5\nexport RAG_DEFAULT_MIN_SCORE=0.5\nexport RAG_MAX_CONTEXT_LENGTH=12000\n\n# Performance (batch-ingester only)\nexport TRANSFORM_WORKERS=4\nexport EMBEDDING_CHUNK_SIZE=900\nexport EMBEDDING_CHUNK_OVERLAP=120\n```\n\n## Authentication\n\nAll REST API endpoints (`/api/**`) on both services require authentication validated against Alfresco.\n\n### Supported Methods\n\n| Method | Example |\n|--------|---------|\n| **Basic Auth** | `curl -u admin:password http://localhost:9090/api/sync/status` |\n| **Ticket (query)** | `curl \"http://localhost:9090/api/sync/status?alf_ticket=TICKET_xxx\"` |\n| **Ticket (header)** | `curl -H \"Authorization: Basic BASE64(TICKET_xxx)\" ...` |\n\n**Note:** Bearer token authentication (OAuth2/OIDC with Keycloak) is not yet supported.\n\n### Quick Example\n\n```bash\n# Authenticate and start sync\ncurl -X POST http://localhost:9090/api/sync/configured \\\n  -u admin:admin\n\n# Or use Alfresco ticket\nTICKET=$(curl -X POST http://localhost:8080/alfresco/api/-default-/public/authentication/versions/1/tickets \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"userId\":\"admin\",\"password\":\"admin\"}' | jq -r '.entry.id')\n\ncurl -X POST \"http://localhost:9090/api/sync/configured?alf_ticket=$TICKET\"\n```\n\n## API Usage\n\n### Batch Ingester (port 9090)\n\n#### Start Synchronization\n\n```bash\n# Sync configured folders\ncurl -X POST http://localhost:9090/api/sync/configured -u admin:admin\n\n# Sync specific folder\ncurl -X POST http://localhost:9090/api/sync/batch \\\n  -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"folders\": [\"node-id\"], \"recursive\": true, \"types\": [\"cm:content\"]}'\n```\n\n#### Monitor Progress\n\n```bash\n# Overall status\ncurl http://localhost:9090/api/sync/status -u admin:admin\n\n# Job-specific status\ncurl http://localhost:9090/api/sync/status/{jobId} -u admin:admin\n```\n\n#### Query Node Status\n\n```bash\n# Single node\ncurl http://localhost:9090/api/content-lake/nodes/{nodeId}/status -u admin:admin\n\n# Bulk node list\ncurl -X POST http://localhost:9090/api/content-lake/nodes/status \\\n  -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"nodeIds\":[\"node-id-1\",\"node-id-2\"]}'\n\n# Optional: include aggregated subtree status for folders\ncurl -X POST http://localhost:9090/api/content-lake/nodes/status \\\n  -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"nodeIds\":[\"folder-id\"],\"includeFolderAggregate\":true}'\n\n# Optional: same aggregation for single-folder lookup\ncurl \"http://localhost:9090/api/content-lake/nodes/{folderId}/status?includeFolderAggregate=true\" \\\n  -u admin:admin\n```\n\n### RAG Service (port 9091)\n\n#### RAG Prompt\n\nAsk a question and get an LLM-generated answer grounded in your Alfresco documents:\n\n```bash\ncurl -X POST http://localhost:9091/api/rag/prompt -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{ \"question\": \"What are the key findings in the Q4 report?\" }'\n```\n\nWith options:\n\n```bash\ncurl -X POST http://localhost:9091/api/rag/prompt -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"question\": \"Summarize the budget proposal\",\n    \"topK\": 10,\n    \"minScore\": 0.6,\n    \"includeContext\": true\n  }'\n```\n\nMulti-turn conversation (same `sessionId`):\n\n```bash\n# Turn 1\ncurl -X POST http://localhost:9091/api/rag/prompt -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"sessionId\": \"demo-session-1\",\n    \"question\": \"Summarize the Q4 report highlights\"\n  }'\n\n# Turn 2 (follow-up resolved with history)\ncurl -X POST http://localhost:9091/api/rag/prompt -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"sessionId\": \"demo-session-1\",\n    \"question\": \"Can you expand on the second point?\"\n  }'\n```\n\nResponse:\n\n```json\n{\n  \"answer\": \"The Q4 report highlights a 12% revenue increase...\",\n  \"question\": \"What are the key findings in the Q4 report?\",\n  \"sessionId\": \"demo-session-1\",\n  \"retrievalQuery\": \"what are the key findings in the q4 report\",\n  \"historyTurnsUsed\": 2,\n  \"model\": \"ai/gpt-oss\",\n  \"tokenCount\": 672,\n  \"searchTimeMs\": 245,\n  \"generationTimeMs\": 1830,\n  \"totalTimeMs\": 2075,\n  \"sourcesUsed\": 3,\n  \"sources\": [\n    {\n      \"documentId\": \"abc-123\",\n      \"nodeId\": \"e4f5a6b7-...\",\n      \"name\": \"Q4-Financial-Report.pdf\",\n      \"path\": \"/Company Home/Reports/Q4-Financial-Report.pdf\",\n      \"chunkText\": \"Revenue for Q4 increased by 12%...\",\n      \"score\": 0.87\n    }\n  ]\n}\n```\n\n| Field | Type | Default | Description |\n|-------|------|---------|-------------|\n| `question` | String | *required* | Natural-language question |\n| `sessionId` | String | user-scoped default | Conversation session id for multi-turn context |\n| `resetSession` | boolean | false | Clear conversation history for the target session before this prompt |\n| `topK` | int | 5 | Number of chunks to retrieve for context |\n| `minScore` | double | 0.5 | Minimum similarity threshold |\n| `filter` | String | — | Additional HXQL filter |\n| `systemPrompt` | String | — | Override the default LLM system prompt |\n| `includeContext` | boolean | false | Include retrieved chunks in response |\n\n| Response Field | Type | Description |\n|---------------|------|-------------|\n| `sessionId` | String | Effective session id used by server |\n| `retrievalQuery` | String | Query actually sent to retrieval (may be reformulated) |\n| `historyTurnsUsed` | Integer | Number of prior turns included in this generation |\n| `tokenCount` | Integer | Total token usage (prompt + completion) when provider reports it |\n\n#### Chat Stream (SSE)\n\nStreaming responses are available with Server-Sent Events (SSE).\n\n- Canonical endpoint: `GET /api/rag/chat/stream`\n- Backward-compatible endpoint: `POST /api/rag/chat/stream` (same JSON body as `/api/rag/prompt`)\n- Content type: `text/event-stream`\n- Authentication: same as other `/api/rag/**` endpoints (Basic Auth or Alfresco ticket)\n\n`GET` example:\n\n```bash\ncurl -N -G http://localhost:9091/api/rag/chat/stream -u admin:admin \\\n  --data-urlencode \"question=What changed in Q4?\" \\\n  --data-urlencode \"sessionId=demo-session-1\" \\\n  --data-urlencode \"resetSession=false\" \\\n  --data-urlencode \"topK=5\" \\\n  --data-urlencode \"minScore=0.5\"\n```\n\nCompatibility `POST` example:\n\n```bash\ncurl -N -X POST http://localhost:9091/api/rag/chat/stream -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"question\": \"What changed in Q4?\",\n    \"sessionId\": \"demo-session-1\",\n    \"topK\": 5,\n    \"minScore\": 0.5\n  }'\n```\n\nQuery params for `GET`:\n\n| Field | Type | Default | Description |\n|-------|------|---------|-------------|\n| `question` | String | *required* | Natural-language question |\n| `sessionId` | String | user-scoped default | Conversation session id for multi-turn context |\n| `resetSession` | boolean | false | Clear conversation history before this prompt |\n| `topK` | int | 5 | Number of chunks to retrieve for context |\n| `minScore` | double | 0.5 | Minimum similarity threshold |\n| `filter` | String | — | Additional HXQL filter |\n| `embeddingType` | String | model default | Embedding type to match |\n| `systemPrompt` | String | — | Override the default LLM system prompt |\n| `includeContext` | boolean | false | Include retrieved chunks in final metadata |\n\nSSE events:\n\n- `event: token` incremental token payload (`{\"token\":\"...\"}`)\n- `event: metadata` final payload with `RagPromptResponse` fields including `sources`, timing fields, `model`, and `tokenCount`\n- `event: done` terminal success event\n- `event: error` terminal failure event with error message\n\nExample stream:\n\n```text\nevent: token\ndata: {\"token\":\"Revenue \"}\n\nevent: token\ndata: {\"token\":\"grew 12% in Q4.\"}\n\nevent: metadata\ndata: {\"answer\":\"Revenue grew 12% in Q4.\",\"question\":\"What changed in Q4?\",\"model\":\"ai/gpt-oss\",\"tokenCount\":672,\"searchTimeMs\":245,\"generationTimeMs\":1830,\"totalTimeMs\":2075,\"sourcesUsed\":3,\"sources\":[{\"documentId\":\"abc-123\",\"nodeId\":\"e4f5a6b7-...\",\"name\":\"Q4-Financial-Report.pdf\",\"path\":\"/Company Home/Reports/Q4-Financial-Report.pdf\",\"chunkText\":\"Revenue for Q4 increased by 12%...\",\"score\":0.87}]}\n\nevent: done\ndata: {\"status\":\"ok\"}\n```\n\nError stream example:\n\n```text\nevent: error\ndata: {\"message\":\"Failed to prepare RAG stream: ...\"}\n```\n\n#### Semantic Search\n\nSearch directly against the embedded chunks without LLM generation:\n\n```bash\ncurl -X POST http://localhost:9091/api/rag/search/semantic -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{ \"query\": \"a girl falls in a crater\", \"topK\": 5, \"minScore\": 0.6 }'\n```\n\nSemantic search applies a minimum similarity score to suppress low-quality vector matches when no strong semantic relation exists.\n\n* Default value: `0.5`\n* Applied server-side after vector retrieval\n* Can be overridden per request\n\n#### Hybrid Search\n\nRun vector + keyword retrieval and fuse results with `rrf` (default) or `weighted` scoring:\n\n```bash\ncurl -X POST http://localhost:9091/api/rag/search/hybrid -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"query\": \"budget approval process\",\n    \"strategy\": \"rrf\",\n    \"candidateCount\": 20,\n    \"maxResults\": 5,\n    \"metadata\": {\n      \"mimeType\": \"application/pdf\",\n      \"pathPrefix\": \"/Company Home/Sites/finance/documentLibrary\",\n      \"modifiedAfter\": \"2026-01-01T00:00:00Z\",\n      \"modifiedBefore\": \"2026-12-31T23:59:59Z\",\n      \"properties\": {\n        \"cm:title\": \"Budget\"\n      }\n    }\n  }'\n```\n\nStructured metadata filters are optional. You can still pass a raw HXQL `filter` for advanced cases.\n\nResponse example:\n\n```json\n{\n  \"query\": \"budget approval process\",\n  \"strategy\": \"weighted\",\n  \"normalization\": \"max\",\n  \"model\": \"ai/mxbai-embed-large\",\n  \"resultCount\": 2,\n  \"vectorCandidates\": 20,\n  \"keywordCandidates\": 18,\n  \"searchTimeMs\": 143,\n  \"results\": [\n    {\n      \"rank\": 1,\n      \"score\": 0.0325,\n      \"chunkText\": \"The budget approval workflow starts with...\",\n      \"vectorScore\": 0.87,\n      \"keywordScore\": 1.0,\n      \"vectorRank\": 2,\n      \"keywordRank\": 1\n    }\n  ]\n}\n```\n\n| Field | Type | Default | Description |\n|-------|------|---------|-------------|\n| `query` | String | *required* | Query for both vector and keyword legs |\n| `strategy` | String | `rrf` | Fusion strategy: `rrf` or `weighted` |\n| `normalization` | String | `max` | Weighted score normalization: `max` or `minmax` |\n| `candidateCount` | int | `20` | Candidates retrieved from each leg before fusion |\n| `maxResults` | int | `5` | Final fused result limit |\n| `vectorWeight` | double | `0.7` | Weight when `strategy=weighted` |\n| `textWeight` | double | `0.3` | Weight when `strategy=weighted` |\n| `filter` | String | — | Additional raw HXQL filter |\n| `metadata.mimeType` | String | — | MIME type filter (for example `application/pdf`) |\n| `metadata.pathPrefix` | String | — | Path prefix filter (starts-with match) |\n| `metadata.modifiedAfter` | String | — | Inclusive lower bound for `alfresco_modifiedAt` |\n| `metadata.modifiedBefore` | String | — | Inclusive upper bound for `alfresco_modifiedAt` |\n| `metadata.properties` | Map\u003cString,String\u003e | — | Exact-match filters on `cin_ingestProperties.\u003ckey\u003e` |\n\n| Response Field | Type | Description |\n|---------------|------|-------------|\n| `query` | String | Original query |\n| `strategy` | String | Effective fusion strategy used |\n| `normalization` | String | Normalization mode used when `strategy=weighted` |\n| `model` | String | Embedding model used for vector search |\n| `resultCount` | int | Number of fused results returned |\n| `vectorCandidates` | int | Number of vector candidates retrieved |\n| `keywordCandidates` | int | Number of keyword candidates retrieved |\n| `searchTimeMs` | long | Total hybrid search execution time |\n| `results[].score` | double | Fused score (RRF or weighted) |\n| `results[].vectorScore` | Double | Raw vector score, if available |\n| `results[].keywordScore` | Double | Raw keyword score, if available |\n| `results[].sourceDocument` | object | Source document metadata |\n| `results[].chunkMetadata` | object | Chunk position/type metadata |\n\n##### Integration Smoke Test (local hxpr)\n\nUse this checklist to validate issue #14 end-to-end:\n\n1. Ensure at least one folder is ingested into hxpr via batch/live ingesters.\n2. Call hybrid search without metadata constraints and verify `resultCount \u003e 0`.\n3. Call hybrid search with a restrictive metadata filter (for example `mimeType: application/pdf`) and confirm results narrow.\n4. Switch strategy to `weighted` and confirm response field `strategy` is `weighted`.\n\nExample smoke-test requests:\n\n```bash\n# Baseline\ncurl -X POST http://localhost:9091/api/rag/search/hybrid -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"query\":\"budget approval process\",\"strategy\":\"rrf\",\"candidateCount\":20,\"maxResults\":5}'\n\n# Restrictive metadata\ncurl -X POST http://localhost:9091/api/rag/search/hybrid -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"query\":\"budget approval process\",\"strategy\":\"rrf\",\"metadata\":{\"mimeType\":\"application/pdf\"}}'\n\n# Weighted strategy\ncurl -X POST http://localhost:9091/api/rag/search/hybrid -u admin:admin \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"query\":\"budget approval process\",\"strategy\":\"weighted\",\"normalization\":\"minmax\",\"vectorWeight\":0.7,\"textWeight\":0.3}'\n```\n\n### Health Checks\n\n```bash\n# Batch ingester (no auth required)\ncurl http://localhost:9090/actuator/health\n\n# Live ingester (no auth required)\ncurl http://localhost:9092/actuator/health\n\n# RAG service (no auth required)\ncurl http://localhost:9091/actuator/health\n\n# RAG service detailed health (auth required)\ncurl http://localhost:9091/api/rag/health -u admin:admin\n```\n\n### Live Ingester (port 9092)\n\nThe live ingester consumes Alfresco Event2 messages from ActiveMQ using Alfresco Java SDK handler interfaces such as `OnNodeUpdatedEventHandler` and `OnPermissionUpdatedEventHandler`.\n\nIt reuses the same shared ingestion pipeline as the batch ingester:\n\n- Fetch the current node snapshot from Alfresco REST API\n- Apply scope and exclusion rules\n- Sync metadata to hxpr\n- Extract text with Transform Service\n- Chunk and embed with Spring AI\n- Update permissions or delete when nodes move out of scope\n\nThe live path is guarded by the same `alfresco_modifiedAt` staleness check used by batch ingestion, so batch and live runs can coexist safely.\n\nStatus endpoint:\n\n```bash\ncurl http://localhost:9092/api/live/status\n```\n\n## Configuration\n\n### Ingestion\n\nEdit `batch-ingester/src/main/resources/application.yml`:\n\n```yaml\ningestion:\n  sources:\n    - folder: your-folder-node-id\n      recursive: true\n      types: [cm:content]\n  exclude:\n    paths: [\"*/surf-config/*\", \"*/thumbnails/*\"]\n    aspects: [cm:workingcopy]\n```\n\n### Live Ingestion\n\nEdit `live-ingester/src/main/resources/application.yml`:\n\n```yaml\nspring:\n  activemq:\n    broker-url: ${ACTIVEMQ_URL:tcp://localhost:61616}\n    user: ${ACTIVEMQ_USER:admin}\n    password: ${ACTIVEMQ_PASSWORD:admin}\n  jms:\n    cache:\n      enabled: false\n\nalfresco:\n  events:\n    topic-name: ${ALFRESCO_EVENT_TOPIC:alfresco.repo.event2}\n    enable-handlers: true\n    enable-spring-integration: false\n\nlive-ingester:\n  filter:\n    exclude-paths: [\"*/surf-config/*\", \"*/thumbnails/*\"]\n    exclude-aspects: [cm:workingcopy]\n  scope:\n    include-paths: []\n    required-aspects: []\n  dedup:\n    window: ${LIVE_INGESTER_DEDUP_WINDOW:PT2M}\n    max-entries: ${LIVE_INGESTER_DEDUP_MAX_ENTRIES:10000}\n```\n\nNotes:\n\n- `spring.jms.cache.enabled=false` is required so the Alfresco Java SDK can use the native ActiveMQ connection factory.\n- By default, the live ingester behaves as an exclude-only listener. Set `include-paths` or `required-aspects` to narrow the scope.\n- Transform Service receives the original Alfresco filename when available, improving binary format detection during text extraction.\n\n### RAG\n\nEdit `rag-service/src/main/resources/application.yml`:\n\n```yaml\nspring:\n  ai:\n    openai:\n      chat:\n        options:\n          model: ${LLM_MODEL:ai/gpt-oss}\n          temperature: ${LLM_TEMPERATURE:0.3}\n          maxTokens: ${LLM_MAX_TOKENS:1024}\n\nrag:\n  default-top-k: 5\n  default-min-score: 0.5\n  max-context-length: 12000\n  default-system-prompt: \u003e\n    You are a document assistant that answers questions based strictly on\n    the provided context.\n\n    RULES:\n    1. Use ONLY information from the DOCUMENT CONTEXT below. Do not use prior knowledge.\n    2. When referencing information, cite the source using its label (e.g. \"According to Source 1...\").\n    3. If multiple sources contain relevant information, synthesize them and cite each.\n    4. If the context does not contain enough information to fully answer the question,\n    clearly state what you can answer and what is missing.\n    5. Be concise and direct. Do not repeat the question or add unnecessary preamble.\n  conversation:\n    enabled: true\n    max-history-turns: 10\n    session-ttl-minutes: 30\n    query-reformulation: true\n\nsemantic-search:\n  default-min-score: 0.5\n\nsearch:\n  hybrid:\n    enabled: true\n    strategy: rrf       # or weighted\n    normalization: max  # max or minmax (weighted strategy)\n    vector-weight: 0.7\n    text-weight: 0.3\n    initial-candidates: 20\n    final-results: 5\n    rrf-k: 60\n    default-min-score: 0.0\n```\n\nConversation memory storage:\n\n- Default implementation is in-memory.\n- To use Redis or a database, provide a custom Spring bean implementing `ConversationMemoryStore`; the default in-memory store is only created when no other `ConversationMemoryStore` bean exists.\n\n## Roadmap\n\n### Next (Q2 2026 - Open Source Release)\n\n- [ ] Harden live-ingester with end-to-end Event2 coverage and operational guidance\n- [ ] OAuth2/Keycloak integration\n- [ ] Comprehensive testing suite\n- [ ] Production deployment guide\n\n### Future\n\n- [ ] Streaming responses (SSE) for progressive answer generation\n- [ ] Conversation history / multi-turn chat sessions\n- [ ] Re-ranking with cross-encoder models\n- [ ] Multiple embedding models per document\n- [ ] Document versioning support\n- [ ] DocFilters integration (better text extraction)\n- [ ] Multilingual embeddings\n- [ ] Performance optimizations for 10K+ documents\n\n## Development\n\n### Build\n\n```bash\nmvn clean package\n```\n\n### Run Tests\n\n```bash\nmvn test\n```\n\n### Run Locally\n\n```bash\n# Batch Ingester\nmvn spring-boot:run -pl batch-ingester\n# or\njava -jar batch-ingester/target/batch-ingester-1.0.0-SNAPSHOT.jar\n\n# Live Ingester\nmvn spring-boot:run -pl live-ingester\n# or\njava -jar live-ingester/target/live-ingester-1.0.0-SNAPSHOT.jar\n\n# RAG Service\nmvn spring-boot:run -pl rag-service\n# or\njava -jar rag-service/target/rag-service-1.0.0-SNAPSHOT.jar\n```\n\n## Contributing\n\nContributions welcome! Please:\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit changes (`git commit -m 'feat: add amazing feature'`)\n4. Push to branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## Acknowledgments\n\n- Built with [Spring AI](https://spring.io/projects/spring-ai)\n- Uses [Alfresco Java SDK](https://github.com/Alfresco/alfresco-java-sdk)\n- Powered by [hxpr Content Lake](https://www.hyland.com/)\n- Created for the Alfresco/Hyland community\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faborroy%2Falfresco-content-lake","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faborroy%2Falfresco-content-lake","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faborroy%2Falfresco-content-lake/lists"}