{"id":51144249,"url":"https://github.com/openpj/spring-manifold-next-gen","last_synced_at":"2026-06-28T03:00:39.435Z","repository":{"id":366550715,"uuid":"1276647632","full_name":"OpenPj/spring-manifold-next-gen","owner":"OpenPj","description":"Spring-Manifold Next-Gen is an enterprise data integration and ingestion platform modeled after Apache ManifoldCF. It leverages modern Java 25 features (such as Structured Concurrency and Virtual Threads), Spring Boot, Kafka and vector search infrastructure to orchestrate data flows from various repository connectors to vector search outputs.","archived":false,"fork":false,"pushed_at":"2026-06-24T14:20:38.000Z","size":640,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-27T02:11:01.424Z","etag":null,"topics":["apache-kafka","etl","java","kafka","manifoldcf","ollama","pgvector","rag","react","spring-boot","structured-concurrency","vector-database","virtual-threads","vite"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenPj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-22T07:02:13.000Z","updated_at":"2026-06-24T14:24:37.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/OpenPj/spring-manifold-next-gen","commit_stats":null,"previous_names":["openpj/spring-manifold-next-gen"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/OpenPj/spring-manifold-next-gen","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenPj%2Fspring-manifold-next-gen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenPj%2Fspring-manifold-next-gen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenPj%2Fspring-manifold-next-gen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenPj%2Fspring-manifold-next-gen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenPj","download_url":"https://codeload.github.com/OpenPj/spring-manifold-next-gen/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenPj%2Fspring-manifold-next-gen/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34875360,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-28T02:00:05.809Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-kafka","etl","java","kafka","manifoldcf","ollama","pgvector","rag","react","spring-boot","structured-concurrency","vector-database","virtual-threads","vite"],"created_at":"2026-06-26T01:30:48.452Z","updated_at":"2026-06-28T03:00:39.392Z","avatar_url":"https://github.com/OpenPj.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spring-Manifold Next-Gen\n\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)\n[![Java Version](https://img.shields.io/badge/Java-25-orange.svg)](https://jdk.java.net/25/)\n[![Spring Boot](https://img.shields.io/badge/Spring_Boot-3.4+-green.svg)](https://spring.io/projects/spring-boot)\n[![Docker](https://img.shields.io/badge/Docker-Supported-blue.svg)](https://www.docker.com/)\n\n**Spring-Manifold Next-Gen** is an enterprise data integration and ingestion platform modeled after Apache ManifoldCF. It leverages modern Java 25 features (such as Structured Concurrency and Virtual Threads), Spring Boot, and vector search infrastructure to orchestrate data flows from various repository connectors to vector search outputs.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/logo.png\" alt=\"Spring-Manifold Next-Gen Logo\" width=\"200\" /\u003e\n\u003c/p\u003e\n\n---\n\n## Architecture Diagram\n\nThe diagram below shows the high-level architecture of Spring-Manifold Next-Gen:\n\n```mermaid\ngraph TD\n    subgraph UI\n        UI_App[Admin React UI - sm-admin-ui]\n    end\n\n    subgraph Platform Runtime [Spring-Manifold JVM Runtime]\n        Core[Core Ingestion Engine - sm-core]\n        Runtime[Bootstrap - sm-runtime]\n        FS_Conn[Filesystem Repository - sm-filesystem-repository-connector]\n        Vec_Conn[Vector Output - sm-vector-output-connector]\n        Kafka_Cons[Ingestion Consumer - IngestionConsumer]\n        \n        Runtime --\u003e Core\n        Core --\u003e FS_Conn\n        Core --\u003e|Publish IngestionMessage| Kafka[(Kafka Topic: manifold-documents)]\n        Kafka --\u003e|Consume Reference| Kafka_Cons\n        Kafka_Cons --\u003e|Resolve Content \u0026 Process| Vec_Conn\n    end\n\n    subgraph Infrastructure [Docker Containers]\n        PG[(PostgreSQL + pgvector)]\n        Redis[(Redis Cache \u0026 Session)]\n        Ollama[Ollama AI Embeddings]\n        Kafka_Broker[Apache Kafka Broker]\n    end\n\n    UI_App --\u003e|REST API| Runtime\n    Vec_Conn --\u003e|Vectors| PG\n    Runtime --\u003e|Job Cache| Redis\n    Vec_Conn --\u003e|Generates Embeddings| Ollama\n    Kafka --\u003e Kafka_Broker\n```\n\n---\n\n## Core Technologies\n\n- **Java 25 Preview Features**: Structured Concurrency, Virtual Threads, and Pattern Matching.\n- **Spring Boot \u0026 Spring AI**: High-performance backend orchestrating ingestion jobs.\n- **Apache Kafka**: Decoupled, event-driven document processing using the **Claim Check Pattern**.\n- **pgvector**: High-dimensional vector similarity search in PostgreSQL.\n- **Redis Stack**: Lightweight caching and session management.\n- **Ollama**: Local AI embedding generation via open-source LLM models.\n- **Vite + React + TailwindCSS**: Modern frontend administration dashboard.\n\n---\n\n## Getting Started\n\n### Prerequisites\n\nEnsure you have the following installed on your machine:\n- **JDK 25** (Ensure `JAVA_HOME` points to your JDK 25 directory)\n- **Maven 3.9+**\n- **Docker \u0026 Docker Compose**\n- **Node.js 18+ \u0026 npm** (for the UI)\n\n---\n\n### Step-by-Step Setup\n\n#### 1. Start Infrastructure (Docker)\nSpin up the database, cache, message broker, and AI engine. Run from the project root:\n```bash\ndocker compose up -d\n```\n**Services started:**\n* **PostgreSQL (Port 5432)**: For job metadata, schema migrations, and pgvector storage.\n* **Redis (Port 6379 / Insight Port 8001)**: For caching and session management.\n* **Ollama (Port 11434)**: For local embeddings.\n* **Apache Kafka (Port 9092)**: KRaft-mode broker for decoupled, event-driven document processing.\n\n#### 2. Pull the Embedding Model (Ollama)\nThe platform is configured to use the `mxbai-embed-large` model for embeddings. You must pull it once:\n```bash\ndocker exec -it ollama ollama pull mxbai-embed-large\n```\n*(You can exit the prompt with `Ctrl+D` once the download starts; Ollama will keep downloading in the background).*\n\n#### 3. Build the Project (Maven)\nCompile all modules using Java 25. Since we utilize advanced features, preview features must be enabled:\n```bash\nmvn clean install\n```\n\n#### 4. Run the Runtime Bootstrap\nStart the Spring Boot runtime application:\n```bash\nmvn spring-boot:run -pl sm-runtime -Dspring-boot.run.profiles=dev\n```\n\n##### Running a Sample Ingestion Job on Startup (Optional)\nBy default, the automatic startup crawl is disabled to prevent unnecessary scans. To trigger a demo crawl job on startup, pass the configuration properties:\n```bash\nmvn spring-boot:run -pl sm-runtime -Dspring-boot.run.profiles=dev \\\n  -Dspring-boot.run.arguments=\"--spring.manifold.crawl-on-startup=true --spring.manifold.scan-path=/your/local/directory/to/scan\"\n```\n\n#### 5. Run the Admin UI\nTo launch the administration dashboard:\n```bash\ncd sm-admin-ui\nnpm install\nnpm run dev\n```\nOpen [http://localhost:5173](http://localhost:5173) in your browser.\n---\n\n## Scaling Out \u0026 Performance\n\nSpring-Manifold Next-Gen is designed for high-throughput, horizontal scalability. Since the ingestion pipeline is decoupled using **Apache Kafka** and the **Claim Check Pattern**, you can scale components independently.\n\n### 1. Scaling the Ingestion / Processing (Output Connector)\nVector indexing and embedding generation is typically the primary performance bottleneck because of deep learning model inference (Ollama) and database indexing (pgvector).\n* **Kafka Consumer Group Partitioning**: The `manifold-documents` topic is consumed by the `IngestionConsumer` inside the `sm-runtime` service. By configuring the topic with multiple partitions, Kafka will distribute documents among active consumers.\n* **Horizontal Scaling of Runtime Instances**: You can run multiple instances of the `sm-runtime` application sharing the same `spring.application.name` and consumer group (`spring-manifold-vector-group`). Kafka automatically distributes partitions and load-balances the messages.\n* **Ollama Load Balancing**: Scale out embedding generation by pointing `spring.ai.ollama.base-url` to a load balancer (e.g., NGINX, HAProxy) backed by a cluster of Ollama instances running on GPU-enabled nodes.\n\n### 2. Scaling the Repository Connectors (Ingestion Source)\nThe scanning/crawling phase can be distributed by splitting large target sources:\n* **Partitioned Scans**: Run separate bootstrap crawl jobs targeting different sub-directories or repository prefixes.\n* **Distributed File Shares / Shared Storage**: In a multi-node setup, ensure the `IngestionConsumer` instances have access to the same shared filesystem (e.g., NFS, S3/MinIO bucket, SMB) as the repository crawlers, so the Claim Check reference (path/URI) can be successfully resolved by the consumer node.\n\n### 3. Claim Check Pattern\nTo ensure the messaging system remains fast and responsive:\n1. The **Repository Connector** crawls data, but instead of publishing the entire document content (which could be megabytes of binary data) to Kafka, it saves/references the file on a shared storage medium.\n2. It publishes a lightweight `IngestionMessage` (Claim Check record) to the Kafka topic containing the metadata (URI, file path, version).\n3. The **Consumer Workers** pull the reference, read the file directly from storage, run splitting/chunking, request embeddings, and save the resulting vectors in pgvector.\n\n---\n\n## Verification \u0026 Monitoring\n\n- **Database**: Access PostgreSQL at `localhost:5432` (User: `manifold`, DB: `manifold`).\n- **Redis Dashboard**: Open [http://localhost:8001](http://localhost:8001) in your browser to view the Redis Stack Insight dashboard.\n- **Logs**: Monitor console output for the Virtual Thread Executor and Structured Concurrency task logs.\n\n---\n\n## Troubleshooting\n\n- **Java Version Check**: Run `java -version` to confirm you are using Java 25.\n- **Preview Features**: If your IDE fails to compile structured concurrency code, verify that the `--enable-preview` JVM argument is configured for compiler and runtime settings. (It is already pre-configured in `pom.xml`).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenpj%2Fspring-manifold-next-gen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenpj%2Fspring-manifold-next-gen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenpj%2Fspring-manifold-next-gen/lists"}