https://github.com/openpj/spring-manifold-next-gen
Spring-Manifold Next-Gen is an enterprise data integration and ingestion platform modeled after Apache ManifoldCF. It leverages modern Java 25 features (such as Structured Concurrency and Virtual Threads), Spring Boot, Kafka and vector search infrastructure to orchestrate data flows from various repository connectors to vector search outputs.
https://github.com/openpj/spring-manifold-next-gen
apache-kafka etl java kafka manifoldcf ollama pgvector rag react spring-boot structured-concurrency vector-database virtual-threads vite
Last synced: about 19 hours ago
JSON representation
Spring-Manifold Next-Gen is an enterprise data integration and ingestion platform modeled after Apache ManifoldCF. It leverages modern Java 25 features (such as Structured Concurrency and Virtual Threads), Spring Boot, Kafka and vector search infrastructure to orchestrate data flows from various repository connectors to vector search outputs.
- Host: GitHub
- URL: https://github.com/openpj/spring-manifold-next-gen
- Owner: OpenPj
- License: apache-2.0
- Created: 2026-06-22T07:02:13.000Z (7 days ago)
- Default Branch: main
- Last Pushed: 2026-06-24T14:20:38.000Z (4 days ago)
- Last Synced: 2026-06-27T02:11:01.424Z (2 days ago)
- Topics: apache-kafka, etl, java, kafka, manifoldcf, ollama, pgvector, rag, react, spring-boot, structured-concurrency, vector-database, virtual-threads, vite
- Language: TypeScript
- Homepage:
- Size: 625 KB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
# Spring-Manifold Next-Gen
[](LICENSE)
[](https://jdk.java.net/25/)
[](https://spring.io/projects/spring-boot)
[](https://www.docker.com/)
**Spring-Manifold Next-Gen** is an enterprise data integration and ingestion platform modeled after Apache ManifoldCF. It leverages modern Java 25 features (such as Structured Concurrency and Virtual Threads), Spring Boot, and vector search infrastructure to orchestrate data flows from various repository connectors to vector search outputs.
---
## Architecture Diagram
The diagram below shows the high-level architecture of Spring-Manifold Next-Gen:
```mermaid
graph TD
subgraph UI
UI_App[Admin React UI - sm-admin-ui]
end
subgraph Platform Runtime [Spring-Manifold JVM Runtime]
Core[Core Ingestion Engine - sm-core]
Runtime[Bootstrap - sm-runtime]
FS_Conn[Filesystem Repository - sm-filesystem-repository-connector]
Vec_Conn[Vector Output - sm-vector-output-connector]
Kafka_Cons[Ingestion Consumer - IngestionConsumer]
Runtime --> Core
Core --> FS_Conn
Core -->|Publish IngestionMessage| Kafka[(Kafka Topic: manifold-documents)]
Kafka -->|Consume Reference| Kafka_Cons
Kafka_Cons -->|Resolve Content & Process| Vec_Conn
end
subgraph Infrastructure [Docker Containers]
PG[(PostgreSQL + pgvector)]
Redis[(Redis Cache & Session)]
Ollama[Ollama AI Embeddings]
Kafka_Broker[Apache Kafka Broker]
end
UI_App -->|REST API| Runtime
Vec_Conn -->|Vectors| PG
Runtime -->|Job Cache| Redis
Vec_Conn -->|Generates Embeddings| Ollama
Kafka --> Kafka_Broker
```
---
## Core Technologies
- **Java 25 Preview Features**: Structured Concurrency, Virtual Threads, and Pattern Matching.
- **Spring Boot & Spring AI**: High-performance backend orchestrating ingestion jobs.
- **Apache Kafka**: Decoupled, event-driven document processing using the **Claim Check Pattern**.
- **pgvector**: High-dimensional vector similarity search in PostgreSQL.
- **Redis Stack**: Lightweight caching and session management.
- **Ollama**: Local AI embedding generation via open-source LLM models.
- **Vite + React + TailwindCSS**: Modern frontend administration dashboard.
---
## Getting Started
### Prerequisites
Ensure you have the following installed on your machine:
- **JDK 25** (Ensure `JAVA_HOME` points to your JDK 25 directory)
- **Maven 3.9+**
- **Docker & Docker Compose**
- **Node.js 18+ & npm** (for the UI)
---
### Step-by-Step Setup
#### 1. Start Infrastructure (Docker)
Spin up the database, cache, message broker, and AI engine. Run from the project root:
```bash
docker compose up -d
```
**Services started:**
* **PostgreSQL (Port 5432)**: For job metadata, schema migrations, and pgvector storage.
* **Redis (Port 6379 / Insight Port 8001)**: For caching and session management.
* **Ollama (Port 11434)**: For local embeddings.
* **Apache Kafka (Port 9092)**: KRaft-mode broker for decoupled, event-driven document processing.
#### 2. Pull the Embedding Model (Ollama)
The platform is configured to use the `mxbai-embed-large` model for embeddings. You must pull it once:
```bash
docker exec -it ollama ollama pull mxbai-embed-large
```
*(You can exit the prompt with `Ctrl+D` once the download starts; Ollama will keep downloading in the background).*
#### 3. Build the Project (Maven)
Compile all modules using Java 25. Since we utilize advanced features, preview features must be enabled:
```bash
mvn clean install
```
#### 4. Run the Runtime Bootstrap
Start the Spring Boot runtime application:
```bash
mvn spring-boot:run -pl sm-runtime -Dspring-boot.run.profiles=dev
```
##### Running a Sample Ingestion Job on Startup (Optional)
By default, the automatic startup crawl is disabled to prevent unnecessary scans. To trigger a demo crawl job on startup, pass the configuration properties:
```bash
mvn spring-boot:run -pl sm-runtime -Dspring-boot.run.profiles=dev \
-Dspring-boot.run.arguments="--spring.manifold.crawl-on-startup=true --spring.manifold.scan-path=/your/local/directory/to/scan"
```
#### 5. Run the Admin UI
To launch the administration dashboard:
```bash
cd sm-admin-ui
npm install
npm run dev
```
Open [http://localhost:5173](http://localhost:5173) in your browser.
---
## Scaling Out & Performance
Spring-Manifold Next-Gen is designed for high-throughput, horizontal scalability. Since the ingestion pipeline is decoupled using **Apache Kafka** and the **Claim Check Pattern**, you can scale components independently.
### 1. Scaling the Ingestion / Processing (Output Connector)
Vector indexing and embedding generation is typically the primary performance bottleneck because of deep learning model inference (Ollama) and database indexing (pgvector).
* **Kafka Consumer Group Partitioning**: The `manifold-documents` topic is consumed by the `IngestionConsumer` inside the `sm-runtime` service. By configuring the topic with multiple partitions, Kafka will distribute documents among active consumers.
* **Horizontal Scaling of Runtime Instances**: You can run multiple instances of the `sm-runtime` application sharing the same `spring.application.name` and consumer group (`spring-manifold-vector-group`). Kafka automatically distributes partitions and load-balances the messages.
* **Ollama Load Balancing**: Scale out embedding generation by pointing `spring.ai.ollama.base-url` to a load balancer (e.g., NGINX, HAProxy) backed by a cluster of Ollama instances running on GPU-enabled nodes.
### 2. Scaling the Repository Connectors (Ingestion Source)
The scanning/crawling phase can be distributed by splitting large target sources:
* **Partitioned Scans**: Run separate bootstrap crawl jobs targeting different sub-directories or repository prefixes.
* **Distributed File Shares / Shared Storage**: In a multi-node setup, ensure the `IngestionConsumer` instances have access to the same shared filesystem (e.g., NFS, S3/MinIO bucket, SMB) as the repository crawlers, so the Claim Check reference (path/URI) can be successfully resolved by the consumer node.
### 3. Claim Check Pattern
To ensure the messaging system remains fast and responsive:
1. The **Repository Connector** crawls data, but instead of publishing the entire document content (which could be megabytes of binary data) to Kafka, it saves/references the file on a shared storage medium.
2. It publishes a lightweight `IngestionMessage` (Claim Check record) to the Kafka topic containing the metadata (URI, file path, version).
3. The **Consumer Workers** pull the reference, read the file directly from storage, run splitting/chunking, request embeddings, and save the resulting vectors in pgvector.
---
## Verification & Monitoring
- **Database**: Access PostgreSQL at `localhost:5432` (User: `manifold`, DB: `manifold`).
- **Redis Dashboard**: Open [http://localhost:8001](http://localhost:8001) in your browser to view the Redis Stack Insight dashboard.
- **Logs**: Monitor console output for the Virtual Thread Executor and Structured Concurrency task logs.
---
## Troubleshooting
- **Java Version Check**: Run `java -version` to confirm you are using Java 25.
- **Preview Features**: If your IDE fails to compile structured concurrency code, verify that the `--enable-preview` JVM argument is configured for compiler and runtime settings. (It is already pre-configured in `pom.xml`).