https://github.com/codelibs/fess-llm-ollama
An Ollama-based LLM integration for AI-powered Fess features.
https://github.com/codelibs/fess-llm-ollama
Last synced: 11 days ago
JSON representation
An Ollama-based LLM integration for AI-powered Fess features.
- Host: GitHub
- URL: https://github.com/codelibs/fess-llm-ollama
- Owner: codelibs
- License: apache-2.0
- Created: 2026-03-04T22:50:23.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-05-01T05:24:18.000Z (about 2 months ago)
- Last Synced: 2026-05-01T07:19:58.179Z (about 2 months ago)
- Language: Java
- Homepage:
- Size: 69.3 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Ollama LLM Plugin for Fess
==========================
## Overview
This plugin provides Ollama integration for Fess's RAG (Retrieval-Augmented Generation) features. It enables Fess to use locally hosted Ollama models for AI-powered search capabilities including intent detection, answer generation, document summarization, and FAQ handling.
## Download
See [Maven Repository](https://repo1.maven.org/maven2/org/codelibs/fess/fess-llm-ollama/).
## Requirements
- Fess 15.x or later
- Java 21 or later
- Ollama server running locally or accessible via network
## Installation
1. Download the plugin JAR from the Maven Repository
2. Place it in your Fess plugin directory
3. Restart Fess
For detailed instructions, see the [Plugin Administration Guide](https://fess.codelibs.org/14.19/admin/plugin-guide.html).
## Configuration
Configure the following properties in `fess_config.properties`:
| Property | Default | Description |
|----------|---------|-------------|
| `rag.llm.name` | - | Set to `ollama` to use this plugin |
| `rag.chat.enabled` | `false` | Enable RAG chat feature |
| `rag.llm.ollama.api.url` | `http://localhost:11434` | Ollama server root URL. The plugin appends `/api/chat` and `/api/tags`, so a trailing `/` or `/api` (the form shown in the Ollama docs, e.g. `http://localhost:11434/api` or `https://ollama.com/api`) is stripped automatically. |
| `rag.llm.ollama.answer.context.max.chars` | `10000` | Maximum characters for document context in answer generation |
| `rag.llm.ollama.availability.check.interval` | `60` | Interval (seconds) for checking Ollama server availability |
| `rag.llm.ollama.chat.evaluation.max.relevant.docs` | `3` | Maximum number of relevant documents for evaluation |
| `rag.llm.ollama.connect.timeout` | `5000` | TCP connect timeout (ms). Separate from `timeout` (read/response). |
| `rag.llm.ollama.default.max.tokens` | (unset) | Fallback when `.max.tokens` is not set. |
| `rag.llm.ollama.default.temperature` | (unset) | Fallback when `.temperature` is not set. |
| `rag.llm.ollama.default.thinking.budget` | (unset) | Fallback when `.thinking.budget` is not set. |
| `rag.llm.ollama.faq.context.max.chars` | `6000` | Maximum characters for document context in FAQ generation |
| `rag.llm.ollama.model` | `gemma4:e4b` | Model name (e.g., `llama3:latest`, `mistral`) |
| `rag.llm.ollama.retry.base.delay.ms` | `2000` | Base delay (ms) for exponential backoff with ±20% jitter. |
| `rag.llm.ollama.retry.max` | `3` | Maximum total attempts on retryable HTTP errors (429/500/502/503/504) and connect-time IOExceptions. |
| `rag.llm.ollama.summary.context.max.chars` | `10000` | Maximum characters for document context in summary generation |
| `rag.llm.ollama.timeout` | `60000` | Response/read timeout (ms). For TCP connect timeout see `rag.llm.ollama.connect.timeout`. |
### Recommended num_ctx Setting
For `gemma4:e4b` with 16GB GPU, set:
```properties
rag.llm.ollama.default.num.ctx=8192
```
### Per-Prompt-Type Parameters
You can configure `top_p` and `top_k` sampling parameters for each prompt type:
| Property | Description |
|----------|-------------|
| `rag.llm.ollama..top.p` | Top-p (nucleus) sampling parameter |
| `rag.llm.ollama..top.k` | Top-k sampling parameter |
## Retry behavior
Both `chat()` and `streamChat()` retry on:
- HTTP `429` (Too Many Requests; Ollama Cloud and rate-limited proxies)
- HTTP `500`, `502`, `503` (Ollama queue overload via `OLLAMA_MAX_QUEUE`), `504`
- `IOException` raised before a response is received (DNS, TCP, TLS, idle-socket failures)
Other `4xx` errors are surfaced as `LlmException` immediately.
Streaming retries only the initial HTTP request. Once NDJSON bytes start flowing,
in-stream errors (HTTP transport failures **or** NDJSON `{"error": "..."}` payloads)
propagate immediately to `LlmStreamCallback.onError(...)` — no replay.
The retry status set tracks the documented [Ollama errors](https://docs.ollama.com/api/errors).
Defaults can be overridden via `rag.llm.ollama.retry.max` and
`rag.llm.ollama.retry.base.delay.ms`.
## Stream completion log
A single INFO line is emitted per `streamChat()` call:
```
[LLM:OLLAMA] Stream completed. chunkCount=N, objectCount=N, firstChunkMs=N,
elapsedTime=Nms, doneReason=stop, totalDurationMs=N, loadDurationMs=N,
promptEvalDurationMs=N, evalDurationMs=N, promptEvalCount=N, evalCount=N,
tokensPerSecond=N.NN, parseErrorCount=0
```
A sibling WARN line is emitted when `done_reason` is anything other than `stop`,
`load`, or `unload` — most commonly `length` (context window truncation):
```
[LLM:OLLAMA] Stream finished abnormally. doneReason=length, evalCount=N, ...
```
## Reasoning Model Configuration (e.g., qwen3.5)
Reasoning models like `qwen3.5` use internal thinking tokens that improve answer quality
but consume output tokens. Configure thinking per prompt type for optimal results.
```properties
rag.llm.ollama.model=qwen3.5:35b
rag.llm.ollama.timeout=120000
# Structured output / short responses - disable thinking
rag.llm.ollama.intent.thinking.budget=0
rag.llm.ollama.evaluation.thinking.budget=0
rag.llm.ollama.unclear.thinking.budget=0
rag.llm.ollama.noresults.thinking.budget=0
rag.llm.ollama.docnotfound.thinking.budget=0
# Answer generation - enable thinking with increased token limit
rag.llm.ollama.answer.thinking.budget=1
rag.llm.ollama.answer.max.tokens=16384
rag.llm.ollama.summary.thinking.budget=1
rag.llm.ollama.summary.max.tokens=16384
rag.llm.ollama.direct.thinking.budget=1
rag.llm.ollama.direct.max.tokens=8192
rag.llm.ollama.faq.thinking.budget=1
rag.llm.ollama.faq.max.tokens=8192
```
The `thinking.budget` parameter controls the Ollama `think` flag as a boolean:
- `0` — disable thinking (`think: false`)
- Any positive value — enable thinking (`think: true`)
- Not set — use model default (reasoning models default to thinking enabled)
When thinking is enabled, increase `max.tokens` to accommodate both thinking and content tokens.
### thinking.level (GPT-OSS and other models that ignore the boolean form)
Per [Ollama's thinking docs](https://docs.ollama.com/capabilities/thinking), the `think`
field also accepts the string values `high`, `medium`, and `low`. GPT-OSS models in
particular ignore the boolean form. Use `rag.llm.ollama..thinking.level`
(or `rag.llm.ollama.default.thinking.level`) to send a string instead of a boolean:
```properties
rag.llm.ollama.model=gpt-oss:20b
rag.llm.ollama.answer.thinking.level=high
rag.llm.ollama.intent.thinking.level=low
```
When `thinking.level` is set, it overrides the boolean derived from `thinking.budget`
for that prompt type. Allowed values: `high`, `medium`, `low` (case-insensitive).
Invalid values are ignored with a WARN log and fall back to `thinking.budget`.
## Features
- **Intent Detection** - Determines user intent (search, summary, FAQ, unclear) and generates Lucene queries
- **Answer Generation** - Generates answers based on search results with citation support
- **Document Summarization** - Summarizes specific documents
- **FAQ Handling** - Provides direct, concise answers to FAQ-type questions
- **Relevance Evaluation** - Identifies the most relevant documents for answer generation
- **Streaming Support** - Real-time response streaming via NDJSON format
- **Availability Checking** - Validates Ollama server and model availability at configurable intervals
## Ollama API Endpoints Used
- `GET /api/tags` - Lists available models for availability checking
- `POST /api/chat` - Performs chat completion (supports both standard and streaming modes)
## Development
### Building from Source
```bash
mvn clean package
```
### Running Tests
```bash
mvn test
```
## License
Apache License 2.0