An open API service indexing awesome lists of open source software.

https://github.com/codelibs/fess-llm-ollama

An Ollama-based LLM integration for AI-powered Fess features.
https://github.com/codelibs/fess-llm-ollama

Last synced: 11 days ago
JSON representation

An Ollama-based LLM integration for AI-powered Fess features.

Awesome Lists containing this project

README

          

Ollama LLM Plugin for Fess
==========================

## Overview

This plugin provides Ollama integration for Fess's RAG (Retrieval-Augmented Generation) features. It enables Fess to use locally hosted Ollama models for AI-powered search capabilities including intent detection, answer generation, document summarization, and FAQ handling.

## Download

See [Maven Repository](https://repo1.maven.org/maven2/org/codelibs/fess/fess-llm-ollama/).

## Requirements

- Fess 15.x or later
- Java 21 or later
- Ollama server running locally or accessible via network

## Installation

1. Download the plugin JAR from the Maven Repository
2. Place it in your Fess plugin directory
3. Restart Fess

For detailed instructions, see the [Plugin Administration Guide](https://fess.codelibs.org/14.19/admin/plugin-guide.html).

## Configuration

Configure the following properties in `fess_config.properties`:

| Property | Default | Description |
|----------|---------|-------------|
| `rag.llm.name` | - | Set to `ollama` to use this plugin |
| `rag.chat.enabled` | `false` | Enable RAG chat feature |
| `rag.llm.ollama.api.url` | `http://localhost:11434` | Ollama server root URL. The plugin appends `/api/chat` and `/api/tags`, so a trailing `/` or `/api` (the form shown in the Ollama docs, e.g. `http://localhost:11434/api` or `https://ollama.com/api`) is stripped automatically. |
| `rag.llm.ollama.answer.context.max.chars` | `10000` | Maximum characters for document context in answer generation |
| `rag.llm.ollama.availability.check.interval` | `60` | Interval (seconds) for checking Ollama server availability |
| `rag.llm.ollama.chat.evaluation.max.relevant.docs` | `3` | Maximum number of relevant documents for evaluation |
| `rag.llm.ollama.connect.timeout` | `5000` | TCP connect timeout (ms). Separate from `timeout` (read/response). |
| `rag.llm.ollama.default.max.tokens` | (unset) | Fallback when `.max.tokens` is not set. |
| `rag.llm.ollama.default.temperature` | (unset) | Fallback when `.temperature` is not set. |
| `rag.llm.ollama.default.thinking.budget` | (unset) | Fallback when `.thinking.budget` is not set. |
| `rag.llm.ollama.faq.context.max.chars` | `6000` | Maximum characters for document context in FAQ generation |
| `rag.llm.ollama.model` | `gemma4:e4b` | Model name (e.g., `llama3:latest`, `mistral`) |
| `rag.llm.ollama.retry.base.delay.ms` | `2000` | Base delay (ms) for exponential backoff with ±20% jitter. |
| `rag.llm.ollama.retry.max` | `3` | Maximum total attempts on retryable HTTP errors (429/500/502/503/504) and connect-time IOExceptions. |
| `rag.llm.ollama.summary.context.max.chars` | `10000` | Maximum characters for document context in summary generation |
| `rag.llm.ollama.timeout` | `60000` | Response/read timeout (ms). For TCP connect timeout see `rag.llm.ollama.connect.timeout`. |

### Recommended num_ctx Setting

For `gemma4:e4b` with 16GB GPU, set:

```properties
rag.llm.ollama.default.num.ctx=8192
```

### Per-Prompt-Type Parameters

You can configure `top_p` and `top_k` sampling parameters for each prompt type:

| Property | Description |
|----------|-------------|
| `rag.llm.ollama..top.p` | Top-p (nucleus) sampling parameter |
| `rag.llm.ollama..top.k` | Top-k sampling parameter |

## Retry behavior

Both `chat()` and `streamChat()` retry on:

- HTTP `429` (Too Many Requests; Ollama Cloud and rate-limited proxies)
- HTTP `500`, `502`, `503` (Ollama queue overload via `OLLAMA_MAX_QUEUE`), `504`
- `IOException` raised before a response is received (DNS, TCP, TLS, idle-socket failures)

Other `4xx` errors are surfaced as `LlmException` immediately.

Streaming retries only the initial HTTP request. Once NDJSON bytes start flowing,
in-stream errors (HTTP transport failures **or** NDJSON `{"error": "..."}` payloads)
propagate immediately to `LlmStreamCallback.onError(...)` — no replay.

The retry status set tracks the documented [Ollama errors](https://docs.ollama.com/api/errors).

Defaults can be overridden via `rag.llm.ollama.retry.max` and
`rag.llm.ollama.retry.base.delay.ms`.

## Stream completion log

A single INFO line is emitted per `streamChat()` call:

```
[LLM:OLLAMA] Stream completed. chunkCount=N, objectCount=N, firstChunkMs=N,
elapsedTime=Nms, doneReason=stop, totalDurationMs=N, loadDurationMs=N,
promptEvalDurationMs=N, evalDurationMs=N, promptEvalCount=N, evalCount=N,
tokensPerSecond=N.NN, parseErrorCount=0
```

A sibling WARN line is emitted when `done_reason` is anything other than `stop`,
`load`, or `unload` — most commonly `length` (context window truncation):

```
[LLM:OLLAMA] Stream finished abnormally. doneReason=length, evalCount=N, ...
```

## Reasoning Model Configuration (e.g., qwen3.5)

Reasoning models like `qwen3.5` use internal thinking tokens that improve answer quality
but consume output tokens. Configure thinking per prompt type for optimal results.

```properties
rag.llm.ollama.model=qwen3.5:35b
rag.llm.ollama.timeout=120000

# Structured output / short responses - disable thinking
rag.llm.ollama.intent.thinking.budget=0
rag.llm.ollama.evaluation.thinking.budget=0
rag.llm.ollama.unclear.thinking.budget=0
rag.llm.ollama.noresults.thinking.budget=0
rag.llm.ollama.docnotfound.thinking.budget=0

# Answer generation - enable thinking with increased token limit
rag.llm.ollama.answer.thinking.budget=1
rag.llm.ollama.answer.max.tokens=16384
rag.llm.ollama.summary.thinking.budget=1
rag.llm.ollama.summary.max.tokens=16384
rag.llm.ollama.direct.thinking.budget=1
rag.llm.ollama.direct.max.tokens=8192
rag.llm.ollama.faq.thinking.budget=1
rag.llm.ollama.faq.max.tokens=8192
```

The `thinking.budget` parameter controls the Ollama `think` flag as a boolean:
- `0` — disable thinking (`think: false`)
- Any positive value — enable thinking (`think: true`)
- Not set — use model default (reasoning models default to thinking enabled)

When thinking is enabled, increase `max.tokens` to accommodate both thinking and content tokens.

### thinking.level (GPT-OSS and other models that ignore the boolean form)

Per [Ollama's thinking docs](https://docs.ollama.com/capabilities/thinking), the `think`
field also accepts the string values `high`, `medium`, and `low`. GPT-OSS models in
particular ignore the boolean form. Use `rag.llm.ollama..thinking.level`
(or `rag.llm.ollama.default.thinking.level`) to send a string instead of a boolean:

```properties
rag.llm.ollama.model=gpt-oss:20b
rag.llm.ollama.answer.thinking.level=high
rag.llm.ollama.intent.thinking.level=low
```

When `thinking.level` is set, it overrides the boolean derived from `thinking.budget`
for that prompt type. Allowed values: `high`, `medium`, `low` (case-insensitive).
Invalid values are ignored with a WARN log and fall back to `thinking.budget`.

## Features

- **Intent Detection** - Determines user intent (search, summary, FAQ, unclear) and generates Lucene queries
- **Answer Generation** - Generates answers based on search results with citation support
- **Document Summarization** - Summarizes specific documents
- **FAQ Handling** - Provides direct, concise answers to FAQ-type questions
- **Relevance Evaluation** - Identifies the most relevant documents for answer generation
- **Streaming Support** - Real-time response streaming via NDJSON format
- **Availability Checking** - Validates Ollama server and model availability at configurable intervals

## Ollama API Endpoints Used

- `GET /api/tags` - Lists available models for availability checking
- `POST /api/chat` - Performs chat completion (supports both standard and streaming modes)

## Development

### Building from Source

```bash
mvn clean package
```

### Running Tests

```bash
mvn test
```

## License

Apache License 2.0