https://github.com/codelibs/fess-llm-ollama

An Ollama-based LLM integration for AI-powered Fess features.
https://github.com/codelibs/fess-llm-ollama
Last synced: 11 days ago
JSON representation
An Ollama-based LLM integration for AI-powered Fess features.
Host: GitHub
URL: https://github.com/codelibs/fess-llm-ollama
Owner: codelibs
License: apache-2.0
Created: 2026-03-04T22:50:23.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-05-01T05:24:18.000Z (about 2 months ago)
Last Synced: 2026-05-01T07:19:58.179Z (about 2 months ago)
Language: Java
Homepage:
Size: 69.3 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          Ollama LLM Plugin for Fess

==========================

## Overview

This plugin provides Ollama integration for Fess's RAG (Retrieval-Augmented Generation) features. It enables Fess to use locally hosted Ollama models for AI-powered search capabilities including intent detection, answer generation, document summarization, and FAQ handling.

## Download

See [Maven Repository](https://repo1.maven.org/maven2/org/codelibs/fess/fess-llm-ollama/).

## Requirements

- Fess 15.x or later

- Java 21 or later

- Ollama server running locally or accessible via network

## Installation

1. Download the plugin JAR from the Maven Repository

2. Place it in your Fess plugin directory

3. Restart Fess

For detailed instructions, see the [Plugin Administration Guide](https://fess.codelibs.org/14.19/admin/plugin-guide.html).

## Configuration

Configure the following properties in `fess_config.properties`:

| Property | Default | Description |

|----------|---------|-------------|

| `rag.llm.name` | - | Set to `ollama` to use this plugin |

| `rag.chat.enabled` | `false` | Enable RAG chat feature |

| `rag.llm.ollama.api.url` | `http://localhost:11434` | Ollama server root URL. The plugin appends `/api/chat` and `/api/tags`, so a trailing `/` or `/api` (the form shown in the Ollama docs, e.g. `http://localhost:11434/api` or `https://ollama.com/api`) is stripped automatically. |

| `rag.llm.ollama.answer.context.max.chars` | `10000` | Maximum characters for document context in answer generation |

| `rag.llm.ollama.availability.check.interval` | `60` | Interval (seconds) for checking Ollama server availability |

| `rag.llm.ollama.chat.evaluation.max.relevant.docs` | `3` | Maximum number of relevant documents for evaluation |

| `rag.llm.ollama.connect.timeout` | `5000` | TCP connect timeout (ms). Separate from `timeout` (read/response). |

| `rag.llm.ollama.default.max.tokens` | (unset) | Fallback when `.max.tokens` is not set. |

| `rag.llm.ollama.default.temperature` | (unset) | Fallback when `.temperature` is not set. |

| `rag.llm.ollama.default.thinking.budget` | (unset) | Fallback when `.thinking.budget` is not set. |

| `rag.llm.ollama.faq.context.max.chars` | `6000` | Maximum characters for document context in FAQ generation |

| `rag.llm.ollama.model` | `gemma4:e4b` | Model name (e.g., `llama3:latest`, `mistral`) |

| `rag.llm.ollama.retry.base.delay.ms` | `2000` | Base delay (ms) for exponential backoff with ±20% jitter. |

| `rag.llm.ollama.retry.max` | `3` | Maximum total attempts on retryable HTTP errors (429/500/502/503/504) and connect-time IOExceptions. |

| `rag.llm.ollama.summary.context.max.chars` | `10000` | Maximum characters for document context in summary generation |

| `rag.llm.ollama.timeout` | `60000` | Response/read timeout (ms). For TCP connect timeout see `rag.llm.ollama.connect.timeout`. |

### Recommended num_ctx Setting

For `gemma4:e4b` with 16GB GPU, set:

```properties

rag.llm.ollama.default.num.ctx=8192

```

### Per-Prompt-Type Parameters

You can configure `top_p` and `top_k` sampling parameters for each prompt type:

| Property | Description |

|----------|-------------|

| `rag.llm.ollama..top.p` | Top-p (nucleus) sampling parameter |

| `rag.llm.ollama..top.k` | Top-k sampling parameter |

## Retry behavior

Both `chat()` and `streamChat()` retry on:

- HTTP `429` (Too Many Requests; Ollama Cloud and rate-limited proxies)

- HTTP `500`, `502`, `503` (Ollama queue overload via `OLLAMA_MAX_QUEUE`), `504`

- `IOException` raised before a response is received (DNS, TCP, TLS, idle-socket failures)

Other `4xx` errors are surfaced as `LlmException` immediately.

Streaming retries only the initial HTTP request. Once NDJSON bytes start flowing,

in-stream errors (HTTP transport failures **or** NDJSON `{"error": "..."}` payloads)

propagate immediately to `LlmStreamCallback.onError(...)` — no replay.

The retry status set tracks the documented [Ollama errors](https://docs.ollama.com/api/errors).

Defaults can be overridden via `rag.llm.ollama.retry.max` and

`rag.llm.ollama.retry.base.delay.ms`.

## Stream completion log

A single INFO line is emitted per `streamChat()` call:

```

[LLM:OLLAMA] Stream completed. chunkCount=N, objectCount=N, firstChunkMs=N,

  elapsedTime=Nms, doneReason=stop, totalDurationMs=N, loadDurationMs=N,

  promptEvalDurationMs=N, evalDurationMs=N, promptEvalCount=N, evalCount=N,

  tokensPerSecond=N.NN, parseErrorCount=0

```

A sibling WARN line is emitted when `done_reason` is anything other than `stop`,

`load`, or `unload` — most commonly `length` (context window truncation):

```

[LLM:OLLAMA] Stream finished abnormally. doneReason=length, evalCount=N, ...

```

## Reasoning Model Configuration (e.g., qwen3.5)

Reasoning models like `qwen3.5` use internal thinking tokens that improve answer quality

but consume output tokens. Configure thinking per prompt type for optimal results.

```properties

rag.llm.ollama.model=qwen3.5:35b

rag.llm.ollama.timeout=120000

# Structured output / short responses - disable thinking

rag.llm.ollama.intent.thinking.budget=0

rag.llm.ollama.evaluation.thinking.budget=0

rag.llm.ollama.unclear.thinking.budget=0

rag.llm.ollama.noresults.thinking.budget=0

rag.llm.ollama.docnotfound.thinking.budget=0

# Answer generation - enable thinking with increased token limit

rag.llm.ollama.answer.thinking.budget=1

rag.llm.ollama.answer.max.tokens=16384

rag.llm.ollama.summary.thinking.budget=1

rag.llm.ollama.summary.max.tokens=16384

rag.llm.ollama.direct.thinking.budget=1

rag.llm.ollama.direct.max.tokens=8192

rag.llm.ollama.faq.thinking.budget=1

rag.llm.ollama.faq.max.tokens=8192

```

The `thinking.budget` parameter controls the Ollama `think` flag as a boolean:

- `0` — disable thinking (`think: false`)

- Any positive value — enable thinking (`think: true`)

- Not set — use model default (reasoning models default to thinking enabled)

When thinking is enabled, increase `max.tokens` to accommodate both thinking and content tokens.

### thinking.level (GPT-OSS and other models that ignore the boolean form)

Per [Ollama's thinking docs](https://docs.ollama.com/capabilities/thinking), the `think`

field also accepts the string values `high`, `medium`, and `low`. GPT-OSS models in

particular ignore the boolean form. Use `rag.llm.ollama..thinking.level`

(or `rag.llm.ollama.default.thinking.level`) to send a string instead of a boolean:

```properties

rag.llm.ollama.model=gpt-oss:20b

rag.llm.ollama.answer.thinking.level=high

rag.llm.ollama.intent.thinking.level=low

```

When `thinking.level` is set, it overrides the boolean derived from `thinking.budget`

for that prompt type. Allowed values: `high`, `medium`, `low` (case-insensitive).

Invalid values are ignored with a WARN log and fall back to `thinking.budget`.

## Features

- **Intent Detection** - Determines user intent (search, summary, FAQ, unclear) and generates Lucene queries

- **Answer Generation** - Generates answers based on search results with citation support

- **Document Summarization** - Summarizes specific documents

- **FAQ Handling** - Provides direct, concise answers to FAQ-type questions

- **Relevance Evaluation** - Identifies the most relevant documents for answer generation

- **Streaming Support** - Real-time response streaming via NDJSON format

- **Availability Checking** - Validates Ollama server and model availability at configurable intervals

## Ollama API Endpoints Used

- `GET /api/tags` - Lists available models for availability checking

- `POST /api/chat` - Performs chat completion (supports both standard and streaming modes)

## Development

### Building from Source

```bash

mvn clean package

```

### Running Tests

```bash

mvn test

```

## License

Apache License 2.0
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/codelibs/fess-llm-ollama

Awesome Lists containing this project

README