An open API service indexing awesome lists of open source software.

https://github.com/danieljustus/symaira-seek

Local-first, CGO-free document retrieval for AI agents with hybrid BM25+vector search
https://github.com/danieljustus/symaira-seek

ai-agents document-retrieval fts5 go mcp semantic-search sqlite

Last synced: 4 days ago
JSON representation

Local-first, CGO-free document retrieval for AI agents with hybrid BM25+vector search

Awesome Lists containing this project

README

          

# Symaira-Seek

> Local-first, CGO-free document retrieval for AI agents with hybrid BM25+vector search.

[![CI](https://github.com/danieljustus/symaira-seek/actions/workflows/ci.yml/badge.svg)](https://github.com/danieljustus/symaira-seek/actions/workflows/ci.yml)
[![Go Version](https://img.shields.io/badge/go-1.26+-00ADD8?logo=go&logoColor=white)](https://go.dev/)
[![License: MIT](https://img.shields.io/badge/license-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Symaira-Seek is a local-first, CGO-free document retrieval tool designed for AI agents and developers. It provides hybrid search (BM25 keyword search combined with vector semantic search) and fuses results using Reciprocal Rank Fusion (RRF).

## Why Symaira-Seek?

- **100% CGO-free**: Pure Go SQLite driver (`modernc.org/sqlite`) — cross-compile anywhere without C dependencies
- **Hybrid search**: Combines BM25 keyword matching with vector semantic search for better relevance
- **Dual embedding modes**: Local Ollama integration for quality, deterministic fallback for offline usage
- **Multiple interfaces**: CLI, MCP server for AI agents, and HTTP REST daemon
- **Local-first**: Your data stays on your machine — no cloud dependencies required

It exposes multiple interfaces:
1. **Command Line Interface (CLI)**: A Unix-friendly command utility.
2. **Model Context Protocol (MCP)**: Native stdio-based tool integration for AI agents (Claude, Cursor, ChatGPT, etc.).
3. **HTTP REST Daemon**: A lightweight localhost API with search and index endpoints.

## Tech Stack & Architecture

- **Language**: Pure Go (1.26+)
- **Database**: SQLite (via pure-Go `modernc.org/sqlite` to maintain **100% CGO-free compilation**)
- **Keyword Search**: SQLite FTS5 with BM25 ranking
- **Vector Search**: Cosine similarity calculations on normalized 768-dimensional float32 arrays
- **Result Fusion**: Reciprocal Rank Fusion (RRF) with parameter $k=60$
- **Embeddings**: Dual-mode generation:
- **Local Ollama Integration**: Uses the `nomic-embed-text` model.
- **Deterministic Local Fallback**: Fallback word-hash vector generator to allow 100% offline usage.

---

## Installation & Setup

### Homebrew (macOS/Linux)

Install via Homebrew tap:

```bash
brew tap danieljustus/tap
brew install symseek
```

### Pre-built Binaries (Recommended)

Download the latest release for your platform from [GitHub Releases](https://github.com/danieljustus/symaira-seek/releases):

- **Linux**: `symaira-seek_Linux_x86_64.tar.gz` or `symaira-seek_Linux_arm64.tar.gz`
- **macOS**: `symaira-seek_Darwin_x86_64.tar.gz` or `symaira-seek_Darwin_arm64.tar.gz`
- **Windows**: `symaira-seek_Windows_x86_64.zip` or `symaira-seek_Windows_arm64.zip`

Extract and install:
```bash
# Linux/macOS
tar -xzf symaira-seek_*.tar.gz
chmod +x symseek
sudo mv symseek /usr/local/bin/

# Windows
# Extract the .zip and add to PATH
```

Verify the installation:
```bash
symseek version
```

### Build from Source

Ensure you have [Go](https://go.dev/) installed.

```bash
go build -o symseek cmd/symseek/main.go
```

To inject a version string at build time, set `main.version` via `-ldflags`. The CI workflow derives the value from the current git tag (or a `0.0.0-dev+` fallback) and passes it automatically:
```bash
VERSION="0.2.0"
go build -ldflags "-s -w -X main.version=${VERSION}" -o symseek cmd/symseek/main.go
./symseek version
```

### Run Tests
```bash
go test -v ./...
```

---

## CLI Usage

### Index a Directory
Crawl and index all markdown, text, code, JSON, and yaml files inside a folder:
```bash
./symseek index /path/to/my-documents
```

#### Watch Daemon
Keep the tool running in the background to automatically synchronize changes (creation, modification, and deletion of files) every 5 seconds:
```bash
./symseek index /path/to/my-documents --watch
```

### Search Documents
Perform a hybrid semantic and keyword search:
```bash
./symseek search "renewable energy optimization" --limit 5
```

Export structured search results directly to JSON:
```bash
./symseek search "renewable energy optimization" --json
```

### Get Database Stats
```bash
./symseek status
```

Export the same stats as JSON for monitoring pipelines:
```bash
./symseek status --json
```

### Configuration

`symseek` stores its configuration as TOML in `~/.config/symseek/config.toml` (overridable with `--config`). A legacy `config.json` from older versions is migrated to TOML automatically on first run (and via `./symseek migrate`).

View the active configuration (the path is printed to stderr):
```bash
./symseek config
```

Set a value without editing the file by hand:
```bash
./symseek config --set-key ollama_url --set-value http://localhost:11434/api/embeddings
./symseek config --set-key model --set-value mxbai-embed-large
```

The file is rewritten with mode `0600` on every write. Supported keys:

| Key | Description | Default |
| --- | --- | --- |
| `ollama_url` | Ollama embeddings endpoint URL | `http://localhost:11434/api/embeddings` |
| `model` | Embedding model name | `nomic-embed-text` |
| `timeout_seconds` | Per-request Ollama timeout (seconds) | `120` |
| `retry_count` | Number of Ollama retries on failure | `2` |
| `retry_backoff_ms` | Initial retry backoff (milliseconds) | `500` |
| `index_cooldown_seconds` | Cooldown between `/index` requests on the HTTP daemon | `5` |

---

## MCP Server Integration

To use Symaira-Seek as an MCP tool for AI clients (like Claude Desktop or Cursor), register it in your client's configuration file:

```json
{
"mcpServers": {
"symaira-seek": {
"command": "/absolute/path/to/symaira-seek/symseek",
"args": ["serve"]
}
}
}
```

### Exposed Tools
1. `search_documents(query, limit)`: Hybrid search over all indexed files.
2. `read_document(path)`: Retrieves the complete content of an indexed file.
3. `list_documents(folder)`: Explorative folder and index structure scanning.
4. `get_context(topic, max_chars)`: Aggregates relevant context blocks from multiple documents.
5. `index_document(path)`: Manually indexes a local file or directory.
6. `index_url(url)`: Indexes content from a URL.

### Tool Examples

#### search_documents
```json
{
"query": "renewable energy optimization",
"limit": 5
}
```

Returns formatted search results with file paths, chunk indices, and RRF scores.

#### read_document
```json
{
"path": "/home/user/documents/report.md"
}
```

Returns the full text content of an indexed file.

#### list_documents
```json
{
"folder": "/home/user/documents"
}
```

Lists all indexed documents, optionally filtered by folder prefix.

#### get_context
```json
{
"topic": "machine learning",
"max_chars": 4000
}
```

Aggregates relevant context blocks from multiple documents for a given topic.

#### index_document
```json
{
"path": "/home/user/documents"
}
```

Indexes a local file or directory immediately.

#### index_url
```json
{
"url": "https://example.com/article"
}
```

Fetches and indexes content from a URL.

---

## HTTP REST Daemon

Start the REST API on port `8788`:
```bash
./symseek serve --port 8788
```

### Endpoints
- **GET** `/health`: Check status (`{"status": "ok"}`).
- **GET** `/status`: Returns document counts, chunk counts, and database file size.
- **GET** `/search?q=query&limit=5`: Query the hybrid search engine.
- **POST** `/index` with body `{"path": "/absolute/path"}`: Synchronously crawl and index a folder.

### Endpoint Examples

#### Health Check
```bash
curl http://localhost:8788/health
# Response: {"status": "ok"}
```

#### Get Status
```bash
curl http://localhost:8788/status
# Response: {"documents": 10, "chunks": 50, "db_size": "1.2 MB"}
```

#### Search Documents
```bash
curl "http://localhost:8788/search?q=machine+learning&limit=5"
# Response: Array of search results with file paths and scores
```

#### Index Documents
```bash
curl -X POST http://localhost:8788/index \
-H "Content-Type: application/json" \
-d '{"path": "/home/user/documents"}'
# Response: {"status": "indexed", "path": "/home/user/documents"}
```

---

## License

MIT License. Part of the Symaira tool suite.