https://github.com/pmarreck/docscan
Document indexing and semantic search for .md, .docx, .pdf, .doc — Zig core + C FFI + hybrid vector/lexical search
https://github.com/pmarreck/docscan
Last synced: about 2 months ago
JSON representation
Document indexing and semantic search for .md, .docx, .pdf, .doc — Zig core + C FFI + hybrid vector/lexical search
- Host: GitHub
- URL: https://github.com/pmarreck/docscan
- Owner: pmarreck
- License: mit
- Created: 2026-04-09T18:59:26.000Z (2 months ago)
- Default Branch: yolo
- Last Pushed: 2026-04-09T21:19:52.000Z (2 months ago)
- Last Synced: 2026-04-09T21:30:00.855Z (2 months ago)
- Language: Zig
- Size: 21.7 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# docscan
[](https://github.com/pmarreck/docscan/actions/workflows/ci.yml)
[](https://garnix.io/repo/pmarreck/docscan)
Document indexing and semantic search for `.md`, `.txt`, `.docx`, `.pdf`, `.doc`, `.rtf`, and `.epub` files.
## What it does
docscan indexes a collection of documents, extracts structured text with heading detection, chunks by document structure, embeds via a local model (Ollama or oMLX), and provides hybrid vector + lexical search from the CLI or via MCP for LLM integration.
## Architecture
```
C CLI (I/O, Ollama/oMLX, files) → C FFI → Zig Core (pure computation)
├── Parsers (md, docx, pdf, doc, rtf)
├── Chunker (structure-aware)
├── Storage (SQLite + sqlite-vec + FTS5)
└── Search (hybrid vector + BM25 with RRF)
```
- **Zig core** is pure computation — no I/O, receives byte slices, returns structured results
- **C FFI** is the public API boundary — flat C functions with opaque handles
- **C CLI** handles all I/O: file reading, embedding server calls, progress bars, output formatting
- **MCP server** over stdio for LLM integration (JSON-RPC 2.0)
## Quick start
```bash
# Build
nix build
# or
./build
# Index a directory
docscan index ~/Documents/contracts/
# Search (exact text match, no embedding server needed)
docscan search --exact "indemnification clause"
# Search (hybrid semantic + lexical, needs Ollama or oMLX)
docscan search "liability protection provisions"
# Check index status
docscan status
# Extract text from a document (no database needed)
docscan extract document.pdf
docscan extract --markdown report.docx
docscan extract --json contract.md
docscan extract --from 5 --to 10 book.pdf
cat file.md | docscan extract --format md -
# Normalize text (fix word splits, ligatures, hyphenation)
echo "kn own" | docscan normalize
```
## Embedding backends
docscan supports both Ollama and OpenAI-compatible embedding servers (oMLX, LM Studio, vLLM, etc.).
```bash
# Ollama (default)docscan index ~/docs/ --model nomic-embed-text
# oMLX / OpenAI-compatible
docscan index ~/docs/ \
--embedding-api openai \
--embedding-url http://localhost:8000 \
--embedding-api-key YOUR_KEY \
--model bge-m3-mlx-fp16
```
Settings are saved to `.docscan/config.ini` on first index, so subsequent commands pick them up automatically.
Environment variables: `DOCSCAN_MODEL`, `DOCSCAN_EMBEDDING_API`, `DOCSCAN_EMBEDDING_URL`, `DOCSCAN_EMBEDDING_API_KEY`, `DOCSCAN_DB`
Override precedence: CLI flags > env vars > `.docscan/config.ini` > global `~/.config/docscan/config.ini` > defaults
To see the effective configuration with the source of each value:
```bash
docscan config debug
docscan config debug --json # machine-readable
```
## MCP server
```bash
# Start MCP server for LLM integration
docscan mcp-serve --db path/to/index.db
```
Exposes 7 tools: `docscan_search`, `docscan_status`, `docscan_read_chunk`, `docscan_list_docs`, `docscan_config`, `docscan_index`, `docscan_update`
## Search modes
| Mode | Flag | Description |
|------|------|-------------|
| Hybrid | *(default)* | Vector similarity + BM25 lexical, fused with Reciprocal Rank Fusion |
| Exact | `--exact` | FTS5 full-text search only (no embedding server needed) |
| Similar | `--similar` | Vector similarity only |
## Document format support
| Format | Extraction | Structure detection | Location info |
|--------|-----------|-------------------|---------------|
| `.md` | Full text | ATX headings (`#` through `######`) | Line number |
| `.txt` | Full text | (same as markdown) | Line number |
| `.docx` | Full text + metadata | Word heading styles (Heading1-6, Title) | Page (from `lastRenderedPageBreak`) |
| `.pdf` | Full text via content stream operators | Font-size heuristics + ToUnicode CMap | Page number |
| `.doc` | Full text via OLE2/Piece Table | Heuristic (ALL CAPS, numbered sections) | — |
| `.rtf` | Full text with Unicode/Windows-1252 decoding | Font-size heuristics, `\page` breaks | Page (hard breaks) |
| `.epub` | Full text from XHTML chapters | HTML headings (h1-h6) | — |
## Building from source
Requires [Nix](https://nixos.org/download) with flakes enabled.
```bash
./build # Release build via nix build
./build --debug # Debug build
./build --test # Build + run tests
./test # Run all test suites (unit + CLI + MCP)
./build_all # Cross-compile for 5 targets
```
## Cross-platform targets
- macOS aarch64 (Apple Silicon)
- Linux aarch64
- Linux x86_64
- Windows aarch64
- Windows x86_64
## License
MIT