{"id":49360242,"url":"https://github.com/pmarreck/docscan","last_synced_at":"2026-04-27T16:01:23.259Z","repository":{"id":350316650,"uuid":"1206282253","full_name":"pmarreck/docscan","owner":"pmarreck","description":"Document indexing and semantic search for .md, .docx, .pdf, .doc — Zig core + C FFI + hybrid vector/lexical search","archived":false,"fork":false,"pushed_at":"2026-04-09T21:19:52.000Z","size":22805,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"yolo","last_synced_at":"2026-04-09T21:30:00.855Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Zig","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pmarreck.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-09T18:59:26.000Z","updated_at":"2026-04-09T21:20:00.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/pmarreck/docscan","commit_stats":null,"previous_names":["pmarreck/docscan"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/pmarreck/docscan","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fdocscan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fdocscan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fdocscan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fdocscan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pmarreck","download_url":"https://codeload.github.com/pmarreck/docscan/tar.gz/refs/heads/yolo","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fdocscan/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32343571,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T23:26:28.701Z","status":"online","status_checked_at":"2026-04-27T02:00:06.769Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-27T16:01:19.124Z","updated_at":"2026-04-27T16:01:23.253Z","avatar_url":"https://github.com/pmarreck.png","language":"Zig","funding_links":[],"categories":[],"sub_categories":[],"readme":"# docscan\n\n[![CI](https://github.com/pmarreck/docscan/actions/workflows/ci.yml/badge.svg?branch=yolo)](https://github.com/pmarreck/docscan/actions/workflows/ci.yml)\n[![Garnix](https://img.shields.io/endpoint.svg?url=https%3A%2F%2Fgarnix.io%2Fapi%2Fbadges%2Fpmarreck%2Fdocscan%3Fbranch%3Dyolo)](https://garnix.io/repo/pmarreck/docscan)\n\nDocument indexing and semantic search for `.md`, `.txt`, `.docx`, `.pdf`, `.doc`, `.rtf`, and `.epub` files.\n## What it does\n\ndocscan indexes a collection of documents, extracts structured text with heading detection, chunks by document structure, embeds via a local model (Ollama or oMLX), and provides hybrid vector + lexical search from the CLI or via MCP for LLM integration.\n\n## Architecture\n\n```\nC CLI (I/O, Ollama/oMLX, files) → C FFI → Zig Core (pure computation)\n                                             ├── Parsers (md, docx, pdf, doc, rtf)\n                                             ├── Chunker (structure-aware)\n                                             ├── Storage (SQLite + sqlite-vec + FTS5)\n                                             └── Search (hybrid vector + BM25 with RRF)\n```\n\n- **Zig core** is pure computation — no I/O, receives byte slices, returns structured results\n- **C FFI** is the public API boundary — flat C functions with opaque handles\n- **C CLI** handles all I/O: file reading, embedding server calls, progress bars, output formatting\n- **MCP server** over stdio for LLM integration (JSON-RPC 2.0)\n\n## Quick start\n\n```bash\n# Build\nnix build\n# or\n./build\n\n# Index a directory\ndocscan index ~/Documents/contracts/\n\n# Search (exact text match, no embedding server needed)\ndocscan search --exact \"indemnification clause\"\n\n# Search (hybrid semantic + lexical, needs Ollama or oMLX)\ndocscan search \"liability protection provisions\"\n\n# Check index status\ndocscan status\n\n# Extract text from a document (no database needed)\ndocscan extract document.pdf\ndocscan extract --markdown report.docx\ndocscan extract --json contract.md\ndocscan extract --from 5 --to 10 book.pdf\ncat file.md | docscan extract --format md -\n\n# Normalize text (fix word splits, ligatures, hyphenation)\necho \"kn own\" | docscan normalize\n```\n\n## Embedding backends\n\ndocscan supports both Ollama and OpenAI-compatible embedding servers (oMLX, LM Studio, vLLM, etc.).\n\n```bash\n# Ollama (default)docscan index ~/docs/ --model nomic-embed-text\n\n# oMLX / OpenAI-compatible\ndocscan index ~/docs/ \\\n  --embedding-api openai \\\n  --embedding-url http://localhost:8000 \\\n  --embedding-api-key YOUR_KEY \\\n  --model bge-m3-mlx-fp16\n```\n\nSettings are saved to `.docscan/config.ini` on first index, so subsequent commands pick them up automatically.\n\nEnvironment variables: `DOCSCAN_MODEL`, `DOCSCAN_EMBEDDING_API`, `DOCSCAN_EMBEDDING_URL`, `DOCSCAN_EMBEDDING_API_KEY`, `DOCSCAN_DB`\n\nOverride precedence: CLI flags \u003e env vars \u003e `.docscan/config.ini` \u003e global `~/.config/docscan/config.ini` \u003e defaults\n\nTo see the effective configuration with the source of each value:\n\n```bash\ndocscan config debug\ndocscan config debug --json  # machine-readable\n```\n\n## MCP server\n\n```bash\n# Start MCP server for LLM integration\ndocscan mcp-serve --db path/to/index.db\n```\n\nExposes 7 tools: `docscan_search`, `docscan_status`, `docscan_read_chunk`, `docscan_list_docs`, `docscan_config`, `docscan_index`, `docscan_update`\n\n## Search modes\n\n| Mode | Flag | Description |\n|------|------|-------------|\n| Hybrid | *(default)* | Vector similarity + BM25 lexical, fused with Reciprocal Rank Fusion |\n| Exact | `--exact` | FTS5 full-text search only (no embedding server needed) |\n| Similar | `--similar` | Vector similarity only |\n\n## Document format support\n\n| Format | Extraction | Structure detection | Location info |\n|--------|-----------|-------------------|---------------|\n| `.md` | Full text | ATX headings (`#` through `######`) | Line number |\n| `.txt` | Full text | (same as markdown) | Line number |\n| `.docx` | Full text + metadata | Word heading styles (Heading1-6, Title) | Page (from `lastRenderedPageBreak`) |\n| `.pdf` | Full text via content stream operators | Font-size heuristics + ToUnicode CMap | Page number |\n| `.doc` | Full text via OLE2/Piece Table | Heuristic (ALL CAPS, numbered sections) | — |\n| `.rtf` | Full text with Unicode/Windows-1252 decoding | Font-size heuristics, `\\page` breaks | Page (hard breaks) |\n| `.epub` | Full text from XHTML chapters | HTML headings (h1-h6) | — |\n## Building from source\n\nRequires [Nix](https://nixos.org/download) with flakes enabled.\n\n```bash\n./build              # Release build via nix build\n./build --debug      # Debug build\n./build --test       # Build + run tests\n./test               # Run all test suites (unit + CLI + MCP)\n./build_all          # Cross-compile for 5 targets\n```\n\n## Cross-platform targets\n\n- macOS aarch64 (Apple Silicon)\n- Linux aarch64\n- Linux x86_64\n- Windows aarch64\n- Windows x86_64\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpmarreck%2Fdocscan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpmarreck%2Fdocscan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpmarreck%2Fdocscan/lists"}