{"id":49195555,"url":"https://github.com/sardor-m/lumen","last_synced_at":"2026-04-23T10:01:42.427Z","repository":{"id":352980449,"uuid":"1200320683","full_name":"Sardor-M/Lumen","owner":"Sardor-M","description":"Local-first knowledge compiler - ingest articles, papers, PDFs, YouTube into a knowledge graph your AI agent can search, compile and grow","archived":false,"fork":false,"pushed_at":"2026-04-22T01:22:50.000Z","size":670,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-22T01:30:29.532Z","etag":null,"topics":["ai-agent-tools","bm25-search","claude-code","codex","knowledge-compiler","pagerank","sqlite-vec","tfidf-vectors"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Sardor-M.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-03T09:21:04.000Z","updated_at":"2026-04-22T01:24:16.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Sardor-M/Lumen","commit_stats":null,"previous_names":["sardor-m/lumen"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Sardor-M/Lumen","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sardor-M%2FLumen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sardor-M%2FLumen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sardor-M%2FLumen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sardor-M%2FLumen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Sardor-M","download_url":"https://codeload.github.com/Sardor-M/Lumen/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sardor-M%2FLumen/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32175040,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-23T02:19:40.750Z","status":"ssl_error","status_checked_at":"2026-04-23T02:17:55.737Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agent-tools","bm25-search","claude-code","codex","knowledge-compiler","pagerank","sqlite-vec","tfidf-vectors"],"created_at":"2026-04-23T10:01:41.578Z","updated_at":"2026-04-23T10:01:42.418Z","avatar_url":"https://github.com/Sardor-M.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lumen\n\n[![npm](https://img.shields.io/npm/v/lumen-kb)](https://www.npmjs.com/package/lumen-kb)\n[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE.md)\n[![GitHub](https://img.shields.io/github/stars/Sardor-M/Lumen?style=social)](https://github.com/Sardor-M/Lumen)\n\nYou read, ship, and build constantly. Articles, papers, transcripts, YouTube talks, PDFs, code you wrote last month, the CSV your colleague sent, the architecture diagram pinned on your desktop. Then you forget most of it.\n\nYour AI assistant has the same problem — worse, actually. It doesn't know anything you've read or written. Every conversation starts from zero. You paste the same context, re-explain the same ideas, re-answer the same questions about your own domain. The model knows the world but doesn't know _your_ world.\n\nLumen fixes that. Drop everything you've read, shipped, or captured into it — articles, papers, YouTube talks, whole code repositories, datasets, screenshots of dashboards, and clippings from your Obsidian vault — and it builds a local knowledge graph that your AI assistant can search before it answers. `lumen install claude` wires it directly into Claude Code with a `CLAUDE.md` brain-first protocol: the assistant checks your brain before the internet, cites your sources, and captures new ideas after every response.\n\nEverything runs on your machine. One SQLite file. No cloud, no server, no syncing. The LLM is only called when you ask it to compile or synthesize — search, indexing, graph traversal, and compression all run locally.\n\n---\n\n## Architecture\n\n```\n    INGEST              CHUNK               STORE              SEARCH\n    ------              -----               -----              ------\n\n  URL      -+         +- Markdown         +- Sources         +- BM25 (FTS5)\n  PDF      -|         |                   |                  |\n  YouTube  -+         +- HTML             +- Chunks          +- TF-IDF\n  arXiv    -|- Extract|                -\u003e |                -\u003e |\n  File/Dir -|         +- Plain text       +- Concepts        +- Vector ANN\n  Code     -|         |                   +- Edges           |\n  Dataset  -|         +- Code + sigs      +- Links           +- Graph walk\n  Image    -|         |                   +- Embeddings             |\n  Obsidian -+         +- Schema tables                               v\n                                                              RRF Fusion\n    COMPILE             ENRICH              GRAPH          (3-signal merge)\n    -------             ------              -----                   |\n                                                                Budget cut\n  LLM extracts       Tier scoring        PageRank                   |\n  concepts +         escalates           Path finding          Ranked chunks\n  compiled truth     stubs -\u003e            Community                  |\n  + timeline         rich pages          detection                  v\n  per source         via LLM             Visualization        LLM synthesis\n  (3 parallel)                                                 (streaming)\n```\n\n---\n\n## What it looks like in practice\n\n```bash\nlumen init\nlumen add https://karpathy.github.io/2021/06/21/blockchain/\nlumen add ./papers/attention-is-all-you-need.pdf\nlumen add https://www.youtube.com/watch?v=kCc8FmEb1nY\nlumen add 1706.03762                          # arXiv ID\nlumen add ./saved-articles/                   # whole folder at once\nlumen add https://github.com/anthropics/claude-code  # whole repo\nlumen add ./benchmarks/results.csv            # dataset — schema + preview indexed\nlumen add ./screenshots/grafana-dashboard.png # image — OCR'd into searchable text\nlumen watch add obsidian ~/ObsidianVault      # auto-pull clippings via frontmatter\nlumen compile -c 3                            # extract concepts + build graph (3 parallel)\n```\n\nNow search it:\n\n```bash\nlumen search \"agent orchestration patterns\" -b 4000\n```\n\n```\n1. [8.2] Building Effective AI Agents \u003e Combining and customizing these patterns\n   ## Combining and customizing these patterns\n   signals: bm25:12% tfidf:146%\n\n2. [7.9] LLM Powered Autonomous Agents \u003e Agent System Overview\n   ## Agent System Overview\n   signals: tfidf:68%\n\n3. [7.8] Building LLM applications for production \u003e Testing an agent\n   #### Testing an agent\n   signals: tfidf:68%\n```\n\nOr ask a question and get a streamed answer:\n\n```bash\nlumen ask \"How do agent swarms compare to RAG for knowledge retrieval?\"\n```\n\nClaude reads the relevant chunks from your corpus and streams the answer token by token — not from training data, from what you've actually read.\n\n---\n\n## Install\n\n```bash\nnpm install -g lumen-kb\n```\n\nOr from source:\n\n```bash\ngit clone https://github.com/Sardor-M/Lumen.git\ncd lumen \u0026\u0026 pnpm install \u0026\u0026 pnpm build\ncd apps/cli \u0026\u0026 npm link\n```\n\nSet your API key once:\n\n```bash\nmkdir -p ~/.lumen\necho 'ANTHROPIC_API_KEY=sk-ant-...' \u003e ~/.lumen/.env\n```\n\nLumen always reads `~/.lumen/.env` for API keys, regardless of which workspace you're using. Supports Anthropic (Claude), OpenRouter (multi-model), and Ollama (local). Default model: `claude-sonnet-4-6`.\n\n---\n\n## Wire it into your AI assistant\n\n### Claude Code\n\n```bash\nlumen install claude\n```\n\nThis generates five files:\n\n- **`CLAUDE.md`** — brain-first protocol (mandatory, loaded every message). Tells Claude: check the knowledge base before answering, cite sources as `[Source: title]`, only use web search after the brain returns nothing.\n- **`.mcp.json`** — MCP server config with `LUMEN_DIR` baked in so the server always connects to the right workspace.\n- **`.claude/skills/lumen/skill.md`** — supplementary skill with tool routing table, capture protocol, and session summary instructions.\n- **`.claude/hooks/lumen-pretool.sh`** — PreToolUse hook. Fires before every `Glob` / `Grep` and reminds Claude that MCP search tools exist.\n- **`.claude/hooks/lumen-signal.sh`** — Stop hook. Fires after every response and nudges Claude to call `capture` if new knowledge appeared.\n\nAfter installing, every conversation draws from and adds to your knowledge base automatically.\n\n### MCP server (Cursor, Codex, any MCP client)\n\n```bash\nlumen --mcp          # stdio server, 19 tools\n```\n\nAdd to your client's MCP config:\n\n```json\n{\n    \"mcpServers\": {\n        \"lumen\": { \"command\": \"lumen\", \"args\": [\"--mcp\"] }\n    }\n}\n```\n\n19 tools: `search`, `query`, `brain_ops`, `add`, `compile`, `capture`, `session_summary`, `status`, `profile`, `god_nodes`, `concept`, `path`, `neighbors`, `pagerank`, `communities`, `community`, `add_link`, `backlinks`, `links`.\n\n### Cursor / Aider / Copilot (no MCP)\n\nAdd this to your instruction file (`.cursorrules`, `CLAUDE.md`, etc.):\n\n```\nBefore answering research questions, run:\n  ! lumen search \"\u003cquestion\u003e\" -b 8000\nUse the returned chunks as primary context.\n```\n\n---\n\n## The agent loop\n\nThis is how it works end to end when an agent is connected:\n\n```\nUser sends a message\n        |\n        v\nCLAUDE.md fires: \"check brain BEFORE answering\"\n        |\n        v\nAgent calls brain_ops(query) via MCP\n        |\n        +-- concept found  --\u003e compiled truth + edges as context\n        +-- path found     --\u003e concept connection chain as context\n        +-- neighborhood   --\u003e related cluster as context\n        +-- search results --\u003e top ranked chunks as context\n        |\n        v\nAgent answers using KB context, cites [Source: title]\n        |\n        v\nStop hook fires: \"call capture if new knowledge appeared\"\n        |\n        v\nAgent calls capture(type, title, content, related_slugs)\n        |\n        v\nConcept upserted + timeline entry + backlinks created\n        |\n        v\nBrain is richer for the next conversation\n```\n\nEvery cycle adds knowledge. The agent enriches concepts after conversations. Next time the same topic comes up, `brain_ops` finds it. The difference compounds daily.\n\n---\n\n## How the brain compounds over time\n\nEvery concept starts at Tier 3 — a stub. As you add more sources that reference it, the tier climbs:\n\n- **Tier 3** — mentioned once. Stub with basic summary.\n- **Tier 2** — mentioned 3+ times across 2+ sources. Enriched with connections and context.\n- **Tier 1** — mentioned 6+ times across 3+ sources. Full compiled truth — the system's current best understanding of this concept, synthesized from everything you've read.\n\nRun `lumen enrich` to process the queue, or `lumen enrich --status` to see where things stand. After `compile`, any concepts that crossed a threshold are automatically queued.\n\nThe `capture` MCP tool writes the other direction — from conversation to graph. When the assistant is discussing something worth remembering, it calls `capture` with the exact phrasing. `session_summary` closes out a session with a digest of what was covered.\n\n---\n\n## How it works\n\n**Ingestion** — no LLM needed. URL scraping via `@extractus/article-extractor`, PDF via `pdf-parse`, YouTube transcripts via the Innertube captions API, arXiv via Atom + PDF. Code repos via shallow `git clone` with `.gitignore`-aware walk and per-language signature extraction. Datasets (CSV, TSV, JSONL, HuggingFace) produce a schema table plus a 20-row preview. Images use optional local Tesseract OCR when the binary is on PATH (`--no-ocr` skips). Obsidian Web Clipper vaults are watched as a connector — YAML frontmatter promotes the original URL so re-clippings dedup. SHA-256 deduplication throughout, so the same quote across five sources costs one row.\n\n**Compilation** — LLM pass. Extracts concepts and relations from stored chunks, writes them as nodes and weighted directed edges with compiled truth + timeline per concept. Delta-aware: `compile` only touches unprocessed sources. `compile --all` reprocesses everything. `compile -c 5` runs 5 sources in parallel. `compile --model claude-haiku-4-5-20251001` uses a faster/cheaper model.\n\n**Search** — local, no LLM. BM25 via SQLite FTS5 (Porter stemmed), TF-IDF via in-memory inverted index (cosine similarity), optional vector ANN via sqlite-vec (OpenAI or Ollama embeddings). Fused with Reciprocal Rank Fusion (`score = Σ weight / (k + rank)`, k=60), ranked by relevance density so small high-value chunks beat verbose low-value ones.\n\n**Synthesis** — LLM pass with prompt caching (`cache_control: ephemeral`, ~60-80% cost reduction on repeated calls within a session). `lumen ask` streams tokens to stdout as they arrive. Non-Anthropic providers fall back gracefully.\n\n---\n\n## CLI reference\n\n| Command                | What it does                                                             | LLM |\n| ---------------------- | ------------------------------------------------------------------------ | --- |\n| `init`                 | Create `~/.lumen` workspace                                              |     |\n| `add \u003cinput\u003e`          | Ingest URL, PDF, YouTube, arXiv, file, folder, code repo, dataset, image |     |\n| `compile`              | Extract concepts + edges from unprocessed sources                        | yes |\n| `enrich`               | Tier-score concepts and LLM-enrich queued ones                           | yes |\n| `embed`                | Generate vector embeddings for chunks                                    | API |\n| `search \u003cquery\u003e`       | Hybrid local search (BM25 + TF-IDF + vector + graph)                     |     |\n| `ask \u003cquestion\u003e`       | Search + streamed LLM-synthesized answer                                 | yes |\n| `graph \u003csubcommand\u003e`   | Overview, pagerank, path, neighbors, report, export                      |     |\n| `profile`              | Corpus summary — sources, density, frequent queries                      |     |\n| `status`               | DB statistics (text or JSON)                                             |     |\n| `memory export/import` | Portable JSONL or SQL backup                                             |     |\n| `serve`                | Start the web UI against your local knowledge base                       |     |\n| `install \u003cplatform\u003e`   | Wire into Claude Code (`claude`) or Codex (`codex`)                      |     |\n| `watch`                | Watch a folder and auto-ingest changes                                   |     |\n| `daemon`               | Install/uninstall as background launchd/systemd service                  |     |\n\nCompile options: `lumen compile -c 5` (5 parallel), `lumen compile --model claude-haiku-4-5-20251001` (faster model).\n\nAdd options: `--type \u003ctype\u003e` force source type; `--as-dataset` treat an ambiguous text file as tabular data; `--no-ocr` skip OCR when ingesting images; `--from \u003cfile\u003e` read inputs line-by-line.\n\nSearch options: `lumen search \"query\" -n 5` (limit results), `lumen search \"query\" -b 4000` (token budget).\n\nGraph subcommands: `lumen graph status`, `lumen graph pagerank`, `lumen graph path \u003ca\u003e \u003cb\u003e`, `lumen graph neighbors \u003cconcept\u003e -d 2`, `lumen graph report`, `lumen graph export -f json`.\n\n---\n\n## Repo layout\n\nMonorepo — Turborepo + pnpm workspaces.\n\n```\nlumen/\n+-- apps/\n|   +-- cli/         -- CLI and MCP server (the engine)\n|   +-- web/         -- Next.js 15 web UI (Better Auth, Zod, shadcn)\n|   +-- extension/   -- Browser extension (placeholder)\n+-- docs/            -- ALGORITHMS.md, architecture, test plans\n+-- test-benchmarks/ -- Side-by-side Mode 1 (bare) vs Mode 2 (agent wired)\n+-- turbo.json\n+-- pnpm-workspace.yaml\n```\n\n---\n\n## Web UI\n\n`lumen serve` starts a Next.js 15 app that reads directly from `~/.lumen/lumen.db`. No separate server, no duplicate query code.\n\nPages: overview (sources / concepts / edges / density / pending), hybrid search with per-signal score breakdown, concept browser, concept detail (neighborhood, edges), sources list, and graph dashboard (god nodes, communities, top concepts).\n\n```bash\nlumen serve                          # dev mode, http://localhost:3000\nlumen serve --port 4000 --mode prod  # after pnpm build in apps/web\n```\n\n---\n\n## Storage\n\nEverything in `~/.lumen/`:\n\n```\n~/.lumen/\n+-- lumen.db          # SQLite WAL -- sources, chunks, FTS5, concepts, edges,\n|                     #   links, embeddings, classifiers, query_log\n+-- config.json       # User config (LLM model, embedding provider, search weights)\n+-- .env              # API keys (always checked, even if LUMEN_DIR is set elsewhere)\n+-- audit.log         # Append-only JSON-lines operation log\n+-- output/           # Generated exports and reports\n```\n\nOne file. Back it up and you have everything.\n\n---\n\n## Algorithms\n\n| Algorithm                 | Use                 | Reference                            |\n| ------------------------- | ------------------- | ------------------------------------ |\n| BM25                      | Full-text ranking   | Robertson \u0026 Zaragoza, 2009           |\n| TF-IDF                    | Vector similarity   | Salton \u0026 Buckley, 1988               |\n| Reciprocal Rank Fusion    | Signal merging      | Cormack, Clarke \u0026 Butt, 2009         |\n| PageRank                  | Concept importance  | Page, Brin, Motwani \u0026 Winograd, 1998 |\n| Content-Addressed Storage | Deduplication       | Quinlan \u0026 Dorward, 2002              |\n| Extractive Summarization  | Compression         | Luhn, 1958                           |\n| Label Propagation         | Community detection | Raghavan, Albert \u0026 Kumara, 2007      |\n\nDetails in [docs/ALGORITHMS.md](./docs/ALGORITHMS.md).\n\n---\n\n## Tech stack\n\n```\nRuntime:     Node.js 22+\nLanguage:    TypeScript 5\nStorage:     better-sqlite3 (WAL, FTS5)\nVectors:     sqlite-vec (ANN search, cosine similarity)\nLLM:         @anthropic-ai/sdk (+ OpenRouter, Ollama)\nEmbeddings:  OpenAI text-embedding-3-small / Ollama nomic-embed-text\nPDF:         pdf-parse\nURL:         @extractus/article-extractor\nYouTube:     Innertube captions API\narXiv:       Atom API + PDF extraction\nCLI:         Commander.js\nMCP:         @modelcontextprotocol/sdk (19 tools)\nWeb:         Next.js 15, Better Auth, Zod, shadcn/ui, Tailwind\nMonorepo:    Turborepo + pnpm workspaces\n```\n\n---\n\n## Privacy\n\nThe only network calls are: (a) fetching the URL or arXiv paper you asked to ingest, (b) model API calls during `compile`, `enrich`, and `ask` using your own API key, and (c) embedding API calls during `embed` if configured. No telemetry. No analytics. Search, graph traversal, compression, chunking, and deduplication all run locally against the SQLite file.\n\n---\n\n## Development\n\n```bash\npnpm install\npnpm dev                          # turbo dev -- all apps in parallel\npnpm --filter lumen-kb dev        # CLI only\npnpm --filter @lumen/web dev      # web only\npnpm build\npnpm lint \u0026\u0026 pnpm format:check    # pre-commit check\npnpm test                         # vitest\n```\n\nTests use a temp directory: `LUMEN_DIR=$(mktemp -d)`.\n\n---\n\n## Contributing\n\nOpen an issue before large changes. High-value areas:\n\n- Additional ingest formats (EPUB, DOCX, Parquet native)\n- Tree-sitter-based code parsing to replace the current regex signatures\n- Claude Vision pass on compile for image captions\n- In-house browser clipper extension (deferred until Obsidian flow is validated)\n- Web dashboard — live graph visualization\n- Mastra and LangChain adapter improvements\n\n---\n\n## Links\n\n- [npm package](https://www.npmjs.com/package/lumen-kb)\n- [GitHub](https://github.com/Sardor-M/Lumen)\n- [Changelog](./CHANGELOG.md)\n- [Contributing](./CONTRIBUTING.md)\n- [Security](./SECURITY.md)\n- [Benchmark plan](./docs/BENCHMARK-PLAN.md)\n\n## License\n\n[MIT](./LICENSE.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsardor-m%2Flumen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsardor-m%2Flumen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsardor-m%2Flumen/lists"}