{"id":50490753,"url":"https://github.com/uhop/vault-storage","last_synced_at":"2026-06-02T02:30:59.589Z","repository":{"id":354531384,"uuid":"1223858724","full_name":"uhop/vault-storage","owner":"uhop","description":"AI-agent-first knowledge base over markdown. SQLite + sqlite-vec for fast lookup, semantic search, and typed-edge traversal; BGE embeddings; REST server (MCP planned).","archived":false,"fork":false,"pushed_at":"2026-05-14T05:03:00.000Z","size":801,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-14T05:35:02.765Z","etag":null,"topics":["ai-agent","bge","claude","embeddings","knowledge-base","markdown","obsidian","rest-api","semantic-search","sqlite","sqlite-vec","typescript","vault","vector-search"],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/uhop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-28T18:13:03.000Z","updated_at":"2026-05-14T05:03:05.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/uhop/vault-storage","commit_stats":null,"previous_names":["uhop/vault-storage"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/uhop/vault-storage","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uhop%2Fvault-storage","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uhop%2Fvault-storage/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uhop%2Fvault-storage/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uhop%2Fvault-storage/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/uhop","download_url":"https://codeload.github.com/uhop/vault-storage/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/uhop%2Fvault-storage/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33803734,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agent","bge","claude","embeddings","knowledge-base","markdown","obsidian","rest-api","semantic-search","sqlite","sqlite-vec","typescript","vault","vector-search"],"created_at":"2026-06-02T02:30:58.773Z","updated_at":"2026-06-02T02:30:59.572Z","avatar_url":"https://github.com/uhop.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# vault-storage\n\nAn AI-agent-first persistent knowledge base. Markdown files are the source of truth; a SQLite + `sqlite-vec` index sits next to them and provides fast lookup, semantic search, and typed-edge traversal for AI agents.\n\n**Status:** v0.x, in active development. Working: importer, embedder, REST server (full path-based vault surface + insight reads + suggestions queue + Obsidian sync), MCP adapter (20 tools / 3 resources), file-watcher with auto-reindex, edge GC, auto-commit, migration tool (Obsidian → vault-storage tree), Docker packaging. Not yet: suggestions review surface, decay/maintenance jobs.\n\n## Architecture\n\n- **Content** lives in a separate **private** git repo: [`vault-data`](https://github.com/uhop/vault-data). Plain markdown with YAML frontmatter, organized into `topics/`, `projects/`, `queries/`, `logs/`, `raw/`. This is the source of truth.\n- **Index** is SQLite + [`sqlite-vec`](https://github.com/asg017/sqlite-vec), accessed via the built-in `node:sqlite`. The DB is fully derivable from the content repo; on total DB loss, a rebuild from `git clone` works.\n- **Server** is Node 25 + TypeScript on `node:http`, with bearer-token auth on every endpoint. Speaks REST today; MCP layer (for Claude Code et al.) is planned.\n- **Embeddings** are `Xenova/bge-small-en-v1.5` (384-dim float32, CLS pooling, paragraph-overlapped chunking, ONNX via `@huggingface/transformers`, runs on local CPU).\n- **Sync between machines** is `git pull` / `git push` against `vault-data`. Per-machine local DB; per-user state stays local; shared content syncs via git.\n\n## Repositories\n\n| Repo                                                                 | Visibility  | Purpose                           |\n| -------------------------------------------------------------------- | ----------- | --------------------------------- |\n| [`vault-storage`](https://github.com/uhop/vault-storage) (this repo) | public      | server code                       |\n| [`vault-data`](https://github.com/uhop/vault-data)                   | **private** | markdown content, source of truth |\n\nThe split keeps this code repo public (so it can be installed and inspected) without exposing personal notes.\n\n## Quick start (Docker)\n\n```bash\ngit clone https://github.com/uhop/vault-storage\ncd vault-storage\ncp .env.example .env\n# Edit .env: set VAULT_API_TOKEN and VAULT_DATA_PATH_HOST.\ndocker compose up -d\ndocker compose logs -f vault-storage   # watch the initial reindex\n```\n\nThat's it. The container watches `VAULT_DATA_PATH_HOST` for markdown changes and keeps the index in sync. By default it listens on `0.0.0.0:8123` so other machines on your network can reach it (bearer-token auth required on every request — generate one with `openssl rand -hex 32`).\n\nTo restrict to local-only or LAN access, set `VAULT_PUBLISH_HOST=127.0.0.1` (or your LAN IP) in `.env`. For TLS over the public internet, put a reverse proxy (Caddy / nginx / Cloudflare Tunnel) in front, or use Tailscale/WireGuard for private remote access.\n\n### Updating\n\n```bash\nbin/update.sh\n```\n\nPulls the latest code, warns about new keys in `.env.example` that you haven't added to `.env`, builds an image tagged with both `:latest` and the short commit SHA, and recreates the container. To roll back, retag a previous SHA as `latest`:\n\n```bash\ndocker tag vault-storage:\u003cprev-sha\u003e vault-storage:latest \u0026\u0026 docker compose up -d\n```\n\nSchema migrations apply automatically on container start.\n\n## Setup (without Docker)\n\nRequires Node ≥ 25.\n\n```bash\ngit clone https://github.com/uhop/vault-storage\ncd vault-storage\nnpm install\n\n# Clone the content repo somewhere; this becomes VAULT_DATA_PATH.\ngit clone git@github.com:uhop/vault-data /path/to/vault-data\n```\n\nEnvironment variables:\n\n| Variable                   | Required | Purpose                                                                      |\n| -------------------------- | -------- | ---------------------------------------------------------------------------- |\n| `VAULT_DATA_PATH`          | yes      | Markdown content tree (the `vault-data` clone). Source of truth.             |\n| `VAULT_API_TOKEN`          | yes      | Bearer token enforced on every server request.                               |\n| `VAULT_DB_PATH`            | no       | SQLite path. Default `${VAULT_DATA_PATH}/.vault-storage/vault.sqlite`.       |\n| `VAULT_HOST`               | no       | Bind address. Default `127.0.0.1` (use `0.0.0.0` for remote access).         |\n| `VAULT_PORT`               | no       | Listen port. Default `8123`.                                                 |\n| `VAULT_INGEST_PATH`        | no       | Default source path for `migrate` / `import` subcommands.                    |\n| `VAULT_EMBEDDER`           | no       | `bge` (default) or `fake` (skip model load — dev/test only).                 |\n| `VAULT_EMBEDDER_RETENTION_MS` | no    | Idle window before the BGE pipeline is disposed and its ~GB ONNX arena returned to the OS. Default `1800000` (30 min); minimum `1000`. Reload on next embed adds ~1-3 s. |\n| `VAULT_EMBEDDER_MAX_BATCH` | no    | Cap on per-ORT-inference batch size. Bounds active-peak RSS by sub-batching large inputs. Default `8` (~200-400 MB peak for BGE-small at S=512); minimum `1`. Trade-off: smaller = lower memory, more inferences per re-embed. |\n| `VAULT_AUTO_REINDEX`       | no       | Run a full reindex on startup. Default `true`.                               |\n| `VAULT_AUTO_WATCH`         | no       | Watch the vault tree and reindex incrementally. Default `true`.              |\n| `VAULT_WATCH_DEBOUNCE_MS`  | no       | Watcher debounce window. Default `1500`.                                     |\n| `VAULT_AUTO_COMMIT`        | no       | Periodic `git add \u0026\u0026 git commit` of the vault tree. Default `true`.          |\n| `VAULT_AUTO_PUSH`          | no       | `git push` after each auto-commit. Default `false` (manual push).            |\n| `VAULT_COMMIT_INTERVAL_MS` | no       | Poll interval for auto-commit. Default `60000`.                              |\n| `VAULT_GIT_AUTHOR_NAME`    | no       | Author name for auto-commits. Default `vault-storage`.                       |\n| `VAULT_GIT_AUTHOR_EMAIL`   | no       | Author email for auto-commits. Default `vault-storage@localhost`.            |\n| `VAULT_EMBED_ANOMALY_LOG`  | no       | JSONL path for transient-NaN embedding events. Default `${VAULT_DATA_PATH}/.vault-storage/embed-nan.jsonl`. Empty disables file logging (stderr-only). |\n\nPut these in `~/.env` (sourced by `.bashrc`) or pass on the command line.\n\n## Usage\n\n### Run the server\n\n```bash\nVAULT_DATA_PATH=/path/to/vault-data \\\nVAULT_API_TOKEN=\u003ctoken\u003e \\\n  npm start\n```\n\nServer listens on `${VAULT_HOST}:${VAULT_PORT}` (default `127.0.0.1:8123`).\n\n### REST endpoints (current surface)\n\nAll endpoints require `Authorization: Bearer \u003ctoken\u003e`.\n\n| Method | Path                       | Purpose                                                            |\n| ------ | -------------------------- | ------------------------------------------------------------------ |\n| GET    | `/system/status`           | Schema version, record / edge / suggestion counts, embedder state (`{model, retained}`), `process.memoryUsage()`. |\n| POST   | `/maintenance/release-embedder` | Force-release the BGE pipeline now (bypasses the retention timer). Returns before/after RSS and freed bytes. No-op when nothing is loaded. |\n| GET    | `/system/lint`             | Integrity checks (bug-finding): embedding hash drift, missing/orphan embeddings, temporal anomalies, dangling tag aliases. Returns `{ok, total_issues, checks}`. ~50ms; safe on session-start flows. |\n| GET    | `/sections`                | List records. Filters: `type`, `status`, `file_path`, `file_prefix`, `priority_min/max`, `updated_since`, `record_ids`. Pagination: `offset`, `limit` (max 100). |\n| GET    | `/sections/{record_id}`    | Read a record by ID. `?exclude=body` for a meta-only fetch.        |\n| GET    | `/sections/{record_id}/meta` | Frontmatter projection only (no body).                           |\n| PUT    | `/sections/{record_id}`    | Replace body (`Content-Type: text/markdown`). Frontmatter-aware: user keys merged; `created`/`updated` accepted but indexer-overridden; DB-only keys (`record_id`, `content_hash`, `last_referenced`, `decay_score`) rejected. |\n| PUT    | `/vault/{path}`            | Two modes: `Content-Type: text/markdown` accepts a `---\\n\u003cFM\u003e\\n---\\n\u003cbody\u003e` blob (server parses YAML); `Content-Type: application/json` accepts `{frontmatter: {...}, body: \"...\"}` and skips YAML parse entirely — recommended for programmatic callers (the JSON path dodges colon-space, leading-special-char, and shadow-keyword authoring traps). Same downstream FM merge / enum validation / auto-managed-key rejection in both modes. |\n\nMore endpoints (search, edges, suggestions) are coming with the MCP layer.\n\n### CLI subcommands\n\n```bash\nnode src/index.ts info                           # DB version + record count\nnode src/index.ts import \u003cvault-path\u003e            # import a directory + embed\nnode src/index.ts migrate \u003csource\u003e \u003ctarget\u003e     # transform Obsidian vault → vault-storage tree\nnode src/index.ts serve                          # start the REST server (= `npm start`)\n```\n\nThe `migrate` subcommand:\n\n- Remaps legacy status (14 values) → 5-value closed enum.\n- Remaps legacy type (`decision` → `design`, `learning` → `research`, etc.).\n- Canonicalizes tags (lowercase, kebab-case, ASCII; conservative singular/plural collapse).\n- Backfills frontmatter for files that lack it.\n- Atomizes oversized files (\u003e 30 KB AND \u003e 5 top-level sections) into per-section pieces.\n- Seeds `tags_taxonomy` + `tag_aliases` from the canonicalized tag corpus.\n\nAfter `migrate`, run `import` against the target tree to build records, edges, and embeddings.\n\n## Agent integration\n\nTwo complementary surfaces for driving the vault from inside a Claude Code\nsession:\n\n- **Claude Code skills** — `skills/vault*` are the slash-command skills\n  (`/vault resume`, `/vault check`, `/vault propose-related`, etc.) that hit\n  the REST API directly through the `bin/vault-curl` wrapper. Backup +\n  install instructions in [`skills/README.md`](skills/README.md).\n- **MCP adapter** — the `mcp/` sub-package exposes the REST surface to Claude\n  Code as ~20 tools and 3 resources with closed-enum input schemas. See\n  `.mcp.json.example` for project-scope activation; `skills/README.md` covers\n  user-scope setup. A standalone, checkout-free installer (release tarball +\n  `curl | sh`) is in progress.\n\nThe two stack: skills can call the MCP tools, or fall back to `vault-curl`\nwhen MCP isn't configured. Both share the same backend.\n\n## Backup\n\nTwo-tier strategy:\n\n**Tier 1 (default on):** every dirty markdown file is auto-committed by\nthe in-server git-sync loop (`VAULT_AUTO_COMMIT=true`, default). Optional\n`VAULT_AUTO_PUSH=true` to also push to the configured remote. The\ncontent tree (markdown + frontmatter) is fully recoverable from any clone.\n\n**Tier 2 (optional):** `vault.sqlite` snapshot for DB-only state — the\nsuggestions queue, embeddings, `last_referenced` timestamps. The server\nexposes a snapshot mechanic; the host wires the offsite shipment.\n\n```bash\n# In-container: produce a gzip-compressed snapshot. Default destination:\n# ${VAULT_DATA_PATH}/.snapshots/vault.sqlite.gz (under the bind-mount).\ncurl -X POST -H \"Authorization: Bearer $VAULT_API_TOKEN\" \\\n  http://localhost:8123/maintenance/snapshot\n```\n\n```bash\n# Host-side cron, daily 03:30: snapshot then ship via whatever upload\n# tool you have on hand. The vault-data tree is bind-mounted on the\n# host, so the snapshot file lands at a path the host can read directly.\n30 3 * * * \\\n  curl -fsS -X POST -H \"Authorization: Bearer $TOKEN\" \\\n       http://localhost:8123/maintenance/snapshot \u0026\u0026 \\\n  aws s3 cp /media/raid/Vault-Data/.snapshots/vault.sqlite.gz \\\n            s3://${YOUR_BUCKET}/vault-storage/vault.sqlite.gz\n```\n\nUse `rclone`, `rsync`, or any encryption-aware wrapper instead of\n`aws s3 cp` as preferred. Encryption keys, credentials, and the upload\ntool all live on the host — none enter the container. Bucket-level\nversioning preserves history at no application cost.\n\n```bash\n# Retention: list and prune. Server provides the mechanic; host orchestrates\n# the policy (age threshold, count cap, etc).\ncurl -fsS -H \"Authorization: Bearer $TOKEN\" \\\n     http://localhost:8123/maintenance/snapshot-list\n# → {snapshots: [{name, bytes, mtime}, …], totalBytes}\n\ncurl -fsS -X DELETE -H \"Authorization: Bearer $TOKEN\" \\\n     \"http://localhost:8123/maintenance/snapshot?name=vault-2025-12.sqlite.gz\"\n# → 204\n```\n\n`GET /maintenance/snapshot-download?name=…` streams a snapshot file for\noffline inspection. Bare filenames only — no path separators, no traversal.\n\n## Multi-writer (git-as-sync)\n\nThe vault-data tree is a normal git repo. Multiple machines can each\nrun their own vault-storage instance against the same shared remote;\nsynchronization is via `git pull` / `git push`, not the application\nlayer. Each machine maintains its own local SQLite (the DB is a\nderived index — reconstructable from files in O(records) embed time).\n\nAfter a `git pull` lands new commits on a non-primary machine, the\nlocal DB lags. Bring it up to date with the incremental reindex:\n\n```bash\ncurl -X POST -H \"Authorization: Bearer $TOKEN\" \\\n     http://localhost:8123/maintenance/incremental-reindex\n```\n\nIt diffs `meta.last_indexed_commit..HEAD`, dispatches per-file:\n- modified / added → re-imports through the normal pipeline (tags,\n  agent block, suggestions, edges)\n- deleted → drops the row\n- renamed → preserves `record_id` by updating the path key\n\nIf the recorded anchor is no longer in HEAD's ancestry (force-push,\nrebase) the call falls back to a full `importVault` and re-pins HEAD.\nForce a full reindex any time with `?full=true`.\n\nMerge conflicts are the user's responsibility — resolve via standard\ngit, then run incremental reindex. The model is \"git is the\nsynchronization layer; the DB is per-machine derivative state.\"\n\n## Tests\n\n```bash\nnpm test         # tape-six suite\nnpm run ts-check # tsc --noEmit\n```\n\nCurrently 425+ asserts across ~200 tests covering importer, classifier, server, migration, and atomization paths.\n\n## Design summary\n\nThe architectural decisions are recorded as numbered constraints C1–C16 in the design vault. The shapes that matter for using the project:\n\n- **Files = source of truth.** DB is a derived index. `cat`, `vim`, `grep`, Obsidian all keep working against the content repo.\n- **Atomization splits big running files into per-section pieces at migration time.** Each piece becomes its own record with inherited frontmatter and a folder-level `_about.md`.\n- **Frontmatter is indexer-managed.** Body is authored; indexer derives `tags`, `type`, `status`, `created`, `updated`. User-authored fields are reconciled, not overwritten.\n- **Agent-driven intelligence.** No LLM calls inside the indexer. Heuristics produce suggestions; agents review them through dedicated commands. Cost is paid by the agent loop, not by background pipelines.\n- **Closed enums.** `status` (5 values), `type` (14 values), edge types (10), suggestion kinds (8). Enforced by SQLite CHECK constraints.\n- **Two-tier backup.** Tier 1 (required): `git push` of `vault-data`. Tier 2 (optional, off by default): `vault.sqlite` snapshot to S3 with object versioning.\n\n## License\n\nBSD-3-Clause. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuhop%2Fvault-storage","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fuhop%2Fvault-storage","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuhop%2Fvault-storage/lists"}