{"id":50711276,"url":"https://github.com/chelslava/md-semantic-search","last_synced_at":"2026-06-09T15:30:54.348Z","repository":{"id":362643257,"uuid":"1260050630","full_name":"chelslava/md-semantic-search","owner":"chelslava","description":"Local, private semantic (vector) search over any folder of Markdown files. Cross-lingual, offline, no API keys, no vector DB — just transformers.js + one JSON index.","archived":false,"fork":false,"pushed_at":"2026-06-05T06:58:25.000Z","size":32,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-05T08:15:24.808Z","etag":null,"topics":["bge","cli","e5","embeddings","markdown","offline","rag","semantic-search","transformers-js","vector-search"],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chelslava.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-05T05:36:40.000Z","updated_at":"2026-06-05T06:58:28.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/chelslava/md-semantic-search","commit_stats":null,"previous_names":["chelslava/md-semantic-search"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/chelslava/md-semantic-search","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chelslava%2Fmd-semantic-search","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chelslava%2Fmd-semantic-search/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chelslava%2Fmd-semantic-search/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chelslava%2Fmd-semantic-search/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chelslava","download_url":"https://codeload.github.com/chelslava/md-semantic-search/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chelslava%2Fmd-semantic-search/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34114426,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bge","cli","e5","embeddings","markdown","offline","rag","semantic-search","transformers-js","vector-search"],"created_at":"2026-06-09T15:30:53.579Z","updated_at":"2026-06-09T15:30:54.339Z","avatar_url":"https://github.com/chelslava.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# md-semantic-search\n\n[![npm version](https://img.shields.io/npm/v/md-semantic-search.svg)](https://www.npmjs.com/package/md-semantic-search)\n[![npm downloads](https://img.shields.io/npm/dm/md-semantic-search.svg)](https://www.npmjs.com/package/md-semantic-search)\n[![publish](https://github.com/chelslava/md-semantic-search/actions/workflows/publish.yml/badge.svg)](https://github.com/chelslava/md-semantic-search/actions/workflows/publish.yml)\n[![node](https://img.shields.io/node/v/md-semantic-search.svg)](https://nodejs.org)\n[![license](https://img.shields.io/npm/l/md-semantic-search.svg)](./LICENSE)\n\n**Local, private semantic (vector) search over any folder of Markdown files.**\n\nFind passages by *meaning*, not just keywords — and across languages (ask in\none language, match documents written in another). Runs fully on your machine\nvia [transformers.js](https://github.com/xenova/transformers.js): no API keys,\nno cloud calls, no vector database. Your notes never leave the disk.\n\n```bash\nnpx md-semantic-search index  --db ./docs\nnpx md-semantic-search search --db ./docs \"how do I rotate the API token\"\n```\n\n---\n\n## Why\n\nKeyword search misses paraphrases. A query like *\"как починить зависший ввод на\nwindows\"* will never match a page titled *\"win32 stdin re-wrap closes buffer\"* —\nzero shared words, different language. Semantic search embeds both into the same\nvector space and matches on meaning. This tool was extracted from a real wiki\nwhere exactly that gap kept biting; see **[RESEARCH.md](./RESEARCH.md)** for the\nmeasurements that shaped its defaults.\n\n## Features\n\n- 🔌 **Any folder, anywhere.** Point `--db` at any directory of `.md`/`.markdown`\n  files. It does **not** have to live inside this project. Recursive by default.\n- 🌍 **Cross-lingual.** Multilingual embeddings (default `multilingual-e5-base`).\n- 🧠 **Hybrid ranking.** Reciprocal Rank Fusion of vector similarity (meaning)\n  and lexical overlap (exact names like `win32`, `TextIOWrapper`).\n- ⚡ **Incremental.** Per-file md5 — re-indexing only re-embeds changed files.\n- 🔒 **Private \u0026 offline.** Model downloads once, then no network. Nothing is\n  uploaded anywhere.\n- 📦 **Zero infra.** One JSON index, brute-force cosine in memory. No Pinecone,\n  no Qdrant, no pgvector. Scales fine to thousands of chunks.\n\n## Requirements\n\n- Node.js ≥ 18\n- ~280 MB disk for the default model (downloaded once into a cache dir)\n\n## Install\n\nRun on demand with `npx` (no install):\n\n```bash\nnpx md-semantic-search --help\n```\n\nOr install globally for the short `mdss` alias:\n\n```bash\nnpm install -g md-semantic-search\nmdss --help\n```\n\nOr from source:\n\n```bash\ngit clone https://github.com/chelslava/md-semantic-search\ncd md-semantic-search\nnpm install\nnode bin/cli.mjs --help\n```\n\n## Usage\n\n### 1. Build the index\n\n```bash\nmdss index --db /path/to/your/markdown\n```\n\nFirst run downloads the model (~280 MB). The index is written to `\u003cdb\u003e/.mdss/`\nby default (override with `--index-dir`). Re-run after editing your notes — it's\nincremental, so only changed files are re-embedded.\n\n### 2. Search\n\n```bash\nmdss search --db /path/to/your/markdown \"your question in plain language\"\n```\n\nExample output:\n\n```\nTop 3 for: \"how do I add a new translation language\"\n\n1. [cos 0.833] i18n Application Analysis\n   i18n-analysis.md › Language status\n   | Language | File | Status | ... English | en/shared.json | complete | ...\n\n2. ...\n```\n\n### Options\n\n| Flag | Meaning |\n|------|---------|\n| `--db \u003cdir\u003e` | Folder of `.md` files (or set `MDSS_DB`). Can be anywhere on disk. |\n| `--index-dir \u003cdir\u003e` | Where to store the index (default: `\u003cdb\u003e/.mdss`). |\n| `--cache-dir \u003cdir\u003e` | Model cache dir (default: package `.cache`, or `MDSS_CACHE_DIR`). |\n| `--model \u003cname\\|id\u003e` | Embedding model (default `e5-base`). See `mdss models`. |\n| `--ignore \u003cglob\u003e` | Skip files/paths; repeatable. e.g. `--ignore \"log.md\" --ignore \"**/archive/**\"`. |\n| `--k \u003cn\u003e` | Number of results (default 6). |\n| `--json` | Machine-readable output. |\n| `--semantic` | Pure vector ranking, skip lexical fusion. |\n\n### The base can live outside the project\n\nThe index does not need write access to your notes if they're read-only — just\npoint the index somewhere writable:\n\n```bash\nmdss index  --db /mnt/shared/team-wiki --index-dir ~/.cache/team-wiki-index\nmdss search --db /mnt/shared/team-wiki --index-dir ~/.cache/team-wiki-index \"incident runbook for db failover\"\n```\n\nOr drive everything from environment variables:\n\n```bash\nexport MDSS_DB=/mnt/shared/team-wiki\nexport MDSS_INDEX_DIR=~/.cache/team-wiki-index\nmdss index\nmdss search \"incident runbook for db failover\"\n```\n\n## Models\n\n```bash\nmdss models\n```\n\n| Alias | Model | Dim | Notes |\n|-------|-------|-----|-------|\n| `e5-small` | `Xenova/multilingual-e5-small` | 384 | Fastest (~120 MB). **Weak cross-lingual** — see RESEARCH. |\n| `e5-base` ⭐ | `Xenova/multilingual-e5-base` | 768 | Default. Best balance. |\n| `e5-large` | `Xenova/multilingual-e5-large` | 1024 | ~2.2 GB, higher quality. |\n| `bge-m3` | `Xenova/bge-m3` | 1024 | ~2.3 GB. Best cross-lingual separation in tests. |\n\nSwitching models invalidates the stored vectors automatically — the next\n`index` run does a full rebuild. You can also pass any raw `Xenova/*` id.\n\n## How it works\n\n1. **Walk** `--db` recursively for `.md`/`.markdown` (dotfiles \u0026 `--ignore`\n   globs skipped).\n2. **Chunk** each file by Markdown headings; oversized sections split on blank\n   lines (~1400 chars/chunk).\n3. **Embed** each chunk (`passage:` prefix for E5) → store `{file, heading,\n   text, vec}` in `vectors.json`, plus per-file md5 in `.hashes.json`.\n4. **Search**: embed the query (`query:` prefix), score every chunk by cosine,\n   score by lexical term-overlap, then **fuse with RRF**. Return top-k chunks.\n\nNo external services, no database — the whole index is one JSON file and search\nis an in-memory dot-product sweep.\n\n## License\n\nMIT © chelslava\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchelslava%2Fmd-semantic-search","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchelslava%2Fmd-semantic-search","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchelslava%2Fmd-semantic-search/lists"}