https://github.com/yurvon-screamo/smos
Semantic Memory OS — universal OpenAI-compatible memory proxy for AI agents. Native Rust, ort+ONNX NLI, SurrealDB embedded.
https://github.com/yurvon-screamo/smos
agent-memory ai-agents ai-memory ai-proxy claude-code coding-agent cursor llm-memory local-first long-term-memory mem0-alternative memory openai-api openai-compatible persistent-context proxy rust self-hosted semantic-memory surrealdb
Last synced: about 18 hours ago
JSON representation
Semantic Memory OS — universal OpenAI-compatible memory proxy for AI agents. Native Rust, ort+ONNX NLI, SurrealDB embedded.
- Host: GitHub
- URL: https://github.com/yurvon-screamo/smos
- Owner: yurvon-screamo
- License: mit
- Created: 2026-06-21T22:40:07.000Z (6 days ago)
- Default Branch: main
- Last Pushed: 2026-06-25T17:45:15.000Z (2 days ago)
- Last Synced: 2026-06-25T19:32:32.494Z (2 days ago)
- Topics: agent-memory, ai-agents, ai-memory, ai-proxy, claude-code, coding-agent, cursor, llm-memory, local-first, long-term-memory, mem0-alternative, memory, openai-api, openai-compatible, persistent-context, proxy, rust, self-hosted, semantic-memory, surrealdb
- Language: Rust
- Homepage: https://semantic-memory-os.vercel.app
- Size: 1.02 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
README
# SMOS — Semantic Memory Operating System
**An OpenAI-compatible memory proxy that gives any AI coding agent persistent long-term memory — without code changes, without an MCP server, without a framework.**
[](LICENSE)
[](https://www.rust-lang.org)
[](https://crates.io/crates/smos)
[](https://www.npmjs.com/package/@yurvon_screamo/smos)
[](https://github.com/yurvon-screamo/smos/actions/workflows/release.yml)
## Quick start
```bash
npm install -g @yurvon_screamo/smos # or: cargo binstall smos
smos init # one-time: downloads ~4 GB of local models
smos serve # starts on http://localhost:8888
```
Point Cursor (or Claude Code, opencode, Cline, Aider, Continue.dev) at
`http://localhost:8888/v1` and use `bob` as the model name. That assistant
now remembers across sessions.
**One prerequisite:** [`llama-server`](https://github.com/ggerganov/llama.cpp)
on your `PATH`. SMOS uses it to run three tiny models locally — extraction,
embeddings, reranking. The largest is 4B parameters. These run on a laptop
CPU with integrated graphics — no GPU, no API keys, no cloud bills, no data
leaving your machine. Prefer cloud providers instead? SMOS supports that
too — see [Configure](#configure).
---
Open a new chat in Cursor and your assistant starts from scratch. Switch to
Claude Code or opencode and you re-explain why the cache TTL is 10 seconds,
not 60 — your architecture, your conventions, every decision you already
made. The model is stateless. The tool is replaceable. The memory should
not be.
SMOS fixes this. It is a transparent proxy that sits between your AI client
and the upstream LLM. Every response is mined for facts automatically — the
agent does nothing, the agent forgets nothing. Point any OpenAI-compatible
client at SMOS and your assistant remembers across sessions, across tools,
across model swaps. Works with local llama.cpp, OpenAI, OpenRouter, vLLM —
any OpenAI-compatible upstream. Run fully local for privacy, or point it
at your existing cloud provider.
---
## How it works
```
Client ──▶ SMOS ──▶ upstream LLM (GPT-4o, Claude, local, …)
│
├── 1. ENRICH inject relevant facts into the request
├── 2. FORWARD stream response back at full LLM speed
├── 3. EXTRACT mine the response for facts (after delivery)
└── 4. FINALIZE DeBERTa NLI resolves merges and conflicts
(after delivery)
```
Steps 3 and 4 run **off the request path** — the client receives the
response as soon as the upstream LLM finishes. Extraction and consolidation
never add latency. If any step fails, the system degrades gracefully: the
request forwards unenriched, facts stay pending for the next cycle, HTTP
keeps serving.
For the full pipeline, memory lifecycle, and NLI internals, see
[`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md).
---
## Why SMOS
- **Memory is part of the API, not a tool.** Every response is mined for
facts automatically. The agent cannot forget to save, because the agent
is not involved in saving. Extraction runs off the request path — zero
added latency.
- **No external database.** Embedded SurrealDB (RocksDB + HNSW vector
index). No Postgres, no Neo4j, no Qdrant, no Docker. One binary, one
directory.
- **Contradictions are detected, not overwritten.** A DeBERTa-v3 NLI model
evaluates each merge candidate. Both sides of a contradiction are
preserved and surfaced to the LLM — not silently overwritten. The
theoretical basis: ["The Price of Meaning"](https://arxiv.org/abs/2603.27116)
(2026) proves vector-only retrieval degrades through semantic
interference; external verification is necessary.
- **Multi-persona isolation.** Bob for Rust, Alice for ML, Charlie for
DevOps — each a separate memory namespace. One SMOS instance, N isolated
assistants.
- **Runs on any laptop.** Three tiny local models (4 GB total) handle
extraction, embeddings, and reranking on CPU. Tested on a laptop with
integrated Intel graphics — no GPU, no API keys, no cloud bills. Your
conversations never leave your machine.
---
## Persons: name your assistant
Every AI client sends a `model` field in the request. SMOS uses that field
as a **person name** — and each person is a memory namespace, a routing
target, and an optional persona.
When Cursor sends `{"model": "bob", ...}`, SMOS:
1. Uses `"bob"` as the memory isolation key
2. Rewrites `model` to the upstream model declared for Bob
3. Routes the request to Bob's provider
4. Injects Bob's persona as a system message
5. Enriches the request with facts from Bob's memory namespace
Create **Alice** for ML engineering and **Charlie** for DevOps — each with
its own memory, provider, and persona. Alice never mixes your Rust types
with your Python pipelines. Swap GPT-4o for a local model and Bob stays
Bob — identity lives at the OS layer, not in a chat log you rebuild by
hand.
See [Configure → Agents (persons)](#agents-persons) for the TOML.
---
## What you need
- **~5 GB disk** for local models (one-time download: 4 GB GGUF + 643 MB
DeBERTa NLI).
- **`llama-server`** on your `PATH` — get it from
[llama.cpp releases](https://github.com/ggerganov/llama.cpp/releases)
or build from source. Runs on any modern laptop; GPU is optional, not
required.
- **Any OpenAI-compatible AI client** — Cursor, Claude Code, opencode,
Cline, Continue.dev, Aider, Windsurf, or plain `curl`.
No Postgres. No Neo4j. No Docker. No cloud account. No API key (unless you
choose to use a cloud provider as your upstream).
---
## Install
### Prebuilt binary (recommended)
```bash
cargo binstall smos
```
The same binary runs on CPU and (when one is detected) on the host GPU.
SMOS probes the hardware at startup, downloads the matching ONNX Runtime
shared library into `~/.smos/models/ort/` on first use (~5–300 MB
depending on the device), and falls back to CPU if no GPU is available.
No feature flags, no per-vendor rebuild.
Detected devices:
- **Windows** — CUDA (NVIDIA only), DirectML (Intel Arc, AMD, NVIDIA via
DirectX 12), CPU fallback.
- **Linux** — CUDA (NVIDIA), CPU fallback.
- **macOS** — Metal / CoreML on Apple Silicon, CPU fallback.
Override the probe by setting `[nli_backend].device = "cpu" | "directml"
| "cuda" | "metal"` in `~/.smos/config.toml`.
### From source
```bash
cargo install smos
```
### npm
```bash
npm install -g @yurvon_screamo/smos
```
Verify:
```bash
smos --version
```
---
## Setup
### Step 1 — Get llama-server
SMOS uses [llama.cpp](https://github.com/ggerganov/llama.cpp) to run three
tiny models locally — a 4B extraction LLM, an embedding model, and a
reranker. These are small enough to run on a laptop CPU with integrated
graphics. No GPU required.
**Quickest path:**
- Download a prebuilt binary from
[llama.cpp releases](https://github.com/ggerganov/llama.cpp/releases)
(look for `llama-server` in the assets for your platform).
- Or build from source:
`git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build --config Release`
- Ensure `llama-server` is on your `PATH` (`llama-server --help` should
work from any directory).
The GGUF model weights for the three roles are downloaded automatically in
the next step — you do not need to fetch them by hand.
> **Prefer cloud?** Skip llama-server entirely. Set `[llama_cpp].auto_launch
> = false` in `~/.smos/config.toml` and point `[llm_extraction]`,
> `[embedding]`, and `[reranker]` at any OpenAI-compatible cloud provider.
> See [Configure](#configure).
### Step 2 — Initialize
```bash
smos init
```
This single command:
- Creates `~/.smos/` with a default `config.toml`, working directories
(`db/`, `models/`, `persons/`, `logs/`, `reports/`), and a stub persona at
`persons/bob.md`.
- Checks for `llama-server` on `PATH`.
- Downloads the GGUF models (~4 GB total) into `~/.smos/models/`:
- `nemotron-3-nano-4b.gguf` — extraction + chat LLM.
- `jina-embeddings-v5.gguf` — embedding model.
- `qwen3-reranker.gguf` — cross-encoder reranker.
- Probes `/health` on the three configured ports (28081 embedding, 28082
extraction, 28181 reranker).
- Initializes the database (SurrealDB migrations).
- Reports what is ready and what still needs attention.
Already-downloaded models are skipped, so re-running `smos init` only retries
the failed ones. Fix any `✗` items shown, then run `smos init` again to verify.
For a deeper audit (NLI cache, stats, a Markdown report), run `smos doctor`.
### Step 3 — Start
```bash
smos serve
```
With `auto_launch = true` (the default), SMOS spawns the three `llama-server`
processes itself on first start — an already-running server on the same port is
reused. The first start also downloads the DeBERTa NLI model (~643 MB) into
`~/.smos/models/`; subsequent starts are instant.
Verify it works:
```bash
curl http://localhost:8888/health
# → {"status":"ok","version":"0.1.7"}
```
### Step 4 — Install as a service (optional)
```bash
smos service install # auto-starts at boot
smos service start # start now
smos service status # current state
smos service stop # stop
smos service uninstall # remove
```
Registered as systemd (Linux), launchd (macOS), or a Windows Service.
---
## Configure
All configuration lives in `~/.smos/config.toml`. `smos init` creates it with
safe defaults; edit the file by hand from there. Any section omitted falls back
to the built-in default.
### Inspect current configuration
```bash
smos config show # full resolved config as TOML (defaults merged in)
smos config providers # list providers: name → URL
smos config persons # list agents: name → provider / model
```
These commands are read-only. To change configuration, edit the TOML.
### Providers
A **provider** is one upstream OpenAI-compatible endpoint (`llama-server`,
OpenRouter, OpenAI, vLLM…). One entry per upstream; there is no round-robin or
failover — routing is per-agent.
```toml
[[providers]]
name = "llama-local"
url = "http://localhost:28082/v1/chat/completions"
api_key_env = "" # env var name; empty = no auth header
# Cloud example — uncomment and set OPENROUTER_API_KEY in the environment
# [[providers]]
# name = "openrouter"
# url = "https://openrouter.ai/api/v1/chat/completions"
# api_key_env = "OPENROUTER_API_KEY"
```
### Agents (persons)
A **person** bundles a memory namespace, a routing target, and an optional
persona. When a client sends `{"model": "bob", ...}`, SMOS uses `"bob"` as the
memory isolation key, rewrites `model` to the upstream model, and routes to the
declared provider.
```toml
[persons.bob]
provider = "llama-local" # must match a [[providers]].name
model = "nemotron-3-nano-4b" # upstream model id
persona = "~/.smos/persons/bob.md" # optional; ~ expands to user home
# [persons.alice]
# provider = "openrouter"
# model = "z-ai/glm-5.2"
# persona = "~/.smos/persons/alice.md"
```
A model name that is not a configured person returns HTTP 400 — every request
must name a real `[persons.*]` entry.
### Persona files
`~/.smos/persons/bob.md` is plain markdown, injected once per conversation as a
`system` message:
```markdown
You are Bob, a Rust systems programming assistant.
Focus on memory safety and performance.
Be concise. Prefer code over long explanations.
Reply in English.
```
### Git memory sync (optional)
Dual-write every extracted fact to a local git repo as markdown files — backup,
versioning, and re-hydration onto another machine. Empty `repo_url` disables
sync.
```toml
[git]
repo_url = "git@github.com:user/smos-memory.git"
branch = "main"
auto_push = true
local_path = "~/.smos/git/memory"
disable_gpg_sign = true
```
On a second machine, re-hydrate the facts with `smos import-git `. Provider
API keys are read from the env var named in `api_key_env`, so secrets never land
in TOML.
### Advanced: llama.cpp auto-launch
By default, `smos serve` spawns the three `llama-server` processes itself and
reuses any server already bound to the configured port. Override the binary,
ports, model paths, or extra CLI args here; flip `auto_launch = false` if you
launch `llama-server` yourself or use a remote / cloud provider.
```toml
[llama_cpp]
binary = "llama-server"
auto_launch = true
# Unload models from VRAM after this many seconds idle (5 min default).
# Set to 0 to disable. Only appended when llama-server supports the flag.
idle_timeout_seconds = 300
[llama_cpp.embedding]
model_path = "~/.smos/models/jina-embeddings-v5.gguf"
port = 28081
extra_args = ["--ctx-size", "2048", "--embeddings"]
[llama_cpp.reranker]
model_path = "~/.smos/models/qwen3-reranker.gguf"
port = 28181
extra_args = ["--ctx-size", "8192"]
[llama_cpp.extraction]
model_path = "~/.smos/models/nemotron-3-nano-4b.gguf"
port = 28082
extra_args = ["--ctx-size", "4096"]
```
### Full configuration reference
See [`smos.toml`](smos.toml) for the canonical, fully-commented example.
| Section | Purpose |
|---|---|
| `[[providers]]` | OpenAI-compatible chat-completion endpoints. One per upstream. |
| `[persons.]` | Person = memory key + provider + upstream model + optional persona. |
| `[git]` | Git-backed memory sync (`repo_url`, `branch`, `auto_push`). |
| `[llama_cpp]` | Auto-launch config for `llama-server` processes (ports, model paths). |
| `[llm_extraction]` | Fact-extraction LLM (model, temperature, seed, timeout). |
| `[embedding]` | Vector embedding model (model, dimensions, timeout). |
| `[reranker]` | Cross-encoder reranker URL (`/v1/rerank`). |
| `[retrieval]` | top-K initial/final, `min_topic_chars`, `min_confidence`. |
| `[merge]` | Cosine threshold for merge candidate selection. |
| `[confidence]` | Base + multi-source/no-contradiction bonuses, accept/pending cut. |
| `[nli]` | Verdict thresholds (contradiction/entailment). |
| `[nli_backend]` | Native ONNX model id + cache directory + device selection. |
| `[extraction]` | Semantic dedup cosine threshold. |
| `[heat]` | Decay rate, min threshold (boosts recently-active facts). |
| `[session]` | Timeout, pending overflow, watcher scan interval. |
| `[audit]` | Optional dreaming agent (schedule, model, mutation caps). |
| `[surreal]` | Embedded RocksDB path + namespace/database. |
| `[server]` | Bind host/port, shutdown grace, log format. |
---
## Connect your AI client
Any client that speaks the OpenAI Chat Completions API works — Cursor,
Claude Code, opencode, Cline, Continue.dev, Aider, Windsurf, and anything
else that lets you set a custom base URL. Point it at SMOS and use the
**person name** as the model.
### opencode
```bash
export OPENAI_BASE_URL=http://localhost:8888/v1
export OPENAI_API_KEY=smos
opencode --model bob
```
### Cursor
Settings → Models → OpenAI API Base URL: `http://localhost:8888/v1`
Model name: `bob`
### curl
```bash
curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"bob","messages":[{"role":"user","content":"hello"}]}'
```
For other OpenAI-compatible clients, the pattern is the same: set the
base URL to `http://localhost:8888/v1`, set any API key (SMOS does not
validate it by default), and use the person name as the model.
---
## Commands
| Command | Description |
|---|---|
| `smos init` | One-command setup: bootstrap `~/.smos`, download GGUF models, probe `llama-server`, run DB migrations. Idempotent. |
| `smos serve` | Start the HTTP proxy (auto-launches `llama-server` processes). |
| `smos doctor` | Validate environment + show SurrealDB stats. |
| `smos doctor --stats` | Quick memory stats (no model round-trips). |
| `smos doctor --report ` | Generate a Markdown health report. |
| `smos doctor --skip-llama` | Skip the `llama-server` + reranker probes. |
| `smos config show` | Print the full resolved configuration as TOML. |
| `smos config providers` | List configured providers (name → URL). |
| `smos config persons` | List configured agents (name → provider / model). |
| `smos import --from-file ` | Import an opencode transcript JSON into memory. |
| `smos import --list` | List discoverable opencode sessions. |
| `smos import-dir ` | Bulk import from a directory (`*.md`, `*.txt`, `*.json`, …). |
| `smos import-git ` | Re-hydrate facts from a git-synced memory repo. |
| `smos import raw ""` | Extract facts from arbitrary free-form text. |
| `smos import raw --stdin` | Same, reading the text body from stdin. |
| `smos finalize ` | Manually trigger memory consolidation for one session. |
| `smos audit` | Run the dreaming agent once (memory cleanup / merges / pruning). |
| `smos service install` | Install SMOS as a system service (auto-starts at boot). |
Global flag: `--config ` to point at a non-default config file.
---
## Known limitations
Honest scope, not marketing hedging:
- **643 MB DeBERTa-v3 ONNX download on first start.** Subsequent starts
are instant. The model is cached under `~/.smos/models/`.
- **`llama-server` on `PATH` for local inference.** SMOS auto-launches
the three `llama-server` processes (extraction, embedding, reranker)
when `auto_launch = true`. The models are tiny (4 GB total) and run on
CPU. To use cloud providers instead, set `auto_launch = false` and
point the extraction / embedding / reranker URLs at your provider.
- **Extraction model is English-optimized.** Nemotron-3-Nano-4B is
multilingual, but accuracy is highest on English. The DeBERTa NLI model
is English-only.
- **Single-process SurrealDB lock.** One SMOS instance per database path.
No built-in horizontal scaling. Multi-machine sync via the git backend.
- **Not benchmarked on LOCOMO.** The NLI contradiction detection is the
architectural choice, not a benchmark number.
---
## Inspiration
SMOS builds on academic research in AI agent memory:
- **[MemoryOS: Memory OS of AI Agent](https://arxiv.org/abs/2506.06326)**
(Kang et al., 2025, EMNLP 2025 Oral) — hierarchical memory management
for AI agents. SMOS adopts a similar lifecycle
(`pending → accepted → conflict-flagged`) driven by natural-language
inference rather than hand-tuned heuristics.
- **[The Price of Meaning: Why Every Semantic Memory System Forgets](https://arxiv.org/abs/2603.27116)**
(Ray Barman et al., 2026) — interference is fundamental in semantic
memory: every store that decides what to keep also decides what to
lose, and pure vector retrieval is mathematically proven to degrade.
SMOS sidesteps this by preserving both sides of a contradiction and
flagging them, instead of picking a winner — and by layering DeBERTa
NLI on top of cosine retrieval as the external verification the paper
calls necessary.
---
## License
MIT — see [`LICENSE`](LICENSE).