{"id":49675507,"url":"https://github.com/openclaw/discrawl","last_synced_at":"2026-05-07T02:01:36.257Z","repository":{"id":342844843,"uuid":"1175225759","full_name":"openclaw/discrawl","owner":"openclaw","description":"cli for Discord with sqlite backend","archived":false,"fork":false,"pushed_at":"2026-05-05T09:27:13.000Z","size":832,"stargazers_count":695,"open_issues_count":1,"forks_count":66,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-05-05T11:20:26.120Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://discrawl.sh","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openclaw.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"docs/security.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":["moltbot"]}},"created_at":"2026-03-07T12:13:48.000Z","updated_at":"2026-05-05T09:27:16.000Z","dependencies_parsed_at":"2026-03-10T15:00:51.128Z","dependency_job_id":null,"html_url":"https://github.com/openclaw/discrawl","commit_stats":null,"previous_names":["steipete/discrawl","openclaw/discrawl"],"tags_count":14,"template":false,"template_full_name":null,"purl":"pkg:github/openclaw/discrawl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openclaw%2Fdiscrawl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openclaw%2Fdiscrawl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openclaw%2Fdiscrawl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openclaw%2Fdiscrawl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openclaw","download_url":"https://codeload.github.com/openclaw/discrawl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openclaw%2Fdiscrawl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32719572,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-07T00:29:05.620Z","status":"online","status_checked_at":"2026-05-07T02:00:07.170Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-07T02:00:50.848Z","updated_at":"2026-05-07T02:01:36.250Z","avatar_url":"https://github.com/openclaw.png","language":"Go","funding_links":["https://github.com/sponsors/moltbot"],"categories":["Go"],"sub_categories":[],"readme":"# discrawl 🛰️ — Mirror Discord into SQLite; search server history locally\n\n`discrawl` mirrors Discord guild data into local SQLite so you can search, inspect, and query server history without depending on Discord search. It can also import classifiable Discord Desktop cache messages for local DM recovery/search without using a user token. Teams can publish the guild archive as a private Git snapshot repo, so readers get fresh org memory without Discord bot credentials.\n\nThere are two local archive sources:\n\n- Discord bot API sync for guilds, channels, members, threads, and message history the configured bot can access\n- Discord Desktop cache import for local, classifiable cached messages, including proven local-only DMs under `@me`\n\nDesktop wiretap mode reads local cache artifacts only. It does not extract credentials, use user tokens, call the Discord API as your user, or run a selfbot.\n\nWiretap DMs stay local and are never exported to the Git-backed snapshot mirror.\n\n## What It Does\n\n- discovers every guild the configured bot can access\n- syncs channels, threads, members, and message history into SQLite\n- maintains FTS5 search indexes for fast local text search\n- builds an offline member directory from archived profile payloads\n- extracts small text-like attachments into the local search index\n- records structured user and role mentions for direct querying\n- tails Gateway events for live updates, with periodic repair syncs\n- imports classifiable Discord Desktop cache messages with `wiretap`, including proven DMs under `@me`\n- publishes and imports private Git-backed archive snapshots for org-wide read access\n- browses stored messages and local DMs in a terminal archive UI\n- exposes `metadata --json`, `status --json`, and `doctor --json` for local\n  launchers, automation, and CI\n- supports Git-only read mode with no Discord credentials on reader machines\n- generates backup README activity reports, with optional AI-written field notes\n- exposes read-only SQL for ad hoc analysis\n- keeps schema multi-guild ready while preserving a simple single-guild default UX\n\nSearch defaults to all guilds. `sync` and `tail` default to the configured default guild when one exists, otherwise they fan out to all discovered guilds.\n\n## Requirements\n\n- Go `1.26+`\n- for publishing/syncing guilds: a Discord bot token the bot can use to read the target guilds\n- for DM wiretap import: local Discord Desktop cache files on the same machine\n- for read-only Git-backed access: access to a private snapshot repo, no Discord credentials required\n- bot permissions for the channels you want archived when running `sync` or `tail`\n\n### Discord Bot Setup\n\n`discrawl` needs a real bot token. Not a user token.\n\nMinimum practical setup:\n\n1. Create or reuse a Discord application in the Discord developer portal.\n2. Add a bot user to that application.\n3. Invite the bot to the target guilds.\n4. Enable these intents for the bot:\n   - `Server Members Intent`\n   - `Message Content Intent`\n5. Ensure the bot can at least:\n   - view channels\n   - read message history\n\nWithout those intents/permissions, `sync`, `tail`, member snapshots, or message content archiving will be partial or fail.\n\n### Bot Token Sources\n\nToken resolution:\n\n1. `DISCORD_BOT_TOKEN` or the configured `discord.token_env`\n2. OS keyring item `discrawl` / `discord_bot_token`, or the configured keyring service/account\n\n`discrawl` accepts either raw token text or a value prefixed with `Bot `. It normalizes that automatically.\n\nFastest path:\n\n```bash\nexport DISCORD_BOT_TOKEN=\"your-bot-token\"\ndiscrawl doctor\ndiscrawl init\n```\n\nIf you keep shell secrets in `~/.profile`, add:\n\n```bash\nexport DISCORD_BOT_TOKEN=\"your-bot-token\"\n```\n\nThen reload your shell before running `discrawl`.\n\nIf you prefer the OS keyring, keep the token out of config and store it in the default keyring item:\n\n```bash\n# macOS Keychain\nsecurity add-generic-password -U -s discrawl -a discord_bot_token -w \"$DISCORD_BOT_TOKEN\"\n\n# Linux Secret Service / libsecret\nprintf %s \"$DISCORD_BOT_TOKEN\" | secret-tool store --label=\"discrawl Discord bot token\" service discrawl username discord_bot_token\n\n# Windows Credential Manager\ncmdkey /generic:discrawl:discord_bot_token /user:discord_bot_token /pass:%DISCORD_BOT_TOKEN%\n```\n\nSet `discord.token_source = \"keyring\"` if you want to require keyring lookup instead of env-first fallback.\n\nDefault runtime paths:\n\n- config: `~/.discrawl/config.toml`\n- database: `~/.discrawl/discrawl.db`\n- cache: `~/.discrawl/cache/`\n- logs: `~/.discrawl/logs/`\n\n## Install\n\nHomebrew (recommended):\n\n```bash\nbrew install steipete/tap/discrawl  # auto-taps steipete/tap\ndiscrawl --version\n```\n\nBuild from source:\n\n```bash\ngit clone https://github.com/openclaw/discrawl.git\ncd discrawl\ngo build -o bin/discrawl ./cmd/discrawl\n./bin/discrawl --version\n```\n\nExamples below assume `discrawl` is on `PATH`. If you built from source without installing it, replace `discrawl` with `./bin/discrawl`.\n\n## Quick Start\n\nConfigure a Discord bot token and refresh both bot-visible guild data and local desktop cache data:\n\n```bash\nexport DISCORD_BOT_TOKEN=\"...\"\ndiscrawl init\ndiscrawl doctor\ndiscrawl sync --full\ndiscrawl sync\ndiscrawl search \"panic: nil pointer\"\ndiscrawl tail\n```\n\nUse `discrawl sync --source wiretap` when you only want the local Discord Desktop cache import and do not want bot-token API sync.\n\nGit-only reader setup:\n\n```bash\ndiscrawl subscribe https://github.com/example/discord-archive.git\ndiscrawl search \"launch checklist\"\ndiscrawl messages --channel general --hours 24\n```\n\n`init` discovers accessible guilds and writes `~/.discrawl/config.toml`. If exactly one guild is available, that guild becomes the default automatically.\n`subscribe` writes a token-free config, imports the private Git snapshot, and read commands auto-refresh when the local snapshot is older than `15m`.\n\n`doctor` is the fastest sanity check:\n\n- confirms config can be loaded\n- shows where the token was resolved from\n- verifies bot auth\n- shows how many guilds the bot can access\n- verifies DB + FTS wiring\n\n## Commands\n\n### `tui`\n\nOpens the local terminal archive browser for stored messages.\n\n```bash\ndiscrawl tui\ndiscrawl tui --guild 123456789012345678 --channel general\ndiscrawl tui --dm\ndiscrawl --json tui --limit 50\n```\n\nThe terminal browser uses the shared crawlkit explorer. The left pane groups\nchannels, people, or threads; the middle pane lists messages; the right pane\nshows the selected message, surrounding conversation, and thread detail. Mouse\nselection, right-click actions, sortable headers, and the local/remote footer\nfollow the same interaction model as `gitcrawl tui`.\n\n### `init`\n\nCreates the local config and discovers accessible guilds.\n\n```bash\ndiscrawl init\ndiscrawl init --guild 123456789012345678\ndiscrawl init --db ~/data/discrawl.db\n```\n\n### `sync`\n\nRefreshes SQLite from one or both archive sources.\n\nBy default, `sync` runs both live/local sources and does not import the Git snapshot first:\n\n- Discord bot-token sync for bot-visible guild data\n- local Discord Desktop cache import for classifiable cached messages and proven DMs\n\nUse `discrawl update` when you want to pull/import the shared Git snapshot. If you intentionally want a sync run to import the snapshot before live deltas, pass `--update=auto` to import only when stale or `--update=force` to pull/import before syncing. `--no-update` is accepted as an explicit no-op alias for the default.\n\nRun one explicit `--full` pass when you want a complete historical guild archive. Use plain `sync` afterward for frequent latest-message and desktop-cache refreshes.\n\n```bash\ndiscrawl sync\ndiscrawl sync --update=auto\ndiscrawl sync --update=force\ndiscrawl sync --no-update\ndiscrawl sync --full\ndiscrawl sync --full --all\ndiscrawl sync --guild 123456789012345678\ndiscrawl sync --guilds 123,456 --concurrency 8\ndiscrawl sync --source both      # default: bot API + desktop cache\ndiscrawl sync --source discord   # bot API only; aliases: key, bot, api\ndiscrawl sync --source wiretap   # desktop cache only; aliases: desktop, cache\ndiscrawl sync --guild 123456789012345678 --all-channels\ndiscrawl sync --channels 111,222 --since 2026-03-01T00:00:00Z\n```\n\nSync sources:\n\n| Source | Reads from | Stores |\n| --- | --- | --- |\n| `both` | Discord bot API and local Discord Desktop cache | bot-visible guild data plus classifiable cached desktop messages |\n| `discord` / `key` | Discord bot API | guilds, channels, threads, members, and messages the bot can access |\n| `wiretap` | local Discord Desktop cache files | classifiable cached messages; proven DMs are stored under `@me` |\n\nSync modes control the Discord bot API side of a run. When `wiretap` is selected, the desktop cache import runs once alongside the chosen bot sync mode.\n\nBot sync modes:\n\n| Command | Use when | Behavior |\n| --- | --- | --- |\n| `discrawl sync` | routine refresh | skips member refreshes, checks live top-level channels plus active threads, and only fetches new messages for channels with a stored latest cursor |\n| `discrawl sync --update=auto` | hybrid Git/live refresh | imports a stale Git snapshot first, then runs the routine live refresh |\n| `discrawl sync --all-channels` | repair pass | broad incremental sweep across every stored channel/thread, including archived threads |\n| `discrawl sync --full` | historical backfill | crawls older history until channels are complete; can take a long time on large servers |\n\n`sync` already uses parallel channel workers for bot API message crawling.\n`--concurrency` overrides the default, and the default is auto-sized from `GOMAXPROCS` with a floor of `8` and a cap of `32`.\n`--all` ignores `default_guild_id` and fans out across every discovered guild the bot can access.\n`--skip-members` refreshes guild/channel/message data without crawling the full member list, which is useful for frequent Git snapshot publishers that only need latest messages.\n`--latest-only` is still accepted for explicit latest-only runs; it is now the default for untargeted `sync`. Use `--all-channels` to opt out of the fast default without doing a full historical crawl.\nWhen `--channels` includes a forum channel id, `discrawl` expands that forum's threads and syncs their messages as part of the targeted run.\n`--since` limits initial history/bootstrap and full-history backfill to messages at or after the given RFC3339 timestamp. It does not mark older history as complete, so a later `sync --full` without `--since` can continue the backfill.\nLong runs now emit periodic progress logs to stderr so large backfills and Git snapshot imports do not look hung.\nIf in-flight channels stop completing for a while, `discrawl` now emits `message sync waiting` heartbeat logs with the oldest active channel, per-channel page activity, and skip/defer counters, and every run ends with a `message sync finished` summary.\nEach channel crawl also has a bounded runtime budget, so a pathological channel is deferred and retried on the next sync instead of pinning a worker forever.\nFull sync member refresh is best-effort and currently gives up after five minutes without a caller-supplied deadline, so message sync completion is not held hostage by a slow guild member crawl.\nWhen the archive is already complete, `sync --full` now reuses the stored backlog markers and limits steady-state refresh to live top-level channels plus active threads instead of revisiting every stored archived thread.\nIf a guild already has a local member snapshot, routine syncs reuse it and skip another full member crawl until that snapshot ages out.\n\n### `tail`\n\nRuns the live Gateway tail and periodic repair loop.\n\n```bash\ndiscrawl tail\ndiscrawl tail --guild 123456789012345678\ndiscrawl tail --repair-every 30m\n```\n\n### `wiretap`\n\nImports classifiable Discord Desktop message payloads into the same local SQLite archive.\n\nThis is the path for searchable DMs because bot tokens cannot read personal direct messages.\n\n`wiretap` is also available through `discrawl sync --source wiretap` and is included in the default `discrawl sync --source both` path.\n\n```bash\ndiscrawl wiretap\ndiscrawl wiretap --path \"$HOME/Library/Application Support/discord\"\ndiscrawl wiretap --dry-run\ndiscrawl wiretap --full-cache\ndiscrawl wiretap --watch-every 2m\n```\n\nNotes:\n\n- stores classifiable cache messages in the same `guilds`, `channels`, and `messages` tables used by bot sync\n- stores proven DMs under the synthetic guild id `@me`\n- keeps `@me` rows local-only: `publish`, Git snapshot import/export, and optional embedding snapshot export exclude DM guilds, channels, messages, events, attachments, mentions, wiretap sync state, and vectors for DM messages\n- preserves existing local `@me` guilds, channels, messages, and attachments when importing a Git snapshot, so a shared guild mirror refresh does not wipe local wiretap DM search\n- drops message payloads whose channel cannot be classified from cached channel metadata or Discord route URLs; dropped rows are counted as `skipped_messages`\n- imports what Discord Desktop has cached locally, not complete live DM history\n- scans local `.ldb`, `.log`, `.json`, and `.txt` artifacts for Discord message JSON, plus route-bearing Chromium HTTP cache entries by default\n- use `--full-cache` or `desktop.full_cache = true` for exhaustive Chromium cache import when you want slower historical guild-cache archaeology\n- does not extract, store, or print Discord auth tokens\n- `--max-file-bytes` skips unusually large files; default is 64 MiB\n\n### `search`\n\nSearches archived messages. FTS is the default mode and works without embeddings.\n\n```bash\ndiscrawl search \"panic: nil pointer\"\ndiscrawl search --mode fts \"panic: nil pointer\"\ndiscrawl search --mode semantic \"missing launch checklist\"\ndiscrawl search --mode hybrid \"database timeout\"\ndiscrawl search --guild 123456789012345678 \"payment failed\"\ndiscrawl search --dm \"launch checklist\"\ndiscrawl search --channel billing --author steipete --limit 50 \"invoice\"\ndiscrawl search --include-empty \"GitHub\"\ndiscrawl --json search \"websocket closed\"\n```\n\nBy default, `search` skips rows with no searchable content. Attachment text, attachment filenames, embeds, and replies still count as content. Use `--include-empty` to opt back in.\n\nModes:\n\n- `fts` searches the local FTS index and returns the newest matching messages first.\n- `semantic` embeds the query, searches locally stored message vectors, and returns a clear error if embeddings are disabled or no compatible vectors exist.\n- `hybrid` runs FTS and semantic search, deduplicates by message id, and falls back to FTS when semantic search is unavailable.\n\nFTS uses SQLite FTS5 with the default `unicode61` tokenizer. User query terms are parameterized and quoted before `MATCH`, so tokens like `AND`, `OR`, `NOT`, `NEAR`, and `*` are searched as input terms instead of FTS operators. Punctuation still follows FTS5 tokenization rules.\n\nSemantic and hybrid search require `[search.embeddings]` plus local `message_embeddings` rows for the configured provider, model, and input version. Run `discrawl sync --with-embeddings` to enqueue changed messages, then `discrawl embed` to generate vectors. The input version is currently `message_normalized_v1`, so vectors are tied to normalized message text rather than raw Discord payloads.\n\n### `messages`\n\nLists exact message slices by channel, author, and time range.\n\n```bash\ndiscrawl messages --channel maintainers --days 7 --all\ndiscrawl messages --channel maintainers --hours 6 --all\ndiscrawl messages --channel \"#maintainers\" --since 2026-03-01T00:00:00Z\ndiscrawl messages --channel 1456744319972282449 --author steipete --limit 50\ndiscrawl messages --channel maintainers --last 100 --sync\ndiscrawl messages --dm --channel Molty --last 20\ndiscrawl messages --channel maintainers --days 7 --all --include-empty\ndiscrawl --json messages --channel maintainers --days 3\n```\n\nNotes:\n\n- `--channel` accepts a channel id, exact name, `#name`, or partial name match\n- `--hours` is shorthand for \"since now minus N hours\"\n- `--days` is shorthand for \"since now minus N days\"\n- `--last` returns the newest `N` matching messages, then prints them oldest-to-newest\n- `--all` removes the safety limit; default is `200`\n- `--sync` runs a blocking pre-query sync for the matching channel or guild scope before reading the local DB\n- rows with no displayable/searchable content are skipped by default; `--include-empty` opts back in\n- at least one filter is required\n- `--dm` is shorthand for `--guild @me`, so DM searches and message slices do not need raw SQL\n\n### `dms`\n\nLists local wiretap DM conversations or reads one DM thread.\n\n```bash\ndiscrawl dms\ndiscrawl dms --with Molty --last 20\ndiscrawl dms --with 1456464433768300635 --all\ndiscrawl dms --search \"launch checklist\"\ndiscrawl dms --with Molty --search \"invoice\"\n```\n\n`discrawl dms` shows one row per local DM channel with message count, author count, and first/last cached message times. Passing `--with` switches to message output for that DM conversation unless `--list` is also set. `--search` searches only local DM messages. This is a convenience layer over the local-only synthetic guild id `@me`; it skips Git snapshot auto-update because DMs are never imported from the shared mirror, and it still only sees Discord Desktop cache data imported by `wiretap`.\n\n### `mentions`\n\nLists structured user and role mentions.\n\n```bash\ndiscrawl mentions --channel maintainers --days 7\ndiscrawl mentions --target steipete --type user --limit 50\ndiscrawl mentions --target 1456406468898197625\ndiscrawl --json mentions --type role --days 1\n```\n\nNotes:\n\n- `--target` accepts an id, exact name, or partial name match\n- `--type` can be `user` or `role`\n- same guild/time filters as `messages`\n\n### `sql`\n\nRuns read-only SQL against the local database.\n\n```bash\ndiscrawl sql 'select count(*) as messages from messages'\necho 'select guild_id, count(*) from messages group by guild_id' | discrawl sql -\n```\n\n### `members`\n\n```bash\ndiscrawl members list\ndiscrawl members show 123456789012345678\ndiscrawl members show --messages 10 steipete\ndiscrawl members search \"peter\"\ndiscrawl members search \"github\"\ndiscrawl members search \"https://github.com/steipete\"\n```\n\nNotes:\n\n- `search` matches names plus any offline profile fields present in the archived member payload\n- `show` accepts a user id or query; if it resolves to one member, it also shows recent messages\n- extracted profile fields may include `bio`, `pronouns`, `location`, `website`, `x`, `github`, and discovered URLs\n- if the bot cannot see a field from Discord, `discrawl` cannot invent it; this is strictly archive-based offline data\n\nTypical workflow:\n\n```bash\ndiscrawl sync --full\ndiscrawl members search \"design engineer\"\ndiscrawl members search \"github\"\ndiscrawl members show --messages 25 steipete\ndiscrawl messages --author steipete --days 30 --all\n```\n\nTypical `members show` output:\n\n```text\nguild=1456350064065904867\nuser=37658261826043904\nusername=steipete\ndisplay=Peter Steinberger\njoined=2026-03-08T16:03:14Z\nbot=false\nx=steipete\ngithub=steipete\nwebsite=https://steipete.me\nbio=Builds native apps and tooling.\nurls=https://steipete.me, https://github.com/steipete\nmessage_count=1284\nfirst_message=2026-02-01T09:00:00Z\nlast_message=2026-03-08T15:59:58Z\n```\n\nSearchable member data comes from:\n\n- Discord member/user payload fields archived into `members.raw_json`\n- explicit profile fields when Discord exposes them\n- URLs and social handles inferred from archived profile text\n- current member snapshot data such as names, nick, roles, and join time\n\n### `channels`\n\n```bash\ndiscrawl channels list\ndiscrawl channels show 123456789012345678\n```\n\n### `status`\n\nShows local archive status.\n\n```bash\ndiscrawl status\n```\n\n### Git-backed sharing\n\n`discrawl` can publish the SQLite archive as sharded, compressed NDJSON snapshots in a private Git repo, then auto-import that repo before local read commands.\n\nPublisher:\n\n```bash\ndiscrawl publish --remote https://github.com/example/discord-archive.git --push\ndiscrawl publish --readme path/to/discord-backup/README.md --push\n```\n\nSubscriber:\n\n```bash\ndiscrawl subscribe https://github.com/example/discord-archive.git\ndiscrawl search \"launch checklist\"\ndiscrawl messages --channel general --hours 24\n```\n\n`subscribe` is the Git-only setup path. It writes a config with `discord.token_source = \"none\"`, imports the snapshot, and does not require a Discord bot token. `sync` and `tail` remain disabled in this mode because they need live Discord access.\n\nConfigure freshness:\n\n```bash\ndiscrawl subscribe --stale-after 15m https://github.com/example/discord-archive.git\ndiscrawl subscribe --no-auto-update https://github.com/example/discord-archive.git\n```\n\nOnce `share.remote` is configured, read commands auto-fetch and import when the local share import is older than `share.stale_after` (default `15m`). `discrawl update` forces the same pull/import step manually. `discrawl sync` does not auto-import the share unless `--update=auto` or `--update=force` is provided, so routine live refreshes stay fast.\n\nHybrid mode is supported too: keep normal Discord credentials configured and set `share.remote`. `discrawl sync --update=auto` and `discrawl messages --sync` import the Git snapshot first, then use live Discord for latest-message deltas. Use `sync --all-channels` or `sync --full` when you intentionally want a broader live repair/backfill pass.\n\nGit snapshots publish non-DM archive tables by default. Embedding queue state stays local to each machine, and Git-only readers can use FTS immediately without an embedding provider.\n\nGenerated vectors can be backed up explicitly:\n\n```bash\ndiscrawl publish --with-embeddings --push\ndiscrawl subscribe --with-embeddings https://github.com/example/discord-archive.git\ndiscrawl update --with-embeddings\n```\n\n`--with-embeddings` exports stored `message_embeddings` rows for the configured `[search.embeddings]` provider/model plus the current input version. The snapshot stores those vectors under `embeddings/\u003cprovider\u003e/\u003cmodel\u003e/\u003cinput_version\u003e/...` and records that identity in `manifest.json`. Only vectors for non-DM messages are exported. Import only restores matching embedding manifests, so an Ollama/nomic subscriber does not accidentally import OpenAI/text-embedding vectors into semantic search. `embedding_jobs` is never exported; subscribers that want fresh local vectors can run `discrawl embed --rebuild` to create their own queue and vectors. Publishing without `--with-embeddings` omits embedding manifests instead of carrying forward an older bundle.\n\nThe Docker smoke test installs `discrawl` in a clean Go container, subscribes to a Git snapshot repo, then checks `search`, `messages`, `sql`, and `report`:\n\n```bash\nDISCRAWL_DOCKER_TEST=1 go test ./internal/cli -run TestDockerGitSourceSmoke -count=1\n```\n\n### `report`\n\nGenerates the Markdown activity block used by the shared backup repo README.\n\n```bash\ndiscrawl report\ndiscrawl report --readme path/to/discord-backup/README.md\n```\n\nEvery scheduled snapshot publish updates deterministic README stats: latest update time, latest archived message, archive totals, and day/week/month activity.\n\nThe backup workflows restore and save `.discrawl-ci/discrawl.db` with `actions/cache`. On a warm runner cache, scheduled publishers skip the pre-sync snapshot import and go straight to the live latest-message delta before publishing. Cache misses still import the latest published snapshot first so `--latest-only` has channel cursors to resume from.\n\n### `digest`\n\nSummarizes per-channel activity for a lookback window.\n\n```bash\ndiscrawl digest\ndiscrawl digest --since 30d\ndiscrawl digest --guild 123456789012345678\ndiscrawl digest --channel general\ndiscrawl --json digest --since 7d --top-n 5\n```\n\nNotes:\n\n- `--since` accepts Go durations (`72h`, `30m`) and `Nd` shorthand (`7d`, `30d`)\n- `--guild` scopes to one guild; when omitted, `default_guild_id` is used if configured\n- `--channel` accepts a channel id or exact channel name\n- `--top-n` controls how many top posters and mention targets are shown per channel\n\n### `analytics`\n\nGroups activity-style queries under one namespace.\n\n```bash\ndiscrawl analytics\ndiscrawl analytics quiet --since 30d\ndiscrawl analytics quiet --guild 123456789012345678\ndiscrawl analytics trends --weeks 8\ndiscrawl analytics trends --weeks 12 --channel general\ndiscrawl --json analytics quiet --since 60d\ndiscrawl --json analytics trends --weeks 4\n```\n\nNotes:\n\n- `analytics quiet` shows top-level text/announcement channels with no messages in the lookback window, including never-active channels\n- `analytics quiet --guild` scopes the report to one guild; when omitted, `default_guild_id` is used if configured\n- `analytics trends` shows Monday-start UTC weekly message counts per message-capable channel\n- `analytics trends --channel` accepts a channel id or exact channel name\n\n### `doctor`\n\nChecks config, auth, DB, and FTS wiring.\n\n```bash\ndiscrawl doctor\n```\n\n## Configuration\n\n`init` writes a complete config, so most users should not hand-edit anything initially.\n\nTypical config shape:\n\n```toml\nversion = 1\ndefault_guild_id = \"\"\nguild_ids = []\ndb_path = \"~/.discrawl/discrawl.db\"\ncache_dir = \"~/.discrawl/cache\"\nlog_dir = \"~/.discrawl/logs\"\n\n[discord]\ntoken_source = \"env\" # use \"none\" for Git-only read access\ntoken_env = \"DISCORD_BOT_TOKEN\"\ntoken_keyring_service = \"discrawl\"\ntoken_keyring_account = \"discord_bot_token\"\n\n[sync]\nsource = \"both\" # use \"discord\" for bot-only sync or \"wiretap\" for desktop-cache-only import\nconcurrency = 16\nrepair_every = \"6h\"\nfull_history = true\nattachment_text = true\n\n[desktop]\npath = \"~/.config/discord\" # macOS default: \"~/Library/Application Support/discord\"\nmax_file_bytes = 67108864\nfull_cache = false\n\n[search]\ndefault_mode = \"fts\"\n\n[search.embeddings]\nenabled = false\nprovider = \"openai\"\nmodel = \"text-embedding-3-small\"\napi_key_env = \"OPENAI_API_KEY\"\nbatch_size = 64\n\n[share]\nremote = \"\"\nrepo_path = \"~/.discrawl/share\"\nbranch = \"main\"\nauto_update = true\nstale_after = \"15m\"\n```\n\nThe value above is an example. `init` writes an auto-sized default based on the host: `min(32, max(8, GOMAXPROCS*2))`.\n\nConfig override rules:\n\n- `--config` beats everything\n- `DISCRAWL_CONFIG` overrides the default config path\n- `discord.token_source = \"none\"` disables live Discord access for Git-only readers\n- `discord.token_source = \"keyring\"` skips env lookup and reads only the configured OS keyring item\n- `DISCRAWL_NO_AUTO_UPDATE=1` disables Git snapshot auto-update for read commands in one process, useful for report jobs that already imported a fresh backup.\n\n## Embeddings\n\nEmbeddings are optional. FTS is the default search path and the primary verification target.\n\nIf enabled, embeddings are intended to enrich recall in background batches, not block the hot sync path.\n\n```bash\nexport OPENAI_API_KEY=\"...\"\ndiscrawl init --with-embeddings\ndiscrawl sync --with-embeddings\ndiscrawl embed --limit 1000\ndiscrawl search --mode semantic \"launch checklist\"\ndiscrawl search --mode hybrid \"launch checklist\"\n```\n\nEmbedding creation has two phases:\n\n1. `sync --with-embeddings` queues changed messages by writing `embedding_jobs` rows. New messages, changed normalized text, and messages that do not already have a job are queued. This phase does not call the embedding provider.\n2. `discrawl embed` drains pending jobs in bounded batches, calls the configured provider, and writes vectors to `message_embeddings` with provider, model, input version, dimensions, and binary vector data.\n\nDuring drain, `discrawl` claims jobs with a short lock so overlapping runs do not process the same batch. Rate limits requeue the batch and stop that drain run cleanly. Provider or validation failures retry up to three attempts before the job is marked failed. Messages with no normalized text are marked done and any stale vector for that message is removed.\n\nThe provider/model/input-version identity is stored on each job and vector. If you change provider or model, pending jobs are retargeted to the new identity and prior attempts are reset. Existing vectors for another identity remain in SQLite, but semantic search only reads vectors compatible with the current config.\n\nUse `--rebuild` when changing provider, model, or input settings and you want to regenerate vectors for the existing archive:\n\n```bash\ndiscrawl embed --rebuild --limit 1000\n```\n\nLocal providers can keep message and query embedding on the same machine:\n\n```toml\n[search.embeddings]\nenabled = true\nprovider = \"ollama\"\nmodel = \"nomic-embed-text\"\n```\n\nWith remote providers, message text is sent during `discrawl embed`, and search query text is sent when using `--mode semantic` or `--mode hybrid`. Stored message text is not sent during local vector scoring.\n\n## Data Stored Locally\n\n- guild metadata\n- channels and threads in one table\n- current member snapshot\n- canonical message rows\n- append-only message event records\n- FTS index rows\n- optional local embedding queue metadata and vectors\n\nMessages imported from Discord Desktop use the same message, attachment, mention, and FTS paths as bot-synced messages.\n\nProven DMs use `@me` as their guild id. Unclassifiable desktop-cache payloads are skipped instead of being stored as unknown synthetic data.\n\nSQLite schema migrations are versioned with `PRAGMA user_version`. Startup now fails fast when a local DB schema is newer than the supported binary.\n\nAttachment binaries are not stored in SQLite.\n\nSet `sync.attachment_text = false` if you want to keep attachment metadata and filenames but disable attachment body fetches for text indexing.\n\n## Security\n\n- do not commit bot tokens or API keys\n- default config lives in your home directory, not inside the repo\n- prefer env vars or the OS keyring for bot tokens\n- CI runs secret scanning with `gitleaks`\n- `doctor` reports token source, not token contents\n\n## Development\n\nLocal gate:\n\n```bash\ngo run github.com/golangci/golangci-lint/v2/cmd/golangci-lint@v2.11.1 run\ngo test ./... -coverprofile=/tmp/discrawl.cover\ngo tool cover -func=/tmp/discrawl.cover | tail -n 1\ngo build ./cmd/discrawl\ngo run ./cmd/discrawl help | grep tui\n```\n\nTarget coverage is `\u003e= 85%`.\n\nCI runs:\n\n- `golangci-lint`\n- `go test` with coverage threshold enforcement\n- `go build ./cmd/discrawl`\n- `gitleaks` against git history and the working tree\n\n## Notes\n\n- the schema is multi-guild ready even when the common UX stays single-guild simple\n- threads are stored as channels because that matches the Discord model\n- archived threads are part of the sync surface\n- live sync is resumable; large guilds still take time because Discord rate limits history backfill\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenclaw%2Fdiscrawl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenclaw%2Fdiscrawl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenclaw%2Fdiscrawl/lists"}