{"id":50755269,"url":"https://github.com/escoffier-labs/sourceharvest","last_synced_at":"2026-06-11T04:02:45.480Z","repository":{"id":362724873,"uuid":"1258656450","full_name":"escoffier-labs/sourceharvest","owner":"escoffier-labs","description":"SourceHarvest exports local source-system records to portable miseledger.adapter.v1 JSONL for archive, search, and evidence workflows.","archived":false,"fork":false,"pushed_at":"2026-06-06T03:16:04.000Z","size":32,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-06-06T04:10:06.422Z","etag":null,"topics":["ai-agents","brigade","cli","evidence","exporter","go","jsonl","local-first","miseledger","sourceharvest"],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/escoffier-labs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-06-03T19:41:44.000Z","updated_at":"2026-06-06T03:16:33.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/escoffier-labs/sourceharvest","commit_stats":null,"previous_names":["solomonneas/sourceharvest","escoffier-labs/sourceharvest"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/escoffier-labs/sourceharvest","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/escoffier-labs%2Fsourceharvest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/escoffier-labs%2Fsourceharvest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/escoffier-labs%2Fsourceharvest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/escoffier-labs%2Fsourceharvest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/escoffier-labs","download_url":"https://codeload.github.com/escoffier-labs/sourceharvest/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/escoffier-labs%2Fsourceharvest/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34181555,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-11T02:00:06.485Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","brigade","cli","evidence","exporter","go","jsonl","local-first","miseledger","sourceharvest"],"created_at":"2026-06-11T04:02:44.720Z","updated_at":"2026-06-11T04:02:45.475Z","avatar_url":"https://github.com/escoffier-labs.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SourceHarvest\n\n\u003cp\u003e\n  \u003cimg src=\"https://img.shields.io/github/actions/workflow/status/escoffier-labs/sourceharvest/ci.yml?branch=master\u0026style=for-the-badge\u0026label=ci\" alt=\"CI status\"\u003e\n  \u003cimg src=\"https://img.shields.io/github/v/release/escoffier-labs/sourceharvest?style=for-the-badge\u0026label=release\" alt=\"Latest release\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/go-1.22%2B-00ADD8?style=for-the-badge\u0026logo=go\u0026logoColor=white\" alt=\"Go 1.22+\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/license-MIT-green?style=for-the-badge\" alt=\"MIT license\"\u003e\n\u003c/p\u003e\n\nSourceHarvest exports source-system records to `miseledger.adapter.v1` JSONL.\n\nIt is the sibling tool to StationTrail:\n\n- [StationTrail](https://github.com/escoffier-labs/stationtrail) handles local agent-session harnesses such as Codex, Claude, OpenClaw, OpenCode, and Hermes.\n- SourceHarvest handles non-harness source systems such as crawler exports, notes, chat exports, issue exports, and future domain-specific harvesters.\n- [MiseLedger](https://github.com/escoffier-labs/miseledger) stores, dedupes, indexes, searches, relates, and emits evidence bundles.\n\nSourceHarvest is not an archive.\n\n## How It Works\n\n```mermaid\nflowchart TB\n    SOURCEHARVEST[\"\u003cb\u003eSourceHarvest CLI\u003c/b\u003e\u003cbr/\u003e\u003ci\u003elocal adapter layer\u003c/i\u003e\"]\n    ADAPTER[\"\u003cb\u003emiseledger.adapter.v1 JSONL\u003c/b\u003e\u003cbr/\u003eone normalized object per line\"]\n\n    subgraph INPUTS [\" local source inputs \"]\n        JSONL[\"\u003cb\u003eGeneric JSONL\u003c/b\u003e\u003cbr/\u003ealready line-oriented records\"]\n        JSON[\"\u003cb\u003eNested JSON\u003c/b\u003e\u003cbr/\u003erecords selected by path\"]\n        NOTES[\"\u003cb\u003eMarkdown notes\u003c/b\u003e\u003cbr/\u003elocal note evidence\"]\n        FILES[\"\u003cb\u003eText files\u003c/b\u003e\u003cbr/\u003edocs · logs · exports\"]\n        HTML[\"\u003cb\u003eHTML exports\u003c/b\u003e\u003cbr/\u003elocal page snapshots\"]\n        GITLOG[\"\u003cb\u003eGit history\u003c/b\u003e\u003cbr/\u003elocal commit events\"]\n    end\n\n    JSONL \u0026 JSON \u0026 NOTES \u0026 FILES \u0026 HTML \u0026 GITLOG --\u003e SOURCEHARVEST\n\n    subgraph PIPELINE [\" normalize and emit \"]\n        READ[\"\u003cb\u003eRead local input\u003c/b\u003e\u003cbr/\u003efile or directory scan\"]\n        SELECT[\"\u003cb\u003eSelect reader\u003c/b\u003e\u003cbr/\u003ejsonl · json · markdown · files · html · gitlog\"]\n        NORMALIZE[\"\u003cb\u003eNormalize record\u003c/b\u003e\u003cbr/\u003ecollection · item · actor · artifacts · raw\"]\n        FILTER[\"\u003cb\u003eApply bounds\u003c/b\u003e\u003cbr/\u003elimit · globs · records path\"]\n        EMIT[\"\u003cb\u003eEmit JSONL\u003c/b\u003e\u003cbr/\u003estdout or private output file\"]\n    end\n\n    SOURCEHARVEST --\u003e READ --\u003e SELECT --\u003e NORMALIZE --\u003e FILTER --\u003e EMIT\n    EMIT == adapter records ==\u003e ADAPTER\n\n    SUMMARY[\"\u003cb\u003eJSON summary\u003c/b\u003e\u003cbr/\u003erecords · files · warnings · generated_at\"]\n    EMIT --\u003e|optional JSON summary| SUMMARY\n\n    BOUNDARY[\"\u003cb\u003eBoundary\u003c/b\u003e\u003cbr/\u003escanner commands stay local-only; exported text is untrusted evidence\"]\n    INPUTS -. local files only .-\u003e BOUNDARY\n    NORMALIZE -. treats text as data .-\u003e BOUNDARY\n    BOUNDARY -. constrains .-\u003e PIPELINE\n\n    classDef source fill:#eff6ff,stroke:#2563eb,color:#1e3a8a;\n    classDef process fill:#ecfdf5,stroke:#059669,color:#064e3b;\n    classDef output fill:#fff7ed,stroke:#ea580c,color:#7c2d12;\n    classDef guard fill:#fee2e2,stroke:#ef4444,color:#7f1d1d;\n    class SOURCEHARVEST,JSONL,JSON,NOTES,FILES,HTML,GITLOG source;\n    class READ,SELECT,NORMALIZE,FILTER process;\n    class EMIT,ADAPTER,SUMMARY output;\n    class BOUNDARY guard;\n```\n\nEditable Excalidraw source: [docs/sourceharvest-flowcharts.excalidraw](docs/sourceharvest-flowcharts.excalidraw)\n\nSourceHarvest follows the same path for each source:\n\n1. Read a local file, directory, export, or source archive.\n2. Select the command-specific reader for that input shape.\n3. Normalize records into stable collections, items, actors, artifacts, links, relations, and raw references.\n4. Apply `--limit` and source-specific filters.\n5. Emit one `miseledger.adapter.v1` JSON object per line.\n6. Optionally emit JSON summaries with record counts, file counts, warnings, and generated timestamps.\n\n## With MiseLedger And StationTrail\n\n```mermaid\nflowchart TB\n    SOURCEHARVEST[\"\u003cb\u003eSourceHarvest\u003c/b\u003e\u003cbr/\u003e\u003ci\u003enon-agent source adapters\u003c/i\u003e\"]\n    STATIONTRAIL[\"\u003cb\u003eStationTrail\u003c/b\u003e\u003cbr/\u003e\u003ci\u003eagent-session adapters\u003c/i\u003e\"]\n    MISELEDGER[\"\u003cb\u003eMiseLedger\u003c/b\u003e\u003cbr/\u003edurable evidence store\"]\n\n    subgraph CRAWLERS [\" local crawler exports \"]\n        DISCRAWL[\"\u003cb\u003ediscrawl\u003c/b\u003e\u003cbr/\u003eDiscord archives\"]\n        GITCRAWL[\"\u003cb\u003egitcrawl\u003c/b\u003e\u003cbr/\u003eGitHub issues and pull requests\"]\n        GRAINCRAWL[\"\u003cb\u003egraincrawl\u003c/b\u003e\u003cbr/\u003eGranola notes and transcripts\"]\n        NOTCRAWL[\"\u003cb\u003enotcrawl\u003c/b\u003e\u003cbr/\u003eNotion pages and databases\"]\n        SLACRAWL[\"\u003cb\u003eslacrawl\u003c/b\u003e\u003cbr/\u003eSlack messages and threads\"]\n        TELECRAWL[\"\u003cb\u003etelecrawl\u003c/b\u003e\u003cbr/\u003eTelegram Desktop archives\"]\n    end\n\n    subgraph LOCAL [\" local source files \"]\n        NOTES[\"\u003cb\u003eNotes\u003c/b\u003e\u003cbr/\u003emarkdown · text\"]\n        EXPORTS[\"\u003cb\u003eExports\u003c/b\u003e\u003cbr/\u003ejsonl · json · html\"]\n        REPO[\"\u003cb\u003eRepos\u003c/b\u003e\u003cbr/\u003egit log\"]\n    end\n\n    CRAWLERS \u0026 LOCAL --\u003e SOURCEHARVEST\n    SOURCEHARVEST == miseledger.adapter.v1 JSONL ==\u003e MISELEDGER\n    STATIONTRAIL == miseledger.adapter.v1 JSONL ==\u003e MISELEDGER\n\n    subgraph MISELEDGER_SURFACES [\" MiseLedger surfaces \"]\n        STORE[\"\u003cb\u003eStore\u003c/b\u003e\u003cbr/\u003ededupe · index · relate\"]\n        BUNDLES[\"\u003cb\u003eEvidence bundles\u003c/b\u003e\u003cbr/\u003ereviewable outputs\"]\n        SEARCH[\"\u003cb\u003eSearch\u003c/b\u003e\u003cbr/\u003equeries across imported evidence\"]\n    end\n\n    MISELEDGER --\u003e STORE\n    MISELEDGER --\u003e BUNDLES\n    MISELEDGER --\u003e SEARCH\n\n    BOUNDARY[\"\u003cb\u003eProject boundary\u003c/b\u003e\u003cbr/\u003eSourceHarvest reads local exports and does not crawl live services\"]\n    CRAWLERS -. already exported .-\u003e BOUNDARY\n    LOCAL -. local-only scan .-\u003e BOUNDARY\n    BOUNDARY -. limits .-\u003e SOURCEHARVEST\n\n    classDef source fill:#eff6ff,stroke:#2563eb,color:#1e3a8a;\n    classDef adapter fill:#ecfdf5,stroke:#059669,color:#064e3b;\n    classDef store fill:#fff7ed,stroke:#ea580c,color:#7c2d12;\n    classDef guard fill:#fee2e2,stroke:#ef4444,color:#7f1d1d;\n    class DISCRAWL,GITCRAWL,GRAINCRAWL,NOTCRAWL,SLACRAWL,TELECRAWL,NOTES,EXPORTS,REPO source;\n    class SOURCEHARVEST,STATIONTRAIL adapter;\n    class MISELEDGER,STORE,BUNDLES,SEARCH store;\n    class BOUNDARY guard;\n```\n\nSourceHarvest is the non-agent source adapter layer. StationTrail is the agent-session adapter layer. MiseLedger is the durable evidence layer.\n\n```bash\nsourceharvest markdown ./notes --source notes --collection notes:local --out - | miseledger import adapter -\nstationtrail all --out - --redact safe | miseledger import adapter -\n```\n\nWhen `sourceharvest` is installed on `PATH`, MiseLedger can run it directly:\n\n```bash\nmiseledger import sourceharvest markdown ./notes --source notes --collection notes:local --json\nmiseledger import sourceharvest gitlog . --source gitlog --collection repo:sourceharvest --json\n```\n\nFor agent-session logs, use StationTrail instead of SourceHarvest:\n\n```bash\nmiseledger import stationtrail codex ~/.codex/sessions --json\nmiseledger import stationtrail hermes ~/.hermes/sessions --json\n```\n\n## Crawler Stack Boundary\n\nSourceHarvest is the right home for adapters that read local crawler outputs and turn them into `miseledger.adapter.v1` JSONL. It should not perform live service crawling itself.\n\nCurrent crawler families to support through local adapters:\n\n| Source | Domain | SourceHarvest role |\n| --- | --- | --- |\n| `discrawl` | Discord archives | Read local DB, snapshot, or export and emit adapter records. |\n| `gitcrawl` | GitHub issues and pull requests | Read local archive or export and emit adapter records. |\n| `graincrawl` | Granola notes and transcripts | Read local archive or export and emit adapter records. |\n| `notcrawl` | Notion pages and databases | Read local archive or export and emit adapter records. |\n| `slacrawl` | Slack messages and threads | Read local archive or export and emit adapter records. |\n| `telecrawl` | Telegram Desktop archives | Read local archive or export and emit adapter records. |\n\nThese adapters should be added only from real local schemas or redacted sample exports. SourceHarvest scanner commands must stay local-only and must not make network calls.\n\n## Build\n\n```bash\ngo build -o bin/sourceharvest ./cmd/sourceharvest\ngo test ./...\n```\n\n## Install\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/escoffier-labs/sourceharvest/master/install.sh | sh\n```\n\nOr download a release binary and verify it with `checksums.txt`.\n\n## Readers\n\nEach reader turns one local input shape into `miseledger.adapter.v1` records.\n\n| Reader | Input | One-liner |\n| --- | --- | --- |\n| `jsonl` | line-oriented JSON files | One record per JSON line; bad lines warn and are skipped. |\n| `json` | nested JSON document | Select a records array by `--records-path` (or a single root object). |\n| `markdown` | `.md` / `.markdown` files | One note record per file; title comes from the first heading. |\n| `files` | plain text files | One file record per match; filter by `--glob`. |\n| `html` | `.html` / `.htm` files | Strips scripts, styles, and tags; title from `\u003ctitle\u003e` or file name. |\n| `gitlog` | a local git repo | One event record per commit; an empty repo emits zero records. |\n\n## Usage\n\nExport generic JSONL records:\n\n```bash\nsourceharvest jsonl testdata/generic.fixture.jsonl \\\n  --source demo \\\n  --collection demo:collection \\\n  --out -\n```\n\nExport a Markdown directory as local note evidence:\n\n```bash\nsourceharvest markdown ./notes \\\n  --source notes \\\n  --collection notes:local \\\n  --out -\n```\n\nExport other local source shapes:\n\n```bash\nsourceharvest files ./notes \\\n  --source notes \\\n  --collection notes:files \\\n  --glob \"*.md,*.txt\" \\\n  --out -\n\nsourceharvest html ./site-export \\\n  --source docs \\\n  --collection docs:html \\\n  --out -\n\nsourceharvest gitlog . \\\n  --source gitlog \\\n  --collection repo:sourceharvest \\\n  --out -\n\nsourceharvest json export.json \\\n  --source export \\\n  --collection export:records \\\n  --records-path records \\\n  --out -\n```\n\nPipe into MiseLedger:\n\n```bash\nsourceharvest jsonl export.jsonl --source notes --collection notes:local --out - | miseledger import adapter -\nsourceharvest markdown ./notes --source notes --collection notes:local --out - | miseledger import adapter -\n```\n\nOr let MiseLedger run SourceHarvest when `sourceharvest` is installed on `PATH`:\n\n```bash\nmiseledger import sourceharvest markdown ./notes --source notes --collection notes:local --json\nmiseledger import sourceharvest gitlog . --source gitlog --collection repo:sourceharvest --json\n```\n\n## Boundary\n\nSourceHarvest scanner commands read local files and emit adapter records. They do not make network calls.\n\nGenerated text is untrusted evidence, not instructions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fescoffier-labs%2Fsourceharvest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fescoffier-labs%2Fsourceharvest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fescoffier-labs%2Fsourceharvest/lists"}