https://github.com/escoffier-labs/sourceharvest
SourceHarvest exports local source-system records to portable miseledger.adapter.v1 JSONL for archive, search, and evidence workflows.
https://github.com/escoffier-labs/sourceharvest
ai-agents brigade cli evidence exporter go jsonl local-first miseledger sourceharvest
Last synced: 23 days ago
JSON representation
SourceHarvest exports local source-system records to portable miseledger.adapter.v1 JSONL for archive, search, and evidence workflows.
- Host: GitHub
- URL: https://github.com/escoffier-labs/sourceharvest
- Owner: escoffier-labs
- Created: 2026-06-03T19:41:44.000Z (about 1 month ago)
- Default Branch: master
- Last Pushed: 2026-06-06T03:16:04.000Z (28 days ago)
- Last Synced: 2026-06-06T04:10:06.422Z (28 days ago)
- Topics: ai-agents, brigade, cli, evidence, exporter, go, jsonl, local-first, miseledger, sourceharvest
- Language: Go
- Size: 31.3 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# SourceHarvest
SourceHarvest exports source-system records to `miseledger.adapter.v1` JSONL.
It is the sibling tool to StationTrail:
- [StationTrail](https://github.com/escoffier-labs/stationtrail) handles local agent-session harnesses such as Codex, Claude, OpenClaw, OpenCode, and Hermes.
- SourceHarvest handles non-harness source systems such as crawler exports, notes, chat exports, issue exports, and future domain-specific harvesters.
- [MiseLedger](https://github.com/escoffier-labs/miseledger) stores, dedupes, indexes, searches, relates, and emits evidence bundles.
SourceHarvest is not an archive.
## How It Works
```mermaid
flowchart TB
SOURCEHARVEST["SourceHarvest CLI
local adapter layer"]
ADAPTER["miseledger.adapter.v1 JSONL
one normalized object per line"]
subgraph INPUTS [" local source inputs "]
JSONL["Generic JSONL
already line-oriented records"]
JSON["Nested JSON
records selected by path"]
NOTES["Markdown notes
local note evidence"]
FILES["Text files
docs · logs · exports"]
HTML["HTML exports
local page snapshots"]
GITLOG["Git history
local commit events"]
end
JSONL & JSON & NOTES & FILES & HTML & GITLOG --> SOURCEHARVEST
subgraph PIPELINE [" normalize and emit "]
READ["Read local input
file or directory scan"]
SELECT["Select reader
jsonl · json · markdown · files · html · gitlog"]
NORMALIZE["Normalize record
collection · item · actor · artifacts · raw"]
FILTER["Apply bounds
limit · globs · records path"]
EMIT["Emit JSONL
stdout or private output file"]
end
SOURCEHARVEST --> READ --> SELECT --> NORMALIZE --> FILTER --> EMIT
EMIT == adapter records ==> ADAPTER
SUMMARY["JSON summary
records · files · warnings · generated_at"]
EMIT -->|optional JSON summary| SUMMARY
BOUNDARY["Boundary
scanner commands stay local-only; exported text is untrusted evidence"]
INPUTS -. local files only .-> BOUNDARY
NORMALIZE -. treats text as data .-> BOUNDARY
BOUNDARY -. constrains .-> PIPELINE
classDef source fill:#eff6ff,stroke:#2563eb,color:#1e3a8a;
classDef process fill:#ecfdf5,stroke:#059669,color:#064e3b;
classDef output fill:#fff7ed,stroke:#ea580c,color:#7c2d12;
classDef guard fill:#fee2e2,stroke:#ef4444,color:#7f1d1d;
class SOURCEHARVEST,JSONL,JSON,NOTES,FILES,HTML,GITLOG source;
class READ,SELECT,NORMALIZE,FILTER process;
class EMIT,ADAPTER,SUMMARY output;
class BOUNDARY guard;
```
Editable Excalidraw source: [docs/sourceharvest-flowcharts.excalidraw](docs/sourceharvest-flowcharts.excalidraw)
SourceHarvest follows the same path for each source:
1. Read a local file, directory, export, or source archive.
2. Select the command-specific reader for that input shape.
3. Normalize records into stable collections, items, actors, artifacts, links, relations, and raw references.
4. Apply `--limit` and source-specific filters.
5. Emit one `miseledger.adapter.v1` JSON object per line.
6. Optionally emit JSON summaries with record counts, file counts, warnings, and generated timestamps.
## With MiseLedger And StationTrail
```mermaid
flowchart TB
SOURCEHARVEST["SourceHarvest
non-agent source adapters"]
STATIONTRAIL["StationTrail
agent-session adapters"]
MISELEDGER["MiseLedger
durable evidence store"]
subgraph CRAWLERS [" local crawler exports "]
DISCRAWL["discrawl
Discord archives"]
GITCRAWL["gitcrawl
GitHub issues and pull requests"]
GRAINCRAWL["graincrawl
Granola notes and transcripts"]
NOTCRAWL["notcrawl
Notion pages and databases"]
SLACRAWL["slacrawl
Slack messages and threads"]
TELECRAWL["telecrawl
Telegram Desktop archives"]
end
subgraph LOCAL [" local source files "]
NOTES["Notes
markdown · text"]
EXPORTS["Exports
jsonl · json · html"]
REPO["Repos
git log"]
end
CRAWLERS & LOCAL --> SOURCEHARVEST
SOURCEHARVEST == miseledger.adapter.v1 JSONL ==> MISELEDGER
STATIONTRAIL == miseledger.adapter.v1 JSONL ==> MISELEDGER
subgraph MISELEDGER_SURFACES [" MiseLedger surfaces "]
STORE["Store
dedupe · index · relate"]
BUNDLES["Evidence bundles
reviewable outputs"]
SEARCH["Search
queries across imported evidence"]
end
MISELEDGER --> STORE
MISELEDGER --> BUNDLES
MISELEDGER --> SEARCH
BOUNDARY["Project boundary
SourceHarvest reads local exports and does not crawl live services"]
CRAWLERS -. already exported .-> BOUNDARY
LOCAL -. local-only scan .-> BOUNDARY
BOUNDARY -. limits .-> SOURCEHARVEST
classDef source fill:#eff6ff,stroke:#2563eb,color:#1e3a8a;
classDef adapter fill:#ecfdf5,stroke:#059669,color:#064e3b;
classDef store fill:#fff7ed,stroke:#ea580c,color:#7c2d12;
classDef guard fill:#fee2e2,stroke:#ef4444,color:#7f1d1d;
class DISCRAWL,GITCRAWL,GRAINCRAWL,NOTCRAWL,SLACRAWL,TELECRAWL,NOTES,EXPORTS,REPO source;
class SOURCEHARVEST,STATIONTRAIL adapter;
class MISELEDGER,STORE,BUNDLES,SEARCH store;
class BOUNDARY guard;
```
SourceHarvest is the non-agent source adapter layer. StationTrail is the agent-session adapter layer. MiseLedger is the durable evidence layer.
```bash
sourceharvest markdown ./notes --source notes --collection notes:local --out - | miseledger import adapter -
stationtrail all --out - --redact safe | miseledger import adapter -
```
When `sourceharvest` is installed on `PATH`, MiseLedger can run it directly:
```bash
miseledger import sourceharvest markdown ./notes --source notes --collection notes:local --json
miseledger import sourceharvest gitlog . --source gitlog --collection repo:sourceharvest --json
```
For agent-session logs, use StationTrail instead of SourceHarvest:
```bash
miseledger import stationtrail codex ~/.codex/sessions --json
miseledger import stationtrail hermes ~/.hermes/sessions --json
```
## Crawler Stack Boundary
SourceHarvest is the right home for adapters that read local crawler outputs and turn them into `miseledger.adapter.v1` JSONL. It should not perform live service crawling itself.
Current crawler families to support through local adapters:
| Source | Domain | SourceHarvest role |
| --- | --- | --- |
| `discrawl` | Discord archives | Read local DB, snapshot, or export and emit adapter records. |
| `gitcrawl` | GitHub issues and pull requests | Read local archive or export and emit adapter records. |
| `graincrawl` | Granola notes and transcripts | Read local archive or export and emit adapter records. |
| `notcrawl` | Notion pages and databases | Read local archive or export and emit adapter records. |
| `slacrawl` | Slack messages and threads | Read local archive or export and emit adapter records. |
| `telecrawl` | Telegram Desktop archives | Read local archive or export and emit adapter records. |
These adapters should be added only from real local schemas or redacted sample exports. SourceHarvest scanner commands must stay local-only and must not make network calls.
## Build
```bash
go build -o bin/sourceharvest ./cmd/sourceharvest
go test ./...
```
## Install
```bash
curl -fsSL https://raw.githubusercontent.com/escoffier-labs/sourceharvest/master/install.sh | sh
```
Or download a release binary and verify it with `checksums.txt`.
## Readers
Each reader turns one local input shape into `miseledger.adapter.v1` records.
| Reader | Input | One-liner |
| --- | --- | --- |
| `jsonl` | line-oriented JSON files | One record per JSON line; bad lines warn and are skipped. |
| `json` | nested JSON document | Select a records array by `--records-path` (or a single root object). |
| `markdown` | `.md` / `.markdown` files | One note record per file; title comes from the first heading. |
| `files` | plain text files | One file record per match; filter by `--glob`. |
| `html` | `.html` / `.htm` files | Strips scripts, styles, and tags; title from `` or file name. |
| `gitlog` | a local git repo | One event record per commit; an empty repo emits zero records. |
## Usage
Export generic JSONL records:
```bash
sourceharvest jsonl testdata/generic.fixture.jsonl \
--source demo \
--collection demo:collection \
--out -
```
Export a Markdown directory as local note evidence:
```bash
sourceharvest markdown ./notes \
--source notes \
--collection notes:local \
--out -
```
Export other local source shapes:
```bash
sourceharvest files ./notes \
--source notes \
--collection notes:files \
--glob "*.md,*.txt" \
--out -
sourceharvest html ./site-export \
--source docs \
--collection docs:html \
--out -
sourceharvest gitlog . \
--source gitlog \
--collection repo:sourceharvest \
--out -
sourceharvest json export.json \
--source export \
--collection export:records \
--records-path records \
--out -
```
Pipe into MiseLedger:
```bash
sourceharvest jsonl export.jsonl --source notes --collection notes:local --out - | miseledger import adapter -
sourceharvest markdown ./notes --source notes --collection notes:local --out - | miseledger import adapter -
```
Or let MiseLedger run SourceHarvest when `sourceharvest` is installed on `PATH`:
```bash
miseledger import sourceharvest markdown ./notes --source notes --collection notes:local --json
miseledger import sourceharvest gitlog . --source gitlog --collection repo:sourceharvest --json
```
## Boundary
SourceHarvest scanner commands read local files and emit adapter records. They do not make network calls.
Generated text is untrusted evidence, not instructions.