An open API service indexing awesome lists of open source software.

https://github.com/escoffier-labs/sourceharvest

SourceHarvest exports local source-system records to portable miseledger.adapter.v1 JSONL for archive, search, and evidence workflows.
https://github.com/escoffier-labs/sourceharvest

ai-agents brigade cli evidence exporter go jsonl local-first miseledger sourceharvest

Last synced: 23 days ago
JSON representation

SourceHarvest exports local source-system records to portable miseledger.adapter.v1 JSONL for archive, search, and evidence workflows.

Awesome Lists containing this project

README

          

# SourceHarvest


CI status
Latest release
Go 1.22+
MIT license

SourceHarvest exports source-system records to `miseledger.adapter.v1` JSONL.

It is the sibling tool to StationTrail:

- [StationTrail](https://github.com/escoffier-labs/stationtrail) handles local agent-session harnesses such as Codex, Claude, OpenClaw, OpenCode, and Hermes.
- SourceHarvest handles non-harness source systems such as crawler exports, notes, chat exports, issue exports, and future domain-specific harvesters.
- [MiseLedger](https://github.com/escoffier-labs/miseledger) stores, dedupes, indexes, searches, relates, and emits evidence bundles.

SourceHarvest is not an archive.

## How It Works

```mermaid
flowchart TB
SOURCEHARVEST["SourceHarvest CLI
local adapter layer"]
ADAPTER["miseledger.adapter.v1 JSONL
one normalized object per line"]

subgraph INPUTS [" local source inputs "]
JSONL["Generic JSONL
already line-oriented records"]
JSON["Nested JSON
records selected by path"]
NOTES["Markdown notes
local note evidence"]
FILES["Text files
docs · logs · exports"]
HTML["HTML exports
local page snapshots"]
GITLOG["Git history
local commit events"]
end

JSONL & JSON & NOTES & FILES & HTML & GITLOG --> SOURCEHARVEST

subgraph PIPELINE [" normalize and emit "]
READ["Read local input
file or directory scan"]
SELECT["Select reader
jsonl · json · markdown · files · html · gitlog"]
NORMALIZE["Normalize record
collection · item · actor · artifacts · raw"]
FILTER["Apply bounds
limit · globs · records path"]
EMIT["Emit JSONL
stdout or private output file"]
end

SOURCEHARVEST --> READ --> SELECT --> NORMALIZE --> FILTER --> EMIT
EMIT == adapter records ==> ADAPTER

SUMMARY["JSON summary
records · files · warnings · generated_at"]
EMIT -->|optional JSON summary| SUMMARY

BOUNDARY["Boundary
scanner commands stay local-only; exported text is untrusted evidence"]
INPUTS -. local files only .-> BOUNDARY
NORMALIZE -. treats text as data .-> BOUNDARY
BOUNDARY -. constrains .-> PIPELINE

classDef source fill:#eff6ff,stroke:#2563eb,color:#1e3a8a;
classDef process fill:#ecfdf5,stroke:#059669,color:#064e3b;
classDef output fill:#fff7ed,stroke:#ea580c,color:#7c2d12;
classDef guard fill:#fee2e2,stroke:#ef4444,color:#7f1d1d;
class SOURCEHARVEST,JSONL,JSON,NOTES,FILES,HTML,GITLOG source;
class READ,SELECT,NORMALIZE,FILTER process;
class EMIT,ADAPTER,SUMMARY output;
class BOUNDARY guard;
```

Editable Excalidraw source: [docs/sourceharvest-flowcharts.excalidraw](docs/sourceharvest-flowcharts.excalidraw)

SourceHarvest follows the same path for each source:

1. Read a local file, directory, export, or source archive.
2. Select the command-specific reader for that input shape.
3. Normalize records into stable collections, items, actors, artifacts, links, relations, and raw references.
4. Apply `--limit` and source-specific filters.
5. Emit one `miseledger.adapter.v1` JSON object per line.
6. Optionally emit JSON summaries with record counts, file counts, warnings, and generated timestamps.

## With MiseLedger And StationTrail

```mermaid
flowchart TB
SOURCEHARVEST["SourceHarvest
non-agent source adapters"]
STATIONTRAIL["StationTrail
agent-session adapters"]
MISELEDGER["MiseLedger
durable evidence store"]

subgraph CRAWLERS [" local crawler exports "]
DISCRAWL["discrawl
Discord archives"]
GITCRAWL["gitcrawl
GitHub issues and pull requests"]
GRAINCRAWL["graincrawl
Granola notes and transcripts"]
NOTCRAWL["notcrawl
Notion pages and databases"]
SLACRAWL["slacrawl
Slack messages and threads"]
TELECRAWL["telecrawl
Telegram Desktop archives"]
end

subgraph LOCAL [" local source files "]
NOTES["Notes
markdown · text"]
EXPORTS["Exports
jsonl · json · html"]
REPO["Repos
git log"]
end

CRAWLERS & LOCAL --> SOURCEHARVEST
SOURCEHARVEST == miseledger.adapter.v1 JSONL ==> MISELEDGER
STATIONTRAIL == miseledger.adapter.v1 JSONL ==> MISELEDGER

subgraph MISELEDGER_SURFACES [" MiseLedger surfaces "]
STORE["Store
dedupe · index · relate"]
BUNDLES["Evidence bundles
reviewable outputs"]
SEARCH["Search
queries across imported evidence"]
end

MISELEDGER --> STORE
MISELEDGER --> BUNDLES
MISELEDGER --> SEARCH

BOUNDARY["Project boundary
SourceHarvest reads local exports and does not crawl live services"]
CRAWLERS -. already exported .-> BOUNDARY
LOCAL -. local-only scan .-> BOUNDARY
BOUNDARY -. limits .-> SOURCEHARVEST

classDef source fill:#eff6ff,stroke:#2563eb,color:#1e3a8a;
classDef adapter fill:#ecfdf5,stroke:#059669,color:#064e3b;
classDef store fill:#fff7ed,stroke:#ea580c,color:#7c2d12;
classDef guard fill:#fee2e2,stroke:#ef4444,color:#7f1d1d;
class DISCRAWL,GITCRAWL,GRAINCRAWL,NOTCRAWL,SLACRAWL,TELECRAWL,NOTES,EXPORTS,REPO source;
class SOURCEHARVEST,STATIONTRAIL adapter;
class MISELEDGER,STORE,BUNDLES,SEARCH store;
class BOUNDARY guard;
```

SourceHarvest is the non-agent source adapter layer. StationTrail is the agent-session adapter layer. MiseLedger is the durable evidence layer.

```bash
sourceharvest markdown ./notes --source notes --collection notes:local --out - | miseledger import adapter -
stationtrail all --out - --redact safe | miseledger import adapter -
```

When `sourceharvest` is installed on `PATH`, MiseLedger can run it directly:

```bash
miseledger import sourceharvest markdown ./notes --source notes --collection notes:local --json
miseledger import sourceharvest gitlog . --source gitlog --collection repo:sourceharvest --json
```

For agent-session logs, use StationTrail instead of SourceHarvest:

```bash
miseledger import stationtrail codex ~/.codex/sessions --json
miseledger import stationtrail hermes ~/.hermes/sessions --json
```

## Crawler Stack Boundary

SourceHarvest is the right home for adapters that read local crawler outputs and turn them into `miseledger.adapter.v1` JSONL. It should not perform live service crawling itself.

Current crawler families to support through local adapters:

| Source | Domain | SourceHarvest role |
| --- | --- | --- |
| `discrawl` | Discord archives | Read local DB, snapshot, or export and emit adapter records. |
| `gitcrawl` | GitHub issues and pull requests | Read local archive or export and emit adapter records. |
| `graincrawl` | Granola notes and transcripts | Read local archive or export and emit adapter records. |
| `notcrawl` | Notion pages and databases | Read local archive or export and emit adapter records. |
| `slacrawl` | Slack messages and threads | Read local archive or export and emit adapter records. |
| `telecrawl` | Telegram Desktop archives | Read local archive or export and emit adapter records. |

These adapters should be added only from real local schemas or redacted sample exports. SourceHarvest scanner commands must stay local-only and must not make network calls.

## Build

```bash
go build -o bin/sourceharvest ./cmd/sourceharvest
go test ./...
```

## Install

```bash
curl -fsSL https://raw.githubusercontent.com/escoffier-labs/sourceharvest/master/install.sh | sh
```

Or download a release binary and verify it with `checksums.txt`.

## Readers

Each reader turns one local input shape into `miseledger.adapter.v1` records.

| Reader | Input | One-liner |
| --- | --- | --- |
| `jsonl` | line-oriented JSON files | One record per JSON line; bad lines warn and are skipped. |
| `json` | nested JSON document | Select a records array by `--records-path` (or a single root object). |
| `markdown` | `.md` / `.markdown` files | One note record per file; title comes from the first heading. |
| `files` | plain text files | One file record per match; filter by `--glob`. |
| `html` | `.html` / `.htm` files | Strips scripts, styles, and tags; title from `` or file name. |
| `gitlog` | a local git repo | One event record per commit; an empty repo emits zero records. |

## Usage

Export generic JSONL records:

```bash
sourceharvest jsonl testdata/generic.fixture.jsonl \
--source demo \
--collection demo:collection \
--out -
```

Export a Markdown directory as local note evidence:

```bash
sourceharvest markdown ./notes \
--source notes \
--collection notes:local \
--out -
```

Export other local source shapes:

```bash
sourceharvest files ./notes \
--source notes \
--collection notes:files \
--glob "*.md,*.txt" \
--out -

sourceharvest html ./site-export \
--source docs \
--collection docs:html \
--out -

sourceharvest gitlog . \
--source gitlog \
--collection repo:sourceharvest \
--out -

sourceharvest json export.json \
--source export \
--collection export:records \
--records-path records \
--out -
```

Pipe into MiseLedger:

```bash
sourceharvest jsonl export.jsonl --source notes --collection notes:local --out - | miseledger import adapter -
sourceharvest markdown ./notes --source notes --collection notes:local --out - | miseledger import adapter -
```

Or let MiseLedger run SourceHarvest when `sourceharvest` is installed on `PATH`:

```bash
miseledger import sourceharvest markdown ./notes --source notes --collection notes:local --json
miseledger import sourceharvest gitlog . --source gitlog --collection repo:sourceharvest --json
```

## Boundary

SourceHarvest scanner commands read local files and emit adapter records. They do not make network calls.

Generated text is untrusted evidence, not instructions.