{"id":48951923,"url":"https://github.com/moonyfringers/ladon","last_synced_at":"2026-04-17T21:01:23.573Z","repository":{"id":327650673,"uuid":"1106796171","full_name":"MoonyFringers/ladon","owner":"MoonyFringers","description":null,"archived":false,"fork":false,"pushed_at":"2026-04-03T20:43:10.000Z","size":1255,"stargazers_count":0,"open_issues_count":3,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-03T20:58:00.255Z","etag":null,"topics":["crawler","data-pipeline","ladon","ladon-framework","llm","python","training-data","web-crawler","web-scraping"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MoonyFringers.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":"CLA.md"}},"created_at":"2025-11-30T00:35:26.000Z","updated_at":"2026-04-03T20:42:48.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/MoonyFringers/ladon","commit_stats":null,"previous_names":["moonyfringers/ladon"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MoonyFringers/ladon","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MoonyFringers%2Fladon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MoonyFringers%2Fladon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MoonyFringers%2Fladon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MoonyFringers%2Fladon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MoonyFringers","download_url":"https://codeload.github.com/MoonyFringers/ladon/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MoonyFringers%2Fladon/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31945987,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-17T17:29:20.459Z","status":"ssl_error","status_checked_at":"2026-04-17T17:28:47.801Z","response_time":62,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","data-pipeline","ladon","ladon-framework","llm","python","training-data","web-crawler","web-scraping"],"created_at":"2026-04-17T21:00:55.077Z","updated_at":"2026-04-17T21:01:23.558Z","avatar_url":"https://github.com/MoonyFringers.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Ladon\n\n[![CI](https://github.com/MoonyFringers/ladon/actions/workflows/unittests.yaml/badge.svg)](https://github.com/MoonyFringers/ladon/actions/workflows/unittests.yaml)\n[![Lint](https://github.com/MoonyFringers/ladon/actions/workflows/lint.yaml/badge.svg)](https://github.com/MoonyFringers/ladon/actions/workflows/lint.yaml)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue)](https://www.python.org/downloads/)\n[![License: AGPL-3.0-only](https://img.shields.io/badge/license-AGPL--3.0--only-blue)](LICENSE)\n\nA Python framework for building structured, resumable web crawlers — designed\nfor domains where data quality matters.\n\n## What is Ladon?\n\nLadon enforces typed domain objects at every stage of the crawl pipeline\nthrough the SES protocol (Source / Expander / Sink). The difference from\nScrapy — a proven, mature tool — is structural: instead of weakly typed\n`scrapy.Item` fields, you define typed dataclasses at the protocol level\n(e.g. a `CommentRecord` with enforced field types). The output is structured\nand typed without a post-processing step. This matters when the destination\nis an LLM training pipeline or any domain where schema correctness is not optional.\n\nThe built-in HTTP layer handles retries, exponential back-off, per-domain rate\nlimiting, circuit breaking, and robots.txt enforcement — so adapter authors\nfocus on domain logic, not infrastructure.\n\n## Quick start\n\nThe canonical example is\n[`ladon-hackernews`](https://github.com/MoonyFringers/ladon-hackernews) —\nan adapter that crawls the HN top-stories list and writes comments to DuckDB:\n\n```bash\npip install ladon-crawl ladon-hackernews\nladon-hackernews --top 30 --out hn.db\n```\n\nNo authentication. No external server. 30 stories and their comments in\nunder a minute.\n\n## The LLM training pipeline\n\n```\nladon-hackernews --top 500 --out hn.db\n    → export_parquet(\"hn.db\", \"hn.parquet\")\n        → training pipeline\n```\n\nHN comments are structured, human-authored, and high signal-to-noise. The\nfull pipeline from install to Parquet takes under five minutes. Each run\nwrites a `ladon_runs` audit table to the DuckDB file — re-running skips\nstories already marked `done`, giving you resumable crawls for free.\n\n```python\nfrom ladon_hackernews import export_parquet\nexport_parquet(\"hn.db\", \"hn.parquet\")\n```\n\n## Writing your own adapter\n\n`ladon-hackernews` is the canonical reference for building an adapter.\nAdapters implement the SES protocol **structurally** — no inheritance from\nany Ladon base class is required. The three components to implement are:\n\n- **Source** — discovers the list of root references to crawl\n- **Expander** — maps a reference to a domain record and child references\n- **Sink** — receives each leaf record for persistence or downstream use\n\nSee the [adapter authoring guide](https://moonyfringers.github.io/ladon/) and\n[ADR-003](https://github.com/MoonyFringers/ladon/blob/main/docs/decisions/adr-003-plugin-adapter-interface.md)\nfor the full protocol specification. The\n[`ladon-hackernews` source](https://github.com/MoonyFringers/ladon-hackernews)\nis the worked example.\n\n## CLI reference\n\n```\nladon info\nladon run --plugin MODULE:CLASS --ref URL [--respect-robots-txt]\nladon --version\n```\n\n| command | description |\n|---|---|\n| `ladon info` | Print Ladon version, Python version, and platform |\n| `ladon run` | Run a crawl using a dynamically loaded plugin class |\n| `ladon --version` | Print the installed version |\n\n`ladon run` flags:\n\n| flag | required | description |\n|---|---|---|\n| `--plugin MODULE:CLASS` | yes | Dotted import path to the `CrawlPlugin` class |\n| `--ref URL` | yes | Top-level reference URL passed to the plugin |\n| `--respect-robots-txt` | no | Honour `Disallow` rules and `Crawl-delay` directives |\n\nExit codes: `0` success · `1` fatal error · `2` partial failures · `3` data not ready (retry later)\n\n`ladon run` uses default `HttpClientConfig` settings. For retries, rate\nlimiting, circuit breaking, or a persistence layer, call `run_crawl()`\ndirectly from Python — see\n[`ladon-hackernews` — Use as a library](https://github.com/MoonyFringers/ladon-hackernews#use-as-a-library)\nfor a full example.\n\n## Status\n\n`v0.0.1` — alpha. The SES protocol and HTTP layer are stable. One reference\nadapter (`ladon-hackernews`) is available as open source and tested against\nthe real HN API.\n\nWhat is in v0.0.1:\n- SES protocol (Source / Expander / Sink) with structural typing\n- `run_crawl()` runner with leaf isolation and `RunResult` summary\n- `HttpClient` with retries, back-off, rate limiting, circuit breaker, robots.txt\n- `Storage` protocol with `LocalFileStorage`\n- `Repository` and `RunAudit` persistence protocols with `NullRepository`\n- `ladon run` / `ladon info` CLI\n\nWhat is coming in v0.1.0:\n- RunResult counter semantics redesign (issue [#62](https://github.com/MoonyFringers/ladon/issues/62))\n- Structured logging baseline (ADR-009)\n\n## Contributing\n\nThe plugin protocol is settled — contributions are welcome. Please read the\n[documentation](https://moonyfringers.github.io/ladon/) for design context\n(ADRs, plugin authoring guide) before sending a pull request.\n\nA [CLA signature](https://github.com/MoonyFringers/ladon/blob/main/CLA.md)\nis required for external contributors. The bot will prompt you on your first PR.\n\n## License\n\nLadon is released under the **GNU Affero General Public License v3.0 only\n(AGPL-3.0-only)**. See [`LICENSE`](LICENSE) for the full text.\n\nAGPL was chosen to ensure that improvements to the core framework — including\nwhen deployed as a networked service — remain open and available to the\ncommunity. A commercial licence is available for organisations that cannot\naccept the AGPL terms — see [`LICENSE-COMMERCIAL`](LICENSE-COMMERCIAL).\n\n`ladon-hackernews` is separately licensed under Apache-2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoonyfringers%2Fladon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmoonyfringers%2Fladon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoonyfringers%2Fladon/lists"}