{"id":50786727,"url":"https://github.com/0x4d44/readex","last_synced_at":"2026-06-12T08:03:54.777Z","repository":{"id":360421488,"uuid":"1241032138","full_name":"0x4D44/readex","owner":"0x4D44","description":"HTML main-content extraction for Rust — ports of Mozilla Readability, Trafilatura, and htmldate.","archived":false,"fork":false,"pushed_at":"2026-05-26T09:41:31.000Z","size":19302,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-26T11:30:54.598Z","etag":null,"topics":["article-extraction","boilerplate-removal","content-extraction","html","html-parser","htmldate","metadata-extraction","readability","rust","text-extraction","trafilatura","web-scraping"],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/0x4D44.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-16T21:57:25.000Z","updated_at":"2026-05-26T09:41:34.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/0x4D44/readex","commit_stats":null,"previous_names":["0x4d44/readex"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/0x4D44/readex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0x4D44%2Freadex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0x4D44%2Freadex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0x4D44%2Freadex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0x4D44%2Freadex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/0x4D44","download_url":"https://codeload.github.com/0x4D44/readex/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0x4D44%2Freadex/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34234576,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-12T02:00:06.859Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["article-extraction","boilerplate-removal","content-extraction","html","html-parser","htmldate","metadata-extraction","readability","rust","text-extraction","trafilatura","web-scraping"],"created_at":"2026-06-12T08:03:48.547Z","updated_at":"2026-06-12T08:03:54.772Z","avatar_url":"https://github.com/0x4D44.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# readex\n\n[![Crates.io](https://img.shields.io/crates/v/readex.svg)](https://crates.io/crates/readex)\n[![docs.rs](https://img.shields.io/docsrs/readex)](https://docs.rs/readex)\n[![License: MIT OR Apache-2.0](https://img.shields.io/badge/license-MIT%20OR%20Apache--2.0-blue)](#license)\n\n**HTML main-content extraction for Rust.** Give `readex` a `\u0026str` of HTML and\nit returns the article body, title, byline, publish date, language, and\n~15 other metadata fields — no network I/O, no JavaScript rendering, no\nencoding detection. Pure synchronous string and DOM work, suitable for\nembedding anywhere from a desktop tool to a server pipeline.\n\n---\n\n## Quick start\n\n```toml\n[dependencies]\nreadex = \"0.19\"\n```\n\n```rust\nuse readex::{extract, Extracted};\n\nlet html = r#\"\n    \u003chtml\u003e\n      \u003chead\u003e\u003ctitle\u003eHello readex\u003c/title\u003e\u003c/head\u003e\n      \u003cbody\u003e\n        \u003carticle\u003e\n          \u003ch1\u003eHello readex\u003c/h1\u003e\n          \u003cp\u003eThis is the body of an article. It contains enough words\n             that the extractor will consider it substantive content.\u003c/p\u003e\n          \u003cp\u003eA second paragraph adds more text so the scorer has signal.\u003c/p\u003e\n        \u003c/article\u003e\n      \u003c/body\u003e\n    \u003c/html\u003e\n\"#;\n\nlet Extracted { title, text, .. } = extract(html, None).expect(\"extraction failed\");\n\nassert_eq!(title.as_deref(), Some(\"Hello readex\"));\nassert!(text.contains(\"body of an article\"));\n```\n\n## A more representative example\n\nReal web pages come wrapped in navigation, cookie banners, share widgets, and\ncomment sections. `readex` strips the chrome and returns just the body and\nthe metadata it can recover:\n\n```rust\nuse readex::extract;\n\nlet html = r#\"\n    \u003chtml lang=\"en\"\u003e\n      \u003chead\u003e\n        \u003ctitle\u003eWhy the bridge collapsed — The Daily Example\u003c/title\u003e\n        \u003cmeta property=\"og:site_name\" content=\"The Daily Example\"\u003e\n        \u003cmeta name=\"author\" content=\"Jane Reporter\"\u003e\n        \u003cmeta property=\"article:published_time\" content=\"2026-05-24T09:30:00Z\"\u003e\n      \u003c/head\u003e\n      \u003cbody\u003e\n        \u003cnav\u003e\u003ca href=\"/\"\u003eHome\u003c/a\u003e \u003ca href=\"/news\"\u003eNews\u003c/a\u003e\u003c/nav\u003e\n        \u003caside class=\"cookie-banner\"\u003eWe use cookies. \u003cbutton\u003eOK\u003c/button\u003e\u003c/aside\u003e\n        \u003carticle\u003e\n          \u003ch1\u003eWhy the bridge collapsed\u003c/h1\u003e\n          \u003cp class=\"byline\"\u003eBy Jane Reporter, 24 May 2026\u003c/p\u003e\n          \u003cp\u003eInvestigators arrived on site shortly after dawn and began\n             sampling the steelwork for fatigue cracks.\u003c/p\u003e\n          \u003cp\u003eThe bridge, opened in 1972, had been scheduled for inspection\n             next month. Engineers say the failure mode is consistent with\n             corrosion at the western anchorage.\u003c/p\u003e\n        \u003c/article\u003e\n        \u003csection class=\"comments\"\u003e\n          \u003ch3\u003eComments (412)\u003c/h3\u003e\n          \u003cp\u003e\"Knew this would happen\" — anonymous\u003c/p\u003e\n        \u003c/section\u003e\n        \u003cfooter\u003e© 2026 The Daily Example\u003c/footer\u003e\n      \u003c/body\u003e\n    \u003c/html\u003e\n\"#;\n\nlet result = extract(html, Some(\"https://example.com/news/bridge\")).unwrap();\n\nassert_eq!(result.title.as_deref(), Some(\"Why the bridge collapsed\"));\nassert_eq!(result.byline.as_deref(), Some(\"Jane Reporter\"));\nassert_eq!(result.site_name.as_deref(), Some(\"The Daily Example\"));\nassert_eq!(result.language.as_deref(), Some(\"en\"));\nassert!(result.published_time.is_some());\nassert!(result.text.contains(\"Investigators arrived on site\"));\nassert!(!result.text.contains(\"cookie\"));          // banner stripped\nassert!(!result.text.contains(\"Home\"));            // nav stripped\nassert!(!result.text.contains(\"Knew this would\")); // comments stripped\n```\n\n`readex` carries the lineage of three well-validated extractors:\n\n| Origin | Role inside `readex` |\n| --- | --- |\n| [Mozilla Readability](https://github.com/mozilla/readability) (JS) | Article-scoring core — the M2 port preserves the full `_grabArticle` / `_prepArticle` / flag-sieve pipeline. |\n| [Trafilatura](https://trafilatura.readthedocs.io) (Python) | The M3 cascade — own → readability fork → jusText — with the 7-branch arbiter, dedup gate, and sanitize post-pass. |\n| [htmldate](https://htmldate.readthedocs.io) (Python) | Publication-date extraction with the same precedence rules as upstream. |\n\nEach is a clean-room reimplementation in Rust; the upstream Python and\nJavaScript projects are the differential-test oracles, not vendored code.\n\n---\n\n## API reference (cheat sheet)\n\n| Function | Purpose |\n| --- | --- |\n| [`extract`] | Default extraction. Returns an [`Extracted`] with title, body text, canonical URL, language, byline, excerpt, site name, published time, categories, tags, image, license, hostname, and (optionally) sanitised HTML. |\n| [`extract_with`] | `extract(html, base_url)` plus a third `\u0026Options` parameter (so `extract_with(html, base_url, \u0026Options::default())` is exactly equivalent to `extract(html, base_url)`). Lets you opt into sanitised HTML output, set a minimum word-count threshold, or request a YAML metadata header. |\n| [`extract_to_markdown`] | Body as Markdown — Trafilatura's `output_format=\"markdown\"`. |\n| [`extract_to_txt`] | Plain-text body — Trafilatura's `output_format=\"txt\"`. |\n| `extract_to_json` / `extract_to_csv` / `extract_to_xml` / `extract_to_tei` | Structured output formats. |\n| [`extract_via_readability`] | Forces the M2 Mozilla-Readability path (older, simpler, no Trafilatura cascade). Useful when you specifically need that algorithm's output shape. |\n\n`extract` and `extract_with(.., .., \u0026Options::default())` are byte-identical by\nconstruction — `extract` is literally a one-line delegate, so the two cannot\ndrift apart.\n\n---\n\n## Why readex?\n\nThere are already a handful of HTML-extraction crates on crates.io. Honest\npositioning vs. the obvious alternatives:\n\n| | `readex` | [`readability`](https://crates.io/crates/readability) | [`dom_content_extraction`](https://crates.io/crates/dom_content_extraction) |\n| --- | --- | --- | --- |\n| Algorithms | Readability + Trafilatura cascade + htmldate | Readability only | DOM-centric (different family) |\n| Metadata fields | ~15 (title, byline, language, dates, OG/Twitter/JSON-LD, categories, tags, image, license, hostname …) | Title + summary | Body text only |\n| Output formats | text, sanitised HTML, Markdown, TXT, JSON, CSV, XML, TEI | text only | text only |\n| Differential parity testing | Yes — 51-URL corpus + 50K broad sweep, every release | No | No |\n| Hard pin on parser versions | Yes (`html5ever 0.39.0`, plus a documented \"parser-equivalence fence\") | No | No |\n| Edition / MSRV | 2024 / 1.85 | 2018 / older | 2021 / older |\n| Comments extraction (Reddit/vBulletin/etc.) | Yes (via Trafilatura) | No | No |\n| Date extraction | Yes (via htmldate) | No | No |\n\nIf your input is well-structured English-language articles and you want one\nalgorithm with no extra moving parts, `readability` may be all you need.\n`readex` exists because real-world corpora (SEC filings, regulator\npublications, multilingual news, low-template blogs, hub/index pages) defeat\nsingle-algorithm extractors — the Trafilatura cascade was designed\nspecifically for that long tail.\n\n---\n\n## Quality \u0026 differential testing\n\n`readex` is developed against a differential-test harness that runs every\nbenchmark URL through three extractors in parallel — `readex`, Mozilla\nReadability (via Node), and Trafilatura (via Python) — and scores agreement\nacross token sequences and metadata fields. The harness lives in\n`benchmark/` in the repo (not published as part of the crate) and is\nre-run on every release.\n\nLatest verdicts (as of 0.19.0):\n\n| Gate | Corpus | Result |\n| --- | --- | --- |\n| Trafilatura `extract_content` (Markdown path) | 51 URLs | **48 / 51** byte-equivalent (41 substantive + 7 documented allowlist) |\n| Trafilatura plain-text (TXT) path | 51 URLs | **45 / 51** substantive + 5 allowlist + 1 deferred |\n| Trafilatura TEI structured output | 51 URLs | **51 / 51** (39 substantive + 12 allowlist) |\n| Mozilla Readability `textContent` | 51 URLs | **50 / 51** byte-equivalent vs. jsdom |\n| Parser equivalence (rcdom vs. jsdom) | 51 URLs | **51 / 51** byte-equivalent DOM |\n| Broad-sweep confidence (Common Crawl) | **50,000 pages** | Tail-distribution scan vs. Python Trafilatura |\n\nThe \"allowlist\" entries are documented per-page divergences where `readex`\nand upstream genuinely disagree for traced reasons (e.g. upstream emits a\ncookie banner the page lacks chrome-class hints for; or upstream skips a\ntable the data-table heuristic rescues). They live under\n`wrk_docs/m{5,7}-allowlist/` in the repo with one Markdown file per\nfixture.\n\nIf you find a page where `readex` disagrees with both Readability and\nTrafilatura in a way that matters, please file an issue with the URL or\nHTML — the harness will pick it up.\n\n---\n\n## What is out of scope\n\n- **Network fetching.** `readex` takes a `\u0026str`. The caller owns HTTP, redirects, SSRF guarding, and encoding detection.\n- **JavaScript rendering.** `readex` parses the bytes as given. Pages that need JS to render their body need a headless browser upstream.\n- **PDF extraction.** HTML only.\n- **Streaming.** The whole document is parsed at once.\n\nThese boundaries keep the crate sync, dependency-light, and easy to embed.\n\n---\n\n## Status\n\n`0.19.0` is the first public crates.io release. The API surface is:\n\n- Stable: `extract`, `extract_with`, `extract_to_markdown`, `extract_to_txt`,\n  `extract_to_json`, `extract_to_csv`, `extract_to_xml`, `extract_to_tei`,\n  `extract_via_readability`, plus the `Extracted`, `Options`, and\n  `ExtractError` types they use.\n- `#[doc(hidden)]` internals: `readability::*`, `trafilatura::*`,\n  `htmldate::*`. These are reachable but **explicitly not part of the\n  semver contract** — they exist for the in-workspace differential test\n  harness. Treat them as private; they can change at any time.\n\n`readex` is at 0.x — additive minor bumps may add fields to `Extracted` or\n`Options`; breaking changes (renames, signature changes) will only land on\na 0.X.0 boundary with a clear changelog entry.\n\n### Minimum supported Rust version (MSRV)\n\n`readex` targets **Rust 1.85+** (Rust 2024 edition).\n\n---\n\n## Contributing\n\nIssues and PRs welcome at \u003chttps://github.com/0x4D44/readex\u003e. For\nnon-trivial changes, please open an issue first so we can discuss the\napproach — `readex` is gated by a parity-test harness against Readability\nand Trafilatura, and the cheapest path through that gate is usually a\nquick sketch of intent before code.\n\n---\n\n## License\n\nLicensed under either of:\n\n- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or\n  \u003chttps://www.apache.org/licenses/LICENSE-2.0\u003e)\n- MIT License ([LICENSE-MIT](LICENSE-MIT) or\n  \u003chttps://opensource.org/licenses/MIT\u003e)\n\nat your option.\n\nUnless you explicitly state otherwise, any contribution intentionally\nsubmitted for inclusion in `readex` by you, as defined in the Apache-2.0\nlicense, shall be dual-licensed as above, without any additional terms or\nconditions.\n\nSee [NOTICE](NOTICE) for attribution to the upstream Readability,\nTrafilatura, and htmldate projects whose algorithms `readex` ports.\n\n[`extract`]: https://docs.rs/readex/latest/readex/fn.extract.html\n[`extract_with`]: https://docs.rs/readex/latest/readex/fn.extract_with.html\n[`extract_to_markdown`]: https://docs.rs/readex/latest/readex/fn.extract_to_markdown.html\n[`extract_to_txt`]: https://docs.rs/readex/latest/readex/fn.extract_to_txt.html\n[`extract_via_readability`]: https://docs.rs/readex/latest/readex/fn.extract_via_readability.html\n[`Extracted`]: https://docs.rs/readex/latest/readex/struct.Extracted.html\n[`Options`]: https://docs.rs/readex/latest/readex/struct.Options.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F0x4d44%2Freadex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F0x4d44%2Freadex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F0x4d44%2Freadex/lists"}