{"id":50508512,"url":"https://github.com/firecrawl/html-extractor","last_synced_at":"2026-06-02T18:01:29.496Z","repository":{"id":358996654,"uuid":"1244037456","full_name":"firecrawl/html-extractor","owner":"firecrawl","description":"Fast HTML main-content extractor in Rust with Node bindings. Page-type-aware, outputs clean markdown.","archived":false,"fork":false,"pushed_at":"2026-05-19T23:26:28.000Z","size":147,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-20T02:35:34.826Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://www.npmjs.com/package/@firecrawl/html-extractor","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/firecrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-19T22:56:25.000Z","updated_at":"2026-05-20T00:07:09.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/firecrawl/html-extractor","commit_stats":null,"previous_names":["firecrawl/html-extractor"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/firecrawl/html-extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/firecrawl%2Fhtml-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/firecrawl%2Fhtml-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/firecrawl%2Fhtml-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/firecrawl%2Fhtml-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/firecrawl","download_url":"https://codeload.github.com/firecrawl/html-extractor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/firecrawl%2Fhtml-extractor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33833277,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-02T18:01:28.603Z","updated_at":"2026-06-02T18:01:29.481Z","avatar_url":"https://github.com/firecrawl.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# html-extractor\n\nA fast, streaming, page-type-aware HTML main-content extractor in Rust, with NAPI bindings for Node.js. A general-purpose library for pulling article-style content out of raw HTML pages.\n\nAlgorithm inspired by Python's [trafilatura](https://github.com/adbar/trafilatura). Implementation is from scratch — the algorithm comes from studying the Python source; the API surface, optimizations, and architecture are ours.\n\n## What this is\n\nA library you call with raw HTML and get back the article content, stripped of nav/footer/related-stories/ads/site chrome. Output is markdown by default (clean text with headings, lists, tables, code blocks preserved). Also returns a per-extraction confidence score, the detected page type, and metadata.\n\n## Why it exists\n\n- Heuristic CSS-selector blocklists are brittle: they break on hashed class names and don't generalize across non-article page types (product, listing, forum, documentation).\n- Existing Rust ports of trafilatura are pre-1.0 with limited maintenance.\n- A self-contained library we can ship as open source and progressively optimize.\n\n## High-level architecture\n\n```\n        ┌──────────────────────────────────────────────┐\n        │  HTML bytes in                                │\n        └──────────────────────────────────────────────┘\n                          │\n                          ▼\n        ┌──────────────────────────────────────────────┐\n        │  Stage 1 — pre-clean                          │\n        │  drop \u003cscript\u003e/\u003cstyle\u003e/\u003chead\u003e/comments        │\n        └──────────────────────────────────────────────┘\n                          │\n                          ▼\n        ┌──────────────────────────────────────────────┐\n        │  Stage 2 — page-type classification            │\n        │  pick a scoring profile per type              │\n        └──────────────────────────────────────────────┘\n                          │\n                          ▼\n        ┌──────────────────────────────────────────────┐\n        │  Stage 3 — score + select main subtree         │\n        │  text density / link density / tag weights /   │\n        │  class hints / position / parent chain         │\n        └──────────────────────────────────────────────┘\n                          │\n                          ▼\n        ┌──────────────────────────────────────────────┐\n        │  Stage 4 — fallback chain if Stage 3 degraded  │\n        │  justext-style, readability-style, raw text    │\n        └──────────────────────────────────────────────┘\n                          │\n                          ▼\n        ┌──────────────────────────────────────────────┐\n        │  Stage 5 — post-clean + markdown render        │\n        └──────────────────────────────────────────────┘\n                          │\n                          ▼\n        ┌──────────────────────────────────────────────┐\n        │  ExtractResult { markdown, page_type,         │\n        │                   extraction_quality, ...}    │\n        └──────────────────────────────────────────────┘\n```\n\n## Tech stack (high level)\n\n- **Rust** for the core library. Modern, idiomatic, no `unsafe`.\n- **NAPI bindings** via `napi-rs` for Node.js / Bun consumers. Pre-built binaries for Linux x64, macOS arm64, Windows x64.\n- **`criterion`** for benchmarks. Throughput numbers in CI.\n- **Golden corpus** of HTML fixtures with expected extractions, in the test suite.\n\n## Status\n\n- 34 Rust unit + integration + doctests, all passing\n- 54 golden-corpus fixtures across 8 categories, all passing\n- 7 NAPI binding tests, all passing\n\n## Use from Rust\n\n```toml\n[dependencies]\nhtml-extractor = \"0.1\"\n```\n\n```rust\nuse html_extractor::{extract, ExtractOptions};\n\nlet html = std::fs::read_to_string(\"page.html\")?;\nlet result = extract(\u0026html, \u0026ExtractOptions::default())?;\nprintln!(\"{}\", result.markdown);\nprintln!(\"page_type = {:?}\", result.page_type);\nprintln!(\"quality   = {:.2}\", result.extraction_quality);\n```\n\nRun the bundled example: `cargo run --example extract_one -p html-extractor`.\n\n## Use from Node\n\n```bash\nnpm install @firecrawl/html-extractor\n```\n\n```js\nimport { extract } from '@firecrawl/html-extractor'\n\nconst html = '\u003chtml\u003e…\u003c/html\u003e'\nconst result = await extract(html, { url: 'https://example.com/article' })\nconsole.log(result.markdown)\nconsole.log(result.metadata)\n```\n\nOr build the addon locally:\n\n```bash\ncd crates/html-extractor-napi\nnpm install\nnpm run build           # produces html-extractor.\u003ctriple\u003e.node for this host\n```\n\nRun the bundled example: `node examples/node-extract.mjs`.\n\n## Throughput\n\n`cargo bench -p html-extractor --bench throughput -- --quick` on an Apple M-series, release build:\n\n| Input               | Time    | Throughput   |\n|---------------------|---------|--------------|\n| Small (~10 KB)      | ~29 µs  | ~67 MiB/s    |\n| Medium (~148 KB)    | ~900 µs | ~156 MiB/s   |\n| Large (~1.7 MB)     | ~10 ms  | ~157 MiB/s   |\n\nThese are DOM-based numbers. A streaming backend (planned) is expected to lift these further.\n\n## License\n\n[Apache-2.0](./LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffirecrawl%2Fhtml-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffirecrawl%2Fhtml-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffirecrawl%2Fhtml-extractor/lists"}