{"id":16612749,"url":"https://github.com/untitaker/html5gum","last_synced_at":"2025-05-15T04:04:28.074Z","repository":{"id":37001295,"uuid":"429641630","full_name":"untitaker/html5gum","owner":"untitaker","description":"A WHATWG-compliant HTML5 tokenizer and tag soup parser","archived":false,"fork":false,"pushed_at":"2025-03-01T21:13:31.000Z","size":590,"stargazers_count":160,"open_issues_count":13,"forks_count":10,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-05-10T10:51:59.608Z","etag":null,"topics":["html","html5","lexer","parser","parsing","sax","tokenizer","whatwg","xml"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/untitaker.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-11-19T02:13:22.000Z","updated_at":"2025-03-11T01:08:34.000Z","dependencies_parsed_at":"2024-12-06T07:02:30.284Z","dependency_job_id":"927f5168-9e0f-4168-8634-ea4ce746deca","html_url":"https://github.com/untitaker/html5gum","commit_stats":{"total_commits":170,"total_committers":6,"mean_commits":"28.333333333333332","dds":"0.16470588235294115","last_synced_commit":"60f5f80a853990a7ecbee9b92aed6656c8a3aea2"},"previous_names":[],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/untitaker%2Fhtml5gum","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/untitaker%2Fhtml5gum/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/untitaker%2Fhtml5gum/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/untitaker%2Fhtml5gum/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/untitaker","download_url":"https://codeload.github.com/untitaker/html5gum/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254270641,"owners_count":22042858,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["html","html5","lexer","parser","parsing","sax","tokenizer","whatwg","xml"],"created_at":"2024-10-12T01:43:17.979Z","updated_at":"2025-05-15T04:04:28.044Z","avatar_url":"https://github.com/untitaker.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# html5gum\n\n[![docs.rs](https://img.shields.io/docsrs/html5gum)](https://docs.rs/html5gum)\n[![crates.io](https://img.shields.io/crates/l/html5gum.svg)](https://crates.io/crates/html5gum)\n\n`html5gum` is a WHATWG-compliant HTML tokenizer.\n\n```rust\nuse std::fmt::Write;\nuse html5gum::{Tokenizer, Token};\n\nlet html = \"\u003ctitle   \u003ehello world\u003c/title\u003e\";\nlet mut new_html = String::new();\n\nfor Ok(token) in Tokenizer::new(html) {\n    match token {\n        Token::StartTag(tag) =\u003e {\n            write!(new_html, \"\u003c{}\u003e\", String::from_utf8_lossy(\u0026tag.name)).unwrap();\n        }\n        Token::String(hello_world) =\u003e {\n            write!(new_html, \"{}\", String::from_utf8_lossy(\u0026hello_world)).unwrap();\n        }\n        Token::EndTag(tag) =\u003e {\n            write!(new_html, \"\u003c/{}\u003e\", String::from_utf8_lossy(\u0026tag.name)).unwrap();\n        }\n        _ =\u003e panic!(\"unexpected input\"),\n    }\n}\n\nassert_eq!(new_html, \"\u003ctitle\u003ehello world\u003c/title\u003e\");\n```\n\n`html5gum` provides multiple kinds of APIs:\n\n* Iterating over tokens as shown above.\n* Implementing your own `Emitter` for maximum performance, see [the `custom_emitter.rs` example][examples/custom_emitter.rs].\n* A callbacks-based API for a middleground between convenience and performance, see [the `callback_emitter.rs` example][examples/callback_emitter.rs].\n* With the `tree-builder` feature, html5gum can be integrated with `html5ever` and `scraper`. See [the `scraper.rs` example][examples/scraper.rs].\n\n## What a tokenizer does and what it does not do\n\n`html5gum` fully implements [13.2.5 of the WHATWG HTML\nspec](https://html.spec.whatwg.org/#tokenization), i.e. is able to tokenize HTML documents and passes [html5lib's tokenizer\ntest suite](https://github.com/html5lib/html5lib-tests/tree/master/tokenizer). Since it is just a tokenizer, this means:\n\n* `html5gum` **does not** [implement charset\n  detection.](https://html.spec.whatwg.org/#determining-the-character-encoding)\n  This implementation takes and returns bytes, but assumes UTF-8. It recovers\n  gracefully from invalid UTF-8.\n* `html5gum` **does not** [correct mis-nested\n  tags.](https://html.spec.whatwg.org/#an-introduction-to-error-handling-and-strange-cases-in-the-parser)\n* `html5gum` doesn't implement the DOM, and unfortunately in the HTML spec,\n  constructing the DOM (\"tree construction\") influences how tokenization is\n  done. For an example of which problems this causes see [this example\n  code][examples/tokenize_with_state_switches.rs].\n* `html5gum` **does not** generally qualify as a browser-grade HTML *parser* as\n  per the WHATWG spec. This can change in the future, see [issue\n  21](https://github.com/untitaker/html5gum/issues/21).\n\nWith those caveats in mind, `html5gum` can pretty much ~parse~ _tokenize_\nanything that browsers can. However, using the experimental `tree-builder`\nfeature, html5gum can be integrated with `html5ever` and `scraper`. See [the\n`scraper.rs` example][examples/scraper.rs].\n\n## Other features\n\n* No unsafe Rust\n* Only dependency is `jetscii`, and can be disabled via crate features (see `Cargo.toml`)\n\n## Alternative HTML parsers\n\n`html5gum` was created out of a need to parse HTML tag soup efficiently. Previous options were to:\n\n* use [quick-xml](https://github.com/tafia/quick-xml/) or\n  [xmlparser](https://github.com/RazrFalcon/xmlparser) with some hacks to make\n  either one not choke on bad HTML. For some (rather large) set of HTML input\n  this works well (particularly `quick-xml` can be configured to be very\n  lenient about parsing errors) and parsing speed is stellar. But neither can\n  parse all HTML.\n\n  For my own usecase `html5gum` is about 2x slower than `quick-xml`.\n\n* use [html5ever's own\n  tokenizer](https://docs.rs/html5ever/0.25.1/html5ever/tokenizer/index.html)\n  to avoid as much tree-building overhead as possible. This was functional but\n  had poor performance for my own usecase (10-15x slower than `quick-xml`).\n\n* use [lol-html](https://github.com/cloudflare/lol-html), which would probably\n  perform at least as well as `html5gum`, but comes with a closure-based API\n  that I didn't manage to get working for my usecase.\n\n## Etymology\n\nWhy is this library called `html5gum`?\n\n* G.U.M: **G**iant **U**nreadable **M**atch-statement\n\n* \\\u003cinsert \"how it feels to \u003cs\u003echew 5 gum\u003c/s\u003e _parse HTML_\" meme here\\\u003e\n\n## License\n\nLicensed under the MIT license, see [`./LICENSE`][LICENSE].\n\n\n\u003c!-- These link destinations are defined like this so that src/lib.rs can override them. --\u003e\n[LICENSE]: ./LICENSE\n[examples/tokenize_with_state_switches.rs]: ./examples/tokenize_with_state_switches.rs\n[examples/custom_emitter.rs]: ./examples/custom_emitter.rs\n[examples/callback_emitter.rs]: ./examples/callback_emitter.rs\n[examples/scraper.rs]: ./examples/scraper.rs\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funtitaker%2Fhtml5gum","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funtitaker%2Fhtml5gum","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funtitaker%2Fhtml5gum/lists"}