https://github.com/wezm/xhtmlchardet
Encoding detection for XML and HTML in Rust
https://github.com/wezm/xhtmlchardet
character-set detection html rust xml
Last synced: 3 months ago
JSON representation
Encoding detection for XML and HTML in Rust
- Host: GitHub
- URL: https://github.com/wezm/xhtmlchardet
- Owner: wezm
- License: mit
- Created: 2015-06-01T11:36:14.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2022-01-26T00:09:00.000Z (over 4 years ago)
- Last Synced: 2025-10-06T01:43:59.959Z (8 months ago)
- Topics: character-set, detection, html, rust, xml
- Language: Rust
- Homepage: http://docs.rs/xhtmlchardet/
- Size: 324 KB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md
Awesome Lists containing this project
README
# xhtmlchardet
Basic character set detection for XML and HTML in Rust.
[](https://cirrus-ci.com/github/wezm/xhtmlchardet)
[](https://docs.rs/xhtmlchardet)
[](https://crates.io/crates/xhtmlchardet)
**Minimum Supported Rust Version:** 1.24.0
## Example
```rust
use std::io::Cursor;
extern crate xhtmlchardet;
let text = b"Example";
let mut text_cursor = Cursor::new(text.to_vec());
let detected_charsets: Vec = xhtmlchardet::detect(&mut text_cursor, None).unwrap();
assert_eq!(detected_charsets, vec!["iso-8859-1".to_string()]);
```
## Rationale
I wrote a feed crawler that needed to determine the character set of fetched
content so that it could be normalised to UTF-8. Initially I used the
[uchardet] crate but I encountered some situations where it misdetected the
charset. I collected all these edge cases together and built a test suite. Then
I implemented this crate, which passes all of those tests. It uses a fairly
naïve approach derived from [section F of the XML specification][xmlspec].
[uchardet]: https://crates.io/crates/uchardet
[xmlspec]: http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing