Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dahlia/html-charset
Determine character encoding of HTML documents/fragments
https://github.com/dahlia/html-charset
character-encoding chardet haskell html
Last synced: 29 days ago
JSON representation
Determine character encoding of HTML documents/fragments
- Host: GitHub
- URL: https://github.com/dahlia/html-charset
- Owner: dahlia
- License: lgpl-2.1
- Created: 2018-07-10T20:08:50.000Z (over 6 years ago)
- Default Branch: main
- Last Pushed: 2022-12-12T14:14:39.000Z (almost 2 years ago)
- Last Synced: 2024-10-03T16:35:40.850Z (about 1 month ago)
- Topics: character-encoding, chardet, haskell, html
- Language: Haskell
- Homepage: https://hackage.haskell.org/package/html-charset
- Size: 19.5 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- License: LICENSE
Awesome Lists containing this project
README
html-charset: Determine character encoding of HTML bytes
========================================================[![Hackage](https://img.shields.io/hackage/v/html-charset.svg)][html-charset]
This provides a [Haskell library][html-charset] and a CLI executable to
determine character encoding (i.e., so-called "charset") from given HTML bytes.The precendence order for determining the character encoding is:
1. A BOM (byte order mark) before any other data in the HTML document itself.
2. A `` declaration with a `charset` attribute or an `http-equiv`
attribute set to `Content-Type` and a value set for `charset`.
Note that it looks at only first 1024 bytes.
3. [Mozilla's Charset Detectors][chardet] heuristics. To be specific,
it delegates to the [charsetdetect-ae] package, a Haskell implementation
of that.[html-charset]: https://hackage.haskell.org/package/html-charset
[chardet]: https://www-archive.mozilla.org/projects/intl/chardet.html
[charsetdetect-ae]: https://hackage.haskell.org/package/charsetdetect-aeAPI
---The package is available on Hackage: *[html-charset]*.
~~~~ haskell
>>> import Data.ByteString.Lazy
>>> import Text.Html.Encoding.Detection
>>> detect "\xef\xbb\xbf\xe4\xbd\xa0\xe5\xa5\xbd..."
Just "UTF-8"
>>> detect "..."
Just "latin-1"
>>> detect "\xbe\xee\xbc\xad\xbf\xc0\xbc\xbc\xbf\xe4..."
Just "EUC-KR"
~~~~Note that the `detect` function takes a lazy bytestring, not strict.
Read the [API docs] for details.
[API docs]: https://hackage.haskell.org/package/html-charset/docs/Text-Html-Encoding-Detection.html
CLI
---We currently doesn't provide any official binaries.
The CLI program can be installed using Cabal or Stack: *[html-charset]*.~~~~
$ curl https://www.haskell.org/onlinereport/ | html-charset
ASCII
$ curl http://www.bunka.go.jp/kokugo_nihongo/sisaku/joho/joho/ | html-charset
shift_jis
~~~~Although it's less likely, `html-charset` may fail to determine the character
encoding, and for the case it prints nothing (only a line feed, exactly).
You can customize the string to print when it fails by configuring
`-f`/`--on-failure` option.Author and license
------------------Witten by [Hong Minhee]. Licensed under [LGPL 2.1] or higher.
[Hong Minhee]: https://hongminhee.org/
[LGPL 2.1]: https://www.gnu.org/licenses/lgpl-2.1.html