Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dahlia/html-charset

Determine character encoding of HTML documents/fragments
https://github.com/dahlia/html-charset

character-encoding chardet haskell html

Last synced: 29 days ago
JSON representation

Determine character encoding of HTML documents/fragments

Awesome Lists containing this project

README

        

html-charset: Determine character encoding of HTML bytes
========================================================

[![Hackage](https://img.shields.io/hackage/v/html-charset.svg)][html-charset]

This provides a [Haskell library][html-charset] and a CLI executable to
determine character encoding (i.e., so-called "charset") from given HTML bytes.

The precendence order for determining the character encoding is:

1. A BOM (byte order mark) before any other data in the HTML document itself.
2. A `` declaration with a `charset` attribute or an `http-equiv`
attribute set to `Content-Type` and a value set for `charset`.
Note that it looks at only first 1024 bytes.
3. [Mozilla's Charset Detectors][chardet] heuristics. To be specific,
it delegates to the [charsetdetect-ae] package, a Haskell implementation
of that.

[html-charset]: https://hackage.haskell.org/package/html-charset
[chardet]: https://www-archive.mozilla.org/projects/intl/chardet.html
[charsetdetect-ae]: https://hackage.haskell.org/package/charsetdetect-ae

API
---

The package is available on Hackage: *[html-charset]*.

~~~~ haskell
>>> import Data.ByteString.Lazy
>>> import Text.Html.Encoding.Detection
>>> detect "\xef\xbb\xbf\xe4\xbd\xa0\xe5\xa5\xbd..."
Just "UTF-8"
>>> detect "..."
Just "latin-1"
>>> detect "\xbe\xee\xbc\xad\xbf\xc0\xbc\xbc\xbf\xe4..."
Just "EUC-KR"
~~~~

Note that the `detect` function takes a lazy bytestring, not strict.

Read the [API docs] for details.

[API docs]: https://hackage.haskell.org/package/html-charset/docs/Text-Html-Encoding-Detection.html

CLI
---

We currently doesn't provide any official binaries.
The CLI program can be installed using Cabal or Stack: *[html-charset]*.

~~~~
$ curl https://www.haskell.org/onlinereport/ | html-charset
ASCII
$ curl http://www.bunka.go.jp/kokugo_nihongo/sisaku/joho/joho/ | html-charset
shift_jis
~~~~

Although it's less likely, `html-charset` may fail to determine the character
encoding, and for the case it prints nothing (only a line feed, exactly).
You can customize the string to print when it fails by configuring
`-f`/`--on-failure` option.

Author and license
------------------

Witten by [Hong Minhee]. Licensed under [LGPL 2.1] or higher.

[Hong Minhee]: https://hongminhee.org/
[LGPL 2.1]: https://www.gnu.org/licenses/lgpl-2.1.html