https://github.com/lorien/unicodec

Tools to detect encoding and convert HTML bytes content to Unicode.
https://github.com/lorien/unicodec

charset charset-detection charset-detector detect-encoding encoding encodings html html5 unicode whatwg

Last synced: 2 months ago
JSON representation

Tools to detect encoding and convert HTML bytes content to Unicode.

Host: GitHub
URL: https://github.com/lorien/unicodec
Owner: lorien
License: mit
Created: 2022-12-18T16:50:17.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-12-20T00:42:31.000Z (over 2 years ago)
Last Synced: 2025-04-08T18:23:52.882Z (3 months ago)
Topics: charset, charset-detection, charset-detector, detect-encoding, encoding, encodings, html, html5, unicode, whatwg
Language: Python
Homepage:
Size: 70.3 KB
Stars: 3
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Unicodec Package Documentation

[![Test Status](https://github.com/lorien/unicodec/actions/workflows/test.yml/badge.svg)](https://github.com/lorien/unicodec/actions/workflows/test.yml)

[![Code Quality](https://github.com/lorien/unicodec/actions/workflows/check.yml/badge.svg)](https://github.com/lorien/unicodec/actions/workflows/test.yml)

[![Type Check](https://github.com/lorien/unicodec/actions/workflows/mypy.yml/badge.svg)](https://github.com/lorien/unicodec/actions/workflows/mypy.yml)

[![Test Coverage Status](https://coveralls.io/repos/github/lorien/unicodec/badge.svg)](https://coveralls.io/github/lorien/unicodec)

This package provides functions for:

- decoding bytes content of HTML document into Unicode text

- detecting encoding of bytes content of HTML document

- normalization of encoding's name to canonical form, according to WHATWG HTML standard

Feel free to give feedback in Telegram groups: [@grablab](https://t.me/grablab) and [@grablab\_ru](https://t.me/grablab_ru).

## Installation

`pip install -U unicodec`

## Usage Example #1

Download web document with urllib and convert its content to Unicode.

```python

from urllib.request import urlopen

from unicodec import decode_content, detect_content_encoding

res = urlopen("http://lib.ru")

rawdata = res.read()

data = decode_content(rawdata, content_type_header=res.headers["content-type"])

print(data[:70])

print(detect_content_encoding(rawdata, res.headers["content-type"]))

```

Output:

```

Lib.Ru: Библиотека Максима МошковаLib.Ru: Библиотека Максима Мошкова {}".format(name, normalize_encoding_name(name)))

```


Output:

```

iso8859-1 -> windows-1252

utf8 -> utf-8

cp1251 -> windows-1251

```

## References

- https://docs.python.org/3/library/html.html

- https://docs.python.org/3/library/html.entities.html

- https://html.spec.whatwg.org/multipage/parsing.html

- https://encoding.spec.whatwg.org/#names-and-labels

- https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lorien/unicodec

Awesome Lists containing this project

README