Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lorien/unicodec

Tools to detect encoding and convert HTML bytes content to Unicode.
https://github.com/lorien/unicodec

charset charset-detection charset-detector detect-encoding encoding encodings html html5 unicode whatwg

Last synced: about 2 months ago
JSON representation

Tools to detect encoding and convert HTML bytes content to Unicode.

Awesome Lists containing this project

README

        

# Unicodec Package Documentation

[![Test Status](https://github.com/lorien/unicodec/actions/workflows/test.yml/badge.svg)](https://github.com/lorien/unicodec/actions/workflows/test.yml)
[![Code Quality](https://github.com/lorien/unicodec/actions/workflows/check.yml/badge.svg)](https://github.com/lorien/unicodec/actions/workflows/test.yml)
[![Type Check](https://github.com/lorien/unicodec/actions/workflows/mypy.yml/badge.svg)](https://github.com/lorien/unicodec/actions/workflows/mypy.yml)
[![Test Coverage Status](https://coveralls.io/repos/github/lorien/unicodec/badge.svg)](https://coveralls.io/github/lorien/unicodec)

This package provides functions for:

- decoding bytes content of HTML document into Unicode text
- detecting encoding of bytes content of HTML document
- normalization of encoding's name to canonical form, according to WHATWG HTML standard

Feel free to give feedback in Telegram groups: [@grablab](https://t.me/grablab) and [@grablab\_ru](https://t.me/grablab_ru).

## Installation

`pip install -U unicodec`

## Usage Example #1

Download web document with urllib and convert its content to Unicode.

```python
from urllib.request import urlopen

from unicodec import decode_content, detect_content_encoding

res = urlopen("http://lib.ru")
rawdata = res.read()
data = decode_content(rawdata, content_type_header=res.headers["content-type"])
print(data[:70])
print(detect_content_encoding(rawdata, res.headers["content-type"]))
```

Output:
```
Lib.Ru: Библиотека Максима МошковаLib.Ru: Библиотека Максима Мошкова {}".format(name, normalize_encoding_name(name)))
```

Output:

```
iso8859-1 -> windows-1252
utf8 -> utf-8
cp1251 -> windows-1251
```

## References

- https://docs.python.org/3/library/html.html
- https://docs.python.org/3/library/html.entities.html
- https://html.spec.whatwg.org/multipage/parsing.html
- https://encoding.spec.whatwg.org/#names-and-labels
- https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html