Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lorien/unicodec
Tools to detect encoding and convert HTML bytes content to Unicode.
https://github.com/lorien/unicodec
charset charset-detection charset-detector detect-encoding encoding encodings html html5 unicode whatwg
Last synced: about 2 months ago
JSON representation
Tools to detect encoding and convert HTML bytes content to Unicode.
- Host: GitHub
- URL: https://github.com/lorien/unicodec
- Owner: lorien
- License: mit
- Created: 2022-12-18T16:50:17.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2022-12-20T00:42:31.000Z (about 2 years ago)
- Last Synced: 2024-11-05T13:07:38.431Z (about 2 months ago)
- Topics: charset, charset-detection, charset-detector, detect-encoding, encoding, encodings, html, html5, unicode, whatwg
- Language: Python
- Homepage:
- Size: 70.3 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Unicodec Package Documentation
[![Test Status](https://github.com/lorien/unicodec/actions/workflows/test.yml/badge.svg)](https://github.com/lorien/unicodec/actions/workflows/test.yml)
[![Code Quality](https://github.com/lorien/unicodec/actions/workflows/check.yml/badge.svg)](https://github.com/lorien/unicodec/actions/workflows/test.yml)
[![Type Check](https://github.com/lorien/unicodec/actions/workflows/mypy.yml/badge.svg)](https://github.com/lorien/unicodec/actions/workflows/mypy.yml)
[![Test Coverage Status](https://coveralls.io/repos/github/lorien/unicodec/badge.svg)](https://coveralls.io/github/lorien/unicodec)This package provides functions for:
- decoding bytes content of HTML document into Unicode text
- detecting encoding of bytes content of HTML document
- normalization of encoding's name to canonical form, according to WHATWG HTML standardFeel free to give feedback in Telegram groups: [@grablab](https://t.me/grablab) and [@grablab\_ru](https://t.me/grablab_ru).
## Installation
`pip install -U unicodec`
## Usage Example #1
Download web document with urllib and convert its content to Unicode.
```python
from urllib.request import urlopenfrom unicodec import decode_content, detect_content_encoding
res = urlopen("http://lib.ru")
rawdata = res.read()
data = decode_content(rawdata, content_type_header=res.headers["content-type"])
print(data[:70])
print(detect_content_encoding(rawdata, res.headers["content-type"]))
```Output:
```
Lib.Ru: Библиотека Максима МошковаLib.Ru: Библиотека Максима Мошкова {}".format(name, normalize_encoding_name(name)))
```Output:
```
iso8859-1 -> windows-1252
utf8 -> utf-8
cp1251 -> windows-1251
```## References
- https://docs.python.org/3/library/html.html
- https://docs.python.org/3/library/html.entities.html
- https://html.spec.whatwg.org/multipage/parsing.html
- https://encoding.spec.whatwg.org/#names-and-labels
- https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html