{"id":20036313,"url":"https://github.com/lorien/unicodec","last_synced_at":"2025-05-05T05:31:50.427Z","repository":{"id":64972418,"uuid":"579715537","full_name":"lorien/unicodec","owner":"lorien","description":"Tools to detect encoding and convert HTML bytes content to Unicode.","archived":false,"fork":false,"pushed_at":"2022-12-20T00:42:31.000Z","size":72,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-08T18:23:52.882Z","etag":null,"topics":["charset","charset-detection","charset-detector","detect-encoding","encoding","encodings","html","html5","unicode","whatwg"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lorien.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-12-18T16:50:17.000Z","updated_at":"2022-12-25T21:19:35.000Z","dependencies_parsed_at":"2023-01-10T14:15:41.395Z","dependency_job_id":null,"html_url":"https://github.com/lorien/unicodec","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lorien%2Funicodec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lorien%2Funicodec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lorien%2Funicodec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lorien%2Funicodec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lorien","download_url":"https://codeload.github.com/lorien/unicodec/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252446427,"owners_count":21749232,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["charset","charset-detection","charset-detector","detect-encoding","encoding","encodings","html","html5","unicode","whatwg"],"created_at":"2024-11-13T10:11:52.398Z","updated_at":"2025-05-05T05:31:49.875Z","avatar_url":"https://github.com/lorien.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Unicodec Package Documentation\n\n[![Test Status](https://github.com/lorien/unicodec/actions/workflows/test.yml/badge.svg)](https://github.com/lorien/unicodec/actions/workflows/test.yml)\n[![Code Quality](https://github.com/lorien/unicodec/actions/workflows/check.yml/badge.svg)](https://github.com/lorien/unicodec/actions/workflows/test.yml)\n[![Type Check](https://github.com/lorien/unicodec/actions/workflows/mypy.yml/badge.svg)](https://github.com/lorien/unicodec/actions/workflows/mypy.yml)\n[![Test Coverage Status](https://coveralls.io/repos/github/lorien/unicodec/badge.svg)](https://coveralls.io/github/lorien/unicodec)\n\nThis package provides functions for:\n\n- decoding bytes content of HTML document into Unicode text\n- detecting encoding of bytes content of HTML document\n- normalization of encoding's name to canonical form, according to WHATWG HTML standard\n\nFeel free to give feedback in Telegram groups: [@grablab](https://t.me/grablab) and [@grablab\\_ru](https://t.me/grablab_ru).\n\n## Installation\n\n`pip install -U unicodec`\n\n## Usage Example #1\n\nDownload web document with urllib and convert its content to Unicode.\n\n```python\nfrom urllib.request import urlopen\n\nfrom unicodec import decode_content, detect_content_encoding\n\nres = urlopen(\"http://lib.ru\")\nrawdata = res.read()\ndata = decode_content(rawdata, content_type_header=res.headers[\"content-type\"])\nprint(data[:70])\nprint(detect_content_encoding(rawdata, res.headers[\"content-type\"]))\n```\n\nOutput:\n```\n\u003chtml\u003e\u003chead\u003e\u003ctitle\u003eLib.Ru: Библиотека Максима Мошкова\u003c/title\u003e\u003c/head\u003e\u003cb\nkoi8-r\n```\n\n## Usage Example #2\n\nDownload web document with urllib3 and convert its content to Unicode.\n\n```python\nfrom urllib3 import PoolManager\n\nfrom unicodec import decode_content, detect_content_encoding\n\nres = PoolManager().urlopen(\"GET\", \"http://lib.ru\")\nrawdata = res.data\ndata = decode_content(rawdata, content_type_header=res.headers[\"content-type\"])\nprint(data[:70])\nprint(detect_content_encoding(rawdata, res.headers[\"content-type\"]))\n```\n\nOutput:\n```\n\u003chtml\u003e\u003chead\u003e\u003ctitle\u003eLib.Ru: Библиотека Максима Мошкова\u003c/title\u003e\u003c/head\u003e\u003cb\nkoi8-r\n```\n\n## Usage Example #3\n\nConvert names of encodings to canonical form (according to WHATWG HTML standard).\n\n```python\nfrom unicodec.normalization import normalize_encoding_name\n\nfor name in [\"iso8859-1\", \"utf8\", \"cp1251\"]:\n    print(\"{} -\u003e {}\".format(name, normalize_encoding_name(name)))\n```\n\nOutput:\n\n```\niso8859-1 -\u003e windows-1252\nutf8 -\u003e utf-8\ncp1251 -\u003e windows-1251\n```\n\n## References\n\n- https://docs.python.org/3/library/html.html\n- https://docs.python.org/3/library/html.entities.html\n- https://html.spec.whatwg.org/multipage/parsing.html\n- https://encoding.spec.whatwg.org/#names-and-labels\n- https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Florien%2Funicodec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Florien%2Funicodec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Florien%2Funicodec/lists"}