{"id":19853866,"url":"https://github.com/magnetikonline/py-encoding-detect","last_synced_at":"2025-07-25T01:07:22.658Z","repository":{"id":147825672,"uuid":"111683264","full_name":"magnetikonline/py-encoding-detect","owner":"magnetikonline","description":"Python module for detecting common text file encodings.","archived":false,"fork":false,"pushed_at":"2022-01-11T04:48:50.000Z","size":7,"stargazers_count":6,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-02T01:37:27.526Z","etag":null,"topics":["encoding","file","python-module","text"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/magnetikonline.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-11-22T12:46:29.000Z","updated_at":"2022-10-11T17:04:07.000Z","dependencies_parsed_at":"2023-05-27T15:15:32.806Z","dependency_job_id":null,"html_url":"https://github.com/magnetikonline/py-encoding-detect","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/magnetikonline/py-encoding-detect","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/magnetikonline%2Fpy-encoding-detect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/magnetikonline%2Fpy-encoding-detect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/magnetikonline%2Fpy-encoding-detect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/magnetikonline%2Fpy-encoding-detect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/magnetikonline","download_url":"https://codeload.github.com/magnetikonline/py-encoding-detect/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/magnetikonline%2Fpy-encoding-detect/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266936386,"owners_count":24009409,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-24T02:00:09.469Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["encoding","file","python-module","text"],"created_at":"2024-11-12T14:07:53.291Z","updated_at":"2025-07-25T01:07:22.619Z","avatar_url":"https://github.com/magnetikonline.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Encoding detect\n\nPython module for detecting the following encodings of a text file:\n- `ASCII`\n- `UTF-8`\n- `UTF-16BE`\n- `UTF-16LE`\n\nWill validate `UTF-8/16` files both with/without a [byte order mark](https://en.wikipedia.org/wiki/Byte_order_mark) (BOM) present.\n\n- [Usage](#usage)\n- [Detection methods](#detection-methods)\n\t- [Byte order mark (BOM)](#byte-order-mark-bom)\n\t- [ASCII/UTF-8](#asciiutf-8)\n\t- [UTF-16BE/UTF-16LE](#utf-16beutf-16le)\n- [Test](#test)\n- [Reference](#reference)\n\n## Usage\n\nModule [`encdect.py`](encdect.py) provides a single `EncodingDetectFile` class and a [`load()`](encdect.py#L144) method:\n- Successful detection returns a tuple of `(encoding,bom_marker,file_unicode)`.\n- Failure (unable to determine) returns `False`.\n\nExample:\n\n```python\nfrom encdect import EncodingDetectFile\n\ndetect = EncodingDetectFile()\nresult = detect.load('./test/file/utf-8-bom.txt')\n\nif (result):\n\tprint(result)\n\t# ('utf_8', '\\xef\\xbb\\xbf', u'Test string \\U0001f44d\\u263a\\n')\n```\n\nUpdating a text file, preserving encoding/BOM (if present):\n\n```python\nfrom encdect import EncodingDetectFile\nUPPERCASE_WORD = 'string'\n\ndetect = EncodingDetectFile()\nresult = detect.load('./test/file/utf-8.txt')\n\nif (result):\n\tencoding,bom_marker,file_decode = result\n\n\tprint(type(file_decode))\n\t# \u003ctype 'unicode'\u003e\n\n\tfile_decode = file_decode.replace(\n\t\tUPPERCASE_WORD,\n\t\tUPPERCASE_WORD.upper()\n\t)\n\n\tfh = open('./output.txt','w')\n\n\tif (bom_marker):\n\t\tfh.write(bom_marker)\n\n\tfh.write(file_decode.encode(encoding))\n\tfh.close()\n```\n\n## Detection methods\n\nRoutines used are based on the work of the C++/C# library https://github.com/AutoIt/text-encoding-detect, with minor tweaks/optimizations.\n\nIf _all steps_ are passed without a positive result then detection is considered _not possible_.\n\nAn overview of detection steps in the order tested:\n\n### Byte order mark (BOM)\n\nLooks for a byte order mark in the first 2-3 bytes of the file, in the following order:\n\n- `UTF-16BE` (2 bytes)\n- `UTF-16LE` (2 bytes)\n- `UTF-8` (3 bytes, fairly rare)\n\nIf BOM is found it is assumed to be valid and detection finishes.\n\n### ASCII/UTF-8\n\nIf no BOM found, determine if file is either `ASCII` or `UTF-8`:\n\n- Single byte read from file.\n- Value determines how many additional bytes [define the character](https://en.wikipedia.org/wiki/UTF-8#Codepage_layout):\n\t- `1 -\u003e 127` no additional (ASCII).\n\t- `194 -\u003e 223` 1 additional.\n\t- `224 -\u003e 239` 2 additional.\n\t- `240 -\u003e 244` 3 additional.\n- Additional bytes walked over - each must be within the bounds of `128 -\u003e 191`.\n- Return to first step and repeat until end of file.\n\nIf end of file reached and above rules remain true:\n\n- With all bytes between range of `1 -\u003e 127` result of `ASCII`.\n- Else result of `UTF-8`.\n\nIf rules were not met, move onto next detection.\n\n### UTF-16BE/UTF-16LE\n\nFinal step for `UTF-16` detection involves two methods.\n\n#### Method 1\n\nEnd of line (EOL) characters (`\\r\\n`) are counted in odd/even positions of the file stream:\n\n- If all EOL characters are in _even_ file positions return result of `UTF-16BE`.\n- Alternatively if all EOL characters are in _odd_ file positions return result of `UTF-16LE`.\n\n#### Method 2\n\nRelying on the fact that text files generally have a high ratio of characters in the `1 -\u003e 127` range, two byte sequences of `[0,1 -\u003e 127]` or `[1 -\u003e 127,0]` should be common:\n\n- Total of null bytes are counted in both odd and even positions.\n- If odd count _above_ positive threshold and even count _below_ negative threshold, return result of `UTF-16BE`.\n- If odd count _below_ negative threshold and even count _above_ positive threshold, return result of `UTF-16LE`.\n\n## Test\n\nA detection of sample files with various encoding formats can be run via [`test/detect.py`](test/detect.py).\n\n## Reference\n\n- https://docs.python.org/2/howto/unicode.html\n- https://docs.python.org/2/library/codecs.html\n- https://github.com/AutoIt/text-encoding-detect\n- https://en.wikipedia.org/wiki/Endianness\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmagnetikonline%2Fpy-encoding-detect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmagnetikonline%2Fpy-encoding-detect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmagnetikonline%2Fpy-encoding-detect/lists"}