https://github.com/magnetikonline/py-encoding-detect
Python module for detecting common text file encodings.
https://github.com/magnetikonline/py-encoding-detect
encoding file python-module text
Last synced: about 2 months ago
JSON representation
Python module for detecting common text file encodings.
- Host: GitHub
- URL: https://github.com/magnetikonline/py-encoding-detect
- Owner: magnetikonline
- License: mit
- Created: 2017-11-22T12:46:29.000Z (over 7 years ago)
- Default Branch: main
- Last Pushed: 2022-01-11T04:48:50.000Z (over 3 years ago)
- Last Synced: 2025-04-06T20:25:33.528Z (3 months ago)
- Topics: encoding, file, python-module, text
- Language: Python
- Homepage:
- Size: 6.84 KB
- Stars: 6
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Encoding detect
Python module for detecting the following encodings of a text file:
- `ASCII`
- `UTF-8`
- `UTF-16BE`
- `UTF-16LE`Will validate `UTF-8/16` files both with/without a [byte order mark](https://en.wikipedia.org/wiki/Byte_order_mark) (BOM) present.
- [Usage](#usage)
- [Detection methods](#detection-methods)
- [Byte order mark (BOM)](#byte-order-mark-bom)
- [ASCII/UTF-8](#asciiutf-8)
- [UTF-16BE/UTF-16LE](#utf-16beutf-16le)
- [Test](#test)
- [Reference](#reference)## Usage
Module [`encdect.py`](encdect.py) provides a single `EncodingDetectFile` class and a [`load()`](encdect.py#L144) method:
- Successful detection returns a tuple of `(encoding,bom_marker,file_unicode)`.
- Failure (unable to determine) returns `False`.Example:
```python
from encdect import EncodingDetectFiledetect = EncodingDetectFile()
result = detect.load('./test/file/utf-8-bom.txt')if (result):
print(result)
# ('utf_8', '\xef\xbb\xbf', u'Test string \U0001f44d\u263a\n')
```Updating a text file, preserving encoding/BOM (if present):
```python
from encdect import EncodingDetectFile
UPPERCASE_WORD = 'string'detect = EncodingDetectFile()
result = detect.load('./test/file/utf-8.txt')if (result):
encoding,bom_marker,file_decode = resultprint(type(file_decode))
#file_decode = file_decode.replace(
UPPERCASE_WORD,
UPPERCASE_WORD.upper()
)fh = open('./output.txt','w')
if (bom_marker):
fh.write(bom_marker)fh.write(file_decode.encode(encoding))
fh.close()
```## Detection methods
Routines used are based on the work of the C++/C# library https://github.com/AutoIt/text-encoding-detect, with minor tweaks/optimizations.
If _all steps_ are passed without a positive result then detection is considered _not possible_.
An overview of detection steps in the order tested:
### Byte order mark (BOM)
Looks for a byte order mark in the first 2-3 bytes of the file, in the following order:
- `UTF-16BE` (2 bytes)
- `UTF-16LE` (2 bytes)
- `UTF-8` (3 bytes, fairly rare)If BOM is found it is assumed to be valid and detection finishes.
### ASCII/UTF-8
If no BOM found, determine if file is either `ASCII` or `UTF-8`:
- Single byte read from file.
- Value determines how many additional bytes [define the character](https://en.wikipedia.org/wiki/UTF-8#Codepage_layout):
- `1 -> 127` no additional (ASCII).
- `194 -> 223` 1 additional.
- `224 -> 239` 2 additional.
- `240 -> 244` 3 additional.
- Additional bytes walked over - each must be within the bounds of `128 -> 191`.
- Return to first step and repeat until end of file.If end of file reached and above rules remain true:
- With all bytes between range of `1 -> 127` result of `ASCII`.
- Else result of `UTF-8`.If rules were not met, move onto next detection.
### UTF-16BE/UTF-16LE
Final step for `UTF-16` detection involves two methods.
#### Method 1
End of line (EOL) characters (`\r\n`) are counted in odd/even positions of the file stream:
- If all EOL characters are in _even_ file positions return result of `UTF-16BE`.
- Alternatively if all EOL characters are in _odd_ file positions return result of `UTF-16LE`.#### Method 2
Relying on the fact that text files generally have a high ratio of characters in the `1 -> 127` range, two byte sequences of `[0,1 -> 127]` or `[1 -> 127,0]` should be common:
- Total of null bytes are counted in both odd and even positions.
- If odd count _above_ positive threshold and even count _below_ negative threshold, return result of `UTF-16BE`.
- If odd count _below_ negative threshold and even count _above_ positive threshold, return result of `UTF-16LE`.## Test
A detection of sample files with various encoding formats can be run via [`test/detect.py`](test/detect.py).
## Reference
- https://docs.python.org/2/howto/unicode.html
- https://docs.python.org/2/library/codecs.html
- https://github.com/AutoIt/text-encoding-detect
- https://en.wikipedia.org/wiki/Endianness