https://github.com/artoria2e5/mbcserrors
(WIP) Error handlers for Python's Windows MBCS encodings.
https://github.com/artoria2e5/mbcserrors
Last synced: 14 days ago
JSON representation
(WIP) Error handlers for Python's Windows MBCS encodings.
- Host: GitHub
- URL: https://github.com/artoria2e5/mbcserrors
- Owner: Artoria2e5
- License: mit
- Created: 2016-11-17T00:18:49.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2016-11-17T01:28:44.000Z (over 8 years ago)
- Last Synced: 2025-02-17T15:52:14.496Z (3 months ago)
- Language: Python
- Homepage:
- Size: 5.86 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
mbcserrors
==========Error handlers for Python's Windows MBCS encodings (`cp...`).
- **mbcsbestfit**: Use ["best fit"][bestfitreadme] replacements for encoders.
- **mbcsbestfitreplace**: Use Windows' replacement behavior.
- **mbcsbestfitignore**: Ignore unknown characters.
- **mbcsreplace**: Use Windows' replacement behavior for codecs.
- `fallback_handler(handler[, h2])`: Creates a handler with fallback.[bestfitreadme]: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt
Usage
-----```Python
import mbcserrorsassert u'\u221e\ufffd'.encode('cp1252', errors='mbcsbestfitreplace') == b'8?'
assert u'\u221e\ufffd'.encode('cp1252', errors='mbcsbestfitignore') == b'8'
assert u'\u221e'.encode('cp1252', errors='mbcsbestfit') == b'8'
assert u'\u221e\ufffd'.encode('cp1252', errors='mbcsreplace') == b'??'# throws the error for \ufffd back at you
# u'\u221e\ufffd'.encode('cp1252', errors='mbcsbestfit') == b'8'
```Installation
------------`setup.py` will pull some data and do the compilation.
Best fit files
--------------A best fit file roughly corresponds to this structure:
```YAML
ast_ansi_file:
codepage: # key: cpnum
cpinfo:
bytes:
encode_replace:
decode_replace:
mbtable: # decode
sbcs:
_length:
single_mapping: list<[uint8_t -> wchar_t]>
dbcs:
_length:
dbcs_range_unit: # single range units
start:
end:
tables:
_length: end - start + 1
single_mapping: [uint8_t -> wchar_t]
wctable: # encode
_length:
single_mapping: [wchar_t -> (uint16_t if cpinfo.bytes == 2 else uint8_t)]
```To generate a "best fit" table:
1. Flatten `mbtable` into a single dict.
1. Copy all single-byte mappings into the dict.
2. For each element *trail* → *wchar* in the *i*-th table
in each range unit, `mbtable[(unit_start + i) << 8 + trail] = wchar`.
2. For each (*wchar*, *code*) pair in `wctable`, take the pair only if
`mbtable[wctable[wchar]] != wchar` (non-roundtrip).The replacement characters will be used for "replace" handlers.