https://github.com/curegit/unicodecheck
Simple tool to check if Unicode text files are Unicode-normalized
https://github.com/curegit/unicodecheck
character-encoding text-normalization unicode
Last synced: 5 months ago
JSON representation
Simple tool to check if Unicode text files are Unicode-normalized
- Host: GitHub
- URL: https://github.com/curegit/unicodecheck
- Owner: curegit
- License: mit
- Created: 2023-10-20T00:22:01.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-10-26T08:03:31.000Z (9 months ago)
- Last Synced: 2025-01-23T02:37:04.345Z (6 months ago)
- Topics: character-encoding, text-normalization, unicode
- Language: Python
- Homepage: https://pypi.org/project/unicodecheck/
- Size: 51.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Unicodecheck
Simple tool to check if Unicode text files are Unicode-normalized
## Install
```sh
pip3 install unicodecheck
```## Usage
### Quickstart
```sh
unicodecheck -iv SPAM.txt
```To check files in a directory recursively:
```sh
unicodecheck -ivr Ham/Eggs/
```### Synopsis
The main program can be invoked either through the `unicodecheck` command or through the Python main module option `python3 -m unicodecheck`.
```txt
usage: unicodecheck [-h] [-V] [-m {NFC,NFD,NFKC,NFKD}] [-d] [-u [NUMBER]] [-r] [-i] [-v]
PATH [PATH ...]
```### Options
```txt
positional arguments:
PATH describe input file or directory (pass '-' to specify stdin)options:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-m {NFC,NFD,NFKC,NFKD}, --mode {NFC,NFD,NFKC,NFKD}
target Unicode normalization (default: NFC)
-d, --diff show diffs between the original and normalized (default: False)
-u [NUMBER], -U [NUMBER], --unified [NUMBER]
show unified diffs with NUMBER lines of context [NUMBER=3] (default: False)
-r, --recursive follow the directory tree rooted in each PATH argument (default: False)
-i, --include-hidden include hidden files and directories (default: False)
-b PATTERN [PATTERN ...], --blacklist PATTERN [PATTERN ...]
notify if having PATTERN (case-sensitive) (default: None)
-e, --error return non-zero exit code on detection (default: False)
-v, --verbose report non-essential logs (default: False)
```## Tips
### Check whether filenames are normalized
The `convmv` command is a good alternative to using this application.
#### NFC
```sh
convmv -f utf8 -t utf8 --nfc -r ./
```#### NFD
```sh
convmv -f utf8 -t utf8 --nfd -r ./
```## Notes
- This tool doesn't provide auto in-place (write) file normalization because Unicode normalization doesn't guarantee content equivalence.
- The procedure for determining the binary file refers to Git's algorithm.## License
MIT