https://github.com/currentslab/flawunicode
Detect unreadable unicode text
https://github.com/currentslab/flawunicode
natural-language-processing python text text-mining unicode
Last synced: 3 months ago
JSON representation
Detect unreadable unicode text
- Host: GitHub
- URL: https://github.com/currentslab/flawunicode
- Owner: currentslab
- License: mit
- Created: 2022-11-05T04:59:24.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-11-06T09:36:10.000Z (over 3 years ago)
- Last Synced: 2025-12-16T23:58:35.160Z (6 months ago)
- Topics: natural-language-processing, python, text, text-mining, unicode
- Language: Python
- Homepage: https://pypi.org/project/flawunicode/
- Size: 30.6 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# flawunicode
Detect unreadable unicode text
Ever encounter any text when crawl text from the internet or inside your raw corpus?
```
srtytyrtyrty
Á¶ÀÌ½ÃÆ¼, ¡®3on3 ÇÁ¸®½ºÅ¸ÀÏ¡¯ 2Á¾ÀÇ ¿¡µð¼Ç ¹øµé Ãâ½Ã
��>+ٽT}$@�������Э����ٗ_���=���e��
```
This is what flawunicode aims to pick these out for you. flawunicode ranks each unicode text and output a score of -1 to 1 which indicates the "completeness" of the unicode text. If the text has a score of lower than 0.4, it is likely this text is not readable by human.
## Usage
```python
import flawunicode
text = "fdsfdxvdhjkf"
flawunicode.detect(text)
>> 0.2727272727272727
flawunicode.detect("Hello World!")
>> 0.6439393939393939
```
## Note
The underlying statistic came from news corpus in [currents api](https://currentsapi.services/en) database. So social network style text maybe rank with low score. You just need to calculate your own frequently used bi-gram characters and it should be fine.