https://github.com/jaymon/unidefy

Python module to substitute recognized unicode characters with ascii characters
https://github.com/jaymon/unidefy

Last synced: 3 months ago
JSON representation

Python module to substitute recognized unicode characters with ascii characters

Host: GitHub
URL: https://github.com/jaymon/unidefy
Owner: Jaymon
Created: 2013-04-26T00:55:43.000Z (over 13 years ago)
Default Branch: master
Last Pushed: 2013-06-27T01:53:17.000Z (about 13 years ago)
Last Synced: 2025-01-08T22:38:18.608Z (over 1 year ago)
Language: Python
Homepage:
Size: 146 KB
Stars: 1
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Unidefy -- substitute unicode for ascii equivalents, if available

I wrote this primarily to normalize certain data for searching, the problem is that
certain characters, like the `umlat`, are hard to do on a normal keyboard, and so most
people don't bother, so this module can be used on the indexed strings to allow
either a `u` or an `umlat` to be used (since the `u` wouldn't be changed and the `umlat` would
be changed to the `u`) so you can search both ways but only have to store one version.

There are other modules that help with this, but python's builtin `unicodedata` didn't quite
do what I needed since it only uses defined [unicode normalizations](http://en.wikipedia.org/wiki/Unicode_normalization),
and something like [unidecode](https://pypi.python.org/pypi/Unidecode) works
but it's a little too eager, getting rid of unicode chars it doesn't recognize. I wanted to keep unicode that there wasn't
a good substitution for, likewise, I didn't really want to turn chinese characters into english
either (something `unidecode` does), it's definitely worth a look if you want more aggressive substitution.

To install, use Pip:

pip install git+https://github.com/Jaymon/unidefy#egg=unidefy

## More reading, if you're interested

This is a dump of all the links I had open/used while I wrote this module

http://stackoverflow.com/questions/12944678/using-unicodedata-normalize-in-python-2-7

http://code.activestate.com/recipes/251871/

http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database

http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/unicode.html

http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UserContribution/index.html

http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/

http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm

http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

http://stackoverflow.com/questions/4808967/replacing-unicode-punctuation-with-ascii-approximations

http://stackoverflow.com/questions/138449/how-to-convert-a-unicode-character-to-its-ascii-equivalent

http://stackoverflow.com/questions/5651124/is-there-a-way-to-dumb-down-text-from-unicode-to-ascii

## I got the data for the substitution table from these locations

http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UserContribution/asciiConversion.html

http://unicode.org/repos/cldr/trunk/common/transforms/Latin-ASCII.xml

http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/MapTables/CoreNormResults.html

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jaymon/unidefy

Awesome Lists containing this project

README