Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/keredson/wordninja
Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.
https://github.com/keredson/wordninja
Last synced: about 18 hours ago
JSON representation
Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.
- Host: GitHub
- URL: https://github.com/keredson/wordninja
- Owner: keredson
- License: mit
- Created: 2017-04-20T22:05:42.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2023-02-19T00:44:37.000Z (about 2 years ago)
- Last Synced: 2024-10-29T15:49:01.088Z (4 months ago)
- Language: Python
- Homepage:
- Size: 740 KB
- Stars: 807
- Watchers: 10
- Forks: 108
- Open Issues: 18
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

Word Ninja
==========Slice your munged together words! Seriously, Take anything, `'imateapot'` for example, would become `['im', 'a', 'teapot']`. Useful for humanizing stuff (like database tables when people don't like underscores).
This project is repackaging the excellent work from here: http://stackoverflow.com/a/11642687/2449774
Usage
-----
```
$ python
>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']
>>> wordninja.split('imateapot')
['im', 'a', 'teapot']
>>> wordninja.split('heshotwhointhewhatnow')
['he', 'shot', 'who', 'in', 'the', 'what', 'now']
>>> wordninja.split('thequickbrownfoxjumpsoverthelazydog')
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
```Performance
-----------
It's super fast!```
>>> def f():
... wordninja.split('imateapot')
...
>>> timeit.timeit(f, number=10000)
0.40885152100236155
```It can handle long strings:
```
>>> wordninja.split('wethepeopleoftheunitedstatesinordertoformamoreperfectunionestablishjusticeinsuredomestictranquilityprovideforthecommondefencepromotethegeneralwelfareandsecuretheblessingsoflibertytoourselvesandourposteritydoordainandestablishthisconstitutionfortheunitedstatesofamerica')
['we', 'the', 'people', 'of', 'the', 'united', 'states', 'in', 'order', 'to', 'form', 'a', 'more', 'perfect', 'union', 'establish', 'justice', 'in', 'sure', 'domestic', 'tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'welfare', 'and', 'secure', 'the', 'blessings', 'of', 'liberty', 'to', 'ourselves', 'and', 'our', 'posterity', 'do', 'ordain', 'and', 'establish', 'this', 'constitution', 'for', 'the', 'united', 'states', 'of', 'america']
```
And scales well. (This string takes ~7ms to compute.)How to Install
--------------```
pip3 install wordninja
```Custom Language Models
----------------------
#1 most requested feature! If you want to do something other than english (or want to specify your own model of english), this is how you do it.```
>>> lm = wordninja.LanguageModel('my_lang.txt.gz')
>>> lm.split('derek')
['der','ek']
```Language files must be gziped text files with one word per line in decreasing order of probability.
If you want to make your model the default, set:
```
wordninja.DEFAULT_LANGUAGE_MODEL = wordninja.LanguageModel('my_lang.txt.gz')
```