Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/keredson/wordninja

Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.
https://github.com/keredson/wordninja

Last synced: about 18 hours ago
JSON representation

Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.

Host: GitHub
URL: https://github.com/keredson/wordninja
Owner: keredson
License: mit
Created: 2017-04-20T22:05:42.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2023-02-19T00:44:37.000Z (about 2 years ago)
Last Synced: 2024-10-29T15:49:01.088Z (4 months ago)
Language: Python
Homepage:
Size: 740 KB
Stars: 807
Watchers: 10
Forks: 108
Open Issues: 18
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        ![image](https://user-images.githubusercontent.com/2049665/29219793-b4dcb942-7e7e-11e7-8785-761b0e784e04.png)

Word Ninja

==========

Slice your munged together words!  Seriously, Take anything, `'imateapot'` for example, would become `['im', 'a', 'teapot']`.  Useful for humanizing stuff (like database tables when people don't like underscores).

This project is repackaging the excellent work from here: http://stackoverflow.com/a/11642687/2449774

Usage

-----

```

$ python

>>> import wordninja

>>> wordninja.split('derekanderson')

['derek', 'anderson']

>>> wordninja.split('imateapot')

['im', 'a', 'teapot']

>>> wordninja.split('heshotwhointhewhatnow')

['he', 'shot', 'who', 'in', 'the', 'what', 'now']

>>> wordninja.split('thequickbrownfoxjumpsoverthelazydog')

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

```

Performance

-----------

It's super fast!

```

>>> def f():

...   wordninja.split('imateapot')

... 

>>> timeit.timeit(f, number=10000)

0.40885152100236155

```

It can handle long strings:

```

>>> wordninja.split('wethepeopleoftheunitedstatesinordertoformamoreperfectunionestablishjusticeinsuredomestictranquilityprovideforthecommondefencepromotethegeneralwelfareandsecuretheblessingsoflibertytoourselvesandourposteritydoordainandestablishthisconstitutionfortheunitedstatesofamerica')

['we', 'the', 'people', 'of', 'the', 'united', 'states', 'in', 'order', 'to', 'form', 'a', 'more', 'perfect', 'union', 'establish', 'justice', 'in', 'sure', 'domestic', 'tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'welfare', 'and', 'secure', 'the', 'blessings', 'of', 'liberty', 'to', 'ourselves', 'and', 'our', 'posterity', 'do', 'ordain', 'and', 'establish', 'this', 'constitution', 'for', 'the', 'united', 'states', 'of', 'america']

```

And scales well.  (This string takes ~7ms to compute.) 

How to Install

--------------

```

pip3 install wordninja

```

Custom Language Models

----------------------

#1 most requested feature!  If you want to do something other than english (or want to specify your own model of english), this is how you do it.

```

>>> lm = wordninja.LanguageModel('my_lang.txt.gz')

>>> lm.split('derek')

['der','ek']

```

Language files must be gziped text files with one word per line in decreasing order of probability.

If you want to make your model the default, set:

```

wordninja.DEFAULT_LANGUAGE_MODEL = wordninja.LanguageModel('my_lang.txt.gz')

```