Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/stephantul/wordninja2
https://github.com/stephantul/wordninja2
Last synced: 5 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/stephantul/wordninja2
- Owner: stephantul
- License: mit
- Created: 2022-11-10T13:12:59.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-08-25T19:36:06.000Z (6 months ago)
- Last Synced: 2024-08-26T09:07:34.118Z (6 months ago)
- Language: Python
- Size: 1.08 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# wordninja2
wordninja2 is a faster version of [wordninja](https://github.com/keredson/wordninja). Wordninja is a word-based unigram LM that splits strings that contain words without spaces into words, as follows:
```python
>>> from wordninja2 import split
>>> split("waldorfastorianewyork")
['waldorf', 'astoria', 'new', 'york']
>>> split("besthotelpricesyoucanfind")
['best', 'hotel', 'prices', 'you', 'can', 'find']```
Wordninja was originally defined in [a stackoverflow thread](https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words/11642687#11642687), and then rewritten into a Python package.
As the original wordninja isn't really maintained, and contains some inconsistencies, I decided to rewrite it. See below for a comparison between wordninja and wordninja2.
## Usage
wordninja2 is packaged with a wordlist, which allows you to use it out of the box. To facilitate migrating from wordninja to wordninja2, we use the exact same wordlist.
```python
>>> from wordninja2 import split
>>> split("HelloIfoundanewhousewiththreebedroomswouldwebeabletoshareit?")
['Hello',
'I',
'found',
'a',
'new',
'house',
'with',
'three',
'bedrooms',
'would',
'we',
'be',
'able',
'to',
'share',
'it',
'?']```
Using wordninja2 with your own wordlist is easy, and works regardless of punctuation in tokens or the languages of those tokens.
```python
>>> from wordninja2 import WordNinja
>>> my_words = ["dog", "cat", "房子"]
>>> wn = WordNinja(my_words)
>>> wn.split("idogcat房子house")
["i", "dog", "cat", "房子", "h", "o", "u", "s", "e"]```
Note that any wordlist you supply should be in descending order of importance. That is, wordninja assumes that words higher in the list should get precedence in segmentation over words that are lower in the list. The example that follows shows what happens.
```python
>>> from wordninja2 import WordNinja
>>> my_words = ["dog", "s", "a", "b", "c", "d", "e", "f", "dogs"]
>>> wn = WordNinja(my_words)
>>> wn.split("dogs")
["dog", "s"]>>> my_words = ["dogs", "dog", "s"]
>>> wn = WordNinja(my_words)
>>> wn.split("dogsdog")
["dogs", "dog"]```
### Wordfreq integration
If you want multilingual wordlists, or a better English wordlist, you can download [wordfreq](https://pypi.org/project/wordfreq/), by the great [rspeer](https://github.com/rspeer) (go give it a star on [Github](https://github.com/rspeer/wordfreq)). This works as follows:
```python
>>> from wordfreq import top_n_list
>>> from wordninja2 import WordNinja
>>> wordlist = top_n_list("de", 500_000)
>>> print(wordlist[:10])
['die', 'der', 'und', 'in', 'das', 'ich', 'ist', 'nicht', 'zu', 'den']>>> wn = WordNinja(wordlist)
>>> wn.split("erinteressiertsichfüralles,aberbesondersfürschmetterlingeundandereinsekten")
['er',
'interessiert',
'sich',
'für',
'alles',
',',
'aber',
'besonders',
'für',
'schmetterlinge',
'und',
'andere',
'insekten']```
One interesting avenue is that you could segment strings using different languages, and take the best one.
```python
>>> from wordfreq import top_n_list
>>> from wordninja2 import WordNinja
>>> wns = {}
>>> for language in ["de", "nl", "en", "fr"]:
wordlist = top_n_list(language, 500_000)
wns[language] = WordNinja(wordlist)>>> # This is a dutch string.
>>> string = "ditiseennederlandsetekstmeteenheelmooiverhaalofmeerdereverhalen"
>>> segmentations = {}
>>> for language, model in wns.items():
segmentation = model.split_with_cost(string)
segmentations[language] = segmentation>>> for language, segmentation in sorted(segmentations.items(), key=lambda x: x[1].cost):
print(language)
print(segmentation.tokens)
nl
['dit', 'is', 'een', 'nederlandse', 'tekst', 'meteen', 'heel', 'mooi', 'verhaal', 'of', 'meerdere', 'verhalen']
en
['diti', 'seen', 'nederlandse', 'tekst', 'me', 'teen', 'heel', 'moo', 'iver', 'haal', 'of', 'meer', 'der', 'ever', 'halen']
fr
['dit', 'ise', 'en', 'nederlandse', 'tekst', 'me', 'te', 'en', 'heel', 'mooi', 'verh', 'aal', 'of', 'meer', 'de', 'rever', 'halen']
de
['dit', 'i', 'seen', 'nederlandse', 'tekst', 'me', 'teen', 'heel', 'mooi', 'verha', 'al', 'of', 'meer', 'der', 'ever', 'halen']```
## Differences with wordninja
In this section I'll highlight some differences between `wordninja` and `wordninja2`.
### Consistency
The original `wordninja` is not self-consistent, that is, the following assert fails.
```python
string = "this,string-split it"
assert "".join(split(string)) == string```
This is because `wordninja` removes all non-word characters from the string before processing it. This also has the consequence of `wordninja` never being able to detect words with these special characters in them.
`wordninja2` is completely self-consistent, and does not remove any special characters from a string.
### Speed
`wordninja2` is twice as fast than `wordninja`. Segmenting the entire text of Mary Shelley's Frankenstein (which you can download [here](https://www.gutenberg.org/ebooks/84)):
```python
>>> import re>>> from wordninja2 import split
>>> from wordninja import split as old_split>>> # Remove all spaces.
>>> txt = re.sub(r"\s", "", open("pg84.txt").read())
>>> %timeit split(txt)
299 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit old_split(txt)
1.89 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)```
The original wordninja has an algorithm that backtracks up to the length of the longest word for each character in the string. Thus, if your wordlist has even a single long word, the entire algorithm will start taking a really long time. Coincidentally, the default wordlist used in `wordninja` has a really long word: `llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch`, see [here](https://www.atlasobscura.com/places/llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch) for additional background.
To avoid backtracking, `wordninja2` uses the [aho-corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm). We use a fast implementation in Rust: [aho-corasick](https://github.com/BurntSushi/aho-corasick), with python bindings: [aho-corasick-rs](https://github.com/G-Research/ahocorasick_rs/).
## Dependencies
See the [pyproject.toml](pyproject.toml) file. We only rely on the aforementioned aho-corasick implementation and numpy.
## Installation
Clone the repo and run `make install`. I might put this on `pypi` later.
## Tests
`wordninja2` has 100% test coverage, run `make test` to run the tests.
## License
MIT
## Authors
* Stéphan Tulkens
* The original code is by [keredson](https://github.com/keredson)
* The original algorithm was written by [Generic Human](https://stackoverflow.com/users/1515832/generic-human)