https://github.com/aosingh/lexpy
Python package for lexicon; Trie and DAWG implementation.
https://github.com/aosingh/lexpy
dawg directed-acyclic-word-graph graph lexicon suffix-tree trie
Last synced: 2 months ago
JSON representation
Python package for lexicon; Trie and DAWG implementation.
- Host: GitHub
- URL: https://github.com/aosingh/lexpy
- Owner: aosingh
- License: gpl-3.0
- Created: 2017-10-25T01:02:22.000Z (over 7 years ago)
- Default Branch: main
- Last Pushed: 2024-12-01T05:10:18.000Z (6 months ago)
- Last Synced: 2025-03-30T01:11:16.762Z (2 months ago)
- Topics: dawg, directed-acyclic-word-graph, graph, lexicon, suffix-tree, trie
- Language: Python
- Homepage:
- Size: 14.2 MB
- Stars: 55
- Watchers: 3
- Forks: 7
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Lexpy
[](https://github.com/aosingh/lexpy/actions)
[](https://pepy.tech/project/lexpy)
[](https://pypi.python.org/pypi/lexpy)[](https://www.python.org/downloads/release/python-370/)
[](https://www.python.org/downloads/release/python-380/)
[](https://www.python.org/downloads/release/python-390/)
[](https://www.python.org/downloads/release/python-3100/)
[](https://www.python.org/downloads/release/python-3110/)
[](https://www.python.org/downloads/release/python-3120/)[](https://www.pypy.org/download.html)
[](https://www.pypy.org/download.html)
[](https://www.pypy.org/download.html)- A lexicon is a data-structure which stores a set of words. The difference between
a dictionary and a lexicon is that in a lexicon there are **no values** associated with the words.- A lexicon is similar to a list or a set of words, but the internal representation is different and optimized
for faster searches of words, prefixes and wildcard patterns.- Given a word, precisely, the search time is O(W) where W is the length of the word.
- 2 important lexicon data-structures are **_Trie_** and **_Directed Acyclic Word Graph (DAWG)_**.
# Install
`lexpy` can be installed via Python Package Index `(PyPI)` using `pip`. The only installation requirement is that you need Python 3.7 or higher.
```commandline
pip install lexpy
```# Interface
| **Interface Description** | **Trie** | **DAWG** |
|------------------------------------------------------------------------------------------------------------------------------- |------------------------------------------ |------------------------------------------ |
| Add a single word | `add('apple', count=2)` | `add('apple', count=2)` |
| Add multiple words | `add_all(['advantage', 'courage'])` | `add_all(['advantage', 'courage'])` |
| Check if exists? | `in` operator | `in` operator |
| Search using wildcard expression | `search('a?b*', with_count=True)` | `search('a?b*, with_count=True)` |
| Search for prefix matches | `search_with_prefix('bar', with_count=True)` | `search_with_prefix('bar')` |
| Search for similar words within given edit distance. Here, the notion of edit distance is same as Levenshtein distance | `search_within_distance('apble', dist=1, with_count=True)` | `search_within_distance('apble', dist=1, with_count=True)` |
| Get the number of nodes in the automaton | `len(trie)` | `len(dawg)` |# Examples
## Trie
### Build from an input list, set, or tuple of words.
```python
from lexpy import Trietrie = Trie()
input_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba',
'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin',
'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit',
'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small'
'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']trie.add_all(input_words) # You can pass any sequence types or a file-like object here
print(trie.get_word_count())
>>> 48
```### Build from a file or file path.
In the file, words should be newline separated.
```python
from lexpy import Trie
# Either
trie = Trie()
trie.add_all('/path/to/file.txt')# Or
with open('/path/to/file.txt', 'r') as infile:
trie.add_all(infile)```
### Check if exists using the `in` operator
```python
print('ampyx' in trie)>>> True
```### Prefix search
```python
print(trie.search_with_prefix('ab'))>>> ['abhor', 'abuzz']
``````python
print(trie.search_with_prefix('ab', with_count=True))
>>> [('abuzz', 2), ('abhor', 1)]
```
### Wildcard search using `?` and `*`
- `?` = 0 or 1 occurrence of any character
- `*` = 0 or more occurrence of any character
```python
print(trie.search('a*o*'))>>> ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']
print(trie.search('a*o*', with_count=True))
>>> [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]
print(trie.search('su?t'))
>>> ['suit']
print(trie.search('su?t', with_count=True))
>>> [('suit', 1)]
```
### Search for similar words using the notion of Levenshtein distance
```python
print(trie.search_within_distance('arie', dist=2))>>> ['athie', 'arbil', 'auric']
print(trie.search_within_distance('arie', dist=2, with_count=True))
>>> [('athie', 3), ('arbil', 1), ('auric', 1)]
```
### Increment word count
- You can either add a new word or increment the counter for an existing word.
```python
trie.add('athie', count=1000)
print(trie.search_within_distance('arie', dist=2, with_count=True))
>>> [('athie', 1003), ('arbil', 1), ('auric', 1)]
```# Directed Acyclic Word Graph (DAWG)
- DAWG supports the same set of operations as a Trie. The difference is the number of nodes in a DAWG is always
less than or equal to the number of nodes in Trie.- They both are Deterministic Finite State Automata. However, DAWG is a minimized version of the Trie DFA.
- In a Trie, prefix redundancy is removed. In a DAWG, both prefix and suffix redundancies are removed.
- In the current implementation of DAWG, the insertion order of the words should be **alphabetical**.
- The implementation idea of DAWG is borrowed from http://stevehanov.ca/blog/?id=115
```python
from lexpy import Trie, DAWGtrie = Trie()
trie.add_all(['advantageous', 'courageous'])dawg = DAWG()
dawg.add_all(['advantageous', 'courageous'])len(trie) # Number of Nodes in Trie
23dawg.reduce() # Perform DFA minimization. Call this every time a chunk of words are uploaded in DAWG.
len(dawg) # Number of nodes in DAWG
21```
## DAWG
The APIs are exactly same as the Trie APIs
### Build a DAWG
```python
from lexpy import DAWG
dawg = DAWG()input_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba',
'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin',
'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit',
'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small'
'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']dawg.add_all(input_words)
dawg.reduce()dawg.get_word_count()
>>> 48
```
### Check if exists using the `in` operator
```python
print('ampyx' in dawg)>>> True
```### Prefix search
```python
print(dawg.search_with_prefix('ab'))>>> ['abhor', 'abuzz']
``````python
print(dawg.search_with_prefix('ab', with_count=True))
>>> [('abuzz', 2), ('abhor', 1)]
```
### Wildcard search using `?` and `*`
`?` = 0 or 1 occurance of any character
`*` = 0 or more occurance of any character
```python
print(dawg.search('a*o*'))>>> ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']
print(dawg.search('a*o*', with_count=True))
>>> [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]
print(dawg.search('su?t'))
>>> ['suit']
print(dawg.search('su?t', with_count=True))
>>> [('suit', 1)]
```
### Search for similar words using the notion of Levenshtein distance
```python
print(dawg.search_within_distance('arie', dist=2))>>> ['athie', 'arbil', 'auric']
print(dawg.search_within_distance('arie', dist=2, with_count=True))
>>> [('athie', 3), ('arbil', 1), ('auric', 1)]
```
### Alphabetical order insertion
If you insert a word which is lexicographically out-of-order, ``ValueError`` will be raised.
```python
dawg.add('athie', count=1000)
```
ValueError```text
ValueError: Words should be inserted in Alphabetical order. ,
```### Increment the word count
- You can either add an alphabetically greater word with a specific count or increment the count of the previous added word.
```python
dawg.add_all(['thrill']*20000) # or dawg.add('thrill', count=20000)
print(dawg.search('thrill', with_count=True))
>> [('thrill', 20001)]
```
## Special Characters
Special characters, except `?` and `*`, are matched literally.
```python
from lexpy import Trie
t = Trie()
t.add('a©')
``````python
t.search('a©')
>> ['a©']```
```python
t.search('a?')
>> ['a©']
``````python
t.search('?©')
>> ['a©']
```## Trie vs DAWG


# Future Work
These are some ideas which I would love to work on next in that order. Pull requests or discussions are invited.
- Merge trie and DAWG features in one data structure
- Support all functionalities and still be as compressed as possible.
- Serialization / Deserialization
- Pickle is definitely an option.
- Server (TCP or HTTP) to serve queries over the network.# Fun Facts
1. The 45-letter word pneumonoultramicroscopicsilicovolcanoconiosis is the longest English word that appears in a major dictionary.
So for all english words, the search time is bounded by O(45).
2. The longest technical word(not in dictionary) is the name of a protein called as [titin](https://en.wikipedia.org/wiki/Titin). It has 189,819
letters and it is disputed whether it is a word.