https://github.com/poke1024/simtrie
An efficient data structure for fast string similarity searches
https://github.com/poke1024/simtrie
damerau-levenshtein-distance edit-distance fuzzy-matching levenshtein-distance prefix-tree python spell-check spelling-correction trie
Last synced: 8 months ago
JSON representation
An efficient data structure for fast string similarity searches
- Host: GitHub
- URL: https://github.com/poke1024/simtrie
- Owner: poke1024
- License: mit
- Created: 2019-04-29T13:19:07.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2021-02-08T14:14:19.000Z (over 5 years ago)
- Last Synced: 2025-04-12T08:41:17.486Z (about 1 year ago)
- Topics: damerau-levenshtein-distance, edit-distance, fuzzy-matching, levenshtein-distance, prefix-tree, python, spell-check, spelling-correction, trie
- Language: Python
- Size: 37.1 KB
- Stars: 22
- Watchers: 2
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Authors: AUTHORS.rst
Awesome Lists containing this project
README
# simtrie
`simtrie` is a library for fast, highly
configurable approximate string similarity
searches.
Here's a simple example:
```
import timeit
from nltk.corpus import wordnet as wn
lemmas = list(set(i for i in wn.words()))
print('creating set of %d words' % len(lemmas))
>> creating set of 147306 words
s = simtrie.Set(lemmas)
def search():
return list(s.similar("bookish", 2))
print(timeit.timeit(stmt=search, number=1))
>> 0.00041486499958409695
print(search())
>> [('blockish', 2.0), ('bookie', 2.0), ('booking', 2.0), ('bookish', 0.0), ('boorish', 1.0), ('boxfish', 2.0), ('boyish', 2.0), ('foolish', 2.0), ('goodish', 2.0), ('monkish', 2.0), ('moorish', 2.0)]
```
`simtrie` allows you to fine-tune searches using custom
weighted metrics:
```
metric = simtrie.Metric(
(('c', None), 1.9), # deletion cost
(('ab', 'ba'), 1.5) # transpose cost
)
s.similar("bookish", 2, metric, allow_transpose=True)
```
Some of simtrie's features:
* Stores string sets and dicts in ram using a prefix tree
* Fast, configurable similarity searches over large sets
* Pythonic API similar to regular `set` and `dict`
* Supports transpose, split and union weights
Note: binary data files are not portable between machine
architectures (they are either little or big endian).
# Credits
`simtrie` is a fork of https://github.com/pytries/DAWG. Its internal
data structure is a very clever C++ implementation of a DAFSA by Susumu Yata.
Various test cases and ideas were taken from the super clean
implementation at https://github.com/infoscout/weighted-levenshtein/.
# Similar Projects
* https://github.com/wolfgarbe/SymSpell
# License
Python code is licensed under the MIT License.
Bundled `dawgdic`_ C++ library and C++ extensions
for simtrie are licensed under the BSD license.