https://github.com/t-ski/string-similarity-algorithms
Common string similarity algorithm implementations.
https://github.com/t-ski/string-similarity-algorithms
nlp python string-distance string-similarity
Last synced: 2 months ago
JSON representation
Common string similarity algorithm implementations.
- Host: GitHub
- URL: https://github.com/t-ski/string-similarity-algorithms
- Owner: t-ski
- License: mit
- Created: 2024-05-24T00:59:59.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-03T22:54:03.000Z (about 2 years ago)
- Last Synced: 2025-12-09T04:21:47.226Z (7 months ago)
- Topics: nlp, python, string-distance, string-similarity
- Language: Python
- Homepage:
- Size: 6.84 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# String Similarity Algorithms
Common string similarity algorithms `sim: (Σ* × Σ*) → [0, 1]`.
> Open value range algorithms (like Hamming) are normalized.
## Hamming
**Complexity:** `O(n)`
``` python
def hamming_distance(str1: str, str2: str) -> int
```
``` python
def hamming(str1: str, str2: str) -> float
```
> The shorter string is padded with blank symbols to apply the algorithm.
## Levenshtein
**Complexity:** `O(n²)`
``` python
def levenshtein_distance(str1: str, str2: str) -> int
```
``` python
def levenshtein(str1: str, str2: str) -> float
```
## Damerau-Levenshtein
**Complexity:** `O(n²)`
``` python
def damerau_levenshtein_distance(str1: str, str2: str) -> int
```
``` python
def damerau_levenshtein(str1: str, str2: str) -> float
```
## Jaro
**Complexity:** `O(n²)`
``` python
def jaro(str1: str, str2: str) -> float
```
## Jaro-Winkler
**Complexity:** `O(n²)`
``` python
def jaro_winkler(str1: str, str2: str, p: float = 0.1) -> float
```
## Jaccard
**Complexity:** `O(n)`
``` python
def jaccard(str1: str, str2: str) -> float
```
> The set based similarity algorithms use character and index combination to mimic set element identity (`{ (character, index) ∀ c ∈ S₁, S₂ }`).
## Sørensen-Dice
**Complexity:** `O(n)`
``` python
def sorensen_dice(str1: str, str2: str) -> float
```
## Szymkiewicz-Simpson
**Complexity:** `O(n)`
``` python
def szymkiewicz_simpson(str1: str, str2: str) -> float
```
> Szymkiewicz-Simpson is also simply known as “overlap”.