https://github.com/t-ski/string-similarity-algorithms

Common string similarity algorithm implementations.
https://github.com/t-ski/string-similarity-algorithms

nlp python string-distance string-similarity

Last synced: 2 months ago
JSON representation

Common string similarity algorithm implementations.

Host: GitHub
URL: https://github.com/t-ski/string-similarity-algorithms
Owner: t-ski
License: mit
Created: 2024-05-24T00:59:59.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-06-03T22:54:03.000Z (about 2 years ago)
Last Synced: 2025-12-09T04:21:47.226Z (7 months ago)
Topics: nlp, python, string-distance, string-similarity
Language: Python
Homepage:
Size: 6.84 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # String Similarity Algorithms

Common string similarity algorithms `sim: (Σ* × Σ*) → [0, 1]`.

> Open value range algorithms (like Hamming) are normalized.

## Hamming 

**Complexity:** `O(n)`

``` python

def hamming_distance(str1: str, str2: str) -> int

```

``` python

def hamming(str1: str, str2: str) -> float

```

> The shorter string is padded with blank symbols to apply the algorithm.

## Levenshtein

**Complexity:** `O(n²)`

``` python

def levenshtein_distance(str1: str, str2: str) -> int

```

``` python

def levenshtein(str1: str, str2: str) -> float

```

## Damerau-Levenshtein

**Complexity:** `O(n²)`

``` python

def damerau_levenshtein_distance(str1: str, str2: str) -> int

```

``` python

def damerau_levenshtein(str1: str, str2: str) -> float

```

## Jaro

**Complexity:** `O(n²)`

``` python

def jaro(str1: str, str2: str) -> float

```

## Jaro-Winkler

**Complexity:** `O(n²)`

``` python

def jaro_winkler(str1: str, str2: str, p: float = 0.1) -> float

```

## Jaccard

**Complexity:** `O(n)`

``` python

def jaccard(str1: str, str2: str) -> float

```

> The set based similarity algorithms use character and index combination to mimic set element identity (`{ (character, index) ∀ c ∈ S₁, S₂ }`).

## Sørensen-Dice

**Complexity:** `O(n)`

``` python

def sorensen_dice(str1: str, str2: str) -> float

```

## Szymkiewicz-Simpson

**Complexity:** `O(n)`

``` python

def szymkiewicz_simpson(str1: str, str2: str) -> float

```

> Szymkiewicz-Simpson is also simply known as “overlap”.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/t-ski/string-similarity-algorithms

Awesome Lists containing this project

README