Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/foxcroftjn/polars-strsim
Polars extension for string similarity
https://github.com/foxcroftjn/polars-strsim
Last synced: 2 months ago
JSON representation
Polars extension for string similarity
- Host: GitHub
- URL: https://github.com/foxcroftjn/polars-strsim
- Owner: foxcroftjn
- License: mit
- Created: 2023-03-23T13:45:57.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-03T19:31:47.000Z (2 months ago)
- Last Synced: 2024-10-03T19:49:41.279Z (2 months ago)
- Language: Rust
- Homepage: https://pypi.org/project/polars-strsim/
- Size: 85 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- trackawesomelist - polars-strsim (⭐2) - Polars plugin that computes string similarity measures directly on a Polars dataframe by [@foxcroftjn](https://github.com/foxcroftjn). (Recently Updated / [Oct 08, 2024](/content/2024/10/08/README.md))
- awesome-polars - polars-strsim - Polars plugin that computes string similarity measures directly on a Polars dataframe by [@foxcroftjn](https://github.com/foxcroftjn). (Libraries/Packages/Scripts / Polars plugins)
README
# String Similarity Measures for Polars
This package provides python bindings to compute various string similarity measures directly on a polars dataframe. All string similarity measures are implemented in rust and computed in parallel.
The similarity measures that have been implemented are:
- Levenshtein
- Jaro
- Jaro-Winkler
- Jaccard
- Sørensen-DiceEach similarity measure returns a value normalized between 0.0 and 1.0 (inclusive), where 0.0 indicates the inputs are maximally different and 1.0 means the strings are maximally similar.
## Installing the Library
### With pip
```bash
pip install polars-strsim
```### From Source
To build and install this library from source, first ensure you have [cargo](https://doc.rust-lang.org/cargo/getting-started/installation.html) installed. You will also need maturin, which you can install via `pip install 'maturin[patchelf]'`
polars-strsim can then be installed in your current python environment by running `maturin develop --release`
## Using the Library
**Input:**
```python
import polars as pl
from polars_strsim import levenshtein, jaro, jaro_winkler, jaccard, sorensen_dicedf = pl.DataFrame(
{
"name_a": ["phillips", "phillips", "" , "", None , None],
"name_b": ["phillips", "philips" , "phillips", "", "phillips", None],
}
).with_columns(
levenshtein=levenshtein("name_a", "name_b"),
jaro=jaro("name_a", "name_b"),
jaro_winkler=jaro_winkler("name_a", "name_b"),
jaccard=jaccard("name_a", "name_b"),
sorensen_dice=sorensen_dice("name_a", "name_b"),
)with pl.Config(ascii_tables=True):
print(df)
```
**Output:**
```
shape: (6, 7)
+----------+----------+-------------+----------+--------------+---------+---------------+
| name_a | name_b | levenshtein | jaro | jaro_winkler | jaccard | sorensen_dice |
| --- | --- | --- | --- | --- | --- | --- |
| str | str | f64 | f64 | f64 | f64 | f64 |
+=======================================================================================+
| phillips | phillips | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| phillips | philips | 0.875 | 0.958333 | 0.975 | 0.875 | 0.933333 |
| | phillips | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| | | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| null | phillips | null | null | null | null | null |
| null | null | null | null | null | null | null |
+----------+----------+-------------+----------+--------------+---------+---------------+
```