Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mrpowers/ceja

PySpark phonetic and string matching algorithms
https://github.com/mrpowers/ceja

damerau-levenshtein hamming-distance jaro-similarity jaro-winkler match-rating-comparisons metaphone nysiis porter-stemmer pyspark

Last synced: 21 days ago
JSON representation

PySpark phonetic and string matching algorithms

Host: GitHub
URL: https://github.com/mrpowers/ceja
Owner: MrPowers
License: mit
Created: 2020-06-06T15:07:16.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2024-02-19T04:25:45.000Z (9 months ago)
Last Synced: 2024-10-13T00:11:46.248Z (about 1 month ago)
Topics: damerau-levenshtein, hamming-distance, jaro-similarity, jaro-winkler, match-rating-comparisons, metaphone, nysiis, porter-stemmer, pyspark
Language: Python
Homepage:
Size: 32.2 KB
Stars: 35
Watchers: 3
Forks: 5
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        # ceja

![![image](https://github.com/MrPowers/ceja/workflows/build/badge.svg)](https://github.com/MrPowers/ceja/actions/workflows/ci.yml/badge.svg)

![PyPI - Downloads](https://img.shields.io/pypi/dm/ceja)

[![PyPI version](https://badge.fury.io/py/ceja.svg)](https://badge.fury.io/py/ceja)

PySpark phonetic, stemming, and string matching algorithms.  Use the power of PySpark to run these algos on massive datasets!

## Installation and basic usage

Run `pip install ceja` to install the library.

Import the functions with `import ceja`.  After importing the code you can run functions like `ceja.nysiis`, `ceja.jaro_winkler_similarity`, etc.

## Public interface summary

* Phonetic algorithms

  * nysiis

  * metaphone

  * match_rating_codex

* Stemming

  * porter_stem

* String similarity

  * damerau_levenshtein_distance

  * hamming_distance

  * jaro_similarity

  * jaro_winkler_similarity

  * match_rating_comparison

## Phonetic algorithms

### NYSIIS

```python

data = [

    ("jellyfish",),

    ("li",),

    ("luisa",),

    (None,)

]

df = spark.createDataFrame(data, ["word"])

actual_df = df.withColumn("word_nysiis", ceja.nysiis(col("word")))

actual_df.show()

```

```

+---------+-----------+

|     word|word_nysiis|

+---------+-----------+

|jellyfish|      JALYF|

|       li|          L|

|    luisa|        LAS|

|     null|       null|

+---------+-----------+

```

### Metaphone

```python

data = [

    ("jellyfish",),

    ("li",),

    ("luisa",),

    ("Klumpz",),

    ("Clumps",),

    (None,)

]

df = spark.createDataFrame(data, ["word"])

actual_df = df.withColumn("word_metaphone", ceja.metaphone(col("word")))

actual_df.show()

```

```

+---------+--------------+

|     word|word_metaphone|

+---------+--------------+

|jellyfish|          JLFX|

|       li|             L|

|    luisa|            LS|

|   Klumpz|         KLMPS|

|   Clumps|         KLMPS|

|     null|          null|

+---------+--------------+

```

### Match rating codex

```python

data = [

    ("jellyfish",),

    ("li",),

    ("luisa",),

    (None,)

]

df = spark.createDataFrame(data, ["word"])

actual_df = df.withColumn("word_match_rating_codex", ceja.match_rating_codex(col("word")))

actual_df.show()

```

```

+---------+-----------------------+

|     word|word_match_rating_codex|

+---------+-----------------------+

|jellyfish|                 JLYFSH|

|       li|                      L|

|    luisa|                     LS|

|     null|                   null|

+---------+-----------------------+

```

## Stemming algorithms

### Porter stem

```python

data = [

    ("chocolates",),

    ("chocolatey",),

    ("choco",),

    (None,)

]

df = spark.createDataFrame(data, ["word"])

actual_df = df.withColumn("word_porter_stem", ceja.porter_stem(col("word")))

actual_df.show()

```

```

+----------+----------------+

|      word|word_porter_stem|

+----------+----------------+

|chocolates|          chocol|

|chocolatey|      chocolatei|

|     choco|           choco|

|      null|            null|

+----------+----------------+

```

## Similarity algorithms

### Damerau Levenshtein Distance

```python

data = [

    ("jellyfish", "smellyfish"),

    ("li", "lee"),

    ("luisa", "bruna"),

    (None, None)

]

df = spark.createDataFrame(data, ["word1", "word2"])

actual_df = df.withColumn("damerau_levenshtein_distance", ceja.damerau_levenshtein_distance(col("word1"), col("word2")))

actual_df.show()

```

```

+---------+----------+----------------------------+

|    word1|     word2|damerau_levenshtein_distance|

+---------+----------+----------------------------+

|jellyfish|smellyfish|                           2|

|       li|       lee|                           2|

|    luisa|     bruna|                           4|

|     null|      null|                        null|

+---------+----------+----------------------------+

```

## Hamming distance

```python

data = [

    ("jellyfish", "smellyfish"),

    ("li", "lee"),

    ("luisa", "bruna"),

    (None, None)

]

df = spark.createDataFrame(data, ["word1", "word2"])

actual_df = df.withColumn("hamming_distance", ceja.hamming_distance(col("word1"), col("word2")))

print("\nHamming distance")

actual_df.show()

```

```

+---------+----------+----------------+

|    word1|     word2|hamming_distance|

+---------+----------+----------------+

|jellyfish|smellyfish|               9|

|       li|       lee|               2|

|    luisa|     bruna|               4|

|     null|      null|            null|

+---------+----------+----------------+

```

### Jaro similarity

```python

data = [

    ("jellyfish", "smellyfish"),

    ("li", "lee"),

    ("luisa", "bruna"),

    ("hi", "colombia"),

    (None, None)

]

df = spark.createDataFrame(data, ["word1", "word2"])

actual_df = df.withColumn("jaro_similarity", ceja.jaro_similarity(col("word1"), col("word2")))

actual_df.show()

```

```

+---------+----------+---------------+

|    word1|     word2|jaro_similarity|

+---------+----------+---------------+

|jellyfish|smellyfish|      0.8962963|

|       li|       lee|      0.6111111|

|    luisa|     bruna|            0.6|

|       hi|  colombia|            0.0|

|     null|      null|           null|

+---------+----------+---------------+

```

### Jaro Winkler similarity

```python

data = [

    ("jellyfish", "smellyfish"),

    ("li", "lee"),

    ("luisa", "bruna"),

    (None, None)

]

df = spark.createDataFrame(data, ["word1", "word2"])

actual_df = df.withColumn("jaro_winkler_similarity", ceja.jaro_winkler_similarity(col("word1"), col("word2")))

actual_df.show()

```

```

+---------+----------+-----------------------+

|    word1|     word2|jaro_winkler_similarity|

+---------+----------+-----------------------+

|jellyfish|smellyfish|              0.8962963|

|       li|       lee|              0.6111111|

|    luisa|     bruna|                    0.6|

|     null|      null|                   null|

+---------+----------+-----------------------+

```

### Match rating comparison

```python

data = [

    ("mat", "matt"),

    ("there", "their"),

    ("luisa", "bruna"),

    (None, None)

]

df = spark.createDataFrame(data, ["word1", "word2"])

actual_df = df.withColumn("match_rating_comparison", ceja.match_rating_comparison(col("word1"), col("word2")))

actual_df.show()

```

```

+-----+-----+-----------------------+

|word1|word2|match_rating_comparison|

+-----+-----+-----------------------+

|  mat| matt|                   true|

|there|their|                   true|

|luisa|bruna|                  false|

| null| null|                   null|

+-----+-----+-----------------------+

```

## Contributing

Contributions are welcome and encouraged.  Feel free to open issues or send pull requests.

If you make a lot of good contributions, you'll be granted push access to the repo.

The best contributions to make would be implementing these functions as Spark native functions.