https://github.com/mrpowers/spark-stringmetric

Spark functions to run popular phonetic and string matching algorithms
https://github.com/mrpowers/spark-stringmetric

cosine-distance double-metaphone fuzzy-score hamming-distance jaccard-similarity jaro-winkler nysiis refined-soundex spark

Last synced: 8 months ago
JSON representation

Spark functions to run popular phonetic and string matching algorithms

Host: GitHub
URL: https://github.com/mrpowers/spark-stringmetric
Owner: MrPowers
License: mit
Created: 2017-09-05T01:49:24.000Z (about 8 years ago)
Default Branch: main
Last Pushed: 2022-02-22T19:30:27.000Z (over 3 years ago)
Last Synced: 2024-10-13T00:11:49.582Z (about 1 year ago)
Topics: cosine-distance, double-metaphone, fuzzy-score, hamming-distance, jaccard-similarity, jaro-winkler, nysiis, refined-soundex, spark
Language: Scala
Size: 457 KB
Stars: 59
Watchers: 5
Forks: 6
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # spark-stringmetric

[![CI](https://github.com/MrPowers/spark-stringmetric/actions/workflows/ci.yml/badge.svg)](https://github.com/MrPowers/spark-stringmetric/actions/workflows/ci.yml)

String similarity functions and phonetic algorithms for Spark.

See [ceja](https://github.com/MrPowers/ceja) if you're using PySpark.

## Project Setup

Update your `build.sbt` file to import the libraries.

```

libraryDependencies += "org.apache.commons" % "commons-text" % "1.1"

// Spark 3

libraryDependencies += "com.github.mrpowers" %% "spark-stringmetric" % "0.4.0"

// Spark 2

libraryDependencies += "com.github.mrpowers" %% "spark-stringmetric" % "0.3.0"

```

You can find the spark-daria [Scala 2.11 versions here](https://repo1.maven.org/maven2/com/github/mrpowers/spark-stringmetric_2.11/) and the [Scala 2.12 versions here](https://repo1.maven.org/maven2/com/github/mrpowers/spark-stringmetric_2.12/).

## SimilarityFunctions

* `cosine_distance`

* `fuzzy_score`

* `hamming`

* `jaccard_similarity`

* `jaro_winkler`

How to import the functions.

```scala

import com.github.mrpowers.spark.stringmetric.SimilarityFunctions._

```

Here's an example on how to use the `jaccard_similarity` function.

Suppose we have the following `sourceDF`:

```

+-------+-------+

|  word1|  word2|

+-------+-------+

|  night|  nacht|

|context|contact|

|   null|  nacht|

|   null|   null|

+-------+-------+

```

Let's run the `jaccard_similarity` function.

```scala

val actualDF = sourceDF.withColumn(

  "w1_w2_jaccard",

  jaccard_similarity(col("word1"), col("word2"))

)

```

We can run `actualDF.show()` to view the `w1_w2_jaccard` column that's been appended to the DataFrame.

```

+-------+-------+-------------+

|  word1|  word2|w1_w2_jaccard|

+-------+-------+-------------+

|  night|  nacht|         0.43|

|context|contact|         0.57|

|   null|  nacht|         null|

|   null|   null|         null|

+-------+-------+-------------+

```

## PhoneticAlgorithms

* `double_metaphone`

* `nysiis`

* `refined_soundex`

How to import the functions.

```scala

import com.github.mrpowers.spark.stringmetric.PhoneticAlgorithms._

```

Here's an example on how to use the `refined_soundex` function.

Suppose we have the following `sourceDF`:

```

+-----+

|word1|

+-----+

|night|

|  cat|

| null|

+-----+

```

Let's run the `refined_soundex` function.

```scala

val actualDF = sourceDF.withColumn(

  "word1_refined_soundex",

  refined_soundex(col("word1"))

)

```

We can run `actualDF.show()` to view the `word1_refined_soundex` column that's been appended to the DataFrame.

```

+-----+---------------------+

|word1|word1_refined_soundex|

+-----+---------------------+

|night|               N80406|

|  cat|                 C306|

| null|                 null|

+-----+---------------------+

```

## API Documentation

[Here is the latest API documentation](https://mrpowers.github.io/spark-stringmetric/latest/api/#package).

## Release

1. Create GitHub tag

2. Build documentation with `sbt ghpagesPushSite`

3. Publish JAR

Run `sbt` to open the SBT console.

Run `> ; + publishSigned; sonatypeBundleRelease` to create the JAR files and release them to Maven.  These commands are made available by the [sbt-sonatype](https://github.com/xerial/sbt-sonatype) plugin.

After running the release command, you'll be prompted to enter your GPG passphrase.

The Sonatype credentials should be stored in the `~/.sbt/sonatype_credentials` file in this format:

```

realm=Sonatype Nexus Repository Manager

host=oss.sonatype.org

user=$USERNAME

password=$PASSWORD

```

## Post Maven release steps

* Create a GitHub release/tag

* Publish the updated documentation

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mrpowers/spark-stringmetric

Awesome Lists containing this project

README