https://github.com/vickumar1981/stringdistance
A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
https://github.com/vickumar1981/stringdistance
cosine-similarity cosine-similarity-scores dice-coefficient fuzzy-matching hacktoberfest hamming-distance jaccard jaccard-similarity jaro jaro-distance jaro-winkler jaro-winkler-distance levenshtein levenshtein-distance longest-common-subsequence ngram sorensen-dice-distance soundex soundex-algorithm string-similarity
Last synced: about 1 month ago
JSON representation
A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
- Host: GitHub
- URL: https://github.com/vickumar1981/stringdistance
- Owner: vickumar1981
- License: other
- Created: 2017-03-02T03:51:46.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2022-04-25T20:24:06.000Z (over 3 years ago)
- Last Synced: 2025-08-31T23:58:41.586Z (about 1 month ago)
- Topics: cosine-similarity, cosine-similarity-scores, dice-coefficient, fuzzy-matching, hacktoberfest, hamming-distance, jaccard, jaccard-similarity, jaro, jaro-distance, jaro-winkler, jaro-winkler-distance, levenshtein, levenshtein-distance, longest-common-subsequence, ngram, sorensen-dice-distance, soundex, soundex-algorithm, string-similarity
- Language: Scala
- Homepage: https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/index.html
- Size: 1.27 MB
- Stars: 81
- Watchers: 5
- Forks: 14
- Open Issues: 16
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: .github/CODE_OF_CONDUCT.md
Awesome Lists containing this project
README

# StringDistance
[](https://app.travis-ci.com/github/vickumar1981/stringdistance/builds) [](https://coveralls.io/github/vickumar1981/stringdistance?branch=master) [](https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/index.html) [](https://mvnrepository.com/artifact/com.github.vickumar1981/stringdistance) [](LICENSE.md)
A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more.
Works with generalized arrays.
For more detailed information, please refer to the [API Documentation](https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/index.html "API Documentation").
Requires: Java 8+ or Scala 2.11+
---
### Contents1. [Add it to your project](https://github.com/vickumar1981/stringdistance#1-add-it-to-your-project-)
2. [Using in Scala](https://github.com/vickumar1981/stringdistance#2-scala-usage)
3. [Using in Scala with implicits](https://github.com/vickumar1981/stringdistance#3-scala-use-with-implicits)
4. [Using in Java](https://github.com/vickumar1981/stringdistance#4-java-usage)
5. [Using with Arrays](https://github.com/vickumar1981/stringdistance#5-using-with-arrays)
6. [Adding your own algorithm](https://github.com/vickumar1981/stringdistance#6-adding-your-own-distance-or-scoring-algorithm)
7. [Reporting an Issue](https://github.com/vickumar1981/stringdistance#7-reporting-an-issue)
8. [Contributing](https://github.com/vickumar1981/stringdistance#8-contributing)
9. [License](https://github.com/vickumar1981/stringdistance#9-license)---
### 1. Add it to your project ...__Using sbt:__
In `build.sbt`:
```scala
libraryDependencies += "com.github.vickumar1981" %% "stringdistance" % "1.2.7"
```__Using gradle:__
In `build.gradle`:
```groovy
dependencies {
compile 'com.github.vickumar1981:stringdistance_2.13:1.2.7'
}
```__Using Maven:__
In `pom.xml`:
```xmlcom.github.vickumar1981
stringdistance_2.13
1.2.7```
**Notes:**
- For Scala 2.12, please use the `stringdistance_2.12` artifact as a dependency instead.
- For Scala 2.11, please use the `stringdistance_2.11` artifact as a dependency instead.---
### 2. Scala Usage__Example.scala__:
```scala
// Scala example
import com.github.vickumar1981.stringdistance.StringDistance._
import com.github.vickumar1981.stringdistance.StringSound._
import com.github.vickumar1981.stringdistance.impl.{ConstantGap, LinearGap}// Cosine Similarity
val cosSimilarity: Double = Cosine.score("hello", "chello") // 0.935// Damerau-Levenshtein Distance
val damerauDist: Int = Damerau.distance("martha", "marhta") // 1
val damerau: Double = Damerau.score("martha", "marhta") // 0.833// Dice Coefficient
val diceCoefficient: Double = DiceCoefficient.score("martha", "marhta") // 0.4// Hamming Distance
val hammingDist: Int = Hamming.distance("martha", "marhta") // 2
val hamming: Double = Hamming.score("martha", "marhta") // 0.667// Jaccard Similarity
val jaccard: Double = Jaccard.score("karolin", "kathrin", 1)// Jaro and Jaro Winkler
val jaro: Double = Jaro.score("martha", "marhta") // 0.944
val jaroWinkler: Double = JaroWinkler.score("martha", "marhta", 0.1) // 0.961// Levenshtein Distance
val levenshteinDist: Int = Levenshtein.distance("martha", "marhta") // 2
val levenshtein: Double = Levenshtein.score("martha", "marhta") // 0.667// Longest Common Subsequence
val longestCommonSubSeq: Int = LongestCommonSeq.distance("martha", "marhta") // 5// Needleman Wunsch
val needlemanWunsch: Double = NeedlemanWunsch.score("martha", "marhta", ConstantGap()) // 0.667// N-Gram Similarity and Distance
val ngramDist: Int = NGram.distance("karolin", "kathrin", 1) // 5
val bigramDist: Int = NGram.distance("karolin", "kathrin", 2) // 2
val ngramSimilarity: Double = NGram.score("karolin", "kathrin", 1) // 0.714
val bigramSimilarity: Double = NGram.score("karolin", "kathrin", 2) // 0.333// N-Gram tokens, returns a List[String]
val tokens: List[String] = NGram.tokens("martha", 2) // List("ma", "ar", "rt", "th", "ha")// Overlap Similarity
val overlap: Double = Overlap.score("karolin", "kathrin", 1) // 0.286
val overlapBiGram: Double = Overlap.score("karolin", "kathrin", 2) // 0.667// Smith Waterman Similarities
val smithWaterman: Double = SmithWaterman.score("martha", "marhta", (LinearGap(gapValue = -1), Integer.MAX_VALUE))
val smithWatermanGotoh: Double = SmithWatermanGotoh.score("martha", "marhta", ConstantGap())// Tversky Similarity
val tversky: Double = Tversky.score("karolin", "kathrin", 0.5) // 0.333// Phonetic Similarity
val metaphone: Boolean = Metaphone.score("merci", "mercy") // true
val soundex: Boolean = Soundex.score("merci", "mercy") // true
```
---
### 3. Scala: Use with Implicits
- To use implicits and extend the String class: `import com.github.vickumar1981.stringdistance.StringConverter._`__Example.scala__
```scala
// Scala example using implicits
import com.github.vickumar1981.stringdistance.StringConverter._// Scores between two strings
val cosSimilarity: Double = "hello".cosine("chello")
val damerau: Double = "martha".damerau("marhta")
val diceCoefficient: Double = "martha".diceCoefficient("marhta")
val hamming: Double = "martha".hamming("marhta")
val jaccard: Double = "karolin".jaccard("kathrin")
val jaro: Double = "martha".jaro("marhta")
val jaroWinkler: Double = "martha".jaroWinkler("marhta")
val levenshtein: Double = "martha".levenshtein("marhta")
val needlemanWunsch: Double = "martha".needlemanWunsch("marhta")
val ngramSimilarity: Double = "karolin".nGram("kathrin")
val bigramSimilarity: Double = "karolin".nGram("kathrin", 2)
val overlap: Double = "karolin".overlap("kathrin")
val overlapBiGram: Double = "karolin".overlap("kathrin", 2)
val smithWaterman: Double = "martha".smithWaterman("marhta")
val smithWatermanGotoh: Double = "martha".smithWatermanGotoh("marhta")
val tversky: Double = "karolin".tversky("kathrin", 0.5)// Distances between two strings
val damerauDist: Int = "martha".damerauDist("marhta") // 1
val hammingDist: Int = "martha".hammingDist("marhta")
val levenshteinDist: Int = "martha".levenshteinDist("marhta")
val longestCommonSeq: Int = "martha".longestCommonSeq("marhta")
val ngramDist: Int = "karolin".nGramDist("kathrin")
val bigramDist: Int = "karolin".nGramDist("kathrin", 2)// N-Gram tokens, returns a List[String]
val tokens: List[String] = "martha".tokens(2) // List("ma", "ar", "rt", "th", "ha")// Phonetic similarity of two strings
val metaphone: Boolean = "merci".metaphone("mercy")
val soundex: Boolean = "merci".soundex("mercy")
```
---
### 4. Java Usage
- To use in Java: `import com.github.vickumar1981.stringdistance.util.StringDistance`__Example.java__
```java
// Java example
import com.github.vickumar1981.stringdistance.util.StringDistance;
import com.github.vickumar1981.stringdistance.util.StringSound;// Scores between two strings
Double cosSimilarity = StringDistance.cosine("hello", "chello");
Double damerau = StringDistance.damerau("martha", "marhta");
Double diceCoefficient = StringDistance.diceCoefficient("martha", "marhta");
Double hamming = StringDistance.hamming("martha", "marhta");
Double jaccard = StringDistance.jaccard("karolin", "kathrin");
Double jaro = StringDistance.jaro("martha", "marhta");
Double jaroWinkler = StringDistance.jaroWinkler("martha", "marhta");
Double levenshtein = StringDistance.levenshtein("martha", "marhta");
Double needlemanWunsch = StringDistance.needlemanWunsch("martha", "marhta");
Double ngramSimilarity = StringDistance.nGram("karolin", "kathrin");
Double bigramSimilarity = StringDistance.nGram("karolin", "kathrin", 2);
Double overlap = StringDistance.overlap("karolin", "kathrin");
Double overlapBiGram = StringDistance.overlap("karolin", "kathrin", 2);
Double smithWaterman = StringDistance.smithWaterman("martha", "marhta");
Double smithWatermanGotoh = StringDistance.smithWatermanGotoh("martha", "marhta");
Double tversky = StringDistance.tversky("karolin", "kathrin", 0.5);// Distances between two strings
Integer damerauDist = StringDistance.damerauDist("martha", "marhta");
Integer hammingDist = StringDistance.hammingDist("martha", "marhta");
Integer levenshteinDist = StringDistance.levenshteinDist("martha", "marhta");
Integer longestCommonSeq = StringDistance.longestCommonSeq("martha", "marhta");
Integer ngramDist = StringDistance.nGramDist("karolin", "kathrin");
Integer bigramDist = StringDistance.nGramDist("karolin", "kathrin", 2);// N-Gram tokens, returns a List
List tokens = StringDistance.nGramTokens(2) // List("ma", "ar", "rt", "th", "ha")// Phonetic similarity of two strings
Boolean metaphone = StringSound.metaphone("merci", "mercy");
Boolean soundex = StringSound.soundex("merci", "mercy");
```
---### 5. Using with Arrays
- You can use the [ArrayDistance](https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/ArrayDistance$.html)
class just like the [StringDistance](https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/StringDistance$.html) class,
except using a generic array - `Array[T]` for Scala and `T[]` for Java.- Make sure your classes are comparable using `==` for Scala or `.equals` for Java
__Scala Sample Code:__
```scala
import com.github.vickumar1981.stringdistance.ArrayDistance._// Example Levenshtein Distance and Score
val levenshteinDist = Levenshtein.distance(Array("m", "a", "r", "t", "h", "a"), Array("m", "a", "r", "h", "t", "a")) // 2
val levenshtein = Levenshtein.score(Array("m", "a", "r", "t", "h", "a"), Array("m", "a", "r", "h", "t", "a")) // 0.667
```__Java Example Code:__
- [Implement .equals for your class](https://github.com/vickumar1981/stringdistance/blob/master/src/main/java/sd_example/Ch.java#L15)
- [Use ArrayDistance with your class](https://github.com/vickumar1981/stringdistance/blob/master/src/main/java/sd_example/SdJavaExample.java#L7)---
### 6. Adding your own Distance or Scoring Algorithm
1. Create a marker trait that extends `StringMetricAlgorithm`:
```scala
trait CustomAlgorithm extends StringMetricAlgorithm
```2. Create an implementation for that algorithm using an implicit object. Override either the `score` or the `distance` method, depending upon whether the object extends `DistanceAlgorithm` or `ScoringAlgorithm`.
```scala
implicit object CustomDistance extends DistanceAlgorithm[CustomAlgorithm] {
override def distance(s1: String, s2: String): Int = {
// Implement distance between s1 and s2
}
}implicit object CustomScore extends ScoringAlgorithm[CustomAlgorithm] {
override def score(s1: String, s2: String): Double = {
// Implement fuzzy score between s1 and s2
}
}
```3. Create an object that extends `StringMetric` using your algorithm as the type parameter, and use the `score` and `distance` methods defined in the implicit object.
```scala
object CustomMetric extends StringMetric[CustomAlgorithm]val customScore: Double = CustomMetric.score("hello", "hello2")
val customDist: Int = CustomMetric.distance("hello", "hello2")
```
---
### 7. Reporting an IssuePlease report any issues or bugs to the [Github issues page](https://github.com/vickumar1981/stringdistance/issues).
---
### 8. ContributingPlease view the [contributing guidelines](CONTRIBUTING.md)
---
### 9. LicenseThis project is licensed under the [Apache 2 License](LICENSE.md).