{"id":16599521,"url":"https://github.com/mrpowers/spark-stringmetric","last_synced_at":"2025-03-21T13:32:39.792Z","repository":{"id":54259514,"uuid":"102420477","full_name":"MrPowers/spark-stringmetric","owner":"MrPowers","description":"Spark functions to run popular phonetic and string matching algorithms","archived":false,"fork":false,"pushed_at":"2022-02-22T19:30:27.000Z","size":468,"stargazers_count":59,"open_issues_count":1,"forks_count":6,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-10-13T00:11:49.582Z","etag":null,"topics":["cosine-distance","double-metaphone","fuzzy-score","hamming-distance","jaccard-similarity","jaro-winkler","nysiis","refined-soundex","spark"],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MrPowers.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-09-05T01:49:24.000Z","updated_at":"2024-09-29T14:00:03.000Z","dependencies_parsed_at":"2022-08-13T10:20:09.684Z","dependency_job_id":null,"html_url":"https://github.com/MrPowers/spark-stringmetric","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MrPowers%2Fspark-stringmetric","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MrPowers%2Fspark-stringmetric/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MrPowers%2Fspark-stringmetric/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MrPowers%2Fspark-stringmetric/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MrPowers","download_url":"https://codeload.github.com/MrPowers/spark-stringmetric/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221815774,"owners_count":16885223,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cosine-distance","double-metaphone","fuzzy-score","hamming-distance","jaccard-similarity","jaro-winkler","nysiis","refined-soundex","spark"],"created_at":"2024-10-12T00:11:47.608Z","updated_at":"2024-10-28T10:14:20.832Z","avatar_url":"https://github.com/MrPowers.png","language":"Scala","readme":"# spark-stringmetric\n\n[![CI](https://github.com/MrPowers/spark-stringmetric/actions/workflows/ci.yml/badge.svg)](https://github.com/MrPowers/spark-stringmetric/actions/workflows/ci.yml)\n\nString similarity functions and phonetic algorithms for Spark.\n\nSee [ceja](https://github.com/MrPowers/ceja) if you're using PySpark.\n\n## Project Setup\n\nUpdate your `build.sbt` file to import the libraries.\n\n```\nlibraryDependencies += \"org.apache.commons\" % \"commons-text\" % \"1.1\"\n\n// Spark 3\nlibraryDependencies += \"com.github.mrpowers\" %% \"spark-stringmetric\" % \"0.4.0\"\n\n// Spark 2\nlibraryDependencies += \"com.github.mrpowers\" %% \"spark-stringmetric\" % \"0.3.0\"\n```\n\nYou can find the spark-daria [Scala 2.11 versions here](https://repo1.maven.org/maven2/com/github/mrpowers/spark-stringmetric_2.11/) and the [Scala 2.12 versions here](https://repo1.maven.org/maven2/com/github/mrpowers/spark-stringmetric_2.12/).\n\n## SimilarityFunctions\n\n* `cosine_distance`\n* `fuzzy_score`\n* `hamming`\n* `jaccard_similarity`\n* `jaro_winkler`\n\nHow to import the functions.\n\n```scala\nimport com.github.mrpowers.spark.stringmetric.SimilarityFunctions._\n```\n\nHere's an example on how to use the `jaccard_similarity` function.\n\nSuppose we have the following `sourceDF`:\n\n```\n+-------+-------+\n|  word1|  word2|\n+-------+-------+\n|  night|  nacht|\n|context|contact|\n|   null|  nacht|\n|   null|   null|\n+-------+-------+\n```\n\nLet's run the `jaccard_similarity` function.\n\n```scala\nval actualDF = sourceDF.withColumn(\n  \"w1_w2_jaccard\",\n  jaccard_similarity(col(\"word1\"), col(\"word2\"))\n)\n```\n\nWe can run `actualDF.show()` to view the `w1_w2_jaccard` column that's been appended to the DataFrame.\n\n```\n+-------+-------+-------------+\n|  word1|  word2|w1_w2_jaccard|\n+-------+-------+-------------+\n|  night|  nacht|         0.43|\n|context|contact|         0.57|\n|   null|  nacht|         null|\n|   null|   null|         null|\n+-------+-------+-------------+\n```\n\n## PhoneticAlgorithms\n\n* `double_metaphone`\n* `nysiis`\n* `refined_soundex`\n\nHow to import the functions.\n\n```scala\nimport com.github.mrpowers.spark.stringmetric.PhoneticAlgorithms._\n```\n\nHere's an example on how to use the `refined_soundex` function.\n\nSuppose we have the following `sourceDF`:\n\n```\n+-----+\n|word1|\n+-----+\n|night|\n|  cat|\n| null|\n+-----+\n```\n\nLet's run the `refined_soundex` function.\n\n```scala\nval actualDF = sourceDF.withColumn(\n  \"word1_refined_soundex\",\n  refined_soundex(col(\"word1\"))\n)\n```\n\nWe can run `actualDF.show()` to view the `word1_refined_soundex` column that's been appended to the DataFrame.\n\n```\n+-----+---------------------+\n|word1|word1_refined_soundex|\n+-----+---------------------+\n|night|               N80406|\n|  cat|                 C306|\n| null|                 null|\n+-----+---------------------+\n```\n\n## API Documentation\n\n[Here is the latest API documentation](https://mrpowers.github.io/spark-stringmetric/latest/api/#package).\n\n## Release\n\n1. Create GitHub tag\n\n2. Build documentation with `sbt ghpagesPushSite`\n\n3. Publish JAR\n\nRun `sbt` to open the SBT console.\n\nRun `\u003e ; + publishSigned; sonatypeBundleRelease` to create the JAR files and release them to Maven.  These commands are made available by the [sbt-sonatype](https://github.com/xerial/sbt-sonatype) plugin.\n\nAfter running the release command, you'll be prompted to enter your GPG passphrase.\n\nThe Sonatype credentials should be stored in the `~/.sbt/sonatype_credentials` file in this format:\n\n```\nrealm=Sonatype Nexus Repository Manager\nhost=oss.sonatype.org\nuser=$USERNAME\npassword=$PASSWORD\n```\n\n## Post Maven release steps\n\n* Create a GitHub release/tag\n* Publish the updated documentation\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrpowers%2Fspark-stringmetric","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmrpowers%2Fspark-stringmetric","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrpowers%2Fspark-stringmetric/lists"}