{"id":19133371,"url":"https://github.com/vasnake/spark.ml.spatialjointransformer","last_synced_at":"2026-02-03T12:10:59.127Z","repository":{"id":68754694,"uuid":"179492167","full_name":"vasnake/spark.ml.SpatialJoinTransformer","owner":"vasnake","description":"spark.ml.transformer: join two datasets using spatial relations","archived":false,"fork":false,"pushed_at":"2024-10-02T13:48:01.000Z","size":123,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-09T04:41:39.556Z","etag":null,"topics":["geospatial","join","ml-pipeline","python","scala","spark","spark-ml","spatial","transformer"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vasnake.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-04-04T12:22:45.000Z","updated_at":"2024-10-02T13:48:05.000Z","dependencies_parsed_at":"2023-05-23T21:30:51.415Z","dependency_job_id":null,"html_url":"https://github.com/vasnake/spark.ml.SpatialJoinTransformer","commit_stats":{"total_commits":79,"total_committers":1,"mean_commits":79.0,"dds":0.0,"last_synced_commit":"d48d1db3e641e86c3c7dcc0841321b2c7726fed7"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/vasnake/spark.ml.SpatialJoinTransformer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vasnake%2Fspark.ml.SpatialJoinTransformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vasnake%2Fspark.ml.SpatialJoinTransformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vasnake%2Fspark.ml.SpatialJoinTransformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vasnake%2Fspark.ml.SpatialJoinTransformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vasnake","download_url":"https://codeload.github.com/vasnake/spark.ml.SpatialJoinTransformer/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vasnake%2Fspark.ml.SpatialJoinTransformer/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263457199,"owners_count":23469294,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["geospatial","join","ml-pipeline","python","scala","spark","spark-ml","spatial","transformer"],"created_at":"2024-11-09T06:22:13.830Z","updated_at":"2026-02-03T12:10:59.091Z","avatar_url":"https://github.com/vasnake.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# spark.ml.SpatialJoinTransformer\n\nIt is a [spark.ml.Transformer](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/PipelineStage.html)\nthat joins input dataset with external data using\nSpatial Relations Predicates.\n\nTo perform spatial join, [SpatialSpark](https://github.com/vasnake/SpatialSpark)\nBroadcastSpatialJoin object is used.\nAlso, SpatialJoinTransformer depends on\n[LocationTech JTS](https://github.com/locationtech/jts) \nand [GeographicLib](https://sourceforge.net/projects/geographiclib/)\n\nProject was built and tested with Spark 2.4 and Scala 2.12\n\n## Installation\n\nYou can use binary packages from \n[releases](https://github.com/vasnake/spark.ml.SpatialJoinTransformer/releases)\npage or add dependency to your sbt project:\n\n```scala\n// project/Build.scala\nobject Projects {\n  lazy val spatialJoinTransformer = RootProject(uri(\n    \"https://github.com/vasnake/spark.ml.SpatialJoinTransformer.git#v0.0.1\"))\n}\n\n// build.sbt\nlazy val root = (project in file(\".\")).settings(\n  ???\n).dependsOn(Projects.spatialJoinTransformer)\n\n```\n\nMay be later I will consider publishing packages to some public repository.\nStay tuned.\n\n## Usage\n\nLet's say we have an `input` dataset that needs to be transformed, \nand some `external dataset` aka just `dataset`.\nTo perform a transformation, spatial join exactly, we need these datasets\nto have spatial information.\nEach dataset have to have a column with WKT geometries or, in case of points,\ntwo columns with Longitude and Latitude coordinates.\n\nAnother requirement: `extenal dataset` must be registered in Spark SQL metastore/catalog.\nIt can be Hive table or some previously registered DataFrame.\n\nShortest possible example: `external dataset` registered with name `poi` that\nhas point geometry columns: `lon`, `lat`.\nDataset to transform, `input` also has point geometry, also located in columns `lon` and `lat`:\n\n```scala\nval input: DataFrame = ???\nval data: DataFrame = ???\ndata.createOrReplaceTempView(\"poi\")\nval transformer = new BroadcastSpatialJoin()\n    .setDataset(\"poi\")\n    .setDatasetPoint(\"lon, lat\")\n    .setInputPoint(\"lon, lat\")\n    .setDataColumns(\"poi_id\")\nval res = transformer.transform(input)\n```\n\nBy default predicate `nearest` will be used as spatial relation and, attention, `input`\ndataset will be broadcasted. It means that for each row from `poi`, nearest point\nfrom `input` will be found and `poi_id` attribute will be joined to that `input` row.\n\nMore detailed examples with different parameters, conditions and predicates you can find in\n[tests](https://github.com/vasnake/spark.ml.SpatialJoinTransformer/blob/master/src/test/scala/me/valik/spark/transformer/BroadcastSpatialJoinTest.scala)\n\nMore information about Spark transformers you can find in\n[documentation](https://spark.apache.org/docs/latest/ml-pipeline.html)\n\n### PySpark\nEvidently you can use `BroadcastSpatialJoin` transformer in Scala or Java projects.\nAlso there is a Python wrapper for using in PySpark environment:\n\n```python\nfrom me.valik.spark.transformer import BroadcastSpatialJoin\npoi = spark.createDataFrame([(\"a\", 1.1, 3.1), (\"b\", 2.1, 5.1)], [\"poi_id\", \"lon\", \"lat\"])\npoi.createOrReplaceTempView(\"poi\")\ndf = spark.createDataFrame([(0, 1.0, 3.0), (2, 2.0, 5.0)], [\"id\", \"lon\", \"lat\"])\ntrans = BroadcastSpatialJoin(\n    dataset=\"poi\", dataColumns=\"poi_id\", datasetPoint=\"lon, lat\", inputPoint=\"lon, lat\")\nresult = trans.transform(df)\n```\n\n### Transformer parameters\nAll parameters are String parameters.\n\n`condition, setJoinCondition`\n:  experimental feature, it should be possible to apply extra filter to pair (input.row, dataset.row)\nfound by spatial relation as a join candidates. e.g. `fulldate between start_ts and end_ts`\n\n`filter, setDatasetFilter`\n:  SQL expression passed to load `dataset` method in case you need to apply filtering before join.\n\n`broadcast, setBroadcast`\n:  which dataset will be broadcasted, two possible values: `input` or `external`,\nby default it will be `input`.\n\n`predicate, setPredicate`\n:  one of supported spatial relations:\n`withindist`, `within`, `contains`, `intersects`, `overlaps`, `nearest`.\nBy default it will be `nearest`.\nOperator `withindist` should be used in form of `withindist n`\nwhere `n` is a distance parameter in meters.\n\nn.b. `broadcast` and `predicate` are closely related: `broadcast` defines a `right` dataset and then\nspatial relation can be interpreted as \"left contains right\" if `predicate` is `contains` for example.\n\n`dataset, setDataset`\n:  external dataset name, should be registered in SQL catalog (metastore).\n\n`dataColumns, setDataColumns`\n:  column names from `dataset` you need to join to `input`.\nFormat: CSV. Any selected column can be renamed using alias in form of \" as alias\".\nFor example: `t.setDataColumns(\"poi_id, name as poi_name\")`\n\n`distanceColumnAlias, setDistColAlias`\n:  if not empty, computable column with defined name will be added to `input`.\nThat column will contain distance (meters) between centroids of `input` and `dataset` geometries.\n\n`datasetWKT, setDatasetWKT`\n:  external dataset column name, if not empty that column must contain geometry definition in WKT format.\n\n`datasetPoint, setDatasetPoint`\n:  two column names from external dataset, if not empty that columns must contain\nLon, Lat (exactly in that order) coordinates for point geometry.\n\nSame goes for `inputWKT, setInputWKT` and `inputPoint, setInputPoint`\n\nN.b. you should define only one source for geometry objects, it's a WKT or Point, not both.\n\n`numPartitions, setNumPartitions`\n:  repartition parameter, in case if you want to repartition `external dataset`\nbefore join.\n\n## Notes and limitations\n\nTransformer allows you to join input dataset with selected external dataset\nusing spatial relations between two geometry columns (or four columns in case of\nlon, lat points). As any other join, it allows you to add selected columns\n(and computable `distance` column) from external dataset to input dataset.\n\nOnly inner join implemented for now.\n\ngeometry\n:  spatial data defined as column containing WKT-formatted primitives: points, polylines, polygons.\nWGS84 coordinate system expected (lon,lat decimal degree GPS coordinates).\nPoints can be represented as coordinates in two columns: (lon, lat).\n\ninput aka input dataset\n:  DataFrame to which transformer is applied, e.g.\n`val result = bsj.transform(input)`\n\ndataset aka external dataset aka external\n:  DataFrame (table or view) registered in spark sql catalog\n(or hive metastore); e.g. `data.createOrReplaceTempView(\"poi_with_wkt_geometry\")`\n\nbroadcast aka setBroadcast parameter\n:  current limitation is that transformer perform join using the\nBroadcastSpatialJoin module that require one of the datasets to be broadcasted.\nIt means that one of the `input` or `external` datasets must be small enough to be broadcasted by Spark.\nBy default `input` will be broadcasted and `external` will be iterated using flatMap to find\nall the records from `input` that satisfy spatial relation (with `filter` and `condition`).\n\n`broadcast` parameter and `predicate` parameter together defines result of the join.\nFor example, consider input that have two rows (2 points) and dataset that have four rows (4 points).\nLet's set predicate to the `nearest`. \nBy default, input will be broadcasted and that means that result table will have four rows:\nnearest point from input for each point from external dataset.\n\nleft or right dataset\n:  `broadcast` parameter defines which dataset will be considered `right` and the other, accordingly, `left`.\nBy default, `input` will be broadcasted, which means, `input` will be the `right` dataset and\n`external dataset` will be `left`.\n\nThe join process looks like iteration (flatMap) over `left` dataset and, for each left.row \nwe search for rows in `right` dataset (after building RTree spatial index)\nthat satisfy defined conditions (spatial and extra).\nIn this scenario we need to broadcast the `right` dataset, hence it should be small enough.\nAs you can see, `broadcast` parameter defines which of two datasets will be `right`\nand then another will be `left`.\n\n## Related\n\nSpatial functions as Spark (Hive) UDFs https://github.com/azavea/hiveless\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvasnake%2Fspark.ml.spatialjointransformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvasnake%2Fspark.ml.spatialjointransformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvasnake%2Fspark.ml.spatialjointransformer/lists"}