{"id":13858931,"url":"https://github.com/databrickslabs/geoscan","last_synced_at":"2025-07-14T01:32:17.133Z","repository":{"id":36986143,"uuid":"355004614","full_name":"databrickslabs/geoscan","owner":"databrickslabs","description":"Geospatial clustering at massive scale","archived":false,"fork":false,"pushed_at":"2024-07-11T12:50:25.000Z","size":2555,"stargazers_count":93,"open_issues_count":8,"forks_count":19,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-08-06T03:06:07.265Z","etag":null,"topics":["clustering","library","spark-ml"],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databrickslabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-04-05T23:58:34.000Z","updated_at":"2024-08-05T11:46:48.000Z","dependencies_parsed_at":"2023-12-09T20:46:36.792Z","dependency_job_id":null,"html_url":"https://github.com/databrickslabs/geoscan","commit_stats":{"total_commits":127,"total_committers":7,"mean_commits":"18.142857142857142","dds":0.6220472440944882,"last_synced_commit":"d779604ec1cbbfebaf29e673280bfd52de1bf6f5"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Fgeoscan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Fgeoscan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Fgeoscan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Fgeoscan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databrickslabs","download_url":"https://codeload.github.com/databrickslabs/geoscan/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225938723,"owners_count":17548541,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","library","spark-ml"],"created_at":"2024-08-05T03:02:26.509Z","updated_at":"2024-11-22T17:30:31.895Z","avatar_url":"https://github.com/databrickslabs.png","language":"Scala","funding_links":[],"categories":["Scala"],"sub_categories":[],"readme":"# Geoscan\n\n[![build](https://github.com/databrickslabs/geoscan/actions/workflows/push.yml/badge.svg?style=for-the-badge)](https://github.com/databrickslabs/geoscan/actions/workflows/push.yml) \n[![codecov](https://codecov.io/gh/databrickslabs/geoscan/branch/master/graph/badge.svg?token=0UKFCOO9OM\u0026style=for-the-badge)](https://codecov.io/gh/databrickslabs/geoscan)\n[![Maven Central](https://img.shields.io/maven-central/v/com.databricks.labs/geoscan.svg)](http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22geoscan)\n\n*DBSCAN (density-based spatial clustering of applications with noise) is a clustering technique used to group points that\nare closely packed together. Compared to other clustering methodologies, it doesn't require you to indicate the number\nof clusters beforehand, can detect clusters of varying shapes and sizes and is strong at finding outliers that don't\nbelong to any cluster, hence a great candidate for geospatial analysis of card transactions and fraud detection.\nThis, however, comes with a serious price tag: DBSCAN requires all points to be compared\nto every other points in order to find dense neighborhoods where at least `minPts` points can be found within a\n`epsilon` radius.* \n\nHere comes **GEOSCAN**, our novel approach to DBSCAN algorithm for geospatial clustering, \nleveraging uber [H3](https://eng.uber.com/h3/) library to only group points we know are in close vicinity (according \nto H3 precision) and relying on [GraphX](https://spark.apache.org/docs/latest/graphx-programming-guide.html) to detect \ndense areas at massive scale. With such a framework, Financial services institutions can better understand user \nshopping behaviours and detect anomalous transactions in real time.\n\n### Usage\n\nThere are 2 modes our framework can be executed, **distributed** and **pseudo-distributed**.\n\n#### Distributed\n\nWorking **fully distributed**, we retrieve clusters from an entire dataframe using the Spark `Estimator` interface, \nhence fully compliant with the Spark Pipeline framework (model can be serialized / deserialized). \nIn this mode, the core of GEOSCAN algorithm relies on `GraphX` to detect points having `distance \u003c epsilon` and a `degree \u003e minPoints`. \nSee the next section for an explanation of our algorithm.\n\n#### Usage\n\n```python\nfrom geoscan import Geoscan\n\ngeoscan = Geoscan() \\\n    .setLatitudeCol(\"latitude\") \\\n    .setLongitudeCol(\"longitude\") \\\n    .setPredictionCol(\"cluster\") \\\n    .setEpsilon(100) \\\n    .setMinPts(3)\n\nmodel = geoscan.fit(points_df)\n```\n\n\n| parameter     | description                                     | default   |\n|---------------|-------------------------------------------------|-----------|\n| epsilon       | the minimum distance in meters between 2 points | 50        |\n| minPts        | the minimum number of neighbours within epsilon | 3         |\n| latitudeCol   | the latitude column                             | latitude  |\n| longitudeCol  | the longitude column                            | longitude |\n| predictionCol | the resulted prediction column                  | predicted |\n\n\nAs the core of GEOSCAN logic relies on the use of H3 polygons, it becomes natural to leverage the same for model \ninference instead of bringing in extra GIS dependencies for expensive point in polygons queries. Our model consists \nin clusters tiled with hexagons of a given resolution (driven by the `epsilon` parameter) that can easily be joined to our original dataframe. \nModel inference is fully supported as per the `Estimator` interface\n\n```python\nmodel.transform(points_df)\n```\n\nNote that when saving model to distributed file system, we converted our shapes into [GeoJson](https://tools.ietf.org/html/rfc7946) RFC 7946 \nformat so that clusters could be loaded as-is into GIS databases or any downstream application or libraries. \n\n```python\nfrom geoscan import GeoscanModel\nmodel.save('/tmp/geoscan_model/distributed')\nmodel = GeoscanModel.load('/tmp/geoscan_model/distributed')\n```\n\nModel can always be returned as a GeoJson object directly\n\n```python\nmodel.toGeoJson()\n```\n\nFinally, it may be useful to extract clusters as a series of H3 tiles that could be used outside a spark environment or outside GEOSCAN library.\nWe expose a `getTiles` method that fills all our polygons with H3 tiles of a given dimension, allowing shapes to spill over additional layers should\nwe want to also \"capture\" neighbours points.\n\n```python\nmodel.getTiles(precision, additional_layers)\n```\n\nThis process can be summarized with below picture. Note that although a higher granularity would\nfit a polygon better, the number of tiles it generates will grow exponentially.\n\n![tiling](https://raw.githubusercontent.com/databrickslabs/geoscan/master/images/tiling.png)\n\n#### Pseudo Distributed\n\nIt is fairly common to extract personalized clusters (e.g. for each user), and doing so sequentially would be terribly inefficient.\nFor that purpose, we extended our GEOSCAN class to support `RelationalGroupedDataset` and train multiple models in parallel, one for each group attribute. \nAlthough the implementation is different (using in-memory `scalax.collection.Graph` instead of distributed `GraphX`), \nthe core logic remains the same as explained in the next section and should yield the same clusters given a same user.\n\n#### Usage\n\nOne must provide a new parameter `groupedCol` to indicate our framework how to group dataframe and train multiple models in parallel.\n\n```python\nfrom geoscan import GeoscanPersonalized\n\ngeoscan = Geoscan() \\\n    .setLatitudeCol(\"latitude\") \\\n    .setLongitudeCol(\"longitude\") \\\n    .setPredictionCol(\"cluster\") \\\n    .setGroupedCol(\"user\") \\\n    .setEpsilon(100) \\\n    .setMinPts(3)\n\nmodel = geoscan.fit(points_df)\n```\n\nNote that the output signature differs from the distributed approach since we cannot return a single model but a collection of GEOJSON objects\n\n```python\nmodel.toGeoJson().show()\n```\n\n```\n+--------------------+--------------------+\n|                user|             cluster|\n+--------------------+--------------------+\n|72fc865a-0c34-409...|{\"type\":\"FeatureC...|\n|cc227e67-c6d1-40a...|{\"type\":\"FeatureC...|\n|9cafdb6d-9134-4ee...|{\"type\":\"FeatureC...|\n|804c7fa2-8063-4ba...|{\"type\":\"FeatureC...|\n|65bd17be-b030-44a...|{\"type\":\"FeatureC...|\n+--------------------+--------------------+\n```\n\nNote that standard `transform` and `getTiles` methods also apply in that mode. By tracking how tiles change overtime, \nthis framework can be used to detect user changing behaviour as represented in below animation using synthetic data.\n\n![trend](https://raw.githubusercontent.com/databrickslabs/geoscan/master/images/geoscan_window.gif)\n\n### Installation\n\nCompile GEOSCAN scala library that can be uploaded onto a Databricks cluster (DBR \u003e 9.1). Activate `shaded` profile \nto include GEOSCAN dependencies as an assembly jar if needed\n\n```shell\nmvn clean package -Pshaded\n```\n\nAlternatively (preferred), install dependency from maven central directly in your spark based environment.\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.databricks.labs\u003c/groupId\u003e\n    \u003cartifactId\u003egeoscan\u003c/artifactId\u003e\n    \u003cversion\u003e0.1\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nFor python users, install the dependencies from pypi in addition to the above scala dependency.\n\n```shell script\npip install geoscan==0.1\n```\n\n### Release process\n\nOnce a change is approved, peer reviewed and merged back to `master` branch, a project admin will be able to promote \na new version to both maven central and pypi repo as a manual github action.\nSee `release.yaml` github action.\n\n### Project support\n\nPlease note that all projects in the /databrickslabs github account are provided for your exploration only, and are \nnot formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make \nany guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.\n\nAny issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed \nas time permits, but there are no formal SLAs for support.\n\n### Author\n\n\u003cantoine.amend@databricks.com\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabrickslabs%2Fgeoscan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabrickslabs%2Fgeoscan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabrickslabs%2Fgeoscan/lists"}