{"id":13704946,"url":"https://github.com/harsha2010/magellan","last_synced_at":"2026-05-19T08:12:25.589Z","repository":{"id":32999087,"uuid":"36629888","full_name":"harsha2010/magellan","owner":"harsha2010","description":"Geo Spatial Data Analytics on Spark","archived":false,"fork":false,"pushed_at":"2021-08-26T15:37:34.000Z","size":13643,"stargazers_count":533,"open_issues_count":77,"forks_count":149,"subscribers_count":65,"default_branch":"master","last_synced_at":"2024-08-03T22:14:16.485Z","etag":null,"topics":["big-data","geojson","geometric-algorithms","geospatial","geospatial-analysis","geospatial-analytics","geospatial-processing","magellan","shapefile","spark","sparksql"],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/harsha2010.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-06-01T01:06:52.000Z","updated_at":"2024-07-26T08:12:44.000Z","dependencies_parsed_at":"2022-08-30T01:10:47.087Z","dependency_job_id":null,"html_url":"https://github.com/harsha2010/magellan","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harsha2010%2Fmagellan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harsha2010%2Fmagellan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harsha2010%2Fmagellan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harsha2010%2Fmagellan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/harsha2010","download_url":"https://codeload.github.com/harsha2010/magellan/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224448826,"owners_count":17313123,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","geojson","geometric-algorithms","geospatial","geospatial-analysis","geospatial-analytics","geospatial-processing","magellan","shapefile","spark","sparksql"],"created_at":"2024-08-02T22:00:27.503Z","updated_at":"2026-05-19T08:12:25.530Z","avatar_url":"https://github.com/harsha2010.png","language":"Scala","funding_links":[],"categories":["Data Processing","Scala"],"sub_categories":[],"readme":"# Magellan: Geospatial Analytics Using Spark\n[![Gitter chat](https://badges.gitter.im/Magellan-dev/Lobby.png)](https://gitter.im/Magellan-dev/Lobby)\n[![Build Status](https://travis-ci.org/harsha2010/magellan.svg?branch=master)](https://travis-ci.org/harsha2010/magellan)\n[![codecov.io](http://codecov.io/github/harsha2010/magellan/coverage.svg?branch=master)](http://codecov.io/github/harsha2010/magellan?branch=maste)\n\n\nMagellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries.\n\nThe application developer writes standard sql or data frame queries to evaluate geometric expressions while the execution engine takes care of efficiently laying data out in memory during query processing, picking the right query plan, optimizing the query execution with cheap and efficient spatial indices while presenting a declarative abstraction to the developer.\n\nMagellan is the first library to extend Spark SQL to provide a relational abstraction for geospatial analytics. I see it as an evolution of geospatial analytics engines into the emerging world of big data by providing abstractions that are developer friendly, can be leveraged by anyone who understands or uses Apache Spark while simultaneously showcasing an execution engine that is state of the art for geospatial analytics on big data.\n\n# Version Release Notes\n\nYou can find notes on the various released versions [here](https://github.com/harsha2010/magellan/releases)\n\n# Linking\n\nYou can link against the latest release using the following coordinates:\n\n\tgroupId: harsha2010\n\tartifactId: magellan\n\tversion: 1.0.5-s_2.11\n\n# Requirements\n\nv1.0.5 requires Spark 2.1+ and Scala 2.11\n\n# Capabilities\n\nThe library currently supports reading the following formats:\n  \n  * [ESRI](https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf) \n  * [GeoJSON](http://geojson.org)\n  * [OSM-XML](http://wiki.openstreetmap.org/wiki/OSM_XML)\n  * [WKT](https://en.wikipedia.org/wiki/Well-known_text).\n\nWe aim to support the full suite of [OpenGIS Simple Features for SQL ](http://www.opengeospatial.org/standards/sfs) spatial predicate functions and operators together with additional topological functions.\n\nThe following geometries are currently supported:\n\n**Geometries**:\n\n  * Point\n  * LineString\n  * Polygon\n  * MultiPoint\n  * MultiPolygon (treated as a collection of Polygons and read in as a row per polygon by the GeoJSON reader)\n\t\nThe following predicates are currently supported:\n\n  * Intersects\n  * Contains\n  * Within\n\nThe following languages are currently supported:\n\n  * Scala\n\n\n\n\n# Reading Data\n\nYou can read Shapefile formatted data as follows:\n\n\n\tval df = sqlCtx.read.\n\t  format(\"magellan\").\n\t  load(path)\n\t  \n\tdf.show()\n\t\n\t+-----+--------+--------------------+--------------------+-----+\n\t|point|polyline|             polygon|            metadata|valid|\n\t+-----+--------+--------------------+--------------------+-----+\n\t| null|    null|Polygon(5, Vector...|Map(neighborho -\u003e...| true|\n\t| null|    null|Polygon(5, Vector...|Map(neighborho -\u003e...| true|\n\t| null|    null|Polygon(5, Vector...|Map(neighborho -\u003e...| true|\n\t| null|    null|Polygon(5, Vector...|Map(neighborho -\u003e...| true|\n\t+-----+--------+--------------------+--------------------+-----+\n\t\n\tdf.select(df.metadata['neighborho']).show()\n\t\n\t+--------------------+\n\t|metadata[neighborho]|\n\t+--------------------+\n\t|Twin Peaks       ...|\n\t|Pacific Heights  ...|\n\t|Visitacion Valley...|\n\t|Potrero Hill     ...|\n\t+--------------------+\n\t\n\nTo read GeoJSON format pass in the type as geojson during load as follows:\n\n\tval df = sqlCtx.read.\n\t  format(\"magellan\").\n\t  option(\"type\", \"geojson\").\n\t  load(path)\n\t  \n\n# Scala API\n\nMagellan is hosted on [Spark Packages](http://spark-packages.org/package/harsha2010/magellan)\n\nWhen launching the Spark Shell, Magellan can be included like any other spark package using the --packages option:\n\n\t\u003e $SPARK_HOME/bin/spark-shell --packages harsha2010:magellan:1.0.4-s_2.11\n\nA few common packages you might want to import within Magellan\n\t\n\timport magellan.{Point, Polygon}\n\timport org.apache.spark.sql.magellan.dsl.expressions._\n\timport org.apache.spark.sql.types._\n\n## Data Structures\n\n### Point\n\n\tval points = sc.parallelize(Seq((-1.0, -1.0), (-1.0, 1.0), (1.0, -1.0))).toDF(\"x\", \"y\").select(point($\"x\", $\"y\").as(\"point\"))\n\t\n\tpoints.show()\n\t\n\t+-----------------+\n\t|            point|\n\t+-----------------+\n\t|Point(-1.0, -1.0)|\n\t| Point(-1.0, 1.0)|\n\t| Point(1.0, -1.0)|\n\t+-----------------+\n\t\n### Polygon\n\n\tcase class PolygonRecord(polygon: Polygon)\n\t\n\tval ring = Array(Point(1.0, 1.0), Point(1.0, -1.0),\n     Point(-1.0, -1.0), Point(-1.0, 1.0),\n     Point(1.0, 1.0))\n    val polygons = sc.parallelize(Seq(\n        PolygonRecord(Polygon(Array(0), ring))\n      )).toDF()\n      \n    polygons.show()\n    \n    +--------------------+\n\t|             polygon|\n\t+--------------------+\n\t|Polygon(5, Vector...|\n\t+--------------------+\n\n## Predicates\n\n### within\n\n\tpoints.join(polygons).where($\"point\" within $\"polygon\").show()\n\n### intersects\n\n\tpoints.join(polygons).where($\"point\" intersects $\"polygon\").show()\n\t\n\t+-----------------+--------------------+\n\t|            point|             polygon|\n\t+-----------------+--------------------+\n\t|Point(-1.0, -1.0)|Polygon(5, Vector...|\n\t| Point(-1.0, 1.0)|Polygon(5, Vector...|\n\t| Point(1.0, -1.0)|Polygon(5, Vector...|\n\t+-----------------+--------------------+\n\n### contains\n\nSince contains is an overloaded expression (contains is used for checking String containment by Spark SQL), Magellan uses the Binary Expression ```\u003e?``` for checking shape containment.\n\n\tpoints.join(polygons).where($\"polygon\" \u003e? $\"polygon\").show()\n\n\n\t\nA Databricks notebook with similar examples is published [here](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/137058993011870/882779309834027/6891974485343070/latest.html) for convenience.\n\n# Spatial indexes\n\nStarting v1.0.5, Magellan support spatial indexes.\nSpatial indexes supported the so called [ZOrderCurves](https://en.wikipedia.org/wiki/Z-order_curve).\n\n\nGiven a column of shapes, one can index the shapes to a given precision using a geohash indexer by doing the following:\n\n```scala\ndf.withColumn(\"index\", $\"polygon\" index 30)\n```\n\nThis produces a new column called ```index``` which is a list of ZOrder Curves of precision ```30``` that taken together cover the polygon.\n\n# Creating Indexes while loading data\n\nThe Spatial Relations (GeoJSON, Shapefile, OSM-XML) all have the ability to automatically index the geometries while loading them.\n\nTo turn this feature on, pass in the parameter ```magellan.index = true``` and optionally a value for ```magellan.index.precision``` (default = 30) while loading the data as follows:\n\n```scala\nspark.read.format(\"magellan\")\n  .option(\"magellan.index\", \"true\")\n  .option(\"magellan.index.precision\", \"25\")\n  .load(s\"$path\")\n```\n\nThis creates an additional column called ```index``` which holds the list of ZOrder Curves of the given precision that cover each geometry in the dataset.\n\n# Spatial Joins\n\nMagellan leverages Spark SQL and has support for joins by default. However, these joins are by default not aware that the columns are geometric so a join of the form\n\n```scala\n  points.join(polygons).where($\"point\" within $\"polygon\")\n```\n\nwill be treated as a Cartesian Join followed by a predicate. \nIn some cases (especially when the polygon dataset is small (O(100-10000) polygons) this is fast enough.\nHowever, when the number of polygons is much larger than that, you will need spatial joins to allow you to scale this computation\n\nTo enable spatial joins in Magellan, add a spatial join rule to Spark by injecting the following code before the join:\n\n```scala\n  magellan.Utils.injectRules(spark)\n```\n\n\nFurthermore, during the join, you will need to provide Magellan a hint of the precision at which to create indices for the join\n\nYou can do this by annotating either of the dataframes involved in the join by providing a Spatial Join Hint as follows:\n\n```scala\nvar df = df.index(30) //after load or\nval df =spark.read.format(...).load(..).index(30) //during load\n```\n\nThen a join of the form\n\n```scala\n  points.join(polygons).where($\"point\" within $\"polygon\") // or\n  \n  points.join(polygons index 30).where($\"point\" within $\"polygon\")\n```\n\nautomatically uses indexes to speed up the join\n\n\n# Developer Channel\n\nPlease visit [Gitter](https://gitter.im/magellan-dev/Lobby?source=orgpage) to discuss Magellan, obtain help from developers or report issues.\n# Magellan Blog\n\nFor more details on Magellan and thoughts around Geospatial Analytics and the optimizations chosen for this project, please visit my [blog](https://magellan.ghost.io)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharsha2010%2Fmagellan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fharsha2010%2Fmagellan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharsha2010%2Fmagellan/lists"}