{"id":13791045,"url":"https://github.com/harryprince/geospark","last_synced_at":"2025-03-16T18:31:23.528Z","repository":{"id":52794395,"uuid":"165322078","full_name":"harryprince/geospark","owner":"harryprince","description":"bring sf to spark in production","archived":false,"fork":false,"pushed_at":"2021-12-13T11:41:21.000Z","size":16673,"stargazers_count":57,"open_issues_count":11,"forks_count":17,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-03-13T02:08:45.813Z","etag":null,"topics":["apache-spark","gis","large-scale-spatial-analysis","r","spark-sql","sparklyr-extension","spatial-analysis","spatial-queries"],"latest_commit_sha":null,"homepage":"https://github.com/harryprince/geospark/wiki","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/harryprince.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-01-11T22:50:16.000Z","updated_at":"2024-12-29T15:25:31.000Z","dependencies_parsed_at":"2022-08-21T07:20:32.904Z","dependency_job_id":null,"html_url":"https://github.com/harryprince/geospark","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harryprince%2Fgeospark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harryprince%2Fgeospark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harryprince%2Fgeospark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harryprince%2Fgeospark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/harryprince","download_url":"https://codeload.github.com/harryprince/geospark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243826788,"owners_count":20354220,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","gis","large-scale-spatial-analysis","r","spark-sql","sparklyr-extension","spatial-analysis","spatial-queries"],"created_at":"2024-08-03T22:00:54.802Z","updated_at":"2025-03-16T18:31:23.142Z","avatar_url":"https://github.com/harryprince.png","language":"R","funding_links":[],"categories":["R","Geospatial Library","Sparklyr Analysis Tools"],"sub_categories":["R","Geospatial Data"],"readme":"GeoSpark: Bring sf to spark\n================\n\n![](https://image-static.segmentfault.com/101/895/1018959988-5c9809116a126)\n\n![](https://camo.githubusercontent.com/31267b3e96ca20997396b88f7c44233710fcc637/687474703a2f2f7777772e7265706f7374617475732e6f72672f6261646765732f6c61746573742f6163746976652e737667)\n[![CRAN version](https://www.r-pkg.org/badges/version/geospark)](https://CRAN.R-project.org/package=geospark)\n[![Build Status](https://travis-ci.org/harryprince/geospark.svg?branch=master)](https://travis-ci.org/harryprince/geospark)\n![](https://cranlogs.r-pkg.org/badges/geospark)\n\n## Introduction \u0026 Philosophy\n\nGoal: make traditional GISer handle geospatial big data easier. \n\nThe origin idea comes from [Uber](https://www.oreilly.com/ideas/query-the-planet-geospatial-big-data-analytics-at-uber), which proposed a ESRI Hive UDF + Presto solution to solve large-scale geospatial data processing problem with spatial index in production.\n\nHowever, The Uber solution is not open source yet and Presto is not popular than Spark.\n\nIn that, `geospark` R package aims at bringing local [sf](https://github.com/r-spatial/sf) functions to distributed spark mode with [GeoSpark](https://github.com/DataSystemsLab/GeoSpark) scala package.\n\nCurrently, `geospark` support the most of important `sf` functions in spark,\nhere is a [summary\ncomparison](https://github.com/harryprince/geospark/wiki/SF-Migration-Guide). And the `geospark` R package is keeping close with geospatial and big data community, which powered by [sparklyr](https://spark.rstudio.com), [sf](https://github.com/r-spatial/sf), [dplyr](https://db.rstudio.com/dplyr/) and [dbplyr](https://github.com/tidyverse/dbplyr).\n\n## Installation\n\nThis package requires Apache Spark 3.X which you can install using\n`sparklyr::install_spark(\"3.0\")`, and previous spark version like spark2.X is no longer officially maintain. in addition, you can install\n`geospark` as follows:\n\n``` r\npak::pkg_install(\"harryprince/geospark\")\n```\n\n## Getting Started\n\nIn this example we will join spatial data using quadrad tree indexing.\nFirst, we will initialize the `geospark` extension and connect to Spark\nusing `sparklyr`:\n\n``` r\nlibrary(sparklyr)\nlibrary(geospark)\n\nsc \u003c- spark_connect(master = \"local\")\nregister_gis(sc)\n```\n\nNext we will load some spatial dataset containing as polygons and\npoints.\n\n``` r\npolygons \u003c- read.table(system.file(package=\"geospark\",\"examples/polygons.txt\"), sep=\"|\", col.names=c(\"area\",\"geom\"))\npoints \u003c- read.table(system.file(package=\"geospark\",\"examples/points.txt\"), sep=\"|\", col.names=c(\"city\",\"state\",\"geom\"))\n\npolygons_wkt \u003c- copy_to(sc, polygons)\npoints_wkt \u003c- copy_to(sc, points)\n```\n\nAnd we can quickly visulize the dataset by `mapview` and `sf`.\n\n```\nM1 = polygons %\u003e%\nsf::st_as_sf(wkt=\"geom\") %\u003e% mapview::mapview()\n\n\nM2 = points %\u003e%\nsf::st_as_sf(wkt=\"geom\") %\u003e% mapview::mapview()\n\nM1+M2\n```\n\n![](https://segmentfault.com/img/bVbqmP9/view?w=1198\u0026h=766)\n\n### The SQL Mode\n\nNow we can perform a GeoSpatial join using the `st_contains` which\nconverts `wkt` into geometry object. To get the original data from `wkt`\nformat, we will use the `st_geomfromwkt` functions. We can execute this\nspatial query using `DBI`:\n\n``` r\nDBI::dbGetQuery(sc, \"\n  SELECT area, state, count(*) cnt FROM\n    (SELECT area, ST_GeomFromWKT(polygons.geom) as y FROM polygons) polygons\n  INNER JOIN\n    (SELECT ST_GeomFromWKT (points.geom) as x, state, city FROM points) points\n  WHERE ST_Contains(polygons.y,points.x) GROUP BY area, state\")\n```\n\n``` \n             area state cnt\n1      texas area    TX  10\n2     dakota area    SD   1\n3     dakota area    ND  10\n4 california area    CA  10\n5   new york area    NY   9\n```\n\n### The Tidyverse Mode\n\nYou can also perform this query using `dplyr` as follows:\n\n``` r\nlibrary(dplyr)\npolygons_wkt \u003c- mutate(polygons_wkt, y = st_geomfromwkt(geom))\npoints_wkt \u003c- mutate(points_wkt, x = st_geomfromwkt(geom))\n\nsc_res \u003c- inner_join(polygons_wkt,\n                     points_wkt,\n                     sql_on = sql(\"st_contains(y,x)\")) %\u003e% \n  group_by(area, state) %\u003e%\n  summarise(cnt = n()) \n  \nsc_res %\u003e%\n  head()\n```\n\n```\n# Source: spark\u003c?\u003e [?? x 3]\n# Groups: area\n  area            state   cnt\n  \u003cchr\u003e           \u003cchr\u003e \u003cdbl\u003e\n1 texas area      TX       10\n2 dakota area     SD        1\n3 dakota area     ND       10\n4 california area CA       10\n5 new york area   NY        9\n```\n\nThe final result can be present by `leaflet`.\n\n```\nIdx_df = collect(sc_res) %\u003e% \nright_join(polygons,by = (c(\"area\"=\"area\"))) %\u003e% \nsf::st_as_sf(wkt=\"geom\")\n\nIdx_df %\u003e% \nleaflet::leaflet() %\u003e% \nleaflet::addTiles() %\u003e% \nleaflet::addPolygons(popup = ~as.character(cnt),color=~colormap::colormap_pal()(cnt)) \n\n```\n\n![](https://image-static.segmentfault.com/305/306/3053068814-5c9803c8d59a7)\n\nFinally, we can disconnect:\n\n``` r\nspark_disconnect_all()\n```\n\n## Performance\n\n### Configuration\n\nTo improve performance, it is recommended to use the `KryoSerializer`\nand the `GeoSparkKryoRegistrator` before connecting as follows:\n\n``` r\nconf \u003c- spark_config()\nconf$spark.serializer \u003c- \"org.apache.spark.serializer.KryoSerializer\"\nconf$spark.kryo.registrator \u003c- \"org.datasyslab.geospark.serde.GeoSparkKryoRegistrator\"\n```\n\n### Benchmarks\n\nThis performance comparison is an extract from the original [GeoSpark: A\nCluster Computing Framework for Processing Spatial\nData](https://pdfs.semanticscholar.org/347d/992ceec645a28f4e7e45e9ab902cd75ecd92.pdf)\npaper:\n\n| No. | test case                                                                                                                                                            | the number of records |\n| --- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------- |\n| 1   | SELECT IDCODE FROM zhenlongxiang WHERE ST\\_Disjoint(geom,ST\\_GeomFromText(‘POLYGON((517000 1520000,619000 1520000,619000 2530000,517000 2530000,517000 1520000))’)); | 85,236 rows           |\n| 2   | SELECT fid FROM cyclonepoint WHERE ST\\_Disjoint(geom,ST\\_GeomFromText(‘POLYGON((90 3,170 3,170 55,90 55,90 3))’,4326))                                               | 60,591 rows           |\n\nQuery\nperformance(ms),\n\n| No. | PostGIS/PostgreSQL | GeoSpark SQL | ESRI Spatial Framework for Hadoop |\n| --- | ------------------ | ------------ | --------------------------------- |\n| 1   | 9631               | 480          | 40,784                            |\n| 2   | 110872             | 394          | 64,217                            |\n\nAccording to this paper, the Geospark SQL definitely outperforms PG and\nESRI UDF under a very large data set.\n\n\nIf you are wondering how the spatial index accelerate the query process,\nhere is a good Uber example: [Unwinding Uber’s Most Efficient\nService](https://medium.com/@buckhx/unwinding-uber-s-most-efficient-service-406413c5871d#.dg5v6irao)\nand the [Chinese translation\nversion](https://segmentfault.com/a/1190000008657566)\n\n## Functions\n\n### Constructor\n\nname|desc\n---|---\n`ST_GeomFromWKT`| Construct a Geometry from Wkt.\n`ST_GeomFromWKB`| Construct a Geometry from Wkb.\n`ST_GeomFromGeoJSON`| Construct a Geometry from GeoJSON.\n`ST_Point`| Construct a Point from X and Y. \n`ST_PointFromText`| Construct a Point from Text, delimited by Delimiter.\n`ST_PolygonFromText`| Construct a Polygon from Text, delimited by Delimiter.\n`ST_LineStringFromText`| Construct a LineString from Text, delimited by Delimiter.\n`ST_PolygonFromEnvelope`| Construct a Polygon from MinX, MinY, MaxX, MaxY.\n\n### Geometry Measurement\n\nname|desc\n---|---\n`ST_Length`| Return the perimeter of A\n`ST_Area`| Return the area of A\n`ST_Distance`| Return the Euclidean distance between A and B\n\n### Spatial Join\n\n![](https://camo.githubusercontent.com/f18513c8002df02bdb6e3aac451519beb3c87ebb/68747470733a2f2f7365676d656e746661756c742e636f6d2f696d672f625662714665333f773d3132383026683d353038)\n\nname|desc\n---|---\n`ST_Contains`|\n`ST_Intersects`|\n`ST_Within`|\n`ST_Equals`|\n`ST_Crosses`|\n`ST_Touches`|\n`ST_Overlaps`|\n\n### Distance join\n\n`ST_Distance`:\n\nSpark GIS SQL mode example:\n\n```\nSELECT *\nFROM pointdf1, pointdf2\nWHERE ST_Distance(pointdf1.pointshape1,pointdf2.pointshape2) \u003c= 2\n```\n\nTidyverse style example:\n\n```\nst_join(x = pointdf1,\n           y = pointdf2,\n           join = sql(\"ST_Distance(pointshape1, pointshape2) \u003c= 2\"))\n```\n\n\n### Aggregation\n\nname|desc\n---|---\n`ST_Envelope_Aggr`| Return the entire envelope boundary of all geometries in A\n`ST_Union_Aggr`|Return the polygon union of all polygons in A\n\n### More Advacned Functions\n\nname|desc\n---|---\n`ST_ConvexHull`| Return the Convex Hull of polgyon A\n`ST_Envelope`| Return the envelop boundary of A\n`ST_Centroid`| Return the centroid point of A\n`ST_Transform`| Transform the Spatial Reference System / Coordinate Reference System of A, from SourceCRS to TargetCRS\n`ST_IsValid`| Test if a geometry is well formed\n`ST_PrecisionReduce`| Reduce the decimals places in the coordinates of the geometry to the given number of decimal places. The last decimal place will be rounded.\n`ST_IsSimple`| Test if geometry's only self-intersections are at boundary points.\n`ST_Buffer`| Returns a geometry/geography that represents all points whose distance from this Geometry/geography is less than or equal to distance.\n`ST_AsText`| Return the Well-Known Text string representation of a geometry\n\n\n## Architecture\n\n# ![](https://user-images.githubusercontent.com/5362577/53225664-bf6abc80-36b3-11e9-8b8e-41611fc7098e.png)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharryprince%2Fgeospark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fharryprince%2Fgeospark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharryprince%2Fgeospark/lists"}