{"id":28633317,"url":"https://github.com/do-me/fast-osm-extraction","last_synced_at":"2026-02-24T01:33:58.277Z","repository":{"id":298119191,"uuid":"998923704","full_name":"do-me/fast-osm-extraction","owner":"do-me","description":"A workflow for performant extraction of entities from OSM protobuf (.pbf) files with various tools","archived":false,"fork":false,"pushed_at":"2025-06-12T13:40:39.000Z","size":44,"stargazers_count":19,"open_issues_count":0,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-28T18:07:35.855Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/do-me.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-09T13:19:26.000Z","updated_at":"2025-08-29T20:45:28.000Z","dependencies_parsed_at":"2025-06-09T14:47:21.822Z","dependency_job_id":null,"html_url":"https://github.com/do-me/fast-osm-extraction","commit_stats":null,"previous_names":["do-me/fast-osm-extraction"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/do-me/fast-osm-extraction","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Ffast-osm-extraction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Ffast-osm-extraction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Ffast-osm-extraction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Ffast-osm-extraction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/do-me","download_url":"https://codeload.github.com/do-me/fast-osm-extraction/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Ffast-osm-extraction/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29766642,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-24T01:28:30.166Z","status":"ssl_error","status_checked_at":"2026-02-24T01:28:27.518Z","response_time":90,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-12T15:08:16.123Z","updated_at":"2026-02-24T01:33:58.251Z","avatar_url":"https://github.com/do-me.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fast OSM Extraction\n\nWorkflows for performant extraction of entities from OSM protobuf (.pbf) files with various tools like DuckDB spatial and Rust-based OSM pbf readers.\n\nAim: Extract all roads with ongoing construction fast and export to geoparquet/pmtiles or other file formats.\n\n## 🥇 1 DuckDB spatial: 4:12 minutes for planet.pbf (incl. post-processing) 👑\n\nThe whole idea revolves around DuckDB spatial's `st_readOSM()` function (I stumbled upon on [HN](https://news.ycombinator.com/item?id=40891644)) that can directly read a pbf file. And no, you do not need tons of RAM! I actually tested it with 16Gb max RAM usage vs 96Gb and it had no effect on the processing time; literally none (not even a few seconds or so). Maybe I did something wrong, but considering it works so beautifully, I won't investigate further. \n\nHave a look at the attached Jupyter Notebook I used for convenience. You might even squeeze some more seconds performance out of DuckDB if it's not wrapped in Python.\n\nExtracting all entities from Germany took only **13.3 seconds**! \n\n## 🥈 2 Osmium tool: 7 minutes for planet.pbf (without post-processing)\n\nOsmium (C++ based) could preprocess the planet file so all consecutive processing becomes faster. However, only this reduction already took 7 mins and turned my Mac into a helicopter.\nConsidering that Osmium is inferior, I didn't even continue this workflow but instead looked a little closer at Rust-based OSM Pbd-Readers as alternative.\n\n```shell\ntime osmium tags-filter \\\n    planet-250602.osm.pbf \\\n    w/highway,construction \\\n    -o construction-roads.osm.pbf \\\n    --overwrite\n[======================================================================] 100%\nosmium tags-filter planet-250602.osm.pbf w/highway,construction -o    2964.27s user 153.42s system 736% cpu 7:03.38 total\n```\n\n### 🥉 Osmpbfreader-rs (excl. post-processing)\n\nSee folder `osm-construction-extractor`. I had high expectations but was simply disappointed. Trying to wrestle with the compiler and ever-changing APIs in the Rust ecosystem with dependency issues really gave me headaches. Also, unfortunately the geo ecosystem on Rust is underdeveloped. E.g. GeoPolars is stale: \n\n\u003e Update (August 2024): GeoPolars is [blocked](https://github.com/pola-rs/polars/issues/1830#issuecomment-2218102856) on Polars supporting [Arrow extension types](https://github.com/pola-rs/polars/issues/9112), which would allow GeoPolars to persist geometry type information and coordinate reference system (CRS) metadata. It's not feasible to create a `geopolars.GeoDataFrame` as a subclass of a `polars.DataFrame` (similar to how the `geopandas.GeoDataFrame` is a subclass of `pandas.DataFrame`) because polars explicitly [does not support subclassing of core data types](https://github.com/pola-rs/polars/issues/2846#issuecomment-1711799869). See https://github.com/geopolars/geopolars/pull/240.\n\nI ended up writing a short script that only read the data to an array as all the downstream tasks like persisting to geoparquet turned out to be too time-consuming for now. The speed was ok-ish: \n\n####  Germany only 48 seconds\n\n```bash\ntime ./target/release/osm-construction-extractor --input ../germany-latest.osm.pbf\n```\n\n```bash\n(base) ➜  osm-construction-extractor git:(master) ✗ time ./target/release/osm-construction-extractor --input ../germany-latest.osm.pbf\n-\u003e Opening PBF file: \"../germany-latest.osm.pbf\"\n-\u003e Pass 1: Finding ways and collecting dependencies...\n   Found 60124 total objects (ways and their required nodes) in 48.54s.\n-\u003e Pass 2: Re-structuring extracted data into final format...\n  [00:00:00] [########################################]   10049/10049   (0s)                                                                                                                                                                                                                                       \n--- BENCHMARK RESULTS ---\nTotal ways extracted: 10049\nCore extraction (PBF read \u0026 dependency resolution): 48.54s\nData restructuring (geometry building, etc.):       14.87ms\n----------------------------------------------------\nTotal runtime:                                      48.55s\n\n✅ Success! Data is held in an in-memory array.\n\nExample of first extracted way:\nConstructionWay {\n    id: WayId(\n        3358460,\n    ),\n    tags: {\n        \"highway\": \"construction\",\n        \"bicycle\": \"yes\",\n        \"name\": \"Paul-Stritter-Weg\",\n        \"surface\": \"paving_stones\",\n        \"construction\": \"footway\",\n        \"check_date\": \"2025-05-29\",\n        \"lit\": \"yes\",\n    },\n    geometry: LINESTRING(10.0243135 53.6102686,10.0243418 53.6102772,10.0249313 53.610489099999995),\n}\n./target/release/osm-construction-extractor --input ../germany-latest.osm.pbf  238.37s user 14.80s system 518% cpu 48.832 total\n```\n\n#### Planet\n\n\n```bash\ntime ./target/release/osm-construction-extractor --input ../planet-250602.osm.pbf \n```\n\nAborted the run after 15 minutes. Not useful to measure the performance at this point.\n\n### 4 QuackOSM: 1:40 minutes for Germany onle (incl. post-processing)\n\nSpecial shoutout to QuackOSM, a fantastic tool for quick and hastle-free access to small- to medium-scale areas of interest. If you're interested in a super convenient tool and don't want to bother about having to tweak DuckDB on your system, where to get the pbf from etc. it's great! \nHowever, comparing it directly to the heavily optimized pure DuckDB workflow from 🏅, it's much slower and hence not suited for planet-scale workflows. From what I understand it's due to it's swiss army knife kind of character, so that it works for any kind of analysis. \n\nLook at this beauty, it's just three lines to get the job done!\n\n```python\n%%time\nimport quackosm as qosm\ngdf = qosm.convert_pbf_to_parquet(\"germany-latest.osm.pbf\", tags_filter={\"highway\":\"construction\"})\n```\n\n```bash\nFinished operation in 0:01:40\nCPU times: user 17min 12s, sys: 1min 17s, total: 18min 29s\nWall time: 1min 40s\n```\n\nComparing to the results from 🏅, they are the same. To the left the DuckDB-based workflow, right QuackOSM. The only subtle difference here is that QuackOSM gives you point geometries too that I filtered out. \n![image](https://github.com/user-attachments/assets/9aadc9a9-fe8e-4940-9667-d3fbd3b6ffc4)\n\n\n\n### Other contenders\n\n- [planetiler](https://github.com/b-r-u/osmpbf) - used it to create a protomaps/basemap once, and took roughly 2h, strong contender, also for convenience as it can export dircetly to mbtiles or pmtiles\n- [osmpbf](https://github.com/b-r-u/osmpbf) - Rust-based too, haven't tried yet \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdo-me%2Ffast-osm-extraction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdo-me%2Ffast-osm-extraction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdo-me%2Ffast-osm-extraction/lists"}