https://github.com/shane98c/wiki-geoparquet
geolocated english wikipedia articles in geoparquet and pmtiles
https://github.com/shane98c/wiki-geoparquet
duckdb geoparquet geospatial geotagged parquet pmtiles wikidata wikipedia
Last synced: about 2 months ago
JSON representation
geolocated english wikipedia articles in geoparquet and pmtiles
- Host: GitHub
- URL: https://github.com/shane98c/wiki-geoparquet
- Owner: Shane98c
- License: mit
- Created: 2026-03-27T02:35:41.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-18T20:16:06.000Z (about 2 months ago)
- Last Synced: 2026-04-18T20:30:17.668Z (about 2 months ago)
- Topics: duckdb, geoparquet, geospatial, geotagged, parquet, pmtiles, wikidata, wikipedia
- Language: Python
- Homepage: https://shane98c.github.io/wiki-geoparquet/
- Size: 354 KB
- Stars: 43
- Watchers: 0
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# wiki-geoparquet
English Wikipedia main-namespace articles with Earth coordinates, as
[GeoParquet](https://geoparquet.org/) +
[PMTiles](https://docs.protomaps.com/pmtiles/).
Coordinates from Wikidata P625, enriched with instance type, country, population,
GeoNames IDs, inlink counts, and more. Updated monthly from Wikidata + Wikipedia dumps.
**demo:**
[shane98c.github.io/wiki-geoparquet](https://shane98c.github.io/wiki-geoparquet/)
## Data
All files are on R2 with CORS enabled — query directly from the browser or any
Parquet-aware tool:
| File | Description |
| ------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- |
| [`wikipedia_geotagged.parquet`](https://pub-016504dd3a4d419a9c17a8939840935e.r2.dev/v2/wikipedia_geotagged.parquet) | GeoParquet, Hilbert-sorted with bbox covering |
| [`wikipedia_geotagged.pmtiles`](https://pub-016504dd3a4d419a9c17a8939840935e.r2.dev/v2/wikipedia_geotagged.pmtiles) | Vector tiles, auto-zoom with overzoom, drops by article length |
| [`wikipedia_search.parquet`](https://pub-016504dd3a4d419a9c17a8939840935e.r2.dev/v2/wikipedia_search.parquet) | Lightweight search index (lowercased label, coords, inlink count), sorted by label for prefix-range row-group pruning |
Pinned versions are also available on
[GitHub Releases](https://github.com/Shane98c/wiki-geoparquet/releases/latest).
## Schema
| Column | Type | Description |
| ---------------- | -------------- | -------------------------------------------------------- |
| `geometry` | WKB Point | WGS84 coordinates (from Wikidata P625) |
| `page_id` | int32 | Wikipedia page ID |
| `qid` | string | Wikidata QID (e.g. Q90) |
| `label` | string | Article title |
| `description` | string | Short description from `wikibase-shortdesc` |
| `instance_of` | string | Wikidata P31 type (e.g. "city", "mountain", "museum") |
| `country` | string | Country name from Wikidata P17 |
| `population` | int64 | Population from Wikidata P1082 (nullable) |
| `geonames_id` | string | GeoNames ID for cross-referencing |
| `elevation` | int16 | Ground elevation in meters, sampled from Copernicus DEM |
| `gt_type` | string | Wikipedia geo classification (supplementary, from geo_tags) |
| `page_len` | int32 | Article length in bytes |
| `inlink_count` | int32 | Number of namespace-0 pagelinks pointing here |
| `wikipedia_url` | string | Full article URL |
| `image_url` | string | Wikimedia Commons image URL |
| `related_images` | list\ | Additional Wikidata images (P6802) |
| `bbox` | struct | Covering bbox for spatial predicate pushdown |
The PMTiles carry only `page_id`, `label`, and `inlink_count`. The demo
enriches popups on click by querying the GeoParquet directly via DuckDB-WASM.
Reconstruct article URLs client-side: `https://en.wikipedia.org/?curid={page_id}`
## Quick start
### Query with DuckDB
```sql
INSTALL spatial; LOAD spatial;
-- Find the most notable geotagged articles
SELECT label, instance_of, country, inlink_count, ST_AsText(geometry)
FROM 'https://pub-016504dd3a4d419a9c17a8939840935e.r2.dev/v2/wikipedia_geotagged.parquet'
ORDER BY inlink_count DESC
LIMIT 20;
-- Articles near Paris (48.8566, 2.3522) — ~50km bbox
SELECT label, instance_of, country, inlink_count
FROM 'https://pub-016504dd3a4d419a9c17a8939840935e.r2.dev/v2/wikipedia_geotagged.parquet'
WHERE bbox.xmin BETWEEN 1.7 AND 3.0
AND bbox.ymin BETWEEN 48.4 AND 49.3
ORDER BY inlink_count DESC
LIMIT 20;
```
### Browser search with duckdb-wasm
The `wikipedia_search.parquet` index has columns `label` (lowercased, used for
both sorting and lookup), `lon`, `lat`, and `inlink_count` for ranking. It's
sorted by `label` so a lowercased prefix range lets DuckDB skip row groups and
fetch ~2 MB per query instead of the full ~25 MB file.
```js
// In duckdb-wasm, INSTALL httpfs (or SET builtin_httpfs = false) — without
// this the built-in HTTP handler downloads the whole file on every query.
// See https://github.com/duckdb/duckdb-wasm/issues/2153.
await conn.query("SET builtin_httpfs = false;");
const q = userInput.toLowerCase().replace(/'/g, "''");
await conn.query(`
SELECT label, lon, lat
FROM 'https://.../wikipedia_search.parquet'
WHERE label >= '${q}' AND label < '${q}~'
ORDER BY inlink_count DESC
LIMIT 10
`);
```
### Use pmtiles in MapLibre
```js
import { Protocol } from "pmtiles";
let protocol = new Protocol();
maplibregl.addProtocol("pmtiles", protocol.tile);
const map = new maplibregl.Map({
style: {
sources: {
wikipedia: {
type: "vector",
url: "pmtiles://https://pub-016504dd3a4d419a9c17a8939840935e.r2.dev/v2/wikipedia_geotagged.pmtiles",
},
},
layers: [
{
id: "articles",
source: "wikipedia",
"source-layer": "wikipedia",
type: "circle",
paint: {
"circle-radius": [
"interpolate",
["linear"],
["sqrt", ["get", "inlink_count"]],
0,
2.5,
280,
20,
],
"circle-color": "#4264fb",
},
},
],
},
});
```
## How it works
1. Streams the Wikidata truthy N-Triples dump (~70 GB) to extract P625
coordinates and enrichment properties (instance type, country, population,
GeoNames ID, images)
2. Resolves P31/P17 QIDs to human-readable labels via the Wikidata API
3. Streams 5 Wikipedia SQL dump files (~12 GB) for article metadata, images,
descriptions, and inlink counts
4. Writes Hilbert-sorted GeoParquet with bbox covering via DuckDB spatial
5. Pipes DuckDB to [tippecanoe](https://github.com/felt/tippecanoe) for PMTiles
with attribute-based feature dropping
See the [Makefile](Makefile) for build steps.
## Data source
Coordinates and enrichment from
[Wikidata entity dumps](https://dumps.wikimedia.org/wikidatawiki/entities/),
article metadata from
[English Wikipedia SQL dumps](https://dumps.wikimedia.org/enwiki/latest/).
Both released under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
## License
Code: [MIT](LICENSE) Data outputs:
[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) (derived from
Wikipedia/Wikidata)