https://github.com/mdsumner/gdx
https://github.com/mdsumner/gdx
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/mdsumner/gdx
- Owner: mdsumner
- Created: 2025-04-19T20:53:34.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-11-25T12:47:47.000Z (6 months ago)
- Last Synced: 2026-01-12T02:48:09.726Z (5 months ago)
- Language: Python
- Size: 44.9 KB
- Stars: 11
- Watchers: 3
- Forks: 1
- Open Issues: 5
-
Metadata Files:
- Readme: README.Rmd
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
reticulate::use_python("~/workenv/bin/python3")
#reticulate::repl_python()
```
# gdx
The goal of gdx is to integrate GDAL with xarray, especially for the multidimensional API which is still relatively underutilized.
## Todo
- [ ] apply xarray indexes when relevant in Raster and Multidim (see [issue](https://github.com/mdsumner/gdx/issues/3 for some discussion))
- [ ] explore when we need to control driver choice
- [ ] compare to opening with GDAL itself after `mdim mosaic`
Here's a basic example, this could be registered as an xarray backend *engine*.
```{python basic, eval = F}
from gdx import GDALBackendEntrypoint
backend = GDALBackendEntrypoint()
dsn = "/vsicurl/https://projects.pawsey.org.au/idea-sealevel-glo-phy-l4-nrt-008-046/data.marine.copernicus.eu/SEALEVEL_GLO_PHY_L4_NRT_008_046/cmems_obs-sl_glo_phy-ssh_nrt_allsat-l4-duacs-0.125deg_P1D_202506/2025/08/nrt_global_allsat_phy_l4_20250825_20250825.nc"
ds = backend.open_dataset(f'vrt://{dsn}?sd_name=vgos', chunks = {})
ds1 = backend.open_dataset(dsn, multidim = True, chunks = {})
```
We have a Raster xarray:
```{python r,eval=F}
ds
Size: 17MB
Dimensions: (x: 2880, y: 1440)
Coordinates:
* x (x) float64 23kB -180.0 -179.9 -179.8 -179.6 ... 179.6 179.8 179.9
* y (y) float64 12kB 90.0 89.88 89.75 89.62 ... -89.62 -89.75 -89.88
Data variables:
band_1 (y, x) int32 17MB dask.array
Attributes:
crs: GEOGCS["unknown",DATUM["unnamed",SPHEROID["Spheroid",63781...
geotransform: (-180.0, 0.125, 0.0, 90.0, 0.0, -0.125)
```
and a Multidim xarray:
```{python m,eval=FALSE}
ds1
Size: 166MB
Dimensions: (latitude: 1440, nv: 2, longitude: 2880, time: 1)
Coordinates:
* latitude (latitude) float32 6kB -89.94 -89.81 -89.69 ... 89.69 89.81 89.94
* nv (nv) int32 8B 0 1
* longitude (longitude) float32 12kB -179.9 -179.8 -179.7 ... 179.8 179.9
* time (time) float32 4B 2.763e+04
Data variables:
lat_bnds (latitude, nv) float32 12kB dask.array
lon_bnds (longitude, nv) float32 23kB dask.array
sla (time, latitude, longitude) int32 17MB dask.array
err_sla (time, latitude, longitude) int32 17MB dask.array
ugosa (time, latitude, longitude) int32 17MB dask.array
err_ugosa (time, latitude, longitude) int32 17MB dask.array
vgosa (time, latitude, longitude) int32 17MB dask.array
err_vgosa (time, latitude, longitude) int32 17MB dask.array
adt (time, latitude, longitude) int32 17MB dask.array
ugos (time, latitude, longitude) int32 17MB dask.array
vgos (time, latitude, longitude) int32 17MB dask.array
flag_ice (time, latitude, longitude) int32 17MB dask.array
Attributes: (12/44)
Conventions: CF-1.6
Metadata_Conventions: Unidata Dataset Discovery v1.0
cdm_data_type: Grid
comment: Sea Surface Height measured by Altimetry...
contact: servicedesk.cmems@mercator-ocean.eu
creator_email: servicedesk.cmems@mercator-ocean.eu
... ...
summary: DUACS Near-Real-Time Level-4 sea surface...
time_coverage_duration: P1D
time_coverage_end: 2025-08-25T12:00:00Z
time_coverage_resolution: P1D
time_coverage_start: 2025-08-24T12:00:00Z
title: NRT merged all satellites Global Ocean G...
```
There's one variable called 'band_1' for the raster:
```{python eval = FALSE}
ds.band_1.isel(x = 0)
# Size: 6kB
# dask.array
# Coordinates:
# x float64 8B -180.0
# * y (y) float64 12kB 90.0 89.88 89.75 89.62 ... -89.62 -89.75 -89.88
# Attributes:
# nodata: -2147483647.0
# scale: 0.0001
# offset: 0.0
```
we can access actual values
```{python, eval = F}
## the raw values for now
ds.band_1.sel(x = 100, y = -50).values
# array(441, dtype=int32)
ds1.sla.isel(longitude = 0, latitude = 1000).values
#array([2404], dtype=int32)
```
This example is a virtualized mosaic of NetCDF in multidim VRT.
```{python mdim, eval=F}
big_virtual_mdim = "/vsicurl/https://gist.githubusercontent.com/mdsumner/18c5d302d00b9a456bb73d30ac758764/raw/f26e1b2e202f759d6aace4d7deb3e04ea3c85f15/mdim.vrt"
bvm = backend.open_dataset(big_virtual_mdim, multidim = True, chunks = {})
# Size: 3TB
# Dimensions: (Time: 5479, st_ocean: 51, yt_ocean: 1500, xt_ocean: 3600)
# Coordinates:
# * Time (Time) float64 44kB 1.132e+04 1.132e+04 ... 1.68e+04 1.68e+04
# * st_ocean (st_ocean) float64 408B 2.5 7.5 12.5 ... 3.603e+03 4.509e+03
# * yt_ocean (yt_ocean) float64 12kB -74.95 -74.85 -74.75 ... 74.75 74.85 74.95
# * xt_ocean (xt_ocean) float64 29kB 0.05 0.15 0.25 0.35 ... 359.8 359.9 360.0
# Data variables:
# temp (Time, st_ocean, yt_ocean, xt_ocean) int16 3TB dask.array
bvm.sel(xt_ocean = slice(140, 150), yt_ocean = slice(-55, -45), st_ocean = slice(8, 13)).isel(Time = -1).temp.values
# array([[[-30770, -30784, -30799, ..., -30418, -30424, -30445],
# [-30755, -30771, -30788, ..., -30418, -30425, -30446],
# [-30744, -30764, -30788, ..., -30417, -30426, -30448],
# ...,
# [-29852, -29868, -29889, ..., -29413, -29338, -29325],
# [-29835, -29851, -29883, ..., -29385, -29327, -29324],
# [-29821, -29840, -29879, ..., -29353, -29319, -29322]]],
# shape=(1, 100, 100), dtype=int16)
```
There's a lot more to do, scaling works but I turned that off to test for now. .
Template a list of netcdf files and mosaic them to VRT, then open with this xarray backend. (Note this requires GDAL>=3.12.0 ).
```{python eval = FALSE}
month = "202501"
url = [f"/vsicurl/https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/{month}/oisst-avhrr-v02r01.{month}{(day+1):02d}.nc" for day in range(31)]
gdal.Run("mdim mosaic", input = url, output = "oisst.vrt", array = "sst")
from gdx import GDALBackendEntrypoint
backend = GDALBackendEntrypoint()
backend.open_dataset("oisst.vrt", multidim = True)
# Size: 64MB
# Dimensions: (lat: 720, lon: 1440, time: 31, zlev: 1)
# Coordinates:
# * lat (lat) float64 6kB -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
# * lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
# * time (time) float64 248B 1.717e+04 1.72e+04 ... 1.717e+04 1.717e+04
# * zlev (zlev) float64 8B 0.0
# Data variables:
# sst (time, zlev, lat, lon) int16 64MB ...
#
```
### Open questions
- I set `chunks = {}` by default, is that ok
- dask will very happily throw fsspec byte range requests at Thredds, more than 10x will get a 104 error but GDAL multidim is better behaved when you use it on its own: can we leverage this (GDAL connection pooling?) within xarray??
- I saw errors from HDF5, but is that via NetCDF or is my driver select going wrong
## Code of Conduct
Please note that the gdx project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.