https://github.com/undp-data/sids-data-pipeline

Python data pipeline for SIDS project
https://github.com/undp-data/sids-data-pipeline

Last synced: 10 months ago
JSON representation

Python data pipeline for SIDS project

Host: GitHub
URL: https://github.com/undp-data/sids-data-pipeline
Owner: UNDP-Data
License: mit
Created: 2022-01-05T15:49:43.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2022-10-03T03:56:07.000Z (over 3 years ago)
Last Synced: 2025-03-05T16:40:48.651Z (over 1 year ago)
Language: Python
Size: 142 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# SIDS data processing pipeline

## Intro

**Small Islands Developing States (SIDS)** is a group of island states spatially disjoint located all over the world. This data pipeline can be used to pre-process and generate the bulk of spatial data for the SIDS platform [geospatial application](https://data.undp.org/sids/geospatial-data). The pipeline computes zonal stats for a number of vector layers from a number of raster layers and converts the results into MapBox vector tiles (.pbf) and stores them in an Azure Blob storage container.

## Project Structure

Inputs are hosted on an Azure Container Blob, in the `inputs` folder of the `sids` container. Rasters and vectors are stored in the respective subfolders, as GeoPackages and GeoTiffs. The `batch.csv` file provides metadata about rasters.

```shell
inputs
├── batch.csv
├── rasters
│ ├── data1.tif
│ ├── data2.tif
│ └── data3.tif
└── vectors
├── zone1.gpkg
├── zone2.gpkg
└── zone3.gpkg
```

## Batch

Batch is the first sub-module, helping to import rasters from all throughout Azure blob storage into a single folder. This module takes a few hours to runn for . Reading the `batch.csv`, the following data standardizations take place:

- ZSTD compression
- ESPG:4326 projection
- clipped to lonmin=-180, lonmax=180, latmin=-35, latmax=35

## Pipeline

Pipeline is the second sub-module, taking the majority of time to run to generate zonal statistics and vector tiles. The pipeline is optimized to check if a vector/raster combination already exists at the destination, in which case it will be skipped.

## Setup

To get started, populate the .env file with values using the template, and log into Azure and Docker.

```shell
az login
docker login undpgeohub.azurecr.io
```

To run either the batch or pipeline, change directory into one of the following and run `./deploy.sh` from that subfolder.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/undp-data/sids-data-pipeline

Awesome Lists containing this project

README