{"id":20486250,"url":"https://github.com/undp-data/geohub-data-pipeline","last_synced_at":"2025-04-13T15:21:12.885Z","repository":{"id":169444129,"uuid":"601718720","full_name":"UNDP-Data/geohub-data-pipeline","owner":"UNDP-Data","description":"Python data pipeline for UNDP Geohub data uploads.","archived":false,"fork":false,"pushed_at":"2025-04-09T22:56:18.000Z","size":45065,"stargazers_count":4,"open_issues_count":15,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-09T23:35:29.827Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UNDP-Data.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-14T17:09:27.000Z","updated_at":"2025-02-28T00:29:12.000Z","dependencies_parsed_at":null,"dependency_job_id":"44c94097-d6b9-4c6e-8f61-bd98f6c0ccaf","html_url":"https://github.com/UNDP-Data/geohub-data-pipeline","commit_stats":null,"previous_names":["undp-data/geohub-data-pipeline"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UNDP-Data%2Fgeohub-data-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UNDP-Data%2Fgeohub-data-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UNDP-Data%2Fgeohub-data-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UNDP-Data%2Fgeohub-data-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UNDP-Data","download_url":"https://codeload.github.com/UNDP-Data/geohub-data-pipeline/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248733211,"owners_count":21152977,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T16:35:51.695Z","updated_at":"2025-04-13T15:21:12.856Z","avatar_url":"https://github.com/UNDP-Data.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# geohub-data-pipeline\nThe \"geohub-data-pipeline\" is a service that receives the filepath of an Azure blob from a service bus queue from uploads to the [Geohub](https://github.com/UNDP-Data/geohub) platform and converts them to Cloud-Optimized GeoTIFF (COG) or PMTiles format and stores them in an Azure Blob storage container. This project is designed to be run in a Docker container on AKS as a deployment.\n\n## Getting Started\n\n### Prerequisites\n\nTo use this API, you'll need to have:\n\n- An Azure account with access to an Azure Blob storage container for the UNDP Data Geohub\n- Docker installed on your local machine\n- An AKS cluster set up with kubectl configured to connect to it\n\n### Installation\n\n1. Clone this repository to your local machine.\n2. Navigate to the project directory in your terminal or command prompt.\n3. Any changes pushed to this repo will automatically trigger a build on the UNDP Data Azure Container Registry. \n\n## Configuration\n\nTo configure the API, you'll need to create an `.env` file in the `deployments/scripts` directory.\n\nHere are the configuration options:\n\n- `AZURE_STORAGE_CONNECTION_STRING`\n- `SERVICE_BUS_CONNECTION_STRING`\n- `SERVICE_BUS_QUEUE_NAME`\n\n### Usage\n\nTo use the API, follow these steps:\n\n1. Deploy the Docker container to your AKS cluster by running the command `deployments/scripts/install.sh`.\n2. Upload a file on the Geohub dev app.\n3. A message will be sent to the service bus queue with the filepath of the uploaded file, and the user token for authentication.\n4. The API will determine whether the file is a raster or vector file, then convert it to a COG or PMTiles format and store it in the user's \"datasets\" directory in Azure.\n\n\n\n## Testing the App Locally with Docker Compose\n\nTo test the app locally, you can use Docker Compose to build and run the app in a local development environment.\n\n1. Clone the repository to your local machine.\n\n        git clone https://github.com/UNDP-Data/geohub-data-pipeline.git\n\n\n2. Navigate to the cloned repository.\n\n        cd geohub-data-pipeline\n\n3. Copy the `.env` file in to the root directory of the repository. Copy `.env.example` to create your `.env`. This file will contain the environment variables that are used by the app. The following variables are required:\n\n- `AZURE_STORAGE_CONNECTION_STRING`\n- `SERVICE_BUS_CONNECTION_STRING`\n- `AZURE_WEBPUBSUB_CONNECTION_STRING`\n\n4. Build and run the app using Docker Compose.\n\n        docker-compose -f ./dockerfiles/docker-compose.yaml up --build\n\n\n    This command will build the app's Docker image and start the app in a Docker container. The `--build` flag ensures that the image is rebuilt whenever there are changes to the app's code.\n\n5. Once the container is running, it will automatically begin scanning for messages in the service bus queue. If a message is found, it will be processed and the file will be converted to a COG or PMTiles format and stored in the user's \"datasets\" directory in Azure.\n\n6. To stop the app, use the following command:\n\n        docker-compose down\n\n    This command will stop and remove the Docker container.\n\n**Note:** This is only a local development environment. For production deployment, you should use a production-grade Docker image and deploy it to a production environment.\n\n\n## Running the pipeline as a command line tool\nWhile the pipeline was designed to be executed within and leverage cloud environment, sometimes it may be desirable\nto run it as a command line tool. \nIn general the best option ot run GDAL dependent python software is through docker.\nThis is to avoid potential issues that could arise from GDAL dependencies and version mismatch.\n\nAn viable alternative would be to create a setup.py/pyproject.toml but that\nassumes the user has a  decent understanding of GDAL internals.\n\nFor the sake of simplicity we choose to rely on docker.\n\n1. Clone the repository to your local machine.\n\n        git clone https://github.com/UNDP-Data/geohub-data-pipeline.git\n\n\n2. Navigate to the cloned repository.\n\n        cd geohub-data-pipeline\n\n3. Build the docker image and assign it a meaningful tag\n\n   ```\n   docker build . -t geohub-data-pipeline\n   ```\n4. Run the command line version\n   ```commandline\n   docker run --rm -it  -v /home/janf/Downloads/data:/data geohub-data-pipeline python -m ingest.cli.main -src /data/Sample.gpkg -dst /data/out/ \n\n   ```\nThe above command will run the previously built image in a new container,\nwill remove it afterwards, maps the \"/home/janf/Downloads/data\" folder as **/data** folder inside the container and finally\ncall the command line version of the pipeline to convert all vector layers to PMTiles and raster bands to COG format\n\nif the command line script is called without any args the help is displayed\n```commandline\ndocker run --rm -it  -v /home/janf/Downloads/data:/data geohub-data-pipeline python -m ingest.cli.main\n/usr/local/lib/python3.10/dist-packages/rio_cogeo/profiles.py:182: UserWarning: Non-standard compression schema: zstd. The output COG might not be fully supported by software not build against latest libtiff.\n  warnings.warn(\nusage: main.py [-h] [-src SOURCE_FILE] [-dst DESTINATION_DIRECTORY] [-j] [-d]\n\nConvert layers/bands from GDAL supported geospatial data files to COGs/PMtiles.\n\noptions:\n  -h, --help            show this help message and exit\n  -src SOURCE_FILE, --source-file SOURCE_FILE\n                        A full absolute path to a geospatial data file. (default: None)\n  -dst DESTINATION_DIRECTORY, --destination-directory DESTINATION_DIRECTORY\n                        A full absolute path to a folder where the files will be written. (default: None)\n  -j, --join-vector-tiles\n                        Boolean flag to specify whether to create a multilayer PMtiles filein case the source file contains more then one vector layer (default: False)\n  -d, --debug           Set log level to debug (default: False)\n\n```\n\n### FAQs\n1. What geospatial formats are supported?\u003c/p\u003e\nANSWER: anything GDAL supports. Beware the image uses latest GDAL image (3.7)\n2. What does the tool do? \u003c/p\u003e\nANSWER: \n   - converts all vector layers to PMtiles format\n   - optionally create a multilayer PMtiles in case join argument is set to True\n   - convert all raster bands or RGB rasters into COG, featuring zstd compression and \n   Google Web mercator projection\n3. Are any environmental variables needed? \u003c/p\u003e\nANSWER: NO\n\n\n\n        \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fundp-data%2Fgeohub-data-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fundp-data%2Fgeohub-data-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fundp-data%2Fgeohub-data-pipeline/lists"}