{"id":21683763,"url":"https://github.com/open-eo/openeo-udf-python-to-r","last_synced_at":"2026-04-29T09:35:55.539Z","repository":{"id":38397685,"uuid":"438186356","full_name":"Open-EO/openeo-udf-python-to-r","owner":"Open-EO","description":"Run openEO R UDFs from Python","archived":false,"fork":false,"pushed_at":"2022-11-18T15:54:11.000Z","size":355,"stargazers_count":0,"open_issues_count":3,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-05-01T23:01:18.918Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Open-EO.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-12-14T09:09:20.000Z","updated_at":"2022-07-04T18:01:15.000Z","dependencies_parsed_at":"2022-07-12T17:28:35.072Z","dependency_job_id":null,"html_url":"https://github.com/Open-EO/openeo-udf-python-to-r","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-EO%2Fopeneo-udf-python-to-r","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-EO%2Fopeneo-udf-python-to-r/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-EO%2Fopeneo-udf-python-to-r/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-EO%2Fopeneo-udf-python-to-r/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Open-EO","download_url":"https://codeload.github.com/Open-EO/openeo-udf-python-to-r/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244601905,"owners_count":20479521,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-25T16:13:26.841Z","updated_at":"2026-04-29T09:35:55.473Z","avatar_url":"https://github.com/Open-EO.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# openeo-udf-python-to-r / openeo-r-udf\n\nThis is an experimental engine for openEO to run R UDFs from an R environment.\n\nThis implementation is currently limited to R UDFs that are running without any other processes in the following processes:\n- `apply`\n- `apply_dimension`\n- `reduce_dimension`\n\nThis repository contains the following content:\n- The scripts to run for testing: `tests/test.py` (single core) and `tests/test_parallel.py` (parallelized).\n- The folder `tests/udfs` contains UDF examples as users could provide them.\n- `udf_lib.py` is a Python library with the Python code required to run R UDFs from Python\n- `executor.R` is the R script that is run from R and executes the R UDF in the Python environment.\n\nThe following image shows how the implementation roughly works:\n![Workflow](docs/workflow.png)\n\n## UDF integration\n\nThis section is for back-end developers who want to add R UDF capabilities to their back-end \nor for end-users who want to test their UDFs locally.\n\n### Install from pypi\n\nYou may want to install all dependencies as a new conda environment first:\n\n`conda env create -f environment.yml`\n\nYou can install this library from pypi then:\n\n`pip install openeo-r-udf`\n\n### Run UDFs\n\nIn the following chapter we'll give examples on how to use the UDF library from a Python environment.\n\nThe following variables should be defined:\n- `udf` (string - The content of the parameter `udf` from `run_udf`, i.e. UDF code or a path/URL to a UDF)\n- `udf_folder` (string - The folder where the UDFs reside or should be written to)\n- `process` (string - The parent process, i.e. `apply`, `apply_dimension` or `reduce_dimension`)\n- `data` (xarray.DataArray - The data to process)\n- `dimension` (string, defaults to `None` - The dimension to work on if applicable, doesn't apply for `apply`)\n- `context` (Any, defaults to `None` - The data that has been passed in the `context` parameter)\n\nAdditionally, if your back-end keeps track of it, you can pass `spatial_dims` and `temporal_dims` to `execute_udf`\nwhere each is a list of dimension names (as strings) for the corresponding dimension types spatial (x,y,z) and temporal.\n\n### Without Parallelization *or* With Parallelization through Dask\n\nIf your back-end parallelizes already, you can directly run the following code:\n\n```python\n# import the UDF library\nfrom openeo_r_udf.udf_lib import prepare_udf, execute_udf\n\n# Define variables as documented above\n\n# Load UDF file (this should not be paralelized)\nudf_path = prepare_udf(udf, udf_folder)\n\n# Execute UDF file (this can be parallelized)\nresult = execute_udf(process, udf_path, data, dimension=dimension, context=context)\n```\n\nIf you parallelize with Dask, the xarray.DataArray must consist of Dask arrays, i.e. the `chunks` attribute of the DataArray MUST NOT be `None`.\n\n### With Parallelization through joblib\n\nIf your back-end is not parallelizing at all, you can run the following:\n\n```python\n# import the UDF library - make sure to install joblib before\nfrom openeo_r_udf.udf_lib import prepare_udf, execute_udf, chunk_cube, combine_cubes\nfrom joblib import Parallel, delayed as joblibDelayed\n\n# Parallelization config\nchunk_size = 1000\nnum_jobs = -1\n\n# Define variables as documented above\n\n# Load UDF file (this should not be paralelized)\nudf_path = prepare_udf(udf, udf_folder)\n\n# Define callback function\ndef compute_udf(data):\n    return execute_udf(process, udf_path, data.compute(), dimension=dimension, context=context)\n\n# Run UDF in parallel\ninput_data_chunked = chunk_cube(data, size=chunk_size)\nresults = Parallel(n_jobs=num_jobs, verbose=51)(joblibDelayed(compute_udf)(data) for data in input_data_chunked)\nresult = combine_cubes(results)\n```\n\nThe `result` variable holds the processed data as an `xarray.DataArray` again.\n\n## Writing a UDF\n\n*This is for end-users*\n\nA UDF must be written differently depending on where it is executed.\nThe underlying library used for data handling is always [`stars`](https://r-spatial.github.io/stars/).\n\n### apply\n\nA UDF that is executed inside the process `apply` manipulates the values on a per-pixel basis.\nYou **can't** add or remove labels or dimensions.\n\nThe UDF function must be named `udf` and receives two parameters:\n\n- `x` is a multi-dimensional stars object and you can run vectorized functions on a per pixel basis, e.g. `abs`.\n- `context` passes through the data that has been passed to the `context` parameter of the parent process (here: `apply`). If nothing has been provided explicitly, the parameter is set to `NULL`.\n  \n  This can be used to pass through configurable options, parameters or some additional data.\n  For example, if you would execute `apply(process = run_udf(...), context = list(min = -1, max = -100))` then you could access the corresponding values in the UDF below as `context$min` and `context$max` (see example below).\n\nThe UDF must return a stars object with exactly the same shape.\n\n**Example:**\n\n```r\nudf = function(x, context) {\n  min(max(x, context$min), context$max) \n}\n```\n\n### apply_dimension\n\nA UDF that is executed inside the process `apply_dimension` takes all values along a dimension and computes new values based on them.\nThis could for example compute a moving average over a timeseries.\n\nThere are two different variants of UDFs that can be used as processes for `apply_dimension`.\nA reducer can be executed either \"vectorized\" or \"per chunk\".\nThis is the same behavior as defined for `reduce_dimension`. \nPlease see below for more details.\n\n### reduce_dimension\n\nA UDF that is executed inside the process `reduce_dimension` takes all values along a dimension and computes a single value for it.\nThis could for example compute an average for a timeseries.\n\nThere are two different forms of UDFs that can be used as reducers\nfor `reduce_dimension`: a reducer can be executed either \"vectorized\"\nor \"per chunk\".\n\n#### vectorized\n\nThe vectorized variant is usually the more efficient variant as it's executed once on a larger chunk of the data cube.\n\nThe UDF function must be named `udf` and receives two parameters:\n\n- `data` is a matrix. Each row contains the values for a \"pixel\" and the columns are the values along the given dimension.\n  So, if you reduce along the temporal dimension, the columns are the individual timestamps.\n- `context` -\u003e see the description of `context` for `apply`.\n\nThe UDF must return a list of values.\n\n**Example:**\n\n```r\nudf = function(data, context) {\n  # To get the labels for the values once:\n  # labels = colnames(data)\n  do.call(pmax, as.data.frame(data))\n  # You could also use apply, but this is much slower as it is not vectorized:\n  # apply(data, 1, max) * context\n}\n```\n\nThe input data may look like this if you reduce along a band dimension with three bands `r`, `g` and `b`:\n\n- `data` could be `matrix(c(1,2,6,3,4,5,7,1,0), nrow = 3, dimnames = list(NULL, c(\"r\",\"g\",\"b\")))`\n- `colnames(data)` would be `c(\"r\", \"g\", \"b\")`\n- Executing the example above would return `c(7, 4, 6)`\n\n#### per chunk\n\nThis variant is usually slower, but might be required for certain use cases. It is executed multiple times on the smallest chunk possible for the dimension given, e.g., a single time series.\n\nThe UDF function must be named `udf_chunked` and receives two parameters:\n\n- `data` is a list of values, e.g. a single time series which you could pass to `max` or `mean`.\n- `context` -\u003e see the description of `context` for `apply`.\n\nThe UDF must return a single value.\n\n**Example:**\n\n```r\nudf_chunked = function(data, context) {\n  # To get the labels for the values:\n  # labels = names(data)\n  max(data)\n}\n```\n\nThe input data may look like this if you reduce along a band dimension with three bands `r`, `g` and `b`:\n\n- `data` could be `c(1, 2, 3)`\n- `names(data)` would be `c(\"r\", \"g\", \"b\")`\n- Executing the example above would return `3`\n\n##### Setup and Teardown\n\nAs `udf_chunked` is usually executed many times in a row, you can optionally define two additional functions that are executed once before and once after the execution.\nThese functions must be named `udf_setup` and `udf_teardown` and be placed in the same file as `udf_chunked`.\n`udf_setup` could be useful to initially load some data, e.g. a machine learning (ML) model.\n`udf_teardown` could be used to clean-up stuff that has been opened in `udf_setup`.\n\n**Note:** `udf_setup` and `udf_teardown` are only available if you implement `udf_chunked`.\nIf you implement `udf`, the two additional functions are not available as you can execute them directly in the `udf` function, which is only executed once (for each worker).\n\nBoth functions receive a single parameter, which is the `context` parameter explained above.\nHere the context parameter could contain the path to a ML model file, for example.\nBy using the context parameter, you can avoid hard-coding information, which helps to make UDFs more reusable.\n\n**Example:**\n\n```r\nudf_setup = function(context) {\n  # e.g. load a ML model from a file\n}\n\nudf_chunked = function(data, context) {\n  max(data)\n}\n\nudf_teardown = function(context) {\n  # e.g. clean-up tasks\n}\n```\n\n**Note:** `udf_teardown` is only executed if none of the `udf_chunked` calls have resulted in an error.\n\nIf you'd like to make some data available in `udf_chunked` and/or `udf_teardown` that you have prepared in `udf_setup` (or `udf_chunked`), you can use a global variable\nand the [special assignment operator](https://cran.r-project.org/doc/manuals/R-intro.html#Scope) `\u003c\u003c-` to assign to variables in the outer environments.\n\n**Example:**\n\nThis loads a trained ML model object from an URL in `udf_setup` and makes it available for prediction in `udf_chunked`.\nThis is important as loading the ML model in udf_chunked may download the model very often, usually thousands of times and as such the computation gets very slow.\n\n```r\nmodel \u003c- NULL\n\nudf_setup = function(context) {\n  model \u003c\u003c- load_model(\"https://example.com/model\")\n}\n\nudf_chunked = function(data, context) {\n  return(predict(data, model))\n}\n```\n\n## Examples\n### Dockerimage for running on a backend\nHere's an example of an Dockerimage that is used to run the R-UDF service on an openEO platform backend:\n\u003chttps://github.com/Open-EO/r4openeo-usecases/tree/main/vito-docker\u003e\n\n### Implementation at Eurac\nHere is an example how the R-UDF service is integrated in the Eurac openEO backend based on Open Data Cube:\n\u003chttps://github.com/SARScripts/openeo_odc_driver/blob/f34cd35107e4fb137fc1d23cae246ed362517c75/openeo_odc_driver.py#L289\u003e\n\n### R4openEO use cases\nHere are use cases that use the R-UDF service:\n\u003chttps://github.com/Open-EO/r4openeo-usecases\u003e\n\n\n## Development\n\nClone this repository and switch into the corresponding folder.\n\n1. Install environment via conda: `conda env create -f environment.yml`\n2. Install package for development: `pip install -e .`\n3. Now you can run one of the tests for example: `python3 tests/test.py`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-eo%2Fopeneo-udf-python-to-r","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopen-eo%2Fopeneo-udf-python-to-r","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-eo%2Fopeneo-udf-python-to-r/lists"}