https://github.com/mpadge/syte-mpadge

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/mpadge/syte-mpadge
Owner: mpadge
Created: 2025-01-23T16:43:22.000Z (4 months ago)
Default Branch: main
Last Pushed: 2025-01-26T11:42:34.000Z (4 months ago)
Last Synced: 2025-03-10T06:38:23.942Z (3 months ago)
Language: Python
Size: 14.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        ## Syte coding exercise

> The primary performance goal is to minimize the response time of the

> get_image(lat, long, radius) function. Given that reading a single .jp2 file

> can take several seconds or minutes, your objective is to maximize the

> throughput.

The approach taken here is to use GDAL for all of the heavy lifting, through

reading only the required portions of each file, and through applying

aggregation (or down-sampling) filters to data _on disk_. This entirely

eliminates any need for these files to be directly read into memory at their

original resolution. To enable useable benchmarking, the script is strictly

single-threaded, with potential effects of multi-threading described below.

The repository is `makefile` controlled, and includes an initial `help` option.

`make check` will check for all required python dependencies, and provide

information on any missing ones. `make run` will then run the script. The

resultant image may be viewed by uncommenting the file line, `img.show()`.

### Execution speed

The script serves the primary purpose of issuing the time needed to process one

image. Central coordinates for image are hand-coded within the script, and were

chosen to overlap four tiles at a radius of 100m. The timing message indicates

that processing these four tiles generally takes only a few seconds.

### Description of procedure

Input coordinates and a buffer distance are converted both to a bounding range

in EPSG:25832 coordinates, and a list of corresponding image files needed to

enclose these coordinates. The `aggregate_one_file` function then uses this

information to read only the required geographic portion of each file, and only

in aggregated form. The speed of the entire script is entirely due to the

routines applied in this function.

The `generate_merged_files` function is used to generate an initially merged

version of the input latitude, longitude, and buffer distance parameters. This

version will always have a somewhat higher resolution than the final required

output resolution, and so the function also applies a further aggregation or

downsampling routine to achieve the desired output resolution of 256x256.

### Error handling

Because of personal time restrictions, the current code contains no error

handling routines.

### Benchmarking

Benchmarking was performed, although is not included in current code. The

primary parameter influencing computation times is `buffer_distance`, which

must generally scale with `N^2`, while all other parameters must scale linearly

at most. Benchmarking demonstrating that responses to this parameter lie well

below quadratic, which is encouraging.

### Optimizations

**Downscale input images**

Potential optimizations that could be implemented depend very much on typical

envisioned application of the code, and in particular on how common

approximately repeated calls are likely to be. The entire performance

bottleneck is in the reading and aggregation of the image files, within

`aggregate_one_file`. Since the aim of this exercise was to generate 256x256

pixel output images from far higher resolution inputs, the single biggest

optimization step which could be implemented would be to initially

aggregate/downsample all input images to the smallest anticipated input buffer

size ("radius"). A buffer size of 100m translates to a reduction to under half

of the current height and width of these images. Such reductions would also

quadratically increase computational efficiency.

**Use local cache**

If further optimization were required, a second stage could involve a

coarsening of the initial `aggregate_one_file` function into discrete, coarser

chunks which could then be locally cached. An additional function could then be

written to align a precise set of `(lat, lon, buffer_size)` parameters to a

coarse file chunk, read that file from cache if it exists, and then trim to the

precisely required region. Details would again depend on anticipated usage

patterns, but chunking each image into, say, 10-by-10 smaller sections would

generate 100 possible sub-images from each input image, resulting in a maximum

of 6,400 locally-cached files. This cache would occupy a tiny fraction of the

size of the original data, and would likely enormously speed up the whole

routine.

**Parallelise**

The main `generate_merged_files` contains an initial `for` loop which occupies

most of the processing time. This ultimately calls `rasterio.read`, with

parameters to trim and aggregate the results. This function can also be called

in parallel mode, which would obviously also reduce computation times.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mpadge/syte-mpadge

Awesome Lists containing this project

README