https://github.com/cedadev/kerchunk-tools
Tools to work with kerchunk
https://github.com/cedadev/kerchunk-tools
Last synced: 9 months ago
JSON representation
Tools to work with kerchunk
- Host: GitHub
- URL: https://github.com/cedadev/kerchunk-tools
- Owner: cedadev
- License: bsd-3-clause
- Created: 2022-11-09T13:31:50.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-08-30T14:58:01.000Z (almost 3 years ago)
- Last Synced: 2025-02-24T03:27:56.506Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 705 KB
- Stars: 3
- Watchers: 5
- Forks: 0
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- Changelog: HISTORY.rst
- Contributing: CONTRIBUTING.rst
- License: LICENSE
- Authors: AUTHORS.rst
Awesome Lists containing this project
README
# kerchunk-tools
## Overview
This is a set of tools for working with the "kerchunk" library:
https://fsspec.github.io/kerchunk/
Kerchunk provides cloud-friendly indexing of data files without needing to move
the data itself.
The tools included here allow:
- indexing of existing NetCDF files to kerchunk files
- aggregation of existing NetCDF files to a single kerchunk file
- tools to write to either POSIX file systems or S3-compatible object-store
- a wrapper around `xarray` to ensure that the data can be read by Python
- integration with access control to limit read/write operations as desired
An example notebook can be run using binder:
https://mybinder.org/v2/gh/cedadev/kerchunk-tools.git/main?filepath=notebooks
## Installation
### Method 1: Install with miniconda
From scratch, you can conda install with:
```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -p ~/miniconda -b
source ~/miniconda/bin/activate
conda create --name kerchunk-tools --file spec-file.txt
conda activate kerchunk-tools
pip install -e . --no-deps
```
### Method 2: Install with Pip
Assuming you have Python 3 installed, you can also install with Pip:
```bash
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -e . --no-deps
```
NOTE: this installation method generated a lot of HDF5 library warnings
when reading data, which were not seen with the Conda install.
## Basic usage
Here is an example of using `kerchunk_tools` with authentication to the
S3 service:
```python
import kerchunk_tools.xarray_wrapper as wrap_xr
s3_config = {
"token": "TOKEN",
"secret": "SECRET",
"endpoint_url": "ENDPOINT_URL"
}
# Load a Kerchunk file
# Load a Kerchunk file
index_uri = "s3://kc-indexes-cci-cloud-v2/BICEP-OC-L3S-PP-MERGED-1M_MONTHLY_9km_mapped-1998-2020-fv4.2.zstd"
ds = wrap_xr.wrap_xr_open(index_uri, s3_config=s3_config)
# Look at the metadata
print(ds)
pp = ds.pp
print(pp.shape, pp.dims)
# Look at the data
mx = ds.pp.sel(time=slice("1998-03-01", "2000-02-01"), lat=slice(34, 40), lon=slice(20, 23)).max()
mx = float(mx)
print(mx)
assert 2137 < mx < 2139
```
## Testing
If you are connecting to a secured endpoint, then you will need three items for your S3 configuration:
- `S3_TOKEN`
- `S3_SECRET`
- `S3_ENDPOINT_URL`
Then you can run a full workflow that:
- creates a bucket in S3
- uploads some NetCDF files to S3
- creates a kerchunk file in S3 (for a single NetCDF file)
- creates a kerchunk file in S3 (for an aggregation of multiple NetCDF files)
- read from the kerchunk files and extract/process a subset of data
```
S3_TOKEN=s3_token S3_SECRET=s3_secret S3_ENDPOINT_URL=s3_endpoint pytest tests/test_workflows/test_workflow_s3_quobyte_single.py -v
```
## Performance testing
Our initial tests, having only run once, came out as follows:
Table of test timings (in seconds). Where multiple values appear, the test was run multiple times.
| Test type | Read/process small subset | Read/process larger subset |
|---------------------|---------------------------|----------------------------|
| POSIX Kerchunk | 1.0, 0.7 | 15.2, 37.9 |
| S3-Quobyte Kerchunk | 1.1, 4.7, 1.3 | 8.5, 9.1, 5.7 |
| S3-DataCore Zarr | 3.9, 3.8 | 99.8, 99.2 |
| POSIX Xarray | 0.6, 0.9 | 86.0, 91.4 |
We need to run these repeatedly to validate them.
### Test types
The test types are:
1. POSIX Kerchunk:
- This uses a Kerchunk index file on the POSIX file system
- It references NetCDF files on the POSIX file system
- There is no use of object-store
- This test depends on having pre-generated the Kerchunk index file
2. S3-Quobyte Kerchunk:
- This uses a Kerchunk index file in the JASMIN S3-Quobyte object-store
- It references NetCDF files in the S3-Quobyte object-store
- The files are actually part of the CEDA Archive and are exposed via an S3 interface
- There is no use of the POSIX file systems
- This test depends on having pre-generated the Kerchunk index file
3. S3-DataCore Zarr:
- This reads a Zarr file that we have copied into the JASMIN DataCore (formerly Caringo) object-store
- The data is the same content as used for the other tests, converted from NetCDF to Zarr
- There is no use of Kerchunk
- This test depends on having pre-generated the Zarr file from NetCDF
4. POSIX Xarray:
- This reads all the NetCDF files directly into Xarray (as a list of files)
- The files are read directly from the POSIX file system
- There is no pre-generation step for this test
- This is slower because the aggregation of the NetCDF content is done on-the-fly
### Test data
The test data, being used is a list of 279 data files from the CCI archive, under the directory:
```
/neodc/esacci/ocean_colour/data/v5.0-release/geographic/netcdf/chlor_a/monthly/v5.0/
```
The first and last files are:
```
First: .../1997/ESACCI-OC-L3S-CHLOR_A-MERGED-1M_MONTHLY_4km_GEO_PML_OCx-199709-fv5.0.nc
Last: .../2020/ESACCI-OC-L3S-CHLOR_A-MERGED-1M_MONTHLY_4km_GEO_PML_OCx-202011-fv5.0.nc
```
### Test details
In all cases the test is run as follows.
Test 1 - Read/process small subset:
1. Load the data as an `xarray.Dataset` object.
2. Create a small time/lat/lon slice of shape: `(2, 144, 72)` (only 2 time steps == 2 files)
3. Calculate the maximum value and assert it equals the expected value. s = time.time()
Test 2 - Read/process larger subset:
1. Load the data as an `xarray.Dataset` object.
2. Create a larger time/lat/lon slice of shape: `(279, 12, 24)` (279 time steps == 279 files)
3. Calculate the maximum value and assert it equals the expected value.
## Background reading and resources
These resources may be useful for understanding why we wanted to look at Kerchunk and how it fits into our bigger picture plans at CEDA:
JASMIN Notebook service intro: https://www.youtube.com/watch?v=nle9teGLAb0&list=PLhF74YhqhjqmZgbQLu_PXZmA27q7vHygg
JASMIN Notebooks workshop tutorial: https://www.youtube.com/watch?v=7UWjhIKq2x0&list=PLhF74Yhqhjqn8NDgU7xfKGLGP8h-FQ1lt&index=16
Notebook that I showed (demonstrating intake access to CMIP6): https://github.com/cedadev/cmip6-object-store/blob/master/notebooks/cmip6-zarr-jasmin.ipynb
Intake library documentation: https://intake.readthedocs.io/en/latest/?badge=latest
Intake ESM (for Earth System Model data) docs: https://intake-esm.readthedocs.io/en/stable/
Kerchunk docs: https://fsspec.github.io/kerchunk/
Useful intro talk on Kerchunk (when it was called ReferenceFileSystem - I think): https://www.youtube.com/watch?v=AWJzDk6M6NM&t=628s