https://github.com/openclimatefix/forecast-data-prep

Handles the processing of Numerical Weather Prediction (NWP), satellite, and PV data into the correct format for machine learning tasks. It also handles efficient data transfer to Google Cloud Storage and disk preparation for use in cloud environments.
https://github.com/openclimatefix/forecast-data-prep

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/openclimatefix/forecast-data-prep
Owner: openclimatefix
Created: 2024-09-17T11:56:16.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-10-14T13:25:17.000Z (10 months ago)
Last Synced: 2026-01-14T12:26:21.987Z (6 months ago)
Language: Jupyter Notebook
Size: 60.5 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Updating NWP + Sat + PV Data on GCS

[![All Contributors](https://img.shields.io/badge/all_contributors-2-orange.svg?style=flat-square)](#contributors-)

![ease of contribution: medium](https://img.shields.io/badge/ease%20of%20contribution:%20medium-f4900c)
[![issues badge](https://img.shields.io/github/issues/openclimatefix/forecast-data-prep?color=FFAC5F)](https://github.com/openclimatefix/forecast-data-prep/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc)

Scripts to process and upload ML training data to Google Cloud Storage for use on VMs.

## Numerical Weather Prediction (NWP) Data

Multiple processing scripts exist for NWPs due to variations in variables, coverage, and forecast horizons. Each script specifies its primary use case at the top.

NWP Processing Steps:

1. Download individual forecast init time files
2. Convert to unzipped Zarr format with combined variables
3. Merge into yearly Zarrs with proper sorting, typing and chunking
4. Validate data through visualization and testing
5. Upload yearly Zarrs to Google Storage

#### Issues and Important Considerations for NWP Processing

- Issues can arise if you are still downloading data to the location where you are merging individual init times from. The solution is to manually remove files showing missing data.
- Some yearly NWP files can be very large (~1TB). Take careful consideration and conduct testing with threads, workers and memory limitations in the Dask client. Note that additional tasks running on the machine can impact performance, especially if they are also using lots of RAM.
- Another way to track process progress is to watch the zarr file size grow using `du -h` in the appropriate location.

## Satellite Data

Satellite data processing is handled by `sat_proc.py`, which downloads satellite imagery data from Google Public Storage and processes it for ML training.

## PV Data

The `gsp_pv_proc.py` script downloads the National PV generation data from Sheffield Solar PV Live, which is used as the target for OCF's national solar power forecast.

## Moving files to GCS and onto a disk

To upload files locally to Google Cloud Platform (GCP), you can use the `gsutil` library. The can be done via:

```bash
gsutil -m cp -r my/folder/path/ gs://your-bucket-name/
```

For potentially faster uploads, you can try the `upload_to_gcs.py` script which uses multiprocessing to speed things up. However sometimes the limitation is the internet or write speed of the disk so it may not be faster.

Once your data is in the GCS bucket, you can transfer it to a disk on your VM. First, SSH into your VM and make sure the target disk is attached with write permissions. Then mount the disk with read and write privileges using the following command:
`sudo mount -o discard,defaults,rw /dev/ABC /mnt/disks/DISK_NAME`

(You will need to know the disk name, which you can find with `lsblk`, and replace `ABC` with the actual disk name).

If updating an existing disk please note that anyone who has the disk mounted will be required to unmount it in order to change the disks read/write access. If cloning an exising disk and adding to it, please see the notes below at "Cloning disks on GCP".

To copy data from your GCS bucket to the mounted disk, use the following command:

```bash
gsutil -m cp -r gs://YOUR_BUCKET_NAME/YOUR_FILE_PATH* /mnt/disks/DISK_NAME/folder
```

The `*` is used to copy all files in that directory.

### Issues during upload

If issues arise when uploading, use `rsync` instead to copy the files across if some have already been downloaded. For example:

```bash
gsutil -m rsync -r gs://solar-pv-nowcasting-data/NWP/UK_Met_Office/UKV_extended/UKV_2023.zarr/ /mnt/disks/gcp_data/nwp/ukv/ukv_ext/UKV_2023.zarr/
```

`rsync` synchronizes files by copying only the differences between source and destination. It can be slow because it needs to scan and compare all files first, then transfer the data. For large datasets like NWP files (~1TB), both the scanning and transfer phases take considerable time due to the volume of data involved.

## Cloning disks on GCP

After cloning (for GCP) mount the disk via the GCP UI. Then check the disk is not corrupted and the transfer was successful via `sudo e2fsck -f /dev/DISK_NAME` you can find the disk name via the `lsblk` command. When clonning a disk the UUID of the disk is the same. This can create issues when auto mounting disks on machine reboots. You can check a disks UUID by running `sudo blkid` in the terminal and check the UUIDs.

To solve this, the UUID needs to be changed via `sudo tune2fs /dev/DISK_NAME -U random`. Once completed run another check on the disk `sudo e2fsck -fD /dev/sdc`. The `-f` option forces a check even if the filesystem seems clean. The `-D` option optimises directories in the filesystem.

You can now check that the UUID has been updated by running `sudo blkid`.

Now the `fstab` can be set to mount the cloned disk on reboot - follow the [Google Cloud instructions](https://cloud.google.com/compute/docs/disks/format-mount-disk-linux#configure_automatic_mounting_on_vm_restart) for help on this.

## Contributing and community

- PR's are welcome! See the [Organisation Profile](https://github.com/openclimatefix) for details on contributing
- Find out about our other projects in the [OCF Meta Repo](https://github.com/openclimatefix/ocf-meta-repo)
- Check out the [OCF blog](https://openclimatefix.org/blog) for updates
- Follow OCF on [LinkedIn](https://uk.linkedin.com/company/open-climate-fix)

---

*Part of the [Open Climate Fix](https://github.com/orgs/openclimatefix/people) community.*

[![OCF Logo](https://cdn.prod.website-files.com/62d92550f6774db58d441cca/6324a2038936ecda71599a8b_OCF_Logo_black_trans.png)](https://openclimatefix.org)

## Contributors ✨

Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):

_{THARAK HEGDE}
📖

_Megawattz
💻