https://github.com/klieret/wandb-offline-sync-hook

A convenient way to trigger synchronizations to wandb / Weights & Biases if your compute nodes don't have internet!
https://github.com/klieret/wandb-offline-sync-hook

hyperparameter-optimization hyperparameter-tuning machine-learning ray ray-tune wandb weights-and-biases

Last synced: 5 months ago
JSON representation

A convenient way to trigger synchronizations to wandb / Weights & Biases if your compute nodes don't have internet!

Host: GitHub
URL: https://github.com/klieret/wandb-offline-sync-hook
Owner: klieret
License: mit
Created: 2022-11-03T20:32:28.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2025-05-05T21:09:49.000Z (5 months ago)
Last Synced: 2025-05-05T22:25:57.511Z (5 months ago)
Topics: hyperparameter-optimization, hyperparameter-tuning, machine-learning, ray, ray-tune, wandb, weights-and-biases
Language: Python
Homepage: https://wandb-offline-sync-hook.rtfd.io/
Size: 242 KB
Stars: 78
Watchers: 1
Forks: 7
Open Issues: 12
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          


Wandb Offline Sync Hook

A convenient way to trigger synchronizations to wandb if your compute nodes don't have internet!



[![Documentation Status](https://readthedocs.org/projects/wandb-offline-sync-hook/badge/?version=latest)](https://wandb-offline-sync-hook.readthedocs.io/en/latest/?badge=latest)

[![PyPI version](https://badge.fury.io/py/wandb-osh.svg)](https://pypi.org/project/wandb-osh)

[![Python 3.8‒3.11](https://img.shields.io/badge/python-3.8%E2%80%923.11-blue)](https://www.python.org)

[![PR welcome](https://img.shields.io/badge/PR-Welcome-%23FF8300.svg)](https://git-scm.com/book/en/v2/GitHub-Contributing-to-a-Project)

[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/klieret/wandb-offline-sync-hook/main.svg)](https://results.pre-commit.ci/latest/github/klieret/wandb-offline-sync-hook/main)

[![.github/workflows/test.yaml](https://github.com/klieret/wandb-offline-sync-hook/actions/workflows/test.yaml/badge.svg)](https://github.com/klieret/wandb-offline-sync-hook/actions/workflows/test.yaml)

[![link checker](https://github.com/klieret/wandb-offline-sync-hook/actions/workflows/check-links.yaml/badge.svg)](https://github.com/klieret/wandb-offline-sync-hook/actions)

[![codecov](https://codecov.io/github/klieret/wandb-offline-sync-hook/branch/main/graph/badge.svg?token=6MQZ4LODE5)](https://app.codecov.io/github/klieret/wandb-offline-sync-hook)

[![gitmoji](https://img.shields.io/badge/gitmoji-%20😜%20😍-FFDD67.svg)](https://gitmoji.dev)

[![Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)



## 🤔 What is this?

- ✅ You use [`wandb`/Weights & Biases](https://wandb.ai/) to record your machine learning trials?

- ✅ Your ML experiments run on compute nodes without internet access (for example, using a batch system)?

- ✅ Your compute nodes and your head/login node (with internet) have access to a shared file system?

Then this package might be useful. For alternatives, see [below](https://github.com/klieret/wandb-offline-sync-hook#what-alternatives-are-there).



![](https://user-images.githubusercontent.com/13602468/200086359-507b8653-e999-4cb3-ac93-ba1d175d2016.png)



### What you might have been doing so far

You probably have been using `export WANDB_MODE="offline"` on the compute nodes and then ran something like

```bash

cd /.../result_dir/

for d in $(ls -t -d */); do cd $d; wandb sync --sync-all; cd ..; done

```

from your head node (with internet access) every now and then.

However, obviously this is not very satisfying as it doesn't update live.

Sure, you could throw this in a `while True` loop, but if you have a lot of trials in your directory, this will take forever, [cause unnecessary network traffic](https://github.com/wandb/wandb/issues/2887) and it's just not very elegant.

### How does `wandb-osh` solve the problem?

1. You add a hook that is called every time an epoch concludes (that is, when we want to trigger a sync).

2. You start the `wandb-osh` script in your head node with internet access. This script will now trigger `wandb sync` upon request from one of the compute nodes.

### How is this implemented?

Very simple: Every time an epoch concludes, the hook gets called and creates a file in the _communication directory_ (`~/.wandb_osh_communication` by default).

The `wandb-osh` script that is running on the head node (with internet) reads these files and performs the synchronization.

### What alternatives are there?

With [ray tune][ray-tune], you can use your ray head node as the place to synchronize from (rather than deploying it via the batch system as well, as the [current docs][ray-tune-slurm-docs] suggest). See the note below or my [demo repository][ray-tune-slurm-test].

Similar strategies might be possible for `wandb` as well (let me know!).

## 📦 Installation

```bash

pip3 install wandb-osh

```

For completeness, the extra dependencies `lightning` and `ray` are given, but they only ensure that the corresponding package is installed.

For example

```bash

pip3 install 'wandb-osh[lightning]'

```

also installs pytorch lightning if it is not already present, but has no other effect.

For development, make sure also to include the `testing` extra requirement.

```bash

pip3 install --editable '.[testing]'

```

## 🔥 Running it!

Two steps: Set up the hook, then run the script from your head node.

### Step 1: Setting up the hook

With pure wandb

Let's adapt the [simple pytorch example](https://docs.wandb.ai/guides/integrations/pytorch) from the wandb docs (it only takes 3 lines!):

```python

import wandb

from wandb_osh.hooks import TriggerWandbSyncHook  # <-- New!

trigger_sync = TriggerWandbSyncHook()  # <--- New!

wandb.init(config=args, mode="offline")

model = ... # set up your model

# Magic

wandb.watch(model, log_freq=100)

model.train()

for batch_idx, (data, target) in enumerate(train_loader):

    output = model(data)

    loss = F.nll_loss(output, target)

    loss.backward()

    optimizer.step()

    if batch_idx % args.log_interval == 0:

        wandb.log({"loss": loss})

        trigger_sync()  # <-- New!

```

With pytorch lightning

Simply add the `TriggerWandbSyncLightningCallback` to your list of callbacks and you're good to go!

```python

from wandb_osh.lightning_hooks import TriggerWandbSyncLightningCallback  # <-- New!

from pytorch_lightning.loggers import WandbLogger

from pytorch_lightning import Trainer

logger = WandbLogger(

    project="project",

    group="group",

    offline=True,

)

model = MyLightningModule()

trainer = Trainer(

    logger=logger,

    callbacks=[TriggerWandbSyncLightningCallback()]  # <-- New!

)

trainer.fit(model, train_dataloader, val_dataloader)

```

With ray tune

> **Note**

> With ray tune, you might not need this package! While the approach suggested in the

> [ray tune SLURM docs][ray-tune-slurm-docs] deploys the ray head on a worker node as well (so it doesn't

> have internet), this actually isn't needed. Instead, you can run the ray head and the

> tuning script on the head node and only submit batch jobs for your workers.

> In this way, `wandb` will be called from the head node and internet access is no

> problem there.

> For more information on this approach, take a look at my [demo repository][ray-tune-slurm-test].

You probably already use the `WandbLoggerCallback` callback. We simply add a second callback for `wandb-osh` (it only takes two new lines!):

```python

import os

from wandb_osh.ray_hooks import TriggerWandbSyncRayHook  # <-- New!

os.environ["WANDB_MODE"] = "offline"

callbacks = [

    WandbLoggerCallback(...),  # <-- ray tune documentation tells you about this

    TriggerWandbSyncRayHook(),  # <-- New!

]

tuner = tune.Tuner(

    trainable,

    tune_config=...,

    run_config=RunConfig(

        ...,

        callbacks=callbacks,

    ),

)

```

With anything else

Simply take the `TriggerWandbSyncHook` class and use it as a callback in your training

loop (as in the `wandb` example above), passing the directory that `wandb` is syncing

to as an argument.

### Step 2: Running the script on the head node

After installation, you should have a `wandb-osh` script in your `$PATH`. Simply call it like this:

```

wandb-osh

```

The output will look something like this

```

INFO: Starting to watch /home/kl5675/.wandb_osh_command_dir

INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42

Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42/wandb/debug-cli.kl5675.log

Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/b1f60706 ... done.

INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42

INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49

Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49/wandb/debug-cli.kl5675.log

Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/92a3ef1b ... done.

INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49

INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17

Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17/wandb/debug-cli.kl5675.log

Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/a2caa9c0 ... done.

```

Take a look at `wandb-osh --help` or check [the documentation](https://wandb-offline-sync-hook.readthedocs.io/en/latest/cli.html) for all command line options.

You can add options to the `wandb sync` call by placing them after `--`. For example

```bash

wandb-osh -- --sync-all

```

## ❓ Q & A

> I get the warning "wandb: NOTE: use wandb sync --sync-all to sync 1 unsynced runs from local directory."

You can start `wandb-osh` with `wandb-osh -- --sync-all` to always synchronize

all available runs.

> How can I suppress logging messages (e.g., warnings about the syncing not being fast enough)

```python3

import wandb_osh

# for wandb_osh.__version__ >= 1.2.0

wandb_osh.set_log_level("ERROR")

```

## 🧰 Development setup

```bash

pip3 install pre-commit

pre-commit install

```

## 💖 Contributing

Your help is greatly appreciated! Suggestions, bug reports and feature requests are best opened as [github issues][github-issues]. You are also very welcome to submit a [pull request][pulls]!

Bug reports and pull requests are credited with the help of the [allcontributors bot](https://allcontributors.org/).

  

    

      
_{Barthelemy Meynard-Piganeau}
🐛

      
_MoH-assan
🐛

      
_{Cedric Leonard}
💻 🐛

    

  

[github-issues]: https://github.com/klieret/wandb-offline-sync-hook/issues

[pulls]: https://github.com/klieret/wandb-offline-sync-hook/pulls

[ray-tune-slurm-docs]: https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html

[ray-tune-slurm-test]: https://github.com/klieret/ray-tune-slurm-test/

[ray-tune]: https://docs.ray.io/en/latest/tune/index.html

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/klieret/wandb-offline-sync-hook

Awesome Lists containing this project

README

Wandb Offline Sync Hook