https://github.com/klieret/wandb-offline-sync-hook
A convenient way to trigger synchronizations to wandb / Weights & Biases if your compute nodes don't have internet!
https://github.com/klieret/wandb-offline-sync-hook
hyperparameter-optimization hyperparameter-tuning machine-learning ray ray-tune wandb weights-and-biases
Last synced: about 1 month ago
JSON representation
A convenient way to trigger synchronizations to wandb / Weights & Biases if your compute nodes don't have internet!
- Host: GitHub
- URL: https://github.com/klieret/wandb-offline-sync-hook
- Owner: klieret
- License: mit
- Created: 2022-11-03T20:32:28.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-05-05T21:09:49.000Z (about 1 month ago)
- Last Synced: 2025-05-05T22:25:57.511Z (about 1 month ago)
- Topics: hyperparameter-optimization, hyperparameter-tuning, machine-learning, ray, ray-tune, wandb, weights-and-biases
- Language: Python
- Homepage: https://wandb-offline-sync-hook.rtfd.io/
- Size: 242 KB
- Stars: 78
- Watchers: 1
- Forks: 7
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
Wandb Offline Sync Hook
A convenient way to trigger synchronizations to wandb if your compute nodes don't have internet!
[](https://wandb-offline-sync-hook.readthedocs.io/en/latest/?badge=latest)
[](https://pypi.org/project/wandb-osh)
[](https://www.python.org)
[](https://git-scm.com/book/en/v2/GitHub-Contributing-to-a-Project)
[](https://results.pre-commit.ci/latest/github/klieret/wandb-offline-sync-hook/main)
[](https://github.com/klieret/wandb-offline-sync-hook/actions/workflows/test.yaml)
[](https://github.com/klieret/wandb-offline-sync-hook/actions)
[](https://app.codecov.io/github/klieret/wandb-offline-sync-hook)
[](https://gitmoji.dev)
[](https://github.com/python/black)## 🤔 What is this?
- ✅ You use [`wandb`/Weights & Biases](https://wandb.ai/) to record your machine learning trials?
- ✅ Your ML experiments run on compute nodes without internet access (for example, using a batch system)?
- ✅ Your compute nodes and your head/login node (with internet) have access to a shared file system?Then this package might be useful. For alternatives, see [below](https://github.com/klieret/wandb-offline-sync-hook#what-alternatives-are-there).

### What you might have been doing so far
You probably have been using `export WANDB_MODE="offline"` on the compute nodes and then ran something like
```bash
cd /.../result_dir/
for d in $(ls -t -d */); do cd $d; wandb sync --sync-all; cd ..; done
```from your head node (with internet access) every now and then.
However, obviously this is not very satisfying as it doesn't update live.
Sure, you could throw this in a `while True` loop, but if you have a lot of trials in your directory, this will take forever, [cause unnecessary network traffic](https://github.com/wandb/wandb/issues/2887) and it's just not very elegant.### How does `wandb-osh` solve the problem?
1. You add a hook that is called every time an epoch concludes (that is, when we want to trigger a sync).
2. You start the `wandb-osh` script in your head node with internet access. This script will now trigger `wandb sync` upon request from one of the compute nodes.### How is this implemented?
Very simple: Every time an epoch concludes, the hook gets called and creates a file in the _communication directory_ (`~/.wandb_osh_communication` by default).
The `wandb-osh` script that is running on the head node (with internet) reads these files and performs the synchronization.### What alternatives are there?
With [ray tune][ray-tune], you can use your ray head node as the place to synchronize from (rather than deploying it via the batch system as well, as the [current docs][ray-tune-slurm-docs] suggest). See the note below or my [demo repository][ray-tune-slurm-test].
Similar strategies might be possible for `wandb` as well (let me know!).## 📦 Installation
```bash
pip3 install wandb-osh
```For completeness, the extra dependencies `lightning` and `ray` are given, but they only ensure that the corresponding package is installed.
For example```bash
pip3 install 'wandb-osh[lightning]'
```also installs pytorch lightning if it is not already present, but has no other effect.
For development, make sure also to include the `testing` extra requirement.
```bash
pip3 install --editable '.[testing]'
```## 🔥 Running it!
Two steps: Set up the hook, then run the script from your head node.
### Step 1: Setting up the hook
With pure wandb
Let's adapt the [simple pytorch example](https://docs.wandb.ai/guides/integrations/pytorch) from the wandb docs (it only takes 3 lines!):
```python
import wandb
from wandb_osh.hooks import TriggerWandbSyncHook # <-- New!trigger_sync = TriggerWandbSyncHook() # <--- New!
wandb.init(config=args, mode="offline")
model = ... # set up your model
# Magic
wandb.watch(model, log_freq=100)model.train()
for batch_idx, (data, target) in enumerate(train_loader):
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
wandb.log({"loss": loss})
trigger_sync() # <-- New!
```With pytorch lightning
Simply add the `TriggerWandbSyncLightningCallback` to your list of callbacks and you're good to go!
```python
from wandb_osh.lightning_hooks import TriggerWandbSyncLightningCallback # <-- New!
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning import Trainerlogger = WandbLogger(
project="project",
group="group",
offline=True,
)model = MyLightningModule()
trainer = Trainer(
logger=logger,
callbacks=[TriggerWandbSyncLightningCallback()] # <-- New!
)
trainer.fit(model, train_dataloader, val_dataloader)
```With ray tune
> **Note**
> With ray tune, you might not need this package! While the approach suggested in the
> [ray tune SLURM docs][ray-tune-slurm-docs] deploys the ray head on a worker node as well (so it doesn't
> have internet), this actually isn't needed. Instead, you can run the ray head and the
> tuning script on the head node and only submit batch jobs for your workers.
> In this way, `wandb` will be called from the head node and internet access is no
> problem there.
> For more information on this approach, take a look at my [demo repository][ray-tune-slurm-test].You probably already use the `WandbLoggerCallback` callback. We simply add a second callback for `wandb-osh` (it only takes two new lines!):
```python
import os
from wandb_osh.ray_hooks import TriggerWandbSyncRayHook # <-- New!os.environ["WANDB_MODE"] = "offline"
callbacks = [
WandbLoggerCallback(...), # <-- ray tune documentation tells you about this
TriggerWandbSyncRayHook(), # <-- New!
]tuner = tune.Tuner(
trainable,
tune_config=...,
run_config=RunConfig(
...,
callbacks=callbacks,
),
)
```With anything else
Simply take the `TriggerWandbSyncHook` class and use it as a callback in your training
loop (as in the `wandb` example above), passing the directory that `wandb` is syncing
to as an argument.### Step 2: Running the script on the head node
After installation, you should have a `wandb-osh` script in your `$PATH`. Simply call it like this:
```
wandb-osh
```The output will look something like this
```
INFO: Starting to watch /home/kl5675/.wandb_osh_command_dir
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/b1f60706 ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/92a3ef1b ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/a2caa9c0 ... done.
```Take a look at `wandb-osh --help` or check [the documentation](https://wandb-offline-sync-hook.readthedocs.io/en/latest/cli.html) for all command line options.
You can add options to the `wandb sync` call by placing them after `--`. For example```bash
wandb-osh -- --sync-all
```## ❓ Q & A
> I get the warning "wandb: NOTE: use wandb sync --sync-all to sync 1 unsynced runs from local directory."
You can start `wandb-osh` with `wandb-osh -- --sync-all` to always synchronize
all available runs.> How can I suppress logging messages (e.g., warnings about the syncing not being fast enough)
```python3
import wandb_osh# for wandb_osh.__version__ >= 1.2.0
wandb_osh.set_log_level("ERROR")
```## 🧰 Development setup
```bash
pip3 install pre-commit
pre-commit install
```## 💖 Contributing
Your help is greatly appreciated! Suggestions, bug reports and feature requests are best opened as [github issues][github-issues]. You are also very welcome to submit a [pull request][pulls]!
Bug reports and pull requests are credited with the help of the [allcontributors bot](https://allcontributors.org/).
Barthelemy Meynard-Piganeau
🐛
MoH-assan
🐛
Cedric Leonard
💻 🐛
[github-issues]: https://github.com/klieret/wandb-offline-sync-hook/issues
[pulls]: https://github.com/klieret/wandb-offline-sync-hook/pulls
[ray-tune-slurm-docs]: https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html
[ray-tune-slurm-test]: https://github.com/klieret/ray-tune-slurm-test/
[ray-tune]: https://docs.ray.io/en/latest/tune/index.html