Projects in Awesome Lists tagged with checkpointing
A curated list of projects in awesome lists tagged with checkpointing .
https://github.com/kakaobrain/torchgpipe
A GPipe implementation in PyTorch
checkpointing deep-learning gpipe model-parallelism parallelism pipeline-parallelism pytorch
Last synced: 15 May 2025
https://github.com/argonne-lcf/dlio_benchmark
An I/O benchmark for deep Learning applications
artificial-intelligence checkpointing data-management deep-learning llm pytorch storage tensorflow
Last synced: 04 Apr 2025
https://github.com/cedana/cedana-cli
Cedana: Access and run on compute anywhere in the world, on any provider. Migrate seamlessly between providers, arbitraging price/performance in realtime to maximize pure runtime.
ai checkpointing cpu docker gpu linux
Last synced: 16 Jan 2026
https://github.com/jorgensd/adios4dolfinx
Extending DOLFINx with checkpointing functionality
adios2 checkpointing fenicsx mpi-applications
Last synced: 09 Oct 2025
https://github.com/dorukkarinca/keras-buoy
Keras wrapper that autosaves what ModelCheckpoint cannot.
autosave checkpointing colab colab-automation colab-notebook colaboratory data-science keras machine-learning
Last synced: 04 Oct 2025
https://github.com/sayiir/sayiir
Sayiir — Lightweight embeddable durable workflow engine in Rust with Python bindings. Checkpoint-based recovery, no deterministic replay. Alternative to Temporal, Restate, Airflow..
ai-agents async-rust checkpointing distributed distributed-systems durable-execution durable-workflows embeddable orchestration pyo3 python rust temporal-alternative workflow workflow-automation workflow-engine
Last synced: 05 Mar 2026
https://github.com/f-dangel/wandb_preempt
Code and tutorial on integrating wandb sweeps with Slurm pre-emption
checkpointing preemption pytorch slurm sweep wandb
Last synced: 13 Mar 2026
https://github.com/rubrikinc/sysfail
A shared library to help test your code with failure-injection
checkpointing failure-injection failure-injection-testing idempotency progress resilience-testing resumability stateful-app
Last synced: 26 Jun 2025
https://github.com/kadubon/bottleneck-audit-toolkit
Offline, fail-closed verifier for JSONL telemetry event logs. Emits deterministic audit certificates + human summaries with explicit claims/non-claims for bottleneck and integrity review.
ai audit bottleneck checkpointing distributed-training event-logs jsonl mlops offline-verification performance-monitoring silent-data-corruption tail-latency telemetry
Last synced: 13 Jan 2026
https://github.com/grebtsew/albumorganizer
A digital album face recognition manager, that isolates images of a specified person from a digital album.
checkpointing computer-vision docker docker-compose face-recognition multiprocess photoalbum photoalbummanager sentient-slideshow
Last synced: 06 Apr 2025
https://github.com/kamangir/blue-objects-2024-09-05-a
🌀 data objects for Bash (attempt one).
Last synced: 11 Jan 2026