{"id":51222363,"url":"https://github.com/tempoxyz/schelk","last_synced_at":"2026-06-28T08:03:51.081Z","repository":{"id":355822908,"uuid":"1136952136","full_name":"tempoxyz/schelk","owner":"tempoxyz","description":"Fast filesystem snapshot and rollback tool for benchmarking","archived":false,"fork":false,"pushed_at":"2026-05-05T11:10:56.000Z","size":169,"stargazers_count":45,"open_issues_count":1,"forks_count":3,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-05T13:20:00.430Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tempoxyz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-01-18T16:50:08.000Z","updated_at":"2026-05-01T04:04:46.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/tempoxyz/schelk","commit_stats":null,"previous_names":["tempoxyz/schelk"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/tempoxyz/schelk","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tempoxyz%2Fschelk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tempoxyz%2Fschelk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tempoxyz%2Fschelk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tempoxyz%2Fschelk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tempoxyz","download_url":"https://codeload.github.com/tempoxyz/schelk/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tempoxyz%2Fschelk/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34881390,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-28T02:00:05.809Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-28T08:03:50.189Z","updated_at":"2026-06-28T08:03:51.075Z","avatar_url":"https://github.com/tempoxyz.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# schelk\n\nschelk restores a block device to a known baseline quickly. It is designed for benchmarking\nsystems with large on-disk state - databases, or blockchain execution clients like\n[reth](https://github.com/paradigmxyz/reth). For such systems, rebuilding the baseline between\nruns is slow, and snapshot layers distort the measurements we want to take.\n\n\u003e [!TIP]\n\u003e If you want Codex, Claude, or another coding agent to set up or operate schelk for you, start\n\u003e it with [`docs/SKILL.md`](docs/SKILL.md) or point it directly at\n\u003e \u003chttps://github.com/tempoxyz/schelk/blob/master/docs/SKILL.md\u003e. That file tells the agent how to\n\u003e install schelk, validate prerequisites, initialize volumes, and run the workflow safely.\n\n## Why\n\nA good benchmarking loop has two requirements:\n\n1. **Fast rollback.** Each run mutates the state on disk, so the baseline must be restored\n   before the next run. If the rollback takes hours, the iteration loop is dead.\n2. **Faithful measurement.** The numbers must reflect the workload, not the rollback\n   machinery. Overhead matters, but variance matters more. If a benchmark varies by 10%\n   between runs, an improvement of 5% is not visible.\n\nThe two requirements are in tension. Any mechanism that makes rollback fast has to remember\nsomething about the pre-run state, and every such mechanism leaves a trace in the read or\nwrite path. The common approaches trade one requirement against the other:\n\n- **Full copy of the volume.** The benchmark runs against a plain filesystem, with no\n  tracking in the hot path - faithful. But on a multi-TB dataset one copy takes hours, so the\n  iteration loop is impractical.\n- **Copy-on-write filesystems** (ZFS, btrfs). Rollback is fast. But every write passes\n  through the CoW layer, and successive runs fragment the dataset differently. Both the\n  overhead and the layout drift enter the numbers.\n- **LVM thin with the overlay on a separate disk.** Rollback is fast. But reads come from\n  one disk and writes go to another. This is not the IO topology of production, so we are\n  benchmarking a different system.\n\nschelk tries to satisfy both. The observation is simple: a typical benchmark writes only a\nsmall fraction of the volume. If we know *exactly which blocks* were written, we can restore\nthe scratch volume by copying only those blocks from a pristine **virgin** volume. Rollback\ntakes seconds in most cases, rather than hours. During the benchmark itself, the workload\nruns against a plain ext4 filesystem on a real NVMe device, with no overlay and no write\nredirection.\n\n## How it works\n\nschelk operates on two equal-size block devices: a **virgin** volume that holds the pristine\nbaseline, and a **scratch** volume that is mounted and used by the benchmark. At `init` time,\nschelk makes scratch byte-identical to virgin. This is done either by creating a fresh ext4 on\nboth volumes, or by copying an existing virgin over. After initialization, both volumes belong\nto schelk and should not be touched directly.\n\nWhen `mount` is run, schelk places a `dm-era` device-mapper target on top of scratch. dm-era\nrecords every written block into metadata that lives on the ramdisk. The benchmark runs\nagainst the mounted filesystem as normal. dm-era does not redirect reads or writes; it only\nrecords which blocks were written.\n\nWhen `recover` is run, schelk unmounts the filesystem, asks dm-era for the list of blocks\nthat were written since the last baseline, and copies exactly those blocks from virgin back to\nscratch. Recovery time is proportional to the number of written blocks, not to the size of the\nvolume, so it does not matter how long the benchmark ran.\n\nA separate `promote` operation does the reverse: it copies the written blocks from scratch\nonto virgin, so that the current state becomes the new baseline. This is useful after a schema\nmigration, or after a snapshot load that should persist across future runs.\n\n## Pre-requisites\n\n### Hardware\n\n- Two block devices of equal size, one for **virgin** and one for **scratch**. Each must be\n  large enough to hold the dataset.[^1]\n- A ramdisk for dm-era metadata. The exact size depends on the internals of dm-era rather\n  than on the workload, so a precise formula is hard to give. As a rule of thumb, 4 GiB is\n  sufficient for a 1.7 TiB drive at 4 KiB granularity.\n\n[^1]: 🦄 Future Feature is to lift the equal-size restriction.\n\n### Software\n\n- A reasonably modern Linux kernel with device-mapper and the `dm-era` target.\n- A reasonably recent Rust toolchain.\n- `mkfs.ext4` from e2fsprogs (required for `init-new`). Usually pre-installed; otherwise\n  `apt install e2fsprogs`.\n- `era_invalidate` from\n  [thin-provisioning-tools](https://github.com/device-mapper-utils/thin-provisioning-tools).\n  The distribution package works, but versions older than 1.0 are very slow. For serious use,\n  build from source.[^2]\n- `dmsetup` (shipped with most distributions).\n\n[^2]: The following command tends to work:\n  ```git clone https://github.com/jthornber/thin-provisioning-tools /tmp/tpt \u0026\u0026 cargo build --release --manifest-path /tmp/tpt/Cargo.toml \u0026\u0026 sudo cp /tmp/tpt/target/release/pdata_tools /usr/local/bin/ \u0026\u0026 sudo ln -sf /usr/local/bin/pdata_tools /usr/local/bin/era_invalidate```\n\n## Usage\n\n\u003e [!WARNING]\n\u003e schelk requires sudo and will overwrite the volumes given to it.\n\n### Install\n\nThere are no binary releases yet. Clone the repository and install from source:\n\n```\ncargo install --path .\n```\n\n### Set up a ramdisk\n\n```\n# 4 GiB ramdisk (rd_size is in KB, so 4 GiB = 4*1024*1024 = 4194304 KB)\nsudo modprobe brd rd_size=4194304\n```\n\n### Initialize\n\nThere are two initialization paths:\n\n**`init-new`** - create fresh ext4 filesystems on both volumes from scratch. All existing data\non both volumes is lost.\n\n```\nsudo schelk init-new \\\n    --virgin /dev/nvme1n1 \\\n    --scratch /dev/nvme2n1 \\\n    --ramdisk /dev/ram0 \\\n    --mount-point /schelk\n```\n\n**`init-from`** - adopt an existing, pre-populated virgin volume, for example one that already\nhas a database snapshot loaded. The scratch volume is overwritten with a copy of virgin.\n\n```\nsudo schelk init-from \\\n    --virgin /dev/nvme1n1 \\\n    --scratch /dev/nvme2n1 \\\n    --ramdisk /dev/ram0 \\\n    --mount-point /schelk \\\n    --fstype ext4\n```\n\nIf both volumes are already prepared identically, `--no-copy` skips the full copy:\n\n```\nsudo schelk init-from ... --no-copy\n```\n\n### Run a benchmark\n\n```\nsudo schelk mount       # mount scratch with dm-era tracking\n./bench.sh              # run the benchmark\nsudo schelk recover     # restore scratch to virgin\nsudo schelk restore     # restore scratch, then mount it for the next run\n```\n\n### Promote scratch to a new baseline\n\nUse this after a one-time state change that should be kept across future runs, such as a\nschema migration or a snapshot load:\n\n```\nsudo schelk promote\n```\n\n### Other commands\n\n- `schelk full-recover` - copy the entire virgin volume to scratch. Used when the incremental\n  recovery path is no longer valid, for example after a host reboot.\n- `schelk status` - report the current state (initialized, mounted, and so on).\n\nNote that both volumes must not be used outside of schelk. Mounting them directly will\ninvalidate the incremental recovery path and force a full copy.\n\n## When not to use schelk\n\nschelk is not a silver bullet. It is brittle and has rough edges, and its hardware cost is\nnot trivial: two block devices large enough to hold the dataset, plus enough DRAM to back a\nramdisk. For many workloads, a CoW filesystem like ZFS or btrfs is a better fit — the\noverhead is real, but easier to accept than the cost and operational effort of schelk.\n\nPrefer a different approach when:\n\n- Measurements can tolerate some overhead or distortion introduced by the rollback mechanism.\n- The workload writes most of the dataset, so incremental recovery is not faster than a full\n  copy.\n- The hardware budget does not cover two dedicated volumes and enough DRAM for the ramdisk.\n\n## Limitations\n\n- **NVMe internal state is not restored.** Overwriting logical blocks does not reset FTL\n  mappings, wear levelling, on-controller caches, or garbage collection state. Some\n  run-to-run variance will always remain. Standard mitigations - drive pre-conditioning,\n  long warmups, steady-state measurement windows - still apply; schelk does not replace them.\n- **Ramdisk metadata does not survive a reboot.** If the host reboots or loses power while a\n  dm-era device is active, incremental recovery is no longer possible. In that case, run\n  `full-recover`.\n- **Volumes are dedicated.** For the duration of a schelk session, both volumes must not be\n  used by anything else. Mounting them or writing to them outside schelk invalidates the\n  incremental recovery path.\n- **Volumes must be of equal size.** This restriction may be lifted in the future.\n\n## FAQ\n\n- **Why not LVM snapshots, ZFS, or btrfs?** They add variance in the hot path of the\n  benchmark. See [Why](#why).\n- **Why not LVM thin with a read-only base and a writable overlay?** The same reason, and\n  additionally the split read/write IO topology does not reflect production.\n- **Why not a userspace filesystem via libfuse?** libfuse is single-threaded, which is a\n  bottleneck for parallel benchmark workloads. The io_uring support in libfuse may eventually\n  lift this, but at the time of writing it was still immature. A libfuse-based solution would\n  also sit on top of a real filesystem, so restoring the baseline would mean writing back\n  through that filesystem. The state of the underlying filesystem would drift between runs -\n  the same problem as with CoW filesystems.\n- **Why `dm-era` specifically?** It is the lightest tracking layer in mainline Linux: it does\n  not move data, cache anything, or redirect IO. Its only job is to mark blocks with an\n  \"era\" number when they are written. A bitmap based system would be much more efficient.\n- **Why a ramdisk for metadata?** Two reasons. First, keeping metadata writes off the drive\n  under test avoids contention with the benchmark. Second, the metadata is cheap to recreate,\n  and a reboot invalidates the incremental recovery path regardless.\n- **What is a typical recovery time?** Recovery time is proportional to the number of bytes\n  written during the run, not to the volume size. A benchmark that writes a few GiB on a\n  multi-TB volume typically recovers in seconds.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftempoxyz%2Fschelk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftempoxyz%2Fschelk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftempoxyz%2Fschelk/lists"}