Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jwodder/zarr-checksum-gallery
Various implementations of Dandi Zarr checksumming
https://github.com/jwodder/zarr-checksum-gallery
benchmarking dandi-zarr-checksum implementation-comparison rust zarr
Last synced: 20 days ago
JSON representation
Various implementations of Dandi Zarr checksumming
- Host: GitHub
- URL: https://github.com/jwodder/zarr-checksum-gallery
- Owner: jwodder
- License: mit
- Created: 2022-08-08T22:23:08.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2024-08-05T04:31:06.000Z (4 months ago)
- Last Synced: 2024-08-05T05:39:30.572Z (4 months ago)
- Topics: benchmarking, dandi-zarr-checksum, implementation-comparison, rust, zarr
- Language: Rust
- Homepage:
- Size: 426 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[![Project Status: Concept – Minimal or no implementation has been done yet, or the repository is only intended to be a limited example, demo, or proof-of-concept.](https://www.repostatus.org/badges/latest/concept.svg)](https://www.repostatus.org/#concept)
[![CI Status](https://github.com/jwodder/zarr-checksum-gallery/actions/workflows/test.yml/badge.svg)](https://github.com/jwodder/zarr-checksum-gallery/actions/workflows/test.yml)
[![codecov.io](https://codecov.io/gh/jwodder/zarr-checksum-gallery/branch/master/graph/badge.svg)](https://codecov.io/gh/jwodder/zarr-checksum-gallery)
[![MIT License](https://img.shields.io/github/license/jwodder/zarr-checksum-gallery.svg)](https://opensource.org/licenses/MIT)This is a Rust library & binary featuring a collection of various different
ways to implement a Merkle tree hash for a directory tree in the [format][1]
used by [the DANDI project](https://github.com/dandi) for Zarr assets. It was
written partly in search of the most efficient implementation but mostly as
just an exercise in Rust.[1]: https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md#zarr-entry-checksum-format
Installation
============Regardless of which installation method you choose, you need to first [install
Rust and Cargo](https://www.rust-lang.org/tools/install).To install the `zarr-checksum-gallery` binary in `~/.cargo/bin`, run:
cargo install --git https://github.com/jwodder/zarr-checksum-gallery
Alternatively, a binary localized to a clone of this repository can be built
with:git clone https://github.com/jwodder/zarr-checksum-gallery
cd zarr-checksum-gallery
cargo build # or `cargo build --release` to enable optimizations
# You can now run the binary with `cargo run -- ` while in this
# repository.Usage
=====zarr-checksum-gallery [] []
or, if running a localized binary:
cargo run [--release] -- [] []
`zarr-checksum-gallery` computes the Zarr checksum for the directory at
`` using the given `` (See list below). Regardless of
the implementation chosen, the checksum should always be the same for the same
directory contents & layout; if it is not, it is a bug.Global Options
--------------- `--debug` — Show DEBUG log messages listing the checksum for each file &
directory as it's computed.- `-E`/`--exclude-dotfiles` — Exclude the dotfiles & dot-directories `.dandi`,
`.datalad`, `.git`, `.gitattributes`, and `.gitmodules` from checksumming- `--trace` — Show TRACE log messages in addition to DEBUG messages. Not all
implementations emit TRACE logs.Implementations
---------------- `breadth-first` — Walk the directory tree iteratively & breadth-first,
building a tree of file checksums in memory- `collapsio-arc` — Walk the directory tree using multiple threads, computing
the checksum for each directory as soon as possible, with intermediate
results reported using shared memory**Options:**
- `-t `/`--threads ` — Set the number of threads to use. The
default value is the number of logical CPU cores on the machine.- `collapsio-mpsc` — Walk the directory tree using multiple threads, computing
the checksum for each directory as soon as possible, with intermediate
results reported over synchronized channels**Options:**
- `-t `/`--threads ` — Set the number of threads to use. The
default value is the number of logical CPU cores on the machine.- `depth-first` — Walk the directory tree iteratively & depth-first, computing
the checksum for each directory as soon as possible- `fastasync` — Walk the directory tree using multiple asynchronous worker
tasks, building a tree of file checksums in memory**Options:**
- `-t `/`--threads ` — Set the number of threads for the async
runtime to use. A value of 1 means to run all tasks in the main thread.
The default value is the number of logical CPU cores on the machine.- `-w `/`--workers ` — Set the number of worker tasks to use.
The default value is the number of logical CPU cores on the machine.- `fastio` — Walk the directory tree using multiple threads, building a tree of
file checksums in memory**Options:**
- `-t `/`--threads ` — Set the number of threads to use. The
default value is the number of logical CPU cores on the machine.- `recursive` — Walk the directory tree recursively and depth-first, computing
the checksum for each directory as soon as possible- `tree` — Like `fastio`, but instead of displaying only the final checksum,
shows a textual tree of the files & directories within the directory tree and
their corresponding checksums**Options:**
- `-t `/`--threads ` — Set the number of threads to use. The
default value is the number of logical CPU cores on the machine.Comparative Performance
=======================Typical final output from a run of `time-all.sh` on a 1.59 GiB directory of
7084 files:collapsio-arc ran
1.06 ± 0.06 times faster than fastio
1.31 ± 0.14 times faster than collapsio-mpsc
6.18 ± 0.07 times faster than depth-first
6.28 ± 0.10 times faster than breadth-first
6.35 ± 0.22 times faster than recursive
6.41 ± 0.24 times faster than fastasyncNote that the collapsio implementations should have some of the smallest memory
footprints, but this has not yet been tested.