{"id":18085074,"url":"https://github.com/jwodder/zarr-digest-timings","last_synced_at":"2025-03-29T00:33:22.434Z","repository":{"id":54844524,"uuid":"460489318","full_name":"jwodder/zarr-digest-timings","owner":"jwodder","description":"Timings for various Dandi Zarr checksum implementations","archived":true,"fork":false,"pushed_at":"2023-04-19T13:09:43.000Z","size":135,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-21T00:21:07.176Z","etag":null,"topics":["benchmarking","dandi-zarr-checksum","implementation-comparison","python","zarr"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jwodder.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-02-17T15:17:09.000Z","updated_at":"2024-05-12T21:49:50.000Z","dependencies_parsed_at":"2022-08-14T04:31:21.166Z","dependency_job_id":"8dfdcbc7-2e58-406f-9ae4-50dca7d88fbe","html_url":"https://github.com/jwodder/zarr-digest-timings","commit_stats":{"total_commits":69,"total_committers":1,"mean_commits":69.0,"dds":0.0,"last_synced_commit":"28f0610a92a318db706950487e7b92212ef8d5f5"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jwodder%2Fzarr-digest-timings","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jwodder%2Fzarr-digest-timings/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jwodder%2Fzarr-digest-timings/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jwodder%2Fzarr-digest-timings/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jwodder","download_url":"https://codeload.github.com/jwodder/zarr-digest-timings/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246122251,"owners_count":20726822,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmarking","dandi-zarr-checksum","implementation-comparison","python","zarr"],"created_at":"2024-10-31T15:09:17.655Z","updated_at":"2025-03-29T00:33:22.143Z","avatar_url":"https://github.com/jwodder.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"This repository contains a script, ``zarr-digest-timings.py``, for repeatedly\nrunning various implementations of a Zarr checksum calculation routine with\ndifferent types of caching and displaying the average runtime.  The script is\nrun via nox_, which manages installation of the proper varying dependencies.\n\n.. _nox: https://nox.thea.codes\n\nPython 3.7 or higher is required.\n\nUsage\n=====\n\n::\n\n    nox -e \u003cenv\u003e -- [\u003coptions\u003e] \u003cdirpath\u003e \u003cimplementation\u003e\n\nRun a given checksumming function on the given directory a number of times and\nprint out the average runtime.  If caching is in effect and\n``--no-clear-cache`` is not given, an initial function call (populating the\ncache) will be timed \u0026 reported separately.\n\nArguments\n---------\n\n``\u003cenv\u003e``\n    The nox environment in which to run the script; can be ``nothreads``, which\n    uses the non-threaded fscacher 0.1.6; ``threads``, which uses the threaded\n    implementation on the `gh-66 branch`_; or ``xor_bytes``, which uses the\n    more efficient directory fingerprinting introduced in v0.2.0.  (Note that\n    no version of fscacher will have any effect by default unless the ``-c`` or\n    ``-C`` option is passed to the script.)\n\n    .. _gh-66 branch: https://github.com/con/fscacher/pull/67\n\n``\u003cdirpath\u003e``\n    The path to a directory tree to calculate the Zarr checksum of\n\n``\u003cimplementation\u003e``\n    The checksumming function to use:\n\n    ``sync``\n        Walks the directory tree synchronously and breadth-first, digesting\n        files, and constructs an in-memory tree for calculating the Zarr digest\n\n    ``fastio``\n        Like ``sync``, but walks the directory tree using `a multithreaded\n        walk`__\n\n        __ https://gist.github.com/jart/0a71cde3ca7261f77080a3625a21672b\n\n    ``oothreads``\n        Like ``fastio``, but rewritten to be more object-oriented\n\n    ``trio``\n        Like ``sync``, but walks the directory asynchronously using trio_.  The\n        number of workers is controlled by the ``--threads`` option.  This\n        implementation is not affected by ``--cache-files``.\n\n        .. _trio: https://github.com/python-trio/trio\n\n    ``trio3``\n        A variant of ``trio`` that runs the MD5 digestion function for each\n        file in a thread.  This implementation *is* affected by\n        ``--cache-files``.\n\n    ``recursive``\n        Walks \u0026 digests the directory tree depth-first using recursion\n\nOptions\n-------\n\n-c, --cache                     Use fscacher to cache the Zarr directory\n                                checksumming routine\n\n-C, --cache-files               Use fscacher to cache digests for individual\n                                files\n\n--clear-cache, --no-clear-cache\n                                Whether to clear the cache on program startup\n                                [default: ``--clear-cache``]\n\n-n INT, --number INT            Set the number of times to call the function\n                                (not counting the initial cache-populating\n                                call, if any).  As a special case, passing 0\n                                will cause the script to simply call the\n                                function once and print out the checksum\n                                without any timing.  [default: 100]\n\n-R FILE, --report FILE          Append a report of the run, containing the\n                                average time and the various input parameters,\n                                as a line of JSON to the given file\n\n-T INT, --threads INT           Set the number of threads to use when walking a\n                                directory tree.  This affects both the\n                                ``fastio`` implementation and the threaded\n                                fscacher implementation.  The default value is\n                                the number of CPU cores plus 4, to a maximum of\n                                32.\n\n-v, --verbose                   Log the result of each function call with a\n                                timestamp as it finishes.  Specify this option\n                                up to two additional times for more debug\n                                logging.\n\n\n``mktree.py``\n=============\n\n::\n\n    python3 mktree.py \u003cdirpath\u003e \u003cspecfile\u003e\n\nThe ``mktree.py`` script can be used to generate a sample directory tree for\nrunning ``zarr-digest-timings.py`` on.  The directory is generated according to\na *layout specification*, which is a JSON file whose contents take one of the\nfollowing forms:\n\n- A list ``lst`` of ``n+1`` integers, possibly with a file object (see below)\n  appended — The tree will consist of ``lst[0]`` directories, each of which\n  contains ``lst[1]`` sub-directories, each of which contains ``lst[2]``\n  sub-subdirectories, and so on, with the directories at level ``n-1``\n  consisting of ``lst[n]`` files.  If a file object is supplied, the files will\n  be generated according to its specification; otherwise, they will be empty.\n\n- An object mapping path names to layout sub-specifications, file objects, or\n  ``null`` — For each key that maps to a layout sub-specification, a\n  subdirectory will be created in the directory with that name and layout.  For\n  each key that maps to a file object or ``null``, a file will be created in\n  the directory with that name and according to that specification (an empty\n  file for ``null``\\s).\n\nA *file object* is an object specifying the size of a file to create; it can\ntake the following forms:\n\n- If the object contains a ``\"size\": INT`` field, the file will be that size.\n\n- Otherwise, the object must contain a ``\"maxsize\": INT`` field and an optional\n  ``\"minsize\": INT`` field (default value: 0).  The file will be created with a\n  random size within the given range, inclusive.\n\nAll files are created with random bytes as data.\n\nSome sample layout specifications can be found in the ``layouts/`` directory.\n\n\n``time-all.sh``\n===============\n\n::\n\n    bash time-all.sh [\u003coptions\u003e] \u003cdirpath\u003e\n\nThe bash script ``time-all.sh`` runs ``zarr-digest-timings.py`` with all\nnon-redundant configurations against a given directory tree for a given number\nof threads, and it generates a JSON Lines report.\n\nOptions\n-------\n\n-n INT                      Set the number of times to run the checksumming\n                            function for each configuration [default: 100]\n\n-R FILE                     Save the report to the given file [default:\n                            ``time-all.json``]\n\n-T INT                      Set the number of threads to use when walking a\n                            directory tree.  See above for the default.\n\n-v                          Increase the verbosity of\n                            ``zarr-digest-timings.py``; can be specified\n                            multiple times\n\n\n``report2table``\n================\n\n::\n\n    nox -e report2table -- [\u003coptions\u003e] \u003creportfile\u003e\n\nThe ``report2table.py`` script takes a JSON Lines report generated via the\n``--report`` option of ``zarr-digest-timings.py`` and renders it as a\nreStructuredText or GitHub-Flavored Markdown document containing a series of\ntables.  It should be run via nox in order to manage its dependencies.\n\nAll of the entries in the report should have been generated on the same\nmachine.  Entries generated on different paths or using different\nimplementations will be grouped into distinct tables.  If two or more entries\nwere produced by the same configuration, their times will be combined.\n\nFor configurations that make use of caching, the corresponding cell in the\nresulting tables will consist of two times separated by a slash; the first time\nis the runtime of the initial cache-populating call, while the second time is\nthe average of the other calls.\n\nOptions\n-------\n\n-f \u003crst|md\u003e, --format \u003crst|md\u003e  Specify whether to produce a reStructuredText\n                                (``rst``) or Markdown (``md``) document\n                                [default: ``rst``]\n\n-o FILE, --outfile FILE         Output to the specified file\n\n-t TEXT, --title TEXT           Set a title for the document\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjwodder%2Fzarr-digest-timings","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjwodder%2Fzarr-digest-timings","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjwodder%2Fzarr-digest-timings/lists"}