{"id":13424917,"url":"https://github.com/kevin-hanselman/dud","last_synced_at":"2025-12-29T23:29:11.937Z","repository":{"id":37608662,"uuid":"243167144","full_name":"kevin-hanselman/dud","owner":"kevin-hanselman","description":"A lightweight CLI tool for versioning data alongside source code and building data pipelines.","archived":false,"fork":false,"pushed_at":"2025-01-15T04:12:22.000Z","size":3586,"stargazers_count":203,"open_issues_count":20,"forks_count":8,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-03-05T21:46:54.692Z","etag":null,"topics":["data-engineering","data-pipelines","data-science","dataset","dvcs","machine-learning","mlops"],"latest_commit_sha":null,"homepage":"https://kevin-hanselman.github.io/dud/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kevin-hanselman.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-26T04:20:52.000Z","updated_at":"2025-02-24T02:44:55.000Z","dependencies_parsed_at":"2024-01-06T17:45:28.674Z","dependency_job_id":"117ce90c-a6e7-4de3-a817-892c821a7c60","html_url":"https://github.com/kevin-hanselman/dud","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kevin-hanselman%2Fdud","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kevin-hanselman%2Fdud/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kevin-hanselman%2Fdud/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kevin-hanselman%2Fdud/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kevin-hanselman","download_url":"https://codeload.github.com/kevin-hanselman/dud/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243775950,"owners_count":20346296,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","data-pipelines","data-science","dataset","dvcs","machine-learning","mlops"],"created_at":"2024-07-31T00:01:00.816Z","updated_at":"2025-12-29T23:29:11.906Z","avatar_url":"https://github.com/kevin-hanselman.png","language":"Go","funding_links":[],"categories":["Go","Data Management"],"sub_categories":[],"readme":"# Dud\n\n[![Build status](https://github.com/kevin-hanselman/dud/workflows/build/badge.svg)](https://github.com/kevin-hanselman/dud/actions?query=workflow%3Abuild)\n[![Go report card](https://goreportcard.com/badge/github.com/kevin-hanselman/dud)](https://goreportcard.com/report/github.com/kevin-hanselman/dud)\n\n[Website](https://kevin-hanselman.github.io/dud/)\n| [Install](https://kevin-hanselman.github.io/dud/install)\n| [Getting Started](https://kevin-hanselman.github.io/dud/getting_started)\n| [Source Code](https://github.com/kevin-hanselman/dud)\n\nDud is a lightweight tool for versioning data alongside source code and building\ndata pipelines. In practice, Dud extends many of the benefits of source\ncontrol to large binary data.\n\nWith Dud, you can **commit, checkout, fetch, and push large files and\ndirectories** with a simple command line interface. Dud stores recipes (a.k.a.\nstages) for retrieving your data in small YAML files. These stages can be\nstored in source control to **link your data to your code**. On top of that,\nstages can **run the commands to generate the data**, sort of like\n[Make](https://www.gnu.org/software/make/). Stages can be chained together to\n**create data pipelines**. See the [Getting\nStarted](https://kevin-hanselman.github.io/dud/getting_started) guide for\na hands-on overview.\n\nDud is pronounced \"duhd\", not \"dood\". Dud is not an acronym.\n\n\n## Motivation\n\nDud is heavily inspired by [DVC](https://dvc.org/). DVC addresses the need for\ndata versioning and reproducibility, but its implementation is not without\nproblems. My criticisms of DVC boil down to two things: speed and simplicity. By\nspeed, I mean throughput and responsiveness. By simplicity, I mean doing\nless--both in project scope and amount of abstraction.\n\nIn terms of speed, Dud is [generally much\nfaster](https://kevin-hanselman.github.io/dud/benchmarks) than DVC. In terms of\nsimplicity, Dud has a [smaller, more focused\nscope](https://kevin-hanselman.github.io/dud/cli/dud), and it is [distributed as\na standalone executable](https://github.com/kevin-hanselman/dud/releases).\n\nTo summarize with an analogy: Dud is to DVC what [Flask][1] is to [Django][1].\nBoth Dud and DVC have their strengths. If you want a \"batteries included\" suite\nof tools for managing machine learning projects, DVC may be a good fit for you.\nIf data management is your main area of need and you want something lightweight\nand fast, Dud may be a better fit.\n\n[1]: https://hackr.io/blog/flask-vs-django\n\nTo get down to brass tacks, read on.\n\n### Concrete differences with DVC\n\n#### Dud does not manage experiments and/or metrics.\n\nDud is solely focused on versioning and reproducing data alongside source code.\nDVC's scope has grown to encompass a large portion of a traditional machine\nlearning workflow. While an integrated suite of tools has its benefits, if UNIX\nis any guide, the composition of smaller, more focused tools generally yield\nmore productivity than their monolithic counterparts. For example, there's no\nreason you couldn't use [MLflow](https://mlflow.org/) or\n[Aim](https://aimstack.io/) alongside Dud to track your experiments. Dud does\nnot prescribe any solution for experiment tracking, and it doesn't try to enter\nthe new, yet already crowded, marketplace for such tools.\n\nSecondly, versioning data alongside source code is an incredibly useful concept\nin its own right. Domains beyond machine learning and data science (e.g. game\ndevelopment and digital design) may greatly benefit from this approach to data\nmanagement without being burdened by extra baggage carried by a specific domain.\n\n\n#### Dud commits must always be explicitly invoked; they are never side effects.\n\nFor both Dud and DVC, committing data to the cache is one of the most expensive\noperations that each tool undertakes (in terms of both run-time and I/O).\nBecause of this, Dud puts the user in absolute control of when to commit data.\nIn Dud, commits only happen when you run `dud commit`.\n\nIn contrast, DVC often commits automatically on your behalf as a side effect of\nother commands (for example, during `dvc add` and `dvc repro`). While DVC is\ntrying to be helpful, these implicit commits are often accidental commits.\nFor example, if you're rapidly iterating on a pipeline, you're likely running\n`dvc repro` or `dvc run` repeatedly as you develop. However, DVC will\nautomatically commit the results each time you run `dvc repro` or `dvc\nrun`--even if you are just debugging something or tweaking your code. Such\naccidental commits have a high cost; they turn \"rapid development\" into\n\"development\", and they bloat your cache. (You can disable DVC's implicit\ncommits using the `--no-commit` flag, but you have to remember to type it each\ntime, and DVC does not support enabling this flag by default, e.g. via\nconfiguration file.)\n\n\n#### Dud checks out files as symbolic links by default.\n\nWhen Dud checks out cached files into the workspace, it uses symbolic links\n(a.k.a. symlinks) by default. Symlinks have a number of benefits that make them\nan excellent choice for checkouts. First, symlinks require very little I/O to\ncreate, so `dud checkout` usually completes almost instantaneously. Second,\nsymlinks transparently redirect to the cached files themselves, so data isn't\nduplicated between the workspace and the cache, and your storage space is used\nefficiently. Last but not least, symlinks make it trivial to check if a file is\nup-to-date (by checking the link target), so `dud status` can also be extremely\nfast.\n\nBy default, DVC checks out files as hard copies. (Technically, DVC tries to use\n[reflinks][reflink] before copies, but very few filesystems support reflinks, so\ncopies are far more likely to be the default.) With hard copies, efficiencies\nlisted above are not possible, so checkouts and status checks are inefficient by\ndefault. To its credit, DVC's cache can be configured to use symlinks, but\narguably DVC's default cache configuration is not sensible for projects of any\nsignificant size.\n\n[reflink]: https://en.wikipedia.org/wiki/Data_deduplication#reflink\n\n\n#### Running a Dud pipeline never implicitly alters a stage's artifacts.\n\nWhen you run a pipeline in DVC, DVC will remove all pipeline outputs before\nrunning the pipeline's command(s). While this can help ensure reproducible\npipelines, it is another implicit behavior the user must consider, and it\nprevents the user from deciding when stage outputs can safely be reused.\n\nIf you don't want DVC to automatically remove outputs for you, you need to\nexplicitly tell it each output you'd like to persist. However, by telling DVC to\npersist an output, DVC may perform a new and different automatic behavior. If\nyou're using symbolic links (or hard links) for checkouts (which is generally\na good idea; see above), DVC will \"unprotect\" all output links by replacing them\nwith hard copies from the cache. Not only is this behavior surprising, it's also\nvery costly in both runtime and storage.\n\nThe result of these two behaviors in DVC means that, in a sensible\nconfiguration, stages simply cannot reuse outputs efficiently; the user has\nlittle choice but to accept DVC's limitations.\n\nWhen you run a pipeline Dud, Dud doesn't do any implicit modification of\nexisting files. Dud defers all modification of workspace files to the user. If\nyou want a specific behavior, you should code it into your stage's command. For\nexample, if you want to clear all outputs of a stage prior to it running, you\ncan delete any outputs at the beginning of your command's script. If you want\nto reuse outputs, you can check for preexisting outputs in your script and\nchoose not to recreate them. Dud's minimalist approach results in a stage's\ncommand entirely owning it's own reproducibility; the responsibility is\nnot awkwardly shared between the stage and the tool.\n\n\n#### Dud delegates remote cache management to Rclone.\n\n[Rclone](https://rclone.org) is a very popular command-line tool which describes\nitself as \"The Swiss army knife of cloud storage.\" At the time of writing,\nRclone has more than 28,000 stars on Github. Rclone supports just about any\ncloud storage provider you've possibly heard of. (S3, GCS, Dropbox, Backblaze,\nto name a few.) This is all to say: Rclone is a top-tier choice for moving data\naround the internet.\n\nDud internally calls Rclone for all of its remote cache functionality, such as\n`dud fetch` and `dud push`. But Dud doesn't hide the Rclone abstraction\nentirely. Dud exposes its Rclone configuration file, and it's expected and\nencouraged that users will use Rclone directly to configure remote storage or\ninteract with their remote data. By using Rclone, Dud's remote cache interface\nimmediately gains the benefit of years of open-source development and a rich,\nwell-documented CLI. This is an example of how Dud embraces the UNIX philosophy\nand the composition of single-focus tools, as stated above.\n\nIn contrast, DVC stiches together various Python packages to support a modest\nassortment of cloud storage options. At the time of writing, DVC 2.6 supports\neleven cloud storage providers, and Rclone 1.56 supports more than fifty. But\nthe amount of cloud storage options isn't the critical disadvantage of DVC's\napproach. (Both Dud and DVC support the biggest players, such as S3 and GCS.)\nDVC's critical disadvantage is that they must develop and maintain most of their\nremote data management stack themselves. If Rclone is any indication, cloud data\ntransfer is a very hard problem, and DVC has their work cut out for them.\n\nIn summary, Dud leverages the deep knowledge and effort of the Rclone developers\nto provide a robust and familiar remote cache experience. DVC plots their own\ncourse, and in doing so incurs a steep development cost.\n\n\n#### Dud does not use analytics. (And it never will.)\n\nBy default, DVC enables [embedded\nanalytics](https://dvc.org/doc/user-guide/analytics#anonymized-usage-analytics).\nI strongly disagree with this practice, especially in free and open-source\nsoftware. I will never embed analytics in Dud.\n\n\n## Contributing\n\nSee\n[CONTRIBUTING.md](https://github.com/kevin-hanselman/dud/blob/main/CONTRIBUTING.md).\n\n\n## License\n\nBSD-3-Clause. See\n[LICENSE](https://github.com/kevin-hanselman/dud/blob/main/LICENSE).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkevin-hanselman%2Fdud","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkevin-hanselman%2Fdud","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkevin-hanselman%2Fdud/lists"}