{"id":49067647,"url":"https://github.com/cirrusdata/datasim","last_synced_at":"2026-04-20T05:37:28.386Z","repository":{"id":347139908,"uuid":"1182885226","full_name":"cirrusdata/datasim","owner":"cirrusdata","description":"Cross-platform unstructured data simulator. Generate realistic file and object workloads for testing storage migration, backup, and replication workflows. Supports ongoing workload simulation with configurable file churn (creates, modifies, deletes) to mimic realistic data activity patterns.","archived":false,"fork":false,"pushed_at":"2026-03-26T19:33:01.000Z","size":68,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-27T08:18:48.215Z","etag":null,"topics":["file-generation","qa","test-data","test-data-generator","unstructured-data","workload-simulation"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cirrusdata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-03-16T03:49:07.000Z","updated_at":"2026-03-26T19:27:39.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/cirrusdata/datasim","commit_stats":null,"previous_names":["cirrusdata/datasim"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/cirrusdata/datasim","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cirrusdata%2Fdatasim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cirrusdata%2Fdatasim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cirrusdata%2Fdatasim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cirrusdata%2Fdatasim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cirrusdata","download_url":"https://codeload.github.com/cirrusdata/datasim/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cirrusdata%2Fdatasim/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32035154,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T00:18:06.643Z","status":"online","status_checked_at":"2026-04-20T02:00:06.527Z","response_time":94,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["file-generation","qa","test-data","test-data-generator","unstructured-data","workload-simulation"],"created_at":"2026-04-20T05:37:27.648Z","updated_at":"2026-04-20T05:37:28.374Z","avatar_url":"https://github.com/cirrusdata.png","language":"Go","readme":"# datasim\n\ndatasim is a workload simulation tool to facilitate storage systems testing and demonstration on Windows and Linux. \n\nThe first simulation included is `fileset`, which generates and rotates synthetic file trees. Additional simulation scenarios are planned which may include other databases, file/object-oriented and block-oriented workload simulations.\n\n## Installing\n\nDownload the latest prebuilt binary from the [releases page](https://github.com/cirrusdata/datasim/releases). Each release includes copy-paste install commands for Linux (amd64 and arm64) and Windows (amd64 and arm64).\n\n\n## fileset\n\nThe `fileset` simulation generates a directory tree with realistic file names, extensions, and directory structure on a mounted filesystem. datasim tracks everything it created in a manifest file and lets you apply ongoing churn to simulate active usage. When the test is done, one command removes exactly what was created and nothing else.\n\n\u003e [!NOTE]\n\u003e File content is currently randomly generated. This simulation produces realistic-looking structure, not readable documents.\n\n### Quick Start\n\n```bash\n# Initialize a 10 GiB corporate-profile dataset\ndatasim fileset init --fs /mnt/test --profile corporate --size 10GiB\n\n# See what was created\ndatasim fileset status /mnt/test\n\n# Apply one round of file churn (creates, deletes, and modifies files)\ndatasim fileset rotate --fs /mnt/test\n\n# Remove everything datasim created\ndatasim fileset destroy /mnt/test\n```\n\n\n### init\n\n`datasim fileset init` populates a directory with a synthetic file tree and writes a `.cirrusdata-datasim` manifest file at the root. The manifest tracks every file datasim created so that later commands know exactly what is there without scanning the directory.\n\n```bash\ndatasim fileset init --fs /mnt/test --profile corporate --size 10GiB\n```\n\n`--fs` is the directory to populate. `--profile` picks the dataset profile; it defaults to `corporate` if you leave it out. `--size` sets the target dataset size, with suffixes like `MiB`, `GiB`, or `TiB`. If you omit `--size`, datasim targets 80% of the filesystem's available capacity.\n\nTo get the same dataset every time — useful when you need a repeatable test scenario — pass a fixed seed:\n\n```bash\ndatasim fileset init --fs /mnt/test --profile nasa --size 50GiB --seed 42\n```\n\nThe `--strategy` flag controls how files are distributed across the tree. `balanced` (the default) applies the profile's characteristic distribution. `random` produces higher-variance output with irregular file counts and sizes, which can stress certain edge cases in migration or sync tools.\n\nSynthetic profiles also randomize file modification timestamps by default so the generated tree looks more like a lived-in dataset instead of a single bulk write.\n\n#### Dataset Profile\n\nA profile controls the shape of the generated dataset: how directories are named, how deep the tree goes, what file extensions appear, and how large files tend to be.\n\n`corporate` produces a generic office file tree with documents, spreadsheets, PDFs, and images organized into project-style subdirectories. Use this when you want something that looks like a typical shared drive or file server.\n\n`school` mimics academic file collections — course materials, readings, and student work organized by subject or term. It produces a wider mix of document and media types at smaller average file sizes.\n\n`nasa` generates data shaped like scientific or research archives: larger files, more uniform naming, and directory structures that resemble instrument or mission data. Use this when you need a dataset that is less document-heavy and more data-file-heavy.\n\n\n### status\n\n`datasim fileset status` reads the manifest and prints the current state of the dataset: file count, total bytes, profile, seed, last operation, number of rotations applied, and a per-category file breakdown.\n\n```bash\ndatasim fileset status /mnt/test\n```\n\nAdd `--json` to print the raw manifest as JSON, which is useful for scripting or piping into other tools:\n\n```bash\ndatasim fileset status /mnt/test --json\n```\n\n### rotate\n\n`datasim fileset rotate` applies a round of changes to an existing dataset. It creates new files, deletes some existing ones, and modifies others. The defaults are 5% create, 5% delete, and 10% modify — applied to the current file count.\n\n```bash\ndatasim fileset rotate --fs /mnt/test\n```\n\nOverride any of the rates when you want a different churn pattern. For a heavy-write scenario with minimal deletes:\n\n```bash\ndatasim fileset rotate --fs /mnt/test --create-pct 20 --delete-pct 1 --modify-pct 5\n```\n\nFor a dataset under heavy deletion pressure:\n\n```bash\ndatasim fileset rotate --fs /mnt/test --create-pct 2 --delete-pct 30 --modify-pct 5\n```\n\nEach rotation is recorded in the manifest with its timestamp, seed, and operation counts. You can see the history with `datasim fileset status /mnt/test`.\n\n### rotate loop\n\n`datasim fileset rotate loop` runs rotation on a repeating schedule. Use this when you want the dataset to keep churning throughout a longer test run without calling rotate manually.\n\n```bash\ndatasim fileset rotate loop --fs /mnt/test --interval 5m\n```\n\nBy default it runs until you stop it with Ctrl-C. Set `--iterations` to limit the number of rounds:\n\n```bash\ndatasim fileset rotate loop --fs /mnt/test --interval 2m --iterations 30\n```\n\nAll the same percentage flags from `rotate` apply here:\n\n```bash\ndatasim fileset rotate loop --fs /mnt/test --interval 1m --create-pct 10 --delete-pct 10 --modify-pct 20\n```\n\n### destroy\n\n`datasim fileset destroy` removes everything the fileset created — the generated files, directories, and the manifest itself. It uses the manifest to determine what to remove, so it does not touch anything it did not create.\n\n```bash\ndatasim fileset destroy /mnt/test\n```\n\n\n## Updating\n\nIf you installed datasim from a release binary, you can update it in place:\n\n```bash\ndatasim update\n```\n\nThis fetches the latest stable release, verifies its checksum, and replaces the running binary.\n\n## Agent Skill\n\nIf you are using an AI coding agent, install the `datasim` skill:\n\n```bash\nnpx skills add cirrusdata/datasim --skill datasim\n```\n\n## Getting help\n\nEvery command and subcommand has a `--help` flag:\n\n```bash\ndatasim --help\ndatasim fileset --help\ndatasim fileset init --help\ndatasim fileset rotate --help\n```\n\nFor engineering notes and architecture details, see [docs/design.md](docs/design.md).\n\n---\n\n## Appendix: block-device utility\n\n`datasim block-device` is a convenience utility for formatting and mounting a raw disk or partition before running a simulation, and tearing it down afterward. It is not part of the simulation itself — if you already have a mounted filesystem, you do not need it.\n\n### Linux\n\n```bash\n# Format /dev/sdc1 and mount it at /mnt/test\ndatasim block-device format /dev/sdc1 /mnt/test\n\n# Run the simulation\ndatasim fileset init --fs /mnt/test --profile corporate --size 50GiB\n# ...\ndatasim fileset destroy /mnt/test\n\n# Unmount and clean up\ndatasim block-device destroy /mnt/test\n```\n\nThe default filesystem type on Linux is `xfs`. Use `--fstype ext4` if your workload requires ext4:\n\n```bash\ndatasim block-device format /dev/sdc1 /mnt/test --fstype ext4\n```\n\nIf the device or mount point is already in use and you want to wipe and reuse it without unmounting first, add `--force`. Without it, the command refuses to touch a device or mount point that is already mounted.\n\n### Windows\n\nUse either a drive letter or a directory mount point. The default filesystem type is `ntfs`.\n\n```powershell\ndatasim block-device format \\\\.\\PHYSICALDRIVE4 X:\\\ndatasim fileset init --fs X:\\ --profile corporate --size 10GiB\ndatasim fileset destroy X:\\\ndatasim block-device destroy X:\\\n```\n\n---\n\n## License\n\nThis project is licensed under the [Apache License 2.0](LICENSE).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcirrusdata%2Fdatasim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcirrusdata%2Fdatasim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcirrusdata%2Fdatasim/lists"}