{"id":48101224,"url":"https://github.com/distributed-system-analysis/smallfile","last_synced_at":"2026-04-04T15:43:30.925Z","repository":{"id":3442122,"uuid":"4494878","full_name":"distributed-system-analysis/smallfile","owner":"distributed-system-analysis","description":"distributed metadata-intensive workload generator for POSIX-like filesystems","archived":false,"fork":false,"pushed_at":"2023-05-23T20:09:42.000Z","size":608,"stargazers_count":195,"open_issues_count":4,"forks_count":64,"subscribers_count":21,"default_branch":"main","last_synced_at":"2024-04-16T18:21:43.555Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/distributed-system-analysis.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2012-05-30T13:12:38.000Z","updated_at":"2024-04-03T09:17:41.000Z","dependencies_parsed_at":"2023-01-13T12:30:48.619Z","dependency_job_id":null,"html_url":"https://github.com/distributed-system-analysis/smallfile","commit_stats":{"total_commits":337,"total_committers":5,"mean_commits":67.4,"dds":"0.029673590504451064","last_synced_commit":"e6a31b14d76bb8472a1ab228646a8807cf86a95e"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/distributed-system-analysis/smallfile","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/distributed-system-analysis%2Fsmallfile","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/distributed-system-analysis%2Fsmallfile/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/distributed-system-analysis%2Fsmallfile/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/distributed-system-analysis%2Fsmallfile/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/distributed-system-analysis","download_url":"https://codeload.github.com/distributed-system-analysis/smallfile/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/distributed-system-analysis%2Fsmallfile/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31403960,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T10:20:44.708Z","status":"ssl_error","status_checked_at":"2026-04-04T10:20:06.846Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-04T15:43:30.808Z","updated_at":"2026-04-04T15:43:30.902Z","avatar_url":"https://github.com/distributed-system-analysis.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"smallfile\n=========\n\nA distributed workload generator for POSIX-like filesystems.\n\nNew features:\n* support for Kubernetes and benchmark-operator\n* YAML input format for parameters\n\n# Table of contents\n\n[License](#license)\n\n[Introduction](#introduction)\n\n[What it can do](#what-it-can-do)\n\n[Restrictions](#restrictions)\n\n[How to specify test](#how-to-specify-test)\n\n[Results](#results)\n\n[Postprocessing of response time data](#postprocessing-of-response-time-data)\n\n[How to run correctly](#how-to-run-correctly)\n\n[Avoiding caching effects](#avoiding-caching-effects)\n\n[Use of pause and auto-pause option](#use-of-pause-and-auto-pause-options)\n\n[Use with distributed filesystems](#use-with-distributed-filesystems)\n\n[The dreaded startup timeout error](#the-dreaded-startup-timeout-error)\n\n[Use with local filesystems](#use-with-local-filesystems)\n\n[Use of subdirectories](#use-of-subdirectories)\n\n[Sharing directories across threads](#sharing-directories-across-threads)\n\n[Hashing files into directory tree](#hashing-files-into-directory-tree)\n\n[Random file size distribution option](#random-file-size-distribution-option)\n\n[Asynchronous file copy performance](#asynchronous-file-copy-performance)\n\n[Comparable Benchmarks](#comparable-benchmarks)\n\n[Design principles](#design-principles)\n\n[Synchronization](#synchronization)\n\n[Test parameter transmission](#test-parameter-transmission)\n\n[Launching remote worker threads](#launching-remote-worker-threads)\n\n[Returning results](#returning-results)\n\n\nLicense\n=========\nCopyright [2012] [Ben England]\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use files except in compliance with the License.\nYou may obtain a copy of the License at\n\n  http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n\nIntroduction\n=========\n\nsmallfile is a python-based distributed POSIX workload generator \nwhich can be used to quickly measure performance for a\nvariety of metadata-intensive workloads across an entire\ncluster.  It has no dependencies on any specific filesystem or implementation \nIt was written to complement use of fio and iozone benchmark for measuring performance \nof large-file workloads, and borrows some concepts from iozone.\nand Ric Wheeler's fs_mark.  It was developed by Ben England starting in March 2009.\n\nWhat it can do\n--------\n\n* multi-host - manages workload generators on multiple hosts\n* containers - can run on sets of docker containers\n* aggregates throughput - for entire set of hosts\n* synchronizes workload generation - can start and stop workload generator threads at approximately same time\n* pure workloads - only one kind of operation in each run (as opposed to mixed workloads)\n* extensible - easy to extend to new workload types\n* scriptable - provides CLI for scripted use, but workload generator is separate so a GUI is possible\n* file size distributions - supports either fixed file size or random exponential file size\n* traces response times - can capture response time data in .csv format, provides utility to reduce this data to statistics\n* Windows support - different launching method, see below\n* verification of read data -- writes unique data pattern in all files, can verify data read against this pattern\n* incompressibility - can write random data pattern that is incompressible\n* async replication support - can measure time required for files to appear in a directory tree\n* fs coherency test - in multi-host tests, can force all clients to read files written by different client\n\npython 2.7, python 3, pypy and pypy3 are supported.   pypy3 can increase throughput by up to 100% where interpreter is the bottleneck -- however at present pypy and pypy3 do not support pyyaml, at least not in Fedora 31.\n\nRestrictions\n-----------\n\n* for a multi-host test, all workload generators and the test driver must provide access to the same shared directory\n* does not support mixed workloads (mixture of different operation types)\n* is not accurate on single-threaded tests in memory resident filesystem\n* requires all hosts to have the same DNS domain name (plan to remove this\n  restriction)\n* does not support HTTP access (use COSBench/ssbench for this)\n* does not support mixture of Windows and non-Windows clients\n* For POSIX-like operating systems, we have only tested with Linux, but there\n  is a high probability that it would work with Apple OS and other UNIXes.\n* Have only tested Windows XP and Windows 7, but any Win32-compatible Windows would probably work with this.\n\nHow to specify test\n============\n\nYou must use a directory visible to all participating hosts to run a\ndistributed test.\n\nYou can include multiple hosts in a test in 1 of 2 ways:\n* provide password-less ssh access to these hosts from the test driver\n* run the launcher daemon on each host (more about this below)\n\nThis latter method is particularly useful for containers where we may not want to have each container running sshd.  To see more about this, look for the --launch-by-daemon parameter below.\n\nTo see what parameters are supported by smallfile_cli.py, do \n\n    # python smallfile_cli.py --help\n\nBoolean true/false parameters can be set to either Y\n(true) or N (false). Every command consists of a sequence of parameter\nname-value pairs with the format --name value .  To see what default values are,\nuse --help option.\n\nThe parameters are (from most useful to least useful):\n\n * --yaml-input-file -- specify parameters in YAML instead of on command line\n * --operation -- operation type (see list below for choices)\n * --top -- top-level directory, all file accesses are done inside this\n  directory tree. If you wish to use multiple mountpoints,provide a list of\n  top-level directories separated by comma (no whitespace).\n * --response-times – if Y then save response time for each file operation in a\n  rsptimes\\*csv file in the shared network directory. Record format is\n  operation-type, start-time, response-time. The operation type is included so\n  that you can run different workloads at the same time and easily merge the\n  data from these runs. The start-time field is the time that the file\n  operation started, down to microsecond resolution. The response time field is\n  the file operation duration down to microsecond resolution.\n * --output-json - if specified then write results in JSON format to the specified pathname for easier postprocessing.\n * --host-set -- comma-separated set of hosts used for this test, or file containing list of hosts\n  names allowed. Default: non-distributed test.\n * --launch-by-daemon - if specified, then ssh will not be used to launch test, see section titled \"launching remote worker threads\"\n * --files -- how many files should each thread process? \n * --threads -- how many workload generator threads should each smallfile_cli.py process create? \n * --auto-pause -- if Y then smallfile will auto-adjust the pause time between files\n * --file-size -- total amount of data accessed per file.   If zero then no\n  reads or writes are performed. \n * --file-size-distribution – only supported value today is exponential.\n * --record-size -- record size in KB, how much data is transferred in a single\n  read or write system call.  If 0 then it is set to the minimum of the file\n  size and 1-MiB record size limit.\n * --files-per-dir -- maximum number of files contained in any one directory.\n * --dirs-per-dir -- maximum number of subdirectories contained in any one\n  directory.\n * --fsync -- if Y then an fsync() call is inserted before closing a created/modified/appended file.\n * --hash-into-dirs – if Y then assign next file to a directory using a hash\n  function, otherwise assign next –files-per-dir files to next directory.\n * --same-dir -- if Y then threads will share a single directory.\n * --network-sync-dir – don't need to specify unless you run a multi-host test\n  and the –top parameter points to a non-shared directory (see discussion\n  below). Default: network_shared subdirectory under –top dir.\n * --permute-host-dirs – if Y then have each host process a different\n  subdirectory tree than it otherwise would (see below for directory tree\n  structure).\n * --xattr-size -- size of extended attribute value in bytes (names begin with\n  'user.smallfile-') \n * --xattr-count -- number of extended attributes per file\n * --cleanup-delay-usec-per-file -- insert a delay after \"cleanup\" \n * --prefix -- a string prefix to prepend to files (so they don't collide with\nprevious runs for example)\n * --suffix -- a string suffix to append to files (so they don't collide with\n  previous runs for example)\n * --incompressible – if Y then generate a pure-random file that\n  will not be compressible (useful for tests where intermediate network or file\n  copy utility attempts to compress data\n * --record-ctime-size -- if Y then label each created file with an\n  xattr containing a time of creation and a file size. This will be used by\n  –await-create operation to compute performance of asynchonous file\n  replication/copy.\n * --finish -- if Y, thread will complete all requested file operations even if\n  measurement has finished.\n * --stonewall -- if Y then thread will measure throughput as soon as it detects\n  that another thread has finished.\n * --verify-read – if Y then smallfile will verify read data is correct.\n * --remote-pgm-dir – don't need to specify this unless the smallfile software\n  lives in a different directory on the target hosts and the test-driver host. \n * --pause -- integer (microseconds) each thread will wait before starting next\n  file.\n\nOperation types are:\n\n* create -- create a file and write data to it\n* append -- open an existing file and append data to it \n* delete -- delete a file \n* rename -- rename a file \n* delete_renamed -- delete a file that had previously been renamed\n* read -- read an existing file \n* stat -- just read metadata from an existing file \n* chmod -- change protection mask for file\n* setxattr -- set extended attribute values in each file \n* getxattr -- read extended attribute values in each file \n* symlink -- create a symlink pointing to each file (create must be run\nbeforehand) \n* mkdir -- create a subdirectory with 1 file in it \n* rmdir -- remove a subdirectory and its 1 file\n* readdir -- scan directories only, don't read files or their metadata\n* ls-l -- scan directories and read basic file metadata\n* cleanup -- delete any pre-existing files from a previous run \n* swift-put -- simulates OpenStack Swift behavior when doing PUT operation\n* swift-get -- simulates OpenStack Swift behavior for each GET operation.\n* overwrite -- overwrite existing files.\n* truncate-overwrite -- truncate existing file and then write data to it.\n\nFor example, if you want to run smallfile_cli.py on 1 host with 8 threads\neach creating 2 GB of 1-MiB files, you can use these options:\n\n    # python smallfile_cli.py --operation create --threads 8 \\  \n       --file-size 1024 --files 2048 --top /mnt/gfs/smf\n\nTo run a 4-host test doing same thing:\n\n    # python smallfile_cli.py --operation create --threads 8 \\  \n       --file-size 1024 --files 2048 --top /mnt/gfs/smf \\  \n       --host-set host1,host2,host3,host4 \n\nNote: You can only perform a read operation on files that were generated with smallfile (using same parameters).\n\nErrors encountered by worker threads will be saved in /var/tmp/invoke-N.log where N is the thread number. After each test, a summary of thread results is displayed, and overall test results are aggregated for you, in three ways:\n\n * files/sec – most relevant for smaller file sizes\n * IOPS -- application I/O operations per sec, rate of read()/write()\n * MB/s -- megabytes/sec (really MiB/sec), data transfer rate\n\nUsers should never need to run smallfile.py -- this is the python class which\nimplements the workload generator. Developers can run this module to invoke its\nunit test however:\n\n    # python smallfile.py \n\nTo run just one unit test module, for example:\n\n    # python -m unittest smallfile.Test.test_c3_Symlink\n\nHow to specify parameters in YAML\n=============\n\nSometimes it's more convenient to specify inputs in a YAML file when using a CI system such as Jenkins.  Smallfile has a flat YAML file format where the parameter name in yaml is the same as on the CLI except that the leading \"--\" is removed and a colon is appended to the parameter name.  For example:\n```\ntop: /mnt/xfs1/smf\nhost-set: host1,host2\n```\n\n\nResults\n=======\n\nAll tests display a \"files/sec\" result.  If the test performs reads or writes,\nthen a \"MB/sec\" data transfer rate and an \"IOPS\" result (i.e. total read or\nwrite calls/sec) are also displayed.  Each thread participating in the test\nkeeps track of total number of files and I/O requests that it processes during\nthe test measurement interval.  These results are rolled up per host if it is a\nsingle-host test.  For a multi-host test, the per-thread results for each host\nare saved in a file within the --top directory, and the test master then reads\nin all of the saved results from its slaves to compute the aggregate result\nacross all client hosts.  The percentage of requested files which were\nprocessed in the measurement interval is also displayed, and if the number is\nlower than a threshold (default 70%) then an error is raised.\n\nPostprocessing of response time data\n--------\n\nIf you specify **--response-times Y** in the command, smallfile will save response time of each operation in per-thread output files in the shared directory as rsptimes\\*.csv.   For example, you can turn these into an X-Y scatterplot so that you can see how response time varies over time.   For example:\n\n    # python smallfile_cli.py --response-times Y\n    # ls -ltr /var/tmp/smf/network_shared/rsptimes*.csv\n\nYou should see 1 .csv file per thread.  These files can be loaded into any\nspreadsheet application and graphed.  An x-y scatterplot can be useful to see\nchanges over time in response time.\n\nBut if you just want statistics, you can generate these using the postprocessing command:\n\n    # python smallfile_rsptimes_stats.py /var/tmp/smf/network_shared\n\nThis will generate statistics summary in ../rsptimes-summary.csv , in this example you would find it in /var/tmp/smf/.  The file is in a form suitable for loading into a spreadsheet and graphing.  A simple example is generated using the regression test **gen-fake-rsptimes.sh** .  The result of this test is output like this:\n\n```\nfiltering out suffix .foo.com from hostnames\nrsp. time result summary at: /tmp/12573.tmp/../rsptime-summary.csv\n```\nThe first line illustrates that you can remove a common hostname suffix in the output so that it is easier to read and graph.  In this test we pass the optional parameter **--common-hostname-suffix foo.com** to smallfile_rsptimes_stats.py.  The inputs to smallfile_rsptimes_stats.py are contained in ```/tmp/12573.tmp/``` and the output looks like this:\n```\n\n$ more /tmp/12573.tmp/../rsptime-summary.csv\nhost:thread, samples, min, max, mean, %dev, 50 %ile, 90 %ile, 95 %ile, 99 %ile, \nall:all,320, 1.000000, 40.000000, 20.500000, 56.397441, 20.500000, 36.100000, 38.050000, 40.000000, \n\nhost-21:all,160, 1.000000, 40.000000, 20.500000, 56.486046, 20.500000, 36.100000, 38.050000, 40.000000, \nhost-22:all,160, 1.000000, 40.000000, 20.500000, 56.486046, 20.500000, 36.100000, 38.050000, 40.000000, \n\nhost-21:01,40, 1.000000, 40.000000, 20.500000, 57.026595, 20.500000, 36.100000, 38.050000, 39.610000, \nhost-21:02,40, 1.000000, 40.000000, 20.500000, 57.026595, 20.500000, 36.100000, 38.050000, 39.610000, \nhost-21:03,40, 1.000000, 40.000000, 20.500000, 57.026595, 20.500000, 36.100000, 38.050000, 39.610000, \nhost-21:04,40, 1.000000, 40.000000, 20.500000, 57.026595, 20.500000, 36.100000, 38.050000, 39.610000, \nhost-22:01,40, 1.000000, 40.000000, 20.500000, 57.026595, 20.500000, 36.100000, 38.050000, 39.610000, \nhost-22:02,40, 1.000000, 40.000000, 20.500000, 57.026595, 20.500000, 36.100000, 38.050000, 39.610000, \nhost-22:03,40, 1.000000, 40.000000, 20.500000, 57.026595, 20.500000, 36.100000, 38.050000, 39.610000, \nhost-22:04,40, 1.000000, 40.000000, 20.500000, 57.026595, 20.500000, 36.100000, 38.050000, 39.610000, \n```\n* record 1 - contains headers for each column\n* record 2 - contains aggregate response time statistics for the entire distributed system, if it consists of more than 1 host\n* record 4-5 - contains per-host aggregate statistics\n* record 7-end - contains per-thread stats, sorted by host then thread\n\nYou'll notice that even though all the threads have the same simulated response times, the 99th percentile values for each thread are different than the aggregate stats per host or for the entire test!  How can this be?  Percentiles are computed using the [numpy.percentiles](https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html) function, which linearly interpolates to obtain percentile values.  In the aggregate stats, the 99th percentile is linearly interpolated between two samples of 40 seconds, whereas in the per-thread results the 99th percentile is interpolated between samples of 40 and 39 seconds.  \n\n\nHow to run correctly\n=============\n\nHere are some things you need to know in order to get valid results - it is not\nenough to just specify the workload that you want.\n\nAvoiding caching effects\n==========\n\nTHere are two types of caching effects that we wish to avoid, data caching and\nmetadata caching.  If the average object size is sufficiently large, we need\nonly be concerned about data caching effects.  In order to avoid data caching\neffects during a large-object read test, the Linux buffer cache on all servers\nmust be cleared. In part this is done using the command: \"echo 1 \u003e /proc/sys/vm/drop_caches\" on all hosts.  However, some filesystems such as\nGluster have their own internal caches - in that case you might even need to\nremount the filesystem or even restart the storage pool/volume.\n\nUse of pause and auto pause options\n==========\n\nNormally, smallfile stops the throughput measurement for the test as soon as\nthe first thread finishes processing all its files.  In some filesystems, the first thread that starts running will be operating at much higher speed (example: NFS writes) and can easily finish before other threads have a chance to get started.  This immediately invalidates the test.  To make this less likely, it is possible to insert a per-file delay into each\nthread with the **--pause** option so that the other threads have a chance to\nparticipate in the test during the measurement interval.    It is preferable to\nrun a longer test instead, because in some cases you might otherwise restrict\nthroughput unintentionally.  But if you know that your throughput upper bound\nis X files/sec and you have N threads running, then your per-thread throughput\nshould be no more than N/X, so a reasonable pause would be something like 3X/N\nmicroseconds.  For  example, if you know that you cannot do better than 100000\nfiles/sec and you have 20 threads running,try a 60/100000 = 600 microsecond\npause.  Verify that this isn't affecting throughput by reducing the pause and\nrunning a longer test.\n\nHowever, this pause parameter is hard to use and requires you to run tests before you set it.\nTo get all threads to run at a speed closer to each other, the auto-pause parameter has been added.\nThis parameter is a boolean defaulting to False for now, so the same test doesn't start to give different results unexpectedly.\nIf set to True, then smallfile will continually adjust the time between files based on the response time it measures during the run (for that thread).\nIt does this by maintaining a record of the last N response times, taking the average, and then computing a pause time from that.\nWhy should this work?   If we think of the cluster as a black box, all the smallfile filesystem calls have to pass through that black box\nand this means that threads exert backpressure on each other indirectly through the response time that they experience.  We want to keep the pause time low enough that the system stays busy, but not so low that one thread can finish before another one even gets started.  One problem with this approach is client-side caching, which can decouple response times of the threads on different hosts.   However, it is usually possible to drop cache on all hosts to prevent client-side caching.\n\nClearly the pause and auto-pause parameters are mutually exclusive - you only use 1 of the 2.\n\nUse of cleanup-delay-usec-per-file option\n=========================================\nSome distributed filesystems do not actually recycle file space at the moment you delete the file. \nThey may wait some time and then do it asynchronously to enable the application to proceed more quickly.\nThis can cause subsequent test performance to compete with the space-recycling activity, resulting in\nvariable results.   The \"cleanup-delay-usec-per-file\" option gives you a way to work around this problem.\nIf you set it to non-zero, then during the \"cleanup\" operation (and only this one), \na time delay will be computed by multiplying the number of files processed by this parameter, and \nsmallfile will sleep for this time duration before proceeding to subsequent operations.\nYou can take advantage of this by structuring your tests so that each sample operation sequence,\nsuch as create,read,rename,delete-renamed , is followed by a \"cleanup\" op.   You can then cause smallfile to \npause for a while after each sample, before the next sample begins.\n\nUse with distributed filesystems\n---------\n\nWith distributed filesystems, it is necessary to have multiple hosts\nsimultaneously applying workload to measure the performance of a distributed\nfilesystem. The –host-set parameter lets you specify a comma-separated list of\nhosts to use, or you can just specify a filename containing a list of hosts, 1 host per record.  \nThe latter is certainly the more convenient option for large clusters.\n\nFor any distributed filesystem test, there must be a single directory which is\nshared across all hosts, both test driver and worker hosts, that can be used to\npass test parameters, pass back results, and coordinate activity across the\nhosts. This is referred to below as the “shared directory” in what follows. By\ndefault this is the network_shared/ subdirectory of the –top directory, but you\ncan override this default by specifying the –network-sync-dir directory\nparameter, see the next section for why this is useful.\n\nSome distributed filesystems, such as NFS, have relaxed,\neventual-consistency caching of directories; this will cause problems for the\nsmallfile benchmark. To work around this problem, you can use a separate NFS\nmountpoint exported from a Linux NFS server, mounted with the option actimeo=1\n(to limit duration of time NFS will cache directory entries and metadata). You\nthen reference this mountpoint using the –network-sync-dir option of smallfile.\nFor example:\n\n```\n# mount -t nfs -o actimeo=1 your-linux-server:/your/nfs/export /mnt/nfs\n# ./smallfile_cli.py –top /your/distributed/filesystem \\\n    –network-sync-dir /mnt/nfs/smf-shared\n```\n\nFor non-Windows tests, the user must set up password-less ssh between the test\ndriver and the host. If security is an issue, a non-root username can be used\nthroughout, since smallfile requires no special privileges. Edit the\n$HOME/.ssh/authorized_keys file to contain the public key of the account on the\ntest driver. The test driver will bypass the .ssh/known_hosts file by using -o\nStrictHostKeyChecking=no option in the ssh command.\n\nFor Windows tests, each worker host must be running the launch_smf_host.py\nprogram that polls the shared network directory for a file that contains the\ncommand to launch smallfile_remote.py in the same way that would happen with\nssh on non-Windows tests. The command-line parameters on each Windows host\nwould be something like this:\n\n    start python launch_smf_host.py –shared z:\\smf\\network_shared –as-host %hostname%\n\nThen from the test driver, you could run specifying your hosts:\n\n    python smallfile_cli.py –top z:\\smf –host-set gprfc023,gprfc024\n\nThe dreaded startup timeout error\n============\n\nIf you get the error \"Exception: starting signal not seen within 11 seconds\" when running a distributed test with a lot of subdirectories, the problem may be caused by insufficient time for the worker threads to get ready to run the test.   In some cases, this was caused by a flaw in smallfile's timeout calculation (which we believe is fixed).  However, before smallfile actually starts a test, each worker thread must prepare a directory tree to hold the files that will be used in the test.   This ensures that we are not measuring directory creation overhead when running a file create test, for example.  For some filesystems, directory creation can be more expensive at scale.  We take this into account with the --min-dirs-per-sec parameter, which defaults to a value more appropriate for local filesystems.   If we are doing a large distributed filesystem test, it may be necessary to lower this parameter somewhat, based on the filesystem's performance, which you can measure using --operation mkdir, and then use a value of about half what you see there.  This will result in a larger timeout value, which you can obtain using \"--output-json your-test.json\" -- look for the 'startup-timeout' and 'host-timeout' parameters in this file to see what timeout is being calculated.\n\n\nUse with local filesystems\n-----------\n\nThere are cases where you want to use a distributed filesystem test on\nhost-local filesystems. One such example is virtualization, where the “local”\nfilesystem is really layered on a virtual disk image which may be stored in a\nnetwork filesystem. The benchmark needs to share certain files across hosts to\nreturn results and synchronize threads. In such a case, you specify the\n–network-sync-dir directory-pathname parameter to have the benchmark use a\ndirectory in some shared filesystem external to the test directory (specified\nwith –top parameter). By default, if this parameter is not specified then the\nshared directory will be the subdirectory network-dir underneath the directory\nspecified with the –top parameter.\n\nUse of subdirectories\n----------\n\nBefore a test even starts, the smallfile benchmark ensures that the\ndirectories needed by that test already exist (there is a specific operation\ntype for testing performance of subdirectory creation and deletion). If the top\ndirectory (specified by –top parameter) is D, then the top per-thread directory\nis D/host/dTT where TT is a 2-digit thread number and “host” is the hostname.\nIf the test is not a distributed test, then it's just whatever host the\nbenchmark command was issued on, otherwise it is each of the hosts specified by\nthe –host-set parameter. The first F files (where F is the value of the\n–files-per-dir) parameter are placed in this top per-thread directory. If the\ntest uses more than F files/thread, then at least one subdirectory from the\nfirst level of subdirectories must be used; these subdirectories have the path\nT/host/dTT/dNNN where NNN is the subdirectory number. Suppose the value of the\nparameter –subdirs-per-dir is D. Then there are at most D subdirectories of the\ntop per-thread directory. If the test requires more than D(F+1) files per\nthread, then a second level of subdirectories will have to be created, with\npathnames like T/host/dTT/dNNN/dMMM . This process of adding subdirectories\ncontinues in this fashion until there are sufficient subdirectories to hold all\nthe files. The purpose of this approach is to simulate a mixture of directories\nand files, and to not require the user to specify how many levels of\ndirectories are required.\n\nThe use of multiple mountpoints is supported. This features is useful for\ntesting NFS, etc.\n\nNote that the test harness does not have to scan the directories to figure out\nwhich files to read or write – it simply generates the filename sequence\nitself. If you want to test directory scanning speed, use readdir or ls-l\noperations. \n\nSharing directories across threads\n---------\n\nSome applications require that many threads, possibly spread across many host\nmachines, need to share a set of directories. The --same-dir parameter makes it\npossible for the benchmark to test this situation. By default this parameter is\nset to N, which means each thread has its own non-overlapping directory tree.\nThis setting provides the best performance and scalability. However, if the\nuser sets this parameter to Y, then the top per-thread directory for all\nthreads will be T instead of T/host/dTT as described in preceding section.\n\nHashing files into directory tree\n----------\n\nFor applications which create very large numbers of small files (millions for\nexample), it is impossible or at the very least impractical to place them all\nin the same directory, whether or not the filesystem supports so many files in\na single directory. There are two ways which applications can use to solve this\nproblem:\n\n * insert files into 1 directory at a time – can create I/O and lock contention for the directory metadata\n * insert files into many directories at the same time – relieves I/O and lock contention for directory metadata, but increases the amount of metadata caching needed to avoid cache misses\n\nThe –hash-into-dirs parameter is intended to enable simulation of this latter\nmode of operation. By default, the value of this parameter is N, and in this\ncase a smallfile thread will sequentially access directories one at a time. In\nother words, the first D (where D = value of –files-per-dir parameter) files\nwill be assigned to the top per-thread directory, then the next D files will be\nassigned to the next per-thread directory, and so on. However, if the\n–hash-into-dirs parameter is set to Y, then the number of the file being\naccessed by the thread will be hashed into the set of directories that are\nbeing used by this thread. \n\nRandom file size distribution option\n-------------\n\nIn real life, users don't create files that all have the same size. Typically\nthere is a file size distribution with a majority of small files and a lesser\nnumber of larger files. This benchmark supports use of the random exponential\ndistribution to approximate that behavior. If you specify\n\n     --file-size-distribution exponential --file-size S\n\nThe meaning of the –file-size parameter changes to the maximum file size (S\nKB), and the mean file size becomes S/8. All file sizes are rounded down to the\nnearest kilobyte boundary, and the smallest allowed file size is 1 KB. When\nthis option is used, the smallfile benchmark saves the seed for each thread's\nrandom number generator object in a .seed file stored in the TMPDIR directory\n(typically /var/tmp). This allows the file reader to recreate the sequence of\nrandom numbers used by the file writer to generate file sizes, so that the\nreader knows exactly how big each file should be without asking the file system\nfor this information. The append operation works in the same way. All other\noperations are metadata operations and do not require that the file size be\nknown in advance.\n\n\nAsynchronous file copy performance\n---------\n\nWhen we want to measure performance of an asynchronous file copy (example:\nGluster geo-replication), we can use smallfile to create the original directory\ntree, but then we can use the new await-create operation type to wait for files\nto appear at the file copy destination. To do this, we need to specify a\nseparate network sync directory. So for example, to create the original\ndirectory tree, we could use a command like:\n\n    # ./smallfile_cli.py --top /mnt/glusterfs-master/smf \\  \n        --threads 16 --files 2000 --file-size 1024 \\  \n        --operation create –incompressible Y --record-ctime-size Y\n\nSuppose that this mountpoint is connected to a Gluster “master” volume which is\nbeing geo-replicated to a “slave” volume in a remote site asynchronously. We\ncan measure the performance of this process using a command like this, where\n/mnt/glusterfs-slave is a read-only mountpoint accessing the slave volume.\n\n    # ./smallfile_cli.py --top /mnt/glusterfs-slave/smf \\  \n         --threads 16 --files 2000 --file-size 1024 \\  \n         --operation await-create –incompressible Y \\  \n         --network-sync-dir /tmp/other\n\nRequirements:\n\n* The parameters controlling file sizes, directory tree, and number of files must match in the two commands.\n* The --incompressible option must be set if you want to avoid situation where async copy software can compress data to exceed network bandwidth.\n* The first command must use the –record-ctime-size Y option so that the await-create operation knows when the original file was created and how big it was.\n\nHow does this work? The first command records information in a user-defined xattr for each file so that the second command, the await-create operation can calculate time required to copy the file, which is recorded as a “response time”, and so that it knows that the entire file reached the destination.\n\nComparable Benchmarks\n==============\n\nThere are many existing performance test benchmarks. I have tried just about\nall the ones that I've heard of. Here are the ones I have looked at, I'm sure\nthere are many more that I failed to include here.\n\n* Bonnie++ -- works well for a single host, but you cannot generate load from multiple hosts because the benchmark will not synchronize its activities, so different phases of the benchmark will be running at the same time, whether you want them to or not.\n\n* iozone -- this is a great tool for large-file testing, but it can only do 1 file/thread in its current form.\n\n* postmark -- works fine for a single client, not as useful for multi-client tests\n\n* grinder -- has not to date been useful for filesystem testing, though it works well for web services testing.\n\n* JMeter – has been used successfully by others in the past.\n\n* fs_mark -- Ric Wheeler's filesystem benchmark, is very good at creating files\n\n* fio -- Linux test tool -- broader coverage of Linux system calls particularly around async. and direct I/O.  Now has multi-host capabilities\n\n* diskperf – open-source tool that generates limited small-file workloads for a single host.\n\n* dbench – developed by samba team\n\n* SPECsfs – not open-source, but \"netmist\" component has some mixed-workload, multi-host workload generation capabilities, configured similarly to iozone, but with a wider range of workloads.\n\nDesign principles\n=============\n\nA cluster-aware test tool ideally should:\n\n* start threads on all hosts at same time\n* stop measurement of throughput for all threads at the same time\n* be easy to use in all file system environments\n* be highly portable and be trivial to install\n* have very low overhead\n* not require threads to synchronize (be embarrassingly parallel) \n\nAlthough there may be some useful tests that involve thread synchronization or contention, we don't want the tool to force thread synchronization or contention for resources. \n\nIn order to run prolonged small-file tests (which is a requirement for scalability to very large clusters), each thread has to be able to use more than one directory.   Since some filesystems perform very differently as the files/directory ratio increases, and most applications and users do not rely on having huge file/directory ratios, this is also important for testing the filesystem with a realistic use case.  This benchmark does something similar to Ric Wheeler's fs_mark benchmark with multiple directory levels.   This benchmark imposes no hard limit on how many directories can be used and how deep the directory tree can go.  Instead, it creates directories according to these constraints:\n\n* files (and directories) are placed as close to the root of the directory hierarchy as possible\n* no directory contains more than the number of files specified in the --files-per-dir test parameter\n* no directory contains more than number of subdirectories specified in the --dirs-per-dir test parameter\n\n\nSynchronization\n--------------\n\nFor non-kubernetes environments, \na single directory is used to synchronize the threads and hosts. This may seem\nproblematic, but we assume here that the file system is not very busy when the\ntest is run (otherwise why would you run a load test on it?). So if a file is\ncreated by one thread, it will quickly be visible on the others, as long as the\nfilesystem is not heavily loaded.\n\nIf it's a single-host test, any directory is sharable amongst threads, but in a\nmulti-host test only a directory shared by all participating hosts can be used.\nIf the –top test directory is in a network-accessible file system (could be NFS\nor Gluster for example), then the synchronization directory is by default in\nthe network_shared subdirectory by default and need not be specified. If the\n–top directory is in a host-local filesystem, then the –network-sync-dir option\nmust be used to specify the synchronization directory. When a network directory\nis used, change propagation between hosts cannot be assumed to occur in under\ntwo seconds.\n\nWe use the concept of a \"starting gate\" -- each thread does all preparation for\ntest, then waits for a special file, the \"starting gate\", to appear in the\nshared area. When a thread arrives at the starting gate, it announces its\narrival by creating a filename with the host and thread ID embedded in it. When\nall threads have arrived, the controlling process will see all the expected\n\"thread ready\" files, and will then create the starting gate file. When the\nstarting gate is seen, the thread pauses for a couple of seconds, then\ncommences generating workload. This initial pause reduces time required for all\nthreads to see the starting gate, thereby minimizing chance of some threads\nbeing unable to start on time. Synchronous thread startup reduces the \"warmup\ntime\" of the system significantly.\n\nWe also need a checkered flag (borrowing from car racing metaphor). Once test\nstarts, each thread looks for a stonewall file in the synchronization\ndirectory. If this file exists, then the thread stops measuring throughput at\nthis time (but can (and does by default) optionally continue to perform\nrequested number of operations). Consequently throughput measurements for each\nthread may be added to obtain an accurate aggregate throughput number. This\npractice is sometimes called \"stonewalling\" in the performance testing world.\n\nSynchronization operations in theory do not require the worker threads to read\nthe synchronization directory. For distributed tests, the test driver host has\nto check whether the various per-host synchronization files exist, but this\ndoes not require a readdir operation. The test driver does this check in such a\nway that the number of file lookups is only slightly more than the number of\nhosts, and this does not require reading the entire directory, only doing a set\nof lookup operations on individual files, so it's O(n) scalable as well.\n\nThe bad news is that some filesystems do not synchronize directories quickly\nwithout an explicit readdir() operation, so we are at present doing\nos.listdir() as a workaround -- this may have to be revisited for very large\ntests.\n\n\nTest parameter transmission\n--------\n\nThe results of the command line parse are saved in a smf_test_params object and\nstored in a python pickle file, which is a representation independent of CPU\narchitecture or operating system. The file is placed in the shared network\ndirectory. Remote worker processes are invoked via the smallfile_remote.py\ncommand and read this file to discover test parameters.\n\nLaunching remote worker threads\n----------\n\nWith Kubernetes, smallfile relies on Kubernetes to launch remote \"pods\" that each contain a smallfile_cli.py process.   In the case of the benchmark-operator\nimplementation, the redis key-value store is used to synchronize these pods so that all pods start running smallfile_cli.py at the same time.\n\nFor multi-host non-Windows non-Kubernetes environments, the test driver launches worker threads using parallel ssh commands to invoke the smallfile_remote.py program, and when this program exits, that is how the test driver discovers that the remote threads on this host have completed.  This works both for bare metal hosts and for virtual machines.\n\nFor Windows environments, ssh usage is more problematic. In Windows, ssh daemon \"sshd\" requires installation of cygwin, a Windows app that emulates a Linux-like environment, but we really want to test with native win32 environment instead. For containers, sshd is not typically available as a way to get inside the container.  So a different launching method is used (and this method works on non-Windows environments as well). \n\nFirst you start launch_smf_host.py in each workload generator.    You must specify --top parameter for each remote host or container.  \n\nFor Windows workload generators, if you are running smallfile_cli.py from a non-Windows host you may need --substitute-top parameter followed by the Windows path to the top directory, usually not the same as in Linux/Unix.  For example:\n\n    % start python launch_smf_host.py --top /mnt/cifs/smf --substitute-top z:\\smf\n\nFor containers, You must specify each daemon's unique ID in the command line - for example, if this is running in a container, then the hostname may not be unique.  This unique ID will be used by the launch_smf_host.py daemon in the container to search for requests from the test driver to run a test.  For example:\n\n    # ./launch_smf_host.py --top /mnt/sharedfs/smf --host-set container_2\n\nNext, you run smallfile_cli.py with \"--launch-by-daemon Y\" option and pass --host-set followed by a list of the Daemon IDs that you want to participate in the test.  For example:\n\n    # ./smallfile_cli.py --launch-by-daemon Y --host-set container_1,container_2\n\nThis second step will result in a set of files being created in the shared network directory, 1 per daemon, that provide the daemon with the test parameters that it is to use.  The existence of this file will tell the daemon to start a test.  Everything else works the same as with ssh method.\n\nReturning results\n-----------------\n\nFor either single-host or multi-host tests, each test thread is implemented as\na smf_invocation object and all thread state is kept there.  Results are\nreturned by using python \"pickle\" files to serialize the state of these\nper-thread objects containing details of each thread's progress during the\ntest.  The pickle files are stored in the shared synchronization directory.\n\nsmallfile_cli.py has the option to output all results in a JSON format for easy parsing.  In the case of benchmark-operator, this data is pushed to Elasticsearch as \"documents\" which can then be viewed or visualized with Kibana or Grafana, for example.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdistributed-system-analysis%2Fsmallfile","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdistributed-system-analysis%2Fsmallfile","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdistributed-system-analysis%2Fsmallfile/lists"}