{"id":17680346,"url":"https://github.com/docelic/rsyncnow","last_synced_at":"2025-03-30T18:48:44.935Z","repository":{"id":63591601,"uuid":"569018045","full_name":"docelic/rsyncnow","owner":"docelic","description":"Fast rsync indexing/syncing for enormous data sets","archived":false,"fork":false,"pushed_at":"2022-11-23T01:52:12.000Z","size":21,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-05T21:33:43.231Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/docelic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-11-21T22:45:02.000Z","updated_at":"2022-11-22T13:10:48.000Z","dependencies_parsed_at":"2023-01-22T04:35:55.220Z","dependency_job_id":null,"html_url":"https://github.com/docelic/rsyncnow","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docelic%2Frsyncnow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docelic%2Frsyncnow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docelic%2Frsyncnow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docelic%2Frsyncnow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/docelic","download_url":"https://codeload.github.com/docelic/rsyncnow/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246365640,"owners_count":20765546,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-24T09:06:40.071Z","updated_at":"2025-03-30T18:48:44.669Z","avatar_url":"https://github.com/docelic.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# rsyncnow\n\nThis tool is for you if you have a need to rsync data sets so large\nthat just building the index of files to sync takes days or weeks.\n\n## Is it any good?\n\nYES.\n\n## The rsync problem\n\nRsync's delta algorithm doesn't need any changes - once a file to sync\nis identified, the algorithm does its job well (or the whole file is\ncopied with rsync option `-W`).\n\nBut when rsync is told to sync two directories, before it begins with\nthe actual syncing, it builds an index of all the files that need to be\nsynced.\n\nOn large data sets, this index building can take hours, days, or weeks.\n\nIn addition to just taking time, it is making it harder to schedule data\nmigrations at the end/beginning of months when the extra traffic will not\ninfluence the month's 95th percentile and is essentially free.\n\nAlso, it is making it harder to fully sync the source and destination if\nthe source is still being modified or uploaded to, because by the time\nrsync completes, the destination is already out of date and requires\nanother sync.\n\n## The gist of rsyncnow operation\n\nSo how does `rsyncnow` help?\n\nThe above-described behavior of rsync, in which it first builds an index\nand then starts syncing, cannot be changed.\n\nHowever, by using the appropriate command line options, rsync does support\na mode where it will print the files that need syncing to STDOUT without\ndelay (it will print them immediately as it finds/identifies them, during\nindex building).\n\nThis enables `rsyncnow` to introduce a huge increase in efficiency as follows:\n\n1. It runs a set of `rsync` processes (1 for every source path)\nthat will be finding files to sync (in dry run mode) and printing them to\nSTDOUT as a stream.  We call these processes `finders`.\n\n1. As finders keep printing files to sync, `rsyncnow` keeps reading\nthem and pushing them to a small internal queue.\n\n1. As soon as `rsyncnow` finds enough paths to sync in a batch\n(or every X seconds if a batch has not been filled up yet), it runs\nseparate rsync processes (called `syncers`) which are given those\nspecific files to sync. Syncers start syncing immediately since they\nare given specific paths, there are no indexes to build.\n\n1. Additionally, if the files to sync are being found faster than they\nare synced, and the bandwidth/resource limits allow it, one can run\n`rsyncnow` with multiple `syncer` processes to achieve even\nfaster/concurrent syncing of multiple files.\n\n## Usage instructions\n\nYou need Ruby installed to run the script. Hopefully this is a trivial requirement.\n\n\n```\nUsage: rsyncnow [OPTIONS...] SRC... DST -- [FIND OPTIONS...] -- [SYNC OPTIONS...]\n\nOPTIONS:\n  -f, --finders 1    - Number of rsync find processes. Currently always gets\n                       reset to the number of specified SRC paths\n  -s, --syncers 1    - Nr. of respawning rsync sync/copy processes, per finder\n  -b, --batchsize 5  - Number of files to collect in a batch and sync\n  -q, --queuesize 50 - Max number of paths to queue for sync. If not specified,\n                       defaults to batchsize * 10. Rsync find processes get\n                       automatically paused when their queue goes above this\n                       limit and are resumed when queue falls below threshold\n  -t, --timeout 5.0  - After timeout seconds, run rsync sync/copy process even\n                       if a batch isn't full\n\n  -r, --rsync rsync  - Name (and/or path) of rsync binary\n\n  -v, --verbose      - Enable rsyncnow and rsync verbose mode\n  -h, --help         - Show help and exit\n  -e, --examples     - Show examples and exit\n\nSRC, DST:\n  Rsync source and destination as usual (including the \"/\" magic)\n\nFIND OPTIONS:\n  If specified, overrides all default cmdline options for rsync find processes.\n  If you use this, options `-niR` must always be present/included.\n  Default value: -aniRe=ssh\n\nSYNC OPTIONS:\n  If specified, overrides all default cmdline options for rsync sync processes.\n  If you use this, options `-0 --files-from=-` must always remain present.\n  Default value: -lptgoD0e=ssh --files-from=-\n    NOTE: options -lptgoD are used explicitly instead just specifying -a\n    because -a also includes option -r which should not be present. Recursion\n    is controlled via FIND OPTIONS (where it is enabled/implied by -a)\n\nEXAMPLES:\n\n# Most basic example:\n# (implies finding files to sync with rsync options -aniRe=ssh,\n# and syncing the actual files with rsync options -lptgoD0e=ssh --files-from=-)\nrsyncnow -v /source/dir /target/dir\n\n# Finding files with size differences only, without full checksum (--size-only), and\n# syncing them by copying, without using rsync's delta algorithm (-W):\nrsyncnow -v /source/dir /target/dir -- -aniRe=ssh --size-only -- -lptgoD0e=ssh --files-from=- -W\n```\n\n## Notes on options -b, -q, -t\n\nOption `-b` (`--batchsize`) organizes files to sync in batches to reduce the number of\n`rsync` process invocations. (If one specifies `-b 1` then a separate process would be\ncalled every time a file is synced.)\n\nOption `-q` defines max internal queue size. Finder processes are automatically paused\nif they fill up the queue to this limit (i.e. if they are finding files to sync much\nfaster than the syncers are able to process them). This option doesn't primarily exist\nto save RAM, but to stop finders from finding all the files quickly and finishing\nthe directory traversal much sooner than syncers will be done with syncing. Namely,\nas long as syncers are syncing the files, the whole syncing process isn't over anyway,\nso by slowing down finders (by spreading their work over more time), we\nincrease the chance of any changes in the source directories to be picked up on the\nfirst run or `rsyncnow`.\n\nFinally, re. option `-t`: if batch size is set to a large value, or if the files to\nsync are rarely found (e.g. if the source and destination are fairly well synced\nalready), then it makes sense to just sync whatever paths are found every X seconds,\nnot to let the process of finding files go on for too long, without syncing\nanything in the meantime.\n\n## Misc notes\n\nCurrently there is always 1 rsync finder process that is started for each\nsource directory, concurrently. If you don't want multiple finders running\nat the same time (for example if all source directories to sync are on the\nsame partition), you should call `rsyncnow` multiple times with 1 source path\nin every invocation instead of once with multiple source paths.\n\nRsyncnow doesn't put any restrictions on the rsync options that one can use in\neither find or sync phase (options related to comparing/finding files,\nwhat to copy/sync, max bandwidth to use etc.).\nSee a myriad of options available in the [rsync man page](https://download.samba.org/pub/rsync/rsync.1).\n\n## Feedback\n\nPlease report any comments or suggestions!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdocelic%2Frsyncnow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdocelic%2Frsyncnow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdocelic%2Frsyncnow/lists"}