{"id":19388419,"url":"https://github.com/mk-fg/lafs-backup-tool","last_synced_at":"2025-04-23T23:31:39.251Z","repository":{"id":4960358,"uuid":"6117842","full_name":"mk-fg/lafs-backup-tool","owner":"mk-fg","description":"Tool to securely push incremental (think \"rsync --link-dest\") backups to tahoe-lafs","archived":true,"fork":false,"pushed_at":"2016-04-12T13:13:07.000Z","size":552,"stargazers_count":9,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-13T12:29:21.716Z","etag":null,"topics":["automation","backup","compression","deduplication","python","tahoe-lafs","twisted","yaml"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"aljosa/django-tinymce","license":"wtfpl","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mk-fg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-10-08T00:33:26.000Z","updated_at":"2023-11-05T19:49:03.000Z","dependencies_parsed_at":"2022-09-01T07:50:33.480Z","dependency_job_id":null,"html_url":"https://github.com/mk-fg/lafs-backup-tool","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mk-fg%2Flafs-backup-tool","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mk-fg%2Flafs-backup-tool/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mk-fg%2Flafs-backup-tool/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mk-fg%2Flafs-backup-tool/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mk-fg","download_url":"https://codeload.github.com/mk-fg/lafs-backup-tool/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250532060,"owners_count":21446107,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","backup","compression","deduplication","python","tahoe-lafs","twisted","yaml"],"created_at":"2024-11-10T10:12:38.776Z","updated_at":"2025-04-23T23:31:38.911Z","avatar_url":"https://github.com/mk-fg.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"lafs-backup-tool\n--------------------\n\nTool to securely push incremental (think \"rsync --link-dest\") backups to [Tahoe\nLeast Authority File System](https://tahoe-lafs.org/).\n\n\n\nInstallation\n--------------------\n\nIt's a regular package for Python 2.7 (not 3.X), but not in pypi, so can be\ninstalled from a checkout with something like that:\n\n\t% python setup.py install\n\nBetter way would be to use [pip](http://pip-installer.org/) to install all the\nnecessary dependencies as well:\n\n\t% pip install 'git+https://github.com/mk-fg/lafs-backup-tool.git#egg=lafs-backup-tool'\n\nNote that to install stuff in system-wide PATH and site-packages, elevated\nprivileges are often required.\nUse \"install --user\",\n[~/.pydistutils.cfg](http://docs.python.org/install/index.html#distutils-configuration-files)\nor [virtualenv](http://pypi.python.org/pypi/virtualenv) to do unprivileged\ninstalls into custom paths.\n\nAlternatively, `./lafs-backup-tool` can be run right from the checkout tree,\nwithout any installation.\n\n\n### Requirements\n\n* [Python 2.7 (not 3.X)](http://python.org) with sqlite3 support\n* [layered-yaml-attrdict-config](https://github.com/mk-fg/layered-yaml-attrdict-config)\n* [Twisted](http://twistedmatrix.com) - core and web components, plus conch if\n\tmanhole (ssh debug shell, see below) is enabled\n* [CFFI](http://cffi.readthedocs.org) (for fs ACL and capabilities support)\n* (optional) [pyliblzma](https://launchpad.net/pyliblzma) - if xz compression is used\n\nCFFI uses C compiler to generate bindings, so gcc (or other compiler) should be\navailable if you build module from source or run straight from checkout tree.\nSome basic system libraries like \"libacl\" and \"libcap\" are used through CFFI\nbindings and must be present in system at runtime and/or during build as well.\n\n\n\nUsage\n--------------------\n\nFirst of all, make a copy of the [base\nconfiguration](https://github.com/mk-fg/lafs-backup-tool/blob/master/lafs_backup/core.yaml)\n(can be produced without comments by running `lafs-backup-tool dump_config`) and\nset all the required settings there.\nUntouched and uninteresting values can be stripped from there (reasonable\ndefaults from base config will be used), if only for the sake of clarity.\n\nResulting config file might look something like this:\n\n\tsource:\n\t  path: /srv/backups/backup.*\n\t  queue:\n\t    path: /srv/backups/tmp/queue.txt\n\t  entry_cache:\n\t    path: /srv/backups/tmp/dentries.db\n\n\tfilter:\n\t  - '+^/var/log/($|security/)' # backup only that subdir from /var/log\n\t  - '-^/var/(tmp|cache|log|spool)/'\n\t  - '-^/home/\\w+/Downloads/'\n\t  - '-^/tmp/'\n\nAfter that, backup process can be started with `lafs-backup-tool -c\n\u003cpath_to_local_config\u003e backup`.\nWhen all files will be backed-up, LAFS URI of the backup root will be, unless\ndisabled in config, printed to stdout.\n\nIf/when old backup files will be removed from LAFS, `lafs-backup-tool cleanup`\ncommand can be used to purge removed entries from deduplication cache\n(\"entry_cache\" settings), unless \"disable_deduplication\" option is used (in\nwhich case no such cleanup is necessary).\n\n`lafs-backup-tool list` command can be used to list finished backups, recorded\nin \"entry_cache\" db file along with their URIs.\n`lafs-backup-tool check` can run deep-check (and renew leases) on these.\n\nCLI reference can be produced by running `lafs-backup-tool --help`.\nFor command-specific options, command in question must be specified,\ne.g. `lafs-backup-tool backup -h`.\n\nAdditional info can be found in \"Implementation details\" section below.\n\n\n\nIdea\n--------------------\n\nIntended use-case is to push most important (chosen by human) parts of already\nexisting and static backups (stored as file trees) to lafs cloud backends.\n\nExcellent [GridBackup project](https://github.com/divegeek/GridBackup) seem to\nbe full of backup-process wisdom, but also more complexity and targetted at a\nbit more (and much more complex) use-cases.\n\ntahoe_backup.py script, shipped with tahoe-lafs already does most of what I\nwant, missing only the following features:\n\n* Compression.\n\n\tIt has obvious security implications, but as I try hard to exclude\n\tnon-compressible media content from backups, and given very limited amount of\n\tcloud-space I plan to use, advantages are quite significant.\n\n\txz (lzma2) compression is usually deterministic, but I suppose it might break\n\toccasionally on updates, forcing re-upload for some of the files.\n\n\tSee also: [compression\n\ttag](https://tahoe-lafs.org/trac/tahoe-lafs/query?status=!closed\u0026keywords=~compression\u0026order=priority),\n\t[#1354](https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1354).\n\n* Metadata.\n\n\tFilesystem ACLs and Capabilities can and should be properly serialized and\n\tadded to filesystem edges, if present on source filesystem.\n\n* Symlinks.\n\n\tBackup these as a small files (containing destination path) with a special\n\tmetadata mark (no mode).\n\n\tSee also: [#641](https://tahoe-lafs.org/trac/tahoe-lafs/ticket/641).\n\n* Include / exclude regexp lists, maintained by hand.\n\n* More verbose logging\n\n\tEspecially the timestamps, info about compression and deduplication (which\n\tfiles change), to be able to improve system performance, if necessary.\n\n* Just a cleaner rewrite, as a base for any future ideas.\n\nSome additional ideas that came after the initial implementation:\n\n* Rate limiting.\n\n\tNecessary with free-cloud APIs, which tend to block too frequent requests, but\n\tmight also be useful to reduce system load due to compression or crypto,\n\tnetwork load.\n\n\n\nImplementation details\n--------------------\n\nOnly immutable lafs files/dirnodes are used at the moment, with the exception of\n\"append_to_lafs_dir\" option, which updates list of backup caps in a mutable lafs\ndirectory node.\n\n\n##### Two-phase operation (of \"backup\" command)\n\n* Phase one: generate queue-file with an ordered list of path of files/dirs and\n\tmetadata to upload.\n\n\tQueue file is a human-readable line-oriented plaintext list with relative\n\tpaths and fs metadata, like this:\n\n\t\ttmp/session_debug.log 1000:1000:100644\n\t\ttmp/root.log 0:0:100600//u::rwx,u:fraggod:rwx,g::r-x,m::rwx,o::r-x\n\t\ttmp 1000:1000:100755\n\t\tbin/skype_notify.sh 1000:1000:100755\n\t\tbin/fs_backup 1000:1000:2750/=;cap_dac_read_search+i\n\t\tbin 1000:1000:100755\n\t\t.netrc 1000:1000:100600\n\t\t 1000:1000:100755\n\n\tFormat of each line is \"path uid:gid:[mode][/caps[/acls]]\".\n\tList is reverse-alpha-sorted.\n\n* Phase two: read queue-file line-by-line and upload each file (checking if it's\n\tnot uploaded already) or create a directory entry to/on the grid.\n\n\tEach uploaded node (and it's ro-cap) gets recorded in \"entry_cache\" sqlite db,\n\tkeyed by all the relevant metadata (mtime, size, xattrs, file-path,\n\tcontents-caps, etc), to facilitate both restarts and deduplication.\n\n\tIt doesn't matter in fact if the next time this upload will be started from\n\tthe same queue-file or another - same files won't be even considered for\n\tuploading.\n\n\tNote that such \"already uploaded\" state caching assumes that files stay\n\thealthy (i.e. available) in the grid. Appropriate check/repair tools should be\n\tused to ensure that that's the case (see \"check\" action below).\n\nPhases can be run individually - queue-file can be generated with `--queue-only\n[path]` and then just read (skipping (re-)generation) with `--reuse-queue\n[path]` (or corresponding configuration file options).\n\nInterrupted (for any reason) second phase of backup process (actual upload to\nthe grid) can be resumed by just restarting the operation.\n`--reuse-queue` option may be used to speed things up a little (i.e. skip\nbuilding it again from the same files), but is generally unnecessary if\n`source.queue.check_mtime` option is enabled (default).\n\n\n##### Path filter\n\nVery similar to rsync filter lists, but don't have merge (include other\nfilter-files) operations and is based on regexps, not glob patterns.\n\nRepresented as a list of exclude/include string-patterns (python regexps) to\nmatch relative (to source.path, starting with \"/\") paths to backup, and must\nstart with '+' or '-' (that character gets stripped from regexp), to include or\nexclude path, respectively.\n\nPatterns are matched against each path in order they're listed.\n\nLeaf directories are matched with the trailing slash (as with rsync) to be\ndistinguishable from files with the same name.\nMatched by exclude-patterns directories won't be recursed into (can save a lot\nof iops for cache and tmp paths).\n\nIf path doesn't match any regexp on the list, it will be included.\n\nExample:\n\n\t- '+/\\.git/config$'          # backup git repository config files\n\t- '+/\\.git/info/'            # backup git repository \"info\" directory/contents\n\t- '-/\\.git/'                 # *don't* backup/crawl-over any repository objects\n\t- '-/(?i)\\.?svn(/|ignore)$'  # exclude (case-insensitive) svn (or .svn) dirs and ignore-lists\n\t- '-^/tmp/'                  # exclude /tmp path (but not \"/subpath/tmp\")\n\nNote how ordering of these lines makes only some paths within \".git\" directories\nincluded, excluding the rest.\n\nAlso documented in [base\nconfig](https://github.com/mk-fg/lafs-backup-tool/blob/master/lafs_backup/core.yaml).\n\n\n##### Edge metadata\n\nTahoe-LAFS doesn't have a concept like \"file inode\" (metadata container) at the\nmoment, and while it's possible to emulate such thing with intermediate file,\nit's also unnecessary, since arbitrary metadata can be stored inside directory\nentries, beside link to the file contents.\n\nSuch metadata can be easily fetched from urls like\n`http://tahoe-webapi/uri/URI:DIR2-CHK:.../?t=json` (see\ndocs/frontentds/webapi.rst).\n\nSingle filenode edge with metadata (dumped as YAML):\n\n\tREADME.md:\n\t  - filenode\n\t  - format: CHK\n\t    metadata:\n\t      enc: xz\n\t      gid: '1000'\n\t      mode: '100644'\n\t      uid: '1000'\n\t    mutable: false\n\t    ro_uri: URI:CHK:...\n\t    size: 1140\n\t    verify_uri: URI:CHK-Verifier:...\n\nMetadata is stored in the same format as in the queue-file (described above).\n\nOne notable addtion to the queue-file data here is the \"enc\" key, which in\nexample above indicates that file contents are encoded using xz compression.\nIn case of compression (as with most other possible encodings), \"size\" field\ndoesn't indicate real (decoded) file size.\n\n\n##### Compression\n\nConfigurable via similar pattern-matching mechanism as include/exclude filters\n(`destination.encoding.xz.path_filter` list).\n\nFilters here can be tuples like `[500, '\\.(txt|csv|log)$']` to compress files\nmatching a pattern, but only if size is larger than the given value.\nOtherwise syntax is identical ('+' or '-', followed by python regexp) to\n`filter` config section (see above).\n\nOne operational difference from `filter` is that file size is taken into account\nhere, with small-enough files not being compressed, as it generally produces\nlarger output (for file sizes lesser than a few kilobytes, in case of xz\ncompression).\nSee `destination.encoding.xz.min_size` parameter.\n\n\n##### Backup result\n\nResult of the whole \"queue and upload\" operation is a single dircap to a root of\nan immutable directory tree.\n\nIt can be printed to stdout (which isn't used otherwise, though logging can be\nconfigured to use it), appended to some text file or be put into some\nhigher-level mutable directory (with a basename of a source path).\n\nOther than that, it also gets recorded to \"entry_cache\" db along with generation\nnumber for this particular backup, so that it can later be removed along with\nall the cache entries unique to it through the cleanup procedure.\n\nSee \"destination.result\" section of the [base\nconfig](https://github.com/mk-fg/lafs-backup-tool/blob/master/lafs_backup/core.yaml)\nfor more info on these.\n\n\n##### Where do lafs caps end up?\n\nIn some cases, it might be desirable to remove all keys to uploaded data, even\nthough it was read from local disk initially.\n\n* \"result\" destination (stdout, file or some mutable tahoe dir - see above),\n\tnaturally.\n\n* Deduplication \"entry_cache\" db (path is required to be set in\n  \"source.entry_cache\").\n\n\tThat file is queried for the actual plaintext caps, so it's impossible to use\n\thashed (or otherwise irreversibly-mapped) values there.\n\nSo if old data is to be removed from machine where the tool runs, these things\nshould be done:\n\n* Resulting cap should be removed or encrypted (probably with assymetric crypto,\n\tso there'd be no decryption key on the machine), if it was stored on a local\n\tmachine (e.g. appended to a file).\n\tIf it was linked to a mutable tahoe directory, it should be unlinked.\n\n\tProvided \"cleanup\" command can remove caps from any configurable destinations\n\t(file, lafs dir), but only if configuration with regard to respective settings\n\t(\"append_to_file\", \"append_to_lafs_dir\") didn't change since backup and entry\n\tin lafs dir was not renamed.\n\n\tNaturally, if cap was linked to some other directory node manually, it won't\n\tbe removed by the command, same for the actual shares on tahoe nodes.\n\n* \"entry_cache\" db removed or encrypted in a similar fashion or \"cleanup\"\n\tcommand is used.\n\n\t\"cleanup\" command gets generation number, corresponding to the backup root cap\n\tand removes all the items with that number.\n\n\tWhen item gets used in newer backup, it gets it's generation number bumped, so\n\tsuch operation is guaranteed to purge any entries used in this backup but not\n\tin any newer ones, which are guaranteed to stay intact.\n\n* If any debug logging was enabled, these logs should be purged, as they may\n\tleak various info about the paths and source file/dir metadata.\n\nOne should also (naturally) beware of sqlite (if it doesn't get removed),\nfilesystem or underlying block device (e.g. solid-state drives) retaining the\nthought-to-be-removed data.\n\n\n##### Logging\n\nCan be configured via config files (uses [python logging\nsubsystem](http://docs.python.org/library/logging.html)) and some CLI parameters\n(for convenience - \"--debug\", \"--noise\").\n\n\"noise\" level (which is lower than \"debug\") will have per-path logging (O(n)\nscale), while output from any levels above should be independent of the file/dir\ncount.\n\nLogs should never contain LAFS URIs/capabilities, but with \"noise\" level will\nexpose paths and some metadata information.\n\n\n##### Twisted-based http client\n\nI'm quite fond of [requests](http://docs.python-requests.org/en/latest/) module\nmyself, but ~~unfortunately it doesn't seem to provide streaming uploads of\nlarge files at the moment~~ needed functionality wasn't there before.\n\nPlus twisted is also a basis for tahoe-lafs implementation, so there's a good\nchance it's already available (unlike gevent, used in requests.async /\ngrequests).\n\n\n##### SSH debug manhole\n\nCan be enabled in configuration (\"manhole\" section).\n\nNamespace used there is persistent between connections and contains following\nuseful keys (list might be a bit outdated):\n\n- `config`\n\n\tConfiguration object (AttrDict instance), can be queried by\n\tattributes (e.g. `config.http.ca_certs_files`).\n\n\tChanges *may* have some effect on the running operation, but it's not\n\tguaranteed or supported.\n\n- `lafs_op` - instance of LAFSOperation subclass, representing currently running\n\toperation.\n\n\t- `lafs_op.debug_frame` - python interpreter frame of the long-running loops.\n\n\t\tCan be inspected to get exact line of code that's currently running, locals,\n\t\tglobals, etc. Try `help(lafs_op.debug_frame)`.\n\n\t- `lafs_op.debug_timeouts` - set of timeout callbacks for long stateless\n\t\toperations (listed in `operation.timeouts` config section).\n\n\t\tCan be used to simulate timeout condition (and force retry) on running\n\t\toperation that supports timing-out/retry.\n\t\tCallbacks should be present there and can be used manually even if actual\n\t\ttimeout is disabled.\n\n- `optz`, `optz_parser` - argparse namespace and ArgumentParser objects.\n\nThere's also an option (\"on_signal\") to create manhole socket only after\nreceiving signal, so that it'd be more secure and multiple invokations of the\ntool with the same configuration won't try to bind the same socket.\n\nSee python 2.X [data\nmodel](http://docs.python.org/2/reference/datamodel.html#index-59) and\n[inspect](http://docs.python.org/2/library/inspect.html) doc sections on how to\ndebug running code.\n\nNote that it should also be easy to reconfigure (e.g. set it to debug level, add\nlogfile handler, etc) logging from there.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmk-fg%2Flafs-backup-tool","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmk-fg%2Flafs-backup-tool","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmk-fg%2Flafs-backup-tool/lists"}