{"id":13695616,"url":"https://github.com/richfelker/bakelite","last_synced_at":"2025-05-07T12:03:55.540Z","repository":{"id":42485742,"uuid":"406952156","full_name":"richfelker/bakelite","owner":"richfelker","description":"Incremental backup with strong cryptographic confidentiality baked into the data model.","archived":false,"fork":false,"pushed_at":"2023-10-18T13:44:15.000Z","size":143,"stargazers_count":123,"open_issues_count":2,"forks_count":4,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-05-02T04:17:56.016Z","etag":null,"topics":["backup","cryptography"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/richfelker.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2021-09-15T23:27:08.000Z","updated_at":"2024-03-06T05:15:39.000Z","dependencies_parsed_at":"2024-01-13T16:45:50.970Z","dependency_job_id":null,"html_url":"https://github.com/richfelker/bakelite","commit_stats":{"total_commits":93,"total_committers":1,"mean_commits":93.0,"dds":0.0,"last_synced_commit":"93856ba6763e60949cd71fdfe44352a12842443f"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/richfelker%2Fbakelite","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/richfelker%2Fbakelite/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/richfelker%2Fbakelite/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/richfelker%2Fbakelite/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/richfelker","download_url":"https://codeload.github.com/richfelker/bakelite/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225091438,"owners_count":17419496,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["backup","cryptography"],"created_at":"2024-08-02T18:00:31.011Z","updated_at":"2024-11-17T21:17:51.903Z","avatar_url":"https://github.com/richfelker.png","language":"C","funding_links":[],"categories":["\u003ca name=\"backup\"\u003e\u003c/a\u003ebackup","C"],"sub_categories":[],"readme":"# Bakelite\n\nIncremental backup with strong cryptographic confidentiality baked\ninto the data model. In a small package, with no dependencies.\n\n**This project is still experimental!** Things may break or change.\nSee below on status.\n\n## Features\n\n- Designed around public key cryptography such that decryption key can\n  be kept offline, air-gapped.\n\n- Backup to local or remote storage with arbitrary transport.\n\n- Incremental update built on inode identity and hashed block\n  contents, compatible with moving and reorganizing entire trees.\n\n- Data deduplication.\n\n- Low local storage requirements for change tracking -- roughly 56-120\n  bytes per file plus 0.1-5% of total data size.\n\n- Live-streamable to storage. Compatible with append-only media. No\n  local storage required for staging a backup that will be stored\n  remotely.\n\n- Optional support for blinded garbage-collection of blobs on the\n  storage host side.\n\n- Written entirely in C with no library dependencies. Requires no\n  installation.\n\n- Built on modern cryptographic primitives: Curve25519 ECDH, ChaCha20,\n  and SHA-3.\n\n\n\n## Status\n\nBakelite is presently experimental and is a work in progress. The\nabove-described features are all present, but have not been subjected\nto third party review or extensive testing. Moreover, many advanced\nfeatures normally expected in backup software, like to controls over\ninclusion/exclusion of files, are not yet available. The codebase is\nalso in transition from a rapidly developed proof of concept to\nsomething more mature and well-factored.\n\nData formats may be subject to change. If attempting to use Bakelite\nas part of a real backup workflow, you should keep note of the\nparticular version used in case it's needed for restore. Note that the\nactual backup format is more mature and stable than the configuration\nand local index format, so the more likely mode of breakage when\nupgrading is needing to start a new full (non-incremental) backup, not\ninability to read old backups.\n\n\n\n## Why another backup program?\n\nBackups are inherently attack surface on the privacy/confidentiality\nof one's data. For decades I've looked for a backup system with the\nright cryptographic properties to minimize this risk, and turned up\nnothing, leaving me reliant on redundant copies of important things\n(the \"Linus backup strategy\") rather than universal system-wide\nbackups. After some moderately serious data loss, I decided it was\ntime to finally write what I had in mind.\n\nAmong the near-solutions out there, some required large library or\nlanguage runtime dependencies, some only worked with a particular\nvendor's cloud storage service, and some had poor change tracking that\nfailed to account for whole trees being moved or renamed. But most\nimportantly, none with incremental capability addressed the\ncatastrophic loss of secrecy of **all past and current data** in the\nevent that the encryption key was exposed.\n\n\n\n\n\n\n## Data model\n\nA backup image is a Merkle tree of nodes representing directories,\nfiles, and file content blocks, with each node identified by a SHA-3\nhash of its *encrypted contents*, and the root of the tree referenced\nby a signed summary record. For readers familiar with the git data\nmodel, this is very much like a git *tree* (not *commit*) but with the\nobjects encrypted. Multiple trees can share common subtrees. This is\nhow incremental backups are represented, and is analogous to how git\ncommits share subtrees. Backup snapshots are not like git commits\nhowever; they do not reference each other or have parent/child\nrelationships. This allows arbitrary retention policies to be\nimplemented without breaking any Merkle tree reference chains.\n\nSince there is no way for the system being backed up to \"read back\"\nfrom the backups when it doesn't hold the private decryption key, a\n\"local index\" is kept to track how objects in storage correspond to\nthe live filesystem contents. It is a key/value dictionary mapping\n(device,inode) pairs and hashes of *unencrypted* file content blocks\nto the corresponding encrypted object hashes. The local index does not\nneed to be stored with the backup, and should not be. A party who has\nread access to the local index can probe whether known data was\npresent on the filesystem at the time of last backup, which inode(s)\n(thereby which files, if they exist in listable directories) contained\nthat data, and which inode(s) share common contents. (Note that these\nare exactly the capabilities needed for deduplicating of data within\nand between snapshots.)\n\n\n\n\n\n\n## Intended security level properties\n\n- If neither private nor public key is exposed (perspective of backup\n  storage provider), breaking confidentiality of backup depends on\n  breaking ChaCha20.\n\n- If the public key is exposed (for example, via breach on the system\n  being backed up), breaking confidentiality of backup depends on\n  breaking ChaCha20 or solving the computational discrete logarithm\n  problem on Curve25519.\n\n- Breaking integrity of backup depends on breaking second-preimage\n  resistance of SHA-3 or breaking the signing algorithm used\n  (signature forgery). The latter admits only complete tree\n  replacement, not selective modification.\n  \n\n\n\n## Setup\n\n1. Key generation. This step does not need to be done on the system\n   that will be backed up, and should be done on a system you\n   absolutely trust -- both to have a working cryptographic entropy\n   source, and not to expose data. Choose a place to store the secret\n   key, such as an encrypted removable device, and run:\n\n        bakelite genkey backup.sec\n        bakelite pubkey backup.sec \u003e backup.pub\n\n   Then copy `backup.pub` to the system(s) you want to back up using\n   this key.\n\n2. Initialization. On the system to be backed up, create an empty\n   directory and run:\n\n        bakelite init /path/to/backup.pub /path/needing/backup\n\n   This will create a skeleton configuration in the current working\n   directory. All further steps should be performed from this\n   directory.\n\n3. Configure storage. Edit the `store_cmd` script produced by\n   `bakelite init` to something that will accept data in tar format\n   and write it to the desired storage, reporting success or failure\n   via exit status. For example, for local storage to mounted media:\n\n        tar -C /media/backup -kxf -\n\n   or appending to a tape drive:\n\n        cat \u003e\u003e /dev/nst0\n\n   or to a remote host via ssh:\n\n        ssh backup@remotehost\n\n   In the latter (ssh) case, the remote `authorized_keys` file should\n   force a `command` that stores the tar stream appropriately and\n   disallows overwrite of existing data. For example:\n\n        command=\"tar -C /media/backup -kxf -\" ssh-ed25519 AAAA...\n\n4. Configure devices. Normally, Bakelite will not traverse mount\n   points to other devices; this avoids accidentally including\n   transient mounts of external media or remote shares into a backup\n   they don't belong in. If you want to include additional mounts,\n   create a symlink to the root of each in the directory named\n   \"devices\". The symlink name will serve as a \"label\" for the device\n   used in the local indexing, so that changes to device numbering\n   across reboots do not break the index. For example:\n\n        ln -s /home devices/home\n        ln -s /var devices/var\n\n5. Configure signatures. Create an executable `sign_cmd` file that\n   accepts data to sign on stdin and produces a signature file on\n   stdout. For example, to use `signify`:\n\n        signify -S -s signing.sec -x - -m -\n\n6. Additional configuration. Edit the `config` file to change any\n   other preferences as desired. It's recommended to at least set a\n   `label` for the backup so that the signed summary files will be\n   associated with a particular role/identity, unless separate signing\n   keys will be used for each tree being backed up.\n\n   To exclude files matching certain patterns from backups, create a\n   file named `exclude` containing one pattern per line. Patterns are\n   a superset of standard glob pattern functionality, intended to\n   match `.gitignore` conventions, except that inversion using leading\n   `!` is not supported. In particular, `**` can be used to match\n   zero or more path components, final `/` forces only directories to\n   match, and patterns with no `/` (except possibly a final one) can\n   match in any directory (they have an implicit `**/` prefix).\n\n   If the directory containing backup configuration is included in the\n   backup, it is recommended to exclude `index*` from this directory,\n   since the `index` will be out-of-date at the time of backup and\n   `index.pending` will be incomplete. Instead of backing it up, the\n   `index` file can be recreated at restore time if desired.\n\n7. Run the first backup.\n\n        bakelite backup -v\n\n   The `-v` (verbose) flag is helpful to see what's happening,\n   especially for new or changed configurations. However, it does\n   expose information about filesystem contents/changes. Setups aiming\n   to maximize privacy should not use it in an automated setting with\n   logging.\n\n   When the job is finished, a text file named according to the label\n   and UTC backup timestamp, in the form\n   `label-yyyy-mm-ddThhmmss.nnnnnnnnnZ.txt`, should be present on\n   the backage storage medium, along with a number of files with hex\n   string names in the `objects` directory. A `.sig` file will be\n   present too if signing was configured.\n\n8. Setup a cron job to perform further backups on the desired\n   schedule. For example:\n\n        0 2 * * * bakelite -C /path/to/configuration/dir backup\n\n\n## Restoring\n\nIt's recommended to test that you are able to restore backups. On a\nsystem with the secret key available, run the `restore` command, as\nin:\n\n    bakelite restore -v -k backup.sec -d /dest/path summary_file.txt\n\nIf the secret key is protected by a passphrase you will need to enter\nit. (Note: passphrase-protected key files are not yet implemented.)\n\nBy default, objects are searched for in `objects/` relative to the\nlocation of the summary file (the same as the default tree layout in\nthe tar stream emitted by the `backup` command for storage).\n\nIf you wish to continue incremental use of the backup after restore,\nyou will need to rebuild the local index as part of the restore\noperation, using the `-i index` option.\n\nDuring testing, original and restored trees should be compared, either\ndirectly with a tool like `diff` or by recursively printing hashes and\nmetadata with `find`, `ls -lR`, or similar and diffing the output, to\nsatisfy oneself that the backup was faithful and faithfully restored.\n\nNote that non-POSIX metadata such as extended attributes is not yet\nstored in the backup inode records or restored. Support for this\nfunctionality may be added in the future.\n\n\n\n## Managing storage\n\nIt's entirely possible to treat the backup storage as append-only,\ncycling media and performing a new full backup periodically or when\nthe media fills up. However, it's also possible to use a running\nincremental backup indefinitely without filling up storage, by\nscripting a retention policy to delete old summary files and and prune\n(garbage collect) data objects that the remaining snapshots do not\nreference. This can be done *without access to the backup contents*\n(i.e. without the private key) via bloom filters attached to each\nsummary.\n\nFrom the directory containing the summary records and `objects`\ndirectory, run:\n\n    bakelite prune *.txt\n\nThis will output a list of relative object file pathnames which are\nnot referenced by any of `*.txt`, which can be fed into `xargs` to\nactually delete them.\n\nBakelite includes simple retention policy for backup summaries via the\n`cull` subcommand:\n\n    bakelite cull -d7 -w4 -m4 -y1 *.txt\n\nwill scan the timestamps in the provided files (`*.txt`) and output a\nlist of which can be deleted while still keeping the latest from each\nof the last 7 days, 4 weeks, 4 months, and 1 year. In addition, all in\na given range of seconds (default 86400, selected by `-r`) before the\nlatest will be kept. As with `prune`, the output of `bakelite cull`\ncan be piped into `xargs` for actual deletion (or moving files to\nstage them for deletion).\n\nEventually, `cull` will include label matching and signature checking\nto ensure that untrusted or erroneously included files from another\nbackup are not used in computing the results, but this functionality\nis not yet implemented.\n\nBy scheduling the appropriate `cull` and `prune` commands on a backup\nstorage host, continuous incremental backups can be kept in bounded\nstorage space.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frichfelker%2Fbakelite","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frichfelker%2Fbakelite","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frichfelker%2Fbakelite/lists"}