Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/richfelker/bakelite

Incremental backup with strong cryptographic confidentiality baked into the data model.
https://github.com/richfelker/bakelite

backup cryptography

Last synced: 2 months ago
JSON representation

Incremental backup with strong cryptographic confidentiality baked into the data model.

Awesome Lists containing this project

README

        

# Bakelite

Incremental backup with strong cryptographic confidentiality baked
into the data model. In a small package, with no dependencies.

**This project is still experimental!** Things may break or change.
See below on status.

## Features

- Designed around public key cryptography such that decryption key can
be kept offline, air-gapped.

- Backup to local or remote storage with arbitrary transport.

- Incremental update built on inode identity and hashed block
contents, compatible with moving and reorganizing entire trees.

- Data deduplication.

- Low local storage requirements for change tracking -- roughly 56-120
bytes per file plus 0.1-5% of total data size.

- Live-streamable to storage. Compatible with append-only media. No
local storage required for staging a backup that will be stored
remotely.

- Optional support for blinded garbage-collection of blobs on the
storage host side.

- Written entirely in C with no library dependencies. Requires no
installation.

- Built on modern cryptographic primitives: Curve25519 ECDH, ChaCha20,
and SHA-3.

## Status

Bakelite is presently experimental and is a work in progress. The
above-described features are all present, but have not been subjected
to third party review or extensive testing. Moreover, many advanced
features normally expected in backup software, like to controls over
inclusion/exclusion of files, are not yet available. The codebase is
also in transition from a rapidly developed proof of concept to
something more mature and well-factored.

Data formats may be subject to change. If attempting to use Bakelite
as part of a real backup workflow, you should keep note of the
particular version used in case it's needed for restore. Note that the
actual backup format is more mature and stable than the configuration
and local index format, so the more likely mode of breakage when
upgrading is needing to start a new full (non-incremental) backup, not
inability to read old backups.

## Why another backup program?

Backups are inherently attack surface on the privacy/confidentiality
of one's data. For decades I've looked for a backup system with the
right cryptographic properties to minimize this risk, and turned up
nothing, leaving me reliant on redundant copies of important things
(the "Linus backup strategy") rather than universal system-wide
backups. After some moderately serious data loss, I decided it was
time to finally write what I had in mind.

Among the near-solutions out there, some required large library or
language runtime dependencies, some only worked with a particular
vendor's cloud storage service, and some had poor change tracking that
failed to account for whole trees being moved or renamed. But most
importantly, none with incremental capability addressed the
catastrophic loss of secrecy of **all past and current data** in the
event that the encryption key was exposed.

## Data model

A backup image is a Merkle tree of nodes representing directories,
files, and file content blocks, with each node identified by a SHA-3
hash of its *encrypted contents*, and the root of the tree referenced
by a signed summary record. For readers familiar with the git data
model, this is very much like a git *tree* (not *commit*) but with the
objects encrypted. Multiple trees can share common subtrees. This is
how incremental backups are represented, and is analogous to how git
commits share subtrees. Backup snapshots are not like git commits
however; they do not reference each other or have parent/child
relationships. This allows arbitrary retention policies to be
implemented without breaking any Merkle tree reference chains.

Since there is no way for the system being backed up to "read back"
from the backups when it doesn't hold the private decryption key, a
"local index" is kept to track how objects in storage correspond to
the live filesystem contents. It is a key/value dictionary mapping
(device,inode) pairs and hashes of *unencrypted* file content blocks
to the corresponding encrypted object hashes. The local index does not
need to be stored with the backup, and should not be. A party who has
read access to the local index can probe whether known data was
present on the filesystem at the time of last backup, which inode(s)
(thereby which files, if they exist in listable directories) contained
that data, and which inode(s) share common contents. (Note that these
are exactly the capabilities needed for deduplicating of data within
and between snapshots.)

## Intended security level properties

- If neither private nor public key is exposed (perspective of backup
storage provider), breaking confidentiality of backup depends on
breaking ChaCha20.

- If the public key is exposed (for example, via breach on the system
being backed up), breaking confidentiality of backup depends on
breaking ChaCha20 or solving the computational discrete logarithm
problem on Curve25519.

- Breaking integrity of backup depends on breaking second-preimage
resistance of SHA-3 or breaking the signing algorithm used
(signature forgery). The latter admits only complete tree
replacement, not selective modification.

## Setup

1. Key generation. This step does not need to be done on the system
that will be backed up, and should be done on a system you
absolutely trust -- both to have a working cryptographic entropy
source, and not to expose data. Choose a place to store the secret
key, such as an encrypted removable device, and run:

bakelite genkey backup.sec
bakelite pubkey backup.sec > backup.pub

Then copy `backup.pub` to the system(s) you want to back up using
this key.

2. Initialization. On the system to be backed up, create an empty
directory and run:

bakelite init /path/to/backup.pub /path/needing/backup

This will create a skeleton configuration in the current working
directory. All further steps should be performed from this
directory.

3. Configure storage. Edit the `store_cmd` script produced by
`bakelite init` to something that will accept data in tar format
and write it to the desired storage, reporting success or failure
via exit status. For example, for local storage to mounted media:

tar -C /media/backup -kxf -

or appending to a tape drive:

cat >> /dev/nst0

or to a remote host via ssh:

ssh backup@remotehost

In the latter (ssh) case, the remote `authorized_keys` file should
force a `command` that stores the tar stream appropriately and
disallows overwrite of existing data. For example:

command="tar -C /media/backup -kxf -" ssh-ed25519 AAAA...

4. Configure devices. Normally, Bakelite will not traverse mount
points to other devices; this avoids accidentally including
transient mounts of external media or remote shares into a backup
they don't belong in. If you want to include additional mounts,
create a symlink to the root of each in the directory named
"devices". The symlink name will serve as a "label" for the device
used in the local indexing, so that changes to device numbering
across reboots do not break the index. For example:

ln -s /home devices/home
ln -s /var devices/var

5. Configure signatures. Create an executable `sign_cmd` file that
accepts data to sign on stdin and produces a signature file on
stdout. For example, to use `signify`:

signify -S -s signing.sec -x - -m -

6. Additional configuration. Edit the `config` file to change any
other preferences as desired. It's recommended to at least set a
`label` for the backup so that the signed summary files will be
associated with a particular role/identity, unless separate signing
keys will be used for each tree being backed up.

To exclude files matching certain patterns from backups, create a
file named `exclude` containing one pattern per line. Patterns are
a superset of standard glob pattern functionality, intended to
match `.gitignore` conventions, except that inversion using leading
`!` is not supported. In particular, `**` can be used to match
zero or more path components, final `/` forces only directories to
match, and patterns with no `/` (except possibly a final one) can
match in any directory (they have an implicit `**/` prefix).

If the directory containing backup configuration is included in the
backup, it is recommended to exclude `index*` from this directory,
since the `index` will be out-of-date at the time of backup and
`index.pending` will be incomplete. Instead of backing it up, the
`index` file can be recreated at restore time if desired.

7. Run the first backup.

bakelite backup -v

The `-v` (verbose) flag is helpful to see what's happening,
especially for new or changed configurations. However, it does
expose information about filesystem contents/changes. Setups aiming
to maximize privacy should not use it in an automated setting with
logging.

When the job is finished, a text file named according to the label
and UTC backup timestamp, in the form
`label-yyyy-mm-ddThhmmss.nnnnnnnnnZ.txt`, should be present on
the backage storage medium, along with a number of files with hex
string names in the `objects` directory. A `.sig` file will be
present too if signing was configured.

8. Setup a cron job to perform further backups on the desired
schedule. For example:

0 2 * * * bakelite -C /path/to/configuration/dir backup

## Restoring

It's recommended to test that you are able to restore backups. On a
system with the secret key available, run the `restore` command, as
in:

bakelite restore -v -k backup.sec -d /dest/path summary_file.txt

If the secret key is protected by a passphrase you will need to enter
it. (Note: passphrase-protected key files are not yet implemented.)

By default, objects are searched for in `objects/` relative to the
location of the summary file (the same as the default tree layout in
the tar stream emitted by the `backup` command for storage).

If you wish to continue incremental use of the backup after restore,
you will need to rebuild the local index as part of the restore
operation, using the `-i index` option.

During testing, original and restored trees should be compared, either
directly with a tool like `diff` or by recursively printing hashes and
metadata with `find`, `ls -lR`, or similar and diffing the output, to
satisfy oneself that the backup was faithful and faithfully restored.

Note that non-POSIX metadata such as extended attributes is not yet
stored in the backup inode records or restored. Support for this
functionality may be added in the future.

## Managing storage

It's entirely possible to treat the backup storage as append-only,
cycling media and performing a new full backup periodically or when
the media fills up. However, it's also possible to use a running
incremental backup indefinitely without filling up storage, by
scripting a retention policy to delete old summary files and and prune
(garbage collect) data objects that the remaining snapshots do not
reference. This can be done *without access to the backup contents*
(i.e. without the private key) via bloom filters attached to each
summary.

From the directory containing the summary records and `objects`
directory, run:

bakelite prune *.txt

This will output a list of relative object file pathnames which are
not referenced by any of `*.txt`, which can be fed into `xargs` to
actually delete them.

Bakelite includes simple retention policy for backup summaries via the
`cull` subcommand:

bakelite cull -d7 -w4 -m4 -y1 *.txt

will scan the timestamps in the provided files (`*.txt`) and output a
list of which can be deleted while still keeping the latest from each
of the last 7 days, 4 weeks, 4 months, and 1 year. In addition, all in
a given range of seconds (default 86400, selected by `-r`) before the
latest will be kept. As with `prune`, the output of `bakelite cull`
can be piped into `xargs` for actual deletion (or moving files to
stage them for deletion).

Eventually, `cull` will include label matching and signature checking
to ensure that untrusted or erroneously included files from another
backup are not used in computing the results, but this functionality
is not yet implemented.

By scheduling the appropriate `cull` and `prune` commands on a backup
storage host, continuous incremental backups can be kept in bounded
storage space.