{"id":13575741,"url":"https://github.com/backtrace-labs/verneuil","last_synced_at":"2025-05-15T21:03:42.367Z","repository":{"id":39566057,"uuid":"387232613","full_name":"backtrace-labs/verneuil","owner":"backtrace-labs","description":"Verneuil is a VFS extension for SQLite that asynchronously replicates databases to S3-compatible blob stores.","archived":false,"fork":false,"pushed_at":"2024-10-06T22:28:48.000Z","size":4189,"stargazers_count":484,"open_issues_count":6,"forks_count":18,"subscribers_count":17,"default_branch":"main","last_synced_at":"2025-04-13T10:58:32.079Z","etag":null,"topics":["s3","sqlite"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/backtrace-labs.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-18T17:29:43.000Z","updated_at":"2025-04-08T21:21:46.000Z","dependencies_parsed_at":"2023-02-16T23:46:01.534Z","dependency_job_id":"644bb4b8-e198-4db1-a7f9-b00203456856","html_url":"https://github.com/backtrace-labs/verneuil","commit_stats":{"total_commits":486,"total_committers":4,"mean_commits":121.5,"dds":"0.23868312757201648","last_synced_commit":"e53fdc44644145ddef788d3c61f9a999e85bb9b4"},"previous_names":[],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/backtrace-labs%2Fverneuil","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/backtrace-labs%2Fverneuil/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/backtrace-labs%2Fverneuil/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/backtrace-labs%2Fverneuil/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/backtrace-labs","download_url":"https://codeload.github.com/backtrace-labs/verneuil/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254422754,"owners_count":22068678,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["s3","sqlite"],"created_at":"2024-08-01T15:01:03.827Z","updated_at":"2025-05-15T21:03:39.944Z","avatar_url":"https://github.com/backtrace-labs.png","language":"C","funding_links":[],"categories":["C","sqlite","extentions"],"sub_categories":[],"readme":"Verneuil: streaming replication for SQLite\n==========================================\n\n[![asciicast](doc/demo.gif)](https://asciinema.org/a/457886)\n\nVerneuil[^verneuil-process] [[vɛʁnœj]](https://en.wikipedia.org/wiki/Auguste_Victor_Louis_Verneuil)\nis a [VFS (OS abstraction layer)](https://www.sqlite.org/vfs.html) for\n[SQLite](https://www.sqlite.org/index.html) that accesses local\ndatabase files like the default unix VFS while asynchronously\nreplicating snapshots to [S3](https://aws.amazon.com/s3/)-compatible\nblob stores.  We wrote it to improve the scalability and availability\nof pre-existing services for which SQLite is a good fit, at least for\nsingle-node deployments.\n\nBacktrace relies on Verneuil to backup and replicate thousands of\nSQLite databases that range in size from 100KB to a few gigabytes,\nsome of which see updates every second... for less than $40/day in S3\ncosts.\n\nIt has been tested on linux/amd64, linux/aarch64 (little endian), and\ndarwin/aarch64.  The sqlite file format and the Verneuil replication\ndata are all platform agnostic.\n\n[^verneuil-process]:  The [Verneuil process](https://en.wikipedia.org/wiki/Verneuil_method) was the first commercial method of manufacturing synthetic gemstones... and DRH insists on pronouncing SQLite like a mineral, surely a precious one (:\n\nThe primary design goal of Verneuil is to add asynchronous read\nreplication to working single-node systems without introducing new\ncatastrophic failure modes.  Avoiding new failure modes takes\nprecedence over all other considerations, including replication lag:\nthere is no attempt to bound or minimise the staleness of read\nreplicas.  Verneuil read replicas should only be used when stale\ndata is acceptable.\n\nIn keeping with this conservative approach to replication, the local\ndatabase file on disk remains the source of truth, and the VFS is\nfully compatible with SQLite's default `unix` VFS, even for concurrent\n(with file locking) accesses.  Verneuil stores all state that must\npersist across SQLite transactions on disk, so multiple processes can\nstill access and replicate the same database with Verneuil.\n\nVerneuil also paces all API calls (with a [currently hardcoded] limit\nof 30 call/second/process) to avoid \"surprising\" cloud bills, and\ndecouples the SQLite VFS from the replication worker threads that\nupload data to a remote blob store with a crash-safe buffer directory\nthat bounds its worst-case disk footprint to roughly four times the\nsize of the source database file.  It's thus always safe to disable\naccess to the blob store: buffered replication data may grow over\ntime, but always within bounds.\n\nReplacing the default unix VFS with Verneuil impacts local SQLite\noperations, of course: writes must be slower, in order to queue\nupdates for replication.  However, this slowdown is usually\nproportional to the time it took to perform the write itself, and\noften dominated by the *two* `fsync`s incurred by SQLite transaction\ncommits in rollback mode.  In addition, the additional replication\nlogic runs with the write lock downgraded to a read lock, so\nsubsequent transactions only block on the new replication step once\nthey're ready to commit.\n\nThis effort is incomparable with [litestream](https://github.com/benbjohnson/litestream/issues/8):\nVerneuil is meant for asynchronous read replication, with streaming\nbackups as a nice side effect.  The\n[replication approach](https://docs.google.com/document/d/173cfdvnVB_68No9vqmgKHc_SSd0Cy3-YmPaXZsVvXIg)\nis thus completely different.  In particular, while litestream only\nworks with SQLite databases in WAL mode, Verneuil only supports\nrollback journaling.  See `doc/DESIGN.md` for details.\n\nWhat's in this repo\n-------------------\n\n1. A \"Linux\" VFS (`c/linuxvfs.c`) that implements everything that\n   SQLite needs for a non-WAL DB, without all the backward\n   compatibility cruft in SQLite's Unix VFS.  The new VFS's behaviour\n   is fully compatible with upstream's Unix VFS!  It's a simpler\n   starting point for new (Linux-only) SQLite VFSes.\n\n2. A Rust crate with a C interface (see `include/verneuil.h`) to\n   configure and register:\n\n   - The `verneuil` VFS, which hooks into the Linux VFS to track changes,\n     generate snapshots in spooling directories, and asynchronously\n     upload spooled data to a remote blob store like S3.  This VFS is\n     only compatible with SQLite's rollback journal mode.  It can be\n     called directly as a Rust program, or via its C interface.\n\n   - The `verneuil_snapshot` VFS that lets SQLite access snapshots stored\n     in S3-compatible blob stores.\n\n3. A runtime-loadable SQLite extension, `libverneuil_vfs`, that lets\n   SQLite open databases with the `verneuil` VFS (to replicate the\n   database to remote storage), or with the `verneuil_snapshot` VFS\n   (to access a replicated snapshot).\n\n4. The `verneuilctl` command-line tool to restore snapshots, forcibly\n   upload spooled data, synchronise a database file to remote\n   storage, and perform other ad hoc administrative tasks.\n\nQuick start\n-----------\n\nThere is more detailed setup information, including how to directly\nlink against the verneuil crate instead of loading it as a SQLite\nextension, in `doc/VFS.md` and `doc/SNAPSHOT_VFS.md`.  The\n`rusqlite_integration` example shows how that works for a Rust crate.\n\nFor quick hacks and test drives, the easiest way to use Verneuil is to\nbuild it as a runtime loadable extension for SQLite\n(`libverneuil_vfs`).\n\n`cargo build --release --examples --features='dynamic_vfs'`\n\nThe `verneuilctl` tool will also be useful.\n\n`cargo build --release --examples --features='vendor_sqlite'`\n\nVerneuil needs additional configuration to know where to spool\nreplication data, and where to upload or fetch data from remote\nstorage.  That configuration data must be encoded in JSON, and will be\ndeserialised into a `verneuil::Options` struct (in `src/lib.rs`).\n\nA minimal configuration string looks as follows.  See `doc/VFS.md` and\n`doc/SNAPSHOT_VFS.md` for more details.\n\n```\n{\n  // \"make_default\": true, to use the replicating VFS by default\n  // \"tempdir\": \"/my/tmp/\", to override the location of temporary files\n  \"replication_spooling_dir\": \"/tmp/verneuil/\",\n  \"replication_targets\": [\n    {\n      \"s3\": {\n        \"region\": \"us-east-1\",\n        // \"endpoint\": \"http://127.0.0.1:9000\", //for non-standard regions\n        \"chunk_bucket\": \"verneuil_chunks\",\n        \"manifest_bucket\": \"verneuil_manifests\",\n        \"domain_addressing\": true  // or false for the legacy bucket-as-path interface\n        // \"create_buckets_on_demand\": true // to create private buckets as needed\n      }\n    }\n  ]\n}\n```\n\nThat's a mouthful to pass as query string parameters to\n`sqlite3_open_v2`, so Verneuil currently looks for that configuration\nstring in the `VERNEUIL_CONFIG` environment variable.  If that\nvariable's value starts with an at sign, like \"@/path/to/config.json\",\nVerneuil looks for the configuration JSON in that file.\n\nThe configuration file does not include any credential: Verneuil\ngets those from the environment, either by hitting the local EC2\ncredentials daemon, or by reading the `AWS_ACCESS_KEY_ID` and\n`AWS_SECRET_ACCESS_KEY` environment variables.\n\nNow that the environment is set up, we can load the extension in\nSQLite, and start replicating our writes to\n[S3](https://aws.amazon.com/s3/), or any other compatible blob server\n(we use [minio](https://min.io/) for testing).\n\n```\n$ RUST_LOG=warn VERNEUIL_CONFIG=@verneuil.json sqlite3\nSQLite version 3.22.0 2018-01-22 18:45:57\nEnter \".help\" for usage hints.\nConnected to a transient in-memory database.\nUse \".open FILENAME\" to reopen on a persistent database.\nsqlite\u003e .load ./libverneuil_vfs  -- Load the Verneuil VFS extension.\nsqlite\u003e .open file:source.db?vfs=verneuil\n-- The contents of source.db will now be spooled for replication before\n-- letting each transaction close.\nsqlite\u003e .open file:verneuil://source.host.name/path/to/replicated.db?vfs=verneuil_snapshot\n-- opens a read replica for the most current snapshot replicated to s3 by `source.host.name`\n-- for the database at `/path/to/replicated.db`.\n```\n\nOutside the SQLite shell, [extensions loading must be enabled](https://www.sqlite.org/c3ref/c_dbconfig_defensive.html#sqlitedbconfigenableloadextension)\nin order to allow access to the [`load_extension` SQL function](https://www.sqlite.org/lang_corefunc.html#load_extension).\n\n[URI filenames](https://www.sqlite.org/uri.html) must also be enabled\nin order to specify the VFS in the connection string; it's also possible\nto [pass a VFS argument to `sqlite3_open_v2`](https://www.sqlite.org/c3ref/open.html).\n\nReplication data is buffered to the `replication_spooling_dir`\nsynchronously, before the end of each SQLite transaction.  Actually\nuploading the data to remote storage happens asynchronously: we\nwouldn't want to block transaction commit on network calls.\n\nAfter exiting the shell or closing an application, we can make sure\nthat all spooled data is flushed to remote storage with `verneuilctl\nflush $REPLICATION_SPOOLING_DIR`: this command will attempt to\nsynchronously upload all pending spooled data in the spooling\ndirectory, and log noisily / error out on failure.\n\nFind documentation for other `verneuilctl` subcommands with `verneuilctl help`:\n\n```\n$ ./verneuilctl --help\nverneuilctl 0.1.0\nutilities to interact with Verneuil snapshots\n\nUSAGE:\n    verneuilctl [OPTIONS] \u003cSUBCOMMAND\u003e\n\nFLAGS:\n    -h, --help\n            Prints help information\n\n    -V, --version\n            Prints version information\n\n\nOPTIONS:\n    -c, --config \u003cconfig\u003e\n            The Verneuil JSON configuration used when originally copying the database to remote storage.\n\n            A value of the form \"@/path/to/json.file\" refers to the contents of that file; otherwise, the argument\n            itself is the configuration string.\n\n            This parameter is optional, and defaults to the value of the `VERNEUIL_CONFIG` environment variable.\n    -l, --log \u003clog\u003e\n            Log level, in the same format as `RUST_LOG`.  Defaults to only logging errors to stderr; `--log=info`\n            increases the verbosity to also log info and warning to stderr.\n\n            To fully disable logging, pass `--log=off`.\n\nSUBCOMMANDS:\n    flush            The verneuilctl flush utility accepts the path to a spooling directory, (i.e., a value for\n                     `verneuil::Options::replication_spooling_dir`), and attempts to upload all the files pending\n                     replication in that directory\n    help             Prints this message or the help of the given subcommand(s)\n    manifest         The verneuilctl manifest utility accepts the path to a source replicated file and an optional\n                     hostname, and outputs the contents of the corresponding manifest file to `--out`, or stdout by\n                     default\n    manifest-name    The verneuilctl manifest-name utility accepts the path to a source replicated file and an\n                     optional hostname, and prints the name of the corresponding manifest file to stdout\n    restore          The verneuilctl restore utility accepts the path to a verneuil manifest file, and reconstructs\n                     its contents to the `--out` argument (or stdout by default)\n    sync             The verneuilctl sync utility accepts the path to a sqlite db, and uploads a fresh snapshot to\n                     the configured replication targets\n$ ./verneuilctl restore --help\nverneuilctl-restore 0.1.0\nThe verneuilctl restore utility accepts the path to a verneuil manifest file, and reconstructs its contents to the\n`--out` argument (or stdout by default)\n\nUSAGE:\n    verneuilctl restore [OPTIONS]\n\nFLAGS:\n        --help\n            Prints help information\n\n    -V, --version\n            Prints version information\n\n\nOPTIONS:\n    -h, --hostname \u003chostname\u003e\n            The hostname of the machine that generated the snapshot.\n\n            Defaults to the current machine's hostname.\n    -m, --manifest \u003cmanifest\u003e\n            The manifest file that describes the snapshot to restore.\n\n            These are typically stored as objects in versioned buckets; it is up to the invoker to fish out the relevant\n            version.\n\n            If missing, verneuilctl restore will attempt to download it from remote storage, based on `--hostname` and\n            `--source_path`.\n\n            As special cases, an `http://` or `https://` prefix will be downloaded over HTTP(S), an\n            `s3://bucket.region[.endpoint]/path/to/blob` URI will be loaded via HTTPS domain-addressed S3,\n            `verneuil://machine-host-name/path/to/sqlite.db` will be loaded based on that hostname (or the current\n            machine's hostname if empty) and source path, and a `file://` prefix will always be read as a local path.\n    -o, --out \u003cout\u003e\n            The path to the reconstructed output file.\n\n            Defaults to stdout.\n    -s, --source-path \u003csource-path\u003e\n            The path to the source file that was replicated by Verneuil, when it ran on `--hostname`\n\n```\n\nBut why?\n--------\n\nBacktraces shards most of its backend metadata in thousands of small\n(1-2 MB) to medium size (up to 1-2 GB) SQLite databases, with an\naverage aggregate write rate of a few dozen write transactions per\nsecond (with a few hot databases and many cold ones).  Before\nVerneuil, this approach offered adequate performance and availability.\nHowever, things could be better, and that's why we wrote Verneuil: to\ndistribute logic that can work with slightly stale read replicas and to\nsimplify our disaster recovery playbooks, without introducing new\nfailure modes in single-node code that already works well enough.\n\nIn fact, making sure replicas are up to date is explicitly not a goal.\nNevertheless, we find that once our backend reaches its steady state,\nless than 0.1% of write transactions take more than 5 seconds to\nreplicate, and detect a replication lag of more than one minute for\nmore rarely than once every million write.  Of course, this all\ndepends on the write load and the number of replicated databases on a\nmachine or process.  For example, we experience temporary spikes in\nreplication lag whenever a service restarts and writes to a few\nhundred databases in rapid succession.\n\nData freshness is not a goal because Verneuil prioritises disaster\navoidance over everything else.  That's why we interpose a wait-free\ncrash-safe replication buffer (implemented as files on disk) between\nthe snapshot update logic, which must run synchronously with SQLite\ntransaction commits, and the copier worker threads that upload\nsnapshot data to remote blob stores.  We trust this buffer to act as a\n\"data diode\" that architecturally shuts off feedback loops from the\ncopier workers back to the SQLite VFS (i.e., back to the application).\nCrucially, the amount of buffered data for a given SQLite data base is\nbounded to a multiple of that database file's size, even when copiers\nare completely stuck.  Even when the blob store is inaccessible or a\ntarget bucket misconfigured, local operations will not be interrupted\nby an ever-growing replication queue.  The buffer is also updated\nwithout `fsync` calls that could easily impact the whole storage\nsubsystem; Verneuil instead achieves crash safety by discarding all\nreplication state after a reboot.\n\nAll too often, distributed solutions for scalability and availability\nend up introducing new catastrophic failure modes, and the result is a\nsystem that might offer resilience to rare (once a year or less)\nevents like hardware failure or power loss, but does so by increasing\ncomplexity to a level such that unforeseen interactions between\ncorrectly functioning pieces of code regularly cause protracted\ncustomer impacting issues.  Verneuil's conservative approach gives us\nsome confidence that we can use it to improve the scalability and\navailability of our preexisting centralised systems without worsening\nthe reliablity of everything that already works well enough.\n\nDisaster avoidance includes bounding cloud costs.  Verneuil can\nguarantee cost effectiveness for a wide range of update rate because\nit's always able to throttle the API calls that update data: the\nreplication buffer will simply squash snapshots and always bound the\nreplication data's footprint to four times the size of the source\ndatabase file.\n\nRegardless of the update pattern (frequency and number of databases),\nwe can count on Verneuil to remain within our budget for replication:\nit will never average more than 30[^size-limit] API calls/replication\ntarget/second/process.  Each call uploads either a chunk (64 KB for\nincompressible data, less if zstd is useful), or a manifest (16 bytes\nper 64 KB chunk in the source database file, so 512 KB for a 2 GB\nfile).\n\n[^size-limit]: This hardcoded limit, coupled with the patrol logic that \"touches\" every extant chunk once a day, limits the total size of replicated databases for a single process: the replication logic may break down around 20-30 GB, but local operations should not be affected, except for the bounded growth in buffered replication data.  That's not an issue for us because we only store metadata in SQLite, metadata that tends to be orders of magnitude smaller than the data.\n\nChunks can be reaped by a retention rule that deletes them after a\nweek of inactivity (Verneuil attempts to \"touch\" useful chunks once a\nday), so, even when there's a lot of churn, a\n[chunk upload to a standard bucket in US East](https://aws.amazon.com/s3/pricing/)\ncosts at most $5e-6 + 64 K/1 GB * $0.023 / 4 (weeks per month) \u003c $6e-6.\n\nManifests for multi-GB databases can be much larger, but manifest\nupdates are throttled to less than one per second per database, and\nmanifest blobs can be deleted more aggressively (e.g., as soon as a\nversion becomes stale).  With a 24h retention rule, uploading the\nmanifest for a 2 GB database adds up to less than $6e-6 for the API\ncall and churned storage.\n\nWe could also take into account storage costs for the permanent\nfootprint of the replicated databases ($0.023/GB/month for standard\nbuckets in US East) to this upper bound, but that's usually dominated\nby API costs.\n\nAt an average rate of 30 upload/replication target/second/process, the\ncost of churned data thus adds up to less than $15.55/replication\ntarget/day/process.  There is usually only one replication target and\none replicating process per machine, so this translates into a\n*maximum* of $15.55/day/machine (comparable to a c5.4xlarge).  In\npractice, the average daily cost for Backtrace's backend fleet\n(millions of writes a day scattered across a few thousand databases)\nis on the order of $40/day.\n\nHow is it tested?\n-----------------\n\nIn addition to simply running this in production to confirm that\nregular single-node operations still work and that the code correctly\npaces its API calls, we use SQLite's open source regression test\nsuite, after replacing the default Unix VFS with Verneuil.\nUnfortunately, some tests assume WAL DBs are supported, so we have to\ndisable them; some others inject failures to exercise SQLite's failure\nhandling logic, those too must be disabled.  The resulting test suite\nlives at https://github.com/pkhuong/sqlite/tree/pkhuong/no-wal-baseline\n\nConfigure a SQLite `build` directory from the mutilated test suite,\nthen run `verneuil/t/test.sh` to build test executables that load\nthe Verneuil VFS and make it the new default.  The test script also\nspins up a local minio container for the Verneuil VFS.\n\nIn test mode, the VFS executes internal consistency checking code, and\npanics whenever it notices a spooling or replication failure.\n\nThe logic for read replicas can't piggyback on top of the SQLite test\nsuite as easily. It is instead subjected to classic unit testing and\nmanual smoke testing.\n\nWhat's missing for 1.0\n----------------------\n\n- Configurability: most of the plumbing is there to configure\n  individual SQLite connections, but the current implementation is\n  geared towards a program linking directly against libverneuil and\n  configuring it with C calls.  We can already load Verneuil in\n  SQLite by configuring it with an environment variable (which matches\n  the current global configuration structure), but we should add\n  support for reading configuration data from the connection string\n  (SQLite query string parameters).\n\nThings we should do after 1.0\n-----------------------------\n\n1. We currently always create the journal file in `0644`.  Umask applies,\n   but it would make sense to implement the same logic as SQLite's Unix\n   VFS and inherit the main db file's permissions.\n\n2. Many filesystems now support copy-on-write; we should think about\n   using that for the commit step, instead of a journal file!\n\n3. The S3 client library is really naive.  We should reuse HTTPS\n   connections.\n\n4. Consider some way to get chunks without indirecting through S3.\n   Could gossip promising chunks ahead of time, or simply serve them\n   on demand over request-response like HTTP.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbacktrace-labs%2Fverneuil","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbacktrace-labs%2Fverneuil","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbacktrace-labs%2Fverneuil/lists"}