{"id":21436787,"url":"https://github.com/modfin/creek","last_synced_at":"2025-03-16T23:23:54.882Z","repository":{"id":190368062,"uuid":"681656366","full_name":"modfin/creek","owner":"modfin","description":"A PostgreSQL CDC system","archived":false,"fork":false,"pushed_at":"2024-03-14T22:09:43.000Z","size":6152,"stargazers_count":2,"open_issues_count":5,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-01-23T09:34:06.814Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/modfin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-22T13:29:14.000Z","updated_at":"2025-01-09T08:33:09.000Z","dependencies_parsed_at":"2023-08-24T10:28:08.612Z","dependency_job_id":"23d009a1-fccc-44f3-ab06-5dc62791b83c","html_url":"https://github.com/modfin/creek","commit_stats":null,"previous_names":["modfin/creek"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modfin%2Fcreek","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modfin%2Fcreek/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modfin%2Fcreek/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modfin%2Fcreek/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/modfin","download_url":"https://codeload.github.com/modfin/creek/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243945818,"owners_count":20372939,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-23T00:15:35.489Z","updated_at":"2025-03-16T23:23:54.861Z","avatar_url":"https://github.com/modfin.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Creek \u003c!-- omit from toc --\u003e\n\n\u003e A PostgreSQL Change-Data-Capture (CDC) based system for event sourcing.\n\n## Motivation\n\nMany services inside Modular Finance depend on the same core data. This data is\nuseful for many services, but keeping it  in sync can be cumbersome. Some\nprojects have tried to break away from the core database while still needing\nsome of its data. Currently, no standardized way of synchronizing data is used.\nA CDC system allows capturing changes from a database in real-time and\npropagating those changes to downstream consumers. This can have many uses\nbeyond keeping two SQL databases in sync, such as streaming changes to a\nspecialized search database.\n\n## Architecture\n\nCreek consists of two major parts: producers and consumers. A producer is\nresponsible for listening for change events on a PostgreSQL database and\npublishing events on a Message Queue (MQ). The MQ used is [NATS\nJetStream](https://docs.nats.io/nats-concepts/jetstream). Events are published\nto topics corresponding to the table name. Generally, a practical system will\nonly consist of one producer database that acts as a single source of truth, but\ncreek is flexible and allows using multiple producers, as long as they have\ndifferent table names.\n\nA system may consist of multiple consumers, and even different types of\nconsumers. A [PostgreSQL consumer](https://github.com/modfin/creek-pg-client)\nhas been implemented. This consumer applies changes on from a source table on a\ntopic to a specified table in a consumer PostgreSQL database.\n\n```mermaid\ngraph  TD\n  source_db  --\u003e|changes|source_db\n  source_db[(Source DB)]  --\u003e  |Postgres Streaming Logical Replication|producer\n  producer[Producer]  --\u003e  |Pub: CREEK.db.wal.namespace.table|nats[NATS JetStream]\n  producer[Producer]  --\u003e  |Pub: CREEK.db.schema.namespace.table|nats\n  nats --\u003e |Sub: CREEK.db.wal.namespace.table|Consumer\n```\n\n### Producer architecture\n\nThe producer works by leveraging [Postgres Logical\nReplication](https://www.postgresql.org/docs/12/logical-replication.html). It\ndirectly hooks into the Postgres Logical Replication Protocol emitted from\n`pgoutput` via the [pglogrepl](https://github.com/jackc/pglogrepl) Go library.\nIt can then listen to events from the PostgreSQL Write-Ahead Log (WAL) on\nspecific tables, and emit messages to a NATS JetStream MQ.\n\nPostgres logical replication can be started for a _Replication Slot_, which\ncorresponds to one consumer of a Postgres _Publication_. A publication can be\ndefined for specific tables. The replication slot contains information about the\ncurrent position in the WAL. As such, when restarting the producer, it will\ncontinue from the last processed WAL location, and include events that may have\nhappened while the producer was offline.\n\nThe messages produced are encoded using the binary\n[Avro](https://avro.apache.org/) data serialization format. This means that\nmessages are encoded in efficient format that allows simple serialization and\ndeserialization. Avro relies on using a _schema_ when both decoding and encoding\nmessages, and the same schema that was used to encode a message must be used\nwhen decoding it. Schemas can be uniquely identified by a 64-byte _fingerprint_.\n\nThe creek producer automatically generates _Avro schemas_ based on the columns\nof the producer PostgreSQL database, and uses them to encode its messages. The\nproducer is responsible for publishing schemas used to encode its messages, and\npersisting the schemas in order to be able to provide it to clients that request\nthe schema.\n\n### WAL\n\nThe creek producer publishes WAL events for each table on the topic\n`[creek-ns].[db].wal.[ns].[table]`, where `creek-ns` is a global namespace for\ncreek (by default `CREEK`), `db` is the database name, `ns` is the Postgres\nnamespace (aka. schema) for the table, and `table` is the name of the table.\nMessages are encoded using Avro, and the messages will have differing schemas.\nFor example, a message for a table with the following Postgres schema:\n\n```SQL\nCREATE TABLE test (\n    id int PRIMARY KEY,\n    name text,\n    at timestamptz\n);\n```\n\nWill have the following corresponding Avro schema:\n\n\u003cdetails\u003e\n    \u003csummary\u003eView schema\u003c/summary\u003e\n\n```json\n{\n    \"name\": \"publish_message\",\n    \"type\": \"record\",\n    \"fields\": [\n        {\n            \"name\": \"fingerprint\",\n            \"type\": \"string\"\n        },\n        {\n            \"name\": \"source\",\n            \"type\": {\n                \"name\": \"source\",\n                \"type\": \"record\",\n                \"fields\": [\n                    {\n                        \"name\": \"name\",\n                        \"type\": \"string\"\n                    },\n                    {\n                        \"name\": \"tx_at\",\n                        \"type\": {\n                            \"type\": \"long\",\n                            \"logicalType\": \"timestamp-micros\"\n                        }\n                    },\n                    {\n                        \"name\": \"db\",\n                        \"type\": \"string\"\n                    },\n                    {\n                        \"name\": \"schema\",\n                        \"type\": \"string\"\n                    },\n                    {\n                        \"name\": \"table\",\n                        \"type\": \"string\"\n                    },\n                    {\n                        \"name\": \"tx_id\",\n                        \"type\": \"long\"\n                    },\n                    {\n                        \"name\": \"last_lsn\",\n                        \"type\": \"string\"\n                    },\n                    {\n                        \"name\": \"lsn\",\n                        \"type\": \"string\"\n                    }\n                ]\n            }\n        },\n        {\n            \"name\": \"op\",\n            \"type\": {\n                \"name\": \"op\",\n                \"type\": \"enum\",\n                \"symbols\": [\n                    \"c\",\n                    \"u\",\n                    \"u_pk\",\n                    \"d\",\n                    \"t\",\n                    \"r\"\n                ]\n            }\n        },\n        {\n            \"name\": \"sent_at\",\n            \"type\": {\n                \"type\": \"long\",\n                \"logicalType\": \"timestamp-micros\"\n            }\n        },\n        {\n            \"name\": \"before\",\n            \"type\": [\n                \"null\",\n                {\n                    \"name\": \"integration_tests\",\n                    \"type\": \"record\",\n                    \"fields\": [\n                        {\n                            \"name\": \"id\",\n                            \"type\": {\n                                \"type\": \"string\",\n                                \"logicalType\": \"uuid\"\n                            },\n                            \"pgKey\": true,\n                            \"pgType\": \"uuid\"\n                        }\n                    ]\n                }\n            ]\n        },\n        {\n            \"name\": \"after\",\n            \"type\": [\n                \"null\",\n                {\n                    \"name\": \"integration_tests\",\n                    \"type\": \"record\",\n                    \"fields\": [\n                        {\n                            \"name\": \"id\",\n                            \"type\": {\n                                \"type\": \"string\",\n                                \"logicalType\": \"uuid\"\n                            },\n                            \"pgKey\": true,\n                            \"pgType\": \"uuid\"\n                        },\n                        {\n                            \"name\": \"data\",\n                            \"type\": [\n                                \"null\",\n                                \"string\"\n                            ],\n                            \"pgKey\": false,\n                            \"pgType\": \"text\"\n                        }\n                    ]\n                }\n            ]\n        }\n    ]\n}\n```\n\u003c/details\u003e\n\n\n### Schemas\n\nThe creek producer publishes the Avro schemas for each table on the topic\n`[creek-ns].[db].schema.[ns].[table]`, where `creek-ns` is a global namespace\nfor creek (by default `CREEK`), `db` is the database name, `ns` is the Postgres\nnamespace for the table, and `table` is the name of the table. These messages\nare sent as plain JSON with the following structure:\n\n```json\n{\n    \"fingerprint\": \"Sykce18MgAQ=\", // Base64 url-encoded fingerprint\n    \"schema\": \"...\",\n    \"source\": \"namespace.table\",\n    \"created_at\": \"YYYY...\"\n}\n```\n\nIn addition, the producer persists schemas in the database that it is connected\nto. Clients can request this schema using NATS Request-Reply. A client issues\na message to `[creek-ns]._schemas` with the fingerprint of the schemas, and \nwill (if available) receive the schema from a producer that has the schema.\n\n### Snapshots\n\nThe WAL does not contain all data in the database, so in order to be able to get\na consistent view of the database, we need to be able to take snapshots of the\ndata. A snapshot is taken by the producer. Each snapshot taken will be written\nto a separate topic, with the name\n`[creek-ns].[db].snap.[ns].[table].[ts]_[id]`. Here, `creek-ns` refers to the\nglobal namespace for creek (by default `CREEK`), `db` is the database name, `ns`\nis the Postgres namespace for the table, `table` is the name of the table, `ts`\nis a timestamp of when the snapshot was taken in the form `YYYYMMDDHHMMSS_MS`,\nand `id` is a 4 character id of the snapshot. \n\nOn each snapshot topic, the first message will be a snapshot header containing\na JSON record in the following form:\n\n```json\n{\n    \"fingerprint\": \"Sykce18MgAQ=\", // Base64 url-encoded fingerprint\n    \"schema\": \"...\",\n    \"tx_id\": 6550070, // Postgres transaction id\n    \"lsn\": \"54/D8213EB8\", // Postgres WAL log sequence number (lsn)\n    \"at\": \"YYYY...\", // Timestamp\n    \"approx_rows\": 2312\n}\n```\n\nFollowing will be $n$ number of messages containing the data for each row in the\ndatabase. This is follow by an end message containing the bytes `0x45 0x4f 0x46`\n(EOF).\n\nClients can request a snapshot using NATS Request-Reply. Clients send a JSON\nmessage on the channel `[creek-ns]._snapshot` in the following form:\n\n```json\n{\n    \"database\": \"db\",\n    \"namespace\": \"namespace\",\n    \"table\": \"table\"\n}\n```\n\nIf the database, namespace, and table exists, a producer will respond with a\ntopic on which the snapshot will written to. The client can now begin reading\nfrom this channel.\n\n## Postgres setup\n\nThe producer database requires a user with replication permission, access\nallowed in `pg_hba.conf`, and logical replication enabled in `postgresql.conf`.\n\nAdd a replication line to your pg_hba.conf:\n\n```\nhost replication [database] [ip address]/32 md5\n```\n\nMake sure that the following is set in your postgresql.conf:\n\n```\nwal_level=logical\nmax_wal_senders=5\nmax_replication_slots=5\n```\n\nAlso, it is a (very) good idea to set a max size of the WAL size, otherwise it will grow to infinity \nwhen the producer is offline. This option only exists since Postgres 13.\n\n```\nmax_slot_wal_keep_size=16GB\n```\n\n\n## Configuring\n\nThe producer is configured using the following environment variables:\n\n```\nPG_URI\nPG_PUBLICATION_NAME\nPG_PUBLICATION_SLOT\nPG_MESSAGE_TIMEOUT\nPG_TABLES\n\nNATS_NAMESPACE\nNATS_URI\nNATS_TIMEOUT\nNATS_MAX_PENDING\n\n\nLOG_LEVEL\nPROMETHEUS_PORT\n```\n\nIt is also possible to add tables to listen to while the producer is running\nusing a PostgreSQL API. On the same database as the producer is connected to:\n\n```SQL\n-- Add table to listen to\nSELECT _creek.add_table('publication_name', 'namespace.table');\n\n-- Remove a table to listen to\nSELECT _creek.remove_table('publication_name', 'namespace.table');\n```\n\n## Usage\n\nThis project includes a client library for consumers written in Go.\nRefer to the documentation for more information.\nExample usage of the client:\n\n```golang\npackage main\n\nimport (\n  \"context\"\n  \"encoding/json\"\n  \"fmt\"\n  \"github.com/modfin/creek\"\n  \"github.com/nats-io/nats.go\"\n)\n\nfunc main() {\n  conn, err := creek.NewClient(nats.DefaultURL, \"CREEK\").Connect()\n  if err != nil {\n    panic(\"failed to connect to creek\")\n  }\n\n  stream, err := conn.StreamWAL(context.Background(), \"db\", \"namespace.table\")\n  if err != nil {\n    panic(\"failed to to stream WAL\")\n  }\n\n  for {\n    msg, err := stream.Next(context.Background())\n    if err != nil {\n      panic(fmt.Errorf(\"failed to get next wal message: %w\", err))\n    }\n\n    b, _ := json.Marshal(msg)\n    fmt.Println(string(b))\n  }\n}\n```\n\n## Security considerations\n\nThere is currently no authentication built into Creek. You will probably want to\nenable authentication to your NATS cluster. Also, be careful with what tables\nyou export from the producer, since all clients connected to nats will be able\nto stream WAL events and request full snapshots of this table which will be\nvisible to all clients, even if it might contain sensitive data.\n\n## Metrics\n\nThe producer produces Prometheus metrics that are available on\n`ip:PROMETHEUS_PORT/metrics`.\n\n## Limitations\n\nDue to the scope of the project, not all Postgres types are supported. Refer to\nthe pgtype-avro [README](./pgtype-avro/README.md) for a full list of supported\ntypes.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmodfin%2Fcreek","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmodfin%2Fcreek","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmodfin%2Fcreek/lists"}