{"id":28633849,"url":"https://github.com/brimdata/zync","last_synced_at":"2026-01-12T03:00:06.620Z","repository":{"id":36993460,"uuid":"231182131","full_name":"brimdata/zync","owner":"brimdata","description":"Kafka connector to sync Zed lakes to and from Kafka topics","archived":false,"fork":false,"pushed_at":"2025-12-04T19:59:52.000Z","size":322,"stargazers_count":18,"open_issues_count":13,"forks_count":3,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-12-08T01:28:24.895Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brimdata.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-01T06:14:59.000Z","updated_at":"2025-12-04T19:25:31.000Z","dependencies_parsed_at":"2024-02-09T20:26:36.264Z","dependency_job_id":"f553b2b2-5eae-4d61-9138-3a5ee8c19c89","html_url":"https://github.com/brimdata/zync","commit_stats":null,"previous_names":["mccanne/zinger","brimsec/zinger","brimdata/zinger"],"tags_count":11,"template":false,"template_full_name":null,"purl":"pkg:github/brimdata/zync","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brimdata%2Fzync","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brimdata%2Fzync/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brimdata%2Fzync/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brimdata%2Fzync/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brimdata","download_url":"https://codeload.github.com/brimdata/zync/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brimdata%2Fzync/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28332831,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-12T00:36:25.062Z","status":"online","status_checked_at":"2026-01-12T02:00:08.677Z","response_time":98,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-12T15:39:18.897Z","updated_at":"2026-01-12T03:00:06.564Z","avatar_url":"https://github.com/brimdata.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# zync\n\n`zync` is a connector between [Kafka](https://kafka.apache.org/) and\n[Zed lakes](https://github.com/brimdata/zed/tree/main/docs/lake).\nIt can run in either direction, syncing a Kafka topic to a Zed lake data pool\nor vice versa.\n\n`zync` can also apply transformations from one or more raw data pools\nto a staging data pool in a transactionally consistent fashion based on\nDebezium's read/create/update/delete model for CDC database logs.\nIn this way, you can use `zync` to sync input topics to raw data pools,\napply Debezium-aware transforms from the raw pools to a staging pool,\nand use `zync` to sync a staging pool to a target database, typically\na data warehouse.\n\n## Installation\n\nTo install `zync`, make sure you have Go 1.21 or better installed and then run\n```\ngo install github.com/brimdata/zync/cmd/zync@main\n```\n\nYou'll also need `zed` installed to run a Zed lake.  Installation instructions\nfor `zed` are in the [Zed repository](https://github.com/brimdata/zed).\n\n## Quick Start\n\nFor built-in help, run\n```\nzync -h\n```\nMake sure your config files are setup for the Kafka cluster\nand schema registry (see below), then run some tests.\n\nList schemas in the registry:\n```\nzync ls\n```\nCreate a topic called `MyTopic` with one partition using your Kafka admin tools\nand, in another window, set up a consumer to display data from that topic:\n```\nzync consume -topic MyTopic\n```\nNext, post some data to the topic:\n```\necho '{s:\"hello,world\"}' | zync produce -topic MyTopic -\n```\nThis transforms the ZSON input to Avro and posts it to the topic.\nThe consumer then converts the Avro back to ZSON and displays it.\n\n\u003e Hit Ctrl-C to interrupt `zync consume` as it will wait indefinitely\n\u003e for data to arrive on the topic.\n\n### Syncing to a Zed Lake\n\nIn another shell, run a Zed lake service:\n```\nmkdir scratch\nzed serve -lake scratch\n```\nNow, in your first shell, sync data from Kafka to a Zed lake:\n```\nzed create -orderby kafka.offset PoolA\nzync from-kafka -topic MyTopic -pool PoolA -exitafter 1s\n```\nSee the data in the Zed pool:\n```\nzed query \"from PoolA\"\n```\nNext, create a topic called `MyTarget` with one partition using your Kafka admin tools,\nsync data from a Zed pool back to Kafka, and check that it made it:\n```\nzync to-kafka -topic MyTarget -pool PoolA\nzync consume -topic MyTarget\n```\nFinally, try out shaping.  Put a Zed script in `shape.zed`, e.g.,\n```\necho 'value:={upper:to_upper(value.s),words:split(value.s, \",\")}' \u003e shape.zed\n```\nAnd shape the record from `MyTopic` into a new `PoolB`:\n```\nzed create -orderby kafka.offset PoolB\nzync from-kafka -topic MyTopic -pool PoolB -shaper shape.zed -exitafter 1s\nzed query -Z \"from PoolB\"\n```\n\n## Configuration\n\nTo configure `zync` to talk to a Kafka cluster and a schema registry,\nyou must create two files in `$HOME/.zync`:\n[`kafka.json`](kafka.json) and\n[`schema_registry.json`](schema_registry.json).\n\nThis Kafka config file contains the Kafka bootstrap server\naddresses and access credentials.\n\nThis schema registry config file contains the URI of the service and\naccess credentials.\n\n\u003e We currently support no authentication, SASL/PLAIN authentication, and\n\u003e TLS client authentication, though it will be easy to add other\n\u003e authentication options.  Please let us know if you have a requirement\n\u003e here.\n\n## Description\n\n`zync` has two sub-commands for synchronizing data to and from Kafka:\n* `zync to-kafka` - syncs data from a Zed data pool to a Kafka topic\n* `zync from-kafka` - syncs data from Kafka topics to Zed data pools\n\nCurrently, only the binary\n[Kavka/Avro format](https://docs.confluent.io/current/schema-registry/serializer-formatter.html#wire-format)\nis supported where the Avro schemas are obtained from a configured\n[schema registry]((https://github.com/confluentinc/schema-registry)).\n\nAn arbitrary Zed script can be applied to the Zed records in either direction.\n\nThe Zed pool used by `zync` must have its pool key set to `kafka.offset` in\nascending order.  `zync` will detect and report an error if syncing\nis attempted using a pool without this configuration.\n\n### Syncing From Kafka\n\n`zync from-kafka` encapusulates records received from Kafka using the envelope\n```\n{\n  kafka: {topic:string,partition:int64,offset:int64},\n  key: {...}\n  value: {...}\n}\n```\nwhere the `key` and `value` fields represent the key/value data pair pulled from\nKafka and transcoded from Avro to Zed and the `kafka` field contains metadata\ndescribing the topic, partition, and offset of the data received from Kafka.\n\nIf a Zed script is provided, it is applied to each such record before\nsyncing the data to the Zed pool.  While the script has access to the\nmetadata in the `kafka` field, it should not modify these values as\n`zync` relies on this field.\n\nAfter optionally shaping each record with a Zed script, the data is committed\ninto the Zed data pool in a transactionally consistent fashion where any and\nall data committed by `zync` writers must have monotonically increasing `kafka.offset`\nrelative to each topic indicated in `kafka.topic`.\n\nAs the Kafka topic and offset is stored in each record,\nthe `zync from-kafka` command can query the maximum input offset in the pool\nfor each topic and resume syncing from where it last left off.\n\nTo avoid duplicate records,\nit is best to configure `zync` with a single writer per Kafka topic.\n\n\u003e Note: we currently do not detect multiple writers to a pool but can do\n\u003e so with a small change to the load API to track commit IDs and detect\n\u003e write conflicts when the writer is not writing to the head commit that\n\u003e it expects.\n\n### Syncing To Kafka\n\n`zync to-kafka` reads records that arrive in a Zed pool, transcodes them\nto Avro, and \"produces\" them to the Kafka topic specified in the\n`kafka` metadata field of each record.\n\nThe synchronization algorithm is very simple: when `zync to-kafka` starts up,\nit queries the pool for the largest `kafka.offset` for each `kafka.topic`\npresent and queries\neach Kafka topic for its high-water mark.  Then it reads, shapes, and\nproduces all records from the Zed pool at the high-water mark and beyond\nfor each topic.\n\nThere is currently no logic to detect multiple concurrent writers to the\nsame Kafka output topic, so\ncare must be taken to only run a single `zync to-kafka` process at a time\nfor any given Kafka topic.\n\n\u003e Note: `zync to-kafka` currently exits after syncing to the highest contiguous offset.\n\u003e We plan to soon modify it so it will run continuously, listening for\n\u003e commits to the pool, then push any new to Kafka with minimal latency.\n\n## Debezium Integration\n\n`zync` can be used with [Debezium](https://debezium.io) to perform database ETL\nand replication by syncing Debezium's CDC logs to a Zed data pool with `zync from-kafka`,\nshaping the logs for a target database schema using an experimental `zync etl`\ncommand, and replicating the shaped CDC logs to a Kafka database\nsink connector using `zync to-kafka`.\n\nThe goal of `zync etl` is to do sophisticated ETL that may involve the denormalization\nof multiple tables into one.\n\nThe model here is that `zync etl` processes data from an input pool to an output\npool where `zync from-kafka` is populating the input pool and `zync to-kafka` is processing\nthe output pool.  More specifically, `zync from-kafka` receives Debezium\nevents from Kafka, `zync etl` transforms those events to\n[JDBC sync connector](https://docs.confluent.io/kafka-connect-jdbc/current/sink-connector/)\nrecords, and `zync to-kafka` send those records to Kafka.\n\nEach Kafka topic must have a single partition as Debezium relies upon\nthe Kafka offset to indicate the FIFO order of all records.\n\nDebezium events have this structure.\n```\n{\n  key: {\n    // Fields correspond to source table's primary key.\n  },\n  value: {\n    op: string,\n    before: {\n      // Fields correspond to source table's columns.\n      // Value is null if and only if op is \"c\" or \"r\".\n    },\n    after: {\n      // Fields correspond to source table's columns.\n      // Value is null if and only if op is \"d\".\n    },\n    // Remaining fields depend on database connector.\n}\n```\n\nJDBC sync connector events have this structure.\n```\n{\n  key: {\n    // Fields correspond to destination table's primary key.\n  },\n  value: {\n    // Fields correspond to destination table's columns.\n    // Value is null for a tombstone (delete) record.\n}\n```\n\n### Design assumptions\n\nDebezium recommends using a single Kakfa topic for database table.\nIn this same way, we can scale out the Zed lake and `zync` processes.\n\nWe assume the following Debezium event types:\n* `r` events indicate the read of a row during the startup snapshot,\n* `c` events indicate the creation of a new row,\n* `u` events indicate the update of an existing row, and\n* `d` events indicate the deletion of an existing row.\n\nTo perform denormalization, we need to collect multiple input events to\nproduce a single output event.  For example, a `c` event on input table A\nand another `c` event on table B might need to be coalesced into a\nmixed `c` event on output table C.  There must be some way to correlate\nsuch events, e.g., by presuming there is a unique ID present in both records,\nor a foreign key in record A that can be used to locate\nthe primary key in record B.\n\nAlso, it might be desirable for a `u` event on table A to wait for a\n`u` event on table B to perform a denormalized `u` event on table C\nwhere the ETL needs info from both the A update and the B update to produce the\ncombined C update.\nIn other cases,\nit might be more straightforward to simply allow the individual updates to\nbe transformed to table C updates.   This would mean the ETL steps should be\nspecified in pieces so the `zync etl` command could apply a single configuration\nto a combined A/B update versus a single A or a single B update.\n\nThe `zync etl` command assumes that each input record participates in exactly\none transformation event.  This way, `zync etl` can track which input records have\nbeen processed and which remain to be processed by recording in the transformed\npool all the Kafka offsets by topic, exactly once, of each input event processed.\n\n### `zync etl` configuration\n\nThe `zync etl` command syncs one Zed data pool to another where the\ninput pool is created with `zync from-kafka` and the output pool is formatted\nfor `zync to-kafka`.\n\nSince configuration can be complex with multiple ETLs of varying types,\nthe command is configured with YAML.\n\n\u003e These YAML options are preliminary and will be iterated upon and change.\n\nConfiguration is currently all in a single YAML file.\n\n```\n# Routes define a pool for each Kafka topic, whether it's an input or an output.\n# The namespace of topics is currently shared between inputs and outputs as we\n# presume a single Kafka cluster for both the inputs and outputs (though this is\n# not a strict requirement).\ninputs:\n  - topic: TableA\n    pool: Raw\n  - topic: TableB\n    pool: Raw\n\noutput:\n  - topic: TableC\n    pool: Staging\n\n# Transforms define rules from one or more input tables to\n# to an output table, using the routes to determine the pools where each\n# topic is stored.\ntransforms:\n  - type: denorm // denorm or stateless\n    where: optional Zed Boolean expression to select this rule\n    left: TableA\n    right: TableB\n    join-on: left.value.after.ID=right.value.after.InvoiceID\n    out: TableC\n    zed: |\n      Zed script applied to all records before storing them in the pool\n      Receives a record with two Debezium events in fields called left and right.\n      Must create a JDBC sink connector record in a field called out.\n  - type: stateless\n    where: value.op==\"u\"\n    in: TableA\n    out: TableC\n    zed: |\n      Zed script applied to all records before storing them in the pool.\n      Receives a record containing one Debezium source event: this.in.\n      Must create a JDBC sink connector record in a field called out.\n```\n\n\u003e Note that this YAML design is only configuring a single ETL pipeline between\n\u003e Zed data pools without any Kafka integration.  We need to work out another layer of\n\u003e YAML config that can embed these ETL configurations and additional logic\n\u003e to wire up the `zync from-kafka` and `zync to-kafka` processes and run many instances\n\u003e over a cluster.  The current plan is to have a `zync build` command that will take\n\u003e the `zync` YAMLs and produce `helm` charts to deploy all the needed `zync` processes\n\u003e across a Kubernetes cluster.\n\n### The ETL Algorithm\n\nThe algorithm here describes how the ETLs are stitched together to perform the\ndesired transformations.  Note that the YAML config above knows nothing\nabout the details below.  However, it's useful to know how things work so you\ncan debug problems that arise (and perhaps performance issues).\n\n\u003e TBD: We'll create a library of Brim queries that can be used to easily\n\u003e navigate to different views of what's going on in a live ETL process.\n\nWhen the `zync etl` command is run, it queries the current state of its\ninput and output pools, determines if any of the ETLs have work to do,\nruns the ETLs to get the transformed results, creates completion records\nfor all records processed (by each input topic),\nand records the transformed records and completions records in an atomic\ncommit in the output pool.  It then exits.\n\nTo make incremental updates efficient, the pools must be sorted by `kafka.offset`\n(in ascending order) and\nfor each topic, we maintain a cursor per input topic,\nreferred to below as `$cursor[$topic]`.\n\nA completion record is recorded in the output pool for each input record that has\nbeen processed, which has the form\n```\n{kafka:{topic:string,offset:int64}}(=done)\n```\n\u003e Note there is no Kafka partition as we require in-order delivery and thus\n\u003e only one partition per topic.\n\nAt startup, to compute the cursors we simply run a query for each input topic\non the output lake\n```\nis(\u003cdone\u003e) | max(kafka.offset) by kafka.topic\n```\n\u003e We can make this efficient by using `head 1` inside of switch legs where each\n\u003e switch case is one of the topics and scanning in descending order, which is the\n\u003e reverse order of the pool.  TBD: we have an issue to make reverse range scanning\n\u003e efficient; right now, we read the whole range and do a sort -r.  This is not\n\u003e a trivial task but isn't too hard.\n\nWe can then enumerate the unprocessed records, by scanning the raw pool\nfrom the smallest cursor up and doing an anti join for each topic.\n\nThe following  pseudo Zed would be stitched together from the YAML config by `zync`\n(assuming two input topics, \"TableA\" and \"TableB\", and output topic \"TableC\").\n```\nfork (\n    =\u003e from (\n        pool Raw range from $cursor[\"TableA\"] to MAXINT64 =\u003e kafka.topic==\"TableA\"\n        pool Staging range from $cursor[\"TableA\"] to MAXINT64 =\u003e is(\u003cdone\u003e) \u0026\u0026 kafka.topic==\"TableA\"\n      ) | anti join on kafka.offset=kafka.offset\n    =\u003e from (\n        pool Raw range from $cursor[\"TableB\"] to MAXINT64 =\u003e kafka.topic==\"TableB\"\n        pool Staging range from $cursor[\"TableB\"] to MAXINT64 =\u003e is(\u003cdone\u003e) \u0026\u0026 kafka.topic==\"TableB\"\n      ) | anti join on kafka.offset=kafka.offset\n  )\n  | switch (\n    case \u003cwhere-denorm\u003e =\u003e\n      fork (\n        =\u003e kafka.topic==\"TableA\" | yield {left:this} | sort \u003cleft-key\u003e\n        =\u003e kafka.topic==\"TableB\" | yield {right:this} | sort \u003cright-key\u003e\n      )\n      | join on \u003cleft-key\u003e=\u003cright-key\u003e right:=right\n      | \u003cZed that creates this.out from this.left and this.right\u003e\n      | out.kafka:=left.kafka\n      | yield out\n      | kafka.topic:=\"TableC\" // zync will fix kafka.offset\n    case (\u003cwhere-stateless\u003e) and kafka.topic==\"TableA\" =\u003e\n      yield {in:this}\n      | \u003cZed that creates this.out from this.in\u003e\n      | out.kafka:=in.kafka\n      | yield out\n      | kafka.topic:=\"TableC\" // zync will fix kafka.offset\n    ...\n  )\n```\n\n### Demo\n\nStart a Zed lake service.\n```\nmkdir scratch\nzed serve -lake scratch\n```\nCreate `Raw` and `Staging` pools:\n```\nzed create -orderby kafka.offset Raw\nzed create -orderby kafka.offset Staging\n```\nLoad the first batch of test data into `Raw`, as if `zync from-kafka` imported\nit from its topics to `Raw` as Debezium CDC logs:\n```\nzed load -use Raw@main demo/batch-1.zson\n```\nYou can easily see the Debezium table updates loaded into `Raw` with `zed query`:\n```\nzed query -f table \"from Raw | kafka.topic=='Invoices' | yield value.after\"\nzed query -f table \"from Raw | kafka.topic=='InvoiceStatus' | yield value.after\"\n```\nThese are all type `r` (read) Debezium logs and represent two new rows in each\nof the `Invoice` and `InvoiceStatus` tables.  Transform them to `Staging` with\n```\nzync etl demo/invoices.yaml\n```\nThis will report the commit ID and number of input records processed.\nNote that the number of records committed into\nthe pool is different than the number of records produced by ETL as the destination\npool includes metadata records tracking which input events have been processed.\n\nAfter running the ETL, you can see the denormalized CDC updates in the\n`Staging` pool:\n```\nzed query -f table \"from Staging | kafka.topic=='NewInvoices' | yield value\"\n```\nYou can also see the progress updates marking the input records completed\nthat are stored alongside the data in `Staging`:\n```\nzed query \"from Staging | is(\u003cdone\u003e)\"\n```\n\nIf you run the ETL again with no new data, it will do nothing as you do not\nwant duplicate data in the output:\n```\nzync etl demo/invoices.yaml\n```\n\u003e `zync` uses an anti join between the completion records in the output pool\n\u003e and the input records to remove from the input all records that have already\n\u003e been processed.\n\nNow suppose new data arrives from Debezium over Kafka.  Let's load it into\nthe `Raw` pool:\n```\nzed load -use Raw@main demo/batch-2.zson\n```\nIn this file, there are new `Invoices` rows for Charlie and Dan but only an\n`InvoiceStatus` row for Charlie.  This means only the Charlie data can be\ndenormalized and the Dan `Invoices` row will be left unprocessed awaiting\nthe arrival of its `InvoiceStatus` counterpart.\n```\nzync etl demo/invoices.yaml\n```\nYou can see that the Charlie row made it Staging:\n```\nzed query -f table \"from Staging | kafka.topic=='NewInvoices' | yield value\"\n```\nbut the Dan row is still pending.  You can see the pending records for this\nexample by running\n```\nzed query -Z -I demo/pending.zed\n```\nNow let's load another batch of records that provides the InvoiceStatus create\nevent for the Dan row and a \"stateless\" `InvoiceStatus` update to change\nAlice's status to \"paid\":\n```\nzed load -use Raw@main demo/batch-3.zson\nzync etl demo/invoices.yaml\n```\nNow we can see the Dan row made it to `Staging`:\n```\nzed query -f table \"from Staging | not is(\u003cdone\u003e) | yield {offset:kafka.offset,value:value} | fuse\"\n```\n\n\u003e NOTE: We formatted this output a bit differently as the updates are getting\n\u003e more complex.  Here we numbered each update according to CDC order\n\u003e and fused the tables so you can see where the updates fall in the table.\n\nFinally, in the last batch, the remaining invoices marked \"pending\" are\nall updated.\n\n```\nzed load -use Raw@main demo/batch-4.zson\nzync etl demo/invoices.yaml\n```\n\nAnd re-run the table query from above to see the final result:\n```\nzed query -f table \"from Staging | not is(\u003cdone\u003e) | yield {offset:kafka.offset,value:value} | fuse\"\n```\n\n\n#### `anti join`\n\nIf you're curious how anti join works, try this:\n```\necho '{a:1,id:1}{a:2,id:2}{a:3,id:2}{a:4,id:3}{a:5,id:4}{a:5,id:5}' \u003e in.zson\necho '{drop:2}{drop:5}' \u003e drop.zson\nzq 'anti join on id=drop' in.zson drop.zson\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrimdata%2Fzync","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrimdata%2Fzync","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrimdata%2Fzync/lists"}