{"id":22152137,"url":"https://github.com/findinpath/nested-set-kafka-sync","last_synced_at":"2025-03-24T13:25:18.045Z","repository":{"id":43851184,"uuid":"256915370","full_name":"findinpath/nested-set-kafka-sync","owner":"findinpath","description":"Proof of concept on how to eventually sync hierachical data from a source database towards a sink database via Apache Kafka in an tainted fashion without intermittently having corrupt content on the sink database.","archived":false,"fork":false,"pushed_at":"2022-09-08T01:08:12.000Z","size":1523,"stargazers_count":0,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-29T18:32:10.638Z","etag":null,"topics":["kafka","kafka-connect-jdbc","nested-set","testcontainers"],"latest_commit_sha":null,"homepage":"https://www.findinpath.com/tree-db-table-kafka-sync/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/findinpath.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-19T04:40:19.000Z","updated_at":"2020-04-22T05:23:45.000Z","dependencies_parsed_at":"2023-01-18T00:15:34.028Z","dependency_job_id":null,"html_url":"https://github.com/findinpath/nested-set-kafka-sync","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/findinpath%2Fnested-set-kafka-sync","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/findinpath%2Fnested-set-kafka-sync/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/findinpath%2Fnested-set-kafka-sync/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/findinpath%2Fnested-set-kafka-sync/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/findinpath","download_url":"https://codeload.github.com/findinpath/nested-set-kafka-sync/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245277466,"owners_count":20589154,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kafka","kafka-connect-jdbc","nested-set","testcontainers"],"created_at":"2024-12-02T00:47:22.289Z","updated_at":"2025-03-24T13:25:18.017Z","avatar_url":"https://github.com/findinpath.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"Sync hierarchical data between relational databases over Apache Kafka\n=====================================================================\n\nThis is a proof of concept on how to *eventually* sync hierachical data \nfrom a source database towards a sink database via Apache Kafka in an\nuntainted fashion without intermittently having corrupt content on the sink\ndatabase.\n\n## Nested Set Model\n\nThere are multiple ways of storing and reading hierarchies in a relational database:\n\n- adjacency list model: each tuple has a parent id pointing to its parent\n- nested set model: each tuple has `left` and `right` coordinates corresponding to the preordered representation of the tree\n\nDetails about the advantages of the nested set model are already very well described in\nthe following article:\n\nhttps://www.sitepoint.com/hierarchical-data-database/\n\n\n**TLDR** As mentioned on [Wikipedia](https://en.wikipedia.org/wiki/Nested_set_model)\n\n\u003e The nested set model is a technique for representing nested sets \n\u003e (also known as trees or hierarchies) in relational databases.\n\n\n![Nested Set Tree](images/nested-set-tree.png)\n\n![Nested Set Model](images/nested-set-model.png)\n\n\n## Syncing nested set models over Apache Kafka\n\n[Kafka Connect](https://docs.confluent.io/current/connect/index.html)\nis an open source component of [Apache Kafka](http://kafka.apache.org/) which\nin a nutshell, as described on [Confluent blog](https://www.confluent.io/blog/kafka-connect-deep-dive-jdbc-source-connector/) \nprovides the following functionality for databases:\n\n\u003e It enables you to pull data (source) from a database into Kafka, and to push data (sink) from a Kafka topic to a database. \n\n\nMore details about kafka-connect-jdbc connector can be found on the\n[Conflent documentation](https://docs.confluent.io/current/connect/kafka-connect-jdbc/index.html).\n\n![Kafka Connect](images/JDBC-connector.png)\n\nSyncing of nested set model from the source database to Apache Kafka\ncan be easily taken care of by a kafka-connect-jdbc source connector\nwhich can be initialized by posting the following configuration \nto the `/connectors` endpoint of\nKafka Connect (see [Kafka Connect REST interface](https://docs.confluent.io/current/connect/references/restapi.html#post--connectors))\n\n```json\n{\n    \"name\": \"findinpath\",\n    \"config\": {\n        \"connector.class\": \"io.confluent.connect.jdbc.JdbcSourceConnector\",\n        \"mode\": \"timestamp+incrementing\",\n        \"timestamp.column.name\": \"updated\",\n        \"incrementing.column.name\": \"id\",\n        \"topic.prefix\": \"findinpath.\",\n        \"connection.user\": \"sa\",\n        \"connection.password\": \"p@ssw0rd!source\",\n        \"validate.non.null\": \"false\",\n        \"tasks.max\": \"1\",\n        \"name\": \"findinpath\",\n        \"connection.url\": \"jdbc:postgresql://source:5432/source?loggerLevel=OFF\",\n        \"table.whitelist\": \"nested_set_node\"\n    }\n}\n```\n\n**NOTE** in the configuration above, the `tasks.max` is set to `1` because JDBC source connectors can deal\nonly with one `SELECT` statement at a time for retrieving the updates performed on a table.\nIt is advisable to use also for Apache Kafka a topic with only `1` partition for syncing the nested set content\ntowards downstream services.\n\n### Refresh the nested set model on the sink database\n\nOn the sink database side there needs to be implemented a mechanism that contains\na safeguard for not adding invalid updates to the nested set model.\nA concrete example in this direction is when going from the nested set model:\n\n```\n|1| A |2|\n```\n\nto the nested set model (after adding two children)\n\n```\n|1| A |6|\n    ├── |2| B |3|\n    └── |4| C |5|\n```\n\nIn the snippets above the tree node labels are prefixed by their `left` and `right`\nnested set model coordinates.\n\nVia `kafka-connect-jdbc` the records corresponding to the tuple updates may come in various\nordering:\n\n```\n| label | left | right |\n|-------|------|-------|\n| A     | 1    | 6     |\n| B     | 2    | 3     |\n| C     | 4    | 5     |\n```\nor\n```\n| label | left | right |\n|-------|------|-------|\n| B     | 2    | 3     |\n| C     | 4    | 5     |\n| A     | 1    | 6     |\n```\nor any other combinations of the three tuples listed above because the records\nare polled in batches of different sizes from Apache Kafka. \n\n\nGoing from the nested set model table content\n\n```\n| label | left | right |\n|-------|------|-------|\n| A     | 1    | 2     |\n```\ntowards\n\n```\n| label | left | right |\n|-------|------|-------|\n| A     | 1    | 6     |\n```\n\nor \n \n```\n| label | left | right |\n|-------|------|-------|\n| A     | 1    | 6     |\n| B     | 2    | 3     |\n```\n\nwould intermittently render the nested set model corrupt until\nall the records from the source nested set model are synced over Apache Kafka.\n\nUsing a kafka-connect-jdbc sink connector is therefor out of the question for\nsyncing the contents of trees from a source service towards downstream online services.  \n\nOne solution to cope with such a problem would be to separate the nested set model from \nwhat is synced over Apache Kafka. \n\n![Sink database table diagram](images/sink-database-table-diagram.png)\n\n\nIn the table diagram above, the `nested_set_node_log` table is an `INSERT only` table\nin which is written whenever a new record(s) is read from Apache Kafka.\nThe `log_offset` table has only one tuple pointing to the last `nested_set_node_log` tuple id\nprocessed when updating the `nested_set_node` table.\n\nWhenever new records are read from Apache Kafka, there will be a transactional attempt \nto apply all the updates from `nested_set_node_log` made after the saved entry in the `log_offset`\ntable to the existing configuration of the `nested_set_node` nested set model.\n\nIf the applied updates lead to a valid nested set model configuration, then the `nested_set_node`\ntable will be updated and the log offset will be set to the latest processed `nested_set_node_log` entry.\nOtherwise the `nested_set_node` table stays in its previous state. \n\n## Caching\n\nOn the sink side is implemented the [Guava's Cache](https://github.com/google/guava/wiki/CachesExplained)\nfor avoiding to read each time from the persistence the contents of the nested set model.\nThis approach nears the usage on a productive system, where the contents of the nested set model\nare cached and not read from the database for each usage.\n\nWhen there are new contents added to the nested set model, the cache is notified for \ninvalidating its contents. \n\n## JDBC Transactions\n\nOne of the challenges faced before implementing this proof of concept\nwas whether to use [spring framework](https://spring.io/) to wrap the \ncomplexities of dealing with JDBC. It is extremely appealing to use\nproduction ready frameworks and not care about their implementation complexity.\n\nThe decision made in the implementation was to avoid using `spring` and\n`JPA` and go with plain old `JDBC`.\n\nAlong the way in the implementation, one open question was whether to group\nall the JDBC complexity in one repository or in multiple repositories.\nDue to the fact that multiple repositories bring a better overview in the\nmaintenance, the decision was made to go with multiple repositories.\n\nThere were some scenarios which involved transaction handling over multiple\nDAO objects. The possible ways of handling transactions over multiple repositories is very\nwell described in the stackexchange post:\n\nhttps://softwareengineering.stackexchange.com/a/339458/363485\n\nThe solution used to cope with this situation within this proof of concept was to create\nrepositories for each service operation and inject  the connection in the repositories.\n\n\u003e Dependency injection of connection: Your DAOs are not singletons but throw-away objects, receiving the connection on creation time. The calling code will control the connection creation for you.\n\u003e\n\u003e PRO: easy to implement\n\u003e\n\u003e CON: DB connection preparation / error handling in the business layer\n\u003e\n\u003e CON: DAOs are not singletons and you produce a lot of trash on the heap (your implementation language may vary here)\n\u003e\n\u003e CON: Will not allow stacking of service methods\n\n\n## Testing\n\nIt is relatively easy to think about a solution for the previously exposed problem, but before putting it to a production\nenvironment the solution needs propper testing in conditions similar to the environment in which it will run.\n\nThis is where the [testcontainers](https://www.testcontainers.org/) library helps a great deal by providing lightweight, \nthrowaway instances of common databases that can run in a Docker container.\n\n![Nested Set Kafka Sync System Tests](images/nested-set-kafka-sync-system-tests.png)\n\nDocker containers are used for interacting with the Apache Kafka ecosystem as well as the source and sink databases.\n\nThis leads to tests that are easy to read and allow the testing of the sync operation for various nested set models\n\n```java\n    /**\n     * This test ensures the sync accuracy for the following simple tree:\n     *\n     * \u003cpre\u003e\n     * |1| A |6|\n     *     ├── |2| B |3|\n     *     └── |4| C |5|\n     * \u003c/pre\u003e\n     */\n    @Test\n    public void simpleTreeDemo() {\n        var aNodeId = sourceNestedSetService.insertRootNode(\"A\");\n        var bNodeId = sourceNestedSetService.insertNode(\"B\", aNodeId);\n        var cNodeId = sourceNestedSetService.insertNode(\"C\", aNodeId);\n\n        awaitForTheSyncOfTheNode(cNodeId);\n        logSinkTreeContent();\n    }\n``` \n\nThis project provides a functional prototype on how to setup the whole\nConfluent environment (including **Confluent Schema Registry** and **Apache Kafka Connect**) \nvia testcontainers.\n\nSee [AbstractNestedSetSyncTest](end-to-end-tests/src/test/java/com/findinpath/AbstractNestedSetSyncTest.java) \nand the [testcontainers package](end-to-end-tests/src/test/java/com/findinpath/testcontainers) for details.\n\n### Kafka Connect\n\nIn order to use the Confluent's Kafka Connect container, this project made use of the already existing code\nfor [KafkaConnectContainer](https://github.com/ydespreaux/testcontainers/blob/master/testcontainers-kafka/src/main/java/com/github/ydespreaux/testcontainers/kafka/containers/KafkaConnectContainer.java)\non [ydespreaux](https://github.com/ydespreaux) Github account.\n\n**NOTE** that the `KafkaConnectContainer` class previously mentioned has also corresponding test cases \nwithin the project [lib-kafka-connect](https://github.com/ydespreaux/shared/tree/master/lib-kafka-connect) in order to have a clue\non how to interact in an integration test with the container.  ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffindinpath%2Fnested-set-kafka-sync","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffindinpath%2Fnested-set-kafka-sync","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffindinpath%2Fnested-set-kafka-sync/lists"}