https://github.com/findinpath/nested-set-kafka-sync

Proof of concept on how to eventually sync hierachical data from a source database towards a sink database via Apache Kafka in an tainted fashion without intermittently having corrupt content on the sink database.
https://github.com/findinpath/nested-set-kafka-sync
kafka kafka-connect-jdbc nested-set testcontainers
Last synced: about 1 year ago
JSON representation
Host: GitHub
URL: https://github.com/findinpath/nested-set-kafka-sync
Owner: findinpath
Created: 2020-04-19T04:40:19.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2022-09-08T01:08:12.000Z (over 3 years ago)
Last Synced: 2025-01-29T18:32:10.638Z (about 1 year ago)
Topics: kafka, kafka-connect-jdbc, nested-set, testcontainers
Language: Java
Homepage: https://www.findinpath.com/tree-db-table-kafka-sync/
Size: 1.45 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          Sync hierarchical data between relational databases over Apache Kafka

=====================================================================

This is a proof of concept on how to *eventually* sync hierachical data 

from a source database towards a sink database via Apache Kafka in an

untainted fashion without intermittently having corrupt content on the sink

database.

## Nested Set Model

There are multiple ways of storing and reading hierarchies in a relational database:

- adjacency list model: each tuple has a parent id pointing to its parent

- nested set model: each tuple has `left` and `right` coordinates corresponding to the preordered representation of the tree

Details about the advantages of the nested set model are already very well described in

the following article:

https://www.sitepoint.com/hierarchical-data-database/

**TLDR** As mentioned on [Wikipedia](https://en.wikipedia.org/wiki/Nested_set_model)

> The nested set model is a technique for representing nested sets 

> (also known as trees or hierarchies) in relational databases.

![Nested Set Tree](images/nested-set-tree.png)

![Nested Set Model](images/nested-set-model.png)

## Syncing nested set models over Apache Kafka

[Kafka Connect](https://docs.confluent.io/current/connect/index.html)

is an open source component of [Apache Kafka](http://kafka.apache.org/) which

in a nutshell, as described on [Confluent blog](https://www.confluent.io/blog/kafka-connect-deep-dive-jdbc-source-connector/) 

provides the following functionality for databases:

> It enables you to pull data (source) from a database into Kafka, and to push data (sink) from a Kafka topic to a database. 

More details about kafka-connect-jdbc connector can be found on the

[Conflent documentation](https://docs.confluent.io/current/connect/kafka-connect-jdbc/index.html).

![Kafka Connect](images/JDBC-connector.png)

Syncing of nested set model from the source database to Apache Kafka

can be easily taken care of by a kafka-connect-jdbc source connector

which can be initialized by posting the following configuration 

to the `/connectors` endpoint of

Kafka Connect (see [Kafka Connect REST interface](https://docs.confluent.io/current/connect/references/restapi.html#post--connectors))

```json

{

    "name": "findinpath",

    "config": {

        "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",

        "mode": "timestamp+incrementing",

        "timestamp.column.name": "updated",

        "incrementing.column.name": "id",

        "topic.prefix": "findinpath.",

        "connection.user": "sa",

        "connection.password": "p@ssw0rd!source",

        "validate.non.null": "false",

        "tasks.max": "1",

        "name": "findinpath",

        "connection.url": "jdbc:postgresql://source:5432/source?loggerLevel=OFF",

        "table.whitelist": "nested_set_node"

    }

}

```

**NOTE** in the configuration above, the `tasks.max` is set to `1` because JDBC source connectors can deal

only with one `SELECT` statement at a time for retrieving the updates performed on a table.

It is advisable to use also for Apache Kafka a topic with only `1` partition for syncing the nested set content

towards downstream services.

### Refresh the nested set model on the sink database

On the sink database side there needs to be implemented a mechanism that contains

a safeguard for not adding invalid updates to the nested set model.

A concrete example in this direction is when going from the nested set model:

```

|1| A |2|

```

to the nested set model (after adding two children)

```

|1| A |6|

    ├── |2| B |3|

    └── |4| C |5|

```

In the snippets above the tree node labels are prefixed by their `left` and `right`

nested set model coordinates.

Via `kafka-connect-jdbc` the records corresponding to the tuple updates may come in various

ordering:

```

| label | left | right |

|-------|------|-------|

| A     | 1    | 6     |

| B     | 2    | 3     |

| C     | 4    | 5     |

```

or

```

| label | left | right |

|-------|------|-------|

| B     | 2    | 3     |

| C     | 4    | 5     |

| A     | 1    | 6     |

```

or any other combinations of the three tuples listed above because the records

are polled in batches of different sizes from Apache Kafka. 

Going from the nested set model table content

```

| label | left | right |

|-------|------|-------|

| A     | 1    | 2     |

```

towards

```

| label | left | right |

|-------|------|-------|

| A     | 1    | 6     |

```

or 

 

```

| label | left | right |

|-------|------|-------|

| A     | 1    | 6     |

| B     | 2    | 3     |

```

would intermittently render the nested set model corrupt until

all the records from the source nested set model are synced over Apache Kafka.

Using a kafka-connect-jdbc sink connector is therefor out of the question for

syncing the contents of trees from a source service towards downstream online services.  

One solution to cope with such a problem would be to separate the nested set model from 

what is synced over Apache Kafka. 

![Sink database table diagram](images/sink-database-table-diagram.png)

In the table diagram above, the `nested_set_node_log` table is an `INSERT only` table

in which is written whenever a new record(s) is read from Apache Kafka.

The `log_offset` table has only one tuple pointing to the last `nested_set_node_log` tuple id

processed when updating the `nested_set_node` table.

Whenever new records are read from Apache Kafka, there will be a transactional attempt 

to apply all the updates from `nested_set_node_log` made after the saved entry in the `log_offset`

table to the existing configuration of the `nested_set_node` nested set model.

If the applied updates lead to a valid nested set model configuration, then the `nested_set_node`

table will be updated and the log offset will be set to the latest processed `nested_set_node_log` entry.

Otherwise the `nested_set_node` table stays in its previous state. 

## Caching

On the sink side is implemented the [Guava's Cache](https://github.com/google/guava/wiki/CachesExplained)

for avoiding to read each time from the persistence the contents of the nested set model.

This approach nears the usage on a productive system, where the contents of the nested set model

are cached and not read from the database for each usage.

When there are new contents added to the nested set model, the cache is notified for 

invalidating its contents. 

## JDBC Transactions

One of the challenges faced before implementing this proof of concept

was whether to use [spring framework](https://spring.io/) to wrap the 

complexities of dealing with JDBC. It is extremely appealing to use

production ready frameworks and not care about their implementation complexity.

The decision made in the implementation was to avoid using `spring` and

`JPA` and go with plain old `JDBC`.

Along the way in the implementation, one open question was whether to group

all the JDBC complexity in one repository or in multiple repositories.

Due to the fact that multiple repositories bring a better overview in the

maintenance, the decision was made to go with multiple repositories.

There were some scenarios which involved transaction handling over multiple

DAO objects. The possible ways of handling transactions over multiple repositories is very

well described in the stackexchange post:

https://softwareengineering.stackexchange.com/a/339458/363485

The solution used to cope with this situation within this proof of concept was to create

repositories for each service operation and inject  the connection in the repositories.

> Dependency injection of connection: Your DAOs are not singletons but throw-away objects, receiving the connection on creation time. The calling code will control the connection creation for you.

>

> PRO: easy to implement

>

> CON: DB connection preparation / error handling in the business layer

>

> CON: DAOs are not singletons and you produce a lot of trash on the heap (your implementation language may vary here)

>

> CON: Will not allow stacking of service methods

## Testing

It is relatively easy to think about a solution for the previously exposed problem, but before putting it to a production

environment the solution needs propper testing in conditions similar to the environment in which it will run.

This is where the [testcontainers](https://www.testcontainers.org/) library helps a great deal by providing lightweight, 

throwaway instances of common databases that can run in a Docker container.

![Nested Set Kafka Sync System Tests](images/nested-set-kafka-sync-system-tests.png)

Docker containers are used for interacting with the Apache Kafka ecosystem as well as the source and sink databases.

This leads to tests that are easy to read and allow the testing of the sync operation for various nested set models

```java

    /**

     * This test ensures the sync accuracy for the following simple tree:

     *

     * 

     * |1| A |6|

     *     ├── |2| B |3|

     *     └── |4| C |5|

     * 

     */

    @Test

    public void simpleTreeDemo() {

        var aNodeId = sourceNestedSetService.insertRootNode("A");

        var bNodeId = sourceNestedSetService.insertNode("B", aNodeId);

        var cNodeId = sourceNestedSetService.insertNode("C", aNodeId);

        awaitForTheSyncOfTheNode(cNodeId);

        logSinkTreeContent();

    }

``` 

This project provides a functional prototype on how to setup the whole

Confluent environment (including **Confluent Schema Registry** and **Apache Kafka Connect**) 

via testcontainers.

See [AbstractNestedSetSyncTest](end-to-end-tests/src/test/java/com/findinpath/AbstractNestedSetSyncTest.java) 

and the [testcontainers package](end-to-end-tests/src/test/java/com/findinpath/testcontainers) for details.

### Kafka Connect

In order to use the Confluent's Kafka Connect container, this project made use of the already existing code

for [KafkaConnectContainer](https://github.com/ydespreaux/testcontainers/blob/master/testcontainers-kafka/src/main/java/com/github/ydespreaux/testcontainers/kafka/containers/KafkaConnectContainer.java)

on [ydespreaux](https://github.com/ydespreaux) Github account.

**NOTE** that the `KafkaConnectContainer` class previously mentioned has also corresponding test cases 

within the project [lib-kafka-connect](https://github.com/ydespreaux/shared/tree/master/lib-kafka-connect) in order to have a clue

on how to interact in an integration test with the container.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/findinpath/nested-set-kafka-sync

Awesome Lists containing this project

README