https://github.com/vndee/graph-crdt

A portable on-memory conflict-free replicated graph database.
https://github.com/vndee/graph-crdt
conflict-free-replicated-datatype database graph
Last synced: 3 months ago
JSON representation
A portable on-memory conflict-free replicated graph database.
Host: GitHub
URL: https://github.com/vndee/graph-crdt
Owner: vndee
Created: 2021-12-17T09:58:20.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2022-01-05T13:04:07.000Z (almost 4 years ago)
Last Synced: 2025-05-12T22:53:48.751Z (5 months ago)
Topics: conflict-free-replicated-datatype, database, graph
Language: Python
Homepage:
Size: 438 KB
Stars: 6
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          ## GraphCRDT

*A portable on-memory conflict-free replicated graph database.*

*Note: This project has already only been tested on macOS Monetary.*

This is a simple implementation of a conflict-free replicated graph database. The graph operations have been developed based on the Last-Writer-Wins element set, which contains several essential functions as follow:

- add a vertex/edge

- remove a vertex/edge

- check if a vertex/edge is in the graph

- query all vertices connected to a vertex

- find any path between two vertices

- merge with concurrent changes from other graphs/replicas

This project comes with a full decentralization fashion which can merge data without any coordination between replicas. The core idea here is that each replica can work independently. When the replica connects to the database network, they can merge or receive updates from other replicas via the connection in the network. If a replica wants to join the network, it should be assigned an address and know exactly one friend (replica) in the network. After a replica in the network receives a message that its friend has just registered to the network, it will broadcast information of this newcomer to the whole network. This message will be sent to all replicas since the network is always connected. Likewise, when a replica sends a merge request to its friends, this message will also be sent to all other replicas. The Last-Writer-Wins data type will solve any conflict.

### Installation

Requirements:

- Docker version 20.10.6+ 

- Python 3.8+

Build docker image for the database instance:

```bash

docker build -t gcrdt .

```

Run the first database instance. It should be noted that the first instance of the network has no friend here, so we set `FRIEND_ADDRESS=-1`:

```bash

docker run -d --name cluster_1 -p 8081:8000 -e ADDRESS=http://host.docker.internal:8081 \

			-e FRIEND_ADDRESS=-1 gcrdt

```

For example:

```bash

docker run -d --name cluster_2 -p 8082:8000 -e ADDRESS=http://host.docker.internal:8082 \

			-e FRIEND_ADDRESS=http://host.docker.internal:8081 gcrdt

docker run -d --name cluster_3 -p 8083:8000 -e ADDRESS=http://host.docker.internal:8083 \

			-e FRIEND_ADDRESS=http://host.docker.internal:8082 gcrdt

```

After executing these commands, **cluster_1**, **cluster_3**, **cluster_3** are connected. We can also run the sample script (provided in the project repository) to have a network with 5 replicas (instances):

```bash

chmod +x run.sh

./run.sh

```

### Architecture

In this type of peer-to-peer communication, latency and switching loop (https://en.wikipedia.org/wiki/Switching_loop) is the big problems. As the figure below shows, D lets its friend B know that he wants to join when D connects to the network. Then B will send to its friend A the information of D. Likewise, A will send the information to its friend C, and C is also a friend of B, so that C will send another request to B. In any REST-like communication type, when a service sends a request, it will wait for a response. That is why the switching loop problem arises here: B wait for A to respond, A wait for C, and C is also waiting for B.



  



There are several ways to deal with this problem:

- Use TTL or Timeout for a request: This is not a good idea since TTL/Timeout will increase the latency of the whole network.

- Use fire-and-forget protocols like Apache Thrift, UDPSocket: The main drawback of these protocols is that we must keep the connection between 2 services if we want them can talk to each other. So it will be another problem with network connection in a large network.

- Use asynchronous message queue like RabbitMQ, Kafka: We can do that, but unfortunately, we don't want any coordination between services, and our network must be fully decentralized among replicas, which means any centralized data will not be accepted.

- Separate architecture into two layers: Yes, at least it works for our case. We can use two layers, the first for communication among networks, and the other is responsible for performing logical queries and broadcasting among replicas. These two layers can be connected by exactly one inner UDPSocket tunnel in the container; this will be more stable than holding a bunch of connections as in the second option. Since the communication gateway will immediately respond to each request, workers will perform other jobs so that we can imagine this as a fire-and-forget gateway.



  



### API Client

Simply install the API client of this project in Python (a PyPi version will come soon):

```bash

python setup.py install

```

Usage:

- Connect to a database instance:

```python

from gcrdt_client import CRDTGraphClient

connection_string = "http://127.0.0.1:8081"

instance = CRDTGraphClient(connection_string)

```

- Clear database (A `broadcast()` request should be sent after any operation in the database instances to keep the data real-time synchronized among replicas. Although, of course, we can also send `broadcast()` request after a list of operations in the database, the rule of Last-Writer-Wins will solve any conflicts):

```python

instance.clear() # clear all edges and vertices from the database

instance.broadcast() # send new update to the network

```

- Add/remove/check a vertex:

```python

instance.add_vertex(1)

instance.add_vertex(2)

instance.broadcast()

print(instance.exists_vertex(1)) # True

print(instance.exists_vertex(2)) # True

instance.remove_vertex(1)

instance.remove_vertex(2)

instance.broadcast()

print(instance.exists_vertex(1)) # False

print(instance.exists_vertex(2)) # False

```

- Add/remove/check an edge:

```python

instance.add_vertex(1)

instance.add_vertex(2)

instance.add_edge(1, 2)

instance.broadcast()

print(instance.exists_edge(1, 2)) # True

instance.remove_edge(1, 2)

instance.broadcast()

print(instance.exists_edge(1, 2)) # False

```

- Find path between two vertices:

```python

instance.add_vertex(1)

instance.add_vertex(2)

instance.add_vertex(3)

instance.add_edge(1, 2)

instance.broadcast()

print(instance.find_path(1, 2)) # [1, 2]

instance.add_edge(1, 3)

print(instance.find_path(2, 3)) # [2, 1, 3]

```

- Query all vertices connected to a vertex:

```python

instance.add_vertex(1)

instance.add_vertex(2)

instance.add_vertex(3)

instance.add_edge(1, 2)

instance.add_edge(1, 3)

instance.broadcast()

print(instance.get_neighbors(1)) # [2, 3] the returned result is on the sorted order

```

Apart from the client API in Python, we can also directly use REST API of the instance gateway to perform queries. Check it out at `{host_name}:{port}/redoc` or `{host_name}/{port}/docs` (For example: http://127.0.0.1:8081/redoc or http://127.0.0.1:8081/docs)

### Testing

This project is also provided some pre-defined tests to validate our system. For example, to test the functionalities of a database instance, start an instance with the listing port `8081` first.

```bash

docker run -d --name cluster_1 -p 8081:8000 -e ADDRESS=http://host.docker.internal:8081 \

			-e FRIEND_ADDRESS=-1 gcrdt

```

Then run the unit test as below:

```bash

python -m unittest test/unit.py

```

If you want to check the consistency between replicas, we start 5 instances and then run another test as below:

```bash

./run.sh # to start 5 database instances

python -m unittest test/integration.py

```

Sample testcase:

```python

import unittest

from gcrdt_client import CRDTGraphClient

class CRDTGraphIntegrationTestCase(unittest.TestCase):

    cluster_1 = "http://127.0.0.1:8081"

    cluster_2 = "http://127.0.0.1:8082"

    cluster_3 = "http://127.0.0.1:8083"

    cluster_4 = "http://127.0.0.1:8084"

    cluster_5 = "http://127.0.0.1:8085"

    

    def test_get_neighbors_and_find_paths(self):

        instance_1 = CRDTGraphClient(CRDTGraphIntegrationTestCase.cluster_1)

        instance_2 = CRDTGraphClient(CRDTGraphIntegrationTestCase.cluster_2)

        instance_3 = CRDTGraphClient(CRDTGraphIntegrationTestCase.cluster_3)

        instance_4 = CRDTGraphClient(CRDTGraphIntegrationTestCase.cluster_4)

        instance_5 = CRDTGraphClient(CRDTGraphIntegrationTestCase.cluster_5)

        self.assertTrue(instance_1.clear())

        self.assertEqual(instance_1.broadcast(), "Success")

        self.assertEqual(instance_1.add_vertex(1)["status"], "Success")

        self.assertEqual(instance_2.add_vertex(2)["status"], "Success")

        self.assertEqual(instance_3.add_vertex(3)["status"], "Success")

        self.assertEqual(instance_1.broadcast(), "Success")

        self.assertEqual(instance_2.broadcast(), "Success")

        self.assertEqual(instance_3.broadcast(), "Success")

        self.assertEqual(instance_1.add_edge(1, 2)["status"], "Success")

        self.assertEqual(instance_1.broadcast(), "Success")

        self.assertEqual(instance_2.find_path(1, 2)["data"], [1, 2])

        self.assertEqual(instance_3.add_edge(1, 3)["status"], "Success")

        self.assertEqual(instance_3.broadcast(), "Success")

        self.assertEqual(instance_4.find_path(2, 3)["data"], [2, 1, 3])

        self.assertEqual(instance_5.get_neighbors(1), [2, 3])

        self.assertEqual(instance_5.add_vertex(4)["status"], "Success")

        self.assertEqual(instance_5.broadcast(), "Success")

        self.assertEqual(instance_1.find_path(1, 4)["status"], "Error")

```

### Improvement directions

- Back up data on disk to preserve the data even when all database instances is crashed.

- Add authentication and authorization layer to the gateway.

- Reduce network latency and broadcasting operations. We can develop a smarter routing algorithm to reduce the number of broadcasting operations. Our main objective is to minimize the longest path between any replica in the network, but we also don't want to keep too many connections (edge) in the network. That is why we need a "smarter" routing algorithm.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vndee/graph-crdt

Awesome Lists containing this project

README