Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/prrao87/kuzudb-study
Benchmark study on KùzuDB, an embedded OLAP graph database, on an artificial social network dataset
https://github.com/prrao87/kuzudb-study
embedded-database graph-database graph-db knowledge-graph kuzudb neo4j python
Last synced: 3 months ago
JSON representation
Benchmark study on KùzuDB, an embedded OLAP graph database, on an artificial social network dataset
- Host: GitHub
- URL: https://github.com/prrao87/kuzudb-study
- Owner: prrao87
- License: mit
- Created: 2023-08-03T02:00:04.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-19T13:44:47.000Z (5 months ago)
- Last Synced: 2024-10-30T01:26:53.259Z (3 months ago)
- Topics: embedded-database, graph-database, graph-db, knowledge-graph, kuzudb, neo4j, python
- Language: Python
- Homepage:
- Size: 26.8 MB
- Stars: 28
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# KùzuDB: Benchmark study
Code for the benchmark study described in this [blog post](https://thedataquarry.com/posts/embedded-db-2/).
Neo4j version | Kùzu version | Python version
:---: | :---: | :---:
5.22.0 (community) | 0.6.0 | 3.12.4[Kùzu](https://kuzudb.com/) is an in-process (embedded) graph database management system (GDBMS) written in C++. It is blazing fast 🔥, and is optimized for handling complex join-heavy analytical workloads on very large graphs. Kùzu's [goal](https://kuzudb.com/docusaurus/blog/what-every-gdbms-should-do-and-vision) is to do in the graph database world what DuckDB has done in the world of relational databases -- that is, to provide a fast, lightweight, embeddable graph database for analytics (OLAP) use cases, while being heavily focused on usability and developer productivity.
This study has the following goals:
* Generate an artificial social network dataset, including persons, interests and locations
* You can scale up the size of the artificial dataset using the scripts provided and test query performance on larger graphs
* Ingest the dataset into two graph databases: Kùzu and Neo4j (community edition)
* Run a set of queries in Cypher on either DB to:
* (1) Verify that the data is ingested correctly and that the results from either DB are consistent with one another
* (2) Compare the query performance on a suite of queries that involve multi-hop traversals and aggregationsPython (and the associated client APIs for either DB) are used to orchestrate the pipelines throughout.
## Setup
Activate a Python virtual environment and install the dependencies as follows.
```sh
# Assuming that the uv package manager is installed
# https://github.com/astral-sh/uv
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
```## Data
An artificial social network dataset is generated specifically for this exercise, via the [Faker](https://faker.readthedocs.io/en/master/) Python library.
### Generate all data at once
A shell script `generate_data.sh` is provided in the root directory of this repo that sequentially runs the Python scripts, generating the data for the nodes and edges for the social network. This is the recommended way to generate the data. A single positional argument is provided to the shell script: The number of person profiles to generate -- this is specified as an integer, as shown below.
```sh
# Generate data with 100K persons and ~2.4M edges
bash generate_data.sh 100000
```Running this command generates a series of files in the `output` directory, following which we can proceed to ingesting the data into a graph database.
See [./data/README.md](./data/README.md) for more details on each script that is run sequentially to generate the data.
## Graph schema
The following graph schema is used for the social network dataset.
![](./assets/kuzudb-graph-schema.png)
* `Person` node `FOLLOWS` another `Person` node
* `Person` node `LIVES_IN` a `City` node
* `Person` node `HAS_INTEREST` towards an `Interest` node
* `City` node is `CITY_IN` a `State` node
* `State` node is `STATE_IN` a `Country` node## Ingest the data into Neo4j or Kùzu
Navigate to the [neo4j](./neo4j) and the [kuzudb](./kuzudb/) directories to see the instructions on how to ingest the data into each database.
The generated graph is a well-connected graph, and a sample of `Person`-`Person` connections as visualized in the Neo4j browser is shown below. Certain groups of persons form a clique, and some others are central hubs with many connections, and each person can have many interests, but only one primary residence city.
![](./assets/person-person.png)
## Run the queries
Some sample queries are run in each DB to verify that the data is ingested correctly, and that the results are consistent with one another.
The following questions are asked of both graphs:
* **Query 1**: Who are the top 3 most-followed persons?
* **Query 2**: In which city does the most-followed person live?
* **Query 3**: Which 5 cities in a particular country have the lowest average age in the network?
* **Query 4**: How many persons between ages 30-40 are there in each country?
* **Query 5**: How many men in London, United Kingdom have an interest in fine dining?
* **Query 6**: Which city has the maximum number of women that like Tennis?
* **Query 7**: Which U.S. state has the maximum number of persons between the age 23-30 who enjoy photography?
* **Query 8**: How many second-degree paths exist in the graph?
* **Query 9**: How many paths exist in the graph through persons age 50 to persons above age 25?## Performance comparison
The run times for both ingestion and queries are compared.
* For ingestion, KùzuDB is consistently faster than Neo4j by a factor of **~18x** for a graph size of 100K nodes and ~2.4M edges.
* For OLAP queries, KùzuDB is **significantly faster** than Neo4j, especially for ones that involve multi-hop queries via nodes with many-to-many relationships.### Benchmark conditions
The benchmark is run M3 Macbook Pro with 36 GB RAM.
### Ingestion performance
Case | Neo4j (sec) | Kùzu (sec) | Speedup factor
--- | ---: | ---: | ---:
Nodes | 2.33 | 0.11 | 21.2x
Edges | 31.08 | 0.42 | 74.0x
Total | 33.41 | 0.53 | 63.0xNodes are ingested significantly faster in Kùzu, and using its community edition, Neo4j's node ingestion
remains of the order of seconds
despite setting constraints on the ID fields as per their best practices. The speedup factors shown
are expected to be even higher as the dataset gets larger and larger using this approach, and
the only way to speed up Neo4j data ingestion is to use `admin-import` instead (however, this means
you lose the ability to work in Python and have to switch languages).### Query performance benchmark
The full benchmark numbers are in the `README.md` pages for respective directories for `neo4j` and `kuzudb`, with the high-level summary shown below.
#### Notes on benchmark timing
The benchmarks are run via the `pytest-benchmark` library for the query scripts for either DB. `pytest-benchmark`, which is built on top of `pytest`, attaches each set of runs to a timer. It uses the Python time module's [`time.perf_counter`](https://docs.python.org/3/library/time.html#time.perf_counter), which has a resolution of 500 ns, smaller than the run time of the fastest query in this dataset.
* 5 warmup runs are performed to ensure byte code compilation and to warm up the cache prior to measuring run times
* Each query is run for a **minimum of 5 rounds**, so the run times shown in each section below as the **average over a minimum of 5 rounds**, or upwards of 50 rounds.
* Long-running queries (where the total run time exceeds 1 sec) are run for at least 5 rounds.
* Short-running queries (of the order of milliseconds) will run as many times as fits into a period of 1 second, so the fastest queries can run upwards of 50 times.
* Python's own GC overhead can obscure true run times, so the `benchamrk-disable-gc` argument is enabled.See the [`pytest-benchmark` docs](https://pytest-benchmark.readthedocs.io/en/latest/calibration.html) to see how they calibrate their timer and group the rounds.
#### Neo4j vs. Kùzu single-threaded
The following table shows the run times for each query (averaged over the number of rounds run, guaranteed to be a minimum of 5 runs) and the speedup factor of Kùzu over Neo4j when Kùzu is **limited to execute queries on a single thread**.
Query | Neo4j (sec) | Kùzu (sec) | Speedup factor
--- | ---: | ---: | ---:
1 | 1.375 | 0.216 | 6.4x
2 | 0.567 | 0.253 | 2.2x
4 | 0.047 | 0.008 | 5.9x
3 | 0.052 | 0.006 | 8.7x
5 | 0.012 | 0.181 | 0.1x
6 | 0.024 | 0.059 | 0.4x
7 | 0.155 | 0.013 | 11.9x
8 | 2.988 | 0.064 | 46.7x
9 | 3.755 | 0.170 | 22.1x#### Neo4j vs. Kùzu multi-threaded
KùzuDB (by default) supports multi-threaded execution of queries. The following results are for the same queries as above, but allowing Kùzu to choose the optimal number of threads for each query. Again, the run times for each query (averaged over the number of rounds run, guaranteed to be a minimum of 5 runs) are shown.
Query | Neo4j (sec) | Kùzu (sec) | Speedup factor
--- | ---: | ---: | ---:
1 | 1.375 | 0.251 | 5.5x
2 | 0.567 | 0.283 | 2.0x
3 | 0.052 | 0.011 | 4.7x
4 | 0.047 | 0.008 | 5.9x
5 | 0.012 | 0.017 | 0.7x
6 | 0.024 | 0.061 | 0.4x
7 | 0.155 | 0.014 | 11.1x
8 | 2.988 | 0.064 | 46.7x
9 | 3.755 | 0.142 | 26.5x> 🔥 The second-degree path-finding queries (8 and 9) show the biggest speedup over Neo4j, due to innovations in KùzuDB's query planner and execution engine.
### Ideas for future work
#### Scale up the dataset
You can attempt to generate a much larger artificial dataset of ~100M nodes and ~2.5B edges, and see how the performance of Kùzu and Neo4j compare, if you're interested.
```sh
# Generate data with 100M persons and ~2.5B edges (takes a long time in Python!)
bash generate_data.sh 100000000
```The above script can take really long to run in Python. [Here's an example](https://github.com/thedataquarry/rustinpieces/tree/main/src/mock_data)
of using the `fake-rs` crate in Rust to do this much faster.#### Relationship property aggregation
The queries 1-9 in this benchmark are all on node properties. You can add relationship properties in the dataset
to see how the two DBs compare when aggregating on them. For example, add a `since` date property on the
`Follows` edges to run filter queries on how long a person has been following another person.