https://github.com/ophiase/graph-database-ifncy020

🎥 Graph DataBase Course: A project on graph database creation and management using Neo4j and PostgreSQL.| Deadline: 20 jan. 2025
https://github.com/ophiase/graph-database-ifncy020

cypher neo4j

Last synced: 3 months ago
JSON representation

🎥 Graph DataBase Course: A project on graph database creation and management using Neo4j and PostgreSQL.| Deadline: 20 jan. 2025

Host: GitHub
URL: https://github.com/ophiase/graph-database-ifncy020
Owner: Ophiase
License: apache-2.0
Created: 2024-12-13T19:06:49.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-01-19T14:57:02.000Z (4 months ago)
Last Synced: 2025-01-19T15:40:47.540Z (4 months ago)
Topics: cypher, neo4j
Language: Python
Homepage:
Size: 224 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# 🎥 [Graph DataBase - IFNCY020](https://github.com/Ophiase/Graph-DataBase-IFNCY020)

This project involves the creation and management of a graph database using Neo4j and PostgreSQL. \
The dataset includes information about works, persons, genres, episodes, and their relationships.

## Table of Contents

- [Loading the Dataset](#loading-the-dataset)
- [Executing Queries](#executing-queries)
- [Directory Structure](#directory-structure)
- [Benchmark](#benchmark)
- [Conclusion](#conclusion)

## Loading the Dataset

To load the dataset into PostgreSQL and Neo4j, follow the instructions in the [load/README.md](load/README.md) file. \
This includes steps for setting up the databases and importing the data.

## Executing Queries

The project includes a set of predefined queries for both Neo4j and PostgreSQL. These queries can be executed using the provided scripts.

To execute the queries, navigate to the `requests` directory and run the following commands:

```bash
cd requests
python3 requests_neo4j.py
python3 requests_psql.py

# Statistics
python3 statistics.py
# Named Graph Statistics (require the named graph)
python3 named_statistics.py
```

These scripts will execute the queries located in the `requests/queries_neo4j` and `requests/queries_psql` directories, respectively.

## Directory Structure

- 🟢 `load/`: Contains scripts for loading the dataset into PostgreSQL and Neo4j.
- 🔵 `data/`: Directory for storing processed data files.
- 🔵 `resources/`: Directory for storing resource files such as images and configuration files.
- 🔵 `default/`: Directory for storing default configuration and data files.
- 🟠 `dump.txt`: Default dump file for loading data.
- 🟠 `imbd_to_psql.py`: Script to generate SQL scripts from raw TSV files.
- 🟠 `psql_to_neo4j.py`: Script to migrate data from PostgreSQL to Neo4j.
-
- 🟠 `named_graph.py`: Script to recreate the named graph.
- 🟠 `load_dump.py`
- 🟠 `dump.py`
- 🟠 `Makefile`: Makefile to automate the loading process.
- 🟢 `requests/`: Contains scripts and query files for executing predefined queries.
- 🟠 `requests_neo4j.py`: Script to execute Neo4j queries.
- 🔵 `queries_neo4j/`: Contains `.cypher` files.
- 🟠 `requests_psql.py`: Script to execute PostgreSQL queries.
- 🔵 `queries_psql/`: Contains `.sql` files.
- 🟠 `statistics.py`: Script to generate statistics from the dataset.
- 🔵 `queries_statistics/`: Contains `.cypher` files.
- 🟠 `named_statistics.py`: Script to generate statistics for named graphs.
- 🔵 `queries_named_statistics/`: Contains `.cypher` files.

### Benchmark

Postgres Queries:

```bash
➜ requests git:(main) ✗ python3 requests_psql.py
# Responses are cutted them from this REPORT
Executing query 'get_all_episodes.sql':
Execution time: 0.0011 seconds
----------------------------------------
Executing query 'data_graph_topology.sql':
Execution time: 0.1502 seconds
----------------------------------------
Executing query 'optional_match.sql':
Execution time: 0.0008 seconds
----------------------------------------
Executing query 'negative_filter.sql':
Execution time: 0.0006 seconds
----------------------------------------
Executing query 'get_all_persons.sql':
Execution time: 0.0004 seconds
----------------------------------------
Executing query 'get_all_genres.sql':
Execution time: 0.0005 seconds
----------------------------------------
Executing query 'get_all_works.sql':
Execution time: 0.0005 seconds
```

Neo4J Queries:

```bash
➜ requests git:(main) ✗ python3 requests_neo4j.py
# Responses are cutted them from this REPORT
----------------------------------------
Executing query 'shortest_path.cypher':
Execution time: 0.0459 seconds
----------------------------------------
Executing query 'collect_unwind.cypher':
Execution time: 0.0060 seconds
----------------------------------------
Executing query 'data_graph_topology.cypher':
Execution time: 0.0884 seconds
----------------------------------------
Executing query 'negative_filter.cypher':
Execution time: 0.0020 seconds
----------------------------------------
Executing query 'with_aggregate_filter.cypher':
Execution time: 0.0060 seconds
----------------------------------------
Executing query 'post_union_filter.cypher':
Execution time: 0.0028 seconds
----------------------------------------
Executing query 'with_read_update.cypher':
Execution time: 0.0146 seconds
----------------------------------------
Skip weighted_dijkstra.cypher
----------------------------------------
Executing query 'quantified_patterns.cypher':
Execution time: 0.0054 seconds
----------------------------------------
Executing query 'optional_match.cypher':
Execution time: 0.0023 seconds
----------------------------------------
Executing query 'reduce_list.cypher':
Execution time: 0.0078 seconds
----------------------------------------
Executing query 'predicate_functions.cypher':
Execution time: 0.0083 seconds
----------------------------------------
```

Statistics on the graph:

```bash
➜ requests git:(main) ✗ python3 statistics.py
Execution times:
most_common_genre.cypher: 0.0155 seconds
average_runtime.cypher: 0.0110 seconds
most_active_person.cypher: 0.0093 seconds
total_writers.cypher: 0.0063 seconds
total_directors.cypher: 0.0081 seconds
degree_distribution.cypher: 0.0140 seconds

Results:
Most Common Genre: {'genre': 'Drama', 'count': 12911}
Average Runtime: {'average_runtime': 61.746802904697546}
Most Active Person: {'person': 'Robert Ellis', 'count': 8}
Total Writers: {'total_writers': 420}
Total Directors: {'total_directors': 304}
Degree Distribution: {'degree': None, 'count': 180032}
```

Statistics that require a named graph (Graph Data Science Library):

```bash
➜ requests git:(main) ✗ python3 named_statistics.py
Execution times:
average_path_length.cypher: 1.1397 seconds
some_path_lengths_over_100.cypher: 1.5135 seconds
undirected_double_bfs.cypher: 0.5850 seconds

Results:
Average Path Length: {'avgLength': 27.476}
Some Path Lengths Over 100: {'pathLengths': [106, 123, 105, 112, 112, 111, 113, 109, 108, 105, 109, 110, 110, 111, 109, 111, 109, 105, 110, 108]}
Undirected Double Bfs: {'length(path)': 17}
```

## Conclusion

- Loading is way faster in postgres
- Load every data in Postgres : few seconds
- Load a little part of the data in Neo4J : 1 hour
- Queries faster in Neo4J
- ``data_graph_topology.cypher`` : 0.0884 seconds
- ``data_graph_topology.sql`` : 0.1502 seconds
- Queries faster in Postgres
- ``optional_match.cypher`` : 0.0023 seconds
- ``optional_match.sql``: 0.0008 seconds
- ``negative_filter.cypher``: 0.0020 seconds
- ``negative_filter.sql`` : 0.0006 seconds

This project demonstrates the power and flexibility of graph databases for managing and analyzing complex datasets. By leveraging both PostgreSQL and Neo4j, we can efficiently store, query, and analyze data with rich relationships and complex structures.

While loading data in PostgreSQL is significantly faster compared to Neo4j, with PostgreSQL handling large datasets in seconds and Neo4j taking much longer for smaller datasets, the Cypher query language in Neo4j offers a more natural and intuitive way to express complex graph queries. Thus we have way more intersting graph queries to show using Neo4J.

## Future Work

- **Data Enrichment**: Integrate additional datasets to enrich the existing data.
- **Performance Optimization**: Further optimize queries and indexing strategies.
- **Advanced Analytics**: Implement advanced graph analytics and machine learning algorithms.
- **Visualization**: Develop visualization tools to explore the graph data interactively.
- **Double BFS** : Estimation of the diameter
- The `undirected_double_bfs.cypher` query returns a path length of 17.
- We know from other queries that it should be at least 120.
- This discrepancy arises because the graph is directed, and Double BFS is not suited for directed graphs.
- To address this, we either need to modify BFS (java) to ignore the direction or complete the graph, which could be a lengthy process.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ophiase/graph-database-ifncy020

Awesome Lists containing this project

README