https://github.com/alfredorubin96/spark-neo4j-playground

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/alfredorubin96/spark-neo4j-playground
Owner: alfredorubin96
Created: 2022-09-06T13:37:30.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-07-27T12:53:32.000Z (almost 2 years ago)
Last Synced: 2025-02-10T06:12:20.127Z (4 months ago)
Language: Jupyter Notebook
Size: 33.8 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Simple Docker Compose based on [this repo](https://github.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker), it contains some examples on how to use the Spark Neo4j Connector.

To start the project, just run:

```shell
# -d is for detached mode (runs everything in background)
docker compose up -d
```

To check the status of the cluster through the spark UI, just connect to:
```shell
http://:8080
```

If you want to code directly from the cluster, connect to Jupyterlab on:
```shell
http://:8888
```

If you want to access the Neo4j Browser of the container defined inside the docker-compose file (right now is commented, uncomment it only if you currently don't have a Neo4j instance on your machine or, if you want to have multiple Neo4j instances, provide different ports to the Neo4j container to prevent errors):
```shell
http://:
```

From Jupyter Lab, you will see that in the workspace there are different Python Notebooks that will guide you on your first steps using the Neo4j Connector for Spark.

This repository contains two different Python Notebooks:

1. [simple_read_from_neo4j.ipynb](shared-workspace/simple_read_from_neo4j.ipynb) : In this notebook you will see how you can read data from Neo4j by running queries using the Spark Connector for Neo4j.

2. [write_to_neo4j.ipynb](shared-workspace/write_to_neo4j.ipynb) : In this notebook you will test the ingestion of a dataset of commercial orders. The ingestion process, without the creation of the Spark DataFrame, running the environment on a testing machine with 16Gb of RAM and a CPU with 4 cores, took almost 150 seconds (2.5 minutes).

The final graph will have around 1.9 mln nodes and 2.6 mln relationships and the final schema that your database will have is:

![alt ext](img/graph-2.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alfredorubin96/spark-neo4j-playground

Awesome Lists containing this project

README