https://github.com/alfredorubin96/spark-neo4j-playground
https://github.com/alfredorubin96/spark-neo4j-playground
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/alfredorubin96/spark-neo4j-playground
- Owner: alfredorubin96
- Created: 2022-09-06T13:37:30.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-07-27T12:53:32.000Z (almost 2 years ago)
- Last Synced: 2025-02-10T06:12:20.127Z (4 months ago)
- Language: Jupyter Notebook
- Size: 33.8 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Simple Docker Compose based on [this repo](https://github.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker), it contains some examples on how to use the Spark Neo4j Connector.
To start the project, just run:
```shell
# -d is for detached mode (runs everything in background)
docker compose up -d
```To check the status of the cluster through the spark UI, just connect to:
```shell
http://:8080
```If you want to code directly from the cluster, connect to Jupyterlab on:
```shell
http://:8888
```If you want to access the Neo4j Browser of the container defined inside the docker-compose file (right now is commented, uncomment it only if you currently don't have a Neo4j instance on your machine or, if you want to have multiple Neo4j instances, provide different ports to the Neo4j container to prevent errors):
```shell
http://:
```From Jupyter Lab, you will see that in the workspace there are different Python Notebooks that will guide you on your first steps using the Neo4j Connector for Spark.
This repository contains two different Python Notebooks:
1. [simple_read_from_neo4j.ipynb](shared-workspace/simple_read_from_neo4j.ipynb) : In this notebook you will see how you can read data from Neo4j by running queries using the Spark Connector for Neo4j.
2. [write_to_neo4j.ipynb](shared-workspace/write_to_neo4j.ipynb) : In this notebook you will test the ingestion of a dataset of commercial orders. The ingestion process, without the creation of the Spark DataFrame, running the environment on a testing machine with 16Gb of RAM and a CPU with 4 cores, took almost 150 seconds (2.5 minutes).
The final graph will have around 1.9 mln nodes and 2.6 mln relationships and the final schema that your database will have is:
