https://github.com/albertopirillo/smbud-project-2022
Implementation of multiple NoSQL database technologies (Neo4j, MongoDB and Apache Spark) to handle a large scale data set of scientific publications.
https://github.com/albertopirillo/smbud-project-2022
apache-spark mongodb neo4j nosql
Last synced: about 2 months ago
JSON representation
Implementation of multiple NoSQL database technologies (Neo4j, MongoDB and Apache Spark) to handle a large scale data set of scientific publications.
- Host: GitHub
- URL: https://github.com/albertopirillo/smbud-project-2022
- Owner: albertopirillo
- License: mit
- Created: 2022-10-25T17:42:52.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2022-12-21T10:46:13.000Z (over 3 years ago)
- Last Synced: 2025-07-07T07:07:03.165Z (12 months ago)
- Topics: apache-spark, mongodb, neo4j, nosql
- Language: Jupyter Notebook
- Homepage:
- Size: 21.5 MB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# System and Methods for Big and Unstructured Data - Project
The aim of the project is to compare different NoSQL database technologies (in particular Neo4j, MongoDB and Apache Spark).
This was done by implementing a bibliographic database storage solution capable of supporting a large scale data set containing different types of publications ranging from scientific papers, books, articles and so on.
The complete report of the project is available [here](/spark/report/Third%20Delivery%20SMBUD%20-%20Group%2038.pdf)
## Pre-processing
A lot of pre-processing was performed, in order to make the downloaded data sets a good fit for our project.
We have used Python and in particular the [Pandas](https://pandas.pydata.org/) libarry. All the scripts and the notebooks we used are in this repository.
The project report contains detailed instructions on how to use such scripts to generate the exact same data sets that we used and on how to perform the exact same queries we executed.
## Setup
Install all the required Python packages with:
pip install -r requirements.txt
## First delivery
The first delivery was about Neo4j, a graph database.
We used a data set downloaded from [AMiner](https://lfs.aminer.cn/misc/dblp.v11.zip).
After some additional pre-processing, the data set was uploaded into Neo4j and some queries and commands were executed.
The report of this delivery is available [here](/neo4j/report/First%20Delivery%20SMBUD%20-%20Group%2038.pdf)
## Second delivery
The second delivery was about MongoDB, a document-oriented database.
We used the same data set downloaded from [AMiner](https://lfs.aminer.cn/misc/dblp.v11.zip) and some additional data sets to highlight the capabilities of MongoDB at handling sub-documents.
After some additional pre-processing, the data set was uploaded into MongoDB and some queries and commands were executed.
The report of this delivery is available [here](/mongodb/report/Second%20Delivery%20SMBUD%20-%20Group%2038.pdf)
## Third delivery
The third delivery was about Apache Spark, a framework for large-scale data processing.
We used the same data set downloaded from [AMiner](https://lfs.aminer.cn/misc/dblp.v11.zip).
After some additional pre-processing, the data set was uploaded into Apache Spark and some queries and commands were executed.
The report of this delivery is available [here](/spark/report/Third%20Delivery%20SMBUD%20-%20Group%2038.pdf)
## Software
- [Draw.io](https://app.diagrams.net/)
- [Neo4j](https://neo4j.com/)
- [MongoDB](https://www.mongodb.com/)
- [Apache Spark](https://spark.apache.org/)
- [Overleaf](https://www.overleaf.com/)
- [Python](https://www.python.org/)
- [PyCharm](https://www.jetbrains.com/pycharm/)
- [Google Colaboratory](https://colab.research.google.com/)
## License
Licensed under [MIT License](LICENSE)