https://github.com/harshoza36/movielens_pyspark
MovieLens Dataset analysis using Hadoop and Pyspark
https://github.com/harshoza36/movielens_pyspark
big-data-analytics hadoop movielens movielens-data-analysis pyspark spark spark-sql
Last synced: 22 days ago
JSON representation
MovieLens Dataset analysis using Hadoop and Pyspark
- Host: GitHub
- URL: https://github.com/harshoza36/movielens_pyspark
- Owner: HarshOza36
- License: mit
- Created: 2021-03-04T03:41:25.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-03-31T04:23:11.000Z (over 4 years ago)
- Last Synced: 2025-02-28T10:17:12.672Z (8 months ago)
- Topics: big-data-analytics, hadoop, movielens, movielens-data-analysis, pyspark, spark, spark-sql
- Language: Jupyter Notebook
- Homepage:
- Size: 6.11 MB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MovieLens_PySpark
MovieLens Dataset analysis using Hadoop and Pyspark## How to Run
- Install Jupyter notebook
```pip install jupyter```Now just write ```jupyter notebook``` in your command prompt and you will see a notebook opening in your localhost
- Install Java
Download Java and install it in your computer
Add Java to the path
Go to Program files > Java > jdk > bin
Copy the path
Go to environment variables and paste this in User variables ```"Path"```
- Setup Java
Add "JAVA_HOME" variable to environment variables
Go to Program files > Java > jdk
Copy the path and paste it in JAVA_HOME variable- Setup Hadoop
Add ```"HADOOP_HOME"``` to environment variables
In the git repo there is hadoop folder
Copy the link to that folder
Add it to HADOOP_HOME variable
- Setup Spark
Add ```"SPARK_HOME"``` to environment variables
In the git repo there is spark zip
Unzip that
Copy the link to that folder
Add it to ```SPARK_HOME``` variable
- Setup Pyspark
Setting up Pyspark variables
Go to environment variables and add these two
```PYSPARK_DRIVER_PYTHON``` with value ```jupyter```
```PYSPARK_DRIVER_PYTHON_OPTS``` with value ```notebook```
- Final Path setup
Go to Path in environment variables and add
```%SPARK_HOME%\bin```
```%HADOOP_HOME%\bin```
### Finally Go to the terminal Type ```pyspark```
and you will see Pyspark is setup and a new jupyter notebook will open with it---
Some references to help you setup
1. https://www.youtube.com/watch?v=cYL42BBL3Fo
2. https://www.youtube.com/watch?v=Xce3hccNf_c