https://github.com/harshoza36/movielens_pyspark

MovieLens Dataset analysis using Hadoop and Pyspark
https://github.com/harshoza36/movielens_pyspark

big-data-analytics hadoop movielens movielens-data-analysis pyspark spark spark-sql

Last synced: 22 days ago
JSON representation

MovieLens Dataset analysis using Hadoop and Pyspark

Host: GitHub
URL: https://github.com/harshoza36/movielens_pyspark
Owner: HarshOza36
License: mit
Created: 2021-03-04T03:41:25.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-03-31T04:23:11.000Z (over 4 years ago)
Last Synced: 2025-02-28T10:17:12.672Z (8 months ago)
Topics: big-data-analytics, hadoop, movielens, movielens-data-analysis, pyspark, spark, spark-sql
Language: Jupyter Notebook
Homepage:
Size: 6.11 MB
Stars: 3
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# MovieLens_PySpark
MovieLens Dataset analysis using Hadoop and Pyspark

## How to Run

- Install Jupyter notebook

```pip install jupyter```

Now just write ```jupyter notebook``` in your command prompt and you will see a notebook opening in your localhost

- Install Java

Download Java and install it in your computer

Add Java to the path

Go to Program files > Java > jdk > bin

Copy the path

Go to environment variables and paste this in User variables ```"Path"```

- Setup Java

Add "JAVA_HOME" variable to environment variables

Go to Program files > Java > jdk

Copy the path and paste it in JAVA_HOME variable

- Setup Hadoop

Add ```"HADOOP_HOME"``` to environment variables

In the git repo there is hadoop folder

Copy the link to that folder

Add it to HADOOP_HOME variable

- Setup Spark

Add ```"SPARK_HOME"``` to environment variables

In the git repo there is spark zip

Unzip that

Copy the link to that folder

Add it to ```SPARK_HOME``` variable

- Setup Pyspark

Setting up Pyspark variables

Go to environment variables and add these two

```PYSPARK_DRIVER_PYTHON``` with value ```jupyter```

```PYSPARK_DRIVER_PYTHON_OPTS``` with value ```notebook```

- Final Path setup

Go to Path in environment variables and add

```%SPARK_HOME%\bin```

```%HADOOP_HOME%\bin```

### Finally Go to the terminal Type ```pyspark```
and you will see Pyspark is setup and a new jupyter notebook will open with it

---

Some references to help you setup

1. https://www.youtube.com/watch?v=cYL42BBL3Fo
2. https://www.youtube.com/watch?v=Xce3hccNf_c

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/harshoza36/movielens_pyspark

Awesome Lists containing this project

README