Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pathak-ashutosh/spark-movie-recommendation
A movie recommendation system on MovieLens 25M dataset using Python and Apache Spark
https://github.com/pathak-ashutosh/spark-movie-recommendation
alternating-least-squares apache-spark hybrid-system item-item-filtering mllib pyspark python random-forest recommendation-system recommender-system rmse spark-mllib-library spark-sql
Last synced: 4 days ago
JSON representation
A movie recommendation system on MovieLens 25M dataset using Python and Apache Spark
- Host: GitHub
- URL: https://github.com/pathak-ashutosh/spark-movie-recommendation
- Owner: pathak-ashutosh
- License: gpl-3.0
- Created: 2024-08-07T02:16:58.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-08-07T02:22:24.000Z (4 months ago)
- Last Synced: 2024-10-19T03:12:49.797Z (about 1 month ago)
- Topics: alternating-least-squares, apache-spark, hybrid-system, item-item-filtering, mllib, pyspark, python, random-forest, recommendation-system, recommender-system, rmse, spark-mllib-library, spark-sql
- Language: Python
- Homepage:
- Size: 19.5 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Movie Recommender System using PySpark
This project implements a sophisticated movie recommender system using various collaborative filtering techniques and machine learning algorithms. The system is built using Apache Spark and PySpark, leveraging the power of distributed computing for handling large-scale movie rating data.
## Features
1. **Alternating Least Squares (ALS) Model**
- Implements collaborative filtering using the ALS algorithm
- Utilizes cross-validation for hyperparameter tuning
- Evaluates performance using RMSE, MSE, and MAP metrics2. **Item-Item Collaborative Filtering**
- Computes movie similarities using Pearson correlation
- Generates recommendations based on item similarities3. **Hybrid Recommender System**
- Combines ALS and Item-Item CF predictions
- Incorporates a supervised learning component (Random Forest Regressor)
- Optimizes weights for different model components4. **Evaluation Metrics**
- Root Mean Square Error (RMSE)
- Mean Square Error (MSE)
- Mean Average Precision (MAP)## Dataset
The project uses the MovieLens 25M dataset, which includes:
- User ratings for movies
- Movie metadata (title, genres)## Implementation Details
- **Data Preparation**: The dataset is split into training (80%) and test (20%) sets.
- **ALS Model**: Implemented using Spark's MLlib with hyperparameter tuning.
- **Item-Item CF**: Custom implementation of similarity computation and prediction generation.
- **Hybrid Model**:
- First version combines ALS and Item-Item CF
- Second version adds a Random Forest Regressor trained on movie genres
- **Evaluation**: Comprehensive evaluation using various metrics, with a focus on RMSE for the hybrid model.## Results
The project demonstrates the effectiveness of hybrid approaches in recommendation systems:
- ALS Model RMSE: 0.845
- Hybrid Model (ALS + Item-Item CF) RMSE: 0.16
- Best Hybrid Model (ALS + Item-Item CF + Random Forest) RMSE: 0.43## Technologies Used
- Apache Spark
- PySpark
- PySpark MLlib
- Python## How to Run
1. Ensure you have Apache Spark and PySpark installed.
2. Download the MovieLens 25M dataset.
3. Update the file paths in the `movie_recommender_system.py` script to point to your dataset location.
4. Run the script using spark-submit or in a PySpark environment:## Future Improvements
- Incorporate more features like user demographics and movie metadata
- Experiment with deep learning models for recommendation
- Implement real-time recommendation updatesThis project showcases the power of combining multiple recommendation techniques to create a robust and accurate movie recommender system.