Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rishisankineni/predicting-movie-ratings-using-als
CS 110- EdX - Collaborative Filtering - Apache Spark
https://github.com/rishisankineni/predicting-movie-ratings-using-als
Last synced: about 1 month ago
JSON representation
CS 110- EdX - Collaborative Filtering - Apache Spark
- Host: GitHub
- URL: https://github.com/rishisankineni/predicting-movie-ratings-using-als
- Owner: RishiSankineni
- Created: 2016-11-01T04:52:17.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2016-11-01T04:56:36.000Z (about 8 years ago)
- Last Synced: 2024-06-24T23:28:36.185Z (6 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 68.4 KB
- Stars: 1
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Predicting-Movie-Ratings-using-ALS (Spark-Mllib)
# CS 110- Lab2 -EdX
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.#![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) + ![Python Logo](http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)
# Predicting Movie Ratings
One of the most common uses of big data is to predict what users want. This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like. This lab will demonstrate how we can use Apache Spark to recommend movies to a user. We will start with some basic techniques, and then use the [Spark ML][sparkml] library's Alternating Least Squares method to make more sophisticated predictions.
For this lab, we will use a subset dataset of 20 million ratings. This dataset is pre-mounted on Databricks and is from the [MovieLens stable benchmark rating dataset](http://grouplens.org/datasets/movielens/). However, the same code you write will also work on the full dataset (though running with the full dataset on Community Edition is likely to take quite a long time).
In this lab:
* *Part 0*: Preliminaries
* *Part 1*: Basic Recommendations
* *Part 2*: Collaborative Filtering
* *Part 3*: Predictions for YourselfAs mentioned during the first Learning Spark lab, think carefully before calling `collect()` on any datasets. When you are using a small dataset, calling `collect()` and then using Python to get a sense for the data locally (in the driver program) will work fine, but this will not work when you are using a large dataset that doesn't fit in memory on one machine. Solutions that call `collect()` and do local analysis that could have been done with Spark will likely fail in the autograder and not receive full credit.
[sparkml]: https://spark.apache.org/docs/1.6.2/api/python/pyspark.ml.html