Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/rishisankineni/predicting-movie-ratings-using-als

CS 110- EdX - Collaborative Filtering - Apache Spark
https://github.com/rishisankineni/predicting-movie-ratings-using-als

Last synced: about 1 month ago
JSON representation

CS 110- EdX - Collaborative Filtering - Apache Spark

Awesome Lists containing this project

README

        

# Predicting-Movie-Ratings-using-ALS (Spark-Mllib)
# CS 110- Lab2 -EdX

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

#![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) + ![Python Logo](http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)

# Predicting Movie Ratings

One of the most common uses of big data is to predict what users want. This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like. This lab will demonstrate how we can use Apache Spark to recommend movies to a user. We will start with some basic techniques, and then use the [Spark ML][sparkml] library's Alternating Least Squares method to make more sophisticated predictions.

For this lab, we will use a subset dataset of 20 million ratings. This dataset is pre-mounted on Databricks and is from the [MovieLens stable benchmark rating dataset](http://grouplens.org/datasets/movielens/). However, the same code you write will also work on the full dataset (though running with the full dataset on Community Edition is likely to take quite a long time).

In this lab:
* *Part 0*: Preliminaries
* *Part 1*: Basic Recommendations
* *Part 2*: Collaborative Filtering
* *Part 3*: Predictions for Yourself

As mentioned during the first Learning Spark lab, think carefully before calling `collect()` on any datasets. When you are using a small dataset, calling `collect()` and then using Python to get a sense for the data locally (in the driver program) will work fine, but this will not work when you are using a large dataset that doesn't fit in memory on one machine. Solutions that call `collect()` and do local analysis that could have been done with Spark will likely fail in the autograder and not receive full credit.
[sparkml]: https://spark.apache.org/docs/1.6.2/api/python/pyspark.ml.html