https://github.com/tjdharamsi/e4571-personalisation-theory-project

collaborative-filtering lsh nmf personalization recommender-system
Last synced: 8 months ago
JSON representation
Host: GitHub
URL: https://github.com/tjdharamsi/e4571-personalisation-theory-project
Owner: tjdharamsi
License: mit
Created: 2017-10-19T02:15:45.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2017-12-16T07:19:26.000Z (almost 8 years ago)
Last Synced: 2024-12-29T12:28:51.794Z (10 months ago)
Topics: collaborative-filtering, lsh, nmf, personalization, recommender-system
Language: Jupyter Notebook
Size: 105 MB
Stars: 1
Watchers: 4
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # E4571 Personalisation Theory Class Project  -- Fall 2017




> Report for Part 2 of the project can be found in [Part2/report/final_project_report.pdf](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/blob/master/Part2/report/final_project_report.pdf).

**Team Members:**

  Name

  GitHub

  UNI

  Tejas Dharamsi

  https://github.com/Dharamsitejas

  td2520

  Abhay S Pawar

  https://github.com/abhayspawar

  asp2197

  Janak A Jain

  https://github.com/janakajain

  jaj2186

  Vijayraghavan Balaji

  https://github.com/vijaybalaji30

  vb2428






~Steps to run the code~

- Clone/Download the Repository

- install dependencies `pip3 install -r requirements.txt`

- move to folder Part2/analysis.

> Report for Part 1 of the project can be found in [Part1/documents/report_part1.pdf](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/blob/master/Part1/documents/report_part1.pdf)

  

> Note: The main file containing the code for Part 1 is [CF-Data.ipynb](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/blob/master/Part1/analysis/CF-Data.ipynb)

### File Structure

#### Top

* Part2

    

    - [analysis](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part2/analysis)  

      + _DatasetCreation\_Benchmark\_ContentBased.ipynb_: contains the code for combination of dataset, Naïve baseline model, item-item collaborative filtering model and content based model

      + _Hybrid.ipynb_: Contains code for Hybrid Model: LSH Model + Content Based Model, Validates serendpity for books recommended by our best model: LSH

      + _LSH_Complete.ipynb_: contains the code for LSH model

      + _book\_features.ipynb_: contains the code for generating word2vec features for books

      + _feature_extraction_from_api.ipynb_: Contains code to get book meta data from goodreads API using book isbn  

      + tree_based_ann.ipynb: contains the code for Tree Based ANN model

    - [created\_datasets](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part2/created_datasets)

      + _Combine.csv_ : contains the combined dataset of BX and Amazon dataset  

      + _book_features.csv_: contains the data with features generated using word2vec     

      + _ibsn_features_new_batch.pickle_: contains the data with features extracted BookReads API and enriched using word2vec

    - [figures](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part2/figures): Contains Plots generated by our code.

    - [raw-data](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part2/raw-data): Contains Book Crossing Dataset, amazon book dataset can be downloaded from [here](http://jmcauley.ucsd.edu/data/amazon/)

   - [Final_Project_Outline.pdf](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/blob/master/Part2/outline.pdf)

* Part1

    - [analysis](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part1/analysis): CF-Data.ipynb main part1 file along with exploratory stuff.

    - [clean-data](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part1/clean-data): Contains subset smaller datasets

    - [raw-data](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part1/raw-data): Contains book-crossing raw datasets.

    - [documents](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part1/documents): instructions and report

    - [figures](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part1/figures): Contains Plots for visualisation

* License

* Readme

* requirements.txt

# About the Project  

  

![Book Shelf Image](http://www.wellbuiltstyle.com/wp-content/uploads/2015/12/library.jpg)  

_Image Courtesy: WellBuiltStyle.com_  

  

The project is part of the course on [Personalization Theory and Applications](https://ds-personalization.github.io/class/) by [Prof. Brett Vintch](http://www.cns.nyu.edu/~vintch/). The aim of this project is to create a recommender system for books that is capable of offering customized recommendations to book readers based on the books they have already read.

## Motivation  

> _There is no friend as loyal as a friend - Ernest Hemingway_

Thanks to [Gutenberg](https://en.wikipedia.org/wiki/Johannes_Gutenberg) and now, the digital boom, we now have access to a huge amount of collective intelligence, wisdom and stories. Indeed, humans perish but their voice continues to resonate through humans brains and minds long after they are gone - sometimes provoking us to think, making us parts of revolutions and sometimes confiding in us with their secrets. They have the ability to make us laugh, cry, think - think hard, and most imporantly, change our lives the way, perhaps nobody else can. In this sense, books are truly our loyal friends.

Can the importance of books as loyal friends ever be overestimated? We think not. Which is why we think that creating just the 'right' recommendations for readers is a noble objective. Consider it a quieter (Shh.. no noise in this library! :)) Facebook or a classier Tinder for those who like to read and listen, patiently.

  

---  

  

## Part II - Summary of findings 

We have implemented four different types of algorithms from scratch and have compared them with with a naïve model. These four models are Tree-based Approximate Nearest Neighbor (ANN), Locality Sensitive Hashing (LSH), Item-item collaborative filtering (CF) and Content-based model. We also created a hybrid model that is a combination of LSH and Content-based model.

We used five-fold cross-validation for all of our developed models, which helped us in selecting the best model for comparison against the benchmark.

We have evaluated each of the developed models on following evaluation metrics:

- Training time

- RMSE

- MAE

- Coverage

- Novelty

 

**Results**:  

  

  Comparison of several models on various comparison metrics

  

    Model Name

    Training Time (hours)

    Best K

    Average Test MAE

    Average Test RMSE

    Coverage

  

  

  

    Naïve

    N/A

    N/A

    0.763

    0.944

    N/A

  

  

  

    Item-item CF

    4.1

    15

    0.553

    0.759

    76.0%

  

  

  

    Tree based ANN

    1.927

    20

    0.55

    0.76

    

  

  

  

    LSH

    1.29

    15

    0.573

    0.796

    65.6%

  

  

  

    Content-based

    0.6 (approx.)

    25

    0.593

    0.8031

    31.55%

  

  

  

    Hybrid (LSH + Contentt)

    1.89

    15

    0.5834

    0.799

    46.54%

  

  

  

### Tweaking the Hybrid model  

  

After developing the Hybrid model from scratch, the next step for us was to evaluate it different values of its hyper-parameter - the distribution of weights on the two underlying models. Given below is a summary of the MAE and RMSE metrics for the Hybrid model for various combinations of these weights.

  Performance of the Hybrid model for various weight combinations of the underlying models

  

    W_LSH

    W_Content

    MAE

    RMSE

  

  

  

    0.9

    0.1

    0.587

    0.813

  

  

  

    0.8

    0.2

    0.585

    0.806

  

  

  

    0.7

    0.3

    0.583

    0.799

  

  

  

    0.6

    0.4

    0.583

    0.796

  

  

  

    0.5

    0.5

    0.583

    0.792

  

### Interpretations  

  

We selected the Hybrid model with a W\_lsh to W\_content weight ratio of 7:3 in order to select the right blend of coverage and serendipity. However, we observed that even at this level, the coverage of the model was significatly lower than that of the LSH model that we implemented from scratch. Hence, we would recommend the use of LSH model for making recommendations.  

  

### A special note on Serendipity of the best model  

  

Our best model is LSH - which has comparable values of MAE and RMSE versus the traditional item-based CF model. Moreover, LSH trains in about a third of the time taken to train the item-based CF model. Another evaluation metric is serendity or novelty of recommendations.  

  

An example of recommendation is shown in Figures 8 and 9 in [the report](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/blob/master/Part2/report/final_project_report.pdf). An interesting recommendation that can be observed from Figure 9 is "Don Quixote". It belongs to a genre that is not currently present in the user's rapport of genres. What's more is that Don Quixote is considered one of the most influential works from the Spanish Golden Age.  

  

Upon closer observation, we find that Don Quixote contains several thematic plots and stylistic elements which are very similar to other books that the user has read. Moreover, such a serendipitous result is also likely to be liked by the user given the higher chances of similarity in stylistic and thematic patterns.   

  

### Future Scope of Work  

  

In the future, we would like to extend this study to convert our code into a Python package. We invite members of the larger academic community to contribute to this project.

  

---

## Part I - Summary of findings 

We have implemented two different types of algorithms from scratch and have compared them with competitive models available from other packages. These two algorithms are Item-Item Collaborative Filtering and  Non-negative Matrix Factorization (NMF)

We implemented our models using two approaches:

 - Collaborative filtering based (Approach 1)

 - Non-negative Matrix Factorization (NMF)based (Approach 2)

We used cross-validation for all of our developed models, which helped us in selecting the best model for comparison against the benchmark.

For both these approaches, we implemented two separate models for this study - one model was developed from scratch, while one was developed using [Surprise](http://surpriselib.com/). 

 

**Results**:  

  

- For Approach 1, our model performed better than Surprise model for by a significant measure for Average MAE. 

- For Approach 2, our model did not fare well in front of Surprise model.

For each approach the results are described below below for each of the norms, viz. **Euclidean distance**, **cosine distance** and **pearson correlation coefficient**:

### Approach 1: Item-Item Collaborative filtering based

  

  

  Euclidean distance

  

    Model Name

    Average RMSE

    Average MAE

  

  

    Our model

    1.54

    0.96

  

  

    Suprise

    1.58

    1.13

  

 

  

  Cosine similarity

  

    Model Name

    Average RMSE

    Average MAE

  

  

    Our model

    1.57

    1.06

  

  

    Suprise

    1.64

    1.22

  

 

  Pearson correlation coefficient

  

    Model Name

    Average RMSE

    Average MAE

  

  

    Our model

    1.53

    1.01

  

  

    Suprise

    1.61

    1.20

  

 

---  

### Approach 2: None-negative Matrix Factorization

  

  

  NMF

  

    Model Name

    Average RMSE

    Average MAE

  

  

    Our model

    2.97

    2

  

  

    Suprise

    1.53

    0.98

  

 

  

  

## Feedback  

  

We look forward to your feedback and comments on this project. Our email IDs are a combination of our four-letter UNI codes e.g. 'td2520' and follow the following rule: {UNI}@columbia.edu.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tjdharamsi/e4571-personalisation-theory-project

Awesome Lists containing this project

README