Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tjdharamsi/e4571-personalisation-theory-project


https://github.com/tjdharamsi/e4571-personalisation-theory-project

collaborative-filtering lsh nmf personalization recommender-system

Last synced: about 2 months ago
JSON representation

Awesome Lists containing this project

README

        

# E4571 Personalisation Theory Class Project -- Fall 2017


> Report for Part 2 of the project can be found in [Part2/report/final_project_report.pdf](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/blob/master/Part2/report/final_project_report.pdf).

**Team Members:**

Name
GitHub
UNI

Tejas Dharamsi
https://github.com/Dharamsitejas
td2520

Abhay S Pawar
https://github.com/abhayspawar
asp2197

Janak A Jain
https://github.com/janakajain
jaj2186

Vijayraghavan Balaji
https://github.com/vijaybalaji30
vb2428





~Steps to run the code~

- Clone/Download the Repository
- install dependencies `pip3 install -r requirements.txt`
- move to folder Part2/analysis.

> Report for Part 1 of the project can be found in [Part1/documents/report_part1.pdf](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/blob/master/Part1/documents/report_part1.pdf)

> Note: The main file containing the code for Part 1 is [CF-Data.ipynb](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/blob/master/Part1/analysis/CF-Data.ipynb)

### File Structure

#### Top
* Part2

- [analysis](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part2/analysis)
+ _DatasetCreation\_Benchmark\_ContentBased.ipynb_: contains the code for combination of dataset, Naïve baseline model, item-item collaborative filtering model and content based model
+ _Hybrid.ipynb_: Contains code for Hybrid Model: LSH Model + Content Based Model, Validates serendpity for books recommended by our best model: LSH
+ _LSH_Complete.ipynb_: contains the code for LSH model
+ _book\_features.ipynb_: contains the code for generating word2vec features for books
+ _feature_extraction_from_api.ipynb_: Contains code to get book meta data from goodreads API using book isbn
+ tree_based_ann.ipynb: contains the code for Tree Based ANN model

- [created\_datasets](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part2/created_datasets)
+ _Combine.csv_ : contains the combined dataset of BX and Amazon dataset
+ _book_features.csv_: contains the data with features generated using word2vec
+ _ibsn_features_new_batch.pickle_: contains the data with features extracted BookReads API and enriched using word2vec
- [figures](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part2/figures): Contains Plots generated by our code.
- [raw-data](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part2/raw-data): Contains Book Crossing Dataset, amazon book dataset can be downloaded from [here](http://jmcauley.ucsd.edu/data/amazon/)
- [Final_Project_Outline.pdf](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/blob/master/Part2/outline.pdf)
* Part1
- [analysis](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part1/analysis): CF-Data.ipynb main part1 file along with exploratory stuff.
- [clean-data](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part1/clean-data): Contains subset smaller datasets
- [raw-data](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part1/raw-data): Contains book-crossing raw datasets.
- [documents](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part1/documents): instructions and report
- [figures](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/tree/master/Part1/figures): Contains Plots for visualisation
* License
* Readme
* requirements.txt

# About the Project

![Book Shelf Image](http://www.wellbuiltstyle.com/wp-content/uploads/2015/12/library.jpg)
_Image Courtesy: WellBuiltStyle.com_


The project is part of the course on [Personalization Theory and Applications](https://ds-personalization.github.io/class/) by [Prof. Brett Vintch](http://www.cns.nyu.edu/~vintch/). The aim of this project is to create a recommender system for books that is capable of offering customized recommendations to book readers based on the books they have already read.

## Motivation

> _There is no friend as loyal as a friend - Ernest Hemingway_

Thanks to [Gutenberg](https://en.wikipedia.org/wiki/Johannes_Gutenberg) and now, the digital boom, we now have access to a huge amount of collective intelligence, wisdom and stories. Indeed, humans perish but their voice continues to resonate through humans brains and minds long after they are gone - sometimes provoking us to think, making us parts of revolutions and sometimes confiding in us with their secrets. They have the ability to make us laugh, cry, think - think hard, and most imporantly, change our lives the way, perhaps nobody else can. In this sense, books are truly our loyal friends.

Can the importance of books as loyal friends ever be overestimated? We think not. Which is why we think that creating just the 'right' recommendations for readers is a noble objective. Consider it a quieter (Shh.. no noise in this library! :)) Facebook or a classier Tinder for those who like to read and listen, patiently.

---

## Part II - Summary of findings

We have implemented four different types of algorithms from scratch and have compared them with with a naïve model. These four models are Tree-based Approximate Nearest Neighbor (ANN), Locality Sensitive Hashing (LSH), Item-item collaborative filtering (CF) and Content-based model. We also created a hybrid model that is a combination of LSH and Content-based model.

We used five-fold cross-validation for all of our developed models, which helped us in selecting the best model for comparison against the benchmark.

We have evaluated each of the developed models on following evaluation metrics:
- Training time
- RMSE
- MAE
- Coverage
- Novelty

**Results**:

Comparison of several models on various comparison metrics

Model Name
Training Time (hours)
Best K
Average Test MAE
Average Test RMSE
Coverage



   Naïve
N/A
N/A
0.763
0.944
N/A



Item-item CF
4.1
15
0.553
0.759
76.0%



Tree based ANN
1.927
20
0.55
0.76




LSH
1.29
15
0.573
0.796
65.6%



   Content-based
0.6 (approx.)
25
0.593
0.8031
31.55%



   Hybrid (LSH + Contentt)
1.89
15
0.5834
0.799
46.54%


### Tweaking the Hybrid model

After developing the Hybrid model from scratch, the next step for us was to evaluate it different values of its hyper-parameter - the distribution of weights on the two underlying models. Given below is a summary of the MAE and RMSE metrics for the Hybrid model for various combinations of these weights.

Performance of the Hybrid model for various weight combinations of the underlying models

W_LSH
W_Content
MAE
RMSE



   0.9
0.1
0.587
0.813



   0.8
0.2
0.585
0.806



   0.7
0.3
0.583
0.799



   0.6
0.4
0.583
0.796



   0.5
0.5
0.583
0.792

### Interpretations

We selected the Hybrid model with a W\_lsh to W\_content weight ratio of 7:3 in order to select the right blend of coverage and serendipity. However, we observed that even at this level, the coverage of the model was significatly lower than that of the LSH model that we implemented from scratch. Hence, we would recommend the use of LSH model for making recommendations.

### A special note on Serendipity of the best model

Our best model is LSH - which has comparable values of MAE and RMSE versus the traditional item-based CF model. Moreover, LSH trains in about a third of the time taken to train the item-based CF model. Another evaluation metric is serendity or novelty of recommendations.

An example of recommendation is shown in Figures 8 and 9 in [the report](https://github.com/Dharamsitejas/E4571-Personalisation-Theory-Project/blob/master/Part2/report/final_project_report.pdf). An interesting recommendation that can be observed from Figure 9 is "Don Quixote". It belongs to a genre that is not currently present in the user's rapport of genres. What's more is that Don Quixote is considered one of the most influential works from the Spanish Golden Age.

Upon closer observation, we find that Don Quixote contains several thematic plots and stylistic elements which are very similar to other books that the user has read. Moreover, such a serendipitous result is also likely to be liked by the user given the higher chances of similarity in stylistic and thematic patterns.

### Future Scope of Work

In the future, we would like to extend this study to convert our code into a Python package. We invite members of the larger academic community to contribute to this project.

---

## Part I - Summary of findings

We have implemented two different types of algorithms from scratch and have compared them with competitive models available from other packages. These two algorithms are Item-Item Collaborative Filtering and Non-negative Matrix Factorization (NMF)

We implemented our models using two approaches:
- Collaborative filtering based (Approach 1)
- Non-negative Matrix Factorization (NMF)based (Approach 2)

We used cross-validation for all of our developed models, which helped us in selecting the best model for comparison against the benchmark.

For both these approaches, we implemented two separate models for this study - one model was developed from scratch, while one was developed using [Surprise](http://surpriselib.com/).

**Results**:

- For Approach 1, our model performed better than Surprise model for by a significant measure for Average MAE.
- For Approach 2, our model did not fare well in front of Surprise model.

For each approach the results are described below below for each of the norms, viz. **Euclidean distance**, **cosine distance** and **pearson correlation coefficient**:

### Approach 1: Item-Item Collaborative filtering based

Euclidean distance

Model Name
Average RMSE
Average MAE


Our model
1.54
0.96


Suprise
1.58
1.13

Cosine similarity

Model Name
Average RMSE
Average MAE


Our model
1.57
1.06


Suprise
1.64
1.22

Pearson correlation coefficient

Model Name
Average RMSE
Average MAE


Our model
1.53
1.01


Suprise
1.61
1.20

---

### Approach 2: None-negative Matrix Factorization

NMF

Model Name
Average RMSE
Average MAE


Our model
2.97
2


Suprise
1.53
0.98


## Feedback

We look forward to your feedback and comments on this project. Our email IDs are a combination of our four-letter UNI codes e.g. 'td2520' and follow the following rule: {UNI}@columbia.edu.