Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/trgiangdo/fast_recsys
Accelerated Recommendation System on the rating prediction problem using Numba library.
https://github.com/trgiangdo/fast_recsys
knn movielens netflix-prize numba python recommender-system svd
Last synced: 19 days ago
JSON representation
Accelerated Recommendation System on the rating prediction problem using Numba library.
- Host: GitHub
- URL: https://github.com/trgiangdo/fast_recsys
- Owner: trgiangdo
- License: gpl-3.0
- Created: 2020-08-02T10:54:38.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2023-02-11T00:32:28.000Z (almost 2 years ago)
- Last Synced: 2023-03-04T00:53:24.217Z (almost 2 years ago)
- Topics: knn, movielens, netflix-prize, numba, python, recommender-system, svd
- Language: Python
- Homepage:
- Size: 172 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 8
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# Fast Recommender System on MovieLens 20M Dataset (working in progress)
Inspired by [gbolmler implementation of SVD using numba](https://github.com/gbolmier/funk-svd).
This repo contains reimplementation of kNN and common matrix factorization methods using [numba](https://github.com/numba/numba) library to accelarate `numpy` operations. Numba is a cool library and you need to give this a shot for future implementation using `numpy`.## MovieLens Dataset
The algorithms in this repo are tested on the Movielens [20M Dataset](https://www.kaggle.com/grouplens/movielens-20m-dataset).This is a big dataset.
In order to extract the dataset to get a smaller dataset, first you need to download MovieLens 20M and save it on your computer, for example, to `movielens20M` folder.
Then you need to create a folder `movilens-sample` for the new sampling dataset.On `utils/sample_movielens.py` you can change the parameter to your like.
```python
if __name__ == "__main__":
sample_movielens(
"movielens20M",
"movielens-sample",
sample_size=1000
)
```where `"movielens20M"` is the folder contains MovieLens 20M Dataset, `"movielens-sample"` is the folder contains new extracted dataset.
Size of the extracted dataset can be changed via `sample_size`.## Netflix Prize Dataset
The algorithms in this repo are also tested on the [Netflix Prize dataset](https://www.kaggle.com/netflix-inc/netflix-prize-data).
Published by Netflix, the dataset contains a training set of 100 million ratings, which includes a probe set of 1 million ratings.
However, the qualifying dataset has not been published anywhere (to my knowledge).For that reason, the scipt in `utils/split_netflix_dataset.py` first uses the probe set as the validation set, then split the remaining ratings into training set and testing set.
The output contains 3 distinct files, `rating_train.csv`, `rating_test.csv`, `rating_val.csv` just like MovieLens 20M, and can be loaded into the algorithms using `utils/DataLoader`.## Benchmarks
Folder `/examples` contains test runs on MovieLens dataset.
Compare to [NicolasHug/Surprise](https://github.com/NicolasHug/Surprise), the runtime of kNNBaseline using Pearson similarity scores is much faster (817s compared to 3166s of Surprise on MovieLens 20M dataset).