https://github.com/lenguyenthedat/dextra-viki-2015
My solution for Dextra Data Science Challenge #43 (Rakuten/Viki) https://challenges.dextra.sg/challenge/43
https://github.com/lenguyenthedat/dextra-viki-2015
collaborative-filtering recommendation-engine recommendation-system viki
Last synced: 7 months ago
JSON representation
My solution for Dextra Data Science Challenge #43 (Rakuten/Viki) https://challenges.dextra.sg/challenge/43
- Host: GitHub
- URL: https://github.com/lenguyenthedat/dextra-viki-2015
- Owner: lenguyenthedat
- Created: 2015-08-24T04:33:36.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2017-09-14T14:07:23.000Z (about 8 years ago)
- Last Synced: 2025-01-30T12:46:47.851Z (9 months ago)
- Topics: collaborative-filtering, recommendation-engine, recommendation-system, viki
- Language: Jupyter Notebook
- Size: 335 KB
- Stars: 2
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
viki-challenge
==============
http://www.dextra.sg/rakuten-viki-global-tv-recommender-challenge/
https://challenges.dextra.sg/challenge/43
# Some preliminary analysis:
https://public.tableau.com/profile/le.nguyen.the.dat#!/vizhome/Rakuten-VikiDataScienceChallenge2015/Rakuten-VikiDataScienceChallenge2015# Presentation deck:
https://speakerdeck.com/lenguyenthedat/rakuten-viki-data-challenge-solution# Requirements:
This solution is 100% Python, below are a few libraries needed:- Pandas
- Scikit-learn# Collaborative Filtering (Jaccard Index) plus feature similarity
$ python viki-videos-similarity.py # Pre-procesing `#videos x #videos` matrix
$ python viki-users-recommender.py # batch processThis is more practical since `#videos x #videos` matrix is much smaller.
Weights can be set manually:top_videos_limit = 50
sim_features = ['sim_country', 'sim_language', 'sim_adult',
'sim_content_owner_id', 'sim_broadcast', 'sim_episode_count',
'sim_genres', 'sim_cast',
'jaccard_1_3', 'jaccard_2_3', 'jaccard_3_3',
'jaccard_high', 'sim_cosine_mv_ratio']
weight_features = [3,3,5,
0,0,0,
5,5,
0,0,15,
15,45]
weight_scores = [1,3,15]Notes:
------
The reason why HOT VIDEOS dominated CF is because viki's homepage currently dominated by:
- Top banner
- Popular show
- Top Drama
- Gender filter for male / femaleTODO:
-----
- utilize ratio instead of score
- see if someone is into hot / fresh video or not
- KNN: cosine similarity for user => top 10 similar user => recommend top videos user havent watchedTried implementing with cosinesimilarity - Killed 9
Tried Sklearn KNN - took 5h ++
Panns https://github.com/ryanrhymes/panns 2h++
Trying Spotify's Annoy https://github.com/spotify/annoy : 20mins with 10 trees, 12mins with 100 trees.
-> however it's taking too long to find k-NN for each users (more than 10s each to get a good enough result)
Submission history:
-------------------
(Only those that worth documented or created when I am not too lazy):https://github.com/lenguyenthedat/dextra-viki-2015/blob/master/submission_history.txt