{"id":18960318,"url":"https://github.com/eugeneyan/recsys-nlp-graph","last_synced_at":"2025-10-07T16:36:36.433Z","repository":{"id":40849867,"uuid":"261506261","full_name":"eugeneyan/recsys-nlp-graph","owner":"eugeneyan","description":"🛒 Simple recommender with matrix factorization, graph, and NLP. Beating the regular collaborative filtering baseline.","archived":false,"fork":false,"pushed_at":"2024-07-07T05:59:01.000Z","size":1964,"stargazers_count":143,"open_issues_count":9,"forks_count":29,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-08-28T22:05:08.031Z","etag":null,"topics":["graph","matrix-factorization","nlp","pytorch","recommender-system"],"latest_commit_sha":null,"homepage":"https://eugeneyan.com/writing/recommender-systems-baseline-pytorch/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eugeneyan.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-05T15:12:28.000Z","updated_at":"2025-06-18T23:15:43.000Z","dependencies_parsed_at":"2024-07-07T06:45:01.626Z","dependency_job_id":"6cc83222-49e6-4768-a03d-47bc27615d16","html_url":"https://github.com/eugeneyan/recsys-nlp-graph","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/eugeneyan/recsys-nlp-graph","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugeneyan%2Frecsys-nlp-graph","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugeneyan%2Frecsys-nlp-graph/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugeneyan%2Frecsys-nlp-graph/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugeneyan%2Frecsys-nlp-graph/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eugeneyan","download_url":"https://codeload.github.com/eugeneyan/recsys-nlp-graph/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugeneyan%2Frecsys-nlp-graph/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278808614,"owners_count":26049756,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["graph","matrix-factorization","nlp","pytorch","recommender-system"],"created_at":"2024-11-08T14:05:25.587Z","updated_at":"2025-10-07T16:36:36.391Z","avatar_url":"https://github.com/eugeneyan.png","language":"Python","readme":"# recsys-nlp-graph\n\n**Undocumented** code for personal project on simple recsys via matrix factorization (part 1), and nlp and graph techniques (part 2). Sharing as part of meet-up follow along.\n\nAssociated articles:  \n- Part 1: [Building a Strong Baseline Recommender in PyTorch](https://eugeneyan.com/writing/recommender-systems-baseline-pytorch/)  \n- Part 2: [Beating the Baseline Recommender with Graph \u0026 NLP in Pytorch](https://eugeneyan.com/writing/recommender-systems-graph-and-nlp-pytorch/)\n\nTalk and Slides:  \n- [DataScience SG Meetup - RecSys, Beyond the Baseline](https://eugeneyan.com/speaking/recommender-systems-beyond-the-baseline-talk/)  \n- [Slideshare](https://www.slideshare.net/eugeneyan/recommender-systems-beyond-the-useritem-matrix)\n\n## Data\n\nElectronics and books data from the [Amazon dataset (May 1996 – July 2014)](http://jmcauley.ucsd.edu/data/amazon/) was used. Here's how an example JSON entry looks like.\n\n```\n{ \n\"asin\": \"0000031852\",\n\"title\": \"Girls Ballet Tutu Zebra Hot Pink\",\n\"price\": 3.17,\n\"imUrl\": \"http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg\",\n\"related”:\n    { \"also_bought\":[\n\t\t  \t\"B00JHONN1S\",\n\t\t  \t\"B002BZX8Z6\",\n\t\t  \t\"B00D2K1M3O\", \n\t\t  \t...\n\t\t  \t\"B007R2RM8W\"\n                    ],\n      \"also_viewed\":[ \n\t\t  \t\"B002BZX8Z6\",\n\t\t  \t\"B00JHONN1S\",\n\t\t  \t\"B008F0SU0Y\",\n\t\t  \t...\n\t\t  \t\"B00BFXLZ8M\"\n                     ],\n      \"bought_together\":[ \n\t\t  \t\"B002BZX8Z6\"\n                     ]\n    },\n\"salesRank\":\n    { \n      \"Toys \u0026 Games\":211836\n    },\n\"brand\": \"Coxlures\",\n\"categories\":[ \n\t    [ \"Sports \u0026 Outdoors\",\n\t      \"Other Sports\",\n\t      \"Dance\"\n\t    ]\n    ]\n}\n```\n\n## Comparing Matrix Factorization to Skip-gram (Node2Vec)\n\n### Overall results for Electronics dataset\n\n|                                             \t| All Products \t| Seen Products Only \t| Runtime (min) \t|\n|---------------------------------------------\t|--------------\t|--------------------\t|---------------\t|\n| PyTorch Matrix Factorization                \t| 0.7951       \t| -                  \t| 45            \t|\n| Node2Vec                                    \t| NA           \t| NA                 \t| NA            \t|\n| Gensim Word2Vec                             \t| 0.9082       \t| 0.9735             \t| 2.58          \t|\n| PyTorch Word2Vec                            \t| 0.9554       \t| 0.9855             \t| 23.63         \t|\n| PyTorch Word2Vec with Side Info             \t| NA           \t| NA                 \t| NA            \t|\n| PyTorch Matrix Factorization With Sequences \t| 0.9320       \t| -                  \t| 70.39         \t|\n| Alibaba Paper*                              \t| 0.9327       \t| -                  \t| -             \t|\n\n### Overall results for Books dataset\n\n|                                             \t| All Products \t| Seen Products Only \t| Runtime (min) \t|\n|---------------------------------------------\t|--------------\t|--------------------\t|---------------\t|\n| PyTorch Matrix Factorization                \t| 0.4996       \t| -                  \t| 1353.12       \t|\n| Gensim Word2Vec                             \t| 0.9701       \t| 0.9892             \t| 16.24         \t|\n| PyTorch Word2Vec                            \t| 0.9775       \t| -                  \t| 122.66        \t|\n| PyTorch Word2Vec with Side Info             \t| NA           \t| NA                 \t| NA            \t|\n| PyTorch Matrix Factorization With Sequences \t| 0.7196       \t| -                  \t| 1393.08       \t|\n\n\n\n*[Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba](https://arxiv.org/abs/1803.02349)\n\n### 1. Matrix Factorization (iteratively pair by pair)\n\nAt a high level, for each pair:\n\n- Get the embedding for each product\n- Multiply embeddings and sum the resulting vector (this is the prediction)\n- Reduce the difference between predicted score and actual score (via gradient descent and a loss function like mean squared error or BCE)\n\nHere's some pseudo-code on how it would work.\n\n```\nfor product_pair, label in train_set:\n    # Get embedding for each product\n    product1_emb = embedding(product1)\n    product2_emb = embedding(product2)\n\n    # Predict product-pair score (interaction term and sum)\n    prediction = sig(sum(product1_emb * product2_emb, dim=1))\n    l2_reg = lambda * sum(embedding.weight ** 2) \n\n    # Minimize loss\n    loss = BinaryCrossEntropyLoss(prediction, label)\n    loss += l2_reg\n\n    loss.backward()\n    optimizer.step()\n```\n\nFor the training schedule, we run it over 5 epochs with cosine annealing. For each epoch, learning rate starts high (0.01) and drops rapidly to a minimum value near zero, before being reset for to the next epoch.\n\n![](https://raw.githubusercontent.com/eugeneyan/recsys-nlp-graph/master/images/cosine-annealing.png)\n\nOne epoch seems sufficient to achive close to optimal ROC-AUC.\n\n![](https://raw.githubusercontent.com/eugeneyan/recsys-nlp-graph/master/images/implementation1-precision-recall.png)\n\nHowever, if we look at the precision-recall curves below, we see that at around 0.5 we hit the “cliff of death”. If we estimate the threshold slightly too low, precision drops from close to 1.0 to 0.5; slightly too high and recall is poor.\n\n![](https://raw.githubusercontent.com/eugeneyan/recsys-nlp-graph/master/images/implementation1-learning-curve.png)\n\n### 2. Matrix Factorization with Bias\n\nAdding bias reduces the steepness of the curves where they intersect, making it more production-friendly. (Though AUC-ROC decreases slightly, this implementation is preferable.)\n\n![](https://raw.githubusercontent.com/eugeneyan/recsys-nlp-graph/master/images/implementation2-precision-recall.png)\n\n### 3. `Node2Vec`\n\nI tried using the implementation of `Node2Vec` [here](https://github.com/aditya-grover/node2vec) but it was too memory intensive and slow. It didn't run to completion, even on a 64gb instance.\n\nDigging deeper, I found that its approach to generating sequences was traversing the graph. If you allowed `networkx` to use multiple threads, it would spawn multiple processes to create sequences and cache them temporarily in memory. In short, very memory hungry. Overall, this didn’t work for the datasets I had.\n\n### 4. `gensim.word2vec`\n\nGensim has an implementation of w2v that takes in a list of sequences and can be multi-threaded. It was very easy to use and the fastest to complete five epochs.\n\nBut the precision-recall curve shows a sharp cliff around threshold == 0.73. This is due to out-of-vocabulary products in our validation datasets (which don't have embeddings).\n\n![](https://raw.githubusercontent.com/eugeneyan/recsys-nlp-graph/master/images/implementation4-precision-recall.png)\n\nIf we _only_ evaluate in-vocabulary items, performance improves significantly.\n\n![](https://raw.githubusercontent.com/eugeneyan/recsys-nlp-graph/master/images/implementation4b-precision-recall.png)\n\n### 5. `PyTorch` word2vec\n\nWe implement Skip-gram in PyTorch. Here's some simplified code of how it looks.\n\n```\nclass SkipGram(nn.Module):\n    def __init__(self, emb_size, emb_dim):\n        self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)\n        self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)\n\n    def forward(self, center, context, neg_context):\n        emb_center, emb_context, emb_neg_context = self.get_embeddings()\n\n        # Get score for positive pairs\n        score = torch.sum(emb_center * emb_context, dim=1)\n        score = -F.logsigmoid(score)\n\n        # Get score for negative pairs\n        neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze()\n        neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)\n\n        # Return combined score\n        return torch.mean(score + neg_score)\n```\n\nIt performed better than `gensim` when considering all products.\n\n![](https://raw.githubusercontent.com/eugeneyan/recsys-nlp-graph/master/images/implementation5-precision-recall.png)\n\nIf considering _only_ seen products, it's still an improvement, but less dramatic.\n\n![](https://raw.githubusercontent.com/eugeneyan/recsys-nlp-graph/master/images/implementation5b-precision-recall.png)\n\nWhen examining the learning curves, it seems that a single epoch is sufficient. In contrast to the learning curves from matrix factorization (implementation 1), the AUC-ROC doesn't drop drastically with each learning rate reset.\n\n![](https://raw.githubusercontent.com/eugeneyan/recsys-nlp-graph/master/images/implementation5-learning-curve.png)\n\n### 6. `PyTorch` word2vec with side info\n\nWhy did we build the skip-gram model from scratch? Because we wanted to extend it with side information (e.g., brand, category, price).\n\n```\nB001T9NUFS -\u003e B003AVEU6G -\u003e B007ZN5Y56 ... -\u003e B007ZN5Y56\nTelevision    Sound bar     Lamp              Standing Fan\nSony          Sony          Phillips          Dyson\n500 – 600     200 – 300     50 – 75           300 - 400\n```\n\nPerhaps by learning on these we can create better embeddings? \n\nUnfortunately, it didn't work out. Here's how the learning curve looks. \n\n![](https://raw.githubusercontent.com/eugeneyan/recsys-nlp-graph/master/images/implementation6-learning-curve.png)\n\nOne possible reason for this non-result is the sparsity of the meta data. Out of 418,749 electronic products, we only had metadata for 162,023 (39%). Of these, brand was 51% empty.\n\n### 7. Sequences + Matrix Factorization\n\nWhy did the w2v approach do so much better than matrix factorization? Was it due to the skipgram model, or due to the training data format (i.e., sequences)?\n\nTo understand this better, I tried the previous matrix factorization with bias implementation (AUC-ROC = 0.7951) with the new sequences and dataloader. It worked very well.\n\n![](https://raw.githubusercontent.com/eugeneyan/recsys-nlp-graph/master/images/implementation7-precision-recall.png)\n\nOddly though, the matrix factorization approach still exhibits the effect of “forgetting” as learning rate resets with each epoch (Fig 9.), though not as pronounced as Figure 3 in the previous post.\n\n![](https://raw.githubusercontent.com/eugeneyan/recsys-nlp-graph/master/images/implementation7-learning-curve.png)\n\n_I wonder if this is due to using the same embeddings for both center and context._\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feugeneyan%2Frecsys-nlp-graph","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feugeneyan%2Frecsys-nlp-graph","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feugeneyan%2Frecsys-nlp-graph/lists"}