Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/westlake-repl/NineRec

Multimodal Dataset and Benchmark for Multi-domain and Cross-domain Recommendation System
https://github.com/westlake-repl/NineRec

cross-domain-recommendation cross-platform-recommendation foundation-recommendation-model image-recommendation llm llm-recommendation multi-domain-recommendation multi-modal-recommendation multimodal multimodal-recommenddation one-for-all pre-train-recommendation pre-training-dataset text-recommendation transfer-learning-recommendation transferable-recommendation

Last synced: about 2 months ago
JSON representation

Multimodal Dataset and Benchmark for Multi-domain and Cross-domain Recommendation System

Awesome Lists containing this project

README

        



# [TPAMI 2024] [NineRec: A Benchmark Dataset Suite for Evaluating Transferable Recommendation](https://arxiv.org/pdf/2309.07705.pdf)



![Multi-Modal](https://img.shields.io/badge/Task-Multi--Modal-red)
![Foundation Model](https://img.shields.io/badge/Task-Foundation_Model-red)
![Transfer Learning](https://img.shields.io/badge/Task-Transfer_Learning-red)
![Recommendation](https://img.shields.io/badge/Task-Recommendation-red)

Quick links:
[📋Blog](#Blog) |
[🗃️Download](#Dataset) |
[📭Citation](#Citation) |
[🛠️Code](#Training) |
[🚀Evaluation](#Baseline_Evaluation) |
[🤗Leaderboard](#Leaderboard) |
[👀Others](#Tenrec) |
[💡News](#News)



# Note
In this paper, we evaluate the TransRec model based on end-to-end training of the recommender backbone and item modality encoder, which is computationally expensive. The reason we do this is because so far there is no widely accepted paradigm for pre-training recommendation models. End-to-end training shows better performance than pre-extracted multimodal features. However, we hope that NineRec can inspire more effective and efficient methods of pre-training recommendation models, rather than just limiting it to the end-to-end training paradigm. If one can develop a very efficient method that goes beyond end-to-end training but can be effectively transferable, it will be a great contribution to the community!!!

# Blog
- [Pre-training and Transfer Learning in Recommender System](https://medium.com/@lifengyi_6964/pre-training-and-transfer-learning-in-recommender-system-907b1011be6e)
- [Multimodal Multi-domain Recommendation System DataSet](https://medium.com/@lifengyi_6964/multimodal-multi-domain-recommendation-system-dataset-e814afcdc68a)
- [Recommendation-Systems-without-Explicit-ID-Features-A-Literature-Review](https://github.com/westlake-repl/Recommendation-Systems-without-Explicit-ID-Features-A-Literature-Review)

# Dataset

All datasets have been released!! If you have any questions about our dataset and code, please email us.

## Download link

- Google Drive: [Source Datasets](https://drive.google.com/file/d/1fR5XB-dC0CDsyZ4BvtUJJn7i3OnWJa-9/view?usp=sharing), [Downstream Datasets](https://drive.google.com/file/d/15RlthgPczrFbP4U7l6QflSImK5wSGP5K/view?usp=sharing)

If you are interested in pre-training on a larger dataset (even than our source dataset), please visit our PixelRec: https://github.com/westlake-repl/PixelRec. PixelRec can be used as the source data set of NineRec, and these downstream tasks of NineRec are cross-domain/platform scenarios.




## Data Format (take QB as an example)
- `QB_cover` contains the raw images in JPG format, with item ID as the file name:



- `QB_behaviour.tsv` contains the user-item interactions in item sequence format, where the first field is the user ID and the second field is a sequence of item ID (has been provided in QB and TN, see Dataset Preparation below to generate this file for others):

User ID | Item ID Sequence
------------| ----------------------------------------------------
u14500 | v17551 v165612 v288299 v14633 v350433

- `QB_pair.csv` contains the user-item interactions in user-item pair format, where the first field is the user ID, the second field is the item ID, and the third field is a timestamp:

User ID | Item ID | Timestamp
------------| -------- | --------
u14500 | v17551 | (only not provided in QB and TN)

- `QB_item.csv` contains the raw texts, where the first field is the item ID and the second field is the text in Chinese, and the third field is the text in English:

Item ID | Text in Chinese | Text in English
------------| -------- | --------
v17551 | 韩国电影,《女教师》 | "Korean Movie, The Governess"

- `QB_url.csv` contains the URL link of items, where the first field is the item ID and the second field is the URL:

Item ID | URL
------------| --------
v17551 | (only not provided in QB and TN)

*Note that source datasets, Bili_2M and its smaller version Bili_500K, share the same image folder `Source_Bili_2M_cover` for less storage space.

# Citation
If you use our dataset, code or find NineRec useful in your work, please cite our paper as:

```bib
@article{zhang2023ninerec,
title={NineRec: A Benchmark Dataset Suite for Evaluating Transferable Recommendation},
author={Jiaqi Zhang and Yu Cheng and Yongxin Ni and Yunzhu Pan and Zheng Yuan and Junchen Fu and Youhua Li and Jie Wang and Fajie Yuan},
journal={arXiv preprint arXiv:2309.07705},
year={2023}
}
```
> :warning: **Caution**: It's prohibited to privately modify the dataset and then offer secondary downloads. If you've made alterations to the dataset in your work, you are encouraged to open-source the data processing code, so others can benefit from your methods. Or notify us of your new dataset so we can put it on this Github with your paper.

# Code
## Environments
```
Pytorch==1.12.1
cudatoolkit==11.2.1
sklearn==1.2.0
python==3.9.12
```
## Dataset Preparation
Run `get_lmdb.py` to get lmdb database for image loading. Run `get_behaviour.py` to convert the user-item pairs into item sequences format.
## Run Experiments
Run `train.py` for pre-training and transferring. Run `test.py` for testing. See more specific instructions in each baseline.

# Baseline_Evaluation



# Leaderboard
coming soon.

# Tenrec
Tenrec (https://github.com/yuangh-x/2022-NIPS-Tenrec) is the sibling dataset of NineRec, which includes multiple user feedback and platforms. It is suitable for studying ID-based transfer and lifelong learning.

# News
实验室招聘科研助理、实习生、博士生和博后,请联系通讯作者。