https://github.com/rucaibox/mtgrec

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/rucaibox/mtgrec
Owner: RUCAIBox
Created: 2025-05-25T15:32:07.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-07-24T05:08:52.000Z (3 months ago)
Last Synced: 2025-07-24T08:44:01.734Z (3 months ago)
Language: Python
Size: 438 KB
Stars: 3
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # MTGRec

[//]: # (This is the official PyTorch implementation for the paper:)

[//]: # ()

[//]: # (> Multi-Identifier Item Tokenization for Generative Recommender Pre-training)

## Overview

In this paper, we propose **MTGRec**, which leverages Multi-identifier item Tokenization to augment token sequence data for Generative Recommender pre-training. Specifically, our approach makes two key contributions: *multi-identifier item tokenization* and *curriculum recommender pre-training*. For multi-identifier item tokenization, we adopt the Residual-Quantized Variational AutoEncoder (RQ-VAE) as the backbone of item tokenizers and consider model checkpoints from adjacent epochs as semantically relevant tokenizers. This enables us to associate each item with multiple identifiers and tokenize a single item interaction sequence into several token sequences as different data groups. For curriculum recommender pre-training, we design a data curriculum scheme through data influence estimation.  During recommender pre-training, we dynamically adjust the sampling probability of each data group according to the influence of the data from each item tokenizer, where the influence estimation is achieved via first-order gradient approximation. Finally, we fine-tune the pre-trained model using a single item identifier to ensure accurate item identification during recommendation. 

![mode](./asset/model.png)

## Requirements

```

torch==2.4.1+cu124

transformers==4.45.2

accelerate==1.0.1

```

[//]: # (## Datasets)

[//]: # ()

[//]: # (You can find all the datasets we used in [Google Drive](https://drive.google.com/file/d/1MAlKxygadJiVMiYHZRM14i7pnPd8J44w/view?usp=sharing). Please download the file and unzip it to the current folder. Each dataset contains the following files:)

[//]: # ()

[//]: # (```)

[//]: # (dataset_name/)

[//]: # (├── metadata.sentence.json)

[//]: # (├── all_item_seqs.json)

[//]: # (├── id_mapping.json)

[//]: # (└── rqvae/)

[//]: # (    ├── sentence-t5-base_256,256,256,256_9950.sem_ids)

[//]: # (    ├── ...)

[//]: # (    └── sentence-t5-base_256,256,256,256_10000.sem_ids)

[//]: # (```)

## Quick Start

Train RQ-VAE and generate item semantic IDs:

```shell

cd tokenizer

bash run.sh

```

Pre-train recommender:

```shell

bash pretrain.sh

```

Finetune recommender:

```shell

bash finetune.sh

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rucaibox/mtgrec

Awesome Lists containing this project

README