{"id":20216231,"url":"https://github.com/thudm/kobe","last_synced_at":"2025-04-08T03:08:52.587Z","repository":{"id":38361288,"uuid":"178392931","full_name":"THUDM/KOBE","owner":"THUDM","description":"Towards Knowledge-Based Personalized Product Description Generation in E-commerce @ KDD 2019","archived":false,"fork":false,"pushed_at":"2023-01-04T19:15:14.000Z","size":856,"stargazers_count":239,"open_issues_count":3,"forks_count":70,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-31T16:21:37.622Z","etag":null,"topics":["generative-models","knowledge-graph","personalization","sequence-to-sequence","text-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-03-29T11:26:17.000Z","updated_at":"2025-01-24T07:28:59.000Z","dependencies_parsed_at":"2023-02-02T20:16:15.076Z","dependency_job_id":null,"html_url":"https://github.com/THUDM/KOBE","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FKOBE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FKOBE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FKOBE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FKOBE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/KOBE/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247767234,"owners_count":20992547,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["generative-models","knowledge-graph","personalization","sequence-to-sequence","text-generation"],"created_at":"2024-11-14T06:26:53.906Z","updated_at":"2025-04-08T03:08:52.569Z","avatar_url":"https://github.com/THUDM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## [KOBE v2: Towards Knowledge-Based Personalized Product Description Generation in E-commerce](https://arxiv.org/abs/1903.12457)\n\n[![Unittest](https://img.shields.io/github/actions/workflow/status/THUDM/KOBE/install.yml?branch=master)](https://github.com/THUDM/KOBE/actions/workflows/install.yml)\n[![GitHub stars](https://img.shields.io/github/stars/THUDM/KOBE)](https://github.com/THUDM/KOBE/stargazers)\n[![GitHub license](https://img.shields.io/github/license/THUDM/KOBE)](https://github.com/THUDM/KOBE/blob/master/LICENSE)\n[![Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)\n\n**New:** We release **KOBE v2**, a refactored version of the original code with the latest deep learning tools in 2021 and greatly improved installation, reproducibility, performance, and visualization, in memory of Kobe Bryant.\n\nThis repo contains code and pre-trained models for KOBE, a sequence-to-sequence based approach for automatically generating product descriptions by leveraging conditional inputs, e.g., user category, and incorporating knowledge with retrieval augmented product titles.\n\nPaper accepted at KDD 2019 (Applied Data Science Track). Latest version at [arXiv](https://arxiv.org/abs/1903.12457).\n\n- [KOBE v2: Towards Knowledge-Based Personalized Product Description Generation in E-commerce](#kobe-v2-towards-knowledge-based-personalized-product-description-generation-in-e-commerce)\n- [Prerequisites](#prerequisites)\n- [Getting Started](#getting-started)\n  - [Installation](#installation)\n  - [Dataset](#dataset)\n- [Preprocessing](#preprocessing)\n  - [Build vocabulary](#build-vocabulary)\n  - [Tokenization](#tokenization)\n- [Experiments](#experiments)\n  - [Visualization with WandB](#visualization-with-wandb)\n  - [Training your own KOBE](#training-your-own-kobe)\n  - [Evaluating KOBE](#evaluating-kobe)\n  - [Pre-trained Models](#pre-trained-models)\n- [Cite](#cite)\n\n## Prerequisites\n\n- Linux\n- Python \u003e= 3.8\n- PyTorch \u003e= 1.10\n\n## Getting Started\n\n### Installation\n\nClone and install KOBE.\n\n```bash\ngit clone https://github.com/THUDM/KOBE\ncd KOBE\npip install -e .\n```\n\nVerify that KOBE is correctly installed by `import kobe`.\n\n### Dataset\n\nWe use the **TaoDescribe** dataset, which contains 2,129,187 product titles and descriptions in Chinese.\n\nRun the following command to automatically download the dataset:\n\n```bash\npython -m kobe.data.download\n```\n\nThe downloaded files will be placed at `saved/raw/`:\n\n```\n 1.6G KOBE/saved\n 1.6G ├──raw\n  42K │  ├──test.cond\n 1.4M │  ├──test.desc\n 2.0M │  ├──test.fact\n 450K │  ├──test.title\n  17M │  ├──train.cond\n 553M │  ├──train.desc\n 794M │  ├──train.fact\n 183M │  ├──train.title\n  80K │  ├──valid.cond\n 2.6M │  ├──valid.desc\n 3.7M │  ├──valid.fact\n 853K │  └──valid.title\n...\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\nMeanings of downloaded data files\n\u003c/summary\u003e\n\u003cul\u003e\n\u003cli\u003e train/valid/test.title: The product title as input (source) \u003c/li\u003e\n\u003cli\u003e train/valid/test.desc: The product description as output (generation target) \u003c/li\u003e\n\u003cli\u003e train/valid/test.cond: The product attribute and user category used as conditions in the KOBE model. The interpretations of these tags are explained at https://github.com/THUDM/KOBE/issues/14#issuecomment-516262659. \u003c/li\u003e\n\u003cli\u003e train/valid/test.fact: The retrieved knowledge for each product \u003c/li\u003e\n\u003c/ul\u003e\n\u003c/details\u003e\n\n## Preprocessing\n\nPreprocessing is a commonly neglected part in code release. However, we now provide the preprocessing scripts to rebuild the vocabulary and tokenize the texts, just in case that you wish to preprocess the KOBE data yourself or need to run on your own data.\n\n### Build vocabulary\n\nWe use BPE to build a vocabulary on the conditions (including attributes and user categories). For texts, we will use existing BertTokenizer from the huggingface transformers library.\n\n```bash\npython -m kobe.data.vocab \\\n  --input saved/raw/train.cond \\\n  --vocab-file saved/vocab.cond \\\n  --vocab-size 31 --algo word\n```\n\n### Tokenization\n\nThen, we will tokenize the raw inputs and save the preprocessed samples to `.tar` files. Note: this process can take a while (about 20 minutes with a 8-core processor).\n\n```bash\npython -m kobe.data.preprocess \\\n  --raw-path saved/raw/ \\\n  --processed-path saved/processed/ \\\n  --split train valid test \\\n  --vocab-file bert-base-chinese \\\n  --cond-vocab-file saved/vocab.cond.model\n```\n\nYou can peek into the `saved/` directories to see what these preprocessing scripts did:\n\n```\n 8.2G KOBE/saved\n  16G ├──processed\n  20M │  ├──test.tar\n 1.0G │  ├──train-0.tar\n 1.0G │  ├──train-1.tar\n 1.0G │  ├──train-2.tar\n 1.0G │  ├──train-3.tar\n 1.0G │  ├──train-4.tar\n 1.0G │  ├──train-5.tar\n 1.0G │  ├──train-6.tar\n 1.0G │  ├──train-7.tar\n  38M │  └──valid.tar\n 1.6G ├──raw\n      │  ├──...\n 238K └──vocab.cond.model\n```\n\n## Experiments\n\n### Visualization with WandB\n\nFirst, set up [WandB](https://wandb.ai/), which is an 🌟 incredible tool for visualize deep learning experiments. In case you haven't use it before, please login and follow the instructions.\n\n```bash\nwandb login\n```\n\n### Training your own KOBE\n\nWe provide four training modes: `baseline`, `kobe-attr`, `kobe-know`, `kobe-full`, corresponding to the models explored in the paper. They can be trained with the following commands:\n\n```bash\npython -m kobe.train --mode baseline --name baseline\npython -m kobe.train --mode kobe-attr --name kobe-attr\npython -m kobe.train --mode kobe-know --name kobe-know\npython -m kobe.train --mode kobe-full --name kobe-full\n```\n\nAfter launching any of the experiment above, please go to the WandB link printed out in the terminal to view the training progress and evaluation results (updated at every epoch end about once per 2 hours).\n\nIf you would like to change other hyperparameters, please look at `kobe/utils/options.py`. For example, the default setting train the models for 30 epochs with batch size 64, which is around 1 millison steps. You could add options like `--epochs 100` to train for more epochs and obtain better results. You can also increase `--num-encoder-layers` and `--num-decoder-layers` if better GPUs available.\n\n**Expected Training Progress**\n\nWe provide a reference for the training progress (training takes about 150 hours on a 2080 Ti). The full KOBE model achieves the best BERTScore and diversity, with a slightly lower BLEU score than KOBE-Attr (as shown in the paper).\n\nThe resulting training/validation/test curves and examples are shown below:\n\n![Training Progress](docs/_static/images/training.jpg)\n\n### Evaluating KOBE\n\nEvaluation is now super convenient and reproducible with the help of pytorch-lightning and WandB. The checkpoint with best bleu score will be saved at `kobe-v2/\u003cwandb-run-id\u003e/checkpoints/\u003cbest_epoch-best_step\u003e.ckpt`. To evaluate this model, run the following command:\n\n```bash\npython -m kobe.train --mode baseline --name test-baseline --test --load-file kobe-v2/\u003cwandb-run-id\u003e/checkpoints/\u003cbest_epoch-best_step\u003e.ckpt\n```\n\nThe results will be displayed on the WandB dashboard with the link printed out in the terminal. The evaluation metrics we provide include BLEU score (sacreBLEU), diversity score and [BERTScore](https://arxiv.org/abs/1904.09675). You can also manually view some generated examples and their references under the `examples/` section on WandB.\n\nWe provide Nucleus sampling (https://arxiv.org/abs/1904.09751) to replace the beam search in the original KOBE paper. To test this great decoding strategy, run:\n\n```\npython -m kobe.train --mode baseline --name test-baseline --test --load-file kobe-v2/\u003cwandb-run-id\u003e/checkpoints/\u003cbest_epoch-best_step\u003e.ckpt --decoding-strategy nucleus\n```\n\n### Pre-trained Models\n\nPre-trained model checkpoints are available at [https://bit.ly/3FiI7Ed](https://bit.ly/3FiI7Ed) (requires network access to Google Drive). In addition, download the [vocabulary file](https://drive.google.com/file/d/1ay9mAZnnjX-ZFA9BkN_NSPlWZ2qADlvO/view?usp=sharing) and place under `saved/`\n\n## Cite\n\nPlease cite our paper if you use this code in your own work:\n\n```\n@inproceedings{chen2019towards,\n  title={Towards knowledge-based personalized product description generation in e-commerce},\n  author={Chen, Qibin and Lin, Junyang and Zhang, Yichang and Yang, Hongxia and Zhou, Jingren and Tang, Jie},\n  booktitle={Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \\\u0026 Data Mining},\n  pages={3040--3050},\n  year={2019}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fkobe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthudm%2Fkobe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fkobe/lists"}