{"id":13958437,"url":"https://github.com/silverriver/MMChat","last_synced_at":"2025-07-21T00:30:47.277Z","repository":{"id":57751043,"uuid":"478458899","full_name":"silverriver/MMChat","owner":"silverriver","description":"[LREC] MMChat: Multi-Modal Chat Dataset on Social Media","archived":false,"fork":false,"pushed_at":"2022-09-25T01:51:51.000Z","size":178,"stargazers_count":98,"open_issues_count":1,"forks_count":6,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-28T02:34:43.768Z","etag":null,"topics":["dataset","dialogue","multimodal"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/silverriver.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-04-06T07:54:24.000Z","updated_at":"2024-10-09T07:09:53.000Z","dependencies_parsed_at":"2022-08-26T09:30:33.920Z","dependency_job_id":null,"html_url":"https://github.com/silverriver/MMChat","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/silverriver/MMChat","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/silverriver%2FMMChat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/silverriver%2FMMChat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/silverriver%2FMMChat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/silverriver%2FMMChat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/silverriver","download_url":"https://codeload.github.com/silverriver/MMChat/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/silverriver%2FMMChat/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266221246,"owners_count":23894964,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","dialogue","multimodal"],"created_at":"2024-08-08T13:01:35.292Z","updated_at":"2025-07-21T00:30:46.998Z","avatar_url":"https://github.com/silverriver.png","language":"Python","funding_links":[],"categories":["其他_机器视觉","Datasets"],"sub_categories":["网络服务_其他"],"readme":"# MMChat\n\nThis repo contains the code and data for the LREC2022 paper \n**[MMChat: Multi-Modal Chat Dataset on Social Media](https://arxiv.org/abs/2108.07154)**.\n\n## News\n\n- 2022-06-09: [MMChat](https://huggingface.co/datasets/silver/mmchat) is now available through huggingface's [datasets](https://github.com/huggingface/datasets) lib：\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"silver/mmchat\")\n# or \n# dataset = load_dataset(\"silver/mmchat\", \"mmchat_hf\")\n# dataset = load_dataset(\"silver/mmchat\", \"mmchat_raw\")\n# dataset = load_dataset(\"silver/mmchat\", \"mmchat_lccc_filtered\")\n```\n\n## Dataset\n\nMMChat is a large-scale dialogue dataset that contains image-grounded dialogues in Chinese.\nEach dialogue in MMChat is associated with one or more images (maximum 9 images per dialogue).\nWe design various strategies to ensure the quality of the dialogues in MMChat. Please read our paper for more details.\nThe images in the dataset are hosted on Weibo's static image server. \nYou can refer to the scripts provided in `data_processing/weibo_image_crawler` to download these images.\n\nTwo sample dialogues form MMChat are given below (translated from Chinese):\n![A sample dialogue from MMChat](/bin/sample.jpg)\n\nMMChat is released in different versions:\n\n### MMChat\n\nThe MMChat dataset reported in our paper are given here.\nThe Weibo content corresponding to these dialogues are all \"分享图片\", (i.e., \"Share Images\" in English).\nThe following table shows some basic statistics:\n\n| Item Description                     | Count   |\n|--------------------------------------|--------:|\n| Sessions                             | 120.84 K |\n| Sessions with more than 4 utterances |  17.32 K |\n| Utterances                           | 314.13 K |\n| Images                               |  198.82 K |\n| Avg. utterance per session           |  2.599 |\n| Avg. image per session               |  2.791 |\n| Avg. character per utterance         |  8.521 |\n\nThe above dialogues can be downloaded from either [Google Drive](https://drive.google.com/drive/folders/1sBzuJzOpPEj6-IoXl3drvfqQ8i1_tluX?usp=sharing) or [Baidu Netdisk](https://pan.baidu.com/s/1m9nwZejujNUIcVUiIKcxPg?pwd=nrqr).\n\n### MMChat-hf\n\nWe perform human annotation on the sampled dialogues to determine whether the given images are related to the corresponding dialogues.\nThe following table only shows the statistics for dialogues that are annotated as image-related.\n\n| Item Description                     | Count   |\n|--------------------------------------|--------:|\n| Sessions                             | 19.90 K |\n| Sessions with more than 4 utterances | 8.91 K |\n| Utterances                           | 81.06 K |\n| Images                               | 52.66K |\n| Avg. utterance per session           | 4.07 |\n| Avg. image per session               | 2.70 |\n| Avg. character per utterance         | 11.93 |\n\nWe annotated about **100K** dialogues.\nAll the annotated dialogues can be downloaded from either [Google Drive](https://drive.google.com/drive/folders/1dGg4Coc4bwH7tk7SWn0quTwMYxn-kX70?usp=sharing) or [Baidu Netdisk](https://pan.baidu.com/s/11l-bYAKoLkm4k7zDPrfZvg?pwd=zfw2).\n\n\n### Rule Filtered Raw MMChat\n\nWe are also releasing the raw dialogues we collected to faciliate further research.\nThis version of MMChat contains raw dialogues filtered by our rules.\nThe following table shows some basic statistics:\n\n| Item Description                     | Count    |\n|--------------------------------------|---------:|\n| Sessions                             | 4.257 M  |\n| Sessions with more than 4 utterances | 2.304 M  |\n| Utterances                           | 18.590 M |\n| Images                               | 4.874 M  |\n| Avg. utterance per session           | 4.367    |\n| Avg. image per session               | 1.670    |\n| Avg. character per utterance         | 14.104   |\n\nWe devide above dialogues into 9 splits to facilitate the download:\n\n0. Split0 [Google Drive](https://drive.google.com/file/d/1irGoKFDqorFNwZtySrA1-g12dl61pG-7/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1JJ627hzIDG1c4gxbZQcbRg?pwd=mviv)\n1. Split1 [Google Drive](https://drive.google.com/file/d/1OkpF7MAtntn2czuZfujSRc_7rALJ6VRJ/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1iupSNrqUd4pQVESOFqNmyw?pwd=ocqr)\n2. Split2 [Google Drive](https://drive.google.com/file/d/1pv_NsPNdQrBSve3h9eVRH1MjeBH8w1AF/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1iX10kUf1at1sCUU83b8SmA?pwd=4f88)\n3. Split3 [Google Drive](https://drive.google.com/file/d/14OSOAD7gM6nVa1ydwJTSApM2WzGOBcWV/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1cq0O1QITtykhB8L0MUlqtw?pwd=w3v5)\n4. Split4 [Google Drive](https://drive.google.com/file/d/14Fz2kof5CBjdgyabxZ8hS6g1hN-9owLx/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1snRfnNN4kbGzfxhbcNFe3g?pwd=xzg9)\n5. Split5 [Google Drive](https://drive.google.com/file/d/1xKAzn9oeWewBKHIb3bt14g4gnrO0u2CP/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1APwm7xTE2oID92Xb74q6Zw?pwd=vvsx)\n6. Split6 [Google Drive](https://drive.google.com/file/d/1vbf8piV9hSCyo2pvx91W4lynZhKNx2lM/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/10HV3p3wnLhHHFOdbhJJOSg?pwd=5idw)\n7. Split7 [Google Drive](https://drive.google.com/file/d/1qfQ3c7SoR44Xd-4HfBb-wh_GOArUhyBz/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1BOSTdHzQizZAMavy1aeajg?pwd=yx6q)\n8. Split8 [Google Drive](https://drive.google.com/file/d/1J4LvdVyX83YsMKh04CeIfTF1N13Q3d3N/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/11VQL7rUrJtmp74x97C5L6g?pwd=lu0i)\n\n### LCCC Filtered MMChat\n\nThis version of MMChat contains the dialogues that are filtered based on the [LCCC](https://github.com/thu-coai/CDial-GPT) (Large-scale Cleaned Chinese Conversation) dataset.\nSpecifically, some dialogues in MMChat are also contained in LCCC. \nWe regard these dialogues as cleaner dialogues since sophisticated schemes are designed in LCCC to filter out noises.\nThis version of MMChat is obtained using the script `data_processing/LCCC_filter.py`\nThe following table shows some basic statistics:\n\n| Item Description                     | Count   |\n|--------------------------------------|--------:|\n| Sessions                             | 492.6 K |\n| Sessions with more than 4 utterances | 208.8 K |\n| Utterances                           | 1.986 M |\n| Images                               | 1.066 M |\n| Avg. utterance per session           | 4.031   |\n| Avg. image per session               | 2.514   |\n| Avg. character per utterance         | 11.336  |\n\nWe devide above dialogues into 9 splits to facilitate the download:\n\n0. Split0 [Google Drive](https://drive.google.com/file/d/1Qd3N00ZpVOGDBqwlHcpj_QgbSIYNnysx/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/17g0UBF8zT3w5hfzvpYerQA?pwd=b2an)\n1. Split1 [Google Drive](https://drive.google.com/file/d/1H15T_aSLNaLZdc86WsUU6-c0J37OoZW-/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1xj_RIE60Be-sisdkrWt0fQ?pwd=6z1x)\n2. Split2 [Google Drive](https://drive.google.com/file/d/1dCXlyQGwx5tfRFLnsDp0B5LhdHr_Rsbi/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1_0WFHqK1ZY92yC4BEqRSwQ?pwd=35cw)\n3. Split3 [Google Drive](https://drive.google.com/file/d/1jzLgo2JW87cjGxEMRtKC8KorTIv-ODJR/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1_pgQRtr7LYnH0aQagRr2Bg?pwd=ouo0)\n4. Split4 [Google Drive](https://drive.google.com/file/d/1JiGhdzhzMZhL_dGreZclymhHxE7YuiRy/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/128CzlJpqKxhc4GJeRynX-g?pwd=pnmr)\n5. Split5 [Google Drive](https://drive.google.com/file/d/1ZLdsNZyFG-cq9pqHP5KvfL0fPXqmmXxO/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1Y21T3jMPWSiRCATvYNOC4g?pwd=ca3m)\n6. Split6 [Google Drive](https://drive.google.com/file/d/1qi99_TFwJanuGgAWDBRgi6hqNUQB9JQd/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1hfBchNqVhOYjFk9fTT_gxA?pwd=dzh3)\n7. Split7 [Google Drive](https://drive.google.com/file/d/15QMZhGuW93fzAVRhBKb6ANiZ8BNw5lX9/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1amg65X0ST7gW8c8MCutXWQ?pwd=2t1j)\n8. Split8 [Google Drive](https://drive.google.com/file/d/1wRCiJfxNk5n5SYzMBm4HYM1BKyGtuGak/view?usp=sharing), [Baidu Netdisk](https://pan.baidu.com/s/1-KYwR-SOezyn5jFzrA3Fxw?pwd=0pyi)\n\n## Code \n\nWe are also releasing all the codes used for our experiments.\nYou can use the script `run_training.sh` in each folder to launch the distributed training.\n\nFor models that require image features, you can extract the image features using the scripts in `data_processing/extract_image_features`\n\nThe model shown in our paper can be found in `dialog_image`:\n![Model](/bin/model.jpg)\n\nThe pre-trained `chinese_gpt_original` model used in our experiments can be downloaded from [Baidu Netdisk](https://pan.baidu.com/s/1l_jLVcpBnGXpLp7yf3lqiw) with extract code of `nmoc`, or downloaded from [Google Drive](https://drive.google.com/drive/folders/1rwWv7gbWQrxDMCOr5fpqVd0jJQF4NQu0?usp=sharing).\n\n## Reference\nPlease cite our paper if you find our work useful ;)\n\n```bibtex\n@inproceedings{zheng2022MMChat,\n  author    = {Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian},\n  title     = {MMChat: Multi-Modal Chat Dataset on Social Media},\n  booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},\n  year      = {2022},\n  publisher = {European Language Resources Association},\n}\n```\n\n```bibtex\n@inproceedings{wang2020chinese,\n  title     = {A Large-Scale Chinese Short-Text Conversation Dataset},\n  author    = {Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},\n  booktitle = {NLPCC},\n  year      = {2020},\n  url       = {https://arxiv.org/abs/2008.03946}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsilverriver%2FMMChat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsilverriver%2FMMChat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsilverriver%2FMMChat/lists"}