{"id":28385529,"url":"https://github.com/x-plug/youku-mplug","last_synced_at":"2025-06-26T06:31:05.819Z","repository":{"id":173223988,"uuid":"650086239","full_name":"X-PLUG/Youku-mPLUG","owner":"X-PLUG","description":"Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks","archived":false,"fork":false,"pushed_at":"2024-01-08T14:27:49.000Z","size":15788,"stargazers_count":297,"open_issues_count":25,"forks_count":11,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-05-30T12:20:29.555Z","etag":null,"topics":["benchmark","chinese","dataset","mllm","multimodal","multimodal-large-language-models","multimodal-pretraining","video","video-question-answering","video-retrieval","youku"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/X-PLUG.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-06-06T09:59:50.000Z","updated_at":"2025-05-21T09:01:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"645eff16-4cd9-4c7e-9803-3fe5989ad80d","html_url":"https://github.com/X-PLUG/Youku-mPLUG","commit_stats":null,"previous_names":["x-plug/youku-mplug"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/X-PLUG/Youku-mPLUG","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FYouku-mPLUG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FYouku-mPLUG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FYouku-mPLUG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FYouku-mPLUG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/X-PLUG","download_url":"https://codeload.github.com/X-PLUG/Youku-mPLUG/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FYouku-mPLUG/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262014368,"owners_count":23245120,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","chinese","dataset","mllm","multimodal","multimodal-large-language-models","multimodal-pretraining","video","video-question-answering","video-retrieval","youku"],"created_at":"2025-05-30T10:40:33.614Z","updated_at":"2025-06-26T06:31:05.777Z","avatar_url":"https://github.com/X-PLUG.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Youku-mPLUG 10M Chinese Large-Scale Video Text Dataset\nYouku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks\n[Download Link HERE](https://modelscope.cn/datasets/modelscope/Youku-AliceMind/summary)\n\n[Paper](https://arxiv.org/abs/2306.04362)\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/youku_mplug_logo.png\" alt=\"examples for youku-mplug\"/\u003e\n\u003c/p\u003e\n\n## What is Youku-mPLUG?\nWe release the public largest Chinese high-quality video-language dataset (10 million) named **Youku-mPLUG**, which is collected \nfrom a well-known Chinese video-sharing website, named Youku, with strict criteria of safety, diversity, and quality.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/pretrain_data.jpg\" alt=\"examples for youku-mplug\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/examples.png\" alt=\"examples for youku-mplug\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cfont size=2 color=\"gray\"\u003eExamples of video clips and titles in the proposed Youku-mPLUG dataset.\u003c/font\u003e\n\u003c/p\u003e\n\nWe provide 3 different downstream multimodal video benchmark datasets to measure the capabilities of pre-trained models. The 3 different tasks include:\n- Video Category Prediction：Given a video and its corresponding title, predict the category of the video.\n- Video-Text Retrieval：In the presence of some videos and some texts, use video for text retrieval and text for video retrieval.\n- Video Captioning：In the presence of a video, describe the content of the video.\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/downstream_data.jpg\" alt=\"examples for youku-mplug downstream dataset\"/\u003e\n\u003c/p\u003e\n\n\n## Data statistics\nThe dataset contains 10 million videos in total, which are of high quality and distributed in 20 super categories can 45 categories.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/statics.jpg\" alt=\"statistics\" width=\"60%\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cfont size=2 color=\"gray\"\u003eThe distribution of categories in Youku-mPLUG dataset.\u003c/font\u003e\n\u003c/p\u003e\n\n## Zero-shot Capability\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/case1.jpg\" alt=\"case1\" width=\"80%\"/\u003e\n\u003cimg src=\"assets/case2.jpg\" alt=\"case2\" width=\"80%\"/\u003e\n\u003c/p\u003e\n\n\n## Download\nYou can download all the videos and annotation files through this [link](https://modelscope.cn/datasets/modelscope/Youku-AliceMind/summary) \n\n## Setup\nNote: Due to a bug in megatron_util, after installing megatron_util, it is necessary to replace *conda/envs/youku/lib/python3.10/site-packages/megatron_util/initialize.py* with the *initialize.py* in the current directory.\n```\nconda env create -f environment.yml\nconda activate youku\npip install megatron_util==1.3.0 -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html\n\n# For caption evaluation\napt-get install default-jre\n```\n\n## mPLUG-Video (1.3B / 2.7B)\n### Pre-train\nFirst you should download GPT-3 1.3B \u0026 2.7B checkpoint from [Modelscope](https://www.modelscope.cn/models/damo/nlp_gpt3_text-generation_1.3B/summary). The pre-trained model can be downloaded [Here (1.3B)](http://mm-chatgpt.oss-cn-zhangjiakou.aliyuncs.com/1_3B_mp_rank_00_model_states.pt) and [Here (2.7B)](http://mm-chatgpt.oss-cn-zhangjiakou.aliyuncs.com/2_7B_mp_rank_00_model_states.pt).\n\nRunning the pre-training of mPLUG-Video as:\n```python\nexp_name='pretrain/gpt3_1.3B/pretrain_gpt3_freezeGPT_youku_v0'\nPYTHONPATH=$PYTHONPATH:./ \\\npython -m torch.distributed.launch --nproc_per_node=8 --master_addr=$MASTER_ADDR \\\n  --master_port=$MASTER_PORT \\\n  --nnodes=$WORLD_SIZE \\\n  --node_rank=$RANK \\\n  --use_env run_pretrain_distributed_gpt3.py \\\n  --config ./configs/${exp_name}.yaml \\\n  --output_dir ./output/${exp_name} \\\n  --enable_deepspeed \\\n  --bf16\n  2\u003e\u00261 | tee ./output/${exp_name}/train.log\n```\n\n### Benchmarking\nTo perform downstream fine-tuning. We take Video Category Prediction as an example:\n```python\nexp_name='cls/cls_gpt3_1.3B_youku_v0_sharp_2'\nPYTHONPATH=$PYTHONPATH:./ \\\npython -m torch.distributed.launch --nproc_per_node=8 --master_addr=$MASTER_ADDR \\\n  --master_port=$MASTER_PORT \\\n  --nnodes=$WORLD_SIZE \\\n  --node_rank=$RANK \\\n  --use_env downstream/run_cls_distributed_gpt3.py \\\n  --config ./configs/${exp_name}.yaml \\\n  --output_dir ./output/${exp_name} \\\n  --enable_deepspeed \\\n  --resume path/to/1_3B_mp_rank_00_model_states.pt \\\n  --bf16\n  2\u003e\u00261 | tee ./output/${exp_name}/train.log\n```\n\n### Experimental results\nBelow we show the results on the validation sets for reference.\n\u003cp align=\"left\"\u003e\n\u003cimg src=\"assets/val_cls.jpg\" alt=\"Video category prediction results on the validation set.\" width=\"70%\"/\u003e\n\u003cimg src=\"assets/val_retrieval.jpg\" alt=\"Video retrieval results on the validation set.\" width=\"70%\"/\u003e\n\u003c/p\u003e\n\n## mPLUG-Video (BloomZ-7B)\nWe build the mPLUG-Video model based on [mPLUG-Owl](https://github.com/X-PLUG/mPLUG-Owl). To use the model, you should first clone the mPLUG-Owl repo as \n```bash\ngit clone https://github.com/X-PLUG/mPLUG-Owl.git\ncd mPLUG-Owl/mPLUG-Owl\n```\nThe instruction-tuned checkpoint is available on [HuggingFace](https://huggingface.co/MAGAer13/mplug-youku-bloomz-7b). For finetuning the model, you can refer to [mPLUG-Owl Repo](https://github.com/X-PLUG/mPLUG-Owl). To perform video inference you can use the following code:\n```python\nimport torch\nfrom mplug_owl_video.modeling_mplug_owl import MplugOwlForConditionalGeneration\nfrom transformers import AutoTokenizer\nfrom mplug_owl_video.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor\n\npretrained_ckpt = 'MAGAer13/mplug-youku-bloomz-7b'\nmodel = MplugOwlForConditionalGeneration.from_pretrained(\n    pretrained_ckpt,\n    torch_dtype=torch.bfloat16,\n    device_map={'': 0},\n)\nimage_processor = MplugOwlImageProcessor.from_pretrained(pretrained_ckpt)\ntokenizer = AutoTokenizer.from_pretrained(pretrained_ckpt)\nprocessor = MplugOwlProcessor(image_processor, tokenizer)\n\n# We use a human/AI template to organize the context as a multi-turn conversation.\n# \u003c|video|\u003e denotes an video placehold.\nprompts = [\n'''The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman: \u003c|video|\u003e\nHuman: 视频中的女人在干什么？\nAI: ''']\n\nvideo_list = ['yoga.mp4']\n\n# generate kwargs (the same in transformers) can be passed in the do_generate()\ngenerate_kwargs = {\n    'do_sample': True,\n    'top_k': 5,\n    'max_length': 512\n}\ninputs = processor(text=prompts, videos=video_list, num_frames=4, return_tensors='pt')\ninputs = {k: v.bfloat16() if v.dtype == torch.float else v for k, v in inputs.items()}\ninputs = {k: v.to(model.device) for k, v in inputs.items()}\nwith torch.no_grad():\n    res = model.generate(**inputs, **generate_kwargs)\nsentence = tokenizer.decode(res.tolist()[0], skip_special_tokens=True)\nprint(sentence)\n```\n\n## Citing Youku-mPLUG\n\nIf you find this dataset useful for your research, please consider citing our paper.\n\n```bibtex\n@misc{xu2023youku_mplug,\n    title={Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks},\n    author={Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Chenliang Li, Qi Qian, Que Maofei, Ji Zhang, Xiao Zeng, Fei Huang},\n    year={2023},\n    eprint={2306.04362},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-plug%2Fyouku-mplug","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fx-plug%2Fyouku-mplug","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-plug%2Fyouku-mplug/lists"}