{"id":13958468,"url":"https://github.com/kywen1119/Video_sim","last_synced_at":"2025-07-21T00:30:55.640Z","repository":{"id":230228839,"uuid":"399790587","full_name":"kywen1119/Video_sim","owner":"kywen1119","description":"qq browser multimodal video similarity contest","archived":false,"fork":false,"pushed_at":"2024-03-28T13:42:28.000Z","size":22750,"stargazers_count":15,"open_issues_count":0,"forks_count":6,"subscribers_count":2,"default_branch":"cqr","last_synced_at":"2024-11-28T02:34:44.602Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kywen1119.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-08-25T11:16:09.000Z","updated_at":"2024-08-12T01:34:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"490bf83e-915b-4cca-8a16-0686e2d11da1","html_url":"https://github.com/kywen1119/Video_sim","commit_stats":null,"previous_names":["kywen1119/video_sim"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kywen1119/Video_sim","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kywen1119%2FVideo_sim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kywen1119%2FVideo_sim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kywen1119%2FVideo_sim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kywen1119%2FVideo_sim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kywen1119","download_url":"https://codeload.github.com/kywen1119/Video_sim/tar.gz/refs/heads/cqr","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kywen1119%2FVideo_sim/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266221247,"owners_count":23894964,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-08T13:01:37.095Z","updated_at":"2025-07-21T00:30:55.634Z","avatar_url":"https://github.com/kywen1119.png","language":"Python","funding_links":[],"categories":["其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"### [Multimodal Video Similarity Challenge](https://algo.browser.qq.com/)\n#### [@CIKM 2021](https://www.cikm2021.org/analyticup) \n#### 第四名解决方案\n#### Implementation source codes of team \u003c618大庆神!\u003e.\n#### Final score: 82.8307 on test_b.\n\n#### 1. 模型总览\n我们的最终结果由6个模型的ensemble组成，先在开头概述这6个模型：(test_a上的结果)\n \u003ctable\u003e\n        \u003ctr\u003e\n            \u003cth\u003eModel\u003c/th\u003e\n            \u003cth\u003esingle-model\u003c/th\u003e\n            \u003cth\u003e10fold\u003c/th\u003e\n            \u003cth\u003eweight\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003cth\u003eMixNextvlad\u003c/th\u003e\n            \u003cth\u003e81.2\u003c/th\u003e\n            \u003cth\u003e81.6\u003c/th\u003e\n            \u003cth\u003e0.17\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003cth\u003eMixNextvlad_ASL\u003c/th\u003e\n            \u003cth\u003e80.9\u003c/th\u003e\n            \u003cth\u003e81.8\u003c/th\u003e\n            \u003cth\u003e0.2\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003cth\u003eMixNextvlad_roformer\u003c/th\u003e\n            \u003cth\u003e80.5\u003c/th\u003e\n            \u003cth\u003e81.3\u003c/th\u003e\n            \u003cth\u003e0.13\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003cth\u003eUniter\u003c/th\u003e\n            \u003cth\u003e80.3\u003c/th\u003e\n            \u003cth\u003e81.4\u003c/th\u003e\n            \u003cth\u003e0.13\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003cth\u003eUniter_ASL\u003c/th\u003e\n            \u003cth\u003e80.6\u003c/th\u003e\n            \u003cth\u003ex\u003c/th\u003e\n            \u003cth\u003e0.2\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003cth\u003eUniter_roformer\u003c/th\u003e\n            \u003cth\u003e80.2\u003c/th\u003e\n            \u003cth\u003ex\u003c/th\u003e\n            \u003cth\u003e0.17\u003c/th\u003e\n        \u003c/tr\u003e\n    \u003c/table\u003e\n\n部分test_b的结果：\n\n\u003ctable\u003e\n        \u003ctr\u003e\n            \u003cth\u003eModel\u003c/th\u003e\n            \u003cth\u003esingle-model\u003c/th\u003e\n            \u003cth\u003e10fold\u003c/th\u003e\n            \u003cth\u003eweight\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003cth\u003eUniter\u003c/th\u003e\n            \u003cth\u003ex\u003c/th\u003e\n            \u003cth\u003e81.4\u003c/th\u003e\n            \u003cth\u003e0.13\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003cth\u003eUniter_ASL\u003c/th\u003e\n            \u003cth\u003e80.5\u003c/th\u003e\n            \u003cth\u003ex\u003c/th\u003e\n            \u003cth\u003e0.2\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003cth\u003eUniter_roformer\u003c/th\u003e\n            \u003cth\u003e80.1\u003c/th\u003e\n            \u003cth\u003ex\u003c/th\u003e\n            \u003cth\u003e0.17\u003c/th\u003e\n        \u003c/tr\u003e\n    \u003c/table\u003e\n\n虽然MixNextvlad单模型会高一些，但是10fold之后Uniter模型应该会好一些，因为观察到MixNextvlad模型生成的结果由许多0.\n\n#### 2. 模型介绍\n主要使用了两种模型，第一种是基于baseline改进的MixNextvald，在[1]中提出；第二种是基于transformer的Uniter [2]模型。 \n\n##### 2.1 MixNextvlad\n如图，论文中也有详细介绍，不再赘述。\n\u003cimg src=\"./img/mixnextvlad.png\" width = \"100%\" height=\"50%\"\u003e\n\n##### 2.2 Uniter\n如图，视频的帧feature和句子的单词embedding经过concat之后送入bert-encoder，输出的features取平均得到最后的embedding。使用该模型时预训练任务增加了MLM（masked language modeling），只对文本进行mask，然后通过上下文的文本和图像特征共同进行MLM。\n\n细解：\n\n模型：视频帧feature+word embedding concat之后送入12层的预训练bert \u0026 roformer，得到的features取mean。\n\npretrain：最终的embedding经过cls层进行tag id的多标签分类；对输入word进行15%的随机mask，然后在输出端预测这些mask的单词（MLM）。\n\nfinetune：不再mask，直接用图像特征和文本作为输入，得到mean pooling之后的256维向量，pair计算sim后和标签算mse。\n\n\u003cimg src=\"./img/uniter.png\" width = \"100%\" height=\"50%\"\u003e\n\n##### 2.3 ASL\n这是阿里巴巴最新提出的一种用于多标签分类的Loss [3]，可以有效解决多标签分类长尾样本的噪声问题。用它来替换baseline中的多标签分类的BCE损失，可以使得收敛更快，最终F1 score也更高。\n\n##### 2.4 Roformer\n用来替换bert。[4]\n\n#### 3. 一些tricks\n总体来说：先进性pretrain（多标签分类 or MLM），再进行finetune （MSE）.\n\nNextVLAD里面的bn层不能注释掉（baseline中注释掉了）\n\n##### 针对预训练\n1. 替换原来的bert model，baseline中的是bert-uncased-chinese，更换成更好的chinese-roberta-wwm-ext；或者更换为roformer_chinese_base。\n2. 对于MixNextvlad 模型，在文本特征和图像特征进行fusion前增加一个对比损失函数（contrastive loss），我们认为这样能平衡二者的量纲，能促进fusion的效果，在pretrain时能有效提升spearman 3个百分点。\n\n##### 针对finetune\n1. 使用11-fold-cross-validation：将pairwise的数据分成11份，每次用一份进行验证，这样同一个模型可以训11个模型，最终embedding取平均。\n2. finetune时使用三个损失函数，包括：mse loss （直接优化similarity）、KL loss （优化模型得到的sim和label sim的分布差距）、tag loss （也就是预训练的多标签分类loss）。单独使用mse会过拟合，增加后两个之后可以有效缓解。\n3. 训练轮次不宜过多，我们128的batch size只需要训练 3000或4000 steps。\n\n#### 4. 如何复现？\n所有实验在一块3090上完成。（mixnextvlad模型在1080ti也可以跑，batch调小一些）\n环境： \n\npython==3.8.0\n\ntensorflow==2.5.0    \n\ntransformers    \n\n##### 4.1 数据准备\n1. 下载data并解压至 Video_sim 文件夹中 （包括test_b）\n2. 生成pair对的tfrecord，且有11个文件（11-fold-cross-validation）\n```bash\n  python write_tfrecord.py\n```\n生成的结果：（每个路径下面包含 train.tfrecord \u0026 val.tfrecord）\n```\n├── data/\n|   ├── pairwise/           \n|   |   ├── 0-5999val/\n|   |   ├── 6000-11999val/\n|   |   ├── 12000-17999val/\n|   |   ├── 18000-23999val/\n|   |   ├── 24000-29999val/\n|   |   ├── 30000-35999val/\n|   |   ├── 36000-41999val/\n|   |   ├── 42000-47999val/\n|   |   ├── 48000-53999val/\n|   |   ├── 54000-59999val/\n|   |   ├── 60000-65999val/\n```\n\n##### 4.2 直接测试（通过现有的ckpt得到最终的结果）\n先下载final_save [这个不方便给出]，mv至Video_sim文件夹中，然后直接运行run.sh文件.\n该sh会运行所有模型的inference，包括6个模型的11-fold，然后对所有embedding进行一个加权的ensemble。\n最后输出为 result_10_b.zip\n\n```bash\nsh run.sh\n```\n\nPS：如果只想测试一个模型的话，以MixNextvlad为例，那只需要下载一个 final_save/10fold_1_mix, 然后运行：\n\n```bash\npython inference_pair_b.py --ckpt-file final_save/10fold_1_mix/ckpt-4014 --output-zip 10fold_b_zip/10fold_1_mix.zip\n```\n\n##### 4.3 模型预训练\n+ Pre-Train on MixNextvlad models:\n    + MixNextvlad:\n    ```bash\n    python cqrtrain_mix.py --batch-size 256 --savedmodel-path save/mix\n    ```\n    + MixNextvlad_ASL:\n    ```bash\n    python cqrtrain_mix_asl.py --batch-size 256 --savedmodel-path save/mix_asl\n    ```\n    + MixNextvlad_roformer:\n    ```bash\n    python cqrtrain_mix_roformer.py --batch-size 256 --savedmodel-path save/mix_roformer --bert-dir junnyu/roformer_chinese_base\n    ```\n+ Pre-Train on Uniter models:\n    + Uniter:\n    ```bash\n    python cqrtrain_mlm_mm_tag.py --batch-size 256 --savedmodel-path save/uniter --uniter-pooling mean \n    ```\n    + Uniter_ASL:\n    ```bash\n    python cqrtrain_mlm_mm_tag_asl.py --batch-size 256 --savedmodel-path save/uniter_asl --uniter-pooling mean \n    ```\n    + Uniter_roformer:\n    ```bash\n    python cqrtrain_mlm_mm_tag_roformer.py --batch-size 210 --savedmodel-path save/uniter_roformer --uniter-pooling mean --bert-dir junnyu/roformer_chinese_base \n    ```\n    \n##### 4.4 模型finetune\n注意！在训练ASL的时候有可能会出现NAN，这种情况需要重跑一次相应的模型.\n建议每次只跑sh文件里面的一个模型，把其他的注释掉，这样方便debug。\n```bash\nsh finetune_all.sh\n```\n\n如何训练单模型？\n\n详见finetune_all.sh，里面有6个模型的单模型训练命令。\n\n例子：\n\n```bash\npython train_pair_mix.py --batch-size 128  --savedmodel-path save/10fold/10fold_1_mix --pretrain_model_dir save/mix --kl-weight 0.5 --total-steps 4000 --train-record-pattern data/pairwise/0-5999val/train.tfrecord --val-record-pattern data/pairwise/0-5999val/val.tfrecord\n```\n\n参数解释：pretrain_model_dir：load的预训练模型的路径 （mix/mix_asl/mix_roformer/uniter/uniter_asl/uniter_roformer）\n\n##### 4.5 模型inference\n建议每次只跑sh文件里面的一个模型，把其他的注释掉，这样方便debug。\n\n```bash\nsh infer_all.sh\n```\n\n如何测试单模型？\n\n例子：\n```bash\npython inference_pair_b.py --ckpt-file save/10fold/10fold_1_mix/ckpt-4012 --output-zip 10fold_b_zip/10fold_1_mix.zip \n```\n\n\n##### 4.6 ensemble\n6个10fold的模型得到的embedding进行加权求和。\n```bash\npython ensemble_final.py\n```\n\n#### 5. References\n[1] Lin R, Xiao J, Fan J. Nextvlad: An efficient neural network to aggregate frame-level features for large-scale video classification[C]//Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 2018: 0-0.\n\n[2] Chen Y C, Li L, Yu L, et al. Uniter: Universal image-text representation learning[C]//European conference on computer vision. Springer, Cham, 2020: 104-120.\n\n[3] Ridnik T, Ben-Baruch E, Zamir N, et al. Asymmetric Loss for Multi-Label Classification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 82-91.\n\n[4] Su J, Lu Y, Pan S, et al. Roformer: Enhanced transformer with rotary position embedding[J]. arXiv preprint arXiv:2104.09864, 2021.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkywen1119%2FVideo_sim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkywen1119%2FVideo_sim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkywen1119%2FVideo_sim/lists"}