{"id":19456918,"url":"https://github.com/alipay/pc2-noiseofweb","last_synced_at":"2025-04-25T05:31:13.603Z","repository":{"id":208945561,"uuid":"722010431","full_name":"alipay/PC2-NoiseofWeb","owner":"alipay","description":"Noise of Web (NoW) is a challenging noisy correspondence learning (NCL) benchmark containing 100K image-text pairs for robust image-text matching/retrieval models.","archived":false,"fork":false,"pushed_at":"2024-11-26T16:11:30.000Z","size":14312,"stargazers_count":12,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-03T17:02:44.753Z","etag":null,"topics":["acmmm","acmmm2024","benchmark","captioning-images","cross-modal-retrieval","dataset","image-text-matching","image-text-retrieval","multimodal-learning","noisy-correspondence"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alipay.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-22T08:48:31.000Z","updated_at":"2024-11-26T16:11:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"f9057a1b-be58-40e8-8d3b-ec1b452f7e45","html_url":"https://github.com/alipay/PC2-NoiseofWeb","commit_stats":null,"previous_names":["alipay/noiseofweb","alipay/pc2-noiseofweb"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alipay%2FPC2-NoiseofWeb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alipay%2FPC2-NoiseofWeb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alipay%2FPC2-NoiseofWeb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alipay%2FPC2-NoiseofWeb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alipay","download_url":"https://codeload.github.com/alipay/PC2-NoiseofWeb/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250760708,"owners_count":21482852,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["acmmm","acmmm2024","benchmark","captioning-images","cross-modal-retrieval","dataset","image-text-matching","image-text-retrieval","multimodal-learning","noisy-correspondence"],"created_at":"2024-11-10T17:19:00.883Z","updated_at":"2025-04-25T05:31:08.595Z","avatar_url":"https://github.com/alipay.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PC2-NoiseofWeb\r\n\r\nThis repo is the official Pytorch implementation of our paper:\r\n\r\n\u003e **PC2: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval**  \r\n\u003e **Authors**: ***[Yue Duan](https://njuyued.github.io/)**, Zhangxuan Gu, Zhenzhe Ying, Lei Qi, Changhua Meng and Yinghuan Shi*\r\n \r\n \r\n- 🔗 **Quick links:** [[PDF](https://arxiv.org/pdf/2408.01349)/[Abs](https://arxiv.org/abs/2408.01349)-arXiv | [Dataset](https://huggingface.co/datasets/NJUyued/NoW) | [文章解读-知乎(Zhihu)](https://zhuanlan.zhihu.com/p/711149124) | [视频解读-bilibili](https://www.bilibili.com/video/BV1zppMezEQe/)]\r\n \r\n - 📰 **Latest news:**\r\n     - We provide a **video presentation (in chinese)** of this work on [bilibili](https://www.bilibili.com/video/BV1zppMezEQe/).\r\n     - We write a **detailed explanation (in chinese)** of this work on [知乎(Zhihu)](https://zhuanlan.zhihu.com/p/711149124).\r\n     - Our paper is accepted by **ACM International Conference on Multimedia (ACM MM) 2024** 🎉🎉. Thanks to users.\r\n - 📑 **More of my works:**\r\n     - 🆕 **[LATEST]** Interested in the **SSL in fine-grained visual classification (SS-FGVC)**? 👉 Check out our AAAI'24 paper **SoC** [[PDF-arXiv](https://arxiv.org/pdf/2312.12237) | [Code](https://github.com/NJUyued/SoC4SS-FGVC/)].\r\n     - Interested in more scenarios of **SSL with mismatched distributions**? 👉 Check out our ICCV'23 paper **PRG** [[PDF-arXiv](https://arxiv.org/pdf/2308.08872) | [Code](https://github.com/NJUyued/PRG4SSL-MNAR)].\r\n     - Interested in **robust SSL in MNAR setting** with mismatched distributions? 👉 Check out our ECCV'22 paper **RDA** [[PDF-arXiv](https://arxiv.org/pdf/2208.04619v2) | [Code](https://github.com/NJUyued/RDA4RobustSSL)].\r\n     - Interested in the conventional SSL or more application of **complementary label in SSL**? 👉 Check out our TNNLS paper **MutexMatch** [[PDF-arXiv](https://arxiv.org/pdf/2203.14316) | [Code](https://github.com/NJUyued/MutexMatch4SSL/)].\r\n\r\n## Dataset Contribution: Noise of Web (NoW)\r\n### Data Collection\r\nWe develop a new dataset named **Noise of Web (NoW)** for NCL. It contains **100K image-text pairs** consisting of **website pages** and **multilingual website meta-descriptions** (**98,000 pairs for training, 1,000 for validation, and 1,000 for testing**). NoW has two main characteristics: *without human annotations and the noisy pairs are naturally captured*.  The source image data of NoW is obtained by taking screenshots when accessing web pages on mobile user interface (MUI) with 720 $\\times$ 1280 resolution, and we parse the meta-description field in the HTML source code as the captions. In [NCR](https://github.com/XLearning-SCU/2021-NeurIPS-NCR) (predecessor of NCL), each image in all datasets were preprocessed using Faster-RCNN detector provided by [Bottom-up Attention Model](https://github.com/peteanderson80/bottom-up-attention) to generate 36 region proposals, and each proposal was encoded as a 2048-dimensional feature. Thus, following NCR, we release our the features instead of raw images for fair comparison. However, we can not just use detection methods like Faster-RCNN to extract image features since it is trained on real-world animals and objects on MS-COCO. To tackle this, we adapt [APT](https://openaccess.thecvf.com/content/CVPR2023/papers/Gu_Mobile_User_Interface_Element_Detection_via_Adaptively_Prompt_Tuning_CVPR_2023_paper.pdf) as the detection model since it is trained on MUI data. Then, we capture the 768-dimensional features of top 36 objects for one image. Due to the automated and non-human curated data collection process, the noise in NoW is highly authentic and intrinsic.  **The estimated noise ratio of this dataset is nearly 70%**.  \r\n\r\n\u003cdiv align=center\u003e\r\n\r\n\u003cimg width=\"750px\" src=\"/figures/now-1.jpg\"\u003e \r\n \r\n\u003c/div\u003e\r\n\r\n### Data Structure\r\n\r\n```\r\n\r\n|-- h5100k_precomp\r\n|   |-- dev_caps_bpe.txt\r\n|   |-- dev_caps_bert.txt\r\n|   |-- dev_caps_jieba.txt\r\n|   |-- dev_ids.txt\r\n|   |-- dev_ims.npy\r\n|   |-- test_caps_bpe.txt\r\n|   |-- test_caps_bert.txt\r\n|   |-- test_caps_jieba.txt\r\n|   |-- test_ids.txt\r\n|   |-- test_ims.npy\r\n|   |-- train_caps_bpe.txt\r\n|   |-- train_caps_bert.txt\r\n|   |-- train_caps_jieba.txt\r\n|   |-- train_ids.txt\r\n|   |-- train_ims.npy\r\n|-- vocab\r\n|   |-- now100k_precomp_vocab_bert.json\r\n|   |-- now100k_precomp_vocab_bpe.json\r\n|   |-- now100k_precomp_vocab_jieba.json\r\n\r\n```\r\n\r\nPlease note that since our raw data contains some sensitive business data, we only provide the **encoded image features** (\\*_ims.npy) and the **token ids of the text tokenized**. For tokenizer, we provide [Tokenizers](https://github.com/huggingface/tokenizers) with [BPE](https://huggingface.co/docs/tokenizers/api/models#tokenizers.models.BPE) to produce \\*_caps_bpe.txt, [BertTokenizer](https://huggingface.co/transformers/v3.0.2/model_doc/bert.html#berttokenizer) with [bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) pre-trained model to produce \\*_caps_bert.txt, and [Jieba](https://github.com/fxsjy/jieba) to produce \\*_caps_jieba.txt. **Our vocabulary size of BPETokenizer is 10,000, while BertTokenizer and JiebaTokenizer have a vocabulary size of 32,702 and 56,271 respectively.** (recorded in now100k_precomp_vocab\\_\\*.txt). \\*_ids.txt records the data indexs in the original 500k dataset. In the future, we may process and make the original dataset public.\r\n\r\n\r\n### Download Link\r\n📎 Download NoW at **https://huggingface.co/datasets/NJUyued/NoW/resolve/main/NoW.zip?download=true**.\r\n\r\n🤗 See HuggingFace's homepage **https://huggingface.co/datasets/NJUyued/NoW** for details.\r\n\r\n### Usage\r\n\r\n```\r\n# data_path: your dataset name and path\r\n# data_split: {train,dev,test}\r\n# tokenizer: {bpe,bert,jieba}\r\n# vocabulary size of {bpe,bert,jieba} is {10,000,32702,56271} \r\n\r\n# captions\r\nwith open(os.path.join(data_path, \"{}_caps_{}.txt\".format(data_split, tokenizer))) as f:\r\n    for line in f:\r\n        captions.append(line.strip())\r\ncaptions_token = []\r\nfor index in range(len(captions)):\r\n  caption = captions[index]\r\n  tokens = caption.split(',')\r\n  caption = []\r\n  caption.append(vocab(\"\u003cstart\u003e\"))\r\n  caption.extend([int(token) for token in tokens if token])\r\n  caption.append(vocab(\"\u003cend\u003e\"))\r\n  captions_token.append(caption)\r\n\r\n# images\r\nimages = np.load(os.path.join(data_path, \"%s_ims.npy\" % data_split))\r\n\r\nreturn captions_token, images\r\n```\r\nAdditionally, you can search for code snippets containing the string `now100k_precomp` in `co_train.py`, `data.py`, `evaluation.py`, and `run.py` in this repo and refer to them to process the NoW dataset for use in your own code.\r\n\r\n## PC2\r\n### Introduction\r\n\r\nIn the realm of **cross-modal retrieval**, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by **noisy correspondence learning (NCL)**. *Such noise often stems from mismatched data pairs, a significant obstacle distinct from traditional noisy labels*. This paper introduces Pseudo-Classification based Pseudo-Captioning ($\\text{PC}^2$) framework to address this challenge.\r\n\r\n\u003cdiv align=center\u003e\r\n\r\n\u003cimg width=\"750px\" src=\"/figures/framework.jpg\"\u003e \r\n \r\n\u003c/div\u003e\r\n\r\n### Requirements\r\n- matplotlib==3.4.2\r\n- nltk==3.8.1\r\n- numpy==1.22.3\r\n- scikit_learn==0.24.2\r\n- scipy==1.6.2\r\n- torch==2.2.2\r\n\r\n## How to Train\r\n### Important Args\r\n- `--lambda_en`: Entropy loss weight.\r\n- `--proj_dim`: Dimensionality of the projection head. By default, `--proj_dim 128` is set. \r\n- `--nb`: Number of tracked bathches.\r\n- `--img_dim` : Dimensionality of the image embedding. `--img_dim 2048` is used for {coco,f30k,cc152k} and please set it to `768` for now100k.\r\n- `--warmup_epoch` : Epochs of warm up stage.\r\n- `--warmup_epoch_2` : Epochs of training with clean data only.\r\n- `--po_dir` : When `--resume`, use this path to load the PO data for resuming training.\r\n- `--model_path` : Use this path to load the checkpoint for resuming training when `--resume`, or use this path to load the warmup checkpoint for resuming training without `--resume`.\r\n- `--data_name {coco,f30k,cc152k,now100k}_precomp` and `--data_path`  : Your dataset name and path.  \r\n- `--tokenizer {bpe,bert,jieba}`: The tokenizer used for NoW dataset.  \r\n- `--noise_ratio`: Noisy ratio for Flickr30K and MS-COCO.\r\n- `--noise_file`: Noise file for the feproduction of noise correspondence.\r\n\r\n### Training with Single GPU\r\n\r\nWe recommend using a single NVIDIA Tesla A100 80G for training to better reproduce our results. Multi-GPU training is feasible, but our results are all obtained from single GPU training.\r\n\r\n```\r\npython ./PC2/run.py --world-size 1 --rank 0 --gpu [0/1/...] @@@other args@@@\r\n```\r\n### Training with Multi-GPUs\r\n\r\n\r\n- Using DistributedDataParallel with single node\r\n\r\n```\r\npython ./PC2/run.py --world-size 1 --rank 0 --multiprocessing-distributed @@@other args@@@\r\n```\r\n\r\n**Please note that** our code is based on the [NCR implementation](https://github.com/XLearning-SCU/2021-NeurIPS-NCR) and the original training code can only run on a single GPU (see [issue#4](https://github.com/XLearning-SCU/2021-NeurIPS-NCR/issues/4)). In order to make it easier for you to use our code, we tried to provide a multi-GPU parallel training version based on `DistributedDataParallel`. Unfortunately, there seem to be some bugs that we have not yet solved. The following error may occur during training: \r\n\r\n```\r\n[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16349, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600341 milliseconds before timing out.\r\n[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.\r\n[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.\r\n[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16349, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600341 milliseconds before timing out.\r\nException raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):\r\nframe #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1e10143d87 in /home/dy/.local/lib/python3.8/site-packages/torch/lib/libc10.so)\r\nframe #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional\u003cstd::chrono::duration\u003clong, std::ratio\u003c1l, 1000l\u003e \u003e \u003e) + 0x1e6 (0x7f1d990756e6 in /home/dy/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)\r\nframe #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f1d99078c3d in /home/dy/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)\r\nframe #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f1d99079839 in /home/dy/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)\r\nframe #4: \u003cunknown function\u003e + 0xc9039 (0x7f1e1034f039 in /usr/local/miniconda3/envs/sharedEnv/bin/../lib/libstdc++.so.6)\r\nframe #5: \u003cunknown function\u003e + 0x76db (0x7f1e14a626db in /lib/x86_64-linux-gnu/libpthread.so.0)\r\nframe #6: clone + 0x3f (0x7f1e1478b61f in /lib/x86_64-linux-gnu/libc.so.6)\r\n\r\n``` \r\n\r\nIf any friends have insights on the occurrence of this problem, please contact us. At the same time, please rest assured that there will be no problem training with a single GPU (i.e., using ``--gpu`` to specify the GPU id).\r\n\r\n## Examples of Running\r\nBy default, the warmup checkpoint `warmup_model_{}.pth.tar`, best checkpoint `checkpoint_best_test.pth.tar`, best validattion checkpoint`checkpoint_best_validattion.pth.tar` and PO data (the pseudo-preditions of pseudo-classification) `distri_bank_{}.pkl` will be saved in `./output_dir`. \r\n\r\n### NoW\r\n\r\n```\r\npython ./pc2/run.py --world-size 1 --rank 0 --gpu 0 --workers 8 --lr_update 30 --warmup_epoch 10 --warmup_epoch_2 25 --data_name h5100k_precomp --tokenizer bert --data_path ./data --vocab_path ./data/vocab --output_dir ./output --proj_dim 128 --lambda_en 10 --img_dim 768 \r\n```\r\n\r\n\r\n### Flickr30k\r\n\r\n```\r\npython ./pc2/run.py --world-size 1 --rank 0 --gpu 0 --workers 8 --warmup_epoch 5 --warmup_epoch_2 25 --data_name f30k_precomp --data_path ./data --vocab_path ./data/vocab  --output_dir ./output --proj_dim 128 --lambda_en 10 --noise_ratio 0.4 --noise_file noise_index/f30k_precomp_0.4\r\n```\r\n\r\n\r\n### MS-COCO\r\n\r\n```\r\npython ./pc2/run.py --world-size 1 --rank 0 --gpu 0 --workers 8 --warmup_epoch 5 --warmup_epoch_2 25 --data_name coco_precomp --data_path ./data --vocab_path ./data/vocab  --output_dir ./output --proj_dim 128 --lambda_en 10 --noise_ratio 0.4 --noise_file noise_index/coco_precomp_0.4\r\n```\r\n\r\n\r\n## Resume Training and Evaluation\r\n- If you restart the training from normal checkpoints, please use `--resume --model_path @your_weight_path`.\r\n\r\n- If you restart the training from warmup checkpoints, please use `--model_path @your_warmup_weight_path`.\r\n\r\n- For evaluation, run\r\n\r\n  ```\r\n  python ./PC2/evaluation.py --data_path @your_data_path --model_path @your_weight_path --gpu @your_gpu_id\r\n  ```\r\n      \r\n  By default, your evaluation process will directly use the dataset name saved in your checkpoint.\r\n\r\n## Citation\r\nPlease cite our paper if you find $\\text{PC}^2$ useful:\r\n```\r\n@article{duan2024pc,\r\n  title={PC $\\^{} 2$: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval},\r\n  author={Duan, Yue and Gu, Zhangxuan and Ying, Zhenzhe and Qi, Lei and Meng, Changhua and Shi, Yinghuan},\r\n  journal={arXiv preprint arXiv:2408.01349},\r\n  year={2024}\r\n}\r\n```\r\n\r\n\r\n\r\n\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falipay%2Fpc2-noiseofweb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falipay%2Fpc2-noiseofweb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falipay%2Fpc2-noiseofweb/lists"}