{"id":16480688,"url":"https://github.com/cdluminate/robrank","last_synced_at":"2025-03-21T07:30:29.849Z","repository":{"id":39694005,"uuid":"357433467","full_name":"cdluminate/robrank","owner":"cdluminate","description":"Adversarial Attack and Defense in Deep Ranking, T-PAMI, 2024","archived":false,"fork":false,"pushed_at":"2024-02-16T04:06:37.000Z","size":445,"stargazers_count":23,"open_issues_count":6,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-17T22:24:17.769Z","etag":null,"topics":["adversarial-attacks","adversarial-defense","adversarial-machine-learning","adversarial-robustness","adversarial-training","deep-metric-learning","dml","metric-learning","ranking"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2106.03614","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cdluminate.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-04-13T05:24:00.000Z","updated_at":"2024-12-25T09:52:32.000Z","dependencies_parsed_at":"2023-12-06T00:28:28.954Z","dependency_job_id":"0b681ff8-0d2e-4262-bbd3-777a61f6d2c5","html_url":"https://github.com/cdluminate/robrank","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cdluminate%2Frobrank","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cdluminate%2Frobrank/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cdluminate%2Frobrank/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cdluminate%2Frobrank/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cdluminate","download_url":"https://codeload.github.com/cdluminate/robrank/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244757090,"owners_count":20505327,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adversarial-attacks","adversarial-defense","adversarial-machine-learning","adversarial-robustness","adversarial-training","deep-metric-learning","dml","metric-learning","ranking"],"created_at":"2024-10-11T13:04:59.682Z","updated_at":"2025-03-21T07:30:29.518Z","avatar_url":"https://github.com/cdluminate.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"RobRank: Adversarial Robustness in Deep Ranking\n===\n\n![badage](https://github.com/cdluminate/robrank/actions/workflows/github-actions-demo.yml/badge.svg)\n[![GitHub license](https://img.shields.io/github/license/cdluminate/robrank)](https://github.com/cdluminate/robrank/blob/main/LICENSE)\n\nDeep neural networks are vulnerable to adversarial attacks, and so does deep\nranking or deep metric learning models. The project *RobRank* aims to study\nthe empirical adversarial robustness of deep ranking / metric learning models.\nOur contribution includes (1) the definition and implementation of two new\nadversarial attacks, namely candidate attack and query attack; (2) two\nadversarial defense methods (based on adversarial training) are proposed\nto improve model robustness against a wide range of attacks; (3) a comprehensive\nempirical robustness score for quantitatively assessing adversarial robustness.\nIn particular, an **\"Anti-Collapse Triplet\" defense** method is newly introduced\nin *RobRank*, which **achieves at least 60% and at most 540% improvement in\nadversarial robustness** compared to the ECCV work. See the preprint manuscript\nfor details.\n\nRobRank codebase is extended from my previous ECCV'2020 work [*\"Adversarial\nRanking Attack and Defense,\"*](https://github.com/cdluminate/advrank) with\na major code refactor. You may find most functionalities of the previous\ncodebase in this repository as well.\n\nNote, the project name is Rob**R**ank, instead of Rob**B**ank.\n\n**Preprint-Title:** \"Adversarial Attack and Defense in Deep Ranking\"  \n**Preprint-Authors:** Mo Zhou, Le Wang, Zhenxing Niu, Qilin Zhang, Nanning Zheng, Gang Hua  \n**Preprint-Link:** https://arxiv.org/abs/2106.03614  \n**Keywords:** Deep {Ranking, Metric Learning}, Adversarial {Attack, Defense, Robustness}  \n\n**Project Status:** Actively maintained.  \n**Install-RobRank-Python-Dependency:** `$ pip install -r requirements.txt`  \n**Try-It-on-Colab:** [[`fashion:rc2f2p:ptripletN`]](https://colab.research.google.com/drive/1QC34RCadO0QCj-YUsLTUI9_pzqn8nrBH?usp=sharing)\n[[`cars:rres18p:ptripletN`]](https://colab.research.google.com/drive/1jjDK4X64bIv7fLyMSlVs-btEMxxzgm6V?usp=sharing)\n\n**News and Updates**\n\n1. [2024-02-03] This manuscript has been accepted to T-PAMI. https://ieeexplore.ieee.org/document/10433769\n1. [2022-03-02] New paper based on this code base has been published: [Enhancing Adversarial Robustness for Deep Metric Learning, CVPR, 2022](https://github.com/cdluminate/robdml). Note, in this new paper, we further improved the benign performance, adversarial robustness, as well as training efficiency altogether for robust metric learning.\n\n## Tables for Robustness Comparison\n\nIn the following tables, \"N/A\" denotes \"no defense equipped\"; EST is the\ndefense proposed in the ECCV'2020 paper; ACT is the new defense in the preprint\npaper. These rows are sorted by ERS. I'm willing to add other DML defenses for\ncomparison in these tables.\n\n| Dataset | Model | Loss | Defense | R@1 | R@2 | mAP | NMI | ERS |\n| ---     | ---   | ---  | ---     | --- | --- | --- | --- | --- |\n| CUB | RN18 | Triplet | N/A | 53.9 | 66.4 | 26.1 | 59.5 | 3.8  |\n| CUB | RN18 | Triplet | EST | 8.5  | 13.0 | 2.6  | 25.2 | 5.3  |\n| CUB | RN18 | Triplet | ACT | 27.5 | 38.2 | 12.2 | 43.0 | 33.9 |\n| CUB | RN18 | Triplet | HM  | 34.9 | 45.0 | 19.8 | 47.1 | 36.0 |\n\n| Dataset | Model | Loss | Defense | R@1 | R@2 | mAP | NMI | ERS |\n| ---     | ---   | ---  | ---     | --- | --- | --- | --- | --- |\n| CARS | RN18 | Triplet | N/A | 62.5 | 74.0 | 23.8 | 57.0 | 3.6  |\n| CARS | RN18 | Triplet | EST | 30.7 | 41.0 | 5.6  | 31.8 | 7.3  |\n| CARS | RN18 | Triplet | ACT | 43.4 | 56.5 | 11.8 | 42.9 | 38.6 |\n| CARS | RN18 | Triplet | HM  | 60.2 | 71.6 | 33.9 | 51.2 | 46.0 |\n\n| Dataset | Model | Loss | Defense | R@1 | R@2 | mAP | NMI | ERS |\n| ---     | ---   | ---  | ---     | --- | --- | --- | --- | --- |\n| SOP | RN18 | Triplet | N/A | 62.9 | 68.5 | 39.2 | 87.4 | 4.0  |\n| SOP | RN18 | Triplet | EST | 46.0 | 51.4 | 24.5 | 84.7 | 31.7 |\n| SOP | RN18 | Triplet | ACT | 47.5 | 52.6 | 25.5 | 84.9 | 50.8 |\n| SOP | RN18 | Triplet | HM  | 46.8 | 51.7 | 24.5 | 84.7 | 61.6 |\n\nSource of these defense methods:\n\n1. N/A: Just standard classification network.\n2. EST: Adversarial Ranking Attack and Defense (ECCV2020)\n3. ACT: Adversarial Attack and Defense in Deep Ranking (arXiv:2106.03614)\n4. HM (or, concreately, `ghmetsmi`): Enhancing Adversarial Robustness for Deep Metric Learning (CVPR2022)\n\nDatasets like MNIST and Fashion-MNIST are excluded here because they are\nsimple toy datasets mostly for sanity testing, not for practical use.\n\n## 1. Common Usage of CLI\n\nPython library `RobRank` provides these functionalities: (1) training \nclassification or ranking (deep metric learning) models, either vanilla\nor defensive; (2) perform adversarial attack against the trained models;\n(3) perform batched adversarial attack. See below for detailed usage.\n\nYou can always specify the GPUs to use by `export CUDA_VISIBLE_DEVICES=\u003cGPUs\u003e`.\n\n**Environment Setup:** Use the command `$ pip install -r requirements.txt` to\ninstall all required python dependencies. Then you can use `pytest -v -x`\nto run the testsuite in order to make sure the code runs correctly. In case\nof pytest failure, you are welcome to\n[open a new issue](https://github.com/cdluminate/robrank/issues) for this\ncode repository.\n\n### 1.1. Training\n\nTraining deep metric learning model or classification model, either normally or\nadversarially.  As `pytorch-lightning` is used by this project, the training\nprocess will automatically use `DistributedDataParallel` when more than one GPU\nare available.\n\nThe typical usage for training a model is as follows\n```shell\npython3 bin/train.py -C \u003cdataset\u003e:\u003cmodel\u003e:\u003closs\u003e\n```\nwhere a \"config\" is composed of three components, so that such mechanism\nis flexible enough to express many combinations. Specifically:\n\n* `dataset` (for all available datasets see `robrank/datasets/__init__.py`)\n  * mnist, fashion, cub, cars, sop (for deep metric learning)\n  * mnist, cifar10 (for classification)\n* model (for all available models see `robrank/models/__init__.py`)\n  * cc2f2: c2f2 network for classification\n  * cres18: resnet-18 for classification\n  * rres18: resnet-18 for deep metric learning (DML)\n  * rres18d: resnet-18 for DML with EST defense\n  * rres18p: resnet-18 for DML with ACT defense\n* loss (for all available losses see `robrank/losses/__init__.py`)\n  * ce: cross-entropy for classification\n  * ptripletN: triplet using Normalized Euclidean with SPC-2 batch.\n  * ptripletE: triplet using Euclidean (not on unit hypersphere) with SPC-2 batch.\n  * ptripletC: triplet using Cosine Distance with SPC-2 batch.\n  * pmtripletN: ptripletN using semihard sampling instead of random\n  * pstripletN: ptripletN using softhard sampling\n  * pdtripletN: ptripletN using distance weithed sampling\n  * phtripletN: ptripletN using batch hardest sampling\n\nFor example:\n```shell\n# classification\npython3 bin/train.py -C mnist:cc2f2:ce --do_test\npython3 bin/train.py -C cifar10:cres18:ce   # cifar10, resnet 18 classify, CE loss\npython3 bin/train.py -C cifar10:cres50:ce   # cifar10, resnet 50 classify, CE loss\n# deep metric learning\npython3 bin/train.py -C mnist:rc2f2:ptripletN\npython3 bin/train.py -C mnist:rc2f2p:ptripletN\npython3 bin/train.py -C cub:rres18:ptripletN\npython3 bin/train.py -C cub:rres18p:ptripletN\npython3 bin/train.py -C cars:rres18:ptripletN\npython3 bin/train.py -C cars:rres18p:ptripletN\npython3 bin/train.py -C sop:rres18:ptripletN\npython3 bin/train.py -C sop:rres18p:ptripletN\n```\n\nTips:\n1. When training DML models, export `FAISS_CPU=1` to disable NMI score\ncalculation on GPU (faiss). This could save a little bit of video memory of you\nencounter CUDA OOM.\n2. To change the number of PGD iterations for creating adversarial examples during\nthe training process, create an empty file to indicate the change. For example, \n`touch override_pgditer_8`. See `robrank/configs/configs_rank.py` for detail.\n\n### 1.2. Adversarial Attack\n\nScript `bin/advrank.py` is the entrance for conducting adversarial attacks\nagainst a trained model. For example, to conduct CA (w=1) with several\nmanually specified PGD parameters, you can do it as follows:\n\n```shell\npython3 bin/advrank.py -v -A CA:pm=+:W=1:eps=0.30196:alpha=0.011764:pgditer=32 -C \u003cxxx.ckpt\u003e\n```\nwhere `xxx.ckpt` is the path to the trained model (saved as a pytorch-lightning checkpoint).\nThe arguments specific to adversarial attacks are joined with a colon \":\"\nin order to avoid lengthy python code based `argparse` module. Example:\n\n```shell\npython3 bin/advrank.py -v -A CA:pm=+:W=1:eps=0.30196:alpha=0.011764:pgditer=32 -C logs_cub-rres18p-ptripletN/lightning_logs/version_0/checkpoints/epoch=74-step=3974.ckpt\n```\n\nPlease browse the bash scripts under the `tools/` directory for examples\nof other types of attacks discussed in the paper. Example:\n\n```shell\nexport CKPT=logs_cub-rres18p-ptripletN/lightning_logs/version_0/checkpoints/epoch=74-step=3974.ckpt\nbash tools/ca.bash + $CKPT      # CA+ column\nbash tools/ca.bash - $CKPT      # CA- column\nbash tools/es.bash $CKPT        # ES:D and ES:R column\n```\n\n### 1.3. Batched Adversarial Attack\n\nScript `bin/swipe.py` is used for conducting a batch of attacks against a specified\nmodel (pytorch-lightning checkpoint), automatically. And it will save the\noutput in json format as `\u003cmodel_ckpt\u003e.ckpt.\u003cswipe_profile\u003e.json`.\nAvailable `swipe_profile` includes `rob28`, `rob224` for ERS score;\nand `pami28`, `pami224` for CA and QA in various settings. A full list\nof possible profiles can be found in `robrank/cmdline.py`. You can even\ncustomize the code and create your own profile for batched evaluation.\n\n```shell\npython3 bin/swipe.py -p rob28 -C logs_fashion-rc2f2-ptripletN/.../xxx.ckpt\npython3 bin/swipe.py -p rob224 -C logs_cub-rres18-ptripletN/.../xxx.ckpt\n```\n\nYou may use `-m \u003cnumber\u003e` (e.g. `-m 10`) specify the max number of iterations\nto get a quick accessment instead of going through the whole validation\ndataset.\n\nCurrently only single-GPU mode is supported for attacks. When the batched\nattack is finished, the results will be written into a json file\n`logs_fashion-rc2f2-ptripletN/.../xxx.ckpt.json`.  A helper script\n`tools/pjswipe.py` can display the content of resulting json files and\ncalculate the corresponding ERS:\n\n```\n$ python3 tools/pjswipe.py logs_fashion-rc2f2-ptripletN\n```\nThe script will automatically use the json file corresponding to the latest\nversion of the specified config. So specifying the log directory is enough.\nThat said, if multiple versions of the same config exists, and you want to\nlet it print result of an old version, export `ITH=\u003cversion\u003e` (e.g. `ITH=1`)\nand run again. If tested with multiple profiles, export `JTYPE` to select\nexact profile. Read the comments in `tools/pjswipe.py` for details.\n\n### 1.4 Scripts for Complete Pipeline\n\nPlease browse the [`escript`](escript/) directory for the scripts containing\nthe command pipelines to reproduce the experiments.\n\n## 2. Project Information\n\n### 2.1. Directory Hierarchy\n\n```\n(the following directory tree is manually edited and annotated)\n.\n├── requirements.txt              Python deps (`pip install -r ...txt`)\n├── bin/train.py                  Entrance script for training models.\n├── bin/advrank.py                Entrance script for adversarial ranking.\n├── bin/swipe.py                  Entrance script for batched attack.\n├── robrank                       RobRank library.\n│   ├── attacks                   Attack Implementations.\n│   │   └── advrank*.py           Adversarial ranking attack (ECCV'2020).\n│   ├── defenses/*                Defense Implementations.\n│   ├── configs/*                 Configurations (incl. hyper-parameters).\n│   ├── datasets/*                Dataset classes.\n│   ├── models                    Models and base classes.\n│   │   ├── template_classify.py  Base class for classification models.\n│   │   ├── template_hybrid.py    Base class for Classification+DML models.\n│   │   └── template_rank.py      Base class for DML/ranking models.\n│   ├── losses/*                  Deep metric learning loss functions.\n│   ├── cmdline.py                Command line interface implementation.\n│   └── utils.py                  Miscellaneous utilities.\n└── tools/*                       Miscellaneous tools for experiments.\n```\n\n### 2.2. Tested Platform\n\nTested Software versions:\n\n```\nOS: Debian unstable, Debian Bullseye, Ubuntu 20.04 LTS, Ubuntu 16.04 LTS\nPython (anaconda distribution): 3.8.5, 3.9.X\nPyTorch: 1.7.1, 1.8.1, 1.11.0\nPyTorch-Lightning: see requirements.txt\n```\n\nMainly Tested Hardware:\n```\nCPU: Intel Xeon Family\nGPU: Nvidia GTX1080Ti, Titan Xp, RTX3090, A5000, A6000, A100\n```\nWith 8 RTX3090 GPUs, most experiments can be finished within 1 day.\nWith older configurations (such as `4* GTX1080Ti`), most experiments can be\nfinished within 3 days, including adversarial training.\n\nMemory requirement: 12GB video memory is required for adversarial training of\nRN18, Mnas, and IBN. Additionally, adversarial training of RN50 requires 24GB.\n\nIf you encounter the following error message:\n```\nTraceback (most recent call last):\n  File \"bin/train.py\", line 16, in \u003cmodule\u003e\n    import robrank as rr\nModuleNotFoundError: No module named 'robrank'\n```\nJust try `export PYTHONPATH=.` and run your command again.\n\n### 2.3. Dataset Preparation\n\nThe default data path setting for any dataset can be found in\n[`robrank/configs/configs_dataset.py`](robrank/configs/configs_dataset.py).\n\n**MNIST** and **Fashion-MNIST** are downloaded using torchvision. The helper script\n`bin/download.py` can download and extract the two datasets for you.\nJust do as follows in your terminal from the root directory of this project:\n```shell\n$ export PYTHONPATH=.\n$ pyhton3 bin/download.py\n```\nThen the MNIST and Fashion-MNIST datasets are ready to use. Try to train a model.\n\nThe rest datasets, namely\n[CUB-200-2011](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html),\n[Cars-196](http://ai.stanford.edu/~jkrause/cars/car_dataset.html), and\n[Stanford Online Products](https://cvgl.stanford.edu/projects/lifted_struct/)\ncan be downloaded from their correspoding websites (and then manually\nextracted). \n\n**CUB:** The tarball can be downloaded from `http://www.vision.caltech.edu/visipedia-data/CUB-200-2011/CUB_200_2011.tgz`. Then change your working directory to `~/.torch` and `tar xvf \u003cpath\u003e/CUB_200_2011.tgz -C .`. Now we are all set.\n\n**CARS:** Create a directory `~/.torch/cars` then change working directory into it. Download `http://imagenet.stanford.edu/internal/car196/car_ims.tgz`\nand `http://imagenet.stanford.edu/internal/car196/cars_annos.mat` into the directory. In the end extract the tarball `tar xvf car_ims.tgz`. We are ready to go.\n\n**SOP:** After you downloaded `Stanford_Online_Products.zip` from `ftp://cs.stanford.edu/cs/cvgl/Stanford_Online_Products.zip`,\njust do `$ cd ~/.torch` and `$ unzip \u003cpath\u003e/Stanford_Online_Products.zip`. Now SOP is ready to use.\n\nThe dataset loader is able to smartly read the dataset from `/dev/shm` to\novercome IO bottleneck (especially from HDDs) if a copy of dataset if available\nthere. For instance, `rsync -av ~/.torch/Stanford_Online_Products /dev/shm`.\n\n**CIFAR:** For cifar10 `cd ~/.torch/; wget -c https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz; tar xvf cifar-10-python.tar.gz`. And for cifar100 `cd ~/.torch/; wget -c https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz; tar xvf cifar-100-python.tar.gz`.\n\n### 2.4. References and Bibtex\n\nIf you found the paper/code useful/inspiring, please consider citing my work:\n\n```bibtex\n@misc{robrank,\n      title={Adversarial Attack and Defense in Deep Ranking}, \n      author={Mo Zhou and Le Wang and Zhenxing Niu and Qilin Zhang and Nanning Zheng and Gang Hua},\n      year={2021},\n      eprint={2106.03614},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\nBibtex of [M. Zhou, et al. \"Adversarial Ranking Attack and Defense,\" ECCV'2020.](https://github.com/cdluminate/advrank) can be found in the linked page.\n\n**Reference Software Projects:**\n\n1. https://github.com/Confusezius/Deep-Metric-Learning-Baselines\n2. https://github.com/Confusezius/Revisiting_Deep_Metric_Learning_PyTorch\n3. https://github.com/idstcv/SoftTriple\n4. https://github.com/KevinMusgrave/pytorch-metric-learning\n5. https://github.com/RobustBench/robustbench\n6. https://github.com/fra31/auto-attack\n7. https://github.com/KevinMusgrave/powerful-benchmarker\n8. https://github.com/MadryLab/robustness\n\n## Frequently Asked Questions\n\n* Q: Concrete code position of the defense methods?\n\nA: As you may have find it ... there are lots of leftover attemps towards a\nbetter defense in `robrank/defenses`. And renames during research process\nalso results in some inconsistency. So I'd better directly point out the\ncode position here:  \n(1) `hm_training_step` in [`defenses/amd.py`](https://github.com/cdluminate/robrank/blob/main/robrank/defenses/amd.py)\nis the Hardness Manipulation (HM) defense. The function for creating adversarial\nexamples for adversarial training is `MadryInnerMax.HardnessManipulate` in the same file.  \n(2) `pnp_training_step` in [`defenses/pnp.py`](https://github.com/cdluminate/robrank/blob/main/robrank/defenses/pnp.py)\nis the Anti-Collapse Triplet (ACT) defense. The function for creating adversarial examples for adversarial\ntraining is `PositiveNegativePerplexing.pncollapse` in the same file.  \n(3) `est_training_step` in [`defenses/est.py`](https://github.com/cdluminate/robrank/blob/main/robrank/defenses/est.py)\nis the Embedding-Shift Triplet (EST) defense. The function for creating adversarial examples for adversarial\ntraining is the ES attack from the [`AdvRank` class](https://github.com/cdluminate/robrank/blob/main/robrank/attacks/advrank.py).  \n\n* Q: Training stuck at the end of validation with Nvidia A100, A6000, A5000, RTX3090, etc.\n\nA: I hate Nvidia for such weird issue. And the reason of distributed data parallel\nbeing stuck varies across different situations or machines.\nHere are a bunch of tricks that might or might not work:  \n(1) Comment out `th.distributed.barrier()` from the code and run again.\nYou can locate that barrier function in the code using ripgrep. This seemed effective on RTX3090;  \n(2) use [`rank_zero_only` option](https://github.com/PyTorchLightning/pytorch-lightning/issues/8821#issuecomment-902402784) for pytorch-lightning logger:\n`sed -i robrank/models/template_rank.py -e \"s/self.log(\\(.*\\))/self.log(\\1, rank_zero_only=True)/g\"`;  \n(3) [change the distributed backend](https://github.com/PyTorchLightning/pytorch-lightning/discussions/6509) of [pytorch](https://pytorch.org/docs/stable/distributed.html#debugging-torch-distributed-applications): `export PL_TORCH_DISTRIBUTED_BACKEND=gloo`;  \n(4) disable P2P feature for NCCL. `export NCCL_P2P_DISABLE=1`;  \n(5) change strategy from `ddp` to `ddp_spawn` in `robrank/cmdline.py`. Run the training again and let it raise error.\nThen change back to `ddp` and the A5000 started working;  \n(6) [P2P GPU traffic will fail with IOMMU](https://github.com/pytorch/pytorch/issues/1637#issuecomment-338268158). Check the `p2pBandwithLatencyTest` cuda example and see whether it could run. If not, then it's not a pytorch issue. Disable `iommu` from kernel parameter should work. `GRUB_CMDLINE_LINUX=\"iommu=soft\"` in `/etc/default/grub`. Run `sudo update-grub2` after edit. Linux kernel has a documentation describing [this iommu parameter](https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt). IOMMU group assignment can be found under `/sys/kernel/iommu_group`;  \n(7) Use only even/odd numbered GPUs `CUDA_VISIBLE_DEVICES=1,3,5` instead of `CUDA_VISIBLE_DEVICES=1,2,3`. This works sometimes for at least the `p2pBandwithLatencyTest` test program;  \n(8) turn off ACS in BIOS;  \n(9) change `num_workers=0` for dataloader.  \n\n* Q: Maxepoch is 16 or 150 in the paper, but 8 or 75 in the code?\n\nA: They are equivalent due to the implementation details in the dataset\nsampler. It is a fixable problem (but not necessary). See [issue #9](https://github.com/cdluminate/robrank/issues/9).\n\n* Q: Training time?\n\nRTX A5000 performance is similar to RTX 3090. RTX A6000 is slightly faster\nthan RTX 3090. Nvidia A100 is roughly 1.5 times faster than RTX 3090.\nRTX 3090 is roughly 2~3 times faster than Nvidia Titan Xp (or GTX 1080Ti).\nIn the following table, `eta` is exactly PGD iteration number (pgditer).\nIt can be overriden by file indicators like `override_pgditer_8` as described\nin previous documentation. Time cost on MNIST and Fashion-MNIST is expected\nto be identical. For the rest datasets, time consumption order is CUB \u003c CARS \u003c SOP.\n\n| Config                              | eta | GPU Model | Number of GPUs | Time (roughly) |\n| ---                                 | --- | ---       | ---            | ---            |\n| `fashion:rc2f2:ptripletN`           | N/A | Titan Xp  | 2 (DDP)        | 2 min          |\n| `fashion:rc2f2p:ptripletN`          | 32  | Titan Xp  | 2 (DDP)        | 10 min         |\n| `cub:rres18:ptripletN`              | N/A | Titan Xp  | 2 (DDP)        | 30 min         |\n| `cub:rres18p:ptripletN`             | 8   | Titan Xp  | 2 (DDP)        | 130 min        |\n| `cub:rres18p:ptripletN`             | 32  | Titan Xp  | 2 (DDP)        | 420 min        |\n| `cub:rres18ghmetsmi:ptripletN`      | 32  | Titan Xp  | 2 (DDP)        | 470 min        |\n| `cars:rres18p:ptripletN`            | 8   | Titan Xp  | 2 (DDP)        | 180 min        |\n| `cars:rres18ghmetsmi:ptripletN`     | 32  | Titan Xp  | 2 (DDP)        | 530 min        |\n| `sop:rres18:ptripletN`              | N/A | RTX A5000 | 4 (DDP)        | 60 min         |\n| `sop:rres18:ptripletN`              | N/A | RTX A6000 | 2 (DDP)        | 120 min        |\n| `sop:rres18p:ptripletN`             | 8   | RTX A6000 | 2 (DDP)        | 560 min        |\n| `sop:rres18p:ptripletN`             | 32  | RTX A6000 | 2 (DDP)        | 1830 min       |\n\n* Q: Pre-trained models and logs?\n\nSee the [model card](doc/model-card.md) for download links.\n\n### Copyright and License\n\n```\nCopyright (C) 2019-2022, Mo Zhou \u003ccdluminate@gmail.com\u003e\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcdluminate%2Frobrank","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcdluminate%2Frobrank","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcdluminate%2Frobrank/lists"}