https://github.com/sungnyun/openssl-simcore
(CVPR 2023) Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning
https://github.com/sungnyun/openssl-simcore
coreset openssl pytorch self-supervised-learning
Last synced: 11 months ago
JSON representation
(CVPR 2023) Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning
- Host: GitHub
- URL: https://github.com/sungnyun/openssl-simcore
- Owner: sungnyun
- Created: 2023-03-15T07:32:16.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-10-03T04:11:22.000Z (over 2 years ago)
- Last Synced: 2025-03-25T22:21:33.929Z (about 1 year ago)
- Topics: coreset, openssl, pytorch, self-supervised-learning
- Language: Python
- Homepage:
- Size: 167 KB
- Stars: 28
- Watchers: 2
- Forks: 8
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# OpenSSL-SimCore (CVPR 2023)
[**Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning**](https://arxiv.org/abs/2303.11101)
[Sungnyun Kim](https://github.com/sungnyun)\*,
[Sangmin Bae](https://www.raymin0223.com)\*,
[Se-Young Yun](https://fbsqkd.github.io)
\* equal contribution
- **Open-set Self-Supervised Learning (OpenSSL) task**: an unlabeled open-set available during the pretraining phase on the fine-grained dataset.
- **SimCore**: simple coreset selection algorithm to leverage a subset semantically similar to the target dataset.
- SimCore significantly improves representation learning performance in various downstream tasks.
- [update on 10.02.2023] Shared SimCore-pretrained models on [HuggingFace Models](https://huggingface.co/sungnyun/openssl-simcore).
## Requirements
Install the necessary packages with:
```
$ pip install -r requirements.txt
```
## Data Preparation
We used 11 fine-grained datasets and 7 open-sets.
Place each data files into `data/[DATASET_NAME]/` (it should be constructed as the `torchvision.datasets.ImageFolder` format).
To download and setup the data, please see the [docs](data/README.md) and run python files, if necessary.
```bash
$ cd data/
$ python [DATASET_NAME]_image_folder_generator.py
```
## Pretraining
To simply pretrain the model, run the shell file. (We support multi-GPUs training, while we utilized 4 GPUs.)
You will need to define the **path for each dataset**, and the **retrieval model checkpoint**.
```bash
# specify $TAG and $DATA
$ CUDA_VISIBLE_DEVICES= bash run_selfsup.sh
```
Here are some important arguments to be considered.
- `--dataset1`: fine-grained target dataset name
- `--dataset2`: open-set name (default: imagenet)
- `--data_folder1`: directory where the `dataset1` is located
- `--data_folder2`: directory where the `dataset2` is located
- `--retrieval_ckpt`: retrieval model checkpoint before SimCore pretraining; for this, pretrain vanilla SSL for 1K epochs
- `--model`: model architecture (default: resnet50), see [models](models/)
- `--method`: self-supervised learning method (default: simclr), see [ssl](ssl/)
- `--sampling_method`: strategy for sampling from the open-set (choose between "random" or "simcore")
- `--no_sampling`: if sampling unwanted (vanilla SSL pretrain), set this True
The pretrained model checkpoints will be saved at `save/[EXP_NAME]/`. For example, if you run the default shell file, the last epoch checkpoint will be saved as `save/$DATA_resnet50_pretrain_simclr_merge_imagenet_$TAG/last.pth`.
## Linear Evaluation
Linear evaluation of the pretrained models can be similarly implemented as the pretraining.
Run the following shell file, with the **pretrained model checkpoint** additionally defined.
```bash
# specify $TAG, $DATA, and --pretrained_ckpt
$ CUDA_VISIBLE_DEVICES= bash run_sup.sh
```
We also support **kNN evaluation** (`--knn`, `--topk`) and **semi-supervised fine-tuning** (`--label_ratio`, `--e2e`).
### Result
SimCore with a stopping criterion highly improves the accuracy by +10.5% (averaged over 11 datasets), compared to the pretraining without any open-set.
### Try other open-sets
SimCore works with various, or even uncurated open-sets. You can also try with your custom, web-crawled, or uncurated open-sets.
## Downstream Tasks
SimCore is extensively evaluated in various downstream tasks.
We thus provide the training and evaluation codes for following downstream tasks.
For more details, please see the [docs](downstream/README.md) and `downstream/` directory.
- [object detection](downstream/detection)
- [pixel-wise segmentation](downstream/segmentation)
- [open-set semi-supervised learning](downstream/opensemi)
- [webly supervised learning](downstream/weblysup)
- [semi-supervised learning](downstream/semisup)
- [active learning](downstream/active)
- [hard negative mining](downstream/hnm)
Use the pretrained model checkpoint to run each downstream task.
## BibTeX
If you find this repo useful for your research, please consider citing our paper:
```
@article{kim2023coreset,
title={Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning},
author={Kim, Sungnyun and Bae, Sangmin and Yun, Se-Young},
journal={arXiv preprint arXiv:2303.11101},
year={2023}
}
```
## Contact
- Sungnyun Kim: ksn4397@kaist.ac.kr
- Sangmin Bae: bsmn0223@kaist.ac.kr