https://github.com/sungnyun/openssl-simcore

(CVPR 2023) Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning
https://github.com/sungnyun/openssl-simcore

coreset openssl pytorch self-supervised-learning

Last synced: 11 months ago
JSON representation

(CVPR 2023) Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning

Host: GitHub
URL: https://github.com/sungnyun/openssl-simcore
Owner: sungnyun
Created: 2023-03-15T07:32:16.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-10-03T04:11:22.000Z (over 2 years ago)
Last Synced: 2025-03-25T22:21:33.929Z (about 1 year ago)
Topics: coreset, openssl, pytorch, self-supervised-learning
Language: Python
Homepage:
Size: 167 KB
Stars: 28
Watchers: 2
Forks: 8
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # OpenSSL-SimCore (CVPR 2023)



 











[**Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning**](https://arxiv.org/abs/2303.11101)


[Sungnyun Kim](https://github.com/sungnyun)\*,

[Sangmin Bae](https://www.raymin0223.com)\*,

[Se-Young Yun](https://fbsqkd.github.io)


\* equal contribution

- **Open-set Self-Supervised Learning (OpenSSL) task**: an unlabeled open-set available during the pretraining phase on the fine-grained dataset.

- **SimCore**: simple coreset selection algorithm to leverage a subset semantically similar to the target dataset.

- SimCore significantly improves representation learning performance in various downstream tasks.

- [update on 10.02.2023] Shared SimCore-pretrained models on [HuggingFace Models](https://huggingface.co/sungnyun/openssl-simcore).

## Requirements

Install the necessary packages with: 

```

$ pip install -r requirements.txt

```

## Data Preparation

We used 11 fine-grained datasets and 7 open-sets.

Place each data files into `data/[DATASET_NAME]/` (it should be constructed as the `torchvision.datasets.ImageFolder` format).    

To download and setup the data, please see the [docs](data/README.md) and run python files, if necessary.

```bash

$ cd data/

$ python [DATASET_NAME]_image_folder_generator.py

```

## Pretraining

To simply pretrain the model, run the shell file. (We support multi-GPUs training, while we utilized 4 GPUs.)    

You will need to define the **path for each dataset**, and the **retrieval model checkpoint**. 

```bash

# specify $TAG and $DATA

$ CUDA_VISIBLE_DEVICES= bash run_selfsup.sh

```

Here are some important arguments to be considered.

- `--dataset1`: fine-grained target dataset name

- `--dataset2`: open-set name (default: imagenet)

- `--data_folder1`: directory where the `dataset1` is located

- `--data_folder2`: directory where the `dataset2` is located

- `--retrieval_ckpt`: retrieval model checkpoint before SimCore pretraining; for this, pretrain vanilla SSL for 1K epochs

- `--model`: model architecture (default: resnet50), see [models](models/)

- `--method`: self-supervised learning method (default: simclr), see [ssl](ssl/)

- `--sampling_method`: strategy for sampling from the open-set (choose between "random" or "simcore")

- `--no_sampling`: if sampling unwanted (vanilla SSL pretrain), set this True

The pretrained model checkpoints will be saved at `save/[EXP_NAME]/`. For example, if you run the default shell file, the last epoch checkpoint will be saved as `save/$DATA_resnet50_pretrain_simclr_merge_imagenet_$TAG/last.pth`.

## Linear Evaluation

Linear evaluation of the pretrained models can be similarly implemented as the pretraining.    

Run the following shell file, with the **pretrained model checkpoint** additionally defined.

```bash

# specify $TAG, $DATA, and --pretrained_ckpt

$ CUDA_VISIBLE_DEVICES= bash run_sup.sh

```

We also support **kNN evaluation** (`--knn`, `--topk`) and **semi-supervised fine-tuning** (`--label_ratio`, `--e2e`).

### Result

SimCore with a stopping criterion highly improves the accuracy by +10.5% (averaged over 11 datasets), compared to the pretraining without any open-set.







### Try other open-sets

SimCore works with various, or even uncurated open-sets. You can also try with your custom, web-crawled, or uncurated open-sets.





 

 

 





## Downstream Tasks

SimCore is extensively evaluated in various downstream tasks.    

We thus provide the training and evaluation codes for following downstream tasks.    

For more details, please see the [docs](downstream/README.md) and `downstream/` directory.    

- [object detection](downstream/detection)

- [pixel-wise segmentation](downstream/segmentation)

- [open-set semi-supervised learning](downstream/opensemi)

- [webly supervised learning](downstream/weblysup)

- [semi-supervised learning](downstream/semisup)

- [active learning](downstream/active)

- [hard negative mining](downstream/hnm)

 Use the pretrained model checkpoint to run each downstream task.

## BibTeX

If you find this repo useful for your research, please consider citing our paper:

```

@article{kim2023coreset,

  title={Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning},

  author={Kim, Sungnyun and Bae, Sangmin and Yun, Se-Young},

  journal={arXiv preprint arXiv:2303.11101},

  year={2023}

}

```

## Contact

- Sungnyun Kim: ksn4397@kaist.ac.kr

- Sangmin Bae: bsmn0223@kaist.ac.kr

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sungnyun/openssl-simcore

Awesome Lists containing this project

README