Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/he-y/soft-label-pruning-for-dataset-distillation

[NeurIPS'24] Official Implementation for "Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?"
https://github.com/he-y/soft-label-pruning-for-dataset-distillation
Last synced: about 5 hours ago
JSON representation
[NeurIPS'24] Official Implementation for "Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?"
Host: GitHub
URL: https://github.com/he-y/soft-label-pruning-for-dataset-distillation
Owner: he-y
Created: 2024-10-16T06:27:53.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2024-11-12T07:54:04.000Z (7 days ago)
Last Synced: 2024-11-12T08:27:34.665Z (7 days ago)
Language: Python
Homepage:
Size: 65.4 KB
Stars: 6
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # Soft Label Pruning for Large-scale Dataset Distillation (LPLD)

[[`Paper`](https://arxiv.org/abs/2410.15919) | [`BibTex`](#citation) | [`Google Drive`](https://drive.google.com/drive/folders/1_eFjyWmrFXtprslgAwjyMpvhfB_qTf7t?usp=sharing)]

---

Official Implementation for "[Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?](https://arxiv.org/abs/2410.15919)", published at NeurIPS'24.

[Lingao Xiao](https://scholar.google.com/citations?user=MlNI5YYAAAAJ),  [Yang He](https://scholar.google.com/citations?user=vvnFsIIAAAAJ)

> **Abstract**: In ImageNet-condensation, the storage for auxiliary soft labels exceeds that of the condensed dataset by over 30 times.

However, ***are large-scale soft labels necessary for large-scale dataset distillation***?

In this paper, we first discover that the high within-class similarity in condensed datasets necessitates the use of large-scale soft labels.

This high within-class similarity can be attributed to the fact that previous methods use samples from different classes to construct a single batch for batch normalization (BN) matching.

To reduce the within-class similarity, we introduce class-wise supervision during the image synthesizing process by batching the samples within classes, instead of across classes.

As a result, we can increase within-class diversity and reduce the size of required soft labels.

A key benefit of improved image diversity is that soft label compression can be achieved through simple random pruning, eliminating the need for complex rule-based strategies. Experiments validate our discoveries.

For example, when condensing ImageNet-1K to 200 images per class, our approach compresses the required soft labels from 113 GB to 2.8 GB (40x compression) with a 2.6% performance gain.







> Images from left to right are from IPC20 LPLD datasets: cock (left), bald eagle, volcano, trailer truck (right).

# Installation

Donwload repo:

```sh

git clone https://github.com/he-y/soft-label-pruning-for-dataset-distillation.git LPLD

cd LPLD

```

Create pytorch environment:

```sh

conda env create -f environment.yml

conda activate lpld

```

## Download all datasets and labels

### Method 1: Automatic Downloading

```sh

# sh download.sh [true|false]

sh download.sh false

```

- `true|false` meaning whether to download only 40x compressed labels or all labels. (default: false, download all labels)

### Method 2: Manual Downloading

Download manually from [Google Drive](https://drive.google.com/drive/folders/1_eFjyWmrFXtprslgAwjyMpvhfB_qTf7t?usp=sharing), and place downloaded files in the following structure:

```

.

├── README.md

├── recover

│   └── model_with_class_bn

│       └── [put Models-with-Class-BN here]

│   └── validate_result

│       └── [put Distilled-Datast here]

├── relabel_and_validate

│   └── syn_label_LPLD

│       └── [put Labels here]

```

## You will find following after downloading

#### Model with Class BN 

|    Dataset    | Model with Class BN |                                                Size                                                |

| :-----------: | :-----------------: | :------------------------------------------------------------------------------------------------: |

|  ImageNet-1K  |      ResNet18       | [50.41 MB](https://drive.google.com/file/d/1Vfou8nPp3x7m7YEG0wd7FcuQ_yE9jj34/view?usp=drive_link)  |

| Tiny-ImageNet |      ResNet18       | [81.30 MB](https://drive.google.com/file/d/1sCArvJoHFthbSaBuWoUhDn67tsYtRPTn/view?usp=drive_link)  |

| ImageNet-21K  |      ResNet18       | [445.87 MB](https://drive.google.com/file/d/1BuplTqBhXKzdfJqCKkTBg218Cbezef57/view?usp=drive_link) |

#### Distilled Image Dataset

|    Dataset    |                   Setting                   |                                                                                                                                                                                                                                                   Dataset Size                                                                                                                                                                                                                                                   |

| :-----------: | :-----------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |

|  ImageNet-1K  | IPC10
IPC20
IPC50
IPC100
IPC200 | [0.15 GB](https://drive.google.com/file/d/1lXN_zi8LRq1pvrVZZgQA6W_ZU83BwvUf/view?usp=drive_link)
[0.30 GB](https://drive.google.com/file/d/18q2MZ5sr9AfNYqcd-NAa3-j3m7LRXKwN/view?usp=drive_link)
[0.75 GB](https://drive.google.com/file/d/1o081uXC-ebu28S_uuT04liACqAwhjA1O/view?usp=drive_link)
[1.49 GB](https://drive.google.com/file/d/18maJqCbuPXKT8zBHTLebbMbUgGwJIn4o/view?usp=drive_link)
[2.98 GB](https://drive.google.com/file/d/1-dLbdD3ww5wap4LpSjb1Ees4II4crv7p/view?usp=drive_link) |

| Tiny-ImageNet |               IPC50
IPC100               |                                                                                                                                                         [21 MB](https://drive.google.com/file/d/1W0JUOAZBrQwIlquIgOpdbFi5C_s8TNt8/view?usp=drive_link)
[40 MB](https://drive.google.com/file/d/1cQDD8OfMfoshsDIaiWOQb95pn2q9veuk/view?usp=drive_link)                                                                                                                                                         |

| ImageNet-21K  |               IPC10
IPC20                |                                                                                                                                                          [3 GB](https://drive.google.com/file/d/1DgmZNr1swgJrKZySjk1smgOiGr0mUU2R/view?usp=drive_link)
[5 GB](https://drive.google.com/file/d/1rycYU2q6JeUbGUDPBatr_QJQMJQSnGUk/view?usp=drive_link)                                                                                                                                                          |

#### Previous Soft Labels vs Ours

| Dataset       |                  Setting                  |                  Previous
Label Size                  |          Previous
Model Acc.           |                                                                                                                                                                                                                                                Ours
Label Size                                                                                                                                                                                                                                                |            Ours
Model Acc.             |

| :------------ | :---------------------------------------: | :------------------------------------------------------: | :---------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------------------------: |

| ImageNet-1K   | IP10
IP20
IPC50
IPC100
IPC200 | 5.67 GB
11.33 GB
28.33 GB
56.66 GB
113.33 GB | 20.1%
33.6%
46.8%
52.8%
57.0% | [0.14 GB (40x)](https://drive.google.com/file/d/1Nf1piVIXIF-_v-jCEmaYGHdWTXsuQIkY/view?usp=drive_link)
[0.29 GB (40x)](https://drive.google.com/file/d/1AdP44DJUadFlY1WCrYiE7F6slotk3Vx4/view?usp=drive_link)
[0.71 GB (40x)](https://drive.google.com/file/d/1GnCY-Apg-dXgZe8BvDwDKqrQSAz1PAbs/view?usp=drive_link)
[1.43 GB (40x)](https://drive.google.com/file/d/12f6qUjsoN6AczK7iJz2ZAT8xNiX0W4bX/view?usp=drive_link)
[2.85 GB (40x)](https://drive.google.com/file/d/1mHWwOaB0yG7fP_lbDSZMmIHUrMh_nDWZ/view?usp=drive_link) | 20.2%
33.0%
46.7%
54.0%
59.6% |

| Tiny-ImageNet |              IPC50
IPC100              |                  449 MB
898 MB 
                   |              41.1%
49.7%               |                                                                                                                                                       [11 MB (40x)](https://drive.google.com/file/d/1Yzgu-I96ODg2J8_AhGuNOP2mlUtbCzHU/view?usp=drive_link)
[22 MB (40x)](https://drive.google.com/file/d/1oJuUIq36raTtD63sfzT37ZJ3kGqZGqbv/view?usp=drive_link)
                                                                                                                                                       |              38.4%
46.1%               |

| ImageNet-21K  |              IPC10
IPC20               |                  643 GB
1286 GB
                   |              18.5%
20.5%               |                                                                                                                                                         [16 GB (40x)](https://drive.google.com/file/d/1inuNAC7ApJWiuXaCsEwWU9_z7DOpMBzG/view?usp=drive_link)
[32 GB (40x)](https://drive.google.com/file/d/1g52Lo2XoKHbJySkiLFo3Gsl6hnjffOEN/view?usp=drive_link)                                                                                                                                                         |              21.3%
29.4%               |

- full labels for ImageNet-21K are too large to upload; nevertheless, we provide the 40x pruned labels.

- labels for other compression ratios are provided in [google drive](https://drive.google.com/drive/folders/1LIKrlcydyowSkw2lRjgrzfULHYZWTNh7?usp=drive_link), or refer [README: Usage](./README_usage.md) to generate the labels.

## Necessary Modification for Pytorch

Modify PyTorch source code `torch.utils.data._utils.fetch._MapDatasetFetcher` to support multi-processing loading of soft label data and mix configurations.

```python

class _MapDatasetFetcher(_BaseDatasetFetcher):

    def fetch(self, possibly_batched_index):

        if hasattr(self.dataset, "mode") and self.dataset.mode == 'fkd_load':

            if hasattr(self.dataset, "G_VBSM") and self.dataset.G_VBSM:

                pass # G_VBSM: uses self-decoding in the training script

            elif hasattr(self.dataset, "use_batch") and self.dataset.use_batch:

                mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config_by_batch_idx(possibly_batched_index[0])

            else:

                mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config(possibly_batched_index[0])

        if self.auto_collation:

            if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:

                data = self.dataset.__getitems__(possibly_batched_index)

            else:

                data = [self.dataset[idx] for idx in possibly_batched_index]

        else:

            data = self.dataset[possibly_batched_index]

        if hasattr(self.dataset, "mode") and self.dataset.mode == 'fkd_load':

            # NOTE: mix_index, mix_lam, mix_bbox can be None

            mix_index_cpu = mix_index.cpu() if mix_index is not None else None

            return self.collate_fn(data), mix_index_cpu, mix_lam, mix_bbox, soft_label.cpu()

        else:

            return self.collate_fn(data)

```

# Reproduce Results for 40x compression ratio

To reproduce the [[`Table`](#Previous-Soft-Labels-vs-Ours)] for 40x compression ratio, run the following code:

```sh

cd relabel_and_validate

bash scripts/reproduce/main_table_in1k.sh

bash scripts/reproduce/main_table_tiny.sh

bash scripts/reproduce/main_table_in21k.sh

```

NOTE: validation directory (`val_dir`) in config files (`relabel_and_validate/cfg/reproduce/CONFIG_FILE`) should be changed to correct path on your device.

# To Reproduce Results for other compression ratios

**Please refer to [README: Usage](./README_usage.md) for details, including three modules**.

## Table Results ([Google Drive](https://drive.google.com/drive/folders/1hw62Qi5N2Vuh1NLdXCAM3BXyNBzT0n1u?usp=drive_link))

| No.                                                                                                 | Content                                                                                 | Datasets      |

| --------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- | ------------- |

| Table 1                                                                                             | Dataset Analysis                                                                        | ImageNet-1K   |

| [Table 2](https://drive.google.com/drive/folders/17GEr8tbvKtNmNvxKZHEkB_KMr0LbK9o6?usp=drive_link)  | (a) SOTA Comparison
(b) Large Networks                                               | Tiny ImageNet |

| [Table 3](https://drive.google.com/drive/folders/1JGgDLB8vuNovTJ_fIN4peHvih4_aJ83c?usp=drive_link)  | SOTA Comparison                                                                         | ImageNet-1K   |

| [Table 4](https://drive.google.com/drive/folders/1Q37e8IXV30nISHRFTStfin33TZJ2yVFm?usp=drive_link)  | Ablation Study                                                                          | ImageNet-1K   |

| [Table 5](https://drive.google.com/drive/folders/1R34uovaGjB7vz-VrHcFhTuZs86VYcRJs?usp=drive_link)  | (a) Pruning Metrics
(b) Calibration                                                  | ImageNet-1K   |

| [Table 6](https://drive.google.com/drive/folders/1KsxslLvXK5enPAhNpBlIif7EDykH0yUX?usp=drive_link)  | (a) Large Pruning Ratio
(b) ResNet-50 Result
(c) Cross Architecture Result        | ImageNet-1K   |

| [Table 7](https://drive.google.com/drive/folders/1Cc_hwZYCKN9inzLKO8DsKOdwBGaY3cHu?usp=drive_link)  | SOTA Comparison                                                                         | ImageNet-21K  |

| [Table 8](https://drive.google.com/drive/folders/14zo1eKf_s3d1bwJ3iB5lcSlzEX1LYl5Q?usp=drive_link)  | Adaptation to Optimization-free Method (i.e., [RDED](https://arxiv.org/abs/2312.03526)) | ImageNet-1K   |

| [Table 9](https://drive.google.com/drive/folders/1Ycnk5dqs0P7AgmY1_gBb1E1z2o6b-zGY?usp=drive_link)  | Comparison to [G-VBSM](https://arxiv.org/abs/2311.17950)                                | ImageNet-1K   |

| **Appendix**                                                                                        |                                                                                         |               |

| Table 10-18                                                                                         | Configurations                                                                          | -             |

| [Table 19](https://drive.google.com/drive/folders/1O-3HLxKGOTDPyn-iXaEmQuBQwfWdzh6G?usp=drive_link) | Detailed Ablation                                                                       | ImageNet-1K   |

| [Table 20](https://drive.google.com/drive/folders/15TDuuiScIsjiYnkxAeYuGwssViW-jNWP?usp=drive_link) | Large IPCs (i.e., IPC300 and IPC400)                                                    | ImageNet-1K   |

| [Table 23](https://drive.google.com/drive/folders/1L-75-dVPBS2JD63DncMd3S3WlKqJgBV0?usp=drive_link) | Comparison to [FKD](https://github.com/szq0214/FKD/blob/main/FKD)                       | ImageNet-1K   |

## Related Repos

Our code is mainly related to the following papers and repos:

- [Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective](https://github.com/VILA-Lab/SRe2L)

- [ImageNet-21K Pretraining for the Masses](https://github.com/Alibaba-MIIL/ImageNet21K)

## Citation

```

@inproceedings{xiao2024lpld,

  title={Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?},

  author={Lingao Xiao and Yang He},

  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},

  year={2024}

}

```