Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/SLAMPAI/large-scale-pretraining-transfer

Code for reproducing the experiments on large-scale pre-training and transfer learning for the paper "Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images" (https://arxiv.org/abs/2106.00116)
https://github.com/SLAMPAI/large-scale-pretraining-transfer

big-transfer chest-x-ray14 chest-xray-images chexpert-dataset covidx-dataset deep-learning distributed-training few-shot-learning fine-tuning imagenet large-scale-learning medical-imaging mimic-cxr padchest-dataset pre-trained-model pre-training pytorch scaling-laws supercomputing transfer-learning

Last synced: about 2 months ago
JSON representation

Code for reproducing the experiments on large-scale pre-training and transfer learning for the paper "Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images" (https://arxiv.org/abs/2106.00116)

Awesome Lists containing this project

README

        

# Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images
*by Mehdi Cherti, Jenia Jitsev* [\[arXiv:2106.00116\]](https://arxiv.org/abs/2106.00116)

[Short version of the paper](http://www.cse.cuhk.edu.hk/~qdou/public/medneurips2021/21_effect_scale_transfer_final_camera_MedNeurIPS2021.pdf) accepted at [Medical Imaging Meets NeurIPS 2021 Workshop](https://sites.google.com/view/med-neurips-2021/)

Longer version of the paper accepted at [IEEE International Joint Conference on Neural Networks 2022](https://wcci2022.org/)

[![Open In Colab][colab-badge]][colab-notebook]

[colab-notebook]:
[colab-badge]:

## Introduction

In this repository, we provide the code for reproducing the experiments on large-scale pre-training and transfer learning for the paper *"Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images"* ([arXiv:2106.00116](https://arxiv.org/abs/2106.00116)).

We provide instructions on how to download the different datasets used in the paper.
We provide the pre-trained models, the instructions to fine-tune a pre-trained model on one of the datasets considered in the paper, as well as new datasets.

Organization
------------

├── LICENSE <- MIT License
├── README.md <- Main doc README on reproducing the experiments
├── requirements.txt <- The requirements file for reproducing the experiments environment, e.g.
│ generated with `pip freeze > requirements.txt`
├── setup.py <- makes `transfer learning` package installable so it can be imported
├── pretrain.py <- code for pre-training
├── finetune.py <- code for fine-tuning
├── transfer_learning <- Source code
│   ├── dataloaders <- Dataset loaders
│   ├── datasets <- Datasets
│   ├── finetuning <- utilities used for Bit-HyperRule
│   ├── lr_scheduler <- Learning rate schedulers
│   ├── models <- Neural network architecture definitions
│   ├── optim <- Optimizers
├── datasets <- Folder where datasets are stored
├── pretrained_models <- Folder where pre-trained models are stored

--------

## Installation

Steps to install the package:

- `pip install -r requirements.txt`
- `python setup.py develop`

## Obtaining Data

The folder `datasets` will be used to store the datasets. In each subfolder of `datasets`, there will be one dataset.
Following are in the instructions to download each dataset considered in the paper.

### Obtaining source datasets for pre-training

#### CheXpert v1.0

1. Fill the form and download the dataset from , and extract the archive
2. Put the folder `CheXpert-v1.0` in `datasets`

#### MIMIC-CXR v2.0

1. Follow the instructions at (section **Files**) and extract the archive
2. Put the folder `mimic-cxr-jpg` in `datasets`

#### NIH Chest Xray-14

1. Download all the files at , and extract all the archives in `images/`
2. Create a folder `NIH-ChestXRay-14` inside `datasets`, and all the files and the folder `images` in `NIH-ChestXRay-14`

#### PadChest

1. Download the complete dataset from after filling the form and extract the archive
2. Unzip all the zip files `0.zip`, `1.zip`,...,`55.zip` inside `BIMCV-PadChest-FULL`
3. Put the folder `BIMCV-PadChest-FULL` in `datasets` and rename it to `PadChest`

### Obtaining target datasets for transfer

#### Oxford Flowers-102

1. Download the dataset from and extract the archive
2. Put the folder `oxford-102-flowers` in `datasets`

If you would like to re-create the dataset from the original version, follow these steps:

1. Download the dataset from and extract the archive
2. `cat valid.txt train.txt > mod_test.txt`
3. `cat test.txt > mod_train.txt`
5. `wget https://raw.githubusercontent.com/SLAMPAI/large-scale-pretraining-transfer/master/scripts/datasets/flowers_to_image_folder.py;python flowers_to_image_folder.py`, this will create a folder `mod_train` and a folder `mod_test`
6. Move the folder to `oxford-102-flowers` to `datasets`

#### Oxford-III Pets

1. Download the dataset from and extract the archive
2. Put the folder `oxford-iiit-pet` in `datasets`

#### COVIDx

1. Download the dataset from and extract the archive inside a new folder `COVIDx-CXR2`
2. Put the folder `COVIDx-CXR2` in `datasets`

#### Tuberculosis

1. Download the dataset from and extract the archive inside a new folder `Tuberculosis_dataset`
2. Put the folder `Tuberculosis_dataset` in `datasets`

## How to run

### Pre-training experiments

Note that you need Horovod to execute pre-training experiments.
Check or to see how to run Horovod,
depending on your setup.

For instance, here is how you can run pre-training with Horovod, using 4 GPUs:

`horovodrun -np 4 python pretrain.py --config-file configs/chexpert_mimic_nih_padchest_bit50x1.yaml`

This will run pre-training for a ResNet-50x1 BiT model, on the concatenation of CheXpert, MIMIC-CXR, NIH Chest-Xray and PadChest.
You can check the other config files in `configs/` for other pre-training experiments, and run them in the same manner.

### Pre-trained models

We provide models with pre-trained weights different network sizes (ResNet-50x1, ResNet-152x4) and on various source datasets of different type and size.
All models are available at .

Each model has its own folder, named following the template `_`,
e.g., `chexpert_mimic_nih_padchest_bit152x4` is a ResNet152x4 pre-trained on
the concatenation of CheXpert, MIMIC-CXR, NIH Chest-Xray and PadChest.

You can use the script `scripts/download_model.sh` to download a pre-trained model,
by providing its name.
For instance, to download `chexpert_mimic_nih_padchest_bit152x4`, you can use:

`bash scripts/download_model.sh chexpert_mimic_nih_padchest_bit152x4`

### Fine-tuning transfer experiments

#### CIFAR-10 example

`python finetune.py --pretrain-config-file configs/imagenet1k_bit50x1.yaml --finetune-config-file configs/finetune/cifar10.yaml --logdir cifar10_finetuning`

This will fine-tune an R50x1 (pre-trained on ImageNet-1k) on CIFAR-10.
The file `configs/finetune/cifar10.yaml` contains the hyper-parmeters used in fine-tuning.

Inside the log directory `cifar10_finetuning`, you will find a log file and a tensorboard file
that you can use to visualize the learning curve with different metrics.

#### Tuberculosis example

`python finetune.py --pretrain-config-file configs/chexpert_mimic_nih_padchest_bit50x1.yaml --finetune-config-file configs/finetune/tuberculosis_full.yaml --logdir tuberculosis_finetuning`

This will fine-tune on an R50x1 (pre-trained on the concatenation of CheXpert, MIMIC-CXR, NIH Chest-Xray and PadChest) on the Tuberculosis dataset.
The file `configs/finetune/tuberculosis_full.yaml` contains the hyper-parameters used in fine-tuning.

Inside the log directory `tuberculosis_finetuning`, you will find a log file and a tensorboard file
that you can use to visualize the learning curve with different metrics.

You can also find a fine-tuning example with Tuberculosis in the [Colab Notebook](https://colab.research.google.com/drive/1alh4O7fFHsqSYsiEkT6Bux8d0-P05kIo?usp=sharing)

#### New dataset?

You can also fine-tune one of the pre-trained models on a new dataset.
You might need a different data loader depending on your dataset structure.
The easiest would be to use an image folder compatible with [TorchVision's ImageFolder](https://pytorch.org/vision/stable/datasets.html#imagefolder), where
each subfolder of the image folder contains the images belonging to one of the classes.

Following are the steps to fine-tune on a dataset with an image folder structure.

1. `cp configs/finetune/template_image_folder.yaml configs/finetune/your_new_dataset.yaml`
2. change `train_dir` by the training directory
3. change `val_dir` by the val or test directory
4. change `nb_classes` by the number of classes
5. Train, using for instance `python finetune.py --pretrain-config-file configs/chexpert_mimic_nih_padchest_bit50x1.yaml --finetune-config-file configs/finetune/your_new_dataset.yaml --logdir your_new_dataset_finetuning`

## Plot results

We provide all the results as a set of CSV files in the folder `results`.
You can use the notebook in `notebooks/plots.ipynb` to regenerate the figures from the paper.

## Citation

If you find this work helpful, please cite our paper:
```
@article{cherti2021effect,
title={Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images},
author={Cherti, Mehdi and Jitsev, Jenia},
journal={arXiv preprint arXiv:2106.00116},
year={2021}
}
```

## Acknowledgements

- Thanks to BiT authors . We used BiT architecture, training procedures and BiT-HyperRule from their code.
- Thanks to TorchXrayVision authors . We used their dataset classes for medical data (CheXpert, MIMIC-CXR, NIH Chest-Xray, PadChest).
- Thanks to Horovod authors , which was used for distributed training. The skeleton of pre-training code is based on from Horovod official examples.
- Thanks to . The structure of the code was inspired from it.
- Thanks to creators and maintainers of openly available X-Ray medical imaging datasets (CheXpert, MIMIC-CXR, NIH Chest-Xray, PadChest, COVIDx, Tuberculosis) that enabled our research
- The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputers JUWELS, JUWELS Booster at Jülich Supercomputing Centre (JSC). We also acknowledge computing resources from the Helmholtz Data Federation and further computing time provided on supercomputer JUSUF in frame of offer for epidemiology research on COVID-19 by JSC.