{"id":13737668,"url":"https://github.com/lxtGH/CAE","last_synced_at":"2025-05-08T15:30:52.499Z","repository":{"id":37471971,"uuid":"499748952","full_name":"lxtGH/CAE","owner":"lxtGH","description":"This is a PyTorch implementation of “Context AutoEncoder for Self-Supervised Representation Learning\"","archived":false,"fork":false,"pushed_at":"2023-01-11T16:59:23.000Z","size":375,"stargazers_count":193,"open_issues_count":8,"forks_count":22,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-11-14T01:25:47.139Z","etag":null,"topics":["context-autoencoder","masked-image-modeling","self-supervised-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lxtGH.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-06-04T06:50:42.000Z","updated_at":"2024-11-10T07:32:14.000Z","dependencies_parsed_at":"2023-02-09T04:00:57.418Z","dependency_job_id":null,"html_url":"https://github.com/lxtGH/CAE","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lxtGH%2FCAE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lxtGH%2FCAE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lxtGH%2FCAE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lxtGH%2FCAE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lxtGH","download_url":"https://codeload.github.com/lxtGH/CAE/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224742168,"owners_count":17362229,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["context-autoencoder","masked-image-modeling","self-supervised-learning"],"created_at":"2024-08-03T03:01:56.760Z","updated_at":"2024-11-15T06:30:46.869Z","avatar_url":"https://github.com/lxtGH.png","language":"Python","funding_links":[],"categories":["Python","Fundamental MIM Methods"],"sub_categories":["MIM for Transformers"],"readme":"# CAE: Context AutoEncoder for Self-Supervised Representation Learning \n\n\u003cp align=\"center\"\u003e\n  \u003cimg src='furnace/CAE.png'\u003e\n\u003c/p\u003e\n\nThis is a PyTorch implementation of [CAE: Context AutoEncoder for Self-Supervised Representation Learning](https://arxiv.org/abs/2202.03026).\n\n## Highlights\n\n- State-of-the-art MIM performance. Results in the paper are successfully reproduced.\n\n## Installation\n\nClone the repo and install required packages.\n```bash\npip install -r requirements.txt\n\n# install apex\ngit clone https://github.com/NVIDIA/apex\ncd apex\npip install -v --disable-pip-version-check --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./\n```\n\n## Data Preparation\nFirst, download ImageNet-1k from http://image-net.org/.\n\nThe directory structure is the standard layout of torchvision's datasets.ImageFolder. The training and validation data are expected to be in the train/ folder and val folder, respectively:\n\n```\n/path/to/imagenet/\n  train/\n    class1/\n      img1.jpeg\n    class2/\n      img2.jpeg\n  val/\n    class1/\n      img3.jpeg\n    class/2\n      img4.jpeg\n```\n\nSecond, download the pretrained tokenizer.\n\n```bash\nTOKENIZER_PATH=/path/to/save/dall_e_tokenizer_weight\nmkdir -p $TOKENIZER_PATH\nwget -o $TOKENIZER_PATH/encoder.pkl https://cdn.openai.com/dall-e/encoder.pkl\nwget -o $TOKENIZER_PATH/decoder.pkl https://cdn.openai.com/dall-e/decoder.pkl\n```\n\n\n## Pretraining\n\nHere is an example that pretrains CAE-base on ImageNet-1K with 32 GPUs. Please see [scripts/cae_base_800e.sh](scripts/cae_base_800e.sh) for complete script.\n```bash\nOMP_NUM_THREADS=1 $PYTHON -m torch.distributed.launch \\\n  --nproc_per_node=8 \\\n  tools/run_pretraining.py \\\n  --data_path ${DATA_PATH} \\\n  --output_dir ${OUTPUT_DIR} \\\n  --model cae_base_patch16_224_8k_vocab --discrete_vae_weight_path ${TOKENIZER_PATH} \\\n  --batch_size 64 --lr 1.5e-3 --warmup_epochs 20 --epochs 800 \\\n  --clip_grad 3.0 --layer_scale_init_value 0.1 \\\n  --imagenet_default_mean_and_std \\\n  --color_jitter 0 \\\n  --drop_path 0.1 \\\n  --sincos_pos_emb \\\n  --mask_generator block \\\n  --num_mask_patches 98 \\\n  --decoder_layer_scale_init_value 0.1 \\\n  --no_auto_resume \\\n  --save_ckpt_freq 100 \\\n  --exp_name $my_name \\\n  --regressor_depth 4 \\\n  --decoder_depth 4 \\\n  --align_loss_weight 2\n```\n- `--num_mask_patches`: number of the input patches need be masked. \n- `--batch_size`: batch size per GPU.\n- Effective batch size = `number of GPUs` * `--batch_size`. So in the above example, the effective batch size is `64*32 = 2048`.\n- `--lr`: learning rate.\n- `--warmup_epochs`: learning rate warmup epochs. Warm up [10, 20, 40] epochs for [300, 800, 1600] pretrain epochs respectively.\n- `--epochs`: total pretraining epochs.\n- `--clip_grad`: clip gradient norm.\n- `--drop_path`: stochastic depth rate.\n- `--imagenet_default_mean_and_std`: enable this for ImageNet-1k pretraining, i.e., `(0.485, 0.456, 0.406)` for mean and `(0.229, 0.224, 0.225)` for std. For other pretraining data, use `(0.5, 0.5, 0.5)` for mean and `(0.5, 0.5, 0.5)` for std by default.\n- `--layer_scale_init_value`: 0.1 for base, 1e-5 for large, set 0 to disable layerscale. We set `--decoder_layer_scale_init_value` the same as this.\n- `--sincos_pos_emb`: adopt sin-cos positional embedding during pretraining.\n- `--regressor_depth`: length of the regressor.\n- `--decoder_depth`: length of the decoder.\n- `--align_loss_weight`: weight for alignment loss. 2 by default.\n\nWarmup epochs for 300/800/1600 epochs pretraining are 10/20/40.\n\nFor CAE-large, please refer to [scripts/cae_large_1600e.sh](scripts/cae_large_1600e.sh). \n\n\n## Results\nHere provides the results of CAE-base/CAE-large for these evaluation tasks:\n- Linear probing\n- Attentive probing\n- Fine-tuning\n- Semantic segmentation\n- Object detection and instance segmentation\n\nPretrained weights and logs are available ([Google Drive](https://drive.google.com/drive/folders/1wwhg7nj2GQuU9uthVuQLkEEXEjx90G7g?usp=sharing), [Baidu Cloud [Code: 4kil]](https://pan.baidu.com/s/15eZGoI72iLupLrOHqmOM9w)). *: from CAE paper.\n\n| Model      | Pretraining data | #Epoch | Linear | Attentive | Fine-tuning | ADE Seg | COCO Det | COCO InstSeg |\n| ---------- | ---------------- | ------ | ------ | --------- | ----------- | ------- | -------- | ------------ |\n| MAE-base*  | ImageNet-1K      | 1600   | 67.8   | 74.2      | 83.6        | 48.1    | 48.4     | 42.6         |\n| MAE-large* | ImageNet-1K      | 1600   | 76.0   | 78.8      | 86.0        | 53.6    | 54.0     | 47.1         |\n| CAE-base   | ImageNet-1K      | 300    | 64.5   | 74.0      | 83.6        | 48.1    | 48.3     | 42.7         |\n| CAE-base   | ImageNet-1K      | 800    | 68.9   | 75.9      | 83.8        | 49.7    | 49.9     | 43.9         |\n| CAE-base   | ImageNet-1K      | 1600   | 70.3   | 77.2      | 83.9        | 50.3    | 50.3     | 44.2         |\n| CAE-large  | ImageNet-1K      | 1600   | 77.8   | 81.2      | 86.2        | 54.9    | 54.5     | 47.5         |\n\n\n### Linear Probing\n- Please refer to [scripts/cae_base_800e.sh](scripts/cae_base_800e.sh) (32 GPUs).  \n- For CAE-large, just replace `--model cae_base_patch16_224` with `--model cae_large_patch16_224`.\n\n### Attentive Probing\n\n- Please refer to [scripts/cae_base_800e.sh](scripts/cae_base_800e.sh) (32 GPUs). \n- For CAE-large, just replace `--model cae_base_patch16_224` with `--model cae_large_patch16_224`.\n\n### Fine-tuning\n- Please refer to [scripts/cae_base_finetune.sh](scripts/cae_base_finetune.sh) (32 GPUs). \n- For CAE-large, please refer to [scripts/cae_large_finetune.sh](scripts/cae_large_finetune.sh) (32 GPUs).\n\n### Segmentation \u0026 Detection\n- Please refer to [downstream_tasks](./downstream_tasks) dir to get started.\n\n## Acknowledgement\n\nThis repository is built using the [BEiT](https://github.com/microsoft/unilm/edit/master/beit) and [MMSelfSup](https://github.com/open-mmlab/mmselfsup), thanks for their open-source code! Thanks also to the CAE authors for their excellent work!\n\n## Citation\n```bibtex\n@article{ContextAutoencoder2022,\n  title={Context Autoencoder for Self-Supervised Representation Learning},\n  author={Chen, Xiaokang and Ding, Mingyu and Wang, Xiaodi and Xin, Ying and Mo, Shentong and Wang, Yunhao and Han, Shumin and Luo, Ping and Zeng, Gang and Wang, Jingdong},\n  journal={arXiv preprint arXiv:2202.03026},\n  year={2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FlxtGH%2FCAE","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FlxtGH%2FCAE","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FlxtGH%2FCAE/lists"}