{"id":19279777,"url":"https://github.com/showlab/sparseformer","last_synced_at":"2025-06-25T06:41:12.454Z","repository":{"id":174949390,"uuid":"622439596","full_name":"showlab/sparseformer","owner":"showlab","description":"(ICLR 2024, CVPR 2024) SparseFormer ","archived":false,"fork":false,"pushed_at":"2024-11-10T12:28:01.000Z","size":267,"stargazers_count":73,"open_issues_count":1,"forks_count":2,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-03-24T09:13:46.013Z","etag":null,"topics":["computer-vision","efficient-neural-networks","sparseformer","transformer","vision-transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/showlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-02T05:31:12.000Z","updated_at":"2025-03-01T15:48:01.000Z","dependencies_parsed_at":"2024-03-06T17:13:32.789Z","dependency_job_id":"6c97095a-ffa2-4a40-a8ef-73fe89002b86","html_url":"https://github.com/showlab/sparseformer","commit_stats":{"total_commits":41,"total_committers":1,"mean_commits":41.0,"dds":0.0,"last_synced_commit":"f4484c49ad87ee218c09aef5bd1b96c304365fbb"},"previous_names":["showlab/sparseformer"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2Fsparseformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2Fsparseformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2Fsparseformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2Fsparseformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/showlab","download_url":"https://codeload.github.com/showlab/sparseformer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246899643,"owners_count":20851894,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","efficient-neural-networks","sparseformer","transformer","vision-transformer"],"created_at":"2024-11-09T21:16:05.372Z","updated_at":"2025-04-03T03:13:24.358Z","avatar_url":"https://github.com/showlab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🎆 SparseFormer\n\nThis is the offical repo for SparseFormer researches:\n\n\u003e [**SparseFormer: Sparse Visual Recognition via Limited Latent Tokens**](https://arxiv.org/abs/2304.03768) **(ICLR 2024)**\u003cbr\u003e\n\u003e Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou\u003cbr\u003e\n\n\u003e [**Bootstrapping SparseFormers from Vision Foundation Models**](https://arxiv.org/abs/2312.01987) **(CVPR 2024)**\u003cbr\u003e\n\u003e Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou\u003cbr\u003e\n\n\u003c!-- ## TL;DR\nSparseFormer is a ViT with **less tokens and compute used**, which can also handle **any aspect ratio and resolution**. --\u003e\n\n## Out-of-box SparseFormer as a Library (recommended)\nWe provide the out-of-box SparseFormer usage with the sparseformer library installation. \n\n__Getting started__. You can install sparseformer as a library by the following command:\n```shell\npip install -e sparseformer # in this folder\n```\n\nAvailable pre-trained model weights are listed [here](./sparseformer/sparseformer/factory.py#L11), including weights of v1 and bootstrapped ones. You can simply use [`create_model`](./sparseformer/sparseformer/factory.py#L37) with the argument `download=True` to get pre-trained models. You can play like this!\n```python\nfrom sparseformer.factory import create_model\n\n# e.g., make a SparseFormer v1 tiny model\nmodel = create_model(\"sparseformer_v1_tiny\", download=True)\n\n\n# or make a CLIP SparseFormer large model and put it in OpenClip pipeline\nimport open_clip\nclip = open_clip.create_model_and_transforms(\"ViT-L-14\", \"openai\")\nvisual = create_model(\"sparseformer_btsp_openai_clip_large\", download=True)\nclip.visual = visual\n# ...\n\n```\n\n__Video SparseFormers__. We also provide unified [`MediaSparseFormer`](./sparseformer/sparseformer/media.py#L103) implementation for both video and image inputs (an image as single-frame video) with the token inflation argument `replicates`. MediaSparseFormer can load pre-trained weights of the image `SparseFormer` by [`load_2d_state_dict`](./sparseformer/sparseformer/media.py#L147).\n\nNotes: Pre-trained weights VideoSparseFormers are currently unavailable. We might reproduce VideoSparseFormers if highly needed by the community.\n\n__ADVANCED: Make your own SparseFormer and load timm weights__. \nOur codebase is generally compatible with [timm vision transformer](https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/vision_transformer.py) weights. So here comes something to play: you can make your own SparseFormer and load timm transformers weights, not limited to our provided configurations!\n\nFor example, you can make a SparseFormer similar to ViT-224/16 and with sampling \u0026 decoding and roi adjusting every 3 block, and load it with CLIP OpenAI official pre-trained weights:\n```python\nfrom sparseformer.modeling import SparseFormer, OP\nfrom sparseformer.config import base_btsp_config\n\nops_list = []\nnum_layers = 12\nfor i in range(num_layers):\n    if i % 3 == 0:\n        ops_list.append([OP.SAMPLING_M, OP.ATTN, OP.MLP, OP.ROI_ADJ, OP.PE_INJECT,])\n    else:\n        ops_list.append([OP.ATTN, OP.MLP])\n\nconfig = base_btsp_config()\nconfig.update(\n    num_latent_tokens=16,\n    num_sampling_points=9,\n    width_configs=[768, ]*num_layers,\n    repeats=[1, ]*num_layers,\n    ops_list=ops_list,\n)\n\nmodel = SparseFormer(**config)\n\nimport timm\npretrained = timm.create_model(\"vit_base_patch16_clip_224.openai\", pretrained=True)\nnew_dict = dict()\nold_dict = pretrained.state_dict()\nfor k in old_dict:\n    nk = k\n    if \"blocks\" in k:\n        nk = nk.replace(\"blocks\", \"layers\")\n    new_dict[nk] = old_dict[k]\nprint(model.load_state_dict(new_dict, strict=False))\n```\nAll weights attention and MLP layers should be successfully loaded. The resulted SparseFormer should be fine-tuned to output meaningful results since the sampling \u0026 decoding and roi adjusting part are newly initialized. Maybe you can fine-tune it to be a CLIP-based open-vocabulary detector (have not yet tried, but very promising imo! :D).\n\n\n\n## Training (SparseFormer v1)\nFor training SparseFormer v1 in ImageNets ([**SparseFormer: Sparse Visual Recognition via Limited Latent Tokens**](https://arxiv.org/abs/2304.03768)), please check [imagenet](./imagenet/).\n\n**Note:** this [imagenet](./imagenet/) sub-codebase will be refactored soon.\n\n\n## Citation\nIf you find SparseFormer useful in your research or work, please consider citing us using the following entry:\n```\n@inproceedings{gao2024sparseformer,\n  author       = {Ziteng Gao and\n                  Zhan Tong and\n                  Limin Wang and\n                  Mike Zheng Shou},\n  title        = {SparseFormer: Sparse Visual Recognition via Limited Latent Tokens},\n  booktitle    = {{ICLR}},\n  publisher    = {OpenReview.net},\n  year         = {2024}\n}\n\n@inproceedings{gao2024bootstrapping,\n  author       = {Ziteng Gao and\n                  Zhan Tong and\n                  Kevin Qinghong Lin and\n                  Joya Chen and\n                  Mike Zheng Shou},\n  title        = {Bootstrapping SparseFormers from Vision Foundation Models},\n  booktitle    = {{CVPR}},\n  pages        = {17710--17721},\n  publisher    = {{IEEE}},\n  year         = {2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshowlab%2Fsparseformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshowlab%2Fsparseformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshowlab%2Fsparseformer/lists"}