Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/DirtyHarryLYL/Transformer-in-Vision

Recent Transformer-based CV and related works.
https://github.com/DirtyHarryLYL/Transformer-in-Vision

computer-vision deep-learning multi-modal paper self-attention transformer vision-transformers visual-language

Last synced: about 1 month ago
JSON representation

Recent Transformer-based CV and related works.

Lists

README

        

# Transformer-in-Vision
Recent Transformer-based CV and related works. Welcome to comment/contribute!

The transformer is now a basic component, adopted in nearly all AI models. Keep updated --> updated irregularly.

New Hope: [LLM-in-Vision](https://github.com/DirtyHarryLYL/LLM-in-Vision)

## Resource

- **ChatGPT** for **Robotics**: Design Principles and Model Abilities, [[Paper]](https://www.microsoft.com/en-us/research/uploads/prod/2023/02/ChatGPT___Robotics.pdf), [[Code]](https://github.com/microsoft/PromptCraft-Robotics)

- DIFFUSIONDB [[Page]](https://poloclub.github.io/diffusiondb), [[Paper]](https://arxiv.org/pdf/2210.14896.pdf)

- LAION-5B [[Page]](https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/), [[Paper]](https://arxiv.org/pdf/2210.08402.pdf)

- LAVIS [[Page]](https://github.com/salesforce/LAVIS), [[Paper]](https://arxiv.org/pdf/2209.09019.pdf)

- Imagen Video [[Page]](https://imagen.research.google/video/), [[Paper]](https://imagen.research.google/video/paper.pdf)

- Phenaki [[Page]](https://phenaki.video/), [[Paper]](https://openreview.net/pdf?id=vOEXS39nOF)

- DREAMFUSION [[Page]](https://dreamfusion3d.github.io/), [[Paper]](https://arxiv.org/pdf/2209.14988.pdf)

- MAKE-A-VIDEO [[Page]](https://make-a-video.github.io/), [[Paper]](https://arxiv.org/pdf/2209.14792.pdf)

- Stable Difffusion [[Page]](https://ommer-lab.com/research/latent-diffusion-models/), [[Paper]](https://arxiv.org/pdf/2112.10752.pdf)

- NUWA-Infinity [[Page]](https://nuwa-infinity.microsoft.com/#/), [[Paper]](https://arxiv.org/pdf/2207.09814.pdf)

- Parti [[Page]](https://parti.research.google/), [[Code]](https://github.com/google-research/parti)

- Imagen [[Page]](https://imagen.research.google/), [[Paper]](https://arxiv.org/pdf/2205.11487.pdf)

- Gato: A Generalist Agent, [[Paper]](https://storage.googleapis.com/deepmind-media/A%20Generalist%20Agent/Generalist%20Agent.pdf)

- PaLM: Scaling Language Modeling with Pathways, [[Paper]](https://arxiv.org/pdf/2204.02311.pdf)

- DALL·E 2 [[Page]](https://openai.com/dall-e-2/), [[Paper]](https://cdn.openai.com/papers/dall-e-2.pdf)

- SCENIC: A JAX Library for Computer Vision Research and Beyond, [[Code]](https://github.com/google-research/scenic)

- V-L joint learning study (with good tables): [[METER]](https://arxiv.org/pdf/2111.02387.pdf), [[Kaleido-BERT]](https://arxiv.org/pdf/2103.16110.pdf)

- Attention is all you need, [[Paper]](https://arxiv.org/pdf/1706.03762.pdf)

- CLIP [[Page]](https://openai.com/blog/clip/), [[Paper]](https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf), [[Code]](https://github.com/openai/CLIP), [[arXiv]](https://arxiv.org/pdf/2103.00020.pdf)

- DALL·E [[Page]](https://openai.com/blog/dall-e/), [[Code]](https://github.com/openai/DALL-E), [[Paper]](https://arxiv.org/pdf/2102.12092.pdf)

- [huggingface/transformers](https://github.com/huggingface/transformers)

- [Kyubyong/transformer](https://github.com/Kyubyong/transformer), TF

- [jadore801120/attention-is-all-you-need-pytorch](https://github.com/jadore801120/attention-is-all-you-need-pytorch), Torch

- [krasserm/fairseq-image-captioning](https://github.com/krasserm/fairseq-image-captioning)

- [PyTorch Transformers Tutorials](https://github.com/abhimishra91/transformers-tutorials)

- [ictnlp/awesome-transformer](https://github.com/ictnlp/awesome-transformer)

- [basicv8vc/awesome-transformer](https://github.com/basicv8vc/awesome-transformer)

- [dk-liang/Awesome-Visual-Transformer](https://github.com/dk-liang/Awesome-Visual-Transformer)

- [yuewang-cuhk/awesome-vision-language-pretraining-papers](https://github.com/yuewang-cuhk/awesome-vision-language-pretraining-papers)

## Survey

- (arXiv 2023.2) TRANSFORMER-BASED **SENSOR FUSION** FOR **AUTONOMOUS DRIVING**: A SURVEY, [[Paper]](https://arxiv.org/pdf/2302.11481.pdf), [[Page]](https://github.com/ApoorvRoboticist/Transformers-Sensor-Fusion)

- (arXiv 2023.2) Deep Learning for **Video-Text Retrieval**: a Review, [[Paper]](https://arxiv.org/pdf/2302.12552.pdf)

- (arXiv 2023.2) Large-scale **Multi-Modal Pre-trained Models**: A Comprehensive Survey, [[Paper]](https://arxiv.org/pdf/2302.10035.pdf)

- (arXiv 2023.2) Transformer-based **Generative Adversarial Networks** in Computer Vision: A Comprehensive Survey, [[Paper]](https://arxiv.org/pdf/2302.08641.pdf)

- (arXiv 2023.2) **Knowledge Distillation** in Vision Transformers: A Critical Review, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2302/2302.02108.pdf)

- (arXiv 2023.2) A Survey on **Efficient Training** of Transformers, [[Paper]](https://arxiv.org/pdf/2302.01107.pdf)

- (arXiv 2023.1) ChatGPT is not all you need. A State of the Art Review of **large Generative AI models**, [[Paper]](https://arxiv.org/pdf/2301.04655.pdf)

- (arXiv 2022.12) Transformers in **Action Recognition**: A Review on Temporal Modeling, [[Paper]](https://arxiv.org/pdf/2302.01921.pdf)

- (arXiv 2022.11) Vision Transformers in **Medical Imaging**: A Review, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2211/2211.10043.pdf)

- (arXiv 2022.11) A survey on **knowledge**-enhanced **multimodal** learning, [[Paper]](https://arxiv.org/pdf/2211.12328.pdf)

- (arXiv 2022.10) Vision-Language Pre-training: Basics, Recent Advances, and Future Trends, [[Paper]](https://arxiv.org/pdf/2210.09263.pdf)

- (arXiv 2022.10) A Survey on Graph Neural Networks and **Graph** Transformers in Computer Vision: A Task-Oriented Perspective, [[Paper]](https://arxiv.org/pdf/2209.13232.pdf)

- (arXiv 2022.09) VISION TRANSFORMERS FOR **ACTION RECOGNITION**: A SURVEY, [[Paper]](https://arxiv.org/pdf/2209.05700.pdf)

- (arXiv 2022.09) Transformers in **Remote Sensing**: A Survey, [[Paper]](https://arxiv.org/pdf/2209.01206.pdf), [[Code]](https://github.com/VIROBO-15/Transformer-in-Remote-Sensing)

- (arXiv 2022.08) **3D Vision** with Transformers: A Survey, [[Paper]](https://arxiv.org/pdf/2208.04309.pdf), [[Code]](https://github.com/lahoud/3d-vision-transformers)

- (arXiv 2022.08) A Survey on **Masked Autoencoder** for Self-supervised Learning in Vision and Beyond, [[Paper]](https://arxiv.org/pdf/2208.00173.pdf)

- (arXiv 2022.07) **Vision** Transformers: State of the Art and Research Challenges, [[Paper]](https://arxiv.org/pdf/2207.03041.pdf)

- (arXiv 2022.07) **SELF-SUPERVISED** LEARNING FOR **VIDEOS**: A SURVEY, [[Paper]](https://arxiv.org/pdf/2207.00419.pdf)

- (arXiv 2022.06) **Multimodal** Learning with Transformers: A Survey, [[Paper]](https://arxiv.org/pdf/2206.06488.pdf)

- (arXiv 2022.05) Vision Transformer: **Vit** and its **Derivatives**, [[Paper]](https://arxiv.org/pdf/2205.11239.pdf)

- (arXiv 2022.05) Transformers in 3D **Point Clouds**: A Survey, [[Paper]](https://arxiv.org/pdf/2205.07417.pdf)

- (arXiv 2022.04) **Visual Attention** Methods in Deep Learning: An In-Depth Survey, [[Paper]](https://arxiv.org/pdf/2204.07756.pdf)

- (arXiv 2022.04) **Vision-and-Language** Pretrained Models: A Survey, [[Paper]](https://arxiv.org/pdf/2204.07356.pdf)

- (arXiv 2022.03) A Roadmap for **Big Model**, [[Paper]](https://arxiv.org/pdf/2203.14101.pdf)

- (arXiv 2022.03) Transformers Meet **Visual** Learning Understanding: A Comprehensive Review, [[Paper]](https://arxiv.org/pdf/2203.12944.pdf)

- (arXiv 2022.03) Recent Advances in **Vision** Transformer: A Survey and Outlook of Recent Work, [[Paper]](https://arxiv.org/pdf/2203.01536.pdf), [[Project]](https://github.com/khawar512/ViT-Survey)

- (arXiv 2022.02) A Survey of **Vision-Language** Pre-Trained Models, [[Paper]](https://arxiv.org/pdf/2202.10936.pdf)

- (arXiv 2022.02) VLP: A Survey on **Vision-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2202.09061.pdf)

- (arXiv 2022.02) Transformer for **Graphs**: An Overview from Architecture Perspective, [[Paper]](https://arxiv.org/pdf/2202.08455.pdf)

- (arXiv 2022.01) **Video** Transformers: A Survey, [[Paper]](https://arxiv.org/pdf/2201.05991.pdf)

- (arXiv 2021.11) ARE WE READY FOR A NEW PARADIGM SHIFT? A SURVEY ON VISUAL DEEP **MLP**, [[Paper]](https://arxiv.org/pdf/2111.04060.pdf)

- (arXiv 2021.11) A Survey of **Visual** Transformers, [[Paper]](https://arxiv.org/pdf/2111.06091.pdf)

- (arXiv 2021.09) Survey: Transformer based **Video-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2109.09920.pdf)

- (arXiv 2021.06) A Survey of **Transformers**, [[Paper]](https://arxiv.org/pdf/2106.04554.pdf)

- (arXiv 2021.06) **Attention** mechanisms and deep learning for machine vision: A survey of the state of the art, [[Paper]](https://arxiv.org/pdf/2106.07550.pdf)

- (arXiv 2021.06) **Pre-Trained Models**: Past, Present and Future, [[Paper]](https://arxiv.org/pdf/2106.07139.pdf)

- (arXiv 2021.05) Can Attention Enable **MLPs** To Catch Up With CNNs? [[Paper]](https://arxiv.org/pdf/2105.15078.pdf)

- (arXiv 2021.03) A Practical Survey on **Faster** and **Lighter** Transformers, [[Paper]](https://arxiv.org/pdf/2103.14636.pdf)

- (arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with **Language and Vision**, [[Paper]](https://arxiv.org/pdf/2103.04037.pdf)

- (arXiv 2021.01) A Survey on **Visual** Transformer, [[Paper]](https://arxiv.org/pdf/2012.12556.pdf)

- (arXiv 2020.9) **Efficient** Transformers: A Survey, [[Paper]](https://arxiv.org/pdf/2009.06732.pdf)

- (arXiv 2020.1) **Transformers in Vision**: A Survey, [[Paper]](https://arxiv.org/pdf/2101.01169.pdf)

## Recent Papers

### 2023.8

- (arXiv 2023.8) VL-PET: Vision-and-Language Parameter-**Efficient Tuning** via Granularity Control, [[Paper]](https://arxiv.org/pdf/2308.09804), [[Project]](https://henryhzy.github.io/VL-PET/)

### 2023.5

- (arXiv 2023.5) Understanding Gaussian **Attention** Bias of Vision Transformers Using Effective Receptive Fields, [[Paper]](https://arxiv.org/pdf/2305.04722.pdf)

### 2023.3

- (arXiv 2023.3) Query-Dependent **Video** Representation for **Moment Retrieval** and **Highlight Detection**, [[Paper]](https://arxiv.org/pdf/2303.13874.pdf), [[Code]](https://github.com/wjun0830/QD-DETR)

### 2023.2

- (arXiv 2023.2) **Open-domain Visual Entity Recognition**: Towards Recognizing Millions of Wikipedia Entities, [[Paper]](https://arxiv.org/pdf/2302.11154.pdf)

- (arXiv 2023.2) KS-DETR: Knowledge Sharing in Attention Learning for **Detection** Transformer, [[Paper]](https://arxiv.org/pdf/2302.11208.pdf), [[Code]](https://github.com/edocanonymous/KS-DETR)

- (arXiv 2023.2) HUMAN MOTIONFORMER: **TRANSFERRING** HUMAN **MOTIONS** WITH VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2302.11306.pdf), [[Code]](https://github.com/KumapowerLIU/Human-MotionFormer)

- (arXiv 2023.2) Aligning **Text-to-Image** Models using **Human Feedback**, [[Paper]](https://arxiv.org/pdf/2302.12192.pdf)

- (arXiv 2023.2) Controlled and Conditional **Text to Image** Generation with Diffusion Prior, [[Paper]](https://arxiv.org/pdf/2302.11710.pdf)

- (arXiv 2023.2) Can Pre-trained Vision and Language Models Answer **Visual Information-Seeking Questions**? [[Paper]](https://arxiv.org/pdf/2302.11713.pdf), [[Code]](https://open-vison-language.github.io/infoseek)

- (arXiv 2023.2) OBJECT-CENTRIC **VIDEO PREDICTION** VIA DECOUPLING OF OBJECT DYNAMICS AND INTERACTIONS, [[Paper]](https://arxiv.org/pdf/2302.11850.pdf), [[Project]](https://sites.google.com/view/ocvp-vp)

- (arXiv 2023.2) Distribution Normalization: An “Effortless” **Test-Time Augmentation** for Contrastively Learned **Visual-language** Models, [[Paper]](https://arxiv.org/pdf/2302.11084.pdf), [[Code]](https://github.com/fengyuli2002/distribution-normalization)

- (arXiv 2023.2) Teaching **CLIP** to **Count** to Ten, [[Paper]](https://arxiv.org/pdf/2302.12066.pdf), [[Project]](https://teaching-clip-to-count.github.io/)

- (arXiv 2023.2) Designing an Encoder for Fast Personalization of **Text-to-Image** Models, [[Paper]](https://arxiv.org/pdf/2302.12228.pdf), [[Project]](https://tuning-encoder.github.io/)

- (arXiv 2023.2) Side Adapter Network for **Open-Vocabulary Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2302.12242.pdf), [[Code]](https://github.com/MendelXu/SAN)

- (arXiv 2023.2) Learning Visual Representations via **Language-Guided Sampling**, [[Paper]](https://arxiv.org/pdf/2302.12248.pdf)

- (arXiv 2023.2) VoxFormer: Sparse Voxel Transformer for Camera-based **3D Semantic Scene Completion**, [[Paper]](https://arxiv.org/pdf/2302.12251.pdf), [[Code]](https://github.com/NVlabs/VoxFormer)

- (arXiv 2023.2) Language-Driven Representation Learning for **Robotics**, [[Paper]](https://arxiv.org/pdf/2302.12766.pdf), [[Project]](https://sites.google.com/view/voltron-robotics)

- (arXiv 2023.2) A Convolutional Vision Transformer for **Semantic Segmentation** of Side-Scan **Sonar** Data, [[Paper]](https://arxiv.org/pdf/2302.12416.pdf), [[Code]](https://github.com/hayatrajani/s3seg-vit)

- (arXiv 2023.2) **Lightweight** Real-time Semantic **Segmentation** Network with Efficient Transformer and CNN, [[Paper]](https://arxiv.org/pdf/2302.10484.pdf), [[Code]](https://github.com/IVIPLab/LETNet)

- (arXiv 2023.2) VIEWCO: DISCOVERING **TEXT-SUPERVISED** **SEGMENTATION** MASKS VIA MULTI-VIEW SEMANTIC CONSISTENCY, [[Paper]](https://arxiv.org/pdf/2302.10307.pdf), [[Code]](https://github.com/pzhren/ViewCo)

- (arXiv 2023.2) CertViT: Certified **Robustness** of Pre-Trained Vision Transformers, [[Paper]](https://arxiv.org/pdf/2302.10287.pdf), [[Code]](https://github.com/sagarverma/transformer-lipschitz)

- (arXiv 2023.2) Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for **Grounding Viewpoint Descriptions**, [[Paper]](https://arxiv.org/pdf/2302.10282.pdf)

- (arXiv 2023.2) MaskedKD: Efficient **Distillation** of Vision Transformers with **Masked** Images, [[Paper]](https://arxiv.org/pdf/2302.10494.pdf)

- (arXiv 2023.2) A General Visual Representation Guided Framework with Global Affinity for **Weakly Supervised Salient Object Detection**, [[Paper]](https://arxiv.org/pdf/2302.10697.pdf)

- (arXiv 2023.2) ViTA: A Vision Transformer **Inference Accelerator** for **Edge** Applications, [[Paper]](https://arxiv.org/pdf/2302.09108.pdf)

- (arXiv 2023.2) **Video Action Recognition** Collaborative Learning with Dynamics via PSO-ConvNet Transformer, [[Paper]](https://arxiv.org/pdf/2302.09187.pdf), [[Code]](https://github.com/leonlha/Video-Action-Recognition-via-PSO-ConvNet-Transformer-Collaborative-Learning-with-Dynamics)

- (arXiv 2023.2) A Pilot **Evaluation** of ChatGPT and DALL-E 2 on **Decision Making** and **Spatial Reasoning**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2302/2302.09068.pdf)

- (arXiv 2023.2) StyLIP: Multi-Scale Style-Conditioned Prompt Learning for **CLIP**-based **Domain Generalization**, [[Paper]](https://arxiv.org/pdf/2302.09251.pdf)

- (arXiv 2023.2) Meta Style Adversarial Training for Cross-Domain **Few-Shot** Learning, [[Paper]](https://arxiv.org/pdf/2302.09309.pdf)

- (arXiv 2023.2) HYNETER: HYBRID NETWORK TRANSFORMER FOR OBJECT **DETECTION**, [[Paper]](https://arxiv.org/pdf/2302.09365.pdf)

- (arXiv 2023.2) STOA-VLP: Spatial-Temporal Modeling of Object and Action for **Video-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2302.09736.pdf)

- (arXiv 2023.2) Constraint and Union for Partially-Supervised **Temporal Sentence Grounding**, [[Paper]](https://arxiv.org/pdf/2302.09850.pdf)

- (arXiv 2023.2) STB-VMM: Swin Transformer Based **Video Motion Magnification**, [[Paper]](https://arxiv.org/pdf/2302.10001.pdf)

- (arXiv 2023.2) **Fashion Image Retrieval** with Multi-Granular Alignment, [[Paper]](https://arxiv.org/pdf/2302.08902.pdf)

- (arXiv 2023.2) LayoutDiffuse: Adapting Foundational Diffusion Models for **Layout-to-Image Generation**, [[Paper]](https://arxiv.org/pdf/2302.08908.pdf)

- (arXiv 2023.2) CK-Transformer: Commonsense Knowledge Enhanced Transformers for **Referring Expression Comprehension**, [[Paper]](https://arxiv.org/pdf/2302.09027.pdf), [[Code]](https://github.com/FightingFighting/CK-Transformer)

- (arXiv 2023.2) MaskSketch: Unpaired Structure-guided Masked **Image Generation**, [[Paper]](https://arxiv.org/pdf/2302.05496.pdf)

- (arXiv 2023.2) Single **Motion** **Diffusion**, [[Paper]](https://arxiv.org/pdf/2302.05905.pdf), [[Code]](https://sinmdm.github.io/SinMDM-page)

- (arXiv 2023.2) Tri-Perspective View for Vision-Based **3D Semantic Occupancy Prediction**, [[Paper]](https://arxiv.org/pdf/2302.07817.pdf), [[Code]](https://github.com/wzzheng/TPVFormer)

- (arXiv 2023.2) ANSEL Photobot: A **Robot** **Event Photographer** with Semantic Intelligence, [[Paper]](https://arxiv.org/pdf/2302.07931.pdf)

- (arXiv 2023.2) ForceFormer: Exploring Social Force and Transformer for **Pedestrian Trajectory Prediction**, [[Paper]](https://arxiv.org/pdf/2302.07583.pdf)

- (arXiv 2023.2) **Video** Probabilistic **Diffusion** Models in Projected Latent Space, [[Paper]](https://arxiv.org/pdf/2302.07685.pdf)

- (arXiv 2023.2) Dataset Interfaces: **Diagnosing Model Failures** Using Controllable Counterfactual Generation, [[Paper]](https://arxiv.org/pdf/2302.07865.pdf), [[Code]](https://github.com/MadryLab/dataset-interfaces)

- (arXiv 2023.2) Learning to Substitute Ingredients in **Recipes**, [[Paper]](https://arxiv.org/pdf/2302.07960.pdf)

- (arXiv 2023.2) **Energy** Transformer, [[Paper]](https://arxiv.org/pdf/2302.07253.pdf)

- (arXiv 2023.2) Efficiency 360: **Efficient** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2302.08374.pdf)

- (arXiv 2023.2) A-la-carte **Prompt Tuning** (APT): Combining Distinct Data Via Composable ` Prompting, [[Paper]](https://arxiv.org/pdf/2302.07994.pdf)

- (arXiv 2023.2) Effective Data **Augmentation** With **Diffusion** Models, [[Paper]](https://arxiv.org/pdf/2302.07944.pdf), [[Project]](https://btrabuc.co/da-fusion)

- (arXiv 2023.2) PRedItOR: Text Guided **Image Editing** with Diffusion Prior, [[Paper]](https://arxiv.org/pdf/2302.07979.pdf)

- (arXiv 2023.2) TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual Vision Transformer for Fast Arbitrary **One-Shot Image Generation**, [[Paper]](https://arxiv.org/pdf/2302.08047.pdf)

- (arXiv 2023.2) Hierarchical Cross-modal Transformer for **RGB-D Salient Object Detection**, [[Paper]](https://arxiv.org/pdf/2302.08052.pdf)

- (arXiv 2023.2) MINOTAUR: Multi-task **Video Grounding** From Multimodal Queries, [[Paper]](https://arxiv.org/pdf/2302.08063.pdf)

- (arXiv 2023.2) Towards **Efficient** Visual **Adaption** via Structural Re-parameterization, [[Paper]](https://arxiv.org/pdf/2302.08106.pdf), [[Code]](https://github.com/luogen1996/RepAdapter)

- (arXiv 2023.2) Efficient **3D Object Reconstruction** using Visual Transformers, [[Paper]](https://arxiv.org/pdf/2302.08474.pdf)

- (arXiv 2023.2) Retrieval-augmented Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2302.08268.pdf)

- (arXiv 2023.2) Robust Human **Motion Forecasting** using Transformer-based Model, [[Paper]](https://arxiv.org/pdf/2302.08274.pdf)

- (arXiv 2023.2) VQ3D: Learning a **3D**-Aware **Generative** Model on ImageNet, [[Paper]](https://arxiv.org/pdf/2302.06833.pdf), [[Project]](https://kylesargent.github.io/vq3d)

- (arXiv 2023.2) UKnow: A Unified Knowledge Protocol for **Common-Sense Reasoning** and **Vision-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2302.06891.pdf), [[Code]](https://github.com/Gongggg/UKnow)

- (arXiv 2023.2) A **THEORETICAL** UNDERSTANDING OF **SHALLOW** VISION TRANSFORMERS: LEARNING, GENERALIZATION, AND SAMPLE COMPLEXITY, [[Paper]](https://arxiv.org/pdf/2302.06015.pdf)

- (arXiv 2023.2) A Simple Zero-shot Prompt Weighting Technique to Improve **Prompt** Ensembling in **Text-Image** Models, [[Paper]](https://arxiv.org/pdf/2302.06235.pdf)

- (arXiv 2023.2) Generalized Few-Shot **Continual Learning** with Contrastive Mixture of Adapters, [[Paper]](https://arxiv.org/pdf/2302.05936.pdf), [[Code]](https://github.com/yawencui/CMoA)

- (arXiv 2023.2) Actional Atomic-Concept Learning for Demystifying **Vision-Language Navigation**, [[Paper]](https://arxiv.org/pdf/2302.06072.pdf)

- (arXiv 2023.2) Towards Local Visual Modeling for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2302.06098.pdf), [[Code]](https://github.com/xmu-xiaoma666/LSTNet)

- (arXiv 2023.2) CLIP-RR: IMPROVED CLIP NETWORK FOR RELATION-FOCUSED **CROSS-MODAL INFORMATION RETRIEVAL**, [[Paper]](https://arxiv.org/pdf/2302.06350.pdf)

- (arXiv 2023.2) **Anticipating** Next Active Objects for **Egocentric Videos**, [[Paper]](https://arxiv.org/pdf/2302.06358.pdf), [[Code]]()

- (arXiv 2023.2) UniAdapter: Unified Parameter-Efficient Transfer Learning for **Cross-modal Modeling**, [[Paper]](https://arxiv.org/pdf/2302.06605.pdf), [[Code]](https://github.com/RERV/UniAdapter)

- (arXiv 2023.2) TEAM **DETR**: GUIDE QUERIES AS A PROFESSIONAL TEAM IN DETECTION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2302.07116.pdf), [[Code]](https://github.com/horrible-dong/TeamDETR)

- (arXiv 2023.2) ConceptFusion: Open-set **Multimodal** **3D Mapping**, [[Paper]](https://arxiv.org/pdf/2302.07241.pdf), [[Project]](https://concept-fusion.github.io/)

- (arXiv 2023.2) Team Triple-Check at Factify 2: Parameter-Efficient Large Foundation Models with Feature Representations for **Multi-Modal Fact Verification**, [[Paper]](https://arxiv.org/pdf/2302.07740.pdf), [[Code]](https://github.com/wwweiwei/Pre-CoFactv2-AAAI-2023)

- (arXiv 2023.2) PolyFormer: Referring Image **Segmentation** as Sequential Polygon Generation, [[Paper]](https://arxiv.org/pdf/2302.07387.pdf)

- (arXiv 2023.2) Pose-Oriented Transformer with Uncertainty-Guided Refinement for **2D-to-3D Human Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2302.07408.pdf)

- (arXiv 2023.2) TFormer: A Transmission-Friendly ViT Model for **IoT** Devices, [[Paper]](https://arxiv.org/pdf/2302.07734.pdf), [[Code]]()

- (arXiv 2023.2) Tri-Perspective View for Vision-Based **3D Semantic Occupancy Prediction**, [[Paper]](https://arxiv.org/pdf/2302.07817.pdf), [[Code]](https://github.com/wzzheng/TPVFormer)

- (arXiv 2023.2) Adding Conditional Control to **Text-to-Image Diffusion** Models, [[Paper]](https://arxiv.org/pdf/2302.05543.pdf), [[Code]](https://github.com/lllyasviel/ControlNet)

- (arXiv 2023.2) Invariant **Slot Attention**: **Object Discovery** with Slot-Centric Reference Frames, [[Paper]](https://arxiv.org/pdf/2302.04973.pdf)

- (arXiv 2023.2) IS MULTI-MODAL **VISION** SUPERVISION **BENEFICIAL** TO **LANGUAGE**? [[Paper]](https://arxiv.org/pdf/2302.05016.pdf)

- (arXiv 2023.2) Data-Driven **Stochastic Motion Evaluation** and **Optimization** with Image by Spatially-Aligned Temporal Encoding, [[Paper]](https://arxiv.org/pdf/2302.05041.pdf)

- (arXiv 2023.2) **Scaling** Vision Transformers to **22 Billion Parameters**, [[Paper]](https://arxiv.org/pdf/2302.05442.pdf)

- (arXiv 2023.2) Adapting **Pre-trained** Vision Transformers from **2D to 3D** through Weight Inflation Improves Medical Image Segmentation, [[Paper]](https://arxiv.org/pdf/2302.04303.pdf), [[Code]](https://github.com/yuhui-zh15/TransSeg)

- (arXiv 2023.2) Mitigating **Bias** in Visual Transformers via Targeted Alignment, [[Paper]](https://arxiv.org/pdf/2302.04358.pdf)

- (arXiv 2023.2) IH-ViT: Vision Transformer-based **Integrated Circuit Appearance Defect Detection**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2302/2302.04521.pdf)

- (arXiv 2023.2) Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2302.04858.pdf)

- (arXiv 2023.2) Learning by Asking for **Embodied** Visual **Navigation** and **Task Completion**, [[Paper]](https://arxiv.org/pdf/2302.04865.pdf)

- (arXiv 2023.2) **Reversible** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2302.04869.pdf), [[Code1]](https://github.com/facebookresearch/slowfast), [[Code2]](https://github.com/karttikeya/minREV)

- (arXiv 2023.2) Neural Congealing: **Aligning Images** to a Joint **Semantic Atlas**, [[Paper]](https://arxiv.org/pdf/2302.03956.pdf), [[Project]](https://neural-congealing.github.io/)

- (arXiv 2023.2) **Adversarial Prompting** for Black Box Foundation Models, [[Paper]](https://arxiv.org/pdf/2302.04237.pdf)

- (arXiv 2023.2) Understanding Why ViT **Trains** Badly on **Small Datasets**: An Intuitive Perspective, [[Paper]](https://arxiv.org/pdf/2302.03751.pdf), [[Code]](https://github.com/BoyuanJackChen/MiniProject2_VisTrans)

- (arXiv 2023.2) CROSS-LAYER RETROSPECTIVE RETRIEVING VIA LAYER **ATTENTION**, [[Paper]](https://arxiv.org/pdf/2302.03985.pdf), [[Code]](https://github.com/joyfang1106/MRLA)

- (arXiv 2023.2) Convolutional Neural Networks Trained to **Identify Words** Provide a Good Account of Visual Form Priming Effects, [[Paper]](https://arxiv.org/pdf/2302.03992.pdf)

- (arXiv 2023.2) Zero-shot **Generation** of Coherent **Storybook** from Plain Text Story using Diffusion Models, [[Paper]](https://arxiv.org/pdf/2302.03900.pdf)

- (arXiv 2023.2) OSRT: Omnidirectional **Image Super-Resolution** with Distortion-aware Transformer, [[Paper]](https://arxiv.org/pdf/2302.03453.pdf)

- (arXiv 2023.2) Pic2Word: Mapping Pictures to Words for Zero-shot **Composed** **Image Retrieval**, [[Paper]](https://arxiv.org/pdf/2302.03084.pdf), [[Code]](https://github.com/google-research/composed_image_retrieval)

- (arXiv 2023.2) SimCon Loss with Multiple Views for Text Supervised **Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2302.03432.pdf)

- (arXiv 2023.2) PhysFormer++: **Facial** Video-based **Physiological Measurement** with SlowFast Temporal Difference Transformer, [[Paper]](https://arxiv.org/pdf/2302.03548.pdf)

- (arXiv 2023.2) Scaling **Self-Supervised** End-to-End **Driving** with Multi-View Attention Learning, [[Paper]](https://arxiv.org/pdf/2302.03198.pdf)

- (arXiv 2023.2) HumanMAC: Masked Motion Completion for **Human Motion Prediction**, [[Paper]](https://arxiv.org/pdf/2302.03665.pdf), [[Project]](https://lhchen.top/Human-MAC/)

- (arXiv 2023.2) LAMPP: **Language Models** as Probabilistic Priors for **Perception** and **Action**, [[Paper]](https://arxiv.org/pdf/2302.02801.pdf)

- (arXiv 2023.2) Zero-Shot **Robot Manipulation** from Passive Human Videos, [[Paper]](https://arxiv.org/pdf/2302.02011.pdf), [[Project]](https://sites.google.com/view/human-0shot-robot)

- (arXiv 2023.2) MixFormer: End-to-End **Tracking** with Iterative Mixed Attention, [[Paper]](https://arxiv.org/pdf/2302.02814.pdf), [[Code]](https://github.com/MCG-NJU/MixFormer)

- (arXiv 2023.2) LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale **Image-Text Retrieval**, [[Paper]](https://arxiv.org/pdf/2302.02908.pdf)

- (arXiv 2023.2) V1T: large-scale **mouse V1 response prediction** using a Vision Transformer, [[Paper]](https://arxiv.org/pdf/2302.03023.pdf)

- (arXiv 2023.2) AIM: ADAPTING **IMAGE MODELS** FOR EFFICIENT **VIDEO ACTION RECOGNITION**, [[Paper]](https://arxiv.org/pdf/2302.03024.pdf), [[Project]](https://adapt-image-models.github.io/)

- (arXiv 2023.2) KDEformer: **Accelerating** Transformers via Kernel Density Estimation, [[Paper]](https://arxiv.org/pdf/2302.02451.pdf), [[Code]](https://github.com/majid-daliri/kdeformer)

- (arXiv 2023.2) Semantic-Guided **Image Augmentation** with Pre-trained Models, [[Paper]](https://arxiv.org/pdf/2302.02070.pdf)

- (arXiv 2023.2) X-ReID: Cross-Instance Transformer for Identity-Level **Person Re-Identification**, [[Paper]](https://arxiv.org/pdf/2302.02075.pdf)

- (arXiv 2023.2) MOMA: **Distill** from Self-Supervised Teachers, [[Paper]](https://arxiv.org/pdf/2302.02089.pdf)

- (arXiv 2023.2) Learning to Agree on Vision Attention for **Visual Commonsense Reasoning**, [[Paper]](https://arxiv.org/pdf/2302.02117.pdf)

- (arXiv 2023.2) Efficient End-to-End **Video Question Answering** with Pyramidal Multimodal Transformer, [[Paper]](https://arxiv.org/pdf/2302.02136.pdf), [[Code]](https://github.com/Trunpm/PMT-AAAI23)

- (arXiv 2023.2) LipFormer: Learning to **Lipread** Unseen Speakers based on Visual-Landmark Transformers, [[Paper]](https://arxiv.org/pdf/2302.02141.pdf)

- (arXiv 2023.2) Oscillation-free **Quantization** for Low-bit Vision Transformers, [[Paper]](https://arxiv.org/pdf/2302.02210.pdf)

- (arXiv 2023.2) Design Booster: A Text-Guided Diffusion Model for **Image Translation** with Spatial Layout Preservation, [[Paper]](https://arxiv.org/pdf/2302.02284.pdf)

- (arXiv 2023.2) Contrast with Reconstruct: **Contrastive** **3D** Representation Learning Guided by Generative Pretraining, [[Paper]](https://arxiv.org/pdf/2302.02318.pdf), [[Code]](https://github.com/qizekun/ReCon)

- (arXiv 2023.2) Leaving Reality to Imagination: **Robust** **Classification** via **Generated** Datasets, [[Paper]](https://arxiv.org/pdf/2302.02503.pdf), [[Code]](https://github.com/Hritikbansal/generative-robustness)

- (arXiv 2023.2) CHiLS: Zero-Shot Image **Classification** with **Hierarchical** Label Sets, [[Paper]](https://arxiv.org/pdf/2302.02551.pdf), [[Code]](https://github.com/acmi-lab/CHILS)

- (arXiv 2023.2) Zero-shot **Image-to-Image** Translation, [[Paper]](https://arxiv.org/pdf/2302.03027.pdf), [[Project]](https://pix2pixzero.github.io/)

- (arXiv 2023.2) Learning a **Fourier Transform** for Linear Relative **Positional Encodings** in Transformers, [[Paper]](https://arxiv.org/pdf/2302.01925.pdf)

- (arXiv 2023.2) EXPLICIT BOX DETECTION UNIFIES END-TO-END **MULTI-PERSON POSE ESTIMATION**, [[Paper]](http://my.sjtu.edu.cn/Task), [[Code]](https://github.com/IDEA-Research/ED-Pose)

- (arXiv 2023.2) CFFT-GAN: Cross-domain Feature Fusion Transformer for Exemplar-based **Image Translation**, [[Paper]](https://arxiv.org/pdf/2302.01608.pdf)

- (arXiv 2023.2) DEVICE: DEpth and VIsual ConcEpts Aware Transformer for **TextCaps**, [[Paper]](https://arxiv.org/pdf/2302.01540.pdf)

- (arXiv 2023.2) CVTNet: A Cross-View Transformer Network for **Place Recognition** Using **LiDAR** Data, [[Paper]](https://arxiv.org/pdf/2302.01665.pdf), [[Code]](https://github.com/BIT-MJY/CVTNet)

- (arXiv 2023.2) DilateFormer: **Multi-Scale Dilated** Transformer for Visual Recognition, [[Paper]](https://arxiv.org/pdf/2302.01791.pdf), [[Code]](https://github.com/JIAOJIAYUASD/dilateformer)

- (arXiv 2023.2) HDFormer: High-order Directed Transformer for **3D Human Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2302.01825.pdf), [[Code]](https://github.com/hyer/HDFormer)

- (arXiv 2023.2) IC^3: Image Captioning by Committee Consensus, [[Paper]](https://arxiv.org/pdf/2302.01328.pdf), [[Code]](https://github.com/DavidMChan/caption-by-committee)

- (arXiv 2023.2) Boosting Low-Data Instance **Segmentation** by Unsupervised Pre-training with Saliency Prompt, [[Paper]](https://arxiv.org/pdf/2302.01171.pdf)

- (arXiv 2023.2) QR-CLIP: Introducing Explicit Open-World Knowledge for **Location and Time Reasoning**, [[Paper]](https://arxiv.org/pdf/2302.00952.pdf)

- (arXiv 2023.2) Vision Transformer-based Feature Extraction for **Generalized Zero-Shot Learning**, [[Paper]](https://arxiv.org/pdf/2302.00875.pdf)

- (arXiv 2023.2) **Multimodal** Chain-of-Thought **Reasoning** in Language Models, [[Paper]](https://arxiv.org/pdf/2302.00923.pdf), [[Code]](https://github.com/amazon-science/mm-cot)

- (arXiv 2023.2) CLIPood: Generalizing **CLIP** to **Out-of-Distributions**, [[Paper]](https://arxiv.org/pdf/2302.00864.pdf)

- (arXiv 2023.2) Language Quantized AutoEncoders: Towards Unsupervised **Text-Image** Alignment, [[Paper]](https://arxiv.org/pdf/2302.00902.pdf)

- (arXiv 2023.2) The geometry of **hidden representations** of large transformer models, [[Paper]](https://arxiv.org/pdf/2302.00294.pdf)

- (arXiv 2023.2) **Debiasing** **Vision-Language** Models via Biased Prompts, [[Paper]](https://arxiv.org/pdf/2302.00070.pdf), [[Code]](https://github.com/chingyaoc/debias_vl)

- (arXiv 2023.2) COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR **OPEN-VOCABULARY VIDEO RELATION DETECTION**, [[Paper]](https://arxiv.org/pdf/2302.00268.pdf), [[Code]](https://github.com/Dawn-LX/OpenVoc-VidVRD)

- (arXiv 2023.2) mPLUG-2: A Modularized **Multi-modal** Foundation Model Across Text, Image and Video, [[Paper]](https://arxiv.org/pdf/2302.00402.pdf), [[Code]](https://github.com/alibaba/AliceMind/tree/main/mPLUG)

- (arXiv 2023.2) Transforming **CLIP** to an **Open-vocabulary Video Model** via Interpolated Weight Optimization, [[Paper]](https://arxiv.org/pdf/2302.00624.pdf)

- (arXiv 2023.2) ADAPT: Action-aware Driving **Caption** Transformer, [[Paper]](https://arxiv.org/pdf/2302.00673.pdf), [[Code]](https://github.com/jxbbb/ADAPT)

### 2023.1

- (arXiv 2023.1) AdaPoinTr: Diverse **Point Cloud Completion** with Adaptive Geometry-Aware Transformers, [[Paper]](https://arxiv.org/pdf/2301.04545.pdf), [[Code]](https://github.com/yuxumin/PoinTr)

- (arXiv 2023.1) **EXIF** as Language: Learning Cross-Modal Associations Between **Images and Camera Metadata**, [[Paper]](https://arxiv.org/pdf/2301.04647.pdf), [[Project]](https://hellomuffin.github.io/exif-as-language)

- (arXiv 2023.1) Head-Free Lightweight **Semantic Segmentation** with Linear Transformer, [[Paper]](https://arxiv.org/pdf/2301.04648.pdf), [[Code]](https://github.com/dongbo811/AFFormer)

- (arXiv 2023.1) Geometry-biased Transformers for **Novel View Synthesis**, [[Paper]](https://arxiv.org/pdf/2301.04650.pdf), [[Project]](https://mayankgrwl97.github.io/gbt)

- (arXiv 2023.1) **Continual** **Few-Shot** Learning Using HyperTransformers, [[Paper]](https://arxiv.org/pdf/2301.04584.pdf)

- (arXiv 2023.1) SEMPPL: PREDICTING **PSEUDO-LABELS** FOR BETTER **CONTRASTIVE** REPRESENTATIONS, [[Paper]](https://arxiv.org/pdf/2301.05158.pdf)

- (arXiv 2023.1) Learning to **Summarize Videos** by Contrasting Clips, [[Paper]](https://arxiv.org/pdf/2301.05213.pdf)

- (arXiv 2023.1) Guiding **Text-to-Image** **Diffusion** Model Towards Grounded Generation, [[Paper]](https://arxiv.org/pdf/2301.05221.pdf), [[Project]](https://lipurple.github.io/Grounded_Diffusion/)

- (arXiv 2023.1) Domain Expansion of **Image Generators**, [[Paper]](https://arxiv.org/pdf/2301.05225.pdf), [[Code]](https://yotamnitzan.github.io/domain-expansion/)

- (arXiv 2023.1) Scene-centric vs. Object-centric Image-Text **Cross-modal Retrieval**: A Reproducibility Study, [[Paper]](https://arxiv.org/pdf/2301.05174.pdf)

- (arXiv 2023.1) Tracr: Compiled Transformers as a Laboratory for **Interpretability**, [[Paper]](https://arxiv.org/pdf/2301.05062.pdf), [[Code]](https://github.com/deepmind/tracr)

- (arXiv 2023.1) **CLIP** the Gap: A Single **Domain Generalization** Approach for Object **Detection**, [[Paper]](https://arxiv.org/pdf/2301.05499.pdf)

- (arXiv 2023.1) **Text to Point Cloud Localization** with Relation-Enhanced Transformer, [[Paper]](https://arxiv.org/pdf/2301.05372.pdf)

- (arXiv 2023.1) GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured **Pruning** for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2301.05345.pdf)

- (arXiv 2023.1) Toward Building General Foundation Models for Language, Vision, and **Vision-Language** Understanding Tasks, [[Paper]](https://arxiv.org/pdf/2301.05065.pdf)

- (arXiv 2023.1) ViTs for SITS: Vision Transformers for **Satellite Image Time Series**, [[Paper]](https://arxiv.org/pdf/2301.04944.pdf), [[Code]](https://github.com/michaeltrs/DeepSatModels)

- (arXiv 2023.1) CLIP2Scene: Towards Label-efficient **3D Scene Understanding** by **CLIP**, [[Paper]](https://arxiv.org/pdf/2301.04926.pdf)

- (arXiv 2023.1) A Large-Scale Outdoor Multi-modal **Dataset** and Benchmark for **Novel View Synthesis** and Implicit **Scene Reconstruction**, [[Paper]](https://arxiv.org/pdf/2301.06782.pdf), [[Project]](https://ommo.luchongshan.com/)

- (arXiv 2023.1) USER: Unified Semantic Enhancement with Momentum Contrast for **Image-Text Retrieval**, [[Paper]](https://arxiv.org/pdf/2301.06844.pdf), [[Code]](https://github.com/zhangy0822/USER)

- (arXiv 2023.1) SAT: Size-Aware Transformer for 3D **Point Cloud Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2301.06869.pdf)

- (arXiv 2023.1) **Masked** **Visual** Reconstruction in **Language** Semantic Space, [[Paper]](https://arxiv.org/pdf/2301.06958.pdf), [[Code]](https://github.com/hustvl/RILS)

- (arXiv 2023.1) Vision Learners Meet Web **Image-Text** Pairs, [[Paper]](https://arxiv.org/pdf/2301.07088.pdf), [[Code]](https://huggingface.co/spaces/tennant/MUG_caption)

- (arXiv 2023.1) GLIGEN: Open-Set Grounded **Text-to-Image** Generation, [[Paper]](https://arxiv.org/pdf/2301.07093.pdf), [[Project]](https://gligen.github.io/)

- (arXiv 2023.1) **Learning** Customized Visual Models with **Retrieval**-Augmented **Knowledge**, [[Paper]](https://arxiv.org/pdf/2301.07094.pdf), [[Project]](https://react-vl.github.io/)

- (arXiv 2023.1) UATVR: Uncertainty-Adaptive **Text-Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2301.06309.pdf)

- (arXiv 2023.1) Learning Aligned Cross-modal Representations for **Referring Image Segmentation**, [[Paper]](https://arxiv.org/pdf/2301.06429.pdf)

- (arXiv 2023.1) T2M-GPT: **Generating** Human **Motion** from Textual Descriptions with Discrete Representations, [[Paper]](https://arxiv.org/pdf/2301.06052.pdf), [[Project]](https://mael-zys.github.io/T2M-GPT/)

- (arXiv 2023.1) DSVT: Dynamic **Sparse Voxel** Transformer with Rotated Sets, [[Paper]](https://arxiv.org/pdf/2301.06051.pdf), [[Code]](https://github.com/Haiyang-W/DSVT)

- (arXiv 2023.1) CMAE-V: Contrastive Masked Autoencoders for **Video Action Recognition**, [[Paper]](https://arxiv.org/pdf/2301.06018.pdf)

- (arXiv 2023.1) Generating Templated Caption for **Video Grounding**, [[Paper]](https://arxiv.org/pdf/2301.05997.pdf)

- (arXiv 2023.1) Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised **Depth Estimation** in Dynamic Scenes, [[Paper]](https://arxiv.org/pdf/2301.05871.pdf)

- (arXiv 2023.1) SwinDepth: Unsupervised **Depth Estimation** using Monocular Sequences via Swin Transformer and Densely Cascaded Network, [[Paper]](https://arxiv.org/pdf/2301.06715.pdf)

- (arXiv 2023.1) **CLIP**TER: Looking at the Bigger Picture in **Scene Text Recognition**, [[Paper]](https://arxiv.org/pdf/2301.07464.pdf)

- (arXiv 2023.1) Temporal Perceiving **Video-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2301.07463.pdf)

- (arXiv 2023.1) Joint Representation Learning for **Text** and 3D **Point Cloud**, [[Paper]](https://arxiv.org/pdf/2301.07584.pdf), [[Code]](https://github.com/LeapLabTHU/Text4Point)

- (arXiv 2023.1) Effective End-to-End **Vision Language** Pretraining with Semantic Visual Loss, [[Paper]](https://arxiv.org/pdf/2301.07236.pdf)

- (arXiv 2023.1) PTA-Det: Point Transformer Associating Point cloud and Image for **3D Object Detection**, [[Paper]](https://arxiv.org/pdf/2301.07301.pdf)

- (arXiv 2023.1) **Face Recognition** in the age of CLIP & Billion image datasets, [[Paper]](https://arxiv.org/pdf/2301.07315.pdf)

- (arXiv 2023.1) HSTFormer: Hierarchical Spatial-Temporal Transformers for **3D Human Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2301.07322.pdf), [[Code]](https://github.com/qianxiaoye825/HSTFormer)

- (arXiv 2023.1) Towards Models that Can **See** and **Read**, [[Paper]](https://arxiv.org/pdf/2301.07389.pdf)

- (arXiv 2023.1) **Embodied** Agents for Efficient Exploration and Smart Scene Description, [[Paper]](https://arxiv.org/pdf/2301.07150.pdf)

- (arXiv 2023.1) **Self-Supervised Learning** from Images with a Joint-Embedding Predictive Architecture, [[Paper]](https://arxiv.org/pdf/2301.08243.pdf)

- (arXiv 2023.1) Revisiting the Spatial and Temporal Modeling for **Few-shot Action Recognition**, [[Paper]](https://arxiv.org/pdf/2301.07944.pdf)

- (arXiv 2023.1) Multimodal Video Adapter for Parameter Efficient **Video Text Retrieval**, [[Paper]](https://arxiv.org/pdf/2301.07868.pdf)

- (arXiv 2023.1) **Self Supervision** Does Not Help Natural Language Supervision at Scale, [[Paper]](https://arxiv.org/pdf/2301.07836.pdf)

- (arXiv 2023.1) MULTI-TARGET MULTI-CAMERA **VEHICLE TRACKING** USING TRANSFORMER-BASED CAMERA LINK MODEL AND SPATIAL-TEMPORAL INFORMATION, [[Paper]](https://arxiv.org/pdf/2301.07805.pdf)

- (arXiv 2023.1) ATMAN: **Understanding** Transformer Predictions Through Memory Efficient **Attention** Manipulation, [[Paper]](https://arxiv.org/pdf/2301.08110.pdf)

- (arXiv 2023.1) DDS: Decoupled Dynamic **Scene-Graph Generation** Network, [[Paper]](https://arxiv.org/pdf/2301.07666.pdf), [[Code]]()

- (arXiv 2023.1) Visual Writing Prompts: Character-Grounded **Story Generation** with Curated Image Sequences, [[Paper]](https://arxiv.org/pdf/2301.08571.pdf)

- (arXiv 2023.1) **Image Memorability Prediction** with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2301.08647.pdf)

- (arXiv 2023.1) HOLISTICALLY **EXPLAINABLE** VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2301.08669.pdf)

- (arXiv 2023.1) FlatFormer: Flattened Window Attention for **Efficient** **Point Cloud** Transformer, [[Paper]](https://arxiv.org/pdf/2301.08739.pdf)

- (arXiv 2023.1) LEGO-Net: Learning Regular **Rearrangements** of **Objects** in Rooms, [[Paper]](https://arxiv.org/pdf/2301.09629.pdf), [[Project]](https://ivl.cs.brown.edu/projects/lego-net)

- (arXiv 2023.1) Zorro: the masked **multimodal** transformer, [[Paper]](https://arxiv.org/pdf/2301.09595.pdf)

- (arXiv 2023.1) Towards Robust **Video Instance Segmentation** with Temporal-Aware Transformer, [[Paper]](https://arxiv.org/pdf/2301.09416.pdf)

- (arXiv 2023.1) Learning **Open-vocabulary Semantic Segmentation** Models From Natural Language Supervision, [[Paper]](https://arxiv.org/pdf/2301.09121.pdf), [[Project]](https://jazzcharles.github.io/OVSegmentor/)

- (arXiv 2023.1) Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object **Interaction Anticipation**, [[Paper]](https://arxiv.org/pdf/2301.09209.pdf), [[Code]](https://eth-ait.github.io/transfusion-proj/)

- (arXiv 2023.1) Combined Use of Federated Learning and Image Encryption for **Privacy**-Preserving **Image Classification** with Vision Transformer, [[Paper]](https://arxiv.org/pdf/2301.09255.pdf)

- (arXiv 2023.1) Slice Transformer and Self-supervised Learning for **6DoF Localization** in 3D Point Cloud Maps, [[Paper]](https://arxiv.org/pdf/2301.08957.pdf)

- (arXiv 2023.1) IMPROVING ACCURACY OF **ZERO-SHOT ACTION RECOGNITION** WITH HANDCRAFTED FEATURES, [[Paper]](https://arxiv.org/pdf/2301.08874.pdf)

- (arXiv 2023.1) Learning to View: Decision Transformers for **Active Object Detection**, [[Paper]](https://arxiv.org/pdf/2301.09544.pdf)

- (arXiv 2023.1) Visual Semantic Relatedness Dataset for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2301.08784.pdf), [[Code]](https://github.com/ahmedssabir/Textual-Visual-Semantic-Dataset)

- (arXiv 2023.1) VERSATILE NEURAL PROCESSES FOR LEARNING **IMPLICIT NEURAL REPRESENTATIONS**, [[Paper]](https://arxiv.org/pdf/2301.08883.pdf), [[Code]](https://github.com/ZongyuGuo/Versatile-NP)

- (arXiv 2023.1) RangeViT: Towards Vision Transformers for **3D Semantic Segmentation** in Autonomous Driving, [[Paper]](https://arxiv.org/pdf/2301.10222.pdf), [[Code]](https://github.com/valeoai/rangevit)

- (arXiv 2023.1) Exploiting Optical Flow Guidance for Transformer-Based **Video Inpainting**, [[Paper]](https://arxiv.org/pdf/2301.10048.pdf)

- (arXiv 2023.1) Image **Super-Resolution** using Efficient Striped Window Transformer, [[Paper]](https://arxiv.org/pdf/2301.09869.pdf), [[Code]](https://github.com/Fried-Rice-Lab/FriedRiceLab)

- (arXiv 2023.1) **Out of Distribution** Performance of State of Art Vision Model, [[Paper]](https://arxiv.org/pdf/2301.10750.pdf), [[Code]](https://github.com/salman-lui/vision_course_project)

- (arXiv 2023.1) Compact Transformer **Tracker** with Correlative Masked Modeling, [[Paper]](https://arxiv.org/pdf/2301.10938.pdf), [[Code]](https://github.com/HUSTDML/CTTrack)

- (arXiv 2023.1) **Vision-Language** Models Performing Zero-Shot Tasks Exhibit **Gender**-based **Disparities**, [[Paper]](https://arxiv.org/pdf/2301.11100.pdf)

- (arXiv 2023.1) Cut and Learn for **Unsupervised** Object **Detection** and Instance **Segmentation**, [[Paper]](https://arxiv.org/pdf/2301.11320.pdf), [[Code]](https://github.com/facebookresearch/CutLER)

- (arXiv 2023.1) Explaining Visual **Biases** as Words by Generating Captions, [[Paper]](https://arxiv.org/pdf/2301.11104.pdf), [[Code]](https://github.com/alinlab/b2t)

- (arXiv 2023.1) Revisiting **Temporal Modeling** for **CLIP**-based Image-to-Video Knowledge Transferring, [[Paper]](https://arxiv.org/pdf/2301.11116.pdf), [[Code]](https://github.com/farewellthree/STAN)

- (arXiv 2023.1) **Multi-video Moment Ranking** with Multimodal Clue, [[Paper]](https://arxiv.org/pdf/2301.13606.pdf)

- (arXiv 2023.1) SDF-FORMER: **MONOCULAR SCENE RECONSTRUCTION** WITH 3D SDF TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2301.13510.pdf), [[Project]](https://weihaosky.github.io/sdfformer)

- (arXiv 2023.1) Grounding Language Models to Images for **Multimodal Generation**, [[Paper]](https://arxiv.org/pdf/2301.13823.pdf)

- (arXiv 2023.1) Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for **Visual Commonsense Reasoning**, [[Paper]](https://arxiv.org/pdf/2301.13335.pdf)

- (arXiv 2023.1) A Modular Multi-stage Lightweight Graph Transformer Network for **Human Pose and Shape Estimation** from 2D Human Pose, [[Paper]](https://arxiv.org/pdf/2301.13403.pdf)

- (arXiv 2023.1) Priors are Powerful: Improving a Transformer for **Multi-camera 3D Detection** with 2D Priors, [[Paper]](https://arxiv.org/pdf/2301.13592.pdf)

- (arXiv 2023.1) UPop: Unified and Progressive Pruning for **Compressing** **Vision-Language** Transformers, [[Paper]](https://arxiv.org/pdf/2301.13741.pdf)

- (arXiv 2023.1) **Fairness**-aware Vision Transformer via Debiased Self-Attention, [[Paper]](https://arxiv.org/pdf/2301.13803.pdf)

- (arXiv 2023.1) Anchor-Based Adversarially Robust **Zero-Shot Learning** Driven by Language, [[Paper]](https://arxiv.org/pdf/2301.13096.pdf)

- (arXiv 2023.1) Distilling Internet-Scale **Vision-Language** Models into **Embodied** Agents, [[Paper]](https://arxiv.org/pdf/2301.12507.pdf)

- (arXiv 2023.1) 6-DoF Robotic **Grasping** with Transformer, [[Paper]](https://arxiv.org/pdf/2301.12476.pdf)

- (arXiv 2023.1) Do Embodied Agents Dream of Pixelated Sheep?: **Embodied Decision Making** using Language Guided World Modelling, [[Paper]](https://arxiv.org/pdf/2301.12050.pdf), [[Project]](https://deckardagent.github.io/)

- (arXiv 2023.1) GALIP: Generative Adversarial CLIPs for **Text-to-Image** Synthesis, [[Paper]](https://arxiv.org/pdf/2301.12959.pdf), [[Code]](https://github.com/tobran/GALIP)

- (arXiv 2023.1) STAIR: Learning **Sparse** **Text and Image** Representation in Grounded Tokens, [[Paper]](https://arxiv.org/pdf/2301.13081.pdf)

- (arXiv 2023.1) **Aerial** Image Object **Detection** With Vision Transformer Detector (ViTDet), [[Paper]](https://arxiv.org/ftp/arxiv/papers/2301/2301.12058.pdf)

- (arXiv 2023.1) Towards Vision Transformer Unrolling Fixed-Point Algorithm: a Case Study on **Image Restoration**, [[Paper]](https://arxiv.org/pdf/2301.12332.pdf)

- (arXiv 2023.1) Debiased Fine-Tuning for **Vision-language** Models by **Prompt** Regularization, [[Paper]](https://arxiv.org/pdf/2301.12429.pdf), [[Code]]()

- (arXiv 2023.1) BLIP-2: Bootstrapping **Language-Image** Pre-training with **Frozen** Image Encoders and Large Language Models, [[Paper]](https://arxiv.org/pdf/2301.12597.pdf), [[Code]](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)

- (arXiv 2023.1) Tagging before Alignment: Integrating Multi-Modal Tags for **Video-Text Retrieval**, [[Paper]](https://arxiv.org/pdf/2301.12644.pdf)

- (arXiv 2023.1) SEAFORMER: SQUEEZE-ENHANCED AXIAL TRANSFORMER FOR MOBILE SEMANTIC **SEGMENTATION**, [[Paper]](https://arxiv.org/pdf/2301.13156.pdf), [[Code]](https://github.com/fudan-zvg/SeaFormer)

- (arXiv 2023.1) Learning 6-DoF Fine-grained **Grasp Detection** Based on Part Affordance Grounding, [[Paper]](https://arxiv.org/pdf/2301.11564.pdf), [[Project]](https://sites.google.com/view/lang-shape)

- (arXiv 2023.1) Multimodal Event Transformer for **Image-guided Story Ending Generation**, [[Paper]](https://arxiv.org/pdf/2301.11357.pdf)

- (arXiv 2023.1) Style-Aware Contrastive Learning for Multi-Style Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2301.11367.pdf)

- (arXiv 2023.1) 3DShape2VecSet: A **3D Shape Representation** for Neural Fields and Generative Diffusion Models, [[Paper]](https://arxiv.org/pdf/2301.11445.pdf)

- (arXiv 2023.1) Semi-Parametric **Video-Grounded Text Generation**, [[Paper]](https://arxiv.org/pdf/2301.11507.pdf)

- (arXiv 2023.1) **Robust** Transformer with Locality Inductive Bias and Feature Normalization, [[Paper]](https://arxiv.org/pdf/2301.11553.pdf)

- (arXiv 2023.1) LEVERAGING THE THIRD DIMENSION IN **CONTRASTIVE LEARNING**, [[Paper]](https://arxiv.org/pdf/2301.11790.pdf)

- (arXiv 2023.1) Understanding **Self-Supervised** Pretraining with **Part**-Aware Representation Learning, [[Paper]](https://arxiv.org/pdf/2301.11915.pdf)

- (arXiv 2023.1) Hypergraph Transformer for **Skeleton-based Action Recognition**, [[Paper]](https://arxiv.org/pdf/2211.09590.pdf)

- (arXiv 2023.1) CPT-V: A Contrastive Approach to Post-Training **Quantization** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2211.09643.pdf)

- (arXiv 2023.1) InstructPix2Pix: Learning to Follow **Image Editing** Instructions, [[Paper]](https://arxiv.org/pdf/2211.09800.pdf), [[Code]](http://timothybrooks.com/instruct-pix2pix)

- (arXiv 2023.1) OvarNet: Towards Open-vocabulary Object **Attribute Recognition**, [[Paper]](https://arxiv.org/pdf/2301.09506.pdf), [[Project]](https://kyanchen.github.io/OvarNet)

- (arXiv 2023.1) DDS: Decoupled Dynamic **Scene-Graph Generation** Network, [[Paper]](https://arxiv.org/pdf/2301.07666.pdf)

- (arXiv 2023.1) **Token** Transformer: Can class token help window-based transformer build better **long-range interactions**? [[Paper]](https://arxiv.org/pdf/2211.06083.pdf)

- (arXiv 2023.1) Toward Building General **Foundation Models** for Language, Vision, and Vision-Language Understanding Tasks, [[Paper]](https://arxiv.org/pdf/2301.05065.pdf)

- (arXiv 2023.1) Multimodal Inverse Cloze Task for Knowledge-based **Visual Question Answering**? [[Paper]](https://arxiv.org/pdf/2301.04366.pdf), [[Code]]()

- (arXiv 2023.1) FGAHOI: Fine-Grained Anchors for **Human-Object Interaction** Detection, [[Paper]](https://arxiv.org/pdf/2301.04019.pdf), [[Code]](https://github.com/xiaomabufei/FGAHOI)

- (arXiv 2023.1) Parallel Reasoning Network for **Human-Object Interaction** Detection, [[Paper]](https://arxiv.org/pdf/2301.03510.pdf)

- (arXiv 2023.1) In Defense of Structural Symbolic Representation for **Video Event-Relation Prediction**, [[Paper]](https://arxiv.org/pdf/2301.03410.pdf)

- (arXiv 2023.1) **Scene Synthesis** from Human **Motion**, [[Paper]](https://arxiv.org/pdf/2301.01424.pdf), [[Project]](https://lijiaman.github.io/projects/summon/)

### 2022.12

- (arXiv 2022.12) EVA: Exploring the Limits of **Masked Visual Representation** Learning at Scale, [[Paper]](https://arxiv.org/pdf/2211.07636.pdf), [[Code]](https://github.com/baaivision/EVA)

- (arXiv 2022.12) OneFormer: One Transformer to Rule Universal Image **Segmentation**, [[Paper]](https://arxiv.org/pdf/2211.06220.pdf), [[Code]](https://github.com/SHI-Labs/OneFormer)

- (arXiv 2022.12) MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards **Multi-modal Open-domain Conversation**, [[Paper]](https://arxiv.org/pdf/2211.05719.pdf), [[Project]](https://github.com/victorsungo/MMDialog)

- (arXiv 2022.12) Why is Winoground Hard? Investigating Failures in **Visuolinguistic Compositionality**, [[Paper]](https://arxiv.org/pdf/2211.00768.pdf), [[Code]](https://github.com/ajd12342/why-winoground-hard)

- (arXiv 2022.12) Multimodal **Information Bottleneck**: Learning Minimal Sufficient Unimodal and **Multimodal** Representations, [[Paper]](https://arxiv.org/pdf/2210.17444.pdf), [[Code]](https://github.com/TmacMai/Multimodal-Information-Bottleneck)

- (arXiv 2022.12) CLIP-FLOW: CONTRASTIVE LEARNING BY SEMISUPERVISED ITERATIVE PSEUDO LABELING FOR **OPTICAL FLOW ESTIMATION**, [[Paper]](https://arxiv.org/pdf/2210.14383.pdf)

- (arXiv 2022.12) INSTRUCTION-FOLLOWING **AGENTS** WITH JOINTLY PRE-TRAINED **VISION-LANGUAGE** MODELS, [[Paper]](https://arxiv.org/pdf/2210.13431.pdf), [[Code]](https://github.com/lhao499/instructrl)

- (arXiv 2022.12) MetaFormer **Baselines** for Vision, [[Paper]](https://arxiv.org/pdf/2210.13452.pdf), [[Code]](https://github.com/sail-sg/metaformer)

- (arXiv 2022.12) ViTCoD: Vision Transformer **Acceleration** via Dedicated Algorithm and Accelerator Co-Design, [[Paper]](https://arxiv.org/pdf/2210.09573.pdf), [[Code]](https://github.com/GATECH-EIC/ViTCoD)

- (arXiv 2022.12) FROM PLAY TO POLICY: CONDITIONAL BEHAVIOR GENERATION FROM UNCURATED **ROBOT** DATA, [[Paper]](https://arxiv.org/pdf/2210.10047.pdf), [[Project]](https://play-to-policy.github.io/)

- (arXiv 2022.12) Optimizing **Prompts** for **Text-to-Image** Generation, [[Paper]](https://arxiv.org/pdf/2212.09611.pdf), [[Code]](https://aka.ms/promptist)

- (arXiv 2022.12) Attentive **Mask** **CLIP**, [[Paper]](https://arxiv.org/pdf/2212.08653.pdf)

- (arXiv 2022.12) Rethinking **Cooking State Recognition** with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2212.08586.pdf)

- (arXiv 2022.12) Enhancing **Multi-modal** and **Multi-hop Question Answering** via Structured Knowledge and Unified Retrieval-Generation, [[Paper]](https://arxiv.org/pdf/2212.08632.pdf), [[Code]]()

- (arXiv 2022.12) MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in **Vision and Language** Models & Tasks, [[Paper]](https://arxiv.org/pdf/2212.08158.pdf), [[Code]](https://github.com/Heidelberg-NLP/MM-SHAP)

- (arXiv 2022.12) RepQ-ViT: Scale Reparameterization for Post-Training **Quantization** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2212.08254.pdf)

- (arXiv 2022.12) WAVENHANCER: UNIFYING WAVELET AND TRANSFORMER FOR **IMAGE ENHANCEMENT**, [[Paper]](https://arxiv.org/pdf/2212.08327.pdf)

- (arXiv 2022.12) AUTOENCODERS AS CROSS-MODAL TEACHERS: CAN PRETRAINED 2D IMAGE TRANSFORMERS HELP **3D REPRESENTATION** LEARNING?, [[Paper]](https://arxiv.org/pdf/2212.08320.pdf), [[Code]](https://github.com/RunpeiDong/ACT)

- (arXiv 2022.12) SceneGATE: Scene-Graph based co-Attention networks for TExt **visual question answering**, [[Paper]](https://arxiv.org/pdf/2212.08283.pdf)

- (arXiv 2022.12) Emergent **Analogical Reasoning** in Large Language Models, [[Paper]](https://arxiv.org/pdf/2212.09196.pdf)

- (arXiv 2022.12) Unleashing the Power of **Visual Prompting** At the Pixel Level, [[Paper]](https://arxiv.org/pdf/2212.10556.pdf), [[Code]](https://github.com/UCSC-VLAA/EVP)

- (arXiv 2022.12) Does **CLIP** Bind Concepts? Probing **Compositionality** in Large Image Models, [[Paper]](https://arxiv.org/pdf/2212.10537.pdf)

- (arXiv 2022.12) LayoutDETR: Detection Transformer Is a Good Multimodal **Layout Designer**, [[Paper]](https://arxiv.org/pdf/2212.09877.pdf), [[Code]](https://github.com/salesforce/LayoutDETR)

- (arXiv 2022.12) Towards Unsupervised **Visual Reasoning**: Do Off-The-Shelf Features Know How to Reason?, [[Paper]](https://arxiv.org/pdf/2212.10292.pdf)

- (arXiv 2022.12) Benchmarking **Spatial Relationships** in **Text-to-Image** Generation, [[Paper]](https://arxiv.org/pdf/2212.10015.pdf), [[Project]](https://visort2i.github.io/)

- (arXiv 2022.12) MetaCLUE: Towards Comprehensive **Visual Metaphors** Research, [[Paper]](https://arxiv.org/pdf/2212.09898.pdf), [[Project]](https://metaclue.github.io/)

- (arXiv 2022.12) Tackling Ambiguity with Images: Improved **Multimodal** Machine **Translation** and Contrastive Evaluation, [[Paper]](https://arxiv.org/pdf/2212.10140.pdf), [[Code]](https://github.com/MatthieuFP/CoMMuTE.git)

- (arXiv 2022.12) Cross-modal Attention Congruence Regularization for **Vision-Language** **Relation** Alignment, [[Paper]](https://arxiv.org/pdf/2212.10549.pdf)

- (arXiv 2022.12) Does unsupervised **grammar induction** need pixels?, [[Paper]](https://arxiv.org/pdf/2212.10564.pdf)

- (arXiv 2022.12) Hi-LASSIE: High-Fidelity **Articulated** Shape and Skeleton **Discovery** from Sparse **Image** Ensemble, [[Paper]](https://arxiv.org/pdf/2212.11042.pdf)

- (arXiv 2022.12) MAViC: Multimodal Active Learning for **Video Captioning**, [[Paper]](https://arxiv.org/pdf/2212.11109.pdf)

- (arXiv 2022.12) What Makes for Good **Tokenizers** in Vision Transformer? [[Paper]](https://arxiv.org/pdf/2212.11115.pdf)

- (arXiv 2022.12) Not Just Pretty Pictures: **Text-to-Image** Generators Enable Interpretable Interventions for **Robust** Representations, [[Paper]](https://arxiv.org/pdf/2212.11237.pdf), [[Code]]()

- (arXiv 2022.12) Generalized Decoding for **Pixel**, **Image**, and **Language**, [[Paper]](https://arxiv.org/pdf/2212.11270.pdf), [[Project]](https://x-decoder-vl.github.io/)

- (arXiv 2022.12) METEOR Guided Divergence for **Video Captioning**, [[Paper]](https://arxiv.org/pdf/2212.10690.pdf), [[Code]](https://github.com/d-rothen/bmhrl)

- (arXiv 2022.12) SLGTFORMER: AN ATTENTION-BASED APPROACH TO **SIGN LANGUAGE RECOGNITION**, [[Paper]](https://arxiv.org/pdf/2212.10746.pdf), [[Code]](https://github.com/neilsong/slt)

- (arXiv 2022.12) FROM IMAGES TO TEXTUAL **PROMPTS**: ZERO-SHOT **VQA** WITH FROZEN LARGE LANGUAGE MODELS, [[Paper]](https://arxiv.org/pdf/2212.10846.pdf), [[Code]](https://github.com/salesforce/LAVIS/tree/main/projects/img2prompt-vqa)

- (arXiv 2022.12) 3D Highlighter: Localizing Regions on **3D** Shapes via **Text** Descriptions, [[Paper]](https://arxiv.org/pdf/2212.11263.pdf), [[Code]](https://github.com/threedle/3DHighlighter)

- (arXiv 2022.12) Contrastive **Language-Vision** AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification **Bias**, [[Paper]](https://arxiv.org/pdf/2212.11261.pdf)

- (arXiv 2022.12) Ultra-High-Definition **Low-Light Image Enhancement**: A Benchmark and Transformer-Based Method, [[Paper]](https://arxiv.org/pdf/2212.11548.pdf), [[Code]](https://github.com/TaoWangzj/LLFormer)

- (arXiv 2022.12) Tune-A-Video: One-Shot Tuning of Image Diffusion Models for **Text-to-Video** Generation, [[Paper]](https://arxiv.org/pdf/2212.11565.pdf), [[Project]](https://tuneavideo.github.io/)

- (arXiv 2022.12) Beyond SOT: It’s Time to **Track** **Multiple** Generic **Objects** at Once, [[Paper]](https://arxiv.org/pdf/2212.11920.pdf)

- (arXiv 2022.12) KNOWLEDGE-DRIVEN SCENE PRIORS FOR SEMANTIC AUDIO-VISUAL **EMBODIED NAVIGATION**, [[Paper]](https://arxiv.org/pdf/2212.11345.pdf)

- (arXiv 2022.12) SegViT: **Semantic Segmentation** with Plain Vision Transformers, [[Paper]](https://arxiv.org/pdf/2210.05844.pdf), [[Code]](https://github.com/zbwxp/SegVit)

- (arXiv 2022.12) Open-Vocabulary **Temporal Action Detection** with Off-the-Shelf Image-Text Features, [[Paper]](https://arxiv.org/pdf/2212.10596.pdf)

- (arXiv 2022.12) Point·E: A System for **Generating 3D Point Clouds** from Complex **Prompts**, [[Paper]](https://arxiv.org/pdf/2212.08751.pdf), [[Code]](https://github.com/openai/point-e)

- (arXiv 2022.12) Inductive Attention for **Video Action Anticipation**, [[Paper]](https://arxiv.org/pdf/2212.08830.pdf)

- (arXiv 2022.12) **Image-and-Language** Understanding from Pixels Only, [[Paper]](https://arxiv.org/pdf/2212.08045.pdf), [[Code]](https://github.com/google-research/big_vision)

- (arXiv 2022.12) FlexiViT: One Model for All **Patch Sizes**, [[Paper]](https://arxiv.org/pdf/2212.08013.pdf), [[Code]](https://github.com/google-research/big_vision)

- (arXiv 2022.12) **Unsupervised** Object **Localization**: Observing the Background to Discover Objects, [[Paper]](https://arxiv.org/pdf/2212.07834.pdf), [[Code]](https://github.com/valeoai/FOUND)

- (arXiv 2022.12) Vision Transformers are Parameter-Efficient **Audio-Visual** Learners, [[Paper]](https://arxiv.org/pdf/2212.07983.pdf), [[Project]](https://genjib.github.io/project_page/LAVISH/)

- (arXiv 2022.12) Full Contextual Attention for Multi-resolution Transformers in **Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2212.07890.pdf)

- (arXiv 2022.12) DETR4D: Direct Multi-View **3D Object Detection** with Sparse Attention, [[Paper]](https://arxiv.org/pdf/2212.07849.pdf)

- (arXiv 2022.12) Enhanced Training of Query-Based Object **Detection** via Selective Query Recollection, [[Paper]](https://arxiv.org/pdf/2212.07593.pdf), [[Code]](https://github.com/Fangyi-Chen/SQR)

- (arXiv 2022.12) TEXT-GUIDED MASK-FREE LOCAL **IMAGE RETOUCHING**, [[Paper]](https://arxiv.org/pdf/2212.07603.pdf)

- (arXiv 2022.12) Summary-Oriented Vision Modeling for **Multimodal Abstractive Summarization**, [[Paper]](https://arxiv.org/pdf/2212.07672.pdf), [[Code]](https://github.com/XL2248/SOV-MAS)

- (arXiv 2022.12) One-Shot Domain Adaptive and Generalizable **Semantic Segmentation** with Class-Aware Cross-Domain Transformers, [[Paper]](https://arxiv.org/pdf/2212.07292.pdf)

- (arXiv 2022.12) ConQueR: Query Contrast Voxel-DETR for **3D Object Detection**, [[Paper]](https://arxiv.org/pdf/2212.07289.pdf)

- (arXiv 2022.12) Examining the **Difference** Among **Transformers** and **CNNs** with Explanation Methods, [[Paper]](https://arxiv.org/pdf/2212.06872.pdf)

- (arXiv 2022.12) Find Someone Who: Visual Commonsense Understanding in Human-Centric **Grounding**, [[Paper]](https://arxiv.org/pdf/2212.06971.pdf), [[Code]](https://github.com/Hxyou/HumanCog)

- (arXiv 2022.12) Dual-branch Cross-Patch Attention Learning for **Group Affect Recognition**, [[Paper]](https://arxiv.org/pdf/2212.07055.pdf)

- (arXiv 2022.12) Cross-Modal Similarity-Based Curriculum Learning for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2212.07075.pdf)

- (arXiv 2022.12) NLIP: Noise-robust **Language-Image** Pre-training, [[Paper]](https://arxiv.org/pdf/2212.07086.pdf)

- (arXiv 2022.12) Lidar**CLIP** or: How I Learned to Talk to **Point Clouds**, [[Paper]](https://arxiv.org/pdf/2212.06858.pdf), [[Code]](https://github.com/atonderski/lidarclip)

- (arXiv 2022.12) **CLIP**SEP: LEARNING TEXT-QUERIED **SOUND SEPARATION** WITH NOISY UNLABELED VIDEOS, [[Paper]](https://arxiv.org/pdf/2212.07065.pdf)

- (arXiv 2022.12) Reproducible **scaling laws** for contrastive language-image learning, [[Paper]](https://arxiv.org/pdf/2212.07143.pdf), [[Code]](https://github.com/LAION-AI/scaling-laws-openclip)

- (arXiv 2022.12) WHAT DO VISION TRANSFORMERS LEARN? A VISUAL **EXPLORATION**, [[Paper]](https://arxiv.org/pdf/2212.06727.pdf)

- (arXiv 2022.12) Self-Play and Self-Describe: **Policy Adaptation** with **Vision-Language** Foundation Models, [[Paper]](https://arxiv.org/pdf/2212.07398.pdf), [[Project]](https://geyuying.github.io/SPLAYD)

- (arXiv 2022.12) GPVIT: A **HIGH RESOLUTION** NON-HIERARCHICAL VISION TRANSFORMER WITH GROUP PROPAGATION, [[Paper]](https://arxiv.org/pdf/2212.06795.pdf), [[Code]](https://github.com/ChenhongyiYang/GPViT)

- (arXiv 2022.12) Learning 3D Representations from 2D Pre-trained Models via **Image-to-Point** Masked Autoencoders, [[Paper]](https://arxiv.org/pdf/2212.06785.pdf), [[Code]](https://github.com/ZrrSkywalker/I2P-MAE)

- (arXiv 2022.12) Parallel Queries for **Human-Object Interaction Detection**, [[Paper]](https://dl.acm.org/doi/pdf/10.1145/3551626.3564944)

- (arXiv 2022.12) Structure-Guided **Image Completion** with Image-level and Object-level Semantic Discriminators, [[Paper]](https://arxiv.org/pdf/2212.06310.pdf)

- (arXiv 2022.12) Localized Latent Updates for **Fine-Tuning** **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2212.06556.pdf)

- (arXiv 2022.12) CamoFormer: Masked Separable Attention for **Camouflaged Object Detection**, [[Paper]](https://arxiv.org/pdf/2212.06570.pdf)

- (arXiv 2022.12) FastMIM: Expediting **Masked** Image Modeling Pre-training for Vision, [[Paper]](https://arxiv.org/pdf/2212.06593.pdf), [[Code]](https://github.com/ggjy/FastMIM.pytorch)

- (arXiv 2022.12) OAMixer: Object-aware **Mixing** Layer for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2212.06595.pdf), [[Code]](https://github.com/alinlab/OAMixer)

- (arXiv 2022.12) Doubly Right **Object Recognition**: A Why **Prompt** for Visual **Rationales**, [[Paper]](https://arxiv.org/pdf/2212.06202.pdf)

- (arXiv 2022.12) RT-1: **ROBOTICS** TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE, [[Paper]](https://arxiv.org/pdf/2212.06817.pdf), [[Project]](https://robotics-transformer.github.io/)

- (arXiv 2022.12) **Egocentric Video** Task Translation, [[Paper]](https://arxiv.org/pdf/2212.06301.pdf)

- (arXiv 2022.12) ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved **Visio-Linguistic** Models in **3D** Scenes, [[Paper]](https://arxiv.org/pdf/2212.06250.pdf), [[Project]](https://scanents3d.github.io/)

- (arXiv 2022.12) **Curriculum Learning** Meets Weakly Supervised **Modality Correlation** Learning, [[Paper]](https://arxiv.org/pdf/2212.07619.pdf)

- (arXiv 2022.12) IMoS: Intent-Driven Full-Body **Motion Synthesis** for **Human-Object Interactions**, [[Paper]](https://arxiv.org/pdf/2212.07555.pdf)

- (arXiv 2022.12) MultiAct: Long-Term **3D Human Motion Generation** from Multiple Action Labels, [[Paper]](https://arxiv.org/pdf/2212.05897.pdf)

- (arXiv 2022.12) A New Path: Scaling **Vision-and-Language Navigation** with Synthetic Instructions and Imitation Learning, [[Paper]](https://arxiv.org/pdf/2210.03112.pdf)

- (arXiv 2022.12) Beyond Object Recognition: A New Benchmark towards **Object Concept Learning**, [[Paper]](https://arxiv.org/pdf/2212.02710.pdf), [[Project]](https://mvig-rhos.com/ocl)

- (arXiv 2022.12) ViTPose+: Vision Transformer Foundation Model for Generic Body **Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2212.04246.pdf), [[Code]](https://github.com/ViTAE-Transformer/ViTPose)

- (arXiv 2022.12) Structured **Vision-Language** Pretraining for **Computational** Cooking, [[Paper]](https://arxiv.org/pdf/2212.04267.pdf)

- (arXiv 2022.12) MIME: **Human**-Aware **3D Scene Generation**, [[Paper]](https://arxiv.org/pdf/2212.04360.pdf), [[Project]](https://mime.is.tue.mpg.de/)

- (arXiv 2022.12) OFASY S: A **Multi-Modal Multi-Task** Learning System for Building **Generalist Models**, [[Paper]](https://arxiv.org/pdf/2212.04408.pdf), [[Code]](https://github.com/OFA-Sys/OFASys)

- (arXiv 2022.12) Task **Bias** in **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2212.04412.pdf)

- (arXiv 2022.12) Multi-Concept Customization of **Text-to-Image** **Diffusion**, [[Paper]](https://arxiv.org/pdf/2212.04488.pdf), [[Code]](https://www.cs.cmu.edu/~custom-diffusion/)

- (arXiv 2022.12) Few-View Object **Reconstruction** with Unknown Categories and Camera Poses, [[Paper]](https://arxiv.org/pdf/2212.04492.pdf), [[Project]](https://ut-austin-rpl.github.io/FORGE/)

- (arXiv 2022.12) Masked Video Distillation: Rethinking **Masked** Feature Modeling for **Self-supervised** **Video Representation** Learning, [[Paper]](https://arxiv.org/pdf/2212.04500.pdf), [[Code]](https://github.com/ruiwang2021/mvd)

- (arXiv 2022.12) Learning **Video** Representations from **Large Language Models**, [[Paper]](https://arxiv.org/pdf/2212.04501.pdf), [[Project]](https://facebookresearch.github.io/LaViLa)

- (arXiv 2022.12) Frozen **CLIP** Model is Efficient **Point Cloud** Backbone, [[Paper]](https://arxiv.org/pdf/2212.04098.pdf)

- (arXiv 2022.12) DialogCC: Large-scale **Multi-Modal Dialogue** Dataset, [[Paper]](https://arxiv.org/pdf/2212.04119.pdf), [[Project]](https://github.com/passing2961/DialogCC)

- (arXiv 2022.12) Group Generalized Mean **Pooling** for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2212.04114.pdf)

- (arXiv 2022.12) LEARNING DOMAIN INVARIANT **PROMPT** FOR **VISION-LANGUAGE** MODELS, [[Paper]](https://arxiv.org/pdf/2212.04196.pdf)

- (arXiv 2022.12) LLM-Planner: Few-Shot Grounded **Planning** for **Embodied** Agents with **Large Language Models**, [[Paper]](https://arxiv.org/pdf/2212.04088.pdf)

- (arXiv 2022.12) Hyperbolic **Contrastive** Learning for Visual **Representations** beyond Objects, [[Paper]](https://arxiv.org/pdf/2212.00653.pdf), [[Code]](https://github.com/shlokk/HCL/tree/main/HCL)

### 2022.11

- (arXiv 2022.11) Texts as Images in Prompt Tuning for **Multi-Label Image Recognition**, [[Paper]](https://arxiv.org/pdf/2211.12739.pdf), [[Code]](https://github.com/guozix/TaI-DPT)

- (arXiv 2022.11) Tell Me What Happened: Unifying **Text-guided Video Completion** via Multimodal Masked Video Generation, [[Paper]](https://arxiv.org/pdf/2211.12824.pdf)

- (arXiv 2022.11) InDiReCT: Language-Guided Zero-Shot Deep **Metric Learning** for Images, [[Paper]](https://arxiv.org/pdf/2211.12760.pdf)

- (arXiv 2022.11) VoP: Text-Video Co-operative Prompt Tuning for **Cross-Modal Retrieval**, [[Paper]](https://arxiv.org/pdf/2211.12764.pdf), [[Code]](https://github.com/bighuang624/VoP)

- (arXiv 2022.11) **Completing point cloud** from few points by Wasserstein GAN and Transformers, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2211/2211.12746.pdf), [[Code]](https://github.com/WxfQjh/Stability-point-recovery.git)

- (arXiv 2022.11) Integrally Pre-Trained Transformer **Pyramid** Networks, [[Paper]](https://arxiv.org/pdf/2211.12735.pdf), [[Code]](https://github.com/sunsmarterjie/iTPN)

- (arXiv 2022.11) Data Augmentation Vision Transformer for **Fine-grained Image Classification**, [[Paper]](https://arxiv.org/pdf/2211.12879.pdf)

- (arXiv 2022.11) **DETR**s with Collaborative Hybrid Assignments **Training**, [[Paper]](https://arxiv.org/pdf/2211.12860.pdf), [[Code]](https://github.com/Sense-X/Co-DETR)

- (arXiv 2022.11) Open-vocabulary **Attribute Detection**, [[Paper]](https://arxiv.org/pdf/2211.12914.pdf), [[Project]](https://ovad-benchmark.github.io/)

- (arXiv 2022.11) Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised **Monocular Depth Estimation**, [[Paper]](https://arxiv.org/pdf/2211.13202.pdf), [[Code]](https://github.com/noahzn/Lite-Mono)

- (arXiv 2022.11) Inversion-Based **Creativity Transfer** with Diffusion Models, [[Paper]](https://arxiv.org/pdf/2211.13203.pdf), [[Code]](https://github.com/zyxElsa/creativity-transfer)

- (arXiv 2022.11) CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free **Continual Learning**, [[Paper]](https://arxiv.org/pdf/2211.13218.pdf)

- (arXiv 2022.11) SVFormer: Semi-supervised Video Transformer for **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2211.13222.pdf), [[Code]](https://github.com/ChenHsing/SVFormer)

- (arXiv 2022.11) Generalizable **Implicit Neural Representations** via Instance Pattern Composers, [[Paper]](https://arxiv.org/pdf/2211.13223.pdf)

- (arXiv 2022.11) Improving **Visual-textual Sentiment Analysis** by Fusing Expert Features, [[Paper]](https://arxiv.org/pdf/2211.12981.pdf)

- (arXiv 2022.11) **Self-Supervised** Learning based on Heat Equation, [[Paper]](https://arxiv.org/pdf/2211.13228.pdf)

- (arXiv 2022.11) Peekaboo: **Text to Image** Diffusion Models are Zero-Shot Segmentors, [[Paper]](https://arxiv.org/pdf/2211.13224.pdf)

- (arXiv 2022.11) Paint by Example: Exemplar-based **Image Editing** with Diffusion Models, [[Paper]](https://arxiv.org/pdf/2211.13227.pdf), [[Code]](https://github.com/Fantasy-Studio/Paint-by-Example)

- (arXiv 2022.11) Human or Machine? **Turing Tests** for Vision and Language, [[Paper]](https://arxiv.org/pdf/2211.13087.pdf), [[Code]](https://tinyurl.com/8x8nha7p)

- (arXiv 2022.11) Teach-DETR: Better **Training** **DETR** with Teachers, [[Paper]](https://arxiv.org/pdf/2211.11953.pdf), [[Code]](https://github.com/LeonHLJ/Teach-DETR)

- (arXiv 2022.11) Conv2Former: A Simple Transformer-Style **ConvNet** for Visual Recognition, [[Paper]](https://arxiv.org/pdf/2211.11943.pdf)

- (arXiv 2022.11) X^2-VLM: All-In-One Pre-trained Model For **Vision-Language** Tasks, [[Paper]](https://arxiv.org/pdf/2211.12402.pdf), [[Code]](github.com/zengyan-97/X2-VLM)

- (arXiv 2022.11) Aligning Source Visual and Target Language Domains for Unpaired **Video Captioning**, [[Paper]](https://arxiv.org/pdf/2211.12148.pdf)

- (arXiv 2022.11) On the Transferability of Visual Features in **Generalized Zero-Shot Learning**, [[Paper]](https://arxiv.org/pdf/2211.12494.pdf), [[Code]](https://github.com/uvavision/TV-GZSL)

- (arXiv 2022.11) Generalizable Industrial Visual **Anomaly Detection** with Self-Induction Vision Transformer, [[Paper]](https://arxiv.org/pdf/2211.12311.pdf)

- (arXiv 2022.11) Transformer Based Multi-Grained Features for Unsupervised **Person Re-Identification**, [[Paper]](https://arxiv.org/pdf/2211.12280.pdf), [[Code]](https://github.com/RikoLi/WACV23-workshop-TMGF)

- (arXiv 2022.11) Efficient Frequency Domain-based Transformers for High-Quality Image **Deblurring**, [[Paper]](https://arxiv.org/pdf/2211.12250.pdf), [[Code]](https://github.com/kkkls/FFTformer)

- (arXiv 2022.11) Event Transformer+. A multi-purpose solution for efficient **event data processing**, [[Paper]](https://arxiv.org/pdf/2211.12222.pdf)

- (arXiv 2022.11) MagicPony: Learning Articulated **3D Animals** in the Wild, [[Paper]](https://arxiv.org/pdf/2211.12497.pdf), [[Project]](https://3dmagicpony.github.io/)

- (arXiv 2022.11) Gated Class-Attention with Cascaded Feature Drift Compensation for Exemplar-free **Continual Learning** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2211.12292.pdf), [[Code]](https://github.com/OcraM17/GCAB-CFDC)

- (arXiv 2022.11) Expectation-Maximization Contrastive Learning for Compact **Video-and-Language** Representations, [[Paper]](https://arxiv.org/pdf/2211.11427.pdf), [[Code]](https://github.com/jpthu17/EMCL)

- (arXiv 2022.11) N-Gram in Swin Transformers for Efficient Lightweight **Image Super-Resolution**, [[Paper]](https://arxiv.org/pdf/2211.11436.pdf)

- (arXiv 2022.11) **Robotic** Skill Acquisition via Instruction Augmentation with Vision-Language Models, [[Paper]](https://arxiv.org/pdf/2211.11736.pdf), [[Code]](https://instructionaugmentation.github.io/)

- (arXiv 2022.11) Peeling the Onion: Hierarchical Reduction of Data Redundancy for **Efficient** Vision Transformer **Training**, [[Paper]](https://arxiv.org/pdf/2211.10801.pdf), [[Code]](https://github.com/ZLKong/Tri-Level-ViT)

- (arXiv 2022.11) Unifying **Vision-Language** Representation Space with Single-tower Transformer, [[Paper]](https://arxiv.org/pdf/2211.11153.pdf)

- (arXiv 2022.11) DeepSolo: Let Transformer Decoder with Explicit Points Solo for **Text Spotting**, [[Paper]](https://arxiv.org/pdf/2211.10772.pdf)

- (arXiv 2022.11) Castling-ViT: **Compressing Self-Attention** via Switching Towards Linear-Angular Attention During Vision Transformer Inference, [[Paper]](https://arxiv.org/pdf/2211.10526.pdf)

- (arXiv 2022.11) CL-CrossVQA: A Continual Learning Benchmark for **Cross-Domain Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2211.10567.pdf)

- (arXiv 2022.11) Normal Transformer: Extracting Surface Geometry from **LiDAR** Points Enhanced by Visual Semantics, [[Paper]](https://arxiv.org/pdf/2211.10580.pdf)

- (arXiv 2022.11) A Unified Model for **Video** Understanding and Knowledge Embedding with Heterogeneous **Knowledge Graph** Dataset, [[Paper]](https://arxiv.org/pdf/2211.10624.pdf)

- (arXiv 2022.11) Efficient **Video Representation** Learning via Masked Video Modeling with Motion-centric Token Selection, [[Paper]](https://arxiv.org/pdf/2211.10636.pdf)

- (arXiv 2022.11) DiffStyler: Controllable Dual Diffusion for Text-Driven **Image Stylization**, [[Paper]](https://arxiv.org/pdf/2211.10682.pdf)

- (arXiv 2022.11) TORE: Token Reduction for Efficient **Human Mesh Recovery** with Transformer, [[Paper]](https://arxiv.org/pdf/2211.10705.pdf)

- (arXiv 2022.11) **Synthesizing** Coherent **Story** with Auto-Regressive Latent Diffusion Models, [[Paper]](https://arxiv.org/pdf/2211.10950.pdf), [[Code]](https://github.com/Flash-321/ARLDM)

- (arXiv 2022.11) Are **Out-of-Distribution Detection** Methods Reliable?, [[Paper]](https://arxiv.org/pdf/2211.10892.pdf)

- (arXiv 2022.11) GLT-T: Global-Local Transformer Voting for **3D Single Object Tracking** in Point Clouds, [[Paper]](https://arxiv.org/pdf/2211.10927.pdf), [[Code]](https://github.com/haooozi/GLT-T)

- (arXiv 2022.11) CROSS-MODAL CONTRASTIVE LEARNING FOR ROBUST REASONING IN **VQA**, [[Paper]](https://arxiv.org/pdf/2211.11190.pdf), [[Code]](https://github.com/qizhust/cmcl_vqa_pl)

- (arXiv 2022.11) LISA: Localized **Image Stylization** with Audio via Implicit Neural Representation, [[Paper]](https://arxiv.org/pdf/2211.11381.pdf)

- (arXiv 2022.11) MagicVideo: Efficient **Video Generation** With Latent Diffusion Models, [[Paper]](https://arxiv.org/pdf/2211.11018.pdf), [[Code]](https://magicvideo.github.io/#)

- (arXiv 2022.11) DreamArtist: Towards Controllable One-Shot **Text-to-Image** Generation via Contrastive Prompt-Tuning, [[Paper]](https://arxiv.org/pdf/2211.11337.pdf)

- (arXiv 2022.11) Hybrid Transformer Based Feature Fusion for Self-Supervised **Monocular Depth Estimation**, [[Paper]](https://arxiv.org/pdf/2211.11066.pdf)

- (arXiv 2022.11) Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable **Image Classification**, [[Paper]](https://arxiv.org/pdf/2211.11158.pdf)

- (arXiv 2022.11) Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language **Navigation**, [[Paper]](https://arxiv.org/pdf/2211.11116.pdf)

- (arXiv 2022.11) You Need Multiple Exiting: Dynamic Early Exiting for **Accelerating** Unified Vision Language Model, [[Paper]](https://arxiv.org/pdf/2211.11152.pdf)

- (arXiv 2022.11) Beyond Attentive Tokens: Incorporating Token Importance and Diversity for **Efficient** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2211.11315.pdf)

- (arXiv 2022.11) FlowLens: Seeing Beyond the **FoV** via Flow-guided **Clip**-Recurrent Transformer, [[Paper]](https://arxiv.org/pdf/2211.11293.pdf), [[Code]](https://github.com/MasterHow/FlowLens)

- (arXiv 2022.11) PS-Transformer: Learning Sparse **Photometric Stereo** Network using Self-Attention Mechanism, [[Paper]](https://arxiv.org/pdf/2211.11386.pdf)

- (arXiv 2022.11) On the Robustness, Generalization, and Forgetting of Shape-Texture Debiased **Continual Learning**, [[Paper]](https://arxiv.org/pdf/2211.11174.pdf)

- (arXiv 2022.11) Vision Transformer with Super **Token Sampling**, [[Paper]](https://arxiv.org/pdf/2211.11167.pdf), [[Code]](https://github.com/hhb072/SViT)

- (arXiv 2022.11) Detect Only What You Specify : Object **Detection** with Linguistic Target, [[Paper]](https://arxiv.org/pdf/2211.11572.pdf)

- (arXiv 2022.11) Visual Programming: Compositional **visual reasoning** without training, [[Paper]](https://arxiv.org/pdf/2211.11559.pdf), [[Project]](https://prior.allenai.org/projects/visprog)

- (arXiv 2022.11) ClipCrop: Conditioned **Cropping** Driven by **Vision-Language** Model, [[Paper]](https://arxiv.org/pdf/2211.11492.pdf)

- (arXiv 2022.11) SMAUG: Sparse **Masked** Autoencoder for **Efficient** **Video-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2211.11446.pdf)

- (arXiv 2022.11) **Blur Interpolation** Transformer for Real-World Motion from Blur, [[Paper]](https://arxiv.org/pdf/2211.11423.pdf)

- (arXiv 2022.11) Mean Shift Mask Transformer for Unseen Object Instance **Segmentation**, [[Paper]](https://arxiv.org/pdf/2211.11679.pdf), [[Code]](https://github.com/YoungSean/UnseenObjectsWithMeanShift)

- (arXiv 2022.11) PointCLIP V2: Adapting **CLIP** for Powerful **3D** Open-world Learning, [[Paper]](https://arxiv.org/pdf/2211.11682.pdf), [[Code]](https://github.com/yangyangyang127/PointCLIP_V2)

- (arXiv 2022.11) Exploring Discrete **Diffusion** Models for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2211.11694.pdf), [[Code]](https://github.com/buxiangzhiren/DDCap)

- (arXiv 2022.11) PERCEIVER-VL: **Efficient** **Vision-and-Language** Modeling with Iterative Latent Attention, [[Paper]](https://arxiv.org/pdf/2211.11701.pdf), [[Code]](https://github.com/zinengtang/Perceiver_VL)

- (arXiv 2022.11) Multitask **Vision-Language** **Prompt** Tuning, [[Paper]](https://arxiv.org/pdf/2211.11720.pdf), [[Code]](https://github.com/sIncerass/MVLPT)

- (arXiv 2022.11) Teaching **Structured** **Vision & Language** Concepts to Vision & Language Models, [[Paper]](https://arxiv.org/pdf/2211.11733.pdf)

- (arXiv 2022.11) WEIGHTED **ENSEMBLE** **SELF-SUPERVISED** LEARNING, [[Paper]](https://arxiv.org/pdf/2211.09981.pdf)

- (arXiv 2022.11) BEVFormer v2: Adapting Modern Image Backbones to **Bird’s-Eye-View Recognition** via Perspective Supervision, [[Paper]](https://arxiv.org/pdf/2211.10439.pdf)

- (arXiv 2022.11) Task Residual for Tuning **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2211.10277.pdf), [[Code]](https://github.com/geekyutao/TaskRes)

- (arXiv 2022.11) α DARTS Once More: Enhancing Differentiable **Architecture Search** by **Masked** Image Modeling, [[Paper]](https://arxiv.org/pdf/2211.10105.pdf)

- (arXiv 2022.11) Delving into Transformer for Incremental Semantic **Segmentation**, [[Paper]](https://arxiv.org/pdf/2211.10253.pdf)

- (arXiv 2022.11) DETRDistill: A Universal **Knowledge Distillation** Framework for **DETR**-families, [[Paper]](https://arxiv.org/pdf/2211.10156.pdf)

- (arXiv 2022.11) PromptCap: Prompt-Guided Task-Aware Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2211.09699.pdf)

- (arXiv 2022.11) UNIFORMERV2: SPATIOTEMPORAL LEARNING BY ARMING IMAGE VITS WITH **VIDEO** UNIFORMER, [[Paper]](https://arxiv.org/pdf/2211.09552.pdf), [[Code]](https://github.com/OpenGVLab/UniFormerV2)

- (arXiv 2022.11) **Masked** Reconstruction **Contrastive** Learning with Information Bottleneck Principle, [[Paper]](https://arxiv.org/pdf/2211.09013.pdf)

- (arXiv 2022.11) Listen, denoise, action! Audio-driven **motion synthesis** with diffusion models, [[Paper]](https://arxiv.org/pdf/2211.09707.pdf), [[Project]](https://www.speech.kth.se/research/listen-denoise-action/)

- (arXiv 2022.11) ConStruct-VL: Data-Free Continual **Structured VL Concepts** Learning, [[Paper]](https://arxiv.org/pdf/2211.09790.pdf)

- (arXiv 2022.11) How to **Fine-Tune** Vision Models with **SGD**, [[Paper]](https://arxiv.org/pdf/2211.09359.pdf)

- (arXiv 2022.11) Progressive Tree-Structured Prototype Network for End-to-End Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2211.09460.pdf), [[Code]](https://github.com/NovaMind-Z/PTSN)

- (arXiv 2022.11) CapEnrich: Enriching **Caption** Semantics for Web Images via Cross-modal Pre-trained Knowledge, [[Paper]](https://arxiv.org/pdf/2211.09371.pdf), [[Code]]()

- (arXiv 2022.11) Visual Commonsense-aware Representation Network for **Video Captioning**, [[Paper]](https://arxiv.org/pdf/2211.09469.pdf), [[Code]](https://github.com/zchoi/VCRN)

- (arXiv 2022.11) Language Conditioned Spatial Relation Reasoning for **3D Object Grounding**, [[Paper]](https://arxiv.org/pdf/2211.09646.pdf), [[Code]](https://cshizhe.github.io/projects/vil3dref.html)

- (arXiv 2022.11) HARDVS: Revisiting Human **Activity Recognition** with **Dynamic Vision Sensors**, [[Paper]](https://arxiv.org/pdf/2211.09648.pdf), [[Code]](https://github.com/Event-AHU/HARDVS)

- (arXiv 2022.11) Towards All-in-one **Pre-training** via Maximizing **Multi-modal** Mutual Information, [[Paper]](https://arxiv.org/pdf/2211.09807.pdf), [[Code]](https://github.com/OpenGVLab/M3I-Pretraining)

- (arXiv 2022.11) Uni-Perceiver v2: A **Generalist** Model for Large-Scale **Vision** and **Vision-Language** Tasks, [[Paper]](https://arxiv.org/pdf/2211.09808.pdf), [[Code]](https://github.com/fundamentalvision/Uni-Perceiver)

- (arXiv 2022.11) D^3ETR: Decoder **Distillation** for **Detection** Transformer, [[Paper]](https://arxiv.org/pdf/2211.09768.pdf)

- (arXiv 2022.11) **CAE** v2: Context Autoencoder with **CLIP** Target, [[Paper]](https://arxiv.org/pdf/2211.09799.pdf)

- (arXiv 2022.11) Cross-Modal Adapter for **Text-Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2211.09623.pdf), [[Code]](https://github.com/LeapLabTHU/Cross-Modal-Adapter)

- (arXiv 2022.11) TOKEN **TURING MACHINES**, [[Paper]](https://arxiv.org/pdf/2211.09119.pdf)

- (arXiv 2022.11) WILL LARGE-SCALE **GENERATIVE** MODELS CORRUPT **FUTURE DATASETS**? [[Paper]](https://arxiv.org/pdf/2211.08095.pdf), [[Code]](https://github.com/moskomule/dataset-contamination)

- (arXiv 2022.11) Demystify **Self-Attention** in Vision Transformers from a Semantic Perspective: Analysis and Application, [[Paper]](https://arxiv.org/pdf/2211.08543.pdf)

- (arXiv 2022.11) SATVSR: Scenario Adaptive Transformer for Cross Scenarios **Video Super-Resolution**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2211/2211.08703.pdf)

- (arXiv 2022.11) TransCC: Transformer-based **Multiple Illuminant Color Constancy** Using Multitask Learning, [[Paper]](https://arxiv.org/pdf/2211.08772.pdf)

- (arXiv 2022.11) Stare at What You See: **Masked Image Modeling** without Reconstruction, [[Paper]](https://arxiv.org/pdf/2211.08887.pdf), [[Code]](https://github.com/OpenPerceptionX/maskalign)

- (arXiv 2022.11) HeatViT: Hardware-Efficient Adaptive **Token Pruning** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2211.08110.pdf)

- (arXiv 2022.11) Cross-domain Federated Adaptive **Prompt Tuning** for **CLIP**, [[Paper]](https://arxiv.org/pdf/2211.07864.pdf)

- (arXiv 2022.11) YORO - Lightweight End to End **Visual Grounding**, [[Paper]](https://arxiv.org/pdf/2211.07912.pdf)

- (arXiv 2022.11) **Knowledge Distillation** for Detection Transformer with Consistent Distillation Points Sampling, [[Paper]](https://arxiv.org/pdf/2211.08071.pdf)

- (arXiv 2022.11) BiViT: Extremely **Compressed** **Binary** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2211.07091.pdf)

- (arXiv 2022.11) ContextCLIP: Contextual Alignment of **Image-Text** pairs on **CLIP** visual representations, [[Paper]](https://arxiv.org/pdf/2211.07122.pdf)

- (arXiv 2022.11) Zero-shot Image **Captioning** by Anchor-augmented Vision-Language Space Alignment, [[Paper]](https://arxiv.org/pdf/2211.07275.pdf)

- (arXiv 2022.11) Seeing Beyond the **Brain**: Conditional Diffusion Model with Sparse Masked Modeling for **Vision Decoding**, [[Paper]](https://arxiv.org/pdf/2211.06956.pdf), [[Project]](https://mind-vis.github.io/)

- (arXiv 2022.11) Enhancing **Few-Shot Image Classification** with Cosine Transformer, [[Paper]](https://arxiv.org/pdf/2211.06828.pdf), [[Code]](https://github.com/vinuni-vishc/Few-Shot-Cosine-Transformer)

- (arXiv 2022.11) SCOTCH and SODA: A Transformer **Video Shadow Detection** Framework, [[Paper]](https://arxiv.org/pdf/2211.06885.pdf)

- (arXiv 2022.11) AU-Aware Vision Transformers for Biased **Facial Expression Recognition**, [[Paper]](https://arxiv.org/pdf/2211.06609.pdf)

- (arXiv 2022.11) Fast Text-Conditional Discrete **Denoising** on Vector-Quantized Latent Spaces, [[Paper]](https://arxiv.org/pdf/2211.07292.pdf), [[Code]](https://github.com/dome272/Paella)

- (arXiv 2022.11) Large-Scale Bidirectional Training for Zero-Shot Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2211.06774.pdf)

- (arXiv 2022.11) Grafting Pre-trained Models for Multimodal **Headline Generation**, [[Paper]](https://arxiv.org/pdf/2211.07210.pdf)

- (arXiv 2022.11) CabViT: Cross **Attention** among Blocks for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2211.07198.pdf), [[Code]](https://github.com/hkzhang91/CabViT)

- (arXiv 2022.11) **Composed Image Retrieval** with Text Feedback via Multi-grained Uncertainty Regularization, [[Paper]](https://arxiv.org/pdf/2211.07394.pdf)

- (arXiv 2022.11) SSGVS: Semantic **Scene Graph-to-Video** Synthesis, [[Paper]](https://arxiv.org/pdf/2211.06119.pdf)

- (arXiv 2022.11) One-Time **Model Adaptation** to Heterogeneous Clients: An Intra-Client and Inter-Image Attention Design, [[Paper]](https://arxiv.org/pdf/2211.06276.pdf)

- (arXiv 2022.11) An Improved End-to-End **Multi-Target Tracking** Method Based on Transformer Self-Attention, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2211/2211.06001.pdf)

- (arXiv 2022.11) Zero-shot Visual Commonsense **Immorality Prediction**, [[Paper]](https://arxiv.org/pdf/2211.05521.pdf), [[Code]](https://github.com/ku-vai/Zero-shot-Visual-Commonsense-Immorality-Prediction)

- (arXiv 2022.11) Hyperbolic Cosine Transformer for **LiDAR 3D Object Detection**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2211/2211.05580.pdf)

- (arXiv 2022.11) **Training** a Vision Transformer from scratch in less than 24 hours with 1 GPU, [[Paper]](https://arxiv.org/pdf/2211.05187.pdf), [[Code]](https://github.com/BorealisAI/efficient-vit-training)

- (arXiv 2022.11) ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer **Acceleration** with a Linear Taylor Attention, [[Paper]](https://arxiv.org/pdf/2211.05109.pdf)

- (arXiv 2022.11) SimOn: A Simple Framework for **Online Temporal Action Localization**, [[Paper]](https://arxiv.org/pdf/2211.04905.pdf), [[Code]](https://github.com/TuanTNG/SimOn)

- (arXiv 2022.11) ERNIE-UNIX^2: A UNIFIED **CROSS-LINGUAL CROSS-MODAL** FRAMEWORK FOR UNDERSTANDING AND GENERATION, [[Paper]](https://arxiv.org/pdf/2211.04861.pdf)

- (arXiv 2022.11) SG-Shuffle: Multi-aspect Shuffle Transformer for **Scene Graph Generation**, [[Paper]](https://arxiv.org/pdf/2211.04773.pdf)

- (arXiv 2022.11) Understanding Cross-modal Interactions in V&L Models that Generate **Scene Descriptions**, [[Paper]](https://arxiv.org/pdf/2211.04971.pdf)

- (arXiv 2022.11) VieCap4H - VLSP 2021: ObjectAoA - Enhancing performance of Object Relation Transformer with Attention on Attention for **Vietnamese** image **captioning**, [[Paper]](https://arxiv.org/pdf/2211.05405.pdf)

- (arXiv 2022.11) Watching the News: Towards **VideoQA** Models that can Read, [[Paper]](https://arxiv.org/pdf/2211.05588.pdf), [[Project]](http://cvit.iiit.ac.in/research/projects/cvit-projects/videoqa)

- (arXiv 2022.11) Efficient Joint **Detection** and **Multiple Object Tracking** with Spatially Aware Transformer, [[Paper]](https://arxiv.org/pdf/2211.05654.pdf)

- (arXiv 2022.11) **Demystify** Transformers & **Convolutions** in Modern Image Deep Networks, [[Paper]](https://arxiv.org/pdf/2211.05781.pdf), [[Code]](https://github.com/OpenGVLab/STM-Evaluation)

- (arXiv 2022.11) InternImage: Exploring Large-Scale Vision Foundation Models with **Deformable Convolutions**, [[Paper]](https://arxiv.org/pdf/2211.05778.pdf), [[Code]](https://github.com/OpenGVLab/InternImage)

- (arXiv 2022.11) DEPTHFORMER: MULTIMODAL POSITIONAL ENCODINGS AND CROSS-INPUT ATTENTION FOR TRANSFORMER-BASED **SEGMENTATION** NETWORKS, [[Paper]](https://arxiv.org/pdf/2211.04188.pdf)

- (arXiv 2022.11) Sequential Transformer for End-to-End **Person Search**, [[Paper]](https://arxiv.org/pdf/2211.04323.pdf)

- (arXiv 2022.11) Prompting Large Pre-trained Vision-Language Models For **Compositional Concept Learning**, [[Paper]](https://arxiv.org/pdf/2211.05077.pdf)

- (arXiv 2022.11) CASA: Category-agnostic **Skeletal Animal Reconstruction**, [[Paper]](https://arxiv.org/pdf/2211.03568.pdf)

- (arXiv 2022.11) ViT-CX: Causal **Explanation** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2211.03064.pdf)

- (arXiv 2022.11) Disentangling Content and Motion for **Text-Based Neural Video Manipulation**, [[Paper]](https://arxiv.org/pdf/2211.02980.pdf)

- (arXiv 2022.11) **Efficient** Multi-order Gated Aggregation Network, [[Paper]](https://arxiv.org/pdf/2211.03295.pdf)

- (arXiv 2022.11) CLOP: **Video-and-Language** Pre-Training with Knowledge Regularizations, [[Paper]](https://arxiv.org/pdf/2211.03314.pdf)

- (arXiv 2022.11) MSMG-Net: Multi-scale Multi-grained Supervised Metworks for Multi-task Image Manipulation **Detection** and **Localization**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2211/2211.03140.pdf)

- (arXiv 2022.11) Understanding and Mitigating Overfitting in **Prompt** Tuning for **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2211.02219.pdf), [[Code]](https://tinyurl.com/mpe64f89)

- (arXiv 2022.11) Zero-shot **Video Moment Retrieval** With Off-the-Shelf Models, [[Paper]](https://arxiv.org/pdf/2211.02178.pdf)

- (arXiv 2022.11) Scaling **Multimodal** Pre-Training via Cross-Modality Gradient Harmonization, [[Paper]](https://arxiv.org/pdf/2211.02077.pdf)

- (arXiv 2022.11) A Transformer Architecture for Online **Gesture Recognition** of Mathematical Expressions, [[Paper]](https://arxiv.org/pdf/2211.02643.pdf)

- (arXiv 2022.11) Evaluating and Improving Factuality in **Multimodal Abstractive Summarization**, [[Paper]](https://arxiv.org/pdf/2211.02580.pdf), [[Code]](https://github.com/meetdavidwan/faithful-multimodal-summ)

- (arXiv 2022.11) RCDPT: **RADAR-CAMERA FUSION** DENSE PREDICTION TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2211.02432.pdf)

- (arXiv 2022.11) **Video Event Extraction** via Tracking Visual States of Arguments, [[Paper]](https://arxiv.org/pdf/2211.01781.pdf)

- (arXiv 2022.11) The **Lottery Ticket** Hypothesis for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2211.01484.pdf)

- (arXiv 2022.11) TEXTCRAFT: ZERO-SHOT GENERATION OF HIGHFIDELITY AND DIVERSE **SHAPES FROM TEXT**, [[Paper]](https://arxiv.org/pdf/2211.01427.pdf)

- (arXiv 2022.11) PolyBuilding: Polygon Transformer for End-to-End **Building Extraction**, [[Paper]](https://arxiv.org/pdf/2211.01589.pdf)

- (arXiv 2022.11) RETHINKING **HIERARCHIES** IN PRE-TRAINED PLAIN VISION TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2211.01785.pdf), [[Code]](https://github.com/ViTAE-Transformer/HPViT)

- (arXiv 2022.11) SAP-**DETR**: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency, [[Paper]](https://arxiv.org/pdf/2211.02006.pdf)

- (arXiv 2022.11) Could Giant Pretrained Image Models Extract **Universal Representations**? [[Paper]](https://arxiv.org/pdf/2211.02043.pdf)

- (arXiv 2022.11) MAEDAY: MAE for few and zero shot **AnomalY-Detection**, [[Paper]](https://arxiv.org/pdf/2211.14307.pdf), [[Code]](https://github.com/EliSchwartz/MAEDAY)

- (arXiv 2022.11) Degenerate Swin to Win: Plain **Window-based** Transformer without Sophisticated Operations, [[Paper]](https://arxiv.org/pdf/2211.14255.pdf)

- (arXiv 2022.11) Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for **3D Visual Grounding**, [[Paper]](https://arxiv.org/pdf/2211.14241.pdf), [[Code]](https://eslambakr.github.io/LAR.github.io/)

- (arXiv 2022.11) SpaText: Spatio-Textual Representation for **Controllable Image Generation**, [[Paper]](https://arxiv.org/pdf/2211.14305.pdf), [[Project]](https://omriavrahami.com/spatext)

- (arXiv 2022.11) Learning **3D** Scene Priors with **2D** Supervision, [[Paper]](https://arxiv.org/pdf/2211.14157.pdf), [[Project]](https://yinyunie.github.io/sceneprior-page/)

- (arXiv 2022.11) PoET: Pose Estimation Transformer for Single-View, Multi-Object **6D Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2211.14125.pdf), [[Code]](https://github.com/aau-cns/poet)

- (arXiv 2022.11) Spatial-Spectral Transformer for **Hyperspectral Image Denoising**, [[Paper]](https://arxiv.org/pdf/2211.14090.pdf), [[Code]](https://github.com/MyuLi/SST)

- (arXiv 2022.11) Training **Vision-Language** Models with Less Bimodal Supervision, [[Paper]](https://arxiv.org/pdf/2211.00262.pdf)

- (arXiv 2022.11) Text-Only Training for Image **Captioning** using Noise-Injected **CLIP**, [[Paper]](https://arxiv.org/pdf/2211.00575.pdf), [[Code]](https://github.com/DavidHuji/CapDec)

- (arXiv 2022.11) Attention-based **Neural Cellular Automata**, [[Paper]](https://arxiv.org/pdf/2211.01233.pdf)

- (arXiv 2022.11) eDiff-I: **Text-to-Image** Diffusion Models with an Ensemble of Expert Denoisers, [[Paper]](https://arxiv.org/pdf/2211.01324.pdf), [[Code]](https://deepimagination.cc/eDiff-I/)

- (arXiv 2022.11) Chinese CLIP: Contrastive **Vision-Language** Pretraining in **Chinese**, [[Paper]](https://arxiv.org/pdf/2211.01335.pdf), [[Code]](https://github.com/OFA-Sys/Chinese-CLIP)

- (arXiv 2022.11) P^3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for **Open-Vocabulary Object Detection**, [[Paper]](https://arxiv.org/pdf/2211.00849.pdf)

- (arXiv 2022.11) tSF: Transformer-based Semantic Filter for **Few-Shot Learning**, [[Paper]](https://arxiv.org/pdf/2211.00868.pdf)

- (arXiv 2022.11) WITT: A WIRELESS IMAGE TRANSMISSION TRANSFORMER FOR **SEMANTIC COMMUNICATIONS**, [[Paper]](https://arxiv.org/pdf/2211.00937.pdf), [[Code]](https://github.com/KeYang8/WITT)

- (arXiv 2022.11) Pair DETR: Contrastive Learning **Speeds Up** **DETR** Training, [[Paper]](https://arxiv.org/pdf/2210.16476.pdf)

- (arXiv 2022.11) Interaction Visual Transformer for **Egocentric Action Anticipation**, [[Paper]](https://arxiv.org/pdf/2211.14154.pdf)

- (arXiv 2022.11) UDE: A Unified Driving Engine for Human **Motion Generation**, [[Paper]](https://arxiv.org/pdf/2211.16016.pdf), [[Code]](https://github.com/zixiangzhou916/UDE/)

- (arXiv 2022.11) Action-**GPT**: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot **Action Generation**, [[Paper]](https://arxiv.org/pdf/2211.15603.pdf), [[Project]](https://actiongpt.github.io/)

- (arXiv 2022.11) Human or Machine? **Turing Tests** for **Vision** and **Language**, [[Paper]](https://arxiv.org/pdf/2211.13087.pdf), [[Code]](https://tinyurl.com/8x8nha7p)

- (arXiv 2022.11) Knowledge **Prompting** for Few-shot **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2211.12030.pdf)

- (arXiv 2022.11) UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance, [[Paper]](https://arxiv.org/pdf/2210.16031.pdf), [[Project]](https://upainting.github.io/)

- (arXiv 2022.11) LVP-M^3: Language-aware Visual Prompt for **Multilingual Multimodal Machine Translation**, [[Paper]](https://arxiv.org/pdf/2210.15461.pdf)

- (arXiv 2022.11) PROCONTEXT: PROGRESSIVE CONTEXT TRANSFORMER FOR **TRACKING**, [[Paper]](https://arxiv.org/pdf/2210.15511.pdf), [[Code]](https://github.com/jp-lan/ProContEXT)

- (arXiv 2022.11) Video based Object **6D Pose Estimation** using Transformers, [[Paper]](https://arxiv.org/pdf/2210.13540.pdf), [[Code]](https://github.com/ApoorvaBeedu/VideoPose)

- (arXiv 2022.11) S2WAT: **Image Style Transfer** via Hierarchical Vision Transformer using Strips Window Attention, [[Paper]](https://arxiv.org/pdf/2210.12381.pdf), [[Code]](https://github.com/AlienZhang1996/S2WAT)

- (arXiv 2022.11) Holistic Interaction Transformer Network for **Action Detection**, [[Paper]](https://arxiv.org/pdf/2210.12686.pdf), [[Code]](https://github.com/joslefaure/HIT)

- (arXiv 2022.11) Learning and Retrieval from Prior Data for Skill-based **Imitation Learning**, [[Paper]](https://arxiv.org/pdf/2210.11435.pdf), [[Code]](https://ut-austin-rpl.github.io/sailor)

- (arXiv 2022.11) SimpleClick: **Interactive** Image **Segmentation** with Simple Vision Transformers, [[Paper]](https://arxiv.org/pdf/2210.11006.pdf), [[Code]](https://github.com/uncbiag/SimpleClick)

- (arXiv 2022.11) TANGO: **Text-driven** Photorealistic and Robust **3D Stylization** via Lighting Decomposition, [[Paper]](https://arxiv.org/pdf/2210.11277.pdf), [[Code]](https://cyw-3d.github.io/tango/)

- (arXiv 2022.11) CPL: Counterfactual **Prompt** Learning for **Vision and Language** Models, [[Paper]](https://arxiv.org/pdf/2210.10362.pdf), [[Code]](https://github.com/eric-ai-lab/CPL)

- (arXiv 2022.11) Plug-and-Play VQA: Zero-shot **VQA** by Conjoining Large Pretrained Models with Zero Training, [[Paper]](https://arxiv.org/pdf/2210.08773.pdf)

- (arXiv 2022.11) Selective Query-guided Debiasing for **Video** Corpus Moment **Retrieval**, [[Paper]](https://arxiv.org/pdf/2210.08714.pdf)

- (arXiv 2022.11) Scaling & Shifting Your Features: A New Baseline for **Efficient Model Tuning**, [[Paper]](https://arxiv.org/pdf/2210.08823.pdf), [[Code]](https://github.com/dongzelian/SSF)

- (arXiv 2022.11) DENOISING **MASKED AUTOENCODERS** ARE CERTIFIABLE ROBUST VISION LEARNERS, [[Paper]](https://arxiv.org/pdf/2210.06983.pdf), [[Code]](https://github.com/quanlin-wu/dmae)

- (arXiv 2022.11) **Token-Label Alignment** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2210.06455.pdf), [[Code]](https://github.com/Euphoria16/TL-Align)

- (arXiv 2022.11) **CLIP**-Fields: Weakly Supervised Semantic Fields for **Robotic** Memory, [[Paper]](https://arxiv.org/pdf/2210.05663.pdf), [[Code]](https://mahis.life/clip-fields)

- (arXiv 2022.11) Multi-Scale Wavelet Transformer for **Face Forgery Detection**, [[Paper]](https://arxiv.org/pdf/2210.03899.pdf)

- (arXiv 2022.11) **CLIP**-PAE: PROJECTION-AUGMENTATION EMBEDDING TO EXTRACT RELEVANT FEATURES FOR A DISENTANGLED, INTERPRETABLE, AND CONTROLLABLE **TEXT-GUIDED IMAGE MANIPULATION**, [[Paper]](https://arxiv.org/pdf/2210.03919.pdf)

- (arXiv 2022.11) VISUAL PROMPT TUNING FOR **TEST-TIME DOMAIN ADAPTATION**, [[Paper]](https://arxiv.org/pdf/2210.04831.pdf)

- (arXiv 2022.11) Fast**CLIP**styler: Optimisation-free Text-based **Image Style Transfer** Using Style Representations, [[Paper]](https://arxiv.org/pdf/2210.03461.pdf)

- (arXiv 2022.11) PROGRESSIVE DENOISING MODEL FOR FINEGRAINED **TEXT-TO-IMAGE** GENERATION, [[Paper]](https://arxiv.org/pdf/2210.02291.pdf)

- (arXiv 2022.11) **DALL-E**-Bot: Introducing Web-Scale Diffusion Models to **Robotics**, [[Paper]](https://arxiv.org/pdf/2210.02438.pdf), [[Project]](https://www.robot-learning.uk/dall-e-bot)

- (arXiv 2022.11) Decomposed Soft Prompt Guided Fusion Enhancing for **Compositional Zero-Shot Learning**, [[Paper]](https://arxiv.org/pdf/2211.10681.pdf), [[Code]](https://github.com/Forest-art/DFSP.git)

- (arXiv 2022.11) ACCURATE **IMAGE RESTORATION** WITH ATTENTION RETRACTABLE TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2210.01427.pdf), [[Code]](https://github.com/gladzhang/ART)

- (arXiv 2022.11) **Dilated** Neighborhood **Attention** Transformer, [[Paper]](https://arxiv.org/pdf/2209.15001.pdf), [[Code]](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer)

- (arXiv 2022.11) Unified Loss of Pair Similarity Optimization for **Vision-Language Retrieval**, [[Paper]](https://arxiv.org/pdf/2209.13869.pdf)

- (arXiv 2022.11) TVLT: Textless **Vision-Language** Transformer, [[Paper]](https://arxiv.org/pdf/2209.14156.pdf), [[Code]](https://github.com/zinengtang/TVLT)

### 2022.10

- (arXiv 2022.10) DiMBERT: Learning **Vision-Language** Grounded Representations with Disentangled Multimodal-Attention, [[Paper]](https://arxiv.org/pdf/2210.16431.pdf)

- (arXiv 2022.10) TFORMER: **3D TOOTH SEGMENTATION** IN MESH SCANS WITH GEOMETRY GUIDED TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2210.16627.pdf)

- (arXiv 2022.10) **ON-THE-FLY** OBJECT **DETECTION** USING STYLEGAN WITH **CLIP** GUIDANCE, [[Paper]](https://arxiv.org/pdf/2210.16742.pdf)

- (arXiv 2022.10) Image-free Domain Generalization via **CLIP** for **3D Hand Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2210.16788.pdf)

- (arXiv 2022.10) Temporal-Viewpoint Transportation Plan for **Skeletal Few-shot Action Recognition**, [[Paper]](https://arxiv.org/pdf/2210.16820.pdf)

- (arXiv 2022.10) A SIMPLE, EFFICIENT AND SCALABLE CONTRASTIVE **MASKED AUTOENCODER** FOR LEARNING VISUAL REPRESENTATIONS, [[Paper]](https://arxiv.org/pdf/2210.16870.pdf)

- (arXiv 2022.10) Time-rEversed diffusioN tEnsor Transformer: A new TENET of **Few-Shot Object Detection**, [[Paper]](https://arxiv.org/pdf/2210.16897.pdf)

- (arXiv 2022.10) **Foreign Object Debris Detection** for Airport Pavement Images based on Self-supervised Localization and Vision Transformer, [[Paper]](https://arxiv.org/pdf/2210.16901.pdf)

- (arXiv 2022.10) ViT-LSLA: Vision Transformer with **Light Self-Limited-Attention**, [[Paper]](https://arxiv.org/pdf/2210.17115.pdf)

- (arXiv 2022.10) Generative Negative Text Replay for Continual **Vision-Language** Pretraining, [[Paper]](https://arxiv.org/pdf/2210.17322.pdf)

- (arXiv 2022.10) PatchRot: A **Self-Supervised** Technique for Training Vision Transformers, [[Paper]](https://arxiv.org/pdf/2210.15722.pdf)

- (arXiv 2022.10) MULTIMODAL TRANSFORMER DISTILLATION FOR **AUDIO-VISUAL** SYNCHRONIZATION, [[Paper]](https://arxiv.org/pdf/2210.15563.pdf)

- (arXiv 2022.10) **Multimodal** Transformer for Parallel Concatenated Variational Autoencoders, [[Paper]](https://arxiv.org/pdf/2210.16174.pdf)

- (arXiv 2022.10) Differentially **Private** CutMix for Split Learning with Vision Transformer, [[Paper]](https://arxiv.org/pdf/2210.15986.pdf)

- (arXiv 2022.10) OHMG: ZERO-SHOT OPEN-VOCABULARY HUMAN **MOTION GENERATION**, [[Paper]](https://arxiv.org/pdf/2210.15929.pdf)

- (arXiv 2022.10) VLT: Vision-Language Transformer and Query Generation for **Referring Segmentation**, [[Paper]](https://arxiv.org/pdf/2210.15871.pdf)

- (arXiv 2022.10) PSFORMER: POINT TRANSFORMER FOR **3D SALIENT OBJECT DETECTION**, [[Paper]](https://arxiv.org/pdf/2210.15933.pdf)

- (arXiv 2022.10) **GRAFTING** VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2210.15943.pdf)

- (arXiv 2022.10) Generalization Differences between End-to-End and Neuro-Symbolic **Vision-Language** **Reasoning** Systems, [[Paper]](https://arxiv.org/pdf/2210.15037)

- (arXiv 2022.10) FaD-VLP: **Fashion** **Vision-and-Language** Pre-training towards Unified Retrieval and Captioning, [[Paper]](https://arxiv.org/pdf/2210.15028.pdf)

- (arXiv 2022.10) Masked **Vision-Language** Transformer in **Fashion**, [[Paper]](https://arxiv.org/pdf/2210.15110.pdf), [[Code]](https://github.com/GewelsJI/MVLT)

- (arXiv 2022.10) Learning Variational Motion Prior for **Video-based Motion Capture**, [[Paper]](https://arxiv.org/pdf/2210.15134.pdf)

- (arXiv 2022.10) **Open-vocabulary Semantic Segmentation** with Frozen Vision-Language Models, [[Paper]](https://arxiv.org/pdf/2210.15138.pdf), [[Code]](https://yyh-rain-song.github.io/Fusioner_webpage/)

- (arXiv 2022.10) **TEXT2MODEL**: MODEL INDUCTION FOR ZERO-SHOT GENERALIZATION USING TASK DESCRIPTIONS, [[Paper]](https://arxiv.org/pdf/2210.15182.pdf)

- (arXiv 2022.10) Learning Joint Representation of **Human Motion** and **Language**, [[Paper]](https://arxiv.org/pdf/2210.15187.pdf)

- (arXiv 2022.10) ERNIE-ViLG 2.0: Improving **Text-to-Image** Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts, [[Paper]](https://arxiv.org/pdf/2210.15257.pdf)

- (arXiv 2022.10) MSF3DDETR: Multi-Sensor Fusion **3D** Detection Transformer for **Autonomous Driving**, [[Paper]](https://arxiv.org/pdf/2210.15316.pdf)

- (arXiv 2022.10) Li3DeTr: A **LiDAR** based **3D Detection** Transformer, [[Paper]](Li3DeTr: A LiDAR based 3D Detection Transformer)

- (arXiv 2022.10) Masked Transformer for **image Anomaly Localization**, [[Paper]](https://arxiv.org/pdf/2210.15540.pdf)

- (arXiv 2022.10) Discovering Design Concepts for **CAD Sketches**, [[Paper]](https://arxiv.org/pdf/2210.14451.pdf)

- (arXiv 2022.10) Compressing And Debiasing Vision-Language Pre-Trained Models for **Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2210.14558.pdf)

- (arXiv 2022.10) End-to-End Multimodal Representation Learning for **Video Dialog**, [[Paper]](https://arxiv.org/pdf/2210.14512.pdf)

- (arXiv 2022.10) TPFNet: A Novel **Text In-painting** Transformer for Text Removal, [[Paper]](https://arxiv.org/pdf/2210.14461.pdf), [[Code]](https://github.com/CandleLabAI/TPFNet)

- (arXiv 2022.10) IMU2CLIP: MULTIMODAL CONTRASTIVE LEARNING FOR **IMU MOTION SENSORS** FROM **EGOCENTRIC** VIDEOS AND TEXT NARRATIONS, [[Paper]](https://arxiv.org/pdf/2210.14395.pdf)

- (arXiv 2022.10) Can Transformer Attention Spread Give Insights Into **Uncertainty** of **Detected** and **Tracked** Objects? [[Paper]](https://arxiv.org/pdf/2210.14391.pdf)

- (arXiv 2022.10) SemFormer: Semantic Guided Activation Transformer for **Weakly Supervised Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2210.14618.pdf), [[Code]](https://github.com/JLChen-C/SemFormer)

- (arXiv 2022.10) End-to-end **Tracking** with a Multi-query Transformer, [[Paper]](https://arxiv.org/pdf/2210.14601.pdf)

- (arXiv 2022.10) Explicitly Increasing **Input Information Density** for Vision Transformers on **Small Datasets**, [[Paper]](https://arxiv.org/pdf/2210.14319.pdf), [[Code]](https://github.com/xiangyu8/DenseVT)

- (arXiv 2022.10) TAMFORMER: MULTI-MODAL TRANSFORMER WITH LEARNED ATTENTION MASK FOR **EARLY INTENT PREDICTION**, [[Paper]](https://arxiv.org/pdf/2210.14714.pdf)

- (arXiv 2022.10) **VISUAL ANSWER LOCALIZATION** WITH CROSS-MODAL MUTUAL KNOWLEDGE TRANSFER, [[Paper]](https://arxiv.org/pdf/2210.14823.pdf), [[Code]](https://github.com/WENGSYX/MutualSL)

- (arXiv 2022.10) Visual **Semantic Parsing**: From Images to Abstract Meaning Representation, [[Paper]](https://arxiv.org/pdf/2210.14862.pdf)

- (arXiv 2022.10) End-to-end Transformer for **Compressed Video Quality Enhancement**, [[Paper]](https://arxiv.org/pdf/2210.13827.pdf)

- (arXiv 2022.10) PlanT: Explainable **Planning** Transformers via Object-Level Representations, [[Paper]](https://arxiv.org/pdf/2210.14222.pdf), [[Project]](https://www.katrinrenz.de/plant)

- (arXiv 2022.10) Strong-TransCenter: Improved **Multi-Object Tracking** based on Transformers with Dense Representations, [[Paper]](https://arxiv.org/pdf/2210.13570.pdf), [[Code]](https://github.com/amitgalor18/STC_Tracker)

- (arXiv 2022.10) GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online **Action Prediction**, [[Paper]](https://arxiv.org/pdf/2210.13605.pdf)

- (arXiv 2022.10) VLC-BERT: **Visual Question Answering** with Contextualized Commonsense Knowledge, [[Paper]](https://arxiv.org/pdf/2210.13626.pdf), [[Code]](https://github.com/aditya10/VLC-BERT)

- (arXiv 2022.10) Learning by Hallucinating: **Vision-Language** Pre-training with Weak Supervision, [[Paper]](https://arxiv.org/pdf/2210.13591.pdf)

- (arXiv 2022.10) Learning Explicit **Object-Centric Representations** with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2210.14139.pdf)

- (arXiv 2022.10) Abductive **Action** Inference, [[Paper]](https://arxiv.org/pdf/2210.13984.pdf)

- (arXiv 2022.10) Minutiae-Guided **Fingerprint** Embeddings via Vision Transformers, [[Paper]](https://arxiv.org/pdf/2210.13994.pdf), [[Code]](https://github.com/tba)

- (arXiv 2022.10) 3DALL-E: Integrating **Text-to-Image** AI in **3D** Design Workflows, [[Paper]](https://arxiv.org/pdf/2210.11603.pdf)

- (arXiv 2022.10) COMPOSING **ENSEMBLES** OF **PRE-TRAINED MODELS** VIA ITERATIVE CONSENSUS, [[Paper]](https://arxiv.org/pdf/2210.11522.pdf), [[Code]](https://energy-based-model.github.io/composing-pretrained-models)

- (arXiv 2022.10) Do **Vision-and-Language** Transformers Learn Grounded **Predicate-Noun Dependencies**?, [[Paper]](https://arxiv.org/pdf/2210.12079.pdf)

- (arXiv 2022.10) Boosting vision transformers for **image retrieval**, [[Paper]](https://arxiv.org/pdf/2210.11909.pdf), [[Code]](https://github.com/dealicious-inc/DToP)

- (arXiv 2022.10) LiteVL: Efficient **Video-Language** Learning with Enhanced Spatial-Temporal Modeling, [[Paper]](https://arxiv.org/pdf/2210.11929.pdf)

- (arXiv 2022.10) Fine-grained Semantic Alignment Network for Weakly Supervised **Temporal Language Grounding**, [[Paper]](https://arxiv.org/pdf/2210.11933.pdf)

- (arXiv 2022.10) **Face** **Pyramid** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2210.11974.pdf), [[Code]](https://khawar-islam.github.io/fpvt/)

- (arXiv 2022.10) Context-Enhanced **Stereo** Transformer, [[Paper]](https://arxiv.org/pdf/2210.11719.pdf), [[Code]](https://github.com/guoweiyu/Context-Enhanced-Stereo-Transformer)

- (arXiv 2022.10) CRT-6D: Fast **6D Object Pose Estimation** with Cascaded Refinement Transformers, [[Paper]](https://arxiv.org/pdf/2210.11718.pdf), [[Code]](https://github.com/PedroCastro/CRT-6D)

- (arXiv 2022.10) Rethinking Learning Approaches for Long-Term **Action Anticipation**, [[Paper]](https://arxiv.org/pdf/2210.11566.pdf), [[Code]](https://github.com/Nmegha2601/anticipatr)

- (arXiv 2022.10) Extending **Phrase Grounding** with Pronouns in Visual Dialogues, [[Paper]](https://arxiv.org/pdf/2210.12658.pdf), [[Code]]()

- (arXiv 2022.10) Accumulated Trivial **Attention** Matters in Vision Transformers on **Small Datasets**, [[Paper]](https://arxiv.org/pdf/2210.12333.pdf), [[Code]](https://github.com/xiangyu8/SATA)

- (arXiv 2022.10) Transformers For **Recognition** In **Overhead Imagery**: A Reality Check, [[Paper]](https://arxiv.org/pdf/2210.12599.pdf)

- (arXiv 2022.10) Anticipative Feature Fusion Transformer for Multi-Modal **Action Anticipation**, [[Paper]](https://arxiv.org/pdf/2210.12649.pdf), [[Code]](https://github.com/zeyun-zhong/AFFT)

- (arXiv 2022.10) UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision Transformer for **Face Forgery Detection**, [[Paper]](https://arxiv.org/pdf/2210.12752.pdf)

- (arXiv 2022.10) LCPFormer: Towards Effective 3D **Point Cloud** Analysis via Local Context Propagation in Transformers, [[Paper]](https://arxiv.org/pdf/2210.12755.pdf)

- (arXiv 2022.10) Towards Real-Time **Text2Video** via **CLIP**-Guided, Pixel-Level Optimization, [[Paper]](https://arxiv.org/pdf/2210.12826.pdf), [[Code]](https://pschaldenbrand.github.io/text2video/)

- (arXiv 2022.10) Language-free Training for Zero-shot **Video Grounding**, [[Paper]](https://arxiv.org/pdf/2210.12977.pdf), [[Code]]()

- (arXiv 2022.10) Foreground Guidance and Multi-Layer Feature Fusion for **Unsupervised Object Discovery** with Transformers, [[Paper]](https://arxiv.org/pdf/2210.13053.pdf), [[Code]](https://github.com/VDIGPKU/FORMULA)

- (arXiv 2022.10) Towards Unifying **Reference Expression** Generation and Comprehension, [[Paper]](https://arxiv.org/pdf/2210.13076.pdf)

- (arXiv 2022.10) Robust **Self-Supervised Learning** with Lie Groups, [[Paper]](https://arxiv.org/pdf/2210.13356.pdf)

- (arXiv 2022.10) VIOLA: Imitation Learning for Vision-Based **Manipulation** with Object Proposal Priors, [[Paper]](https://arxiv.org/pdf/2210.11339.pdf), [[Code]](https://ut-austin-rpl.github.io/VIOLA)

- (arXiv 2022.10) VTC: Improving **Video-Text Retrieval** with User Comments, [[Paper]](https://arxiv.org/pdf/2210.10820.pdf), [[Project]](https://unitaryai.github.io/vtc-paper)

- (arXiv 2022.10) SOLVING **REASONING** TASKS WITH A SLOT TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2210.11394.pdf), [[Code]]()

- (arXiv 2022.10) Prompting through Prototype: A Prototype-based **Prompt** Learning on Pretrained **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2210.10841.pdf)

- (arXiv 2022.10) Grounded **Video Situation Recognition**, [[Paper]](https://arxiv.org/pdf/2210.10828.pdf), [[Project]](https://zeeshank95.github.io/grvidsitu)

- (arXiv 2022.10) Single **Image Super-Resolution** Using Lightweight Networks Based on Swin Transformer, [[Paper]](https://arxiv.org/pdf/2210.11019.pdf)

- (arXiv 2022.10) Visual Spatial Description: Controlled **Spatial**-Oriented **Image-to-Text** Generation, [[Paper]](https://arxiv.org/pdf/2210.11109.pdf), [[Code]](https://github.com/zhaoyucs/VSD)

- (arXiv 2022.10) Movie**CLIP**: Visual **Scene Recognition** in Movies, [[Paper]](https://arxiv.org/pdf/2210.11065.pdf)

- (arXiv 2022.10) PointTAD: Multi-Label **Temporal Action Detection** with Learnable Query Points, [[Paper]](https://arxiv.org/pdf/2210.11035.pdf), [[Code]](https://github.com/MCG-NJU/PointTAD)

- (arXiv 2022.10) TOWARDS SUSTAINABLE **SELF-SUPERVISED** LEARNING, [[Paper]](https://arxiv.org/pdf/2210.11016.pdf)

- (arXiv 2022.10) **Visual-Semantic** Contrastive Alignment for Few-Shot Image Classification, [[Paper]](https://arxiv.org/pdf/2210.11000.pdf)

- (arXiv 2022.10) i-MAE: ARE LATENT REPRESENTATIONS IN **MASKED AUTOENCODERS** LINEARLY SEPARABLE? [[Paper]](https://arxiv.org/pdf/2210.11470.pdf), [[Code]](https://github.com/vision-learning-acceleration-lab/i-mae)

- (arXiv 2022.10) 2nd Place Solution to ECCV 2022 Challenge: Transformer-based **Action recognition** in **hand-object** interacting scenarios, [[Paper]](https://arxiv.org/pdf/2210.11387.pdf)

- (arXiv 2022.10) 1st Place Solution to ECCV 2022 Challenge on HBHA: Transformer-based Global **3D Hand Pose Estimation** in Two Hands Manipulating Objects Scenarios, [[Paper]](https://arxiv.org/pdf/2210.11384.pdf)

- (arXiv 2022.10) **DALLE-2** is Seeing Double: **Flaws** in Word-to-Concept Mapping in Text2Image Models, [[Paper]](https://arxiv.org/pdf/2210.10606.pdf)

- (arXiv 2022.10) **CLIP**-Driven Fine-grained Text-Image **Person Re-identification**, [[Paper]](https://arxiv.org/pdf/2210.10276.pdf)

- (arXiv 2022.10) Dense but Efficient **VideoQA** for Intricate Compositional Reasoning, [[Paper]](https://arxiv.org/pdf/2210.10300.pdf)

- (arXiv 2022.10) Multi-view **Gait Recognition** based on SiameseVisionTransformer, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2210/2210.10421.pdf)

- (arXiv 2022.10) TOIST: Task Oriented **Instance Segmentation** Transformer with Noun-Pronoun Distillation, [[Paper]](https://arxiv.org/pdf/2210.10775.pdf), [[Code]](https://github.com/AIR-DISCOVER/TOIST)

- (arXiv 2022.10) CroCo: **Self-Supervised** Pre-training for **3D** Vision Tasks by Cross-View **Completion**, [[Paper]](https://arxiv.org/pdf/2210.10716.pdf), [[Project]](https://europe.naverlabs.com/research/computer-vision/croco/)

- (arXiv 2022.10) A Unified View of **Masked** Image Modeling, [[Paper]](https://arxiv.org/pdf/2210.10615.pdf), [[Code]](https://aka.ms/unimim)

- (arXiv 2022.10) Cross-Modal Fusion Distillation for Fine-Grained **Sketch-Based Image Retrieval**, [[Paper]](https://arxiv.org/pdf/2210.10486.pdf), [[Code]](https://github.com/abhrac/xmodal-vit)

- (arXiv 2022.10) BOAT: Bilateral Local **Attention** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2201.13027.pdf)

- (arXiv 2022.12) **TOKEN MERGING**: YOUR VIT BUT **FASTER**, [[Paper]](https://arxiv.org/pdf/2210.09461.pdf), [[Code]](https://github.com/facebookresearch/ToMe)

- (arXiv 2022.10) Using Language to Extend to **Unseen Domains**, [[Paper]](https://arxiv.org/pdf/2210.09520.pdf), [[Code]]()

- (arXiv 2022.10) SWINV2-IMAGEN: HIERARCHICAL VISION TRANSFORMER DIFFUSION MODELS FOR **TEXT-TO-IMAGE** GENERATION, [[Paper]](https://arxiv.org/pdf/2210.09549.pdf)

- (arXiv 2022.10) HUMANISE: Language-conditioned Human **Motion Generation** in **3D Scenes**, [[Paper]](https://arxiv.org/pdf/2210.09729.pdf), [[Project]](https://silverster98.github.io/HUMANISE/)

- (arXiv 2022.10) Transfer-learning for **video classification**: Video Swin Transformer on multiple domains, [[Paper]](https://arxiv.org/pdf/2210.09969.pdf)

- (arXiv 2022.10) PERCEPTUAL **GROUPING** IN **VISION-LANGUAGE** MODELS, [[Paper]](https://arxiv.org/pdf/2210.09996.pdf)

- (arXiv 2022.10) How Mask Matters: Towards **Theoretical Understandings** of **Masked Autoencoders**, [[Paper]](https://arxiv.org/pdf/2210.08344.pdf), [[Code]](https://github.com/zhangq327/U-MAE)

- (arXiv 2022.10) **LINEAR** **VIDEO** TRANSFORMER WITH FEATURE FIXATION, [[Paper]](https://arxiv.org/pdf/2210.08164.pdf), [[Code]]()

- (arXiv 2022.10) Transformer-based **dimensionality reduction**, [[Paper]](https://arxiv.org/pdf/2210.08288.pdf)

- (arXiv 2022.10) Bridging the Domain Gap for **Multi-Agent Perception**, [[Paper]](https://arxiv.org/pdf/2210.08451.pdf)

- (arXiv 2022.10) TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-**Drone** **Detection** in **Aerial** Videos, [[Paper]](https://arxiv.org/pdf/2210.08423.pdf), [[Code]](https://github.com/tusharsangam/TransVisDrone)

- (arXiv 2022.10) SCRATCHING VISUAL TRANSFORMER’S BACK WITH UNIFORM **ATTENTION**, [[Paper]](https://arxiv.org/pdf/2210.08457.pdf)

- (arXiv 2022.10) Increasing Visual Awareness in **Multimodal Neural Machine Translation** from an Information Theoretic Perspective, [[Paper]](https://arxiv.org/pdf/2210.08478.pdf)

- (arXiv 2022.10) TLDW: Extreme Multimodal **Summarisation** of News **Videos**, [[Paper]](https://arxiv.org/pdf/2210.08481.pdf), [[Code]]()

- (arXiv 2022.10) Character-Centric **Story Visualization** via Visual Planning and Token Alignment, [[Paper]](https://arxiv.org/pdf/2210.08465.pdf), [[Code]](https://github.com/PlusLabNLP/VP-CSV)

- (arXiv 2022.10) COFAR: Commonsense and Factual Reasoning in **Image Search**, [[Paper]](https://arxiv.org/pdf/2210.08554.pdf), [[Code]](https://vl2g.github.io/projects/cofar)

- (arXiv 2022.10) Learning Self-Regularized **Adversarial** Views for Self-Supervised Vision Transformers, [[Paper]](https://arxiv.org/pdf/2210.08458.pdf), [[Code]](https://github.com/Trent-tangtao/AutoView)

- (arXiv 2022.10) Temporal and Contextual Transformer for **Multi-Camera Editing** of TV Shows, [[Paper]](https://arxiv.org/pdf/2210.08737.pdf)

- (arXiv 2022.10) **Forecasting** Human **Trajectory** from Scene History, [[Paper]](https://arxiv.org/pdf/2210.08732.pdf), [[Code]](https://github.com/MaKaRuiNah/SHENet)

- (arXiv 2022.10) SGRAM: Improving **Scene Graph Parsing** via Abstract Meaning Representation, [[Paper]](https://arxiv.org/pdf/2210.08675.pdf)

- (arXiv 2022.10) Contrastive **Language-Image** Pre-Training with Knowledge Graphs, [[Paper]](https://arxiv.org/pdf/2210.08901.pdf)

- (arXiv 2022.10) A Saccaded Visual Transformer for General **Object Spotting**, [[Paper]](https://arxiv.org/pdf/2210.09220.pdf)

- (arXiv 2022.10) Vision Transformers provably learn **spatial structure**, [[Paper]](https://arxiv.org/pdf/2210.09221.pdf)

- (arXiv 2022.10) oViT: An Accurate Second-Order **Pruning** Framework for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2210.09223.pdf)

- (arXiv 2022.10) Fine-grained **Category Discovery** under Coarse-grained supervision with Hierarchical Weighted Self-contrastive Learning, [[Paper]](https://arxiv.org/pdf/2210.07733.pdf), [[Code]](https://github.com/Lackel/Hierarchical_Weighted_SCL)

- (arXiv 2022.10) Non-Contrastive Learning Meets **Language-Image** Pre-Training, [[Paper]](https://arxiv.org/pdf/2210.09304.pdf)

- (arXiv 2022.10) Frame Mining: a Free Lunch for Learning Robotic **Manipulation** from 3D Point Clouds, [[Paper]](https://arxiv.org/pdf/2210.07442.pdf), [[Project]](https://colin97.github.io/FrameMining/)

- (arXiv 2022.10) Pretrained Transformers Do not Always Improve **Robustness**, [[Paper]](https://arxiv.org/pdf/2210.07663.pdf)

- (arXiv 2022.10) Plausible May Not Be Faithful: Probing Object Hallucination in **Vision-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2210.07688.pdf)

- (arXiv 2022.10) CONTRASTIVE **AUDIO-VISUAL** **MASKED** AUTOENCODER, [[Paper]](https://arxiv.org/pdf/2210.07839.pdf)

- (arXiv 2022.10) SWFormer: Sparse Window Transformer for **3D Object Detection** in Point Clouds, [[Paper]](https://arxiv.org/pdf/2210.07372.pdf)

- (arXiv 2022.10) Trailers12k: Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label **Movie Trailer Genre Classification**, [[Paper]](https://arxiv.org/pdf/2210.07983.pdf)

- (arXiv 2022.10) AVLEN: Audio-Visual-Language Embodied **Navigation** in 3D Environments, [[Paper]](https://arxiv.org/pdf/2210.07940.pdf)

- (arXiv 2022.10) MOVE: Unsupervised Movable Object **Segmentation** and **Detection**, [[Paper]](https://arxiv.org/pdf/2210.07920.pdf)

- (arXiv 2022.10) IS SYNTHETIC DATA FROM **GENERATIVE** MODELS READY FOR IMAGE **RECOGNITION**?, [[Paper]](https://arxiv.org/pdf/2210.07574.pdf), [[Code]](https://github.com/CVMI-Lab/SyntheticData)

- (arXiv 2022.10) Towards Transformer-based Homogenization of **Satellite Imagery** for Landsat-8 and Sentinel-2, [[Paper]](https://arxiv.org/pdf/2210.07654.pdf)

- (arXiv 2022.10) MCTNET: A MULTI-SCALE CNN-TRANSFORMER NETWORK FOR **CHANGE DETECTION** IN OPTICAL **REMOTE SENSING** IMAGES, [[Paper]](https://arxiv.org/pdf/2210.07601.pdf)

- (arXiv 2022.10) Vision Transformer **Visualization**: What Neurons Tell and How Neurons Behave? [[Paper]](https://arxiv.org/pdf/2210.07646.pdf), [[Code]](https://github.com/byM1902/ViT_visualization)

- (arXiv 2022.10) TokenMixup: Efficient Attention-guided Token-level Data **Augmentation** for Transformers, [[Paper]](https://arxiv.org/pdf/2210.07562.pdf), [[Code]](https://github.com/mlvlab/TokenMixup)

- (arXiv 2022.10) SQA3D: SITUATED **QUESTION ANSWERING** IN **3D** SCENES, [[Paper]](https://arxiv.org/pdf/2210.07474.pdf)

- (arXiv 2022.10) When **Adversarial** Training Meets Vision Transformers: Recipes from Training to Architecture, [[Paper]](https://arxiv.org/pdf/2210.07540.pdf), [[Code]](https://github.com/mo666666/When-Adversarial-Training-Meets-Vision-Transformers)

- (arXiv 2022.10) STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2210.07503.pdf)

- (arXiv 2022.10) PedFormer: **Pedestrian Behavior Prediction** via Cross-Modal Attention Modulation and Gated Multitask Learning, [[Paper]](https://arxiv.org/pdf/2210.07886.pdf)

- (arXiv 2022.10) One Model to Edit Them All: Free-Form Text-Driven **Image Manipulation** with Semantic Modulations, [[Paper]](https://arxiv.org/pdf/2210.07883.pdf), [[Code]](https://github.com/KumapowerLIU/FFCLIP)

- (arXiv 2022.10) IMAGINARYNET: LEARNING OBJECT **DETECTORS** WITHOUT REAL IMAGES AND ANNOTATIONS, [[Paper]](https://arxiv.org/pdf/2210.06886.pdf), [[Code]](https://github.com/kodenii/ImaginaryNet)

- (arXiv 2022.10) Feature-Proxy Transformer for **Few-Shot Segmentation**, [[Paper]](https://arxiv.org/pdf/2210.06908.pdf), [[Code]](https://github.com/Jarvis73/FPTrans)

- (arXiv 2022.10) Scene **Text Image Super-Resolution** via Content Perceptual Loss and Criss-Cross Transformer Blocks, [[Paper]](https://arxiv.org/pdf/2210.06924.pdf)

- (arXiv 2022.10) UNIFIED **VISION AND LANGUAGE** **PROMPT** LEARNING, [[Paper]](https://arxiv.org/pdf/2210.07225.pdf), [[Code]](https://github.com/yuhangzang/UPT)

- (arXiv 2022.10) Exploring Long-Sequence **Masked Autoencoders**, [[Paper]](https://arxiv.org/pdf/2210.07224.pdf), [[Code]](https://github.com/facebookresearch/long_seq_mae)

- (arXiv 2022.10) MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for **Vision-Language** Few-Shot **Prompting**, [[Paper]](https://arxiv.org/pdf/2210.07179.pdf)

- (arXiv 2022.10) Interactive **Language**: Talking to **Robots** in Real Time, [[Paper]](https://arxiv.org/pdf/2210.06407.pdf), [[Project]](https://interactive-language.github.io/)

- (arXiv 2022.10) RTFormer: Efficient Design for Real-Time **Semantic Segmentation** with Transformer, [[Paper]](https://arxiv.org/pdf/2210.07124.pdf), [[Code]](https://github.com/PaddlePaddle/PaddleSeg)

- (arXiv 2022.10) How to **Train** Vision Transformer on **Small-scale Datasets**?, [[Paper]](https://arxiv.org/pdf/2210.07240.pdf), [[Code]](https://github.com/hananshafi/vits-for-small-scale-datasets)

- (arXiv 2022.10) Hate-CLIPper: Multimodal Hateful **Meme Classification** based on Cross-modal Interaction of **CLIP** Features, [[Paper]](https://arxiv.org/pdf/2210.05916.pdf), [[Code]](https://github.com/gokulkarthik/hateclipper)

- (arXiv 2022.10) Large Models are Parsimonious Learners: **Activation Sparsity** in Trained Transformers, [[Paper]](https://arxiv.org/pdf/2210.06313.pdf)

- (arXiv 2022.10) CURVED **REPRESENTATION SPACE** OF VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2210.05742.pdf)

- (arXiv 2022.10) **Foundation** Transformers, [[Paper]](https://arxiv.org/pdf/2210.06423.pdf), [[Code]](https://github.com/microsoft/unilm)

- (arXiv 2022.10) Underspecification in **Scene Description-to-Depiction** Tasks, [[Paper]](https://arxiv.org/pdf/2210.05815.pdf)

- (arXiv 2022.10) Continuous conditional **video synthesis** by neural processes, [[Paper]](https://arxiv.org/pdf/2210.05810.pdf), [[Code]](https://github.com/NPVS/NPVS)

- (arXiv 2022.10) SAIT: SPARSE VISION TRANSFORMERS THROUGH ADAPTIVE TOKEN **PRUNING**, [[Paper]](https://arxiv.org/pdf/2210.05832.pdf)

- (arXiv 2022.10) ZITS++: **Image Inpainting** by Improving the Incremental Transformer on Structural Priors, [[Paper]](https://arxiv.org/pdf/2210.05950.pdf)

- (arXiv 2022.10) SLOTFORMER: UNSUPERVISED VISUAL **DYNAMICS SIMULATION** WITH OBJECT-CENTRIC MODELS, [[Paper]](https://arxiv.org/pdf/2210.05861.pdf), [[Project]](https://slotformer.github.io/)

- (arXiv 2022.10) Learning by Asking Questions for Knowledge-based Novel **Object Recognition**, [[Paper]](https://arxiv.org/pdf/2210.05879.pdf)

- (arXiv 2022.10) Bridging the **Gap** Between Vision Transformers and **Convolutional Neural Networks** on Small Datasets, [[Paper]](https://arxiv.org/pdf/2210.05958.pdf), [[Code]](https://github.com/ArieSeirack/DHVT)

- (arXiv 2022.10) GGViT:Multistream Vision Transformer Network in Face2Face **Facial Reenactment Detection**, [[Paper]](https://arxiv.org/pdf/2210.05990.pdf)

- (arXiv 2022.10) Distilling Knowledge from Language Models for **Video-based Action Anticipation**, [[Paper]](https://arxiv.org/pdf/2210.05991.pdf)

- (arXiv 2022.10) Long-Form **Video-Language** Pre-Training with Multimodal Temporal Contrastive Learning, [[Paper]](https://arxiv.org/pdf/2210.06031.pdf), [[Code]](https://github.com/microsoft/XPretrain)

- (arXiv 2022.10) M3VIDEO: **MASKED** MOTION MODELING FOR SELFSUPERVISED **VIDEO REPRESENTATION** LEARNING, [[Paper]](https://arxiv.org/pdf/2210.06096.pdf)

- (arXiv 2022.10) Uplift and Upsample: Efficient **3D Human Pose Estimation** with Uplifting Transformers, [[Paper]](https://arxiv.org/pdf/2210.06110.pdf), [[Code]](https://github.com/goldbricklemon/uplift-upsample-3dhpe)

- (arXiv 2022.10) FontTransformer: Few-shot High-resolution Chinese **Glyph Image Synthesis** via Stacked Transformers, [[Paper]](https://arxiv.org/pdf/2210.06301.pdf)

- (arXiv 2022.10) AISFormer: Amodal **Instance Segmentation** with Transformer, [[Paper]](https://arxiv.org/pdf/2210.06323.pdf), [[Code]](https://github.com/UARK-AICV/AISFormer)

- (arXiv 2022.10) ViewBirdiformer: Learning to **recover** ground-plane **crowd trajectories** and **ego-motion** from a single ego-centric view, [[Paper]](https://arxiv.org/pdf/2210.06332.pdf)

- (arXiv 2022.10) One does not fit all! On the Complementarity of Vision Encoders for **Vision and Language** Tasks, [[Paper]](https://arxiv.org/pdf/2210.06379.pdf)

- (arXiv 2022.10) **PROMPT GENERATION** NETWORKS FOR EFFICIENT **ADAPTATION** OF FROZEN VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2210.06466.pdf), [[Code]](https://github.com/jochemloedeman/PGN)

- (arXiv 2022.10) Generating Executable Action **Plans** with Environmentally-Aware Language Models, [[Paper]](https://arxiv.org/pdf/2210.04964.pdf)

- (arXiv 2022.10) AVE-**CLIP**: AudioCLIP-based Multi-window Temporal Transformer for **Audio Visual Event Localization**, [[Paper]](https://arxiv.org/pdf/2210.05060.pdf)

- (arXiv 2022.10) Improving Dense **Contrastive Learning** with Dense Negative Pairs, [[Paper]](https://arxiv.org/pdf/2210.05063.pdf)

- (arXiv 2022.10) Fine-Grained **Image Style Transfer** with Visual Transformers, [[Paper]](https://arxiv.org/pdf/2210.05176.pdf), [[Code]](https://github.com/researchmm/STTR)

- (arXiv 2022.10) IT TAKES TWO: **MASKED** APPEARANCE-MOTION MODELING FOR **SELF-SUPERVISED VIDEO** TRANSFORMER PRE-TRAINING, [[Paper]](https://arxiv.org/pdf/2210.05234.pdf)

- (arXiv 2022.10) Contrastive **Video-Language** Learning with Fine-grained Frame Sampling, [[Paper]](https://arxiv.org/pdf/2210.05039.pdf)

- (arXiv 2022.10) Style-Guided Inference of Transformer for High-resolution **Image Synthesis**, [[Paper]](https://arxiv.org/pdf/2210.05533.pdf)

- (arXiv 2022.10) MAP: Modality-Agnostic Uncertainty-Aware **Vision-Language** Pre-training Model, [[Paper]](https://arxiv.org/pdf/2210.05335.pdf), [[Code]](https://github.com/IIGROUP/MAP)

- (arXiv 2022.10) LEARNING TO LOCATE VISUAL **ANSWER** IN **VIDEO** CORPUS USING **QUESTION**, [[Paper]](https://arxiv.org/pdf/2210.05423.pdf), [[Code]](https://github.com/WENGSYX/CCGS)

- (arXiv 2022.10) UNDERSTANDING **EMBODIED** REFERENCE WITH TOUCH-LINE TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2210.05668.pdf)

- (arXiv 2022.10) **Point** Transformer V2: Grouped Vector Attention and Partition-based Pooling, [[Paper]](https://arxiv.org/pdf/2210.05666.pdf), [[Code]](https://github.com/Gofinge/PointTransformerV2)

- (arXiv 2022.10) See, Plan, Predict: Language-guided Cognitive **Planning** with Video Prediction, [[Paper]](https://arxiv.org/pdf/2210.03825.pdf)

- (arXiv 2022.10) USING BOTH DEMONSTRATIONS AND LANGUAGE INSTRUCTIONS TO EFFICIENTLY LEARN **ROBOTIC** TASKS, [[Paper]](https://arxiv.org/pdf/2210.04476.pdf), [[Project]](https://sites.google.com/view/del-taco-learning)

- (arXiv 2022.10) Generating image **captions** with external encyclopedic knowledge, [[Paper]](https://arxiv.org/pdf/2210.04806.pdf)

- (arXiv 2022.10) LOCL: Learning **Object-Attribute Composition** using Localization, [[Paper]](https://arxiv.org/pdf/2210.03780.pdf)

- (arXiv 2022.10) SVL-Adapter: Self-Supervised Adapter for **Vision-Language** Pretrained Models, [[Paper]](https://arxiv.org/pdf/2210.03794.pdf), [[Code]](https://github.com/omipan/svl_adapter)

- (arXiv 2022.10) ConTra: (Con)text (Tra)nsformer for **Cross-Modal Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2210.04341.pdf)

- (arXiv 2022.10) Learning Fine-Grained Visual Understanding for **Video Question Answering** via Decoupling Spatial-Temporal Modeling, [[Paper]](https://arxiv.org/pdf/2210.03941.pdf), [[Code]](https://github.com/shinying/dest)

- (arXiv 2022.10) (Fusionformer):Exploiting the Joint Motion Synergy with Fusion Network Based On Transformer for **3D Human Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2210.04006.pdf)

- (arXiv 2022.10) Fast-ParC: Position Aware Global **Kernel** for ConvNets and ViTs, [[Paper]](https://arxiv.org/pdf/2210.04020.pdf), [[Code]](https://github.com/yangtao2019yt/Fast_ParC.git)

- (arXiv 2022.10) OGC: Unsupervised **3D Object Segmentation** from Rigid Dynamics of Point Clouds, [[Paper]](https://arxiv.org/pdf/2210.04458.pdf), [[Code]](https://github.com/vLAR-group/OGC)

- (arXiv 2022.10) Multi-Modal Fusion Transformer for **Visual Question Answering** in **Remote Sensing**, [[Paper]](https://arxiv.org/pdf/2210.04510.pdf), [[Code]](https://git.tu-berlin.de/rsim/multi-modal-fusion-transformer-for-vqa-in-rs)

- (arXiv 2022.10) Semantics-Consistent **Cross-domain Summarization** via Optimal Transport Alignment, [[Paper]](https://arxiv.org/pdf/2210.04722.pdf)

- (arXiv 2022.10) VOLTA: **VISION-LANGUAGE** TRANSFORMER WITH WEAKLY-SUPERVISED LOCAL-FEATURE ALIGNMENT, [[Paper]](https://arxiv.org/pdf/2210.04135.pdf)

- (arXiv 2022.10) OPEN-VOCABULARY SEMANTIC **SEGMENTATION** WITH MASK-ADAPTED **CLIP**, [[Paper]](https://arxiv.org/pdf/2210.04150.pdf), [[Project]](https://jeff-liangf.github.io/projects/ovseg)

- (arXiv 2022.10) MAMO: Masked Multimodal Modeling for Fine-Grained **Vision-Language** Representation Learning, [[Paper]](https://arxiv.org/pdf/2210.04183.pdf)

- (arXiv 2022.10) SELF-SUPERVISED **VIDEO** REPRESENTATION LEARNING WITH MOTION-AWARE **MASKED AUTOENCODERS**, [[Paper]](https://arxiv.org/pdf/2210.04154.pdf), [[Code]](https://github.com/happy-hsy/MotionMAE)

- (arXiv 2022.10) LEARNING TO DECOMPOSE **VISUAL** FEATURES WITH LATENT **TEXTUAL** PROMPTS, [[Paper]](https://arxiv.org/pdf/2210.04287.pdf)

- (arXiv 2022.10) DCVQE: A Hierarchical Transformer for **Video Quality Assessment**, [[Paper]](https://arxiv.org/pdf/2210.04377.pdf)

- (arXiv 2022.10) **Fine-grained Object** Categorization for Service **Robots**, [[Paper]](https://arxiv.org/pdf/2210.04613.pdf)

- (arXiv 2022.10) **CLIP**-DIFFUSION-LM: APPLY DIFFUSION MODEL ON IMAGE **CAPTIONING**, [[Paper]](https://arxiv.org/pdf/2210.04559.pdf), [[Code]](https://github.com/xu-shitong/diffusion-image-captioning)

- (arXiv 2022.10) A Memory Transformer Network for **Incremental Learning**, [[Paper]](https://arxiv.org/pdf/2210.04485.pdf)

- (arXiv 2022.10) Bridging **CLIP** and StyleGAN through Latent Alignment for **Image Editing**, [[Paper]](https://arxiv.org/pdf/2210.04506.pdf)

- (arXiv 2022.10) LMQFormer: A Laplace-Prior-Guided Mask Query Transformer for **Lightweight Snow Removal**, [[Paper]](https://arxiv.org/pdf/2210.04787.pdf)

- (arXiv 2022.10) FS-DETR: **FEW-SHOT** **DETECTION** TRANSFORMER WITH PROMPTING AND WITHOUT RE-TRAINING, [[Paper]](https://arxiv.org/pdf/2210.04845.pdf)

- (arXiv 2022.10) Transformer-based Localization from **Embodied** Dialog with Large-scale Pre-training, [[Paper]](https://arxiv.org/pdf/2210.04864.pdf)

- (arXiv 2022.10) Turbo Training with Token **Dropout**, [[Paper]](https://arxiv.org/pdf/2210.04889.pdf)

- (arXiv 2022.10) Polyhistor: Parameter-**Efficient** **Multi-Task Adaptation** for Dense Vision Tasks, [[Paper]](https://arxiv.org/pdf/2210.03265.pdf)

- (arXiv 2022.10) C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual **Text-Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2210.03625.pdf)

- (arXiv 2022.10) **Pose** Guided **Human Image Synthesis** with Partially Decoupled GAN, [[Paper]](https://arxiv.org/pdf/2210.03627.pdf)

- (arXiv 2022.10) A Simple Plugin for Transforming **Images** to **Arbitrary Scales**, [[Paper]](https://arxiv.org/pdf/2210.03417.pdf), [[Project]](https://lipurple.github.io/ARIS_Webpage/)

- (arXiv 2022.10) Time-Space Transformers for **Video Panoptic Segmentation**, [[Paper]](https://arxiv.org/pdf/2210.03546.pdf)

- (arXiv 2022.10) MOAT: ALTERNATING **MOBILE** CONVOLUTION AND ATTENTION BRINGS STRONG VISION MODELS, [[Paper]](https://arxiv.org/pdf/2210.01820.pdf), [[Code]](https://github.com/google-research/deeplab2)

- (arXiv 2022.10) **IMAGEN** VIDEO: HIGH DEFINITION **VIDEO GENERATION** WITH **DIFFUSION** MODELS, [[Paper]](https://arxiv.org/pdf/2210.02303.pdf), [[Project]](https://imagen.research.google/video/)

- (arXiv 2022.10) clip2latent: **Text** driven sampling of a pre-trained **StyleGAN** using denoising diffusion and **CLIP**, [[Paper]](https://arxiv.org/pdf/2210.02347.pdf)

- (arXiv 2022.10) FQDet: **Fast**-converging Query-based **Detector**, [[Paper]](https://arxiv.org/pdf/2210.02318.pdf), [[Code]](https://github.com/CedricPicron/FQDet)

- (arXiv 2022.10) VARIATIONAL **PROMPT** TUNING IMPROVES GENERALIZATION OF **VISION-LANGUAGE** MODELS, [[Paper]](https://arxiv.org/pdf/2210.02390.pdf)

- (arXiv 2022.10) Grounding **Language** with **Visual** **Affordances** over Unstructured Data, [[Paper]](https://arxiv.org/pdf/2210.01911.pdf), [[Project]](http://hulc2.cs.uni-freiburg.de/)

- (arXiv 2022.10) Granularity-aware Adaptation for **Image Retrieval** over Multiple Tasks, [[Paper]](https://arxiv.org/pdf/2210.02254.pdf)

- (arXiv 2022.10) WHEN AND WHY **VISION-LANGUAGE** MODELS BEHAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT? [[Paper]](https://arxiv.org/pdf/2210.01936.pdf)

- (arXiv 2022.10) Multi-view **Human** Body **Mesh** Translator, [[Paper]](https://arxiv.org/pdf/2210.01886.pdf)

- (arXiv 2022.10) EXPLORING THE ROLE OF MEAN TEACHERS IN **SELFSUPERVISED** **MASKED** AUTO-ENCODERS, [[Paper]](https://arxiv.org/pdf/2210.02077.pdf)

- (arXiv 2022.10) **Point Cloud Recognition** with Position-to-Structure Attention Transformers, [[Paper]](https://arxiv.org/pdf/2210.02030.pdf)

- (arXiv 2022.10) TEMPORALLY CONSISTENT VIDEO TRANSFORMER FOR LONG-TERM **VIDEO PREDICTION**, [[Paper]](https://arxiv.org/pdf/2210.02396.pdf), [[Code]](https://wilson1yan.github.io/teco)

- (arXiv 2022.10) PHENAKI: VARIABLE LENGTH **VIDEO GENERATION** FROM OPEN DOMAIN TEXTUAL DESCRIPTIONS, [[Paper]](https://arxiv.org/pdf/2210.02399.pdf)

- (arXiv 2022.10) MuRAG: Multimodal Retrieval-Augmented Generator for **Open Question Answering** over Images and Text, [[Paper]](https://arxiv.org/pdf/2210.02928.pdf)

- (arXiv 2022.10) Real-World **Robot Learning** with **Masked** Visual Pre-training, [[Paper]](https://arxiv.org/pdf/2210.03109.pdf), [[Project]](https://tetexiao.com/projects/real-mvp)

- (arXiv 2022.10) BaseTransformers: Attention over base data-points for **One Shot** Learning, [[Paper]](https://arxiv.org/pdf/2210.02476.pdf), [[Code]](https://github.com/mayug/BaseTransformers)

- (arXiv 2022.10) Focal and Global Spatial-Temporal Transformer for **Skeleton**-based **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2210.02693.pdf)

- (arXiv 2022.10) Vision Transformer Based Model for **Describing** a Set of **Images** as a Story, [[Paper]](https://arxiv.org/pdf/2210.02762.pdf)

- (arXiv 2022.10) **Video Referring Expression Comprehension** via Transformer with Content-aware Query, [[Paper]](https://arxiv.org/pdf/2210.02953.pdf), [[Code]](https://github.com/mengcaopku/ContFormer)

- (arXiv 2022.10) **EFFECTIVE** **SELF-SUPERVISED** PRE-TRAINING ON LOW-COMPUTE NETWORKS WITHOUT DISTILLATION, [[Paper]](https://arxiv.org/pdf/2210.02808.pdf)

- (arXiv 2022.10) **CLIP** MODEL IS AN EFFICIENT **CONTINUAL LEARNER**, [[Paper]](https://arxiv.org/pdf/2210.03114.pdf)

- (arXiv 2022.10) Content-Based Search for Deep **Generative** Models, [[Paper]](https://arxiv.org/pdf/2210.03116.pdf)

- (arXiv 2022.10) MAPLE: **MULTI-MODAL** **PROMPT** LEARNING, [[Paper]](https://arxiv.org/pdf/2210.03117.pdf), [[Code]](https://tinyurl.com/2dzs8f3w)

- (arXiv 2022.10) SYSTEMATIC GENERALIZATION AND EMERGENT STRUCTURES IN TRANSFORMERS TRAINED ON **STRUCTURED TASKS**, [[Paper]](https://arxiv.org/pdf/2210.00400.pdf)

- (arXiv 2022.10) WIDE **ATTENTION** IS THE WAY FORWARD FOR TRANSFORMERS? [[Paper]](https://arxiv.org/pdf/2210.00640.pdf)

- (arXiv 2022.10) DARTFORMER: **FINDING** THE BEST TYPE OF **ATTENTION**, [[Paper]](https://arxiv.org/pdf/2210.00641.pdf)

- (arXiv 2022.10) MOBILEVITV3: **MOBILE**-FRIENDLY VISION TRANSFORMER WITH SIMPLE AND EFFECTIVE FUSION OF LOCAL, GLOBAL AND INPUT FEATURES, [[Paper]](https://arxiv.org/pdf/2209.15159.pdf), [[Code]](https://github.com/micronDLA/MobileViTv3)

- (arXiv 2022.10) Differentiable Parsing and Visual **Grounding** of Verbal Instructions for **Object Placement**, [[Paper]](https://arxiv.org/pdf/2210.00215.pdf), [[Project]](https://1989ryan.github.io/projects/paragon.html)

- (arXiv 2022.10) EAPruning: Evolutionary **Pruning** for Vision Transformers and CNNs, [[Paper]](https://arxiv.org/pdf/2210.00181.pdf)

- (arXiv 2022.10) Motion-inductive Self-supervised **Object Discovery** in Videos, [[Paper]](https://arxiv.org/pdf/2210.00221.pdf)

- (arXiv 2022.10) Fully Transformer Network for Change Detection of **Remote Sensing** Images, [[Paper]](https://arxiv.org/pdf/2210.00757.pdf), [[Code]](https://github.com/AI-Zhpp/FTN)

- (arXiv 2022.10) TOWARDS A UNIFIED VIEW ON VISUAL PARAMETER-**EFFICIENT** **TRANSFER LEARNING**, [[Paper]](https://arxiv.org/pdf/2210.00788.pdf)

- (arXiv 2022.10) Visual **Prompt** Tuning for Generative **Transfer Learning**, [[Paper]](https://arxiv.org/pdf/2210.00990.pdf)

- (arXiv 2022.10) A Strong Transfer Baseline for **RGB-D Fusion** in Vision Transformers, [[Paper]](https://arxiv.org/pdf/2210.00843.pdf)

- (arXiv 2022.10) LPT: **LONG-TAILED** **PROMPT** TUNING FOR IMAGE CLASSIFICATION, [[Paper]](https://arxiv.org/pdf/2210.01033.pdf)

- (arXiv 2022.10) Expediting Large-Scale Vision Transformer for **Dense Prediction** without Fine-tuning, [[Paper]](https://arxiv.org/pdf/2210.01035.pdf)

- (arXiv 2022.10) CLIP2POINT: TRANSFER **CLIP** TO **POINT CLOUD CLASSIFICATION** WITH IMAGE-DEPTH PRE-TRAINING, [[Paper]](https://arxiv.org/pdf/2210.01055.pdf)

- (arXiv 2022.10) Dual-former: Hybrid Self-attention Transformer for Efficient **Image Restoration**, [[Paper]](https://arxiv.org/pdf/2210.01069.pdf)

- (arXiv 2022.10) LANGUAGE-AWARE **SOFT PROMPTING** FOR **VISION & LANGUAGE** FOUNDATION MODELS, [[Paper]](https://arxiv.org/pdf/2210.01115.pdf)

- (arXiv 2022.10) ASIF: COUPLED DATA TURNS UNIMODAL MODELS TO **MULTIMODAL** WITHOUT TRAINING, [[Paper]](https://arxiv.org/pdf/2210.01738.pdf)

- (arXiv 2022.10) ImmFusion: Robust mmWave-RGB Fusion for **3D Human Body Reconstruction** in All Weather Conditions, [[Paper]](https://arxiv.org/pdf/2210.01346.pdf)

- (arXiv 2022.10) PROMPT LEARNING WITH **OPTIMAL TRANSPORT** FOR **VISION-LANGUAGE** MODELS, [[Paper]](https://arxiv.org/pdf/2210.01253.pdf)

- (arXiv 2022.10) Bridged Transformer for Vision and Point Cloud **3D Object Detection**, [[Paper]](https://arxiv.org/pdf/2210.01391.pdf)

- (arXiv 2022.10) Dense Prediction Transformer for Scale Estimation in Monocular Visual **Odometry**, [[Paper]](https://arxiv.org/pdf/2210.01723.pdf)

- (arXiv 2022.10) HUMAN **MOTION** **DIFFUSION** MODEL, [[Paper]](https://arxiv.org/pdf/2209.14916.pdf), [[Project]](https://guytevet.github.io/mdm-page/)

- (arXiv 2022.10) TokenFlow: Rethinking Fine-grained Cross-modal Alignment in **Vision-Language Retrieval**, [[Paper]](https://arxiv.org/pdf/2209.13822.pdf)

- (arXiv 2022.10) Uni**CLIP**: Unified Framework for Contrastive **Language–Image** Pre-training, [[Paper]](https://arxiv.org/pdf/2209.13430.pdf)

- (arXiv 2022.10) CrossDTR: Cross-view and Depth-guided Transformers for **3D Object Detection**, [[Paper]](https://arxiv.org/pdf/2209.13507.pdf), [[Code]](https://github.com/sty61010/CrossDTR)

- (arXiv 2022.10) Multi-dataset Training of Transformers for Robust **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2209.12362.pdf), [[Code]](https://github.com/JunweiLiang/MultiTrain)

- (arXiv 2022.10) Multi-Scale **Human-Object Interaction** Detector, [[Paper]](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9927451)

- (arXiv 2022.10) LGDN: Language-Guided Denoising Network for **Video-Language** Modeling, [[Paper]](https://arxiv.org/pdf/2209.11388.pdf)

- (arXiv 2022.10) RaP: Redundancy-aware Video-language Pre-training for **Text-Video** Retrieval, [[Paper]](https://arxiv.org/pdf/2210.06881.pdf), [[Code]](https://github.com/caskcsg/VLP/tree/main/RaP)

- (arXiv 2022.10) Intermediate Prototype Mining Transformer for Few-Shot **Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2210.06780.pdf), [[Code]](https://github.com/LIUYUANWEI98/IPMT)

- (arXiv 2022.10) Decoding Visual Neural Representations by Multimodal Learning of **Brain-Visual-Linguistic** Features, [[Paper]](https://arxiv.org/pdf/2210.06756.pdf), [[Code]](https://github.com/ChangdeDu/BraVL)

- (arXiv 2022.10) Q-ViT: Accurate and Fully **Quantized** Low-bit Vision Transformer, [[Paper]](https://arxiv.org/pdf/2210.06707.pdf), [[Code]](https://github.com/YanjingLi0202/Q-ViT)

- (arXiv 2022.10) Prepended Domain Transformer: Heterogeneous **Face Recognition** without Bells and Whistles, [[Paper]](https://arxiv.org/pdf/2210.06529.pdf)

- (arXiv 2022.10) Visual Knowledge Graph for Human **Action Reasoning in Videos**, [[Paper]](https://dl.acm.org/doi/pdf/10.1145/3503161.3548257)

- (arXiv 2022.10) Human Joint Kinematics Diffusion-Refinement for Stochastic **Motion Prediction**, [[Paper]](https://arxiv.org/pdf/2210.05976.pdf)

- (arXiv 2022.10) VIMA: GENERAL **ROBOT MANIPULATION** WITH MULTIMODAL **PROMPTS**, [[Paper]](https://arxiv.org/pdf/2210.03094.pdf), [[Project]](https://vimalabs.github.io/)

- (arXiv 2022.10) What Should the System Do Next?: **Operative Action Captioning** for Estimating System Actions, [[Paper]](https://arxiv.org/pdf/2210.02735.pdf)

- (arXiv 2022.10) DMMGAN: Diverse Multi **Motion Prediction of 3D Human Joints** using Attention-Based Generative Adversarial Network, [[Paper]](https://arxiv.org/pdf/2209.09124.pdf)

- (arXiv 2022.10) PIZZA: A Powerful Image-only Zero-Shot Zero-CAD Approach to **6 DoF Tracking**, [[Paper]](https://arxiv.org/pdf/2209.07589.pdf), [[Code]](https://github.com/nv-nguyen/pizza)

### 2022.09

- (arXiv 2022.09) SELF-DISTILLATION FOR FURTHER **PRE-TRAINING** OF TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2210.02871.pdf)

- (arXiv 2022.09) **Visuo-Tactile** Transformers for Manipulation, [[Paper]](https://arxiv.org/pdf/2210.00121.pdf), [[Project]](https://www.mmintlab.com/vtt)

- (arXiv 2022.09) UNDERSTANDING PURE **CLIP** GUIDANCE FOR **VOXEL** GRID **NERF** MODELS, [[Paper]](https://arxiv.org/pdf/2209.15172.pdf), [[Project]](https://hanhung.github.io/PureCLIPNeRF/)

- (arXiv 2022.09) Dual Progressive Transformations for Weakly Supervised Semantic **Segmentation**, [[Paper]](https://arxiv.org/pdf/2209.15211.pdf), [[Code]](https://github.com/huodongjian0603/crt)

- (arXiv 2022.09) Transformers for Object **Detection** in Large **Point Clouds**, [[Paper]](https://arxiv.org/pdf/2209.15258.pdf)

- (arXiv 2022.09) **DIFFUSION**-BASED **IMAGE TRANSLATION** USING DISENTANGLED STYLE AND CONTENT REPRESENTATION, [[Paper]](https://arxiv.org/pdf/2209.15264.pdf)

- (arXiv 2022.09) ERNIE-VIL 2.0: MULTI-VIEW CONTRASTIVE LEARNING FOR **IMAGE-TEXT** PRE-TRAINING, [[Paper]](https://arxiv.org/pdf/2209.15270.pdf), [[Code]](https://github.com/PaddlePaddle/ERNIE/)

- (arXiv 2022.09) LEARNING TRANSFERABLE **SPATIOTEMPORAL** REPRESENTATIONS FROM NATURAL **SCRIPT** KNOWLEDGE, [[Paper]](https://arxiv.org/pdf/2209.15280.pdf)

- (arXiv 2022.09) SMALLCAP: Lightweight Image **Captioning** Prompted with Retrieval Augmentation, [[Paper]](https://arxiv.org/pdf/2209.15323.pdf), [[Code]](https://github.com/RitaRamo/smallcap)

- (arXiv 2022.09) SPIKFORMER: WHEN **SPIKING NEURAL NETWORK** MEETS TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2209.15425.pdf)

- (arXiv 2022.09) F-VLM: OPEN-VOCABULARY OBJECT **DETECTION** UPON FROZEN VISION AND LANGUAGE MODELS, [[Paper]](https://arxiv.org/pdf/2209.15639.pdf)

- (arXiv 2022.09) CONTRASTIVE CORPUS ATTRIBUTION FOR EXPLAINING REPRESENTATIONS, [[Paper]](https://arxiv.org/pdf/2210.00107.pdf)

- (arXiv 2022.09) Alignment-guided Temporal Attention for **Video Action Recognition**, [[Paper]](https://arxiv.org/pdf/2210.00132.pdf)

- (arXiv 2022.09) EDA: Explicit Text-Decoupling and Dense Alignment for **3D Visual and Language** Learning, [[Paper]](https://arxiv.org/pdf/2209.14941.pdf), [[Code]](https://github.com/yanmin-wu/EDA)

- (arXiv 2022.09) SPOTLIGHT: **MOBILE UI UNDERSTANDING** USING VISION-LANGUAGE MODELS WITH A FOCUS, [[Paper]](https://arxiv.org/pdf/2209.14927.pdf)

- (arXiv 2022.09) DREAMFUSION: **TEXT-TO-3D** USING 2D **DIFFUSION**, [[Paper]](https://arxiv.org/pdf/2209.14988.pdf), [[Project]](https://dreamfusion3d.github.io/)

- (arXiv 2022.09) REST: RETRIEVE & SELF-TRAIN FOR GENERATIVE **ACTION RECOGNITION**, [[Paper]](https://arxiv.org/pdf/2209.15000.pdf)

- (arXiv 2022.09) **Effective** Vision Transformer **Training**: A Data-Centric Perspective, [[Paper]](https://arxiv.org/pdf/2209.15006.pdf)

- (arXiv 2022.09) Human-in-the-loop Robotic **Grasping** using BERT Scene Representation, [[Paper]](https://arxiv.org/pdf/2209.14026.pdf), [[Project]](https://sites.google.com/view/hitl-grasping-bert)

- (arXiv 2022.09) Revisiting **Few-Shot** Learning from a **Causal** Perspective, [[Paper]](https://arxiv.org/pdf/2209.13816.pdf)

- (arXiv 2022.09) **Attacking** Compressed Vision Transformers, [[Paper]](https://arxiv.org/pdf/2209.13785.pdf)

- (arXiv 2022.09) Adaptive Sparse ViT: Towards Learnable Adaptive **Token Pruning** by Fully Exploiting Self-Attention, [[Paper]](https://arxiv.org/pdf/2209.13802.pdf)

- (arXiv 2022.09) DeViT: Deformed Vision Transformers in **Video Inpainting**, [[Paper]](https://arxiv.org/pdf/2209.13925.pdf)

- (arXiv 2022.09) Obj2Seq: Formatting **Objects** as Sequences with Class Prompt for Visual Tasks, [[Paper]](https://arxiv.org/pdf/2209.13948.pdf), [[Code]](https://github.com/CASIA-IVA-Lab/Obj2Seq)

- (arXiv 2022.09) Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual **Grounding**, [[Paper]](https://arxiv.org/pdf/2209.13959.pdf)

- (arXiv 2022.09) Motion Transformer for Unsupervised **Image Animation**, [[Paper]](https://arxiv.org/pdf/2209.14024.pdf)

- (arXiv 2022.09) Weighted Contrastive **Hashing**, [[Paper]](https://arxiv.org/pdf/2209.14099.pdf), [[Code]](http://github.com/RosieYuu/WCH)

- (arXiv 2022.09) CALIP: **Zero-Shot** Enhancement of **CLIP** with Parameter-free Attention, [[Paper]](https://arxiv.org/pdf/2209.14169.pdf)

- (arXiv 2022.09) Dialog Acts for Task-Driven Embodied Agents, [[Paper]](https://arxiv.org/pdf/2209.12953.pdf)

- (arXiv 2022.09) NEURAL MARIONETTE: A Transformer-based Multi-action Human **Motion Synthesis** System, [[Paper]](https://arxiv.org/pdf/2209.13204.pdf), [[Code]](https://wjohnnyw.github.io/blog/tag2motion/)

- (arXiv 2022.09) Embracing Consistency: A One-Stage Approach for **Spatio-Temporal Video Grounding**, [[Paper]](https://arxiv.org/pdf/2209.13306.pdf), [[Code]](https://github.com/jy0205/STCAT)

- (arXiv 2022.09) Text-Adaptive Multiple Visual Prototype Matching for **Video-Text** Retrieval, [[Paper]](https://arxiv.org/pdf/2209.13307.pdf)

- (arXiv 2022.09) Towards Parameter-Efficient Integration of Pre-Trained **Language** Models In Temporal **Video** Grounding, [[Paper]](https://arxiv.org/pdf/2209.13359.pdf)

- (arXiv 2022.09) **Anomaly Detection** in **Aerial** Videos with Transformers, [[Paper]](https://arxiv.org/pdf/2209.13363.pdf), [[Code]](https://youtu.be/ancczYryOBY)

- (arXiv 2022.09) AdaFocusV3: On Unified Spatial-temporal Dynamic **Video Recognition**, [[Paper]](https://arxiv.org/pdf/2209.13465.pdf)

- (arXiv 2022.09) **Motion** Transformer with Global Intention **Localization** and Local Movement Refinement, [[Paper]](https://arxiv.org/pdf/2209.13508.pdf), [[Code]](https://github.com/sshaoshuai/MTR)

- (arXiv 2022.09) FREESEG: FREE MASK FROM INTERPRETABLE CONTRASTIVE LANGUAGE-IMAGE PRETRAINING FOR **SEMANTIC SEGMENTATION**, [[Paper]](https://arxiv.org/pdf/2209.13558.pdf)

- (arXiv 2022.09) Learning State-Aware Visual Representations from Audible **Interactions**, [[Paper]](https://arxiv.org/pdf/2209.13583.pdf), [[Code]](https://github.com/HimangiM/RepLAI)

- (arXiv 2022.09) Towards Explainable **3D** Grounded **Visual Question Answering**: A New Benchmark and Strong Baseline, [[Paper]](https://arxiv.org/pdf/2209.12028.pdf)

- (arXiv 2022.09) Leveraging Self-Supervised Training for **Unintentional Action Recognition**, [[Paper]](https://arxiv.org/pdf/2209.11870.pdf)

- (arXiv 2022.09) NeRF-Loc: Transformer-Based **Object Localization** Within **Neural Radiance Fields**, [[Paper]](https://arxiv.org/pdf/2209.12068.pdf)

- (arXiv 2022.09) All are Worth Words: a ViT Backbone for Score-based **Diffusion** Models, [[Paper]](https://arxiv.org/pdf/2209.12152.pdf)

- (arXiv 2022.09) Paraphrasing Is All You Need for Novel Object **Captioning**, [[Paper]](https://arxiv.org/pdf/2209.12343.pdf)

- (arXiv 2022.09) Collaboration of Pre-trained Models Makes Better **Few-shot** Learner, [[Paper]](https://arxiv.org/pdf/2209.12255.pdf)

- (arXiv 2022.09) Multi-modal **Video Chapter Generation**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2209/2209.12694.pdf), [[Code]](https://github.com/czt117/MVCG)

- (arXiv 2022.09) **Best Prompts** for **Text-to-Image** Models and How to Find Them, [[Paper]](https://arxiv.org/pdf/2209.11711)

- (arXiv 2022.09) Swin2SR: SwinV2 Transformer for **Compressed Image Super-Resolution** and **Restoration**, [[Paper]](https://arxiv.org/pdf/2209.11345.pdf), [[Code]](https://github.com/mv-lab/swin2sr)

- (arXiv 2022.09) 3DPCT: 3D **Point Cloud** Transformer with Dual Self-attention, [[Paper]](https://arxiv.org/pdf/2209.11255.pdf)

- (arXiv 2022.09) **LIGHTWEIGHT** TRANSFORMERS FOR HUMAN **ACTIVITY RECOGNITION** ON MOBILE DEVICES, [[Paper]](https://arxiv.org/pdf/2209.11750.pdf)

- (arXiv 2022.09) PACT: Perception-Action Causal Transformer for **Autoregressive Robotics Pre-Training**, [[Paper]](https://arxiv.org/pdf/2209.11133.pdf)

- (arXiv 2022.09) UniColor: A Unified Framework for Multi-Modal **Colorization** with Transformer, [[Paper]](https://arxiv.org/pdf/2209.11223.pdf), [[Code]](https://luckyhzt.github.io/unicolor)

- (arXiv 2022.09) **Traffic Accident Risk Forecasting** using Contextual Vision Transformers, [[Paper]](https://arxiv.org/pdf/2209.11180.pdf)

- (arXiv 2022.09) CONE: An Efficient COarse-to-fiNE Alignment Framework for **Long Video Temporal Grounding**, [[Paper]](https://arxiv.org/pdf/2209.10918.pdf)

- (arXiv 2022.09) **Recipe Generation** from Unsegmented Cooking Videos, [[Paper]](https://arxiv.org/pdf/2209.10134.pdf)

- (arXiv 2022.09) PicT: A Slim Weakly Supervised Vision Transformer for **Pavement Distress Classification**, [[Paper]](https://arxiv.org/pdf/2209.10074.pdf), [[Code]](https://github.com/DearCaat/PicT)

- (arXiv 2022.09) Show, Interpret and Tell: Entity-aware Contextualised Image **Captioning** in Wikipedia, [[Paper]](https://arxiv.org/pdf/2209.10474.pdf)

- (arXiv 2022.09) RNGDet++: **Road Network Graph Detection** by Transformer with Instance Segmentation and Multi-scale Features Enhancement, [[Paper]](https://arxiv.org/pdf/2209.10150.pdf), [[Code]](https://tonyxuqaq.github.io/projects/RNGDetPlusPlus/)

- (arXiv 2022.09) Toward 3D Spatial Reasoning for Human-like Text-based **Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2209.10326.pdf)

- (arXiv 2022.09) I2DFormer: Learning **Image** to **Document** Attention for **Zero-Shot Image Classification**, [[Paper]](https://arxiv.org/pdf/2209.10304.pdf)

- (arXiv 2022.09) **Integer** Fine-tuning of Transformer-based Models, [[Paper]](https://arxiv.org/pdf/2209.09815.pdf)

- (arXiv 2022.09) Open-vocabulary Queryable Scene Representations for **Real World Planning**, [[Paper]](https://arxiv.org/pdf/2209.09874.pdf), [[Code]](https://nlmap-saycan.github.io/)

- (arXiv 2022.09) Det**CLIP**: Dictionary-Enriched Visual-Concept Paralleled Pre-training for **Open-world Detection**, [[Paper]](https://arxiv.org/pdf/2209.09407.pdf)

- (arXiv 2022.09) Hierarchical Temporal Transformer for **3D Hand Pose Estimation** and **Action Recognition** from **Egocentric** RGB Videos, [[Paper]](https://arxiv.org/pdf/2209.09484.pdf)

- (arXiv 2022.09) **Graph** Reasoning Transformer for **Image Parsing**, [[Paper]](https://arxiv.org/pdf/2209.09545.pdf)

- (arXiv 2022.09) **Quantum** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2209.08167.pdf)

- (arXiv 2022.09) Active **Visual Search** in the Wild, [[Paper]](https://arxiv.org/pdf/2209.08803.pdf)

- (arXiv 2022.09) PPT: token-Pruned Pose Transformer for monocular and multi-view human **pose estimation**, [[Paper]](https://arxiv.org/pdf/2209.08194.pdf), [[Code]](https://github.com/HowieMa/PPT)

- (arXiv 2022.09) Learning Distinct and Representative **Modes** for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2209.08231.pdf), [[Code]](https://github.com/bladewaltz1/ModeCap)

- (arXiv 2022.09) TODE-Trans: **Transparent** Object **Depth Estimation** with Transformer, [[Paper]](https://arxiv.org/pdf/2209.08455.pdf), [[Code]](https://github.com/yuchendoudou/TODE)

- (arXiv 2022.09) Tree-based **Text-Vision** BERT for Video Search in Baidu Video Advertising, [[Paper]](https://arxiv.org/pdf/2209.08759.pdf)

- (arXiv 2022.09) Integrative Feature and Cost Aggregation with Transformers for **Dense Correspondence**, [[Paper]](https://arxiv.org/pdf/2209.08742.pdf)

- (arXiv 2022.09) Axially Expanded Windows for **Local-Global Interaction** in Vision Transformers, [[Paper]](https://arxiv.org/pdf/2209.08726.pdf)

- (arXiv 2022.09) UNCERTAINTY AWARE MULTITASK PYRAMID VISION TRANSFORMER FOR **UAV**-BASED **OBJECT RE-IDENTIFICATION**, [[Paper]](https://arxiv.org/pdf/2209.08686.pdf)

- (arXiv 2022.09) TASKED: Transformer-based Adversarial learning for **human activity recognition** using **wearable sensors** via Self-KnowledgE Distillation, [[Paper]](https://arxiv.org/pdf/2209.09092.pdf)

- (arXiv 2022.09) EcoFormer: Energy-Saving Attention with **Linear Complexity**, [[Paper]](https://arxiv.org/pdf/2209.09004.pdf), [[Code]]](https://github.com/ziplab/EcoFormer)

- (arXiv 2022.09) **Panoramic** Vision Transformer for **Saliency Detection** in 360◦ Videos, [[Paper]](https://arxiv.org/pdf/2209.08956.pdf)

- (arXiv 2022.09) THE BIASED ARTIST: EXPLOITING CULTURAL **BIASES** VIA HOMOGLYPHS IN **TEXT-GUIDED IMAGE GENERATION MODELS**, [[Paper]](https://arxiv.org/pdf/2209.08891.pdf)

- (arXiv 2022.09) **Scene Graph Modification** as Incremental Structure Expanding, [[Paper]](https://arxiv.org/pdf/2209.09093.pdf), [[Code]](https://github.com/THU-BPM/SGM)

- (arXiv 2022.09) Discriminative Sampling of Proposals in Self-Supervised Transformers for **Weakly Supervised Object Localization**, [[Paper]](https://arxiv.org/pdf/2209.09209.pdf), [[Code]](https://github.com/shakeebmurtaza/dips)

- (arXiv 2022.09) Real-time **Online Video Detection** with Temporal Smoothing Transformers, [[Paper]](https://arxiv.org/pdf/2209.09236.pdf)

- (arXiv 2022.09) ViT-DD: Multi-Task Vision Transformer for Semi-Supervised **Driver Distraction Detection**, [[Paper]](https://arxiv.org/pdf/2209.09178.pdf), [[Code]](https://github.com/PurdueDigitalTwin/ViT-DD)

- (arXiv 2022.09) Code as Policies: Language Model Programs for **Embodied Control**, [[Paper]](https://arxiv.org/pdf/2209.07753.pdf), [[Project]](https://code-as-policies.github.io/)

- (arXiv 2022.09) SQ-Swin: a Pretrained Siamese Quadratic Swin Transformer for **Lettuce Browning Prediction**, [[Paper]](https://arxiv.org/pdf/2209.07683.pdf)

- (arXiv 2022.09) Self-Attentive Pooling for **Efficient** Deep Learning, [[Paper]](https://arxiv.org/pdf/2209.07659.pdf)

- (arXiv 2022.09) Domain-Unified Prompt Representations for Source-Free **Domain Generalization**, [[Paper]](https://arxiv.org/pdf/2209.14926.pdf), [[Code]](https://github.com/muse1998/Source-Free-Domain-Generalization)

- (arXiv 2022.09) BRIDGING THE GAP TO REAL-WORLD **OBJECTCENTRIC LEARNING**, [[Paper]](https://arxiv.org/pdf/2209.14860.pdf)

- (arXiv 2022.09) Prompt-guided **Scene Generation** for **3D** Zero-Shot Learning, [[Paper]](https://arxiv.org/pdf/2209.14690.pdf)

- (arXiv 2022.09) RE-IMAGEN: RETRIEVAL-AUGMENTED **TEXT-TO-IMAGE** GENERATOR, [[Paper]](https://arxiv.org/pdf/2209.14491.pdf)

- (arXiv 2022.09) Distribution Aware **Metrics** for Conditional Natural **Language Generation**, [[Paper]](https://arxiv.org/pdf/2209.07518.pdf)

- (arXiv 2022.09) **CLIP**ping Privacy: Identity Inference **Attacks** on Multi-Modal Machine Learning Models, [[Paper]](https://arxiv.org/pdf/2209.07341.pdf)

- (arXiv 2022.09) Finetuning Pretrained **Vision-Language** Models with Correlation Information Bottleneck for Robust **Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2209.06954.pdf)

- (arXiv 2022.09) PriorLane: A Prior Knowledge Enhanced **Lane Detection** Approach Based on Transformer, [[Paper]](https://arxiv.org/pdf/2209.06994.pdf), [[Code]](https://github.com/vincentqqb/PriorLane)

- (arXiv 2022.09) Can We Solve **3D** Vision Tasks Starting from A **2D** Vision **Transformer**? [[Paper]](https://arxiv.org/pdf/2209.07026.pdf), [[Code]](https://github.com/VITA-Group/Simple3D-Former.git)

- (arXiv 2022.09) EXPLORING VISUAL INTERPRETABILITY FOR CONTRASTIVE **LANGUAGE-IMAGE** PRE-TRAINING, [[Paper]](https://arxiv.org/pdf/2209.07046.pdf)

- (arXiv 2022.09) OmniVL: One Foundation Model for **Image-Language** and **Video-Language** Tasks, [[Paper]](https://arxiv.org/pdf/2209.07526.pdf)

- (arXiv 2022.09) Test-Time Training with **Masked Autoencoders**, [[Paper]](https://arxiv.org/pdf/2209.07522.pdf), [[Code]](https://yossigandelsman.github.io/ttt_mae/index.html)

- (arXiv 2022.09) VISUAL **RECOGNITION** WITH DEEP NEAREST **CENTROIDS**, [[Paper]](https://arxiv.org/pdf/2209.07383.pdf), [[Code]](https://github.com/ChengHan111/DNC)

- (arXiv 2022.09) One-Shot Transfer of **Affordance** Regions? AffCorrs! [[Paper]](https://arxiv.org/pdf/2209.07147.pdf), [[Code]](https://sites.google.com/view/affcorrs)

- (arXiv 2022.09) Test-Time **Prompt Tuning** for Zero-Shot Generalization in **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2209.07511.pdf), [[Code]](https://azshue.github.io/TPT/)

- (arXiv 2022.09) A Light Recipe to Train **Robust** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2209.07399.pdf), [[Code]](https://github.com/dedeswim/vits-robustness-torch)

- (arXiv 2022.09) On the Surprising Effectiveness of Transformers in Low-Labeled **Video Recognition**, [[Paper]](https://arxiv.org/pdf/2209.07474.pdf)

- (arXiv 2022.09) Number of **Attention Heads** vs. Number of Transformer-**Encoders** in Computer Vision, [[Paper]](https://arxiv.org/pdf/2209.07221.pdf)

- (arXiv 2022.09) Global Semantic Descriptors for **Zero-Shot Action Recognition**, [[Paper]](https://arxiv.org/pdf/2209.12061.pdf), [[Code]](https://github.com/valterlej/objsentzsar)

- (arXiv 2022.09) Revisiting Neural **Scaling Laws** in Language and Vision, [[Paper]](https://arxiv.org/pdf/2209.06640.pdf)

- (arXiv 2022.09) Small Transformers Compute Universal **Metric Embeddings**, [[Paper]](https://arxiv.org/pdf/2209.06788.pdf)

- (arXiv 2022.09) **CLIP**-ViP: Adapting Pre-trained Image-Text Model to **Video-Language** Representation Alignment, [[Paper]](https://arxiv.org/pdf/2209.06430.pdf), [[Code]](https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP)

- (arXiv 2022.09) CRAFT: Camera-Radar **3D Object Detection** with Spatio-Contextual Fusion Transformer, [[Paper]](https://arxiv.org/pdf/2209.06535.pdf)

- (arXiv 2022.09) Transformers and CNNs both Beat Humans on **SBIR**, [[Paper]](https://arxiv.org/pdf/2209.06629.pdf)

- (arXiv 2022.09) PaLI: A Jointly-Scaled **Multilingual** **Language-Image** Model, [[Paper]](https://arxiv.org/pdf/2209.06794.pdf)

- (arXiv 2022.09) MUST-VQA: MUltilingual Scene-text **VQA**, [[Paper]](https://arxiv.org/pdf/2209.06730.pdf), [[Code]](https://www.ethnologue.com/enterprise-faq/how-many-languages-world-are-unwritten-0)

- (arXiv 2022.09) Leveraging Large Language Models for **Robot 3D Scene Understanding**, [[Paper]](https://arxiv.org/pdf/2209.05629.pdf), [[Code]](https://github.com/MIT-SPARK/llm_scene_understanding)

- (arXiv 2022.09) A lightweight Transformer-based model for **fish landmark detection**, [[Paper]](https://arxiv.org/pdf/2209.05777.pdf)

- (arXiv 2022.09) PSAQ-ViT V2: Towards Accurate and General Data-Free **Quantization** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2209.05687.pdf), [[Code]](https://github.com/zkkli/PSAQ-ViT)

- (arXiv 2022.09) ComplETR: Reducing the cost of annotations for object **detection** in dense scenes with vision transformers, [[Paper]](https://arxiv.org/pdf/2209.05654.pdf)

- (arXiv 2022.09) Semantic2Graph: Graph-based Multi-modal Feature for **Action Segmentation** in **Videos**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2209/2209.05653.pdf)

- (arXiv 2022.09) CenterFormer: Center-based Transformer for **3D Object Detection**, [[Paper]](https://arxiv.org/pdf/2209.05588.pdf), [[Code]](https://github.com/TuSimple/centerformer)

- (arXiv 2022.09) PreSTU: Pre-Training for **Scene-Text** Understanding, [[Paper]](https://arxiv.org/pdf/2209.05534.pdf)

- (arXiv 2022.09) OmDet: Language-Aware Object **Detection** with Large-scale **Vision-Language** Multi-dataset Pre-training, [[Paper]](https://arxiv.org/pdf/2209.05946.pdf)

- (arXiv 2022.09) DMTNet: Dynamic Multi-scale Network for Dual-pixel Images **Defocus Deblurring** with Transformer, [[Paper]](https://arxiv.org/pdf/2209.06040.pdf)

- (arXiv 2022.09) SeRP: Self-Supervised Representation Learning Using Perturbed **Point Clouds**, [[Paper]](https://arxiv.org/pdf/2209.06067.pdf)

- (arXiv 2022.09) VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2209.06103.pdf)

- (arXiv 2022.09) Story**DALL-E**: Adapting Pretrained Text-to-Image Transformers for **Story Continuation**, [[Paper]](https://arxiv.org/pdf/2209.06192.pdf), [[Code]](https://github.com/adymaharana/storydalle)

- (arXiv 2022.09) ON THE **COMPUTATIONAL COMPLEXITY** OF SELF-ATTENTION, [[Paper]](https://arxiv.org/pdf/2209.04881.pdf)

- (arXiv 2022.09) Instruction-driven history-aware policies for **robotic manipulations**, [[Paper]](https://arxiv.org/pdf/2209.04899.pdf), [[Code]](https://guhur.github.io/hiveformer/)

- (arXiv 2022.09) Towards Multi-Lingual **Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2209.05401.pdf)

- (arXiv 2022.09) PERCEIVER-ACTOR: A Multi-Task Transformer for **Robotic Manipulation**, [[Paper]](https://arxiv.org/pdf/2209.05451.pdf), [[Project]](https://peract.github.io/)

- (arXiv 2022.09) GLOBAL PROTOTYPE ENCODING FOR INCREMENTAL **VIDEO HIGHLIGHTS DETECTION**, [[Paper]](https://arxiv.org/pdf/2209.05166.pdf), [[Code]](https://github.com/ForeverPs/GPE)

- (arXiv 2022.09) Self-Supervised Multimodal Fusion Transformer for **Passive Activity Recognition**, [[Paper]](https://arxiv.org/pdf/2209.03765.pdf)

- (arXiv 2022.09) FETA: Towards Specializing **Foundation Models** for **Expert Task** Applications, [[Paper]](https://arxiv.org/pdf/2209.03648.pdf)

- (arXiv 2022.09) Prior Knowledge-Guided **Attention** in Self-Supervised Vision Transformers, [[Paper]](https://arxiv.org/pdf/2209.03745.pdf)

- (arXiv 2022.09) Exploring Target Representations for **Masked Autoencoders**, [[Paper]](https://arxiv.org/pdf/2209.03917.pdf)

- (arXiv 2022.09) ISS: IMAGE AS STEPPING STONE FOR **TEXT-GUIDED 3D SHAPE GENERATION**, [[Paper]](https://arxiv.org/pdf/2209.04145.pdf)

- (arXiv 2022.09) Towards Confidence-guided **Shape Completion** for **Robotic** Applications, [[Paper]](https://arxiv.org/pdf/2209.04300.pdf), [[Code]](https://github.com/andrearosasco/hyperpcr)

- (arXiv 2022.09) Pre-training **image-language** transformers for **open-vocabulary** tasks, [[Paper]](https://arxiv.org/pdf/2209.04372.pdf)

- (arXiv 2022.09) Improved Masked **Image Generation** with Token-Critic, [[Paper]](https://arxiv.org/pdf/2209.04439.pdf)

- (arXiv 2022.09) Do As I Can, Not As I Say: **Grounding Language** in **Robotic** Affordances, [[Paper]](https://arxiv.org/pdf/2204.01691.pdf), [[Code]](https://say-can.github.io/)

- (arXiv 2022.09) Uformer-ICS: A Specialized U-Shaped Transformer for **Image Compressive Sensing**, [[Paper]](https://arxiv.org/pdf/2209.01763.pdf)

- (arXiv 2022.09) An Empirical Study of End-to-End **Video-Language** Transformers with **Masked** Visual Modeling, [[Paper]](https://arxiv.org/pdf/2209.01540.pdf)

- (arXiv 2022.09) Spatial-Temporal Transformer for **Video Snapshot Compressive Imaging**, [[Paper]](https://arxiv.org/pdf/2209.01578.pdf), [[Code]](https://github.com/ucaswangls/STFormer)

- (arXiv 2022.09) MAFormer: A Transformer Network with Multi-scale **Attention** Fusion for Visual Recognition, [[Paper]](https://arxiv.org/pdf/2209.01620.pdf)

- (arXiv 2022.09) SEFormer: Structure Embedding Transformer for **3D Object Detection**, [[Paper]](https://arxiv.org/pdf/2209.01745.pdf)

- (arXiv 2022.09) ADTR: **Anomaly Detection** Transformer with Feature Reconstruction, [[Paper]](https://arxiv.org/pdf/2209.01816.pdf)

- (arXiv 2022.09) Learning Canonical Embeddings for Unsupervised Shape **Correspondence** with Locally Linear Transformations, [[Paper]](https://arxiv.org/pdf/2209.02152.pdf)

- (arXiv 2022.09) Transformer-CNN Cohort: **Semi-supervised Semantic Segmentation** by the Best of Both Students, [[Paper]](https://arxiv.org/pdf/2209.02178.pdf)

- (arXiv 2022.09) PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards **Video Object Detection**, [[Paper]](https://arxiv.org/pdf/2209.02242.pdf), [[Code]](https://github.com/Hon-Wong/PTSEFormer)

- (arXiv 2022.09) VITKD: PRACTICAL GUIDELINES FOR VIT FEATURE **KNOWLEDGE DISTILLATION**, [[Paper]](https://arxiv.org/pdf/2209.02432.pdf), [[Code]](https://github.com/yzd-v/cls_KD)

- (arXiv 2022.09) DPIT: Dual-Pipeline Integrated Transformer for Human **Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2209.02431.pdf)

- (arXiv 2022.09) SkeletonMAE: Spatial-Temporal **Masked Autoencoders** for Self-supervised **Skeleton Action Recognition**, [[Paper]](https://arxiv.org/pdf/2209.02399.pdf)

- (arXiv 2022.09) What does a platypus look like? Generating customized prompts for **zero-shot image classification**, [[Paper]](https://arxiv.org/pdf/2209.03320.pdf), [[Code]](https://github.com/sarahpratt/CuPL)

- (arXiv 2022.09) AI Illustrator: Translating Raw Descriptions into **Images** by **Prompt**-based Cross-Modal **Generation**, [[Paper]](https://arxiv.org/pdf/2209.03160.pdf), [[Code]](https://github.com/researchmm/AI_Illustrator)

- (arXiv 2022.09) MimCo: **Masked** Image Modeling Pre-training with Contrastive Teacher, [[Paper]](https://arxiv.org/pdf/2209.03063.pdf)

- (arXiv 2022.09) **Multi-modal** Contrastive Representation Learning for **Entity Alignment**, [[Paper]](https://arxiv.org/pdf/2209.00891.pdf)

- (arXiv 2022.09) Zero-Shot Multi-Modal **Artist-Controlled Retrieval** and **Exploration** of **3D** Object Sets, [[Paper]](https://arxiv.org/pdf/2209.00682.pdf)

- (arXiv 2022.09) Geometry Aligned Variational Transformer for **Image-conditioned Layout Generation**, [[Paper]](https://arxiv.org/pdf/2209.00852.pdf)

- (arXiv 2022.09) Real-time **3D** Single Object **Tracking** with Transformer, [[Paper]](https://arxiv.org/pdf/2209.00860.pdf), [[Code]](https://github.com/shanjiayao/PTT)

- (arXiv 2022.09) Video-Guided Curriculum Learning for **Spoken Video Grounding**, [[Paper]](https://arxiv.org/pdf/2209.00277.pdf), [[Code]](https://github.com/marmot-xy/Spoken-Video-Grounding)

- (arXiv 2022.09) FLAME: Free-form Language-based **Motion Synthesis** & **Editing**, [[Paper]](https://arxiv.org/pdf/2209.00349.pdf)

- (arXiv 2022.09) TOKENCUT: **SEGMENTING** OBJECTS IN IMAGES AND VIDEOS WITH SELF-SUPERVISED TRANSFORMER AND NORMALIZED CUT, [[Paper]](https://arxiv.org/pdf/2209.00383.pdf), [[Code]](https://www.m-psi.fr/Papers/TokenCut2022/)

- (arXiv 2022.09) Unified Fully and Timestamp Supervised **Temporal Action Segmentation** via Sequence to Sequence Translation, [[Paper]](https://arxiv.org/pdf/2209.00638.pdf)

- (arXiv 2022.09) MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised **Point Cloud Action Recognition**, [[Paper]](https://arxiv.org/pdf/2209.00407.pdf), [[Project]](http://xiaodongchen.cn/MAPLE/)

- (arXiv 2022.09) **Visual Prompting** via Image Inpainting, [[Paper]](https://arxiv.org/pdf/2209.00647.pdf), [[Project]](https://yossigandelsman.github.io/visual_prompt)

- (arXiv 2022.09) RLIP: Relational **Language-Image** Pre-training for **Human-Object Interaction** Detection, [[Paper]](https://arxiv.org/pdf/2209.01814.pdf), [[Code]](https://github.com/JacobYuan7/RLIP)

### 2022.08

- (arXiv 2022.08) On Grounded Planning for **Embodied** Tasks with Language Models, [[Paper]](https://arxiv.org/pdf/2209.00465.pdf), [[Project]](https://inklab.usc.edu/G-PlanET/)

- (arXiv 2022.08) **Group Activity Recognition** in Basketball Tracking Data - Neural Embeddings in Team Sports (NETS), [[Paper]](https://arxiv.org/pdf/2209.00451.pdf)

- (arXiv 2022.08) SWIN-TRANSFORMER-YOLOV5 FOR REAL-TIME WINE GRAPE BUNCH **DETECTION**, [[Paper]](https://arxiv.org/pdf/2208.14508.pdf)

- (arXiv 2022.08) SIM-Trans: Structure Information Modeling Transformer for **Fine-grained Visual Categorization**, [[Paper]](https://arxiv.org/pdf/2208.14607.pdf), [[Code]](https://github.com/PKU-ICST-MIPL/SIM-Trans_ACMMM2022)

- (arXiv 2022.08) INJECTING **IMAGE DETAILS** INTO **CLIP**’S FEATURE SPACE, [[Paper]](https://arxiv.org/pdf/2208.14649.pdf)

- (arXiv 2022.08) Hierarchical Local-Global Transformer for **Temporal Sentence Grounding**, [[Paper]](https://arxiv.org/pdf/2208.14882.pdf)

- (arXiv 2022.08) EViT: **Privacy**-Preserving **Image Retrieval** via Encrypted Vision Transformer in Cloud Computing, [[Paper]](https://arxiv.org/pdf/2208.14657.pdf)

- (arXiv 2022.08) TRUST: An Accurate and End-to-End **Table structure Recognizer** Using Splitting-based Transformers, [[Paper]](https://arxiv.org/pdf/2208.14687.pdf)

- (arXiv 2022.08) ELMformer: Efficient Raw **Image Restoration** with a Locally Multiplicative Transformer, [[Paper]](https://arxiv.org/pdf/2208.14704.pdf), [[Code]](https://github.com/leonmakise/ELMformer)

- (arXiv 2022.08) SoMoFormer: **Multi-Person Pose Forecasting** with Transformers, [[Paper]](https://arxiv.org/pdf/2208.14023.pdf)

- (arXiv 2022.08) A Circular Window-based Cascade Transformer for **Online Action Detection**, [[Paper]](https://arxiv.org/pdf/2208.14209.pdf)

- (arXiv 2022.08) ASpanFormer: Detector-Free **Image Matching** with Adaptive Span Transformer, [[Paper]](https://arxiv.org/pdf/2208.14201.pdf)

- (arXiv 2022.08) Robust Sound-Guided **Image Manipulation**, [[Paper]](https://arxiv.org/pdf/2208.14114.pdf)

- (arXiv 2022.08) TrojViT: **Trojan Insertion** in Vision Transformers, [[Paper]](https://arxiv.org/pdf/2208.13049.pdf)

- (arXiv 2022.08) User-Controllable Latent Transformer for StyleGAN **Image Layout Editing**, [[Paper]](https://arxiv.org/pdf/2208.12408.pdf)

- (arXiv 2022.08) Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for **Few-Shot Classification**, [[Paper]](https://arxiv.org/pdf/2208.12398.pdf)

- (arXiv 2022.08) JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational **Embodied Agents**, [[Paper]](https://arxiv.org/pdf/2208.13266.pdf)

- (arXiv 2022.08) TFusion: Transformer based N-to-One **Multimodal** Fusion Block, [[Paper]](https://arxiv.org/pdf/2208.12776.pdf)

- (arXiv 2022.08) VMFormer: End-to-End **Video Matting** with Transformer, [[Paper]](https://arxiv.org/pdf/2208.12801.pdf), [[Code]](https://chrisjuniorli.github.io/project/VMFormer/)

- (arXiv 2022.08) LOGICRANK: Logic Induced Reranking for Generative **Text-to-Image** Systems, [[Paper]](https://arxiv.org/pdf/2208.13518.pdf)

- (arXiv 2022.08) CLUSTR: EXPLORING **EFFICIENT SELF-ATTENTION** VIA CLUSTERING FOR VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2208.13138.pdf)

- (arXiv 2022.08) **Federated Zero-Shot Learning** with Mid-Level Semantic Knowledge Transfer, [[Paper]](https://arxiv.org/pdf/2208.13465.pdf)

- (arXiv 2022.08) **Prompt Tuning** with Soft Context Sharing for **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2208.13474.pdf)

- (arXiv 2022.08) Efficient **Vision-Language** Pretraining with Visual Concepts and Hierarchical Alignment, [[Paper]](https://arxiv.org/pdf/2208.13628.pdf), [[Code]](https://github.com/mshukor/ViCHA)

- (arXiv 2022.08) CounTR: Transformer-based Generalised Visual **Counting**, [[Paper]](https://arxiv.org/pdf/2208.13721.pdf), [[Code]](https://verg-avesta.github.io/CounTR_Webpage/)

- (arXiv 2022.08) **Open-Set** Semi-Supervised Object **Detection**, [[Paper]](https://arxiv.org/pdf/2208.13722.pdf)

- (arXiv 2022.08) g**Swin**: Gated **MLP** Vision Model with Hierarchical Structure of Shifted Window, [[Paper]](https://arxiv.org/pdf/2208.11718.pdf)

- (arXiv 2022.08) Adaptive Perception Transformer for **Temporal Action Localization**, [[Paper]](https://arxiv.org/pdf/2208.11908.pdf), [[Code]](https://github.com/SouperO/AdaPerFormer)

- (arXiv 2022.08) Symbolic Replay: **Scene Graph** as Prompt for Continual Learning on **VQA** Task, [[Paper]](https://arxiv.org/pdf/2208.12037.pdf), [[Code]](https://github.com/showlab/CLVQA)

- (arXiv 2022.08) **Masked** Autoencoders Enable Efficient **Knowledge Distillers**, [[Paper]](https://arxiv.org/pdf/2208.12256.pdf), [[Code]](https://github.com/UCSC-VLAA/DMAE)

- (arXiv 2022.08) LaTe**RF**: Label and **Text** Driven Object Radiance Fields, [[Paper]](https://arxiv.org/pdf/2207.01583.pdf)

- (arXiv 2022.08) Video Mobile-Former: **Video Recognition** with **Efficient** Global Spatial-temporal Modeling, [[Paper]](https://arxiv.org/pdf/2208.12257.pdf)

- (arXiv 2022.08) Pix4Point: Image Pretrained Transformers for 3D **Point Cloud Understanding**, [[Paper]](https://arxiv.org/pdf/2208.12259.pdf), [[Code]](https://github.com/guochengqian/Pix4Point)

- (arXiv 2022.08) Mask**CLIP**: **Masked** Self-Distillation Advances Contrastive Language-Image Pretraining, [[Paper]](https://arxiv.org/pdf/2208.12262.pdf)

- (arXiv 2022.08) Visual Subtitle Feature Enhanced **Video Outline Generation**, [[Paper]](https://arxiv.org/pdf/2208.11307.pdf), [[Code]](https://github.com/Aopolin-Lv/VSENet)

- (arXiv 2022.08) CATS: COMPLEMENTARY **CNN** AND TRANSFORMER ENCODERS FOR **SEGMENTATION**, [[Paper]](https://arxiv.org/pdf/2208.11572.pdf)

- (arXiv 2022.08) Modeling Paragraph-Level **Vision-Language** Semantic Alignment for Multi-Modal Summarization, [[Paper]](https://arxiv.org/pdf/2208.11303.pdf)

- (arXiv 2022.08) Fashion**VQA**: A Domain-Specific Visual Question Answering System, [[Paper]](https://arxiv.org/pdf/2208.11253.pdf)

- (arXiv 2022.08) K-ORDER GRAPH-ORIENTED TRANSFORMER WITH GRAATTENTION FOR **3D POSE AND SHAPE ESTIMATION**, [[Paper]](https://arxiv.org/pdf/2208.11328.pdf)

- (arXiv 2022.08) Towards **Efficient** Use of Multi-Scale Features in Transformer-Based Object **Detectors**, [[Paper]](https://arxiv.org/pdf/2208.11356.pdf), [[Code]](https://github.com/ZhangGongjie/IMFA)

- (arXiv 2022.08) Improving **video retrieval** using **multilingual** knowledge transfer, [[Paper]](https://arxiv.org/pdf/2208.11553.pdf)

- (arXiv 2022.08) **EFFICIENT** SPARSELY ACTIVATED TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2208.14580.pdf)

- (arXiv 2022.08) M2HF: MULTI-LEVEL MULTI-MODAL HYBRID FUSION FOR **TEXT-VIDEO RETRIEVAL**, [[Paper]](https://arxiv.org/pdf/2208.07664.pdf)

- (arXiv 2022.08) **Accelerating** Vision Transformer Training via a Patch Sampling Schedule, [[Paper]](https://arxiv.org/pdf/2208.09520.pdf), [[Project]](https://github.com/BradMcDanel/pss)

- (arXiv 2022.08) A Dual Modality Approach For (Zero-Shot) **Multi-Label Classification**, [[Paper]](https://arxiv.org/pdf/2208.09562.pdf)

- (arXiv 2022.08) Offline **Handwritten Mathematical Recognition** using Adversarial Learning and Transformers, [[Paper]](https://arxiv.org/pdf/2208.09662.pdf)

- (arXiv 2022.08) Semantic-enhanced Image **Clustering**, [[Paper]](https://arxiv.org/pdf/2208.09849.pdf)

- (arXiv 2022.08) DPTNet: A Dual-Path Transformer Architecture for **Scene Text Detection**, [[Paper]](https://arxiv.org/pdf/2208.09878.pdf)

- (arXiv 2022.08) ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for **Interpretable Image Recognition**, [[Paper]](https://arxiv.org/pdf/2208.10431.pdf), [[Code]](https://github.com/zju-vipa/ProtoPFormer)

- (arXiv 2022.08) Image as a Foreign Language: BEIT Pretraining for All Vision and **Vision-Language** Tasks, [[Paper]](https://arxiv.org/pdf/2208.10442.pdf), [[Project]](https://aka.ms/beit-3)

- (arXiv 2022.08) PoseBERT: A Generic Transformer Module for **Temporal 3D Human Modeling**, [[Paper]](https://arxiv.org/pdf/2208.10211.pdf), [[Code]](https://github.com/naver/posebert)

- (arXiv 2022.08) **EFFICIENT** ATTENTION-FREE **VIDEO** SHIFT TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2208.11108.pdf)

- (arXiv 2022.08) Flat Multi-modal Interaction Transformer for **Named Entity Recognition**, [[Paper]](https://arxiv.org/pdf/2208.11039.pdf)

- (arXiv 2022.08) **Dance Style Transfer** with Cross-modal Transformer, [[Paper]](https://arxiv.org/pdf/2208.09406.pdf)

- (arXiv 2022.08) Improved **Image Classification** with Token Fusion , [[Paper]](https://arxiv.org/ftp/arxiv/papers/2208/2208.09183.pdf)

- (arXiv 2022.08) VAuLT: Augmenting the **Vision-and-Language** Transformer with the Propagation of Deep Language Representations, [[Paper]](https://arxiv.org/pdf/2208.09021.pdf), [[Code]](https://github.com/gchochla/VAuLT)

- (arXiv 2022.08) **TEXT TO IMAGE GENERATION**: LEAVING NO LANGUAGE BEHIND, [[Paper]](https://arxiv.org/pdf/2208.09333.pdf)

- (arXiv 2022.08) Aspect-based **Sentiment Classification** with Sequential Cross-modal Semantic Graph, [[Paper]](https://arxiv.org/pdf/2208.09417.pdf)

- (arXiv 2022.08) Diverse **Video Captioning** by Adaptive Spatio-temporal Attention, [[Paper]](https://arxiv.org/pdf/2208.09266.pdf)

- (arXiv 2022.08) VL**MAE**: **Vision-Language** Masked Autoencoder, [[Paper]](https://arxiv.org/pdf/2208.09374.pdf)

- (arXiv 2022.08) SoMoFormer: Social-Aware Motion Transformer for **Multi-Person Motion Prediction**, [[Paper]](https://arxiv.org/pdf/2208.09224.pdf)

- (arXiv 2022.08) ILLUME: Rationalizing **Vision-Language** Models by Interacting with their Jabber, [[Paper]](https://arxiv.org/pdf/2208.08241.pdf)

- (arXiv 2022.08) ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human **Activity Recognition** in Videos, [[Paper]](https://arxiv.org/pdf/2208.07929.pdf)

- (arXiv 2022.08) UniLayout: Taming Unified Sequence-to-Sequence Transformers for **Graphic Layout Generation**, [[Paper]](https://arxiv.org/pdf/2208.08037.pdf)

- (arXiv 2022.08) InterTrack: Interaction Transformer for **3D Multi-Object Tracking**, [[Paper]](https://arxiv.org/pdf/2208.08041.pdf)

- (arXiv 2022.08) Understanding **Attention** for **Vision-and-Language** Task, [[Paper]](https://arxiv.org/pdf/2208.08104.pdf)

- (arXiv 2022.08) Towards **Open-vocabulary Scene Graph Generation** with Prompt-based Finetuning, [[Paper]](https://arxiv.org/pdf/2208.08165.pdf)

- (arXiv 2022.08) Class-Aware Visual Prompt Tuning for **Vision-Language** Pre-Trained Model, [[Paper]](https://arxiv.org/pdf/2208.08340.pdf)

- (arXiv 2022.08) Unifying Visual **Perception** by Dispersible **Points** Learning, [[Paper]](https://arxiv.org/pdf/2208.08630.pdf), [[Code]](https://github.com/Sense-X/UniHead)

- (arXiv 2022.08) **Text-to-Image Generation** via Implicit Visual Guidance and Hypernetwork, [[Paper]](https://arxiv.org/pdf/2208.08493.pdf)

- (arXiv 2022.08) ConMatch: **Semi-Supervised Learning** with Confidence-Guided Consistency Regularization, [[Paper]](https://arxiv.org/pdf/2208.08631.pdf), [[Code]](https://github.com/JiwonCocoder/ConMatch)

- (arXiv 2022.08) The 8-Point Algorithm as an Inductive Bias for **Relative Pose Prediction** by ViTs, [[Paper]](https://arxiv.org/pdf/2208.08988.pdf)

- (arXiv 2022.08) Open-Vocabulary **Panoptic Segmentation** with Mask**CLIP**, [[Paper]](https://arxiv.org/pdf/2208.08984.pdf)

- (arXiv 2022.08) Prompt Vision Transformer for **Domain Generalization**, [[Paper]](https://arxiv.org/pdf/2208.08914.pdf)

- (arXiv 2022.08) GSRFormer: **Grounded Situation Recognition** Transformer with Alternate Semantic Attention Refinement, [[Paper]](https://arxiv.org/pdf/2208.08965.pdf)

- (arXiv 2022.08) CONVIFORMERS: **CONVOLUTIONALLY** GUIDED VISION TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2208.08900.pdf)

- (arXiv 2022.08) Learning Spatial-Frequency Transformer for Visual Object **Tracking**, [[Paper]](https://arxiv.org/pdf/2208.08829.pdf), [[Code]](https://github.com/Tchuanm/SFTransT.git)

- (arXiv 2022.08) **Efficient** **Multimodal** Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis, [[Paper]](https://arxiv.org/pdf/2208.07589.pdf)

- (arXiv 2022.08) Your ViT is Secretly a Hybrid **Discriminative-Generative** **Diffusion** Model, [[Paper]](https://arxiv.org/pdf/2208.07791.pdf), [[Code]](https://github.com/sndnyang/Diffusion_ViT)

- (arXiv 2022.08) LLM.int8(): 8-bit **Matrix Multiplication** for Transformers at **Scale**, [[Paper]](https://arxiv.org/pdf/2208.07339.pdf)

- (arXiv 2022.08) ExpansionNet v2: Block Static Expansion in fast end to end training for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2208.06551.pdf), [[Code]](https://github.com/jchenghu/ExpansionNet_v2)

- (arXiv 2022.08) Multi-modal Transformer **Path Prediction** for Autonomous Vehicle, [[Paper]](https://arxiv.org/pdf/2208.07256.pdf)

- (arXiv 2022.08) Flow-Guided Transformer for **Video Inpainting**, [[Paper]](https://arxiv.org/pdf/2208.06768.pdf), [[Code]](https://github.com/hitachinsk/FGT)

- (arXiv 2022.08) TL;DW? **Summarizing Instructional Videos** with Task Relevance & Cross-Modal Saliency, [[Paper]](https://arxiv.org/pdf/2208.06773.pdf), [[Project]](https://medhini.github.io/ivsum/)

- (arXiv 2022.08) HoW-3D: Holistic **3D Wireframe Perception** from a Single Image, [[Paper]](https://arxiv.org/pdf/2208.06999.pdf), [[Code]](https://github.com/Wenchao-M/HoW-3D)

- (arXiv 2022.08) **BEIT V2**: **Masked Image Modeling** with Vector-Quantized Visual Tokenizers, [[Paper]](https://arxiv.org/pdf/2208.06366.pdf), [[Code]](https://github.com/microsoft/unilm)

- (arXiv 2022.08) MILAN: **Masked Image Pretraining** on Language Assisted Representation, [[Paper]](https://arxiv.org/pdf/2208.06049.pdf), [[Code]](https://github.com/zejiangh/MILAN)

- (arXiv 2022.08) Hybrid Transformer Network for **Deepfake Detection**, [[Paper]](https://arxiv.org/pdf/2208.05820.pdf)

- (arXiv 2022.08) **Semi-supervised** Vision Transformers at Scale, [[Paper]](https://arxiv.org/pdf/2208.05688.pdf)

- (arXiv 2022.08) PPMN: Pixel-Phrase Matching Network for One-Stage **Panoptic Narrative Grounding**, [[Paper]](https://arxiv.org/pdf/2208.05647.pdf), [[Code]](https://github.com/dzh19990407/PPMN)

- (arXiv 2022.08) Exploring Anchor-based Detection for **Ego4D** Natural **Language Query**, [[Paper]](https://arxiv.org/pdf/2208.05375.pdf)

- (arXiv 2022.08) Language Supervised Training for **Skeleton-based Action Recognition**, [[Paper]](https://arxiv.org/pdf/2208.05318.pdf), [[Code]](https://github.com/MartinXM/LST)

- (arXiv 2022.08) Exploring Point-BEV Fusion for 3D **Point Cloud Object Tracking** with Transformer, [[Paper]](https://arxiv.org/pdf/2208.05216.pdf), [[Code]](https://github.com/Jasonkks/PTTR)

- (arXiv 2022.08) Ghost-free **High Dynamic Range Imaging** with Context-aware Transformer, [[Paper]](https://arxiv.org/pdf/2208.05114.pdf), [[Code]](https://github.com/megvii-research/HDR-Transformer)

- (arXiv 2022.08) **CLIP**-based Neural Neighbor **Style Transfer** for **3D Assets**, [[Paper]](https://arxiv.org/pdf/2208.04370.pdf)

- (arXiv 2022.08) **Sports Video Analysis** on Large-Scale Data, [[Paper]](https://arxiv.org/pdf/2208.04897.pdf), [[Code]](https://github.com/jackwu502/NSVA)

- (arXiv 2022.08) How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving **Art Classification**, [[Paper]](https://arxiv.org/pdf/2208.04693.pdf)

- (arXiv 2022.08) In the Eye of Transformer: Global-Local Correlation for **Egocentric Gaze Estimation**, [[Paper]](https://arxiv.org/pdf/2208.04464.pdf), [[Code]](https://bolinlai.github.io/GLC-EgoGazeEst)

- (arXiv 2022.08) **DALLE**-URBAN: Capturing the **urban** design expertise of large **text to image** transformers, [[Paper]](https://arxiv.org/pdf/2208.04139.pdf), [[Code]](https://github.com/sachith500/DALLEURBAN)

- (arXiv 2022.08) PlaneFormers: From Sparse View Planes to **3D Reconstruction**, [[Paper]](https://arxiv.org/pdf/2208.04307.pdf), [[Code]](https://samiragarwala.github.io/PlaneFormers)

- (arXiv 2022.08) Boosting **Video-Text Retrieval** with Explicit High-Level Semantics, [[Paper]](https://arxiv.org/pdf/2208.04215.pdf)

- (arXiv 2022.08) Distinctive Image **Captioning** via **CLIP** Guided Group Optimization, [[Paper]](https://arxiv.org/pdf/2208.04254.pdf)

- (arXiv 2022.08) Understanding **Masked Image Modeling** via Learning Occlusion Invariant Feature, [[Paper]](https://arxiv.org/pdf/2208.04164.pdf)

- (arXiv 2022.08) GRIT-VLP: Grouped Mini-batch Sampling for Efficient **Vision and Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2208.04060.pdf), [[Code]](https://github.com/jaeseokbyun/GRIT-VLP)

- (arXiv 2022.08) Advancing Plain Vision Transformer Towards **Remote Sensing** Foundation Model, [[Paper]](https://arxiv.org/pdf/2208.03987.pdf), [[Code]](https://github.com/ViTAE-Transformer/Remote-Sensing-RVSA)

- (arXiv 2022.08) Domain Randomization-Enhanced Depth Simulation and Restoration for Perceiving and **Grasping** Specular and Transparent Objects, [[Paper]](https://arxiv.org/pdf/2208.03792.pdf), [[Code]](https://github.com/PKU-EPIC/DREDS)

- (arXiv 2022.08) Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for **3D Human Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2208.03704.pdf)

- (arXiv 2022.08) Frozen **CLIP** Models are **Efficient Video** Learners, [[Paper]](https://arxiv.org/pdf/2208.03550.pdf), [[Code]](https://github.com/OpenGVLab/efficient-video-recognition)

- (arXiv 2022.08) MonoViT: Self-Supervised **Monocular Depth Estimation** with a Vision Transformer, [[Paper]](https://arxiv.org/pdf/2208.03543.pdf), [[Code]](https://github.com/zxcqlf/MonoViT)

- (arXiv 2022.08) HaloAE: An HaloNet based Local Transformer Auto-Encoder for **Anomaly Detection** and **Localization**, [[Paper]](https://arxiv.org/pdf/2208.03486.pdf), [[Code]](https://anonymous.4open.science/r/HaloAE-E27B/README.md)

- (arXiv 2022.08) IVT: An End-to-End Instance-guided Video Transformer for **3D Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2208.03431.pdf)

- (arXiv 2022.08) A Sketch Is Worth a Thousand Words: **Image Retrieval** with **Text** and **Sketch**, [[Paper]](https://arxiv.org/pdf/2208.03354.pdf), [[Code]](https://janesjanes.github.io/tsbir/)

- (arXiv 2022.08) PointConvFormer: Revenge of the **Point-based Convolution**, [[Paper]](https://arxiv.org/pdf/2208.02879.pdf)

- (arXiv 2022.08) ChiQA: A Large Scale Image-based Real-World **Question Answering Dataset** for Multi-Modal Understanding, [[Paper]](https://arxiv.org/pdf/2208.03030.pdf)

- (arXiv 2022.08) LaTTe: **Language** **Trajectory** TransformEr, [[Paper]](https://arxiv.org/pdf/2208.02918.pdf), [[Code]](https://github.com/arthurfenderbucker/NL_trajectory_reshaper)

- (arXiv 2022.08) Learning Spatiotemporal Frequency-Transformer for **Compressed Video Super-Resolution**, [[Paper]](https://arxiv.org/pdf/2208.03012.pdf), [[Code]](https://github.com/researchmm/FTVSR)

- (arXiv 2022.08) TransMatting: Enhancing **Transparent Objects Matting** with Transformers, [[Paper]](https://arxiv.org/pdf/2208.03007.pdf), [[Project]](https://github.com/AceCHQ/TransMatting)

- (arXiv 2022.08) Word-Level Fine-Grained **Story Visualization**, [[Paper]](https://arxiv.org/pdf/2208.02341.pdf)

- (arXiv 2022.08) Fine-Grained Semantically Aligned **Vision-Language** Pre-Training, [[Paper]](https://arxiv.org/pdf/2208.02515.pdf)

- (arXiv 2022.08) Expanding **Language-Image** Pretrained Models for General **Video Recognition**, [[Paper]](https://arxiv.org/pdf/2208.02816.pdf), [[Code]](https://github.com/microsoft/VideoX/tree/master/X-CLIP)

- (arXiv 2022.08) P2P: Tuning Pre-trained Image Models for **Point Cloud Analysis** with Point-to-Pixel Prompting, [[Paper]](https://arxiv.org/pdf/2208.02812.pdf), [[Code]](https://github.com/wangzy22/P2P)

- (arXiv 2022.08) **Drop**Key, [[Paper]](https://arxiv.org/pdf/2208.02646.pdf)

- (arXiv 2022.08) MVSFormer: **Multi-View Stereo** with Pre-trained Vision Transformers and Temperature-based Depth, [[Paper]](https://arxiv.org/pdf/2208.02541.pdf)

- (arXiv 2022.08) Per-Clip Video Object **Segmentation**, [[Paper]](https://arxiv.org/pdf/2208.01924.pdf)

- (arXiv 2022.08) XCon: Learning with Experts for **Fine-grained Category Discovery**, [[Paper]](https://arxiv.org/pdf/2208.01898.pdf), [[Code]](https://github.com/YiXXin/XCon)

- (arXiv 2022.08) Combined CNN Transformer Encoder for Enhanced Fine-grained Human **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2208.01897.pdf)

- (arXiv 2022.08) RE-ATTENTION TRANSFORMER FOR WEAKLY SUPERVISED **OBJECT LOCALIZATION**, [[Paper]](https://arxiv.org/pdf/2208.01838.pdf), [[Code]](https://github.com/su-hui-zz/ReAttentionTransformer)

- (arXiv 2022.08) TAG: Boosting Text-**VQA** via Text-aware Visual Question-answer Generation, [[Paper]](https://arxiv.org/pdf/2208.01813.pdf)

- (arXiv 2022.08) Two-Stream Transformer Architecture for **Long Form Video Understanding**, [[Paper]](https://arxiv.org/pdf/2208.01753.pdf)

- (arXiv 2022.08) A Fast **Text-Driven** Approach for **Generating Artistic Content**, [[Paper]](https://arxiv.org/pdf/2208.01748.pdf)

- (arXiv 2022.08) DAHITRA: **DAMAGE ASSESSMENT** USING A NOVEL HIERARCHICAL TRANSFORMER ARCHITECTURE, [[Paper]](https://arxiv.org/pdf/2208.02205.pdf)

- (arXiv 2022.08) MinVIS: A Minimal **Video Instance Segmentation** Framework without Video-based Training, [[Paper]](https://arxiv.org/pdf/2208.02245.pdf), [[Code]](https://github.com/NVlabs/MinVIS)

- (arXiv 2022.08) **Masked** **Vision and Language** Modeling for Multi-modal Representation Learning, [[Paper]](https://arxiv.org/pdf/2208.02131.pdf)

- (arXiv 2022.08) SSformer: A **Lightweight** Transformer for Semantic Segmentation, [[Paper]](https://arxiv.org/pdf/2208.02034.pdf), [[Code]](https://github.com/shiwt03/SSformer)

- (arXiv 2022.08) **Pose** Uncertainty Aware **Movement Synchrony Estimation** via Spatial-Temporal Graph Transformer, [[Paper]](https://arxiv.org/pdf/2208.01161.pdf)

- (arXiv 2022.08) Making the Best of Both Worlds: A Domain-Oriented Transformer for **Unsupervised Domain Adaptation**, [[Paper]](https://arxiv.org/pdf/2208.01195.pdf), [[Code]](https://github.com/BIT-DA/Domain-Oriented-Transformer)

- (arXiv 2022.08) Unified Normalization for **Accelerating** and **Stabilizing** Transformers, [[Paper]](https://arxiv.org/pdf/2208.01313.pdf)

- (arXiv 2022.08) An Image is Worth One Word: Personalizing **Text-to-Image Generation** using Textual Inversion, [[Paper]](https://arxiv.org/pdf/2208.01618.pdf), [[Project]](https://textual-inversion.github.io/)

- (arXiv 2022.08) Prompt-to-**Prompt** **Image Editing** with Cross Attention Control, [[Paper]](https://arxiv.org/pdf/2208.01626.pdf)

- (arXiv 2022.08) Momentum Transformer: Closing the Performance Gap Between Self-attention and Its **Linearization**, [[Paper]](https://arxiv.org/pdf/2208.00579.pdf)

- (arXiv 2022.08) Testing Relational Understanding in **Text-Guided Image Generation**, [[Paper]](https://arxiv.org/pdf/2208.00005.pdf)

- (arXiv 2022.08) UAVM: A Unified Model for **Audio-Visual** Learning, [[Paper]](https://arxiv.org/pdf/2208.00061.pdf)

- (arXiv 2022.08) Meta-**DETR**: Image-Level **Few-Shot** Detection with Inter-Class Correlation Exploitation, [[Paper]](https://arxiv.org/pdf/2208.00219.pdf), [[Code]](https://github.com/ZhangGongjie/Meta-DETR)

- (arXiv 2022.08) Point Primitive Transformer for Long-Term **4D Point Cloud Video Understanding**, [[Paper]](https://arxiv.org/pdf/2208.00281.pdf)

- (arXiv 2022.08) One for All: One-stage **Referring Expression Comprehension** with Dynamic Reasoning, [[Paper]](https://arxiv.org/pdf/2208.00361.pdf)

- (arXiv 2022.08) Toward Understanding WordArt: Corner-Guided Transformer for **Scene Text Recognition**, [[Paper]](https://arxiv.org/pdf/2208.00438.pdf), [[Code]](https://github.com/xdxie/WordArt)

- (arXiv 2022.08) SdAE: Self-distillated **Masked Autoencoder**, [[Paper]](https://arxiv.org/pdf/2208.00449.pdf), [[Code]](https://github.com/AbrahamYabo/SdAE)

- (arXiv 2022.08) Augmenting **Vision Language** Pretraining by Learning Codebook with Visual Semantics, [[Paper]](https://arxiv.org/pdf/2208.00475.pdf)

- (arXiv 2022.08) STrajNet: **Occupancy Flow Prediction** via Multi-modal Swin Transformer, [[Paper]](https://arxiv.org/pdf/2208.00394.pdf)

- (arXiv 2022.08) D^3Former: Debiased Dual Distilled Transformer for Incremental Learning, [[Paper]](https://arxiv.org/pdf/2208.00777.pdf), [[Code]](https://tinyurl.com/d3former)

- (arXiv 2022.08) Local Perception-Aware Transformer for **Aerial Tracking**, [[Paper]](https://arxiv.org/pdf/2208.00662.pdf), [[Code]](https://github.com/vision4robotics/LPAT)

- (arXiv 2022.08) SIAMIXFORMER: A SIAMESE TRANSFORMER NETWORK FOR BUILDING DETECTION AND CHANGE DETECTION FROM BI-TEMPORAL **REMOTE SENSING** IMAGES, [[Paper]](https://arxiv.org/pdf/2208.00657.pdf)

- (arXiv 2022.08) Transformers as Meta-Learners for **Implicit Neural Representations**, [[Paper]](https://arxiv.org/pdf/2208.02801.pdf), [[Code]](https://yinboc.github.io/trans-inr/)

- (arXiv 2022.08) **Video Question Answering** with Iterative Video-Text Co-Tokenization, [[Paper]](https://arxiv.org/pdf/2208.00934.pdf), [[Code]](https://sites.google.com/view/videoqa-cotokenization)

- (arXiv 2022.08) Understanding Adversarial **Robustness** of Vision Transformers via Cauchy Problem, [[Paper]](https://arxiv.org/pdf/2208.00906.pdf), [[Code]](https://github.com/TrustAI/ODE4RobustViT)

### 2022.07

- (arXiv 2022.07) Pro-tuning: Unified **Prompt Tuning** for Vision Tasks, [[Paper]](https://arxiv.org/pdf/2207.14381.pdf)

- (arXiv 2022.07) ALADIN: Distilling Fine-grained Alignment Scores for Efficient **Image-Text** Matching and Retrieval, [[Paper]](https://arxiv.org/pdf/2207.14757.pdf), [[Code]](https://github.com/mesnico/ALADIN)

- (arXiv 2022.07) Curriculum Learning for Data-Efficient **Vision-Language** Alignme, [[Paper]](https://arxiv.org/pdf/2207.14525.pdf)

- (arXiv 2022.07) DnSwin: Toward Real-World **Denoising** via Continuous Wavelet Sliding-Transformer, [[Paper]](https://arxiv.org/pdf/2207.13861.pdf)

- (arXiv 2022.07) Cross-Attention of Disentangled Modalities for **3D Human Mesh Recovery** with Transformers, [[Paper]](https://arxiv.org/pdf/2207.13820.pdf), [[Code]](https://github.com/postech-ami/FastMETRO)

- (arXiv 2022.07) AvatarPoser: Articulated **Full-Body Pose Tracking** from Sparse Motion Sensing, [[Paper]](https://arxiv.org/pdf/2207.13784.pdf), [[Project]](https://github.com/eth-siplab/AvatarPoser)

- (arXiv 2022.07) Semantic-Aligned Matching for Enhanced **DETR** Convergence and Multi-Scale Feature Fusion, [[Paper]](https://arxiv.org/pdf/2207.14172.pdf), [[Code]](https://github.com/ZhangGongjie/SAM-DETR)

- (arXiv 2022.07) Safety-Enhanced **Autonomous Driving** Using Interpretable Sensor Fusion Transformer, [[Paper]](https://arxiv.org/pdf/2207.14024.pdf), [[Code]](https://github.com/opendilab/InterFuser)

- (arXiv 2022.07) Video Mask Transfiner for High-Quality **Video Instance Segmentation**, [[Paper]](https://arxiv.org/pdf/2207.14012.pdf), [[Project]](http://vis.xyz/pub/vmt)

- (arXiv 2022.07) A **Variational AutoEncoder** for Transformers with Nonparametric Variational Information Bottleneck, [[Paper]](https://arxiv.org/pdf/2207.13529.pdf)

- (arXiv 2022.07) Online **Continual Learning** with Contrastive Vision Transformer, [[Paper]](https://arxiv.org/pdf/2207.13516.pdf)

- (arXiv 2022.07) Retrieval-Augmented Transformer for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2207.13162.pdf)

- (arXiv 2022.07) Spatiotemporal Self-attention Modeling with Temporal Patch Shift for **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2207.13259.pdf), [[Code]](https://github.com/MartinXM/TPS)

- (arXiv 2022.07) Is Attention All **NeRF** Needs?, [[Paper]](https://arxiv.org/pdf/2207.13298.pdf), [[Code]](https://vita-group.github.io/GNT/)

- (arXiv 2022.07) **Convolutional Embedding** Makes Hierarchical Vision Transformer Stronger, [[Paper]](https://arxiv.org/pdf/2207.13317.pdf)

- (arXiv 2022.07) SiRi: A Simple Selective Retraining Mechanism for Transformer-based **Visual Grounding**, [[Paper]](https://arxiv.org/pdf/2207.13325.pdf), [[Code]](https://github.com/qumengxue/siri-vg.git)

- (arXiv 2022.07) Deep **Clustering** with Features from **Self-Supervised** Pretraining, [[Paper]](https://arxiv.org/pdf/2207.13364.pdf)

- (arXiv 2022.07) Contrastive **Masked Autoencoders** are Stronger Vision Learners, [[Paper]](https://arxiv.org/pdf/2207.13532.pdf)

- (arXiv 2022.07) VICTOR: VISUAL **INCOMPATIBILITY DETECTION** WITH TRANSFORMERS AND FASHION-SPECIFIC CONTRASTIVE PRE-TRAINING, [[Paper]](https://arxiv.org/pdf/2207.13458.pdf)

- (arXiv 2022.07) Compositional **Human-Scene Interaction Synthesis** with Semantic Control, [[Paper]](https://arxiv.org/pdf/2207.12824.pdf), [[Code]](https://github.com/zkf1997/COINS)

- (arXiv 2022.07) Static and Dynamic Concepts for **Self-supervised** **Video** Representation Learning, [[Paper]](https://arxiv.org/pdf/2207.12795.pdf)

- (arXiv 2022.07) Unsupervised Domain Adaptation for Video Transformers in **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2207.12842.pdf), [[Code]](https://github.com/vturrisi/UDAVT)

- (arXiv 2022.07) LaKo: Knowledge-driven **Visual Question Answering** via Late Knowledge-to-Text Injection, [[Paper]](https://arxiv.org/pdf/2207.12888.pdf)

- (arXiv 2022.07) TransFiner: A Full-Scale Refinement Approach for **Multiple Object Tracking**, [[Paper]](https://arxiv.org/pdf/2207.12967.pdf)

- (arXiv 2022.07) S-Prompts Learning with Pre-trained Transformers: An Occam’s Razor for **Domain Incremental Learning**, [[Paper]](https://arxiv.org/pdf/2207.12819.pdf)

- (arXiv 2022.07) WinoGAViL: Gamified Association **Benchmark** to Challenge **Vision-and-Language** Models, [[Paper]](https://arxiv.org/pdf/2207.12576.pdf), [[Project]](https://winogavil.github.io/)

- (arXiv 2022.07) Cross-Modal Causal Relational Reasoning for Event-Level **Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2207.12647.pdf)

- (arXiv 2022.07) Graph Neural Network and Spatiotemporal Transformer Attention for **3D** Video Object **Detection** from Point Clouds, [[Paper]](https://arxiv.org/pdf/2207.12659.pdf)

- (arXiv 2022.07) Learning Visual Representation from Modality-Shared Contrastive **Language-Image** Pre-training, [[Paper]](https://arxiv.org/pdf/2207.12661.pdf), [[Code]](https://github.com/Hxyou/MSCLIP)

- (arXiv 2022.07) V^2L: Leveraging Vision and **Vision-language** Models into Large-scale **Product Retrieval**, [[Paper]](https://arxiv.org/pdf/2207.12994.pdf), [[Code]](https://github.com/WangWenhao0716/V2L)

- (arXiv 2022.07) NewsStories: Illustrating **articles** with **visual** summaries, [[Paper]](https://arxiv.org/pdf/2207.13061.pdf), [[Project]](https://github.com/NewsStoriesData/newsstories.github.io)

- (arXiv 2022.07) **DETR**s with Hybrid **Matching**, [[Paper]](https://arxiv.org/pdf/2207.13080.pdf), [[Code]](https://github.com/HDETR)

- (arXiv 2022.07) GROUP **DETR**: **FAST** TRAINING CONVERGENCE WITH DECOUPLED ONE-TO-MANY LABEL ASSIGNMENT, [[Paper]](https://arxiv.org/pdf/2207.13085.pdf)

- (arXiv 2022.07) Improved **Super Resolution** of MR Images Using CNNs and Vision Transformers, [[Paper]](https://arxiv.org/pdf/2207.11748.pdf)

- (arXiv 2022.07) Video Swin Transformers for **Egocentric Video** Understanding @ Ego4D Challenges 2022, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2207/2207.11329.pdf), [[Code]](https://github.com/BCV-Uniandes/PNR_OSCC)

- (arXiv 2022.07) An Impartial Take to the CNN vs Transformer **Robustness** Contest, [[Paper]](https://arxiv.org/pdf/2207.11347.pdf)

- (arXiv 2022.07) **Generative** Artisan: A Semantic-Aware and Controllable **CLIP**styler, [[Paper]](https://arxiv.org/pdf/2207.11598.pdf)

- (arXiv 2022.07) MAR: Masked Autoencoders for Efficient **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2207.11660.pdf), [[Code]](https://github.com/alibaba-mmai-research/Masked-Action-Recognition)

- (arXiv 2022.07) **Object State Change Classification** in **Egocentric** Videos using the Divided Space-Time Attention Mechanism, [[Paper]](https://arxiv.org/pdf/2207.11814.pdf), [[Cpde]](https://github.com/md-mohaiminul/ObjectStateChange)

- (arXiv 2022.07) Behind Every Domain There is a Shift: Adapting Distortion-aware Vision Transformers for **Panoramic Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2207.11860.pdf), [[Code]](https://github.com/jamycheung/Trans4PASS)

- (arXiv 2022.07) Reference-based Image **Super-Resolution** with Deformable Attention Transformer, [[Paper]](https://arxiv.org/pdf/2207.11938.pdf), [[Code]](https://github.com/caojiezhang/DATSR)

- (arXiv 2022.07) JIGSAW-VIT: LEARNING **JIGSAW PUZZLES** IN VISION TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2207.11971.pdf), [[Code]](https://yingyichen-cyy.github.io/Jigsaw-ViT)

- (arXiv 2022.07) TransCL: Transformer Makes Strong and Flexible **Compressive Learning**, [[Paper]](https://arxiv.org/pdf/2207.11972.pdf), [[Code]](https://github.com/MC-E/TransCL/)

- (arXiv 2022.07) 3D Siamese Transformer Network for Single Object **Tracking** on **Point Clouds**, [[Paper]](https://arxiv.org/pdf/2207.11995.pdf), [[Code]](https://github.com/fpthink/STNet)

- (arXiv 2022.07) Intention-Conditioned Long-Term Human **Egocentric Action Forecasting** @ EGO4D Challenge 2022, [[Paper]](https://arxiv.org/pdf/2207.12080.pdf), [[Code]](https://github.com/Evm7/ego4dlta-icvae)

- (arXiv 2022.07) IGFormer: Interaction Graph Transformer for **Skeleton**-based **Human Interaction Recognition**, [[Paper]](https://arxiv.org/pdf/2207.12100.pdf)

- (arXiv 2022.07) Is **GPT-3** all you need for **Visual Question Answering** in Cultural Heritage? [[Paper]](https://arxiv.org/pdf/2207.12101.pdf)

- (arXiv 2022.07) Applying Spatiotemporal Attention to **Identify Distracted** and **Drowsy Driving** with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2207.12148.pdf)

- (arXiv 2022.07) **Action Quality Assessment** using Transformers, [[Paper]](https://arxiv.org/pdf/2207.12318.pdf)

- (arXiv 2022.07) Self-Distilled Vision Transformer for **Domain Generalization**, [[Paper]](https://arxiv.org/pdf/2207.12392.pdf), [[Code]](https://github.com/maryam089/SDViT)

- (arXiv 2022.07) Exploring **CLIP** for **Assessing** the Look and Feel of **Images**, [[Paper]](https://arxiv.org/pdf/2207.12396.pdf), [[Code]](https://github.com/IceClear/CLIP-IQA)

- (arXiv 2022.07) Transformer with Implicit Edges for Particle-based **Physics Simulation**, [[Paper]](https://arxiv.org/pdf/2207.10860.pdf), [[Code]](https://github.com/ftbabi/TIE_ECCV2022.git)

- (arXiv 2022.07) Auto-regressive **Image Synthesis** with Integrated Quantization, [[Paper]](https://arxiv.org/pdf/2207.10776.pdf)

- (arXiv 2022.07) Efficient Modeling of Future Context for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2207.10897.pdf), [[Code]](https://github.com/feizc/Future-Caption)

- (arXiv 2022.07) Zero-Shot Video **Captioning** with Evolving Pseudo-Tokens, [[Paper]](https://arxiv.org/pdf/2207.11100.pdf), [[Code]](https://github.com/YoadTew/zero-shot-video-to-text)

- (arXiv 2022.07) Panoptic **Scene Graph** Generation, [[Paper]](https://arxiv.org/pdf/2207.11247.pdf), [[Project]](https://psgdataset.org/), [[Code]](https://github.com/Jingkang50/OpenPSG)

- (arXiv 2022.07) **Facial Expression Recognition** using Vanilla ViT backbones with MAE Pretraining, [[Paper]](https://arxiv.org/pdf/2207.11081.pdf)

- (arXiv 2022.07) Target-Driven Structured Transformer Planner for **Vision-Language Navigation**, [[Paper]](https://arxiv.org/pdf/2207.11201.pdf)

- (arXiv 2022.07) **Scaling Laws** vs Model Architectures: How does Inductive Bias Influence Scaling? [[Paper]](https://arxiv.org/pdf/2207.10551.pdf)

- (arXiv 2022.07) Hybrid CNN-Transformer Model For **Facial Affect Recognition** In the ABAW4 Challenge, [[Paper]](https://arxiv.org/pdf/2207.10201.pdf)

- (arXiv 2022.07) Mesh**MAE**: Masked Autoencoders for 3D **Mesh** Data Analysis, [[Paper]](https://arxiv.org/pdf/2207.10228.pdf)

- (arXiv 2022.07) SeedFormer: Patch Seeds based **Point Cloud Completion** with Upsample Transformer, [[Paper]](https://arxiv.org/pdf/2207.10315.pdf), [[Code]](https://github.com/hrzhou2/seedformer)

- (arXiv 2022.07) LocVTP: **Video-Text** Pre-training for Temporal Localization, [[Paper]](https://arxiv.org/pdf/2207.10362.pdf), [[Code]](https://github.com/mengcaopku/LocVTP)

- (arXiv 2022.07) Temporal Saliency Query Network for **Efficient Video Recognition**, [[Paper]](https://arxiv.org/pdf/2207.10379.pdf), [[Code]](https://lawrencexia2008.github.io/projects/tsqnet)

- (arXiv 2022.07) Pose for Everything: Towards Category-Agnostic **Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2207.10387.pdf), [[Code]](https://github.com/luminxu/Pose-for-Everything)

- (arXiv 2022.07) Weakly Supervised **Object Localization** via Transformer with Implicit Spatial Calibration, [[Paper]](https://arxiv.org/pdf/2207.10447.pdf), [[Code]](https://github.com/164140757/SCM)

- (arXiv 2022.07) An Efficient **Spatio-Temporal** Pyramid Transformer for **Action Detection**, [[Paper]](https://arxiv.org/pdf/2207.10448.pdf)

- (arXiv 2022.07) Towards **Efficient Adversarial Training** on Vision Transformers, [[Paper]](https://arxiv.org/pdf/2207.10498.pdf)

- (arXiv 2022.07) TinyViT: Fast Pretraining Distillation for **Small** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2207.10666.pdf), [[Code]](https://github.com/microsoft/Cream/tree/main/TinyViT)

- (arXiv 2022.07) Hierarchically Self-Supervised Transformer for Human **Skeleton Representation** Learning, [[Paper]](https://arxiv.org/pdf/2207.09644.pdf), [[Code]]( https://github.com/yuxiaochen1103/Hi-TRS)

- (arXiv 2022.07) Explicit Image **Caption Editing**, [[Paper]](https://arxiv.org/pdf/2207.09625.pdf), [[Code]](https://github.com/baaaad/ECE)

- (arXiv 2022.07) AiATrack: Attention in Attention for Transformer Visual **Tracking**, [[Paper]](https://arxiv.org/pdf/2207.09603.pdf), [[Code]](https://github.com/Little-Podi/AiATrack)

- (arXiv 2022.07) Tip-Adapter: Training-free Adaption of **CLIP** for **Few-shot Classification**, [[Paper]](https://arxiv.org/pdf/2207.09519.pdf), [[Code]](https://github.com/gaopengcuhk/Tip-Adapter)

- (arXiv 2022.07) Single Frame **Atmospheric Turbulence Mitigation**: A Benchmark Study and A New Physics-Inspired Transformer Model, [[Paper]](https://arxiv.org/pdf/2207.10040.pdf), [[Code]](https://github.com/VITA-Group/TurbNet)

- (arXiv 2022.07) HTNet: Anchor-free **Temporal Action Localization** with Hierarchical Transformers, [[Paper]](https://arxiv.org/pdf/2207.09662.pdf)

- (arXiv 2022.07) GRIT: Faster and Better Image **captioning** Transformer Using Dual Visual Features, [[Paper]](https://arxiv.org/pdf/2207.09666.pdf)

- (arXiv 2022.07) OTPose: Occlusion-Aware Transformer for **Pose Estimation** in Sparsely-Labeled Videos, [[Paper]](https://arxiv.org/pdf/2207.09725.pdf)

- (arXiv 2022.07) FaceFormer: Scale-aware Blind **Face Restoration** with Transformers, [[Paper]](https://arxiv.org/pdf/2207.09790.pdf)

- (arXiv 2022.07) Multimodal Transformer for **Automatic 3D Annotation** and Object **Detection**, [[Paper]](https://arxiv.org/pdf/2207.09805.pdf), [[Code]](https://github.com/Cliu2/MTrans)

- (arXiv 2022.07) Temporal and cross-modal attention for **audio-visual** **zero-shot** learning, [[Paper]](https://arxiv.org/pdf/2207.09966.pdf), [[Code]](https://github.com/ExplainableML/TCAF-GZSL)

- (arXiv 2022.07) Locality Guidance for Improving Vision Transformers on **Tiny Datasets**, [[Paper]](https://arxiv.org/pdf/2207.10026.pdf), [[Code]](https://github.com/lkhl/tiny-transformers)

- (arXiv 2022.07) Is an Object-Centric Video Representation Beneficial for Transfer? [[Paper]](https://arxiv.org/pdf/2207.10075.pdf)

- (arXiv 2022.07) DUQIM-Net: Probabilistic Object Hierarchy Representation for Multi-View **Manipulation**, [[Paper]](https://arxiv.org/pdf/2207.09105.pdf)

- (arXiv 2022.07) RELATIONAL FUTURE **CAPTIONING** MODEL FOR EXPLAINING LIKELY COLLISIONS IN DAILY TASKS, [[Paper]](https://arxiv.org/pdf/2207.09083.pdf)

- (arXiv 2022.07) Conditional **DETR** V2: **Efficient** Detection Transformer with Box Queries, [[Paper]](https://arxiv.org/pdf/2207.08914.pdf)

- (arXiv 2022.07) Exploiting Unlabeled Data with **Vision and Language** Models for Object **Detection**, [[Paper]](https://arxiv.org/pdf/2207.08954.pdf), [[Code]](https://github.com/xiaofeng94/VL-PLM)

- (arXiv 2022.07) TTVFI: Learning Trajectory-Aware Transformer for **Video Frame Interpolation**, [[Paper]](https://arxiv.org/pdf/2207.09048.pdf), [[Code]](https://github.com/researchmm/TTVFI.git)

- (arXiv 2022.07) Time Is MattEr: **Temporal Self-supervision** for Video Transformers, [[Paper]](https://arxiv.org/pdf/2207.09067.pdf)

- (arXiv 2022.07) IDET: Iterative Difference-Enhanced Transformers for **High-Quality Change Detection**, [[Paper]](https://arxiv.org/pdf/2207.09240.pdf)

- (arXiv 2022.07) Don’t Stop Learning: Towards **Continual Learning** for the **CLIP** Model, [[Paper]](https://arxiv.org/pdf/2207.09248.pdf)

- (arXiv 2022.07) **Action Quality Assessment** with Temporal Parsing Transformer, [[Paper]](https://arxiv.org/pdf/2207.09270.pdf)

- (arXiv 2022.07) Visual **Representation** Learning with Transformer: A Sequence-to-Sequence Perspective, [[Paper]](https://arxiv.org/pdf/2207.09339.pdf), [[Code]](https://github.com/fudan-zvg/SETR)

- (arXiv 2022.07) Structural Prior Guided Generative Adversarial Transformers for **Low-Light Image Enhancement**, [[Paper]](https://arxiv.org/pdf/2207.07828.pdf)

- (arXiv 2022.07) TS2-Net: Token Shift and Selection Transformer for T**ext-Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2207.07852.pdf), [[Code]](https://github.com/yuqi657/ts2_net)

- (arXiv 2022.07) Clover: Towards A Unified **Video-Language** Alignment and Fusion Model, [[Paper]](https://arxiv.org/pdf/2207.07885.pdf), [[Code]](https://github.com/LeeYN-43/Clover)

- (arXiv 2022.07) SatMAE: Pre-training Transformers for Temporal and Multi-Spectral **Satellite Imagery**, [[Paper]](https://arxiv.org/pdf/2207.08051.pdf)

- (arXiv 2022.07) FashionViL: Fashion-Focused **Vision-and-Language** Representation Learning, [[Paper]](https://arxiv.org/pdf/2207.08150.pdf), [[Code]](https://github.com/BrandonHanx/mmf)

- (arXiv 2022.07) Zero-Shot **Temporal Action Detection** via Vision-Language Prompting, [[Paper]](https://arxiv.org/pdf/2207.08184.pdf), [[Code]](https://github.com/sauradip/STALE)

- (arXiv 2022.07) Rethinking Alignment in **Video Super-Resolution** Transformers, [[Paper]](https://arxiv.org/pdf/2207.08494.pdf), [[Code]](https://github.com/XPixelGroup/RethinkVSRAlignment)

- (arXiv 2022.07) Open-world **Semantic Segmentation** via Contrasting and Clustering Vision-Language Embedding, [[Paper]](https://arxiv.org/pdf/2207.08455.pdf)

- (arXiv 2022.07) TokenMix: Rethinking Image Mixing for Data **Augmentation** in Vision Transformers, [[Paper]](https://arxiv.org/pdf/2207.08409.pdf), [[Code]](https://github.com/Sense-X/TokenMix)

- (arXiv 2022.07) Towards the Human Global Context: Does the **Vision-Language** Model Really Judge Like a Human Being? [[Paper]](https://arxiv.org/pdf/2207.08333.pdf)

- (arXiv 2022.07) Defect Transformer: An Efficient Hybrid Transformer Architecture for **Surface Defect Detection**, [[Paper]](https://arxiv.org/pdf/2207.08319.pdf)

- (arXiv 2022.07) Semantic **Novelty Detection** via Relational Reasoning, [[Paper]](https://arxiv.org/pdf/2207.08699.pdf)

- (arXiv 2022.07) Unifying **Event Detection** and **Captioning** as Sequence Generation via Pre-Training, [[Paper]](https://arxiv.org/pdf/2207.08625.pdf), [[Code]](https://github.com/QiQAng/UEDVC)

- (arXiv 2022.07) Multi-manifold **Attention** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2207.08569.pdf)

- (arXiv 2022.07) UniFormer: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in **Bird’s-Eye-View**, [[Paper]](https://arxiv.org/pdf/2207.08536.pdf)

- (arXiv 2022.07) **Position Prediction** as an Effective Pretraining Strategy, [[Paper]](https://arxiv.org/pdf/2207.07611.pdf)

- (arXiv 2022.07) **Lightweight** Vision Transformer with Cross Feature Attention, [[Paper]](https://arxiv.org/pdf/2207.07268.pdf)

- (arXiv 2022.07) Parameterization of **Cross-Token Relations** with Relative Positional Encoding for Vision **MLP**, [[Paper]](https://arxiv.org/pdf/2207.07284.pdf), [[Code]](https://github.com/Zhicaiwww/PosMLP)

- (arXiv 2022.07) X-CLIP: End-to-End Multi-grained Contrastive Learning for **Video-Text Retrieval**, [[Paper]](https://arxiv.org/pdf/2207.07285.pdf)

- (arXiv 2022.07) Learning Parallax Transformer Network for **Stereo Image JPEG Artifacts Removal**, [[Paper]](https://arxiv.org/pdf/2207.07335.pdf)

- (arXiv 2022.07) A Dual-Masked Auto-Encoder for **Robust Motion Capture** with Spatial-Temporal Skeletal Token Completion, [[Paper]](https://arxiv.org/pdf/2207.07381.pdf)

- (arXiv 2022.07) Is a **Caption** Worth a Thousand **Images**? A Controlled Study for **Representation** Learning, [[Paper]](https://arxiv.org/pdf/2207.07635.pdf)

- (arXiv 2022.07) Multimodal **Open-Vocabulary Video Classification** via Pre-Trained Vision and Language Models, [[Paper]](https://arxiv.org/pdf/2207.07646.pdf)

- (arXiv 2022.07) Cross-Attention Transformer for **Video Interpolation**, [[Paper]](https://arxiv.org/pdf/2207.04132.pdf)

- (arXiv 2022.07) Towards Multimodal **Vision-Language** Models Generating Non-Generic Text, [[Paper]](https://arxiv.org/pdf/2207.04174.pdf)

- (arXiv 2022.07) QKVA grid: **Attention** in Image Perspective and Stacked DETR, [[Paper]](https://arxiv.org/pdf/2207.04313.pdf), [[Code]](https://github.com/shengwenyuan/sdetr)

- (arXiv 2022.07) Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person **3D Pose Estimation Tracking** and **Forecasting** on a Video Snippet, [[Paper]](https://arxiv.org/pdf/2207.04320.pdf), [[Code]](https://github.com/JimmyZou/Snipper)

- (arXiv 2022.07) Horizontal and Vertical **Attention** in Transformers, [[Paper]](https://arxiv.org/pdf/2207.04399.pdf)

- (arXiv 2022.07) CoMER: Modeling Coverage for Transformer-based **Handwritten Mathematical Expression Recognition**, [[Paper]](https://arxiv.org/pdf/2207.04410.pdf), [[Code]](https://github.com/Green-Wood/CoMER)

- (arXiv 2022.07) DPText-DETR: Towards Better **Scene Text Detection** with Dynamic Points in Transformer, [[Paper]](https://arxiv.org/pdf/2207.04491.pdf), [[Code]](https://github.com/ymy-k/DPText-DETR)

- (arXiv 2022.07) DEPTHFORMER: MULTISCALE VISION TRANSFORMER FOR **MONOCULAR DEPTH ESTIMATION** WITH GLOBAL LOCAL INFORMATION FUSION, [[Paper]](https://arxiv.org/pdf/2207.04535.pdf), [[Code]](https://github.com/ashutosh1807/Depthformer.git)

- (arXiv 2022.07) LaT: Latent Translation with Cycle-Consistency for **Video-Text Retrieval**, [[Paper]](https://arxiv.org/pdf/2207.04858.pdf)

- (arXiv 2022.07) **Dual** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2207.04976.pdf), [[Code]](https://github.com/YehLi/ImageNetModel)

- (arXiv 2022.07) Wave-ViT: Unifying **Wavelet** and Transformers for Visual **Representation** Learning, [[Paper]](https://arxiv.org/pdf/2207.04978.pdf), [[Code]](https://github.com/YehLi/ImageNetModel)

- (arXiv 2022.07) Scaling Novel Object **Detection** with Weakly Supervised Detection Transformers, [[Paper]](https://arxiv.org/pdf/2207.05205.pdf)

- (arXiv 2022.07) Hunting Group Clues with Transformers for Social **Group Activity Recognition**, [[Paper]](https://arxiv.org/pdf/2207.05254.pdf)

- (arXiv 2022.07) **Outpainting** by Queries, [[Paper]](https://arxiv.org/pdf/2207.05312.pdf), [[Code]](https://github.com/Kaiseem/QueryOTR)

- (arXiv 2022.07) IDEA: Increasing Text Diversity via Online Multi-Label Recognition for **Vision-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2207.05333.pdf)

- (arXiv 2022.07) Video Graph Transformer for **Video Question Answering**, [[Paper]](https://arxiv.org/pdf/2207.05342.pdf), [[Code]](https://github.com/sail-sg/VGT)

- (arXiv 2022.07) Next-ViT: Next Generation Vision Transformer for **Efficient Deployment** in Realistic **Industrial** Scenarios, [[Paper]](https://arxiv.org/pdf/2207.05501.pdf)

- (arXiv 2022.07) UniNet: Unified **Architecture Search** with Convolution, Transformer, and MLP, [[Paper]](https://arxiv.org/pdf/2207.05420.pdf), [[Code]](https://github.com/Sense-X/UniNet)

- (arXiv 2022.07) Image and Model Transformation with **Secret Key** for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2207.05366.pdf)

- (arXiv 2022.07) eX-ViT: A Novel eXplainable Vision Transformer for **Weakly Supervised Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2207.05358.pdf)

- (arXiv 2022.07) Compound Prototype Matching for **Few-shot Action Recognition**, [[Paper]](https://arxiv.org/pdf/2207.05515.pdf)

- (arXiv 2022.07) Long-term Leap Attention, Short-term Periodic Shift for **Video Classification**, [[Paper]](https://arxiv.org/pdf/2207.05526.pdf), [[Code]](https://github.com/VideoNetworks/LAPS-transformer)

- (arXiv 2022.07) LightViT: Towards **Light**-Weight **Convolution-Free** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2207.05557.pdf), [[Code]](https://github.com/hunto/LightViT)

- (arXiv 2022.07) Learning from **Label Relationships** in Human Affect, [[Paper]](https://arxiv.org/pdf/2207.05577.pdf)

- (arXiv 2022.07) MSP-Former: Multi-Scale Projection Transformer for Single Image **Desnowing**, [[Paper]](https://arxiv.org/pdf/2207.05621.pdf)

- (arXiv 2022.07) Tell Me the Evidence? Dual **Visual-Linguistic** Interaction for **Answer Grounding**, [[Paper]](https://arxiv.org/pdf/2207.05703.pdf)

- (arXiv 2022.07) Vision Transformer for NeRF-Based **View Synthesis** from a Single Input Image, [[Paper]](https://arxiv.org/pdf/2207.05736.pdf), [[Code]](https://cseweb.ucsd.edu/~viscomp/projects/VisionNeRF/)

- (arXiv 2022.07) COSIM: Commonsense Reasoning for **Counterfactual Scene Imagination**, [[Paper]](https://arxiv.org/pdf/2207.03961.pdf), [[Code]](https://github.com/hyounghk/CoSIm)

- (arXiv 2022.07) Beyond Transfer Learning: Co-finetuning for **Action Localisation**, [[Paper]](https://arxiv.org/pdf/2207.03807.pdf)

- (arXiv 2022.07) RePFormer: Refinement Pyramid Transformer for Robust **Facial Landmark Detection**, [[Paper]](https://arxiv.org/pdf/2207.03917.pdf)

- (arXiv 2022.07) k-means **Mask** Transformer, [[Paper]](https://arxiv.org/pdf/2207.04044.pdf), [[Code]](https://github.com/google-research/deeplab2)

- (arXiv 2022.07) **Training** Transformers Together, [[Paper]](https://arxiv.org/pdf/2207.03481.pdf), [[Code]](https://training-transformers-together.github.io/)

- (arXiv 2022.07) Improving **Few-Shot Image Classification** Using Machine- and User-Generated Natural Language Descriptions, [[Paper]](https://arxiv.org/pdf/2207.03133.pdf)

- (arXiv 2022.07) MaiT: Leverage **Attention Masks** for More **Efficient** Image Transformers, [[Paper]](https://arxiv.org/pdf/2207.03006.pdf)

- (arXiv 2022.07) Dual-Stream Transformer for Generic **Event Boundary Captioning**, [[Paper]](https://arxiv.org/pdf/2207.03038.pdf), [[Code]](https://github.com/GX77/Dual-Stream-Transformer-for-Generic-Event-Boundary-Captioning)

- (arXiv 2022.07) **Softmax-free** Linear Transformers, [[Paper]](https://arxiv.org/pdf/2207.03341.pdf), [[Code[[(https://github.com/fudan-zvg/SOFT)

- (arXiv 2022.07) Bridging the Gap between Object and Image-level Representations for **Open-Vocabulary Detection**, [[Paper]](https://arxiv.org/pdf/2207.03482.pdf), [[Code]](https://bit.ly/3byZoQp)

- (arXiv 2022.07) Transformers are Adaptable **Task Planners**, [[Paper]](https://arxiv.org/pdf/2207.02442.pdf), [[Code]](https://anonymous.4open.science/r/temporal_task_planner-Paper148/)

- (arXiv 2022.07) **ARRAY CAMERA IMAGE FUSION** USING PHYSICS-AWARE TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2207.02250.pdf)

- (arXiv 2022.07) OSFormer: One-Stage Camouflaged Instance **Segmentation** with Transformers, [[Paper]](https://arxiv.org/pdf/2207.02255.pdf), [[Code]](https://github.com/PJLallen/OSFormer)

- (arXiv 2022.07) Weakly Supervised Grounding for **VQA** in Vision-Language Transformers, [[Paper]](https://arxiv.org/pdf/2207.02334.pdf), [[Code]](https://github.com/aurooj/WSG-VQA-VLTransformers)

- (arXiv 2022.07) PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense **Video Captioning**, [[Paper]](https://arxiv.org/pdf/2207.02583.pdf)

- (arXiv 2022.07) STVGFormer: Spatio-Temporal **Video Grounding** with Static-Dynamic Cross-Modal Understanding, [[Paper]](https://arxiv.org/pdf/2207.02756.pdf)

- (arXiv 2022.07) Towards Counterfactual **Image Manipulation** via **CLIP**, [[Paper]](https://arxiv.org/pdf/2207.02812.pdf)

- (arXiv 2022.07) MatFormer: A **Generative** Model for Procedural **Materials**, [[Paper]](https://arxiv.org/pdf/2207.01044.pdf)

- (arXiv 2022.07) Multimodal Frame-Scoring Transformer for **Video Summarization**, [[Paper]](https://arxiv.org/pdf/2207.01814.pdf)

- (arXiv 2022.07) **3D Part Assembly** Generation with Instance Encoded Transformer, [[Paper]](https://arxiv.org/pdf/2207.01779.pdf)

- (arXiv 2022.07) Scene-Aware Prompt for Multi-modal **Dialogue Understanding and Generation**, [[Paper]](https://arxiv.org/pdf/2207.01823.pdf)

- (arXiv 2022.07) **Efficient** Representation Learning via Adaptive Context Pooling, [[Paper]](https://arxiv.org/pdf/2207.01844.pdf)

- (arXiv 2022.07) **Gaze Target Estimation** inspired by Interactive Attention, [[Paper]](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9828503), [[Code]](https://github.com/nkuhzx/VSG-IA)

- (arXiv 2022.07) Generalizable Patch-Based **Neural Rendering**, [[Paper]](https://arxiv.org/pdf/2207.10662.pdf), [[Project]](https://mohammedsuhail.net/gen_patch_neural_rendering/)

- (arXiv 2022.07) Interaction Transformer for Human **Reaction Generation**, [[Paper]](https://arxiv.org/pdf/2207.01685.pdf)

- (arXiv 2022.07) TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of **3D Human Motions and Texts**, [[Paper]](https://arxiv.org/pdf/2207.01696.pdf), [[Project]](https://ericguo5513.github.io/TM2T/)

- (arXiv 2022.07) FishFormer: Annulus Slicing-based Transformer for **Fisheye Rectification** with Efficacy Domain Exploration, [[Paper]](https://arxiv.org/pdf/2207.01925.pdf)

- (arXiv 2022.07) Open-Vocabulary Multi-Label Classification via Multi-modal **Knowledge Transfer**, [[Paper]](https://arxiv.org/pdf/2207.01887.pdf), [[Code]](https://github.com/seanhe97/MKT)

- (arXiv 2022.07) Toward Explainable and Fine-Grained **3D Grounding** through Referring Textual Phrases, [[Paper]](https://arxiv.org/pdf/2207.01821.pdf), [[Code]](https://yanx27.github.io/phraserefer/)

- (arXiv 2022.07) Improving **Semantic Segmentation** in Transformers using Hierarchical Inter-Level Attention, [[Paper]](https://arxiv.org/pdf/2207.02126.pdf)

- (arXiv 2022.07) MULTI-MODAL **ROBUSTNESS** ANALYSIS AGAINST **LANGUAGE AND VISUAL** PERTURBATIONS, [[Paper]](https://arxiv.org/pdf/2207.02159.pdf), [[Project]](https://maddy12.github.io/MultiModalVideoRobustness/)

- (arXiv 2022.07) CoBEVT: Cooperative **Bird’s Eye View Semantic Segmentation** with Sparse Transformers, [[Paper]](https://arxiv.org/pdf/2207.02202.pdf)

- (arXiv 2022.07) **Segmenting Moving Objects** via an Object-Centric Layered Representation, [[Paper]](https://arxiv.org/pdf/2207.02206.pdf)

- (arXiv 2022.07) Counterfactually Measuring and Eliminating **Social Bias** in **Vision-Language** Pre-training Models, [[Paper]](https://arxiv.org/pdf/2207.01056.pdf)

- (arXiv 2022.07) Contrastive Cross-Modal Knowledge Sharing Pre-training for **Vision-Language** Representation Learning and Retrieval, [[Paper]](https://arxiv.org/pdf/2207.00733.pdf)

- (arXiv 2022.07) Learning Cross-Image Object Semantic Relation in Transformer for **Few-Shot Fine-Grained** Image **Classification**, [[Paper]](https://arxiv.org/pdf/2207.00784.pdf), [[Code]](https://github.com/JiakangYuan/HelixFormer)

- (arXiv 2022.07) Memory-Based Label-Text Tuning for **Few-Shot** Class-**Incremental** **Learning**, [[Paper]](https://arxiv.org/pdf/2207.01036.pdf)

- (arXiv 2022.07) Exploiting Context Information for Generic Event Boundary **Captioning**, [[Paper]](https://arxiv.org/pdf/2207.01050.pdf), [[Code]](https://github.com/zjr2000/Context-GEBC)

- (arXiv 2022.07) You Only Need One **Detector**: Unified Object Detector for **Different Modalities** based on Vision Transformers, [[Paper]](https://arxiv.org/pdf/2207.01071.pdf), [[Code]](https://github.com/liketheflower/YONOD.git)

- (arXiv 2022.07) Divert More Attention to **Vision-Language Tracking**, [[Paper]](https://arxiv.org/pdf/2207.01076.pdf), [[Code]](https://github.com/JudasDie/SOTS)

- (arXiv 2022.07) Can **Language** Understand **Depth**? [[Paper]](https://arxiv.org/pdf/2207.01077.pdf), [[Code]](https://github.com/Adonis-galaxy/DepthCLIP)

- (arXiv 2022.07) TANet: Transformer-based Asymmetric Network for **RGB-D Salient Object Detection**, [[Paper]](https://arxiv.org/pdf/2207.01172.pdf), [[Code]](https://github.com/lc012463/TANet)

- (arXiv 2022.07) DUET: Cross-modal Semantic Grounding for **Contrastive Zero-shot Learning**, [[Paper]](https://arxiv.org/pdf/2207.01328.pdf)

- (arXiv 2022.07) Transferring **Textual Knowledge** for Visual Recognition, [[Paper]](https://arxiv.org/pdf/2207.01297.pdf), [[Code]](https://github.com/whwu95/Text4Vis)

- (arXiv 2022.07) R^2-VOS: Robust Referring **Video** Object **Segmentation** via Relational Cycle Consistency, [[Paper]](https://arxiv.org/pdf/2207.01203.pdf)

- (arXiv 2022.07) CRFormer: A Cross-Region Transformer for **Shadow Removal**, [[Paper]](https://arxiv.org/pdf/2207.01600.pdf)

- (arXiv 2022.07) Dynamic **Spatial Sparsification** for **Efficient** Vision Transformers and Convolutional Neural Networks, [[Paper]](https://arxiv.org/pdf/2207.01580.pdf), [[Code]](https://github.com/raoyongming/DynamicViT)

- (arXiv 2022.07) Back to MLP: A Simple Baseline for Human **Motion Prediction**, [[Paper]](https://arxiv.org/pdf/2207.01567.pdf), [[Code]](https://github.com/dulucas/siMLPe)

- (arXiv 2022.07) I-ViT: Integer-only **Quantization** for **Efficient** Vision Transformer Inference, [[Paper]](https://arxiv.org/pdf/2207.01405.pdf)

- (arXiv 2022.07) Rethinking **Query-Key** Pairwise Interactions in Vision Transformers, [[Paper]](https://arxiv.org/pdf/2207.00188.pdf)

- (arXiv 2022.07) LARGE-SCALE **ROBUSTNESS** ANALYSIS OF **VIDEO ACTION RECOGNITION** MODELS, [[Paper]](https://arxiv.org/pdf/2207.01398.pdf), [[Code]](https://rose-ar.github.io/)

- (arXiv 2022.07) VL-CheckList: **Evaluating** Pre-trained **Vision-Language** Models with Objects, Attributes and Relations, [[Paper]](https://arxiv.org/pdf/2207.00221.pdf), [[Code]](https://github.com/om-ai-lab/VL-CheckList)

- (arXiv 2022.07) **Masked Autoencoders** for Self-Supervised Learning on Automotive **Point Clouds**, [[Paper]](https://arxiv.org/pdf/2207.00531.pdf)

- (arXiv 2022.07) MotionMixer: **MLP**-based **3D** Human Body **Pose Forecasting**, [[Paper]](https://arxiv.org/pdf/2207.00499.pdf), [[Code]](https://github.com/MotionMLP/MotionMixer)

- (arXiv 2022.07) DALG: Deep Attentive Local and Global Modeling for **Image Retrieval**, [[Paper]](https://arxiv.org/pdf/2207.00287.pdf)

- (arXiv 2022.07) PolarFormer: Multi-camera **3D Object Detection** with Polar Transformers, [[Paper]](https://arxiv.org/pdf/2206.15398.pdf), [[Code]](https://github.com/fudan-zvg/PolarFormer)

- (arXiv 2022.07) CTrGAN: Cycle Transformers GAN for **Gait Transfer**, [[Paper]](https://arxiv.org/pdf/2206.15248.pdf)

- (arXiv 2022.07) LM-Nav: Robotic **Navigation** with Large Pre-Trained Models of **Language**, **Vision**, and **Action**, [[Paper]](https://arxiv.org/pdf/2207.04429.pdf)

- (arXiv 2022.07) Bootstrapped **Masked Autoencoders** for Vision BERT Pretraining, [[Paper]](https://arxiv.org/pdf/2207.07116.pdf), [[Code]](https://github.com/LightDXY/BootMAE)

- (arXiv 2022.07) ReAct: **Temporal Action Detection** with Relational Queries, [[Paper]](https://arxiv.org/pdf/2207.07097.pdf), [[Code]](https://github.com/sssste/React)

- (arXiv 2022.07) **Benchmarking** **Omni-Vision** Representation through the Lens of Visual **Realms**, [[Paper]](https://arxiv.org/pdf/2207.07106.pdf), [[Project]](https://zhangyuanhan-ai.github.io/OmniBenchmark)

- (arXiv 2022.07) **Convolutional Bypasses** Are Better Vision Transformer **Adapters**, [[Paper]](https://arxiv.org/pdf/2207.07039.pdf)

- (arXiv 2022.07) **LANGUAGE MODELLING** WITH PIXELS, [[Paper]](https://arxiv.org/pdf/2207.06991.pdf)

- (arXiv 2022.07) Transformer-based Context Condensation for Boosting Feature Pyramids in Object **Detection**, [[Paper]](https://arxiv.org/pdf/2207.06603.pdf)

- (arXiv 2022.07) **DEEPFAKE VIDEO DETECTION** WITH SPATIOTEMPORAL DROPOUT TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2207.06612.pdf)

- (arXiv 2022.07) iColoriT: Towards Propagating Local Hint to the Right Region in **Interactive Colorization** by Leveraging Vision Transformer, [[Paper]](https://arxiv.org/pdf/2207.06831.pdf)

- (arXiv 2022.07) **Imaging** through the **Atmosphere** using **Turbulence** Mitigation Transformer, [[Paper]](https://arxiv.org/pdf/2207.06465.pdf)

- (arXiv 2022.07) Symmetry-Aware Transformer-based **Mirror Detection**, [[Paper]](https://arxiv.org/pdf/2207.06332.pdf), [[Code]](https://github.com/tyhuang0428/SATNet)

- (arXiv 2022.07) Pyramid Transformer for **Traffic Sign Detection**, [[Paper]](https://arxiv.org/pdf/2207.06067.pdf)

- (arXiv 2022.07) Global-local Motion Transformer for Unsupervised **Skeleton**-based **Action** Learning, [[Paper]](https://arxiv.org/pdf/2207.06101.pdf), [[Code]](https://github.com/Boeun-Kim/GL-Transformer)

- (arXiv 2022.07) DynaST: Dynamic Sparse Transformer for Exemplar-Guided **Image Generation**, [[Paper]](https://arxiv.org/pdf/2207.06124.pdf)

- (arXiv 2022.07) Trans4Map: Revisiting Holistic Top-down Mapping from **Egocentric Images** to Allocentric Semantics with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2207.06205.pdf), [[Code]](https://github.com/jamycheung/Trans4Map)

- (arXiv 2022.07) Entry-Flipped Transformer for Inference and Prediction of **Participant Behavior**, [[Paper]](https://arxiv.org/pdf/2207.06235.pdf)

- (arXiv 2022.07) Wayformer: **Motion Forecasting** via Simple & Efficient Attention Networks, [[Paper]](https://arxiv.org/pdf/2207.05844.pdf)

- (arXiv 2022.07) Diverse **Dance Synthesis** via Keyframes with Transformer Controllers, [[Paper]](https://arxiv.org/pdf/2207.05906.pdf)

- (arXiv 2022.07) Learning to Estimate **External Forces** of Human **Motion** in Video, [[Paper]](https://arxiv.org/pdf/2207.05845.pdf)

- (arXiv 2022.07) Vision Transformer for **Contrastive Clustering**, [[Paper]](https://arxiv.org/pdf/2206.12925.pdf), [[Code]](https://github.com/JackKoLing/VTCC)

- (arXiv 2022.07) Pose2Room: Understanding **3D Scenes** from **Human Activities**, [[Paper]](https://arxiv.org/pdf/2112.03030.pdf)

- (arXiv 2022.07) Towards Hard-Positive Query Mining for DETR-based **Human-Object Interaction Detection**, [[Paper]](https://arxiv.org/pdf/2207.05293.pdf), [[Code]](https://github.com/MuchHair/HQM)

- (arXiv 2022.07) Cross-Architecture **Knowledge Distillation**, [[Paper]](https://arxiv.org/pdf/2207.05273.pdf)

- (arXiv 2022.07) Distance Matters in **Human-Object Interaction Detection**, [[Paper]](https://arxiv.org/pdf/2207.01869.pdf)

### 2022.06

- (arXiv 2022.06) TENET: Transformer Encoding Network for Effective Temporal Flow on **Motion Prediction**, [[Paper]](https://arxiv.org/pdf/2207.00170.pdf)

- (arXiv 2022.06) GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for **Few-Shot Gait Impairment Severity Estimation**, [[Paper]](https://arxiv.org/pdf/2207.00106.pdf), [[Code]](https://github.com/markendo/GaitForeMer)

- (arXiv 2022.06) GSCLIP : A Framework for Explaining **Distribution Shifts** in Natural **Language**, [[Paper]](https://arxiv.org/pdf/2206.15007.pdf)

- (arXiv 2022.06) Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2206.15002.pdf)

- (arXiv 2022.06) Causality for Inherently **Explainable** Transformers: CAT-XPLAIN, [[Paper]](https://arxiv.org/pdf/2206.14841.pdf), [[Code]](https://github.com/mvrl/CAT-XPLAIN)

- (arXiv 2022.06) A Unified End-to-End Retriever-Reader Framework for Knowledge-based **VQA**, [[Paper]](https://arxiv.org/pdf/2206.14989.pdf)

- (arXiv 2022.06) **Continual Learning** with Transformers for **Image Classification**, [[Paper]](https://arxiv.org/pdf/2206.14085.pdf)

- (arXiv 2022.06) ST-Adapter: Parameter-**Efficient** **Image-to-Video** Transfer Learning for **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2206.13559.pdf)

- (arXiv 2022.06) **Robustifying** Vision Transformer without Retraining from Scratch by **Test-Time** Class-Conditional Feature Alignment, [[Paper]](https://arxiv.org/pdf/2206.13951.pdf), [[Code]](https://github.com/kojima-takeshi188/CFA)

- (arXiv 2022.06) Leveraging **Language** for Accelerated Learning of **Tool Manipulation**, [[Paper]](https://arxiv.org/pdf/2206.13074.pdf)

- (arXiv 2022.06) RoME: Role-aware Mixture-of-Expert Transformer for **Text-to-Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2206.12845.pdf)

- (arXiv 2022.06) VLCAP: **VISION-LANGUAGE** WITH CONTRASTIVE LEARNING FOR COHERENT VIDEO PARAGRAPH **CAPTIONING**, [[Paper]](https://arxiv.org/pdf/2206.12972.pdf), [[Code]](https://github.com/UARK-AICV/VLCAP)

- (arXiv 2022.06) Video2**StyleGAN**: Encoding **Video** in Latent Space for **Manipulation**, [[Paper]](https://arxiv.org/pdf/2206.13078.pdf)

- (arXiv 2022.06) Text-Driven **Stylization** of **Video** Objects, [[Paper]](https://arxiv.org/pdf/2206.12396.pdf), [[Project]](https://sloeschcke.github.io/Text-Driven-Stylization-of-Video-Objects/)

- (arXiv 2022.06) **Open Vocabulary** Object **Detection** with Proposal Mining and Prediction Equalization, [[Paper]](https://arxiv.org/pdf/2206.11134.pdf), [[Code]](https://github.com/Pealing/MEDet)

- (arXiv 2022.06) CMT-DeepLab: Clustering Mask Transformers for Panoptic **Segmentation**, [[Paper]](about:blank)

- (arXiv 2022.06) Towards Adversarial **Attack** on **Vision-Language** Pre-training Models, [[Paper]](https://arxiv.org/pdf/2206.09391.pdf)

- (arXiv 2022.06) CLiMB: A **Continual Learning** Benchmark for **Vision-and-Language** Tasks, [[Paper]](https://arxiv.org/pdf/2206.09059.pdf), [[Code]](https://github.com/GLAMOR-USC/CLiMB)

- (arXiv 2022.06) **VISUALIZING** AND UNDERSTANDING **SELF-SUPERVISED** VISION LEARNING, [[Paper]](https://arxiv.org/pdf/2206.09753.pdf), [[Code]](https://github.com/fawazsammani/xai-ssl)

- (arXiv 2022.06) VReBERT: A Simple and Flexible Transformer for **Visual Relationship Detection**, [[Paper]](https://arxiv.org/pdf/2206.09111.pdf)

- (arXiv 2022.06) Bear the Query in Mind: **Visual Grounding** with Query-conditioned Convolution, [[Paper]](https://arxiv.org/pdf/2206.09114.pdf)

- (arXiv 2022.06) **DALL-E** for **Detection**: Language-driven Context Image Synthesis for Object Detection, [[Paper]](https://arxiv.org/pdf/2206.09592.pdf)

- (arXiv 2022.06) REVECA – Rich Encoder-decoder framework for **Video Event CAptioner**, [[Paper]](https://arxiv.org/pdf/2206.09178.pdf), [[Code]](https://github.com/TooTouch/REVECA)

- (arXiv 2022.06) SAViR-T: Spatially Attentive** Visual Reasoning** with Transformers, [[Paper]](https://arxiv.org/pdf/2206.09265.pdf)

- (arXiv 2022.06) EATFormer: Improving Vision Transformer Inspired by **Evolutionary** Algorithm, [[Paper]](https://arxiv.org/pdf/2206.09325.pdf), [[Code]](https://https//github.com/zhangzjn/EATFormer)

- (arXiv 2022.06) Dual**CoOp**: Fast Adaptation to **Multi-Label** Recognition with Limited Annotations, [[Paper]](https://arxiv.org/pdf/2206.09541.pdf)

- (arXiv 2022.06) Capturing and Inferring Dense Full-Body **Human-Scene Contact**, [[Paper]](https://arxiv.org/pdf/2206.09553.pdf), [[Project]](https://rich.is.tue.mpg.de/)

- (arXiv 2022.06) M&M Mix: A Multimodal Multiview Transformer **Ensemble**, [[Paper]](https://arxiv.org/pdf/2206.09852.pdf)

- (arXiv 2022.06) DisCoVQA: Temporal Distortion-Content Transformers for **Video Quality Assessment**, [[Paper]](https://arxiv.org/pdf/2206.09853.pdf)

- (arXiv 2022.06) Voxel-**MAE**: Masked Autoencoders for Pre-training Large-scale **Point Clouds**, [[Paper]](https://arxiv.org/pdf/2206.09900.pdf), [[Code]](https://github.com/chaytonmin/Voxel-MAE)

- (arXiv 2022.06) **Global Context** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2206.09959.pdf), [[Code]](https://github.com/NVlabs/GCViT)

- (arXiv 2022.06) **Counting** Varying Density **Crowds** Through Density Guided Adaptive Selection CNN and Transformer Estimation, [[Paper]](https://arxiv.org/pdf/2206.10075.pdf)

- (arXiv 2022.06) One-stage **Action Detection** Transformer, [[Paper]](https://arxiv.org/pdf/2206.10080.pdf)

- (arXiv 2022.06) Sem**MAE**: Semantic-Guided Masking for Learning Masked Autoencoders, [[Paper]](https://arxiv.org/pdf/2206.10207.pdf)

- (arXiv 2022.06) TRANSFORMER-BASED MULTI-MODAL PROPOSAL AND RE-RANK FOR WIKIPEDIA **IMAGE-CAPTION** MATCHING, [[Paper]](https://arxiv.org/pdf/2206.10436.pdf), [[Code]](https://github.com/mesnico/Wiki-Image-Caption-Matching)

- (arXiv 2022.06) **Vicinity** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2206.10552.pdf), [[Code]](https://github.com/OpenNLPLab/Vicinity-Vision-Transformer)

- (arXiv 2022.06) EdgeNeXt: **Efficiently** Amalgamated **CNN-Transformer** Architecture for Mobile Vision Applications, [[Paper]](https://arxiv.org/pdf/2206.10589.pdf), [[Code]](https://t.ly/_Vu9)

- (arXiv 2022.06) Temporally Consistent Semantic **Video Editing**, [[Paper]](https://arxiv.org/pdf/2206.10590.pdf)

- (arXiv 2022.06) VLMbench: A Compositional Benchmark for **Vision-and-Language Manipulation**, [[Paper]](https://arxiv.org/pdf/2206.08522.pdf)

- (arXiv 2022.06) MINEDOJO: Building Open-Ended **Embodied Agents** with Internet-Scale Knowledge, [[Paper]](https://arxiv.org/pdf/2206.08853.pdf), [[Project]](https://minedojo.org/)

- (arXiv 2022.06) IRISformer: Dense Vision Transformers for Single-Image **Inverse Rendering** in Indoor Scenes, [[Paper]](https://arxiv.org/pdf/2206.08423.pdf), [[Code]](https://github.com/ViLab-UCSD/IRISformer)

- (arXiv 2022.06) Backdoor **Attacks** on Vision Transformers, [[Paper]](https://arxiv.org/pdf/2206.08477.pdf), [[Code]](https://github.com/UCDvision/backdoor_transformer.git)

- (arXiv 2022.06) Rectify ViT **Shortcut** Learning by Visual **Saliency**, [[Paper]](https://arxiv.org/pdf/2206.08567.pdf)

- (arXiv 2022.06) Learning Using Privileged Information for **Zero-Shot Action Recognition**, [[Paper]](https://arxiv.org/pdf/2206.08632.pdf)

- (arXiv 2022.06) Bridge-Tower: Building Bridges Between Encoders in **Vision-Language** Representation Learning, [[Paper]](https://arxiv.org/pdf/2206.08657.pdf), [[Code]](https://github.com/microsoft/BridgeTower)

- (arXiv 2022.06) CtrlFormer: Learning Transferable State Representation for **Visual Control** via Transformer, [[Paper]](https://arxiv.org/pdf/2206.08883.pdf), [[Project]](https://sites.google.com/view/ctrlformer-icml/)

- (arXiv 2022.06) SimA: Simple **Softmax-free** **Attention** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2206.08898.pdf), [[Code]](https://github.com/UCDvision/sima)

- (arXiv 2022.06) UNIFIED-IO: A **UNIFIED MODEL** FOR **VISION, LANGUAGE**, AND **MULTI-MODAL** TASKS, [[Paper]](https://arxiv.org/pdf/2206.08916.pdf), [[Project]](https://unified-io.allenai.org/)

- (arXiv 2022.06) VLMixer: Unpaired **Vision-Language** Pre-training via Cross-Modal CutMix, [[Paper]](https://arxiv.org/pdf/2206.08919.pdf), [[Code]](https://github.com/ttengwang/VLMixer)

- (arXiv 2022.06) ReLER@ZJU-Alibaba Submission to the **Ego4D** Natural Language **Queries** Challenge 2022, [[Paper]](https://arxiv.org/pdf/2207.00383.pdf)

- (arXiv 2022.06) Video + **CLIP** Baseline for **Ego4D** Long-term **Action Anticipation**, [[Paper]](https://arxiv.org/pdf/2207.00579.pdf), [[Code]](https://github.com/srijandas07/clip_baseline_LTA_Ego4d)

- (arXiv 2022.06) What makes **domain generalization** hard?, [[Paper]](https://arxiv.org/pdf/2206.07802.pdf)

- (arXiv 2022.06) SAVi++: Towards End-to-End **Object-Centric Learning** from Real-World **Videos**, [[Paper]](https://arxiv.org/pdf/2206.07764.pdf), [[Code]](https://slot-attention-video.github.io/savi++/)

- (arXiv 2022.06) Disentangling **visual** and **written** **concepts** in **CLIP**, [[Paper]](https://arxiv.org/pdf/2206.07835.pdf), [[Project]](https://joaanna.github.io/disentangling_spelling_in_clip/)

- (arXiv 2022.06) Multi-scale Cooperative Multimodal Transformers for Multimodal **Sentiment Analysis** in Videos, [[Paper]](https://arxiv.org/pdf/2206.07981.pdf)

- (arXiv 2022.06) **Patch**-level **Representation** Learning for Self-supervised Vision Transformers, [[Paper]](https://arxiv.org/pdf/2206.07990.pdf)

- (arXiv 2022.06) **Zero-Shot Video Question Answering** via Frozen Bidirectional Language Models, [[Paper]](https://arxiv.org/pdf/2206.08155.pdf), [[Code]](https://antoyang.github.io/frozenbilm.html)

- (arXiv 2022.06) Omni**MAE**: Single Model Masked Pretraining on Images and Videos, [[Paper]](https://arxiv.org/pdf/2206.08356.pdf), [[Code]](https://github.com/facebookresearch/omnivore)

- (arXiv 2022.06) **Adapting** Self-Supervised Vision Transformers by Probing Attention-Conditioned **Masking** Consistency, [[Paper]](https://arxiv.org/pdf/2206.08222.pdf), [[Code]](https://github.com/virajprabhu/PACMAC)

- (arXiv 2022.06) LAVENDER: Unifying **Video-Language** Understanding as Masked Language Modeling, [[Paper]](https://arxiv.org/pdf/2206.07160.pdf), [[Code]](https://github.com/microsoft/LAVENDER)

- (arXiv 2022.06) Multimodal Event Graphs: Towards **Event Centric Understanding** of **Multimodal** World, [[Paper]](https://arxiv.org/pdf/2206.07207.pdf)

- (arXiv 2022.06) Rethinking Generalization in **Few-Shot Classification**, [[Paper]](https://arxiv.org/pdf/2206.07267.pdf), [[Code]](https://github.com/mrkshllr/FewTURE)

- (arXiv 2022.06) VCT: A **Video Compression** Transformer, [[Paper]](https://arxiv.org/pdf/2206.07307.pdf)

- (arXiv 2022.06) **Forecasting** of **depth** and **ego-motion** with transformers and self-supervision, [[Paper]](https://arxiv.org/pdf/2206.07435.pdf)

- (arXiv 2022.06) Coarse-to-Fine **Vision-Language** Pre-training with Fusion in the Backbone, [[Paper]](https://arxiv.org/pdf/2206.07643.pdf), [[Code]](https://github.com/microsoft/FIBER)

- (arXiv 2022.06) SP-ViT: Learning **2D Spatial Priors** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2206.07662.pdf)

- (arXiv 2022.06) A Simple **Data Mixing** Prior for Improving **Self-Supervised** Learning, [[Paper]](https://arxiv.org/pdf/2206.07692.pdf), [[Code]](https://github.com/OliverRensu/SDMP)

- (arXiv 2022.06) Prefix Language Models are **Unified Modal Learners**, [[Paper]](https://arxiv.org/pdf/2206.07699.pdf), [[Code]](https://github.com/shizhediao/DaVinci)

- (arXiv 2022.06) **Masked Frequency Modeling** for Self-Supervised Visual Pre-Training, [[Paper]](https://arxiv.org/pdf/2206.07706.pdf), [Code]](https://www.mmlab-ntu.com/project/mfm/index.html)

- (arXiv 2022.06) Generalizable **Neural Radiance Fields** for Novel View Synthesis with Transformer, [[Paper]](https://arxiv.org/pdf/2206.05375.pdf)

- (arXiv 2022.06) A Unified **Continuous Learning** Framework for Multi-modal Knowledge Discovery and Pre-training, [[Paper]](https://arxiv.org/pdf/2206.05555.pdf)

- (arXiv 2022.06) Learning to Estimate **Shapley Values** with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2206.05282.pdf), [[Code]](https://github.com/suinleelab/vit-shapley)

- (arXiv 2022.06) Graph-based Spatial Transformer with Memory Replay for Multi-future **Pedestrian Trajectory Prediction**, [[Paper]](https://arxiv.org/pdf/2206.05712.pdf), [[Code]](https://github.com/Jacobieee/ST-MR)

- (arXiv 2022.06) **GLIPv2**: Unifying Localization and **VL Understanding**, [[Paper]](https://arxiv.org/pdf/2206.05836.pdf), [[Code]](https://github.com/microsoft/GLIP)

- (arXiv 2022.06) INDIGO: Intrinsic Multimodality for **Domain Generalization**, [[Paper]](https://arxiv.org/pdf/2206.05912.pdf)

- (arXiv 2022.06) TRANSDUCTIVE **CLIP** WITH **CLASS-CONDITIONAL** CONTRASTIVE LEARNING, [[Paper]](https://arxiv.org/pdf/2206.06177.pdf)

- (arXiv 2022.06) SILVER-BULLET-3D AT MANISKILL 2021: LEARNING-FROM-DEMONSTRATIONS AND HEURISTIC RULE-BASED METHODS FOR **OBJECT MANIPULATION**, [[Paper]](https://arxiv.org/pdf/2206.06289.pdf), [[Code]](https://github.com/caiqi/Silver-Bullet-3D/)

- (arXiv 2022.06) MLP-3D: A **MLP**-like **3D** Architecture with Grouped Time Mixing, [[Paper]](https://arxiv.org/pdf/2206.06292.pdf), [[Code]](https://github.com/ZhaofanQiu/MLP-3D)

- (arXiv 2022.06) Visual Transformer for Object **Detection**, [[Paper]](https://arxiv.org/pdf/2206.06323.pdf)

- (arXiv 2022.06) Bringing **Image **Scene Structure to Video** via Frame-Clip Consistency of Object Tokens, [[Paper]](https://arxiv.org/pdf/2206.06346.pdf), [[Code]](https://eladb3.github.io/SViT/)

- (arXiv 2022.06) TransVG++: End-to-End **Visual Grounding** with Language Conditioned Vision Transformer, [[Paper]](https://arxiv.org/pdf/2206.06619.pdf)

- (arXiv 2022.06) ReCo: Retrieve and Co-**segment** for **Zero-shot** Transfer, [[Paper]](https://arxiv.org/pdf/2206.07045.pdf), [[Project]](https://www.robots.ox.ac.uk/~vgg/research/reco)

- (arXiv 2022.06) MAREO: MEMORY- AND ATTENTION- BASED **VISUAL REASONING**, [[Paper]](https://arxiv.org/pdf/2206.04928.pdf)

- (arXiv 2022.06) Recurrent Transformer Variational Autoencoders for **Multi-Action Motion Synthesis**, [[Paper]](https://arxiv.org/pdf/2206.06741.pdf)

- (arXiv 2022.06) **Object Scene Representation** Transformer, [[Paper]](https://arxiv.org/pdf/2206.06922.pdf)

- (arXiv 2022.06) Comprehending and Ordering Semantics for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2206.06930.pdf), [[Code]](https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/cosnet)

- (arXiv 2022.06) Exploring **Adversarial Attacks** and **Defenses** in Vision Transformers trained with **DINO**, [[Paper]](https://arxiv.org/pdf/2206.06761.pdf)

- (arXiv 2022.06) **Peripheral** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2206.06801.pdf), [[Code]](http://cvlab.postech.ac.kr/research/PerViT/)

- (arXiv 2022.06) Efficient Decoder-free Object **Detection** with Transformers, [[Paper]](https://arxiv.org/pdf/2206.06829.pdf), [[Code]](https://github.com/Pealing/DFFT.)

- (arXiv 2022.06) Prototypical **Contrastive Language Image Pretraining**, [[Paper]](https://arxiv.org/pdf/2206.10996.pdf), [[Code]](https://github.com/megvii-research/protoclip)

- (arXiv 2022.06) SpA-Former:Transformer image** shadow detection and removal** via spatial attention, [[Paper]](https://arxiv.org/pdf/2206.10910.pdf), [[Code]](https://github.com/zhangbaijin/SpA-Former-shadow-removal)

- (arXiv 2022.06) A Unified and Biologically-Plausible **Relational Graph Representation** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2206.11073.pdf)

- (arXiv 2022.06) Can Foundation Models Talk **Causality**? [[Paper]](https://arxiv.org/pdf/2206.10591.pdf)

- (arXiv 2022.06) Learning **Viewpoint-Agnostic** Visual **Representations** by Recovering Tokens in 3D Space, [[Paper]](https://arxiv.org/pdf/2206.11895.pdf), [[Code]](https://www3.cs.stonybrook.edu/~jishang/3dtrl/3dtrl.html)

- (arXiv 2022.06) MaskViT: **Masked** Visual Pre-Training for **Video Prediction**, [[Paper]](https://arxiv.org/pdf/2206.11894.pdf)

- (arXiv 2022.06) PromptPose: Language **Prompt** Helps **Animal Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2206.11752.pdf)

- (arXiv 2022.06) **Video PreTraining** (VPT): Learning to Act by Watching **Unlabeled** **Online** Videos, [[Paper]](https://arxiv.org/pdf/2206.11795.pdf)

- (arXiv 2022.06) MERLOT Reserve: Neural Script Knowledge through **Vision and Language and Sound**, [[Paper]](https://arxiv.org/pdf/2201.02639.pdf), [[Project]](https://rowanzellers.com/merlotreserve)

- (arXiv 2022.06) Building Spatio-temporal Transformers for **Egocentric 3D Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2206.04785.pdf)

- (arXiv 2022.06) **Position** Labels for **Self-Supervised** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2206.04981.pdf)

- (arXiv 2022.06) Exploring Feature Self-relation for **Self-supervised** Transformer, [[Paper]](https://arxiv.org/pdf/2206.05184.pdf)

- (arXiv 2022.06) Patch-based Object-centric Transformers for Efficient **Video Generation**, [[Paper]](https://arxiv.org/pdf/2206.04003.pdf), [[Code]](https://sites.google.com/view/povt-public)

- (arXiv 2022.06) Sparse Fusion **Mixture-of-Experts** are Domain Generalizable Learners, [[Paper]](https://arxiv.org/pdf/2206.04046.pdf), [[Code]](https://github.com/Luodian/SF-MoE-DG)

- (arXiv 2022.06) VN-Transformer: **Rotation-Equivariant** Attention for Vector Neurons, [[Paper]](https://arxiv.org/pdf/2206.04176.pdf)

- (arXiv 2022.06) **CLIP**-Actor: **Text**-Driven Recommendation and Stylization for **Animating Human Meshes**, [[Paper]](https://arxiv.org/pdf/2206.04382.pdf), [[Code]](https://github.com/Youwang-Kim/CLIP-Actor)

- (arXiv 2022.06) **OOD** Augmentation May Be at Odds with **Open-Set Recognition**, [[Paper]](https://arxiv.org/pdf/2206.04242.pdf)

- (arXiv 2022.06) Draft-and-Revise: Effective **Image Generation** with Contextual RQ-Transformer, [[Paper]](https://arxiv.org/pdf/2206.04452.pdf)

- (arXiv 2022.06) cycle text2face: cycle **text-to-face gan** via transformers, [[Paper]](https://arxiv.org/pdf/2206.04503.pdf)

- (arXiv 2022.06) Efficient and Robust **2D-to-BEV** Representation Learning via Geometry-guided Kernel Transformer, [[Paper]](https://arxiv.org/pdf/2206.04584.pdf), [[Code]](https://github.com/hustvl/GKT)

- (arXiv 2022.06) Transformer based Urdu Handwritten **Text** Optical Character Reader, [[Paper]](https://arxiv.org/pdf/2206.04575.pdf)

- (arXiv 2022.06) **Spatial Entropy Regularization** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2206.04636.pdf)

- (arXiv 2022.06) On Data Scaling in **Masked Image Modeling**, [[Paper]](https://arxiv.org/pdf/2206.04664.pdf)

- (arXiv 2022.06) Extreme **Masking** for Learning Instance and Distributed Visual **Representations**, [[Paper]](https://arxiv.org/pdf/2206.04667.pdf)

- (arXiv 2022.06) GateHUB: Gated History Unit with Background Suppression for **Online Action Detection**, [[Paper]](https://arxiv.org/pdf/2206.04668.pdf)

- (arXiv 2022.06) **Anomaly detection** in surveillance videos using transformer based attention model, [[Paper]](https://arxiv.org/pdf/2206.01524.pdf), [[Code]](https://github.com/kapildeshpande/Anomaly-Detection-in-Surveillance-Videos)

- (arXiv 2022.06) Contra**CLIP**: Interpretable **GAN** generation driven by pairs of contrasting sentences, [[Paper]](https://arxiv.org/pdf/2206.02104.pdf), [[Code]](https://github.com/chi0tzp/ContraCLIP)

- (arXiv 2022.06) EAANet: **Efficient** Attention Augmented Convolutional Networks, [[Paper]](https://arxiv.org/pdf/2206.01821.pdf)

- (arXiv 2022.06) Visual Clues: Bridging **Vision and Language** Foundations for Image Paragraph **Captioning**, [[Paper]](https://arxiv.org/pdf/2206.01843.pdf)

- (arXiv 2022.06) Recurrent **Video Restoration** Transformer with Guided Deformable Attention, [[Paper]](https://arxiv.org/pdf/2206.02146.pdf), [[Code]](https://github.com/JingyunLiang/RVRT)

- (arXiv 2022.06) Rethinking the **Openness** of **CLIP**, [[Paper]](https://arxiv.org/pdf/2206.01986.pdf)

- (arXiv 2022.06) OrdinalCLIP: Learning Rank Prompts for **Language-Guided Ordinal Regression**, [[Paper]](https://arxiv.org/pdf/2206.02338.pdf)

- (arXiv 2022.06) Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel **Video-Language Retrieval**, [[Paper]](https://arxiv.org/pdf/2206.02082.pdf)

- (arXiv 2022.06) CONTRASTIVE GRAPH MULTIMODAL MODEL FOR **TEXT CLASSIFICATION** IN VIDEOS, [[Paper]](https://arxiv.org/pdf/2206.02343.pdf)

- (arXiv 2022.06) Separable **Self-attention** for **Mobile** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2206.02680.pdf), [[Code]](https://github.com/apple/ml-cvnets)

- (arXiv 2022.06) Mask **DINO**: Towards A Unified Transformer-based Framework for Object **Detection** and **Segmentation**, [[Paper]](https://arxiv.org/pdf/2206.02777.pdf), [[Code]](https://github.com/IDEACVR/MaskDINO)

- (arXiv 2022.06) Multimodal Contrastive Learning with LIMoE: the **Language-Image** **Mixture of Experts**, [[Paper]](https://arxiv.org/pdf/2206.02770.pdf)

- (arXiv 2022.06) cViL: Cross-Lingual Training of **Vision-Language** Models using Knowledge Distillation, [[Paper]](https://arxiv.org/pdf/2206.03354.pdf)

- (arXiv 2022.06) **Masked** **Unsupervised** Self-training for Zero-shot Image Classification, [[Paper]](https://arxiv.org/pdf/2206.02967.pdf), [[Code]](https://github.com/salesforce/MUST)

- (arXiv 2022.06) DETR++: Taming Your Multi-Scale **Detection** Transformer, [[Paper]](https://arxiv.org/pdf/2206.02977.pdf)

- (arXiv 2022.06) Structured Context Transformer for Generic **Event Boundary Detection**, [[Paper]](https://arxiv.org/pdf/2206.02985.pdf)

- (arXiv 2022.06) Revealing Single Frame Bias for **Video-and-Language** Learning, [[Paper]](https://arxiv.org/pdf/2206.03428.pdf), [[Code]](https://github.com/jayleicn/singularity)

- (arXiv 2022.06) Cerberus Transformer: Joint **Semantic**, **Affordance** and **Attribute** Parsing, [[Paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Chen_Cerberus_Transformer_Joint_Semantic_Affordance_and_Attribute_Parsing_CVPR_2022_paper.pdf), [[Code]](https://github.com/OPEN-AIR-SUN/Cerberus)

- (arXiv 2022.06) Can **CNNs** Be More **Robust** Than Transformers? [[Paper]](https://arxiv.org/pdf/2206.03452.pdf), [[Code]](https://github.com/UCSC-VLAA/RobustCNN)

- (arXiv 2022.06) Detection Hub: Unifying Object **Detection** Datasets via Query Adaptation on Language Embedding, [[Paper]](https://arxiv.org/pdf/2206.03484.pdf)

- (CVPR 2022) Keypoint Transformer: Solving Joint Identification in Challenging **Hands and Object Interactions** for Accurate **3D Pose** Estimation, [[Paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Hampali_Keypoint_Transformer_Solving_Joint_Identification_in_Challenging_Hands_and_Object_CVPR_2022_paper.pdf)

- (arXiv 2022.06) A-OKVQA: A Benchmark for **Visual Question Answering** using World Knowledge, [[Paper]](https://arxiv.org/pdf/2206.01718.pdf), [[Project]](http://a-okvqa.allenai.org/)

- (arXiv 2022.06) Revisiting the “Video” in **Video-Language** Understanding, [[Paper]](https://arxiv.org/pdf/2206.01720.pdf), [[Project]](https://stanfordvl.github.io/atp-revisit-video-lang/)

- (arXiv 2022.06) Efficient **Self-supervised** Vision Pretraining with Local **Masked** Reconstruction, [[Paper]](https://arxiv.org/pdf/2206.00790.pdf)

- (arXiv 2022.06) Modeling Image Composition for Complex **Scene Generation**, [[Paper]](https://arxiv.org/pdf/2206.00923.pdf), [[Code]](https://github.com/JohnDreamer/TwFA)

- (arXiv 2022.06) Unified Recurrence Modeling for **Video Action Anticipation**, [[Paper]](https://arxiv.org/pdf/2206.01009.pdf)

- (arXiv 2022.06) **Prefix Conditioning** Unifies Language and Label Supervision, [[Paper]](https://arxiv.org/pdf/2206.01125.pdf)

- (arXiv 2022.06) Optimizing Relevance Maps of Vision Transformers Improves **Robustness**, [[Paper]](https://arxiv.org/pdf/2206.01161.pdf), [[Code]](https://github.com/hila-chefer/RobustViT)

- (arXiv 2022.06) VL-BEIT: Generative **Vision-Language** Pretraining, [[Paper]](https://arxiv.org/pdf/2206.01127.pdf), [[Code]](https://github.com/microsoft/unilm)

- (arXiv 2022.06) **Efficient**Former: Vision Transformers at MobileNet Speed, [[Paper]](https://arxiv.org/pdf/2206.01191.pdf), [[Code]](https://github.com/snap-research/EfficientFormer)

- (arXiv 2022.06) REVIVE: Regional Visual Representation Matters in Knowledge-Based **Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2206.01201.pdf)

- (arXiv 2022.06) Siamese Image Modeling for **Self-Supervised** Vision Representation Learning, [[Paper]](https://arxiv.org/pdf/2206.01204.pdf)

- (CVPR 2022) Distillation Using Oracle Queries for Transformer-based **Human-Object nteraction Detection**, [[Paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Qu_Distillation_Using_Oracle_Queries_for_Transformer-Based_Human-Object_Interaction_Detection_CVPR_2022_paper.pdf), [[Code]](https://github.com/SherlockHolmes221/DOQ)

- (CVPR 2022) Exploring Structure-aware Transformer over Interaction Proposals for **Human-Object Interaction Detection**, [[Paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Zhang_Exploring_Structure-Aware_Transformer_Over_Interaction_Proposals_for_Human-Object_Interaction_Detection_CVPR_2022_paper.pdf), [[Code]](https://github.com/zyong812/STIP)

- (CVPR 2022) Human **Trajectory Prediction** with Momentary Observation, [[Paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Sun_Human_Trajectory_Prediction_With_Momentary_Observation_CVPR_2022_paper.pdf)

- (arXiv 2022.06) Where are my Neighbors? Exploiting Patches Relations in **Self-Supervised** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2206.00481.pdf)

- (arXiv 2022.06) Unifying Voxel-based Representation with Transformer for **3D Object Detection**, [[Paper]](https://arxiv.org/pdf/2206.00630.pdf), [[Code]](https://github.com/dvlab-research/UVTR)

- (arXiv 2022.06) Extreme **Floorplan Reconstruction** by Structure-Hallucinating Transformer Cascades, [[Paper]](https://arxiv.org/pdf/2206.00645.pdf)

- (arXiv 2022.06) Cross-View Language Modeling: Towards Unified Cross-Lingual **Cross-Modal Pre-training**, [[Paper]](https://arxiv.org/pdf/2206.00621.pdf)

- (arXiv 2022.06) VALHALLA: **Visual Hallucination** for Machine Translation, [[Paper]](https://arxiv.org/pdf/2206.00100.pdf), [[Code]](http://www.svcl.ucsd.edu/projects/valhalla)

- (arXiv 2022.06) Learning Sequential Contexts using Transformer for **3D Hand Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2206.00171.pdf)

- (arXiv 2022.06) CLIP4IDC: **CLIP** for Image Difference **Captioning**, [[Paper]](https://arxiv.org/pdf/2206.00629.pdf), [[Code]](https://github.com/sushizixin/CLIP4IDC)

- (arXiv 2022.06) Cross-domain **Detection** Transformer based on Spatial-aware and Semantic-aware Token Alignment, [[Paper]](https://arxiv.org/pdf/2206.00222.pdf)

- (arXiv 2022.06) Vision **GNN**: An Image is Worth Graph of Nodes, [[Paper]](https://arxiv.org/pdf/2206.00272.pdf), [[Code]](https://github.com/huawei-noah/CV-Backbones)

- (arXiv 2022.06) Weakly-supervised Action Transition Learning for Stochastic Human **Motion Prediction**, [[Paper]](https://arxiv.org/pdf/2205.15608.pdf), [[Code]](https://github.com/wei-mao-2019/WAT)

- (arXiv 2022.06) TubeFormer-**DeepLab**: Video Mask Transformer, [[Paper]](https://arxiv.org/pdf/2205.15361.pdf)

- (arXiv 2022.06) **Video**-based **Human-Object Interaction** Detection from Tubelet Tokens, [[Paper]](https://arxiv.org/pdf/2206.01908.pdf)

### 2022.05

- (arXiv 2022.05) HeatER: An Efficient and Unified Network for **Human Reconstruction** via Heatmap-based TransformER, [[Paper]](https://arxiv.org/pdf/2205.15448.pdf)

- (arXiv 2022.05) Robotic **grasp detection** based on Transformer, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2205/2205.15112.pdf)

- (arXiv 2022.05) Multimodal **Masked Autoencoders** Learn Transferable Representations, [[Paper]](https://arxiv.org/pdf/2205.14204.pdf)

- (arXiv 2022.05) Multimodal **Fake News Detection** via **CLIP**-Guided Learning, [[Paper]](https://arxiv.org/pdf/2205.14304.pdf)

- (arXiv 2022.05) WT-MVSNet: Window-based Transformers for **Multi-view Stereo**, [[Paper]](https://arxiv.org/pdf/2205.14319.pdf)

- (arXiv 2022.05) Object-wise **Masked Autoencoders** for **Fast** Pre-training, [[Paper]](https://arxiv.org/pdf/2205.14338.pdf)

- (arXiv 2022.05) A Closer Look at **Self-supervised** **Lightweight** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2205.14443.pdf)

- (arXiv 2022.05) Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2205.14458.pdf)

- (arXiv 2022.05) CY**CLIP**: Cyclic Contrastive **Language-Image** Pretraining, [[Paper]](https://arxiv.org/pdf/2205.14459.pdf), [[Code]](https://github.com/goel-shashank/CyCLIP)

- (arXiv 2022.05) MDMLP: Image Classification from Scratch on Small Datasets with **MLP**, [[Paper]](https://arxiv.org/pdf/2205.14477.pdf), [[Code]](https://github.com/Amoza-Theodore/MDMLP)

- (arXiv 2022.05) SupMAE: **Supervised** **Masked Autoencoders** Are Efficient Vision Learners, [[Paper]](https://arxiv.org/pdf/2205.14540.pdf), [[Code]](https://github.com/cmu-enyac/supmae)

- (arXiv 2022.05) 3D-C2FT: Coarse-to-fine Transformer for **Multi-view 3D Reconstruction**, [[Paper]](https://arxiv.org/pdf/2205.14575.pdf)

- (arXiv 2022.05) Prompt-aligned Gradient for **Prompt** Tuning, [[Paper]](https://arxiv.org/pdf/2205.14865.pdf), [[Code]](https://github.com/BeierZhu/Prompt-align)

- (arXiv 2022.05) **Illumination** Adaptive Transformer, [[Paper]](https://arxiv.org/pdf/2205.14871.pdf), [[Code]](https://github.com/cuiziteng/Illumination-Adaptive-Transformer)

- (arXiv 2022.05) HiViT: **Hierarchical** Vision Transformer Meets **Masked Image Modeling**, [[Paper]](https://arxiv.org/pdf/2205.14949.pdf)

- (arXiv 2022.05) GMML is All you Need, [[Paper]](https://arxiv.org/pdf/2205.14986.pdf), [[Code]](https://github.com/Sara-Ahmed/GMML)

- (arXiv 2022.05) COMPLETEDT: **POINT CLOUD COMPLETION** WITH DENSE AUGMENT INFERENCE TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2205.14999.pdf)

- (arXiv 2022.05) Self-Supervised Pre-training of Vision Transformers for **Dense Prediction** Tasks, [[Paper]](https://arxiv.org/pdf/2205.15173.pdf)

- (arXiv 2022.05) VLUE: A Multi-Task **Benchmark** for Evaluating **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2205.15237.pdf), [[Benchmark]](https://vlue-benchmark.github.io/), [[Code]](https://github.com/MichaelZhouwang/VLUE)

- (arXiv 2022.05) Architecture-Agnostic **Masked Image Modeling** – From ViT back to CNN, [[Paper]](https://arxiv.org/pdf/2205.13943.pdf)

- (arXiv 2022.05) **Contrastive** Learning Rivals **Masked Image Modeling** in Fine-tuning via Feature Distillation, [[Paper]](https://arxiv.org/pdf/2205.14141.pdf), [[Code]](https://github.com/SwinTransformer/Feature-Distillation)

- (arXiv 2022.05) GIT: A Generative **Image-to-text** Transformer for Vision and Language, [[Paper]](https://arxiv.org/pdf/2205.14100.pdf)

- (arXiv 2022.05) 3DILG: Irregular Latent Grids for **3D Generative Modeling**, [[Paper]](https://arxiv.org/pdf/2205.13914.pdf)

- (arXiv 2022.05) Simple **Unsupervised** **Object-Centric Learning** for Complex and Naturalistic Videos, [[Paper]](https://arxiv.org/pdf/2205.14065.pdf), [[Code]](https://sites.google.com/view/slot-transformer-for-videos)

- (arXiv 2022.05) Future Transformer for Long-term **Action Anticipation**, [[Paper]](https://arxiv.org/pdf/2205.14022.pdf), [[Project]](http://cvlab.postech.ac.kr/research/FUTR)

- (arXiv 2022.05) X-ViT: High Performance **Linear** Vision Transformer without Softmax, [[Paper]](https://arxiv.org/pdf/2205.13805.pdf)

- (arXiv 2022.05) **Knowledge Distillation** via the Target-aware Transformer, [[Paper]](https://arxiv.org/pdf/2205.10793.pdf)

- (arXiv 2022.05) Dynamic **Query** Selection for Fast Visual Perceiver, [[Paper]](https://arxiv.org/pdf/2205.10873.pdf)

- (arXiv 2022.05) MonoFormer: Towards Generalization of self-supervised monocular **depth** estimation with Transformers, [[Paper]](https://arxiv.org/pdf/2205.11083.pdf)

- (arXiv 2022.05) PEVL: Position-enhanced Pre-training and Prompt Tuning for **Vision-language** Models, [[Paper]](https://arxiv.org/pdf/2205.11169.pdf), [[Code]](https://github.com/thunlp/PEVL)

- (arXiv 2022.05) Supporting **Vision-Language** Model Inference with Causality-pruning Knowledge **Prompt**, [[Paper]](https://arxiv.org/pdf/2205.11100.pdf)

- (arXiv 2022.05) Super Vision Transformer, [[Paper]](https://arxiv.org/pdf/2205.11397.pdf), [[Code]](https://github.com/lmbxmu/SuperViT)

- (arXiv 2022.05) mPLUG: Effective and Efficient **Vision-Language** Learning by Cross-modal Skip-connections, [[Paper]](https://arxiv.org/pdf/2205.12005.pdf)

- (arXiv 2022.05) **VQA**-GNN: **Reasoning** with Multimodal Semantic Graph for Visual Question Answering, [[Paper]](https://arxiv.org/pdf/2205.11501.pdf)

- (arXiv 2022.05) UMSNet: An Universal Multi-sensor Network for Human **Activity Recognition**, [[Paper]](https://arxiv.org/pdf/2205.11756.pdf)

- (arXiv 2022.05) **Privacy**-Preserving Image **Classification** Using Vision Transformer, [[Paper]](https://arxiv.org/pdf/2205.12041.pdf)

- (arXiv 2022.05) HiVLP: Hierarchical **Vision-Language** Pre-Training for Fast Image-Text Retrieval, [[Paper]](https://arxiv.org/pdf/2205.12105.pdf)

- (arXiv 2022.05) ASSET: Autoregressive **Semantic Scene Editing** with Transformers at High Resolutions, [[Paper]](https://arxiv.org/pdf/2205.12231.pdf), [[Code]](https://github.com/DifanLiu/ASSET)

- (arXiv 2022.05) HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent **Trajectory Prediction** via Scene Encoding, [[Paper]](https://arxiv.org/pdf/2205.09753.pdf)

- (arXiv 2022.05) **Mask**-guided Vision Transformer (MG-ViT) for **Few-Shot** Learning, [[Paper]](https://arxiv.org/pdf/2205.09995.pdf)

- (arXiv 2022.05) Degradation-Aware Unfolding Half-Shuffle Transformer for **Spectral Compressive Imaging**, [[Paper]](https://arxiv.org/pdf/2205.10102.pdf)

- (arXiv 2022.05) Uniform Masking: Enabling **MAE** Pre-training for **Pyramid**-based Vision Transformers with Locality, [[Paper]](https://arxiv.org/pdf/2205.10063.pdf), [[Code]](https://github.com/implus/UM-MAE)

- (arXiv 2022.05) Visual **Concepts** Tokenization, [[Paper]](https://arxiv.org/pdf/2205.10093.pdf)

- (arXiv 2022.05) MSTRIQ: No Reference **Image Quality Assessment** Based on Swin Transformer with Multi-Stage Fusion, [[Paper]](https://arxiv.org/pdf/2205.10101.pdf)

- (arXiv 2022.05) CogVideo: Large-scale Pretraining for **Text-to-Video** Generation via Transformers., [[Paper]](https://github.com/THUDM/CogVideo/blob/main/paper/CogVideo-arxiv.pdf), [[Code]](https://github.com/THUDM/CogVideo)

- (arXiv 2022.05) Evidence for **Hypodescent** in Visual Semantic AI, [[Paper]](https://arxiv.org/pdf/2205.10764.pdf)

- (arXiv 2022.05) Boosting Camouflaged Object **Detection** with Dual-Task Interactive Transformer, [[Paper]](https://arxiv.org/pdf/2205.10579.pdf), [[Code]](https://github.com/liuzywen/COD)

- (arXiv 2022.05) muNet: Evolving Pretrained Deep Neural Networks into Scalable **Auto-tuning Multitask Systems**, [[Paper]](https://arxiv.org/pdf/2205.10937.pdf)

- (arXiv 2022.05) Large Language Models are **Zero-Shot Reasoners**, [[Paper]](https://arxiv.org/pdf/2205.11916.pdf)

- (arXiv 2022.05) AdaptFormer: **Adapting** Vision Transformers for **Scalable** Visual Recognition, [[Paper]](https://arxiv.org/pdf/2205.13535.pdf), [[Code]](http://www.shoufachen.com/adaptformer-page)

- (arXiv 2022.05) **Green** Hierarchical Vision Transformer for **Masked** Image Modeling, [[Paper]](https://arxiv.org/pdf/2205.13515.pdf), [[Code]](https://github.com/LayneH/GreenMIM)

- (arXiv 2022.05) Efficient U-Transformer with Boundary-Aware Loss for **Action Segmentatio**n, [[Paper]](https://arxiv.org/pdf/2205.13425.pdf)

- (arXiv 2022.05) Cross-Architecture **Self-supervised Video** Representation Learning, [[Paper]](https://arxiv.org/pdf/2205.13313.pdf), [[Code]](https://github.com/guoshengcv/CACL)

- (arXiv 2022.05) **Prompt**-based Learning for Unpaired Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2205.13125.pdf)

- (arXiv 2022.05) MixMIM: **Mixed** and **Masked** Image Modeling for **Efficient** Visual Representation Learning, [[Paper]](https://arxiv.org/pdf/2205.13137.pdf), [[Code]](https://github.com/Sense-X/MixMIM)

- (arXiv 2022.05) **Fast** Vision Transformers with HiLo **Attention**, [[Paper]](https://arxiv.org/pdf/2205.13213.pdf), [[Code]](https://github.com/zip-group/LITv2)

- (arXiv 2022.05) Fine-grained Image **Captioning** with **CLIP** Reward, [[Paper]](https://arxiv.org/pdf/2205.13115.pdf), [[Code]](https://github.com/j-min/CLIP-Caption-Reward)

- (arXiv 2022.05) Mutual Information Divergence: A Unified Metric for Multimodal **Generative Models**, [[Paper]](https://arxiv.org/pdf/2205.13445.pdf)

- (arXiv 2022.05) MoCoViT: **Mobile** **Convolutional** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2205.12635.pdf)

- (arXiv 2022.05) AO2-DETR: Arbitrary-Oriented Object **Detection** Transformer, [[Paper]](https://arxiv.org/pdf/2205.12785.pdf)

- (arXiv 2022.05) **Inception** Transformer, [[Paper]](https://arxiv.org/pdf/2205.12956.pdf), [[Code]](https://github.com/sail-sg/iFormer)

- (arXiv 2022.05) VTP: Volumetric Transformer for Multi-view Multi-person **3D Pose** Estimation, [[Paper]](https://arxiv.org/pdf/2205.12602.pdf)

- (arXiv 2022.05) UViM: A **Unified Modeling** Approach for Vision with Learned Guiding Codes, [[Paper]](https://arxiv.org/pdf/2205.10337.pdf)

- (arXiv 2022.05) Language Models with Image Descriptors are Strong Few-Shot **Video-Language** Learners, [[Paper]](https://arxiv.org/pdf/2205.10747.pdf), [[Code]](https://github.com/MikeWangWZHL/VidIL)

- (arXiv 2022.05) **Training** Vision-Language Transformers from **Captions** Alone, [[Paper]](https://arxiv.org/pdf/2205.09256.pdf), [[Code]](https://github.com/guilk/VLC)

- (arXiv 2022.05) **Voxel**-informed **Language** Grounding, [[Paper]](https://arxiv.org/pdf/2205.09710.pdf), [[Code]](https://github.com/rcorona/voxel_informed_language_grounding)

- (arXiv 2022.05) Cross-Enhancement Transformer for **Action Segmentation**, [[Paper]](https://arxiv.org/pdf/2205.09445.pdf)

- (arXiv 2022.05) TRT-ViT: **TensorRT**-oriented Vision Transformer, [[Paper]](https://arxiv.org/pdf/2205.09579.pdf)

- (arXiv 2022.05) Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object **Detection**, [[Paper]](https://arxiv.org/pdf/2205.09613.pdf)

- (arXiv 2022.05) A graph-transformer for whole slide image **classification**, [[Paper]](https://arxiv.org/pdf/2205.09671.pdf)

- (arXiv 2022.05) VNT-Net: **Rotational Invariant** Vector Neuron Transformers, [[Paper]](https://arxiv.org/pdf/2205.09690.pdf)

- (arXiv 2022.05) **Masked** Image Modeling with Denoising **Contrast**, [[Paper]](https://arxiv.org/pdf/2205.09616.pdf)

- (arXiv 2022.05) Cross-subject **Action Unit Detection** with Meta Learning and Transformer-based Relation Modeling, [[Paper]](https://arxiv.org/pdf/2205.08787.pdf)

- (arXiv 2022.05) **Masked Autoencoders** As Spatiotemporal Learners, [[Paper]](https://arxiv.org/pdf/2205.09113.pdf)

- (arXiv 2022.05) BodyMap: Learning Full-**Body** Dense **Correspondence** Map, [[Paper]](https://arxiv.org/pdf/2205.09111.pdf), [[Code]](https://nsarafianos.github.io/bodymap)

- (arXiv 2022.05) Unraveling **Attention** via Convex Duality: Analysis and Interpretations of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2205.08078.pdf)

- (arXiv 2022.05) Avatar**CLIP**: Zero-Shot Text-Driven Generation and Animation of 3D **Avatars**, [[Paper]](https://arxiv.org/pdf/2205.08535.pdf)

- (arXiv 2022.05) Vision Transformer Adapter for **Dense Predictions**, [[Paper]](https://arxiv.org/pdf/2205.08534.pdf), [[Code]](https://github.com/czczup/ViT-Adapter)

- (arXiv 2022.05) Demo: Real-Time **Semantic Communications** with a Vision Transformer, [[Paper]](https://arxiv.org/pdf/2205.03886.pdf)

- (arXiv 2022.05) MulT: An End-to-End **Multitask** Learning Transformer, [[Paper]](https://arxiv.org/pdf/2205.08303.pdf), [[Code]](https://ivrl.github.io/MulT/)

- (arXiv 2022.05) A **CLIP**-Hitchhiker’s Guide to Long **Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2205.08508.pdf)

- (arXiv 2022.05) Video **Frame Interpolation** with Transformer, [[Paper]](https://arxiv.org/pdf/2205.07230.pdf), [[Code]](https://github.com/dvlab-research/VFIformer)

- (arXiv 2022.05) Dense residual Transformer for Image **Denoising**, [[Paper]](https://arxiv.org/pdf/2205.06944.pdf)

- (arXiv 2022.05) Learning Lip-Based **Audio-Visual** Speaker Embeddings with AV-HuBERT, [[Paper]](https://arxiv.org/pdf/2205.07180.pdf)

- (arXiv 2022.05) **Robot Cooking** with Stir-fry: Bimanual Non-prehensile Manipulation of Semi-fluid Objects, [[Paper]](https://arxiv.org/pdf/2205.05960.pdf)

- (arXiv 2022.05) Entity-aware and Motion-aware Transformers for Language-driven **Action Localization** in Videos, [[Paper]](https://arxiv.org/pdf/2205.05854.pdf), [[Code]](https://github.com/shuoyang129/EAMAT)

- (arXiv 2022.05) Learning to **Retrieve Videos** by Asking Questions, [[Paper]](https://arxiv.org/pdf/2205.05739.pdf)

- (arXiv 2022.05) One Model, **Multiple Modalities**: A Sparsely Activated Approach for Text, Sound, Image, Video and Code, [[Paper]](https://arxiv.org/pdf/2205.06126.pdf)

- (arXiv 2022.05) Simple Open-Vocabulary Object **Detection** with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2205.06230.pdf), [[Code]](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit)

- (arXiv 2022.05) AggPose: Deep Aggregation Vision Transformer for Infant **Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2205.05277.pdf), [[Code]](https://github.com/SZAR-LAB/AggPose)

- (arXiv 2022.05) An Empirical Study of Self-supervised Learning Approaches for Object **Detection** with Transformers, [[Paper]](https://arxiv.org/pdf/2205.05543.pdf), [[Code-DETR]](https://github.com/gokulkarthik/detr), [[Code-Deform-DETR]](https://github.com/gokulkarthik/Deformable-DETR)

- (arXiv 2022.05) Reduce Information Loss in Transformers for Pluralistic Image **Inpainting**, [[Paper]](https://arxiv.org/pdf/2205.05076.pdf), [[Code]](https://github.com/liuqk3/PUT)

- (arXiv 2022.05) Transformer-based Cross-Modal **Recipe** Embeddings with Large Batch Training, [[Paper]](https://arxiv.org/pdf/2205.04948.pdf)

- (arXiv 2022.05) Spatio-Temporal Transformer for Dynamic **Facial Expression Recognition** in the Wild, [[Paper]](https://arxiv.org/pdf/2205.04749.pdf)

- (arXiv 2022.05) Generalizable **Task Planning** through Representation Pretraining, [[Paper]](https://arxiv.org/pdf/2205.07993.pdf), [[Project]](https://sites.google.com/view/gentp)

- (arXiv 2022.05) EdgeViTs: Competing **Light-weight** CNNs on **Mobile Devices** with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2205.03436.pdf)

- (arXiv 2022.05) Activating More Pixels in Image **Super-Resolution** Transformer, [[Paper]](https://arxiv.org/pdf/2205.04437.pdf), [[Code]](https://github.com/chxy95/HAT)

- (arXiv 2022.05) Row-wise **Accelerator** for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2205.03998.pdf)

- (arXiv 2022.05) SparseTT: Visual **Tracking** with Sparse Transformers, [[Paper]](https://arxiv.org/pdf/2205.03776.pdf), [[Code]](https://github.com/fzh0917/SparseTT)

- (arXiv 2022.05) RoViST: Learning Robust Metrics for **Visual Storytelling**, [[Paper]](https://arxiv.org/pdf/2205.03774.pdf), [[Code]](https://github.com/usydnlp/rovist)

- (arXiv 2022.05) Beyond Bounding Box: Multimodal Knowledge Learning for Object **Detection**, [[Paper]](https://arxiv.org/pdf/2205.04072.pdf)

- (arXiv 2022.05) Multilevel Hierarchical Network with Multiscale Sampling for **Video Question Answering**, [[Paper]](https://arxiv.org/pdf/2205.04061.pdf)

- (arXiv 2022.05) Incremental-DETR: Incremental Few-Shot Object **Detection** via Self-Supervised Learning, [[Paper]](https://arxiv.org/pdf/2205.04042.pdf)

- (arXiv 2022.05) Conv**MAE**: Masked **Convolution** Meets Masked Autoencoders, [[Paper]](https://arxiv.org/pdf/2205.03892.pdf), [[Code]](https://github.com/Alpha-VL/ConvMAE)

- (arXiv 2022.05) Cross-lingual Adaptation for **Recipe Retrieval** with Mixup, [[Paper]](https://arxiv.org/pdf/2205.03891.pdf)

- (arXiv 2022.05) Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A **Vision-Language** Framework, [[Paper]](https://arxiv.org/pdf/2205.03860.pdf)

- (arXiv 2022.05) Transformer **Tracking** with Cyclic Shifting Window Attention, [[Paper]](https://arxiv.org/pdf/2205.03806.pdf), [[Code]](https://github.com/SkyeSong38/CSWinTT)

- (arXiv 2022.05) Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2205.04363.pdf)

- (arXiv 2022.05) Prompt Distribution Learning, [[Paper]](https://arxiv.org/pdf/2205.03340.pdf)

- (arXiv 2022.05) CLIP-CLOP: **CLIP**-Guided **Collage** and **Photomontage**, [[Paper]](https://arxiv.org/pdf/2205.03146.pdf)

- (arXiv 2022.05) Dual-Level Decoupled Transformer for **Video Captioning**, [[Paper]](https://arxiv.org/pdf/2205.03039.pdf)

- (arXiv 2022.05) Declaration-based Prompt Tuning for **Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2205.02456.pdf), [[Code]](https://github.com/CCIIPLab/DPT)

- (arXiv 2022.05) P^3IV: Probabilistic **Procedure Planning** from **Instructional Videos** with Weak Supervision, [[Paper]](https://arxiv.org/pdf/2205.02300.pdf)

- (arXiv 2022.05) Language Models Can See: Plugging **Visual** Controls in **Text Generation**, [[Paper]](https://arxiv.org/pdf/2205.02655.pdf), [[Code]](https://github.com/yxuansu/MAGIC)

- (arXiv 2022.05) YOLOPose: Transformer-based Multi-Object **6D Pose Estimation** using Keypoint Regression, [[Paper]](https://arxiv.org/pdf/2205.02536.pdf)

- (arXiv 2022.05) Cross-view Transformers for real-time Map-view **Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2205.02833.pdf), [[Code]](https://github.com/bradyz/cross_view_transformers)

- (arXiv 2022.05) i-Code: An Integrative and Composable **Multimodal** Learning Framework, [[Paper]](https://arxiv.org/pdf/2205.01818.pdf)

- (arXiv 2022.05) **Visual Commonsense** in Pretrained Unimodal and Multimodal Models, [[Paper]](https://arxiv.org/pdf/2205.01850.pdf), [[Project]](https://github.com/ChenyuHeidiZhang/VL-commonsense)

- (arXiv 2022.05) Dual Cross-Attention Learning for Fine-Grained Visual **Categorization** and Object **Re-Identification**, [[Paper]](https://arxiv.org/pdf/2205.02151.pdf)

- (arXiv 2022.05) RecipeSnap - a lightweight **image to recipe** model, [[Paper]](https://arxiv.org/pdf/2205.02141.pdf), [[Code]](https://github.com/jianfa/RecipeSnap-a-lightweight-image-to-recipe-model.git)

- (arXiv 2022.05) CoCa: Contrastive Captioners are **Image-Text** Foundation Models, [[Paper]](https://arxiv.org/pdf/2205.01917.pdf)

- (arXiv 2022.05) Data Determines Distributional **Robustness** in Contrastive Language Image Pre-training (**CLIP**), [[Paper]](https://arxiv.org/pdf/2205.01397.pdf)

- (arXiv 2022.05) Cross-modal Representation Learning for **Zero-shot Action Recognition**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2205/2205.01657.pdf), [[Code]](https://github.com/microsoft/ResT)

- (arXiv 2022.05) Cross-Domain Object **Detection** with Mean-Teacher Transformer, [[Paper]](https://arxiv.org/pdf/2205.01643.pdf)

- (arXiv 2022.05) Better plain ViT baselines for **ImageNet-1k**, [[Paper]](https://arxiv.org/pdf/2205.01580.pdf), [[Code]](https://github.com/google-research/big_vision)

- (arXiv 2022.05) Reinforced Swin-Convs Transformer for **Underwater Image Enhancement**, [[Paper]](https://arxiv.org/pdf/2205.00434.pdf)

- (arXiv 2022.05) UTC: A Unified Transformer with Inter-Task Contrastive Learning for **Visual Dialog**, [[Paper]](https://arxiv.org/pdf/2205.00423.pdf)

- (arXiv 2022.05) Answer-Me: Multi-Task Open-Vocabulary **Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2205.00949.pdf)

- (arXiv 2022.05) Center**CLIP**: Token Clustering for Efficient **Text-Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2205.00823.pdf), [[Code]](https://github.com/mzhaoshuai/CenterCLIP)

- (arXiv 2022.05) Arbitrary Shape **Text Detection** via Boundary Transformer, [[Paper]](https://arxiv.org/pdf/2205.05320.pdf), [[Code]](https://github.com/GXYM/TextBPN-Puls-Plus)

- (arXiv 2022.05) HULC: **3D Human Motion Capture** with Pose Manifold Sampling and Dense Contact Guidance, [[Paper]](https://arxiv.org/pdf/2205.05677.pdf), [[Project]](https://vcai.mpi-inf.mpg.de/projects/HULC)

### 2022.04

- (arXiv 2022.04) Learn to Understand Negation in **Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2205.00132.pdf)

- (arXiv 2022.04) LayoutBERT: Masked Language **Layout** Model for Object Insertion, [[Paper]](https://arxiv.org/pdf/2205.00347.pdf)

- (arXiv 2022.04) Improving **Visual Grounding** with Visual-Linguistic Verification and Iterative Reasoning, [[Paper]](https://arxiv.org/pdf/2205.00272.pdf), [[Code]](https://github.com/yangli18/VLTVG)

- (arXiv 2022.04) Coarse-to-Fine **Video Denoising** with Dual-Stage Spatial-Channel Transformer, [[Paper]](https://arxiv.org/pdf/2205.00214.pdf)

- (arXiv 2022.04) SideRT: A Real-time Pure Transformer Architecture for Single Image **Depth Estimation**, [[Paper]](https://arxiv.org/pdf/2204.13892.pdf)

- (arXiv 2022.04) Where in the World is this Image? Transformer-based **Geo-localization** in the Wild, [[Paper]](https://arxiv.org/pdf/2204.13861.pdf)

- (arXiv 2022.04) **Depth Estimation** with Simplified Transformer, [[Paper]](https://arxiv.org/pdf/2204.13791.pdf)

- (arXiv 2022.04) A very preliminary **analysis** of **DALL-E 2**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2204/2204.13807.pdf)

- (arXiv 2022.04) CogView2: Faster and Better **Text-to-Image Generation** via Hierarchical Transformers, [[Paper]](https://arxiv.org/pdf/2204.14217.pdf), [[Code]](https://github.com/THUDM/CogView2)

- (arXiv 2022.04) **CLIP**-Art: Contrastive Pre-training for **Fine-Grained Art Classification**, [[Paper]](https://arxiv.org/pdf/2204.14244.pdf), [[Code]](https://github.com/KeremTurgutlu/clip_art)

- (arXiv 2022.04) TEMOS: **Generating** diverse human **motions** from textual descriptions, [[Paper]](https://arxiv.org/pdf/2204.14109.pdf), [[Project]](https://imagine.enpc.fr/~petrovim/temos)

- (arXiv 2022.04) PyramidCLIP: Hierarchical Feature Alignment for **Vision-language** Model Pretraining, [[Paper]](https://arxiv.org/pdf/2204.14095.pdf)

- (arXiv 2022.04) Symmetric Transformer-based Network for **Unsupervised Image Registration**, [[Paper]](https://arxiv.org/pdf/2204.13575.pdf), [[Code]](https://github.com/MingR-Ma/SymTrans)

- (arXiv 2022.04) Tragedy Plus Time: Capturing **Unintended Human Activities** from Weakly-labeled Videos, [[Paper]](https://arxiv.org/pdf/2204.13548.pdf), [[Code]](https://asu-apg.github.io/TragedyPlusTime)

- (arXiv 2022.04) CapOnImage: Context-driven Dense-**Captioning** on Image, [[Paper]](https://arxiv.org/pdf/2204.12974.pdf)

- (arXiv 2022.04) Self-Supervised Learning of Object Parts for Semantic **Segmentation**, [[Paper]](https://arxiv.org/pdf/2204.13101.pdf), [[Code]](https://github.com/MkuuWaUjinga/leopart)

- (arXiv 2022.04) DearKD: Data-**Efficient** Early **Knowledge Distillation** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2204.12997.pdf)

- (arXiv 2022.04) CATrans: Context and Affinity Transformer for Few-Shot **Segmentation**, [[Paper]](https://arxiv.org/pdf/2204.12817.pdf)

- (arXiv 2022.04) Self-Driving Car **Steering Angle Prediction**: Let Transformer Be a Car Again, [[Paper]](https://arxiv.org/pdf/2204.12748.pdf), [[Code]](https://github.com/chingisooinar/AI)

- (arXiv 2022.04) ClothFormer: Taming Video **Virtual Try-on** in All Module, [[Paper]](https://arxiv.org/pdf/2204.12151.pdf)

- (arXiv 2022.04) Deeper Insights into ViTs **Robustness** towards Common Corruptions, [[Paper]](https://arxiv.org/pdf/2204.12143.pdf)

- (arXiv 2022.04) VITPOSE: SIMPLE VISION TRANSFORMER BASELINES FOR HUMAN **POSE ESTIMATION**, [[Paper]](https://arxiv.org/pdf/2204.12484.pdf), [[Code]](https://github.com/ViTAE-Transformer/ViTPose)

- (arXiv 2022.04) Understanding The **Robustness** in Vision Transformers, [[Paper]](https://arxiv.org/pdf/2204.12451.pdf), [[Code]](https://github.com/NVlabs/FAN)

- (arXiv 2022.04) MILES: Visual BERT Pre-training with Injected Language Semantics for **Video-text Retrieval**, [[Paper]](https://arxiv.org/pdf/2204.12408.pdf)

- (arXiv 2022.04) Contrastive Language-Action Pre-training for **Temporal Localization**, [[Paper]](https://arxiv.org/pdf/2204.12293.pdf)

- (arXiv 2022.04) Boosting **Adversarial Transferability** of **MLP**-Mixer, [[Paper]](https://arxiv.org/pdf/2204.12204.pdf)

- (arXiv 2022.04) Adaptive **Split-Fusion** Transformer, [[Paper]](https://arxiv.org/pdf/2204.12196.pdf), [[Code]](https://github.com/szx503045266/ASF-former)

- (arXiv 2022.04) Can Foundation Models Perform Zero-Shot Task Specification For **Robot Manipulation**? [[Paper]](https://arxiv.org/pdf/2204.11134.pdf), [[Project]](https://sites.google.com/view/zestproject)

- (arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR **VISUAL RELATIONAL REASONING**, [[Paper]](https://arxiv.org/pdf/2204.11167.pdf)

- (arXiv 2022.04) VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame Filtration for Automatic **Retail Checkout**, [[Paper]](https://arxiv.org/pdf/2204.11024.pdf), [[Code]](https://github.com/istiakshihab/automated-retail-checkout-aicity22)

- (arXiv 2022.04) **CLIP**-DISSECT: AUTOMATIC **DESCRIPTION** OF **NEURON** REPRESENTATIONS IN DEEP VISION NETWORKS, [[Paper]](https://arxiv.org/pdf/2204.10965.pdf)

- (arXiv 2022.04) TEMOS: **Generating** diverse human **motions** from textual descriptions, [[Paper]](https://arxiv.org/pdf/2204.14109.pdf), [[Project]](https://imagine.enpc.fr/~petrovim/temos)

- (arXiv 2022.04) Unsupervised Hierarchical **Semantic Segmentation** with Multiview Cosegmentation and Clustering Transformers, [[Paper]](https://arxiv.org/pdf/2204.11432.pdf)

- (arXiv 2022.04) SwinFuse: A Residual Swin Transformer Fusion Network for **Infrared and Visible Images**, [[Paper]](https://arxiv.org/pdf/2204.11436.pdf), [[Code]](https://github.com/Zhishe-Wang/SwinFuse)

- (arXiv 2022.04) OCFormer: One-Class Transformer Network for **Image Classification**, [[Paper]](https://arxiv.org/pdf/2204.11449.pdf)

- (arXiv 2022.04) DRT: A Lightweight Single Image **Deraining** Recursive Transformer, [[Paper]](https://arxiv.org/pdf/2204.11385.pdf), [[Code]](https://github.com/YC-Liang/DRT)

- (arXiv 2022.04) Hypergraph Transformer: Weakly-Supervised Multi-hop Reasoning for **Knowledge-based Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2204.10448.pdf), [[Code]](https://github.com/yujungheo/kbvqa-public)

- (arXiv 2022.04) ParkPredict+: **Multimodal Intent** and **Motion Prediction** for **Vehicles** in Parking Lots with CNN and Transformer, [[Paper]](https://arxiv.org/pdf/2204.10777.pdf)

- (arXiv 2022.04) iCAR: Bridging Image Classification and **Image-text** Alignment for Visual Recognition, [[Paper]](https://arxiv.org/pdf/2204.10760.pdf), [[Code]](https://github.com/weiyx16/iCAR)

- (arXiv 2022.04) DIVERSE INSTANCE DISCOVERY: VISION-TRANSFORMER FOR INSTANCE-AWARE **MULTI-LABEL IMAGE RECOGNITION**, [[Paper]](https://arxiv.org/pdf/2204.10731.pdf)

- (arXiv 2022.04) Spatiality-guided Transformer for 3D Dense **Captioning** on **Point Clouds**, [[Paper]](https://arxiv.org/pdf/2204.10688.pdf), [[Code]](https://spacap3d.github.io/)

- (arXiv 2022.04) DFAM-DETR: Deformable feature based attention mechanism DETR on slender object **detection**, [[Paper]](https://arxiv.org/pdf/2204.10667.pdf)

- (arXiv 2022.04) NFormer: Robust **Person Re-identification** with Neighbor Transformer, [[Paper]](https://arxiv.org/pdf/2204.09331.pdf), [[Code]](https://github.com/haochenheheda/NFormer)

- (arXiv 2022.04) **Video Moment Retrieval** from Text Queries via Single Frame Annotation, [[Paper]](https://arxiv.org/pdf/2204.09409.pdf)

- (arXiv 2022.04) GIMO: Gaze-Informed **Human Motion Prediction** in Context, [[Paper]](https://arxiv.org/pdf/2204.09443.pdf)

- (arXiv 2022.04) VQGAN-CLIP: Open Domain **Image Generation and Editing** with Natural Language Guidance, [[Paper]](https://arxiv.org/pdf/2204.08583.pdf)

- (arXiv 2022.04) Sim-2-Sim Transfer for **Vision-and-Language Navigation** in Continuous Environments, [[Paper]](https://arxiv.org/pdf/2204.09667.pdf)

- (arXiv 2022.04) Not All Tokens Are Equal: **Human-centric** Visual **Analysis** via Token Clustering Transformer, [[Paper]](https://arxiv.org/pdf/2204.08680.pdf), [[Code]](https://github.com/zengwang430521/TCFormer.git)

- (arXiv 2022.04) **Multimodal** Token Fusion for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2204.08721.pdf)

- (arXiv 2022.04) Self-Calibrated Efficient Transformer for Lightweight **Super-Resolution**, [[Paper]](https://arxiv.org/pdf/2204.08913.pdf), [[Code]](https://github.com/AlexZou14/SCET)

- (arXiv 2022.04) Searching Intrinsic **Dimensions** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2204.07722.pdf)

- (arXiv 2022.04) Towards **Lightweight** Transformer via Group-wise Transformation for **Vision-and-Language** Tasks, [[Paper]](https://arxiv.org/pdf/2204.07780.pdf)

- (arXiv 2022.04) Multimodal Few-Shot Object **Detection** with Meta-Learning Based Cross-Modal Prompting, [[Paper]](https://arxiv.org/pdf/2204.07841.pdf)

- (arXiv 2022.04) Multi-Frame Self-Supervised **Depth** with Transformers, [[Paper]](https://arxiv.org/pdf/2204.07616.pdf), [[Code]](https://sites.google.com/tri.global/depthformer)

- (arXiv 2022.04) MST++: Multi-stage Spectral-wise Transformer for Efficient **Spectral Reconstruction**, [[Paper]](https://arxiv.org/pdf/2204.07908.pdf), [[Code]](https://github.com/caiyuanhao1998/MST-plus-plus)

- (arXiv 2022.04) Vision-Language Pre-Training for Multimodal Aspect-Based **Sentiment Analysis**, [[Paper]](https://arxiv.org/pdf/2204.07955.pdf), [[Code]](https://github.com/NUSTM/VLP-MABSA)

- (arXiv 2022.04) An Extendable, Efficient and Effective Transformer-based Object **Detector**, [[Paper]](https://arxiv.org/pdf/2204.07962.pdf), [[Code]](https://github.com/naver-ai/vidt)

- (arXiv 2022.04) VDTR: **Video Deblurring** with Transformer, [[Paper]](https://arxiv.org/pdf/2204.08023.pdf), [[Code]](https://github.com/ljzycmd/VDTR)

- (arXiv 2022.04) BSRT: Improving Burst **Super-Resolution** with Swin Transformer and Flow-Guided Deformable Alignment, [[Paper]](https://arxiv.org/pdf/2204.08332.pdf), [[Code]](https://github.com/Algolzw/BSRT)

- (arXiv 2022.04) Temporally Efficient Vision Transformer for **Video Instance Segmentation**, [[Paper]](https://arxiv.org/pdf/2204.08412.pdf), [[Code]](https://github.com/hustvl/TeViT)

- (arXiv 2022.04) VSA: Learning Varied-Size Window **Attention** in Vision Transformers, [[Paper]](https://arxiv.org/pdf/2204.08446.pdf), [[Code]](https://github.com/ViTAE-Transformer/ViTAE-VSA)

- (arXiv 2022.04) XDBERT: Distilling Visual Information to **BERT** from Cross-Modal Systems to Improve Language Understanding, [[Paper]](https://arxiv.org/pdf/2204.07316.pdf)

- (arXiv 2022.04) IMPROVING CROSS-MODAL UNDERSTANDING IN **VISUAL DIALOG** VIA CONTRASTIVE LEARNING, [[Paper]](https://arxiv.org/pdf/2204.07302.pdf)

- (arXiv 2022.04) MVSTER: Epipolar Transformer for Efficient **Multi-View Stereo**, [[Paper]](https://arxiv.org/pdf/2204.07346.pdf), [[Code]](https://github.com/JeffWang987/MVSTER)

- (arXiv 2022.04) UNCONDITIONAL **IMAGE-TEXT PAIR GENERATION** WITH MULTIMODAL CROSS QUANTIZER, [[Paper]](https://arxiv.org/pdf/2204.07537.pdf)

- (arXiv 2022.04) Pushing the Limits of Simple Pipelines for **Few-Shot Learning**: External Data and Fine-Tuning Make a Difference, [[Paper]](https://arxiv.org/pdf/2204.07305.pdf)

- (arXiv 2022.04) COTS: Collaborative Two-Stream **Vision-Language** Pre-Training Model for Cross-Modal Retrieval, [[Paper]](https://arxiv.org/pdf/2204.07441.pdf)

- (arXiv 2022.04) Image **Captioning** In the Transformer Age, [[Paper]](https://arxiv.org/pdf/2204.07374.pdf), [[Code]](https://github.com/SjokerLily/awesome-image-captioning)

- (arXiv 2022.04) **ResT** V2: Simpler, Faster and Stronger, [[Paper]](https://arxiv.org/pdf/2204.07366.pdf), [[Code]](https://github.com/wofmanaf/ResT)

- (arXiv 2022.04) Lightweight Bimodal Network for Single-Image **Super-Resolution** via Symmetric CNN and Recursive Transformer, [[Paper]](https://arxiv.org/pdf/2204.13286.pdf), [[Code]](https://github.com/IVIPLab/LBNet)

- (arXiv 2022.04) Temporal Progressive Attention for **Early Action Prediction**, [[Paper]](https://arxiv.org/pdf/2204.13340.pdf), [[Code]](https://github.com/alexandrosstergiou/progressive-action-prediction)

- (arXiv 2022.04) Keep the Caption Information: Preventing Shortcut Learning in Contrastive **Image-Caption** Retrieval, [[Paper]](https://arxiv.org/pdf/2204.13382.pdf)

- (arXiv 2022.04) Flamingo: a **Visual Language** Model for **Few-Shot** Learning, [[Paper]](https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/tackling-multiple-tasks-with-a-single-visual-language-model/flamingo.pdf)

- (arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR **VISUAL RELATIONAL REASONING**, [[Paper]](https://arxiv.org/pdf/2204.11167.pdf)

- (arXiv 2022.04) **Unsupervised** Human **Action** Recognition with Skeletal Graph Laplacian and Self-Supervised Viewpoints Invariance, [[Paper]](https://arxiv.org/pdf/2204.10312.pdf), [[Code]](https://github.com/IIT-PAVIS/UHAR_Skeletal_Laplacian)

- (arXiv 2022.04) Learning **Future Object Prediction** with a Spatiotemporal Detection Transformer, [[Paper]](https://arxiv.org/pdf/2204.10321.pdf)

- (arXiv 2022.04) R^2-Trans: **Fine-Grained Visual Categorization** with Redundancy Reduction, [[Paper]](https://arxiv.org/pdf/2204.10095.pdf), [[Code]](https://anonymous.4open.science/r/R-2-Trans)

- (arXiv 2022.04) A New Dataset and Transformer for **Stereoscopic Video Super-Resolution**, [[Paper]](https://arxiv.org/pdf/2204.10039.pdf), [[Code]](https://github.com/H-deep/Trans-SVSR/)

- (arXiv 2022.04) Transformer-Guided Convolutional Neural Network for Cross-View **Geolocalization**, [[Paper]](https://arxiv.org/pdf/2204.09967.pdf)

- (arXiv 2022.04) Multi-Scale Features and Parallel Transformers Based **Image Quality Assessment**, [[Paper]](https://arxiv.org/pdf/2204.09779.pdf), [[Code]](https://github.com/KomalPal9610/IQA)

- (arXiv 2022.04) BTranspose: Bottleneck Transformers for **Human Pose Estimation** with Self-Supervised Pre-Training, [[Paper]](https://arxiv.org/pdf/2204.10209.pdf)

- (arXiv 2022.04) **Human-Object Interaction Detection** via Disentangled Transformer, [[Paper]](https://arxiv.org/pdf/2204.09290.pdf)

- (arXiv 2022.04) ELEVATER: A **Benchmark** and **Toolkit** for Evaluating **Language-Augmented Visual Models**, [[Paper]](https://arxiv.org/pdf/2204.08790.pdf)

- (arXiv 2022.04) Interactiveness Field in **Human-Object Interactions**, [[Paper]](https://arxiv.org/pdf/2204.07718.pdf), [[Code]](https://github.com/Foruck/Interactiveness-Field)

- (arXiv 2022.04) DeiT III: Revenge of the **ViT**, [[Paper]](https://arxiv.org/pdf/2204.07118.pdf)

- (arXiv 2022.04) Residual Swin Transformer Channel Attention Network for **Image Demosaicing**, [[Paper]](https://arxiv.org/pdf/2204.07098.pdf)

- (arXiv 2022.04) Neighborhood **Attention** Transformer, [[Paper]](https://arxiv.org/pdf/2204.07143.pdf), [[Code]](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer)

- (arXiv 2022.04) MiniViT: **Compressing** Vision Transformers with Weight Multiplexing, [[Paper]](https://arxiv.org/pdf/2204.07154.pdf), [[Code]](https://github.com/microsoft/Cream)

- (arXiv 2022.04) ViTOL: Vision Transformer for **Weakly Supervised Object Localization**, [[Paper]](https://arxiv.org/pdf/2204.06772.pdf), [[Code]](https://github.com/Saurav-31/ViTOL)

- (arXiv 2022.04) What Matters in Language Conditioned Robotic **Imitation Learning**, [[Paper]](https://arxiv.org/pdf/2204.06252.pdf), [[Code]](http://hulc.cs.uni-freiburg.de/)

- (arXiv 2022.04) Consistency driven Sequential Transformers Attention Model for **Partially Observable Scenes**, [[Paper]](https://arxiv.org/pdf/2204.00656)

- (arXiv 2022.04) ReCLIP: A Strong Zero-Shot Baseline for **Referring Expression Comprehension**, [[Paper]](https://arxiv.org/pdf/2204.05991.pdf)

- (arXiv 2022.04) Are **Multimodal** Transformers **Robust** to Missing Modality? [[Paper]](https://arxiv.org/pdf/2204.05454.pdf)

- (arXiv 2022.04) TopFormer: Token Pyramid Transformer for Mobile **Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2204.05525.pdf), [[Code]](https://github.com/hustvl/TopFormer)

- (arXiv 2022.04) X-DETR: A Versatile Architecture for Instance-wise **Vision-Language** Tasks, [[Paper]](https://arxiv.org/pdf/2204.05626.pdf)

- (arXiv 2022.04) **Event** Transformer, [[Paper]](https://arxiv.org/pdf/2204.05172.pdf)

- (arXiv 2022.04) Evaluating Vision Transformer Methods for **Deep Reinforcement Learning** from Pixels, [[Paper]](https://arxiv.org/pdf/2204.04905.pdf)

- (arXiv 2022.04) ManiTrans: Entity-Level **Text-Guided Image Manipulation** via Token-wise Semantic Alignment and Generation, [[Paper]](https://arxiv.org/pdf/2204.04428.pdf), [[Code]](https://jawang19.github.io/manitrans)

- (arXiv 2022.04) Multimodal Transformer for Nursing **Activity Recognition**, [[Paper]](https://arxiv.org/pdf/2204.04564.pdf), [[Code]](https://github.com/Momilijaz96/MMT_for_NCRC)

- (arXiv 2022.04) **Robust** Cross-Modal Representation Learning with Progressive Self-Distillation, [[Paper]](https://arxiv.org/pdf/2204.04588.pdf)

- (arXiv 2022.04) Stripformer: Strip Transformer for Fast Image **Deblurring**, [[Paper]](https://arxiv.org/pdf/2204.04627.pdf)

- (arXiv 2022.04) No Token Left Behind: **Explainability**-Aided Image Classification and Generation, [[Paper]](https://arxiv.org/pdf/2204.04908.pdf)

- (arXiv 2022.04) Fashionformer: A Simple, Effective and Unified Baseline for **Human Fashion Segmentation and Recognition**, [[Paper]](https://arxiv.org/pdf/2204.04654.pdf), [[Code]](https://github.com/xushilin1/FashionFormer)

- (arXiv 2022.04) Panoptic-PartFormer: Learning a Unified Model for **Panoptic Part Segmentation**, [[Paper]](https://arxiv.org/pdf/2204.04655.pdf), [[Code]](https://github.com/lxtGH/Panoptic-PartFormer)

- (arXiv 2022.04) DILEMMA: Self-Supervised **Shape and Texture** Learning with Transformers, [[Paper]](https://arxiv.org/pdf/2204.04788.pdf)

- (arXiv 2022.04) Learning Trajectory-Aware Transformer for **Video Super-Resolution**, [[Paper]](https://arxiv.org/pdf/2204.04216.pdf), [[Code]](https://github.com/researchmm/TTVSR.git)

- (arXiv 2022.04) Learning to Induce **Causal** Structure, [[Paper]](https://arxiv.org/pdf/2204.04875.pdf)

- (arXiv 2022.04) Consistency Learning via Decoding Path Augmentation for Transformers in **Human Object Interaction Detection**, [[Paper]](https://arxiv.org/pdf/2204.04836.pdf), [[Code]](https://github.com/mlvlab/CPChoi)

- (arXiv 2022.04) Category-Aware Transformer Network for Better **Human-Object Interaction Detection**, [[Paper]](https://arxiv.org/pdf/2204.04911.pdf)

- (arXiv 2022.04) Does **Robustness** on ImageNet Transfer to Downstream Tasks?, [[Paper]](https://arxiv.org/pdf/2204.03934.pdf)

- (arXiv 2022.04) POSTER: A Pyramid Cross-Fusion Transformer Network for **Facial Expression Recognition**, [[Paper]](https://arxiv.org/pdf/2204.04083.pdf), [[Code]](https://github.com/zczcwh/POSTER)

- (arXiv 2022.04) Vision Transformers for Single Image **Dehazing**, [[Paper]](https://arxiv.org/pdf/2204.03883.pdf), [[Code]](https://github.com/IDKiro/DehazeFormer)

- (arXiv 2022.04) Underwater **Image Enhancement** Using Pre-trained Transformer, [[Paper]](https://arxiv.org/pdf/2204.04199.pdf)

- (arXiv 2022.04) Event Transformer. A sparse-aware solution for efficient **event data processing**, [[Paper]](https://arxiv.org/pdf/2204.03355.pdf), [[Code]](https://github.com/AlbertoSabater/EventTransformer)

- (arXiv 2022.04) PSTR: End-to-End One-Step **Person Search** With Transformers, [[Paper]](https://arxiv.org/pdf/2204.03340.pdf), [[Code]](https://github.com/JialeCao001/PSTR)

- (arXiv 2022.04) Adapting **CLIP** For **Phrase Localization** Without Further Training, [[Paper]](https://arxiv.org/pdf/2204.03647.pdf), [[Code]](https://github.com/pals-ttic/adapting-CLIP)

- (arXiv 2022.04) FineDiving: A Fine-grained **Dataset** for Procedure-aware **Action Quality Assessment**, [[Paper]](https://arxiv.org/pdf/2204.03646.pdf), [[Project]](https://github.com/xujinglin/FineDiving)

- (arXiv 2022.04) DaViT: Dual **Attention** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2204.03645.pdf), [[Code]](https://github.com/dingmyu/davit)

- (arXiv 2022.04) Unsupervised Prompt Learning for **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2204.03649.pdf), [[Code]](https://github.com/tonyhuang2022/UPL)

- (arXiv 2022.04) **Long Video Generation** with Time-Agnostic VQGAN and Time-Sensitive Transformer, [[Paper]](https://arxiv.org/pdf/2204.03638.pdf), [[Project]](https://songweige.github.io/projects/tats/index.html)

- (arXiv 2022.04) Unified Contrastive Learning in **Image-Text-Label** Space, [[Paper]](https://arxiv.org/pdf/2204.03610.pdf), [[Code]](https://github.com/microsoft/UniCL)

- (arXiv 2022.04) HunYuan_tvr for **Text-Video** Retrivial, [[Paper]](https://arxiv.org/pdf/2204.03382.pdf)

- (arXiv 2022.04) LEARNING TO COMPOSE SOFT PROMPTS FOR **COMPOSITIONAL ZERO-SHOT LEARNING**, [[Paper]]()

- (arXiv 2022.04) End-to-End Zero-Shot **HOI** Detection via **Vision and Language** Knowledge Distillation, [[Paper]](https://arxiv.org/pdf/2204.03541.pdf), [[Code]](https://github.com/mrwu-mac/EoID)

- (arXiv 2022.04) **Temporal Alignment** Networks for Long-term Video, [[Paper]](https://arxiv.org/pdf/2204.02968.pdf), [[Code]](https://www.robots.ox.ac.uk/~vgg/research/tan/)

- (arXiv 2022.04) Unleashing Vanilla Vision Transformer with Masked Image Modeling for **Object Detection**, [[Paper]](https://arxiv.org/pdf/2204.02964.pdf), [[Code]](https://github.com/hustvl/MIMDet)

- (arXiv 2022.04) MixFormer: **Mixing Features** across Windows and Dimensions, [[Paper]](https://arxiv.org/pdf/2204.02557.pdf), [[Code]](https://github.com/PaddlePaddle/PaddleClas)

- (arXiv 2022.04) CM3: A **CAUSAL** MASKED **MULTIMODAL** MODEL OF THE INTERNET, [[Paper]](https://arxiv.org/pdf/2201.07520.pdf)

- (arXiv 2022.04) DO AS I CAN, NOT AS I SAY: GROUNDING LANGUAGE IN **ROBOTIC** AFFORDANCES, [[Paper]](https://arxiv.org/pdf/2204.01691.pdf), [[Project]](https://say-can.github.io/)

- (arXiv 2022.04) TransGeo: Transformer Is All You Need for **Cross-view Image Geo-localization**, [[Paper]](https://arxiv.org/pdf/2204.00097.pdf), [[Code]](https://github.com/Jeff-Zilence/TransGeo2022)

- (arXiv 2022.04) Socratic Models: Composing **Zero-Shot Multimodal Reasoning** with Language, [[Paper]](https://arxiv.org/pdf/2204.00598.pdf), [[Project]](https://socraticmodels.github.io/)

- (arXiv 2022.04) Vision Transformer with Cross-attention by Temporal Shift for Efficient **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2204.00452.pdf)

- (arXiv 2022.04) Learning **Audio-Video** Modalities from Image Captions, [[Paper]](https://arxiv.org/pdf/2204.00679.pdf)

- (arXiv 2022.04) Improving Vision Transformers by Revisiting **High-frequency Components**, [[Paper]]()

- (arXiv 2022.04) POS-BERT: **Point Cloud** One-Stage BERT Pre-Training, [[Paper]](https://arxiv.org/pdf/2204.00989.pdf), [[Code]](https://github.com/fukexue/POS-BERT)

- (arXiv 2022.04) BinsFormer: Revisiting Adaptive Bins for **Monocular Depth Estimation**, [[Paper]](https://arxiv.org/pdf/2204.00987.pdf), [[Code]](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox)

- (arXiv 2022.04) BatchFormerV2: Exploring Sample Relationships for **Dense Representation** Learning, [[Paper]](https://arxiv.org/pdf/2204.01254.pdf)

- (arXiv 2022.04) TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for **Repetitive Action Counting**, [[Paper]](https://arxiv.org/pdf/2204.01018.pdf)

- (arXiv 2022.04) **Long** Movie Clip Classification with State-Space **Video** Models, [[Paper]](https://arxiv.org/pdf/2204.01692.pdf), [[Code]](https://github.com/md-mohaiminul/ViS4mer)

- (arXiv 2022.04) TALLFormer: **Temporal Action Localization** with Long-memory Transformer, [[Paper]](https://arxiv.org/pdf/2204.01680.pdf), [[Code]](https://github.com/klauscc/TALLFormer)

- (arXiv 2022.04) Multi**MAE**: Multi-modal Multi-task Masked Autoencoders, [[Paper]](https://arxiv.org/pdf/2204.01678.pdf), [[Project]](https://multimae.epfl.ch/)

- (arXiv 2022.04) “This is my unicorn, Fluffy”: Personalizing frozen **vision-language** representations, [[Paper]](https://arxiv.org/pdf/2204.01694.pdf)

- (arXiv 2022.04) SE(3)-Equivariant Attention Networks for **Shape Reconstruction** in Function Space, [[Paper]](https://arxiv.org/pdf/2204.02394.pdf)

- (arXiv 2022.04) Multi-View Transformer for **3D Visual Grounding**, [[Paper]](https://arxiv.org/pdf/2204.02174.pdf), [[Code]](https://github.com/sega-hsj/MVT-3DVG)

- (arXiv 2022.04) VISION TRANSFORMER EQUIPPED WITH NEURAL RESIZER ON **FACIAL EXPRESSION RECOGNITION** TASK, [[Paper]](https://arxiv.org/pdf/2204.02181.pdf)

- (arXiv 2022.04) Dual-AI: Dual-path Actor Interaction Learning for **Group Activity Recognition**, [[Paper]](https://arxiv.org/pdf/2204.02148.pdf), [[Project]](https://mingfei.info/Dual-AI/)

- (arXiv 2022.04) Detector-Free Weakly Supervised **Group Activity Recognition**, [[Paper]](https://arxiv.org/pdf/2204.02139.pdf)

- (arXiv 2022.04) Joint **Hand Motion and Interaction Hotspots Prediction** from Egocentric Videos, [[Paper]](https://arxiv.org/pdf/2204.01696.pdf), [[Project]](https://stevenlsw.github.io/hoi-forecast)

- (arXiv 2022.04) What to look at and where: Semantic and Spatial Refined Transformer for detecting **human-object interactions**, [[Paper]](https://arxiv.org/pdf/2204.00746.pdf)

- (arXiv 2022.04) MaxViT: **Multi-Axis** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2204.01697.pdf)

### 2022.03

- (arXiv 2022.03) A **ConvNet** for the 2020s, [[Paper]](https://arxiv.org/pdf/2201.03545.pdf), [[Code]](https://github.com/facebookresearch/ConvNeXt)

- (arXiv 2022.03) DeepNet: Scaling Transformers to **1,000 Layers**, [[Paper]](https://arxiv.org/pdf/2203.00555.pdf)

- (arXiv 2022.03) Spatial-Temporal Parallel Transformer for **Arm-Hand Dynamic Estimation**, [[Paper]](https://arxiv.org/pdf/2203.16202.pdf)

- (arXiv 2022.03) ViSTA: **Vision** and **Scene Text** Aggregation for Cross-Modal **Retrieval**, [[Paper]](https://arxiv.org/pdf/2203.16778.pdf)

- (arXiv 2022.03) ReSTR: Convolution-free Referring Image Segmentation Using Transformers, [[Paper]](https://arxiv.org/pdf/2203.16768.pdf), [[Project]](http://cvlab.postech.ac.kr/research/restr/)

- (arXiv 2022.03) CREATE: A **Benchmark** for Chinese Short **Video Retrieval** and **Title Generation**, [[Paper]](https://arxiv.org/pdf/2203.16763.pdf)

- (arXiv 2022.03) **Deformable** **Video** Transformer, [[Paper]](https://arxiv.org/pdf/2203.16795.pdf)

- (arXiv 2022.03) End-to-End **Trajectory** Distribution Prediction Based on Occupancy Grid Maps, [[Paper]](https://arxiv.org/pdf/2203.16910.pdf)

- (arXiv 2022.03) CRAFT: Cross-Attentional Flow Transformer for Robust **Optical Flow**, [[Paper]](https://arxiv.org/pdf/2203.16896.pdf), [[Code]](https://github.com/askerlee/craft)

- (arXiv 2022.03) VL-InterpreT: An Interactive **Visualization** Tool for Interpreting **Vision-Language** Transformers, [[Paper]](https://arxiv.org/pdf/2203.17247.pdf), [[App]](http://vlinterpretenv4env-env.eba-vmhhefup.us-east-2.elasticbeanstalk.com/)

- (arXiv 2022.03) TransEditor: Transformer-Based Dual-Space **GAN** for Highly Controllable **Facial Editing**, [[Paper]](https://arxiv.org/pdf/2203.17266.pdf), [[Code]](https://github.com/BillyXYB/TransEditor)

- (arXiv 2022.03) BEVFormer: Learning **Bird’s-Eye-View** Representation from Multi-Camera Images via Spatiotemporal Transformers, [[Paper]](https://arxiv.org/pdf/2203.17270.pdf), [[Code]](https://github.com/zhiqi-li/BEVFormer)

- (arXiv 2022.03) **Visual Prompting**: Modifying Pixel Space to Adapt Pre-trained Models, [[Paper]](https://arxiv.org/pdf/2203.17274.pdf), [[Code]](https://hjbahng.github.io/visual_prompting/)

- (arXiv 2022.03) Bringing Old **Films** Back to Life, [[Paper]](https://arxiv.org/pdf/2203.17276.pdf), [[Code]](https://github.com/raywzy/Bringing-Old-Films-Back-to-Life)

- (arXiv 2022.03) Learning to Prompt for **Open-Vocabulary Object Detection** with Vision-Language Model, [[Paper]](https://arxiv.org/pdf/2203.14940.pdf), [[Code]](https://github.com/dyabel/detpro)

- (arXiv 2022.03) SeqTR: A Simple yet Universal Network for **Visual Grounding**, [[Paper]](https://arxiv.org/pdf/2203.16265.pdf), [[Code]](https://github.com/sean-zhuh/SeqTR)

- (arXiv 2022.03) InstaFormer: Instance-Aware **Image-to-Image Translation** with Transformer, [[Paper]](https://arxiv.org/pdf/2203.16248.pdf)

- (arXiv 2022.03) Omni-DETR: **Omni-Supervised** Object **Detection** with Transformers, [[Paper]](https://arxiv.org/pdf/2203.16089.pdf), [[Code]](https://github.com/amazon-research/omni-detr)

- (arXiv 2022.03) Learning **Program Representations** for Food Images and Cooking Recipes, [[Paper]](https://arxiv.org/pdf/2203.16071.pdf), [[Project]](http://cookingprograms.csail.mit.edu/)

- (arXiv 2022.03) ITTR: **Unpaired Image-to-Image Translation** with Transformers, [[Paper]](https://arxiv.org/pdf/2203.16015.pdf)

- (arXiv 2022.03) VPTR: **Efficient** Transformers for **Video Prediction**, [[Paper]](https://arxiv.org/pdf/2203.15836.pdf), [[Code]](https://github.com/XiYe20/VPTR)

- (arXiv 2022.03) Parameter-**efficient** Fine-tuning for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2203.16329.pdf)

- (arXiv 2022.03) TubeDETR: Spatio-Temporal Video **Grounding** with Transformers, [[Paper]](https://arxiv.org/pdf/2203.16434.pdf), [[Code]](https://antoyang.github.io/tubedetr.html)

- (arXiv 2022.03) Exploring Plain Vision Transformer Backbones for Object **Detection**, [[Paper]](https://arxiv.org/pdf/2203.16527.pdf)

- (arXiv 2022.03) PROMPTDET: EXPAND YOUR **DETECTOR** VOCABULARY WITH UNCURATED IMAGES, [[Paper]](https://arxiv.org/pdf/2203.16513.pdf), [[Code]](https://fcjian.github.io/promptdet)

- (arXiv 2022.03) **Few-Shot** Object **Detection** with Fully Cross-Transformer, [[Paper]](https://arxiv.org/pdf/2203.15021.pdf)

- (arXiv 2022.03) Unified Transformer Tracker for Object **Tracking**, [[Paper]](https://arxiv.org/pdf/2203.15175.pdf)

- (arXiv 2022.03) X-Pool: Cross-Modal **Language-Video** Attention for Text-Video Retrieval, [[Paper]](https://arxiv.org/pdf/2203.15086.pdf), [[Code]](https://layer6ai-labs.github.io/xpool/)

- (arXiv 2022.03) Fine-tuning Image Transformers using Learnable **Memory**, [[Paper]](https://arxiv.org/pdf/2203.15243.pdf)

- (arXiv 2022.03) MAT: Mask-Aware Transformer for Large Hole Image **Inpainting**, [[Paper]](https://arxiv.org/pdf/2203.15270.pdf), [[Code]](https://github.com/fenglinglwb/MAT)

- (arXiv 2022.03) mc-BEiT: Multi-choice Discretization for **Image BERT** Pre-training, [[Paper]](https://arxiv.org/pdf/2203.15371.pdf)

- (arXiv 2022.03) End-to-End Transformer Based Model for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2203.15350.pdf)

- (arXiv 2022.03) Hybrid Routing Transformer for **Zero-Shot Learning**, [[Paper]](https://arxiv.org/pdf/2203.15310.pdf)

- (arXiv 2022.03) TREATMENT LEARNING TRANSFORMER FOR **NOISY IMAGE CLASSIFICATION**, [[Paper]](https://arxiv.org/pdf/2203.15529.pdf)

- (arXiv 2022.03) Do **Vision-Language** Pretrained Models Learn **Primitive Concepts**?, [[Paper]](https://arxiv.org/pdf/2203.17271.pdf)

- (arXiv 2022.03) Transformer Inertial Poser: Attention-based Real-time **Human Motion Reconstruction** from Sparse **IMUs**, [[Paper]](https://arxiv.org/pdf/2203.15720.pdf)

- (arXiv 2022.03) SepViT: **Separable** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2203.15380.pdf)

- (arXiv 2022.03) MatteFormer: Transformer-Based **Image Matting** via Prior-Tokens, [[Paper]](https://arxiv.org/pdf/2203.15662.pdf), [[Code]](https://github.com/webtoon/matteformer)

- (arXiv 2022.03) Feature Selective Transformer for **Semantic Image Segmentation**, [[Paper]](https://arxiv.org/pdf/2203.14124.pdf)

- (arXiv 2022.03) Bridge-Prompt: Towards **Ordinal Action Understanding** in Instructional Videos, [[Paper]](https://arxiv.org/pdf/2203.14104.pdf), [[Code]](https://github.com/ttlmh/Bridge-Prompt)

- (arXiv 2022.03) RSTT: Real-time Spatial Temporal Transformer for Space-Time **Video Super-Resolution**, [[Paper]](https://arxiv.org/pdf/2203.14186.pdf), [[Code]](https://github.com/llmpass/RSTT)

- (arXiv 2022.03) Single-Stream Multi-Level Alignment for **Vision-Language** Pretraining, [[Paper]](https://arxiv.org/pdf/2203.14395.pdf)

- (arXiv 2022.03) Beyond Masking: Demystifying **Token-Based Pre-Training** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2203.14313.pdf), [[Code]](https://github.com/sunsmarterjie/beyond_masking)

- (arXiv 2022.03) Collaborative Transformers for **Grounded Situation Recognition**, [[Paper]](https://arxiv.org/pdf/2203.16518.pdf), [[Code]](https://github.com/jhcho99/CoFormer)

- (arXiv 2022.03) Object Memory Transformer for Object Goal **Navigation**, [[Paper]](https://arxiv.org/pdf/2203.14708.pdf)

- (arXiv 2022.03) Brain-inspired **Multilayer Perceptron** with **Spiking Neurons**, [[Paper]](https://arxiv.org/pdf/2203.14679.pdf), [[Code]](https://gitee.com/mindspore/models/tree/master/research/cv/snn_mlp)

- (arXiv 2022.03) HandOccNet: Occlusion-Robust **3D Hand Mesh Estimation** Network, [[Paper]](https://arxiv.org/pdf/2203.14564.pdf), [[Code]](https://github.com/namepllet/HandOccNet)

- (arXiv 2022.03) REGTR: End-to-end **Point Cloud Correspondences** with Transformers, [[Paper]](https://arxiv.org/pdf/2203.14517.pdf), [[Code]](https://github.com/yewzijian/RegTR)

- (arXiv 2022.03) Automated Progressive Learning for **Efficient Training** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2203.14509.pdf)

- (arXiv 2022.03) Stratified Transformer for 3D **Point Cloud Segmentation**, [[Paper]](https://arxiv.org/pdf/2203.14508.pdf), [[Code]](https://github.com/dvlab-research/Stratified-Transformer)

- (arXiv 2022.03) NOC-REK: Novel Object **Captioning** with Retrieved Vocabulary from External Knowledge, [[Paper]](https://arxiv.org/pdf/2203.14499.pdf)

- (arXiv 2022.03) **FACIAL EXPRESSION RECOGNITION** WITH SWIN TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2203.13472.pdf)

- (arXiv 2022.03) Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch **Robustness**, [[Paper]](https://arxiv.org/pdf/2203.13639.pdf)

- (arXiv 2022.03) Efficient Visual **Tracking** via Hierarchical Cross-Attention Transformer, [[Paper]](https://arxiv.org/pdf/2203.13537.pdf), [[Code]](https://github.com/chenxin-dlut/HCAT)

- (arXiv 2022.03) High-Performance Transformer **Tracking**, [[Paper]](https://arxiv.org/pdf/2203.13533.pdf), [[Code]](https://github.com/chenxin-dlut/TransT-M)

- (arXiv 2022.03) RayTran: **3D pose estimation** and **shape reconstruction** of multiple objects from videos with ray-traced transformers, [[Paper]](https://arxiv.org/pdf/2203.13296.pdf)

- (arXiv 2022.03) Multi-modal Multi-label **Facial Action Unit Detection** with Transformer, [[Paper]](https://arxiv.org/pdf/2203.13301.pdf)

- (arXiv 2022.03) MonoDETR: Depth-aware Transformer for Monocular **3D** Object **Detection**, [[Paper]](https://arxiv.org/pdf/2203.13310.pdf), [[Code]](https://github.com/ZrrSkywalker/MonoDETR.git)

- (arXiv 2022.03) **Text to Mesh** Without 3D Supervision Using Limit Subdivision, [[Paper]](https://arxiv.org/pdf/2203.13333.pdf), [[Project]](https://www.nasir.lol/clipmesh)

- (arXiv 2022.03) GEN-VLKT: Simplify Association and Enhance Interaction Understanding for **HOI Detection**, [[Paper]](https://arxiv.org/pdf/2203.13954.pdf), [[Code]](https://github.com/YueLiao/gen-vlkt)

- (arXiv 2022.03) CrossFormer: Cross Spatio-Temporal Transformer for **3D Human Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2203.13387.pdf)

- (arXiv 2022.03) FitCLIP: Refining Large-Scale Pretrained **Image-Text** Models for Zero-Shot Video Understanding Tasks, [[Paper]](https://arxiv.org/pdf/2203.13371.pdf), [[Code]](https://github.com/bryant1410/)

- (arXiv 2022.03) Vision Transformer **Compression** with Structured Pruning and Low Rank Approximation, [[Paper]](https://arxiv.org/pdf/2203.13444.pdf)

- (arXiv 2022.03) Multi-Modal Learning for **AU Detection** Based on Multi-Head Fused Transformers, [[Paper]](https://arxiv.org/pdf/2203.11441.pdf)

- (arXiv 2022.03) MSTR: Multi-Scale Transformer for End-to-End **Human-Object Interaction Detection**, [[Paper]](https://arxiv.org/pdf/2203.14709.pdf)

- (arXiv 2022.03) Learning Patch-to-Cluster **Attention** in Vision Transformer, [[Paper]](https://arxiv.org/pdf/2203.11987.pdf)

- (arXiv 2022.03) Visual **Prompt Tuning**, [[Paper]](https://arxiv.org/pdf/2203.12119.pdf)

- (arXiv 2022.03) Training-free Transformer **Architecture Search**, [[Paper]](https://arxiv.org/pdf/2203.12217.pdf)

- (arXiv 2022.03) VideoMAE: Masked Autoencoders are Data-Efficient Learners for **Self-Supervised Video Pre-Training**, [[Paper]](https://arxiv.org/pdf/2203.12602.pdf), [[Code]](https://github.com/MCG-NJU/VideoMAE)

- (arXiv 2022.03) METAMORPH: LEARNING **UNIVERSAL CONTROLLERS** WITH TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2203.11931.pdf), [[Project]](https://metamorph-iclr.github.io/site/)

- (arXiv 2022.03) A Prompt Array Keeps the Bias Away: **Debiasing** **Vision-Language** Models with Adversarial Learning, [[Paper]](https://arxiv.org/pdf/2203.11933.pdf)

- (arXiv 2022.03) Reshaping **Robot Trajectories** Using Natural **Language** Commands: A Study of Multi-Modal Data Alignment Using Transformers, [[Paper]](https://arxiv.org/pdf/2203.13411.pdf), [[Project]](https://arthurfenderbucker.github.io/NL_trajectory_reshaper/)

- (arXiv 2022.03) Associating Objects with Scalable Transformers for **Video Object Segmentation**, [[Paper]](https://arxiv.org/pdf/2203.11442.pdf), [[Project]](https://github.com/z-x-yang/AOT0

- (arXiv 2022.03) HOP: History-and-Order Aware Pre-training for **Vision-and-Language Navigation**, [[Paper]](https://arxiv.org/pdf/2203.11591.pdf), [[Code]](https://github.com/YanyuanQiao/HOP-VLN)

- (arXiv 2022.03) Learning to **generate line drawings** that convey geometry and semantics, [[Paper]](https://arxiv.org/pdf/2203.12691.pdf), [[Project]](https://carolineec.github.io/informative_drawings/)

- (arXiv 2022.03) UMT: Unified Multi-modal Transformers for Joint **Video Moment Retrieval** and **Highlight Detection**, [[Paper]](https://arxiv.org/pdf/2203.12745.pdf), [[Code]](https://github.com/TencentARC/UMT)

- (arXiv 2022.03) AIMusicGuru: Music Assisted **Human Pose Correction**, [[Paper]](https://arxiv.org/pdf/2203.12829.pdf)

- (arXiv 2022.03) What to Hide from Your Students: Attention-Guided **Masked Image Modeling**, [[Paper]](https://arxiv.org/pdf/2203.12719.pdf)

- (arXiv 2022.03) Towards Efficient and Elastic **Visual Question Answering** with Doubly Slimmable Transformer, [[Paper]](https://arxiv.org/pdf/2203.12814.pdf)

- (arXiv 2022.03) ViT-FOD: A Vision Transformer based **Fine-grained Object Discriminator**, [[Paper]](https://arxiv.org/pdf/2203.12816.pdf)

- (arXiv 2022.03) **Keypoints Tracking** via Transformer Networks, [[Paper]](https://arxiv.org/pdf/2203.12848.pdf), [[Code]](https://github.com/LexaNagiBator228/Keypoints-Tracking-via-Transformer-Networks/)

- (arXiv 2022.03) Beyond Fixation: **Dynamic Window** Visual Transformer, [[Paper]](https://arxiv.org/pdf/2203.12856.pdf), [[Code]](https://github.com/pzhren/DW-ViT)

- (arXiv 2022.03) Make-A-Scene: Scene-Based **Text-to-Image** Generation with Human Priors, [[Paper]](https://arxiv.org/pdf/2203.13131.pdf)

- (arXiv 2022.03) Self-supervised Video-centralised Transformer for **Video Face Clustering**, [[Paper]](https://arxiv.org/pdf/2203.13166.pdf)

- (arXiv 2022.03) Towards Exemplar-Free **Continual Learning** in Vision Transformers: an Account of Attention, Functional and Weight Regularization, [[Paper]](https://arxiv.org/pdf/2203.13167.pdf)

- (arXiv 2022.03) Global **Tracking** Transformers, [[Paper]](https://arxiv.org/pdf/2203.13250.pdf), [[Code]](https://github.com/xingyizhou/GTR)

- (arXiv 2022.03) **Video Instance Segmentation** via Multi-scale Spatio-temporal Split Attention Transformer, [[Paper]](https://arxiv.org/pdf/2203.13253.pdf), [[Code]](https://github.com/OmkarThawakar/MSSTS-VIS)

- (arXiv 2022.03) QS-Craft: Learning to Quantize, Scrabble and Craft for **Conditional Human Motion Animation**, [[Paper]](https://arxiv.org/pdf/2203.11632.pdf)

- (arXiv 2022.03) Look for the Change: Learning **Object States** and **State-Modifying Actions** from Untrimmed Web Videos, [[Paper]](https://arxiv.org/pdf/2203.11637.pdf), [[Project]](https://data.ciirc.cvut.cz/public/projects/2022LookForTheChange/)

- (arXiv 2022.03) GradViT: **Gradient Inversion** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2203.11894.pdf), [[Code]](https://gradvit.github.io/)

- (arXiv 2022.03) **Mask Usage Recognition** using Vision Transformer with Transfer Learning and Data Augmentation, [[Paper]](https://arxiv.org/pdf/2203.11542.pdf)

- (arXiv 2022.03) Under the Hood of Transformer Networks for **Trajectory Forecasting**, [[Paper]](https://arxiv.org/pdf/2203.11878.pdf)

- (arXiv 2022.03) **Open-Vocabulary DETR** with Conditional Matching, [[Paper]](https://arxiv.org/pdf/2203.11876.pdf)

- (arXiv 2022.03) Meta-attention for ViT-backed **Continual Learning**, [[Paper]](https://arxiv.org/pdf/2203.11684.pdf), [[Code]](https://github.com/zju-vipa/MEAT-TIL)

- (arXiv 2022.03) CNNs and Transformers Perceive **Hybrid Images** Similar to Humans, [[Paper]](https://arxiv.org/pdf/2203.11678.pdf), [[Code]](https://github.com/aliborji/hybrid_images.git)

- (arXiv 2022.03) Bailando: **3D Dance Generation** by Actor-Critic GPT with Choreographic Memory, [[Paper]](https://arxiv.org/pdf/2203.13055.pdf), [[Code]](https://github.com/lisiyao21/Bailando/)

- (arXiv 2022.03) Affective Feedback Synthesis Towards Multimodal **Text and Image** Data, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2203/2203.12692.pdf)

- (arXiv 2022.03) ViewFormer: **NeRF-free Neural Rendering** from Few Images Using Transformers, [[Paper]](https://arxiv.org/pdf/2203.10157.pdf)

- (arXiv 2022.03) **CLIP** on Wheels: Zero-Shot Object **Navigation** as Object Localization and Exploration, [[Paper]](https://arxiv.org/pdf/2203.10421.pdf)

- (arXiv 2022.03) Voxel Set Transformer: A Set-to-Set Approach to **3D Object Detection** from Point Clouds, [[Paper]](https://arxiv.org/pdf/2203.10314.pdf), [[Code]](https://github.com/skyhehe123/VoxSeT)

- (arXiv 2022.03) HIPA: Hierarchical Patch Transformer for Single Image **Super Resolution**, [[Paper]](https://arxiv.org/pdf/2203.10247.pdf)

- (arXiv 2022.03) DirecFormer: A Directed Attention in Transformer Approach to **Robust Action Recognition**, [[Paper]](https://arxiv.org/pdf/2203.10233.pdf), [[Code]](https://github.com/uark-cviu/DirecFormer)

- (arXiv 2022.03) MixFormer: End-to-End **Tracking** with Iterative Mixed Attention, [[Paper]](https://arxiv.org/pdf/2203.11082.pdf), [[Code]](https://github.com/MCG-NJU/MixFormer)

- (arXiv 2022.03) PersFormer: **3D Lane Detection** via Perspective Transformer and the OpenLane Benchmark, [[Paper]](https://arxiv.org/pdf/2203.11089.pdf), [[Code]](https://github.com/OpenPerceptionX/OpenLane)

- (arXiv 2022.03) Relationformer: A Unified Framework for **Image-to-Graph** Generation, [[Paper]](https://arxiv.org/pdf/2203.10202.pdf), [[Code]](https://github.com/suprosanna/relationformer)

- (arXiv 2022.03) **CLIP** meets GamePhysics: Towards **bug identification** in gameplay videos using zero-shot transfer learning, [[Paper]](https://arxiv.org/pdf/2203.11096.pdf), [[Code]](https://asgaardlab.github.io/CLIPxGamePhysics/)

- (arXiv 2022.03) **Hyperbolic** Vision Transformers: Combining Improvements in Metric Learning, [[Paper]](https://arxiv.org/pdf/2203.10833.pdf), [[Code]](https://github.com/htdt/hyp_metric)

- (arXiv 2022.03) MonoDTR: Monocular **3D Object Detection** with Depth-Aware Transformer, [[Paper]](https://arxiv.org/pdf/2203.10981.pdf), [[Code]](https://github.com/kuanchihhuang/MonoDTR)

- (arXiv 2022.03) Transformer-based **HTR** for Historical Documents, [[Paper]](https://arxiv.org/pdf/2203.11008.pdf)

- (arXiv 2022.03) simCrossTrans: A Simple **Cross-Modality** Transfer Learning for Object **Detection** with ConvNets or Vision Transformers, [[Paper]](https://arxiv.org/pdf/2203.10456.pdf), [[Code]](https://github.com/liketheflower/simCrossTrans)

- (arXiv 2022.03) End-to-End **Human-Gaze-Target Detection** with Transformers, [[Paper]](https://arxiv.org/pdf/2203.10433.pdf)

- (arXiv 2022.03) End-to-End **Video Text Spotting** with Transformer, [[Paper]](https://arxiv.org/pdf/2203.10539.pdf), [[Code]](https://github.com/weijiawu/TransDETR)

- (arXiv 2022.03) Open-Vocabulary One-Stage **Detection** with Hierarchical **Visual-Language** Knowledge Distillation, [[Paper]](https://arxiv.org/pdf/2203.10593.pdf), [[Code]](https://github.com/mengqiDyangge/HierKD)

- (arXiv 2022.03) V2X-ViT: **Vehicle**-to-Everything Cooperative Perception with Vision Transformer, [[Paper]](https://arxiv.org/pdf/2203.10638.pdf)

- (arXiv 2022.03) LocATe: End-to-end **Localization of Actions** in 3D with Transformers, [[Paper]](https://arxiv.org/pdf/2203.10719.pdf)

- (arXiv 2022.03) AnoViT: **Unsupervised Anomaly Detection and Localization** with Vision Transformer-based Encoder-Decoder, [[Paper]](https://arxiv.org/pdf/2203.10808.pdf)

- (arXiv 2022.03) ViM: **Out-Of-Distribution** with Virtual-logit Matching, [[Paper]](https://arxiv.org/pdf/2203.10807.pdf), [[Code]](https://github.com/haoqiwang/vim)

- (arXiv 2022.03) ScalableViT: Rethinking the Context-oriented **Generalization** of Vision Transformer, [[Paper]](https://arxiv.org/pdf/2203.10790.pdf)

- (arXiv 2022.03) Iwin: **Human-Object Interaction Detection** via Transformer with Irregular Windows, [[Paper]](https://arxiv.org/pdf/2203.10537.pdf)

- (arXiv 2022.03) Vision Transformer with Convolutions **Architecture Search**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2203/2203.10435.pdf)

- (arXiv 2022.03) Cascade Transformers for End-to-End **Person Search**, [[Paper]](https://arxiv.org/pdf/2203.09642.pdf), [[Code]](https://github.com/Kitware/COAT)

- (arXiv 2022.03) CodedVTR: Codebook-based Sparse **Voxel** Transformer with Geometric Guidance, [[Paper]](https://arxiv.org/pdf/2203.09887.pdf)

- (arXiv 2022.03) MatchFormer: Interleaving Attention in Transformers for **Feature Matching**, [[Paper]](https://arxiv.org/pdf/2203.09645.pdf), [[Code]](https://github.com/jamycheung/MatchFormer)

- (arXiv 2022.03) Local-Global Context Aware Transformer for **Language-Guided Video Segmentation**, [[Paper]](https://arxiv.org/pdf/2203.09773.pdf), [[Code]](https://github.com/leonnnop/Locater)

- (arXiv 2022.03) **Three things** everyone should know about Vision Transformers, [[Paper]](https://arxiv.org/pdf/2203.09795.pdf)

- (arXiv 2022.03) Are Vision Transformers **Robust** to Spurious Correlations? [[Paper]](https://arxiv.org/pdf/2203.09125.pdf), [[Code]](https://github.com/deeplearning-wisc/vit-spurious-robustness)

- (arXiv 2022.03) MUTUAL GENERATIVE TRANSFORMER LEARNING FOR **CROSS-VIEW GEO-LOCALIZATION**, [[Paper]](https://arxiv.org/pdf/2203.09135.pdf)

- (arXiv 2022.03) DU-VLG: Unifying **Vision-and-Language** Generation via Dual Sequence-to-Sequence Pre-training, [[Paper]](https://arxiv.org/pdf/2203.09052.pdf)

- (arXiv 2022.03) Semantic-aligned Fusion Transformer for **One-shot** Object **Detection**, [[Paper]](https://arxiv.org/pdf/2203.09093.pdf)

- (arXiv 2022.03) UNIMO-2: End-to-End Unified **Vision-Language** Grounded Learning, [[Paper]](https://arxiv.org/pdf/2203.09067.pdf), [[Code]](https://unimo-ptm.github.io/)

- (arXiv 2022.03) Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for **Few-shot Learning**, [[Paper]](https://arxiv.org/pdf/2203.09064.pdf), [[Code]](https://github.com/StomachCold/HCTransformers)

- (arXiv 2022.03) One-Shot Adaptation of **GAN** in Just One **CLIP**, [[Paper]](https://arxiv.org/pdf/2203.09301.pdf)

- (arXiv 2022.03) PanoFormer: Panorama Transformer for Indoor 360° **Depth Estimation**, [[Paper]](https://arxiv.org/pdf/2203.09283.pdf)

- (arXiv 2022.03) PreTR: Spatio-Temporal Non-Autoregressive **Trajectory Prediction** Transformer, [[Paper]](https://arxiv.org/pdf/2203.09293.pdf)

- (arXiv 2022.03) Look Outside the Room: **Synthesizing** A Consistent Long-Term **3D Scene Video** from A Single Image, [[Paper]](https://arxiv.org/pdf/2203.09457.pdf), [[Code]](https://xrenaa.github.io/look-outside-room/)

- (arXiv 2022.03) Transframer: Arbitrary **Frame Prediction** with Generative Models, [[Paper]](https://arxiv.org/pdf/2203.09494.pdf)

- (arXiv 2022.03) Towards Data-**Efficient** Detection Transformers, [[Paper]](https://arxiv.org/pdf/2203.09507.pdf), [[Code]](https://github.com/encounter1997/DE-DETRs)

- (arXiv 2022.03) Bi-directional Object-Context Prioritization Learning for **Saliency** Ranking, [[Paper]](https://arxiv.org/pdf/2203.09416.pdf), [[Code]](https://github.com/GrassBro/OCOR)

- (arXiv 2022.03) PATCH-FOOL: ARE VISION TRANSFORMERS ALWAYS ROBUST AGAINST **ADVERSARIAL** PERTURBATIONS? [[Paper]](https://arxiv.org/pdf/2203.08392.pdf), [[Code]](https://github.com/RICE-EIC/Patch-Fool)

- (arXiv 2022.03) WegFormer: Transformers for Weakly Supervised **Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2203.08421.pdf)

- (arXiv 2022.03) **Open Set Recognition** using Vision Transformer with an Additional Detection Head, [[Paper]](https://arxiv.org/pdf/2203.08441.pdf), [[Code]](https://github.com/feiyang-cai/osr_vit.git)

- (arXiv 2022.03) UNIFIED VISUAL TRANSFORMER **COMPRESSION**, [[Paper]](https://arxiv.org/pdf/2203.08243.pdf), [[Code]](https://github.com/VITA-Group/UVC)

- (arXiv 2022.03) Towards Practical **Certifiable Patch Defense** with Vision Transformer, [[Paper]](https://arxiv.org/pdf/2203.08519.pdf)

- (arXiv 2022.03) EDTER: **Edge Detection** with Transformer, [[Paper]](https://arxiv.org/pdf/2203.08566.pdf), [[Code]](https://github.com/MengyangPu/EDTER)

- (arXiv 2022.03) ActFormer: A GAN Transformer Framework towards General Action-Conditioned **3D Human Motion Generation**, [[Paper]](https://arxiv.org/pdf/2203.07706.pdf)

- (arXiv 2022.03) Rich CNN-Transformer Feature Aggregation Networks for **Super-Resolution**, [[Paper]](https://arxiv.org/pdf/2203.07682.pdf)

- (arXiv 2022.03) Revitalize Region Feature for Democratizing **Video-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2203.07720.pdf), [[Code]](https://github.com/CuthbertCai/DemoVLP)

- (arXiv 2022.03) Inverted Pyramid Multi-task Transformer for **Dense Scene Understanding**, [[Paper]](https://arxiv.org/pdf/2203.07997.pdf)

- (arXiv 2022.03) Smoothing Matters: Momentum Transformer for Domain Adaptive **Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2203.07988.pdf), [[Code]](https://github.com/alpc91/TransDA)

- (arXiv 2022.03) Style Transformer for **Image Inversion** and **Editing**, [[Paper]](https://arxiv.org/pdf/2203.07932.pdf), [[Code]](https://github.com/sapphire497/style-transformer)

- (arXiv 2022.03) MotionCLIP: Exposing Human **Motion Generation** to **CLIP** Space, [[Paper]](https://arxiv.org/pdf/2203.08063.pdf), [[Project]](https://guytevet.github.io/motionclip-page/)

- (arXiv 2022.03) The Principle of **Diversity**: Training Stronger Vision Transformers Calls for Reducing All Levels of **Redundancy**, [[Paper]](https://arxiv.org/pdf/2203.06345.pdf), [[Code]](https://github.com/VITA-Group/Diverse-ViT)

- (arXiv 2022.03) Enabling **Multimodal Generation** on CLIP via Vision-Language Knowledge Distillation, [[Paper]](https://arxiv.org/pdf/2203.06386.pdf)

- (arXiv 2022.03) Sparse Local Patch Transformer for Robust **Face Alignment** and **Landmarks Inherent Relation** Learning, [[Paper]](https://arxiv.org/pdf/2203.06541.pdf), [[Code]](https://github.com/Jiahao-UTS/SLPT-master)

- (arXiv 2022.03) Joint CNN and Transformer Network via weakly supervised Learning for efficient **crowd counting**, [[Paper]](https://arxiv.org/pdf/2203.06388.pdf)

- (arXiv 2022.03) DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for **Salient Object Detection**, [[Paper]](https://arxiv.org/pdf/2203.06429.pdf)

- (arXiv 2022.03) DATR: Domain-adaptive transformer for **multi-domain landmark detection**, [[Paper]](https://arxiv.org/pdf/2203.06433.pdf)

- (arXiv 2022.03) EventFormer: AU Event Transformer for **Facial Action** Unit Event Detection, [[Paper]](https://arxiv.org/pdf/2203.06355.pdf)

- (arXiv 2022.03) Accelerating **DETR** **Convergence** via Semantic-Aligned Matching, [[Paper]](https://arxiv.org/pdf/2203.06883.pdf), [[Code]](https://github.com/ZhangGongjie/SAM-DETR)

- (arXiv 2022.03) All in One: Exploring Unified **Video-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2203.07303.pdf), [[Code]](https://github.com/showlab/all-in-one)

- (arXiv 2022.03) CLIP Models are **Few-shot** Learners: Empirical Studies on VQA and Visual Entailment, [[Paper]](https://arxiv.org/pdf/2203.07190.pdf)

- (arXiv 2022.03) EIT: **Efficiently** Lead Inductive Biases to ViT, [[Paper]](https://arxiv.org/pdf/2203.07116.pdf), [[Code]](https://github.com/MrHaiPi/EIT)

- (arXiv 2022.03) Self-Promoted Supervision for **Few-Shot** Transformer, [[Paper]](https://arxiv.org/pdf/2203.07057.pdf), [[Code]](https://github.com/DongSky/few-shot-vit)

- (arXiv 2022.03) MDMMT-2: Multidomain Multimodal Transformer for **Video Retrieval**, One More Step Towards Generalization, [[Paper]](https://arxiv.org/pdf/2203.07086.pdf)

- (arXiv 2022.03) Disentangled Representation Learning for **Text-Video** Retrieval, [[Paper]](https://arxiv.org/pdf/2203.07111.pdf)

- (arXiv 2022.03) TransCAM: Transformer Attention-based CAM Refinement for **Weakly Supervised Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2203.07239.pdf), [[Code]](https://github.com/liruiwen/TransCAM)

- (arXiv 2022.03) Synopses of Movie Narratives: a **Video-Language Dataset** for Story Understanding, [[Paper]](https://arxiv.org/pdf/2203.05711.pdf), [[Dataset]](https://github.com/insundaycathy/SYMON)

- (arXiv 2022.03) Visualizing and Understanding **Patch Interactions** in Vision Transformer, [[Paper]](https://arxiv.org/pdf/2203.05922.pdf)

- (arXiv 2022.03) **ANTI-OVERSMOOTHING** IN DEEP VISION TRANSFORMERS VIA THE FOURIER DOMAIN ANALYSIS: FROM THEORY TO PRACTICE, [[Paper]](https://arxiv.org/pdf/2203.05962.pdf), [[Code]](https://github.com/VITA-Group/ViT-Anti-Oversmoothing)

- (arXiv 2022.03) Democratizing Contrastive **Language-Image** Pre-training: A CLIP **Benchmark** of Data, Model, and Supervision, [[Paper]](https://arxiv.org/pdf/2203.05796.pdf), [[Code]](https://github.com/Sense-GVT/DeCLIP)

- (arXiv 2022.03) ActiveMLP: An **MLP**-like Architecture with Active Token Mixer, [[Paper]](https://arxiv.org/pdf/2203.06108.pdf), [[Code]](https://github.com/microsoft/ActiveMLP)

- (arXiv 2022.03) **Zero-Shot Action Recognition** with Transformer-based Video Semantic Embedding, [[Paper]](https://arxiv.org/pdf/2203.05156.pdf)

- (arXiv 2022.03) TrueType Transformer: **Character and Font Style Recognition** in Outline Format, [[Paper]](https://arxiv.org/pdf/2203.05338.pdf)

- (arXiv 2022.03) LOOPITR: Combining Dual and Cross Encoder Architectures for **Image-Text** Retrieval, [[Paper]](https://arxiv.org/pdf/2203.05465.pdf)

- (arXiv 2022.03) MVP: **Multimodality**-guided Visual Pre-training, [[Paper]](https://arxiv.org/pdf/2203.05175.pdf)

- (arXiv 2022.03) DEER: Detection-agnostic End-to-End Recognizer for **Scene Text Spotting**, [[Paper]](https://arxiv.org/pdf/2203.05122.pdf)

- (arXiv 2022.03) **Multi-Modal** Mixup for **Robust** Fine-tuning, [[Paper]](https://arxiv.org/pdf/2203.03897.pdf)

- (arXiv 2022.03) AssistQ: Affordance-centric Question-driven Task Completion for **Egocentric Assistant**, [[Paper]](https://arxiv.org/pdf/2203.04203.pdf), [[Project]](https://showlab.github.io/assistq/)

- (arXiv 2022.03) **Coarse-to-Fine** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2203.03821.pdf), [[Code]](https://github.com/ChenMnZ/CF-ViT)

- (arXiv 2022.03) Monocular Robot **Navigation** with Self-Supervised Pretrained Vision Transformers, [[Paper]](https://arxiv.org/pdf/2203.03682.pdf)

- (arXiv 2022.03) WAVEMIX: RESOURCE-**EFFICIENT** TOKEN MIXING FOR IMAGES, [[Paper]](https://arxiv.org/pdf/2203.03689.pdf)

- (arXiv 2022.03) VOVIT: LOW LATENCY GRAPH-BASED **AUDIO-VISUAL** VOICE SEPARATION TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2203.04099.pdf), [[Code]](https://ipcv.github.io/VoViT/)

- (arXiv 2022.03) Graph Attention Transformer Network for **Multi-Label** Image **Classification**, [[Paper]](https://arxiv.org/pdf/2203.04049.pdf)

- (arXiv 2022.03) EDGEFORMER: IMPROVING **LIGHT-WEIGHT CONVNETS** BY LEARNING FROM VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2203.03952.pdf), [[Code]](https://github.com/hkzhang91/EdgeFormer)

- (arXiv 2022.03) Skating-Mixer: Multimodal **MLP** for **Scoring Figure Skating**, [[Paper]](https://arxiv.org/pdf/2203.03990.pdf)

- (arXiv 2022.03) Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group **Attention**, [[Paper]](https://arxiv.org/pdf/2203.03937.pdf)

- (arXiv 2022.03) CP-ViT: Cascade Vision Transformer **Pruning** via Progressive Sparsity Prediction, [[Paper]](https://arxiv.org/pdf/2203.04570.pdf)

- (arXiv 2022.03) Model-Agnostic Multitask Fine-tuning for Few-shot **Vision-Language** **Transfer Learning**, [[Paper]](https://arxiv.org/pdf/2203.04904.pdf)

- (arXiv 2022.03) ChiTransformer: Towards Reliable **Stereo** from Cues, [[Paper]](https://arxiv.org/pdf/2203.04554.pdf)

- (arXiv 2022.03) A Unified Transformer Framework for Group-based Segmentation: **Co-Segmentation**,** Co-Saliency Detection** and **Video Salient Object Detection**, [[Paper]](https://arxiv.org/pdf/2203.04708.pdf), [[Code]](https://github.com/suyukun666/UFO)

- (arXiv 2022.03) Coarse-to-Fine Sparse Transformer for **Hyperspectral Image Reconstruction**, [[Paper]](https://arxiv.org/pdf/2203.04845.pdf)

- (arXiv 2022.03) CMX: Cross-Modal Fusion for **RGB-X Semantic Segmentation** with Transformers, [[Paper]](https://arxiv.org/pdf/2203.04838.pdf), [[Code]](https://github.com/huaaaliu/RGBX_Semantic_Segmentation)

- (arXiv 2022.03) Multiscale Transformer for **Hyperspectral Image Classification**, [[Paper]](https://arxiv.org/pdf/2203.04771.pdf)

- (arXiv 2022.03) Mind the Gap: Understanding the Modality Gap in **Multi-modal Contrastive Representation** Learning, [[Paper]](https://arxiv.org/pdf/2203.02053.pdf), [[Code]](https://modalitygap.readthedocs.io/)

- (arXiv 2022.03) Autoregressive **Image Generation** using Residual Quantization, [[Paper]](https://arxiv.org/pdf/2203.01941.pdf)

- (arXiv 2022.03) CONTEXTFORMER: A TRANSFORMER WITH SPATIO-CHANNEL ATTENTION FOR CONTEXT MODELING IN LEARNED **IMAGE COMPRESSION**, [[Paper]](https://arxiv.org/pdf/2203.02452.pdf)

- (arXiv 2022.03) Patch Similarity Aware Data-Free **Quantization** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2203.02250.pdf)

- (arXiv 2022.03) ViT-P: Rethinking Data-**efficient** Vision Transformers from Locality, [[Paper]](https://arxiv.org/pdf/2203.02358.pdf)

- (arXiv 2022.03) DIT: SELF-SUPERVISED PRE-TRAINING FOR **DOCUMENT IMAGE** TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2203.02378.pdf)

- (arXiv 2022.03) Towards **Efficient** and **Scalable** Sharpness-Aware Minimization, [[Paper]](https://arxiv.org/pdf/2203.02714.pdf)

- (arXiv 2022.03) HyperTransformer: A Textural and Spectral Feature Fusion Transformer for **Pansharpening**, [[Paper]](https://arxiv.org/pdf/2203.02503.pdf), [[Code]](https://github.com/wgcban/HyperTransformer)

- (arXiv 2022.03) UVCGAN: UNET VISION TRANSFORMER CYCLE-CONSISTENT GAN FOR **UNPAIRED IMAGE-TO-IMAGE TRANSLATION**, [[Paper]](https://arxiv.org/pdf/2203.02557.pdf), [[Code]](https://github.com/LS4GAN/uvcgan)

- (arXiv 2022.03) Show Me What and Tell Me How: **Video Synthesis** via Multimodal Conditioning, [[Paper]](https://arxiv.org/pdf/2203.02573.pdf), [[Code]](https://github.com/snap-research/MMVID)

- (arXiv 2022.03) PANFORMER: A TRANSFORMER BASED MODEL FOR **PAN-SHARPENING**, [[Paper]](https://arxiv.org/pdf/2203.02916.pdf), [[Code]](https://github.com/zhysora/PanFormer)

- (arXiv 2022.03) Multi-class Token Transformer for **Weakly Supervised Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2203.02891.pdf), [[Code]](https://github.com/xulianuwa/MCTformer)

- (arXiv 2022.03) Cross Language Image Matching for **Weakly Supervised Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2203.02668.pdf)

- (arXiv 2022.03) Learning Affinity from Attention: End-to-End **Weakly-Supervised Semantic Segmentation** with Transformers, [[Paper]](https://arxiv.org/pdf/2203.02664.pdf), [[Code]](https://github.com/rulixiang/afa)

- (arXiv 2022.03) DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object **Detection**, [[Paper]](https://arxiv.org/pdf/2203.03605.pdf), [[Code]](https://github.com/IDEACVR/DINO)

- (arXiv 2022.03) MetaFormer : A Unified Meta Framework for **Fine-Grained Recognition**, [[Paper]](https://arxiv.org/pdf/2203.02751.pdf), [[Code]](https://github.com/dqshuai/MetaFormer)

- (arXiv 2022.03) **Audio-visual** Generalised Zero-shot Learning with Cross-modal Attention and Language, [[Paper]](https://arxiv.org/pdf/2203.03598.pdf)

- (arXiv 2022.03) Knowledge Amalgamation for Object **Detection** with Transformers, [[Paper]](https://arxiv.org/pdf/2203.03187.pdf)

- (arXiv 2022.03) Learnable Irrelevant Modality Dropout for **Multimodal Action Recognition** on Modality-Specific Annotated Videos, [[Paper]](https://arxiv.org/pdf/2203.03014.pdf)

- (arXiv 2022.03) Modeling Coreference Relations in **Visual Dialog**, [[Paper]](https://arxiv.org/pdf/2203.02986.pdf), [[Code]](https://github.com/Mingxiao-Li/Modeling-Coreference-Relations-in-Visual-Dialog)

- (arXiv 2022.03) VITRANSPAD: VIDEO TRANSFORMER USING CONVOLUTION AND SELF-ATTENTION FOR **FACE PRESENTATION ATTACK DETECTION**, [[Paper]](https://arxiv.org/pdf/2203.01562.pdf)

- (arXiv 2022.03) Multi-Tailed Vision Transformer for **Efficient Inference**, [[Paper]](https://arxiv.org/pdf/2203.01587.pdf)

- (arXiv 2022.03) Bending Reality: Distortion-aware Transformers for Adapting to Panoramic **Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2203.01452.pdf), [[Code]](https://github.com/jamycheung/Trans4PASS)

- (arXiv 2022.03) Ensembles of Vision Transformers as a New Paradigm for Automated Classification in **Ecology**, [[Paper]](https://arxiv.org/pdf/2203.01726.pdf)

- (arXiv 2022.03) LGT-Net: Indoor Panoramic Room **Layout Estimation** with Geometry-Aware Transformer Network, [[Paper]](https://arxiv.org/pdf/2203.01824.pdf), [[Code]](https://github.com/zhigangjiang/LGT-Net)

- (arXiv 2022.03) LatentFormer: Multi-Agent Transformer-Based **Interaction Modeling** and **Trajectory Prediction**, [[Paper]](https://arxiv.org/pdf/2203.01880.pdf)

- (arXiv 2022.03) DCT-Former: **Efficient** Self-Attention with Discrete Cosine Transform, [[Paper]](https://arxiv.org/pdf/2203.01178.pdf), [[Code]](https://github.com/cscribano/DCT-Former-Public)

- (arXiv 2022.03) Unsupervised **Vision-and-Language** Pre-training via Retrieval-based Multi-Granular Alignment, [[Paper]](https://arxiv.org/pdf/2203.00242.pdf)

- (arXiv 2022.03) Spatiotemporal Transformer Attention Network for 3D Voxel Level Joint Segmentation and Motion Prediction in **Point Cloud**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2203/2203.00138.pdf)

- (arXiv 2022.03) **CLIP**-GEN: Language-Free Training of a **Text-to-Image** Generator with CLIP, [[Paper]]()

- (arXiv 2022.03) MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for **3D** Human **Pose** Estimation in Video, [[Paper]](https://arxiv.org/pdf/2203.00859.pdf)

- (arXiv 2022.03) X -Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for **3D Dense Captioning**, [[Paper]](https://arxiv.org/pdf/2203.00843.pdf)

- (arXiv 2022.03) 3DCTN: 3D Convolution-Transformer Network for **Point Cloud** Classification, [[Paper]](https://arxiv.org/pdf/2203.00828.pdf)

- (arXiv 2022.03) DeciWatch: A Simple Baseline for 10× **Efficient** 2D and 3D **Pose** Estimation, [[Paper]](https://arxiv.org/pdf/2203.08713.pdf)

- (arXiv 2022.03) D_2ETR: **Decoder-Only DETR** with Computationally Efficient Cross-Scale Attention, [[Paper]](https://arxiv.org/pdf/2203.00860.pdf)

- (arXiv 2022.03) Incremental Transformer Structure Enhanced Image **Inpainting** with Masking Positional Encoding, [[Paper]](https://arxiv.org/pdf/2203.00867.pdf), [[Code]](https://github.com/DQiaole/ZITS_inpainting)

- (arXiv 2022.03) Self-supervised Transformer for **Deepfake Detection**, [[Paper]](https://arxiv.org/pdf/2203.01265.pdf)

- (arXiv 2022.03) Aggregated **Pyramid** Vision Transformer: Splittransform-merge Strategy for Image Recognition without Convolutions, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2203/2203.00960.pdf)

- (arXiv 2022.03) TransDARC: Transformer-based **Driver Activity Recognition** with Latent Space Feature Calibration, [[Paper]](https://arxiv.org/pdf/2203.00927.pdf), [[Code]](https://github.com/KPeng9510/TransDARC)

- (arXiv 2022.03) DN-DETR: **Accelerate** DETR **Training** by Introducing Query DeNoising, [[Paper]](https://arxiv.org/pdf/2203.01305.pdf), [[Code]](https://github.com/FengLi-ust/DN-DETR)

- (arXiv 2022.03) **Protecting Celebrities** with Identity Consistency Transformer, [[Paper]](https://arxiv.org/pdf/2203.01318.pdf)

- (arXiv 2022.03) Masked Visual Pre-training for **Motor Control**, [[Paper]](https://arxiv.org/pdf/2203.06173.pdf), [[Project]](https://tetexiao.com/projects/mvp)

- (arXiv 2022.03) NLX-GPT: A Model for Natural Language Explanations in Vision and **Vision-Language** Tasks, [[Paper]](https://arxiv.org/pdf/2203.05081.pdf), [[Code]](https://github.com/fawazsammani/nlxgpt)

- (arXiv 2022.03) Conditional Prompt Learning for Vision-Language Models, [[Paper]](https://arxiv.org/pdf/2203.05557.pdf), [[Code]](https://github.com/KaiyangZhou/CoOp)

- (arXiv 2022.03) **Lane Detection** with Versatile AtrousFormer and Local Semantic Guidance, [[Paper]](https://arxiv.org/pdf/2203.04067.pdf)

- (arXiv 2022.03) DALL-EVAL: Probing the Reasoning Skills and Social Biases of **Text-to-Image** Generative Transformers, [[Paper]](https://arxiv.org/pdf/2202.04053.pdf), [[Code]](https://github.com/j-min/DallEval)

- (arXiv 2022.03) **Forecasting** Characteristic **3D Poses** of Human Actions
, [[Paper]](https://arxiv.org/pdf/2011.15079.pdf), [[Code]](https://github.com/chrdiller/characteristic3dposes)

### 2022.02

- (arXiv 2022.02) **Bayesian Structure Learning** with Generative Flow Networks, [[Paper]](https://arxiv.org/pdf/2202.13903.pdf)

- (arXiv 2022.02) Towards **Unsupervised Domain Adaptation** via Domain-Transformer, [[Paper]](https://arxiv.org/pdf/2202.13777.pdf)

- (arXiv 2022.02) An End-to-End Transformer Model for **Crowd Localization**, [[Paper]](https://arxiv.org/pdf/2202.13065.pdf)

- (arXiv 2022.02) Instantaneous **Physiological Estimation** using Video Transformers, [[Paper]](https://arxiv.org/pdf/2202.12368.pdf), [[Code]](https://github.com/revanurambareesh/instantaneous_transformer)

- (arXiv 2022.02) Style**CLIP**Draw: Coupling Content and Style in **Text-to-Drawing** Translation, [[Paper]](https://arxiv.org/pdf/2202.12362.pdf), [[Code]](https://github.com/pschaldenbrand/StyleCLIPDraw)

- (arXiv 2022.02) ATTENTION ENABLES ZERO **APPROXIMATION** ERROR, [[Paper]](https://arxiv.org/pdf/2202.12166.pdf)

- (arXiv 2022.02) When Transformer Meets **Robotic Grasping**: Exploits Context for Efficient Grasp Detection, [[Paper]](https://arxiv.org/pdf/2202.11911.pdf), [[Code]](https://github.com/WangShaoSUN/grasp-transformer)

- (arXiv 2022.02) **AUTO-SCALING** VISION TRANSFORMERS WITHOUT TRAINING, [[Paper]](https://arxiv.org/pdf/2202.11921.pdf), [[Code]](https://github.com/VITA-Group/AsViT)

- (arXiv 2022.02) Think Global, Act Local: Dual-scale Graph Transformer for **Vision-and-Language Navigation**, [[Paper]](https://arxiv.org/pdf/2202.11742.pdf), [[Project]](https://cshizhe.github.io/projects/vln_duet.html)

- (arXiv 2022.02) LEARNING TO **MERGE TOKENS** IN VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2202.12015.pdf)

- (arXiv 2022.02) ProFormer: Learning **Data-efficient** Representations of **Body Movement** with Prototype-based Feature Augmentation and Visual Transformers, [[Paper]](https://arxiv.org/pdf/2202.11423.pdf), [[Code]](https://github.com/KPeng9510/ProFormer)

- (arXiv 2022.02) SELF-SUPERVISED TRANSFORMERS FOR **UNSUPERVISED OBJECT DISCOVERY** USING NORMALIZED CUT, [[Paper]](https://arxiv.org/pdf/2202.11539.pdf), [[Project]](https://www.m-psi.fr/Papers/TokenCut2022/)

- (arXiv 2022.02) Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal **Texture Synthesis**, [[Paper]](https://arxiv.org/pdf/2202.11703.pdf)

- (arXiv 2022.02) CaMEL: Mean Teacher Learning for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2202.10492.pdf)

- (arXiv 2022.02) **Hierarchical** Perceiver, [[Paper]](https://arxiv.org/pdf/2202.10890.pdf)

- (arXiv 2022.02) **Movies2Scenes**: Learning Scene Representations Using Movie Similarities, [[Paper]](https://arxiv.org/pdf/2202.10650.pdf)

- (arXiv 2022.02) GroupViT: **Semantic Segmentation** Emerges from Text Supervision, [[Paper]](https://arxiv.org/pdf/2202.11094.pdf), [[Code

- (arXiv 2022.02) Snowflake Point Deconvolution for **Point Cloud** Completion and Generation with Skip-Transformer, [[Paper]](https://arxiv.org/pdf/2202.09367.pdf), [[Code]](https://github.com/AllenXiangX/SnowflakeNet)

- (arXiv 2022.02) Audio Visual Scene-Aware **Dialog Generation** with Transformer-based Video Representations, [[Paper]](https://arxiv.org/pdf/2202.09979.pdf)

- (arXiv 2022.02) ViTAEv2: Vision Transformer Advanced by Exploring **Inductive Bias** for Image Recognition and Beyond, [[Paper]](https://arxiv.org/pdf/2202.10108.pdf)

- (arXiv 2022.02) PMP-Net++: **Point Cloud Completion** by Transformer-Enhanced Multi-step Point Moving Paths, [[Paper]](https://arxiv.org/pdf/2202.09507.pdf), [[Code]](https://github.com/diviswen/PMP-Net)

- (arXiv 2022.02) DataMUX: **Data Multiplexing** for Neural Networks, [[Paper]](https://arxiv.org/pdf/2202.09318.pdf), [[Code]](https://github.com/princeton-nlp/DataMUX)

- (arXiv 2022.02) On Guiding Visual **Attention** with **Language** Specification, [[Paper]](https://arxiv.org/pdf/2202.08926.pdf)

- (arXiv 2022.02) SPATIO-TEMPORAL OUTDOOR **LIGHTING AGGREGATION** ON IMAGE SEQUENCES USING TRANSFORMER NETWORKS, [[Paper]](https://arxiv.org/pdf/2202.09206.pdf)

- (arXiv 2022.02) **MISINFORMATION DETECTION** IN SOCIAL MEDIA **VIDEO** POSTS, [[Paper]](https://arxiv.org/pdf/2202.07706.pdf)

- (arXiv 2022.02) Can Deep Learning be Applied to Model-Based **Multi-Object Tracking**? [[Paper]](https://arxiv.org/pdf/2202.07909.pdf)

- (arXiv 2022.02) NOT ALL PATCHES ARE WHAT YOU NEED: EXPEDITING VISION TRANSFORMERS VIA **TOKEN REORGANIZATIONS**, [[Paper]](https://arxiv.org/pdf/2202.07800.pdf), [[Code]](https://github.com/youweiliang/evit)

- (arXiv 2022.02) ActionFormer: **Localizing** Moments of **Actions** with Transformers, [[Paper]](https://arxiv.org/pdf/2202.07925.pdf), [[Code]](https://github.com/happyharrycn/actionformer_release)

- (arXiv 2022.02) One Step at a Time: Long-Horizon **Vision-and-Language Navigation** with Milestones, [[Paper]](https://arxiv.org/pdf/2202.07028.pdf)

- (arXiv 2022.02) XAI for Transformers: Better **Explanations** through Conservative Propagation, [[Paper]](https://arxiv.org/pdf/2202.07304.pdf)

- (arXiv 2022.02) MeshLeTemp: Leveraging the Learnable Vertex-Vertex Relationship to Generalize Human **Pose** and **Mesh Reconstruction** for In-the-Wild Scenes, [[Paper]](https://arxiv.org/pdf/2202.07228.pdf)

- (arXiv 2022.02) ViNTER: **Image Narrative Generation** with Emotion-Arc-Aware Transformer, [[Paper]](https://arxiv.org/pdf/2202.07305.pdf)

- (arXiv 2022.02) Hyper-relationship Learning Network for **Scene Graph** Generation, [[Paper]](https://arxiv.org/pdf/2202.07271.pdf)

- (arXiv 2022.02) CommerceMM: Large-Scale Commerce **MultiModal Representation** Learning with Omni Retrieval, [[Paper]](https://arxiv.org/pdf/2202.07247.pdf)

- (arXiv 2022.02) Flowformer: **Linearizing** Transformers with Conservation Flows, [[Paper]](https://arxiv.org/pdf/2202.06258.pdf)

- (arXiv 2022.02) DialFRED: Dialogue-Enabled Agents for **Embodied** Instruction Following, [[Paper]](https://arxiv.org/pdf/2202.13330.pdf), [[Code]](https://github.com/anonrabit/DialFRED)

- (arXiv 2022.02) CATs++: Boosting **Cost Aggregation** with Convolutions and Transformers, [[Paper]](https://arxiv.org/pdf/2202.06817.pdf)

- (arXiv 2022.02) Geometric Transformer for Fast and Robust **Point Cloud Registration**, [[Paper]](https://arxiv.org/pdf/2202.06688.pdf), [[Code]](https://arxiv.org/pdf/2202.06688.pdf)

- (arXiv 2022.02) I-Tuning: Tuning Language Models with Image for **Caption** Generation, [[Paper]](I-Tuning: Tuning Language Models with Image for Caption Generation)

- (arXiv 2022.02) Multi-direction and Multi-scale Pyramid in Transformer for Video-based **Pedestrian Retrieval**, [[Paper]](https://arxiv.org/pdf/2202.06014.pdf), [[Code]](https://git.openi.org.cn/zangxh/PiT.git)

- (arXiv 2022.02) **Visual Acoustic** Matching, [[Paper]](https://arxiv.org/pdf/2202.06875.pdf)

- (arXiv 2022.02) LighTN: **Light**-weight Transformer Network for Performance-overhead Tradeoff in **Point Cloud Downsampling**, [[Paper]](https://arxiv.org/pdf/2202.06263.pdf)

- (arXiv 2022.02) BViT: Broad **Attention** based Vision Transformer, [[Paper]](https://arxiv.org/pdf/2202.06268.pdf), [[Code]](https://github.com/DRL-CASIA/Dense_ViT)

- (arXiv 2022.02) Task-Adaptive Feature Transformer with Semantic Enrichment for **Few-Shot Segmentation**, [[Paper]](https://arxiv.org/pdf/2202.06498.pdf)

- (arXiv 2022.02) Domain Adaptation via **Prompt** Learning, [[Paper]](https://arxiv.org/pdf/2202.06687.pdf)

- (arXiv 2022.02) Mixing and Shifting: Exploiting Global and Local Dependencies in **Vision MLPs**, [[Paper]](https://arxiv.org/pdf/2202.06510.pdf), [[Code]](https://github.com/JegZheng/MS-MLP)

- (arXiv 2022.02) Wukong: 100 Million Large-scale Chinese **Cross-modal Pre-training** Dataset and A Foundation Framework, [[Paper]](https://arxiv.org/pdf/2202.06767.pdf), [[Project]](https://wukong-dataset.github.io/wukong-dataset/)

- (arXiv 2022.02) HOW DO VISION TRANSFORMERS WORK? [[Paper]](https://arxiv.org/pdf/2202.06709.pdf), [[Code]](https://github.com/xxxnell/how-do-vits-work)

- (arXiv 2022.02) ACORT: A Compact Object Relation Transformer for Parameter Efficient Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2202.05451.pdf), [[Code]](https://github.com/jiahuei/sparse-image-captioning)

- (arXiv 2022.02) **CLIP**asso: Semantically-Aware **Object Sketching**, [[Paper]](https://arxiv.org/pdf/2202.05822.pdf), [[Code]](https://clipasso.github.io/clipasso/)

- (arXiv 2022.02) Towards Weakly-Supervised **Text Spotting** using a Multi-Task Transformer, [[Paper]](https://arxiv.org/pdf/2202.05508.pdf)

- (arXiv 2022.02) DEEP **SOCCER CAPTIONING** WITH TRANSFORMER: DATASET, SEMANTICS-RELATED LOSSES, AND MULTI-LEVEL EVALUATION, [[Paper]](https://arxiv.org/pdf/2202.05728.pdf), [[Project]](https://sites.google.com/view/soccercaptioning)

- (arXiv 2022.02) ENTROFORMER: A TRANSFORMER-BASED ENTROPY MODEL FOR LEARNED **IMAGE COMPRESSION**, [[Paper]](https://arxiv.org/pdf/2202.05492.pdf), [[Code]](https://github.com/mx54039q/entroformer)

- (arXiv 2022.02) Image Difference Captioning with Pre-training and Contrastive Learning, [[Paper]](https://arxiv.org/pdf/2202.04298.pdf), [[Code]](https://github.com/yaolinli/IDC)

- (arXiv 2022.02) MaskGIT: Masked **Generative** **Image** Transformer, [[Paper]](https://arxiv.org/pdf/2202.04200.pdf)

- (arXiv 2022.02) Distillation with Contrast is All You Need for **Self-Supervised** **Point Cloud** Representation Learning, [[Paper]](https://arxiv.org/pdf/2202.04241.pdf)

- (arXiv 2022.02) Motion-Aware Transformer For **Occluded Person Re-identification**, [[Paper]](https://arxiv.org/pdf/2202.04243.pdf)

- (arXiv 2022.02) Conditional **Motion In-betweening**, [[Paper]](https://arxiv.org/pdf/2202.04307.pdf), [[Code]](https://arxiv.org/pdf/2202.04307.pdf)

- (arXiv 2022.02) Memory-based **gaze prediction** in deep imitation learning for **robot manipulation**, [[Paper]](https://arxiv.org/pdf/2202.04877.pdf)

- (arXiv 2022.02) **Spherical** Transformer, [[Paper]](https://arxiv.org/pdf/2202.04942.pdf)

- (arXiv 2022.02) OWL (Observe, Watch, Listen): **Localizing Actions** in **Egocentric Video** via Audiovisual Temporal Context, [[Paper]](https://arxiv.org/pdf/2202.04947.pdf)

- (arXiv 2022.02) The Abduction of Sherlock Holmes: A **Dataset** for **Visual Abductive Reasoning**, [[Paper]](https://arxiv.org/pdf/2202.04800.pdf), [[Project]](http://www.visualabduction.com/)

- (arXiv 2022.02) DALL-EVAL: Probing the Reasoning Skills and Social Biases of **Text-to-Image** Generative Transformers, [[Paper]](https://arxiv.org/pdf/2202.04053.pdf), [[Code]](https://github.com/j-min/DallEval)

- (arXiv 2022.02) Pre-Trained Language Models for **Interactive Decision-Making**, [[Paper]](https://arxiv.org/pdf/2202.01771.pdf)

- (arXiv 2022.02) TransFollower: Long-Sequence Car-Following **Trajectory Prediction** through Transformer, [[Paper]](https://arxiv.org/pdf/2202.03183.pdf)

- (arXiv 2022.02) The devil is in the labels: **Semantic segmentation** from sentences, [[Paper]](https://arxiv.org/pdf/2202.02002.pdf)

- (arXiv 2022.02) Webly Supervised Concept Expansion for **General Purpose Vision Models**, [[Paper]](https://arxiv.org/pdf/2202.02317.pdf), [[Project]](https://prior.allenai.org/projects/gpv2)

- (arXiv 2022.02) VU-BERT: A UNIFIED FRAMEWORK FOR **VISUAL DIALOG**, [[Paper]](https://arxiv.org/pdf/2202.10787.pdf)

- (arXiv 2022.02) **UNIFYING** ARCHITECTURES, TASKS, AND MODALITIES THROUGH A SIMPLE SEQUENCE-TO-SEQUENCE LEARNING FRAMEWORK, [[Paper]](https://arxiv.org/pdf/2202.03052.pdf), [[Code]](https://github.com/OFA-Sys/OFA)

- (arXiv 2022.02) Transformers in Self-Supervised **Monocular Depth Estimation** with Unknown Camera Intrinsics, [[Paper]](https://arxiv.org/pdf/2202.03131.pdf)

- (arXiv 2022.02) TRANSDREAMER: **REINFORCEMENT LEARNING** WITH TRANSFORMER WORLD MODELS, [[Paper]](https://arxiv.org/pdf/2202.09481.pdf)

- (arXiv 2022.02) **Vision-Language** Pre-Training with Triple Contrastive Learning, [[Paper]](https://arxiv.org/pdf/2202.10401.pdf), [[Code]](https://github.com/uta-smile/TCL)

- (arXiv 2022.02) Corrupted Image Modeling for **Self-Supervised** Visual **Pre-Training**, [[Paper]](https://arxiv.org/pdf/2202.03382.pdf)

- (arXiv 2022.02) BLIP: Bootstrapping Language-Image Pre-training for Unified **Vision-Language** Understanding and Generation, [[Paper]](https://arxiv.org/pdf/2201.12086.pdf), [[Code]](https://github.com/salesforce/BLIP)

- (arXiv 2022.02) DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in **DNN Accelerators**, [[Paper]](https://arxiv.org/pdf/2201.11218.pdf)

- (arXiv 2022.02) Interactron: **Embodied** Adaptive **Object Detection**, [[Paper]](https://arxiv.org/pdf/2202.00660.pdf)

- (arXiv 2022.02) Local Feature Matching with Transformers for low-end devices **LoFTR** method adaptation approach, [[Paper]](https://arxiv.org/pdf/2202.00770.pdf), [[Code]](https://github.com/Kolkir/Coarse_LoFTR_TRT)

- (arXiv 2022.02) Pre-Trained Language Models for **Interactive Decision-Making**, [[Paper]](https://arxiv.org/pdf/2202.01771.pdf)

- (arXiv 2022.02) Can Transformers be Strong **Treatment Effect Estimators**?, [[Paper]](https://arxiv.org/pdf/2202.01336.pdf)

- (arXiv 2022.02) Improving **Sample Efficiency of Value** Based Models Using Attention and Vision Transformers, [[Paper]](https://arxiv.org/pdf/2202.00710.pdf)

- (arXiv 2022.02) Detecting **Human-Object Interactions** with Object-Guided Cross-Modal Calibrated Semantics, [[Paper]](https://arxiv.org/pdf/2202.00259.pdf), [[Code]](https://github.com/JacobYuan7/OCN-HOI-Benchmark)

### 2022.01

- (arXiv 2022.01) O-ViT: Orthogonal Vision Transformer, [[Paper]](https://arxiv.org/pdf/2201.12133.pdf)

- (arXiv 2022.01) DynaMixer: A Vision **MLP** Architecture with Dynamic Mixing, [[Paper]](https://arxiv.org/pdf/2201.12083.pdf)

- (arXiv 2022.01) VRT: A **Video Restoration** Transformer, [[Paper]](https://arxiv.org/pdf/2201.12288.pdf), [[Code]](https://github.com/JingyunLiang/VRT)

- (arXiv 2022.01) DAB-DETR: DYNAMIC **ANCHOR** BOXES ARE BETTER QUERIES FOR **DETR**, [[Paper]](https://arxiv.org/pdf/2201.12329.pdf), [[Code]](https://github.com/SlongLiu/DAB-DETR)

- (arXiv 2022.01) Plug-In Inversion: Model-Agnostic **Inversion** for Vision with Data Augmentations, [[Paper]](https://arxiv.org/pdf/2201.12961.pdf)

- (arXiv 2022.01) MVP: Multi-Stage **Vision-Language** Pre-Training via Multi-Level Semantic Alignment, [[Paper]](https://arxiv.org/pdf/2201.12596.pdf)

- (arXiv 2022.01) VC-GPT: Visual Conditioned GPT for End-to-End Generative **Vision-and-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2201.12723.pdf)

- (arXiv 2022.01) BOAT: Bilateral Local **Attention** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2201.13027.pdf)

- (arXiv 2022.01) GRAPH SELF-ATTENTION FOR LEARNING **GRAPH** REPRESENTATION WITH TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2201.12787.pdf)

- (arXiv 2022.01) Aggregating **Global** Features into **Local** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2201.12903.pdf), [[Code]](https://github.com/krushi1992/MOA-transformer)

- (arXiv 2022.01) Transformer Module Networks for Systematic Generalization in **Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2201.11316.pdf)

- (arXiv 2022.01) Generalised Image **Outpainting** with U-Transformer, [[Paper]](https://arxiv.org/pdf/2201.11403.pdf)

- (arXiv 2022.01) RelTR: Relation Transformer for **Scene Graph Generation**, [[Paper]](https://arxiv.org/pdf/2201.11460.pdf)

- (arXiv 2022.01) DocSegTr: An Instance-Level End-to-End **Document Image Segmentation** Transformer, [[Paper]](https://arxiv.org/pdf/2201.11438.pdf)

- (arXiv 2022.01) Pre-Trained **Language** Transformers are Universal **Image** Classifiers, [[Paper]](https://arxiv.org/pdf/2201.10182.pdf)

- (arXiv 2022.01) Explore and Match: End-to-End **Video Grounding** with Transformer, [[Paper]](https://arxiv.org/pdf/2201.10168.pdf)

- (arXiv 2022.01) TGFuse: An **Infrared** and **Visible Image Fusion** Approach Based on Transformer and Generative Adversarial Network, [[Paper]](https://arxiv.org/pdf/2201.10147.pdf)

- (arXiv 2022.01) ViT-HGR: Vision Transformer-based **Hand Gesture Recognition** from High Density Surface EMG Signals, [[Paper]](https://arxiv.org/pdf/2201.10060.pdf)

- (arXiv 2022.01) ShapeFormer: Transformer-based **Shape Completion** via Sparse Representation, [[Paper]](https://arxiv.org/pdf/2201.10326.pdf), [[Project]](https://shapeformer.github.io/)

- (arXiv 2022.01) **CONVOLUTIONAL** XFORMERS FOR VISION, [[Paper]](https://arxiv.org/pdf/2201.10271.pdf), [[Code]](https://github.com/pranavphoenix/CXV)

- (arXiv 2022.01) DocEnTr: An End-to-End **Document Image Enhancement** Transformer, [[Paper]](https://arxiv.org/pdf/2201.10252.pdf), [[Code]](https://github.com/dali92002/DocEnTR)

- (arXiv 2022.01) Zero-Shot **Sketch** Based **Image Retrieval** using Graph Transformer, [[Paper]](https://arxiv.org/pdf/2201.10185.pdf)

- (arXiv 2022.01) SA-**VQA**: Structured Alignment of Visual and Semantic Representations for Visual Question Answering, [[Paper]](https://arxiv.org/pdf/2201.10654.pdf)

- (arXiv 2022.01) DUAL-TASKS SIAMESE TRANSFORMER FRAMEWORK FOR **BUILDING DAMAGE ASSESSMENT**, [[Paper]](https://arxiv.org/pdf/2201.10953.pdf)

- (arXiv 2022.01) When **Shift Operation** Meets Vision Transformer: An Extremely Simple Alternative to **Attention** Mechanism, [[Paper]](https://arxiv.org/pdf/2201.10801.pdf), [[Code]](https://github.com/microsoft/SPACH)

- (arXiv 2022.01) Self-supervised 3D Semantic Representation Learning for **Vision-and-Language Navigation**, [[Paper]](https://arxiv.org/pdf/2201.10788.pdf)

- (arXiv 2022.01) **Training** Vision Transformers with Only 2040 Images, [[Paper]](https://arxiv.org/pdf/2201.10728.pdf)

- (arXiv 2022.01) Learning To Recognize **Procedural Activities** with Distant Supervision, [[Paper]](https://arxiv.org/pdf/2201.10990.pdf)

- (arXiv 2022.01) EVALUATING **LANGUAGE**-BIASED **IMAGE** CLASSIFICATION BASED ON SEMANTIC REPRESENTATIONS, [[Paper]](https://arxiv.org/pdf/2201.11014.pdf)

- (arXiv 2022.01) A Comprehensive Study of Vision Transformers on **Dense Prediction Tasks**, [[Paper]](https://arxiv.org/pdf/2201.08683.pdf)

- (arXiv 2022.01) UniFormer: Unifying **Convolution** and **Self-attention** for Visual Recognition, [[Paper]](https://arxiv.org/pdf/2201.09450.pdf), [[Code]](https://github.com/Sense-X/UniFormer)

- (arXiv 2022.01) **Patches** Are All You Need? [[Paper]](https://arxiv.org/pdf/2201.09792.pdf), [[Code]](https://github.com/locuslab/convmixer)

- (arXiv 2022.01) Reading-strategy Inspired Visual Representation Learning for **Text-to-Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2201.09168.pdf)

- (arXiv 2022.01) LEARNING TO ACT WITH AFFORDANCE-AWARE **MULTIMODAL** NEURAL **SLAM**, [[Paper]](https://arxiv.org/pdf/2201.09862.pdf)

- (arXiv 2022.01) Visual Information Guided **Zero-Shot Paraphrase Generation**, [[Paper]](https://arxiv.org/pdf/2201.09107.pdf)

- (arXiv 2022.01) TerViT: An **Efficient** **Ternary** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2201.08050.pdf)

- (arXiv 2022.01) End-to-end Generative Pretraining for **Multimodal Video Captioning**, [[Paper]](https://arxiv.org/pdf/2201.08264.pdf)

- (arXiv 2022.01) OMNIVORE: A Single Model for **Many** Visual **Modalities**, [[Paper]](https://arxiv.org/pdf/2201.08377.pdf), [[Project]](https://facebookresearch.github.io/omnivore/)

- (arXiv 2022.01) MeMViT: Memory-Augmented Multiscale Vision Transformer for **Efficient Long-Term Video Recognition**, [[Paper]](https://arxiv.org/pdf/2201.08383.pdf)

- (arXiv 2022.01) The CLEAR Benchmark: **Continual LEArning** on Real-World Imagery, [[Paper]](https://arxiv.org/pdf/2201.06289.pdf), [[Project]](https://clear-benchmark.github.io/)

- (arXiv 2022.01) ProposalCLIP: **Unsupervised** Open-Category Object **Proposal** Generation via Exploiting **CLIP** Cues, [[Paper]](https://arxiv.org/pdf/2201.06696.pdf)

- (arXiv 2022.01) Cross-modal Contrastive Distillation for **Instructional Activity Anticipation**, [[Paper]](https://arxiv.org/pdf/2201.06734.pdf)

- (arXiv 2022.01) Transformers in Action: **Weakly Supervised Action Segmentation**, [[Paper]](https://arxiv.org/pdf/2201.05675.pdf)

- (arXiv 2022.01) VAQF: Fully Automatic **Software-hardware Co-design** Framework for **Low-bit** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2201.06618.pdf)

- (arXiv 2022.01) CLIP-TD: **CLIP** Targeted Distillation for **Vision-Language** Tasks, [[Paper]](https://arxiv.org/pdf/2201.05729.pdf)

- (arXiv 2022.01) **Domain Adaptation** via Bidirectional Cross-Attention Transformer, [[Paper]](https://arxiv.org/pdf/2201.05887.pdf)

- (arXiv 2022.01) Continual Transformers: Redundancy-Free Attention for **Online Inference**, [[Paper]](https://arxiv.org/pdf/2201.06268.pdf)

- (arXiv 2022.01) **Motion Inbetweening** via Deep ∆-Interpolator, [[Paper]](https://arxiv.org/pdf/2201.06701.pdf)

- (arXiv 2022.01) RePre: Improving **Self-Supervised** Vision Transformer with Reconstructive Pre-training, [[Paper]](https://arxiv.org/pdf/2201.06857.pdf)

- (arXiv 2022.01) GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for **Nowcasting Extreme Events**, [[Paper]](https://arxiv.org/pdf/2201.06717.pdf)

- (arXiv 2022.01) TransFuse: A Unified Transformer-based **Image Fusion** Framework using Self-supervised Learning, [[Paper]](https://arxiv.org/pdf/2201.07451.pdf)

- (arXiv 2022.01) Q-ViT: Fully Differentiable **Quantization** for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2201.07703.pdf)

- (arXiv 2022.01) Disentangled Latent Transformer for **Interpretable Monocular Height Estimation**, [[Paper]](https://arxiv.org/pdf/2201.06357.pdf), [[Project]](https://github.com/ShadowXZT/DLT-Height-Estimation.pytorch)

- (arXiv 2022.01) Poseur: Direct Human **Pose Regression** with Transformers*, [[Paper]](https://arxiv.org/pdf/2201.07412.pdf)

- (arXiv 2022.01) SWINUNET3D - A HIERARCHICAL ARCHITECTURE FOR DEEP **TRAFFIC PREDICTION** USING SHIFTED WINDOW TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2201.06390.pdf), [[Code]](https://github.com/bojesomo/Traffic4Cast2021-SwinUNet3D)

- (arXiv 2022.01) SWIN-POSE: SWIN TRANSFORMER BASED HUMAN **POSE ESTIMATION**, [[Paper]](https://arxiv.org/pdf/2201.07384.pdf)

- (arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for **Robotic Manipulation**, [[Paper]](https://arxiv.org/pdf/2201.07779.pdf), [[Project]](https://jangirrishabh.github.io/lookcloser/)

- (arXiv 2022.01) ViT2Hash: Unsupervised Information-Preserving **Hashing**, [[Paper]](https://arxiv.org/pdf/2201.05541.pdf)

- (arXiv 2022.01) LANGUAGE-DRIVEN **SEMANTIC SEGMENTATION**, [[Paper]](https://arxiv.org/pdf/2201.03546.pdf), [[Code]](https://github.com/isl-org/lang-seg)

- (arXiv 2022.01) **Pedestrian Detection**: Domain Generalization, CNNs, Transformers and Beyond, [[Paper]](https://arxiv.org/pdf/2201.03176.pdf), [[Code]](https://github.com/hasanirtiza/Pedestron)

- (arXiv 2022.01) ImageSubject: A Large-scale Dataset for **Subject Detection**, [[Paper]](https://arxiv.org/pdf/2201.03101.pdf)

- (arXiv 2022.01) **Detecting** Twenty-thousand Classes using Image-level Supervision, [[Paper]](https://arxiv.org/pdf/2201.02605.pdf), [[Code]](https://github.com/facebookresearch/Detic)

- (arXiv 2022.01) Generalized **Category Discovery**, [[Paper]](https://arxiv.org/pdf/2201.02609.pdf), [[Code]](https://github.com/sgvaze/generalized-category-discovery)

- (arXiv 2022.01) Video **Summarization** Based on **Video-text** Modelling, [[Paper]](https://arxiv.org/pdf/2201.02494.pdf)

- (arXiv 2022.01) Spatio-Temporal Tuples Transformer for **Skeleton-Based Action Recognition**, [[Paper]](https://arxiv.org/pdf/2201.02849.pdf), [[Code]](https://github.com/heleiqiu/STTFormer)

- (arXiv 2022.01) **QUADTREE ATTENTION** FOR VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2201.02767.pdf), [[Code]](https://github.com/Tangshitao/QuadtreeAttention)

- (arXiv 2022.01) A Comprehensive Empirical Study of **Vision-Language** Pre-trained Model for Supervised Cross-Modal Retrieval, [[Paper]](https://arxiv.org/pdf/2201.02772.pdf), [[Project]](https://github.com/zhixiongz/CLIP4CMR)

- (arXiv 2022.01) MERLOT Reserve: Neural Script Knowledge through **Vision and Language and Sound**, [[Paper]](https://arxiv.org/pdf/2201.02639.pdf), [[Project]](https://rowanzellers.com/merlotreserve)

- (arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in **Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2201.03965.pdf)

- (arXiv 2022.01) Pyramid Fusion Transformer for **Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2201.04019.pdf)

- (arXiv 2022.01) Multiview Transformers for **Video Recognition**, [[Paper]](https://arxiv.org/pdf/2201.04288.pdf)

- (arXiv 2022.01) HYPERTRANSFORMER: MODEL GENERATION FOR SUPERVISED AND SEMI-SUPERVISED **FEW-SHOT LEARNING**, [[Paper]](https://arxiv.org/pdf/2201.04182.pdf)

- (arXiv 2022.01) UNIFORMER: UNIFIED TRANSFORMER FOR **EFFICIENT SPATIOTEMPORAL** REPRESENTATION LEARNING, [[Paper]](https://arxiv.org/pdf/2201.04676.pdf), [[Code]](https://github.com/Sense-X/UniFormer)

- (arXiv 2022.01) BridgeFormer: Bridging **Video-text** Retrieval with Multiple Choice Questions, [[Paper]](https://arxiv.org/pdf/2201.04850.pdf), [[Project]](https://geyuying.github.io/MCQ.html)

- (arXiv 2022.01) TransVOD: End-to-end **Video Object Detection** with Spatial-Temporal Transformers, [[Paper]](https://arxiv.org/pdf/2201.05047.pdf)

- (arXiv 2022.01) **CLIP**-Event: Connecting Text and Images with **Event** Structures, [[Paper]](https://arxiv.org/pdf/2201.05078.pdf), [[Code]](https://github.com/limanling/clip-event)

- (arXiv 2022.01) Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular **Vision-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2201.04026.pdf)

- (arXiv 2022.01) Lawin Transformer: Improving **Semantic Segmentation** Transformer with Multi-Scale Representations via Large Window Attention, [[Paper]](https://arxiv.org/pdf/2201.01615.pdf), [[Code]](https://github.com/yan-hao-tian/lawin)

- (arXiv 2022.01) **Self-Training** **Vision Language** BERTs with a Unified Conditional Model, [[Paper]](https://arxiv.org/pdf/2201.02010.pdf)

- (arXiv 2022.01) TransVPR: Transformer-based TransVPR: Transformer-based place recognition with multi-level attention aggregation with multi-level attention aggregation, [[Paper]](https://arxiv.org/pdf/2201.02001.pdf)

- (arXiv 2022.01) Compact Bidirectional Transformer for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2201.01984.pdf), [[Code]](https://github.com/YuanEZhou/CBTrans)

- (arXiv 2022.01) Flow-Guided Sparse Transformer for **Video Deblurring**, [[Paper]](https://arxiv.org/pdf/2201.01893.pdf)

- (arXiv 2022.01) **Stochastic Layers** in Vision Transformers, [[Paper]](https://arxiv.org/pdf/2112.15111.pdf)

- (arXiv 2022.01) ERNIE-VILG: UNIFIED GENERATIVE PRE-TRAINING FOR **BIDIRECTIONAL VISION-LANGUAGE GENERATION**, [[Paper]](https://arxiv.org/pdf/2112.15283.pdf)

- (arXiv 2022.01) InverseMV: **Composing Piano Scores** with a Convolutional **Video-Music** Transformer, [[Paper]](https://arxiv.org/pdf/2112.15320.pdf), [[Code]](https://github.com/linchintung/VMT)

- (arXiv 2022.01) CSformer: Bridging Convolution and Transformer for **Compressive Sensing**, [[Paper]](https://arxiv.org/pdf/2112.15299.pdf)

- (arXiv 2022.01) Persformer: A Transformer Architecture for **Topological Machine Learning**, [[Paper]](https://arxiv.org/pdf/2112.15210.pdf)

- (arXiv 2022.01) Vision Transformer **Slimming**: Multi-Dimension Searching in Continuous Optimization Space, [[Paper]](https://arxiv.org/pdf/2201.00814.pdf)

- (arXiv 2022.01) Language as Queries for **Referring Video Object Segmentation**, [[Paper]](https://arxiv.org/pdf/2201.00487.pdf), [[Code]](https://github.com/wjn922/ReferFormer)

- (arXiv 2022.01) PyramidTNT: Improved **Transformer-in-Transformer** Baselines with Pyramid Architecture, [[Paper]](https://arxiv.org/pdf/2201.00978.pdf), [[Code]](https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch)

- (arXiv 2022.01) A TRANSFORMER-BASED SIAMESE NETWORK FOR **CHANGE DETECTION**, [[Paper]](https://arxiv.org/pdf/2201.01293.pdf), [[Code]](https://github.com/wgcban/ChangeFormer)

- (arXiv 2022.01) Vision Transformer with **Deformable Attention**, [[Paper]](https://arxiv.org/pdf/2201.00520.pdf), [[Code]](https://github.com/LeapLabTHU/DAT)

- (arXiv 2022.01) Splicing ViT Features for **Semantic Appearance Transfer**, [[Paper]](https://arxiv.org/pdf/2201.00424.pdf), [[Project]](https://splice-vit.github.io/)

- (arXiv 2022.01) Detail-Preserving Transformer for **Light Field Image Super-Resolution**, [[Paper]](https://arxiv.org/pdf/2201.00346.pdf), [[Code]](https://github.com/BITszwang/DPT)

### 2021.12

- (arXiv 2021.12) Multi-Dimensional **Model Compression** of Vision Transformer, [[Paper]](https://arxiv.org/pdf/2201.00043.pdf)

- (arXiv 2021.12) Siamese Network with Interactive Transformer for **Video Object Segmentation**, [[Paper]](https://arxiv.org/pdf/2112.13983.pdf), [[Code]](https://github.com/LANMNG/SITVOS)

- (arXiv 2021.12) Pale Transformer: A General Vision Transformer **Backbone** with Pale-Shaped **Atention**, [[Paper]](https://arxiv.org/pdf/2112.14000.pdf), [[Code]](https://github.com/BR-IDL/PaddleViT)

- (arXiv 2021.12) APRIL: Finding the Achilles’ Heel on **Privacy** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2112.14087.pdf)

- (arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in **Video-to-Text Translation**, [[Paper]](https://arxiv.org/pdf/2112.14088.pdf)

- (arXiv 2021.12) Does **CLIP** Benefit **Visual Question Answering** in the Medical Domain as Much as it Does in the General Domain?, [[Paper]](https://arxiv.org/pdf/2112.13906.pdf)

- (arXiv 2021.12) SPViT: Enabling **Faster** Vision Transformers via Soft Token Pruning, [[Paper]](https://arxiv.org/pdf/2112.13890.pdf)

- (arXiv 2021.12) A FISTFUL OF WORDS: LEARNING TRANSFERABLE VISUAL MODELS FROM BAG-OF-WORDS SUPERVISION, [[Paper]](https://arxiv.org/pdf/2112.13884.pdf)

- (arXiv 2021.12) StyleGAN-V: A Continuous **Video** Generator with the Price, Image Quality and Perks of **StyleGAN2**, [[Paper]](https://arxiv.org/pdf/2112.14683.pdf), [[Code]](https://universome.github.io/stylegan-v)

- (arXiv 2021.12) A Simple Baseline for **Zero-shot Semantic Segmentation** with Pre-trained **Vision-language** Model, [[Paper]](https://arxiv.org/pdf/2112.14757.pdf), [[Code]](https://github.com/MendelXu/zsseg.baseline)

- (arXiv 2021.12) Miti-DETR: Object **Detection** based on Transformers with Mitigatory Self-Attention Convergence, [[Paper]](https://arxiv.org/pdf/2112.13310.pdf)

- (arXiv 2021.12) SIMVIT: EXPLORING A SIMPLE VISION TRANSFORMER WITH **SLIDING WINDOWS**, [[Paper]](https://arxiv.org/pdf/2112.13085.pdf), [[Code]](https://github.com/ucasligang/SimViT)

- (arXiv 2021.12) SGTR: End-to-end **Scene Graph Generation** with Transformer, [[Paper]](https://arxiv.org/pdf/2112.12970.pdf)

- (arXiv 2021.12) **Video** Joint Modelling Based on Hierarchical Transformer for **Co-summarization**, [[Paper]](https://arxiv.org/pdf/2112.13478.pdf)

- (arXiv 2021.12) Vision Transformer for **Small-Size Datasets**, [[Paper]](https://arxiv.org/pdf/2112.13492.pdf)

- (arXiv 2021.12) Learning **Generative** Vision Transformer with Energy-Based Latent Space for **Saliency Prediction**, [[Paper]](https://arxiv.org/pdf/2112.13528.pdf)

- (arXiv 2021.12) ViR: the Vision **Reservoir**, [[Paper]](https://arxiv.org/pdf/2112.13545.pdf)

- (arXiv 2021.12) SeMask: Semantically Masked Transformers for **Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2112.12782.pdf), [[Code]](https://github.com/Picsart-AI-Research/SeMask-Segmentation)

- (arXiv 2021.12) Open-Vocabulary Image **Segmentation**, [[Paper]](https://arxiv.org/pdf/2112.12143.pdf)

- (arXiv 2021.12) ELSA: Enhanced Local **Self-Attention** for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2112.12786.pdf), [[Code]](https://github.com/damo-cv/ELSA)

- (arXiv 2021.12) LaTr: Layout-Aware Transformer for **Scene-Text** **VQA**, [[Paper]](https://arxiv.org/pdf/2112.12494.pdf)

- (arXiv 2021.12) **Multimodal Personality Recognition** using Cross-Attention Transformer and Behaviour Encoding, [[Paper]](https://arxiv.org/pdf/2112.12180.pdf)

- (arXiv 2021.12) Fine-grained **Multi-Modal Self-Supervised Learning**, [[Paper]](https://arxiv.org/pdf/2112.12182.pdf)

- (arXiv 2021.12) SLIP: Self-supervision meets **Language-Image** Pre-training, [[Paper]](https://arxiv.org/pdf/2112.12750.pdf), [[Code]](https://github.com/facebookresearch/SLIP)

- (arXiv 2021.12) CLEVR3D: Compositional Language and Elementary Visual Reasoning for **Question Answering** in **3D Real-World Scenes**, [[Paper]](https://arxiv.org/pdf/2112.11691.pdf)

- (arXiv 2021.12) MIA-Former: **Efficient** and **Robust** Vision Transformers via Multi-grained Input Adaptation, [[Paper]](https://arxiv.org/pdf/2112.11542.pdf)

- (arXiv 2021.12) iSegFormer: Interactive Image **Segmentation** with Transformers, [[Paper]](https://arxiv.org/pdf/2112.11325.pdf), [[Code]](https://github.com/qinliuliuqin/iSegFormer.git)

- (arXiv 2021.12) Contrastive Object **Detection** Using Knowledge Graph Embeddings, [[Paper]](https://arxiv.org/pdf/2112.11366.pdf)

- (arXiv 2021.12) RepMLPNet: Hierarchical Vision **MLP** with Re-parameterized **Locality**, [[Paper]](https://arxiv.org/pdf/2112.11081.pdf), [[Code]](https://github.com/DingXiaoH/RepMLP)

- (arXiv 2021.12) **Lite** Vision Transformer with Enhanced **Self-Attention**, [[Paper]](https://arxiv.org/pdf/2112.10809.pdf), [[Code]](https://github.com/Chenglin-Yang/LVT)

- (arXiv 2021.12) MPViT : Multi-Path Vision Transformer for **Dense Prediction**, [[Paper]](https://arxiv.org/pdf/2112.11010.pdf), [[Code]](https://git.io/MPViT)

- (arXiv 2021.12) SOIT: **Segmenting** Objects with Instance-Aware Transformers, [[Paper]](https://arxiv.org/pdf/2112.11037.pdf), [[Code]](https://github.com/yuxiaodongHRI/SOIT)

- (arXiv 2021.12) Learned Queries for Efficient Local **Attention**, [[Paper]](https://arxiv.org/pdf/2112.11435.pdf), [[Code]](https://github.com/moabarar/qna)

- (arXiv 2021.12) On **Efficient** Transformer and Image Pre-training for **Low-level** Vision, [[Paper]](https://arxiv.org/pdf/2112.10175.pdf), [[Code]](https://github.com/fenglinglwb/EDT)

- (arXiv 2021.12) LOCFORMER: Enabling Transformers to Perform **Temporal Moment Localization** on Long Untrimmed Videos With a Feature Sampling Approach, [[Paper]](https://arxiv.org/pdf/2112.10066.pdf)

- (arXiv 2021.12) Tell me what you see: A zero-shot **action recognition** method based on natural language descriptions, [[Paper]](https://arxiv.org/pdf/2112.09976.pdf), [[Code]](https://github.com/valterlej/zsarcap)

- (arXiv 2021.12) Pre-Training Transformers for **Domain Adaptation**, [[Paper]](https://arxiv.org/pdf/2112.09965.pdf)

- (arXiv 2021.12) ScanQA: 3D Question Answering for Spatial Scene Understanding, [[Paper]](https://arxiv.org/pdf/2112.10482.pdf)

- (arXiv 2021.12) Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [[Paper]](https://arxiv.org/pdf/2112.10740.pdf)

- (arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution **Image Generation**, [[Paper]](https://arxiv.org/pdf/2112.10762.pdf), [[Code]](https://github.com/microsoft/StyleSwin)

- (arXiv 2021.12) Mask2Former for **Video Instance Segmentation**, [[Paper]](https://arxiv.org/pdf/2112.10764.pdf), [[Code]](https://github.com/facebookresearch/Mask2Former)

- (arXiv 2021.12) GLIDE: Towards Photorealistic **Image Generation** and **Editing** with **Text**-Guided Diffusion Models, [[Paper]](https://arxiv.org/pdf/2112.10741.pdf), [[Code]](https://github.com/openai/glide-text2im)

- (arXiv 2021.12) **Efficient** Visual **Tracking** with Exemplar Transformers, [[Paper]](https://arxiv.org/pdf/2112.09686.pdf), [[Code]](https://github.com/visionml/pytracking)

- (arXiv 2021.12) **Neuromorphic Camera Denoising** using Graph Neural Network-driven Transformers, [[Paper]](https://arxiv.org/pdf/2112.09685.pdf)

- (arXiv 2021.12) Align and Prompt: **Video-and-Language** Pre-training with Entity Prompts, [[Paper]](https://arxiv.org/pdf/2112.09583.pdf), [[Code]](https://github.com/salesforce/ALPRO)

- (arXiv 2021.12) DATA **EFFICIENT** **LANGUAGE-SUPERVISED ZEROSHOT RECOGNITION** WITH OPTIMAL TRANSPORT DISTILLATION, [[Paper]](https://arxiv.org/pdf/2112.09445.pdf)

- (arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame **Image Restoration** with Pre-Trained Siamese Transformers, [[Paper]](https://arxiv.org/pdf/2112.09426.pdf)

- (arXiv 2021.12) Full Transformer Framework for Robust **Point Cloud Registration** with Deep Information Interaction, [[Paper]](https://arxiv.org/pdf/2112.09385.pdf), [[Code]](https://github.com/CGuangyan-BIT/DIT)

- (arXiv 2021.12) ZeroVL: A Strong Baseline for Aligning **Vision-Language** Representations with **Limited Resources**, [[Paper]](https://arxiv.org/pdf/2112.09331.pdf)

- (arXiv 2021.12) Towards End-to-End **Image Compression and Analysis** with Transformers, [[Paper]](https://arxiv.org/pdf/2112.09300.pdf)

- (arXiv 2021.12) How to **augment** your ViTs? Consistency loss and StyleAug, a random style transfer augmentation, [[Paper]](https://arxiv.org/pdf/2112.09260.pdf)

- (arXiv 2021.12) Learning to Prompt for **Continual Learning**, [[Paper]](https://arxiv.org/pdf/2112.08654.pdf), [[Code]](https://github.com/google-research/l2p)

- (arXiv 2021.12) Distilled Dual-Encoder Model for **Vision-Language** Understanding, [[Paper]](https://arxiv.org/pdf/2112.08723.pdf), [[Code]](https://github.com/kugwzk/Distilled-DualEncoder)

- (arXiv 2021.12) Dense Video **Captioning** Using Unsupervised Semantic Information, [[Paper]](https://arxiv.org/pdf/2112.08455.pdf), [[Code]](https://github.com/valterlej/dvcusi)

- (arXiv 2021.12) Looking Outside the Box to **Ground Language** in **3D** Scenes, [[Paper]](https://arxiv.org/pdf/2112.08879.pdf), [[Code]](https://github.com/nickgkan/beauty_detr)

- (arXiv 2021.12) Region**CLIP**: Region-based **Language-Image** Pretraining, [[Paper]](https://arxiv.org/pdf/2112.09106.pdf), [[Code]](https://github.com/microsoft/RegionCLIP)

- (arXiv 2021.12) DProST: **6-DoF Object Pose Estimation** Using Space Carving and Dynamic Projective Spatial Transformer, [[Paper]](https://arxiv.org/pdf/2112.08775.pdf)

- (arXiv 2021.12) Masked Feature Prediction for **Self-Supervised** Visual Pre-Training, [[Paper]](https://arxiv.org/pdf/2112.09133.pdf)

- (arXiv 2021.12) SGEITL: Scene Graph Enhanced Image-Text Learning for **Visual Commonsense Reasoning**, [[Paper]](https://arxiv.org/pdf/2112.08587.pdf)

- (arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for **Zero-Shot Learning**, [[Paper]](https://arxiv.org/pdf/2112.08643.pdf), [[Code]](https://github.com/shiming-chen/TransZero_pp)

- (arXiv 2021.12) Vision Transformer Based **Video Hashing Retrieval** for Tracing the Source of Fake Videos, [[Paper]](https://arxiv.org/pdf/2112.08117.pdf), [[Code]](https://github.com/lajlksdf/vtl)

- (arXiv 2021.12) Co-training Transformer with Videos and Images Improves **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2112.07175.pdf)

- (arXiv 2021.12) QAHOI: Query-Based Anchors for **Human-Object Interaction** Detection, [[Paper]](https://arxiv.org/pdf/2112.08647.pdf), [[Code]](https://github.com/cjw2021/QAHOI)

- (arXiv 2021.12) AdaViT: Adaptive Tokens for **Efficient** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2112.07658.pdf)

- (arXiv 2021.12) **CLIP**-Lite: Information **Efficient** Visual Representation Learning from Textual Annotations, [[Paper]](https://arxiv.org/pdf/2112.07133.pdf)

- (arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on **Unpaired Images and Text**, [[Paper]](https://arxiv.org/pdf/2112.07074.pdf)

- (arXiv 2021.12) Deep ViT Features as Dense Visual **Descriptors**, [[Paper]](https://arxiv.org/pdf/2112.05814.pdf), [[Project]](https://dino-vit-features.github.io/)

- (arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [[Paper]](https://arxiv.org/pdf/2112.07374.pdf), [[Code]](https://github.com/mikecheninoulu/CGT)

- (arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2112.07338.pdf)

- (arXiv 2021.12) COMPOSER: Compositional Learning of **Group Activity** in Videos, [[Paper]](https://arxiv.org/pdf/2112.05892.pdf)

- (arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for **Micro-Expression Recognition**, [[Paper]](https://arxiv.org/pdf/2112.05851.pdf)

- (arXiv 2021.12) Improving and Diagnosing Knowledge-Based **Visual Question Answering** via Entity Enhanced Knowledge Injection, [[Paper]](https://arxiv.org/pdf/2112.06888.pdf)

- (arXiv 2021.12) SVIP: **Sequence VerIfication** for Procedures in **Videos**, [[Paper]](https://arxiv.org/pdf/2112.06447.pdf)

- (arXiv 2021.12) Improving Vision Transformers for **Incremental Learning**, [[Paper]](https://arxiv.org/pdf/2112.06103.pdf)

- (arXiv 2021.12) VL-ADAPTER: Parameter-Efficient Transfer Learning for **Vision-and-Language** Tasks, [[Paper]](https://arxiv.org/pdf/2112.06825.pdf), [[Code]](https://github.com/ylsung/VL_adapter)

- (arXiv 2021.12) Embracing Single Stride **3D Object Detector** with Sparse Transformer, [[Paper]](https://arxiv.org/pdf/2112.06375.pdf), [[Code]](https://github.com/TuSimple/SST)

- (arXiv 2021.12) PartGlot: Learning **Shape Part Segmentation** from Language Reference Games, [[Paper]](https://arxiv.org/pdf/2112.06390.pdf)

- (arXiv 2021.12) **Pedestrian Trajectory Prediction** via Spatial Interaction Transformer Network, [[Paper]](https://arxiv.org/pdf/2112.06624.pdf)

- (arXiv 2021.12) LEARNING SEMANTIC-ALIGNED FEATURE REPRESENTATION FOR **TEXT-BASED PERSON SEARCH**, [[Paper]](https://arxiv.org/pdf/2112.06714.pdf)

- (arXiv 2021.12) L-Verse: Bidirectional **Generation** Between **Image** and **Text**, [[Paper]](https://arxiv.org/pdf/2111.11133.pdf)

- (arXiv 2021.12) **SELF-ATTENTION** DOES NOT NEED O(n^2) MEMORY, [[Paper]](https://arxiv.org/pdf/2112.05682.pdf)

- (arXiv 2021.12) Are Vision Transformers **Robust** to Patch Perturbations? [[Paper]](https://arxiv.org/pdf/2111.10659.pdf)

- (arXiv 2021.12) Mesa: A **Memory-saving Training** Framework for Transformers, [[Paper]](https://arxiv.org/pdf/2111.11124.pdf), [[Code]](https://github.com/zhuang-group/Mesa)

- (arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2112.05230.pdf)

- (arXiv 2021.12) MAGMA – Multimodal **Augmentation** of **Generative** Models through Adapter-based Finetuning, [[Paper]](https://arxiv.org/pdf/2112.05253.pdf)

- (arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for **Weakly Supervised Object Localization**, [[Paper]](https://arxiv.org/pdf/2112.05291.pdf)

- (arXiv 2021.12) FaceFormer: **Speech-Driven 3D Facial Animation** with Transformers, [[Paper]](https://arxiv.org/pdf/2112.05329.pdf)

- (arXiv 2021.12) Rethinking the Two-Stage Framework for **Grounded Situation Recognition**, [[Paper]](https://arxiv.org/pdf/2112.05375.pdf), [[Code]](https://github.com/kellyiss/SituFormer)

- (arXiv 2021.12) **CLIP**2Style**GAN**: Unsupervised Extraction of StyleGAN Edit Directions, [[Paper]](https://arxiv.org/pdf/2112.05219.pdf)

- (arXiv 2021.12) Couplformer: Rethinking Vision Transformer with Coupling **Attention** Map, [[Paper]](https://arxiv.org/pdf/2112.05425.pdf)

- (arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for **Vision-Language** Understanding and Generation, [[Paper]](https://arxiv.org/pdf/2112.05587.pdf)

- (arXiv 2021.12) Visual Transformers with Primal Object Queries for **Multi-Label Image Classification**, [[Paper]](https://arxiv.org/pdf/2112.05485.pdf)

- (arXiv 2021.12) Colossal-AI: A Unified Deep Learning System For **Large-Scale Parallel Training**, [[Paper]](https://arxiv.org/pdf/2110.14883.pdf), [[Code]](https://github.com/hpcaitech/ColossalAI)

- (arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for **Action Detection**, [[Paper]](https://arxiv.org/pdf/2112.03902.pdf)

- (arXiv 2021.12) Grounded **Language-Image** Pre-training, [[Paper]](https://arxiv.org/pdf/2112.03857.pdf), [[Code]](https://github.com/microsoft/GLIP)

- (arXiv 2021.12) U^2-Former: A Nested U-shaped Transformer for **Image Restoration**, [[Paper]](https://arxiv.org/pdf/2112.02279.pdf)

- (arXiv 2021.12) ADAPTIVE CHANNEL ENCODING TRANSFORMER FOR **POINT CLOUD** ANALYSIS, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2112/2112.02507.pdf)

- (arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person **Re-identification** Based on Transformer, [[Paper]](https://arxiv.org/pdf/2112.02466.pdf), [[Code]](https://github.com/WangTaoAs/PFD_Net)

- (arXiv 2021.12) VT-CLIP: Enhancing **Vision-Language** Models with Visual-guided Texts, [[Paper]](https://arxiv.org/pdf/2112.02399.pdf)

- (arXiv 2021.12) PointCLIP: **Point Cloud** Understanding by **CLIP**, [[Paper]](https://arxiv.org/pdf/2112.02413.pdf), [[Code]](https://github.com/ZrrSkywalker/PointCLIP)

- (arXiv 2021.12) Learning **Tracking** Representations via Dual-Branch Fully Transformer Networks, [[Paper]](https://arxiv.org/pdf/2112.02571.pdf), [[Code]](https://github.com/phiphiphi31/DualTFR)

- (arXiv 2021.12) DYNAMIC TOKEN **NORMALIZATION** IMPROVES VISION TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2112.02624.pdf), [[Code]](https://github.com/wqshao126/DTN)

- (arXiv 2021.12) PTTR: Relational 3D **Point Cloud Object Tracking** with Transformer, [[Paper]](https://arxiv.org/pdf/2112.02857.pdf), [[Code]](https://github.com/Jasonkks/PTTR)

- (arXiv 2021.12) GETAM: Gradient-weighted Element-wise Transformer Attention Map for **Weakly-supervised Semantic segmentation**, [[Paper]](https://arxiv.org/pdf/2112.02841.pdf)

- (arXiv 2021.12) **Text2Mesh**: Text-Driven Neural Stylization for Meshes, [[Paper]](https://arxiv.org/pdf/2112.03221.pdf), [[Project]](https://threedle.github.io/text2mesh/)

- (arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal **Emotion Recognition** from Unaligned Multimodal Sequences, [[Paper]](https://arxiv.org/pdf/2112.01697.pdf)

- (arXiv 2021.12) Make A Long Image Short: Adaptive **Token** Length for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2112.01686.pdf)

- (arXiv 2021.12) FuseDream: Training-Free **Text-to-Image Generation** with Improved **CLIP**+GAN Space Optimization, [[Paper]](https://arxiv.org/pdf/2112.01573.pdf), [[Code]](https://github.com/gnobitab/FuseDream)

- (arXiv 2021.12) TransZero: Attribute-guided Transformer for **Zero-Shot Learning**, [[Paper]](https://arxiv.org/pdf/2112.01683.pdf), [[Code]](https://github.com/shiming-chen/TransZero)

- (arXiv 2021.12) Learning Generalizable **Vision-Tactile** Robotic **Grasping** Strategy for Deformable Objects via Transformer, [[Paper]](https://arxiv.org/pdf/2112.06374.pdf), [[Code]](https://github.com/GTLIDAR/DeformableObjectsGrasping.git)

- (arXiv 2021.12) Hformer: Hybrid CNN-Transformer for **Fringe Order Prediction** in Phase Unwrapping of Fringe Projection, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2112/2112.06759.pdf)

- (arXiv 2021.12) Pre-training and Fine-tuning Transformers for **fMRI Prediction** Tasks, [[Paper]](https://arxiv.org/pdf/2112.05761.pdf)

- (arXiv 2021.12) Transformer based **trajectory prediction**, [[Paper]](https://arxiv.org/pdf/2112.04350.pdf)

- (arXiv 2021.12) Evaluating Transformers for Lightweight **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2111.09641.pdf)

- (arXiv 2021.12) Contextualized Spatio-Temporal **Contrastive Learning** with Self-Supervision, [[Paper]](https://arxiv.org/pdf/2112.05181.pdf)

- (arXiv 2021.12) CMA-CLIP: Cross-Modality Attention **CLIP** for **Image-Text** Classification, [[Paper]](https://arxiv.org/pdf/2112.03562.pdf)

- (arXiv 2021.12) **Bootstrapping** ViTs: Towards Liberating Vision Transformers from Pre-training, [[Paper]](https://arxiv.org/pdf/2112.03552.pdf)

- (arXiv 2021.12) Decision-based Black-box **Attack** Against Vision Transformers via Patch-wise Adversarial Removal, [[Paper]](https://arxiv.org/pdf/2112.03492.pdf), [[Code]](https://github.com/shiyuchengTJU/PAR)

- (arXiv 2021.12) DoodleFormer: Creative **Sketch Drawing** with Transformers, [[Paper]](https://arxiv.org/pdf/2112.03258.pdf)

- (arXiv 2021.12) Creating **Multimodal Interactive Agents** with Imitation and Self-Supervised Learning, [[Paper]](https://arxiv.org/pdf/2112.03763.pdf)

- (arXiv 2021.12) **AUDIO-VISUAL** SYNCHRONISATION IN THE WILD, [[Paper]](https://arxiv.org/pdf/2112.04432.pdf), [[Project]](https://www.robots.ox.ac.uk/~vgg/research/avs)

- (arXiv 2021.12) **Classification**-Then-**Grounding**: Reformulating **Video** Scene Graphs as Temporal Bipartite Graphs, [[Paper]](https://arxiv.org/pdf/2112.04222.pdf)

- (arXiv 2021.12) Garment4D: **Garment Reconstruction** from Point Cloud Sequences, [[Paper]](https://arxiv.org/pdf/2112.04159.pdf), [[Code]](https://github.com/hongfz16/Garment4D)

- (arXiv 2021.12) Locally Shifted **Attention****** With Early Global Integration, [[Paper]](https://arxiv.org/pdf/2112.05080.pdf), [[Code]](https://github.com/shellysheynin/Locally-SAG-Transformer)

- (arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable **Layout Generation**, [[Paper]](https://arxiv.org/pdf/2112.05112.pdf)

- (arXiv 2021.12) PE-former: **Pose Estimation** Transformer, [[Paper]](https://arxiv.org/pdf/2112.04981.pdf), [[Project]](https://www.ics.forth.gr/hccv/)

- (arXiv 2021.12) Hair**CLIP**: **Design** Your Hair by Text and Reference Image, [[Paper]](https://arxiv.org/pdf/2112.05142.pdf), [[Project]](https://github.com/wty-ustc/HairCLIP)

- (arXiv 2021.12) **CLIP**-**NeRF**: Text-and-Image Driven Manipulation of Neural Radiance Fields, [[Paper]](https://arxiv.org/pdf/2112.05139.pdf), [[Code]](https://cassiepython.github.io/clipnerf/)

- (arXiv 2021.12) A Bilingual, Open World Video Text **Dataset** and End-to-end **Video Text Spotter** with Transformer, [[Paper]](https://arxiv.org/pdf/2112.04888.pdf), [[Code]](https://github.com/weijiawu/TransVTSpotter), [[Dataset]](https://github.com/weijiawu/BOVText-Benchmark)

- (arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for **Efficient Video Recognition**, [[Paper]](https://arxiv.org/pdf/2112.04674.pdf), [[Code]](https://github.com/sail-sg/dualformer)

- (arXiv 2021.12) Recurrent Glimpse-based Decoder for **Detection** with Transformer, [[Paper]](https://arxiv.org/pdf/2112.04632.pdf), [[Code]](https://github.com/zhechen/Deformable-DETR-REGO)

- (arXiv 2021.12) Fast **Point** Transformer, [[Paper]](https://arxiv.org/pdf/2112.04702.pdf)

- (arXiv 2021.12) Assistive Tele-op: Leveraging Transformers to **Collect Robotic Task Demonstrations**, [[Paper]](https://arxiv.org/pdf/2112.05129.pdf), [[Project]](https://sites.google.com/view/assistive-teleop)

- (arXiv 2021.12) Cross-Modality Fusion Transformer for **Multispectral Object Detection**, [[Paper]](https://arxiv.org/pdf/2111.00273.pdf)

- (arXiv 2021.12) PatchFormer: An **Efficient** **Point** Transformer with Patch Attention, [[Paper]](https://arxiv.org/pdf/2111.00207.pdf)

- (arXiv 2021.12) Transformer-Based Approach for Joint **Handwriting** and **Named Entity Recognition** in Historical documents, [[Paper]](https://arxiv.org/pdf/2112.04189.pdf)

- (arXiv 2021.12) **MLP** Architectures for **Vision-and-Language** Modeling: An Empirical Study, [[Paper]](https://arxiv.org/pdf/2112.04453.pdf), [[Code]](https://github.com/easonnie/mlp-vil)

- (arXiv 2021.12) Everything at Once – Multi-modal Fusion Transformer for **Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2112.04446.pdf)

- (arXiv 2021.12) Prompting **Visual-Language** Models for Efficient Video Understanding, [[Paper]](https://arxiv.org/pdf/2112.04478.pdf), [[Project]](https://ju-chen.github.io/efficient-prompt/)

- (arXiv 2021.12) FLAVA: A Foundational **Language And Vision** Alignment Model, [[Paper]](https://arxiv.org/pdf/2112.04482.pdf)

- (arXiv 2021.12) Embedding Arithmetic for **Text-driven Image Transformation**, [[Paper]](https://arxiv.org/pdf/2112.03162.pdf)

- (arXiv 2021.12) LAVT: Language-Aware Vision Transformer for **Referring Image Segmentation**, [[Paper]](https://arxiv.org/pdf/2112.02244.pdf)

- (arXiv 2021.12) Look at What I’m Doing: Self-Supervised **Spatial Grounding** of Narrations in Instructional Videos, [[Paper]](https://arxiv.org/pdf/2110.10596.pdf), [[Project]](https://cs-people.bu.edu/rxtan/projects/grounding_narrations/)

- (arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for **Generic Perception** for **Zero-shot and Few-shot** Tasks, [[Paper]](https://arxiv.org/pdf/2112.01522.pdf)

- (arXiv 2021.12) Dense**CLIP**: Language-Guided **Dense** Prediction with Context-Aware Prompting, [[Paper]](https://arxiv.org/pdf/2112.01518.pdf), [[Code]](https://github.com/raoyongming/DenseCLIP)

- (arXiv 2021.12) Self-supervised **Video** Transformer, [[Paper]](https://arxiv.org/pdf/2112.01514.pdf), [[Code]](https://git.io/J1juJ)

- (arXiv 2021.12) OW-DETR: **Open-world Detection** Transformer, [[Paper]](https://arxiv.org/pdf/2112.01513.pdf)

- (arXiv 2021.12) Zero-Shot **Text-Guided Object Generation** with Dream Fields, [[Paper]](https://arxiv.org/pdf/2112.01455.pdf), [[Project]](https://ajayj.com/dreamfields)

- (arXiv 2021.12) **Video-Text** Pre-training with Learned Regions, [[Paper]](https://arxiv.org/pdf/2112.01194.pdf), [[Code]](https://github.com/ruiyan1995/Region_Learner)

- (arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for **RGB-D Salient Object Detection**, [[Paper]](https://arxiv.org/pdf/2112.01177.pdf)

- (arXiv 2021.12) TCTN: A 3D-Temporal Convolutional Transformer Network for **Spatiotemporal** Predictive Learning, [[Paper]](https://arxiv.org/pdf/2112.01085.pdf)

- (arXiv 2021.12) DenseCLIP: Extract Free **Dense** Labels from **CLIP**, [[Paper]](https://arxiv.org/pdf/2112.01071.pdf)

- (arXiv 2021.12) TransMEF: A Transformer-Based **Multi-Exposure Image Fusion** Framework using Self-Supervised Multi-Task Learning, [[Paper]](https://arxiv.org/pdf/2112.01030.pdf)

- (arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer **Tracking**, [[Paper]](https://arxiv.org/pdf/2112.00995.pdf), [[Code]](https://github.com/LitingLin/SwinTrack)

- (arXiv 2021.12) Object-Centric Unsupervised Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2112.00969.pdf)

- (arXiv 2021.12) Vision Pair Learning: An **Efficient** Training Framework for Image **Classification**, [[Paper]](https://arxiv.org/pdf/2112.00965.pdf)

- (arXiv 2021.12) Visual-Semantic Transformer for **Scene Text Recognition**, [[Paper]](https://arxiv.org/pdf/2112.00948.pdf)

- (arXiv 2021.12) Differentiable **Spatial Planning** using Transformers, [[Paper]](https://arxiv.org/pdf/2112.01010.pdf), [[Project]](https://devendrachaplot.github.io/projects/spatial-planning-transformers)

- (arXiv 2021.12) Improved **Multiscale** Vision Transformers for **Classification** and **Detection**, [[Paper]](https://arxiv.org/pdf/2112.01526.pdf)

- (arXiv 2021.12) Masked-attention Mask Transformer for Universal Image **Segmentation**, [[Paper]](https://arxiv.org/pdf/2112.01527.pdf), [[Code]](https://bowenc0221.github.io/mask2former)

- (arXiv 2021.12) BEVT: BERT Pretraining of **Video** Transformers, [[Paper]](https://arxiv.org/pdf/2112.01529.pdf)

- (arXiv 2021.12) **Human-Object Interaction Detection** via Weak Supervision, [[Paper]](https://arxiv.org/pdf/2112.00492.pdf)

- (arXiv 2021.12) Learning Transformer Features for **Image Quality Assessment**, [[Paper]](https://arxiv.org/pdf/2112.00485.pdf)

- (arXiv 2021.12) **CLIP**styler: **Image Style Transfer** with a Single Text Condition, [[Paper]](https://arxiv.org/pdf/2112.00374.pdf)

- (arXiv 2021.12) **Multi-View Stereo** with Transformer, [[Paper]](https://arxiv.org/pdf/2112.00336.pdf)

- (arXiv 2021.12) VoRTX: **Volumetric 3D Reconstruction** With Transformers for Voxelwise View Selection and Fusion, [[Paper]](https://arxiv.org/pdf/2112.00236.pdf), [[Code]](https://noahstier.github.io/vortx)

- (arXiv 2021.12) Object-aware **Video-language** Pre-training for Retrieval, [[Paper]](https://arxiv.org/pdf/2112.00656.pdf), [[Code]](https://github.com/FingerRec/OA-Transformer)

### 2021.11

- (arXiv 2021.11) Multi-modal Transformers Excel at **Class-agnostic** Object **Detection**, [[Paper]](https://arxiv.org/pdf/2111.11430.pdf), [[Code]](https://git.io/J1HPY)

- (arXiv 2021.11) Predict, Prevent, and Evaluate: Disentangled **Text-Driven Image Manipulation** Empowered by Pre-Trained Vision-Language Model, [[Paper]](https://arxiv.org/pdf/2111.13333.pdf)

- (arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for **Visual Recognition**, [[Paper]](https://arxiv.org/pdf/2111.12994.pdf), [[Code]](https://github.com/NomMer1125/NomMer)

- (arXiv 2021.11) PolyViT: **Co-training** Vision Transformers on **Images**, **Videos** and **Audio**, [[Paper]](https://arxiv.org/pdf/2111.12993.pdf)

- (arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [[Paper]](https://arxiv.org/pdf/2111.13677.pdf)

- (arXiv 2021.11) ADAPTIVE **FOURIER** NEURAL OPERATORS: **EFFICIENT** TOKEN MIXERS FOR TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2111.13587.pdf)

- (arXiv 2021.11) DyTox: Transformers for **Continual Learning** with DYnamic TOken eXpansion, [[Paper]](https://arxiv.org/pdf/2111.11326.pdf), [[Code]](https://github.com/arthurdouillard/dytox)

- (arXiv 2021.11) DABS: A Domain-Agnostic **Benchmark** for **Self-Supervised** Learning, [[Paper]](https://arxiv.org/pdf/2111.12062.pdf), [[Code]](https://github.com/alextamkin/dabs)

- (arXiv 2021.11) Ice hockey **player identification** via transformers, [[Paper]](https://arxiv.org/pdf/2111.11535.pdf)

- (arXiv 2021.11) DBIA: Data-free Backdoor Injection **Attack** against Transformer Networks, [[Paper]](https://arxiv.org/pdf/2111.11870.pdf), [[Code]](https://anonymous.4open.science/r/DBIA-825D)

- (arXiv 2021.11) Sparse Fusion for **Multimodal** Transformers, [[Paper]](https://arxiv.org/pdf/2111.11992.pdf)

- (arXiv 2021.11) PhysFormer: **Facial Video-based Physiological Measurement** with Temporal Difference Transformer, [[Paper]](https://arxiv.org/pdf/2111.12082.pdf), [[Code]](https://github.com/ZitongYu/PhysFormer)

- (arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person **Re-Identification**, [[Paper]](https://arxiv.org/pdf/2111.12084.pdf), [[Code]](https://github.com/michuanhaohao/TransReID-SSL)

- (arXiv 2021.11) DISCRETE REPRESENTATIONS STRENGTHEN VISION TRANSFORMER **ROBUSTNESS**, [[Paper]](https://arxiv.org/pdf/2111.10493.pdf)

- (arXiv 2021.11) TRAVLR: Now You See It, Now You Don’t! Evaluating Cross-Modal Transfer of **Visio-Linguistic Reasoning**, [[Paper]](https://arxiv.org/pdf/2111.10756.pdf)

- (arXiv 2021.11) Crossing the Format Boundary of Text and Boxes: Towards Unified **Vision-Language** Modeling, [[Paper]](https://arxiv.org/pdf/2111.12085.pdf)

- (arXiv 2021.11) **Semi-Supervised** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2111.11067.pdf)

- (arXiv 2021.11) CpT: Convolutional Point Transformer for 3D **Point Cloud** Processing, [[Paper]](https://arxiv.org/pdf/2111.10866.pdf)

- (arXiv 2021.11) ZERO-SHOT CERTIFIED **DEFENSE** AGAINST **ADVERSARIAL** PATCHES WITH VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2111.10481.pdf)

- (arXiv 2021.11) PointMixer: MLP-Mixer for **Point Cloud** Understanding, [[Paper]](https://arxiv.org/pdf/2111.11187.pdf)

- (arXiv 2021.11) **MetaFormer** is Actually What You Need for Vision, [[Paper]](https://arxiv.org/pdf/2111.11418.pdf), [[Code]](https://github.com/sail-sg/poolformer)

- (arXiv 2021.11) Florence: A New **Foundation Model** for Computer Vision, [[Paper]](https://arxiv.org/pdf/2111.11432.pdf)

- (arXiv 2021.11) Benchmarking **Detection Transfer Learning** with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2111.11429.pdf)

- (arXiv 2021.11) Learning to **Compose Visual Relations**, [[Paper]](https://arxiv.org/pdf/2111.09297.pdf), [[Project]](https://composevisualrelations.github.io/)

- (arXiv 2021.11) REFERENCE-BASED **MAGNETIC RESONANCE IMAGE RECONSTRUCTION** USING TEXTURE TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2111.09492.pdf)

- (arXiv 2021.11) Induce, Edit, Retrieve: Language Grounded Multimodal Schema for **Instructional Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2111.09276.pdf)

- (arXiv 2021.11) **Swin Transformer V2**: Scaling Up Capacity and Resolution, [[Paper]](https://arxiv.org/pdf/2111.09883.pdf), [[Code]](https://github.com/microsoft/Swin-Transformer)

- (arXiv 2021.11) SimMIM: A Simple Framework for **Masked Image Modeling**, [[Paper]](https://arxiv.org/pdf/2111.09886.pdf), [[Code]](https://github.com/microsoft/SimMIM)

- (arXiv 2021.11) Restormer: Efficient Transformer for **High-Resolution Image Restoration**, [[Paper]](https://arxiv.org/pdf/2111.09881.pdf), [[Code]](https://github.com/swz30/Restormer)

- (arXiv 2021.11) Simple but Effective: **CLIP** Embeddings for **Embodied AI**, [[Paper]](https://arxiv.org/pdf/2111.09888.pdf)

- (arXiv 2021.11) ClipCap: CLIP Prefix for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2111.09734.pdf), [[Code]](https://github.com/rmokady/CLIP_prefix_caption)

- (arXiv 2021.11) TransMix: Attend to **Mix** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2111.09833.pdf), [[Code]](https://github.com/Beckschen/TransMix)

- (arXiv 2021.11) TRIG: Transformer-Based **Text Recognizer** with Initial Embedding Guidance, [[Paper]](https://arxiv.org/pdf/2111.08314.pdf)

- (arXiv 2021.11) Multi-Grained **Vision Language** Pre-Training: Aligning Texts with Visual Concepts, [[Paper]](https://arxiv.org/pdf/2111.08276.pdf), [[Code]](https://github.com/zengyan-97/X-VLM)

- (arXiv 2021.11) Explainable Semantic Space by **Grounding Language to Vision** with Cross-Modal Contrastive Learning, [[Paper]](https://arxiv.org/pdf/2111.07180.pdf), [[Code]](https://github.com/yizhen-zhang/VG-Bert)

- (arXiv 2021.11) Semantically Grounded Object Matching for Robust **Robotic Scene Rearrangement**, [[Paper]](https://arxiv.org/pdf/2111.07975.pdf), [[Code]](https://github.com/applied-ai-lab/object_matching)

- (arXiv 2021.11) **Tracking** People with **3D** Representations, [[Paper]](https://arxiv.org/pdf/2111.07868.pdf), [[Code]](https://brjathu.github.io/T3DP)

- (arXiv 2021.11) LiT: Zero-Shot Transfer with Locked-**image** **Text** **Tuning**, [[Paper]](https://arxiv.org/pdf/2111.07991.pdf)

- (arXiv 2021.11) FILIP: FINE-GRAINED INTERACTIVE **LANGUAGE-IMAGE** PRE-TRAINING, [[Paper]](https://arxiv.org/pdf/2111.07783.pdf)

- (arXiv 2021.11) Graph Relation Transformer: Incorporating **pairwise object features** into the Transformer architecture, [[Paper]](https://arxiv.org/pdf/2111.06075.pdf), [[Code]](https://github.com/derikclive/transformers)

- (arXiv 2021.11) **Attention** Approximates Sparse Distributed Memory, [[Paper]](https://arxiv.org/pdf/2111.05498.pdf)

- (arXiv 2021.11) SLICED **RECURSIVE** TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2111.05297.pdf), [[Code]](https://github.com/szq0214/SReT)

- (arXiv 2021.11) HYBRID **BYOL-VIT**: EFFICIENT APPROACH TO DEAL WITH **SMALL DATASETS**, [[Paper]](https://arxiv.org/pdf/2111.04845.pdf)

- (arXiv 2021.11) Tip-Adapter: Training-free **CLIP**-Adapter for Better **Vision-Language** Modeling, [[Paper]](https://arxiv.org/pdf/2111.03930.pdf), [[Code]](https://github.com/gaopengcuhk/Tip-Adapter)

- (arXiv 2021.11) Improving Visual Quality of **Image Synthesis** by A Token-based Generator with Transformers, [[Paper]](https://arxiv.org/pdf/2111.03481.pdf)

- (arXiv 2021.11) Style**CLIP**Draw: Coupling Content and Style in **Text-to-Drawing Synthesis**, [[Paper]](https://arxiv.org/pdf/2111.03133.pdf), [[Code]](https://github.com/pschaldenbrand/StyleCLIPDraw)

- (arXiv 2021.11) Revisiting **spatio-temporal** layouts for **compositional action recognition**, [[Paper]](https://arxiv.org/pdf/2111.01936.pdf), [[Code]](https://github.com/gorjanradevski/revisiting-spatial-temporal-layouts)

- (arXiv 2021.11) PatchGame: Learning to Signal Mid-level Patches in **Referential Games**, [[Paper]](https://arxiv.org/pdf/2111.01785.pdf), [[Code]](https://kampta.github.io/patch-game)

- (arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM **CONVOLUTION**? [[Paper]](https://arxiv.org/pdf/2111.01353.pdf)

- (arXiv 2021.11) Livestock Monitoring with Transformer, [[Paper]](https://arxiv.org/pdf/2111.00801.pdf)

- (arXiv 2021.11) With a Little Help from my Temporal Context: Multimodal **Egocentric Action Recognition**, [[Paper]](https://arxiv.org/pdf/2111.01024.pdf), [[Code]](https://github.com/ekazakos/MTCN)

- (arXiv 2021.11) IconQA: A New Benchmark for Abstract Diagram Understanding and **Visual Language Reasoning**, [[Paper]](https://arxiv.org/pdf/2110.13214.pdf), [[Project]](https://iconqa.github.io/)

- (arXiv 2021.11) BoxeR: **Box-Attention** for 2D and 3D Transformers, [[Paper]](https://arxiv.org/pdf/2111.13087.pdf)

- (arXiv 2021.11) VLDeformer: **Vision-Language** Decomposed Transformer for Fast **Cross-Modal Retrieval**, [[Paper]](https://arxiv.org/pdf/2110.11338.pdf)

- (arXiv 2021.11) Multi-Person **3D Motion Prediction** with Multi-Range Transformers, [[Paper]](https://arxiv.org/pdf/2111.12073.pdf), [[Code]](https://jiashunwang.github.io/MRT/)

- (arXiv 2021.11) Scene Representation Transformer: Geometry-Free **Novel View Synthesis** Through Set-Latent Scene Representations, [[Paper]](https://arxiv.org/pdf/2111.13152.pdf), [[Project]](https://srt-paper.github.io/)

- (arXiv 2021.11) **Global Interaction Modelling** in Vision Transformer via Super Tokens, [[Paper]](https://arxiv.org/pdf/2111.13156.pdf)

- (arXiv 2021.11) ML-Decoder: Scalable and Versatile **Classification Head**, [[Paper]](https://arxiv.org/pdf/2111.12933.pdf), [[Code]](https://github.com/Alibaba-MIIL/ML_Decoder)

- (arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for **Unsupervised Domain Adaptation**, [[Paper]](https://arxiv.org/pdf/2111.12941.pdf)

- (arXiv 2021.11) SWINBERT: End-to-End Transformers with Sparse Attention for **Video Captioning**, [[Paper]](https://arxiv.org/pdf/2111.13196.pdf)

- (arXiv 2021.11) Amortized Prompt: Lightweight Fine-Tuning for **CLIP** in **Domain Generalization**, [[Paper]](https://arxiv.org/pdf/2111.12853.pdf)

- (arXiv 2021.11) Universal Captioner: Long-Tail **Vision-and-Language** Model Training through Content-Style Separation, [[Paper]](https://arxiv.org/pdf/2111.12727.pdf)

- (arXiv 2021.11) **Sparse** is Enough in Scaling Transformers, [[Paper]](https://arxiv.org/pdf/2111.12763.pdf)

- (arXiv 2021.11) An implementation of the “**Guess who**?” game using CLIP, [[Paper]](https://arxiv.org/pdf/2112.00599.pdf), [[Code]](https://github.com/ArnauDIMAI/CLIP-GuessWho)

- (arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for **Structured Reconstruction**, [[Paper]](https://arxiv.org/pdf/2111.15143.pdf)

- (arXiv 2021.11) A Unified **Pruning** Framework for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2111.15127.pdf)

- (arXiv 2021.11) Pyramid **Adversarial Training** Improves ViT Performance, [[Paper]](https://arxiv.org/pdf/2111.15121.pdf)

- (arXiv 2021.11) AssistSR: Affordance-centric Question-driven **Video Segment Retrieval**, [[Paper]](https://arxiv.org/pdf/2111.15050.pdf), [[Code & Data]](https://github.com/StanLei52/AQVSR)

- (arXiv 2021.11) DAFormer: Improving Network Architectures and Training Strategies for **Domain-Adaptive Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2111.14887.pdf), [[Code]](https://github.com/lhoyer/DAFormer)

- (arXiv 2021.11) , [[Paper]](https://arxiv.org/pdf/2111.14887.pdf)

- (arXiv 2021.11) AdaViT: Adaptive Vision Transformers for **Efficient** Image Recognition, [[Paper]](https://arxiv.org/pdf/2111.15668.pdf)

- (arXiv 2021.11) ATS: Adaptive Token Sampling For **Efficient** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2111.15667.pdf)

- (arXiv 2021.11) **CLIP** Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate **Captioning**, [[Paper]](https://arxiv.org/pdf/2111.15162.pdf)

- (arXiv 2021.11) CRIS: **CLIP**-Driven Referring Image **Segmentation**, [[Paper]](https://arxiv.org/pdf/2111.15174.pdf)

- (arXiv 2021.11) Shunted **Self-Attention** via Multi-Scale Token Aggregation, [[Paper]](https://arxiv.org/pdf/2111.15193.pdf), [[Code]](https://github.com/OliverRensu/Shunted-Transformer)

- (arXiv 2021.11) MC-SSL0.0: Towards Multi-Concept **Self-Supervised** Learning, [[Paper]](https://arxiv.org/pdf/2111.15340.pdf)

- (arXiv 2021.11) TransWeather: Transformer-based **Restoration of Images** Degraded by Adverse Weather Conditions, [[Paper]](https://arxiv.org/pdf/2111.14813.pdf), [[Code]](https://github.com/jeya-maria-jose/TransWeather)

- (arXiv 2021.11) Searching the **Search Space** of Vision Transformer, [[Paper]](https://arxiv.org/pdf/2111.14725.pdf), [[Code]](https://github.com/microsoft/Cream)

- (arXiv 2021.11) TransMVSNet: Global Context-aware **Multi-view Stereo** Network with Transformers, [[Paper]](https://arxiv.org/pdf/2111.14600.pdf), [[Code]](https://github.com/MegviiRobot/TransMVSNet)

- (arXiv 2021.11) **Recurrent** Vision Transformer for Solving Visual **Reasoning** Problems, [[Paper]]()

- (arXiv 2021.11) **Video Frame Interpolation** Transformer, [[Paper]](https://arxiv.org/pdf/2111.13817.pdf)

- (arXiv 2021.11) FQ-ViT: Fully **Quantized** Vision Transformer without Retraining, [[Paper]](https://arxiv.org/pdf/2111.13824.pdf), [[Code]](https://github.com/linyang-zhh/FQ-ViT)

- (arXiv 2021.11) LAFITE : Towards Language-Free Training for **Text-to-Image Generation**, [[Paper]](https://arxiv.org/pdf/2111.13792.pdf)

- (arXiv 2021.11) SPARSE DETR: **EFFICIENT** END-TO-END OBJECT **DETECTION** WITH LEARNABLE SPARSITY, [[Paper]](https://arxiv.org/pdf/2111.14330.pdf), [[Code]](https://github.com/kakaobrain/sparse-detr)

- (arXiv 2021.11) End-to-End **Referring Video Object Segmentation** with Multimodal Transformers, [[Paper]](https://arxiv.org/pdf/2111.14821.pdf), [[Code]](https://github.com/mttr2021/MTTR)

- (arXiv 2021.11) Point-BERT: Pre-training 3D **Point Cloud** Transformers with Masked Point Modeling, [[Paper]](https://arxiv.org/pdf/2111.14819.pdf), [[Code]](https://github.com/lulutang0608/Point-BERT)

- (arXiv 2021.11) Zero-Shot **Image-to-Text Generation** for Visual-Semantic Arithmetic, [[Paper]](https://arxiv.org/pdf/2111.14447.pdf), [[Code]](https://github.com/YoadTew/zero-shot-image-to-text)

- (arXiv 2021.11) Blended Diffusion for **Text-driven Editing** of **Natural Images**, [[Paper]](https://arxiv.org/pdf/2111.14818.pdf), [[Code]](https://github.com/omriav/blended-diffusion)

- (arXiv 2021.11) Mask Transfiner for High-Quality **Instance Segmentation**, [[Paper]](https://arxiv.org/pdf/2111.13673.pdf), [[Code]](http://vis.xyz/pub/transfiner)

- (arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for **3D Human Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2111.12707.pdf), [[Code]](https://github.com/Vegetebird/MHFormer)

- (arXiv 2021.11) PeCo: Perceptual Codebook for **BERT Pre-training** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2111.12710.pdf), [[Code]](https://github.com/microsoft/PeCo)

- (arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast **High-Resolution Image Generation** from Vector-Quantized Codes, [[Paper]](https://arxiv.org/pdf/2111.12701.pdf), [[COde]](https://github.com/samb-t/unleashing-transformers)

- (arXiv 2021.11) Towards Tokenized **Human Dynamics** Representation, [[Paper]](https://arxiv.org/pdf/2111.11433.pdf), [[Code]](https://github.com/likenneth/acton)

- (arXiv 2021.11) **Self-slimmed** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2111.12624.pdf)

- (arXiv 2021.11) VIOLET: End-to-End **Video-Language** Transformers with Masked Visual-token Modeling, [[Paper]](https://arxiv.org/pdf/2111.12681.pdf), [[Code]](https://github.com/tsujuifu/pytorch_violet)

- (arXiv 2021.11) A Lightweight Graph Transformer Network for **Human Mesh Reconstruction** from 2D Human Pose, [[Paper]](https://arxiv.org/pdf/2111.12696.pdf)

- (arXiv 2021.11) MorphMLP: A Self-Attention Free, **MLP**-Like Backbone for Image and Video, [[Paper]](https://arxiv.org/pdf/2111.12527.pdf)

- (arXiv 2021.11) Octree Transformer: Autoregressive **3D Shape Generation** on Hierarchically Structured Sequences, [[Paper]](https://arxiv.org/pdf/2111.12480.pdf)

- (arXiv 2021.11) Hierarchical Modular Network for **Video Captioning**, [[Paper]](https://arxiv.org/pdf/2111.12476.pdf)

- (arXiv 2021.11) NU¨WA: **Visual Synthesis Pre-training** for Neural visUal World creAtion, [[Paper]](https://arxiv.org/pdf/2111.12417.pdf), [[Code]](https://github.com/microsoft/NUWA)

- (arXiv 2021.11) An Image Patch is a Wave: Phase-Aware Vision **MLP**, [[Paper]](https://arxiv.org/pdf/2111.12294.pdf)

- (arXiv 2021.11) PTQ4ViT: Post-Training **Quantization** Framework for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2111.12293.pdf)

- (arXiv 2021.11) PU-Transformer: **Point Cloud Upsampling** Transformer, [[Paper]](https://arxiv.org/pdf/2111.12242.pdf)

- (arXiv 2021.11) Scaling Up **Vision-Language Pre-training** for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2111.12233.pdf)

- (arXiv 2021.11) Cerberus Transformer: Joint **Semantic, Affordance and Attribute Parsing**, [[Paper]](https://arxiv.org/pdf/2111.12608.pdf), [[Code]](https://github.com/OPEN-AIR-SUN/Cerberus)

- (arXiv 2021.11) Efficient **Video** Transformers with Spatial-Temporal Token Selection, [[Paper]](https://arxiv.org/pdf/2111.11591.pdf)

- (arXiv 2021.11) RedCaps: Web-curated **image-text data** created by the people, for the people, [[Paper]](https://arxiv.org/pdf/2111.11431.pdf), [[Project]](https://redcaps.xyz/)

- (arXiv 2021.11) EMScore: Evaluating **Video Captioning** via Coarse-Grained and Fine-Grained Embedding Matching, [[Paper]](https://arxiv.org/pdf/2111.08919.pdf), [[Code]](https://github.com/ShiYaya/emscore)

- (arXiv 2021.11) Compositional Transformers for **Scene Generation**, [[Paper]](https://arxiv.org/pdf/2111.08960.pdf), [[Code]](https://github.com/dorarad/gansformer)

- (arXiv 2021.11) Vis-TOP: Visual Transformer **Overlay Processor**, [[Paper]](https://arxiv.org/pdf/2110.10957.pdf)

- (arXiv 2021.11) **Grounded Situation Recognition** with Transformers, [[Paper]](https://arxiv.org/pdf/2111.10135.pdf), [[Code]](https://github.com/jhcho99/gsrtr)

- (arXiv 2021.11) Rethinking **Query, Key, and Value** Embedding in Vision Transformer under **Tiny Model** Constraints, [[Paper]](https://arxiv.org/pdf/2111.10017.pdf)

- (arXiv 2021.11) UFO: A UniFied TransfOrmer for **Vision-Language** Representation Learning, [[Paper]](https://arxiv.org/pdf/2111.10023.pdf)

- (arXiv 2021.11) Advancing High-Resolution **Video-Language** Representation with Large-Scale Video Transcriptions, [[Paper]](https://arxiv.org/pdf/2111.10337.pdf)

- (arXiv 2021.11) Combined Scaling for **Zero-shot Transfer Learning**, [[Paper]](https://arxiv.org/pdf/2111.10050.pdf)

- (arXiv 2021.11) Simple but Effective: **CLIP** Embeddings for **Embodied AI**, [[Paper]](https://arxiv.org/pdf/2111.09888.pdf)

- (arXiv 2021.11) Improved **Robustness** of Vision Transformer via PreLayerNorm in Patch Embedding, [[Paper]](https://arxiv.org/pdf/2111.08413.pdf)

- (arXiv 2021.11) IBOT: **IMAGE BERT PRE-TRAINING** WITH ONLINE TOKENIZER, [[Paper]](https://arxiv.org/pdf/2111.07832.pdf), [[Code]](https://github.com/bytedance/ibot)

- (arXiv 2021.11) **Masked Autoencoders** Are Scalable Vision Learners, [[Paper]](https://arxiv.org/pdf/2111.06377.pdf)

- (arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient **Hyperspectral Image Reconstruction**, [[Paper]](https://arxiv.org/pdf/2111.07910.pdf)

- (arXiv 2021.11) Are Transformers More **Robust** Than CNNs?, [[Paper]](https://arxiv.org/pdf/2111.05464.pdf), [[Code]](https://github.com/ytongbai/ViTs-vs-CNNs)

- (arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for **Video-Text Retrieval**, [[Paper]](https://arxiv.org/pdf/2111.05610.pdf)

- (arXiv 2021.11) Multimodal Transformer with Variable-length Memory for **Vision-and-Language Navigation**, [[Paper]](https://arxiv.org/pdf/2111.05759.pdf)

- (arXiv 2021.11) Improving **Visual Quality** of **Image Synthesis** by A Token-based Generator with Transformers, [[Paper]](https://arxiv.org/abs/2111.03481)

- (arXiv 2021.11) VLMO: Unified **Vision-Language** Pre-Training with Mixture-of-Modality-Experts, [[Paper]](https://arxiv.org/pdf/2111.02358.pdf), [[Code]](https://aka.ms/vlmo)

- (arXiv 2021.11) LAION-400M: Open **Dataset** of **CLIP**-Filtered 400 Million **Image-Text** Pairs, [[Paper]](https://arxiv.org/pdf/2111.02114.pdf), [[Project]](https://laion.ai/laion-400-open-dataset/)

- (arXiv 2021.11) An Empirical Study of **Training** End-to-End **Vision-and-Language** Transformers, [[Paper]](https://arxiv.org/pdf/2111.02387.pdf), [[Code]](https://github.com/zdou0830/METER)

- (arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM **CONVOLUTION**? [[Paper]](https://arxiv.org/pdf/2111.01353.pdf)

- (arXiv 2021.11) HRViT: **Multi-Scale High-Resolution** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2111.01236.pdf)

### 2021.10

- (arXiv 2021.10) **Visual Keyword Spotting** with Attention, [[Paper]](https://arxiv.org/pdf/2110.15957.pdf), [[Project]](Visual Keyword Spotting with Attention)

- (arXiv 2021.10) Learning **Co-segmentation** by Segment Swapping for Retrieval and Discovery, [[Paper]](https://arxiv.org/pdf/2110.15904.pdf), [[Data & Code]](http://imagine.enpc.fr/~shenx/SegSwap/)

- (arXiv 2021.10) Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal **Text-Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2110.15609.pdf), [[Code]](https://https//github.com/Lionel-Hing/VSR-Net)

- (arXiv 2021.10) Dispensed Transformer Network for **Unsupervised Domain Adaptation**, [[Paper]](https://arxiv.org/pdf/2110.14944.pdf)

- (arXiv 2021.10) Scatterbrain: Unifying **Sparse** and **Low-rank Attention** Approximation, [[Paper]](https://arxiv.org/pdf/2110.15343.pdf)

- (arXiv 2021.10) **3D Object Tracking** with Transformer, [[Paper]](https://arxiv.org/pdf/2110.14921.pdf), [[Code]](https://github.com/3bobo/lttr)

- (arXiv 2021.10) Blending **Anti-Aliasing** into Vision Transformer, [[Paper]](https://arxiv.org/pdf/2110.15156.pdf), [[Code]](https://github.com/amazon-research/anti-aliasing-transformer)

- (arXiv 2021.10) UltraPose: **Synthesizing** Dense Pose with 1 Billion Points by **Human-body** Decoupling **3D** Model, [[Paper]](https://arxiv.org/pdf/2110.15267.pdf), [[Data & Code]](https://github.com/MomoAILab/ultrapose)

- (arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for **Vision-and-Language Navigation**, [[Paper]](https://arxiv.org/pdf/2110.14143.pdf)

- (arXiv 2021.10) Bangla Image **Caption Generation** through CNN-Transformer based Encoder-Decoder Network, [[Paper]](https://arxiv.org/pdf/2110.12442.pdf)

- (arXiv 2021.10) History Aware Multimodal Transformer for **Vision-and-Language Navigation**, [[Paper]](https://arxiv.org/pdf/2110.13309.pdf), [[Project]](https://cshizhe.github.io/projects/vln_hamt.html)

- (arXiv 2021.10) TriBERT: Full-body Human-centric Audio-visual Representation Learning for **Visual Sound Separation**, [[Paper]](https://arxiv.org/pdf/2110.13412.pdf)

- (arXiv 2021.10) TNTC: TWO-STREAM NETWORK WITH TRANSFORMER-BASED COMPLEMENTARITY FOR GAIT-BASED **EMOTION RECOGNITION**, [[Paper]](https://arxiv.org/pdf/2110.13708.pdf)

- (arXiv 2021.10) Contextual Similarity Aggregation with Self-attention for **Visual Re-ranking**, [[Paper]](https://arxiv.org/pdf/2110.13430.pdf), [[Code]](https://github.com/MCC-WH/CSA)

- (arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for **Skeleton-Based Action Recognition**, [[Paper]](https://arxiv.org/pdf/2110.13385.pdf), [[Code]](https://github.com/qtwang0035/IIP-Transformer)

- (arXiv 2021.10) IMAGE-BASED **CLIP**-GUIDED ESSENCE TRANSFER, [[Paper]](https://arxiv.org/pdf/2110.12427.pdf), [[Code]](https://github.com/hila-chefer/TargetCLIP)

- (arXiv 2021.10) Sinkformers: Transformers with Doubly Stochastic **Attention**, [[Paper]](https://arxiv.org/pdf/2110.11773.pdf)

- (arXiv 2021.10) ILLITERATE **DALL·E** LEARNS TO COMPOSE, [[Paper]](https://arxiv.org/pdf/2110.11405.pdf), [[Project]](https://sites.google.com/view/slate-autoencoder), [[Code]](https://github.com/singhgautam/slate)

- (arXiv 2021.10) Learning Text-Image Joint Embedding for Efficient **Cross-Modal Retrieval** with Deep Feature Engineering, [[Paper]](https://arxiv.org/pdf/2110.11592.pdf)

- (arXiv 2021.10) SOFT: Softmax-free Transformer with **Linear Complexity**, [[Paper]](https://arxiv.org/pdf/2110.11945.pdf), [[Code]](https://fudan-zvg.github.io/SOFT)

- (arXiv 2021.10) Deep Two-Stream Video Inference for Human Body **Pose** and **Shape Estimation**, [[Paper]](https://arxiv.org/pdf/2110.11680.pdf)

- (arXiv 2021.10) TRANSFORMER **ACCELERATION** WITH DYNAMIC SPARSE ATTENTION, [[Paper]](https://arxiv.org/pdf/2110.11299.pdf)

- (arXiv 2021.10) CLOOB: MODERN **HOPFIELD** NETWORKS WITH INFOLOOB OUTPERFORM **CLIP**, [[Paper]](https://arxiv.org/pdf/2110.11316.pdf), [[Code]](https://github.com/ml-jku/cloob)

- (arXiv 2021.10) Integrating Visuospatial, Linguistic and Commonsense Structure into **Story Visualization**, [[Paper]](https://arxiv.org/pdf/2110.10834.pdf)

- (arXiv 2021.10) StructFormer: Learning Spatial Structure for **Language-Guided** Semantic **Rearrangement** of Novel Objects, [[Paper]](https://arxiv.org/pdf/2110.10189.pdf), [[Project]](https://sites.google.com/view/structformer)

- (arXiv 2021.10) Gophormer: Ego-**Graph** Transformer for **Node Classification**, [[Paper]](https://arxiv.org/pdf/2110.13094.pdf)

- (arXiv 2021.10) STRANSGAN: AN EMPIRICAL STUDY ON TRANSFORMER IN **GANS**, [[Paper]](https://arxiv.org/pdf/2110.13107.pdf), [[Code]](https://nbei.github.io/stransgan.html)

- (arXiv 2021.10) MVT: Multi-view Vision Transformer for **3D Object Recognition**, [[Paper]](https://arxiv.org/pdf/2110.13083.pdf)

- (arXiv 2021.10) DocTr: **Document Image** Transformer for Geometric Unwarping and Illumination Correction, [[Paper]](https://arxiv.org/pdf/2110.12942.pdf), [[Code]](https://github.com/fh2019ustc/DocTr)

- (arXiv 2021.10) Bangla Image **Caption** Generation through CNN-Transformer based Encoder-Decoder Network, [[Paper]](https://arxiv.org/pdf/2110.12442.pdf)

- (arXiv 2021.10) WAV2CLIP: LEARNING ROBUST **AUDIO REPRESENTATIONS** FROM **CLIP**, [[Paper]](https://arxiv.org/pdf/2110.11499.pdf), [[Code]](https://github.com/descriptinc/lyrebird-wav2clip)

- (arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for **Medical Image Segmentation**, [[Paper]](https://arxiv.org/pdf/2110.10403.pdf)

- (arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM **CLIP**, [[Paper]](https://arxiv.org/pdf/2110.11316.pdf), [[Code]](https://github.com/ml-jku/cloob)

- (arXiv 2021.10) AniFormer: Data-driven **3D Animation** with Transformer, [[Paper]](https://arxiv.org/pdf/2110.10533.pdf), [[Code]](https://github.com/mikecheninoulu/AniFormer)

- (arXiv 2021.10) **Few-Shot Temporal Action Localization** with Query Adaptive Transformer, [[Paper]](https://arxiv.org/pdf/2110.10552.pdf), [[Code]](https://github.com/sauradip/fewshotQAT)

- (arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for **Hyperspectral Image Classification**, [[Paper]](https://arxiv.org/pdf/2110.11084.pdf), [[Code]](https://github.com/xmm/3D-ANAS-V2)

- (arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared **Person Re-identification**, [[Paper]](https://arxiv.org/pdf/2110.08994.pdf)

- (arXiv 2021.10) 3D-RETR: End-to-End **Single and Multi-View 3D Reconstruction** with Transformers, [[Paper]](https://arxiv.org/pdf/2110.08861.pdf), [[Code]](https://github.com/FomalhautB/3D-RETR)

- (arXiv 2021.10) HRFormer: **High-Resolution** Transformer for **Dense Prediction**, [[Paper]](https://arxiv.org/pdf/2110.09408.pdf), [[Code]](https://github.com/HRNet/HRFormer)

- (arXiv 2021.10) Leveraging MoCap Data for **Human Mesh Recovery**, [[Paper]](https://arxiv.org/pdf/2110.09243.pdf)

- (arXiv 2021.10) A Good **Prompt** Is Worth Millions of Parameters? Low-resource Prompt-based Learning for **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2110.08484.pdf)

- (arXiv 2021.10) ASFormer: Transformer for **Action Segmentation**, [[Paper]](https://arxiv.org/pdf/2110.08568.pdf), [[Code]](https://github.com/ChinaYi/ASFormer)

- (arXiv 2021.10) Multimodal **Dialogue Response Generation**, [[Paper]](https://arxiv.org/pdf/2110.08515.pdf)

- (arXiv 2021.10) Understanding **Procedural Knowledge** by Sequencing Multimodal Instructional Manuals, [[Paper]](https://arxiv.org/pdf/2110.08486.pdf)

- (arXiv 2021.10) COMPOSITIONAL **ATTENTION**: DISENTANGLING SEARCH AND RETRIEVAL, [[Paper]](https://arxiv.org/pdf/2110.09419.pdf), [[Code]](https://github.com/sarthmit/Compositional-Attention)

- (arXiv 2021.10) Spatial-Temporal Transformer for 3D **Point Cloud Sequences**, [[Paper]](https://arxiv.org/pdf/2110.09783.pdf)

- (arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for **3D Human Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2110.09554.pdf), [[Code]](https://github.com/HowieMa/TransFusion-Pose)

- (arXiv 2021.10) Unifying Multimodal Transformer for **Bi-directional Image and Text Generation**, [[Paper]](https://arxiv.org/pdf/2110.09753.pdf)

- (arXiv 2021.10) Transformer with a Mixture of **Gaussian Keys**, [[Paper]](https://arxiv.org/pdf/2110.08678.pdf)

- (arXiv 2021.10) DIFFUSIONCLIP: **TEXT-GUIDED IMAGE MANIPULATION** USING DIFFUSION MODELS, [[Paper]](https://arxiv.org/pdf/2110.02711.pdf)

- (arXiv 2021.10) Adversarial **Robustness** Comparison of Vision Transformer and MLP-Mixer to CNNs, [[Paper]](https://arxiv.org/pdf/2110.02797.pdf), [[Code]](https://github.com/phibenz/robustness_comparison_vit_mlp-mixer_cnn)

- (arXiv 2021.10) RIPPLE ATTENTION FOR VISUAL PERCEPTION WITH **SUB-QUADRATIC COMPLEXITY**, [[Paper]](https://arxiv.org/pdf/2110.02453.pdf)

- (arXiv 2021.10) Certified Patch **Robustness** via Smoothed Vision Transformers, [[Paper]](https://arxiv.org/pdf/2110.07719.pdf), [[Code]](https://github.com/MadryLab/smoothed-vit)

- (arXiv 2021.10) CLIP-Forge: Towards Zero-Shot **Text-to-Shape** Generation, [[Paper]](https://arxiv.org/pdf/2110.02624.pdf)

- (arXiv 2021.10) Understanding and Improving **Robustness** of Vision Transformers through Patch-based Negative Augmentation, [[Paper]](https://arxiv.org/pdf/2110.07858.pdf)

- (arXiv 2021.10) SPARSE MOES MEET **EFFICIENT ENSEMBLES**, [[Paper]](https://arxiv.org/pdf/2110.03360.pdf)

- (arXiv 2021.10) Shared **Visual Representations** of Drawing for Communication: How do different **biases** affect human interpretability and intent? [[Paper]](https://arxiv.org/pdf/2110.08203.pdf)

- (arXiv 2021.10) SignBERT: Pre-Training of Hand-Model-Aware Representation for **Sign Language Recognition**, [[Paper]](https://arxiv.org/pdf/2110.05382.pdf)

- (arXiv 2021.10) Revitalizing CNN Attentions via Transformers in **Self-Supervised** Visual Representation Learning, [[Paper]](https://arxiv.org/pdf/2110.05340.pdf)

- (arXiv 2021.10) Investigating **Transfer Learning Capabilities** of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2110/2110.05270.pdf)

- (arXiv 2021.10) SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE **LANGUAGE-IMAGE** PRE-TRAINING PARADIGM, [[Paper]](https://arxiv.org/pdf/2110.05208.pdf), [[Code]](https://github.com/Sense-GVT/)

- (arXiv 2021.10) CLIP4Caption ++: Multi-CLIP for **Video Caption**, [[Paper]](https://arxiv.org/pdf/2110.05204.pdf)

- (arXiv 2021.10) Transformer-based Dual Relation Graph for **Multi-label Image Recognition**, [[Paper]](https://arxiv.org/pdf/2110.04722.pdf)

- (arXiv 2021.10) VECTOR-QUANTIZED **IMAGE MODELING** WITH IMPROVED VQGAN, [[Paper]](https://arxiv.org/pdf/2110.04627.pdf)

- (arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for **3D Human Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2110.05092.pdf), [[Code]](https://github.com/lelexx/MTF-Transformer)

- (arXiv 2021.10) NVIT: VISION TRANSFORMER **COMPRESSION** AND **PARAMETER REDISTRIBUTION**, [[Paper]](https://arxiv.org/pdf/2110.04869.pdf)

- (arXiv 2021.10) 6D-ViT: Category-Level **6D Object Pose Estimation** via Transformer-based Instance Representation Learning, [[Paper]](https://arxiv.org/pdf/2110.04792.pdf)

- (arXiv 2021.10) CLIP-Adapter: Better **Vision-Language** Models with Feature Adapters, [[Paper]](https://arxiv.org/pdf/2110.04544.pdf), [[Code]](https://github.com/gaopengcuhk/CLIP-Adapter)

- (arXiv 2021.10) ATISS: Autoregressive Transformers for **Indoor Scene Synthesis**, [[Paper]](https://arxiv.org/pdf/2110.03675.pdf), [[Code]](https://nv-tlabs.github.io/ATISS)

- (arXiv 2021.10) MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND **MOBILE**-FRIENDLY VISION TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2110.02178.pdf)

- (arXiv 2021.10) **TOKEN POOLING** IN VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2110.03860.pdf)

- (arXiv 2021.10) VIDT: AN EFFICIENT AND EFFECTIVE FULLY TRANSFORMER-BASED **OBJECT DETECTOR**, [[Paper]](https://arxiv.org/pdf/2110.03921.pdf), [[Code]](https://github.com/naver-ai/vidt)

- (arXiv 2021.10) CLIP4Caption: CLIP for **Video Caption**, [[Paper]](https://arxiv.org/pdf/2110.06615.pdf)

- (arXiv 2021.10) **OBJECT**-REGION **VIDEO** TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2110.06915.pdf), [[Code]](https://roeiherz.github.io/ORViT/)

- (arXiv 2021.10) LEVERAGING **REDUNDANCY** IN ATTENTION WITH REUSE TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2110.06821.pdf)

- (arXiv 2021.10) **Dynamic Inference** with Neural Interpreters, [[Paper]](https://arxiv.org/pdf/2110.06399.pdf)

- (arXiv 2021.10) A CLIP-Enhanced Method for **Video-Language** Understanding, [[Paper]](https://arxiv.org/pdf/2110.07137.pdf)

- (arXiv 2021.10) **Visual Relationship Detection** Using Part-and-Sum Transformers with Composite Queries, [[Paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Dong_Visual_Relationship_Detection_Using_Part-and-Sum_Transformers_With_Composite_Queries_ICCV_2021_paper.pdf)

- (arXiv 2021.10) Discovering Human **Interactions** with Large-Vocabulary Objects via Query and Multi-Scale Detection, [[Paper]](https://openaccess.thecvf.com/content/ICCV2021/papers/Wang_Discovering_Human_Interactions_With_Large-Vocabulary_Objects_via_Query_and_Multi-Scale_ICCV_2021_paper.pdf)

- (arXiv 2021.10) Learning Structural Representations for **Recipe Generation** and **Food Retrieval**, [[Paper]](https://arxiv.org/pdf/2110.01209.pdf)

- (arXiv 2021.10) A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR **FINE-GRAINED VISUAL RECOGNITION**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2110/2110.01240.pdf)

### 2021.09
- (arXiv 2021.09) Joint Multimedia **Event Extraction** from Video and Article, [[Paper]](https://arxiv.org/pdf/2109.12776.pdf)

- (arXiv 2021.09) Long-Range Transformers for **Dynamic Spatiotemporal Forecasting**, [[Paper]](https://arxiv.org/pdf/2109.12218.pdf)

- (arXiv 2021.09) **Visually Grounded Concept** Composition, [[Paper]](https://arxiv.org/pdf/2109.14115.pdf)

- (arXiv 2021.09) CoSeg: Cognitively Inspired Unsupervised Generic **Event Segmentation**, [[Paper]](https://arxiv.org/pdf/2109.15170.pdf)

- (arXiv 2021.09) CCTrans: Simplifying and Improving **Crowd Counting** with Transformer, [[Paper]](https://arxiv.org/pdf/2109.14483.pdf)

- (arXiv 2021.09) UFO-ViT: High Performance **Linear** Vision Transformer **without Softmax**, [[Paper]](https://arxiv.org/pdf/2109.14382.pdf)

- (arXiv 2021.09) **Infrared Small-Dim Target Detection** with Transformer under Complex Backgrounds, [[Paper]](https://arxiv.org/pdf/2109.14379.pdf)

- (arXiv 2021.09) **Localizing Objects** with Self-Supervised Transformers and no Labels, [[Paper]](https://arxiv.org/pdf/2109.14279.pdf), [[Code]](https://github.com/valeoai/LOST)

- (arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2109.14137.pdf)

- (arXiv 2021.09) VideoCLIP: Contrastive Pre-training for **Zero-shot Video-Text Understanding**, [[Paper]](https://arxiv.org/pdf/2109.14084.pdf), [[Code]](https://github.com/pytorch/fairseq/examples/MMPT)

- (arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of **State Variables in Ising Models**, [[Paper]](https://arxiv.org/pdf/2109.13925.pdf)

- (arXiv 2021.09) CLIP-It! Language-Guided **Video Summarization**, [[Paper]](https://arxiv.org/pdf/2107.00650.pdf), [[Project]](https://medhini.github.io/clip_it)

- (arXiv 2021.09) MFEVIT: A ROBUST LIGHTWEIGHT TRANSFORMER-BASED NETWORK FOR MULTIMODAL 2D+3D **FACIAL EXPRESSION RECOGNITION**, [[Paper]](https://arxiv.org/pdf/2109.13086.pdf)

- (arXiv 2021.09) Sparse Spatial Transformers for **Few-Shot Learning**, [[Paper]](https://arxiv.org/pdf/2109.12932.pdf), [[Code]](https://github.com/chenhaoxing/SSFormers)

- (arXiv 2021.09) Vision Transformer Hashing for **Image Retrieval**, [[Paper]](https://arxiv.org/pdf/2109.12564.pdf)

- (arXiv 2021.09) PETA: **Photo Albums Event Recognition** using Transformers Attention, [[Paper]](https://arxiv.org/pdf/2109.12499.pdf)

- (arXiv 2021.09) MLIM: **VISION-AND-LANGUAGE** MODEL PRE-TRAINING WITH MASKED LANGUAGE AND IMAGE MODELING, [[Paper]](https://arxiv.org/pdf/2109.12178.pdf)

- (arXiv 2021.09) Dense Contrastive **Visual-Linguistic** Pretraining, [[Paper]](https://arxiv.org/pdf/2109.11778.pdf)

- (arXiv 2021.09) CPT: COLORFUL **PROMPT TUNING** FOR PRE-TRAINED VISION-LANGUAGE MODELS, [[Paper]](https://arxiv.org/pdf/2109.11797.pdf)

- (arXiv 2021.09) Localizing ∞-shaped fishes: **Sketch-guided object localization** in the wild, [[Paper]](https://arxiv.org/pdf/2109.11874.pdf), [[Code]](https://github.com/priba/sgol_wild)

- (arXiv 2021.09) CLIPORT: What and Where Pathways for **Robotic Manipulation**, [[Paper]](https://arxiv.org/pdf/2109.12098.pdf), [[Project]](https://cliport.github.io/), [[Code]](https://github.com/cliport/cliport)

- (arXiv 2021.09) GraFormer: Graph Convolution Transformer for **3D Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2109.08364.pdf), [[Code]](https://github.com/Graformer/GraFormer)

- (arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for **Visual Dialogue Generation**, [[Paper]](https://arxiv.org/pdf/2109.08478.pdf)

- (arXiv 2021.09) Expression Snippet Transformer for Robust Video-based **Facial Expression Recognition**, [[Paper]](https://arxiv.org/pdf/2109.08409.pdf), [[Code]](https://anonymous.4open.science/r/ATSE-C58B)

- (arXiv 2021.09) LOTR: **Face Landmark Localization** Using Localization Transformer, [[Paper]](https://arxiv.org/pdf/2109.10057.pdf)

- (arXiv 2021.09) Dyadformer: A **Multi-modal** Transformer for Long-Range Modeling of Dyadic Interactions, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2109/2109.09487.pdf)

- (arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for **Dense Image Prediction**, [[Paper]](https://arxiv.org/pdf/2109.08963.pdf)

- (arXiv 2021.09) KD-VLP: Improving End-to-End **Vision-and-Language Pretraining** with Object Knowledge Distillation, [[Paper]](https://arxiv.org/pdf/2109.10504.pdf)

- (arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [[Paper]](https://arxiv.org/pdf/2109.10948.pdf)

- (arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for **Person Re-Identification**, [[Paper]](https://arxiv.org/pdf/2109.11159.pdf)

- (arXiv 2021.09) PIX2SEQ: A LANGUAGE MODELING FRAMEWORK FOR **OBJECT DETECTION**, [[Paper]](https://arxiv.org/pdf/2109.10852.pdf)

- (arXiv 2021.09) ActionCLIP: A New Paradigm for **Video Action Recognition**, [[Paper]](https://arxiv.org/pdf/2109.08472.pdf)

- (arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for **Scene Graph Generation**, [[Paper]](https://arxiv.org/pdf/2109.05346.pdf)

- (arXiv 2021.09) Neural Human Performer: Learning Generalizable Radiance Fields for **Human Performance Rendering**, [[Paper]](https://arxiv.org/pdf/2109.07448.pdf), [[Code]](https://youngjoongunc.github.io/nhp/)

- (arXiv 2021.09) **Anchor DETR**: Query Design for Transformer-Based Detector, [[Paper]](https://arxiv.org/pdf/2109.07107.pdf), [[Code]](https://github.com/megvii-model/AnchorDETR)

- (arXiv 2021.09) An End-to-End Transformer Model for **3D Object Detection**, [[Paper]](https://arxiv.org/pdf/2109.08141.pdf), [[Code]](https://facebookresearch.github.io/3detr)

- (arXiv 2021.09) Hybrid Local-Global Transformer for **Image Dehazing**, [[Paper]](https://arxiv.org/pdf/2109.07100.pdf)

- (arXiv 2021.09) Semi-Supervised Wide-Angle **Portraits Correction** by Multi-Scale Transformer, [[Paper]](https://arxiv.org/pdf/2109.08024.pdf)

- (arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2109.07799.pdf)

- (arXiv 2021.09) Pose Transformers (POTR): **Human Motion Prediction** with Non-Autoregressive Transformers, [[Paper]](https://arxiv.org/pdf/2109.07531.pdf), [[Code]](https://github.com/idiap/potr)

- (arXiv 2021.09) PnP-DETR: Towards **Efficient** Visual Analysis with Transformers, [[Paper]](https://arxiv.org/pdf/2109.07036.pdf), [[Code]](https://github.com/twangnh/pnp-detr)

- (arXiv 2021.09) Learning to **Ground** Visual Objects for Visual Dialog, [[Paper]](https://arxiv.org/pdf/2109.06013.pdf)

- (arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for **Video Grounding**, [[Paper]](https://arxiv.org/pdf/2109.06085.pdf), [[Code]](https://sites.google.com/view/mengcao/publication/gtr)

- (arXiv 2021.09) CDTrans: Cross-domain Transformer for **Unsupervised Domain Adaptation**, [[Paper]](https://arxiv.org/pdf/2109.06165.pdf)

- (arXiv 2021.09) IS ATTENTION BETTER THAN **MATRIX DECOMPOSITION**? [[Paper]](https://arxiv.org/pdf/2109.04553.pdf), [[Code]](https://github.com/Gsunshine/Enjoy-Hamburger)

- (arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for **Video Question Answering**, [[Paper]](https://arxiv.org/pdf/2109.04735.pdf)

- (arXiv 2021.09) Line as a Visual Sentence: Context-aware **Line Descriptor** for Visual Localization, [[Paper]](https://arxiv.org/pdf/2109.04753.pdf)

- (arXiv 2021.09) Negative Sample Matters: A Renaissance of Metric Learning for **Temporal Grounding**, [[Paper]](https://arxiv.org/pdf/2109.04872.pdf)

- (arXiv 2021.09) LAViTeR: Learning Aligned **Visual and Textual** Representations Assisted by Image and Caption Generation, [[Paper]](https://arxiv.org/pdf/2109.04993.pdf), [[Code]](https://github.com/mshaikh2/LaViTeR)

- (arXiv 2021.09) Panoptic Narrative **Grounding**, [[Paper]](https://arxiv.org/pdf/2109.04988.pdf)

- (arXiv 2021.09) An Empirical Study of GPT-3 for Few-Shot Knowledge-Based **VQA**, [[Paper]](https://arxiv.org/pdf/2109.05014.pdf)

- (arXiv 2021.09) PlaTe: **Visually-Grounded Planning** with Transformers in Procedural Tasks, [[Paper]](https://arxiv.org/pdf/2109.04869.pdf), [[Project]](https://www.pair.toronto.edu/plate-planner/)

- (arXiv 2021.09) **EfficientCLIP**: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling, [[Paper]](https://arxiv.org/pdf/2109.04699.pdf)

- (arXiv 2021.09) **Scaled ReLU** Matters for **Training** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2109.03810.pdf)

- (arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for **Video Inpainting**, [[Paper]](https://arxiv.org/pdf/2109.02974.pdf), [[Code]](https://github.com/ruiliu-ai/FuseFormer)

- (arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2109.02860.pdf)

- (arXiv 2021.09) WHYACT: Identifying **Action Reasons** in Lifestyle **Vlogs**, [[Paper]](https://arxiv.org/pdf/2109.02747.pdf)

- (arXiv 2021.09) Zero-Shot **Open Set Detection** by Extending **CLIP**, [[Paper]](https://arxiv.org/pdf/2109.02748.pdf)

- (arXiv 2021.09) Towards Transferable **Adversarial Attacks** on Vision Transformers, [[Paper]](https://arxiv.org/pdf/2109.04176.pdf)

- (arXiv 2021.09) Learning to **Prompt** for **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2109.01134), [[Code]](https://github.com/KaiyangZhou/CoOp)

- (arXiv 2021.09) Improving **Video-Text Retrieval** by Multi-Stream Corpus Alignment and Dual Softmax Loss, [[Paper]](https://arxiv.org/pdf/2109.04290.pdf), [[Code]](https://github.com/starmemda/CAMoW/)

- (arXiv 2021.09) UCTransNet: Rethinking the **Skip Connections in U-Net** from a Channel-wise Perspective with Transformer, [[Paper]](https://arxiv.org/pdf/2109.04335.pdf), [[Code]](https://github.com/McGregorWwww/UCTransNet)

- (arXiv 2021.09) ConvMLP: Hierarchical Convolutional **MLPs** for Vision, [[Paper]](https://arxiv.org/pdf/2109.04454.pdf), [[Code]](https://github.com/SHI-Labs/Convolutional-MLPs)

- (arXiv 2021.09) TxT: **Crossmodal** End-to-End Learning with Transformers, [[Paper]](https://arxiv.org/pdf/2109.04422.pdf)

- (arXiv 2021.09) Vision-and-Language or Vision-for-Language? On **Cross-Modal Influence** in Multimodal Transformers, [[Paper]](https://arxiv.org/pdf/2109.04448.pdf)

- (arXiv 2021.09) **Sparse**-MLP: A Fully-**MLP** Architecture with Conditional Computation, [[Paper]](https://arxiv.org/pdf/2109.02008.pdf)

- (arXiv 2021.09) SORNet: Spatial Object-Centric Representations for Sequential **Manipulation**, [[Paper]](https://arxiv.org/pdf/2109.03891.pdf), [[Project]](https://wentaoyuan.github.io/sornet)

- (arXiv 2021.09) Audio-Visual Transformer Based **Crowd Counting**, [[Paper]](https://arxiv.org/pdf/2109.01926.pdf)

- (arXiv 2021.09) Weakly Supervised Relative Spatial Reasoning for **Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2109.01934.pdf), [[Code]](https://github.com/pratyay-banerjee/weak_sup_vqa)

- (arXiv 2021.09) FUSFORMER: A TRANSFORMER-BASED FUSION APPROACH FOR HYPERSPECTRAL IMAGE **SUPER-RESOLUTION**, [[Paper]](https://arxiv.org/pdf/2109.02079.pdf)

- (arXiv 2021.09) CTRL-C: **Camera calibration** TRansformer with Line-Classification, [[Paper]](https://arxiv.org/pdf/2109.02259.pdf), [[Code]](https://github.com/jwlee-vcl/CTRL-C)

- (arXiv 2021.09) Learning to Generate **Scene Graph** from Natural Language Supervision, [[Paper]](https://arxiv.org/pdf/2109.02227.pdf), [[Code]](https://github.com/YiwuZhong/SGG_from_NLS)

- (arXiv 2021.09) The Animation Transformer: Visual **Correspondence** via Segment Matching, [[Paper]](https://arxiv.org/pdf/2109.02614.pdf)

- (arXiv 2021.09) Voxel Transformer for **3D Object Detection**, [[Paper]](https://arxiv.org/pdf/2109.02497.pdf)

- (ICCV 2021.09) **3D Human Texture Estimation** from a Single Image with Transformers, [[Paper]](http://personal.ie.cuhk.edu.hk/~ccloy/files/iccv_2021_texformer.pdf), [[Code]](https://github.com/xuxy09/Texformer)

- (arXiv 2021.09) Encoder-decoder with Multi-level Attention for **3D Human Shape and Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2109.02303.pdf), [[Code]](https://github.com/ziniuwan/maed)

- (arXiv 2021.09) Joint Graph Learning and Matching for **Semantic Feature Correspondence**, [[Paper]](https://arxiv.org/pdf/2109.00240.pdf)

- (arXiv 2021.09) Searching for **Efficient** Multi-Stage Vision Transformers, [[Paper]](https://arxiv.org/pdf/2109.00642.pdf), [[Code]](https://github.com/yilunliao/vit-search)

### 2021.08
- (arXiv 2021.08) SIGN: Spatial-information Incorporated Generative Network for **Generalized Zero-shot Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2108.12517.pdf)

- (arXiv 2021.08) GroupFormer: **Group Activity Recognition** with Clustered Spatial-Temporal Transformer, [[Paper]](https://arxiv.org/pdf/2108.12630.pdf), [[Code]](https://github.com/xueyee/GroupFormer)

- (arXiv 2021.08) **A Battle of Network Structures**: An Empirical Study of CNN, Transformer, and MLP, [[Paper]](https://arxiv.org/pdf/2108.13002.pdf)

- (arXiv 2021.08) Exploring and Improving **Mobile** Level Vision Transformers, [[Paper]](https://arxiv.org/pdf/2108.13015.pdf)

- (arXiv 2021.08) Cross-category **Video Highlight Detection** via Set-based Learning, [[Paper]](https://arxiv.org/pdf/2108.11770.pdf), [[Code]](https://github.com/ChrisAllenMing/Cross_Category_Video_Highlight)

- (arXiv 2021.08) Shifted Chunk Transformer for **Spatio-Temporal** Representational Learning, [[Paper]](https://arxiv.org/pdf/2108.11575.pdf)

- (arXiv 2021.08) SASRA: Semantically-aware Spatio-temporal Reasoning Agent for **Vision-and-Language Navigation** in Continuous Environments, [[Paper]](https://arxiv.org/pdf/2108.11945.pdf)

- (arXiv 2021.08) LocTex: Learning **Data-Efficient** Visual **Representations** from Localized Textual Supervision, [[Paper]](https://arxiv.org/pdf/2108.11950.pdf), [[Project]](https://loctex.mit.edu/)

- (arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based **Detection** Heads, [[Paper]](https://arxiv.org/pdf/2108.09691.pdf)

- (arXiv 2021.08) SIMVLM: SIMPLE **VISUAL LANGUAGE** MODEL PRETRAINING WITH WEAK SUPERVISION, [[Paper]](https://arxiv.org/pdf/2108.10904.pdf)

- (arXiv 2021.08) TransFER: Learning Relation-aware **Facial Expression Representations** with Transformers, [[Paper]](https://arxiv.org/pdf/2108.11116)