An open API service indexing awesome lists of open source software.

https://github.com/DirtyHarryLYL/Transformer-in-Vision

Recent Transformer-based CV and related works.
https://github.com/DirtyHarryLYL/Transformer-in-Vision

computer-vision deep-learning multi-modal paper self-attention transformer vision-transformers visual-language

Last synced: about 1 year ago
JSON representation

Recent Transformer-based CV and related works.

Awesome Lists containing this project

README

          

# Transformer-in-Vision
Recent Transformer-based CV and related works. Welcome to comment/contribute!

The transformer is now a basic component, adopted in nearly all AI models. Keep updated --> updated irregularly.

New Hope: [LLM-in-Vision](https://github.com/DirtyHarryLYL/LLM-in-Vision)

## Resource

- **ChatGPT** for **Robotics**: Design Principles and Model Abilities, [[Paper]](https://www.microsoft.com/en-us/research/uploads/prod/2023/02/ChatGPT___Robotics.pdf), [[Code]](https://github.com/microsoft/PromptCraft-Robotics)

- DIFFUSIONDB [[Page]](https://poloclub.github.io/diffusiondb), [[Paper]](https://arxiv.org/pdf/2210.14896.pdf)

- LAION-5B [[Page]](https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/), [[Paper]](https://arxiv.org/pdf/2210.08402.pdf)

- LAVIS [[Page]](https://github.com/salesforce/LAVIS), [[Paper]](https://arxiv.org/pdf/2209.09019.pdf)

- Imagen Video [[Page]](https://imagen.research.google/video/), [[Paper]](https://imagen.research.google/video/paper.pdf)

- Phenaki [[Page]](https://phenaki.video/), [[Paper]](https://openreview.net/pdf?id=vOEXS39nOF)

- DREAMFUSION [[Page]](https://dreamfusion3d.github.io/), [[Paper]](https://arxiv.org/pdf/2209.14988.pdf)

- MAKE-A-VIDEO [[Page]](https://make-a-video.github.io/), [[Paper]](https://arxiv.org/pdf/2209.14792.pdf)

- Stable Difffusion [[Page]](https://ommer-lab.com/research/latent-diffusion-models/), [[Paper]](https://arxiv.org/pdf/2112.10752.pdf)

- NUWA-Infinity [[Page]](https://nuwa-infinity.microsoft.com/#/), [[Paper]](https://arxiv.org/pdf/2207.09814.pdf)

- Parti [[Page]](https://parti.research.google/), [[Code]](https://github.com/google-research/parti)

- Imagen [[Page]](https://imagen.research.google/), [[Paper]](https://arxiv.org/pdf/2205.11487.pdf)

- Gato: A Generalist Agent, [[Paper]](https://storage.googleapis.com/deepmind-media/A%20Generalist%20Agent/Generalist%20Agent.pdf)

- PaLM: Scaling Language Modeling with Pathways, [[Paper]](https://arxiv.org/pdf/2204.02311.pdf)

- DALL·E 2 [[Page]](https://openai.com/dall-e-2/), [[Paper]](https://cdn.openai.com/papers/dall-e-2.pdf)

- SCENIC: A JAX Library for Computer Vision Research and Beyond, [[Code]](https://github.com/google-research/scenic)

- V-L joint learning study (with good tables): [[METER]](https://arxiv.org/pdf/2111.02387.pdf), [[Kaleido-BERT]](https://arxiv.org/pdf/2103.16110.pdf)

- Attention is all you need, [[Paper]](https://arxiv.org/pdf/1706.03762.pdf)

- CLIP [[Page]](https://openai.com/blog/clip/), [[Paper]](https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf), [[Code]](https://github.com/openai/CLIP), [[arXiv]](https://arxiv.org/pdf/2103.00020.pdf)

- DALL·E [[Page]](https://openai.com/blog/dall-e/), [[Code]](https://github.com/openai/DALL-E), [[Paper]](https://arxiv.org/pdf/2102.12092.pdf)

- [huggingface/transformers](https://github.com/huggingface/transformers)

- [Kyubyong/transformer](https://github.com/Kyubyong/transformer), TF

- [jadore801120/attention-is-all-you-need-pytorch](https://github.com/jadore801120/attention-is-all-you-need-pytorch), Torch

- [krasserm/fairseq-image-captioning](https://github.com/krasserm/fairseq-image-captioning)

- [PyTorch Transformers Tutorials](https://github.com/abhimishra91/transformers-tutorials)

- [ictnlp/awesome-transformer](https://github.com/ictnlp/awesome-transformer)

- [basicv8vc/awesome-transformer](https://github.com/basicv8vc/awesome-transformer)

- [dk-liang/Awesome-Visual-Transformer](https://github.com/dk-liang/Awesome-Visual-Transformer)

- [yuewang-cuhk/awesome-vision-language-pretraining-papers](https://github.com/yuewang-cuhk/awesome-vision-language-pretraining-papers)

## Survey

- (arXiv 2023.2) TRANSFORMER-BASED **SENSOR FUSION** FOR **AUTONOMOUS DRIVING**: A SURVEY, [[Paper]](https://arxiv.org/pdf/2302.11481.pdf), [[Page]](https://github.com/ApoorvRoboticist/Transformers-Sensor-Fusion)

- (arXiv 2023.2) Deep Learning for **Video-Text Retrieval**: a Review, [[Paper]](https://arxiv.org/pdf/2302.12552.pdf)

- (arXiv 2023.2) Large-scale **Multi-Modal Pre-trained Models**: A Comprehensive Survey, [[Paper]](https://arxiv.org/pdf/2302.10035.pdf)

- (arXiv 2023.2) Transformer-based **Generative Adversarial Networks** in Computer Vision: A Comprehensive Survey, [[Paper]](https://arxiv.org/pdf/2302.08641.pdf)

- (arXiv 2023.2) **Knowledge Distillation** in Vision Transformers: A Critical Review, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2302/2302.02108.pdf)

- (arXiv 2023.2) A Survey on **Efficient Training** of Transformers, [[Paper]](https://arxiv.org/pdf/2302.01107.pdf)

- (arXiv 2023.1) ChatGPT is not all you need. A State of the Art Review of **large Generative AI models**, [[Paper]](https://arxiv.org/pdf/2301.04655.pdf)

- (arXiv 2022.12) Transformers in **Action Recognition**: A Review on Temporal Modeling, [[Paper]](https://arxiv.org/pdf/2302.01921.pdf)

- (arXiv 2022.11) Vision Transformers in **Medical Imaging**: A Review, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2211/2211.10043.pdf)

- (arXiv 2022.11) A survey on **knowledge**-enhanced **multimodal** learning, [[Paper]](https://arxiv.org/pdf/2211.12328.pdf)

- (arXiv 2022.10) Vision-Language Pre-training: Basics, Recent Advances, and Future Trends, [[Paper]](https://arxiv.org/pdf/2210.09263.pdf)

- (arXiv 2022.10) A Survey on Graph Neural Networks and **Graph** Transformers in Computer Vision: A Task-Oriented Perspective, [[Paper]](https://arxiv.org/pdf/2209.13232.pdf)

- (arXiv 2022.09) VISION TRANSFORMERS FOR **ACTION RECOGNITION**: A SURVEY, [[Paper]](https://arxiv.org/pdf/2209.05700.pdf)

- (arXiv 2022.09) Transformers in **Remote Sensing**: A Survey, [[Paper]](https://arxiv.org/pdf/2209.01206.pdf), [[Code]](https://github.com/VIROBO-15/Transformer-in-Remote-Sensing)

- (arXiv 2022.08) **3D Vision** with Transformers: A Survey, [[Paper]](https://arxiv.org/pdf/2208.04309.pdf), [[Code]](https://github.com/lahoud/3d-vision-transformers)

- (arXiv 2022.08) A Survey on **Masked Autoencoder** for Self-supervised Learning in Vision and Beyond, [[Paper]](https://arxiv.org/pdf/2208.00173.pdf)

- (arXiv 2022.07) **Vision** Transformers: State of the Art and Research Challenges, [[Paper]](https://arxiv.org/pdf/2207.03041.pdf)

- (arXiv 2022.07) **SELF-SUPERVISED** LEARNING FOR **VIDEOS**: A SURVEY, [[Paper]](https://arxiv.org/pdf/2207.00419.pdf)

- (arXiv 2022.06) **Multimodal** Learning with Transformers: A Survey, [[Paper]](https://arxiv.org/pdf/2206.06488.pdf)

- (arXiv 2022.05) Vision Transformer: **Vit** and its **Derivatives**, [[Paper]](https://arxiv.org/pdf/2205.11239.pdf)

- (arXiv 2022.05) Transformers in 3D **Point Clouds**: A Survey, [[Paper]](https://arxiv.org/pdf/2205.07417.pdf)

- (arXiv 2022.04) **Visual Attention** Methods in Deep Learning: An In-Depth Survey, [[Paper]](https://arxiv.org/pdf/2204.07756.pdf)

- (arXiv 2022.04) **Vision-and-Language** Pretrained Models: A Survey, [[Paper]](https://arxiv.org/pdf/2204.07356.pdf)

- (arXiv 2022.03) A Roadmap for **Big Model**, [[Paper]](https://arxiv.org/pdf/2203.14101.pdf)

- (arXiv 2022.03) Transformers Meet **Visual** Learning Understanding: A Comprehensive Review, [[Paper]](https://arxiv.org/pdf/2203.12944.pdf)

- (arXiv 2022.03) Recent Advances in **Vision** Transformer: A Survey and Outlook of Recent Work, [[Paper]](https://arxiv.org/pdf/2203.01536.pdf), [[Project]](https://github.com/khawar512/ViT-Survey)

- (arXiv 2022.02) A Survey of **Vision-Language** Pre-Trained Models, [[Paper]](https://arxiv.org/pdf/2202.10936.pdf)

- (arXiv 2022.02) VLP: A Survey on **Vision-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2202.09061.pdf)

- (arXiv 2022.02) Transformer for **Graphs**: An Overview from Architecture Perspective, [[Paper]](https://arxiv.org/pdf/2202.08455.pdf)

- (arXiv 2022.01) **Video** Transformers: A Survey, [[Paper]](https://arxiv.org/pdf/2201.05991.pdf)

- (arXiv 2021.11) ARE WE READY FOR A NEW PARADIGM SHIFT? A SURVEY ON VISUAL DEEP **MLP**, [[Paper]](https://arxiv.org/pdf/2111.04060.pdf)

- (arXiv 2021.11) A Survey of **Visual** Transformers, [[Paper]](https://arxiv.org/pdf/2111.06091.pdf)

- (arXiv 2021.09) Survey: Transformer based **Video-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2109.09920.pdf)

- (arXiv 2021.06) A Survey of **Transformers**, [[Paper]](https://arxiv.org/pdf/2106.04554.pdf)

- (arXiv 2021.06) **Attention** mechanisms and deep learning for machine vision: A survey of the state of the art, [[Paper]](https://arxiv.org/pdf/2106.07550.pdf)

- (arXiv 2021.06) **Pre-Trained Models**: Past, Present and Future, [[Paper]](https://arxiv.org/pdf/2106.07139.pdf)

- (arXiv 2021.05) Can Attention Enable **MLPs** To Catch Up With CNNs? [[Paper]](https://arxiv.org/pdf/2105.15078.pdf)

- (arXiv 2021.03) A Practical Survey on **Faster** and **Lighter** Transformers, [[Paper]](https://arxiv.org/pdf/2103.14636.pdf)

- (arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with **Language and Vision**, [[Paper]](https://arxiv.org/pdf/2103.04037.pdf)

- (arXiv 2021.01) A Survey on **Visual** Transformer, [[Paper]](https://arxiv.org/pdf/2012.12556.pdf)

- (arXiv 2020.9) **Efficient** Transformers: A Survey, [[Paper]](https://arxiv.org/pdf/2009.06732.pdf)

- (arXiv 2020.1) **Transformers in Vision**: A Survey, [[Paper]](https://arxiv.org/pdf/2101.01169.pdf)

## Recent Papers

### 2023.8

- (arXiv 2023.8) VL-PET: Vision-and-Language Parameter-**Efficient Tuning** via Granularity Control, [[Paper]](https://arxiv.org/pdf/2308.09804), [[Project]](https://henryhzy.github.io/VL-PET/)

### 2023.5

- (arXiv 2023.5) Understanding Gaussian **Attention** Bias of Vision Transformers Using Effective Receptive Fields, [[Paper]](https://arxiv.org/pdf/2305.04722.pdf)

### 2023.3

- (arXiv 2023.3) Query-Dependent **Video** Representation for **Moment Retrieval** and **Highlight Detection**, [[Paper]](https://arxiv.org/pdf/2303.13874.pdf), [[Code]](https://github.com/wjun0830/QD-DETR)

### 2023.2

- (arXiv 2023.2) **Open-domain Visual Entity Recognition**: Towards Recognizing Millions of Wikipedia Entities, [[Paper]](https://arxiv.org/pdf/2302.11154.pdf)

- (arXiv 2023.2) KS-DETR: Knowledge Sharing in Attention Learning for **Detection** Transformer, [[Paper]](https://arxiv.org/pdf/2302.11208.pdf), [[Code]](https://github.com/edocanonymous/KS-DETR)

- (arXiv 2023.2) HUMAN MOTIONFORMER: **TRANSFERRING** HUMAN **MOTIONS** WITH VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2302.11306.pdf), [[Code]](https://github.com/KumapowerLIU/Human-MotionFormer)

- (arXiv 2023.2) Aligning **Text-to-Image** Models using **Human Feedback**, [[Paper]](https://arxiv.org/pdf/2302.12192.pdf)

- (arXiv 2023.2) Controlled and Conditional **Text to Image** Generation with Diffusion Prior, [[Paper]](https://arxiv.org/pdf/2302.11710.pdf)

- (arXiv 2023.2) Can Pre-trained Vision and Language Models Answer **Visual Information-Seeking Questions**? [[Paper]](https://arxiv.org/pdf/2302.11713.pdf), [[Code]](https://open-vison-language.github.io/infoseek)

- (arXiv 2023.2) OBJECT-CENTRIC **VIDEO PREDICTION** VIA DECOUPLING OF OBJECT DYNAMICS AND INTERACTIONS, [[Paper]](https://arxiv.org/pdf/2302.11850.pdf), [[Project]](https://sites.google.com/view/ocvp-vp)

- (arXiv 2023.2) Distribution Normalization: An “Effortless” **Test-Time Augmentation** for Contrastively Learned **Visual-language** Models, [[Paper]](https://arxiv.org/pdf/2302.11084.pdf), [[Code]](https://github.com/fengyuli2002/distribution-normalization)

- (arXiv 2023.2) Teaching **CLIP** to **Count** to Ten, [[Paper]](https://arxiv.org/pdf/2302.12066.pdf), [[Project]](https://teaching-clip-to-count.github.io/)

- (arXiv 2023.2) Designing an Encoder for Fast Personalization of **Text-to-Image** Models, [[Paper]](https://arxiv.org/pdf/2302.12228.pdf), [[Project]](https://tuning-encoder.github.io/)

- (arXiv 2023.2) Side Adapter Network for **Open-Vocabulary Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2302.12242.pdf), [[Code]](https://github.com/MendelXu/SAN)

- (arXiv 2023.2) Learning Visual Representations via **Language-Guided Sampling**, [[Paper]](https://arxiv.org/pdf/2302.12248.pdf)

- (arXiv 2023.2) VoxFormer: Sparse Voxel Transformer for Camera-based **3D Semantic Scene Completion**, [[Paper]](https://arxiv.org/pdf/2302.12251.pdf), [[Code]](https://github.com/NVlabs/VoxFormer)

- (arXiv 2023.2) Language-Driven Representation Learning for **Robotics**, [[Paper]](https://arxiv.org/pdf/2302.12766.pdf), [[Project]](https://sites.google.com/view/voltron-robotics)

- (arXiv 2023.2) A Convolutional Vision Transformer for **Semantic Segmentation** of Side-Scan **Sonar** Data, [[Paper]](https://arxiv.org/pdf/2302.12416.pdf), [[Code]](https://github.com/hayatrajani/s3seg-vit)

- (arXiv 2023.2) **Lightweight** Real-time Semantic **Segmentation** Network with Efficient Transformer and CNN, [[Paper]](https://arxiv.org/pdf/2302.10484.pdf), [[Code]](https://github.com/IVIPLab/LETNet)

- (arXiv 2023.2) VIEWCO: DISCOVERING **TEXT-SUPERVISED** **SEGMENTATION** MASKS VIA MULTI-VIEW SEMANTIC CONSISTENCY, [[Paper]](https://arxiv.org/pdf/2302.10307.pdf), [[Code]](https://github.com/pzhren/ViewCo)

- (arXiv 2023.2) CertViT: Certified **Robustness** of Pre-Trained Vision Transformers, [[Paper]](https://arxiv.org/pdf/2302.10287.pdf), [[Code]](https://github.com/sagarverma/transformer-lipschitz)

- (arXiv 2023.2) Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for **Grounding Viewpoint Descriptions**, [[Paper]](https://arxiv.org/pdf/2302.10282.pdf)

- (arXiv 2023.2) MaskedKD: Efficient **Distillation** of Vision Transformers with **Masked** Images, [[Paper]](https://arxiv.org/pdf/2302.10494.pdf)

- (arXiv 2023.2) A General Visual Representation Guided Framework with Global Affinity for **Weakly Supervised Salient Object Detection**, [[Paper]](https://arxiv.org/pdf/2302.10697.pdf)

- (arXiv 2023.2) ViTA: A Vision Transformer **Inference Accelerator** for **Edge** Applications, [[Paper]](https://arxiv.org/pdf/2302.09108.pdf)

- (arXiv 2023.2) **Video Action Recognition** Collaborative Learning with Dynamics via PSO-ConvNet Transformer, [[Paper]](https://arxiv.org/pdf/2302.09187.pdf), [[Code]](https://github.com/leonlha/Video-Action-Recognition-via-PSO-ConvNet-Transformer-Collaborative-Learning-with-Dynamics)

- (arXiv 2023.2) A Pilot **Evaluation** of ChatGPT and DALL-E 2 on **Decision Making** and **Spatial Reasoning**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2302/2302.09068.pdf)

- (arXiv 2023.2) StyLIP: Multi-Scale Style-Conditioned Prompt Learning for **CLIP**-based **Domain Generalization**, [[Paper]](https://arxiv.org/pdf/2302.09251.pdf)

- (arXiv 2023.2) Meta Style Adversarial Training for Cross-Domain **Few-Shot** Learning, [[Paper]](https://arxiv.org/pdf/2302.09309.pdf)

- (arXiv 2023.2) HYNETER: HYBRID NETWORK TRANSFORMER FOR OBJECT **DETECTION**, [[Paper]](https://arxiv.org/pdf/2302.09365.pdf)

- (arXiv 2023.2) STOA-VLP: Spatial-Temporal Modeling of Object and Action for **Video-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2302.09736.pdf)

- (arXiv 2023.2) Constraint and Union for Partially-Supervised **Temporal Sentence Grounding**, [[Paper]](https://arxiv.org/pdf/2302.09850.pdf)

- (arXiv 2023.2) STB-VMM: Swin Transformer Based **Video Motion Magnification**, [[Paper]](https://arxiv.org/pdf/2302.10001.pdf)

- (arXiv 2023.2) **Fashion Image Retrieval** with Multi-Granular Alignment, [[Paper]](https://arxiv.org/pdf/2302.08902.pdf)

- (arXiv 2023.2) LayoutDiffuse: Adapting Foundational Diffusion Models for **Layout-to-Image Generation**, [[Paper]](https://arxiv.org/pdf/2302.08908.pdf)

- (arXiv 2023.2) CK-Transformer: Commonsense Knowledge Enhanced Transformers for **Referring Expression Comprehension**, [[Paper]](https://arxiv.org/pdf/2302.09027.pdf), [[Code]](https://github.com/FightingFighting/CK-Transformer)

- (arXiv 2023.2) MaskSketch: Unpaired Structure-guided Masked **Image Generation**, [[Paper]](https://arxiv.org/pdf/2302.05496.pdf)

- (arXiv 2023.2) Single **Motion** **Diffusion**, [[Paper]](https://arxiv.org/pdf/2302.05905.pdf), [[Code]](https://sinmdm.github.io/SinMDM-page)

- (arXiv 2023.2) Tri-Perspective View for Vision-Based **3D Semantic Occupancy Prediction**, [[Paper]](https://arxiv.org/pdf/2302.07817.pdf), [[Code]](https://github.com/wzzheng/TPVFormer)

- (arXiv 2023.2) ANSEL Photobot: A **Robot** **Event Photographer** with Semantic Intelligence, [[Paper]](https://arxiv.org/pdf/2302.07931.pdf)

- (arXiv 2023.2) ForceFormer: Exploring Social Force and Transformer for **Pedestrian Trajectory Prediction**, [[Paper]](https://arxiv.org/pdf/2302.07583.pdf)

- (arXiv 2023.2) **Video** Probabilistic **Diffusion** Models in Projected Latent Space, [[Paper]](https://arxiv.org/pdf/2302.07685.pdf)

- (arXiv 2023.2) Dataset Interfaces: **Diagnosing Model Failures** Using Controllable Counterfactual Generation, [[Paper]](https://arxiv.org/pdf/2302.07865.pdf), [[Code]](https://github.com/MadryLab/dataset-interfaces)

- (arXiv 2023.2) Learning to Substitute Ingredients in **Recipes**, [[Paper]](https://arxiv.org/pdf/2302.07960.pdf)

- (arXiv 2023.2) **Energy** Transformer, [[Paper]](https://arxiv.org/pdf/2302.07253.pdf)

- (arXiv 2023.2) Efficiency 360: **Efficient** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2302.08374.pdf)

- (arXiv 2023.2) A-la-carte **Prompt Tuning** (APT): Combining Distinct Data Via Composable ` Prompting, [[Paper]](https://arxiv.org/pdf/2302.07994.pdf)

- (arXiv 2023.2) Effective Data **Augmentation** With **Diffusion** Models, [[Paper]](https://arxiv.org/pdf/2302.07944.pdf), [[Project]](https://btrabuc.co/da-fusion)

- (arXiv 2023.2) PRedItOR: Text Guided **Image Editing** with Diffusion Prior, [[Paper]](https://arxiv.org/pdf/2302.07979.pdf)

- (arXiv 2023.2) TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual Vision Transformer for Fast Arbitrary **One-Shot Image Generation**, [[Paper]](https://arxiv.org/pdf/2302.08047.pdf)

- (arXiv 2023.2) Hierarchical Cross-modal Transformer for **RGB-D Salient Object Detection**, [[Paper]](https://arxiv.org/pdf/2302.08052.pdf)

- (arXiv 2023.2) MINOTAUR: Multi-task **Video Grounding** From Multimodal Queries, [[Paper]](https://arxiv.org/pdf/2302.08063.pdf)

- (arXiv 2023.2) Towards **Efficient** Visual **Adaption** via Structural Re-parameterization, [[Paper]](https://arxiv.org/pdf/2302.08106.pdf), [[Code]](https://github.com/luogen1996/RepAdapter)

- (arXiv 2023.2) Efficient **3D Object Reconstruction** using Visual Transformers, [[Paper]](https://arxiv.org/pdf/2302.08474.pdf)

- (arXiv 2023.2) Retrieval-augmented Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2302.08268.pdf)

- (arXiv 2023.2) Robust Human **Motion Forecasting** using Transformer-based Model, [[Paper]](https://arxiv.org/pdf/2302.08274.pdf)

- (arXiv 2023.2) VQ3D: Learning a **3D**-Aware **Generative** Model on ImageNet, [[Paper]](https://arxiv.org/pdf/2302.06833.pdf), [[Project]](https://kylesargent.github.io/vq3d)

- (arXiv 2023.2) UKnow: A Unified Knowledge Protocol for **Common-Sense Reasoning** and **Vision-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2302.06891.pdf), [[Code]](https://github.com/Gongggg/UKnow)

- (arXiv 2023.2) A **THEORETICAL** UNDERSTANDING OF **SHALLOW** VISION TRANSFORMERS: LEARNING, GENERALIZATION, AND SAMPLE COMPLEXITY, [[Paper]](https://arxiv.org/pdf/2302.06015.pdf)

- (arXiv 2023.2) A Simple Zero-shot Prompt Weighting Technique to Improve **Prompt** Ensembling in **Text-Image** Models, [[Paper]](https://arxiv.org/pdf/2302.06235.pdf)

- (arXiv 2023.2) Generalized Few-Shot **Continual Learning** with Contrastive Mixture of Adapters, [[Paper]](https://arxiv.org/pdf/2302.05936.pdf), [[Code]](https://github.com/yawencui/CMoA)

- (arXiv 2023.2) Actional Atomic-Concept Learning for Demystifying **Vision-Language Navigation**, [[Paper]](https://arxiv.org/pdf/2302.06072.pdf)

- (arXiv 2023.2) Towards Local Visual Modeling for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2302.06098.pdf), [[Code]](https://github.com/xmu-xiaoma666/LSTNet)

- (arXiv 2023.2) CLIP-RR: IMPROVED CLIP NETWORK FOR RELATION-FOCUSED **CROSS-MODAL INFORMATION RETRIEVAL**, [[Paper]](https://arxiv.org/pdf/2302.06350.pdf)

- (arXiv 2023.2) **Anticipating** Next Active Objects for **Egocentric Videos**, [[Paper]](https://arxiv.org/pdf/2302.06358.pdf), [[Code]]()

- (arXiv 2023.2) UniAdapter: Unified Parameter-Efficient Transfer Learning for **Cross-modal Modeling**, [[Paper]](https://arxiv.org/pdf/2302.06605.pdf), [[Code]](https://github.com/RERV/UniAdapter)

- (arXiv 2023.2) TEAM **DETR**: GUIDE QUERIES AS A PROFESSIONAL TEAM IN DETECTION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2302.07116.pdf), [[Code]](https://github.com/horrible-dong/TeamDETR)

- (arXiv 2023.2) ConceptFusion: Open-set **Multimodal** **3D Mapping**, [[Paper]](https://arxiv.org/pdf/2302.07241.pdf), [[Project]](https://concept-fusion.github.io/)

- (arXiv 2023.2) Team Triple-Check at Factify 2: Parameter-Efficient Large Foundation Models with Feature Representations for **Multi-Modal Fact Verification**, [[Paper]](https://arxiv.org/pdf/2302.07740.pdf), [[Code]](https://github.com/wwweiwei/Pre-CoFactv2-AAAI-2023)

- (arXiv 2023.2) PolyFormer: Referring Image **Segmentation** as Sequential Polygon Generation, [[Paper]](https://arxiv.org/pdf/2302.07387.pdf)

- (arXiv 2023.2) Pose-Oriented Transformer with Uncertainty-Guided Refinement for **2D-to-3D Human Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2302.07408.pdf)

- (arXiv 2023.2) TFormer: A Transmission-Friendly ViT Model for **IoT** Devices, [[Paper]](https://arxiv.org/pdf/2302.07734.pdf), [[Code]]()

- (arXiv 2023.2) Tri-Perspective View for Vision-Based **3D Semantic Occupancy Prediction**, [[Paper]](https://arxiv.org/pdf/2302.07817.pdf), [[Code]](https://github.com/wzzheng/TPVFormer)

- (arXiv 2023.2) Adding Conditional Control to **Text-to-Image Diffusion** Models, [[Paper]](https://arxiv.org/pdf/2302.05543.pdf), [[Code]](https://github.com/lllyasviel/ControlNet)

- (arXiv 2023.2) Invariant **Slot Attention**: **Object Discovery** with Slot-Centric Reference Frames, [[Paper]](https://arxiv.org/pdf/2302.04973.pdf)

- (arXiv 2023.2) IS MULTI-MODAL **VISION** SUPERVISION **BENEFICIAL** TO **LANGUAGE**? [[Paper]](https://arxiv.org/pdf/2302.05016.pdf)

- (arXiv 2023.2) Data-Driven **Stochastic Motion Evaluation** and **Optimization** with Image by Spatially-Aligned Temporal Encoding, [[Paper]](https://arxiv.org/pdf/2302.05041.pdf)

- (arXiv 2023.2) **Scaling** Vision Transformers to **22 Billion Parameters**, [[Paper]](https://arxiv.org/pdf/2302.05442.pdf)

- (arXiv 2023.2) Adapting **Pre-trained** Vision Transformers from **2D to 3D** through Weight Inflation Improves Medical Image Segmentation, [[Paper]](https://arxiv.org/pdf/2302.04303.pdf), [[Code]](https://github.com/yuhui-zh15/TransSeg)

- (arXiv 2023.2) Mitigating **Bias** in Visual Transformers via Targeted Alignment, [[Paper]](https://arxiv.org/pdf/2302.04358.pdf)

- (arXiv 2023.2) IH-ViT: Vision Transformer-based **Integrated Circuit Appearance Defect Detection**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2302/2302.04521.pdf)

- (arXiv 2023.2) Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2302.04858.pdf)

- (arXiv 2023.2) Learning by Asking for **Embodied** Visual **Navigation** and **Task Completion**, [[Paper]](https://arxiv.org/pdf/2302.04865.pdf)

- (arXiv 2023.2) **Reversible** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2302.04869.pdf), [[Code1]](https://github.com/facebookresearch/slowfast), [[Code2]](https://github.com/karttikeya/minREV)

- (arXiv 2023.2) Neural Congealing: **Aligning Images** to a Joint **Semantic Atlas**, [[Paper]](https://arxiv.org/pdf/2302.03956.pdf), [[Project]](https://neural-congealing.github.io/)

- (arXiv 2023.2) **Adversarial Prompting** for Black Box Foundation Models, [[Paper]](https://arxiv.org/pdf/2302.04237.pdf)

- (arXiv 2023.2) Understanding Why ViT **Trains** Badly on **Small Datasets**: An Intuitive Perspective, [[Paper]](https://arxiv.org/pdf/2302.03751.pdf), [[Code]](https://github.com/BoyuanJackChen/MiniProject2_VisTrans)

- (arXiv 2023.2) CROSS-LAYER RETROSPECTIVE RETRIEVING VIA LAYER **ATTENTION**, [[Paper]](https://arxiv.org/pdf/2302.03985.pdf), [[Code]](https://github.com/joyfang1106/MRLA)

- (arXiv 2023.2) Convolutional Neural Networks Trained to **Identify Words** Provide a Good Account of Visual Form Priming Effects, [[Paper]](https://arxiv.org/pdf/2302.03992.pdf)

- (arXiv 2023.2) Zero-shot **Generation** of Coherent **Storybook** from Plain Text Story using Diffusion Models, [[Paper]](https://arxiv.org/pdf/2302.03900.pdf)

- (arXiv 2023.2) OSRT: Omnidirectional **Image Super-Resolution** with Distortion-aware Transformer, [[Paper]](https://arxiv.org/pdf/2302.03453.pdf)

- (arXiv 2023.2) Pic2Word: Mapping Pictures to Words for Zero-shot **Composed** **Image Retrieval**, [[Paper]](https://arxiv.org/pdf/2302.03084.pdf), [[Code]](https://github.com/google-research/composed_image_retrieval)

- (arXiv 2023.2) SimCon Loss with Multiple Views for Text Supervised **Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2302.03432.pdf)

- (arXiv 2023.2) PhysFormer++: **Facial** Video-based **Physiological Measurement** with SlowFast Temporal Difference Transformer, [[Paper]](https://arxiv.org/pdf/2302.03548.pdf)

- (arXiv 2023.2) Scaling **Self-Supervised** End-to-End **Driving** with Multi-View Attention Learning, [[Paper]](https://arxiv.org/pdf/2302.03198.pdf)

- (arXiv 2023.2) HumanMAC: Masked Motion Completion for **Human Motion Prediction**, [[Paper]](https://arxiv.org/pdf/2302.03665.pdf), [[Project]](https://lhchen.top/Human-MAC/)

- (arXiv 2023.2) LAMPP: **Language Models** as Probabilistic Priors for **Perception** and **Action**, [[Paper]](https://arxiv.org/pdf/2302.02801.pdf)

- (arXiv 2023.2) Zero-Shot **Robot Manipulation** from Passive Human Videos, [[Paper]](https://arxiv.org/pdf/2302.02011.pdf), [[Project]](https://sites.google.com/view/human-0shot-robot)

- (arXiv 2023.2) MixFormer: End-to-End **Tracking** with Iterative Mixed Attention, [[Paper]](https://arxiv.org/pdf/2302.02814.pdf), [[Code]](https://github.com/MCG-NJU/MixFormer)

- (arXiv 2023.2) LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale **Image-Text Retrieval**, [[Paper]](https://arxiv.org/pdf/2302.02908.pdf)

- (arXiv 2023.2) V1T: large-scale **mouse V1 response prediction** using a Vision Transformer, [[Paper]](https://arxiv.org/pdf/2302.03023.pdf)

- (arXiv 2023.2) AIM: ADAPTING **IMAGE MODELS** FOR EFFICIENT **VIDEO ACTION RECOGNITION**, [[Paper]](https://arxiv.org/pdf/2302.03024.pdf), [[Project]](https://adapt-image-models.github.io/)

- (arXiv 2023.2) KDEformer: **Accelerating** Transformers via Kernel Density Estimation, [[Paper]](https://arxiv.org/pdf/2302.02451.pdf), [[Code]](https://github.com/majid-daliri/kdeformer)

- (arXiv 2023.2) Semantic-Guided **Image Augmentation** with Pre-trained Models, [[Paper]](https://arxiv.org/pdf/2302.02070.pdf)

- (arXiv 2023.2) X-ReID: Cross-Instance Transformer for Identity-Level **Person Re-Identification**, [[Paper]](https://arxiv.org/pdf/2302.02075.pdf)

- (arXiv 2023.2) MOMA: **Distill** from Self-Supervised Teachers, [[Paper]](https://arxiv.org/pdf/2302.02089.pdf)

- (arXiv 2023.2) Learning to Agree on Vision Attention for **Visual Commonsense Reasoning**, [[Paper]](https://arxiv.org/pdf/2302.02117.pdf)

- (arXiv 2023.2) Efficient End-to-End **Video Question Answering** with Pyramidal Multimodal Transformer, [[Paper]](https://arxiv.org/pdf/2302.02136.pdf), [[Code]](https://github.com/Trunpm/PMT-AAAI23)

- (arXiv 2023.2) LipFormer: Learning to **Lipread** Unseen Speakers based on Visual-Landmark Transformers, [[Paper]](https://arxiv.org/pdf/2302.02141.pdf)

- (arXiv 2023.2) Oscillation-free **Quantization** for Low-bit Vision Transformers, [[Paper]](https://arxiv.org/pdf/2302.02210.pdf)

- (arXiv 2023.2) Design Booster: A Text-Guided Diffusion Model for **Image Translation** with Spatial Layout Preservation, [[Paper]](https://arxiv.org/pdf/2302.02284.pdf)

- (arXiv 2023.2) Contrast with Reconstruct: **Contrastive** **3D** Representation Learning Guided by Generative Pretraining, [[Paper]](https://arxiv.org/pdf/2302.02318.pdf), [[Code]](https://github.com/qizekun/ReCon)

- (arXiv 2023.2) Leaving Reality to Imagination: **Robust** **Classification** via **Generated** Datasets, [[Paper]](https://arxiv.org/pdf/2302.02503.pdf), [[Code]](https://github.com/Hritikbansal/generative-robustness)

- (arXiv 2023.2) CHiLS: Zero-Shot Image **Classification** with **Hierarchical** Label Sets, [[Paper]](https://arxiv.org/pdf/2302.02551.pdf), [[Code]](https://github.com/acmi-lab/CHILS)

- (arXiv 2023.2) Zero-shot **Image-to-Image** Translation, [[Paper]](https://arxiv.org/pdf/2302.03027.pdf), [[Project]](https://pix2pixzero.github.io/)

- (arXiv 2023.2) Learning a **Fourier Transform** for Linear Relative **Positional Encodings** in Transformers, [[Paper]](https://arxiv.org/pdf/2302.01925.pdf)

- (arXiv 2023.2) EXPLICIT BOX DETECTION UNIFIES END-TO-END **MULTI-PERSON POSE ESTIMATION**, [[Paper]](http://my.sjtu.edu.cn/Task), [[Code]](https://github.com/IDEA-Research/ED-Pose)

- (arXiv 2023.2) CFFT-GAN: Cross-domain Feature Fusion Transformer for Exemplar-based **Image Translation**, [[Paper]](https://arxiv.org/pdf/2302.01608.pdf)

- (arXiv 2023.2) DEVICE: DEpth and VIsual ConcEpts Aware Transformer for **TextCaps**, [[Paper]](https://arxiv.org/pdf/2302.01540.pdf)

- (arXiv 2023.2) CVTNet: A Cross-View Transformer Network for **Place Recognition** Using **LiDAR** Data, [[Paper]](https://arxiv.org/pdf/2302.01665.pdf), [[Code]](https://github.com/BIT-MJY/CVTNet)

- (arXiv 2023.2) DilateFormer: **Multi-Scale Dilated** Transformer for Visual Recognition, [[Paper]](https://arxiv.org/pdf/2302.01791.pdf), [[Code]](https://github.com/JIAOJIAYUASD/dilateformer)

- (arXiv 2023.2) HDFormer: High-order Directed Transformer for **3D Human Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2302.01825.pdf), [[Code]](https://github.com/hyer/HDFormer)

- (arXiv 2023.2) IC^3: Image Captioning by Committee Consensus, [[Paper]](https://arxiv.org/pdf/2302.01328.pdf), [[Code]](https://github.com/DavidMChan/caption-by-committee)

- (arXiv 2023.2) Boosting Low-Data Instance **Segmentation** by Unsupervised Pre-training with Saliency Prompt, [[Paper]](https://arxiv.org/pdf/2302.01171.pdf)

- (arXiv 2023.2) QR-CLIP: Introducing Explicit Open-World Knowledge for **Location and Time Reasoning**, [[Paper]](https://arxiv.org/pdf/2302.00952.pdf)

- (arXiv 2023.2) Vision Transformer-based Feature Extraction for **Generalized Zero-Shot Learning**, [[Paper]](https://arxiv.org/pdf/2302.00875.pdf)

- (arXiv 2023.2) **Multimodal** Chain-of-Thought **Reasoning** in Language Models, [[Paper]](https://arxiv.org/pdf/2302.00923.pdf), [[Code]](https://github.com/amazon-science/mm-cot)

- (arXiv 2023.2) CLIPood: Generalizing **CLIP** to **Out-of-Distributions**, [[Paper]](https://arxiv.org/pdf/2302.00864.pdf)

- (arXiv 2023.2) Language Quantized AutoEncoders: Towards Unsupervised **Text-Image** Alignment, [[Paper]](https://arxiv.org/pdf/2302.00902.pdf)

- (arXiv 2023.2) The geometry of **hidden representations** of large transformer models, [[Paper]](https://arxiv.org/pdf/2302.00294.pdf)

- (arXiv 2023.2) **Debiasing** **Vision-Language** Models via Biased Prompts, [[Paper]](https://arxiv.org/pdf/2302.00070.pdf), [[Code]](https://github.com/chingyaoc/debias_vl)

- (arXiv 2023.2) COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR **OPEN-VOCABULARY VIDEO RELATION DETECTION**, [[Paper]](https://arxiv.org/pdf/2302.00268.pdf), [[Code]](https://github.com/Dawn-LX/OpenVoc-VidVRD)

- (arXiv 2023.2) mPLUG-2: A Modularized **Multi-modal** Foundation Model Across Text, Image and Video, [[Paper]](https://arxiv.org/pdf/2302.00402.pdf), [[Code]](https://github.com/alibaba/AliceMind/tree/main/mPLUG)

- (arXiv 2023.2) Transforming **CLIP** to an **Open-vocabulary Video Model** via Interpolated Weight Optimization, [[Paper]](https://arxiv.org/pdf/2302.00624.pdf)

- (arXiv 2023.2) ADAPT: Action-aware Driving **Caption** Transformer, [[Paper]](https://arxiv.org/pdf/2302.00673.pdf), [[Code]](https://github.com/jxbbb/ADAPT)

### 2023.1

- (arXiv 2023.1) AdaPoinTr: Diverse **Point Cloud Completion** with Adaptive Geometry-Aware Transformers, [[Paper]](https://arxiv.org/pdf/2301.04545.pdf), [[Code]](https://github.com/yuxumin/PoinTr)

- (arXiv 2023.1) **EXIF** as Language: Learning Cross-Modal Associations Between **Images and Camera Metadata**, [[Paper]](https://arxiv.org/pdf/2301.04647.pdf), [[Project]](https://hellomuffin.github.io/exif-as-language)

- (arXiv 2023.1) Head-Free Lightweight **Semantic Segmentation** with Linear Transformer, [[Paper]](https://arxiv.org/pdf/2301.04648.pdf), [[Code]](https://github.com/dongbo811/AFFormer)

- (arXiv 2023.1) Geometry-biased Transformers for **Novel View Synthesis**, [[Paper]](https://arxiv.org/pdf/2301.04650.pdf), [[Project]](https://mayankgrwl97.github.io/gbt)

- (arXiv 2023.1) **Continual** **Few-Shot** Learning Using HyperTransformers, [[Paper]](https://arxiv.org/pdf/2301.04584.pdf)

- (arXiv 2023.1) SEMPPL: PREDICTING **PSEUDO-LABELS** FOR BETTER **CONTRASTIVE** REPRESENTATIONS, [[Paper]](https://arxiv.org/pdf/2301.05158.pdf)

- (arXiv 2023.1) Learning to **Summarize Videos** by Contrasting Clips, [[Paper]](https://arxiv.org/pdf/2301.05213.pdf)

- (arXiv 2023.1) Guiding **Text-to-Image** **Diffusion** Model Towards Grounded Generation, [[Paper]](https://arxiv.org/pdf/2301.05221.pdf), [[Project]](https://lipurple.github.io/Grounded_Diffusion/)

- (arXiv 2023.1) Domain Expansion of **Image Generators**, [[Paper]](https://arxiv.org/pdf/2301.05225.pdf), [[Code]](https://yotamnitzan.github.io/domain-expansion/)

- (arXiv 2023.1) Scene-centric vs. Object-centric Image-Text **Cross-modal Retrieval**: A Reproducibility Study, [[Paper]](https://arxiv.org/pdf/2301.05174.pdf)

- (arXiv 2023.1) Tracr: Compiled Transformers as a Laboratory for **Interpretability**, [[Paper]](https://arxiv.org/pdf/2301.05062.pdf), [[Code]](https://github.com/deepmind/tracr)

- (arXiv 2023.1) **CLIP** the Gap: A Single **Domain Generalization** Approach for Object **Detection**, [[Paper]](https://arxiv.org/pdf/2301.05499.pdf)

- (arXiv 2023.1) **Text to Point Cloud Localization** with Relation-Enhanced Transformer, [[Paper]](https://arxiv.org/pdf/2301.05372.pdf)

- (arXiv 2023.1) GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured **Pruning** for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2301.05345.pdf)

- (arXiv 2023.1) Toward Building General Foundation Models for Language, Vision, and **Vision-Language** Understanding Tasks, [[Paper]](https://arxiv.org/pdf/2301.05065.pdf)

- (arXiv 2023.1) ViTs for SITS: Vision Transformers for **Satellite Image Time Series**, [[Paper]](https://arxiv.org/pdf/2301.04944.pdf), [[Code]](https://github.com/michaeltrs/DeepSatModels)

- (arXiv 2023.1) CLIP2Scene: Towards Label-efficient **3D Scene Understanding** by **CLIP**, [[Paper]](https://arxiv.org/pdf/2301.04926.pdf)

- (arXiv 2023.1) A Large-Scale Outdoor Multi-modal **Dataset** and Benchmark for **Novel View Synthesis** and Implicit **Scene Reconstruction**, [[Paper]](https://arxiv.org/pdf/2301.06782.pdf), [[Project]](https://ommo.luchongshan.com/)

- (arXiv 2023.1) USER: Unified Semantic Enhancement with Momentum Contrast for **Image-Text Retrieval**, [[Paper]](https://arxiv.org/pdf/2301.06844.pdf), [[Code]](https://github.com/zhangy0822/USER)

- (arXiv 2023.1) SAT: Size-Aware Transformer for 3D **Point Cloud Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2301.06869.pdf)

- (arXiv 2023.1) **Masked** **Visual** Reconstruction in **Language** Semantic Space, [[Paper]](https://arxiv.org/pdf/2301.06958.pdf), [[Code]](https://github.com/hustvl/RILS)

- (arXiv 2023.1) Vision Learners Meet Web **Image-Text** Pairs, [[Paper]](https://arxiv.org/pdf/2301.07088.pdf), [[Code]](https://huggingface.co/spaces/tennant/MUG_caption)

- (arXiv 2023.1) GLIGEN: Open-Set Grounded **Text-to-Image** Generation, [[Paper]](https://arxiv.org/pdf/2301.07093.pdf), [[Project]](https://gligen.github.io/)

- (arXiv 2023.1) **Learning** Customized Visual Models with **Retrieval**-Augmented **Knowledge**, [[Paper]](https://arxiv.org/pdf/2301.07094.pdf), [[Project]](https://react-vl.github.io/)

- (arXiv 2023.1) UATVR: Uncertainty-Adaptive **Text-Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2301.06309.pdf)

- (arXiv 2023.1) Learning Aligned Cross-modal Representations for **Referring Image Segmentation**, [[Paper]](https://arxiv.org/pdf/2301.06429.pdf)

- (arXiv 2023.1) T2M-GPT: **Generating** Human **Motion** from Textual Descriptions with Discrete Representations, [[Paper]](https://arxiv.org/pdf/2301.06052.pdf), [[Project]](https://mael-zys.github.io/T2M-GPT/)

- (arXiv 2023.1) DSVT: Dynamic **Sparse Voxel** Transformer with Rotated Sets, [[Paper]](https://arxiv.org/pdf/2301.06051.pdf), [[Code]](https://github.com/Haiyang-W/DSVT)

- (arXiv 2023.1) CMAE-V: Contrastive Masked Autoencoders for **Video Action Recognition**, [[Paper]](https://arxiv.org/pdf/2301.06018.pdf)

- (arXiv 2023.1) Generating Templated Caption for **Video Grounding**, [[Paper]](https://arxiv.org/pdf/2301.05997.pdf)

- (arXiv 2023.1) Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised **Depth Estimation** in Dynamic Scenes, [[Paper]](https://arxiv.org/pdf/2301.05871.pdf)

- (arXiv 2023.1) SwinDepth: Unsupervised **Depth Estimation** using Monocular Sequences via Swin Transformer and Densely Cascaded Network, [[Paper]](https://arxiv.org/pdf/2301.06715.pdf)

- (arXiv 2023.1) **CLIP**TER: Looking at the Bigger Picture in **Scene Text Recognition**, [[Paper]](https://arxiv.org/pdf/2301.07464.pdf)

- (arXiv 2023.1) Temporal Perceiving **Video-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2301.07463.pdf)

- (arXiv 2023.1) Joint Representation Learning for **Text** and 3D **Point Cloud**, [[Paper]](https://arxiv.org/pdf/2301.07584.pdf), [[Code]](https://github.com/LeapLabTHU/Text4Point)

- (arXiv 2023.1) Effective End-to-End **Vision Language** Pretraining with Semantic Visual Loss, [[Paper]](https://arxiv.org/pdf/2301.07236.pdf)

- (arXiv 2023.1) PTA-Det: Point Transformer Associating Point cloud and Image for **3D Object Detection**, [[Paper]](https://arxiv.org/pdf/2301.07301.pdf)

- (arXiv 2023.1) **Face Recognition** in the age of CLIP & Billion image datasets, [[Paper]](https://arxiv.org/pdf/2301.07315.pdf)

- (arXiv 2023.1) HSTFormer: Hierarchical Spatial-Temporal Transformers for **3D Human Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2301.07322.pdf), [[Code]](https://github.com/qianxiaoye825/HSTFormer)

- (arXiv 2023.1) Towards Models that Can **See** and **Read**, [[Paper]](https://arxiv.org/pdf/2301.07389.pdf)

- (arXiv 2023.1) **Embodied** Agents for Efficient Exploration and Smart Scene Description, [[Paper]](https://arxiv.org/pdf/2301.07150.pdf)

- (arXiv 2023.1) **Self-Supervised Learning** from Images with a Joint-Embedding Predictive Architecture, [[Paper]](https://arxiv.org/pdf/2301.08243.pdf)

- (arXiv 2023.1) Revisiting the Spatial and Temporal Modeling for **Few-shot Action Recognition**, [[Paper]](https://arxiv.org/pdf/2301.07944.pdf)

- (arXiv 2023.1) Multimodal Video Adapter for Parameter Efficient **Video Text Retrieval**, [[Paper]](https://arxiv.org/pdf/2301.07868.pdf)

- (arXiv 2023.1) **Self Supervision** Does Not Help Natural Language Supervision at Scale, [[Paper]](https://arxiv.org/pdf/2301.07836.pdf)

- (arXiv 2023.1) MULTI-TARGET MULTI-CAMERA **VEHICLE TRACKING** USING TRANSFORMER-BASED CAMERA LINK MODEL AND SPATIAL-TEMPORAL INFORMATION, [[Paper]](https://arxiv.org/pdf/2301.07805.pdf)

- (arXiv 2023.1) ATMAN: **Understanding** Transformer Predictions Through Memory Efficient **Attention** Manipulation, [[Paper]](https://arxiv.org/pdf/2301.08110.pdf)

- (arXiv 2023.1) DDS: Decoupled Dynamic **Scene-Graph Generation** Network, [[Paper]](https://arxiv.org/pdf/2301.07666.pdf), [[Code]]()

- (arXiv 2023.1) Visual Writing Prompts: Character-Grounded **Story Generation** with Curated Image Sequences, [[Paper]](https://arxiv.org/pdf/2301.08571.pdf)

- (arXiv 2023.1) **Image Memorability Prediction** with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2301.08647.pdf)

- (arXiv 2023.1) HOLISTICALLY **EXPLAINABLE** VISION TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2301.08669.pdf)

- (arXiv 2023.1) FlatFormer: Flattened Window Attention for **Efficient** **Point Cloud** Transformer, [[Paper]](https://arxiv.org/pdf/2301.08739.pdf)

- (arXiv 2023.1) LEGO-Net: Learning Regular **Rearrangements** of **Objects** in Rooms, [[Paper]](https://arxiv.org/pdf/2301.09629.pdf), [[Project]](https://ivl.cs.brown.edu/projects/lego-net)

- (arXiv 2023.1) Zorro: the masked **multimodal** transformer, [[Paper]](https://arxiv.org/pdf/2301.09595.pdf)

- (arXiv 2023.1) Towards Robust **Video Instance Segmentation** with Temporal-Aware Transformer, [[Paper]](https://arxiv.org/pdf/2301.09416.pdf)

- (arXiv 2023.1) Learning **Open-vocabulary Semantic Segmentation** Models From Natural Language Supervision, [[Paper]](https://arxiv.org/pdf/2301.09121.pdf), [[Project]](https://jazzcharles.github.io/OVSegmentor/)

- (arXiv 2023.1) Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object **Interaction Anticipation**, [[Paper]](https://arxiv.org/pdf/2301.09209.pdf), [[Code]](https://eth-ait.github.io/transfusion-proj/)

- (arXiv 2023.1) Combined Use of Federated Learning and Image Encryption for **Privacy**-Preserving **Image Classification** with Vision Transformer, [[Paper]](https://arxiv.org/pdf/2301.09255.pdf)

- (arXiv 2023.1) Slice Transformer and Self-supervised Learning for **6DoF Localization** in 3D Point Cloud Maps, [[Paper]](https://arxiv.org/pdf/2301.08957.pdf)

- (arXiv 2023.1) IMPROVING ACCURACY OF **ZERO-SHOT ACTION RECOGNITION** WITH HANDCRAFTED FEATURES, [[Paper]](https://arxiv.org/pdf/2301.08874.pdf)

- (arXiv 2023.1) Learning to View: Decision Transformers for **Active Object Detection**, [[Paper]](https://arxiv.org/pdf/2301.09544.pdf)

- (arXiv 2023.1) Visual Semantic Relatedness Dataset for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2301.08784.pdf), [[Code]](https://github.com/ahmedssabir/Textual-Visual-Semantic-Dataset)

- (arXiv 2023.1) VERSATILE NEURAL PROCESSES FOR LEARNING **IMPLICIT NEURAL REPRESENTATIONS**, [[Paper]](https://arxiv.org/pdf/2301.08883.pdf), [[Code]](https://github.com/ZongyuGuo/Versatile-NP)

- (arXiv 2023.1) RangeViT: Towards Vision Transformers for **3D Semantic Segmentation** in Autonomous Driving, [[Paper]](https://arxiv.org/pdf/2301.10222.pdf), [[Code]](https://github.com/valeoai/rangevit)

- (arXiv 2023.1) Exploiting Optical Flow Guidance for Transformer-Based **Video Inpainting**, [[Paper]](https://arxiv.org/pdf/2301.10048.pdf)

- (arXiv 2023.1) Image **Super-Resolution** using Efficient Striped Window Transformer, [[Paper]](https://arxiv.org/pdf/2301.09869.pdf), [[Code]](https://github.com/Fried-Rice-Lab/FriedRiceLab)

- (arXiv 2023.1) **Out of Distribution** Performance of State of Art Vision Model, [[Paper]](https://arxiv.org/pdf/2301.10750.pdf), [[Code]](https://github.com/salman-lui/vision_course_project)

- (arXiv 2023.1) Compact Transformer **Tracker** with Correlative Masked Modeling, [[Paper]](https://arxiv.org/pdf/2301.10938.pdf), [[Code]](https://github.com/HUSTDML/CTTrack)

- (arXiv 2023.1) **Vision-Language** Models Performing Zero-Shot Tasks Exhibit **Gender**-based **Disparities**, [[Paper]](https://arxiv.org/pdf/2301.11100.pdf)

- (arXiv 2023.1) Cut and Learn for **Unsupervised** Object **Detection** and Instance **Segmentation**, [[Paper]](https://arxiv.org/pdf/2301.11320.pdf), [[Code]](https://github.com/facebookresearch/CutLER)

- (arXiv 2023.1) Explaining Visual **Biases** as Words by Generating Captions, [[Paper]](https://arxiv.org/pdf/2301.11104.pdf), [[Code]](https://github.com/alinlab/b2t)

- (arXiv 2023.1) Revisiting **Temporal Modeling** for **CLIP**-based Image-to-Video Knowledge Transferring, [[Paper]](https://arxiv.org/pdf/2301.11116.pdf), [[Code]](https://github.com/farewellthree/STAN)

- (arXiv 2023.1) **Multi-video Moment Ranking** with Multimodal Clue, [[Paper]](https://arxiv.org/pdf/2301.13606.pdf)

- (arXiv 2023.1) SDF-FORMER: **MONOCULAR SCENE RECONSTRUCTION** WITH 3D SDF TRANSFORMERS, [[Paper]](https://arxiv.org/pdf/2301.13510.pdf), [[Project]](https://weihaosky.github.io/sdfformer)

- (arXiv 2023.1) Grounding Language Models to Images for **Multimodal Generation**, [[Paper]](https://arxiv.org/pdf/2301.13823.pdf)

- (arXiv 2023.1) Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for **Visual Commonsense Reasoning**, [[Paper]](https://arxiv.org/pdf/2301.13335.pdf)

- (arXiv 2023.1) A Modular Multi-stage Lightweight Graph Transformer Network for **Human Pose and Shape Estimation** from 2D Human Pose, [[Paper]](https://arxiv.org/pdf/2301.13403.pdf)

- (arXiv 2023.1) Priors are Powerful: Improving a Transformer for **Multi-camera 3D Detection** with 2D Priors, [[Paper]](https://arxiv.org/pdf/2301.13592.pdf)

- (arXiv 2023.1) UPop: Unified and Progressive Pruning for **Compressing** **Vision-Language** Transformers, [[Paper]](https://arxiv.org/pdf/2301.13741.pdf)

- (arXiv 2023.1) **Fairness**-aware Vision Transformer via Debiased Self-Attention, [[Paper]](https://arxiv.org/pdf/2301.13803.pdf)

- (arXiv 2023.1) Anchor-Based Adversarially Robust **Zero-Shot Learning** Driven by Language, [[Paper]](https://arxiv.org/pdf/2301.13096.pdf)

- (arXiv 2023.1) Distilling Internet-Scale **Vision-Language** Models into **Embodied** Agents, [[Paper]](https://arxiv.org/pdf/2301.12507.pdf)

- (arXiv 2023.1) 6-DoF Robotic **Grasping** with Transformer, [[Paper]](https://arxiv.org/pdf/2301.12476.pdf)

- (arXiv 2023.1) Do Embodied Agents Dream of Pixelated Sheep?: **Embodied Decision Making** using Language Guided World Modelling, [[Paper]](https://arxiv.org/pdf/2301.12050.pdf), [[Project]](https://deckardagent.github.io/)

- (arXiv 2023.1) GALIP: Generative Adversarial CLIPs for **Text-to-Image** Synthesis, [[Paper]](https://arxiv.org/pdf/2301.12959.pdf), [[Code]](https://github.com/tobran/GALIP)

- (arXiv 2023.1) STAIR: Learning **Sparse** **Text and Image** Representation in Grounded Tokens, [[Paper]](https://arxiv.org/pdf/2301.13081.pdf)

- (arXiv 2023.1) **Aerial** Image Object **Detection** With Vision Transformer Detector (ViTDet), [[Paper]](https://arxiv.org/ftp/arxiv/papers/2301/2301.12058.pdf)

- (arXiv 2023.1) Towards Vision Transformer Unrolling Fixed-Point Algorithm: a Case Study on **Image Restoration**, [[Paper]](https://arxiv.org/pdf/2301.12332.pdf)

- (arXiv 2023.1) Debiased Fine-Tuning for **Vision-language** Models by **Prompt** Regularization, [[Paper]](https://arxiv.org/pdf/2301.12429.pdf), [[Code]]()

- (arXiv 2023.1) BLIP-2: Bootstrapping **Language-Image** Pre-training with **Frozen** Image Encoders and Large Language Models, [[Paper]](https://arxiv.org/pdf/2301.12597.pdf), [[Code]](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)

- (arXiv 2023.1) Tagging before Alignment: Integrating Multi-Modal Tags for **Video-Text Retrieval**, [[Paper]](https://arxiv.org/pdf/2301.12644.pdf)

- (arXiv 2023.1) SEAFORMER: SQUEEZE-ENHANCED AXIAL TRANSFORMER FOR MOBILE SEMANTIC **SEGMENTATION**, [[Paper]](https://arxiv.org/pdf/2301.13156.pdf), [[Code]](https://github.com/fudan-zvg/SeaFormer)

- (arXiv 2023.1) Learning 6-DoF Fine-grained **Grasp Detection** Based on Part Affordance Grounding, [[Paper]](https://arxiv.org/pdf/2301.11564.pdf), [[Project]](https://sites.google.com/view/lang-shape)

- (arXiv 2023.1) Multimodal Event Transformer for **Image-guided Story Ending Generation**, [[Paper]](https://arxiv.org/pdf/2301.11357.pdf)

- (arXiv 2023.1) Style-Aware Contrastive Learning for Multi-Style Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2301.11367.pdf)

- (arXiv 2023.1) 3DShape2VecSet: A **3D Shape Representation** for Neural Fields and Generative Diffusion Models, [[Paper]](https://arxiv.org/pdf/2301.11445.pdf)

- (arXiv 2023.1) Semi-Parametric **Video-Grounded Text Generation**, [[Paper]](https://arxiv.org/pdf/2301.11507.pdf)

- (arXiv 2023.1) **Robust** Transformer with Locality Inductive Bias and Feature Normalization, [[Paper]](https://arxiv.org/pdf/2301.11553.pdf)

- (arXiv 2023.1) LEVERAGING THE THIRD DIMENSION IN **CONTRASTIVE LEARNING**, [[Paper]](https://arxiv.org/pdf/2301.11790.pdf)

- (arXiv 2023.1) Understanding **Self-Supervised** Pretraining with **Part**-Aware Representation Learning, [[Paper]](https://arxiv.org/pdf/2301.11915.pdf)

- (arXiv 2023.1) Hypergraph Transformer for **Skeleton-based Action Recognition**, [[Paper]](https://arxiv.org/pdf/2211.09590.pdf)

- (arXiv 2023.1) CPT-V: A Contrastive Approach to Post-Training **Quantization** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2211.09643.pdf)

- (arXiv 2023.1) InstructPix2Pix: Learning to Follow **Image Editing** Instructions, [[Paper]](https://arxiv.org/pdf/2211.09800.pdf), [[Code]](http://timothybrooks.com/instruct-pix2pix)

- (arXiv 2023.1) OvarNet: Towards Open-vocabulary Object **Attribute Recognition**, [[Paper]](https://arxiv.org/pdf/2301.09506.pdf), [[Project]](https://kyanchen.github.io/OvarNet)

- (arXiv 2023.1) DDS: Decoupled Dynamic **Scene-Graph Generation** Network, [[Paper]](https://arxiv.org/pdf/2301.07666.pdf)

- (arXiv 2023.1) **Token** Transformer: Can class token help window-based transformer build better **long-range interactions**? [[Paper]](https://arxiv.org/pdf/2211.06083.pdf)

- (arXiv 2023.1) Toward Building General **Foundation Models** for Language, Vision, and Vision-Language Understanding Tasks, [[Paper]](https://arxiv.org/pdf/2301.05065.pdf)

- (arXiv 2023.1) Multimodal Inverse Cloze Task for Knowledge-based **Visual Question Answering**? [[Paper]](https://arxiv.org/pdf/2301.04366.pdf), [[Code]]()

- (arXiv 2023.1) FGAHOI: Fine-Grained Anchors for **Human-Object Interaction** Detection, [[Paper]](https://arxiv.org/pdf/2301.04019.pdf), [[Code]](https://github.com/xiaomabufei/FGAHOI)

- (arXiv 2023.1) Parallel Reasoning Network for **Human-Object Interaction** Detection, [[Paper]](https://arxiv.org/pdf/2301.03510.pdf)

- (arXiv 2023.1) In Defense of Structural Symbolic Representation for **Video Event-Relation Prediction**, [[Paper]](https://arxiv.org/pdf/2301.03410.pdf)

- (arXiv 2023.1) **Scene Synthesis** from Human **Motion**, [[Paper]](https://arxiv.org/pdf/2301.01424.pdf), [[Project]](https://lijiaman.github.io/projects/summon/)

### 2022.12

- (arXiv 2022.12) EVA: Exploring the Limits of **Masked Visual Representation** Learning at Scale, [[Paper]](https://arxiv.org/pdf/2211.07636.pdf), [[Code]](https://github.com/baaivision/EVA)

- (arXiv 2022.12) OneFormer: One Transformer to Rule Universal Image **Segmentation**, [[Paper]](https://arxiv.org/pdf/2211.06220.pdf), [[Code]](https://github.com/SHI-Labs/OneFormer)

- (arXiv 2022.12) MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards **Multi-modal Open-domain Conversation**, [[Paper]](https://arxiv.org/pdf/2211.05719.pdf), [[Project]](https://github.com/victorsungo/MMDialog)

- (arXiv 2022.12) Why is Winoground Hard? Investigating Failures in **Visuolinguistic Compositionality**, [[Paper]](https://arxiv.org/pdf/2211.00768.pdf), [[Code]](https://github.com/ajd12342/why-winoground-hard)

- (arXiv 2022.12) Multimodal **Information Bottleneck**: Learning Minimal Sufficient Unimodal and **Multimodal** Representations, [[Paper]](https://arxiv.org/pdf/2210.17444.pdf), [[Code]](https://github.com/TmacMai/Multimodal-Information-Bottleneck)

- (arXiv 2022.12) CLIP-FLOW: CONTRASTIVE LEARNING BY SEMISUPERVISED ITERATIVE PSEUDO LABELING FOR **OPTICAL FLOW ESTIMATION**, [[Paper]](https://arxiv.org/pdf/2210.14383.pdf)

- (arXiv 2022.12) INSTRUCTION-FOLLOWING **AGENTS** WITH JOINTLY PRE-TRAINED **VISION-LANGUAGE** MODELS, [[Paper]](https://arxiv.org/pdf/2210.13431.pdf), [[Code]](https://github.com/lhao499/instructrl)

- (arXiv 2022.12) MetaFormer **Baselines** for Vision, [[Paper]](https://arxiv.org/pdf/2210.13452.pdf), [[Code]](https://github.com/sail-sg/metaformer)

- (arXiv 2022.12) ViTCoD: Vision Transformer **Acceleration** via Dedicated Algorithm and Accelerator Co-Design, [[Paper]](https://arxiv.org/pdf/2210.09573.pdf), [[Code]](https://github.com/GATECH-EIC/ViTCoD)

- (arXiv 2022.12) FROM PLAY TO POLICY: CONDITIONAL BEHAVIOR GENERATION FROM UNCURATED **ROBOT** DATA, [[Paper]](https://arxiv.org/pdf/2210.10047.pdf), [[Project]](https://play-to-policy.github.io/)

- (arXiv 2022.12) Optimizing **Prompts** for **Text-to-Image** Generation, [[Paper]](https://arxiv.org/pdf/2212.09611.pdf), [[Code]](https://aka.ms/promptist)

- (arXiv 2022.12) Attentive **Mask** **CLIP**, [[Paper]](https://arxiv.org/pdf/2212.08653.pdf)

- (arXiv 2022.12) Rethinking **Cooking State Recognition** with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2212.08586.pdf)

- (arXiv 2022.12) Enhancing **Multi-modal** and **Multi-hop Question Answering** via Structured Knowledge and Unified Retrieval-Generation, [[Paper]](https://arxiv.org/pdf/2212.08632.pdf), [[Code]]()

- (arXiv 2022.12) MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in **Vision and Language** Models & Tasks, [[Paper]](https://arxiv.org/pdf/2212.08158.pdf), [[Code]](https://github.com/Heidelberg-NLP/MM-SHAP)

- (arXiv 2022.12) RepQ-ViT: Scale Reparameterization for Post-Training **Quantization** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2212.08254.pdf)

- (arXiv 2022.12) WAVENHANCER: UNIFYING WAVELET AND TRANSFORMER FOR **IMAGE ENHANCEMENT**, [[Paper]](https://arxiv.org/pdf/2212.08327.pdf)

- (arXiv 2022.12) AUTOENCODERS AS CROSS-MODAL TEACHERS: CAN PRETRAINED 2D IMAGE TRANSFORMERS HELP **3D REPRESENTATION** LEARNING?, [[Paper]](https://arxiv.org/pdf/2212.08320.pdf), [[Code]](https://github.com/RunpeiDong/ACT)

- (arXiv 2022.12) SceneGATE: Scene-Graph based co-Attention networks for TExt **visual question answering**, [[Paper]](https://arxiv.org/pdf/2212.08283.pdf)

- (arXiv 2022.12) Emergent **Analogical Reasoning** in Large Language Models, [[Paper]](https://arxiv.org/pdf/2212.09196.pdf)

- (arXiv 2022.12) Unleashing the Power of **Visual Prompting** At the Pixel Level, [[Paper]](https://arxiv.org/pdf/2212.10556.pdf), [[Code]](https://github.com/UCSC-VLAA/EVP)

- (arXiv 2022.12) Does **CLIP** Bind Concepts? Probing **Compositionality** in Large Image Models, [[Paper]](https://arxiv.org/pdf/2212.10537.pdf)

- (arXiv 2022.12) LayoutDETR: Detection Transformer Is a Good Multimodal **Layout Designer**, [[Paper]](https://arxiv.org/pdf/2212.09877.pdf), [[Code]](https://github.com/salesforce/LayoutDETR)

- (arXiv 2022.12) Towards Unsupervised **Visual Reasoning**: Do Off-The-Shelf Features Know How to Reason?, [[Paper]](https://arxiv.org/pdf/2212.10292.pdf)

- (arXiv 2022.12) Benchmarking **Spatial Relationships** in **Text-to-Image** Generation, [[Paper]](https://arxiv.org/pdf/2212.10015.pdf), [[Project]](https://visort2i.github.io/)

- (arXiv 2022.12) MetaCLUE: Towards Comprehensive **Visual Metaphors** Research, [[Paper]](https://arxiv.org/pdf/2212.09898.pdf), [[Project]](https://metaclue.github.io/)

- (arXiv 2022.12) Tackling Ambiguity with Images: Improved **Multimodal** Machine **Translation** and Contrastive Evaluation, [[Paper]](https://arxiv.org/pdf/2212.10140.pdf), [[Code]](https://github.com/MatthieuFP/CoMMuTE.git)

- (arXiv 2022.12) Cross-modal Attention Congruence Regularization for **Vision-Language** **Relation** Alignment, [[Paper]](https://arxiv.org/pdf/2212.10549.pdf)

- (arXiv 2022.12) Does unsupervised **grammar induction** need pixels?, [[Paper]](https://arxiv.org/pdf/2212.10564.pdf)

- (arXiv 2022.12) Hi-LASSIE: High-Fidelity **Articulated** Shape and Skeleton **Discovery** from Sparse **Image** Ensemble, [[Paper]](https://arxiv.org/pdf/2212.11042.pdf)

- (arXiv 2022.12) MAViC: Multimodal Active Learning for **Video Captioning**, [[Paper]](https://arxiv.org/pdf/2212.11109.pdf)

- (arXiv 2022.12) What Makes for Good **Tokenizers** in Vision Transformer? [[Paper]](https://arxiv.org/pdf/2212.11115.pdf)

- (arXiv 2022.12) Not Just Pretty Pictures: **Text-to-Image** Generators Enable Interpretable Interventions for **Robust** Representations, [[Paper]](https://arxiv.org/pdf/2212.11237.pdf), [[Code]]()

- (arXiv 2022.12) Generalized Decoding for **Pixel**, **Image**, and **Language**, [[Paper]](https://arxiv.org/pdf/2212.11270.pdf), [[Project]](https://x-decoder-vl.github.io/)

- (arXiv 2022.12) METEOR Guided Divergence for **Video Captioning**, [[Paper]](https://arxiv.org/pdf/2212.10690.pdf), [[Code]](https://github.com/d-rothen/bmhrl)

- (arXiv 2022.12) SLGTFORMER: AN ATTENTION-BASED APPROACH TO **SIGN LANGUAGE RECOGNITION**, [[Paper]](https://arxiv.org/pdf/2212.10746.pdf), [[Code]](https://github.com/neilsong/slt)

- (arXiv 2022.12) FROM IMAGES TO TEXTUAL **PROMPTS**: ZERO-SHOT **VQA** WITH FROZEN LARGE LANGUAGE MODELS, [[Paper]](https://arxiv.org/pdf/2212.10846.pdf), [[Code]](https://github.com/salesforce/LAVIS/tree/main/projects/img2prompt-vqa)

- (arXiv 2022.12) 3D Highlighter: Localizing Regions on **3D** Shapes via **Text** Descriptions, [[Paper]](https://arxiv.org/pdf/2212.11263.pdf), [[Code]](https://github.com/threedle/3DHighlighter)

- (arXiv 2022.12) Contrastive **Language-Vision** AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification **Bias**, [[Paper]](https://arxiv.org/pdf/2212.11261.pdf)

- (arXiv 2022.12) Ultra-High-Definition **Low-Light Image Enhancement**: A Benchmark and Transformer-Based Method, [[Paper]](https://arxiv.org/pdf/2212.11548.pdf), [[Code]](https://github.com/TaoWangzj/LLFormer)

- (arXiv 2022.12) Tune-A-Video: One-Shot Tuning of Image Diffusion Models for **Text-to-Video** Generation, [[Paper]](https://arxiv.org/pdf/2212.11565.pdf), [[Project]](https://tuneavideo.github.io/)

- (arXiv 2022.12) Beyond SOT: It’s Time to **Track** **Multiple** Generic **Objects** at Once, [[Paper]](https://arxiv.org/pdf/2212.11920.pdf)

- (arXiv 2022.12) KNOWLEDGE-DRIVEN SCENE PRIORS FOR SEMANTIC AUDIO-VISUAL **EMBODIED NAVIGATION**, [[Paper]](https://arxiv.org/pdf/2212.11345.pdf)

- (arXiv 2022.12) SegViT: **Semantic Segmentation** with Plain Vision Transformers, [[Paper]](https://arxiv.org/pdf/2210.05844.pdf), [[Code]](https://github.com/zbwxp/SegVit)

- (arXiv 2022.12) Open-Vocabulary **Temporal Action Detection** with Off-the-Shelf Image-Text Features, [[Paper]](https://arxiv.org/pdf/2212.10596.pdf)

- (arXiv 2022.12) Point·E: A System for **Generating 3D Point Clouds** from Complex **Prompts**, [[Paper]](https://arxiv.org/pdf/2212.08751.pdf), [[Code]](https://github.com/openai/point-e)

- (arXiv 2022.12) Inductive Attention for **Video Action Anticipation**, [[Paper]](https://arxiv.org/pdf/2212.08830.pdf)

- (arXiv 2022.12) **Image-and-Language** Understanding from Pixels Only, [[Paper]](https://arxiv.org/pdf/2212.08045.pdf), [[Code]](https://github.com/google-research/big_vision)

- (arXiv 2022.12) FlexiViT: One Model for All **Patch Sizes**, [[Paper]](https://arxiv.org/pdf/2212.08013.pdf), [[Code]](https://github.com/google-research/big_vision)

- (arXiv 2022.12) **Unsupervised** Object **Localization**: Observing the Background to Discover Objects, [[Paper]](https://arxiv.org/pdf/2212.07834.pdf), [[Code]](https://github.com/valeoai/FOUND)

- (arXiv 2022.12) Vision Transformers are Parameter-Efficient **Audio-Visual** Learners, [[Paper]](https://arxiv.org/pdf/2212.07983.pdf), [[Project]](https://genjib.github.io/project_page/LAVISH/)

- (arXiv 2022.12) Full Contextual Attention for Multi-resolution Transformers in **Semantic Segmentation**, [[Paper]](https://arxiv.org/pdf/2212.07890.pdf)

- (arXiv 2022.12) DETR4D: Direct Multi-View **3D Object Detection** with Sparse Attention, [[Paper]](https://arxiv.org/pdf/2212.07849.pdf)

- (arXiv 2022.12) Enhanced Training of Query-Based Object **Detection** via Selective Query Recollection, [[Paper]](https://arxiv.org/pdf/2212.07593.pdf), [[Code]](https://github.com/Fangyi-Chen/SQR)

- (arXiv 2022.12) TEXT-GUIDED MASK-FREE LOCAL **IMAGE RETOUCHING**, [[Paper]](https://arxiv.org/pdf/2212.07603.pdf)

- (arXiv 2022.12) Summary-Oriented Vision Modeling for **Multimodal Abstractive Summarization**, [[Paper]](https://arxiv.org/pdf/2212.07672.pdf), [[Code]](https://github.com/XL2248/SOV-MAS)

- (arXiv 2022.12) One-Shot Domain Adaptive and Generalizable **Semantic Segmentation** with Class-Aware Cross-Domain Transformers, [[Paper]](https://arxiv.org/pdf/2212.07292.pdf)

- (arXiv 2022.12) ConQueR: Query Contrast Voxel-DETR for **3D Object Detection**, [[Paper]](https://arxiv.org/pdf/2212.07289.pdf)

- (arXiv 2022.12) Examining the **Difference** Among **Transformers** and **CNNs** with Explanation Methods, [[Paper]](https://arxiv.org/pdf/2212.06872.pdf)

- (arXiv 2022.12) Find Someone Who: Visual Commonsense Understanding in Human-Centric **Grounding**, [[Paper]](https://arxiv.org/pdf/2212.06971.pdf), [[Code]](https://github.com/Hxyou/HumanCog)

- (arXiv 2022.12) Dual-branch Cross-Patch Attention Learning for **Group Affect Recognition**, [[Paper]](https://arxiv.org/pdf/2212.07055.pdf)

- (arXiv 2022.12) Cross-Modal Similarity-Based Curriculum Learning for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2212.07075.pdf)

- (arXiv 2022.12) NLIP: Noise-robust **Language-Image** Pre-training, [[Paper]](https://arxiv.org/pdf/2212.07086.pdf)

- (arXiv 2022.12) Lidar**CLIP** or: How I Learned to Talk to **Point Clouds**, [[Paper]](https://arxiv.org/pdf/2212.06858.pdf), [[Code]](https://github.com/atonderski/lidarclip)

- (arXiv 2022.12) **CLIP**SEP: LEARNING TEXT-QUERIED **SOUND SEPARATION** WITH NOISY UNLABELED VIDEOS, [[Paper]](https://arxiv.org/pdf/2212.07065.pdf)

- (arXiv 2022.12) Reproducible **scaling laws** for contrastive language-image learning, [[Paper]](https://arxiv.org/pdf/2212.07143.pdf), [[Code]](https://github.com/LAION-AI/scaling-laws-openclip)

- (arXiv 2022.12) WHAT DO VISION TRANSFORMERS LEARN? A VISUAL **EXPLORATION**, [[Paper]](https://arxiv.org/pdf/2212.06727.pdf)

- (arXiv 2022.12) Self-Play and Self-Describe: **Policy Adaptation** with **Vision-Language** Foundation Models, [[Paper]](https://arxiv.org/pdf/2212.07398.pdf), [[Project]](https://geyuying.github.io/SPLAYD)

- (arXiv 2022.12) GPVIT: A **HIGH RESOLUTION** NON-HIERARCHICAL VISION TRANSFORMER WITH GROUP PROPAGATION, [[Paper]](https://arxiv.org/pdf/2212.06795.pdf), [[Code]](https://github.com/ChenhongyiYang/GPViT)

- (arXiv 2022.12) Learning 3D Representations from 2D Pre-trained Models via **Image-to-Point** Masked Autoencoders, [[Paper]](https://arxiv.org/pdf/2212.06785.pdf), [[Code]](https://github.com/ZrrSkywalker/I2P-MAE)

- (arXiv 2022.12) Parallel Queries for **Human-Object Interaction Detection**, [[Paper]](https://dl.acm.org/doi/pdf/10.1145/3551626.3564944)

- (arXiv 2022.12) Structure-Guided **Image Completion** with Image-level and Object-level Semantic Discriminators, [[Paper]](https://arxiv.org/pdf/2212.06310.pdf)

- (arXiv 2022.12) Localized Latent Updates for **Fine-Tuning** **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2212.06556.pdf)

- (arXiv 2022.12) CamoFormer: Masked Separable Attention for **Camouflaged Object Detection**, [[Paper]](https://arxiv.org/pdf/2212.06570.pdf)

- (arXiv 2022.12) FastMIM: Expediting **Masked** Image Modeling Pre-training for Vision, [[Paper]](https://arxiv.org/pdf/2212.06593.pdf), [[Code]](https://github.com/ggjy/FastMIM.pytorch)

- (arXiv 2022.12) OAMixer: Object-aware **Mixing** Layer for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2212.06595.pdf), [[Code]](https://github.com/alinlab/OAMixer)

- (arXiv 2022.12) Doubly Right **Object Recognition**: A Why **Prompt** for Visual **Rationales**, [[Paper]](https://arxiv.org/pdf/2212.06202.pdf)

- (arXiv 2022.12) RT-1: **ROBOTICS** TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE, [[Paper]](https://arxiv.org/pdf/2212.06817.pdf), [[Project]](https://robotics-transformer.github.io/)

- (arXiv 2022.12) **Egocentric Video** Task Translation, [[Paper]](https://arxiv.org/pdf/2212.06301.pdf)

- (arXiv 2022.12) ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved **Visio-Linguistic** Models in **3D** Scenes, [[Paper]](https://arxiv.org/pdf/2212.06250.pdf), [[Project]](https://scanents3d.github.io/)

- (arXiv 2022.12) **Curriculum Learning** Meets Weakly Supervised **Modality Correlation** Learning, [[Paper]](https://arxiv.org/pdf/2212.07619.pdf)

- (arXiv 2022.12) IMoS: Intent-Driven Full-Body **Motion Synthesis** for **Human-Object Interactions**, [[Paper]](https://arxiv.org/pdf/2212.07555.pdf)

- (arXiv 2022.12) MultiAct: Long-Term **3D Human Motion Generation** from Multiple Action Labels, [[Paper]](https://arxiv.org/pdf/2212.05897.pdf)

- (arXiv 2022.12) A New Path: Scaling **Vision-and-Language Navigation** with Synthetic Instructions and Imitation Learning, [[Paper]](https://arxiv.org/pdf/2210.03112.pdf)

- (arXiv 2022.12) Beyond Object Recognition: A New Benchmark towards **Object Concept Learning**, [[Paper]](https://arxiv.org/pdf/2212.02710.pdf), [[Project]](https://mvig-rhos.com/ocl)

- (arXiv 2022.12) ViTPose+: Vision Transformer Foundation Model for Generic Body **Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2212.04246.pdf), [[Code]](https://github.com/ViTAE-Transformer/ViTPose)

- (arXiv 2022.12) Structured **Vision-Language** Pretraining for **Computational** Cooking, [[Paper]](https://arxiv.org/pdf/2212.04267.pdf)

- (arXiv 2022.12) MIME: **Human**-Aware **3D Scene Generation**, [[Paper]](https://arxiv.org/pdf/2212.04360.pdf), [[Project]](https://mime.is.tue.mpg.de/)

- (arXiv 2022.12) OFASY S: A **Multi-Modal Multi-Task** Learning System for Building **Generalist Models**, [[Paper]](https://arxiv.org/pdf/2212.04408.pdf), [[Code]](https://github.com/OFA-Sys/OFASys)

- (arXiv 2022.12) Task **Bias** in **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2212.04412.pdf)

- (arXiv 2022.12) Multi-Concept Customization of **Text-to-Image** **Diffusion**, [[Paper]](https://arxiv.org/pdf/2212.04488.pdf), [[Code]](https://www.cs.cmu.edu/~custom-diffusion/)

- (arXiv 2022.12) Few-View Object **Reconstruction** with Unknown Categories and Camera Poses, [[Paper]](https://arxiv.org/pdf/2212.04492.pdf), [[Project]](https://ut-austin-rpl.github.io/FORGE/)

- (arXiv 2022.12) Masked Video Distillation: Rethinking **Masked** Feature Modeling for **Self-supervised** **Video Representation** Learning, [[Paper]](https://arxiv.org/pdf/2212.04500.pdf), [[Code]](https://github.com/ruiwang2021/mvd)

- (arXiv 2022.12) Learning **Video** Representations from **Large Language Models**, [[Paper]](https://arxiv.org/pdf/2212.04501.pdf), [[Project]](https://facebookresearch.github.io/LaViLa)

- (arXiv 2022.12) Frozen **CLIP** Model is Efficient **Point Cloud** Backbone, [[Paper]](https://arxiv.org/pdf/2212.04098.pdf)

- (arXiv 2022.12) DialogCC: Large-scale **Multi-Modal Dialogue** Dataset, [[Paper]](https://arxiv.org/pdf/2212.04119.pdf), [[Project]](https://github.com/passing2961/DialogCC)

- (arXiv 2022.12) Group Generalized Mean **Pooling** for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2212.04114.pdf)

- (arXiv 2022.12) LEARNING DOMAIN INVARIANT **PROMPT** FOR **VISION-LANGUAGE** MODELS, [[Paper]](https://arxiv.org/pdf/2212.04196.pdf)

- (arXiv 2022.12) LLM-Planner: Few-Shot Grounded **Planning** for **Embodied** Agents with **Large Language Models**, [[Paper]](https://arxiv.org/pdf/2212.04088.pdf)

- (arXiv 2022.12) Hyperbolic **Contrastive** Learning for Visual **Representations** beyond Objects, [[Paper]](https://arxiv.org/pdf/2212.00653.pdf), [[Code]](https://github.com/shlokk/HCL/tree/main/HCL)

### 2022.11

- (arXiv 2022.11) Texts as Images in Prompt Tuning for **Multi-Label Image Recognition**, [[Paper]](https://arxiv.org/pdf/2211.12739.pdf), [[Code]](https://github.com/guozix/TaI-DPT)

- (arXiv 2022.11) Tell Me What Happened: Unifying **Text-guided Video Completion** via Multimodal Masked Video Generation, [[Paper]](https://arxiv.org/pdf/2211.12824.pdf)

- (arXiv 2022.11) InDiReCT: Language-Guided Zero-Shot Deep **Metric Learning** for Images, [[Paper]](https://arxiv.org/pdf/2211.12760.pdf)

- (arXiv 2022.11) VoP: Text-Video Co-operative Prompt Tuning for **Cross-Modal Retrieval**, [[Paper]](https://arxiv.org/pdf/2211.12764.pdf), [[Code]](https://github.com/bighuang624/VoP)

- (arXiv 2022.11) **Completing point cloud** from few points by Wasserstein GAN and Transformers, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2211/2211.12746.pdf), [[Code]](https://github.com/WxfQjh/Stability-point-recovery.git)

- (arXiv 2022.11) Integrally Pre-Trained Transformer **Pyramid** Networks, [[Paper]](https://arxiv.org/pdf/2211.12735.pdf), [[Code]](https://github.com/sunsmarterjie/iTPN)

- (arXiv 2022.11) Data Augmentation Vision Transformer for **Fine-grained Image Classification**, [[Paper]](https://arxiv.org/pdf/2211.12879.pdf)

- (arXiv 2022.11) **DETR**s with Collaborative Hybrid Assignments **Training**, [[Paper]](https://arxiv.org/pdf/2211.12860.pdf), [[Code]](https://github.com/Sense-X/Co-DETR)

- (arXiv 2022.11) Open-vocabulary **Attribute Detection**, [[Paper]](https://arxiv.org/pdf/2211.12914.pdf), [[Project]](https://ovad-benchmark.github.io/)

- (arXiv 2022.11) Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised **Monocular Depth Estimation**, [[Paper]](https://arxiv.org/pdf/2211.13202.pdf), [[Code]](https://github.com/noahzn/Lite-Mono)

- (arXiv 2022.11) Inversion-Based **Creativity Transfer** with Diffusion Models, [[Paper]](https://arxiv.org/pdf/2211.13203.pdf), [[Code]](https://github.com/zyxElsa/creativity-transfer)

- (arXiv 2022.11) CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free **Continual Learning**, [[Paper]](https://arxiv.org/pdf/2211.13218.pdf)

- (arXiv 2022.11) SVFormer: Semi-supervised Video Transformer for **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2211.13222.pdf), [[Code]](https://github.com/ChenHsing/SVFormer)

- (arXiv 2022.11) Generalizable **Implicit Neural Representations** via Instance Pattern Composers, [[Paper]](https://arxiv.org/pdf/2211.13223.pdf)

- (arXiv 2022.11) Improving **Visual-textual Sentiment Analysis** by Fusing Expert Features, [[Paper]](https://arxiv.org/pdf/2211.12981.pdf)

- (arXiv 2022.11) **Self-Supervised** Learning based on Heat Equation, [[Paper]](https://arxiv.org/pdf/2211.13228.pdf)

- (arXiv 2022.11) Peekaboo: **Text to Image** Diffusion Models are Zero-Shot Segmentors, [[Paper]](https://arxiv.org/pdf/2211.13224.pdf)

- (arXiv 2022.11) Paint by Example: Exemplar-based **Image Editing** with Diffusion Models, [[Paper]](https://arxiv.org/pdf/2211.13227.pdf), [[Code]](https://github.com/Fantasy-Studio/Paint-by-Example)

- (arXiv 2022.11) Human or Machine? **Turing Tests** for Vision and Language, [[Paper]](https://arxiv.org/pdf/2211.13087.pdf), [[Code]](https://tinyurl.com/8x8nha7p)

- (arXiv 2022.11) Teach-DETR: Better **Training** **DETR** with Teachers, [[Paper]](https://arxiv.org/pdf/2211.11953.pdf), [[Code]](https://github.com/LeonHLJ/Teach-DETR)

- (arXiv 2022.11) Conv2Former: A Simple Transformer-Style **ConvNet** for Visual Recognition, [[Paper]](https://arxiv.org/pdf/2211.11943.pdf)

- (arXiv 2022.11) X^2-VLM: All-In-One Pre-trained Model For **Vision-Language** Tasks, [[Paper]](https://arxiv.org/pdf/2211.12402.pdf), [[Code]](github.com/zengyan-97/X2-VLM)

- (arXiv 2022.11) Aligning Source Visual and Target Language Domains for Unpaired **Video Captioning**, [[Paper]](https://arxiv.org/pdf/2211.12148.pdf)

- (arXiv 2022.11) On the Transferability of Visual Features in **Generalized Zero-Shot Learning**, [[Paper]](https://arxiv.org/pdf/2211.12494.pdf), [[Code]](https://github.com/uvavision/TV-GZSL)

- (arXiv 2022.11) Generalizable Industrial Visual **Anomaly Detection** with Self-Induction Vision Transformer, [[Paper]](https://arxiv.org/pdf/2211.12311.pdf)

- (arXiv 2022.11) Transformer Based Multi-Grained Features for Unsupervised **Person Re-Identification**, [[Paper]](https://arxiv.org/pdf/2211.12280.pdf), [[Code]](https://github.com/RikoLi/WACV23-workshop-TMGF)

- (arXiv 2022.11) Efficient Frequency Domain-based Transformers for High-Quality Image **Deblurring**, [[Paper]](https://arxiv.org/pdf/2211.12250.pdf), [[Code]](https://github.com/kkkls/FFTformer)

- (arXiv 2022.11) Event Transformer+. A multi-purpose solution for efficient **event data processing**, [[Paper]](https://arxiv.org/pdf/2211.12222.pdf)

- (arXiv 2022.11) MagicPony: Learning Articulated **3D Animals** in the Wild, [[Paper]](https://arxiv.org/pdf/2211.12497.pdf), [[Project]](https://3dmagicpony.github.io/)

- (arXiv 2022.11) Gated Class-Attention with Cascaded Feature Drift Compensation for Exemplar-free **Continual Learning** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2211.12292.pdf), [[Code]](https://github.com/OcraM17/GCAB-CFDC)

- (arXiv 2022.11) Expectation-Maximization Contrastive Learning for Compact **Video-and-Language** Representations, [[Paper]](https://arxiv.org/pdf/2211.11427.pdf), [[Code]](https://github.com/jpthu17/EMCL)

- (arXiv 2022.11) N-Gram in Swin Transformers for Efficient Lightweight **Image Super-Resolution**, [[Paper]](https://arxiv.org/pdf/2211.11436.pdf)

- (arXiv 2022.11) **Robotic** Skill Acquisition via Instruction Augmentation with Vision-Language Models, [[Paper]](https://arxiv.org/pdf/2211.11736.pdf), [[Code]](https://instructionaugmentation.github.io/)

- (arXiv 2022.11) Peeling the Onion: Hierarchical Reduction of Data Redundancy for **Efficient** Vision Transformer **Training**, [[Paper]](https://arxiv.org/pdf/2211.10801.pdf), [[Code]](https://github.com/ZLKong/Tri-Level-ViT)

- (arXiv 2022.11) Unifying **Vision-Language** Representation Space with Single-tower Transformer, [[Paper]](https://arxiv.org/pdf/2211.11153.pdf)

- (arXiv 2022.11) DeepSolo: Let Transformer Decoder with Explicit Points Solo for **Text Spotting**, [[Paper]](https://arxiv.org/pdf/2211.10772.pdf)

- (arXiv 2022.11) Castling-ViT: **Compressing Self-Attention** via Switching Towards Linear-Angular Attention During Vision Transformer Inference, [[Paper]](https://arxiv.org/pdf/2211.10526.pdf)

- (arXiv 2022.11) CL-CrossVQA: A Continual Learning Benchmark for **Cross-Domain Visual Question Answering**, [[Paper]](https://arxiv.org/pdf/2211.10567.pdf)

- (arXiv 2022.11) Normal Transformer: Extracting Surface Geometry from **LiDAR** Points Enhanced by Visual Semantics, [[Paper]](https://arxiv.org/pdf/2211.10580.pdf)

- (arXiv 2022.11) A Unified Model for **Video** Understanding and Knowledge Embedding with Heterogeneous **Knowledge Graph** Dataset, [[Paper]](https://arxiv.org/pdf/2211.10624.pdf)

- (arXiv 2022.11) Efficient **Video Representation** Learning via Masked Video Modeling with Motion-centric Token Selection, [[Paper]](https://arxiv.org/pdf/2211.10636.pdf)

- (arXiv 2022.11) DiffStyler: Controllable Dual Diffusion for Text-Driven **Image Stylization**, [[Paper]](https://arxiv.org/pdf/2211.10682.pdf)

- (arXiv 2022.11) TORE: Token Reduction for Efficient **Human Mesh Recovery** with Transformer, [[Paper]](https://arxiv.org/pdf/2211.10705.pdf)

- (arXiv 2022.11) **Synthesizing** Coherent **Story** with Auto-Regressive Latent Diffusion Models, [[Paper]](https://arxiv.org/pdf/2211.10950.pdf), [[Code]](https://github.com/Flash-321/ARLDM)

- (arXiv 2022.11) Are **Out-of-Distribution Detection** Methods Reliable?, [[Paper]](https://arxiv.org/pdf/2211.10892.pdf)

- (arXiv 2022.11) GLT-T: Global-Local Transformer Voting for **3D Single Object Tracking** in Point Clouds, [[Paper]](https://arxiv.org/pdf/2211.10927.pdf), [[Code]](https://github.com/haooozi/GLT-T)

- (arXiv 2022.11) CROSS-MODAL CONTRASTIVE LEARNING FOR ROBUST REASONING IN **VQA**, [[Paper]](https://arxiv.org/pdf/2211.11190.pdf), [[Code]](https://github.com/qizhust/cmcl_vqa_pl)

- (arXiv 2022.11) LISA: Localized **Image Stylization** with Audio via Implicit Neural Representation, [[Paper]](https://arxiv.org/pdf/2211.11381.pdf)

- (arXiv 2022.11) MagicVideo: Efficient **Video Generation** With Latent Diffusion Models, [[Paper]](https://arxiv.org/pdf/2211.11018.pdf), [[Code]](https://magicvideo.github.io/#)

- (arXiv 2022.11) DreamArtist: Towards Controllable One-Shot **Text-to-Image** Generation via Contrastive Prompt-Tuning, [[Paper]](https://arxiv.org/pdf/2211.11337.pdf)

- (arXiv 2022.11) Hybrid Transformer Based Feature Fusion for Self-Supervised **Monocular Depth Estimation**, [[Paper]](https://arxiv.org/pdf/2211.11066.pdf)

- (arXiv 2022.11) Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable **Image Classification**, [[Paper]](https://arxiv.org/pdf/2211.11158.pdf)

- (arXiv 2022.11) Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language **Navigation**, [[Paper]](https://arxiv.org/pdf/2211.11116.pdf)

- (arXiv 2022.11) You Need Multiple Exiting: Dynamic Early Exiting for **Accelerating** Unified Vision Language Model, [[Paper]](https://arxiv.org/pdf/2211.11152.pdf)

- (arXiv 2022.11) Beyond Attentive Tokens: Incorporating Token Importance and Diversity for **Efficient** Vision Transformers, [[Paper]](https://arxiv.org/pdf/2211.11315.pdf)

- (arXiv 2022.11) FlowLens: Seeing Beyond the **FoV** via Flow-guided **Clip**-Recurrent Transformer, [[Paper]](https://arxiv.org/pdf/2211.11293.pdf), [[Code]](https://github.com/MasterHow/FlowLens)

- (arXiv 2022.11) PS-Transformer: Learning Sparse **Photometric Stereo** Network using Self-Attention Mechanism, [[Paper]](https://arxiv.org/pdf/2211.11386.pdf)

- (arXiv 2022.11) On the Robustness, Generalization, and Forgetting of Shape-Texture Debiased **Continual Learning**, [[Paper]](https://arxiv.org/pdf/2211.11174.pdf)

- (arXiv 2022.11) Vision Transformer with Super **Token Sampling**, [[Paper]](https://arxiv.org/pdf/2211.11167.pdf), [[Code]](https://github.com/hhb072/SViT)

- (arXiv 2022.11) Detect Only What You Specify : Object **Detection** with Linguistic Target, [[Paper]](https://arxiv.org/pdf/2211.11572.pdf)

- (arXiv 2022.11) Visual Programming: Compositional **visual reasoning** without training, [[Paper]](https://arxiv.org/pdf/2211.11559.pdf), [[Project]](https://prior.allenai.org/projects/visprog)

- (arXiv 2022.11) ClipCrop: Conditioned **Cropping** Driven by **Vision-Language** Model, [[Paper]](https://arxiv.org/pdf/2211.11492.pdf)

- (arXiv 2022.11) SMAUG: Sparse **Masked** Autoencoder for **Efficient** **Video-Language** Pre-training, [[Paper]](https://arxiv.org/pdf/2211.11446.pdf)

- (arXiv 2022.11) **Blur Interpolation** Transformer for Real-World Motion from Blur, [[Paper]](https://arxiv.org/pdf/2211.11423.pdf)

- (arXiv 2022.11) Mean Shift Mask Transformer for Unseen Object Instance **Segmentation**, [[Paper]](https://arxiv.org/pdf/2211.11679.pdf), [[Code]](https://github.com/YoungSean/UnseenObjectsWithMeanShift)

- (arXiv 2022.11) PointCLIP V2: Adapting **CLIP** for Powerful **3D** Open-world Learning, [[Paper]](https://arxiv.org/pdf/2211.11682.pdf), [[Code]](https://github.com/yangyangyang127/PointCLIP_V2)

- (arXiv 2022.11) Exploring Discrete **Diffusion** Models for Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2211.11694.pdf), [[Code]](https://github.com/buxiangzhiren/DDCap)

- (arXiv 2022.11) PERCEIVER-VL: **Efficient** **Vision-and-Language** Modeling with Iterative Latent Attention, [[Paper]](https://arxiv.org/pdf/2211.11701.pdf), [[Code]](https://github.com/zinengtang/Perceiver_VL)

- (arXiv 2022.11) Multitask **Vision-Language** **Prompt** Tuning, [[Paper]](https://arxiv.org/pdf/2211.11720.pdf), [[Code]](https://github.com/sIncerass/MVLPT)

- (arXiv 2022.11) Teaching **Structured** **Vision & Language** Concepts to Vision & Language Models, [[Paper]](https://arxiv.org/pdf/2211.11733.pdf)

- (arXiv 2022.11) WEIGHTED **ENSEMBLE** **SELF-SUPERVISED** LEARNING, [[Paper]](https://arxiv.org/pdf/2211.09981.pdf)

- (arXiv 2022.11) BEVFormer v2: Adapting Modern Image Backbones to **Bird’s-Eye-View Recognition** via Perspective Supervision, [[Paper]](https://arxiv.org/pdf/2211.10439.pdf)

- (arXiv 2022.11) Task Residual for Tuning **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2211.10277.pdf), [[Code]](https://github.com/geekyutao/TaskRes)

- (arXiv 2022.11) α DARTS Once More: Enhancing Differentiable **Architecture Search** by **Masked** Image Modeling, [[Paper]](https://arxiv.org/pdf/2211.10105.pdf)

- (arXiv 2022.11) Delving into Transformer for Incremental Semantic **Segmentation**, [[Paper]](https://arxiv.org/pdf/2211.10253.pdf)

- (arXiv 2022.11) DETRDistill: A Universal **Knowledge Distillation** Framework for **DETR**-families, [[Paper]](https://arxiv.org/pdf/2211.10156.pdf)

- (arXiv 2022.11) PromptCap: Prompt-Guided Task-Aware Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2211.09699.pdf)

- (arXiv 2022.11) UNIFORMERV2: SPATIOTEMPORAL LEARNING BY ARMING IMAGE VITS WITH **VIDEO** UNIFORMER, [[Paper]](https://arxiv.org/pdf/2211.09552.pdf), [[Code]](https://github.com/OpenGVLab/UniFormerV2)

- (arXiv 2022.11) **Masked** Reconstruction **Contrastive** Learning with Information Bottleneck Principle, [[Paper]](https://arxiv.org/pdf/2211.09013.pdf)

- (arXiv 2022.11) Listen, denoise, action! Audio-driven **motion synthesis** with diffusion models, [[Paper]](https://arxiv.org/pdf/2211.09707.pdf), [[Project]](https://www.speech.kth.se/research/listen-denoise-action/)

- (arXiv 2022.11) ConStruct-VL: Data-Free Continual **Structured VL Concepts** Learning, [[Paper]](https://arxiv.org/pdf/2211.09790.pdf)

- (arXiv 2022.11) How to **Fine-Tune** Vision Models with **SGD**, [[Paper]](https://arxiv.org/pdf/2211.09359.pdf)

- (arXiv 2022.11) Progressive Tree-Structured Prototype Network for End-to-End Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2211.09460.pdf), [[Code]](https://github.com/NovaMind-Z/PTSN)

- (arXiv 2022.11) CapEnrich: Enriching **Caption** Semantics for Web Images via Cross-modal Pre-trained Knowledge, [[Paper]](https://arxiv.org/pdf/2211.09371.pdf), [[Code]]()

- (arXiv 2022.11) Visual Commonsense-aware Representation Network for **Video Captioning**, [[Paper]](https://arxiv.org/pdf/2211.09469.pdf), [[Code]](https://github.com/zchoi/VCRN)

- (arXiv 2022.11) Language Conditioned Spatial Relation Reasoning for **3D Object Grounding**, [[Paper]](https://arxiv.org/pdf/2211.09646.pdf), [[Code]](https://cshizhe.github.io/projects/vil3dref.html)

- (arXiv 2022.11) HARDVS: Revisiting Human **Activity Recognition** with **Dynamic Vision Sensors**, [[Paper]](https://arxiv.org/pdf/2211.09648.pdf), [[Code]](https://github.com/Event-AHU/HARDVS)

- (arXiv 2022.11) Towards All-in-one **Pre-training** via Maximizing **Multi-modal** Mutual Information, [[Paper]](https://arxiv.org/pdf/2211.09807.pdf), [[Code]](https://github.com/OpenGVLab/M3I-Pretraining)

- (arXiv 2022.11) Uni-Perceiver v2: A **Generalist** Model for Large-Scale **Vision** and **Vision-Language** Tasks, [[Paper]](https://arxiv.org/pdf/2211.09808.pdf), [[Code]](https://github.com/fundamentalvision/Uni-Perceiver)

- (arXiv 2022.11) D^3ETR: Decoder **Distillation** for **Detection** Transformer, [[Paper]](https://arxiv.org/pdf/2211.09768.pdf)

- (arXiv 2022.11) **CAE** v2: Context Autoencoder with **CLIP** Target, [[Paper]](https://arxiv.org/pdf/2211.09799.pdf)

- (arXiv 2022.11) Cross-Modal Adapter for **Text-Video Retrieval**, [[Paper]](https://arxiv.org/pdf/2211.09623.pdf), [[Code]](https://github.com/LeapLabTHU/Cross-Modal-Adapter)

- (arXiv 2022.11) TOKEN **TURING MACHINES**, [[Paper]](https://arxiv.org/pdf/2211.09119.pdf)

- (arXiv 2022.11) WILL LARGE-SCALE **GENERATIVE** MODELS CORRUPT **FUTURE DATASETS**? [[Paper]](https://arxiv.org/pdf/2211.08095.pdf), [[Code]](https://github.com/moskomule/dataset-contamination)

- (arXiv 2022.11) Demystify **Self-Attention** in Vision Transformers from a Semantic Perspective: Analysis and Application, [[Paper]](https://arxiv.org/pdf/2211.08543.pdf)

- (arXiv 2022.11) SATVSR: Scenario Adaptive Transformer for Cross Scenarios **Video Super-Resolution**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2211/2211.08703.pdf)

- (arXiv 2022.11) TransCC: Transformer-based **Multiple Illuminant Color Constancy** Using Multitask Learning, [[Paper]](https://arxiv.org/pdf/2211.08772.pdf)

- (arXiv 2022.11) Stare at What You See: **Masked Image Modeling** without Reconstruction, [[Paper]](https://arxiv.org/pdf/2211.08887.pdf), [[Code]](https://github.com/OpenPerceptionX/maskalign)

- (arXiv 2022.11) HeatViT: Hardware-Efficient Adaptive **Token Pruning** for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2211.08110.pdf)

- (arXiv 2022.11) Cross-domain Federated Adaptive **Prompt Tuning** for **CLIP**, [[Paper]](https://arxiv.org/pdf/2211.07864.pdf)

- (arXiv 2022.11) YORO - Lightweight End to End **Visual Grounding**, [[Paper]](https://arxiv.org/pdf/2211.07912.pdf)

- (arXiv 2022.11) **Knowledge Distillation** for Detection Transformer with Consistent Distillation Points Sampling, [[Paper]](https://arxiv.org/pdf/2211.08071.pdf)

- (arXiv 2022.11) BiViT: Extremely **Compressed** **Binary** Vision Transformer, [[Paper]](https://arxiv.org/pdf/2211.07091.pdf)

- (arXiv 2022.11) ContextCLIP: Contextual Alignment of **Image-Text** pairs on **CLIP** visual representations, [[Paper]](https://arxiv.org/pdf/2211.07122.pdf)

- (arXiv 2022.11) Zero-shot Image **Captioning** by Anchor-augmented Vision-Language Space Alignment, [[Paper]](https://arxiv.org/pdf/2211.07275.pdf)

- (arXiv 2022.11) Seeing Beyond the **Brain**: Conditional Diffusion Model with Sparse Masked Modeling for **Vision Decoding**, [[Paper]](https://arxiv.org/pdf/2211.06956.pdf), [[Project]](https://mind-vis.github.io/)

- (arXiv 2022.11) Enhancing **Few-Shot Image Classification** with Cosine Transformer, [[Paper]](https://arxiv.org/pdf/2211.06828.pdf), [[Code]](https://github.com/vinuni-vishc/Few-Shot-Cosine-Transformer)

- (arXiv 2022.11) SCOTCH and SODA: A Transformer **Video Shadow Detection** Framework, [[Paper]](https://arxiv.org/pdf/2211.06885.pdf)

- (arXiv 2022.11) AU-Aware Vision Transformers for Biased **Facial Expression Recognition**, [[Paper]](https://arxiv.org/pdf/2211.06609.pdf)

- (arXiv 2022.11) Fast Text-Conditional Discrete **Denoising** on Vector-Quantized Latent Spaces, [[Paper]](https://arxiv.org/pdf/2211.07292.pdf), [[Code]](https://github.com/dome272/Paella)

- (arXiv 2022.11) Large-Scale Bidirectional Training for Zero-Shot Image **Captioning**, [[Paper]](https://arxiv.org/pdf/2211.06774.pdf)

- (arXiv 2022.11) Grafting Pre-trained Models for Multimodal **Headline Generation**, [[Paper]](https://arxiv.org/pdf/2211.07210.pdf)

- (arXiv 2022.11) CabViT: Cross **Attention** among Blocks for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2211.07198.pdf), [[Code]](https://github.com/hkzhang91/CabViT)

- (arXiv 2022.11) **Composed Image Retrieval** with Text Feedback via Multi-grained Uncertainty Regularization, [[Paper]](https://arxiv.org/pdf/2211.07394.pdf)

- (arXiv 2022.11) SSGVS: Semantic **Scene Graph-to-Video** Synthesis, [[Paper]](https://arxiv.org/pdf/2211.06119.pdf)

- (arXiv 2022.11) One-Time **Model Adaptation** to Heterogeneous Clients: An Intra-Client and Inter-Image Attention Design, [[Paper]](https://arxiv.org/pdf/2211.06276.pdf)

- (arXiv 2022.11) An Improved End-to-End **Multi-Target Tracking** Method Based on Transformer Self-Attention, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2211/2211.06001.pdf)

- (arXiv 2022.11) Zero-shot Visual Commonsense **Immorality Prediction**, [[Paper]](https://arxiv.org/pdf/2211.05521.pdf), [[Code]](https://github.com/ku-vai/Zero-shot-Visual-Commonsense-Immorality-Prediction)

- (arXiv 2022.11) Hyperbolic Cosine Transformer for **LiDAR 3D Object Detection**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2211/2211.05580.pdf)

- (arXiv 2022.11) **Training** a Vision Transformer from scratch in less than 24 hours with 1 GPU, [[Paper]](https://arxiv.org/pdf/2211.05187.pdf), [[Code]](https://github.com/BorealisAI/efficient-vit-training)

- (arXiv 2022.11) ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer **Acceleration** with a Linear Taylor Attention, [[Paper]](https://arxiv.org/pdf/2211.05109.pdf)

- (arXiv 2022.11) SimOn: A Simple Framework for **Online Temporal Action Localization**, [[Paper]](https://arxiv.org/pdf/2211.04905.pdf), [[Code]](https://github.com/TuanTNG/SimOn)

- (arXiv 2022.11) ERNIE-UNIX^2: A UNIFIED **CROSS-LINGUAL CROSS-MODAL** FRAMEWORK FOR UNDERSTANDING AND GENERATION, [[Paper]](https://arxiv.org/pdf/2211.04861.pdf)

- (arXiv 2022.11) SG-Shuffle: Multi-aspect Shuffle Transformer for **Scene Graph Generation**, [[Paper]](https://arxiv.org/pdf/2211.04773.pdf)

- (arXiv 2022.11) Understanding Cross-modal Interactions in V&L Models that Generate **Scene Descriptions**, [[Paper]](https://arxiv.org/pdf/2211.04971.pdf)

- (arXiv 2022.11) VieCap4H - VLSP 2021: ObjectAoA - Enhancing performance of Object Relation Transformer with Attention on Attention for **Vietnamese** image **captioning**, [[Paper]](https://arxiv.org/pdf/2211.05405.pdf)

- (arXiv 2022.11) Watching the News: Towards **VideoQA** Models that can Read, [[Paper]](https://arxiv.org/pdf/2211.05588.pdf), [[Project]](http://cvit.iiit.ac.in/research/projects/cvit-projects/videoqa)

- (arXiv 2022.11) Efficient Joint **Detection** and **Multiple Object Tracking** with Spatially Aware Transformer, [[Paper]](https://arxiv.org/pdf/2211.05654.pdf)

- (arXiv 2022.11) **Demystify** Transformers & **Convolutions** in Modern Image Deep Networks, [[Paper]](https://arxiv.org/pdf/2211.05781.pdf), [[Code]](https://github.com/OpenGVLab/STM-Evaluation)

- (arXiv 2022.11) InternImage: Exploring Large-Scale Vision Foundation Models with **Deformable Convolutions**, [[Paper]](https://arxiv.org/pdf/2211.05778.pdf), [[Code]](https://github.com/OpenGVLab/InternImage)

- (arXiv 2022.11) DEPTHFORMER: MULTIMODAL POSITIONAL ENCODINGS AND CROSS-INPUT ATTENTION FOR TRANSFORMER-BASED **SEGMENTATION** NETWORKS, [[Paper]](https://arxiv.org/pdf/2211.04188.pdf)

- (arXiv 2022.11) Sequential Transformer for End-to-End **Person Search**, [[Paper]](https://arxiv.org/pdf/2211.04323.pdf)

- (arXiv 2022.11) Prompting Large Pre-trained Vision-Language Models For **Compositional Concept Learning**, [[Paper]](https://arxiv.org/pdf/2211.05077.pdf)

- (arXiv 2022.11) CASA: Category-agnostic **Skeletal Animal Reconstruction**, [[Paper]](https://arxiv.org/pdf/2211.03568.pdf)

- (arXiv 2022.11) ViT-CX: Causal **Explanation** of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2211.03064.pdf)

- (arXiv 2022.11) Disentangling Content and Motion for **Text-Based Neural Video Manipulation**, [[Paper]](https://arxiv.org/pdf/2211.02980.pdf)

- (arXiv 2022.11) **Efficient** Multi-order Gated Aggregation Network, [[Paper]](https://arxiv.org/pdf/2211.03295.pdf)

- (arXiv 2022.11) CLOP: **Video-and-Language** Pre-Training with Knowledge Regularizations, [[Paper]](https://arxiv.org/pdf/2211.03314.pdf)

- (arXiv 2022.11) MSMG-Net: Multi-scale Multi-grained Supervised Metworks for Multi-task Image Manipulation **Detection** and **Localization**, [[Paper]](https://arxiv.org/ftp/arxiv/papers/2211/2211.03140.pdf)

- (arXiv 2022.11) Understanding and Mitigating Overfitting in **Prompt** Tuning for **Vision-Language** Models, [[Paper]](https://arxiv.org/pdf/2211.02219.pdf), [[Code]](https://tinyurl.com/mpe64f89)

- (arXiv 2022.11) Zero-shot **Video Moment Retrieval** With Off-the-Shelf Models, [[Paper]](https://arxiv.org/pdf/2211.02178.pdf)

- (arXiv 2022.11) Scaling **Multimodal** Pre-Training via Cross-Modality Gradient Harmonization, [[Paper]](https://arxiv.org/pdf/2211.02077.pdf)

- (arXiv 2022.11) A Transformer Architecture for Online **Gesture Recognition** of Mathematical Expressions, [[Paper]](https://arxiv.org/pdf/2211.02643.pdf)

- (arXiv 2022.11) Evaluating and Improving Factuality in **Multimodal Abstractive Summarization**, [[Paper]](https://arxiv.org/pdf/2211.02580.pdf), [[Code]](https://github.com/meetdavidwan/faithful-multimodal-summ)

- (arXiv 2022.11) RCDPT: **RADAR-CAMERA FUSION** DENSE PREDICTION TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2211.02432.pdf)

- (arXiv 2022.11) **Video Event Extraction** via Tracking Visual States of Arguments, [[Paper]](https://arxiv.org/pdf/2211.01781.pdf)

- (arXiv 2022.11) The **Lottery Ticket** Hypothesis for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2211.01484.pdf)

- (arXiv 2022.11) TEXTCRAFT: ZERO-SHOT GENERATION OF HIGHFIDELITY AND DIVERSE **SHAPES FROM TEXT**, [[Paper]](https://arxiv.org/pdf/2211.01427.pdf)

- (arXiv 2022.11) PolyBuilding: Polygon Transformer for End-to-End **Building Extraction**, [[Paper]](https://arxiv.org/pdf/2211.01589.pdf)

- (arXiv 2022.11) RETHINKING **HIERARCHIES** IN PRE-TRAINED PLAIN VISION TRANSFORMER, [[Paper]](https://arxiv.org/pdf/2211.01785.pdf), [[Code]](https://github.com/ViTAE-Transformer/HPViT)

- (arXiv 2022.11) SAP-**DETR**: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency, [[Paper]](https://arxiv.org/pdf/2211.02006.pdf)

- (arXiv 2022.11) Could Giant Pretrained Image Models Extract **Universal Representations**? [[Paper]](https://arxiv.org/pdf/2211.02043.pdf)

- (arXiv 2022.11) MAEDAY: MAE for few and zero shot **AnomalY-Detection**, [[Paper]](https://arxiv.org/pdf/2211.14307.pdf), [[Code]](https://github.com/EliSchwartz/MAEDAY)

- (arXiv 2022.11) Degenerate Swin to Win: Plain **Window-based** Transformer without Sophisticated Operations, [[Paper]](https://arxiv.org/pdf/2211.14255.pdf)

- (arXiv 2022.11) Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for **3D Visual Grounding**, [[Paper]](https://arxiv.org/pdf/2211.14241.pdf), [[Code]](https://eslambakr.github.io/LAR.github.io/)

- (arXiv 2022.11) SpaText: Spatio-Textual Representation for **Controllable Image Generation**, [[Paper]](https://arxiv.org/pdf/2211.14305.pdf), [[Project]](https://omriavrahami.com/spatext)

- (arXiv 2022.11) Learning **3D** Scene Priors with **2D** Supervision, [[Paper]](https://arxiv.org/pdf/2211.14157.pdf), [[Project]](https://yinyunie.github.io/sceneprior-page/)

- (arXiv 2022.11) PoET: Pose Estimation Transformer for Single-View, Multi-Object **6D Pose Estimation**, [[Paper]](https://arxiv.org/pdf/2211.14125.pdf), [[Code]](https://github.com/aau-cns/poet)

- (arXiv 2022.11) Spatial-Spectral Transformer for **Hyperspectral Image Denoising**, [[Paper]](https://arxiv.org/pdf/2211.14090.pdf), [[Code]](https://github.com/MyuLi/SST)

- (arXiv 2022.11) Training **Vision-Language** Models with Less Bimodal Supervision, [[Paper]](https://arxiv.org/pdf/2211.00262.pdf)

- (arXiv 2022.11) Text-Only Training for Image **Captioning** using Noise-Injected **CLIP**, [[Paper]](https://arxiv.org/pdf/2211.00575.pdf), [[Code]](https://github.com/DavidHuji/CapDec)

- (arXiv 2022.11) Attention-based **Neural Cellular Automata**, [[Paper]](https://arxiv.org/pdf/2211.01233.pdf)

- (arXiv 2022.11) eDiff-I: **Text-to-Image** Diffusion Models with an Ensemble of Expert Denoisers, [[Paper]](https://arxiv.org/pdf/2211.01324.pdf), [[Code]](https://deepimagination.cc/eDiff-I/)

- (arXiv 2022.11) Chinese CLIP: Contrastive **Vision-Language** Pretraining in **Chinese**, [[Paper]](https://arxiv.org/pdf/2211.01335.pdf), [[Code]](https://github.com/OFA-Sys/Chinese-CLIP)

- (arXiv 2022.11) P^3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for **Open-Vocabulary Object Detection**, [[Paper]](https://arxiv.org/pdf/2211.00849.pdf)

- (arXiv 2022.11) tSF: Transformer-based Semantic Filter for **Few-Shot Learning**, [[Paper]](https://arxiv.org/pdf/2211.00868.pdf)

- (arXiv 2022.11) WITT: A WIRELESS IMAGE TRANSMISSION TRANSFORMER FOR **SEMANTIC COMMUNICATIONS**, [[Paper]](https://arxiv.org/pdf/2211.00937.pdf), [[Code]](https://github.com/KeYang8/WITT)

- (arXiv 2022.11) Pair DETR: Contrastive Learning **Speeds Up** **DETR** Training, [[Paper]](https://arxiv.org/pdf/2210.16476.pdf)

- (arXiv 2022.11) Interaction Visual Transformer for **Egocentric Action Anticipation**, [[Paper]](https://arxiv.org/pdf/2211.14154.pdf)

- (arXiv 2022.11) UDE: A Unified Driving Engine for Human **Motion Generation**, [[Paper]](https://arxiv.org/pdf/2211.16016.pdf), [[Code]](https://github.com/zixiangzhou916/UDE/)

- (arXiv 2022.11) Action-**GPT**: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot **Action Generation**, [[Paper]](https://arxiv.org/pdf/2211.15603.pdf), [[Project]](https://actiongpt.github.io/)

- (arXiv 2022.11) Human or Machine? **Turing Tests** for **Vision** and **Language**, [[Paper]](https://arxiv.org/pdf/2211.13087.pdf), [[Code]](https://tinyurl.com/8x8nha7p)

- (arXiv 2022.11) Knowledge **Prompting** for Few-shot **Action Recognition**, [[Paper]](https://arxiv.org/pdf/2211.12030.pdf)

- (arXiv 2022.11) UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance, [[Paper]](https://arxiv.org/pdf/2210.16031.pdf), [[Project]](https://upainting.github.io/)

- (arXiv 2022.11) LVP-M^3: Language-aware Visual Prompt for **Multilingual Multimodal Machine Translation**, [[Paper]](https://arxiv.org/pdf/2210.15461.pdf)

- (arXiv 2022.11) PROCONTEXT: PROGRESSIVE CONTEXT TRANSFORMER FOR **TRACKING**, [[Paper]](https://ar