An open API service indexing awesome lists of open source software.

https://github.com/52cv/iccv-2023-papers


https://github.com/52cv/iccv-2023-papers

Last synced: 5 months ago
JSON representation

Awesome Lists containing this project

README

          

# ICCV-2023-Papers
![Alt text](af0c53186833a908a200f58867b6dcf.png)

## 官网链接:https://iccv2023.thecvf.com/

### 研讨会:bell::2023 年 10 月 2 日至 3 日

### 主会:bell::2023 年 10 月 4 日至 6 日

## 历年综述论文分类汇总戳这里↘️[CV-Surveys](https://github.com/52CV/CV-Surveys)施工中~~~~~~~~~~

## 2024 年论文分类汇总戳这里
↘️[WACV-2024-Papers](https://github.com/52CV/WACV-2024-Papers)

## 2023 年论文分类汇总戳这里
↘️[CVPR-2023-Papers](https://github.com/52CV/CVPR-2023-Papers)
↘️[WACV-2023-Papers](https://github.com/52CV/WACV-2023-Papers)
↘️[ICCV-2023-Papers](https://github.com/52CV/ICCV-2023-Papers)

## [2022 年论文分类汇总戳这里](#000)
## [2021 年论文分类汇总戳这里](#00)
## [2020 年论文分类汇总戳这里](#0)

## 目录

|:cat:|:dog:|:tiger:|:wolf:|
|------|------|------|------|
|[1.其它(others)](#1)|[2.3D(三维重建\三维视觉)](#2)|[3.Object Detection(目标检测)](#3)|[4.Object Tracking(目标跟踪)](#4)|
|[5.Biometric Recognition(生物特征识别)](#5)|[6.Face(人脸)](#6)|[7.Image Progress(低层图像处理、质量评价)](#7)|[8.Image Segmentation(图像分割)](#8)|
|[9.Image Classification(图像分类)](#9)|[10.Image Synthesis(图像合成)](#10)|[11.Image/Video Editing(图像/视频编辑)](#11)|[12.Medical Image(医学影像)](#12)|
|[13.Image Captions(图像字幕)](#13)|[14.Image/Video Composition(图像/视频压缩)](#14)|[15.Image/Video Retrieval(图像/视频检索)](#15)|[16.Super-Resolution(超分辨率)](#16)|
|[17.GAN](#17)|[18.Pose](#18)|[19.UAV/Remote Sensing/Satellite Image(无人机/遥感/卫星图像)](#19)|[20.Reid](#20)|
|[21.Point Cloud(点云)](#21)|[22.OCR](#22)|[23.Optical Flow Estimation(光流估计)](#23)|[24.Few/Zero-Shot Learning/Domain Generalization/Adaptation(小/零样本/域泛化/域适应)](#24)|
|[25.Model Compression/KD/Pruning(模型压缩/知识蒸馏/剪枝)](#25)|[26.ML(机器学习)](#26)|[27.Self/Semi-Supervised Learning](#27)|[28.Style Transfer(风格迁移)](#28)|
|[29.Autonomous vehicles(自动驾驶)](#30)|[30.SLAM/AR/VR/Robotics(增强/虚拟现实/机器人)](#30)|[31.HOI(人物交互)](#31)|[32.Sign Language Recognition(手语)](#32)|
|[33.Video](#33)|[34.Action Detection](#34)|[35.Human Motion Prediction(人体运动预测)](#35)|[36.Vision Question Answering(视觉问答)](#36)|
|[37.Object Pose Estimation(物体姿势估计)](#37)|[38.Vision-Language(视觉语言)](#38)|[39.Keypoint Detection(关键点检测)](#39)|[40.Anomaly Detection(异常检测)](#40)|
|[41.Vision Transformers](#41)|[42.Dataset/Benchmark](#42)|[43.Neural Radiance Fields](#43)|[44.Rendering(渲染)](#44)|
|[45.Scene Graph Generation(场景图合成)](#45)|[46.Edge Detection](#46)|[47.Image-to-Image Translation](#47)|[48.Image Reconstruction](#48)|
|[49.Image Fusion(图像融合)](#49)|[50.Image Matching(图像匹配)](#50)|[51.Visual Place Recognition](#51)|[52.View Synthesis(视图合成)](#52)|
|[53.Computed Imaging(计算成像,如光学、几何、光场成像等)](#53)|[54.Gaze Estimation](#54)|[55.sound(语音)](#55)|[56.NAS](#56)|
|[57.Semantic Scene Completion(语义场景补全)](#57)|[58.scene flow estimation(场景流估计)](#58)|[59.Copyright Protection(版权保护/信息安全)](#59)|[60.Visual Localization(视觉定位)](#60)|
|[61.Natural Language Progress(NLP)](#61)|[62.Group Affect Recognition(群体情感识别)](#62)|[63.Industrial Defect Detectors](#63)|[64.Scene Understanding(场景理解)](#64)|
|[65.Deepfake Detectors](#65)|[66.Graph Neural Networks(图神经网络)](#66)|[67.Open Set Recognition(开集识别)](#67)|[68.Clustering(聚类)](#68)|
|[69.Affordance Learning(启示学习)](#69)|[70.Active Learning(主动学习)](#70)|[71.Data Augmentation(数据增强)](#71)|[72.Dense Prediction(密集预测)](#72)|
|[73.Spiking Neural Networks](#73)|

## 💥💥💥ICCV 2023 获奖论文
### 最佳论文奖——马尔奖
* [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/pdf/2302.05543.pdf)
:star:[code](https://github.com/lllyasviel/ControlNet)
* [Passive Ultra-Wideband Single-Photon Imaging](https://appleswithacapitala.github.io/static/docs/paper.pdf)
:star:[code](https://appleswithacapitala.github.io/)
### 最佳论文奖提名
* [Segment Anything](https://arxiv.org/abs/2304.02643)
:house:[project](https://segment-anything.com/)
### 最佳学生论文奖
* [Tracking Everything Everywhere All at Once](https://browse.arxiv.org/pdf/2306.05422.pdf)
:house:[project](https://github.com/qianqianwang68/omnimotion)


:thumbsup:[ICCV 2023 数据集分享(含水下图像视频、阴影去除、目标检测跟踪分割、交互、超分等)](https://mp.weixin.qq.com/s/XK943x4INOGHzD5Kvvo_Hw)

:thumbsup:[ICCV 2023 数据集分享(含动人物姿态、自动驾驶、遥感、去雪、人脸、VOS等)](https://zhuanlan.zhihu.com/p/660344176)

## 78.Sketch
* [CLIPascene: Scene Sketching with Different Types and Levels of Abstraction](http://arxiv.org/abs/2211.17256)

## 73.Spiking Neural Networks
* [RMP-Loss: Regularizing Membrane Potential Distribution for Spiking Neural Networks](http://arxiv.org/abs/2308.06787v1)
* [Towards Memory- and Time-Efficient Backpropagation for Training Spiking Neural Networks](https://openaccess.thecvf.com/content/ICCV2023/papers/Meng_Towards_Memory-_and_Time-Efficient_Backpropagation_for_Training_Spiking_Neural_Networks_ICCV_2023_paper.pdf)
* [SSF: Accelerating Training of Spiking Neural Networks with Stabilized Spiking Flow](https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_SSF_Accelerating_Training_of_Spiking_Neural_Networks_with_Stabilized_Spiking_ICCV_2023_paper.pdf)
* [Inherent Redundancy in Spiking Neural Networks](http://arxiv.org/abs/2308.08227v1)
:star:[code](https://github.com/BICLab/ASA-SNN)
* [Membrane Potential Batch Normalization for Spiking Neural Networks](http://arxiv.org/abs/2308.08359v1)
:star:[code](https://github.com/yfguo91/MPBN)
* [Unleashing the Potential of Spiking Neural Networks with Dynamic Confidence](https://openaccess.thecvf.com/content/ICCV2023/papers/Li_Unleashing_the_Potential_of_Spiking_Neural_Networks_with_Dynamic_Confidence_ICCV_2023_paper.pdf)
* [Temporal-Coded Spiking Neural Networks with Dynamic Firing Threshold: Learning with Event-Driven Backpropagation](https://openaccess.thecvf.com/content/ICCV2023/papers/Wei_Temporal-Coded_Spiking_Neural_Networks_with_Dynamic_Firing_Threshold_Learning_with_ICCV_2023_paper.pdf)
* [Efficient Converted Spiking Neural Network for 3D and 2D Classification](https://openaccess.thecvf.com/content/ICCV2023/papers/Lan_Efficient_Converted_Spiking_Neural_Network_for_3D_and_2D_Classification_ICCV_2023_paper.pdf)

## 72.Dense Prediction(密集预测)
* [Multi-Task Learning with Knowledge Distillation for Dense Prediction](https://openaccess.thecvf.com/content/ICCV2023/papers/Xu_Multi-Task_Learning_with_Knowledge_Distillation_for_Dense_Prediction_ICCV_2023_paper.pdf)
* [Consistent Depth Prediction for Transparent Object Reconstruction from RGB-D Camera](https://openaccess.thecvf.com/content/ICCV2023/papers/Cai_Consistent_Depth_Prediction_for_Transparent_Object_Reconstruction_from_RGB-D_Camera_ICCV_2023_paper.pdf)
* [EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction](https://openaccess.thecvf.com/content/ICCV2023/papers/Cai_EfficientViT_Lightweight_Multi-Scale_Attention_for_High-Resolution_Dense_Prediction_ICCV_2023_paper.pdf)

## 71.Data Augmentation(数据增强)
* [HybridAugment++: Unified Frequency Spectra Perturbations for Model Robustness](http://arxiv.org/abs/2307.11823v1)
* [MixBag: Bag-Level Data Augmentation for Learning from Label Proportions](http://arxiv.org/abs/2308.08822v1)
* [When to Learn What: Model-Adaptive Data Augmentation Curriculum](http://arxiv.org/abs/2309.04747)
* [Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning](http://arxiv.org/abs/2308.06038)

## 70.Active Learning(主动学习)
* [TiDAL: Learning Training Dynamics for Active Learning](http://arxiv.org/abs/2210.06788)
* [HAL3D: Hierarchical Active Learning for Fine-Grained 3D Part Labeling](http://arxiv.org/abs/2301.10460)
* [Knowledge-Aware Federated Active Learning with Non-IID Data](http://arxiv.org/abs/2211.13579)
* [Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning](https://openaccess.thecvf.com/content/ICCV2023/papers/Li_Skip-Plan_Procedure_Planning_in_Instructional_Videos_via_Condensed_Action_Space_ICCV_2023_paper.pdf)

## 69.Affordance Learning(启示学习)
* [MAAL: Multimodality-Aware Autoencoder-Based Affordance Learning for 3D Articulated Objects](https://openaccess.thecvf.com/content/ICCV2023/papers/Liang_MAAL_Multimodality-Aware_Autoencoder-Based_Affordance_Learning_for_3D_Articulated_Objects_ICCV_2023_paper.pdf)

## 68.Clustering(聚类)
* [Deep Multiview Clustering by Contrasting Cluster Assignments](http://arxiv.org/abs/2304.10769)
* [Cross-modal Scalable Hyperbolic Hierarchical Clustering](https://openaccess.thecvf.com/content/ICCV2023/papers/Long_Cross-modal_Scalable_Hierarchical_Clustering_in_Hyperbolic_space_ICCV_2023_paper.pdf)
* [Cross-view Topology Based Consistent and Complementary Information for Deep Multi-view Clustering](https://openaccess.thecvf.com/content/ICCV2023/papers/Dong_Cross-view_Topology_Based_Consistent_and_Complementary_Information_for_Deep_Multi-view_ICCV_2023_paper.pdf)
* [MHCN: A Hyperbolic Neural Network Model for Multi-view Hierarchical Clustering](https://openaccess.thecvf.com/content/ICCV2023/papers/Lin_MHCN_A_Hyperbolic_Neural_Network_Model_for_Multi-view_Hierarchical_Clustering_ICCV_2023_paper.pdf)
* [Stable Cluster Discrimination for Deep Clustering](https://openaccess.thecvf.com/content/ICCV2023/papers/Qian_Stable_Cluster_Discrimination_for_Deep_Clustering_ICCV_2023_paper.pdf)
* [Unsupervised Manifold Linearizing and Clustering](http://arxiv.org/abs/2301.01805)
* [Anchor Structure Regularization Induced Multi-view Subspace Clustering via Enhanced Tensor Rank Minimization](https://openaccess.thecvf.com/content/ICCV2023/papers/Ji_Anchor_Structure_Regularization_Induced_Multi-view_Subspace_Clustering_via_Enhanced_Tensor_ICCV_2023_paper.pdf)
* [Surface Normal Clustering for Implicit Representation of Manhattan Scenes](http://arxiv.org/abs/2212.01331)

## 67.Open Set Recognition(开集识别)
* [FedPD: Federated Open Set Recognition with Parameter Disentanglement](https://openaccess.thecvf.com/content/ICCV2023/papers/Yang_FedPD_Federated_Open_Set_Recognition_with_Parameter_Disentanglement_ICCV_2023_paper.pdf)

## 66.Graph Neural Networks(图神经网络)
* [VertexSerum: Poisoning Graph Neural Networks for Link Inference](http://arxiv.org/abs/2308.01469)
* [Learning Adaptive Neighborhoods for Graph Neural Networks](http://arxiv.org/abs/2307.09065)
* [Vision HGNN: An Image is More than a Graph of Nodes](https://openaccess.thecvf.com/content/ICCV2023/papers/Han_Vision_HGNN_An_Image_is_More_than_a_Graph_of_ICCV_2023_paper.pdf)

## 65.Deepfake Detectors
* [Towards Understanding the Generalization of Deepfake Detectors from a Game-Theoretical View](https://openaccess.thecvf.com/content/ICCV2023/papers/Yao_Towards_Understanding_the_Generalization_of_Deepfake_Detectors_from_a_Game-Theoretical_ICCV_2023_paper.pdf)
* [Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning](http://arxiv.org/abs/2309.05911)
* [SeeABLE: Soft Discrepancies and Bounded Contrastive Learning for Exposing Deepfakes](http://arxiv.org/abs/2211.11296)
* [UCF: Uncovering Common Features for Generalizable Deepfake Detection](http://arxiv.org/abs/2304.13949)

## 64.Scene Understanding(场景理解)
* [Shape Anchor Guided Holistic Indoor Scene Understanding](http://arxiv.org/abs/2309.11133)
* [Efficient Computation Sharing for Multi-Task Visual Scene Understanding](http://arxiv.org/abs/2303.09663)
* [Ordered Atomic Activity for Fine-grained Interactive Traffic Scenario Understanding](https://openaccess.thecvf.com/content/ICCV2023/papers/Agarwal_Ordered_Atomic_Activity_for_Fine-grained_Interactive_Traffic_Scenario_Understanding_ICCV_2023_paper.pdf)
* [Clutter Detection and Removal in 3D Scenes with View-Consistent Inpainting](http://arxiv.org/abs/2304.03763)
* [Human-centric Scene Understanding for 3D Large-scale Scenarios](http://arxiv.org/abs/2307.14392)

## 63.Industrial Defect Detectors
* 工业缺陷定位
* [Removing Anomalies as Noises for Industrial Defect Localization](https://openaccess.thecvf.com/content/ICCV2023/papers/Lu_Removing_Anomalies_as_Noises_for_Industrial_Defect_Localization_ICCV_2023_paper.pdf)
* 工业异常检测
* [PNI : Industrial Anomaly Detection using Position and Neighborhood Information](http://arxiv.org/abs/2211.12634)
* [FastRecon: Few-shot Industrial Anomaly Detection via Fast Feature Reconstruction](https://openaccess.thecvf.com/content/ICCV2023/papers/Fang_FastRecon_Few-shot_Industrial_Anomaly_Detection_via_Fast_Feature_Reconstruction_ICCV_2023_paper.pdf)
* 裂缝检测
* [The Devil is in the Crack Orientation: A New Perspective for Crack Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/Chen_The_Devil_is_in_the_Crack_Orientation_A_New_Perspective_ICCV_2023_paper.pdf)

## 62.Group Affect Recognition(群体情感识别)
* [Most Important Person-Guided Dual-Branch Cross-Patch Attention for Group Affect Recognition](https://openaccess.thecvf.com/content/ICCV2023/papers/Xie_Most_Important_Person-Guided_Dual-Branch_Cross-Patch_Attention_for_Group_Affect_Recognition_ICCV_2023_paper.pdf)

## 61.Natural Language Progress(NLP)
* [Improved Visual Fine-tuning with Natural Language Supervision](http://arxiv.org/abs/2304.01489)
* [Tracking by Natural Language Specification with Long Short-term Context Decoupling](https://openaccess.thecvf.com/content/ICCV2023/papers/Ma_Tracking_by_Natural_Language_Specification_with_Long_Short-term_Context_Decoupling_ICCV_2023_paper.pdf)

## 60.Visual Localization(视觉定位)
* [EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization](http://arxiv.org/abs/2309.07471v1)
* [Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance](http://arxiv.org/abs/2309.07403v1)
* [Yes, we CANN: Constrained Approximate Nearest Neighbors for Local Feature-Based Visual Localization](http://arxiv.org/abs/2306.09012)
* [OFVL-MS: Once for Visual Localization across Multiple Indoor Scenes](https://openaccess.thecvf.com/content/ICCV2023/papers/Xie_OFVL-MS_Once_for_Visual_Localization_across_Multiple_Indoor_Scenes_ICCV_2023_paper.pdf)

## 59.Copyright Protection(版权保护/信息安全)
* [Towards Robust Model Watermark via Reducing Parametric Vulnerability](http://arxiv.org/abs/2309.04777v1)
:star:[code](https://github.com/GuanhaoGan/robust-model-watermarking)
* [CopyRNeRF: Protecting the CopyRight of Neural Radiance Fields](http://arxiv.org/abs/2307.11526v1)

## 58.scene flow estimation(场景流估计)
* [EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity](http://arxiv.org/abs/2309.01296v1)
* [Fast Neural Scene Flow](http://arxiv.org/abs/2304.09121)
* [Multi-Scale Bidirectional Recurrent Network with Hybrid Correlation for Point Cloud Based Scene Flow Estimation](https://openaccess.thecvf.com/content/ICCV2023/papers/Cheng_Multi-Scale_Bidirectional_Recurrent_Network_with_Hybrid_Correlation_for_Point_Cloud_ICCV_2023_paper.pdf)
* [IHNet: Iterative Hierarchical Network Guided by High-Resolution Estimated Information for Scene Flow Estimation](https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_IHNet_Iterative_Hierarchical_Network_Guided_by_High-Resolution_Estimated_Information_for_ICCV_2023_paper.pdf)

## 57.Semantic Scene Completion(语义场景补全)
* [NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space](http://arxiv.org/abs/2309.14616v1)
:star:[code](https://jiawei-yao0812.github.io/NDC-Scene/)
:star:[code](https://github.com/Jiawei-Yao0812/NDCScene)
* [DDIT: Semantic Scene Completion via Deformable Deep Implicit Templates](https://openaccess.thecvf.com/content/ICCV2023/papers/Li_DDIT_Semantic_Scene_Completion_via_Deformable_Deep_Implicit_Templates_ICCV_2023_paper.pdf)
* [CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion](http://arxiv.org/abs/2307.07938)
* [Learning Long-Range Information with Dual-Scale Transformers for Indoor Scene Completion](https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_Learning_Long-Range_Information_with_Dual-Scale_Transformers_for_Indoor_Scene_Completion_ICCV_2023_paper.pdf)

## 56.NAS
* [ShiftNAS: Improving One-shot NAS via Probability Shift](http://arxiv.org/abs/2307.08300)
* [ROME: Robustifying Memory-Efficient NAS via Topology Disentanglement and Gradient Accumulation](http://arxiv.org/abs/2011.11233)
* [Extensible and Efficient Proxy for Neural Architecture Search](https://openaccess.thecvf.com/content/ICCV2023/papers/Li_Extensible_and_Efficient_Proxy_for_Neural_Architecture_Search_ICCV_2023_paper.pdf)
* [MixPath: A Unified Approach for One-shot Neural Architecture Search](http://arxiv.org/abs/2001.05887)
* [Unleashing the Power of Gradient Signal-to-Noise Ratio for Zero-Shot NAS](https://openaccess.thecvf.com/content/ICCV2023/papers/Sun_Unleashing_the_Power_of_Gradient_Signal-to-Noise_Ratio_for_Zero-Shot_NAS_ICCV_2023_paper.pdf)

## 55.sound(语音)
* [Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video](http://arxiv.org/abs/2309.04814)
* [MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition](http://arxiv.org/abs/2303.05309)
* [Be Everywhere - Hear Everything (BEE): Audio Scene Reconstruction by Sparse Audio-Visual Samples](https://openaccess.thecvf.com/content/ICCV2023/papers/Chen_Be_Everywhere_-_Hear_Everything_BEE_Audio_Scene_Reconstruction_by_ICCV_2023_paper.pdf)
* [On the Audio-visual Synchronization for Lip-to-Speech Synthesis](http://arxiv.org/abs/2303.00502)
* [Boosting Positive Segments for Weakly-Supervised Audio-Visual Video Parsing](https://openaccess.thecvf.com/content/ICCV2023/papers/Rachavarapu_Boosting_Positive_Segments_for_Weakly-Supervised_Audio-Visual_Video_Parsing_ICCV_2023_paper.pdf)
* [DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding](http://arxiv.org/abs/2308.07787v1)
* [Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation](http://arxiv.org/abs/2308.10306v1)
* [Sound Source Localization is All about Cross-Modal Alignment](http://arxiv.org/abs/2309.10724v1)
* [Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping](http://arxiv.org/abs/2308.06112)
* 去混响
* [AdVerb: Visually Guided Audio Dereverberation](http://arxiv.org/abs/2308.12370v1)
:house:[project](https://gamma.umd.edu/researchdirections/speech/adverb)
* 唇语识别
* [Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge](http://arxiv.org/abs/2308.09311v1)
* 音频-视频生成
* [The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion](http://arxiv.org/abs/2309.04509v1)
:star:[code](https://ku-vai.github.io/TPoS/)
* 音视觉导航
* [Omnidirectional Information Gathering for Knowledge Transfer-Based Audio-Visual Navigation](http://arxiv.org/abs/2308.10306)
* 声音定位
* [Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation](http://arxiv.org/abs/2303.11329)
* 视听分割
* [Multimodal Variational Auto-encoder based Audio-Visual Segmentation](https://openaccess.thecvf.com/content/ICCV2023/papers/Mao_Multimodal_Variational_Auto-encoder_based_Audio-Visual_Segmentation_ICCV_2023_paper.pdf)

## 54.Gaze Estimation
* [DVGaze: Dual-View Gaze Estimation](http://arxiv.org/abs/2308.10310v1)
:star:[code](https://github.com/yihuacheng/DVGaze)

## 53.Computed Imaging(计算成像,如光学、几何、光场成像等)
* [Event Camera Data Pre-training](http://arxiv.org/abs/2301.01928)
* [Deep Geometrized Cartoon Line Inbetweening](https://openaccess.thecvf.com/content/ICCV2023/papers/Siyao_Deep_Geometrized_Cartoon_Line_Inbetweening_ICCV_2023_paper.pdf)
* [Aperture Diffraction for Compact Snapshot Spectral Imaging](https://openaccess.thecvf.com/content/ICCV2023/papers/Lv_Aperture_Diffraction_for_Compact_Snapshot_Spectral_Imaging_ICCV_2023_paper.pdf)
* [Examining Autoexposure for Challenging Scenes](http://arxiv.org/abs/2309.04542v1)
* [Vanishing Point Estimation in Uncalibrated Images with Prior Gravity Direction](http://arxiv.org/abs/2308.10694v1)
:star:[code](https://github.com/cvg/VP-Estimation-with-Prior-Gravity)
* [Robust Frame-to-Frame Camera Rotation Estimation in Crowded Scenes](http://arxiv.org/abs/2309.08588v1)
:house:[project](https://fabiendelattre.com/robust-rotation-estimation)
* [Exploring Positional Characteristics of Dual-Pixel Data for Camera Autofocus](https://openaccess.thecvf.com/content/ICCV2023/papers/Choi_Exploring_Positional_Characteristics_of_Dual-Pixel_Data_for_Camera_Autofocus_ICCV_2023_paper.pdf)
* [Enhancing Non-line-of-sight Imaging via Learnable Inverse Kernel and Attention Mechanisms](https://openaccess.thecvf.com/content/ICCV2023/papers/Yu_Enhancing_Non-line-of-sight_Imaging_via_Learnable_Inverse_Kernel_and_Attention_Mechanisms_ICCV_2023_paper.pdf)
* [On the Robustness of Normalizing Flows for Inverse Problems in Imaging](http://arxiv.org/abs/2212.04319)
* [Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation](http://arxiv.org/abs/2304.05669)
* 光场
* [NeILF++: Inter-Reflectable Light Fields for Geometry and Material Estimation](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_NeILF_Inter-Reflectable_Light_Fields_for_Geometry_and_Material_Estimation_ICCV_2023_paper.pdf)
* 相机姿势估计
* [Multi-body Depth and Camera Pose Estimation from Multiple Views](https://openaccess.thecvf.com/content/ICCV2023/papers/Dal_Cin_Multi-body_Depth_and_Camera_Pose_Estimation_from_Multiple_Views_ICCV_2023_paper.pdf)

## 52.View Synthesis(视图合成)
* [Forward Flow for Novel View Synthesis of Dynamic Scenes](https://openaccess.thecvf.com/content/ICCV2023/papers/Guo_Forward_Flow_for_Novel_View_Synthesis_of_Dynamic_Scenes_ICCV_2023_paper.pdf)
* [iVS-Net: Learning Human View Synthesis from Internet Videos](https://openaccess.thecvf.com/content/ICCV2023/papers/Dong_iVS-Net_Learning_Human_View_Synthesis_from_Internet_Videos_ICCV_2023_paper.pdf)
* [Multi-task View Synthesis with Neural Radiance Fields](https://openaccess.thecvf.com/content/ICCV2023/papers/Zheng_Multi-task_View_Synthesis_with_Neural_Radiance_Fields_ICCV_2023_paper.pdf)
* [IntrinsicNeRF: Learning Intrinsic Neural Radiance Fields for Editable Novel View Synthesis](http://arxiv.org/abs/2210.00647)
* [Generative Novel View Synthesis with 3D-Aware Diffusion Models](http://arxiv.org/abs/2304.02602)
* [SparseNeRF: Distilling Depth Ranking for Few-shot Novel View Synthesis](http://arxiv.org/abs/2303.16196)
* [Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis](https://openaccess.thecvf.com/content/ICCV2023/papers/Song_Total-Recon_Deformable_Scene_Reconstruction_for_Embodied_View_Synthesis_ICCV_2023_paper.pdf)
* [Neural LiDAR Fields for Novel View Synthesis](https://openaccess.thecvf.com/content/ICCV2023/papers/Huang_Neural_LiDAR_Fields_for_Novel_View_Synthesis_ICCV_2023_paper.pdf)
* [LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference](http://arxiv.org/abs/2307.12217v1)
* [Learning Unified Decompositional and Compositional NeRF for Editable Novel View Synthesis](http://arxiv.org/abs/2308.02840v1)
:star:[code](https://w-ted.github.io/publications/udc-nerf)
* [Efficient View Synthesis with Neural Radiance Distribution Field](http://arxiv.org/abs/2308.11130v1)
* [NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes](http://arxiv.org/abs/2308.12967v1)
:star:[code](https://zubair-irshad.github.io/projects/neo360.html)
* [PARF: Primitive-Aware Radiance Fusion for Indoor Scene Novel View Synthesis](https://openaccess.thecvf.com/content/ICCV2023/papers/Ying_PARF_Primitive-Aware_Radiance_Fusion_for_Indoor_Scene_Novel_View_Synthesis_ICCV_2023_paper.pdf)
* [Urban Radiance Field Representation with Deformable Neural Mesh Primitives](http://arxiv.org/abs/2307.10776v1)
:house:[project](https://dnmp.github.io/)
* [FlipNeRF: Flipped Reflection Rays for Few-shot Novel View Synthesis](http://arxiv.org/abs/2306.17723)
* [SAMPLING: Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image](http://arxiv.org/abs/2309.06323)
* [A Large-Scale Outdoor Multi-Modal Dataset and Benchmark for Novel View Synthesis and Implicit Scene Reconstruction](http://arxiv.org/abs/2301.06782)
* [Cross-Ray Neural Radiance Fields for Novel-View Synthesis from Unconstrained Image Collections](http://arxiv.org/abs/2307.08093)
* [Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models](http://arxiv.org/abs/2304.10700)
* [NEMTO: Neural Environment Matting for Novel View and Relighting Synthesis of Transparent Objects](https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_NEMTO_Neural_Environment_Matting_for_Novel_View_and_Relighting_Synthesis_ICCV_2023_paper.pdf)

## 51.Visual Place Recognition
* [CASSPR: Cross Attention Single Scan Place Recognition](http://arxiv.org/abs/2211.12542)
* [EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition](http://arxiv.org/abs/2308.10832v1)
:star:[code](https://github.com/gmberton/auto_VPR)
:star:[code](https://github.com/gmberton/EigenPlaces)
* [CrossLoc3D: Aerial-Ground Cross-Source 3D Place Recognition](http://arxiv.org/abs/2303.17778)
* [BEVPlace: Learning LiDAR-based Place Recognition using Bird's Eye View Images](https://openaccess.thecvf.com/content/ICCV2023/papers/Luo_BEVPlace_Learning_LiDAR-based_Place_Recognition_using_Birds_Eye_View_Images_ICCV_2023_paper.pdf)

## 50.Image Matching(图像匹配)
* [OccNet: Robust Image Matching Based on 3D Occupancy Estimation for Occluded Regions](https://openaccess.thecvf.com/content/ICCV2023/papers/Fan_Occ2Net_Robust_Image_Matching_Based_on_3D_Occupancy_Estimation_for_ICCV_2023_paper.pdf)
:thumbsup:[ICCV 2023|Occ2Net,一种基于3D 占据估计的有效且稳健的带有遮挡区域的图像匹配方法](https://mp.weixin.qq.com/s/mbk5tnYlzCLOg4_xfyKRyA)
* [Scene-Aware Feature Matching](http://arxiv.org/abs/2308.09949v1)
* [Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints](http://arxiv.org/abs/2303.02885)
* [GlueStick: Robust Image Matching by Sticking Points and Lines Together](https://openaccess.thecvf.com/content/ICCV2023/papers/Pautrat_GlueStick_Robust_Image_Matching_by_Sticking_Points_and_Lines_Together_ICCV_2023_paper.pdf)
* [Grounded Image Text Matching with Mismatched Relation Reasoning](http://arxiv.org/abs/2308.01236)
* [Graph Matching with Bi-level Noisy Correspondence](http://arxiv.org/abs/2212.04085)
* 特征匹配
* [Guiding Local Feature Matching with Surface Curvature](https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_Guiding_Local_Feature_Matching_with_Surface_Curvature_ICCV_2023_paper.pdf)

## 49.Image Fusion(图像融合)
* [DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion](http://arxiv.org/abs/2303.06840)
* [MEFLUT: Unsupervised 1D Lookup Tables for Multi-exposure Image Fusion](http://arxiv.org/abs/2309.11847)
* [Learned Image Reasoning Prior Penetrates Deep Unfolding Network for Panchromatic and Multi-Spectral Image Fusion](http://arxiv.org/abs/2308.16083v1)
:star:[code](https://manman1995.github.io/)
* [Degradation-Resistant Unfolding Network for Heterogeneous Image Fusion](https://openaccess.thecvf.com/content/ICCV2023/papers/He_Degradation-Resistant_Unfolding_Network_for_Heterogeneous_Image_Fusion_ICCV_2023_paper.pdf)
* [Learned Image Reasoning Prior Penetrates Deep Unfolding Network for Panchromatic and Multi-spectral Image Fusion](http://arxiv.org/abs/2308.16083)
* [UniFusion: Unified Multi-View Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View](https://openaccess.thecvf.com/content/ICCV2023/papers/Qin_UniFusion_Unified_Multi-View_Fusion_Transformer_for_Spatial-Temporal_Representation_in_Birds-Eye-View_ICCV_2023_paper.pdf)
* [Multi-Modal Gated Mixture of Local-to-Global Experts for Dynamic Image Fusion](http://arxiv.org/abs/2302.01392)

## 48.Image Reconstruction
* [Pixel Adaptive Deep Unfolding Transformer for Hyperspectral Image Reconstruction](http://arxiv.org/abs/2308.10820v1)
:star:[code](https://github.com/MyuLi/PADUT)
* [RawHDR: High Dynamic Range Image Reconstruction from a Single Raw Image](http://arxiv.org/abs/2309.02020v1)
* [Learning Continuous Exposure Value Representations for Single-Image HDR Reconstruction](http://arxiv.org/abs/2309.03900v1)
:star:[code](https://skchen1993.github.io/CEVR_web/)

## 47.Image-to-Image Translation
* [General Image-to-Image Translation with One-Shot Image Guidance](http://arxiv.org/abs/2307.14352v1)
:star:[code](https://github.com/CrystalNeuro/visual-concept-translator)
* [Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation](http://arxiv.org/abs/2308.12968v1)
:star:[code](https://github.com/Yuxinn-J/Scenimefy)
:star:[code](https://yuxinn-j.github.io/projects/Scenimefy.html)
* [UGC: Unified GAN Compression for Efficient Image-to-Image Translation](http://arxiv.org/abs/2309.09310)

## 46.Edge Detection
* [Tiny and Efficient Model for the Edge Detection Generalization](http://arxiv.org/abs/2308.06468v1)
:star:[code](https://github.com/xavysp/TEED)

## 45.Scene Graph Generation(场景图合成)
* [SGAligner: 3D Scene Alignment with Scene Graphs](http://arxiv.org/abs/2304.14880)
* [Environment-Invariant Curriculum Relation Learning for Fine-Grained Scene Graph Generation](http://arxiv.org/abs/2308.03282v1)
* [Compositional Feature Augmentation for Unbiased Scene Graph Generation](http://arxiv.org/abs/2308.06712v1)
* [Vision Relation Transformer for Unbiased Scene Graph Generation](http://arxiv.org/abs/2308.09472v1)
* [TextPSG: Panoptic Scene Graph Generation from Textual Descriptions](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhao_TextPSG_Panoptic_Scene_Graph_Generation_from_Textual_Descriptions_ICCV_2023_paper.pdf)
* [HiLo: Exploiting High Low Frequency Relations for Unbiased Panoptic Scene Graph Generation](http://arxiv.org/abs/2303.15994)
* [Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World](http://arxiv.org/abs/2303.13233)
* [Visual Traffic Knowledge Graph Generation from Scene Images](https://openaccess.thecvf.com/content/ICCV2023/papers/Guo_Visual_Traffic_Knowledge_Graph_Generation_from_Scene_Images_ICCV_2023_paper.pdf)

## 44.Rendering(渲染)
* [LiveHand: Real-time and Photorealistic Neural Hand Rendering](http://arxiv.org/abs/2302.07672)
* [NeMF: Inverse Volume Rendering with Neural Microflake Field](http://arxiv.org/abs/2304.00782)
* [ActorsNeRF: Animatable Few-shot Human Rendering with Generalizable NeRFs](http://arxiv.org/abs/2304.14401)
* [HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation](http://arxiv.org/abs/2308.10122v1)
* [DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering](https://arxiv.org/abs/2307.10173)
:house:[project](https://dna-rendering.github.io/)
* [S3IM: Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural Fields](http://arxiv.org/abs/2308.07032v1)
:star:[code](https://github.com/Madaoer/S3IM)
:thumbsup:[ICCV 2023 | NeRF 提点的 Magic Loss —— S3IM 随机结构相似性](https://mp.weixin.qq.com/s/w5IUykx6_-7lBE_2r_Onkg)
* [TransHuman: A Transformer-based Human Representation for Generalizable Neural Human Rendering](http://arxiv.org/abs/2307.12291v1)
:star:[code](https://pansanity666.github.io/TransHuman/)
* [Tri-MipRF: Tri-Mip Representation for Efficient Anti-Aliasing Neural Radiance Fields](http://arxiv.org/abs/2307.11335v1)
:star:[code](https://wbhu.github.io/projects/Tri-MipRF)
* [Rendering Humans from Object-Occluded Monocular Videos](http://arxiv.org/abs/2308.04622v1)
:house:[project](https://cs.stanford.edu/~xtiange/projects/occnerf/)
* [ScatterNeRF: Seeing Through Fog with Physically-Based Inverse Neural Rendering](http://arxiv.org/abs/2305.02103)
* [DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-Centric Rendering](https://openaccess.thecvf.com/content/ICCV2023/papers/Cheng_DNA-Rendering_A_Diverse_Neural_Actor_Repository_for_High-Fidelity_Human-Centric_Rendering_ICCV_2023_paper.pdf)
* [Neural Microfacet Fields for Inverse Rendering](http://arxiv.org/abs/2303.17806)
* [3D-aware Blending with Generative NeRFs](http://arxiv.org/abs/2302.06608)

## 43.Neural Radiance Fields
* [Instance Neural Radiance Field](http://arxiv.org/abs/2304.04395)
* [Adaptive Positional Encoding for Bundle-Adjusting Neural Radiance Fields](https://openaccess.thecvf.com/content/ICCV2023/papers/Gao_Adaptive_Positional_Encoding_for_Bundle-Adjusting_Neural_Radiance_Fields_ICCV_2023_paper.pdf)
* [FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models](http://arxiv.org/abs/2303.12786)
* [NerfAcc: Efficient Sampling Accelerates NeRFs](http://arxiv.org/abs/2305.04966)
* [MIMO-NeRF: Fast Neural Rendering with Multi-input Multi-output Neural Radiance Fields](https://openaccess.thecvf.com/content/ICCV2023/papers/Kaneko_MIMO-NeRF_Fast_Neural_Rendering_with_Multi-input_Multi-output_Neural_Radiance_Fields_ICCV_2023_paper.pdf)
* [UHDNeRF: Ultra-High-Definition Neural Radiance Fields](https://openaccess.thecvf.com/content/ICCV2023/papers/Li_UHDNeRF_Ultra-High-Definition_Neural_Radiance_Fields_ICCV_2023_paper.pdf)
* [Deformable Neural Radiance Fields using RGB and Event Cameras](http://arxiv.org/abs/2309.08416)
* [Learning Neural Implicit Surfaces with Object-Aware Radiance Fields](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_Learning_Neural_Implicit_Surfaces_with_Object-Aware_Radiance_Fields_ICCV_2023_paper.pdf)
* [ClimateNeRF: Extreme Weather Synthesis in Neural Radiance Field](http://arxiv.org/abs/2211.13226)
* [HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video](http://arxiv.org/abs/2304.12281)
* [ReNeRF: Relightable Neural Radiance Fields with Nearfield Lighting](https://openaccess.thecvf.com/content/ICCV2023/papers/Xu_ReNeRF_Relightable_Neural_Radiance_Fields_with_Nearfield_Lighting_ICCV_2023_paper.pdf)
* [E2NeRF: Event Enhanced Neural Radiance Fields from Blurry Images](https://openaccess.thecvf.com/content/ICCV2023/papers/Qi_E2NeRF_Event_Enhanced_Neural_Radiance_Fields_from_Blurry_Images_ICCV_2023_paper.pdf)
* [Neural Fields for Structured Lighting](https://openaccess.thecvf.com/content/ICCV2023/papers/Shandilya_Neural_Fields_for_Structured_Lighting_ICCV_2023_paper.pdf)
* [NeRF-MS: Neural Radiance Fields with Multi-Sequence](https://openaccess.thecvf.com/content/ICCV2023/papers/Li_NeRF-MS_Neural_Radiance_Fields_with_Multi-Sequence_ICCV_2023_paper.pdf)
* [StegaNeRF: Embedding Invisible Information within Neural Radiance Fields](http://arxiv.org/abs/2212.01602)
* [SHERF: Generalizable Human NeRF from a Single Image](http://arxiv.org/abs/2303.12791)
* [DeformToon3D: Deformable Neural Radiance Fields for 3D Toonification](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_DeformToon3D_Deformable_Neural_Radiance_Fields_for_3D_Toonification_ICCV_2023_paper.pdf)
* [Nerfbusters: Removing Ghostly Artifacts from Casually Captured NeRFs](http://arxiv.org/abs/2304.10532)
* [Tetra-NeRF: Representing Neural Radiance Fields Using Tetrahedra](https://openaccess.thecvf.com/content/ICCV2023/papers/Kulhanek_Tetra-NeRF_Representing_Neural_Radiance_Fields_Using_Tetrahedra_ICCV_2023_paper.pdf)
* [Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields](https://openaccess.thecvf.com/content/ICCV2023/papers/Barron_Zip-NeRF_Anti-Aliased_Grid-Based_Neural_Radiance_Fields_ICCV_2023_paper.pdf)
* [NeRFrac: Neural Radiance Fields through Refractive Surface](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhan_NeRFrac_Neural_Radiance_Fields_through_Refractive_Surface_ICCV_2023_paper.pdf)
* [MonoNeRF: Learning a Generalizable Dynamic Radiance Field from Monocular Videos](http://arxiv.org/abs/2212.13056)
* [Reference-guided Controllable Inpainting of Neural Radiance Fields](http://arxiv.org/abs/2304.09677)
* [DeLiRa: Self-Supervised Depth, Light, and Radiance Fields](http://arxiv.org/abs/2304.02797)
* [Neural Radiance Field with LiDAR maps](https://openaccess.thecvf.com/content/ICCV2023/papers/Chang_Neural_Radiance_Field_with_LiDAR_maps_ICCV_2023_paper.pdf)
* [Dynamic Mesh-Aware Radiance Fields](http://arxiv.org/abs/2309.04581v1)
* [Locally Stylized Neural Radiance Fields](http://arxiv.org/abs/2309.10684v1)
* [Generalizable Neural Fields as Partially Observed Neural Processes](http://arxiv.org/abs/2309.06660v1)
* [DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields](http://arxiv.org/abs/2309.04410v1)
:house:[project](https://www.mmlab-ntu.com/project/deformtoon3d/)
:star:[code](https://github.com/junzhezhang/DeformToon3D)
* [Robust e-NeRF: NeRF from Sparse & Noisy Events under Non-Uniform Motion](http://arxiv.org/abs/2309.08596v1)
:star:[code](https://wengflow.github.io/robust-e-nerf)
* [Pose-Free Neural Radiance Fields via Implicit Pose Regularization](http://arxiv.org/abs/2308.15049v1)
* [Canonical Factors for Hybrid Neural Fields](http://arxiv.org/abs/2308.15461v1)
:star:[code](https://brentyi.github.io/tilted/)
* [Multi-Modal Neural Radiance Field for Monocular Dense SLAM with a Light-Weight ToF Sensor](http://arxiv.org/abs/2308.14383v1)
:star:[code](https://zju3dv.github.io/tof_slam/)
* [Blending-NeRF: Text-Driven Localized Editing in Neural Radiance Fields](http://arxiv.org/abs/2308.11974v1)
* [Strata-NeRF : Neural Radiance Fields for Stratified Scenes](http://arxiv.org/abs/2308.10337v1)
:star:[code](https://ankitatiisc.github.io/Strata-NeRF/)
* [DReg-NeRF: Deep Registration for Neural Radiance Fields](http://arxiv.org/abs/2308.09386v1)
:star:[code](https://github.com/AIBluefisher/DReg-NeRF)
* [Seal-3D: Interactive Pixel-Level Editing for Neural Radiance Fields](http://arxiv.org/abs/2307.15131v1)
:house:[project](https://windingwind.github.io/seal-3d/)
:star:[code](https://github.com/windingwind/seal-3d/)
* [WaveNeRF: Wavelet-based Generalizable Neural Radiance Fields](http://arxiv.org/abs/2308.04826v1)
* [UrbanGIRAFFE: Representing Urban Scenes as Compositional Generative Neural Feature Fields](http://arxiv.org/abs/2303.14167)
* [LERF: Language Embedded Radiance Fields](http://arxiv.org/abs/2303.09553)
* [Strivec: Sparse Tri-Vector Radiance Fields](http://arxiv.org/abs/2307.13226)
* [Multiscale Representation for Real-Time Anti-Aliasing Neural Rendering](http://arxiv.org/abs/2304.10075)

## 42.Dataset/Benchmark
* 数据集
* [Building3D: An Urban-Scale Dataset and Benchmarks for Learning Roof Structures from Point Clouds](https://arxiv.org/pdf/2307.11914.pdf)
:sunflower:[dataset](https://building3d.ucalgary.ca/#)
:thumbsup:[ICCV2023 首个城市级别的基于航空点云的房屋建模数据集 Building3D](https://mp.weixin.qq.com/s/gKFByZ8ud2aNlG7C7t2-2Q)
* [LoTE-Animal: A Long Time-span Dataset for Endangered Animal Behavior Understanding](https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_LoTE-Animal_A_Long_Time-span_Dataset_for_Endangered_Animal_Behavior_Understanding_ICCV_2023_paper.pdf)
* [Beyond the Pixel: a Photometrically Calibrated HDR Dataset for Luminance and Color Prediction](https://openaccess.thecvf.com/content/ICCV2023/papers/Bolduc_Beyond_the_Pixel_a_Photometrically_Calibrated_HDR_Dataset_for_Luminance_ICCV_2023_paper.pdf)
* [Atmospheric Transmission and Thermal Inertia Induced Blind Road Segmentation with a Large-Scale Dataset TBRSD](https://openaccess.thecvf.com/content/ICCV2023/papers/Chen_Atmospheric_Transmission_and_Thermal_Inertia_Induced_Blind_Road_Segmentation_with_ICCV_2023_paper.pdf)
* [H3WB: Human3.6M 3D WholeBody Dataset and Benchmark](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhu_H3WB_Human3.6M_3D_WholeBody_Dataset_and_Benchmark_ICCV_2023_paper.pdf)
* [V3Det: Vast Vocabulary Visual Detection Dataset](http://arxiv.org/abs/2304.03752)
* [HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World](https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_HoloAssist_an_Egocentric_Human_Interaction_Dataset_for_Interactive_AI_Assistants_ICCV_2023_paper.pdf)
* [Zenseact Open Dataset: A Large-Scale and Diverse Multimodal Dataset for Autonomous Driving](http://arxiv.org/abs/2305.02008)
* [FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods](http://arxiv.org/abs/2308.06248)
* [Lecture Presentations Multimodal Dataset: Towards Understanding Multimodality in Educational Videos](https://openaccess.thecvf.com/content/ICCV2023/papers/Lee_Lecture_Presentations_Multimodal_Dataset_Towards_Understanding_Multimodality_in_Educational_Videos_ICCV_2023_paper.pdf)
* [RealGraph: A Multiview Dataset for 4D Real-world Context Graph Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Lin_RealGraph_A_Multiview_Dataset_for_4D_Real-world_Context_Graph_Generation_ICCV_2023_paper.pdf)
* [Video Background Music Generation: Dataset, Method and Evaluation](http://arxiv.org/abs/2211.11248)
* [Thinking Image Color Aesthetics Assessment: Models, Datasets and Benchmarks](https://openaccess.thecvf.com/content/ICCV2023/papers/He_Thinking_Image_Color_Aesthetics_Assessment_Models_Datasets_and_Benchmarks_ICCV_2023_paper.pdf)
* [Snow Removal in Video: A New Dataset and A Novel Method](https://openaccess.thecvf.com/content/ICCV2023/papers/Chen_Snow_Removal_in_Video_A_New_Dataset_and_A_Novel_ICCV_2023_paper.pdf)
* [SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes](http://arxiv.org/abs/2304.05170)
* [EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes](http://arxiv.org/abs/2307.07961)
* [DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners](http://arxiv.org/abs/2309.03483)
* [PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking](http://arxiv.org/abs/2307.15055)
* [SynBody: Synthetic Dataset with Layered Human Models for 3D Human Perception and Modeling](http://arxiv.org/abs/2303.17368)
* [MOSE: A New Dataset for Video Object Segmentation in Complex Scenes](http://arxiv.org/abs/2302.01872)
* [Audio-Visual Deception Detection: DOLOS Dataset and Parameter-Efficient Crossmodal Learning](http://arxiv.org/abs/2303.12745)
* [3DMiner: Discovering Shapes from Large-Scale Unannotated Image Datasets](https://openaccess.thecvf.com/content/ICCV2023/papers/Cheng_3DMiner_Discovering_Shapes_from_Large-Scale_Unannotated_Image_Datasets_ICCV_2023_paper.pdf)
* [MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond](https://openaccess.thecvf.com/content/ICCV2023/papers/Li_MatrixCity_A_Large-scale_City_Dataset_for_City-scale_Neural_Rendering_and_ICCV_2023_paper.pdf)
* [LaRS: A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark](http://arxiv.org/abs/2308.09618v1)
:star:[code](https://lojzezust.github.io/lars-dataset)
* [EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding](http://arxiv.org/abs/2309.08816v1)
:star:[code](https://github.com/facebookresearch/EgoObjects)
* [Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations](http://arxiv.org/abs/2309.01858v1)
:house:[project](https://cmp.felk.cvut.cz/univ_emb/)
* [High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net](http://arxiv.org/abs/2308.14221)
* [ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes](http://arxiv.org/abs/2308.11417v1)
:house:[project](https://youtu.be/E6P9e2r6M8I)
:star:[code](https://cy94.github.io/scannetpp/)
* [Learning Optical Flow from Event Camera with Rendered Dataset](https://arxiv.org/abs/2303.11011)
* [Efficient neural supersampling on a novel gaming dataset](http://arxiv.org/abs/2308.01483v1)
* [AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for Assistive Driving Perception](http://arxiv.org/abs/2307.13933v1)
:star:[code](https://github.com/ydk122024/AIDE)
* [360VOT: A New Benchmark Dataset for Omnidirectional Visual Object Tracking](http://arxiv.org/abs/2307.14630v1)
:house:[project](https://360vot.hkustvgd.com)
:star:[code](https://github.com/HuajianUP/360VOT)全向视觉目标跟踪
* [Harvard Glaucoma Detection and Progression: A Multimodal Multitask Dataset and Generalization-Reinforced Semi-Supervised Learning](http://arxiv.org/abs/2308.13411v1)
:house:[project](https://ophai.hms.harvard.edu/datasets/harvard-gdp1000)
* [FishNet: A Large-scale Dataset and Benchmark for Fish Recognition, Detection, and Functional Trait Prediction](https://openaccess.thecvf.com/content/ICCV2023/papers/Khan_FishNet_A_Large-scale_Dataset_and_Benchmark_for_Fish_Recognition_Detection_ICCV_2023_paper.pdf)
* 基准
* [Towards Real-world Burst Image Super-Resolution: Benchmark and Method](https://arxiv.org/abs/2309.04803)
:star:[code](https://github.com/yjsunnn/FBANet)
:thumbsup:[ICCV2023 |FBANet:迈向真实世界的多帧超分](https://mp.weixin.qq.com/s/JN-5d_Ujak3YDcGM6ZZjPw)
* [SQAD: Automatic Smartphone Camera Quality Assessment and Benchmarking](https://openaccess.thecvf.com/content/ICCV2023/papers/Fang_SQAD_Automatic_Smartphone_Camera_Quality_Assessment_and_Benchmarking_ICCV_2023_paper.pdf)
* [ARNOLD: A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic 3D Scenes](http://arxiv.org/abs/2304.04321)
* [Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception](http://arxiv.org/abs/2306.06362)
* [From Sky to the Ground: A Large-scale Benchmark and Simple Baseline Towards Real Rain Removal](http://arxiv.org/abs/2308.03867v1)
* [ChildPlay: A New Benchmark for Understanding Children's Gaze Behaviour](https://openaccess.thecvf.com/content/ICCV2023/papers/Tafasca_ChildPlay_A_New_Benchmark_for_Understanding_Childrens_Gaze_Behaviour_ICCV_2023_paper.pdf)
* [PlanarTrack: A Large-scale Challenging Benchmark for Planar Object Tracking](http://arxiv.org/abs/2303.07625)
* [OmniLabel: A Challenging Benchmark for Language-Based Object Detection](http://arxiv.org/abs/2304.11463)
* [OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception](http://arxiv.org/abs/2303.03991)
* [HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models](https://openaccess.thecvf.com/content/ICCV2023/papers/Bakr_HRS-Bench_Holistic_Reliable_and_Scalable_Benchmark_for_Text-to-Image_Models_ICCV_2023_paper.pdf)
* [Beyond Object Recognition: A New Benchmark towards Object Concept Learning](http://arxiv.org/abs/2212.02710)
* [ClothPose: A Real-world Benchmark for Visual Analysis of Garment Pose via An Indirect Recording Solution](https://openaccess.thecvf.com/content/ICCV2023/papers/Xu_ClothPose_A_Real-world_Benchmark_for_Visual_Analysis_of_Garment_Pose_ICCV_2023_paper.pdf)
* [REAP: A Large-Scale Realistic Adversarial Patch Benchmark](http://arxiv.org/abs/2212.05680)
* [Chaotic World: A Large and Challenging Benchmark for Human Behavior Understanding in Chaotic Events](https://openaccess.thecvf.com/content/ICCV2023/papers/Ong_Chaotic_World_A_Large_and_Challenging_Benchmark_for_Human_Behavior_ICCV_2023_paper.pdf)
* [Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images](http://arxiv.org/abs/2303.07274)
* [FACET: Fairness in Computer Vision Evaluation Benchmark](http://arxiv.org/abs/2309.00035)
* [A Benchmark for Chinese-English Scene Text Image Super-Resolution](http://arxiv.org/abs/2308.03262)
* [Ego-Humans: An Ego-Centric 3D Multi-Human Benchmark](https://openaccess.thecvf.com/content/ICCV2023/papers/Khirodkar_Ego-Humans_An_Ego-Centric_3D_Multi-Human_Benchmark_ICCV_2023_paper.pdf)
* [Towards Real-World Burst Image Super-Resolution: Benchmark and Method](http://arxiv.org/abs/2309.04803v1)
:star:[code](https://github.com/yjsunnn/FBANet)
* [COCO-O: A Benchmark for Object Detectors under Natural Distribution Shifts](http://arxiv.org/abs/2307.12730v1)
:star:[code](https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o)
* [Dancing in the Dark: A Benchmark towards General Low-light Video Enhancement](https://openaccess.thecvf.com/content/ICCV2023/papers/Fu_Dancing_in_the_Dark_A_Benchmark_towards_General_Low-light_Video_ICCV_2023_paper.pdf)
* [DiLiGenT-Pi: Photometric Stereo for Planar Surfaces with Rich Details - Benchmark Dataset and Beyond](https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_DiLiGenT-Pi_Photometric_Stereo_for_Planar_Surfaces_with_Rich_Details_-_ICCV_2023_paper.pdf)
* 方法
* [Prototype-based Dataset Comparison](http://arxiv.org/abs/2309.02401v1)
:star:[code](https://github.com/Nanne/ProtoSim)

## 41.Vision Transformers
* [Masked Spiking Transformer](http://arxiv.org/abs/2210.01208)
* [Scale-Aware Modulation Meet Transformer](http://arxiv.org/abs/2307.08579)
* [BiViT: Extremely Compressed Binary Vision Transformers](http://arxiv.org/abs/2211.07091)
* [Fcaformer: Forward Cross Attention in Hybrid Vision Transformer](http://arxiv.org/abs/2211.07198)
* [FastViT: A Fast Hybrid Vision Transformer Using Structural Reparameterization](http://arxiv.org/abs/2303.14189)
* [SwinLSTM: Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM](http://arxiv.org/abs/2308.09891)
* [Multimodal High-order Relation Transformer for Scene Boundary Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/Wei_Multimodal_High-order_Relation_Transformer_for_Scene_Boundary_Detection_ICCV_2023_paper.pdf)
* [GET: Group Event Transformer for Event-Based Vision](https://openaccess.thecvf.com/content/ICCV2023/papers/Peng_GET_Group_Event_Transformer_for_Event-Based_Vision_ICCV_2023_paper.pdf)
* [DiffRate : Differentiable Compression Rate for Efficient Vision Transformers](http://arxiv.org/abs/2305.17997)
* [Scratching Visual Transformer's Back with Uniform Attention](https://openaccess.thecvf.com/content/ICCV2023/papers/Hyeon-Woo_Scratching_Visual_Transformers_Back_with_Uniform_Attention_ICCV_2023_paper.pdf)
* [Skill Transformer: A Monolithic Policy for Mobile Manipulation](http://arxiv.org/abs/2308.09873)
* [A Multidimensional Analysis of Social Biases in Vision Transformers](http://arxiv.org/abs/2308.01948)
* [Token-Label Alignment for Vision Transformers](http://arxiv.org/abs/2210.06455)
* [Building Vision Transformers with Hierarchy Aware Feature Aggregation](https://openaccess.thecvf.com/content/ICCV2023/papers/Chen_Building_Vision_Transformers_with_Hierarchy_Aware_Feature_Aggregation_ICCV_2023_paper.pdf)
* [TripLe: Revisiting Pretrained Model Reuse and Progressive Learning for Efficient Vision Transformer Scaling and Searching](https://openaccess.thecvf.com/content/ICCV2023/papers/Fu_TripLe_Revisiting_Pretrained_Model_Reuse_and_Progressive_Learning_for_Efficient_ICCV_2023_paper.pdf)
* [DarSwin: Distortion Aware Radial Swin Transformer](https://openaccess.thecvf.com/content/ICCV2023/papers/Athwale_DarSwin_Distortion_Aware_Radial_Swin_Transformer_ICCV_2023_paper.pdf)
* [Robustifying Token Attention for Vision Transformers](http://arxiv.org/abs/2303.11126)
* [FLatten Transformer: Vision Transformer using Focused Linear Attention](http://arxiv.org/abs/2308.00442)
* [Detection Transformer with Stable Matching](http://arxiv.org/abs/2304.04742)
* [LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization](https://openaccess.thecvf.com/content/ICCV2023/papers/Yu_LaPE_Layer-adaptive_Position_Embedding_for_Vision_Transformers_with_Independent_Layer_ICCV_2023_paper.pdf)
* [M2T: Masking Transformers Twice for Faster Decoding](https://openaccess.thecvf.com/content/ICCV2023/papers/Mentzer_M2T_Masking_Transformers_Twice_for_Faster_Decoding_ICCV_2023_paper.pdf)
* [FDViT: Improve the Hierarchical Architecture of Vision Transformer](https://openaccess.thecvf.com/content/ICCV2023/papers/Xu_FDViT_Improve_the_Hierarchical_Architecture_of_Vision_Transformer_ICCV_2023_paper.pdf)
* [Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers](http://arxiv.org/abs/2308.10814)
* [Rethinking Vision Transformers for MobileNet Size and Speed](http://arxiv.org/abs/2212.08059)
* [Structure Invariant Transformation for better Adversarial Transferability](http://arxiv.org/abs/2309.14700v1)
:star:[code](https://github.com/xiaosen-wang/SIT)
* [SG-Former: Self-guided Transformer with Evolving Token Reallocation](http://arxiv.org/abs/2308.12216v1)
:star:[code](https://github.com/OliverRensu/SG-Former)
* [Pre-training Vision Transformers with Very Limited Synthesized Images](http://arxiv.org/abs/2307.14710v1)
* [SMMix: Self-Motivated Image Mixing for Vision Transformers](https://arxiv.org/abs/2212.12977)
* [Revisiting Vision Transformer from the View of Path Ensemble](http://arxiv.org/abs/2308.06548v1)
* [SwinLSTM:Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM](http://arxiv.org/abs/2308.09891v1)
:star:[code](https://github.com/SongTang-x/SwinLSTM)
* [Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers](http://arxiv.org/abs/2308.13494v1)
* [Contrastive Feature Masking Open-Vocabulary Vision Transformer](http://arxiv.org/abs/2309.00775v1)
* [MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer](http://arxiv.org/abs/2309.09067v1)
* [SAL-ViT: Towards Latency Efficient Private Inference on ViT using Selective Attention Search with a Learnable Softmax Approximation](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_SAL-ViT_Towards_Latency_Efficient_Private_Inference_on_ViT_using_Selective_ICCV_2023_paper.pdf)
* [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](http://arxiv.org/abs/2303.15446)
* [MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention](http://arxiv.org/abs/2211.13955)

## 40.Anomaly Detection(异常检测)
* [Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection](http://arxiv.org/abs/2308.10155v1)
* [Anomaly Detection Under Distribution Shift](http://arxiv.org/abs/2303.13845)
* [Unsupervised Surface Anomaly Detection with Diffusion Probabilistic Model](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_Unsupervised_Surface_Anomaly_Detection_with_Diffusion_Probabilistic_Model_ICCV_2023_paper.pdf)
* [Anomaly Detection using Score-based Perturbation Resilience](https://openaccess.thecvf.com/content/ICCV2023/papers/Shin_Anomaly_Detection_using_Score-based_Perturbation_Resilience_ICCV_2023_paper.pdf)
* [Remembering Normality: Memory-guided Knowledge Distillation for Unsupervised Anomaly Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/Gu_Remembering_Normality_Memory-guided_Knowledge_Distillation_for_Unsupervised_Anomaly_Detection_ICCV_2023_paper.pdf)
* [Template-guided Hierarchical Feature Restoration for Anomaly Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/Guo_Template-guided_Hierarchical_Feature_Restoration_for_Anomaly_Detection_ICCV_2023_paper.pdf)
* [Inter-Realization Channels: Unsupervised Anomaly Detection Beyond One-Class Classification](https://openaccess.thecvf.com/content/ICCV2023/papers/McIntosh_Inter-Realization_Channels_Unsupervised_Anomaly_Detection_Beyond_One-Class_Classification_ICCV_2023_paper.pdf)
* 图像异常检测
* [Focus the Discrepancy: Intra- and Inter-Correlation Learning for Image Anomaly Detection](http://arxiv.org/abs/2308.02983v1)
:star:[code](https://github.com/xcyao00/FOD)
* OOD
* [CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No](http://arxiv.org/abs/2308.12213v1)
:star:[code](https://github.com/xmed-lab/CLIPN)
* [Meta OOD Learning for Continuously Adaptive OOD Detection](http://arxiv.org/abs/2309.11705v1)
* [Simple and Effective Out-of-Distribution Detection via Cosine-based Softmax Loss](https://openaccess.thecvf.com/content/ICCV2023/papers/Noh_Simple_and_Effective_Out-of-Distribution_Detection_via_Cosine-based_Softmax_Loss_ICCV_2023_paper.pdf)
* [Nearest Neighbor Guidance for Out-of-Distribution Detection](http://arxiv.org/abs/2309.14888v1)
:star:[code](https://github.com/roomo7time/nnguide)
* [Understanding the Feature Norm for Out-of-Distribution Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/Park_Understanding_the_Feature_Norm_for_Out-of-Distribution_Detection_ICCV_2023_paper.pdf)
* [Meta OOD Learning For Continuously Adaptive OOD Detection](http://arxiv.org/abs/2309.11705)
* [SAFE: Sensitivity-Aware Features for Out-of-Distribution Object Detection](http://arxiv.org/abs/2208.13930)
* [Unified Out-Of-Distribution Detection: A Model-Specific Perspective](http://arxiv.org/abs/2304.06813)
* [Revisit PCA-based Technique for Out-of-Distribution Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/Guan_Revisit_PCA-based_Technique_for_Out-of-Distribution_Detection_ICCV_2023_paper.pdf)
* [Hierarchical Visual Categories Modeling: A Joint Representation Learning and Density Estimation Framework for Out-of-Distribution Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/Li_Hierarchical_Visual_Categories_Modeling_A_Joint_Representation_Learning_and_Density_ICCV_2023_paper.pdf)
* [WDiscOOD: Out-of-Distribution Detection via Whitened Linear Discriminant Analysis](http://arxiv.org/abs/2303.07543)

## 39.Keypoint Detection(关键点检测)
* [Neural Interactive Keypoint Detection](http://arxiv.org/abs/2308.10174v1)
:star:[code](https://github.com/IDEA-Research/Click-Pose)
* [3D Implicit Transporter for Temporally Consistent Keypoint Discovery](http://arxiv.org/abs/2309.05098v1)
:star:[code](https://github.com/zhongcl-thu/3D-Implicit-Transporter)

## 38.Vision-Language(视觉语言)
* [Linear Spaces of Meanings: Compositional Structures in Vision-Language Models](http://arxiv.org/abs/2302.14383)
* [ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation](http://arxiv.org/abs/2308.16689)
* [Gloss-Free Sign Language Translation: Improving from Visual-Language Pretraining](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhou_Gloss-Free_Sign_Language_Translation_Improving_from_Visual-Language_Pretraining_ICCV_2023_paper.pdf)
* [SuS-X: Training-Free Name-Only Transfer of Vision-Language Models](https://openaccess.thecvf.com/content/ICCV2023/papers/Udandarao_SuS-X_Training-Free_Name-Only_Transfer_of_Vision-Language_Models_ICCV_2023_paper.pdf)
* [Bayesian Prompt Learning for Image-Language Model Generalization](http://arxiv.org/abs/2210.02390)
* [eP-ALM: Efficient Perceptual Augmentation of Language Models](https://openaccess.thecvf.com/content/ICCV2023/papers/Shukor_eP-ALM_Efficient_Perceptual_Augmentation_of_Language_Models_ICCV_2023_paper.pdf)
* [Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models](http://arxiv.org/abs/2303.17169)
* [SLAN: Self-Locator Aided Network for Vision-Language Understanding](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhai_SLAN_Self-Locator_Aided_Network_for_Vision-Language_Understanding_ICCV_2023_paper.pdf)
* [Borrowing Knowledge From Pre-trained Language Model: A New Data-efficient Visual Learning Paradigm](https://openaccess.thecvf.com/content/ICCV2023/papers/Ma_Borrowing_Knowledge_From_Pre-trained_Language_Model_A_New_Data-efficient_Visual_ICCV_2023_paper.pdf)
* [VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching](https://openaccess.thecvf.com/content/ICCV2023/papers/Bi_VL-Match_Enhancing_Vision-Language_Pretraining_with_Token-Level_and_Instance-Level_Matching_ICCV_2023_paper.pdf)
* [A Retrospect to Multi-prompt Learning across Vision and Language](https://openaccess.thecvf.com/content/ICCV2023/papers/Chen_A_Retrospect_to_Multi-prompt_Learning_across_Vision_and_Language_ICCV_2023_paper.pdf)
* [CiT: Curation in Training for Effective Vision-Language Data](http://arxiv.org/abs/2301.02241)
* [EgoTV: Egocentric Task Verification from Natural Language Task Descriptions](http://arxiv.org/abs/2303.16975)
* [Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models](http://arxiv.org/abs/2303.06571)
* [Towards Unifying Medical Vision-and-Language Pre-Training via Soft Prompts](http://arxiv.org/abs/2302.08958)
* [Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models](http://arxiv.org/abs/2303.06628)
* [Perceptual Grouping in Contrastive Vision-Language Models](http://arxiv.org/abs/2210.09996)
* [Black Box Few-Shot Adaptation for Vision-Language Models](https://openaccess.thecvf.com/content/ICCV2023/papers/Ouali_Black_Box_Few-Shot_Adaptation_for_Vision-Language_Models_ICCV_2023_paper.pdf)
* [CTP:Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation](http://arxiv.org/abs/2308.07146)
* [VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control](https://openaccess.thecvf.com/content/ICCV2023/papers/Hu_VL-PET_Vision-and-Language_Parameter-Efficient_Tuning_via_Granularity_Control_ICCV_2023_paper.pdf)
* [GrowCLIP: Data-Aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-Training](http://arxiv.org/abs/2308.11331)
* [I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision](https://openaccess.thecvf.com/content/ICCV2023/papers/Gu_I_Cant_Believe_Theres_No_Images_Learning_Visual_Tasks_Using_ICCV_2023_paper.pdf)
* [Too Large; Data Reduction for Vision-Language Pre-Training](http://arxiv.org/abs/2305.20087)
* [Equivariant Similarity for Vision-Language Foundation Models](http://arxiv.org/abs/2303.14465)
* [Going Beyond Nouns With Vision & Language Models Using Synthetic Data](http://arxiv.org/abs/2303.17590)
* [SINC: Self-Supervised In-Context Learning for Vision-Language Tasks](http://arxiv.org/abs/2307.07742)
* [Unified Visual Relationship Detection with Vision and Language Models](http://arxiv.org/abs/2303.08998)
* [ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models](http://arxiv.org/abs/2307.00398)
* [Distilling Large Vision-Language Model with Out-of-Distribution Generalizability](http://arxiv.org/abs/2307.03135)
* [Distribution-Aware Prompt Tuning for Vision-Language Models](http://arxiv.org/abs/2309.03406v1)
:star:[code](https://github.com/mlvlab/DAPT)
* [LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models](http://arxiv.org/abs/2309.01155v1)
:star:[code](https://chengshiest.github.io/logo)
* [CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation](http://arxiv.org/abs/2308.15226v1)
* [GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training](http://arxiv.org/abs/2308.11331v1)
* [RLIPv2: Fast Scaling of Relational Language-Image Pre-training](http://arxiv.org/abs/2308.09351v1)
:star:[code](https://github.com/JacobYuan7/RLIPv2)
* [Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models](http://arxiv.org/abs/2307.15049v1)
:star:[code](https://wuw2019.github.io/RMT/)
* [Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?](http://arxiv.org/abs/2307.11978v1)
:star:[code](https://github.com/CEWu/PTNL)
* [Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models](http://arxiv.org/abs/2307.14061v1)
:thumbsup:[ICCV 2023 Oral | 南科大VIP Lab | 针对VLP模型的集合级引导攻击](https://mp.weixin.qq.com/s/bE97oBoa4nH1c5XuOz4WWA)
* [CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation](http://arxiv.org/abs/2308.07146v1)
:star:[code](https://github.com/KevinLight831/CTP)
* [VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control](http://arxiv.org/abs/2308.09804v1)
:star:[code](https://github.com/HenryHZY/VL-PET)
* [Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models](http://arxiv.org/abs/2308.11186v1)
* [VLSlice: Interactive Vision-and-Language Slice Discovery](http://arxiv.org/abs/2309.06703v1)
:house:[project](https://ericslyman.com/vlslice/)
* [What does CLIP know about a red circle? Visual prompt engineering for VLMs](http://arxiv.org/abs/2304.06712)
* [BUS: Efficient and Effective Vision-Language Pre-Training with Bottom-Up Patch Summarization](http://arxiv.org/abs/2307.08504)
* 视觉表示学习
* [Hallucination Improves the Performance of Unsupervised Visual Representation Learning](http://arxiv.org/abs/2307.12168v1)
* [Semantics-Consistent Feature Search for Self-Supervised Visual Representation Learning](http://arxiv.org/abs/2212.06486)
* [ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data](http://arxiv.org/abs/2308.11194v1)
* VLN
* [Learning Vision-and-Language Navigation from YouTube Videos](http://arxiv.org/abs/2307.11984v1)
:star:[code](https://github.com/JeremyLinky/YouTube-VLN)
* [Learning Navigational Visual Representations with Semantic Map Supervision](http://arxiv.org/abs/2307.12335)
* [Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation](http://arxiv.org/abs/2308.12587)
* [GridMM: Grid Memory Map for Vision-and-Language Navigation](http://arxiv.org/abs/2307.12907v1)
* [Scaling Data Generation in Vision-and-Language Navigation](http://arxiv.org/abs/2307.15644v1)
* [Bird's-Eye-View Scene Graph for Vision-Language Navigation](http://arxiv.org/abs/2308.04758v1)
:star:[code](https://github.com/DefaultRui/BEV-Scene-Graph)
* [AerialVLN: Vision-and-Language Navigation for UAVs](http://arxiv.org/abs/2308.06735v1)
:star:[code](https://github.com/AirVLN/AirVLN)
* [DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation](http://arxiv.org/abs/2308.07498v1)
:star:[code](https://github.com/hanqingwangai/Dreamwalker)
* [VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation](http://arxiv.org/abs/2308.10172v1)
* [March in Chat: Interactive Prompting for Remote Embodied Referring Expression](http://arxiv.org/abs/2308.10141v1)
* [Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation](http://arxiv.org/abs/2308.12587v1)
* [BEVBert: Multimodal Map Pre-training for Language-guided Navigation](https://arxiv.org/pdf/2212.04385.pdf)
:star:[code](https://github.com/MarSaKi/VLN-BEVBert)
* Visual Grounding(视觉定位)
* [ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding](http://arxiv.org/abs/2303.16894)
* [Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding](http://arxiv.org/abs/2307.09267)
* [Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding](https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_Confidence-aware_Pseudo-label_Learning_for_Weakly_Supervised_Visual_Grounding_ICCV_2023_paper.pdf)
* Video-Language
* [HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training](http://arxiv.org/abs/2212.14546)
* [Learning Trajectory-Word Alignments for Video-Language Tasks](http://arxiv.org/abs/2301.01953)
* [HiVLP: Hierarchical Interactive Video-Language Pre-Training](https://openaccess.thecvf.com/content/ICCV2023/papers/Shao_HiVLP_Hierarchical_Interactive_Video-Language_Pre-Training_ICCV_2023_paper.pdf)
* [Verbs in Action: Improving Verb Understanding in Video-Language Models](http://arxiv.org/abs/2304.06708)
* [Exploring Temporal Concurrency for Video-Language Representation Learning](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_Exploring_Temporal_Concurrency_for_Video-Language_Representation_Learning_ICCV_2023_paper.pdf)
* [EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone](http://arxiv.org/abs/2307.05463)
* [SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-Training](http://arxiv.org/abs/2211.11446)
* 视觉推理
* [ViperGPT: Visual Inference via Python Execution for Reasoning](https://openaccess.thecvf.com/content/ICCV2023/papers/Suris_ViperGPT_Visual_Inference_via_Python_Execution_for_Reasoning_ICCV_2023_paper.pdf)
* LLM
* [LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models](https://openaccess.thecvf.com/content/ICCV2023/papers/Song_LLM-Planner_Few-Shot_Grounded_Planning_for_Embodied_Agents_with_Large_Language_ICCV_2023_paper.pdf)
* [Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts](http://arxiv.org/abs/2308.11793v1)
:star:[code](https://github.com/VITA-Group/GNT-MOVE)

## 37.Object Pose Estimation(物体姿势估计)
* [IST-Net: Prior-Free Category-Level Pose Estimation with Implicit Space Transformation](https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_IST-Net_Prior-Free_Category-Level_Pose_Estimation_with_Implicit_Space_Transformation_ICCV_2023_paper.pdf)
* [LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs](https://openaccess.thecvf.com/content/ICCV2023/papers/Cheng_LU-NeRF_Scene_and_Pose_Estimation_by_Synchronizing_Local_Unposed_NeRFs_ICCV_2023_paper.pdf)
* [PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment](http://arxiv.org/abs/2306.15667)
* [Nonrigid Object Contact Estimation With Regional Unwrapping Transformer](http://arxiv.org/abs/2308.14074v1)
* 6D
* [Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation](http://arxiv.org/abs/2308.05438v1)
* [SOCS: Semantically-Aware Object Coordinate Space for Category-Level 6D Object Pose Estimation under Large Shape Variations](http://arxiv.org/abs/2303.10346)
* [Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation](http://arxiv.org/abs/2303.11516)
* [Pseudo Flow Consistency for Self-Supervised 6D Object Pose Estimation](http://arxiv.org/abs/2308.10016v1)
* [Center-Based Decoupled Point-cloud Registration for 6D Object Pose Estimation](https://openaccess.thecvf.com/content/ICCV2023/papers/Jiang_Center-Based_Decoupled_Point-cloud_Registration_for_6D_Object_Pose_Estimation_ICCV_2023_paper.pdf)
* [Query6DoF: Learning Sparse Queries as Implicit Shape Prior for Category-Level 6DoF Pose Estimation](https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_Query6DoF_Learning_Sparse_Queries_as_Implicit_Shape_Prior_for_Category-Level_ICCV_2023_paper.pdf)
* [VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations](http://arxiv.org/abs/2308.09916v1)
:star:[code](https://github.com/JiehongLin/VI-Net)
* [Learning Symmetry-Aware Geometry Correspondences for 6D Object Pose Estimation](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhao_Learning_Symmetry-Aware_Geometry_Correspondences_for_6D_Object_Pose_Estimation_ICCV_2023_paper.pdf)
* [3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation](http://arxiv.org/abs/2302.03744)
* 物体计数
* [STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning](http://arxiv.org/abs/2308.10468v1)
:star:[code](https://github.com/taohan10200/STEERER)
* [Interactive Class-Agnostic Object Counting](http://arxiv.org/abs/2309.05277)
* [A Low-Shot Object Counting Network With Iterative Prototype Adaptation](https://openaccess.thecvf.com/content/ICCV2023/papers/Dukic_A_Low-Shot_Object_Counting_Network_With_Iterative_Prototype_Adaptation_ICCV_2023_paper.pdf)
* 动物姿势估计
* [Animal3D: A Comprehensive Dataset of 3D Animal Pose and Shape](http://arxiv.org/abs/2308.11737)

## 36.Vision Question Answering(视觉问答)
* [Toward Unsupervised Realistic Visual Question Answering](http://arxiv.org/abs/2303.05068)
* [Variational Causal Inference Network for Explanatory Visual Question Answering](https://openaccess.thecvf.com/content/ICCV2023/papers/Xue_Variational_Causal_Inference_Network_for_Explanatory_Visual_Question_Answering_ICCV_2023_paper.pdf)
* [VQA Therapy: Exploring Answer Differences by Visually Grounding Answers](http://arxiv.org/abs/2308.11662)
* [VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering](https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_VQA-GNN_Reasoning_with_Multimodal_Knowledge_via_Graph_Neural_Networks_for_ICCV_2023_paper.pdf)
* [TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering](http://arxiv.org/abs/2303.11897)
* [Encyclopedic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories](http://arxiv.org/abs/2306.09224)
* [PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3](https://openaccess.thecvf.com/content/ICCV2023/papers/Hu_PromptCap_Prompt-Guided_Image_Captioning_for_VQA_with_GPT-3_ICCV_2023_paper.pdf)
* [Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering](https://openaccess.thecvf.com/content/ICCV2023/papers/Qian_Decouple_Before_Interact_Multi-Modal_Prompt_Learning_for_Continual_Visual_Question_ICCV_2023_paper.pdf)
* Video-QA
* [Discovering Spatio-Temporal Rationales for Video Question Answering](http://arxiv.org/abs/2307.12058v1)
:star:[code](https://github.com/yl3800/TranSTR)
* [Knowledge Proxy Intervention for Deconfounded Video Question Answering](https://openaccess.thecvf.com/content/ICCV2023/papers/Li_Knowledge_Proxy_Intervention_for_Deconfounded_Video_Question_Answering_ICCV_2023_paper.pdf)
* [Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer](http://arxiv.org/abs/2308.08414v1)
* [Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models](http://arxiv.org/abs/2308.09363v1)
:star:[code](https://github.com/mlvlab/OVQA)
* [Tem-Adapter: Adapting Image-Text Pretraining for Video Question Answer](https://openaccess.thecvf.com/content/ICCV2023/papers/Chen_Tem-Adapter_Adapting_Image-Text_Pretraining_for_Video_Question_Answer_ICCV_2023_paper.pdf)
* 视频意图推理
* [IntentQA: Context-aware Video Intent Reasoning](https://openaccess.thecvf.com/content/ICCV2023/papers/Li_IntentQA_Context-aware_Video_Intent_Reasoning_ICCV_2023_paper.pdf)

## 35.Human Motion Prediction(人体运动预测)
* [Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction](http://arxiv.org/abs/2308.08942v1)
:star:[code](https://github.com/MediaBrain-SJTU/AuxFormer)
* [Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders](http://arxiv.org/abs/2308.09882v1)
:star:[code](https://github.com/jchengai/forecast-mae)
* [Priority-Centric Human Motion Generation in Discrete Latent Space](http://arxiv.org/abs/2308.14480v1)
* [MotionLM: Multi-Agent Motion Forecasting as Language Modeling](https://openaccess.thecvf.com/content/ICCV2023/papers/Seff_MotionLM_Multi-Agent_Motion_Forecasting_as_Language_Modeling_ICCV_2023_paper.pdf)
* [HumanMAC: Masked Motion Completion for Human Motion Prediction](http://arxiv.org/abs/2302.03665)
* [Joint-Relation Transformer for Multi-Person Motion Prediction](http://arxiv.org/abs/2308.04808)
* [Bootstrap Motion Forecasting With Self-Consistent Constraints](http://arxiv.org/abs/2204.05859)
* [PhysDiff: Physics-Guided Human Motion Diffusion Model](http://arxiv.org/abs/2212.02500)
* [AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism](http://arxiv.org/abs/2309.00796)
* [BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction](http://arxiv.org/abs/2211.14304)
* [Social Diffusion: Long-term Multiple Human Motion Anticipation](https://openaccess.thecvf.com/content/ICCV2023/papers/Tanke_Social_Diffusion_Long-term_Multiple_Human_Motion_Anticipation_ICCV_2023_paper.pdf)
* [SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation](http://arxiv.org/abs/2304.10417)

## 34.Action Detection(动作识别)
* [Multimodal Distillation for Egocentric Action Recognition](http://arxiv.org/abs/2307.07483)
* [Memory-and-Anticipation Transformer for Online Action Understanding](http://arxiv.org/abs/2308.07893v1)
:star:[code](https://github.com/Echo0125/Memory-and-Anticipation-Transformer)
* [Masked Motion Predictors are Strong 3D Action Representation Learners](http://arxiv.org/abs/2308.07092v1)
:star:[code](https://github.com/maoyunyao/MAMP)
* [Efficient Video Action Detection with Token Dropout and Context Refinement](http://arxiv.org/abs/2304.08451)
* [Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition](https://openaccess.thecvf.com/content/ICCV2023/papers/Wasim_Video-FocalNets_Spatio-Temporal_Focal_Modulation_for_Video_Action_Recognition_ICCV_2023_paper.pdf)
* [E2E-LOAD: End-to-End Long-form Online Action Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/Cao_E2E-LOAD_End-to-End_Long-form_Online_Action_Detection_ICCV_2023_paper.pdf)
* [Ego-Only: Egocentric Action Detection without Exocentric Transferring](https://openaccess.thecvf.com/content/ICCV2023/papers/Wang_Ego-Only_Egocentric_Action_Detection_without_Exocentric_Transferring_ICCV_2023_paper.pdf)
* [Cross-Modal Learning with 3D Deformable Attention for Action Recognition](http://arxiv.org/abs/2212.05638)
* [DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion](http://arxiv.org/abs/2303.14863)
* [STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition](http://arxiv.org/abs/2301.03046)
* [MiniROAD: Minimal RNN Framework for Online Action Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/An_MiniROAD_Minimal_RNN_Framework_for_Online_Action_Detection_ICCV_2023_paper.pdf)
* [Video Action Recognition with Attentive Semantic Units](http://arxiv.org/abs/2303.09756)
* [A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition](http://arxiv.org/abs/2303.13505)
* [What Can a Cook in Italy Teach a Mechanic in India? Action Recognition Generalisation Over Scenarios and Locations](http://arxiv.org/abs/2306.08713)
* 基于骨架的动作识别
* [LAC -- Latent Action Composition for Skeleton-based Action Segmentation](http://arxiv.org/abs/2308.14500v1)
* [Generative Action Description Prompts for Skeleton-based Action Recognition](http://arxiv.org/abs/2208.05318)
* [Parallel Attention Interaction Network for Few-Shot Skeleton-Based Action Recognition](https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_Parallel_Attention_Interaction_Network_for_Few-Shot_Skeleton-Based_Action_Recognition_ICCV_2023_paper.pdf)
* [Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition](http://arxiv.org/abs/2212.04761)
* [Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition](http://arxiv.org/abs/2208.10741)
* [Modeling the Relative Visual Tempo for Self-supervised Skeleton-based Action Recognition](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhu_Modeling_the_Relative_Visual_Tempo_for_Self-supervised_Skeleton-based_Action_Recognition_ICCV_2023_paper.pdf)
* [SkeleTR: Towrads Skeleton-based Action Recognition in the Wild](http://arxiv.org/abs/2309.11445v1)
* [Hard No-Box Adversarial Attack on Skeleton-Based Human Action Recognition with Skeleton-Motion-Informed Gradient](http://arxiv.org/abs/2308.05681)
* [FSAR: Federated Skeleton-based Action Recognition with Adaptive Topology Structure and Knowledge Distillation](http://arxiv.org/abs/2306.11046)
* 开集动作识别
* [SOAR: Scene-debiasing Open-set Action Recognition](http://arxiv.org/abs/2309.01265v1)
:star:[code](https://github.com/yhZhai/SOAR)
* 零样本动作识别
* [MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge](http://arxiv.org/abs/2303.08914)
* 小样本动作识别
* [Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching](http://arxiv.org/abs/2308.09346v1)
:star:[code](https://github.com/jiazheng-xing/GgHM)
* 时序动作定位
* [DDG-Net: Discriminability-Driven Graph Network for Weakly-supervised Temporal Action Localization](http://arxiv.org/abs/2307.16415v1)
:star:[code](https://github.com/XiaojunTang22/ICCV2023-DDGNet)
* [Movement Enhancement toward Multi-Scale Video Feature Representation for Temporal Action Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhao_Movement_Enhancement_toward_Multi-Scale_Video_Feature_Representation_for_Temporal_Action_ICCV_2023_paper.pdf)
* [Self-Feedback DETR for Temporal Action Detection](http://arxiv.org/abs/2308.10570v1)
* [Action Sensitivity Learning for Temporal Action Localization](http://arxiv.org/abs/2305.15701)
* [Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization](https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_Revisiting_Foreground_and_Background_Separation_in_Weakly-supervised_Temporal_Action_Localization_ICCV_2023_paper.pdf)
* [Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Localization](https://openaccess.thecvf.com/content/ICCV2023/papers/Xia_Learning_from_Noisy_Pseudo_Labels_for_Semi-Supervised_Temporal_Action_Localization_ICCV_2023_paper.pdf)
* 弱监督动作定位
* [Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling](http://arxiv.org/abs/2308.09946v1)
* 小样本动作定位
* [Few-Shot Common Action Localization via Cross-Attentional Fusion of Context and Temporal Dynamics](https://openaccess.thecvf.com/content/ICCV2023/papers/Lee_Few-Shot_Common_Action_Localization_via_Cross-Attentional_Fusion_of_Context_and_ICCV_2023_paper.pdf)
* 动作理解
* [Memory-and-Anticipation Transformer for Online Action Understanding](http://arxiv.org/abs/2308.07893)

## 33.Video(视频)
* [Neural Video Depth Stabilizer](http://arxiv.org/abs/2307.08695)
* [NPC: Neural Point Characters from Video](http://arxiv.org/abs/2304.02013)
* [Localizing Moments in Long Video Via Multimodal Guidance](http://arxiv.org/abs/2302.13372)
* [Order-Prompted Tag Sequence Generation for Video Tagging](https://openaccess.thecvf.com/content/ICCV2023/papers/Ma_Order-Prompted_Tag_Sequence_Generation_for_Video_Tagging_ICCV_2023_paper.pdf)
* [Moment Detection in Long Tutorial Videos](https://openaccess.thecvf.com/content/ICCV2023/papers/Croitoru_Moment_Detection_in_Long_Tutorial_Videos_ICCV_2023_paper.pdf)
* [MMVP: Motion-Matrix-based Video Prediction](http://arxiv.org/abs/2308.16154v1)
:star:[code](https://github.com/Kay1794/MMVP-motion-matrix-based-video-prediction)
* [D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation](http://arxiv.org/abs/2308.04197v1)
:star:[code](https://github.com/solicucu/D3G)
* [LAN-HDR: Luminance-based Alignment Network for High Dynamic Range Video Reconstruction](http://arxiv.org/abs/2308.11116v1)
* [TALL: Thumbnail Layout for Deepfake Video Detection](http://arxiv.org/abs/2307.07494)
* [Spatio-temporal Prompting Network for Robust Video Feature Extraction](https://openaccess.thecvf.com/content/ICCV2023/papers/Sun_Spatio-temporal_Prompting_Network_for_Robust_Video_Feature_Extraction_ICCV_2023_paper.pdf)
* [Neural Reconstruction of Relightable Human Model from Monocular Video](https://openaccess.thecvf.com/content/ICCV2023/papers/Sun_Neural_Reconstruction_of_Relightable_Human_Model_from_Monocular_Video_ICCV_2023_paper.pdf)
* 视频理解
* [Long-range Multimodal Pretraining for Movie Understanding](http://arxiv.org/abs/2308.09775v1)
* [RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D](http://arxiv.org/abs/2308.12035v1)
* [UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding](https://openaccess.thecvf.com/content/ICCV2023/papers/Li_UniFormerV2_Unlocking_the_Potential_of_Image_ViTs_for_Video_Understanding_ICCV_2023_paper.pdf)
* 视频分类
* [ReGen: A good Generative Zero-Shot Video Classifier Should be Rewarded](https://openaccess.thecvf.com/content/ICCV2023/papers/Bulat_ReGen_A_good_Generative_Zero-Shot_Video_Classifier_Should_be_Rewarded_ICCV_2023_paper.pdf)
* [Gram-based Attentive Neural Ordinary Differential Equations Network for Video Nystagmography Classification](https://openaccess.thecvf.com/content/ICCV2023/papers/Qiu_Gram-based_Attentive_Neural_Ordinary_Differential_Equations_Network_for_Video_Nystagmography_ICCV_2023_paper.pdf)
* [Few-Shot Video Classification via Representation Fusion and Promotion Learning](https://openaccess.thecvf.com/content/ICCV2023/papers/Xia_Few-Shot_Video_Classification_via_Representation_Fusion_and_Promotion_Learning_ICCV_2023_paper.pdf)
* 视频合成
* [StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation](http://arxiv.org/abs/2308.16909v1)
:house:[project](https://www.mmlab-ntu.com/project/styleinv/index.html)
:star:[code](https://github.com/johannwyh/StyleInV)
* [Mixed Neural Voxels for Fast Multi-view Video Synthesis](http://arxiv.org/abs/2212.00190)
* [WALDO: Future Video Synthesis Using Object Layer Decomposition and Parametric Flow Prediction](http://arxiv.org/abs/2211.14308)
* [Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://openaccess.thecvf.com/content/ICCV2023/papers/Khachatryan_Text2Video-Zero_Text-to-Image_Diffusion_Models_are_Zero-Shot_Video_Generators_ICCV_2023_paper.pdf)
* [StyleLipSync: Style-based Personalized Lip-sync Video Generation](http://arxiv.org/abs/2305.00521)
* [Text2Performer: Text-Driven Human Video Generation](http://arxiv.org/abs/2304.08483)
* [DreamPose: Fashion Video Synthesis with Stable Diffusion](https://openaccess.thecvf.com/content/ICCV2023/papers/Karras_DreamPose_Fashion_Video_Synthesis_with_Stable_Diffusion_ICCV_2023_paper.pdf)
* [Structure and Content-Guided Video Synthesis with Diffusion Models](http://arxiv.org/abs/2302.03011)
* 视频稳定
* [Fast Full-frame Video Stabilization with Iterative Optimization](http://arxiv.org/abs/2307.12774v1)
* [Minimum Latency Deep Online Video Stabilization](http://arxiv.org/abs/2212.02073)
* Video Grounding(视频定位)
* [G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory](http://arxiv.org/abs/2307.14277v1)
* [UniVTG: Towards Unified Video-Language Temporal Grounding](http://arxiv.org/abs/2307.16715v1)
:star:[code](https://github.com/showlab/UniVTG)
* [Knowing Where to Focus: Event-aware Transformer for Video Grounding](http://arxiv.org/abs/2308.06947v1)
:star:[code](https://github.com/jinhyunj/EaTR)
* [Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos](http://arxiv.org/abs/2303.08345)
* 视频分割
* [XMem++: Production-level Video Segmentation From Few Annotated Frames](http://arxiv.org/abs/2307.15958v1)
* [Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation](http://arxiv.org/abs/2309.13248v1)
:star:[code](https://github.com/kfan21/EoRaS)
* [GraphEcho: Graph-Driven Unsupervised Domain Adaptation for Echocardiogram Video Segmentation](http://arxiv.org/abs/2309.11145v1)
:star:[code](https://github.com/xmed-lab/GraphEcho)
* [MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions](http://arxiv.org/abs/2308.08544v1)
:star:[code](https://henghuiding.github.io/MeViS)
:star:[code](https://henghuiding.github.io/MeViS/)
:thumbsup:[ICCV2023|新数据集 MeViS:基于动作描述的视频分割](https://mp.weixin.qq.com/s/hHAgfiQdA_g0DkmgWPzLeg)
* [MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation](http://arxiv.org/abs/2308.11185v1)
* [Tracking Anything with Decoupled Video Segmentation](http://arxiv.org/abs/2309.03903v1)
:star:[code](https://hkchengrex.github.io/Tracking-Anything-with-DEVA)
* [The Making and Breaking of Camouflage](http://arxiv.org/abs/2309.03899v1)
* [Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation](https://openaccess.thecvf.com/content/ICCV2023/papers/Li_Tube-Link_A_Flexible_Cross_Tube_Framework_for_Universal_Video_Segmentation_ICCV_2023_paper.pdf)
* 视频对应
* [Learning Fine-Grained Features for Pixel-wise Video Correspondences](http://arxiv.org/abs/2308.03040v1)
:star:[code](https://github.com/qianduoduolr/FGVC)
* 视频感知
* [ResQ: Residual Quantization for Video Perception](http://arxiv.org/abs/2308.09511v1)
* 视频识别
* [Audio-Visual Glance Network for Efficient Video Recognition](http://arxiv.org/abs/2308.09322v1)
* [Efficient Decision-based Black-box Patch Attacks on Video Recognition](http://arxiv.org/abs/2303.11917)
* [Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition](http://arxiv.org/abs/2308.11489v1)
:star:[code](https://github.com/wqtwjt1996/SUM-L)
* [Implicit Temporal Modeling with Learnable Alignment for Video Recognition](http://arxiv.org/abs/2304.10465)
* 视频修补
* [ProPainter: Improving Propagation and Transformer for Video Inpainting](http://arxiv.org/abs/2309.03897v1)
:star:[code](https://github.com/sczhou/ProPainter)
* 视频表示学习
* [MGMAE: Motion Guided Masking for Video Masked Autoencoding](http://arxiv.org/abs/2308.10794v1)
* [Spatio-Temporal Crop Aggregation for Video Representation Learning](http://arxiv.org/abs/2211.17042)
* VAD
* [TeD-SPAD: Temporal Distinctiveness for Self-supervised Privacy-preservation for video Anomaly Detection](http://arxiv.org/abs/2308.11072v1)
:star:[code](https://joefioresi718.github.io/TeD-SPAD_webpage/)
* [Video Anomaly Detection via Sequentially Learning Multiple Pretext Tasks](https://openaccess.thecvf.com/content/ICCV2023/papers/Shi_Video_Anomaly_Detection_via_Sequentially_Learning_Multiple_Pretext_Tasks_ICCV_2023_paper.pdf)
* [Feature Prediction Diffusion Model for Video Anomaly Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/Yan_Feature_Prediction_Diffusion_Model_for_Video_Anomaly_Detection_ICCV_2023_paper.pdf)
* Video Localization
* [UnLoc: A Unified Framework for Video Localization Tasks](http://arxiv.org/abs/2308.11062v1)
:star:[code](https://github.com/google-research/scenic)
* [Video OWL-ViT: Temporally-consistent open-world localization in video](http://arxiv.org/abs/2308.11093v1)
* [Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/Flaborea_Multimodal_Motion_Conditioned_Diffusion_Model_for_Skeleton-based_Video_Anomaly_Detection_ICCV_2023_paper.pdf)
* [TeD-SPAD: Temporal Distinctiveness for Self-Supervised Privacy-Preservation for Video Anomaly Detection](https://openaccess.thecvf.com/content/ICCV2023/papers/Fioresi_TeD-SPAD_Temporal_Distinctiveness_for_Self-Supervised_Privacy-Preservation_for_Video_Anomaly_Detection_ICCV_2023_paper.pdf)
* 视频预测
* [MMVP: Motion-Matrix-Based Video Prediction](http://arxiv.org/abs/2308.16154)
* [Efficient Video Prediction via Sparsely Conditioned Flow Matching](http://arxiv.org/abs/2211.14575)
* 视频玻璃分割
* [Multi-view Spectral Polarization Propagation for Video Glass Segmentation](https://openaccess.thecvf.com/content/ICCV2023/papers/Qiao_Multi-view_Spectral_Polarization_Propagation_for_Video_Glass_Segmentation_ICCV_2023_paper.pdf)
* 视频帧插值
* [Rethinking Video Frame Interpolation from Shutter Mode Induced Degradation](https://openaccess.thecvf.com/content/ICCV2023/papers/Ji_Rethinking_Video_Frame_Interpolation_from_Shutter_Mode_Induced_Degradation_ICCV_2023_paper.pdf)
* 视频语义压缩
* [Non-Semantics Suppressed Mask Learning for Unsupervised Video Semantic Compression](https://openaccess.thecvf.com/content/ICCV2023/papers/Tian_Non-Semantics_Suppressed_Mask_Learning_for_Unsupervised_Video_Semantic_Compression_ICCV_2023_paper.pdf)
* 视频-视频翻译
* [Shortcut-V2V: Compression Framework for Video-to-Video Translation Based on Temporal Redundancy Reduction](https://openaccess.thecvf.com/content/ICCV2023/papers/Chung_Shortcut-V2V_Compression_Framework_for_Video-to-Video_Translation_Based_on_Temporal_Redundancy_ICCV_2023_paper.pdf)

## 32.Sign Language Recognition(手语)
* [Human Part-wise 3D Motion Context Learning for Sign Language Recognition](http://arxiv.org/abs/2308.09305v1)
* [CoSign: Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition](https://openaccess.thecvf.com/content/ICCV2023/papers/Jiao_CoSign_Exploring_Co-occurrence_Signals_in_Skeleton-based_Continuous_Sign_Language_Recognition_ICCV_2023_paper.pdf)
* [Improving Continuous Sign Language Recognition with Cross-Lingual Signs](http://arxiv.org/abs/2308.10809v1)
* [C2ST: Cross-Modal Contextualized Sequence Transduction for Continuous Sign Language Recognition](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_C2ST_Cross-Modal_Contextualized_Sequence_Transduction_for_Continuous_Sign_Language_Recognition_ICCV_2023_paper.pdf)
* 手语翻译
* [Sign Language Translation with Iterative Prototype](http://arxiv.org/abs/2308.12191v1)
* [Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining](http://arxiv.org/abs/2307.14768v1)
:star:[code](https://github.com/zhoubenjia/GFSLT-VLP)

## 31.Human-Object Interaction(人物交互)
* [Full-Body Articulated Human-Object Interaction](http://arxiv.org/abs/2212.10621)
* [Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory](http://arxiv.org/abs/2309.03696)
* [Learning Human-Human Interactions in Images from Weak Textual Supervision](http://arxiv.org/abs/2304.14104)
* [Persistent-Transient Duality: A Multi-mechanism Approach for Modeling Human-Object Interaction](http://arxiv.org/abs/2307.12729v1)
* [Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection](http://arxiv.org/abs/2307.13529v1)
* [Agglomerative Transformer for Human-Object Interaction Detection](http://arxiv.org/abs/2308.08370v1)
* [InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion](http://arxiv.org/abs/2308.16905v1)
:star:[code](https://sirui-xu.github.io/InterDiff/)
* [Exploring Predicate Visual Context in Detecting of Human-Object Interactions](http://arxiv.org/abs/2308.06202)
* [Persistent-Transient Duality: A Multi-Mechanism Approach for Modeling Human-Object Interaction](http://arxiv.org/abs/2307.12729)
* [Narrator: Towards Natural Control of Human-Scene Interaction Generation via Relationship Reasoning](http://arxiv.org/abs/2303.09410)
* [Open Set Video HOI detection from Action-Centric Chain-of-Look Prompting](https://openaccess.thecvf.com/content/ICCV2023/papers/Xi_Open_Set_Video_HOI_detection_from_Action-Centric_Chain-of-Look_Prompting_ICCV_2023_paper.pdf)
* [Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models](https://openaccess.thecvf.com/content/ICCV2023/papers/Pi_Hierarchical_Generation_of_Human-Object_Interactions_with_Diffusion_Probabilistic_Models_ICCV_2023_paper.pdf)
* 手物交互
* [EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding](http://arxiv.org/abs/2309.02423v1)
:house:[project](https://mvig-rhos.com/ego_pca)
* [Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips](http://arxiv.org/abs/2309.05663v1)
:star:[code](https://judyye.github.io/diffhoi-www/)
* [AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose](http://arxiv.org/abs/2309.08942v1)
:star:[code](https://github.com/GentlesJan/AffordPose)
* [Novel-View Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views](http://arxiv.org/abs/2308.11198)

## 30.SLAM/Augmented Reality/Virtual Reality/Robotics(增强/虚拟现实/机器人)
* 虚拟人物生成
* [MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions](http://arxiv.org/abs/2307.10008v1)
* [GETAvatar: Generative Textured Meshes for Animatable Human Avatars](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_GETAvatar_Generative_Textured_Meshes_for_Animatable_Human_Avatars_ICCV_2023_paper.pdf)
* [NSF: Neural Surface Fields for Human Modeling from Monocular Depth](http://arxiv.org/abs/2308.14847v1)
:house:[project](https://yuxuan-xue.com/nsf)
* [AvatarCraft: Transforming Text into Neural Human Avatars with Parameterized Shape and Pose Control](http://arxiv.org/abs/2303.17606)
* [DINAR: Diffusion Inpainting of Neural Textures for One-Shot Human Avatars](http://arxiv.org/abs/2303.09375)
* 机器人
* [Leveraging SE(3) Equivariance for Learning 3D Geometric Shape Assembly](http://arxiv.org/abs/2309.06810v1)
:star:[code](https://github.com/crtie/Leveraging-SE-3-Equivariance-for-Learning-3D-Geometric-Shape-Assembly)
:star:[code](https://crtie.github.io/SE-3-part-assembly/)
* [PourIt!: Weakly-Supervised Liquid Perception from a Single Image for Visual Closed-Loop Robotic Pouring](http://arxiv.org/abs/2307.11299)
* AR/VR
* [HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations](http://arxiv.org/abs/2308.11261v1)
* SLAM
* [GO-SLAM: Global Optimization for Consistent 3D Instant Reconstruction](http://arxiv.org/abs/2309.02436v1)
:star:[code](https://youmi-zym.github.io/projects/GO-SLAM/)
:star:[code](https://github.com/youmi-zym/GO-SLAM)
* [Point-SLAM: Dense Neural Point Cloud-based SLAM](https://openaccess.thecvf.com/content/ICCV2023/papers/Sandstrom_Point-SLAM_Dense_Neural_Point_Cloud-based_SLAM_ICCV_2023_paper.pdf)
* [NeRF-LOAM: Neural Implicit Representation for Large-Scale Incremental LiDAR Odometry and Mapping](https://openaccess.thecvf.com/content/ICCV2023/papers/Deng_NeRF-LOAM_Neural_Implicit_Representation_for_Large-Scale_Incremental_LiDAR_Odometry_and_ICCV_2023_paper.pdf)
* [MV-Map: Offboard HD-Map Generation with Multi-view Consistency](https://openaccess.thecvf.com/content/ICCV2023/papers/Xie_MV-Map_Offboard_HD-Map_Generation_with_Multi-view_Consistency_ICCV_2023_paper.pdf)
* 虚拟试穿
* [Virtual Try-On with Pose-Garment Keypoints Guided Inpainting](https://openaccess.thecvf