https://github.com/52cv/cvpr-2024-papers

Last synced: 4 months ago
JSON representation
Host: GitHub
URL: https://github.com/52cv/cvpr-2024-papers
Owner: 52CV
Created: 2023-11-29T06:53:31.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-06-27T02:22:48.000Z (about 2 years ago)
Last Synced: 2025-11-17T06:06:42.506Z (8 months ago)
Size: 255 KB
Stars: 1,119
Watchers: 6
Forks: 64
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # CVPR-2024-Papers

![homepage_image](https://github.com/52CV/CVPR-2024-Papers/assets/62801906/41a45750-bca8-4cb8-89dc-a04b0bbe7b2c)

## 官网链接：https://cvpr.thecvf.com/

### 研讨会 :bell:：6 月 17-18 日


### 主会 :bell:：6 月 19-21 日

## 历年综述论文分类汇总戳这里↘️[CV-Surveys](https://github.com/52CV/CV-Surveys)施工中~~~~~~~~~~

## 2024 年论文分类汇总戳这里

↘️[WACV-2024-Papers](https://github.com/52CV/WACV-2024-Papers)

↘️[CVPR-2024-Papers](https://github.com/52CV/CVPR-2024-Papers)

↘️[ECCV-2024-Papers](https://github.com/52CV/ECCV-2024-Papers)

## 2023 年论文分类汇总戳这里

↘️[CVPR-2023-Papers](https://github.com/52CV/CVPR-2023-Papers)

↘️[WACV-2023-Papers](https://github.com/52CV/WACV-2023-Papers)

↘️[ICCV-2023-Papers](https://github.com/52CV/ICCV-2023-Papers)

## [2022 年论文分类汇总戳这里](#000)

## [2021 年论文分类汇总戳这里](#00)

## [2020 年论文分类汇总戳这里](#0)

## 💥💥💥收录论文已全部更新，并全部分类完成！！！

### 🏆Best Papers

* [Generative Image Dynamics](https://arxiv.org/abs/2309.07906)
:house:[project](https://generative-dynamics.github.io/)

* [Rich Human Feedback for Text-to-Image Generation](http://arxiv.org/abs/2312.10240)

### 🏅Best Paper Runners-Up

* [EventPS: Real-Time Photometric Stereo Using an Event Camera](https://openaccess.thecvf.com/content/CVPR2024/papers/Yu_EventPS_Real-Time_Photometric_Stereo_Using_an_Event_Camera_CVPR_2024_paper.pdf)

* [pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction](http://arxiv.org/abs/2312.12337)

### 🥇Best Student Papers

* [Mip-Splatting: Alias-free 3D Gaussian Splatting](https://arxiv.org/abs/2311.16493)
:star:[code](https://github.com/autonomousvision/mip-splatting)
:house:[project](https://niujinshuchong.github.io/mip-splatting/)

* [BioCLIP: A Vision Foundation Model for the Tree of Life](https://arxiv.org/abs/2311.18803)
:star:[code](https://github.com/Imageomics/bioclip)

### 🥈Best Student Paper Runner-Ups

* [SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency](https://openaccess.thecvf.com/content/CVPR2024/papers/Roetzer_SpiderMatch_3D_Shape_Matching_with_Global_Optimality_and_Geometric_Consistency_CVPR_2024_paper.pdf)

* [Image Processing GNN: Breaking Rigidity in Super-Resolution](https://openaccess.thecvf.com/content/CVPR2024/papers/Tian_Image_Processing_GNN_Breaking_Rigidity_in_Super-Resolution_CVPR_2024_paper.pdf)

* [Objects as Volumes: A Stochastic Geometry View of Opaque Solids](http://arxiv.org/abs/2312.15406)

* [Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods](https://arxiv.org/abs/2212.06872)

## 目录

|:cat:|:dog:|:tiger:|:wolf:|

|------|------|------|------|

|[1.其它(other)](#1)|[2.Image Segmentation(图像分割)](#2)|[3.Image Classification(图像分类)](#3)|[4.Image/Video Super-Resolution(图像超分辨率)](#4)|

|[5.Image/Video Compression(图像/视频压缩)](#5)|[6.Image/Video Captioning(图像/视频字幕)](#6)|[7.Image Progress(图像处理)](#7)|[8.Image Synthesis(图像生成)](#8)|

|[9.Face(人脸)](#9)|[10.Medical Image Progress(医学影响处理)](#10)|[11.3D](#11)|[12.Video](#12)|

|[13.HPE(人体姿态估计)](#13)|[14.HAR(人体动作识别检测)](#14)|[15.Object Detection(目标检测)](#15)|[16.Point Cloud(点云)](#16)|

|[17.Automated Driving(自动驾驶)](#17)|[18.SLAM/AR/VR/Robotics(增强/虚拟现实/机器人)(机器人)](#18)|[19.Object Pose Estimation(物体姿态估计)](#19)|[20.Optical Flow Estimation(光流估计)](#20)|

|[21.Few/Zero-Shot Learning/DG/A(小/零样本/域泛化/域适应)](#21)|[22.Deepfake Detection](#22)|[23.Sound(语音处理)](#23)|[24.ML(机器学习)](#24)|

|[25.Object Tracking(目标跟踪)](#25)|[26.Information Security(信息安全)](#25)|[27.Vision-Language(视觉语言)](#27)|[28.UAV/Remote Sensing/Satellite Image(无人机/遥感/卫星图像)](#28)|

|[29.MC/KD/Pruning(模型压缩/知识蒸馏/剪枝)](#29)|[30.Person Re-Id(人员重识别)](#30)|[31.Edge Detection(边缘检测)](#31)|[32.NLP(自然语言处理)](#32)|

|[33.NeRF](#33)|[34.Human–Computer Interaction(人机交互)](#34)|[35.Scene Understanding(场景理解)](#35)|[36.4D Reconstruction(4D 重建)](#36)|

|[37.OCR](#37)|[38.VQA(视觉问答)](#38)|[39.Motion Generation(动作生成)](#39)|[40.Scene Graph Generation(场景图生成)](#40)|

|[41.Graph Generative Network(GNN/GCN)](#41)|[42.Image Retrieval(图像检索)](#42)|[43.Image Matching(图像匹配)](#43)|[44.Image Fusion(图像融合)](#44)|

|[45.NAS(神经架构搜索)](#45)|[46.Industrial Anomaly Detection(工业缺陷检测)](#46)|[47.Dense Predictions(密集预测)](#47)|[48.Semi/self-supervised learning(半/自监督)](#48)|

|[49.Dataset(数据集)](#49)|[50.OOD Detection](#50)|[51.Style Transfer(风格迁移)](#51)|[52.Biomedical](#52)|

|[53.Light-Field(光场)](#53)|[54.ViT](#54)|[55.REC(指代表达理解)](#55)|[56.Visual emotion recognition(视觉情绪识别)](#56)|

|[57.Visual Relationship Detection(视觉关系检测)](#57)|[58.Fisheye Images(鱼眼图像)](#58)|[59.Clustering(聚类)](#59)|[60.Sketch(草图)](#60)|

|[61.Gaze](#61)|[62.全家桶](#62)|



## 62.全家桶

* [UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition](https://arxiv.org/abs/2311.15599)
:star:[code](https://github.com/AILab-CVC/UniRepLKNet)用于音频、视频、点云、时间序列和图像识别的通用感知大内核卷积网络

* [GPT4Point: A Unified Framework for Point-Language Understanding and Generation](https://arxiv.org/abs/2312.02980)点语言理解和生成的统一框架

* [AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond](https://arxiv.org/abs/2311.16468)用于运动理解、规划、生成等的一体化框架



## 61.Gaze

* [Sharingan: A Transformer Architecture for Multi-Person Gaze Following](https://arxiv.org/abs/2310.00816)目光跟随

* [From Feature to Gaze: A Generalizable Replacement of Linear Layer for Gaze Estimation](https://openaccess.thecvf.com/content/CVPR2024/papers/Bao_From_Feature_to_Gaze_A_Generalizable_Replacement_of_Linear_Layer_CVPR_2024_paper.pdf)



## 60.Sketch(草图)

* [What Sketch Explainability Really Means for Downstream Tasks](http://arxiv.org/abs/2403.09480v1)

* [SketchINR: A First Look into Sketches as Implicit Neural Representations](https://arxiv.org/abs/2403.09344)

* [Open Vocabulary Semantic Scene Sketch Understanding](https://arxiv.org/abs/2312.12463)草图理解

* [CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention](https://arxiv.org/abs/2402.17678)



## 59.Clustering(聚类)

* [MoDE: CLIP Data Experts via Clustering](http://arxiv.org/abs/2404.16030)聚类

* [Fine-Grained Bipartite Concept Factorization for Clustering](https://openaccess.thecvf.com/content/CVPR2024/papers/Peng_Fine-Grained_Bipartite_Concept_Factorization_for_Clustering_CVPR_2024_paper.pdf)

* 多视图聚类

  * [Investigating and Mitigating the Side Effects of Noisy Views for Self-Supervised Clustering Algorithms in Practical Multi-View Scenarios](https://arxiv.org/abs/2303.17245)

  * [Learn from View Correlation: An Anchor Enhancement Strategy for Multi-view Clustering](https://openaccess.thecvf.com/content/CVPR2024/papers/Liu_Learn_from_View_Correlation_An_Anchor_Enhancement_Strategy_for_Multi-view_CVPR_2024_paper.pdf)

  * [Differentiable Information Bottleneck for Deterministic Multi-view Clustering](https://arxiv.org/abs/2403.15681)



## 58.Fisheye Images(鱼眼图像)

* [Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under Manhattan World Assumption](https://openaccess.thecvf.com/content/CVPR2024/papers/Wakai_Deep_Single_Image_Camera_Calibration_by_Heatmap_Regression_to_Recover_CVPR_2024_paper.pdf)鱼眼图像



## 57.Visual Relationship Detection(视觉关系检测)

* [Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection](http://arxiv.org/abs/2403.17709v1)
:star:[code](https://github.com/mlvlab/SpeaQ)



## 56.Visual emotion recognition(视觉情绪识别)

* [EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning](https://export.arxiv.org/abs/2404.16670)
:star:[code](https://github.com/aimmemotion/EmoVIT)视觉情感理解

* 多模态意图识别

  * [Contextual Augmented Global Contrast for Multimodal Intent Recognition](https://openaccess.thecvf.com/content/CVPR2024/papers/Sun_Contextual_Augmented_Global_Contrast_for_Multimodal_Intent_Recognition_CVPR_2024_paper.pdf)



## 55.Referring Expression Comprehension(指代表达理解)

* [ScanFormer: Referring Expression Comprehension by Iteratively Scanning](https://openaccess.thecvf.com/content/CVPR2024/papers/Su_ScanFormer_Referring_Expression_Comprehension_by_Iteratively_Scanning_CVPR_2024_paper.pdf)

* [Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions](https://arxiv.org/abs/2311.17048)
:star:[code](https://github.com/Show-han/Zeroshot_REC)零样本指代表达理解

* [Revisiting Counterfactual Problems in Referring Expression Comprehension](https://openaccess.thecvf.com/content/CVPR2024/papers/Yu_Revisiting_Counterfactual_Problems_in_Referring_Expression_Comprehension_CVPR_2024_paper.pdf)



## 54.Vision Transformers

* [Dexterous Grasp Transformer](http://arxiv.org/abs/2404.18135)

* [Mean-Shift Feature Transformer](https://openaccess.thecvf.com/content/CVPR2024/papers/Kobayashi_Mean-Shift_Feature_Transformer_CVPR_2024_paper.pdf)

* [MLP Can Be A Good Transformer Learner](https://arxiv.org/abs/2404.05657)
:star:[code](https://github.com/sihaoevery/lambda_vit)

* [Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers](http://arxiv.org/abs/2303.09383)

* [Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers](http://arxiv.org/abs/2404.07292)

* [Dual-scale Transformer for Large-scale Single-Pixel Imaging](http://arxiv.org/abs/2404.05001)

* [DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets](http://arxiv.org/abs/2404.02900)

* [Solving Masked Jigsaw Puzzles with Diffusion Transformers](http://arxiv.org/abs/2404.07292)

* [Towards Understanding and Improving Adversarial Robustness of Vision Transformers](https://openaccess.thecvf.com/content/CVPR2024/papers/Jain_Towards_Understanding_and_Improving_Adversarial_Robustness_of_Vision_Transformers_CVPR_2024_paper.pdf)

* [RMT: Retentive Networks Meet Vision Transformers](https://arxiv.org/abs/2309.11523)
:star:[code](https://github.com/qhfan/RMT)

* [You Only Need Less Attention at Each Stage in Vision Transformers](https://arxiv.org/abs/2406.00427)

* [MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers](https://arxiv.org/abs/2311.15475)
:house:[project](https://nihalsid.github.io/mesh-gpt/)

* [Instance-Aware Group Quantization for Vision Transformers](https://arxiv.org/abs/2404.00928)
:house:[project](https://cvlab.yonsei.ac.kr/projects/IGQ-ViT/)

* [Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers](http://arxiv.org/abs/2403.10030v1)
:star:[code](https://github.com/mlvlab/MCTF)

* [RepViT: Revisiting Mobile CNN From ViT Perspective](https://arxiv.org/abs/2307.09283)
:star:[code](https://github.com/THU-MIG/RepViT)

* [Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer](http://arxiv.org/abs/2403.14552v1)

* [Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers](https://arxiv.org/abs/2403.10574)
:thumbsup:[摘要](https://informatics.xmu.edu.cn/info/1053/36349.htm)

* [Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods](https://arxiv.org/abs/2212.06872)

* [On the Faithfulness of Vision Transformer Explanations](http://arxiv.org/abs/2404.01415v1)

* [Learning Correlation Structures for Vision Transformers](http://arxiv.org/abs/2404.03924v1)

* [Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach](https://arxiv.org/abs/2403.19067)
:star:[code](https://github.com/zstarN70/RLRR.git)

* [Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression](http://arxiv.org/abs/2403.15835v1)

* [Point Transformer V3: Simpler Faster Stronger](https://arxiv.org/abs/2312.10035)
:star:[code](https://github.com/Pointcept/PointTransformerV3)

* [A General and Efficient Training for Transformer via Token Expansion](http://arxiv.org/abs/2404.00672v1)
:star:[code](https://github.com/Osilly/TokenExpansion)

* [HEAL-SWIN: A Vision Transformer On The Sphere](https://arxiv.org/abs/2307.07313)
:star:[code](https://github.com/JanEGerken/HEAL-SWIN)

* [SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design](https://arxiv.org/abs/2401.16456)Vision

* [TransNeXt: Robust Foveal Visual Perception for Vision Transformers](https://arxiv.org/abs/2311.17132)
:star:[code](https://github.com/DaiShiResearch/TransNeXt)

* [Making Vision Transformers Truly Shift-Equivariant](https://arxiv.org/abs/2305.16316)

* [Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities](https://arxiv.org/abs/2401.14405)
:star:[code](https://github.com/AILab-CVC/M2PT)

* [Random Entangled Tokens for Adversarially Robust Vision Transformer](https://openaccess.thecvf.com/content/CVPR2024/papers/Gong_Random_Entangled_Tokens_for_Adversarially_Robust_Vision_Transformer_CVPR_2024_paper.pdf)



## 53.Light-Field(光场)

* [Time-Efficient Light-Field Acquisition Using Coded Aperture and Events](https://arxiv.org/abs/2403.07244)
:house:[project](https://www.fujii.nuee.nagoya-u.ac.jp/Research/EventLF/)

* [Continuous Pose for Monocular Cameras in Neural Implicit Representation](https://arxiv.org/abs/2311.17119)
:star:[code](https://github.com/qimaqi/Continuous-Pose-in-NeRF)

* [PanoPose: Self-supervised Relative Pose Estimation for Panoramic Images](https://openaccess.thecvf.com/content/CVPR2024/papers/Tu_PanoPose_Self-supervised_Relative_Pose_Estimation_for_Panoramic_Images_CVPR_2024_paper.pdf)
:house:[project](http://www.3dv.ac.cn/en/publication/cvpr-b/)

* [Unbiased Estimator for Distorted Conics in Camera Calibration](http://arxiv.org/abs/2403.04583)

* 相机姿态

  * [Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences](https://arxiv.org/abs/2404.06337)

  * [Map-Relative Pose Regression for Visual Re-Localization](http://arxiv.org/abs/2404.09884v1)
:star:[code](https://nianticlabs.github.io/marepo)

  * [The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement](http://arxiv.org/abs/2404.10438v1)
:star:[code](https://github.com/ga1i13o/mcloc_poseref)

* 快照压缩成像

  * [DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model](https://arxiv.org/abs/2311.11417)



## 52.Biomedical

* [ManiFPT: Defining and Analyzing Fingerprints of Generative Models](https://arxiv.org/abs/2402.10401)

* [Flexible Biometrics Recognition: Bridging the Multimodality Gap through Attention Alignment and Prompt Tuning](https://openaccess.thecvf.com/content/CVPR2024/papers/Tiong_Flexible_Biometrics_Recognition_Bridging_the_Multimodality_Gap_through_Attention_Alignment_CVPR_2024_paper.pdf)生物识别

* 人员识别

  * [Activity-Biometrics: Person Identification from Daily Activities](http://arxiv.org/abs/2403.17360v1)
:star:[code](https://github.com/sacrcv/Activity-Biometrics/)



## 51.Style Transfer(风格迁移)

* [Z*: Zero-shot Style Transfer via Attention Reweighting](https://openaccess.thecvf.com/content/CVPR2024/papers/Deng_Z_Zero-shot_Style_Transfer_via_Attention_Reweighting_CVPR_2024_paper.pdf)

* [MoST: Motion Style Transformer Between Diverse Action Contents](http://arxiv.org/abs/2403.06225)
:star:[code](https://github.com/Boeun-Kim/MoST)

* [ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation](https://arxiv.org/abs/2312.02109)
:star:[code](https://github.com/cardinalblue/ArtAdapter)
:house:[project](https://cardinalblue.github.io/artadapter.github.io/)

* [Arbitrary Motion Style Transfer with Multi-condition Motion Latent Diffusion Model](https://openaccess.thecvf.com/content/CVPR2024/papers/Song_Arbitrary_Motion_Style_Transfer_with_Multi-condition_Motion_Latent_Diffusion_Model_CVPR_2024_paper.pdf)

* [Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer](https://arxiv.org/abs/2312.09008v2)
:house:[project](https://jiwoogit.github.io/StyleID_site)

* [Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network](https://arxiv.org/abs/2405.19775)
:thumbsup:[平衡效率与质量，南航提出新风格迁移算法Puff-Net](https://mp.weixin.qq.com/s/B-RkdeQNvIXmAYJMUkHkYQ)

* 零样本文本驱动运动迁移

  * [Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer](http://arxiv.org/abs/2311.17009)
:house:[project](https://diffusion-motion-transfer.github.io/)



## 50.OOD Detection

* [Test-Time Linear Out-of-Distribution Detection](https://openaccess.thecvf.com/content/CVPR2024/papers/Fan_Test-Time_Linear_Out-of-Distribution_Detection_CVPR_2024_paper.pdf)

* [Segment Every Out-of-Distribution Object](https://arxiv.org/abs/2311.16516)

* [Label-Efficient Group Robustness via Out-of-Distribution Concept Curation](https://openaccess.thecvf.com/content/CVPR2024/papers/Yang_Label-Efficient_Group_Robustness_via_Out-of-Distribution_Concept_Curation_CVPR_2024_paper.pdf)

* [Enhancing the Power of OOD Detection via Sample-Aware Model Selection](https://openaccess.thecvf.com/content/CVPR2024/papers/Xue_Enhancing_the_Power_of_OOD_Detection_via_Sample-Aware_Model_Selection_CVPR_2024_paper.pdf)OOD

* [Discriminability-Driven Channel Selection for Out-of-Distribution Detection](https://openaccess.thecvf.com/content/CVPR2024/papers/Yuan_Discriminability-Driven_Channel_Selection_for_Out-of-Distribution_Detection_CVPR_2024_paper.pdf)

* [CORES: Convolutional Response-based Score for Out-of-distribution Detection](https://openaccess.thecvf.com/content/CVPR2024/papers/Tang_CORES_Convolutional_Response-based_Score_for_Out-of-distribution_Detection_CVPR_2024_paper.pdf)

* [Learning Transferable Negative Prompts for Out-of-Distribution Detection](https://arxiv.org/abs/2404.03248)
:star:[code](https://github.com/mala-lab/negprompt)

* [A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?](http://arxiv.org/abs/2404.01775v1)
:star:[code](https://github.com/glhr/ood-labelnoise)

* [Improving Out-of-Distribution Generalization in Graphs via Hierarchical Semantic Environments](https://arxiv.org/abs/2403.01773)

* [A Noisy Elephant in the Room: Is Your Out-of-Distribution Detector Robust to Label Noise?](http://arxiv.org/abs/2404.01775)

* 异常检测

  * [Hyperbolic Anomaly Detection](https://openaccess.thecvf.com/content/CVPR2024/papers/Li_Hyperbolic_Anomaly_Detection_CVPR_2024_paper.pdf)

  * [Universal Novelty Detection through Adaptive Contrastive Learning](https://openaccess.thecvf.com/content/CVPR2024/papers/Mirzaei_Universal_Novelty_Detection_Through_Adaptive_Contrastive_Learning_CVPR_2024_paper.pdf)

  * [Looking 3D: Anomaly Detection with 2D-3D Alignment](https://openaccess.thecvf.com/content/CVPR2024/papers/Bhunia_Looking_3D_Anomaly_Detection_with_2D-3D_Alignment_CVPR_2024_paper.pdf)



## 49.Dataset(数据集)

* 数据集

  * [Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset](https://openaccess.thecvf.com/content/CVPR2024/papers/Li_Multiagent_Multitraversal_Multimodal_Self-Driving_Open_MARS_Dataset_CVPR_2024_paper.pdf)

  * [4D-DRESS: A 4D Dataset of Real-World Human Clothing With Semantic Annotations](https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_4D-DRESS_A_4D_Dataset_of_Real-World_Human_Clothing_With_Semantic_CVPR_2024_paper.pdf)

  * [DiLiGenRT: A Photometric Stereo Dataset with Quantified Roughness and Translucency](https://openaccess.thecvf.com/content/CVPR2024/papers/Guo_DiLiGenRT_A_Photometric_Stereo_Dataset_with_Quantified_Roughness_and_Translucency_CVPR_2024_paper.pdf)

  * [MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation](http://arxiv.org/abs/2404.02790)

  * [LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs](http://arxiv.org/abs/2312.04372)

  * [360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries](http://arxiv.org/abs/2311.17389)

  * [Towards Automatic Power Battery Detection: New Challenge Benchmark Dataset and Baseline](http://arxiv.org/abs/2312.02528)

  * [MSU-4S - The Michigan State University Four Seasons Dataset](https://openaccess.thecvf.com/content/CVPR2024/papers/Kent_MSU-4S_-_The_Michigan_State_University_Four_Seasons_Dataset_CVPR_2024_paper.pdf)

  * [DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields](https://openaccess.thecvf.com/content/CVPR2024/papers/Lu_DiVa-360_The_Dynamic_Visual_Dataset_for_Immersive_Neural_Fields_CVPR_2024_paper.pdf)

  * [Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline](http://arxiv.org/abs/2309.14611)

  * [LiDAR-Net: A Real-scanned 3D Point Cloud Dataset for Indoor Scenes](https://openaccess.thecvf.com/content/CVPR2024/papers/Guo_LiDAR-Net_A_Real-scanned_3D_Point_Cloud_Dataset_for_Indoor_Scenes_CVPR_2024_paper.pdf)

  * [Advancing Saliency Ranking with Human Fixations: Dataset, Models and Benchmarks](https://openaccess.thecvf.com/content/CVPR2024/papers/Deng_Advancing_Saliency_Ranking_with_Human_Fixations_Dataset_Models_and_Benchmarks_CVPR_2024_paper.pdf)

  * [MAGICK: A Large-scale Captioned Dataset from Matting Generated Images using Chroma Keying](https://openaccess.thecvf.com/content/CVPR2024/papers/Burgert_MAGICK_A_Large-scale_Captioned_Dataset_from_Matting_Generated_Images_using_CVPR_2024_paper.pdf)

  * [HardMo: A Large-Scale Hardcase Dataset for Motion Capture](https://openaccess.thecvf.com/content/CVPR2024/papers/Liao_HardMo_A_Large-Scale_Hardcase_Dataset_for_Motion_Capture_CVPR_2024_paper.pdf)

  * [The STVchrono Dataset: Towards Continuous Change Recognition in Time](https://openaccess.thecvf.com/content/CVPR2024/papers/Sun_The_STVchrono_Dataset_Towards_Continuous_Change_Recognition_in_Time_CVPR_2024_paper.pdf)

  * [Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding](https://openaccess.thecvf.com/content/CVPR2024/papers/Nguyen_Insect-Foundation_A_Foundation_Model_and_Large-scale_1M_Dataset_for_Visual_CVPR_2024_paper.pdf)

  * [LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising](https://arxiv.org/abs/2405.19718)

  * [On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm](https://arxiv.org/abs/2312.03526)

  * [Towards Modern Image Manipulation Localization: A Large-Scale Dataset and Novel Methods](https://openaccess.thecvf.com/content/CVPR2024/papers/Qu_Towards_Modern_Image_Manipulation_Localization_A_Large-Scale_Dataset_and_Novel_CVPR_2024_paper.pdf)

  * [Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation](https://arxiv.org/abs/2306.11290)

  * [FineSports: A Multi-person Hierarchical Sports Video Dataset for Fine-grained Action Understanding](https://openaccess.thecvf.com/content/CVPR2024/papers/Xu_FineSports_A_Multi-person_Hierarchical_Sports_Video_Dataset_for_Fine-grained_Action_CVPR_2024_paper.pdf)细粒度动作理解

  * [MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos](https://arxiv.org/abs/2306.04216)
:house:[project](https://mmsum-dataset.github.io/)

  * [Traffic Scene Parsing through the TSP6K Dataset](http://arxiv.org/abs/2303.02835)

  * [Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset](https://arxiv.org/abs/2311.17396)

  * [RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos](https://arxiv.org/abs/2401.12592)
:house:[project](https://wildrgbd.github.io/)
:sunflower:[dataset](https://github.com/wildrgbd/wildrgbd)RGB-D object数据集

  * [eTraM: Event-based Traffic Monitoring Dataset](https://arxiv.org/abs/2403.19976)
:star:[code](https://github.com/eventbasedvision/eTraM)
:house:[project](https://eventbasedvision.github.io/eTraM/)流量监控数据集

  * [Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network](http://arxiv.org/abs/2405.00244)
:sunflower:[dataset](https://github.com/yungsyu99/Real-HDRV)

  * [JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups](http://arxiv.org/abs/2404.04458v1)
:house:[project](https://jrdb.erc.monash.edu/dataset/social)

  * [TULIP: Multi-camera 3D Precision Assessment of Parkinson's Disease](https://openaccess.thecvf.com/content/CVPR2024/papers/Kim_TULIP_Multi-camera_3D_Precision_Assessment_of_Parkinsons_Disease_CVPR_2024_paper.pdf)

  * [JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments](http://arxiv.org/abs/2404.01686v1)

  * [OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion](http://arxiv.org/abs/2403.19417v1)
:house:[project](https://oakink.net/v2)

  * [SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos](http://arxiv.org/abs/2404.04565v1)

  * [RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method](http://arxiv.org/abs/2403.19501v1)
:house:[project](http://www.lidarhumanmotion.net/reli11d/)
:thumbsup:[摘要](https://informatics.xmu.edu.cn/info/1053/36349.htm)

  * [MatSynth: A Modern PBR Materials Dataset](https://arxiv.org/abs/2401.06056)
:house:[project](https://gvecchio.com/matsynth/)

  * [RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception](http://arxiv.org/abs/2403.10145v1)
:star:[code](https://github.com/AIR-THU/DAIR-RCooper)

  * [Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection](http://arxiv.org/abs/2403.12580v1)

  * [EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World](http://arxiv.org/abs/2403.16182v1)
:star:[code](https://github.com/OpenGVLab/EgoExoLearn)

  * [MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception](https://arxiv.org/abs/2403.11496)
:sunflower:[dataset](https://mcdviral.github.io/)

  * [HouseCat6D -- A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios](https://arxiv.org/abs/2212.10428)

  * [HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative](https://arxiv.org/abs/2403.02640)
:sunflower:[dataset](https://holovic.net/)

  * [DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision](https://arxiv.org/abs/2312.16256)
:sunflower:[dataset](https://github.com/DL3DV-10K/Dataset)

  * [EFHQ: Multi-purpose ExtremePose-Face-HQ dataset](https://arxiv.org/abs/2312.17205)
:star:[code](https://www.vinai.io/)
:house:[project](https://bomcon123456.github.io/efhq/)数据集

  * [LUWA Dataset: Learning Lithic Use-Wear Analysis on Microscopic Images](https://arxiv.org/abs/2403.13171)

  * [MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors](http://arxiv.org/abs/2403.17610v1)
:star:[code](https://haolyuan.github.io/MMVP-Dataset/)

  * [FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions](https://arxiv.org/abs/2309.05073)
:house:[project](https://wangjiongw.github.io/freeman/)

  * [TUMTraf V2X Cooperative Perception Dataset](https://arxiv.org/pdf/2403.01316.pdf)
:house:[project](https://tum-traffic-dataset.github.io/tumtraf-v2x/)

  * [MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures](https://arxiv.org/abs/2312.02963)
:sunflower:[dataset](https://x-zhangyang.github.io/MVHumanNet/)

* 基准

  * [When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach](https://openaccess.thecvf.com/content/CVPR2024/papers/Ma_When_Visual_Grounding_Meets_Gigapixel-level_Large-scale_Scenes_Benchmark_and_Approach_CVPR_2024_paper.pdf)

  * [THRONE: A Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models](http://arxiv.org/abs/2405.05256)

  * [M3-UDA: A New Benchmark for Unsupervised Domain Adaptive Fetal Cardiac Structure Detection](https://openaccess.thecvf.com/content/CVPR2024/papers/Pu_M3-UDA_A_New_Benchmark_for_Unsupervised_Domain_Adaptive_Fetal_Cardiac_CVPR_2024_paper.pdf)
:star:[code](https://github.com/LiwenWang919/M3-UDA)

  * [DriveTrack: A Benchmark for Long-Range Point Tracking in Real-World Videos](https://arxiv.org/abs/2312.09523)现实视频中远程点跟踪的基准

  * [SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge](https://arxiv.org/abs/2405.09713)

  * [MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding](https://openaccess.thecvf.com/content/CVPR2024/papers/Cao_MAPLM_A_Real-World_Large-Scale_Vision-Language_Benchmark_for_Map_and_Traffic_CVPR_2024_paper.pdf)

  * [RoDLA: Benchmarking the Robustness of Document Layout Analysis Models](http://arxiv.org/abs/2403.14442v1)
:star:[code](https://yufanchen96.github.io/projects/RoDLA)

  * [GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation](https://openaccess.thecvf.com/content/CVPR2024/papers/Khanna_GOAT-Bench_A_Benchmark_for_Multi-Modal_Lifelong_Navigation_CVPR_2024_paper.pdf)

  * [MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI](http://arxiv.org/abs/2311.16502)

  * [Advancing Saliency Ranking with Human Fixations: Dataset Models and Benchmarks](https://openaccess.thecvf.com/content/CVPR2024/papers/Deng_Advancing_Saliency_Ranking_with_Human_Fixations_Dataset_Models_and_Benchmarks_CVPR_2024_paper.pdf)

  * [ConCon-Chi: Concept-Context Chimera Benchmark for Personalized Vision-Language Tasks](https://openaccess.thecvf.com/content/CVPR2024/papers/Rosasco_ConCon-Chi_Concept-Context_Chimera_Benchmark_for_Personalized_Vision-Language_Tasks_CVPR_2024_paper.pdf)

  * [Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark](http://arxiv.org/abs/2403.18821v1)
:star:[code](https://facebookresearch.github.io/real-acoustic-fields/)

  * [UVEB: A Large-scale Benchmark and Baseline Towards Real-World Underwater Video Enhancement](http://arxiv.org/abs/2404.14542)

  * [PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling](https://openaccess.thecvf.com/content/CVPR2024/papers/Zheng_PKU-DyMVHumans_A_Multi-View_Video_Benchmark_for_High-Fidelity_Dynamic_Human_Modeling_CVPR_2024_paper.pdf)
:house:[project](https://pku-dymvhumans.github.io/)

  * [MVBench: A Comprehensive Multi-modal Video Understanding Benchmark](https://arxiv.org/abs/2311.17005)
:star:[code](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2)

  * [Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly](http://arxiv.org/abs/2405.00181)

  * [VBench : Comprehensive Benchmark Suite for Video Generative Models](https://arxiv.org/abs/2311.17982)
:star:[code](https://arxiv.org/abs/2311.17982)
:house:[project](https://vchitect.github.io/VBench-project/)

  * [MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark](http://arxiv.org/abs/2403.20225v1)

  * [CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs](https://arxiv.org/abs/2311.16703)
:house:[project](https://enigma-li.github.io/CADTalk/)

  * [How to Train Neural Field Representations: A Comprehensive Study and Benchmark](https://arxiv.org/abs/2312.10531)

  * [OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM](https://arxiv.org/abs/2402.09181)



## 48.Semi/self-supervised learning(半/自监督)

* 弱监督学习

  * 部分标签学习

    * [CroSel: Cross Selection of Confident Pseudo Labels for Partial-Label Learning](https://arxiv.org/abs/2303.10365)部分标签学习-弱监督学习问题

* 半监督

  * [Targeted Representation Alignment for Open-World Semi-Supervised Learning](https://openaccess.thecvf.com/content/CVPR2024/papers/Xiao_Targeted_Representation_Alignment_for_Open-World_Semi-Supervised_Learning_CVPR_2024_paper.pdf)

  * [SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder](https://openaccess.thecvf.com/content/CVPR2024/papers/Zheng_SeNM-VAE_Semi-Supervised_Noise_Modeling_with_Hierarchical_Variational_Autoencoder_CVPR_2024_paper.pdf)

  * [CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning](http://arxiv.org/abs/2403.10391v1)

  * [BEM: Balanced and Entropy-based Mix for Long-Tailed Semi-Supervised Learning](http://arxiv.org/abs/2404.01179v1)

  * 正样本标签学习

    * [Positive-Unlabeled Learning by Latent Group-Aware Meta Disambiguation](https://openaccess.thecvf.com/content/CVPR2024/papers/Long_Positive-Unlabeled_Learning_by_Latent_Group-Aware_Meta_Disambiguation_CVPR_2024_paper.pdf)Positive-Unlabeled Learning(正样本标签学习)半监督学习的一个重要分支

* 自监督学习

  * [Self-supervised Representation Learning from Arbitrary Scenarios](https://openaccess.thecvf.com/content/CVPR2024/papers/Li_Self-Supervised_Representation_Learning_from_Arbitrary_Scenarios_CVPR_2024_paper.pdf)

  * [Self-supervised Debiasing Using Low Rank Regularization](http://arxiv.org/abs/2210.05248)

  * [Self-Supervised Dual Contouring](http://arxiv.org/abs/2405.18131)

  * [Neural Modes: Self-supervised Learning of Nonlinear Modal Subspaces](https://arxiv.org/abs/2404.17620)

  * [Self-Supervised Representation Learning from Arbitrary Scenarios](https://openaccess.thecvf.com/content/CVPR2024/papers/Li_Self-Supervised_Representation_Learning_from_Arbitrary_Scenarios_CVPR_2024_paper.pdf)

  * [SD2Event: Self-supervised Learning of Dynamic Detectors and Contextual Descriptors for Event Cameras](https://openaccess.thecvf.com/content/CVPR2024/papers/Gao_SD2EventSelf-supervised_Learning_of_Dynamic_Detectors_and_Contextual_Descriptors_for_Event_CVPR_2024_paper.pdf)

  * [An Asymmetric Augmented Self-Supervised Learning Method for Unsupervised Fine-Grained Image Hashing](https://openaccess.thecvf.com/content/CVPR2024/papers/Hu_An_Asymmetric_Augmented_Self-Supervised_Learning_Method_for_Unsupervised_Fine-Grained_Image_CVPR_2024_paper.pdf)

  * [Self-supervised debiasing using low rank regularization](https://arxiv.org/abs/2210.05248)

  * [CNC-Net: Self-Supervised Learning for CNC Machining Operations](https://arxiv.org/abs/2312.09925)

* 无监督学习

  * [Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos](https://openaccess.thecvf.com/content/CVPR2024/papers/Sommer_Unsupervised_Learning_of_Category-Level_3D_Pose_from_Object-Centric_Videos_CVPR_2024_paper.pdf)



## 47.Dense Predictions(密集预测)

* [Efficient Multitask Dense Predictor via Binarization](https://arxiv.org/abs/2405.14136)密集预测

* [Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models](https://openaccess.thecvf.com/content/CVPR2024/papers/Huang_Going_Beyond_Multi-Task_Dense_Prediction_with_Synergy_Embedding_Models_CVPR_2024_paper.pdf)

* [Exploiting Diffusion Prior for Generalizable Dense Prediction](https://arxiv.org/abs/2311.18832)
:house:[project](https://shinying.github.io/dmp)

* [ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions](http://arxiv.org/abs/2403.07392v1)
:star:[code](https://github.com/Traffic-X/ViT-CoMer)
:thumbsup:[百度提出视觉新骨干ViT-CoMer，刷新密集预测任务SOTA](https://mp.weixin.qq.com/s/Q2xI_rU5_7Mv6jiYeu6NkA)

* [Multi-Task Dense Prediction via Mixture of Low-Rank Experts](http://arxiv.org/abs/2403.17749v1)
:star:[code](https://github.com/YuqiYang213/MLoRE)



## 46.Industrial Anomaly Detection(工业缺陷检测)

* [Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection](https://arxiv.org/abs/2310.12790)
:star:[code](https://github.com/mala-lab/AHL)

* 异常检测

  * [Supervised Anomaly Detection for Complex Industrial Images](http://arxiv.org/abs/2405.04953)

  * [Prompt-enhanced Multiple Instance Learning for Weakly Supervised Anomaly Detection](https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_Prompt-Enhanced_Multiple_Instance_Learning_for_Weakly_Supervised_Video_Anomaly_Detection_CVPR_2024_paper.pdf)弱监督异常检测

  * [Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping](https://arxiv.org/abs/2312.04521)

  * [Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation](https://arxiv.org/abs/2403.06247)

  * [Long-Tailed Anomaly Detection with Learnable Class Names](http://arxiv.org/abs/2403.20236v1)
:house:[project](https://zenodo.org/records/10854201)

  * [RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection](http://arxiv.org/abs/2403.05897v1)
:star:[code](https://github.com/cnulab/RealNet)

  * [Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts](http://arxiv.org/abs/2403.06495v1)
:star:[code](https://github.com/mala-lab/InCTRL)

  * [PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection](http://arxiv.org/abs/2404.05231v1)
:star:[code](https://github.com/FuNz-0/PromptAD)

* 薄膜去除

  * [Learning to Remove Wrinkled Transparent Film with Polarized Prior](http://arxiv.org/abs/2403.04368v1)
:star:[code](https://github.com/jqtangust/FilmRemoval)

* 基准/数据集

  * [Real-IAD: A Real-World Multi-view Dataset for Benchmarking Versatile Industrial Anomaly Detection](https://arxiv.org/abs/2403.12580)
:star:[code](https://github.com/TencentYoutuResearch/AnomalyDetection_Real-IAD)

  * [Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network](https://arxiv.org/abs/2311.14897)
:star:[code](https://github.com/Chopper-233/Anomaly-ShapeNet)



## 45.Neural Architecture Search(神经架构搜索)

* [Towards Accurate and Robust Architectures via Neural Architecture Search](https://arxiv.org/abs/2405.05502)

* [Boosting Order-Preserving and Transferability for Neural Architecture Search: a Joint Architecture Refined Search and Fine-tuning Approach](http://arxiv.org/abs/2403.11380v1)

* [Building Optimal Neural Architectures using Interpretable Knowledge](http://arxiv.org/abs/2403.13293v1)
:star:[code](https://github.com/Ascend-Research/AutoBuild)

* [AZ-NAS: Assembling Zero-Cost Proxies for Network Architecture Search](http://arxiv.org/abs/2403.19232v1)

* [SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model](https://arxiv.org/abs/2406.00195)

* [Insights from the Use of Previously Unseen Neural Architecture Search Datasets](https://arxiv.org/abs/2404.02189)

* [FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer](https://arxiv.org/abs/2403.12821)
:star:[code](http://github.com/y0ngjaenius/CVPR2024_FLOWERFormer)



## 44.Image Fusion(图像融合)

* [Equivariant Multi-Modality Image Fusion](https://arxiv.org/abs/2305.11443)图像融合

* [Task-Customized Mixture of Adapters for General Image Fusion](http://arxiv.org/abs/2403.12494v1)
:star:[code](https://github.com/YangSun22/TC-MoA)

* [Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion](http://arxiv.org/abs/2403.16387v1)
:star:[code](https://github.com/XunpengYi/Text-IF)

* [Revisiting Spatial-Frequency Information Integration from a Hierarchical Perspective for Panchromatic and Multi-Spectral Image Fusion](https://openaccess.thecvf.com/content/CVPR2024/papers/Tan_Revisiting_Spatial-Frequency_Information_Integration_from_a_Hierarchical_Perspective_for_Panchromatic_CVPR_2024_paper.pdf)

* [Neural Spline Fields for Burst Image Fusion and Layer Separation](https://arxiv.org/abs/2312.14235)
:house:[project](https://light.princeton.edu/publication/nsf)

* 红外和可见光图像融合

  * [Probing Synergistic High-Order Interaction in Infrared and Visible Image Fusion](https://openaccess.thecvf.com/content/CVPR2024/papers/Zheng_Probing_Synergistic_High-Order_Interaction_in_Infrared_and_Visible_Image_Fusion_CVPR_2024_paper.pdf)



## 43.Image Matching(图像匹配)

* [XFeat: Accelerated Features for Lightweight Image Matching](https://arxiv.org/abs/2404.19174)
:house:[project](http://www.verlab.dcc.ufmg.br/descriptors/xfeat_cvpr24)图像匹配

* 图像-文本

  * [Composing Object Relations and Attributes for Image-Text Matching](https://openaccess.thecvf.com/content/CVPR2024/papers/Pham_Composing_Object_Relations_and_Attributes_for_Image-Text_Matching_CVPR_2024_paper.pdf)



## 42.Image Retrieval(图像检索)

* [Language-only Training of Zero-shot Composed Image Retrieval](https://openaccess.thecvf.com/content/CVPR2024/papers/Gu_Language-only_Training_of_Zero-shot_Composed_Image_Retrieval_CVPR_2024_paper.pdf)
:star:[code](https://github.com/navervision/lincir)

* [Evaluating Transferability in Retrieval Tasks: An Approach Using MMD and Kernel Methods](https://openaccess.thecvf.com/content/CVPR2024/papers/Dai_Evaluating_Transferability_in_Retrieval_Tasks_An_Approach_Using_MMD_and_CVPR_2024_paper.pdf)

* [Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval](http://arxiv.org/abs/2403.16005v1)

* [On Train-Test Class Overlap and Detection for Image Retrieval](http://arxiv.org/abs/2404.01524v1)
:star:[code](https://github.com/dealicious-inc/RGLDv2-clean)

* [D3still: Decoupled Differential Distillation for Asymmetric Image Retrieval](https://openaccess.thecvf.com/content/CVPR2024/papers/Xie_D3still_Decoupled_Differential_Distillation_for_Asymmetric_Image_Retrieval_CVPR_2024_paper.pdf)

* [Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection](http://arxiv.org/abs/2404.09263)

* 跨域检索

  * [ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval](https://arxiv.org/abs/2312.12478)
:star:[code](https://github.com/fangkaipeng/ProS)

* 视频检索

  * [Composed Video Retrieval via Enriched Context and Discriminative Embeddings](http://arxiv.org/abs/2403.16997v1)
:star:[code](https://github.com/OmkarThawakar/composed-video-retrieval)

* 跨模态检索

  * [Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval](http://arxiv.org/abs/2403.05105v1)
:star:[code](https://github.com/hhc1997/L2RM)

  * [Fine-grained Prototypical Voting with Heterogeneous Mixup for Semi-supervised 2D-3D Cross-modal Retrieval](https://openaccess.thecvf.com/content/CVPR2024/papers/Zhang_Fine-grained_Prototypical_Voting_with_Heterogeneous_Mixup_for_Semi-supervised_2D-3D_Cross-modal_CVPR_2024_paper.pdf)

* 文本-视频检索

  * [Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval](http://arxiv.org/abs/2403.17998v1)
:star:[code](https://github.com/Jiamian-Wang/T-MASS-text-video-retrieval)

  * [Holistic Features are almost Sufficient for Text-to-Video Retrieval](https://www.researchgate.net/publication/379270657_Holistic_Features_are_almost_Sufficient_for_Text-to-Video_Retrieval)

* 图像-文本检索

  * [How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?](https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_How_to_Make_Cross_Encoder_a_Good_Teacher_for_Efficient_CVPR_2024_paper.pdf)

* 视频文本检索

  * [MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval](https://openaccess.thecvf.com/content/CVPR2024/papers/Jin_MV-Adapter_Multimodal_Video_Transfer_Learning_for_Video_Text_Retrieval_CVPR_2024_paper.pdf)

* 组合图像检索

  * [Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval](http://arxiv.org/abs/2404.15516)

* 细粒度图像检索

  * [You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval](https://arxiv.org/abs/2403.07222)
:house:[project](https://subhadeepkoley.github.io/Sketch2Word)

  * [Characteristics Matching Based Hash Codes Generation for Efficient Fine-grained Image Retrieval](https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_Characteristics_Matching_Based_Hash_Codes_Generation_for_Efficient_Fine-grained_Image_CVPR_2024_paper.pdf)

* 基于草图的检索

  * [How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval?](http://arxiv.org/abs/2403.07203v1)
:star:[code](https://subhadeepkoley.github.io/AbstractAway)

  * [Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers](http://arxiv.org/abs/2403.07214v1)
:house:[project](https://subhadeepkoley.github.io/DiffusionZSSBIR)  



## 41.Graph Generative Network(GNN/GCN)

* GNN

  * [Domain Separation Graph Neural Networks for Saliency Object Ranking](https://openaccess.thecvf.com/content/CVPR2024/papers/Wu_Domain_Separation_Graph_Neural_Networks_for_Saliency_Object_Ranking_CVPR_2024_paper.pdf)

  * [GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs](https://arxiv.org/abs/2405.06849)

  * [FC-GNN: Recovering Reliable and Accurate Correspondences from Interferences](https://openaccess.thecvf.com/content/CVPR2024/papers/Xu_FC-GNN_Recovering_Reliable_and_Accurate_Correspondences_from_Interferences_CVPR_2024_paper.pdf)

  * [DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching](https://arxiv.org/abs/2306.12547)

  * [GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds](https://arxiv.org/abs/2312.00068)图生成网络

* GCN

  * [Learning for Transductive Threshold Calibration in Open-World Recognition](https://arxiv.org/abs/2305.12039)



## 40.Scene Graph Generation(场景图生成)

* [Leveraging Predicate and Triplet Learning for Scene Graph Generation](https://arxiv.org/abs/2406.02038)

* [OED: Towards One-stage End-to-End Dynamic Scene Graph Generation](https://arxiv.org/abs/2405.16925)

* [CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning](https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_CLIP-Driven_Open-Vocabulary_3D_Scene_Graph_Generation_via_Cross-Modality_Contrastive_Learning_CVPR_2024_paper.pdf)

* [Multi-Level Neural Scene Graphs for Dynamic Urban Environments](http://arxiv.org/abs/2404.00168v1)
:star:[code](https://tobiasfshr.github.io/pub/ml-nsg/)

* [HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation](http://arxiv.org/abs/2403.12033v1)
:star:[code](https://zhangce01.github.io/HiKER-SGG)
:star:[code](https://github.com/zhangce01/HiKER-SGG)

* [DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation](http://arxiv.org/abs/2403.14886v1)
:star:[code](https://github.com/zeeshanhayder/DSGG)
:house:[project](https://zeeshanhayder.github.io/DSGG/)

* [From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models](http://arxiv.org/abs/2404.00906v1)

* [EGTR: Extracting Graph from Transformer for Scene Graph Generation](http://arxiv.org/abs/2404.02072v1)
:star:[code](https://github.com/naver-ai/egtr)

* [LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation](http://arxiv.org/abs/2310.10404)



## 39.Motion Generation(动作生成)

* [Programmable Motion Generation for Open-Set Motion Control Tasks](https://arxiv.org/abs/2405.19283)

* [Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance](http://arxiv.org/abs/2403.18036v1)

* [AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents](https://arxiv.org/abs/2403.12835)

* [Towards Variable and Coordinated Holistic Co-Speech Motion Generation](http://arxiv.org/abs/2404.00368v1)
:star:[code](https://feifeifeiliu.github.io/probtalk/)

* [Generating Human Motion in 3D Scenes from Text Descriptions](http://arxiv.org/abs/2405.07784)根据文本描述生成 3D 场景中的人体运动

* [NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis](https://arxiv.org/abs/2307.07511)
:house:[project](https://nileshkulkarni.github.io/nifty)人体运动合成

* [OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers](https://arxiv.org/abs/2312.08985)
:house:[project](https://tr3e.github.io/omg-page)

* [WANDR: Intention-guided Human Motion Generation](http://arxiv.org/abs/2404.15383)
:tv:[video](https://www.youtube.com/watch?v=9szizM-XUCg)

* [MAS: Multi-view Ancestral Sampling for 3D Motion Generation Using 2D Diffusion](http://arxiv.org/abs/2310.14729)
:house:[project](https://guytevet.github.io/mas-page/)

* [Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action](http://arxiv.org/abs/2312.17172)

* [Multimodal Sense-Informed Forecasting of 3D Human Motions](https://arxiv.org/abs/2405.02911)

* 运动检索

  * [Tri-Modal Motion Retrieval by Learning a Joint Embedding Space](http://arxiv.org/abs/2403.00691)

* 动物运动

  * [OmniMotionGPT: Animal Motion Generation with Limited Data](https://arxiv.org/abs/2311.18303)
:star:[code](https://zshyang.github.io/omgpt-website/)
:house:[project](https://zshyang.github.io/omgpt-website/)

* 人体运动预测

  * [MoML: Online Meta Adaptation for 3D Human Motion Prediction](https://openaccess.thecvf.com/content/CVPR2024/papers/Sun_MoML_Online_Meta_Adaptation_for_3D_Human_Motion_Prediction_CVPR_2024_paper.pdf)

  * [MoST: Multi-Modality Scene Tokenization for Motion Prediction](http://arxiv.org/abs/2404.19531)

  * [Rethinking Human Motion Prediction with Symplectic Integral](https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_Rethinking_Human_Motion_Prediction_with_Symplectic_Integral_CVPR_2024_paper.pdf)

  * [Human Motion Prediction Under Unexpected Perturbation](https://openaccess.thecvf.com/content/CVPR2024/papers/Yue_Human_Motion_Prediction_Under_Unexpected_Perturbation_CVPR_2024_paper.pdf)

  * [Continual Learning for Motion Prediction Model via Meta-Representation Learning and Optimal Memory Buffer Retention Strategy](https://openaccess.thecvf.com/content/CVPR2024/papers/Kang_Continual_Learning_for_Motion_Prediction_Model_via_Meta-Representation_Learning_and_CVPR_2024_paper.pdf)

* 人体运动估计

  * [MultiPhys: Multi-Person Physics-aware 3D Motion Estimation](https://arxiv.org/abs/2404.11987)
:house:[project](http://www.iri.upc.edu/people/nugrinovic/multiphys/)

  * [A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals](https://arxiv.org/abs/2404.04890)人体运动估计

* 人体运动重建

  * [RoHM: Robust Human Motion Reconstruction via Diffusion](http://arxiv.org/abs/2401.08570)



## 38.Vision Question Answering(视觉问答)

* [GRAM: Global Reasoning for Multi-Page VQA](https://arxiv.org/abs/2401.03411)

* [SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities](http://arxiv.org/abs/2401.12168)
:house:[project](https://spatial-vlm.github.io/)

* [Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering](http://arxiv.org/abs/2404.10193v1)

* [How to Configure Good In-Context Sequence for Visual Question Answering](https://arxiv.org/abs/2312.01571)
:star:[code](https://github.com/GaryJiajia/OFv2_ICL_VQA)

* [Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models](https://arxiv.org/abs/2312.06685)

* [Question Aware Vision Transformer for Multimodal Reasoning](http://arxiv.org/abs/2402.05472)

* [OpenEQA: Embodied Question Answering in the Era of Foundation Models](https://openaccess.thecvf.com/content/CVPR2024/papers/Majumdar_OpenEQA_Embodied_Question_Answering_in_the_Era_of_Foundation_Models_CVPR_2024_paper.pdf)

* Video-QA

  * [Grounded Question-Answering in Long Egocentric Videos](https://arxiv.org/abs/2312.06505)
:star:[code](https://github.com/Becomebright/GroundVQA)

  * [Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels](http://arxiv.org/abs/2403.14430v1) 

  * [Language-aware Visual Semantic Distillation for Video Question Answering](https://openaccess.thecvf.com/content/CVPR2024/papers/Zou_Language-aware_Visual_Semantic_Distillation_for_Video_Question_Answering_CVPR_2024_paper.pdf)

  * [MoReVQA: Exploring Modular Reasoning Models for Video Question Answering](https://arxiv.org/abs/2404.06511)

  * [Can I Trust Your Answer? Visually Grounded Video Question Answering](https://arxiv.org/abs/2309.01327)
:star:[code](https://github.com/doc-doc/NExT-GQA)

  * [Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering](https://openaccess.thecvf.com/content/CVPR2024/papers/Liao_Align_and_Aggregate_Compositional_Reasoning_with_Video_Alignment_and_Answer_CVPR_2024_paper.pdf)

* 图表问答

  * [CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering](https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_CoG-DQA_Chain-of-Guiding_Learning_with_Large_Language_Models_for_Diagram_Question_CVPR_2024_paper.pdf)

  * [Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA](http://arxiv.org/abs/2403.16385v1)

* 视觉文本问答

  * [VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning](https://arxiv.org/abs/2303.02635)



## 37.OCR

* 场景文本识别

  * [OTE: Exploring Accurate Scene Text Recognition Using One Token](https://openaccess.thecvf.com/content/CVPR2024/papers/Xu_OTE_Exploring_Accurate_Scene_Text_Recognition_Using_One_Token_CVPR_2024_paper.pdf)

  * [An Empirical Study of Scaling Law for Scene Text Recognition](https://arxiv.org/abs/2401.00028)
:star:[code](https://github.com/large-ocr-model/large-ocr-model.github.io)场景文本识别

  * [Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer](https://arxiv.org/abs/2311.13120)
:star:[code](https://github.com/bytedance/E2STR)

  * [Kernel Adaptive Convolution for Scene Text Detection via Distance Map Prediction](https://openaccess.thecvf.com/content/CVPR2024/papers/Zheng_Kernel_Adaptive_Convolution_for_Scene_Text_Detection_via_Distance_Map_CVPR_2024_paper.pdf)

  * [Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing](https://arxiv.org/abs/2405.04377)场景文本识别、删除和编辑

  * [ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting](https://arxiv.org/abs/2403.00303)
:star:[code](https://github.com/PriNing/ODM)

* 场景文本图像合成

  * [Layout-Agnostic Scene Text Image Synthesis with Diffusion Models](https://openaccess.thecvf.com/content/CVPR2024/papers/Zhangli_Layout-Agnostic_Scene_Text_Image_Synthesis_with_Diffusion_Models_CVPR_2024_paper.pdf)

* 场景文本理解

  * [LayoutFormer: Hierarchical Text Detection Towards Scene Text Understanding](https://openaccess.thecvf.com/content/CVPR2024/papers/Liang_LayoutFormer_Hierarchical_Text_Detection_Towards_Scene_Text_Understanding_CVPR_2024_paper.pdf)

* 化学结构识别

  * [Atom-Level Optical Chemical Structure Recognition with Limited Supervision](https://arxiv.org/abs/2404.01743)
:star:[code](https://github.com/molden/atomlenz)

* 文档色度检测

  * [CMA: A Chromaticity Map Adapter for Robust Detection of Screen-Recapture Document Images](https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_CMA_A_Chromaticity_Map_Adapter_for_Robust_Detection_of_Screen-Recapture_CVPR_2024_paper.pdf)
:star:[code](https://github.com/chenlewis/Chromaticity-Map-Adapter-for-DPAD)

* 文本检测

  * [OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition](http://arxiv.org/abs/2403.19128v1)
:star:[code](https://github.com/AlibabaResearch/AdvancedLiterateMachinery)

  * [Bridging the Gap Between End-to-End and Two-Step Text Spotting](http://arxiv.org/abs/2404.04624v1)
:star:[code](https://github.com/mxin262/Bridging-Text-Spotting)

  * [Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis](https://arxiv.org/abs/2405.07481)

* 文档理解

  * [LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding](http://arxiv.org/abs/2404.05225v1)
:star:[code](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/LayoutLLM)

  * [HRVDA: High-Resolution Visual Document Assistant](http://arxiv.org/abs/2404.06918v1)

* 字体生成

  * [Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models](https://openaccess.thecvf.com/content/CVPR2024/papers/Fu_Generate_Like_Experts_Multi-Stage_Font_Generation_by_Incorporating_Font_Transfer_CVPR_2024_paper.pdf)



## 36.4D Reconstruction(4D 重建)

* [Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle](https://arxiv.org/abs/2312.03431)
:house:[project](https://nju-3dv.github.io/projects/Gaussian-Flow)

* [Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking](https://arxiv.org/abs/2401.06614)
:house:[project](https://vveicao.github.io/projects/Motion2VecSets/)

* [4D Gaussian Splatting for Real-Time Dynamic Scene Rendering](https://arxiv.org/abs/2310.08528)
:star:[code](https://github.com/hustvl/4DGaussians)
:house:[project](https://guanjunwu.github.io/4dgs/)

* 文本和图像引导 4D 场景生成

  * [A Unified Approach for Text- and Image-guided 4D Scene Generation](https://arxiv.org/abs/2311.16854)
:house:[project](https://research.nvidia.com/labs/nxp/dream-in-4d/)

* 4D视图合成

  * [4K4D: Real-Time 4D View Synthesis at 4K Resolution](https://arxiv.org/abs/2310.11448)
:star:[code](https://github.com/zju3dv/4K4D)
:house:[project](https://zju3dv.github.io/4k4d/)

* 语言到 4D 建模

  * [L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream](https://openaccess.thecvf.com/content/CVPR2024/papers/Sun_L4D-Track_Language-to-4D_Modeling_Towards_6-DoF_Tracking_and_Shape_Reconstruction_in_CVPR_2024_paper.pdf)
:star:[code](https://github.com/S-JingTao/L4D_Track)



## 35.Scene Understanding(场景理解)

* [Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding](https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_Omni-Q_Omni-Directional_Scene_Understanding_for_Unsupervised_Visual_Grounding_CVPR_2024_paper.pdf)

* [PanoContext-Former: Panoramic Total Scene Understanding with a Transformer](https://openaccess.thecvf.com/content/CVPR2024/papers/Dong_PanoContext-Former_Panoramic_Total_Scene_Understanding_with_a_Transformer_CVPR_2024_paper.pdf)

* [DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving](https://arxiv.org/abs/2405.04390)

* [OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies](https://arxiv.org/abs/2405.05259)
:star:[code](https://github.com/ldkong1205/OpenESS)

* [A Category Agnostic Model for Visual Rearrangment](https://openaccess.thecvf.com/content/CVPR2024/papers/Liu_A_Category_Agnostic_Model_for_Visual_Rearrangment_CVPR_2024_paper.pdf)
:thumbsup:[VILP](https://vipl.ict.ac.cn/news/research/202403/t20240315_207758.html)

* [360+x: A Panoptic Multi-modal Scene Understanding Dataset](http://arxiv.org/abs/2404.00989v1)
:star:[code](https://x360dataset.github.io)

* 开放词汇场景理解

  * [Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding](https://arxiv.org/abs/2311.18482)

* 3D场景理解

  * [HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting](https://arxiv.org/abs/2403.12722)
:house:[project](https://xdimlab.github.io/hugs_website)

  * [SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field](https://arxiv.org/abs/2403.14366)

  * [GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding](http://arxiv.org/abs/2403.03608v1)

  * [GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding](https://openaccess.thecvf.com/content/CVPR2024/papers/Li_GP-NeRF_Generalized_Perception_NeRF_for_Context-Aware_3D_Scene_Understanding_CVPR_2024_paper.pdf)

  * [RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding](https://arxiv.org/abs/2304.00962)
:house:[project](https://jihanyang.github.io/projects/RegionPLC)

  * [GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding](http://arxiv.org/abs/2403.09639v1)
:star:[code](https://github.com/dvlab-research/GroupContrast)

  * [SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes](https://openaccess.thecvf.com/content/CVPR2024/papers/Delitzas_SceneFun3D_Fine-Grained_Functionality_and_Affordance_Understanding_in_3D_Scenes_CVPR_2024_paper.pdf)



## 34.Human–Computer Interaction(人机交互)

* [Exploring Pose-Aware Human-Object Interaction via Hybrid Learning](https://openaccess.thecvf.com/content/CVPR2024/papers/Wu_Exploring_Pose-Aware_Human-Object_Interaction_via_Hybrid_Learning_CVPR_2024_paper.pdf)

* [Bilateral Adaptation for Human-Object Interaction Detection with Occlusion-Robustness](https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_Bilateral_Adaptation_for_Human-Object_Interaction_Detection_with_Occlusion-Robustness_CVPR_2024_paper.pdf)

* [Scaling Up Dynamic Human-Scene Interaction Modeling](https://arxiv.org/abs/2403.08629)
:star:[code](https://huggingface.co/spaces/jnnan/trumans/tree/main)
:house:[project](https://jnnan.github.io/trumans/)

* [ReGenNet: Towards Human Action-Reaction Synthesis](http://arxiv.org/abs/2403.11882v1)
:star:[code](https://liangxuy.github.io/ReGenNet/)

* [DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback](https://arxiv.org/pdf/2311.10081.pdf)
:star:[code](https://huggingface.co/datasets/YangyiYY/LVLM_NLF)交互

* [HOI-M^3: Capture Multiple Humans and Objects Interaction within Contextual Environment](https://openaccess.thecvf.com/content/CVPR2024/papers/Zhang_HOI-M3_Capture_Multiple_Humans_and_Objects_Interaction_within_Contextual_Environment_CVPR_2024_paper.pdf)

* [GenZI: Zero-Shot 3D Human-Scene Interaction Generation](http://arxiv.org/abs/2311.17737)

* [Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection](http://arxiv.org/abs/2404.06194)

* 人体运动跟踪

  * [HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations](http://arxiv.org/abs/2403.03561v1)
:star:[code](https://pico-ai-team.github.io/hmd-poser)
:house:[project](https://pico-ai-team.github.io/hmd-poser)

* 新运动合成

  * [PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics](https://arxiv.org/abs/2311.12198)
:star:[code](https://github.com/XPandora/PhysGaussian)
:house:[project](https://xpandora.github.io/PhysGaussian/)

* 手部交互

  * [InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion](http://arxiv.org/abs/2403.17422v1)
:star:[code](https://jyunlee.github.io/projects/interhandgen/)

  * [HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data](http://arxiv.org/abs/2403.12011)

  * [Physics-Aware Hand-Object Interaction Denoising](http://arxiv.org/abs/2405.11481)

  * [HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video](https://arxiv.org/abs/2311.18448)
:star:[code](https://github.com/zc-alexfan/hold)
:house:[project](https://zc-alexfan.github.io/hold)手物交互

  * [GEARS: Local Geometry-aware Hand-object Interaction Synthesis](https://arxiv.org/abs/2404.01758)

  * [TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding](https://arxiv.org/abs/2401.08399)
:house:[project](https://taco2024.github.io/)

  * [Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction](http://arxiv.org/abs/2404.00562v1)
:star:[code](https://github.com/JunukCha/Text2HOI)

  * [G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis](http://arxiv.org/abs/2404.12383v1)
:star:[code](https://judyye.github.io/ghop-www)

  * [MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision](http://arxiv.org/abs/2310.11696)

  * [HOIST-Former: Hand-held Objects Identification Segmentation and Tracking in the Wild](https://openaccess.thecvf.com/content/CVPR2024/papers/Narasimhaswamy_HOIST-Former_Hand-held_Objects_Identification_Segmentation_and_Tracking_in_the_Wild_CVPR_2024_paper.pdf)

* 人物交互

  * [Discovering Syntactic Interaction Clues for Human-Object Interaction Detection](https://openaccess.thecvf.com/content/CVPR2024/papers/Luo_Discovering_Syntactic_Interaction_Clues_for_Human-Object_Interaction_Detection_CVPR_2024_paper.pdf)

  * [Open-World Human-Object Interaction Detection via Multi-modal Prompts](https://arxiv.org/abs/2406.07221)

  * [LEMON: Learning 3D Human-Object Interaction Relation from 2D Images](https://arxiv.org/pdf/2312.08963.pdf)
:star:[code](https://github.com/yyvhang/lemon_3d)
:house:[project](https://yyvhang.github.io/LEMON/)

  * [Disentangled Pre-training for Human-Object Interaction Detection](http://arxiv.org/abs/2404.01725v1)
:star:[code](https://github.com/xingaoli/DP-HOI)

  * [GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation Demonstration and Imitation](https://arxiv.org/abs/2401.00929)
:house:[project](https://genh2r.github.io/)

  * [Learning from Observer Gaze: Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition](http://arxiv.org/abs/2405.09931)
:house:[project](https://yuchen2199.github.io/Interactive-Gaze/)

  * [Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation](https://arxiv.org/abs/2312.07063)
:house:[project](https://virtualhumans.mpi-inf.mpg.de/procigen-hdm)

  * 3D 人物交互

    * [I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions](https://arxiv.org/abs/2312.08869)
:house:[project](https://afterjourney00.github.io/IM-HOI.github.io/)

    * [CG-HOI: Contact-Guided 3D Human-Object Interaction Generation](https://arxiv.org/abs/2311.16097)
:house:[project](https://cg-hoi.christian-diller.de/)

* 人-人交互

  * [Inter-X: Towards Versatile Human-Human Interaction Analysis](https://arxiv.org/abs/2312.16051)
:star:[code](https://github.com/liangxuy/Inter-X)
:house:[project](https://liangxuy.github.io/inter-x/)
:thumbsup:[三维数字人重建、编辑与驱动](https://valser.org/webinar/slide/slides/20240403/Valse20240403%E6%99%8F%E8%BD%B6%E8%B6%85.pdf)



## 33.NeRF

* [GARField: Group Anything with Radiance Fields](http://arxiv.org/abs/2401.09419)

* [IReNe: Instant Recoloring of Neural Radiance Fields](https://openaccess.thecvf.com/content/CVPR2024/papers/Mazzucchelli_IReNe_Instant_Recoloring_of_Neural_Radiance_Fields_CVPR_2024_paper.pdf)

* [PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF](https://openaccess.thecvf.com/content/CVPR2024/papers/Feng_PIE-NeRF_Physics-based_Interactive_Elastodynamics_with_NeRF_CVPR_2024_paper.pdf)

* [LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes](http://arxiv.org/abs/2405.00900)

* [SIGNeRF: Scene Integrated Generation for Neural Radiance Fields](http://arxiv.org/abs/2401.01647)

* [NC-SDF: Enhancing Indoor Scene Reconstruction Using Neural SDFs with View-Dependent Normal Compensation](https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_NC-SDF_Enhancing_Indoor_Scene_Reconstruction_Using_Neural_SDFs_with_View-Dependent_CVPR_2024_paper.pdf)

* [SpecNeRF: Gaussian Directional Encoding for Specular Reflections](http://arxiv.org/abs/2312.13102)

* [PaReNeRF: Toward Fast Large-scale Dynamic NeRF with Patch-based Reference](https://openaccess.thecvf.com/content/CVPR2024/papers/Tang_PaReNeRF_Toward_Fast_Large-scale_Dynamic_NeRF_with_Patch-based_Reference_CVPR_2024_paper.pdf)NeRF

* [Global and Hierarchical Geometry Consistency Priors for Few-shot NeRFs in Indoor Scenes](https://openaccess.thecvf.com/content/CVPR2024/papers/Sun_Global_and_Hierarchical_Geometry_Consistency_Priors_for_Few-shot_NeRFs_in_CVPR_2024_paper.pdf)
:thumbsup:[摘要](https://informatics.xmu.edu.cn/info/1053/36349.htm)

* [NeRF Analogies: Example-Based Visual Attribute Transfer for NeRFs](http://arxiv.org/abs/2402.08622)

* [Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling](https://arxiv.org/abs/2405.14847)
:star:[code](https://github.com/lwwu2/nde)

* [Accelerating Neural Field Training via Soft Mining](http://arxiv.org/abs/2312.00075)

* [Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling](https://arxiv.org/abs/2406.03723)
:house:[project](https://merl.com/research/highlights/gear-nerf)

* [How Far Can We Compress Instant-NGP-Based NeRF?](https://arxiv.org/abs/2406.04101)
:star:[code](https://github.com/yihangchen-ee/cnc/)
:house:[project](https://yihangchen-ee.github.io/project_cnc/)

* [BANF: Band-Limited Neural Fields for Levels of Detail Reconstruction](http://arxiv.org/abs/2404.13024)
:star:[code](https://theialab.github.io/banf/)
:house:[project](https://theialab.github.io/banf/)

* [Tactile-Augmented Radiance Fields](https://arxiv.org/abs/2405.04534)
:star:[code](https://github.com/Dou-Yiming/TaRF/)
:house:[project](https://dou-yiming.github.io/TaRF)

* [NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild](https://arxiv.org/abs/2405.18715)
:house:[project](https://nerf-on-the-go.github.io/)

* [L0-Sampler: An L0 Model Guided Volume Sampling for NeRF](https://arxiv.org/abs/2311.07044)
:house:[project](https://ustc3dv.github.io/L0-Sampler/)NeRF

* [HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses](https://arxiv.org/abs/2312.02232)

* [Entangled View-Epipolar Information Aggregation for Generalizable Neural Radiance Fields](https://arxiv.org/abs/2311.11845)
:star:[code](https://github.com/tatakai1/EVENeRF)

* [NeRFCodec: Neural Feature Compression Meets Neural Radiance Fields for Memory-Efficient Scene Representation](https://arxiv.org/abs/2404.02185)

* [MuRF: Multi-Baseline Radiance Fields](https://arxiv.org/abs/2312.04565)
:house:[project](https://haofeixu.github.io/murf/)
:house:[project](https://ivrl.github.io/InNeRF360)

* [InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields](https://arxiv.org/abs/2305.15094)

* [NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors](https://arxiv.org/abs/2403.03122)
:star:[code](https://github.com/hynann/NRDF)
:house:[project](https://virtualhumans.mpi-inf.mpg.de/nrdf/)

* [Neural Fields as Distributions: Signal Processing Beyond Euclidean Space](https://openaccess.thecvf.com/content/CVPR2024/papers/Rebain_Neural_Fields_as_Distributions_Signal_Processing_Beyond_Euclidean_Space_CVPR_2024_paper.pdf)
:house:[project](https://ubc-vision.github.io/nfd/)

* [CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs](http://arxiv.org/abs/2403.16885v1)
:star:[code](https://zhongyingji.github.io/CVT-xRF)

* [DaReNeRF: Direction-aware Representation for Dynamic Scenes](http://arxiv.org/abs/2403.02265v1)

* [Geometry Transfer for Stylizing Radiance Fields](https://arxiv.org/abs/2402.00863)
:house:[project](https://hyblue.github.io/geo-srf/)

* [S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes](http://arxiv.org/abs/2403.06205v1)
:star:[code](https://xingyi-li.github.io/s-dyrf/)

* [SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream](http://arxiv.org/abs/2403.11222v1)
:star:[code](https://github.com/BIT-Vision/SpikeNeRF)

* [Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes](http://arxiv.org/abs/2403.16141v1)
:star:[code](https://otonari726.github.io/entitynerf/)

* [Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates](https://arxiv.org/abs/2309.11281)
:star:[code](https://github.com/kcshum/pose-conditioned-NeRF-object-fusion)

* [LAENeRF: Local Appearance Editing for Neural Radiance Fields](https://arxiv.org/abs/2312.09913)
:star:[code](https://github.com/r4dl/LAENeRF)
:house:[project](https://r4dl.github.io/LAENeRF/)

* [Single View Refractive Index Tomography with Neural Fields](http://arxiv.org/abs/2309.04437)

* [ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models](https://arxiv.org/abs/2406.06133)

* [TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video](https://arxiv.org/abs/2312.06713)

* [NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation](http://arxiv.org/abs/2403.17537v1)
:star:[code](https://cnhaox.github.io/NeRF-HuGS/)

* [Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency](http://arxiv.org/abs/2403.17638v1)
:star:[code](https://github.com/HKCLynn/ReVoRF)

* [Grounding and Enhancing Grid-based Models for Neural Fields](http://arxiv.org/abs/2403.20002v1)
:house:[project](https://sites.google.com/view/cvpr24-2034-submission/home)

* [Mitigating Motion Blur in Neural Radiance Fields with Events and Frames](http://arxiv.org/abs/2403.19780v1)

* [OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos](http://arxiv.org/abs/2404.00676v1)

* [Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects](http://arxiv.org/abs/2404.01440v1)
:star:[code](https://github.com/NVlabs/DigitalTwinArt)

* [Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields](https://openaccess.thecvf.com/content/CVPR2024/papers/Goli_Bayes_Rays_Uncertainty_Quantification_for_Neural_Radiance_Fields_CVPR_2024_paper.pdf)

* [Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields](http://arxiv.org/abs/2404.02155v1)
:house:[project](https://pals.ttic.edu/p/alpha-invariance)

* [Dynamic LiDAR Re-simulation using Compositional Neural Fields](https://arxiv.org/abs/2312.05247)
:house:[project](https://shengyuh.github.io/dynfl)

* [SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields](https://arxiv.org/abs/2311.15803)
:house:[project](https://qherau.github.io/SOAC/)

* [ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization](https://arxiv.org/abs/2401.08937)

* [NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows](https://openaccess.thecvf.com/content/CVPR2024/papers/Tang_NeRFDeformer_NeRF_Transformation_from_a_Single_View_via_3D_Scene_CVPR_2024_paper.pdf)

* 新视图合成

  * [ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image](http://arxiv.org/abs/2310.17994)

  * [Unifying Correspondence Pose and NeRF for Generalized Pose-Free Novel View Synthesis](https://openaccess.thecvf.com/content/CVPR2024/papers/Hong_Unifying_Correspondence_Pose_and_NeRF_for_Generalized_Pose-Free_Novel_View_CVPR_2024_paper.pdf)

  * [NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis](https://openaccess.thecvf.com/content/CVPR2024/papers/You_NeLF-Pro_Neural_Light_Field_Probes_for_Multi-Scale_Novel_View_Synthesis_CVPR_2024_paper.pdf)

  * [3D Geometry-Aware Deformable Gaussian Splatting for Dynamic View Synthesis](http://arxiv.org/abs/2404.06270)

  * [G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images](http://arxiv.org/abs/2404.07474v1)

  * [MultiDiff: Consistent Novel View Synthesis from a Single Image](https://openaccess.thecvf.com/content/CVPR2024/papers/Muller_MultiDiff_Consistent_Novel_View_Synthesis_from_a_Single_Image_CVPR_2024_paper.pdf)

  * [Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis](https://arxiv.org/abs/2401.02436)

  * [DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis](https://arxiv.org/abs/2312.13016)

  * [3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis](https://arxiv.org/abs/2404.06270)

  * [Generalizable Novel-View Synthesis using a Stereo Camera](http://arxiv.org/abs/2404.13541)
:house:[project](https://jinwonjoon.github.io/stereonerf/)

  * [DART: Implicit Doppler Tomography for Radar Novel View Synthesis](http://arxiv.org/abs/2403.03896v1)
:house:[project](https://wiselabcmu.github.io/dart/)

  * [XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold](http://arxiv.org/abs/2403.19517v1)

  * [Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis](https://arxiv.org/abs/2312.16812)
:star:[code](https://github.com/oppo-us-research/SpacetimeGaussians)
:house:[project](https://oppo-us-research.github.io/SpacetimeGaussians-website/)

  * [NViST: In the Wild New View Synthesis from a Single Image with Transformers](https://arxiv.org/abs/2312.08568)
:star:[code](https://github.com/wbjang/nvist_official)
:house:[project](https://wbjang.github.io/nvist_webpage/)

  * [ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models](https://arxiv.org/abs/2312.01305)
:house:[project](https://jgkwak95.github.io/ViVid-1-to-3/)

  * [SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes](https://arxiv.org/abs/2312.14937)
:star:[code](https://github.com/yihua7/SC-GS)
:house:[project](https://yihua7.github.io/SC-GS-web/)

  * [Neural Visibility Field for Uncertainty-Driven Active Mapping](https://arxiv.org/abs/2406.06948)
:house:[project](https://sites.google.com/view/nvf-cvpr24/)

  * [EscherNet: A Generative Model for Scalable View Synthesis](https://arxiv.org/abs/2402.03908)
:star:[code](https://github.com/kxhit/EscherNet)
:house:[project](https://kxhit.github.io/EscherNet)

  * [GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis](https://arxiv.org/pdf/2312.02155.pdf)
:star:[code](https://github.com/ShunyuanZheng/GPS-Gaussian)
:house:[project](https://shunyuanzheng.github.io/GPS-Gaussian)新视图

  * [DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization](http://arxiv.org/abs/2403.06912v1)
:star:[code](https://github.com/Fictionarry/DNGaussian)
:house:[project](https://fictionarry.github.io/DNGaussian/)

  * [LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis](https://arxiv.org/abs/2404.02742)
:star:[code](https://github.com/ispc-lab/LiDAR4D)
:house:[project](https://dyfcalid.github.io/LiDAR4D)

  * [Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?](http://arxiv.org/abs/2403.06092v1)

  * [Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models](https://github.com/Q-Future/Q-Instruct/tree/main/fig/Q_Instruct_v0_1_preview.pdf)
:star:[code](https://huggingface.co/datasets/teowu/Q-Instruct)
:house:[project](https://q-future.github.io/Q-Instruct/)

  * [CoPoNeRF: Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs](https://arxiv.org/abs/2312.07246)
:star:[code](https://github.com/KU-CVLAB/CoPoNeRF)
:house:[project](https://ku-cvlab.github.io/CoPoNeRF/)

  * [EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion](https://arxiv.org/abs/2312.06725)
:star:[code](https://github.com/huanngzh/EpiDiff)
:house:[project](https://huanngzh.github.io/EpiDiff/)

  * [Free3D: Consistent Novel View Synthesis without 3D Representation](https://arxiv.org/abs/2312.04551)
:star:[code](https://github.com/lyndonzheng/Free3D)
:house:[project](https://chuanxiaz.com/free3d/)

  * [Novel View Synthesis with View-Dependent Effects from a Single Image](https://arxiv.org/abs/2312.08071)
:house:[project](https://kaist-viclab.github.io/monovde-site)

* 渲染

  * [NeRF Director: Revisiting View Selection in Neural Volume Rendering](https://openaccess.thecvf.com/content/CVPR2024/papers/Xiao_NeRF_Director_Revisiting_View_Selection_in_Neural_Volume_Rendering_CVPR_2024_paper.pdf)

  * [Multiplane Prior Guided Few-Shot Aerial Scene Rendering](https://arxiv.org/abs/2406.04961)渲染

  * [Differentiable Point-based Inverse Rendering](https://arxiv.org/abs/2312.02480)逆渲染

  * [Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance](https://arxiv.org/abs/2312.04529)渲染

  * [Perceptual Assessment and Optimization of HDR Image Rendering](https://openaccess.thecvf.com/content/CVPR2024/papers/Cao_Perceptual_Assessment_and_Optimization_of_HDR_Image_Rendering_CVPR_2024_paper.pdf)

  * [Global Latent Neural Rendering](https://arxiv.org/abs/2312.08338)

  * [Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields](https://arxiv.org/abs/2404.17528)
:star:[code](https://github.com/TQTQliu/GeFu)
:house:[project](https://gefucvpr24.github.io/)

  * [GES: Generalized Exponential Splatting for Efficient Radiance Field Rendering](https://arxiv.org/abs/2402.10128)
:house:[project](https://abdullahamdi.com/ges)

  * [Real-time Acquisition and Reconstruction of Dynamic Volumes with Neural Structured Illumination](https://openaccess.thecvf.com/content/CVPR2024/papers/Zeng_Real-time_Acquisition_and_Reconstruction_of_Dynamic_Volumes_with_Neural_Structured_CVPR_2024_paper.pdf)
:house:[project](https://svbrdf.github.io/publications/realtimedynamic/project.html)
:tv:[video](https://www.youtube.com/watch?v=XoTYTGSueh4)
:thumbsup:[借助神经结构光，浙大实现动态三维现象的实时采集重建](https://mp.weixin.qq.com/s/cUnFIaL4xLaHBOWpNcI7Yg)

  * [Inverse Rendering of Glossy Objects via the Neural Plenoptic Function and Radiance Fields](http://arxiv.org/abs/2403.16224v1)
:house:[project](https://whyy.site/paper/nep)

  * [Dr.Bokeh: DiffeRentiable Occlusion-aware Bokeh Rendering](https://arxiv.org/abs/2308.08843)
:house:[project](https://shengcn.github.io/DrBokeh/)

  * [HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting](https://arxiv.org/abs/2312.03461)
:thumbsup:[HiFi4G: 通过紧凑高斯进行高保真人体性能渲染](https://cloud.tencent.com/developer/article/2383180)

  * [ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering](https://arxiv.org/abs/2312.05941)
:house:[project](https://vcai.mpi-inf.mpg.de/projects/ash/)

  * [SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild](https://arxiv.org/abs/2401.10171)
:house:[project](https://shinobi.aengelhardt.com/)神经渲染

  * [LTM: Lightweight Textured Mesh Extraction and Refinement of Large Unbounded Scenes for Efficient Storage and Real-time Rendering](https://openaccess.thecvf.com/content/CVPR2024/papers/Choi_LTM_Lightweight_Textured_Mesh_Extraction_and_Refinement_of_Large_Unbounded_CVPR_2024_paper.pdf)

  * [HashPoint: Accelerated Point Searching and Sampling for Neural Rendering](https://export.arxiv.org/abs/2404.14044)
:house:[project](https://jiahao-ma.github.io/hashpoint/)

  * [HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces](https://arxiv.org/abs/2312.03160)
:house:[project](https://haithemturki.com/hybrid-nerf/)

  * [DUDF: Differentiable Unsigned Distance Fields with Hyperbolic Scaling](https://arxiv.org/abs/2402.08876)
:star:[code](https://github.com/LIA-DiTella/DiffUDF)
:house:[project](https://lia-ditella.github.io/DUDF/)

  * [Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras](https://arxiv.org/abs/2312.07423)
:house:[project](https://vcai.mpi-inf.mpg.de/projects/holochar/)

  * [ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis](https://arxiv.org/abs/2311.17123)
:house:[project](https://gaoxiangjun.github.io/contex_human/)

* 多视图逆渲染

  * [VMINer: Versatile Multi-view Inverse Rendering with Near- and Far-field Light Sources](https://openaccess.thecvf.com/content/CVPR2024/papers/Fei_VMINer_Versatile_Multi-view_Inverse_Rendering_with_Near-_and_Far-field_Light_CVPR_2024_paper.pdf)

* 目标重建

  * [Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction](https://arxiv.org/abs/2312.01196)
:house:[project](https://geometric-rl.mpi-inf.mpg.de/npg)

  * [SAOR: Single-View Articulated Object Reconstruction](https://arxiv.org/abs/2303.13514)
:house:[project](https://mehmetaygun.github.io/saor)



## 32.NLP(自然语言处理)

* [Describing Differences in Image Sets with Natural Language](http://arxiv.org/abs/2312.02974)

* 实体识别

  * [A Generative Approach for Wikipedia-Scale Visual Entity Recognition](http://arxiv.org/abs/2403.02041v1)

* 提示学习

  * [BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP](https://arxiv.org/abs/2311.16194)

  * [Active Prompt Learning in Vision Language Models](https://arxiv.org/abs/2311.11178)
:star:[code](https://github.com/kaist-dmlab/pcb)

  * [Domain Prompt Learning with Quaternion Networks](https://arxiv.org/abs/2312.08878)

  * [On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do We Really Need Prompt Learning?](http://arxiv.org/abs/2405.02266)

  * [ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection](http://arxiv.org/abs/2311.15243)

* 基础模型

  * [Asymmetric Masked Distillation for Pre-Training Small Foundation Models](https://arxiv.org/abs/2311.03149)
:star:[code](https://github.com/MCG-NJU/AMD)

  * [Bootstrapping SparseFormers from Vision Foundation Models](https://arxiv.org/abs/2312.01987)
:star:[code](https://github.com/showlab/sparseformer)

 



## 31.Edge Detection(边缘检测)

* [MuGE: Multiple Granularity Edge Detection](https://openaccess.thecvf.com/content/CVPR2024/papers/Zhou_MuGE_Multiple_Granularity_Edge_Detection_CVPR_2024_paper.pdf)

* [RankED: Addressing Imbalance and Uncertainty in Edge Detection Using Ranking-based Losses](http://arxiv.org/abs/2403.01795v1)
:star:[code](https://ranked-cvpr24.github.io)



## 30.Person Re-Identification(人员重识别)

* [Fusing Personal and Environmental Cues for Identification and Segmentation of First-Person Camera Wearers in Third-Person Views](https://openaccess.thecvf.com/content/CVPR2024/papers/Zhao_Fusing_Personal_and_Environmental_Cues_for_Identification_and_Segmentation_of_CVPR_2024_paper.pdf)

* [Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception](https://arxiv.org/abs/2311.13793)

* 行人检测

  * [DAP: A Dynamic Adversarial Patch for Evading Person Detectors](https://arxiv.org/abs/2305.11618)

  * [Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection](http://arxiv.org/abs/2403.01300v1)
:star:[code](https://github.com/ssbin0914/Causal-Mode-Multiplexer)

  * [WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects Under Occlusion](http://arxiv.org/abs/2403.19022)

  * 基于文本的行人检索

    * [UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity](https://arxiv.org/abs/2312.03441)
:star:[code](https://github.com/Zplusdragon/UFineBench)

* 人群计数

  * [Single Domain Generalization for Crowd Counting](http://arxiv.org/abs/2403.09124v1)
:star:[code](https://github.com/Shimmer93/MPCount)

  * [CrowdDiff: Multi-hypothesis Crowd Density Estimation using Diffusion Models](https://arxiv.org/abs/2303.12790)
:house:[project](https://dylran.github.io/crowddiff.github.io)

  * [Regressor-Segmenter Mutual Prompt Learning for Crowd Counting](https://arxiv.org/abs/2312.01711)

* 行人属性检测

  * [Learning Group Activity Features Through Person Attribute Prediction](https://arxiv.org/abs/2403.02753)
:star:[code](https://github.com/chihina/GAFL-CVPR2024)
:house:[project](https://www.toyota-ti.ac.jp/Lab/Denshi/iim/ukita/selection/CVPR2024-GAFL.html)

* 重识别

  * [SEAS: ShapE-Aligned Supervision for Person Re-Identification](https://openaccess.thecvf.com/content/CVPR2024/papers/Zhu_SEAS_ShapE-Aligned_Supervision_for_Person_Re-Identification_CVPR_2024_paper.pdf)

  * [Learning Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification](https://openaccess.thecvf.com/content/CVPR2024/papers/Cui_Learning_Continual_Compatible_Representation_for_Re-indexing_Free_Lifelong_Person_Re-identification_CVPR_2024_paper.pdf)
:star:[code](https://github.com/PKU-ICST-MIPL/C2R_CVPR2024)

  * [View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network](http://arxiv.org/abs/2403.14513v1)
:star:[code](https://github.com/LinlyAC/VDT-AGPReID)

  * [CA-Jaccard: Camera-aware Jaccard Distance for Person Re-identification](https://arxiv.org/abs/2311.10605)

  * [Attribute-Guided Pedestrian Retrieval: Bridging Person Re-ID with Internal Attribute Variability](https://openaccess.thecvf.com/content/CVPR2024/papers/Huang_Attribute-Guided_Pedestrian_Retrieval_Bridging_Person_Re-ID_with_Internal_Attribute_Variability_CVPR_2024_paper.pdf)

  * [All in One Framework for Multimodal Re-identification in the Wild](https://arxiv.org/abs/2405.04741)

  * [A Pedestrian is Worth One Prompt: Towards Language Guidance Person Re-Identification](https://openaccess.thecvf.com/content/CVPR2024/papers/Yang_A_Pedestrian_is_Worth_One_Prompt_Towards_Language_Guidance_Person_CVPR_2024_paper.pdf)

  * [Distribution-aware Knowledge Prototyping for Non-exemplar Lifelong Person Re-identification](https://zhoujiahuan1991.github.io/pub/CVPR2024_DKP.pdf)

  * [Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions](https://arxiv.org/abs/2306.07520)
:star:[code](https://github.com/hwz-zju/Instruct-ReID)

  * 基于雷达的Re-Id

    * [LiDAR-based Person Re-identification](https://arxiv.org/abs/2312.03033)

  * 可见光-红外人员重识别

    * [Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-Identification](http://arxiv.org/abs/2403.11708v1)
:star:[code](https://github.com/1KK077/IDKL)

    * [Shallow-Deep Collaborative Learning for Unsupervised Visible-Infrared Person Re-Identification](https://openaccess.thecvf.com/content/CVPR2024/papers/Yang_Shallow-Deep_Collaborative_Learning_for_Unsupervised_Visible-Infrared_Person_Re-Identification_CVPR_2024_paper.pdf)

  * 文本-图像重识别

    * [Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID](https://arxiv.org/abs/2405.04940)

    * [Noisy-Correspondence Learning for Text-to-Image Person Re-identification](https://arxiv.org/abs/2308.09911)
:star:[code](https://github.com/QinYang79/RDE)

* 步态识别

  * [Learning Visual Prompt for Gait Recognition](https://openaccess.thecvf.com/content/CVPR2024/papers/Ma_Learning_Visual_Prompt_for_Gait_Recognition_CVPR_2024_paper.pdf)

  * [BigGait: Learning Gait Representation You Want by Large Vision Models](https://arxiv.org/abs/2402.19122)
:star:[code](https://github.com/ShiqiYu/OpenGait)



## 29.Model Compression/Knowledge Distillation/Pruning(模型压缩/知识蒸馏/剪枝)

* MC

  * [Dense Vision Transformer Compression with Few Samples](http://arxiv.org/abs/2403.18708v1)

* KD

  * [Small Scale Data-Free Knowledge Distillation](https://openaccess.thecvf.com/content/CVPR2024/papers/Liu_Small_Scale_Data-Free_Knowledge_Distillation_CVPR_2024_paper.pdf)

  * [KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling](https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_KD-DETR_Knowledge_Distillation_for_Detection_Transformer_with_Consistent_Distillation_Points_CVPR_2024_paper.pdf)

  * [Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation](http://arxiv.org/abs/2404.07933)

  * [Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities](http://arxiv.org/abs/2404.16456)

  * [C2KD: Bridging the Modality Gap for Cross-Modal Knowledge Distillation](https://openaccess.thecvf.com/content/CVPR2024/papers/Huo_C2KD_Bridging_the_Modality_Gap_for_Cross-Modal_Knowledge_Distillation_CVPR_2024_paper.pdf)

  * [CrossKD: Cross-Head Knowledge Distillation for Object Detection](http://arxiv.org/abs/2306.11369)

  * [CLIP-KD: An Empirical Study of CLIP Model Distillation](https://arxiv.org/abs/2307.12732)
:star:[code](https://github.com/winycg/CLIP-KD)

  * [Aligning Logits Generatively for Principled Black-Box Knowledge Distillation](https://arxiv.org/abs/2205.10490)

  * [FreeKD: Knowledge Distillation via Semantic Frequency Prompt](https://arxiv.org/abs/2311.12079)

  * [Logit Standardization in Knowledge Distillation](http://arxiv.org/abs/2403.01427v1)

  * [$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections](http://arxiv.org/abs/2403.06213v1)
:star:[code](https://github.com/roymiles/vkd)

  * [Scale Decoupled Distillation](http://arxiv.org/abs/2403.13512v1)
:star:[code](https://github.com/shicaiwei123/SDD-CVPR2024)

  * [NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation](https://arxiv.org/abs/2310.00258v2)
:star:[code](https://github.com/tmtuan1307/nayer)

  * [De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts](http://arxiv.org/abs/2403.19539v1)

  * [PromptKD: Unsupervised Prompt Distillation for Vision-Language Models](https://arxiv.org/abs/2403.02781)
:star:[code](https://github.com/zhengli97/PromptKD)
:house:[project](https://zhengli97.github.io/PromptKD/)
:thumbsup:[中文解读](https://zhengli97.github.io/PromptKD/chinese_interpertation.html)

* 剪枝

  * [Device-Wise Federated Network Pruning](https://openaccess.thecvf.com/content/CVPR2024/papers/Gao_Device-Wise_Federated_Network_Pruning_CVPR_2024_paper.pdf)

  * [FedMef: Towards Memory-efficient Federated Dynamic Pruning](http://arxiv.org/abs/2403.14737)

  * [OrthCaps: An Orthogonal CapsNet with Sparse Attention Routing and Pruning](http://arxiv.org/abs/2403.13351)

  * [BilevelPruning: Unified Dynamic and Static Channel Pruning for Convolutional Neural Networks](https://openaccess.thecvf.com/content/CVPR2024/papers/Gao_BilevelPruning_Unified_Dynamic_and_Static_Channel_Pruning_for_Convolutional_Neural_CVPR_2024_paper.pdf)

  * [Resource-Efficient Transformer Pruning for Finetuning of Large Models](https://openaccess.thecvf.com/content/CVPR2024/papers/Ilhan_Resource-Efficient_Transformer_Pruning_for_Finetuning_of_Large_Models_CVPR_2024_paper.pdf)

  * [Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch](https://arxiv.org/abs/2403.14729)

  * [Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning](https://arxiv.org/abs/2406.01820)
:house:[project](https://iurada.github.io/PX)

  * [Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers](https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_Zero-TPrune_Zero-Shot_Token_Pruning_through_Leveraging_of_the_Attention_Graph_CVPR_2024_paper.pdf)

  * [MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric](http://arxiv.org/abs/2403.07839v1)

  * [Jointly Training and Pruning CNNs via Learnable Agent Guidance and Alignment](http://arxiv.org/abs/2403.19490v1)

  * [MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning](http://arxiv.org/abs/2404.05621v1)
:star:[code](https://github.com/FarinaMatteo/multiflow)

* 量化

  * [PTQ4SAM: Post-Training Quantization for Segment Anything](https://arxiv.org/abs/2405.03144)

  * [Reg-PTQ: Regression-specialized Post-training Quantization for Fully Quantized Object Detector](https://openaccess.thecvf.com/content/CVPR2024/papers/Ding_Reg-PTQ_Regression-specialized_Post-training_Quantization_for_Fully_Quantized_Object_Detector_CVPR_2024_paper.pdf)

  * [Data-Free Quantization via Pseudo-label Filtering](https://openaccess.thecvf.com/content/CVPR2024/papers/Fan_Data-Free_Quantization_via_Pseudo-label_Filtering_CVPR_2024_paper.pdf)

  * [JointSQ: Joint Sparsification-Quantization for Distributed Learning](https://openaccess.thecvf.com/content/CVPR2024/papers/Xie_JointSQ_Joint_Sparsification-Quantization_for_Distributed_Learning_CVPR_2024_paper.pdf)

  * [Retraining-Free Model Quantization via One-Shot Weight-Coupling Learning](http://arxiv.org/abs/2401.01543)

  * [Epistemic Uncertainty Quantification For Pre-Trained Neural Networks](http://arxiv.org/abs/2404.10124)

  * [Enhancing Post-training Quantization Calibration through Contrastive Learning](https://openaccess.thecvf.com/content/CVPR2024/papers/Shang_Enhancing_Post-training_Quantization_Calibration_through_Contrastive_Learning_CVPR_2024_paper.pdf)

  * [Towards Accurate Post-training Quantization for Diffusion Models](http://arxiv.org/abs/2305.18723)量化

  * [Is Conventional SNN Really Efficient? A Perspective from Network Quantization](https://arxiv.org/abs/2311.10802)

  * [Are Conventional SNNs Really Efficient? A Perspective from Network Quantization](https://openaccess.thecvf.com/content/CVPR2024/papers/Shen_Are_Conventional_SNNs_Really_Efficient_A_Perspective_from_Network_Quantization_CVPR_2024_paper.pdf)



## 28.UAV/Remote Sensing/Satellite Image(无人机/遥感/卫星图像)

* [Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization](http://arxiv.org/abs/2403.14198v1)
:star:[code](https://github.com/liguopeng0923/UCVGL)

* [Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery](http://arxiv.org/abs/2403.05419v1)
:star:[code](https://github.com/techmn/satmae_pp)

* [Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery](http://arxiv.org/abs/2403.11812v1)
:house:[project](https://zyqz97.github.io/Aerial_Lifting/)

* [S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data](https://openaccess.thecvf.com/content/CVPR2024/papers/Li_S2MAE_A_Spatial-Spectral_Pretraining_Foundation_Model_for_Spectral_Remote_Sensing_CVPR_2024_paper.pdf)

* [Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans](https://arxiv.org/abs/2304.09704)
:house:[project](https://imagine.enpc.fr/~loiseaur/learnable-earth-parser)

* [WildlifeMapper: Aerial Image Analysis for Multi-Species Detection and Identification](https://openaccess.thecvf.com/content/CVPR2024/papers/Kumar_WildlifeMapper_Aerial_Image_Analysis_for_Multi-Species_Detection_and_Identification_CVPR_2024_paper.pdf)
:star:[code](https://github.com/UCSB-VRL/WildlifeMapper)

* [Learning without Exact Guidanc
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/52cv/cvpr-2024-papers

Awesome Lists containing this project

README