{"id":28975203,"url":"https://github.com/52cv/cvpr-2025-papers","last_synced_at":"2026-02-03T19:02:48.913Z","repository":{"id":298091528,"uuid":"893808080","full_name":"52CV/CVPR-2025-Papers","owner":"52CV","description":null,"archived":false,"fork":false,"pushed_at":"2025-06-26T04:05:11.000Z","size":248,"stargazers_count":26,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-26T05:19:40.270Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/52CV.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-25T08:48:10.000Z","updated_at":"2025-06-26T04:05:15.000Z","dependencies_parsed_at":"2025-06-26T05:18:53.100Z","dependency_job_id":"41e7850f-765d-4cb2-9f78-51757cca51f4","html_url":"https://github.com/52CV/CVPR-2025-Papers","commit_stats":null,"previous_names":["52cv/cvpr-2025-papers"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/52CV/CVPR-2025-Papers","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/52CV%2FCVPR-2025-Papers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/52CV%2FCVPR-2025-Papers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/52CV%2FCVPR-2025-Papers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/52CV%2FCVPR-2025-Papers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/52CV","download_url":"https://codeload.github.com/52CV/CVPR-2025-Papers/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/52CV%2FCVPR-2025-Papers/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29054047,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-03T15:43:47.601Z","status":"ssl_error","status_checked_at":"2026-02-03T15:43:46.709Z","response_time":96,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-24T12:32:41.856Z","updated_at":"2026-02-03T19:02:48.894Z","avatar_url":"https://github.com/52CV.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# CVPR-2025-Papers\n## 会议时间：2025年6月11日至15日\n## 会议网址：https://cvpr.thecvf.com/\n## ❣❣❣ CVPR 2025 论文分类整理ing\n\n## 查看2025年综述文献点这里↘️[2025-CV-Surveys](https://github.com/52CV/CV-Surveys)\n\n## 2025 年论文分类汇总戳这里\n↘️[WACV-2025-Papers](https://github.com/52CV/WACV-2025-Papers)\n↘️[CVPR-2025-Papers](https://github.com/52CV/CVPR-2025-Papers)\n↘️[ICCV-2025-Papers](https://github.com/52CV/ICCV-2025-Papers)\n\n## [2024 年论文分类汇总戳这里](#00000)\n## [2023 年论文分类汇总戳这里](#0000)\n## [2022 年论文分类汇总戳这里](#000)\n## [2021 年论文分类汇总戳这里](#00)\n## [2020 年论文分类汇总戳这里](#0)\n\n# ❣❣❣ CVPR 2025 论文分类整理已完成\n# :loudspeaker::loudspeaker::loudspeaker:获奖论文\n### :trophy:最佳论文\n* [VGGT: Visual Geometry Grounded Transformer](http://arxiv.org/abs/2503.11651v1)\u003cbr\u003e:star:[code](https://vgg-t.github.io/)\u003cbr\u003e:star:[code](https://github.com/facebookresearch/vggt)\n### :trophy:最佳学生论文\n* [Neural Inverse Rendering from Propagating Light](https://openaccess.thecvf.com/content/CVPR2025/html/Malik_Neural_Inverse_Rendering_from_Propagating_Light_CVPR_2025_paper.html)\n### :trophy:最佳论文荣誉提名奖\n* [Navigation World Models](https://openaccess.thecvf.com/content/CVPR2025/html/Bar_Navigation_World_Models_CVPR_2025_paper.html)\n* [3D Student Splatting and Scooping](https://openaccess.thecvf.com/content/CVPR2025/html/Zhu_3D_Student_Splatting_and_Scooping_CVPR_2025_paper.html)\n* [MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos](https://openaccess.thecvf.com/content/CVPR2025/html/Li_MegaSaM_Accurate_Fast_and_Robust_Structure_and_Motion_from_Casual_CVPR_2025_paper.html)\n* [Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Deitke_Molmo_and_PixMo_Open_Weights_and_Open_Data_for_State-of-the-Art_CVPR_2025_paper.html)\n### :trophy:最佳学生论文荣誉提名奖\n* [Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens](http://arxiv.org/abs/2504.14666v1)\u003cbr\u003e:star:[code](https://DDT-LLaMA.github.io/)\n\n\n|:cat:|:dog:|:tiger:|:wolf:|\n|------|------|------|------|\n|[1.Othere(其它)](#1)|[2.Face(人脸)](#2)|[3.Image Segmentation(图像分割)](#3)|[4.Image Progress(图像/视频处理)](#4)|\n|[5.Image SR(超分辨率)](#5)|[6.Image Classification(图像分类)](#6)|[7.Image/video Compression(图像/视频压缩)](#7)|[8.Image/Video Captions(图像字幕)](#8)|\n|[9.Image/Video Retrieval(图像检索)](#9)|[10.ODetection(目标检测)](#10)|[11.OTracking(目标跟踪)](#11)|[12.Autonomous Driving(自动驾驶)](#12)|\n|[13.Medical Image Progress(医学影响处理)](#13)|[14.HPE(姿态估计)](#14)|[15.ActDetection(动作检测)](#15)|[16.Human Motion Generation(人体运动生成)](#16)|\n|[17.HOI(人机交互)](#17)|[18.Person Re-id(人员重识别)](#18)|[19.UAV/RS/Satellite Image(无人机/遥感/卫星图像)](#19)|[20.VQA(视觉问答)](#20)|\n|[21.Point Cloud(点云)](#21)|[22.3D(三维重建\\三维视觉)](#22)|[23.OCR](#23)|[24.Video ](#24)|\n|[25.GAN/Image Synthesis(图像生成)](#25)|[26.Style Transfer(风格迁移)](#26)|[27.SGG(场景图生成)](#27)|[28.Optical Flow Estimation(光流估计)](#28)|\n|[29.Scene Flow Estimation(场景流估计)](#29)|[30.Gaze Estimation(视线估计)](#30)|[31.机器人导航/SLAM](#31)|[32Machine learning(机器学习)](#32)|\n|[33.MC/KD/Pruning(模型压缩/知识蒸馏/剪枝)](#33)|[34.NAS(神经架构搜索)](#34)|[35.Self-Supervised(监督)](#35)|[36.Vision-Language](#36)|\n|[37.Sound](#)|[38.Dataset/Benchmark(数据集/基准)](#)|[39.Vision Transformers](#)|[40.Deepfake Detection/AI生成图像检测](#)|\n|[41.F/ZSL/DG/A(小/零样本/域泛化/域适应)](#41)|[42.GNN/GCN](#42)|[43.Object Re-Id/Counting(计数)](#43)|[44.Object Pose Estimation(物体姿态估计)](#44)|\n|[45.Anomaly Detection(异常检测)](#45)|[46.Neural Radiance Fields](#46)|[47.Industrial Anomaly Detection(工业缺陷检测)](#47)|[48.Feature Matching(‌特征匹配)](#48)|\n|[49.Image Fusion(图像融合)](#49)|[50.Dense Prediction(密集预测)](#50)|[51.Protecting copyright(保护版权)](#51)|[52.Animal](#52)|\n|[53.Sketch(草图)](#53)|[54.Animation(动画)](#54)|[55.Retrieval-Augmented Generation(检索增强生成)](#55)|[56.Multi-view Clustering](#56)|\n|[57.计算成像](#57)|\n\n\u003ca name=\"57\"/\u003e\n\n## 57.计算成像\n* [AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos](http://arxiv.org/abs/2503.23282v1)\u003cbr\u003e:star:[code](https://fwmb.github.io/anycam)\n* [Dynamic Camera Poses and Where to Find Them](http://arxiv.org/abs/2504.17788v1)\u003cbr\u003e:house:[project](https://research.nvidia.com/labs/dir/dynpose-100k)\n* [EquiPose: Exploiting Permutation Equivariance for Relative Camera Pose Estimation](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_EquiPose_Exploiting_Permutation_Equivariance_for_Relative_Camera_Pose_Estimation_CVPR_2025_paper.html)\n* [HyperPose: Hypernetwork-Infused Camera Pose Localization and an Extended Cambridge Landmarks Dataset](https://openaccess.thecvf.com/content/CVPR2025/html/Ferens_HyperPose_Hypernetwork-Infused_Camera_Pose_Localization_and_an_Extended_Cambridge_Landmarks_CVPR_2025_paper.html)\n* [FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_FLARE_Feed-forward_Geometry_Appearance_and_Camera_Estimation_from_Uncalibrated_Sparse_CVPR_2025_paper.html)\n* 相机重定位\n  * [From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting](http://arxiv.org/abs/2503.19358v1)\n\n\u003ca name=\"56\"/\u003e\n\n## 56.Multi-view Clustering\n* [AdaptCMVC: Robust Adaption to Incremental Views in Continual Multi-view Clustering](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_AdaptCMVC_Robust_Adaption_to_Incremental_Views_in_Continual_Multi-view_Clustering_CVPR_2025_paper.html)\n* [Deep Fair Multi-View Clustering with Attention KAN](https://openaccess.thecvf.com/content/CVPR2025/html/Xu_Deep_Fair_Multi-View_Clustering_with_Attention_KAN_CVPR_2025_paper.html)\n* [Imputation-free and Alignment-free: Incomplete Multi-view Clustering Driven by Consensus Semantic Learning](https://openaccess.thecvf.com/content/CVPR2025/html/Dai_Imputation-free_and_Alignment-free_Incomplete_Multi-view_Clustering_Driven_by_Consensus_Semantic_CVPR_2025_paper.html)\n* [Medusa: A Multi-Scale High-order Contrastive Dual-Diffusion Approach for Multi-View Clustering](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_Medusa_A_Multi-Scale_High-order_Contrastive_Dual-Diffusion_Approach_for_Multi-View_Clustering_CVPR_2025_paper.html)\n* [A Hubness Perspective on Representation Learning for Graph-Based Multi-View Clustering](https://openaccess.thecvf.com/content/CVPR2025/html/Xu_A_Hubness_Perspective_on_Representation_Learning_for_Graph-Based_Multi-View_Clustering_CVPR_2025_paper.html)\n* [EASEMVC:Efficient Dual Selection Mechanism for Deep Multi-View Clustering](https://openaccess.thecvf.com/content/CVPR2025/html/Xiao_EASEMVCEfficient_Dual_Selection_Mechanism_for_Deep_Multi-View_Clustering_CVPR_2025_paper.html)\n* [ROLL: Robust Noisy Pseudo-label Learning for Multi-View Clustering with Noisy Correspondence](https://openaccess.thecvf.com/content/CVPR2025/html/Sun_ROLL_Robust_Noisy_Pseudo-label_Learning_for_Multi-View_Clustering_with_Noisy_CVPR_2025_paper.html)\n* [Enhanced then Progressive Fusion with View Graph for Multi-View Clustering](https://openaccess.thecvf.com/content/CVPR2025/html/Dong_Enhanced_then_Progressive_Fusion_with_View_Graph_for_Multi-View_Clustering_CVPR_2025_paper.html)\n\n\u003ca name=\"55\"/\u003e\n\n## 55.Retrieval-Augmented Generation(检索增强生成)\n* [VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents](https://openaccess.thecvf.com/content/CVPR2025/html/Tanaka_VDocRAG_Retrieval-Augmented_Generation_over_Visually-Rich_Documents_CVPR_2025_paper.html)\n* 生成式检索\n  * [GENIUS: A Generative Framework for Universal Multimodal Search](http://arxiv.org/abs/2503.19868v1)\n\n\u003ca name=\"54\"/\u003e\n\n## 54.Animation(动画)\n* [AniDoc: Animation Creation Made Easier](https://openaccess.thecvf.com/content/CVPR2025/html/Meng_AniDoc_Animation_Creation_Made_Easier_CVPR_2025_paper.html)\n* [X-Dyna: Expressive Dynamic Human Image Animation](https://openaccess.thecvf.com/content/CVPR2025/html/Chang_X-Dyna_Expressive_Dynamic_Human_Image_Animation_CVPR_2025_paper.html)\n* [EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation](https://openaccess.thecvf.com/content/CVPR2025/html/Meng_EchoMimicV2_Towards_Striking_Simplified_and_Semi-Body_Human_Animation_CVPR_2025_paper.html)\n* [StableAnimator: High-Quality Identity-Preserving Human Image Animation](https://openaccess.thecvf.com/content/CVPR2025/html/Tu_StableAnimator_High-Quality_Identity-Preserving_Human_Image_Animation_CVPR_2025_paper.html)\n* [Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters](https://openaccess.thecvf.com/content/CVPR2025/html/Guo_Make-It-Animatable_An_Efficient_Framework_for_Authoring_Animation-Ready_3D_Characters_CVPR_2025_paper.html)\n* [PhysAnimator: Physics-Guided Generative Cartoon Animation](https://openaccess.thecvf.com/content/CVPR2025/html/Xie_PhysAnimator_Physics-Guided_Generative_Cartoon_Animation_CVPR_2025_paper.html)\n* [Free-viewpoint Human Animation with Pose-correlated Reference Selection](https://openaccess.thecvf.com/content/CVPR2025/html/Hong_Free-viewpoint_Human_Animation_with_Pose-correlated_Reference_Selection_CVPR_2025_paper.html)\n* [Consistent and Controllable Image Animation with Motion Diffusion Models](https://openaccess.thecvf.com/content/CVPR2025/html/Ma_Consistent_and_Controllable_Image_Animation_with_Motion_Diffusion_Models_CVPR_2025_paper.html)\n* [Let's Chorus: Partner-aware Hybrid Song-Driven 3D Head Animation](https://openaccess.thecvf.com/content/CVPR2025/html/Xie_Lets_Chorus_Partner-aware_Hybrid_Song-Driven_3D_Head_Animation_CVPR_2025_paper.html)\n* [MotiF: Making Text Count in Image Animation with Motion Focal Loss](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_MotiF_Making_Text_Count_in_Image_Animation_with_Motion_Focal_CVPR_2025_paper.html)\n* [FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations](https://openaccess.thecvf.com/content/CVPR2025/html/Bandyopadhyay_FlipSketch_Flipping_Static_Drawings_to_Text-Guided_Sketch_Animations_CVPR_2025_paper.html)\n* [Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Diffusion-based_Realistic_Listening_Head_Generation_via_Hybrid_Motion_Modeling_CVPR_2025_paper.html)\n* [Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach](https://openaccess.thecvf.com/content/CVPR2025/html/Bi_Unveiling_Visual_Perception_in_Language_Models_An_Attention_Head_Analysis_CVPR_2025_paper.html)\n* 肖像动画\n  * [Sonic: Shifting Focus to Global Audio Perception in Portrait Animation](https://openaccess.thecvf.com/content/CVPR2025/html/Ji_Sonic_Shifting_Focus_to_Global_Audio_Perception_in_Portrait_Animation_CVPR_2025_paper.html)\n  * [Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer](https://openaccess.thecvf.com/content/CVPR2025/html/Cui_Hallo3_Highly_Dynamic_and_Realistic_Portrait_Image_Animation_with_Video_CVPR_2025_paper.html)\n  * [High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model](https://openaccess.thecvf.com/content/CVPR2025/html/Guo_High-Fidelity_Relightable_Monocular_Portrait_Animation_with_Lighting-Controllable_Video_Diffusion_Model_CVPR_2025_paper.html)\n  * [KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation](http://arxiv.org/abs/2503.01715v1)\n  * [HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation](http://arxiv.org/abs/2503.18860v1)\u003cbr\u003e:star:[code](https://kkakkkka.github.io/HunyuanPortrait)\n  * [Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation](https://openaccess.thecvf.com/content/CVPR2025/html/Li_Wav2Sem_Plug-and-Play_Audio_Semantic_Decoupling_for_3D_Speech-Driven_Facial_Animation_CVPR_2025_paper.html)\n  * [Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion](http://arxiv.org/abs/2503.15851v1)\n  * [Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation](http://arxiv.org/abs/2503.18429v1)\n  * [MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation](http://arxiv.org/abs/2503.19383v1)\n  * [MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_MoEE_Mixture_of_Emotion_Experts_for_Audio-Driven_Portrait_Animation_CVPR_2025_paper.html)\n\n\u003ca name=\"53\"/\u003e\n\n## 53.Sketch(草图)\n* [Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch](https://openaccess.thecvf.com/content/CVPR2025/html/Sain_Sketch_Down_the_FLOPs_Towards_Efficient_Networks_for_Human_Sketch_CVPR_2025_paper.html)\n* [Image Referenced Sketch Colorization Based on Animation Creation Workflow](https://openaccess.thecvf.com/content/CVPR2025/html/Yan_Image_Referenced_Sketch_Colorization_Based_on_Animation_Creation_Workflow_CVPR_2025_paper.html)\n* [SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models](http://arxiv.org/abs/2503.14129v1)\u003cbr\u003e:star:[code](https://subhadeepkoley.github.io/SketchFusion/)\n* [SketchAgent: Language-Driven Sequential Sketch Generation](https://openaccess.thecvf.com/content/CVPR2025/html/Vinker_SketchAgent_Language-Driven_Sequential_Sketch_Generation_CVPR_2025_paper.html)\n* 三维草图\n  * [Recovering Dynamic 3D Sketches from Videos](http://arxiv.org/abs/2503.20321v1)\u003cbr\u003e:house:[project](https://jaeah.me/liv3stroke_web)\n\n\n\u003ca name=\"52\"/\u003e\n\n## 52.Animal\n* [Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation](http://arxiv.org/abs/2503.21140v1)\u003cbr\u003e:star:[code](https://github.com/chenbys/FMMP)\n* [Probabilistic Prompt Distribution Learning for Animal Pose Estimation](http://arxiv.org/abs/2503.16120v1)\u003cbr\u003e:star:[code](https://github.com/Raojiyong/PPAP)\n* [AniMo: Species-Aware Model for Text-Driven Animal Motion Generation](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_AniMo_Species-Aware_Model_for_Text-Driven_Animal_Motion_Generation_CVPR_2025_paper.html)\n* [AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer](https://openaccess.thecvf.com/content/CVPR2025/html/Lyu_AniMer_Animal_Pose_and_Shape_Estimation_Using_Family_Aware_Transformer_CVPR_2025_paper.html)\n* [Reconstructing Animals and the Wild](https://openaccess.thecvf.com/content/CVPR2025/html/Kulits_Reconstructing_Animals_and_the_Wild_CVPR_2025_paper.html)\n\n\u003ca name=\"51\"/\u003e\n\n## 51.Protecting copyright(保护版权)\n* [CDI: Copyrighted Data Identification in Diffusion Models](https://openaccess.thecvf.com/content/CVPR2025/html/Dubinski_CDI_Copyrighted_Data_Identification_in_Diffusion_Models_CVPR_2025_paper.html)\n* [Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models](http://arxiv.org/abs/2503.11071v1)\n* [Vision-Language Model IP Protection via Prompt-based Learning](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Vision-Language_Model_IP_Protection_via_Prompt-based_Learning_CVPR_2025_paper.html)\n* 水印\n  * [3D-GSW: 3D Gaussian Splatting for Robust Watermarking](https://openaccess.thecvf.com/content/CVPR2025/html/Jang_3D-GSW_3D_Gaussian_Splatting_for_Robust_Watermarking_CVPR_2025_paper.html)\n  * [GuardSplat: Efficient and Robust Watermarking for 3D Gaussian Splatting](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_GuardSplat_Efficient_and_Robust_Watermarking_for_3D_Gaussian_Splatting_CVPR_2025_paper.html)\n  * [OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_OmniGuard_Hybrid_Manipulation_Localization_via_Augmented_Versatile_Deep_Image_Watermarking_CVPR_2025_paper.html)\n  * [Watermarking One for All: A Robust Watermarking Scheme Against Partial Image Theft](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_Watermarking_One_for_All_A_Robust_Watermarking_Scheme_Against_Partial_CVPR_2025_paper.html)\n  * [EntropyMark: Towards More Harmless Backdoor Watermark via Entropy-based Constraint for Open-source Dataset Copyright Protection](https://openaccess.thecvf.com/content/CVPR2025/html/Sun_EntropyMark_Towards_More_Harmless_Backdoor_Watermark_via_Entropy-based_Constraint_for_CVPR_2025_paper.html)\n  * [SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_SleeperMark_Towards_Robust_Watermark_against_Fine-Tuning_Text-to-image_Diffusion_Models_CVPR_2025_paper.html)\n  * [Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models](https://openaccess.thecvf.com/content/CVPR2025/html/Muller_Black-Box_Forgery_Attacks_on_Semantic_Watermarks_for_Diffusion_Models_CVPR_2025_paper.html)\n\n\u003ca name=\"50\"/\u003e\n\n## 50.Dense Prediction(密集预测)\n* [Unified Dense Prediction of Video Diffusion](http://arxiv.org/abs/2503.09344v1)\n* [Frequency Dynamic Convolution for Dense Image Prediction](http://arxiv.org/abs/2503.18783v1)\u003cbr\u003e:star:[code](https://github.com/Linwei-Chen/FDConv)\n* [A Unified Image-Dense Annotation Generation Model for Underwater Scenes](http://arxiv.org/abs/2503.21771v1)dense prediction\n\n\n\u003ca name=\"49\"/\u003e\n\n\n## 49.Image Fusion(图像融合)\n* [DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion](http://arxiv.org/abs/2503.17673v1)\u003cbr\u003e:star:[code](https://github.com/Beate-Suy-Zhang/DCEvo)\n* [Task-driven Image Fusion with Learnable Fusion Loss](https://openaccess.thecvf.com/content/CVPR2025/html/Bai_Task-driven_Image_Fusion_with_Learnable_Fusion_Loss_CVPR_2025_paper.html)\n* [Binarized Neural Network for Multi-spectral Image Fusion](https://openaccess.thecvf.com/content/CVPR2025/html/Hou_Binarized_Neural_Network_for_Multi-spectral_Image_Fusion_CVPR_2025_paper.html)\n* [Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond](https://openaccess.thecvf.com/content/CVPR2025/html/Wu_Every_SAM_Drop_Counts_Embracing_Semantic_Priors_for_Multi-Modality_Image_CVPR_2025_paper.html)\n* [Self-Learning Hyperspectral and Multispectral Image Fusion via Adaptive Residual Guided Subspace Diffusion Model](https://openaccess.thecvf.com/content/CVPR2025/html/Zhu_Self-Learning_Hyperspectral_and_Multispectral_Image_Fusion_via_Adaptive_Residual_Guided_CVPR_2025_paper.html)\n* [One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion](http://arxiv.org/abs/2502.19854v1)\u003cbr\u003e:star:[code](https://github.com/AWCXV/GIFNet)\n* [A Selective Re-learning Mechanism for Hyperspectral Fusion Imaging](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_A_Selective_Re-learning_Mechanism_for_Hyperspectral_Fusion_Imaging_CVPR_2025_paper.html)\n\n\u003ca name=\"48\"/\u003e\n\n## 48.Feature Matching(‌特征匹配)\n* [CoMatcher: Multi-View Collaborative Feature Matching](http://arxiv.org/abs/2504.01872v1)\n* [JamMa: Ultra-lightweight Local Feature Matching with Joint Mamba](http://arxiv.org/abs/2503.03437v1)\u003cbr\u003e:star:[code](https://leoluxxx.github.io/JamMa-page/)\n* [FG^2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching](https://openaccess.thecvf.com/content/CVPR2025/html/Xia_FG2_Fine-Grained_Cross-View_Localization_by_Fine-Grained_Feature_Matching_CVPR_2025_paper.html)\n* [EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching](https://openaccess.thecvf.com/content/CVPR2025/html/Jung_EDM_Equirectangular_Projection-Oriented_Dense_Kernelized_Feature_Matching_CVPR_2025_paper.html)\n\n\u003ca name=\"47\"/\u003e\n\n## 47.Industrial Anomaly Detection(工业缺陷检测)\n* [DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection](http://arxiv.org/abs/2503.13985v1)\n* [Towards Training-free Anomaly Detection with Vision and Language Foundation Models](http://arxiv.org/abs/2503.18325v1)\u003cbr\u003e:star:[code](https://github.com/zhang0jhon/LogSAD)\n* [The MVTec AD 2 Dataset: Advanced Scenarios for Unsupervised Anomaly Detection](http://arxiv.org/abs/2503.21622v1)\u003cbr\u003e:house:[project](https://benchmark.mvtec.com/)\u003cbr\u003e:house:[project](https://www.mvtec.com/company/research/datasets/mvtec-ad-2)\u003cbr\u003e:house:[project](https://sites.google.com/view/vand30cvpr2025/challenge)\n* [Wavelet and Prototype Augmented Query-based Transformer for Pixel-level Surface Defect Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Yan_Wavelet_and_Prototype_Augmented_Query-based_Transformer_for_Pixel-level_Surface_Defect_CVPR_2025_paper.html)\n* [Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties](https://openaccess.thecvf.com/content/CVPR2025/html/Li_Multi-Sensor_Object_Anomaly_Detection_Unifying_Appearance_Geometry_and_Internal_Properties_CVPR_2025_paper.html)\n* [AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios](https://openaccess.thecvf.com/content/CVPR2025/html/Huang_AnomalyNCD_Towards_Novel_Anomaly_Class_Discovery_in_Industrial_Scenarios_CVPR_2025_paper.html)\n* 异常检测\n  * [One-for-More: Continual Diffusion Model for Anomaly Detection](http://arxiv.org/abs/2502.19848v1)\u003cbr\u003e:star:[code](https://github.com/FuNz-0/One-for-More)\n  * [AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP](http://arxiv.org/abs/2503.06661v1)\u003cbr\u003e:star:[code](https://github.com/Mwxinnn/AA-CLIP)\n  * [Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection](http://arxiv.org/abs/2503.02424v1)\u003cbr\u003e:star:[code](https://github.com/luow23/INP-Former)\n  * [Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection](http://arxiv.org/abs/2503.03562v1)\n  * [Distribution Prototype Diffusion Learning for Open-set Supervised Anomaly Detection](http://arxiv.org/abs/2502.20981v1)\n  * [UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Gu_UniVAD_A_Training-free_Unified_Model_for_Few-shot_Visual_Anomaly_Detection_CVPR_2025_paper.html)\n  * [Unseen Visual Anomaly Generation](https://openaccess.thecvf.com/content/CVPR2025/html/Sun_Unseen_Visual_Anomaly_Generation_CVPR_2025_paper.html)\n  * [PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies](https://openaccess.thecvf.com/content/CVPR2025/html/Nafez_PatchGuard_Adversarially_Robust_Anomaly_Detection_and_Localization_through_Vision_Transformers_CVPR_2025_paper.html)\n  * [Odd-One-Out: Anomaly Detection by Comparing with Neighbors](https://openaccess.thecvf.com/content/CVPR2025/html/Bhunia_Odd-One-Out_Anomaly_Detection_by_Comparing_with_Neighbors_CVPR_2025_paper.html)\n  * [Beyond Single-Modal Boundary: Cross-Modal Anomaly Detection through Visual Prototype and Harmonization](https://openaccess.thecvf.com/content/CVPR2025/html/Mao_Beyond_Single-Modal_Boundary_Cross-Modal_Anomaly_Detection_through_Visual_Prototype_and_CVPR_2025_paper.html)\n  * [PIAD: Pose and Illumination agnostic Anomaly Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Yang_PIAD_Pose_and_Illumination_agnostic_Anomaly_Detection_CVPR_2025_paper.html)\n  * [DFM: Differentiable Feature Matching for Anomaly Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Wu_DFM_Differentiable_Feature_Matching_for_Anomaly_Detection_CVPR_2025_paper.html)\n  * [A Unified Latent Schrodinger Bridge Diffusion Model for Unsupervised Anomaly Detection and Localization](https://openaccess.thecvf.com/content/CVPR2025/html/Akshay_A_Unified_Latent_Schrodinger_Bridge_Diffusion_Model_for_Unsupervised_Anomaly_CVPR_2025_paper.html)\n  * [TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Jung_TailedCore_Few-Shot_Sampling_for_Unsupervised_Long-Tail_Noisy_Anomaly_Detection_CVPR_2025_paper.html)\n  * [Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Qu_Bayesian_Prompt_Flow_Learning_for_Zero-Shot_Anomaly_Detection_CVPR_2025_paper.html)\n  * [Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Beizaee_Correcting_Deviations_from_Normality_A_Reformulated_Diffusion_Model_for_Multi-Class_CVPR_2025_paper.html)\n\n\u003ca name=\"46\"/\u003e\n\n## 46.Neural Radiance Fields\n* [LookCloser: Frequency-aware Radiance Field for Tiny-Detail Scene](http://arxiv.org/abs/2503.18513v1)\n* [RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings](http://arxiv.org/abs/2502.19781v1)\n* [LIRM: Large Inverse Rendering Model for Progressive Reconstruction of Shape, Materials and View-dependent Radiance Fields](https://openaccess.thecvf.com/content/CVPR2025/html/Li_LIRM_Large_Inverse_Rendering_Model_for_Progressive_Reconstruction_of_Shape_CVPR_2025_paper.html)\n* [PBR-NeRF: Inverse Rendering with Physics-Based Neural Fields](https://openaccess.thecvf.com/content/CVPR2025/html/Wu_PBR-NeRF_Inverse_Rendering_with_Physics-Based_Neural_Fields_CVPR_2025_paper.html)\n* [NeISF++: Neural Incident Stokes Field for Polarized Inverse Rendering of Conductors and Dielectrics](https://openaccess.thecvf.com/content/CVPR2025/html/Li_NeISF_Neural_Incident_Stokes_Field_for_Polarized_Inverse_Rendering_of_CVPR_2025_paper.html)\n* [Time of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields](https://openaccess.thecvf.com/content/CVPR2025/html/Li_Time_of_the_Flight_of_the_Gaussians_Optimizing_Depth_Indirectly_CVPR_2025_paper.html)\n* [Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video](https://openaccess.thecvf.com/content/CVPR2025/html/Nguyen_Joint_Optimization_of_Neural_Radiance_Fields_and_Continuous_Camera_Motion_CVPR_2025_paper.html)\n* [RelationField: Relate Anything in Radiance Fields](https://openaccess.thecvf.com/content/CVPR2025/html/Koch_RelationField_Relate_Anything_in_Radiance_Fields_CVPR_2025_paper.html)\n* [Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction](https://openaccess.thecvf.com/content/CVPR2025/html/Fang_Depth-Guided_Bundle_Sampling_for_Efficient_Generalizable_Neural_Radiance_Field_Reconstruction_CVPR_2025_paper.html)\n* 视图合成\n  * [EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis](http://arxiv.org/abs/2503.20168v1)\n  * [NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting](http://arxiv.org/abs/2503.18794v1)\u003cbr\u003e:star:[code](https://usmizuki.github.io/NexusGS/)\n  * [SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs](http://arxiv.org/abs/2503.12535v1)\u003cbr\u003e:star:[code](https://gbliao.github.io/SPC-GS.github.io/)\n  * [CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis](http://arxiv.org/abs/2503.20998v1)\n  * [Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views](http://arxiv.org/abs/2503.24382v1)\u003cbr\u003e:star:[code](https://zju3dv.github.io/free360/)\n  * [LITA-GS: Illumination-Agnostic Novel View Synthesis via Reference-Free 3D Gaussian Splatting and Physical Priors](http://arxiv.org/abs/2504.00219v1)\u003cbr\u003e:star:[code](https://github.com/LowLevelAI/LITA-GS)\n  * [NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images](https://openaccess.thecvf.com/content/CVPR2025/html/Li_NVComposer_Boosting_Generative_Novel_View_Synthesis_with_Multiple_Sparse_and_CVPR_2025_paper.html)\n  * [MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes](https://openaccess.thecvf.com/content/CVPR2025/html/Lu_MOVIS_Enhancing_Multi-Object_Novel_View_Synthesis_for_Indoor_Scenes_CVPR_2025_paper.html)\n  * [Novel View Synthesis with Pixel-Space Diffusion Models](https://openaccess.thecvf.com/content/CVPR2025/html/Elata_Novel_View_Synthesis_with_Pixel-Space_Diffusion_Models_CVPR_2025_paper.html)\n  * [FrugalNeRF: Fast Convergence for Extreme Few-shot Novel View Synthesis without Learned Priors](https://openaccess.thecvf.com/content/CVPR2025/html/Lin_FrugalNeRF_Fast_Convergence_for_Extreme_Few-shot_Novel_View_Synthesis_without_CVPR_2025_paper.html)\n  * [Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion](https://openaccess.thecvf.com/content/CVPR2025/html/Guizilini_Zero-Shot_Novel_View_and_Depth_Synthesis_with_Multi-View_Geometric_Diffusion_CVPR_2025_paper.html)\n  * [AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis](https://openaccess.thecvf.com/content/CVPR2025/html/Vuong_AerialMegaDepth_Learning_Aerial-Ground_Reconstruction_and_View_Synthesis_CVPR_2025_paper.html)\n  * [GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_GoLF-NRT_Integrating_Global_Context_and_Local_Geometry_for_Few-Shot_View_CVPR_2025_paper.html)\n  * [SimVS: Simulating World Inconsistencies for Robust View Synthesis](https://openaccess.thecvf.com/content/CVPR2025/html/Trevithick_SimVS_Simulating_World_Inconsistencies_for_Robust_View_Synthesis_CVPR_2025_paper.html)\n  * [EVPGS: Enhanced View Prior Guidance for Splatting-based Extrapolated View Synthesis](https://openaccess.thecvf.com/content/CVPR2025/html/Li_EVPGS_Enhanced_View_Prior_Guidance_for_Splatting-based_Extrapolated_View_Synthesis_CVPR_2025_paper.html)\n  * [StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models](https://openaccess.thecvf.com/content/CVPR2025/html/Yan_StreetCrafter_Street_View_Synthesis_with_Controllable_Video_Diffusion_Models_CVPR_2025_paper.html)\n* 渲染\n  * [Differentiable Inverse Rendering with Interpretable Basis BRDFs](https://arxiv.org/abs/2411.17994)\n  * [Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes](http://arxiv.org/abs/2503.09993v1)\n  * [TensoFlow: Tensorial Flow-based Sampler for Inverse Rendering](http://arxiv.org/abs/2503.18328v1)\u003cbr\u003e:star:[code](https://github.com/fudan-zvg/tensoflow)\n  * [MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction](http://arxiv.org/abs/2503.18363v1)\u003cbr\u003e:star:[code](https://wen-yuan-zhang.github.io/MonoInstance/)\n  * [BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation](http://arxiv.org/abs/2503.20672v1)\u003cbr\u003e:star:[code](https://bizgen-msra.github.io)\n  * [Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models](https://openaccess.thecvf.com/content/CVPR2025/html/Liang_Diffusion_Renderer_Neural_Inverse_and_Forward_Rendering_with_Video_Diffusion_CVPR_2025_paper.html)\n  * [3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes](https://openaccess.thecvf.com/content/CVPR2025/html/Held_3D_Convex_Splatting_Radiance_Field_Rendering_with_3D_Smooth_Convexes_CVPR_2025_paper.html)\n  * [Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering](https://openaccess.thecvf.com/content/CVPR2025/html/Sun_Sparse_Voxels_Rasterization_Real-time_High-fidelity_Radiance_Field_Rendering_CVPR_2025_paper.html)\n  * [AMO Sampler: Enhancing Text Rendering with Overshooting](https://openaccess.thecvf.com/content/CVPR2025/html/Hu_AMO_Sampler_Enhancing_Text_Rendering_with_Overshooting_CVPR_2025_paper.html)\n* 4D \n  * [4Deform: Neural Surface Deformation for Robust Shape Interpolation](http://arxiv.org/abs/2502.20208v1)\n  * [Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis](http://arxiv.org/abs/2503.03132v1)\u003cbr\u003e:star:[code](https://4d-dsns.github.io/DSNS/)\n  * [Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video](https://openaccess.thecvf.com/content/CVPR2025/html/Yao_Uni4D_Unifying_Visual_Foundation_Models_for_4D_Modeling_from_a_CVPR_2025_paper.html)\n  * [DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation](https://openaccess.thecvf.com/content/CVPR2025/html/Zhao_DriveDreamer4D_World_Models_Are_Effective_Data_Machines_for_4D_Driving_CVPR_2025_paper.html)\n  * [Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields](https://openaccess.thecvf.com/content/CVPR2025/html/Zhou_Feature4X_Bridging_Any_Monocular_Video_to_4D_Agentic_AI_with_CVPR_2025_paper.html)\n  * [MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds](https://openaccess.thecvf.com/content/CVPR2025/html/Lei_MoSca_Dynamic_Gaussian_Fusion_from_Casual_Videos_via_4D_Motion_CVPR_2025_paper.html)\n  * [DIO: Decomposable Implicit 4D Occupancy-Flow World Model](https://openaccess.thecvf.com/content/CVPR2025/html/Diehl_DIO_Decomposable_Implicit_4D_Occupancy-Flow_World_Model_CVPR_2025_paper.html)\n  * [DNF: Unconditional 4D Generation with Dictionary-based Neural Fields](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_DNF_Unconditional_4D_Generation_with_Dictionary-based_Neural_Fields_CVPR_2025_paper.html)\n  * [4D-Fly: Fast 4D Reconstruction from a Single Monocular Video](https://openaccess.thecvf.com/content/CVPR2025/html/Wu_4D-Fly_Fast_4D_Reconstruction_from_a_Single_Monocular_Video_CVPR_2025_paper.html)\n  * [CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models](https://openaccess.thecvf.com/content/CVPR2025/html/Wu_CAT4D_Create_Anything_in_4D_with_Multi-View_Video_Diffusion_Models_CVPR_2025_paper.html)\n  * [Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos](https://openaccess.thecvf.com/content/CVPR2025/html/Jin_Stereo4D_Learning_How_Things_Move_in_3D_from_Internet_Stereo_CVPR_2025_paper.html)\n  * [GIFStream: 4D Gaussian-based Immersive Video with Feature Stream](https://openaccess.thecvf.com/content/CVPR2025/html/Li_GIFStream_4D_Gaussian-based_Immersive_Video_with_Feature_Stream_CVPR_2025_paper.html)\n  * [FIction: 4D Future Interaction Prediction from Video](https://openaccess.thecvf.com/content/CVPR2025/html/Ashutosh_FIction_4D_Future_Interaction_Prediction_from_Video_CVPR_2025_paper.html)\n  * [NTR-Gaussian: Nighttime Dynamic Thermal Reconstruction with 4D Gaussian Splatting Based on Thermodynamics](https://openaccess.thecvf.com/content/CVPR2025/html/Yang_NTR-Gaussian_Nighttime_Dynamic_Thermal_Reconstruction_with_4D_Gaussian_Splatting_Based_CVPR_2025_paper.html)\n  * [DrivingSphere: Building a High-fidelity 4D World for Closed-loop Simulation](https://openaccess.thecvf.com/content/CVPR2025/html/Yan_DrivingSphere_Building_a_High-fidelity_4D_World_for_Closed-loop_Simulation_CVPR_2025_paper.html)\n  * [4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians](https://openaccess.thecvf.com/content/CVPR2025/html/Matsuki_4DTAM_Non-Rigid_Tracking_and_Mapping_via_Dynamic_Surface_Gaussians_CVPR_2025_paper.html)\n  * [Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_Unleashing_the_Potential_of_Multi-modal_Foundation_Models_and_Video_Diffusion_CVPR_2025_paper.html)\n  * [Robust Multi-Object 4D Generation for In-the-wild Videos](https://openaccess.thecvf.com/content/CVPR2025/html/Chu_Robust_Multi-Object_4D_Generation_for_In-the-wild_Videos_CVPR_2025_paper.html)\n  * [4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_4Real-Video_Learning_Generalizable_Photo-Realistic_4D_Video_Diffusion_CVPR_2025_paper.html)\n\n\u003ca name=\"45\"/\u003e\n\n## 45.Anomaly Detection(异常检测)\n* OOD\n  * [CADRef: Robust Out-of-Distribution Detection via Class-Aware Decoupled Relative Feature Leveraging](http://arxiv.org/abs/2503.00325v1)\n  * [Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations](http://arxiv.org/abs/2503.18817v1)\n  * [ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth Networks](http://arxiv.org/abs/2503.21397v1)\u003cbr\u003e:star:[code](https://github.com/walline/prohoc)\n  * [DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Li_DPU_Dynamic_Prototype_Updating_for_Multimodal_Out-of-Distribution_Detection_CVPR_2025_paper.html)\n  * [Dual Energy-Based Model with Open-World Uncertainty Estimation for Out-of-distribution Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_Dual_Energy-Based_Model_with_Open-World_Uncertainty_Estimation_for_Out-of-distribution_Detection_CVPR_2025_paper.html)\n  * [OODD: Test-time Out-of-Distribution Detection with Dynamic Dictionary](https://openaccess.thecvf.com/content/CVPR2025/html/Yang_OODD_Test-time_Out-of-Distribution_Detection_with_Dynamic_Dictionary_CVPR_2025_paper.html)\n  * [Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Xu_Overcoming_Shortcut_Problem_in_VLM_for_Robust_Out-of-Distribution_Detection_CVPR_2025_paper.html)\n  * [H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_H2ST_Hierarchical_Two-Sample_Tests_for_Continual_Out-of-Distribution_Detection_CVPR_2025_paper.html)\n  * [Beyond Clean Training Data: A Versatile and Model-Agnostic Framework for Out-of-Distribution Detection with Contaminated Training Data](https://openaccess.thecvf.com/content/CVPR2025/html/Li_Beyond_Clean_Training_Data_A_Versatile_and_Model-Agnostic_Framework_for_CVPR_2025_paper.html)\n  * [Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_Leveraging_Perturbation_Robustness_to_Enhance_Out-of-Distribution_Detection_CVPR_2025_paper.html)\n  * [Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy](https://openaccess.thecvf.com/content/CVPR2025/html/Jeong_Playing_the_Fool_Jailbreaking_LLMs_and_Multimodal_LLMs_with_Out-of-Distribution_CVPR_2025_paper.html)\n  * [On the Out-Of-Distribution Generalization of Large Multimodal Models](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_On_the_Out-Of-Distribution_Generalization_of_Large_Multimodal_Models_CVPR_2025_paper.html)\n  * [Detecting Out-of-Distribution Through the Lens of Neural Collapse](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_Detecting_Out-of-Distribution_Through_the_Lens_of_Neural_Collapse_CVPR_2025_paper.html)\n  * [Open Set Label Shift with Test Time Out-of-Distribution Reference](https://openaccess.thecvf.com/content/CVPR2025/html/Ye_Open_Set_Label_Shift_with_Test_Time_Out-of-Distribution_Reference_CVPR_2025_paper.html)\n  * [Simplification Is All You Need against Out-of-Distribution Overconfidence](https://openaccess.thecvf.com/content/CVPR2025/html/Tang_Simplification_Is_All_You_Need_against_Out-of-Distribution_Overconfidence_CVPR_2025_paper.html)\n* 图像异常检测\n  * [FlexUOD: The Answer to Real-world Unsupervised Image Outlier Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_FlexUOD_The_Answer_to_Real-world_Unsupervised_Image_Outlier_Detection_CVPR_2025_paper.html)\n\n\u003ca name=\"44\"/\u003e\n\n## 44.Object Pose Estimation(物体姿态估计)\n* [Co-op: Correspondence-based Novel Object Pose Estimation](http://arxiv.org/abs/2503.17731v1)\n* [GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation](http://arxiv.org/abs/2503.15110v1)\u003cbr\u003e:star:[code](https://github.com/ziqin-h/GIVEPose)\n* [GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation](https://openaccess.thecvf.com/content/CVPR2025/html/Li_GCE-Pose_Global_Context_Enhancement_for_Category-level_Object_Pose_Estimation_CVPR_2025_paper.html)\n* [UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_UNOPose_Unseen_Object_Pose_Estimation_with_an_Unposed_RGB-D_Reference_CVPR_2025_paper.html)\n* [Rethinking Correspondence-based Category-Level Object Pose Estimation](https://openaccess.thecvf.com/content/CVPR2025/html/Ren_Rethinking_Correspondence-based_Category-Level_Object_Pose_Estimation_CVPR_2025_paper.html)\n* [CRISP: Object Pose and Shape Estimation with Test-Time Adaptation](https://openaccess.thecvf.com/content/CVPR2025/html/Shi_CRISP_Object_Pose_and_Shape_Estimation_with_Test-Time_Adaptation_CVPR_2025_paper.html)\n* 6D\n  * [Any6D: Model-free 6D Pose Estimation of Novel Objects](http://arxiv.org/abs/2503.18673v1)\u003cbr\u003e:house:[project](https://taeyeop.com/any6d)\n  * [RefPose: Leveraging Reference Geometric Correspondences for Accurate 6D Pose Estimation of Unseen Objects](http://arxiv.org/abs/2505.10841v1)\n  * [UA-Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References](https://openaccess.thecvf.com/content/CVPR2025/html/Li_UA-Pose_Uncertainty-Aware_6D_Object_Pose_Estimation_and_Online_Object_Completion_CVPR_2025_paper.html)\n  * [ONDA-Pose: Occlusion-Aware Neural Domain Adaptation for Self-Supervised 6D Object Pose Estimation](https://openaccess.thecvf.com/content/CVPR2025/html/Tan_ONDA-Pose_Occlusion-Aware_Neural_Domain_Adaptation_for_Self-Supervised_6D_Object_Pose_CVPR_2025_paper.html)\n  * [iG-6DoF: Model-free 6DoF Pose Estimation for Unseen Object via Iterative 3D Gaussian Splatting](https://openaccess.thecvf.com/content/CVPR2025/html/Cao_iG-6DoF_Model-free_6DoF_Pose_Estimation_for_Unseen_Object_via_Iterative_CVPR_2025_paper.html)\n  * [Leveraging Global Stereo Consistency for Category-Level Shape and 6D Pose Estimation from Stereo Images](https://openaccess.thecvf.com/content/CVPR2025/html/Qiu_Leveraging_Global_Stereo_Consistency_for_Category-Level_Shape_and_6D_Pose_CVPR_2025_paper.html)\n  * [One2Any: One-Reference 6D Pose Estimation for Any Object](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_One2Any_One-Reference_6D_Pose_Estimation_for_Any_Object_CVPR_2025_paper.html)\n  * [Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision](https://openaccess.thecvf.com/content/CVPR2025/html/Yoshida_Generating_6DoF_Object_Manipulation_Trajectories_from_Action_Description_in_Egocentric_CVPR_2025_paper.html)\n  * [Pos3R: 6D Pose Estimation for Unseen Objects Made Easy](https://openaccess.thecvf.com/content/CVPR2025/html/Deng_Pos3R_6D_Pose_Estimation_for_Unseen_Objects_Made_Easy_CVPR_2025_paper.html)\n  * [CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image](https://openaccess.thecvf.com/content/CVPR2025/html/Huang_CAP-Net_A_Unified_Network_for_6D_Pose_and_Size_Estimation_CVPR_2025_paper.html)\n\n\n\u003ca name=\"43\"/\u003e\n\n## 43.Object Re-Id/Counting(计数)\n* [T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting](http://arxiv.org/abs/2502.20625v1)\u003cbr\u003e:star:[code](https://github.com/cha15yq/T2ICount)\n* [AirRoom: Objects Matter in Room Reidentification](http://arxiv.org/abs/2503.01130v1)\n* [Single Domain Generalization for Few-Shot Counting via Universal Representation Matching](http://arxiv.org/abs/2505.16778v1)\n* 物体重识别\n  * [IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification](http://arxiv.org/abs/2503.10324v1)\n\n\u003ca name=\"42\"/\u003e\n\n\n## 42.Graph Neural Network(GNN/GCN)\n* [Graph Neural Network Combining Event Stream and Periodic Aggregation for Low-Latency Event-based Vision](https://openaccess.thecvf.com/content/CVPR2025/html/Dampfhoffer_Graph_Neural_Network_Combining_Event_Stream_and_Periodic_Aggregation_for_CVPR_2025_paper.html)\n* [Deterministic Certification of Graph Neural Networks against Graph Poisoning Attacks with Arbitrary Perturbations](https://openaccess.thecvf.com/content/CVPR2025/html/Li_Deterministic_Certification_of_Graph_Neural_Networks_against_Graph_Poisoning_Attacks_CVPR_2025_paper.html)\n\n\u003ca name=\"41\"/\u003e\n\n## 41.Few/Zero-Shot Learning/DG/A(小/零样本/域泛化/域适应)\n* FSL\n  * [Logits DeConfusion with CLIP for Few-Shot Learning](https://openaccess.thecvf.com/content/CVPR2025/html/Li_Logits_DeConfusion_with_CLIP_for_Few-Shot_Learning_CVPR_2025_paper.html)\n  * [ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning](https://openaccess.thecvf.com/content/CVPR2025/html/Yang_ImagineFSL_Self-Supervised_Pretraining_Matters_on_Imagined_Base_Set_for_VLM-based_CVPR_2025_paper.html)\n  * [UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning](https://openaccess.thecvf.com/content/CVPR2025/html/Zhou_UNEM_UNrolled_Generalized_EM_for_Transductive_Few-Shot_Learning_CVPR_2025_paper.html)\n* ZSL\n  * [Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning](http://arxiv.org/abs/2503.23030v1)\n  * [LOGICZSL: Exploring Logic-induced Representation for Compositional Zero-shot Learning](https://openaccess.thecvf.com/content/CVPR2025/html/Wu_LOGICZSL_Exploring_Logic-induced_Representation_for_Compositional_Zero-shot_Learning_CVPR_2025_paper.html)\n* DG \n  * [Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection](http://arxiv.org/abs/2503.02101v1)\u003cbr\u003e:star:[code](https://github.com/heboyong/Generalized-Diffusion-Detector)\n  * [Unlocking the Potential of Unlabeled Data in Semi-Supervised Domain Generalization](http://arxiv.org/abs/2503.13915v1)\u003cbr\u003e:star:[code](https://github.com/dongkwani/UPCSC)\n  * [OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP](http://arxiv.org/abs/2503.16106v1)\n  * [When Domain Generalization meets Generalized Category Discovery: An Adaptive Task-Arithmetic Driven Approach](https://openaccess.thecvf.com/content/CVPR2025/html/Rathore_When_Domain_Generalization_meets_Generalized_Category_Discovery_An_Adaptive_Task-Arithmetic_CVPR_2025_paper.html)\n  * [Domain Generalization in CLIP via Learning with Diverse Text Prompts](https://openaccess.thecvf.com/content/CVPR2025/html/Wen_Domain_Generalization_in_CLIP_via_Learning_with_Diverse_Text_Prompts_CVPR_2025_paper.html)\n  * [SoMA: Singular Value Decomposed Minor Components Adaptation for Domain Generalizable Representation Learning](https://openaccess.thecvf.com/content/CVPR2025/html/Yun_SoMA_Singular_Value_Decomposed_Minor_Components_Adaptation_for_Domain_Generalizable_CVPR_2025_paper.html)\n  * [PEER Pressure: Model-to-Model Regularization for Single Source Domain Generalization](https://openaccess.thecvf.com/content/CVPR2025/html/Cho_PEER_Pressure_Model-to-Model_Regularization_for_Single_Source_Domain_Generalization_CVPR_2025_paper.html)\n  * [Seeking Consistent Flat Minima for Better Domain Generalization via Refining Loss Landscapes](https://openaccess.thecvf.com/content/CVPR2025/html/Li_Seeking_Consistent_Flat_Minima_for_Better_Domain_Generalization_via_Refining_CVPR_2025_paper.html)\n  * [Gradient-Guided Annealing for Domain Generalization](https://openaccess.thecvf.com/content/CVPR2025/html/Ballas_Gradient-Guided_Annealing_for_Domain_Generalization_CVPR_2025_paper.html)\n  * [Adversarial Domain Prompt Tuning and Generation for Single Domain Generalization](https://openaccess.thecvf.com/content/CVPR2025/html/Xu_Adversarial_Domain_Prompt_Tuning_and_Generation_for_Single_Domain_Generalization_CVPR_2025_paper.html)\n  * [Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Balanced_Direction_from_Multifarious_Choices_Arithmetic_Meta-Learning_for_Domain_Generalization_CVPR_2025_paper.html)\n  * [TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction](https://openaccess.thecvf.com/content/CVPR2025/html/Agarwal_TIDE_Training_Locally_Interpretable_Domain_Generalization_Models_Enables_Test-time_Correction_CVPR_2025_paper.html)\n* DA\n  * [Distinguish Then Exploit: Source-free Open Set Domain Adaptation via Weight Barcode Estimation and Sparse Label Assignment](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_Distinguish_Then_Exploit_Source-free_Open_Set_Domain_Adaptation_via_Weight_CVPR_2025_paper.html)\n  * [Link-based Contrastive Learning for One-Shot Unsupervised Domain Adaptation](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_Link-based_Contrastive_Learning_for_One-Shot_Unsupervised_Domain_Adaptation_CVPR_2025_paper.html)\n  * [Revisiting Source-Free Domain Adaptation: Insights into Representativeness, Generalization, and Variety](https://openaccess.thecvf.com/content/CVPR2025/html/Zhu_Revisiting_Source-Free_Domain_Adaptation_Insights_into_Representativeness_Generalization_and_Variety_CVPR_2025_paper.html)\n  * [ADU: Adaptive Detection of Unknown Categories in Black-Box Domain Adaptation](https://openaccess.thecvf.com/content/CVPR2025/html/Lai_ADU_Adaptive_Detection_of_Unknown_Categories_in_Black-Box_Domain_Adaptation_CVPR_2025_paper.html)\n  * [MODfinity: Unsupervised Domain Adaptation with Multimodal Information Flow Intertwining](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_MODfinity_Unsupervised_Domain_Adaptation_with_Multimodal_Information_Flow_Intertwining_CVPR_2025_paper.html)\n  * [Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation](https://openaccess.thecvf.com/content/CVPR2025/html/Vuong_Preserving_Clusters_in_Prompt_Learning_for_Unsupervised_Domain_Adaptation_CVPR_2025_paper.html)\n* 广义类别发现\n  * [GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_GET_Unlocking_the_Multi-modal_Potential_of_CLIP_for_Generalized_Category_CVPR_2025_paper.html)\n  * [Adaptive Part Learning for Fine-Grained Generalized Category Discovery: A Plug-and-Play Enhancement](https://openaccess.thecvf.com/content/CVPR2025/html/Dai_Adaptive_Part_Learning_for_Fine-Grained_Generalized_Category_Discovery_A_Plug-and-Play_CVPR_2025_paper.html)\n  * [Less Attention is More: Prompt Transformer for Generalized Category Discovery](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_Less_Attention_is_More_Prompt_Transformer_for_Generalized_Category_Discovery_CVPR_2025_paper.html)\n  * [MOS: Modeling Object-Scene Associations in Generalized Category Discovery](https://openaccess.thecvf.com/content/CVPR2025/html/Peng_MOS_Modeling_Object-Scene_Associations_in_Generalized_Category_Discovery_CVPR_2025_paper.html)\n\n\u003ca name=\"40\"/\u003e\n\n\n## 40.Deepfake Detection/AI生成图像检测\n* [FreqDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing](https://openaccess.thecvf.com/content/CVPR2025/html/Kashiani_FreqDebias_Towards_Generalizable_Deepfake_Detection_via_Consistency-Driven_Frequency_Debiasing_CVPR_2025_paper.html)\n* [D^3: Scaling Up Deepfake Detection by Learning from Discrepancy](https://openaccess.thecvf.com/content/CVPR2025/html/Yang_D3_Scaling_Up_Deepfake_Detection_by_Learning_from_Discrepancy_CVPR_2025_paper.html)\n* [SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model](https://openaccess.thecvf.com/content/CVPR2025/html/Huang_SIDA_Social_Media_Image_Deepfake_Detection_Localization_and_Explanation_with_CVPR_2025_paper.html)\n* [Where the Devil Hides: Deepfake Detectors Can No Longer Be Trusted](https://openaccess.thecvf.com/content/CVPR2025/html/Yuan_Where_the_Devil_Hides_Deepfake_Detectors_Can_No_Longer_Be_CVPR_2025_paper.html)\n* AI生成图像检测\n  * [Towards Universal AI-Generated Image Detection by Variational Information Bottleneck Network](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_Towards_Universal_AI-Generated_Image_Detection_by_Variational_Information_Bottleneck_Network_CVPR_2025_paper.html)\n  * [A Bias-Free Training Paradigm for More General AI-generated Image Detection](https://openaccess.thecvf.com/content/CVPR2025/html/Guillaro_A_Bias-Free_Training_Paradigm_for_More_General_AI-generated_Image_Detection_CVPR_2025_paper.html)\n  * [Any-Resolution AI-Generated Image Detection by Spectral Learning](https://openaccess.thecvf.com/content/CVPR2025/html/Karageorgiou_Any-Resolution_AI-Generated_Image_Detection_by_Spectral_Learning_CVPR_2025_paper.html)\n  * [Beyond Generation: A Diffusion-based Low-level Feature Extractor for Detecting AI-generated Images](https://openaccess.thecvf.com/content/CVPR2025/html/Zhong_Beyond_Generation_A_Diffusion-based_Low-level_Feature_Extractor_for_Detecting_AI-generated_CVPR_2025_paper.html)\n  * [Where's the Liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content](https://openaccess.thecvf.com/content/CVPR2025/html/Bai_Wheres_the_Liability_in_the_Generative_Era_Recovery-based_Black-Box_Detection_CVPR_2025_paper.html)\n  * [Secret Lies in Color: Enhancing AI-Generated Images Detection with Color Distribution Analysis](https://openaccess.thecvf.com/content/CVPR2025/html/Jia_Secret_Lies_in_Color_Enhancing_AI-Generated_Images_Detection_with_Color_CVPR_2025_paper.html)\n* 伪造检测\n  * [Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Forensics-Bench_A_Comprehensive_Forgery_Detection_Benchmark_Suite_for_Large_Vision_CVPR_2025_paper.html)\n  * [Detecting Adversarial Data Using Perturbation Forgery](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Detecting_Adversarial_Data_Using_Perturbation_Forgery_CVPR_2025_paper.html)\n  * [Community Forensics: Using Thousands of Generators to Train Fake Image Detectors](https://openaccess.thecvf.com/content/CVPR2025/html/Park_Community_Forensics_Using_Thousands_of_Generators_to_Train_Fake_Image_CVPR_2025_paper.html)\n* 伪造视频检测\n  * [Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning](https://openaccess.thecvf.com/content/CVPR2025/html/Yan_Generalizing_Deepfake_Video_Detection_with_Plug-and-Play_Video-Level_Blending_and_Spatiotemporal_CVPR_2025_paper.html)\n\n\u003ca name=\"39\"/\u003e\n\n## 39.Vision Transformers\n* [Split Adaptation for Pre-trained Vision Transformers](http://arxiv.org/abs/2503.00441v1)\u003cbr\u003e:star:[code](https://github.com/conditionWang/Split_Adaptation)\n* [BHViT: Binarized Hybrid Vision Transformer](http://arxiv.org/abs/2503.02394v1)\n* [VGGT: Visual Geometry Grounded Transformer](http://arxiv.org/abs/2503.11651v1)\u003cbr\u003e:star:[code](https://vgg-t.github.io/)\u003cbr\u003e:star:[code](https://github.com/facebookresearch/vggt)\n* [ERUPT: Efficient Rendering with Unposed Patch Transformer](http://arxiv.org/abs/2503.24374v1)\n* [Spiking Transformer:Introducing Accurate Addition-Only Spiking Self-Attention for Transformer](http://arxiv.org/abs/2503.00226v1)\n* [Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement](http://arxiv.org/abs/2503.15404v1)\u003cbr\u003e:star:[code](https://github.com/RYC-98/FPR)\n* [Hypergraph Vision Transformers: Images are More than Nodes, More than Edges](https://openaccess.thecvf.com/content/CVPR2025/html/Fixelle_Hypergraph_Vision_Transformers_Images_are_More_than_Nodes_More_than_CVPR_2025_paper.html)\n* [LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions](https://openaccess.thecvf.com/content/CVPR2025/html/Mehri_LibraGrad_Balancing_Gradient_Flow_for_Universally_Better_Vision_Transformer_Attributions_CVPR_2025_paper.html)\n* [Your Scale Factors are My Weapon: Targeted Bit-Flip Attacks on Vision Transformers via Scale Factor Manipulation](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Your_Scale_Factors_are_My_Weapon_Targeted_Bit-Flip_Attacks_on_CVPR_2025_paper.html)\n* [Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers](https://openaccess.thecvf.com/content/CVPR2025/html/Hong_Comprehensive_Information_Bottleneck_for_Unveiling_Universal_Attribution_to_Interpret_Vision_CVPR_2025_paper.html)\n* [Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis](https://openaccess.thecvf.com/content/CVPR2025/html/Chowdhury_Prompt-CAM_Making_Vision_Transformers_Interpretable_for_Fine-Grained_Analysis_CVPR_2025_paper.html)\n* [SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers](https://openaccess.thecvf.com/content/CVPR2025/html/Nikzad_SATA_Spatial_Autocorrelation_Token_Analysis_for_Enhancing_the_Robustness_of_CVPR_2025_paper.html)\n* [DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers](https://openaccess.thecvf.com/content/CVPR2025/html/Ren_DA-VPT_Semantic-Guided_Visual_Prompt_Tuning_for_Vision_Transformers_CVPR_2025_paper.html)\n* [CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction](https://openaccess.thecvf.com/content/CVPR2025/html/Zhou_CARE_Transformer_Mobile-Friendly_Linear_Visual_Transformer_via_Decoupled_Dual_Interaction_CVPR_2025_paper.html)\n\n\n\n\n\n\u003ca name=\"38\"/\u003e\n\n## 38.Dataset/Benchmark(数据集/基准)\n* 基准\n  * [MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research](http://arxiv.org/abs/2503.13399v1)\n  * [Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos](http://arxiv.org/abs/2503.13646v1)\u003cbr\u003e:star:[code](https://github.com/google-research-datasets/egotempo.git)\n  * [Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion](http://arxiv.org/abs/2503.22262v1)\u003cbr\u003e:star:[code](https://mono2stereo-bench.github.io/)\n  * [Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks](http://arxiv.org/abs/2503.18637v1)\u003cbr\u003e:star:[code](https://utd-project.github.io/)\n  * [VinaBench: Benchmark for Faithful and Consistent Visual Narratives](http://arxiv.org/abs/2503.20871v1)\n  * [OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts](http://arxiv.org/abs/2503.22952v1)\n  * [CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation](https://openaccess.thecvf.com/content/CVPR2025/html/Long_CheckManual_A_New_Challenge_and_Benchmark_for_Manual-based_Appliance_Manipulation_CVPR_2025_paper.html)\n  * [OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations](https://openaccess.thecvf.com/content/CVPR2025/html/Ouyang_OmniDocBench_Benchmarking_Diverse_PDF_Document_Parsing_with_Comprehensive_Annotations_CVPR_2025_paper.html)\n  * [EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark](https://openaccess.thecvf.com/content/CVPR2025/html/Li_EEE-Bench_A_Comprehensive_Multimodal_Electrical_And_Electronics_Engineering_Benchmark_CVPR_2025_paper.html)\n  * [ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems](https://openaccess.thecvf.com/content/CVPR2025/html/Xue_ComfyBench_Benchmarking_LLM-based_Agents_in_ComfyUI_for_Autonomously_Designing_Collaborative_CVPR_2025_paper.html)\n  * [Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Is_Your_World_Simulator_a_Good_Story_Presenter_A_Consecutive_CVPR_2025_paper.html)\n  * [Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_Q-Bench-Video_Benchmark_the_Video_Quality_Understanding_of_LMMs_CVPR_2025_paper.html)\n  * [FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding](https://openaccess.thecvf.com/content/CVPR2025/html/Gao_FSBench_A_Figure_Skating_Benchmark_for_Advancing_Artistic_Sports_Understanding_CVPR_2025_paper.html)\n  * [Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Spatial457_A_Diagnostic_Benchmark_for_6D_Spatial_Reasoning_of_Large_CVPR_2025_paper.html)\n  * [Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map](https://openaccess.thecvf.com/content/CVPR2025/html/Chang_Driving_by_the_Rules_A_Benchmark_for_Integrating_Traffic_Sign_CVPR_2025_paper.html)\n  * [Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method](https://openaccess.thecvf.com/content/CVPR2025/html/Song_Towards_Long-Horizon_Vision-Language_Navigation_Platform_Benchmark_and_Method_CVPR_2025_paper.html)\n  * [SMTPD: A New Benchmark for Temporal Prediction of Social Media Popularity](https://openaccess.thecvf.com/content/CVPR2025/html/Xu_SMTPD_A_New_Benchmark_for_Temporal_Prediction_of_Social_Media_CVPR_2025_paper.html)\n  * [PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction](https://openaccess.thecvf.com/content/CVPR2025/html/Poesina_PQPP_A_Joint_Benchmark_for_Text-to-Image_Prompt_and_Query_Performance_CVPR_2025_paper.html)\n  * [NSD-Imagery: A Benchmark Dataset for Extending fMRI Vision Decoding Methods to Mental Imagery](https://openaccess.thecvf.com/content/CVPR2025/html/Kneeland_NSD-Imagery_A_Benchmark_Dataset_for_Extending_fMRI_Vision_Decoding_Methods_CVPR_2025_paper.html)\n  * [From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing](https://openaccess.thecvf.com/content/CVPR2025/html/Wei_From_Words_to_Structured_Visuals_A_Benchmark_and_Framework_for_CVPR_2025_paper.html)\n  * [Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning](https://openaccess.thecvf.com/content/CVPR2025/html/Zhu_Mosaic_of_Modalities_A_Comprehensive_Benchmark_for_Multimodal_Graph_Learning_CVPR_2025_paper.html)\n  * [RUBIK: A Structured Benchmark for Image Matching across Geometric Challenges](https://openaccess.thecvf.com/content/CVPR2025/html/Loiseau_RUBIK_A_Structured_Benchmark_for_Image_Matching_across_Geometric_Challenges_CVPR_2025_paper.html)\n  * [HuPerFlow: A Comprehensive Benchmark for Human vs. Machine Motion Estimation Comparison](https://openaccess.thecvf.com/content/CVPR2025/html/Yang_HuPerFlow_A_Comprehensive_Benchmark_for_Human_vs._Machine_Motion_Estimation_CVPR_2025_paper.html)\n  * [OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation](https://openaccess.thecvf.com/content/CVPR2025/html/Zhou_OpenING_A_Comprehensive_Benchmark_for_Judging_Open-ended_Interleaved_Image-Text_Generation_CVPR_2025_paper.html)\n  * [Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding](https://openaccess.thecvf.com/content/CVPR2025/html/Zhao_Can_Machines_Understand_Composition_Dataset_and_Benchmark_for_Photographic_Image_CVPR_2025_paper.html)\n  * [LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos](https://openaccess.thecvf.com/content/CVPR2025/html/Geng_LongVALE_Vision-Audio-Language-Event_Benchmark_Towards_Time-Aware_Omni-Modal_Perception_of_Long_Videos_CVPR_2025_paper.html)\n  * [SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_SeriesBench_A_Benchmark_for_Narrative-Driven_Drama_Series_Understanding_CVPR_2025_paper.html)\n  * [Quad-Pixel Image Defocus Deblurring: A New Benchmark and Model](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_Quad-Pixel_Image_Defocus_Deblurring_A_New_Benchmark_and_Model_CVPR_2025_paper.html)\n  * [MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval](https://openaccess.thecvf.com/content/CVPR2025/html/Kriz_MultiVENT_2.0_A_Massive_Multilingual_Benchmark_for_Event-Centric_Video_Retrieval_CVPR_2025_paper.html)\n  * [VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models](https://openaccess.thecvf.com/content/CVPR2025/html/Li_VL-RewardBench_A_Challenging_Benchmark_for_Vision-Language_Generative_Reward_Models_CVPR_2025_paper.html)\n* 数据集\n  * [LiSu: A Dataset and Method for LiDAR Surface Normal Estimation](http://arxiv.org/abs/2503.08601v1)\n  * [HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization](http://arxiv.org/abs/2503.01725v1)\u003cbr\u003e:star:[code](https://harmonyset.github.io/)\n  * [MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps](http://arxiv.org/abs/2503.18223v1)\u003cbr\u003e:star:[code](https://github.com/eceo-epfl/MammAlps)\n  * [MultimodalStudio: A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities](http://arxiv.org/abs/2503.19673v1)\n  * [RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives](http://arxiv.org/abs/2503.21459v1)\u003cbr\u003e:star:[code](https://roadsocial.github.io/)\n  * [ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate](http://arxiv.org/abs/2503.21268v1)\u003cbr\u003e:house:[project](http://www.lidarhumanmotion.net/climbingcap/)\n  * [OpticalNet: An Optical Imaging Dataset and Benchmark Beyond the Diffraction Limit](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_OpticalNet_An_Optical_Imaging_Dataset_and_Benchmark_Beyond_the_Diffraction_CVPR_2025_paper.html)\n  * [HD-EPIC: A Highly-Detailed Egocentric Video Dataset](https://openaccess.thecvf.com/content/CVPR2025/html/Perrett_HD-EPIC_A_Highly-Detailed_Egocentric_Video_Dataset_CVPR_2025_paper.html)\n  * [MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments](https://openaccess.thecvf.com/content/CVPR2025/html/Ozsoy_MM-OR_A_Large_Multimodal_Operating_Room_Dataset_for_Semantic_Understanding_CVPR_2025_paper.html)\n  * [VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection](https://openaccess.thecvf.com/content/CVPR2025/html/Han_VideoEspresso_A_Large-Scale_Chain-of-Thought_Dataset_for_Fine-Grained_Video_Reasoning_via_CVPR_2025_paper.html)\n  * [EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision](https://openaccess.thecvf.com/content/CVPR2025/html/Zhao_EgoPressure_A_Dataset_for_Hand_Pressure_and_Pose_Estimation_in_CVPR_2025_paper.html)\n  * [BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation](https://openaccess.thecvf.com/content/CVPR2025/html/Pan_BASKET_A_Large-Scale_Video_Dataset_for_Fine-Grained_Skill_Estimation_CVPR_2025_paper.html)\n  * [RealEdit: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations](https://openaccess.thecvf.com/content/CVPR2025/html/Sushko_RealEdit_Reddit_Edits_As_a_Large-scale_Empirical_Dataset_for_Image_CVPR_2025_paper.html)\n  * [CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_CoMM_A_Coherent_Interleaved_Image-Text_Dataset_for_Multimodal_Understanding_and_CVPR_2025_paper.html)\n  * [GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities](https://openaccess.thecvf.com/content/CVPR2025/html/Fu_GigaHands_A_Massive_Annotated_Dataset_of_Bimanual_Hand_Activities_CVPR_2025_paper.html)\n  * [Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving](https://openaccess.thecvf.com/content/CVPR2025/html/Nekrasov_Spotting_the_Unexpected_STU_A_3D_LiDAR_Dataset_for_Anomaly_CVPR_2025_paper.html)\n  * [Fish-Vista: A Multi-Purpose Dataset for Understanding \u0026 Identification of Traits from Images](https://openaccess.thecvf.com/content/CVPR2025/html/Mehrab_Fish-Vista_A_Multi-Purpose_Dataset_for_Understanding__Identification_of_Traits_CVPR_2025_paper.html)\n  * [Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Koala-36M_A_Large-scale_Video_Dataset_Improving_Consistency_between_Fine-grained_Conditions_CVPR_2025_paper.html)\n  * [Automatic Spectral Calibration of Hyperspectral Images: Method, Dataset and Benchmark](https://openaccess.thecvf.com/content/CVPR2025/html/Du_Automatic_Spectral_Calibration_of_Hyperspectral_Images_Method_Dataset_and_Benchmark_CVPR_2025_paper.html)\n  * [The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition](https://openaccess.thecvf.com/content/CVPR2025/html/Brookes_The_PanAf-FGBG_Dataset_Understanding_the_Impact_of_Backgrounds_in_Wildlife_CVPR_2025_paper.html)\n  * [CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools](https://openaccess.thecvf.com/content/CVPR2025/html/Nwoye_CholecTrack20_A_Multi-Perspective_Tracking_Dataset_for_Surgical_Tools_CVPR_2025_paper.html)\n  * [Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset](https://openaccess.thecvf.com/content/CVPR2025/html/Dong_Digital_Twin_Catalog_A_Large-Scale_Photorealistic_3D_Object_Digital_Twin_CVPR_2025_paper.html)\n  * [SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_SPA-VL_A_Comprehensive_Safety_Preference_Alignment_Dataset_for_Vision_Language_CVPR_2025_paper.html)\n  * [M3GYM: A Large-Scale Multimodal Multi-view Multi-person Pose Dataset for Fitness Activity Understanding in Real-world Settings](https://openaccess.thecvf.com/content/CVPR2025/html/Xu_M3GYM_A_Large-Scale_Multimodal_Multi-view_Multi-person_Pose_Dataset_for_Fitness_CVPR_2025_paper.html)\n  * [3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination](https://openaccess.thecvf.com/content/CVPR2025/html/Yang_3D-GRAND_A_Million-Scale_Dataset_for_3D-LLMs_with_Better_Grounding_and_CVPR_2025_paper.html)\n  * [Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback](https://openaccess.thecvf.com/content/CVPR2025/html/Khan_Sketchtopia_A_Dataset_and_Foundational_Agents_for_Benchmarking_Asynchronous_Multimodal_CVPR_2025_paper.html)\n  * 人脸\n    * [AI-Face: A Million-Scale Demographically Annotated AI-Generated Face Dataset and Fairness Benchmark](https://openaccess.thecvf.com/content/CVPR2025/html/Lin_AI-Face_A_Million-Scale_Demographically_Annotated_AI-Generated_Face_Dataset_and_Fairness_CVPR_2025_paper.html)\n     * [FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs](http://arxiv.org/abs/2503.21457v1)\u003cbr\u003e:star:[code](https://github.com/CVI-SZU/FaceBench)\n  * 自动驾驶\n    * [OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_OmniDrive_A_Holistic_Vision-Language_Dataset_for_Autonomous_Driving_with_Counterfactual_CVPR_2025_paper.html)\n  * HOI\n    * [CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_CORE4D_A_4D_Human-Object-Human_Interaction_Dataset_for_Collaborative_Object_REarrangement_CVPR_2025_paper.html)\n  * 视觉文本异常检测\n    * [MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects](https://openaccess.thecvf.com/content/CVPR2025/html/Fan_MANTA_A_Large-Scale_Multi-View_and_Visual-Text_Anomaly_Detection_Dataset_for_CVPR_2025_paper.html)\n* Dataset Distillation(数据集蒸馏)\n  * [Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation](http://arxiv.org/abs/2503.18872v1)\u003cbr\u003e:star:[code](https://github.com/CYDaaa30/CCFS)\n  * [Dataset Distillation with Neural Characteristic Function: A Minmax Perspective](http://arxiv.org/abs/2502.20653v1)\n  * [Enhancing Dataset Distillation via Non-Critical Region Refinement](http://arxiv.org/abs/2503.18267v1)\u003cbr\u003e:star:[code](https://github.com/tmtuan1307/NRR-DD)\n  * [Hierarchical Features Matter: A Deep Exploration of Progressive Parameterization Method for Dataset Distillation](https://openaccess.thecvf.com/content/CVPR2025/html/Zhong_Hierarchical_Features_Matter_A_Deep_Exploration_of_Progressive_Parameterization_Method_CVPR_2025_paper.html)\n  * [OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation](https://openaccess.thecvf.com/content/CVPR2025/html/Cui_OPTICAL_Leveraging_Optimal_Transport_for_Contribution_Allocation_in_Dataset_Distillation_CVPR_2025_paper.html)\n  * [DELT: A Simple Diversity-driven EarlyLate Training for Dataset Distillation](https://openaccess.thecvf.com/content/CVPR2025/html/Shen_DELT_A_Simple_Diversity-driven_EarlyLate_Training_for_Dataset_Distillation_CVPR_2025_paper.html)\n  * [Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Emphasizing_Discriminative_Features_for_Dataset_Distillation_in_Complex_Scenarios_CVPR_2025_paper.html)\n  * [Towards Universal Dataset Distillation via Task-Driven Diffusion](https://openaccess.thecvf.com/content/CVPR2025/html/Qi_Towards_Universal_Dataset_Distillation_via_Task-Driven_Diffusion_CVPR_2025_paper.html)\n  * [Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory](https://openaccess.thecvf.com/content/CVPR2025/html/Zhong_Towards_Stable_and_Storage-efficient_Dataset_Distillation_Matching_Convexified_Trajectory_CVPR_2025_paper.html)\n  * [Distilling Long-tailed Datasets](https://openaccess.thecvf.com/content/CVPR2025/html/Zhao_Distilling_Long-tailed_Datasets_CVPR_2025_paper.html)\n* 数据增强\n  * [Inversion Circle Interpolation: Diffusion-based Image Augmentation for Data-scarce Classification](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Inversion_Circle_Interpolation_Diffusion-based_Image_Augmentation_for_Data-scarce_Classification_CVPR_2025_paper.html)\n  * [MeshGen: Generating PBR Textured Mesh with Render-Enhanced Auto-Encoder and Generative Data Augmentation](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_MeshGen_Generating_PBR_Textured_Mesh_with_Render-Enhanced_Auto-Encoder_and_Generative_CVPR_2025_paper.html)\n\n\n\u003ca name=\"37\"/\u003e\n\n## 37.Sound \n* [SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding](http://arxiv.org/abs/2504.05576v1)\n* [Learning to Highlight Audio by Watching Movies](http://arxiv.org/abs/2505.12154v1)\u003cbr\u003e:star:[code](https://wikichao.github.io/VisAH/)\n* [Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes](http://arxiv.org/abs/2503.18880v1)\n* [Circumventing Shortcuts in Audio-visual Deepfake Detection Datasets with Unsupervised Learning](https://openaccess.thecvf.com/content/CVPR2025/html/Smeu_Circumventing_Shortcuts_in_Audio-visual_Deepfake_Detection_Datasets_with_Unsupervised_Learning_CVPR_2025_paper.html)\n* [UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing](https://openaccess.thecvf.com/content/CVPR2025/html/Lai_UWAV_Uncertainty-weighted_Weakly-supervised_Audio-Visual_Video_Parsing_CVPR_2025_paper.html)\n* [Supervising Sound Localization by In-the-wild Egomotion](https://openaccess.thecvf.com/content/CVPR2025/html/Min_Supervising_Sound_Localization_by_In-the-wild_Egomotion_CVPR_2025_paper.html)\n* [EchoTraffic: Enhancing Traffic Anomaly Understanding with Audio-Visual Insights](https://openaccess.thecvf.com/content/CVPR2025/html/Xing_EchoTraffic_Enhancing_Traffic_Anomaly_Understanding_with_Audio-Visual_Insights_CVPR_2025_paper.html)\n* [Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation](https://openaccess.thecvf.com/content/CVPR2025/html/Du_Crab_A_Unified_Audio-Visual_Scene_Understanding_Model_with_Explicit_Cooperation_CVPR_2025_paper.html)\n* [Language-Guided Audio-Visual Learning for Long-Term Sports Assessment](https://openaccess.thecvf.com/content/CVPR2025/html/Xu_Language-Guided_Audio-Visual_Learning_for_Long-Term_Sports_Assessment_CVPR_2025_paper.html)\n* [CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment](https://openaccess.thecvf.com/content/CVPR2025/html/Araujo_CAV-MAE_Sync_Improving_Contrastive_Audio-Visual_Mask_Autoencoders_via_Fine-Grained_Alignment_CVPR_2025_paper.html)\n* [TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation](https://openaccess.thecvf.com/content/CVPR2025/html/Radman_TSAM_Temporal_SAM_Augmented_with_Multimodal_Prompts_for_Referring_Audio-Visual_CVPR_2025_paper.html)\n* [Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds](https://openaccess.thecvf.com/content/CVPR2025/html/Shaar_Adapting_to_the_Unknown_Training-Free_Audio-Visual_Event_Perception_with_Dynamic_CVPR_2025_paper.html)\n* [Animate and Sound an Image](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Animate_and_Sound_an_Image_CVPR_2025_paper.html)\n* [Sound Bridge: Associating Egocentric and Exocentric Videos via Audio Cues](https://openaccess.thecvf.com/content/CVPR2025/html/Huang_Sound_Bridge_Associating_Egocentric_and_Exocentric_Videos_via_Audio_Cues_CVPR_2025_paper.html)\n* [Video-Guided Foley Sound Generation with Multimodal Controls](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_Video-Guided_Foley_Sound_Generation_with_Multimodal_Controls_CVPR_2025_paper.html)\n* [Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes](https://openaccess.thecvf.com/content/CVPR2025/html/Ryu_Seeing_Speech_and_Sound_Distinguishing_and_Locating_Audio_Sources_in_CVPR_2025_paper.html)\n* [Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes](https://openaccess.thecvf.com/content/CVPR2025/html/Dou_Hearing_Hands_Generating_Sounds_from_Physical_Interactions_in_3D_Scenes_CVPR_2025_paper.html)\n* [VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation](https://openaccess.thecvf.com/content/CVPR2025/html/Kushwaha_VinTAGe_Joint_Video_and_Text_Conditioning_for_Holistic_Audio_Generation_CVPR_2025_paper.html)\n* [DistinctAD: Distinctive Audio Description Generation in Contexts](https://openaccess.thecvf.com/content/CVPR2025/html/Fang_DistinctAD_Distinctive_Audio_Description_Generation_in_Contexts_CVPR_2025_paper.html)\n* 视听分割\n  * [SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_SAM2-LOVE_Segment_Anything_Model_2_in_Language-aided_Audio-Visual_Scenes_CVPR_2025_paper.html)\n  * [Revisiting Audio-Visual Segmentation with Vision-Centric Transformer](https://openaccess.thecvf.com/content/CVPR2025/html/Huang_Revisiting_Audio-Visual_Segmentation_with_Vision-Centric_Transformer_CVPR_2025_paper.html)\n  * [Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment](http://arxiv.org/abs/2503.12847v1)\n  * [Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics](http://arxiv.org/abs/2503.12840v1)\n* 视听定位\n  * [Towards Open-Vocabulary Audio-Visual Event Localization](https://openaccess.thecvf.com/content/CVPR2025/html/Zhou_Towards_Open-Vocabulary_Audio-Visual_Event_Localization_CVPR_2025_paper.html)\n  * [Improving Sound Source Localization with Joint Slot Attention on Image and Audio](http://arxiv.org/abs/2504.15118v1)\n  * [Audio-Visual Semantic Graph Network for Audio-Visual Event Localization](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_Audio-Visual_Semantic_Graph_Network_for_Audio-Visual_Event_Localization_CVPR_2025_paper.html)\n  * [Object-aware Sound Source Localization via Audio-Visual Scene Understanding](https://openaccess.thecvf.com/content/CVPR2025/html/Um_Object-aware_Sound_Source_Localization_via_Audio-Visual_Scene_Understanding_CVPR_2025_paper.html)\n* Video-to-Audio\n  * [Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows](https://openaccess.thecvf.com/content/CVPR2025/html/Mo_Foley-Flow_Coordinated_Video-to-Audio_Generation_with_Masked_Audio-Visual_Alignment_and_Dynamic_CVPR_2025_paper.html)\n  * [Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition](http://arxiv.org/abs/2503.06984v1)\u003cbr\u003e:star:[code](https://wjc2830.github.io/MelQCD/)\n  * [MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://openaccess.thecvf.com/content/CVPR2025/html/Cheng_MMAudio_Taming_Multimodal_Joint_Training_for_High-Quality_Video-to-Audio_Synthesis_CVPR_2025_paper.html)\n* 语音转录\n  * [LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale](http://arxiv.org/abs/2504.16030v1)\u003cbr\u003e:star:[code](https://showlab.github.io/livecc)\n* 音乐制作\n  * [FilmComposer: LLM-Driven Music Production for Silent Film Clips](https://openaccess.thecvf.com/content/CVPR2025/html/Xie_FilmComposer_LLM-Driven_Music_Production_for_Silent_Film_Clips_CVPR_2025_paper.html)\n* 视频-音乐\n  * [VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling](https://openaccess.thecvf.com/content/CVPR2025/html/Tian_VidMuse_A_Simple_Video-to-Music_Generation_Framework_with_Long-Short-Term_Modeling_CVPR_2025_paper.html)\n\n\n\n\n\n\u003ca name=\"36\"/\u003e\n\n## 36.Vision-Language\n* [Synthetic Data is an Elegant GIFT for Continual Vision-Language Models](http://arxiv.org/abs/2503.04229v1)\n* [Words or Vision: Do Vision-Language Models Have Blind Faith in Text?](http://arxiv.org/abs/2503.02199v1)\n* [Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval](http://arxiv.org/abs/2503.01980v1)\u003cbr\u003e:star:[code](https://github.com/aimagelab/ReT)\n* [GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks](http://arxiv.org/abs/2503.06514v1)\n* [MMRL: Multi-Modal Representation Learning for Vision-Language Models](http://arxiv.org/abs/2503.08497v1)\u003cbr\u003e:star:[code](https://github.com/yunncheng/MMRL)\n* [DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models](http://arxiv.org/abs/2503.13443v1)\u003cbr\u003e:star:[code](https://github.com/JREion/DPC)\n* [From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration](http://arxiv.org/abs/2503.12821v1)\n* [Hyperbolic Safety-Aware Vision-Language Models](http://arxiv.org/abs/2503.12127v1)\u003cbr\u003e:star:[code](https://github.com/aimagelab/HySAC)\n* [O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models](http://arxiv.org/abs/2503.12096v1)\n* [MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation](http://arxiv.org/abs/2503.13446v1)\u003cbr\u003e:star:[code](https://gary3410.github.io/momanipVLA/)\n* [Identifying and Mitigating Position Bias of Multi-image Vision-Language Models](http://arxiv.org/abs/2503.13792v1)\n* [EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models](http://arxiv.org/abs/2503.15369v1)\n* [Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models](http://arxiv.org/abs/2503.17142v1)\u003cbr\u003e:star:[code](https://github.com/BerasiDavide/vlm_image_compositionality)\n* [Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks](http://arxiv.org/abs/2503.16930v1)\n* [Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis](http://arxiv.org/abs/2503.22420v1)\u003cbr\u003e:star:[code](https://beacon-3d.github.io)\n* [CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models](http://arxiv.org/abs/2503.22020v1)\u003cbr\u003e:star:[code](https://cot-vla.github.io/)\n* [It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data](http://arxiv.org/abs/2503.24129v1)\u003cbr\u003e:star:[code](https://dominik-schnaus.github.io/itsamatch/)\n* [Taxonomy-Aware Evaluation of Vision-Language Models](http://arxiv.org/abs/2504.05457v1)\n* [SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation](http://arxiv.org/abs/2504.05925v1)\n* [Assessing and Learning Alignment of Unimodal Vision and Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_Assessing_and_Learning_Alignment_of_Unimodal_Vision_and_Language_Models_CVPR_2025_paper.html)\n* [Dynamic Updates for Language Adaptation in Visual-Language Tracking](https://openaccess.thecvf.com/content/CVPR2025/html/Li_Dynamic_Updates_for_Language_Adaptation_in_Visual-Language_Tracking_CVPR_2025_paper.html)\n* [Yo'Chameleon: Personalized Vision and Language Generation](https://openaccess.thecvf.com/content/CVPR2025/html/Nguyen_YoChameleon_Personalized_Vision_and_Language_Generation_CVPR_2025_paper.html)\n* [R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning](https://openaccess.thecvf.com/content/CVPR2025/html/Sheng_R-TPT_Improving_Adversarial_Robustness_of_Vision-Language_Models_through_Test-Time_Prompt_CVPR_2025_paper.html)\n* [LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Sun_LayoutVLM_Differentiable_Optimization_of_3D_Layout_via_Vision-Language_Models_CVPR_2025_paper.html)\n* [Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Hao_Exploring_Visual_Vulnerabilities_via_Multi-Loss_Adversarial_Search_for_Jailbreaking_Vision-Language_CVPR_2025_paper.html)\n* [F^3OCUS - Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics](https://openaccess.thecvf.com/content/CVPR2025/html/Saha_F3OCUS_-_Federated_Finetuning_of_Vision-Language_Foundation_Models_with_Optimal_CVPR_2025_paper.html)\n* [ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_ICT_Image-Object_Cross-Level_Trusted_Intervention_for_Mitigating_Object_Hallucination_in_CVPR_2025_paper.html)\n* [SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments](https://openaccess.thecvf.com/content/CVPR2025/html/Cao_SceneTAP_Scene-Coherent_Typographic_Adversarial_Planner_against_Vision-Language_Models_in_Real-World_CVPR_2025_paper.html)\n* [Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?](https://openaccess.thecvf.com/content/CVPR2025/html/Liao_Can_Large_Vision-Language_Models_Correct_Semantic_Grounding_Errors_By_Themselves_CVPR_2025_paper.html)\n* [SmartCLIP: Modular Vision-language Alignment with Identification Guarantees](https://openaccess.thecvf.com/content/CVPR2025/html/Xie_SmartCLIP_Modular_Vision-language_Alignment_with_Identification_Guarantees_CVPR_2025_paper.html)\n* [TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_TAPT_Test-Time_Adversarial_Prompt_Tuning_for_Robust_Inference_in_Vision-Language_CVPR_2025_paper.html)\n* [Conical Visual Concentration for Efficient Large Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Xing_Conical_Visual_Concentration_for_Efficient_Large_Vision-Language_Models_CVPR_2025_paper.html)\n* [DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment](https://openaccess.thecvf.com/content/CVPR2025/html/Jose_DINOv2_Meets_Text_A_Unified_Framework_for_Image-_and_Pixel-Level_CVPR_2025_paper.html)\n* [Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves](https://openaccess.thecvf.com/content/CVPR2025/html/Wu_Skip_Tuning_Pre-trained_Vision-Language_Models_are_Effective_and_Efficient_Adapters_CVPR_2025_paper.html)\n* [Document Haystacks:  Vision-Language Reasoning Over Piles of 1000+ Documents](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_Document_Haystacks__Vision-Language_Reasoning_Over_Piles_of_1000_Documents_CVPR_2025_paper.html)\n* [Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages](https://openaccess.thecvf.com/content/CVPR2025/html/Farina_Rethinking_Few-Shot_Adaptation_of_Vision-Language_Models_in_Two_Stages_CVPR_2025_paper.html)\n* [Once-Tuning-Multiple-Variants: Tuning Once and Expanded as Multiple Vision-Language Model Variants](https://openaccess.thecvf.com/content/CVPR2025/html/Yu_Once-Tuning-Multiple-Variants_Tuning_Once_and_Expanded_as_Multiple_Vision-Language_Model_Variants_CVPR_2025_paper.html)\n* [Post-pre-training for Modality Alignment in Vision-Language Foundation Models](https://openaccess.thecvf.com/content/CVPR2025/html/Yamaguchi_Post-pre-training_for_Modality_Alignment_in_Vision-Language_Foundation_Models_CVPR_2025_paper.html)\n* [Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data](https://openaccess.thecvf.com/content/CVPR2025/html/Li_Enhancing_Vision-Language_Compositional_Understanding_with_Multimodal_Synthetic_Data_CVPR_2025_paper.html)\n* [Joint Vision-Language Social Bias Removal for CLIP](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_Joint_Vision-Language_Social_Bias_Removal_for_CLIP_CVPR_2025_paper.html)\n* [SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters](https://openaccess.thecvf.com/content/CVPR2025/html/Jiang_SOLAMI_Social_Vision-Language-Action_Modeling_for_Immersive_Interaction_with_3D_Autonomous_CVPR_2025_paper.html)\n* [Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens](https://openaccess.thecvf.com/content/CVPR2025/html/Jiang_Devils_in_Middle_Layers_of_Large_Vision-Language_Models_Interpreting_Detecting_CVPR_2025_paper.html)\n* [SLADE: Shielding against Dual Exploits in Large Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Hossain_SLADE_Shielding_against_Dual_Exploits_in_Large_Vision-Language_Models_CVPR_2025_paper.html)\n* [HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Huang_HiRes-LLaVA_Restoring_Fragmentation_Input_in_High-Resolution_Large_Vision-Language_Models_CVPR_2025_paper.html)\n* [DH-Set: Improving Vision-Language Alignment with Diverse and Hybrid Set-Embeddings Learning](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_DH-Set_Improving_Vision-Language_Alignment_with_Diverse_and_Hybrid_Set-Embeddings_Learning_CVPR_2025_paper.html)\n* [Task-Aware Clustering for Prompting Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Hao_Task-Aware_Clustering_for_Prompting_Vision-Language_Models_CVPR_2025_paper.html)\n* [MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_MambaVLT_Time-Evolving_Multimodal_State_Space_Model_for_Vision-Language_Tracking_CVPR_2025_paper.html)\n* [Adaptive Parameter Selection for Tuning Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_Adaptive_Parameter_Selection_for_Tuning_Vision-Language_Models_CVPR_2025_paper.html)\n* [ShowUI: One Vision-Language-Action Model for GUI Visual Agent](https://openaccess.thecvf.com/content/CVPR2025/html/Lin_ShowUI_One_Vision-Language-Action_Model_for_GUI_Visual_Agent_CVPR_2025_paper.html)\n* [ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos](https://openaccess.thecvf.com/content/CVPR2025/html/Hannan_ReVisionLLM_Recursive_Vision-Language_Model_for_Temporal_Grounding_in_Hour-Long_Videos_CVPR_2025_paper.html)\n* [ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Bendou_ProKeR_A_Kernel_Perspective_on_Few-Shot_Adaptation_of_Large_Vision-Language_CVPR_2025_paper.html)\n* [Vision-Language Models Do Not Understand Negation](https://openaccess.thecvf.com/content/CVPR2025/html/Alhamoud_Vision-Language_Models_Do_Not_Understand_Negation_CVPR_2025_paper.html)\n* [CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Zhu_CoSpace_Benchmarking_Continuous_Space_Perception_Ability_for_Vision-Language_Models_CVPR_2025_paper.html)\n* [HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding](https://openaccess.thecvf.com/content/CVPR2025/html/Tao_HoVLE_Unleashing_the_Power_of_Monolithic_Vision-Language_Models_with_Holistic_CVPR_2025_paper.html)\n* [Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection](https://openaccess.thecvf.com/content/CVPR2025/html/Yang_Nullu_Mitigating_Object_Hallucinations_in_Large_Vision-Language_Models_via_HalluSpace_CVPR_2025_paper.html)\n* [Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention](https://openaccess.thecvf.com/content/CVPR2025/html/An_Mitigating_Object_Hallucinations_in_Large_Vision-Language_Models_with_Assembly_of_CVPR_2025_paper.html)\n* [MEET: Towards Memory-Efficient Temporal Sparse Deep Neural Networks](https://openaccess.thecvf.com/content/CVPR2025/html/Zhu_MEET_Towards_Memory-Efficient_Temporal_Sparse_Deep_Neural_Networks_CVPR_2025_paper.html)\n* [Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_Florence-VL_Enhancing_Vision-Language_Models_with_Generative_Vision_Encoder_and_Depth-Breadth_CVPR_2025_paper.html)\n* [Mamba-Reg: Vision Mamba Also Needs Registers](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Mamba-Reg_Vision_Mamba_Also_Needs_Registers_CVPR_2025_paper.html)\n* [Reproducible Vision-Language Models Meet Concepts Out of Pre-Training](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_Reproducible_Vision-Language_Models_Meet_Concepts_Out_of_Pre-Training_CVPR_2025_paper.html)\n* [Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding](https://openaccess.thecvf.com/content/CVPR2025/html/Kang_Your_Large_Vision-Language_Model_Only_Needs_A_Few_Attention_Heads_CVPR_2025_paper.html)\n* [Bayesian Test-Time Adaptation for Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Zhou_Bayesian_Test-Time_Adaptation_for_Vision-Language_Models_CVPR_2025_paper.html)\n* [Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key](https://openaccess.thecvf.com/content/CVPR2025/html/Yang_Mitigating_Hallucinations_in_Large_Vision-Language_Models_via_DPO_On-Policy_Data_CVPR_2025_paper.html)\n* [NLPrompt: Noise-Label Prompt Learning for Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Pan_NLPrompt_Noise-Label_Prompt_Learning_for_Vision-Language_Models_CVPR_2025_paper.html)\n* [Towards Understanding How Knowledge Evolves in Large Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Towards_Understanding_How_Knowledge_Evolves_in_Large_Vision-Language_Models_CVPR_2025_paper.html)\n* [Evaluating Vision-Language Models as Evaluators in Path Planning](https://openaccess.thecvf.com/content/CVPR2025/html/Aghzal_Evaluating_Vision-Language_Models_as_Evaluators_in_Path_Planning_CVPR_2025_paper.html)\n* [Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Deitke_Molmo_and_PixMo_Open_Weights_and_Open_Data_for_State-of-the-Art_CVPR_2025_paper.html)\n* [Self-Evolving Visual Concept Library using Vision-Language Critics](https://openaccess.thecvf.com/content/CVPR2025/html/Sehgal_Self-Evolving_Visual_Concept_Library_using_Vision-Language_Critics_CVPR_2025_paper.html)\n* [Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift](https://openaccess.thecvf.com/content/CVPR2025/html/Liang_Revisiting_Backdoor_Attacks_against_Large_Vision-Language_Models_from_Domain_Shift_CVPR_2025_paper.html)\n* [On the Zero-shot Adversarial Robustness of Vision-Language Models: A Truly Zero-shot and Training-free Approach](https://openaccess.thecvf.com/content/CVPR2025/html/Tong_On_the_Zero-shot_Adversarial_Robustness_of_Vision-Language_Models_A_Truly_CVPR_2025_paper.html)\n* [ResCLIP: Residual Attention for Training-free Dense Vision-language Inference](https://openaccess.thecvf.com/content/CVPR2025/html/Yang_ResCLIP_Residual_Attention_for_Training-free_Dense_Vision-language_Inference_CVPR_2025_paper.html)\n* [Realistic Test-Time Adaptation of Vision-Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Zanella_Realistic_Test-Time_Adaptation_of_Vision-Language_Models_CVPR_2025_paper.html)\n* [PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Schmalfuss_PARC_A_Quantitative_Framework_Uncovering_the_Symmetries_within_Vision_Language_CVPR_2025_paper.html)\n* [What's in the Image? A Deep-Dive into the Vision of Vision Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Kaduri_Whats_in_the_Image_A_Deep-Dive_into_the_Vision_of_CVPR_2025_paper.html)\n* [Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Liu_Stealthy_Backdoor_Attack_in_Self-Supervised_Learning_Vision_Encoders_for_Large_CVPR_2025_paper.html)\n* [VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Lee_VLsI_Verbalized_Layers-to-Interactions_from_Large_to_Small_Vision_Language_Models_CVPR_2025_paper.html)\n* [Seeing the Abstract: Translating the Abstract Language for Vision Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Talon_Seeing_the_Abstract_Translating_the_Abstract_Language_for_Vision_Language_CVPR_2025_paper.html)\n* [VisionZip: Longer is Better but Not Necessary in Vision Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Yang_VisionZip_Longer_is_Better_but_Not_Necessary_in_Vision_Language_CVPR_2025_paper.html)\n* [FastVLM: Efficient Vision Encoding for Vision Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Vasu_FastVLM_Efficient_Vision_Encoding_for_Vision_Language_Models_CVPR_2025_paper.html)\n* [COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training](https://openaccess.thecvf.com/content/CVPR2025/html/Kim_COSMOS_Cross-Modality_Self-Distillation_for_Vision_Language_Pre-training_CVPR_2025_paper.html)\n* [HalLoc: Token-level Localization of Hallucinations for Vision Language Models](https://openaccess.thecvf.com/content/CVPR2025/html/Park_HalLoc_Token-level_Localization_of_Hallucinations_for_Vision_Language_Models_CVPR_2025_paper.html)\n* [Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks](https://openaccess.thecvf.com/content/CVPR2025/html/Wang_Steering_Away_from_Harm_An_Adaptive_Approach_to_Defending_Vision_CVPR_2025_paper.html)\n* [Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation](https://openaccess.thecvf.com/content/CVPR2025/html/Zhang_Automated_Generation_of_Challenging_Multiple-Choice_Questions_for_Vision_Language_Model_CVPR_2025_paper.html)\n* [Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts](https://openaccess.thecvf.com/content/CVPR2025/html/Chen_Lifelong_Knowledge_Editing_for_Vi","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F52cv%2Fcvpr-2025-papers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F52cv%2Fcvpr-2025-papers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F52cv%2Fcvpr-2025-papers/lists"}