https://github.com/chenin-wang/awesome_ai_paper

List: awesome_ai_paper

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/chenin-wang/awesome_ai_paper
Owner: chenin-wang
Created: 2024-06-17T11:05:19.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-04-09T18:19:20.000Z (2 months ago)
Last Synced: 2025-04-09T19:29:29.889Z (2 months ago)
Language: Python
Homepage: http://paper.cheninweb.asia/
Size: 13.8 MB
Stars: 13
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

ultimate-awesome - awesome_ai_paper - Paper.cheninweb.asia. (Other Lists / Julia Lists)

README

## Updated on 2025.04.09
> Usage instructions: [here](./docs/README.md#usage)

Table of Contents

多模态

6DOF Object Pose

nerf

分类/检测/识别/分割

生成模型

Transformer

## 多模态

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2025-04-08**|[Transfer between Modalities with MetaQueries](http://arxiv.org/abs/2504.06256)|null|统一多模态模型旨在整合理解（文本输出）和生成（像素输出），但将这些不同的模态整合到一个单一架构中通常需要复杂的训练方法和仔细的数据平衡。我们引入了MetaQueries，这是一组可学习的查询，充当自回归多模态大型语言模型（MLLM）和扩散模型之间的有效接口。MetaQueries将MLLM的潜在表示连接到扩散解码器，通过利用MLLM的深度理解和推理能力来实现知识增强的图像生成。我们的方法简化了训练，只需要配对的图像-标题数据和标准的扩散目标函数。值得注意的是，即使MLLM主干保持冻结状态，这种迁移也很有效，从而在保持其最先进的多模态理解能力的同时实现强大的生成性能。此外，我们的方法非常灵活，可以很容易地进行指令微调，用于高级应用，如图像编辑和主题驱动生成。|
|**2025-04-07**|[ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering](http://arxiv.org/abs/2504.05506)|null|图表无处不在，人们经常使用它们来分析数据、回答问题和发现关键见解。然而，使用图表执行复杂的分析任务需要大量的感知和认知努力。图表问答 (CQA) 系统通过使模型能够解释和推理数据的视觉表示来自动化此过程。然而，像 ChartQA 这样的现有基准测试缺乏现实世界的多样性，并且最近显示出现代大型视觉语言模型 (LVLM) 的性能饱和。为了解决这些限制，我们引入了 ChartQAPro，这是一个新的基准测试，其中包含来自 157 个不同来源的 1,341 个图表，涵盖各种图表类型，包括信息图表和仪表板，并包含 1,948 个各种类型的问题，例如多项选择题、对话题、假设题和无法回答的问题，以更好地反映现实世界的挑战。我们对 21 个模型的评估表明，LVLM 在 ChartQAPro 上的性能大幅下降；例如，Claude Sonnet 3.5 在 ChartQA 上得分为 90.5%，但在 ChartQAPro 上仅为 55.81%，这突显了图表推理的复杂性。我们通过详细的错误分析和消融研究补充了我们的发现，确定了在图表理解和推理方面推进 LVLM 的关键挑战和机遇。我们在 https://github.com/vis-nlp/ChartQAPro 发布 ChartQAPro。|
|**2025-04-07**|[Probing the Visualization Literacy of Vision Language Models: the Good, the Bad, and the Ugly](http://arxiv.org/abs/2504.05445)|null|视觉语言模型 (VLM) 表现出 promising 的图表理解能力。然而，先前对其可视化素养的探索仅限于评估其响应的正确性，而未能探究其内部推理过程。为了弥补这一差距，我们针对 VLM 调整了注意力引导的类激活图 (AG-CAM)，以可视化输入特征（图像和文本）对模型响应的影响和重要性。利用这种方法，我们对四个开源（ChartGemma、Janus 1B 和 7B 以及 LLaVA）和两个闭源（GPT-4o、Gemini）模型进行了检查，比较了它们的性能，并针对开源模型比较了它们的 AG-CAM 结果。总体而言，我们发现针对图表问答 (QA) 微调的 3B 参数 VLM ChartGemma 优于其他开源模型，并且表现出与参数规模更大的闭源 VLM 相当的性能。我们还发现，VLM 通过准确定位关键图表特征来展现空间推理能力，并通过将视觉元素与相应的数据值和查询标记相关联来展现语义推理能力。我们的方法首次展示了在广泛使用的早期融合 VLM 架构上以及在图表问答中使用 AG-CAM。我们还展示了初步证据，表明这些结果与人类推理一致。我们充满希望的开源 VLM 结果为人工智能可视化素养方面的透明和可重复研究铺平了道路。|
|**2025-04-07**|[SmolVLM: Redefining small and efficient multimodal models](http://arxiv.org/abs/2504.05299)|null|大型视觉语言模型 (VLM) 性能卓越，但需要大量的计算资源，限制了其在移动和边缘设备上的部署。较小的 VLM 通常会模仿大型模型的设计选择，例如大量的图像分词，导致 GPU 内存使用效率低下，并限制了设备端应用的实用性。我们推出了 SmolVLM，这是一系列专为资源高效推理而设计的紧凑型多模态模型。我们系统地探索了针对低计算开销而优化的架构配置、分词策略和数据整理。通过这些，我们确定了关键的设计选择，这些选择在图像和视频任务中带来了显著的性能提升，同时最大限度地减少了内存占用。我们最小的模型 SmolVLM-256M，在推理过程中使用不到 1GB 的 GPU 内存，并且性能优于比其大 300 倍的 Idefics-80B 模型，尽管两者有 18 个月的开发差距。我们最大的模型，参数量为 22 亿，可与最先进的 VLM 媲美，而后者消耗的 GPU 内存是其两倍。SmolVLM 模型的功能超越了静态图像，展示了强大的视频理解能力。我们的结果强调，战略性的架构优化、积极而高效的分词以及精心整理的训练数据可以显著增强多模态性能，从而促进在更小规模下实现实用、节能的部署。|
|**2025-04-07**|[Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation](http://arxiv.org/abs/2504.05225)|null|模型预测控制（MPC）是一种广泛采用的控制范式，它利用预测模型来估计未来的系统状态并相应地优化控制输入。然而，虽然MPC在规划和控制方面表现出色，但它缺乏环境感知能力，导致在复杂和非结构化场景中失败。为了解决这一局限性，我们引入了视觉语言模型预测控制（VLMPC），这是一个机器人操作规划框架，它将视觉语言模型（VLM）的感知能力与MPC相结合。VLMPC利用一个条件动作采样模块，该模块将目标图像或语言指令作为输入，并利用VLM生成候选动作序列。这些候选序列被输入到一个视频预测模型中，该模型根据动作模拟未来帧。此外，我们提出了一个增强型变体Traj-VLMPC，它用运动轨迹生成代替视频预测，以降低计算复杂度，同时保持精度。Traj-VLMPC根据候选动作估计运动动态，为长周期任务和实时应用提供了更高效的替代方案。VLMPC和Traj-VLMPC都使用基于VLM的分层成本函数来选择最佳动作序列，该函数捕获当前观察结果和任务输入之间的像素级和知识级一致性。我们证明了这两种方法在公共基准测试中都优于现有的最先进方法，并在各种现实世界的机器人操作任务中取得了优异的性能。代码可在https://github.com/PPjmchen/VLMPC获取。|
|**2025-04-07**|[Resource-Efficient Beam Prediction in mmWave Communications with Multimodal Realistic Simulation Framework](http://arxiv.org/abs/2504.05187)|null|波束成形是毫米波 (mmWave) 通信中的一项关键技术，它通过优化方向性和强度来改善信号传输。然而，传统的信道估计方法，例如导频信号或波束扫描，通常无法适应快速变化的通信环境。为了解决这一局限性，多模态感知辅助波束预测受到了广泛关注，它利用来自激光雷达、雷达、GPS 和 RGB 图像等设备的各种传感数据来预测用户位置或网络状况。尽管其潜力巨大，但多模态感知辅助波束预测的应用受到高计算复杂度、高成本和有限数据集的阻碍。因此，本文提出了一种资源高效的学习方法，将知识从多模态网络迁移到基于跨模态关系知识蒸馏 (CRKD) 的单模态（仅雷达）网络，同时降低计算开销并保持预测精度。为了使用真实数据进行多模态学习，开发了一种新颖的多模态仿真框架，将自动驾驶模拟器 CARLA 生成的传感器数据与基于 MATLAB 的毫米波信道建模相结合，并反映真实世界条件。所提出的 CRKD 通过提取不同特征空间的关系信息来实现其目标，从而在不依赖昂贵传感器数据的情况下增强波束预测性能。仿真结果表明，CRKD 可以有效地提取多模态知识，使仅雷达模型达到教师模型性能的94.62%。尤其值得一提的是，这仅使用了教师网络10%的参数量就实现了，从而显著降低了计算复杂度和对多模态传感器数据的依赖。|
|**2025-04-08**|[A Taxonomy of Self-Handover](http://arxiv.org/abs/2504.04939)|null|自身交接，即将物体在自己的双手之间传递，是一种常见但研究不足的双手动作。虽然它有助于在复杂任务中实现无缝过渡，但其执行背后的策略很大程度上仍未得到探索。本文中，我们介绍了第一个系统的自身交接分类法，该分类法源自对21名参与者执行的超过12小时烹饪活动的 manuell 标注。我们的分析表明，自身交接不仅仅是被动的过渡，而是一个高度协调的动作，涉及双手的预期调整。为了实现人类操作的自动分析，我们进一步证明了使用最先进的视觉语言模型对自身交接类型进行分类的可行性。这些发现为双手协调提供了新的见解，强调了自身交接在实现平滑任务过渡中的作用——这是一种自适应双臂机器人必不可少的能力。|
|**2025-04-07**|[SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models](http://arxiv.org/abs/2504.04893)|null|排版攻击利用多模态基础模型中文本和视觉内容之间的相互作用，当图像中嵌入误导性文本时，会导致错误分类。然而，现有的数据集在规模和多样性方面都存在局限性，这使得研究此类漏洞变得困难。在本文中，我们介绍了 SCAM，这是迄今为止最大、最多样化的真实世界排版攻击图像数据集，包含 1,162 张图像，涵盖数百个对象类别和攻击词。通过在 SCAM 上对视觉语言模型 (VLM) 进行广泛的基准测试，我们证明了排版攻击会显着降低性能，并确定训练数据和模型架构会影响对这些攻击的敏感性。我们的研究结果表明，由于视觉编码器的选择，排版攻击仍然存在于最先进的大型视觉语言模型 (LVLM) 中，尽管更大的大型语言模型 (LLM) 骨干有助于降低其脆弱性。此外，我们证明了合成攻击与真实世界（手写）攻击非常相似，验证了它们在研究中的用途。我们的工作提供了全面的资源和经验见解，以促进未来对稳健且值得信赖的多模态人工智能系统的研究。我们在 https://huggingface.co/datasets/BLISS-e-V/SCAM 下公开发布了本文中介绍的数据集，以及 https://github.com/Bliss-e-V/SCAM 上的评估代码。|
|**2025-04-07**|[Don't Lag, RAG: Training-Free Adversarial Detection Using RAG](http://arxiv.org/abs/2504.04858)|null|对抗性补丁攻击通过嵌入误导深度模型的局部扰动，对视觉系统构成了重大威胁。传统的防御方法通常需要重新训练或微调，这使得它们在实际部署中不切实际。我们提出了一种无需训练的视觉检索增强生成（VRAG）框架，该框架集成了视觉语言模型（VLM）来检测对抗性补丁。通过检索视觉上相似的补丁和图像（这些补丁和图像与不断扩展的数据库中存储的攻击相似），VRAG执行生成式推理来识别各种攻击类型，所有这些都不需要额外的训练或微调。我们广泛评估了开源大规模VLM，包括Qwen-VL-Plus、Qwen2.5-VL-72B和UI-TARS-72B-DPO，以及闭源模型Gemini-2.0。值得注意的是，开源UI-TARS-72B-DPO模型的分类准确率高达95%，为开源对抗性补丁检测树立了新的最先进水平。Gemini-2.0的总体准确率最高，达到98%，但仍然是闭源的。实验结果表明，VRAG能够以最少的人工标注识别各种对抗性补丁，为抵御不断演变的对抗性补丁攻击的鲁棒、实用的防御铺平了道路。|
|**2025-04-06**|[M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models](http://arxiv.org/abs/2504.04633)|null|多模态上下文学习 (ICL) 是大型视觉语言模型 (LVLM) 的一项重要能力，它允许通过上下文提示进行任务适配，而无需重新训练参数。然而，其应用受到输入的密集型token性质和跨模态小样本学习的高复杂性的限制，这限制了表示方法的表达能力。为了应对这些挑战，我们提出了M2IV，一种用可学习的上下文向量 (In-context Vectors) 代替显式演示的方法，这些向量直接集成到LVLM中。通过利用多头注意力 (MHA) 和多层感知器 (MLP) 的互补优势，M2IV 通过训练实现了鲁棒的跨模态保真度和细粒度的语义蒸馏。这显著提高了各种 LVLM 和任务的性能，并有效地扩展到多样本场景，绕过了上下文窗口的限制。我们还引入了VLibrary，一个用于存储和检索 M2IV 的存储库，支持 LVLM 在跨模态对齐、定制化生成和安全性改进等任务上的灵活操控。在七个基准测试和三个 LVLM 上的实验表明，M2IV 超越了 Vanilla ICL 和先前的表示工程方法，在相同样本数量下，平均精度比 ICL 提高了3.74%，同时具有显著的效率优势。|
|**2025-04-04**|[SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding](http://arxiv.org/abs/2504.03254)|null|合成孔径雷达 (SAR) 是一项至关重要的遥感技术，能够全天候、昼夜观测，并具有强大的表面穿透能力，可用于精确和持续的环境监测与分析。然而，由于其复杂的物理成像机制以及与人类视觉感知的显著差异，SAR图像解译仍然具有挑战性。近年来，视觉语言模型 (VLM) 在RGB图像理解方面取得了显著成功，提供了强大的开放词汇解释和灵活的语言交互能力。但是，由于其训练数据中缺乏特定于SAR的知识，VLM在SAR图像上的应用受到严重限制，导致性能欠佳。为了解决这一局限性，我们引入了SARLANG-1M，这是一个专为多模态SAR图像理解而设计的大规模基准数据集，主要侧重于将SAR与文本模态相结合。SARLANG-1M包含从全球59个以上城市收集的超过100万个高质量SAR图像-文本对。它具有分层分辨率（范围从0.1米到25米）、细粒度的语义描述（包括简洁和详细的标题）、多样化的遥感类别（1,696种对象类型和16种土地覆盖类别），以及涵盖七个应用和1,012种问题类型的多任务问答对。在主流VLM上的大量实验表明，使用SARLANG-1M进行微调可以显著提高其在SAR图像解译方面的性能，达到与人类专家相当的水平。数据集和代码将在https://github.com/Jimmyxichen/SARLANG-1M上公开发布。|
|**2025-04-04**|[NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving](http://arxiv.org/abs/2504.03164)|null|视觉语言模型（VLM）的最新进展已展现出其在自动驾驶任务中的巨大潜力。然而，它们的空间理解和推理能力（自动驾驶的关键能力）仍然存在显著的局限性。值得注意的是，现有的基准测试均未系统地评估VLM在驾驶场景中的空间推理能力。为了填补这一空白，我们提出了NuScenes-SpatialQA，这是第一个专门设计用于评估VLM在自动驾驶中空间理解和推理能力的大规模基于真实数据的问答（QA）基准测试。该基准测试构建于NuScenes数据集之上，通过自动化的3D场景图生成流程和问答生成流程构建而成。该基准测试系统地评估了VLM在多个维度上的空间理解和推理性能。我们使用此基准测试对各种VLM（包括通用模型和空间增强模型）进行了广泛的实验，首次对其在自动驾驶中的空间能力进行了全面评估。令人惊讶的是，实验结果表明，空间增强VLM在定性问答方面表现出色，但在定量问答方面却没有展现出竞争力。总体而言，VLM在空间理解和推理方面仍然面临相当大的挑战。|
|**2025-04-04**|[TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference](http://arxiv.org/abs/2504.03154)|null|传统的视觉语言模型 (VLMs) 通常使用固定数量的视觉标记，而忽略了任务的复杂性。这种一刀切的策略引入了显著的低效性：在简单的任务中使用过多的标记会导致不必要的计算开销，而在更复杂的情况下，标记不足则会影响对视觉信息的细粒度理解。为了克服这些限制，我们提出了 TokenFLEX，这是一个创新且自适应的视觉语言框架，它将图像编码成可变数量的标记，以便与大型语言模型 (LLM) 高效集成。我们的方法基于两项关键创新。首先，我们提出了一种新的训练范式，通过在训练期间随机调整标记数量来提高不同数量视觉标记的性能。其次，我们设计了一个轻量级的视觉标记投影器，它包含一个自适应池化层和 SwiGLU，允许灵活地下采样视觉标记，并根据特定标记数量自适应地选择特征。综合实验表明，TokenFLEX 始终优于其固定标记的 counterparts，在各种标记数量下均实现了显著的性能提升，在八个视觉语言基准测试中，使用 64、144 和 256 个标记分别平均提高了 1.6%、1.0% 和 0.4%。这些结果突出了 TokenFLEX 在保持高性能视觉语言理解的同时，具有显著的灵活性。|
|**2025-04-04**|[MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories](http://arxiv.org/abs/2504.03153)|null|我们提出了MORAL（一种用于自主实验室决策的多模态强化学习框架），通过整合视觉和文本输入来增强自主机器人实验室中的序列决策。利用BridgeData V2数据集，我们使用预训练的BLIP-2视觉语言模型生成微调后的图像描述，并通过早期融合策略将其与视觉特征相结合。融合后的表示使用深度Q网络（DQN）和近端策略优化（PPO）智能体进行处理。实验结果表明，经过充分训练后，多模态智能体在任务完成率方面提高了20%，并且显著优于纯视觉和纯文本基线。与基于Transformer和循环神经网络的多模态强化学习模型相比，我们的方法在累积奖励和描述质量指标（BLEU、METEOR、ROUGE-L）方面实现了优越的性能。这些结果突出了语义对齐的语言线索在提高智能体学习效率和泛化能力方面的影响。所提出的框架有助于推动多模态强化学习和具身人工智能系统在动态的现实环境中的发展。|
|**2025-04-03**|[QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding](http://arxiv.org/abs/2504.02971)|null|在视觉文档理解 (VDU) 任务中，使用新数据集对预训练的视觉语言模型 (VLM) 进行微调通常难以优化视觉编码器以识别富文本文档图像中特定于查询的区域。现有的通过修改网络架构将查询直接注入模型层的方法通常难以适应标注有限的新数据集。为了解决这个问题，我们引入了 QID，这是一种新颖的、简化的、保留架构的方法，它将查询嵌入集成到视觉编码器中，从而显著提高性能，尤其是在数据稀缺的微调场景中。具体来说，我们的方法引入了双模块框架：一个查询感知模块，生成唯一的查询向量以精确引导模型的焦点；以及一个查询无关模块，捕获标记之间的位置关系，确保稳健的空间理解。值得注意的是，这两个模块独立于视觉注意力块运行，促进了查询嵌入的定向学习并增强了视觉语义识别。使用无 OCR 的 VLM 在多个数据集上进行的实验表明，我们的方法显著提高了性能，尤其是在数据稀缺的环境中处理富文本文档方面。|
|**2025-04-03**|[Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence](http://arxiv.org/abs/2504.02799)|null|大型视觉语言模型 (VLMs) 为人工智能驱动的图像理解提供了一种新的范式，使模型能够在没有特定任务训练的情况下执行任务。这种灵活性在医学领域尤其具有前景，因为该领域缺乏专家标注的数据。然而，VLM 在以干预为中心的领域的实际效用——尤其是在外科手术中，由于决策具有主观性且临床场景多变——仍然不确定。在这里，我们对 11 个最先进的 VLM 在外科人工智能的 17 个关键视觉理解任务中的表现进行了全面分析，这些任务涵盖了解剖结构识别到技能评估，使用了 13 个跨腹腔镜、机器人和开放手术的数据集。在我们的实验中，VLM 表现出良好的泛化能力，有时在部署到训练环境之外时，其性能甚至优于监督模型。上下文学习（在测试期间加入示例）将性能提高了三倍，表明适应性是其关键优势。尽管如此，需要空间或时间推理的任务仍然很困难。除了外科手术之外，我们的研究结果还为 VLM 应对临床和更广泛的现实世界应用中的复杂和动态场景的潜力提供了见解。|
|**2025-04-03**|[Robot-Led Vision Language Model Wellbeing Assessment of Children](http://arxiv.org/abs/2504.02765)|null|本研究提出了一种新颖的由机器人主导的儿童心理健康评估方法，该方法利用视觉语言模型 (VLM)。受儿童统觉测验 (CAT) 的启发，社交机器人 NAO 向儿童呈现图片刺激物，以引出他们对图像的口头叙述，然后由 VLM 根据 CAT 评估指南进行评估。VLM 的评估结果与训练有素的心理学家提供的评估结果进行了系统比较。结果表明，虽然 VLM 在识别无健康问题案例方面表现出中等可靠性，但其准确分类有临床问题评估的能力仍然有限。此外，尽管该模型在输入不同的年龄和性别等人口统计学因素时，其性能总体一致，但在女孩中观察到显着更高的假阳性率，表明该模型可能对性别属性敏感。这些发现突出了将 VLM 集成到机器人主导的儿童健康评估中的前景和挑战。|
|**2025-04-03**|[Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation](http://arxiv.org/abs/2504.02438)|null|处理长视频对视觉语言模型（VLM）提出了根本性挑战，因为处理长时序序列的计算成本很高。现有的token剪枝和特征融合方法往往会牺牲关键的时间依赖性或稀释语义信息。我们引入了差分蒸馏，这是一种系统地保留任务相关信息并抑制冗余信息的原则性方法。基于此原则，我们开发了ViLaMP，一个通过两种关键机制以“混合精度”处理长达数小时视频的分层视频语言模型：（1）差分关键帧选择，在帧级别最大化查询相关性，同时保持时间上的区分性；（2）差分特征融合，在补丁级别保留非关键帧中与查询相关的显著特征。因此，ViLaMP保留了关键帧中的完整信息，同时将非关键帧简化为其最显著的特征，类似于混合精度训练。大量实验表明，ViLaMP在四个视频理解基准测试中表现出色，尤其是在长视频内容上。值得注意的是，ViLaMP可以在单个NVIDIA A100 GPU上处理超长视频（最多10K帧），在保持最先进性能的同时，实现了显著的计算效率。|
|**2025-04-03**|[Large (Vision) Language Models are Unsupervised In-Context Learners](http://arxiv.org/abs/2504.02349)|**[link](https://github.com/mlbio-epfl/joint-inference)**|大型语言模型和视觉语言模型的最新进展实现了零样本推理，允许模型在没有特定任务训练的情况下解决新任务。各种适应技术，例如提示工程、上下文学习 (ICL) 和监督微调，可以进一步提高模型在下游任务中的性能，但它们需要大量手动工作来构建有效的提示或标记示例。在这项工作中，我们引入了一个用于完全无监督适应的联合推理框架，从而无需手动提示工程和标记示例。与进行独立预测的零样本推理不同，联合推理对给定任务中的所有输入同时进行预测。由于直接联合推理涉及计算成本高昂的优化，我们开发了高效的近似技术，从而产生了两种无监督适应方法：无监督微调和无监督ICL。我们证明了我们的方法在各种任务和模型中的有效性，包括在自然语言处理任务上的纯语言模型Llama-3.1，在小学数学问题上的推理导向模型Qwen2.5-Math，在视觉任务上的视觉语言模型OpenFlamingo，以及在大型多学科任务上仅通过API访问的GPT-4o模型。我们的实验表明，与标准零样本方法相比，该方法取得了显著改进，包括在具有挑战性的GSM8K数学推理数据集上实现了39%的绝对改进。值得注意的是，尽管是完全无监督的，但我们的框架的性能通常与依赖于真实标签的监督方法相当。|
|**2025-04-03**|[Re-thinking Temporal Search for Long-Form Video Understanding](http://arxiv.org/abs/2504.02259)|null|有效理解长视频仍然是计算机视觉中的一个重大挑战。在这项工作中，我们重新审视了用于长视频理解的时序搜索范式，研究了所有最先进的长上下文视觉语言模型（VLM）都存在的的一个基本问题。具体来说，我们的贡献有两个方面：首先，我们将时序搜索表述为一个长视频大海捞针问题，即在给定特定查询的情况下，从真实世界的长视频的数万帧中找到最小的相关帧集（通常是一到五帧）。为了验证我们的表述，我们创建了LV-Haystack，这是第一个包含3874个人工标注实例的基准测试，它包含用于评估关键帧搜索质量和计算效率的细粒度评估指标。在LV-Haystack上的实验结果突出了时序搜索能力的显著研究差距，最先进的关键帧选择方法在LVBench子集上仅实现了2.1%的时序F1分数。接下来，受图像视觉搜索的启发，我们重新思考了时序搜索，并提出了一个轻量级的关键帧搜索框架T*，它将昂贵的时序搜索转化为空间搜索问题。T*利用了通常用于图像的卓越视觉定位能力，并引入了一种在时间和空间维度上运行的自适应放大机制。我们广泛的实验表明，与现有方法集成时，T*显著提高了最先进的长视频理解性能。具体来说，在32帧的推理预算下，T*将GPT-4o在LongVideoBench XL子集上的性能从50.5%提高到53.1%，将LLaVA-OneVision-72B的性能从56.5%提高到62.4%。我们的PyTorch代码、基准数据集和模型包含在补充材料中。|
|**2025-04-03**|[SocialGesture: Delving into Multi-person Gesture Understanding](http://arxiv.org/abs/2504.02244)|null|以往的人体手势识别研究很大程度上忽略了多人互动，而多人互动对于理解自然发生的手势的社会语境至关重要。现有数据集的这一局限性给人类手势与其他模态（如语言和语音）的对齐带来了重大挑战。为了解决这个问题，我们推出了SocialGesture，这是第一个专门为多人手势分析设计的大规模数据集。SocialGesture涵盖了各种自然场景，并支持多种手势分析任务，包括基于视频的识别和时间定位，为推进复杂社会互动中手势的研究提供了宝贵的资源。此外，我们提出了一个新的视觉问答（VQA）任务，以评估视觉语言模型（VLM）在理解社会手势方面的性能。我们的研究结果突出了当前手势识别模型的几个局限性，为该领域未来的改进方向提供了见解。SocialGesture可在huggingface.co/datasets/IrohXu/SocialGesture获取。||
|**2025-04-02**|[One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image](http://arxiv.org/abs/2504.02132)|null|多模态检索增强生成（M-RAG）最近成为一种通过事实知识库（KB）抑制大型多模态模型（LMM）幻觉的方法。然而，M-RAG 也为旨在通过向知识库注入恶意条目来破坏系统的攻击者引入了新的攻击向量。在这项工作中，我们提出了一种针对视觉文档检索应用的 M-RAG 投毒攻击，其中知识库包含文档页面的图像。我们的目标是制作一个单一的图像，使其能够被各种不同的用户查询检索到，并持续影响生成模型产生的输出，从而对 M-RAG 系统造成普遍的拒绝服务（DoS）攻击。我们证明，虽然我们的攻击对各种广泛使用的、最先进的检索器（嵌入模型）和生成器（LMM）有效，但它对鲁棒的嵌入模型也可能无效。我们的攻击不仅突出了 M-RAG 管道易受投毒攻击的漏洞，还揭示了一个潜在的弱点，即使在良性环境中也可能影响其性能。||
|**2025-04-02**|[FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs](http://arxiv.org/abs/2504.01916)|null|作为一种开创性的视觉语言模型，CLIP（对比语言-图像预训练）已在各个领域和广泛的下游视觉语言任务中取得了显著成功。然而，流行的CLIP模型中的文本编码器仅限于处理77个文本标记，这限制了它们有效处理更长、更详细的标题的能力。此外，CLIP模型通常难以有效地捕捉详细的视觉和文本信息，这阻碍了它们在需要细粒度分析的任务上的性能。为了解决这些限制，我们提出了一种新颖的方法FineLIP，它扩展了CLIP的功能。FineLIP通过在CLIP风格的框架内结合细粒度对齐和更长的文本输入来增强跨模态文本图像映射。FineLIP首先扩展位置嵌入以处理更长的文本，然后动态聚合局部图像和文本标记。聚合的结果随后用于强制执行细粒度的标记到标记的跨模态对齐。我们在具有长而详细标题的数据集上验证了我们的模型，涵盖了两个任务：零样本跨模态检索和文本到图像生成。定量和定性的实验结果证明了FineLIP的有效性，其性能优于现有的最先进方法。此外，全面的消融研究验证了FineLIP中关键设计元素的优势。||
|**2025-04-02**|[Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness](http://arxiv.org/abs/2504.01901)|null|大型多模态模型 (LMM) 在二维图像和视频领域的快速发展促使人们努力将其应用于三维场景的理解。然而，大规模三维视觉语言数据集的缺失构成了一个重大障碍。为了解决这个问题，典型的方法侧重于通过设计三维输入级场景表示，将三维感知注入二维 LMM 中。这项工作提供了一个新的视角。我们引入了具有三维感知的重建式视觉指令微调 (Ross3D)，它将三维感知视觉监督融入训练过程中。具体来说，它结合了跨视角和全局视角重建。前者要求通过聚合来自其他视角的重叠信息来重建被遮挡的视角。后者旨在聚合来自所有可用视角的信息以恢复鸟瞰图，从而有助于全面了解整个场景。根据经验，Ross3D 在各种三维场景理解基准测试中实现了最先进的性能。更重要的是，我们的半监督实验表明，在利用大量未标记的三维纯视觉数据方面具有巨大潜力。||
|**2025-04-02**|[Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images](http://arxiv.org/abs/2504.01838)|**[link](https://github.com/munia03/dermdit)**|人工智能 (AI) 在皮肤病诊断方面已取得显著进展，但一个主要问题是这些模型在不同亚组中的表现经常出现偏差，尤其是在肤色等敏感属性方面。为了解决这些问题，我们提出了一个新的基于生成式AI的框架，即皮肤病扩散Transformer (DermDiT)，它利用通过视觉语言模型生成的文本提示和多模态图文学习来生成新的皮肤镜图像。我们利用大型视觉语言模型为每个皮肤镜图像生成准确和适当的提示，这有助于生成合成图像，以改善临床诊断中高度不平衡数据集中代表性不足的群体（患者、疾病等）的表征。我们广泛的实验表明，大型视觉语言模型提供了更具洞察力的表征，使 DermDiT 能够生成高质量的图像。我们的代码可在 https://github.com/Munia03/DermDiT 获取。||
|**2025-03-31**|[FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics](http://arxiv.org/abs/2503.24267)|null|生成式人工智能 (AI) 的快速且不受约束的发展是一把双刃剑：它在带来前所未有的创造力的同时，也助长了高度逼真的欺骗性内容的产生，从而破坏了社会信任。随着图像生成技术日益复杂，检测合成图像不再只是一项二元任务：它需要可解释的、上下文感知的方法来增强可信度和透明度。然而，现有的检测模型主要集中于分类，对图像真实性提供的解释性见解有限。在这项工作中，我们提出了 FakeScope，一个专为 AI 生成图像取证定制的专家多模态模型 (LMM)，它不仅可以高精度地识别 AI 合成图像，还可以提供丰富、可解释和查询驱动的取证见解。我们首先构建了 FakeChain 数据集，该数据集包含基于视觉痕迹证据的语言真实性推理，并通过一种新颖的人机协作框架开发。在此基础上，我们进一步提出了 FakeInstruct，这是一个最大的多模态指令微调数据集，包含 200 万条视觉指令，旨在增强 LMM 中的取证意识。FakeScope 在封闭式和开放式取证场景中均实现了最先进的性能。它可以高精度地区分合成图像，同时提供连贯且富有洞察力的解释、关于细粒度伪造属性的自由讨论以及可操作的增强策略。值得注意的是，尽管仅使用定性硬标签进行训练，FakeScope 通过我们提出的基于标记的概率估计策略，在检测方面展现出显著的零样本量化能力。此外，FakeScope 表现出强大的泛化能力和野外适用性，确保了其在现实场景中的应用价值。||
|**2025-03-31**|[Predicting Targeted Therapy Resistance in Non-Small Cell Lung Cancer Using Multimodal Machine Learning](http://arxiv.org/abs/2503.24165)|null|肺癌是全球癌症死亡的主要原因，其中非小细胞肺癌 (NSCLC) 是最常见的亚型。在 NSCLC 患者中，约有 32.3% 的患者存在表皮生长因子受体 (EGFR) 基因突变。奥希替尼作为第三代 EGFR-酪氨酸激酶抑制剂 (TKI)，对携带激活型和 T790M 耐药 EGFR 突变的 NSCLC 患者展现出显著疗效。尽管奥希替尼疗效确切，但耐药性仍然是患者充分获益于该药物治疗的重大挑战。目前缺乏准确预测 TKI 耐药性（包括奥希替尼耐药性）的标准工具仍然是一个关键障碍。为了弥合这一差距，本研究开发了一种可解释的多模态机器学习模型，用于预测晚期 NSCLC 患者中 EGFR 激活突变患者的奥希替尼耐药性，在多中心数据集上实现了 0.82 的 c 指数。该机器学习模型利用患者就诊和医学评估期间常规收集的 readily available 数据，以促进精准肺癌管理和明智的治疗决策。通过整合各种数据类型，如组织学图像、二代测序 (NGS) 数据、人口统计学数据和临床记录，我们的多模态模型可以生成更全面的建议。我们的实验结果还表明，多模态模型的性能优于单模态模型（c 指数为 0.82，而单模态模型为 0.75 和 0.77），从而强调了在患者预后预测中结合多种模态的优势。||
|**2025-03-31**|[SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation](http://arxiv.org/abs/2503.24164)|null|大型视觉和语言模型在图像描述、视觉问答和检索等任务中表现出色。然而，将语音、文本和视觉信息整合到一个统一的模型中仍然存在挑战，尤其是在涉及语音的任务中。语音生成方法各不相同（有些直接生成语音），有些则通过文本生成语音（但其对质量的影响尚不清楚）。评估通常依赖于自动语音识别，这可能会引入偏差。我们提出了SVLA，一个基于Transformer架构的统一语音视觉语言模型，可以处理多模态输入和输出。我们使用3820万个语音文本图像示例对其进行训练，包括64.1小时的合成语音。我们还引入了一个新的用于评估语音回复的指标：语音视觉问答准确率（Speech VQA Accuracy）。SVLA 通过更好地结合语音、视觉和语言信息，改进了多模态理解和生成能力。||
|**2025-03-31**|[H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding](http://arxiv.org/abs/2503.24008)|null|随着多模态模型的快速发展，对评估视频理解能力的需求稳步增长。然而，现有的视频理解评估基准在覆盖范围、任务多样性和场景适应性方面存在显著局限性。这些缺点阻碍了对模型综合视频理解能力的准确评估。为了应对这一挑战，我们提出了一个分层且全面的视频理解（H2VU）基准，旨在评估通用视频和在线流媒体视频的理解能力。该基准具有三个主要特点：扩展的视频时长：涵盖从3秒短片到1.5小时完整录像的视频，从而弥合当前基准测试中存在的时长差距。全面的评估任务：除了传统的感知和推理任务外，我们还引入了反常识理解和轨迹状态跟踪模块。这些新增模块测试了模型的深度理解能力，超越了单纯的先验知识。丰富的视频数据：为了跟上当前人工智能Agent的快速发展，我们扩展了第一人称视角的流媒体视频数据集。这种扩展允许探索多模态模型在理解第一人称视角流媒体视频方面的性能。H2VU 的大量结果表明，现有的多模态大型语言模型（MLLM）在我们新提出的评估任务中仍有很大的改进潜力。我们期望 H2VU 通过提供对 MLLM 的全面深入分析，从而促进视频理解研究的进步。||
|**2025-03-31**|[HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment](http://arxiv.org/abs/2503.23907)|null|图像美学评估（IAA）是一项长期且具有挑战性的研究任务。然而，它的一个子集，即人物图像美学评估（HIAA），却很少被探索，尽管HIAA在社交媒体、人工智能工作流程和相关领域中被广泛应用。为了弥合这一研究差距，我们的工作率先提出了一个专为HIAA定制的整体实施框架。具体来说，我们引入了HumanBeauty，这是第一个专门为HIAA构建的数据集，包含10.8万张高质量的人物图像及其人工标注。为了实现全面且细粒度的人物图像美学评估，我们通过严格的筛选过程手动收集了5万张人物图像，并利用我们开创性的12维美学标准进行标注，其余5.8万张带有整体美学标签的图像则系统地从公共数据集中筛选而来。基于HumanBeauty数据库，我们提出了HumanAesExpert，一个用于人物图像美学评估的强大的视觉语言模型。我们创新性地设计了一个专家头，以结合人类对美学子维度的知识，同时利用语言建模（LM）头和回归头。这种方法使我们的模型在整体和细粒度的HIAA中都能达到卓越的性能。此外，我们引入了MetaVoter，它汇集了所有三个头的分数，以有效平衡每个头的能力，从而实现更高的评估精度。大量实验表明，我们的HumanAesExpert模型在HIAA中的性能明显优于其他最先进的模型。我们的数据集、模型和代码已公开发布，以促进HIAA社区的发展。项目网页：https://humanaesexpert.github.io/HumanAesExpert/||
|**2025-03-31**|[Texture or Semantics? Vision-Language Models Get Lost in Font Recognition](http://arxiv.org/abs/2503.23768)|null|现代视觉语言模型（VLM）展现出卓越的视觉和语言能力，在图像识别和物体定位等各种任务中取得了令人瞩目的性能。然而，它们在细粒度任务中的有效性仍然是一个悬而未决的问题。在日常生活中，个人在接触设计素材，例如杂志、排版教程、研究论文或品牌内容时，可能希望识别文本中使用的美观字体。鉴于其多模态能力和免费可访问性，许多VLM常被认为是潜在的字体识别工具。这引发了一个根本性问题：VLM是否真的具备识别字体能力？为了探究这个问题，我们引入了字体识别基准（FRB），这是一个紧凑且结构良好的数据集，包含15种常用字体。FRB包含两个版本：（i）简单版本，其中10个句子以不同字体呈现；（ii）困难版本，其中每个文本样本由15种字体的名称本身组成，引入了斯特鲁普效应，挑战模型的感知能力。通过对各种VLM在字体识别任务上的广泛评估，我们得出以下主要发现：（i）当前VLM的字体识别能力有限，许多最先进的模型未能达到令人满意的性能。（ii）小样本学习和思维链（CoT）提示在提高不同VLM字体识别准确率方面的益处微乎其微。（iii）注意力分析揭示了VLM在捕获语义特征方面的固有局限性。||
|**2025-03-31**|[KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language](http://arxiv.org/abs/2503.23730)|**[link](https://github.com/maum-ai/koffvqa)**|大型视觉语言模型（VLMs）的最新出现导致了各种用于评估此类模型的基准测试。尽管如此，我们观察到大多数现有评估方法都存在缺陷：它们要么要求模型从预先确定的答案中进行选择，牺牲了开放性；要么使用判断模型来评估答案，导致主观性和不可靠性。此外，我们观察到缺乏针对韩语VLMs的基准测试，而这些基准测试作为区别于更常见的英语基准测试的单独指标是必要的，因为生成语言模型的性能会因所使用的语言而显著不同。因此，我们提出了KOFFVQA，一个通用的韩语自由形式视觉问答基准测试，用于评估VLMs。我们的基准测试包含275个精心设计的问题，每个问题都与一张图像和涵盖VLM性能10个不同方面的评分标准配对。该评分标准通过允许判断模型根据一组预先确定的规则对每个答案进行评分，从而消除了不可靠性的问题。通过以客观的方式定义评估标准，即使是小型开源模型也可以可靠地用于评估我们基准测试中的模型。除了在我们的基准测试上评估大量现有VLMs之外，我们还通过实验证明，我们使用预先存在的评分标准进行评估的方法比现有方法更可靠。我们的评估代码可在https://github.com/maum-ai/KOFFVQA获取。||
|**2025-03-30**|[BiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image Segmentation](http://arxiv.org/abs/2503.23534)|**[link](https://github.com/rafiibnsultan/bipvl-seg)**|医学图像分割通常仅依赖于视觉数据，忽略了临床医生用于诊断的丰富文本信息。视觉语言模型试图弥合这一差距，但现有方法通常独立处理视觉和文本特征，导致跨模态对齐较弱。由于空间视觉特征和序列文本嵌入之间存在固有差异，简单的融合技术难以奏效。此外，医学术语不同于通用语言，限制了现成文本编码器的有效性，进一步阻碍了视觉语言对齐。我们提出了BiPVL-Seg，一个端到端的框架，它通过架构和训练创新集成了视觉语言融合和嵌入对齐，其中两个组件相互 reinforcing，以增强医学图像分割。BiPVL-Seg在架构中引入了双向渐进式融合，这促进了视觉和文本编码器之间的阶段性信息交换。此外，它还结合了全局-局部对比对齐，这是一个训练目标，通过在类别和概念层面上对齐文本和视觉嵌入来增强文本编码器的理解能力。在CT和MR模式下的各种医学影像基准上的大量实验表明，BiPVL-Seg在复杂的多类别分割中比最先进的方法具有更优越的性能。源代码可在GitHub代码库中获取。||
|**2025-03-30**|[Re-Aligning Language to Visual Objects with an Agentic Workflow](http://arxiv.org/abs/2503.23508)|null|基于语言的目标检测 (LOD) 旨在将视觉对象与语言表达对齐。大量的配对数据被用于提高 LOD 模型的泛化能力。在训练过程中，最近的研究利用视觉语言模型 (VLM) 自动生成类似人类对视觉对象的表达，从而促进训练数据的扩展。在这个过程中，我们观察到 VLM 的幻觉会带来不准确的对象描述（例如，对象名称、颜色和形状），从而降低视觉语言对齐的质量。为了减少 VLM 的幻觉，我们提出了一种由大型语言模型 (LLM) 控制的代理工作流程，通过自适应地调整图像和文本提示来将语言重新与视觉对象对齐。我们将此工作流程命名为 Real-LOD，它包括规划、工具使用和反思步骤。给定一张包含检测到的对象和 VLM 原始语言表达的图像，Real-LOD 会根据我们的神经符号设计自动推断其状态并安排行动（即规划）。该行动将自适应地调整图像和文本提示，并将它们发送到 VLM 以重新描述对象（即工具使用）。然后，我们使用另一个 LLM 来分析这些改进后的表达以获得反馈（即反思）。这些步骤以循环形式进行，逐步改进语言描述，使其重新与视觉对象对齐。我们构建了一个包含少量 0.18M 图像和重新对齐的语言表达的数据集，并训练了一个流行的 LOD 模型，在标准基准测试中，其性能比现有的 LOD 方法高出约 50%。我们的 Real-LOD 工作流程具有自动视觉语言细化功能，揭示了在扩大数据量的同时保持数据质量的潜力，这从数据对齐的角度进一步提高了 LOD 性能。||
|**2025-03-30**|[Evolutionary Prompt Optimization Discovers Emergent Multimodal Reasoning Strategies in Vision-Language Models](http://arxiv.org/abs/2503.23503)|null|我们提出了一个用于优化视觉语言模型中的提示的框架，以在无需重新训练模型的情况下引出多模态推理。我们使用进化算法来指导视觉任务下游的提示更新，我们的方法改进了缺乏进化式“适者生存”迭代的基线提示更新算法。至关重要的是，我们发现这种方法使语言模型能够在几代进化过程中独立发现渐进式问题解决技术。例如，该模型推断，为了“分解”视觉上复杂的空间任务，调用Python解释器来执行任务（例如裁剪、图像分割或饱和度更改）将显着提高性能。我们的实验表明，通过系统级XML $...\texttt{} ... \texttt{}...$ 标签显式调用此“工具调用”，可以有效地标记Python解释器访问，以便相同的语言模型生成相关程序，从而生成高级多模态功能。此功能可以具体化为一个系统级提示，在推理时提高性能，我们的实验表明，在选定的视觉任务中，相对改进高达约 50%。下游性能在MathVista、M3CoT和GeoBench-VLM数据集的子任务中进行了训练和评估。重要的是，我们的方法表明进化提示优化引导语言模型进行自我推理发现，从而提高了跨任务的零样本泛化能力。||
|**2025-03-28**|[VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection](http://arxiv.org/abs/2503.22291)|null|随着对象检测器越来越多地作为黑盒云服务或预训练模型部署，并且访问原始训练数据受到限制，零样本对象级分布外 (OOD) 检测的挑战随之出现。这项任务对于确保检测器在开放世界环境中的可靠性至关重要。虽然现有方法已经证明了使用像 CLIP 这样的预训练视觉语言模型在图像级 OOD 检测中取得了成功，但将此类模型直接应用于对象级 OOD 检测会面临挑战，原因在于上下文信息的丢失以及对图像级对齐的依赖。为了应对这些挑战，我们引入了一种新方法，该方法利用视觉提示和文本增强的分布内 (ID) 空间构建来使 CLIP 适应零样本对象级 OOD 检测。我们的方法保留了关键的上下文信息，并提高了区分 ID 和 OOD 对象的能力，在不同的基准测试中实现了具有竞争力的性能。||
|**2025-03-28**|[FLIP: Towards Comprehensive and Reliable Evaluation of Federated Prompt Learning](http://arxiv.org/abs/2503.22263)|**[link](https://github.com/0-ml/flip)**|对隐私和数据安全的日益重视推动了联邦学习的采用，这是一种无需共享原始数据即可训练机器学习模型的去中心化方法。提示学习通过微调预训练模型的提示嵌入，在联邦学习环境中具有显著优势，它可以降低计算成本和通信开销，同时利用CLIP等视觉语言模型强大的性能和泛化能力。本文探讨了联邦学习和提示学习的交叉点，特别是对于视觉语言模型。在这项工作中，我们引入了一个名为FLIP的综合框架来评估联邦提示学习算法。FLIP在4种联邦学习协议和12个开放数据集上评估了8种最先进的联邦提示学习方法的性能，考虑了6种不同的评估场景。我们的研究结果表明，提示学习在数据分布内和分布外场景中都能保持强大的泛化性能，并且资源消耗最小。这项工作突出了联邦提示学习在数据稀缺、未见类别和跨域分布偏移等环境中的有效性。我们将FLIP中所有已实现算法的代码开源，以促进该领域的进一步研究。||
|**2025-03-28**|[REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation](http://arxiv.org/abs/2503.22122)|null|视觉语言模型 (VLM) 在机器人规划中展现出卓越的能力，尤其是在需要对环境进行整体理解才能进行任务分解的长程任务中。现有方法通常依赖于先验环境知识或精心设计的特定任务提示，这使得它们难以应对动态场景变化或意外任务情况，例如，机器人试图将胡萝卜放入微波炉，但发现微波炉门是关闭的。这些挑战凸显了两个关键问题：适应性和效率。为了解决这些问题，我们在这项工作中提出了一个自适应多智能体规划框架，称为 REMAC，它能够通过持续反思和自我进化实现高效、场景无关的多机器人长程任务规划和执行。REMAC 包含两个关键模块：一个在循环中执行前置条件和后置条件检查的自我反思模块，用于评估进度和改进计划；以及一个根据场景特定推理动态调整计划的自我进化模块。它具有几个吸引人的优势：1）机器人可以初步探索和推理环境，而无需复杂的提示设计。2）机器人可以不断反思潜在的规划错误，并根据特定任务的见解调整计划。3）经过多次迭代后，一个机器人可以调用另一个机器人并行协调任务，从而最大限度地提高任务执行效率。为了验证 REMAC 的有效性，我们基于 RoboCasa 构建了一个用于长程机器人操作和导航的多智能体环境，该环境具有 4 个任务类别，27 种任务样式和 50 多种不同的对象。在此基础上，我们进一步对最先进的推理模型（包括 DeepSeek-R1、o3-mini、QwQ 和 Grok3）进行了基准测试，结果表明 REMAC 的优越性，它将平均成功率提高了 40%，并将执行效率比单机器人基线提高了 52.7%。||
|**2025-03-28**|[How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark](http://arxiv.org/abs/2503.22093)|null|视觉语言模型（VLMs）在视觉问答（VQA）任务中展现出强大的推理能力；然而，它们在执行心智理论（ToM）任务（例如准确推断人类意图、信念和其他心理状态）方面的能力仍未得到充分探索。在这项工作中，我们提出了一个开放式问题框架，以全面评估VLMs在不同类别ToM任务中的性能。我们策划和注释了一个由30张图像组成的基准数据集。然后，我们评估了四种不同规模的VLMs在这个数据集上的性能。我们的实验结果表明，GPT-4模型的性能优于所有其他模型，只有一个较小的模型GPT-4o-mini实现了可比的性能。此外，我们观察到，VLMs在复杂场景（例如欺凌或作弊）中往往难以准确推断意图。此外，我们的研究结果还表明，较小的模型有时可以推断出正确的意图，即使它们依赖于不正确的视觉线索。||
|**2025-03-28**|[A Survey on Remote Sensing Foundation Models: From Vision to Multimodality](http://arxiv.org/abs/2503.22081)|null|遥感基础模型，特别是视觉和多模态模型的快速发展，显著增强了智能地理空间数据解译的能力。这些模型结合了各种数据模态，例如光学、雷达和激光雷达图像，以及文本和地理信息，从而能够更全面地分析和理解遥感数据。多模态的集成提高了目标检测、土地覆盖分类和变化检测等任务的性能，而这些任务通常会受到遥感数据复杂性和异构性的挑战。然而，尽管取得了这些进步，仍然存在一些挑战。数据类型的多样性、对大规模标注数据集的需求以及多模态融合技术的复杂性，对这些模型的有效部署构成了重大障碍。此外，训练和微调多模态模型的计算需求需要大量资源，这进一步使其在遥感图像解译任务中的实际应用复杂化。本文全面综述了用于遥感的视觉和多模态基础模型的最新进展，重点关注其架构、训练方法、数据集和应用场景。我们讨论了这些模型面临的关键挑战，例如数据对齐、跨模态迁移学习和可扩展性，同时也指出了旨在克服这些限制的新兴研究方向。我们的目标是提供对当前遥感基础模型领域的清晰理解，并启发未来的研究，从而突破这些模型在实际应用中所能达到的极限。本文收集的资源列表可以在https://github.com/IRIP-BUAA/A-Review-for-remote-sensing-vision-language-models 中找到。||
|**2025-03-27**|[CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models](http://arxiv.org/abs/2503.22020)|null|视觉-语言-动作模型 (VLA) 已展现出利用预训练视觉-语言模型和多样化的机器人演示来学习泛化感知运动控制的潜力。虽然这种范式有效地利用了来自机器人和非机器人来源的大规模数据，但目前的 VLA 主要关注直接的输入-输出映射，缺乏复杂操作任务中至关重要的中间推理步骤。因此，现有的 VLA 缺乏时间规划或推理能力。在本文中，我们介绍了一种将显式视觉思维链 (CoT) 推理融入视觉-语言-动作模型 (VLA) 的方法，该方法通过自回归预测未来图像帧作为视觉目标，然后生成短动作序列来实现这些目标。我们介绍了 CoT-VLA，一个最先进的 7B VLA，它可以理解和生成视觉和动作标记。我们的实验结果表明，CoT-VLA 实现了强大的性能，在真实世界操作任务中比最先进的 VLA 模型的性能高出 17%，在模拟基准测试中高出 6%。项目网站：https://cot-vla.github.io/||
|**2025-03-27**|[On Large Multimodal Models as Open-World Image Classifiers](http://arxiv.org/abs/2503.21851)|**[link](https://github.com/altndrr/lmms-owc)**|传统的图像分类需要预定义的语义类别列表。相比之下，大型多模态模型 (LMM) 可以绕过这一要求，直接使用自然语言对图像进行分类（例如，回答“图像中的主要对象是什么？”）。尽管具有这种非凡的能力，但大多数现有的关于 LMM 分类性能的研究在范围上却令人惊讶地有限，通常假设一个具有预定义类别集合的封闭世界设置。在这项工作中，我们通过在真正的开放世界环境中彻底评估 LMM 分类性能来弥补这一差距。我们首先将任务形式化，并引入了一个评估协议，定义了各种指标来评估预测类别和真实类别之间的一致性。然后，我们评估了跨越 10 个基准的 13 个模型，涵盖了原型、非原型、细粒度和超细粒度类别，展示了 LMM 在这项任务中面临的挑战。基于所提出的指标的进一步分析揭示了 LMM 产生的错误类型，突出了与粒度和细粒度能力相关的挑战，并展示了定制提示和推理如何缓解这些挑战。||
|**2025-03-27**|[MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX](http://arxiv.org/abs/2503.21699)|null|前沿模型要么仅限于语言，要么主要关注视觉和语言模态。尽管最近在具有视觉和音频理解能力的模型方面取得了显著进展，但该领域缺乏一个标准化的评估框架来彻底评估它们的跨模态感知性能。我们引入了MAVERIX（多模态视听评估推理指数），这是一个包含700个视频和2556个问题的新基准，专门设计用于通过需要紧密集成视频和音频信息的的任务来评估多模态模型。MAVERIX独特地为模型提供了视听任务，密切模仿了人类在推理和决策过程中可用的多模态感知体验。据我们所知，MAVERIX是第一个明确旨在评估全面视听集成的基准测试。对包括Gemini 1.5 Pro和o1在内的最先进模型进行的实验表明，其性能接近人类水平（约70%的准确率），而人类专家则接近上限性能（95.1%）。凭借标准化的评估协议、严格注释的流程和公共工具包，MAVERIX为推进视听多模态智能建立了一个具有挑战性的测试平台。||
|**2025-03-27**|[FusionSegReID: Advancing Person Re-Identification with Multimodal Retrieval and Precise Segmentation](http://arxiv.org/abs/2503.21595)|null|行人重识别（ReID）在安全监控和犯罪调查等应用中扮演着至关重要的角色，它通过匹配来自不同摄像头拍摄的大型图像库来识别个体。传统的ReID方法依赖于单模态输入，通常是图像，但由于遮挡、光照变化和姿势变化等挑战，其性能受到限制。虽然基于图像和基于文本的ReID系统都取得了进展，但两种模态的融合仍未得到充分探索。本文提出了FusionSegReID，一个结合图像和文本输入的多模态模型，以提高ReID性能。通过利用这些模态的互补优势，我们的模型提高了匹配精度和鲁棒性，尤其是在复杂的现实场景中，单一模态可能难以应对。我们的实验表明，在ReID的Top-1准确率和平均精度均值（mAP）方面都有显著提高，并且在遮挡和低质量图像等挑战性场景下获得了更好的分割结果。消融研究进一步证实，多模态融合和分割模块有助于提高重识别和掩码精度。结果表明，FusionSegReID优于传统的单模态模型，为现实世界的行人重识别任务提供了更鲁棒和更灵活的解决方案。||
|**2025-03-27**|[Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models](http://arxiv.org/abs/2503.21435)|null|图神经网络 (GNNs) 作为图结构学习的主导范式，长期以来面临着计算复杂度呈指数级增长和跨场景泛化能力不足的双重挑战。随着多模态学习的快速发展，视觉语言模型 (VLMs) 表现出卓越的跨模态关系推理能力和泛化能力，从而为克服传统图学习范式的固有局限性开辟了新的途径。然而，目前的研究主要集中在探索 VLM 的单图推理能力，这根本无法满足现实应用场景中对多个异构图数据进行协调推理的关键需求。为了解决这些局限性，我们提出了第一个用于 VLM 的多图联合推理基准。我们的基准涵盖四种图类别：知识图谱、流程图、思维导图和路线图，每个图组都伴随三个逐步提升难度的指令-响应对。利用该基准，我们对最先进的 VLM 进行了全面的能力评估，并在开源模型上进行了微调。这项研究不仅解决了 VLM 多图推理中未被充分探索的评估差距，而且凭经验证实了它们在图结构学习中的泛化优势。||
|**2025-03-27**|[UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning](http://arxiv.org/abs/2503.21193)|null|我们推出了UGen，一个统一的自回归多模态模型，它在文本处理、图像理解和图像生成任务上同时展现出强大的性能。UGen将文本和图像转换为离散的标记序列，并利用单个Transformer以自回归的方式统一地生成它们。为了应对统一多模态学习带来的挑战，UGen采用了一种新的机制进行训练，即渐进式词汇学习。在这个过程中，视觉标记ID被逐步激活并整合到训练阶段，最终提高了统一多模态学习的有效性。在全面的文本和图像任务上的实验表明，与普通的统一自回归方法相比，UGen实现了13.3%的显著总体性能提升，并且在所有任务中与多个特定任务模型相比也取得了具有竞争力的结果。||
|**2025-03-27**|[AdaMHF: Adaptive Multimodal Hierarchical Fusion for Survival Prediction](http://arxiv.org/abs/2503.21124)|**[link](https://github.com/Curry30Messi/AdaMHF)**|随着多模态学习的进步，整合病理图像和基因组数据进行生存分析越来越受到关注。然而，目前的方法往往忽略了模态内部和模态之间的生物学特征，例如异质性和稀疏性，最终限制了它们在临床实践中的适应性。为了应对这些挑战，我们提出了 AdaMHF：自适应多模态层次融合，这是一个为高效、全面和定制的特征提取和融合而设计的框架。AdaMHF 专门针对医学数据的独特性进行了调整，即使在模态缺失等挑战性场景下，也能以最少的资源消耗实现准确的预测。首先，AdaMHF 采用专家扩展和残差结构来激活专门的专家以提取异质和稀疏的特征。提取的特征通过选择和聚合进行细化，在保留全面信息的同时减少非主要特征的权重。随后，对编码后的特征进行层次化融合，从而捕获跨模态的多粒度交互。此外，我们引入了一个生存预测基准，旨在解决模态缺失的场景，以反映真实的临床情况。在 TCGA 数据集上的大量实验表明，AdaMHF 超越了当前最先进 (SOTA) 的方法，在完整和不完整模态设置下均展现出卓越的性能。||
|**2025-03-27**|[Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning](http://arxiv.org/abs/2503.20752)|null|视觉推理能力对于理解复杂的多模态数据至关重要，它推动了特定领域应用和通用人工智能 (AGI) 的发展。现有方法通过思维链 (CoT) 监督微调来改进视觉语言模型 (VLM) 的推理能力，使用精心标注的训练数据来增强视觉推理能力。然而，这种训练范式可能导致过拟合和认知僵化，限制了模型跨领域迁移视觉推理技能的能力，并限制了其在现实世界的应用。为了解决这些限制，我们提出了 Reason-RFT，一个新的强化微调框架，可显著增强视觉推理任务的泛化能力。Reason-RFT 为视觉推理引入了一个两阶段训练框架：(1) 使用精心策划的思维链 (CoT) 数据进行监督微调 (SFT)，激活视觉语言模型 (VLMs) 的推理潜力；(2) 基于组相对策略优化 (GRPO) 的强化学习，生成多个推理-响应对，显著增强视觉推理任务的泛化能力。为了评估 Reason-RFT 的视觉推理能力，我们重建了一个涵盖视觉计数、结构感知和空间变换的综合数据集。实验结果证明了 Reasoning-RFT 的三个主要优势：(1) 性能提升：在多项任务中取得了最先进的结果，优于大多数主流的开源和专有模型；(2) 泛化优势：在不同的任务和领域中始终保持稳健的性能，优于其他训练范式；(3) 数据效率：在少样本学习场景中表现出色，超越了全数据集 SFT 基线。项目网站：https://tanhuajie.github.io/ReasonRFT||
|**2025-03-26**|[IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting](http://arxiv.org/abs/2503.20612)|**[link](https://github.com/ferdinandzju/iap)**|近期的预训练视觉语言模型（PT-VLM）在实践中经常面临多领域类别增量学习（MCIL）场景，其中多模态任务的多个类别和领域是增量到达的。在无法访问先前学习的任务和未见任务的情况下，内存受限的MCIL会遭受前向和后向遗忘的困扰。为了缓解上述挑战，参数高效的微调技术（PEFT），例如prompt tuning，被用于使PT-VLM适应不同的增量学习任务。为了实现有效的新任务适应，现有方法只考虑了PEFT策略选择的影响，而忽略了PEFT参数设置（例如，prompting）的影响。在本文中，我们解决了在MCIL中为不同任务优化prompt设计的挑战，并提出了一个实例感知Prompting（IAP）框架。具体来说，我们的实例感知门控Prompting（IA-GP）模块通过在实例级别动态分配跨transformer层的prompt来增强对新任务的适应性，同时减轻遗忘。我们的实例感知类别分布驱动Prompting（IA-CDDP）通过为每个实例确定准确的任务标签相关置信度得分来改进任务适应过程。使用三个性能指标，跨11个数据集的实验评估证明了我们提出的方法的有效性。代码可在https://github.com/FerdinandZJU/IAP找到。||
|**2025-03-26**|[Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding](http://arxiv.org/abs/2503.20362)|null|大型视觉语言模型 (LVLMs) 在短视频任务（如视频问答）中表现出色，但在长视频理解方面却存在不足。LVLMs 通常使用的线性帧采样策略未能考虑到视频数据中关键事件的非线性分布，这通常会在较长的上下文中引入冗余或无关信息，同时在较短的上下文中存在遗漏关键事件的风险。为了解决这个问题，我们提出了 SelfReS，一种非线性时空自反射采样方法，它可以根据用户提示动态选择关键视频片段。与先前的方法不同，SelfReS 利用 LVLMs 固有的稀疏注意力图来定义反射标记，从而实现与相关性相关的标记选择，而无需额外的训练或外部模块。实验表明，SelfReS 可以无缝集成到强大的基准 LVLMs 中，提高长视频任务的准确性，并在相同的 GPU 内存预算下实现高达 46% 的推理速度提升。||
|**2025-03-26**|[sudo rm -rf agentic_security](http://arxiv.org/abs/2503.20279)|**[link](https://github.com/AIM-Intelligence/SUDO)**|大型语言模型 (LLM) 正越来越多地被部署为计算机使用代理，在真实的桌面或网络环境中自主执行任务。虽然这种演变极大地扩展了人类的实际用例，但也带来了严重的安全隐患。我们提出了 SUDO（基于屏幕的通用 Detox2Tox 攻击），这是一个新颖的攻击框架，可以系统地绕过商用计算机使用代理（例如 Claude Computer Use）中经过拒绝训练的安全防护措施。其核心机制 Detox2Tox 通过解毒将有害请求（代理最初拒绝的请求）转换为看似良性的请求，从高级视觉语言模型 (VLM) 中获取详细指令，然后在执行前通过毒化重新引入恶意内容。与传统的越狱不同，SUDO 基于内置的拒绝反馈迭代地改进其攻击，使其对强大的策略过滤器越来越有效。在涵盖 50 个现实世界任务和多个最先进 VLM 的广泛测试中，SUDO 在 Claude Computer Use 中实现了 24% 的显著攻击成功率（无需改进），以及高达 41% 的攻击成功率（通过其迭代改进）。通过揭示这些漏洞并演示在现实世界计算环境中利用这些漏洞的容易程度，本文强调了对强大的、上下文感知的安全防护措施的迫切需求。警告：本文包含有害或冒犯性的模型输出。||
|**2025-03-26**|[Qwen2.5-Omni Technical Report](http://arxiv.org/abs/2503.20215)|null|在本报告中，我们提出了Qwen2.5-Omni，这是一个端到端的多模态模型，旨在感知包括文本、图像、音频和视频在内的多种模态，同时以流式方式生成文本和自然语音响应。为了实现多模态信息输入的流式处理，音频和视觉编码器都采用了分块处理方法。为了将视频输入的时间戳与音频同步，我们以交错的方式依次组织音频和视频，并提出了一种名为TMRoPE（时间对齐多模态旋转位置编码）的新型位置嵌入方法。为了同时生成文本和语音并避免两种模态之间的干扰，我们提出了Thinker-Talker架构。在此框架中，Thinker充当大型语言模型，负责文本生成，而Talker是一个双轨自回归模型，直接利用Thinker的隐藏表示来生成音频标记作为输出。Thinker和Talker模型都被设计成以端到端的方式进行训练和推理。为了以流式方式解码音频标记，我们引入了滑动窗口DiT，它限制了感受野，旨在减少初始数据包延迟。Qwen2.5-Omni与类似规模的Qwen2.5-VL性能相当，并优于Qwen2-Audio。此外，Qwen2.5-Omni在Omni-Bench等多模态基准测试中取得了最先进的性能。值得注意的是，Qwen2.5-Omni在端到端语音指令遵循方面的性能与其文本输入能力相当，这可以通过MMLU和GSM8K等基准测试来证明。至于语音生成，Qwen2.5-Omni的流式Talker在鲁棒性和自然度方面优于大多数现有的流式和非流式替代方案。||
|**2025-03-25**|[CoLLM: A Large Language Model for Composed Image Retrieval](http://arxiv.org/abs/2503.19910)|**[link](https://github.com/hmchuong/CoLLM)**|组合图像检索 (CIR) 是一项复杂的任务，旨在根据多模态查询检索图像。典型的训练数据由包含参考图像、所需修改的文本描述和目标图像的三元组组成，获取这些数据既昂贵又耗时。CIR 数据集的稀缺性导致了利用合成三元组或利用视觉语言模型 (VLM) 和普遍存在的网络爬取图像-标题对的零样本方法。然而，这些方法存在明显的局限性：合成三元组的规模有限，缺乏多样性，并且修改文本不自然，而图像-标题对由于缺乏三元组数据而阻碍了多模态查询的联合嵌入学习。此外，现有方法难以处理复杂且细致的修改文本，这些文本需要复杂的融合和对视觉和语言模态的理解。我们提出了 CoLLM，这是一个一站式框架，可以有效地解决这些限制。我们的方法从图像-标题对中动态生成三元组，从而无需手动注释即可进行监督训练。我们利用大型语言模型 (LLM) 生成参考图像和修改文本的联合嵌入，促进更深层次的多模态融合。此外，我们引入了包含 340 万个样本的大规模数据集多文本 CIR (MTCIR)，并改进了现有的 CIR 基准测试（CIRR 和 Fashion-IQ），以增强评估的可靠性。实验结果表明，CoLLM 在多个 CIR 基准测试和设置中实现了最先进的性能。MTCIR 取得了有竞争力的结果，性能提升高达 15%。我们改进的基准测试为 CIR 模型提供了更可靠的评估指标，有助于推动这一重要领域的发展。||
|**2025-03-25**|[FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs](http://arxiv.org/abs/2503.19850)|null|即使对于最先进的视觉语言模型 (VLM)，在长达一小时的视频中检索信息也提出了重大挑战，尤其是当所需信息位于一小部分帧内时。由于上下文窗口的限制以及难以精确定位包含答案的帧，长视频数据给 VLM 带来了挑战。我们新颖的视频代理 FALCONEye 结合了 VLM 和大型语言模型 (LLM) 来搜索视频中的相关信息，并定位包含答案的帧。FALCONEye 的创新之处在于：1) 提出的元架构比现有技术中的短视频方法更适合处理长达一小时的视频；2) 一种新的高效探索算法，使用短片、字幕和答案置信度来定位信息；3) 我们对答案置信度进行的最先进的 VLM 校准分析。我们的代理构建于小型 VLM 和中型 LLM 之上，可以使用标准计算资源运行。我们还发布了 FALCON-Bench，这是一个用于评估长（平均 > 1 小时）视频答案搜索挑战的基准测试，强调了对开放式问题评估的需求。我们的实验表明，FALCONEye 在 FALCON-Bench 中的性能优于现有技术，并且在相关基准测试中表现出相似或更好的性能。||
|**2025-03-25**|[ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation](http://arxiv.org/abs/2503.19755)|null|端到端 (E2E) 自动驾驶方法由于因果推理能力有限，在交互式闭环评估中仍然难以做出正确的决策。当前的方法试图利用视觉语言模型 (VLM) 强大的理解和推理能力来解决这一难题。然而，由于语义推理空间和动作空间中纯数值轨迹输出之间的差距，很少有用于 E2E 方法的 VLM 在闭环评估中表现良好，这个问题仍然存在。为了解决这个问题，我们提出了 ORION，一个通过视觉语言指令动作生成的整体式 E2E 自动驾驶框架。ORION 独特地结合了 QT-Former 来聚合长期历史上下文，一个大型语言模型 (LLM) 用于驾驶场景推理，以及一个生成式规划器用于精确轨迹预测。ORION 进一步对齐推理空间和动作空间，以实现视觉问答 (VQA) 和规划任务的统一 E2E 优化。我们的方法在具有挑战性的 Bench2Drive 数据集上实现了令人印象深刻的闭环性能，驾驶得分 (DS) 为 77.74，成功率 (SR) 为 54.62%，这比最先进 (SOTA) 方法分别高出 14.28 DS 和 19.61% SR。||
|**2025-03-25**|[Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models](http://arxiv.org/abs/2503.19707)|**[link](https://github.com/stogiannidis/srbench)**|视觉语言模型 (VLM) 近年来已成为强大的工具，在图像描述、视觉问答和图文检索等需要整合视觉和文本理解的任务中表现出色。然而，现有的 VLM 基准测试包含空间成分，通常无法将空间推理与其他相关任务（如物体检测或语义理解）区分开来。在本文中，我们采用多方面的方法来理解空间推理，以此来解决这些缺陷。基于人类空间推理能力的多样性和多维性，我们首先对空间推理的核心要素进行了详细分析：空间关系、方向和导航、心理旋转和空间可视化，然后评估了这些模型在合成图像和真实图像中的性能， bridging controlled and naturalistic contexts。我们分析了 13 个最先进的视觉语言模型，揭示了它们在空间推理性能方面的关键见解。我们的结果揭示了当前 VLM 的严重缺陷，13 个模型的平均准确率接近随机概率，这突出了空间推理仍然是一个持续存在的障碍。这项工作不仅揭示了提升 VLM 空间推理能力的迫切需求，也为未来的探索奠定了坚实的基础。代码可在 GitHub (https://github.com/stogiannidis/srbench) 上获取，数据集可在 HuggingFace (https://huggingface.co/datasets/stogiannidis/srbench) 上获取。||
|**2025-03-25**|[RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models](http://arxiv.org/abs/2503.19654)|null|我们推出了RGB-Th-Bench，这是第一个旨在评估视觉语言模型（VLM）理解RGB-热图像对能力的基准测试。尽管VLM在视觉推理和多模态理解方面取得了显著进展，但它们的评估主要局限于基于RGB的基准测试，这在评估其红外视觉任务能力方面留下了关键空白。现有的可见光-红外数据集要么特定于任务，要么缺乏对模型进行严格评估所需的高质量注释。为了解决这些限制，RGB-Th-Bench提供了一个全面的评估框架，涵盖14个不同的技能维度，共有1600多个专家注释的是/否问题。该基准测试采用两个准确性指标：标准的问题级准确性和更严格的技能级准确性，后者评估模型在每个技能维度内多个问题上的稳健性。这种设计确保了对模型性能的全面评估，包括对对抗性和虚构响应的抵抗力。我们对19个最先进的VLM进行了广泛的评估，揭示了RGB-热图像理解方面的显著性能差距。我们的结果表明，即使是最强大的模型在热图像理解方面也存在困难，其性能受到基于RGB能力的严重限制。此外，预训练中缺乏大规模特定应用和专家注释的热图像-描述对数据集是造成观察到的性能差距的一个重要原因。RGB-Th-Bench强调了多模态学习方面进一步发展的迫切需求，以弥合可见光和热图像理解之间的差距。数据集可通过此链接获取，评估代码也将公开提供。||
|**2025-03-25**|[Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation](http://arxiv.org/abs/2503.19647)|null|大型视觉语言模型 (VLM) 正越来越多地被视为基础模型，可以通过提示来指示它们解决各种任务，而无需特定于任务的训练。我们研究了一个看似显而易见的问题：如何有效地提示 VLM 进行语义分割。为此，我们在分布外 MESS 数据集集合上系统地评估了几个最近模型的分割性能，这些模型由文本或视觉提示引导。我们引入了一种可扩展的提示方案，即少样本提示语义分割，其灵感来自开放词汇分割和少样本学习。结果表明，VLM 的性能远远落后于为特定分割任务训练的专业模型，在交并比指标上平均落后约 30%。此外，我们发现文本提示和视觉提示是互补的：两种模式中的每一种都会在许多示例上失败，而另一种模式可以解决这些示例。我们的分析表明，能够预测最有效的提示模式可以使性能提高 11%。基于我们的发现，我们提出了 PromptMatcher，这是一个非常简单的免训练基线，它结合了文本和视觉提示，在少样本提示语义分割上实现了最先进的结果，比最佳文本提示 VLM 高 2.5%，比顶级视觉提示 VLM 高 3.5%。||
|**2025-03-25**|[RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation](http://arxiv.org/abs/2503.19510)|null|随着机器人技术向更复杂的多模态交互和操作任务发展，先进视觉语言模型（VLM）的集成已成为该领域的关键驱动力。尽管当前方法取得了进展，但在3D环境中融合深度和RGB信息以及执行由语言指令引导的任务方面仍然存在挑战。为了应对这些挑战，我们通过引入RoboFlamingo-Plus增强了现有的RoboFlamingo框架，它将深度数据融入VLM中，从而显著提高机器人操作性能。我们的研究通过将预训练的视觉Transformer（ViT）与重采样技术相结合，实现了RGB和深度信息的细致融合，将这些组合数据与语言线索紧密对齐，以实现卓越的多模态理解。RoboFlamingo-Plus的创新之处在于它对深度数据处理的输入进行了调整，利用预训练的重采样器进行深度特征提取，并采用交叉注意力机制进行最佳特征融合。这些改进使RoboFlamingo-Plus不仅可以深入理解3D环境，还可以轻松地在挑战性环境中执行复杂的、语言引导的任务。实验结果表明，RoboFlamingo-Plus将机器人操作性能比现有方法提高了10-20%，标志着取得了重大进展。代码和模型权重已在RoboFlamingo-Plus公开发布。||
|**2025-03-25**|[LangBridge: Interpreting Image as a Combination of Language Embeddings](http://arxiv.org/abs/2503.19404)|null|近年来，大型视觉语言模型 (LVLMs) 取得了显著进展，在各种复杂的视觉语言任务中达到了人类水平的性能。遵循 LLaVA 的范式，主流 LVLMs 通常采用浅层 MLP 通过两阶段训练过程进行视觉语言对齐：跨模态对齐预训练和指令微调。虽然这种方法已被证明有效，但 MLP 如何弥合模态差距的潜在机制仍然知之甚少。尽管一些研究探索了 LLMs 如何处理转换后的视觉标记，但很少有研究调查基本的对齐机制。此外，每当切换 LLM 主干时，MLP 适配器都需要重新训练。为了解决这些限制，我们首先研究了 MLP 适配器的工作原理，并发现它们学习将视觉嵌入逐渐投影到相应文本嵌入所跨越的子空间中。基于这一见解，我们提出了 LangBridge，一种将视觉标记显式映射到 LLM 词汇嵌入线性组合的新型适配器。这种创新设计支持跨不同 LLM 的免预训练适配器迁移，同时保持性能。我们的实验结果表明，在 Qwen2-0.5B 上预训练的 LangBridge 适配器可以直接应用于更大的模型，例如 LLaMA3-8B 或 Qwen2.5-14B，同时保持竞争力。总体而言，LangBridge 通过将视觉表示基于 LLM 词汇嵌入来实现可解释的视觉语言对齐，而其即插即用的设计确保了跨多个 LLM 的高效复用，几乎没有性能下降。请访问我们的项目页面 https://LangBridge.github.io/||
|**2025-03-25**|[ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models](http://arxiv.org/abs/2503.19355)|null|时空推理对于理解自动驾驶和体育分析等各个领域的现实世界环境至关重要。近期的进展通过引入大规模数据提高了视觉语言模型 (VLM) 的空间推理能力，但这些模型仍然难以分析运动物体的运动学元素，例如行进距离和速度。为了弥合这一差距，我们构建了一个包含运动学指令调整的时空推理数据集和基准测试，分别称为 STKit 和 STKit-Bench。它们由带有 3D 标注的真实世界视频组成，详细描述了物体的运动动态：行进距离、速度、运动方向、物体间距离比较和相对运动方向。为了进一步将此类数据构建扩展到没有 3D 标签的视频，我们提出了一个自动流程，使用真实世界尺度的 4D 重建生成伪标签。利用我们用于时空推理的运动学指令调整数据，我们提出了 ST-VLM，这是一种增强了时空推理能力的 VLM，它在 STKit-Bench 上表现出优异的性能。此外，我们展示了 ST-VLM 可以稳健地泛化到不同的领域和任务，在其他时空基准测试（例如 ActivityNet、TVQA+）上优于基线模型。最后，通过将学习到的时空推理与现有能力相结合，ST-VLM 能够进行复杂的多步推理。项目页面：https://ikodoh.github.io/ST-VLM.||
|**2025-03-25**|[LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text](http://arxiv.org/abs/2503.19311)|**[link](https://github.com/mitsuichen14/lrsclip)**|本研究解决了遥感视觉语言基础模型（VLFM）在处理长文本时遇到的技术瓶颈以及短文本信息不足导致的“幻觉”问题。我们提出了一个新的视觉语言基础模型LRSCLIP和一个多模态数据集LRS2M。主要贡献如下：（1）通过整合多源遥感数据并采用大型语言模型标注策略，我们构建了包含200万图像-文本对的LRS2M数据集，首次同时提供短文本和长文本，解决了现有数据集语义粒度限制的问题；（2）基于Long-CLIP的KPS模块设计了LRSCLIP架构，扩展了CLIP的文本处理能力，并通过双文本损失加权机制实现了细粒度的跨模态特征对齐。实验结果表明，在零样本长文本跨模态检索任务中，LRSCLIP的检索精度比Long-CLIP基线提高了10%-20%。对于零样本短文本跨模态检索任务，LRSCLIP在RSITMD数据集上的Text to Image R@1、Image to Text R@1和mR分别比当前最佳模型GeoRSCLIP提高了0.17%、0.67%和0.92%，在RSICD数据集上分别提高了0.04%、2.93%和1.28%。在零样本图像分类任务（平均准确率=75.75%）和语义定位任务（Rmi=0.7653）中，LRSCLIP实现了最先进的性能。这些结果验证了LRSCLIP在细粒度语义理解和全局特征匹配方面的双重优势。这项工作为遥感多模态学习提供了新的基准模型和数据支持。相关代码已开源，可在https://github.com/MitsuiChen14/LRSCLIP获取。||
|**2025-03-21**|[OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement](http://arxiv.org/abs/2503.17352)|**[link](https://github.com/yihedeng9/openvlthinker)**|DeepSeek-R1 近期的进展表明，大型语言模型 (LLM) 的复杂推理能力，包括自我验证和自我纠正等复杂行为，可以通过带有可验证奖励的强化学习 (RL) 来实现，并显著提高模型在 AIME 等挑战性任务上的性能。基于这些发现，我们的研究调查了类似的推理能力是否可以成功地整合到大型视觉语言模型 (LVLM) 中，并评估它们对挑战性多模态推理任务的影响。我们考虑了一种方法，它迭代地利用轻量级训练数据的监督微调 (SFT) 和强化学习 (RL) 来进一步提高模型泛化能力。首先，通过使用来自不同视觉数据集的高质量图像描述生成推理步骤，从纯文本 R1 模型中提取推理能力。随后，迭代式 RL 训练进一步增强了推理能力，每次迭代的 RL 改进模型都会为下一轮生成改进的 SFT 数据集。这个迭代过程产生了 OpenVLThinker，这是一个在 MathVista、MathVerse 和 MathVision 等挑战性基准测试中持续提高推理性能的 LVLM，证明了我们的策略在鲁棒视觉语言推理方面的潜力。代码、模型和数据位于 https://github.com/yihedeng9/OpenVLThinker。||
|**2025-03-21**|[Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models](http://arxiv.org/abs/2503.17349)|null|视觉语言模型 (VLM) 擅长识别和描述物体，但在空间推理方面，例如准确理解物体的相对位置方面存在不足。受人类视觉双通路（腹侧-背侧）模型的启发，我们研究了为什么VLM尽管具有强大的物体识别能力，却在空间任务中表现不佳。我们基于可解释性的分析揭示了一个关键的潜在原因：VLM 中的视觉嵌入主要被视为语义上的“词袋”，由于其过大的嵌入范数，掩盖了细微但关键的位置线索。我们通过大量的诊断实验验证了这一见解，证明移除词序或细粒度空间细节对性能的影响极小。在这些发现的指导下，我们提出了简单且可解释的干预措施，包括规范化视觉嵌入范数和提取中间层空间丰富的特征，以恢复空间感知能力。在我们合成的 datasets 和标准基准测试上的实证结果表明空间推理能力得到了改善，突出了可解释性指导的设计选择的价值。我们的研究不仅揭示了当前VLM架构的基本局限性，还为增强对视觉场景的结构化感知提供了可操作的见解。||
|**2025-03-21**|[Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology](http://arxiv.org/abs/2503.17238)|null|本文探讨了如何利用基础视觉语言模型（VLMs）和幻灯片级提示学习来解决组织病理学全视野切片图像（WSIs）中的少样本分类难题。鉴于WSIs的千兆像素规模，传统的多示例学习（MIL）方法依赖于聚合函数从图像块表示中导出幻灯片级（包级）预测，这需要大量的包级标签进行训练。相比之下，基于VLM的方法擅长将图像块的视觉嵌入与候选类别文本提示对齐，但缺乏必要的病理学先验知识。我们的方法的独特之处在于利用来自语言模型的病理学先验知识来识别WSI分类的关键局部组织类型（图像块），并将其集成到基于VLM的MIL框架中。我们的方法有效地将图像块与组织类型对齐，并且我们仅使用每个类别中少量的标记WSIs通过提示学习对模型进行微调。在真实世界病理WSI数据集上的实验和消融研究突出了我们的方法在少样本WSI分类任务中相较于现有基于MIL和VLM的方法的优越性能。我们的代码已公开发布在https://github.com/LTS5/SLIP。||
|**2025-03-21**|[Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models](http://arxiv.org/abs/2503.17142)|**[link](https://github.com/berasidavide/vlm_image_compositionality)**|视觉语言模型 (VLM) 学习文本和图像的共享特征空间，从而能够比较不同模态的输入。虽然先前的工作表明，VLM 将自然语言表示组织成编码复合含义的规则结构，但尚不清楚视觉嵌入空间中是否也出现了组合模式。在这项工作中，我们研究了图像域中的组合性，其中视觉数据的噪声和稀疏性对组合属性的分析提出了挑战。我们解决了这些问题，并提出了一个名为测地线可分解嵌入 (GDE) 的框架，该框架使用潜在空间中几何感知的组合结构来近似图像表示。我们证明了预训练 VLM 的视觉嵌入表现出组合排列，并在组合分类和组鲁棒性任务中评估了此属性的有效性。与假设潜在空间线性几何的对应方法相比，GDE 在组合分类中实现了更强的性能。值得注意的是，它对于组鲁棒性特别有效，我们在其中取得了比特定任务解决方案更高的结果。我们的结果表明，VLM 可以在视觉领域自动发展出类似人类的组合推理形式，使其底层过程更具可解释性。代码可在 https://github.com/BerasiDavide/vlm_image_compositionality 获取。||
|**2025-03-21**|[Beyond Accuracy: What Matters in Designing Well-Behaved Models?](http://arxiv.org/abs/2503.17110)|null|深度学习已成为计算机视觉的重要组成部分，深度神经网络 (DNN) 在预测性能方面表现出色。然而，它们通常在其他关键质量维度上存在不足，例如鲁棒性、校准性或公平性。虽然现有研究侧重于这些质量维度的一个子集，但尚未有人探索 DNN 更通用的“良好行为”。通过这项工作，我们通过同时研究图像分类的九个不同质量维度来弥补这一差距。通过一项大规模研究，我们分析了 326 个骨干模型以及不同的训练范式和模型架构如何影响质量维度，从而提供了全局视角。我们揭示了各种新见解，例如 (i) 视觉语言模型在 ImageNet-1k 分类中表现出高度公平性，并且对域变化具有很强的鲁棒性；(ii) 自监督学习是一种有效的训练范式，可以提高几乎所有考虑的质量维度；(iii) 训练数据集大小是大多数质量维度的主要驱动因素。最后，我们引入了 QUBA 分数（超越准确性的质量理解），这是一种新的衡量标准，可以在多个质量维度上对模型进行排名，从而根据特定用户需求提供定制化建议。||
|**2025-03-21**|[PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition](http://arxiv.org/abs/2503.16945)|null|像CLIP这样的视觉语言模型(VLM)为动态面部表情识别(DFER)提供了有希望的解决方案，但面临着诸如全微调效率低下、复杂度高以及文本和视觉表示之间对齐不良等挑战。此外，现有方法在有效的时间建模方面存在困难。为了解决这些问题，我们提出了PE-CLIP，这是一个参数高效的微调(PEFT)框架，它使CLIP适应DFER，同时显著减少可训练参数并保持高精度。PE-CLIP引入了两个专门的适配器：时间动态适配器(TDA)和共享适配器(ShA)。TDA是一个基于GRU的模块，具有动态缩放功能，可以捕获序列依赖关系，同时强调信息丰富的时间特征并抑制无关的变化。ShA是一个轻量级适配器，可在文本和视觉编码器中优化表示，确保一致性和效率。此外，我们集成了多模态提示学习(MaPLe)，为视觉和基于动作单元的文本输入引入可学习提示，增强了模态之间的语义对齐，并使CLIP能够高效地适应动态任务。我们在两个基准数据集DFEW和FERV39K上评估了PE-CLIP，与最先进的方法相比，实现了具有竞争力的性能，同时需要的可训练参数更少。通过平衡效率和准确性，PE-CLIP在资源高效的DFER中树立了新的基准。所提出的PE-CLIP的源代码将在https://github.com/Ibtissam-SAADI/PE-CLIP公开发布。||
|**2025-03-21**|[Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks](http://arxiv.org/abs/2503.16930)|null|动态图像退化，包括噪声、模糊和光照不一致，由于传感器限制或不利环境条件，对图像复原提出了重大挑战。现有的深度展开网络 (DUN) 提供了稳定的复原性能，但需要针对每种退化类型手动选择退化矩阵，限制了它们在不同场景下的适应性。为了解决这个问题，我们提出了视觉语言引导展开网络 (VLU-Net)，这是一个统一的 DUN 框架，可以同时处理多种退化类型。VLU-Net 利用在退化图像-文本对上进行微调的视觉语言模型 (VLM) 将图像特征与退化描述对齐，从而为目标退化选择合适的变换。通过将基于 VLM 的自动梯度估计策略集成到近端梯度下降 (PGD) 算法中，VLU-Net 有效地解决了复杂的多退化复原任务，同时保持了可解释性。此外，我们设计了一个分层特征展开结构来增强 VLU-Net 框架，从而有效地合成不同级别的退化模式。VLU-Net 是第一个一体化的 DUN 框架，在 SOTS 去雾数据集和 Rain100L 去雨数据集上的性能分别比目前领先的一对一和一体化端到端方法高 3.74 dB 和 1.70 dB。||
|**2025-03-21**|[Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification](http://arxiv.org/abs/2503.16873)|null|多标签分类对于全面的图像理解至关重要，但获取准确的标注既具有挑战性又成本高昂。为了解决这个问题，最近的一项研究建议利用强大的视觉语言模型CLIP来进行无监督多标签分类。尽管CLIP能力很强，但它存在视图依赖性预测和固有偏差，限制了它的有效性。我们提出了一种新方法，通过利用目标物体附近的多个视图来解决这些问题，该方法以分类器的类激活映射（CAM）为指导，并对从CLIP预测得到的伪标签进行去偏差处理。我们的分类器引导的CLIP蒸馏（CCD）能够在没有额外标签的情况下选择多个局部视图，并对预测进行去偏差处理，从而提高分类性能。实验结果证实了我们的方法在不同数据集上优于现有技术的性能。代码可在https://github.com/k0u-id/CCD获取。||
|**2025-03-21**|[Joint Extraction Matters: Prompt-Based Visual Question Answering for Multi-Field Document Information Extraction](http://arxiv.org/abs/2503.16868)|null|视觉问答 (VQA) 已成为从文档图像中提取特定信息的灵活方法。然而，现有工作通常孤立地查询每个字段，忽略了多个项目之间潜在的依赖关系。本文研究了联合提取多个字段与单独提取多个字段的优劣。通过在多个大型视觉语言模型和数据集上进行实验，我们发现联合提取字段通常可以提高准确性，尤其是在字段之间存在强数值或上下文依赖关系时。我们进一步分析了性能如何随请求项的数量而变化，并使用基于回归的指标来量化字段间关系。我们的结果表明，多字段提示可以减轻由相似表面形式和相关数值引起混淆，为在文档信息提取任务中设计鲁棒的 VQA 系统提供了实用方法。||
|**2025-03-21**|[MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers](http://arxiv.org/abs/2503.16856)|null|机器完全理解科学论文反映了通用人工智能的高水平，需要跨零散和异构信息源进行推理的能力，这是一个复杂且具有实际意义的挑战。虽然视觉语言模型（VLM）在各种任务中取得了显著进展，尤其是在涉及从单张图像或文本页面进行证据推理的任务中，但它们利用跨源信息进行推理的能力仍然是一个有待解决的问题。这项工作提出了MMCR，这是一个高难度基准测试，旨在评估VLM利用科学论文中的跨源信息进行推理的能力。该基准测试包含276个高质量问题，由人工在7个学科和10种任务类型中精心标注。对18个VLM的实验表明，跨源推理对现有模型提出了重大挑战。值得注意的是，即使是表现最好的模型GPT-4o，总体准确率也只有48.55%，在多表格理解任务中准确率仅为20%，而表现第二的模型Qwen2.5-VL-72B的总体准确率为39.86%。此外，我们研究了思维链（CoT）技术对跨源推理的影响，观察到它对小型模型有不利影响，而大型模型的性能则得到显著提升。这些结果突出了开发能够有效利用跨源信息进行推理的VLM的迫切需求。||
|**2025-03-20**|[Exploring the Hidden Reasoning Process of Large Language Models by Misleading Them](http://arxiv.org/abs/2503.16401)|null|大型语言模型 (LLM) 和视觉语言模型 (VLM) 已经能够在各种场景中执行多种形式的推理任务，但它们是否真的在进行任务抽象和基于规则的推理，而不仅仅是记忆和模式匹配？为了回答这个问题，我们提出了一种新的实验方法，即误导性微调 (MisFT)，以检验 LLM/VLM 是否通过改变其对基本规则的原始理解来执行抽象推理。具体来说，我们通过构建一个包含与正确运算原则相矛盾的数学表达式的数据集，对模型进行微调以学习这些矛盾的规则，并评估其在不同测试域上的泛化能力。通过一系列实验，我们发现当前的 LLM/VLM 能够有效地应用矛盾规则来解决实际的数学应用题和图像表示的数学表达式，这意味着存在一种在推理前进行抽象的内部机制。||
|**2025-03-20**|[Disentangled and Interpretable Multimodal Attention Fusion for Cancer Survival Prediction](http://arxiv.org/abs/2503.16069)|null|为了改进使用全切片图像和转录组学数据预测癌症存活率，捕捉模态共享和模态特定信息至关重要。然而，多模态框架通常会将这些表征纠缠在一起，限制了可解释性，并可能抑制判别性特征。为了解决这个问题，我们提出了分离且可解释的多模态注意力融合（DIMAF），这是一个多模态框架，它在基于注意力的融合机制中分离了模态内和模态间的交互，以学习不同的模态特定和模态共享表征。我们引入了一个基于距离相关性的损失来促进这些表征之间的分离，并整合了Shapley加性解释来评估它们对生存预测的相对贡献。我们在四个公共癌症生存数据集上评估了DIMAF，与当前最先进的多模态模型相比，性能平均提高了1.85%，分离度提高了23.7%。除了改进性能之外，我们可解释的框架还能更深入地探索癌症生物学中模态之间和模态内部的潜在相互作用。||
|**2025-03-20**|[STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding](http://arxiv.org/abs/2503.15973)|**[link](https://github.com/zhoujiahuan1991/cvpr2025-stop)**|基于海量图文对预训练的视觉语言模型，例如CLIP，已在众多基于图像的任务中展现出极具潜力的零样本泛化能力。然而，由于视频标注数据的匮乏和高昂的训练成本，将这些能力扩展到视频任务仍然充满挑战。近期的视频提示方法试图通过引入可学习提示来使CLIP适应视频任务，但它们通常依赖于针对所有视频序列的单一静态提示，忽略了帧之间存在的不同时序动态和空间变化。这种局限性严重阻碍了模型捕捉有效视频理解所需的关键时序信息的能力。为了解决这个问题，我们提出了一个集成的时空动态提示（STOP）模型，它包含两个互补的模块：帧内空间提示和帧间时序提示。我们的帧内空间提示旨在通过利用帧内注意力和时序变化来自适应地突出每个帧中的判别区域，使模型能够专注于具有显著时序动态的区域并捕捉细粒度的空间细节。此外，为了突出帧对于视频理解的不同重要性，我们进一步引入了帧间时序提示，在具有高时序方差（通过帧相似度衡量）的帧之间动态插入提示。这使得模型能够优先考虑关键帧，并增强其理解序列中时序依赖关系的能力。在各种视频基准上的大量实验表明，STOP始终优于最先进的方法。代码可在https://github.com/zhoujiahuan1991/CVPR2025-STOP获取。||
|**2025-03-20**|[Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation](http://arxiv.org/abs/2503.15969)|null|用于地球观测 (EO) 的视觉语言模型通常仅依赖数据的可见光谱作为模型输入，因此未能利用卫星记录的多光谱通道中丰富的谱信息。因此，在本文中，我们介绍了 Llama3-MS-CLIP，这是第一个使用大规模多光谱数据集上的对比学习进行预训练的视觉语言模型，并报告了由于扩展光谱范围而带来的性能提升。此外，我们提出了迄今为止最大的多光谱图像-字幕数据集，其中包含一百万个 Sentinel-2 样本和使用 Llama3-LLaVA-Next 和 Overture Maps 数据生成的相应文本描述。我们开发了一个可扩展的字幕生成流水线，并已由领域专家验证。我们使用三个不同复杂度的数据集，在多光谱零样本图像分类和检索任务上评估了 Llama3-MS-CLIP。我们的结果表明，Llama3-MS-CLIP 明显优于其他基于 RGB 的方法，与次优模型相比，分类精度平均提高了 6.77%，检索性能提高了 4.63% mAP。我们的结果强调了多光谱视觉语言学习的相关性。我们将图像-字幕数据集、代码和模型权重以开源许可证的形式发布。||
|**2025-03-20**|[CausalCLIPSeg: Unlocking CLIP's Potential in Referring Medical Image Segmentation with Causal Intervention](http://arxiv.org/abs/2503.15949)|**[link](https://github.com/wutcm-lab/causalclipseg)**|参考医学图像分割目标描绘由文本描述指示的病灶。由于视觉和文本线索不同的数据属性，对齐它们具有挑战性。受大规模预训练视觉语言模型的启发，我们提出了 CausalCLIPSeg，一个用于参考医学图像分割的端到端框架，它利用了 CLIP。尽管没有在医学数据上进行训练，我们通过定制的跨模态解码方法将 CLIP 丰富的语义空间强制应用于医学领域，以实现文本到像素的对齐。此外，为了减轻可能导致模型学习虚假相关性而不是有意义的因果关系的混杂偏差，CausalCLIPSeg 引入了一个因果干预模块，该模块对混杂因素进行自我注释，并从输入中挖掘因果特征以进行分割判断。我们还设计了一个对抗性最小-最大博弈来优化因果特征，同时惩罚混杂特征。大量实验表明我们提出的方法具有最先进的性能。代码可在 https://github.com/WUTCM-Lab/CausalCLIPSeg 获取。||
|**2025-03-20**|[Don't Fight Hallucinations, Use Them: Estimating Image Realism using NLI over Atomic Facts](http://arxiv.org/abs/2503.15948)|**[link](https://github.com/s-nlp/dont-fight-hallucinations)**|量化图像的真实性仍然是人工智能领域的一个难题。例如，爱因斯坦拿着智能手机的图像违反了常识，因为现代智能手机是在爱因斯坦去世后发明的。我们介绍了一种使用大型视觉语言模型（LVLM）和自然语言推理（NLI）来评估图像真实性的新方法。我们的方法基于这样一个前提：当LVLM遇到违反常识的图像时，可能会产生幻觉。使用LVLM从这些图像中提取原子事实，我们得到了一组准确的事实和错误的幻觉。我们继续计算这些事实之间的成对蕴涵分数，随后将这些值聚合以产生单个现实分数。此过程用于识别真实事实和幻觉元素之间的矛盾，从而表明存在违反常识的图像。我们的方法在WHOOPS!数据集上的零样本模式中实现了新的最佳性能。||
|**2025-03-20**|[UMIT: Unifying Medical Imaging Tasks via Vision-Language Models](http://arxiv.org/abs/2503.15892)|**[link](https://github.com/dz-osamu/UMIT)**|随着深度学习的快速发展，尤其是在医学图像分析领域，越来越多的视觉语言模型（VLM）被广泛应用于解决复杂的健康和生物医学挑战。然而，现有研究主要集中在特定任务或单一模态上，这限制了它们在不同医学场景中的适用性和泛化能力。为了应对这一挑战，我们提出了UMIT，一个统一的多模态、多任务VLM，专为医学影像任务而设计。UMIT能够解决各种任务，包括视觉问答、疾病检测和医学报告生成。此外，它适用于多种成像模态（例如，X光、CT和PET），涵盖从基本诊断到复杂病灶分析的广泛应用。此外，UMIT支持英语和中文，扩展了其全球适用性，并确保了不同语言环境下医疗服务的可及性。为了增强模型的适应性和任务处理能力，我们设计了一种独特的两阶段训练策略，并使用设计的指令模板对UMIT进行微调。通过广泛的实证评估，UMIT在多个数据集的五项任务中均优于先前的方法。UMIT的性能表明，它可以显著提高诊断准确性和工作流程效率，从而为医学影像应用提供有效的解决方案。||
|**2025-03-20**|[What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?](http://arxiv.org/abs/2503.15846)|null|视频动态场景图生成（DSGG）是计算机视觉中一项具有挑战性的任务。虽然现有方法通常侧重于复杂架构设计且仅在评估期间使用召回率，但我们仔细研究了它们预测的场景图，并发现了现有DSGG方法的三个关键问题：严重的精确率-召回率权衡、缺乏对三元组重要性的认识以及评估方案不当。另一方面，大型多模态模型（LMM）的最新进展在视频理解方面展现出巨大潜力，但尚未在诸如DSGG之类的细粒度逐帧理解任务上进行测试。在本研究中，我们首次对用于执行DSGG的视频LMM进行了系统分析。在不依赖复杂架构设计的情况下，我们展示了具有简单仅解码器结构的LMM可以转化为最先进的场景图生成器，有效地克服了上述问题，同时只需要少量微调（5-10%的训练数据）。||
|**2025-03-20**|[AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models](http://arxiv.org/abs/2503.15778)|null|在自动驾驶领域，开放式问答通常面临评估不可靠的问题，因为自由形式的回答需要复杂的指标或主观的人工判断。为了应对这一挑战，我们引入了AutoDrive-QA，这是一个自动化的流程，可以将现有的驾驶问答数据集（包括DriveLM、NuScenes-QA和LingoQA）转换为结构化的多项选择题（MCQ）格式。该基准系统地评估感知、预测和规划任务，提供了一个标准化和客观的评估框架。AutoDrive-QA采用自动化流程，利用大型语言模型（LLM）根据自动驾驶场景中常见的特定领域错误模式生成高质量、上下文相关的干扰项。为了评估通用能力和泛化性能，我们在三个公共数据集上测试了该基准，并在一个未见数据集上进行了零样本实验。零样本评估显示，GPT-4V以69.57%的准确率领先——感知准确率74.94%，预测准确率65.33%，规划准确率68.45%——这表明尽管所有模型在感知方面都表现出色，但在预测方面却存在困难。因此，AutoDrive-QA为整合和评估不同视觉语言模型在各种自动驾驶数据集上的性能建立了严格、公正的标准，从而提高了该领域的泛化能力。我们在AutoDrive-QA GitHub存储库中发布了所有代码。||
|**2025-03-19**|[TULIP: Towards Unified Language-Image Pretraining](http://arxiv.org/abs/2503.15485)|null|尽管像CLIP和SigLIP这样的图像-文本对比模型最近取得了成功，但这些模型在需要高保真图像理解的以视觉为中心的的任务（例如计数、深度估计和细粒度对象识别）中常常表现不佳。这些模型通过执行语言对齐，倾向于优先考虑高级语义而不是视觉理解，从而削弱了它们的图像理解能力。另一方面，专注于视觉的模型擅长处理视觉信息，但在理解语言方面却存在困难，限制了它们在语言驱动任务中的灵活性。在这项工作中，我们推出了TULIP，一个开源的、可直接替代现有类似CLIP模型的替代方案。我们的方法利用生成式数据增强、增强的图像-图像和文本-文本对比学习以及图像/文本重建正则化来学习细粒度的视觉特征，同时保留全局语义对齐。我们的方法扩展到超过10亿个参数，在多个基准测试中优于现有的最先进 (SOTA) 模型，在ImageNet-1K上建立了新的SOTA零样本性能，在RxRx1的线性探测少样本分类中实现了高达 $2\times$的SigLIP性能提升，并改进了视觉语言模型，在MMVP上实现了比SigLIP高$3\times$ 以上的分数。我们的代码/检查点可在https://tulip-berkeley.github.io获取。||
|**2025-03-14**|[Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense](http://arxiv.org/abs/2503.11619)|null|部署大型视觉语言模型 (LVLMs) 引入了一个独特的漏洞：易受通过视觉输入进行的恶意攻击。然而，现有的防御方法存在两个关键限制：(1) 它们只关注文本防御，未能直接解决攻击起源的视觉领域的威胁，以及 (2) 额外的处理步骤通常会导致大量的计算开销或损害模型在良性任务上的性能。基于这些见解，我们提出了 ESIII（将安全指令嵌入图像），这是一种将视觉空间从漏洞来源转变为主动防御机制的新方法。首先，我们通过基于梯度的优化将安全指令嵌入到防御图像中，从而在视觉维度上获得安全指令。随后，我们将视觉和文本维度的安全指令与输入查询集成。来自不同维度的安全指令之间的协作确保了全面的安全保护。大量实验表明，我们的方法有效地增强了 LVLMs 抵抗此类攻击的鲁棒性，同时保持了其在标准良性任务上的性能，并且时间成本增加微乎其微。||
|**2025-03-14**|[Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers](http://arxiv.org/abs/2503.11579)|null|目前最先进的基于Transformer的大型多模态模型（LMM）难以处理长达数小时的视频输入，因为因果自注意力操作的二次复杂度导致训练和推理过程中的计算成本很高。现有的基于标记压缩的方法减少了视频标记的数量，但通常会导致信息丢失，并且对于超长序列仍然效率低下。在本文中，我们探索了一个正交方向，构建了一个混合Mamba-Transformer模型（VAMBA），该模型采用Mamba-2块以线性复杂度编码视频标记。在不进行任何标记压缩的情况下，VAMBA可以在单个GPU上编码超过1024帧（640×360）的视频，而基于Transformer的模型只能编码256帧。在长视频输入上，VAMBA在训练和推理过程中至少减少了50%的GPU内存使用量，并且与基于Transformer的LMM相比，每个训练步骤的速度几乎翻倍。我们的实验结果表明，在具有挑战性的小时长视频理解基准测试LVBench上，VAMBA的准确率比之前的有效视频LMM提高了4.3%，并在各种长短视频理解任务上保持了强大的性能。||
|**2025-03-14**|[SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion](http://arxiv.org/abs/2503.11576)|null|我们推出了 SmolDocling，一个针对端到端文档转换的超紧凑视觉语言模型。我们的模型通过生成 DocTags 来全面处理整个页面，DocTags 是一种新的通用标记格式，可以在完整的上下文中捕获所有页面元素及其位置。与依赖大型基础模型或依赖多个专用模型的手工管道的集成解决方案不同，SmolDocling 提供端到端转换，可在 2.56 亿参数的视觉语言模型中准确捕获文档元素的内容、结构和空间位置。SmolDocling 在正确再现各种文档类型（包括商业文档、学术论文、技术报告、专利和表格）中的文档特征（例如代码列表、表格、公式、图表、列表等）方面表现出强大的性能，这显著扩展了通常观察到的对科学论文的关注。此外，我们还贡献了新的公开图表、表格、公式和代码识别数据集。实验结果表明，SmolDocling 可以与其他规模高达其 27 倍的视觉语言模型竞争，同时大幅降低了计算需求。该模型目前可用，数据集即将公开。||
|**2025-03-14**|[Similarity-Aware Token Pruning: Your VLM but Faster](http://arxiv.org/abs/2503.11549)|**[link](https://github.com/ArmenJeddi/saint)**|由于自注意力的二次复杂性，视觉Transformer（ViT）和视觉语言模型（VLM）的计算需求仍然是一项重大挑战。虽然token剪枝提供了一种很有前景的解决方案，但现有方法通常会引入训练开销或无法跨层动态调整。我们提出了SAINT，一个无需训练的token剪枝框架，它利用token相似性和基于图的公式来动态优化剪枝率和冗余阈值。通过系统分析，我们确定了transformer中普遍存在的三阶段token演化过程（对齐器-探索器-聚合器），从而能够在早期阶段进行积极的剪枝而不牺牲关键信息。对于ViT，SAINT在ImageNet-1K上仅损失0.6%的准确率的情况下，将224px分辨率的ViT-H/14的吞吐量提高了一倍，超过了最接近的竞争对手0.8%。对于VLM，我们以三种模式应用SAINT：仅ViT、仅LLM和混合模式。SAINT将LLaVA-13B的token减少了75%，实现了与LLaVA-7B相当的延迟，并且在各项基准测试中性能损失不到1%。我们的工作为ViT和VLM的高效推理建立了一个统一且实用的框架。||
|**2025-03-14**|[Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models](http://arxiv.org/abs/2503.11519)|null|当前的跨模态生成模型（GMs）在各种生成任务中展现出卓越的能力。鉴于视觉模态输入在现实世界场景中的普遍性和信息丰富性，跨视觉任务，包括视觉语言理解（VLP）和图像到图像（I2I）任务，吸引了大量的关注。大型视觉语言模型（LVLMs）和I2I生成模型分别用于处理VLP和I2I任务。先前的研究表明，将印刷体文字添加到输入图像中会显著诱导LVLMs和I2I生成模型产生与这些文字语义相关的破坏性输出。此外，作为一种更复杂的印刷体形式，视觉提示也被发现当被注入到图像中时，会对VLP任务的各种应用构成安全风险。在本文中，我们全面研究了印刷体视觉提示注入（TVPI）对各种LVLMs和I2I生成模型的性能影响。为了更好地观察这种威胁的性能修改和特征，我们还引入了TVPI数据集。通过广泛的探索，我们加深了对各种生成模型中TVPI威胁的潜在原因的理解，并对其可能的起源提供了有价值的见解。||
|**2025-03-14**|[PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models](http://arxiv.org/abs/2503.11360)|null|以语言引导的注意力框架显著提高了图像分类的可解释性和性能；然而，依赖于预训练视觉语言基础模型的确定性嵌入来生成参考注意力图，经常忽略了跨模态映射固有的多值性和不适定性。为了解决这些限制，我们引入了PARIC，一个通过语言规范引导视觉注意力的概率框架。我们的方法使预训练的视觉语言模型能够生成概率参考注意力图，与确定性方法相比，它可以更有效地对齐文本和视觉模态，同时结合了不确定性估计。在基准测试问题上的实验表明，PARIC提高了预测精度，减少了偏差，确保了预测的一致性，并提高了跨各种数据集的鲁棒性。||
|**2025-03-14**|[Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset](http://arxiv.org/abs/2503.11342)|null|路怒症由驾驶相关刺激引发，例如交通拥堵和攻击性驾驶，对道路安全构成重大威胁。以往对路怒症调节的研究主要集中在抑制反应上，缺乏主动预防能力。随着视觉语言模型 (VLM) 的出现，现在可以在驾驶员愤怒升级之前，对触发事件进行视觉推理，然后进行基于对话的安慰。为此，我们提出了路怒症推理任务，以及一个精细标注的测试数据集和评估指标，以评估当前主流 VLM 在场景理解、事件识别和路怒症推理方面的能力。结果表明，当前的 VLM 在视觉模态的场景理解以及文本模态中理解对象之间的空间关系方面存在显著缺陷。提高 VLM 在这些方面的性能将极大地有利于下游任务，如以前因后果为重点的路怒症调节。||
|**2025-03-14**|[DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models](http://arxiv.org/abs/2503.11265)|null|视觉问答（VQA）模型属于视觉语言模型范畴，通常对图像输入执行多次下采样过程，以在计算效率和模型性能之间取得平衡。尽管这种方法有助于集中处理显著特征并减少计算负担，但它会导致重要细节信息的丢失，这在端到端自动驾驶场景中尤其有害。下采样可能导致无法充分捕捉远处或小型物体，例如行人、路标或障碍物，而这些对于安全导航至关重要。这种特征损失会对自动驾驶系统准确感知环境的能力产生负面影响，并可能增加事故风险。为了解决这个问题，我们提出了动态分辨率视觉语言模型（DynRsl-VLM）。DynRsl-VLM采用了一种动态分辨率图像输入处理方法，可以捕获图像中的所有实体特征信息，同时确保图像输入对于视觉Transformer（ViT）而言在计算上仍然易于处理。此外，我们设计了一种新颖的图文对齐模块来取代Q-Former，从而在处理动态分辨率图像输入时能够简单高效地与文本对齐。我们的方法在不超出计算限制的情况下增强了自动驾驶系统的环境感知能力。||
|**2025-03-14**|[Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment](http://arxiv.org/abs/2503.11229)|null|大型多模态模型 (LMM) 已在众多领域展现出卓越的性能。本文探讨了其在发音评估任务中的潜力，特别关注评估生成式预训练Transformer (GPT) 模型（特别是 GPT-4o）的能力。我们的研究调查了其处理语音和音频以进行多粒度和多维度发音评估的能力，重点是反馈生成和评分。在我们的实验中，我们使用了公开可用的 Speechocean762 数据集。评估侧重于两个关键方面：多级评分和生成反馈的实用性。将评分结果与 Speechocean762 数据集中提供的人工评分进行比较，同时使用大型语言模型 (LLM) 评估反馈质量。研究结果突出了将 LMM 与传统发音评估方法相结合的有效性，提供了对模型优势的见解，并确定了需要进一步改进的领域。||
|**2025-03-14**|[Augmenting Image Annotation: A Human-LMM Collaborative Framework for Efficient Object Selection and Label Generation](http://arxiv.org/abs/2503.11096)|null|传统的图像标注任务严重依赖人工进行目标选择和标签分配，这使得该过程耗时且容易因标注员长时间工作后的疲劳而导致效率下降。本文介绍了一个利用大型多模态模型（LMM），特别是 GPT 的视觉理解能力来辅助标注工作流程的新颖框架。在我们提出的方法中，人工标注员专注于通过边界框选择目标，而 LMM 自动生成相关标签。这种人机协作框架通过减少人工标注员的认知和时间负担，提高了标注效率。通过分析系统在各种标注任务中的性能，我们证明了其能够泛化到诸如目标识别、场景描述和细粒度分类等任务。我们提出的框架突出了这种方法重新定义标注工作流程的潜力，为计算机视觉中的大规模数据标记提供了一种可扩展且高效的解决方案。最后，我们讨论了将 LMM 集成到标注流程中如何推进人机双向对齐，以及如何通过将部分工作转移给 AI 来缓解信息过载带来的“无休止的标注”负担的挑战。||
|**2025-03-13**|[A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1](http://arxiv.org/abs/2503.10635)|**[link](https://github.com/vila-lab/m-attack)**|尽管基于迁移的定向攻击在开源大型视觉语言模型（LVLMs）上表现良好，但它们通常无法攻击黑盒商业LVLMs。对失败的对抗扰动的分析表明，学习到的扰动通常来自均匀分布，缺乏清晰的语义细节，导致意外的响应。这种关键的语义信息缺失导致商业LVLMs要么完全忽略扰动，要么误解其嵌入的语义，从而导致攻击失败。为了克服这些问题，我们注意到识别核心语义对象是使用各种数据集和方法训练的模型的关键目标。这一见解启发了我们的方法，即通过在局部区域内编码显式语义细节来提高语义清晰度，从而确保互操作性并捕获更细粒度的特征，并将修改集中在语义丰富的区域而不是均匀应用。为此，我们提出了一个简单但高效的解决方案：在每个优化步骤中，对抗图像以受控的纵横比和比例进行随机裁剪，调整大小，然后在嵌入空间中与目标图像对齐。实验结果证实了我们的假设。我们利用集中在关键区域的局部聚合扰动制作的对抗样本对商业LVLMs表现出惊人的迁移性，包括GPT-4.5、GPT-4o、Gemini-2.0-flash、Claude-3.5-sonnet、Claude-3.7-sonnet，甚至推理模型，如o1、Claude-3.7-thinking和Gemini-2.0-flash-thinking。我们的方法在GPT-4.5、4o和o1上实现了超过90%的成功率，显著优于所有先前的最先进攻击方法。我们不同配置下优化的对抗样本和训练代码可在https://github.com/VILA-Lab/M-Attack获取。||
|**2025-03-13**|[HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model](http://arxiv.org/abs/2503.10631)|null|近年来，用于常识推理的视觉语言模型 (VLM) 的进步促进了视觉语言动作 (VLA) 模型的发展，使机器人能够执行泛化操作。尽管现有的自回归 VLA 方法利用了大规模预训练知识，但它们会破坏动作的连续性。同时，一些 VLA 方法结合了额外的扩散头来预测连续动作，仅依赖于 VLM 提取的特征，这限制了它们的推理能力。在本文中，我们介绍了 HybridVLA，这是一个统一的框架，它将自回归和扩散策略的优势无缝地集成在一个大型语言模型中，而不是简单地将它们连接起来。为了弥合生成差距，我们提出了一种协作训练方法，将扩散建模直接注入到下一个标记预测中。通过这种方法，我们发现这两种形式的动作预测不仅相互 reinforcing，而且在不同的任务中表现出不同的性能。因此，我们设计了一种协作动作集成机制，自适应地融合这两种预测，从而实现更稳健的控制。在实验中，HybridVLA 在各种模拟和真实世界任务中（包括单臂和双臂机器人）均优于先前的最先进 VLA 方法，同时在以前未见过的配置中展现出稳定的操作性能。||
|**2025-03-13**|[DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding](http://arxiv.org/abs/2503.10621)|**[link](https://github.com/ayesha-ishaq/drivelmm-o1)**|虽然大型多模态模型（LMM）在各种视觉问答（VQA）任务中表现出强大的性能，但某些挑战需要复杂的多步骤推理才能得出准确答案。自动驾驶就是一个特别具有挑战性的任务，它需要在做出决策之前进行全面的认知处理。在这个领域中，对视觉线索的顺序和解释性理解对于有效的感知、预测和规划至关重要。然而，常见的VQA基准通常关注最终答案的准确性，而忽略了能够生成准确答案的推理过程。此外，现有方法缺乏一个用于评估现实驾驶场景中逐步推理的综合框架。为了弥补这一差距，我们提出了DriveLMM-o1，这是一个专门设计用于推进自动驾驶逐步视觉推理的新数据集和基准。我们的基准在训练集中包含超过1.8万个VQA示例，在测试集中包含超过4千个示例，涵盖了关于感知、预测和规划的各种问题，每个示例都通过逐步推理来丰富，以确保自动驾驶场景中的逻辑推理。我们进一步引入了一个在我们的推理数据集上微调的大型多模态模型，在复杂的驾驶场景中展现了强大的性能。此外，我们在我们提出的数据集上对各种开源和闭源方法进行了基准测试，系统地比较了它们在自动驾驶任务中的推理能力。我们的模型在最终答案准确率上实现了+7.49%的提升，同时推理得分比之前的最佳开源模型提高了3.62%。我们的框架、数据集和模型可在https://github.com/ayesha-ishaq/DriveLMM-o1获取。||
|**2025-03-13**|[CoSTA $\ast$ : Cost-Sensitive Toolpath Agent for Multi-turn Image Editing](http://arxiv.org/abs/2503.10613)|**[link](https://github.com/tianyi-lab/CoSTAR)**|像Stable Diffusion和DALLE-3这样的文生图模型仍然难以进行多轮图像编辑。我们将此类任务分解为一个代理工作流程（路径），该流程利用各种成本的人工智能工具来解决一系列子任务。传统的搜索算法需要进行昂贵的探索才能找到工具路径。虽然大型语言模型（LLM）拥有子任务规划的先验知识，但它们可能缺乏对工具能力和成本的准确估计，无法确定在每个子任务中应用哪个工具。我们能否结合LLM和图搜索的优势来找到具有成本效益的工具路径？我们提出了一种名为“CoSTA*”的三阶段方法，它利用LLM创建子任务树，帮助修剪给定任务的AI工具图，然后在小子图上进行A*搜索以找到工具路径。为了更好地平衡总成本和质量，CoSTA*结合每个工具在每个子任务上的两个指标来指导A*搜索。然后，每个子任务的输出由视觉语言模型（VLM）评估，如果失败，将触发工具在该子任务上的成本和质量的更新。因此，A*搜索可以快速从失败中恢复，探索其他路径。此外，CoSTA*可以跨子任务自动切换模态，以获得更好的成本-质量权衡。我们构建了一个新的具有挑战性的多轮图像编辑基准，CoSTA*在成本和质量方面都优于最先进的图像编辑模型或代理，并且可以根据用户偏好进行多功能权衡。||
|**2025-03-13**|[GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding](http://arxiv.org/abs/2503.10596)|**[link](https://github.com/hustvl/groundingsuite)**|像素级 grounding，涵盖诸如指称表达式分割 (RES) 等任务，由于其在 bridging 视觉和语言模态方面的巨大潜力而备受关注。然而，该领域的进展目前受到现有数据集固有局限性的限制，包括对象类别有限、文本多样性不足以及高质量标注的缺乏。为了缓解这些限制，我们引入了 GroundingSuite，它包含：(1) 一个利用多个视觉语言模型 (VLM) agent 的自动化数据标注框架；(2) 一个包含 956 万个多样化指称表达式及其对应分割的大规模训练数据集；(3) 一个由 3,800 张图像组成的精心策划的评估基准。GroundingSuite 训练数据集促进了性能的显著提升，使在其上训练的模型能够达到最先进的结果。具体来说，在 gRefCOCO 上的 cIoU 为 68.9，在 RefCOCOm 上的 gIoU 为 55.3。此外，GroundingSuite 标注框架展示出比目前领先的数据标注方法更高的效率，例如，比 GLaMM 快 4.5 倍。||
|**2025-03-13**|[VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search](http://arxiv.org/abs/2503.10582)|null|视觉语言模型在许多以感知为中心的任务上取得了显著进展，然而，由于缺乏高质量和多样化的训练数据，它们在以推理为中心的任务上的进展似乎有限。在这项工作中，我们旨在解决以推理为中心的多模态数据集的稀缺性问题。我们提出了VisualWebInstruct——一种利用搜索引擎创建跨多个学科（如数学、物理、金融、化学等）的多样化、高质量数据集的新方法。从精心挑选的30,000张种子图像开始，我们使用谷歌图片搜索来识别包含类似图像的网站。我们从超过70万个独特的URL来源收集和处理HTML。通过内容提取、过滤和合成的流水线，我们构建了一个包含大约90万个问答对的数据集，其中40%是视觉问答对，其余是文本问答对。在VisualWebInstruct上微调的模型表现出显著的性能提升：(1) 从Llava-OV-mid训练，在各个基准测试中显示出10-20%的绝对点数提升，(2) 从MAmmoTH-VL训练，显示出5%的绝对点数提升。我们的最佳模型MAmmoTH-VL2在MMMU-Pro-std (40.7%)、MathVerse (42.6%)和DynaMath (55.7%)上展现了100亿参数级别内的最佳性能。这些显著的结果突出了我们的数据集在增强视觉语言模型推理能力以应对复杂多模态任务方面的有效性。||
|**2025-03-13**|[Towards Fast, Memory-based and Data-Efficient Vision-Language Policy](http://arxiv.org/abs/2503.10322)|null|在互联网规模的视觉语言数据上预训练的视觉语言模型 (VLM) 已经展示了将它们的知识迁移到机器人学习中的潜力。然而，现有范式面临三个关键挑战：（1）由大规模模型参数导致的昂贵推理成本，（2）由数据模态不匹配引起的频繁领域迁移，以及（3）处理过去或未来经验的能力有限。在这项工作中，我们提出了 LiteVLP，一个轻量级的、基于记忆的、通用的视觉语言策略生成模型。LiteVLP 建立在一个预训练的 10 亿参数 VLM 之上，并在一个小型对话风格的机器人数据集上进行微调。通过广泛的实验，我们证明了 LiteVLP 在 VIMA-Bench 上的性能优于最先进的视觉语言策略，且训练时间极短。此外，LiteVLP 表现出卓越的推理速度，同时保持了极高的准确性。在长时程操作任务中，LiteVLP 还展现了显著的记忆能力，比性能最佳的基线模型高出 18.8%。这些结果突出了 LiteVLP 作为一个有前途的模型，可以将 VLM 的智能集成到机器人学习中。||
|**2025-03-13**|[SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence](http://arxiv.org/abs/2503.10265)|null|视觉语言模型 (VLM) 在外科智能中的应用受到幻觉、领域知识差距以及对外科场景中任务相互依赖性理解有限的阻碍，从而损害了临床可靠性。虽然最近的 VLM 展现出强大的通用推理和思维能力，但它们仍然缺乏精确解释外科场景所需的领域专业知识和任务意识。尽管思维链 (CoT) 可以更有效地构建推理，但目前的方法依赖于自生成的 CoT 步骤，这通常会加剧固有的领域差距和幻觉。为了克服这个问题，我们提出了 SurgRAW，一个 CoT 驱动的多智能体框架，可为机器人辅助手术中的大多数任务提供透明、可解释的见解。通过在五个任务（器械识别、动作识别、动作预测、患者数据提取和结果评估）中使用专门的 CoT 提示，SurgRAW 通过结构化的、领域感知的推理来减轻幻觉。检索增强生成 (RAG) 也被集成到外部医学知识中，以弥合领域差距并提高响应可靠性。最重要的是，一个分层的智能体系统确保嵌入 CoT 的 VLM 智能体在理解任务相互依赖性的同时有效协作，并通过小组讨论机制促进逻辑一致性。为了评估我们的方法，我们引入了 SurgCoTBench，这是第一个具有结构化帧级注释的基于推理的数据集。通过全面的实验，我们证明了所提出的 SurgRAW 在 12 个机器人手术中比基线 VLM 准确率提高了 29.32%，实现了最先进的性能，并推进了可解释、可信赖和自主的外科辅助。||
|**2025-03-13**|[ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning](http://arxiv.org/abs/2503.10166)|**[link](https://github.com/pengfei-luo/ImageScope)**|随着在线内容中图像的激增，语言引导的图像检索（LGIR）在过去十年中成为研究热点，涵盖了各种具有不同输入形式的子任务。尽管大型多模态模型（LMM）的发展极大地促进了这些任务，但现有方法通常孤立地处理它们，需要为每个任务构建单独的系统。这不仅增加了系统复杂性和维护成本，还加剧了语言歧义和复杂图像内容带来的挑战，使检索系统难以提供准确可靠的结果。为此，我们提出了ImageScope，一个免训练的三阶段框架，它利用集体推理来统一LGIR任务。统一背后的关键洞察在于语言的组合性质，它将不同的LGIR任务转化为通用的文本到图像检索过程，同时利用LMM的推理作为通用验证来改进结果。具体而言，在第一阶段，我们通过使用思维链（CoT）推理合成不同语义粒度级别的搜索意图来提高框架的鲁棒性。在第二和第三阶段，我们通过局部验证谓词命题和全局执行成对评估来反思检索结果。在六个LGIR数据集上进行的实验表明，ImageScope的性能优于竞争基线。全面的评估和消融研究进一步证实了我们设计的有效性。||
|**2025-03-13**|[IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models](http://arxiv.org/abs/2503.10110)|null|运动规划涉及确定一系列机器人配置以达到所需的姿态，并受运动和安全约束。传统的运动规划会寻找无碰撞路径，但这在杂乱环境中过于严格，因为机器人可能无法在不接触的情况下完成任务。此外，接触的范围从相对良性（例如，擦过柔软的枕头）到更危险（例如，打翻玻璃花瓶）不等。由于这种多样性，很难描述哪些接触是可以接受或不可接受的。在本文中，我们提出了IMPACT，一个新颖的运动规划框架，它使用视觉语言模型 (VLM) 来推断环境语义，根据物体的属性和位置识别环境中哪些部分最能容忍接触。我们的方法使用VLM的输出生成密集的3D“成本图”，该图对接触容差进行编码，并与标准运动规划器无缝集成。我们使用20个模拟场景和10个真实场景进行实验，并使用任务成功率、物体位移和人类评估者的反馈进行评估。我们在3620次模拟和200次真实世界试验中的结果表明，IMPACT能够在杂乱的环境中进行高效的富接触运动规划，同时优于替代方法和消融研究。补充材料可在https://impact-planning.github.io/ 上获取。||
|**2025-03-11**|[ComicsPAP: understanding comic strips by picking the correct panel](http://arxiv.org/abs/2503.08561)|null|大型多模态模型 (LMM) 在图像描述、视觉问答和视频理解方面取得了显著进展，但它们仍然难以处理漫画中复杂的时空线索。为了弥补这一差距，我们引入了 ComicsPAP，这是一个专为连环漫画理解而设计的大规模基准测试。ComicsPAP 包含超过 10 万个样本，并根据“选择一个面板”框架组织成 5 个子任务，要求模型识别序列中缺失的面板。我们在多图像和单图像协议下进行的评估表明，当前最先进的 LMM 在这些任务上的表现接近随机，这突显了其在捕获顺序和上下文依赖性方面的重大局限性。为了缩小差距，我们调整了 LMM 以适应连环漫画理解，在 ComicsPAP 上获得了比 10 倍大的模型更好的结果，这表明 ComicsPAP 为推动多模态漫画理解的未来研究提供了强大的资源。||
|**2025-03-11**|[GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training](http://arxiv.org/abs/2503.08525)|null|可验证结果奖励强化学习 (RLVR) 已有效地扩展了大型语言模型 (LLM) 中的思维链 (CoT) 推理。然而，其在训练视觉语言模型 (VLM) 代理以在视觉环境中进行目标导向的行动推理方面的有效性尚未得到充分证实。这项工作通过对复杂纸牌游戏（如 24 点游戏）和来自 ALFWorld 的具身任务进行广泛的实验来研究这个问题。我们发现，当奖励仅基于行动结果时，强化学习无法激励 VLM 中的 CoT 推理，反而会导致一种我们称之为思维崩溃的现象，其特征是代理思维多样性的迅速丧失、与状态无关且不完整的推理以及随后的无效行动，最终导致负面奖励。为了应对思维崩溃，我们强调了过程指导的必要性，并提出了一种自动校正器，用于在每个强化学习步骤评估和改进代理的推理。这种简单且可扩展的 GTR（引导思维强化学习）框架无需密集的、每步人工标注即可同时训练推理和行动。我们的实验表明，GTR 显着提高了 LLaVA-7b 模型在各种视觉环境中的性能和泛化能力，与具有明显更小模型尺寸的最先进模型相比，实现了 3-5 倍更高的任务成功率。||
|**2025-03-11**|[External Knowledge Injection for CLIP-Based Class-Incremental Learning](http://arxiv.org/abs/2503.08510)|**[link](https://github.com/g-u-n/pycil)**|类增量学习 (CIL) 使学习系统能够不断适应不断变化的数据流。随着预训练的进步，利用预训练的视觉语言模型（例如 CLIP）为 CIL 提供了一个有希望的起点。然而，CLIP 通过将视觉嵌入与类名匹配来做出决策，忽略了通过语言传达的丰富上下文信息。例如，“猫”的概念可以分解成尾巴、毛皮和脸等特征进行识别。此外，由于模型不断更新，这些详细特征在 CIL 中会被覆盖，需要外部知识来补偿。在本文中，我们介绍了用于基于 CLIP 的 CIL 的外部知识注入 (ENGINE)。为了增强数据集外部的知识转移，我们提出了一个双分支注入调整框架，该框架对来自视觉和文本模态的信息性知识进行编码。视觉分支通过数据增强得到增强，以丰富视觉特征，而文本分支则利用 GPT-4 重写判别性描述符。除了这种动态知识注入之外，我们还在推理过程中通过重新排序预测结果来实现后调整知识。通过注入的知识，模型可以更好地捕获信息特征，以用于随着数据演变的下游任务。大量实验表明 ENGINE 具有最先进的性能。代码可在以下网址获得：https://github.com/RenaissCode/ENGINE||
|**2025-03-11**|[MMRL: Multi-Modal Representation Learning for Vision-Language Models](http://arxiv.org/abs/2503.08497)|**[link](https://github.com/yunncheng/MMRL)**|大规模预训练视觉语言模型 (VLM) 已成为跨不同任务迁移学习的必要工具。然而，使用有限的少样本数据调整这些模型通常会导致过拟合，从而降低它们在新任务上的性能。为了解决这个问题，我们提出了一种新的多模态表征学习 (MMRL) 框架，该框架引入了共享的、可学习的、与模态无关的表征空间。MMRL 将空间标记投影到文本和图像表征标记，从而促进更有效的多模态交互。与以往仅优化类别标记特征的方法不同，MMRL 将表征标记集成到编码器的更高层——其中数据集特定特征更为突出——同时保留低层中的通用知识。在训练期间，表征特征和类别特征都会得到优化，可训练的投影层应用于表征标记，而类别标记投影层保持冻结以保留预训练的知识。此外，引入了一个正则化项，以将类别特征和文本特征与来自冻结 VLM 的零样本特征对齐，从而保护模型的泛化能力。在推理过程中，采用了一种解耦策略，其中表征特征和类别特征都用于基类，而只有保留更多通用知识的类别特征用于新任务。跨 15 个数据集的大量实验表明，MMRL 优于最先进的方法，在特定任务的适应性和泛化性之间实现了平衡。代码可在 https://github.com/yunncheng/MMRL 获取。||
|**2025-03-11**|[SuperCap: Multi-resolution Superpixel-based Image Captioning](http://arxiv.org/abs/2503.08496)|null|图像描述领域的一个长期目标是摆脱对目标检测的依赖。我们研究了使用超像素结合视觉语言模型 (VLM) 来弥合基于检测器的描述架构与那些仅在大型数据集上进行预训练的架构之间的差距。我们新颖的超像素方法确保模型接收类似对象的特征，而 VLM 的使用则为我们的模型提供了开放集对象理解能力。此外，我们扩展了我们的架构以利用多分辨率输入，使我们的模型能够查看不同细节级别的图像，并使用注意力机制来确定哪些部分与描述最相关。我们通过多个 VLM 并通过一系列消融实验展示了我们模型的性能，详细说明了不同架构选择的影响。我们的完整模型在 COCO Karpathy 拆分上实现了 136.9 的 CIDEr 分数，具有竞争力。||
|**2025-03-11**|[PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability](http://arxiv.org/abs/2503.08481)|**[link](https://github.com/unira-zwj/PhysVLM)**|理解环境和机器人的物理可达性对于任务执行至关重要。虽然最先进的视觉语言模型 (VLM) 在环境感知方面表现出色，但由于缺乏对机器人物理可达性的理解，它们在具体视觉推理任务中经常生成不准确或不切实际的响应。为了解决这个问题，我们提出了一个跨不同机器人的统一物理可达性表示，即空间物理可达性地图（S-P Map），以及PhysVLM，一个将这种可达性信息整合到视觉推理中的视觉语言模型。具体来说，S-P Map将机器人的物理可达性抽象为一个通用的空间表示，独立于特定的机器人配置，使模型能够专注于可达性特征而不是特定于机器人的参数。随后，PhysVLM通过加入一个额外的特征编码器来处理S-P Map，扩展了传统的VLM架构，使模型能够在不影响其通用视觉语言能力的情况下推理物理可达性。为了训练和评估PhysVLM，我们构建了一个大规模多机器人数据集Phys100K，以及一个具有挑战性的基准测试EQA-phys，其中包括在模拟和真实环境中六种不同机器人的任务。实验结果表明，PhysVLM优于现有模型，在EQA-phys上比GPT-4o提高了14%，并在RoboVQA-val和OpenEQA基准测试中超过了先进的具体VLM，如RoboMamba和SpatialVLM。此外，S-P Map与各种VLM表现出很强的兼容性，将其集成到GPT-4o-mini中可使其性能提高7.1%。||
|**2025-03-11**|[Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving](http://arxiv.org/abs/2503.08336)|null|具身户外场景理解是自主代理感知、分析和响应动态驾驶环境的基础。然而，现有的3D理解主要基于2D视觉语言模型（VLM），收集和处理的场景感知上下文有限。相比之下，与2D平面视觉信息相比，像LiDAR这样的点云传感器提供了丰富的深度信息和对象的细粒度3D表示。同时，新兴的4D毫米波（mmWave）雷达能够检测每个物体的运动趋势、速度和反射强度。因此，这两种模态的集成，为自然语言提供了更灵活的查询条件，实现了更精确的3D视觉定位。为此，在本文中，我们探索性地提出了一种名为TPCNet的新方法，这是第一个基于提示引导点云传感器组合（包括LiDAR和雷达上下文）的户外3D视觉定位模型。为了自适应地平衡提示所需的这两种传感器的特征，我们设计了一种名为两阶段异构模态自适应融合的多融合范式。具体来说，该范式首先采用双向代理交叉注意力（BACA），将具有全局感受野特征的双传感器特征馈送到文本特征进行查询。此外，我们设计了一个动态门控图融合（DGGF）模块来定位查询识别的感兴趣区域。为了进一步提高准确性，我们创新性地设计了一个基于最近物体边缘的C3D-RECHead。我们的实验表明，我们的TPCNet及其各个模块在Talk2Radar和Talk2Car数据集上都实现了最先进的性能。||
|**2025-03-11**|[Modeling Variants of Prompts for Vision-Language Models](http://arxiv.org/abs/2503.08229)|**[link](https://github.com/liaolea/mvp)**|大型预训练视觉语言模型 (VLM) 为利用人类语言增强下游任务提供了 promising 的途径。然而，像 CLIP 这样的 VLM 面临着显著的局限性：其性能对 prompt 模板的设计高度敏感。尽管 prompt learning 方法可以通过将自然语言 prompt 替换为可学习的 prompt 来解决敏感性问题，但这些可学习的 prompt 对人类来说难以理解。确保在各种 prompt 模板之间性能的一致性，使模型能够无缝地适应不同的表达方式，增强其处理下游任务的能力，而无需进行大量的 prompt engineering。在这项工作中，我们引入了 RobustPrompt Benchmark，这是一个系统性的基准测试，用于评估 VLM 对不同 prompt 模板的鲁棒性。它包含一个数据集，其中包含数百个精心设计的 prompt 模板，分为六种类型，涵盖了各种常用的模板。除了基准测试之外，我们还提出了 Modeling Variants of Prompts (MVP)，这是一种简单而有效的方法，通过对 prompt 结构的变体进行建模来降低敏感性。MVP 的创新之处在于将 prompt 解耦为模板和类名，并使用变分自编码器 (VAE) 对不同 prompt 结构的分布进行建模。在 11 个数据集上的实验表明，MVP 可以显著增强模型对输入 prompt 变化的鲁棒性，而不会降低性能。代码可在 https://github.com/xiaoyaoxinyi/MVP 获取。||
|**2025-03-11**|[FASIONAD++ : Integrating High-Level Instruction and Information Bottleneck in FAt-Slow fusION Systems for Enhanced Safety in Autonomous Driving with Adaptive Feedback](http://arxiv.org/abs/2503.08162)|null|确保安全、舒适和高效的规划对于自动驾驶系统至关重要。虽然在大型数据集上训练的端到端模型在标准驾驶场景中表现良好，但它们难以处理复杂的低频事件。最近大型语言模型 (LLM) 和视觉语言模型 (VLM) 的进步提供了增强的推理能力，但存在计算效率低下的问题。受双过程认知模型“思考，快与慢”的启发，我们提出了FASIONAD——一种新颖的双系统框架，它将快速的端到端规划器与基于VLM的推理模块协同起来。快速系统利用端到端学习在常见场景中实现实时轨迹生成，而慢速系统则通过不确定性估计激活，以执行上下文分析和复杂场景解析。我们的架构引入了三个关键创新：(1) 一种基于实时不确定性评估的动态切换机制，可实现慢速系统干预；(2) 一个具有高级计划反馈的信息瓶颈，可优化慢速系统的引导能力；(3) 双向知识交换，其中视觉提示增强慢速系统的推理，而其反馈则改进快速规划器的决策。为了加强VLM推理，我们开发了一种结合奖励指令训练策略的问答机制。在开环实验中，FASIONAD实现了平均L2轨迹误差降低6.7%，碰撞率降低28.1%。||
|**2025-03-11**|[Uni $\textbf{F}^2$ ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models](http://arxiv.org/abs/2503.08120)|null|统一多模态模型 (UMM) 已成为计算机视觉基础研究中的一种强大范式，在图像理解和生成方面展现出巨大潜力。然而，现有的人脸领域研究主要集中在粗粒度的面部属性理解上，处理细粒度面部属性的能力有限，并且没有涉及生成能力。为了克服这些限制，我们提出了 UniF²ace，这是第一个专门为细粒度人脸理解和生成而设计的 UMM。总的来说，我们利用两种互惠互利的扩散技术和两级混合专家架构，在一个自建的专用数据集上训练 UniF²ace。具体来说，我们首先构建了一个大规模人脸数据集 UniF²ace-130K，其中包含 130K 个图文对和一百万个问答对，涵盖了广泛的面部属性。其次，我们建立了离散扩散分数匹配和掩码生成模型之间的理论联系，同时优化了两个证据下界，这显著提高了模型合成面部细节的能力。最后，我们引入了token级别和序列级别的混合专家，从而能够对理解和生成任务进行高效的细粒度表示学习。在 UniF²ace-130K 上进行的大量实验表明，UniF²ace 优于现有的 UMM 和生成模型，在理解和生成任务上均实现了卓越的性能。||
|**2025-03-07**|[VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method](http://arxiv.org/abs/2503.05383)|**[link](https://github.com/camel-ai/vlm-play-starcraft2)**|我们引入了VLM-Attention，这是一个多模态星际争霸II环境，使人工智能体的感知与人类游戏体验相一致。传统的框架（例如SMAC）依赖于抽象状态表示，这与人类感知存在显著差异，限制了智能体行为的生态有效性。我们的环境通过结合RGB视觉输入和自然语言观察来解决这一限制，这些输入和观察更接近地模拟了人类在游戏过程中的认知过程。VLM-Attention框架由三个集成组件组成：（1）一个视觉语言模型，通过专门的自注意力机制增强，用于战略单位目标选择和战场评估；（2）一个检索增强生成系统，利用特定领域的星际争霸II知识来指导战术决策；（3）一个基于角色的动态任务分配系统，支持多智能体的协同行为。我们跨21个自定义场景的实验评估表明，由基础模型（特别是Qwen-VL和GPT-4o）驱动的基于VLM的智能体可以在没有明确训练的情况下执行复杂的战术动作，达到与需要大量训练迭代的传统MARL方法相当的性能。这项工作为开发与人类对齐的星际争霸II智能体奠定了基础，并推进了多模态游戏人工智能的更广泛研究议程。我们的代码实现可在https://github.com/camel-ai/VLM-Play-StarCraft2获取。||
|**2025-03-07**|[Robust Multimodal Learning for Ophthalmic Disease Grading via Disentangled Representation](http://arxiv.org/abs/2503.05319)|**[link](https://github.com/xinkunwang111/robust-multimodal-learning-for-ophthalmic-disease-grading-via-disentangled-representation)**|本文探讨了眼科医生如何经常依赖多模态数据来提高诊断准确性。然而，由于缺乏医疗设备和对数据隐私的担忧，在实际应用中，完整的多模态数据很少见。传统的深度学习方法通常通过在潜在空间中学习表示来解决这些问题。然而，本文强调了这些方法的两个关键局限性：（i）复杂模态中与任务无关的冗余信息（例如，大量的切片）导致潜在空间表示中存在显著冗余。（ii）重叠的多模态表示使得难以提取每种模态的独特特征。为了克服这些挑战，作者提出了本质点和解耦表示学习（EDRL）策略，该策略将自蒸馏机制集成到端到端框架中，以增强特征选择和解耦，从而实现更稳健的多模态学习。具体而言，本质点表示学习模块选择可提高疾病分级性能的判别特征。解耦表示学习模块将多模态数据分离成模态共有和模态独有的表示，减少特征纠缠，并增强眼科疾病诊断的鲁棒性和可解释性。在多模态眼科数据集上的实验表明，所提出的EDRL策略显著优于当前最先进的方法。||
|**2025-03-07**|[Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions](http://arxiv.org/abs/2503.05186)|null|在最近的文本视频检索研究中，使用来自视觉语言模型的附加字幕已展现出对性能的积极影响。然而，现有使用附加字幕的模型通常难以捕捉视频中固有的丰富语义，包括时间变化。此外，生成模型导致的错误信息可能造成检索不准确。为了解决这些问题，我们提出了一个名为“视频叙述”（NarVid）的新框架，该框架策略性地利用了来自帧级字幕（即叙述）的全面信息。所提出的NarVid以多种方式利用叙述：1）通过叙述和视频之间的跨模态交互来增强特征；2）使用查询感知的自适应过滤来抑制不相关或不正确的叙述信息；3）通过添加查询-视频相似度和查询-叙述相似度来计算双模态匹配分数；4）使用来自不同视角的两种相似度，通过难负例损失来学习具有区分性的特征。实验结果表明，NarVid在各种基准数据集上实现了最先进的性能。||
|**2025-03-07**|[Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation](http://arxiv.org/abs/2503.05064)|null|视觉语言模型 (VLM) 在机器人操作领域展现出显著的潜力，但在高速、高精度地执行复杂的精细操作任务方面仍然存在挑战。虽然现有 VLM 方法擅长高层规划，但在指导机器人执行精确的精细动作序列上却存在不足。为了解决这一局限性，我们提出了一种渐进式 VLM 规划算法，使机器人能够执行快速、精确且可纠错的精细操作。我们的方法将复杂任务分解为子动作，并维护三个关键数据结构：任务记忆结构、二维拓扑图和三维空间网络，实现了高精度的空间语义融合。这三个组件共同积累和存储任务执行过程中的关键信息，为我们面向任务的 VLM 交互机制提供了丰富的上下文。这使得 VLM 能够根据实时反馈动态调整指导，生成精确的动作计划，并促进逐步纠错。在复杂装配任务上的实验验证表明，我们的算法能够有效地指导机器人在挑战性场景中快速、精确地完成精细操作，显著提升了机器人在精密任务中的智能水平。||
|**2025-03-06**|[LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression](http://arxiv.org/abs/2503.04982)|null|尽管最近在理解压缩对大型语言模型 (LLM) 的下游任务性能和在相对简单的单模态基准（例如，问答、常识推理）上的可信度方面的影响上做出了努力，但对其在多模态大型视觉语言模型 (LVLM) 上的详细研究仍有待揭示。为了弥合这一差距，我们提出了 LVLM-Compress-Bench，这是一个框架，旨在首先彻底研究压缩对 LVLM 生成性能的广泛影响，这些 LVLM 使用多模态输入驱动任务。具体来说，我们考虑了自回归模型的两大类压缩方法，即分别针对动态增长的中间缓存和静态权重的 KV 缓存和权重压缩。我们使用流行的 LLaVA 框架的四个 LVLM 变体来展示我们的分析，方法是集成各种最先进的 KV 和权重压缩方法，包括用于 KV 缓存和权重的均匀量化、减少异常值量化和分组量化。借助此框架，我们展示了十个不同的多模态数据集，这些数据集具有不同的能力，包括识别、知识、语言生成、空间感知、视觉推理、幻觉和视觉错觉识别、毒性、刻板印象和偏见。具体来说，我们的框架利用现实世界和合成数据集的组合来涵盖不同的社会交叉属性，从而展示了压缩对一般指标和道德关键指标的影响。广泛的实验评估对不同 KV 和权重量化预算下 LVLM 的行为产生了多样而有趣的观察结果，与使用 FP16 数据格式的基线模型相比，这些行为在保持和损失性能方面都有体现。代码将在 https://github.com/opengear-project/LVLM-compress-bench 开源。||
|**2025-03-06**|[Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach](http://arxiv.org/abs/2503.04918)|null|人工智能通过视觉语言模型（VLMs）的发展取得了进步，VLMs整合文本和视觉输入，以在各种环境中实现全面理解和交互。提升这些模型（例如基于Transformer的Florence 2）在特定任务（例如在复杂和非结构化环境中的目标检测）中的性能需要进行微调。本文的目标是通过微调来提高Florence 2模型在挑战性环境中的效率。我们通过使用不同的配置，各种GPU类型（T4、L4、A100）和优化器（例如AdamW和SGD）进行实验来实现这一目标。我们还采用了各种学习率和LoRA（低秩自适应）设置。通过分析性能指标，例如平均精度均值（mAP）分数，结果表明经过微调的Florence 2模型的性能与YOLO模型（包括YOLOv8、YOLOv9和YOLOv10）相当。这表明基于Transformer的VLM如何适应详细的目标检测任务。本文强调了优化后的基于Transformer的VLM能够应对非结构化环境中目标检测的特定挑战，为在苛刻和复杂环境中的实际应用开辟了有希望的途径。||
|**2025-03-06**|[A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning](http://arxiv.org/abs/2503.04592)|null|遥感图像描述 (RSIC) 是一个连接视觉和语言的跨模态领域，旨在自动生成对遥感图像中特征和场景的自然语言描述。尽管在开发用于训练视觉语言模型 (VLM) 的复杂方法和大规模数据集方面取得了重大进展，但仍然存在两个关键挑战：非英语描述性数据集的稀缺性以及模型多语言能力评估的缺乏。这些限制从根本上阻碍了 RSIC 的进展和实际部署，尤其是在大型 VLM 时代。为了应对这些挑战，本文对该领域做出了一些重要贡献。首先，我们介绍并分析了 BRSIC（双语遥感图像描述），这是一个全面的双语数据集，它通过中文描述丰富了三个已建立的英语 RSIC 数据集，包含 13,634 张图像和 68,170 个双语描述。在此基础上，我们开发了一个系统化的评估框架，解决了评估协议中普遍存在的不一致性，能够通过 BRSIC 上的标准化再训练程序对模型性能进行严格评估。此外，我们对八个最先进的大型视觉语言模型 (LVLM) 进行了广泛的实证研究，检验了它们在零样本推理、监督微调和多语言训练等多种范式下的能力。这项综合评估为了解当前 LVLM 处理多语言遥感任务的优势和局限性提供了重要的见解。此外，我们的跨数据集迁移实验揭示了一些有趣的发现。代码和数据将在 https://github.com/mrazhou/BRSIC 上提供。||
|**2025-03-06**|[AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM](http://arxiv.org/abs/2503.04504)|**[link](https://github.com/SkiddieAhn/Paper-AnyAnomaly)**|视频异常检测 (VAD) 在计算机视觉的视频分析和监控中至关重要。然而，现有的VAD模型依赖于学习到的正常模式，这使得它们难以应用于不同的环境。因此，用户需要针对新环境重新训练模型或开发单独的AI模型，这需要机器学习方面的专业知识、高性能硬件和大量数据收集，限制了VAD的实际可用性。为了应对这些挑战，本研究提出了可定制的视频异常检测 (C-VAD) 技术和 AnyAnomaly 模型。C-VAD 将用户定义的文本视为异常事件，并在视频中检测包含指定事件的帧。我们使用上下文感知的视觉问答有效地实现了 AnyAnomaly，而无需微调大型视觉语言模型。为了验证所提出模型的有效性，我们构建了 C-VAD 数据集并证明了 AnyAnomaly 的优越性。此外，我们的方法在 VAD 基准数据集上展现出具有竞争力的性能，在 UBnormal 数据集上达到了最先进的结果，并且在所有数据集的泛化方面都优于其他方法。我们的代码可在 github.com/SkiddieAhn/Paper-AnyAnomaly 在线获取。||
|**2025-03-06**|[Semantic Alignment of Unimodal Medical Text and Vision Representations](http://arxiv.org/abs/2503.04478)|null|通用人工智能模型，尤其在文本和视觉领域设计的模型，在各种深度学习任务中展现出惊人的多功能性。然而，它们在医学影像等专业领域通常表现不佳，这些领域通常需要特定领域的解决方案或替代的知识迁移方法。最近的研究表明，通用模型在处理语义相关数据时可以展现出相似的潜在空间，尽管这种对齐并非自然发生。基于这一见解，已有研究表明，应用一个简单的变换（至多仿射变换），该变换是根据语义对应的样本子集（称为锚点）估计的，可以实现跨不同训练范式、架构和模态的模型拼接。在本文中，我们探讨了语义对齐（估计锚点之间的变换）如何将通用人工智能与专业医学知识联系起来。我们使用多个公共胸部 X 光数据集，证明了跨模型架构的模型拼接允许通用模型在无需额外训练的情况下整合特定领域的知识，从而提高医学任务的性能。此外，我们还介绍了一种针对单模态视觉编码器的零样本分类新方法，该方法利用了跨模态的语义对齐。我们的结果表明，我们的方法不仅优于通用的多模态模型，而且接近完全训练的、特定于医学的多模态解决方案的性能水平。||
|**2025-03-06**|[TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction](http://arxiv.org/abs/2503.04457)|null|视觉语言模型（VLMs）取得了显著的进步，这得益于大型语言模型（LLMs）在各种任务中的出色能力。尽管如此，一个被称为“幻觉”的关键挑战仍然存在，即模型过度自信地描述图像中不存在的物体或属性，这个问题由于VLMs倾向于依赖语言先验而加剧。这种局限性降低了模型在高风险应用中的可靠性。在本研究中，我们观察到 logits 连续性一致性增强的特性，并引入了一种简单有效的方法，称为跨时间预测连接（TPC），旨在通过跨时间步连接 logits 来增强其语义一致性。TPC 增强了信息流并提高了连贯性，有效地减少了幻觉。大量实验表明，TPC 超越了现有的代表性方法，在准确性和效率方面都实现了卓越的性能，同时在开放式文本生成任务中保持了稳健性。||
|**2025-03-06**|[ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task](http://arxiv.org/abs/2503.04444)|null|大型多模态模型 (LMM) 是功能强大的工具，能够推理和理解文本和语言之外的多模态信息。尽管它们的影响根深蒂固，但与单模态模型相比，LMM 的开发受到更高计算要求的阻碍。造成这种情况的主要原因之一是编码视觉输入需要大量的标记，这在多图像多模态任务中尤为明显。最近减少视觉标记的方法依赖于视觉编码器架构，需要微调 LLM 以保持性能，并且只考虑单图像场景。为了解决这些限制，我们提出了 ToFu，一种与视觉编码器无关、无需训练的标记融合策略，它结合 LMM 的冗余视觉标记来处理高分辨率、多图像任务。我们方法背后的核心直觉简单而有效：保留独特的标记，同时合并相似的标记。我们通过依次检查视觉标记并决定是将它们与其他标记合并还是将它们保留为单独的实体来实现这一点。我们在已建立的 LLaVA-Interleave Bench 上验证了我们的方法，该基准涵盖了具有挑战性的多图像任务。此外，我们通过在一个新创建的基准 ComPairs 上测试我们的方法，将方法推向极致，该基准侧重于多图像比较，其中将大量的图像和视觉标记输入到 LMM。我们广泛的分析考虑了几种 LMM 架构，证明了我们的方法在效率和性能提升方面的优势。||
|**2025-03-06**|[Synthetic Data is an Elegant GIFT for Continual Vision-Language Models](http://arxiv.org/abs/2503.04229)|null|预训练视觉语言模型（VLM）需要持续学习（CL）来有效地更新其知识并适应各种下游任务，而无需从头开始重新训练。然而，对于VLM来说，除了丢失先前从下游任务中学到的知识外，预训练知识在持续微调过程中也会被破坏。由于原始预训练数据的不可用，这个问题更加严重，导致VLM的泛化能力下降。在本文中，我们提出了GIFT，一种利用合成数据来克服VLM中灾难性遗忘的新型持续微调方法。利用文本到图像合成的最新进展，我们采用预训练的扩散模型来重建预训练数据和学习到的下游任务数据。通过这种方式，VLM可以通过对匹配的扩散生成图像和相应文本提示进行蒸馏来重温先前的知识。利用VLM特征空间中合成图像-文本对的广泛分布和高度对齐，我们提出了一个对比蒸馏损失以及一个图像-文本对齐约束。为了进一步对抗分布内过拟合并使用有限的生成数据增强蒸馏性能，我们结合了自适应权重巩固，利用来自这些合成图像-文本对的Fisher信息，实现了更好的稳定性-可塑性平衡。大量实验表明，我们的方法在各种设置下始终优于先前的最先进方法。||
|**2025-03-06**|[EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models](http://arxiv.org/abs/2503.04058)|null|大型视觉语言模型 (LVLMs) 的出现推动了视频相关任务的发展，例如视频字幕生成和视频理解。一些先前的研究表明，将视频中的文本作为输入可以进一步提高视频理解的性能。作为短视频或电影中不可或缺的信息，字幕可以帮助 LVLMs 更好地理解视频。大多数现有的视频字幕提取方法都基于多阶段框架，独立处理每一帧。它们很难利用视频的时间信息。尽管一些 LVLMs 表现出强大的 OCR 能力，但预测字幕文本的准确时间戳仍然具有挑战性。在本文中，我们提出了一种端到端的视频字幕提取方法，称为 EVE，它由三个模块组成：视觉编码器、适配器模块和大型语言模型。为了有效地压缩来自视觉编码器的视觉标记，我们提出了一种新颖的适配器 InterleavedVT 来交错两种模态。它包含一个视觉压缩器和一个文本区域压缩器。所提出的 InterleavedVT 结合了平均池化和 Q-Former 在标记压缩方面的优点。考虑到视频的时间信息，我们在文本区域压缩器中引入了滑动窗口机制。为了对视频字幕提取任务进行基准测试，我们提出了一个包含 250 万个视频的大型数据集 ViSa。在 ViSa 上的大量实验表明，所提出的 EVE 可以优于现有的开源工具和 LVLMs。||
|**2025-03-05**|[Vision-Language Models Struggle to Align Entities across Modalities](http://arxiv.org/abs/2503.03854)|null|跨模态实体链接是指在不同模态之间对齐实体及其属性的能力。尽管跨模态实体链接是现实世界应用（例如多模态代码生成、虚假新闻检测或场景理解）所需的一项基本技能，但在文献中尚未得到深入研究。在本文中，我们引入了一项新任务和基准来弥补这一差距。我们的基准 MATE 包含 5.5k 个评估实例，这些实例以视觉场景及其对应的文本表示为特征。为了评估跨模态实体链接性能，我们设计了一个问答任务，该任务涉及基于一个模态中对象的唯一属性来检索另一个模态中该对象的属性。我们在此任务上评估了最先进的视觉语言模型 (VLM) 和人类的表现，发现与人类相比，VLM 的表现明显吃力，尤其是当场景中的对象数量增加时。我们的分析还表明，虽然思维链提示可以提高 VLM 的性能，但模型距离达到人类水平的熟练程度还有很大差距。这些发现突出了对跨模态实体链接进行进一步研究的必要性，并表明 MATE 是支持这一进展的强有力基准。||
|**2025-03-05**|[Decoupling the components of geometric understanding in Vision Language Models](http://arxiv.org/abs/2503.03840)|null|理解几何形状很大程度上依赖于视觉。在这项工作中，我们评估了最先进的视觉语言模型 (VLM) 是否能够理解简单的几何概念。我们使用了来自认知科学的范式，将对简单几何形状的视觉理解与其经常混淆的许多其他能力（例如推理和世界知识）隔离开来。我们将模型的性能与来自美国的成年人以及先前对来自亚马逊土著群体的未受过正规教育的成年人的研究进行了比较。我们发现，VLM 的表现始终逊于这两组成年人，尽管它们在某些概念上比其他概念更成功。我们还发现，VLM 的几何理解比人类理解更脆弱，并且在任务需要心理旋转时不够稳健。这项工作突出了人类和机器几何理解起源的有趣差异——例如，来自正规教育中使用的印刷材料与物理世界的互动，或两者的结合——并为了解这些差异迈出了一小步。||
|**2025-03-05**|[See What You Are Told: Visual Attention Sink in Large Multimodal Models](http://arxiv.org/abs/2503.03321)|null|大型多模态模型 (LMM) 通过利用 Transformer 解码器中文本和视觉标记之间的注意力机制来“理解”图像。理想情况下，这些模型应该关注与文本标记相关的关键视觉信息。然而，最近的研究结果表明，LMM 具有非常强的倾向，即使某些视觉标记与相应的文本无关，它们也会持续地为这些标记分配高注意力权重。在本研究中，我们调查了这些无关视觉标记出现背后的特性，并检查了它们的特征。我们的研究结果表明，这种行为是由于某些隐藏状态维度的大量激活引起的，类似于在语言模型中发现的注意力汇聚现象。因此，我们将这种现象称为视觉注意力汇聚。特别是，我们的分析表明，移除无关的视觉汇聚标记并不会影响模型性能，尽管它们获得了很高的注意力权重。因此，我们将对这些标记的注意力作为剩余资源进行回收，重新分配注意力预算，以增强对图像的关注。为此，我们引入了视觉注意力重分配 (VAR) 方法，该方法重新分配以图像为中心的注意力头的注意力，我们将其识别为天生专注于视觉信息的注意力头。VAR 可以无缝地应用于不同的 LMM，以提高各种任务的性能，包括通用视觉语言任务、视觉幻觉任务和以视觉为中心的的任务，所有这些都不需要额外的训练、模型或推理步骤。实验结果表明，VAR 通过调整其内部注意力机制，使 LMM 能够更有效地处理视觉信息，为增强 LMM 的多模态能力提供了新的方向。||
|**2025-03-04**|[Multimodal AI predicts clinical outcomes of drug combinations from preclinical data](http://arxiv.org/abs/2503.02781)|**[link](https://github.com/mims-harvard/Madrigal)**|从临床前数据预测临床结果对于确定安全有效的药物组合至关重要。目前的模型依赖于结构或基于靶点的特征来识别高效低毒的药物组合。然而，这些方法未能整合准确、临床相关的预测所需的多模态数据。在这里，我们介绍MADRIGAL，一个多模态AI模型，它从结构、通路、细胞活力和转录组数据中学习，以预测953个临床结果和21842种化合物的药物组合效应，包括已批准药物和正在开发的新型化合物的组合。MADRIGAL使用transformer瓶颈模块来统一临床前药物数据模态，同时处理训练和推理过程中的缺失数据——这是多模态学习中的一个主要挑战。它在预测药物不良相互作用方面优于单模态方法和最先进的模型。MADRIGAL执行抗癌药物组合的虚拟筛选，并支持II型糖尿病和代谢功能障碍相关脂肪性肝炎 (MASH) 的多药治疗管理。它识别转运蛋白介导的药物相互作用。MADRIGAL预测resmetirom（第一个也是唯一一个FDA批准的MASH药物）是安全性最高的疗法之一。它通过整合癌症患者的基因组图谱来支持个性化癌症治疗。使用原发性急性髓系白血病样本和患者来源的异种移植模型，它预测了个性化药物组合的疗效。将MADRIGAL与大型语言模型集成，允许用户用自然语言描述临床结果，通过识别潜在的不良相互作用和毒性风险来改进安全性评估。MADRIGAL提供了一种多模态方法，用于设计具有更高预测精度和临床相关性的联合疗法。||
|**2025-03-04**|[Vision-Language Model IP Protection via Prompt-based Learning](http://arxiv.org/abs/2503.02393)|null|像CLIP（对比语言-图像预训练）这样的视觉语言模型（VLM）在视觉识别领域取得了显著成功，这凸显了保护训练良好的模型的知识产权（IP）的需求日益增长。有效的知识产权保护不仅仅是确保授权使用；它还需要限制模型部署到授权的数据域，尤其是在针对特定目标域对模型进行微调时。然而，目前的知识产权保护方法通常只依赖于视觉主干网络，这可能缺乏足够的语义丰富性。为了弥合这一差距，我们引入了IP-CLIP，这是一种针对CLIP的轻量级知识产权保护策略，采用基于提示的学习方法。通过利用CLIP冻结的视觉主干网络，我们提取图像风格和内容信息，并将其融入到IP提示的学习中。这种策略就像一个强大的屏障，有效地防止特征从授权域未经授权地转移到未授权域。此外，我们提出了一个风格增强分支，为授权域和未授权域构建特征库。该分支集成了自增强特征和跨域特征，进一步增强了IP-CLIP阻止来自未授权域特征的能力。最后，我们提出了三个新的指标，旨在更好地平衡授权域和未授权域的性能下降。各种场景下的综合实验表明，它在VLM知识产权保护任务中具有广阔的应用潜力。||
|**2025-03-04**|[BiasICL: In-Context Learning and Demographic Biases of Vision Language Models](http://arxiv.org/abs/2503.02334)|null|视觉语言模型 (VLM) 在医学诊断中展现出潜力，但使用上下文学习 (ICL) 时，它们在不同人群亚组中的表现仍缺乏深入了解。我们考察了演示示例中的人口统计构成如何影响 VLM 在两项医学影像任务中的表现：皮肤病变恶性肿瘤预测和胸部 X 光片气胸检测。我们的分析表明，ICL 通过多种机制影响模型预测：(1) ICL 允许 VLM 从提示中学习特定亚组的疾病基准率；(2) ICL 导致 VLM 做出的预测在不同人群组中的表现不同，即使在控制了特定亚组的疾病基准率之后也是如此。我们的实证结果为当前 VLM 的提示最佳实践提供了信息（特别是检查人口亚组的表现，以及将标签的基准率与总体水平和亚组内的目标分布相匹配），同时也为改进我们对这些模型的理论理解提出了后续步骤。||
|**2025-03-04**|[Words or Vision: Do Vision-Language Models Have Blind Faith in Text?](http://arxiv.org/abs/2503.02199)|**[link](https://github.com/d-ailin/blind-faith-in-text)**|视觉语言模型 (VLM) 擅长整合视觉和文本信息以完成以视觉为中心的任务，但它们处理模态之间不一致性的能力尚未得到充分探索。我们研究了在以视觉为中心的场景下，当面对视觉数据和不同的文本输入时，VLM 的模态偏好。通过在四个以视觉为中心的任务中引入文本变体并评估十个视觉语言模型 (VLM)，我们发现了一种“盲目相信文本”的现象：当出现不一致时，VLM 过度依赖文本数据而不是视觉数据，导致在文本损坏的情况下性能显著下降，并引发安全问题。我们分析了影响这种文本偏差的因素，包括指令提示、语言模型大小、文本相关性、标记顺序，以及视觉和文本确定性之间的相互作用。虽然某些因素（例如扩大语言模型的规模）可以略微减轻文本偏差，但其他因素（例如标记顺序）由于继承自语言模型的位置偏差而会加剧文本偏差。为了解决这个问题，我们探索了使用文本增强进行监督微调，并证明了其在减少文本偏差方面的有效性。此外，我们还提供了一个理论分析，表明“盲目相信文本”现象可能源于训练过程中纯文本和多模态数据的不平衡。我们的研究结果强调，需要在 VLM 中进行平衡训练并仔细考虑模态交互，以增强其处理多模态数据不一致性的鲁棒性和可靠性。||
|**2025-03-04**|[DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models](http://arxiv.org/abs/2503.02175)|**[link](https://github.com/vbdi/divprune)**|大型多模态模型 (LMMs) 已成为功能强大的模型，能够理解各种数据模态，包括文本、图像和视频。LMMs 将文本和视觉数据编码为标记，然后由集成的超大型语言模型 (LLM) 组合和处理。包含视觉标记会大大增加标记总数，通常会增加数千个。LLM 输入长度的增加显着提高了推理的复杂性，导致 LMMs 的延迟较高。为了解决这个问题，人们提出了标记剪枝方法，它可以去除部分视觉标记。现有的标记剪枝方法要么需要大量的校准和微调，要么依赖于次优的重要性指标，这会导致保留的标记之间冗余度增加。在本文中，我们首先将标记剪枝定义为最大-最小多样性问题 (MMDP)，其目标是选择一个子集，使所选标记之间的多样性最大化。然后，我们求解 MMDP 以获得所选子集并剪枝其余标记。所提出的方法 DivPrune 减少了冗余，并实现了所选标记的最高多样性。通过确保高多样性，所选标记可以更好地表示原始标记，即使在高剪枝率下也能实现有效性能，而无需微调。对各种 LMMs 进行的大量实验表明，DivPrune 在 16 个图像和视频-语言数据集上实现了最先进的准确性。此外，DivPrune 减少了测试模型的端到端延迟和 GPU 内存使用量。代码可在 $\href{https://github.com/vbdi/divprune}{\text{此处}}$ 获取。||
|**2025-03-03**|[Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPA](http://arxiv.org/abs/2503.02034)|null|医学影像在现代医疗保健中发挥着关键作用，其中计算机断层扫描肺血管造影 (CTPA) 是诊断肺栓塞和其他胸部疾病的重要工具。然而，解释 CTPA 扫描并生成准确的放射学报告仍然是一项重大挑战。本文介绍了 Abn-BLIP（异常对齐的引导式语言图像预训练），这是一种先进的诊断模型，旨在将异常发现与放射学报告的准确性和全面性对齐。通过利用可学习查询和跨模态注意力机制，我们的模型在检测异常、减少漏诊和生成结构化报告方面表现出优于现有方法的性能。我们的实验表明，Abn-BLIP 在准确性和临床相关性方面均优于最先进的医学视觉语言模型和 3D 报告生成方法。这些结果突出了整合多模态学习策略以改进放射学报告的潜力。源代码可在 https://github.com/zzs95/abn-blip 获取。||
|**2025-03-03**|[OFF-CLIP: Improving Normal Detection Confidence in Radiology CLIP with Simple Off-Diagonal Term Auto-Adjustment](http://arxiv.org/abs/2503.01794)|null|对比语言-图像预训练 (CLIP) 实现了放射学中的零样本分类，减少了对人工标注的依赖。然而，由于严格的样本内对齐，传统的对比学习在正常病例检测方面存在困难，这种对齐会破坏正常样本的聚类，并导致高假阳性 (FP) 和假阴性 (FN)。为了解决这些问题，我们提出了 OFF-CLIP，这是一种对比学习改进方法，它通过引入非对角项损失来增强正常样本聚类，并通过从异常报告中移除未对齐的正常语句来应用句子级文本过滤以减少 FN，从而改进正常检测。OFF-CLIP 可以应用于放射学 CLIP 模型，而无需任何架构修改。实验结果表明，OFF-CLIP 显着提高了正常分类，在 VinDr-CXR 数据集上比最先进的零样本分类基线 CARZero 的曲线下面积 (AUC) 增加了 0.61，同时保持或提高了异常分类性能。此外，OFF-CLIP 通过提高指向游戏准确性来增强零样本定位，从而确认了更好的异常定位。这些结果证明了 OFF-CLIP 作为医学视觉语言模型的鲁棒且有效的增强方法的有效性。||
|**2025-03-03**|[Visual-RFT: Visual Reinforcement Fine-Tuning](http://arxiv.org/abs/2503.01785)|**[link](https://github.com/liuziyu77/visual-rft)**|大型推理模型（如 OpenAI 的 o1）中的强化微调 (RFT) 可以根据答案反馈进行学习，这在微调数据稀缺的应用中特别有用。最近的开源工作（如 DeepSeek-R1）表明，具有可验证奖励的强化学习是复现 o1 的一个关键方向。虽然 R1 风格的模型在语言模型中取得了成功，但其在多模态领域的应用仍未得到充分探索。这项工作引入了视觉强化微调 (Visual-RFT)，进一步扩展了 RFT 在视觉任务上的应用领域。具体来说，Visual-RFT 首先使用大型视觉语言模型 (LVLMs) 为每个输入生成包含推理标记和最终答案的多个响应，然后使用我们提出的视觉感知可验证奖励函数，通过策略优化算法（如组相对策略优化 (GRPO)）来更新模型。我们针对不同的感知任务设计了不同的可验证奖励函数，例如用于目标检测的交并比 (IoU) 奖励。在细粒度图像分类、小样本目标检测、推理基础以及开放词汇目标检测基准上的实验结果表明，与监督微调 (SFT) 相比，Visual-RFT 具有更强的竞争性能和泛化能力。例如，在使用约 100 个样本进行的单样本细粒度图像分类中，Visual-RFT 的准确率比基线提高了 24.3%。在小样本目标检测中，Visual-RFT 在 COCO 的两样本设置中也比基线高 21.9，在 LVIS 上高 15.4。我们的 Visual-RFT 代表了微调 LVLMs 的范式转变，它提供了一种数据高效、奖励驱动的方方法，增强了特定领域任务的推理和适应能力。||
|**2025-03-03**|[Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs](http://arxiv.org/abs/2503.01743)|null|我们推出了Phi-4-Mini和Phi-4-Multimodal，这是紧凑但功能强大的语言和多模态模型。Phi-4-Mini是一个38亿参数的语言模型，在高质量的网络和合成数据上进行训练，其性能显著优于近期类似规模的开源模型，并且在需要复杂推理的数学和编码任务上与两倍于其规模的模型性能相当。这一成就得益于精心策划的合成数据配方，该配方强调高质量的数学和编码数据集。与其前身Phi-3.5-Mini相比，Phi-4-Mini的词汇量扩大到20万个标记，以更好地支持多语言应用，并采用分组查询注意力机制以更高效地生成长序列。Phi-4-Multimodal是一个多模态模型，它将文本、视觉和语音/音频输入模态集成到单个模型中。其新颖的模态扩展方法利用LoRA适配器和模态特定路由器，允许多种推理模式组合各种模态而不会相互干扰。例如，尽管语音/音频模态的LoRA组件只有4.6亿个参数，但它目前在OpenASR排行榜上排名第一。Phi-4-Multimodal支持涉及（视觉+语言）、（视觉+语音）和（语音/音频）输入的场景，在各种任务上的表现优于更大的视觉语言和语音语言模型。此外，我们还尝试进一步训练Phi-4-Mini以增强其推理能力。尽管其大小只有38亿个参数，但这个实验版本在推理性能上与更大的模型（包括DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Llama-8B）相当甚至更好。||
|**2025-03-03**|[DeepSuM: Deep Sufficient Modality Learning Framework](http://arxiv.org/abs/2503.01728)|null|多模态学习已成为开发鲁棒学习模型的关键方法，其应用涵盖多媒体、机器人、大型语言模型和医疗保健。鉴于不同模态的成本和资源需求各不相同，多模态系统的效率是一个关键问题。这突显了有效模态选择对于平衡性能提升和资源消耗的必要性。在本研究中，我们提出了一个新的模态选择框架，该框架独立学习每种模态的表示。这种方法允许在每种模态独特的表示空间内评估其重要性，从而能够开发定制的编码器，并促进对具有不同特征的模态进行联合分析。我们的框架旨在通过优化模态集成和选择来提高多模态学习的效率和有效性。||
|**2025-02-28**|[PET Image Denoising via Text-Guided Diffusion: Integrating Anatomical Priors through Text Prompts](http://arxiv.org/abs/2502.21260)|null|低剂量正电子发射断层扫描（PET）成像由于噪声增加和图像质量降低而面临重大挑战，这可能会影响其诊断准确性和临床效用。去噪扩散概率模型（DDPM）在PET图像去噪方面展现出良好的性能。然而，现有的基于DDPM的方法通常忽略了有价值的元数据，例如患者人口统计信息、解剖信息和扫描参数，如果考虑这些元数据，应该可以进一步提高去噪性能。视觉语言模型（VLM），特别是预训练的对比语言图像预训练（CLIP）模型的最新进展，突出了将基于文本的信息纳入视觉任务以提高下游性能的潜力。在这项初步研究中，我们提出了一种新颖的文本引导DDPM，用于通过文本提示整合解剖先验知识进行PET图像去噪。使用预训练的CLIP文本编码器对解剖文本描述进行编码以提取语义指导，然后通过交叉注意机制将其纳入扩散过程。基于配对的1/20低剂量和正常剂量18F-FDG PET数据集的评估表明，所提出的方法在全身和器官水平上均实现了比传统UNet和标准DDPM方法更好的定量性能。这些结果强调了利用VLM将丰富的元数据整合到扩散框架中以提高低剂量PET扫描图像质量的潜力。||
|**2025-02-28**|[FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts](http://arxiv.org/abs/2502.21059)|null|大型视觉语言模型 (LVLMs) 已变得非常强大，并在一些实际应用中得到广泛采用。然而，最近的研究揭示了它们易受多模态“越狱”攻击的漏洞，即模型可能被诱导生成有害内容，从而导致安全风险。尽管大多数 LVLMs 已经过安全对齐，但最近的研究表明，视觉模态仍然容易受到“越狱”攻击。在我们的工作中，我们发现通过使用包含部分有害信息的流程图，可以诱导 LVLMs 提供更多有害细节。基于此，我们提出了一种基于自动生成流程图的“越狱”攻击方法，称为 FC-Attack。具体来说，FC-Attack 首先微调一个预训练的 LLM，以创建一个基于良性数据集的步骤描述生成器。然后，该生成器用于生成与有害查询对应的步骤描述，这些描述将被转换为三种不同形状（垂直、水平和 S 形）的流程图，作为视觉提示。然后，这些流程图与良性文本提示相结合，对 LVLMs 执行“越狱”攻击。我们使用 Advbench 数据集进行的评估表明，FC-Attack 在 Gemini-1.5、Llaval-Next、Qwen2-VL 和 InternVL-2.5 模型上实现了超过 90% 的攻击成功率，优于现有的 LVLM “越狱”方法。此外，我们还研究了影响攻击性能的因素，包括流程图中的步骤数量和字体样式。我们的评估表明，FC-Attack 可以通过更改字体样式将 Claude-3.5 的“越狱”性能提高 4% 到 28%。为了缓解这种攻击，我们探索了几种防御措施，发现 AdaShield 可以在很大程度上降低“越狱”性能，但会降低实用性。||
|**2025-02-28**|[DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping](http://arxiv.org/abs/2502.20900)|null|灵巧抓取仍然是机器人技术中一个基本而又具有挑战性的问题。通用机器人必须能够在任意场景中抓取各种物体。然而，现有研究通常依赖于特定假设，例如单物体设置或有限环境，导致泛化能力受限。我们的解决方案是 DexGraspVLA，这是一个分层框架，它利用预训练的视觉语言模型作为高级任务规划器，并学习基于扩散的策略作为低级动作控制器。关键在于迭代地将不同的语言和视觉输入转换为域不变的表示，由于减轻了域偏移，可以在其中有效地应用模仿学习。因此，它能够在各种现实场景中实现稳健的泛化。值得注意的是，我们的方法在“零样本”环境下，在数千种未见过的物体、光照和背景组合中实现了90%以上的成功率。实证分析进一步证实了内部模型行为在环境变化中的一致性，从而验证了我们的设计并解释了其泛化性能。我们希望我们的工作能够朝着实现通用灵巧抓取的目标迈进一步。我们的演示和代码可以在 https://dexgraspvla.github.io/ 上找到。||
|**2025-02-28**|[VLEER: Vision and Language Embeddings for Explainable Whole Slide Image Representation](http://arxiv.org/abs/2502.20850)|null|视觉语言模型 (VLM) 近期取得的进展展现了其在桥接视觉和文本模态方面的显著潜力。在计算病理学中，基于大量组织病理学图像-文本数据集预训练的特定领域 VLM 已成功应用于各种下游任务。然而，现有研究主要集中在 VLM 的预训练过程及其在图像块级别的直接应用，而忽略了其在全视野数字切片 (WSI) 应用中的巨大潜力。在本研究中，我们假设预训练的 VLM 可以通过定量特征提取内在地捕获信息丰富且可解释的 WSI 表示。为了验证这一假设，我们引入了用于可解释 WSI 表示的视觉和语言嵌入 (VLEER)，这是一种利用 VLM 进行 WSI 表示的新方法。我们在三个病理 WSI 数据集上系统地评估了 VLEER，证明其在 WSI 分析中比传统视觉特征表现更好。更重要的是，VLEER 具有独特的可解释性优势，能够利用文本模态进行详细的病理注释，从而对结果提供直接的人类可读的洞察，为 WSI 级别的病理下游任务提供清晰的推理。||
|**2025-02-28**|[Multimodal Learning for Just-In-Time Software Defect Prediction in Autonomous Driving Systems](http://arxiv.org/abs/2502.20806)|null|近年来，自动驾驶技术的兴起凸显了可靠软件对于确保安全和性能的至关重要性。本文提出了一种利用多模态学习技术在自动驾驶软件系统中进行即时软件缺陷预测（JIT-SDP）的新方法。该模型利用多模态Transformer，其中预训练的Transformer和一个组合模块处理软件系统数据集的多种数据模态，例如代码特征、变更指标和上下文信息。采用多模态学习的关键在于利用不同数据模态（如文本、数值和类别数据）之间的注意力机制。在组合模块中，文本数据上的Transformer模型的输出和包含类别和数值数据的表格特征被组合在一起，使用全连接层生成预测结果。在从GitHub存储库收集的三个开源自动驾驶系统软件项目（Apollo、Carla和Donkeycar）上进行的实验表明，就评估指标而言，所提出的方法明显优于最先进的深度学习和机器学习模型。我们的研究结果突出了多模态学习在通过改进缺陷预测来增强自动驾驶软件的可靠性和安全性方面的潜力。||
|**2025-02-28**|[MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models](http://arxiv.org/abs/2502.20780)|**[link](https://github.com/russellyq/medhalltune)**|视觉语言模型 (VLM) 在医疗应用中的使用日益增多，随之而来的是模型可能产生看似合理但不正确的“幻觉”的巨大挑战。此类幻觉会危及临床决策，可能对诊断和治疗造成损害。在本研究中，我们提出了 MedHallTune，这是一个专门设计用于评估和减轻医学 VLM 幻觉的大规模基准。MedHallTune 包含超过 100,000 张图像和 1,000,000 个指令对，其中包括幻觉和非幻觉样本，每个样本都有真实的注释。我们使用 MedHallTune 对当前的医学和通用 VLM 进行了全面评估，评估了它们在关键指标上的性能，包括临床准确性、相关性、细节水平和风险等级。实验结果表明，使用 MedHallTune 进行微调可以成功提高几种现有模型管理幻觉的能力，并提升它们在下游视觉问答 (VQA) 任务中的零样本性能，使其在实际医疗应用中更加可靠。我们的工作有助于开发更值得信赖的 VLM。代码和数据集将在 \href{https://github.com/russellyq/MedHallTune}{MedHallTune} 上提供。||
|**2025-02-28**|[Towards General Visual-Linguistic Face Forgery Detection(V2)](http://arxiv.org/abs/2502.20698)|**[link](https://github.com/skjack/vlffd)**|人脸操纵技术取得了显著进展，对安全和社会信任构成了严峻挑战。最近的研究表明，利用多模态模型可以增强人脸伪造检测的泛化性和可解释性。然而，现有的标注方法，无论是通过人工标注还是直接使用多模态大型语言模型 (MLLM) 生成，都经常受到幻觉问题的影响，导致文本描述不准确，尤其是在高质量伪造的情况下。为了解决这个问题，我们提出了人脸伪造文本生成器 (FFTG)，一种新颖的标注流程，它利用伪造掩码进行初始区域和类型识别，然后采用全面的提示策略来指导 MLLM 减少幻觉，从而生成准确的文本描述。我们通过使用结合了单模态和多模态目标的三分支训练框架微调 CLIP，以及使用我们的结构化标注微调 MLLM 来验证我们的方法。实验结果表明，我们的方法不仅实现了更准确的标注和更高的区域识别准确率，还在各种伪造检测基准测试中提高了模型性能。我们的代码可在 https://github.com/skJack/VLFFD.git 获取。||
|**2025-02-28**|[T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting](http://arxiv.org/abs/2502.20625)|null|零样本目标计数旨在对由文本描述指定的任意目标类别进行实例计数。现有方法通常依赖于像CLIP这样的视觉语言模型，但通常对文本提示的敏感度有限。我们提出了T2ICount，一个基于扩散的框架，它利用预训练扩散模型中丰富的先验知识和细粒度的视觉理解能力。虽然一步去噪可确保效率，但它会导致文本敏感性减弱。为了应对这一挑战，我们提出了一个层次语义校正模块，逐步改进文本图像特征对齐，以及一个表征区域一致性损失，通过利用从去噪U-Net中提取的交叉注意力图来提供可靠的监督信号。此外，我们观察到当前的基准主要集中于图像中的主要目标，这可能会掩盖模型的文本敏感性。为了解决这个问题，我们提供了一个重新注释的FSC147的具有挑战性的子集，以便更好地评估文本引导的计数能力。大量实验表明，我们的方法在不同的基准测试中实现了卓越的性能。代码可在https://github.com/cha15yq/T2ICount获取。||
|**2025-02-27**|[Cache-of-Thought: Master-Apprentice Framework for Cost-Effective Vision Language Model Inference](http://arxiv.org/abs/2502.20587)|null|视觉语言模型 (VLM) 在各种越来越复杂和规模化的视觉应用中取得了显著成功，然而，选择合适的 VLM 模型大小需要在响应质量和成本之间进行权衡。虽然较小的 VLM 运行成本较低，但它们在 MMMU 等基准测试中产生的响应通常仅略好于随机猜测。在本文中，我们提出了思维缓存 (CoT)，这是一个用于大型和小型 VLM 之间协作推理的师徒框架。CoT 将来自大型 VLM（师傅）的高质量查询结果管理在一个缓存中，然后通过一种新颖的多模态检索和上下文学习进行选择，以辅助小型 VLM（学徒）的性能。我们在各种广泛认可且具有挑战性的通用 VQA 基准测试中对 CoT 进行了广泛评估，结果表明，在相同预算下，CoT 将整体 VQA 性能提高了 7.7%，并且特别将学徒 VLM 的性能提高了 36.6%。||
|**2025-02-27**|[Interpreting CLIP with Hierarchical Sparse Autoencoders](http://arxiv.org/abs/2502.20578)|null|稀疏自编码器（SAE）有助于检测和控制神经网络中的可解释特征，尤其是在理解复杂的多模态表示方面具有潜力。鉴于其揭示可解释特征的能力，SAE对于分析大规模视觉语言模型（例如CLIP和SigLIP）特别有价值，这些模型是现代系统的基本构建块，但仍然难以解释和控制。然而，当前的SAE方法受到同时优化重建质量和稀疏性的限制，因为它们依赖于激活抑制或严格的稀疏性约束。为此，我们引入了Matryoshka SAE（MSAE），这是一种新架构，可以在多个粒度上同时学习分层表示，从而可以直接优化这两个指标而无需妥协。MSAE在CLIP的重建质量和稀疏性之间建立了新的最先进的帕累托前沿，实现了0.99的余弦相似度和低于0.1的未解释方差分数，同时保持了约80%的稀疏性。最后，我们展示了MSAE作为解释和控制CLIP的工具的实用性，它可以从CLIP的表示中提取超过120个语义概念，以便在下游任务（如CelebA）中执行基于概念的相似性搜索和偏差分析。||
|**2025-02-27**|[R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts](http://arxiv.org/abs/2502.20395)|**[link](https://github.com/tianyi-lab/R2-T2)**|在大型多模态模型 (LMM) 中，非语言模态（例如视觉表示）的感知能力通常不如大型语言模型 (LLM) 强大的推理能力，这阻碍了 LMM 在挑战性下游任务中的性能。最近，通过将视觉编码器替换为混合专家 (MoE) 来缓解这一弱点，MoE 提供了不同下游任务所需的丰富、多粒度和多样化的表示。多模态 MoE 的性能很大程度上取决于其路由器，路由器会为每个输入重新加权和混合不同专家的表示。然而，我们发现端到端训练的路由器并不总是为每个测试样本生成最佳路由权重。为了弥合这一差距，我们提出了一种新颖且高效的方法“测试时重新路由 (R2-T2)”，该方法通过将测试样本的路由权重向量移向其邻域中正确预测样本的路由权重向量来局部优化测试时的路由权重向量。我们提出了三种具有不同优化目标和邻域搜索空间的 R2-T2 策略。R2-T2 在不训练任何基础模型参数的情况下，持续且显著地提高了最先进 LMM 在各种任务的挑战性基准测试中的性能。||
|**2025-02-27**|[Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think](http://arxiv.org/abs/2502.20172)|**[link](https://github.com/chenllliang/dreamengine)**|先进的文图生成领域正在涌现出将强大的文本编码器（如 CLIP 和 T5）与扩散Transformer骨干网络相结合的统一框架。尽管已有一些尝试通过附加条件（如边缘图和深度图）来控制输出图像，但仍然缺乏一个用于任意文本图像交错控制的综合框架。在尝试合并生成过程中来自多个图像的概念或视觉元素时，这种差距尤为明显。为了弥合这一差距，我们进行了初步实验，结果表明大型多模态模型 (LMM) 提供了一个有效的共享表示空间，图像和文本可以在其中很好地对齐，作为外部扩散模型的条件。基于这一发现，我们提出了Dream Engine，这是一个高效且统一的框架，旨在用于图像生成模型中的任意文本图像交错控制。在强大的文图生成模型（如 SD3.5）的基础上，我们通过结合通用的多模态信息编码器（如 QwenVL）来取代原始的纯文本编码器。我们的方法采用两阶段训练范式，包括联合文本图像对齐和多模态交错指令微调。我们的实验表明，这种训练方法是有效的，在 GenEval 基准测试中取得了 0.69 的总分，并与最先进的文图生成模型（如 SD3.5 和 FLUX）的性能相匹配。||
|**2025-02-27**|[Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion](http://arxiv.org/abs/2502.20120)|null|尽管多模态学习（MML）取得了显著进展，但模态不平衡的存在阻碍了多模态学习在实践中实现其相较于单模态模型的预期优势。为了克服这个问题，主流的多模态学习方法更加强调平衡学习过程。然而，这些方法并没有明确增强较弱模态的分类能力，导致性能提升有限。通过设计一种持续提升算法，我们提出了一种新的多模态学习方法，以动态平衡弱模态和强模态的分类能力。具体来说，我们首先通过使用设计的可配置分类器模块同时优化分类误差和残差，提出了一种多模态学习中的持续提升算法。然后，我们提出了一种自适应分类器分配策略，以动态地促进弱模态的分类性能。为此，强模态和弱模态的分类能力有望达到平衡，从而缓解模态不平衡问题。在广泛使用的数据集上的实验结果表明，通过与各种最先进的（SoTA）多模态学习基线进行比较，我们的方法具有优越性。||
|**2025-02-27**|[Vision-Encoders (Already) Know What They See: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore](http://arxiv.org/abs/2502.20034)|**[link](https://github.com/abzb1/f-clip)**|近年来，大型视觉语言模型 (LVLMs) 在各个领域都展现出卓越的性能。然而，这些模型也存在对象幻觉的问题。本研究重新审视了先前的一种说法，即这种幻觉的主要原因在于视觉编码器表征能力的限制。我们的分析表明，视觉编码器本身的容量已经足以检测对象幻觉。基于这一见解，我们提出了一种细粒度的 CLIPScore（F-CLIPScore），这是一种简单而有效的评估指标，它通过结合名词短语级别的文本嵌入来增强对象级别的粒度。在 OHD-Caps 基准测试中的评估结果表明，F-CLIPScore 的准确性显著优于传统的 CLIPScore，差距高达 39.6%，且无需额外的训练。我们进一步验证了 F-CLIPScore，结果表明，使用 F-CLIPScore 过滤后的数据训练的 LVLM 能够减少对象幻觉。||
|**2025-02-27**|[ProAPO: Progressively Automatic Prompt Optimization for Visual Classification](http://arxiv.org/abs/2502.19844)|**[link](https://github.com/MorningStarOvO/ProAPO)**|视觉语言模型 (VLM) 通过使用大规模图文配对数据进行训练，在图像分类方面取得了显著进展。它们的性能很大程度上取决于提示的质量。虽然最近的方法表明，大型语言模型 (LLM) 生成的视觉描述增强了 VLM 的泛化能力，但由于 LLM 的幻觉，特定类别的提示可能不准确或缺乏区分性。在本文中，我们旨在以最少的监督和无需人工干预的方式，为细粒度类别找到具有视觉区分性的提示。提出了一种基于进化算法的方法，逐步将特定任务模板的语言提示优化为特定类别的描述。与优化模板不同，搜索空间在特定类别候选提示中呈现爆炸式增长。这增加了提示生成成本、迭代次数和过拟合问题。为此，我们首先引入了几种简单但有效的基于编辑和基于进化操作，通过一次查询 LLM 来生成不同的候选提示。然后，提出了两种采样策略，以找到更好的初始搜索点并减少遍历的类别，从而节省迭代成本。此外，我们应用了一种具有熵约束的新颖适应度评分来减轻过拟合。在具有挑战性的一次性图像分类设置中，我们的方法在 13 个数据集上优于现有的基于文本提示的方法，并改进了 LLM 生成的描述方法。同时，我们证明了我们的最优提示改进了基于适配器的方法，并在不同的骨干网络之间有效迁移。||
|**2025-02-27**|[Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study](http://arxiv.org/abs/2502.19828)|null|对比语言-图像预训练 (CLIP) 模型在零样本分类任务中表现出色，但在处理复杂多目标场景方面的效率仍然存在挑战。本研究通过受控实验，对 CLIP 在多目标环境下的性能局限性进行了全面分析。我们引入了两个自定义数据集 SimCO 和 CompCO，用于评估 CLIP 在各种多目标配置下的图像和文本编码器。我们的研究结果揭示了两种编码器中的显著偏差：图像编码器偏向于较大的物体，而文本编码器则优先考虑描述中首先提到的物体。我们假设这些偏差源于 CLIP 的训练过程，并通过分析 COCO 数据集和 CLIP 的训练进程提供证据。此外，我们将研究扩展到 Stable Diffusion 模型，发现 CLIP 文本编码器中的偏差会显著影响文本到图像的生成任务。我们的实验展示了这些偏差如何影响 CLIP 在图像-字幕匹配和生成任务中的性能，尤其是在操纵物体大小和它们在字幕中顺序的情况下。这项工作为 CLIP 在复杂视觉环境中的行为提供了宝贵的见解，并突出了未来视觉语言模型改进的方向。||
|**2025-02-27**|[Mixtera: A Data Plane for Foundation Model Training](http://arxiv.org/abs/2502.19790)|**[link](https://github.com/eth-easl/mixtera)**|最先进的大型语言和视觉模型在数万亿个从各种来源聚合的标记上进行训练。随着训练数据集的增长，手动管理样本变得耗时、繁琐且容易出错。然而，最近的研究表明，数据混合以及训练期间访问样本的顺序会显着影响模型的准确性。我们构建并展示了Mixtera，一个用于基础模型训练的数据平面，它使用户能够声明式地表达哪些数据样本应该以何种比例以及以何种顺序在训练期间使用。Mixtera是一个集中式的、只读的层，部署在现有的训练数据集之上，并且可以通过声明式查询。它独立于文件系统结构运行，并支持跨任意属性（例如，语言、源数据集）的混合，以及基于模型反馈的动态混合调整。我们通过实验评估了Mixtera，并表明我们的实现不会造成训练瓶颈，并且可以扩展到256个GH200超级芯片。我们通过在系统中实现提出的自适应数据优化（ADO）算法并评估其性能影响，展示了Mixtera如何支持混合策略的最新进展。我们还探讨了混合对视觉语言模型的作用。||
|**2025-02-27**|[Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success](http://arxiv.org/abs/2502.19645)|null|近来的视觉-语言-动作模型（VLA）建立在预训练的视觉-语言模型之上，并利用不同的机器人数据集来展示强大的任务执行能力、语言理解能力和语义泛化能力。尽管取得了这些成功，VLA仍然难以应对新的机器人设置，并且需要微调才能达到良好的性能，然而，考虑到许多可能的策略，如何最有效地微调它们尚不清楚。在这项工作中，我们使用OpenVLA作为我们的代表性基准模型，研究了关键的VLA适应性设计选择，例如不同的动作解码方案、动作表示和微调的学习目标。我们的实证分析提供了一种优化微调（OFT）方案，它集成了并行解码、动作分块、连续动作表示和简单的基于L1回归的学习目标，从而共同提高推理效率、策略性能以及模型输入输出规范的灵活性。我们提出了OpenVLA-OFT，它是该方案的一个实例，它在LIBERO仿真基准测试中树立了新的最先进水平，将OpenVLA在四个任务套件中的平均成功率从76.5%显著提高到97.1%，同时将动作生成吞吐量提高了26倍。在真实世界的评估中，我们的微调方案使OpenVLA能够在双臂ALOHA机器人上成功执行灵巧的高频控制任务，并且优于其他使用其默认方案微调的VLA（π0和RDT-1B），以及从头训练的强大的模仿学习策略（Diffusion Policy和ACT），平均成功率高达15%（绝对值）。我们在https://openvla-oft.github.io/上发布了OFT的代码和预训练模型检查点。||
|**2025-02-26**|[Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models](http://arxiv.org/abs/2502.19417)|null|通才机器人如果要在开放世界环境中执行一系列不同的任务，不仅需要能够推理完成目标所需的步骤，还需要能够在任务执行过程中处理复杂的指令、提示甚至反馈。复杂的指令（例如，“你能给我做一个素食三明治吗？”或“我不喜欢那个”）不仅需要机器人能够实际执行单个步骤，还需要能够将复杂的命令和反馈置于现实世界中。在这项工作中，我们描述了一个在分层结构中使用视觉语言模型的系统，该系统首先对复杂的提示和用户反馈进行推理，以推断出完成任务的最合适的下一步，然后使用低级动作执行该步骤。与可以直接遵循简单命令（“拿起杯子”）的指令跟随方法相比，我们的系统可以推理复杂的提示，并在任务执行过程中结合情境反馈（“那不是垃圾”）。我们在三个机器人平台（包括单臂、双臂和双臂移动机器人）上评估了我们的系统，展示了其处理诸如清洁凌乱的桌子、制作三明治和购买杂货等任务的能力。||
|**2025-02-27**|[Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions](http://arxiv.org/abs/2502.19293)|null|每年病理学家要检查数百万个黑色素细胞皮肤病变，其中大多数是常见的痣（即普通痣）。虽然大多数此类病变可在几秒钟内诊断出来，但撰写相应的病理报告却要耗费更多时间。因此，病理报告的部分自动化可以减轻病理学家日益增加的工作负担。在这项工作中，我们开发了一个专门针对皮肤黑色素细胞病变病理领域的视觉语言模型。该模型遵循对比字幕框架，并使用包含 42,512 张 H&E 染色的全切片图像和 19,645 份相应病理报告的黑色素细胞病变数据集进行训练和评估。我们的结果表明，经专家病理学家在读者研究中评估，模型生成的常见痣报告的质量得分与病理学家撰写的报告相当。虽然对于罕见的黑色素细胞病变亚型，报告生成更具挑战性，但这些病例的跨模态检索性能要好得多。||
|**2025-02-25**|[WebGames: Challenging General-Purpose Web-Browsing AI Agents](http://arxiv.org/abs/2502.18356)|**[link](https://github.com/convergence-ai/webgames)**|我们推出WebGames，这是一个全面的基准测试套件，旨在通过50多个交互式挑战来评估通用网页浏览AI代理。这些挑战专门设计为对人类来说简单易行，同时系统地测试当前AI系统在基本浏览器交互、高级输入处理、认知任务、工作流程自动化和互动娱乐方面的局限性。我们的框架通过一个封闭的测试环境消除了外部依赖，确保了可重复的评估和可验证的真实解决方案。我们评估了领先的视觉语言模型，包括GPT-4o、Claude Computer-Use、Gemini-1.5-Pro和Qwen2-VL，并与人类表现进行了比较。结果显示，AI能力与人类能力之间存在巨大差距，即使是最好的AI系统也仅取得了43.1%的成功率，而人类的成功率为95.7%，这凸显了当前AI系统在处理人类认为直观的常见网络交互模式方面的根本局限性。该基准测试在webgames.convergence.ai公开可用，提供了一个轻量级的客户端实现，方便快速评估周期。通过其模块化架构和标准化挑战规范，WebGames为衡量更强大的网页浏览代理的开发进度提供了坚实的基础。||
|**2025-02-25**|[Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs](http://arxiv.org/abs/2502.18179)|**[link](https://github.com/gayecolakoglu/layie-llm)**|本文定义并探讨了使用大型语言模型 (LLM) 从富布局文档中提取信息 (IE) 的设计空间。LLM 布局感知 IE 的三个核心挑战是 1) 数据结构化，2) 模型参与，和 3) 输出细化。我们的研究深入探讨了这些核心挑战中的子问题，例如输入表示、分块、提示以及 LLM 和多模态模型的选择。它通过一个新的布局感知 IE 测试套件检查不同设计选择的结果，并以最先进的 (SoA) 模型 LayoutLMv3 为基准进行测试。结果表明，一次一个因素 (OFAT) 试验的配置实现了接近最优的结果，与基线模型相比 F1 值提高了 14.1 分，而全因子探索仅带来了略高的 15.1 分的提升，但 token 使用量增加了约 36 倍。我们证明了配置良好的通用 LLM 可以匹敌专用模型的性能，从而提供了一种经济高效的替代方案。我们的测试套件可在 https://github.com/gayecolakoglu/LayIE-LLM 免费获取。||
|**2025-02-25**|[VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion](http://arxiv.org/abs/2502.18042)|null|人类驾驶员能够熟练地驾驭复杂场景，这得益于他们丰富的注意力语义，但目前的自动驾驶系统难以复制这种能力，因为它们在将2D观察结果转换为3D空间时经常丢失关键的语义信息。从这个意义上讲，它阻碍了它们在动态复杂环境中的有效部署。利用视觉语言模型 (VLM) 卓越的场景理解和推理能力，我们提出了 VLM-E2E，这是一个利用 VLM 通过提供注意力线索来增强训练的新颖框架。我们的方法将文本表示集成到鸟瞰图 (BEV) 特征中以进行语义监督，这使模型能够学习更丰富的特征表示，从而显式地捕捉驾驶员的注意力语义。通过关注注意力语义，VLM-E2E 更好地与人类驾驶行为保持一致，这对于在动态和复杂环境中导航至关重要。此外，我们引入了一种 BEV-Text 可学习加权融合策略，以解决融合多模态信息时模态重要性不平衡的问题。这种方法动态地平衡了 BEV 和文本特征的贡献，确保有效地利用来自视觉和文本模态的互补信息。通过明确解决多模态融合中的不平衡问题，我们的方法有助于更全面、更鲁棒地表示驾驶环境。我们在 nuScenes 数据集上评估了 VLM-E2E，并证明了其优于最先进方法的性能，展示了显著的性能改进。||
|**2025-02-25**|[Can Multimodal LLMs Perform Time Series Anomaly Detection?](http://arxiv.org/abs/2502.17812)|**[link](https://github.com/mllm-ts/visualtimeanomaly)**|大型语言模型 (LLM) 越来越多地用于时间序列分析。然而，多模态 LLM (MLLM)，特别是视觉语言模型，在时间序列中的潜力很大程度上仍未得到充分探索。人类检测时间序列异常的一种自然方式是通过可视化和文本描述。受此启发，我们提出了一个关键且实际的研究问题：MLLM 能否执行时间序列异常检测？为了回答这个问题，我们提出了 VisualTimeAnomaly 基准来评估 MLLM 在时间序列异常检测 (TSAD) 中的性能。我们的方法将时间序列数值数据转换为图像格式，并将这些图像输入各种 MLLM，包括专有模型（GPT-4o 和 Gemini-1.5）和开源模型（LLaVA-NeXT 和 Qwen2-VL），每个模型都有一个较大版本和一个较小版本。VisualTimeAnomaly 总共包含 12.4k 张时间序列图像，涵盖 3 个场景和 3 个异常粒度，包含 8 个 MLLM 的 9 种异常类型。从单变量情况（点异常和范围异常）开始，我们将评估扩展到更实际的场景，包括多变量和不规则时间序列场景以及变量异常。我们的研究揭示了几个关键见解：1) MLLM 检测范围异常和变量异常比检测点异常更有效。2) MLLM 对不规则时间序列具有高度鲁棒性，即使缺少 25% 的数据也是如此。3) 开源 MLLM 在 TSAD 中的性能与专有模型相当。虽然开源 MLLM 在单变量时间序列上表现出色，但专有 MLLM 在多变量时间序列上表现出更优越的有效性。据我们所知，这是第一个全面研究 MLLM 用于 TSAD 的工作，特别是针对多变量和不规则时间序列场景。我们在 https://github.com/mllm-ts/VisualTimeAnomaly 上发布了我们的数据集和代码，以支持未来的研究。||
|**2025-02-24**|[Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures](http://arxiv.org/abs/2502.17710)|**[link](https://github.com/Akhila-Yerukola/culturally-offensive-gestures)**|手势是非语言交流中不可或缺的一部分，其含义因文化而异，误解可能会造成严重的社会和外交后果。随着人工智能系统越来越多地融入全球应用，确保它们不会无意中延续文化冒犯至关重要。为此，我们引入了多文化不当手势和非语言符号集 (MC-SIGNS)，这是一个包含 288 个手势-国家对的数据集，涵盖 25 种手势和 85 个国家，并对冒犯性、文化意义和背景因素进行了注释。通过使用 MC-SIGNS 进行系统评估，我们发现了关键局限性：文本到图像 (T2I) 系统表现出强烈的以美国为中心的偏见，在检测美国语境中的冒犯性手势方面比非美国语境表现更好；大型语言模型 (LLM) 倾向于过度标记手势为冒犯性；视觉语言模型 (VLM) 在回应诸如祝某人好运之类的普遍概念时，默认为基于美国的解释，经常建议文化上不适当的手势。这些发现凸显了对文化感知型人工智能安全机制的迫切需求，以确保人工智能技术在全球的公平部署。||
|**2025-02-24**|[METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling](http://arxiv.org/abs/2502.17651)|null|图表生成旨在生成代码以绘制满足所需视觉属性（例如文本、布局、颜色和类型）的图表。它在赋能金融分析、研究演示、教育和医疗保健领域的专业报告自动生成方面具有巨大潜力。在这项工作中，我们构建了一个基于视觉语言模型 (VLM) 的多智能体框架，以实现高效的自动图表生成。生成高质量的图表需要强大的视觉设计技能和精确的编码能力，才能将所需的视觉属性嵌入到代码中。这种复杂的多模态推理过程对于直接提示 VLM 来说是困难的。为了解决这些挑战，我们提出了 METAL，一个多智能体框架，它将图表生成任务分解为专门智能体之间的迭代协作。METAL 在图表生成任务中的准确率比当前最佳结果提高了 5.2%。METAL 框架展现了测试时缩放的现象：随着对数计算预算从 512 个标记增加到 8192 个标记，其性能单调递增。此外，我们发现在 METAL 的 critique 过程中分离不同的模态可以提高 VLM 在多模态上下文中的自我纠正能力。||
|**2025-02-24**|[End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models](http://arxiv.org/abs/2502.17589)|null|自动图表摘要对于增强数据可访问性和实现从视觉数据中高效提取信息至关重要。尽管视觉语言模型 (VLM) 的最新进展已展现出希望，但现有方法通常在使生成的摘要与图表数据匹配以及推理复杂图表模式方面存在局限性。本文介绍了用于图表摘要的端到端视觉思维链 (V-CoT)，这是一种针对大型视觉语言模型 (LVLM) 优化的全新方法。我们的方法直接训练 LVLM 以端到端的方式处理图表图像并生成文本摘要，从而无需显式的图表解析模块。我们通过指令微调 incorporating 了视觉思维链机制，在摘要生成过程中隐式引导 LVLM 执行视觉推理步骤。在大规模 Chart-Sum-QA 数据集上进行的评估表明，我们的 V-CoT 方法在包括 BLEU、BLEURT、CIDEr 和 CS 在内的一系列自动指标上均显著优于最先进的基线模型，并且在人工评估中展现出更优的匹配度和推理正确性。消融研究和详细分析进一步验证了我们提出的方法的有效性和稳健性，为端到端图表摘要建立了新的基准。||
|**2025-02-25**|[DIS-CO: Discovering Copyrighted Content in VLMs Training Data](http://arxiv.org/abs/2502.17358)|**[link](https://github.com/avduarte333/dis-co)**|我们如何在无法直接访问大型视觉语言模型 (VLM) 训练数据的情况下验证其是否使用了受版权保护的内容进行训练？基于VLM能够识别其训练语料库中图像的假设，我们提出了DIS-CO，一种推断模型开发过程中是否包含受版权保护内容的新方法。通过使用目标受版权保护材料中的特定帧重复查询VLM，DIS-CO 通过自由形式的文本补全提取内容的身份。为了评估其有效性，我们引入了MovieTection，这是一个包含14,000帧的基准测试，每帧都配有详细的字幕，这些帧来自模型训练截止日期之前和之后发布的电影。我们的结果表明，DIS-CO 显着提高了检测性能，在可获取logits的模型上，其平均AUC几乎是最佳现有方法的两倍。我们的研究结果还突出了一个更广泛的问题：所有测试模型似乎都在一定程度上接触过受版权保护的内容。我们的代码和数据可在 https://github.com/avduarte333/DIS-CO 获取。||
|**2025-02-24**|[Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI](http://arxiv.org/abs/2502.17092)|null|我们推出了Shakti VLM，这是一个包含10亿和40亿参数的视觉语言模型系列，旨在应对多模态学习中的数据效率挑战。虽然最近的VLM通过大量的训练数据实现了强大的性能，但Shakti模型利用架构创新以更少的token实现了具有竞争力的结果。关键改进包括用于注意力稳定的QK归一化、混合归一化技术和增强的 positional encoding。三阶段训练策略进一步优化了学习效率。评估表明，Shakti-VLM-1B和Shakti-VLM-4B在文档理解、视觉推理、OCR提取和通用多模态推理方面表现出色。我们的结果表明，高性能可以通过模型设计和训练策略而不是纯粹的数据量来实现，这使得Shakti成为企业级多模态任务的有效解决方案。||
|**2025-02-24**|[Systematic Weight Evaluation for Pruning Large Language Models: Enhancing Performance and Sustainability](http://arxiv.org/abs/2502.17071)|null|大型语言模型 (LLM) 例如ChatGPT 的指数级增长彻底改变了人工智能，在自然语言处理领域提供了前所未有的能力。然而，训练这些模型所需的庞大计算资源对环境造成了重大影响，包括高碳排放、能源消耗和水资源使用。本研究提出了一种新的 LLM 剪枝方法，重点是对整个训练过程中各个权重重要性进行系统评估。通过监测参数随时间的演变，我们提出了一种在不影响性能的情况下有效减小模型大小的方法。在缩小版 LLM 和大型多模态模型上进行的大量实验表明，适度剪枝可以提高效率并减少损失，而过度剪枝则会大幅降低模型性能。这些发现强调了优化人工智能模型以确保可持续发展的关键需求，在技术进步与环境责任之间取得平衡。||
|**2025-02-21**|[ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval](http://arxiv.org/abs/2502.15682)|null|本文旨在改进文本到图像检索的性能。为此，我们引入了一个新框架，可以提升大规模预训练视觉语言模型的性能，使其能够用于文本到图像重排序。这种方法称为增强型语言图像预训练 (ELIP)，它使用文本查询来预测一组视觉提示，以调节 ViT 图像编码。ELIP 可以轻松应用于常用的 CLIP/SigLIP 和最先进的 BLIP-2 架构。为了在有限的计算资源下训练该架构，我们开发了一种“学生友好”的最佳实践，包括全局难样本挖掘以及大规模数据集的选择和整理。在评估方面，我们建立了两个新的分布外基准测试，Occluded COCO 和 ImageNet-R，以评估模型对不同领域的零样本泛化能力。得益于新颖的架构和数据整理，实验表明，我们增强的网络显著提升了 CLIP/SigLIP 的性能，并在文本到图像检索方面超越了最先进的 BLIP-2 模型。||
|**2025-02-21**|[Testing the limits of fine-tuning to improve reasoning in vision language models](http://arxiv.org/abs/2502.15678)|null|预训练的视觉语言模型仍然达不到人类视觉认知的水平。为了改进视觉认知并将模型与人类行为对齐，我们引入了视觉刺激和人类对视觉认知任务的判断，这使我们能够在一致的环境下系统地评估跨认知领域的性能。我们使用直观物理和因果推理的真实数据对模型进行微调，发现这提高了模型在各自微调领域的性能。此外，它还可以提高模型与人类行为的一致性。然而，我们发现微调并不能促进模型像人类一样稳健地泛化到具有其他视觉特征的数据或其他认知领域的任务。||
|**2025-02-21**|[FaultGPT: Industrial Fault Diagnosis Question Answering System by Vision Language Models](http://arxiv.org/abs/2502.15481)|null|近年来，基于机械振动信号的单模态大型语言模型作为故障预测器为智能故障诊断带来了新的视角。然而，这些方法利用多模态数据的潜力仍未得到充分开发，特别是在复杂的机械系统中，依赖单一数据源往往无法捕获全面的故障信息。本文提出了FaultGPT，一个可以直接从原始振动信号生成故障诊断报告的新模型。通过利用大型视觉语言模型（LVLM）和基于文本的监督，FaultGPT执行端到端的故障诊断问答（FDQA），区别于传统的分类或回归方法。具体来说，我们构建了一个用于LVLM指令微调的大规模FDQA指令数据集。该数据集包括振动时频图像-文本标签对和人工指令-真实答案对。为了提高生成高质量故障诊断报告的能力，我们设计了一个多尺度跨模态图像解码器来提取细粒度的故障语义，并在不引入额外训练参数的情况下对LVLM进行指令微调。大量的实验，包括故障诊断报告生成、跨多个数据集的少样本和零样本评估，验证了FaultGPT在各种工业场景中的优越性能和适应性。||
|**2025-02-21**|[LongCaptioning: Unlocking the Power of Long Caption Generation in Large Multimodal Models](http://arxiv.org/abs/2502.15393)|null|大型多模态模型（LMM）在视频理解任务中展现出卓越的性能，甚至可以处理超过一小时的视频。然而，尽管它们能够处理长输入，但生成具有相应丰富度的输出仍然是一个挑战。在本文中，我们以视频字幕生成作为代理任务，探讨了LMM中长输出的问题，并发现开源LMM难以持续生成超过约300个单词的输出。通过受控实验，我们发现训练期间长字幕配对示例的稀缺性是限制模型输出长度的主要因素。然而，手动标注长字幕示例既耗时又昂贵。为了解决这个问题，我们提出了LongCaption-Agent，这是一个通过聚合多级描述来合成长字幕数据的框架。使用LongCaption-Agent，我们构建了一个新的长字幕数据集LongCaption-10K。我们还开发了LongCaption-Bench，这是一个旨在全面评估LMM生成的字幕质量的基准测试。通过将LongCaption-10K纳入训练，我们使LMM能够生成超过1000个单词的字幕，同时保持高质量的输出。在LongCaption-Bench中，我们的80亿参数模型实现了最先进的性能，甚至超过了更大的专有模型。我们将在论文发表后发布数据集和代码。||
|**2025-02-21**|[CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models](http://arxiv.org/abs/2502.15278)|null|评估AI生成的图像是否与受版权保护的作品实质性相似是解决版权纠纷的关键步骤。在本文中，我们提出了CopyJudge，一个自动化的版权侵权识别框架，它利用大型视觉语言模型（LVLM）来模拟实际的法院流程，以确定受版权保护的图像和由文本到图像扩散模型生成的图像之间的实质性相似性。具体来说，我们采用了一个抽象-过滤-比较测试框架，并结合多LVLM辩论来评估侵权的可能性，并提供详细的判断理由。基于这些判断，我们进一步引入了一种通用的基于LVLM的缓解策略，通过避免敏感表达来自动优化侵权提示，同时保留非侵权内容。此外，我们的方法可以通过强化学习探索扩散潜在空间内的非侵权噪声向量来增强，即使不修改原始提示。实验结果表明，我们的识别方法达到了与现有最佳方法相当的性能，同时在各种侵权形式中提供了更好的泛化性和可解释性，并且我们的缓解方法可以更有效地减轻记忆和IP侵权，而不会丢失非侵权表达。||
|**2025-02-21**|[PairBench: A Systematic Framework for Selecting Reliable Judge VLMs](http://arxiv.org/abs/2502.15210)|null|随着大型视觉语言模型 (VLM) 越来越多地被用作自动评估器，理解它们按照提示有效比较数据对的能力变得至关重要。为了解决这个问题，我们提出了 PairBench，这是一个低成本框架，可以系统地评估 VLM 作为跨各种模态和场景的可定制相似性工具。通过 PairBench，我们引入了四个代表相似性评分关键指标的度量：与人工标注的一致性、数据对无论顺序如何的一致性、相似性分布的平滑度以及通过提示的可控性。我们的分析表明，无论是闭源还是开源模型，没有哪个模型在所有指标上都表现优异；最佳选择取决于自动评估器所需的行为（例如，平滑的判断者还是尖锐的判断者），这突出了在未经彻底评估的情况下广泛采用 VLM 作为评估器的风险。例如，大多数 VLM 难以保持对称的相似性评分，而不管顺序如何。此外，我们的结果表明，VLM 在 PairBench 指标上的性能与流行的基准测试密切相关，展示了其在模型排名中的预测能力。||
|**2025-02-21**|[CurricuVLM: Towards Safe Autonomous Driving via Personalized Safety-Critical Curriculum Learning with Vision-Language Models](http://arxiv.org/abs/2502.15119)|null|确保自动驾驶系统的安全性仍然是一项关键挑战，尤其是在处理罕见但可能造成灾难性后果的安全关键场景方面。虽然现有研究已经探索了生成用于自动驾驶车辆 (AV) 测试的安全关键场景，但在有效地将这些场景纳入策略学习以增强安全性方面的工作有限。此外，开发适应自动驾驶车辆不断变化的行为模式和性能瓶颈的训练课程在很大程度上仍未得到探索。为了应对这些挑战，我们提出了 CurricuVLM，这是一个利用视觉语言模型 (VLM) 为自动驾驶代理实现个性化课程学习的新颖框架。我们的方法独特地利用了 VLM 的多模态理解能力来分析代理行为，识别性能弱点，并动态生成定制的训练场景以适应课程。通过对带有叙述性描述的不安全驾驶情况进行全面分析，CurricuVLM 执行深入推理以评估自动驾驶车辆的能力并识别关键行为模式。然后，该框架合成针对这些已识别局限性的定制训练场景，从而实现有效且个性化的课程学习。在 Waymo Open Motion Dataset 上进行的大量实验表明，CurricuVLM 在常规和安全关键场景方面均优于最先进的基线，在导航成功率、驾驶效率和安全指标方面实现了卓越的性能。进一步的分析表明，CurricuVLM 是一种通用方法，可以与各种强化学习算法集成以增强自动驾驶系统。代码和演示视频可在以下网址获取：https://zihaosheng.github.io/CurricuVLM/。||
|**2025-02-21**|[Social Genome: Grounded Social Reasoning Abilities of Multimodal Models](http://arxiv.org/abs/2502.15109)|null|社交推理能力对于人工智能系统有效地解释和响应多模态人类交流和社交环境中的互动至关重要。我们引入了Social Genome，这是第一个用于评估多模态模型细粒度、基于事实的社交推理能力的基准测试。Social Genome包含272个互动视频和1,486条人工标注的关于这些互动推论的推理轨迹。这些轨迹包含5,777个推理步骤，它们引用了来自视觉线索、言语线索、声音线索和外部知识（视频外部的上下文知识）的证据。Social Genome也是第一个研究社交推理中外部知识的建模挑战。Social Genome计算指标来全面评估模型生成的社交推理轨迹的语义和结构质量。我们通过使用最先进模型的实验展示了Social Genome的实用性，识别了性能差距以及未来研究改进多模态模型基于事实的社交推理能力的机会。||
|**2025-02-20**|[InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback](http://arxiv.org/abs/2502.15027)|null|现有的基准测试并没有测试大型多模态模型 (LMM) 与人类用户的交互智能，而这对于开发通用人工智能助手至关重要。我们设计了 InterFeedback，这是一个交互式框架，可以应用于任何 LMM 和数据集来自动评估这种能力。在此基础上，我们引入了 InterFeedback-Bench，它使用两个具有代表性的数据集 MMMU-Pro 和 MathVerse 来评估 10 个不同的开源 LMM 的交互智能。此外，我们还展示了 InterFeedback-Human，这是一个包含 120 个案例的新收集数据集，旨在手动测试领先模型（如 OpenAI-o1 和 Claude-3.5-Sonnet）的交互性能。我们的评估结果表明，即使是最先进的 LMM（如 OpenAI-o1）也无法通过人类反馈将其结果纠正到 50% 以上。我们的研究结果表明，需要开发能够增强 LMM 解释和利用反馈能力的方法。||
|**2025-02-20**|[Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation](http://arxiv.org/abs/2502.14846)|null|对富文本图像（例如图表和文档）进行推理是视觉语言模型 (VLM) 的一项关键应用。然而，由于缺乏多样化的富文本视觉语言数据，VLM 在这些领域常常表现不佳。为了应对这一挑战，我们提出了 CoSyn，这是一个利用纯文本大型语言模型 (LLM) 的编码能力来自动创建合成富文本多模态数据的框架。给定描述目标领域（例如，“营养成分标签”）的输入文本，CoSyn 会提示 LLM 生成用于渲染合成图像的代码（Python、HTML、LaTeX 等）。利用底层代码作为合成图像的文本表示，CoSyn 可以生成高质量的指令微调数据，同样依赖于纯文本 LLM。使用 CoSyn，我们构建了一个包含 40 万张图像和 270 万行视觉语言指令微调数据的数据集。在七个基准测试上的综合实验表明，使用我们的合成数据训练的模型在包括 Llama 3.2 在内的具有竞争力的开源模型中实现了最先进的性能，并超过了 GPT-4V 和 Gemini 1.5 Flash 等专有模型。此外，CoSyn 可以生成合成的指向数据，使 VLM 能够在输入图像中定位信息，展示其开发能够在现实环境中行动的多模态代理的潜力。||
|**2025-02-20**|[LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models](http://arxiv.org/abs/2502.14834)|**[link](https://github.com/thu-keg/longwriter-v)**|现有的大型视觉语言模型 (LVLMs) 可以处理上下文长度高达 128k 的视觉和文本标记的输入，但它们难以生成超过 1,000 字的连贯输出。我们发现主要的限制是在监督微调 (SFT) 期间缺乏长输出示例。为了解决这个问题，我们引入了 LongWriter-V-22k，这是一个包含 22,158 个示例的 SFT 数据集，每个示例包含多个输入图像、一条指令和相应的输出，输出长度从 0 到 10,000 字不等。此外，为了实现与输入图像保持高保真度的长输出，我们对 SFT 模型采用了直接偏好优化 (DPO)。鉴于为长输出（例如 3,000 字）收集人工反馈的成本很高，我们提出了 IterDPO，它将长输出分解成多个片段，并使用迭代修正来形成与原始输出的偏好对。此外，我们还开发了 MMLongBench-Write，这是一个包含六项任务的基准测试，用于评估 VLM 的长生成能力。我们使用 LongWriter-V-22k 和 IterDPO 训练的 7B 参数模型在这个基准测试中取得了令人印象深刻的性能，超过了更大的专有模型，如 GPT-4o。代码和数据：https://github.com/THU-KEG/LongWriter-V||
|**2025-02-20**|[FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis](http://arxiv.org/abs/2502.14807)|**[link](https://github.com/biomedia-mbzuai/fetalclip)**|基础模型在医学领域正变得越来越有效，它们基于大型数据集进行预训练，可以很容易地适应下游任务。尽管取得了进展，但胎儿超声图像由于其固有的复杂性，对于基础模型来说仍然是一个具有挑战性的领域，通常需要大量的额外训练，并面临着由于配对多模态数据稀缺而带来的限制。为了克服这些挑战，我们在此介绍FetalCLIP，一个能够生成胎儿超声图像通用表示的视觉语言基础模型。FetalCLIP使用多模态学习方法在一个包含210,035张胎儿超声图像及其配对文本的多样化数据集上进行预训练。这是迄今为止用于基础模型开发的同类最大配对数据集。这种独特的训练方法使FetalCLIP能够有效地学习胎儿超声图像中复杂的解剖特征，从而产生可用于各种下游应用的鲁棒表示。在涵盖一系列关键胎儿超声应用（包括分类、胎龄估计、先天性心脏缺陷 (CHD) 检测和胎儿结构分割）的广泛基准测试中，FetalCLIP的性能优于所有基线，同时展现出显著的泛化能力，即使在标记数据有限的情况下也表现出色。我们计划公开发布FetalCLIP模型，以造福更广泛的科学界。||
|**2025-02-20**|[SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features](http://arxiv.org/abs/2502.14786)|**[link](https://github.com/google-research/big_vision)**|我们推出了SigLIP 2，这是一系列基于初代SigLIP成功构建的全新多语言视觉语言编码器。在这次迭代中，我们将几种先前独立开发的技术与原始图文训练目标整合到一个统一的方案中——这包括基于图像描述的预训练、自监督损失（自蒸馏、掩码预测）和在线数据整理。通过这些改进，SigLIP 2模型在所有模型规模的核心能力上都优于对应的SigLIP模型，包括零样本分类、图文检索以及为视觉语言模型 (VLM) 提取视觉表示时的迁移性能。此外，新的训练方案显著提升了定位和密集预测任务的性能。我们还训练了支持多种分辨率并保留输入原始纵横比的变体模型。最后，我们在包含去偏差技术的多样化数据混合体上进行训练，从而显著提升了多语言理解能力和公平性。为了使用户能够在推理成本和性能之间进行权衡，我们发布了四种大小的模型检查点：ViT-B (86M)、L (303M)、So400m (400M) 和 g (1B)。||
|**2025-02-20**|[Harnessing PDF Data for Improving Japanese Large Multimodal Models](http://arxiv.org/abs/2502.14778)|null|大型多模态模型 (LMM) 在英语中表现出色，但在日语中的有效性仍然有限，这是由于缺乏高质量的训练数据。目前的日语 LMM 通常依赖于翻译的英语数据集，限制了其捕捉日本特定文化知识的能力。为了解决这个问题，我们探索了日语 PDF 数据作为训练资源的潜力，这是一个很大程度上尚未被利用的领域。我们引入了一个全自动管道，利用预训练模型通过布局分析、光学字符识别 (OCR) 和视觉语言配对从 PDF 中提取图像-文本对，从而无需手动标注。此外，我们从提取的图像-文本对构建指令数据，以丰富训练数据。为了评估 PDF 数据的有效性，我们训练了日语 LMM，并在日语 LMM 基准测试上评估了它们的性能。我们的结果表明取得了实质性的改进，在 Heron-Bench 上的性能提升幅度从 3.9% 到 13.8% 不等。进一步的分析突出了 PDF 数据对各种因素的影响，例如模型大小和语言模型，强化了其作为日语 LMM 多模态资源的价值。我们计划在论文被接收后公开源代码和数据。||
|**2025-02-20**|[Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective](http://arxiv.org/abs/2502.14770)|null|本文从理论角度探讨了如何确定大型语言模型 (LLM) 的分层稀疏率。我们发现现有 LLM 稀疏化方法中存在一个关键问题，即“重构误差爆炸”。这指的是在整个稀疏化过程中，早期层的重构误差会传播并放大到后续层中，累积效应导致整体重构误差显著增加，从而导致模型性能大幅下降。通过理论分析，我们推导出一种简单而有效的分层稀疏率分配方法来缓解这个问题。我们的方法采用单调递增的等差数列，将确定多层稀疏率的过程简化为确定一个共同差值的超参数。值得注意的是，只需少量尝试即可确定最佳的分层稀疏率。我们的理论分析和实验结果都表明，这种稀疏率分配方案接近最优。大量实验表明，我们的方法显著提高了各种架构的稀疏 LLM 的性能，优于现有的分层稀疏方法。此外，它还增强了各种压缩技术的性能，并适用于视觉和多模态模型。值得一提的是，我们的方法将通过 Wanda 获得的 70% 稀疏 LLaMA2-7B 模型的困惑度降低了 52.10，将平均零样本准确率提高了 10.50%，并在 CPU 和 GPU 上分别实现了 2.63 倍和 2.23 倍的加速。||
|**2025-02-20**|[PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models](http://arxiv.org/abs/2502.14504)|null|大型视觉语言模型 (LVLMs) 在各种多模态任务中展现了卓越的能力。然而，它们的推理效率受到解码过程中处理的大量视觉标记的限制。为了应对这一挑战，我们提出了逐层逐头视觉标记剪枝 (PLPHP)，这是一种包含层级保留率分配和头级视觉标记剪枝的双层细粒度剪枝方法。受解码器层间视觉标记重关注现象的启发，我们逐层动态调整标记保留率。对视觉信息关注度较高的层保留更多视觉标记，而视觉关注度较低的层则进行更积极的剪枝。此外，PLPHP 在注意力头级别应用剪枝，使同一层内的不同头能够独立保留关键上下文。在多个基准测试中的实验表明，PLPHP 将解码速度提高了 18%，并将键值缓存 (KV Cache) 大小减少了 50% 以上，而平均性能仅下降 0.46%，同时在多图像任务中实现了显著的性能提升。这些结果突出了细粒度标记剪枝的有效性，并有助于提高 LVLMs 的效率和可扩展性。我们的源代码将公开发布。||
|**2025-02-20**|[Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models](http://arxiv.org/abs/2502.14191)|**[link](https://github.com/facebookresearch/multimodal_rewardbench)**|奖励模型通过评估输出质量以使其与人类偏好对齐，在训练视觉语言模型 (VLM) 中发挥着至关重要的作用。尽管它们很重要，但研究界缺乏用于评估 VLM 中多模态奖励模型的全面开放基准。为了弥补这一差距，我们引入了 Multimodal RewardBench，这是一个专家注释的基准，涵盖六个领域：一般正确性、偏好、知识、推理、安全性和视觉问答。我们的数据集包含从各种 VLM 收集的 5,211 个带注释的（提示、选择的响应、拒绝的响应）三元组。在评估一系列 VLM 评估模型时，我们发现即使是性能最佳的模型 Gemini 1.5 Pro 和 Claude 3.5 Sonnet，总体准确率也仅达到 72%。值得注意的是，大多数模型在推理和安全领域都表现不佳。这些发现表明，Multimodal RewardBench 为推进跨多个领域的奖励模型发展提供了一个具有挑战性的测试平台。我们在 https://github.com/facebookresearch/multimodal_rewardbench 发布了该基准。||
|**2025-02-19**|[PitVQA++: Vector Matrix-Low-Rank Adaptation for Open-Ended Visual Question Answering in Pituitary Surgery](http://arxiv.org/abs/2502.14149)|**[link](https://github.com/hrl-mike/pitvqa-plus)**|视觉语言模型（VLM）在视觉问答（VQA）领域为增强术中决策、促进直观交互以及显著推进外科教育提供了独特的机会。然而，由于数据集有限以及预训练权重全微调过程中存在过拟合和灾难性遗忘的风险，开发用于外科VQA的VLM极具挑战性。虽然像低秩自适应（LoRA）和秩自适应矩阵（MoRA）等参数高效技术解决了自适应挑战，但它们的均匀参数分布忽略了深度网络中的特征层次结构，其中学习通用特征的早期层比后期层需要更多参数。这项工作介绍了PitVQA++，它包含一个开放式PitVQA数据集和向量矩阵低秩自适应（Vector-MoLoRA），这是一种将GPT-2适配到垂体手术的创新VLM微调方法。开放式PitVQA包含来自25个手术视频的大约101,803帧，以及745,972个问答句对，涵盖了关键手术要素，例如阶段和步骤识别、上下文理解、工具检测、定位和交互识别。Vector-MoLoRA结合了LoRA和MoRA的原理，开发了一种矩阵低秩自适应策略，该策略采用向量排序为早期层分配更多参数，并在后期层逐渐减少参数。我们的方法在开放式PitVQA和EndoVis18-VQA数据集上进行了验证，有效地减轻了灾难性遗忘，同时显著提高了性能，超越了最近的基线。此外，我们的风险覆盖分析强调了其在处理不确定预测时增强的可靠性和可信度。我们的源代码和数据集可在~\url{https://github.com/HRL-Mike/PitVQA-Plus}获取。||
|**2025-02-19**|[Modular Prompt Learning Improves Vision-Language Models](http://arxiv.org/abs/2502.14125)|**[link](https://github.com/Zhenhan-Huang/Modular-Prompt-Learning)**|预训练的视觉语言模型能够理解视觉概念和语言语义。提示学习是一种为文本编码器或图像编码器构建提示的方法，它可以激发预训练模型的潜力，并轻松地使其适应新的场景。与微调相比，提示学习使模型能够使用更少的可训练参数实现相当或更好的性能。此外，提示学习冻结了预训练模型，避免了微调中的灾难性遗忘问题。插入到每个Transformer层输入中的连续提示（即深度提示）可以提高预训练模型在下游任务中的性能。对于第i个Transformer层，插入的提示会替换第(i-1)层中先前插入的提示。尽管自注意力机制将当前层新插入的提示和前一层输出的嵌入联系起来，但从前一层中删除所有插入的提示不可避免地会丢失连续提示中包含的信息。在这项工作中，我们提出了模块化提示学习（MPL），旨在促进对插入提示中包含信息的保留。我们在基础到新泛化和跨数据集任务上评估了所提出的方法。在平均11个数据集上，与最先进的方法相比，我们的方法在基础到新泛化任务上实现了0.7%的性能提升。单个数据集的最大改进是10.7%（EuroSAT数据集）。||
|**2025-02-18**|[Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization](http://arxiv.org/abs/2502.13146)|**[link](https://github.com/taco-group/re-align)**|大型视觉语言模型 (VLM) 的出现通过整合视觉模态扩展了单模态大型语言模型 (LLM) 的范围和能力，从而在各种现实场景中开启了变革性的跨模态应用。尽管性能令人印象深刻，但 VLM 容易出现严重的幻觉，尤其是跨模态不一致的形式。基于人类反馈强化学习 (RLHF) 在对齐 LLM 方面的成功，最近的研究进展集中在将直接偏好优化 (DPO) 应用于精心策划的数据集以缓解这些问题。然而，此类方法通常以暴力方式引入偏好信号，忽略了视觉信息在对齐过程中的关键作用。在本文中，我们介绍了 Re-Align，一种利用图像检索来构建双偏好数据集的新型对齐框架，有效地结合了文本和视觉偏好信号。我们进一步引入了 rDPO，这是标准直接偏好优化的扩展，在微调期间包含了额外的视觉偏好目标。我们的实验结果表明，Re-Align 不仅比以前的方法更有效地减轻了幻觉，而且在一般视觉问答 (VQA) 任务中也产生了显著的性能提升。此外，我们还展示了 Re-Align 在各种 VLM 规模和架构中保持了鲁棒性和可扩展性。这项工作代表了在对齐多模态 LLM 方面迈出的重要一步，为更可靠和有效的跨模态应用铺平了道路。我们在 https://github.com/taco-group/Re-Align 发布所有代码。||
|**2025-02-18**|[Understanding and Rectifying Safety Perception Distortion in VLMs](http://arxiv.org/abs/2502.13095)|null|最新研究表明，视觉语言模型（VLM）在整合视觉模态后，更容易受到有害请求和越狱攻击的影响，比纯文本的大语言模型（LLM）主干网络更脆弱。为了揭示这一现象的根本原因，我们进行了深入分析，并确定了一个关键问题：多模态输入引入了一种模态诱导的激活偏移，使其相较于纯文本输入朝着“更安全”的方向偏移，导致VLM系统性地高估有害输入的安全性。我们将这个问题称为安全感知扭曲。为了减轻这种扭曲，我们提出了激活偏移解耦和校准（ShiftDC），这是一种无需训练的方法，它分解并校准模态诱导的激活偏移，以减少模态对安全性的影响。通过隔离并移除与安全相关的组成部分，ShiftDC恢复了LLM主干网络固有的安全对齐，同时保留了VLM的视觉语言能力。实验结果表明，ShiftDC显著提高了安全基准测试中的对齐性能，且不损害模型效用。||
|**2025-02-18**|[Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection](http://arxiv.org/abs/2502.13061)|null|仇恨表情包已成为互联网上一个值得关注的重大问题，需要强大的自动化检测系统。虽然大型多模态模型在各种任务中表现出强大的泛化能力，但由于表情包的动态特性与新兴社会趋势和突发新闻紧密相关，它们在仇恨表情包检测方面的泛化能力较差。最近的研究进一步强调了在这种情况下对大型多模态模型进行传统监督微调的局限性。为了应对这些挑战，我们提出了大型多模态模型检索引导的对比学习（LMM-RGCL），这是一种新颖的两阶段微调框架，旨在提高域内准确性和跨域泛化能力。在六个广泛使用的表情包分类数据集上的实验结果表明，LMM-RGCL 实现了最先进的性能，优于基于代理的系统，如 VPD-PALI-X-55B。此外，我们的方法可以有效地泛化到低资源环境下的域外表情包，超越了像 GPT-4o 这样的模型。||
|**2025-02-18**|[MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching](http://arxiv.org/abs/2502.12852)|null|现有的多语言视觉语言（VL）基准通常只涵盖少数几种语言。因此，大型视觉语言模型（LVLMs）的评估主要针对高资源语言，突显了对低资源语言评估数据的需求。为了解决这一限制，我们引入了MVL-SIB，这是一个大规模多语言视觉语言基准，用于评估205种语言的跨模态和纯文本主题匹配——比现有最广泛的多语言VL基准多100多种语言。然后，我们在MVL-SIB上对一系列开放权重的LVLMs以及GPT-4o(-mini)进行了基准测试。我们的结果表明，LVLMs在低资源语言的跨模态主题匹配中表现不佳，在像N'Koo这样的语言上的表现不比随机好。我们的分析进一步表明，LVLMs中对VL的支持相对于对低资源语言的文本支持下降得不成比例，这一点可以通过比较跨模态和纯文本主题匹配性能来证明。我们还观察到，开放权重的LVLMs无法从用多个图像表示主题中受益，这表明这些模型在处理多图像任务方面还没有完全有效。通过将MVL-SIB上的性能与其他多语言VL基准相关联，我们强调MVL-SIB可以作为LVLMs中多语言VL理解的全面探针。||
|**2025-02-18**|[CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base](http://arxiv.org/abs/2502.12591)|null|大型视觉语言模型 (LVLMs) 展现了令人印象深刻的多模态推理能力，但它们仍然容易出现幻觉，尤其是对象幻觉，即在生成的描述中捏造不存在的物体或错误的属性。现有的检测方法虽然性能强大，但严重依赖昂贵的 API 调用和基于 LVLM 的迭代验证，使其无法用于大规模或离线场景。为了解决这些限制，我们提出了 CutPaste\&Find，一个用于检测 LVLM 生成输出中幻觉的轻量级且无需训练的框架。我们的方法利用现成的视觉和语言模块来执行多步验证，无需 LVLM 推理，从而提高效率。我们框架的核心是一个视觉辅助知识库，它编码了丰富的实体-属性关系和相关的图像表示。我们引入了一个缩放因子来细化相似度分数，从而缓解即使对于真实图像-文本对也存在次优对齐值的问题。在基准数据集（包括 POPE 和 R-Bench）上的全面评估表明，CutPaste\&Find 实现了与现有方法相当的幻觉检测性能，同时效率更高，成本效益更好。||
|**2025-02-18**|[Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning](http://arxiv.org/abs/2502.12425)|**[link](https://github.com/MICLAB-BUPT/DCL)**|本文提出了一种新的鲁棒解耦反事实学习方法（RDCL），用于物理视听常识推理。该任务旨在基于视频和音频输入推断对象的物理常识，其主要挑战是如何模仿人类的推理能力，即使在模态缺失的情况下也是如此。当前大多数方法未能充分利用多模态数据中的不同特征，并且模型缺乏因果推理能力阻碍了隐式物理知识推断的进展。为了解决这些问题，我们提出的RDCL方法通过解耦的序列编码器将视频在潜在空间中解耦为静态（时不变）和动态（时变）因子，该编码器采用变分自编码器（VAE）并使用对比损失函数来最大化互信息。此外，我们引入了一个反事实学习模块，通过对反事实干预下不同对象之间的物理知识关系进行建模来增强模型的推理能力。为了缓解模态数据不完整的问题，我们引入了一种鲁棒的多模态学习方法，通过分解共享特征和模型特定特征来恢复缺失的数据。我们提出的方法是一个即插即用的模块，可以并入任何基线方法，包括VLM。实验表明，我们提出的方法提高了基线方法的推理精度和鲁棒性，并实现了最先进的性能。||
|**2025-02-17**|[LanP: Rethinking the Impact of Language Priors in Large Vision-Language Models](http://arxiv.org/abs/2502.12359)|null|大型视觉语言模型 (LVLMs) 在各种任务中展现出令人印象深刻的性能。然而，LVLMs 存在幻觉问题，这阻碍了它们在现实世界中的应用。现有研究强调，LVLMs 强大的语言先验会压制视觉信息，从而导致幻觉。然而，语言先验的积极作用是强大的 LVLM 的关键。如果语言先验太弱，LVLMs 将难以利用丰富的参数知识和指令理解能力来完成在仅凭视觉信息不足的挑战性视觉场景中的任务。因此，我们提出了一个名为 LanP 的基准测试，以重新思考语言先验在 LVLMs 中的影响。它旨在研究当前 LVLMs 的语言先验强度。LanP 包含 170 张图像和 340 个相应的精心设计的问答题。对 25 个流行 LVLMs 的广泛实验表明，当物体部分隐藏时，许多 LVLMs 的语言先验不足以有效地辅助问答。许多模型，包括 GPT-4 Turbo，在这种情况下准确率低于 0.5。||
|**2025-02-17**|[VLM $^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues](http://arxiv.org/abs/2502.12084)|null|在日常生活中，视觉关联匹配线索是一项至关重要的能力，例如，即使不知道人物是谁，也能根据线索识别多张照片中的同一个人。尽管视觉语言模型 (VLM) 拥有丰富的知识，但它们是否能够执行这项基本任务仍在很大程度上未得到探索。为了解决这个问题，我们引入了 VLM$^2$ -Bench，这是一个基准测试，旨在评估 VLM 是否可以视觉关联匹配线索，其中包含 9 个子任务和超过 3,000 个测试用例。我们对八个开源 VLM 和 GPT-4o 进行了全面评估，并进一步分析了各种语言端和视觉端提示方法，最终得出了八个关键发现。我们确定了模型在关联视觉线索能力方面的关键挑战，突出了一个显著的性能差距，即使是 GPT-4o 也落后于人类 34.80%。基于这些见解，我们提倡 (i) 增强核心视觉能力，以提高适应性并减少对先验知识的依赖，(ii) 建立更清晰的原则，以便在以视觉为中心的任务中整合基于语言的推理，从而防止不必要的偏差，以及 (iii) 将视觉文本训练范式转向培养模型独立构建和推断视觉线索之间关系的能力。||
|**2025-02-17**|[How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines](http://arxiv.org/abs/2502.12051)|null|神经缩放规律通过揭示模型规模、数据集大小和计算资源之间可预测的关系，彻底改变了大规模人工智能模型的设计和优化。早期研究建立了模型性能的幂律关系，从而形成了计算最优的缩放策略。然而，最近的研究强调了这些规律在不同架构、模态和部署环境下的局限性。稀疏模型、混合专家模型、检索增强学习和多模态模型通常偏离传统的缩放模式。此外，缩放行为在视觉、强化学习和微调等不同领域也存在差异，这突显了对更细致方法的需求。在本综述中，我们综合了50多项研究的见解，考察了缩放规律的理论基础、实证结果和实际意义。我们还探讨了关键挑战，包括数据效率、推理缩放和特定架构的限制，并提倡采用适应实际应用的自适应缩放策略。我们认为，虽然缩放规律提供了一个有用的指导，但它们并不总是适用于所有架构和训练策略。||
|**2025-02-17**|[From Open-Vocabulary to Vocabulary-Free Semantic Segmentation](http://arxiv.org/abs/2502.11891)|null|开放词汇语义分割使模型能够识别训练数据之外的新颖对象类别。虽然这种灵活性代表了重大进步，但当前的方法仍然依赖手动指定的类名作为输入，从而在实际应用中造成了固有的瓶颈。这项工作提出了一个无词汇语义分割流程，无需预定义的类别词汇表。具体来说，我们解决了用户需要了解场景中所有潜在对象才能识别它们的鸡和蛋问题，而分割的目的通常是发现这些对象。所提出的方法利用视觉语言模型来自动识别对象并生成适当的类名，旨在解决类别规范和命名质量的挑战。通过对几个公共数据集的广泛实验，我们强调了文本编码器在模型性能中的关键作用，特别是当图像文本类别与生成的描述配对时。尽管类别标记过程中文本编码器对假阴性的敏感性给任务增加了复杂性，但我们证明了我们全自动的流程显着提高了各种现实场景下的无词汇分割精度。||
|**2025-02-14**|[Probing Perceptual Constancy in Large Vision Language Models](http://arxiv.org/abs/2502.10273)|null|知觉恒常性是指尽管感官输入发生变化（例如距离、角度或光照的变化），仍能保持对物体稳定感知的能力。这种能力对于在动态世界中识别视觉信息至关重要，使其成为视觉语言模型 (VLM) 的必要条件。然而，VLM 目前和理论上是否能够掌握这种能力仍未得到充分探索。在本研究中，我们使用跨颜色、大小和形状恒常性三个领域的 253 个实验评估了 33 个 VLM。实验包括经典认知任务的单图像和视频改编，以及在真实世界条件下的新任务，以评估模型在不同条件下识别物体属性的能力。我们发现 VLM 的性能存在显著差异，模型在形状恒常性方面的性能与颜色和大小恒常性方面的性能明显不同。||
|**2025-02-14**|[VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models](http://arxiv.org/abs/2502.10250)|null|视觉语言模型 (VLM) 在各种视觉基准测试中表现出色，但通常受到高质量视觉微调数据缺乏的限制。为了应对这一挑战，我们引入了 VisCon-100K，这是一个从交错图文网页文档中派生出来的新颖数据集。我们的方法将 OBELICS 数据集中的 4.5 万个网页文档转换为 10 万个图像对话样本。我们利用 GPT-4V 生成图像上下文描述，并使用 OpenChat 3.5 模型将这些描述转换为多样化的自由形式和多项选择问答对。整合该数据集进行微调，可以显著提高 VLM 在多个基准测试中的性能。与仅关注细粒度视觉内容的方法不同，我们的方法利用伴随的网页上下文，从而产生更优异的结果。我们还发现，“泄漏模态混合”（即对话样本包含可以从图像及其上下文描述中回答的问题）的性能优于非泄漏的描述和问答对组合。VisCon-100k 数据集在两种流行的 VLM 方法中表现出强大的性能：使用图像描述数据将纯文本大型语言模型 (LLM) 与视觉编码器对齐 (ShareGPT4V-7b)，以及使用交错图文数据进行多模态预训练的 LLM (IDEFICS2-8b)。除了发布 VisCon-100K 数据集外，我们还提供了一个在该数据集上训练的上下文描述生成器，以便于为未来的研究和开源应用生成可扩展的微调数据。使用相同的流程，但用我们训练的上下文描述生成器替代 GPT-4V，我们还发布了更大的 VisCon-1M 数据集。||
|**2025-02-14**|[Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models](http://arxiv.org/abs/2502.10090)|null|人类拥有非凡的能力，能够通过理解抽象的说明书来完成复杂的操控任务。然而，对于机器人来说，这种能力仍然是一个巨大的挑战，因为它们无法理解抽象的指令并将其转化为可执行的动作。在本文中，我们提出了Manual2Skill，一个新颖的框架，使机器人能够在高级说明书的指导下执行复杂的组装任务。我们的方法利用视觉语言模型 (VLM) 从说明图片中提取结构化信息，然后使用这些信息构建分层的组装图。这些图表示零件、子组件以及它们之间的关系。为了便于任务执行，姿态估计模型预测每个组装步骤中组件的相对6D姿态。同时，运动规划模块生成用于实际机器人实施的可操作序列。我们通过成功组装几个真实的宜家家具产品来展示Manual2Skill的有效性。该应用突出了其高效、精确地管理长期操控任务的能力，显著增强了机器人从说明书中学习的实用性。这项工作标志着机器人系统在理解和执行复杂操作任务方面向类似人类能力迈进了一步。||
|**2025-02-14**|[Diffusion Trajectory-guided Policy for Long-horizon Robot Manipulation](http://arxiv.org/abs/2502.10040)|null|Recently, Vision-Language-Action models (VLA) have advanced robot imitation learning, but high data collection costs and limited demonstrations hinder generalization and current imitation learning methods struggle in out-of-distribution scenarios, especially for long-horizon tasks. A key challenge is how to mitigate compounding errors in imitation learning, which lead to cascading failures over extended trajectories. To address these challenges, we propose the Diffusion Trajectory-guided Policy (DTP) framework, which generates 2D trajectories through a diffusion model to guide policy learning for long-horizon tasks. By leveraging task-relevant trajectories, DTP provides trajectory-level guidance to reduce error accumulation. Our two-stage approach first trains a generative vision-language model to create diffusion-based trajectories, then refines the imitation policy using them. Experiments on the CALVIN benchmark show that DTP outperforms state-of-the-art baselines by 25% in success rate, starting from scratch without external pretraining. Moreover, DTP significantly improves real-world robot performance.||
|**2025-02-14**|[HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation](http://arxiv.org/abs/2502.09838)|**[link](https://github.com/dcdmllm/healthgpt)**|我们提出了HealthGPT，一个强大的医疗大型视觉-语言模型（Med-LVLM），它在统一的自回归范式中集成了医学视觉理解和生成能力。我们的自举理念是将异构的理解和生成知识逐步适应于预训练的大型语言模型（LLM）。这是通过一种新颖的异构低秩适应（H-LoRA）技术实现的，该技术辅以定制的分层视觉感知方法和三阶段学习策略。为了有效地训练HealthGPT，我们设计了一个全面的医学领域特定理解和生成数据集，称为VL-Health。实验结果证明了HealthGPT在医学视觉统一任务中的卓越性能和可扩展性。我们的项目可以通过https://github.com/DCDmllm/HealthGPT访问。||
|**2025-02-13**|[MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency](http://arxiv.org/abs/2502.09621)|null|使用思维链（CoT）回答问题显著增强了大型语言模型（LLM）的推理能力，但其对大型多模态模型（LMM）的影响仍然缺乏系统评估和深入研究。在本文中，我们引入了MME-CoT，这是一个专门用于评估LMM的CoT推理性能的基准测试，涵盖六个领域：数学、科学、光学字符识别（OCR）、逻辑、时空和一般场景。作为该领域的首次综合研究，我们提出了一个全面的评估套件，其中包含三个新颖的指标，用于在细粒度级别评估推理质量、鲁棒性和效率。利用精心策划的高质量数据和独特的评估策略，我们对最先进的LMM进行了深入分析，揭示了几个关键见解：1）具有反思机制的模型表现出更优的CoT质量，其中Kimi k1.5的性能优于GPT-4o，并展现出最高的质量结果；2）CoT提示通常会降低LMM在感知密集型任务上的性能，这表明存在潜在的有害过度思考行为；3）尽管CoT质量很高，但具有反思机制的LMM在正常响应和自我校正阶段都表现出显著的低效率。我们希望MME-CoT能够成为推进LMM多模态推理的基础。项目页面：https://mmecot.github.io/||
|**2025-02-13**|[GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis](http://arxiv.org/abs/2502.09598)|**[link](https://github.com/Orion-AI-Lab/GAIA)**|地球轨道卫星的持续运行产生了庞大且不断增长的遥感 (RS) 图像档案。自然语言为访问、查询和解释此类档案中的数据提供了一个直观的界面。然而，现有的视觉语言模型 (VLM) 主要是在网络抓取的嘈杂图像文本数据上进行训练的，对遥感专业领域的接触有限。这种缺陷导致在特定遥感任务上的性能不佳，因为常用的数据集通常缺乏详细的、科学准确的文本描述，而只强调日期和位置等属性。为了弥合这一关键差距，我们引入了 GAIA，这是一个专为多尺度、多传感器和多模态遥感图像分析而设计的新颖数据集。GAIA 包含 205,150 个精心策划的遥感图像文本对，代表了与不同空间分辨率相关的各种遥感模式。与现有的遥感视觉语言数据集不同，GAIA 特别关注捕获各种遥感应用，提供有关环境变化、自然灾害和各种其他动态现象的独特信息。该数据集提供了空间和时间上的平衡分布，涵盖全球范围，涵盖过去 25 年，并具有均衡的观测时间分布。GAIA 的构建涉及一个两阶段过程：(1) 从信誉良好的遥感相关来源定向抓取图像和 accompanying 文本，以及 (2) 使用精心设计的提示，利用 GPT-4o 先进的视觉语言能力，为每个图像生成五个高质量、具有科学依据的合成描述。我们广泛的实验，包括对 CLIP 和 BLIP2 模型进行微调，表明 GAIA 显着提高了遥感图像分类、跨模态检索和图像描述任务的性能。||
|**2025-02-13**|[When and How Does CLIP Enable Domain and Compositional Generalization?](http://arxiv.org/abs/2502.09507)|null|对比视觉语言模型（如 CLIP）卓越的泛化性能通常归因于其训练数据分布的多样性。然而，一些关键问题仍然没有得到解答：当在不同领域混合的数据上训练时，CLIP 能否泛化到一个完全未见过的领域（领域泛化）？它能否泛化到部分见过领域中未见过的类别（组合泛化）？哪些因素会影响这种泛化？为了回答这些问题，我们在系统构建的训练数据分布上训练 CLIP 模型，并控制了领域多样性和对象类别曝光度。我们的实验表明，领域多样性对于领域泛化和组合泛化都至关重要，但是，当训练数据分布包含测试领域的一个次优子集时，组合泛化的能力可能比领域泛化弱得多。通过以数据为中心和机制分析，我们发现成功的泛化需要在中间层和共享电路中学习共享表示。||
|**2025-02-13**|[Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model](http://arxiv.org/abs/2502.09057)|**[link](https://github.com/ia-gu/vision-language-in-context-learning-driven-few-shot-visual-inspection-model)**|我们提出了一种通用的视觉检测模型，利用视觉-语言模型（VLM）结合少量无缺陷或缺陷产品的示例图像以及作为检测标准的解释性文本。尽管现有的VLM在各种任务中表现出很高的性能，但它们并没有针对视觉检测等特定任务进行训练。因此，我们构建了一个数据集，其中包含从网络收集的各种无缺陷和缺陷产品的图像，以及统一格式的输出文本，并对VLM进行了微调。对于新产品，我们的方法采用上下文学习（In-Context Learning），允许模型通过无缺陷或缺陷图像示例以及相应的带有视觉提示的解释性文本来执行检测。这种方法无需为每个产品收集大量训练样本并重新训练模型。实验结果表明，我们的方法在MVTec AD数据集上以单次学习（one-shot）的方式实现了高性能，MCC达到0.804，F1分数达到0.950。我们的代码可在https://github.com/ia-gu/Vision-Language-In-Context-Learning-Driven-Few-Shot-Visual-Inspection-Model获取。||
|**2025-02-12**|[ClipRover: Zero-shot Vision-Language Exploration and Target Discovery by Mobile Robots](http://arxiv.org/abs/2502.08791)|null|视觉语言导航 (VLN) 已成为一种很有前景的范式，它使移动机器人能够执行零样本推理并在没有特定预编程的情况下执行任务。然而，当前的系统通常将地图探索和路径规划分开，由于环境信息有限（部分观察），探索依赖于低效的算法。在本文中，我们提出了一种名为“ClipRover”的新型导航流程，用于在未知环境中同时进行探索和目标发现，它利用了名为 CLIP 的视觉语言模型的功能。我们的方法只需要单目视觉，并且无需任何先验地图或目标知识即可运行。为了进行全面评估，我们设计了一个名为“Rover Master”的无人驾驶地面车辆 (UGV) 系统的功能原型，这是一个用于通用 VLN 任务的定制平台。我们将 ClipRover 流程集成并部署到 Rover Master 上，以评估其在各种实际场景中的吞吐量、避障能力和轨迹性能。实验结果表明，ClipRover 的性能始终优于传统的遍历地图算法，并且达到了与依赖先验地图和目标知识的路径规划方法相当的性能。值得注意的是，ClipRover 提供了实时的主动导航，无需预先捕获的候选图像或预先构建的节点图，解决了现有 VLN 流程的关键局限性。||
|**2025-02-13**|[PulseCheck457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models](http://arxiv.org/abs/2502.08636)|null|尽管大型多模态模型（LMM）在视觉场景解释和推理方面展现出显著的能力，但其在复杂和精确的3维空间推理方面的能力仍不确定。现有的基准主要集中在2维空间理解，缺乏一个框架来全面评估不同复杂程度的6维空间推理能力。为了解决这一局限性，我们提出了PulseCheck457，这是一个可扩展且无偏差的合成数据集，设计用于评估空间推理的4个关键能力：多目标识别、2D定位、3D定位和3D方向。我们开发了一个级联评估结构，构建了5个难度级别的7种问题类型，从基本的单目标识别到我们新提出的复杂6维空间推理任务。我们在PulseCheck457上评估了各种大型多模态模型（LMM），观察到随着任务复杂性的增加，性能普遍下降，尤其是在3D推理和6维空间任务中。为了量化这些挑战，我们引入了相对性能下降率（RPDR），突出了3D推理能力的关键弱点。利用我们数据集的无偏差属性设计，我们还发现了不同属性的预测偏差，并在真实图像设置中观察到了类似的模式。||
|**2025-02-12**|[ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification](http://arxiv.org/abs/2502.08391)|**[link](https://github.com/jiangbo-shi/vila-mil)**|基于多示例学习（MIL）的框架已成为处理数字病理学中千兆像素大小和分层图像上下文的全幻灯片图像（WSI）的主流方法。然而，这些方法严重依赖大量的包级别标签，并且仅从原始幻灯片中学习，容易受到数据分布变化的影响。最近，基于视觉语言模型（VLM）的方法通过对大规模病理图像-文本对进行预训练引入了语言先验。然而，以前的文本提示缺乏对病理先验知识的考虑，因此没有实质性地提高模型的性能。此外，收集此类图像-文本对和预训练过程非常耗时且资源密集。为了解决上述问题，我们提出了一种用于全幻灯片图像分类的双尺度视觉语言多示例学习（ViLa-MIL）框架。具体来说，我们提出了一种基于冻结大型语言模型（LLM）的双尺度视觉描述性文本提示，以有效提升VLM的性能。为了有效地将VLM迁移到WSI处理中，对于图像分支，我们提出了一个原型引导的补丁解码器，通过将相似的补丁分组到相同的原型中来逐步聚合补丁特征；对于文本分支，我们引入了一个上下文引导的文本解码器，通过结合多粒度图像上下文来增强文本特征。在三个多癌种和多中心的亚型数据集上的大量研究证明了ViLa-MIL的优越性。||
|**2025-02-12**|[Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting](http://arxiv.org/abs/2502.08317)|null|空间关系幻觉是大规模视觉语言模型（LVLMs）中持续存在的挑战，导致模型对图像中物体位置和空间结构的预测错误。为了解决这个问题，我们提出了一个约束感知提示框架，旨在减少空间关系幻觉。具体来说，我们引入了两种类型的约束：（1）双向约束，确保成对对象关系的一致性；（2）传递性约束，强制多个对象之间的关系依赖性。通过结合这些约束，LVLMs可以生成更具空间连贯性和一致性的输出。我们在三个广泛使用的空间关系数据集上评估了我们的方法，证明其性能优于现有方法。此外，对各种双向关系分析选择和传递性参考选择的系统分析突出了我们的方法在结合约束以减轻空间关系幻觉方面的更大可能性。||
|**2025-02-12**|[What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations](http://arxiv.org/abs/2502.08279)|**[link](https://github.com/dongqi-me/vista)**|将录制的视频转换为简洁准确的文本摘要是多模态学习中一个日益增长的挑战。本文介绍了VISTA，这是一个专门为科学领域视频到文本摘要而设计的数据集。VISTA包含18,599个录制的AI会议演示文稿及其相应的论文摘要。我们对最先进的大型模型的性能进行了基准测试，并应用了一个基于计划的框架来更好地捕捉摘要的结构化特性。人工和自动评估都证实，显式规划可以提高摘要质量和事实一致性。然而，模型和人类的表现之间仍然存在相当大的差距，这凸显了科学视频摘要的挑战性。||
|**2025-02-12**|[UniCoRN: Unified Commented Retrieval Network with LMMs](http://arxiv.org/abs/2502.08254)|null|多模态检索方法在处理需要推理查询和检索实体的视觉内容的复杂组合查询方面存在局限性。另一方面，大型多模态模型 (LMM) 可以用语言回答更复杂的视觉问题，但缺乏检索相关实体以支持其答案的内在能力。我们的目标是通过UniCoRN（一个统一注释检索网络）来解决这些限制，它结合了组合多模态检索方法和生成式语言方法的优势，超越了检索增强生成 (RAG)。我们引入了一个实体适配器模块，将检索到的多模态实体重新注入LMM，使其在生成答案和注释时能够关注这些实体。通过保持基础 LMM 冻结，UniCoRN 保留了其原始功能，同时能够在单个集成框架下执行检索和文本生成任务。为了评估这些新功能，我们引入了注释检索任务 (CoR) 和相应的数据库，目标是检索能够准确回答给定问题的图像，并生成额外的文本响应，以提供关于视觉信息的进一步说明和详细信息。我们在几个数据集上展示了 UniCoRN 的有效性，表明在组合多模态检索方面，召回率比现有技术提高了 +4.5%，在 CoR 注释方面，METEOR 提高了 +14.9%，BEM 提高了 +18.4%。||
|**2025-02-11**|[Scaling Pre-training to One Hundred Billion Data for Vision Language Models](http://arxiv.org/abs/2502.07617)|null|我们对前所未有的规模（1000亿个样本）的视觉语言模型预训练的潜力进行了实证研究。我们发现，在许多以西方为中心的常见分类和检索基准测试（例如COCO Captions）中，模型性能在这种规模下趋于饱和。然而，由于涵盖了长尾概念，文化多样性任务从1000亿规模的网络数据中获得了更大的收益。此外，我们分析了模型的多语言性，并展示了低资源语言的改进。另外，我们观察到，通过使用CLIP等质量过滤器来减少预训练数据集的大小（通常用于增强性能）可能会无意中减少即使在大规模数据集中所代表的文化多样性。我们的结果表明，虽然传统的基准测试可能无法从将嘈杂的原始网络数据扩展到1000亿个样本中获得显著收益，但这种数据规模对于构建真正具有包容性的多模态系统至关重要。||
|**2025-02-11**|[MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification](http://arxiv.org/abs/2502.07409)|**[link](https://github.com/HauschildLab/MGPATH)**|全切片病理图像分类由于图像尺寸巨大且标注标签有限，存在模型泛化困难等挑战。本文介绍了一种提示学习方法，以使大型视觉语言模型适应少样本病理分类。我们首先扩展了在13亿个病理图像图块上进行预训练的Prov-GigaPath视觉基础模型，通过添加适配器并通过923K图像-文本对上的对比学习将其与医学文本编码器对齐，从而将其扩展为视觉语言模型。然后，该模型用于从少量标注中提取视觉特征和文本嵌入，并使用可学习的提示嵌入进行微调。与先前将提示与冻结特征结合使用前缀嵌入或自注意力的方法不同，我们提出了多粒度注意力，它比较可学习提示与单个图像块及其组之间的交互。这种方法提高了模型捕获细粒度细节和更广泛上下文的能力，增强了其对跨子区域复杂模式的识别能力。为了进一步提高准确性，我们利用基于（非平衡）最优传输的视觉文本距离来确保模型的鲁棒性，通过减轻数据增强过程中可能发生的扰动。在肺、肾和乳腺病理模式上的实证实验验证了我们方法的有效性；因此，我们超越了几个最新的竞争对手，并在包括CLIP、PLIP和Prov-GigaPath集成PLIP在内的各种架构中持续提高了性能。我们在MGPATH发布了我们的实现和预训练模型。||
|**2025-02-11**|[TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation](http://arxiv.org/abs/2502.07306)|null|在这项工作中，我们提出了一种用于视觉语言导航（VLN）任务的模块化方法，将问题分解为四个子模块，这些子模块在零样本设置中使用最先进的大语言模型（LLM）和视觉语言模型（VLM）。给定自然语言的导航指令，我们首先提示LLM提取地标及其访问顺序。假设已知环境模型，我们检索最后一个地标的前k个位置，并使用环境拓扑地图上的最短路径算法，从起始位置到最后一个地标生成k个路径假设。每个路径假设由一系列全景图表示。然后，我们使用动态规划计算全景图序列和地标名称序列之间的对齐分数，该分数与从VLM获得的分数匹配。最后，我们计算产生最高对齐分数的假设与nDTW指标之间的距离，以评估路径保真度。与在复杂的R2R-Habitat指令数据集上使用联合语义地图（如VLMaps）的其他方法相比，我们展示了更优越的性能，并详细量化了视觉基础对导航性能的影响。||
|**2025-02-10**|[EVEv2: Improved Baselines for Encoder-Free Vision-Language Models](http://arxiv.org/abs/2502.06788)|**[link](https://github.com/baaivision/eve)**|现有的无编码器视觉语言模型 (VLM) 正在迅速缩小与其基于编码器的对应模型之间的性能差距，凸显了结构简洁且部署高效的统一多模态系统的巨大潜力。我们系统地阐明了使用预训练视觉编码器、离散分词器和从零开始训练的极简视觉层的 VLM 之间的性能差距，深入挖掘了无编码器 VLM 尚未得到充分研究的特性。我们为无编码器 VLM 开发了可与主流基于编码器的 VLM 相媲美的有效策略。经过深入研究，我们推出了 EVEv2.0，这是一个新的改进版无编码器 VLM 系列。我们的研究表明：(i) 在统一模型中适当地分解视觉和语言并进行分层关联可以减少模态之间的干扰。(ii) 精心设计的训练策略可以有效地优化无编码器 VLM。通过广泛的评估，我们的 EVEv2.0 代表了对跨模态仅解码器架构的深入研究，展示了卓越的数据效率和强大的视觉推理能力。代码已公开发布：https://github.com/baaivision/EVE。||
|**2025-02-10**|[Learning Musical Representations for Music Performance Question Answering](http://arxiv.org/abs/2502.06710)|null|音乐表演是音频-视觉建模的典型场景。不同于通常场景中稀疏的音频，音乐表演自始至终都包含密集的音频信号。虽然现有的多模态学习方法在音频-视频问答方面展现出令人印象深刻的能力，但它们无法处理音乐表演中的基本问题：它们对表演中多模态信号之间的交互探索不足，并且未能考虑乐器和音乐的独特特征。因此，现有方法往往无法准确回答有关音乐表演的问题。为了弥合上述研究差距，(i) 鉴于音乐数据固有的复杂多模态互连性，我们的主要骨干网络旨在结合音乐环境下的多模态交互；(ii) 为了使模型能够学习音乐特征，我们在当前的音乐数据集中标注并发布了节奏和音乐来源；(iii) 为了实现时间感知的音频-视觉建模，我们将模型的音乐预测与时间维度对齐。我们的实验在 Music AVQA 数据集上展示了最先进的效果。我们的代码可在 https://github.com/xid32/Amuse 获取。||
|**2025-02-10**|[AppVLM: A Lightweight Vision Language Model for Online App Control](http://arxiv.org/abs/2502.06395)|null|将基础模型用作智能手机助手（称为应用程序代理）是一项关键的研究挑战。这些代理旨在通过解释文本指令并通过设备界面执行操作来在智能手机上执行人类指令。虽然前景广阔，但目前的方法面临着很大的局限性。使用大型专有模型（例如 GPT-4o）的方法计算成本高昂，而使用较小的微调模型的方法通常缺乏对分布外任务的适应性。在这项工作中，我们介绍了 AppVLM，一种轻量级的视觉语言模型 (VLM)。首先，我们使用 AndroidControl 数据集对其进行离线微调。然后，我们通过从 AndroidWorld 环境收集数据并执行进一步的训练迭代来改进其策略。我们的结果表明，与所有评估的基线相比，AppVLM 在 AndroidControl 数据集的离线评估中实现了最高的动作预测准确率，并且在 AndroidWorld 环境中的在线任务完成成功率与 GPT-4o 相当，同时速度最高可达十倍。这使得 AppVLM 成为现实世界部署的实用且高效的解决方案。||
|**2025-02-11**|[When Data Manipulation Meets Attack Goals: An In-depth Survey of Attacks for VLMs](http://arxiv.org/abs/2502.06390)|**[link](https://github.com/aobtdai/vlm_attack_paper_list)**|近年来，视觉语言模型（VLM）因其有效整合和处理文本与视觉信息的能力而备受关注。这种整合显著提升了各种应用的性能，例如场景感知和机器人技术。然而，VLM 的部署也引发了对安全性和可靠性的关键担忧，需要进行大量研究来评估这些 VLM 系统可能存在的潜在漏洞。在这项工作中，我们深入调研了针对 VLM 的攻击策略。我们根据其根本目标（即越狱、伪装和利用）对这些攻击进行了分类，同时详细介绍了用于操纵 VLM 数据的各种方法。同时，我们概述了为缓解这些漏洞而提出的相应防御机制。通过识别不同攻击类型之间的关键联系和区别，我们提出了一个引人注目的 VLM 攻击分类法。此外，我们总结了全面描述不同攻击对 VLM 的特征和影响的评估指标。最后，我们讨论了未来有前景的研究方向，这些方向可以进一步增强 VLM 的鲁棒性和安全性，强调了在这一关键研究领域持续探索的重要性。为了促进社区参与，我们维护了一个最新的项目页面，网址为：https://github.com/AobtDai/VLM_Attack_Paper_List。||
|**2025-02-10**|[Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models](http://arxiv.org/abs/2502.06130)|**[link](https://github.com/zhangce01/degf)**|虽然最近的大型视觉语言模型（LVLMs）在多模态任务中表现出了显著的性能，但它们容易生成与给定视觉输入不一致的幻觉文本响应，这限制了它们在现实场景中的实际应用。在这项工作中，受到文本到图像生成过程是LVLMs中图像条件响应生成的逆过程这一观察的启发，我们探索了利用文本到图像生成模型来辅助减少LVLMs中幻觉的潜力。我们发现，生成模型可以在响应和标记层面提供有价值的自我反馈，以减少幻觉。基于这一发现，我们引入了具有生成反馈的自校正解码（DeGF），这是一种无需训练的新颖算法，它将文本到图像生成模型的反馈纳入解码过程，以有效减少LVLMs中的幻觉。具体来说，DeGF根据LVLMs产生的初始响应生成图像，该图像充当辅助视觉参考，并通过互补或对比解码提供自我反馈来验证和纠正初始响应。大量的实验结果验证了我们的方法在减少各种类型幻觉方面的有效性，在六个基准测试中始终优于最先进的方法。代码可在https://github.com/zhangce01/DeGF获取。||
|**2025-02-10**|[Fair-MoE: Fairness-Oriented Mixture of Experts in Vision-Language Models](http://arxiv.org/abs/2502.06094)|null|公平性是医学伦理的基本原则。视觉语言模型（VLM）由于能够利用视觉和语言上下文，减少对大型数据集的需求并支持执行复杂任务，在医学领域展现出巨大的潜力。然而，对VLM应用中公平性的探索仍然有限。在没有对公平性进行全面分析的情况下应用VLM可能会导致对平等治疗机会的担忧，并降低公众对医学深度学习模型的信任。为了建立对医学VLM的信任，我们提出了Fair-MoE，一个专门设计用于确保公平性和有效性的模型。Fair-MoE包含两个关键组件：公平导向专家混合模型（FO-MoE）和公平导向损失函数（FOL）。FO-MoE旨在利用各种专家的专业知识来过滤掉有偏差的图像块嵌入，并使用集成方法提取与特定任务相关的更公平的信息。FOL是一种新颖的公平导向损失函数，它不仅最小化不同属性之间的距离，还优化各种属性分布离散度的差异。扩展实验证明了Fair-MoE的有效性和公平性。在Harvard-FairVLMed数据集上进行的测试表明，Fair-MoE在所有四个属性的公平性和准确性方面均有所提高。代码将公开发布。||
|**2025-02-09**|[DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control](http://arxiv.org/abs/2502.05855)|null|使机器人能够在不同环境中执行各种任务是机器人学习的核心挑战。虽然视觉-语言-动作（VLA）模型在泛化机器人技能方面已展现出潜力，但要充分发挥其潜力，需要解决动作表示和高效训练方面的局限性。目前的VLA模型通常侧重于扩展视觉-语言模型（VLM）组件，而动作空间表示仍然是一个关键瓶颈。本文介绍了DexVLA，这是一个旨在提高VLA在不同机器人具身化中处理复杂、长程任务的效率和泛化能力的新颖框架。DexVLA的特点是一个新颖的基于扩散的动作专家模型，规模达到十亿参数，专为跨具身化学习而设计。一种新颖的具身化课程学习策略促进了高效训练：（1）在跨具身化数据上预训练可与VLA分离的扩散专家模型，（2）将VLA模型与特定具身化对齐，以及（3）针对新任务进行快速适应的后续训练。我们针对包括单臂、双臂和灵巧手在内的多种具身化进行了全面的实验，证明了DexVLA对挑战性任务的适应性，无需针对特定任务进行调整；它能够在新的具身化上利用有限的数据学习灵巧技能；以及它能够仅使用直接语言提示来完成复杂、长程任务，例如叠衣服。在所有场景下，我们的方法都展现出比Octo、OpenVLA和Diffusion Policy等现有最先进模型更优越的性能。||
|**2025-02-07**|[Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray](http://arxiv.org/abs/2502.05177)|**[link](https://github.com/vita-mllm/long-vita)**|为视频理解、高分辨率图像理解、多模态代理和推理建立大型视觉语言模型的长上下文能力至关重要。我们引入了Long-VITA，一个简单而有效的大型多模态模型，用于长上下文视觉语言理解任务。它擅长同时处理和分析超过4K帧或1M标记的图像、视频和文本模态，同时在短上下文多模态任务上提供先进的性能。我们提出了一个有效的多模态训练方案，从大型语言模型开始，然后进行视觉语言对齐、通用知识学习以及两个连续阶段的长序列微调。我们进一步实施了上下文并行分布式推理和 logits 遮蔽语言建模头，以在模型推理期间将 Long-VITA 扩展到无限长的图像和文本输入。关于训练数据，Long-VITA 仅基于来自公共数据集的 1700 万个样本的组合构建，并且与具有内部数据的最新前沿模型相比，在各种多模态基准测试中展示了最先进的性能。Long-VITA 是完全可复现的，并且支持 NPU 和 GPU 平台进行训练和测试。我们希望 Long-VITA 可以作为一个有竞争力的基线，并为开源社区在推进长上下文多模态理解方面提供有价值的见解。||
|**2025-02-07**|[DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions](http://arxiv.org/abs/2502.05091)|null|视觉语言模型 (VLM) 将视觉和文本表征对齐，从而在二维医学影像中实现高性能的零样本分类和图文检索。然而，将 VLM 扩展到三维医学影像仍然具有计算挑战性。现有的三维 VLM 依赖于视觉 Transformer (ViT)，由于自注意力机制的二次复杂度，其计算成本很高，或者依赖于三维卷积，随着卷积核大小的增加，其参数量和 FLOPs (浮点运算次数) 也会急剧增加。我们引入了 DCFormer，一种高效的三维医学图像编码器，它将三维卷积分解为沿深度、高度和宽度三个平行的方向上的一维卷积。这种设计在显著降低计算成本的同时保留了空间信息。DCFormer 集成到基于 CLIP 的视觉语言框架中，并在 CT-RATE 数据集（包含 50,188 对配对的三维胸部 CT 图像和放射学报告）上进行评估，用于 18 种病症的零样本多异常检测。与 ViT、ConvNeXt、PoolFormer 和 TransUNet 相比，DCFormer 实现了更高的效率和准确性，其中 DCFormer-Tiny 仅使用了少量参数即达到了 62.0% 的准确率和 46.3% 的 F1 值。这些结果突出了 DCFormer 在可扩展的、临床上可部署的三维医学 VLM 中的潜力。我们的代码将公开发布。||
|**2025-02-07**|[OccGS: Zero-shot 3D Occupancy Reconstruction with Semantic and Geometric-Aware Gaussian Splatting](http://arxiv.org/abs/2502.04981)|null|从原始传感器数据中获取语义三维占用信息而无需手动标注仍然是一项重要且具有挑战性的任务。以往的工作将此视为感知预测问题，而我们将其表述为具有几何和语义信息的场景感知三维占用重建。在本工作中，我们提出了 OccGS，一个利用语义和几何感知高斯 splatting 以零样本方式进行三维占用重建的新颖框架。利用从视觉语言模型中提取的语义信息和激光雷达点引导的几何信息，OccGS 从原始多传感器数据构建语义和几何感知高斯模型。我们还开发了一种累积高斯到三维体素的 splatting 方法，用于从高斯模型重建占用信息。OccGS 在占用预测方面优于自监督方法，达到了与全监督方法相当的性能，并在零样本语义三维占用估计方面实现了最先进的性能。||
|**2025-02-06**|[AnyPlace: Learning Generalized Object Placement for Robot Manipulation](http://arxiv.org/abs/2502.04531)|null|由于物体几何形状和放置配置的多样性，在机器人任务中放置物体本身就具有挑战性。为了解决这个问题，我们提出了 AnyPlace，这是一种完全基于合成数据训练的两阶段方法，能够预测真实世界任务中各种可行的放置姿态。我们的主要见解是，通过利用视觉语言模型 (VLM) 来识别粗略的放置位置，我们只关注局部放置的相关区域，这使我们能够训练低级别的放置姿态预测模型来有效地捕获各种放置。为了训练，我们生成了一个完全合成的不同放置配置（插入、堆叠、悬挂）的随机生成物体的数据集，并训练了局部放置预测模型。我们在模拟中进行了广泛的评估，证明我们的方法在成功率、可能的放置模式的覆盖范围和精度方面优于基线。在真实世界的实验中，我们展示了我们的方法如何将纯粹基于合成数据训练的模型直接迁移到真实世界，并在其他模型难以处理的场景中成功执行放置——例如具有不同物体几何形状、不同放置模式以及实现高精度精细放置。更多信息请访问：https://any-place.github.io。||
|**2025-02-06**|[Color in Visual-Language Models: CLIP deficiencies](http://arxiv.org/abs/2502.04470)|null|这项工作探索了颜色如何在CLIP（对比语言-图像预训练）中编码，CLIP是目前人工智能领域中最具影响力的VML（视觉语言模型）。在对为此任务创建的合成数据集进行不同的实验后，我们得出结论，CLIP能够将正确的颜色标签归因于彩色的视觉刺激，但是，我们发现了两个主要缺陷：（a）对与颜色概念关联较差的消色差刺激存在明显的偏差，因此白色、灰色和黑色很少被指定为颜色标签；以及（b）倾向于优先考虑文本而不是其他视觉信息。在这里，我们通过详尽的斯特鲁普效应测试证明了它在颜色标记中非常重要。为了找到这些颜色缺陷的原因，我们分析了神经元层面的内部表征。我们得出结论，CLIP呈现了大量的文本选择性神经元，尤其是在网络的更深层，以及少量的多模态颜色神经元，这可能是正确理解颜色概念的关键。我们的研究强调了改进神经网络中颜色表征机制的必要性，以促进对人类理解颜色的更全面理解，从而提高CLIP等多模态模型在现实场景中的效率和通用性。||
|**2025-02-06**|[Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting](http://arxiv.org/abs/2502.04395)|null|时间序列预测的最新进展探索了使用文本或视觉模态增强模型以提高准确性。虽然文本提供了上下文理解，但它通常缺乏细粒度的时间细节。相反，视觉捕捉复杂的时间模式，但缺乏语义上下文，限制了这些模态的互补潜力。为了解决这个问题，我们提出了 Time-VLM，一个利用预训练视觉语言模型 (VLM) 来桥接时间、视觉和文本模态以增强预测的新型多模态框架。我们的框架包含三个关键组件：（1）检索增强学习器，它通过记忆库交互提取丰富的时序特征；（2）视觉增强学习器，它将时间序列编码为信息图像；（3）文本增强学习器，它生成上下文文本描述。这些组件与冻结的预训练 VLM 协作以生成多模态嵌入，然后将其与时间特征融合以进行最终预测。跨不同数据集的大量实验表明，Time-VLM 实现了卓越的性能，尤其是在少样本和零样本场景中，从而为多模态时间序列预测建立了新的方向。||
|**2025-02-06**|[Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment](http://arxiv.org/abs/2502.04328)|**[link](https://github.com/ola-omni/ola)**|近年来，大型语言模型的进步，尤其是GPT-4o之后，激发了人们对开发能够理解更多模态的全模态模型的兴趣。尽管出现了一些开源替代方案，但它们的性能与专门的单模态模型相比仍存在显著差距。在本文中，我们提出了Ola，一个全模态语言模型，在图像、视频和音频理解方面，与专门的对应模型相比实现了具有竞争力的性能。Ola的核心设计在于其渐进式模态对齐策略，该策略逐步扩展了语言模型支持的模态。我们的训练流程从最独特的模态开始：图像和文本，然后使用连接语言和音频知识的语音数据，以及连接所有模态的视频数据，逐步扩展模型的技能。这种渐进式学习流程也使我们能够保持相对较小的跨模态对齐数据规模，从而使从现有视觉语言模型开发全模态模型变得更容易且成本更低。此外，为了解锁像GPT-4o一样的高级交互体验，我们进一步设计了一种用于流式语音生成的逐句解码方案。大量实验表明，Ola在所有模态上都超越了现有的开放全模态大型语言模型，同时与类似规模的最先进的专门模型相比，也取得了极具竞争力的性能。我们的目标是使Ola成为一个完全开放的全模态理解解决方案，以推进这一新兴领域的未来研究。模型权重、代码和数据已在https://github.com/Ola-Omni/Ola开源。||
|**2025-02-06**|[Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion](http://arxiv.org/abs/2502.04263)|**[link](https://github.com/miccunifi/cross-the-gap)**|像CLIP这样的预训练多模态视觉语言模型已被广泛用于各种应用。本文指出，针对像图像到图像检索这样的模态内任务，单独利用这些强大的多模态模型的文本或图像编码器的常见做法并非最佳。我们认为，这本质上是由于CLIP式的跨模态对比损失没有施加任何模态内约束，导致了我们所说的模态内错位。为了证明这一点，我们利用两种基于优化的模态反演技术，将表示从其输入模态映射到互补模态，而无需任何辅助数据或额外训练的适配器。我们通过实验证明，在图像到图像和文本到文本检索的模态内任务中，采用跨模态方法在超过15个数据集上相对于模态内基线显著提高了性能。此外，我们还证明，以模态内方式处理原生跨模态任务（例如零样本图像分类）会降低性能，这进一步验证了我们的发现。最后，我们展示了在预训练目标中加入模态内项或缩小文本和图像特征嵌入空间之间的模态差距有助于减少模态内错位。代码公开地址：https://github.com/miccunifi/Cross-the-Gap。||
|**2025-02-07**|[Efficient Few-Shot Continual Learning in Vision-Language Models](http://arxiv.org/abs/2502.04098)|null|视觉语言模型 (VLM) 在视觉问答和图像描述等任务中表现出色。然而，VLM 通常受限于其使用的预训练图像编码器（如 CLIP），导致图像理解错误，从而影响整体性能。此外，实际应用通常要求模型随着新的且通常有限的数据不断到来而持续适应。为了解决这个问题，我们提出了 LoRSU（带结构化更新的低秩适应），这是一种用于选择性更新 VLM 中图像编码器的鲁棒且计算高效的方法。LoRSU 引入了结构化和局部化的参数更新，有效地纠正了先前容易出错的数据上的性能，同时保留了模型的整体鲁棒性。我们的方法利用理论见解来识别和更新最重要的参数，从而实现显著的资源效率。具体来说，我们证明了与完整的 VLM 更新相比，LoRSU 将计算开销减少了 25 倍以上，而不会牺牲性能。在少样本持续学习环境下的 VQA 任务上的实验结果验证了 LoRSU 的可扩展性、效率和有效性，使其成为资源受限环境中图像编码器适应的引人注目的解决方案。||
|**2025-02-06**|[CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing](http://arxiv.org/abs/2502.03997)|null|计算机辅助设计 (CAD) 在各个行业中不可或缺。基于文本的 CAD 编辑，即根据文本指令自动修改 CAD 模型，具有巨大的潜力，但仍未得到充分探索。现有方法主要侧重于设计变体生成或基于文本的 CAD 生成，要么缺乏对基于文本控制的支持，要么忽略了将现有 CAD 模型作为约束条件。我们引入了 CAD-Editor，这是第一个用于基于文本的 CAD 编辑的框架。为了应对训练所需的精确对应关系的三元组数据带来的挑战，我们提出了一个自动数据合成流程。该流程利用设计变体模型生成原始 CAD 模型和编辑后 CAD 模型的配对数据，并使用大型视觉语言模型 (LVLMs) 将它们的差异概括为编辑指令。为了解决基于文本的 CAD 编辑的复合特性，我们提出了一个“定位-然后-填充”框架，将任务分解为两个重点子任务：定位需要修改的区域以及用适当的编辑填充这些区域。大型语言模型 (LLMs) 作为这两个子任务的支柱，利用其在自然语言理解和 CAD 知识方面的能力。实验表明，CAD-Editor 在定量和定性方面都取得了优异的性能。||
|**2025-02-05**|[RadVLM: A Multitask Conversational Vision-Language Model for Radiology](http://arxiv.org/abs/2502.03333)|null|胸部X光片（CXRs）的广泛使用，加上放射科医生的短缺，推动了人们对自动化CXR分析和AI辅助报告的兴趣日益增长。虽然现有的视觉语言模型（VLMs）在报告生成或异常检测等特定任务中显示出前景，但它们通常缺乏对交互式诊断功能的支持。在这项工作中，我们提出了RadVLM，一个紧凑的多任务对话基础模型，专为CXR解读而设计。为此，我们整理了一个大型指令数据集，包含超过100万个图像-指令对，其中包含单轮任务（例如报告生成、异常分类和视觉定位）以及多轮、多任务对话交互。在对RadVLM进行微调后，我们评估了它在不同任务中的表现，并与重新实现的基线VLMs进行了比较。我们的结果表明，RadVLM在对话能力和视觉定位方面达到了最先进的性能，同时在其他放射学任务中保持竞争力。消融研究进一步强调了跨多个任务联合训练的好处，尤其是在标注数据有限的情况下。总之，这些发现突出了RadVLM作为临床相关AI助手的潜力，它提供结构化的CXR解读和对话功能，以支持更有效和更易于访问的诊断工作流程。||
|**2025-02-05**|[iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs](http://arxiv.org/abs/2502.03214)|**[link](https://github.com/SharkyBamboozle/iVISPAR)**|视觉语言模型 (VLM) 在空间推理和视觉对齐方面存在不足。为了克服这些限制，我们引入了 iVISPAR，这是一个交互式多模态基准测试，旨在评估作为代理的 VLM 的空间推理能力。iVISPAR 基于滑动拼图的一种变体——这是一个需要逻辑规划、空间意识和多步骤推理的经典问题。该基准测试支持视觉 2D、3D 和基于文本的输入模态，可以全面评估 VLM 的规划和推理技能。我们评估了一系列最先进的开源和闭源 VLM，比较了它们的性能，同时还提供了最佳路径解决方案和人类基线，以评估任务的复杂性和对人类的可行性。结果表明，虽然某些 VLM 在简单的空间任务上表现良好，但它们在更复杂的配置和问题属性上遇到困难。值得注意的是，虽然 VLM 在 2D 视觉上的表现通常优于 3D 或基于文本的表示，但它们始终达不到人类的表现，这说明视觉对齐方面仍然存在挑战。这突出了当前 VLM 能力中的关键差距，并强调了它们在达到人类认知水平方面的局限性。||
|**2025-02-05**|[Tell2Reg: Establishing spatial correspondence between images by the same language prompts](http://arxiv.org/abs/2502.03118)|**[link](https://github.com/yanwenci/tell2reg)**|空间对应关系可以通过分割区域对来表示，从而使图像配准网络的目标是对相应的区域进行分割，而不是预测位移场或变换参数。在这项工作中，我们展示了可以使用基于 GroundingDINO 和 SAM 的预训练大型多模态模型，通过在两张不同图像上使用相同的语言提示来预测这样的对应区域对。这使得完全自动化且无需训练的配准算法成为可能，并可能推广到各种图像配准任务。在本文中，我们使用一项具有挑战性的任务（配准跨受试者前列腺 MR 图像）展示了实验结果，该任务涉及患者之间高度可变的强度和形态。Tell2Reg 无需训练，无需为此配准任务先前所需的昂贵且耗时的数据管理和标记。这种方法优于测试的基于无监督学习的配准方法，并且其性能可与弱监督方法相媲美。还提供了额外的定性结果，表明语言语义和空间对应之间可能存在相关性，这尚属首次，包括语言提示区域的空间不变性以及获得的局部和全局对应之间的语言提示差异。代码可在 https://github.com/yanwenCi/Tell2Reg.git 获取。||
|**2025-02-05**|[Disentangling CLIP Features for Enhanced Localized Understanding](http://arxiv.org/abs/2502.02977)|null|视觉语言模型（VLM）在图像分类和检索等粗粒度任务中展现出令人印象深刻的能力。然而，它们在需要局部理解的细粒度任务中却表现不佳。为了探究这一弱点，我们对 CLIP 特征进行了全面分析，并发现了一个重要问题：语义特征高度相关。具体来说，一个类别的特征编码了其他类别信息，我们称之为互信息特征（MFI）。当我们查询一个特定类别时，这种互信息变得明显，目标类别以及一些不相关的对象同时被激活。为了解决这个问题，我们提出了 Unmix-CLIP，一个旨在减少 MFI 并改进特征解耦的新颖框架。我们引入了 MFI 损失，它通过将文本特征投影到一个类间相似性最小化的空间来显式分离文本特征。为了确保图像特征的相应分离，我们使用多标签识别（MLR）将图像特征与分离的文本特征对齐。这确保了图像和文本特征在不同模态之间解耦和对齐，从而改善了下游任务的特征分离。对于 COCO-14 数据集，Unmix-CLIP 将特征相似性降低了 24.9%。我们通过对 MLR 和零样本语义分割 (ZS3) 的广泛评估来证明其有效性。在 MLR 中，我们的方法在 VOC2007 数据集上表现出竞争力，并在 COCO-14 数据集上以更少的训练参数超越了现有最佳方法 (SOTA)。此外，Unmix-CLIP 在 COCO 和 VOC 数据集上的 ZS3 任务中始终优于现有方法。||
|**2025-02-04**|[Vision-Language Model Dialog Games for Self-Improvement](http://arxiv.org/abs/2502.02740)|null|对高质量、多样化训练数据的需求不断增长，这给视觉语言模型 (VLM) 的发展带来了显著瓶颈。本文提出了 VLM 对话游戏，这是一个新颖且可扩展的 VLM 自我改进框架。我们的方法利用两个代理之间的自我博弈，围绕图像识别进行目标导向的游戏。通过筛选成功的游戏交互，我们自动构建了一个高质量的图像和文本交错数据集。我们证明，基于这些合成数据进行微调可以提高下游任务的性能，并且可以泛化到不同的数据集。此外，由于模型的改进可以带来更好的游戏效果，因此可以迭代地应用此过程。这项工作为自我改进的 VLM 铺平了道路，在各种现实场景中具有潜在的应用价值，尤其是在高质量多模态数据稀缺的情况下。||
|**2025-02-04**|[COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation](http://arxiv.org/abs/2502.02589)|null|本文介绍了COCONut-PanCap数据集，旨在增强全景分割和基于 grounding 的图像描述。该数据集基于COCO数据集构建，并利用先进的COCONut全景掩码，旨在克服现有图文数据集中缺乏详细、场景全面描述的局限性。COCONut-PanCap数据集包含基于全景分割掩码的细粒度、区域级描述，确保了一致性并提高了生成描述的细节。通过人工编辑的密集注释描述，COCONut-PanCap支持改进用于图像理解的视觉语言模型（VLM）和用于文本到图像任务的生成模型的训练。实验结果表明，COCONut-PanCap显著提高了理解和生成任务的性能，为大规模数据集提供了补充优势。该数据集为评估联合全景分割和基于 grounding 的描述任务的模型设立了新的基准，解决了多模态学习中对高质量、详细图文注释的需求。||
|**2025-01-31**|[Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SCICAP Challenge 2023](http://arxiv.org/abs/2501.19353)|null|自2021年SCICAP数据集发布以来，研究界在为学术文章中的科学图表生成标题方面取得了显著进展。2023年，首届SCICAP挑战赛举行，邀请全球团队使用扩展的SCICAP数据集开发模型，为不同学术领域的各种图表类型生成标题。与此同时，文本生成模型快速发展，涌现出许多功能强大的预训练大型多模态模型（LMM），在各种视觉和语言任务中展现出令人印象深刻的能力。本文概述了首届SCICAP挑战赛，并详细介绍了各种模型在其数据上的性能，捕捉了该领域的现状。我们发现，专业编辑 overwhelmingly 更喜欢GPT-4V生成的图表标题，而不是所有其他模型甚至作者撰写的原始标题。基于这一重要发现，我们进行了详细的分析来回答这个问题：先进的LMM是否已经解决了为科学图表生成标题的任务？||
|**2025-01-31**|[A Survey on Class-Agnostic Counting: Advancements from Reference-Based to Open-World Text-Guided Approaches](http://arxiv.org/abs/2501.19184)|null|物体计数领域近期转向了类别无关计数（CAC），它解决了跨任意类别计数物体的挑战，满足了多功能计数系统的关键需求。虽然人类可以毫不费力地识别和计数来自不同类别的物体而无需先验知识，但大多数计数方法仍然局限于枚举已知类别的实例，需要大量的标记数据集进行训练，并且在开放词汇设置下表现不佳。相反，CAC旨在对训练期间从未见过的类别的物体进行计数，通常在小样本设置下运行。在本文中，我们首次回顾了CAC方法的进展，根据目标物体类别的指定方式将其分为三种范式：基于参考的、无参考的和开放世界文本引导的。基于参考的方法使用样本引导机制设定了性能基准。无参考方法通过利用固有的图像模式消除了对样本的依赖。最后，开放世界文本引导方法利用视觉语言模型，通过文本提示实现物体类别描述，代表了一种灵活且有吸引力的解决方案。我们分析了最先进的技术，并报告了它们在现有黄金标准基准测试集上的结果，比较了它们的性能，并识别和讨论了它们的优势和局限性。还讨论了持续存在的挑战，例如标注依赖性、可扩展性和泛化性，以及未来的发展方向。我们相信这项综述将成为研究人员了解CAC随时间推移的进步发展和贡献以及当前最新技术的宝贵资源，为未来的方向和有待解决的挑战提供见解。||
|**2025-01-31**|[Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification](http://arxiv.org/abs/2501.19086)|null|X射线成像在医学诊断中至关重要，它以非侵入性的方式提供对各种健康状况的洞察。近年来，诸如对比语言-图像预训练 (CLIP) 模型之类的视觉语言模型，通过利用大规模图文数据集，展现了提高诊断准确性的潜力。然而，由于 CLIP 最初并非为医学图像设计，因此已经开发了几种专门针对医学图像训练的类 CLIP 模型。尽管它们的性能有所增强，但公平性问题——尤其是在人口统计学属性方面——在很大程度上仍未得到解决。在本研究中，我们对应用于 X 射线图像分类的类 CLIP 模型进行了全面的公平性分析。我们使用零样本推理和各种微调技术（包括线性探测、多层感知器 (MLP)、低秩自适应 (LoRA) 和全微调）评估了它们在不同患者人口统计学和疾病类别中的性能和公平性。我们的结果表明，虽然微调提高了模型的准确性，但公平性问题仍然存在，这凸显了在这些基础模型中需要进一步的公平性干预措施。||
|**2025-01-31**|[RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception](http://arxiv.org/abs/2501.18880)|null|基于自然语言指令对视觉语言模型（VLM）进行微调以实现特定应用的视觉定位，已成为学习型自主系统中最流行的方法之一。然而，这种微调严重依赖于高质量的数据集，才能在各种下游任务中取得成功。此外，由于微调数据不足且不平衡，VLM经常会遇到局限性。为了解决这些问题，我们提出了一个新的通用框架，通过将VLM微调与强化学习（RL）代理集成来改进VLM微调。我们的方法利用RL代理在室内环境中操纵对象，以创建用于微调的合成数据，从而解决VLM的某些漏洞。具体来说，我们使用VLM的性能向RL代理提供反馈，以生成信息丰富的数据，从而有效地针对目标任务（例如空间推理）微调VLM。这项工作的关键贡献是开发了一个框架，其中RL代理充当信息性数据采样工具，并辅助VLM以提高性能并解决特定任务的漏洞。通过针对数据采样过程来解决VLM的弱点，我们可以有效地训练一个更具上下文感知能力的模型。此外，生成合成数据使我们能够精确控制每个场景并生成精细的Ground Truth字幕。我们的结果表明，所提出的数据生成方法提高了VLM的空间推理性能，这证明了在视觉语言任务中使用RL引导的数据生成的好处。||
|**2025-01-31**|[Test-time Loss Landscape Adaptation for Zero-Shot Generalization in Vision-Language Models](http://arxiv.org/abs/2501.18864)|null|预训练视觉语言模型的测试时自适应已成为解决测试时分布偏移的一种技术。尽管现有方法，尤其是基于测试时提示调优（TPT）的方法，已显示出良好的结果，但其与参数优化相关的高计算成本对可扩展性和实际应用提出了挑战。本文从损失景观的角度揭示了现有方法中反向传播的不必要性。基于这一见解，本文提出了一个简单而有效的框架，称为测试时损失景观自适应（TLLA）。TLLA利用训练最小值和测试损失景观之间的相对位置来指导自适应过程，避免在测试时更新模型参数。具体来说，它主要包括两个主要阶段：在提示调优阶段，引入了一种锐度感知提示调优（SAPT）方法来识别训练平坦最小值，为后续的测试时自适应奠定基础；在测试阶段，利用基于锐度的测试样本选择（STSS）方法来确保训练损失景观内的平坦最小值与每个增强测试样本的损失景观对齐。在域泛化和跨数据集基准上的大量实验表明，TLLA实现了最先进的性能，同时显着降低了计算开销。值得注意的是，当使用ResNet50和ViT-B/16图像编码器时，TLLA在四个ImageNet变体数据集上的平均性能分别超过TPT 5.32%和6.98%。代码即将发布。||
|**2025-01-30**|[Human Re-ID Meets LVLMs: What can we expect?](http://arxiv.org/abs/2501.18698)|null|大型视觉语言模型 (LVLMs) 在诸多任务中取得了突破性进展，涵盖内容生成、虚拟助手、多模态搜索和检索等领域。然而，许多应用场景下，这些方法的性能受到了广泛批评，尤其与各特定领域最先进的方法和技术相比。本文针对行人重识别任务，比较了领先的大型视觉语言模型与专为此问题设计的现有最先进AI模型的性能。我们使用著名的 Market1501 数据集，比较了 ChatGPT-4o、Gemini-2.0-Flash、Claude 3.5 Sonnet 和 Qwen-VL-Max 与基线 ReID PersonViT 模型的结果。我们的评估流程包括数据集整理、提示工程和指标选择，以评估模型的性能。结果从多个角度进行了分析：相似度得分、分类准确率和分类指标（包括精确率、召回率、F1 值和曲线下面积 (AUC)）。我们的结果证实了 LVLMs 的优势，但也揭示了其严重的局限性，这些局限性常常导致灾难性的结果，应成为进一步研究的重点。最后，我们展望了一些未来的研究方向，这些研究应融合传统方法和 LVLMs，结合两类技术的优势，实现性能的实质性提升。||
|**2025-01-30**|[Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models](http://arxiv.org/abs/2501.18533)|null|大型视觉语言模型 (VLM) 在各种任务中都取得了显著的性能。然而，在安全关键领域部署这些模型面临着重大挑战。现有的安全微调方法，专注于文本或多模态内容，无法充分解决具有挑战性的案例，或者会破坏有用性和无害性之间的平衡。我们的评估突出了安全推理的差距：这些方法缺乏安全视觉推理能力，导致了这样的瓶颈。为了解决这一限制并在安全关键环境中增强视觉感知和推理能力，我们提出了一个新的数据集，该数据集将多图像输入与安全思维链 (CoT) 标签作为细粒度推理逻辑相结合，以提高模型性能。具体来说，我们引入了多图像安全 (MIS) 数据集，这是一个为多图像安全场景定制的指令遵循数据集，包含训练集和测试集。我们的实验表明，使用 MIS 微调 InternVL2.5-8B 在需要安全相关视觉推理的具有挑战性的多图像任务中，其性能显著优于强大的开源模型和基于 API 的模型。这种方法不仅提供了卓越的安全性能，还保留了通用能力，而没有任何权衡。具体而言，使用 MIS 进行微调将五个通用基准测试的平均准确率提高了 0.83%，并在多个安全基准测试中大幅降低了攻击成功率 (ASR)。数据和模型发布在：\href{https://dripnowhy.github.io/MIS/}{\texttt{https://dripnowhy.github.io/MIS/}}||
|**2025-01-30**|[Pre-Trained Vision-Language Model Selection and Reuse for Downstream Tasks](http://arxiv.org/abs/2501.18271)|null|预训练视觉语言模型（VLM）在各种视觉任务中越来越受欢迎，并且已经发布了几个开源的VLM变体。然而，为特定的下游任务选择性能最佳的预训练VLM具有挑战性，因为没有哪个VLM能够在所有下游任务上都取得令人满意的性能，并且由于时间和数据的限制，评估所有可用的VLM是不可能的。为了解决这个问题，本文提出了一种新的范式来选择和重用VLM以用于下游任务，称为模型标签学习（MLL）。该方案包含三个关键模块：\emph{模型标记}，为每个VLM分配标签以描述其专业性和实用性；\emph{模型选择}，将目标任务的要求与模型标签匹配；以及\emph{模型重用}，以集成的方式将选定的VLM应用于目标任务。该方案具有很高的计算效率并且可扩展，因为模型标记过程独立于目标任务完成，并且其能力可以随着候选VLM数量的增加而增长。我们还引入了一个用于评估VLM选择方法的新基准，包括49个VLM和17个目标任务数据集。实验结果清楚地证明了所提出的VLM选择和重用方法的有效性。||
|**2025-01-30**|[Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment](http://arxiv.org/abs/2501.18157)|null|构建可靠的语音系统通常需要结合多种模态，例如音频和视觉线索。虽然这种多模态解决方案经常可以提升性能，甚至在某些情况下至关重要，但它们也伴随着一些限制，例如增加的传感器需求、计算成本和模态同步等等。这些挑战限制了这些多模态解决方案在实际应用中的直接使用。在这项工作中，我们开发了一种学习过程使用所有可用模态，但部署或推理仅使用一个或减少后的模态的方法。为此，我们提出了一个多模态训练和单模态部署（MUTUD）框架，其中包含一个时间对齐模态特征估计（TAME）模块，该模块可以使用推理期间存在的模态来估计缺失模态的信息。这种创新方法促进了不同模态信息的整合，通过利用每个模态的优势来补偿推理过程中某些模态的缺失，从而增强了整体推理过程。我们将MUTUD应用于各种视听语音任务，并表明它可以在很大程度上缩小多模态模型和相应的单模态模型之间的性能差距。MUTUD可以做到这一点，同时还能减小模型大小和计算量（与多模态模型相比），在某些情况下减少幅度接近80%。||
|**2025-01-29**|[U2A: Unified Unimodal Adaptation for Robust and Efficient Multimodal Learning](http://arxiv.org/abs/2501.17823)|null|多模态学习通常依赖于设计新的模型和复杂的训练策略以达到最佳性能。我们提出了统一单模态适应（U2A），它使用低秩适应（LoRA）联合微调预训练的单模态编码器，以用于各种多模态任务。我们的方法显著减少了可学习参数的数量，并且消除了对复杂训练策略的需求，例如交替训练、梯度修改或单模态微调。为了解决训练和测试期间缺失模态的问题，我们引入了掩码标记（MT），它使用每个模态的单个标记从可用模态生成缺失模态的特征。这简化了流程，无需专门的特征估计或提示微调方法。我们的评估表明，U2A 在完整和缺失模态设置下都能达到或超过最先进方法的性能，展现了跨各种模态、任务和数据集的强大性能和鲁棒性。我们还分析并报告了掩码标记在不同缺失模态场景下的有效性。总体而言，我们的方法为多模态学习提供了一个鲁棒、灵活且高效的解决方案，计算开销最小。||
|**2025-01-29**|[Planning with Vision-Language Models and a Use Case in Robot-Assisted Teaching](http://arxiv.org/abs/2501.17665)|null|利用大型语言模型 (LLM) 自动生成规划领域定义语言 (PDDL) 为人工智能规划领域开辟了新的研究课题，特别是对于复杂的现实世界任务。本文介绍了 Image2PDDL，这是一个利用视觉语言模型 (VLM) 将初始状态图像和目标状态描述自动转换为 PDDL 问题的创新框架。通过提供 PDDL 域以及视觉输入，Image2PDDL 解决了连接感知理解和符号规划的关键挑战，减少了创建结构化问题实例所需的专业知识，并提高了跨不同复杂度任务的可扩展性。我们在各种领域评估了该框架，包括标准规划领域，如积木世界和滑块拼图，使用具有多个难度级别的数据集。性能评估基于语法正确性，确保语法和可执行性，以及内容正确性，验证生成的 PDDL 问题中状态表示的准确性。所提出的方法在不同的任务复杂度中展现出 promising 的结果，表明其在人工智能规划领域具有更广泛应用的潜力。我们将讨论在机器人辅助自闭症谱系障碍学生教学中的潜在用例。||
|**2025-01-29**|[Exploring Vision Language Models for Multimodal and Multilingual Stance Detection](http://arxiv.org/abs/2501.17654)|null|社交媒体的全球影响力放大了信息的传播，突出了对强大的自然语言处理任务的需求，例如跨语言和跨模态的立场检测。之前的研究主要集中在纯文本输入，而对多模态场景（例如涉及图像和文本的场景）的探索相对不足。与此同时，近年来多模态帖子的流行程度显著增加。尽管最先进的视觉语言模型 (VLM) 显示出前景，但它们在多模态和多语言立场检测任务上的性能在很大程度上仍未得到检验。本文评估了最先进的 VLM 在一个新扩展的数据集上的性能，该数据集涵盖七种语言和多模态输入，研究了它们对视觉线索的使用、特定语言的性能以及跨模态的交互。我们的结果表明，VLM 通常更多地依赖文本而不是图像进行立场检测，并且这种趋势在不同语言中都存在。此外，VLM 更加依赖图像中包含的文本，而不是其他视觉内容。关于多语言性，所研究的模型倾向于在不同语言之间生成一致的预测，无论它们是否明确支持多语言，尽管存在与宏观 F1 值、语言支持和模型大小不一致的异常值。||
|**2025-01-29**|[Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation](http://arxiv.org/abs/2501.17642)|null|开放词汇语义分割 (OVSS) 是一项开放世界任务，旨在将图像中的每个像素分配给由任意文本描述定义的特定类别。近年来，大规模视觉语言模型的进步展示了其开放词汇理解能力，极大地促进了OVSS的发展。然而，大多数现有方法要么性能欠佳，要么延迟过长。本研究介绍了一种名为ERR-Seg的新框架，该框架有效地减少了冗余，以平衡准确性和效率。ERR-Seg包含一个免训练的通道减少模块（CRM），它利用来自CLIP等视觉语言模型的先验知识来识别最相关的类别，同时丢弃其他类别。此外，它还结合了高效语义上下文融合（ESCF），并采用了空间级和类别级序列减少策略。CRM和ESCF在不影响准确性的情况下，显著节省了内存和计算资源。此外，认识到从中间层特征中提取的层次语义对于闭集语义分割的重要性，ERR-Seg引入了层次语义模块（HSM），以在OVSS的上下文中利用层次语义。与之前在ADE20K-847设置下的最先进方法相比，ERR-Seg实现了+5.6%的mIoU提升，并将延迟降低了67.3%。||
|**2025-01-30**|[Boosting Weak Positives for Text Based Person Search](http://arxiv.org/abs/2501.17586)|null|大型视觉语言模型彻底改变了跨模态对象检索，但基于文本的人物搜索 (TBPS) 由于数据有限且任务本身的细粒度特性，仍然是一项具有挑战性的任务。现有方法主要集中于将图像-文本对对齐到一个共同的表示空间，而往往忽略了现实世界中正例图像-文本对之间存在不同程度的相似性这一事实。这导致模型优先考虑容易的样本对，并且在一些最近的方法中，具有挑战性的样本在训练期间被当作噪声丢弃。在这项工作中，我们引入了一种 boosting 技术，可以在训练期间动态识别并强调这些具有挑战性的样本。我们的方法受到经典 boosting 技术的启发，并动态更新弱正例样本的权重，其中排名第一的匹配项与查询的身份不符。该权重允许这些排名错误的样本对对损失函数贡献更多，并且网络必须更多地关注此类样本。我们的方法在四个行人数据集上实现了性能提升，证明了我们提出的模块的有效性。||
|**2025-01-29**|[Learning Free Token Reduction for Multi-Modal LLM](http://arxiv.org/abs/2501.17391)|null|视觉语言模型（VLM）在各种多模态任务中取得了显著成功；然而，其实际部署通常受到高计算成本和推理时间长的限制。由于视觉模态通常比文本模态携带更多信息，压缩视觉提示为缓解这些挑战提供了一种有前景的解决方案。现有方法主要集中于改进模型架构或直接减少视觉标记的数量。然而，由于缺乏对视觉数据独特的空间和时间特征的考虑，这些方法通常会损害推理性能。在这项工作中，我们提出了一种在空间和时间维度上进行操作的标记压缩范式。我们的方法包括一个免学习、即插即用的压缩流程，可以无缝集成到大多数多模态大型语言模型（MLLM）框架中。通过利用这种方法，我们增强了模型推理能力，同时降低了其计算成本。在视频问答任务上的实验结果证明了所提出方法的有效性，展示了在不牺牲性能的情况下效率的显著提高。||
|**2025-01-30**|[Probing LLM World Models: Enhancing Guesstimation with Wisdom of Crowds Decoding](http://arxiv.org/abs/2501.17310)|null|估算，即对数量进行近似估计的任务，是一个常见的现实世界挑战。然而，它在大型语言模型 (LLM) 和视觉语言模型 (VLM) 研究中很大程度上被忽略了。我们引入了一个新的估算数据集 MARBLES。该数据集要求人们估计有多少物品（例如，弹珠）可以放入容器中（例如，一个容量为一杯的量杯），无论有无附带图像。受社会科学概念“群体智慧”(WOC) 的启发——取人群估计值的中位数，这已被证明在估算中有效——我们提出了用于 LLM 估算的“WOC 解码”策略。我们表明 LLM/VLM 在估算方面表现良好，这表明它们拥有一定程度的进行估算所需的“世界模型”。此外，与人类表现类似，WOC 解码方法提高了 LLM/VLM 估算的准确性。此外，在多模态条件下包含图像增强了模型性能。这些结果突出了 WOC 解码策略对 LLM/VLM 的价值，并将估算定位为评估 LLM/VLM 世界模型的探针。由于 LLM 的世界模型是许多现实世界任务（例如，人机协作）的基本先决条件，我们的发现对 AI 社区具有广泛的意义。||
|**2025-01-28**|[Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection](http://arxiv.org/abs/2501.16981)|null|Owing to large-scale image-text contrastive training, pre-trained vision language model (VLM) like CLIP shows superior open-vocabulary recognition ability. Most existing open-vocabulary object detectors attempt to utilize the pre-trained VLM to attain generative representation. F-ViT uses the pre-trained visual encoder as the backbone network and freezes it during training. However, the frozen backbone doesn't benefit from the labeled data to strengthen the representation. Therefore, we propose a novel two-branch backbone network design, named as ViT-Feature-Modulated Multi-Scale Convolutional network (VMCNet). VMCNet consists of a trainable convolutional branch, a frozen pre-trained ViT branch and a feature modulation module. The trainable CNN branch could be optimized with labeled data while the frozen pre-trained ViT branch could keep the representation ability derived from large-scale pre-training. Then, the proposed feature modulation module could modulate the multi-scale CNN features with the representations from ViT branch. With the proposed mixed structure, detector is more likely to discover novel categories. Evaluated on two popular benchmarks, our method boosts the detection performance on novel category and outperforms the baseline. On OV-COCO, the proposed method achieves 44.3 AP $_{50}^{\mathrm{novel}}$ with ViT-B/16 and 48.5 AP$_{50}^{\mathrm{novel}}$ with ViT-L/14. On OV-LVIS, VMCNet with ViT-B/16 and ViT-L/14 reaches 27.8 and 38.4 mAP$_{r}$ .||
|**2025-01-28**|[Improving Vision-Language-Action Model with Online Reinforcement Learning](http://arxiv.org/abs/2501.16664)|null|Recent studies have successfully integrated large vision-language models (VLMs) into low-level robotic control by supervised fine-tuning (SFT) with expert robotic datasets, resulting in what we term vision-language-action (VLA) models. Although the VLA models are powerful, how to improve these large models during interaction with environments remains an open question. In this paper, we explore how to further improve these VLA models via Reinforcement Learning (RL), a commonly used fine-tuning technique for large models. However, we find that directly applying online RL to large VLA models presents significant challenges, including training instability that severely impacts the performance of large models, and computing burdens that exceed the capabilities of most local machines. To address these challenges, we propose iRe-VLA framework, which iterates between Reinforcement Learning and Supervised Learning to effectively improve VLA models, leveraging the exploratory benefits of RL while maintaining the stability of supervised learning. Experiments in two simulated benchmarks and a real-world manipulation suite validate the effectiveness of our method.||
|**2025-01-27**|[BiFold: Bimanual Cloth Folding with Language Guidance](http://arxiv.org/abs/2501.16458)|null|叠衣服是一项复杂的任务，这是由于衣服不可避免的自遮挡、复杂的动力学特性，以及服装可能具有的不同材料、几何形状和纹理。在这项工作中，我们学习了以文本命令为条件的折叠动作。将高级抽象指令转换为精确的机器人动作需要复杂的语言理解和操作能力。为此，我们利用预训练的视觉语言模型，并将其重新用于预测操作动作。我们的模型 BiFold 可以考虑上下文，并在现有的语言条件折叠基准测试中实现了最先进的性能。鉴于缺乏带注释的双手折叠数据，我们设计了一个程序来自动解析模拟数据集的动作，并用对齐的文本指令标记它们。BiFold 在我们的数据集上取得了最佳性能，并且可以迁移到新的指令、服装和环境。||
|**2025-01-27**|[PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding](http://arxiv.org/abs/2501.16411)|null|理解物理世界是 embodied AI 的一项基本挑战，对于使智能体能够执行复杂任务并在现实世界环境中安全运行至关重要。虽然视觉语言模型 (VLM) 在 embodied AI 的推理和任务规划方面展现出巨大潜力，但它们理解物理现象的能力仍然极其有限。为了弥合这一差距，我们引入了 PhysBench，这是一个综合基准测试，旨在评估 VLM 在各种任务中理解物理世界的能力。PhysBench 包含 100,000 条交错的视频-图像-文本数据，分为四大领域：物理对象属性、物理对象关系、物理场景理解和基于物理的动力学，进一步细分为 19 个子类和 8 个不同的能力维度。我们对 75 个代表性 VLM 进行的广泛实验表明，虽然这些模型在常识推理方面表现出色，但在理解物理世界方面却存在困难——这可能是由于其训练数据中缺乏物理知识以及缺乏嵌入的物理先验。为了解决这一不足，我们引入了 PhysAgent，这是一个结合了 VLM 的泛化优势和视觉模型的专业知识的新型框架，可显著增强 VLM 在各种任务中的物理理解能力，包括 GPT-4o 性能提升 18.4%。此外，我们的结果表明，增强 VLM 的物理世界理解能力可以帮助 embodied AI，例如 MOKA。我们相信 PhysBench 和 PhysAgent 提供了宝贵的见解，有助于弥合 VLM 与物理世界理解之间的差距。||
|**2025-01-27**|[CLISC: Bridging clip and sam by enhanced cam for unsupervised brain tumor segmentation](http://arxiv.org/abs/2501.16246)|null|脑肿瘤分割对于肿瘤诊断至关重要，而当前的深度学习方法依赖于大量带标注的图像进行训练，标注成本高。无监督分割有望避免人工标注，但其性能通常有限。在本研究中，我们提出了一种利用基础模型的新型无监督分割方法，它包含三个主要步骤：（1）使用视觉语言模型（即CLIP）获取图像级伪标签，以训练分类网络。然后使用类激活映射 (CAM) 提取感兴趣区域 (ROI)，并使用基于自适应掩模的数据增强来增强 ROI 识别。（2）使用 ROI 为Segment Anything Model (SAM) 生成边界框和点提示，以获取分割伪标签。（3）使用 SAM 生成的伪标签训练 3D 分割网络，在自学习过程中根据 SAM 的输出和网络预测之间的相似性过滤掉低质量的伪标签。在 BraTS2020 数据集上的评估表明，我们的方法获得了 85.60% 的平均 Dice 相似性得分 (DSC)，比五种最先进的无监督分割方法高出 10 个百分点以上。此外，我们的方法优于直接使用 SAM 进行零样本推理，其性能接近于全监督学习。||
|**2025-01-28**|[SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model](http://arxiv.org/abs/2501.15830)|null|本文认为空间理解是机器人操作的关键，并提出了SpatialVLA来探索机器人基础模型的有效空间表示。具体来说，我们引入了Ego3D位置编码，将3D信息注入到视觉-语言-动作模型的输入观察中，并提出了自适应动作网格，用自适应离散化的动作网格来表示空间机器人运动动作，以便学习可泛化和可迁移的空间动作知识，用于跨机器人控制。SpatialVLA首先在拥有110万个真实世界机器人episodes的视觉-语言模型之上进行预训练，以学习跨多个机器人环境和任务的通用操作策略。预训练后，SpatialVLA可以直接以零样本的方式执行众多任务。在仿真和真实世界机器人中的优异结果证明了其在推断复杂机器人运动轨迹方面的优势及其强大的域内多任务泛化能力。我们进一步展示了所提出的自适应动作网格为微调预训练的SpatialVLA模型提供了一种新的有效方法，使其适用于新的仿真和真实世界设置，其中预先学习的动作网格被重新离散化，以捕获新设置中特定于机器人的空间动作运动。大量评估的优异结果证明了其卓越的域内泛化和域外适应能力，突出了所提出的空间感知表示对于通用机器人策略学习的关键益处。所有细节和代码都将开源。||
|**2025-01-26**|[Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts](http://arxiv.org/abs/2501.15688)|null|多模态知识图谱补全 (MMKGC) 旨在通过利用来自各种模态的信息以及结构化数据来预测多模态知识图谱 (MMKGs) 中缺失的链接。现有的 MMKGC 方法主要扩展了传统的知识图谱嵌入 (KGE) 模型，这通常需要为每个实体创建一个嵌入。这导致模型规模庞大，并且在集成多模态信息方面效率低下，尤其是在现实世界的图谱中。同时，基于 Transformer 的模型在知识图谱补全 (KGC) 中展现出竞争力的性能。然而，它们对单模态知识的关注限制了它们利用跨模态信息的能力。最近，大型视觉语言模型 (VLMs) 在跨模态任务中显示出潜力，但受到高训练成本的限制。在这项工作中，我们提出了一种新颖的方法，将基于 Transformer 的 KGE 模型与由预训练 VLM 生成的跨模态上下文相结合，从而将其应用扩展到 MMKGC。具体来说，我们使用预训练的 VLM 将来自实体及其邻居的相关视觉信息转换为文本序列。然后，我们将 KGC 构建为一个序列到序列的任务，使用生成的跨模态上下文对模型进行微调。这种简单而有效的方法与传统的 KGE 方法相比，显著减小了模型的大小，同时在多个大规模数据集上实现了竞争力的性能，且只需最少的超参数调整。||
|**2025-01-26**|[TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding](http://arxiv.org/abs/2501.15513)|**[link](https://github.com/zhangxj199/tinyllava-video)**|我们提出了TinyLLaVA-Video，这是一个参数不超过40亿的视频理解模型，它以简单的方式处理视频序列，无需复杂的架构，支持fps采样和均匀帧采样。我们的模型具有模块化和可扩展性，允许在有限的计算资源下进行训练和推理，并允许用户根据需要替换组件。我们通过实验验证了该框架的有效性，最佳模型在多个视频理解基准测试中实现了与某些现有70亿参数模型相当的性能。代码和训练方法完全开源，所有组件和训练数据均公开可用。我们希望这项工作可以作为从业者探索用于视频理解的小规模多模态模型的基线。它可以在\url{https://github.com/ZhangXJ199/TinyLLaVA-Video}获取。||
|**2025-01-26**|[Cross-Modal Transfer from Memes to Videos: Addressing Data Scarcity in Hateful Video Detection](http://arxiv.org/abs/2501.15438)|**[link](https://github.com/social-ai-studio/crossmodaltransferlearning)**|在网络内容中检测仇恨言论对于确保更安全的数字空间至关重要。虽然在文本和模因形式方面取得了重大进展，但基于视频的仇恨言论检测仍然缺乏探索，其障碍在于缺乏注释数据集和视频注释的高成本。鉴于对大型模型的日益依赖，这个问题尤其突出，因为大型模型需要大量的训练数据。为了应对这一挑战，我们利用模因数据集作为替代和增强策略来训练仇恨视频检测模型。我们的方法引入了一种人工辅助的重新注释流程，以将模因数据集标签与视频数据集对齐，从而确保一致性并将标记工作量降至最低。我们使用两个最先进的视觉语言模型证明，在资源匮乏的情况下，模因数据可以替代视频数据，并且可以增强视频数据集以实现进一步的性能提升。我们的结果始终优于最先进的基准，展示了跨模态迁移学习在推进仇恨视频检测方面的潜力。数据集和代码可在 https://github.com/Social-AI-Studio/CrossModalTransferLearning 获取。||
|**2025-01-26**|[Scaling Large Vision-Language Models for Enhanced Multimodal Comprehension In Biomedical Image Analysis](http://arxiv.org/abs/2501.15370)|null|大型语言模型 (LLM) 在理解文本数据方面展现出巨大潜力，并越来越多地被研究人员用于加速科学发现，其应用包括知识提取（信息检索）、知识蒸馏（将关键发现和方法总结成简洁的形式）以及知识合成（整合来自多个科学来源的信息以解决复杂问题、生成假设和制定实验计划）。然而，科学数据通常以视觉和文本两种模态存在。视觉语言模型 (VLM) 通过结合预训练的视觉主干网络来处理图像，并使用跨模态投影器将图像标记调整到 LLM 维度空间，从而提供更丰富的多模态理解来解决这个问题。尽管如此，现成的 VLM 在处理特定领域数据时能力有限，并且容易出现幻觉。我们开发了基于 LLaVA 模型微调的智能助手，以增强低剂量放射治疗 (LDRT) 中的多模态理解，LDRT 是一种用于治疗癌症相关疾病的良性方法。利用来自 42,673 篇文章的多语言数据，我们为视觉问答 (VQA) 基准设计了复杂的推理和详细描述任务。我们的助手在 50,882 个图像-文本对上进行训练，使用 LLM 作为评判者的方法进行评估，其性能优于基础模型，尤其是在减少幻觉和提高特定领域理解方面。||
|**2025-01-24**|[Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding](http://arxiv.org/abs/2501.14548)|**[link](https://github.com/alibaba-damo-academy/fvlm)**|Artificial intelligence (AI) shows great potential in assisting radiologists to improve the efficiency and accuracy of medical image interpretation and diagnosis. However, a versatile AI model requires large-scale data and comprehensive annotations, which are often impractical in medical settings. Recent studies leverage radiology reports as a naturally high-quality supervision for medical images, using contrastive language-image pre-training (CLIP) to develop language-informed models for radiological image interpretation. Nonetheless, these approaches typically contrast entire images with reports, neglecting the local associations between imaging regions and report sentences, which may undermine model performance and interoperability. In this paper, we propose a fine-grained vision-language model (fVLM) for anatomy-level CT image interpretation. Specifically, we explicitly match anatomical regions of CT images with corresponding descriptions in radiology reports and perform contrastive pre-training for each anatomy individually. Fine-grained alignment, however, faces considerable false-negative challenges, mainly from the abundance of anatomy-level healthy samples and similarly diseased abnormalities. To tackle this issue, we propose identifying false negatives of both normal and abnormal samples and calibrating contrastive learning from patient-level to disease-aware pairing. We curated the largest CT dataset to date, comprising imaging and report data from 69,086 patients, and conducted a comprehensive evaluation of 54 major and important disease diagnosis tasks across 15 main anatomies. Experimental results demonstrate the substantial potential of fVLM in versatile medical image interpretation. In the zero-shot classification task, we achieved an average AUC of 81.3% on 54 diagnosis tasks, surpassing CLIP and supervised methods by 12.9% and 8.0%, respectively.||
|**2025-01-24**|[Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models](http://arxiv.org/abs/2501.14276)|null|随着大型视觉语言模型 (LVLMs) 中对高分辨率图像处理的需求不断增长，子图像划分已成为一种流行的方法，用于减少与固定分辨率处理相关的视觉信息损失。然而，现有的划分方法统一处理子图像，导致图像理解欠佳。在本工作中，我们发现与整个图像语义相关性更高的子图像包含更丰富的视觉信息，可以保留模型的视觉理解能力。因此，我们提出了全局语义引导权重分配器 (GSWA) 模块，它根据子图像的相对信息密度动态分配权重，模拟人类视觉注意力机制。这种方法使模型能够专注于信息更丰富的区域，克服了统一处理的局限性。我们将 GSWA 集成到 InternVL2-2B 框架中，创建了 SleighVL，一个轻量级但高性能的模型。大量实验表明，SleighVL 的性能优于参数相当的模型，并且与更大的模型相比仍具有竞争力。我们的工作为 LVLMs 中更高效且上下文感知的高分辨率图像处理提供了一个有前景的方向，促进了多模态系统的发展。||
|**2025-01-24**|[SelfPrompt: Confidence-Aware Semi-Supervised Tuning for Robust Vision-Language Model Adaptation](http://arxiv.org/abs/2501.14148)|null|我们提出了SelfPrompt，一种用于视觉语言模型（VLM）的半监督学习场景下的新型提示微调方法。现有的在半监督场景下微调VLM的方法难以应对未经校准的VLM对伪标签的负面影响，以及噪声伪标签的累积。SelfPrompt通过引入一种可以提高伪标签准确性的聚类引导伪标签方法，以及一个通过结合监督学习和弱监督学习来最大化利用未标记数据的置信度感知半监督学习模块，来应对这些挑战。此外，我们还在主动半监督学习场景下研究了我们的方法，在该场景中，标记集经过策略性选择以确保有限标记预算的最佳利用。为此，我们提出了一种弱监督采样技术，用于选择多样化且具有代表性的标记集，该技术可以无缝集成到现有方法中以提高其性能。我们对13个数据集进行了广泛的评估，在使用2-shot设置的情况下，显着超越了最先进的性能，在标准半监督学习、主动半监督学习和基础到新颖的泛化方面分别平均提高了6.23%、6.25%和4.9%。此外，SelfPrompt在单次设置中表现出出色的泛化能力，平均提高了11.78%。||
|**2025-01-23**|[GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing](http://arxiv.org/abs/2501.13925)|**[link](https://github.com/mbzuai-oryx/geopixel)**|近年来，大型多模态模型 (LMM) 的进步已经认识到细粒度基础是视觉理解和对话的必要因素。然而，这种表示在 LMM 中的优势仅限于自然图像领域，并且这些模型在遥感 (RS) 方面的表现不佳。高分辨率遥感影像中独特的俯视视角、尺度变化以及小物体的存在对区域级理解提出了独特的挑战。此外，由于缺乏颗粒化的、特定于遥感领域的基础数据，LMM 在遥感领域中基础对话能力的发展受到阻碍。为了解决这些限制，我们提出了 GeoPixel——第一个支持像素级基础的端到端高分辨率遥感 LMM。这种能力允许通过在对话中生成交错掩码来进行细粒度的视觉感知。GeoPixel 支持高达 4K 高清分辨率的任意纵横比，非常适合高精度遥感图像分析。为了支持遥感影像中的基础对话生成 (GCG)，我们通过一个半自动管道构建了一个视觉基础数据集 GeoPixelD，该管道利用了专为遥感数据定制的标记集提示和空间先验，以有条不紊地控制数据生成过程。GeoPixel 在像素级理解方面表现出优异的性能，在单目标和多目标分割任务中均超过了现有的 LMM。我们的方法消融研究验证了整体架构中每个组件的有效性。我们的代码和数据将公开发布。||
|**2025-01-23**|[Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models](http://arxiv.org/abs/2501.13904)|null|多模态大型语言模型 (LLM) 通过集成文本、图像和音频等多种模态，在彻底改变客户支持和运营方面发挥着关键作用。联邦提示学习 (FPL) 是最近提出的一种方法，它将预训练的多模态 LLM（例如视觉语言模型）与联邦学习相结合，以创建个性化且注重隐私的 AI 系统。然而，如何在个性化、泛化性和隐私性之间取得平衡仍然是一项重大挑战。过度个性化会导致过拟合，从而降低泛化能力，而严格的隐私措施（例如差分隐私）会同时损害个性化和泛化性。在本文中，我们提出了一种差分隐私联邦提示学习 (DP-FPL) 方法来应对这一挑战，该方法利用低秩自适应方案来捕捉泛化性，同时保留一个残差项来保持个性化的表达能力。为了确保隐私，我们引入了一种新方法，将局部差分隐私应用于局部提示的两个低秩分量，并将全局差分隐私应用于全局提示。我们的方法减轻了隐私噪声对模型性能的影响，同时平衡了个性化和泛化性之间的权衡。大量实验结果证明了我们的方法相较于其他基准方法的有效性。||
|**2025-01-23**|[Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning](http://arxiv.org/abs/2501.13893)|**[link](https://github.com/geshang777/pix2cap)**|我们提出了Pix2Cap-COCO，这是第一个旨在提升细粒度视觉理解的全景像素级字幕数据集。为此，我们精心设计了一个自动化标注流程， prompting GPT-4V 为图像中的单个对象生成像素对齐的、特定于实例的字幕，使模型能够学习对象与其上下文之间更细粒度的关系。这种方法产生了 167,254 个详细的字幕，平均每个字幕包含 22.94 个单词。基于 Pix2Cap-COCO，我们引入了一项新任务，即全景分割字幕，该任务要求模型识别图像中的实例并同时为每个实例提供详细的描述。为了对这项任务进行基准测试，我们设计了一个基于 X-Decoder 的鲁棒基线。实验结果表明，Pix2Cap-COCO 是一个极具挑战性的数据集，因为它要求模型同时擅长细粒度视觉理解和详细的语言生成。此外，我们利用 Pix2Cap-COCO 对大型多模态模型 (LMM) 进行监督微调 (SFT) 以提高其性能。例如，使用 Pix2Cap-COCO 进行训练显著提高了 GPT4RoI 的性能，在 Visual Genome 数据集上 CIDEr 提高了 +1.4%，ROUGE 提高了 +0.4%，SPICE 提高了 +0.5%，并增强了其在 ViP-BENCH 上的区域理解能力，整体提升了 +5.1%，其中识别准确率显著提高了 +11.2%，语言生成质量提高了 +22.2%。||
|**2025-01-23**|[Dual-Modal Prototype Joint Learning for Compositional Zero-Shot Learning](http://arxiv.org/abs/2501.13859)|null|组合零样本学习 (CZSL) 旨在通过利用从已见组合中学习到的知识来识别属性和对象的新组合。最近的方法探索了使用视觉语言模型 (VLM) 来对齐文本和视觉模态。这些方法通常采用 prompt 工程、参数调整和模态融合来生成丰富的文本原型，作为 CZSL 的类别原型。然而，模态差距导致文本原型无法完全捕捉所有类别原型的最佳表示，特别是那些具有细粒度特征的原型，而这些特征可以直接从视觉模态中获得。在本文中，我们为 CZSL 任务提出了一种新的双模态原型联合学习框架。我们的方法基于 VLM，在文本和视觉模态中引入了原型。文本原型经过优化以捕获广泛的概念信息，帮助模型泛化到未见过的组合。同时，视觉原型用于减轻由模态差距引起的分类错误，并捕获细粒度细节以区分外观相似的图像。为了有效地优化这些原型，我们设计了专门的分解模块和联合学习策略，以丰富来自两种模态的特征。这些原型不仅在训练期间捕获关键类别信息，还在推理期间充当关键参考目标。实验结果表明，我们的方法在三个公开可用的 CZSL 基准测试中，在封闭世界设置中实现了最先进的性能，在开放世界设置中实现了具有竞争力的性能。这些发现验证了我们的方法在推进组合泛化方面的有效性。||
|**2025-01-23**|[Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes](http://arxiv.org/abs/2501.13851)|**[link](https://github.com/seefreem/meme_text_retrieval_p1)**|模因已成为一种强大的交流形式，它融合了视觉和文本元素来传达幽默、讽刺和文化信息。现有研究主要集中在情感分类、模因生成、传播、解读、比喻语言和社会语言学等方面，但往往忽略了更深层次的模因理解和模因文本检索。为了弥补这些差距，本研究引入了ClassicMemes-50-templates (CM50)，这是一个包含超过33,000个模因的大规模数据集，以50个流行的模因模板为中心。我们还提出了一个利用大型视觉语言模型的自动化知识 grounded 标注流程，以生成高质量的图像标题、模因标题和修辞手法标签，克服了手动标注的劳动密集型需求。此外，我们提出了一个模因文本检索CLIP模型(mtrCLIP)，它利用跨模态嵌入来增强模因分析，显著提高了检索性能。我们的贡献包括：(1) 一个用于大规模模因研究的新型数据集，(2) 一个可扩展的模因标注框架，以及(3) 一个用于模因文本检索的微调CLIP模型，所有这些都旨在推进对模因的大规模理解和分析。||
|**2025-01-23**|[Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos](http://arxiv.org/abs/2501.13826)|null|人类通过三个认知阶段获取知识：感知信息、理解知识以及应用知识解决新问题。视频是学习过程的有效媒介，有助于促进这些认知阶段的进展。然而，现有的视频基准测试未能系统地评估大型多模态模型 (LMM) 的知识获取能力。为了弥补这一差距，我们引入了 Video-MMMU，这是一个多模态、多学科的基准测试，旨在评估 LMM 从视频中获取和利用知识的能力。Video-MMMU 包含 300 个专家级视频和 900 个跨六个学科的人工标注问题，通过与阶段对齐的问答对（感知、理解和应用）来评估知识获取。提出的知识增益指标 Δknowledge 量化了观看视频后性能的提升。对 LMM 的评估表明，随着认知需求的增加，性能急剧下降，并突显了人类和模型知识获取之间的显著差距，强调需要增强 LMM 从视频中学习和适应能力的方法。||
|**2025-01-23**|[Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak](http://arxiv.org/abs/2501.13772)|null|大型语言模型 (LLM) 在各种自然语言处理任务中展现出卓越的零样本性能。多模态编码器的集成扩展了它们的能力，使得能够开发处理视觉、音频和文本的多模态大型语言模型。然而，这些能力也引发了重大的安全担忧，因为这些模型可能被操纵以通过“越狱”生成有害或不适当的内容。尽管已有大量研究探索了模态特定输入编辑对基于文本的 LLM 和大型视觉语言模型在“越狱”中的影响，但音频特定编辑对大型音频语言模型 (LALM) 的影响仍未得到充分探索。因此，本文通过研究音频特定编辑如何影响 LALM 推理的“越狱”来弥补这一差距。我们引入了音频编辑工具箱 (AET)，它支持音调调整、单词强调和噪声注入等音频模态编辑，以及编辑音频数据集 (EAD)，一个全面的音频“越狱”基准测试集。我们还对最先进的 LALM 进行了广泛的评估，以评估它们在不同音频编辑下的鲁棒性。这项工作为未来探索 LALM 安全中的音频模态交互奠定了基础。||
|**2025-01-23**|[Cognitive Paradigms for Evaluating VLMs on Visual Reasoning Task](http://arxiv.org/abs/2501.13620)|null|评估视觉语言模型 (VLM) 在复杂视觉任务中的推理能力，可以深入了解它们的潜力和局限性。本研究评估了 VLM 在具有挑战性的 Bongard 开放世界问题基准测试中的性能，该基准测试涉及对自然图像进行推理。我们提出并评估了三种受人类启发的范式：整体分析（全局上下文处理）、演绎规则学习（显式规则推导和应用）和成分分析（将图像结构化分解为组件）。我们的结果表明，包括 GPT-4o 和 Gemini 在内的最先进模型不仅超越了人类基准，而且在结构化推理任务中表现出色，其中成分分析尤其有效。然而，消融研究揭示了关键挑战，例如处理合成图像、进行细粒度区分以及解释细微的上下文信息。这些见解强调了进一步提高模型鲁棒性和泛化能力的必要性，同时也突出了结构化推理方法在增强 VLM 能力方面的变革潜力。||
|**2025-01-23**|[Text-driven Online Action Detection](http://arxiv.org/abs/2501.13518)|**[link](https://github.com/3dperceptionlab/toad)**|实时动作检测对于视频监控、自动驾驶和人机交互等应用至关重要。这项任务被称为在线动作检测，需要对流媒体视频中的动作进行分类，处理背景噪声，并应对不完整的动作。Transformer架构是目前的最新技术，但计算机视觉领域的最新进展，特别是视觉语言模型（VLM）的潜力，在很大程度上仍未被用于解决这个问题，部分原因是计算成本高。在本文中，我们介绍了TOAD：一个文本驱动的在线动作检测架构，支持零样本和小样本学习。TOAD利用CLIP（对比语言-图像预训练）文本嵌入，从而能够高效地使用VLM，而不会产生显著的计算开销。我们的模型在THUMOS14数据集上实现了82.46%的mAP，优于现有方法，并在THUMOS14和TVSeries数据集上为零样本和小样本性能设定了新的基准。||
|**2025-01-23**|[Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge](http://arxiv.org/abs/2501.13468)|**[link](https://github.com/hmxiong/streamchat)**|大型语言模型 (LLM) 的最新进展促进了视频-LLM 的发展，通过桥接视频数据和语言任务推进了多模态学习。然而，目前的视频理解模型难以处理长视频序列、支持多轮对话以及适应现实世界的动态场景。为了解决这些问题，我们提出了 StreamChat，一个用于流式视频推理和对话交互的免训练框架。StreamChat 利用一种新颖的分层记忆系统来高效地处理和压缩扩展序列中的视频特征，从而实现实时多轮对话。我们的框架结合了并行系统调度策略，提高了处理速度并减少了延迟，确保了在实际应用中的稳健性能。此外，我们引入了 StreamBench，这是一个多功能基准测试，用于评估跨各种媒体类型和交互场景（包括多轮交互和复杂推理任务）的流式视频理解。在 StreamBench 和其他公共基准测试上的大量评估表明，StreamChat 在准确性和响应时间方面显著优于现有的最先进模型，证实了其在流式视频理解方面的有效性。代码可在 StreamChat 获取：https://github.com/hmxiong/StreamChat。||
|**2025-01-21**|[InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model](http://arxiv.org/abs/2501.12368)|**[link](https://github.com/internlm/internlm-xcomposer)**|尽管大型视觉语言模型（LVLMs）在视觉理解方面表现出色，但它们偶尔会生成错误的输出。虽然使用强化学习或测试时缩放的奖励模型（RMs）提供了改进生成质量的潜力，但仍然存在一个关键差距：公开可用的LVLMs多模态RMs很少，且专有模型的实现细节通常不清楚。我们通过InternLM-XComposer2.5-Reward (IXC-2.5-Reward)弥合了这一差距，这是一个简单而有效的多模态奖励模型，使LVLMs与人类偏好保持一致。为了确保IXC-2.5-Reward的鲁棒性和通用性，我们建立了一个高质量的多模态偏好语料库，涵盖了跨不同领域的文本、图像和视频输入，例如指令遵循、常识理解、富文本文档、数学推理和视频理解。IXC-2.5-Reward在最新的多模态奖励模型基准测试中取得了优异的成绩，并在纯文本奖励模型基准测试中展现了竞争力。我们进一步展示了IXC-2.5-Reward的三个关键应用：（1）为强化学习训练提供监督信号。我们将IXC-2.5-Reward与近端策略优化（PPO）集成，产生了IXC-2.5-Chat，这在指令遵循和多模态开放式对话方面显示出一致的改进；（2）从候选响应中选择最佳响应以进行测试时缩放；（3）从现有的图像和视频指令微调训练数据中过滤异常值或噪声样本。为了确保可重现性并促进进一步研究，我们已在https://github.com/InternLM/InternLM-XComposer开源所有模型权重和训练方法。||
|**2025-01-21**|[Vision-Language Models for Automated Chest X-ray Interpretation: Leveraging ViT and GPT-2](http://arxiv.org/abs/2501.12356)|null|放射学凭借其非侵入性诊断能力在现代医学中发挥着关键作用。然而，手动生成非结构化医疗报告耗时且容易出错，这在临床工作流程中造成了严重的瓶颈。尽管人工智能生成的放射学报告取得了进展，但在实现详细准确的报告生成方面仍然存在挑战。在本研究中，我们评估了结合计算机视觉和自然语言处理的多模态模型的不同组合，以生成全面的放射学报告。我们使用了预训练的视觉转换器（ViT-B16）和SWIN转换器作为图像编码器，并使用BART和GPT-2模型作为文本解码器。我们使用来自IU-Xray数据集的胸部X光图像和报告来评估SWIN Transformer-BART、SWIN Transformer-GPT-2、ViT-B16-BART和ViT-B16-GPT-2模型在报告生成方面的可用性，旨在找到这些模型中的最佳组合。SWIN-BART模型在四种模型中表现最佳，在几乎所有评估指标（如ROUGE、BLEU和BERTScore）上都取得了显著成果。||
|**2025-01-21**|[CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification](http://arxiv.org/abs/2501.12266)|null|基于深度学习的解决方案在医疗工作流程中的应用面临的主要挑战是注释数据的可用性以及此类系统缺乏可解释性。概念瓶颈模型 (CBM) 通过将最终疾病预测限制在一组预定义且人类可解释的概念上来解决后者。然而，通过这些基于概念的解释实现的可解释性提高意味着更高的注释负担。此外，如果需要添加新概念，则需要重新训练整个系统。受大型视觉语言模型 (LVLM) 在少样本设置中表现出的卓越性能的启发，我们提出了一种简单而有效的方法 CBVLM，它解决了上述两个挑战。首先，对于每个概念，我们提示 LVLM 回答输入图像中是否存在该概念。然后，我们要求 LVLM 根据先前的概念预测对图像进行分类。此外，在这两个阶段中，我们都加入了一个检索模块，负责选择最佳示例用于上下文学习。通过将最终诊断基于预测的概念，我们确保了可解释性，并且通过利用 LVLM 的少样本能力，我们大大降低了注释成本。我们通过对四个医学数据集和十二个 LVLM（包括通用和医学）的广泛实验来验证我们的方法，并表明 CBVLM 在不需要任何训练且仅使用少量注释示例的情况下始终优于 CBM 和特定任务的监督方法。更多信息请访问我们的项目页面：https://cristianopatricio.github.io/CBVLM/。||
|**2025-01-21**|[Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model](http://arxiv.org/abs/2501.12206)|**[link](https://github.com/hasanar1f/llava-hallunication-fix)**|大型视觉语言模型（LVLMs）在理解和描述视觉内容方面展现出卓越的能力，在各种视觉语言任务中取得了最先进的性能。然而，这些模型经常表现出幻觉行为，即它们生成的描述包含输入图像中不存在的物体或细节。我们的工作通过分析transformer层和头的注意力模式来研究这一现象，揭示了幻觉通常源于更深层中视觉基础的逐渐退化。我们提出了一种新颖的注意力修改方法，该方法结合了选择性标记强调和特定于头的调制，以在整个生成过程中保持视觉基础。我们的方法引入了两个关键组件：（1）双流标记选择机制，用于识别并优先考虑局部信息丰富且空间上重要的视觉标记；（2）特定于注意力头的调制策略，根据单个注意力头的视觉敏感度测量结果，差异化地放大视觉信息处理。通过在MSCOCO数据集上进行广泛的实验，我们证明，与基线模型相比，我们的方法将幻觉率降低了高达62.3％，同时保持了相当的任务性能。我们的分析表明，选择性地调节具有不同视觉敏感度级别的注意力头上的标记可以显著改善视觉基础，而无需重新训练模型。||
|**2025-01-20**|[Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks](http://arxiv.org/abs/2501.11733)|null|智能手机已成为现代生活中不可或缺的一部分，然而在移动设备上完成复杂任务常常令人沮丧。基于大型多模态模型（LMM）的移动代理的最新进展已展现出在移动环境中感知和行动的能力。然而，目前的方法面临着重大限制：它们无法充分满足现实世界的人类需求，难以处理推理密集型和长期任务，并且缺乏从先前经验中学习和改进的机制。为了克服这些挑战，我们引入了Mobile-Agent-E，这是一个能够通过过去经验进行自我演化的分层多代理框架。所谓分层，是指将高层规划和低层动作执行明确分开。该框架包含一个管理器，负责通过将复杂任务分解为子目标来制定总体计划，以及四个下属代理——感知器、操作器、动作反射器和记录器——分别处理细粒度的视觉感知、即时动作执行、错误验证和信息聚合。Mobile-Agent-E还具有一个新颖的自我进化模块，该模块维护一个包含技巧和快捷方式的持久长期记忆。技巧是从先前任务中学习到的关于如何有效地与环境交互的通用指导和经验教训。快捷方式是针对特定子例程的可重用、可执行的原子操作序列。技巧和快捷方式的加入促进了性能和效率的持续改进。除了这个框架，我们还引入了Mobile-Eval-E，这是一个新的基准测试，包含需要长期、多应用交互的复杂移动任务。实证结果表明，Mobile-Agent-E在三个基础模型主干上比之前的最先进方法实现了22%的绝对改进。项目页面：https://x-plug.github.io/MobileAgent.||
|**2025-01-20**|[SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models](http://arxiv.org/abs/2501.11485)|**[link](https://github.com/shuzou-1/simlabel)**|在现实世界的机器学习应用中，尤其是在安全关键领域，检测分布外 (OOD) 数据至关重要。现有方法通常利用来自视觉语言模型 (VLM) 的语言信息，通过丰富的类别文本信息改进置信度估计，从而增强 OOD 检测。然而，在基于分布内 (ID) 文本图像关联构建 OOD 检测分数时，现有工作要么关注每个 ID 类别，要么关注整个 ID 标签集，忽略了 ID 类别之间固有的联系。我们发现，不同 ID 类别之间的语义信息有利于有效的 OOD 检测。因此，我们研究了 VLM 中不同语义相关 ID 标签之间的图像文本理解能力，并提出了一种名为 SimLabel 的新型后处理策略。SimLabel 通过建立一个更鲁棒的图像类别相似性度量来增强 ID 和 OOD 样本的可分离性，该度量考虑了一组相似类别标签的一致性。大量实验表明，SimLabel 在各种零样本 OOD 检测基准测试中表现出色。所提出的模型还扩展到了各种 VLM 骨干网络，展现了其良好的泛化能力。我们的演示和实现代码可在以下网址获取：https://github.com/ShuZou-1/SimLabel。||
|**2025-01-20**|[ITCFN: Incomplete Triple-Modal Co-Attention Fusion Network for Mild Cognitive Impairment Conversion Prediction](http://arxiv.org/abs/2501.11276)|**[link](https://github.com/justinhxy/itfc)**|阿尔茨海默病 (AD) 是老年人中常见的神经退行性疾病。对其前驱阶段轻度认知障碍 (MCI) 的早期预测和及时干预可以降低发展为 AD 的风险。结合来自各种模态的信息可以显著提高预测准确性。然而，诸如数据缺失和模态间异质性等挑战使多模态学习方法变得复杂，因为添加更多模态会加剧这些问题。目前的多模态融合技术通常无法适应医学数据的复杂性，从而阻碍了识别模态之间关系的能力。为了应对这些挑战，我们提出了一种创新的多模态方法来预测 MCI 转换，特别关注正电子发射断层扫描 (PET) 数据缺失和整合不同医学信息的问题。为此，我们提出了不完整三模态 MCI 转换预测网络。通过缺失模态生成模块，我们从磁共振成像合成缺失的 PET 数据，并使用专门设计的编码器提取特征。我们还开发了通道聚合模块和三模态共同注意力融合模块，以减少特征冗余并实现有效的多模态数据融合。此外，我们设计了一个损失函数来处理缺失模态问题并对齐跨模态特征。这些组件共同利用多模态数据来提高网络性能。在 ADNI1 和 ADNI2 数据集上的实验结果表明，我们的方法明显优于现有的单模态和其他多模态模型。我们的代码可在 https://github.com/justinhxy/ITFC 获取。||
|**2025-01-19**|[Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding](http://arxiv.org/abs/2501.10967)|**[link](https://github.com/sakuratroychen/pype)**|视觉语言模型 (VLM) 在推进通用人工智能方面展现出显著的能力，然而，对视觉位置的不合理编码仍然阻碍着模型在不同粒度级别上的全面感知性能。在这项工作中，我们提出了金字塔式视觉位置编码 (PyPE)，这是一种旨在增强 VLM 中视觉标记感知的新方法。通过从外围到中心分配视觉位置索引并逐步扩展中心感受野，PyPE 克服了传统栅格扫描方法的局限性，并减轻了旋转位置嵌入 (RoPE) 引起的长期衰减效应。我们的方法减少了相关视觉元素和指令标记之间的相对距离，促进了更合理的注意力权重分配，并允许多粒度感知视觉元素，并克服了对锚标记的过度依赖。大量的实验评估表明，PyPE 持续提高了各种规模 VLM 的通用能力。代码可在 https://github.com/SakuraTroyChen/PyPE 获取。||
|**2025-01-17**|[HiMix: Reducing Computational Complexity in Large Vision-Language Models](http://arxiv.org/abs/2501.10318)|null|得益于大型语言模型和模态对齐技术的最新进展，现有的大型视觉语言模型（LVLMs）在各种场景中均取得了显著的性能。然而，过高的计算复杂度限制了这些模型在实际应用中的广泛使用。我们认为，计算复杂度的主要瓶颈之一是由模型计算中冗余视觉序列的参与造成的。这是受到对LVLMs语言解码器中视觉和语言信息传输效率重新评估的启发。因此，我们提出了一种新颖的分层视觉语言交互机制，称为混合注意力分层视觉注入（HiMix）。在HiMix中，只有语言序列进行完整的前向传播，而视觉序列在每个语言解码器层的特定阶段与语言交互。值得注意的是，我们的方法显著降低了计算复杂度，而性能损失却最小。具体来说，HiMix在多个LVLM模型中将语言解码器的计算成本降低了10倍，同时保持了相当的性能。这突出了我们方法的优势，我们希望我们的研究能为视觉语言理解领域带来新的视角。项目页面：https://xuange923.github.io/HiMix||
|**2025-01-17**|[FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization](http://arxiv.org/abs/2501.10067)|**[link](https://github.com/casia-iva-lab/filo)**|异常检测方法通常需要大量的目标类别正常样本进行训练，这限制了它们在需要快速适应的场景（例如冷启动）中的适用性。零样本和少样本异常检测不需要预先提供目标类别的标记样本，使其成为一个有前景的研究方向。现有的零样本和少样本方法通常利用强大的多模态模型，通过比较图像-文本相似性来检测和定位异常。然而，它们手工制作的通用描述无法捕捉不同对象中可能出现的各种异常，并且简单的图像-文本块级匹配通常难以定位形状和大小各异的异常区域。为了解决这些问题，本文提出了FiLo++方法，它包含两个关键组件。第一个组件，融合细粒度描述（FusDes），利用大型语言模型为每个对象类别生成异常描述，结合了固定和可学习的提示模板，并应用了运行时提示过滤方法，从而生成更准确、更具任务特异性的文本描述。第二个组件，可变形定位（DefLoc），集成了视觉基础模型Grounding DINO和位置增强文本描述以及多尺度可变形跨模态交互（MDCI）模块，能够准确定位各种形状和大小的异常。此外，我们设计了一种位置增强型图像块匹配方法，以提高少样本异常检测性能。在多个数据集上的实验表明，FiLo++与现有方法相比实现了显著的性能提升。代码将在https://github.com/CASIA-IVA-Lab/FiLo上提供。||
|**2025-01-17**|[Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions](http://arxiv.org/abs/2501.10011)|null|目前流行的大型视觉语言模型（LVLMs）存在对象属性幻觉（HoOA）问题，导致对输入图像中细粒度属性的判断错误。利用从单张图像生成3D模型的重大进展，本文提出了一种新方法来缓解LVLMs中的HoOA问题。该方法利用从生成的3D表示中采样的多视角图像作为LVLMs的视觉提示，从而提供来自其他视角的更多视觉信息。此外，我们观察到多个多视角图像的输入顺序会显著影响LVLMs的性能。因此，我们设计了多视角图像增强VLM（MIAVLM），它包含一个多视角属性感知器（MAP）子模块，能够同时消除输入图像顺序的影响并将多视角图像的视觉信息与大型语言模型（LLMs）对齐。此外，我们设计并使用了否定指令来减轻LVLMs对“是”响应的偏见。综合实验表明了我们方法的有效性。||
|**2025-01-16**|[Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key](http://arxiv.org/abs/2501.09695)|**[link](https://github.com/zhyang2226/opa-dpo)**|幻觉仍然是大规模视觉语言模型（LVLMs）面临的主要挑战。直接偏好优化（DPO）作为一种解决幻觉问题的简单方案，受到了越来越多的关注。它直接从构造的偏好对中学习，这些偏好对反映了对同一提示和图像的响应中幻觉的严重程度。然而，现有工作中不同的数据构造方法带来了显著的性能差异。我们在此确定了一个关键因素：结果在很大程度上取决于构造的数据是否与DPO的初始（参考）策略一致。理论分析表明，偏离策略的数据学习会受到更新策略和参考策略之间KL散度的阻碍。从数据集分布的角度，我们系统地总结了现有采用DPO解决幻觉问题的算法的固有缺陷。为了缓解这些问题，我们提出了策略一致性（OPA）-DPO框架，它独特地利用专家反馈来纠正幻觉响应，并以策略一致的方式对齐原始响应和专家修改后的响应。值得注意的是，仅使用4.8k数据，OPA-DPO在LLaVA-1.5-7B的幻觉率上实现了进一步降低：与之前使用16k样本训练的SOTA算法相比，在AMBER基准测试中降低了13.26%，在Object-Hal基准测试中降低了5.39%。||
|**2025-01-16**|[Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness](http://arxiv.org/abs/2501.09446)|null|本文研究了视觉语言模型针对对抗性视觉扰动的鲁棒性，并介绍了一种新的“双视觉防御”来增强这种鲁棒性。与以往采用轻量级对抗性微调预训练CLIP模型的方法不同，我们使用网络规模数据从头开始进行大规模对抗性视觉语言预训练。然后，我们通过结合对抗性视觉指令调整来加强防御。每个阶段产生的模型，ΔCLIP和Δ²LLaVA，显示出显著增强的零样本鲁棒性，并在视觉语言模型的对抗性防御方面树立了新的最先进水平。例如，ΔCLIP在ImageNet-1k上的对抗鲁棒性比之前的最佳模型高出约20%。与现有技术相比，Δ²LLaVA在图像描述任务上的鲁棒性提高了约30%，在视觉问答任务上的鲁棒性提高了约20%。此外，与基线模型相比，我们的模型展现出更强的零样本识别能力、更少的幻觉和更优越的推理性能。我们的项目页面是https://doublevisualdefense.github.io/。||
|**2025-01-16**|[Vision-Language Models Do Not Understand Negation](http://arxiv.org/abs/2501.09425)|null|许多实际的视觉语言应用需要能够理解否定的模型，例如，使用自然语言检索包含某些对象但不包含其他对象的图像。尽管通过大规模训练视觉语言模型 (VLM) 取得了进步，但它们理解否定的能力仍未得到充分探索。本研究探讨了以下问题：当前的 VLM 对否定的理解程度如何？我们引入了 NegBench，这是一个新的基准测试，旨在评估跨越图像、视频和医学数据集的 18 种任务变体和 7.9 万个示例的否定理解能力。该基准测试包含两个核心任务，旨在评估在不同多模态环境下的否定理解：带有否定的检索和带有否定字幕的多项选择题。我们的评估表明，现代 VLM 在处理否定方面存在显著困难，其性能通常处于随机水平。为了解决这些缺点，我们探索了一种以数据为中心的方法，即在包含数百万个否定字幕的大规模合成数据集上微调 CLIP 模型。我们证明，这种方法可以使否定查询的召回率提高 10%，并在带有否定字幕的多项选择题的准确率上提高 40%。||
|**2025-01-16**|[Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning](http://arxiv.org/abs/2501.09294)|null|由于标注数据的有限性和医学图像的复杂性，医学图像分类中的少样本学习是一项重大挑战。在这项工作中，我们提出了具有层次对比对齐（HiCA）的自适应视觉语言微调，这是一个利用大型视觉语言模型（LVLMs）进行医学图像分析的新颖框架。HiCA引入了两阶段微调策略，结合特定领域预训练和层次对比学习，在多个层次上对齐视觉和文本表示。我们在两个基准数据集，胸部X射线和乳腺超声上评估了我们的方法，在少样本和零样本设置中均实现了最先进的性能。进一步的分析证明了我们方法的鲁棒性、泛化性和可解释性，与现有基线相比，性能有了实质性的提高。我们的工作突出了层次对比策略在使LVLMs适应医学成像任务的独特挑战方面的潜力。||
|**2025-01-16**|[Are Open-Vocabulary Models Ready for Detection of MEP Elements on Construction Sites](http://arxiv.org/abs/2501.09267)|null|建筑行业长期以来一直在探索机器人技术和计算机视觉，但它们在建筑工地的部署仍然非常有限。这些技术有可能通过提高建筑管理的准确性、效率和安全性来彻底改变传统的工作流程。配备先进视觉系统的地面机器人可以自动执行诸如监控机械、电气和管道 (MEP) 系统等任务。本研究评估了开放词汇视觉语言模型与微调的轻量级闭集目标检测器在使用移动地面机器人平台检测 MEP 组件方面的适用性。通过安装在地面机器人上的摄像头收集的数据集经过人工标注和分析，以比较模型性能。结果表明，尽管视觉语言模型具有多功能性，但在特定环境和特定领域的任务中，经过微调的轻量级模型的性能仍然在很大程度上优于它们。||
|**2025-01-15**|[Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks](http://arxiv.org/abs/2501.09052)|null|单图像去散焦模糊 (SIDD) 旨在从散焦图像中恢复全聚焦图像。散焦图像中的分布偏移通常会导致现有方法在分布外推理期间的性能下降。在这项工作中，我们衡量了性能下降背后的内在原因，即镜头特定点扩散函数的异质性。经验证据支持这一发现，促使我们对 SIDD 采用持续测试时适应 (CTTA) 范式。然而，传统的 CTTA 方法主要依赖于熵最小化，无法充分探索像 SIDD 这样的像素级回归任务的任务相关信息。为了解决这个问题，我们提出了一个新的基于孪生网络的持续测试时适应框架，该框架使源模型适应不断变化的目标域，并且仅需在线方式使用未标记的目标数据。为了进一步减轻源 SIDD 模型在严重退化下引入的语义错误纹理，我们通过结构因果模型重新审视学习范式，并提出了因果孪生网络 (CauSiam)。我们的方法利用大规模预训练的视觉语言模型来导出具有判别性的通用语义先验，并将这些先验整合到孪生网络中，确保模糊输入和恢复图像之间的因果可识别性。大量实验表明，CauSiam 有效地提高了现有 SIDD 方法在不断变化的域中的泛化性能。||
|**2025-01-15**|[IDEA: Image Description Enhanced CLIP-Adapter](http://arxiv.org/abs/2501.08816)|**[link](https://github.com/fourierai/idea)**|CLIP（对比语言-图像预训练）在模式识别和计算机视觉领域取得了巨大成功。如何将CLIP迁移到下游任务（例如零样本或少样本分类）是多模态学习中的一个热点话题。然而，目前的研究主要集中在文本的提示学习或视觉的适配器调优，而没有充分利用图像-文本对之间的互补信息和相关性。在本文中，我们提出了一种图像描述增强CLIP适配器（IDEA）方法，以使CLIP适应少样本图像分类任务。该方法通过利用图像的视觉特征和文本描述来捕获细粒度特征。IDEA是一种无需训练的CLIP方法，它在多个任务上的性能可以媲美甚至超越最先进的模型。此外，我们引入了可训练的IDEA（T-IDEA），它通过添加两个轻量级可学习组件（即投影器和可学习的潜在空间）来扩展IDEA，进一步提高了模型的性能，并在11个数据集上达到了最先进的结果。作为一项重要贡献，我们利用Llama模型并设计了一个全面的流程，为11个数据集的图像生成文本描述，总共产生了1,637,795个图像-文本对，命名为“IMD-11”。我们的代码和数据已发布在https://github.com/FourierAI/IDEA。||
|**2025-01-15**|[Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning](http://arxiv.org/abs/2501.08597)|null|大型视觉语言模型 (LVLMs) 在多模态任务中展现了令人印象深刻的能力，但它们的性能往往受到缺乏外部知识整合的限制，从而限制了它们处理知识密集型任务（例如视觉问答和推理）的能力。为了应对这一挑战，我们提出了一种新方法，即大型视觉语言模型的自适应知识引导预训练 (AKGP-LVLM)，该方法在预训练和微调期间将结构化和非结构化知识动态地融入 LVLMs。我们的方法采用知识编码器来表示外部知识，采用检索机制来选择与任务相关的信息，并采用动态适配器来有效地对齐多模态和知识表示。我们在四个基准数据集上评估了我们的方法，证明其性能比最先进的模型有显著提高。此外，人工评估突出了我们模型输出的更高的正确性和相关性。大量的分析证实了 AKGP-LVLM 的鲁棒性、效率和可扩展性，使其成为现实世界知识密集型任务的引人注目的解决方案。||
|**2025-01-14**|[MiniMax-01: Scaling Foundation Models with Lightning Attention](http://arxiv.org/abs/2501.08313)|null|我们推出了MiniMax-01系列模型，包括MiniMax-Text-01和MiniMax-VL-01，它们在处理更长上下文方面拥有卓越能力的同时，性能可与顶级模型相媲美。其核心在于闪电注意力机制及其高效的扩展性。为了最大化计算能力，我们将其与混合专家模型（MoE）集成，创建了一个拥有32个专家和4560亿总参数的模型，其中每个token激活459亿参数。我们为MoE和闪电注意力机制开发了优化的并行策略和高效的计算-通信重叠技术。这种方法使我们能够对具有数千亿参数的模型进行高效的训练和推理，其上下文跨度可达数百万token。MiniMax-Text-01的上下文窗口在训练期间可达100万token，并在推理期间以可承受的成本外推至400万token。我们的视觉语言模型MiniMax-VL-01是通过使用5120亿视觉语言token进行持续训练而构建的。在标准和内部基准测试中的实验表明，我们的模型与GPT-4o和Claude-3.5-Sonnet等最先进模型的性能相当，同时提供了20到32倍的上下文窗口长度。我们在https://github.com/MiniMax-AI公开发布了MiniMax-01。||
|**2025-01-14**|[Benchmarking Multimodal Models for Fine-Grained Image Analysis: A Comparative Study Across Diverse Visual Features](http://arxiv.org/abs/2501.08170)|null|本文介绍了一个用于评估多模态模型在图像分析和解释能力的基准测试。该基准测试关注七个关键视觉方面：主要对象、附加对象、背景、细节、主色调、风格和视角。一个包含14,580张图像的数据集（由不同的文本提示生成）用于评估七个领先的多模态模型的性能。这些模型的评估基于它们准确识别和描述每个视觉方面的能力，从而深入了解它们在全面图像理解方面的优势和劣势。此基准测试的结果对各种图像分析任务中多模态模型的开发和选择具有重要意义。||
|**2025-01-14**|[Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding](http://arxiv.org/abs/2501.07888)|**[link](https://github.com/bytedance/tarsier)**|我们推出了Tarsier2，一个最先进的大型视觉语言模型（LVLM），旨在生成详细而准确的视频描述，同时展现出卓越的通用视频理解能力。Tarsier2通过三个关键升级实现了显著进步：（1）将预训练数据从1100万个视频-文本对扩展到4000万个，丰富了数量和多样性；（2）在监督微调期间执行细粒度的时间对齐；（3）使用基于模型的采样自动构建偏好数据，并应用DPO训练进行优化。大量实验表明，Tarsier2-7B在详细视频描述任务中始终优于领先的专有模型，包括GPT-4o和Gemini 1.5 Pro。在DREAM-1K基准测试中，Tarsier2-7B的F1值比GPT-4o提高了2.8%，比Gemini-1.5-Pro提高了5.8%。在人工并排评估中，Tarsier2-7B比GPT-4o表现出+8.6%的优势，比Gemini-1.5-Pro表现出+24.9%的优势。Tarsier2-7B还在15个公共基准测试中创下了新的最佳成绩，涵盖了视频问答、视频定位、幻觉测试和具身问答等任务，展示了其作为强大的通用视觉语言模型的多功能性。||
|**2025-01-14**|[Visual Language Models as Operator Agents in the Space Domain](http://arxiv.org/abs/2501.07802)|null|本文探讨了视觉语言模型（VLM）在空间领域作为操作代理的应用，重点关注软件和硬件操作范式。基于大型语言模型（LLM）及其多模态扩展的进展，我们研究了VLM如何增强空间任务中的自主控制和决策。在软件环境下，我们在Kerbal Space Program Differential Games (KSPDG) 仿真环境中使用VLM，使代理能够解释图形用户界面的视觉截图以执行复杂的轨道机动。在硬件环境下，我们将VLM与配备摄像头的机器人系统集成，以检查和诊断物理空间物体，例如卫星。我们的结果表明，VLM可以有效地处理视觉和文本数据以生成上下文相关的动作，在仿真任务中与传统方法和非多模态LLM竞争，并在实际应用中显示出前景。||
|**2025-01-14**|[BMIP: Bi-directional Modality Interaction Prompt Learning for VLM](http://arxiv.org/abs/2501.07769)|null|视觉语言模型 (VLM) 表现出卓越的泛化能力，而针对 VLM 的 prompt learning 也因其能够使预训练的 VLM 适应特定下游任务而备受关注。然而，现有研究主要集中在单模态 prompt 或单向模态交互，忽略了视觉和语言模态之间交互产生的强大对齐效果。为此，我们提出了一种名为双向模态交互提示 ( $\underline{\textbf{B}}i-directional \underline{\textbf{M}}odality \underline{\textbf{I}}nteraction \underline{\textbf{P}}rompt$ , BMIP) 的新型 prompt learning 方法，该方法通过学习注意力层的信息来动态加权双模态信息，相比简单的信息聚合方法，增强了可训练性和模态间一致性。为了评估 prompt learning 方法的有效性，我们提出了一种更贴近实际的评估范式，称为开放世界泛化，以补充广泛采用的跨数据集迁移和领域泛化任务。在各种数据集上的综合实验表明，BMIP 不仅在所有三种评估范式中都优于当前最先进的方法，而且还足够灵活，可以与其他基于 prompt 的方法结合使用，从而实现一致的性能提升。||
|**2025-01-13**|[SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing](http://arxiv.org/abs/2501.07554)|**[link](https://github.com/custommetrics-sst/sst_customevaluationmetrics)**|视频编辑模型取得了显著进展，但评估其性能仍然具有挑战性。传统的指标，例如 CLIP 文本和图像分数，通常存在不足：文本分数受限于训练数据不足和层次依赖性，而图像分数无法评估时间一致性。我们提出了 SST-EM（语义、空间和时间评估指标），这是一个利用现代视觉语言模型 (VLM)、目标检测和时间一致性检查的新型评估框架。SST-EM 包含四个组成部分：(1) 使用 VLM 从帧中提取语义信息；(2) 使用目标检测进行主要目标跟踪；(3) 通过 LLM 代理进行聚焦目标细化；(4) 使用视觉变换器 (ViT) 评估时间一致性。这些组件被整合到一个统一的指标中，其权重源自人工评估和回归分析。SST-EM 的名称体现了其对视频评估的语义、空间和时间方面的关注。SST-EM 提供了对视频编辑中语义保真度和时间流畅度的全面评估。源代码可在GitHub存储库中获取。||
|**2025-01-13**|[Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models](http://arxiv.org/abs/2501.07396)|null|自动目标识别 (ATR) 在导航和监视等任务中起着至关重要的作用，在这些任务中，安全性和准确性至关重要。在极端用例中，例如军事应用，由于存在未知地形、环境条件和新的目标类别，这些因素经常受到挑战。当前的目标检测器，包括开放世界检测器，缺乏自信地识别新目标或在未知环境中运行的能力，因为它们没有接触过这些新条件。然而，大型视觉语言模型 (LVLM) 表现出涌现的特性，使它们能够以零样本方式识别不同条件下的目标。尽管如此，LVLM 难以有效地定位场景中的目标。为了解决这些限制，我们提出了一种新的流程，它结合了开放世界检测器的检测能力和 LVLM 的识别置信度，创建了一个用于新类别和未知领域的零样本 ATR 的鲁棒系统。在本研究中，我们比较了各种 LVLM 在识别军用车辆方面的性能，这些军用车辆在训练数据集中通常代表性不足。此外，我们研究了距离范围、模态和提示方法等因素对识别性能的影响，为开发更可靠的新条件和类别 ATR 系统提供了见解。||
|**2025-01-13**|[GestLLM: Advanced Hand Gesture Interpretation via Large Language Models for Human-Robot Interaction](http://arxiv.org/abs/2501.07295)|null|本文介绍了GestLLM，一个先进的人机交互系统，可以通过手势实现直观的机器人控制。不同于依赖有限预定义手势集的传统系统，GestLLM利用大型语言模型和通过MediaPipe进行的特征提取来解释各种手势。这种集成解决了现有系统的关键局限性，例如手势灵活性受限以及无法识别人类交流中常用的复杂或非常规手势。通过结合最先进的特征提取和语言模型功能，GestLLM实现了与领先的视觉语言模型相当的性能，同时支持传统数据集中代表性不足的手势。例如，这包括流行文化中的手势，例如《星际迷航》中的“瓦肯举手礼”，无需任何额外的预训练、提示工程等。这种灵活性增强了机器人控制的自然性和包容性，使交互更加直观和用户友好。GestLLM在基于手势的交互方面迈出了重要一步，使机器人能够有效地理解和响应各种手势。本文概述了它的设计、实现和评估，展示了其在先进人机协作、辅助机器人和互动娱乐中的潜在应用。||
|**2025-01-13**|[Exploring the Use of Contrastive Language-Image Pre-Training for Human Posture Classification: Insights from Yoga Pose Analysis](http://arxiv.org/abs/2501.07221)|null|在图像和视频中准确识别人体姿势对于包括工作安全、物理康复、运动训练或日常辅助生活在内的各个领域的自动化应用至关重要。最近，多模态学习方法，例如对比语言-图像预训练 (CLIP)，在联合理解图像和文本方面取得了显著进展。本研究旨在评估 CLIP 在人体姿势分类中的有效性，重点关注其在瑜伽中的应用。尽管零样本方法存在初始限制，但在包含 82 个类别的 15,301 张图像（真实和合成）上应用迁移学习已显示出 promising 的结果。文章描述了微调的完整过程，包括图像描述语法、模型和超参数调整的选择。经过微调的 CLIP 模型在 3826 张图像上进行测试，实现了超过 85% 的准确率，比先前在同一数据集上工作的当前最先进水平提高了约 6%，其训练时间比微调基于 YOLOv8 的模型所需的时间低 3.5 倍。对于更面向应用的场景，对于每个包含六个姿势的较小数据集，分别包含 1301 和 401 张训练图像，经过微调的模型分别达到了 98.8% 和 99.1% 的准确率。此外，我们的实验表明，每个姿势仅使用 20 张图像进行训练即可在六类别数据集中获得约 90% 的准确率。这项研究表明，这种多模态技术可以有效地用于瑜伽姿势分类，并且可能通常用于人体姿势分类。此外，CLIP 推理时间（约 7 毫秒）支持该模型可以集成到用于姿势评估的自动化系统中，例如，用于开发用于性能评估的实时个人瑜伽助手。||
|**2025-01-13**|[TimeLogic: A Temporal Logic Benchmark for Video QA](http://arxiv.org/abs/2501.07214)|null|时间逻辑理解是人类认知的核心方面，在捕捉视频中复杂的顺序事件及其时间关系方面起着关键作用。这种能力在视频问答（VideoQA）等任务中尤其重要，VideoQA的目标是处理随时间变化的视觉数据和文本数据，以提供连贯的答案。然而，由于标注时间逻辑的挑战，目前的VideoQA基准很少关注评估这项关键技能。尽管视觉语言模型有所进步，但评估其时间逻辑推理能力仍然是一个挑战，主要是因为缺乏需要正式、复杂时间推理的问答对。为了弥合这一差距，我们引入了TimeLogic QA（TLQA）框架来自动生成问答对，专门用于评估时间逻辑理解。为此，TLQA利用现有视频数据集的时间标注以及从逻辑理论中导出的时间运算符来构建测试对事件序列及其时间关系理解的问题。TLQA框架通用且可扩展，能够利用现有的带有时间动作分割标注的视频动作数据集或带有时间场景图标注的视频数据集来自动生成时间逻辑问题。我们利用了4个数据集，STAR、Breakfast、AGQA和CrossTask，并生成了两个VideoQA数据集变体——小型（TLQA-S）和大型（TLQA-L）——每个类别包含2k和10k个问答对，每个数据集总共产生32k和160k个问答对。我们对领先的VideoQA模型进行了全面评估，使用TLQA来衡量它们的时间逻辑理解能力。我们评估了VideoQA模型在16个具有不同时间复杂度的时间逻辑类别上的时间推理性能。||
|**2025-01-10**|[Generate, Transduct, Adapt: Iterative Transduction with VLMs](http://arxiv.org/abs/2501.06031)|null|基于视觉语言模型的直推式零样本学习利用数据集中图像间的相似性来提高分类精度，相比于归纳式方法效果更佳。然而，目前很少有研究探索这种场景下语言空间的结构。我们提出了GTA-CLIP，一种结合语言模型监督，在语言和视觉空间中进行联合直推的新技术。我们的方法是迭代式的，包含三个步骤：（i）通过查询语言模型逐步探索属性空间，（ii）属性增强的直推推理过程，以及（iii）根据数据集中推断的标签微调语言和视觉编码器。通过使用CLIP编码器的实验，我们证明了在零样本设置下，GTA-CLIP在12个数据集和3个编码器上，相较于CLIP和直推式CLIP，平均性能分别提升了8.6%和3.7%。我们还在少样本设置下观察到了类似的提升。我们通过消融研究证明了每个步骤的价值，并可视化了视觉和语言空间在直推学习驱动下的迭代演变过程。||
|**2025-01-10**|[Scalable Vision Language Model Training via High Quality Data Curation](http://arxiv.org/abs/2501.05952)|null|在本文中，我们推出了SAIL-VL（通过高质量数据管理进行可扩展视觉语言模型训练），这是一个具有20亿参数的开源视觉语言模型 (VLM)，其性能达到当前最佳水平。我们引入了三个关键改进，促成了SAIL-VL的领先性能：(1) 可扩展的高质量视觉理解数据构建：我们实现了一个视觉理解数据构建流水线，该流水线支持亿级规模的高质量重标题数据标注。借助此流水线，我们构建了SAIL-Caption，这是一个大规模的标题数据集，与开源标题数据集相比，它拥有更大的数量和最高的数据质量。(2) 使用高质量视觉理解数据进行可扩展预训练：我们将SAIL-VL的预训练规模扩展到1310亿个标记，并表明即使是20亿参数的VLM也能从规模更大的训练数据中受益，在视觉理解和指令遵循性能方面展现出预期的数据规模缩放规律。(3) 通过数量和质量扩展实现可扩展的SFT：我们引入了指令数据管理的通用指南，以持续扩大指令数据的规模，使我们能够构建一个具有最高质量的大型SFT数据集。为了进一步提高SAIL-VL的性能，我们提出了质量扩展，这是一种采用课程学习的多阶段训练方法，以提高模型性能缩放曲线（相对于数据大小）从对数级到接近线性级。SAIL-VL在我们评估的19个常用基准测试中获得了最高的平均分数，并在OpenCompass（https://rank.opencompass.org.cn/leaderboard-multimodal）上实现了与同等规模的VLM相比的最佳性能。我们在HuggingFace（https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B）上发布了我们的SAIL-VL-2B模型。||
|**2025-01-10**|[Valley2: Exploring Multimodal Models with Scalable Vision-Language Design](http://arxiv.org/abs/2501.05901)|**[link](https://github.com/bytedance/valley)**|近年来，视觉语言模型取得了显著进展，在图像描述和视频理解等各种任务中展现出卓越的能力。我们推出了Valley2，一个新型的多模态大型语言模型，旨在提升所有领域的性能，并扩展其在电商和短视频场景中的实际应用边界。值得注意的是，Valley2在电商基准测试中达到了最先进的水平（SOTA），其性能大大超过了类似规模的开源模型（79.66 vs. 72.76）。此外，Valley2在OpenCompass排行榜上名列第二，在参数量少于100亿的模型中，取得了令人瞩目的67.4平均分。代码和模型权重已在https://github.com/bytedance/Valley开源。||
|**2025-01-10**|[Super-class guided Transformer for Zero-Shot Attribute Classification](http://arxiv.org/abs/2501.05728)|**[link](https://github.com/mlvlab/SugaFormer)**|属性分类对于识别图像区域内的特定特征至关重要。视觉语言模型（VLM）通过利用其从大规模数据集中获得的通用知识，在零样本任务中表现出色。最近的研究表明，基于Transformer的具有类别查询的模型可以有效地解决零样本多标签分类问题。然而，由于对已见和未见属性之间关系的利用不足，导致模型缺乏泛化能力。此外，属性分类通常涉及许多属性，这使得维持模型的可扩展性变得困难。为了解决这些问题，我们提出了超类引导Transformer（SugaFormer），这是一个利用超类来增强零样本属性分类的可扩展性和泛化性的新框架。SugaFormer采用超类查询初始化（SQI）来减少查询数量，利用来自超类的共同语义信息，并结合多上下文解码（MD）来处理不同的视觉线索。为了增强泛化能力，我们引入了两种利用VLM的知识迁移策略。在训练过程中，超类引导一致性正则化（SCR）使用特定区域的提示将SugaFormer的特征与VLM对齐，在推理过程中，零样本基于检索的得分增强（ZRSE）改进对未见属性的预测。大量实验表明，SugaFormer在零样本和跨数据集迁移设置下，在三个广泛使用的属性分类基准测试中均达到了最先进的性能。我们的代码可在https://github.com/mlvlab/SugaFormer获取。||
|**2025-01-09**|[Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation](http://arxiv.org/abs/2501.05413)|null|训练音频到图像的生成模型需要大量多样且语义对齐的音视频数据对。这类数据几乎总是从野外视频中整理而来，因为它们本身就具有跨模态语义对应性。在这项工作中，我们假设坚持绝对需要真实的音视频对应关系不仅是不必要的，还会导致数据规模、质量和多样性受到严重限制，最终损害其在现代生成模型中的应用。也就是说，我们提出了一个可扩展的图像声化框架，其中来自各种高质量但不相交的单模态来源的实例可以通过检索过程进行人工配对，该过程由现代视觉语言模型的推理能力提供支持。为了证明这种方法的有效性，我们使用我们声化的图像来训练一个音频到图像的生成模型，其性能与最先进的模型相比具有竞争力。最后，通过一系列消融研究，我们展示了我们的模型隐式开发的几种有趣的听觉能力，例如语义混合和插值、响度校准和通过混响进行的声学空间建模，以指导图像生成过程。||
|**2025-01-09**|[Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection](http://arxiv.org/abs/2501.05228)|null|利用CLIP等强大的视觉语言模型（VLM），零样本方法在分布外（OOD）检测方面取得了显著进展。然而，我们初步研究观察到，先前的研究工作主要集中在提高远OOD性能，而可能损害近OOD效果。为了解决这个问题，我们提出了一种利用大型语言模型（LLM）和VLM来增强远OOD和近OOD场景下零样本OOD检测性能的新策略。我们的方法首先利用LLM生成ID标签的超类及其相应的背景描述，然后使用CLIP提取特征。之后，我们通过从超类特征中减去背景特征来隔离ID数据的核心语义特征。这种精细的表示有助于从WordNet的候选标签集中为OOD数据选择更合适的负标签，从而提高两种场景下零样本OOD检测的性能。此外，我们引入了新颖的少样本提示调整和视觉提示调整，以使所提出的框架更好地与目标分布对齐。实验结果表明，该方法在多个基准测试中始终优于当前最先进的方法，AUROC提升高达2.9%，FPR95降低高达12.6%。此外，我们的方法在不同领域表现出优异的协变量偏移鲁棒性，进一步突出了其在实际场景中的有效性。||
|**2025-01-09**|[Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model](http://arxiv.org/abs/2501.05122)|null|迄今为止，大多数大型视觉语言模型 (LVLM) 主要使用英语数据进行训练，这使得它们难以理解非英语输入，并且无法生成所需目标语言的输出。现有的工作尝试通过添加多语言训练数据来缓解这些问题，但这些尝试大多是临时性的，缺乏对不同训练组合如何影响不同语言组的深入理解。在这项工作中，我们对大规模多语言 LVLM 的训练策略进行了全面研究。首先，我们进行了一系列多阶段实验，涵盖 13 个下游视觉语言任务和 43 种语言，系统地 بررسی了：（1）在不降低英语性能的情况下可以包含的训练语言数量；（2）预训练和 (3) 指令微调数据的最佳语言分布。此外，我们 (4) 研究了如何改进多语言文本图像理解，并为该任务引入了一个新的基准。令人惊讶的是，我们的分析表明，可以 (i) 同时包含多达 100 种训练语言，(ii) 并且只需 25-50% 的非英语数据，即可在保持强大的英语性能的同时，显著提高多语言性能。我们进一步发现，(iii) 在预训练和指令微调中包含非英语 OCR 数据对于提高多语言文本图像理解至关重要。最后，我们结合所有研究结果，训练了 Centurio，一个 100 种语言的 LVLM，在涵盖 14 个任务和 56 种语言的评估中提供了最先进的性能。||
|**2025-01-08**|[Re-ranking the Context for Multimodal Retrieval Augmented Generation](http://arxiv.org/abs/2501.04695)|null|检索增强生成 (RAG) 通过结合外部知识来增强大型语言模型 (LLM)，从而在上下文内容中生成更准确、更少幻觉的响应。然而，多模态 RAG 系统面临着独特的挑战：（i）检索过程可能会选择与用户查询不相关的条目（例如，图像、文档），以及（ii）视觉语言模型或多模态语言模型（如 GPT-4o）在处理这些条目以生成 RAG 输出时可能会产生幻觉。在本文中，我们旨在解决第一个挑战，即改进多模态 RAG 检索阶段从知识库中选择相关上下文的过程。具体来说，我们利用先前工作中设计的用于评估 RAG 性能的相关性得分 (RS) 度量，在检索过程中选择更相关的条目。基于嵌入（例如基于 CLIP 的嵌入）和余弦相似度的检索通常表现不佳，尤其是在多模态数据方面。我们表明，通过使用更高级的相关性度量，可以通过从知识库中选择更相关的片段并通过自适应地选择最多 k 个条目而不是固定数量的条目来消除上下文中的不相关片段，从而增强检索过程。我们使用 COCO 数据集进行的评估表明，在选择相关上下文和生成响应的准确性方面有显著提高。||
|**2025-01-08**|[DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests](http://arxiv.org/abs/2501.04671)|null|大型视觉语言模型 (LVLMs) 通过增强语言模型的视觉理解能力，使其能够进行多模态推理。然而，由于文本和视觉数据之间的模态差异，它们经常面临重大挑战，例如过度依赖文本先验、幻觉和有限的复杂视觉推理能力。现有的用于评估 LVLMs 视觉推理能力的基准测试通常依赖于示意图或合成图像以及不精确的机器生成的解释。为了弥合模态差距，我们提出了 DrivingVQA，这是一个源自驾驶理论测试的新基准，用于评估复杂现实场景中的视觉思维链推理。它提供了 3,931 个专家精心制作的选择题和与推理过程相关的实体的交错解释。我们利用这个数据集对 LVLMs 推理复杂视觉场景的能力进行了广泛的研究。我们的实验表明，开源和专有的 LVLMs 在零样本设置下难以进行视觉思维链推理。我们研究了利用相关实体来改进视觉推理的训练策略。值得注意的是，当对与这些实体相关的裁剪区域的图像标记进行推理时，我们观察到性能提升高达 7%。||
|**2025-01-08**|[A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI](http://arxiv.org/abs/2501.04641)|**[link](https://github.com/willcai7/multimodal-ghm)**|多模态生成式AI系统，例如那些结合视觉和语言的系统，依赖于对比式预训练来学习不同模态的表示。虽然它们的实际效益已得到广泛认可，但对比式预训练框架的严格理论理解仍然有限。本文开发了一个理论框架来解释对比式预训练在下游任务（例如零样本分类、条件扩散模型和视觉语言模型）中的成功。我们引入了近似充分统计量的概念，它是经典充分统计量的推广，并表明对比式预训练损失的近似极小值近似充分，使其能够适应各种下游任务。我们进一步提出了图像和文本联合分布的联合生成层次模型，表明Transformer可以通过置信传播有效地逼近该模型中的相关函数。基于此框架，我们推导了基于对比式预训练表示的多模态学习的样本复杂度保证。数值模拟验证了这些理论发现，证明了对比式预训练Transformer在各种多模态任务中的强大泛化性能。||
|**2025-01-08**|[Supervision-free Vision-Language Alignment](http://arxiv.org/abs/2501.04568)|null|视觉语言模型（VLM）在整合视觉和语言信息方面展现出显著的潜力，但其性能往往受到对大量高质量图文训练数据的需求的限制。这些图文对的收集既耗时又 computationally expensive。为了应对这一挑战，我们引入了 SVP（无监督视觉投影），这是一个无需依赖 curated 数据或偏好标注即可增强视觉语言对齐的新颖框架。SVP 利用自标题生成和预训练的 grounding 模型作为反馈机制，以引出 VLM 中的潜在信息。我们在六个关键领域评估了我们的方法：标题生成、指称、视觉问答、多任务处理、幻觉控制和对象召回。结果表明，SVP 取得了显著的改进，包括标题生成任务平均提高了 14%，对象召回率提高了 12%，并且幻觉率大幅降低。值得注意的是，使用 SVP 的小型 VLM 实现了与五倍大的模型相当的幻觉减少，而指称能力较差的 VLM 的性能提高了一倍多，接近两倍大小模型的水平。||
|**2025-01-08**|[Online Gaussian Test-Time Adaptation of Vision-Language Models](http://arxiv.org/abs/2501.04352)|**[link](https://github.com/cfuchs2023/oga)**|在线测试时适应 (OTTA) 视觉语言模型 (VLM) 最近受到越来越多的关注，其目的是利用数据流中观察到的数据来改进未来的预测。不幸的是，现有方法依赖于特定于数据集的超参数，这严重限制了它们对未见任务的适应性。为此，我们提出了在线高斯适应 (OGA)，这是一种新颖的方法，它使用高斯分布对视觉特征的似然度进行建模，并将零样本先验信息纳入具有可解释性的最大后验 (MAP) 估计框架中，并在所有数据集上使用固定的超参数。我们证明了 OGA 在大多数数据集和运行中都优于最先进的方法。此外，我们还展示了将 OTTA 与流行的少样本技术相结合（这是先前研究中一个实际但被忽视的场景）是非常有益的。此外，我们的实验研究表明，常见的 OTTA 评估协议（每个数据集最多平均三次运行的性能）是不充分的，因为所有 OTTA 方法在不同运行中都观察到很大的可变性。因此，我们提倡更严格的评估实践，包括增加运行次数和考虑其他定量指标，例如我们提出的预期尾部准确率 (ETA)，其计算方法是取最差 10% 运行的平均准确率。我们希望这些贡献能够鼓励 OTTA 社区采用更严格和多样化的评估实践。代码可在 https://github.com/cfuchs2023/OGA 获取。||
|**2025-01-08**|[Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation](http://arxiv.org/abs/2501.04268)|null|机器人操作中的零样本泛化在不同的机器人、任务和环境中仍然是一项重大挑战。策略代码生成方法使用可执行代码连接高级任务描述和低级动作序列，利用大型语言模型和原子技能库的泛化能力。在这项工作中，我们提出了机器人程序员（RoboPro），一个机器人基础模型，使其能够感知视觉信息并遵循自由形式的指令，以零样本的方式使用策略代码执行机器人操作。为了解决为机器人任务收集运行时代码数据的低效率和高成本问题，我们设计了Video2Code，利用现成的视觉语言模型和代码领域大型语言模型，从大量的野外视频中合成可执行代码。大量实验表明，RoboPro在模拟器和现实环境中的机器人操作中实现了最先进的零样本性能。具体来说，RoboPro在RLBench上的零样本成功率超过了最先进的模型GPT-4o 11.6%，这甚至可以与强大的监督训练基线相媲美。此外，RoboPro对API格式和技能集的变化具有鲁棒性。||
|**2025-01-07**|[MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation](http://arxiv.org/abs/2501.04155)|**[link](https://github.com/sjoshi804/mm-gen)**|视觉语言模型 (VLM) 非常有效，但在特定任务上通常表现不佳；例如，由于缺乏特定任务的训练数据，Llava-1.5 在图表和图解理解方面存在困难。现有的训练数据来源于通用数据集，无法捕捉这些任务所需的细微差别。我们引入了 MM-Gen，这是一种可扩展的方法，它利用更强大的模型为候选图像生成特定任务的高质量合成文本。MM-Gen 采用一个三阶段的目标导向流程：将数据划分为子组，根据任务描述生成目标文本，并过滤掉冗余和异常数据。使用 MM-Gen 生成的数据对 VLM 进行微调可以显著提高性能，包括 Llava-1.5 (7B) 在空间推理方面提高 29%，在图解理解方面提高 15%。与人工整理的图像描述数据相比，MM-Gen 对原始模型的改进高达 1.6 倍，证明了其在增强特定任务 VLM 性能和弥合通用数据集与特定需求之间差距方面的有效性。代码可在 https://github.com/sjoshi804/MM-Gen 获取。||
|**2025-01-07**|[Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives](http://arxiv.org/abs/2501.04003)|**[link](https://github.com/opendrivelab/drivelm)**|近年来，视觉语言模型 (VLM) 的进步引发了人们对其在自动驾驶领域应用的兴趣，尤其是在通过自然语言生成可解释的驾驶决策方面。然而，VLM inherent 提供视觉基础的、可靠的和可解释的驾驶解释的假设在很大程度上仍未得到检验。为了弥补这一差距，我们引入了 DriveBench，这是一个基准数据集，旨在评估 VLM 在 17 种设置（干净、损坏和仅文本输入）下的可靠性，涵盖 19,200 帧、20,498 个问答对、三种问题类型、四种主流驾驶任务和总共 12 种流行的 VLM。我们的研究结果表明，VLM 通常会根据常识或文本线索生成似是而非的回答，而不是真正的视觉基础，尤其是在视觉输入退化或缺失的情况下。这种行为被数据集的不平衡和评估指标的不足所掩盖，在自动驾驶等安全关键场景中构成了重大风险。我们进一步观察到，VLM 在多模态推理方面存在困难，并且对输入损坏表现出高度敏感性，导致性能不一致。为了应对这些挑战，我们提出了改进的评估指标，优先考虑稳健的视觉基础和多模态理解。此外，我们强调了利用 VLM 对损坏的感知来增强其可靠性的潜力，为在现实世界的自动驾驶环境中开发更值得信赖和可解释的决策系统提供了路线图。该基准测试工具包已公开发布。||
|**2025-01-07**|[Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos](http://arxiv.org/abs/2501.04001)|**[link](https://github.com/magic-research/Sa2VA)**|这项工作提出了Sa2VA，第一个用于图像和视频密集基础理解的统一模型。与现有的多模态大型语言模型（通常局限于特定模态和任务）不同，Sa2VA支持广泛的图像和视频任务，包括参考分割和对话，只需最少的单样本指令微调。Sa2VA结合了基础视频分割模型SAM-2和高级视觉语言模型LLaVA，并将文本、图像和视频统一到共享的LLM标记空间中。Sa2VA使用LLM生成指导SAM-2生成精确掩码的指令标记，从而实现对静态和动态视觉内容的基础多模态理解。此外，我们引入了Ref-SAV，这是一个包含72k多个复杂视频场景中对象表达的自动标记数据集，旨在提高模型性能。我们还在Ref-SAV数据集中手动验证了2k个视频对象，以对复杂环境中的参考视频对象分割进行基准测试。实验表明，Sa2VA在多项任务中均达到了最先进的水平，尤其是在参考视频对象分割方面，突出了其在复杂现实应用中的潜力。||
|**2025-01-07**|[RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance](http://arxiv.org/abs/2501.03995)|null|检索增强生成（RAG）通过使用外部知识来指导响应生成，从而改进大型语言模型（LLM）并减少幻觉。然而，RAG，尤其是多模态RAG，可能会引入新的幻觉来源：（i）检索过程可能会从数据库中选择不相关的片段（例如，文档、图像）作为原始上下文，以及（ii）检索到的图像通过视觉语言模型（VLM）处理成基于文本的上下文，或直接由多模态语言模型（MLLM，如GPT-4o）使用，这可能会产生幻觉。为了解决这个问题，我们提出了一个新的框架，使用两个性能指标来评估多模态RAG的可靠性：（i）相关性得分（RS），评估检索到的条目与查询的相关性，以及（ii）正确性得分（CS），评估生成响应的准确性。我们使用ChatGPT衍生的数据库和人工评估员样本训练RS和CS模型。结果表明，这两个模型在测试数据上都达到了约88%的准确率。此外，我们构建了一个包含5000个样本的人工标注数据库，用于评估检索片段的相关性和响应语句的正确性。我们的RS模型与人工偏好在检索中的一致性比CLIP高20%，并且我们的CS模型与人工偏好的一致性达到约91%。最后，我们使用RS和CS评估了各种RAG系统的选择和生成性能。||
|**2025-01-07**|[CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds Ratio on High-Resolution Point Clouds](http://arxiv.org/abs/2501.03879)|null|近期研究表明，大型语言模型（LLM）不仅限于纯文本任务，还可以作为多模态模型应用于各种模态，包括音频、图像和视频。尤其是在处理点云等高维数据潜力的推动下，关于三维大型多模态模型（3D LMM）的研究正在取得显著进展。然而，经过仔细研究，我们发现现有训练数据集的每个样本中的视觉和文本内容都缺乏高信息粒度和清晰度，这成为精确跨模态理解的瓶颈。为了解决这些问题，我们提出了CL3DOR，即通过高分辨率点云上的优势比进行三维大型多模态模型的对比学习，旨在确保视觉和文本内容的更高特异性和清晰度。具体而言，我们增加了每个对象的点云密度，并在训练数据集中构建信息丰富的困难负样本，以惩罚不需要的响应。为了利用困难负样本，我们将优势比作为对比学习的辅助项纳入到传统的语言建模损失中。CL3DOR在3D场景理解和推理基准测试中实现了最先进的性能。此外，我们通过大量实验验证了CL3DOR关键组件的有效性。||
|**2025-01-07**|[KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration](http://arxiv.org/abs/2501.03786)|null|零样本异常检测 (ZSAD) 无需目标数据集的训练样本即可识别异常，这对于存在隐私问题或数据有限的场景至关重要。像 CLIP 这样的视觉语言模型在 ZSAD 中显示出潜力，但也存在局限性：依赖手动制作的固定文本描述或异常提示既耗时又容易出现语义歧义，并且 CLIP 难以进行像素级异常分割，更关注全局语义而非局部细节。为了解决这些局限性，我们引入了 KAnoCLIP，这是一个利用视觉语言模型的新型 ZSAD 框架。KAnoCLIP 通过知识驱动提示学习 (KnPL) 将来自大型语言模型 (GPT-3.5) 的通用知识和来自视觉问答系统 (Llama3) 的细粒度、图像特定知识相结合。KnPL 使用知识驱动 (KD) 损失函数来创建可学习的异常提示，从而无需固定文本提示并增强泛化能力。KAnoCLIP 包括带有 V-V 注意力的 CLIP 视觉编码器 (CLIP-VV)、用于多级跨模态交互的双向交叉注意力 (Bi-CMCI) 和 Conv-Adapter。这些组件保留了局部视觉语义，改进了局部跨模态融合，并将全局视觉特征与文本信息对齐，从而增强了像素级异常检测。KAnoCLIP 在 12 个工业和医学数据集的 ZSAD 中实现了最先进的性能，展示了比现有方法更优越的泛化能力。||
|**2025-01-07**|[Realistic Test-Time Adaptation of Vision-Language Models](http://arxiv.org/abs/2501.03729)|**[link](https://github.com/maxzanella/stata)**|视觉语言模型 (VLM) 的零样本能力已被广泛用于提高预测性能。然而，先前关于转导或测试时适应 (TTA) 的研究通常对数据分布做出强假设，例如所有类别都存在。我们的工作挑战了这些有利的部署场景，并引入了一个更现实的评估框架，包括：(i) 单个批次内用于适应的可变数量的有效类别，以及 (ii) 在线适应设置中非独立同分布的测试样本批次。我们提供了全面的评估、比较和消融研究，证明了当前用于 VLM 的转导或 TTA 方法如何在各种现实场景下系统地损害模型的初始零样本鲁棒性，从而有利于在关于测试样本分布的有利假设下的性能提升。此外，我们引入了 StatA，这是一种通用方法，可以处理各种部署场景，包括在测试时具有可变数量有效类别的场景。我们的方法结合了一种专门为 VLM 设计的新颖正则化项，它充当统计锚，保留初始文本编码器知识，尤其是在低数据状态下。代码可在 https://github.com/MaxZanella/StatA 获取。||
|**2025-01-07**|[SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning](http://arxiv.org/abs/2501.03675)|**[link](https://github.com/togethercomputer/smir)**|视觉语言模型 (VLM) 在理解单张图像方面表现出色，这得益于大量高质量的指令数据集。然而，由于两个主要挑战，开源社区中对多图像推理任务的探索仍然不足：（1）扩展具有多个相关图像和复杂推理指令的数据集非常耗费资源，并且难以保持质量；（2）缺乏用于多图像任务的稳健评估基准。为了解决这些问题，我们引入了 SMIR，这是一个用于多图像推理的高效合成数据生成管道，以及使用该管道生成的高质量数据集。我们的管道使用多模态嵌入有效地提取高度相关的图像，结合视觉和描述性信息，并利用开源大型语言模型 (LLM) 生成高质量的指令。使用此管道，我们生成了 16 万个合成训练样本，为昂贵的闭源解决方案提供了一种经济高效的替代方案。此外，我们还提出了 SMIR-BENCH，这是一个新颖的多图像推理评估基准，包含 7 个复杂的多图像推理任务中的 200 个不同示例。SMIR-BENCH 是多轮的，并利用 VLM 裁判来评估自由形式的回答，从而提供对跨模态模型表达能力和推理能力的全面评估。我们通过微调几个开源 VLM 并评估它们在 SMIR-BENCH 上的性能来证明 SMIR 数据集的有效性。我们的结果表明，在我们的数据集上训练的模型在多图像推理任务中的表现优于基线模型，最高可达 8%，并且数据管道更具可扩展性。||
|**2025-01-06**|[OpenLKA: an open dataset of lane keeping assist from market autonomous vehicles](http://arxiv.org/abs/2501.03287)|null|车道保持辅助系统（LKA）已成为近期汽车型号的标准配置。虽然其市场定位是提供自动转向功能，但由于缺乏实际测试和全面数据，该系统的运行特性和安全性能仍未得到充分探索。为了填补这一空白，我们对美国主要汽车制造商在佛罗里达州坦帕市的主流LKA系统进行了广泛测试。我们使用一种创新方法收集了一个综合数据集，其中包括带有LKA属性的完整控制器局域网络（CAN）消息，以及配备高级视觉检测和轨迹规划算法的高质量前置摄像头的视频、感知和横向轨迹数据。我们的测试涵盖了各种具有挑战性的条件，包括复杂的道路几何形状、恶劣天气、退化的车道标记及其组合。视觉语言模型（VLM）进一步注释了视频，以捕捉天气、照明和交通特征。基于该数据集，我们对LKA的运行特性和安全性能进行了实证概述。主要发现表明：（i）LKA易受微弱标记和低路面对比度的影响；（ii）它在车道转换（合并、分叉、交叉路口）中表现不佳，经常导致意外偏离或脱离；（iii）转向扭矩限制导致在急转弯时频繁偏离，造成安全风险；（iv）LKA系统始终保持僵硬的车道中心位置，在急弯或靠近大型车辆（如卡车）时缺乏适应性。最后，我们演示了该数据集如何指导基础设施规划和自动驾驶技术。鉴于LKA的局限性，我们建议改进道路几何形状和路面维护。此外，我们还说明了该数据集如何通过VLM微调和思维链推理来支持开发类似人类的LKA系统。||
|**2025-01-06**|[MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models](http://arxiv.org/abs/2501.02955)|null|近年来，视觉语言模型 (VLM) 在视频理解方面取得了显著进展。然而，一项关键能力——细粒度运动理解——在目前的基准测试中仍未得到充分探索。为了弥补这一差距，我们提出了 MotionBench，这是一个旨在评估视频理解模型细粒度运动理解能力的综合评估基准。MotionBench 通过六大类面向运动的问题类型来评估模型的运动级感知能力，并包含从不同来源收集的数据，确保对真实世界视频内容的广泛表示。实验结果表明，现有的 VLM 在理解细粒度运动方面表现不佳。为了提高 VLM 在有限长度LLM序列中感知细粒度运动的能力，我们进行了大量实验，回顾了针对视频特征压缩而优化的 VLM 架构，并提出了一种新颖且高效的通编码器 (TE) 融合方法。实验表明，更高的帧率输入和 TE 融合可以提高运动理解能力，但仍有很大的提升空间。我们的基准旨在指导和推动开发更强大的视频理解模型，强调细粒度运动理解的重要性。项目页面：https://motion-bench.github.io。||
|**2025-01-05**|[Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?](http://arxiv.org/abs/2501.02669)|**[link](https://github.com/princeton-pli/vlm_s2h)**|虽然视觉语言模型 (VLM) 在视觉问答 (VQA) 和图像描述等任务中表现出色，但它们对图像进行多步推理的能力却有所滞后，导致人们认为它们存在模态不平衡或脆弱性问题。为了系统地研究这些问题，我们引入了一个用于评估 VLM 执行算法视觉推理 (AVR) 能力的合成框架，该框架包含三个任务：表格读取、网格导航和视觉类比。每个任务都有两种难度级别，简单 (SIMPLE) 和困难 (HARD)，即使是简单版本对于前沿 VLM 来说也很难。我们寻求在任务的简单版本上进行训练的策略，以提高在相应困难任务上的性能，即 S2H 泛化能力。这个合成框架中，每个任务也有一个纯文本版本，可以量化模态不平衡，以及训练策略如何影响它。消融实验突出了在使用自回归训练时，显式图像到文本转换对于促进 S2H 泛化的重要性。我们还报告了对这种现象进行机制研究的结果，包括梯度对齐的测量，这似乎可以识别出能够促进更好 S2H 泛化的训练策略。||
|**2025-01-03**|[LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction](http://arxiv.org/abs/2501.01767)|null|逻辑图像理解涉及解释图像视觉内容中的关系和一致性，并在其基础上进行推理。这种能力在工业检测等应用中至关重要，其中逻辑异常检测对于保持高质量标准和最大程度地减少代价高昂的召回至关重要。以前的异常检测 (AD) 研究依赖于先验知识来设计算法，这通常需要大量的手动注释、强大的计算能力和大量的训练数据。自回归多模态视觉语言模型 (AVLM) 提供了一种很有前景的替代方案，因为它们在跨各个领域的视觉推理方面表现出色。尽管如此，它们在逻辑异常检测中的应用仍未得到探索。在这项工作中，我们研究了使用 AVLM 进行逻辑异常检测，并证明它们非常适合这项任务。通过将 AVLM 与格式嵌入和逻辑推理器相结合，我们在公共基准测试 MVTec LOCO AD 上实现了最先进的性能，AUROC 为 86.0%，F1-max 为 83.7%，并附带异常解释。这大大超过了现有的最先进方法。||
|**2025-01-03**|[MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders](http://arxiv.org/abs/2501.01709)|null|视觉编码器是视觉语言模型 (VLM) 的基本组成部分，每个编码器都展现了源自各种预训练视觉基础模型的独特优势。为了利用这些编码器的不同功能，最近的研究将多个编码器合并到单个 VLM 中，导致计算成本大幅增加。在本文中，我们提出了视觉编码器混合知识蒸馏 (MoVE-KD)，这是一个将多个视觉编码器的独特能力蒸馏到单个高效编码器模型中的新框架。具体来说，为了减少冲突并保留每个教师编码器的独特特性，我们采用低秩自适应 (LoRA) 和混合专家 (MoE) 技术，根据输入特征选择性地激活专业知识，从而提高适应性和效率。为了规范知识蒸馏过程并提高性能，我们提出了一种基于注意力的蒸馏策略，该策略自适应地调整不同视觉编码器的权重，并强调有价值的视觉标记，从而减轻了从多个教师复制全面但不同特征的负担。在流行的 VLM（例如 LLaVA 和 LLaVA-NeXT）上进行的全面实验验证了我们方法的有效性。代码将被发布。||
|**2025-01-03**|[GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models](http://arxiv.org/abs/2501.01428)|**[link](https://github.com/Qi-Zhangyang/GPT4Scene)**|近年来，二维视觉语言模型（VLM）在图像文本理解任务中取得了显著进展。然而，它们在三维空间理解方面的性能仍然有限，而这对于具身智能至关重要。最近的研究进展利用三维点云和多视图图像作为输入，取得了 promising 的结果。然而，我们提出探索一种纯粹基于视觉的解决方案，其灵感来自人类感知，仅依赖视觉线索进行三维空间理解。本文实证研究了VLM在三维空间知识方面的局限性，揭示了它们的主要缺点在于场景和单个帧之间缺乏全局-局部对应关系。为了解决这个问题，我们引入了GPT4Scene，这是一种在VLM训练和推理中新颖的视觉提示范式，有助于建立全局-局部关系，从而显著提高对室内场景的三维空间理解能力。具体而言，GPT4Scene从视频中构建三维鸟瞰图（BEV）图像，并在帧和BEV图像上标记一致的对象ID。然后，模型输入带有标记的拼接BEV图像和视频帧。在零样本评估中，GPT4Scene的性能优于GPT-4o等闭源VLM。此外，我们准备了一个包含16.5万文本标注的处理后的视频数据集，用于微调开源VLM，在所有三维理解任务上均实现了最先进的性能。令人惊讶的是，使用GPT4Scene范式训练后，即使在推理过程中没有视觉提示和BEV图像作为显式对应关系，VLM的性能也会持续提高。这表明，所提出的范式有助于VLM发展理解三维场景的内在能力，为扩展预训练VLM以进行三维场景理解的非侵入式方法铺平了道路。||
|**2025-01-02**|[CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering](http://arxiv.org/abs/2501.01371)|null|近来的视觉语言模型 (VLM) 在视觉理解和推理方面，尤其是在多项选择视觉问答 (VQA) 中，展现出非凡的能力。然而，这些模型仍会犯一些明显不自然的错误，例如对无法回答的VQA问题（比如询问图像中不存在的物体的问题）提供（错误的）答案。为了解决这个问题，我们提出了CLIP-UP：基于CLIP的不可回答问题检测方法，这是一种新颖的轻量级方法，用于使VLM能够拒绝回答无法回答的问题。通过利用CLIP提取问题与图像的对齐信息，CLIP-UP只需对几个额外的层进行高效训练，同时保持原始VLM的权重不变。在LLaVA模型上进行测试，CLIP-UP在用于评估多项选择VQA中不可回答性的MM-UPD基准测试上取得了最先进的结果，同时保持了在其他任务上的原始性能。||
|**2025-01-02**|[CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries](http://arxiv.org/abs/2501.01282)|null|视觉语言模型 (VLM) 虽然推进了人机交互，但在文化理解方面仍存在不足，由于训练数据主要以西方为中心，导致模型经常误解符号、手势和人工制品。在本文中，我们构建了 CultureVerse，这是一个涵盖 19,682 个文化概念、188 个国家/地区、15 种文化类别和 3 种问题类型的大规模多模态基准测试，旨在描述和改进 VLM 的多文化理解能力。然后，我们提出了 CultureVLM，这是一系列在我们的数据集上微调的 VLM，在文化理解方面取得了显著的性能提升。我们对 16 个模型的评估揭示了显著的差异，模型在西方文化概念上的表现更强，而在非洲和亚洲文化背景下的结果较弱。在 CultureVerse 上进行微调增强了文化感知能力，展示了跨文化、跨洲和跨数据集的泛化能力，且不影响模型在通用 VLM 基准测试中的性能。我们进一步提出了关于文化泛化和遗忘的见解。我们希望这项工作能够为构建更公平、更具文化意识的多模态人工智能系统奠定基础。||
|**2025-01-02**|[Asymmetric Reinforcing against Multi-modal Representation Bias](http://arxiv.org/abs/2501.01240)|**[link](https://github.com/gao-xiyuan/arm)**|多模态学习的优势在于其能够整合来自不同来源的信息，提供丰富而全面的见解。然而，在现实场景中，多模态系统经常面临模态贡献动态变化的挑战，不同模态的主导地位可能随着环境变化而改变，导致多模态学习的性能欠佳。目前的方法主要通过增强弱模态来平衡多模态表示偏差，这不可避免地从部分模态的角度进行优化，容易导致主导模态的性能下降。为了解决这个问题，我们提出了一种针对多模态表示偏差的非对称增强方法（ARM）。我们的ARM方法动态地增强弱模态，同时通过条件互信息保持对主导模态的表示能力。此外，我们进行了深入的分析，表明优化某些模态可能会导致信息丢失，并阻碍充分利用多模态数据的优势。通过探索模态的主导地位并缩小模态之间的贡献差距，我们显著提高了多模态学习的性能，在缓解不平衡多模态学习方面取得了显著进展。||
|**2025-01-02**|[3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer](http://arxiv.org/abs/2501.01163)|null|当前3D大型多模态模型（3D LMMs）在基于3D视觉的对话和推理方面展现出巨大的潜力。然而，如何进一步增强3D LMMs以实现细粒度的场景理解并促进灵活的人机交互仍然是一个具有挑战性的问题。在这项工作中，我们介绍了3D-LLaVA，一个简单但功能强大的3D LMM，旨在充当理解、推理和与3D世界交互的智能助手。与现有的依赖复杂流程（例如离线多视图特征提取或额外的特定任务头）的最佳性能方法不同，3D-LLaVA采用极简设计和集成架构，并且仅将点云作为输入。3D-LLaVA的核心是一个新的全能超点Transformer（OST），它集成了三个功能：（1）一个视觉特征选择器，用于转换和选择视觉标记；（2）一个视觉提示编码器，用于将交互式视觉提示嵌入到视觉标记空间中；（3）一个参考掩码解码器，用于根据文本描述生成3D掩码。这种多功能的OST通过混合预训练来获得感知先验，并用作连接3D数据和LLM的视觉连接器。在执行统一指令微调后，我们的3D-LLaVA在各种基准测试中都取得了令人印象深刻的结果。代码和模型将被发布以促进未来的探索。||
|**2025-01-02**|[Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs](http://arxiv.org/abs/2501.01042)|null|基于视频的多模态大语言模型（V-MLLM）在视频文本多模态任务中已显示出易受对抗样本攻击的漏洞。然而，对抗视频对未知模型的迁移性——一种常见且实际的现实场景——仍未得到探索。在本文中，我们率先研究了对抗视频样本在不同V-MLLM之间的迁移性。我们发现，现有的对抗攻击方法在应用于V-MLLM的黑盒设置时面临着显著的局限性，我们将这些局限性归因于以下几点：（1）扰动视频特征缺乏泛化性，（2）仅关注稀疏的关键帧，以及（3）未能整合多模态信息。为了解决这些局限性并加深对黑盒场景下V-MLLM漏洞的理解，我们引入了图像到视频MLLM（I2V-MLLM）攻击。在I2V-MLLM中，我们利用基于图像的多模态模型（IMM）作为代理模型来制作对抗视频样本。多模态交互和时间信息被整合以扰乱潜在空间中的视频表示，从而提高对抗迁移性。此外，我们还引入了一种扰动传播技术来处理不同的未知帧采样策略。实验结果表明，我们的方法可以生成在不同V-MLLM和多个视频文本多模态任务上表现出强大迁移性的对抗样本。与在这些模型上的白盒攻击相比，我们的黑盒攻击（使用BLIP-2作为代理模型）实现了具有竞争力的性能，在VideoQA任务中，对MSVD-QA和MSRVTT-QA的平均攻击成功率分别为55.48%和58.26%。我们的代码将在论文被接收后发布。||
|**2025-01-02**|[Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models](http://arxiv.org/abs/2501.01034)|**[link](https://github.com/audiollms/singlish)**|新加坡式英语（Singlish），一种以英语为基础的克里奥尔语，是多语言和多文化背景下语言学研究的一个重要焦点。然而，其口语形式仍然缺乏研究，限制了对其语言结构和应用的深入了解。为了弥补这一差距，我们对目前最大的新加坡式英语口语语料库进行了标准化和标注，并将其命名为多任务国家语音语料库（MNSC）。这些数据集支持多种任务，包括自动语音识别（ASR）、口语问答（SQA）、口语对话摘要（SDS）和副语言问答（PQA）。我们发布了标准化的数据分割和人工验证的测试集，以促进进一步的研究。此外，我们提出了SingAudioLLM，一个利用多模态大型语言模型来同时处理这些任务的多任务多模态模型。实验表明，我们的模型能够适应新加坡式英语的语境，实现了最先进的性能，并且与其他AudioLLM和级联解决方案相比，性能提升了10-30%。||
|**2025-01-03**|[2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining](http://arxiv.org/abs/2501.00958)|**[link](https://github.com/damo-nlp-sg/multimodal_textbook)**|与图文对数据相比，交错语料库使视觉语言模型 (VLM) 能够像人类一样更自然地理解世界。然而，此类现有数据集是从网页爬取的，面临着知识密度低、图文关系松散以及图像之间逻辑连贯性差等挑战。另一方面，互联网拥有大量人类广泛用于学习基础学科的教学视频（例如，在线几何课程），但这些宝贵的资源在 VLM 训练中仍未得到充分探索。在本文中，我们介绍了一个高质量的\textbf{多模态教科书}语料库，其中包含更丰富的基础知识，用于 VLM 预训练。它收集了超过 2.5 年的教学视频，总计 22,000 个课时。我们首先使用 LLM 提出的分类法系统地收集教学视频。然后，我们逐步从视频中提取和优化视觉（关键帧）、音频（ASR）和文本知识（OCR），并根据时间顺序组织成图文交错的语料库。与其对应物相比，我们以视频为中心的教科书提供了更连贯的上下文、更丰富的知识和更好的图文对齐。实验结果表明，其预训练性能出色，尤其是在知识和推理密集型任务（如 ScienceQA 和 MathVista）中。此外，在我们的教科书上进行预训练的 VLM 表现出出色的交错上下文感知能力，在其少样本上下文中利用视觉和文本线索来解决任务。||
|**2024-12-30**|[Hierarchical Banzhaf Interaction for General Video-Language Representation Learning](http://arxiv.org/abs/2412.20964)|**[link](https://github.com/jpthu17/HBI)**|多模态表征学习，特别是对比学习，在人工智能领域扮演着重要角色。作为其重要的子领域，视频-语言表征学习专注于利用预定义的视频-文本对之间的全局语义交互来学习表征。然而，为了增强和细化这种粗粒度的全局交互，更细粒度的交互对于细粒度多模态学习至关重要。在本研究中，我们引入了一种新方法，将视频-文本建模为博弈参与者，利用多元合作博弈论来处理细粒度语义交互过程中存在的不确定性，这些交互具有多样化的粒度、灵活的组合以及模糊的强度。具体而言，我们设计了分层班扎夫交互（Hierarchical Banzhaf Interaction）来模拟视频片段和文本单词之间从分层角度来看的细粒度对应关系。此外，为了减轻班扎夫交互计算中的偏差，我们提出通过融合单模态和跨模态成分来重建表征。这种重建的表征确保了与单模态表征相当的细粒度，同时保留了跨模态表征的自适应编码特性。此外，我们将原始结构扩展到一个灵活的编码器-解码器框架中，使模型能够适应各种下游任务。在常用的文本-视频检索、视频问答和视频字幕基准数据集上进行的大量实验表明，我们的方法具有优越的性能，验证了其有效性和泛化能力。||
|**2024-12-30**|[WalkVLM:Aid Visually Impaired People Walking by Vision Language Model](http://arxiv.org/abs/2412.20903)|null|全球约有2亿人遭受不同程度的视力障碍，因此利用人工智能技术为这些人提供步行辅助至关重要。随着视觉语言模型（VLM）的最新进展，利用VLM改进这一领域已成为一个热门研究课题。然而，大多数现有方法的研究都基于自建的问答数据集，缺乏统一的步行引导训练和测试基准。此外，在盲人行走任务中，需要执行实时流媒体视频解析并生成简洁而信息丰富的提示，这对容易出现冗余回复和低推理效率的VLM构成了巨大挑战。在本文中，我们首先发布了一个多样化、广泛且无偏差的行走意识数据集，其中包含来自欧洲和亚洲的1.2万个视频-人工注释对，为盲人行走任务提供了一个公平的训练和测试基准。此外，我们提出了一个WalkVLM模型，该模型采用思维链进行分层规划，以生成简洁而信息丰富的提示，并利用时间感知自适应预测来减少提示的时间冗余。最后，我们建立了一个稳固的盲人行走任务基准，并验证了WalkVLM在该任务的流媒体视频处理中相较于其他VLM的优势。我们的数据集和代码将在匿名链接https://walkvlm2024.github.io发布。||
|**2024-12-30**|[Are Vision-Language Models Truly Understanding Multi-vision Sensor?](http://arxiv.org/abs/2412.20750)|**[link](https://github.com/top-yun/ms-pr)**|大规模视觉语言模型 (VLM) 通过将视觉输入与文本对齐取得了进展，显著提高了计算机视觉任务的性能。此外，为了使 VLM 能够有效地应用于实际应用中，理解不同的多视觉传感器数据（例如热、深度和 X 射线信息）至关重要。然而，我们发现当前的 VLM 在处理多视觉传感器图像时，缺乏对传感器信息的深入理解，忽略了每个传感器独特的物理特性。这种限制制约了它们解释和回答需要多视觉传感器推理的复杂问题的能力。为了解决这个问题，我们提出了一个新颖的多视觉传感器感知和推理 (MS-PR) 基准测试，评估 VLM 对特定传感器推理的能力。此外，我们引入了多样性负属性 (DNA) 优化，使 VLM 能够对多视觉传感器任务进行深度推理，有助于弥合图像和传感器数据之间的核心信息差距。大量的实验结果证实，所提出的 DNA 方法可以显著提高 VLM 的多视觉传感器推理能力。||
|**2024-12-30**|[UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models](http://arxiv.org/abs/2412.20742)|**[link](https://github.com/IntelliSensing/UniRS)**|遥感影像与自然图像之间的域差异近年来受到广泛关注，视觉语言模型（VLM）在遥感多模态任务中展现出优异的泛化性能。然而，目前的研究仍局限于探索遥感VLM如何处理不同类型的视觉输入。为了弥合这一差距，我们引入了UniRS，这是第一个统一处理多种视觉输入的跨多时相遥感任务的视觉语言模型。UniRS支持单张图像、双时相图像对和视频作为输入，从而在统一框架内实现全面的遥感时序分析。我们采用统一的视觉表示方法，使模型能够接受各种视觉输入。对于双时相图像对任务，我们定制了一个变化提取模块，以进一步增强时空特征的提取。此外，我们设计了一种针对模型推理过程的提示增强机制，利用通用VLM的先验知识为UniRS提供线索。为了促进多任务知识共享，我们在混合数据集上对模型进行联合微调。实验结果表明，UniRS在视觉问答、变化描述和视频场景分类等各种任务中均达到了最先进的性能，凸显了其在统一这些多时相遥感任务方面的多功能性和有效性。我们的代码和数据集将很快发布。||
|**2024-12-30**|[M $^3$oralBench: A MultiModal Moral Benchmark for LVLMs](http://arxiv.org/abs/2412.20718)|**[link](https://github.com/BeiiiY/M3oralBench)**|近年来，大型基础模型，包括大型语言模型（LLM）和大型视觉语言模型（LVLM），已成为法律、金融和医疗保健等关键领域的重要工具。随着这些模型日益融入我们的日常生活，有必要进行道德评估，以确保其输出符合人类价值观并保持在道德界限内。先前的工作主要集中在LLM上，提出了局限于文本模态的道德数据集和基准。然而，鉴于LVLM的快速发展，仍然缺乏多模态道德评估方法。为了弥合这一差距，我们引入了M$^3$oralBench，这是第一个用于LVLM的多模态道德基准。M$^3$oralBench扩展了道德基础小插图（MFV）中的日常道德场景，并采用文本到图像的扩散模型SD3.0来创建相应的场景图像。它根据道德基础理论（MFT）的六个道德基础进行道德评估，涵盖道德判断、道德分类和道德响应等任务，提供了对模型在多模态道德理解和推理方面性能的全面评估。对10个流行的开源和闭源LVLM进行的广泛实验表明，M$^3$ oralBench是一个具有挑战性的基准，揭示了当前模型中明显的道德局限性。我们的基准测试已公开发布。||
|**2024-12-30**|[Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks](http://arxiv.org/abs/2412.20682)|null|像CLIP这样的视觉语言模型（VLM）在分类基准测试中展现了卓越的零样本能力。然而，在未标记的下游任务中选择性能最高的VLM并非易事。现有的VLM选择方法专注于仅使用类名的情况，依赖于有监督的大规模数据集和大型语言模型，这在部署期间可能无法访问或实施。本文提出了无监督视觉语言模型选择问题，其中只有无监督的下游数据集可用，没有提供任何附加信息。为了解决这个问题，我们提出了一种称为视觉-文本图对齐（VEGA）的方法，通过测量VLM在未标记下游任务中两种模态之间的一致性来选择VLM，无需任何标注。VEGA的动机来自于VLM的预训练范式，该范式将来自视觉和文本模态的具有相同语义的特征对齐，从而将两种模态映射到共享的表示空间中。具体来说，我们首先分别在视觉和文本特征上构建两个图。然后，VEGA被定义为视觉图和文本图在节点和边缘级别上的总体相似性。在涵盖各种应用场景和下游数据集的三个不同基准测试中的大量实验表明，VEGA能够对VLM在未标记下游任务上的性能提供一致可靠且准确的估计。||
|**2024-12-30**|[YOLO-UniOW: Efficient Universal Open-World Object Detection](http://arxiv.org/abs/2412.20645)|**[link](https://github.com/thu-mig/yolo-uniow)**|传统的目标检测模型受限于闭集数据集，只能检测在训练期间遇到的类别。虽然多模态模型通过对齐文本和图像模态扩展了类别识别，但由于跨模态融合，它们引入了显著的推理开销，并且仍然受限于预定义词汇，导致它们在处理开放世界场景中的未知对象时效率低下。在这项工作中，我们引入了通用开放世界目标检测 (Uni-OWD)，这是一种统一开放词汇和开放世界目标检测任务的新范式。为了应对这种设置的挑战，我们提出了 YOLO-UniOW，一个在效率、通用性和性能方面都取得了进步的新颖模型。YOLO-UniOW 结合了自适应决策学习，用 CLIP 潜在空间中的轻量级对齐取代计算成本高昂的跨模态融合，从而在不影响泛化性的情况下实现高效检测。此外，我们设计了一种通配符学习策略，将分布外对象检测为“未知”，同时支持动态词汇扩展，而无需增量学习。这种设计使 YOLO-UniOW 能够无缝适应开放世界环境中的新类别。大量实验验证了 YOLO-UniOW 的优越性，在 LVIS 上实现了 34.6 AP 和 30.0 APr，推理速度为 69.6 FPS。该模型还在 M-OWODB、S-OWODB 和 nuScenes 数据集上设立了基准，展示了其在开放世界目标检测中无与伦比的性能。代码和模型可在 https://github.com/THU-MIG/YOLO-UniOW 获取。||
|**2024-12-29**|[HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models](http://arxiv.org/abs/2412.20622)|**[link](https://github.com/AikyamLab/hallucinogen)**|大型视觉语言模型（LVLMs）在执行复杂的多模态任务方面表现出了显著的性能。然而，它们仍然受到对象幻觉的困扰：对图像中存在的对象进行错误识别或错误分类。为此，我们提出了HALLUCINOGEN，这是一个新颖的视觉问答（VQA）对象幻觉攻击基准测试，它利用不同的上下文推理提示来评估最先进的LVLMs中的对象幻觉。我们设计了一系列上下文推理幻觉提示，以评估LVLMs在执行各种视觉语言任务（例如识别、定位或围绕特定对象进行视觉推理）时准确识别目标图像中对象的能力。此外，我们将基准测试扩展到高风险医疗应用，并引入了MED-HALLUCINOGEN，这是一种针对生物医学领域定制的幻觉攻击，并评估了LVLMs在医学图像上的幻觉性能，这是一个精度至关重要的关键领域。最后，我们对八个LVLMs和两种幻觉缓解策略进行了跨多个数据集的广泛评估，以表明当前的通用和医学LVLMs仍然容易受到幻觉攻击。||
|**2024-12-29**|[Audiopedia: Audio QA with Knowledge](http://arxiv.org/abs/2412.20619)|**[link](https://github.com/Abhiram4572/Audiopedia)**|本文介绍了一个名为Audiopedia的新任务，即基于知识的音频问答，该任务需要音频理解和外部知识推理。与专注于仅从音频中即可回答的简单查询的传统音频问答（AQA）基准测试不同，Audiopedia针对知识密集型问题。我们定义了三个子任务：（i）单音频问答（s-AQA），其中问题基于单个音频样本回答；（ii）多音频问答（m-AQA），需要对多个音频样本进行推理；以及（iii）检索增强音频问答（r-AQA），其中涉及检索相关音频以回答问题。我们对大型音频语言模型（LALM）在这些子任务上进行了基准测试，并观察到其性能欠佳。为了解决这个问题，我们提出了一个可以适应任何LALM的通用框架，使其具备知识推理能力。我们的框架有两个组成部分：（i）音频实体链接（AEL）和（ii）知识增强音频大型多模态模型（KA2LM），它们共同提高了知识密集型AQA任务的性能。据我们所知，这是第一个通过Audiopedia等知识密集型任务来解决高级音频理解的工作。||
|**2024-12-29**|[Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment](http://arxiv.org/abs/2412.20418)|null|多模态学习已被证明可以提高各种临床任务的性能，这归功于不同模态数据提供的不同视角。然而，现有的多模态分割方法依赖于良好配准的多模态数据，这对于现实世界的临床图像来说是不现实的，特别是对于像肝脏肿瘤这样模糊和弥漫的区域。在本文中，我们介绍了Diff4MMLiTS，一个四阶段的多模态肝脏肿瘤分割流程：多模态CT中目标器官的预配准；对标注模态的掩码进行扩张，然后用其进行图像修复，以获得没有肿瘤的多模态正常CT；使用基于多模态CT特征和随机生成的肿瘤掩码的潜在扩散模型合成严格对齐的多模态CT；最后，训练分割模型，从而消除了对严格对齐的多模态数据的需求。在公共和内部数据集上的大量实验表明，Diff4MMLiTS优于其他最先进的多模态分割方法。||
|**2024-12-27**|[MVTamperBench: Evaluating Robustness of Vision-Language Models](http://arxiv.org/abs/2412.19794)|null|视觉语言模型 (VLM) 近期取得的进展促进了复杂视频理解任务的显著进步。然而，它们对现实世界操作的鲁棒性仍未得到充分探索，限制了它们在关键应用中的可靠性。为了弥补这一差距，我们推出了 MVTamperBench，这是一个综合基准，旨在评估 VLM 对视频篡改效应（包括旋转、丢帧、遮盖、替换和重复）的抵抗能力。通过系统地评估最先进的模型，MVTamperBench 揭示了鲁棒性的巨大差异，InternVL2-8B 等模型实现了高性能，而 Llama-VILA1.5-8B 等其他模型则表现出严重的漏洞。为了促进更广泛的采用和可重复性，MVTamperBench 被集成到 VLMEvalKit 中，这是一个模块化评估工具包，可实现简化的测试并促进模型鲁棒性的进步。我们的基准代表了朝着开发防篡改 VLM 的关键一步，确保了它们在现实世界场景中的可靠性。项目页面：https://amitbcp.github.io/MVTamperBench/||
|**2024-12-27**|[OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis](http://arxiv.org/abs/2412.19723)|null|由视觉语言模型 (VLM) 驱动的图形用户界面 (GUI) 代理已展现出类似人类的计算机控制能力。尽管它们在推进数字化自动化方面非常有用，但一个关键瓶颈仍然存在：收集用于训练的高质量轨迹数据。收集此类数据的常见做法依赖于人工监督或通过执行预定义任务来生成合成数据，但这两种方法要么资源密集，要么无法保证数据质量。此外，这些方法还存在数据多样性有限以及合成数据与现实环境之间存在显著差距的问题。为了应对这些挑战，我们提出了 OS-Genesis，一种新颖的 GUI 数据合成流程，它颠覆了传统的轨迹收集过程。OS-Genesis 不是依赖预定义的任务，而是使代理首先感知环境并执行逐步交互，然后回顾性地导出高质量的任务以实现轨迹级探索。然后采用轨迹奖励模型来确保生成的轨迹的质量。我们证明，使用 OS-Genesis 训练 GUI 代理可以显著提高它们在极具挑战性的在线基准测试中的性能。深入分析进一步验证了 OS-Genesis 的效率及其相比现有合成方法更高的数据质量和多样性。我们的代码、数据和检查点可在\href{https://qiushisun.github.io/OS-Genesis-Home/}{OS-Genesis 主页}获取。||
|**2024-12-27**|[From Elements to Design: A Layered Approach for Automatic Graphic Design Composition](http://arxiv.org/abs/2412.19712)|null|在这项工作中，我们研究了多模态图形元素的自动设计组合。尽管最近的研究已经开发了各种用于图形设计的生成模型，但它们通常面临以下限制：它们只关注某些子任务，并且远未实现设计组合任务；它们在生成过程中没有考虑图形设计的层次信息。为了解决这些问题，我们将分层设计原则引入大型多模态模型 (LMM) 中，并提出了一种名为 LaDeCo 的新方法来完成这项具有挑战性的任务。具体来说，LaDeCo 首先对给定的元素集执行图层规划，根据其内容将输入元素划分到不同的语义图层中。基于规划结果，它随后以分层方式预测控制设计组合的元素属性，并将先前生成的图层的渲染图像包含到上下文中。凭借这种富有洞察力的设计，LaDeCo 将困难的任务分解成更小、更易于管理的步骤，使生成过程更流畅、更清晰。实验结果证明了 LaDeCo 在设计组合中的有效性。此外，我们展示了 LaDeCo 在图形设计中支持一些有趣的应用，例如分辨率调整、元素填充、设计变体等。此外，它甚至在某些设计子任务中，无需任何特定任务的训练即可胜过专门的模型。||
|**2024-12-27**|[Is Your Text-to-Image Model Robust to Caption Noise?](http://arxiv.org/abs/2412.19531)|null|在文生图 (T2I) 生成中，一种流行的训练技术是利用视觉语言模型 (VLM) 进行图像重描述。尽管已知 VLM 会出现幻觉，生成偏离视觉现实的描述性内容，但这种描述幻觉对 T2I 生成性能的影响仍未得到充分探索。通过我们的实证研究，我们首先建立了一个包含 VLM 生成描述的综合数据集，然后系统地分析了描述幻觉如何影响生成结果。我们的研究结果表明：(1) 描述质量的差异在微调过程中持续影响模型输出。(2) VLM 置信度分数是检测和表征数据分布中噪声相关模式的可靠指标。(3) 即使描述保真度的细微变化也会对学习到的表征质量产生重大影响。这些发现共同强调了描述质量对模型性能的深远影响，并突出了在 T2I 中需要更复杂的鲁棒训练算法。针对这些观察结果，我们提出了一种利用 VLM 置信度分数来减轻描述噪声的方法，从而增强 T2I 模型对描述幻觉的鲁棒性。||
|**2024-12-27**|[Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation](http://arxiv.org/abs/2412.19492)|**[link](https://github.com/yecy749/gsnet)**|近年来，基于深度学习的方法彻底改变了遥感图像分割领域。然而，这些方法通常依赖于预定义的语义类别集合，因此在适应新类别时需要额外的图像标注和模型训练。更重要的是，它们无法分割任意语义类别。在这项工作中，我们引入了开放词汇遥感图像语义分割（OVRSISS），旨在分割遥感图像中的任意语义类别。为了解决OVRSISS数据集的缺乏，我们开发了LandDiscover50K，这是一个包含51,846张图像的综合数据集，涵盖40个不同的语义类别。此外，我们提出了一个名为GSNet的新颖框架，它集成了来自特定遥感模型的领域先验知识和通用视觉语言模型的通用能力。从技术上讲，GSNet由双流图像编码器（DSIE）、查询引导特征融合（QGFF）和残差信息保留解码器（RIPD）组成。DSIE首先从双流中的特定模型和通用模型捕获全面特征。然后，在可变词汇表的指导下，QGFF集成了专家和通才特征，使它们能够相互补充。最后，提出了RIPD来聚合多源特征，以获得更准确的掩码预测。实验表明，我们的方法大大优于其他方法，并且我们提出的LandDiscover50K提高了OVRSISS方法的性能。提出的数据集和方法将在https://github.com/yecy749/GSNet上公开发布。||
|**2024-12-26**|[CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models](http://arxiv.org/abs/2412.19331)|null|大型视觉语言模型 (LVLMs) 近期的进展通过视觉指令微调促进了通用视觉任务的显著进步。虽然一些研究已经证明 LVLMs 能够生成将短语与单个图像中的自然语言描述对齐的分割掩码，但它们难以在多幅图像之间进行基于分割的比较，尤其是在对象部分等更细粒度级别上。在本文中，我们引入了以部分为中心的语义共分割的新任务，该任务旨在识别和分割图像之间共有和独特的对象及部分。为了解决这个问题，我们提出了 CALICO，这是第一个能够跨图像分割和推理多个掩码的 LVLM，从而能够基于对象的组成部分进行对象比较。CALICO 具有两个提出的组件，一个新颖的对应提取模块，用于捕获语义丰富的信息以识别对象之间的部分级对应关系，以及一个对应适应模块，用于将这些信息以参数高效的方式嵌入到 LVLM 中，以促进多图像理解。为了支持训练和评估，我们构建了 MixedParts，这是一个包含约 4.4 万张图像上约 240 万个样本的多图像分割综合数据集，涵盖了各种对象和部分类别。实验结果表明，仅在其架构的 0.3% 上进行微调的 CALICO 在以部分为中心的语义共分割中实现了稳健的性能。||
|**2024-12-26**|[Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for Enhanced Image-Text Matching](http://arxiv.org/abs/2412.19184)|null|随着多模态学习的快速发展，图像-文本匹配任务作为连接视觉和语言的桥梁，变得越来越重要。基于现有研究，本研究提出了一种创新的视觉语义嵌入模型，名为多头一致性感知视觉语义嵌入模型（MH-CVSE）。该模型在一致性感知视觉语义嵌入模型（CVSE）的基础上引入了多头自注意力机制，从而能够并行地捕获多个子空间中的信息，显著增强了模型理解和表示图像与文本之间复杂关系的能力。此外，我们采用了一种参数化的特征融合策略，灵活地整合不同层级的特征信息，进一步提升了模型的表达能力。在损失函数设计方面，MH-CVSE模型采用了动态权重调整策略，根据损失值本身动态调整权重，使模型在训练过程中能够更好地平衡不同损失项的贡献。同时，我们引入了余弦退火学习率策略，帮助模型在训练后期更稳定地收敛。在Flickr30k数据集上的大量实验验证表明，MH-CVSE模型在双向图像和文本检索任务中均取得了比现有方法更好的性能，充分证明了其有效性和优越性。||
|**2024-12-26**|[MoPD: Mixture-of-Prompts Distillation for Vision-Language Models](http://arxiv.org/abs/2412.19087)|null|软提示学习方法可以有效地使视觉语言模型 (VLM) 适应下游任务。然而，经验证据表明，现有方法倾向于过度拟合已见类别，并在未见类别上表现出性能下降。这种局限性是由于训练数据中固有的对已见类别的偏向。为了解决这个问题，我们提出了一种新的软提示学习方法，称为提示混合蒸馏 (MoPD)，它可以有效地将手动精心制作的硬提示（也称为教师提示）中的有用知识转移到可学习的软提示（也称为学生提示），从而增强软提示对未见类别的泛化能力。此外，所提出的 MoPD 方法利用门控网络来学习选择用于提示蒸馏的硬提示。大量实验表明，所提出的 MoPD 方法优于最先进的基线方法，尤其是在未见类别上。||
|**2024-12-26**|[Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation](http://arxiv.org/abs/2412.19021)|null|开放词汇场景图生成（OV-SGG）通过将视觉关系表示与开放词汇文本表示对齐来克服封闭集合假设的局限性。这使得能够识别新的视觉关系，使其适用于具有多样化关系的现实场景。然而，现有的OV-SGG方法受到固定文本表示的限制，限制了图像-文本对齐的多样性和准确性。为了应对这些挑战，我们提出了关系感知分层提示（RAHP）框架，该框架通过整合主语-宾语和特定区域的关系信息来增强文本表示。我们的方法利用实体聚类来解决关系三元组类别的复杂性，从而能够有效地整合主语-宾语信息。此外，我们利用大型语言模型（LLM）生成详细的区域感知提示，捕捉细粒度的视觉交互，并改进视觉和文本模态之间的对齐。RAHP还在视觉语言模型（VLM）中引入了动态选择机制，该机制根据视觉内容自适应地选择相关的文本提示，从而减少来自无关提示的噪声。在Visual Genome和Open Images v6数据集上的大量实验表明，我们的框架始终 achieves state-of-the-art 的性能，证明了其在解决开放词汇场景图生成挑战方面的有效性。||
|**2024-12-24**|[TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models](http://arxiv.org/abs/2412.18675)|**[link](https://github.com/anguyen8/TAB)**|多头自注意力机制（MHSA）是 Transformer 的关键组成部分，Transformer 是一种在语言和视觉领域都广受欢迎的架构。多个头直观地实现了对相同输入的不同并行处理。然而，它们也掩盖了每个输入块对模型输出的贡献。我们提出了一种新颖的单头 Transformer 注意力瓶颈（TAB）层，插入到传统的 MHSA 架构之后，用作可解释性和干预的注意力瓶颈。与标准的自注意力机制不同，TAB 将所有图像块上的总注意力限制在 $\in [0, 1]$ 范围内。也就是说，当总注意力为 0 时，没有视觉信息会进一步传播到网络中，并且视觉语言模型（VLM）将默认返回一个通用的、与图像无关的响应。为了证明 TAB 的优势，我们训练了带有 TAB 的 VLM 来执行图像差异描述。在三个数据集上，我们的模型在描述方面的性能与基线 VLM 相似，但瓶颈在定位变化和识别何时没有变化方面更胜一筹。TAB 是第一个允许用户通过编辑注意力进行干预的架构，这通常可以使 VLM 产生预期的输出。||
|**2024-12-24**|[MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning](http://arxiv.org/abs/2412.18437)|**[link](https://github.com/Madjid-CH/auto-mixer)**|选择合适的深度学习架构来进行多模态数据融合是一项具有挑战性的任务，因为它需要有效地整合和处理具有不同结构和特征的各种数据类型。在本文中，我们介绍了 MixMAS，这是一个基于采样的混合器架构搜索框架，专为多模态学习而定制。我们的方法可以针对给定的多模态机器学习 (MML) 任务自动选择最佳的基于 MLP 的架构。具体来说，MixMAS 利用基于采样的微基准测试策略来探索各种模态特定编码器、融合函数和融合网络的组合，系统地识别最符合任务性能指标的架构。||
|**2024-12-24**|[LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating](http://arxiv.org/abs/2412.18424)|**[link](https://github.com/dengc2023/longdocurl)**|大型视觉语言模型 (LVLMs) 显着提高了文档理解能力，能够处理复杂的文档元素、更长的上下文和更广泛的任务。然而，现有的文档理解基准测试仅限于处理少量页面，并且未能提供对布局元素定位的全面分析。在本文中，我们首先定义了三个主要任务类别：长文档理解、数值推理和跨元素定位，然后提出了一个综合基准 LongDocURL，它整合了上述三个主要任务，并包含基于不同主要任务和答案证据分类的 20 个子任务。此外，我们开发了一个半自动构建流程，并收集了 2,325 个高质量的问答对，涵盖超过 33,000 页的文档，显着优于现有基准。随后，我们对 26 种不同配置的开源和闭源模型进行了全面的评估实验，揭示了该领域的关键性能差距。||
|**2024-12-24**|[Weak Scaling Capability in Token Space: An Observation from Large Vision Language Model](http://arxiv.org/abs/2412.18387)|**[link](https://github.com/tenghuilee/scalingcapfusedvisionlm)**|扩展能力已经在参数数量和训练数据大小方面得到了广泛验证。一个尚未探索的重要问题是，扩展能力是否也同样存在于视觉标记的数量方面？本研究通过调查视觉标记的数量与视觉语言模型的性能之间的关系来填补这一空白。我们的理论分析和实证评估表明，模型在长度$N_l$上表现出较弱的扩展能力，其性能约为$S(N_l) \approx (c/N_l)^{\alpha}$，其中$c, \alpha$是超参数。有趣的是，无论输入中是否包含用户问题，这种扩展行为基本不受影响。此外，当用户问题与任务相关时，将用户问题与视觉标记融合可以提高模型性能。为了应对与大规模视觉标记相关的计算挑战，我们提出了一种新的架构，可以有效地减少标记数量，同时将用户问题标记集成到表示中。我们的研究结果可能为在特定任务限制下开发更高效和更有效的视觉语言模型提供见解。||
|**2024-12-24**|[Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation](http://arxiv.org/abs/2412.18176)|null|序列推荐系统(SR)在过去十年中取得了显著发展，从传统的协同过滤过渡到深度学习方法，以及最近的大语言模型(LLM)。虽然LLM的采用推动了实质性的进步，但这些模型天生缺乏协同过滤信息，主要依赖文本内容数据而忽略其他模态，因此未能达到最佳推荐性能。为了解决这一局限性，我们提出了Molar，一个多模态大语言序列推荐框架，它整合了多种内容模态与ID信息，以有效地捕捉协同信号。Molar采用多模态大语言模型(MLLM)从文本和非文本数据生成统一的物品表示，促进全面的多模态建模并丰富物品嵌入。此外，它通过后对齐机制整合协同过滤信号，该机制将基于内容的模型和基于ID的模型的用户表示进行对齐，确保精确的个性化和稳健的性能。通过将多模态内容与协同过滤的洞察无缝结合，Molar既能捕捉用户兴趣，又能捕捉上下文语义，从而提高推荐准确性。大量实验验证了Molar显著优于传统和基于LLM的基线模型，突出了其在序列推荐任务中利用多模态数据和协同信号的优势。源代码可在https://anonymous.4open.science/r/Molar-8B06/获取。||
|**2024-12-24**|[EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation](http://arxiv.org/abs/2412.18150)|**[link](https://github.com/DYEvaLab/EvalMuse)**|近年来，文本到图像 (T2I) 生成模型取得了显著进展。相应地，许多自动化指标涌现出来，用于评估生成模型的图文对齐能力。然而，现有的小数据集限制了这些自动化指标之间的性能比较。此外，这些数据集缺乏在细粒度级别评估自动化指标性能的能力。在本研究中，我们贡献了一个 EvalMuse-40K 基准测试，收集了 40K 个图文对，并带有用于图文对齐相关任务的细粒度人工标注。在构建过程中，我们采用了各种策略，例如平衡的提示采样和数据重新标注，以确保基准的多样性和可靠性。这使我们能够全面评估 T2I 模型图文对齐指标的有效性。同时，我们引入了两种评估 T2I 模型图文对齐能力的新方法：FGA-BLIP2，它涉及对视觉语言模型进行端到端的微调以生成细粒度的图文对齐分数；以及 PN-VQA，它在视觉问答 (VQA) 模型中采用了一种新颖的正负 VQA 方式进行零样本细粒度评估。这两种方法在图文对齐评估中都取得了令人印象深刻的性能。我们还使用我们的方法对当前的 AIGC 模型进行排名，其结果可以作为未来研究的参考，并促进 T2I 生成的发展。数据和代码将公开发布。||
|**2024-12-24**|[VisionLLM-based Multimodal Fusion Network for Glottic Carcinoma Early Detection](http://arxiv.org/abs/2412.18124)|null|声门癌的早期检测对于改善患者预后至关重要，因为它可以及时干预，保留发声功能，并显著降低肿瘤进展和转移的风险。然而，声门癌和声带发育不良在形态上的相似性导致检测准确性欠佳。为了解决这个问题，我们提出了一种基于视觉大型语言模型（VisionLLM）的多模态融合网络，用于声门癌检测，称为MMGC-Net。通过整合图像和文本模态，多模态模型可以捕获互补信息，从而获得更准确和鲁棒的预测。在本文中，我们从中山大学附属第一医院收集了一个名为SYSU1H的真实的私人声门癌数据集，包含5,799个图像-文本对。我们利用图像编码器和额外的Q-Former提取视觉嵌入，并使用大型语言模型Meta AI (Llama3) 获取文本嵌入。然后，这些模态通过喉部特征融合块进行整合，从而实现图像和文本特征的全面融合，进而提高声门癌识别性能。在SYSU1H数据集上的大量实验表明，MMGC-Net可以达到最先进的性能，优于以往的多模态模型。||
|**2024-12-24**|[MMFactory: A Universal Solution Search Engine for Vision-Language Tasks](http://arxiv.org/abs/2412.18072)|null|随着基础模型和视觉语言模型的进步，以及有效的微调技术的出现，大量通用和专用模型被开发出来，用于各种视觉任务。尽管这些模型具有灵活性和可访问性，但没有哪个单一模型能够处理所有潜在用户可能设想的任务或应用。最近的方法，如视觉编程和集成了工具的多模态大型语言模型，旨在通过程序合成来处理复杂的视觉任务。然而，这些方法忽略了用户约束（例如，性能/计算需求），产生了难以部署的测试时特定于样本的解决方案，并且有时需要超出普通用户能力的低级指令。为了解决这些限制，我们引入了MMFactory，这是一个通用的框架，包含模型和指标路由组件，其作用类似于跨各种可用模型的解决方案搜索引擎。基于任务描述、少量样本输入输出对以及（可选的）资源或性能约束，MMFactory可以通过实例化和组合其模型库中的视觉语言工具，来建议一个多样化的程序化解决方案池。除了合成这些解决方案之外，MMFactory还提出了指标和基准性能/资源特征，允许用户选择满足其独特设计约束的解决方案。从技术角度来看，我们还引入了一个基于委员会的解决方案提议器，它利用多代理大型语言模型对话为用户生成可执行、多样化、通用且稳健的解决方案。实验结果表明，MMFactory通过提供针对用户问题规范定制的最先进解决方案，优于现有方法。项目页面位于https://davidhalladay.github.io/mmfactory_demo。||
|**2024-12-23**|[Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection](http://arxiv.org/abs/2412.17800)|**[link](https://github.com/row11n/prova)**|使模型能够识别巨大的开放世界类别一直是目标检测领域的长期追求。通过利用视觉语言模型的泛化能力，目前的开放世界检测器可以识别更广泛的词汇，即使它们只在有限的类别上进行训练。然而，当训练期间类别词汇的规模扩展到真实世界水平时，先前与粗略类别名称对齐的分类器会显著降低这些检测器的识别性能。在本文中，我们介绍了Prova，一个用于大规模词汇目标检测的多模态原型分类器。Prova提取全面的多模态原型作为对齐分类器的初始化，以解决大规模词汇目标识别失败的问题。在V3Det上，这种简单的方法极大地提高了单阶段、两阶段和基于DETR的检测器的性能，在监督和开放词汇设置中仅增加了投影层。特别是，在V3Det的监督设置中，Prova分别将Faster R-CNN、FCOS和DINO的AP提高了3.3、6.2和2.9。对于开放词汇设置，Prova实现了新的最先进性能，基础AP为32.8，新颖AP为11.0，比先前的方法分别提高了2.6和4.3。||
|**2024-12-23**|[Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective](http://arxiv.org/abs/2412.17787)|null|近来的大型视觉语言模型（LVLMs）在图表、表格和文档等富文本图像上展现出 promising 的推理能力。然而，此类图像中丰富的文本可能会增加模型对语言的敏感性。这就需要评估 LVLM 在跨语言富文本视觉输入上的性能，其中图像中的语言与指令的语言不同。为此，我们引入了 XT-VQA（跨语言富文本视觉问答），这是一个用于评估 LVLMs 如何处理图像文本和问题之间语言不一致性的基准测试。XT-VQA 整合了五个现有的富文本 VQA 数据集和一个新收集的数据集 XPaperQA，涵盖了需要在语言不一致的情况下忠实识别和理解视觉信息的各种场景。我们对 XT-VQA 上 prominent LVLMs 的评估表明，即使对于具有多语言能力的模型，跨语言场景的性能也会显著下降。互信息分析表明，这种性能差距源于跨语言问题未能充分激活相关的视觉信息。为了缓解这个问题，我们提出了 MVCL-MI（最大化视觉语言跨语言互信息），通过最大化模型输出和视觉信息之间的互信息来构建视觉文本跨语言对齐。这是通过 KL 散度最小化将知识从单语言设置提炼到跨语言设置来实现的，其中单语言输出 logits 作为教师。在 XT-VQA 上的实验结果表明，MVCL-MI 有效地减少了视觉文本跨语言性能差异，同时保留了 LVLMs 的固有能力，为改进 LVLMs 的潜在实践提供了新的思路。代码可在以下网址获取：https://github.com/Stardust-y/XTVQA.git||
|**2024-12-23**|[Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy](http://arxiv.org/abs/2412.17759)|null|多模态学习是人工智能领域中一个快速发展的领域，它旨在通过整合和分析包括文本、图像、音频和视频在内的多种类型的数据来构建更通用和更鲁棒的系统。这种方法受到人类通过多种感官吸收信息的能力的启发，使文本到视频转换、视觉问答和图像字幕等应用成为可能。本概述重点介绍了支持多模态语言模型 (MLLM) 的数据集方面的最新进展。大规模多模态数据集至关重要，因为它们允许对这些模型进行彻底的测试和训练。该研究重点关注其对学科的贡献，考察了各种数据集，包括用于训练、特定领域任务和实际应用的数据集。它还强调了基准数据集对于评估模型在各种场景下的性能、可扩展性和适用性的重要性。由于多模态学习不断变化，克服这些障碍将有助于人工智能研究和应用达到新的高度。||
|**2024-12-20**|[HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding](http://arxiv.org/abs/2412.16158)|null|大型语言模型 (LLM) 的快速发展催化了视觉语言模型 (VLM) 的发展。一体化 VLM 避免了特定模态编码器，为组合式 VLM 提供了一种有希望的替代方案，但面临着性能较差的挑战。大多数现有的一体化 VLM 需要调整预训练的 LLM 以获得视觉能力，这可能会降低其语言能力。为了解决这一困境，本文提出了一种名为 HoVLE 的新型高性能一体化 VLM。我们注意到，当图像嵌入与文本嵌入对齐时，LLM 已被证明能够解释图像。当前一体化 VLM 的挑战实际上在于缺乏用于视觉和语言输入的整体嵌入模块。因此，HoVLE 引入了一个整体嵌入模块，将视觉和文本输入转换为共享空间，允许 LLM 以与处理文本相同的方式处理图像。此外，精心设计了一种多阶段训练策略来增强整体嵌入模块。它首先被训练以从预训练的视觉编码器中提取视觉特征，并从 LLM 中提取文本嵌入，从而能够使用不成对的随机图像和文本标记进行大规模训练。整个模型进一步对多模态数据进行下一个标记预测以对齐嵌入。最后，加入了指令微调阶段。我们的实验表明，HoVLE 在各种基准测试中实现了接近领先组合模型的性能，大大优于先前的一体化模型。模型可在 https://huggingface.co/OpenGVLab/HoVLE 获取。||
|**2024-12-20**|[Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training](http://arxiv.org/abs/2412.16148)|**[link](https://github.com/mingliangliang3/clipf)**|视觉语言模型（VLM）如果能够减少训练集的大小，就可以更有效地进行训练。最近的研究表明，在 VLM 训练过程中使用各种方法屏蔽文本（例如：截断、随机屏蔽、块屏蔽和语法屏蔽）是有益的。在本文中，我们展示了最佳的屏蔽策略会随着训练周期的变化而变化，并且在足够的训练周期下，词频信息是实现最佳性能所需要的。我们提出的方法，称为基于词频屏蔽的对比语言图像预训练（CLIPF），在一系列大型数据集上的实验结果证明了其优势。随着输入词数量的减少，这种优势尤为明显。我们分析了 CLIPF 与其他屏蔽方法对词频平衡的影响，并讨论了 CLIPF 在维持跨词性类别词频平衡方面的关键作用。||
|**2024-12-20**|[Demystifying the Potential of ChatGPT-4 Vision for Construction Progress Monitoring](http://arxiv.org/abs/2412.16108)|null|大型视觉语言模型 (LVLMs)，例如 OpenAI 的 GPT-4 Vision，融入各行各业标志着人工智能领域，尤其是在视觉数据分析和解释方面，的重大进步。本文探讨了 GPT-4 Vision 在建筑行业的实际应用，重点关注其在监测和跟踪建筑项目进度方面的能力。本研究利用建筑工地的高分辨率航拍图像，检验了 GPT-4 Vision 如何执行详细的场景分析并跟踪随时间推移的发展变化。研究结果表明，虽然 GPT-4 Vision 能熟练地识别施工阶段、材料和机械，但在精确的目标定位和分割方面仍面临挑战。尽管存在这些限制，但该技术未来发展潜力巨大。这项研究不仅强调了当前在建筑领域使用 LVLMs 的现状和机遇，还讨论了未来通过特定领域训练以及与其他计算机视觉技术和数字孪生集成来增强模型效用的方向。||
|**2024-12-20**|[Error-driven Data-efficient Large Multimodal Model Tuning](http://arxiv.org/abs/2412.15652)|null|大型多模态模型 (LMMs) 在众多学术基准测试中展现了令人印象深刻的性能。然而，微调对于在下游任务中获得令人满意的性能仍然至关重要，而特定于任务的微调样本通常难以获得，或者获取成本高昂且耗时。为了解决这个问题，我们提出了一个错误驱动的、数据高效的微调框架，旨在有效地将通用 LMM 适应新兴任务，而无需任何特定于任务的训练样本。在我们的方法中，首先在一个小的目标任务验证集上评估一个作为学生模型的通用 LMM，然后一个更强大的模型（作为教师模型）识别学生模型推理步骤中的错误步骤，并分析其完全解决目标任务的能力差距。基于这些差距，从现有的任务无关数据集中进一步检索有针对性的训练样本，以微调学生模型并使其适应目标任务。我们跨越三种不同的训练数据规模和七项任务进行了广泛的实验，结果表明，我们的训练范式显著且有效地提高了 LMM 在下游任务中的性能，平均性能提升了 7.01%。||
|**2024-12-20**|[VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving](http://arxiv.org/abs/2412.15544)|null|近年来，基于强化学习 (RL) 的驾驶策略学习方法在自动驾驶领域受到了越来越多的关注，并在各种驾驶场景中取得了显著进展。然而，传统的 RL 方法依赖于手动设计的奖励函数，这需要大量的人力，并且通常缺乏泛化能力。为了解决这些局限性，我们提出了VLM-RL，这是一个将预训练的视觉语言模型 (VLM) 与 RL 相结合的统一框架，利用图像观测和自然语言目标生成奖励信号。VLM-RL的核心是对比语言目标 (CLG) 作为奖励的范式，它使用正面和负面的语言目标来生成语义奖励。我们进一步引入了一种分层奖励合成方法，将基于 CLG 的语义奖励与车辆状态信息相结合，提高了奖励的稳定性，并提供了更全面的奖励信号。此外，我们还采用了批量处理技术来优化训练过程中的计算效率。在 CARLA 模拟器中的大量实验表明，VLM-RL 的性能优于最先进的基线方法，碰撞率降低了 10.5%，路线完成率提高了 104.6%，并且对未见过的驾驶场景具有鲁棒的泛化能力。此外，VLM-RL 可以无缝集成几乎任何标准的 RL 算法，这可能会彻底改变现有的依赖于手动奖励工程的 RL 范式，并实现持续的性能改进。演示视频和代码可在以下网址访问：https://zilin-huang.github.io/VLM-RL-website。||
|**2024-12-19**|[PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation](http://arxiv.org/abs/2412.15209)|null|尽管大型视觉语言模型（LVLMs）取得了显著进展，但现有的像素定位模型仍然在单图像设置下运行，限制了它们在多图像之间进行详细、细粒度比较的能力。相反，当前的多图像理解模型缺乏像素级定位。我们的工作通过引入多图像像素定位推理分割任务来弥补这一差距，并提出了PRIMA，一个新颖的LVLM，它将像素级定位与强大的多图像推理能力相结合，以生成上下文丰富、像素定位的解释。PRIMA的核心是一个高效的视觉模块，它跨多个图像查询细粒度的视觉表示，从而减少了25.3%的TFLOPs。为了支持训练和评估，我们构建了 $M^4Seg$ ，这是一个新的推理分割基准，包含约224K个问答对，需要跨多图像的细粒度视觉理解。实验结果表明，PRIMA的性能优于最先进的基线模型。||
|**2024-12-19**|[EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues](http://arxiv.org/abs/2412.15190)|null|通过交互式视觉语言模型 (VLM) 自动分析海量地球观测数据，可以为环境监测、灾害响应和资源管理开辟新的机遇。现有的通用 VLM 在遥感数据上的表现不佳，而最近的地理空间 VLM 仍然局限于固定的分辨率和少数传感器模式。在本文中，我们介绍了 EarthDial，这是一个专为地球观测 (EO) 数据设计的对话助手，它将复杂的多传感器地球观测数据转换为交互式的自然语言对话。EarthDial 支持多光谱、多时相和多分辨率图像，从而实现广泛的遥感任务，包括分类、检测、字幕生成、问答、视觉推理和视觉定位。为了实现这一目标，我们引入了一个包含超过 1111 万个指令对的广泛指令微调数据集，涵盖了 RGB、合成孔径雷达 (SAR) 和近红外 (NIR) 和红外等多光谱模式。此外，EarthDial 可以处理双时相和多时相序列分析，用于变化检测等应用。我们对 37 个下游应用进行的广泛实验结果表明，EarthDial 的性能优于现有的通用模型和特定领域模型，在各种 EO 任务中实现了更好的泛化能力。||
|**2024-12-19**|[Qwen2.5 Technical Report](http://arxiv.org/abs/2412.15115)|**[link](https://github.com/qwenlm/qwen2.5)**|在本报告中，我们介绍了Qwen2.5，这是一系列旨在满足各种需求的大语言模型（LLM）。与之前的版本相比，Qwen 2.5在预训练和后训练阶段都得到了显著改进。在预训练方面，我们将高质量的预训练数据集从之前的7万亿个词元扩展到18万亿个词元。这为常识、专业知识和推理能力提供了坚实的基础。在后训练方面，我们使用超过100万个样本进行了复杂的监督微调，以及多阶段强化学习。后训练技术增强了人类偏好，并显著改进了长文本生成、结构化数据分析和指令遵循能力。为了有效处理多样化的用例，我们提供了各种规模的Qwen2.5 LLM系列。公开权重的模型包括基础模型和指令微调模型，并提供量化版本。此外，对于托管解决方案，专有模型目前包括两个混合专家（MoE）变体：Qwen2.5-Turbo和Qwen2.5-Plus，均可从阿里云模型工作室获取。Qwen2.5在评估语言理解、推理、数学、编码、人类偏好对齐等方面的一系列基准测试中展现了顶级性能。具体来说，公开权重的旗舰模型Qwen2.5-72B-Instruct的性能优于许多公开和专有模型，并且与最先进的公开权重模型Llama-3-405B-Instruct（其规模约为Qwen2.5的5倍）相比，表现出竞争力。Qwen2.5-Turbo和Qwen2.5-Plus具有优越的成本效益，同时性能分别与GPT-4o-mini和GPT-4o相比具有竞争力。此外，作为基础模型，Qwen2.5模型在训练专业模型方面发挥了重要作用，例如Qwen2.5-Math、Qwen2.5-Coder、QwQ和多模态模型。||
|**2024-12-19**|[Progressive Multimodal Reasoning via Active Retrieval](http://arxiv.org/abs/2412.14835)|null|多步骤多模态推理任务对多模态大语言模型（MLLM）提出了重大挑战，如何有效提高其在此类场景下的性能仍是一个未解决的问题。本文提出了AR-MCTS，这是一个通用框架，旨在通过主动检索（AR）和蒙特卡洛树搜索（MCTS）逐步提高MLLM的推理能力。我们的方法首先开发了一个统一的检索模块，该模块从混合模态检索语料库中检索关键支持信息，以解决复杂的推理问题。为了弥合自动多模态推理验证方面的差距，我们采用了MCTS算法结合主动检索机制，从而能够自动生成逐步注释。该策略动态地检索每个推理步骤的关键信息，超越了传统的波束搜索采样，以提高推理空间的多样性和可靠性。此外，我们引入了一个过程奖励模型，该模型逐步对齐以支持多模态推理任务的自动验证。在三个复杂的多模态推理基准上的实验结果证实了AR-MCTS框架在提高各种多模态模型性能方面的有效性。进一步的分析表明，AR-MCTS可以优化采样多样性和准确性，从而产生可靠的多模态推理。||
|**2024-12-19**|[A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space](http://arxiv.org/abs/2412.14680)|**[link](https://github.com/d-robotics-ai-lab/dosod)**|开放集目标检测 (OSOD) 对于非结构化环境中的机器人操作非常重要。然而，现有的 OSOD 方法由于计算负担高且部署复杂，通常无法满足机器人应用的需求。为了解决这个问题，本文提出了一种名为解耦 OSOD (DOSOD) 的轻量级框架，它是一种实用且高效的解决方案，可支持机器人系统中的实时 OSOD 任务。具体来说，DOSOD 建立在 YOLO-World 管道的基础上，通过将视觉语言模型 (VLM) 与检测器集成。开发了一种多层感知器 (MLP) 适配器，用于将 VLM 提取的文本嵌入转换为联合空间，检测器在其中学习类别无关提议的区域表示。跨模态特征直接在联合空间中对齐，避免了复杂的特征交互，从而提高了计算效率。DOSOD 在测试阶段的操作类似于传统的闭集检测器，有效地弥合了闭集和开集检测之间的差距。与基线 YOLO-World 相比，所提出的 DOSOD 显着提高了实时性能，同时保持了相当的精度。在 LVIS minival 数据集上使用类似的骨干网络，轻量级 DOSOD-S 模型实现了 26.7% 的 Fixed AP，而 YOLO-World-v1-S 为 26.2%，YOLO-World-v2-S 为 22.7%。同时，DOSOD-S 的 FPS 比 YOLO-World-v1-S 高 57.1%，比 YOLO-World-v2-S 高 29.6%。同时，我们证明了 DOSOD 模型有助于边缘设备的部署。代码和模型已在 https://github.com/D-Robotics-AI-Lab/DOSOD 公开发布。||
|**2024-12-19**|[Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation](http://arxiv.org/abs/2412.14487)|null|直接偏好优化 (DPO) 已被证明在减少大型视觉语言模型 (LVLMs) 的幻觉方面非常有效，它可以通过使模型输出更贴近人类偏好来实现这一点。尽管最近取得了进展，现有方法仍存在两个缺点：1) 缺乏可扩展的词元级奖励；2) 忽略了视觉锚定词元。为此，我们提出了一种新的具有自校准奖励的词元偏好优化模型（称为 TPO），它可以自适应地关注与视觉相关的词元，而无需精细的标注。具体来说，我们引入了一个词元级“视觉锚定奖励”，其定义为以原始图像和损坏图像为条件的生成词元的逻辑分布之差。此外，为了突出信息丰富的视觉锚定词元，我们提出了一个视觉感知训练目标，以增强更准确的词元级优化。大量的实验结果表明，所提出的 TPO 实现了最先进的性能。例如，基于 LLAVA-1.5-7B，我们的 TPO 为幻觉基准测试带来了显著的性能提升。||
|**2024-12-19**|[GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering](http://arxiv.org/abs/2412.14480)|null|在具身问答（EQA）中，智能体必须探索并发展对未知环境的语义理解，才能自信地回答情境化问题。由于难以获取有用的语义表示、在线更新这些表示以及利用先验世界知识进行高效探索和规划，这仍然是机器人技术中的一个挑战性问题。为了解决这些限制，我们提出了GraphEQA，这是一种新颖的方法，它利用实时3D度量语义场景图（3DSG）和任务相关图像作为多模态记忆，将视觉语言模型（VLM）接地，以在未知环境中执行EQA任务。我们采用分层规划方法，利用3DSG的层次性进行结构化规划和语义引导探索。通过在HM-EQA数据集上的模拟实验以及在家庭和办公室环境中的真实世界实验，我们证明了我们的方法在完成EQA任务方面优于关键基线，成功率更高，规划步骤更少。||
|**2024-12-19**|[MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval](http://arxiv.org/abs/2412.14475)|null|尽管对多模态检索的需求迅速增长，但该领域的研究进展仍然受到训练数据缺乏的严重限制。在本文中，我们介绍了一种名为 MegaPairs 的新型数据合成方法，它利用视觉语言模型 (VLM) 和开放域图像，并使用该方法生成了一个庞大的合成数据集。我们的实证分析表明，MegaPairs 生成了高质量的数据，使多模态检索器的性能显著优于在现有数据集（数据量是 MegaPairs 的 70 倍）上训练的基线模型。此外，由于 MegaPairs 仅依赖于通用图像语料库和开源 VLM，因此可以轻松扩展，从而不断提高检索性能。目前，我们已经生成了超过 2600 万个训练实例，并使用这些数据训练了多个不同规模的模型。这些新模型在 4 个流行的组合图像检索 (CIR) 基准测试中实现了最先进的零样本性能，并在 MMEB 提供的 36 个数据集上取得了最高的整体性能。它们在下游微调后也展现出显著的性能提升。我们将公开发布我们生成的数据集、训练好的模型和数据合成流程，以促进该领域的未来发展。||
|**2024-12-18**|[Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception](http://arxiv.org/abs/2412.14233)|**[link](https://github.com/syp2ysy/dce)**|训练大型多模态模型 (LMM) 依赖于连接图像和语言的描述性图像标题。现有方法要么从 LMM 模型中提取标题，要么从互联网图像或人工构建标题。我们建议利用现成的视觉专家来增强图像标题，这些专家最初是在标注图像上训练的，而不是用于图像标题生成。我们的方法名为 DCE，它探索对象的低级和细粒度属性（例如，深度、情感和细粒度类别）以及对象关系（例如，相对位置和人-物交互 (HOI)），并将这些属性组合到描述性标题中。实验表明，此类视觉专家能够提高视觉理解任务的性能，以及受益于更准确视觉理解的推理能力。我们将发布源代码和流程，以便其他视觉专家可以轻松地组合到流程中。DCE 流程和数据集的完整源代码将在 \url{https://github.com/syp2ysy/DCE} 上提供。||
|**2024-12-18**|[Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation](http://arxiv.org/abs/2412.14145)|null|视觉理解通常从三个粒度级别进行研究：图像、图像块和像素。视觉标记化通过自监督重建学习进行训练，利用码本在图像块级别压缩视觉数据，信息损失很小，但视觉标记本身不具备语义。开放词汇语义分割受益于不断发展的视觉语言模型 (VLM)，这些模型具有强大的图像零样本能力，但将图像级理解转换为像素级理解仍然是一项迫在眉睫的挑战。在本文中，我们将分割视为像素标记化，并研究了一种用于所有粒度理解的统一感知和语义标记压缩方法，从而促进了开放词汇语义分割。参考预训练VLM的认知过程，其中低级特征逐渐组合成高级语义，我们提出了特征金字塔标记化 (PAT) 方法，通过可学习的码本来聚类和表示多分辨率特征，然后通过联合学习像素重建和语义分割来解码它们。我们设计了松散耦合的像素和语义学习分支。像素分支模拟码本标记的自底向上组合和自顶向下可视化，而语义分支则共同融合分层码本作为辅助分割指导。我们的实验表明，PAT增强了VLM特征金字塔的语义直觉，提高了基线分割模型的性能，并在开放词汇语义分割基准测试中取得了具有竞争力的性能。我们的模型对于VLM集成来说参数高效，并且对于独立标记化来说具有灵活性。我们希望不仅能够为改进分割提供灵感，还能为语义视觉标记的利用提供启示。||
|**2024-12-17**|[Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration](http://arxiv.org/abs/2412.13180)|null|最近关于加速视觉语言模型的研究表明，尽管视觉信息被高度压缩，但在各种视觉语言任务中仍可以保持强大的性能。在这项工作中，我们研究了在语言模型内部及早修剪视觉标记的流行加速方法，并发现其在许多任务中的强大性能并非源于压缩视觉信息的出色能力，而是源于基准测试评估细粒度视觉能力的有限能力。也就是说，我们证明了这种加速方法的一个核心问题，即图像顶部的大部分标记都被修剪掉了。然而，这个问题只反映在定位等一小部分任务的性能上。对于其他评估的任务，有缺陷的修剪策略仍能保持强大的性能。注意到所研究的加速技术的视觉能力有限，我们提出了FEATHER（Fast and Effective Acceleration wiTH Ensemble cRiteria），这是一种简单的方法，它（1）解决了早期层修剪中发现的问题，（2）结合了均匀采样以确保覆盖所有图像区域，以及（3）分两个阶段应用修剪，以便使标准在后期层更有效，同时仍然通过早期层修剪实现显著的加速。在相当的计算节省情况下，我们发现FEATHER在以视觉为中心的定位基准测试中比原始加速方法的性能提高了5倍以上。||
|**2024-12-17**|[DoPTA: Improving Document Layout Analysis using Patch-Text Alignment](http://arxiv.org/abs/2412.12902)|null|多模态学习的出现为文档AI带来了显著的改进。文档现在被视为多模态实体，结合文本和视觉信息进行下游分析。然而，该领域的研究通常侧重于文本方面，将视觉空间用作辅助信息。虽然某些工作探索了基于纯视觉技术的文档图像理解，但它们在推理过程中需要OCR识别文本作为输入，或者在学习过程中不与文本对齐。因此，我们提出了一种新颖的图文对齐技术，专门设计用于利用文档图像中的文本信息来提高视觉任务的性能。我们的文档编码器模型DoPTA使用这种技术进行训练，在各种文档图像理解任务中展现出强大的性能，且推理过程中无需OCR。结合辅助重建目标，DoPTA的性能始终优于更大的模型，同时使用的预训练计算量却明显更少。DoPTA还在D4LA和FUNSD这两个具有挑战性的文档视觉分析基准测试中创造了新的最佳结果。||
|**2024-12-17**|[ZoRI: Towards Discriminative Zero-Shot Remote Sensing Instance Segmentation](http://arxiv.org/abs/2412.12798)|**[link](https://github.com/huangshiqi128/zori)**|遥感图像实例分割算法通常基于传统方法，这限制了它们在已知场景和闭集预测中的应用。在这项工作中，我们提出了一个名为零样本遥感实例分割的新任务，旨在识别训练数据中不存在的空中物体。在对具有高类间相似性和类内差异的空中类别进行分类时，会遇到挑战。此外，视觉语言模型预训练数据集和遥感数据集之间的域差异阻碍了预训练模型直接应用于遥感图像时的零样本能力。为了应对这些挑战，我们提出了一个名为ZoRI的零样本遥感实例分割框架。我们的方法采用了一个区分增强分类器，它使用细化的文本嵌入来增强对类别差异的感知。我们没有直接进行微调，而是提出了一种知识维护的自适应策略，将语义相关信息解耦，以保留预训练的视觉语言对齐，同时调整特征以捕获遥感领域特定的视觉线索。此外，我们引入了一个带有空中视觉原型缓存库的先验注入预测，以补充文本嵌入的语义丰富性，并无缝集成空中表征，以适应遥感领域。我们建立了新的实验协议和基准，大量的实验令人信服地证明ZoRI在零样本遥感实例分割任务上达到了最先进的性能。我们的代码可在https://github.com/HuangShiqi128/ZoRI获取。||
|**2024-12-17**|[CRoF: CLIP-based Robust Few-shot Learning on Noisy Labels](http://arxiv.org/abs/2412.12793)|null|由于新领域中特征不精确，噪声标签威胁着少样本学习 (FSL) 的鲁棒性。CLIP，一个大规模视觉语言模型，在基于图像-文本嵌入相似度的 FSL 中表现良好，但它容易受到噪声标签引起的错误分类的影响。如何在 FSL 任务中增强 CLIP 在噪声数据上的域泛化能力是一个关键挑战。在本文中，我们提供了一种新的视角来减轻噪声标签的影响，即基于 CLIP 的鲁棒少样本学习 (CRoF)。CRoF 是一个适用于基于 CLIP 模型的通用插件模块。为了避免错误分类和标签嵌入混淆，我们设计了面向少样本任务的提示生成器，为每个类别提供更具辨别力的描述。所提出的提示实现了更大的类间文本嵌入距离。此外，我们没有完全信任 CLIP 的零样本分类，而是使用类似标签平滑的加权策略，在具有噪声的少样本新领域数据上微调 CLIP。多个潜在正确标签的权重考虑了 CLIP 的先验知识与原始标签信息之间的关系，以确保可靠性。我们的多标签损失函数进一步支持这种范式下的鲁棒训练。综合实验表明，CRoF 作为插件，在不同噪声类型和噪声比率上优于微调和原始 CLIP 模型。||
|**2024-12-17**|[Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference](http://arxiv.org/abs/2412.12785)|null|大型视觉语言模型 (LVLMs) 通常通过视觉指令微调来学习视觉能力，这涉及到投影器及其 LLM 主干的更新。受人脑视觉区域概念的启发，我们研究了 LLM 中是否存在类似的充当认知核心的“视觉区域”，并探索了通过选择性层微调来高效训练 LVLMs 的可能性。我们使用 Bunny-Llama-3-8B-V 进行详细实验，并使用 LLaVA-1.5-7B 和 LLaVA-1.5-13B 在各种视觉和文本任务中进行验证。我们的研究结果表明，选择性地更新 25% 的 LLM 层（稀疏且均匀分布）可以保留近 99% 的视觉性能，同时保持或增强文本任务结果，并且还有效地减少了训练时间。基于这种定向训练方法，我们进一步提出了一种新的基于视觉区域的剪枝范式，去除视觉区域外不重要的层，从而最大限度地减少性能损失。这项研究通过激活 LLM 中的逐层视觉区域，为 LVLM 训练和推理提供了一种有效且高效的策略，该策略在不同模型和参数规模上始终有效。||
|**2024-12-17**|[SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language Models](http://arxiv.org/abs/2412.12693)|**[link](https://github.com/zwenyu/SPHERE-VLM)**|目前的视觉语言模型可能包含单维空间线索，例如深度、物体边界和基本空间方向（例如左、右、前、后），但通常缺乏类人理解和实际应用所需的多维空间推理能力。为了弥补这一差距，我们开发了SPHERE（空间感知和推理的分层评估），这是一个分层评估框架，带有一个新的人工标注数据集，用于精确定位模型的优势和劣势，从单技能任务到多技能任务，最终到需要将多个空间和视觉线索与逻辑推理相结合的复杂推理任务。对最先进的开源模型的基准评估揭示了其重大缺陷，尤其是在理解距离和接近度、从以自我为中心和以物体为中心的视角进行推理以及在物理环境中执行复杂推理的能力方面。这项工作强调了对更高级的空间理解和推理方法的需求，为改进视觉语言模型及其与类人空间能力的对齐铺平了道路。该数据集将在发表后开源。||
|**2024-12-17**|[PBVS 2024 Solution: Self-Supervised Learning and Sampling Strategies for SAR Classification in Extreme Long-Tail Distribution](http://arxiv.org/abs/2412.12565)|null|多模态学习研讨会 (PBVS 2024) 旨在通过利用合成孔径雷达 (SAR) 数据和电光 (EO) 数据进行同步学习来提高自动目标识别 (ATR) 系统的性能，其中SAR数据难以解释但不受天气条件和可见光的影响。名为“多模态航拍图像挑战赛——分类”的子任务，重点是基于一组SAR-EO图像对及其各自的类别标签来预测低分辨率航拍图像的类别标签。所提供的数据集由SAR-EO图像对组成，其特征是具有严重的“长尾”分布，最大类别和最小类别之间的差异超过1000倍，这使得典型的长尾方法难以应用。此外，SAR和EO数据集之间的域差异也使得标准多模态方法的有效性变得复杂。为了应对这些重大挑战，我们提出了一种两阶段学习方法，该方法利用自监督技术，结合多模态学习和通过SAR到EO转换进行推理，以有效利用EO数据。在PBVS 2024多模态航拍图像挑战赛——分类（SAR分类）任务的最终测试阶段，我们的模型实现了21.45%的准确率、0.56的AUC和0.30的总分，在比赛中排名第九。||
|**2024-12-17**|[DuSSS: Dual Semantic Similarity-Supervised Vision-Language Model for Semi-Supervised Medical Image Segmentation](http://arxiv.org/abs/2412.12492)|**[link](https://github.com/QingtaoPan/DuSSS)**|半监督医学图像分割（SSMIS）利用一致性学习来规范模型训练，从而减轻像素级手动标注的负担。然而，它经常受到来自低质量伪标签的错误监督的影响。视觉语言模型（VLM）通过引入文本提示引导的多模态监督信息，具有增强伪标签的巨大潜力。然而，它面临跨模态问题：获得的消息往往对应于多个目标。为了解决上述问题，我们提出了一种用于SSMIS的双语义相似度监督VLM（DuSSS）。具体来说，1）设计了一种双重对比学习（DCL），通过捕获每种模态内的内在表示和跨模态的语义相关性来提高跨模态语义一致性。2）为了鼓励学习多种语义对应关系，提出了一种语义相似度监督策略（SSS），并将其注入到DCL的每个对比学习过程中，通过基于分布的不确定性水平来监督语义相似性。此外，设计了一种新的基于VLM的SSMIS网络，以弥补伪标签的质量缺陷。它利用预训练的VLM生成文本提示引导的监督信息，改进伪标签以获得更好的一致性正则化。实验结果表明，我们的DuSSS在三个公共数据集（QaTa-COV19、BM-Seg和MoNuSeg）上取得了优异的性能，Dice分别为82.52%、74.61%和78.03%。||
|**2024-12-17**|[Causal Diffusion Transformers for Generative Modeling](http://arxiv.org/abs/2412.12095)|**[link](https://github.com/causalfusion/causalfusion)**|我们引入了因果扩散模型（Causal Diffusion），它是扩散模型的自回归（AR）对应模型。它是一个下一标记（或多个标记）预测框架，对离散和连续模态都很友好，并且与现有的下一标记预测模型（如LLaMA和GPT）兼容。尽管最近的一些工作尝试将扩散模型与AR模型结合起来，但我们证明，将序列分解引入扩散模型可以显著提高其性能，并实现AR和扩散生成模式之间的平滑过渡。因此，我们提出了CausalFusion——一个仅解码器的Transformer模型，它在序列标记和扩散噪声级别上对数据进行双重分解，在ImageNet生成基准测试中取得了最先进的结果，同时也享有AR生成任意数量标记以进行上下文推理的优势。我们进一步通过联合图像生成和字幕模型展示了CausalFusion的多模态能力，并展示了CausalFusion进行零样本上下文图像处理的能力。我们希望这项工作可以为社区提供训练离散和连续数据多模态模型的新视角。||
|**2024-12-16**|[CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology](http://arxiv.org/abs/2412.12077)|null|大型多模态模型 (LMM) 的出现为病理学带来了显著的进步。以往的研究主要集中在分别训练补丁级别和全切片图像 (WSI) 级别的模型，限制了跨补丁和WSI学习知识的整合，并导致模型冗余。在这项工作中，我们介绍了 CPath-Omni，这是第一个拥有 150 亿参数的 LMM，旨在统一补丁和 WSI 级别的图像分析，整合这两个级别的各种任务，包括分类、视觉问答、字幕生成和视觉参考提示。大量实验表明，CPath-Omni 在 42 个数据集中 39 个的七项不同任务中实现了最先进的 (SOTA) 性能，优于或匹配为单个任务训练的特定任务模型。此外，我们为 CPath-Omni 开发了一个专门的基于病理学 CLIP 的视觉处理器 CPath-CLIP，它首次集成了不同的视觉模型，并结合大型语言模型作为文本编码器，构建了一个更强大的 CLIP 模型，在九个零样本和四个少样本数据集中实现了 SOTA 性能。我们的研究结果突出了 CPath-Omni 统一各种病理学任务的能力，展示了其简化和推进病理学基础模型领域的潜力。||
|**2024-12-13**|[Apollo: An Exploration of Video Understanding in Large Multimodal Models](http://arxiv.org/abs/2412.10360)|null|尽管大型多模态模型 (LMM) 迅速整合了视频感知能力，但驱动其视频理解的潜在机制仍然知之甚少。因此，该领域的许多设计决策缺乏适当的论证或分析。训练和评估此类模型的高计算成本，加上有限的开放研究，阻碍了视频-LMM 的发展。为了解决这个问题，我们提出了一项综合研究，以帮助揭示在 LMM 中有效驱动视频理解的因素。我们首先批判性地研究了与视频-LMM 研究相关的高计算需求的主要原因，并发现了规模一致性，即在较小模型和数据集（达到临界规模）上做出的设计和训练决策可以有效地迁移到较大模型。利用这些见解，我们探索了视频-LMM 的许多视频特定方面，包括视频采样、架构、数据组成、训练计划等等。例如，我们证明了训练期间的 fps 采样比均匀帧采样更可取，并且哪些视觉编码器最适合视频表示。在这些发现的指导下，我们推出了 Apollo，这是一个最先进的 LMM 系列，可在不同模型大小上实现卓越的性能。我们的模型可以有效地感知长达一小时的视频，Apollo-3B 的性能优于大多数现有的 7B 模型，在 LongVideoBench 上取得了令人印象深刻的 55.1 分。Apollo-7B 与 7B LMM 相比处于最先进水平，在 MLVU 上得分为 70.9，在 Video-MME 上得分为 63.3。||
|**2024-12-13**|[A dual contrastive framework](http://arxiv.org/abs/2412.10348)|null|在当前的多模态任务中，模型通常冻结编码器和解码器，同时调整中间层以适应特定任务目标，例如区域描述。区域级别的视觉理解对大规模视觉语言模型提出了重大挑战。虽然有限的空间感知能力是一个已知问题，但粗粒度的预训练尤其加剧了优化潜在表示以实现有效编码器-解码器对齐的难度。我们提出了 AlignCap，这是一个旨在通过潜在空间的细粒度对齐来增强区域级别理解的框架。我们的方法引入了一个新颖的潜在特征细化模块，增强了条件潜在空间表示，从而提高区域级别描述的性能。我们还提出了一种创新的对齐策略，即语义空间对齐模块，它可以提高多模态表示的质量。此外，我们在两个模块中都以一种新颖的方式结合了对比学习，以进一步增强区域级别描述的性能。为了解决空间限制，我们采用了一种通用目标检测（GOD）方法作为数据预处理流程，增强了区域级别的空间推理能力。大量实验表明，我们的方法显著提高了各种任务中区域级别描述的性能。||
|**2024-12-13**|[DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding](http://arxiv.org/abs/2412.10302)|**[link](https://github.com/deepseek-ai/deepseek-vl2)**|我们推出了DeepSeek-VL2，这是一系列先进的大型混合专家模型（MoE）视觉语言模型，它通过两个关键的主要升级在其前身DeepSeek-VL的基础上进行了显著改进。对于视觉组件，我们采用了一种动态分块视觉编码策略，旨在处理具有不同纵横比的高分辨率图像。对于语言组件，我们利用具有多头潜在注意力机制的DeepSeekMoE模型，将键值缓存压缩为潜在向量，以实现高效推理和高吞吐量。DeepSeek-VL2在改进的视觉语言数据集上进行训练，在各种任务中展现出卓越的能力，包括但不限于视觉问答、光学字符识别、文档/表格/图表理解和视觉定位。我们的模型系列由三个变体组成：DeepSeek-VL2-Tiny、DeepSeek-VL2-Small和DeepSeek-VL2，分别具有10亿、28亿和45亿个激活参数。与现有的开源密集型和基于MoE的模型相比，DeepSeek-VL2在相似或更少的激活参数下实现了具有竞争力或最先进的性能。代码和预训练模型可在https://github.com/deepseek-ai/DeepSeek-VL2公开访问。||
|**2024-12-13**|[VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation](http://arxiv.org/abs/2412.10151)|null|我们提出了VLR-Bench，这是一个基于检索增强生成（RAG）的视觉问答（VQA）基准，用于评估视觉语言模型（VLM）。与现有的基于外部知识的VQA评估数据集不同，我们提出的VLR-Bench包含五个输入段落。这允许测试确定哪个段落对回答给定查询有用的能力，这是先前研究中缺乏的能力。在此背景下，我们构建了一个包含32,000个自动生成的指令遵循示例的数据集，我们将其称为VLR-IF。该数据集旨在通过使VLM学习如何根据输入段落生成适当的答案来增强其RAG能力。我们使用基于最先进Llama3的VLM模型Llava-Llama-3评估了所提出的基准和训练数据的有效性，并验证了其性能。提出的VLR-Bench和VLR-IF数据集已公开在线提供。||
|**2024-12-13**|[Performance of ChatGPT on tasks involving physics visual representations: the case of the Brief Electricity and Magnetism Assessment](http://arxiv.org/abs/2412.10019)|null|基于人工智能的聊天机器人由于其能够解释和响应文本和视觉输入，正日益影响物理教育。本研究评估了两个大型多模态模型聊天机器人ChatGPT-4和ChatGPT-4o在简明电磁评估（BEMA）中的表现，BEMA是一个包含大量视觉表示（如矢量场、电路图和图表）的概念物理题库。定量分析表明，ChatGPT-4o的表现优于ChatGPT-4和大量大学生样本，并展示了ChatGPT-4o相比其前身ChatGPT-4在视觉解释能力方面的改进。然而，对ChatGPT-4o回答的定性分析揭示了其持续存在的挑战。我们确定了聊天机器人在回答BEMA任务时遇到的三种类型的困难：（1）视觉解释困难，（2）提供正确的物理定律或规则困难，以及（3）空间协调和应用物理表征困难。空间推理任务，特别是那些需要使用右手定则的任务，被证明尤其困难。这些发现表明，使用最广泛的大型多模态模型聊天机器人ChatGPT-4o在处理涉及视觉表示的物理任务时仍然存在显著困难。虽然聊天机器人在教育应用方面展现出潜力，包括个性化辅导以及为盲人或低视力学生提供无障碍支持，但其局限性需要谨慎对待。另一方面，我们的研究结果也可以用于设计难以被聊天机器人解决的评估。||
|**2024-12-13**|[WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model](http://arxiv.org/abs/2412.09951)|null|快速发展的视觉语言模型 (VLM) 在通用人类知识和令人印象深刻的逻辑推理能力方面的进步，推动了人们对将 VLM 应用于高级自动驾驶任务（例如场景理解和决策）的兴趣日益浓厚。然而，关于知识熟练程度（尤其是必要的驾驶专业知识）与闭环自动驾驶性能之间关系的深入研究仍有待进一步探索。在本文中，我们研究了基础驾驶知识的深度和广度对闭环轨迹规划的影响，并介绍了 WiseAD，这是一个专门为端到端自动驾驶定制的 VLM，能够在各种场景下进行驾驶推理、动作解释、物体识别、风险分析、驾驶建议和轨迹规划。我们采用驾驶知识和规划数据集的联合训练，使模型能够相应地执行知识对齐的轨迹规划。大量实验表明，随着驾驶知识多样性的扩展，严重事故显着减少，在 Carla 闭环评估中，驾驶得分和路线完成率分别提高了 11.9% 和 12.4%，达到了最先进的性能。此外，WiseAD 在域内和域外数据集的知识评估中也表现出卓越的性能。||
|**2024-12-13**|[CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models](http://arxiv.org/abs/2412.09936)|**[link](https://github.com/kennyyao2001/16824-caloraify)**|肥胖现象，即体重问题，是全球范围内可预防慢性病的主要原因。传统的卡路里估算工具通常依赖于特定的数据格式或复杂的流程，限制了它们在现实场景中的实用性。近年来，视觉语言模型 (VLM) 在理解现实世界环境和实现对话交互方面表现出色，使其成为诸如成分分析等下游任务的理想选择。然而，将 VLM 应用于卡路里估算需要特定领域的数据和对齐策略。为此，我们构建了 CalData，这是一个包含 33 万个图像-文本对的数据集，专为成分识别和卡路里估算而设计，它结合了大规模食谱数据集和详细的营养说明，以实现稳健的视觉语言训练。基于此数据集，我们提出了 CaLoRAify，这是一个新颖的 VLM 框架，通过使用视觉-文本对进行训练来对齐成分识别和卡路里估算。在推理过程中，用户只需一张单目食物图像即可估算卡路里，同时保留基于代理的对话交互的灵活性。借助低秩自适应 (LoRA) 和检索增强生成 (RAG) 技术，我们的系统增强了基础 VLM 在卡路里估算垂直领域的性能。我们的代码和数据已在 https://github.com/KennyYao2001/16824-CaLORAify 完全开源。||
|**2024-12-13**|[Selective State Space Memory for Large Vision-Language Models](http://arxiv.org/abs/2412.09875)|null|大型视觉语言模型 (LVLMs) 在各种多模态任务中展现了卓越的性能。然而，针对特定领域应用对这些模型进行微调仍然是一项计算密集型挑战。本文介绍了状态空间记忆集成 (SSMI)，这是一种用于高效微调 LVLMs 的新方法。通过将基于 Mamba 的轻量级状态空间模块集成到 LVLM 架构中，SSMI 可以有效捕获长程依赖关系并注入特定任务的视觉和序列模式。与传统的微调方法不同，SSMI 只需要更新模型参数的一小部分，使其具有计算效率和可扩展性。在基准数据集（包括 COCO Captioning、VQA 和 Flickr30k）上的实验表明，SSMI 实现了最先进的性能，同时保持了鲁棒性和泛化能力。综合分析进一步验证了 SSMI 在效率、适应性和可解释性方面的优势，使其成为微调大规模视觉语言模型的引人注目的解决方案。||
|**2024-12-12**|[BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation](http://arxiv.org/abs/2412.09718)|null|大型预训练视觉语言模型 (VLM) 的出现代表了机器学习的范式转变，在广泛的视觉识别任务中取得了前所未有的成果。CLIP 作为最流行的 VLM 之一，在分类任务中展现了卓越的零样本学习和迁移学习能力。为了将 CLIP 迁移到下游任务，适配器构成了一种参数高效的方法，避免了通过大型模型进行反向传播（不同于相关的提示学习方法）。然而，CLIP 适配器的开发主要关注判别性能，而其不确定性估计的质量却被忽视了。在这项工作中，我们展示了最先进的 CLIP 适配器的判别性能与其不确定性估计能力并不总是相关，而后者对于在现实场景中的安全部署至关重要。我们还证明了其中一个适配器是通过对更通用的概率框架进行最大后验 (MAP) 推断获得的。基于这一观察，我们引入了 BayesAdapter，它利用贝叶斯推断来估计完整的概率分布而不是单点估计，从而更好地捕捉参数空间中固有的可变性。在全面的实证评估中，我们展示了我们的方法在预测中获得了高质量的不确定性估计，在校准和选择性分类方面表现突出。我们的代码已公开发布在：https://github.com/pablomorales92/BayesAdapter。||
|**2024-12-13**|[V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding](http://arxiv.org/abs/2412.09616)|**[link](https://github.com/opengvlab/pe)**|视觉语言模型 (VLM) 在处理各种多模态任务方面展现出 promising 的能力，但它们在长上下文场景中，尤其是在涉及视频、高分辨率图像或长篇图文文档的任务中表现不佳。在本工作中，我们首先使用我们扩充的长上下文多模态数据集对 VLM 的长上下文能力进行了实证分析。我们的研究结果表明，将用于文本标记的位置编码机制直接应用于视觉标记并非最佳方案，并且当位置编码超过模型的上下文窗口时，VLM 的性能会急剧下降。为了解决这个问题，我们提出了可变视觉位置编码 (V2PE)，这是一种新颖的位置编码方法，它对视觉标记采用可变且较小的增量，从而能够更有效地管理长多模态序列。我们的实验表明，V2PE 可以有效增强 VLM 理解和推理长多模态上下文的能力。我们进一步将 V2PE 与我们扩充的长上下文多模态数据集相结合，对开源 VLM InternVL2 进行了微调。微调后的模型在标准和长上下文多模态任务上均取得了良好的性能。值得注意的是，当训练数据集的序列长度增加到 256K 标记时，该模型能够处理最多 1M 标记的多模态序列，突出了其在现实世界长上下文应用中的潜力。||
|**2024-12-12**|[PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models](http://arxiv.org/abs/2412.09613)|null|大型视觉语言模型 (VLM) 已被扩展用于理解图像和视频。视觉标记压缩被用来减少视觉输入的大量标记长度。为了满足不同任务的需求，现有的高性能模型通常使用不同的标记压缩策略分别处理图像和视频，这限制了图像和视频组合能力。为此，我们将每个图像扩展为“静态”视频，并引入一种统一的标记压缩策略，称为渐进式视觉标记压缩 (PVC)，其中每一帧的标记被渐进式编码和自适应压缩，以补充前几帧未提取的信息。通过利用固有的时间冗余性，视频标记被有效地压缩。图像被重复为静态视频，并且空间细节可以在多帧中逐渐补充。PVC 统一了图像和视频的标记压缩。在每帧标记数量有限的情况下（默认为 64 个标记），仍然可以保留空间细节和时间变化。实验表明，我们的模型在各种视频理解基准测试中实现了最先进的性能，包括长视频任务和细粒度短视频任务。同时，我们统一的标记压缩策略在图像基准测试中没有造成性能损失，尤其是在细节敏感的任务中。||
|**2024-12-12**|[Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM](http://arxiv.org/abs/2412.09530)|**[link](https://github.com/hon-wong/bytevideollm)**|大型视觉语言模型 (LVLMs) 在图像和视频分析中的应用是一个令人兴奋且快速发展的领域。近年来，我们看到用于微调图像理解的高质量图文数据集有了显著增长，但视频方面仍然缺乏可比的数据集。此外，许多 VideoLLM 是单图像 VLM 的扩展，可能无法有效处理较长视频的复杂性。在本研究中，我们介绍了一个使用专有模型创建的大规模合成数据集，并使用精心设计的提示来处理各种问题。我们还探索了一种动态视觉标记压缩架构，在计算效率和性能之间取得了平衡。我们提出的\model{}在各种视频任务中取得了最先进的结果，并展现出令人印象深刻的泛化能力，为多图像理解设定了新的基准。值得注意的是，\model{}在 VideoMME 上比 LLaVA-OneVision 绝对提升了 2.7%，在 MuirBench 上提升了 10.7%。代码可在 https://github.com/Hon-Wong/ByteVideoLLM 获取。||
|**2024-12-12**|[Embeddings are all you need! Achieving High Performance Medical Image Classification through Training-Free Embedding Analysis](http://arxiv.org/abs/2412.09445)|null|开发用于医学影像的人工智能 (AI) 和机器学习 (ML) 模型通常需要在大型数据集上进行大量训练和测试，这会消耗大量的计算时间、能源和资源。需要更高效的方法，在不增加相关资源负担的情况下实现相当或更优的诊断性能。我们研究了用基于嵌入的方法替代传统训练程序的可行性，该方法利用医学图像的简洁且语义上有意义的表示。使用预训练的基础模型——特别是卷积神经网络 (CNN)（如 ResNet）和多模态模型（如对比语言-图像预训练 (CLIP)）——我们生成了用于多类别分类任务的图像嵌入。然后将简单的线性分类器应用于这些嵌入。该方法在各种医学影像模态中进行了评估，包括视网膜图像、乳腺X线照片、皮肤镜图像和胸部X线照片。将性能与使用传统方法训练和测试的基准模型进行了比较。在各种医学影像模态的多类别分类任务中，基于嵌入的模型的受试者工作特征曲线下面积 (AUC-ROC) 得分比基准模型高出87个百分点。值得注意的是，CLIP 嵌入模型实现了最高的 AUC-ROC 得分，展示了卓越的分类性能，同时显著降低了计算需求。我们的研究表明，利用预训练基础模型的嵌入可以有效地替代医学图像分析中传统的、资源密集型训练和测试程序。这种基于嵌入的方法为图像分割、分类和预测提供了更高效的替代方案，可能加速 AI 技术在临床实践中的集成。||
|**2024-12-12**|[Causal Graphical Models for Vision-Language Compositional Understanding](http://arxiv.org/abs/2412.09353)|**[link](https://github.com/aimagelab/COGT)**|近期研究经验证明，视觉语言模型（VLM）难以完全理解人类语言的组合特性，通常将图像描述建模为“词袋”。因此，它们在组合任务上的表现不佳，这类任务需要更深入地理解句子中不同实体（主语、动词等）及其相互关系才能解决。在本文中，我们使用因果图模型（CGM）对文本和视觉标记之间的依赖关系进行建模，该模型使用依存分析器构建，并且我们训练了一个以VLM视觉编码器为条件的解码器。与标准的自回归或并行预测不同，我们的解码器的生成过程是根据CGM结构进行偏序的。这种结构鼓励解码器只学习句子中的主要因果依赖关系，而丢弃虚假的相关性。通过在五个组合基准上进行的大量实验，我们证明了我们的方法显著优于所有最先进的组合方法，并且它也比使用更大数据集训练的方法有所改进。||
|**2024-12-12**|[GaGA: Towards Interactive Global Geolocation Assistant](http://arxiv.org/abs/2412.08907)|null|全球地理定位旨在预测世界各地拍摄图像的地理位置，是计算机视觉领域最具挑战性的任务之一。在本文中，我们介绍了一种创新的基于大型视觉语言模型 (LVLM) 的交互式全球地理定位助手 GaGA。GaGA 能够发现图像中的地理线索，并将其与 LVLM 中嵌入的广泛世界知识相结合，从而确定地理位置，同时为预测结果提供依据和解释。我们进一步设计了一种新颖的交互式地理定位方法，超越了传统的静态推理方法。它允许用户干预、纠正或提供预测线索，使模型更加灵活实用。GaGA 的开发依赖于新提出的多模态全球地理定位 (MG-Geo) 数据集，这是一个包含 500 万个高质量图文对的综合集合。GaGA 在 GWS15k 数据集上实现了最先进的性能，在国家级别和城市级别分别将准确率提高了 4.57% 和 2.92%，树立了新的基准。这些进步代表着在开发高度精确、交互式且具有全球适用性的地理定位系统方面取得了重大飞跃。||
|**2024-12-11**|[DocVLM: Make Your VLM an Efficient Reader](http://arxiv.org/abs/2412.08746)|null|视觉语言模型 (VLM) 在各种视觉任务中表现出色，但在需要细粒度文本处理的文档理解方面面临挑战。虽然典型的视觉任务在低分辨率输入下表现良好，但阅读密集型应用需要高分辨率，这会导致巨大的计算开销。在 VLM 提示中使用 OCR 提取的文本可以部分解决这个问题，但与全分辨率图像相比性能较差，因为它缺乏实现最佳性能所需的完整视觉上下文。我们引入了 DocVLM，这是一种将基于 OCR 的模态集成到 VLM 中的方法，以增强文档处理能力，同时保留原始权重。我们的方法采用 OCR 编码器来捕获文本内容和布局，并将这些信息压缩成一组紧凑的学习查询，并将其整合到 VLM 中。对领先 VLM 的综合评估表明，DocVLM 显着降低了文档理解对高分辨率图像的依赖。在有限的令牌方案 (448×448) 中，与 InternVL2 集成时，具有 64 个学习查询的 DocVLM 将 DocVQA 结果从 56.0% 提高到 86.6%，与 Qwen2-VL 集成时，则从 84.4% 提高到 91.2%。在 LLaVA-OneVision 中，DocVLM 使用的图像令牌减少了 80%，同时实现了更好的结果。减少的令牌使用量可以有效地处理多个页面，在 DUDE 上展现出令人印象深刻的零样本结果，并在 MP-DocVQA 上实现了最先进的性能，突出了 DocVLM 在需要高性能和效率的应用中的潜力。||
|**2024-12-11**|[StreamChat: Chatting with Streaming Video](http://arxiv.org/abs/2412.08646)|null|本文介绍了StreamChat，一种增强大型多模态模型（LMM）与流媒体视频内容交互能力的新方法。在流媒体交互场景中，现有方法仅依赖于提问时可用的视觉信息，导致模型无法感知流媒体视频的后续变化，从而造成 significant delays。StreamChat 通过在每个解码步骤创新性地更新视觉上下文来解决这个问题，确保模型在整个解码过程中利用最新的视频内容。此外，我们引入了一种灵活高效的基于交叉注意力的架构来处理动态流媒体输入，同时保持流媒体交互的推理效率。此外，我们构建了一个新的密集指令数据集，以促进流媒体交互模型的训练，并辅以一种并行的 3D-RoPE 机制，用于编码视觉和文本标记的相对时间信息。实验结果表明，StreamChat 在已有的图像和视频基准测试中取得了具有竞争力的性能，并且在流媒体交互场景中展现出比最先进的视频 LMM 更优越的能力。||
|**2024-12-11**|[Multimodal Latent Language Modeling with Next-Token Diffusion](http://arxiv.org/abs/2412.08635)|**[link](https://github.com/microsoft/unilm/tree/master/LatentLM)**|多模态生成模型需要一种统一的方法来处理离散数据（例如文本和代码）和连续数据（例如图像、音频、视频）。在这项工作中，我们提出了潜在语言建模 (LatentLM)，它使用因果Transformer无缝地集成了连续和离散数据。具体来说，我们采用变分自编码器 (VAE) 将连续数据表示为潜在向量，并引入下一标记扩散技术来自回归生成这些向量。此外，我们开发了 $\sigma$ -VAE 来解决方差崩溃的挑战，这对于自回归建模至关重要。大量实验验证了 LatentLM 跨各种模态的有效性。在图像生成方面，LatentLM 在性能和可扩展性方面都超过了扩散Transformer。当集成到多模态大型语言模型中时，LatentLM 提供了一个统一多模态生成和理解的通用接口。实验结果表明，在扩大训练标记的设置中，LatentLM 与 Transfusion 和矢量量化模型相比取得了更好的性能。在文本到语音合成方面，LatentLM 在说话人相似性和鲁棒性方面优于最先进的 VALL-E 2 模型，同时解码步骤减少了 10 倍。这些结果表明 LatentLM 是一种高效且可扩展的方法，可以推进大型多模态模型的发展。||
|**2024-12-11**|[Synthetic Vision: Training Vision-Language Models to Understand Physics](http://arxiv.org/abs/2412.08619)|null|物理推理涉及在动态环境中解释、理解和预测物体的行为，这对于当前的视觉语言模型 (VLM) 仍然是一项重大挑战。在这项工作中，我们提出了两种使用模拟数据增强 VLM 物理推理能力的方法。首先，我们使用从与物理推理任务相关的模拟生成的问答 (QA) 对微调预训练的 VLM。其次，我们引入了物理上下文构建器 (PCB)，这是一种专门的 VLM，经过微调以创建富含物理属性和过程的场景描述。在物理推理任务期间，这些 PCB 可以用作上下文来辅助大型语言模型 (LLM) 以提高其性能。我们使用多个基准评估了这两种方法，包括一个名为“倒塔”的新稳定性检测 QA 数据集（包含模拟场景和真实场景）以及 CLEVRER。我们证明了一个小型 QA 微调 VLM 可以显著优于更大的最先进的基础模型。我们还表明，集成 PCB 可以提高基础 LLM 在物理推理任务上的性能。使用来自“倒塔”数据集的真实场景，我们还验证了两种方法在 Sim2Real 迁移中的鲁棒性。我们的结果突出了模拟数据在创建能够进行高级物理推理的学习系统中的实用性。||
|**2024-12-10**|[BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities](http://arxiv.org/abs/2412.07769)|**[link](https://github.com/mbzuai-oryx/bimedix2)**|本文介绍了BiMediX2，一个双语（阿拉伯语-英语）生物医学专家大型多模态模型 (LMM)，它采用统一架构集成了文本和视觉模态，支持高级图像理解和医学应用。BiMediX2利用Llama3.1架构，并整合了文本和视觉能力，以便于英语和阿拉伯语的无缝交互，支持基于文本的输入和涉及医学图像的多轮对话。该模型在一个包含160万个阿拉伯语和英语混合的文本和图像模态的双语医疗数据集上进行训练，该数据集包含各种医学交互样本。我们还提出了第一个基于双语GPT-4o的医学LMM基准测试，名为BiMed-MBench。BiMediX2在基于文本和基于图像的任务上进行了基准测试，在多个医学基准测试中实现了最先进的性能。它在医学LLM评估基准测试中优于最近的最先进模型。我们的模型还在多模态医学评估中树立了新的基准，在英语评估中提高了9%以上，在阿拉伯语评估中提高了20%以上。此外，它在UPHILL事实准确性评估中超过GPT-4约9%，并在各种医学视觉问答、报告生成和报告摘要任务中表现出色。项目页面包含源代码和训练好的模型，网址为https://github.com/mbzuai-oryx/BiMediX2。||
|**2024-12-10**|[DriveMM: All-in-One Large Multimodal Model for Autonomous Driving](http://arxiv.org/abs/2412.07689)|**[link](https://github.com/zhijian11/DriveMM)**|大型多模态模型 (LMM) 通过结合大型语言模型，在自动驾驶 (AD) 领域展现了卓越的理解和解释能力。尽管取得了这些进展，但当前数据驱动的自动驾驶方法往往集中于单个数据集和特定任务，而忽略了其整体能力和泛化能力。为了弥合这些差距，我们提出了 DriveMM，一个通用的多模态大模型，旨在处理不同的数据输入，如图像和多视角视频，同时执行各种自动驾驶任务，包括感知、预测和规划。最初，该模型 undergoes 课程预训练来处理各种视觉信号并执行基本的视觉理解和感知任务。随后，我们对各种与自动驾驶相关的数据集进行增强和标准化处理，以微调模型，从而形成一个用于自动驾驶的多合一 LMM。为了评估其综合能力和泛化能力，我们对六个公共基准测试进行了评估，并在一个未见数据集上进行了零样本迁移，DriveMM 在所有任务中均达到了最先进的性能。我们希望 DriveMM 成为未来现实世界中端到端自动驾驶应用的一个有前景的解决方案。||
|**2024-12-10**|[RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models](http://arxiv.org/abs/2412.07679)|**[link](https://github.com/nvlabs/radio)**|聚合模型最近已成为训练视觉基础模型的一种强大方法，它利用了来自现有模型（如 CLIP、DINO 和 SAM）的多教师蒸馏技术。这种策略能够高效地创建鲁棒的模型，结合各个教师的优势，同时显著降低计算和资源需求。在本文中，我们深入分析了最先进的聚合模型，确定了关键挑战，包括分辨率模式转换、教师不平衡、教师特有伪影以及过多的输出标记。为了解决这些问题，我们提出了几种新颖的解决方案：多分辨率训练、马赛克增强以及改进的教师损失函数平衡。具体而言，在视觉语言模型的背景下，我们引入了一种标记压缩技术，以便在固定标记数量内维持高分辨率信息。我们发布了性能最佳的模型，提供多种规模（-B、-L、-H 和 -g），以及推理代码和预训练权重。||
|**2024-12-10**|[DRUM: Learning Demonstration Retriever for Large MUlti-modal Models](http://arxiv.org/abs/2412.07619)|null|近年来，大型语言模型（LLM）在上下文学习（ICL）的帮助下展现出处理新任务的出色能力。在大型视觉语言模型（LVLM）的研究中，当实现ICL时，研究人员通常采用简单的策略，例如跨不同样本的固定演示，或直接通过视觉语言嵌入模型选择演示。这些方法并不能保证配置的演示符合LVLM的需求。为了解决这个问题，我们提出了一个新的框架，即大型多模态模型演示检索器（DRUM），它对视觉语言嵌入模型进行微调以更好地满足LVLM的需求。首先，我们讨论了视觉语言任务的检索策略，假设给定一个嵌入模型。并且我们建议连接图像和文本嵌入以提高检索性能。其次，我们建议通过LVLM的反馈对嵌入模型检索到的演示进行重新排序，并计算用于训练嵌入模型的列表排序损失。第三，我们提出了一种迭代演示挖掘策略来改进嵌入模型的训练。通过对3种视觉语言任务、7个基准数据集的广泛实验，我们的DRUM框架被证明可以有效地通过检索更合适的演示来提升LVLM的上下文学习性能。||
|**2024-12-10**|[Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios](http://arxiv.org/abs/2412.07518)|**[link](https://github.com/fjq-tongji/hcoenet)**|大型视觉语言模型（LVLMs）在多模态理解和生成任务中展现出卓越的能力。然而，这些模型有时会产生幻觉文本，导致生成的描述看似合理，但却与图像不符。这种现象可能导致自动驾驶系统做出错误的驾驶决策。为了应对这一挑战，本文提出了HCOENet，一种即插即用的思维链修正方法，旨在消除对象幻觉并为初始响应中忽略的关键对象生成增强的描述。具体而言，HCOENet采用交叉检查机制来过滤实体，并直接从给定图像中提取关键对象，从而丰富描述文本。在POPE基准测试上的实验结果表明，HCOENet分别将Mini-InternVL-4B和mPLUG-Owl3模型的F1分数提高了12.58%和4.28%。此外，使用在开放校园场景中收集的图像进行的定性结果进一步突出了该方法的实际适用性。与GPT-4o模型相比，HCOENet实现了可比的描述性能，同时显著降低了成本。最后，我们为交通场景创建了两个新的语义理解数据集，CODA_desc和nuScenes_desc，以支持未来的研究。代码和数据集已公开发布在https://github.com/fjq-tongji/HCOENet。||
|**2024-12-10**|[SmartAgent: Chain-of-User-Thought for Embodied Personalized Agent in Cyber World](http://arxiv.org/abs/2412.07472)|**[link](https://github.com/tsinghua-fib-lab/smartagent)**|近年来，基于大型视觉语言模型 (LVLMs) 的多模态感知和推理能力的具身智能体取得了显著进展，它们擅长在真实或虚拟世界中自主交互，帮助人们在复杂环境中做出智能决策。然而，目前的研究工作通常通过黄金行动轨迹或针对特定目标的理想任务导向解决方案进行优化。这种范式很少考虑用户导向的因素，这可能是它们在广泛的个人助理应用中性能下降的原因。为了解决这个问题，我们提出了用户思维链 (COUT)，这是一种新的具身推理范式，它采用从基本行动思维到显式和隐式个性化偏好思维的思维链，将个性化因素纳入自主智能体学习中。为了实现COUT，我们引入了SmartAgent，这是一个能够感知网络环境并推理个性化需求的智能体框架，它可以 1) 与GUI交互以访问项目池，2) 生成由先前操作暗示的用户显式需求，以及 3) 推荐项目以满足用户的隐式需求。为了展示SmartAgent的功能，我们还创建了一个全新的数据集SmartSpot，它提供了一个包含个性化行动的全阶段环境。据我们所知，我们的工作是第一个提出COUT流程的工作，是对具身个性化智能体学习的初步尝试。我们对SmartSpot进行的大量实验表明SmartAgent在一系列具身化和个性化子任务中的功能。我们将在论文录用后发布代码和数据，网址为\url{https://github.com/tsinghua-fib-lab/SmartAgent}。||
|**2024-12-10**|[MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models](http://arxiv.org/abs/2412.07148)|**[link](https://github.com/souradipp76/mm-poe)**|本文介绍了通过使用多模态模型的消元法进行多项选择推理的方法，简称多模态消元法（MM-PoE）。这种新颖的方法旨在增强视觉语言模型（VLM）在多项选择视觉推理任务中的效能。与传统方法独立评估每个选项不同，MM-PoE采用了一种双步骤评分范式，首先识别并排除不可能的选项，然后集中于剩余的最可能的选项。这种方法模拟了人类的考试策略，即人们通常在选择最佳答案之前先排除明显错误的答案。我们对三个基准数据集进行的实证评估表明，MM-PoE显著提高了当代最先进VLM的零样本和小样本性能。至关重要的是，这种方法不仅将消元法的应用扩展到多模态环境，还允许进行小样本实验，从而解决了关于PoE仅在零样本设置中以及仅在纯语言框架中使用的两个主要限制。因此，MM-PoE不仅改进了VLM的推理能力，还扩展了它们在复杂视觉问答场景中的适用性。所有支持我们工作的代码和文档都可以在https://pypi.org/project/mm-poe/上找到，使研究人员和从业人员能够轻松地集成和进一步开发这些技术。||
|**2024-12-09**|[Visual Lexicon: Rich Image Features in Language Space](http://arxiv.org/abs/2412.06774)|null|我们提出了视觉词库 (Visual Lexicon, ViLex)，一种新颖的视觉语言，它将丰富的图像信息编码到词汇标记的文本空间中，同时保留了自然语言通常难以传达的复杂视觉细节。与优先考虑高级语义（例如 CLIP）或像素级重建（例如 VAE）的传统方法不同，ViLex 同时捕获丰富的语义内容和精细的视觉细节，从而实现高质量的图像生成和全面的视觉场景理解。通过自监督学习流程，ViLex 生成用于使用冻结的文本到图像 (T2I) 扩散模型重建输入图像的优化标记，保留高保真语义级重建所需的详细信息。作为语言空间中的图像嵌入，ViLex 标记利用自然语言的组合性，允许它们独立用作“文本标记”或与自然语言标记组合，以使用视觉和文本输入提示预训练的 T2I 模型，反映了我们与视觉语言模型 (VLM) 的交互方式。实验表明，与文本嵌入相比，ViLex 在图像重建中实现了更高的保真度，即使使用单个 ViLex 标记也是如此。此外，ViLex 以零样本、无监督的方式成功执行各种 DreamBooth 任务，而无需微调 T2I 模型。此外，ViLex 可作为强大的视觉编码器，相较于强大的 SigLIP 基线，在 15 个基准测试中持续提高视觉语言模型的性能。||
|**2024-12-09**|[The Narrow Gate: Localized Image-Text Communication in Vision-Language Models](http://arxiv.org/abs/2412.06646)|null|近年来，多模态训练的进步显著提升了图像理解和生成在统一模型中的融合。本研究探讨了视觉语言模型（VLM）如何处理图像理解任务，特别关注视觉信息是如何处理并传递到文本领域的。我们比较了同时生成图像和文本的VLM与仅输出文本的VLM，突出了信息流的关键差异。我们发现，在具有多模态输出的模型中，图像和文本嵌入在残差流中更加分离。此外，模型在视觉信息如何交换到文本标记方面也存在差异。仅输出文本的VLM表现出一种分布式通信模式，其中信息通过多个图像标记进行交换。相比之下，为图像和文本生成而训练的模型依赖于单个标记，该标记充当视觉信息的窄门。我们证明，去除这个单个标记会显著降低图像理解任务的性能。此外，修改此标记可以有效地控制图像语义，表明有针对性的局部干预可以可靠地控制模型的全局行为。||
|**2024-12-09**|[Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels](http://arxiv.org/abs/2412.06461)|null|随着大型多模态模型 (LMM) 越来越多地部署到各种应用中，对适应性强、面向真实世界模型排序的需求变得至关重要。传统的评估方法主要以数据集为中心，依赖于固定的、带标签的数据集和监督指标，这些方法资源密集，并且可能缺乏对新场景的泛化能力，这突出了无监督排序的重要性。在这项工作中，我们通过利用大型多模态模型的不确定性信号（例如 softmax 概率）来探索其无监督模型排序。我们评估了最先进的 LMM（例如 LLaVA）在视觉问答基准上的表现，分析了基于不确定性的指标如何反映模型性能。我们的研究结果表明，从 softmax 分布得出的不确定性分数为跨各种任务的模型排序提供了稳健一致的依据。这一发现使得在真实世界、未标记的数据上对视觉问答的 LMM 进行排序成为可能，从而提供了一种无需手动标注即可在不同领域中选择模型的实用方法。||
|**2024-12-06**|[Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies](http://arxiv.org/abs/2412.05155)|**[link](https://github.com/firatcekinel/Multimodal-Fact-Checking-with-Vision-Language-Models)**|本研究评估了视觉语言模型 (VLM) 在表示和利用多模态内容进行事实核查方面的有效性。具体而言，我们调查了结合多模态内容是否比仅文本模型提高了性能，以及 VLM 如何利用文本和图像信息来增强错误信息检测。此外，我们提出了一种基于探测分类器的 VLM 解决方案。我们的方法从选定 VLM 的最后一个隐藏层中提取嵌入向量，并将它们输入到神经探测分类器中，以进行多类别真实性分类。通过对两个事实核查数据集进行一系列实验，我们证明了虽然多模态可以提高性能，但与使用 VLM 嵌入向量相比，融合来自文本和图像编码器的单独嵌入向量产生了更好的结果。此外，所提出的神经分类器在利用提取的嵌入向量方面明显优于 KNN 和 SVM 基线，突出了其在多模态事实核查中的有效性。||
|**2024-12-06**|[Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora](http://arxiv.org/abs/2412.05149)|null|BabyLM挑战赛是一项社区活动，旨在缩小人类和计算语言学习者之间的数据效率差距。参与者竞争在1亿单词或更少固定语言数据预算的情况下优化语言模型训练。今年，我们发布了改进的文本语料库，以及一个视觉和语言语料库，以促进对认知上合理的视觉语言模型的研究。提交的模型在针对语法能力、（视觉）问答、语用能力和基础能力等评估任务上进行了比较。参与者可以提交纯文本1000万单词赛道、纯文本1亿单词赛道和/或图文多模态1亿单词赛道。在采用不同方法的31份提交中，混合因果掩码语言模型架构优于其他方法。在多模态赛道中，没有提交的模型超过基线。在后续分析中，我们发现训练FLOPs与跨任务的平均性能之间存在很强的关系，并且表现最佳的提交提出了对训练数据、训练目标和模型架构的更改。今年的BabyLM挑战赛表明，在这种情况下仍有很大的创新空间，特别是对于图像文本建模，但社区驱动的研究可以对小规模语言建模的有效策略产生可操作的见解。||
|**2024-12-06**|[Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model](http://arxiv.org/abs/2412.04729)|null|目前大多数用于视频的视觉语言模型 (VLM) 都难以理解超过几秒的视频。这主要是因为它们无法扩展到利用大量帧。为了解决这个限制，我们提出了 Espresso，一种分别提取和压缩空间和时间信息的新方法。通过广泛的评估，我们表明 Espresso 中的空间和时间压缩各自对长视频理解能力有积极的影响；当两者结合使用时，它们的积极影响会增强。此外，我们证明 Espresso 的性能随着训练数据的增加而提升，并且在长视频理解中，Espresso 比现有的 VLM 投影器有效得多。而且，我们为 EgoSchema 设计了一个更困难的评估设置，称为“大海捞针”，它增加了输入视频的长度。Espresso 在这项任务上实现了最先进的性能 (SOTA)，优于那些使用了更多训练数据的 SOTA VLM。||
|**2024-12-05**|[Cross-Self KV Cache Pruning for Efficient Vision-Language Inference](http://arxiv.org/abs/2412.04652)|**[link](https://github.com/terrypei/csp)**|键值缓存剪枝已成为一种很有前景的技术，可用于降低长上下文自回归生成中的内存和计算成本。现有的视觉语言模型 (VLM) 方法通常依赖于大型语言模型 (LLM) 的自注意力分数来识别和剪枝不相关的标记。然而，这些方法忽略了模态之间固有的分布差异，往往导致标记重要性估计不准确，以及过度剪枝关键的视觉标记。为了解决这个问题，我们提出将注意力分数分解为模态内注意力（同一模态内）和模态间注意力（跨模态），通过独立管理这些不同的注意力类型来实现更精确的键值缓存剪枝。此外，我们引入了一个 n-softmax 函数来抵消由剪枝引起的分布偏移，保持注意力分数的原始平滑度并确保性能稳定。我们最终的免训练方法，跨自剪枝 (CSP)，在与具有完整键值缓存的模型相比实现了具有竞争力的性能，同时显著优于以前的剪枝方法。在 MileBench（一个包含 29 个多模态数据集的基准测试）上的大量评估证明了 CSP 的有效性，在诸如对话式具身对话等挑战性任务上实现了高达 41% 的性能提升，同时将键值缓存预算减少了 13.6%。代码可在 https://github.com/TerryPei/CSP 获取。||
|**2024-12-05**|[Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation](http://arxiv.org/abs/2412.04533)|**[link](https://github.com/hustvl/maskadapter)**|近来的开放词汇分割方法采用掩码生成器来预测分割掩码，并利用预训练的视觉语言模型（例如CLIP）通过掩码池化对这些掩码进行分类。尽管这些方法展现了 promising 的结果，但准确的掩码通常无法通过在掩码区域内池化 CLIP 图像嵌入来产生准确的分类结果，这与直觉相悖。在本文中，我们揭示了掩码池化的性能局限性，并引入了 Mask-Adapter，这是一种简单而有效的方法，可以解决开放词汇分割中的这些挑战。与直接使用候选掩码相比，我们提出的 Mask-Adapter 从候选掩码中提取语义激活图，提供更丰富的上下文信息，并确保掩码和 CLIP 之间的一致性。此外，我们提出了一个掩码一致性损失，鼓励具有相似 IoU 的候选掩码获得相似的 CLIP 嵌入，以增强模型对不同预测掩码的鲁棒性。Mask-Adapter 以即插即用的方式无缝集成到基于掩码池化的开放词汇分割方法中，提供更准确的分类结果。跨多个零样本基准的广泛实验表明，所提出的 Mask-Adapter 在几种成熟的方法上实现了显著的性能提升。值得注意的是，Mask-Adapter 还可以有效地扩展到 SAM，并在几个开放词汇分割数据集上取得了令人印象深刻的结果。代码和模型可在 \url{https://github.com/hustvl/MaskAdapter} 获取。||
|**2024-12-05**|[VisionZip: Longer is Better but Not Necessary in Vision Language Models](http://arxiv.org/abs/2412.04467)|**[link](https://github.com/dvlab-research/visionzip)**|视觉语言模型的最新进展通过增加视觉标记的长度来提升性能，使其远长于文本标记，并显著增加了计算成本。然而，我们观察到由流行的视觉编码器（如CLIP和SigLIP）生成的视觉标记存在显著的冗余。为了解决这个问题，我们引入了VisionZip，这是一种简单而有效的方法，它选择一组信息丰富的标记输入到语言模型中，从而减少视觉标记冗余并提高效率，同时保持模型性能。所提出的VisionZip可以广泛应用于图像和视频理解任务，并且非常适合现实世界场景中的多轮对话，而以前的方法在这些场景中往往表现不佳。实验结果表明，VisionZip在几乎所有设置中都比之前的最先进方法至少提高了5%的性能。此外，我们的方法显著提高了模型推理速度，将预填充时间提高了8倍，并使LLaVA-Next 13B模型的推理速度比LLaVA-Next 7B模型更快，同时取得了更好的结果。此外，我们分析了这种冗余产生的原因，并鼓励社区关注提取更好的视觉特征，而不是仅仅增加标记长度。我们的代码可在https://github.com/dvlab-research/VisionZip 获取。||
|**2024-12-05**|[Grounding Descriptions in Images informs Zero-Shot Visual Recognition](http://arxiv.org/abs/2412.04429)|**[link](https://github.com/shaunak27/grain-clip)**|像CLIP这样的视觉语言模型 (VLM) 因其能够对开放词汇概念执行零样本视觉识别而备受重视。这是通过选择文本表示与查询图像具有最高相似度的对象类别来实现的。虽然在某些领域取得了成功，但这种方法难以识别细粒度实体以及泛化到训练分布未捕获的未见概念。最近的工作试图通过在测试时整合类别描述来缓解这些挑战，尽管改进不大。我们将这些有限的收益归因于图像和描述表示之间的根本错位，这根植于CLIP的预训练结构。在本文中，我们提出了GRAIN，一种新的预训练策略，旨在同时在精细和粗略级别上对齐表示。我们的方法学习将文本描述与图像区域共同 grounding，同时将总体标题与全局图像表示对齐。为了推动这种预训练，我们利用冻结的多模态大型语言模型 (MLLM) 来导出大规模的合成注释。我们展示了我们的模型与当前最先进的方法相比，在11个不同的图像分类数据集上增强的零样本性能。此外，我们还介绍了Products-2023，这是一个新策划的手动标记数据集，其中包含新颖的概念，并展示了我们的模型通过对其进行基准测试来识别这些概念的能力。我们的模型在其他下游任务（如检索）上取得的显著改进进一步突出了我们的方法学习到的表示的卓越质量。代码可在https://github.com/shaunak27/grain-clip 获取。||
|**2024-12-05**|[SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model](http://arxiv.org/abs/2412.04292)|null|生成模型在创建高度逼真图像方面的快速发展给错误信息的传播带来了巨大的风险。例如，在社交媒体上分享的合成图像可能会误导大量受众，并削弱对数字内容的信任，从而造成严重的后果。尽管取得了一些进展，学术界还没有创建一个大型多样化的社交媒体深度伪造检测数据集，也没有设计出有效的解决方案来解决这个问题。在本文中，我们介绍了社交媒体图像检测数据集（SID-Set），它具有三个主要优势：（1）体量大，包含30万张带有全面注释的AI生成/篡改图像和真实图像；（2）多样性广，涵盖各种类别的完全合成图像和篡改图像；（3）逼真度高，图像主要通过肉眼无法与真实图像区分。此外，利用大型多模态模型的卓越能力，我们提出了一个新的图像深度伪造检测、定位和解释框架，名为SIDA（社交媒体图像检测、定位和解释助手）。SIDA不仅可以辨别图像的真伪，还可以通过掩码预测描绘篡改区域，并提供模型判断标准的文本解释。与在SID-Set和其他基准数据集上最先进的深度伪造检测模型相比，大量实验表明，SIDA在各种设置下均实现了卓越的性能。代码、模型和数据集将被公开发布。||
|**2024-12-05**|[3D Part Segmentation via Geometric Aggregation of 2D Visual Features](http://arxiv.org/abs/2412.04247)|**[link](https://github.com/marco-garosi/COPS)**|受监督的3D部件分割模型针对固定的对象和部件集进行定制，限制了它们对开放集、真实场景的可迁移性。最近的研究探索了视觉语言模型 (VLM) 作为一种有前景的替代方案，使用多视图渲染和文本提示来识别物体部件。然而，在此上下文中简单地应用VLM会引入一些缺点，例如需要细致的提示工程，并且未能利用对象的3D几何结构。为了解决这些限制，我们提出了COPS，一个用于部件分割的综合模型，它融合了从视觉概念和3D几何中提取的语义，以有效地识别物体部件。COPS从多个视点渲染点云，提取2D特征，将它们投影回3D，并使用一种新颖的几何感知特征聚合程序来确保空间和语义一致性。最后，它将点聚类成部件并标记它们。我们证明了COPS高效、可扩展，并在五个数据集上实现了零样本的最佳性能，涵盖了合成数据和真实数据、无纹理和彩色对象，以及刚性和非刚性形状。代码可在https://3d-cops.github.io获取。||
|**2024-12-05**|[CALMM-Drive: Confidence-Aware Autonomous Driving with Large Multimodal Model](http://arxiv.org/abs/2412.04209)|null|决策和运动规划对于确保自动驾驶汽车 (AV) 的安全性和效率至关重要。现有方法通常采用两种范式：先决策后规划或先生成后评分。然而，前者常常难以协调决策和规划，而后者在整合短期运行效用和长期战术效能方面面临重大挑战。为了解决这些问题，我们引入了 CALMM-Drive，这是一个由置信度感知大型多模态模型 (LMM) 支持的新型自动驾驶框架。我们的方法采用 Top-K 置信度提取，这有助于生成多个候选决策及其置信度。此外，我们提出了一个新的规划模块，它集成了用于轨迹生成的扩散模型和用于寻找最佳路径的分层优化过程。该框架能够选择兼顾底层解决方案质量和高层战术置信度的最佳方案，从而降低一次性决策的风险并克服短视评分机制带来的局限性。在 nuPlan 闭环仿真环境中的综合评估证明了 CALMM-Drive 在实现可靠和灵活的驾驶性能方面的有效性，展示了在 LMM 支持的 AV 中整合不确定性的重大进步。代码将在论文被接受后发布。||
|**2024-12-05**|[AIpparel: A Large Multimodal Generative Model for Digital Garments](http://arxiv.org/abs/2412.03937)|null|服装对于人类生活至关重要，它提供保护、反映文化认同并展现个人风格。然而，服装的创作仍然是一个耗时的过程，这主要是因为设计过程涉及大量的手工劳动。为了简化这一过程，我们推出了AIpparel，一个用于生成和编辑缝纫图案的大型多模态模型。我们的模型在超过12万件独特服装的定制大型数据集上对最先进的大型多模态模型（LMM）进行了微调，每件服装都带有包括文本、图像和缝纫图案的多模态注释。此外，我们提出了一种新颖的标记化方案，可以简洁地编码这些复杂的缝纫图案，以便大型语言模型（LLM）能够高效地学习预测它们。AIpparel在单模态任务中实现了最先进的性能，包括文本到服装和图像到服装的预测，并支持新颖的多模态服装生成应用，例如交互式服装编辑。项目网站位于georgenakayama.github.io/AIpparel/。||
|**2024-12-05**|[MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models](http://arxiv.org/abs/2412.03927)|null|在视觉语言模型 (VLM) 中，感知和解释颜色以及物理环境的能力对于实现上下文准确的理解和交互至关重要。然而，尽管多模态建模取得了进步，但仍然严重缺乏专门的数据集来严格评估模型辨别细微颜色变化和空间上下文的能力——这些是情境理解和跨现实世界应用可靠部署的关键要素。为此，我们构建了 MegaCOIN，这是一个基于具有各种上下文属性的真实图像的高质量人工标注数据集。MegaCOIN 由两部分组成：MegaCOIN-Instruct，用作 VLM 的监督微调 (SFT) 数据集；MegaCOIN-Bench，一个带注释的测试集，可用作独立的问答数据集。MegaCOIN 为 220,000 张真实图像提供了三个注释特征：前景颜色、背景颜色和对象物理环境的描述，构成了 660k 个人工注释。此外，MegaCOIN 可用于对域泛化 (DG) 算法进行基准测试。我们探索了在 VLM 线性探测设置中对 DG 方法进行基准测试，并展示了一些新见解。最后但同样重要的是，我们发现包括 GPT-4o 在内的 VLM 的颜色识别能力不足，使用 MegaCOIN 进行微调可以提高视觉评估任务的性能。在某些情况下，使用 MegaCOIN 微调的小规模开源模型（如 LLaVA 和 Bunny）可以胜过闭源的 GPT-4o。我们希望 MegaCOIN 的实用性能够阐明 VLM 的改进方向，并为域泛化算法提供更复杂的平台。||
|**2024-12-05**|[CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance](http://arxiv.org/abs/2412.03871)|null|对比语言-图像预训练 (CLIP) 的成功之外，近期的趋势标志着人们开始探索轻量级视觉语言模型在资源受限场景下的适用性。这些模型通常仅依赖单一的图文对比学习目标时，性能表现欠佳，这凸显了对更有效训练机制的需求，以保证鲁棒的跨模态特征对齐。在这项工作中，我们提出了 CLIP-PING：基于近邻内在引导的对比语言-图像预训练，这是一种简单高效的训练范式，旨在以最小的计算开销和更低的数据需求来提升轻量级视觉语言模型的性能。CLIP-PING 利用从任意预训练编码器中提取的单模态特征来获取近邻样本的内在引导，即最近邻 (NN) 和交叉最近邻 (XNN)。我们发现，来自这些邻居的额外对比监督可以显著促进跨模态对齐，使轻量级模型能够学习更通用的特征，并具有丰富的语义多样性。大量实验表明，CLIP-PING 在零样本泛化和跨模态检索任务中明显优于同类模型。具体来说，与使用在 300 万（图像，文本）对上训练的 ViT-XS 图像编码器的原始 CLIP 相比，CLIP-PING 在零样本 ImageNet1K 上获得了 5.5% 的提升，在 Flickr30K 上的图像到文本 (I2T) 和文本到图像 (T2I) 检索分别提升了 10.7% 和 5.7%。此外，CLIP-PING 在线性评估协议下，在多个下游任务中展现出强大的迁移能力。||
|**2024-12-05**|[LL-ICM: Image Compression for Low-level Machine Vision via Large Vision-Language Model](http://arxiv.org/abs/2412.03841)|null|面向机器的图像压缩 (ICM) 旨在压缩图像以用于机器视觉任务而非人类观看。目前的工作主要集中在目标检测和语义分割等高级任务。然而，现实世界中原始图像的质量通常无法保证，导致压缩后的感知质量或下游任务性能更差。低级 (LL) 机器视觉模型，如图像恢复模型，可以帮助提高这种质量，因此它们的压缩需求也应予以考虑。在本文中，我们提出了一个面向低级机器视觉任务的开创性 ICM 框架，即 LL-ICM。通过联合优化压缩和低级任务，所提出的 LL-ICM 不仅增强了其编码能力以泛化到各种低级任务，而且还优化了下游低级任务模型的处理能力，实现了图像编解码器和低级任务模型的相互适应。此外，我们将大规模视觉语言模型集成到 LL-ICM 框架中，为低级视觉任务生成更通用且抗失真的特征嵌入。因此，一个 LL-ICM 编解码器可以泛化到多个任务。我们建立了一个可靠的基准来评估 LL-ICM，其中包括使用全参考和无参考图像质量评估进行的广泛客观实验。实验结果表明，LL-ICM 比最先进的方法可以实现 22.65% 的 BD 率降低。||
|**2024-12-04**|[Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension](http://arxiv.org/abs/2412.03704)|**[link](https://github.com/si0wang/visvm)**|尽管视觉语言模型 (VLM) 取得了显著进展，但仍然缺乏通过扩展推理时计算来提高响应质量的有效方法。在最近的大型语言模型研究中，这种能力被认为是迈向自改进模型的核心步骤。在本文中，我们提出了视觉价值模型 (VisVM)，它可以指导 VLM 推理时搜索，以生成具有更好视觉理解的响应。具体来说，VisVM 不仅评估当前搜索步骤中生成的句子质量，还预测当前步骤可能导致的后续句子的质量，从而提供长期价值。通过这种方式，VisVM 避免 VLM 生成容易出现幻觉或细节不足的句子，从而产生更高质量的响应。实验结果表明，与贪婪解码和其他视觉奖励信号的搜索方法相比，VisVM 引导的搜索显着增强了 VLM 生成具有更丰富视觉细节和更少幻觉的描述性字幕的能力。此外，我们发现使用 VisVM 引导的字幕进行自训练可以提高 VLM 在各种多模态基准测试中的性能，这表明了开发自改进 VLM 的潜力。我们的价值模型和代码可在 https://github.com/si0wang/VisVM 获取。||
|**2024-12-03**|[CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs](http://arxiv.org/abs/2412.02602)|null|本文分析了小型语言模型 (SLM) 和视觉语言模型 (VLM) 的性能，并评估了在四个基本任务（图像描述、视觉问答 (VQA)、对话摘要和文本到 SQL 转换）中模型性能和碳排放之间的权衡。文中选择了属于 Qwen 和 LLaMA 架构系列的各种 SLM 和 VLM，并评估了基于模型大小（参数数量、量化级别和微调参数）的变体。计算了模型变体的性能和碳排放。为了量化模型性能和碳排放之间的权衡，我们引入了一个名为 CEGI（碳效率增益指数）的新指标。该指标表示每百万可训练参数每单位百分比增益的碳排放量。该指标提供了一个标准化指标，用于比较模型在性能改进方面的效率与其环境成本。实验结果表明，微调 SLM 和 VLM 可以达到与大型语言模型 (LLM) 相当的性能水平，同时产生的碳排放量显着减少。我们的研究结果表明，大型模型带来的边际精度提升并不能证明碳排放量的大幅增加是合理的。利用较低比特的量化级别，所提出的指标可以进一步提高能源效率，而不会影响性能。这项研究强调了高性能和环境可持续性之间的平衡。它为选择适合环境友好型 AI 开发的模型提供了一个有价值的指标。||
|**2024-12-03**|[SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection](http://arxiv.org/abs/2412.02565)|**[link](https://github.com/jw-chae/sjtu)**|尽管视觉语言理解取得了进步，但在多模态架构中实现图像分割仍然是现代人工智能系统中的一个根本挑战。现有的视觉语言模型主要依赖于骨干架构或基于 CLIP 的嵌入学习，在精细空间定位和操作能力方面表现出固有的局限性。本文介绍了 SJTU：多模态模型中的空间判断——通过坐标检测实现统一分割，这是一个利用空间坐标理解来桥接视觉语言交互和精确分割的新颖框架，能够通过自然语言指令实现准确的目标识别。该框架提出了一种基于多模态空间推理，将分割技术与视觉语言模型相结合的新方法。通过利用边界框的归一化坐标检测并将其转换为可操作的分割输出，我们探索了整合多模态空间和语言表示的可能性。基于所提出的技术方法，该框架在各种基准数据集上展现出卓越的性能以及准确的目标分割。在 COCO 2017 通用目标检测数据集和 Pascal VOC 语义分割数据集上的结果证明了该框架的泛化能力。||
|**2024-12-03**|[BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding](http://arxiv.org/abs/2412.02449)|null|动态场景理解仍然是机器人应用中一项持续的挑战。早期的动态建图方法侧重于通过掩蔽或跟踪特定类别来减轻短期动态物体对相机运动估计的负面影响，但这往往难以适应长期场景变化。最近的研究尝试使用在合成数据集上训练的神经网络来解决长期动态环境中的物体关联问题，但它们仍然依赖于预定义的物体形状和类别。其他方法结合了视觉、几何或语义启发式方法进行关联，但通常缺乏鲁棒性。在这项工作中，我们引入了BYE，一个与类别无关的、针对每个场景的点云编码器，它无需预定义的类别、形状先验或大量的关联数据集。BYE只需在单个探索数据序列上进行训练，即可有效地在动态变化的场景中执行物体关联。我们进一步提出了一种集成方案，将视觉语言模型 (VLM) 的语义优势与BYE的场景特定专业知识相结合，在物体关联任务中实现了7%的改进和95%的成功率。代码和数据集可在https://byencoder.github.io获取。||
|**2024-12-03**|[Initial Study On Improving Segmentation By Combining Preoperative CT And Intraoperative CBCT Using Synthetic Data](http://arxiv.org/abs/2412.02294)|null|计算机辅助介入（Computer-Assisted Interventions）使临床医生能够执行精确的微创手术，通常依赖于先进的成像方法。锥形束计算机断层扫描（CBCT）可用于辅助计算机辅助介入，尽管它经常受到伪影的影响，给准确解释带来了挑战。虽然图像质量下降会影响图像分析，但高质量的术前扫描的可用性提供了改进的潜力。我们在此考虑一种术前CT和术中CBCT扫描均可用的情况，然而，扫描之间的对齐（配准）并不完美，以模拟真实场景。我们提出了一种多模态学习方法，融合粗略对齐的CBCT和CT扫描，并研究其对分割性能的影响。在本实验中，我们使用包含真实CT和合成CBCT体积以及相应体素标注的合成生成数据。结果表明，在20个研究设置中，有18个设置的分割性能得到了改进。||
|**2024-12-03**|[CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy](http://arxiv.org/abs/2412.02210)|null|大型多模态模型 (LMMs) 在使用自然语言指令识别文档图像方面表现出令人印象深刻的性能。然而，目前尚不清楚其在具有丰富结构和细粒度视觉挑战的文本理解能力方面的程度。目前的领域缺乏一个全面的基准来有效衡量 LMMs 的文本理解能力。现有的基准通常受到狭窄场景和特定任务的限制。为此，我们引入了 CC-OCR，这是一个包含各种场景、任务和挑战的综合基准。CC-OCR 包含四个以 OCR 为中心的赛道：多场景文本阅读、多语言文本阅读、文档解析和关键信息提取。它包含 39 个子集，共 7,058 张完整标注的图像，其中 41% 来自实际应用，首次发布。此外，我们评估了九个著名的 LMMs，并揭示了这些模型的优势和劣势，特别是在文本定位、多方向和重复幻觉方面。CC-OCR 旨在全面评估 LMMs 在以 OCR 为中心的各项任务上的能力，从而推动 LMMs 的发展。||
|**2024-12-03**|[LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models](http://arxiv.org/abs/2412.02193)|null|开放世界三维布局生成是指根据语言指令排列未标记的三维资产。大型语言模型 (LLM) 难以生成物理上合理的 3D 场景并遵守输入指令，尤其是在杂乱的场景中。我们引入了 LayoutVLM，这是一个框架和场景布局表示，它利用视觉语言模型 (VLM) 的语义知识并支持可微分优化以确保物理合理性。LayoutVLM 使用 VLM 从视觉标记图像生成两个相互增强的表示，以及一个自洽的解码过程来改进 VLM 的空间规划。我们的实验表明，LayoutVLM 克服了现有 LLM 和基于约束的方法的局限性，生成了更符合输入语言指令语义意图的物理上合理的 3D 布局。我们还证明了使用从现有场景数据集中提取的提出的场景布局表示对 VLM 进行微调可以提高性能。||
|**2024-12-03**|[VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding](http://arxiv.org/abs/2412.02186)|**[link](https://github.com/kangsankim07/videoicl)**|近来，大型视频多模态模型 (LMM) 的进步显著提升了其视频理解和推理能力。然而，在训练数据中代表性不足的分布外 (OOD) 任务上，它们的性能会下降。传统的微调方法由于计算成本高，在 OOD 数据集上不切实际。虽然上下文学习 (ICL) 通过示例演示在语言任务和图像-语言任务中展现了良好的泛化性能，无需微调，但将 ICL 应用于视频-语言任务面临挑战，因为视频 LMM 的上下文长度有限，而视频需要更长的标记长度。为了解决这些问题，我们提出了 VideoICL，一个用于 OOD 任务的新型视频上下文学习框架，它引入了基于相似度的相关示例选择策略和基于置信度的迭代推理方法。这允许选择最相关的示例并根据相似度对其进行排序，用于推理。如果生成的响应置信度低，我们的框架会选择新的示例并再次执行推理，迭代地改进结果，直到获得高置信度的响应。这种方法通过扩展有效上下文长度来提高 OOD 视频理解性能，而不会产生高昂的成本。在多个基准测试上的实验结果表明，该方法取得了显著的性能提升，尤其是在特定领域场景下，为更广泛的视频理解应用奠定了基础。代码将发布在 https://github.com/KangsanKim07/VideoICL||
|**2024-12-03**|[VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning](http://arxiv.org/abs/2412.02172)|null|大型视觉语言模型 (LVLMs) 的批判和纠正自身推理能力是其自我改进的关键组成部分。然而，目前仍缺乏对此类 LVLMs 能力的系统性分析。我们提出了 VISCO，这是第一个广泛分析 LVLMs 细粒度批判和纠正能力的基准测试。相比于现有工作使用单一标量值来批判整个推理过程 [4]，VISCO 具有密集且细粒度的批判特性，要求 LVLMs 评估思维链中每个步骤的正确性，并提供自然语言解释来支持其判断。对 24 个 LVLMs 的广泛评估表明，人工编写的批判能显著提高纠正后的性能，展现了自我改进策略的潜力。然而，模型生成的批判作用较小，有时甚至会损害性能，这表明批判是关键瓶颈。我们确定了批判失败的三个常见模式：未能批判视觉感知、不愿“说不”以及夸大错误传播的假设。为了解决这些问题，我们提出了一种有效的 LookBack 策略，即重新审视图像以验证初始推理中每条信息的正确性。LookBack 可以将批判和纠正性能显著提高 13.5%。||
|**2024-12-02**|[X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models](http://arxiv.org/abs/2412.01824)|**[link](https://github.com/sunzey/x-prompt)**|上下文生成是大型语言模型 (LLM) 开放任务泛化能力的关键组成部分。通过利用少量示例作为上下文，LLM 可以执行域内和域外任务。建立在 LLM 之上的自回归视觉语言模型 (VLM) 的最新进展在文本到图像生成方面展现了令人印象深刻的性能。然而，上下文学习在一般图像生成任务中的潜力很大程度上仍未得到探索。为了解决这个问题，我们引入了 X-Prompt，这是一个纯自回归的大型视觉语言模型，旨在在统一的上下文学习框架内，在各种已见和未见图像生成任务中提供具有竞争力的性能。X-Prompt 采用了一种专门的设计，可以有效地压缩上下文示例中的宝贵特征，支持更长的上下文标记序列，并提高其泛化到未见任务的能力。用于文本和图像预测的统一训练任务使 X-Prompt 能够处理一般的图像生成，并通过上下文示例增强任务感知能力。大量实验验证了该模型在各种已见图像生成任务中的性能及其泛化到先前未见任务的能力。||
|**2024-12-02**|[VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models](http://arxiv.org/abs/2412.01822)|null|近来，来自诸如GPT-4V等闭源视觉语言模型（VLM）的高质量视觉指令微调样本的激增加速了各种规模开源VLM的发布。然而，使用更大的模型扩展VLM以提高性能带来了巨大的计算挑战，尤其是在资源受限的设备（如移动平台和机器人）上进行部署时。为了解决这个问题，我们提出了VLsI：Verbalized Layers-to-Interactions，这是一个新的VLM系列，模型大小为2B和7B，它优先考虑效率而不牺牲准确性。VLsI利用独特的逐层蒸馏过程，引入中间“verbalizers”，将每一层的特征映射到自然语言空间，从而允许较小的VLM灵活地与较大VLM的推理过程对齐。这种方法减轻了输出模仿中经常遇到的训练不稳定性，并且超越了典型的最终层微调，通过将小型VLM的逐层进展与大型VLM的逐层进展对齐。我们在十个具有挑战性的视觉语言基准上验证了VLsI，在无需模型缩放、合并或架构更改的情况下，相比GPT-4V实现了显著的性能提升（2B模型提升11.0%，7B模型提升17.4%）。||
|**2024-11-29**|[SDR-GNN: Spectral Domain Reconstruction Graph Neural Network for Incomplete Multimodal Learning in Conversational Emotion Recognition](http://arxiv.org/abs/2411.19822)|**[link](https://github.com/fufangze/SDR-GNN)**|多模态对话情感识别 (MERC) 旨在利用文本、音频和视觉模态特征对语句的情感进行分类。大多数现有的 MERC 方法假设每个语句都具有完整的模态，忽略了现实场景中常见的模态缺失问题。近年来，图神经网络 (GNNs) 在不完整多模态对话情感识别 (IMERC) 中取得了显著成果。然而，传统的 GNNs 侧重于节点之间的二元关系，限制了其捕获更复杂的高阶信息的能力。此外，重复的消息传递会导致过度平滑，降低其保留关键高频细节的能力。为了解决这些问题，我们提出了一种用于对话情感识别中不完整多模态学习的谱域重建图神经网络 (SDR-GNN)。SDR-GNN 基于说话者和上下文关系，使用滑动窗口构建语句语义交互图，以建模情感依赖关系。为了捕获高阶和高频信息，SDR-GNN 利用加权关系聚合，确保跨语句一致的语义特征提取。此外，它在谱域中进行多频聚合，通过提取高频和低频信息，能够有效地恢复不完整的模态。最后，应用多头注意力机制来融合和优化用于情感识别的特征。在各种真实世界数据集上的大量实验表明，我们的方法在不完整多模态学习中是有效的，并且优于当前最先进的方法。||
|**2024-11-29**|[SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks](http://arxiv.org/abs/2411.19688)|**[link](https://github.com/iml-dkfz/sure-vqa)**|视觉语言模型 (VLM) 在医学任务中具有巨大潜力，例如视觉问答 (VQA)，它们可以作为患者和临床医生的交互助手。然而，它们对未见数据分布变化的鲁棒性仍然是安全部署的关键问题。评估这种鲁棒性需要一个受控的实验设置，以便系统地了解模型的行为。然而，我们证明了目前的设置无法提供足够彻底的评估，限制了它们准确评估模型鲁棒性的能力。为了弥补这一差距，我们的工作引入了一个名为 SURE-VQA 的新框架，该框架围绕三个关键要求构建，以克服当前的缺陷并系统地分析 VLM 的鲁棒性：1) 由于合成偏移的鲁棒性不一定转化为现实世界的偏移，因此鲁棒性应该在 VQA 数据固有的现实世界偏移上进行测量；2) 传统的标记匹配指标通常无法捕捉潜在的语义，因此需要使用大型语言模型 (LLM) 进行更准确的语义评估；3) 由于缺少健全性基线，模型性能通常缺乏可解释性，因此应报告有意义的基线，以便评估多模态对 VLM 的影响。为了证明该框架的相关性，我们对三种医学数据集上的各种微调方法在四种不同类型的分布偏移下的鲁棒性进行了研究。我们的研究揭示了几个重要发现：1) 不使用图像数据的健全性基线可以表现得 surprisingly well；2) 我们确认 LoRA 是表现最佳的 PEFT 方法；3) 没有一种 PEFT 方法在应对偏移的鲁棒性方面始终优于其他方法。代码位于 https://github.com/IML-DKFZ/sure-vqa。||
|**2024-11-29**|[CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation](http://arxiv.org/abs/2411.19650)|null|大型视觉-语言-动作（VLA）模型的进步显著提高了机器人操作在语言引导任务执行和泛化到未见场景方面的能力。虽然现有的由预训练大型视觉-语言模型（VLM）改进而来的VLA已经展现出良好的泛化性，但它们的性能仍然不尽如人意，不同环境下的低任务成功率就证明了这一点。在本文中，我们提出了一种源自VLM的新型高级VLA架构。与先前直接通过简单的动作量化将VLM用于动作预测的工作不同，我们提出了一个组件化的VLA架构，它包含一个专门的动作模块，并以VLM输出为条件。我们系统地研究了动作模块的设计，并展示了使用扩散动作Transformer进行动作序列建模的强大性能提升及其良好的扩展性。我们还进行了全面的实验和消融研究，以评估我们模型在不同设计下的有效性。在模拟和真实世界中对5种机器人实体的评估表明，我们的模型不仅在任务性能上显著优于现有的VLA，而且对新机器人表现出卓越的适应性，并能泛化到未见过的物体和背景。在模拟评估中，它的平均成功率比模型规模（7B）相似的OpenVLA高出35%以上，在真实机器人实验中高出55%以上。它还比大型RT-2-X模型（55B）在模拟中的绝对成功率高出18%。代码和模型可以在我们的项目页面 (https://cogact.github.io/) 上找到。||
|**2024-11-29**|[Interleaved-Modal Chain-of-Thought](http://arxiv.org/abs/2411.19488)|null|思维链（CoT）提示引导大型语言模型（LLM）在得出最终答案之前生成一系列中间推理步骤。然而，当过渡到视觉语言模型（VLM）时，它们仅文本的推理难以表达与原始图像的细粒度关联。在本文中，我们提出了一种结合图像的多模态思维链，名为\textbf{交错模态思维链（ICoT）}，它生成由成对的视觉和文本推理步骤组成的序列，以推断最终答案。直观地说，新的ICoT要求VLM能够生成细粒度的交错模态内容，这对目前的VLM来说很难实现。考虑到所需的视觉信息通常是输入图像的一部分，我们提出了\textbf{注意力驱动选择（ADS）}来在现有VLM上实现ICoT。ADS智能地插入输入图像的区域，以生成交错模态推理步骤，且额外的延迟可忽略不计。ADS仅依赖于VLM的注意力图，无需参数化，因此它是一种即插即用的策略，可以推广到各种VLM。我们将ADS应用于两种不同架构的流行VLM上以实现ICoT。对三个基准的广泛评估表明，与现有的多模态CoT提示方法相比，ICoT提示在性能（高达14%）和可解释性方面都有显著提高。||
|**2024-11-28**|[Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation](http://arxiv.org/abs/2411.19331)|**[link](https://github.com/lorebianchi98/Talk2DINO)**|开放词汇分割 (OVS) 的目标是根据自由形式的文本概念分割图像，而无需预定义的训练类别。虽然现有的视觉语言模型（如 CLIP）可以利用视觉Transformer的粗略空间信息生成分割掩码，但由于图像和文本特征的全局对齐，它们在空间定位方面面临挑战。相反，像 DINO 这样的自监督视觉模型擅长细粒度视觉编码，但缺乏与语言的整合。为了弥合这一差距，我们提出了 Talk2DINO，一种结合了 DINOv2 的空间精度和 CLIP 的语言理解能力的新型混合方法。我们的方法通过学习到的映射函数将 CLIP 的文本嵌入与 DINOv2 的补丁级特征对齐，而无需微调底层主干网络。在训练时，我们利用 DINOv2 的注意力图选择性地将局部视觉补丁与文本嵌入对齐。我们展示了 Talk2DINO 强大的语义和定位能力可以增强分割过程，从而产生更自然、更少噪声的分割，并且我们的方法还可以有效地区分前景对象和背景。实验结果表明，Talk2DINO 在多个无监督 OVS 基准测试中实现了最先进的性能。源代码和模型公开发布于：https://lorebianchi98.github.io/Talk2DINO/。||
|**2024-11-28**|[GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks](http://arxiv.org/abs/2411.19325)|**[link](https://github.com/the-ai-alliance/geo-bench-vlm)**|虽然最近有许多基准测试专注于评估通用的视觉语言模型 (VLM)，但它们未能满足地理空间应用的独特需求。通用的 VLM 基准测试并非设计用于处理地理空间数据的复杂性，而这对于环境监测、城市规划和灾害管理等应用至关重要。地理空间领域的一些独特挑战包括变化的时间分析、大量目标计数、微小目标检测以及理解遥感影像中实体之间的关系。为了弥补地理空间领域的这一差距，我们提出了 GEOBench-VLM，这是一个专门设计用于评估 VLM 在地理空间任务上的综合基准测试，包括场景理解、目标计数、定位、细粒度分类和时间分析。我们的基准测试包含超过 10,000 条手动验证的指令，涵盖了视觉条件、目标类型和规模的各种变化。我们评估了几个最先进的 VLM，以评估它们在地理空间环境中的准确性。结果表明，尽管现有的 VLM 具有潜力，但在处理地理空间特定示例时仍面临挑战，这凸显了进一步改进的空间。具体而言，表现最好的 GPT4o 在多项选择题上的准确率仅为 40%，仅为随机猜测性能的两倍。我们的基准测试公开发布于 https://github.com/The-AI-Alliance/GEO-Bench-VLM。||
|**2024-11-28**|[GRAPE: Generalizing Robot Policy via Preference Alignment](http://arxiv.org/abs/2411.19309)|null|尽管视觉-语言-动作（VLA）模型在各种机器人任务中取得了最新进展，但由于它们完全依赖于从成功部署中进行行为克隆，因此存在一些关键问题，例如对未见任务的泛化能力差。此外，它们通常经过微调以复制专家在不同设置下收集的演示，从而引入了分布偏差，并限制了它们对不同操作目标（例如效率、安全性和任务完成）的适应性。为了弥合这一差距，我们引入了GRAPE：通过偏好对齐泛化机器人策略。具体来说，GRAPE在轨迹级别上对齐VLA，并隐式地对成功和失败试验的奖励进行建模，以提高对不同任务的泛化能力。此外，GRAPE将复杂的操作任务分解为独立的阶段，并通过大型视觉语言模型提出的关键点的定制时空约束，自动引导偏好建模。值得注意的是，这些约束是灵活的，可以定制以使模型与不同的目标对齐，例如安全性、效率或任务成功。我们在现实世界和模拟环境中的各种任务中评估了GRAPE。实验结果表明，GRAPE增强了最先进的VLA模型的性能，将域内和未见操作任务的成功率分别提高了51.79%和60.36%。此外，GRAPE可以与各种目标对齐，例如安全性和效率，分别将碰撞率降低了44.31%，并将部署步长缩短了11.15%。所有代码、模型和数据均可在https://grape-vla.github.io/获取。||
|**2024-11-28**|[VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](http://arxiv.org/abs/2411.19103)|null|在本文中，我们介绍了一个开源的韩语-英语视觉语言模型 (VLM)，VARCO-VISION。我们采用了一种逐步训练策略，使模型能够学习语言和视觉信息，同时保留骨干模型的知识。与类似规模的模型相比，我们的模型在需要双语图像文本理解和生成能力的各种设置中展现出优异的性能。VARCO-VISION 还具备图像定位、指称和光学字符识别 (OCR) 功能，扩展了其在实际场景中的用途和潜在应用。除了模型之外，我们还发布了五个韩语评估数据集，包括四个闭集和一个开集基准测试。我们预计，我们的里程碑将为致力于训练 VLM 的人工智能研究人员拓宽机会。VARCO-VISION 可在 https://huggingface.co/NCSOFT/VARCO-VISION-14B 获取。||
|**2024-11-27**|[Evaluating Vision-Language Models as Evaluators in Path Planning](http://arxiv.org/abs/2411.18711)|null|尽管大型语言模型 (LLM) 在执行复杂推理方面很有潜力，但它们在端到端规划中的有效性有限。这引发了一个有趣的问题：如果这些模型无法很好地规划，它们是否仍然可以作为有用的规划评估器为规划框架做出贡献？在这项工作中，我们将这个问题推广到考虑具有视觉理解能力的增强型LLM，即视觉语言模型 (VLM)。我们引入了PathEval，这是一个新颖的基准测试，用于评估VLM在复杂路径规划场景中作为规划评估器的能力。要在此基准测试中取得成功，VLM需要能够从场景描述中提取最佳路径的特征，展示对每条路径的精确低级感知，并整合这些信息来确定更好的路径。我们对最先进的VLM的分析表明，这些模型在此基准测试中面临着重大挑战。我们观察到，VLM可以精确地提取给定场景以识别所需特征，并在整合所提供信息方面表现出好坏参半。然而，它们的视觉组件存在一个关键瓶颈，即模型难以感知路径的低级细节。我们的实验结果表明，这个问题无法通过端到端微调来轻松解决；相反，需要对这些视觉编码器进行特定任务的判别式适应，才能使这些VLM成为有效的路径评估器。||
|**2024-11-27**|[Embodied Red Teaming for Auditing Robotic Foundation Models](http://arxiv.org/abs/2411.18676)|null|以语言为条件的机器人模型（即机器人基础模型）使机器人能够根据自然语言指令执行各种任务。尽管在现有基准测试中表现出色，但由于测试所有可能的语言变体的复杂性，评估这些模型的安全性和有效性仍然具有挑战性。当前的基准测试有两个关键限制：它们依赖于有限的人工生成指令集，遗漏了许多具有挑战性的案例，并且它们只关注任务性能而不评估安全性，例如避免损坏。为了解决这些差距，我们引入了Embodied Red Teaming (ERT)，这是一种新的评估方法，它生成多样化且具有挑战性的指令来测试这些模型。ERT使用带有视觉语言模型（VLM）的自动红队技术来创建基于上下文且难度较高的指令。实验结果表明，最先进的模型在ERT测试中经常失败或表现出不安全的行为，这突显了当前基准测试在评估真实世界性能和安全性方面的不足。代码和视频可在以下网址获取：https://sites.google.com/view/embodiedredteam。||
|**2024-11-27**|[AMPS: ASR with Multimodal Paraphrase Supervision](http://arxiv.org/abs/2411.18368)|null|针对现有最先进的自动语音识别 (ASR) 系统，自然或对话式多语种语音识别提出了诸多挑战。在本研究中，我们提出了一种名为AMPS的新技术，它通过基于释义的监督来增强多语种多模态ASR系统，从而改进包括印地语、马拉地语、马拉雅拉姆语、卡纳达语和尼扬贾语在内的多种语言的对话ASR。我们在训练多模态ASR模型时，使用参考转录的释义作为额外的监督，并针对ASR性能较差的语句选择性地调用此释义目标函数。通过将AMPS与最先进的多模态模型SeamlessM4T结合使用，我们在词错误率 (WER) 上取得了高达5%的显著相对降低。我们使用客观和人工评估指标对系统进行了详细的分析。||
|**2024-11-27**|[Large Language Model-Brained GUI Agents: A Survey](http://arxiv.org/abs/2411.18279)|**[link](https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey)**|图形用户界面（GUI）长期以来一直是人机交互的核心，提供了一种直观且视觉驱动的方式来访问和操作数字系统。大型语言模型（LLM），特别是多模态模型的出现，开启了GUI自动化的新时代。它们在自然语言理解、代码生成和视觉处理方面展现出卓越的能力。这为新一代基于LLM的GUI智能体铺平了道路，这些智能体能够理解复杂的GUI元素，并根据自然语言指令自主执行操作。这些智能体代表了一种范式转变，使用户能够通过简单的对话命令执行复杂的多步骤任务。它们的应用涵盖网页导航、移动应用交互和桌面自动化，提供了一种变革性的用户体验，彻底改变了个人与软件的交互方式。这个新兴领域正在快速发展，在研究和产业方面都取得了显著进展。为了提供对这一趋势的结构化理解，本文对基于LLM的GUI智能体进行了全面综述，探讨了它们的历史演变、核心组件和先进技术。我们探讨了现有GUI智能体框架、用于训练专用GUI智能体的数据收集和利用、针对GUI任务的大型动作模型的开发以及评估其有效性所需的评估指标和基准等研究问题。此外，我们还研究了由这些智能体驱动的新兴应用。通过详细分析，本综述确定了关键的研究差距，并概述了该领域未来发展的路线图。通过整合基础知识和最新发展，本工作旨在指导研究人员和从业人员克服挑战，并释放基于LLM的GUI智能体的全部潜力。||
|**2024-11-27**|[Grid-augumented vision: A simple yet effective approach for enhanced spatial understanding in multi-modal agents](http://arxiv.org/abs/2411.18270)|**[link](https://github.com/triumph123aaa/grid-augmented-vision)**|多模态模型近期取得的进展展现了其在物体识别和场景理解方面的卓越能力。然而，这些模型常常难以实现精确定位，而这对于实际应用至关重要。受人类使用棋盘和地图等基于网格的参考方式的启发，我们提出通过一种简单的网格叠加方法来引入显式视觉位置编码。通过在输入图像上添加一个 9x9 的黑色网格图案，我们的方法提供了类似于Transformer中位置编码的视觉空间引导，但采用的是显式视觉形式。在 COCO 2017 数据集上的实验表明，基于网格的方法显著提高了定位精度，与基线性能相比，IoU 提高了 107.4%（从 0.27 提升至 0.56），GIoU 提高了 194.4%（从 0.18 提升至 0.53）。通过注意力可视化分析，我们展示了这种视觉位置编码如何帮助模型更好地理解空间关系。我们方法的简洁性和有效性使其对于需要精确空间推理的应用，例如机器人操作、医学影像和自动导航，尤为重要。||
|**2024-11-27**|[Multimodal Integration of Longitudinal Noninvasive Diagnostics for Survival Prediction in Immunotherapy Using Deep Learning](http://arxiv.org/abs/2411.18253)|null|目的：使用人工智能分析无创的纵向和多模态数据可能改变癌症患者的免疫治疗，为精准医疗铺平道路。方法：在这项研究中，我们整合了来自一大群泛癌队列（694名接受免疫治疗的患者）的治疗前和治疗期间的血液测量值、处方药和基于CT的器官体积，以预测短期和长期总生存期。通过利用最新发展的组合，我们端到端地训练了我们扩展的多模态基于Transformer的简单时间注意力（MMTSimTA）网络的不同变体，以预测三个月、六个月、九个月和十二个月的死亡率。这些模型还与包含基于中间和后期融合的集成方法的基线方法进行了比较。结果：使用扩展的基于Transformer的多模态模型展现出最强的预后性能，其曲线下面积（AUC）分别为3个月、6个月、9个月和12个月生存预测的 $0.84 \pm $0.04、$0.83 \pm $0.02、$0.82 \pm $0.02、$0.81 \pm$ 0.03。结论：我们的研究结果表明，分析整合的早期治疗数据具有预测免疫治疗患者生存期的潜力。使用我们扩展的基于Transformer的架构，将补充的无创模式整合到一个联合训练的模型中，展现出改进的多模式预后性能，尤其是在短期生存预测方面。||
|**2024-11-27**|[Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning](http://arxiv.org/abs/2411.18203)|null|视觉语言模型（VLM）在多模态推理任务中取得了显著进展。然而，由于诸如图像理解的幻觉或推理路径的不完善等问题，它们仍然经常生成不准确或不相关的响应。为了应对这些挑战，我们引入了Critic-V，这是一个受Actor-Critic范式启发的新颖框架，旨在提升VLM的推理能力。该框架通过集成两个独立的组件来解耦推理过程和评论过程：根据视觉和文本输入生成推理路径的Reasoner，以及提供建设性评论以改进这些路径的Critic。在这种方法中，Reasoner根据文本提示生成推理响应，这些响应可以作为策略根据Critic的反馈进行迭代演进。这种交互过程的理论基础是强化学习框架，其中Critic提供自然语言评论而不是标量奖励，从而实现更细致的反馈，以提升Reasoner在复杂推理任务上的能力。Critic模型使用直接偏好优化（DPO）进行训练，利用基于规则奖励（RBR）排序的评论偏好数据集来增强其评论能力。评估结果表明，Critic-V框架在8个基准测试中的5个上显著优于现有方法，包括GPT-4V，尤其是在推理准确性和效率方面。将Reasoner的动态文本策略与偏好优化Critic的建设性反馈相结合，实现了更可靠且上下文敏感的多模态推理过程。我们的方法为增强VLM的可靠性提供了一个有前景的解决方案，从而提高其在自动驾驶和具身智能等现实世界推理密集型多模态应用中的性能。||
|**2024-11-27**|[COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models](http://arxiv.org/abs/2411.18145)|null|随着大型视觉语言模型（VLMs）的快速发展，通用领域模型和专门为遥感地球观测设计的模型都在该特定领域展现出卓越的感知和推理能力。然而，目前缺乏一个全面评估这些VLMs遥感能力的基准，这是一个显著的差距。为了弥合这一差距，我们提出了COREval，这是第一个旨在全面客观地评估VLMs分层遥感能力的基准。我们集中于遥感中两个主要的维度：感知和推理，并进一步细分为6个次级维度和22个叶子任务，以确保对该特定领域进行全面的评估覆盖。COREval通过从全球50个分布式城市收集数据、构建问题和质量控制的严格流程，保证了总共6,263个问题的质量，并且具有明确答案的选择题格式允许对VLM性能进行客观直接的评估。我们对来自通用领域和遥感领域的13个杰出的开源VLMs进行了全面评估，突出了它们在遥感能力方面的当前不足，并为它们在这一特定领域中的应用改进提供了方向。我们希望COREval能够成为一个宝贵的资源，并为VLMs在遥感领域的挑战和潜力提供更深入的见解。||
|**2024-11-27**|[VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis](http://arxiv.org/abs/2411.18038)|null|大型视觉语言模型（VLM）最近在桥接两种基本模态方面取得了显著进展。通过足够大的数据集训练的VLM展现出对视觉和语言的全面理解，可以执行各种任务。为了准确地提取这些知识，在本文中，我们介绍了一种新颖的方法，该方法明确地将VLM用作人-物交互（HOI）检测任务（VLM-HOI）的目标函数形式。具体来说，我们提出了一种使用图文匹配技术量化预测的HOI三元组相似性的方法。我们以语言方式表示HOI三元组，以充分利用VLM的语言理解能力，由于其定位和以对象为中心的特性，VLM比CLIP模型更适合于此任务。该匹配得分用作对比优化的目标。据我们所知，这是首次将VLM的语言能力用于HOI检测。实验结果证明了我们方法的有效性，在基准测试中达到了最先进的HOI检测精度。我们相信将VLM集成到HOI检测中代表着朝着更高级和更具解释性的人-物交互分析迈出的重要一步。||
|**2024-11-26**|[HOPPR Medical-Grade Platform for Medical Imaging AI](http://arxiv.org/abs/2411.17891)|null|人工智能 (AI) 技术的进步使得开发基于数百万图像和文本配对样本训练的大型视觉语言模型 (LVLM) 成为可能。后续研究工作证明了 LVLM 在医学影像用例（例如，放射报告生成）中实现高性能的巨大潜力，但也存在阻碍这些解决方案广泛部署的障碍。这些障碍包括开发大规模模型所需的巨大计算成本、复杂的 AI 模型开发所需的专业知识，以及难以获取足以代表 LVLM 解决方案部署人群的大量高质量数据集。HOPPR 医疗级平台通过提供强大的计算基础设施、一套基础模型（开发人员可以在其上针对特定用例进行微调）以及稳健的质量管理系统（为评估用于临床部署的微调模型设定了标准）来解决这些障碍。HOPPR 平台可以访问来自数百个影像中心、代表不同人群的数百万影像研究和文本报告，以预训练基础模型并启用针对特定用例的队列进行微调。所有数据均已去识别化并安全存储，以符合 HIPAA 规范。此外，开发人员可以安全地将模型托管在 HOPPR 平台上，并通过 API 访问它们，以便在已建立的临床工作流程中使用这些模型进行推理。借助医疗级平台，HOPPR 的使命是加速 LVLM 解决方案在医学影像领域的部署，最终优化放射科医生的工作流程并满足该领域日益增长的需求。||
|**2024-11-26**|[NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?](http://arxiv.org/abs/2411.17794)|null|多模态大型语言模型 (MLLM) 在视觉理解方面取得了显著进展，但它们识别被特定属性修饰的物体能力仍然是一个悬而未决的问题。为了解决这个问题，我们探索了 MLLM 在物体识别方面的推理能力，涵盖从常识到超常识的场景。我们引入了一个名为 NEMO 的新基准测试，它包含 900 张原始水果图像及其对应的属性修改图像；以及包含开放式、多项选择和不可解类型的 2700 个问题。我们使用我们的基准测试评估了 26 个最新的开源和商用模型。研究结果突出了模型在 NEMO 中识别物体的性能差距，并揭示了不同模型之间不同的答案偏好。虽然更强大的视觉编码器可以提高性能，但 MLLM 仍然落后于独立的视觉编码器。有趣的是，扩大模型规模并不能持续带来更好的结果，更深入的分析表明，更大的 LLM 在微调过程中会削弱视觉编码器。这些见解揭示了当前 MLLM 的关键局限性，并为开发更通用和更具弹性的多模态模型提出了潜在途径。||
|**2024-11-26**|[VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models](http://arxiv.org/abs/2411.17451)|null|视觉语言生成奖励模型（VL-GenRM）在对齐和评估多模态AI系统中扮演着至关重要的角色，然而对其自身的评估却仍未得到充分探索。目前的评估方法主要依赖于来自传统视觉语言任务的AI标注的偏好标签，这可能会引入偏差，并且通常无法有效地挑战最先进的模型。为了解决这些局限性，我们引入了VL-RewardBench，这是一个涵盖通用多模态查询、视觉幻觉检测和复杂推理任务的综合基准测试。通过我们结合样本选择和人工验证的AI辅助标注流程，我们精心挑选了1250个高质量示例，专门用于探测模型的局限性。对16个领先的大型视觉语言模型进行的全面评估表明，VL-RewardBench作为一个具有挑战性的测试平台是有效的，即使是GPT-4o也仅达到了65.4%的准确率，而像Qwen2-VL-72B这样的最先进的开源模型也很难超过随机猜测的水平。重要的是，VL-RewardBench上的性能与使用VL-GenRM进行Best-of-N采样的MMMU-Pro准确率密切相关（皮尔逊相关系数r > 0.9）。分析实验揭示了改进VL-GenRM的三个关键见解：（i）模型主要在基本的视觉感知任务上失败，而不是推理任务；（ii）推理时缩放的收益因模型容量而异；（iii）训练VL-GenRM学习判断能够大幅提升判断能力（7B VL-GenRM的准确率提升了14.7%）。我们相信VL-RewardBench以及这些实验见解将成为推进VL-GenRM发展的宝贵资源。||
|**2024-11-26**|[CoA: Chain-of-Action for Generative Semantic Labels](http://arxiv.org/abs/2411.17406)|**[link](https://github.com/WilsonMqz/CoA)**|近年来，视觉语言模型 (VLM) 在图像分类方面取得了显著进展。这些 VLM 利用预定义的类别集来构建文本提示，以进行零样本推理。然而，在像自动驾驶这样更开放的领域中，使用预定义的标签集变得不切实际，因为语义标签空间是未知的且不断变化的。此外，固定的嵌入文本提示通常倾向于预测单个标签（而实际上，每张图像通常存在多个标签）。在本文中，我们介绍了 CoA，一种创新的行动链 (CoA) 方法，它生成与图像所有上下文相关特征对齐的标签。CoA 的设计基于以下观察：丰富且有价值的上下文信息可以提高推理过程中的生成性能。传统的视觉语言模型倾向于输出单一且冗余的响应。因此，我们采用定制的 CoA 来缓解这个问题。我们首先将生成标签任务分解为详细的行动，并构建一个 CoA，最终达到生成目标。每个行动都从先前的行动中提取并合并关键信息，并将丰富的信息作为上下文传递给下一个行动，最终提高 VLM 生成全面且准确的语义标签的能力。我们通过对广泛使用的基准数据集进行综合评估来评估 CoA 的有效性，结果表明，CoA 在关键性能指标方面均有显著改进。||
|**2024-11-26**|[AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM](http://arxiv.org/abs/2411.17221)|**[link](https://github.com/wangjiarui153/AIGV-Assessor)**|大型多模态模型 (LMM) 的快速发展导致人工智能生成视频 (AIGV) 的迅速扩张，这凸显了对专为 AIGV 设计的有效视频质量评估 (VQA) 模型的迫切需求。由于存在独特的失真，例如不真实的物体、不自然的运动或不一致的视觉元素，目前的 VQA 模型通常无法准确评估 AIGV 的感知质量。为了应对这一挑战，我们首先提出了 AIGVQA-DB，这是一个包含 36,576 个 AIGV 的大规模数据集，这些 AIGV 是由 15 个先进的文本到视频模型使用 1,048 个不同的提示生成的。利用这些 AIGV，我们设计了一个包含评分和排序过程的系统注释流程，迄今为止已收集了 37 万条专家评分。基于 AIGVQA-DB，我们进一步推出了 AIGV-Assessor，这是一种新颖的 VQA 模型，它利用时空特征和 LMM 框架来捕捉 AIGV 复杂的质量属性，从而准确预测精确的视频质量分数和视频对偏好。通过在 AIGVQA-DB 和现有 AIGV 数据库上进行的综合实验，AIGV-Assessor 展现了最先进的性能，在多个感知质量维度上显著超越了现有的评分或评估方法。||
|**2024-11-26**|[Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment](http://arxiv.org/abs/2411.17188)|null|许多现实世界的用户查询（例如“如何制作蛋炒饭？”）可以受益于能够生成包含文本步骤和相应图像的响应的系统，类似于烹饪书。旨在生成交错文本和图像的模型面临着确保这些模态内部和之间一致性的挑战。为了应对这些挑战，我们提出了ISG，一个用于交错文本和图像生成的综合评估框架。ISG利用场景图结构来捕捉文本块和图像块之间的关系，并在四个粒度级别上评估响应：整体、结构、块级和图像特定。这种多层评估允许对一致性、连贯性和准确性进行细致的评估，并提供可解释的问答反馈。结合ISG，我们引入了一个基准测试ISG-Bench，包含8个类别和21个子类别中的1150个样本。该基准数据集包含复杂的语言-视觉依赖关系和黄金答案，可以有效地评估模型在以视觉为中心的任务（例如风格迁移）上的表现，这是当前模型的一个挑战领域。使用ISG-Bench，我们证明了最近的统一视觉语言模型在生成交错内容方面表现不佳。虽然组合方法结合了单独的语言和图像模型，在整体水平上比统一模型提高了111%，但它们在块级和图像级上的性能仍然欠佳。为了促进未来的工作，我们开发了ISG-Agent，一个采用“计划-执行-改进”流程来调用工具的基线代理，实现了122%的性能提升。||
|**2024-11-26**|[Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation](http://arxiv.org/abs/2411.17150)|null|开放词汇语义分割 (OVSS) 随着最近视觉语言模型 (VLMs) 的发展而进步，通过各种学习方案使得分割超越预定义类别成为可能。值得注意的是，免训练方法为处理未见数据（OVSS 的一个关键目标）提供了可扩展、易于部署的解决方案。然而，一个关键问题仍然存在：在基于任意查询提示的 OVSS 挑战性环境中分割复杂对象时，缺乏对象级上下文考虑。这种疏忽限制了模型在对象内分组语义一致元素并将它们精确映射到用户定义的任意类的能力。在这项工作中，我们引入了一种新方法，通过在图像中结合对象级上下文知识来克服这一限制。具体来说，我们的模型通过将视觉基础模型的光谱驱动特征提取到视觉编码器的注意力机制中来增强对象内一致性，从而使语义相关的组件形成单个对象掩码。此外，我们使用零样本对象存在似然性来细化文本嵌入，以确保与图像中表示的特定对象准确对齐。通过利用对象级上下文知识，我们提出的方法在各种数据集上实现了最先进的性能和强大的泛化能力。||
|**2024-11-26**|[Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation](http://arxiv.org/abs/2411.17141)|**[link](https://github.com/zhengxuJosh/AnySeg)**|同时利用来自多个传感器的多模态输入来训练分割器从直觉上来说是有利的，但在实践中却具有挑战性。一个关键的挑战是单模态偏差，即多模态分割器过度依赖某些模态，导致在其他模态缺失时性能下降，这在实际应用中很常见。为此，我们开发了第一个用于学习鲁棒分割器的框架，该框架可以处理任何视觉模态组合。具体来说，我们首先引入了一种并行多模态学习策略来学习一个强大的教师模型。然后，通过将特征级知识从多模态分割器迁移到任意模态分割器，在多尺度表示空间中实现跨模态和单模态蒸馏，旨在解决单模态偏差并避免过度依赖特定模态。此外，我们提出了一种预测级模态无关的语义蒸馏方法，以实现分割的语义知识迁移。在合成和真实世界的多传感器基准上的大量实验表明，我们的方法实现了卓越的性能。||
|**2024-11-26**|[Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models](http://arxiv.org/abs/2411.17066)|**[link](https://github.com/colinconwell/t2i-probology)**|尽管多模态人工智能研究取得了显著进展，但在一个重要领域，现代人工智能仍然远远落后于人类儿童：逻辑运算符的可靠部署。在这里，我们考察了三种形式的逻辑运算符：关系、否定和离散数字。我们要求人类受访者（总共 N=178）评估由最先进的图像生成人工智能 (DALL-E 3) 生成的图像，这些图像由这些“逻辑探针”提示生成，并发现没有一个能够可靠地产生超过 50% 的人类一致性评分。否定探针和数字（超过 3）失败的频率最高。在第四个实验中，我们评估了一个“基础扩散”流程，它利用目标提示工程和结构化中间表示来实现更大的组合控制，但发现其性能在所有提示中都被评判为比 DALL-E 3 更差。为了进一步阐明这些文本到图像系统中潜在的成功和失败来源，我们用多个辅助分析和示意图补充了我们的 4 个核心实验，例如，直接量化了关系提示的 N-gram 频率与生成图像的平均匹配之间的关系；在否定提示的渲染中，3 种不同提示修改策略的成功率；以及涉及整数的提示的标量可变性/比率依赖性（“近似计算能力”）。最后，我们讨论了“基础”多模态学习系统中固有的局限性，这些系统的基础严重依赖于基于向量的语义（例如 DALL-E 3）或未充分指定的句法约束（例如“基础扩散”），并提出了最小修改（受发展启发，基于图像），这些修改可以帮助弥合规模和结构之间挥之不去的组合差距。所有数据和代码都可以在 https://github.com/ColinConwell/T2I-Probology 获取。||
|**2024-11-26**|[Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation](http://arxiv.org/abs/2411.17002)|**[link](https://github.com/ShambhaviCodes/CLIPOT)**|视觉语言基础模型，例如CLIP，在一系列任务中展现出前所未有的零样本性能。然而，这些模型在分布偏移的情况下可能并不可靠，因为它们的性能会显著下降。在这项工作中，我们探索如何在测试时推理期间有效利用类别文本信息来减轻大型预训练视觉语言模型（VLM）遇到的这些分布漂移。特别是，我们提出通过利用通用类别文本嵌入作为标签分配问题的固定质心来为测试时样本生成伪标签，并使用最优传输有效地解决该问题。此外，所提出的适应方法（CLIP-OT）集成了多模板知识蒸馏方法，该方法复制了无监督表示学习中的多视图对比学习策略，但不会增加额外的计算复杂度。在呈现不同复杂度的多个流行测试时适应基准上的大量实验，凭经验表明了CLIP-OT的优越性，相较于最近的最先进方法，实现了高达7%的性能提升，同时保持计算和内存效率。||
|**2024-11-25**|[Probing the limitations of multimodal language models for chemistry and materials research](http://arxiv.org/abs/2411.16955)|**[link](https://github.com/lamalab-org/mac-bench)**|人工智能的最新进展激发了人们对科学助手的兴趣，这些助手可以支持研究人员的整个科研工作流程，从文献综述到实验设计和数据分析。此类系统的关键能力是处理和推理视觉和文本形式的科学信息——从解释光谱数据到理解实验室装置。在此，我们介绍MaCBench，这是一个综合基准，用于评估视觉语言模型如何处理现实世界的化学和材料科学任务，涵盖三个核心方面：数据提取、实验理解和结果解释。通过对领先模型的系统评估，我们发现虽然这些系统在基本感知任务中显示出有希望的能力——在设备识别和标准化数据提取方面达到近乎完美的性能——但它们在空间推理、跨模态信息合成和多步逻辑推理方面表现出根本性的局限性。我们的见解对化学和材料科学之外的领域具有重要意义，这表明开发可靠的多模态人工智能科学助手可能需要在整理合适的训练数据和训练这些模型的方法方面取得进展。||
|**2024-11-25**|[Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge](http://arxiv.org/abs/2411.16824)|null|大型视觉语言模型（LVLMs）通常集成独立预训练的视觉和语言组件，并经常使用CLIP-ViT作为视觉骨干网络。然而，这些模型经常遇到视觉编码器（VE）和大型语言模型（LLM）之间“认知错位”的核心问题。具体来说，VE对视觉信息的表示可能无法与LLM的认知框架完全一致，导致视觉特征超出语言模型解释范围的不匹配。为了解决这个问题，我们研究了VE表示的变化如何影响LVLM的理解能力，尤其是在LLM面对VE未知数据（图像的视觉表示不明确，挑战VE的解释精度）时。因此，我们构建了一个多粒度地标数据集，并系统地检验了VE已知和VE未知数据对解释能力的影响。我们的结果表明，VE未知数据限制了LVLM的准确理解能力，而具有丰富独特特征的VE已知数据有助于减少认知错位。基于这些见解，我们提出了实体增强认知对齐（EECA）方法，该方法采用多粒度监督来生成视觉上丰富且对齐良好的标记，这些标记不仅融入LLM的嵌入空间，而且与LLM的认知框架对齐。这种对齐显著增强了LVLM在地标识别中的性能。我们的研究结果强调了VE未知数据带来的挑战，并突出了认知对齐在推进多模态系统发展中的重要作用。||
|**2024-11-22**|[PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision](http://arxiv.org/abs/2411.15127)|**[link](https://github.com/nokia-bell-labs/pretrained-imu-encoders)**|基于个人设备中嵌入的惯性测量单元（IMU）的人体运动感知在健康和保健领域有着重要的应用。虽然标记的IMU数据稀缺，但我们可以收集未标记或弱标记的IMU数据来建模人体运动。对于视频或文本模态，“预训练和适应”方法利用大量的未标记或弱标记数据进行预训练，构建强大的特征提取器，然后使用有限的标记数据适应特定任务。这种方法在IMU领域尚未得到广泛采用，原因有两个：（1）在IMU的背景下，预训练方法的研究还不够深入；（2）很少有公开可用的、可跨数据集泛化的开源预训练模型。在本文中，我们旨在解决第一个问题，提出了PRIMUS，一种用于预训练IMU编码器的方法。我们对各种自监督和多模态学习预训练目标进行了系统和统一的评估。我们的研究结果表明，使用结合了自监督、多模态监督和最近邻监督的PRIMUS可以显著提高下游任务的性能。与最先进的多模态训练方法相比，在每类少于500个标记样本的情况下，PRIMUS在留出的测试数据中有效地将下游性能提高了15%。为了使更广泛的社区受益，我们的代码和预训练的IMU编码器将在论文发表后在github.com/nokia-bell-labs公开发布。||
|**2024-11-22**|[Context-Aware Multimodal Pretraining](http://arxiv.org/abs/2411.15099)|null|大规模多模态表征学习成功地优化了测试时的零样本迁移。然而，标准的预训练范式（对大量图文数据进行对比学习）并没有明确鼓励表征支持少样本适应。在这项工作中，我们提出了一个简单但精心设计的多模态预训练扩展，使表征能够适应额外的上下文。使用这个目标，我们展示了视觉语言模型可以被训练成显著提高少样本适应能力：在21个下游任务中，我们发现测试时样本效率提高了四倍，平均少样本适应增益超过5%，同时在不同模型规模和训练时长下保持了零样本泛化性能。特别是，配备了简单的、无需训练的、基于度量的适应机制，我们的表征轻松超越了更复杂和昂贵的基于优化方案，极大地简化了对新领域的泛化。||
|**2024-11-22**|[Information Extraction from Heterogenous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation](http://arxiv.org/abs/2411.14957)|null|员工提交的发票和收据是包含文本、视觉和布局信息的富视觉文档 (VRD)。为了防范欺诈和滥用的风险，组织必须有效地从提交的收据中提取所需信息。这有助于评估关键因素，例如费用索赔的适当性、支出和交易策略的遵守情况、收据的有效性，以及各种级别的下游异常检测。这些文档具有异构性，格式和语言多样，上传的图像质量各异，并且通常不包含用于有效训练模型的真实标签。在本文中，我们提出了任务感知的基于指令的标注 (TAIL) 方法，用于在没有标签的 VRD 语料库中生成合成标签，并使用基于响应的知识蒸馏方法在 TAIL 标签上微调多模态富视觉文档理解模型 (VRDU)，无需使用教师模型的权重或训练数据集即可有条件地生成适当格式的注释。我们使用一个具有真实标签的基准外部数据集，通过实证研究证明了我们的方法在哪些条件下与 Claude 3 Sonnet 的性能相当。然后，我们展示了最终模型在一家大型跨国组织的内部费用文档上的性能与最先进的大型多模态模型 (LMM) Claude 3 Sonnet 相当或更好，同时成本降低了 85%，速度提高了约 5 倍，并且由于其能够推理和从罕见格式中提取信息，在平均归一化 Levenshtein 相似度 (ANLS) 得分上比布局感知基线模型高出 10% 以上。最后，我们举例说明了我们的方法在防止多付方面的应用。||
|**2024-11-22**|[VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models](http://arxiv.org/abs/2411.14832)|null|大型视觉语言模型 (LVLMs) 的快速发展展现出巨大的潜力。这些模型越来越有能力处理抽象的视觉任务。几何结构，特别是具有固有灵活性和复杂性的图，是评估这些模型预测能力的绝佳基准。虽然人类观察者可以轻松识别细微的视觉细节并进行准确的分析，但我们的研究表明，最先进的 LVLMs 在特定的视觉图场景中表现出一致的局限性，尤其是在面对风格变化时。为了应对这些挑战，我们引入了 VisGraphVar（视觉图变异性），这是一个可定制的基准生成器，能够生成七个不同任务类别（检测、分类、分割、模式识别、链接预测、推理、匹配）的图图像，旨在系统地评估单个 LVLMs 的优势和局限性。我们使用 VisGraphVar 生成了 990 张图图像，并使用零样本和思维链两种不同的提示策略评估了六个 LVLMs。研究结果表明，图像视觉属性（例如，节点标签和布局）的变化以及故意包含的视觉缺陷（例如，节点重叠）会显着影响模型性能。这项研究强调了对图形相关任务进行全面评估的重要性，而不仅仅是推理。VisGraphVar 为开发更可靠、更强大的能够执行高级视觉图形分析的系统提供了宝贵的见解。||
|**2024-11-22**|[VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection](http://arxiv.org/abs/2411.14794)|**[link](https://github.com/hshjerry/videoespresso)**|大型视觉语言模型 (LVLMs) 的进步显著提升了多模态理解能力，但由于缺乏高质量、大规模数据集，视频推理任务仍然面临挑战。现有的视频问答 (VideoQA) 数据集通常依赖于成本高昂、粒度不足的手动标注，或采用冗余的逐帧分析的自动构建方法，限制了其在复杂推理方面的可扩展性和有效性。为了应对这些挑战，我们推出了 VideoEspresso，这是一个新颖的数据集，其特点是保留了关键空间细节和时间连贯性的 VideoQA 对，以及中间推理步骤的多模态标注。我们的构建流程采用语义感知方法来减少冗余，然后使用 GPT-4o 生成问答对。我们进一步开发了视频思维链 (CoT) 标注来丰富推理过程，引导 GPT-4o 从问答对和视频内容中提取逻辑关系。为了充分利用高质量 VideoQA 对的潜力，我们提出了一个混合 LVLMs 协作框架，该框架包含一个帧选择器和一个经过两阶段指令微调的推理 LVLM。该框架自适应地选择核心帧，并使用多模态证据进行 CoT 推理。在我们提出的包含 14 项任务的基准测试中，针对 9 个流行的 LVLMs 进行评估，我们的方法在大多数任务上都优于现有基线，展现出卓越的视频推理能力。我们的代码和数据集将在以下地址发布：https://github.com/hshjerry/VideoEspresso||
|**2024-11-22**|[Effective SAM Combination for Open-Vocabulary Semantic Segmentation](http://arxiv.org/abs/2411.14723)|null|开放词汇语义分割旨在为图像中的像素分配不限范围的类别标签。传统方法通常采用将强大的掩码提议生成器（例如Segment Anything Model，SAM）与预训练的视觉语言模型（例如CLIP）顺序连接的方式来解决这个问题。但这些两阶段方法通常存在计算成本高、内存效率低的问题。在本文中，我们提出了ESC-Net，一种新颖的单阶段开放词汇分割模型，它在一个高效的推理框架内利用SAM解码器模块进行类别无关的分割。通过将从图像-文本相关性生成的伪提示嵌入到SAM的可提示分割框架中，ESC-Net实现了细化的空间聚合，从而实现了准确的掩码预测。ESC-Net在标准基准测试（包括ADE20K、PASCAL-VOC和PASCAL-Context）上取得了优异的性能，在效率和准确性方面均优于先前的方法。全面的消融研究进一步证明了其在挑战性条件下的鲁棒性。||
|**2024-11-21**|[FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers](http://arxiv.org/abs/2411.14507)|null|生成式预训练Transformer模型（GPT）通过大规模扩展模型参数，在不同领域展现了显著的性能。近期的研究观察到Transformer块之间存在冗余，并开发了通过结构化剪枝不重要的块来压缩模型的方法。然而，这种直接的消除方法总会带来不可逆的性能下降。在本文中，我们提出了FuseGPT，一种新的方法，通过回收剪枝的Transformer块来进一步恢复模型性能。首先，我们引入了一种新的重要性检测指标，宏观影响（MI），通过计算移除每个Transformer块后的信息损失来检测其长期影响。然后，我们提出了组级层融合，它采用不重要块中层的参数，并将它们注入到相邻块内相应的层中。这种融合不是一次性的，而是通过轻量级的组级微调进行迭代参数更新。具体来说，这些注入的参数被冻结，但通过可学习的秩分解矩阵进行加权，以减少微调时的开销。我们的方法不仅适用于大型语言模型，也适用于大型多模态模型。实验表明，FuseGPT只需使用少量数据，就可以在困惑度和零样本任务性能方面优于先前的工作。||
|**2024-11-21**|[Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance](http://arxiv.org/abs/2411.14279)|null|大型视觉语言模型（LVLMs）在各种视觉语言任务中取得了令人瞩目的成果。然而，尽管展现出 promising 的性能，LVLMs 仍然受到语言偏差导致的幻觉的影响，导致对图像的关注减少和视觉理解 ineffective。我们确定了这种偏差的两个主要原因：1. LLM 预训练阶段和多模态对齐阶段之间训练数据的不同规模。2. 由于文本数据的短期依赖性而学习到的推理偏差。因此，我们提出了 LACING，这是一个系统框架，旨在通过多模态双重注意力机制（MDA）和软图像引导（IFG）来解决 LVLMs 的语言偏差问题。具体来说，MDA 引入了一种并行的双重注意力机制，增强了视觉输入在模型中的整合。IFG 在训练和推理过程中引入了一个可学习的软视觉提示来代替视觉输入，旨在迫使 LVLMs 优先考虑文本输入。然后，IFG 进一步提出了一种使用软视觉提示的新解码策略，以减轻模型对相邻文本输入的过度依赖。综合实验表明，我们的方法有效地消除了 LVLMs 的语言偏差，增强了视觉理解并减少了幻觉，而无需额外的训练资源或数据。代码和模型可在 [lacing-lvlm.github.io](https://lacing-lvlm.github.io) 获取。||
|**2024-11-21**|[Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset](http://arxiv.org/abs/2411.14137)|**[link](https://github.com/hazel-heejeong-nam/vague)**|能够跨多模态输入执行复杂推理对于模型在现实世界场景中与人类有效互动至关重要。视觉语言模型的进步显著提高了在需要处理明确和直接文本输入的任务（如视觉问答（VQA）和视觉定位（VG））上的性能。然而，提高模型理解细微和模糊的交流形式的能力却较少受到关注。这提出了一个关键挑战，因为现实世界互动中的人类语言通常传达隐藏的意图，这些意图依赖于上下文才能进行准确的解释。为了解决这一差距，我们提出了VAGUE，这是一个包含3.9K个间接人类话语及其对应场景的多模态基准测试。此外，我们还提供了一个基于模型的管道，用于从输入图像生成提示-解决方案对。我们的工作旨在深入研究模型理解间接交流的能力，并致力于开发能够进行更精细、更像人类互动的模型。对多个VLM的广泛评估表明，主流模型在需要执行复杂的语言和视觉推理时仍然难以理解间接交流。我们在https://github.com/Hazel-Heejeong-Nam/VAGUE.git发布了我们的代码和数据。||
|**2024-11-21**|[MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective](http://arxiv.org/abs/2411.14062)|**[link](https://github.com/lerogo/mmgenbench)**|大型多模态模型 (LMMs) 已展现出卓越的功能。然而，现有的 LMMs 评估基准主要集中在图像理解方面，很少有工作从图像生成的视角进行评估。为了解决这个问题，我们提出了一个简单的自动化评估流程。具体来说，该流程要求 LMMs 根据给定的输入图像生成图像描述。随后，它使用文本到图像生成模型根据这些生成的描述创建新图像。最后，我们通过比较原始图像和生成的图像来评估 LMMs 的性能。此外，我们还引入了 MMGenBench-Test，这是一个全面的基准测试，用于评估 LMMs 在 13 种不同图像模式下的性能，以及 MMGenBench-Domain，旨在评估 LMMs 在生成图像领域内的性能。对 50 多个流行 LMMs 的全面评估证明了该流程和基准测试的有效性和可靠性。我们的观察表明，许多在现有基准测试中表现优异的 LMMs 未能充分完成与图像理解和描述相关的基本任务。这一发现凸显了当前 LMMs 性能提升的巨大潜力，并为未来的模型优化提供了方向。同时，我们的流程仅使用图像输入即可促进对不同领域 LMMs 性能的有效评估。||
|**2024-11-20**|[BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games](http://arxiv.org/abs/2411.13543)|null|大型语言模型 (LLM) 和视觉语言模型 (VLM) 拥有广泛的知识并展现出 promising 的推理能力；然而，它们在复杂、动态的环境中仍然难以良好地执行任务。现实世界的任务需要处理复杂的交互、高级空间推理、长期规划和对新策略的持续探索——在这些领域，我们缺乏有效的方法来全面评估这些能力。为了弥补这一差距，我们引入了 BALROG，这是一个 novel 的基准测试，旨在通过一组不同的 challenging 游戏来评估 LLM 和 VLM 的智能体能力。我们的基准测试包含一系列现有的强化学习环境，难度各不相同，包括非专业人员可以在几秒钟内解决的任务，以及可能需要数年才能掌握的极其挑战性的任务（例如，NetHack 学习环境）。我们设计了细粒度的指标来衡量性能，并对几个流行的开源和闭源 LLM 和 VLM 进行了广泛的评估。我们的研究结果表明，虽然目前的模型在较简单的游戏中取得了部分成功，但在更具挑战性的任务中却举步维艰。值得注意的是，我们观察到基于视觉的决策存在严重缺陷，因为当提供环境的视觉表示时，模型的性能会更差。我们将 BALROG 作为一个开放且用户友好的基准测试发布，以促进智能体社区未来的研究和发展。||
|**2024-11-20**|[Teaching VLMs to Localize Specific Objects from In-context Examples](http://arxiv.org/abs/2411.13317)|**[link](https://github.com/sivandoveh/iploc)**|视觉语言模型 (VLM) 在各种视觉任务中展现了卓越的能力，包括图像识别、视频理解和视觉问答 (VQA)，前提是针对这些任务进行专门训练。尽管取得了这些进展，我们发现当前的 VLM 缺乏一项基本的认知能力：通过考虑上下文来学习定位场景中的对象。在这项工作中，我们专注于少样本个性化定位任务，其中模型被赋予一小组带注释的图像（上下文示例）——每个图像都带有类别标签和边界框——并且其任务是在查询图像中定位相同类型的对象。为了激发模型的个性化定位能力，我们提出了一种以数据为中心的解决方案，使用从视频对象跟踪数据集中精心挑选的数据对模型进行微调。通过利用跨多个镜头跟踪同一对象的帧序列，我们模拟了促进上下文感知的指令调整对话。为了强化这一点，我们引入了一种新的正规化技术，用伪名称替换对象标签，确保模型依赖视觉上下文而不是先验知识。我们的方法显著增强了少样本定位性能，且不会牺牲泛化能力，这在几个为个性化定位定制的基准测试中得到了证明。这项工作是第一个探索和基准测试 VLM 的个性化少样本定位的工作，为未来上下文驱动的视觉语言应用研究奠定了基础。我们的项目代码可在 https://github.com/SivanDoveh/IPLoc 获取。||
|**2024-11-20**|[VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation](http://arxiv.org/abs/2411.13281)|null|近年来，具备高级视频分析能力的大型多模态模型 (LMM) 引起了广泛关注。然而，大多数评估依赖于传统方法，例如 VideoMME 和 LongVideoBench 等基准测试中的多项选择题，这些方法往往缺乏深度，难以捕捉现实世界用户的复杂需求。为了解决这一局限性，并且考虑到人工标注视频任务的高成本和低效率，我们引入了 VideoAutoArena，这是一个竞技场式的基准测试，其灵感来自 LMSYS Chatbot Arena 的框架，旨在自动评估 LMM 的视频分析能力。VideoAutoArena 利用用户模拟生成开放式、自适应问题，以严格评估模型在视频理解方面的性能。该基准测试采用了一种可扩展的自动化评估框架，并结合了改进的 ELO 评分系统，以便在多个 LMM 之间进行公平、持续的比较。为了验证我们的自动评判系统，我们使用精心策划的人工标注子集构建了“黄金标准”，证明我们的竞技场与人类判断高度一致，同时保持了可扩展性。此外，我们引入了一种故障驱动的进化策略，逐步增加问题的复杂性，以推动模型处理更具挑战性的视频分析场景。实验结果表明，VideoAutoArena 可以有效地区分最先进的 LMM，并提供有关模型优势和改进方向的见解。为了进一步简化我们的评估，我们引入了 VideoAutoBench 作为辅助基准测试，其中人工标注员在 VideoAutoArena 比赛的子集中标记获胜者。我们使用 GPT-4o 作为评判，将模型的回答与这些经过人工验证的答案进行比较。VideoAutoArena 和 VideoAutoBench 共同提供了一个经济高效且可扩展的框架，用于评估以用户为中心的视频分析中的 LMM。||
|**2024-11-20**|[XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation](http://arxiv.org/abs/2411.13243)|**[link](https://github.com/wangzy22/xmask3d)**|现有的开放词汇3D语义分割方法主要集中于建立一个包含3D、2D和文本模态的统一特征空间。然而，诸如全局特征对齐或视觉语言模型蒸馏等传统技术往往只能实现近似的对应，尤其难以描绘细粒度的分割边界。为了解决这个问题，我们提出了一个通过跨模态掩码推理框架XMask3D在3D特征和2D-文本嵌入空间之间进行更精细的掩码级对齐的方法。在我们的方法中，我们基于预训练扩散模型中的去噪UNet开发了一个掩码生成器，利用其对密集像素表示的精确文本控制能力，并增强了生成掩码的开放世界适应性。我们进一步将3D全局特征作为隐式条件融入预训练的2D去噪UNet中，使得生成的分割掩码能够额外感知3D几何信息。随后，生成的2D掩码被用于将掩码级别的3D表示与视觉语言特征空间对齐，从而增强3D几何嵌入的开放词汇能力。最后，我们融合互补的2D和3D掩码特征，从而在多个3D开放词汇语义分割基准测试中取得了竞争性的性能。代码可在https://github.com/wangzy22/XMask3D获取。||
|**2024-11-21**|[ViSTa Dataset: Do vision-language models understand sequential tasks?](http://arxiv.org/abs/2411.13211)|**[link](https://github.com/eugleo/vista-dataset)**|将视觉语言模型 (VLM) 用作强化学习中的奖励模型有望降低成本并提高安全性。迄今为止，VLM 奖励模型仅用于目标导向的任务，其中智能体必须达到特定的最终结果。我们探索 VLM 监督无法仅凭最终状态评分的任务的潜力。为此，我们引入了 ViSTa，这是一个用于评估基于视觉的顺序任务理解的数据集。ViSTa 包含 4,000 多个视频，其中包含虚拟家庭、Minecraft 和现实世界环境中的分步描述。其新颖的层次结构——由基本的单步任务组成越来越复杂的顺序任务——可以深入了解 VLM 判断不同复杂度任务的能力。为了说明这一点，我们使用 ViSTa 来评估最先进的 VLM，包括 CLIP、ViCLIP 和 GPT-4o。我们发现，虽然它们都擅长物体识别，但它们无法理解顺序任务，只有 GPT-4o 取得了非平凡的性能。||
|**2024-11-20**|[TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models](http://arxiv.org/abs/2411.13136)|null|大型预训练视觉语言模型（VLM），例如CLIP，在各种下游任务中展现出优异的零样本泛化能力。然而，最近的研究表明，CLIP的推理性能很容易被小的对抗性扰动大幅降低，尤其是在其视觉模态方面，这构成了重大的安全威胁。为了缓解此漏洞，本文提出了一种名为测试时对抗性提示调优（TAPT）的新颖防御方法，以增强CLIP针对视觉对抗性攻击的推理鲁棒性。TAPT是一种测试时防御方法，它学习防御性双模态（文本和视觉）提示以增强CLIP的推理过程的鲁棒性。具体来说，它是一种无监督方法，通过最小化多视图熵并对齐对抗样本和干净样本的分布来优化每个测试样本的防御性提示。我们在11个基准数据集（包括ImageNet和10个其他零样本数据集）上评估了TAPT的有效性，结果表明，它将原始CLIP的零样本对抗鲁棒性提高了至少48.9%（对抗AutoAttack（AA）），同时在很大程度上保持了对干净样本的性能。此外，TAPT在各种骨干网络上的性能都优于现有的对抗性提示调优方法，平均鲁棒性提升至少36.6%。||
|**2024-11-19**|[VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge](http://arxiv.org/abs/2411.12915)|null|通用视觉语言模型（VLMs）在计算机视觉领域取得了显著进展，但在医疗等需要专业知识的特定领域却存在不足。在传统的计算机视觉任务中，创造性或近似的答案可能是可以接受的，但在医疗领域，精度至关重要。目前的通用大型多模态模型，如Gemini和GPT-4o，由于依赖记忆的互联网知识而非医疗所需的细致专业知识，因此不足以胜任医疗任务。VLM的训练通常分为三个阶段：视觉预训练、视觉-语言预训练和指令微调（IFT）。IFT通常使用通用数据和医疗数据的混合进行。相比之下，我们提出，对于医学VLM，需要第四阶段的专门IFT，重点关注医学数据，并包含来自领域专家模型的信息。为医疗用途开发的领域专家模型至关重要，因为它们经过专门训练以执行某些临床任务，例如通过分割和分类来检测肿瘤和对异常进行分类，从而学习医学数据的细粒度特征——这些特征通常过于复杂，VLM无法有效捕捉，尤其是在放射学领域。本文介绍了一种新的医学VLM框架VILA-M3，它利用专家模型的领域知识。通过实验，我们展示了改进的最先进（SOTA）性能，平均比之前的SOTA模型Med-Gemini提高了约9%，比针对特定任务训练的模型提高了约6%。我们的方法强调了领域专业知识在创建用于医疗应用的精确、可靠的VLM中的重要性。||
|**2024-11-18**|[Vision Language Models Are Few-Shot Audio Spectrogram Classifiers](http://arxiv.org/abs/2411.12058)|null|我们证明了视觉语言模型（VLM）能够在给定相应频谱图图像的情况下识别音频录音中的内容。具体来说，我们通过提示VLM对每个类别的示例频谱图图像进行分类，指导它们在少样本设置下执行音频分类任务。通过精心设计频谱图图像表示并选择良好的少样本示例，我们展示了GPT-4o在ESC-10环境声音分类数据集上可以达到59.00%的交叉验证准确率。此外，我们证明了VLM目前在同等的音频分类任务上优于唯一可用的具有音频理解能力的商业音频语言模型（Gemini-1.5）（59.00% vs. 49.62%），甚至在视觉频谱图分类方面略优于人类专家（在第一个折叠上，73.75% vs. 72.50%）。我们设想了这些发现的两个潜在用例：（1）结合VLM的频谱图和语言理解能力进行音频字幕增强，以及（2）将视觉频谱图分类作为VLM的挑战任务。||
|**2024-11-18**|[ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements](http://arxiv.org/abs/2411.12044)|**[link](https://github.com/m-arda-aydn/itaclip)**|近年来，基础视觉语言模型 (VLM) 的进步重塑了计算机视觉任务的评估范式。这些基础模型，尤其是 CLIP，加速了开放词汇计算机视觉任务（包括开放词汇语义分割 (OVSS)）的研究。尽管初步结果令人鼓舞，但 VLM 的密集预测能力仍需进一步提高。在本研究中，我们通过引入新的模块和修改来增强 CLIP 的语义分割性能：1) 改变 ViT 最后一层的架构，并将中间层的注意力图与最后一层合并；2) 图像工程：应用数据增强来丰富输入图像的表示；3) 使用大型语言模型 (LLM) 为每个类别名称生成定义和同义词，以利用 CLIP 的开放词汇能力。我们的免训练方法 ITACLIP 在 COCO-Stuff、COCO-Object、Pascal Context 和 Pascal VOC 等分割基准测试中优于当前最先进的方法。我们的代码可在 https://github.com/m-arda-aydn/ITACLIP 获取。||
|**2024-11-17**|[On-Board Vision-Language Models for Personalized Autonomous Vehicle Motion Control: System Design and Real-World Validation](http://arxiv.org/abs/2411.11913)|null|个性化驾驶指的是自动驾驶车辆在保证安全和舒适标准的前提下，使其驾驶行为或控制策略适应个体用户偏好和驾驶风格的能力。然而，现有研究要么无法精确捕捉每个个体的偏好，要么随着用户群的扩大而导致计算效率低下。视觉语言模型（VLM）凭借其自然语言理解和场景推理能力，为解决这一问题提供了 promising 的方案。在这项工作中，我们提出了一个轻量级但高效的车载 VLM 框架，该框架在提供低延迟个性化驾驶性能的同时，保持了强大的推理能力。我们的解决方案包含一个基于检索增强生成（RAG）的记忆模块，该模块能够通过人类反馈持续学习个体驾驶偏好。通过全面的实际车辆部署和实验，我们的系统已 demonstrated 在各种场景下提供安全、舒适和个性化的驾驶体验的能力，并将接管率显著降低了高达 76.9%。据我们所知，这项工作代表了在实际自动驾驶车辆中第一个端到端的基于 VLM 的运动控制系统。||
|**2024-11-18**|[The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning](http://arxiv.org/abs/2411.11758)|**[link](https://github.com/michigannlp/mosaic)**|大型多模态模型 (LMMs) 在各种多模态任务中展现出令人瞩目的性能。然而，由于大多数数据和模型以西方为中心，它们在跨文化语境中的有效性仍然有限。相反，多智能体模型在解决复杂任务方面表现出显著的能力。我们的研究评估了 LMMs 在多智能体交互环境下对文化图像描述这一新任务的集体表现。我们的贡献如下：(1) 我们引入了 MosAIC，这是一个多智能体框架，利用具有不同文化角色的 LMMs 来增强跨文化图像描述；(2) 我们提供了一个包含来自中国、印度和罗马尼亚图像的英文文化丰富图像描述数据集，涵盖 GeoDE、GD-VCR 和 CVQA 三个数据集；(3) 我们提出了一个文化适应性指标，用于评估图像描述中的文化信息；(4) 我们证明了多智能体交互在不同指标上优于单智能体模型，并为未来的研究提供了宝贵的见解。我们的数据集和模型可在 https://github.com/MichiganNLP/MosAIC 获取。||
|**2024-11-18**|[MC-LLaVA: Multi-Concept Personalized Vision-Language Model](http://arxiv.org/abs/2411.11706)|**[link](https://github.com/arctanxarc/mc-llava)**|目前的视觉语言模型 (VLM) 在包括视觉问答在内的各种任务中展现出卓越的能力。为了增强实际应用中的用户体验，最近的研究探索了VLM个性化以理解用户提供的概念。然而，现有研究主要集中在单概念个性化上，忽略了多个概念的存在和相互作用，这限制了个性化VLM的实际应用。在本文中，我们提出了第一个多概念个性化方法，称为MC-LLaVA，以及一个高质量的多概念个性化数据集。具体来说，MC-LLaVA采用联合训练策略，在单个训练步骤中结合多个概念，使VLM能够在多概念个性化中准确执行。为了降低联合训练的成本，MC-LLaVA利用视觉标记信息进行概念标记初始化，从而改进概念表示并加速联合训练。为了推进多概念个性化研究，我们进一步贡献了一个高质量的数据集。我们从包含多个角色的各种电影中精心收集图像，并手动生成多概念问答样本。我们的数据集涵盖了不同的电影类型和问答类型。我们进行了全面的定性和定量实验，以证明MC-LLaVA可以实现令人印象深刻的多概念个性化响应，为VLM成为更好的用户特定助手铺平了道路。代码和数据集将在https://github.com/arctanxarc/MC-LLaVA公开发布。||
|**2024-11-18**|[VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation](http://arxiv.org/abs/2411.11609)|null|遵循人类指令在陌生环境中探索和搜索指定目标是移动服务机器人的一项关键技能。以往关于物体目标导航的研究大多集中在单一输入模态作为目标，这可能导致对包含详细属性和空间关系的语言描述考虑不足。为了解决这一局限性，我们提出了VLN-Game，一个用于视觉目标导航的新型零样本框架，可以有效地处理物体名称和描述性语言目标。更准确地说，我们的方法通过将预训练的视觉语言特征与物理环境的三维重建相结合，构建了一个以物体为中心的三维空间地图。然后，该框架识别出最有希望的区域，以探索潜在的目标候选者。采用博弈论视觉语言模型来确定哪个目标与给定的语言描述最匹配。在Habitat-Matterport 3D (HM3D)数据集上进行的实验表明，所提出的框架在物体目标导航和基于语言的导航任务中均实现了最先进的性能。此外，我们展示了VLN-Game可以轻松部署到现实世界的机器人上。VLN-Game的成功凸显了使用博弈论方法和紧凑型视觉语言模型来提升机器人系统决策能力的巨大潜力。补充视频和代码可以通过以下链接访问：https://sites.google.com/view/vln-game。||
|**2024-11-18**|[Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment](http://arxiv.org/abs/2411.11543)|null|受益于大型语言模型 (LLM) 的强大功能，预训练的连接到 LLM 的视觉编码器模型形成了视觉语言模型 (VLM)。然而，最近的研究表明，VLM 中的视觉模态非常脆弱，攻击者可以通过视觉传输的内容绕过 LLM 中的安全对齐，发起有害攻击。为了应对这一挑战，我们提出了一种基于渐进式概念的对齐策略 PSA-VLM，它将安全模块作为概念瓶颈，以增强视觉模态安全对齐。通过将模型预测与特定安全概念对齐，我们改进了针对风险图像的防御，增强了可解释性和可控性，同时最大限度地减少了对一般性能的影响。我们的方法通过两阶段训练获得。第一阶段的低计算成本带来了非常有效的性能提升，第二阶段的语言模型微调进一步提高了安全性能。我们的方法在流行的 VLM 安全基准测试中取得了最先进的结果。||
|**2024-11-18**|[InstruGen: Automatic Instruction Generation for Vision-and-Language Navigation Via Large Multimodal Models](http://arxiv.org/abs/2411.11394)|null|最近关于视觉和语言导航 (VLN) 的研究表明，由于缺乏真实的训练环境和高质量的路径-指令对，agent 在未知环境中的泛化能力较差。大多数现有的构建逼真导航场景的方法成本较高，且指令的扩展主要依赖于预定义的模板或规则，缺乏适应性。为了缓解这个问题，我们提出了 InstruGen，一个 VLN 路径-指令对生成范式。具体来说，我们使用 YouTube 房屋参观视频作为真实的导航场景，并利用大型多模态模型 (LMM) 强大的视觉理解和生成能力来自动生成多样化且高质量的 VLN 路径-指令对。我们的方法可以生成不同粒度的导航指令，并在指令和视觉观察之间实现细粒度的对齐，这是以前的方法难以实现的。此外，我们设计了一个多阶段验证机制，以减少 LMM 的幻觉和不一致性。实验结果表明，使用 InstruGen 生成的路径-指令对训练的 agent 在 R2R 和 RxR 基准测试中，尤其是在未知环境中，达到了最先进的性能。代码可在 https://github.com/yanyu0526/InstruGen 获取。||
|**2024-11-18**|[Efficient Transfer Learning for Video-language Foundation Models](http://arxiv.org/abs/2411.11223)|**[link](https://github.com/chenhaoxing/etl4video)**|预训练的视觉语言模型为跨各种下游任务的高效迁移学习提供了稳健的基础。在视频动作识别领域，主流方法通常会引入额外的参数模块来捕获时间信息。虽然这些额外参数带来的模型容量增加有助于更好地拟合视频特定的归纳偏差，但现有方法需要学习大量的参数，并且容易出现对原始泛化知识的灾难性遗忘。在本文中，我们提出了一个简单而有效的多模态时空适配器（MSTA），以改进文本和视觉分支中表示之间的对齐，从而在通用知识和特定任务知识之间取得平衡。此外，为了减轻过拟合并增强泛化能力，我们引入了时空描述引导的一致性约束。这种约束包括将模板输入（即“{cls} 的视频”）馈送到可训练的语言分支，同时将LLM生成的时空描述输入到预训练的语言分支，强制两个分支的输出保持一致。这种机制可以防止对下游任务的过拟合，并提高可训练分支在时空语义空间中的可区分性。我们在四个任务上评估了我们方法的有效性：零样本迁移、小样本学习、基础到新颖的泛化以及全监督学习。与许多最先进的方法相比，我们的MSTA在所有评估中都取得了优异的性能，而只使用了原始模型中2-7%的可训练参数。代码将在 https://github.com/chenhaoxing/ETL4Video 上提供。||
|**2024-11-17**|[Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection](http://arxiv.org/abs/2411.10922)|**[link](https://github.com/cogito2012/openmixer)**|动作检测旨在对视频中的人类动作进行时空上的检测（识别和定位）。现有方法主要集中在封闭集合设置，其中动作检测器在来自固定动作类别集合的视频上进行训练和测试。然而，这种受限的设置在开放世界中是不可行的，因为测试视频不可避免地会超出训练的动作类别。在本文中，我们解决了实际但具有挑战性的开放词汇动作检测 (OVAD) 问题。其目标是在固定动作类别集合上训练模型的同时检测测试视频中的任何动作。为了实现这种开放词汇能力，我们提出了一种名为 OpenMixer 的新方法，它利用了大型视觉语言模型 (VLM) 在基于查询的检测转换器 (DETR) 系列中固有的语义和可定位性。具体来说，OpenMixer 由空间和时间 OpenMixer 模块（S-OMB 和 T-OMB）以及一个动态融合对齐 (DFA) 模块组成。这三个组件共同享有预训练 VLM 的强泛化能力和 DETR 设计的端到端学习的优点。此外，我们建立了各种设置下的 OVAD 基准测试，实验结果表明，OpenMixer 在检测已见和未见动作方面优于基线方法。我们在 https://github.com/Cogito2012/OpenMixer 发布了代码、模型和数据集划分。||
|**2024-11-15**|[LLaVA-o1: Let Vision Language Models Reason Step-by-Step](http://arxiv.org/abs/2411.10440)|**[link](https://github.com/PKU-YuanGroup/LLaVA-CoT)**|大型语言模型在推理能力方面展现出显著进步，尤其体现在推理时规模扩展上，例如OpenAI的o1模型。然而，当前的视觉语言模型（VLM）在执行系统性和结构化推理时常常遇到困难，尤其是在处理复杂的视觉问答任务时。在这项工作中，我们介绍了LLaVA-o1，一个旨在进行自主多阶段推理的新型VLM。与思维链提示不同，LLaVA-o1独立地进行摘要、视觉解释、逻辑推理和结论生成等连续阶段。这种结构化方法使LLaVA-o1在推理密集型任务上的精度显著提高。为此，我们编译了LLaVA-o1-100k数据集，整合了来自各种视觉问答来源的样本，并提供了结构化的推理标注。此外，我们提出了一种推理时阶段级集束搜索方法，实现了有效的推理时规模扩展。值得注意的是，仅使用10万个训练样本和一个简单而有效的推理时规模扩展方法，LLaVA-o1不仅在各种多模态推理基准测试中比其基础模型的性能提高了8.9%，而且还超过了更大甚至闭源模型的性能，例如Gemini-1.5-pro、GPT-4o-mini和Llama-3.2-90B-Vision-Instruct。||
|**2024-11-15**|[SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning](http://arxiv.org/abs/2411.10161)|**[link](https://github.com/chencn2020/seagull)**|现有的图像质量评估 (IQA) 方法在分析整体图像质量方面取得了显著成功，但很少有研究探索感兴趣区域 (ROI) 的质量分析。ROI 的质量分析可以为图像质量改进提供细粒度的指导，并且对于关注区域级质量的场景至关重要。本文提出了一种名为 SEAGULL 的新型网络，它可以借助大型视觉语言模型的指导来查看和评估 ROI 的质量。SEAGULL 结合了视觉语言模型 (VLM)、由 Segment Anything Model (SAM) 生成的用于指定 ROI 的掩码，以及精心设计的基于掩码的特征提取器 (MFE) 来提取指定 ROI 的全局和局部标记，从而实现对 ROI 的精确细粒度 IQA。此外，本文构建了两个基于 ROI 的 IQA 数据集，SEAGULL-100w 和 SEAGULL-3k，用于训练和评估基于 ROI 的 IQA。SEAGULL-100w 包含约 100 万张合成失真图像和 3300 万个 ROI，用于预训练以提高模型的区域质量感知能力，而 SEAGULL-3k 包含约 3000 个真实失真 ROI，以增强模型感知真实世界失真的能力。在 SEAGULL-100w 上进行预训练并在 SEAGULL-3k 上进行微调后，SEAGULL 在细粒度 ROI 质量评估方面展现出卓越的性能。代码和数据集已在 https://github.com/chencn2020/Seagull 公开发布。||
|**2024-11-15**|[Federated Domain Generalization via Prompt Learning and Aggregation](http://arxiv.org/abs/2411.10063)|**[link](https://github.com/GongShuai8210/PLAN)**|联邦域泛化 (FedDG) 旨在通过解决隐私保护约束下的数据异构性来提高全局模型在未见域中的泛化能力。现有 FedDG 研究中的一种常见策略是在客户端之间共享特定域的知识，例如频谱信息、类别原型和数据风格。然而，这些知识是直接从本地客户端样本中提取的，共享此类敏感信息会带来数据泄露的潜在风险，这可能无法完全满足 FedDG 的要求。在本文中，我们引入了提示学习来适应 FedDG 场景下的预训练视觉语言模型 (VLM)，并利用本地学习的提示作为更安全的桥梁来促进客户端之间的知识转移。具体来说，我们提出了一个通过提示学习和聚合 (PLAN) 的新型 FedDG 框架，该框架包含两个训练阶段，在每个联邦轮次协同生成局部提示和全局提示。首先，每个客户端使用自己的数据执行文本和视觉提示学习，通过将全局提示作为共同参考来间接同步局部提示。其次，所有特定域的局部提示在客户端之间交换，并使用基于轻量级注意力的聚合器选择性地聚合到全局提示中。最终，全局提示被应用于使 VLM 适应未见的目标域。由于我们的 PLAN 框架只需要训练有限数量的提示和轻量级聚合器，因此它在 FedDG 的计算和通信效率方面具有显著优势。大量实验表明，PLAN 在四个基准数据集上具有优越的泛化能力。||
|**2024-11-15**|[Free Lunch in Pathology Foundation Model: Task-specific Model Adaptation with Concept-Guided Feature Enhancement](http://arxiv.org/abs/2411.09894)|**[link](https://github.com/hku-medai/cate)**|全切片图像（WSI）分析在医学影像领域日益受到重视。病理学基础模型的最新进展表明，其具有从WSI中提取强大的特征表示用于下游任务的潜力。然而，这些基础模型通常设计用于通用病理图像分析，对于特定的下游任务或癌症类型可能并非最佳选择。在这项工作中，我们提出了概念锚引导的任务特定特征增强（CATE），这是一个适应性强的范例，可以提高病理学基础模型针对特定下游任务的表现力和辨别力。基于一组从病理视觉语言模型中提取的、由专家设计的提示得到的任务特定概念，我们引入了两个相互关联的模块，以动态校准基础模型提取的通用图像特征，使其适用于特定任务或癌症类型。具体来说，我们设计了一个概念引导的信息瓶颈模块，通过最大化图像特征和概念锚之间的互信息，同时抑制多余信息，来增强与任务相关的特征。此外，我们还提出了一个概念-特征干扰模块，利用校准后的特征和概念锚之间的相似性，进一步生成具有辨别力的任务特定特征。在公共WSI数据集上的大量实验表明，CATE显着提高了MIL模型的性能和泛化能力。此外，热力图和umap可视化结果也揭示了CATE的有效性和可解释性。源代码可在https://github.com/HKU-MedAI/CATE获取。||
|**2024-11-14**|[Cross-Modal Consistency in Multimodal Large Language Models](http://arxiv.org/abs/2411.09273)|null|多模态方法的最新发展标志着模型处理各种数据类型（包括文本、音频和视觉内容）的新时代的开始。像GPT-4V这样将计算机视觉与高级语言处理相结合的模型，在处理需要同时理解文本和视觉信息的复杂任务方面表现出非凡的能力。之前的研究工作已经仔细评估了这些视觉大型语言模型（VLLM）在各种领域（包括目标检测、图像描述和其他相关领域）的有效性。然而，现有的分析往往存在局限性，主要集中在孤立地评估每种模态的性能，而忽略了探索它们复杂的跨模态交互。具体来说，这些模型在面对不同模态的相同任务实例时是否达到相同的准确度水平的问题仍然没有答案。在本研究中，我们主动通过引入一个称为跨模态一致性的新概念来深入研究这些感兴趣的模态之间的交互和比较。此外，我们提出了一个基于此概念的定量评估框架。我们从自己开发的一系列精选的平行视觉语言数据集中得出的实验结果表明，尽管GPT-4V被描述为一个统一的多模态模型，但其视觉和语言模态之间存在明显的不一致性。我们的研究揭示了此类模型的适当使用方法，并暗示了改进其设计的潜在途径。||
|**2024-11-13**|[ClevrSkills: Compositional Language and Visual Reasoning in Robotics](http://arxiv.org/abs/2411.09052)|**[link](https://github.com/Qualcomm-AI-research/ClevrSkills)**|机器人任务本质上是高度组合的。例如，要执行像清洁桌子这样的高级任务，机器人必须运用低级能力，将效应器移动到桌子上的物体，拾取它们，然后将它们一个个地从桌子上移开，同时在此过程中重新评估随之而来的动态场景。鉴于大型视觉语言模型 (VLM) 在许多需要高级、类人推理的任务上取得了进展，我们提出了这样一个问题：如果教会模型必要的低级能力，它们能否以新颖的方式组合这些能力来完成有趣的像清洁桌子这样的高级任务，而无需明确地教授？为此，我们提出了 ClevrSkills——一个用于机器人组合推理的基准套件。ClevrSkills 是一个基于 ManiSkill2 模拟器开发的环境套件以及一个伴随的数据集。该数据集包含在一系列机器人任务上生成的轨迹，带有语言和视觉注释以及作为任务规范的多模态提示。该套件包括一个包含三个级别的组合理解的任务课程，从需要基本运动技能的简单任务开始。我们在 ClevrSkills 上对多个不同的 VLM 基线进行了基准测试，并表明即使在大量任务上进行了预训练后，这些模型在机器人任务的组合推理上仍然失败。||
|**2024-11-13**|[DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models](http://arxiv.org/abs/2411.09022)|**[link](https://github.com/wyd0817/Breakdown_Function_Modules)**|大型语言模型 (LLM) 在机器人系统中展现出显著的推理能力。然而，它们在多机器人系统中的部署仍然较为分散，难以处理复杂的依赖关系和并行执行。本研究介绍了 DART-LLM（基于依赖感知的多机器人任务分解和执行系统），旨在应对这些挑战。DART-LLM 利用 LLM 解析自然语言指令，将其分解为多个具有依赖关系的子任务，以建立复杂的任务序列，从而增强多机器人系统中的高效协调和并行执行。该系统包含问答LLM模块、分解函数模块、执行模块和基于视觉语言模型 (VLM) 的目标检测模块，支持将自然语言指令转换为机器人动作的任务分解和执行。实验结果表明，DART-LLM 擅长处理长周期任务和具有复杂依赖关系的协作任务。即使使用较小的模型（如 Llama 3.1 8B），该系统也能取得良好的性能，突出了 DART-LLM 在模型规模方面的鲁棒性。更多视频和代码，请访问项目网站：https://wyd0817.github.io/project-dart-llm/。||
|**2024-11-13**|[The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models](http://arxiv.org/abs/2411.08870)|**[link](https://github.com/taekb/eval-medical-dapt)**|最近的一些工作致力于开发专门用于医疗应用的基础模型，通过在公开可用的生物医学语料库上继续进行预训练来调整通用大型语言模型 (LLM) 和视觉语言模型 (VLM)。这些工作通常声称这种领域自适应预训练 (DAPT) 可以提高下游医疗任务的性能，例如回答医学执照考试问题。在本文中，我们将十个公开的“医学”LLM 和两个 VLM 与它们相应的基准模型进行了比较，得出了不同的结论：所有医学 VLM 和几乎所有医学 LLM 在医学问答 (QA) 的零样本/少样本提示和监督微调机制中均未能持续改进其基准模型。例如，在我们考虑的 3 样本设置中的所有任务和模型对中，医学 LLM 仅在 22.7% 的情况下优于其基准模型，在 36.8% 的情况下达到（统计）持平，并且在其余 40.5% 的情况下明显差于其基准模型。我们的结论基于 (i) 将每个医学模型与其相应的基准模型直接进行头对头比较；(ii) 在零样本/少样本提示中分别优化每个模型的提示；以及 (iii) 考虑比较中的统计不确定性。虽然这些基本实践并未在文献中得到一致采用，但我们的消融研究表明它们会对结论产生重大影响。同时，我们发现，在针对特定 QA 任务进行微调后，医学 LLM 可以显示出性能改进，但这些好处并不会延续到基于临床记录的任务。我们的研究结果表明，最先进的通用领域模型可能已经展现出强大的医学知识和推理能力，并为加强未来研究的结论提供了建议。||
|**2024-11-13**|[Sharingan: Extract User Action Sequence from Desktop Recordings](http://arxiv.org/abs/2411.08768)|null|用户活动视频记录，尤其是桌面录屏，为理解用户行为和自动化流程提供了丰富的数据来源。然而，尽管视觉语言模型（VLM）取得了进步并在视频分析中得到越来越多的应用，但从桌面录屏中提取用户动作仍然是一个未被充分探索的领域。本文旨在弥补这一差距，提出了两种基于VLM的用户动作提取新方法：直接基于帧的方法（DF），将采样帧直接输入VLM；以及基于差异帧的方法（DiffF），它结合了通过计算机视觉技术检测到的帧间差异。我们使用一个基本的自建数据集和一个改编自先前工作的进阶基准来评估这些方法。结果表明，DF方法在识别用户动作方面达到了70%到80%的准确率，提取的动作序列可以通过机器人流程自动化（RPA）进行重放。我们发现，虽然VLM展现了潜力，但纳入显式的UI变化反而会降低性能，使得DF方法更加可靠。这项工作首次将VLM应用于从桌面录屏中提取用户动作序列，为未来的研究贡献了新的方法、基准和见解。||
|**2024-11-13**|[Voxeland: Probabilistic Instance-Aware Semantic Mapping with Evidence-based Uncertainty Quantification](http://arxiv.org/abs/2411.08727)|**[link](https://github.com/MAPIRlab/Voxeland)**|在以人为中心的场景中，机器人需要准确的场景理解才能有效地执行高级任务。这种理解可以通过实例感知语义建图来实现，它涉及在单个实例级别重建元素。神经网络作为场景理解的实际解决方案，仍然面临一些局限性，例如对分布外对象的过度自信的错误预测或生成不准确的掩码。过度依赖这些预测会使重建容易出错，降低最终地图的鲁棒性，并妨碍机器人的操作。在这项工作中，我们提出了Voxeland，一个用于增量构建实例感知语义地图的概率框架。受证据理论的启发，Voxeland将神经网络预测视为关于地图实例在几何和语义层面的主观意见。这些意见随着时间的推移聚合形成证据，并通过概率模型进行形式化。这使我们能够量化重建过程中的不确定性，从而有助于识别需要改进的地图区域（例如重新观察或重新分类）。作为利用这一点的一种策略，我们结合了一个大型视觉语言模型（LVLM）来对具有高不确定性的实例执行语义级别的消歧。在公开可用的SceneNN数据集上的标准基准测试结果表明，Voxeland优于最先进的方法，突出了结合和利用实例级和语义级不确定性来增强重建鲁棒性的好处。在真实世界的ScanNet数据集上进行的定性实验进一步验证了这一点。||
|**2024-11-13**|[Retrieval Augmented Recipe Generation](http://arxiv.org/abs/2411.08715)|null|鉴于从食物图像生成食谱的潜在应用，近年来该领域受到了研究人员的极大关注。现有的食谱生成工作主要采用两阶段训练方法，首先生成食材，然后从图像和食材中获取烹饪步骤。大型多模态模型 (LMM) 在各种视觉和语言任务中取得了显著成功，为直接从图像生成食材和步骤提供了新的思路。然而，LMM 在食谱生成过程中仍然面临常见的幻觉问题，导致性能欠佳。为了解决这个问题，我们提出了一种用于食谱生成的检索增强大型多模态模型。我们首先引入了随机多样化检索增强 (SDRA) 方法，从现有数据存储中检索与图像语义相关的食谱作为补充，将它们集成到提示中，为输入图像添加多样化和丰富的上下文。此外，我们提出了自一致性集成投票机制，以确定最置信的预测食谱作为最终输出。它计算生成的候选食谱之间的一致性，这些候选食谱使用不同的检索食谱作为生成上下文。大量实验验证了我们提出的方法的有效性，它在 Recipe1M 数据集上的食谱生成任务中展现了最先进 (SOTA) 的性能。||
|**2024-11-13**|[Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints](http://arxiv.org/abs/2411.08253)|null|基于互联网规模数据训练的基础模型，例如视觉语言模型 (VLM)，擅长执行涉及常识的任务，例如视觉问答。尽管它们能力非凡，但这些模型目前无法直接应用于需要复杂且精确的连续推理的挑战性机器人操作问题。任务和运动规划 (TAMP) 系统可以通过组合传统的原始机器人操作来控制高维连续系统进行长期规划。然而，这些系统需要机器人如何影响其环境的详细模型，这阻止它们直接解释和处理新的目标，例如，一个任意的自然语言目标。我们建议在 TAMP 系统内部署 VLM，让它们生成离散和连续的语言参数化约束，使 TAMP 能够推理开放世界概念。具体来说，我们提出了 VLM 部分规划算法，该算法约束 TAMP 系统的离散时间搜索和 VLM 连续约束解释，以增强 TAMP 系统寻求满足的传统操作约束。我们在两种机器人平台（包括一个真实世界的机器人）上通过几个操作任务演示了我们的方法，其中期望的目标仅通过语言传达。||
|**2024-11-12**|[DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection](http://arxiv.org/abs/2411.08227)|**[link](https://github.com/lili0415/dpu-ood-detection)**|分布外 (OOD) 检测对于通过识别偏离训练分布的样本从而确保机器学习模型的鲁棒性至关重要。虽然传统的 OOD 检测主要关注单模态输入（例如图像），但多模态模型的最新进展已经证明了利用多模态（例如视频、光流、音频）来增强检测性能的潜力。然而，现有方法通常忽略分布内 (ID) 数据中的类内差异，假设同一类的样本完全一致且没有变化。这种假设会导致性能下降，尤其当预测差异在所有样本中被均匀放大时。为了解决这个问题，我们提出了动态原型更新 (DPU)，这是一个用于多模态 OOD 检测的即插即用框架，它考虑了类内变化。我们的方法通过测量每个批次中相似样本的方差来动态更新每个类的中心表示，从而实现自适应调整。这种方法允许我们根据更新的类中心放大预测差异，从而提高模型在不同模态下的鲁棒性和泛化能力。在两个任务、五个数据集和九个基础 OOD 算法上的大量实验表明，DPU 显着提高了 OOD 检测性能，在多模态 OOD 检测中树立了新的最先进水平，在远距离 OOD 检测中的改进高达 80%。为了促进可访问性和可重复性，我们的代码已在 GitHub 上公开发布。||
|**2024-11-12**|[JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation](http://arxiv.org/abs/2411.07975)|**[link](https://github.com/deepseek-ai/janus)**|我们提出了JanusFlow，这是一个强大的框架，它在单个模型中统一了图像理解和生成。JanusFlow引入了一个极简的架构，它将自回归语言模型与校正流（一种最先进的生成建模方法）集成在一起。我们的主要发现表明，校正流可以直接在大型语言模型框架内进行训练，而无需复杂的架构修改。为了进一步提高我们统一模型的性能，我们采用了两个关键策略：（i）解耦理解编码器和生成编码器，以及（ii）在统一训练期间对齐它们的表示。大量实验表明，JanusFlow在其各自领域实现了与专用模型相当或更优的性能，同时在标准基准测试中显著优于现有的统一方法。这项工作代表着朝着更高效、更通用的视觉语言模型迈出的一步。||
|**2024-11-12**|[SparrowVQE: Visual Question Explanation for Course Content Understanding](http://arxiv.org/abs/2411.07516)|**[link](https://github.com/youshanzhang/sparrowvqe)**|视觉问答 (VQA) 研究致力于创建能够回答图像中自然语言问题的 AI 系统，然而 VQA 方法通常只会产生过于简单和简短的答案。本文旨在通过引入视觉问题解释 (VQE) 来推进该领域的发展，VQE 增强了 VQA 提供详细解释而非简短回答的能力，并解决了对更复杂的视觉内容交互的需求。我们首先从一个为期 14 周的流媒体视频机器学习课程中创建了一个 MLVQE 数据集，其中包含 885 张幻灯片图像、110,407 个单词的转录文本和 9,416 个设计好的问答 (QA) 对。接下来，我们提出了一个新颖的 SparrowVQE 模型，这是一个仅有 30 亿参数的小型多模态模型。我们使用三阶段训练机制来训练我们的模型，包括多模态预训练（幻灯片图像和转录文本特征对齐）、指令微调（使用转录文本和问答对微调预训练模型）和领域微调（微调幻灯片图像和问答对）。最终，我们的 SparrowVQE 可以使用 SigLIP 模型理解和连接视觉信息，并使用带有 MLP 适配器的 Phi-2 语言模型处理转录文本。实验结果表明，我们的 SparrowVQE 在我们开发的 MLVQE 数据集中取得了更好的性能，并且在其他五个基准 VQA 数据集中优于最先进的方法。源代码可在 \url{https://github.com/YoushanZhang/SparrowVQE} 获取。||
|**2024-11-11**|[Multimodal Fusion Balancing Through Game-Theoretic Regularization](http://arxiv.org/abs/2411.07335)|null|多模态学习可以通过揭示数据源之间的关键依赖关系来完善信息提取的图景。然而，当前的系统未能充分利用多种模态以获得最佳性能。这归因于模态竞争，其中各种模态争夺训练资源，导致一些模态未得到充分优化。我们发现，当前的平衡方法难以训练出超越简单基线（例如集成模型）的多模态模型。这就提出了一个问题：我们如何确保多模态训练中的所有模态都得到充分训练，并且从新模态中学习能够持续提高性能？本文提出了多模态竞争正则化器 (MCR)，这是一种受互信息 (MI) 分解启发的新损失组件，旨在防止多模态训练中竞争的不利影响。我们的主要贡献是：1) 在多模态学习中引入博弈论原则，其中每种模态都充当一个参与者，竞争以最大化其对最终结果的影响，从而实现 MI 项的自动平衡。2) 细化每个 MI 项的下限和上限，以增强对跨模态的任务相关的独特信息和共享信息的提取。3) 建议使用潜在空间排列进行条件 MI 估计，从而显著提高计算效率。MCR 的性能优于所有先前建议的训练策略，并且是第一个持续改进多模态学习并超越集成模型基线的方法，清楚地表明结合多种模态可以在合成数据集和大型真实世界数据集上带来显著的性能提升。||
|**2024-11-11**|[StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification](http://arxiv.org/abs/2411.07076)|**[link](https://github.com/hyc2026/StoryTeller)**|现有的大型视觉语言模型 (LVLM) 主要局限于处理短至几秒的视频，难以生成连贯的描述来概括长达几分钟或更长时间的视频。长视频描述引入了新的挑战，例如描述中跨情节级别的一致性。为了解决这些问题，我们确定了视听角色识别（将角色名称与每个对话匹配）是一个关键因素。我们提出了 StoryTeller，一个用于生成长视频密集描述的系统，它结合了低级视觉概念和高级情节信息。StoryTeller 使用集成了视觉、音频和文本模态的多模态大型语言模型，对长达几分钟的视频片段执行视听角色识别。然后将结果输入到 LVLM 中以增强视频描述的一致性。我们在电影描述任务上验证了我们的方法，并引入了 MovieStory101，一个包含三分钟电影片段密集描述的数据集。为了评估长视频描述，我们创建了 MovieQA，一个针对 MovieStory101 测试集的大型多项选择题集。我们通过将描述输入 GPT-4 来回答这些问题，并使用准确率作为自动评估指标来评估描述质量。实验表明，StoryTeller 在 MovieQA 上的性能优于所有开源和闭源基线模型，准确率比最强基线 Gemini-1.5-pro 高 9.5%，并且在人工并排评估中展现出 +15.56% 的优势。此外，结合 StoryTeller 的视听角色识别功能，所有视频描述模型的性能均有所提高，Gemini-1.5-pro 和 GPT-4o 在 MovieQA 上的准确率分别提高了 5.5% 和 13.0%。||
|**2024-11-11**|[UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models](http://arxiv.org/abs/2411.06921)|**[link](https://github.com/git-ljc/umfc)**|预训练的视觉语言模型（例如CLIP）已经展现出强大的零样本迁移能力。但是它们仍然难以应对领域迁移，并且通常需要标记数据来适应下游任务，这可能成本高昂。在这项工作中，我们旨在利用自然跨越多个领域的未标记数据来增强视觉语言模型的可迁移性。在这个无监督多领域设置下，我们发现了CLIP中固有的模型偏差，尤其是在其视觉和文本编码器中。具体来说，我们观察到CLIP的视觉编码器倾向于优先编码领域信息而不是区分性类别信息，同时其文本编码器表现出对领域相关类别的偏好。为了减轻这种模型偏差，我们提出了一种免训练且免标签的特征校准方法，即无监督多领域特征校准（UMFC）。UMFC从特定领域的特征估计图像级偏差，并从领域转换的方向估计文本级偏差。随后，这些偏差分别从原始图像和文本特征中减去，以使它们与领域无关。我们在多种设置（包括直推式学习和测试时适应）下评估了我们的方法。大量实验表明，我们的方法优于CLIP，并且性能与需要额外标注或优化的最先进方法相当。我们的代码可在https://github.com/GIT-LJc/UMFC获取。||
|**2024-11-11**|[Renaissance: Investigating the Pretraining of Vision-Language Encoders](http://arxiv.org/abs/2411.06657)|**[link](https://github.com/bsu-slim/renaissance)**|在过去几年中，用于视觉语言任务的可用模型数量激增。然而，现有文献仍然存在许多与设计和训练此类模型的最佳实践相关的问题。在本文中，我们试图通过元分析来回答几个与视觉语言编码器预训练相关的问题。在我们的第一组实验中，我们表明，通过在预训练期间冻结视觉语言模型的大部分，我们可以在不损失下游性能的情况下节省大量的计算资源。在我们的第二组实验中，我们研究了基于视觉模型与基于文本模型的视觉语言转换器的效果。此外，我们介绍了一个名为Renaissance的视觉语言建模平台，我们使用该平台进行所有实验。该程序为创建、训练和评估用于视觉语言建模的Transformer编码器提供了极大的灵活性。Renaissance的源代码可以在https://github.com/bsu-slim/renaissance找到。||
|**2024-11-09**|[M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework](http://arxiv.org/abs/2411.06176)|null|能够理解文档并回答相关问题的能力在许多商业和实际应用中都非常有用。然而，文档通常包含冗长且多样化的多模态内容，例如文本、图表和表格，这对于人类来说非常耗时。因此，迫切需要开发有效且自动的方法来帮助人类完成这项任务。在这项工作中，我们引入了M-LongDoc，一个包含851个样本的基准测试，以及一个用于评估大型多模态模型性能的自动化框架。我们进一步提出了一种检索感知的调整方法，以实现高效且有效的多模态文档阅读。与现有工作相比，我们的基准测试包含更新且更长的文档（数百页），同时也需要开放式答案，而不仅仅是提取式答案。据我们所知，我们的训练框架是第一个直接解决多模态长文档检索问题的框架。为了能够调整开源模型，我们以全自动的方式构建了一个用于此类文档问答任务的训练语料库。实验表明，与基线开源模型相比，我们的调整方法使模型响应的正确性提高了4.6%。我们的数据、代码和模型可在https://multimodal-documents.github.io获取。||
|**2024-11-09**|[Aquila: A Hierarchically Aligned Visual-Language Model for Enhanced Remote Sensing Image Comprehension](http://arxiv.org/abs/2411.06074)|null|近年来，大型视觉语言模型（VLM）通过视觉指令微调在视觉语言能力方面取得了显著进展，在遥感图像解译领域展现出巨大的潜力。然而，现有的遥感视觉语言模型（RSVLM）通常难以捕捉遥感场景的复杂特征，因为它们通常依赖于低分辨率、单尺度的视觉特征以及将视觉特征映射到语言特征的简单方法。在本文中，我们提出了Aquila，一个先进的视觉语言基础模型，旨在实现更丰富的遥感图像视觉特征表示和更精确的视觉语言特征对齐。我们的方法引入了一个可学习的分层空间特征融合（SFI）模块，该模块支持高分辨率图像输入并聚合多尺度视觉特征，从而可以详细表示复杂的视觉信息。此外，SFI模块被反复集成到大型语言模型（LLM）的层中，以实现深度视觉语言特征对齐，而不会影响模型在自然语言处理任务中的性能。这些创新，通过更高分辨率和多尺度输入捕捉详细的视觉效果，并增强特征对齐，显著提高了模型从图像文本数据中学习的能力。我们通过广泛的定量实验和定性分析验证了Aquila的有效性，证明了其优越的性能。||
|**2024-11-09**|[GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection](http://arxiv.org/abs/2411.06071)|**[link](https://github.com/yul-git/glocalclip)**|零样本异常检测 (ZSAD) 对于在没有训练样本的情况下检测目标数据集中的异常模式至关重要，尤其是在目标域和训练数据之间存在分布差异或由于访问限制导致数据稀缺的情况下。尽管最近预训练的视觉语言模型在各种视觉任务中展现出强大的零样本性能，但它们侧重于学习类别语义，这使得它们直接应用于 ZSAD 具有挑战性。为了解决这种情况，我们提出了 GlocalCLIP，它独特地分离全局和局部提示并对其进行联合优化。这种方法使得与对象无关的全局语义提示设计能够有效地捕获一般的正常和异常模式，而无需依赖图像中的特定对象。我们通过在文本编码器中利用深度文本提示调整来改进文本提示，以进行更精确的调整。在视觉编码器中，我们应用 V-V 注意力层来捕获详细的局部图像特征。最后，我们引入了全局对比学习来改进全局和局部提示的互补学习，从而有效地检测跨各个领域的异常模式。GlocalCLIP 在 ZSAD 中的泛化性能在来自工业和医疗领域的 15 个真实世界数据集上得到了证明，实现了优于现有方法的性能。||
|**2024-11-09**|[An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models](http://arxiv.org/abs/2411.06048)|**[link](https://github.com/fatemehshiri/spatial-mm)**|大型多模态模型 (LMMs) 在各种视觉和语言任务中都取得了强大的性能。然而，它们的空间推理能力却缺乏研究。在本文中，我们构建了一个新颖的视觉问答数据集 Spatial-MM，以全面研究 LMMs 的空间理解和推理能力。我们对对象关系和多跳推理的分析揭示了几个重要发现。首先，边界框和场景图，即使是合成的，也可以显著增强 LMMs 的空间推理能力。其次，LMMs 在处理从人类视角提出的问题时，比从相机视角提出的问题更困难。第三，思维链 (CoT) 提示并不能提高模型在涉及空间关系的复杂多跳问题上的性能。最后，我们对 GQA-spatial 的扰动分析表明，LMMs 在基本物体检测方面比复杂空间推理方面更强。我们相信我们的基准数据集和深入分析可以激发对 LMMs 空间推理的进一步研究。Spatial-MM 基准数据集可在以下网址获取：https://github.com/FatemehShiri/Spatial-MM||
|**2024-11-08**|[End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering](http://arxiv.org/abs/2411.05755)|**[link](https://github.com/Jirl-upenn/VLMnav)**|我们提出了VLMnav，这是一个将视觉语言模型（VLM）转换为端到端导航策略的具体化框架。与先前的工作不同，我们不依赖于感知、规划和控制之间的分离；相反，我们使用VLM一步直接选择动作。令人惊讶的是，我们发现VLM可以零样本地用作端到端策略，即无需任何微调或接触导航数据。这使得我们的方法具有开放性，并且可以泛化到任何下游导航任务。我们进行了广泛的研究，以评估我们的方法与基线提示方法相比的性能。此外，我们还进行了设计分析，以了解最具影响力的设计决策。我们项目的视觉示例和代码可以在https://jirl-upenn.github.io/VLMnav/找到。||
|**2024-11-08**|[Towards Low-Resource Harmful Meme Detection with LMM Agents](http://arxiv.org/abs/2411.05383)|**[link](https://github.com/jianzhao-huang/lorehm)**|在社交媒体时代，网络迷因的泛滥使得有效识别有害迷因成为必要。由于迷因的动态特性，现有的数据驱动模型在只有少量标记样本的低资源场景下可能会遇到困难。本文提出了一个基于代理的低资源有害迷因检测框架，利用少量标注样本进行外向和内向分析。受大型多模态模型 (LMM) 在多模态推理方面强大能力的启发，我们首先检索带有标注的相关迷因，以利用标签信息作为LMM代理的辅助信号。然后，我们引出LMM代理内部的知识修正行为，以获得对迷因有害性的良好泛化洞察。通过结合这些策略，我们的方法能够对复杂和隐含的危害指示模式进行辩证推理。在三个迷因数据集上进行的大量实验表明，我们提出的方法在低资源有害迷因检测任务上取得了优于现有最先进方法的性能。||
|**2024-11-08**|[Enhancing Visual Classification using Comparative Descriptors](http://arxiv.org/abs/2411.05357)|**[link](https://github.com/hk1ee/comparative-clip)**|视觉语言模型（VLM），例如CLIP，在视觉分类任务中的性能已经通过利用来自大型语言模型（LLM）（包括GPT）的语义知识得到增强。最近的研究表明，在零样本分类任务中，包含附加线索、高级概念甚至随机字符的描述符通常优于仅使用类别名称的描述符。在许多分类任务中，虽然top-1准确率可能相对较低，但top-5准确率通常要高得多。这种差距意味着大多数错误分类发生在几个相似的类别之间，突出了模型难以区分具有细微差异的类别。为了应对这一挑战，我们引入了比较描述符的新概念。这些描述符强调目标类别与其最相似类别之间的独特特征，从而增强区分度。通过生成并将这些比较描述符整合到分类框架中，我们改进了语义焦点并提高了分类精度。额外的过滤过程确保这些描述符更接近CLIP空间中的图像嵌入，进一步提高了性能。我们的方法通过解决细微的类间差异这一特定挑战，提高了视觉分类任务的准确性和鲁棒性。||
|**2024-11-08**|[Exploring the Alignment Landscape: LLMs and Geometric Deep Models in Protein Representation](http://arxiv.org/abs/2411.05316)|**[link](https://github.com/tizzzzy/llm-gdm-alignment)**|隐性表征对齐已成为构建多模态大型语言模型 (MLLM) 的基础技术，它将不同模态的嵌入映射到共享空间，通常与大型语言模型 (LLM) 的嵌入空间对齐，以实现有效的跨模态理解。虽然初步的蛋白质导向 MLLM 已经出现，但它们主要依赖于启发式方法，缺乏对跨表征的最佳对齐实践的基本理解。在本研究中，我们探索了蛋白质领域中 LLM 和几何深度模型 (GDM) 之间多模态表征的对齐。我们全面评估了三个最先进的 LLM（Gemma2-2B、LLaMa3.1-8B 和 LLaMa3.1-70B）与四个蛋白质特化 GDM（GearNet、GVP、ScanNet、GAT）。我们的工作从模型和蛋白质角度检验对齐因素，确定当前对齐方法中的挑战，并提出改进对齐过程的策略。我们的主要发现表明，结合图和 3D 结构信息的 GDM 可以更好地与 LLM 对齐，更大的 LLM 表现出改进的对齐能力，蛋白质的稀有性会显着影响对齐性能。我们还发现，增加 GDM 嵌入维度、使用双层投影头以及在蛋白质特定数据上微调 LLM 可以显着提高对齐质量。这些策略为增强蛋白质相关多模态模型的性能提供了潜力。我们的代码和数据可在 https://github.com/Tizzzzy/LLM-GDM-alignment 获取。||
|**2024-11-08**|[Real-World Offline Reinforcement Learning from Vision Language Model Feedback](http://arxiv.org/abs/2411.05273)|null|离线强化学习可以在没有在线交互的情况下，利用预先收集的次优数据集进行策略学习。这使得它非常适合于现实世界的机器人和安全关键场景，在这些场景中，收集在线数据或专家演示缓慢、昂贵且有风险。然而，大多数现有的离线强化学习工作假设数据集已经被标注了任务奖励，这个过程通常需要大量的人工工作，尤其是在难以确定真实状态的情况下（例如，在现实世界中）。在本文中，我们基于先前的工作，特别是RL-VLM-F，提出了一个新颖的系统，该系统使用来自视觉语言模型的偏好反馈和任务的文本描述，自动为离线数据集生成奖励标签。然后，我们的方法使用带有奖励标签的数据集进行离线强化学习来学习策略。我们展示了该系统在复杂的现实世界机器人辅助穿衣任务中的适用性，我们首先使用视觉语言模型在次优离线数据集上学习奖励函数，然后使用学习到的奖励函数，采用隐式Q学习来开发有效的穿衣策略。我们的方法在涉及操纵刚性和可变形物体的仿真任务中也表现良好，并且显著优于行为克隆和逆强化学习等基线方法。总之，我们提出了一个新的系统，能够从未标记的、次优的离线数据集中自动进行奖励标记和策略学习。||
|**2024-11-07**|[On Erroneous Agreements of CLIP Image Embeddings](http://arxiv.org/abs/2411.05195)|null|最近的研究表明，视觉语言模型 (VLM) 在视觉推理方面的失败通常源于错误的一致性——语义上不同的图像被 CLIP 图像编码器模糊地编码为具有高余弦相似度的嵌入向量。在本文中，我们表明错误的一致性并不总是主要原因，因为多模态大型语言模型 (MLLM) 仍然可以从中提取不同的信息。例如，在 What'sUp 基准测试中区分左侧和右侧的物体时，左右对的 CLIP 图像嵌入向量的平均余弦相似度 >0.99，并且 CLIP 的性能与随机猜测相当；但是使用相同 CLIP 图像编码器的 LLaVA-1.5-7B 却达到了接近 100% 的准确率。我们发现 CLIP 图像嵌入向量中可提取的信息可能被 CLIP 不充分的视觉语言对齐所掩盖：其通过对比目标学习的匹配分数可能没有捕获所有不同的图像-文本对应关系。我们还研究了 MMVP 基准测试，先前的工作表明 LLaVA-1.5 无法区分具有高余弦相似度的图像对。我们观察到通过替代解码算法更多地关注视觉输入所带来的性能提升。此外，如果模型可以将两个图像都作为输入以强调它们细微的差异，则准确性会显着提高。这两项发现都表明 LLaVA-1.5 没有充分利用提取的视觉信息。总之，我们的研究结果表明，虽然改进图像编码器可能对 VLM 有利，但通过应用更好的提取和利用视觉信息的策略，仍然有提升使用固定图像编码器的模型的空间。||
|**2024-11-07**|[DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation](http://arxiv.org/abs/2411.04999)|**[link](https://github.com/hello-robot/stretch_ai)**|在开放词汇移动操控领域，目标是让机器人根据自然语言描述在任何环境中执行任务，目前已取得重大进展。然而，大多数现有系统假设环境是静态的，这限制了系统在现实场景中的适用性，因为现实场景中环境会由于人为干预或机器人自身的行为而频繁变化。在这项工作中，我们提出了 DynaMem，一种用于开放世界移动操控的新方法，它使用动态时空语义记忆来表示机器人的环境。DynaMem 构建了一个 3D 数据结构来维护点云的动态记忆，并使用多模态大型语言模型或由最先进的视觉语言模型生成的开放词汇特征来回答开放词汇对象定位查询。在 DynaMem 的支持下，我们的机器人可以探索新环境，搜索记忆中不存在的物体，并在场景中物体移动、出现或消失时不断更新记忆。我们在三个真实场景和九个离线场景中使用 Stretch SE3 机器人进行了大量实验，对非静止物体的平均拾取和放置成功率达到了 70%，比最先进的静态系统提高了 2 倍以上。我们的代码以及实验和部署视频已开源，可在我们的项目网站上找到：https://dynamem.github.io/||
|**2024-11-07**|[Exploring Hierarchical Molecular Graph Representation in Multimodal LLMs](http://arxiv.org/abs/2411.04708)|null|随着大型语言模型 (LLM) 和多模态模型的里程碑式发展，我们看到将 LLM 应用于生化任务的热潮。利用图特征和分子文本表示，LLM 可以处理各种任务，例如预测化学反应结果和描述分子性质。然而，目前大多数工作忽略了图特征的多层次性。不同特征层次对 LLM 的影响以及每个层次的重要性仍未得到探索，而且不同的化学任务可能需要不同的特征层次。在这项工作中，我们首先通过融合 GNN 生成的特征标记来研究特征粒度的影响，发现即使将所有标记减少到单个标记也不会显着影响性能。然后，我们探索了不同特征级别对性能的影响，发现 LLM 生成分子的质量和不同任务的性能都受益于不同的特征级别。我们总结了两个关键见解：（1）当前的分子多模态 LLM (MLLM) 缺乏对图特征的全面理解，以及（2）静态处理不足以处理分层图特征。我们的代码即将公开发布。||
|**2024-11-07**|[Vision Language Models are In-Context Value Learners](http://arxiv.org/abs/2411.04549)|null|从视觉轨迹预测时间进度对于能够学习、适应和改进的智能机器人至关重要。然而，学习这种跨不同任务和领域的进度估计器或时间值函数，需要大量多样化的数据和可扩展且可泛化的学习方法。为了应对这些挑战，我们提出了生成式值学习（GVL），一种通用的值函数估计器，它利用视觉语言模型（VLM）中嵌入的世界知识来预测任务进度。简单地让VLM预测视频序列的值表现不佳，因为连续帧之间存在强烈的时序相关性。相反，GVL将值估计视为对打乱的视频帧进行时序排序的问题；这项看似更具挑战性的任务鼓励VLM更充分地利用其潜在的语义和时序基础能力来区分帧，基于其感知的任务进度，从而产生明显更好的值预测。无需任何机器人或特定任务的训练，GVL可以在上下文零样本和少样本情况下，对跨不同机器人平台的300多个不同的真实世界任务（包括具有挑战性的双手操作任务）预测有效值。此外，我们证明了GVL允许通过来自异构任务和实施例（例如人类视频）的示例进行灵活的多模态上下文学习。GVL的通用性使其能够应用于各种与视觉运动策略学习相关的下游应用，包括数据集过滤、成功检测和优势加权回归——所有这些都无需任何模型训练或微调。||
|**2024-11-06**|[Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?](http://arxiv.org/abs/2411.04118)|**[link](https://github.com/taekb/eval-medical-dapt)**|最近的一些工作致力于开发专门用于医疗应用的基础模型，通过在公开可用的生物医学语料库上继续进行预训练来调整通用大型语言模型 (LLM) 和视觉语言模型 (VLM)。这些工作通常声称这种领域自适应预训练 (DAPT) 可以提高下游医疗任务的性能，例如回答医学执照考试问题。在本文中，我们将七个公开的“医学”LLM 和两个 VLM 与它们相应的基准模型进行了比较，得出了不同的结论：所有医学 VLM 和几乎所有医学 LLM 在医学问答 (QA) 任务的零样本/少样本提示机制下，均未能始终如一地改进其基准模型。例如，在我们考虑的 3 样本设置中的任务和模型对中，医学 LLM 仅在 12.1% 的情况下优于其基准模型，在 49.8% 的情况下达到（统计）持平，并且在其余 38.2% 的情况下明显差于其基准模型。我们的结论基于 (i) 将每个医学模型与其相应的基准模型直接进行头对头比较；(ii) 分别为每个模型优化提示；以及 (iii) 考虑比较中的统计不确定性。虽然这些基本实践并未在文献中始终如一地采用，但我们的消融研究表明，它们会对结论产生重大影响。我们的研究结果表明，最先进的通用领域模型可能已经展现出强大的医学知识和推理能力，并为加强未来研究的结论提供了建议。||
|**2024-11-06**|[RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models](http://arxiv.org/abs/2411.04097)|**[link](https://github.com/stanford-aimi/ravl)**|微调后的视觉语言模型 (VLM) 通常会捕获图像特征和文本属性之间的虚假关联，导致零样本测试性能下降。现有的解决虚假关联的方法 (i) 主要在全局图像级别操作，而不是直接干预细粒度的图像特征，并且 (ii) 主要为单模态设置而设计。在这项工作中，我们提出了 RaVL，它通过使用局部图像特征而不是在全局图像级别操作来发现和减轻虚假关联，从而从细粒度的角度来看待 VLM 的鲁棒性。给定一个微调的 VLM，RaVL 首先利用区域级聚类方法来识别导致零样本分类错误的精确图像特征，从而发现虚假关联。然后，RaVL 通过一种新颖的区域感知损失函数来减轻已识别的虚假关联，该函数使 VLM 能够专注于相关区域并在微调期间忽略虚假关系。我们在 654 个具有各种模型架构、数据域和学习到的虚假关联的 VLM 上评估了 RaVL。我们的结果表明，RaVL 能够准确地发现（比最接近的基线提高 191%）和减轻（最差组图像分类准确率提高 8.2%）虚假关联。对通用领域和医学领域 VLM 的定性评估证实了我们的发现。||
|**2024-11-06**|[DesignMinds: Enhancing Video-Based Design Ideation with Vision-Language Model and Context-Injected Large Language Model](http://arxiv.org/abs/2411.03827)|null|构思是基于视频的设计 (VBD) 的关键组成部分，其中视频是设计探索和灵感的首要媒介。生成式人工智能的出现为增强这一过程提供了巨大的潜力，它可以简化视频分析并促进创意生成。在本文中，我们提出了 DesignMinds，这是一个将最先进的视觉语言模型 (VLM) 与上下文增强的语言大模型 (LLM) 相结合的原型，以支持 VBD 中的构思。为了评估 DesignMinds，我们对 35 位设计从业者进行了一项受试者间研究，将其性能与基线条件进行了比较。我们的结果表明，DesignMinds 显着增强了构思的灵活性和原创性，同时也提高了任务参与度。重要的是，这项技术的引入并没有对用户体验、技术接受度或可用性产生负面影响。||
|**2024-11-06**|[Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction](http://arxiv.org/abs/2411.03707)|null|几何尺寸和公差 (GD&T) 通过定义零件特征的可接受偏差来确保组件质量和功能，在制造业中起着至关重要的作用。然而，从 2D 工程图中提取 GD&T 信息是一项耗时且劳动密集型的任务，通常依赖于手动工作或半自动化工具。为了应对这些挑战，本研究提出了一种通过微调 Florence-2（一种开源视觉语言模型 (VLM)）来自动化且高效地提取 GD&T 信息的方法。该模型在包含 400 张工程图的数据集上进行训练，其中真实标注由领域专家提供。为了进行比较，两个最先进的闭源 VLM，GPT-4o 和 Claude-3.5-Sonnet，也在同一数据集上进行了评估。所有模型均使用精确率、召回率、F1 值和幻觉指标进行评估。由于针对特定领域任务微调大型闭源 VLM 的计算成本和不切实际性，GPT-4o 和 Claude-3.5-Sonnet 在零样本设置下进行了评估。相比之下，Florence-2 拥有 2.3 亿个参数，是一个较小的模型，它通过在三个不同的实验中进行全参数微调来进行优化，每个实验都使用了不同程度增强的数据集。结果表明，与性能最佳的闭源模型相比，Florence-2 的精确率提高了 29.95%，召回率提高了 37.75%，F1 值提高了 52.40%，幻觉率降低了 43.15%。这些发现突出了微调较小的开源 VLM（如 Florence-2）的有效性，为自动化 GD&T 提取提供了一种实用且高效的解决方案，以支持下游制造任务。||
|**2024-11-05**|[Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset](http://arxiv.org/abs/2411.03554)|**[link](https://github.com/safolab-wisc/fiubench)**|机器遗忘学习已成为一种在训练数据中遗忘特定信息的有效策略。然而，随着视觉数据集成度的提高，视觉语言模型 (VLM) 中的隐私问题仍未得到充分探索。为了解决这个问题，我们引入了面部身份遗忘基准 (FIUBench)，这是一个新颖的 VLM 遗忘学习基准，旨在稳健地评估“被遗忘权”设置下遗忘算法的有效性。具体来说，我们通过构建虚拟面部身份VQA数据集来制定VLM遗忘学习任务，并应用一个两阶段评估流程，旨在精确控制信息来源及其暴露程度。在评估方面，由于VLM支持使用具有相同语义的各种提问方式，我们还提供强大的评估指标，包括成员推理攻击和精心设计的对抗性隐私攻击，以评估算法的性能。通过在FIUBench内评估四个基线VLM遗忘学习算法，我们发现所有方法的遗忘学习性能仍然有限，在模型效用和遗忘质量之间存在显著的权衡。此外，我们的研究结果还强调了隐私攻击对于稳健评估的重要性。我们希望FIUBench能够推动开发更有效的VLM遗忘学习算法。||
|**2024-11-05**|[VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation](http://arxiv.org/abs/2411.03540)|**[link](https://github.com/haochenz11/vla-3d)**|随着大型语言模型 (LLM)、视觉语言模型 (VLM) 和其他通用基础模型的兴起，能够仅通过自然语言输入就在不同环境中运行的多模态、多任务具身代理的潜力越来越大。室内导航便是这种应用领域之一，它使用自然语言指令进行导航。然而，尽管最近取得了进展，但由于所需的空间推理和语义理解，这个问题仍然具有挑战性，尤其是在可能包含许多属于细粒度类别的物体的任意场景中。为了应对这一挑战，我们构建了用于三维场景视觉和语言引导动作的最大真实世界数据集 (VLA-3D)，其中包含来自现有数据集的超过 11.5K 个扫描三维室内房间、23.5M 个启发式生成的物体间语义关系和 9.7M 个综合生成的指称语句。我们的数据集包含处理过的三维点云、语义对象和房间注释、场景图、可导航自由空间注释以及专门关注用于消除对象歧义的视图无关空间关系的指称语言语句。这些特征旨在辅助下游导航任务，尤其是在真实世界系统中，在不断变化的场景和不完美语言的开放世界中必须保证一定程度的鲁棒性。我们使用当前最先进的模型对我们的数据集进行基准测试，以获得性能基线。生成和可视化数据集的所有代码都已公开发布，请参阅 https://github.com/HaochenZ11/VLA-3D。我们希望通过发布此数据集，为在对变化具有鲁棒性的语义三维场景理解方面取得进展提供资源，并有助于开发交互式室内导航系统。||
|**2024-11-05**|[MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning](http://arxiv.org/abs/2411.03314)|null|近年来，通用领域的多模态基准指导了通用任务多模态模型的快速发展。然而，金融领域具有其特殊性。它以独特的图形图像（例如， candlestick 图表、技术指标图表）为特征，并拥有丰富的专业金融知识（例如，期货、换手率）。因此，来自通用领域的基准通常无法衡量金融领域多模态模型的性能，从而无法有效指导大型金融模型的快速发展。为了促进大型金融多模态模型的发展，我们提出了 MME-Finance，一个面向实际应用的双语开放式视觉问答 (VQA) 基准。我们的基准的特点是金融性和专业性，其中包括构建反映用户实际使用需求的图表（例如，计算机屏幕截图和手机摄影）、根据金融领域查询的偏好创建问题，以及由具有 10 年以上金融行业经验的专家注释问题。此外，我们开发了一个定制的金融评估系统，在多模态评估过程中首先引入视觉信息。我们对 19 个主流多模态大语言模型 (MLLM) 进行了广泛的实验评估，以测试它们的感知、推理和认知能力。结果表明，在通用基准上表现良好的模型在 MME-Finance 上表现不佳；例如，表现最佳的开源和闭源模型分别获得 65.69 (Qwen2VL-72B) 和 63.18 (GPT-4o)。它们在与金融最相关的类别（例如 candlestick 图表和技术指标图表）中表现尤其差。此外，我们还提出了一个中文版本，有助于比较 MLLM 在中文语境下的性能。||
|**2024-11-05**|[Inference Optimal VLMs Need Only One Visual Token but Larger Models](http://arxiv.org/abs/2411.03312)|**[link](https://github.com/locuslab/llava-token-compression)**|视觉语言模型 (VLM) 在各种视觉理解和推理任务中展现出强大的能力。然而，由于大型语言模型 (LLM) 处理大量输入标记（主要来自图像）所需的计算量巨大，导致推理过程中延迟较高，这常常限制了它们在现实世界的部署。为了降低推理成本，可以缩小 LLM 的规模或减少输入图像标记的数量，后者是最近许多关于标记压缩工作的重点。然而，由于这两个因素都直接影响 VLM 的性能，因此最佳的权衡策略尚不清楚。我们首先通过建立捕捉这两个因素的性能变化的缩放法则来描述视觉标记数量和 LLM 参数之间的最佳权衡。我们的结果揭示了一个令人惊讶的趋势：对于视觉推理任务，VLM 中推理最优的行为，即在任何给定的固定推理计算量下，下游误差最小，是在使用推理预算内最大的 LLM 的同时最小化视觉标记数量（通常减少到单个标记）时实现的。虽然标记减少的文献主要关注于通过适度减少标记数量（例如 5-10 倍）来保持基础模型的性能，但我们的结果表明，计算最优的推理机制需要在更高的标记压缩比下运行。基于这些见解，我们初步尝试构建针对高标记压缩设置的方法。代码可在 https://github.com/locuslab/llava-token-compression 获取。||
|**2024-11-05**|[HumanVLM: Foundation for Human-Scene Vision-Language Model](http://arxiv.org/abs/2411.03034)|null|人景视觉语言任务在各种社会应用中日益普及，但最近的进展主要依赖于专门为单个任务定制的模型。新兴研究表明，大型视觉语言模型 (VLM) 可以增强各种下游视觉语言理解任务的性能。然而，通用领域模型在特定领域通常表现不佳。本研究介绍了一个特定领域的大型视觉语言模型，即人景视觉语言模型 (HumanVLM)，旨在为人景视觉语言任务提供基础。具体而言，(1) 我们创建了一个大规模的人景多模态图文数据集 (HumanCaption-10M)，数据源自互联网，以促进特定领域的对齐；(2) 开发了一种以人为中心的图像的描述方法，捕捉人脸、身体和背景，并构建了一个高质量的人景图文数据集 (HumanCaptionHQ，约 31.1 万对)，其中包含尽可能详细的人物信息；(3) 使用 HumanCaption-10M 和 HumanCaptionHQ，我们训练了一个 HumanVLM。在实验中，我们随后在各种下游任务中评估了我们的 HumanVLM，它在同等规模的多模态模型中展现出优越的整体性能，尤其在与人类相关的任务中表现出色，并显著优于类似模型，包括 Qwen2VL 和 ChatGPT-4o。HumanVLM 以及引入的数据将促进人类相关领域的研究。||
|**2024-11-05**|[Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning](http://arxiv.org/abs/2411.02793)|null|多模态情感分析（MSA）是一项重要的研究领域，旨在通过多种模态理解和识别人类情感。多模态融合提供的补充信息促进了情感分析，使其比仅利用单一模态更有效。然而，在实际应用中，许多不可避免的因素可能导致模态不确定缺失的情况，从而阻碍多模态建模的有效性并降低模型的性能。为此，我们针对模态不确定缺失情况下的MSA任务提出了一种分层表示学习框架（HRLF）。具体来说，我们提出了一个细粒度的表示分解模块，通过跨模态翻译和情感语义重建将模态分解为情感相关和模态特定的表示，从而充分提取有价值的情感信息。此外，我们引入了一种分层互信息最大化机制，以增量方式最大化多尺度表示之间的互信息，从而对齐和重建表示中的高层语义。最后，我们提出了一种分层对抗学习机制，进一步对齐和调整情感相关表示的潜在分布，以生成鲁棒的联合多模态表示。在三个数据集上的综合实验表明，HRLF在模态不确定缺失的情况下显著提高了MSA性能。||
|**2024-11-05**|[DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark](http://arxiv.org/abs/2411.02733)|**[link](https://github.com/haodongli2024/rspope)**|随着大型视觉语言模型（LVLMs）的快速发展，这些模型在各种多模态任务中展现出优异的成果。由于LVLMs容易出现幻觉，且目前针对遥感的专用数据集和评估方法较少，因此它们在应用于遥感任务时的性能通常较差。为了解决这些问题，本文介绍了一个高质量的遥感LVLMs数据集DDFAV，该数据集是使用数据增强和数据混合策略创建的。接下来，基于从所提出的数据集中选择的一些高质量遥感图像生成了一套训练指令集。最后，我们基于所提出的数据集开发了一种遥感LVLMs幻觉评估方法RSPOPE，并评估了不同LVLMs的零样本能力。我们提出的数据集、指令集和评估方法文件可在https://github.com/HaodongLi2024/rspope获取。||
|**2024-11-04**|[INQUIRE: A Natural World Text-to-Image Retrieval Benchmark](http://arxiv.org/abs/2411.02537)|**[link](https://github.com/inquire-benchmark/INQUIRE)**|我们推出了INQUIRE，这是一个文本到图像检索基准测试，旨在挑战多模态视觉语言模型在专家级查询上的能力。INQUIRE包含iNaturalist 2024 (iNat24)，这是一个包含五百万张自然世界图像的新数据集，以及250个专家级检索查询。这些查询与iNat24中所有相关的图像进行了全面配对和标注，总共包含33,000个匹配项。查询涵盖物种识别、环境、行为和外观等类别，强调需要细致的图像理解和领域专业知识的任务。我们的基准测试评估了两个核心检索任务：(1) INQUIRE-Fullrank，一个全数据集排序任务，以及 (2) INQUIRE-Rerank，一个用于改进top-100检索结果的重排序任务。对一系列最新多模态模型的详细评估表明，INQUIRE提出了一个重大挑战，即使是最佳模型也未能达到50%以上的mAP@50。此外，我们还展示了使用更强大的多模态模型进行重排序可以提高检索性能，但仍有很大的改进空间。INQUIRE专注于具有科学动机的生态挑战，旨在弥合人工智能能力与现实世界科学探究需求之间的差距，鼓励开发能够协助加速生态和生物多样性研究的检索系统。我们的数据集和代码可在https://inquire-benchmark.github.io获取。||
|**2024-11-04**|[One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering](http://arxiv.org/abs/2411.02210)|null|视觉语言模型（VLMs）在利用网络规模多模态数据集的视觉问答（VQA）任务中展现出巨大的潜力。然而，这些模型在适应新任务时，由于灾难性遗忘，往往难以进行持续学习。作为缓解灾难性遗忘的有效补救措施，复习策略在学习新任务时会使用过去任务的数据。然而，这种策略需要存储过去的数据，这由于硬件限制或隐私问题可能并不可行。在这项工作中，我们提出了第一个无数据方法，它利用VLM的语言生成能力（而不是依赖外部模型）来生成伪复习数据，以解决持续VQA问题。我们的方案名为GaB，它通过对新任务数据提出先前任务的问题来生成伪复习数据。然而，尽管有效，但由于训练数据有限且特定于任务，生成问题的分布会偏向于最常提出的问题。为了缓解这个问题，我们引入了一个伪复习平衡模块，它使用问题元统计或无监督聚类方法将生成的数据与真实数据分布对齐。我们在两个最近的基准测试集上评估了我们提出的方法，即VQACL-VQAv2和CLOVE-function基准测试集。GaB 的性能优于所有无数据基线，在跨不断变化的任务中保持 VQA 性能方面有了实质性的改进，同时与可以访问过去数据的方法不相上下。||
|**2024-11-04**|[TableGPT2: A Large Multimodal Model with Tabular Data Integration](http://arxiv.org/abs/2411.02059)|**[link](https://github.com/tablegpt/tablegpt-agent)**|像GPT、Claude、LLaMA和Qwen这样的模型的出现重塑了人工智能应用，为各行各业带来了巨大的新机遇。然而，尽管表格数据在众多现实领域中发挥着基础性作用，但其与这些模型的集成仍然明显不足。这种差距之所以至关重要，主要有三个原因。首先，数据库或数据仓库的数据集成对于高级应用至关重要；其次，大量且很大程度上尚未开发的表格数据资源提供了巨大的分析潜力；第三，商业智能领域尤其需要适应性强、精确的解决方案，而许多目前的LLM可能难以提供。为此，我们推出了TableGPT2，这是一个经过严格预训练和微调的模型，使用了超过593.8万个表格和236万个高质量的查询-表格-输出元组，其表格相关数据的规模在以往的研究中是前所未有的。这种广泛的训练使TableGPT2能够在以表格为中心的任务中表现出色，同时保持强大的通用语言和编码能力。TableGPT2的关键创新之一是其新颖的表格编码器，专门设计用于捕获模式级和单元格级信息。这种编码器增强了模型处理现实应用中常见的歧义查询、缺失列名和不规则表格的能力。与视觉语言模型类似，这种开创性的方法与解码器集成，形成了一个强大的大型多模态模型。我们相信结果令人信服：在23个基准测试指标中，TableGPT2在7B模型和72B模型上分别比之前的基准中性LLM平均性能提高了35.20%和49.32%，同时保持了强大的通用能力。||
|**2024-11-04**|[Foundations and Recent Trends in Multimodal Mobile Agents: A Survey](http://arxiv.org/abs/2411.02006)|**[link](https://github.com/aialt/awesome-mobile-agents)**|移动代理是复杂和动态移动环境中自动化任务的关键。随着基础模型的发展，对能够实时适应和处理多模态数据的代理的需求也在增长。本综述全面回顾了移动代理技术，重点关注增强实时适应性和多模态交互的最新进展。最近开发的评估基准可以更好地捕捉移动任务的静态和交互环境，从而更准确地评估代理的性能。我们将这些进展分为两种主要方法：基于提示的方法，它利用大型语言模型（LLM）进行基于指令的任务执行；以及基于训练的方法，它对多模态模型进行微调以适应移动特定应用。此外，我们还探讨了增强代理性能的补充技术。通过讨论关键挑战并概述未来的研究方向，本综述为推进移动代理技术提供了宝贵的见解。综合资源列表可在 https://github.com/aialt/awesome-mobile-agents 获取。||
|**2024-11-03**|[EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark](http://arxiv.org/abs/2411.01492)|null|近期对大型语言模型 (LLM) 和大型多模态模型 (LMM) 的研究表明，它们在科学和数学等各个领域都展现出 promising 的技能。然而，它们在更具挑战性和现实世界相关场景（如工程）中的能力尚未得到系统研究。为了弥合这一差距，我们提出了 EEE-Bench，这是一个多模态基准测试，旨在评估 LMM 解决实际工程任务的能力，使用电气与电子工程 (EEE) 作为测试平台。我们的基准测试包含 2860 个精心策划的问题，涵盖 10 个重要子领域，例如模拟电路、控制系统等。与其他领域的基准测试相比，工程问题的本质是 1) 视觉上更复杂和多样化，2) 解决方案更不确定。成功解决这些问题通常需要比以往更严格地整合视觉和文本信息，因为模型需要理解复杂的图像（如抽象电路和系统图），同时还要考虑专业指令，这使得它们成为 LMM 评估的绝佳候选者。除了 EEE-Bench，我们还提供了对 17 种广泛使用的开源和闭源 LLM 和 LMM 的广泛定量评估和细粒度分析。我们的结果表明，当前基础模型在 EEE 方面存在显著缺陷，平均性能范围为 19.48% 至 46.78%。最后，我们揭示并探讨了 LMM 的一个关键缺点，我们称之为“懒惰”：在对技术图像问题进行推理时，倾向于走捷径，依赖文本而忽略视觉上下文。总之，我们相信 EEE-Bench 不仅揭示了 LMM 的一些值得注意的局限性，而且为推进其在实际工程任务中应用的研究提供了宝贵的资源，推动其处理复杂现实场景的能力的未来改进。||
|**2024-10-31**|[ $π_0$ : A Vision-Language-Action Flow Model for General Robot Control](http://arxiv.org/abs/2410.24164)|null|机器人学习拥有巨大潜力，可以释放灵活、通用和灵巧机器人系统的全部潜能，并解决人工智能领域一些最深层次的问题。然而，要将机器人学习提升到有效现实世界系统所需的通用性水平，在数据、泛化性和鲁棒性方面面临着重大障碍。在本文中，我们讨论了通才机器人策略（即机器人基础模型）如何应对这些挑战，以及我们如何为复杂且高度灵巧的任务设计有效的通才机器人策略。我们提出了一种构建于预训练视觉语言模型 (VLM) 之上的新型流匹配架构，以继承互联网规模的语义知识。然后，我们讨论了如何使用来自多个灵巧机器人平台（包括单臂机器人、双臂机器人和移动机械手）的大型多样化数据集来训练该模型。我们评估了模型在预训练后零样本执行任务的能力、遵循来自人类和高级 VLM 策略的语言指令的能力，以及通过微调获取新技能的能力。我们的结果涵盖了各种各样的任务，例如叠衣服、清洁桌子和组装盒子。||
|**2024-10-31**|[Exploring Vision Language Models for Facial Attribute Recognition: Emotion, Race, Gender, and Age](http://arxiv.org/abs/2410.24148)|null|人脸属性识别技术，例如种族、性别、年龄和情绪识别，在监控、广告内容、情感分析以及人口趋势和社会行为研究等领域拥有广泛的应用。基于图像分析人口统计特征和面部表情分析由于人脸属性的复杂性而面临诸多挑战。传统方法采用卷积神经网络（CNN）和其他各种深度学习技术，并在大量标记图像上进行训练。虽然这些方法展现出有效性能，但仍有进一步提升的空间。在本文中，我们提议利用视觉语言模型（VLM），例如生成式预训练Transformer（GPT）、GEMINI、大型语言和视觉助手（LLAVA）、PaliGemma和Microsoft Florence2，从人脸图像中识别种族、性别、年龄和情绪等面部属性。我们使用了各种数据集，如FairFace、AffectNet和UTKFace来评估这些方案。结果表明，VLM与传统技术相比，即使不优越，也具有竞争力。此外，我们提出了“FaceScanPaliGemma”——一个微调的PaliGemma模型——用于种族、性别、年龄和情绪识别。结果显示，在种族、性别、年龄组和情绪分类方面，其准确率分别为81.1%、95.8%、80%和59.4%，优于预训练版本的PaliGemma、其他VLM和SotA方法。最后，我们提出了“FaceScanGPT”，这是一个GPT-4o模型，用于在图像中存在多个人时，使用针对具有特定面部和/或身体属性的人设计的提示来识别上述属性。结果强调了FaceScanGPT卓越的多任务处理能力，仅使用提示即可驱动检测和识别任务，检测个体的属性，如发型、服装颜色、姿势等。||
|**2024-10-31**|[Nearest Neighbor Normalization Improves Multimodal Retrieval](http://arxiv.org/abs/2410.24114)|**[link](https://github.com/multimodal-interpretability/nnn)**|多模态模型利用大规模预训练在图像描述、视觉问答和跨模态检索等任务上取得了显著但仍不完美的性能。本文提出了一种简单有效的方法，无需额外训练即可纠正已训练的对比图像-文本检索模型中的错误，称为最近邻归一化 (NNN)。我们展示了在我们测试的所有对比模型（CLIP、BLIP、ALBEF、SigLIP、BEiT）以及我们使用的两个数据集（MS-COCO 和 Flickr30k）上，文本检索和图像检索指标均有所改进。NNN 需要一个参考数据库，但不需要对该数据库进行任何训练，甚至可以在模型微调后提高其检索精度。||
|**2024-10-31**|[Bayesian-guided Label Mapping for Visual Reprogramming](http://arxiv.org/abs/2410.24018)|**[link](https://github.com/tmlr-group/bayesianlm)**|视觉重编程（VR）利用预训练视觉模型的内在能力，通过调整其输入或输出接口来解决下游任务，这些任务的标签（即下游标签）可能与预训练模型相关的标签（即预训练标签）完全不同。在调整输出接口时，标签映射方法通过在下游标签和预训练标签之间建立一个无梯度的一对一对应关系，将预训练标签转换为下游标签。然而，在本文中，我们揭示了一对一映射可能忽略了预训练标签和下游标签之间的复杂关系。基于这一观察，我们提出了一种贝叶斯引导的标签映射（BLM）方法。BLM构建了一个迭代更新的概率标签映射矩阵，其中每个元素量化了预训练标签和下游标签之间的成对关系。该矩阵值的分配由贝叶斯条件概率引导，考虑了预训练模型对下游样本预测的标签和下游标签的联合分布。在预训练视觉模型（例如ResNeXt）和视觉语言模型（例如CLIP）上进行的实验表明，BLM的性能优于现有的标签映射方法。BLM的成功也提供了一个概率视角，可以用来理解和分析VR的有效性。我们的代码可在https://github.com/tmlr-group/BayesianLM获取。||
|**2024-10-31**|[EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection](http://arxiv.org/abs/2410.23904)|**[link](https://github.com/chelsielei/ez-hoi)**|在零样本设置下检测人与物体交互 (HOI) 是一个巨大的挑战，模型必须处理未见过的类别。现有方法依赖于将视觉编码器与大型视觉语言模型 (VLM) 对齐以利用 VLM 的广泛知识，这需要大型的、计算成本高的模型，并且会遇到训练困难。使用提示学习调整 VLM 提供了直接对齐的替代方案。然而，由于缺乏未见类别的标签，在特定任务数据集上进行微调通常会导致对已见类别的过拟合以及对未见类别的次优性能。为了应对这些挑战，我们引入了一种新的基于提示学习的框架，用于高效的零样本 HOI 检测 (EZ-HOI)。首先，我们引入了大型语言模型 (LLM) 和 VLM 指导的可学习提示，整合详细的 HOI 描述和视觉语义，以使 VLM 适应 HOI 任务。然而，由于训练数据集仅包含已见类别的标签，因此在此类数据集上微调 VLM 往往会针对已见类别而不是未见类别优化可学习提示。因此，我们利用来自相关已见类别信息的提示学习来处理未见类别，并利用 LLM 突出显示未见类别与相关已见类别之间的差异。在基准数据集上的定量评估表明，我们的 EZ-HOI 在各种零样本设置下均实现了最先进的性能，与现有方法相比，仅使用了 10.35% 到 33.95% 的可训练参数。代码可在 https://github.com/ChelsieLei/EZ-HOI 获取。||
|**2024-10-31**|[Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP](http://arxiv.org/abs/2410.23698)|null|像CLIP这样的大型预训练视觉语言模型已展现出良好的泛化能力，但在专业领域（例如卫星图像）或细粒度分类（例如汽车型号）中可能会遇到困难，因为这些视觉概念在预训练期间未出现或未得到充分体现。提示学习提供了一种参数高效的微调框架，即使在标注数据有限的情况下也能使CLIP适应下游任务。在本文中，我们通过从自然语言提示（人工生成或LLM生成）中提取文本知识来改进提示学习，从而为这些未得到充分体现的概念提供丰富的先验知识。我们首先通过学习的提示聚合器获得与每个输入图像对齐的提示“摘要”。然后，我们联合训练一个提示生成器，使其生成的提示嵌入尽可能接近聚合的摘要，同时最小化任务损失。我们将这种提示嵌入称为聚合和自适应提示嵌入（AAPE）。AAPE被证明能够泛化到不同的下游数据分布和任务，包括视觉语言理解任务（例如，少样本分类、VQA）和生成任务（图像描述），并在这些任务中取得了具有竞争力的性能。我们还表明，AAPE对于处理非规范和OOD样本特别有帮助。此外，AAPE学习消除了基线方法所需的基于LLM的推理成本，并且可以更好地扩展数据和LLM模型规模。||
|**2024-10-31**|[SuctionPrompt: Visual-assisted Robotic Picking with a Suction Cup Using Vision-Language Models and Facile Hardware Design](http://arxiv.org/abs/2410.23640)|null|大型语言模型和视觉语言模型 (VLM) 的发展使得机器人在各个领域的应用日益增多。然而，如何将这些模型有效地整合到现实世界的机器人任务中是一个关键挑战。我们开发了一个名为 SuctionPrompt 的多功能机器人系统，该系统利用 VLM 的提示技术结合 3D 检测来执行在多样化和动态环境中的产品拾取任务。我们的方法强调了将 3D 空间信息与自适应行动规划相结合的重要性，使机器人能够在新的环境中接近和操纵物体。在验证实验中，该系统准确选择了 75.4% 的吸取点，并在拾取常见物品方面达到了 65.0% 的成功率。这项研究突出了 VLM 在机器人操纵任务中的有效性，即使只进行简单的 3D 处理。||
|**2024-10-30**|[CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP](http://arxiv.org/abs/2410.23330)|null|机器遗忘（MU）作为一种无需完全重新训练即可从训练模型中移除特定数据的方法，受到了广泛关注。尽管在文本和图像分类等单模态领域取得了进展，但多模态模型中的遗忘研究仍然相对不足。本研究致力于解决CLIP（一种对齐视觉和文本表示的杰出多模态模型）中遗忘带来的独特挑战。我们引入了CLIPErase，这是一种新颖的方法，可以解开并选择性地遗忘视觉和文本关联，确保遗忘不会损害模型性能。CLIPErase由三个关键模块组成：遗忘模块，用于破坏遗忘集中样本的关联；保留模块，用于保持模型在保留集上的性能；以及一致性模块，用于维护与原始模型的一致性。在CIFAR-100和Flickr30K数据集上，针对四个CLIP下游任务进行的大量实验表明，CLIPErase可以有效地遗忘零样本任务中多模态样本的指定关联，同时在遗忘后保持模型在保留集上的性能。||
|**2024-10-30**|[EMMA: End-to-End Multimodal Model for Autonomous Driving](http://arxiv.org/abs/2410.23262)|null|我们推出了EMMA，一个用于自动驾驶的端到端多模态模型。EMMA建立在多模态大型语言模型的基础上，可将原始摄像头传感器数据直接映射到各种驾驶专用输出，包括规划轨迹、感知对象和道路图元素。EMMA通过将所有非传感器输入（例如导航指令和车辆自身状态）和输出（例如轨迹和3D位置）表示为自然语言文本，最大限度地利用了预训练大型语言模型的世界知识。这种方法允许EMMA在统一的语言空间中联合处理各种驾驶任务，并使用特定于任务的提示生成每个任务的输出。根据经验，我们通过在nuScenes上实现最先进的运动规划性能以及在Waymo Open Motion Dataset (WOMD) 上取得有竞争力的结果来证明EMMA的有效性。EMMA还在Waymo Open Dataset (WOD) 上的摄像头主要3D目标检测中取得了有竞争力的结果。我们表明，使用规划轨迹、目标检测和道路图任务对EMMA进行联合训练可以在所有三个领域带来改进，突出了EMMA作为自动驾驶应用通用模型的潜力。然而，EMMA也存在某些局限性：它只能处理少量图像帧，不包含LiDAR或雷达等精确的3D传感模态，并且计算成本高昂。我们希望我们的研究结果能够激励进一步的研究来缓解这些问题，并进一步发展自动驾驶模型架构的最新技术。||
|**2024-10-30**|[Keypoint Abstraction using Large Models for Object-Relative Imitation Learning](http://arxiv.org/abs/2410.23254)|null|泛化到不同任务和环境中的新颖物体配置和实例是机器人技术中的一个关键挑战。基于关键点的表示已被证明是一种有效且简洁的表示方法，可以捕获重要的物体特征，并在动作预测中建立参考框架，从而实现数据高效的机器人技能学习。然而，它们的手动设计性质以及对额外人工标签的依赖限制了它们的可扩展性。在本文中，我们提出了KALM，一个利用大型预训练视觉语言模型 (LM) 自动生成与任务相关且跨实例一致的关键点的框架。KALM 通过使用 LM 生成关键点提议并根据少量机器人演示数据验证它们，从而提取跨视图和物体的鲁棒且一致的关键点。基于生成的关键点，我们可以训练以关键点为条件的策略模型，该模型可以在以关键点为中心的框架中预测动作，使机器人能够有效地泛化到不同的物体姿态、相机视角和具有相似功能形状的物体实例。我们的方法在现实世界中展现出强大的性能，只需少量演示即可适应不同的任务和环境，并且不需要额外的标签。网站：https://kalm-il.github.io/||
|**2024-10-29**|[Natural Language Inference Improves Compositionality in Vision-Language Models](http://arxiv.org/abs/2410.22315)|null|视觉语言模型 (VLM) 的组合推理仍然具有挑战性，因为这些模型通常难以关联对象、属性和空间关系。最近的方法旨在通过依赖文本描述的语义来解决这些限制，使用大型语言模型 (LLM) 将其分解为问题和答案的子集。然而，这些方法主要在表面层面运作，未能融入更深层次的词汇理解，同时引入了由 LLM 生成的错误假设。为了应对这些问题，我们提出了“基于矛盾和蕴涵的标题扩展 (CECE)”方法，这是一种利用自然语言推理 (NLI) 从给定前提生成蕴涵和矛盾的原则性方法。CECE 生成词汇多样化的句子，同时保持其核心含义。通过广泛的实验，我们表明 CECE 增强了可解释性并减少了对有偏差或肤浅特征的过度依赖。通过平衡 CECE 和原始前提，我们在无需额外微调的情况下实现了比先前方法的显著改进，在用于评估图像-文本对齐一致性的人类判断基准测试中取得了最先进的结果，并在 Winoground 上实现了 +19.2%（组得分）的性能提升，在 EqBen 上实现了 +12.9%（组得分）的性能提升，超过了之前的最佳工作（使用目标数据进行微调）。||
|**2024-10-29**|[Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving](http://arxiv.org/abs/2410.22313)|**[link](https://github.com/hustvl/senna)**|端到端自动驾驶凭借大规模数据展现出强大的规划能力，但在复杂和罕见场景下仍因缺乏常识而难以应对。相比之下，大型视觉语言模型（LVLM）擅长场景理解和推理。未来的方向在于融合两者的优势。以往使用LVLM预测轨迹或控制信号的方法效果欠佳，因为LVLM不适合进行精确的数值预测。本文提出Senna，一个结合了LVLM（Senna-VLM）和端到端模型（Senna-E2E）的自动驾驶系统。Senna将高级规划与低级轨迹预测解耦。Senna-VLM用自然语言生成规划决策，而Senna-E2E预测精确的轨迹。Senna-VLM利用多图像编码方法和多视角提示词来实现高效的场景理解。此外，我们引入了面向规划的问答以及三阶段训练策略，这增强了Senna-VLM的规划性能，同时保留了常识。在两个数据集上的大量实验表明，Senna实现了最先进的规划性能。值得注意的是，通过在大型数据集DriveX上进行预训练并在nuScenes上进行微调，Senna相比未经预训练的模型显著降低了27.12%的平均规划误差和33.33%的碰撞率。我们相信Senna的跨场景泛化能力和可迁移性对于实现完全自动驾驶至关重要。代码和模型将在https://github.com/hustvl/Senna发布。||
|**2024-10-29**|[ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding](http://arxiv.org/abs/2410.22211)|**[link](https://github.com/kimihiroh/promqa)**|多模态系统在辅助人类执行程序性活动方面具有巨大潜力，在这些活动中，人们遵循指令以实现其目标。尽管应用场景多种多样，但系统通常在传统的分类任务上进行评估，例如动作识别或时间动作分割。在本文中，我们提出了一个新的评估数据集ProMQA，用于衡量系统在面向应用场景中的进展。ProMQA包含401个多模态程序性问答对，基于用户录制的程序性活动及其相应的指令。对于问答标注，我们采用了一种经济高效的人机协作方法，其中利用LLM生成的、随后经人工验证的问答对来扩充现有标注。然后，我们提供了基准测试结果，以设定ProMQA的基线性能。我们的实验揭示了人类表现与当前系统（包括具有竞争力的专有多模态模型）之间存在显著差距。我们希望我们的数据集能够揭示模型多模态理解能力的新方面。||
|**2024-10-29**|[Active Learning for Vision-Language Models](http://arxiv.org/abs/2410.22187)|null|像CLIP这样的预训练视觉语言模型（VLM）在一系列下游计算机视觉任务中展现了令人印象深刻的零样本性能。然而，这些模型与在下游数据集上训练的有监督深度模型之间仍然存在相当大的性能差距。为了弥合这一差距，我们提出了一种新的主动学习（AL）框架，通过仅从未标记数据中选择少量信息丰富的样本进行标注来增强VLM的零样本分类性能。为了实现这一点，我们的方法首先校准VLM的预测熵，然后利用自不确定性和邻居感知不确定性的组合来计算可靠的不确定性度量，用于主动样本选择。我们的大量实验表明，所提出的方法在多个图像分类数据集上优于现有的AL方法，并显著提高了VLM的零样本性能。||
|**2024-10-29**|[Are VLMs Really Blind](http://arxiv.org/abs/2410.22029)|**[link](https://github.com/vlgiitr/Are-VLMs-Really-Blind)**|视觉语言模型擅长处理各种复杂任务，包括光学字符识别 (OCR)、视觉问答 (VQA) 和高级几何推理。然而，这些模型在人类特别容易掌握的低级基本视觉任务中表现不佳。我们这项工作的目标是确定这些模型是否真的对几何推理“视而不见”，或者是否存在增强其在这方面能力的方法。我们的工作提出了一种新颖的自动流水线，旨在根据特定问题从图像中提取关键信息。我们没有仅仅依赖直接的 VQA，而是使用从问题中提取的关键词来创建一个标题，突出显示图像中与问题相关的重要的细节。然后，语言模型使用此标题来提供对问题的精确答案，而无需外部微调。||
|**2024-10-29**|[Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications](http://arxiv.org/abs/2410.21943)|**[link](https://github.com/riedlerm/multimodal_rag_for_industry)**|大型语言模型 (LLM) 在回答问题方面展现出令人印象深刻的能力，但它们缺乏特定领域的知识，并且容易出现幻觉。检索增强生成 (RAG) 是解决这些挑战的一种方法，而多模态模型正在成为处理文本和图像方面很有前途的 AI 助手。在本文中，我们描述了一系列实验，旨在确定如何将多模态模型最好地集成到工业领域的 RAG 系统中。这些实验的目的是确定在工业领域的文件中包含图像以及文本是否会提高 RAG 性能，并找到这种多模态 RAG 系统的最佳配置。我们的实验包括两种图像处理和检索方法，以及两种用于答案合成的 LLM（GPT4-Vision 和 LLaVA）。这些图像处理策略涉及使用多模态嵌入和从图像生成文本摘要。我们使用 LLM 作为评判者的方法来评估我们的实验。我们的结果表明，多模态 RAG 可以胜过单模态 RAG 设置，尽管图像检索比文本检索更具挑战性。此外，利用图像的文本摘要与使用多模态嵌入相比，提供了一种更有希望的方法，为未来的进步提供了更多机会。||
|**2024-10-29**|[Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models](http://arxiv.org/abs/2410.21802)|**[link](https://github.com/zhyblue424/tga-zsr)**|由于预训练视觉语言模型（例如CLIP）令人印象深刻的零样本能力，它们吸引了广泛关注并在各个领域得到应用。然而，CLIP已被观察到容易受到对抗样本的攻击。通过实验分析，我们观察到一个现象：对抗扰动会导致文本引导的注意力发生偏移。基于这一观察，我们提出了一个简单而有效的策略：文本引导注意力零样本鲁棒性（TGA-ZSR）。该框架包含两个组件：注意力细化模块和基于注意力的模型约束模块。我们的目标是保持CLIP模型的泛化能力并增强其对抗鲁棒性：注意力细化模块将通过对抗样本从目标模型获得的文本引导注意力与通过干净样本从原始模型获得的文本引导注意力对齐。这种对齐增强了模型的鲁棒性。此外，基于注意力的模型约束模块使用干净样本从目标模型和原始模型获取文本引导注意力。其目标是保持模型在干净样本上的性能，同时增强整体鲁棒性。实验验证，我们的方法在16个数据集上，将零样本鲁棒精度比当前最先进的技术提高了9.58%。我们的代码可在https://github.com/zhyblue424/TGA-ZSR获取。||
|**2024-10-29**|[AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?](http://arxiv.org/abs/2410.21259)|**[link](https://github.com/wad3birch/AutoBench-V)**|大型视觉语言模型（LVLMs）已成为推进视觉和语言信息融合的关键，促进了各种复杂应用和任务的发展。然而，LVLMs 的评估面临着重大挑战，因为评估基准的构建总是需要大量的人力成本，并且一旦构建完成就保持静态，缺乏灵活性。尽管在文本模态中已经探索了自动评估，但视觉模态仍然缺乏研究。因此，在这项工作中，我们提出了一个问题：“LVLMs 能否成为自动基准测试的途径？”. 我们引入了 AutoBench-V，这是一个用于按需进行评估的自动化框架，即基于模型能力的特定方面对 LVLMs 进行基准测试。在接收到评估能力后，AutoBench-V 利用文本到图像模型生成相关的图像样本，然后利用 LVLMs 来编排视觉问答（VQA）任务，从而高效灵活地完成评估过程。通过对七个流行的 LVLMs 在五个用户输入（即评估能力）上的广泛评估，该框架展现了有效性和可靠性。我们观察到以下几点：（1）我们构建的基准准确地反映了不同的任务难度；（2）随着任务难度的增加，模型之间的性能差距会扩大；（3）虽然模型在抽象层面的理解上表现出很强的性能，但在细节推理任务中表现不佳；（4）构建具有不同难度级别的 datasets 对于全面彻底的评估至关重要。总的来说，AutoBench-V 不仅成功地利用 LVLMs 进行自动基准测试，还揭示了 LVLMs 作为评估者的巨大潜力。||
|**2024-10-28**|[Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines](http://arxiv.org/abs/2410.21220)|**[link](https://github.com/cnzzx/vsa)**|搜索引擎能够通过文本检索未知信息。然而，传统方法在理解不熟悉的视觉内容方面存在不足，例如识别模型从未见过的物体。对于大型视觉语言模型 (VLM) 来说，这一挑战尤为突出：如果模型没有接触过图像中描绘的物体，它就难以针对用户关于该图像的问题生成可靠的答案。此外，由于新的物体和事件不断涌现，频繁更新VLM由于沉重的计算负担而变得不切实际。为了解决这一限制，我们提出了视觉搜索助手 (Vision Search Assistant)，一个促进VLM和网络代理之间协作的新框架。该方法利用VLM的视觉理解能力和网络代理的实时信息访问能力，通过网络执行开放世界检索增强生成。通过这种协作集成视觉和文本表示，即使图像对系统来说是新颖的，模型也可以提供有根据的响应。在开放集和封闭集问答基准上进行的大量实验表明，视觉搜索助手显著优于其他模型，并且可以广泛应用于现有的VLM。||
|**2024-10-28**|[Zero-Shot Action Recognition in Surveillance Videos](http://arxiv.org/abs/2410.21113)|null|公共场所日益增长的监控需求对人力资源短缺带来了重大挑战。当前基于人工智能的视频监控系统严重依赖需要大量微调的核心计算机视觉模型，而由于数据集有限且设置困难（视角、低质量等），这在监控环境中尤其困难。在本研究中，我们提出利用以强大的零样本和小样本泛化能力而闻名的大型视觉语言模型 (LVLM) 来处理监控中的视频理解任务。具体来说，我们探索了最先进的 LVLM VideoLLaMA2 和一种改进的标记级采样方法——自反射采样 (Self-ReS)。我们在 UCF-Crime 数据集上的实验表明，VideoLLaMA2 代表了零样本性能的显著飞跃，比基线提高了 20%。Self-ReS 还将零样本动作识别性能提高到 44.6%。这些结果突出了 LVLM 与改进的采样技术相结合在推进各种场景下的监控视频分析方面的潜力。||
|**2024-10-25**|[Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models](http://arxiv.org/abs/2410.19732)|null|大型视觉语言模型 (LVLM) 擅长跨模态任务，但在长上下文推理中表现不佳，因为它过度依赖文本信息而降低了对视觉的依赖。在本研究中，我们对 LVLM 在长上下文推理中的表现进行了实证分析，结果表明，随着上下文长度的增加，模型对语言的依赖程度会提高，而对视觉的依赖程度会降低。为了解决这个问题，我们提出了一种新的无需训练的上下文剪枝方法，该方法可以有选择地删除不太重要的文本信息。我们的方法增强了视觉依赖性并减少了文本噪声，从而提高了 LVLM 在长上下文推理中的性能。我们通过构建一个长上下文数据集来验证我们方法的有效性，并在各种 LVLM 上证明了其有效性。此外，进一步的分析证实了不同标记剪枝策略的鲁棒性，并初步探讨了剪枝率与上下文长度之间的比例关系。||
|**2024-10-25**|[OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization](http://arxiv.org/abs/2410.19609)|**[link](https://github.com/minorjerry/openwebvoyager)**|大型语言和多模态模型的快速发展引发了人们对使用 GPT-4o 等专有模型开发能够处理现实世界场景（如网页导航）的自主代理的浓厚兴趣。尽管最近的开源工作试图赋予代理探索环境并随着时间的推移不断改进的能力，但他们是在奖励信号明确定义的合成环境中构建纯文本代理。此类代理难以泛化到需要多模态感知能力且缺乏真实信号的现实环境中。在本文中，我们介绍了一个开源框架，旨在促进多模态 Web 代理的开发，该代理可以自主进行现实世界的探索并自我改进。我们首先通过模仿学习训练基础模型以获得基本能力。然后，我们让代理探索开放网络并收集对其轨迹的反馈。之后，它通过学习另一个通用模型判断的良好表现轨迹来进一步改进其策略。这种探索-反馈-优化循环可以持续多次迭代。实验结果表明，我们的 Web 代理在每次迭代后都成功地自我改进，在多个测试集中表现出强大的性能。||
|**2024-10-25**|[GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing](http://arxiv.org/abs/2410.19552)|**[link](https://github.com/HosamGen/GeoLLaVA)**|探测地理景观中的时间变化对于环境监测和城市规划等应用至关重要。虽然遥感数据丰富，但现有的视觉语言模型 (VLM) 通常无法有效捕捉时间动态。本文通过引入一个带注释的视频帧对数据集来解决这些限制，以跟踪随时间推移而演变的地理模式。通过在 Video-LLaVA 和 LLaVA-NeXT-Video 等模型上使用低秩自适应 (LoRA)、量化 LoRA (QLoRA) 和模型剪枝等微调技术，我们显著提高了 VLM 处理遥感时间变化的性能。结果表明，性能得到显著提升，最佳性能的 BERT 得分为 0.864，ROUGE-1 得分为 0.576，在描述土地利用转变方面表现出卓越的准确性。||
|**2024-10-25**|[COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training](http://arxiv.org/abs/2410.19313)|**[link](https://github.com/nvlabs/coat)**|FP8训练已成为提高训练效率的一种很有前景的方法。现有框架通过将FP8计算应用于线性层来加速训练，同时将优化器状态和激活保持在更高的精度，但这未能完全优化内存使用。本文介绍了COAT（压缩优化器状态和激活以进行FP8训练），这是一种新颖的FP8训练框架，旨在显着减少训练大型模型时的内存占用。COAT通过两项关键创新解决了当前的局限性：(1) 动态范围扩展，它使优化器状态分布更接近FP8表示范围，从而减少量化误差，以及(2) 混合粒度激活量化，它结合每张量和每组量化策略来优化激活内存。实验表明，与BF16相比，COAT有效地将端到端训练内存占用减少了1.54倍，同时在各种任务（如大型语言模型预训练和微调以及视觉语言模型训练）中实现了几乎无损的性能。与BF16相比，COAT还实现了1.43倍的端到端训练加速，性能与TransformerEngine的加速相当或优于后者。COAT能够在更少的GPU上对大型模型进行高效的全参数训练，并在分布式训练环境中将批大小翻倍，为扩展大规模模型训练提供了一种实用的解决方案。代码可在https://github.com/NVlabs/COAT获取。||
|**2024-10-25**|[Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting](http://arxiv.org/abs/2410.19294)|null|视觉语言模型，例如 CLIP，在使用适当的文本描述时表现出令人印象深刻的泛化能力。虽然在下游标记数据上优化提示已被证明可以有效提高性能，但这些方法需要承担注释的人工成本，并且受其质量的限制。此外，由于 CLIP 是在高度不平衡的网络规模数据上预先训练的，因此它存在固有的标签偏差，导致性能欠佳。为了应对上述挑战，我们提出了一个免标签的提示分布学习和偏差校正框架，称为 **Frolic**，它可以在不需要标记数据的情况下提高零样本性能。具体来说，我们的 Frolic 学习提示原型的分布以捕获不同的视觉表示，并通过置信度匹配自适应地将这些表示与原始 CLIP 融合。通过免标签的 logits 调整来校正标签偏差，进一步增强了这个融合模型。值得注意的是，我们的方法不仅无需训练，而且还避免了超参数调整的必要性。跨 16 个数据集的大量实验结果证明了我们方法的有效性，特别是使用 CLIP ViT-B/16 在 10 个数据集上的性能平均优于最先进方法 2.6%，并在 ImageNet 及其五个分布偏移上使用 CLIP ViT-B/16 实现了平均 1.5% 的优势。代码可在 https://github.com/zhuhsingyuu/Frolic 获取。||
|**2024-10-24**|[Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant](http://arxiv.org/abs/2410.19144)|**[link](https://github.com/vl2g/KaLMA)**|我们重新审视了基于知识的文本视觉问答，也称为 Text-KVQA，并结合大型多模态模型 (LMM) 的最新进展，做出了以下贡献：(i) 我们提出了 VisTEL——一种执行视觉文本实体链接的原则性方法。所提出的 VisTEL 模块利用最先进的视觉文本识别引擎和大规模多模态模型的能力，使用从图像中的周围线索获得的文本和视觉上下文进行联合推理，将视觉文本实体链接到正确的知识库实体。(ii) 我们介绍了 KaLMA——一种知识感知的大型多模态助手，它使用与图像中的视觉文本实体相关的知识来增强 LMM，以获得准确的答案。此外，我们还提供了我们的方法与传统视觉问答、大型多模态模型之前的模型、大型多模态模型以及先前表现最佳的方法的全面实验分析和比较。在 Text-KVQA 的三个拆分上的平均值，我们提出的方法比之前的最佳方法在绝对规模上大幅提高了 23.3%，并建立了新的最先进水平。我们将公开我们的实现。||
|**2024-10-24**|[VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](http://arxiv.org/abs/2410.19100)|null|视频常被用于学习或提取完成任务所需的信息，其方式不同于仅凭文本和静态图像所能提供的。然而，许多现有的智能体基准测试忽略了长上下文视频理解，而是专注于文本或静态图像输入。为了弥合这一差距，我们引入了 VideoWebArena (VideoWA)，这是一个用于评估长上下文多模态智能体视频理解能力的基准测试。VideoWA 由 2,021 个基于人工制作的视频教程的网络智能体任务组成，总计近四个小时的内容。对于我们的基准测试，我们定义了长上下文视频智能体任务的分类法，主要关注两个方面：技能保留和事实保留。技能保留任务评估智能体是否可以使用给定的人类演示有效地完成任务，而事实保留任务评估智能体是否可以从视频中检索与指令相关的信息以完成任务。我们发现，最佳模型在事实保留任务上的成功率为 13.3%，在事实保留问答对上的成功率为 45.8%，远低于人类分别为 73.9% 和 79.3% 的表现。在技能保留任务上，长上下文模型在使用教程的情况下比不使用教程的情况下表现更差，WebArena 任务的性能下降了 5%，VisualWebArena 任务的性能下降了 10.3%。我们的工作强调了提高长上下文多模态模型的智能体能力的必要性，并为未来长上下文视频智能体的开发提供了一个测试平台。||
|**2024-10-24**|[CAMEL-Bench: A Comprehensive Arabic LMM Benchmark](http://arxiv.org/abs/2410.18976)|**[link](https://github.com/mbzuai-oryx/CAMEL-Bench)**|近年来，开发能够执行各种视觉推理和理解任务的大型多模态模型 (LMM) 引起了人们的极大兴趣。这导致引入了多个 LMM 基准来评估 LMM 在不同任务上的表现。然而，大多数现有的 LMM 评估基准主要以英语为中心。在这项工作中，我们为阿拉伯语开发了一个全面的 LMM 评估基准，以代表超过 4 亿人口。拟议的基准测试名为 CAMEL-Bench，包括八个不同的领域和 38 个子领域，包括多图像理解、复杂视觉感知、手写文档理解、视频理解、医学成像、植物病害和基于遥感的土地利用理解，以评估广泛的场景泛化性。我们的 CAMEL-Bench 包含大约 29,036 个问题，这些问题是从更大的样本池中筛选出来的，其质量由母语人士手动验证，以确保可靠的模型评估。我们对闭源（包括 GPT-4 系列）和开源 LMM 进行了评估。我们的分析表明，需要进行重大改进，尤其是在最佳开源模型中，即使是闭源 GPT-4o 也仅获得了 62% 的总体得分。我们的基准测试和评估脚本是开源的。||
|**2024-10-24**|[Deep Insights into Cognitive Decline: A Survey of Leveraging Non-Intrusive Modalities with Deep Learning Techniques](http://arxiv.org/abs/2410.18972)|null|认知能力下降是衰老的自然组成部分，通常会导致认知能力下降。然而，在某些情况下，这种下降更为明显，通常是由于阿尔茨海默病等疾病。早期发现异常的认知能力下降至关重要，因为它可以促进及时的专业干预。虽然医学数据可以帮助进行这种检测，但它通常涉及侵入性程序。另一种方法是采用非侵入性技术，例如语音或笔迹分析，这些技术不一定会影响日常活动。本综述回顾了使用深度学习技术来自动化认知能力下降估计任务的最相关方法，包括音频、文本和视觉处理。我们讨论了每种模式和方法的关键特征和优势，包括最先进的方法，如Transformer架构和基础模型。此外，我们还介绍了整合不同模态以开发多模态模型的工作。我们还重点介绍了最重要的数据集以及使用这些资源的研究的量化结果。从这次审查中得出了一些结论。在大多数情况下，文本模态取得了最佳结果，并且与检测认知能力下降最相关。此外，将来自单个模态的各种方法组合成多模态模型始终如一地提高了几乎所有场景下的性能。||
|**2024-10-24**|[Zero-shot Object Navigation with Vision-Language Models Reasoning](http://arxiv.org/abs/2410.18570)|null|物体导航对于机器人至关重要，但传统方法需要大量的训练数据，并且无法泛化到未知环境。零样本物体导航 (ZSON) 旨在解决这一挑战，使机器人能够在没有特定训练数据的情况下与未知物体进行交互。语言驱动的零样本物体导航 (L-ZSON) 是 ZSON 的扩展，它结合了自然语言指令来指导机器人导航和与物体交互。在本文中，我们提出了一种新颖的视觉语言模型，该模型具有用于 L-ZSON 的思维树网络 (VLTNet)。VLTNet 包含四个主要模块：视觉语言模型理解、语义映射、思维树推理和探索以及目标识别。在这些模块中，思维树 (ToT) 推理和探索模块作为核心组件，创新地使用 ToT 推理框架在机器人探索过程中进行导航边界选择。与没有推理的传统边界选择相比，使用 ToT 推理的导航涉及多路径推理过程并在必要时进行回溯，从而能够进行全局信息的决策，并具有更高的准确性。在 PASTURE 和 RoboTHOR 基准测试上的实验结果表明，我们的模型在 LZSON 中表现出色，特别是在涉及复杂自然语言作为目标指令的场景中。||
|**2024-10-24**|[Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data](http://arxiv.org/abs/2410.18558)|null|视觉语言模型（VLM）最近取得了显著进展，但开源指令数据的规模和质量有限，阻碍了它们的性能，使其与闭源模型相比存在差距。在这项工作中，我们通过引入 Infinity-MM 来解决这个限制，Infinity-MM 是一个包含 4000 万个样本的大规模多模态指令数据集，通过严格的质量过滤和去重进行了增强。我们还提出了一种基于开源 VLM 的合成指令生成方法，使用详细的图像标注和多样化的问题生成。利用这些数据，我们训练了一个 20 亿参数的 VLM，Aquila-VL-2B，在类似规模的模型中实现了最先进的（SOTA）性能。这表明扩大指令数据和生成合成数据可以显著提高开源模型的性能。||
|**2024-10-24**|[Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics](http://arxiv.org/abs/2410.18537)|null|传统上，风格主要从颜色、笔触和光照等艺术元素方面来考虑。然而，相同的语义主题，例如人、船和房屋，在不同的艺术传统中可以有很大的差异，这表明风格也包含了潜在的语义。因此，在本研究中，我们提出了一种用于协调语义的图像变化的零样本方案。具体来说，我们的方案将图像到图像的问题转化为图像到文本到图像的问题。图像到文本的操作采用视觉语言模型（例如BLIP）来生成描述输入图像内容的文本，包括对象及其位置。随后，将输入的风格关键词详细描述，然后使用ChatGPT的推理能力将其与内容文本合并。最后，文本到图像的操作利用Diffusion模型根据文本提示生成图像。为了使Diffusion模型能够适应更多风格，我们提出了一种微调策略，将文本和风格约束注入到交叉注意力中。这确保了输出图像在所需的风格中展现出相似的语义。为了验证所提出方案的性能，我们构建了一个包含各种风格和场景图像的基准，并引入了两个新的指标。尽管简单，但我们的方案以零样本的方式产生了高度合理的结果，尤其是在生成具有高保真语义的风格化图像方面。||
|**2024-10-23**|[R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models](http://arxiv.org/abs/2410.17885)|**[link](https://github.com/dle666/r-cot)**|现有的多模态大模型 (LMMs) 在数学几何推理方面表现不佳，原因是缺乏高质量的图文配对数据。当前的几何数据生成方法，无论是应用预设模板生成几何数据还是使用大型语言模型 (LLMs) 改写问答 (Q&A)，都不可避免地限制了数据的准确性和多样性。为了合成更高质量的数据，我们提出了一个两阶段逆向思维链 (R-CoT) 几何问题生成流程。首先，我们引入了 GeoChain 来生成高保真几何图像以及相应的描述，突出几何元素之间的关系。然后，我们设计了一种逆向问答方法，该方法基于描述逐步推理，并从推理结果反向生成问题。实验表明，所提出的方法为多个 LMM 基准模型带来了显著且一致的改进，在 2B、7B 和 8B 设置中均达到了新的性能记录。值得注意的是，R-CoT-8B 在 MathVista 和 GeoQA 上分别显著优于先前最先进的开源数学模型 16.6% 和 9.2%，同时还超过了闭源模型 GPT-4o 在这两个数据集上的平均性能 13%。代码可在 https://github.com/dle666/R-CoT 获取。||
|**2024-10-23**|[Lightweight Neural App Control](http://arxiv.org/abs/2410.17883)|null|本文介绍了一种名为“app agents”的新型手机控制架构，用于在各种安卓应用之间进行高效的交互和控制。所提出的轻量多模态应用控制 (LiMAC) 将文本目标和一系列过去的移动观察（例如屏幕截图和相应的 UI 树）作为输入，以生成精确的操作。为了解决智能手机固有的计算限制，我们在 LiMAC 中引入了一个小型动作转换器 (AcT)，并将其与微调的视觉语言模型 (VLM) 集成，以实现实时决策和任务执行。我们在两个开源移动控制数据集上评估了 LiMAC，证明了我们的小尺寸方法优于开源 VLM（例如 Florence2 和 Qwen2-VL）的微调版本。它也明显优于利用闭源基础模型（如 GPT-4o）的提示工程基线。更具体地说，与微调的 VLM 相比，LiMAC 将整体动作准确率提高了 19%，与提示工程基线相比提高了 42%。||
|**2024-10-23**|[MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models](http://arxiv.org/abs/2410.17637)|**[link](https://github.com/liuziyu77/mia-dpo)**|视觉偏好对齐涉及训练大型视觉语言模型 (LVLM) 来预测人类对视觉输入的偏好。这通常是通过使用已标记的选中/拒绝图像对数据集并采用直接偏好优化 (DPO) 等优化算法来实现的。现有的视觉对齐方法主要针对单图像场景而设计，由于缺乏多样化的训练数据以及标注选中/拒绝图像对的高成本，难以有效处理多图像任务的复杂性。我们提出了多图像增强直接偏好优化 (MIA-DPO)，这是一种可以有效处理多图像输入的视觉偏好对齐方法。MIA-DPO 通过使用以网格拼贴或画中画格式排列的无关图像来扩展单图像数据，从而缓解了多样化多图像训练数据的稀缺性，显著降低了与多图像数据标注相关的成本。我们的观察表明，LVLM 的注意力值在不同图像之间存在很大差异。我们使用注意力值来识别和过滤掉模型可能错误关注的已拒绝响应。我们基于注意力值的策略选择构建选中/拒绝图像对，无需依赖 (i) 人工标注，(ii) 额外数据，以及 (iii) 外部模型或 API。MIA-DPO 与各种架构兼容，并且在五个多图像基准测试中优于现有方法，在 LLaVA-v1.5 上平均性能提升 3.0%，在最近的 InternLM-XC2.5 上平均性能提升 4.3%。此外，MIA-DPO 对模型理解单图像的能力的影响微乎其微。||
|**2024-10-22**|[JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation](http://arxiv.org/abs/2410.17250)|null|加速非英语语言大型多模态模型 (LMM) 的研究对于提升更广泛人群的用户体验至关重要。在本文中，我们介绍了 JMMMU（日语 MMMU），这是第一个基于日本文化背景、旨在评估 LMM 在专家级任务上表现的大规模日语基准测试。为了促进全面的文化感知评估，JMMMU 包含两个互补的子集：(i) 文化无关 (CA) 子集，其中选择与文化无关的学科（例如数学）并将其翻译成日语，以便与对应的英语 MMMU 进行一对一比较；以及 (ii) 文化特定 (CS) 子集，包含反映日本文化背景的新创建学科。使用 CA 子集，我们观察到许多 LMM 在日语评估中性能下降，这完全归因于语言差异。使用 CS 子集，我们揭示了它们对日本文化理解的不足。此外，通过结合两个子集，我们发现一些 LMM 在 CA 子集上表现良好，但在 CS 子集上表现不佳，这暴露了它们对日语的理解肤浅，缺乏文化深度的理解。我们希望这项工作不仅有助于提升 LMM 在日语方面的性能，还能作为创建用于多语言 LMM 开发的高标准、文化多样化基准测试的指南。项目页面为 https://mmmu-japanese-benchmark.github.io/JMMMU/。||
|**2024-10-22**|[PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction](http://arxiv.org/abs/2410.17247)|**[link](https://github.com/cooperx521/pyramiddrop)**|在大型视觉语言模型 (LVLMs) 中，图像作为输入承载着丰富的信息。正如谚语“一图胜千言”所言，在当前的 LVLMs 中表示单个图像可能需要数百甚至数千个标记。这导致了巨大的计算成本，并且随着输入图像分辨率的增加呈二次方增长，从而严重影响训练和推理的效率。以前的方法试图在 LVLMs 的早期层之前或之内减少图像标记的数量。然而，这些策略不可避免地会导致关键图像信息的丢失，最终降低模型性能。为了应对这一挑战，我们进行了一项实证研究，表明所有视觉标记对于 LVLMs 的浅层都是必要的，而标记冗余在模型的深层逐渐增加。为此，我们提出了 PyramidDrop，一种 LVLMs 的视觉冗余减少策略，以提高其训练和推理效率，且性能损失可忽略不计。具体来说，我们将 LVLM 划分为几个阶段，并在每个阶段的末尾以预定义的比率丢弃部分图像标记，从而在模型层中创建金字塔状的视觉标记。丢弃操作基于轻量级的相似度计算，时间开销可以忽略不计。大量实验表明，PyramidDrop 可以使 LLaVA-NeXT 的训练时间缩短 40%，推理 FLOPs 减少 55%，且性能相当。此外，PyramidDrop 还可以作为即插即用的推理加速策略，无需训练，即可获得比同类方法更好的性能和更低的推理成本。我们希望 PyramidDrop 引入的见解和方法能够激励未来的研究，进一步探索图像标记在 LVLMs 中的作用。||
|**2024-10-22**|[An Eye for an AI: Evaluating GPT-4o's Visual Perception Skills and Geometric Reasoning Skills Using Computer Graphics Questions](http://arxiv.org/abs/2410.16991)|null|CG（计算机图形学）是 CS（计算机科学）中的一个热门领域，但许多学生发现这门课程很难，因为它需要大量的技能，如数学、编程、几何推理和创造力。在过去几年中，研究人员一直在探索利用生成式人工智能 (GenAI) 的力量来改进教学的方法。在计算机科学领域，许多研究都集中在计算机入门教育上。最近一项评估大型语言模型 (LLM) GPT-4（仅限文本）在 CG 问题上的表现的研究表明，GPT-4 的表现不佳，并且依赖于对图像内容的详细描述，这通常需要用户具备相当多的洞察力才能返回合理的结果。到目前为止，还没有研究调查过大型多模态模型 (LMM) 或多模态 LLM 解决 CG 问题的能力，以及如何利用这些能力来改进教学。在本研究中，我们构建了两个 CG 问题数据集，这些问题需要不同程度的视觉感知能力和几何推理能力，并评估了当前最先进的 LMM GPT-4o 在这两个数据集上的表现。我们发现，尽管 GPT-4o 在独立解决带有视觉信息的问题方面展现出巨大潜力，但在生成结果的准确性和质量方面仍然存在重大局限性。我们为 CG 教育工作者提出了一些新颖的方法，以便将生成式人工智能融入到 CG 教学中，尽管存在这些限制。我们希望，我们的指导方针能进一步鼓励 CG 课堂的学习和参与。||
|**2024-10-22**|[MPDS: A Movie Posters Dataset for Image Generation with Diffusion Model](http://arxiv.org/abs/2410.16840)|null|电影海报对于吸引观众、传达主题和推动电影行业的市场竞争至关重要。虽然传统的设计费时费力，但智能生成技术可以提高效率并增强设计效果。尽管图像生成取得了令人兴奋的进展，但目前的模型在生成令人满意的海报结果方面往往存在不足。主要问题在于缺乏专门的海报数据集来进行有针对性的模型训练。在这项工作中，我们提出了一个电影海报数据集 (MPDS)，专为文本到图像生成模型量身定制，旨在彻底改变海报制作。MPDS 专注于海报，据我们所知，它是第一个图像-文本对数据集，由 37.3 万多个图像-文本对和 8 千多张演员图像（涵盖 4 千多名演员）组成。详细的海报描述，例如电影标题、类型、演员阵容和概要，都根据公开的电影概要（也称为电影概要提示）进行了精心组织和标准化。为了充实海报描述并减少与电影概要的差异，我们进一步利用大型视觉语言模型自动为每个海报生成视觉感知提示，然后进行手动校正并与电影概要提示相结合。此外，我们引入了海报标题提示，以展示海报中的文本元素，如演员姓名和电影标题。对于电影海报生成，我们开发了一个多条件扩散框架，将海报提示、海报标题和演员图像（用于个性化）作为输入，通过学习扩散模型产生出色的结果。实验表明，我们提出的 MPDS 数据集在推进个性化电影海报生成方面具有重要价值。MPDS 可在 https://anonymous.4open.science/r/MPDS-373k-BD3B 获取。||
|**2024-10-21**|[DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding](http://arxiv.org/abs/2410.16472)|null|文档结构编辑涉及根据用户请求操作文档图像中的局部文本、视觉和布局组件。过去的研究表明，用户请求在文档图像中的多模态 grounding 以及准确识别结构组件及其相关属性仍然是这项任务的关键挑战。为了解决这些问题，我们引入了 DocEdit-v2，这是一个利用大型多模态模型 (LMM) 执行端到端文档编辑的新框架。它包含三个新组件：(1) Doc2Command，它同时定位感兴趣的编辑区域 (RoI) 并将用户编辑请求分解为编辑命令；(2) 基于 LLM 的命令重构提示，将最初为专业软件设计的编辑命令定制为适合通才 LMM 的编辑指令。(3) 此外，DocEdit-v2 通过 GPT-4V 和 Gemini 等大型多模态模型处理这些输出，以解析文档布局、对 grounded 感兴趣区域 (RoI) 执行编辑并生成编辑后的文档图像。在 DocEdit 数据集上的大量实验表明，DocEdit-v2 在编辑命令生成 (2-33%)、RoI 边界框检测 (12-31%) 和整体文档编辑 (1-12%) 任务上明显优于强大的基线。||
|**2024-10-21**|[Promoting cross-modal representations to improve multimodal foundation models for physiological signals](http://arxiv.org/abs/2410.16424)|null|许多医疗保健应用本质上是多模态的，涉及多种生理信号。随着这些信号的传感器变得越来越普遍，改进针对多模态医疗保健数据的机器学习方法至关重要。预训练基础模型是取得成功的有希望的途径。然而，在医疗保健领域开发基础模型的方法仍处于早期探索阶段，并且尚不清楚鉴于生理信号的多样性，哪种预训练策略最有效。这在一定程度上是由于多模态健康数据方面的挑战：获取许多患者的数据既困难又昂贵，受试者之间存在很大差异，并且模态在下游任务中的信息量通常存在异质性。在这里，我们在 PhysioNet 2018 数据集中探讨了这些挑战。我们使用掩蔽自动编码目标来预训练多模态模型。我们证明了该模型学习到的表示可以被线性探测用于各种下游任务。我们假设跨模态重建目标对于成功的多模态训练很重要，因为它们鼓励模型整合跨模态的信息。我们证明了输入空间中的模态丢失可以提高下游任务的性能。我们还发现，使用对比学习目标预训练的后期融合模型在多个任务中的效果较差。最后，我们分析了模型的表示，表明注意力权重通过我们的预训练策略变得更加跨模态和时间对齐。就每个单元编码的模态而言，学习到的嵌入也变得更加分散。总的来说，我们的工作证明了多模态基础模型对健康数据的效用，即使是在不同的生理数据源中也是如此。我们进一步认为，用于诱导跨模态的显式方法可以增强多模态预训练策略。||
|**2024-10-21**|[VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use](http://arxiv.org/abs/2410.16400)|null|虽然视觉语言模型 (VLM) 在结合文本和视觉信息的各种任务中表现出卓越的性能，但它们在需要详细像素级分析的细粒度视觉感知任务中仍然面临挑战。如何有效地从 VLM 中引出对此类复杂视觉元素的全面推理仍然是一个开放的挑战。在本文中，我们提出了 VipAct，这是一个通过集成多智能体协作和视觉专家模型来增强 VLM 的智能体框架，从而实现更精确的视觉理解和更全面的推理。VipAct 由一个协调器智能体和一些专门的智能体组成，协调器智能体负责任务需求分析、规划和协调，而专门的智能体则处理图像字幕等特定任务，以及提供高精度感知信息的视觉专家模型。这种多智能体方法允许 VLM 通过协同规划、推理和工具使用来更好地执行细粒度视觉感知任务。我们在具有一组不同视觉感知任务的基准测试中评估了 VipAct，实验结果表明，在所有任务中，与最先进的基线相比，性能都有显著提高。此外，全面的消融研究揭示了多智能体协作在引出更详细的系统 2 推理中的关键作用，并强调了图像输入对任务规划的重要性。此外，我们的错误分析确定了 VLM 在视觉感知方面固有局限性的模式，为未来潜在的改进提供了见解。VipAct 提供了一个灵活且可扩展的框架，为各种现实应用中更先进的视觉感知系统铺平了道路。||
|**2024-10-21**|[Improve Vision Language Model Chain-of-thought Reasoning](http://arxiv.org/abs/2410.16198)|**[link](https://github.com/riflezhang/llava-reasoner-dpo)**|视觉语言模型 (VLM) 中的思维链 (CoT) 推理对于提高模型的可解释性和可信度至关重要。然而，目前的训练方法缺乏强大的 CoT 推理数据，依赖于以简短注释和少量推理过程为主的数据集。在这项工作中，我们发现，在简短答案上训练 VLM 并不能很好地泛化到需要更详细回答的推理任务。为了解决这个问题，我们提出了一种双重方法。首先，我们从 GPT-4o 模型中提取推理过程，以丰富训练数据并微调 VLM，从而提高其 CoT 性能。其次，我们应用强化学习来进一步校准推理质量。具体来说，我们通过将模型生成的推理链的预测结果与带注释的简短答案进行比较，构建正（正确）和负（错误）样本对。利用这些成对数据，我们应用直接偏好优化算法来改进模型的推理能力。我们的实验表明，在基准数据集上，CoT 推理得到了显著改进，并且对直接答案预测的泛化能力也更强。这项工作强调了在训练中纳入详细推理过程以及利用强化学习来增强 VLM 推理能力的重要性。||
|**2024-10-21**|[Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models](http://arxiv.org/abs/2410.16163)|**[link](https://github.com/jefferyzhan/griffon)**|大型多模态模型 (LMM) 在基于自回归建模的各种视觉语言和以视觉为中心的的任务中取得了重大突破。然而，这些模型通常侧重于以视觉为中心的的任务，例如视觉定位和区域描述，或者视觉语言任务，例如图像描述和多场景视觉问答 (VQA)。目前还没有哪个 LMM 能够像自然语言处理领域的大型语言模型那样，将这两种类型的任务全面统一在一个模型中。此外，即使有丰富的多任务指令遵循数据，直接堆叠这些数据来扩展通用能力仍然具有挑战性。为了解决这些问题，我们引入了一个名为 CCMD-8M 的新型多维度策划和整合的多模态数据集，它通过多级数据策划和多任务整合克服了统一以视觉为中心的任务和视觉语言任务的数据障碍。更重要的是，我们提出了 Griffon-G，这是一个通用的 LMM，它在单个端到端范式中同时解决了以视觉为中心的任务和视觉语言任务。Griffon-G 解决了在这些任务的联合优化过程中遇到的训练崩溃问题，实现了更好的训练效率。跨多模态基准、通用视觉问答 (VQA) 任务、场景文本中心 VQA 任务、文档相关 VQA 任务、指称表达式理解和目标检测的评估表明，Griffon-G 优于先进的 LMM，并在复杂的以视觉为中心的的任务中达到了专家级的性能。||
|**2024-10-21**|[Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning](http://arxiv.org/abs/2410.16162)|null|视觉语言模型 (VLM) 在各种下游任务中表现出了令人印象深刻的性能。然而，尽管空间推理在涉及导航和与物理环境交互的任务中起着至关重要的作用，但VLM在这方面的能力仍然有限。具体来说，这些任务中的大部分空间推理发生在二维 (2D) 环境中，我们的评估表明，最先进的 VLM 经常对复合空间推理问题生成不合理和错误的响应，包括人类一眼就能轻松解决的简单寻路任务。为了解决这个问题，我们探索了一种有效的方法，通过训练模型的基本空间能力来增强 VLM 中的 2D 空间推理能力。我们首先将 2D 空间推理的关键组成部分分解为：方向理解、距离估计和定位。我们的核心假设是，掌握这些基本的空间能力可以显着提高模型在需要高级空间理解和组合问题解决能力的复合空间任务中的性能。为了验证这一假设，我们引入了 Sparkle，这是一个通过合成数据生成和目标监督对这三种基本空间能力进行微调的 VLM 框架，以便为每种能力形成一个指令数据集。我们的实验表明，使用 Sparkle 微调的 VLM 不仅在基本任务本身中取得了显着的性能提升，而且还可以泛化到复合和分布外的空间推理任务中（例如，在最短路径问题上的性能从 13.5% 提高到 40.0%）。这些发现强调了掌握基本空间能力在增强复合空间问题解决能力方面的有效性，为提高 VLM 的空间推理能力提供了见解。||
|**2024-10-18**|[NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples](http://arxiv.org/abs/2410.14669)|null|视觉语言模型（VLM）在最近的视觉问答（VQA）基准测试中取得了重大进展，这些基准测试评估了复杂的视觉语言推理能力。然而，这些模型真的有效吗？在这项工作中，我们发现VLM仍然难以处理人类可以轻松回答的自然图像和问题，我们将其称为自然对抗样本。我们还发现，使用 CLIP 和 ChatGPT 等现成模型从自然图像文本语料库中生成这些VQA样本非常容易。我们提出了一种半自动方法来收集一个新的基准测试集NaturalBench，该测试集包含10,000个经过人工验证的VQA样本，用于可靠地评估VLM。至关重要的是，我们采用以视觉为中心的设计，将每个问题与两张产生不同答案的图像配对，防止模型在不使用图像的情况下盲目作答。这使得NaturalBench比之前可以利用常识先验知识解决的基准测试更具挑战性。我们在NaturalBench上评估了53个最先进的VLM，结果表明，LLaVA-OneVision、Cambrian-1、Llama3.2-Vision、Molmo、Qwen2-VL，甚至GPT-4o等模型都比人类表现（超过90%）落后50%-70%。我们从两个角度分析了NaturalBench为何难以处理：（1）组合性：解决NaturalBench需要多种视觉语言技能，包括理解属性绑定、对象关系以及逻辑和计数等高级推理。为此，与先前的工作使用每个样本一个标签不同，我们为每个NaturalBench样本标记了1到8个技能标签，以便进行细粒度评估。（2）偏差：NaturalBench揭示了VLM中存在的严重偏差，因为模型通常会选择相同的答案，而不管图像如何。最后，我们将基准测试集构建方法应用于不同的数据源，包括长标题（超过100字）和中文、印地语等非英语语言，突出了其对VLM进行动态评估的潜力。||
|**2024-10-18**|[Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension](http://arxiv.org/abs/2410.14332)|**[link](https://github.com/deepglint/croc)**|近年来，大型语言模型（LLM）的进步推动了大型多模态模型（LMM）的发展。然而，现有的研究主要集中在调整语言和图像指令上，而忽略了模型学习联合处理文本和视觉模态的关键预训练阶段。在本文中，我们提出了一种新的LMM预训练范式，通过引入一种新颖的跨模态理解阶段来增强LLM的视觉理解能力。具体来说，我们设计了一个动态可学习的提示标记池，并采用匈牙利算法用最相关的提示标记替换部分原始视觉标记。然后，我们将视觉标记概念化为LLM的“外语”，并提出了一种混合注意力机制，结合双向视觉注意力和单向文本注意力，以全面增强对视觉标记的理解。同时，我们整合了详细的图像描述生成任务，利用丰富的描述来进一步促进LLM理解视觉语义信息。在150万条公开数据上进行预训练后，我们提出了一个名为Croc的新基础模型。实验结果表明，Croc在大型视觉语言基准测试中取得了新的最先进性能。为了支持可 reproducibility 并促进进一步的研究，我们在https://github.com/deepglint/Croc 上发布了训练代码和预训练模型权重。||
|**2024-10-18**|[E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model](http://arxiv.org/abs/2410.14200)|null|三维医学视觉语言模型的开发在疾病诊断和患者治疗方面具有巨大潜力。然而，与二维医学图像相比，三维医学图像（如CT扫描）面临着训练数据有限和维度高等挑战，这严重限制了三维医学视觉语言模型的进展。为了解决这些问题，我们收集了大量未标记的三维CT数据，并利用自监督学习构建了一个用于提取三维视觉特征的三维视觉基础模型。然后，我们应用三维空间卷积来聚合和投影高级图像特征，在降低计算复杂度的同时保留空间信息。我们还基于BIMCV-R和CT-RATE构建了两个指令微调数据集，用于微调三维视觉语言模型。我们的模型在报告生成、视觉问答和疾病诊断方面表现出优于现有方法的性能。代码和数据将很快公开发布。||
|**2024-10-18**|[LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs](http://arxiv.org/abs/2410.14182)|null|实验室事故对人类生命和财产构成重大风险，凸显了健全安全规程的重要性。尽管安全培训有所进步，但实验室人员仍可能在不知不觉中进行不安全的操作。随着各领域（包括实验室环境）越来越依赖大型语言模型 (LLM) 进行指导，人们越来越担心LLM在关键安全相关决策中的可靠性。与受过训练的人类研究人员不同，LLM缺乏正式的实验室安全教育，这引发了人们对其提供安全和准确指导的能力的质疑。现有关于LLM可信度的研究主要集中在道德合规性、真实性和公平性等问题上，但未能完全涵盖安全关键型现实应用，例如实验室安全。为了弥补这一差距，我们提出了实验室安全基准（LabSafety Bench），这是一个基于与职业安全与健康管理局 (OSHA) 协议相一致的新分类法的综合评估框架。该基准测试包括由人类专家验证的765道多项选择题，用于评估LLM和视觉语言模型 (VLM) 在实验室安全环境中的性能。我们的评估表明，虽然GPT-4o的表现优于人类参与者，但它仍然容易出现严重错误，这凸显了在安全关键型环境中依赖LLM的风险。我们的研究结果强调，需要专门的基准来准确评估LLM在现实安全应用中的可信度。||
|**2024-10-18**|[ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom](http://arxiv.org/abs/2410.14138)|null|大型视觉语言模型 (LVLM) 在视觉理解任务方面取得了重大进展。然而，它们在视觉推理任务中经常优先考虑语言知识而不是图像信息，从而导致性能下降。为了解决这个问题，我们首先确定了现有解决方案的缺点（即视觉描述不足且不相关，以及多模态能力有限）。然后，我们将视觉推理过程分解为两个阶段：视觉感知（即视力）和文本推理（即智慧），并介绍了一种名为 ProReason 的新型视觉推理框架。该框架具有多轮主动感知和解耦的视觉推理能力。简而言之，给定一个多模态问题，ProReason 会迭代主动信息收集和推理，直到可以用必要且充分的视觉描述得出答案。值得注意的是，能力的解耦允许无缝集成现有的大型语言模型 (LLM) 来弥补 LVLM 的推理缺陷。我们广泛的实验表明，ProReason 在开源和闭源模型的各种基准测试中都优于现有的多步推理框架和被动对等方法。此外，在 LLM 的帮助下，ProReason 在 MMMU 基准测试中实现了高达 15% 的性能提升。我们对现有解决方案的见解以及对 LLM 可行集成的解耦视角，为未来的视觉推理技术研究（尤其是 LLM 辅助技术）提供了启示。||
|**2024-10-17**|[Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers](http://arxiv.org/abs/2410.14072)|null|近年来，视觉语言模型 (VLM) 的进步扩展了其在现实世界应用中的潜力，使这些模型能够对图像进行复杂的推理。在像 LLaVA 这样广泛使用的完全自回归的基于 Transformer 的模型中，投影的视觉标记被添加到文本标记之前。通常，视觉标记比提示标记多得多，导致训练和推理过程中的计算开销增加。在本文中，我们提出了视觉压缩标记寄存器 (Victor)，这是一种通过将视觉标记汇总到一组较小的寄存器标记来减少视觉标记数量的方法。Victor 在视觉标记之后添加了一些可学习的寄存器标记，并使用 VLM 语言塔中的前几层将视觉信息汇总到这些寄存器中。在这几层之后，所有视觉标记都将被丢弃，从而显着提高了训练和推理的计算效率。值得注意的是，我们的方法易于实现，并且只需要少量新的可训练参数，对模型性能的影响最小。在我们的实验中，Victor 仅使用 8 个视觉寄存器（约占原始标记的 1%），就将准确率下降控制在 4% 以内，同时将总训练时间减少了 43%，并将推理吞吐量提高了 3.3 倍。||
|**2024-10-17**|[Reproducibility study of "LICO: Explainable Models with Language-Image Consistency"](http://arxiv.org/abs/2410.13989)|**[link](https://github.com/robertdvdk/lico-fact)**|机器学习领域日益严重的复现性危机要求我们仔细审查研究结果。本文调查了 Lei 等人 (2023) 提出的 LICO 方法，该方法旨在增强事后可解释性技术并提高图像分类性能。LICO 利用来自视觉语言模型的自然语言监督来丰富特征表示并指导学习过程。我们进行了一项全面的可重复性研究，采用了 (Wide) ResNets 和已建立的可解释性方法，如 Grad-CAM 和 RISE。我们基本上无法复现作者的结果。特别是，我们没有发现 LICO 始终能够提高分类性能或改进可解释性的定量和定性指标。因此，我们的研究结果强调了在可解释性研究中进行严格评估和透明报告的重要性。||
|**2024-10-17**|[Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations](http://arxiv.org/abs/2410.13976)|null|大型视觉语言模型 (LVLM)，例如 LLaVA，已经展示出作为通用聊天机器人的强大能力，能够就提供的输入图像进行对话。然而，它们的响应会受到训练数据集中存在的社会偏见的影响，导致模型在处理描绘不同人群图像时产生不希望的差异。在这项工作中，我们为 LVLM 提出了一种新的去偏见框架，通过在文本生成过程中直接消融偏见属性，以避免生成与受保护属性相关的文本，甚至在内部表示它们。我们的方法不需要训练，只需要相对少量的代表性偏见输出（约 1000 个样本）。我们的实验表明，我们不仅可以最大限度地降低 LVLM 生成与受保护属性相关的文本的倾向，而且甚至可以使用合成数据来指导消融，同时保持在真实数据（如 COCO）上的字幕性能。此外，我们发现，去偏 LVLM 的结果生成表现出与基线偏见模型相似的准确性，表明可以在不牺牲模型性能的情况下实现去偏效果。||
|**2024-10-17**|[Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation](http://arxiv.org/abs/2410.13848)|**[link](https://github.com/deepseek-ai/janus)**|在本文中，我们介绍了 Janus，这是一个统一了多模态理解和生成的自动回归框架。之前的研究通常依赖于单一视觉编码器来完成这两项任务，例如 Chameleon。然而，由于多模态理解和生成所需的信息粒度不同，这种方法会导致性能欠佳，尤其是在多模态理解方面。为了解决这个问题，我们将视觉编码分离成独立的路径，同时仍然利用单个统一的 Transformer 架构进行处理。这种分离不仅缓解了视觉编码器在理解和生成中角色之间的冲突，还增强了框架的灵活性。例如，多模态理解和生成组件都可以独立选择最合适的编码方法。实验表明，Janus 优于之前的统一模型，并且达到或超过了特定任务模型的性能。Janus 的简洁性、高灵活性和有效性使其成为下一代统一多模态模型的有力候选者。||
|**2024-10-17**|[VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks](http://arxiv.org/abs/2410.13666)|**[link](https://github.com/shailaja183/vl-glue)**|从异构输入（如图像、文本和音频）中推导出推理是人类执行日常任务的一项重要技能。对于开发先进的人工智能 (AI) 系统来说，类似的能力也是非常需要的。虽然最先进的模型在各种计算机视觉和自然语言处理任务上正在迅速缩小与人类水平性能的差距，但它们在解决需要对视觉和文本模态进行联合推理的任务时仍然很吃力。受 GLUE（Wang 等人，2018 年）的启发，GLUE 是一个用于自然语言理解的多任务基准测试，我们在本文中提出了 VL-GLUE。VL-GLUE 由跨越七个不同任务的超过 100k 个样本组成，这些任务的核心都需要视觉语言推理。此外，我们的基准测试包含了多样化的图像类型（从合成渲染的图形、日常场景到图表和复杂图表），并包含了广泛的特定领域文本（从烹饪、政治、体育到高中课程），证明了现实世界中对多模态理解的需求。我们表明，这个基准测试对于现有的大规模视觉语言模型来说相当具有挑战性，并鼓励开发具有鲁棒视觉语言推理能力的系统。||
|**2024-10-17**|[H2OVL-Mississippi Vision Language Models Technical Report](http://arxiv.org/abs/2410.13611)|null|由于能够在消费者硬件上高效运行以处理企业商业文档和图像，体积更小的视觉语言模型 (VLM) 对于注重隐私的设备上应用程序变得越来越重要。这些模型需要强大的语言理解和视觉能力来增强人机交互。为了满足这一需求，我们推出了 H2OVL-Mississippi，这是一对小型 VLM，使用 8 个 H100 GPU，在 240 小时的计算时间内，利用 3700 万个图文对进行训练。H2OVL-Mississippi-0.8B 是一款参数量为 8 亿的微型模型，专注于文本识别，在 OCRBench 的文本识别部分实现了最先进的性能，并在该领域超越了许多更大的模型。此外，我们还发布了 H2OVL-Mississippi-2B，这是一个包含 20 亿个参数的通用模型，在各种学术基准测试中均表现出极具竞争力的指标。这两个模型都建立在我们之前使用 H2O-Danube 语言模型的工作基础之上，将其功能扩展到视觉领域。我们将它们在 Apache 2.0 许可下发布，使所有人都可以使用 VLM，从而使文档 AI 和视觉 LLM 民主化。||
|**2024-10-17**|[GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models](http://arxiv.org/abs/2410.13510)|null|几何问题解决需要高级推理能力来处理多模态输入并有效地利用数学知识。视觉语言模型（VLM）在各种多模态任务中取得了重大进展。然而，它们仍然难以解决几何问题，并且由于无法执行预训练期间未见过的数学运算（例如计算任意角度的余弦）以及难以正确应用相关几何公式而受到很大限制。为了克服这些挑战，我们提出了 GeoCoder，它利用模块化代码微调来使用预定义的几何函数库生成和执行代码。通过执行代码，我们实现了准确和确定的计算，与自回归标记预测的随机性形成对比，而函数库最大限度地减少了公式使用中的错误。我们还提出了 GeoCoder 的多模态检索增强变体，名为 RAG-GeoCoder，它结合了一个非参数内存模块来从几何库中检索函数，从而减少对参数内存的依赖。我们的模块化代码微调方法增强了 VLM 的几何推理能力，与其他微调方法相比，在 GeomVerse 数据集上的各种问题复杂性方面平均提高了 16% 以上。||
|**2024-10-17**|[Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR](http://arxiv.org/abs/2410.13445)|null|由于缺乏标注的训练数据，低资源语言的自动语音识别 (ASR) 仍然是一个挑战。参数高效的微调和纯文本自适应是两种常用的方法，用于解决这种低资源环境下的问题。在这项工作中，我们研究了如何使用像 SeamlessM4T 这样的多语言多模态模型有效地结合这些技术。多模态模型能够通过纯文本自适应利用未标注的文本，并进一步进行参数高效的 ASR 微调，从而提高 ASR 性能。我们还展示了从高资源语言进行跨语言迁移，在没有任何标注语音的零样本设置中，相对于基线实现了高达 17% 的词错误率 (WER) 降低。||
|**2024-10-17**|[Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding](http://arxiv.org/abs/2410.13321)|null|大型视觉语言模型 (LVLM) 在根据视觉输入生成详细且连贯的响应方面表现出令人印象深刻的能力。然而，由于过度依赖语言先验，它们容易产生幻觉。为了解决这个问题，我们研究了 LVLM 中的语言先验，并得出两个关键观察结果：(1) 即使在预测与图像相关的词性 (POS) 相关的标记时，随着标记序列的增长，模型越来越依赖语言先验，从而放大了幻觉。(2) 直接校准 LVLM 的输出分布以减轻语言先验的方法可能会导致文本质量下降，甚至加剧幻觉。基于这些发现，我们提出了一种新方法，即摘要引导解码 (SGD)。该方法通过摘要减少文本上下文，自然地鼓励模型更多地关注图像信息，同时仅控制与图像相关的词性标记以保持文本质量。通过实验，我们证明了 SGD 在物体幻觉基准测试中实现了最先进的性能。此外，在精确率和召回率的权衡方面，SGD 在现有方法中实现了帕累托最优。最后，我们观察到，尽管现有方法难以在减少物体幻觉和保持文本质量之间取得平衡，但 SGD 在应对这一挑战方面表现出稳健性。||
|**2024-10-17**|[Mapping Bias in Vision Language Models: Signposts, Pitfalls, and the Road Ahead](http://arxiv.org/abs/2410.13146)|**[link](https://github.com/kuleens/vlmbiaseval)**|随着视觉语言模型 (VLM) 得到广泛应用，其公平性仍然缺乏探索。在本文中，我们分析了五个模型和六个数据集的人口统计学偏差。我们发现，像 UTKFace 和 CelebA 这样的肖像数据集是检测偏差的最佳工具，可以发现 LLaVa 和 CLIP 模型之间在性能和公平性方面的差距。然而，像 PATA、VLStereoSet 这样的场景数据集由于其构建方式，无法成为有效的偏差基准。至于像 VisoGender 这样的基于代词的数据集，我们收到了混合信号，因为只有一部分数据子集对提供见解有用。为了缓解这个问题，我们引入了更难版本的 VisoGender，作为更严格的评估标准。基于这些结果，我们呼吁建立更有效、设计更仔细的数据集，以确保 VLM 的公平性和可靠性。||
|**2024-10-16**|[Sensitivity of Generative VLMs to Semantically and Lexically Altered Prompts](http://arxiv.org/abs/2410.13030)|null|尽管用于生成式视觉语言模型 (VLM) 的提示调整技术大量涌现，但这些模型对提示中的词汇和语义变化的敏感程度仍不清楚。在本文中，我们使用 SugarCrepe++ 数据集评估了生成式 VLM 理解文本中词汇和语义变化的能力。我们分析了 VLM 对提示中词汇变化的敏感性，而这些变化不对应于语义变化。我们的研究结果表明，生成式 VLM 对此类更改高度敏感。此外，我们还发现，这种脆弱性会影响旨在实现其输出一致性的技术性能。||
|**2024-10-16**|[Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models](http://arxiv.org/abs/2410.13002)|null|端到端学习将感官输入直接映射到动作，为复杂的机器人任务创建高度集成和高效的策略。然而，此类模型难以有效训练，并且通常难以泛化到其训练场景之外，从而限制了对新环境、任务和概念的适应性。在这项工作中，我们研究了在看不见的文本指令和视觉分布变化下，基于视觉的控制策略实现稳健的闭环性能所需的最小数据要求和架构适应。为此，我们设计了具有不同数据表示丰富度的数据库，通过利用多模态基础模型编码器来改进特征提取协议，并评估不同策略网络头的适用性。我们的研究结果在 Flex（Fly-lexically）中得到综合，这是一个使用预训练的视觉语言模型（VLM）作为冻结的逐块特征提取器的框架，生成整合语义和视觉信息的具有空间感知的嵌入。这些丰富的特征构成了训练高度稳健的下游策略的基础，这些策略能够跨平台、环境和文本指定的任务进行泛化。我们展示了这种方法在四旋翼飞行器飞往目标任务中的有效性，其中通过行为克隆在小型模拟数据库上训练的代理成功地泛化到现实世界场景，处理不同的新目标和命令公式。||
|**2024-10-16**|[The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio](http://arxiv.org/abs/2410.12787)|null|近年来，大型多模态模型 (LMM) 的进步显著提高了其在各种任务中的性能，并且人们一直在努力进一步整合视频和音频等其他模态。然而，大多数现有的 LMM 仍然容易出现幻觉，即事实上的多模态输入与生成的文本输出之间存在差异，这限制了它们在各种现实场景中的适用性。本文首次系统地研究了涉及三种最常见模态（语言、视觉和音频）的 LMM 中的幻觉问题。我们的研究揭示了导致幻觉的两个关键因素：过度依赖单模态先验和虚假的模态间相关性。为了应对这些挑战，我们引入了多模态诅咒 (CMM) 基准测试，该基准全面评估了 LMM 中的幻觉，并详细分析了其根本问题。我们的研究结果突出了关键的漏洞，包括模态整合的不平衡和训练数据的偏差，强调了平衡跨模态学习和增强幻觉缓解策略的必要性。根据我们的观察和发现，我们提出了一些潜在的研究方向，可以提高 LMM 的可靠性。||
|**2024-10-15**|[Unveiling the Mystery of Visual Attributes of Concrete and Abstract Concepts: Variability, Nearest Neighbors, and Challenging Categories](http://arxiv.org/abs/2410.11657)|**[link](https://github.com/TarunTater/AbstractConceptsInImages)**|一个概念的视觉表征会因其含义和出现语境的不同而发生显著变化，这对视觉和多模态模型都提出了多重挑战。我们的研究侧重于具象性，这是一个经过充分研究的词汇语义变量，并以此作为案例研究来检验视觉表征的可变性。我们依赖于从两个不同数据集（Bing 和 YFCC）中提取的与大约 1000 个抽象和具体概念相关的图像。我们的目标是：(i) 评估概念描述中的视觉多样性是否可以可靠地区分具体概念和抽象概念；(ii) 通过最近邻分析来分析同一概念的多幅图像的视觉特征的可变性；(iii) 通过对图像进行分类和注释来识别导致这种可变性的挑战性因素。我们的研究结果表明，对于抽象概念和具体概念图像的分类，颜色和纹理等基本视觉特征的组合比视觉Transformer（ViT）等更复杂模型提取的特征更有效。然而，ViT 在最近邻分析中表现出更好的性能，这强调了在通过文本以外的模态分析概念变量时，需要谨慎选择视觉特征。||
|**2024-10-15**|[On-the-fly Modulation for Balanced Multimodal Learning](http://arxiv.org/abs/2410.11582)|**[link](https://github.com/gewu-lab/bml_tpami2024)**|多模态学习旨在通过整合来自不同模态的信息来提升模型性能。然而，由于广泛使用的联合训练策略对所有模态采用统一目标，导致单模态表征不平衡和欠优化，因此多模态学习的潜力并未得到充分发挥。具体来说，我们指出通常存在具有更多判别信息的模态，例如踢足球的视觉和刮风的听觉。它们可能在联合训练过程中占据主导地位，导致其他模态严重欠优化。为了缓解这个问题，我们首先从优化的前馈和反向传播阶段分析了欠优化现象。然后，提出了动态预测调制（OPM）和动态梯度调制（OGM）策略，通过在训练过程中监控模态间的判别差异来调节每个模态的优化。具体而言，OPM在前馈阶段通过动态概率丢弃主导模态的特征来削弱其影响，而OGM在反向传播阶段减轻其梯度。在实验中，我们的方法在各种多模态任务中都表现出相当大的改进。这些简单而有效的策略不仅增强了普通和面向任务的多模态模型的性能，而且在更复杂的多模态任务中也表现出色，展示了它们的有效性和灵活性。源代码可在\url{https://github.com/GeWu-Lab/BML_TPAMI2024}获取。||
|**2024-10-15**|[Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference](http://arxiv.org/abs/2410.11403)|null|多模态变分自编码器 (VAE) 旨在通过整合来自不同数据模态的信息来捕获共享的潜在表示。一个重大挑战是在不需要为所有可能的模态组合训练不切实际数量 (2^M) 个推理网络的情况下，准确地从任何模态子集推断表示。基于混合的模型通过仅需要与模态数量一样多的推理模型来简化这一过程，从而聚合单模态推理。然而，当模态缺失时，它们会遭受信息丢失的困扰。基于对齐的 VAE 通过最小化 Kullback-Leibler (KL) 散度将单模态推理模型与多模态模型对齐来解决这个问题，但由于摊销差距导致推理精度下降，因此面临着问题。为了解决这些问题，我们在多模态 VAE 框架内引入了多模态迭代摊销推理，这是一种迭代细化机制。该方法通过使用所有可用模态迭代地细化多模态推理，从而克服了缺失模态造成的信息丢失，并最大程度地减少了摊销差距。通过将单模态推理与这种细化的多模态后验对齐，我们实现了单模态推理，该推理有效地结合了多模态信息，同时在推理过程中仅需要单模态输入。在基准数据集上的实验表明，我们的方法提高了推理性能，更高的线性分类精度和竞争性余弦相似性证明了这一点，并增强了跨模态生成，FID 得分较低表明了这一点。这表明我们的方法增强了从单模态输入推断的表示。||
|**2024-10-15**|[LargePiG: Your Large Language Model is Secretly a Pointer Generator](http://arxiv.org/abs/2410.11366)|null|最近关于查询生成的研究集中在使用大型语言模型（LLM）上，虽然LLM带来了最先进的性能，但也引入了生成查询中的幻觉问题。在这项工作中，我们将相关性幻觉和事实性幻觉作为一种新的类型学来描述基于LLM的查询生成带来的幻觉问题。我们提出了一种有效的方法来分离LLM生成查询中的内容和形式，该方法保留了从输入中提取和集成的 factual knowledge，并利用LLM强大的语言能力编译了句法结构，包括功能词。具体来说，我们介绍了一种与模型无关且无需训练的方法，将大型语言模型转换为指针生成器（LargePiG），其中指针注意力分布利用了LLM固有的注意力权重，并且复制概率源自模型高层和最后一层的词汇分布差异。为了验证LargePiG的有效性，我们构建了两个数据集，用于评估查询生成中的幻觉问题，涵盖了文档和视频场景。对各种LLM的实证研究表明，LargePiG在两个数据集上都具有优越性。额外的实验还验证了LargePiG可以减少大型视觉语言模型中的幻觉，并提高基于文档的问答和事实性评估任务的准确性。||
|**2024-10-15**|[CLIP-DFGS: A Hard Sample Mining Method for CLIP in Generalizable Person Re-Identification](http://arxiv.org/abs/2410.11255)|null|近年来，像CLIP这样的预训练视觉语言模型的进步，已经显示出其在行人重识别（ReID）应用中的潜力。然而，它们在通用行人重识别任务中的性能仍然欠佳。CLIP预训练中使用的大规模多样化的图像-文本对可能导致某些细粒度特征的缺失或不足。针对这些挑战，我们提出了一种名为DFGS（深度优先图采样器）的困难样本挖掘方法，该方法基于深度优先搜索，旨在提供足够具有挑战性的样本，以增强CLIP提取细粒度特征的能力。DFGS可以应用于CLIP中的图像编码器和文本编码器。通过利用CLIP强大的跨模态学习能力，我们的目标是应用DFGS方法提取具有挑战性的样本，并形成具有高判别难度的mini-batches，为图像模型提供更有效、更具挑战性的难以区分的样本，从而增强模型区分个体的能力。我们的结果表明，与其他方法相比，DFGS有显著的改进，证实了DFGS在提供具有挑战性的样本以增强CLIP在通用行人重识别中的性能方面的有效性。||
|**2024-10-14**|[Locality Alignment Improves Vision-Language Models](http://arxiv.org/abs/2410.11087)|null|近年来，视觉语言模型 (VLM) 得到越来越多的应用，但许多模型仍然难以解决基本的 spatial reasoning 错误。我们假设这是由于 VLM 采用了预训练的视觉骨干网络，特别是使用图像级监督和最小归纳偏差训练的视觉变换器 (ViT)。此类模型可能无法编码图像中每个位置的类别内容，我们的目标是通过确保视觉骨干网络有效捕获局部和全局图像语义来解决此问题。我们的主要见解是，我们不需要新的监督来学习这种能力——预训练模型包含大量的局部语义知识，我们可以提取这些知识并将其用于可扩展的自监督。我们为 ViT 提出了一种新的高效的训练后阶段，称为局部性对齐，以及一种新的微调程序，称为 MaskEmbed，它使用掩蔽重建损失来学习每个图像块的语义贡献。我们首先使用仅视觉基准评估局部性对齐，发现它提高了模型在块级语义分割任务中的性能，特别是对于使用图像-标题对（例如，CLIP 和 SigLIP）训练的强骨干网络。然后，我们训练了一系列使用和不使用局部性对齐的 VLM，并表明局部性对齐的骨干网络提高了各种基准测试的性能，特别是那些涉及空间理解的基准测试（例如，RefCOCO、OCID-Ref、TallyQA、VSR、AI2D）。总的来说，我们证明了我们可以通过局部性对齐阶段有效地学习局部语义提取，并且此过程补充了使用现成视觉骨干网络的现有 VLM 训练方法。||
|**2024-10-14**|[Towards Foundation Models for 3D Vision: How Close Are We?](http://arxiv.org/abs/2410.10799)|**[link](https://github.com/princeton-vl/uniqa-3d)**|构建用于 3D 视觉的基础模型是一个尚未解决的复杂挑战。为了实现这一目标，重要的是了解当前模型的 3D 推理能力，并确定这些模型与人类之间的差距。因此，我们构建了一个新的 3D 视觉理解基准，该基准涵盖了视觉问答 (VQA) 格式的基本 3D 视觉任务。我们评估了最先进的视觉语言模型 (VLM)、专门模型和人类受试者。我们的结果表明，VLM 的性能普遍较差，而专门模型虽然准确但不稳健，在几何扰动下会失败。相比之下，人类视觉仍然是最可靠的 3D 视觉系统。我们进一步证明，与经典计算机视觉方法相比，神经网络与人类 3D 视觉机制的一致性更高，并且基于 Transformer 的网络（如 ViT）比 CNN 与人类 3D 视觉机制的一致性更高。我们希望我们的研究能够有利于未来 3D 视觉基础模型的开发。||
|**2024-10-14**|[VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents](http://arxiv.org/abs/2410.10594)|**[link](https://github.com/openbmb/visrag)**|检索增强生成（RAG）是一种有效的技术，它使大型语言模型（LLM）能够利用外部知识源进行生成。然而，当前的RAG系统完全基于文本，无法利用在现实世界多模态文档中起着至关重要作用的视觉信息，如布局和图像。在本文中，我们介绍了VisRAG，它通过建立一个基于视觉语言模型（VLM）的RAG流程来解决这个问题。在这个流程中，不是先解析文档以获取文本，而是使用VLM将文档作为图像直接嵌入，然后检索以增强VLM的生成。与传统的基于文本的RAG相比，VisRAG最大限度地保留和利用了原始文档中的数据信息，消除了解析过程中引入的信息损失。我们收集了开源数据和合成数据来训练VisRAG中的检索器，并探索了各种生成方法。实验表明，VisRAG在检索和生成阶段都优于传统的RAG，相较于传统的基于文本的RAG流程，实现了25%-39%的端到端性能提升。进一步的分析表明，VisRAG可以有效地利用训练数据并表现出强大的泛化能力，这使其成为多模态文档上RAG的一个很有前景的解决方案。我们的代码和数据可在https://github.com/openbmb/visrag 获取。||
|**2024-10-14**|[LG-CAV: Train Any Concept Activation Vector with Language Guidance](http://arxiv.org/abs/2410.10308)|null|概念激活向量（CAV）通过将模型预测优雅地归因于特定概念，在可解释人工智能领域引起了广泛的研究兴趣。然而，CAV 的训练通常需要大量高质量的图像，这些图像的整理成本很高，因此仅限于一组预定义的概念。为了解决这个问题，我们提出了语言引导的 CAV（LG-CAV），以利用某些预训练的视觉语言模型（例如 CLIP）中丰富的概念知识。该方法允许在没有标记数据的情况下训练任何 CAV，方法是利用相应的概念描述作为指导。为了弥合视觉语言模型与目标模型之间的差距，我们使用视觉语言模型计算了一组通用图像（探测图像）上概念描述的激活值，并利用它们作为语言指导来训练 LG-CAV。此外，在训练了与目标模型中所有预测类别相关的高质量 LG-CAV 后，我们提出了激活样本重新加权（ASR）作为一种模型校正技术，以反过来提高目标模型的性能。在四个数据集上跨越九种架构的实验表明，LG-CAV 在给定任何概念的情况下，相较于以前的 CAV 方法实现了显著的质量提升，并且我们的模型校正方法与现有的基于概念的方法相比，实现了最先进的性能。我们的代码可在 https://github.com/hqhQAQ/LG-CAV 获取。||
|**2024-10-14**|[Saliency Guided Optimization of Diffusion Latents](http://arxiv.org/abs/2410.10257)|null|随着扩散模型的快速发展，从文本提示生成高质量图像已不再是挑战。文本到图像生成的重点是如何优化生成结果，使其更好地与人类意图或提示保持一致。现有的优化方法通常将整个图像视为一个整体，进行全局优化。这些方法忽略了一个事实：当人类观察图像时，视觉系统会自然地将注意力集中在显著区域，而忽略不太重要或不显著的区域。也就是说，人类很可能忽略对非显著区域的优化。因此，尽管在大型多模态模型的指导下进行了模型微调，但现有进行全局优化的方法得到的结果并不理想。为了有效且高效地解决这种对齐挑战，我们提出了显著性引导的扩散潜在空间优化方法（SGOOL）。我们首先使用显著性检测器来模拟人类视觉注意力系统，并标记出显著区域。为了避免重新训练额外的模型，我们的方法直接优化扩散模型的潜在空间。此外，SGOOL 利用了可逆扩散过程，并具有恒定内存实现的优点。因此，我们的方法成为了一种参数高效且即插即用的微调方法。我们使用多种指标和人工评估进行了大量实验。实验结果表明，SGOOL 在图像质量和提示对齐方面具有优越性。||
|**2024-10-11**|[SegGrasp: Zero-Shot Task-Oriented Grasping via Semantic and Geometric Guided Segmentation](http://arxiv.org/abs/2410.08901)|null|面向任务的抓取，即根据物体功能抓取其特定部位，对于开发能够在动态环境中执行复杂任务的先进机器人系统至关重要。在本文中，我们提出了一个免训练框架，该框架结合了语义和几何先验，用于零样本面向任务的抓取生成。所提出的框架名为 SegGrasp，首先利用 GLIP 等视觉语言模型进行粗分割。然后，它使用来自凸分解的详细几何信息，通过名为 GeoFusion 的融合策略来提高分割质量。通过改进分割的抓取网络可以生成有效的抓取姿态。我们在分割基准和真实世界机器人抓取上进行了实验。实验结果表明，SegGrasp 在抓取和分割性能方面均优于基线 15% 以上。||
|**2024-10-11**|[Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation](http://arxiv.org/abs/2410.08895)|null|基于缓存的方法在适应视觉语言模型 (VLM) 方面表现出色且高效。然而，现有的缓存模型忽略了三个关键方面。1) 预训练的 VLM 主要针对图像-文本相似性进行优化，忽略了图像-图像相似性的重要性，导致预训练和适应之间存在差距。2) 当前的缓存模型基于 Nadaraya-Watson (N-W) 估计器，它在构建权重函数时忽略了训练样本之间错综复杂的关系。3) 在样本有限的情况下，缓存模型生成的 logits 具有很高的不确定性，直接使用这些 logits 而不考虑置信度可能会有问题。为了解决上述挑战，本工作提出了三个校准模块。相似性校准通过使用未标记的图像来改进图像-图像相似性。我们在 CLIP 的预训练图像编码器之上添加了一个带有残差连接的可学习投影层，并通过最小化自监督对比损失来优化参数。权重校准在权重函数中引入了一个精度矩阵，以充分模拟训练样本之间的关系，将现有的缓存模型转换为高斯过程 (GP) 回归器，这可能比 N-W 估计器更准确。置信度校准利用 GP 回归计算的预测方差来动态地重新调整缓存模型的 logits，确保缓存模型的输出根据其置信度进行适当调整。此外，为了降低 GP 的高复杂度，我们进一步提出了一种基于组的学习策略。整合上述设计，我们提出了免训练和需要训练的两种变体。在 11 个少样本分类数据集上的大量实验表明，所提出的方法可以达到最先进的性能。||
|**2024-10-11**|[RoRA-VLM: Robust Retrieval-Augmented Vision Language Models](http://arxiv.org/abs/2410.08876)|null|目前的视觉语言模型 (VLM) 在知识密集型任务中仍然表现不佳，这主要是由于难以将视觉对象和场景与其对应的实体和背景知识之间的所有关联进行准确编码。虽然检索增强方法提供了一种集成外部知识的有效方法，但将其扩展到视觉语言领域存在着独特的挑战：(1) 由于多模态查询中固有的差异，难以从外部来源准确检索相关信息；(2) 难以抵抗检索到的多模态知识片段中包含的无关、多余和嘈杂的信息。在这项工作中，我们介绍了 RORA-VLM，这是一个专为 VLM 量身定制的新颖且强大的检索增强框架，它具有两项关键创新：(1) 一种采用图像锚定文本查询扩展的两阶段检索过程，以协同组合查询中的视觉和文本信息，并检索最相关的多模态知识片段；(2) 一种鲁棒的检索增强方法，通过在检索增强训练过程中注入对抗性噪声，增强 VLM 对检索到的多模态知识中无关信息的抵抗力，并通过面向查询的视觉标记优化策略过滤掉无关的视觉信息，例如图像中呈现的无关实体。我们进行了广泛的实验，以验证我们提出的方法在三个广泛采用的基准数据集上的有效性和鲁棒性。我们的结果表明，只需极少的训练实例，RORA-VLM 就可以使基础模型实现显著的性能提升，并在所有基准测试中始终优于最先进的检索增强 VLM，同时还展现出新颖的零样本域迁移能力。||
|**2024-10-11**|[VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model](http://arxiv.org/abs/2410.08792)|null|视觉语言模型 (VLM) 近期因其在常识推理和泛化能力方面的优势被应用于机器人领域。现有工作已将 VLM 应用于从自然语言指令生成任务和运动规划，以及为机器人学习模拟训练数据。在本工作中，我们探索使用 VLM 来解释人类演示视频并生成机器人任务规划。我们的方法将关键帧选择、视觉感知和 VLM 推理集成到一个管道中。我们将其命名为 SeeDo，因为它使 VLM 能够“看到”人类演示并向机器人解释相应的计划，以便机器人“执行”。为了验证我们的方法，我们收集了一组长时程人类视频，演示了三种不同类别中的拾放任务，并设计了一套指标，以全面比较 SeeDo 与几种基线方法（包括最先进的视频输入 VLM）的性能。实验结果表明 SeeDo 具有优越的性能。我们进一步在仿真环境和真实的机器人手臂上部署了生成的的任务计划。||
|**2024-10-11**|[Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models](http://arxiv.org/abs/2410.08791)|**[link](https://github.com/abbasireza/super-pipeline)**|机器学习模型的快速发展，特别是在自然语言处理和计算机视觉领域，给在资源有限的硬件上运行这些模型带来了挑战。本文介绍了 Superpipeline，这是一个旨在优化大型 AI 模型在训练和推理过程中在受限硬件上执行的新框架。我们的方法涉及通过将模型划分为单独的层并有效地在 GPU 和 CPU 内存之间传输这些层来动态管理模型执行。在我们的实验中，Superpipeline 在保持模型精度和可接受的处理速度的同时，将 GPU 内存使用量减少了高达 60%。这使得原本会超出可用 GPU 内存的模型能够有效运行。与主要关注推理或特定模型类型的现有解决方案不同，Superpipeline 可以应用于大型语言模型 (LLM)、视觉语言模型 (VLM) 和基于视觉的模型。我们在各种模型和硬件设置中测试了 Superpipeline 的性能。该方法包括两个关键参数，允许微调 GPU 内存使用量和处理速度之间的平衡。重要的是，Superpipeline 不需要重新训练或更改模型参数，确保原始模型的输出保持不变。Superpipeline 的简单性和灵活性使其对在有限硬件上使用高级 AI 模型的研究人员和专业人士非常有用。它允许在现有硬件上使用更大的模型或更大的批次大小，从而有可能加快许多机器学习应用的创新。这项工作标志着朝着使高级 AI 模型更易于访问并在资源有限的环境中优化其部署迈出了重要一步。Superpipeline 的代码可在 https://github.com/abbasiReza/super-pipeline 获取。||
|**2024-10-11**|[Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping](http://arxiv.org/abs/2410.08695)|**[link](https://github.com/yangyue5114/DME)**|大型视觉语言模型（LVLM）在视觉感知和推理等多模态任务中表现出非凡的能力，在各种多模态评估基准测试中均取得了良好的性能。然而，这些基准测试保持着静态性，并且与预训练数据重叠，导致复杂度限制固定和数据污染问题。这引发了对评估有效性的担忧。为了应对这两项挑战，我们引入了一种称为视觉语言自举（VLB）的动态多模态评估协议。VLB 为 LVLM 提供了一个稳健且全面的评估，减少了数据污染，并具有灵活的复杂性。为此，VLB 通过多模态自举模块动态生成新的视觉问答样本，该模块修改图像和语言，同时通过判断模块确保新生成的样本与原始样本保持一致。通过组合各种自举策略，VLB 提供了具有不同复杂性的现有基准测试的动态变体，使评估能够随着 LVLM 不断发展的能力而共同发展。跨多个基准测试（包括 SEEDBench、MMBench 和 MME）的大量实验结果表明，VLB 显着减少了数据污染，并暴露了 LVLM 的性能局限性。||
|**2024-10-11**|[Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models](http://arxiv.org/abs/2410.08611)|**[link](https://github.com/mengyuanchen21/neurips2024-csp)**|零样本分布外 (OOD) 检测的直接 pipeline 涉及从广泛的语义库中选择潜在的 OOD 标签，然后利用预训练的视觉语言模型对分布内 (ID) 和 OOD 标签执行分类。在本文中，我们提出理论，认为提高性能需要扩展语义库，同时增加 OOD 样本激活所选 OOD 标签的预期概率，并确保这些 OOD 标签的激活之间相互依赖性低。一种自然的扩展方式是采用更大的词库；然而，不可避免地引入大量同义词和不常用词无法满足上述要求，这表明可行的扩展方式不仅仅是从词库中选择词语。由于 OOD 检测旨在将输入图像正确分类到 ID/OOD 类别组中，我们可以“编造”OOD 标签候选，这些候选不是标准类别名称，但有利于该过程。观察到原始语义库由未修改的特定类别名称组成，我们相应地构建了一个共轭语义库 (CSP)，它由修改后的超类别名称组成，每个名称都充当跨不同类别共享相似属性的样本的聚类中心。与我们建立的理论一致，使用 CSP 扩展 OOD 标签候选满足要求，并且在 FPR95 中的性能比现有工作提高了 7.89%。代码可在 https://github.com/MengyuanChen21/NeurIPS2024-CSP 中获得。||
|**2024-10-11**|[ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression](http://arxiv.org/abs/2410.08584)|null|大型视觉语言模型 (LVLMs) 的效率受到预填充阶段注意力机制的计算瓶颈和解码阶段获取键值 (KV) 缓存的内存瓶颈的限制，尤其是在涉及高分辨率图像或视频的情况下。视觉内容通常表现出大量的冗余，导致 LVLMs 中的注意力图高度稀疏。可以利用这种稀疏性，通过各种方法来加速注意力计算或压缩 KV 缓存。然而，大多数研究只关注解决这些瓶颈中的一个，并且没有充分支持根据不同的层或任务动态调整稀疏性。在本文中，我们提出了 ZipVL，这是一个为 LVLMs 设计的高效推理框架，它通过重要标记的动态比率分配策略来解决计算和内存瓶颈。该比率是根据特定层的注意力分数分布自适应确定的，而不是固定的超参数，从而在较简单的任务中提高效率，同时在更具挑战性的任务中保持高性能。然后我们根据归一化后的注意力分数选择重要的标记，并仅对这些重要的标记执行注意力机制，以加速预填充阶段。为了缓解解码阶段的内存瓶颈，我们对 KV 缓存采用混合精度量化，其中对重要标记的缓存使用高比特量化，而对不那么重要的标记的缓存使用低比特量化。我们的实验表明，ZipVL 可以将预填充阶段的速度提高 2.6 倍，并将 GPU 内存使用量减少 50.0%，在 LongVA-7B 模型上的 Video-MME 基准测试中，准确率仅下降了 0.2%，有效地提高了 LVLMs 的生成效率。||
|**2024-10-10**|[LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts](http://arxiv.org/abs/2410.08211)|null|大规模视觉语言预训练 (VLP) 模型（例如 CLIP）以其多功能性而闻名，因为它们可以在零样本设置中应用于各种应用。然而，当这些模型用于特定领域时，由于领域差距或训练数据中这些领域的代表性不足，它们的性能往往不尽如人意。虽然在具有人工标注标签的自定义数据集上微调 VLP 模型可以解决这个问题，但即使是标注小规模数据集（例如，100k 个样本）也可能是一项昂贵的工作，如果任务复杂，通常需要专家标注员。为了应对这些挑战，我们提出了 LatteCLIP，这是一种无监督方法，用于在自定义领域中使用已知类名对 CLIP 模型进行分类微调，而无需依赖人工标注。我们的方法利用大型多模态模型 (LMM) 为单个图像和图像组生成富有表现力的文本描述。这些信息提供了额外的上下文信息，以指导自定义领域中的微调过程。由于 LMM 生成的描述容易出现幻觉或细节缺失，我们引入了一种新策略，仅提取有用信息并稳定训练过程。具体来说，我们从噪声生成的文本和双重伪标签中学习丰富的每类原型表示。我们在 10 个特定领域数据集上的实验表明，LatteCLIP 的性能优于预训练的零样本方法，平均提高了 +4.74 个百分点的 top-1 准确率，并且优于其他最先进的无监督方法 +3.45 个百分点。||
|**2024-10-10**|[Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision](http://arxiv.org/abs/2410.08209)|null|当前的大型多模态模型 (LMM) 面临着 grounding 的挑战， grounding 要求模型将语言成分与视觉实体相关联。与使用额外的 grounding 监督微调 LMM 的常见做法相反，我们发现 grounding 能力实际上可以在没有明确 grounding 监督的情况下训练的 LMM 中出现。为了揭示这种新兴的 grounding 能力，我们引入了一种“attend-and-segment”方法，该方法利用来自标准 LMM 的注意力图来执行像素级分割。此外，为了增强 grounding 能力，我们提出了 DIFFLMM，这是一种利用基于扩散的视觉编码器（而不是标准 CLIP 视觉编码器）的 LMM，并使用相同的弱监督进行训练。我们的方法不受限于 grounding 特定监督数据的偏差和规模限制，因此更具通用性和可扩展性。与 grounding LMM 和通才 LMM 相比，我们在 grounding 特定和一般视觉问答基准测试中均取得了有竞争力的性能。值得注意的是，我们在没有任何 grounding 监督的情况下，在 grounded 对话生成方面实现了 44.2 的 grounding 掩码召回率，优于经过广泛监督的模型 GLaMM。项目页面：https://groundLMM.github.io。||
|**2024-10-10**|[MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models](http://arxiv.org/abs/2410.08182)|null|现有的多模态检索基准主要侧重于评估模型是否能够检索和利用外部文本知识来回答问题。然而，在某些情况下，检索视觉信息比文本数据更有益或更容易获取。在本文中，我们介绍了一个多模态检索增强生成基准 MRAG-Bench，在该基准中，我们系统地识别和分类了视觉增强知识优于文本知识的场景，例如，来自不同视角的更多图像。MRAG-Bench 由 16,130 张图像和 1,353 个人工标注的多项选择题组成，涵盖 9 个不同的场景。借助 MRAG-Bench，我们对 10 个开源和 4 个专有的超大型视觉语言模型 (LVLM) 进行了评估。我们的结果表明，与文本知识相比，所有 LVLM 在使用图像增强时都表现出更大的改进，这证实了 MRAG-Bench 以视觉为中心的特点。此外，我们使用 MRAG-Bench 进行了广泛的分析，为了解检索增强型 LVLM 提供了宝贵的见解。值得注意的是，表现最佳的模型 GPT-4o 在有效利用检索到的知识方面面临挑战，在使用真实信息的情况下仅实现了 5.82% 的改进，而人类参与者观察到的改进为 33.16%。这些发现突出了 MRAG-Bench 在鼓励社区增强 LVLM 更有效地利用检索到的视觉知识方面的能力的重要性。||
|**2024-10-10**|[Q-VLM: Post-training Quantization for Large Vision-Language Models](http://arxiv.org/abs/2410.08119)|**[link](https://github.com/changyuanwang17/qvlm)**|在本文中，我们提出了一种针对大型视觉语言模型 (LVLMs) 的训练后量化框架，以实现高效的多模态推理。传统的量化方法通过最小化激活离散化误差来顺序搜索逐层舍入函数，这种方法由于没有考虑跨层依赖性，因此无法获得最佳量化策略。相反，我们挖掘了对整个视觉语言模型的离散化误差有显著影响的跨层依赖性，并将这种依赖性嵌入到低搜索成本的最佳量化策略搜索中。具体来说，我们观察到激活熵和跨层依赖性之间存在强相关性，这与输出离散化误差有关。因此，我们采用熵作为代理来优化分区块，旨在在离散化误差和搜索成本之间取得令人满意的平衡。此外，我们优化了视觉编码器以解耦跨层依赖性，从而对搜索空间进行细粒度分解，从而在不损害量化精度的情况下进一步降低搜索成本。实验结果表明，我们的方法在不降低各种多模态推理任务性能的情况下，将大约 13B LLaVA 模型的内存压缩了 2.78 倍，并将生成速度提高了 1.44 倍。代码可在 https://github.com/ChangyuanWang17/QVLM 获取。||
|**2024-10-10**|[Unsupervised Data Validation Methods for Efficient Model Training](http://arxiv.org/abs/2410.07880)|null|本文探讨了改进低资源语言机器学习系统所面临的挑战和潜在解决方案。自然语言处理 (NLP)、文本到语音 (TTS)、语音到文本 (STT) 和视觉语言模型 (VLM) 中的最新模型严重依赖于大型数据集，而这些数据集通常不适用于低资源语言。本研究探讨了关键领域，例如定义“高质量数据”、开发生成适当数据的方法以及增强模型训练的可访问性。对当前方法的全面回顾，包括数据增强、多语言迁移学习、合成数据生成和数据选择技术，突出了进步和局限性。确定了几个开放的研究问题，为未来旨在优化数据利用、减少所需数据量和保持高质量模型性能的研究提供了框架。通过应对这些挑战，本文旨在使低资源语言更容易获得先进的机器学习模型，从而增强其在各个领域的效用和影响力。||
|**2024-10-10**|[HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter](http://arxiv.org/abs/2410.07854)|null|基于适配器的调优方法在将知识从预训练的视觉语言模型迁移到下游任务方面已显示出巨大潜力。然而，在回顾现有的适配器后，我们发现它们通常无法充分探索构建特定任务知识时不同模态之间的交互。此外，现有工作通常只关注正文本提示之间的相似性匹配，这使得区分具有高度相似视觉内容的类别变得具有挑战性。为了解决这些问题，在本文中，我们提出了一种新颖的异构图适配器来实现下游任务的视觉语言模型微调。具体来说，我们首先构建了一个统一的异构图模式，它包含 i) 视觉节点、正文本节点和负文本节点，以及 ii) 几种类型的边连接，以全面地对模态内、模态间和类间结构知识进行建模。接下来，我们采用特定的异构图神经网络来挖掘多模态结构知识，以便为下游任务调整视觉和文本特征。最后，在HeGraphAdapter之后，我们同时构建基于文本和基于视觉的分类器，以全面提升CLIP模型的性能。在 11 个基准数据集上的实验结果证明了所提出的 HeGraphAdapter 的有效性和优势。||
|**2024-10-10**|[FLIER: Few-shot Language Image Models Embedded with Latent Representations](http://arxiv.org/abs/2410.07648)|null|随着像对比语言-图像预训练 (CLIP) 这样的大型视觉语言模型的快速发展，许多类似 CLIP 的方法在视觉识别方面表现出了令人印象深刻的能力，尤其是在低数据场景下。然而，我们注意到大多数这些方法仅限于对文本和图像编码器进行新的修改。最近，潜在扩散模型 (LDM) 在图像生成方面表现出了良好的能力。LDM 的强大能力将我们的注意力引向了 UNet 采样的潜在表示。受 CoOp 中学习到的提示编码超出现有词汇量的含义的猜想的启发，我们假设，对于深度模型，潜在表示是对图像的简洁准确的理解，其中抽象掉了高频的、不可感知的细节。在本文中，我们提出了一种融合潜在表示的少样本语言图像模型 (FLIER)，通过引入一个与 CLIP 的图像编码器联合训练的潜在编码器来进行图像识别，它结合了 CLIP 的预训练视觉语言知识和稳定扩散的潜在表示。我们首先通过稳定扩散使用 GPT-3 的文本输入生成图像和相应的潜在表示。将潜在表示作为“模型可理解的像素”，我们引入了一个具有两个卷积层的灵活卷积神经网络作为潜在编码器，它比视觉语言模型中的大多数编码器都简单。潜在编码器与 CLIP 的图像编码器联合训练，可以更好地将预训练的知识迁移到下游任务。在各种视觉分类任务上的实验和广泛的消融研究表明，FLIER 在大多数少样本分类的 11 个数据集上表现出最先进的性能。||
|**2024-10-10**|[A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks](http://arxiv.org/abs/2410.07593)|**[link](https://github.com/HoinJung/Unified-Debiaisng-VLM-SFID)**|视觉语言模型 (VLM) 的最新进展使得通过同时处理文本和图像数据来完成复杂的多模态任务成为可能，从而显著增强了人工智能领域。然而，这些模型经常表现出偏差，这些偏差会导致输出偏向社会刻板印象，因此需要去偏差策略。现有的去偏差方法狭隘地关注特定的模态或任务，并且需要大量的再训练。为了解决这些限制，本文介绍了用于去偏差的选择性特征插补 (SFID)，这是一种集成了特征剪枝和低置信度插补 (LCI) 的新方法，可以有效减少 VLM 中的偏差。SFID 具有多种功能，可以保持输出的语义完整性，并且通过消除重新训练的需要来节省成本。我们的实验结果证明了 SFID 在各种 VLM 任务中的有效性，包括零样本分类、文本到图像检索、图像字幕和文本到图像生成，通过在不影响性能的情况下显着减少性别偏差。这种方法不仅增强了 VLM 应用的公平性，而且还保留了它们在不同场景中的效率和实用性。||
|**2024-10-10**|[3D Vision-Language Gaussian Splatting](http://arxiv.org/abs/2410.07577)|null|近年来，三维重建方法和视觉语言模型的进步推动了多模态三维场景理解的发展，这在机器人技术、自动驾驶以及虚拟/增强现实中具有至关重要的应用。然而，当前的多模态场景理解方法简单地将语义表示嵌入到三维重建方法中，而没有在视觉和语言模态之间取得平衡，这导致半透明或反射性物体的语义栅格化效果不理想，以及对颜色模态的过度拟合。为了缓解这些限制，我们提出了一种充分处理不同视觉和语义模态的解决方案，即用于场景理解的三维视觉语言高斯散射模型，以强调语言模态的表示学习。我们提出了一种新颖的跨模态栅格化器，使用模态融合以及平滑语义指示器来增强语义栅格化。我们还采用了相机视图混合技术来提高现有视图和合成视图之间的语义一致性，从而有效地减轻过度拟合。大量实验表明，我们的方法在开放词汇语义分割方面达到了最先进的性能，明显优于现有方法。||
|**2024-10-09**|[The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks](http://arxiv.org/abs/2410.07391)|null|人们越来越关注追踪通用人工智能基础模型的能力。本研究以韦氏成人智力量表（WAIS-IV）为基准，将领先的大型语言模型和视觉语言模型与人类表现进行了比较。WAIS-IV是一种全面、以人群为规范的潜在人类认知和智力能力评估，重点关注语言理解（VCI）、工作记忆（WMI）和知觉推理（PRI）领域。大多数模型在存储、检索和处理诸如字母和数字的任意序列等token方面表现出卓越的能力，与人类群体规范能力相比，工作记忆指数（WMI）的表现等于或大于99.5%。语言理解指数（VCI）衡量的是对获得信息的检索，以及对单词含义及其相互关系的语言理解，其表现也始终保持在98%或以上。尽管有这些广泛的优势，但我们观察到，多模态模型在知觉推理指数（PRI；范围0.1-10%）上的表现一直很差，这表明其在解释和推理视觉信息方面存在严重不足。较小和较旧的模型版本的表现始终较差，这表明训练数据、参数数量和微调方面的进步正在导致认知能力的显著进步。||
|**2024-10-07**|[Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia](http://arxiv.org/abs/2410.05270)|**[link](https://github.com/astra-vision/prolip)**|我们研究了如何将像 CLIP (Radford et al., 2021) 这样的对比预训练视觉语言模型应用于少样本分类问题。现有文献通过学习冻结视觉特征的线性分类器、优化词嵌入或学习外部特征适配器来解决这个问题。本文介绍了一种无需添加“外部”参数来优化 CLIP 自适应的替代方法。我们发现，与现有的基线相比，简单地微调视觉编码器的最后一个投影矩阵就能获得强大的性能。此外，我们发现，通过微调矩阵和预训练矩阵之间的距离对训练进行正则化，可以提高通过该层自适应 CLIP 的可靠性。也许令人惊讶的是，这种被称为 ProLIP 的方法在 11 个少样本分类基准测试、少样本域泛化、跨数据集迁移和测试时自适应方面取得了与最先进水平相当或更好的性能。代码将在 https://github.com/astra-vision/ProLIP 上提供。||
|**2024-10-07**|[TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens](http://arxiv.org/abs/2410.05261)|null|阅读密集文本和定位图像中的物体是大规模视觉语言模型 (LVLM) 执行高级任务的基本能力。以前的 LVLM，包括像 GPT-4o 这样的优秀专有模型，都难以同时在这两项任务中表现出色。此外，以前具有细粒度感知能力的 LVLM 每张图像需要消耗数千个标记，这使得它们非常消耗资源。我们提出了 TextHawk2，这是一种双语 LVLM，具有高效的细粒度感知能力，并在通用、OCR 和 grounding 任务中展现出最先进的性能，同时图像标记数量减少了 16 倍。关键改进包括：(1) 标记压缩：TextHawk2 建立在其前身的有效架构之上，将每张图像的标记数量显著减少了 16 倍，从而能够以最少的资源促进 TextHawk 系列的训练和部署。(2) 视觉编码器增强：我们通过 LVLM 联合训练增强了视觉编码器，从而释放了其在中文 OCR 和 grounding 等以前未见任务中的潜力。(3) 数据多样性：我们在保持 1 亿个样本的相当规模的同时，使预训练数据的来源多样化。我们在多个基准测试中评估了 TextHawk2，它始终如一地提供卓越的性能，并优于类似规模的闭源模型，例如在 OCRBench 上实现了 78.4% 的准确率，在 ChartQA 上实现了 81.4% 的准确率，在 DocVQA 上实现了 89.6% 的 ANLS，以及在 RefCOCOg-test 上实现了 88.1% 的 [email protected]。||
|**2024-10-07**|[TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models](http://arxiv.org/abs/2410.05239)|**[link](https://github.com/naamiinepal/tunevlseg)**|视觉语言模型 (VLM) 在视觉任务中表现出色，但将其应用于新领域通常需要昂贵的微调。提示调整技术，包括文本、视觉和多模态提示，通过利用可学习的提示提供了有效的替代方案。然而，它们在视觉语言分割模型 (VLSM) 中的应用以及在显著领域迁移下的评估仍有待探索。本研究提出了一个开源基准测试框架 TuneVLSeg，将各种单模态和多模态提示调整技术集成到 VLSM 中，使得提示调整适用于任何类别数量的下游分割数据集。TuneVLSeg 包括在 2 个 VLSM 中使用的不同提示深度上的 6 种提示调整策略，总共 8 种不同的组合。我们在 8 个不同的医学数据集上测试了各种提示调整，包括 3 个放射学数据集（乳腺肿瘤、超声心动图、胸部 X 光片病变）和 5 个非放射学数据集（息肉、溃疡、皮肤癌），以及两个自然领域分割数据集。我们的研究发现，文本提示调整在从自然领域图像到医学数据的显著领域迁移下表现不佳。此外，与多模态提示调整相比，视觉提示调整具有更少的超参数，通常可以实现与多模态方法相当的性能，使其成为一种有价值的首次尝试。我们的工作促进了对不同提示调整技术在鲁棒的特定领域分割中的理解和适用性。源代码可在 https://github.com/naamiinepal/tunevlseg 获取。||
|**2024-10-07**|[LADEV: A Language-Driven Testing and Evaluation Platform for Vision-Language-Action Models in Robotic Manipulation](http://arxiv.org/abs/2410.05191)|null|基于大型语言模型（LLMs）和视觉语言模型（VLMs）的进步，近期的研究引入了视觉-语言-动作（VLA）模型作为机器人操作任务的集成解决方案。这些模型将相机图像和自然语言任务指令作为输入，直接生成机器人的控制动作来执行指定任务，极大地提高了决策能力和与人类用户的交互。然而，VLA模型的数据驱动特性，加上其缺乏可解释性，使得确保其有效性和鲁棒性成为一项具有挑战性的任务。这突出了对可靠测试和评估平台的需求。为此，在这项工作中，我们提出了LADEV，这是一个专门为评估VLA模型而设计的综合高效平台。我们首先提出了一种语言驱动的方法，可以根据自然语言输入自动生成仿真环境，从而减少了手动调整的需求，并显著提高了测试效率。然后，为了进一步评估语言输入对VLA模型的影响，我们实现了一种释义机制，可以生成不同的自然语言任务指令进行测试。最后，为了加快评估过程，我们引入了一种批量式方法来对VLA模型进行大规模测试。使用LADEV，我们对几种最先进的VLA模型进行了实验，证明了其作为评估这些模型的工具的有效性。我们的结果表明，LADEV不仅提高了测试效率，而且为评估VLA模型建立了坚实的基础，为开发更智能、更先进的机器人系统铺平了道路。||
|**2024-10-07**|[HE-Drive: Human-Like End-to-End Driving with Vision Language Models](http://arxiv.org/abs/2410.05051)|null|本文提出了HE-Drive：第一个以类人为中心的端到端自动驾驶系统，用于生成时间一致且舒适的轨迹。最近的研究表明，基于模仿学习的规划器和基于学习的轨迹评分器可以有效地生成和选择与专家演示非常相似的精确轨迹。然而，这种轨迹规划器和评分器面临着生成时间不一致和不舒适轨迹的困境。为了解决上述问题，我们的HE-Drive首先通过稀疏感知提取关键的3D空间表示，然后将其作为基于条件去噪扩散概率模型（DDPMs）的运动规划器的条件输入，以生成时间一致的多模态轨迹。随后，视觉语言模型（VLMs）引导的轨迹评分器从这些候选轨迹中选择最舒适的轨迹来控制车辆，确保类人的端到端驾驶。实验表明，HE-Drive不仅在具有挑战性的nuScenes和OpenScene数据集上实现了最先进的性能（即将平均碰撞率降低了71%比VAD）和效率（即比SparseDrive快1.9倍），而且在真实世界的数据上提供了最舒适的驾驶体验。更多信息请访问项目网站：https://jmwang0117.github.io/HE-Drive/。||
|**2024-10-07**|[Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models](http://arxiv.org/abs/2410.04884)|null|视觉语言预训练 (VLP) 模型在各个领域都取得了显著成功，但它们仍然容易受到对抗性攻击。解决这些对抗性漏洞对于增强多模态学习的安全性至关重要。传统上，针对 VLP 模型的对抗性方法涉及同时扰动图像和文本。然而，这种方法面临着显著的挑战：首先，对抗性扰动通常无法有效地转化为现实场景；其次，对文本的直接修改非常明显。为了克服这些限制，我们提出了一种新策略，该策略专门使用图像补丁进行攻击，从而保持原始文本的完整性。我们的方法利用来自扩散模型的先验知识来增强扰动的真实性和自然性。此外，为了优化补丁放置并提高攻击的效率，我们利用了交叉注意力机制，该机制通过生成注意力图来封装模态间交互，以指导战略性补丁放置。在图像到文本场景的白盒设置中进行的综合实验表明，我们提出的方法明显优于现有技术，实现了 100% 的攻击成功率。此外，它在涉及文本到图像配置的迁移任务中表现出 commendable 的性能。||
|**2024-10-05**|[TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions](http://arxiv.org/abs/2410.04107)|**[link](https://github.com/nlpcode/tubench)**|大型视觉语言模型 (LVLM) 在视觉感知和语言理解方面取得了显著进展。尽管它们在各种任务中表现出色，但 LVLM 仍然存在幻觉问题，即生成与视觉或文本输入不正确或不忠实的内容。传统的基准测试，如 MME 和 POPE，使用可回答的问题在视觉问答 (VQA) 范围内评估 LVLM 中的幻觉。然而，由于图像中信息不足，有些问题无法回答，而 LVLM 在此类无法回答的问题上的表现仍未得到充分探索。为了弥合这一研究差距，我们提出了 TUBench，这是一个专门用于使用无法回答的问题评估 LVLM 可靠性的基准测试。TUBench 包含大量高质量的、无法回答的问题，这些问题是使用十种不同的策略精心制作的。为了全面评估 LVLM，TUBench 中的无法回答的问题基于来自四个不同领域的图像作为视觉上下文：代码片段的屏幕截图、自然图像、几何图形和统计表的屏幕截图。这些无法回答的问题分别用于测试 LVLM 在代码推理、常识推理、几何推理和与表格相关的数学推理方面的可信度。我们对 TUBench 上的 28 个领先基础模型进行了全面的定量评估，其中表现最佳的模型 Gemini-1.5-Pro 在确定问题是否可回答方面达到了 69.2% 的平均准确率，排名第三的模型 GPT-4o 则达到了 66.7% 的平均准确率。TUBench 可在 https://github.com/NLPCode/TUBench 获取。||
|**2024-10-05**|[Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks](http://arxiv.org/abs/2410.04055)|**[link](https://github.com/ivy3h/SCL)**|虽然视觉语言模型 (VLM) 在视觉和语言推理任务中表现出非凡的能力，但它们也不可避免地会产生错误的响应。自我纠正，即指导模型改进其输出，为解决这个问题提供了一种很有前景的解决方案。以往的研究主要集中在大型语言模型 (LLM) 上，而 VLM 的自我纠正能力，特别是在视觉和语言信息方面的能力，在很大程度上仍未得到检验。本研究调查了 VLM 在推理和微调阶段的自我纠正能力。我们介绍了一种自我纠正学习 (SCL) 方法，该方法使 VLM 能够通过直接偏好优化 (DPO) 从其自我生成的自我纠正数据中学习，而无需依赖外部反馈，从而促进自我改进。具体来说，我们根据初始和改进响应的正确性收集偏好和不偏好的样本，这些样本是通过在推理阶段使用 VLM 进行两轮自我纠正获得的。实验结果表明，虽然 VLM 在没有额外微调和外部反馈的情况下难以在迭代推理过程中有效地进行自我纠正，但当它们自我生成的自我纠正数据被分类为偏好和不偏好样本时，它们可以通过偏好微调来提高性能并避免以前的错误。这项研究强调，自我纠正不仅仅是一个改进过程；相反，它应该通过额外的训练来增强模型的推理能力，使其能够直接生成高质量的响应，而无需进一步改进。||
|**2024-10-05**|[Gamified crowd-sourcing of high-quality data for visual fine-tuning](http://arxiv.org/abs/2410.04038)|null|本文介绍了游戏化对抗提示 (GAP)，这是一个为大型多模态模型的视觉指令微调进行众包高质量数据的框架。GAP 将数据收集过程转化为引人入胜的游戏，激励玩家提供针对模型知识差距的细粒度、具有挑战性的问题和答案。我们的贡献包括 (1) 一种从人类那里捕获问答对的方法，这些问答对直接针对模型知识中的弱点，(2) 一种评估和奖励玩家的方法，该方法成功地激励他们提供高质量的提交内容，以及 (3) 一个可扩展的游戏化平台，该平台成功地在几周内从超过 50,000 名参与者那里收集了这些数据。我们对 GAP 的实现显着提高了小型多模态模型 MiniCPM-Llama3-V-2.5-8B 的准确性，将其在我们数据集上的 GPT 分数从 0.147 提高到 0.477，接近更大的 GPT-4V 所设定的基准。此外，我们证明了使用 MiniCPM-Llama3-V-2.5-8B 生成的数据也增强了其在其他基准测试中的性能，并展现出跨模型的优势。具体来说，相同的数据提高了 QWEN2-VL-2B 和 QWEN2-VL-7B 在相同多个基准测试中的性能。||
|**2024-10-04**|[Model Developmental Safety: A Safety-Centric Method and Applications in Vision-Language Models](http://arxiv.org/abs/2410.03955)|**[link](https://github.com/ganglii/devsafety)**|在现实世界中，学习型系统通常会经历多个模型开发周期，以增强系统处理困难或新出现任务的能力。这种持续的模型开发过程提出了一个重要问题，即为获取新能力或改进现有能力而进行的模型开发可能会无意中失去旧模型的能力，也称为灾难性遗忘。现有的持续学习研究侧重于通过权衡先前任务和新任务的性能来减轻灾难性遗忘，以确保良好的平均性能。然而，它们不足以用于许多应用，特别是在安全关键领域，因为未能严格保持旧模型的性能不仅会带来安全风险和不确定性，还会在重新改进和重新验证现有属性方面造成巨大开销。为了解决这个问题，我们引入了模型开发安全作为学习系统的保证，即在模型开发过程中，新模型应严格保留旧模型现有的受保护能力，同时提高其在目标任务上的性能。为了确保模型开发安全，我们提出了一个以安全为中心的框架，将模型开发安全制定为依赖于数据的约束。在这个框架下，我们研究了如何开发一个预训练的视觉语言模型（又称 CLIP 模型），以获得新的能力或改进现有的图像分类能力。我们提出了一种具有理论保证的高效约束优化算法，并利用其见解微调具有任务依赖头的 CLIP 模型，以促进模型开发安全。我们在自动驾驶和场景识别数据集上改进视觉感知能力的实验结果证明了该方法的有效性。||
|**2024-10-04**|[Generalizable Prompt Tuning for Vision-Language Models](http://arxiv.org/abs/2410.03189)|null|针对诸如 CLIP 等视觉语言模型的提示调优涉及优化用于为特定下游任务生成图像-文本对的文本提示。虽然手工制作或基于模板的提示通常适用于更广泛的未见类别，但它们在下游任务（即已见类别）中往往表现不佳。另一方面，可学习的软提示通常在下游任务中表现良好，但缺乏泛化性。此外，先前的研究主要集中在文本模态上，很少有研究试图从视觉模态探索提示的泛化潜力。考虑到这些限制，我们研究了如何进行提示调优以获得具有竞争力的下游性能和泛化能力。研究表明，通过将软提示和手工提示视为文本模态的双重视图，并最大化它们的互信息，我们可以更好地集成特定任务的语义信息和通用语义信息。此外，为了生成更具表达力的提示，该研究引入了来自视觉模态的类别增强，从而显著提高了对更广泛的未见类别的鲁棒性。对多个基准的广泛评估表明，所提出的方法在特定任务性能和泛化能力方面都取得了具有竞争力的结果。||
|**2024-10-04**|[Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models](http://arxiv.org/abs/2410.03176)|**[link](https://github.com/yufang-liu/clip_hallucination)**|大型视觉语言模型 (LVLM) 已经取得了令人瞩目的性能，但研究指出，这些模型存在严重的物体幻觉问题。然而，对于这些幻觉源自模型的哪个部分，目前还没有明确的结论。在本文中，我们深入研究了 CLIP 模型中的物体幻觉问题，CLIP 模型是许多最先进的视觉语言系统的支柱。我们揭示了即使是单独使用，CLIP 模型也容易出现物体幻觉，这表明幻觉问题不仅仅是由于视觉和语言模态之间的交互造成的。为了解决这个问题，我们提出了一种反事实数据增强方法，通过创建具有各种幻觉问题的负样本来实现。我们证明了我们的方法可以有效地减轻 CLIP 模型的物体幻觉，并且我们展示了增强后的模型可以用作视觉编码器，有效地缓解了 LVLMs 中的物体幻觉问题。||
|**2024-10-04**|[AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark](http://arxiv.org/abs/2410.03051)|null|视频详细字幕生成是一项关键任务，旨在生成对视频内容全面而连贯的文本描述，有利于视频理解和生成。在本文中，我们提出了 AuroraCap，一个基于大型多模态模型的视频字幕生成器。我们遵循最简单的架构设计，没有为时间建模添加额外的参数。为了解决长视频序列带来的开销，我们实施了标记合并策略，减少了输入视觉标记的数量。令人惊讶的是，我们发现这种策略几乎没有造成性能损失。AuroraCap 在各种视频和图像字幕基准测试中表现出色，例如，在 Flickr30k 上获得了 88.9 的 CIDEr 分数，超过了 GPT-4V (55.3) 和 Gemini-1.5 Pro (82.2)。然而，现有的视频字幕基准测试只包含简单的描述，由几十个词组成，这限制了该领域的研究。因此，我们开发了 VDC，这是一个包含一千多个精心标注的结构化字幕的视频详细字幕基准测试。此外，我们提出了一种新的 LLM 辅助指标 VDCscore，用于改进评估，该指标采用分治策略将长字幕评估转化为多个简短的问答对。在人工 Elo 排名的帮助下，我们的实验表明，该基准测试与人类对视频详细字幕质量的判断具有更好的相关性。||
|**2024-10-03**|[CPFD: Confidence-aware Privileged Feature Distillation for Short Video Classification](http://arxiv.org/abs/2410.03038)|null|在短视频分类中，针对不同业务场景定制的密集特征至关重要。然而，它们的复杂性、特定的适应性要求和高计算成本使得它们在在线推理过程中资源密集且难以访问。因此，这些密集特征被称为“特权密集特征”。同时，端到端多模态模型在众多计算机视觉任务中显示出良好的效果。在工业应用中，优先考虑端到端多模态特征可以提高效率，但往往会导致丢失历史特权密集特征中的宝贵信息。为了在保持效率和可管理的资源成本的同时整合这两种特征，我们提出了置信度感知的特权特征蒸馏（CPFD），它通过在训练过程中自适应地提取特权特征来增强端到端多模态模型的特征。与现有的特权特征蒸馏（PFD）方法不同，CPFD不会在蒸馏过程中对所有实例应用统一的权重（这可能会导致不同业务场景下的性能不稳定，以及教师模型（密集特征增强的多模态模型DF-X-VLM）和学生模型（仅使用多模态模型X-VLM）之间存在显著的性能差距），而是利用从教师模型中获得的置信度分数来自适应地减轻学生模型的性能差异。我们在五个不同的任务上进行了广泛的离线实验，结果表明，与端到端多模态模型（X-VLM）相比，CPFD将视频分类的F1分数提高了6.76%，与普通的PFD相比平均提高了2.31%。它将性能差距缩小了84.6%，并取得了与教师模型DF-X-VLM相当的结果。在线实验进一步证实了CPFD的有效性，我们的框架已经部署到生产系统中，用于十多个模型。||
|**2024-10-03**|[MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection](http://arxiv.org/abs/2410.03010)|null|多模态学习旨在结合来自多个输入源的数据，以提高不同下游任务的性能。在现实场景中，如果缺少某些输入模态，性能可能会大幅下降。现有的可以处理缺失模态的方法包括针对每个输入模态组合进行定制训练或适应步骤。这些方法要么绑定到特定的模态，要么随着输入模态数量的增加而变得计算成本高昂。在本文中，我们提出了掩蔽模态投影（MMP），这是一种旨在训练单个模型的方法，该模型对任何缺失模态场景都具有鲁棒性。我们通过在训练期间随机掩蔽一部分模态并学习投影可用的输入模态来估计掩蔽模态的标记来实现这一点。这种方法使模型能够有效地学习利用来自可用模态的信息来补偿缺失的模态，从而增强缺失模态的鲁棒性。我们使用各种基线模型和数据集进行了一系列实验，以评估该策略的有效性。实验表明，我们的方法提高了对不同缺失模态场景的鲁棒性，优于为缺失模态或特定模态组合设计的现有方法。||
|**2024-10-03**|[Real-World Cooking Robot System from Recipes Based on Food State Recognition Using Foundation Models and PDDL](http://arxiv.org/abs/2410.02874)|null|尽管机器人烹饪行为的需求日益增长，但基于机器人在现实世界中对新食谱描述的一系列烹饪行为尚未实现。在本研究中，我们提出了一种机器人系统，该系统集成了使用大型语言模型 (LLM) 和 PDDL 描述的经典规划的可执行的真实世界机器人烹饪行为规划，以及使用视觉语言模型 (VLM) 从少量数据中学习食物成分状态识别。我们成功地进行了实验，在实验中，双臂轮式机器人 PR2 在真实环境中根据安排的新食谱进行烹饪，并确认了所提出系统的有效性。||
|**2024-10-03**|[Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos](http://arxiv.org/abs/2410.02763)|null|最近，越来越多的人认为现代大型多模态模型 (LMM) 已经解决了与短视频理解相关的大多数关键挑战。因此，学术界和工业界都逐渐将注意力转向理解长视频带来的更复杂挑战。然而，事实真的如此吗？我们的研究表明，即使在处理短视频时，LMM 仍然缺乏许多基本的推理能力。我们介绍了 Vinoground，这是一个包含 1000 个短而自然的视频-字幕对的时间反事实 LMM 评估基准。我们证明，现有的 LMM 很难区分不同动作和对象转换之间的时间差异。例如，最佳模型 GPT-4o 在我们的文本和视频得分中仅获得约 50% 的分数，与约 90% 的人类基线相比存在较大差距。所有开源多模态模型和基于 CLIP 的模型表现更差，产生的结果大多是随机的。通过这项工作，我们揭示了短视频中的时间推理是一个尚未完全解决的问题。数据集和评估代码可在 https://vinoground.github.io 获取。||
|**2024-10-03**|[Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations](http://arxiv.org/abs/2410.02762)|**[link](https://github.com/nickjiang2378/vl-interp)**|我们研究了视觉语言模型 (VLM) 的内部表征，以解决幻觉问题，尽管模型规模和训练方面取得了进步，但这仍然是一个持续的挑战。我们将 VLM 的内部图像表征投影到它们的语言词汇表中，并观察到真实物体的输出概率比幻觉物体更有信心。我们还使用这些输出概率来对真实物体进行空间定位。在此方法的基础上，我们引入了一种知识擦除算法，通过线性正交化图像特征和幻觉物体特征来消除幻觉。我们表明，对模型潜在表征的有针对性的编辑可以将 COCO2014 数据集上的幻觉减少高达 25.7%，同时保持性能。我们的研究结果表明，更深入地理解 VLM 的潜在表征可以增强可靠性并实现新的功能，例如零样本分割。||
|**2024-10-03**|[Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models](http://arxiv.org/abs/2410.02740)|null|多模态模型的最新进展突出了重写图像描述对于提高性能的价值，但也存在一些关键挑战。例如，虽然合成图像描述通常提供更高的质量和图文对齐性，但尚不清楚它们是否可以完全替代 AltTexts：合成图像描述的作用及其与原始网络抓取的 AltTexts 在预训练中的交互作用仍不清楚。此外，不同的多模态基础模型可能对特定的图像描述格式有独特的偏好，但确定每个模型的最佳图像描述的努力仍然有限。在这项工作中，我们提出了一种新颖的、可控的和可扩展的图像描述生成流程，旨在生成适合各种多模态模型的不同图像描述格式。通过以简短合成图像描述 (SSC) 和密集合成图像描述 (DSC+) 作为案例研究，我们系统地探索了它们对 CLIP、多模态 LLM 和扩散模型等模型的影响以及与 AltTexts 的交互作用。我们的研究结果表明，保留合成图像描述和 AltTexts 的混合方法可以优于单独使用合成图像描述，从而提高对齐性和性能，并且每个模型都表现出对特定图像描述格式的偏好。这种全面的分析为优化图像描述策略提供了宝贵的见解，从而推进了多模态基础模型的预训练。||
|**2024-10-03**|[DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects](http://arxiv.org/abs/2410.02730)|**[link](https://github.com/zhaowei-wang-nlp/divscene)**|在未知环境中进行物体导航对于在现实世界应用中部署具身代理至关重要。虽然由于大规模场景数据集、更快的模拟器和更强大的模型，我们已经目睹了巨大的进步，但之前的研究主要集中在有限的场景类型和目标物体上。在本文中，我们研究了在大量场景类型中导航到不同目标物体的新任务。为了对该问题进行基准测试，我们提出了一个大规模场景数据集 DivScene，其中包含跨越 81 种不同类型的 4,614 个场景。利用该数据集，我们通过模仿学习微调大型视觉语言模型 (LVLM)，构建了一个端到端的具身代理 NatVLM。LVLM 被训练用于获取来自环境的先前观察结果并生成下一步动作。我们还引入了动作预测的思维链 (CoT) 解释轨迹，以便在调整 LVLM 时获得更好的性能。我们广泛的实验发现，我们可以通过对由 BFS 规划器构建的最短路径进行模仿学习来构建性能良好的基于 LVLM 的代理，而无需任何人工监督。我们的代理实现了超过 GPT-4o 20% 以上的成功率。同时，我们进行了各种分析，展示了我们代理的泛化能力。||
|**2024-10-03**|[Video Instruction Tuning With Synthetic Data](http://arxiv.org/abs/2410.02713)|null|视频大型多模态模型 (LMM) 的发展一直受到从网络获取大量高质量原始数据的难度的阻碍。为了解决这个问题，我们提出了一种替代方法，即创建一个专门用于视频指令遵循的高质量合成数据集，即 LLaVA-Video-178K。该数据集包括关键任务，例如详细字幕、开放式问答 (QA) 和多项选择 QA。通过结合现有的视觉指令调整数据对该数据集进行训练，我们推出了一个新的视频 LLM，即 LLaVA-Video。我们的实验表明，LLaVA-Video 在各种视频基准测试中均取得了出色的性能，突出了我们数据集的有效性。我们计划发布数据集、其生成管道和模型检查点。||
|**2024-10-03**|[LLaVA-Critic: Learning to Evaluate Multimodal Models](http://arxiv.org/abs/2410.02712)|null|我们推出了 LLaVA-Critic，这是第一个开源的大型多模态模型 (LMM)，它被设计成一个通用的评估器，用于评估各种多模态任务的性能。LLaVA-Critic 使用高质量的批评指令遵循数据集进行训练，该数据集包含不同的评估标准和场景。我们的实验结果证明了该模型在两个关键领域的有效性：(1) LMM 作为评判者，LLaVA-Critic 提供可靠的评估分数，在多个评估基准上表现与 GPT 模型相当或更优；(2) 偏好学习，它为偏好学习生成奖励信号，增强模型对齐能力。这项工作强调了开源 LMM 在自我批评和评估方面的潜力，为未来研究 LMM 可扩展的、超人的对齐反馈机制奠定了基础。||
|**2024-10-03**|[Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models](http://arxiv.org/abs/2410.02681)|null|置信度校准对于机器学习模型在现实世界中的安全部署至关重要。然而，像 CLIP 这样的视觉语言模型，特别是在微调之后，尚未完全解决这个问题。本研究表明，现有的提示微调方法通常会导致基础类别和新类别之间校准的权衡：CoOp 中的交叉熵损失通过增加文本标签差异导致对新类别的过度自信，而 KgCoOp 的正则化保持了置信度水平，但由于准确性的提高，导致对基础类别的不自信。受这些观察结果的启发，我们引入了动态异常值正则化 (DOR) 来确保微调后对基础类别和新类别的置信度校准。特别是，我们建议最小化从大型词汇表中采样的新文本标签（而不是基础类别）的特征偏差。实际上，DOR 阻止了新标签的文本差异的增加，同时放宽了对基础类别的限制。大量实验表明，DOR 可以增强当前微调方法在基础类别和新类别上的校准性能。||
|**2024-10-03**|[Guiding Long-Horizon Task and Motion Planning with Vision Language Models](http://arxiv.org/abs/2410.02193)|null|视觉语言模型 (VLM) 能够在被提示目标、上下文、场景图像和任何规划约束时生成看似合理的高级计划。但是，无法保证预测的动作对于特定的机器人实施方案在几何和运动学上是可行的。因此，在他们的计划中，许多先决条件步骤（例如打开抽屉以获取物体）经常被省略。机器人任务和运动规划器可以生成尊重动作几何可行性的运动轨迹，并插入物理上必要的动作，但无法扩展到需要常识知识并涉及由许多变量组成的大状态空间的日常问题。我们提出了 VLM-TAMP，这是一种分层规划算法，它利用 VLM 生成语义上有意义且减少范围的中间子目标，从而指导任务和运动规划器。当子目标或动作无法细化时，将再次查询 VLM 以进行重新规划。我们在厨房任务中评估 VLM-TAMP，其中机器人必须完成需要按顺序执行 30-50 个动作并与多达 21 个物体交互的烹饪目标。VLM-TAMP 的性能大大优于严格且独立地执行 VLM 生成的动作序列的基线，无论是在成功率（50% 到 100% 对比 0%）还是平均任务完成百分比（72% 到 100% 对比 15% 到 45%）。有关更多信息，请参阅项目网站 https://zt-yang.github.io/vlm-tamp-robot/。||
|**2024-10-02**|[Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations](http://arxiv.org/abs/2410.02086)|null|多模态学习在使机器学习模型能够融合和利用文本、图像和音频等不同数据源以支持各种下游任务方面发挥着至关重要的作用。跨各种模态的统一表示对于提高效率和性能尤为重要。最近的绑定方法，如ImageBind（Girdhar等人，2023），通常使用固定的锚点模态来对齐锚点模态嵌入空间中的多模态数据。在本文中，我们对固定锚点绑定方法进行了数学分析，并发现了其显著的局限性：（1）过度依赖于锚点模态的选择，（2）无法捕获模态内信息，以及（3）无法解释非锚点模态之间的模态间相关性。为了解决这些局限性，我们提出了CentroBind，这是一种简单而强大的方法，它消除了对固定锚点的需求；相反，它采用从所有可用模态生成的动态可调的基于质心的锚点，从而产生平衡且丰富的表示空间。我们从理论上证明了我们的方法捕获了多模态学习的三个关键属性：模态内学习、模态间学习和多模态对齐，同时还在所有模态中构建了一个稳健的统一表示。我们在合成数据集和真实世界数据集上的实验都证明了该方法的优越性，表明动态锚点方法优于所有固定锚点绑定方法，因为前者捕获了更细微的多模态交互。||
|**2024-10-02**|[Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning](http://arxiv.org/abs/2410.02052)|null|自主智能体在自动化复杂的多步决策任务中展现出巨大潜力。然而，即使是最先进的视觉语言模型（VLM），例如GPT-4o，在复杂网络环境和长期规划任务中仍未达到人类水平。为了解决这些限制，我们引入了反射蒙特卡洛树搜索（R-MCTS），这是一种新颖的测试时算法，旨在增强人工智能体（例如由GPT-4o驱动的智能体）动态探索决策空间的能力。R-MCTS通过以下方式扩展了传统的MCTS：1）结合对比反射，使智能体能够从过去的交互中学习并动态提高其搜索效率；2）使用多智能体辩论来提供可靠的状态评估。此外，我们通过自我学习微调GPT-4o来提高智能体的性能，使用R-MCTS生成的树遍历，无需任何人工提供的标签。在具有挑战性的VisualWebArena基准测试中，我们基于GPT-4o的R-MCTS智能体在各种任务中比之前的最先进技术实现了6%到30%的相对改进。此外，我们还表明，从测试时搜索中获得的知识可以通过微调有效地转移回GPT-4o。经过微调的GPT-4o在测试时可以达到R-MCTS性能的97%，同时计算量减少了四倍。此外，定性结果表明，经过微调的GPT-4o模型能够探索环境、评估状态，并在检测到当前状态无法导致成功时回溯到可行的状态。此外，我们的工作展示了训练（使用R-MCTS收集数据）和测试时的计算扩展特性。这些结果为通过测试时搜索和自我学习来增强VLM的推理和规划能力，以用于智能体应用，提出了一个有希望的研究方向。||
|**2024-09-30**|[HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding](http://arxiv.org/abs/2409.20429)|**[link](https://github.com/F-Yuan303/HELPD)**|大型视觉语言模型 (LVLM) 在许多视觉语言任务中都表现出了非凡的性能。然而，这些模型仍然受到多模态幻觉的影响，这意味着会生成违反图像内容的对象或内容。许多现有工作通过直接判断一个对象是否存在于图像中来检测幻觉，而忽略了对象与语义之间的关联。为了解决这个问题，我们提出了视觉增强惩罚解码的分层反馈学习 (HELPD)。该框架在对象和句子语义层面都纳入了幻觉反馈。值得注意的是，即使训练程度不高，这种方法也可以减少 15% 以上的幻觉。同时，HELPD 根据图像注意力窗口惩罚输出 logits，以避免过度受生成文本的影响。HELPD 可以无缝集成到任何 LVLMs 中。我们的实验表明，所提出的框架在多个幻觉基准测试中产生了良好的结果。它有效地减轻了不同 LVLMs 的幻觉，同时提高了它们的文本生成质量。||
|**2024-09-30**|[CableInspect-AD: An Expert-Annotated Anomaly Detection Dataset](http://arxiv.org/abs/2409.20353)|**[link](https://github.com/mila-iqia/cableinspect-ad-code)**|机器学习模型正越来越多地部署在现实环境中。然而，关于其对特定和关键应用的可迁移性的系统研究在研究文献中却鲜有报道。一个重要的例子是用于机器人电力线巡检的视觉异常检测 (VAD)。虽然现有的 VAD 方法在受控环境中表现良好，但现实场景中存在着当前数据集无法捕捉到的各种意外异常。为了弥补这一差距，我们推出了 $\textit{CableInspect-AD}$，这是一个由加拿大公用事业公司 Hydro-Qu\'ebec 的领域专家创建和标注的高质量、公开可用的数据集。该数据集包含具有挑战性的现实世界异常的高分辨率图像，涵盖了不同严重程度的缺陷。为了解决为设置检测阈值而收集各种异常和正常样本的挑战，我们建议对著名的 PatchCore 算法进行增强。这种增强使其能够在标记数据有限的情况下使用。我们还提出了一个基于交叉验证的综合评估方案，以评估模型的性能。我们评估了我们的 $\textit{Enhanced-PatchCore}$ 在少样本和多样本检测方面的性能，以及视觉语言模型在零样本检测方面的性能。虽然这些模型很有前景，但它们难以检测所有异常，这突出了该数据集作为一个具有挑战性的基准对更广泛研究群体的价值。项目页面：https://mila-iqia.github.io/cableinspect-ad/。||
|**2024-09-30**|[Visual Context Window Extension: A New Perspective for Long Video Understanding](http://arxiv.org/abs/2409.20018)|null|大型多模态模型 (LMM) 在短视频理解任务中表现出色，但在应用于长视频理解时面临巨大挑战。相比之下，大型语言模型 (LLM) 在建模长文本方面表现出色。现有工作试图通过在训练期间引入长视频-文本对来解决这个问题。然而，这些方法需要大量的计算和数据资源。在本文中，我们从上下文窗口的角度来应对长视频理解的挑战，旨在将 LMM 应用于长视频任务，而无需在长视频数据集上重新训练。我们首先深入分析了预训练的 LMM 难以理解长视频内容的原因，发现视觉和语言模态之间的差异导致视觉和语言标记的上下文窗口不同，这使得直接扩展视觉标记以匹配语言上下文窗口变得困难。基于此，我们建议通过扩展视觉上下文窗口来调整 LMM 以适应长视频理解任务，从而无需在大型长视频数据集上重新训练。为了进一步减少长序列导致的大量内存消耗，我们引入了一种渐进式池化推理策略，该策略选择性地调整帧嵌入的空间分辨率，在保留重要空间信息的同时减少视觉标记的数量。在多个长视频理解基准测试中，我们的方法随着视频帧数量的增加而持续提高性能。在 MLVU 基准测试中，我们的方法优于 GPT-4o，即使我们的模型大小只有 7B。此外，在 256 帧设置中，与基线相比，我们的方法将内存使用量减少了大约 45%，而不会导致任何性能损失。||
|**2024-09-30**|[Towards Robust Multimodal Sentiment Analysis with Incomplete Data](http://arxiv.org/abs/2409.20012)|**[link](https://github.com/haoyu-ha/lnln)**|多模态情感分析（MSA）领域最近出现了一个新兴方向，旨在解决数据不完整性问题。认识到语言模态通常包含密集的情感信息，我们将其视为主要模态，并提出了一种创新的语言主导抗噪学习网络（LNLN），以实现稳健的MSA。所提出的LNLN具有主要模态校正（DMC）模块和基于主要模态的多模态学习（DMML）模块，通过确保主要模态表示的质量，增强了模型在各种噪声场景下的鲁棒性。除了方法论设计之外，我们还在随机数据缺失场景下进行了全面的实验，在几个流行的数据集（例如MOSI、MOSEI和SIMS）上使用了多样化且有意义的设置，与文献中的现有评估相比，提供了额外的统一性、透明度和公平性。根据经验，LNLN始终优于现有的基线，在这些具有挑战性和广泛的评估指标中表现出卓越的性能。||
|**2024-09-30**|[Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels](http://arxiv.org/abs/2409.19846)|null|像 CLIP 这样的大规模视觉语言模型在图像级任务中表现出了令人印象深刻的开放词汇能力，在识别物体方面表现出色。然而，它们在语义分割等像素级识别任务中却表现不佳，因为这些任务还需要理解物体的位置。在这项工作中，我们提出了一种名为 PixelCLIP 的新方法，通过使用从 SAM 和 DINO 等视觉基础模型生成的未标记图像和掩码来指导模型识别物体的位置，从而使 CLIP 图像编码器适应像素级理解。为了解决在没有语义标签的情况下利用掩码的挑战，我们设计了一种使用可学习类名的在线聚类算法来获取一般的语义概念。PixelCLIP 在开放词汇语义分割方面比 CLIP 显示出显著的性能提升，并且与字幕监督方法相比具有竞争力的结果。项目页面：https://cvlab-kaist.github.io/PixelCLIP||
|**2024-09-29**|[PALM: Few-Shot Prompt Learning for Audio Language Models](http://arxiv.org/abs/2409.19806)|null|音频语言模型（ALM）最近在零样本音频识别任务中取得了显著成果，其灵感来自视觉语言模型（VLM）的进步，将音频波形的特征与特定类别的文本提示特征相匹配。鉴于零样本性能对人工设计文本提示选择的敏感性，已经为VLM开发了许多提示学习技术。我们探索了这些方法在ALM中的有效性，并提出了一种名为“音频语言模型中的提示学习”（PALM）的新方法，该方法优化了文本编码器分支的特征空间。与在输入空间中工作的现有方法不同，我们的方法实现了更高的训练效率。我们在11个音频识别数据集上证明了我们方法的有效性，这些数据集涵盖了各种语音处理任务，并在少样本学习设置中将结果与三个基线进行了比较。我们的方法在计算量较小的同时，其性能与其他方法相当或更优。代码可在https://asif-hanif.github.io/palm/获取。||
|**2024-09-29**|[Vision-Language Models are Strong Noisy Label Detectors](http://arxiv.org/abs/2409.19696)|**[link](https://github.com/HotanLee/DeFT)**|最近关于视觉语言模型微调的研究表明，其在下游任务中表现出色。然而，在实际应用中获取准确标记数据的挑战给微调过程带来了重大障碍。为了应对这一挑战，本文提出了一种名为 DeFT 的去噪微调框架，用于视觉语言模型的适应性训练。DeFT 利用在数百万个辅助图像-文本对上预训练的文本和视觉特征的鲁棒对齐来筛选噪声标签。所提出的框架通过学习每个类别的正负文本提示来建立噪声标签检测器。正提示旨在揭示该类别的独特特征，而负提示则作为可学习的阈值，用于区分干净样本和噪声样本。我们采用参数高效的微调方法来调整预训练的视觉编码器，以促进其与学习到的文本提示对齐。作为一个通用框架，DeFT 可以通过利用精心挑选的干净样本，将许多预训练模型无缝地微调到下游任务。在七个合成和真实噪声数据集上的实验结果验证了 DeFT 在噪声标签检测和图像分类方面的有效性。||
|**2024-09-29**|[MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation](http://arxiv.org/abs/2409.19684)|**[link](https://github.com/MedHK23/MedViLaM)**|医学本质上是多模态和多任务的，具有涵盖文本、影像等多种数据模态。然而，目前大多数医学领域模型都是单模态单任务的，缺乏良好的泛化性和可解释性。在本研究中，我们介绍了MedViLaM，这是一个通用的医学数据视觉语言模型，它可以使用相同的模型权重灵活地编码和解释各种形式的医学数据，包括临床语言和影像。为了促进这种多任务模型的创建，我们策划了MultiMedBench，这是一个全面的预训练数据集和基准，包含多个不同的任务，即连续问答、多标签疾病分类、疾病定位、放射学报告的生成和总结。MedViLaM在所有MultiMedBench任务中都表现出色，经常大幅超越其他通用模型。此外，我们还展示了零样本泛化到新的医学概念和任务、跨不同任务的有效迁移学习以及零样本医学推理的出现。||
|**2024-09-29**|[Federated Learning from Vision-Language Foundation Models: Theoretical Analysis and Method](http://arxiv.org/abs/2409.19610)|**[link](https://github.com/PanBikang/PromptFolio)**|将CLIP等预训练的视觉语言基础模型整合到联邦学习中，以增强跨不同任务的泛化能力，引起了广泛关注。通常，视觉语言模型的联邦学习采用提示学习来降低通信和计算成本，即基于提示的联邦学习。然而，目前对基于提示的联邦学习性能的理论分析还很有限。在这项工作中，我们通过特征学习理论构建了一个基于提示的联邦学习的理论分析框架。具体来说，我们监控了基于提示的联邦学习中信号学习和噪声记忆的演变，证明了可以通过与任务相关和与任务无关的系数之比来评估性能。此外，我们将投资组合优化中的收益和风险与特征学习中的任务相关和任务无关项进行了类比。受投资组合优化理论的启发，即组合两种独立资产将保持收益，同时降低风险，我们引入了两种提示：全局提示和局部提示，以构建一个提示组合来平衡泛化性和个性化。因此，我们展示了提示组合的性能优势，并推导出了最佳混合系数。这些理论主张得到了进一步的实证实验的支持。||
|**2024-09-28**|[FairPIVARA: Reducing and Assessing Biases in CLIP-Based Multimodal Models](http://arxiv.org/abs/2409.19474)|**[link](https://github.com/hiaac-nlp/fairpivara)**|尽管视觉语言模型取得了重大进展并得到广泛应用，但很少有研究探讨其伦理含义。这些模型通常需要大量的训练数据，而这些数据往往来自仓促审查的文本和图像数据集，导致数据集高度失衡并引发伦理问题。此外，最初用英语训练的模型经常针对其他语言进行微调，例如 CLIP 模型，可以通过添加更多数据来增强其功能，但也可能引入新的偏差。CAPIVARA 是一种基于 CLIP 模型并适用于葡萄牙语的模型，在零样本任务中表现出色。在本文中，我们评估了视觉语言模型中的四种不同类型的歧视性做法，并介绍了 FairPIVARA，这是一种通过移除特征嵌入中受影响最大的维度来减少这些做法的方法。FairPIVARA 的应用显著减少了高达 98% 的观察到的偏差，同时促进了模型中更平衡的词语分布。我们的模型和代码可在以下网址获取：https://github.com/hiaac-nlp/FairPIVARA。||
|**2024-09-27**|[Image-guided topic modeling for interpretable privacy classification](http://arxiv.org/abs/2409.18674)|**[link](https://github.com/idiap/itm)**|用人类可理解的术语预测和解释图像中包含的隐私信息是一项复杂且依赖于上下文的的任务。即使对于大型语言模型来说，这项任务也具有挑战性。为了促进对隐私决策的理解，我们建议根据一组自然语言内容描述符来预测图像隐私。这些内容描述符与隐私分数相关联，这些分数反映了人们如何看待图像内容。我们使用我们新颖的图像引导主题建模（ITM）方法生成描述符。ITM 通过多模态对齐，利用来自视觉语言模型的视觉信息和图像文本描述。我们使用 ITM 生成的描述符来学习隐私预测器 Priv×ITM，其决策在设计上是可解释的。我们的 Priv×ITM 分类器在准确率方面比参考的可解释方法高出 5 个百分点，并且性能与当前最先进的不可解释模型相当。||
|**2024-09-26**|[LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness](http://arxiv.org/abs/2409.18125)|null|大型多模态模型 (LMM) 近期的进步极大地提高了其在 2D 视觉理解任务中的能力，使其能够有效地处理和理解图像和视频。然而，由于缺乏大规模 3D 视觉语言数据集和强大的 3D 编码器，具有 3D 感知能力的 LMM 在 3D 场景理解方面的开发一直受到阻碍。在本文中，我们介绍了一种简单而有效的框架，称为 LLaVA-3D。LLaVA-3D 利用 LLaVA 强大的 2D 理解先验知识，有效地将 LLaVA 应用于 3D 场景理解，而不会影响其 2D 理解能力。为了实现这一点，我们采用了一种简单有效的表示方法，即 3D Patch，它将 2D CLIP 图像块特征与其在 3D 空间中的对应位置连接起来。通过将 3D Patch 集成到 2D LMM 中，并采用联合 2D 和 3D 视觉语言指令微调，我们建立了一个用于 2D 图像理解和 3D 场景理解的统一架构。实验结果表明，在 3D 视觉语言数据集上训练时，LLaVA-3D 的收敛速度比现有 3D LMM 快 3.5 倍。此外，LLaVA-3D 不仅在各种 3D 任务上实现了最先进的性能，而且还保持了与 LLaVA 相当的 2D 图像理解和视觉语言对话能力。||
|**2024-09-26**|[EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions](http://arxiv.org/abs/2409.18042)|null|GPT-4o，一个能够进行带有不同情感和语调的语音对话的多模态模型，标志着多模态基础模型的一个里程碑。然而，在开源社区中，使用公开可用的数据赋予大型语言模型以端到端的方式感知和生成图像、文本和语音仍然具有挑战性。现有的视觉语言模型依赖于外部工具进行语音处理，而语音语言模型仍然存在视觉理解能力有限甚至没有的问题。为了解决这个问题，我们提出了EMOVA（情感无所不在的语音助手），它使大型语言模型具备端到端的语音能力，同时保持领先的视觉语言性能。利用语义-声学解耦的语音标记器，我们惊奇地发现，与相应的双模态对齐模型相比，多模态对齐可以进一步增强视觉语言和语音能力。此外，我们还提出了一个轻量级的风格模块，用于灵活控制语音风格（例如情感和音调）。EMOVA首次在视觉语言和语音基准测试中均实现了最先进的性能，同时支持具有生动情感的多模态语音对话。||
|**2024-09-26**|[DARE: Diverse Visual Question Answering with Robustness Evaluation](http://arxiv.org/abs/2409.18023)|null|视觉语言模型 (VLM) 扩展了仅文本大型语言模型和仅视觉模型的卓越能力，并且能够从多模态视觉文本输入中学习和处理。虽然现代 VLM 在许多标准图像分类和图像文本匹配任务中表现良好，但它们仍然难以应对许多关键的视觉语言 (VL) 推理能力，例如计数和空间推理。此外，虽然它们可能对指令和/或评估协议的微小变化非常脆弱，但现有基准测试未能评估它们的稳健性（或者更确切地说是缺乏稳健性）。为了将具有挑战性的 VL 场景与全面的稳健性评估相结合，我们引入了 DARE，即具有稳健性评估的多样化视觉问答，这是一个精心创建和策划的多项选择 VQA 基准。DARE 评估 VLM 在五个不同类别上的性能，并包括四个基于以下变化的面向稳健性的评估：提示、答案选项子集、输出格式和正确答案的数量。在一系列其他发现中，我们报告说，最先进的 VLM 仍然难以回答大多数类别中的问题，并且无法在测试的稳健性评估中始终如一地提供其峰值性能。选项子集的最差情况性能比标准情况下的性能低 34%。诸如 LLaVA 1.6 和 Idefics2 等开源 VLM 的稳健性无法与 GPT-4 和 Gemini 等闭源模型相提并论，但即使是后者仍然非常容易受到不同变化的影响。||
|**2024-09-26**|[The Hard Positive Truth about Vision-Language Compositionality](http://arxiv.org/abs/2409.17958)|**[link](https://github.com/amitakamath/hard_positives)**|多项基准测试得出结论，我们最好的视觉语言模型（例如 CLIP）缺乏组合性。给定一张图像，这些基准测试会探测模型从一组组合干扰项中识别其关联标题的能力。作为回应，最近涌现出大量提案，表明通过使用干扰项作为强负例对 CLIP 进行微调可以改进模型。我们的调查表明，这些改进实际上被严重夸大了——因为现有的基准测试没有探究微调后的视觉语言模型是否对强正例保持不变。通过使用 112,382 个强负例和强正例整理评估数据集，我们发现包含强正例会使 CLIP 的性能降低 12.9%，而人类则可以毫不费力地达到 99% 的准确率。使用强负例微调 CLIP 会导致更大的性能下降，高达 38.7%。基于这一发现，我们制作了一个包含 1,775,259 个图像文本的训练集，其中包含强负例和强正例标题。通过同时使用两者进行训练，我们看到现有基准测试的性能有所提高，同时强正例的性能也有所提高，这表明组合性得到了更稳健的改进。我们的工作表明，未来的研究需要严格测试和改进 CLIP 对相关“正”概念之间语义关系的理解。||
|**2024-09-26**|[A Multimodal Single-Branch Embedding Network for Recommendation in Cold-Start and Missing Modality Scenarios](http://arxiv.org/abs/2409.17864)|**[link](https://github.com/hcai-mms/sibrar---single-branch-recommender)**|大多数推荐系统采用协同过滤 (CF) 并根据过去的集体交互提供推荐。因此，当可用交互很少或没有交互时，CF 算法的性能会下降，这种情况称为冷启动。为了解决这个问题，以前的工作依赖于利用协作数据和用户或项目辅助信息的模型。类似于多模态学习，这些模型旨在将协作和内容表示组合到共享嵌入空间中。在这项工作中，我们提出了一种新的多模态推荐技术，它依赖于用于推荐的多模态单分支嵌入网络 (SiBraR)。SiBraR 利用权重共享，在不同模态上使用相同的单分支嵌入网络对交互数据以及多模态辅助信息进行编码。这使得 SiBraR 在缺少模态的情况下（包括冷启动）非常有效。我们对来自三个不同推荐域（音乐、电影和电子商务）并提供多模态内容信息（音频、文本、图像、标签和交互）的大规模推荐数据集进行了广泛实验，结果表明，SiBraR 在冷启动场景下明显优于 CF 以及最先进的基于内容的 RS，并且在热启动场景下也具有竞争力。我们证明了 SiBraR 的推荐在缺少模态的情况下是准确的，并且该模型能够将不同的模态映射到共享嵌入空间的同一区域，从而减少了模态差距。||
|**2024-09-26**|[Cascade Prompt Learning for Vision-Language Model Adaptation](http://arxiv.org/abs/2409.17805)|**[link](https://github.com/megvii-research/caspl)**|提示学习已成为一种有效的方法，可以提高视觉语言模型 (VLM)（如 CLIP）在下游任务中的性能。然而，当前的可学习提示标记主要用于适应任务的单一阶段（即，调整提示），容易导致过拟合风险。在这项工作中，我们提出了一种新颖的级联提示学习 CasPL 框架，使提示学习能够同时服务于通用和特定专业知识（即，增强和调整提示）。具体来说，CasPL 是一种新的学习范式，包括两个不同阶段的可学习提示：第一个增强提示旨在通过使用大量未标记的域图像对齐其预测的 logits，从高级更大的 CLIP 教师模型中提取域通用知识。然后，第二个调整提示与冻结的第一组级联，以微调下游任务，遵循先前研究中采用的方法。通过这种方式，CasPL 可以有效地将域通用和任务特定表示捕获到明确不同的渐进提示组中，从而潜在地缓解目标域中的过拟合问题。值得注意的是，CasPL 作为一个即插即用的模块，可以无缝集成到任何现有的提示学习方法中。CasPL 在性能和推理速度之间实现了显著更好的平衡，这对于在资源受限的环境中部署较小的 VLM 模型特别有利。与先前最先进的方法 PromptSRC 相比，CasPL 在 11 个图像分类数据集上，基本类别平均提高了 1.85%，新类别平均提高了 3.44%，调和平均值平均提高了 2.72%。代码公开地址：https://github.com/megvii-research/CasPL。||
|**2024-09-26**|[Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification](http://arxiv.org/abs/2409.17777)|**[link](https://github.com/RaghavSinghal10/M3CoL)**|深度多模态学习通过利用对比学习来捕捉模态之间显式的一对一关系，已经展现出显著的成果。然而，现实世界的数据往往表现出超越简单成对关联的共享关系。我们提出了M3CoL，一种多模态混合对比学习方法，用于捕捉多模态数据中固有的细微共享关系。我们的主要贡献是一种基于混合的对比损失函数，它通过将来自一种模态的混合样本与其来自其他模态的对应样本对齐来学习鲁棒的表示，从而捕捉它们之间的共享关系。对于多模态分类任务，我们引入了一个框架，该框架将融合模块与单模态预测模块相结合，以便在训练期间进行辅助监督，并辅以我们提出的基于混合的对比损失函数。通过对不同数据集（N24News、ROSMAP、BRCA 和 Food-101）的广泛实验，我们证明了 M3CoL 可以有效地捕捉共享的多模态关系并在不同领域泛化。它在 N24News、ROSMAP 和 BRCA 上的表现优于最先进的方法，同时在 Food-101 上取得了可比的性能。我们的工作突出了学习共享关系对于鲁棒的多模态学习的重要性，为未来的研究开辟了有希望的途径。||
|**2024-09-26**|[Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications](http://arxiv.org/abs/2409.17727)|null|视觉语言模型在为各种机器人应用提取有意义的特征方面发挥了关键作用。其中，对比语言-图像预训练 (CLIP) 广泛应用于需要视觉和自然语言理解的机器人任务。然而，CLIP 仅在与文本提示配对的静态图像上进行训练，尚未完全适应涉及动态动作的机器人任务。在本文中，我们介绍了 Robotic-CLIP 来增强机器人的感知能力。我们首先收集和标记大规模动作数据，然后使用对比学习在 309,433 个视频（约 740 万帧）的动作数据上微调 CLIP，构建我们的 Robotic-CLIP。通过利用动作数据，Robotic-CLIP 继承了 CLIP 强大的图像性能，同时获得了理解机器人环境中动作的能力。大量实验表明，我们的 Robotic-CLIP 在各种语言驱动的机器人任务中优于其他基于 CLIP 的模型。此外，我们还展示了 Robotic-CLIP 在现实世界抓取应用中的实际有效性。||
|**2024-09-26**|[MIO: A Foundation Model on Multimodal Tokens](http://arxiv.org/abs/2409.17692)|**[link](https://github.com/mio-team/mio)**|本文介绍了一种基于多模态token的新型基础模型MIO，它能够以端到端、自回归的方式理解和生成语音、文本、图像和视频。尽管大型语言模型（LLM）和多模态大型语言模型（MM-LLM）凭借其多功能性推动了人工智能通用性的进步，但它们仍然缺乏真正的任意模态之间理解和生成的能力。最近，GPT-4o的发布展示了任意模态之间LLM在处理复杂现实世界任务方面的巨大潜力，它能够实现图像、语音和文本之间的全向输入和输出。然而，它是一个闭源模型，并且不支持生成多模态交错序列。为了解决这个问题，我们提出了MIO，它使用因果多模态建模在四种模态的离散token混合数据集上进行训练。MIO经历了四个训练阶段：（1）对齐预训练，（2）交错预训练，（3）语音增强预训练，以及（4）针对不同文本、视觉和语音任务的综合监督微调。我们的实验结果表明，与之前的双模态基线、任意模态之间模型基线，甚至是特定模态基线相比，MIO表现出具有竞争力的性能，在某些情况下甚至更胜一筹。此外，MIO还展示了其任意模态之间功能所带来的高级能力，例如交错视频文本生成、视觉思维链推理、视觉指南生成、指令图像编辑等。||
|**2024-09-26**|[P4Q: Learning to Prompt for Quantization in Visual-language Models](http://arxiv.org/abs/2409.17634)|null|大规模预训练的视觉语言模型（VLM）在各种视觉和多模态任务中取得了显著成果，但由于其对训练样本和计算资源的巨大需求，将VLM部署到下游应用平台仍然具有挑战性。对VLM进行微调和量化可以显著降低样本和计算成本，因此迫切需要这方面的研究。量化领域目前存在两种主要范式：量化感知训练（QAT）可以有效地量化大规模VLM，但会产生巨大的训练成本；而低比特位后训练量化（PTQ）则存在明显的性能下降问题。我们提出了一种平衡微调和量化的方法，称为“量化提示”（P4Q），其中我们设计了一种轻量级架构，利用对比损失监督来增强PTQ模型的识别性能。我们的方法可以有效地减少由低比特位量化引起的图像特征和文本特征之间的差距，其方法是基于可学习的提示来重组文本表示，并使用低比特位适配器重新调整图像和文本特征的分布。我们还引入了一种基于余弦相似度预测的蒸馏损失，以使用全精度教师模型对量化模型进行蒸馏。大量的实验结果表明，我们的P4Q方法优于现有技术，甚至可以达到与其全精度模型相当的结果。例如，我们的8位P4Q理论上可以将CLIP-ViT/B-32压缩4倍，同时在ImageNet数据集上实现66.94%的Top-1准确率，比可学习提示微调的全精度模型高出2.24%，而额外的参数可以忽略不计。||
|**2024-09-18**|[Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution](http://arxiv.org/abs/2409.12191)|**[link](https://github.com/qwenlm/qwen2-vl)**|我们推出了Qwen2-VL系列，这是对先前Qwen-VL模型的先进升级，它重新定义了视觉处理中传统的预定分辨率方法。Qwen2-VL引入了朴素动态分辨率机制，使模型能够将不同分辨率的图像动态处理成不同数量的视觉标记。这种方法允许模型生成更高效、更准确的视觉表示，与人类的感知过程紧密一致。该模型还集成了多模态旋转位置嵌入（M-RoPE），促进了文本、图像和视频中位置信息的有效融合。我们采用统一的范式来处理图像和视频，增强了模型的视觉感知能力。为了探索大型多模态模型的潜力，Qwen2-VL研究了大型视觉语言模型（LVLM）的缩放规律。通过扩展模型规模（包括2B、8B和72B参数的版本）和训练数据量，Qwen2-VL系列实现了极具竞争力的性能。值得注意的是，Qwen2-VL-72B模型在各种多模态基准测试中取得了与GPT-4o和Claude3.5-Sonnet等领先模型相当的结果，优于其他通用模型。代码可在\url{https://github.com/QwenLM/Qwen2-VL}获取。||
|**2024-09-18**|[GauTOAO: Gaussian-based Task-Oriented Affordance of Objects](http://arxiv.org/abs/2409.11941)|null|当您的机器人使用灵巧的手或抓手抓取物体时，它应该理解物体的面向任务的可操作性 (TOAO)，因为不同的任务通常需要关注物体的特定部分。为了应对这一挑战，我们提出了 GauTOAO，这是一个基于高斯的物体面向任务可操作性框架，它以零样本的方式利用视觉语言模型，在给定自然语言查询的情况下预测物体上与可操作性相关的区域。我们的方法引入了一种新的范式：“静态相机，移动物体”，使机器人在操作过程中能够更好地观察和理解手中的物体。GauTOAO 解决了现有方法的局限性，这些方法通常缺乏有效的空间分组，它使用 DINO 特征提取完整的 3D 物体掩码。然后，该掩码用于有条件地查询高斯分布，从而生成针对特定任务的、在物体上的精细语义分布。这种方法可以更准确地提取 TOAO，增强机器人对物体的理解并提高任务性能。我们通过现实世界实验验证了 GauTOAO 的有效性，证明了它能够泛化到各种任务。||
|**2024-09-18**|[LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models](http://arxiv.org/abs/2409.11919)|**[link](https://github.com/valeoai/LLM_wrapper)**|视觉语言模型 (VLM) 在众多任务中都表现出色，但与其专用或微调模型相比，它们的零样本能力可能有限。然而，微调 VLM 存在局限性，因为它需要对模型架构和权重的“白盒”访问权限，以及设计微调目标和优化超参数的专业知识，这些都特定于每个 VLM 和下游任务。在这项工作中，我们提出了 LLM-wrapper，这是一种通过利用大型语言模型 (LLM) 来推理其输出，以“黑盒”方式调整 VLM 的新方法。我们通过指代表达理解 (REC) 证明了 LLM-wrapper 的有效性，这是一项需要空间和语义推理的具有挑战性的开放词汇任务。我们的方法显著提高了现成模型的性能，与经典微调相比获得了具有竞争力的结果。||
|**2024-09-17**|[NVLM: Open Frontier-Class Multimodal LLMs](http://arxiv.org/abs/2409.11402)|null|我们推出了 NVLM 1.0，这是一系列前沿的多模态大型语言模型 (LLM)，在视觉语言任务上取得了最先进的结果，可与领先的专有模型（例如 GPT-4o）和开放访问模型（例如 Llama 3-V 405B 和 InternVL 2）相媲美。值得注意的是，NVLM 1.0 在多模态训练后，其纯文本性能优于其 LLM 骨干模型。在模型设计方面，我们对仅解码器多模态 LLM（例如 LLaVA）和基于交叉注意力的模型（例如 Flamingo）进行了全面比较。基于这两种方法的优缺点，我们提出了一种新颖的架构，可以提高训练效率和多模态推理能力。此外，我们为基于图块的动态高分辨率图像引入了 1-D 图块标记设计，这显着提高了多模态推理和 OCR 相关任务的性能。关于训练数据，我们精心策划并提供有关我们多模态预训练和监督微调数据集的详细信息。我们的研究结果表明，即使在预训练阶段，在所有架构中，数据集质量和任务多样性都比规模更重要。值得注意的是，我们为 NVLM-1.0 模型开发了生产级多模态，使其能够在视觉语言任务中表现出色，同时保持甚至改进与其 LLM 骨干模型相比的纯文本性能。为此，我们将高质量的纯文本数据集与大量的多模态数学和推理数据一起制作并集成到多模态训练中，从而增强了跨模态的数学和编码能力。为了推动该领域的研究，我们将发布模型权重，并将开源代码供社区使用：https://nvlm-project.github.io/。||
|**2024-09-17**|[CAST: Cross-modal Alignment Similarity Test for Vision Language Models](http://arxiv.org/abs/2409.11007)|**[link](https://github.com/gautierdag/cast)**|视觉语言模型 (VLM) 通常通过视觉问答 (VQA) 任务进行评估，这些任务评估模型对场景的理解。良好的 VQA 性能被视为该模型能够在需要视觉和语言输入的更广泛任务中表现良好的证据。然而，场景感知 VQA 并不能完全捕捉输入偏差，也不能评估由模态之间错位引起的幻觉。为了解决这个问题，我们提出了跨模态对齐相似性测试 (CAST) 来探测 VLM 在不同模态之间的自洽性。该测试包括要求模型仅通过文本、仅通过图像或两者兼用来识别两个场景之间的相似性，然后评估它们生成的相似性的真实性。由于没有可供比较的真实情况，因此该评估的重点不是客观准确性，而是 VLM 在输出方面是否内部一致。我们认为，虽然并非所有自洽模型都具有能力或准确性，但所有有能力的 VLM 都必须是自洽的。||
|**2024-09-17**|[KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph](http://arxiv.org/abs/2409.10921)|**[link](https://github.com/yanbei-jiang/artwork-interpretation)**|Exploring the narratives conveyed by fine-art paintings is a challenge in image captioning, where the goal is to generate descriptions that not only precisely represent the visual content but also offer a in-depth interpretation of the artwork's meaning. The task is particularly complex for artwork images due to their diverse interpretations and varied aesthetic principles across different artistic schools and styles. In response to this, we present KALE Knowledge-Augmented vision-Language model for artwork Elaborations), a novel approach that enhances existing vision-language models by integrating artwork metadata as additional knowledge. KALE incorporates the metadata in two ways: firstly as direct textual input, and secondly through a multimodal heterogeneous knowledge graph. To optimize the learning of graph representations, we introduce a new cross-modal alignment loss that maximizes the similarity between the image and its corresponding metadata. Experimental results demonstrate that KALE achieves strong performance (when evaluated with CIDEr, in particular) over existing state-of-the-art work across several artwork datasets. Source code of the project is available at https://github.com/Yanbei-Jiang/Artwork-Interpretation.||
|**2024-09-16**|[Do Pre-trained Vision-Language Models Encode Object States?](http://arxiv.org/abs/2409.10488)|null|For a vision-language model (VLM) to understand the physical world, such as cause and effect, a first step is to capture the temporal dynamics of the visual world, for example how the physical states of objects evolve over time (e.g. a whole apple into a sliced apple). Our paper aims to investigate if VLMs pre-trained on web-scale data learn to encode object states, which can be extracted with zero-shot text prompts. We curate an object state recognition dataset ChangeIt-Frames, and evaluate nine open-source VLMs, including models trained with contrastive and generative objectives. We observe that while these state-of-the-art vision-language models can reliably perform object recognition, they consistently fail to accurately distinguish the objects' physical states. Through extensive experiments, we identify three areas for improvements for VLMs to better encode object states, namely the quality of object localization, the architecture to bind concepts to objects, and the objective to learn discriminative visual and language encoders on object states. Data and code are released.||
|**2024-09-16**|[CtRNet-X: Camera-to-Robot Pose Estimation in Real-world Conditions Using a Single Camera](http://arxiv.org/abs/2409.10441)|null|Camera-to-robot calibration is crucial for vision-based robot control and requires effort to make it accurate. Recent advancements in markerless pose estimation methods have eliminated the need for time-consuming physical setups for camera-to-robot calibration. While the existing markerless pose estimation methods have demonstrated impressive accuracy without the need for cumbersome setups, they rely on the assumption that all the robot joints are visible within the camera's field of view. However, in practice, robots usually move in and out of view, and some portion of the robot may stay out-of-frame during the whole manipulation task due to real-world constraints, leading to a lack of sufficient visual features and subsequent failure of these approaches. To address this challenge and enhance the applicability to vision-based robot control, we propose a novel framework capable of estimating the robot pose with partially visible robot manipulators. Our approach leverages the Vision-Language Models for fine-grained robot components detection, and integrates it into a keypoint-based pose estimation network, which enables more robust performance in varied operational conditions. The framework is evaluated on both public robot datasets and self-collected partial-view datasets to demonstrate our robustness and generalizability. As a result, this method is effective for robot pose estimation in a wider range of real-world manipulation scenarios.||
|**2024-09-16**|[HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models](http://arxiv.org/abs/2409.10419)|**[link](https://github.com/vineet2104/hifics)**|能够通过自然语言与人类交互的机器人可以解锁许多应用，例如参考抓取合成（RGS）。给定一个文本查询，RGS确定一个稳定的抓取姿态来操纵机器人工作空间中所指的对象。RGS包括两个步骤：视觉定位和抓取姿态估计。最近的研究利用强大的视觉语言模型（VLM）将自由流动的自然语言视觉定位到现实世界的机器人执行中。然而，在具有多个相同对象实例的复杂、杂乱环境中的比较仍然缺乏。本文介绍了HiFi-CS，它采用特征线性调制（FiLM）的分层应用来融合图像和文本嵌入，增强了机器人抓取中遇到的复杂属性丰富文本查询的视觉定位。视觉定位将二维/三维空间中的对象与自然语言输入相关联，并在两种情况下进行研究：封闭词汇和开放词汇。HiFi-CS具有一个轻量级的解码器，结合了一个冻结的VLM，在封闭词汇设置中优于竞争基线，同时尺寸缩小了100倍。我们的模型可以有效地指导像GroundedSAM这样的开放集目标检测器，以提高开放词汇性能。我们使用一个7自由度机械臂，通过真实的RGS实验验证了我们的方法，在15个桌面场景中实现了90.33%的视觉定位精度。我们在补充材料中包含了我们的代码库。||
|**2024-09-19**|[IRIS: Interactive Responsive Intelligent Segmentation for 3D Affordance Analysis](http://arxiv.org/abs/2409.10078)|null|大型语言和视觉语言模型的最新进展显著增强了多模态理解，然而将高级语言指令转换为精确的3D空间机器人动作仍然具有挑战性。本文介绍了IRIS（交互式响应智能分割），这是一种用于3D功能分割的全新免训练多模态系统，以及一个用于评估日常环境中交互式语言引导功能的基准。IRIS将大型多模态模型与专门的3D视觉网络相结合，实现了2D和3D视觉理解与语言理解的无缝融合。为了便于评估，我们提供了一个包含10个典型室内环境的数据集，每个环境包含50张标注了物体动作和3D功能分割的图像。大量实验表明，IRIS能够处理各种环境下的交互式3D功能分割任务，并在各种指标上均展现出具有竞争力的性能。我们的结果突出了IRIS在增强基于复杂室内环境中功能理解的人机交互方面的潜力，推进了更直观、更高效的机器人系统在现实世界应用中的发展。||
|**2024-09-15**|[FSL-LVLM: Friction-Aware Safety Locomotion using Large Vision Language Model in Wheeled Robots](http://arxiv.org/abs/2409.09845)|null|轮腿式机器人在移动性和多功能性方面具有显著优势，但在湿滑地形上运行时面临着巨大挑战。这些机器人的传统基于模型的控制器假设没有滑动。虽然强化学习（RL）可以帮助四足机器人适应不同的表面，但从滑动中恢复仍然具有挑战性，特别是对于接触点较少的系统。估计地面摩擦系数是另一个开放的挑战。在本文中，我们提出了一种新颖的摩擦感知安全运动框架，该框架将大型视觉语言模型（LLM）与RL策略相结合。我们的方法将估计的摩擦系数明确纳入RL策略，使机器人能够在到达表面之前根据表面类型提前调整其行为。我们引入了一个“视觉摩擦”（FFV）模块，该模块利用LLM估计地面摩擦系数，从而无需大型数据集和大量训练。该框架在定制的轮式倒立摆上进行了验证，实验结果表明，我们的框架通过根据地形类型调整速度来提高完成驾驶任务的成功率，同时与基线方法相比实现了更好的跟踪性能。我们的框架可以轻松地与任何其他RL策略集成。||
|**2024-09-15**|[Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models](http://arxiv.org/abs/2409.09788)|null|尽管近期研究表明视觉语言模型 (VLM) 能够使用自然语言描述图像中的复杂关系，但其对物体大小和距离进行定量推理的能力仍有待探索。在这项工作中，我们引入了一个手动标注的基准测试集 Q-Spatial Bench，其中包含 271 个跨越五个类别的、专为定量空间推理设计的问题，并系统地研究了最先进的 VLM 在这项任务上的性能。我们的分析表明，推理物体之间的距离对 SoTA VLM 来说尤其具有挑战性；然而，一些 VLM 的性能明显优于其他模型，表现最好的两个模型之间存在超过 40 个百分点的差距。我们还惊奇地观察到，当响应中自然出现使用参考对象的推理路径时，性能最佳的 VLM 的成功率提高了 19 个百分点。受此观察结果的启发，我们开发了一种零样本提示技术 SpatialPrompt，该技术鼓励 VLM 使用参考对象作为视觉线索来回答定量空间问题。通过 SpatialPrompt 指导 VLM 在其推理路径中使用参考对象，Gemini 1.5 Pro、Gemini 1.5 Flash 和 GPT-4V 的成功率分别提高了 40、20 和 30 个百分点以上。我们强调，这些显著的改进无需更多数据、模型架构修改或微调即可实现。||
|**2024-09-15**|[Finetuning CLIP to Reason about Pairwise Differences](http://arxiv.org/abs/2409.09721)|**[link](https://github.com/dsam99/pc_clip)**|视觉语言模型 (VLM) 如 CLIP 是通过文本和图像对之间的对比学习进行训练的，从而产生对齐的图像和文本嵌入，这对许多下游任务非常有用。然而，CLIP 的一个显著缺点是，由此产生的嵌入空间似乎缺乏其纯文本替代方案所具有的一些结构。例如，长期以来，人们一直注意到文本嵌入可以使用向量算术来满足嵌入空间中的\emph{类比}，而 CLIP 则没有这种特性。在本文中，我们提出了一种以对比方式原生训练 CLIP 的方法，以便推理嵌入空间中的差异。我们对 CLIP 进行了微调，以便图像嵌入空间中的差异对应于\emph{图像差异的文本描述}，我们使用大型语言模型在图像-标题配对数据集上合成地生成了这些描述。我们首先证明，我们的方法在按特定属性对图像进行排序（例如，大象比猫大）方面产生了显著改进的能力，这在检索或构建基于属性的分类器中非常有用，并且提高了许多下游图像分类任务上的零样本分类性能。此外，我们的方法还实现了一种新的推理机制，我们将其称为比较提示，其中我们利用对感兴趣类别之间差异的文本描述的先验知识，在分类中实现了更大的性能提升。最后，我们说明了生成的嵌入在嵌入空间中遵循更大程度的几何特性，例如在文本到图像的生成中。||
|**2024-09-13**|[Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing](http://arxiv.org/abs/2409.08885)|null|遥感影像中的目标检测在地球观测的各种应用中发挥着至关重要的作用。然而，与自然场景图像中的目标检测不同，这项任务特别具有挑战性，因为在不同的地形中存在大量的小型且通常难以察觉的目标。为了应对这些挑战，可以使用多模态学习来整合来自不同数据模态的特征，从而提高检测精度。然而，多模态学习的性能往往受到标记数据集大小的限制。在本文中，我们建议使用掩蔽图像建模（MIM）作为一种预训练技术，利用无标记数据的自监督学习来提高检测性能。然而，传统的MIM方法（如MAE）使用没有上下文信息的掩蔽标记，由于缺乏与图像其他部分的交互，难以捕捉到细粒度的细节。为了解决这个问题，我们提出了一种新的交互式MIM方法，可以在不同的标记之间建立交互，这对于遥感中的目标检测特别有利。大量的消融研究和评估证明了我们方法的有效性。||
|**2024-09-13**|[A Multimodal Approach for Fluid Overload Prediction: Integrating Lung Ultrasound and Clinical Data](http://arxiv.org/abs/2409.08790)|null|维持透析患者的体液平衡至关重要，因为管理不当会导致严重并发症。在本文中，我们提出了一种多模态方法，该方法整合了肺部超声图像的视觉特征和临床数据，以增强对体内多余液体预测的准确性。我们的框架采用独立的编码器来提取每种模态的特征，并通过跨域注意力机制将它们组合起来，以捕获互补信息。通过将预测构建为分类任务，该模型实现了比回归模型更好的性能。结果表明，多模态模型始终优于单模态模型，尤其是在注意力机制优先考虑表格数据时。伪样本生成进一步有助于缓解分类问题中的数据不平衡问题，实现了 88.31% 的最高准确率。这项研究强调了多模态学习对透析患者液体超负荷管理的有效性，为改善临床结果提供了宝贵的见解。||
|**2024-09-13**|[ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning](http://arxiv.org/abs/2409.08582)|null|遥感 (RS) 变化分析通过检测图像随时间的变化来监测地球动态过程，至关重要。传统的变点检测擅长识别像素级的变化，但缺乏将这些变化置于背景中的能力。虽然最近在变化描述方面的进展提供了对变化的自然语言描述，但它们不支持交互式的、用户特定的查询。为了解决这些限制，我们引入了 ChangeChat，这是第一个专为 RS 变化分析设计的双时态视觉语言模型 (VLM)。ChangeChat 利用多模态指令微调，使其能够处理复杂的查询，例如变化描述、特定类别的量化和变化定位。为了提高模型的性能，我们开发了 ChangeChat-87k 数据集，该数据集是使用基于规则的方法和 GPT 辅助技术相结合生成的。实验表明，ChangeChat 为 RS 变化分析提供了一个全面、交互式的解决方案，在特定任务上的性能达到甚至优于最先进 (SOTA) 方法，并显着超过了最新的通用模型 GPT-4。代码和预训练权重可在 https://github.com/hanlinwu/ChangeChat 获取。||
|**2024-09-13**|[Generalization Boosted Adapter for Open-Vocabulary Segmentation](http://arxiv.org/abs/2409.08468)|null|视觉语言模型 (VLM) 已展现出卓越的开放词汇对象识别能力，这促使它们被应用于密集预测任务，例如分割。然而，由于缺乏像素级粒度以及可用于微调的数据有限，直接将 VLM 应用于此类任务仍然具有挑战性，导致过度拟合和泛化能力差。为了解决这些限制，我们提出了泛化增强适配器 (GBA)，这是一种新颖的适配器策略，可以增强 VLM 对开放词汇分割的泛化能力和鲁棒性。GBA 包含两个核心组件：(1) 风格多样化适配器 (SDA)，它将特征解耦为幅度和相位分量，仅对幅度进行操作以丰富特征空间表示，同时保持语义一致性；(2) 相关性约束适配器 (CCA)，它采用交叉注意力机制在文本类别和目标区域之间建立更紧密的语义关联，抑制不相关的低频“噪声”信息并避免错误关联。通过浅层 SDA 和深层 CCA 的协同效应，GBA 有效地缓解了过度拟合问题，并增强了特征表示的语义相关性。作为一个简单、高效、即插即用的组件，GBA 可以灵活地集成到各种基于 CLIP 的方法中，展现出广泛的适用性，并在多个开放词汇分割基准测试中实现了最先进的性能。||
|**2024-09-12**|[Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations](http://arxiv.org/abs/2409.08381)|null|像 CLIP 这样的视觉语言模型 (VLM) 已被应用于部分标注的多标签识别 (MLR)，其方法是利用提示学习，为每个类别学习正负提示，以便将它们的嵌入与共享视觉文本特征空间中的类别存在或不存在相关联。虽然这种方法通过依赖 VLM 先验信息提高了 MLR 性能，但我们假设学习负面提示可能不是最优的，因为用于训练 VLM 的数据集缺乏明确关注类别缺失的图像-标题对。为了分析正负提示学习对 MLR 的影响，我们引入了 PositiveCoOp 和 NegativeCoOp，其中只有一个提示是在 VLM 指导下学习的，而另一个提示则被直接在共享特征空间中学习的嵌入向量所取代，而不依赖于文本编码器。通过实证分析，我们观察到负面提示会降低 MLR 性能，并且仅学习正面提示并结合学习到的负面嵌入（PositiveCoOp）优于双提示学习方法。此外，我们量化了提示学习相对于仅使用视觉特征的简单基线的性能优势，观察到当缺失标签的比例较低时，基线表现出与双提示学习方法 (DualCoOp) 相当的强劲性能，同时所需的训练计算量减少一半，参数数量减少 16 倍。||
|**2024-09-12**|[What Makes a Maze Look Like a Maze?](http://arxiv.org/abs/2409.08202)|null|人类视觉理解的一个独特之处在于能够灵活地解释抽象概念：获取解释其象征意义的提升规则，将它们应用于熟悉和不熟悉的语境，并对其进行预测或推理。虽然现成的视觉语言模型擅长对图像进行字面解释（例如，识别树枝等物体类别），但它们仍然难以理解此类视觉抽象概念（例如，树枝的排列方式如何形成迷宫的墙壁）。为了应对这一挑战，我们引入了深度模式基础（DSG），这是一个利用视觉抽象的显式结构化表示进行基础化和推理的框架。DSG 的核心是模式——抽象概念的依赖图描述，将它们分解成更原始级别的符号。DSG 使用大型语言模型来提取模式，然后使用视觉语言模型将模式的具体组件到抽象组件分层地基础化到图像上。基础化的模式用于增强视觉抽象理解。我们在新的视觉抽象数据集上系统地评估了 DSG 和不同的推理方法，该数据集包含各种现实世界中抽象概念的图像以及由人类标记的相应问答对。我们表明，DSG 显着提高了视觉语言模型的抽象视觉推理性能，并且是朝着人类一致的视觉抽象理解迈出的一步。||
|**2024-09-13**|[A Comprehensive Survey on Deep Multimodal Learning with Missing Modality](http://arxiv.org/abs/2409.07825)|null|在多模态模型训练和推理过程中，由于传感器限制、成本限制、隐私问题、数据丢失以及时间和空间因素，数据样本可能会缺少某些模态，从而导致模型性能下降。本综述概述了缺失模态的多模态学习 (MLMM) 的最新进展，重点关注深度学习技术。它是第一个涵盖历史背景和 MLMM 与标准多模态学习设置之间区别的综合性综述，然后详细分析了当前的 MLMM 方法、应用和数据集，最后讨论了该领域的挑战和潜在的未来方向。||
|**2024-09-12**|[Top-down Activity Representation Learning for Video Question Answering](http://arxiv.org/abs/2409.07748)|null|从原子动作（例如，拿起一个礼物，移动到沙发，打开礼物）到上下文事件（例如，庆祝圣诞节）捕捉复杂的分层人类活动对于实现高性能视频问答 (VideoQA) 至关重要。最近的工作已经扩展了多模态模型（例如，CLIP，LLaVA）来处理连续视频序列，增强了模型的时间推理能力。然而，这些方法通常无法捕捉可以分解为多个原子动作的上下文事件，这些动作非连续地分布在相对长期的序列中。在本文中，为了利用 CLIP 模型的空间视觉上下文表示能力来获得视频中上下文事件方面的非连续视觉表示，我们将长期视频序列转换为空间图像域，并针对 VideoQA 任务微调多模态模型 LLaVA。我们的方法在 STAR 任务上取得了具有竞争力的性能，特别是在 NExTQA 任务上，获得了 78.4% 的准确率，超过了当前最先进的得分 2.8 个百分点。||
|**2024-09-12**|[DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?](http://arxiv.org/abs/2409.07703)|**[link](https://github.com/liqiangjing/dsbench)**|大型语言模型（LLM）和大型视觉语言模型（LVLM）已经展现出令人印象深刻的语言/视觉推理能力，引发了构建针对特定应用（如购物助手或AI软件工程师）的代理的最新趋势。最近，许多数据科学基准测试被提出，以研究其在数据科学领域的性能。然而，现有的数据科学基准测试与现实世界的数据科学应用相比仍然存在不足，因为它们的设置过于简化。为了弥合这一差距，我们引入了 DSBench，这是一个全面的基准测试，旨在评估具有现实任务的数据科学代理。该基准测试包括 466 个数据分析任务和 74 个数据建模任务，这些任务来自 Eloquence 和 Kaggle 竞赛。DSBench 通过包含长上下文、多模态任务背景、对大型数据文件和多表结构进行推理以及执行端到端数据建模任务，提供了一个真实的设置。我们对最先进的 LLM、LVLM 和代理的评估表明，它们难以完成大多数任务，最好的代理仅能解决 34.12% 的数据分析任务，并实现了 34.74% 的相对性能差距 (RPG)。这些发现强调了进一步发展更实用、更智能、更自主的数据科学代理的必要性。||
|**2024-09-12**|[Open-Vocabulary Remote Sensing Image Semantic Segmentation](http://arxiv.org/abs/2409.07683)|**[link](https://github.com/caoql98/ovrs)**|开放词汇图像语义分割 (OVS) 旨在将图像分割成跨开放类别集的语义区域。现有的 OVS 方法通常依赖于基础视觉语言模型，并利用相似度计算来处理 OVS 任务。然而，这些方法主要针对自然图像量身定制，难以应对遥感图像的独特特征，例如快速变化的方向和显著的尺度变化。这些挑战使地球视觉中的 OVS 任务变得复杂，需要专门的方法。为了解决这一难题，我们借鉴了独特的遥感特征，提出了第一个专门为遥感图像设计的 OVS 框架。特别是，为了解决不同的方向问题，我们引入了一种旋转聚合相似度计算模块，该模块生成方向自适应相似度图作为初始语义图。随后，这些图会在空间和类别级别进行细化，以生成更准确的语义图。此外，为了管理显著的尺度变化，我们将多尺度图像特征集成到上采样过程中，从而得到最终的尺度感知语义掩码。为了推进地球视觉中的 OVS 并鼓励可重复研究，我们建立了第一个用于遥感图像的开源 OVS 基准，包括四个公共遥感数据集。在这个基准上的大量实验表明，我们提出的方法达到了最先进的性能。所有代码和数据集都可以在 https://github.com/caoql98/OVRS 获取。||
|**2024-09-11**|[Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks](http://arxiv.org/abs/2409.07353)|**[link](https://github.com/speedlab-git/robust-encoder-against-jailbreak-attack)**|基于多模态大数据集训练的大型视觉语言模型 (LVLM) 在视觉语言任务方面表现出色，极大地推进了人工智能的发展。然而，这些模型仍然容易受到对抗性攻击，尤其是越狱攻击，这些攻击会绕过安全协议，导致模型生成误导性或有害的响应。这种脆弱性源于大型语言模型 (LLM) 固有的敏感性以及视觉模态引入的扩大攻击面。我们提出了 Sim-CLIP+，这是一种新颖的防御机制，它利用 Siamese 架构通过对抗性微调 CLIP 视觉编码器。这种方法最大限度地提高了扰动样本和干净样本之间的余弦相似度，增强了对对抗性操作的抵抗力。Sim-CLIP+ 提供了一种即插即用的解决方案，允许作为强大的视觉编码器无缝集成到现有的 LVLM 架构中。与以前的防御措施不同，我们的方法不需要对 LVLM 进行结构修改，并且计算开销最小。Sim-CLIP+ 证明了其对基于梯度的对抗性攻击和各种越狱技术的有效性。我们针对三种不同的越狱攻击策略评估了 Sim-CLIP+，并使用标准下游数据集（包括用于图像字幕的 COCO 和用于视觉问答的 OKVQA）执行了干净评估。大量实验表明，Sim-CLIP+ 在保持高清洁精度的同时，显着提高了对基于梯度的对抗性攻击和越狱技术的鲁棒性。我们的代码和强大的视觉编码器可在 https://github.com/speedlab-git/Robust-Encoder-against-Jailbreak-attack.git 获取。||
|**2024-09-11**|[MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving](http://arxiv.org/abs/2409.07267)|**[link](https://github.com/emzucas/minidrive)**|视觉语言模型 (VLM) 作为自动驾驶中的通用端到端模型，通过问答交互执行预测、规划和感知等子任务。然而，大多数现有方法依赖于计算成本高昂的视觉编码器和大型语言模型 (LLM)，这使得它们难以部署在现实世界场景和实时应用程序中。同时，大多数现有 VLM 缺乏处理多图像的能力，难以适应自动驾驶中的多摄像头感知。为了解决这些问题，我们提出了一种名为 MiniDrive 的新型框架，该框架结合了我们提出的特征工程混合专家 (FE-MoE) 模块和动态指令适配器 (DI-Adapter)。FE-MoE 在输入语言模型之前，将 2D 特征有效地映射到视觉标记嵌入中。DI-Adapter 使视觉标记嵌入能够随指令文本嵌入动态变化，解决了以往方法中同一图像的静态视觉标记嵌入问题。与之前的工作相比，MiniDrive 在参数大小、浮点运算和响应效率方面实现了最先进的性能，最小版本仅包含 83M 参数。||
|**2024-09-11**|[MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis](http://arxiv.org/abs/2409.07129)|null|本文介绍了MVLLaVA，这是一种专为新视角合成任务设计的智能代理。MVLLaVA将多个多视图扩散模型与大型多模态模型LLaVA相结合，使其能够高效地处理各种任务。MVLLaVA代表了一个通用且统一的平台，可以适应不同的输入类型，包括单个图像、描述性标题或视角方位角的特定变化，并以语言指令指导视角生成。我们精心设计了特定于任务的指令模板，随后用于微调LLaVA。因此，MVLLaVA获得了根据用户指令生成新视角图像的能力，展示了其在不同任务中的灵活性。实验验证了MVLLaVA的有效性，证明了其在应对各种新视角合成挑战时的强大性能和多功能性。||
|**2024-09-11**|[FSMDet: Vision-guided feature diffusion for fully sparse 3D detector](http://arxiv.org/abs/2409.06945)|null|近年来，全稀疏三维目标检测引起了越来越多的关注。然而，这些框架中特征的稀疏性由于扩散过程有限，对候选框的生成提出了挑战。此外，对效率的追求导致对视觉辅助的全稀疏模型的研究很少。在本文中，我们提出了FSMDet（全稀疏多模态检测），它使用视觉信息来指导激光雷达特征扩散过程，同时仍然保持管道的效率。具体来说，大多数全稀疏工作都集中在复杂的定制中心融合扩散/回归算子上。然而，我们观察到，如果执行了适当的目标补全，即使是最简单的插值算子也能得到令人满意的结果。受此观察的启发，我们将视觉引导的扩散过程分为两个模块：形状恢复层（SRLayer）和自扩散层（SDLayer）。前者使用RGB信息来恢复物体可见部分的形状，后者使用视觉先验将特征进一步扩散到中心区域。实验表明，我们的方法成功地提高了以往仅使用激光雷达的全稀疏模型的性能，并在多模态模型中达到了SOTA性能。同时，由于采用了稀疏架构，我们的方法在推理过程中比以往的SOTA方法效率最高可提高5倍。||
|**2024-09-10**|[ExIQA: Explainable Image Quality Assessment Using Distortion Attributes](http://arxiv.org/abs/2409.06853)|null|盲图像质量评估 (BIQA) 旨在开发无需参考图像即可估计图像质量分数的方法。在本文中，我们从失真识别角度探讨 BIQA，主要目标是利用视觉语言模型 (VLM)（如 CLIP）预测失真类型和强度，因为它们具有广泛的知识和泛化能力。基于这些预测的失真，我们然后估计图像的质量分数。为此，我们提出了一种基于属性学习的可解释失真识别方法。我们没有使用失真名称提示 VLM，而是使用失真的属性或影响提示它们，并汇总这些信息以推断失真强度。此外，我们为每张图像考虑了多种失真，使我们的方法更具可扩展性。为此，我们生成了一个包含 100,000 张图像的数据集，用于高效训练。最后，检索属性概率并将其输入回归器以预测图像质量分数。结果表明，我们的方法除了具有可解释性和透明度外，还在多个数据集的 PLCC 和 SRCC 指标上均达到了最先进 (SOTA) 的性能。此外，零样本结果证明了该方法的泛化能力。||
|**2024-09-10**|[MAGDA: Multi-agent guideline-driven diagnostic assistance](http://arxiv.org/abs/2409.06351)|null|在急诊科、乡村医院或欠发达地区的诊所，临床医生往往缺乏训练有素的放射科医生进行快速图像分析，这可能对患者的医疗保健产生不利影响。大型语言模型 (LLM) 有可能通过提供有助于临床医生做出决策的见解，从而减轻他们的一些压力。虽然这些 LLM 在医学考试中取得了很高的测试成绩，展示了其丰富的理论医学知识，但它们往往不遵循医学指南。在这项工作中，我们介绍了一种新的零样本指南驱动决策支持方法。我们模拟了一个由多个 LLM 代理组成的系统，该系统增强了对比视觉语言模型，这些代理协作以达成患者诊断。在向代理提供简单的诊断指南后，他们将根据这些指南合成提示并筛选图像以查找结果。最后，他们为自己的诊断提供易于理解的思维链推理，然后对其进行自我完善，以考虑疾病之间的相互依赖性。由于我们的方法是零样本的，因此它适用于罕见疾病的设置，在这些情况下，训练数据有限，但可以使用专家制定的疾病描述。我们在两个胸部 X 光数据集 CheXpert 和 ChestX-ray 14 Longtail 上评估了我们的方法，展示了其相对于现有零样本方法的性能改进以及对罕见疾病的泛化能力。||
|**2024-09-10**|[INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding](http://arxiv.org/abs/2409.06210)|null|可供性是指物体固有的潜在交互方式。对可供性的感知可以让智能体高效地在新环境中导航和交互。弱监督可供性基础可以让智能体在没有昂贵的像素级标注的情况下学习可供性的概念，但需要使用以环境为中心的图像。尽管弱监督可供性基础的最新进展已经取得了可喜的成果，但仍然存在一些挑战，例如需要配对的以环境为中心和以自我为中心的图像数据集，以及为单个物体基础多种可供性的复杂性。为了解决这些问题，我们提出了交互关系感知的弱监督可供性基础 (INTRA)。与现有技术不同，INTRA 将这个问题重新定义为表征学习，通过仅使用以环境为中心的图像进行对比学习来识别交互的独特特征，从而消除了对配对数据集的需求。此外，我们利用视觉语言模型嵌入来灵活地使用任何文本进行可供性基础，设计了以文本为条件的可供性映射生成，以反映交互关系以进行对比学习，并通过我们的文本同义词增强来增强鲁棒性。我们的方法在 AGD20K、IIT-AFF、CAD 和 UMD 等不同的数据集上优于现有技术。此外，实验结果表明，我们的方法对合成图像/插图具有显著的领域可扩展性，并且能够对新的交互和物体进行可供性基础。||
|**2024-09-10**|[Revisiting Prompt Pretraining of Vision-Language Models](http://arxiv.org/abs/2409.06166)|null|提示学习是一种有效的定制视觉语言模型 (VLM) 以适应各种下游任务的方法，它仅需微调输入提示词符的少量参数。近年来，在大规模数据集（例如 ImageNet-21K）上进行提示预训练已成为通用视觉识别提示学习的关键。然而，我们重新审视并观察到，在提示预训练期间，鉴于图像数量庞大，有限的可学习提示可能会面临欠拟合的风险，同时导致泛化能力较差。为了解决上述问题，本文提出了一种名为“重新审视提示预训练”（RPP）的通用框架，旨在从提示结构和提示监督两个方面提高拟合和泛化能力。对于提示结构，我们打破了查询、键和值向量均来自共享的可学习提示词符的常见做法的限制。相反，我们引入了非共享的独立查询、键和值可学习提示，从而通过增加参数多样性来增强模型的拟合能力。对于提示监督，我们还利用了由预训练的对比语言图像预训练 (CLIP) 教师模型提供的零样本概率预测得到的软标签。这些软标签可以更细致、更全面地洞察类间关系，从而赋予预训练过程更好的泛化能力。RPP 产生更稳健的提示初始化，增强其在各种视觉识别任务中的鲁棒迁移能力。跨多个基准的实验一致证实了我们预训练提示的最新性能。代码和模型将很快发布。||
|**2024-09-09**|[PEERNet: An End-to-End Profiling Tool for Real-Time Networked Robotic Systems](http://arxiv.org/abs/2409.06078)|**[link](https://github.com/utaustin-swarmlab/peernet)**|网络机器人系统在自动驾驶汽车、无人机群和远程手术等应用中需要平衡计算、功耗和延迟约束。该领域的核心问题是何时将计算量大的任务卸载到云端（远程服务器）以换取通信延迟。任务卸载算法通常依赖于对系统特定性能指标的精确了解，例如传感器数据速率、网络带宽和机器学习模型延迟。虽然这些指标可以在系统设计期间进行建模，但连接质量、服务器负载和硬件条件的不确定性会导致实时性能变化，从而影响整体性能。我们推出了 PEERNet，这是一种用于云机器人的端到端实时分析工具。PEERNet 通过对传感器、网络、深度学习管道和设备等系统组件进行有针对性但自适应的分析，从而能够在异构硬件上进行性能监控。我们通过网络机器人任务展示了 PEERNet 的功能，例如基于图像的 Franka Emika Panda 机械臂远程操作和使用 Nvidia Jetson Orin 查询视觉语言模型。PEERNet 揭示了机器人系统中非直观的的行为，例如非对称网络传输和双峰语言模型输出。我们的评估强调了网络机器人中基准测试的有效性和重要性，证明了 PEERNet 的适应性。我们的代码是开源的，可在 github.com/UTAustin-SwarmLab/PEERNet 获取。||
|**2024-09-07**|[Unlocking Potential Binders: Multimodal Pretraining DEL-Fusion for Denoising DNA-Encoded Libraries](http://arxiv.org/abs/2409.05916)|null|在药物发现领域，DNA 编码化合物库 (DEL) 筛选技术已成为识别高亲和力化合物的有效方法。然而，DEL 筛选面临着一个重大挑战：复杂生物系统中非特异性相互作用产生的噪声。在 DEL 库上训练的神经网络已被用于提取化合物特征，旨在对数据进行去噪并发现潜在的治疗靶点结合剂。然而，DEL 的固有结构受限于结构单元的有限多样性，这影响了化合物编码器的性能。此外，现有方法仅在单一级别捕获化合物特征，进一步限制了去噪策略的有效性。为了缓解这些问题，我们提出了一种多模态预训练 DEL-Fusion 模型 (MPDF)，该模型通过预训练增强编码器能力，并在不同尺度上整合化合物特征。我们开发了在不同化合物表示及其文本描述之间应用对比目标的预训练任务，增强了化合物编码器获取通用特征的能力。此外，我们提出了一种新颖的 DEL-fusion 框架，该框架融合了原子、亚分子和分子水平的化合物信息，这些信息由各种化合物编码器捕获。这些创新的协同作用使 MPDF 具备丰富的多尺度特征，从而实现全面的下游去噪。在三个 DEL 数据集上进行的评估表明，MPDF 在验证任务的数据处理和分析方面表现出优异的性能。值得注意的是，MPDF 为识别高亲和力分子提供了新的见解，为改进 DEL 在药物发现中的应用铺平了道路。||
|**2024-09-09**|[DexDiff: Towards Extrinsic Dexterity Manipulation of Ungraspable Objects in Unrestricted Environments](http://arxiv.org/abs/2409.05493)|null|抓取又大又平的物体（例如书或平底锅）通常被认为是一项无法完成的任务，因为抓取姿势无法企及，这带来了重大挑战。以前的工作利用墙壁或桌子边缘等外部灵活性来抓取此类物体。然而，它们仅限于特定于任务的策略，并且缺乏寻找预抓取条件的任务规划。这使得适应各种环境和外部灵活性约束变得困难。因此，我们提出了 DexDiff，一种用于具有外部灵活性的长视野规划的稳健机器人操作方法。具体来说，我们利用视觉语言模型 (VLM) 来感知环境状态并生成高级任务计划，然后使用目标条件动作扩散 (GCAD) 模型来预测低级动作序列。该模型从离线数据中学习低级策略，并将高级规划引导的累积奖励作为目标条件，从而可以改进对机器人动作的预测。实验结果表明，我们的方法不仅可以有效地执行无法完成的任务，而且可以泛化到以前从未见过的物体。它在模拟中的成功率比基线高 47%，并有助于在现实场景中高效部署和操作。||
|**2024-09-08**|[PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions](http://arxiv.org/abs/2409.05076)|**[link](https://github.com/btzyd/pip)**|大型视觉语言模型 (LVLM) 已经展示出强大的多模态能力。然而，它们也面临着严重的安全问题，因为攻击者可以通过精心设计的对抗样本在 LVLM 中引发鲁棒性问题。因此，LVLM 迫切需要针对对抗样本的检测工具，以防止出现错误响应。在这项工作中，我们首先发现，当使用探测问题时，LVLM 对干净图像表现出规律的注意力模式。我们提出了一种名为 PIP 的非常规方法，它利用一个随机选择的无关探测问题（例如，“有钟表吗？”）的注意力模式来区分对抗样本和干净样本。无论待测图像及其对应的问题是什么，PIP 只需要对待测图像和探测问题进行一次额外的推理，即可成功检测对抗样本。即使在黑盒攻击和开放数据集场景下，我们的 PIP 与简单的 SVM 相结合，仍然可以实现超过 98% 的召回率和超过 90% 的精确率。我们的 PIP 是首次尝试通过简单的无关探测问题来检测针对 LVLM 的对抗攻击，为更深入地理解和反思 LVLM 提供了思路。代码可在 https://github.com/btzyd/pip 获取。||
|**2024-09-07**|[POINTS: Improving Your Vision-language Model with Affordable Strategies](http://arxiv.org/abs/2409.04828)|null|近年来，视觉语言模型取得了重大进展，在光学字符识别和几何问题解决等任务中表现出色。然而，仍然存在几个关键问题：1）专有模型的架构往往缺乏透明度，而开源模型需要对其训练策略进行更详细的消融研究。2）开源工作中的预训练数据尚未得到充分探索，数据集是根据经验添加的，这使得过程变得繁琐。3）微调通常侧重于添加数据集，导致收益递减。为了解决这些问题，我们提出以下贡献：1）我们使用视觉语言模型的最新进展训练了一个强大的基线模型，引入了有效的改进，并对每种技术进行了全面的消融和验证。2）受近期大型语言模型工作的启发，我们使用困惑度对预训练数据进行过滤，选择困惑度最低的数据进行训练。这种方法使我们能够在精选的 1M 数据集上进行训练，并取得了具有竞争力的性能。3）在视觉指令微调期间，当添加更多数据集的收益微乎其微时，我们对不同数据集使用了模型融合。这些创新产生了一个 9B 参数的模型，其性能与最先进的模型相比具有竞争力。我们的策略高效且轻量级，因此社区很容易采用。||
|**2024-09-07**|[Enhancing Outlier Knowledge for Few-Shot Out-of-Distribution Detection with Extensible Local Prompts](http://arxiv.org/abs/2409.04796)|null|分布外 (OOD) 检测旨在区分已知类别之外的异常值，在实际场景中已变得越来越重要。近年来，视觉语言模型 (VLM) 的出现激发了人们对通过少量样本微调来增强 VLM 的 OOD 检测的兴趣。然而，现有方法主要侧重于优化全局提示，而忽略了对异常值的局部信息的精细利用。基于此，我们冻结全局提示，并引入了一种新颖的从粗到精的微调范式，以强调使用局部提示进行区域增强。我们的方法包括两个组成部分：全局提示引导的负增强和局部提示增强的区域正则化。前者利用冻结的、粗略的全局提示作为指导线索来合并负增强，从而利用局部异常值知识。后者采用可训练的局部提示和区域正则化来有效地捕获局部信息，从而帮助识别异常值。我们还提出了区域相关指标，以增强 OOD 检测的丰富性。此外，由于我们的方法仅探索增强局部提示，因此可以在推理过程中与训练好的全局提示无缝集成，以提高性能。综合实验结果证明了我们方法的有效性和潜力。值得注意的是，在 ImageNet-1k 数据集上进行的 4 次样本微调中，我们的方法相对于最先进的方法将平均 FPR95 降低了 5.17%，甚至优于先前方法的 16 次样本微调结果。||
|**2024-09-06**|[COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes](http://arxiv.org/abs/2409.04053)|**[link](https://github.com/koen-47/columbus)**|虽然视觉问答 (VQA) 基准测试推动了推理技术的发展，但它们一直专注于垂直思维。有效的解决问题还需要横向思维，而横向思维在人工智能领域仍未得到充分研究，也没有用于测试视觉感知系统。为了弥合这一差距，我们将视觉横向思维形式化为一个多项选择题问答任务，并描述了一个由分类法驱动的三步法来实例化任务示例。然后，我们开发了 COLUMBUS，这是一个合成基准测试，它应用任务管道，根据公开可用的化合物和常用短语集合，创建带有文本和图标字谜的 QA 集。COLUMBUS 包含超过 1,000 个谜题，每个谜题有四个候选答案。虽然最先进的视觉语言模型 (VLM) 取得了不错的性能，但我们的评估表明人类和模型之间存在巨大差距。VLM 受益于人工策划的描述，但在正确的抽象级别上难以自行生成此类表示。||
|**2024-09-06**|[Generating Faithful and Salient Text from Multimodal Data](http://arxiv.org/abs/2409.03961)|**[link](https://github.com/TahsinaHashem/FaithD2T)**|虽然大型多模态模型 (LMM) 在许多多模态任务中取得了良好的性能，但它们在生成文本时仍可能会出现幻觉。它们在从视觉数据中检测显著特征方面的性能也不清楚。在本文中，我们开发了一个框架，用于从混合模态数据（包括图像和结构化数据（以知识图谱或表格表示））生成忠实且显著的文本。具体来说，我们训练了一个小型视觉评论家模型，用于从图像模态中识别幻觉和非显著特征。评论家模型还会生成显著图像特征列表。此信息用于后期编辑步骤，以提高生成质量。在两个数据集上的实验表明，我们的框架提高了 LMM 在忠实度和显著性方面的生成质量，优于最近旨在减少幻觉的技术。||
|**2024-09-05**|[Few-shot Adaptation of Medical Vision-Language Models](http://arxiv.org/abs/2409.03868)|**[link](https://github.com/fereshteshakeri/few-shot-medvlms)**|Integrating image and text data through multi-modal learning has emerged as a new approach in medical imaging research, following its successful deployment in computer vision. While considerable efforts have been dedicated to establishing medical foundation models and their zero-shot transfer to downstream tasks, the popular few-shot setting remains relatively unexplored. Following on from the currently strong emergence of this setting in computer vision, we introduce the first structured benchmark for adapting medical vision-language models (VLMs) in a strict few-shot regime and investigate various adaptation strategies commonly used in the context of natural images. Furthermore, we evaluate a simple generalization of the linear-probe adaptation baseline, which seeks an optimal blending of the visual prototypes and text embeddings via learnable class-wise multipliers. Surprisingly, such a text-informed linear probe yields competitive performances in comparison to convoluted prompt-learning and adapter-based strategies, while running considerably faster and accommodating the black-box setting. Our extensive experiments span three different medical modalities and specialized foundation models, nine downstream tasks, and several state-of-the-art few-shot adaptation methods. We made our benchmark and code publicly available to trigger further developments in this emergent subject: \url{https://github.com/FereshteShakeri/few-shot-MedVLMs}.||
|**2024-09-05**|[Have Large Vision-Language Models Mastered Art History?](http://arxiv.org/abs/2409.03521)|null|The emergence of large Vision-Language Models (VLMs) has recently established new baselines in image classification across multiple domains. However, the performance of VLMs in the specific task of artwork classification, particularly art style classification of paintings - a domain traditionally mastered by art historians - has not been explored yet. Artworks pose a unique challenge compared to natural images due to their inherently complex and diverse structures, characterized by variable compositions and styles. Art historians have long studied the unique aspects of artworks, with style prediction being a crucial component of their discipline. This paper investigates whether large VLMs, which integrate visual and textual data, can effectively predict the art historical attributes of paintings. We conduct an in-depth analysis of four VLMs, namely CLIP, LLaVA, OpenFlamingo, and GPT-4o, focusing on zero-shot classification of art style, author and time period using two public benchmarks of artworks. Additionally, we present ArTest, a well-curated test set of artworks, including pivotal paintings studied by art historians.||
|**2024-09-04**|[Can LVLMs Obtain a Driver's License? A Benchmark Towards Reliable AGI for Autonomous Driving](http://arxiv.org/abs/2409.02914)|null|Large Vision-Language Models (LVLMs) have recently garnered significant attention, with many efforts aimed at harnessing their general knowledge to enhance the interpretability and robustness of autonomous driving models. However, LVLMs typically rely on large, general-purpose datasets and lack the specialized expertise required for professional and safe driving. Existing vision-language driving datasets focus primarily on scene understanding and decision-making, without providing explicit guidance on traffic rules and driving skills, which are critical aspects directly related to driving safety. To bridge this gap, we propose IDKB, a large-scale dataset containing over one million data items collected from various countries, including driving handbooks, theory test data, and simulated road test data. Much like the process of obtaining a driver's license, IDKB encompasses nearly all the explicit knowledge needed for driving from theory to practice. In particular, we conducted comprehensive tests on 15 LVLMs using IDKB to assess their reliability in the context of autonomous driving and provided extensive analysis. We also fine-tuned popular models, achieving notable performance improvements, which further validate the significance of our dataset. The project page can be found at: \url{https://4dvlab.github.io/project_page/idkb.html}||
|**2024-09-04**|[Benchmarking Spurious Bias in Few-Shot Image Classifiers](http://arxiv.org/abs/2409.02882)|**[link](https://github.com/gtzheng/fewstab)**|Few-shot image classifiers are designed to recognize and classify new data with minimal supervision and limited data but often show reliance on spurious correlations between classes and spurious attributes, known as spurious bias. Spurious correlations commonly hold in certain samples and few-shot classifiers can suffer from spurious bias induced from them. There is an absence of an automatic benchmarking system to assess the robustness of few-shot classifiers against spurious bias. In this paper, we propose a systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias. FewSTAB creates few-shot evaluation tasks with biased attributes so that using them for predictions can demonstrate poor performance. To construct these tasks, we propose attribute-based sample selection strategies based on a pre-trained vision-language model, eliminating the need for manual dataset curation. This allows FewSTAB to automatically benchmark spurious bias using any existing test data. FewSTAB offers evaluation results in a new dimension along with a new design guideline for building robust classifiers. Moreover, it can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness. Its effectiveness is demonstrated through experiments on ten few-shot learning methods across three datasets. We hope our framework can inspire new designs of robust few-shot classifiers. Our code is available at https://github.com/gtzheng/FewSTAB.||
|**2024-09-06**|[CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models](http://arxiv.org/abs/2409.02834)|**[link](https://github.com/ecnu-icalk/educhat-math)**|Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.||
|**2024-09-04**|[MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark](http://arxiv.org/abs/2409.02813)|null|This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.||
|**2024-09-04**|[Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection](http://arxiv.org/abs/2409.02664)|null|The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via data perturbations, our method can reprogram a pretrained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. Furthermore, we insert a pseudo-word guided by facial identity into the text prompt. Extensive experiments on several popular benchmarks demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88% AUC in cross-dataset setting from FF++ to WildDeepfake) using a pre-trained CLIP model with our proposed reprogramming method; (2) our superior performances are at less cost of trainable parameters, making it a promising approach for real-world applications.||
|**2024-09-04**|[Understanding eGFR Trajectories and Kidney Function Decline via Large Multimodal Models](http://arxiv.org/abs/2409.02530)|null|The estimated Glomerular Filtration Rate (eGFR) is an essential indicator of kidney function in clinical practice. Although traditional equations and Machine Learning (ML) models using clinical and laboratory data can estimate eGFR, accurately predicting future eGFR levels remains a significant challenge for nephrologists and ML researchers. Recent advances demonstrate that Large Language Models (LLMs) and Large Multimodal Models (LMMs) can serve as robust foundation models for diverse applications. This study investigates the potential of LMMs to predict future eGFR levels with a dataset consisting of laboratory and clinical values from 50 patients. By integrating various prompting techniques and ensembles of LMMs, our findings suggest that these models, when combined with precise prompts and visual representations of eGFR trajectories, offer predictive performance comparable to existing ML models. This research extends the application of foundation models and suggests avenues for future studies to harness these models in addressing complex medical forecasting challenges.||
|**2024-09-03**|[Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems](http://arxiv.org/abs/2409.02278)|null|近年来，视觉语言模型（VLM）的最新发展显示出其在图像理解相关应用方面的巨大潜力。在本研究中，我们探索了最先进的VLM模型在基于视觉的交通工程任务中的应用，例如图像分类和目标检测。图像分类任务包括拥堵检测和裂缝识别，而目标检测任务则用于识别未佩戴头盔的行为。我们应用了CLIP、BLIP、OWL-ViT、Llava-Next等开源模型和闭源模型GPT-4o，评估了这些最先进的VLM模型的性能，以利用语言理解能力来完成基于视觉的交通任务。这些任务是通过对VLM模型应用零样本提示来执行的，因为零样本提示允许在不对任务进行任何训练的情况下执行任务。它消除了对特定任务进行标注数据集或微调的需求。虽然这些模型在图像分类任务中取得了与基准卷积神经网络（CNN）模型相当的结果，但在目标定位任务中仍有改进的空间。因此，本研究对最先进的VLM模型进行了全面评估，突出了这些模型的优势和局限性，可以作为未来改进和大规模实施的基线。||
|**2024-09-03**|[How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?](http://arxiv.org/abs/2409.02253)|**[link](https://github.com/asgsaeid/cad_vqa)**|大型基础模型彻底改变了该领域，但针对特定视觉任务优化多模态模型仍然存在挑战。我们提出了一种新颖且通用的方法，通过测量不同输入提示下输出的一致性，来确定黑盒视觉语言模型 (VLM) 的首选图像分布。我们将其应用于 3D 对象的不同渲染类型，证明了其在需要精确解释复杂结构的各个领域的有效性，重点关注计算机辅助设计 (CAD) 作为示例领域。我们使用人类反馈的上下文学习进一步完善了 VLM 输出，显著提高了解释质量。为了解决专业领域缺乏基准的问题，我们引入了 CAD-VQA，这是一个用于评估 VLM 在 CAD 相关视觉问答任务上的新数据集。我们对 CAD-VQA 上最先进的 VLM 进行了评估，建立了基线性能水平，为在需要专家级视觉解释的各个领域推进 VLM 在复杂视觉推理任务中的能力提供了一个框架。我们在 \url{https://github.com/asgsaeid/cad_vqa} 上发布了数据集和评估代码。||
|**2024-09-03**|[Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models](http://arxiv.org/abs/2409.02101)|**[link](https://github.com/jiaqixuac/WResVLM)**|本文探讨了在合成数据上训练的恶劣天气图像恢复方法应用于现实场景时的局限性。我们构建了一个半监督学习框架，利用视觉语言模型来增强现实环境中不同恶劣天气条件下的恢复性能。我们的方法包括使用视觉语言模型对真实数据进行图像清晰度评估和语义提供，作为训练恢复模型的监督信号。对于清晰度增强，我们使用真实数据，采用双重策略，即利用视觉语言模型评估的伪标签和天气提示学习。对于语义增强，我们通过调整视觉语言模型描述中的天气条件，同时保留语义，来整合真实世界的数据。此外，我们引入了一种有效的训练策略来提升恢复性能。我们的方法在真实世界的恶劣天气图像恢复方面取得了优异的结果，通过与现有最佳工作的定性和定量比较证明了这一点。||
|**2024-09-03**|[GraspSplats: Efficient Manipulation with 3D Feature Splatting](http://arxiv.org/abs/2409.02084)|null|The ability for robots to perform efficient and zero-shot grasping of object parts is crucial for practical applications and is becoming prevalent with recent advances in Vision-Language Models (VLMs). To bridge the 2D-to-3D gap for representations to support such a capability, existing methods rely on neural fields (NeRFs) via differentiable rendering or point-based projection methods. However, we demonstrate that NeRFs are inappropriate for scene changes due to their implicitness and point-based methods are inaccurate for part localization without rendering-based optimization. To amend these issues, we propose GraspSplats. Using depth supervision and a novel reference feature computation method, GraspSplats generates high-quality scene representations in under 60 seconds. We further validate the advantages of Gaussian-based representation by showing that the explicit and optimized geometry in GraspSplats is sufficient to natively support (1) real-time grasp sampling and (2) dynamic and articulated object manipulation with point trackers. With extensive experiments on a Franka robot, we demonstrate that GraspSplats significantly outperforms existing methods under diverse task settings. In particular, GraspSplats outperforms NeRF-based methods like F3RM and LERF-TOGO, and 2D detection methods.||

(back to top)

## 6DOF Object Pose

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2025-04-03**|[PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation](http://arxiv.org/abs/2504.02617)|null|从RGB图像中进行新物体姿态估计对于零样本泛化来说是一项重大挑战，因为它涉及估计RGB观测值与训练期间未见过的物体CAD模型之间的相对6D变换。在本文中，我们介绍了PicoPose，这是一个使用三阶段像素到像素对应学习过程来处理此任务的新颖框架。首先，PicoPose将RGB观测值中的特征与渲染的物体模板中的特征进行匹配，识别最佳匹配模板并建立粗略对应关系。其次，PicoPose通过全局回归从粗略对应图中获得2D仿射变换（包括平面内旋转、缩放和平移）来平滑对应关系。第三，PicoPose将仿射变换应用于最佳匹配模板的特征图，并在局部区域内学习对应偏移量，以实现细粒度的对应关系。通过逐步细化对应关系，PicoPose显著提高了通过PnP/RANSAC计算的物体姿态的精度。PicoPose在BOP基准测试的七个核心数据集上实现了最先进的性能，展示了对由CAD模型或物体参考图像表示的新物体的出色泛化能力。代码和模型可在https://github.com/foollh/PicoPose获取。|
|**2025-03-25**|[Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders](http://arxiv.org/abs/2503.19947)|**[link](https://github.com/KochPJ/vanishing-depth)**|通用度量深度理解对于精确的视觉引导机器人技术至关重要，而目前的SOTA视觉编码器并不支持这一点。为了解决这个问题，我们提出了Vanishing Depth，一种自监督训练方法，它扩展了预训练的RGB编码器，将度量深度融入其特征嵌入并进行对齐。基于我们新颖的位置深度编码，我们实现了稳定的深度密度和深度分布不变的特征提取。我们在各种相关的RGBD下游任务中实现了性能改进和SOTA结果——无需微调编码器。最值得注意的是，我们在SUN-RGBD分割上实现了56.05 mIoU，在Void深度补全上实现了88.3 RMSE，在NYUv2场景分类上实现了83.8 Top 1准确率。在6D目标姿态估计中，我们优于DinoV2、EVA-02和Omnivore等前身模型，并在几个相关的RGBD下游任务中实现了非微调编码器的SOTA结果。|
|**2025-03-25**|[Visuo-Tactile Object Pose Estimation for a Multi-Finger Robot Hand with Low-Resolution In-Hand Tactile Sensing](http://arxiv.org/abs/2503.19893)|null|精确的抓取物体三维姿态估计是机器人执行装配或手内操作任务的重要前提，但机器人自身的手对物体的遮挡极大地增加了这项感知任务的难度。在此，我们提出将视觉信息和本体感觉与来自铰接式机器人手内表面二元低分辨率触觉接触测量相结合可以缓解这个问题。我们将视觉-触觉物体姿态估计问题以概率方式表示在一个因子图中。物体的姿态经过优化，以使用鲁棒的成本函数与三种测量对齐，从而减少视觉或触觉异常读数的影响。首先在仿真中证明了所提出方法的优势：一个定制的15自由度机械手，每个连杆上有一个二元触觉传感器，抓取17个YCB物体，同时由RGB-D相机观察。这种低分辨率的手内触觉传感显著改善了高遮挡和高视觉噪声下的物体姿态估计。我们还通过使用我们触觉手的初步真实版本的抓取测试展示了这些优势，平均以大约13.3赫兹的频率获得了合理的视觉-触觉物体姿态估计。|
|**2025-03-25**|[DynOPETs: A Versatile Benchmark for Dynamic Object Pose Estimation and Tracking in Moving Camera Scenarios](http://arxiv.org/abs/2503.19625)|null|在物体姿态估计领域，涉及动态物体和移动摄像头的场景非常普遍。然而，相应的真实世界数据集的缺乏严重阻碍了鲁棒姿态估计模型的开发和评估。这主要归因于在移动摄像头捕获的动态场景中准确标注物体姿态的固有挑战。为了弥合这一差距，本文提出了一个名为DynOPETs的新颖数据集以及一个专用的数据采集和标注流程，专为这种无约束环境下的物体姿态估计和跟踪而设计。我们高效的标注方法创新性地集成了姿态估计和姿态跟踪技术来生成伪标签，随后通过姿态图优化进行细化。由此产生的数据集为从移动摄像头观察到的动态物体提供了准确的姿态标注。为了验证我们数据集的有效性和价值，我们使用18种最先进的方法进行了全面评估，证明了其在加速该挑战性领域研究方面的潜力。该数据集将公开发布，以促进该领域的进一步探索和发展。|
|**2025-03-25**|[Any6D: Model-free 6D Pose Estimation of Novel Objects](http://arxiv.org/abs/2503.18673)|null|我们推出了 Any6D，这是一个用于 6D 物体姿态估计的无模型框架，只需要一张 RGB-D 锚图像即可估计新场景中未知物体的 6D 姿态和尺寸。与依赖纹理 3D 模型或多视角的现有方法不同，Any6D 利用联合物体对齐过程来增强 2D-3D 对齐和度量尺度估计，从而提高姿态精度。我们的方法集成了渲染和比较策略来生成和细化姿态假设，从而在遮挡、非重叠视图、不同光照条件和大的跨环境变化等场景中实现稳健的性能。我们在五个具有挑战性的数据集上评估了我们的方法：REAL275、Toyota-Light、HO3D、YCBINEOAT 和 LM-O，证明了其在显著优于现有最先进的新物体姿态估计方法方面的有效性。项目页面：https://taeyeop.com/any6d|
|**2025-03-20**|[GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation](http://arxiv.org/abs/2503.15110)|**[link](https://github.com/ziqin-h/givepose)**|基于RGBD的类别级物体姿态估计的最新进展受到其对精确深度信息的依赖的限制，从而限制了其更广泛的适用性。因此，基于RGB的方法得到了发展。在这些方法中，源于实例级任务的几何引导姿态回归表现出强大的性能。然而，我们认为NOCS图对于几何引导姿态回归方法来说是不充分的中间表示，因为它与类别级姿态的多对一对应引入了冗余的实例特定信息，导致结果不佳。本文指出了仅基于NOCS图的姿态回归中固有的类内差异问题，并提出了类内无差异共识（IVFC）图，这是一种从类别级共识模型生成的新的坐标表示。通过利用NOCS图和IVFC图的互补优势，我们引入了GIVEPose，这是一个实现类别级物体姿态估计的逐步消除类内差异的框架。在合成数据集和真实数据集上的大量评估表明，GIVEPose显著优于现有的最先进的基于RGB的方法，在类别级物体姿态估计方面取得了实质性的改进。我们的代码可在https://github.com/ziqin-h/GIVEPose获取。|
|**2025-03-19**|[Learning Shape-Independent Transformation via Spherical Representations for Category-Level Object Pose Estimation](http://arxiv.org/abs/2503.13926)|null|类别级物体姿态估计旨在确定特定类别中新物体的姿态和大小。现有的基于对应的方法通常采用基于点的表示来建立原始观测点和归一化物体坐标之间的对应关系。然而，由于规范坐标固有的形状依赖性，这些方法在不同物体形状之间存在语义不一致性。为了解决这个问题，我们创新性地利用球体作为物体的共享代理形状，通过球面表示学习与形状无关的变换。基于这一见解，我们引入了一种名为SpherePose的新型架构，它通过三个核心设计实现了精确的对应预测。首先，我们赋予逐点特征提取SO(3)不变性，这促进了相机坐标空间和物体坐标空间之间稳健的映射，而无需考虑旋转变换。其次，球面注意力机制旨在从全局视角传播和整合球面锚点之间的特征，从而减轻噪声和不完整点云的干扰。最后，设计了一个双曲对应损失函数来区分细微的差异，这可以提高对应预测的精度。在CAMERA25、REAL275和HouseCat6D基准上的实验结果证明了我们方法的优越性能，验证了球面表示和架构创新的有效性。|
|**2025-03-17**|[UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation](http://arxiv.org/abs/2503.13303)|null|从单目图像估计手和潜在手持物体的三维姿态是一个长期存在的挑战。然而，现有方法过于专业化，专注于裸手或与物体交互的手。没有一种方法可以灵活地处理这两种场景，并且当应用于另一种场景时，它们的性能会下降。在本文中，我们提出了UniHOPE，一种用于通用三维手物姿态估计的统一方法，可以灵活地适应这两种场景。从技术上讲，我们设计了一个抓取感知特征融合模块，将手物特征与物体切换器集成在一起，根据抓取状态动态控制手物姿态估计。此外，为了提高手势估计的鲁棒性（无论是否存在物体），我们生成逼真的去遮挡图像对来训练模型学习物体引起的手部遮挡，并制定了多级特征增强技术来学习遮挡不变特征。在三个常用基准数据集上的大量实验表明，UniHOPE在处理仅手和手物场景方面具有最先进的性能。代码将在https://github.com/JoyboyWang/UniHOPE_Pytorch上发布。|
|**2025-03-17**|[Uncertainty-Aware Knowledge Distillation for Compact and Efficient 6DoF Pose Estimation](http://arxiv.org/abs/2503.13053)|null|在机器人技术、增强现实和空间自主导航系统等应用中，紧凑高效的6自由度物体姿态估计至关重要，轻量级模型对于实时精确性能至关重要。本文介绍了一种新颖的具有不确定性感知能力的端到端知识蒸馏（KD）框架，专注于基于关键点的6自由度姿态估计。大型教师模型预测的关键点表现出不同程度的不确定性，可以在蒸馏过程中利用这些不确定性来提高学生模型的精度，同时确保其紧凑性。为此，我们提出了一种蒸馏策略，通过根据与每个教师关键点预测相关的不确定性来调整知识传递，从而使学生和教师的预测保持一致。此外，所提出的KD利用这种具有不确定性感知能力的关键点对齐方式，在其各自特征图的关键位置传递知识。在广泛使用的LINEMOD基准数据集上的实验结果证明了我们方法的有效性，与最先进的方法相比，使用轻量级模型实现了卓越的6自由度物体姿态估计。在SPEED+航天器姿态估计数据集上的进一步验证突出了我们的方法在各种6自由度姿态估计场景下的鲁棒性。|
|**2025-03-09**|[AxisPose: Model-Free Matching-Free Single-Shot 6D Object Pose Estimation via Axis Generation](http://arxiv.org/abs/2503.06660)|null|物体姿态估计在机器人技术、增强现实和自动驾驶中起着至关重要的作用，一直是计算机视觉领域的一个重要研究方向。现有研究要么需要多阶段姿态回归，要么依赖于2D-3D特征匹配。尽管这些方法已经取得了令人满意的结果，但它们严重依赖于外观信息，需要复杂的输入（例如，多视图参考输入、深度或CAD模型）和复杂的流程（例如，特征提取-SfM-2D到3D匹配-PnP）。我们提出了AxisPose，一个无模型、无匹配、单次拍摄的鲁棒6D姿态估计解决方案，它从根本上不同于现有范式。与现有方法依赖于使用3D技术（例如SfM和PnP）进行2D-3D或2D-2D匹配不同，AxisPose通过利用扩散模型学习物体潜在轴分布，无需参考视图即可直接从单个视图推断出鲁棒的6D姿态。具体来说，AxisPose构建了一个轴生成模块(AGM)，通过扩散模型捕获物体轴的潜在几何分布。扩散过程通过将几何一致性损失的梯度注入噪声估计中来引导，以保持生成的三轴的几何一致性。利用生成的三轴投影，AxisPose进一步采用三轴反投影模块(TBM)从物体三轴恢复6D姿态。所提出的AxisPose仅使用单个视图作为输入，无需参考图像，即可在跨实例级别（即一个模型用于N个实例）实现鲁棒性能，并且具有泛化到未见物体级别的巨大潜力。|
|**2025-03-07**|[Novel Object 6D Pose Estimation with a Single Reference View](http://arxiv.org/abs/2503.05578)|**[link](https://github.com/cnjianliu/sinref-6d)**|现有的新型物体6D姿态估计方法通常依赖于难以获取的CAD模型或密集参考视图。仅使用单个参考视图更具可扩展性，但由于姿态差异较大以及几何和空间信息有限而具有挑战性。为了解决这些问题，我们提出了一种基于单参考的新型物体6D（SinRef-6D）姿态估计方法。我们的核心思想是在状态空间模型（SSM）的基础上，在相机坐标系中迭代地建立点对点对齐。具体来说，迭代相机空间点对点对齐可以有效地处理大的姿态差异，而我们提出的RGB和点SSM可以从单个视图中捕获长距离依赖关系和空间信息，提供线性复杂度和优越的空间建模能力。一旦在合成数据上进行预训练，SinRef-6D就可以仅使用单个参考视图来估计新型物体的6D姿态，而无需重新训练或CAD模型。在六个流行数据集和真实世界机器人场景上的大量实验表明，尽管在更具挑战性的单参考设置下运行，我们仍实现了与基于CAD和基于密集参考视图的方法相当的性能。代码将在https://github.com/CNJianLiu/SinRef-6D发布。||
|**2025-03-05**|[Improving 6D Object Pose Estimation of metallic Household and Industry Objects](http://arxiv.org/abs/2503.03655)|null|六维物体姿态估计在应用于金属物体时精度会下降。我们着手改进现有技术，以应对工业应用中的反射和镜面高光等挑战。我们新颖的BOP兼容数据集包含在各种光照和背景条件下的各种金属物体（罐头、家用和工业物品），提供了额外的几何和视觉线索。我们证明了可以有效地利用这些线索来提高整体性能。为了说明这些附加特征的实用性，我们改进了GDRNPP算法，引入了额外的关键点预测和材质估计器头，以提高空间场景理解能力。在新数据集上的评估表明金属物体的精度有所提高，这支持了额外的几何和视觉线索可以改进学习的假设。||
|**2025-03-05**|[Tiny Lidars for Manipulator Self-Awareness: Sensor Characterization and Initial Localization Experiments](http://arxiv.org/abs/2503.03449)|null|对于从操作到检查的诸多任务而言，机器人对其周围的目标物体进行定位是有益的。在本文中，我们提出了一种利用从小型VL53L5CX飞行时间（ToF）传感器（微型激光雷达）获得的粗略点云来定位机器人工作空间中目标物体的方法。我们首先进行了一项实验，以校准传感器读数对目标相对距离和方向的依赖性。然后，我们提出了一个概率传感器模型，并在使用粒子滤波器 (PF) 的物体姿态估计任务中对其进行了验证。结果表明，与两个基线相比，所提出的传感器模型提高了目标物体定位的性能：一个基线假设测量没有不确定性，另一个基线则使用传感器数据表提供的置信度。||
|**2025-03-04**|[DQO-MAP: Dual Quadrics Multi-Object mapping with Gaussian Splatting](http://arxiv.org/abs/2503.02223)|**[link](https://github.com/lihaoy-ux/dqo-map)**|精确的目标感知对于机器人应用（例如目标导航）至关重要。在本文中，我们提出了DQO-MAP，一种新颖的物体SLAM系统，它无缝集成了物体姿态估计和重建。我们采用三维高斯样条函数进行高保真物体重建，并利用二次曲面进行精确的物体姿态估计。它们的管理都在CPU上进行，而优化则在GPU上执行，从而显著提高了系统效率。通过将对象与唯一ID关联，我们的系统能够从场景中快速提取对象。大量的对象重建和姿态估计实验结果表明，DQO-MAP在精度、重建质量和计算效率方面均实现了出色的性能。代码和数据集可在以下网址获取：https://github.com/LiHaoy-ux/DQO-MAP。||
|**2025-03-03**|[Category-level Meta-learned NeRF Priors for Efficient Object Mapping](http://arxiv.org/abs/2503.01582)|null|在三维物体建图中，类别级先验能够实现高效的物体重建和规范姿态估计，每个语义类别（例如椅子、书籍、笔记本电脑）只需要一个先验。近年来，DeepSDF主要被用作类别级形状先验，但它难以重建清晰的几何形状，并且计算成本高昂。相比之下，NeRF能够捕捉精细的细节，但尚未在实时多物体建图框架中与类别级先验有效集成。为了弥合这一差距，我们引入了PRENOM，一种基于先验的高效神经物体建图器，它将类别级先验与物体级NeRF相结合，以提高重建效率，同时实现规范物体姿态估计。PRENOM通过在从开源形状数据集中生成的合成重建任务上进行元学习来“认识”物体。为了考虑物体类别变化，它采用多目标遗传算法来优化每个类别的NeRF架构，平衡重建质量和训练时间。此外，基于先验的概率射线采样将采样导向预期的物体区域，从而加快收敛速度并在资源有限的情况下提高重建质量。在低端GPU上的实验结果突出了PRENOM在保持计算可行性的同时实现高质量重建的能力。具体来说，与无先验的基于NeRF的方法在合成数据集上的比较显示，Chamfer距离降低了21%，表明重建质量更好。此外，与在噪声真实世界数据集上使用形状先验的其他方法进行的评估表明，所有重建指标平均提高了13%，旋转估计精度有所提高，平移和尺寸估计性能相当，而训练时间却减少了5倍。||
|**2025-02-24**|[V-HOP: Visuo-Haptic 6D Object Pose Tracking](http://arxiv.org/abs/2502.17434)|null|人类在操作物体时会自然地整合视觉和触觉以实现稳健的物体感知。任何一种模态的缺失都会显著降低性能。受这种多感官整合的启发，之前的物体姿态估计研究尝试结合视觉和触觉/触觉反馈。尽管这些工作在受控环境或合成数据集中展示了改进，但由于在不同的抓取器、传感器布局或仿真到现实环境中的泛化能力较差，它们在现实世界环境中的表现通常不如纯视觉方法。此外，它们通常独立估计每一帧的物体姿态，导致在现实世界部署中对序列的跟踪不太连贯。为了解决这些限制，我们引入了一种新的统一触觉表示，可以有效地处理多种抓取器实施方案。在此表示的基础上，我们引入了一种新的基于视觉-触觉Transformer的物体姿态跟踪器，它可以无缝地整合视觉和触觉输入。我们在我们的数据集和Feelsight数据集中验证了我们的框架，证明了其在挑战性序列上的显著性能提升。值得注意的是，我们的方法在新的实施方案、物体和传感器类型（包括基于触觉单元和基于视觉的触觉传感器）上实现了卓越的泛化能力和鲁棒性。在真实世界的实验中，我们证明了我们的方法在很大程度上优于最先进的视觉跟踪器。我们进一步表明，通过将我们的实时物体跟踪结果纳入运动计划，我们可以实现精确的操作任务，这突出了视觉-触觉感知的优势。我们的模型和数据集将在论文被接受后开源。项目网站：https://lhy.xyz/projects/v-hop/||
|**2025-02-17**|[Enhancing Transparent Object Pose Estimation: A Fusion of GDR-Net and Edge Detection](http://arxiv.org/abs/2502.12027)|null|由于光照、背景和反射的巨大影响，透明物体的姿态估计在机器人视觉领域仍然是一项具有挑战性的任务。然而，透明物体的边缘具有最高的对比度，这导致了稳定且突出的特征。我们提出了一种新方法，在目标检测和目标姿态估计任务的预处理步骤中加入边缘检测。我们进行了实验，以研究边缘检测器对透明物体的影响。我们检验了在应用不同边缘检测器作为预处理步骤（即带有和不带颜色信息的Canny边缘检测以及整体嵌套边缘（HED））时，最先进的6D物体姿态估计流程GDR-Net和物体检测器YOLOX的性能。我们使用BOP挑战赛提出的参数评估了基于物理渲染的透明物体数据集Trans6D-32K。我们的结果表明，应用边缘检测作为预处理可以提高某些物体的性能。||
|**2025-02-04**|[Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose Estimation](http://arxiv.org/abs/2502.02525)|**[link](https://github.com/cnjianliu/diff9d)**|九自由度 (9-DoF) 物体姿态和尺寸估计对于增强现实和机器人操作至关重要。由于其对类内未知物体的泛化潜力，类别级方法受到了广泛的研究关注。然而，这些方法需要手动收集和标记大规模的真实世界训练数据。为了解决这个问题，我们引入了一种基于扩散的范式，用于域泛化的类别级 9-DoF 物体姿态估计。我们的动机是利用扩散模型的潜在泛化能力来应对物体姿态估计中的域泛化挑战。这意味着仅使用渲染的合成数据训练模型，以实现对真实场景的泛化。我们提出了一种有效的扩散模型，从生成的角度重新定义 9-DoF 物体姿态估计。我们的模型在训练或推理过程中不需要任何 3D 形状先验。通过使用去噪扩散隐式模型，我们证明了反向扩散过程只需 3 步即可执行，从而实现近乎实时的性能。最后，我们设计了一个由硬件和软件组件组成的机器人抓取系统。通过在两个基准数据集和真实世界机器人系统上的综合实验，我们证明了我们的方法实现了最先进的域泛化性能。我们的代码将在 https://github.com/CNJianLiu/Diff9D 公开。||
|**2025-02-03**|[CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation](http://arxiv.org/abs/2502.01312)|null|类别级物体姿态估计旨在恢复预定义类别中未见实例的旋转、平移和大小。在这项任务中，基于深度神经网络的方法已经展现出显著的性能。然而，先前的研究表明，它们会受到模型中“不干净”混杂因素引起的虚假相关性的影响，从而阻碍其在具有显著变化的新实例上的性能。为了解决这个问题，我们提出了 CleanPose，这是一种结合因果学习和知识蒸馏的新方法，以增强类别级姿态估计。为了减轻未观察到的混杂因素的负面影响，我们开发了一个基于前门调整的因果推理模块，通过减少潜在的虚假相关性来促进无偏估计。此外，为了进一步提高泛化能力，我们设计了一种基于残差的知识蒸馏方法，该方法已被证明可有效提供全面的类别信息指导。在多个基准测试（REAL275、CAMERA25 和 HouseCat6D）上的大量实验突出了 CleanPose 相对于最先进方法的优越性。代码将被发布。||
|**2025-01-24**|[Optimizing Grasping Precision for Industrial Pick-and-Place Tasks Through a Novel Visual Servoing Approach](http://arxiv.org/abs/2501.14557)|null|由于机械臂操纵器在执行特定任务方面的效率和有效性，它们已被广泛集成到工业生产线中。随着相机技术的进步，视觉传感器和感知系统已被纳入以解决更复杂的操作。本研究介绍了一种新型视觉伺服控制系统，专为在具有挑战性的环境中进行机器人操作而设计，在这些环境中，振动、刀具路径偏差和加工痕迹等因素会阻碍准确的物体姿态估计。为了克服这些障碍，我们的解决方案侧重于提高拾取和放置任务的准确性，确保在各种场景下的可靠性能。这是通过一种基于两种互补方法集成的新型视觉伺服方法实现的：一种用于物体定位的技术和一种通过视觉反馈进行精确控制的单独方法，利用它们的优势来应对工业环境带来的挑战，从而提高整体抓取精度。我们的方法利用来自感知传感器的反馈来有效地调整控制回路，使机器人系统能够熟练地拾取和放置物体。我们引入了一种控制器，能够在工业环境中无缝地管理各种形状和类型的物体的检测和操作，解决了此类环境中出现的众多挑战。||
|**2024-12-13**|[Targeted Hard Sample Synthesis Based on Estimated Pose and Occlusion Error for Improved Object Pose Estimation](http://arxiv.org/abs/2412.04279)|null|6D物体姿态估计是机器人技术中的一个基本组成部分，它能够实现与环境的高效交互。在零件拾取应用中，这项任务尤其具有挑战性，因为物体可能没有纹理且处于难以识别的姿态，同类型物体之间的遮挡即使在训练良好的模型中也可能造成混淆。我们提出了一种新的难例合成方法，该方法与模型无关，利用现有模拟器对相机到物体视角球体和遮挡空间中的姿态误差进行建模。通过评估模型在物体姿态和遮挡分布方面的性能，我们发现了高误差区域，并生成了逼真的训练样本以专门针对这些区域。通过我们的训练方法，我们证明了使用最先进的姿态估计模型，在几个ROBI数据集对象上，正确检测率提高了高达20%。||
|**2024-12-02**|[6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting](http://arxiv.org/abs/2412.01543)|null|高效准确的目标姿态估计是许多应用（例如增强现实、自动驾驶和机器人技术）中现代视觉系统的关键组成部分。虽然基于模型的6D目标姿态估计研究已取得令人满意的结果，但无模型方法受到实时RGB-D视频流中渲染和推断任意目标一致姿态的高计算负荷的阻碍。为了解决这个问题，我们提出了6DOPE-GS，一种利用高斯渲染的最新进展，通过单个RGB-D相机进行在线6D目标姿态估计和跟踪的新方法。由于高斯渲染的快速可微渲染能力，6DOPE-GS可以同时优化6D目标姿态和3D目标重建。为了实现实时跟踪所需的效率和精度，我们的方法使用增量2D高斯渲染和智能动态关键帧选择程序，以实现高空间目标覆盖率并防止错误的姿态更新。我们还提出了一种基于不透明度统计的剪枝机制，用于自适应高斯密度控制，以确保训练的稳定性和效率。我们在HO3D和YCBInEOAT数据集上评估了我们的方法，结果表明，6DOPE-GS在无模型同步6D姿态跟踪和重建方面与最先进的基线性能相当，同时提供了5倍的加速。我们还在实际环境中演示了该方法对实时动态目标跟踪和重建的适用性。||
|**2024-12-01**|[Particle-based 6D Object Pose Estimation from Point Clouds using Diffusion Models](http://arxiv.org/abs/2412.00835)|**[link](https://github.com/zitronian/6dposediffusion)**|从单一视角进行物体姿态估计仍然是一个具有挑战性的问题。特别是，部分可观察性、遮挡和物体对称性最终会导致姿态模糊。为了解决这种多模态问题，这项工作提出训练一个基于扩散的生成模型用于 6D 物体姿态估计。在推理过程中，训练好的生成模型允许对多个粒子（即姿态假设）进行采样。为了将这些信息提炼成单一的姿态估计，我们提出了两种新颖且有效的姿态选择策略，它们不需要任何额外的训练或计算密集型操作。此外，虽然许多现有的姿态估计方法主要集中在图像域，并且仅将深度信息用于最终的姿态细化，但我们的模型仅对点云数据进行操作。因此，该模型利用了点云处理的最新进展，并在 SE(3) 等变潜在空间上运行，该空间构成了粒子选择策略的基础，并缩短了推理时间。我们在 Linemod 数据集上的全面实验结果证明了我们方法的竞争性能，并展示了我们设计选择的有效性。代码可在 https://github.com/zitronian/6DPoseDiffusion 获取。||
|**2024-11-25**|[UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image](http://arxiv.org/abs/2411.16106)|null|未见物体姿态估计方法通常依赖于CAD模型或多个参考视图，这使得引入阶段成本高昂。为了简化参考获取，我们的目标是通过单个未姿态设定的RGB-D参考图像来估计未见物体的姿态。虽然先前的工作利用参考图像作为姿态锚点来限制相对姿态的范围，但我们的场景提出了重大挑战，因为相对变换可能在整个SE(3)空间中变化。此外，遮挡、传感器噪声和极端几何形状等因素可能导致视点重叠率低。为了应对这些挑战，我们提出了一种新的方法和基准，称为UNOPose，用于基于单个参考的未见物体姿态估计。基于由粗到精的范式，UNOPose构建了一个SE(3)不变的参考框架，以在姿态和大小变化的情况下标准化物体表示。为了缓解视点之间的小重叠，我们根据每个对应关系预测的在重叠区域内的可能性重新校准其权重。在我们提出的基于BOP挑战的基准测试中进行评估，UNOPose展现出优越的性能，在单参考设置下显著优于传统方法和基于学习的方法，并且与基于CAD模型的方法保持竞争力。代码和数据集将被公开。||
|**2024-11-24**|[Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching](http://arxiv.org/abs/2411.15860)|**[link](https://github.com/scy639/gen2sm)**|本文提出了一种新的可泛化的物体姿态估计方法，仅使用一张RGB图像即可确定物体姿态。与依赖于实例级物体姿态估计并需要大量训练数据的传统方法不同，我们的方法无需大量训练即可泛化到未见过的物体，仅需一张物体的参考图像，并且无需3D物体模型或物体的多个视图。这些特性是通过利用扩散模型生成新视角图像并在这些生成的图像上进行双边匹配来实现的。定量实验表明，在合成数据集和真实数据集上，我们的方法均优于现有的姿态估计技术。值得注意的是，即使在视点变化很大的场景中，我们的方法仍能保持强大的性能，突出了其在挑战性条件下的鲁棒性和多功能性。代码将在 https://github.com/scy639/Gen2SM 发布。||
|**2024-11-21**|[SEMPose: A Single End-to-end Network for Multi-object Pose Estimation](http://arxiv.org/abs/2411.14002)|null|在计算机视觉领域，从RGB图像估计六自由度姿态是一个基本任务。然而，在多目标场景中，这项任务变得极具挑战性。目前，最佳方法通常采用间接策略，即识别2D和3D对应关系，然后使用Perspective-n-Points方法求解。然而，这种方法无法进行端到端训练。另一方面，直接方法由于物体大小变化和遮挡等挑战，精度较低。为了解决这些问题，我们提出了SEMPose，一个端到端的多个物体姿态估计网络。SEMPose利用精心设计的纹理形状引导特征金字塔网络，有效地解决了物体尺寸变化的挑战。此外，它采用迭代细化头部结构，逐步分别回归旋转和平移，以提高估计精度。在训练过程中，我们通过从可见部分选择正样本来减轻遮挡的影响。实验结果表明，SEMPose可以在32 FPS的速度下进行推理，除了RGB图像外不需要其他输入。它可以实时准确地估计多个物体的姿态，推理时间不受目标物体数量的影响。在LM-O和YCB-V数据集上，我们的方法优于其他基于RGB的单模型方法，实现了更高的精度。即使与多模型方法和使用额外细化的方法相比，我们的结果仍然具有竞争力。||
|**2024-11-08**|[DeepArUco++: Improved detection of square fiducial markers in challenging lighting conditions](http://arxiv.org/abs/2411.05552)|**[link](https://github.com/avauco/deeparuco)**|基准标记是用于物体姿态估计和检测的计算机视觉工具。这些标记在工业、医疗和物流等领域非常有用。然而，最佳照明条件并非总是可用，并且其他因素（例如模糊或传感器噪声）会影响图像质量。精确定位和解码基准标记的经典计算机视觉技术在困难的照明条件下（例如，同一帧内光照的极端变化）通常会失败。因此，我们提出了 DeepArUco++，这是一个基于深度学习的框架，它利用卷积神经网络的鲁棒性在挑战性照明条件下执行标记检测和解码。该框架基于一个在每个步骤中使用不同神经网络模型的流水线，即标记检测、角点细化和标记解码。此外，我们提出了一种简单的方法来生成用于训练构成所提出流水线的不同模型的合成数据，并且我们提出了第二个在挑战性照明条件下的 ArUco 标记的真实数据集，用于评估我们的系统。所开发的方法在此类任务中优于其他最先进的方法，即使在用于开发这些方法的数据集上进行测试时也保持竞争力。代码可在 GitHub 上获得：https://github.com/AVAuco/deeparuco/||
|**2024-10-08**|[AIVIO: Closed-loop, Object-relative Navigation of UAVs with AI-aided Visual Inertial Odometry](http://arxiv.org/abs/2410.05996)|null|面向对象的移动机器人导航对于各种任务至关重要，例如自主关键基础设施检查，但这需要从原始传感器数据中提取有关感兴趣对象的语义信息的能力。虽然基于深度学习 (DL) 的方法擅长从图像中推断语义对象信息，例如类别和相对六自由度 (6-DoF) 位姿，但它们的计算要求很高，因此通常不适合有效载荷受限的移动机器人。在这篇短文中，我们提出了一种实时无人机 (UAV) 系统，用于对象相对的闭环导航，其最小传感器配置由惯性测量单元 (IMU) 和 RGB 相机组成。利用仅在合成数据上训练并针对伴侣板部署进行优化的基于深度学习的对象位姿估计器，将对象相对位姿测量值与 IMU 数据融合以执行对象相对定位。我们进行了多个真实世界的实验，以验证我们系统在电线杆检查等挑战性用例中的性能。补充视频中展示了一个闭环飞行示例。||
|**2024-09-24**|[LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation](http://arxiv.org/abs/2409.15727)|**[link](https://github.com/lolrudy/lapose)**|虽然基于RGBD的类别级物体姿态估计方法很有前景，但它们对深度数据的依赖限制了其在不同场景中的适用性。因此，最近的研究转向了基于RGB的方法；然而，由于缺乏深度信息，它们面临着巨大的挑战。一方面，深度信息的缺失加剧了处理类内形状变化的难度，导致形状预测的不确定性增加。另一方面，纯RGB输入引入了固有的尺度模糊性，使得物体大小和位移的估计成为一个不适定问题。为了解决这些挑战，我们提出了LaPose，一个新的框架，将物体形状建模为用于姿态估计的拉普拉斯混合模型。通过将每个点表示为概率分布，我们明确地量化了形状的不确定性。LaPose利用一个广义3D信息流和一个专门的特征流来独立地预测每个点的拉普拉斯分布，捕捉物体几何形状的不同方面。然后，这两个分布被整合为一个拉普拉斯混合模型，以建立2D-3D对应关系，这些对应关系用于通过PnP模块求解姿态。为了减轻尺度模糊性，我们引入了一种与尺度无关的物体大小和位移表示方法，从而提高了训练效率和整体鲁棒性。在NOCS数据集上的大量实验验证了LaPose的有效性，在基于RGB的类别级物体姿态估计中取得了最先进的性能。代码已发布在https://github.com/lolrudy/LaPose。||
|**2024-09-22**|[Tactile Functasets: Neural Implicit Representations of Tactile Datasets](http://arxiv.org/abs/2409.14592)|null|现代触觉传感器的版本会产生高维原始感官反馈，例如图像，这使得高效存储、处理和跨传感器泛化具有挑战性。为了解决这些问题，我们引入了一种新的用于触觉传感器反馈的隐函数表示。我们没有直接使用原始触觉图像，而是提出了经过训练以重建触觉数据集的神经隐函数，从而生成紧凑的表示来捕捉感官输入的底层结构。这些表示与其原始对应物相比具有 several 优势：它们紧凑，能够进行概率可解释的推断，并促进跨不同传感器的泛化。我们展示了这种表示在下游手持物体姿态估计任务中的有效性，在简化下游模型的同时实现了比基于图像的方法更好的性能。我们在 https://www.mmintlab.com/tactile-functasets 上发布了代码、演示和数据集。||
|**2024-09-18**|[FAST GDRNPP: Improving the Speed of State-of-the-Art 6D Object Pose Estimation](http://arxiv.org/abs/2409.12720)|null|6D物体姿态估计涉及确定场景中物体相对于所选坐标系的三维平移和旋转。这个问题在许多工业任务的实际应用中尤其重要，例如质量控制、零件拾取和机器人操作，在这些应用中，速度和精度对于实际部署都至关重要。当前的模型，包括经典模型和基于深度学习的模型，通常难以在精度和延迟之间取得平衡。我们的研究重点是在保持其高精度的同时，提高最先进的深度学习模型GDRNPP的速度。我们采用多种技术来减小模型大小并缩短推理时间。这些技术包括使用更小、更快的骨干网络、剪枝不必要的参数以及通过蒸馏将知识从大型高性能模型迁移到更小、更高效的学生模型。我们的研究结果表明，所提出的配置在显著缩短推理时间的同时，保持了与最先进模型相当的精度。这一进步可以促使在各种工业场景中实现更高效和实用的应用，从而提高6D物体姿态估计模型在实际环境中的整体适用性。||
|**2024-09-12**|[Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation](http://arxiv.org/abs/2409.08269)|null|现今的触摸传感器种类繁多，形状各异。由于模型通常与特定的传感器设计绑定，这给开发通用触摸处理方法带来了挑战。我们通过在触摸传感器之间进行跨模态预测来解决这个问题：给定一个传感器的触觉信号，我们使用生成模型来估计另一个传感器如何感知相同的物理接触。这允许我们将特定于传感器的处理方法应用于生成的信号。我们通过训练一个扩散模型来实现这个想法，该模型可以在流行的 GelSlim 和 Soft Bubble 传感器之间进行转换。作为一个下游任务，我们使用 GelSlim 传感器进行手持物体姿态估计，同时使用一种仅对 Soft Bubble 信号进行操作的算法。数据集、代码和更多详细信息可以在 https://www.mmintlab.com/research/touch2touch/ 找到。||
|**2024-09-04**|[Object Gaussian for Monocular 6D Pose Estimation from Sparse Views](http://arxiv.org/abs/2409.02581)|null|单目物体姿态估计作为计算机视觉和机器人技术中的一项关键任务，高度依赖于精确的2D-3D对应关系，而这通常需要昂贵的CAD模型，这些模型可能并不容易获得。物体三维重建方法提供了一种替代方案，其中最近3D高斯 splatting (3DGS) 的进展提供了一种引人注目的潜力。然而，它的性能仍然存在不足，并且在输入视图较少的情况下容易过拟合。为了应对这一挑战，我们引入了SGPose，这是一个使用基于高斯方法的稀疏视图物体姿态估计的新框架。只需十个视图，SGPose 就可以通过从随机长方体初始化开始生成几何感知表示，从而避免依赖传统3DGS方法所需的基于运动恢复结构 (SfM) 流程的几何形状。SGPose 通过回归稀疏输入和随机初始化的图像和重建模型之间的密集2D-3D对应关系，消除了对CAD模型的依赖，而几何一致性深度监督和在线合成视图扭曲是成功的关键。在典型基准数据集，尤其是在Occlusion LM-O数据集上的实验表明，即使在稀疏视图限制下，SGPose 的性能也优于现有方法，这凸显了其在实际应用中的潜力。||
|**2024-08-29**|[OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation](http://arxiv.org/abs/2408.16547)|**[link](https://github.com/yc-che/op-align)**|类别级铰接物体姿态估计专注于对已知类别中未知铰接物体的姿态估计。尽管意义重大，但由于物体的形状和姿态各不相同、数据集标注成本高昂以及复杂的现实环境，这项任务仍然具有挑战性。在本文中，我们提出了一种新的自监督方法，利用单帧点云来解决这个问题。我们的模型一致地生成具有规范姿态和关节状态的完整输入物体重建，并估计物体级姿态（减少整体姿态方差）和部件级姿态（将输入的每个部件与其对应的重建部件对齐）。实验结果表明，我们的方法显著优于以往的自监督方法，并且与最先进的监督方法相当。为了评估我们的模型在真实场景中的性能，我们还引入了一个新的真实世界铰接物体基准数据集。||
|**2024-08-19**|[RUMI: Rummaging Using Mutual Information](http://arxiv.org/abs/2408.10450)|null|本文提出了一种名为基于互信息的翻找方法（RUMI），用于在线生成机器人在视觉遮挡环境中收集已知可移动物体姿态信息的动作序列。该方法专注于富接触翻找，利用物体姿态分布和机器人轨迹之间的互信息进行动作规划。RUMI从观测到的部分点云推断出兼容的物体姿态分布，并实时计算其与工作空间占有率的互信息近似值。基于此，我们开发了信息增益成本函数和可达性成本函数，以保持物体在机器人的可达范围内。这些函数被集成到一个具有随机动力学模型的模型预测控制（MPC）框架中，并在闭环中更新姿态分布。主要贡献包括一个新的物体姿态估计置信框架、一个高效的信息增益计算策略和一个鲁棒的基于MPC的控制方案。与基线方法相比，RUMI在仿真和实际任务中均表现出优异的性能。||
|**2024-08-15**|[Comparative Evaluation of 3D Reconstruction Methods for Object Pose Estimation](http://arxiv.org/abs/2408.08234)|**[link](https://github.com/varunburde/reconstruction_pose_benchmark)**|物体姿态估计对于许多涉及机器人操作、导航和增强现实的工业应用至关重要。当前通用的物体姿态估计器，即不需要针对每个物体进行训练的方法，依赖于精确的3D模型。目前主要使用CAD模型，但在实践中获取CAD模型可能很困难。同时，获取物体的图像是相对容易的。自然，这就引出了一个问题：从图像重建的3D模型是否足以实现精确的物体姿态估计？为了回答这个问题，我们提出了一个新的基准测试，用于衡量3D重建质量对姿态估计精度的影响。我们的基准测试提供了用于物体重建的校准图像，这些图像与YCB-V数据集的测试图像配准，以便在BOP基准测试格式下进行姿态评估。使用多种最先进的3D重建和物体姿态估计方法进行的详细实验表明，现代重建方法生成的几何模型通常足以进行精确的姿态估计。我们的实验得出了一些有趣的观察结果：（1）用于衡量3D重建质量的标准指标并不一定能指示姿态估计的精度，这表明需要像我们这样的专用基准测试。（2）传统的、非基于学习的方法可以与现代的基于学习的重建技术相媲美，甚至可以提供更好的重建时间-姿态精度权衡。（3）使用重建模型和CAD模型的性能之间仍然存在相当大的差距。为了促进缩小这一差距的研究，我们的基准测试已在https://github.com/VarunBurde/reconstruction_pose_benchmark公开发布。||
|**2024-07-16**|[NeuSurfEmb: A Complete Pipeline for Dense Correspondence-based 6D Object Pose Estimation without CAD Models](http://arxiv.org/abs/2407.12207)|**[link](https://github.com/ethz-asl/neusurfemb)**|目前最先进的6D物体姿态估计方法假设CAD模型可用，并要求用户手动设置基于物理的渲染（PBR）流程以生成合成训练数据。这两个因素都限制了这些方法在实际场景中的应用。在这项工作中，我们提出了一个不需要CAD模型的流程，并且只需少量真实图像作为输入即可训练出最先进的姿态估计器。我们的方法基于NeuS2对象表示，我们通过基于运动恢复结构（SfM）和物体无关分割的半自动化程序来学习该表示。我们利用NeuS2的新视角合成能力和简单的剪切粘贴增强功能来自动生成逼真的物体渲染，用于训练基于对应的SurfEmb姿态估计器。我们在LINEMOD-Occlusion数据集上评估了我们的方法，广泛研究了其各个组件的影响，并展示了相对于基于CAD模型和PBR数据的方法的竞争性能。我们还在自行收集的真实世界物体上展示了我们流程的易用性和有效性，表明我们的方法优于最先进的无CAD模型方法，具有更好的精度和对轻微遮挡的鲁棒性。为了让机器人社区能够从该系统中受益，我们将在https://www.github.com/ethz-asl/neusurfemb公开发布它。||
|**2024-06-06**|[Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking](http://arxiv.org/abs/2406.04316)|null|6D物体姿态估计是计算机视觉中一项至关重要但极具挑战性的任务，其面临的主要问题是大规模数据集的严重缺乏。这种稀缺性阻碍了对模型性能的全面评估，限制了研究进展。此外，可用实例或类别的数量有限也限制了其应用。为了解决这些问题，本文提出了Omni6DPose，这是一个以对象类别多样性、规模大和对象材质多样性为特征的大型数据集。Omni6DPose主要由三个部分组成：ROPE（真实6D物体姿态估计数据集），包含332K张图像，涵盖149个类别、581个实例的超过150万个标注；SOPE（模拟6D物体姿态估计数据集），由混合现实环境中创建的475K张图像组成，利用深度模拟技术进行标注，涵盖与ROPE相同的149个类别、4162个实例的超过500万个标注；以及在ROPE和SOPE中均使用的、经过手动对齐的真实扫描物体。由于存在大量的变化和模糊性，Omni6DPose本身就具有很大的挑战性。为了应对这一挑战，我们引入了GenPose++，它是SOTA类别级姿态估计框架的增强版本，它包含两个关键改进：语义感知特征提取和基于聚类的聚合。此外，我们还提供了一个全面的基准测试分析，以评估先前方法在这个大规模数据集上在6D物体姿态估计和姿态跟踪方面的性能。||
|**2024-06-05**|[Sparse Color-Code Net: Real-Time RGB-Based 6D Object Pose Estimation on Edge Devices](http://arxiv.org/abs/2406.02977)|null|随着机器人和增强现实应用越来越依赖于精确高效的6D物体姿态估计，边缘设备上的实时性能对于实现更具交互性和响应能力的系统至关重要。我们提出的稀疏颜色代码网络（SCCN）体现了一种清晰简洁的流程设计，以有效满足这一需求。SCCN对RGB图像中的目标物体进行像素级预测，利用基本物体几何特征的稀疏性来加速Perspective-n-Point（PnP）计算过程。此外，它引入了一种新颖的基于像素级几何的物体对称表示，该表示与初始姿态预测无缝集成，有效地解决了对称物体歧义问题。SCCN在英伟达Jetson AGX Xavier上分别实现了在基准LINEMOD数据集和遮挡LINEMOD数据集上每秒19帧（FPS）和6帧的估计速率，同时在这些速率下始终保持较高的估计精度。||
|**2024-05-31**|[Deep Learning-Based Object Pose Estimation: A Comprehensive Survey](http://arxiv.org/abs/2405.07801)|**[link](https://github.com/cnjianliu/awesome-object-pose-estimation)**|物体姿态估计是计算机视觉中的一个基本问题，在增强现实和机器人技术中有着广泛的应用。在过去的十年中，深度学习模型由于其卓越的准确性和鲁棒性，越来越多地取代了依赖于工程点对特征的传统算法。然而，当代方法仍然存在若干挑战，包括它们对标记训练数据的依赖性、模型紧凑性、在挑战性条件下的鲁棒性以及泛化到未见过的新物体能力。目前缺乏一篇综述来讨论该领域的进展、面临的挑战和未来有希望的方向。为了填补这一空白，我们讨论了基于深度学习的物体姿态估计的最新进展，涵盖了该问题的所有三种形式，即实例级、类别级和未见过物体的姿态估计。我们的综述还涵盖了多种输入数据模态、输出姿态的自由度、物体属性和下游任务，为读者提供了对该领域的全面理解。此外，它还讨论了不同领域的训练范式、推理模式、应用领域、评估指标和基准数据集，并报告了当前最先进方法在这些基准上的性能，从而方便读者为其应用选择最合适的方法。最后，该综述指出了关键挑战，回顾了当前的趋势及其优缺点，并确定了未来研究的有希望的方向。我们还在 https://github.com/CNJianLiu/Awesome-Object-Pose-Estimation 上持续跟踪最新的工作。||
|**2024-03-28**|[Instance-Adaptive and Geometric-Aware Keypoint Learning for Category-Level 6D Object Pose Estimation](http://arxiv.org/abs/2403.19527)|**[link](https://github.com/leeiieeo/ag-pose)**|类别级 6D 物体姿态估计旨在估计特定类别中未见实例的旋转、平移和大小。在这一领域，基于密集对应的方法取得了领先的性能。然而，它们没有明确考虑不同实例的局部和全局几何信息，导致对形状变化显著的未见实例的泛化能力较差。为了解决这个问题，我们提出了一种新颖的实例自适应和几何感知的关键点学习方法，用于类别级 6D 物体姿态估计 (AG-Pose)，它包括两个关键设计：（1）第一个设计是实例自适应关键点检测模块，它可以自适应地检测一组稀疏的关键点，用于表示各种实例的几何结构。(2) 第二个设计是几何感知特征聚合模块，它可以有效地将局部和全局几何信息整合到关键点特征中。这两个模块可以协同工作，为未见实例建立鲁棒的关键点级对应关系，从而增强模型的泛化能力。在 CAMERA25 和 REAL275 数据集上的实验结果表明，所提出的 AG-Pose 在没有类别特定形状先验的情况下，大大优于最先进的方法。||
|**2024-06-01**|[Object Pose Estimation via the Aggregation of Diffusion Features](http://arxiv.org/abs/2403.18791)|**[link](https://github.com/tianfu18/diff-feats-pose)**|从图像中估计物体姿态是3D场景理解的关键任务，最近的方法在非常大的基准测试中显示出可喜的结果。然而，这些方法在处理未见过的物体时性能会显著下降。我们认为这是由于图像特征的泛化能力有限造成的。为了解决这个问题，我们对扩散模型（例如Stable Diffusion）的特征进行了深入分析，这些模型在对未见过的物体建模方面具有巨大潜力。在此分析的基础上，我们创新性地将这些扩散特征引入物体姿态估计。为此，我们提出了三种不同的架构，可以有效地捕获和聚合不同粒度的扩散特征，极大地提高了物体姿态估计的泛化能力。我们的方法在三个流行的基准数据集LM、O-LM和T-LESS上，以相当大的优势优于最先进的方法。特别是，我们的方法在未见过的物体上取得了比先前最佳结果更高的精度：在Unseen LM上为98.2%对93.5%，在Unseen O-LM上为85.9%对76.3%，显示了我们方法强大的泛化能力。我们的代码发布在https://github.com/Tianfu18/diff-feats-pose。||

(back to top)

## nerf

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2025-04-02**|[BOGausS: Better Optimized Gaussian Splatting](http://arxiv.org/abs/2504.01844)|null|三维高斯 splatting (3DGS) 为新视角合成提供了一种高效的解决方案。其框架可提供快速且高保真的渲染。尽管它不如其他解决方案（例如神经辐射场 (NeRF)）复杂，但在不牺牲质量的情况下构建更小模型仍然存在一些挑战。在本研究中，我们对 3DGS 训练过程进行了仔细分析，并提出了一种新的优化方法。我们的更好优化高斯 splatting (BOGausS) 解决方案能够生成比原始 3DGS 轻十倍的模型，且不降低质量，从而显著提高了高斯 splatting 相对于现有技术的性能。|
|**2025-03-24**|[NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting](http://arxiv.org/abs/2503.18794)|null|神经辐射场 (NeRF) 和三维高斯 splatting (3DGS) 技术利用密集相机视角拍摄的图像，显著提升了照片级真实感新视角合成效果。然而，由于监督信息有限，这些方法在少样本场景下表现不佳。本文提出了一种基于 3DGS 的方法 NexusGS，通过将深度信息直接嵌入点云，增强了稀疏视角图像的新视角合成，且无需依赖复杂的人工正则化。利用 3DGS 固有的对极几何特性，我们的方法引入了一种新的点云密集化策略，使用密集点云初始化 3DGS，减少了点放置的随机性，同时防止了过度平滑和过拟合。具体来说，NexusGS 包括三个关键步骤：对极深度关联、抗流深度融合和流过滤深度修剪。这些步骤利用光流和相机姿态来计算精确的深度图，同时减轻光流通常带来的不准确性。通过结合对极深度先验，NexusGS 确保了可靠的密集点云覆盖，并支持在稀疏视角条件下进行稳定的 3DGS 训练。实验表明，NexusGS 显着提高了深度精度和渲染质量，并大幅超越了现有最先进的方法。此外，我们通过大幅提升其他竞争方法的性能，验证了我们生成的点云的优越性。项目页面：https://usmizuki.github.io/NexusGS/。|
|**2025-03-13**|[MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction](http://arxiv.org/abs/2503.10604)|null|近年来，辐射场技术的突破显著推进了自动驾驶领域的三维场景重建和新视角合成（NVS）。然而，关键的局限性依然存在：基于重建的方法在视角严重偏离训练轨迹时性能大幅下降，而基于生成的技术则难以保证时间一致性和精确的场景可控性。为了克服这些挑战，我们提出了MuDG，这是一个创新的框架，它将多模态扩散模型与高斯 splatting (GS) 相结合，用于城市场景重建。MuDG利用聚合的激光雷达点云以及RGB和几何先验信息来调节多模态视频扩散模型，合成新视角下的逼真RGB、深度和语义输出。该合成流程无需对每个场景进行计算密集型优化即可实现前馈式NVS，并提供全面的监督信号来细化3DGS表示，从而增强极端视角变化下的渲染鲁棒性。在Open Waymo数据集上的实验表明，MuDG在重建质量和合成质量方面均优于现有方法。|
|**2025-03-11**|[Dynamic Scene Reconstruction: Recent Advance in Real-time Rendering and Streaming](http://arxiv.org/abs/2503.08166)|null|从二维图像表示和渲染动态场景是计算机视觉和图形学中一个基本而又具有挑战性的问题。本综述全面回顾了动态场景表示和渲染的演变和发展，特别关注了基于神经辐射场和基于三维高斯 splatting 重建方法的最新进展。我们系统地总结了现有方法，并根据其核心原理进行了分类，汇编了相关数据集，比较了各种方法在这些基准上的性能，并探讨了这个快速发展领域中的挑战和未来研究方向。总共回顾了 170 多篇相关论文，提供了该领域最新技术的广阔视角。|
|**2025-03-09**|[Gaussian RBFNet: Gaussian Radial Basis Functions for Fast and Accurate Representation and Reconstruction of Neural Fields](http://arxiv.org/abs/2503.06762)|null|诸如 DeepSDF 和神经辐射场等神经场最近彻底改变了从 RGB 图像和视频进行新视角合成和 3D 重建的方式。然而，实现高质量的表示、重建和渲染需要深度神经网络，这些网络训练和评估速度缓慢。尽管已经提出了几种加速技术，但它们通常以牺牲内存为代价来换取速度。另一方面，基于高斯 splatting 的方法加快了渲染时间，但在训练速度和存储大量高斯参数所需的内存方面仍然成本高昂。在本文中，我们介绍了一种新的神经表示方法，它在训练和推理时都很快，而且很轻量级。我们的主要观察结果是，传统 MLP 中使用的神经元执行简单的计算（点积后进行 ReLU 激活），因此需要使用宽而深的 MLP 或高分辨率和高维特征网格来参数化复杂的非线性函数。我们在本文中表明，通过用径向基函数 (RBF) 核替换传统神经元，可以用这种神经元的单层实现 2D（RGB 图像）、3D（几何形状）和 5D（辐射场）信号的高精度表示。该表示高度可并行化，在低分辨率特征网格上运行，并且紧凑且内存高效。我们证明了所提出的新表示可以用于 3D 几何表示的训练，用时不到 15 秒，而新视角合成的训练用时不到 15 分钟。在运行时，它可以在不牺牲质量的情况下以超过 60 fps 的速度合成新视角。|
|**2025-03-08**|[Feature-EndoGaussian: Feature Distilled Gaussian Splatting in Surgical Deformable Scene Reconstruction](http://arxiv.org/abs/2503.06161)|null|微创手术 (MIS) 通过缩短恢复时间、减少并发症和提高精度改变了临床实践。然而，MIS 本质上依赖于间接可视化和精确的器械控制，带来了独特的挑战。人工智能的最新进展使得通过图像分类、目标检测和分割等技术实时理解手术场景成为可能，其中场景重建成为增强术中引导的关键要素。尽管已经为此目的探索了神经辐射场 (NeRF)，但其大量的数据需求和缓慢的渲染速度阻碍了实时性能。相比之下，3D 高斯 splatting (3DGS) 提供了一种更有效的替代方案，在动态手术场景重建中实现了最先进的性能。在这项工作中，我们介绍了 Feature-EndoGaussian (FEG)，它是 3DGS 的扩展，它将 2D 分割线索集成到 3D 渲染中，以实现实时语义和场景重建。通过利用预训练的分割基础模型，FEG 在高斯变形框架内结合了语义特征蒸馏，从而提高了重建保真度和分割精度。在 EndoNeRF 数据集上，FEG 实现了优于领先方法的性能（SSIM 为 0.97，PSNR 为 39.08，LPIPS 为 0.03）。此外，在 EndoVis18 数据集上，FEG 展示了具有竞争力的类别分割指标，同时平衡了模型大小和实时性能。|
|**2025-03-05**|[LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation](http://arxiv.org/abs/2503.03890)|null|从少量演示中学习灵巧操作对于先进的类人机器人系统来说是一个重要且具有挑战性的问题。密集蒸馏特征场通过将丰富的语义特征从 2D 视觉基础模型蒸馏到 3D 领域来应对这一挑战。然而，它们依赖于神经渲染模型，例如神经辐射场 (NeRF) 或高斯 splatting，导致计算成本高昂。相比之下，先前基于稀疏特征场的方法要么由于多视图依赖性和大量训练而效率低下，要么缺乏足够的抓取灵活性。为了克服这些限制，我们提出了语言增强稀疏蒸馏特征场 (LensDFF)，它使用我们新颖的语言增强特征融合策略将视图一致的 2D 特征有效地蒸馏到 3D 点上，从而实现单视图少样本泛化。基于 LensDFF，我们进一步介绍了一个少样本灵巧操作框架，该框架将抓取基元集成到演示中，以生成稳定且高度灵巧的抓取。此外，我们提出了一个 real2sim 抓取评估流程，用于高效的抓取评估和超参数调整。通过基于 real2sim 流程的大量仿真实验和真实世界实验，我们的方法实现了具有竞争力的抓取性能，优于最先进的方法。|
|**2025-03-04**|[2DGS-Avatar: Animatable High-fidelity Clothed Avatar via 2D Gaussian Splatting](http://arxiv.org/abs/2503.02452)|null|从单目视频中实时渲染高保真且可动画的虚拟化身仍然是计算机视觉和图形学中的一个难题。在过去几年中，神经辐射场（NeRF）在渲染质量方面取得了显著进展，但由于体积渲染效率低下，运行时性能表现不佳。最近，基于三维高斯 splatting (3DGS) 的方法在快速训练和实时渲染方面展现出巨大潜力。然而，它们仍然受到几何精度不足导致的伪影的影响。为了解决这些问题，我们提出了 2DGS-Avatar，一种基于二维高斯 splatting (2DGS) 的新方法，用于对具有高保真度和快速训练性能的可动画穿着虚拟化身进行建模。给定单目 RGB 视频作为输入，我们的方法生成一个可以由姿势驱动并实时渲染的虚拟化身。与基于 3DGS 的方法相比，我们的 2DGS-Avatar 保留了快速训练和渲染的优势，同时还捕捉了详细、动态和逼真的外观。我们在 AvatarRex 和 THuman4.0 等流行数据集上进行了大量实验，证明了在定性和定量指标方面都具有令人印象深刻的性能。|
|**2025-02-28**|[EndoPBR: Material and Lighting Estimation for Photorealistic Surgical Simulations via Physically-based Rendering](http://arxiv.org/abs/2502.20669)|null|三维手术场景标注数据集的缺乏阻碍了医学领域鲁棒三维重建算法的发展。尽管神经辐射场和三维高斯 splatting 在一般计算机视觉领域很受欢迎，但由于非静态照明和非朗伯表面等挑战，这些系统尚未在手术场景中取得持续的成功。因此，对标注手术数据集的需求持续增长。在这项工作中，我们引入了一个可微渲染框架，用于从内窥镜图像和已知几何体估计材质和照明。与先前将照明和材质共同建模为辐射的方法相比，我们明确地将这些场景属性分离，以实现鲁棒且逼真的新视角合成。为了消除训练过程的歧义，我们制定了手术场景中固有的特定领域属性。具体来说，我们将场景照明建模为一个简单的聚光灯，并将材质属性建模为由神经网络参数化的双向反射分布函数。通过将颜色预测基于渲染方程，我们可以在任意相机姿态下生成逼真的图像。我们使用结肠镜检查三维视频数据集中的各种序列评估了我们的方法，并表明与其他方法相比，我们的方法产生了具有竞争力的新视角合成结果。此外，我们证明了合成数据可用于开发三维视觉算法，方法是使用我们渲染的输出来微调深度估计模型。总体而言，我们看到深度估计性能与使用原始真实图像进行微调相当。|
|**2025-02-26**|[Does 3D Gaussian Splatting Need Accurate Volumetric Rendering?](http://arxiv.org/abs/2502.19318)|**[link](https://github.com/cg-tuwien/does_3d_gaussian_splatting_need_accurate_volumetric_rendering)**|自推出以来，三维高斯 splatting (3DGS) 已成为学习捕获场景三维表示的重要参考方法，支持实时新视角合成，并具有高质量的视觉效果和快速的训练时间。先于 3DGS 的神经辐射场 (NeRF) 则基于体渲染的射线步进方法。相比之下，3DGS 虽然与 NeRF 共享类似的图像形成模型，但它采用了一种混合渲染方案，结合了体渲染和图元光栅化的优势。3DGS 的一个关键优势在于它的性能，这是通过一系列的近似方法实现的，在很多情况下都与体渲染理论有关。一个自然产生的问题是，用更具原则性的体渲染方案取代这些近似方法是否可以提高 3DGS 的质量。在本文中，我们深入分析了原始 3DGS 方案中使用的各种近似和假设。我们证明，虽然更精确的体渲染可以帮助减少图元数量，但高效的优化能力和大量的高斯函数使得 3DGS 能够在性能上优于体渲染，尽管它存在近似处理。|
|**2025-02-24**|[VR-Pipe: Streamlining Hardware Graphics Pipeline for Volume Rendering](http://arxiv.org/abs/2502.17078)|null|基于机器学习和辐射场的图形渲染因其在从新视角生成逼真图像方面的出色质量和速度而受到广泛关注。然而，先前的工作主要集中于通过基于软件的渲染在可编程着色器核心上评估其性能，而对其利用固定功能图形单元时的性能却很大程度上未被探索。在本文中，我们研究了在硬件图形管道上执行辐射场渲染的性能影响。为此，我们使用图形 API 实现了最先进的辐射场方法——3D 高斯溅射，并在当今的图形硬件上对合成场景和真实场景进行了评估。基于我们的分析，我们提出了 VR-Pipe，它将两项创新无缝集成到图形硬件中，以简化用于体渲染的硬件管道，例如辐射场方法。首先，我们通过重新利用现代 GPU 中现有的专用硬件来引入对提前终止的原生硬件支持。其次，我们提出了带有四边形合并的多粒度图块分箱，它在将片段传递到固定功能混合单元之前，先在着色器核心中适时地混合它们。我们的评估表明，VR-Pipe 极大地提高了渲染性能，与传统的图形渲染管道相比，速度提升高达 2.78 倍，且硬件开销可忽略不计。||
|**2025-02-18**|[GS-QA: Comprehensive Quality Assessment Benchmark for Gaussian Splatting View Synthesis](http://arxiv.org/abs/2502.13196)|null|高斯 splatting (GS) 为实时三维场景渲染提供了一种很有前景的 NeRF 替代方案。GS 使用一组三维高斯函数来表示复杂的几何形状和外观，与 NeRF 中使用的神经网络方法相比，实现了更快的渲染速度和更低的内存消耗。然而，GS 生成的静态内容的质量评估尚未得到深入研究。本文描述了一项主观质量评估研究，旨在评估使用几种最先进的静态 GS 方法合成的视频的质量。这些方法应用于不同的视觉场景，涵盖 360 度和前向 (FF) 相机轨迹。此外，使用主观研究得到的分数分析了 18 种客观质量指标的性能，深入了解了它们的优势、局限性以及与人类感知的一致性。所有视频和分数均已公开，提供了一个全面的数据库，可用作 GS 视图合成和客观质量指标的基准。||
|**2025-02-17**|[3D Gaussian Inpainting with Depth-Guided Cross-View Consistency](http://arxiv.org/abs/2502.11801)|null|使用诸如神经辐射场 (NeRF) 或三维高斯 splatting (3DGS) 等新视角渲染方法执行三维修复时，如何在不同视角之间实现纹理和几何一致性一直是一个挑战。在本文中，我们提出了一个具有深度引导跨视角一致性的三维高斯修复框架 (3DGIC)，用于跨视角一致的三维修复。在每个训练视角渲染的深度信息的引导下，我们的 3DGIC 利用跨不同视角可见的背景像素来更新修复掩码，从而使我们能够改进用于修复的 3DGS。通过在基准数据集上的大量实验，我们证实了我们的 3DGIC 在定量和定性方面都优于当前最先进的三维修复方法。||
|**2025-02-11**|[Flow Distillation Sampling: Regularizing 3D Gaussians with Pre-trained Matching Priors](http://arxiv.org/abs/2502.07615)|null|三维高斯 splatting (3DGS) 已实现出色的渲染质量以及快速的训练和渲染速度。然而，其优化过程缺乏明确的几何约束，导致在稀疏或无观测输入视角的区域几何重建效果欠佳。在本研究中，我们尝试通过在 3DGS 优化过程之前结合预训练的匹配先验来缓解此问题。我们引入了流蒸馏采样 (FDS)，一项利用预训练几何知识来提高高斯辐射场精度的技术。我们的方法采用一种策略性采样技术，以输入视角附近的未观测视角为目标，利用匹配模型（先验流）计算出的光流来引导 3DGS 几何体（辐射流）解析计算出的流。在深度渲染、网格重建和新视角合成方面的综合实验展示了 FDS 相对于最先进方法的显著优势。此外，我们的解释性实验和分析旨在阐明 FDS 对几何精度和渲染质量的影响，从而为读者提供对其性能的深入了解。项目页面：https://nju-3dv.github.io/projects/fds||
|**2025-02-07**|[SC-OmniGS: Self-Calibrating Omnidirectional Gaussian Splatting](http://arxiv.org/abs/2502.04734)|null|360度相机通过捕获全面的场景数据，简化了辐射场三维重建的数据收集过程。然而，传统的辐射场方法并没有解决360度图像固有的特定挑战。我们提出了SC-OmniGS，一种新型的自校准全向高斯 splatting 系统，用于使用360度图像进行快速准确的全向辐射场重建。我们没有将360度图像转换为立方体贴图并执行透视图像校准，而是将360度图像视为一个完整的球体，并推导出了一个数学框架，可以直接进行全向相机姿态校准以及3D高斯优化。此外，我们引入了一个可微分的全向相机模型，以纠正真实世界数据的失真，从而提高性能。总体而言，全向相机内参模型、外参姿态和3D高斯函数通过最小化加权球面光度损失进行联合优化。大量实验表明，我们提出的SC-OmniGS能够在具有宽基线和非以物体为中心的配置等挑战性场景中，从噪声相机姿态甚至没有姿态先验的情况下恢复高质量的辐射场。在使用消费级全向相机拍摄的真实世界数据集中，显著的性能提升验证了我们通用的全向相机模型在减少360度图像失真方面的有效性。||
|**2025-01-27**|[Deformable Beta Splatting](http://arxiv.org/abs/2501.18630)|null|三维高斯 splatting (3DGS) 通过支持实时渲染推进了辐射场重建。然而，其依赖高斯核进行几何建模和使用低阶球谐函数 (SH) 进行颜色编码限制了其捕捉复杂几何形状和多样颜色的能力。我们引入了可变形 Beta Splatting (DBS)，这是一种可变形且紧凑的方法，增强了几何和颜色表示。DBS 用可变形 Beta 核取代了高斯核，该核提供有界支持和自适应频率控制，可以更高保真度地捕捉精细的几何细节，同时实现更好的内存效率。此外，我们将 Beta 核扩展到颜色编码，这有助于改进漫反射和镜面反射分量的表示，与基于 SH 的方法相比产生更优异的结果。此外，不同于依赖于高斯属性的现有致密化技术，我们从数学上证明，仅调整正则化不透明度即可确保分布保持不变的马尔可夫链蒙特卡罗 (MCMC)，而与 splatting 核类型无关。实验结果表明，DBS 在仅使用 45% 的参数的情况下实现了最先进的视觉质量，并且渲染速度比基于 3DGS 的方法快 1.5 倍。值得注意的是，基于 splatting 的方法首次超越了最先进的神经辐射场，突出了 DBS 在实时辐射场渲染方面的卓越性能和效率。||
|**2025-01-23**|[VIGS SLAM: IMU-based Large-Scale 3D Gaussian Splatting SLAM](http://arxiv.org/abs/2501.13402)|null|近年来，基于辐射场（例如三维高斯 splatting 和 NeRF）的地图表示法因其逼真的描绘能力而备受关注，并促使人们尝试将其与 SLAM 相结合。虽然这些方法可以构建高度逼真的地图，但大规模 SLAM 仍然是一项挑战，因为它们需要大量的用于建图的高斯图像和用于跟踪的相邻图像作为关键帧。我们提出了一种新颖的三维高斯 splatting SLAM 方法，即 VIGS SLAM，它利用 RGB-D 和 IMU 传感器的融合来处理大规模室内环境。为了减少基于 3DGS 的跟踪的计算负担，我们采用了一种基于 ICP 的跟踪框架，该框架结合了 IMU 预积分，为精确的位姿估计提供了良好的初始值。我们的方法首次提出通过集成 IMU 传感器测量值，可以在大规模环境中有效地执行基于高斯 splatting 的 SLAM。这一提议不仅增强了高斯 splatting SLAM 在房间尺度场景之外的性能，而且在大规模室内环境中实现了与最先进方法相当的 SLAM 性能。||
|**2024-12-27**|[Learning Radiance Fields from a Single Snapshot Compressive Image](http://arxiv.org/abs/2412.19483)|null|本文探讨了快照压缩成像（SCI）技术从单张时间压缩图像中恢复底层3D场景结构的潜力。SCI是一种经济高效的方法，可以使用低成本的2D成像传感器将高维数据（例如高光谱或时间信息）记录到单个图像中。为此，通常采用一系列专门设计的2D掩模，从而减少了存储和传输需求，并提供了潜在的隐私保护。受此启发，我们更进一步，利用神经辐射场（NeRF）强大的3D场景表示能力来恢复编码的3D场景信息。具体来说，我们提出了SCINeRF，将SCI的物理成像过程制定为NeRF训练的一部分，从而利用其在捕获复杂场景结构方面的出色性能。此外，我们进一步集成了流行的3D高斯 splatting（3DGS）框架，并提出了SCISplat，通过将点云显式优化为3D高斯表示来提高3D场景重建质量和训练/渲染速度。为了评估我们方法的有效性，我们使用合成数据和我们的SCI系统捕获的真实数据进行了广泛的评估。实验结果表明，我们提出的方法在图像重建和新视角合成方面优于现有最先进的方法。此外，我们的方法还能够利用SCI和3DGS的渲染能力实时渲染高帧率的多视角一致图像。代码将在以下网址提供：https://github.com/WU-CVGL/SCISplat。||
|**2024-12-18**|[GraphAvatar: Compact Head Avatars with GNN-Generated 3D Gaussians](http://arxiv.org/abs/2412.13983)|**[link](https://github.com/ucwxb/graphavatar)**|从任意视角渲染逼真的头部虚拟形象对于虚拟现实等各种应用至关重要。尽管先前基于神经辐射场（NeRF）的方法可以取得令人印象深刻的结果，但它们缺乏保真度和效率。最近使用三维高斯 splatting (3DGS) 的方法提高了渲染质量和实时性能，但仍然需要大量的存储开销。在本文中，我们介绍了一种名为 GraphAvatar 的方法，它利用图神经网络 (GNN) 来生成头部虚拟形象的三维高斯。具体来说，GraphAvatar 训练了一个几何 GNN 和一个外观 GNN，以从跟踪的网格生成三维高斯的属性。因此，我们的方法可以存储 GNN 模型而不是三维高斯，从而将存储开销显著减少到仅 10MB。为了减少面部跟踪错误的影响，我们还提出了一种新颖的图引导优化模块，用于在训练期间改进面部跟踪参数。最后，我们引入了一个 3D 感知增强器用于后期处理，以提高渲染质量。我们进行了全面的实验来证明 GraphAvatar 的优势，在视觉保真度和存储消耗方面超过了现有方法。消融研究揭示了渲染质量和模型大小之间的权衡。代码将在 https://github.com/ucwxb/GraphAvatar 发布。||
|**2025-02-03**|[CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image](http://arxiv.org/abs/2412.12906)|null|近来，基于三维高斯 splatting 的可泛化前馈方法因其利用有限资源重建三维场景的潜力而备受关注。这些方法仅通过单次前向传递，即可从少量图像中创建由逐像素三维高斯基元参数化的三维辐射场。然而，与受益于跨视图对应关系的多视图方法不同，单视图图像的三维场景重建仍然是一个未被充分探索的领域。在本工作中，我们介绍了 CATSplat，这是一个新颖的基于 Transformer 的可泛化框架，旨在突破单目设置下的固有约束。首先，我们建议利用来自视觉语言模型的文本指导来补充单幅图像中不足的信息。通过交叉注意力机制结合来自文本嵌入的场景特定上下文细节，我们为超越单纯依赖视觉线索的上下文感知三维场景重建铺平了道路。此外，我们提倡利用来自三维点特征的空间指导，以便在单视图设置下获得全面的几何理解。借助三维先验，图像特征可以捕获丰富的结构信息，从而在无需多视图技术的情况下预测三维高斯分布。在大规模数据集上的大量实验表明，CATSplat 在单视图三维场景重建中实现了最先进的性能，并能合成高质量的新视角图像。||
|**2024-12-12**|[LIVE-GS: LLM Powers Interactive VR by Enhancing Gaussian Splatting](http://arxiv.org/abs/2412.09176)|null|近来，辐射场渲染技术，例如三维高斯 splatting (3DGS)，因其高质量的渲染效果和高效的制作流程，在VR内容创作领域展现出巨大的潜力。然而，由于缺乏场景理解，现有的基于物理的3DGS交互系统只能进行简单且不真实的模拟，或者在复杂场景中需要大量的用户输入。在本文中，我们提出了LIVE-GS，一个由大语言模型（LLM）驱动的，高度逼真的VR交互系统。在基于对象感知的GS重建之后，我们使用GPT-4o分析场景中物体的物理属性，并以此指导符合真实物理现象的物理模拟。我们还设计了一个GPT辅助的GS修复模块，用于填充被操作对象遮挡的不可见区域。为了精确分割高斯核，我们提出了一种特征掩码分割策略。为了实现丰富的交互，我们进一步提出了一个基于位置的动力学（PBD）的统一插值方法，构建了一个计算高效的物理模拟框架，支持各种物理形态，例如刚体、软体和颗粒状材料。我们的实验结果表明，借助LLM对场景的理解和增强，我们的VR系统可以支持复杂且逼真的交互，而无需额外的、手动设计和标注。||
|**2024-12-10**|[EventSplat: 3D Gaussian Splatting from Moving Event Cameras for Real-time Rendering](http://arxiv.org/abs/2412.07293)|null|我们介绍了一种通过高斯 splatting 利用事件相机数据进行新视角合成的方法。事件相机拥有卓越的时间分辨率和高动态范围。利用这些能力使我们能够有效地解决快速相机运动下的新视角合成挑战。为了初始化优化过程，我们的方法使用了编码在事件到视频模型中的先验知识。我们还使用样条插值来获取沿事件相机轨迹的高质量姿态。这增强了快速移动相机的重建质量，同时克服了传统上与基于事件的神经辐射场 (NeRF) 方法相关的计算限制。我们的实验评估表明，我们的结果比现有的基于事件的 NeRF 方法实现了更高的视觉保真度和更好的性能，同时渲染速度快了一个数量级。||
|**2024-12-10**|[Extrapolated Urban View Synthesis Benchmark](http://arxiv.org/abs/2412.05256)|**[link](https://github.com/ai4ce/EUVS-Benchmark)**|逼真的模拟器对于以视觉为中心的自动驾驶汽车 (AV) 的训练和评估至关重要。其核心是新视角合成 (NVS)，这是一种关键能力，可以生成不同的未见视角，以适应自动驾驶汽车广泛且连续的姿态分布。辐射场领域的最新进展，例如三维高斯 splatting，实现了实时速度的逼真渲染，并已广泛用于大规模驾驶场景的建模。然而，它们的性能通常使用插值设置进行评估，其中训练视图和测试视图高度相关。相比之下，测试视图与训练视图有很大偏差的外推法仍未得到充分探索，限制了通用模拟技术的发展。为了弥补这一差距，我们利用具有多次遍历、多辆车和多个摄像头的公开可用自动驾驶数据集来构建第一个外推城市视图合成 (EUVS) 基准。同时，我们对不同难度级别的最先进高斯 splatting 方法进行了定量和定性评估。我们的结果表明，高斯 splatting 容易过度拟合训练视图。此外，结合扩散先验和改进几何形状并不能从根本上改善大视角变化下的 NVS，这凸显了对更鲁棒方法和大规模训练的需求。我们已经发布了我们的数据，以帮助推进自动驾驶和城市机器人模拟技术。||
|**2024-12-04**|[NeRF and Gaussian Splatting SLAM in the Wild](http://arxiv.org/abs/2412.03263)|**[link](https://github.com/iis-esslingen/nerf-3dgs-benchmark)**|在室外环境中使用视觉同步定位和建图（SLAM）系统进行导航会面临巨大的挑战，例如动态场景、光照变化和季节变化，因此需要强大的解决方案。虽然传统的SLAM方法在适应性方面存在不足，但基于深度学习的方法以及新兴的神经辐射场和基于高斯 splatting 的SLAM方法提供了很有前景的替代方案。然而，这些方法主要在条件稳定的受控室内环境中进行评估，对于它们在非结构化和多变的室外环境中的性能了解甚少。本研究通过在自然室外环境中评估这些方法来弥补这一差距，重点关注相机跟踪精度、对环境因素的鲁棒性以及计算效率，并强调了它们之间的独特权衡。大量评估表明，神经SLAM方法实现了卓越的鲁棒性，尤其是在弱光等挑战性条件下，但计算成本很高。同时，传统方法在跨季节表现最佳，但对光照条件的变化高度敏感。该基准测试的代码可在https://github.com/iis-esslingen/nerf-3dgs-benchmark公开获取。||
|**2024-12-03**|[Gaussian Splatting Under Attack: Investigating Adversarial Noise in 3D Objects](http://arxiv.org/abs/2412.02803)|null|三维高斯 splatting 技术推进了辐射场重建，实现了高质量的视图合成和三维建模中的快速渲染。尽管针对二维图像目标检测模型的对抗性攻击已得到充分研究，但其对三维模型的影响仍未得到充分探索。本工作介绍了掩膜迭代快速梯度符号法（M-IFGSM），旨在生成针对 CLIP 视觉语言模型的对抗性噪声。M-IFGSM 通过将扰动集中在掩膜区域来专门改变目标对象，从而降低 CLIP 应用于三维模型时的零样本目标检测能力的性能。使用来自 Common Objects 3D (CO3D) 数据集的八个对象，我们证明了我们的方法有效地降低了模型的准确性和置信度，而对抗性噪声对人类观察者几乎难以察觉。原始模型渲染的 top-1 准确率从训练图像的 95.4\% 下降到 12.5\%，测试图像的准确率从 91.2\% 下降到 35.4\%，置信度水平反映了这种从正确分类到错误分类的转变，突出了对抗性攻击对自动驾驶、机器人和监控等应用中三维模型的风险。这项研究的意义在于它有可能揭示现代三维视觉模型（包括辐射场）中的漏洞，从而促使在关键的实际应用中开发更强大的防御和安全措施。||
|**2024-11-27**|[SmileSplat: Generalizable Gaussian Splats for Unconstrained Sparse Images](http://arxiv.org/abs/2411.18072)|null|稀疏多视角图像可以通过可泛化的 Gaussian Splatting 方法学习预测显式辐射场，这在实际应用中无需地面真实相机参数作为输入，具有更广阔的应用前景。本文提出了一种新的可泛化的 Gaussian Splatting 方法，称为 SmileSplat，仅需无约束的稀疏多视角图像即可重建像素对齐的高斯表面元素。首先，基于多头高斯回归解码器预测高斯表面元素，该解码器可以用较少的自由度表示，但具有更好的多视角一致性。此外，基于高质量的法线先验增强了高斯表面元素的法向量。其次，通过提出的捆绑调整 Gaussian Splatting 模块优化高斯表面元素和相机参数（包括外参和内参），以获得用于新视角合成任务的高质量高斯辐射场。在公共数据集上进行了新视角渲染和深度图预测任务的广泛实验，表明所提出的方法在各种 3D 视觉任务中实现了最先进的性能。更多信息可以在我们的项目页面 (https://yanyan-li.github.io/project/gs/smilesplat) 上找到。||
|**2024-11-26**|[3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes](http://arxiv.org/abs/2411.14974)|**[link](https://github.com/convexsplatting/convex-splatting)**|近年来，辐射场重建技术取得了显著进展，例如三维高斯 splatting (3DGS) 通过使用高斯基元组合来表示场景，实现了高质量的新视角合成和快速渲染。然而，3D 高斯在场景重建方面存在一些局限性。在不显著增加高斯数量的情况下，准确捕捉硬边缘具有挑战性，这会导致内存占用过大。此外，由于高斯函数在空间中是弥散的，它们难以表示平面。如果没有手工设计的正则化器，它们往往会在实际表面周围不规则地分散。为了规避这些问题，我们引入了一种名为三维凸 splatting (3DCS) 的新方法，它利用三维光滑凸体作为基元，从多视图图像中建模具有几何意义的辐射场。光滑凸形状比高斯形状更具灵活性，可以用更少的基元更好地表示具有硬边缘和密集体积的 3D 场景。得益于我们高效的基于 CUDA 的光栅化器，3DCS 在 Mip-NeRF360、Tanks and Temples 和 Deep Blending 等基准测试中实现了优于 3DGS 的性能。具体而言，与 3DGS 相比，我们的方法在保持高渲染速度并减少所需基元数量的同时，PSNR 提升了 0.81，LPIPS 提升了 0.026。我们的结果凸显了 3D 凸 splatting 成为高质量场景重建和新视角合成新标准的潜力。项目页面：convexsplatting.github.io。||
|**2024-11-20**|[GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting](http://arxiv.org/abs/2411.12981)|null|视线估计在处理分布外数据时面临泛化挑战。为了解决这个问题，最近的方法使用神经辐射场 (NeRF) 来生成增强数据。然而，现有的基于 NeRF 的方法计算成本高且缺乏面部细节。三维高斯 splatting (3DGS) 已成为神经场的流行表示方法。虽然 3DGS 已在头部头像中得到广泛检验，但它在精确的视线控制和跨不同对象的泛化方面面临挑战。在这项工作中，我们提出了 GazeGaussian，一种高保真视线重定向方法，它使用双流 3DGS 模型分别表示面部和眼睛区域。通过利用 3DGS 的非结构化特性，我们开发了一种基于目标视线方向的用于刚性眼球旋转的新颖的眼睛表示方法。为了增强跨不同对象的合成泛化能力，我们集成了一个表情条件模块来引导神经渲染器。综合实验表明，GazeGaussian 在渲染速度、视线重定向精度和跨多个数据集的面部合成方面优于现有方法。我们还证明了现有的视线估计方法可以利用 GazeGaussian 来提高其泛化性能。代码将在 https://ucwxb.github.io/GazeGaussian/ 提供。||
|**2024-11-15**|[GSEditPro: 3D Gaussian Splatting Editing with Attention-based Progressive Localization](http://arxiv.org/abs/2411.10033)|null|随着大型文本到图像 (T2I) 模型和诸如神经辐射场 (NeRF) 等隐式三维表示的出现，许多基于 NeRF 的文本驱动生成式编辑方法应运而生。然而，几何和纹理信息的隐式编码给编辑过程中对象的准确定位和控制带来了挑战。最近，基于显式表示的实时渲染技术——三维高斯 splatting 的编辑方法取得了显著进展。然而，这些方法仍然存在定位不准确和编辑操作有限等问题。为了解决这些挑战，我们提出了 GSEditPro，这是一个新颖的三维场景编辑框架，允许用户仅使用文本提示执行各种创造性和精确的编辑。利用三维高斯分布的显式特性，我们引入了一个基于注意力的渐进式定位模块，在渲染过程中为每个高斯添加语义标签。这使得通过基于 T2I 模型交叉注意力层衍生的编辑提示对高斯进行分类，从而能够对编辑区域进行精确定位。此外，我们提出了一种基于三维高斯 splatting 的创新编辑优化方法，通过分数蒸馏采样和伪真值的引导，获得了稳定和精细的编辑结果。我们通过大量实验证明了我们方法的有效性。||
|**2024-11-13**|[Biomass phenotyping of oilseed rape through UAV multi-view oblique imaging with 3DGS and SAM model](http://arxiv.org/abs/2411.08453)|null|油菜生物量估算对于优化作物产量和育种策略至关重要。虽然基于无人机 (UAV) 的成像技术已经推进了高通量表型分析，但目前的方法通常依赖于正射影像，其在复杂的田间环境中难以处理重叠叶片和不完整的结构信息。本研究将三维高斯 splatting (3DGS) 与 Segment Anything Model (SAM) 相结合，用于油菜的精确三维重建和生物量估算。使用来自 36 个角度的无人机多视角倾斜图像进行三维重建，并使用 SAM 模块增强点云分割。然后将分割后的点云转换为点云体积，并使用线性回归拟合到地面测量的生物量。结果表明，3DGS（7k 和 30k 次迭代）提供了高精度，峰值信噪比 (PSNR) 分别为 27.43 和 29.53，训练时间分别为 7 分钟和 49 分钟。这一性能超过了运动结构 (SfM) 和 mipmap 神经辐射场 (Mip-NeRF)，展现出更高的效率。SAM 模块实现了高分割精度，平均交并比 (mIoU) 为 0.961，F1 值为 0.980。此外，对生物量提取模型的比较发现，点云体积模型最为准确，决定系数 (R²) 为 0.976，均方根误差 (RMSE) 为 2.92 克/株，平均绝对百分比误差 (MAPE) 为 6.81%，优于小区作物体积模型和单株作物体积模型。这项研究突出了将 3DGS 与多视角无人机成像相结合以改进生物量表型分析的潜力。||
|**2024-11-13**|[MBA-SLAM: Motion Blur Aware Dense Visual SLAM with Radiance Fields Representation](http://arxiv.org/abs/2411.08279)|**[link](https://github.com/wu-cvgl/mba-slam)**|新兴的3D场景表示方法，例如神经辐射场（NeRF）和3D高斯 splatting（3DGS），已证明其在同时定位和建图（SLAM）中用于照片级真实感渲染的有效性，尤其是在使用高质量视频序列作为输入时。然而，现有方法难以处理运动模糊帧，这在现实场景中很常见，例如低光或长曝光条件。这通常会导致相机定位精度和地图重建质量的显着下降。为了应对这一挑战，我们提出了一种密集视觉SLAM流程（即MBA-SLAM）来处理严重的运动模糊输入。我们的方法将一个高效的运动模糊感知跟踪器与基于神经辐射场或高斯Splatting的建图器相结合。通过精确建模运动模糊图像的物理成像过程，我们的方法可以同时学习3D场景表示并估计相机在曝光时间内的局部轨迹，从而能够主动补偿由相机运动引起的运动模糊。在我们的实验中，我们证明了MBA-SLAM在相机定位和地图重建方面都超越了以往的先进方法，在包括合成和真实数据集（包含清晰图像以及受运动模糊影响的图像）在内的一系列数据集上展示了优越的性能，突出了我们方法的通用性和鲁棒性。代码可在https://github.com/WU-CVGL/MBA-SLAM获取。||
|**2024-11-06**|[3DGS-CD: 3D Gaussian Splatting-based Change Detection for Physical Object Rearrangement](http://arxiv.org/abs/2411.03706)|**[link](https://github.com/520xyxyzq/3dgs-cd)**|我们提出了3DGS-CD，这是第一个基于三维高斯散射(3DGS)的用于检测三维场景中物理对象重排的方法。我们的方法通过比较两组不同时间拍摄的未对齐图像来估计三维对象级别的变化。利用3DGS的新视角渲染和EfficientSAM的零样本分割能力，我们检测二维对象级别的变化，然后跨视角关联和融合这些变化以估计三维变化。我们的方法可以使用少至一张新图像，在18秒内检测杂乱环境中的变化，而无需依赖深度输入、用户指令、对象类别或对象模型——只要一个对象被重新排列，它就会被识别。我们的方法在公共和自收集的真实世界数据集上进行了评估，与最先进的基于辐射场的变化检测方法相比，实现了高达14%的更高精度和三个数量级的更快性能。这种显著的性能提升使得广泛的下游应用成为可能，我们重点介绍了三个关键用例：对象重建、机器人工作空间重置和3DGS模型更新。我们的代码和数据将在https://github.com/520xyxyzq/3DGS-CD上提供。||
|**2024-11-06**|[Structure Consistent Gaussian Splatting with Matching Prior for Few-shot Novel View Synthesis](http://arxiv.org/abs/2411.03637)|**[link](https://github.com/prstrive/scgaussian)**|尽管新视角合成取得了实质性进展，但现有方法，无论是基于神经辐射场 (NeRF) 还是最近的 3D 高斯 splatting (3DGS)，在输入变得稀疏时都会出现明显的性能下降。人们已经提出了许多努力来缓解这个问题，但它们仍然难以高效地合成令人满意的结果，尤其是在大场景中。在本文中，我们提出了 SCGaussian，一种使用匹配先验来学习 3D 一致场景结构的结构一致性高斯 splatting 方法。考虑到高斯属性的高度相互依赖性，我们从两个方面优化场景结构：渲染几何以及更重要的高斯基元的定位，这在普通的 3DGS 中由于非结构特性而难以直接约束。为此，我们提出了一种混合高斯表示。除了普通的非结构化高斯基元外，我们的模型还包含基于射线的高斯基元，这些基元绑定到匹配射线上，并且其位置的优化受限于沿射线方向。因此，我们可以利用匹配对应关系来直接强制这些高斯基元的位置收敛到射线相交的表面点。在正面、环绕和复杂大场景上的大量实验表明，我们的方法具有最先进的性能和高效率。代码可在 https://github.com/prstrive/SCGaussian 获取。||
|**2024-11-05**|[HFGaussian: Learning Generalizable Gaussian Human with Integrated Human Features](http://arxiv.org/abs/2411.03086)|null|最近基于辐射场的渲染技术在三维场景表示方面展现出显著成果，其中基于高斯 splatting 的技术因其质量和效率成为当前最佳方案。高斯 splatting 已被广泛应用于各种应用，包括三维人体表示。然而，先前基于高斯 splatting 的三维人体表示方法要么使用参数化人体模型作为附加信息，要么未能提供任何底层结构，例如对不同应用至关重要的人体生物力学特征。在本文中，我们提出了一种名为 HFGaussian 的新方法，它可以实时（25 FPS）地从稀疏输入图像中估计新视角和人体特征，例如三维骨架、三维关键点和密集姿态。该方法利用可泛化的 splatting 技术来表示人体及其相关特征，从而实现高效且可泛化的重建。通过结合姿态回归网络和特征 splatting 技术与高斯 splatting，HFGaussian 展示了比现有三维人体方法更强的能力，展现了融合生物力学信息的三维人体表示的潜力。我们将 HFGaussian 方法与人体高斯 splatting 和姿态估计领域的最新技术进行了全面比较，证明了其实时的、最先进的性能。||
|**2024-11-05**|[FewViewGS: Gaussian Splatting with Few View Matching and Multi-stage Training](http://arxiv.org/abs/2411.02229)|null|基于图像的新视角合成领域随着神经辐射场 (NeRF) 的引入以及最近 3D 高斯 splatting 的出现而取得了快速进展。由于其效率和准确渲染新视角的能力，高斯 splatting 得到了广泛采用。虽然在有足够训练图像的情况下高斯 splatting 表现良好，但其非结构化的显式表示在稀疏输入图像的情况下容易过拟合，导致渲染性能不佳。为了解决这个问题，我们提出了一种基于 3D 高斯的稀疏输入图像新视角合成方法，可以从训练图像未覆盖的视点准确地渲染场景。我们提出了一种多阶段训练方案，在不依赖预训练深度估计或扩散模型的情况下，对新视角施加基于匹配的一致性约束。这是通过使用可用训练图像的匹配来监督在训练帧之间采样的新视角的生成，并使用颜色、几何和语义损失来实现的。此外，我们引入了一种用于 3D 高斯的局部性保留正则化，通过保留场景的局部颜色结构来消除渲染伪影。在合成数据集和真实世界数据集上的评估表明，与现有的最先进方法相比，我们的方法在少样本新视角合成方面具有竞争力或更优的性能。||
|**2024-10-31**|[GaussianMarker: Uncertainty-Aware Copyright Protection of 3D Gaussian Splatting](http://arxiv.org/abs/2410.23718)|null|三维高斯 splatting (3DGS) 已成为获取三维资产的关键方法。为了保护这些资产的版权，可以应用数字水印技术将所有权信息谨慎地嵌入到 3DGS 模型中。然而，现有的用于网格、点云和隐式辐射场的数字水印方法不能直接应用于 3DGS 模型，因为 3DGS 模型使用具有独特结构的显式三维高斯函数，并且不依赖于神经网络。简单地在预训练的 3DGS 模型上嵌入水印会导致渲染图像出现明显的失真。在我们的工作中，我们提出了一种基于不确定性的方法，该方法通过约束模型参数的扰动来实现 3DGS 的不可见水印。在消息解码阶段，即使在各种三维和二维失真情况下，也可以从三维高斯函数和二维渲染图像中可靠地提取版权信息。我们在 Blender、LLFF 和 MipNeRF-360 数据集上进行了大量实验，以验证我们提出的方法的有效性，证明了其在消息解码精度和视图合成质量方面的最新性能。||
|**2024-10-23**|[VR-Splatting: Foveated Radiance Field Rendering via 3D Gaussian Splatting and Neural Points](http://arxiv.org/abs/2410.17932)|null|近年来，新视角合成（NVS）技术，特别是神经辐射场（NeRF）和高斯 splatting（3DGS），在逼真的场景渲染方面取得了令人瞩目的成果。这些技术在虚拟旅游和远程呈现等对沉浸式真实感要求很高的应用中具有巨大的潜力。然而，虚拟现实（VR）系统的高性能需求给直接利用即使是像 3DGS 这样渲染速度很快的场景表示也带来了挑战，这主要是因为延迟和计算资源的限制。在本文中，我们提出将注视点渲染作为解决这些障碍的有效方案。我们分析了最先进的 NVS 方法的渲染性能及其与人类视觉系统的兼容性。我们的方法引入了一种新颖的用于虚拟现实的注视点渲染方法，它利用神经点渲染为中心凹区域提供清晰、细节丰富的输出，并将其与 3DGS 为周边视觉提供的平滑渲染相融合。我们的评估证实，与标准的 VR-ready 3DGS 配置相比，我们的方法提高了感知的清晰度和细节丰富度。我们的系统满足实时 VR 交互所需的性能要求，最终增强了用户的沉浸式体验。项目页面：https://lfranke.github.io/vr_splatting||
|**2024-10-18**|[GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting](http://arxiv.org/abs/2410.17084)|null|本文介绍了GS-LIVM，一个面向户外场景的实时逼真激光雷达-惯性-视觉建图框架，该框架采用高斯 splatting 技术。与现有的基于神经辐射场 (NeRF) 和三维高斯 splatting (3DGS) 的方法相比，我们的方法能够在保证大规模无界户外环境高质量图像渲染的同时，实现实时逼真建图。本文采用高斯过程回归 (GPR) 来缓解由稀疏且分布不均匀的激光雷达观测数据带来的问题。基于体素的三维高斯地图表示有助于在大型户外环境中进行实时密集建图，并通过自定义 CUDA 内核进行加速。此外，整个框架以协方差为中心进行设计，其中估计的协方差用于初始化三维高斯的尺度和旋转，以及更新 GPR 的参数。我们在多个户外数据集上评估了我们的算法，结果表明，我们的方法在建图效率和渲染质量方面达到了最先进的水平。源代码可在 GitHub 上获取。||
|**2024-10-22**|[E-3DGS: Gaussian Splatting with Exposure and Motion Events](http://arxiv.org/abs/2410.16995)|**[link](https://github.com/masterhow/e-3dgs)**|在视觉领域，从理想条件下拍摄的图像中估计神经辐射场（NeRFs）已被广泛研究。然而，机器人应用通常面临运动模糊、光照不足和高计算开销等挑战，这些挑战会对导航、检查和场景可视化等下游任务产生不利影响。为了应对这些挑战，我们提出了E-3DGS，一种基于事件的新方法，它将事件划分为运动事件（来自相机或物体运动）和曝光事件（来自相机曝光），前者用于处理快速运动场景，后者用于重建灰度图像，以实现基于事件的三维高斯 splatting（3DGS）的高质量训练和优化。我们引入了一种将3DGS与曝光事件相结合的新方法，以实现高质量的显式场景表示重建。我们的多功能框架可以单独使用运动事件进行三维重建，使用曝光事件提高质量，或者采用混合模式，先用初始曝光事件优化，再用高速运动事件优化，从而平衡质量和效率。我们还引入了EME-3D，这是一个真实世界的三维数据集，包含曝光事件、运动事件、相机校准参数和稀疏点云。我们的方法比基于事件的NeRF速度更快，重建质量更好，同时比结合事件和RGB数据的NeRF方法更具成本效益，因为它只使用单个事件传感器。通过结合运动事件和曝光事件，E-3DGS为基于事件的三维重建设定了新的基准，在挑战性条件下具有稳健的性能和更低的硬件要求。源代码和数据集将在https://github.com/MasterHow/E-3DGS上提供。||
|**2024-10-18**|[DaRePlane: Direction-aware Representations for Dynamic Scene Reconstruction](http://arxiv.org/abs/2410.14169)|null|许多近期对动态场景建模和重新渲染的方法利用基于平面的显式表示，解决了与神经辐射场 (NeRF) 和高斯 splatting (GS) 等模型相关的训练时间慢的问题。然而，仅仅将 4D 动态场景分解成多个 2D 基于平面的表示不足以高保真地重新渲染具有复杂运动的场景。为此，我们提出了 DaRePlane，一种新颖的方向感知表示方法，可从六个不同方向捕获场景动态。这种学习到的表示经过逆双树复小波变换 (DTCWT) 来恢复基于平面的信息。在 NeRF 流程中，DaRePlane 通过融合来自这些恢复平面的向量来计算每个时空点的特征，然后将其传递给一个小型 MLP 进行颜色回归。应用于高斯 splatting 时，DaRePlane 计算高斯点的特征，然后通过一个小型多头 MLP 进行时空变形预测。值得注意的是，为了解决由六个实部和六个虚部方向感知小波系数引入的冗余问题，我们引入了一种可训练的掩蔽方法，在不显著降低性能的情况下缓解了存储问题。为了证明 DaRePlane 的通用性和效率，我们在常规和手术动态场景上分别针对 NeRF 和 GS 系统对其进行了测试。大量实验表明，DaRePlane 在各种复杂动态场景的新颖视图合成中实现了最先进的性能。||
|**2024-10-16**|[3D Gaussian Splatting in Robotics: A Survey](http://arxiv.org/abs/2410.12262)|**[link](https://github.com/zstsandy/awesome-3d-gaussian-splatting-in-robotics)**|在机器人领域，环境的密集3D表示一直是一个长期目标。虽然以前基于坐标的隐式神经辐射场（NeRF）表示法很流行，但最近出现的3D高斯 splatting (3DGS)在其显式辐射场表示方面展现了显著的潜力。通过利用3D高斯基元进行显式场景表示并支持可微渲染，3DGS在实时渲染和逼真性能方面比其他辐射场表现出显著优势，这有利于机器人应用。在本综述中，我们提供了对3DGS在机器人领域中的全面理解。我们将相关工作的讨论分为两大类：3DGS的应用和3DGS技术的进步。在应用部分，我们探讨了3DGS如何在各种机器人任务中从场景理解和交互的角度得到应用。3DGS技术的进步部分重点介绍3DGS自身属性在适应性和效率方面的改进，旨在提高其在机器人领域的性能。然后，我们总结了机器人领域中最常用的数据集和评估指标。最后，我们指出了当前3DGS方法的挑战和局限性，并讨论了3DGS在机器人领域的未来发展方向。||
|**2024-10-15**|[MCGS: Multiview Consistency Enhancement for Sparse-View 3D Gaussian Radiance Fields](http://arxiv.org/abs/2410.11394)|null|用三维高斯函数表示的辐射场在合成新视角方面表现出色，兼具高训练效率和快速渲染速度。然而，由于输入视角稀疏，缺乏多视角一致性约束会导致点云初始化不良以及优化和密集化过程中的启发式方法不可靠，从而导致性能欠佳。现有方法通常会结合来自密集估计网络的深度先验，但忽略了输入图像中固有的多视角一致性。此外，它们依赖于基于多视角立体视觉 (MVS) 的初始化，这限制了场景表示的效率。为了克服这些挑战，我们提出了一个基于三维高斯 splatting 的视图合成框架，名为 MCGS，可以从稀疏的输入视角实现逼真的场景重建。MCGS 在增强多视角一致性方面的关键创新如下：i) 我们引入了一种初始化方法，利用稀疏匹配器结合随机填充策略，生成一组紧凑但足以表示场景的初始点。这种方法增强了初始几何先验，促进了高效的场景表示。ii) 我们开发了一种多视角一致性引导的渐进式剪枝策略，通过加强一致性并消除低贡献的高斯函数来细化高斯场。这些模块化、即插即用的策略增强了对稀疏输入视角的鲁棒性，加快了渲染速度，并减少了内存消耗，使 MCGS 成为一个实用且高效的三维高斯 splatting 框架。||
|**2024-10-14**|[Few-shot Novel View Synthesis using Depth Aware 3D Gaussian Splatting](http://arxiv.org/abs/2410.11080)|**[link](https://github.com/raja-kumar/depth-aware-3dgs)**|三维高斯 splatting 技术在新型视图合成方面已经超越了神经辐射场方法，实现了更低的计算成本和实时高质量渲染。尽管在输入视图较多时可以生成高质量的渲染结果，但在只有少量视图可用时，其性能会显著下降。在本文中，我们提出了一种用于少样本新型视图合成的深度感知高斯 splatting 方法来解决这个问题。我们使用单目深度预测作为先验，并结合尺度不变的深度损失，在少量输入视图下约束三维形状。我们还使用低阶球谐函数对颜色进行建模，以避免过拟合。此外，我们观察到，像原始工作中那样周期性地移除低不透明度的 splat 会导致点云非常稀疏，从而降低渲染质量。为了缓解这个问题，我们保留了所有的 splat，从而在少量视图设置下实现了更好的重建效果。实验结果表明，我们的方法优于传统的三维高斯 splatting 方法，峰值信噪比提高了 10.5%，结构相似性指数提高了 6%，感知相似度提高了 14.1%，从而验证了我们方法的有效性。代码将在 https://github.com/raja-kumar/depth-aware-3DGS 上提供。||
|**2024-10-09**|[DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation](http://arxiv.org/abs/2410.06756)|null|近年来，2D/3D 生成技术的进步促进了从单目视频生成动态 3D 对象。先前的方法主要依赖于隐式神经辐射场 (NeRF) 或显式高斯 splatting 作为底层表示，难以实现令人满意的时空一致性和表面外观。受现代 3D 动画流程的启发，我们引入了 DreamMesh4D，这是一个结合了网格表示和几何蒙皮技术的新颖框架，可以从单目视频生成高质量的 4D 对象。我们没有使用经典的纹理贴图来表现外观，而是将高斯 splat 绑定到网格的三角面上，以便对纹理和网格顶点进行可微分优化。特别是，DreamMesh4D 从通过图像到 3D 生成过程获得的粗网格开始。然后在网格表面均匀采样稀疏点，并使用这些点构建变形图来驱动 3D 对象的运动，以提高计算效率并提供额外的约束。对于每个步骤，使用变形网络预测稀疏控制点的变换，并通过一种新颖的几何蒙皮算法对网格顶点和表面高斯进行变形，该算法结合了 LBS（线性混合蒙皮）和 DQS（双四元数蒙皮）的混合方法，减轻了两种方法相关的缺点。静态表面高斯和网格顶点以及变形网络通过参考视图光度损失、分数蒸馏损失以及其他正则化器以两阶段方式学习。大量实验表明我们的方法具有优越的性能。此外，我们的方法与现代图形流程兼容，展示了其在 3D 游戏和电影行业的潜力。||
|**2024-10-08**|[Comparative Analysis of Novel View Synthesis and Photogrammetry for 3D Forest Stand Reconstruction and extraction of individual tree parameters](http://arxiv.org/abs/2410.05772)|null|精确高效的三维树木重建对于森林资源评估和管理至关重要。近景摄影测量法 (CRP) 常用于重建森林场景，但面临效率低、质量差等挑战。近年来，包括神经辐射场 (NeRF) 和三维高斯 splatting (3DGS) 在内的新视角合成 (NVS) 技术已展现出利用有限图像进行三维植物重建的潜力。然而，现有研究主要集中在果园中的小型植物或单棵树木上，其在更大、更复杂的林分中的应用仍存在不确定性。在本研究中，我们收集了不同复杂程度的森林样地的序列图像，并使用 NeRF 和 3DGS 进行了密集重建。将所得点云与摄影测量和激光扫描的点云进行了比较。结果表明，NVS 方法显著提高了重建效率。摄影测量法在处理复杂林分时存在困难，导致点云树冠噪声过多，树木重建错误，例如树干重复。NeRF 虽然更适合树冠区域，但在视野有限的地面区域可能会产生错误。3DGS 方法生成的点云更稀疏，尤其是在树干区域，影响胸径 (DBH) 的精度。所有三种方法都可以提取树高信息，其中 NeRF 的精度最高；然而，摄影测量法在胸径精度方面仍然具有优势。这些发现表明，NVS 方法在林分三维重建方面具有巨大潜力，可为复杂的森林资源清查和可视化任务提供宝贵支持。||
|**2024-09-30**|[RL-GSBridge: 3D Gaussian Splatting Based Real2Sim2Real Method for Robotic Manipulation Learning](http://arxiv.org/abs/2409.20291)|null|Sim-to-Real 指的是将仿真环境中学习到的策略迁移到现实世界的过程，这对于实现实际机器人应用至关重要。然而，最近的 Sim2real 方法要么依赖大量的增强数据，要么依赖大型学习模型，这对于特定任务来说效率低下。近年来，基于辐射场的重建方法，尤其是 3D Gaussian Splatting 的出现，使得重现逼真的现实世界场景成为可能。为此，我们提出了一种新颖的 real-to-sim-to-real 强化学习框架 RL-GSBridge，该框架引入了基于网格的 3D Gaussian Splatting 方法，以实现基于视觉的深度强化学习的零样本 sim-to-real 迁移。我们通过使用软绑定约束改进了基于网格的 3D GS 建模方法，从而提高了网格模型的渲染质量。然后，我们采用 GS 编辑方法将渲染与物理模拟器同步，更准确地反映物理机器人的交互。通过一系列 sim-to-real 机械臂实验，包括抓取和拾放任务，我们证明了 RL-GSBridge 在 sim-to-real 迁移过程中保持了令人满意的实际任务完成成功率。此外，一系列渲染指标和可视化结果表明，我们提出的基于网格的 3D Gaussian 减少了非结构化对象中的伪影，展现了更逼真的渲染性能。||
|**2024-09-25**|[SeaSplat: Representing Underwater Scenes with 3D Gaussian Splatting and a Physically Grounded Image Formation Model](http://arxiv.org/abs/2409.17345)|null|我们介绍SeaSplat，这是一种利用最新3D辐射场技术实现水下场景实时渲染的方法。水下场景是具有挑战性的视觉环境，因为透过水等介质进行渲染会在图像捕获中引入距离和颜色相关的影响。我们使用物理基础的水下成像模型来约束3D高斯渲染（3DGS），这是一种最新的辐射场技术，可以实现完整3D场景的快速训练和实时渲染。将SeaSplat应用于SeaThru-NeRF数据集中的真实场景（由美属维尔京群岛的水下航行器收集的场景）和模拟退化的真实场景，我们不仅看到在存在介质的情况下渲染场景新视点的定量性能有所提高，而且还能够恢复场景的底层真实颜色，并将渲染恢复到不存在介入介质的状态。我们证明了水下成像模型有助于学习场景结构，获得更好的深度图，并表明我们的改进保持了利用3D高斯表示带来的显著计算优势。||
|**2024-09-25**|[Let's Make a Splan: Risk-Aware Trajectory Optimization in a Normalized Gaussian Splat](http://arxiv.org/abs/2409.16915)|null|神经辐射场和高斯 splatting 通过实现复杂场景的逼真表示，改变了计算机视觉领域。尽管取得了成功，但它们在现实世界机器人任务（如轨迹优化）中的应用仍然有限。造成这种有限成功有两个关键因素。首先，在辐射模型中难以推理碰撞。其次，很难足够快地执行辐射模型的推理以进行实时轨迹合成。本文提出了 SPLANNING，一种在高斯 splatting 模型中运行的风险感知轨迹优化器，以应对这些挑战。本文首先推导出一种严格限制机器人与辐射场之间碰撞概率上限的方法。其次，本文介绍了高斯 splatting 的归一化重构，以便在高斯 splat 中高效计算碰撞边界。第三，提出了一种在避免与高斯 splat 表示的场景发生碰撞的同时优化轨迹的方法。实验表明，在高度杂乱的环境中，SPLANNING 在生成无碰撞轨迹方面优于最先进的方法。所提出的系统还在现实世界的机器人机械臂上进行了测试。项目页面位于 https://roahmlab.github.io/splanning。||
|**2024-09-22**|[MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views](http://arxiv.org/abs/2409.14316)|null|近年来，神经辐射场（NeRF）的进步促进了少样本新视角合成（NVS）的发展，这是三维视觉应用中的一个重大挑战。尽管人们做了很多尝试来减少NeRF中对密集输入的需求，但它仍然面临着训练和渲染过程耗时的难题。最近，三维高斯散射（3DGS）通过基于点的显式表示实现了实时高质量渲染。然而，与NeRF类似，由于缺乏约束，它往往会对训练视图过拟合。在本文中，我们提出了MVPGS，一种基于三维高斯散射挖掘多视图先验的少样本NVS方法。我们利用最近基于学习的多视图立体（MVS）来提高3DGS几何初始化的质量。为了减轻过拟合，我们提出了一种前向扭曲方法，用于根据计算出的几何形状对场景进行额外的外观约束。此外，我们引入了一种视图一致性几何约束来约束高斯参数，以促进适当的优化收敛，并利用单目深度正则化作为补偿。实验表明，该方法在实时渲染速度下达到了最先进的性能。项目页面：https://zezeaaa.github.io/projects/MVPGS/||
|**2024-09-10**|[Sources of Uncertainty in 3D Scene Reconstruction](http://arxiv.org/abs/2409.06407)|**[link](https://github.com/aaltoml/uncertainty-nerf-gs)**|三维场景重建过程会受到现实世界场景中众多不确定性来源的影响。虽然神经辐射场 (NeRF) 和三维高斯散射 (GS) 可以实现高保真渲染，但它们缺乏内置机制来直接解决或量化由噪声、遮挡、混杂异常值和不精确的相机姿态输入引起的不确定性。在本文中，我们引入了一种分类法，对这些方法中固有的不同不确定性来源进行分类。此外，我们使用不确定性估计技术扩展了基于 NeRF 和 GS 的方法，包括学习不确定性输出和集成，并进行了实证研究来评估它们捕捉重建敏感性的能力。我们的研究强调了在设计基于 NeRF/GS 的不确定性感知三维重建方法时，需要解决各种不确定性方面的需求。||
|**2024-09-05**|[Optimizing 3D Gaussian Splatting for Sparse Viewpoint Scene Reconstruction](http://arxiv.org/abs/2409.03213)|null|三维高斯 splatting (3DGS) 已成为一种很有前景的三维场景表示方法，与神经辐射场 (NeRF) 相比，它可以降低计算开销。然而，3DGS 容易出现高频伪影，并且在稀疏视点条件下表现不佳，从而限制了其在机器人和计算机视觉中的应用。为了解决这些限制，我们引入了 SVS-GS，这是一种用于稀疏视点场景重建的新框架，它集成了三维高斯平滑滤波器来抑制伪影。此外，我们的方法结合了深度梯度剖面先验 (DGPP) 损失和动态深度掩码来锐化边缘，并结合了分数蒸馏采样 (SDS) 损失的二维扩散来增强新视图合成中的几何一致性。在 MipNeRF-360 和 SeaThru-NeRF 数据集上的实验评估表明，SVS-GS 显着改善了稀疏视点下的三维重建，为机器人和计算机视觉应用中的场景理解提供了一种稳健且高效的解决方案。||
|**2024-08-20**|[Gaussian in the Dark: Real-Time View Synthesis From Inconsistent Dark Images Using Gaussian Splatting](http://arxiv.org/abs/2408.09130)|**[link](https://github.com/yec22/Gaussian-DK)**|3D Gaussian Splatting has recently emerged as a powerful representation that can synthesize remarkable novel views using consistent multi-view images as input. However, we notice that images captured in dark environments where the scenes are not fully illuminated can exhibit considerable brightness variations and multi-view inconsistency, which poses great challenges to 3D Gaussian Splatting and severely degrades its performance. To tackle this problem, we propose Gaussian-DK. Observing that inconsistencies are mainly caused by camera imaging, we represent a consistent radiance field of the physical world using a set of anisotropic 3D Gaussians, and design a camera response module to compensate for multi-view inconsistencies. We also introduce a step-based gradient scaling strategy to constrain Gaussians near the camera, which turn out to be floaters, from splitting and cloning. Experiments on our proposed benchmark dataset demonstrate that Gaussian-DK produces high-quality renderings without ghosting and floater artifacts and significantly outperforms existing methods. Furthermore, we can also synthesize light-up images by controlling exposure levels that clearly show details in shadow areas.||
|**2024-09-05**|[EaDeblur-GS: Event assisted 3D Deblur Reconstruction with Gaussian Splatting](http://arxiv.org/abs/2407.13520)|null|3D deblurring reconstruction techniques have recently seen significant advancements with the development of Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Although these techniques can recover relatively clear 3D reconstructions from blurry image inputs, they still face limitations in handling severe blurring and complex camera motion. To address these issues, we propose Event-assisted 3D Deblur Reconstruction with Gaussian Splatting (EaDeblur-GS), which integrates event camera data to enhance the robustness of 3DGS against motion blur. By employing an Adaptive Deviation Estimator (ADE) network to estimate Gaussian center deviations and using novel loss functions, EaDeblur-GS achieves sharp 3D reconstructions in real-time, demonstrating performance comparable to state-of-the-art methods.||
|**2024-10-02**|[DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation](http://arxiv.org/abs/2407.11394)|**[link](https://github.com/kaist-cvml/dreamcatalyst)**|分数蒸馏采样（SDS）已成为文本驱动3D编辑任务中一种有效的框架，它利用扩散模型进行3D一致性编辑。然而，现有的基于SDS的3D编辑方法存在训练时间长、生成结果质量低的问题。我们发现，造成这种性能下降的根本原因是它们与扩散模型的采样动力学相冲突。解决这种冲突使我们能够将SDS视为通过从数据空间采样进行3D编辑的扩散逆过程。相比之下，现有方法简单地使用扩散模型提取分数函数。基于这些见解，我们提出了DreamCatalyst，这是一个在SDS框架中考虑了这些采样动力学的新框架。具体来说，我们设计了DreamCatalyst的优化过程来逼近编辑任务中的扩散逆过程，从而与扩散采样动力学保持一致。因此，DreamCatalyst成功地减少了训练时间并提高了编辑质量。我们的方法提供了两种模式：（1）快速模式，编辑神经辐射场（NeRF）场景的速度比当前最先进的NeRF编辑方法快约23倍；（2）高质量模式，生成的结果比这些方法好约8倍。值得注意的是，我们的高质量模式在速度和质量方面都优于当前最先进的NeRF编辑方法。DreamCatalyst还超越了最先进的3D高斯样条（3DGS）编辑方法，使其成为一种有效且与模型无关的3D编辑解决方案。请在我们的项目页面上查看更多结果：https://dream-catalyst.github.io。||
|**2024-07-10**|[3D Gaussian Ray Tracing: Fast Tracing of Particle Scenes](http://arxiv.org/abs/2407.07090)|null|基于粒子的辐射场表示法，例如 3D 高斯 splatting，在复杂场景的重建和重新渲染方面取得了巨大成功。大多数现有方法通过光栅化渲染粒子，将它们投影到屏幕空间图块中，以便按排序顺序进行处理。而这项工作则考虑对粒子进行光线追踪，构建边界体积层次结构，并使用高性能 GPU 光线追踪硬件为每个像素投射光线。为了有效处理大量半透明粒子，我们描述了一种专门的渲染算法，该算法使用边界网格封装粒子，以利用快速的光线三角形相交，并按深度顺序对成批的相交进行着色。光线追踪的优势在计算机图形学中是众所周知的：处理非相干光线以获得阴影和反射等二次照明效果、从机器人技术中常见的高度扭曲的相机进行渲染、随机采样光线等等。使用我们的渲染器，与光栅化相比，这种灵活性几乎没有成本。实验证明了我们方法的速度和准确性，以及在计算机图形学和视觉方面的几种应用。我们进一步提出了对基本高斯表示的相关改进，包括简单地使用广义核函数，这可以显着减少粒子命中次数。||
|**2024-07-07**|[GaussReg: Fast 3D Registration with Gaussian Splatting](http://arxiv.org/abs/2407.05254)|null|点云配准是大规模三维场景扫描和重建的基本问题。在深度学习的帮助下，配准方法得到了显著发展，已接近成熟阶段。随着神经辐射场（NeRF）的引入，它凭借强大的视图合成能力成为最受欢迎的三维场景表示方法。对于NeRF表示，大规模场景重建也需要对其进行配准。然而，这方面还缺乏深入的探索。这是因为对具有隐式表示的两个场景之间的几何关系进行建模存在固有的挑战。现有方法通常将隐式表示转换为显式表示以进行进一步配准。最近，引入了高斯 splatting（GS），它采用显式三维高斯函数。这种方法在保持高质量渲染效果的同时，显著提高了渲染速度。给定两个具有显式GS表示的场景，我们在这项工作中探索了它们之间的三维配准任务。为此，我们提出了GaussReg，一个快速且准确的由粗到精的框架。粗配准阶段遵循现有的点云配准方法，并估计来自GS的点云的粗略对齐。我们还提出了一种新的图像引导的精配准方法，该方法通过从GS渲染图像，为精确对齐提供更详细的几何信息。为了支持全面的评估，我们仔细构建了一个名为ScanNet-GSReg的场景级数据集，其中包含从ScanNet数据集中获得的1379个场景，并收集了一个名为GSReg的真实世界数据集。实验结果表明，我们的方法在多个数据集上实现了最先进的性能。我们的GaussReg比HLoc（SuperPoint作为特征提取器，SuperGlue作为匹配器）快44倍，并且具有相当的精度。||
|**2024-07-04**|[CRiM-GS: Continuous Rigid Motion-Aware Gaussian Splatting from Motion Blur Images](http://arxiv.org/abs/2407.03923)|null|由于神经辐射场 (NeRFs) 能够高质量地渲染新视角，因此备受关注，这促使人们对其在各种真实场景中的应用进行研究。其中一个关键挑战是相机在曝光时间内移动造成的相机运动模糊，这阻碍了精确的三维场景重建。在本研究中，我们提出了连续刚体运动感知高斯散射 (CRiM-GS)，以实时渲染速度从模糊图像中重建精确的三维场景。考虑到实际的相机运动模糊过程包含复杂的运动模式，我们基于神经常微分方程 (ODEs) 预测相机的连续运动。具体来说，我们利用刚体变换来模拟相机运动并进行适当的正则化，以保持对象的形状和大小。此外，我们在\textit{SE(3)} 场中引入连续可变形三维变换，通过确保更高的自由度使刚体变换适应现实问题。通过重新审视基本相机理论并采用先进的神经网络训练技术，我们实现了对连续相机轨迹的精确建模。我们进行了大量的实验，在基准数据集上定量和定性地证明了其最先进的性能。||
|**2024-07-29**|[Trimming the Fat: Efficient Compression of 3D Gaussian Splats through Pruning](http://arxiv.org/abs/2406.18214)|**[link](https://github.com/salmanali96/trimming-the-fat)**|近年来，由于神经辐射场和最近出现的3D高斯样条曲线(3DGS)模型提供了端到端训练的能力，3D模型的使用得到了推广。后者在训练过程中能够轻松地快速收敛并提供广泛的可编辑性，因此具有显著的优势。然而，尽管发展迅速，但关于这些模型可扩展性的文献仍处于起步阶段。在本研究中，我们为解决这一差距采取了一些初步措施，展示了一种能够实现此类模型内存和计算可扩展性的方法。具体来说，我们提出了“Trimming the fat”，这是一种基于梯度的迭代式后剪枝技术，用于消除模型中编码的冗余信息。我们在广泛认可的基准测试集上的实验结果证明了我们方法的有效性，结果表明，在保持甚至提高基线性能的同时，最多可以移除75%的高斯函数。我们的方法实现了大约50倍的压缩，同时保持了与基线模型相似的性能，并且能够将计算速度提高到600 FPS。||
|**2024-06-21**|[Gaussian Splatting to Real World Flight Navigation Transfer with Liquid Networks](http://arxiv.org/abs/2406.15149)|null|模拟器是自动机器人学习的强大工具，因为它们可以提供可扩展的数据生成、灵活的设计和轨迹优化。然而，将从模拟数据中学习到的行为迁移到现实世界中被证明是困难的，通常需要通过计算量大的域随机化方法或进一步的模型微调来缓解。我们提出了一种方法来提高模拟到真实视觉四旋翼导航任务中对分布变化的泛化能力和鲁棒性。为此，我们首先通过将高斯 splatting 与四旋翼飞行动力学相结合来构建模拟器，然后使用 Liquid 神经网络训练鲁棒的导航策略。通过这种方式，我们获得了一个完整的模仿学习协议，它结合了 3D 高斯 splatting 辐射场渲染的进步、专家演示训练数据的巧妙编程以及 Liquid 网络的任务理解能力。通过一系列定量飞行测试，我们证明了在单个模拟场景中学习到的导航技能可以直接稳健地迁移到现实世界。我们进一步展示了在剧烈的分布和物理环境变化下，在训练环境之外保持性能的能力。我们学习的 Liquid 策略，仅在从真实感室内模拟飞行中提取的单个目标操作上进行训练，可以泛化到户外真实硬件平台上的多步远足。||
|**2024-06-14**|[Wild-GS: Real-Time Novel View Synthesis from Unconstrained Photo Collections](http://arxiv.org/abs/2406.10373)|null|在非结构化的旅游环境中拍摄的照片经常表现出多变的外观和短暂的遮挡，这对准确的场景重建提出了挑战，并在新视角合成中导致了伪影。虽然先前的方法已经将神经辐射场 (NeRF) 与其他可学习模块相结合来处理动态外观并消除瞬态对象，但其大量的训练需求和缓慢的渲染速度限制了实际部署。最近，3D 高斯 splatting (3DGS) 已成为 NeRF 的一种有前途的替代方案，它提供了卓越的训练和推理效率以及更好的渲染质量。本文介绍了 Wild-GS，这是一种针对不受约束的照片集优化的 3DGS 创新改编，同时保留了其效率优势。Wild-GS 通过每张图像的固有材质属性、全局照明和相机属性以及逐点反射率的局部变化来确定每个 3D 高斯的外观。与先前在图像空间中对参考特征进行建模的方法不同，Wild-GS 通过对从参考图像中提取的三平面进行采样，将像素外观特征明确地与相应的局部高斯对齐。这种新颖的设计有效地将参考视图的高频细节外观转移到 3D 空间，并显着加快了训练过程。此外，2D 可见性图和深度正则化分别用于减轻瞬态效应和约束几何形状。大量实验表明，Wild-GS 在所有现有技术中实现了最先进的渲染性能以及最高的训练和推理效率。||
|**2024-06-06**|[A Survey on 3D Human Avatar Modeling -- From Reconstruction to Generation](http://arxiv.org/abs/2406.04253)|null|3D modeling has long been an important area in computer vision and computer graphics. Recently, thanks to the breakthroughs in neural representations and generative models, we witnessed a rapid development of 3D modeling. 3D human modeling, lying at the core of many real-world applications, such as gaming and animation, has attracted significant attention. Over the past few years, a large body of work on creating 3D human avatars has been introduced, forming a new and abundant knowledge base for 3D human modeling. The scale of the literature makes it difficult for individuals to keep track of all the works. This survey aims to provide a comprehensive overview of these emerging techniques for 3D human avatar modeling, from both reconstruction and generation perspectives. Firstly, we review representative methods for 3D human reconstruction, including methods based on pixel-aligned implicit function, neural radiance field, and 3D Gaussian Splatting, etc. We then summarize representative methods for 3D human generation, especially those using large language models like CLIP, diffusion models, and various 3D representations, which demonstrate state-of-the-art performance. Finally, we discuss our reflection on existing methods and open challenges for 3D human avatar modeling, shedding light on future research.||
|**2024-06-13**|[3D-HGS: 3D Half-Gaussian Splatting](http://arxiv.org/abs/2406.02720)|**[link](https://github.com/lihaolin88/3d-half-gaussian-splatting)**|照片级逼真的三维重建是三维计算机视觉中的一个基本问题。由于最近神经渲染技术的出现，该领域取得了相当大的进步。这些技术主要集中于学习三维场景的体积表示，并通过渲染得到的损失函数来细化这些表示。其中，三维高斯散射（3D-GS）已成为一种重要的方法，其性能超过了神经辐射场（NeRFs）。3D-GS使用参数化的三维高斯函数来建模空间位置和颜色信息，并结合基于图块的快速渲染技术。尽管其渲染性能和速度都很出色，但使用三维高斯核函数在准确表示不连续函数方面存在固有限制，特别是在形状不连续的边缘和角落，以及在颜色不连续的不同纹理之间。为了解决这个问题，我们建议采用三维半高斯（3D-HGS）核函数，它可以作为一种即插即用的核函数。我们的实验表明，它们能够提高当前与3D-GS相关方法的性能，并在不影响渲染速度的情况下，在各种数据集上实现最先进的渲染性能。||

(back to top)

## 分类/检测/识别/分割

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2025-04-08**|[Memory-Modular Classification: Learning to Generalize with Memory Replacement](http://arxiv.org/abs/2504.06021)|null|我们提出了一种用于图像分类的新型记忆模块化学习器，它将知识记忆与推理分离。我们的模型只需替换记忆内容即可有效地泛化到新类别，而无需重新训练模型。与传统模型在训练期间将世界知识和特定任务技能编码到权重中不同，我们的模型将知识存储在网络爬取的图像和文本数据的外部记忆中。在推理时，模型根据输入图像从记忆中动态选择相关内容，使其能够通过简单地替换记忆内容来适应任意类别。我们学习器的关键区别在于它元学习使用来自未见类别的噪声网络数据执行分类任务，从而在各种分类场景中实现稳健的性能。实验结果证明了我们的方法在处理各种分类任务（包括未见类别的零样本/少样本分类、细粒度分类和类别增量分类）方面的良好性能和多功能性。|
|**2025-04-08**|[Balancing long- and short-term dynamics for the modeling of saliency in videos](http://arxiv.org/abs/2504.05913)|null|视频显著性目标检测中长期和短期动态的作用尚未得到充分研究。我们提出了一种基于Transformer的方法来学习视频帧和过去显著性信息的联合表示。我们的模型嵌入了长期和短期信息，以检测视频中动态变化的显著性。我们为模型提供视频帧流和过去的显著性图，作为下一个预测的先验，并从两种模态中提取时空标记。将帧序列分解为标记使模型能够结合标记内的短期信息，同时能够在整个序列中进行标记之间的长期连接。系统的核心由一个双流Transformer架构组成，在融合两种模态之前独立处理提取的序列。此外，我们对输入帧应用基于显著性的掩码方案，以学习有助于识别与先前输出偏差的嵌入。我们观察到，额外的先验信息有助于首次检测显著位置。我们的研究结果表明，时空长期和短期特征的比例直接影响模型的性能。虽然增加短期上下文在一定程度上是有益的，但模型的性能极大地受益于长期上下文的扩展。|
|**2025-04-08**|[Intrinsic Saliency Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation](http://arxiv.org/abs/2504.05904)|null|近期的无监督视频目标分割 (UVOS) 方法主要采用运动-外观范式。主流的运动-外观方法要么使用双编码器结构分别编码运动和外观特征，要么使用单编码器结构进行联合编码。然而，这些方法未能恰当地平衡运动-外观关系。因此，即使使用复杂的融合模块进行运动-外观集成，提取的次优特征也会降低模型的整体性能。此外，光流的质量因场景而异，仅依靠光流不足以获得高质量的分割结果。为了应对这些挑战，我们提出了内在显著性引导的主干-侧支网络 (ISTC-Net)，它可以更好地平衡运动-外观关系，并结合模型的内在显著性信息来增强分割性能。具体而言，考虑到光流图是从RGB图像派生的，它们既有共同点，也有差异。我们提出了一种新颖的主干-侧支结构。共享的主干网络捕获运动-外观的共同点，而侧支分支学习运动特征的独特性。此外，我们设计了一个内在显著性引导的细化模块 (ISRM)，以有效地利用模型的内在显著性信息来细化高级特征，并为运动-外观融合提供像素级指导，从而在不增加额外输入的情况下提高性能。实验结果表明，ISTC-Net 在三个 UVOS 数据集（DAVIS-16 上 89.2% J&F，YouTube-Objects 上 76% J，FBMS 上 86.4% J）和四个标准视频显著目标检测 (VSOD) 基准测试中取得了显著提升的最先进性能，证明了其有效性和优于先前方法的优势。|
|**2025-04-08**|[KAN-SAM: Kolmogorov-Arnold Network Guided Segment Anything Model for RGB-T Salient Object Detection](http://arxiv.org/abs/2504.05878)|null|现有的RGB-热红外显著目标检测（RGB-T SOD）方法旨在通过利用RGB和热红外两种模态来识别视觉上显著的目标，以便在复杂场景中实现鲁棒的性能，但由于可用数据集的多样性有限以及构建多模态表示的效率低下，它们通常存在泛化能力有限的问题。在本文中，我们提出了一种名为KAN-SAM的基于提示学习的新型RGB-T SOD方法，它揭示了视觉基础模型在RGB-T SOD任务中的潜力。具体而言，我们通过高效且准确的Kolmogorov-Arnold网络（KAN）适配器将热红外特征作为引导提示引入，扩展了Segment Anything Model 2 (SAM2)以用于RGB-T SOD，从而有效增强了RGB表示并提高了鲁棒性。此外，我们引入了一种互斥随机掩码策略，以减少对RGB数据的依赖并提高泛化能力。基准测试结果表明，该方法的性能优于现有最先进的方法。|
|**2025-04-08**|[DefMamba: Deformable Visual State Space Model](http://arxiv.org/abs/2504.05794)|null|近年来，状态空间模型（SSM），特别是Mamba模型，因其能够有效平衡计算效率和性能而受到了学者们的广泛关注。然而，大多数现有的视觉Mamba方法使用预定义的扫描顺序将图像展平为一维序列，这导致模型在特征提取过程中难以有效利用图像的空间结构信息。为了解决这个问题，我们提出了一种名为DefMamba的新型视觉基础模型。该模型包含一个多尺度骨干结构和可变形Mamba（DM）块，它可以动态调整扫描路径以优先处理重要信息，从而增强对相关输入特征的捕获和处理能力。通过结合可变形扫描（DS）策略，该模型显著提高了其学习图像结构和检测物体细节变化的能力。大量实验表明，DefMamba在各种视觉任务中均达到了最先进的性能，包括图像分类、目标检测、实例分割和语义分割。代码已在DefMamba开源。|
|**2025-04-08**|[POD: Predictive Object Detection with Single-Frame FMCW LiDAR Point Cloud](http://arxiv.org/abs/2504.05649)|null|基于激光雷达的3D物体检测是自动驾驶领域的一项基本任务。本文探讨了调频连续波（FMCW）激光雷达在自动感知中的独特优势。给定单帧带有径向速度测量的FMCW点云，我们期望我们的物体检测器能够仅使用当前帧传感器数据检测物体的短期未来位置，并展示快速响应中间危险的能力。为此，我们将标准物体检测任务扩展到一项名为预测性物体检测（POD）的新任务，该任务旨在仅基于当前观测结果预测物体的短期未来位置和尺寸。通常，运动预测任务需要历史传感器信息来处理每个物体的时间上下文，而我们的检测器避免使用多帧历史信息，可以更快地响应潜在危险。FMCW激光雷达的核心优势在于每个反射点都带有径向速度。我们提出了一个新的POD框架，其核心思想是使用射线投射机制生成虚拟未来点，创建包含当前帧和虚拟未来帧的虚拟两帧点云，并使用稀疏4D编码器对这些两帧体素特征进行编码。随后，根据时间索引分离4D体素特征，并将其重新映射到两个鸟瞰图（BEV）特征：一个用于解码标准当前帧物体检测，另一个用于未来预测性物体检测。在我们内部数据集上的大量实验结果表明，所提出的POD框架在标准和预测性检测方面均达到了最先进的性能。|
|**2025-04-08**|[AD-Det: Boosting Object Detection in UAV Images with Focused Small Objects and Balanced Tail Classes](http://arxiv.org/abs/2504.05601)|null|无人机(UAV)图像中的目标检测由于目标的复杂尺度变化和类别不平衡性带来了巨大挑战。现有方法通常单独处理这些挑战，忽略了无人机图像的复杂性和它们之间潜在的协同作用。为此，本文提出了AD-Det，一个采用由粗到精策略的新型框架，它无缝集成了两个关键组件：自适应小目标增强(ASOE)和动态类别平衡复制粘贴(DCC)。ASOE利用高分辨率特征图识别和聚类包含小目标的区域。这些区域随后被放大并由细粒度检测器处理。另一方面，DCC通过动态地将尾部类别目标粘贴到ASOE获得的聚类中心周围来进行目标级重采样，并为每个尾部类别维护一个动态记忆库。这种方法使AD-Det不仅能够提取包含小目标的区域以进行精确检测，还能动态地对尾部类别目标执行合理的重采样。因此，AD-Det通过一个协同和自适应的框架解决了无人机图像中尺度变化和类别不平衡的挑战，从而提高了整体检测性能。我们在两个公共数据集VisDrone和UAVDT上广泛评估了我们的方法，并证明AD-Det显著优于现有的竞争方案。值得注意的是，AD-Det在VisDrone数据集上实现了37.5%的平均精度(AP)，至少超过其他方法3.1%。|
|**2025-04-07**|[Secure Diagnostics: Adversarial Robustness Meets Clinical Interpretability](http://arxiv.org/abs/2504.05483)|null|深度神经网络用于医学图像分类时，由于违反了独立同分布假设且决策过程不透明，通常难以在临床实践中实现一致的泛化。本文通过评估模型针对对抗攻击的性能，并将可解释性方法与骨科医生标注的骨折区域进行比较，研究了为骨折检测微调的深度神经网络的可解释性。我们的研究结果证明，鲁棒的模型产生的解释更符合临床意义区域，表明鲁棒性促进了解剖学相关特征的优先级排序。我们强调了可解释性对于促进人机协作的价值，在这种协作中，模型在人在环路的范式下充当助手：临床合理的解释能够增强信任，支持错误纠正，并避免对人工智能进行高风险决策的依赖。本文研究了鲁棒性和可解释性作为互补的基准，以弥合基准性能与安全、可操作的临床部署之间的差距。|
|**2025-04-07**|[Diffusion-based Models for Unpaired Super-resolution in Fluid Dynamics](http://arxiv.org/abs/2504.05443)|null|高保真、高分辨率数值模拟对于研究流体动力学中复杂的多尺度现象（例如湍流和海浪）至关重要。然而，使用高分辨率求解器进行直接数值模拟的计算成本过高。作为一种替代方案，超分辨率技术可以增强低保真、低分辨率模拟的效果。然而，传统的超分辨率方法依赖于成对的低保真、低分辨率和高保真、高分辨率数据集进行训练，而这些数据集在复杂的流动系统中通常无法获取。为了应对这一挑战，我们提出了一种新颖的两步法，无需配对数据集。首先，我们使用增强型去噪扩散隐式桥在低分辨率级别执行非配对域转换。此过程将低保真、低分辨率输入转换为高保真、低分辨率输出，并且我们提供了理论分析来强调这种增强型基于扩散的方法的优势。其次，我们采用级联的通过重复细化的超分辨率模型将高保真、低分辨率预测提升到高分辨率结果。我们通过三个流体动力学问题证明了我们方法的有效性。此外，通过结合神经算子来学习系统动力学，我们的方法可以扩展到改进低保真、低分辨率数据的演化模拟。|
|**2025-04-07**|[Time-adaptive Video Frame Interpolation based on Residual Diffusion](http://arxiv.org/abs/2504.05402)|null|在这项工作中，我们提出了一种新的基于扩散的视频帧插值（VFI）方法，用于传统的纯手绘动画。我们引入了三个主要贡献：首先，我们在模型中显式地处理插值时间，并在训练过程中重新估计插值时间，以应对动画领域中与自然视频相比特别大的时间变化；其次，我们将超分辨率领域最近提出的名为ResShift的扩散方案调整并推广到VFI，这使得我们只需执行极少量的扩散步骤（大约10步）即可生成估计值；第三，我们利用扩散过程的随机性来提供对插值帧像素级不确定性的估计，这有助于预测模型可能出错的位置。我们提供了与最先进模型的广泛比较，并表明我们的模型在动画视频上的性能优于这些模型。|
|**2025-04-04**|[PF3Det: A Prompted Foundation Feature Assisted Visual LiDAR 3D Detector](http://arxiv.org/abs/2504.03563)|null|三维物体检测对于自动驾驶至关重要，它利用激光雷达点云获取精确的深度信息，并利用摄像头图像获取丰富的语义信息。因此，结合这两种模态的多模态方法可以提供更鲁棒的检测结果。然而，由于域差异，高效地融合激光雷达点云和图像仍然具有挑战性。此外，许多模型的性能受到高质量标注数据量的限制，而创建这些数据成本高昂。基础模型的最新进展，即在不同模态上使用大规模预训练，实现了更好的多模态融合。结合用于高效训练的提示工程技术，我们提出了提示基础三维检测器（PF3Det），它集成了基础模型编码器和软提示以增强激光雷达-摄像头特征融合。PF3Det在有限的训练数据下实现了最先进的结果，在nuScenes数据集上将NDS提高了1.19%，mAP提高了2.42%，证明了其在三维检测中的效率。|
|**2025-04-04**|[ZFusion: An Effective Fuser of Camera and 4D Radar for 3D Object Perception in Autonomous Driving](http://arxiv.org/abs/2504.03438)|null|可靠的三维物体感知对于自动驾驶至关重要。由于其在全天候条件下的感知能力，4D雷达近年来备受关注。然而，与激光雷达相比，4D雷达提供的点云要稀疏得多。在本文中，我们提出了一种名为ZFusion的三维目标检测方法，它融合了4D雷达和视觉模态。作为ZFusion的核心，我们提出的FP-DDCA（特征金字塔-双变形交叉注意力）融合器有效地补充了（稀疏的）雷达信息和（密集的）视觉信息。具体来说，FP-DDCA融合器采用特征金字塔结构，包含Transformer模块，以交互方式融合不同尺度的多模态特征，从而提高感知精度。此外，由于4D雷达的物理特性，我们利用了深度-上下文-分割视图转换模块。考虑到4D雷达的成本远低于激光雷达，ZFusion是基于激光雷达方法的一个有吸引力的替代方案。在像VoD（View-of-Delft）数据集这样的典型交通场景中，实验表明，ZFusion在合理的推理速度下，在感兴趣区域实现了最先进的mAP（平均精度），同时在整个区域与基线方法相比具有竞争力的mAP，这表明其性能接近激光雷达，并且大大优于那些仅使用摄像机的方法。|
|**2025-04-04**|[Infrared bubble recognition in the Milky Way and beyond using deep learning](http://arxiv.org/abs/2504.03367)|null|我们提出了一个深度学习模型，可以利用斯皮策太空望远镜和JWST获取的双波段近红外数据准确地探测斯皮策气泡。该模型基于Single Shot MultiBox Detector作为目标检测模型，使用由银河系项目(MWP-Bubble)识别的斯皮策气泡进行训练和验证。我们发现，仅使用结构清晰的MWP气泡，加上归一化和数据增强，可以显著提高性能。为了减少数据集偏差，我们还使用了通过结合负采样和聚类两种技术选择的无气泡数据。模型通过使用贝叶斯优化进行超参数调整来优化。将该模型应用于银道面测试区域，对8微米辐射清晰包围24微米辐射的MWP气泡实现了98%的探测率。此外，我们将该模型应用于更广阔的区域（1°≤|l|≤65°，|b|≤1°），包括训练和验证区域，该模型探测到3006个气泡，其中1413个是新探测到的。我们还尝试在大质量恒星形成区天鹅座X以及外部星系大麦哲伦星云(LMC)和NGC 628中探测气泡。该模型成功地在这些外部星系中探测到斯皮策气泡，尽管它也探测到米拉型变星和其他难以与斯皮策气泡区分的致密源。探测过程仅需几个小时，证明了探测气泡结构的效率。此外，用于探测斯皮策气泡的方法也被应用于探测仅在8微米发射波段可观测到的壳状结构，从而在大麦哲伦星云中探测到469个壳状结构，在NGC 628中探测到143个。|
|**2025-04-04**|[Adaptive Classification of Interval-Valued Time Series](http://arxiv.org/abs/2504.03318)|null|近年来，区间值时间序列的建模和分析在计量经济学和统计学领域引起了广泛关注。然而，现有文献主要集中在回归任务上，而忽略了分类方面。在本文中，我们提出了一种区间值时间序列分类的自适应方法。具体而言，我们使用区间上下界的凸组合来表示区间值时间序列，并基于点值时间序列成像方法将这些表示转换为图像。我们利用细粒度图像分类神经网络对这些图像进行分类，以实现对原始区间值时间序列进行分类的目标。该方法适用于单变量和多变量区间值时间序列。在优化方面，我们将凸组合系数视为类似于神经网络参数的可学习参数，并提供了一种基于交替方向乘子法 (ADMM) 的有效估计方法。在理论方面，在特定条件下，我们为由卷积、池化和全连接层等基本块组成的通用 CNN 建立了基于间隔的多分类泛化界。通过仿真研究和实际数据应用，我们验证了所提出方法的有效性，并将其性能与各种点值时间序列分类方法进行了比较。|
|**2025-04-04**|[Real-Time Roadway Obstacle Detection for Electric Scooters Using Deep Learning and Multi-Sensor Fusion](http://arxiv.org/abs/2504.03171)|null|随着电动滑板车（e-scooter）在城市地区的日益普及，交通事故和伤害也随之增加，这主要是因为其车轮小、缺乏悬挂系统以及对不平坦路面的敏感性。虽然基于深度学习的目标检测已被广泛用于提高汽车安全性，但其在电动滑板车障碍物检测中的应用仍未得到探索。本研究介绍了一种新型的电动滑板车地面障碍物检测系统，该系统集成了RGB摄像头和深度摄像头，以增强实时道路危险检测能力。此外，惯性测量单元（IMU）测量线性垂直加速度以识别路面振动，从而指导六种障碍物类别的选择：树枝、井盖、坑洼、松果、非定向裂缝和截顶圆锥。所有传感器，包括RGB摄像头、深度摄像头和IMU，都集成在英特尔实感摄像头D435i中。由YOLO驱动的深度学习模型可检测道路危险并利用深度数据估计障碍物距离。该系统在7小时的自然骑行数据集上进行评估，实现了0.827的高平均精度（mAP），并展现出优异的实时性能。这种方法通过先进的计算机视觉和数据融合，为提高电动滑板车的安全性提供了有效的解决方案。数据集可在https://zenodo.org/records/14583718访问，项目代码托管在https://github.com/Zeyang-Zheng/Real-Time-Roadway-Obstacle-Detection-for-Electric-Scooters。|
|**2025-04-04**|[Finding the Reflection Point: Unpadding Images to Remove Data Augmentation Artifacts in Large Open Source Image Datasets for Machine Learning](http://arxiv.org/abs/2504.03168)|null|本文探讨了一种与机器学习数据集管理相关的图像修复新问题：噪声镜像填充伪影的检测和去除。虽然填充等数据增强技术对于标准化图像尺寸是必要的，但当数据集跨域重新利用时，它们可能会引入降低模型评估质量的伪影。我们提出了一种系统算法，通过具有阈值的最小均方误差方法精确定位反射边界并去除反射填充。即使存在压缩或插值噪声，我们的方法也能有效识别真实内容与其镜像对应物之间的过渡。我们在 SHEL5k 数据集上展示了我们算法的有效性，显示了使用 OWLv2 进行零样本目标检测任务的性能显著提升，安全帽检测的平均精度从 0.47 提高到 0.61，人员检测的平均精度从 0.68 提高到 0.73。通过解决填充区域中的标注不一致和物体失真问题，我们的方法增强了数据集的完整性，从而能够在各种计算机视觉任务中进行更可靠的模型评估。|
|**2025-04-03**|[Attention-Aware Multi-View Pedestrian Tracking](http://arxiv.org/abs/2504.03047)|null|尽管多目标跟踪技术近期取得了进展，遮挡仍然是一项重大挑战。多摄像头设置通过提供场景的全面覆盖来应对这一挑战。最近的多视角行人检测模型突出了早期融合策略的潜力，即将所有视角的特征图投影到公共地面或鸟瞰图 (BEV)，然后执行检测。该策略已被证明可以提高检测和跟踪性能。然而，透视变换会导致地面上的显著失真，影响行人外观特征的鲁棒性。为了解决这一局限性，我们提出了一种在多视角行人跟踪场景中 incorporating 注意力机制的新型模型。我们的模型利用早期融合策略进行检测，并利用交叉注意力机制在不同帧中的行人之间建立稳健的关联，同时有效地在帧之间传播行人特征，从而为每个行人生成更鲁棒的特征表示。大量实验表明，我们的模型优于最先进的模型，在 Wildtrack 数据集上的 IDF1 得分为 96.1%，在 MultiviewX 数据集上的 IDF1 得分为 85.7%。|
|**2025-04-03**|[HQViT: Hybrid Quantum Vision Transformer for Image Classification](http://arxiv.org/abs/2504.02730)|null|基于Transformer的架构彻底改变了深度学习领域。在计算机视觉领域，视觉Transformer展现出与卷积神经网络相当甚至更优越的性能。然而，其自注意力机制的二次计算复杂度给经典计算带来了挑战，使得使用高维输入数据（例如图像）进行模型训练的成本非常高。为了解决这些限制，我们提出了一种混合量子视觉Transformer（HQViT），它利用量子计算的原理来加速模型训练并提高模型性能。HQViT引入了幅度编码的全图像处理，以更好地保留全局图像信息，而无需额外的位置编码。通过在最关键的步骤上利用量子计算，并以经典方式选择性地处理其他组件，我们降低了HQViT的量子资源成本。量子比特需求被最小化到 $O(log_2N)$，参数化量子门的数量仅为$O(log_2d)$，使其非常适合噪声中等规模量子（NISQ）设备。通过将计算密集型的注意力系数矩阵计算转移到量子框架，HQViT将经典计算负载降低了$O(T^2d)$。跨各种计算机视觉数据集的大量实验表明，HQViT优于现有模型，与现有最佳水平相比，实现了高达$10.9\%$ 的改进（在MNIST 10分类任务上）。这项工作突出了结合量子和经典计算来处理复杂图像分类任务的巨大潜力。|
|**2025-04-03**|[Rip Current Segmentation: A Novel Benchmark and YOLOv8 Baseline Results](http://arxiv.org/abs/2504.02558)|null|离岸流是全球许多海滩致命事故和受伤的主要原因，这突显了自动检测这些危险的表面水流的重要性。在本文中，我们提出了一个新的任务：离岸流实例分割。我们引入了一个包含2,466张图像的综合数据集，其中包含用于实例分割的新创建的多边形标注，用于训练和验证。此外，我们还提供了一个包含17个无人机视频（包含约24,000帧）的新数据集，这些视频以30 FPS拍摄，并使用用于实例分割的多边形和用于目标检测的边界框进行标注，用于测试目的。我们在静态图像上训练了不同版本的YOLOv8用于实例分割，并在测试数据集（视频）上评估了它们的性能。YOLOv8-nano模型（可在便携式设备上运行）取得了最佳结果，在验证数据集上的mAP50为88.94%，在测试数据集上的宏平均值为81.21%。这些结果为未来离岸流分割的研究提供了基线。我们的工作通过引入详细的标注数据集和训练用于离岸流实例分割的深度学习模型，为现有文献做出了贡献。代码、训练细节和标注数据集已在https://github.com/Irikos/rip_currents公开发布。|
|**2025-04-03**|[Data-Driven Object Tracking: Integrating Modular Neural Networks into a Kalman Framework](http://arxiv.org/abs/2504.02519)|null|本文提出了新颖的机器学习 (ML) 方法，用于多目标跟踪 (MOT)，旨在满足高级驾驶辅助系统 (ADAS) 日益增长的复杂性和精度需求。我们引入了三个神经网络 (NN) 模型来解决 MOT 中的关键挑战：(i) 用于轨迹预测的单预测网络 (SPENT)，(ii) 用于将单个传感器目标 (SO) 映射到现有轨迹的单关联网络 (SANT)，以及 (iii) 用于将多个 SO 关联到多个轨迹的多关联网络 (MANTa)。这些模型被无缝集成到传统的卡尔曼滤波器 (KF) 框架中，通过替换相关组件而不破坏整体架构来保持系统的模块化。重要的是，所有三个网络都设计为在实时嵌入式环境中运行。每个网络包含少于 5 万个可训练参数。我们在公共 KITTI 跟踪数据集上进行的评估表明，跟踪性能得到了显著改善。与标准 KF 相比，SPENT 将均方根误差 (RMSE) 降低了 50%，而 SANT 和 MANTa 在传感器目标到轨迹的分配中实现了高达 95% 的准确率。这些结果强调了将特定任务 NN 纳入传统跟踪系统中的有效性，在保持模块化、可维护性和可解释性的同时，提高了性能和鲁棒性。|
|**2025-04-03**|[CornerPoint3D: Look at the Nearest Corner Instead of the Center](http://arxiv.org/abs/2504.02464)|null|三维物体检测旨在从激光雷达点云中预测物体的中心、尺寸和旋转角度。尽管激光雷达操作简单，但它只能捕获物体的近端，这使得基于中心的检测器在点分布不同的跨域任务中容易出现定位精度差的问题。同时，现有的为单域评估设计的评估指标也因数据集特定的尺寸变化而存在过拟合问题。一个关键问题出现了：我们真的需要模型在跨域应用后在整个三维边界框中保持优异的性能吗？实际上，我们的主要关注点之一是防止车辆与其他障碍物发生碰撞，尤其是在跨域场景中，准确预测尺寸要困难得多。为了解决这些问题，我们从实用角度重新思考跨域三维物体检测。我们提出了两个新的指标来评估模型检测物体靠近激光雷达传感器表面的能力。此外，我们引入了EdgeHead，一个细化头部，引导模型更加关注可学习的近端表面，从而显著提高了新旧BEV/3D指标下的跨域性能。此外，我们认为预测最近的角点而非物体中心可以增强鲁棒性。我们提出了一种名为CornerPoint3D的新型三维物体检测器，它基于CenterPoint构建，并使用热力图来监督每个物体最近角点的学习和检测。我们提出的方法实现了整个边界框的检测质量与靠近激光雷达传感器表面的定位精度之间的平衡，在多个跨域任务中优于传统的基于中心的检测器CenterPoint，并提供了一个更实用合理且鲁棒的跨域三维物体检测解决方案。||
|**2025-04-03**|[A Physics-Informed Meta-Learning Framework for the Continuous Solution of Parametric PDEs on Arbitrary Geometries](http://arxiv.org/abs/2504.02459)|null|在这项工作中，我们引入了隐式有限算子学习（iFOL）用于在任意几何形状上对偏微分方程（PDE）进行连续参数化求解。我们提出了一种基于物理信息的编码器-解码器网络，以建立连续参数空间和解空间之间的映射。解码器利用以潜在或特征代码为条件的隐式神经场网络来构建参数化解场。特定实例的代码是通过基于二阶元学习技术的PDE编码过程导出的。在训练和推理过程中，PDE编码和解码过程中会最小化基于物理信息的损失函数。iFOL以能量或加权残差的形式表示损失函数，并使用从标准数值PDE方法导出的离散残差对其进行评估。这种方法导致在训练和推理过程中反向传播离散残差。iFOL具有几个关键特性：（1）其独特的损失函数公式消除了以前在使用条件神经场的算子学习中用于PDE的传统编码-处理-解码流程的需求；（2）它不仅提供精确的参数化连续场，而且还提供解到参数的梯度，而无需额外的损失项或灵敏度分析；（3）它可以有效地捕捉解中的尖锐不连续性；（4）它消除了对几何形状和网格的限制，使其适用于任意几何形状和空间采样（零样本超分辨率能力）。我们批判性地评估了这些特性，并分析了网络泛化到跨静态和瞬态PDE的未见样本的能力。所提出方法的整体性能令人鼓舞，证明了其对计算力学中一系列挑战性问题的适用性。||
|**2025-04-03**|[Hyperspectral Remote Sensing Images Salient Object Detection: The First Benchmark Dataset and Baseline](http://arxiv.org/abs/2504.02416)|null|高光谱遥感图像显著性目标检测 (HRSI-SOD) 的目标是识别与背景具有明显光谱对比度的物体或区域。该领域在实际应用中具有巨大的潜力；然而，由于缺乏专用的数据集和方法，其进展受到限制。为了弥合这一差距并促进进一步的研究，我们引入了第一个 HRSI-SOD 数据集，称为 HRSSD，其中包含 704 张高光谱图像和 5327 个像素级标注的显著性物体。HRSSD 数据集由于尺度变化大、前景-背景关系多样以及多显著性物体等因素，对显著性目标检测算法提出了重大挑战。此外，我们提出了一种创新且高效的 HRSI-SOD 基线模型，称为深度光谱显著性网络 (DSSN)。DSSN 的核心是跨层显著性评估块，它执行逐像素注意力并评估每个空间位置的多尺度相似性图的贡献，有效地减少了杂乱区域中的错误响应，并强调了跨尺度的显著性区域。此外，高分辨率融合模块结合了自底向上融合策略和学习的空间上采样，以利用多尺度显著性图的优势，确保精确定位小物体。在 HRSSD 数据集上的实验有力地验证了 DSSN 的优越性，强调了该领域对专用数据集和方法的迫切需求。在 HSOD-BIT 和 HS-SOD 数据集上的进一步评估证明了该方法的泛化能力。数据集和源代码已公开发布在 https://github.com/laprf/HRSSD。||
|**2025-04-03**|[LLM-Guided Evolution: An Autonomous Model Optimization for Object Detection](http://arxiv.org/abs/2504.02280)|null|在机器学习中，神经架构搜索（NAS）需要模型设计的领域知识和大量的试错才能获得有希望的性能。同时，进化算法传统上依赖于固定规则和预定义的构建块。大型语言模型（LLM）引导进化（GE）框架通过结合LLM直接修改CIFAR数据上图像分类算法的模型源代码并智能地指导变异和交叉来改变这种方法。LLM-GE的一个关键要素是“思维进化”（EoT）技术，它建立反馈循环，允许LLM根据先前操作的执行情况迭代地改进它们的决策。在本研究中，我们通过改进LLM-GE来修改YOLO（You Only Look Once）模型的架构以增强在KITTI数据集上的性能，从而执行用于目标检测的NAS。我们的方法智能地调整YOLO的设计和设置，以找到针对目标（例如检测精度和速度）的最佳算法。我们展示了LLM-GE产生的变体具有显著的性能改进，例如平均精度均值从92.5%提高到94.5%。这一结果突出了LLM-GE在实际挑战中的灵活性和有效性，为将LLM驱动的推理与进化策略相结合的自动化机器学习提供了一种新的范例。||
|**2025-04-02**|[Neural Style Transfer for Synthesising a Dataset of Ancient Egyptian Hieroglyphs](http://arxiv.org/abs/2504.02163)|null|低资源语言的训练数据有限，这使得应用机器学习技术具有挑战性。古埃及语就是这样一种资源匮乏的语言。然而，数据增强方法（如神经风格迁移）的创新应用可以克服这些障碍。本文提出了一种通过将神经风格迁移应用于数字字体来生成古埃及象形文字数据集的新方法。实验结果表明，在神经风格迁移生成的样本和照片上训练的图像分类模型表现出相同的性能，并且可以迁移到未见过的真实象形文字图像。||
|**2025-04-03**|[ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement](http://arxiv.org/abs/2504.01934)|null|我们提出了ILLUME+，它利用双视觉标记化和扩散解码器来改进深度语义理解和高保真图像生成。现有的统一模型难以在单个模型中同时处理三种基本能力：理解、生成和编辑。像Chameleon和EMU3这样的模型由于缺乏深度语义交互，利用VQGAN进行图像离散化，它们在视觉理解任务中落后于像LLaVA这样的专业模型。为了缓解这个问题，LaViT和ILLUME采用语义编码器进行标记化，但由于纹理保存不良，它们在图像编辑方面存在困难。同时，Janus系列将输入和输出图像表示分离，限制了它们无缝处理交错图文理解和生成的能力。相比之下，ILLUME+引入了统一的双视觉标记器DualViTok，它既保留了细粒度纹理又保留了文本对齐的语义，同时支持用于多模态理解和生成的从粗到精的图像表示策略。此外，我们采用扩散模型作为图像去标记器，以增强生成质量并实现高效的超分辨率。ILLUME+在统一的多模态大型语言模型（MLLM）中遵循连续输入、离散输出方案，并采用渐进式训练程序，支持跨视觉标记器、MLLM和扩散解码器的动态分辨率。这种设计允许跨不同任务进行灵活高效的上下文感知图像编辑和生成。ILLUME+ (3B) 在多模态理解、生成和编辑基准测试中，表现出与现有统一MLLM和专业模型相当的竞争力。凭借其强大的性能，ILLUME+ 为未来的多模态应用提供了可扩展且通用的基础。项目页面：https://illume-unified-mllm.github.io/。||
|**2025-04-02**|[A Randomized Zeroth-Order Hierarchical Framework for Heterogeneous Federated Learning](http://arxiv.org/abs/2504.01839)|null|联邦学习 (FL) 中的异质性是一个关键且具有挑战性的方面，它会显著影响模型性能和收敛性。在本文中，我们提出了一个新的框架，将异构 FL 公式化为一个分层优化问题。这个新框架通过双层公式捕获局部和全局训练过程，并能够实现以下目标：(i) 通过个性化学习框架解决客户端异质性；(ii) 捕获服务器端的预训练过程；(iii) 通过非标准聚合更新全局模型；(iv) 允许不同的局部步骤；(v) 捕获客户端的局部约束。我们设计并分析了一种隐式零阶 FL 方法 (ZO-HFL)，为服务器代理和单个客户端代理提供了非渐近收敛保证，并在几乎必然的意义上为服务器代理和客户端代理提供了渐近保证。值得注意的是，我们的方法不依赖于异构 FL 中的标准假设，例如有界梯度差异条件。我们在图像分类任务上实现了我们的方法，并在不同的异构设置下与其他方法进行了比较。||
|**2025-03-31**|[Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach](http://arxiv.org/abs/2503.24271)|**[link](https://github.com/fpramunno/ldm_superresolution)**|太阳磁场的空间特性对于解码太阳内部的物理过程及其行星际效应至关重要。然而，来自旧仪器（如迈克尔逊多普勒成像仪 (MDI)）的观测数据空间或时间分辨率有限，这阻碍了对小尺度太阳特征进行详细研究的能力。对这些较旧的数据集进行超分辨率重建对于跨不同太阳周期的统一分析至关重要，能够更好地表征太阳耀斑、活动区和磁网络动力学。在这项工作中，我们介绍了一种用于超分辨率的新型扩散模型方法，并将其应用于 MDI 磁图，以匹配日震和磁成像仪 (HMI) 的更高分辨率能力。通过使用降尺度 HMI 数据的残差训练潜在扩散模型 (LDM)，并使用配对的 MDI/HMI 数据对其进行微调，我们可以将 MDI 观测的分辨率从 2"/像素提高到 0.5"/像素。我们通过经典指标（例如，PSNR、SSIM、FID 和 LPIPS）评估重建图像的质量，并检查是否保留了物理特性，例如无符号磁通量或活动区的尺寸。我们将我们的模型与 LDM 和去噪扩散概率模型 (DDPM) 的不同变体进行了比较，还与过去用于执行超分辨率任务的两种确定性架构进行了比较。此外，我们通过傅里叶域分析表明，具有残差的 LDM 可以分辨小于 2" 的特征，并且由于 LDM 的概率性质，我们可以评估它们的可靠性，这与确定性模型形成对比。未来的研究旨在提高太阳 MDI 仪器的时间尺度超分辨率，以便我们还可以更好地了解旧事件的动态。||
|**2025-03-31**|[MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing](http://arxiv.org/abs/2503.24219)|**[link](https://github.com/rd20karim/mb-ores)**|我们提出了一个统一的框架，用于整合遥感 (RS) 影像中的目标检测 (OD) 和视觉定位 (VG)。为了支持传统的目标检测并为视觉定位任务建立直观的先验，我们使用指称表达式数据微调了一个开放集目标检测器，将其视为部分监督的目标检测任务。在第一阶段，我们构建了每个图像的图表示，包括目标查询、类别嵌入和候选位置。然后，我们的任务感知架构处理此图以执行视觉定位任务。该模型包含：(i) 一个多分支网络，整合空间、视觉和类别特征以生成任务感知的候选区域，以及 (ii) 一个目标推理网络，用于在候选区域之间分配概率，随后采用软选择机制进行最终指称目标定位。我们的模型在 OPT-RSVG 和 DIOR-RSVG 数据集上展现出优越的性能，相较于现有最先进的方法取得了显著改进，同时保留了经典的目标检测能力。代码将在我们的代码库中提供：\url{https://github.com/rd20karim/MB-ORES}。||
|**2025-03-31**|[CIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP Generalization](http://arxiv.org/abs/2503.24182)|null|对比语言-图像预训练 (CLIP) 通过有效对齐视觉和文本表示，在零样本图像分类和文本图像检索等跨模态任务中取得了显著成功。然而，CLIP 强大的泛化能力背后的理论基础仍不清楚。在这项工作中，我们通过提出跨模态信息瓶颈 (CIB) 框架来弥补这一差距。CIB 为 CLIP 的对比学习目标提供了一种基于信息瓶颈的合理解释。在这种视角下，模型最大化共享的跨模态信息，同时丢弃特定模态的冗余信息，从而保留跨模态的本质语义对齐。基于这一见解，我们引入了一种跨模态信息瓶颈正则化 (CIBR) 方法，在训练过程中明确地执行这些信息瓶颈原则。CIBR 引入了一个惩罚项来抑制特定模态的冗余信息，从而增强图像和文本特征之间的语义对齐。我们在广泛的视觉语言基准上验证了 CIBR，包括跨七个不同图像数据集的零样本分类以及在 MSCOCO 和 Flickr30K 上的文本图像检索。结果表明，与标准 CLIP 相比，性能得到了一致提高。这些发现通过信息瓶颈的视角首次提供了对 CLIP 泛化能力的理论理解。它们还展示了实际改进，为未来的跨模态表示学习提供了指导。||
|**2025-03-31**|[PixelCAM: Pixel Class Activation Mapping for Histology Image Classification and ROI Localization](http://arxiv.org/abs/2503.24135)|**[link](https://github.com/alexisguichemerrecode/pixelcam)**|弱监督目标定位 (WSOL) 方法允许训练模型对图像进行分类和定位 ROI。WSOL 只需要低成本的图像类别标注，即可提供具有视觉可解释性的分类器，这在组织学图像分析中非常重要。标准的 WSOL 方法依赖于类激活映射 (CAM) 方法，根据单步或两步策略生成空间定位图。虽然这两种策略都取得了重大进展，但它们在处理组织学图像时仍然面临一些局限性。由于组织学图像中视觉 ROI 的显著性有限以及定位线索有限，单步方法很容易导致欠激活或过激活。它们还面临着分类和定位任务之间异步收敛的常见问题。两步方法的次优性在于它依赖于冻结的分类器，从而限制了定位能力。此外，这些方法在应用于分布外 (OOD) 数据集时也会遇到困难。本文介绍了一种用于 WSOL 的多任务方法，可同时训练这两个任务以解决异步收敛问题。具体而言，定位是在与分类共享的图像编码器的像素特征空间中进行的。这允许学习判别特征并准确描绘前景/背景区域，以支持 ROI 定位和图像分类。我们提出了 PixelCAM，这是一种在像素特征空间中经济高效的前景/背景像素分类器，可以进行空间目标定位。PixelCAM 使用从预训练的 WSOL 模型收集的像素伪标签进行训练。图像和像素分类器都使用标准梯度下降同时进行训练。此外，我们的像素分类器可以轻松集成到基于 CNN 和 Transformer 的架构中，而无需进行任何修改。||
|**2025-03-31**|[Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification](http://arxiv.org/abs/2503.24017)|null|跨模态知识蒸馏（KD）旨在利用多模态教师模型增强单模态学生模型。特别是当教师的模态包含学生的模态时，可以利用额外的补充信息来改进知识迁移。在监督图像分类中，图像数据集通常包含代表高级概念的类别标签，这暗示了结合文本线索进行跨模态知识蒸馏的自然途径。然而，这些标签很少能捕捉到真实世界视觉中更深层的语义结构，并且如果直接用作输入会导致标签泄漏，最终限制知识蒸馏的性能。为了解决这些问题，我们提出了一个多教师跨模态知识蒸馏框架，该框架在分层损失下集成了CLIP图像嵌入和可学习的WordNet放松文本嵌入。通过避免直接使用确切的类别名称，而是使用语义更丰富的WordNet扩展，我们减轻了标签泄漏，并引入了更多样化的文本线索。实验表明，这种策略显著提高了学生模型的性能，而嘈杂或过于精确的文本嵌入会阻碍蒸馏效率。可解释性分析证实，WordNet放松的提示鼓励更多地依赖视觉特征而不是文本捷径，同时仍然有效地结合了新引入的文本线索。我们的方法在六个公共数据集上实现了最先进或次优的结果，证明了其在推进跨模态知识蒸馏方面的有效性。||
|**2025-03-31**|[Spectral-Adaptive Modulation Networks for Visual Perception](http://arxiv.org/abs/2503.23947)|null|最近的研究表明，二维卷积和自注意力机制表现出不同的频谱行为，优化它们的频谱特性可以提高视觉模型的性能。然而，理论分析仍然难以解释为什么二维卷积在高通滤波方面比自注意力机制更有效，以及为什么更大的卷积核有利于形状偏差，类似于自注意力机制。在本文中，我们采用图谱分析来在统一框架内从理论上模拟和比较二维卷积和自注意力机制的频率响应。我们的结果证实了先前的经验发现，并揭示了由窗口大小调节的节点连通性是塑造频谱函数的关键因素。基于这一见解，我们引入了一种频谱自适应调制（SPAM）混合器，它使用多尺度卷积核和频谱重新缩放机制以频谱自适应的方式处理视觉特征，以优化频谱成分。基于SPAM，我们开发了SPANetV2，一种新型视觉骨干网络。大量实验表明，SPANetV2在多种视觉任务中都优于最先进的模型，包括ImageNet-1K图像分类、COCO目标检测和ADE20K语义分割。||
|**2025-03-30**|[DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution](http://arxiv.org/abs/2503.23580)|null|大型预训练扩散模型因其丰富的生成先验，正越来越多地用于解决真实世界图像超分辨率（Real-ISR）问题。最近发展的扩散Transformer（DiT）在图像生成方面展现出超越传统基于UNet架构的压倒性性能，这也引发了一个问题：我们能否将先进的基于DiT的扩散模型用于Real-ISR？为此，我们提出了DiT4SR，这是首批将大型DiT模型应用于Real-ISR的开创性工作之一。我们没有像ControlNet那样直接注入从低分辨率（LR）图像中提取的嵌入，而是将LR嵌入集成到DiT的原始注意力机制中，允许LR潜在特征和生成潜在特征之间的双向信息流动。这两个流的充分交互使LR流能够随着扩散过程而演变，从而产生逐步细化的指导，使其在每个扩散步骤中更好地与生成的潜在特征对齐。此外，LR指导通过跨流卷积层注入到生成的潜在特征中，弥补了DiT捕获局部信息能力的不足。这些简单而有效的设计赋予了DiT模型在Real-ISR中的优越性能，大量实验也证明了这一点。项目页面：https://adam-duan.github.io/projects/dit4sr/。||
|**2025-03-30**|[Re-Aligning Language to Visual Objects with an Agentic Workflow](http://arxiv.org/abs/2503.23508)|null|基于语言的目标检测 (LOD) 旨在将视觉对象与语言表达对齐。大量的配对数据被用于提高 LOD 模型的泛化能力。在训练过程中，最近的研究利用视觉语言模型 (VLM) 自动生成类似人类对视觉对象的表达，从而促进训练数据的扩展。在这个过程中，我们观察到 VLM 的幻觉会带来不准确的对象描述（例如，对象名称、颜色和形状），从而降低 VL 对齐质量。为了减少 VLM 幻觉，我们提出了一个由 LLM 控制的代理工作流程，通过自适应地调整图像和文本提示来将语言重新对齐到视觉对象。我们将此工作流程命名为 Real-LOD，它包括规划、工具使用和反思步骤。给定一张包含检测到的对象和 VLM 原始语言表达的图像，Real-LOD 会自动推理其状态，并根据我们的神经符号设计安排行动（即规划）。该行动将自适应地调整图像和文本提示，并将它们发送到 VLM 以重新描述对象（即工具使用）。然后，我们使用另一个 LLM 来分析这些精炼的表达以获得反馈（即反思）。这些步骤以循环形式进行，逐步改进语言描述，以便重新对齐到视觉对象。我们构建了一个包含少量 0.18M 图像和重新对齐的语言表达的数据集，并训练了一个流行的 LOD 模型，在标准基准测试中，其性能超过现有 LOD 方法约 50%。我们的 Real-LOD 工作流程具有自动 VL 细化功能，揭示了在扩大数据量的同时保持数据质量的潜力，这从数据对齐的角度进一步提高了 LOD 性能。||
|**2025-03-30**|[Efficient Dynamic Attention 3D Convolution for Hyperspectral Image Classification](http://arxiv.org/abs/2503.23472)|null|深度神经网络在高光谱图像分类中面临诸多挑战，包括空间-光谱联合信息的利用不足、深度增加带来的梯度消失以及过拟合。为了提高特征提取效率并跳过冗余信息，本文提出了一种基于改进型3D-DenseNet模型的动态注意力卷积设计。该设计采用多个并行卷积核代替单个卷积核，并为这些并行卷积分配动态注意力权重。这种动态注意力机制能够根据高光谱图像空间维度中的空间特征实现自适应特征响应，更加关注关键的空间结构。在光谱维度上，它能够动态地区分不同的波段，从而减轻高光谱维度带来的信息冗余和计算复杂度。DAC模块通过基于注意力的多卷积核聚合增强了模型的表示能力，而无需增加网络深度或宽度。该方法在推理速度和精度方面均表现出优异的性能，在IN、UP和KSC数据集上的表现优于主流高光谱图像分类方法。||
|**2025-03-30**|[EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing](http://arxiv.org/abs/2503.23330)|**[link](https://github.com/xiangtodayeatswhat/eaglevision)**|近年来，多模态大型语言模型 (MLLM) 在各种视觉任务中取得了令人瞩目的成果。然而，在遥感 (RS) 领域，高分辨率和物体占比小等特点对现有的 MLLM 提出了挑战，这些模型难以处理以物体为中心的任务，尤其是在精确的目标定位和细粒度属性描述方面。这些遥感 MLLM 尚未超越传统的视觉感知模型，因为它们仅提供粗略的图像理解，导致在实际场景中的应用收益有限。为了弥补这一差距，我们建立了 EagleVision，这是一个专为遥感设计的 MLLM，擅长目标检测和属性理解。EagleVision 配备了属性解耦模块，学习解耦视觉标记以表达不同的属性。为了支持对象级别的视觉语言对齐，我们构建了 EVAttrs-95K，这是遥感领域第一个用于指令微调的大规模对象属性理解数据集，以及一个新的评估基准 EVBench。EagleVision 在细粒度目标检测和目标属性理解任务上均实现了最先进的性能，凸显了 MLLM 中检测和理解能力之间的相互促进作用。代码、模型、数据和演示将在 https://github.com/XiangTodayEatsWhat/EagleVision 上提供。||
|**2025-03-28**|[Data Quality Matters: Quantifying Image Quality Impact on Machine Learning Performance](http://arxiv.org/abs/2503.22375)|null|在高度自动化的驾驶系统中，精确的环境感知至关重要，这些系统依赖于机器学习任务，例如目标检测和分割。传感器数据的压缩通常用于数据处理，而虚拟化则用于硬件在环验证。这两种方法都会改变传感器数据并降低模型性能。这就需要一种系统化的方法来量化图像的有效性。本文提出了一个四步框架来评估图像修改对机器学习任务的影响。首先，准备一个包含修改图像的数据集，以确保一一对应的图像对，从而能够测量由压缩和虚拟化引起的偏差。其次，通过比较压缩和虚拟化的影响与原始基于摄像头的传感器数据来量化图像偏差。第三，分析最先进的目标检测模型的性能，以确定更改后的输入数据如何影响感知任务，包括边界框的准确性和可靠性。最后，进行相关性分析以识别图像质量和模型性能之间的关系。结果表明，在所有评估的机器学习任务中，LPIPS 指标在图像偏差和机器学习性能之间实现了最高的相关性。||
|**2025-03-28**|[ForcePose: A Deep Learning Approach for Force Calculation Based on Action Recognition Using MediaPipe Pose Estimation Combined with Object Detection](http://arxiv.org/abs/2503.22363)|null|人体与物体交互作用中的力估计对于人体工程学、物理治疗和运动科学等各个领域都至关重要。传统方法依赖于测力板和传感器等专用设备，这使得准确评估既昂贵又局限于实验室环境。在本文中，我们介绍了 ForcePose，这是一个新颖的深度学习框架，它通过结合人体姿态估计和物体检测来估计施加的力。我们的方法利用 MediaPipe 进行骨骼跟踪，并利用 SSD MobileNet 进行物体识别，从而创建人体-物体交互的统一表示。我们开发了一个专门的神经网络，该网络处理空间和时间特征来预测力的大小和方向，而无需任何物理传感器。在包含 850 个带相应力测量的注释视频的数据集上进行训练后，我们的模型在力大小方面的平均绝对误差为 5.83 牛顿，在力方向方面的平均绝对误差为 7.4 度。与现有的计算机视觉方法相比，我们的方法性能提高了 27.5%，同时在标准计算硬件上仍能提供实时性能。ForcePose 为在传统测量工具不切实际或具有侵入性的各种现实场景中进行力分析开辟了新的可能性。本文讨论了我们的方法、数据集创建过程、评估指标以及在康复、人体工程学评估和运动表现分析中的潜在应用。||
|**2025-03-28**|[Data-Free Universal Attack by Exploiting the Intrinsic Vulnerability of Deep Models](http://arxiv.org/abs/2503.22205)|**[link](https://github.com/yyt0718/Intri_Attack)**|深度神经网络 (DNN) 易受通用对抗扰动 (UAP) 的影响，UAP 是一种与实例无关的扰动，可以在各种样本中欺骗目标模型。与特定实例的对抗样本不同，UAP 提出了更大的挑战，因为它们必须泛化到不同的样本和模型。生成 UAP 通常需要访问大量样本，这在现实任务中是一个强假设。在本文中，我们提出了一种名为 Intrinsic UAP (IntriUAP) 的新型无数据方法，它利用了深度模型的内在漏洞。我们分析了一系列由Lipschitz常数为1的线性和非线性层组成的流行深度模型，发现这些模型的漏洞主要受其线性组件的影响。基于这一观察，我们利用线性组件的病态性，将 UAP 与每个线性层最大奇异值对应的右奇异向量对齐。值得注意的是，我们的方法在攻击流行的图像分类深度模型时，无需使用任何图像样本即可达到极具竞争力的性能。我们还评估了该方法的黑盒攻击性能，结果表明，在符合我们理论框架的模型上，它与最先进的无数据方法基线相匹配。除了无数据假设之外，IntriUAP 还可以在更弱的假设下运行，即攻击者只能访问受害者模型的部分层。实验表明，当攻击者只能访问受害者模型 50% 的线性层时，攻击成功率仅下降 4%。||
|**2025-03-28**|[Hyperspectral Adapter for Object Tracking based on Hyperspectral Video](http://arxiv.org/abs/2503.22199)|null|基于高光谱视频的目标跟踪因其丰富的光谱和运动信息而日益受到关注。目前主流的高光谱跟踪方法通过在高光谱数据集上微调整个预训练的基于RGB的目标跟踪网络来适应高光谱任务，这在挑战性场景中取得了令人瞩目的成果。然而，高光谱跟踪器的性能受到转换过程中光谱信息损失的限制，并且微调整个预训练网络对于实际应用来说效率低下。为了解决这些问题，本文提出了一种新的高光谱目标跟踪方法，称为高光谱跟踪适配器（HyA-T）。该方法提出了用于自注意力机制的高光谱适配器（HAS）和用于多层感知机的高光谱适配器（HAM），通过将自适应信息增强到多头自注意力（MSA）模块和多层感知机（MLP）的计算中，生成自适应信息并将预训练网络中的MSA和MLP迁移到高光谱目标跟踪任务中。此外，还提出了输入高光谱增强模块（HEI），将原始光谱信息增强到跟踪网络的输入中。所提出的方法直接从高光谱图像中提取光谱信息，防止了光谱信息的丢失。此外，只需要微调所提出方法中的参数，这比现有方法更高效。在具有不同光谱波段的四个数据集上进行了大量实验，验证了所提出方法的有效性。HyA-T在所有数据集上均实现了最先进的性能。||
|**2025-03-28**|[Knowledge Rectification for Camouflaged Object Detection: Unlocking Insights from Low-Quality Data](http://arxiv.org/abs/2503.22180)|null|低质量数据通常图像细节不足，这引入了一种额外的隐性伪装方面，使伪装目标检测 (COD) 变得更加复杂。现有的 COD 方法主要关注高质量数据，忽略了低质量数据带来的挑战，这会导致性能显著下降。因此，我们提出了 KRNet，这是第一个明确设计用于低质量数据上 COD 的框架。KRNet 提出了一种领导者-跟随者框架，其中领导者从高质量数据中提取双重黄金标准分布：条件分布和混合分布，以驱动跟随者纠正从低质量数据中学习到的知识。该框架进一步受益于交叉一致性策略，该策略改进了这些分布的校正，以及一个时间相关的条件编码器，丰富了分布的多样性。在基准数据集上的大量实验表明，KRNet 优于最先进的 COD 方法和超分辨率辅助 COD 方法，证明了其在应对 COD 中低质量数据挑战方面的有效性。||
|**2025-03-28**|[A Survey on Remote Sensing Foundation Models: From Vision to Multimodality](http://arxiv.org/abs/2503.22081)|null|遥感基础模型，尤其是视觉和多模态模型的快速发展，显著增强了智能地理空间数据解译的能力。这些模型结合了各种数据模态，例如光学、雷达和激光雷达图像，以及文本和地理信息，从而能够更全面地分析和理解遥感数据。多模态的集成提高了目标检测、土地覆盖分类和变化检测等任务的性能，而这些任务常常受到遥感数据复杂性和异构性的挑战。然而，尽管取得了这些进展，仍然存在一些挑战。数据类型的多样性、对大规模标注数据集的需求以及多模态融合技术的复杂性，对这些模型的有效部署构成了重大障碍。此外，训练和微调多模态模型的计算需求需要大量资源，这进一步使其在遥感图像解译任务中的实际应用变得复杂。本文全面综述了用于遥感的视觉和多模态基础模型的最新进展，重点关注其架构、训练方法、数据集和应用场景。我们讨论了这些模型面临的主要挑战，例如数据对齐、跨模态迁移学习和可扩展性，同时也指出了旨在克服这些限制的新兴研究方向。我们的目标是清晰地了解当前遥感基础模型的概况，并激励未来的研究，从而突破这些模型在实际应用中所能达到的极限。本文收集的资源列表可以在https://github.com/IRIP-BUAA/A-Review-for-remote-sensing-vision-language-models 中找到。||
|**2025-03-27**|[AGILE: A Diffusion-Based Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification](http://arxiv.org/abs/2503.22019)|**[link](https://github.com/plant-ai-biophysics-lab/agile)**|语义一致的跨域图像转换通过跨不同域迁移标签来促进训练数据的生成，这对于农业中的植物性状识别特别有用。然而，现有的生成模型在不同域之间转换图像时难以保持对象级别的准确性，尤其是在域差距很大的情况下。在这项工作中，我们介绍了 AGILE（用于高效跨域植物性状识别的注意力引导图像和标签转换），这是一个基于扩散的框架，它利用优化的文本嵌入和注意力引导来语义约束图像转换。AGILE 利用预训练的扩散模型和公开可用的农业数据集来提高转换图像的保真度，同时保留关键的对象语义。我们的方法优化文本嵌入以加强源图像和目标图像之间的对应关系，并在去噪过程中引导注意力图以控制对象的位置。我们在跨域植物数据集上评估了 AGILE，并证明了其在生成语义准确的转换图像方面的有效性。定量实验表明，AGILE 增强了目标域中的目标检测性能，同时保持了真实感和一致性。与之前的图像转换方法相比，AGILE 实现了卓越的语义对齐，尤其是在对象差异很大或域差距很大的挑战性情况下。||
|**2025-03-27**|[FACETS: Efficient Once-for-all Object Detection via Constrained Iterative Search](http://arxiv.org/abs/2503.21999)|null|深度学习目标检测框架的神经架构搜索 (NAS) 通常涉及多个执行不同任务的模块。这些模块导致了巨大的搜索空间，使得搜索可能需要数个 GPU 小时甚至数天，具体取决于搜索空间的复杂性。这使得联合优化既具有挑战性，又在计算上成本高昂。此外，满足跨模块的目标设备约束也增加了优化过程的复杂性。为了应对这些挑战，我们提出了 FACETS（通过约束迭代搜索实现高效一次性完成所有目标检测），这是一种新颖的统一迭代式 NAS 方法，它以循环方式改进所有模块的架构。FACETS 利用先前迭代的反馈，在固定一个模块的架构和优化其他模块之间交替进行。这种方法在保留模块之间相互依赖性的同时减少了总体搜索空间，并结合了基于目标设备计算预算的约束。在与渐进式和单模块搜索策略的受控比较中，FACETS 实现了精度高达 4.75% 的架构，速度是早期阶段渐进式搜索策略的两倍，同时仍然能够达到全局最优。此外，FACETS 展示了迭代改进搜索空间的能力，随着时间的推移生成性能更好的架构。改进后的搜索空间产生的候选架构的平均精度比全局搜索方法高 27%，比渐进式搜索方法高 5%。||
|**2025-03-27**|[Exponentially Weighted Instance-Aware Repeat Factor Sampling for Long-Tailed Object Detection Model Training in Unmanned Aerial Vehicles Surveillance Scenarios](http://arxiv.org/abs/2503.21893)|null|对象检测模型经常面临类别不平衡问题，即罕见类别的出现频率明显低于常见类别。现有的基于采样的再平衡策略，例如重复因子采样 (RFS) 和实例感知重复因子采样 (IRFS)，通过根据图像和实例计数调整采样频率来缓解这个问题。然而，这些方法基于线性调整，这限制了它们在长尾分布中的有效性。这项工作引入了指数加权实例感知重复因子采样 (E-IRFS)，它是 IRFS 的扩展，应用指数缩放来更好地区分罕见类别和常见类别。E-IRFS 使用应用于图像和实例频率几何平均值的指数函数来调整采样概率，确保更具自适应性的再平衡策略。我们在源自消防员-无人机-RGBT 数据集的数据集和另外四个公共数据集上评估了 E-IRFS，使用 YOLOv11 对象检测模型来识别紧急场景中的火灾、烟雾、人员和湖泊。结果表明，E-IRFS 比基线提高了 22% 的检测性能，并且优于 RFS 和 IRFS，尤其是在罕见类别方面。分析还强调，E-IRFS 对容量有限的轻量级模型具有更强的影响，因为这些模型更依赖于数据采样策略来解决类别不平衡问题。研究结果表明，E-IRFS 改善了资源受限环境中的罕见目标检测，使其成为实时应用（例如基于无人机的应急监控）的合适解决方案。||
|**2025-03-27**|[On Large Multimodal Models as Open-World Image Classifiers](http://arxiv.org/abs/2503.21851)|**[link](https://github.com/altndrr/lmms-owc)**|传统的图像分类需要预先定义的语义类别列表。相比之下，大型多模态模型 (LMM) 可以绕过这一要求，直接使用自然语言对图像进行分类（例如，回答提示“图像中的主要对象是什么？”）。尽管具有这种卓越的能力，但大多数现有的关于 LMM 分类性能的研究范围却令人惊讶地有限，通常假设一个具有预定义类别集合的封闭世界设定。在这项工作中，我们通过在真正的开放世界环境中彻底评估 LMM 分类性能来弥补这一差距。我们首先将任务形式化，并引入了一个评估协议，定义了各种指标来评估预测类别和真实类别之间的一致性。然后，我们评估了跨越 10 个基准的 13 个模型，涵盖了原型、非原型、细粒度和非常细粒度的类别，展示了 LMM 在这项任务中面临的挑战。基于所提出的指标的进一步分析揭示了 LMM 犯的错误类型，突出了与粒度和细粒度能力相关的挑战，并展示了如何通过定制提示和推理来缓解这些挑战。||
|**2025-03-27**|[Residual Learning Inspired Crossover Operator and Strategy Enhancements for Evolutionary Multitasking](http://arxiv.org/abs/2503.21347)|null|在进化多任务优化中，诸如交叉算子和技能因子分配等策略对于有效的知识迁移至关重要。现有的交叉算子改进主要集中在低维变量组合上，例如算术交叉或部分映射交叉，这不足以对复杂的高维交互进行建模。此外，静态或半动态交叉策略无法适应任务之间动态的依赖关系。另外，当前的多因子进化算法框架通常依赖于固定的技能因子分配策略，缺乏灵活性。为了解决这些局限性，本文提出了一种基于残差学习的多因子进化算法——残差学习多因子进化算法（MFEA-RL）。该方法采用超分辨率深度卷积网络（VDSR）模型生成个体的高维残差表示，增强了对维度内复杂关系的建模能力。基于ResNet的机制动态分配技能因子以提高任务适应性，而随机映射机制有效地执行交叉操作并降低负迁移的风险。理论分析和实验结果表明，MFEA-RL优于最先进的多任务优化算法。它在标准进化多任务基准测试（包括CEC2017-MTSO和WCCI2020-MTSO）中表现出优异的收敛性和适应性。此外，它的有效性还通过一个实际应用场景得到了验证。||
|**2025-03-27**|[Improving $(α, f)$ -Byzantine Resilience in Federated Learning via layerwise aggregation and cosine distance](http://arxiv.org/abs/2503.21244)|**[link](https://github.com/ari-dasci/S-layerwise_cosine_aggregation)**|人工智能系统的快速发展加剧了社会对其使用的担忧，因此需要包含数据隐私的监管框架。联邦学习 (FL) 被认为是解决分布式机器学习中数据隐私挑战的潜在方案，它能够在不共享数据的情况下进行协作模型训练。然而，联邦学习系统仍然容易受到拜占庭攻击，即恶意节点贡献损坏的模型更新。虽然拜占庭容错算子已成为广泛采用的鲁棒聚合算法以减轻这些攻击，但其有效性在高维参数空间中显着降低，有时会导致模型性能不佳。本文介绍了分层余弦聚合，这是一种旨在增强这些规则在高维设置下的鲁棒性，同时保持计算效率的新型聚合方案。理论分析表明，与原始鲁棒聚合算子相比，所提出的分层余弦聚合具有更强的鲁棒性。在不同的图像分类数据集、不同的数据分布和拜占庭攻击场景下的实证评估一致地证明了分层余弦聚合的改进性能，模型精度提升高达 16%。||
|**2025-03-27**|[PLAIN: Scalable Estimation Architecture for Integrated Sensing and Communication](http://arxiv.org/abs/2503.21242)|**[link](https://github.com/bashar-tahir/plain)**|集成传感与通信 (ISAC) 被设想为构建下一代移动网络的范例之一，它扩展了定位和跟踪能力，并催生了环境感知无线接入。传感集成的一个关键方面是参数估计，它涉及提取有关周围环境的信息，例如其中各种物体的方向、距离和速度。这通常具有高维性质，如果跨多个传感维度（例如空间、频率和时间）联合执行，会导致巨大的计算复杂性。此外，由于在数据传输之上加入了传感功能，可用于传感的時間窗口可能很短，从而导致只能访问单个快照的估计问题。在这项工作中，我们提出了 PLAIN，一个基于张量的估计架构，它可以灵活地扩展多个传感维度，并可以处理高维度、有限的测量时间和超分辨率要求。它包括三个阶段：压缩阶段，将高维输入转换为低维，而不牺牲分辨率；解耦估计阶段，并行估计不同维度上的参数，且复杂度低；基于输入的融合阶段，将解耦的参数融合在一起，形成配对的多维估计。我们研究了不同配置下该架构的性能，并将其与实际的顺序和联合估计基线以及理论界限进行了比较。我们的结果表明，PLAIN 使用张量代数、基于子空间的处理和压缩感知等工具，可以灵活地扩展维度，同时保持低复杂度和超分辨率。||
|**2025-03-27**|[Learning Class Prototypes for Unified Sparse Supervised 3D Object Detection](http://arxiv.org/abs/2503.21099)|**[link](https://github.com/zyrant/cpdet3d)**|室内外场景感知对于 embodied intelligence 至关重要。然而，当前的稀疏监督三维目标检测方法仅关注室外场景，而未考虑室内环境。为此，我们提出了一种统一的室内外场景稀疏监督三维目标检测方法，该方法通过学习类别原型来有效利用未标记的目标。具体而言，我们首先提出了一个基于原型的目标挖掘模块，将未标记目标挖掘转化为类别原型与未标记特征之间的匹配问题。通过使用最优传输匹配结果，我们将原型标签分配给高置信度特征，从而实现未标记目标的挖掘。然后，我们提出了一个多标签协同优化模块，通过伪标签质量控制和原型标签协作来有效地恢复漏检目标。实验表明，我们的方法在室内外数据集的每个场景只包含一个标记目标的稀疏监督设置下达到了最先进的性能。在每个场景只有一个标记目标的情况下，我们的方法在 ScanNet V2、SUN RGB-D 和 KITTI 数据集上分别达到了全监督检测器性能的约 78%、90% 和 96%，突显了我们方法的可扩展性。代码可在 https://github.com/zyrant/CPDet3D 获取。||
|**2025-03-27**|[Neural Architecture Search by Learning a Hierarchical Search Space](http://arxiv.org/abs/2503.21061)|null|蒙特卡洛树搜索 (MCTS) 是一种强大的工具，可用于许多不可微分的搜索相关问题，例如对抗性游戏。然而，这种方法的性能很大程度上取决于在树的每个分支处考虑的节点的顺序。如果第一个分支无法区分最终任务的有希望的配置和具有欺骗性的配置，则搜索效率将呈指数级下降。在神经架构搜索 (NAS) 中，由于只有最终架构很重要，因此可以优化分支的访问顺序以改进学习。在本文中，我们研究了 MCTS 在图像分类 NAS 中的应用。我们分析了 MCTS 的几种采样方法和分支方案，并提出通过基于架构相似性的层次聚类来学习分支。相似性是通过架构输出向量的成对距离来衡量的。在 CIFAR10 和 ImageNet 上的两个具有挑战性的基准测试集上的大量实验表明，如果提供良好的分支层次结构，MCTS 可以比其他 NAS 方法更有效地产生有希望的解决方案。||
|**2025-03-26**|[TS-Inverse: A Gradient Inversion Attack Tailored for Federated Time Series Forecasting Models](http://arxiv.org/abs/2503.20952)|**[link](https://github.com/capsar/ts-inverse)**|用于时间序列预测（TSF）的联邦学习（FL）使拥有隐私敏感时间序列（TS）数据的客户端能够协作学习准确的预测模型，例如在能源负荷预测中。然而，FL 中的隐私风险依然存在，因为服务器有可能通过梯度反演攻击（GIA）重建客户端的训练数据。尽管 GIA 已在图像分类任务中得到证实，但关于时间序列回归任务的研究却很少。在本文中，我们首先对 4 个 TSF 模型和 4 个数据集的 TS 数据反演进行了广泛的实证研究，确定了重建 TS 数据的观测值和目标值的独特挑战。然后，我们提出了 TS-Inverse，一种新颖的 GIA，它通过以下方式改进了 TS 数据的反演：（i）学习一个输出分位数预测的梯度反演模型，（ii）一个包含周期性和趋势正则化的独特损失函数，以及（iii）根据分位数预测进行正则化。我们的评估证明了 TS-Inverse 的卓越性能，在 TS 数据上，就 sMAPE 指标而言，它比现有的 GIA 方法至少提高了 2 到 10 倍。代码库：https://github.com/Capsar/ts-inverse||
|**2025-03-26**|[Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications](http://arxiv.org/abs/2503.20516)|null|小目标检测(SOD)是计算机视觉中一项至关重要但极具挑战性的任务，其应用涵盖监控、自动驾驶系统、医学影像和遥感等领域。与较大的目标不同，小目标包含有限的空间和上下文信息，这使得准确检测变得困难。低分辨率、遮挡、背景干扰和类别不平衡等挑战进一步加剧了问题的复杂性。本综述全面回顾了近年来使用深度学习进行小目标检测的研究进展，重点关注2024-2025年期间发表在Q1期刊上的文章。我们分析了挑战、最先进的技术、数据集、评估指标和实际应用。深度学习的最新进展引入了一些创新解决方案，包括多尺度特征提取、超分辨率(SR)技术、注意力机制和基于Transformer的架构。此外，数据增强、合成数据生成和迁移学习方面的改进解决了数据稀缺性和领域自适应问题。此外，轻量级神经网络、知识蒸馏(KD)和自监督学习等新兴趋势为提高检测效率提供了有前景的方向，尤其是在资源受限的环境中，例如基于无人机(UAV)的监控和边缘计算。我们还回顾了广泛使用的数据集，以及标准评估指标，例如平均精度均值(mAP)和特定尺寸的AP分数。该综述重点介绍了实际应用，包括交通监控、海上监视、工业缺陷检测和精准农业。最后，我们讨论了开放的研究挑战和未来方向，强调需要稳健的领域自适应技术、更好的特征融合策略和实时性能优化。||
|**2025-03-26**|[FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System](http://arxiv.org/abs/2503.20499)|null|在这项工作中，我们提出了一个高质量的流式基础文本到语音系统FireRedTTS-1S，它是从FireRedTTS的可流式版本升级而来。FireRedTTS-1S通过两个步骤实现流式生成：文本到语义解码和语义到声学解码。在文本到语义解码中，一个语义感知的语音分词器将语音信号转换为语义标记，这些标记可以通过语义语言模型以自回归的方式从文本中合成。同时，语义到声学解码模块通过超分辨率因果音频编解码器和多流声学语言模型，将生成的语义标记以流式方式同步翻译成语音信号。这种设计使我们能够在零样本设置下生成高质量的语音音频，同时呈现一个低延迟（低于150毫秒）的实时生成过程。在零样本语音克隆实验中，客观结果证实FireRedTTS-1S是一个高质量的基础模型，其清晰度和说话人相似度与行业基线系统相当。此外，FireRedTTS-1S的主观评分突出了其令人印象深刻的合成性能，达到了与真实录音相当的质量。这些结果证实FireRedTTS-1S是一个高质量的流式基础TTS系统。||
|**2025-03-27**|[Consistency Trajectory Matching for One-Step Generative Super-Resolution](http://arxiv.org/abs/2503.20349)|null|目前的基于扩散的超分辨率 (SR) 方法以高推理开销为代价实现了值得称赞的性能。因此，人们利用蒸馏技术将多步教师模型加速到一步学生模型。然而，这些方法显著增加了训练成本，并受到教师模型的限制，从而制约了学生模型的性能。为了克服这些严峻的挑战，我们提出了用于超分辨率的一致性轨迹匹配 (CTMSR)，这是一种无需蒸馏的策略，能够一步生成逼真的 SR 结果。具体来说，我们首先制定了概率流常微分方程 (PF-ODE) 轨迹，以建立从带有噪声的低分辨率 (LR) 图像到高分辨率 (HR) 图像的确定性映射。然后，我们应用一致性训练 (CT) 策略直接一步学习该映射，从而消除了对预训练扩散模型的需求。为了进一步增强性能并在训练过程中更好地利用真实数据，我们的目标是使 SR 结果的分布与自然图像的分布更加一致。为此，我们建议通过精心设计的分布轨迹匹配 (DTM) 损失来最小化它们各自从 LR 图像分布的 PF-ODE 轨迹之间的差异，从而提高恢复的 HR 图像的真实感。全面的实验结果表明，所提出的方法在合成数据集和真实数据集上均可获得可比甚至更优的性能，同时保持最小的推理延迟。||
|**2025-03-26**|[Progressive Focused Transformer for Single Image Super-Resolution](http://arxiv.org/abs/2503.20337)|**[link](https://github.com/labshuhanggu/pft-sr)**|基于Transformer的方法在图像超分辨率任务中取得了显著成果，因为它们可以捕获低质量输入图像中的非局部依赖关系。然而，这种特征密集型建模方法计算成本高昂，因为它在获取注意力权重时会计算大量与查询特征无关的特征之间的相似度。这些不必要的相似度计算不仅会降低重建性能，还会引入显著的计算开销。如何准确识别对当前查询特征重要的特征并避免无关特征之间的相似度计算仍然是一个亟待解决的问题。为了解决这个问题，我们提出了一种新颖且高效的渐进式聚焦Transformer（PFT），它通过渐进式聚焦注意力（PFA）将网络中所有孤立的注意力图连接起来，从而将注意力集中在最重要的标记上。PFA不仅使网络能够捕获更关键的相似特征，而且通过在计算相似度之前过滤掉无关特征，显著降低了整个网络的计算成本。大量实验表明了该方法的有效性，在各种单图像超分辨率基准测试中实现了最先进的性能。||
|**2025-03-25**|[Extensions of regret-minimization algorithm for optimal design](http://arxiv.org/abs/2503.19874)|null|我们探讨了~\cite{design}提出的用于解决最优实验设计问题的遗憾最小化框架的扩展和应用。具体来说，我们将熵正则化器纳入该框架，从而得到一个新的样本选择目标和一个可证明的样本复杂度界，该界保证了 $(1+\epsilon)$ -近似最优解。我们进一步扩展了该方法以处理正则化的最优设计设置。作为一个应用，我们使用我们的算法从图像分类数据集中选择一小部分代表性样本，而无需依赖标签信息。为了评估所选样本的质量，我们训练了一个逻辑回归模型，并将其性能与几种基线采样策略进行了比较。在MNIST、CIFAR-10和ImageNet的50类子集上的实验结果表明，我们的方法在大多数情况下始终优于竞争方法。||
|**2025-03-25**|[Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models](http://arxiv.org/abs/2503.19707)|**[link](https://github.com/stogiannidis/srbench)**|视觉语言模型 (VLM) 近期成为了一种强大的工具，在结合视觉和文本理解的任务中表现出色，例如图像描述、视觉问答和图文检索。然而，现有的 VLM 基准测试包含空间成分，通常无法将空间推理与其他相关任务（如目标检测或语义理解）区分开来。在本文中，我们采用多方面的方法来理解空间推理，以此解决这些不足。基于人类空间推理能力的多样性和多维性，我们首先详细分析并界定了空间推理的核心要素：空间关系、方向和导航、心理旋转和空间可视化，然后评估这些模型在合成图像和真实图像中的性能，从而桥接了受控和自然环境。我们分析了 13 个最先进的视觉语言模型，揭示了它们在空间推理性能方面的关键洞察。我们的结果揭示了当前 VLM 的严重缺陷，13 个模型的平均准确率接近随机概率，这凸显了空间推理仍然是一个持续存在的挑战。这项工作不仅揭示了提升 VLM 空间推理能力的迫切需求，还为未来的探索奠定了坚实的基础。代码可在 GitHub (https://github.com/stogiannidis/srbench) 上获取，数据集可在 HuggingFace (https://huggingface.co/datasets/stogiannidis/srbench) 上获取。||
|**2025-03-25**|[BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction](http://arxiv.org/abs/2503.19658)|null|人工数字化书目元数据耗时且费力，尤其对于格式差异很大的历史和现实世界档案而言。尽管机器学习取得了进步，但缺乏用于元数据提取的专用数据集阻碍了自动化进程。为了弥补这一差距，我们引入了 BiblioPage，这是一个扫描的标题页数据集，标注有结构化的书目元数据。该数据集包含从 14 个捷克图书馆收集的大约 2,000 个专著标题页，涵盖了广泛的出版时期、印刷风格和布局结构。每个标题页都标注有 16 个书目属性，包括标题、贡献者和出版元数据，以及边界框形式的精确位置信息。为了从此数据集中提取结构化信息，我们评估了目标检测模型（如 YOLO 和 DETR）与基于 Transformer 的 OCR 的组合，实现了 52 的最大 mAP 和 59 的 F1 值。此外，我们还评估了各种视觉大型语言模型的性能，包括 LlamA 3.2-Vision 和 GPT-4o，最佳模型的 F1 值达到 67。BiblioPage 可作为书目元数据提取的真实世界基准，有助于文档理解、文档问答和文档信息提取。数据集和评估脚本可在以下网址获取：https://github.com/DCGM/biblio-dataset||
|**2025-03-25**|[Burst Image Super-Resolution with Mamba](http://arxiv.org/abs/2503.19634)|null|突发图像超分辨率 (BISR) 旨在利用快速连续拍摄的多张低分辨率图像的信息来增强关键帧的分辨率。在深度学习时代，BISR 方法已经从全卷积网络发展到基于 Transformer 的架构，尽管它们有效，但存在自注意力二次复杂度的问题。我们认为 Mamba 是该领域发展的下一个自然步骤，它以线性时间复杂度提供了可比较的全局感受野和选择性信息路由。在这项工作中，我们介绍了 BurstMamba，一个基于 Mamba 的 BISR 架构。我们的方法将任务解耦为两个专门的分支：用于关键帧超分辨率的空间模块和用于亚像素先验提取的时间模块，在计算效率和突发信息集成之间取得了平衡。为了进一步增强 Mamba 的突发处理能力，我们提出了两种新策略：(i) 基于光流的序列化，它仅在状态更新期间对齐突发序列以保留亚像素细节，以及 (ii) 基于小波的状态空间更新规则的重新参数化，优先考虑高频特征以改进从突发到关键帧的信息传递。我们的框架在 SyntheticSR、RealBSR-RGB 和 RealBSR-RAW 的公共基准测试中实现了最先进的性能。||
|**2025-03-25**|[Single Shot AI-assisted quantification of KI-67 proliferation index in breast cancer](http://arxiv.org/abs/2503.19606)|null|Ki-67是乳腺癌中一个关键的增殖标志物，对其进行可靠的量化对于分子分型和制定合理的治疗方案至关重要。传统的评估方法，包括视觉估计和人工计数，存在观察者之间差异性和可重复性有限的问题。本研究介绍了一种基于YOLOv8目标检测框架的AI辅助方法，用于Ki-67的自动评分。从Ki-67热点区域采集免疫组化染色肿瘤切片的数码高分辨率图像（40倍放大），并由领域专家进行手动标注，以区分Ki-67阳性和阴性肿瘤细胞。数据集经过扩充后被分为训练集（80%）、验证集（10%）和测试集（10%）。在测试的YOLOv8变体中，Medium模型实现了最佳性能，Ki-67阳性细胞的平均精度均值（mAP50，交并比为50%）超过85%。该方法为传统的评分方法提供了一种高效、可扩展且客观的替代方案，支持Ki-67评估中更高的一致性。未来的研究方向包括开发用户友好的临床界面，并扩展到多机构数据集，以增强泛化能力，并促进其在诊断实践中的更广泛应用。||
|**2025-03-25**|[VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models](http://arxiv.org/abs/2503.19530)|null|流行的PEFT方法通过假设增量权重更新本质上是低秩的来实现参数效率，但这通常会导致与全量微调相比存在性能差距。尽管最近的方法试图解决这一限制，但它们通常缺乏足够的参数和内存效率。我们提出了VectorFit，这是一种有效且易于部署的方法，它自适应地训练预训练权重矩阵的奇异向量和偏差。我们证明，利用预训练权重的结构和变换特性可以实现与全量微调相当的高秩更新。因此，与最先进的PEFT方法相比，VectorFit在可训练参数减少9倍的情况下实现了优越的性能。通过对17个数据集进行广泛的实验，涵盖了各种语言和视觉任务，例如自然语言理解和生成、问答、图像分类和图像生成，我们展示了VectorFit始终优于基线方法，即使在极其低预算的情况下也是如此。||
|**2025-03-25**|[Single-Step Latent Consistency Model for Remote Sensing Image Super-Resolution](http://arxiv.org/abs/2503.19505)|null|扩散模型（DM）的最新进展极大地推进了遥感图像超分辨率（RSISR）技术。然而，其迭代采样过程通常会导致推理速度缓慢，限制了其在实时任务中的应用。为了应对这一挑战，我们提出了用于超分辨率的潜在一致性模型（LCMSR），这是一种新颖的单步扩散方法，旨在提高RSISR任务的效率和视觉质量。我们的方案分为两个不同的阶段。在第一阶段，我们预训练一个残差自编码器，以编码高分辨率（HR）图像和低分辨率（LR）图像之间的差异信息，将扩散过程转换到潜在空间以降低计算成本。第二阶段侧重于一致性扩散学习，旨在学习以LR图像为条件的潜在空间中残差编码的分布。一致性约束强制反向扩散轨迹中任意两个时间步的预测保持一致，从而可以直接从噪声映射到数据。因此，所提出的LCMSR将传统扩散模型的迭代步骤从50-1000步或更多减少到仅一步，显著提高了效率。实验结果表明，LCMSR有效地平衡了效率和性能，实现了与非扩散模型相当的推理时间，同时保持了高质量的输出。||
|**2025-03-25**|[MATT-GS: Masked Attention-based 3DGS for Robot Perception and Object Detection](http://arxiv.org/abs/2503.19330)|null|本文提出了一种新颖的基于掩码注意力的三维高斯 splatting (3DGS) 方法，以增强工业和智能工厂环境中的机器人感知和物体检测能力。利用 U2-Net 进行背景去除，将目标物体从原始图像中分离出来，从而最大限度地减少杂波，并确保模型只处理相关数据。此外，在 3DGS 框架中集成了一种基于 Sobel 滤波器的注意力机制，以增强细节——捕捉关键特征，如螺丝、电线和复杂纹理，这些特征对于高精度任务至关重要。我们使用定量指标（包括 L1 损失、SSIM、PSNR）验证了我们的方法，将去除背景并结合注意力的 3DGS 模型的性能与真实图像和原始 3DGS 训练基线进行了比较。结果表明，视觉保真度和细节保留方面都有显著改进，突出了我们的方法在增强复杂工业环境中用于物体识别和操作的机器人视觉方面的有效性。||
|**2025-03-25**|[LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text](http://arxiv.org/abs/2503.19311)|**[link](https://github.com/mitsuichen14/lrsclip)**|本研究解决了遥感视觉语言基础模型（VLFM）在处理长文本方面的技术瓶颈以及短文本信息不足导致的“幻觉”问题。我们提出了一个新的视觉语言基础模型LRSCLIP和一个多模态数据集LRS2M。主要贡献如下：（1）通过整合多源遥感数据并采用大型语言模型标注策略，我们构建了包含200万图像-文本对的LRS2M数据集，首次同时提供短文本和长文本，解决了现有数据集语义粒度限制的问题；（2）基于Long-CLIP的KPS模块设计了LRSCLIP架构，扩展了CLIP的文本处理能力，并通过双文本损失加权机制实现了细粒度的跨模态特征对齐。实验结果表明，在零样本长文本跨模态检索任务中，LRSCLIP的检索精度比Long-CLIP基线提高了10%-20%。对于零样本短文本跨模态检索任务，LRSCLIP在RSITMD数据集上，相比当前最佳模型GeoRSCLIP，其Text to Image R@1、Image to Text R@1和mR分别提高了0.17%、0.67%和0.92%，在RSICD数据集上分别提高了0.04%、2.93%和1.28%。在零样本图像分类任务（平均准确率=75.75%）和语义定位任务（Rmi=0.7653）中，LRSCLIP实现了最先进的性能。这些结果验证了LRSCLIP在细粒度语义理解和全局特征匹配方面的双重优势。这项工作为遥感多模态学习提供了新的基准模型和数据支持。相关代码已开源，可在https://github.com/MitsuiChen14/LRSCLIP获取。||
|**2025-03-25**|[Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment](http://arxiv.org/abs/2503.19295)|**[link](https://github.com/GuangluDong0728/SFD)**|生成对抗网络（GAN）已被广泛应用于图像超分辨率（SR）以增强感知质量。然而，大多数现有的基于GAN的SR方法通常直接对图像进行粗粒度判别，忽略了图像的语义信息，这使得超分辨率网络（SRN）难以学习细粒度和语义相关的纹理细节。为了缓解这个问题，我们提出了一种用于感知SR的语义特征判别方法SFD。具体来说，我们首先设计了一个特征判别器（Feat-D），用于判别来自CLIP的像素级中间语义特征，使SR图像的特征分布与高质量图像的特征分布对齐。此外，我们提出了一种文本引导判别方法（TG-D），通过以对抗方式引入可学习提示对（LPP）来对CLIP更抽象的输出特征进行判别，进一步增强了我们方法的判别能力。结合Feat-D和TG-D，我们的SFD可以有效区分低质量和高质量图像的语义特征分布，鼓励SRN生成更真实、语义更相关的纹理。此外，基于训练好的Feat-D和LPP，我们提出了一种新颖的无参考图像质量评估方法SFD-IQA，该方法无需任何额外的目标训练即可显著提高客观无参考图像质量评估（OU NR-IQA）性能。在经典SISR、真实世界SISR和OU NR-IQA任务上的大量实验证明了我们提出的方法的有效性。||
|**2025-03-21**|[An Iterative Feedback Mechanism for Improving Natural Language Class Descriptions in Open-Vocabulary Object Detection](http://arxiv.org/abs/2503.17285)|null|开放词汇目标检测模型的最新进展将使自动目标识别系统能够由非技术最终用户维持和重新利用，以用于各种应用或任务。新的，甚至可能包含细微差别的类别，可以在运行前立即使用自然语言文本描述在现场进行定义，而无需重新训练模型。我们提出了一种改进非技术用户对其感兴趣目标的自然语言文本描述的方法，该方法结合了对文本嵌入的分析技术以及用于对比示例的适当嵌入组合。我们通过展示多个公开可用的开放词汇目标检测模型的性能来量化我们的反馈机制提供的改进。||
|**2025-03-21**|[Leveraging Text-to-Image Generation for Handling Spurious Correlation](http://arxiv.org/abs/2503.17226)|null|使用经验风险最小化 (ERM) 训练的深度神经网络在训练数据和测试数据来自同一域时表现良好，但它们通常无法泛化到分布外样本。在图像分类中，这些模型可能依赖于标签和图像无关特征之间经常存在的虚假相关性，使得当这些特征不存在时预测不可靠。我们提出了一种使用文本到图像 (T2I) 扩散模型生成训练样本的技术来解决虚假相关性问题。首先，我们通过文本反演机制计算描述样本因果成分相关视觉特征的最佳描述标记。然后，利用语言分割方法和扩散模型，我们通过将因果成分与其他类别的元素组合来生成新样本。我们还根据 ERM 模型的预测概率和归因分数仔细修剪生成的样本，以确保它们符合我们的目标组合。最后，我们在增强的数据集上重新训练 ERM 模型。这个过程通过学习精心制作的样本（其中不存在这种相关性）来减少模型对虚假相关性的依赖。我们的实验表明，在不同的基准测试中，我们的技术实现了比现有最先进方法更好的最差组准确率。||
|**2025-03-21**|[Which2comm: An Efficient Collaborative Perception Framework for 3D Object Detection](http://arxiv.org/abs/2503.17175)|null|协同感知允许实时代理间信息交换，从而为增强单个代理的感知能力提供了宝贵的机会。然而，实际场景中有限的通信带宽限制了代理间的数据传输量，从而导致协同感知系统性能下降。这意味着感知性能和通信成本之间需要权衡。为了解决这个问题，我们提出了Which2comm，一个利用对象级稀疏特征的新型多智能体3D目标检测框架。通过将对象的语义信息集成到3D目标检测框中，我们引入了语义检测框（SemDBs）。创新性地在代理之间传输这些信息丰富的对象级稀疏特征，不仅显著减少了通信量需求，还提高了3D目标检测性能。具体来说，我们构建了一个全稀疏网络来从单个代理中提取SemDBs；并利用具有相对时间编码机制的时间融合方法来获得全面的时空特征。在V2XSet和OPV2V数据集上的大量实验表明，Which2comm在感知性能和通信成本方面始终优于其他最先进的方法，表现出对真实世界延迟的更好鲁棒性。这些结果表明，对于多智能体协同3D目标检测，仅传输对象级稀疏特征足以实现高精度和鲁棒的性能。||
|**2025-03-21**|[Beyond Accuracy: What Matters in Designing Well-Behaved Models?](http://arxiv.org/abs/2503.17110)|null|深度学习已成为计算机视觉的重要组成部分，深度神经网络 (DNN) 在预测性能方面表现出色。然而，它们通常在其他关键质量维度上存在不足，例如鲁棒性、校准或公平性。虽然现有研究侧重于这些质量维度的一个子集，但尚未有人探索 DNN 更通用的“良好行为”形式。通过这项工作，我们通过同时研究图像分类的九个不同质量维度来弥补这一差距。通过一项大规模研究，我们分析了 326 个骨干模型以及不同的训练范式和模型架构如何影响质量维度，从而提供了全局视角。我们揭示了各种新见解，例如 (i) 视觉语言模型在 ImageNet-1k 分类中表现出高度公平性，并且对域变化具有很强的鲁棒性；(ii) 自监督学习是提高几乎所有考虑的质量维度的有效训练范式；(iii) 训练数据集大小是大多数质量维度的主要驱动因素。我们通过引入 QUBA 分数（超越准确性的质量理解）来总结我们的研究，这是一种新的指标，可以在多个质量维度上对模型进行排名，从而能够根据特定用户需求提供定制的推荐。||
|**2025-03-21**|[R2LDM: An Efficient 4D Radar Super-Resolution Framework Leveraging Diffusion Model](http://arxiv.org/abs/2503.17097)|null|我们提出了R2LDM，一种创新的方法，用于生成密集且精确的4D雷达点云，并以相应的激光雷达点云作为引导。我们没有使用距离图像或鸟瞰图（BEV）图像，而是使用体素特征来表示激光雷达和4D雷达点云，这更有效地捕获了3D形状信息。随后，我们提出了潜在体素扩散模型（LVDM），它在潜在空间中执行扩散过程。此外，我们利用一种新颖的潜在点云重建（LPCR）模块从高维潜在体素特征重建点云。因此，R2LDM可以有效地从配对的原始雷达数据生成类似激光雷达的点云。我们在两个不同的数据集上评估了我们的方法，实验结果表明，我们的模型实现了雷达点云6到10倍的密集化，在4D雷达点云超分辨率方面优于最先进的基线方法。此外，我们的方法生成的增强雷达点云显著改善了下游任务，点云配准召回率提高了31.7%，目标检测精度提高了24.9%。||
|**2025-03-21**|[Superpowering Open-Vocabulary Object Detectors for X-ray Vision](http://arxiv.org/abs/2503.17071)|null|开放词汇目标检测 (OvOD) 将彻底改变安检流程，使系统能够识别 X 光扫描中的任何物品。然而，由于数据稀缺和模态差异（这阻碍了直接采用基于 RGB 的解决方案），为 X 光成像开发有效的 OvOD 模型面临着独特的挑战。为了克服这些限制，我们提出了 RAXO，这是一个免训练框架，它重新利用现成的 RGB OvOD 检测器进行鲁棒的 X 光检测。RAXO 使用双源检索策略构建高质量的 X 光类别描述符。它从网络收集相关的 RGB 图像，并通过一种新颖的 X 光材质迁移机制对其进行丰富，从而消除了对标记数据库的需求。这些视觉描述符取代了 OvOD 中基于文本的分类，利用模态内特征距离进行鲁棒检测。大量实验表明，RAXO 持续提高了 OvOD 性能，使平均 mAP 比基准检测器提高了 17.0 个点。为了进一步支持这一新兴领域的研究，我们还引入了 DET-COMPASS，这是一个新的基准测试，包含 300 多个对象类别的边界框标注，可用于大规模评估 X 光下的 OvOD。代码和数据集可在以下网址获取：https://github.com/PAGF188/RAXO。||
|**2025-03-21**|[Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos](http://arxiv.org/abs/2503.17050)|null|视频伪装目标检测 (VCOD) 旨在分割外观与其周围环境非常相似的物体，这是一项具有挑战性且新兴的任务。由于伪装物体外观难以区分以及视频动态信息利用不足，现有视觉模型在此类场景中往往表现不佳。为了应对这些挑战，我们提出了一个受人类记忆识别启发的端到端 VCOD 框架，该框架通过整合用于伪装序列处理的记忆参考帧来利用历史视频信息。具体来说，我们设计了一个双用途解码器，可同时生成预测掩码和分数，从而能够基于分数选择参考帧，同时引入辅助监督以增强特征提取。此外，本研究引入了一种新颖的参考引导多级非对称注意力机制，有效地将长期参考信息与短期运动线索相结合，以进行全面的特征提取。通过组合这些模块，我们开发了评分、记忆和参考 (SRR) 框架，该框架有效地提取信息以定位目标，并利用记忆指导来改进后续处理。凭借其优化的模块设计和对视频数据的有效利用，我们的模型实现了显著的性能提升，在基准数据集上的性能比现有方法提高了 10%，同时需要的参数更少 (54M)，并且只需单次视频处理。代码将公开发布。||
|**2025-03-21**|[EasyRobust: A Comprehensive and Easy-to-use Toolkit for Robust and Generalized Vision](http://arxiv.org/abs/2503.16975)|null|深度神经网络 (DNN) 在计算机视觉任务中展现出巨大的潜力。然而，DNN 实现的机器视觉的鲁棒性仍然不及人类感知。对抗性攻击和数据分布偏移是导致机器性能下降并阻碍机器在现实世界中广泛部署的两个主要场景。为了克服这些障碍并促进模型鲁棒性研究，我们开发了 EasyRobust，一个全面且易于使用的工具包，用于训练、评估和分析鲁棒的视觉模型。EasyRobust 针对两种类型的鲁棒性：1) 对抗鲁棒性使模型能够防御由最坏情况扰动（也称为对抗样本）精心制作的恶意输入；2) 非对抗鲁棒性增强模型在包含损坏或分布偏移的自然测试图像上的性能。在图像分类上的全面基准测试使 EasyRobust 能够对视觉模型进行准确的鲁棒性评估。我们希望 EasyRobust 能够帮助训练实际鲁棒的模型，并促进学术界和工业界在缩小人机视觉差距方面的进展。EasyRobust 的代码和模型已在 https://github.com/alibaba/easyrobust 开源。||
|**2025-03-21**|[Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision](http://arxiv.org/abs/2503.16811)|null|基于激光雷达的3D物体检测和语义分割是3D场景理解中的关键任务。传统的检测和分割方法通过边界框标签和语义掩码标签来监督模型。然而，这两个独立的标签本身包含了大量的冗余信息。本文旨在通过仅使用语义标签来监督3D物体检测，从而消除这种冗余。但是，由于点云实例的几何结构不完整和边界模糊，会导致伪标签不准确和检测结果不佳。为了解决这些挑战，我们提出了一种名为Seg2Box的新方法。我们首先引入了一个多帧多尺度聚类（MFMS-C）模块，它利用点云的时空一致性来生成准确的框级伪标签。此外，我们还提出了语义引导迭代挖掘自训练（SGIM-ST）模块，通过逐步 refine 伪标签和挖掘未生成伪标签的实例来提高性能。在Waymo Open Dataset和nuScenes Dataset上的实验表明，我们的方法在mAP上分别显著优于其他竞争方法23.7%和10.3%。结果证明了我们方法在标签效率方面的巨大潜力和先进性。||
|**2025-03-21**|[Local Ratio based Real-time Job Offloading and Resource Allocation in Mobile Edge Computing](http://arxiv.org/abs/2503.16794)|null|移动边缘计算 (MEC) 已成为一种很有前景的范例，使车辆能够处理计算密集型和时间敏感型智能交通应用。由于 MEC 中的资源有限，有效的资源管理对于提高系统性能至关重要。虽然现有研究主要集中在任务卸载问题，并假设任务资源需求是固定且预先给定的，但很少有研究考虑到任务卸载（为每个任务选择边缘服务器）和资源分配（确定卸载和处理的带宽和计算资源）的联合问题。本文讨论了MEC中受截止时间限制的任务的联合问题，同时考虑了通信和计算资源的限制，目的是最大化任务获得的总效用。为了解决这个问题，我们提出了一个近似算法 $\mathtt{IDAssign}$，其近似界为 $\frac{1}{6}$，并使用真实的出租车轨迹和目标检测应用程序，通过与最先进的启发式算法进行比较，实验评估了 $\mathtt{IDAssign}$ 的性能。||
|**2025-03-20**|[Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction](http://arxiv.org/abs/2503.16318)|null|DUSt3R最近表明，多视图几何中的许多任务，包括估计相机内参和外参、重建场景三维结构以及建立图像对应关系，都可以简化为预测一对视点不变的点图，即在公共参考系中定义的像素对齐点云。这种公式优雅而强大，但无法处理动态场景。为了应对这一挑战，我们引入了动态点图（DPM）的概念，将标准点图扩展到支持4D任务，例如运动分割、场景流估计、3D目标跟踪和2D对应关系。我们的关键直觉是，当引入时间维度时，可以使用多种可能的时空参考来定义点图。我们确定了此类组合的最小集合，可以通过网络回归来解决上述子任务。我们在合成数据和真实数据的混合数据集上训练DPM预测器，并在视频深度预测、动态点云重建、3D场景流和目标姿态跟踪等各种基准测试中对其进行评估，实现了最先进的性能。代码、模型和其他结果可在https://www.robots.ox.ac.uk/~vgg/research/dynamic-point-maps/获取。||
|**2025-03-20**|[RESFL: An Uncertainty-Aware Framework for Responsible Federated Learning by Balancing Privacy, Fairness and Utility in Autonomous Vehicles](http://arxiv.org/abs/2503.16251)|null|自动驾驶汽车（AV）越来越依赖联邦学习（FL）来增强感知模型，同时保护隐私。然而，现有的联邦学习框架难以平衡隐私、公平性和鲁棒性，导致不同人群之间的性能差异。差分隐私等隐私保护技术可以降低数据泄露的风险，但会通过限制对偏差校正所需的敏感属性的访问来加剧不公平性。这项工作探讨了基于联邦学习的自动驾驶目标检测中隐私和公平性之间的权衡，并介绍了RESFL，一个同时优化两者的集成解决方案。RESFL结合了对抗性隐私解耦和不确定性引导的公平感知聚合。对抗性组件使用梯度反转层来去除敏感属性，在保持公平性的同时降低隐私风险。不确定性感知聚合采用证据神经网络自适应地加权客户端更新，优先考虑公平性差异较低且置信度较高的贡献。这确保了稳健和公平的联邦学习模型更新。我们在FACET数据集和CARLA模拟器上评估了RESFL，评估了不同条件下的准确性、公平性、隐私弹性和鲁棒性。与其他方法相比，RESFL提高了检测精度，减少了公平性差异，降低了隐私攻击成功率，同时 demonstrated 对抗性条件下更强的鲁棒性。||
|**2025-03-20**|[CLS-RL: Image Classification with Rule-Based Reinforcement Learning](http://arxiv.org/abs/2503.16188)|**[link](https://github.com/minglllli/CLS-RL)**|分类是机器学习的核心任务。最近的研究表明，尽管多模态大语言模型 (MLLM) 最初在图像分类方面表现不佳，但使用足够的数据对其进行微调可以显著提高其性能，使其与最先进的分类模型相媲美。然而，获取大规模标记数据成本高昂。在本文中，我们探索了少样本 MLLM 分类微调。我们发现监督微调（SFT）会导致严重的过拟合问题，甚至可能降低零样本方法的性能。为了应对这一挑战，受基于规则的强化学习最新成功的启发，我们提出了 CLS-RL，它使用可验证的信号作为奖励来微调 MLLM。我们发现 CLS-RL 在大多数数据集上优于 SFT，并且在基到新和少样本学习设置中都具有更高的平均准确率。此外，我们观察到 CLS-RL 的“免费午餐”现象；当模型在特定数据集上进行微调时，它们在其他不同数据集上的性能也可能比零样本模型有所提高，即使这些数据集在分布和类名上都不同。这表明基于强化学习的方法有效地教会了模型分类的基本原理。最后，受推理时思考的最新工作的启发，我们重新审视了视觉分类背景下微调过程中的“思考过程”，这是基于强化学习方法的一个关键方面。我们质疑此类任务在微调过程中是否需要大量的思考过程，并提出这实际上可能会降低性能。基于这一前提，我们引入了 No-Thinking-CLS-RL 方法，该方法通过设置相等准确率奖励来最大限度地减少训练期间的思考过程。我们的研究结果表明，No-Thinking-CLS-RL 方法只需更少的微调时间，即可实现比 CLS-RL 更优的域内性能和泛化能力。||
|**2025-03-20**|[Uncertainty Meets Diversity: A Comprehensive Active Learning Framework for Indoor 3D Object Detection](http://arxiv.org/abs/2503.16125)|null|主动学习已成为一种很有前景的方法，可以减少3D目标检测任务中大量的标注负担，并促使在户外环境中开展了一些方案。然而，其在室内环境中的应用仍未得到探索。与户外3D数据集相比，室内数据集面临着重大挑战，包括每类训练样本较少、类别数量较多、类别不平衡更严重，以及场景类型和类内差异更多样化。本文首次提出了针对室内3D目标检测的主动学习研究，我们为此任务设计了一个新的框架。我们的方法结合了两个关键标准——不确定性和多样性——来主动选择最模糊且信息量最大的未标记样本进行标注。不确定性标准考虑了不准确的检测和未检测到的目标，确保优先考虑最模糊的样本。同时，多样性标准被制定为一个联合优化问题，使用一种新的类别感知自适应原型（CAP）库来最大化目标类别分布和场景类型的多样性。CAP库动态地为每个类别分配代表性原型，有助于捕捉不同类别之间变化的类内多样性。我们在SUN RGB-D和ScanNetV2上评估了我们的方法，其性能显著优于基线方法，仅用10%的标注预算就达到了完全监督性能的85%以上。||
|**2025-03-20**|[Semantic-Guided Global-Local Collaborative Networks for Lightweight Image Super-Resolution](http://arxiv.org/abs/2503.16056)|**[link](https://github.com/fanamber831/sgglc-net)**|单图像超分辨率 (SISR) 在提高各种基于视觉的仪器和测量应用中不可或缺的测量系统的准确性和可靠性方面起着关键作用。这些系统通常需要清晰细致的图像才能进行精确的物体检测和识别。然而，视觉测量工具拍摄的图像经常会发生退化，包括模糊和细节丢失，这会影响测量精度。作为一种潜在的补救措施，我们在本文中提出了一种用于轻量级 SISR 的语义引导全局-局部协作网络 (SGGLC-Net)。我们的 SGGLC-Net 利用从预训练模型中提取的语义先验来指导超分辨率过程，从而有效地增强图像细节质量。具体来说，我们提出了一个语义引导模块，它将语义先验无缝集成到超分辨率网络中，使网络能够更熟练地捕获和利用语义先验，从而增强图像细节。为了进一步探索局部和非局部交互以改进细节再现，我们提出了一个全局-局部协作模块，它具有三个全局和局部细节增强模块，以及一个混合注意力机制，它们共同协作以有效地学习更多有用的特征。我们的大量实验表明，SGGLC-Net 在多个基准数据集上实现了极具竞争力的 PSNR 和 SSIM 值，与最先进的轻量级超分辨率方法相比，在减少 12.81G 多重加法的同时展现出更高的性能。这些改进凸显了我们的方法在提高视觉测量系统的精度和有效性方面的潜力。代码位于 https://github.com/fanamber831/SGGLC-Net。||
|**2025-03-20**|[DIPLI: Deep Image Prior Lucky Imaging for Blind Astronomical Image Restoration](http://arxiv.org/abs/2503.15984)|null|当代图像修复和超分辨率技术有效地利用了深度神经网络，其性能明显优于传统方法。然而，由于训练数据有限，天文摄影给深度学习带来了独特的挑战。本工作探索了混合策略，例如深度图像先验 (DIP) 模型，该模型促进了盲训练，但在处理噪声图像时容易出现过拟合、伪影生成和不稳定性。我们提出了通过几种先进技术来增强 DIP 模型基线性能的方法。首先，我们改进了模型以同时处理多帧，采用反向投影方法和 TVNet 模型。接下来，我们采用马尔可夫方法，结合蒙特卡罗估计、朗之万动力学和变分输入技术，以实现具有最小方差的无偏估计，并有效地抵消过拟合。总的来说，这些修改降低了噪声学习的可能性，并减轻了训练期间损失函数的波动，从而增强了结果的稳定性。我们在多个天文和天体图像集上验证了我们的算法，其性能不仅克服了幸运成像（一种经典的计算机视觉技术，仍然是天文图像重建的标准方法）的局限性，而且超过了原始 DIP 模型、最先进的基于 Transformer 和扩散的模型，突出了我们改进的重要性。||
|**2025-03-20**|[Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation](http://arxiv.org/abs/2503.15969)|null|用于地球观测 (EO) 的视觉语言模型通常仅依赖数据的可见光谱作为模型输入，因此未能利用卫星记录的多光谱通道中丰富的频谱信息。因此，在本文中，我们介绍了 Llama3-MS-CLIP，这是第一个使用大规模多光谱数据集上的对比学习进行预训练的视觉语言模型，并报告了由于扩展光谱范围而带来的性能提升。此外，我们提出了迄今为止最大的多光谱图像-标题数据集，其中包含一百万个 Sentinel-2 样本和使用 Llama3-LLaVA-Next 和 Overture Maps 数据生成的相应文本描述。我们开发了一个可扩展的标题生成管道，并已由领域专家验证。我们使用三个不同复杂度的数据集在多光谱零样本图像分类和检索上评估了 Llama3-MS-CLIP。我们的结果表明，Llama3-MS-CLIP 明显优于其他基于 RGB 的方法，与表现次佳的模型相比，平均分类精度提高了 6.77%，检索性能提高了 4.63% mAP。我们的结果强调了多光谱视觉语言学习的相关性。我们将以开源许可证发布图像-标题数据集、代码和模型权重。||
|**2025-03-19**|[Graph-Weighted Contrastive Learning for Semi-Supervised Hyperspectral Image Classification](http://arxiv.org/abs/2503.15731)|**[link](https://github.com/kunzhan/semihsi)**|大多数现有的基于图的半监督高光谱图像分类方法依赖于超像素分割技术。然而，由于超像素边界的不准确性，它们会遭受某些像素的错误分类，即超像素分割的初始误差限制了整体分类性能。在本文中，我们提出了一种新颖的图加权对比学习方法，该方法避免使用超像素分割，并直接使用神经网络来学习高光谱图像表示。此外，虽然许多方法要求在训练期间所有图节点都可用，但我们的方法通过一次只处理一部分节点来支持小批量训练，从而降低了计算复杂度并提高了对未见节点的泛化能力。在三个广泛使用的数据集上的实验结果证明了所提出的方法与依赖于超像素分割的基线方法相比的有效性。||
|**2025-03-19**|[Toward task-driven satellite image super-resolution](http://arxiv.org/abs/2503.15474)|null|超分辨率旨在从低分辨率观测图像重建高分辨率图像。基于深度学习的最新方法能够获得出色的结果，生成具有高感知质量的图像。然而，重建的细节是否接近真实的 Ground-Truth 信息，以及它们是否构成对图像分析算法更有价值的来源，通常仍不清楚。在本报告的工作中，我们解决了后一个问题，并展示了我们为以任务驱动的方式学习超分辨率算法所做的努力，使其适合生成可用于自动图像分析的高分辨率图像。在报告的初步研究中，我们提出了一种方法论，用于评估执行计算机视觉任务的现有模型，以确定它们是否可用于评估超分辨率重建算法，以及以任务驱动的方式训练它们。我们通过实验研究支持我们的分析，并期望它为选择合适的计算机视觉任务奠定坚实的基础，从而提升现实世界超分辨率的能力。||
|**2025-03-19**|[DCA: Dividing and Conquering Amnesia in Incremental Object Detection](http://arxiv.org/abs/2503.15295)|null|增量目标检测 (IOD) 旨在培养一种目标检测器，使其能够持续定位和识别新类别，同时保持其在先前类别上的性能。现有方法通过改进基于 Transformer 检测框架的知识蒸馏和样本回放取得了一定的成功，但其内在的遗忘机制仍未得到充分探索。在本文中，我们深入研究了遗忘的原因，并发现了基于 Transformer 的 IOD 中定位和识别之间的遗忘不平衡，这意味着定位遗忘较少并且可以泛化到未来的类别，而灾难性遗忘主要发生在识别上。基于这些见解，我们提出了一种分而治之的遗忘 (DCA) 策略，将基于 Transformer 的 IOD 重新设计为先定位后识别的过程。DCA 可以很好地保持和迁移定位能力，将解耦的脆弱识别留待专门处理。为了减少识别中的特征漂移，我们利用预训练语言模型中编码的语义知识将类别表示锚定在跨增量任务的统一特征空间内。这涉及设计一个双工分类器融合并将类别语义特征以查询的形式嵌入到识别解码过程中。大量实验验证了我们的方法实现了最先进的性能，尤其是在长期增量场景下。例如，在 MS-COCO 的四步设置下，我们的 DCA 策略将最终 AP 显着提高了 6.9%。||
|**2025-03-14**|[FLASHμ: Fast Localizing And Sizing of Holographic Microparticles](http://arxiv.org/abs/2503.11538)|null|从衍射图像（全息图）重建微粒的 3D 位置和大小是一个计算成本高昂的逆问题，传统上一直使用基于物理的重建方法来解决。最近，研究人员使用机器学习方法来加速这一过程。然而，对于大样本容量中的小颗粒，这些方法的性能不如标准的基于物理的重建方法。在这里，我们设计了一个两阶段神经网络架构 FLASH $\mu$ ，用于从样本深度高达 20 厘米的全息图中检测小颗粒（6-100 微米）。我们的方法仅在添加了物理噪声的合成数据上进行训练，可以可靠地检测真实全息图中至少 9 微米的颗粒，与标准的基于重建的方法相当，同时在较小的裁剪区域上以原始分辨率的四分之一运行，并提供了大约 600 倍的加速。除了引入一种解决非局部目标检测或信号分离问题的新方法外，我们的工作还可以实现低成本、实时的全息成像装置。||
|**2025-03-14**|[PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models](http://arxiv.org/abs/2503.11360)|null|语言引导的注意力框架显著提高了图像分类的可解释性和性能；然而，依赖预训练视觉语言基础模型的确定性嵌入来生成参考注意力图，常常忽略了跨模态映射固有的多值性和不适定性。为了解决这些限制，我们引入了PARIC，一个通过语言规范引导视觉注意力的概率框架。我们的方法使预训练的视觉语言模型能够生成概率参考注意力图，与确定性方法相比，它可以更有效地对齐文本和视觉模态，同时结合了不确定性估计。在基准测试问题上的实验表明，PARIC提高了预测精度，减少了偏差，确保了预测的一致性，并提高了跨各种数据集的鲁棒性。||
|**2025-03-14**|[Falcon: A Remote Sensing Vision-Language Foundation Model](http://arxiv.org/abs/2503.11070)|**[link](https://github.com/tianhuilab/falcon)**|本文介绍了一个专为遥感领域设计的整体视觉语言基础模型，名为Falcon。Falcon提供了一个统一的、基于提示的范式，可以有效地执行全面而复杂的遥感任务。Falcon在图像、区域和像素级别展现了强大的理解和推理能力。具体来说，只需简单的自然语言指令和遥感图像，Falcon就能在14项不同的任务中生成令人印象深刻的文本结果，包括图像分类、目标检测、分割、图像描述等。为了促进Falcon的训练并增强其编码丰富空间和语义信息的表示能力，我们开发了Falcon_SFT，这是一个遥感领域的大规模、多任务、指令微调数据集。Falcon_SFT数据集包含约7800万个高质量数据样本，涵盖560万张多空间分辨率和多视角的遥感图像以及多样化的指令。它具有分层注释，并经过人工抽样验证，以确保数据的高质量和可靠性。我们进行了广泛的比较实验，结果表明，尽管只有7亿个参数，Falcon在67个数据集和14项任务上均取得了显著的性能。我们在https://github.com/TianHuiLab/Falcon上发布了完整的数据集、代码和模型权重，希望能够进一步推动开源社区的发展。||
|**2025-03-14**|[FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection](http://arxiv.org/abs/2503.11030)|null|伪装目标检测 (COD) 具有挑战性，因为伪装目标与其周围环境高度相似，这使得识别变得复杂。现有方法主要依赖于空间局部特征，无法捕获全局信息，而 Transformer 会增加计算成本。为了解决这个问题，我们提出了频率辅助曼巴蛇状线性注意力网络 (FMNet)，它利用频域学习来有效地捕获全局特征并减轻目标与背景之间的歧义。FMNet 引入了多尺度频率辅助曼巴蛇状线性注意力 (MFM) 模块，通过多尺度结构融合频率和空间特征，以处理尺度变化，同时降低计算复杂度。此外，金字塔频率注意力提取 (PFAE) 模块和频率反向解码器 (FRD) 增强了语义信息并重建了特征。实验结果表明，FMNet 在多个 COD 数据集上的性能优于现有方法，展示了其在性能和效率方面的优势。代码可在 https://anonymous.4open.science/r/FMNet-3CE5 获取。||
|**2025-03-14**|[Comparative Analysis of Advanced AI-based Object Detection Models for Pavement Marking Quality Assessment during Daytime](http://arxiv.org/abs/2503.11008)|null|基于深度学习的视觉目标检测在计算机视觉中起着至关重要的作用，并在交通工程中有着广泛的应用。本文重点研究利用You Only Look Once (YOLO)模型在白天检测路面标记质量，利用其先进的架构特征，通过精确的实时评估来提高道路安全性。本研究利用来自新泽西州的图像数据，采用了三种YOLOv8变体：YOLOv8m、YOLOv8n和YOLOv8x。这些模型根据其将路面标记分类为良好、中等和较差可见度类别的预测精度进行评估。结果表明，YOLOv8n在准确性和计算效率之间提供了最佳平衡，在可见度良好的目标上实现了最高的平均精度均值（mAP），并在各种交并比（IoU）阈值下表现出稳健的性能。这项研究通过提供一种自动化且准确的路面标记质量评估方法，增强了交通安全性。||
|**2025-03-14**|[Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection](http://arxiv.org/abs/2503.11005)|**[link](https://github.com/zchuhan/cckt-det)**|为了检测超出预定义类别的开放词汇对象，以往的开放词汇目标检测 (OVD) 技术通常采用预训练的视觉语言模型 (VLM) 来实现从基础类别到新类别的泛化。然而，为了减轻上游图文预训练和下游区域级感知之间的不对齐，额外的监督是必不可少的，例如图文对或通过自训练策略生成的伪标签。在这项工作中，我们提出了无需任何额外监督即可训练的 CCKT-Det。所提出的框架构建了从语言查询和从 VLM 中提取的视觉区域特征的循环动态知识迁移，这迫使检测器与 VLM 的视觉语义空间紧密对齐。具体来说，1) 我们预过滤并注入语义先验来指导查询的学习，以及 2) 引入区域对比损失来提高查询对新对象的感知能力。CCKT-Det 能够随着 VLM 规模的增加而持续提高性能，同时只需要中等程度的计算开销。全面的实验结果表明，在具有挑战性的 COCO 基准测试中，我们的方法在没有和有更强大的教师模型的情况下，AP50 分别比之前的最先进技术提高了 +2.9% 和 +10.2%。代码可在 https://github.com/ZCHUHan/CCKT-Det 获取。||
|**2025-03-13**|[The Power of One: A Single Example is All it Takes for Segmentation in VLMs](http://arxiv.org/abs/2503.10779)|null|大规模视觉语言模型（VLM）在海量图文对数据集上进行训练，通过隐式学习文本描述和图像区域之间的关联，展现出强大的多模态理解能力。这种涌现的能力使得零样本目标检测和分割成为可能，其技术依赖于图文注意力图，而无需在大量标注的分割数据集上进行训练。然而，此类方法的性能很大程度上取决于prompt工程以及人工选择注意力层的层或head。在本研究中，我们证明了，相比于仅仅依赖文本prompt，为每个类别提供单个视觉示例并微调文本到图像的注意力层和嵌入可以显著提高性能。此外，我们提出通过跨多层和/或prompt的少样本微调来学习集成。我们提出了一种基于熵的图文注意力层排序和选择机制，用于识别性能最佳的层，而无需分割标签。这消除了对图文注意力层进行超参数选择的需要，为开放词汇分割提供了更灵活和可扩展的解决方案。我们证明了这种方法可以产生强大的零样本性能，并且通过使用单个视觉示例进行微调可以进一步增强性能。此外，我们还证明了我们的方法和发现具有通用性，可以应用于各种视觉语言模型（VLM）。||
|**2025-03-13**|[HeightFormer: Learning Height Prediction in Voxel Features for Roadside Vision Centric 3D Object Detection via Transformer](http://arxiv.org/abs/2503.10777)|null|近年来，路侧视觉为中心的3D目标检测受到了越来越多的关注。它扩展了自动驾驶汽车的感知范围，增强了道路安全性。先前的方法侧重于预测每个像素的高度而不是深度，这在路侧视觉感知方面取得了显著进展。然而，它受限于图像特征近大远小的透视特性，使得网络难以理解3D世界中物体的真实尺寸。与图像特征相比，BEV特征和体素特征呈现了3D世界中物体的真实分布。但是，BEV特征由于缺乏明确的高度信息往往会丢失细节，而体素特征的计算成本很高。基于这一洞察，本文提出了一个通过Transformer在体素特征中学习高度预测的高效框架，称为HeightFormer。它将体素特征分组为局部高度序列，并利用注意力机制来获得高度分布预测。随后，重新组合局部高度序列以生成精确的3D特征。该方法应用于两个大规模路侧基准数据集，DAIR-V2X-I和Rope3D。大量的实验表明，HeightFormer在路侧视觉为中心的3D目标检测任务中优于最先进的方法。||
|**2025-03-13**|[OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer](http://arxiv.org/abs/2503.10616)|**[link](https://github.com/jinyanglii/ovtr)**|开放词汇多目标跟踪旨在将跟踪器泛化到训练期间未见过的类别，使其能够应用于各种现实场景。然而，现有的开放词汇跟踪器受其框架结构、孤立的帧级感知和不足的模态交互的限制，这阻碍了其在开放词汇分类和跟踪方面的性能。在本文中，我们提出了 OVTR（使用 Transformer 的端到端开放词汇多目标跟踪），这是第一个同时对运动、外观和类别进行建模的端到端开放词汇跟踪器。为了实现稳定的分类和连续跟踪，我们设计了 CIP（类别信息传播）策略，为后续帧建立多个高级类别信息先验。此外，我们引入了双分支结构以增强泛化能力和深度多模态交互，并在解码器中加入了保护策略以提高性能。实验结果表明，我们的方法在开放词汇 MOT 基准测试中超越了以前的跟踪器，同时实现了更快的推理速度并显著减少了预处理需求。此外，将模型迁移到另一个数据集的实验也证明了其强大的适应性。模型和代码已发布在 https://github.com/jinyanglii/OVTR。||
|**2025-03-13**|[Extreme Learning Machines for Attention-based Multiple Instance Learning in Whole-Slide Image Classification](http://arxiv.org/abs/2503.10510)|null|全切片图像分类是计算病理学和医学中的一个关键挑战。基于注意力的多示例学习 (MIL) 已成为解决这个问题的有效方法。然而，注意力机制架构对生物医学图像模型性能的影响尚无完善的文献记录。在这项工作中，我们比较了不同的 MIL 方法和实现，包括深度学习变体。我们介绍了一种使用高维特征空间进行深度 MIL 的新方法。我们还开发了一种用于全切片图像分类的新算法，该算法将极限机器学习与基于注意力的 MIL 相结合，以提高灵敏度并降低训练复杂度。我们将我们的算法应用于检测外周血中循环稀有细胞 (CRC)（如成红细胞）的问题。我们的结果表明，非线性在分类中起着关键作用，因为去除非线性会导致稳定性急剧下降，此外平均曲线下面积 (AUC) 下降超过 4%。我们还证明了当利用高维特征空间时，模型的鲁棒性显着提高，平均 AUC 提高了 10% 以上。此外，我们表明，极限学习机可以通过将训练参数的数量减少 5 倍，同时仍将平均 AUC 保持在深度 MIL 模型的 1.5% 以内，从而在训练效率方面提供明显的改进。最后，我们讨论了未来使用量子算法丰富经典计算框架的选项。因此，这项工作有助于为更准确、更高效的单细胞诊断铺平道路，这是精准医学的基石之一。||
|**2025-03-13**|[RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation](http://arxiv.org/abs/2503.10410)|**[link](https://github.com/duyuwen-duen/roco-sim)**|路侧协同感知是指多个路侧单元协作汇集感知数据，辅助车辆增强环境感知的系统。现有的路侧感知方法专注于模型设计，却忽略了诸如标定误差、信息稀疏和多视角一致性等数据问题，导致其在近期发布的数据集上表现不佳。为了显著增强路侧协同感知并解决关键数据问题，我们提出了第一个用于路侧协同感知的仿真框架RoCo-Sim。RoCo-Sim能够通过对单张图像进行动态前景编辑和全场景风格迁移，生成多样化、多视角一致的模拟路侧数据。RoCo-Sim由四个组件组成：（1）相机外参优化，确保路侧相机的3D到2D投影的准确性；（2）一种新颖的多视角遮挡感知采样器（MOAS），用于确定3D空间中各种数字资产的位置；（3）DepthSAM创新地从单帧固定视角图像中建模前景-背景关系，确保前景的多视角一致性；（4）可扩展的后处理工具包，通过风格迁移和其他增强功能生成更逼真、更丰富的场景。RoCo-Sim显著提升了路侧3D目标检测性能，在Rcooper-Intersection和TUMTraf-V2X数据集上，AP70分别比SOTA方法提升了83.74和83.12。RoCo-Sim填补了路侧感知仿真领域的一个关键空白。代码和预训练模型即将发布：https://github.com/duyuwen-duen/RoCo-Sim||
|**2025-03-13**|[RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing](http://arxiv.org/abs/2503.10392)|**[link](https://github.com/mililab/roma)**|近年来，视觉Transformer（ViT）的自监督学习取得的进展推动了遥感（RS）基础模型的突破。然而，自注意力机制的二次复杂度对可扩展性构成了重大障碍，尤其对于大型模型和高分辨率图像而言。虽然线性复杂度的Mamba架构提供了一种很有前景的替代方案，但现有的Mamba在RS中的应用仍然局限于特定领域的小型数据集上的监督任务。为了应对这些挑战，我们提出了RoMA，这是一个能够使用大规模、多样化、未标记的数据对基于Mamba的RS基础模型进行可扩展的自监督预训练的框架。RoMA通过定制的自回归学习策略增强了高分辨率图像的可扩展性，并结合了两个关键创新：1）结合自适应裁剪和角度嵌入的旋转感知预训练机制，以处理具有任意方向的稀疏分布的目标；2）多尺度标记预测目标，以解决RS图像固有的目标尺度极端变化问题。系统的实证研究证实，Mamba遵循RS数据和参数缩放规律，随着模型和数据大小的增加，性能可可靠地扩展。此外，跨场景分类、目标检测和语义分割任务的实验表明，RoMA预训练的Mamba模型在准确性和计算效率方面始终优于基于ViT的模型。源代码和预训练模型将在https://github.com/MiliLab/RoMA发布。||
|**2025-03-13**|[A Multi-Modal Federated Learning Framework for Remote Sensing Image Classification](http://arxiv.org/abs/2503.10262)|null|联邦学习 (FL) 允许在分散的数据档案（即客户端）之间协同训练深度神经网络，而无需共享客户端的本地数据。大多数现有的 FL 方法假设所有客户端分布的数据都与相同的数据模态相关联。然而，不同客户端的遥感 (RS) 图像可能与不同的数据模态相关联。联合使用多模态 RS 数据可以显著提高分类性能。为了有效利用分散且未共享的多模态 RS 数据，本文介绍了一种用于 RS 图像分类问题的新型多模态 FL 框架。该框架包含三个模块：1）多模态融合 (MF)；2）特征白化 (FW)；3）互信息最大化 (MIM)。MF 模块采用迭代模型平均来促进学习，而无需访问客户端上的多模态训练数据。FW 模块旨在通过对齐客户端之间的数据分布来解决训练数据异构性的限制。MIM 模块旨在通过最大化来自不同模态的图像之间的相似性来建模互信息。在实验分析中，我们将注意力集中在 RS 中的多标签分类和基于像素的分类任务上。使用两个基准档案获得的结果表明，与文献中最先进的算法相比，所提出的框架的有效性。所提出的框架的代码将在 https://git.tu-berlin.de/rsim/multi-modal-FL 上提供。||
|**2025-03-13**|[TARS: Traffic-Aware Radar Scene Flow Estimation](http://arxiv.org/abs/2503.10210)|null|场景流为自动驾驶提供了至关重要的运动信息。最近的激光雷达场景流模型在实例级别利用了刚体运动假设，假设物体是刚体。然而，这些实例级方法不适用于稀疏的雷达点云。在这项工作中，我们提出了一种新颖的交通感知雷达场景流估计方法，名为TARS，它利用了交通级别的运动刚性。为了解决雷达场景流中的挑战，我们联合执行目标检测和场景流估计，并提升后者。我们结合了使用检测损失训练的目标检测器的特征图，使雷达场景流能够感知环境和道路使用者。由此，我们在特征空间中构建了一个交通矢量场（TVF），使我们的场景流分支能够进行整体的交通级场景理解。在估计场景流时，我们同时考虑了来自点邻居的点级运动线索和空间内刚体运动的交通级一致性。TARS 在专有数据集和 View-of-Delft 数据集上的性能优于现有最佳方法，分别将基准提高了 23% 和 15%。||
|**2025-03-13**|[Automatic quality control in multi-centric fetal brain MRI super-resolution reconstruction](http://arxiv.org/abs/2503.10156)|null|质量控制 (QC) 长期以来一直被认为是保证神经影像学研究可靠性的关键。它对于胎儿大脑 MRI 尤为重要，因为胎儿大脑 MRI 的采集和图像处理技术不如成人影像标准化。在这项工作中，我们专注于胎儿大脑 MRI 超分辨率重建 (SRR) 图像的自动质量控制，这是一个重要的处理步骤，其中多组厚的二维切片被配准在一起并组合以构建单个、各向同性且无伪影的 T2 加权图像。我们提出了 FetMRQC $_{SR}$，这是一种机器学习方法，它提取 100 多个图像质量指标，使用随机森林模型预测图像质量评分。这种方法非常适合于数据维度高、数据异质性强且数据集较小的问题。我们在域外 (OOD) 设置中验证了 FetMRQC$_{SR}$，并报告了高性能（ROC AUC = 0.89），即使面对来自未知位点或 SRR 方法的数据也是如此。我们还调查了失败案例，并表明它们发生在 45% 的图像中，原因是专家评级存在争议的模糊配置。这些结果令人鼓舞，并说明了像 FetMRQC$_{SR}$ 这样的非深度学习方法如何非常适合这个多方面的问题。我们的工具以及用于生成、训练和评估模型的所有代码将在论文被接受后发布。||
|**2025-03-13**|[Multiplicative Learning](http://arxiv.org/abs/2503.10144)|null|人工神经网络的高效训练仍然是深度学习中的一个关键挑战。反向传播（BP）作为标准学习算法，依赖于梯度下降，通常需要多次迭代才能收敛。在本研究中，我们介绍了一种新的学习方法——期望反射（ER），它基于观测输出与预测输出的比率，以乘法方式更新权重。与传统方法不同，ER无需特定的损失函数或学习率超参数即可保持一致性。我们将ER扩展到多层网络，并展示了其在执行图像分类任务中的有效性。值得注意的是，ER在一次迭代中即可实现最佳权重更新。此外，我们将ER重新解释为一种结合了目标传播逆映射的改进梯度下降形式。这些发现表明，ER为训练神经网络提供了一种高效且可扩展的替代方案。||
|**2025-03-13**|[Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild](http://arxiv.org/abs/2503.10065)|null|神经架构倾向于用相对简单的函数拟合数据。这种“简单性偏差”被广泛认为是其成功的关键。本文探讨了这一原则的局限性。基于ReLU激活函数导致简单性偏差的最新发现[96]，我们引入了一种元学习新激活函数和归纳偏差的方法，使其更适合特定任务。研究结果：我们确定了多个简单性偏差不足且ReLU并非最优的任务。在这些情况下，我们学习了新的激活函数，通过引入更高复杂度的先验来获得更好的性能。有趣的是，这些情况对应于神经网络历来难以处理的领域：表格数据、回归任务、捷径学习案例和算法理解任务。相比之下，ReLU引起的简单性偏差在图像任务中被证明是足够的，在这些任务中，学习到的最佳激活函数与ReLU和GeLU几乎相同。意义：与普遍看法相反，ReLU网络的简单性偏差并非普遍适用。它对于图像分类来说接近最优，但其他归纳偏差有时更可取。我们证明了激活函数可以控制这些归纳偏差，但未来定制的架构可能会提供更多好处。仍然需要进一步的研究来刻画模型的归纳偏差，使其超越“复杂性”，并使其与数据相适应。||
|**2025-03-13**|[Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection](http://arxiv.org/abs/2503.09968)|null|最近，单域泛化目标检测（Single-DGOD）任务被提出，旨在将检测器泛化到训练中从未见过的多个未知域。由于目标域数据的不可用，一些方法利用视觉语言模型的多模态能力，使用文本提示来估计跨域信息，增强模型的泛化能力。这些方法通常使用单个文本提示，通常称为单步提示方法。然而，当处理复杂风格（例如雨夜结合）时，我们观察到单步提示方法的性能往往相对较弱。原因可能是许多场景不仅包含单一风格，而是多种风格的组合。单步提示方法可能无法有效地合成涉及各种风格的组合信息。为了解决这个限制，我们提出了一种新方法，即沿着思维链进行风格演化（Style Evolving along Chain-of-Thought），旨在沿着思维链逐步整合和扩展风格信息，使风格能够持续演化。具体来说，通过逐步细化风格描述并引导风格的多样化演化，这种方法能够更准确地模拟各种风格特征，并帮助模型逐渐学习和适应风格之间的细微差异。此外，它使模型能够接触到具有不同数据分布的更广泛的风格特征，从而增强其在未见域中的泛化能力。在五个恶劣天气场景和Real to Art基准测试中的显著性能提升证明了我们方法的优越性。||
|**2025-03-11**|[Referring to Any Person](http://arxiv.org/abs/2503.08507)|**[link](https://github.com/idea-research/rexseek)**|毫无疑问，人类是计算机视觉中最重要的一部分，而根据自然语言描述检测任何个体的能力，我们将其定义为指称任何人，具有巨大的实用价值。然而，我们发现现有模型普遍未能达到实际应用的可用性，并且当前的基准测试受限于其对一对一指称的关注，阻碍了该领域的进展。在这项工作中，我们从三个关键角度重新审视这项任务：任务定义、数据集设计和模型架构。我们首先确定了可指称实体的五个方面以及该任务的三个显著特征。接下来，我们介绍 HumanRef，这是一个旨在应对这些挑战并更好地反映实际应用的新颖数据集。从模型设计的角度来看，我们将多模态大型语言模型与目标检测框架相结合，构建了一个名为 RexSeek 的鲁棒指称模型。实验结果表明，在 RefCOCO/+/g 等常用基准测试中表现良好的最先进模型，由于无法检测多个个体，在 HumanRef 上表现不佳。相比之下，RexSeek 不仅在人类指称方面表现出色，而且可以有效地泛化到常见的物体指称，使其广泛适用于各种感知任务。代码可在 https://github.com/IDEA-Research/RexSeek 获取。||
|**2025-03-11**|[SuperCap: Multi-resolution Superpixel-based Image Captioning](http://arxiv.org/abs/2503.08496)|null|图像描述领域的一个长期目标是摆脱对目标检测的依赖。我们研究了使用超像素结合视觉语言模型 (VLM) 来弥合基于检测器的描述架构与那些仅在大型数据集上进行预训练的架构之间的差距。我们新颖的超像素方法确保模型接收类似对象的特征，而 VLM 的使用则为我们的模型提供了开放集对象理解能力。此外，我们扩展了我们的架构以利用多分辨率输入，使我们的模型能够查看不同细节级别的图像，并使用注意力机制来确定哪些部分与描述最相关。我们通过多个 VLM 并通过一系列消融实验展示了我们模型的性能，详细说明了不同架构选择的影响。我们的完整模型在 COCO Karpathy 拆分上取得了 136.9 的 CIDEr 竞争分数。||
|**2025-03-11**|[TrackOcc: Camera-based 4D Panoptic Occupancy Tracking](http://arxiv.org/abs/2503.08471)|null|基于摄像头的全面一致的动态场景理解对于先进的自主系统至关重要。传统的基于摄像头的感知任务，如3D目标跟踪和语义占用预测，缺乏空间完整性或时间一致性。在这项工作中，我们引入了一个全新的任务，即基于摄像头的4D全景占用跟踪，它同时解决了仅从摄像头输入进行全景占用分割和目标跟踪的问题。此外，我们提出了TrackOcc，这是一种先进的方法，它以流式、端到端的方式处理图像输入，并使用4D全景查询来解决所提出的任务。利用位置感知损失，TrackOcc在没有额外复杂设计的情况下提高了4D全景占用跟踪的精度。实验结果表明，我们的方法在Waymo数据集上实现了最先进的性能。源代码将在https://github.com/Tsinghua-MARS-Lab/TrackOcc发布。||
|**2025-03-11**|[Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels](http://arxiv.org/abs/2503.08421)|**[link](https://github.com/xmuqimingxia/dota)**|无监督三维物体检测是解决离线三维物体标注难题的重要方案。然而，由于数据稀疏性和视角受限，基于聚类的无监督目标检测中的标签拟合通常会生成低质量的伪标签。多智能体协作数据集包含智能体之间互补观测信息的共享，具有突破这一瓶颈的潜力。本文提出了一种名为DOtA的全新无监督方法，可以从多智能体激光雷达扫描中学习检测物体，无需使用外部标签。DOtA首先利用协作智能体内部共享的自身位姿和形状来初始化检测器，并借助神经网络的泛化性能来推断初步标签。随后，DOtA利用智能体之间的互补观测信息对初步标签进行多尺度编码，然后解码出高质量和低质量的标签。这些标签进一步用作提示来指导正确的特征学习过程，从而提高无监督目标检测任务的性能。在V2V4Real和OPV2V数据集上的大量实验表明，我们的DOtA方法优于最先进的无监督三维目标检测方法。此外，我们还在各种协作感知框架下验证了DOtA标签的有效性。代码可在https://github.com/xmuqimingxia/DOtA获取。||
|**2025-03-11**|[Generalizable and Explainable Deep Learning for Medical Image Computing: An Overview](http://arxiv.org/abs/2503.08420)|null|目的。本文概述了用于医学影像的深度学习 (DL) 中可泛化且可解释的人工智能 (XAI)，旨在解决临床应用中对透明度和可解释性的迫切需求。方法。我们建议在三个医学数据集（脑肿瘤、皮肤癌和胸部 X 光）中使用四个 CNN 进行医学图像分类任务。此外，我们执行配对 t 检验以显示不同方法之间观察到的差异的显著性。此外，我们建议将 ResNet50 与五种常见的 XAI 技术相结合，以获得模型预测的可解释结果，旨在提高模型的透明度。我们还引入了一个量化指标（置信度提升）来评估 XAI 技术的有效性。主要发现。实验结果表明，ResNet50 可以在所有数据集中实现可行的准确率和 F1 分数（例如，皮肤癌的准确率为 86.31%）。此外，研究结果表明，虽然某些 XAI 方法（例如 XgradCAM）可以有效地突出医学图像中相关的异常区域，但其他方法（例如 EigenGradCAM）在特定情况下可能效果较差。此外，与 GradCAM++ (0.09) 和 LayerCAM (0.08) 相比，XgradCAM 表现出更高的置信度提升（例如，胶质瘤为 0.12）。意义。基于实验结果和最新进展，我们概述了未来研究方向，以增强深度学习模型在生物医学成像领域的鲁棒性和泛化能力。||
|**2025-03-11**|[Embodied Crowd Counting](http://arxiv.org/abs/2503.08367)|null|遮挡是人群计数中的基本挑战之一。目前，各种数据驱动的方法已经被开发用于解决这个问题，但它们的有效性有限。这主要是因为大多数现有的用于训练这些方法的人群计数数据集都基于被动相机，限制了它们充分感知环境的能力。最近，具身导航方法在交互场景中的精确目标检测方面展现出巨大的潜力。这些方法结合了主动相机设置，有望解决人群计数中的根本问题。然而，大多数现有方法都是为室内导航设计的，在分析大型场景（如人群）中复杂的物体分布方面的性能尚不清楚。此外，大多数现有的具身导航数据集都是规模和物体数量有限的室内场景，这使得它们无法被引入到密集人群分析中。基于此，我们提出了一个新的任务，即具身人群计数（ECC）。我们首先构建了一个交互式模拟器，即具身人群计数数据集（ECCD），它可以实现大规模场景和大物体数量。为了生成人群，我们引入了近似于真实人群分布的先验概率分布。然后，我们提出了一种零样本导航方法（ZECC）。该方法包含一个由MLLM驱动，从粗到精的导航机制，支持主动Z轴探索，以及一个基于法线的用于精细计数的人群分布分析方法。与基线方法的实验结果表明，所提出的方法在计数精度和导航成本之间实现了最佳的平衡。||
|**2025-03-11**|[Feature Alignment with Equivariant Convolutions for Burst Image Super-Resolution](http://arxiv.org/abs/2503.08300)|null|突发图像处理 (BIP) 将多帧图像捕获并集成到单个高质量图像中，广泛用于消费类相机。作为典型的 BIP 任务，突发图像超分辨率 (BISR) 近年来通过深度学习取得了显著进展。现有的 BISR 方法通常包含三个关键阶段：对齐、上采样和融合，这些阶段的顺序和实现方式通常有所不同。在这些阶段中，对齐对于确保准确的特征匹配和进一步重建至关重要。然而，现有方法通常依赖可变形卷积和光流等技术来实现对齐，这些技术要么只关注局部变换，要么缺乏理论基础，从而限制了它们的性能。为了缓解这些问题，我们提出了一个新的 BISR 框架，其特点是基于等变卷积的对齐，确保图像域和特征域之间的一致变换。这使得对齐变换可以通过图像域中的显式监督来学习，并以理论上合理的方式轻松应用于特征域，从而有效地提高对齐精度。此外，我们设计了一个有效的重建模块，采用先进的深度架构进行上采样和融合，以获得最终的 BISR 结果。在 BISR 基准数据集上的大量实验表明，我们的方法在定量指标和视觉质量方面均具有优越的性能。||
|**2025-03-11**|[Physics-based AI methodology for Material Parameter Extraction from Optical Data](http://arxiv.org/abs/2503.08183)|null|我们报道了一种利用基于物理的神经网络从光谱光学数据中提取材料参数的新方法。该模型将经典优化框架与多尺度目标检测框架相结合，特别探索了将物理原理融入神经网络的效果。我们验证并分析了该模型在太赫兹和红外频率模拟透射光谱上的性能。与传统的基于模型的方法相比，我们的方法旨在实现自主性、鲁棒性和时间效率，使其特别适用于工业和社会应用。||
|**2025-03-11**|[Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking](http://arxiv.org/abs/2503.08145)|null|开放词汇多目标跟踪 (OV-MOT) 旨在使跟踪方法不受预定义类别集的限制。当前的 OV-MOT 方法通常主要依赖于实例级检测和关联，而经常忽略对象跟踪任务中独特且必要的轨迹信息。利用轨迹信息可以增强关联稳定性和分类精度，尤其是在遮挡和类别模糊的情况下，从而提高对新类别的适应性。因此，在本文中，我们提出了 TRACT，一种开放词汇跟踪器，它利用轨迹信息来改进 OV-MOT 中的目标关联和分类。具体来说，我们引入了一种轨迹一致性增强 (TCR) 策略，通过提高目标身份和类别一致性来提升跟踪性能。此外，我们还提出了 TraCLIP，一个即插即用的轨迹分类模块。它集成了轨迹特征聚合 (TFA) 和轨迹语义增强 (TSE) 策略，以充分利用视觉和语言视角的轨迹信息来增强分类结果。在 OV-TAO 上的大量实验表明，我们的 TRACT 显着提高了跟踪性能，突出了轨迹信息作为 OV-MOT 宝贵资产的作用。代码将发布。||
|**2025-03-11**|[Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method](http://arxiv.org/abs/2503.08144)|null|近年来，大型语言模型（LLM）和视觉语言模型（VLM）取得了显著成功，在理解各种图像和视频方面展现出卓越的能力，尤其是在分类和检测任务中。然而，由于遥感图像和传统光学图像之间存在巨大差异，这些模型在理解遥感图像方面，特别是在检测任务中，面临着相当大的挑战。直接使用检测指令提示VLM通常无法获得令人满意的结果。为了解决这个问题，本文探讨了VLM在遥感图像目标检测中的应用。具体而言，我们利用公开可用的遥感目标检测数据集，包括SSDD、HRSID和NWPU-VHR-10，将传统的标注信息转换为自然语言，从而构建用于VLM训练的指令微调（SFT）数据集。然后，我们评估了不同VLM微调策略的检测性能，并获得了用于遥感图像目标检测的优化模型权重。最后，我们通过自然语言查询评估了模型的先验知识能力。实验结果表明，在不修改模型架构的情况下，仅使用自然语言即可有效地实现遥感目标检测。此外，该模型还展现出执行某些视觉问答（VQA）任务的能力。我们的数据集和相关代码将很快发布。||
|**2025-03-07**|[QArtSR: Quantization via Reverse-Module and Timestep-Retraining in One-Step Diffusion based Image Super-Resolution](http://arxiv.org/abs/2503.05584)|**[link](https://github.com/libozhu03/qartsr)**|基于单步扩散的图像超分辨率（OSDSR）模型如今展现出越来越优越的性能。然而，尽管它们的去噪步骤已减少到一步，并且可以量化到 8 位以进一步降低成本，但OSDSR仍有很大的潜力可以量化到更低位数。为了探索量化 OSDSR 的更多可能性，我们提出了一种高效的方法，即通过反向模块和时间步重新训练进行 OSDSR 量化，名为 QArtSR。首先，我们研究了时间步值对量化模型性能的影响。然后，我们提出了时间步重新训练量化（TRQ）和反向逐模块量化（RPQ）策略来校准量化模型。同时，我们采用模块损失和图像损失来更新所有量化模块。我们仅更新量化微调组件中的参数，不包括原始权重。为了确保所有模块都得到充分微调，我们在逐模块阶段后增加了扩展的端到端训练。我们的 4 位和 2 位量化实验结果表明，QArtSR 相比于最近领先的比较方法获得了更优的效果。4 位 QArtSR 的性能接近全精度模型。我们的代码将在 https://github.com/libozhu03/QArtSR 发布。||
|**2025-03-07**|[Grouped Sequential Optimization Strategy -- the Application of Hyperparameter Importance Assessment in Deep Learning](http://arxiv.org/abs/2503.05106)|null|超参数优化 (HPO) 是机器学习流程中的关键组成部分，它显著影响模型的鲁棒性、稳定性和泛化能力。然而，HPO 通常是一项耗时且计算密集型的任务。传统的 HPO 方法，例如网格搜索和随机搜索，效率通常较低。贝叶斯优化虽然效率更高，但在高维搜索空间中仍然存在困难。在本文中，我们探索如何利用从超参数重要性评估 (HIA) 中获得的见解来加速 HPO，从而减少时间和计算资源，为该领域做出贡献。基于先前在 10 个常见图像分类数据集上使用 CNN 评估 10 个超参数的工作，我们实现了一种名为“顺序分组”的新型 HPO 策略。先前的工作根据超参数对模型性能的影响评估了所研究超参数的重要性权重，提供了我们用来优化 HPO 过程的宝贵见解。我们在六个额外的图像分类数据集上进行的实验验证表明，结合超参数重要性评估 (HIA) 可以显著加速 HPO 而不损害模型性能，与传统的同步策略相比，平均减少了 31.9% 的优化时间。||
|**2025-03-06**|[HieroLM: Egyptian Hieroglyph Recovery with Next Word Prediction Language Model](http://arxiv.org/abs/2503.04996)|null|古埃及象形文字出现在许多古埃及文物上，但由于腐蚀，它们通常模糊不清甚至缺失。现有的修复模糊象形文字的工作采用计算机视觉技术，如CNN，并将象形文字恢复建模为图像分类任务，这存在两个主要局限性：（i）它们无法处理严重损坏或完全缺失的象形文字。(ii) 它们在进行预测时没有考虑上下文和语法信息，只基于单个象形文字。本文提出了一种将象形文字恢复建模为下一个词预测任务的新方法，并使用语言模型来解决这个问题。我们比较了不同最先进语言模型的性能，并选择LSTM作为我们HieroLM的架构，因为埃及象形文字文本中语义的局部关联性很强。实验表明，HieroLM的准确率超过44%，并且在少样本预测和数据稀缺的情况下仍保持显著的性能，这使其成为辅助学者推断缺失象形文字的实用工具。它还可以补充基于计算机视觉的模型，从而显著降低识别模糊象形文字的困惑度。我们的代码可在https://github.com/Rick-Cai/HieroLM/获取。||
|**2025-03-06**|[Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach](http://arxiv.org/abs/2503.04918)|null|人工智能通过视觉语言模型（VLM）的发展取得了进步，VLM整合文本和视觉输入，以在各种环境中实现全面理解和交互。提升这些模型（例如基于Transformer的Florence 2）在特定任务（例如在复杂和非结构化环境中的目标检测）中的性能需要进行微调。本文的目标是通过微调来提高Florence 2模型在挑战性环境中的效率。我们通过使用不同的配置，各种GPU类型（T4、L4、A100）和优化器（例如AdamW和SGD）进行实验来实现这一目标。我们还采用了各种学习率和LoRA（低秩自适应）设置。对性能指标（例如平均精度均值（mAP）分数）的分析表明，经过微调的Florence 2模型的性能与YOLO模型（包括YOLOv8、YOLOv9和YOLOv10）相当。这表明基于Transformer的VLM如何适用于精细的目标检测任务。本文强调了优化后的基于Transformer的VLM能够应对非结构化环境中目标检测的特定挑战，为在高要求和复杂环境中的实际应用开辟了有希望的途径。||
|**2025-03-06**|[Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation](http://arxiv.org/abs/2503.04718)|null|场景流估计是许多机器人应用的基础任务，包括鲁棒的动态物体检测、自动标注和传感器同步。解决这个问题的方法主要有两类：1）监督方法和2）基于优化的方法。监督方法在推理过程中速度快，并能获得高质量的结果，然而，它们受限于对大量标记训练数据的需求，并且容易受到域差异的影响。相比之下，无监督的测试时优化方法不会面临域差异问题，但通常运行时间长，出现伪影或无法收敛到正确的解。在这项工作中，我们减轻了现有基于优化方法的几个局限性。为此，我们 1）引入了一个简单的基于体素网格的模型，该模型在多个维度上优于标准的基于 MLP 的公式，并且 2）引入了一种新的多帧损失公式。3）我们将这两项贡献结合到我们的新方法 Floxels 中。在 Argoverse 2 基准测试中，在无监督方法中，Floxels 的性能仅次于 EulerFlow，同时以一小部分计算成本实现了可比的性能。Floxels 比 EulerFlow 实现了超过 60 到 140 倍的大幅加速，将运行时间从每天缩短到每序列 10 分钟。与速度更快但质量较低的基线 NSFP 相比，Floxels 实现了约 14 倍的加速。||
|**2025-03-06**|[DEAL-YOLO: Drone-based Efficient Animal Localization using YOLO](http://arxiv.org/abs/2503.04698)|null|尽管深度学习和航空监控技术的进步正在改善野生动物保护工作，复杂且不稳定的环境条件仍然是一个问题，需要创新的解决方案来实现经济高效的小型动物检测。这项工作介绍了 DEAL-YOLO，这是一种新颖的方法，通过使用多目标损失函数（如 Wise IoU (WIoU) 和归一化 Wasserstein 距离 (NWD)）来改进无人机 (UAV) 图像中的小目标检测，这些函数优先考虑边界框中心附近的像素，确保更平滑的定位并减少突然偏差。此外，该模型通过使用线性可变形 (LD) 卷积进行高效的特征提取来优化，从而在保持计算效率的同时提高了精度。缩放序列特征融合 (SSFF) 模块通过有效捕获尺度间关系、改进特征表示和通过优化的多尺度融合提升指标来增强目标检测。与基线模型的比较显示，与普通的 Yolov8-N 相比，该模型效率很高，参数减少了高达 69.5%，突出了所提出修改的稳健性。通过这种方法，我们的论文旨在促进濒危物种的检测、动物种群分析、栖息地监测、生物多样性研究以及其他各种丰富野生动物保护工作的应用。DEAL-YOLO 采用两阶段推理范式进行目标检测，细化选定区域以改进定位和置信度。这种方法提高了性能，特别是对于目标得分较低的小型实例。||
|**2025-03-06**|[Teach YOLO to Remember: A Self-Distillation Approach for Continual Object Detection](http://arxiv.org/abs/2503.04688)|null|像YOLO这样的实时目标检测器在大型数据集上进行多轮训练时可以达到极佳的性能。然而，在现实世界中，数据是增量到达的，神经网络会遭受灾难性遗忘，导致先前学习的知识丢失。为了解决这个问题，之前的研究探索了用于目标检测持续学习（CLOD）的类别增量学习（CIL）策略，大多数方法集中于两阶段目标检测器。然而，现有工作表明，由于噪声回归输出可能传递错误的知识，“学习而不遗忘”（LwF）可能对像YOLO这样的单阶段无锚检测器无效。在这项工作中，我们介绍了YOLO LwF，这是一种专为基于YOLO的持续目标检测而设计的自蒸馏方法。我们证明，当结合重放记忆时，YOLO LwF可以显著减轻遗忘。与之前的方法相比，它实现了最先进的性能，在VOC和COCO基准测试中分别将mAP提高了+2.1%和+2.9%。||
|**2025-03-06**|[Omnidirectional Multi-Object Tracking](http://arxiv.org/abs/2503.04565)|**[link](https://github.com/xifen523/omnitrack)**|全景图像凭借其360度视野，可提供支持多目标跟踪（MOT）的全面信息，从而捕捉周围物体的时空关系。然而，大多数MOT算法都是针对视角有限的针孔图像设计的，这削弱了它们在全景环境中的有效性。此外，全景图像的失真，如分辨率损失、几何变形和光照不均匀，阻碍了现有MOT方法的直接应用，导致性能显著下降。为了应对这些挑战，我们提出了OmniTrack，一个全向MOT框架，它结合了轨迹管理以引入时间线索、FlexiTrack实例以进行目标定位和关联，以及CircularStatE模块以减轻图像和几何失真。这种集成即使在传感器快速运动的情况下也能在大视野场景中进行跟踪。为了缓解全景MOT数据集的缺乏，我们引入了QuadTrack数据集——一个由四足机器人收集的综合全景数据集，具有广阔视野、剧烈运动和复杂环境等多种挑战。在公共JRDB数据集和新引入的QuadTrack基准上的大量实验表明，所提出的框架具有最先进的性能。OmniTrack在JRDB上实现了26.92%的HOTA分数，提高了3.43%，并在QuadTrack上达到了23.45%，超过基线6.81%。数据集和代码将在https://github.com/xifen523/OmniTrack公开发布。||
|**2025-03-06**|[ReynoldsFlow: Exquisite Flow Estimation via Reynolds Transport Theorem](http://arxiv.org/abs/2503.04500)|null|光流是运动估计的一项基本技术，广泛应用于视频稳定、插值和目标跟踪。人工智能 (AI) 的最新进展使深度学习模型能够利用光流作为运动分析的重要特征。然而，传统的光流方法依赖于诸如亮度恒定和慢动作约束等限制性假设，限制了它们在复杂场景中的有效性。基于深度学习的方法需要在大型特定领域数据集上进行大量训练，这使得它们对计算要求很高。此外，光流通常在 HSV 颜色空间中可视化，这在转换为 RGB 时会引入非线性失真，并且对噪声高度敏感，从而降低运动表示的准确性。这些限制从本质上限制了下游模型的性能，可能会阻碍目标跟踪和运动分析任务。为了应对这些挑战，我们提出了雷诺流，这是一种受雷诺输运定理启发的新型免训练流估计方法，为复杂运动动力学建模提供了一种有原则的方法。除了传统的基于 HSV 的可视化（称为 ReynoldsFlow）之外，我们还引入了另一种表示形式 ReynoldsFlow+，旨在改进流可视化。我们在三个基于视频的基准测试中评估 ReynoldsFlow 和 ReynoldsFlow+：UAVDB 上的微小物体检测、Anti-UAV 上的红外物体检测和 GolfDB 上的姿态估计。实验结果表明，使用 ReynoldsFlow+ 训练的网络实现了最先进的 (SOTA) 性能，在所有任务中都表现出更高的鲁棒性和效率。||
|**2025-03-06**|[Scale-Invariant Adversarial Attack against Arbitrary-scale Super-resolution](http://arxiv.org/abs/2503.04385)|null|局部连续图像函数 (LIIF) 的出现引起了人们对任意尺度超分辨率 (SR) 技术的极大关注。然而，尽管固定尺度 SR 的漏洞已得到评估，但基于连续表示的任意尺度 SR 对抗对抗性攻击的鲁棒性仍然是一个值得进一步探索的领域。为固定尺度 SR 精心设计的对抗性攻击是尺度相关的，当应用于任意尺度 SR 时，会导致耗时且占用大量内存的问题。为了解决这个问题，我们提出了一种简单而有效的“尺度不变”SR 对抗攻击方法，该方法具有良好的迁移性，称为 SIAGT。具体来说，我们建议通过利用连续表示的有限离散点来构建资源节约型攻击。此外，我们制定了坐标相关的损失函数，以增强攻击的跨模型迁移性。这种攻击可以显著降低 SR 图像的质量，同时对目标低分辨率 (LR) 图像引入难以察觉的失真。在三种流行的基于 LIIF 的 SR 方法和四个经典 SR 数据集上进行的实验表明，SIAGT 具有显著的攻击性能和迁移性。||
|**2025-03-06**|[S2Gaussian: Sparse-View Super-Resolution 3D Gaussian Splatting](http://arxiv.org/abs/2503.04314)|null|本文旨在解决一个现实且具有挑战性的问题，即如何从稀疏的低分辨率视角重建高质量的3D场景，这些视角同时存在视角不足和清晰度低的问题。现有方法仅能处理稀疏视角或低分辨率观测，无法应对这种混合且复杂的场景。为此，我们提出了一种名为S2Gaussian的稀疏视角超分辨率3D高斯 splatting 框架，该框架仅需稀疏的低分辨率视角即可重建结构准确、细节逼真的3D场景。S2Gaussian分两个阶段运行。在第一阶段，我们首先利用深度正则化优化低分辨率高斯表示，并通过定制的高斯随机分割操作将其密集化，以初始化高分辨率高斯。在第二阶段，我们使用原始稀疏视角和低分辨率高斯渲染的伪视角生成的超分辨率图像来细化高分辨率高斯。其中，我们精心设计了一种定制的无模糊不一致性建模方案和3D鲁棒优化策略，以减轻多视角不一致性并消除由不完善监督引起的错误更新。大量实验表明，我们的方法取得了优异的结果，尤其是在几何一致性和更精细的细节方面，建立了新的最先进性能。||
|**2025-03-06**|[Spiking Meets Attention: Efficient Remote Sensing Image Super-Resolution with Attention Spiking Neural Networks](http://arxiv.org/abs/2503.04223)|null|脉冲神经网络 (SNN) 作为传统人工神经网络 (ANN) 的一种有前景的替代方案正在兴起，它具有生物学上的合理性和能源效率。尽管具有这些优点，SNN 仍然经常受到容量有限和表示能力不足的阻碍，并且在遥感超分辨率 (SR) 任务中仍未得到充分探索。在本文中，我们首先观察到脉冲信号在不同纹理之间表现出剧烈的强度变化，突出了神经元的主动学习状态。这一观察促使我们将 SNN 应用于 RSI 的高效超分辨率。受注意力机制在表示显著信息方面的成功的启发，我们设计了脉冲注意力块 (SAB)，这是一个简洁而有效的组件，它通过推断出的注意力权重来优化膜电位，进而调节脉冲活动以获得更好的特征表示。我们的主要贡献包括：1) 我们桥接了时间和通道维度之间的独立调制，促进了联合特征相关性学习，以及 2) 我们访问大规模遥感影像中的全局自相似模式以推断空间注意力权重，结合有效的先验知识以实现真实可靠的重建。基于 SAB，我们提出了 SpikeSR，它在各种遥感基准测试（例如 AID、DOTA 和 DIOR）中实现了最先进的性能，同时保持了高计算效率。SpikeSR 的代码将在论文被接受后发布。||
|**2025-03-06**|[WeakSupCon: Weakly Supervised Contrastive Learning for Encoder Pre-training](http://arxiv.org/abs/2503.04165)|null|弱监督多示例学习 (MIL) 是一项具有挑战性的任务，因为它只提供包级标签，而每个包通常包含多个示例。该主题已在组织病理学图像分析中得到广泛研究，其中标签通常仅在全幻灯片图像 (WSI) 级别可用，而每个全幻灯片图像可以被分成数千个小图像块进行训练。主流的 MIL 方法采用固定的图像块特征作为输入，以解决计算限制并确保模型稳定性。这些特征通常由在 ImageNet 上预训练的编码器、在大型数据集上预训练的基础编码器或通过在本地数据集上进行自监督学习生成。虽然在与下游 MIL 任务相同的数据集上进行自监督编码器预训练有助于减轻域偏移并生成更好的特征，但在该过程中未使用包级标签，并且来自不同类别的图像块的特征可能会聚集在一起，从而降低 MIL 任务的分类性能。最近，与自监督对比学习甚至在传统图像分类任务上的端到端训练相比，使用监督对比学习 (SupCon) 进行预训练已展现出优越的性能。在本文中，我们提出了一种用于下游 MIL 任务的新型编码器预训练方法，称为弱监督对比学习 (WeakSupCon)，它利用了包级标签。在我们的方法中，我们采用多任务学习，并为具有不同包标签的样本定义不同的对比学习损失。我们的实验表明，与三个数据集上的自监督方法相比，使用 WeakSupCon 生成的特征显着提高了 MIL 分类性能。||
|**2025-03-06**|[Fractional Correspondence Framework in Detection Transformer](http://arxiv.org/abs/2503.04107)|null|检测Transformer（DETR）通过结合匈牙利算法，显著简化了目标检测任务中的匹配过程。该算法在训练过程中促进了预测边界框与真实标注之间的最佳一对一匹配。虽然有效，但这种严格的匹配过程并没有内在地考虑物体的不同密度和分布，导致次优的对应关系，例如无法处理同一物体的多个检测或遗漏小物体的检测。为了解决这个问题，我们提出了正则化传输计划（RTP）。RTP引入了一种灵活的匹配策略，它捕捉将预测与真实值对齐的成本，以找到这些集合之间最准确的对应关系。通过利用可微分的Sinkhorn算法，RTP允许进行软的、分数的匹配，而不是严格的一对一分配。这种方法增强了模型有效管理不同物体密度和分布的能力。我们在MS-COCO和VOC基准数据集上的大量评估证明了我们方法的有效性。RTP-DETR的性能超过了Deform-DETR和最近推出的DINO-DETR，mAP分别获得了+3.8%和+1.7%的绝对提升。||
|**2025-03-04**|[Undertrained Image Reconstruction for Realistic Degradation in Blind Image Super-Resolution](http://arxiv.org/abs/2503.02767)|null|大多数超分辨率 (SR) 模型难以处理真实世界的低分辨率 (LR) 图像。这是因为合成数据集中的退化特征与真实世界 LR 图像中的退化特征不同。由于 SR 模型是在通过下采样生成的成对高分辨率 (HR) 和 LR 图像上进行训练的，因此它们针对简单的退化进行了优化。然而，真实世界的 LR 图像包含由成像过程和 JPEG 压缩等因素引起的复杂退化。由于这些退化特征的差异，大多数 SR 模型在真实世界的 LR 图像上的表现不佳。本研究提出了一种使用欠训练图像重建模型的数据集生成方法。这些模型具有从输入图像重建具有不同退化的低质量图像的特性。利用此特性，本研究从 HR 图像生成具有不同退化的 LR 图像来构建数据集。在我们生成的数据集上微调预训练的 SR 模型可以改进噪声去除和模糊减少，从而提高在真实世界 LR 图像上的性能。此外，对数据集的分析表明，退化多样性有助于提高性能，而 HR 和 LR 图像之间的颜色差异可能会降低性能。11页，（11图2表）||
|**2025-03-04**|[Measurement noise scaling laws for cellular representation learning](http://arxiv.org/abs/2503.02726)|**[link](https://github.com/ggdna/scScaling)**|深度学习缩放定律预测了性能如何随着模型和数据集大小的增加而提高。在这里，我们将数据中的测量噪声确定为另一个性能缩放轴，它受一个独特的对数定律支配。我们专注于生物单细胞基因组数据的表示学习模型，其中测量噪声的主要来源是分子欠采样。我们引入了一个用于细胞表示模型质量的信息论度量，并发现它随着采样深度的增加而缩放。单一的定量关系适用于多种模型类型和多个数据集。我们表明，这种关系的解析形式可以从一个简单的高斯噪声模型推导出来，这反过来又为缩放定律提供了一个直观的解释。最后，我们证明了同样的关系出现在图像分类模型中，针对两种类型的成像噪声，这表明测量噪声缩放可能是一种普遍现象。噪声缩放可以作为生成和管理深度学习模型数据的指南，尤其是在数据集之间测量质量可能差异很大的领域。||
|**2025-03-04**|[XFMamba: Cross-Fusion Mamba for Multi-View Medical Image Classification](http://arxiv.org/abs/2503.02619)|null|与单视图医学图像分类相比，使用多视图可以显著提高预测精度，因为它可以考虑每个视图的互补性，同时利用视图之间的相关性。现有的多视图方法通常采用单独的卷积或Transformer分支，并结合简单的特征融合策略。然而，这些方法无意中忽略了重要的跨视图相关性，导致分类性能欠佳，并且受限于有限的感受野（CNN）或二次计算复杂度（Transformer）的挑战。受状态空间序列模型的启发，我们提出了XFMamba，一个纯粹基于Mamba的跨融合架构，以应对多视图医学图像分类的挑战。XFMamba引入了一种新颖的两阶段融合策略，促进了单视图特征及其跨视图差异的学习。这种机制捕获了每个视图中空间上的长程依赖关系，同时增强了视图之间的无缝信息传递。在MURA、CheXpert和DDSM三个公共数据集上的结果表明，我们的方法在各种多视图医学图像分类任务中都很有效，其性能优于现有的基于卷积和基于Transformer的多视图方法。代码可在https://github.com/XZheng0427/XFMamba获取。||
|**2025-03-04**|[Exploring Model Quantization in GenAI-based Image Inpainting and Detection of Arable Plants](http://arxiv.org/abs/2503.02420)|null|基于深度学习的杂草控制系统常常受到训练数据多样性有限和车载计算能力受限的影响，从而影响其在实际应用中的性能。为了克服这些挑战，我们提出了一个框架，利用基于Stable Diffusion的图像修复技术，以10%的增量逐步扩充训练数据——最多可额外增加200%，从而提高样本的数量和多样性。我们使用mAP50指标评估了两种最先进的目标检测模型YOLO11(l)和RT-DETR(l)的检测性能。我们探索了生成修复模型和检测模型的量化策略（FP16和INT8），以在推理速度和精度之间取得平衡。在下游模型在Jetson Orin Nano上的部署证明了我们的框架在资源受限环境中的实用性，最终提高了智能杂草管理系统中的检测精度和计算效率。||
|**2025-03-04**|[Robust detection of overlapping bioacoustic sound events](http://arxiv.org/abs/2503.02389)|null|我们提出了一种能够精确检测生物声学事件的方法，该方法对事件重叠具有鲁棒性，而事件重叠是动物行为学、生态学和保护领域中的常见问题。虽然标准方法采用基于帧的多标签方法，但我们引入了一种基于起始点的检测方法，我们将其命名为Voxaboxen。它受到计算机视觉中目标检测方法的启发，同时利用了自监督音频编码器领域的最新进展。对于每个时间窗口，Voxaboxen预测它是否包含发声的起始点以及发声的持续时间。它也以相反的方式进行相同的操作，预测每个窗口是否包含发声的结束点，以及它从多久之前开始。然后使用图匹配算法将两组生成的边界框融合在一起。我们还发布了一个旨在衡量重叠发声检测性能的新数据集。该数据集包含带有时间强标签且频繁重叠的斑胸草雀录音。我们在七个现有数据集和我们的新数据集上测试了Voxaboxen。我们将Voxaboxen与自然基线和现有声音事件检测方法进行比较，并展示了最先进的结果。进一步的实验表明，改进对于频繁的发声重叠具有鲁棒性。||
|**2025-03-04**|[YOLO-PRO: Enhancing Instance-Specific Object Detection with Full-Channel Global Self-Attention](http://arxiv.org/abs/2503.02348)|null|本文解决了传统瓶颈结构（由于过度强调批次统计数据而导致实例可辨别性降低）和解耦头（计算冗余）在目标检测框架中的固有限制，提出了两个新颖的模块：具有全通道全局自注意力机制的实例特定瓶颈 (ISB) 和实例特定非对称解耦头 (ISADH)。ISB 模块创新性地重建特征图，通过批量统计特征和实例特定特征的协同融合，建立高效的全通道全局注意力机制。作为补充，ISADH 模块开创了一种非对称解耦架构，通过双流批量实例表示融合实现分层多维特征集成。在 MS-COCO 基准上的大量实验表明，在 YOLO-PRO 框架中协同部署 ISB 和 ISADH 可在所有计算规模上实现最先进的性能。具体来说，YOLO-PRO 超越 YOLOv8 1.0-1.6% AP（N/S/M/L/X 等级），并在关键的 M/L/X 组中优于 YOLO11 0.1-0.5% AP，同时保持了具有竞争力的计算效率。这项工作为开发可部署在边缘设备上的高精度检测器提供了实用见解。||
|**2025-03-04**|[Diffusion-Based mmWave Radar Point Cloud Enhancement Driven by Range Images](http://arxiv.org/abs/2503.02300)|null|毫米波雷达在机器人和自动驾驶领域引起了广泛关注。然而，尽管在恶劣环境下感知稳定，但毫米波雷达生成的点云相对稀疏且包含大量噪声，这限制了其进一步发展。传统的毫米波雷达增强方法通常难以利用扩散模型在超分辨率中的有效性，这主要是因为距离-方位热力图（RAH）或鸟瞰图（BEV）表示的不自然性。为了克服这一限制，我们提出了一种将距离图像与图像扩散模型融合的新方法，实现了类似于激光雷达的精确且密集的毫米波雷达点云。得益于与人类观察一致的投影方式，毫米波雷达的距离图像表示接近于自然图像，使得预训练图像扩散模型的知识能够有效迁移，从而显著提高整体性能。在公共数据集和自建数据集上的大量评估表明，我们的方法提供了实质性的改进，在通过毫米波雷达生成真正的三维类激光雷达点云方面树立了新的最先进性能。||
|**2025-03-04**|[Making Better Mistakes in CLIP-Based Zero-Shot Classification with Hierarchy-Aware Language Prompts](http://arxiv.org/abs/2503.02248)|null|最近的研究利用在大量互联网爬取文本数据上训练的大型语言模型 (LLM) 的进步，为基于 CLIP 的零样本图像分类中的下游类别生成文本描述。虽然这些方法大多旨在提高准确性，但我们的工作侧重于“犯更好的错误”，这些错误的严重程度源于下游任务的给定标签层次结构。由于 CLIP 的图像编码器是用语言监督信号训练的，它隐式地捕获了不同类别之间的层次语义关系。这促使我们的目标是在零样本分类中犯更好的错误，而 CLIP 天然地适合这项任务。我们的方法 (HAPrompts) 查询语言模型，为给定类别生成文本表示，作为 CLIP 的零样本分类器，以对下游任务执行图像分类。据我们所知，这是第一个在基于 CLIP 的零样本分类中引入犯更好错误的工作。在我们的实验中，我们的方法在对五个不同规模、具有不同高度标签层次结构的数据集进行的全面比较中优于相关方法。我们的代码和 LLM 生成的图像提示：\href{https://github.com/ltong1130ztr/HAPrompts}{https://github.com/ltong1130ztr/HAPrompts}。||
|**2025-03-03**|[Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection](http://arxiv.org/abs/2503.02101)|null|面向对象检测的域泛化 (DG) 旨在增强检测器在未知场景中的性能。由于现实世界应用中的复杂变化，这项任务仍然具有挑战性。最近，扩散模型在多样化场景生成方面展现出卓越的能力，这启发我们探索其在改进 DG 任务方面的潜力。我们的方法不是生成图像，而是在扩散过程中提取多步中间特征，以获得用于泛化检测的域不变特征。此外，我们提出了一个高效的知识迁移框架，使检测器能够通过特征和对象级别的对齐继承扩散模型的泛化能力，而不会增加推理时间。我们在六个具有挑战性的 DG 基准数据集上进行了广泛的实验。结果表明，我们的方法在不同域和损坏类型上比现有 DG 方法实现了 14.0% mAP 的显著改进。值得注意的是，即使不访问任何目标域数据，我们的方法也优于大多数域适应方法。此外，与基线相比，扩散引导的检测器平均显示出 15.9% mAP 的一致改进。我们的工作旨在提出一种有效的域泛化检测方法，并为现实世界场景中的鲁棒视觉识别提供潜在的见解。代码可在 Generalized Diffusion Detector (https://github.com/heboyong/Generalized-Diffusion-Detector) 获取。||
|**2025-03-03**|[Uncertainty Representation in a SOTIF-Related Use Case with Dempster-Shafer Theory for LiDAR Sensor-Based Object Detection](http://arxiv.org/abs/2503.02087)|**[link](https://github.com/milinpatel07/SOTIF-PCOD)**|基于激光雷达传感器的目标检测的不确定性源于环境变化和传感器性能的局限性。表示这些不确定性对于确保预期功能安全 (SOTIF) 至关重要，SOTIF 的重点是防止自动驾驶场景中的危险。本文提出了一种系统的方法，用于在与 SOTIF 相关的场景中识别、分类和表示基于激光雷达的目标检测中的不确定性。采用 Dempster-Shafer 理论 (DST) 构建识别框架 (FoD) 来表示检测结果。基于已识别不确定性来源之间的依赖关系应用条件基本概率分配 (BPA)。Yager 组合规则用于解决来自多个来源的冲突证据，提供了一个结构化框架来评估不确定性对检测精度的影响。该研究应用基于方差的敏感性分析 (VBSA) 来量化和区分不确定性的优先级，详细说明它们对检测性能的具体影响。||
|**2025-02-28**|[Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning](http://arxiv.org/abs/2502.21130)|**[link](https://github.com/JiuyangDong/HDMIL)**|虽然多示例学习（MIL）已成功应用于病理图像分类，但由于需要处理来自千兆像素全切片图像（WSI）的大量图像块，它面临着推理成本高昂的挑战。为了解决这个问题，我们提出了HDMIL，一个分层蒸馏多示例学习框架，通过消除不相关的图像块来实现快速准确的分类。HDMIL由两个关键组件组成：动态多示例网络（DMIN）和轻量级实例预筛选网络（LIPN）。DMIN处理高分辨率WSI，而LIPN处理相应的低分辨率WSI。在训练过程中，DMIN被训练用于WSI分类，同时生成基于注意力分数的掩码，指示不相关的图像块。这些掩码随后指导LIPN的训练，以预测每个低分辨率图像块的相关性。在测试过程中，LIPN首先确定低分辨率WSI中的有用区域，这间接地使我们能够消除高分辨率WSI中的不相关区域，从而在不降低性能的情况下减少推理时间。此外，我们进一步设计了计算病理学中第一个基于切比雪夫多项式的Kolmogorov-Arnold分类器，它通过可学习的激活层增强了HDMIL的性能。在三个公共数据集上的大量实验表明，HDMIL优于以往最先进的方法，例如，在Camelyon16数据集上，AUC提高了3.13%，同时推理时间减少了28.6%。||
|**2025-02-28**|[BadRefSR: Backdoor Attacks Against Reference-based Image Super Resolution](http://arxiv.org/abs/2502.20943)|**[link](https://github.com/xuefusiji/badrefsr)**|基于参考图像的超分辨率 (RefSR) 代表了超分辨率 (SR) 技术中一个很有前景的进步。与单图像超分辨率 (SISR) 不同，RefSR 利用额外的参考图像来帮助恢复高频细节，但其对后门攻击的脆弱性尚未得到探究。为了填补这一研究空白，我们提出了一个名为 BadRefSR 的新型攻击框架，它通过向参考图像添加触发器并在混合损失函数下训练，从而在 RefSR 模型中嵌入后门。大量跨各种后门攻击设置的实验结果证明了 BadRefSR 的有效性。受攻击的 RefSR 网络在干净的输入图像上表现正常，而在被触发器污染的输入图像上则会输出攻击者指定的图像。我们的研究旨在提醒研究人员关注 RefSR 中潜在的后门风险。代码可在 https://github.com/xuefusiji/BadRefSR 获取。||
|**2025-02-28**|[LV-DOT: LiDAR-visual dynamic obstacle detection and tracking for autonomous robot navigation](http://arxiv.org/abs/2502.20607)|**[link](https://github.com/zhefan-xu/lv-dot)**|在室内环境中，精确感知动态障碍物对于自主机器人导航至关重要。尽管计算机视觉和自动驾驶领域已经深入研究和开发了复杂的3D物体检测和跟踪方法，但它们对昂贵且高精度的传感器设置以及大型神经网络的大量计算资源的需求使其不适合室内机器人应用。最近，利用机载摄像头或激光雷达传感器的更轻量级的感知算法已成为有希望的替代方案。然而，依赖单一传感器存在明显的局限性：摄像头的视野有限，并且可能受到高噪声的影响，而激光雷达传感器的运行频率较低，并且缺乏丰富的视觉特征。为了解决这一局限性，我们提出了一种动态障碍物检测和跟踪框架，该框架同时使用机载摄像头和激光雷达数据来实现轻量级且精确的感知。我们提出的方法扩展了我们之前的集成检测方法，该方法集成了来自多个低精度但计算效率高的检测器的输出，以确保在机载计算机上实时运行。在这项工作中，我们提出了一种更稳健的融合策略，它集成了激光雷达和视觉数据，以进一步提高检测精度。然后，我们利用一个采用基于特征的物体关联和卡尔曼滤波器的跟踪模块来跟踪和估计检测到的障碍物状态。此外，我们还设计了一种动态障碍物分类算法，以鲁棒地识别运动物体。数据集评估表明，与基准方法相比，我们的方法具有更好的感知性能。在四旋翼机器人上的物理实验验证了其在实际导航中的可行性。||
|**2025-02-28**|[Exploring the Impact of Temperature Scaling in Softmax for Classification and Adversarial Robustness](http://arxiv.org/abs/2502.20604)|null|softmax 函数是深度学习中的一个基本组成部分。本研究深入探讨了 softmax 函数中经常被忽视的参数——“温度”，并对图像分类中温度缩放的实践和理论方面提供了新的见解。我们采用卷积神经网络和Transformer 在多个基准数据集上进行的实证研究表明，适中的温度通常会带来更好的整体性能。通过大量的实验和严谨的理论分析，我们探索了温度缩放在模型训练中的作用，并揭示了温度不仅影响学习步长，而且还塑造了模型的优化方向。此外，我们首次发现提高温度的一个惊人优势：增强模型针对常见损坏、自然扰动和非目标对抗性攻击（如投影梯度下降）的鲁棒性。我们将我们的发现扩展到对抗性训练，证明了与默认温度值的标准 softmax 函数相比，更高的温度有可能增强对抗性训练。这项工作为改进深度学习应用中的模型性能和安全性开辟了新的途径。||
|**2025-02-27**|[Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds](http://arxiv.org/abs/2502.20316)|null|掩码自编码器（MAE）在视觉及其他领域的自监督学习（SSL）中展现出巨大的潜力。然而，自动驾驶中使用的激光雷达点云对MAE来说尤其具有挑战性，因为3D空间中的大片区域是空的。因此，现有工作存在将占用信息泄漏到解码器的问题，并且计算复杂度很高，从而在实践中将SSL预训练限制在仅2D鸟瞰图编码器。在这项工作中，我们提出了新颖的邻域占用MAE（NOMAE），它通过仅在非掩码体素的邻域中进行掩码占用重建来克服上述挑战。我们利用我们提出的分层掩码生成技术，在多个尺度上结合体素掩码和占用重建，以捕获点云中不同大小物体的特征。NOMAE非常灵活，可以直接用于现有3D架构中的SSL。我们在nuScenes和Waymo Open数据集上对语义分割和3D目标检测等下游感知任务进行了广泛的评估，并与判别式和生成式SSL方法进行了比较。结果表明，NOMAE在多个基准测试和多个点云感知任务上树立了新的最先进水平。||
|**2025-02-27**|[Gradient-Guided Annealing for Domain Generalization](http://arxiv.org/abs/2502.20162)|**[link](https://github.com/aristotelisballas/gga)**|域泛化 (DG) 研究最近获得了相当大的关注，因为泛化到未见数据分布的能力是即使是最先进的训练算法也难以实现的要求。在本文中，我们观察到模型训练的初始迭代在域泛化有效性中起着关键作用，因为训练和测试分布之间的损失情况可能存在显著差异，这与独立同分布数据的情况相反。每个域的损失分量的梯度之间的冲突导致优化过程陷入不理想的局部最小值，这些最小值无法捕获目标类的域不变特征。我们提出通过在训练早期迭代地退火模型参数并在域之间梯度对齐的点处进行搜索来缓解模型优化中的域冲突。通过发现一组参数值，在这些值处，训练集中存在的每个数据分布的梯度都朝着相同的方向更新，所提出的梯度引导退火 (GGA) 算法鼓励模型寻找对域偏移表现出更好鲁棒性的最小值。GGA 的有效性在五个广泛接受且具有挑战性的图像分类域泛化基准上进行了评估，仅使用 GGA 就能够建立高度竞争力甚至达到最先进的性能。此外，当与先前提出的域泛化算法结合使用时，它能够持续地大幅提高它们的有效性。||
|**2025-02-27**|[MITracker: Multi-View Integration for Visual Object Tracking](http://arxiv.org/abs/2502.20111)|null|多视角目标跟踪（MVOT）为解决传统单视角跟踪中常见的遮挡和目标丢失等挑战提供了有希望的解决方案。然而，由于缺乏全面的多视角数据集和有效的跨视角集成方法，其进展受到限制。为了克服这些限制，我们编译了一个多视角目标跟踪（MVTrack）数据集，包含234K个高质量标注帧，涵盖各种场景中的27个不同对象。结合该数据集，我们引入了一种新颖的MVOT方法，即多视角集成跟踪器（MITracker），以有效地集成多视角目标特征并提供稳定的跟踪结果。MITracker可以在任意长度的视频帧中从任意视角跟踪任何目标。我们的方法相较于传统单视角方法的主要进步体现在两个方面：（1）MITracker将2D图像特征转换为3D特征体积，并将其压缩到鸟瞰图（BEV）平面，便于视角间信息融合；（2）我们提出了一种注意力机制，利用融合的3D特征体积的几何信息来细化每个视角的跟踪结果。MITracker在MVTrack和GMTD数据集上的性能优于现有方法，达到了最先进的水平。代码和新数据集将在https://mii-laboratory.github.io/MITracker/上提供。||
|**2025-02-27**|[OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels](http://arxiv.org/abs/2502.20087)|**[link](https://github.com/lmmmeng/overlock)**|在人类视觉系统中，自上而下的注意力在感知中起着至关重要的作用，其中大脑首先进行整体但粗略的场景分析以提取显著线索（即，先概览），然后进行更细粒度的检查以做出更准确的判断（即，再细看）。然而，ConvNet设计的最新努力主要集中在增加卷积核大小以获得更大的感受野，而没有考虑这种关键的仿生机制来进一步提高性能。为此，我们提出了一种新颖的纯ConvNet视觉骨干网络，称为OverLoCK，它是从架构和混合器的角度精心设计的。具体来说，我们引入了一种仿生深度阶段分解策略（DDS），通过在特征和卷积核权重级别提供动态自上而下的上下文指导，将语义上有意义的上下文表示融合到中间层和深层。为了充分释放自上而下上下文指导的潜力，我们进一步提出了一种新颖的上下文混合动态卷积（ContMix），它可以有效地建模长程依赖关系，同时保留固有的局部归纳偏差，即使在输入分辨率增加时也是如此。这些特性是以前的卷积所不具备的。在DDS和ContMix的支持下，我们的OverLoCK比现有方法表现出显著的性能提升。例如，OverLoCK-T实现了84.2%的Top-1准确率，显著超过了ConvNeXt-B，而只使用了大约三分之一的FLOPs/参数。在使用Cascade Mask R-CNN进行目标检测时，我们的OverLoCK-S的AP $^b$ 比MogaNet-B显著提高了1%。在使用UperNet进行语义分割时，我们的OverLoCK-T的mIoU比UniRepLKNet-T显著提高了1.7%。代码公开发布于https://github.com/LMMMEng/OverLoCK。||
|**2025-02-27**|[ProAPO: Progressively Automatic Prompt Optimization for Visual Classification](http://arxiv.org/abs/2502.19844)|**[link](https://github.com/MorningStarOvO/ProAPO)**|视觉语言模型 (VLM) 通过使用大规模图文配对数据进行训练，在图像分类方面取得了显著进展。它们的性能很大程度上取决于提示的质量。尽管最近的方法表明，大型语言模型 (LLM) 生成的视觉描述增强了 VLM 的泛化能力，但由于 LLM 的幻觉，特定类别的提示可能不准确或缺乏辨别力。在本文中，我们旨在以最少的监督和无需人工干预的方式，为细粒度类别找到具有视觉辨别力的提示。我们提出了一种基于进化算法的方法，逐步将语言提示从特定任务的模板优化为特定类别的描述。与优化模板不同的是，特定类别候选提示的搜索空间呈现爆炸式增长。这增加了提示生成成本、迭代次数和过拟合问题。为此，我们首先引入了几种简单但有效的基于编辑和基于进化操作，通过一次查询 LLM 来生成多样化的候选提示。然后，提出了两种采样策略以找到更好的初始搜索点并减少遍历的类别，从而节省迭代成本。此外，我们应用了一种具有熵约束的新颖适应度评分来减轻过拟合。在具有挑战性的一次性图像分类设置中，我们的方法在 13 个数据集上优于现有的基于文本提示的方法，并改进了 LLM 生成的描述方法。同时，我们证明了我们的最佳提示改进了基于适配器的方法，并在不同的骨干网络之间有效迁移。||
|**2025-02-27**|[MFSR: Multi-fractal Feature for Super-resolution Reconstruction with Fine Details Recovery](http://arxiv.org/abs/2502.19797)|null|在图像超分辨率处理过程中，复杂局部信息的处理会对生成图像的质量产生显著影响。分形特征可以捕捉图像中微观和宏观纹理结构的丰富细节。因此，我们提出了一种基于扩散模型的超分辨率方法，结合了低分辨率图像的分形特征，命名为MFSR。MFSR在扩散模型的去噪过程中利用这些分形特征作为增强条件，以确保纹理信息的准确恢复。MFSR采用卷积作为一种软分配来近似低分辨率图像的分形特征。这种方法也被用于近似这些图像的密度特征图。通过使用软分配，图像的空间布局被分层描述，编码了图像在不同尺度下的自相似性。对不同类型的特征应用不同的处理方法，以丰富模型获取的信息。此外，在去噪U-Net中集成了一个子去噪器，以减少上采样过程中特征图中的噪声，从而提高生成图像的质量。在各种人脸和自然图像数据集上进行的实验表明，MFSR可以生成更高质量的图像。||
|**2025-02-27**|[Learning Mask Invariant Mutual Information for Masked Image Modeling](http://arxiv.org/abs/2502.19718)|null|掩码自编码器 (MAE) 是计算机视觉中一种重要的自监督学习范式。尽管 MAE 在实践中取得了成功，但其底层机制仍未得到充分理解。最近的研究试图通过对比学习和特征表示分析来阐明 MAE 的功能，但这些方法通常只提供了隐含的见解。在本文中，我们利用信息论中的信息瓶颈原理，提出了一个理解 MAE 的新视角。我们的理论分析表明，优化潜在特征以平衡相关信息和无关信息是提高 MAE 性能的关键。基于我们的证明，我们引入了 MI-MAE，一种通过最大化和最小化互信息来优化 MAE 的新方法。通过增强潜在特征以保留它们与输出之间的最大相关信息，并最小化它们与输入之间的无关信息，我们的方法实现了更好的性能。在标准基准上的大量实验表明，MI-MAE 在图像分类、目标检测和语义分割等任务中显著优于 MAE 模型。我们的研究结果验证了理论框架，并突出了将信息瓶颈原理应用于 MAE 的实际优势，为开发更强大的自监督学习模型提供了更深入的见解。||
|**2025-02-27**|[Spatial-Spectral Diffusion Contrastive Representation Network for Hyperspectral Image Classification](http://arxiv.org/abs/2502.19699)|null|尽管有效提取具有判别性的空间-光谱特征对于高光谱图像分类 (HSIC) 至关重要，但由于空间-光谱异质性和噪声效应等因素，获取这些特征非常困难。本文提出了一种基于去噪扩散概率模型 (DDPM) 并结合对比学习 (CL) 的空间-光谱扩散对比表示网络 (DiffCRN) 用于 HSIC，该网络具有以下特点。首先，为了改进空间-光谱特征表示，我们没有采用 DDPM 中广泛使用的类 UNets 结构，而是在 DiffCRN 中设计了一种新的分阶段架构，该架构包含空间自注意力去噪模块 (SSAD) 和光谱组自注意力去噪模块 (SGSAD)，从而提高了空间-光谱特征学习的效率。其次，为了提高无监督特征学习效率，我们设计了一种新的 DDPM 模型，采用对数绝对误差 (LAE) 损失和对比学习 (CL)，提高了损失函数的有效性，并增加了实例级别和类间的可辨别性。第三，为了改进特征选择，我们设计了一种基于像素级光谱角映射 (SAM) 的可学习方法，以自适应和自动的方式选择所提出的 DDPM 模型中的时间步长。最后，为了改进特征集成和分类，我们设计了自适应加权模块 (AWAM) 和跨时间步长空间-光谱融合模块 (CTSSFM) 来融合时间步长特征并执行分类。在四个广泛使用的高光谱图像数据集上进行的实验表明，所提出的 DiffCRN 比经典的骨干模型以及最先进的 GAN、Transformer 模型和其他预训练方法具有更高的性能。源代码和预训练模型将公开发布。||
|**2025-02-27**|[BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance](http://arxiv.org/abs/2502.19694)|null|鸟瞰图（BEV）表示在自动驾驶任务中起着至关重要的作用。尽管BEV生成技术近期取得了进展，但源自传感器限制和学习过程的固有噪声仍然很大程度上未得到解决，导致BEV表示的质量欠佳，从而对下游任务的性能产生不利影响。为了解决这个问题，我们提出了BEVDiffuser，一种新颖的扩散模型，它使用真实对象布局作为指导，有效地对BEV特征图进行去噪。BEVDiffuser可以在训练期间以即插即用的方式运行，以增强现有的BEV模型，而无需任何架构修改。在具有挑战性的nuScenes数据集上进行的大量实验表明，BEVDiffuser具有出色的去噪和生成能力，可以显著增强现有的BEV模型，3D目标检测的mAP和NDS分别提高了12.3%和10.1%，且没有引入额外的计算复杂度。此外，在长尾目标检测以及恶劣天气和光照条件下的显著改进，进一步验证了BEVDiffuser在去噪和增强BEV表示方面的有效性。||
|**2025-02-26**|[Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras](http://arxiv.org/abs/2502.19630)|**[link](https://github.com/mickeykang16/ev3dod)**|在自动驾驶系统中，点云三维目标检测扮演着至关重要的角色。近年来，结合摄像头信息的多模态方法取得了显著成果。然而，对于安全有效的自动驾驶系统来说，算法不仅要准确，还要快速且低延迟。但由于固定帧率传感器（例如激光雷达和摄像头）的延迟和带宽限制，现有算法无法满足这些要求。为了解决这个问题，我们首次将异步事件相机引入三维目标检测。我们利用其高时间分辨率和低带宽来实现高速三维目标检测。即使在同步数据不可用的帧间间隔期间，我们的方法也能通过事件相机检索先前的三维信息来实现检测。此外，我们还引入了首个基于事件的三维目标检测数据集DSEC-3DOD，该数据集包含100 FPS的真值三维边界框，为基于事件的三维检测器建立了第一个基准。代码和数据集可在https://github.com/mickeykang16/Ev3DOD获取。||
|**2025-02-25**|[MedKAN: An Advanced Kolmogorov-Arnold Network for Medical Image Classification](http://arxiv.org/abs/2502.18416)|null|近年来，深度学习在图像分类领域的进步主要依赖于卷积神经网络（CNN）或基于Transformer的架构。然而，这些模型在医学影像方面面临着显著的挑战，尤其是在捕捉复杂的纹理细节和上下文特征方面。Kolmogorov-Arnold网络（KAN）代表了一种新型的架构，它增强了非线性变换建模，从而更好地表示复杂特征。在这项工作中，我们提出了MedKAN，一个基于KAN及其卷积扩展的医学图像分类框架。MedKAN具有两个核心模块：用于细粒度特征提取的局部信息KAN（LIK）模块和用于全局上下文整合的全局信息KAN（GIK）模块。通过结合这些模块，MedKAN实现了鲁棒的特征建模和融合。为了满足不同的计算需求，我们引入了三种可扩展的变体——MedKAN-S、MedKAN-B和MedKAN-L。在九个公共医学影像数据集上的实验结果表明，MedKAN与基于CNN和Transformer的模型相比，取得了更优异的性能，突出了其在医学图像分析中的有效性和泛化能力。||
|**2025-02-25**|[UASTrack: A Unified Adaptive Selection Framework with Modality-Customization in Single Object Tracking](http://arxiv.org/abs/2502.18220)|null|多模态跟踪在单目标跟踪 (SOT) 中至关重要，因为不同类型的传感器具有独特的功能，可以克服目标外观变化带来的挑战。然而，现有的统一 RGB-X 跟踪器（X 代表深度、事件或热成像模态）要么依赖于针对单个 RGB-X 图像对的特定任务训练策略，要么未能解决实际应用中模态自适应感知的关键重要性。在这项工作中，我们提出了 UASTrack，一个统一的自适应选择框架，它促进了模型和参数的统一，以及跨各种多模态跟踪任务的自适应模态判别。为了在联合 RGB-X 对中实现模态自适应感知，我们设计了一个判别式自动选择器 (DAS)，能够识别模态标签，从而区分辅助模态的数据分布。此外，我们提出了一个针对潜在空间中各种模态定制的任务定制优化适配器 (TCOA)。该策略基于每种模态的特定特征，有效地过滤了噪声冗余并减轻了背景干扰。在 LasHeR、GTOT、RGBT234、VisEvent 和 DepthTrack 五个基准数据集上进行的广泛比较，涵盖了 RGB-T、RGB-E 和 RGB-D 跟踪场景，表明我们的创新方法仅引入了 1.87M 的额外训练参数和 1.95G 的浮点运算，即可实现相当的性能。代码将在 https://github.com/wanghe/UASTrack 上提供。||
|**2025-02-24**|[A Priori Generalizability Estimate for a CNN](http://arxiv.org/abs/2502.17622)|null|我们将卷积神经网络的截断奇异值分解公式化。我们证明了计算得到的左奇异向量和右奇异向量可用于识别卷积神经网络可能表现不佳的图像。为了创建这个诊断工具，我们定义了两个指标：右投影比和左投影比。右（左）投影比评估图像（标签）在计算得到的右（左）奇异向量上的投影保真度。我们观察到这两个比率都能识别图像分类问题中类别不平衡的存在。此外，我们发现仅需要未标记数据的右投影比与应用于图像分割时的模型性能相关。这表明右投影比可以作为评估模型在样本上表现良好可能性程度的有用指标。||
|**2025-02-24**|[V-HOP: Visuo-Haptic 6D Object Pose Tracking](http://arxiv.org/abs/2502.17434)|null|人类在操作物体时会自然地整合视觉和触觉以实现稳健的物体感知。任何一种模态的缺失都会显著降低性能。受这种多感官整合的启发，先前的物体姿态估计研究尝试结合视觉和触觉/触觉反馈。尽管这些工作在受控环境或合成数据集上有所改进，但由于在不同的抓取器、传感器布局或仿真到现实环境中的泛化能力较差，它们在现实环境中的表现通常不如纯视觉方法。此外，它们通常独立估计每一帧的物体姿态，导致在现实部署中对序列的跟踪一致性较差。为了解决这些限制，我们引入了一种新颖的统一触觉表示，可以有效地处理多种抓取器实施方案。基于这种表示，我们引入了一种新的基于视觉-触觉Transformer的物体姿态跟踪器，它可以无缝地整合视觉和触觉输入。我们在我们的数据集和Feelsight数据集中验证了我们的框架，证明了其在挑战性序列上的显著性能提升。值得注意的是，我们的方法在新的实施方案、物体和传感器类型（基于触觉单元和基于视觉的触觉传感器）上实现了卓越的泛化性和鲁棒性。在实际实验中，我们证明了我们的方法在很大程度上优于最先进的视觉跟踪器。我们进一步展示了，通过将我们的实时物体跟踪结果纳入运动计划，我们可以实现精确的操作任务，突出了视觉-触觉感知的优势。我们的模型和数据集将在论文被接收后开源。项目网站：https://lhy.xyz/projects/v-hop/||
|**2025-02-24**|[Enriching Physical-Virtual Interaction in AR Gaming by Tracking Identical Real Objects](http://arxiv.org/abs/2502.17399)|**[link](https://github.com/gmudcxr/EnrichingARInteraction)**|增强现实（AR）游戏，尤其是那些为头戴式设备设计的，随着软硬件的进步变得越来越普遍。然而，大多数AR游戏仍然依赖于预扫描或静态场景，交互机制通常仅限于控制器或手部追踪。此外，AR游戏中相同物体的存在对传统的物体追踪技术提出了挑战，这些技术通常难以区分相同的物体，或者需要安装固定摄像头来进行全局物体运动追踪。为了应对这些限制，我们提出了一种新颖的方法来解决AR场景中相同物体的追踪问题，以丰富物理-虚拟交互。我们的方法利用AR头戴式设备捕获的部分场景观察，利用该技术提供的视角和空间数据。场景中物体的身份是通过使用整数规划解决标签分配问题来确定的。为了提高计算效率，我们在方法中加入了基于Voronoi图的剪枝方法。我们在一个农场到餐桌的AR游戏中实现了这种方法，证明了其令人满意的性能和鲁棒性。此外，我们还通过在AR故事讲述和模拟游戏机器人中的应用展示了我们方法的多功能性和实用性。我们的视频演示可在以下网址观看：https://youtu.be/rPGkLYuKvCQ。||
|**2025-02-24**|[Experimental validation of UAV search and detection system in real wilderness environment](http://arxiv.org/abs/2502.17372)|null|搜救 (SAR) 任务需要可靠的搜索方法来定位幸存者，尤其是在具有挑战性或难以到达的环境中。因此，引入无人机 (UAV) 可以极大地提高 SAR 任务的效率，同时增加所有参与任务人员的安全性。基于此，我们设计并试验了在地中海喀斯特环境中使用自主无人机搜索人类。无人机根据已知的概率密度和探测函数，使用热方程驱动的区域覆盖 (HEDAC) 遍历控制方法进行引导。已实施的传感框架由概率搜索模型、运动控制系统和计算机视觉目标检测组成。它能够计算 SAR 任务中目标被探测到的概率，本文重点对提出的概率框架和无人机控制进行实验验证。为了确保在所需搜索区域内找到目标的概率均匀，我们为 78 名志愿者分配了经过深思熟虑的任务，从而实现了均匀的概率密度。检测模型基于 YOLO，并使用先前收集的正射影像数据库进行训练。实验搜索经过精心策划和执行，并记录了尽可能多的参数。全面的分析包括运动控制系统、目标检测和搜索验证。对探测和搜索性能的评估有力地表明，无人机控制算法中设计的探测模型与实际结果相符。||
|**2025-02-24**|[Disentangling Visual Transformers: Patch-level Interpretability for Image Classification](http://arxiv.org/abs/2502.17196)|null|视觉Transformer在图像分类任务中取得了显著的成果，但这种性能提升是以牺牲可解释性为代价的。Transformer可解释性的主要障碍之一是自注意力机制，它以复杂的方式混合了整幅图像的视觉信息。在本文中，我们提出了Hindered Transformer (HiT)，一种受视觉Transformer启发、具有可解释性的新型架构设计。我们提出的架构重新思考了Transformer的设计，以便在分类阶段更好地解耦图像块的影响。最终，HiT可以被解释为图像块级别信息的线性组合。我们证明，HiT在可解释性方面的优势伴随着合理的性能权衡，使其成为可解释性至关重要的应用场景中的一个有吸引力的替代方案。||
|**2025-02-24**|[LCV2I: Communication-Efficient and High-Performance Collaborative Perception Framework with Low-Resolution LiDAR](http://arxiv.org/abs/2502.17039)|null|车路协同感知（V2I）利用基础设施传感器收集的数据来增强车辆的感知能力。激光雷达作为协同感知中常用的传感器，已广泛应用于智能车辆和基础设施中。然而，其优异的性能伴随着相应的高成本。为了实现低成本V2I，降低激光雷达的成本至关重要。因此，我们研究在车辆上采用低分辨率激光雷达以最大程度地降低成本。然而，简单地降低车载激光雷达的分辨率会导致点云稀疏，使得远距离小目标更加模糊。此外，传统的通信方法带宽利用效率相对较低。这些因素给我们带来了挑战。为了平衡成本和感知精度，我们提出了一个新的协同感知框架，即LCV2I。LCV2I使用来自摄像头和低分辨率激光雷达的数据作为输入。它还采用了特征偏移校正模块和区域特征增强算法来改进特征表示。最后，我们使用区域差异图和区域得分图来评估协作内容的价值，从而提高通信带宽效率。总之，我们的方法在显著降低车辆对高分辨率传感器需求的同时实现了高感知性能。为了评估该算法，我们在DAIR-V2X的真实场景中进行了3D目标检测，结果表明LCV2I的性能始终优于现有算法。||
|**2025-02-24**|[Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment](http://arxiv.org/abs/2502.16894)|null|虽然低秩适应（LoRA）能够对大型语言模型（LLM）进行参数高效的微调，但其性能通常不如全量微调（Full FT）。目前的方法通过使用静态奇异值分解（SVD）子集进行初始化来优化LoRA，导致对预训练知识的利用不足。另一种改进LoRA的途径是结合混合专家（MoE）架构。然而，权重未对齐和复杂的梯度动态使得在LoRA MoE架构之前采用SVD具有挑战性。为了缓解这些问题，我们提出了\underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t}（GOAT），这是一个(1) 使用 SVD 结构的 MoE 自适应地整合相关先验，以及 (2) 通过推导理论缩放因子使优化与全量微调的 MoE 对齐的框架。我们证明了适当的缩放，在不修改架构或训练算法的情况下，可以提高 LoRA MoE 的效率和性能。跨越25个数据集的实验，包括自然语言理解、常识推理、图像分类和自然语言生成，证明了GOAT的最佳性能，缩小了与全量微调的差距。||
|**2025-02-24**|[CRTrack: Low-Light Semi-Supervised Multi-object Tracking Based on Consistency Regularization](http://arxiv.org/abs/2502.16809)|null|弱光环境下的多目标跟踪在现实生活中很普遍。近年来，多目标跟踪领域发展迅速。然而，由于缺乏数据集和高昂的标注成本，弱光环境下的多目标跟踪仍然是一个持续的挑战。在本文中，我们专注于弱光条件下的多目标跟踪。为了解决数据有限和缺乏数据集的问题，我们首先构建了一个弱光多目标跟踪数据集（LLMOT）。该数据集包含来自MOT17的数据，这些数据已针对夜间条件进行了增强，以及多个未标注的弱光视频。随后，为了解决高昂的标注成本并应对图像质量下降的问题，我们提出了一种基于一致性正则化的半监督多目标跟踪方法，名为CRTrack。首先，我们校准了一致的自适应采样分配来代替静态的基于IoU的策略，使半监督跟踪方法能够抵抗噪声伪边界框。然后，我们设计了一种自适应的半监督网络更新方法，可以有效地利用未标注的数据来提高模型性能。数据集和代码：https://github.com/ZJZhao123/CRTrack。||
|**2025-02-21**|[Q-PETR: Quant-aware Position Embedding Transformation for Multi-View 3D Object Detection](http://arxiv.org/abs/2502.15488)|null|基于PETR的方法在3D感知基准测试中占据主导地位，并逐渐成为现代自动驾驶系统的关键组成部分。然而，当需要INT8推理时，它们的量化性能会显著下降，在NuScenes数据集上，mAP下降了58.2%，NDS下降了36.9%。为了解决这个问题，我们提出了一种用于多视图3D目标检测的量化感知位置嵌入变换，称为Q-PETR。Q-PETR提供了一种量化友好且易于部署的架构，同时保留了PETR的原始性能。它显著缩小了PETR系列方法在INT8和FP32推理之间的精度差距。在没有额外技巧的情况下，我们的方法将标准8位每张量后训练量化下的mAP和NDS下降控制在1%以内。此外，我们的方法在浮点精度方面超过了原始PETR的性能。在各种PETR系列模型上的大量实验证明了其广泛的泛化能力。||
|**2025-02-21**|[A Novel Riemannian Sparse Representation Learning Network for Polarimetric SAR Image Classification](http://arxiv.org/abs/2502.15302)|null|深度学习是一种有效的极化合成孔径雷达(PolSAR)图像分类端到端方法，但它缺乏相关数学原理的指导，本质上是一个黑盒模型。此外，现有的深度模型在欧几里得空间中学习特征，其中PolSAR复矩阵通常被转换为复值向量作为网络输入，从而扭曲了矩阵结构和通道关系。然而，复协方差矩阵是Hermitian正定的(HPD)，并且位于黎曼流形而不是欧几里得空间上。现有方法无法测量HPD矩阵的几何距离，并且由于不合适的欧几里得度量容易导致一些错误分类。为了解决这些问题，我们提出了一种用于PolSAR图像的新型黎曼稀疏表示学习网络(SRSR CNN)。首先，设计了一个基于超像素的黎曼稀疏表示(SRSR)模型来学习具有黎曼度量的稀疏特征。然后，推导出SRSR模型的优化过程，并将其进一步展开为SRSRnet，它可以自动学习稀疏系数和字典原子。此外，为了学习上下文高级特征，添加了一个CNN增强模块来提高分类性能。所提出的网络是一个稀疏表示(SR)引导的深度学习模型，它可以直接利用协方差矩阵作为网络输入，并利用黎曼度量来学习黎曼空间中复矩阵的几何结构和稀疏特征。在三个真实的PolSAR数据集上的实验表明，该方法在确保准确的边缘细节和正确的区域同质性方面优于现有技术。||
|**2025-02-21**|[Quantum autoencoders for image classification](http://arxiv.org/abs/2502.15254)|null|经典机器学习常常难以处理复杂的高维数据。量子机器学习提供了一种潜在的解决方案，有望实现更高效的处理。尽管量子卷积神经网络 (QCNN) 这种混合量子-经典算法适用于当前含噪声中等规模量子 (NISQ) 时代的硬件，但其学习过程严重依赖于经典计算。未来的大规模、基于门的量子计算机可能会释放量子效应在机器学习中的全部潜力。与QCNN相比，量子自动编码器 (QAE) 仅利用经典优化进行参数调整。数据压缩和重建完全在量子线路内进行，从而实现纯粹的基于量子的特征提取。本研究介绍了一种使用QAE进行图像分类的新方法，与传统的QAE实现相比，无需额外的量子比特即可实现分类。量子线路结构会显著影响分类精度。与QCNN等混合方法不同，基于QAE的分类强调量子计算。我们的实验在一个四分类任务中展现了高准确率，并评估了各种量子门配置，以了解不同参数化量子线路（拟设）结构对分类性能的影响。我们的结果表明，特定的拟设结构可以实现更高的准确率，并且我们对其有效性进行了分析。此外，所提出的方法实现了与传统机器学习方法相当的性能，同时显著减少了需要优化的参数数量。这些发现表明，QAE可以作为高效的分类模型，且参数更少，并突出了利用量子线路进行完整端到端学习的潜力，这与QCNN等混合方法有所不同。||
|**2025-02-21**|[TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba](http://arxiv.org/abs/2502.15130)|null|Transformer模型因其注意力模块的灵活可扩展性，在单模态和多模态基础模型中均受到青睐。因此，许多预训练的Transformer模型，例如LLaVA、CLIP和DEIT，都是公开可用的。最近的研究引入了Mamba等亚二次型架构，它能够以线性复杂度实现全局感知。然而，从头开始为特定任务训练专门的亚二次型架构既耗费资源又耗时。为此，我们探索了跨架构训练，将现有Transformer模型中的现有知识迁移到替代架构Mamba，称为TransMamba。我们的方法采用两阶段策略来加速训练新的Mamba模型，确保在单模态和跨模态任务中的有效性。针对架构差异，我们将中间特征投影到对齐的潜在空间中，然后再进行知识迁移。此外，我们引入了一种权重子克隆和自适应双向蒸馏方法（WSAB），用于知识迁移，且不受不同层数的限制。对于跨模态学习，我们提出了一个跨Mamba模块，将语言感知融入Mamba的视觉特征中，增强了Mamba架构的跨模态交互能力。尽管使用的训练数据少于从头训练通常所需的75%，但TransMamba在各种网络架构和下游任务中均展现出更强的性能，包括图像分类、视觉问答和文本视频检索。代码将公开发布。||
|**2025-02-20**|[YOLOv12: A Breakdown of the Key Architectural Features](http://arxiv.org/abs/2502.14740)|null|本文介绍了YOLOv12的架构分析，它是单阶段实时目标检测领域的重大进展，它在继承前代优势的基础上引入了关键改进。该模型融合了优化的骨干网络（R-ELAN）、7x7可分离卷积和基于FlashAttention的区域注意力机制，从而改进特征提取，提高效率并实现鲁棒的检测。与前代类似，YOLOv12拥有多种模型变体，为延迟敏感型和高精度应用提供了可扩展的解决方案。实验结果表明，平均精度（mAP）和推理速度均得到一致提升，这使得YOLOv12成为自动驾驶系统、安全和实时分析等应用场景的理想选择。通过在计算效率和性能之间取得最佳平衡，YOLOv12为实时计算机视觉树立了新的标杆，促进了其在从边缘设备到高性能集群等各种硬件平台上的部署。||
|**2025-02-20**|[LXLv2: Enhanced LiDAR Excluded Lean 3D Object Detection with Fusion of 4D Radar and Camera](http://arxiv.org/abs/2502.14503)|null|作为先前基于4D雷达-摄像头融合的3D目标检测的最佳方法，LXL利用预测的图像深度分布图和雷达3D占用网格来辅助基于采样的图像视图变换。然而，深度预测缺乏准确性和一致性，并且LXL中基于拼接的融合阻碍了模型的鲁棒性。在这项工作中，我们提出了LXLv2，对其进行了修改以克服这些限制并提高性能。具体来说，考虑到雷达测量中的位置误差，我们设计了一种通过雷达点进行一对多深度监督的策略，其中进一步利用雷达截面积（RCS）值来调整目标级深度一致性的监督区域。此外，引入了一种名为CSAFusion的基于通道和空间注意力的融合模块，以提高特征自适应性。在View-of-Delft和TJ4DRadSet数据集上的实验结果表明，所提出的LXLv2在检测精度、推理速度和鲁棒性方面均优于LXL，证明了模型的有效性。||
|**2025-02-20**|[Stochastic Resonance Improves the Detection of Low Contrast Images in Deep Learning Models](http://arxiv.org/abs/2502.14442)|null|随机共振描述了噪声在某些类型的系统中提高弱信号可检测性的效用。它已在自然和工程环境中被广泛观察到，但其在基于速率的神经网络进行图像分类中的效用尚未得到广泛研究。在本分析中，一个简单的LSTM循环神经网络被训练用于数字识别和分类。在测试阶段，图像对比度被降低到模型无法识别刺激存在的程度。添加受控噪声以部分恢复分类性能。结果表明基于速率的循环神经网络中存在随机共振。||
|**2025-02-20**|[Reliable Explainability of Deep Learning Spatial-Spectral Classifiers for Improved Semantic Segmentation in Autonomous Driving](http://arxiv.org/abs/2502.14416)|null|高光谱图像 (HSI) 与深度神经网络 (DNN) 的结合，通过融合光谱和空间信息，可以提高智能视觉系统的精度，这对于自动驾驶中的语义分割等任务非常有用。为了推进此类安全关键系统的研究，需要确定光谱信息对复杂 DNN 输出的精确贡献。为此，人们提出了几种显著性方法，例如类激活图 (CAM)，主要用于图像分类。然而，最近的研究对其可靠性提出了质疑。在本文中，我们解决了这些局限性，并提出了一种替代方法，利用相关 DNN 层的激活和权重提供的数据来更好地捕捉输入特征和预测之间的关系。该研究旨在评估 HSI 相比于 3 通道和单通道 DNN 的优越性能。我们还探讨了光谱特征归一化对增强 DNN 在实际驾驶条件下鲁棒性的影响。||
|**2025-02-20**|[ODVerse33: Is the New YOLO Version Always Better? A Multi Domain benchmark from YOLO v5 to v11](http://arxiv.org/abs/2502.14314)|null|YOLO（You Look Only Once）模型已广泛用于构建各种领域的实时目标检测器。随着新版YOLO发布频率的增加，关键问题出现了：新版本一定比旧版本更好吗？每个YOLO版本的核心创新是什么？这些变化如何转化为实际性能提升？在本文中，我们总结了从YOLOv1到YOLOv11的关键创新，并介绍了一个名为ODverse33的综合基准测试，其中包含跨越11个不同领域（自动驾驶、农业、水下、医疗、电子游戏、工业、航空、野生动物、零售、显微和安全）的33个数据集，并通过广泛的实验结果探讨了模型改进在现实世界、多领域应用中的实际影响。我们希望这项研究可以为广大目标检测模型用户提供一些指导，并为未来实时目标检测器的开发提供一些参考。||
|**2025-02-19**|[MambaLiteSR: Image Super-Resolution with Low-Rank Mamba using Knowledge Distillation](http://arxiv.org/abs/2502.14090)|null|近年来，生成式人工智能 (AI) 获得了极大的关注，彻底改变了各行各业的各种应用。其中，用于图像超分辨率的先进视觉模型的需求量很大，尤其是在实时处理至关重要的边缘设备上的部署。然而，由于计算能力和内存有限，在边缘设备上部署此类模型具有挑战性。在本文中，我们提出了 MambaLiteSR，一种新颖的轻量级图像超分辨率 (SR) 模型，它利用了 Vision Mamba 的架构。它集成了状态空间块和一个重建模块，用于高效的特征提取。为了在不影响性能的情况下优化效率，MambaLiteSR 采用知识蒸馏，通过超参数调整将关键信息从较大的基于 Mamba 的教师模型转移到较小的学生模型。通过对模型参数及其对峰值信噪比 (PSNR) 的影响进行数学分析，我们确定了关键因素并进行了相应调整。我们的综合评估表明，MambaLiteSR 通过降低功耗同时在基准数据集上保持具有竞争力的 PSNR 和结构相似性 (SSIM) 分数，优于最先进的边缘 SR 方法。它还通过低秩近似减少了训练期间的功耗。此外，MambaLiteSR 以最小的性能损失减少了参数，从而能够在资源受限的设备上高效部署生成式 AI 模型。在嵌入式 NVIDIA Jetson Orin Nano 上的部署证实了 MambaLiteSR 在尺寸、延迟和效率之间的出色平衡。实验表明，MambaLiteSR 的性能与基线和其他边缘模型相当，同时参数减少了 15%。与最先进的 SR 边缘模型相比，它还将功耗降低了高达 58%，同时在训练期间保持低能耗。||
|**2025-02-19**|[Image compositing is all you need for data augmentation](http://arxiv.org/abs/2502.13936)|null|本文研究了各种数据增强技术对目标检测模型性能的影响。具体而言，我们探索了经典的增强方法、图像合成以及Stable Diffusion XL和ControlNet等先进的生成模型。这项工作的目标是增强模型鲁棒性并提高检测精度，尤其是在标注数据有限的情况下。我们使用YOLOv8，在包含商用和军用飞机的自定义数据集上微调模型，并应用不同的增强策略。我们的实验表明，图像合成在检测性能方面提供了最高的提升，通过精确率、召回率和平均精度均值（[email protected]）来衡量。其他方法，包括Stable Diffusion XL和ControlNet，也展现了显著的提升，突出了先进数据增强技术在目标检测任务中的潜力。结果强调了数据集多样性和增强对于在实际应用中实现更好的泛化性和性能的重要性。未来的工作将探索半监督学习方法的整合以及进一步的优化，以增强模型在更大、更复杂数据集上的性能。||
|**2025-02-19**|[MEX: Memory-efficient Approach to Referring Multi-Object Tracking](http://arxiv.org/abs/2502.13875)|null|指代性多目标跟踪（RMOT）是一个相对较新的概念，它作为计算机视觉和自然语言处理交叉领域的一个有前景的研究方向迅速获得了关注。与传统的多目标跟踪不同，RMOT识别和跟踪目标，并结合目标类别名称的文本描述，使该方法更加直观。人们已经提出了各种技术来解决这个具有挑战性的问题；然而，大多数技术由于其端到端的性质，需要训练整个网络。在这些方法中，iKUN已成为一个特别有希望的解决方案。因此，我们进一步探索其流程并提高其性能。在本文中，我们介绍了一个名为“高效内存跨模态”（MEX）的实用模块。这种内存高效的技术可以直接应用于像iKUN这样的现成跟踪器，从而显著改进架构。我们的方法在使用4 GB内存的单个GPU上进行推理时被证明是有效的。在各种基准测试中，Refer-KITTI数据集提供了具有相关语言表达的各种自动驾驶场景，对于研究这个问题特别有用。根据经验，我们的方法在HOTA跟踪分数方面表现出有效性和效率，大大改善了内存分配和处理速度。||
|**2025-02-19**|[MSVCOD:A Large-Scale Multi-Scene Dataset for Video Camouflage Object Detection](http://arxiv.org/abs/2502.13859)|null|视频伪装目标检测 (VCOD) 是一项具有挑战性的任务，旨在识别视频中与背景无缝融合的目标。视频的动态特性可以通过运动线索或不同视角来检测伪装目标。之前的 VCOD 数据集主要包含动物目标，将研究范围限制在野生动物场景。然而，VCOD 的应用远不止野生动物，在安全、艺术和医疗领域也有着重要的意义。为了解决这个问题，我们构建了一个新的大规模多域 VCOD 数据集 MSVCOD。为了实现高质量的标注，我们设计了一个半自动迭代标注流程，在降低成本的同时保持标注精度。我们的 MSVCOD 是迄今为止最大的 VCOD 数据集，首次引入了包括人类、动物、医疗和车辆目标在内的多个目标类别，同时也扩展了不同环境下的背景多样性。这种扩展的范围增加了 VCOD 任务在伪装目标检测中的实际适用性。除了这个数据集，我们还引入了一个单流视频伪装目标检测模型，该模型可以执行特征提取和信息融合，而无需额外的运动特征融合模块。我们的框架在现有的 VCOD 动物数据集和提出的 MSVCOD 上实现了最先进的结果。数据集和代码将公开发布。||
|**2025-02-19**|[Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention](http://arxiv.org/abs/2502.13693)|**[link](https://github.com/Omid-Nejati/MedViTV2)**|卷积网络、Transformer、混合模型和基于 Mamba 的架构在各种医学图像分类任务中展现了强大的性能。然而，这些方法主要用于使用标记数据对清晰图像进行分类。相比之下，现实世界的临床数据通常包含图像损坏，这些损坏是多中心研究所特有的，并且源于不同制造商之间成像设备的差异。在本文中，我们介绍了医学视觉Transformer (MedViTV2)，这是一种将 Kolmogorov-Arnold 网络 (KAN) 层首次融入 Transformer 架构的新型架构，旨在实现广义医学图像分类。我们开发了一个高效的 KAN 模块，以减少计算负载，同时提高原始 MedViT 的准确性。此外，为了抵消 MedViT 扩大规模时的脆弱性，我们提出了一种增强的扩张邻域注意力（DiNA），它是对高效融合点积注意力核的改进，能够捕获全局上下文并扩展感受野，从而有效地扩展模型并解决特征崩溃问题。此外，还引入了一种分层混合策略，以高效的方式堆叠我们的局部特征感知和全局特征感知模块，从而平衡局部和全局特征感知以提高性能。在 17 个医学图像分类数据集和 12 个损坏医学图像数据集上的大量实验表明，MedViTV2 在 29 个实验中的 27 个中以降低的计算复杂度实现了最先进的结果。MedViTV2 比之前的版本计算效率提高了 44%，并显著提高了准确性，在 MedMNIST 上提高了 4.6%，在 NonMNIST 上提高了 5.8%，在 MedMNIST-C 基准测试上提高了 13.4%。||
|**2025-02-18**|[RobuRCDet: Enhancing Robustness of Radar-Camera Fusion in Bird's Eye View for 3D Object Detection](http://arxiv.org/abs/2502.13071)|null|虽然最近低成本的雷达-相机方法在多模态三维目标检测中展现出 promising 的成果，但两种传感器都面临着环境和内在干扰的挑战。光线不足或恶劣天气条件会降低相机的性能，而雷达则会受到噪声和位置模糊性的影响。实现鲁棒的雷达-相机三维目标检测需要在不同条件下保持一致的性能，这是一个尚未得到充分探索的课题。在这项工作中，我们首先对五种噪声下的雷达-相机检测的鲁棒性进行了系统分析，并提出了RobuRCDet，一个在BEV空间中的鲁棒目标检测模型。具体来说，我们设计了一个三维高斯扩展（3DGE）模块来减轻雷达点的不准确性，包括位置、雷达截面积（RCS）和速度。3DGE利用RCS和速度先验来生成可变形核图和方差，用于核大小调整和数值分布。此外，我们引入了一个天气自适应融合模块，它根据相机信号置信度自适应地融合雷达和相机特征。在流行的基准数据集nuScenes上的大量实验表明，我们的模型在正常和噪声条件下都取得了 competitive 的结果。||
|**2025-02-18**|[Benchmarking MedMNIST dataset on real quantum hardware](http://arxiv.org/abs/2502.13056)|null|Quantum machine learning (QML) has emerged as a promising domain to leverage the computational capabilities of quantum systems to solve complex classification tasks. In this work, we present first comprehensive QML study by benchmarking the MedMNIST-a diverse collection of medical imaging datasets on a 127-qubit real IBM quantum hardware, to evaluate the feasibility and performance of quantum models (without any classical neural networks) in practical applications. This study explore recent advancements in quantum computing such as device-aware quantum circuits, error suppression and mitigation for medical image classification. Our methodology comprised of three stages: preprocessing, generation of noise-resilient and hardware-efficient quantum circuits, optimizing/training of quantum circuits on classical hardware, and inference on real IBM quantum hardware. Firstly, we process all input images in the preprocessing stage to reduce the spatial dimension due to the quantum hardware limitations. We generate hardware-efficient quantum circuits using backend properties expressible to learn complex patterns for medical image classification. After classical optimization of QML models, we perform the inference on real quantum hardware. We also incorporates advanced error suppression and mitigation techniques in our QML workflow including dynamical decoupling (DD), gate twirling, and matrix-free measurement mitigation (M3) to mitigate the effects of noise and improve classification performance. The experimental results showcase the potential of quantum computing for medical imaging and establishes a benchmark for future advancements in QML applied to healthcare.||
|**2025-02-17**|[OCT Data is All You Need: How Vision Transformers with and without Pre-training Benefit Imaging](http://arxiv.org/abs/2502.12379)|null|Optical Coherence Tomography (OCT) provides high-resolution cross-sectional images useful for diagnosing various diseases, but their distinct characteristics from natural images raise questions about whether large-scale pre-training on datasets like ImageNet is always beneficial. In this paper, we investigate the impact of ImageNet-based pre-training on Vision Transformer (ViT) performance for OCT image classification across different dataset sizes. Our experiments cover four-category retinal pathologies (CNV, DME, Drusen, Normal). Results suggest that while pre-training can accelerate convergence and potentially offer better performance in smaller datasets, training from scratch may achieve comparable or even superior accuracy when sufficient OCT data is available. Our findings highlight the importance of matching domain characteristics in pre-training and call for further study on large-scale OCT-specific pre-training.||
|**2025-02-17**|[Enhancing Transparent Object Pose Estimation: A Fusion of GDR-Net and Edge Detection](http://arxiv.org/abs/2502.12027)|null|Object pose estimation of transparent objects remains a challenging task in the field of robot vision due to the immense influence of lighting, background, and reflections. However, the edges of clear objects have the highest contrast, which leads to stable and prominent features. We propose a novel approach by incorporating edge detection in a pre-processing step for the tasks of object detection and object pose estimation. We conducted experiments to investigate the effect of edge detectors on transparent objects. We examine the performance of the state-of-the-art 6D object pose estimation pipeline GDR-Net and the object detector YOLOX when applying different edge detectors as pre-processing steps (i.e., Canny edge detection with and without color information, and holistically-nested edges (HED)). We evaluate the physically-based rendered dataset Trans6D-32 K of transparent objects with parameters proposed by the BOP Challenge. Our results indicate that applying edge detection as a pre-processing enhances performance for certain objects.||
|**2025-02-16**|[Leveraging Conditional Mutual Information to Improve Large Language Model Fine-Tuning For Classification](http://arxiv.org/abs/2502.11258)|null|Although large language models (LLMs) have demonstrated remarkable capabilities in recent years, the potential of information theory (IT) to enhance LLM development remains underexplored. This paper introduces the information theoretic principle of Conditional Mutual Information (CMI) to LLM fine-tuning for classification tasks, exploring its promise in two main ways: minimizing CMI to improve a model's standalone performance and maximizing CMI to enhance knowledge distillation (KD) for more capable student models. To apply CMI in LLM fine-tuning, we adapt the recently proposed CMI-constrained deep learning framework, which was initially developed for image classification, with some modification. By minimizing CMI during LLM fine-tuning, we achieve superior performance gains on 6 of 8 GLUE classification tasks compared to BERT. Additionally, maximizing CMI during the KD process results in significant performance improvements in 6 of 8 GLUE classification tasks compared to DistilBERT. These findings demonstrate CMI's adaptability for optimizing both standalone LLMs and student models, showcasing its potential as a robust framework for advancing LLM fine-tuning. Our work bridges the gap between information theory and LLM development, offering new insights for building high-performing language models.||
|**2025-02-16**|[DAViMNet: SSMs-Based Domain Adaptive Object Detection](http://arxiv.org/abs/2502.11178)|**[link](https://github.com/enesdoruk/davimnet)**|Unsupervised domain adaptation (UDA) for object detection adapts models trained on labeled source domains to unlabeled target domains, ensuring robust performance across domain shifts. Transformer-based architectures excel at capturing long-range dependencies but face efficiency challenges due to their quadratic attention complexity, which limits scalability in UDA tasks. To address these issues, we propose a hybrid domain-adaptive Mamba Transformer architecture that combines Mamba's efficient state-space modeling with attention mechanisms to tackle domain-specific spatial and channel-wise variations. Each hybrid block integrates domain-adaptive Mamba blocks and attention mechanisms: Domain-Adaptive Mamba employs spatial and channel state-space models to adaptively model domain variations, while attention mechanisms leverage self-attention for intra-domain feature enhancement and cross-attention for effective source-target alignment. Our approach processes both shallow and deeper features, employing an entropy-based knowledge distillation framework with margin ReLU to emphasize discriminative features and suppress noise. Gradient Reversal Layers enable adversarial alignment across network layers, while entropy-driven gating attention with random perturbations refines target features and mitigates overfitting. By unifying these components, our architecture achieves state-of-the-art performance in UDA object detection, balancing efficiency with robust generalization.||
|**2025-02-15**|[Do Deepfake Detectors Work in Reality?](http://arxiv.org/abs/2502.10920)|null|Deepfakes, particularly those involving faceswap-based manipulations, have sparked significant societal concern due to their increasing realism and potential for misuse. Despite rapid advancements in generative models, detection methods have not kept pace, creating a critical gap in defense strategies. This disparity is further amplified by the disconnect between academic research and real-world applications, which often prioritize different objectives and evaluation criteria. In this study, we take a pivotal step toward bridging this gap by presenting a novel observation: the post-processing step of super-resolution, commonly employed in real-world scenarios, substantially undermines the effectiveness of existing deepfake detection methods. To substantiate this claim, we introduce and publish the first real-world faceswap dataset, collected from popular online faceswap platforms. We then qualitatively evaluate the performance of state-of-the-art deepfake detectors on real-world deepfakes, revealing that their accuracy approaches the level of random guessing. Furthermore, we quantitatively demonstrate the significant performance degradation caused by common post-processing techniques. By addressing this overlooked challenge, our study underscores a critical avenue for enhancing the robustness and practical applicability of deepfake detection methods in real-world settings.||
|**2025-02-15**|[CLoCKDistill: Consistent Location-and-Context-aware Knowledge Distillation for DETRs](http://arxiv.org/abs/2502.10683)|null|Object detection has advanced significantly with Detection Transformers (DETRs). However, these models are computationally demanding, posing challenges for deployment in resource-constrained environments (e.g., self-driving cars). Knowledge distillation (KD) is an effective compression method widely applied to CNN detectors, but its application to DETR models has been limited. Most KD methods for DETRs fail to distill transformer-specific global context. Also, they blindly believe in the teacher model, which can sometimes be misleading. To bridge the gaps, this paper proposes Consistent Location-and-Context-aware Knowledge Distillation (CLoCKDistill) for DETR detectors, which includes both feature distillation and logit distillation components. For feature distillation, instead of distilling backbone features like existing KD methods, we distill the transformer encoder output (i.e., memory) that contains valuable global context and long-range dependencies. Also, we enrich this memory with object location details during feature distillation so that the student model can prioritize relevant regions while effectively capturing the global context. To facilitate logit distillation, we create target-aware queries based on the ground truth, allowing both the student and teacher decoders to attend to consistent and accurate parts of encoder memory. Experiments on the KITTI and COCO datasets show our CLoCKDistill method's efficacy across various DETRs, e.g., single-scale DAB-DETR, multi-scale deformable DETR, and denoising-based DINO. Our method boosts student detector performance by 2.2% to 6.4%.||
|**2025-02-14**|[Data-driven Super-Resolution of Flood Inundation Maps using Synthetic Simulations](http://arxiv.org/abs/2502.10601)|**[link](https://github.com/aaravamudan2014/SIDDIS)**|The frequency of extreme flood events is increasing throughout the world. Daily, high-resolution (30m) Flood Inundation Maps (FIM) observed from space play a key role in informing mitigation and preparedness efforts to counter these extreme events. However, the temporal frequency of publicly available high-resolution FIMs, e.g., from Landsat, is at the order of two weeks thus limiting the effective monitoring of flood inundation dynamics. Conversely, global, low-resolution (~300m) Water Fraction Maps (WFM) are publicly available from NOAA VIIRS daily. Motivated by the recent successes of deep learning methods for single image super-resolution, we explore the effectiveness and limitations of similar data-driven approaches to downscaling low-resolution WFMs to high-resolution FIMs. To overcome the scarcity of high-resolution FIMs, we train our models with high-quality synthetic data obtained through physics-based simulations. We evaluate our models on real-world data from flood events in the state of Iowa. The study indicates that data-driven approaches exhibit superior reconstruction accuracy over non-data-driven alternatives and that the use of synthetic data is a viable proxy for training purposes. Additionally, we show that our trained models can exhibit superior zero-shot performance when transferred to regions with hydroclimatological similarity to the U.S. Midwest.||
|**2025-02-14**|[Simplifying DINO via Coding Rate Regularization](http://arxiv.org/abs/2502.10385)|null|DINO and DINOv2 are two model families being widely used to learn representations from unlabeled imagery data at large scales. Their learned representations often enable state-of-the-art performance for downstream tasks, such as image classification and segmentation. However, they employ many empirically motivated design choices and their training pipelines are highly complex and unstable -- many hyperparameters need to be carefully tuned to ensure that the representations do not collapse -- which poses considerable difficulty to improving them or adapting them to new domains. In this work, we posit that we can remove most such-motivated idiosyncrasies in the pre-training pipelines, and only need to add an explicit coding rate term in the loss function to avoid collapse of the representations. As a result, we obtain highly simplified variants of the DINO and DINOv2 which we call SimDINO and SimDINOv2, respectively. Remarkably, these simplified models are more robust to different design choices, such as network architecture and hyperparameters, and they learn even higher-quality representations, measured by performance on downstream tasks, offering a Pareto improvement over the corresponding DINO and DINOv2 models. This work highlights the potential of using simplifying design principles to improve the empirical practice of deep learning.||
|**2025-02-14**|[Ocular Disease Classification Using CNN with Deep Convolutional Generative Adversarial Network](http://arxiv.org/abs/2502.10334)|null|卷积神经网络 (CNN) 因其强大的学习能力在图像分类中展现出令人印象深刻的性能。然而，它需要大量且均衡的数据集才能进行有效训练。否则，网络经常会出现过拟合，难以泛化到新的样本。公开可用的眼部疾病眼底图像数据集不足以训练任何分类模型以达到令人满意的准确性。因此，我们提出了一种基于生成对抗网络 (GAN) 的数据生成技术来合成数据集，用于训练基于 CNN 的分类模型，然后使用包含原始疾病的眼部图像来测试模型。在使用原始眼部图像测试模型分类精度时，该模型对近视的准确率达到 78.6%，对青光眼的准确率达到 88.6%，对白内障的准确率达到 84.6%，总体分类精度为 84.6%。||
|**2025-02-14**|[Object Detection and Tracking](http://arxiv.org/abs/2502.10310)|**[link](https://github.com/omar9008-eng/Object-Detection-Tracking-2025)**|高效准确的目标检测是计算机视觉系统发展的一个重要课题。随着深度学习技术的出现，目标检测的准确性显著提高。本项目旨在集成一种现代目标检测技术，以期实现实时性能的高精度。许多目标识别系统依赖于其他计算机视觉算法，导致性能低下且效率不高，这是一个重大障碍。在本研究中，我们完全使用深度学习技术解决了端到端的目标检测问题。该网络使用最具挑战性的公开数据集进行训练，该数据集用于年度目标检测挑战赛。需要目标检测的应用可以受益于系统快速而精确的检测。||
|**2025-02-14**|[Artificial Intelligence to Assess Dental Findings from Panoramic Radiographs -- A Multinational Study](http://arxiv.org/abs/2502.10277)|**[link](https://github.com/stmharry/dental-pano-ai)**|牙科全景放射线照片 (DPR) 广泛用于临床的全面口腔评估，但由于结构重叠和判读时间限制而带来了挑战。本研究旨在通过开发和评估一个人工智能系统，并将其性能与人类判读者在多国数据集上的表现进行比较，从而为 DPR 中结果的 AI 自动化评估建立坚实的基础。我们分析了来自三个数据集（荷兰、巴西和台湾）的 6,669 张 DPR，重点关注 8 种类型的牙科发现。该 AI 系统结合了目标检测和语义分割技术来识别每颗牙齿的发现。性能指标包括灵敏度、特异性和受试者工作特征曲线下面积 (AUC-ROC)。在不同数据集上测试了 AI 的泛化能力，并将其性能与人类牙科医生进行了比较。该 AI 系统表现出与人类判读者相当或更优的性能，尤其是在识别根尖周透射影方面，灵敏度提高了 +67.9%（95% 置信区间：54.0%-81.9%；p < .001），在识别缺失牙方面，灵敏度提高了 +4.7%（95% 置信区间：1.4%-8.0%；p = .008）。在 8 项结果中，AI 的宏平均 AUC-ROC 达到了 96.2%（95% 置信区间：94.6%-97.8%）。除龋齿外（p = .024），AI 与参考结果的一致性与人类之间的一致性在 8 项结果中有 7 项相当。该 AI 系统在不同的影像和人口统计学设置中表现出强大的泛化能力，并且处理图像的速度比人类判读者快 79 倍（95% 置信区间：75-82）。该 AI 系统有效地评估了 DPR 中的发现，其性能与人类专家相当或更优，同时显著减少了判读时间。这些结果突出了将 AI 集成到临床工作流程中以提高诊断效率和准确性以及患者管理的潜力。||
|**2025-02-14**|[SeWA: Selective Weight Average via Probabilistic Masking](http://arxiv.org/abs/2502.10119)|null|权重平均已成为增强模型性能的标准技术。然而，诸如随机权重平均（SWA）和最新权重平均（LAWA）之类的方法通常需要手动设计的程序从训练轨迹中采样，并且结果严重依赖于超参数调整。为了最大限度地减少人工，本文提出了一种简单而有效的算法，称为选择性权重平均（SeWA），它在训练的最后阶段自适应地选择检查点进行平均。基于SeWA，我们表明只需要几个点即可实现更好的泛化和更快的收敛。理论上，解决离散子集选择问题本质上具有挑战性。为了解决这个问题，我们将其转换为一个连续的概率优化框架，并采用Gumbel-Softmax估计器来学习每个检查点的不可微掩码。此外，我们从理论上推导了SeWA基于稳定性的泛化界，在凸和非凸假设下都比SGD更清晰。最后，在包括行为克隆、图像分类和文本分类在内的各个领域的扎实扩展实验进一步验证了我们方法的有效性。||
|**2025-02-14**|[AffectSRNet : Facial Emotion-Aware Super-Resolution Network](http://arxiv.org/abs/2502.09932)|null|低分辨率环境下的面部表情识别（FER）系统在准确识别表情方面面临重大挑战，因为细粒度的面部细节会丢失。这种局限性对于监控和移动通信等应用来说尤其棘手，因为这些应用中图像分辨率通常较低，并且会影响识别精度。然而，传统的单图像人脸超分辨率（FSR）技术通常无法保留表情的情感意图，引入的失真会模糊原始的情感内容。鉴于单图像超分辨率本身的不适定性，需要一种有针对性的方法来平衡图像质量增强和情感保留。在本文中，我们提出了AffectSRNet，这是一种新颖的情感感知超分辨率框架，可以从低分辨率输入重建高质量的面部图像，同时保持面部表情的强度和保真度。我们的方法通过采用专为FER应用定制的表情保留损失函数，有效地弥合了图像分辨率和表情准确性之间的差距。此外，我们引入了一种新的度量标准来评估超分辨率图像中的情感保留，从而为低分辨率场景下的FER系统性能提供更细致的评估。在CelebA、FFHQ和Helen等标准数据集上的实验结果表明，AffectSRNet在视觉质量和情感保真度方面均优于现有的FSR方法，突出了其在实际FER应用中的集成潜力。这项工作不仅提高了图像清晰度，还确保了情感驱动型应用在次优分辨率环境中保留其核心功能，为在FER系统中更广泛地采用铺平了道路。||
|**2025-02-13**|[GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis](http://arxiv.org/abs/2502.09598)|**[link](https://github.com/Orion-AI-Lab/GAIA)**|地球轨道卫星的持续运行产生了庞大且不断增长的遥感 (RS) 图像档案。自然语言为访问、查询和解释这些档案中的数据提供了一个直观的界面。然而，现有的视觉语言模型 (VLM) 主要是在网络抓取的嘈杂图像文本数据上进行训练的，对遥感专业领域的接触有限。这种缺陷导致在特定遥感任务上的性能较差，因为常用的数据集通常缺乏详细的、科学准确的文本描述，而只强调日期和位置等属性。为了弥合这一关键差距，我们引入了 GAIA，这是一个为多尺度、多传感器和多模态遥感图像分析而设计的新颖数据集。GAIA 包含 205,150 个精心策划的遥感图像文本对，代表了与不同空间分辨率相关的各种遥感模式。与现有的遥感视觉语言数据集不同，GAIA 特别关注于捕获各种遥感应用，提供有关环境变化、自然灾害和各种其他动态现象的独特信息。该数据集提供了空间和时间上的平衡分布，涵盖全球范围，涵盖过去 25 年，并具有均衡的观测时间分布。GAIA 的构建涉及一个两阶段过程：(1) 从信誉良好的遥感相关来源定向抓取图像和附带文本，以及 (2) 使用精心设计的提示，利用 GPT-4o 先进的视觉语言能力，为每张图像生成五个高质量、科学 обоснованный 的合成字幕。我们广泛的实验，包括对 CLIP 和 BLIP2 模型进行微调，表明 GAIA 显着提高了遥感图像分类、跨模态检索和图像字幕生成任务的性能。||
|**2025-02-13**|[Wholly-WOOD: Wholly Leveraging Diversified-quality Labels for Weakly-supervised Oriented Object Detection](http://arxiv.org/abs/2502.09471)|**[link](https://github.com/yuyi1005/whollywood)**|精确估计视觉对象的方向并用紧凑的旋转边界框 (RBoxes) 表示已成为一项重要需求，这对现有的仅使用水平边界框 (HBoxes) 的目标检测范式提出了挑战。为了使检测器具备方向感知能力，人们引入了监督回归/分类模块，但旋转标注的成本很高。同时，一些现有的包含定向对象的数据集已经用水平框甚至单点进行了标注。如何有效利用较弱的单点和水平标注来训练定向目标检测器 (OOD) 成为一个有吸引力但仍未解决的问题。我们开发了 Wholly-WOOD，一个弱监督 OOD 框架，能够以统一的方式充分利用各种标注形式（点、HBoxes、RBoxes 及其组合）。仅使用 HBox 进行训练，我们的 Wholly-WOOD 在遥感和其他领域的性能非常接近于使用 RBox 训练的模型，显著减少了定向对象劳动密集型标注的繁琐工作。源代码可在 https://github.com/VisionXLab/whollywood (基于 PyTorch) 和 https://github.com/VisionXLab/whollywood-jittor (基于 Jittor) 获取。||
|**2025-02-13**|[Mitigating the Impact of Prominent Position Shift in Drone-based RGBT Object Detection](http://arxiv.org/abs/2502.09311)|null|基于无人机的RGBT目标检测在许多全天候应用中发挥着至关重要的作用。然而，现实世界中无人机视角的RGBT数据存在显著的位置偏移问题，即微小目标在不同模态中的位置差异很大。例如，热成像模态中微小目标的轻微偏差会导致其在RGB模态中偏离自身的主体。考虑到RGBT数据通常只在一个模态（参考模态）上进行标注，这将导致未标注的模态（感知模态）缺乏精确的监督信号，并阻碍检测器学习良好的特征表示。此外，模态之间对应特征点的错位会使融合特征对检测头造成混淆。在本文中，我们将跨模态框偏移问题转化为标签噪声问题，并通过一种新颖的基于均值教师的跨模态框校正头集成（CBC）动态地解决这个问题。通过这种方式，网络可以学习到两种模态的更多信息表示。此外，为了缓解RGBT融合中特征图错位的问题，我们设计了一个基于滑动窗口的级联对齐模块（SWCA）。SWCA在滑动窗口内挖掘空间未对齐特征之间的长距离依赖关系，并将感知特征与参考特征进行级联对齐。在两个基于无人机的RGBT目标检测数据集上的大量实验表明，校正结果在视觉和定量上都令人满意，从而提高了检测性能。特别是，我们的CBC模块将感知模态的真值精度提高了25.52个aSim点。总体而言，所提出的检测器在RGBTDronePerson数据集上实现了43.55的mAP_50，并在DroneVehicle数据集的偏移子集上超过了最先进的方法8.6 mAP50。代码和数据将公开发布。||
|**2025-02-13**|[Feature-based Graph Attention Networks Improve Online Continual Learning](http://arxiv.org/abs/2502.09143)|null|在线持续学习对于图像分类模型适应新数据的同时保留先前学习任务的知识至关重要。这种能力对于应对涉及动态环境和不断变化的数据分布的现实挑战至关重要。传统方法主要采用卷积神经网络，其局限于将图像处理为网格，主要捕获局部模式而非关系信息。尽管Transformer架构的出现提高了捕获关系的能力，但这些模型通常需要更大的资源。在本文中，我们提出了一种基于图注意力网络（GAT）的新型在线持续学习框架，该框架可以有效地捕获上下文关系，并通过学习的注意力权重动态更新特定任务的表示。我们的方法利用预训练的特征提取器，使用分层特征图将图像转换为图，表示不同粒度级别的信息。然后，这些图由GAT处理，并结合增强的全局池化策略，以提高持续学习的分类性能。此外，我们提出了排练记忆复制技术，可在保持内存预算的同时改进先前任务的表示。在基准数据集（包括SVHN、CIFAR10、CIFAR100和MiniImageNet）上的综合评估表明，我们的方法优于现有最先进的方法。||
|**2025-02-12**|[Deep EEG Super-Resolution: Upsampling EEG Spatial Resolution with Generative Adversarial Networks](http://arxiv.org/abs/2502.08803)|null|脑电图 (EEG) 活动包含大量关于人脑内部活动的信息。记录更多此类数据有可能开启未来无限的应用。然而，脑电图硬件的成本随着同时记录的脑电图通道数量的增加而变得越来越昂贵。在本文中，我们通过提出一种基于生成对抗网络 (GAN) 的新型深度脑电图超分辨率 (SR) 方法来解决这个问题。这种方法可以通过生成通道方向的上采样数据来有效地插值大量缺失的通道，从而从低分辨率样本中生成高空间分辨率的脑电图数据，从而减少对昂贵脑电图设备的需求。我们使用来自心理意象任务的脑电图数据集测试了该方法的性能。我们提出的 GAN 模型与基线双三次插值方法相比，均方误差 (MSE) 和平均绝对误差 (MAE) 分别降低了 10^4 倍和 10^2 倍。我们通过在原始分类任务上训练分类器来进一步验证我们的方法，该分类器在使用超分辨率数据时显示出最小的精度损失。这种通过 GAN 实现的脑电图超分辨率方法是提高低密度脑电图头盔空间分辨率的一种很有前景的方法。||
|**2025-02-12**|[Rapid Whole Brain Mesoscale In-vivo MR Imaging using Multi-scale Implicit Neural Representation](http://arxiv.org/abs/2502.08634)|null|目的：开发并验证一种使用隐式神经表示 (INR) 的新型图像重建技术，用于多视角厚层采集，在减少扫描时间的同时保持高信噪比 (SNR)。方法：我们提出了旋转视角超分辨率 (ROVER)-MRI，这是一种基于无监督神经网络的算法，旨在从多视角厚层重建 MRI 数据，在保持精细解剖细节的同时有效地将扫描时间缩短 2 倍。我们将我们的方法与双三次插值和当前最先进的正则化最小二乘超分辨率重建 (LS-SRR) 技术进行了比较。使用离体猴脑数据的真值进行验证，并且我们展示了在几个体内人体数据集上的卓越重建质量。值得注意的是，我们实现了体内人脑全脑 T2 加权图像的重建，其空间分辨率高达 180 微米，仅在 7T MRI 扫描仪上花费 17 分钟的扫描时间即可完成。结果：ROVER-MRI 在重建质量方面优于 LS-SRR 方法，相对误差 (RE) 降低了 22.4%，半峰全宽 (FWHM) 降低了 7.5%，表明在几乎一半的扫描时间内更好地保留了精细结构细节。结论：ROVER-MRI 为中尺度磁共振成像提供了一种高效且稳健的方法，可实现快速、高分辨率的全脑扫描。它的多功能性为需要解剖细节和时间高效成像的研究应用带来了巨大的希望。||
|**2025-02-12**|[ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification](http://arxiv.org/abs/2502.08391)|**[link](https://github.com/jiangbo-shi/vila-mil)**|基于多示例学习（MIL）的框架已成为处理数字病理学中千兆像素大小和分层图像上下文的全切片图像（WSI）的主流方法。然而，这些方法严重依赖大量的包级别标签，并且仅从原始切片中学习，容易受到数据分布变化的影响。最近，基于视觉语言模型（VLM）的方法通过对大规模病理图文对进行预训练引入了语言先验。然而，以前的文本提示缺乏对病理先验知识的考虑，因此没有实质性地提高模型的性能。此外，收集此类图文对和预训练过程非常耗时且资源密集。为了解决上述问题，我们提出了一种用于全切片图像分类的双尺度视觉语言多示例学习（ViLa-MIL）框架。具体来说，我们提出了一个基于冻结大型语言模型（LLM）的双尺度视觉描述性文本提示，以有效提升VLM的性能。为了有效地将VLM迁移到WSI处理中，对于图像分支，我们提出了一个原型引导的patch解码器，通过将相似的patch分组到同一个原型中来逐步聚合patch特征；对于文本分支，我们引入了一个上下文引导的文本解码器，通过结合多粒度图像上下文来增强文本特征。在三个多癌种和多中心的亚型数据集上的大量研究证明了ViLa-MIL的优越性。||
|**2025-02-12**|[Uncertainty Aware Human-machine Collaboration in Camouflaged Object Detection](http://arxiv.org/abs/2502.08373)|**[link](https://github.com/ziyuey/uncertainty-aware-human-machine-collaboration-in-camouflaged-object-identification)**|伪装目标检测 (COD) 旨在识别隐藏在其环境中的物体，由于其广泛的实际应用，近年来发展迅速。开发值得信赖的 COD 系统的关键步骤是评估和有效利用不确定性。在这项工作中，我们提出了一个用于对伪装物体进行分类的人机协作框架，利用计算机视觉 (CV) 模型和非侵入性脑机接口 (BCI) 的互补优势。我们的方法引入了一个多视图主干网络来估计 CV 模型预测中的不确定性，并在训练期间利用这种不确定性来提高效率，并在测试期间将低置信度的情况交给基于 RSVP 的 BCI 进行人工评估，以获得更可靠的决策。我们在 CAMO 数据集中评估了该框架，取得了最先进的结果，与现有方法相比，平衡精度 (BA) 平均提高了 4.56%，F1 值平均提高了 3.66%。对于表现最佳的参与者，BA 提高了 7.6%，F1 值提高了 6.66%。对训练过程的分析表明，我们的置信度测量与精度之间存在很强的相关性，而消融研究证实了所提出的训练策略和人机协作策略的有效性。总的来说，这项工作减少了人类的认知负荷，提高了系统的可靠性，并为现实世界 COD 应用和人机交互的进步奠定了坚实的基础。我们的代码和数据可在以下网址获取：https://github.com/ziyuey/Uncertainty-aware-human-machine-collaboration-in-camouflaged-object-identification。||
|**2025-02-12**|[Plantation Monitoring Using Drone Images: A Dataset and Performance Review](http://arxiv.org/abs/2502.08233)|null|在农业中，树木种植园的自动监测起着至关重要的作用。对树木健康状况的无瑕疵监测有助于农民通过采取适当的行动来做出明智的管理决策。使用无人机图像进行自动种植园监测可以提高监测过程的准确性，同时仍然对印度等发展中国家的小农户来说价格实惠。配备RGB相机的小型低成本无人机可以捕获农田的高分辨率图像，从而可以对种植园的健康状况进行详细分析。现有的自动化种植园监测方法大多基于卫星图像，而农民很难获得这些图像。我们提出了一个使用无人机图像进行种植园健康监测的自动化系统，农民获取这些图像正变得越来越容易。我们提出了一个包含三类树木图像的数据集：“健康”、“发育不良”和“死亡”。我们使用CVAT注释工具对数据集进行注释，以用于研究目的。我们试验了不同的著名CNN模型，以观察它们在所提出的数据集上的性能。最初的低准确率水平表明了所提出数据集的复杂性。此外，我们的研究表明，嵌入在深度CNN模型中的深度卷积运算可以提高模型在无人机数据集上的性能。此外，我们应用最先进的目标检测模型来识别单个树木，以便更好地对其进行自动监测。||
|**2025-02-12**|[Take What You Need: Flexible Multi-Task Semantic Communications with Channel Adaptation](http://arxiv.org/abs/2502.08221)|null|对高效语义通信系统的需求日益增长，这些系统需要能够管理各种任务并适应变化的信道条件，这推动了稳健且资源高效的框架的发展。本文介绍了一种基于掩码自编码器架构的新型信道自适应和多任务感知的语义通信框架。我们的框架通过结合多任务感知评分机制来优化有意义信息的传输，该机制可识别并优先考虑多个并发任务中语义重要的数据。采用信道感知提取器来根据实时信道条件动态选择相关信息。通过联合优化语义相关性和传输效率，该框架确保了在资源限制下的性能下降最小。实验结果表明，在图像重建和目标检测等任务中，我们的框架相比传统方法具有优越的性能。这些结果突出了该框架对异构信道环境的适应性及其对多任务应用的可扩展性，使其成为下一代语义通信网络的有希望的解决方案。||
|**2025-02-11**|[Visual-based spatial audio generation system for multi-speaker environments](http://arxiv.org/abs/2502.07538)|null|在电影和电子游戏等多媒体应用中，空间音频技术被广泛用于通过模拟3D声音来增强用户体验：将单声道音频转换为双声道格式。然而，这个过程对于声音设计师来说通常既复杂又费力，需要将音频与视觉组件的空间位置精确同步。为了应对这些挑战，我们提出了一种基于视觉的空间音频生成系统——一个集成了用于目标检测的人脸检测YOLOv8、单目深度估计和空间音频技术的自动化系统。值得注意的是，该系统无需额外的双声道数据集训练即可运行。我们使用客观指标将提出的系统与现有的空间音频生成系统进行了比较评估。实验结果表明，我们的方法显著提高了音频和视频之间的空间一致性，增强了语音质量，并且在多说话人场景中表现稳健。通过简化视听对齐过程，所提出的系统使声音工程师能够高效地获得高质量的结果，使其成为多媒体制作专业人员的宝贵工具。||
|**2025-02-11**|[Optimizing Knowledge Distillation in Transformers: Enabling Multi-Head Attention without Alignment Barriers](http://arxiv.org/abs/2502.07436)|null|Transformer 中的知识蒸馏 (KD) 经常面临由于教师模型和学生模型之间注意力头数量不匹配带来的挑战。现有方法要么要求头部数量相同，要么引入投影器来弥合维度差距，从而限制了灵活性和效率。我们提出了压缩头蒸馏 (SHD)，这是一种新颖的方法，它通过高效的线性近似压缩多头注意力图，从而实现不同头部数量模型之间的无缝知识迁移。与先前的工作不同，SHD 消除了对齐障碍，无需额外的参数或架构修改。我们的方法将多个教师头的组合效应动态地近似到更少的学生头中，在保留细粒度注意力模式的同时减少冗余。跨语言（LLaMA、GPT）和视觉（DiT、MDT）生成以及视觉（DeiT）判别任务的实验表明了 SHD 的有效性：它优于基于 logits 和特征对齐的 KD 基线，在图像分类、图像生成、语言微调和语言预训练中实现了最先进的结果。灵活的头压缩、免投影的设计和线性时间复杂度等关键创新使 SHD 成为蒸馏现代 Transformer 的通用且可扩展的解决方案。这项工作弥合了 KD 中的一个关键差距，使得在不牺牲性能的情况下高效部署紧凑模型成为可能。||
|**2025-02-11**|[MoENAS: Mixture-of-Expert based Neural Architecture Search for jointly Accurate, Fair, and Robust Edge Deep Neural Networks](http://arxiv.org/abs/2502.07422)|null|近年来，使用传统优化技术（如剪枝）以及最近出现的自动设计方法来优化边缘深度神经网络 (DNN) 的精度和效率的工作激增。然而，这些设计技术通常忽略了公平性、鲁棒性和泛化性等关键指标。因此，当我们使用 FACET 数据集评估 SOTA 边缘 DNN 在图像分类中的性能时，我们发现它们在 10 种不同肤色之间表现出显著的精度差异 (14.09%)，同时存在非鲁棒性和泛化性差的问题。针对这些观察结果，我们引入了基于混合专家神经架构搜索 (MoENAS) 的自动设计技术，该技术在混合专家空间中进行搜索，以发现准确、公平、鲁棒且通用的边缘 DNN。与 SOTA 边缘 DNN 相比，MoENAS 将精度提高了 4.02%，并将肤色精度差异从 14.09% 降低到 5.60%，同时将鲁棒性提高了 3.80%，并将过拟合最小化到 0.21%，所有这些都在保持模型大小接近于最先进模型的平均大小 (+0.4M)。凭借这些改进，MoENAS 为边缘 DNN 设计树立了新的基准，为开发更具包容性和鲁棒性的边缘 DNN 铺平了道路。||
|**2025-02-11**|[MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification](http://arxiv.org/abs/2502.07409)|**[link](https://github.com/HauschildLab/MGPATH)**|全切片病理图像分类由于图像尺寸巨大和标注标签有限而面临挑战，阻碍了模型的泛化能力。本文介绍了一种提示学习方法，以使大型视觉语言模型适应少样本病理分类。我们首先扩展了在13亿个病理图像图块上预训练的Prov-GigaPath视觉基础模型，通过添加适配器并通过923K图像-文本对上的对比学习将其与医学文本编码器对齐，从而将其扩展为视觉语言模型。该模型随后用于从少量标注中提取视觉特征和文本嵌入，并使用可学习的提示嵌入进行微调。与先前将提示与使用前缀嵌入或自注意力的冻结特征相结合的方法不同，我们提出了多粒度注意力，它比较可学习提示与单个图像块及其组之间的交互。这种方法提高了模型捕获细粒度细节和更广泛上下文的能力，增强了其对跨子区域复杂模式的识别能力。为了进一步提高准确性，我们利用基于（非平衡）最优传输的视觉-文本距离来确保模型的鲁棒性，通过减轻数据增强过程中可能发生的扰动。在肺、肾和乳腺病理模式上的实证实验验证了我们方法的有效性；因此，我们超越了几种最新的竞争方法，并在包括CLIP、PLIP和Prov-GigaPath集成的PLIP在内的各种架构中持续提高性能。我们在MGPATH发布了我们的实现和预训练模型。||
|**2025-02-11**|[Spatial Degradation-Aware and Temporal Consistent Diffusion Model for Compressed Video Super-Resolution](http://arxiv.org/abs/2502.07381)|null|由于存储和带宽的限制，互联网上存储和传输的视频通常质量较低，分辨率低且存在压缩噪声。虽然视频超分辨率 (VSR) 是一种提高视频分辨率的有效技术，但现有的VSR方法主要关注压缩视频。直接应用通用的VSR方法会导致实际视频的改进失败，尤其是在帧以低比特率高度压缩时。最近，扩散模型在低级视觉任务中取得了优异的性能，其高真实感生成能力使其能够应用于VSR。为了合成更多压缩损失的细节并改进时间一致性，我们提出了一种新的用于压缩VSR的空间退化感知和时间一致性 (SDATC) 扩散模型。具体来说，我们引入了一个失真控制模块 (DCM) 来调制扩散模型输入并引导生成。接下来，扩散模型使用微调的空间提示压缩感知模块 (PCAM) 和时空注意力模块 (STAM) 执行去噪过程以生成纹理。PCAM提取特征以动态编码特定的压缩信息。STAM将空间注意力机制扩展到时空维度以捕获时间相关性。在基准数据集上的大量实验结果证明了所提出的模块在增强压缩视频方面的有效性。||
|**2025-02-11**|[Multi-Task-oriented Nighttime Haze Imaging Enhancer for Vision-driven Measurement Systems](http://arxiv.org/abs/2502.07351)|**[link](https://github.com/ai-chen-lab/mtoie)**|显著目标检测（SOD）在视觉驱动测量系统（VMS）中起着至关重要的作用，有助于检测和分割图像中的关键视觉元素。然而，不利的成像条件，例如白天的雾霾、低光和夜间的雾霾会严重降低图像质量，并使SOD过程复杂化。为了应对这些挑战，我们提出了一种面向多任务的夜间雾霾图像增强器（MToIE），它集成了三个任务：白天去雾、低光增强和夜间去雾。MToIE包含两个关键的创新组件：首先，该网络采用面向任务的节点学习机制来处理三种特定类型的退化：白天雾霾、低光和夜间雾霾条件，并嵌入了自注意力模块以增强其在夜间成像中的性能。此外，多感受野增强模块通过三个具有不同扩张率的并行深度可分离卷积分支有效地提取多尺度特征，从而以最小的计算开销捕获全面的空间信息。为了确保最佳的图像重建质量和视觉特性，我们建议使用混合损失函数。在不同类型的天气/成像条件下进行的大量实验表明，MToIE超越了现有方法，显著提高了视觉系统在各种成像场景下的准确性和可靠性。代码可在https://github.com/Ai-Chen-Lab/MToIE获取。||
|**2025-02-11**|[Dense Object Detection Based on De-homogenized Queries](http://arxiv.org/abs/2502.07194)|null|密集目标检测广泛应用于自动驾驶、视频监控等领域。本文重点关注密集目标检测这一挑战性任务。目前，基于贪婪算法（如非极大值抑制（NMS））的检测方法在密集场景下经常产生许多重复预测或漏检，这是基于NMS算法的常见问题。通过端到端的DETR（DEtection TRansformer），作为一种可以将NMS等后处理去重能力融入网络的检测器，我们发现基于查询的检测器中的同质查询会导致网络去重能力和编码器学习效率的降低，从而导致重复预测和漏检问题。为了解决这个问题，我们提出了可学习的差异化编码来去同质化查询，同时，查询可以通过差异化编码信息相互通信，取代了之前查询之间的自注意力机制。此外，我们在编码器的输出上使用了考虑位置和置信度预测的联合损失，为查询提供更高质量的初始化。在不增加繁琐的解码器堆叠并保证精度的前提下，我们提出的端到端检测框架更加简洁，与可变形DETR相比，参数数量减少了约8%。我们的方法在具有挑战性的人群数据集CrowdHuman上取得了优异的结果，平均精度（AP）达到93.6%，MR-2达到39.2%，JI达到84.3%。该性能超过了之前的SOTA方法，如Iter-E2EDet（渐进式端到端目标检测）和MIP（单提议，多预测）。此外，我们的方法在不同密度的各种场景下更加鲁棒。||
|**2025-02-10**|[From Image to Video: An Empirical Study of Diffusion Representations](http://arxiv.org/abs/2502.07001)|null|扩散模型彻底改变了生成式建模，实现了图像和视频合成中前所未有的逼真度。这一成功激发了人们对其表征在视觉理解任务中的潜力的兴趣。尽管最近的研究探索了其在图像生成方面的潜力，但视频扩散模型的视觉理解能力在很大程度上仍未得到充分探索。为了弥补这一差距，我们系统地比较了针对视频和图像生成训练的相同模型架构，分析了其潜在表征在各种下游任务（包括图像分类、动作识别、深度估计和跟踪）中的性能。结果表明，视频扩散模型的表现始终优于图像模型，尽管我们发现这种优势的程度差异很大。我们进一步分析了从不同层提取的特征以及不同噪声水平下的特征，以及模型大小和训练预算对表征和生成质量的影响。这项工作首次直接比较了用于视觉理解的视频和图像扩散目标，为时间信息在表征学习中的作用提供了见解。||
|**2025-02-10**|[Enhancing Performance of Explainable AI Models with Constrained Concept Refinement](http://arxiv.org/abs/2502.06775)|null|机器学习 (ML) 中长期存在准确性和可解释性之间的权衡问题。这种矛盾对于新兴的“设计即可解释”方法尤为突出，这些方法旨在重新设计 ML 算法以实现可信的可解释性，但通常会牺牲准确性。在本文中，我们通过研究概念表示（可解释模型的一个重要组成部分）的偏差对预测性能的影响来弥补这一差距，并提出了一个新的框架来减轻这些影响。该框架基于在保持可解释性的约束下优化概念嵌入的原则。使用生成模型作为测试平台，我们严格证明了我们的算法实现了零损失，同时逐步增强了最终模型的可解释性。此外，我们评估了我们提出的框架在各种基准测试中为图像分类任务生成可解释预测的实际性能。与现有的可解释方法相比，我们的方法不仅在各种大规模基准测试中提高了预测精度，同时保持了模型的可解释性，而且还以显著更低的计算成本实现了这一点。||
|**2025-02-10**|[AgilePilot: DRL-Based Drone Agent for Real-Time Motion Planning in Dynamic Environments by Leveraging Object Detection](http://arxiv.org/abs/2502.06725)|null|在动态环境下的自主无人机导航仍然是一个关键挑战，尤其是在处理包括目标位置快速变化的快速移动物体等不可预测的场景时。虽然传统的规划器和经典的优化方法已被广泛用于解决这一动态问题，但它们经常面临实时的、不可预测的变化，最终导致在适应性和实时决策方面的性能欠佳。在这项工作中，我们提出了一种名为AgilePilot的新型运动规划器，它基于在动态条件下训练的深度强化学习 (DRL)，并在飞行过程中结合了用于物体检测的实时计算机视觉 (CV)。训练到部署的框架弥合了Sim2Real的差距，利用复杂的奖励结构根据环境条件促进安全性和敏捷性。该系统可以快速适应变化的环境，同时在真实场景中达到3.0米/秒的最大速度。相比之下，我们的方法在性能和动态目标跟踪精度上都比基于人工势场 (APF) 的经典运动规划算法高3倍，因为它使用了速度预测，并在75次实验中展现了90%的成功率。这项工作突出了DRL在应对实时动态导航挑战方面的有效性，提供了智能的安全性和敏捷性。||
|**2025-02-07**|[AIQViT: Architecture-Informed Post-Training Quantization for Vision Transformers](http://arxiv.org/abs/2502.04628)|null|训练后量化 (PTQ) 已成为降低视觉Transformer (ViT) 存储和计算成本的有前景的解决方案。近期的进展主要集中于设计量化器来处理ViT特有的激活特征。然而，大多数现有方法低估了权重量化带来的信息损失，导致性能显著下降，尤其是在低比特情况下。此外，量化ViT的Softmax后激活的常见做法是采用对数变换，但这不幸地优先考虑了零附近信息量较小的值。这种方法引入了额外的冗余，最终导致量化效率欠佳。为了解决这些问题，本文提出了一种专为ViT设计的创新PTQ方法，称为AIQViT（面向架构的ViT训练后量化）。首先，我们设计了一种架构感知的低秩补偿机制，其中引入可学习的低秩权重来补偿权重量化造成的性能下降。其次，我们设计了一种动态聚焦量化器来适应Softmax后激活的不平衡分布，它动态地选择最有价值的区间以获得更高的量化分辨率。在图像分类、目标检测、实例分割、点云分类和点云零件分割五种视觉任务上的大量实验表明，AIQViT优于最先进的PTQ方法。||
|**2025-02-06**|[Augmented Conditioning Is Enough For Effective Training Image Generation](http://arxiv.org/abs/2502.04475)|null|文转图扩散模型的图像生成能力已取得显著进步，能够根据描述性文本生成高度逼真的图像，并提高了利用合成图像训练计算机视觉模型的可行性。为了有效地用作训练数据，生成的图像必须高度逼真，同时在目标数据分布的支持下具有足够的多样性。然而，最先进的条件图像生成模型主要针对创意应用进行了优化，优先考虑图像真实度和对提示的依从性，而不是条件多样性。在本文中，我们研究如何在不微调图像生成模型的情况下提高生成图像的多样性，以提高其训练下游图像分类模型的有效性。我们发现，以增强后的真实图像和文本提示为条件进行生成过程，可以生成有效的合成数据集，用于下游训练。以真实的训练图像为条件，可以将生成过程置于真实图像分布的域内，而数据增强引入的视觉多样性则提高了下游分类器的性能。我们在总共五个已建立的长尾和少样本图像分类基准上验证了增强条件，并表明利用增强来调节生成过程，在长尾基准上比现有技术获得了持续改进，并在其余四个基准的极端少样本机制中取得了显著提升。这些结果构成了有效利用合成数据进行下游训练的重要一步。||
|**2025-02-07**|[Point2RBox-v2: Rethinking Point-supervised Oriented Object Detection with Spatial Layout Among Instances](http://arxiv.org/abs/2502.04268)|**[link](https://github.com/visionxlab/point2rbox-)**|随着对定向目标检测（OOD）需求的快速增长，近年来，使用弱监督检测器从点标注中学习OOD的研究受到了广泛关注。在本文中，我们重新思考了这种具有挑战性的任务设置，并考虑了实例之间的布局关系，提出了Point2RBox-v2。其核心是三个原则：1）高斯重叠损失。它通过将目标视为二维高斯分布并最小化它们的重叠来学习每个实例的上界。2）沃罗诺伊分水岭损失。它通过对沃罗诺伊镶嵌进行分水岭变换来学习每个实例的下界。3）一致性损失。它学习输入图像及其增强视图的两个输出集之间的尺寸/旋转变化。辅以一些设计的技巧，例如边缘损失和复制粘贴，检测器得到了进一步增强。据我们所知，Point2RBox-v2是第一个探索实例之间的空间布局关系以学习点监督OOD的方法。我们的解决方案简洁轻便，但在密集场景中也能提供具有竞争力的性能：在DOTA/HRSC/FAIR1M数据集上分别达到了62.61%/86.15%/34.71%。代码可在https://github.com/VisionXLab/point2rbox-v2获取。||
|**2025-02-06**|[Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion](http://arxiv.org/abs/2502.04263)|**[link](https://github.com/miccunifi/cross-the-gap)**|像CLIP这样的预训练多模态视觉语言模型已被广泛用于各种应用。在本文中，我们展示了单独利用这些强大的多模态模型的文本或图像编码器来处理模态内任务（如图像到图像检索）的常见做法是高度次优的。我们认为，这本质上是由于CLIP式的跨模态对比损失没有强制执行任何模态内约束，导致我们所说的模态内错位。为了证明这一点，我们利用了两种基于优化的模态反演技术，将表示从其输入模态映射到互补模态，而无需任何辅助数据或额外训练的适配器。我们通过实验证明，在图像到图像和文本到文本检索的模态内任务中，采用跨模态方法处理这些任务，在超过15个数据集上，相对于模态内基线显著提高了性能。此外，我们还证明，采用模态内方法处理原生跨模态任务（例如零样本图像分类）会降低性能，进一步验证了我们的发现。最后，我们展示了在预训练目标中加入模态内项或缩小文本和图像特征嵌入空间之间的模态差距有助于减少模态内错位。代码公开发布于：https://github.com/miccunifi/Cross-the-Gap。||
|**2025-02-06**|[An object detection approach for lane change and overtake detection from motion profiles](http://arxiv.org/abs/2502.04244)|null|在车队管理和驾驶员监控的应用领域中，从行车记录仪影像中获取相关的驾驶事件和活动，同时最大限度地减少存储和分析的信息量，是非常具有挑战性的。在本文中，我们提出了一种应用于运动轮廓的新型目标检测方法来识别超车和变道操作，将驾驶视频片段紧凑地表示为单幅图像。为了训练和测试我们的模型，我们创建了一个内部数据集，该数据集由从各种行车记录仪视频中获得的运动轮廓图像组成，并通过车辆本身手动标记了超车和变道操作。除了标准的目标检测方法外，我们还展示了如何通过包含CoordConvolution层来进一步提高模型的性能，在mAP和F1分数方面取得了最先进的性能，与文献中的其他基线相比。该方案的极低计算要求使其特别适合在设备上运行。||
|**2025-02-06**|[Expanding Training Data for Endoscopic Phenotyping of Eosinophilic Esophagitis](http://arxiv.org/abs/2502.04199)|null|嗜酸性食管炎 (EoE) 是一种以嗜酸性粒细胞为主的炎症为特征的慢性食管疾病。EoE 的诊断通常包括食管黏膜的内窥镜检查和获取食管活检以进行组织学确认。最近的进展见证了人工智能辅助内窥镜成像技术（在 EREFS 系统的指导下）的出现，它有可能成为减少对侵入性组织学评估依赖的替代方法。尽管取得了这些进步，但由于训练人工智能模型的数据有限，仍然存在重大挑战——即使在开发针对更常见疾病的人工智能时，这也是一个普遍存在的问题。本研究旨在通过使用来自在线平台、公共数据集和电子教科书的多样化图像集来扩充我们的训练数据，从而提高基于深度学习的 EoE 表型分类的性能，使我们的数据集从 435 张图像增加到 7050 张图像。我们利用数据高效图像转换器进行图像分类，并结合了注意力图可视化以提高可解释性。研究结果表明，我们扩展的数据集和模型增强功能提高了诊断准确性、稳健性和综合分析能力，从而改善了患者的治疗结果。||
|**2025-02-06**|[High Spatiotemporal Resolution Structured Illumination Microscopy: Principle, Instrumentation, and Applications](http://arxiv.org/abs/2502.04072)|null|在超分辨率显微技术中，结构光照明显微术（SIM）展现出低光毒性、高速度和长期动态观察的优异性能等显著优势，使其特别适用于活细胞成像。本综述深入探讨了SIM的原理、仪器和应用，重点介绍了其在实现高时空分辨率方面的能力。SIM主要采用两种类型的结构光照明机制：（1）基于条纹的SIM，其中照明条纹通过干涉或投影形成，并通过傅里叶域扩展实现分辨率提升；（2）基于点扫描的SIM，其中照明图案通过投影焦点或焦斑阵列生成，并通过光子重分配实现分辨率提升。我们讨论了SIM从机械设备到高速光电设备（例如空间光调制器、数字微镜器件、振镜等）的演变，这些设备显著提高了成像速度、分辨率和调制灵活性。本综述还探讨了SIM在生物学研究中的应用，特别是在活细胞成像和细胞相互作用研究中的应用，为深入了解疾病机制和细胞功能提供了见解。最后，我们概述了SIM在生命科学领域的未来发展方向。随着成像技术和重建算法的进步，SIM有望为前沿研究领域带来革命性的影响，为探索细胞生物学的复杂性提供新的途径。||
|**2025-02-06**|[Advanced Object Detection and Pose Estimation with Hybrid Task Cascade and High-Resolution Networks](http://arxiv.org/abs/2502.03877)|null|在计算机视觉领域，6D物体检测和姿态估计对于机器人技术、增强现实和自动驾驶等应用至关重要。传统方法通常难以同时在物体检测和精确姿态估计方面都达到高精度。本研究提出了一种改进的6D物体检测和姿态估计流程，该流程基于现有的6D-VNet框架，并通过集成混合任务级联（HTC）和高分辨率网络（HRNet）骨干网络进行增强。通过利用HTC的多阶段细化过程和HRNet保持高分辨率表示的能力，我们的方法显著提高了检测精度和姿态估计精度。此外，我们引入了先进的后处理技术和一种新颖的模型集成策略，这些技术共同促进了在公共和私有基准测试中的优异性能。我们的方法相较于最先进的模型展现出显著的改进，使其成为6D物体检测和姿态估计领域的重要贡献。||
|**2025-02-06**|[Pursuing Better Decision Boundaries for Long-Tailed Object Detection via Category Information Amount](http://arxiv.org/abs/2502.03852)|null|在目标检测中，实例数量通常用于定义数据集是否呈现长尾分布，并隐含地假设模型在实例较少的类别上表现较差。这种假设导致了对实例数量不平衡的数据集中类别偏差的广泛研究。然而，即使在实例数量相对平衡的数据集中，模型仍然表现出类别偏差，这清楚地表明仅靠实例数量无法解释这一现象。在这项工作中，我们首先引入了类别信息量的概念和度量方法。我们观察到类别信息量与准确率之间存在显著的负相关关系，这表明类别信息量更准确地反映了类别的学习难度。基于这一观察，我们提出了信息量引导的角度裕度（IGAM）损失函数。IGAM的核心思想是根据每个类别的信息量动态调整其决策空间，从而减少长尾数据集中的类别偏差。IGAM损失函数不仅在LVIS v1.0和COCO-LT等长尾基准数据集上表现良好，而且在非长尾数据集Pascal VOC中也显著提升了代表性不足类别的性能。综合实验表明，类别信息量作为一种工具的潜力以及我们提出的方法的普适性。||
|**2025-02-06**|[Single-Domain Generalized Object Detection by Balancing Domain Diversity and Invariance](http://arxiv.org/abs/2502.03835)|null|单域泛化目标检测 (S-DGOD) 旨在将知识从单个源域迁移到未见的目标域。近年来，许多模型主要关注于实现特征不变性以增强鲁棒性。然而，由于跨域的固有差异，过分强调不变性会导致模型忽略图像之间的实际差异。这种过分强调可能会使训练过程复杂化，并导致有价值信息的丢失。为了解决这个问题，我们提出了多样性不变性检测模型 (DIDM)，它侧重于特定域多样性和跨域不变性之间的平衡。认识到域多样性会在特定域特征中引入变化，我们引入了多样性学习模块 (DLM)。DLM 旨在通过提出的特征多样性损失来保持特定域信息的多样性，同时限制特征中的类别语义。此外，为了保持域不变性，我们引入了加权对齐模块 (WAM)，它可以在不损害特征多样性的情况下对齐特征。我们在五个不同的数据集上进行了模型的实验，结果表明了所提出模型的优越性能和有效性。||
|**2025-02-06**|[UAV Cognitive Semantic Communications Enabled by Knowledge Graph for Robust Object Detection](http://arxiv.org/abs/2502.03761)|null|无人机(UAV)被广泛用于目标检测。然而，现有的基于无人机的目标检测系统面临严峻挑战，即其有限的计算、能源和通信资源限制了可实现的检测性能。为了克服这些挑战，我们提出了一种利用知识图谱的无人机认知语义通信系统。此外，我们设计了一种多尺度编解码器，用于语义压缩，以减少数据传输量，同时保证检测性能。考虑到无人机通信场景的复杂性和动态性，我们引入了一种具有鲁棒信道自适应能力的信噪比(SNR)自适应模块。此外，我们提出了一种利用知识图谱的目标检测方案，以克服信道噪声干扰和压缩失真。在实际航空图像数据集上进行的仿真结果表明，我们提出的语义通信系统在检测精度、通信鲁棒性和计算效率方面优于基准系统，尤其是在处理低带宽压缩比和低信噪比情况下。||
|**2025-02-06**|[Conditional Diffusion Models are Medical Image Classifiers that Provide Explainability and Uncertainty for Free](http://arxiv.org/abs/2502.03687)|null|判别式分类器已成为深度学习在医学影像中的基础工具，擅长学习复杂数据分布的可分特征。然而，这些模型通常需要精心的设计、增强和训练技术，以确保安全可靠的部署。近年来，扩散模型已成为二维生成模型的代名词。这些模型在一系列任务中展现出稳健性，包括自然图像分类，其中分类是通过比较为每个可能的条件输入生成的图像的重建误差来执行的。这项工作首次探索了类别条件扩散模型在二维医学图像分类中的潜力。首先，我们开发了一种新的多数投票方案，以提高医学扩散分类器的性能。接下来，在 CheXpert 和 ISIC 黑色素瘤皮肤癌数据集上的大量实验表明，基础扩散模型和从头训练的扩散模型在无需显式监督的情况下，实现了与最先进的判别式分类器相当的性能。此外，我们还表明，扩散分类器本质上是可解释的，并且可以用来量化其预测的不确定性，从而提高其在安全关键型临床环境中的可信度和可靠性。更多信息请访问我们的项目页面：https://faverogian.github.io/med-diffusion-classifier.github.io/||
|**2025-01-31**|[Redefining Machine Unlearning: A Conformal Prediction-Motivated Approach](http://arxiv.org/abs/2501.19403)|null|机器忘却旨在系统地从训练好的模型中移除指定数据，有效地达到如同训练期间从未遇到过这些数据的状态。虽然诸如忘却准确率（UA）和成员推理攻击（MIA）等指标为评估忘却性能提供了一个基线，但它们未能评估忘却的完整性和可靠性。这是因为在不确定性量化的范围内，真实标签仍然是潜在的候选者，这使得对真正忘却的评估存在缺口。在本文中，我们指出了现有忘却指标的关键局限性，并提出了受保形预测启发而增强的评估指标。我们的指标可以有效地捕捉真实标签被排除在预测集之外的程度。此外，我们观察到许多现有的机器忘却方法在使用我们的新指标进行评估时，并没有达到令人满意的忘却性能。为了解决这个问题，我们提出了一个忘却框架，它将保形预测的见解融入到Carlini & Wagner对抗攻击损失中。在图像分类任务上的大量实验表明，我们增强的指标可以更深入地了解忘却效果，并且我们的忘却框架显著提高了忘却方法的忘却质量。||
|**2025-01-31**|[Let Human Sketches Help: Empowering Challenging Image Segmentation Task with Freehand Sketches](http://arxiv.org/abs/2501.19329)|null|草图凭借其表达潜力，即使是粗略的轮廓也能让人们传达物体的本质。我们首次利用这种表达潜力来提高诸如伪装物体检测 (COD) 等挑战性任务中的分割性能。我们的方法引入了一种创新的草图引导的交互式分割框架，允许用户使用手绘草图（绘制物体的大致轮廓）直观地标注物体，而不是像SAM等经典交互式分割模型中使用的传统边界框或点。我们证明了草图输入可以显著提高现有迭代分割方法的性能，优于文本或边界框标注。此外，我们对网络架构进行了关键修改，并引入了一种新颖的草图增强技术，以充分利用草图输入的优势，进一步提高分割精度。值得注意的是，我们的模型输出可以直接用于训练其他神经网络，取得与逐像素标注相当的结果——同时将标注时间减少多达120倍，这在使标注过程大众化和减少对资源密集型、费力的像素级标注的依赖方面展现出巨大潜力。我们还提出了KOSCamo+，第一个用于伪装物体检测的手绘草图数据集。数据集、代码和标注工具将开源。||
|**2025-01-31**|[Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification](http://arxiv.org/abs/2501.19086)|null|X射线成像在医学诊断中至关重要，它提供了一种对各种健康状况进行非侵入性洞察的方法。近年来，诸如对比语言-图像预训练（CLIP）模型之类的视觉语言模型，通过利用大规模图文数据集，在提高诊断准确性方面展现出潜力。然而，由于CLIP最初并非为医学图像设计，因此，一些专门针对医学图像训练的类CLIP模型已经被开发出来。尽管这些模型的性能有所提升，但公平性问题——尤其是关于人口统计学属性的公平性——在很大程度上仍未得到解决。在本研究中，我们对应用于X射线图像分类的类CLIP模型进行了全面的公平性分析。我们使用零样本推理和各种微调技术（包括线性探测、多层感知器（MLP）、低秩自适应（LoRA）和全量微调）评估了这些模型在不同患者人口统计学特征和疾病类别中的性能和公平性。我们的结果表明，虽然微调提高了模型的准确性，但公平性问题依然存在，这凸显了在这些基础模型中进一步开展公平性干预措施的必要性。||
|**2025-01-30**|[OT-Transformer: A Continuous-time Transformer Architecture with Optimal Transport Regularization](http://arxiv.org/abs/2501.18793)|null|Transformer模型已在众多任务中取得了最先进的性能。在本文中，我们提出了Transformer的连续时间公式。具体来说，我们考虑一个由Transformer块参数化的动力系统的控制方程。我们利用最优传输理论来正则化训练问题，从而增强训练的稳定性并提高最终模型的泛化能力。此外，我们从理论上证明了这种正则化是必要的，因为它促进了解的唯一性和规律性。我们的模型非常灵活，几乎任何现有的Transformer架构都可以用来构建动力系统，只需对现有代码进行轻微修改。我们在受自然语言处理、图像分类和点云分类启发的任务上进行了广泛的数值实验。我们的实验结果表明，所提出的方法提高了其离散对应物的性能，并优于相关的比较模型。||
|**2025-01-30**|[Tuning Event Camera Biases Heuristic for Object Detection Applications in Staring Scenarios](http://arxiv.org/abs/2501.18788)|null|解锁神经形态相机（也称为“事件相机”）潜力的主要挑战之一是开发新方法来解决调整其偏置参数以适应所需任务的多参数问题。事实上，在文献中很难找到能够解决任何所需应用问题的系统启发式方法。在本文中，我们提出了一种针对事件相机偏置的调整参数启发式方法，适用于需要在凝视场景中检测小物体的任务。该启发式方法的主要目的是尽可能地发挥相机的潜力，优化其性能，并扩展其检测能力。在演示中，我们将事件相机的实验特性和系统约束转化为数学术语，并表明，在某些假设下，多变量问题如何简化为可以通过实验解决的双参数问题。将要证明的一个主要结论是，对于某些所需信号，例如由周期性电网供电的白炽灯提供的信号，相机的最佳值与制造商建议的默认值相差甚远。||
|**2025-01-30**|[Distillation-Driven Diffusion Model for Multi-Scale MRI Super-Resolution: Make 1.5T MRI Great Again](http://arxiv.org/abs/2501.18736)|**[link](https://github.com/zwang78/sr)**|磁共振成像 (MRI) 可提供关键的微观结构细节信息，然而，标准 1.5T 成像系统的空间分辨率通常有限。相比之下，7T MRI 提供了显著增强的空间分辨率，能够更精细地显示解剖结构。尽管如此，7T MRI 的高成本和有限的可用性阻碍了其在临床环境中的广泛应用。为了应对这一挑战，本文提出了一种新的超分辨率 (SR) 模型，用于从标准 1.5T MRI 扫描生成类似 7T 的 MRI 图像。我们的方法利用基于扩散的架构，结合了来自 7T 成像的梯度非线性校正和偏置场校正数据作为指导。此外，为了提高部署能力，我们引入了渐进式蒸馏策略。具体来说，学生模型逐步细化 7T 超分辨率任务，利用教师模型推理阶段的特征图作为指导，旨在使学生模型以更小、可部署的模型尺寸逐步达到 7T 超分辨率性能。实验结果表明，我们的基准教师模型达到了最先进的超分辨率性能。学生模型虽然轻量级，但性能损失最小。此外，学生模型能够接受不同分辨率的 MRI 输入而无需重新训练，从而显著提高了部署的灵活性。我们使用来自麻省总医院的临床数据验证了所提出方法的临床相关性。我们的代码可在 https://github.com/ZWang78/SR 获取。||
|**2025-01-30**|[Rethinking the Upsampling Layer in Hyperspectral Image Super Resolution](http://arxiv.org/abs/2501.18664)|null|深度学习在单幅高光谱图像超分辨率 (SHSR) 中取得了显著成功；然而，高光谱维度导致了沉重的计算负担，使其难以部署在实时场景中。为了解决这个问题，本文提出了一种新的轻量级 SHSR 网络，即 LKCA-Net，它结合了通道注意力机制来校准高光谱图像的多尺度通道特征。此外，我们首次证明了可学习上采样层的低秩特性是轻量级 SHSR 方法中的一个关键瓶颈。为了解决这个问题，我们采用低秩近似策略来优化可学习上采样层的参数冗余。另外，我们引入了一种基于知识蒸馏的特征对齐技术，以确保低秩近似网络保留与原始网络相同的特征表示能力。我们在 Chikusei、Houston 2018 和 Pavia Center 数据集上进行了大量实验，并与一些最先进的方法进行了比较。结果表明，我们的方法在性能上具有竞争力，同时与其他性能良好的 SHSR 方法相比，实现了数十倍甚至数百倍的加速。||
|**2025-01-30**|[HSRMamba: Contextual Spatial-Spectral State Space Model for Single Hyperspectral Super-Resolution](http://arxiv.org/abs/2501.18500)|null|Mamba凭借其强大的全局建模能力和线性计算复杂度在视觉任务中展现了卓越的性能，在高光谱图像超分辨率（HSISR）领域具有巨大潜力。然而，在HSISR中，Mamba 将图像转换为一维序列的做法忽略了局部相邻像素之间的空间-光谱结构关系，并且其性能对输入顺序高度敏感，这会影响空间和光谱细节的恢复。本文提出了HSRMamba，一种用于HSISR的上下文空间-光谱建模状态空间模型，以同时解决局部和全局问题。具体而言，设计了一种局部空间-光谱划分机制，以在3D特征中建立相邻像素之间的逐块因果关系，从而减轻局部遗忘问题。此外，采用基于光谱相似性的全局光谱重排序策略，以增强跨空间和光谱维度的相似像素的因果表示。最后，实验结果表明，我们的HSRMamba在定量质量和视觉效果方面均优于现有最先进的方法。代码即将发布。||
|**2025-01-30**|[Waveform-Specific Performance of Deep Learning-Based Super-Resolution for Ultrasound Contrast Imaging](http://arxiv.org/abs/2501.18375)|**[link](https://github.com/miagrouput/super-resolution-waveforms)**|解析动脉血流对于理解心血管病理、改进诊断和监测患者病情至关重要。超声造影成像使用微泡增强血池的散射，从而实现血流的实时可视化。矢量血流成像的最新发展通过时间分辨快速动脉血流进一步扩展了超声的成像能力。下一个需要克服的障碍是空间分辨率的不足。超分辨率超声图像可以通过在波束形成之前对射频 (RF) 信号进行反卷积来获得，从而打破分辨率和脉冲持续时间之间的联系。可以训练卷积神经网络 (CNN) 来局部估计反卷积核，从而直接在射频信号内对微泡进行超定位。然而，微泡对比度是非线性的，CNN 在微泡定位方面的潜力尚未得到充分利用。因此，评估基于深度学习的反卷积在非平凡成像脉冲中的性能对于成功将其转化为实际应用至关重要，在实际应用中，信噪比有限，并且发射方案应符合安全准则。在本研究中，我们训练 CNN 对由谐波脉冲、线性调频脉冲或延迟编码脉冲序列驱动的射频信号进行反卷积并定位微泡。此外，我们通过展示初步的实验结果，讨论了体外和体内超分辨率的潜在障碍。我们发现，虽然 CNN 可以准确定位所有脉冲的微泡，但短成像脉冲在无噪声条件下提供最佳性能。然而，线性调频脉冲在无噪声情况下提供相当的性能，但对噪声更鲁棒，并且在低信噪比条件下优于所有其他脉冲。||
|**2025-01-30**|[IROAM: Improving Roadside Monocular 3D Object Detection Learning from Autonomous Vehicle Data Domain](http://arxiv.org/abs/2501.18162)|null|在自动驾驶中，路侧传感器可以提供更全面的环境视野，从而提高车辆自身的感知能力。然而，由于视角域差异，现有的为车载摄像头设计的单目检测方法并不适用于路侧摄像头。为了弥合这一差距并改进路侧单目3D目标检测，我们提出了IROAM，一个语义-几何解耦的对比学习框架，它同时将车端和路侧数据作为输入。IROAM有两个重要的模块。域内查询交互模块利用Transformer学习每个域的内容和深度信息，并输出目标查询。为了从两个域中学习更好的特征表示，跨域查询增强模块将查询解耦为语义和几何部分，并且仅使用前者进行对比学习。实验结果证明了IROAM在提高路侧检测器性能方面的有效性，并验证了IROAM具备学习跨域信息的能力。||
|**2025-01-29**|[Efficient Feature Fusion for UAV Object Detection](http://arxiv.org/abs/2501.17983)|**[link](https://github.com/gamepai0811/fmsa)**|无人机(UAV)遥感图像中的目标检测由于图像质量不稳定、目标尺寸小、背景复杂和环境遮挡等因素而面临着重大挑战。尤其小目标在图像中所占比例极小，使其精确检测非常困难。现有的多尺度特征融合方法通过聚合不同分辨率的特征在一定程度上解决了这些挑战。然而，这些方法通常无法有效平衡小目标的分类和定位性能，主要原因是特征表示不足和网络信息流不平衡。在本文中，我们提出了一种专门为无人机目标检测任务设计的特征融合框架，以增强定位精度和分类性能。该框架集成了混合上采样和下采样模块，使来自不同网络深度的特征图能够灵活地调整到任意分辨率。这种设计促进了跨层连接和多尺度特征融合，确保了对小目标的更好表示。我们的方法利用混合下采样来增强细粒度特征表示，即使在复杂条件下也能提高小目标的空间定位精度。同时，上采样模块聚合全局上下文信息，优化了跨尺度的特征一致性，并增强了杂乱场景中的分类鲁棒性。在两个公共无人机数据集上的实验结果证明了该框架的有效性。集成到YOLO-V10模型中，我们的方法在平均精度(AP)方面比基线YOLO-V10模型提高了2%，同时保持了相同的参数数量。这些结果突出了我们的框架在准确高效的无人机目标检测方面的潜力。||
|**2025-01-29**|[Detection of Oscillation-like Patterns in Eclipsing Binary Light Curves using Neural Network-based Object Detection Algorithms](http://arxiv.org/abs/2501.17538)|**[link](https://github.com/burakulas/detocs)**|本研究的主要目的是评估几种基于卷积神经网络的目标检测算法，用于识别食双星光变曲线中的振荡类模式。这涉及创建一个稳健的检测框架，可以有效地处理合成光变曲线和真实的观测数据。该研究采用了多种最先进的目标检测算法，包括单发多框检测器（SSD）、Faster Region-based Convolutional Neural Network（Faster R-CNN）、You Only Look Once（YOLO）和EfficientDet，以及一个从零开始实现的自定义非预训练模型。使用自定义脚本构建了合成光变曲线图像和从已知具有脉动成分的食双星的TESS观测光变曲线导出的图像，并带有相应的注释文件。这些模型在已建立的数据集上进行训练和验证，然后在未见过的开普勒数据上进行测试，以评估其泛化性能。同时计算统计指标以评估每个模型的质量。结果表明，预训练模型在检测目标模式方面表现出很高的准确性和可靠性。特别是Faster R-CNN和YOLO在验证数据集上的目标检测评估指标方面表现出优异的性能，例如mAP值超过99%。另一方面，SSD速度最快，尽管其性能略低，mAP为97%。这些发现突出了这些模型在自动确定食双星系统中脉动成分方面的潜力，有助于更高效、更全面地开展天体物理研究。||
|**2025-01-30**|[Assessing the Capability of YOLO- and Transformer-based Object Detectors for Real-time Weed Detection](http://arxiv.org/abs/2501.17387)|null|精准喷洒代表了一种减少农田中农药（尤其是除草剂）使用量的有效且可持续的方法。为了实现这一目标，最重要的是在现场和实时条件下可靠地区分作物和杂草，甚至区分单个杂草种类。为了评估其在实时应用中的适用性，对目前最先进的不同目标检测模型进行了比较。所有可用的 YOLOv8、YOLOv9、YOLOv10 和 RT-DETR 模型都使用来自真实田间情况的图像进行了训练和评估。图像被分成两个不同的数据集：在初始数据集中，每种植物都被单独训练；在后续数据集中，区分了单子叶杂草、双子叶杂草和三种选定的作物。结果表明，虽然所有模型在评估指标上的表现都同样出色，但 YOLOv9 模型，尤其是 YOLOv9s 和 YOLOv9e，在数据集 2 中的召回率（66.58% 和 72.36%）以及 mAP50（73.52% 和 79.86%）和 mAP50-95（43.82% 和 47.00%）方面表现突出。然而，RT-DETR 模型，尤其是 RT-DETR-l，在精度方面表现出色，在数据集 1 上达到 82.44%，在数据集 2 上达到 81.46%，这使得它们特别适用于需要最大限度减少误报的场景。尤其值得注意的是，YOLO 模型的最小变体（YOLOv8n、YOLOv9t 和 YOLOv10n）在 NVIDIA GeForce RTX 4090 GPU 上分析一帧图像时实现了低至 7.58 毫秒的推理时间，同时保持了具有竞争力的准确性，这突出了它们在资源受限的嵌入式计算设备（通常用于生产设置）中的部署潜力。||
|**2025-01-28**|[MR imaging in the low-field: Leveraging the power of machine learning](http://arxiv.org/abs/2501.17211)|null|磁共振成像 (MRI) 硬件和软件的最新创新重新点燃了人们对低场（<1 特斯拉）和超低场 MRI（<0.1 特斯拉）的兴趣。这些技术具有功耗更低、比吸收率更低、场不均匀性更低和成本效益更高等优势，为资源有限和床旁护理环境提供了一种很有前景的替代方案。然而，低场 MRI 面临着一些固有的挑战，例如信噪比降低，因此可能导致空间分辨率降低或扫描时间延长。本章探讨了低场和超低场 MRI 的挑战和机遇，重点关注机器学习 (ML) 在克服这些限制方面的作用。我们概述了深度神经网络及其在增强低场和超低场 MRI 性能方面的应用。讨论了具体的基于机器学习的解决方案，包括高级图像重建、去噪和超分辨率算法。本章最后探讨了将机器学习与低场 MRI 集成如何扩展其临床应用并提高可及性，从而可能彻底改变其在各种医疗保健环境中的应用。||
|**2025-01-28**|[Depth Separable architecture for Sentinel-5P Super-Resolution](http://arxiv.org/abs/2501.17210)|null|Sentinel-5P (S5P)卫星提供用于空气质量和气候监测的大气测量数据。虽然S5P卫星提供了丰富的谱分辨率，但它也存在物理限制，制约了其空间分辨率。超分辨率（SR）技术可以克服这些限制并提高S5P数据的空间分辨率。在这项工作中，我们介绍了一种专门为S5P数据设计的新型SR模型，该数据具有八个光谱带，每个光谱带约有500个通道。我们提出的S5-DSCR模型依赖于深度可分离卷积（DSC）架构，通过利用跨通道相关性来有效地执行空间超分辨率。定量评估表明，我们的模型在大多数光谱带上的性能优于现有方法。这项工作突出了利用DSC架构来应对高光谱超分辨率挑战的潜力。我们的模型可以捕捉精确分析所需的精细细节，并为空气质量监测和遥感应用的进步铺平道路。||
|**2025-01-28**|[DINOSTAR: Deep Iterative Neural Object Detector Self-Supervised Training for Roadside LiDAR Applications](http://arxiv.org/abs/2501.17076)|null|深度学习方法在点云数据目标检测方面的最新进展促进了众多路侧应用，提升了交通安全和管理水平。然而，点云数据的复杂性给人工监督标注带来了巨大挑战，导致时间和资金的大量支出。本文提出了一种端到端、可扩展且自监督的框架，用于训练适用于路侧点云数据的深度目标检测器。该框架利用自监督、统计建模的教师模型来训练现成的深度目标检测器，从而避免了人工监督的需求。教师模型遵循经过微调的标准实践，包括背景滤波、目标聚类、边界框拟合和分类，以生成噪声标签。研究表明，通过在多个教师模型生成的组合噪声标注上训练学生模型，可以增强其区分背景/前景的能力，并使其学习感兴趣的目标类别的多样化点云表示。在公开可用的路侧数据集和最先进的深度目标检测器上的评估结果表明，尽管未在训练过程中使用人工标注，但所提出的框架实现了与在人工标注标签上训练的深度目标检测器相当的性能。||
|**2025-01-28**|[Approach Towards Semi-Automated Certification for Low Criticality ML-Enabled Airborne Applications](http://arxiv.org/abs/2501.17028)|null|随着机器学习 (ML) 进入航空领域，即使是低关键性系统的机器学习系统也需要可靠的认证流程来确保安全性和性能。传统的航空关键软件标准，如 DO 178C，并不能完全涵盖机器学习的独特性。本文提出了一种针对低关键性机器学习系统的半自动化认证方法，重点关注数据和模型验证、弹性评估和可用性保证，同时整合手动和自动化流程。关键方面包括用于指导根据系统属性进行认证严格程度的结构化分类、将评估结果整合到机器学习组件置信度指标中的保证配置文件，以及将人工监督集成到认证活动中的方法。通过一个基于 YOLOv8 的目标检测系统的案例研究，该系统旨在为侦察和监视飞机实时分类军用和民用车辆，我们展示了这种方法如何支持低关键性机载应用中机器学习系统的认证。||
|**2025-01-28**|[SSF-PAN: Semantic Scene Flow-Based Perception for Autonomous Navigation in Traffic Scenarios](http://arxiv.org/abs/2501.16754)|null|在复杂的交通场景中，车辆检测和定位由于移动物体的干扰而面临重大挑战。传统方法通常依赖于异常值排除或语义分割，这两种方法的计算效率和精度都较低。提出的SSF-PAN可以实现基于激光雷达点云的目标检测/定位和SLAM（同时定位和建图）功能，并具有高计算效率和高精度，从而实现无地图导航框架。这项工作的新颖之处在于三个方面：1）开发了一个神经网络，可以对场景流中具有不同运动特征的静态和动态对象进行分割，即语义场景流（SSF）；2）开发了一个迭代框架，可以进一步优化输入场景流和输出分割结果的质量；3）开发了一个基于场景流的导航平台，可以在仿真环境中测试SSF感知系统的性能。提出的SSF-PAN方法在SUScape-CARLA和KITTI数据集以及CARLA模拟器上进行了验证。实验结果表明，该方法在场景流计算精度、移动物体检测精度、计算效率和自主导航有效性方面均优于传统方法。||
|**2025-01-28**|[Overcoming Semantic Dilution in Transformer-Based Next Frame Prediction](http://arxiv.org/abs/2501.16753)|null|视频的下一帧预测对于自动驾驶、目标跟踪和运动预测等应用至关重要。下一帧预测的主要挑战在于有效地捕获和处理来自先前视频序列的空间和时间信息。以其处理序列数据的能力而闻名的Transformer架构在该领域取得了显著进展。然而，基于Transformer的下一帧预测模型面临着一些显著问题：（a）多头自注意力（MHSA）机制要求将输入嵌入分割成 $N$块，其中$N$ 是头的数量。每个片段仅捕获原始嵌入信息的一部分，这扭曲了潜在空间中嵌入的表示，导致语义稀释问题；（b）这些模型预测的是下一帧的嵌入而不是帧本身，但损失函数基于重建帧的误差，而不是预测的嵌入——这在训练目标和模型输出之间造成了差异。我们提出了一种语义集中多头自注意力（SCMHSA）架构，它有效地减轻了基于Transformer的下一帧预测中的语义稀释。此外，我们引入了一个在潜在空间中优化SCMHSA的损失函数，使训练目标与模型输出更加一致。我们的方法与原始的基于Transformer的预测器相比，表现出更优的性能。||
|**2025-01-28**|[Toward Relative Positional Encoding in Spiking Transformers](http://arxiv.org/abs/2501.16745)|null|脉冲神经网络 (SNN) 是一种受生物启发的网络，它模拟大脑中神经元如何通过离散脉冲进行通信，由于其能量效率和时间处理能力，在各种任务中具有巨大的潜力。具有自注意力机制的 SNN（脉冲Transformer）最近在各种任务（例如序列建模和图像分类）中取得了巨大的进步。然而，在脉冲Transformer 中，整合位置信息仍然是一项挑战，而位置信息对于捕获数据中的序列关系至关重要。在本文中，我们介绍了一种用于脉冲Transformer 中的相对位置编码 (RPE) 的近似方法，利用格雷码作为我们方法的基础。我们提供了该方法在序列任务中部分捕获相对位置信息的有效性的综合证明。此外，我们将 RPE 方法扩展到适用于图像块处理的二维形式。我们在几个任务上评估了所提出的 RPE 方法，包括时间序列预测、文本分类和基于块的图像分类。我们的实验结果表明，RPE 的加入通过有效捕获相对位置信息显著提高了性能。||
|**2025-01-27**|[Directing Mamba to Complex Textures: An Efficient Texture-Aware State Space Model for Image Restoration](http://arxiv.org/abs/2501.16583)|null|Image restoration aims to recover details and enhance contrast in degraded images. With the growing demand for high-quality imaging (\textit{e.g.}, 4K and 8K), achieving a balance between restoration quality and computational efficiency has become increasingly critical. Existing methods, primarily based on CNNs, Transformers, or their hybrid approaches, apply uniform deep representation extraction across the image. However, these methods often struggle to effectively model long-range dependencies and largely overlook the spatial characteristics of image degradation (regions with richer textures tend to suffer more severe damage), making it hard to achieve the best trade-off between restoration quality and efficiency. To address these issues, we propose a novel texture-aware image restoration method, TAMambaIR, which simultaneously perceives image textures and achieves a trade-off between performance and efficiency. Specifically, we introduce a novel Texture-Aware State Space Model, which enhances texture awareness and improves efficiency by modulating the transition matrix of the state-space equation and focusing on regions with complex textures. Additionally, we design a {Multi-Directional Perception Block} to improve multi-directional receptive fields while maintaining low computational overhead. Extensive experiments on benchmarks for image super-resolution, deraining, and low-light image enhancement demonstrate that TAMambaIR achieves state-of-the-art performance with significantly improved efficiency, establishing it as a robust and efficient framework for image restoration.||
|**2025-01-27**|[Object Detection for Medical Image Analysis: Insights from the RT-DETR Model](http://arxiv.org/abs/2501.16469)|null|Deep learning has emerged as a transformative approach for solving complex pattern recognition and object detection challenges. This paper focuses on the application of a novel detection framework based on the RT-DETR model for analyzing intricate image data, particularly in areas such as diabetic retinopathy detection. Diabetic retinopathy, a leading cause of vision loss globally, requires accurate and efficient image analysis to identify early-stage lesions. The proposed RT-DETR model, built on a Transformer-based architecture, excels at processing high-dimensional and complex visual data with enhanced robustness and accuracy. Comparative evaluations with models such as YOLOv5, YOLOv8, SSD, and DETR demonstrate that RT-DETR achieves superior performance across precision, recall, mAP50, and mAP50-95 metrics, particularly in detecting small-scale objects and densely packed targets. This study underscores the potential of Transformer-based models like RT-DETR for advancing object detection tasks, offering promising applications in medical imaging and beyond.||
|**2025-01-27**|[Solving Turbulent Rayleigh-Bénard Convection using Fourier Neural Operators](http://arxiv.org/abs/2501.16209)|null|We train Fourier Neural Operator (FNO) surrogate models for Rayleigh-B\'enard Convection (RBC), a model for convection processes that occur in nature and industrial settings. We compare the prediction accuracy and model properties of FNO surrogates to two popular surrogates used in fluid dynamics: the Dynamic Mode Decomposition and the Linearly-Recurrent Autoencoder Network. We regard Direct Numerical Simulations (DNS) of the RBC equations as the ground truth on which the models are trained and evaluated in different settings. The FNO performs favorably when compared to the DMD and LRAN and its predictions are fast and highly accurate for this task. Additionally, we show its zero-shot super-resolution ability for the convection dynamics. The FNO model has a high potential to be used in downstream tasks such as flow control in RBC.||
|**2025-01-27**|[The Linear Attention Resurrection in Vision Transformer](http://arxiv.org/abs/2501.16182)|null|Vision Transformers (ViTs) have recently taken computer vision by storm. However, the softmax attention underlying ViTs comes with a quadratic complexity in time and memory, hindering the application of ViTs to high-resolution images. We revisit the attention design and propose a linear attention method to address the limitation, which doesn't sacrifice ViT's core advantage of capturing global representation like existing methods (e.g. local window attention of Swin). We further investigate the key difference between linear attention and softmax attention. Our empirical results suggest that linear attention lacks a fundamental property of concentrating the distribution of the attention matrix. Inspired by this observation, we introduce a local concentration module to enhance linear attention. By incorporating enhanced linear global attention and local window attention, we propose a new ViT architecture, dubbed L $^2$ViT. Notably, L$^2$ViT can effectively capture both global interactions and local representations while enjoying linear computational complexity. Extensive experiments demonstrate the strong performance of L$^2$ViT. On image classification, L$^2$ViT achieves 84.4% Top-1 accuracy on ImageNet-1K without any extra training data or label. By further pre-training on ImageNet-22k, it attains 87.0% when fine-tuned with resolution 384$^2$. For downstream tasks, L$^2$ ViT delivers favorable performance as a backbone on object detection as well as semantic segmentation.||
|**2025-01-27**|[Spatial-Angular Representation Learning for High-Fidelity Continuous Super-Resolution in Diffusion MRI](http://arxiv.org/abs/2501.16014)|null|Diffusion magnetic resonance imaging (dMRI) often suffers from low spatial and angular resolution due to inherent limitations in imaging hardware and system noise, adversely affecting the accurate estimation of microstructural parameters with fine anatomical details. Deep learning-based super-resolution techniques have shown promise in enhancing dMRI resolution without increasing acquisition time. However, most existing methods are confined to either spatial or angular super-resolution, limiting their effectiveness in capturing detailed microstructural features. Furthermore, traditional pixel-wise loss functions struggle to recover intricate image details essential for high-resolution reconstruction. To address these challenges, we propose SARL-dMRI, a novel Spatial-Angular Representation Learning framework for high-fidelity, continuous super-resolution in dMRI. SARL-dMRI explores implicit neural representations and spherical harmonics to model continuous spatial and angular representations, simultaneously enhancing both spatial and angular resolution while improving microstructural parameter estimation accuracy. To further preserve image fidelity, a data-fidelity module and wavelet-based frequency loss are introduced, ensuring the super-resolved images remain consistent with the original input and retain fine details. Extensive experiments demonstrate that, compared to five other state-of-the-art methods, our method significantly enhances dMRI data resolution, improves the accuracy of microstructural parameter estimation, and provides better generalization capabilities. It maintains stable performance even under a 45 $\times$ downsampling factor.||
|**2025-01-24**|[Geometric Mean Improves Loss For Few-Shot Learning](http://arxiv.org/abs/2501.14593)|null|Few-shot learning (FSL) is a challenging task in machine learning, demanding a model to render discriminative classification by using only a few labeled samples. In the literature of FSL, deep models are trained in a manner of metric learning to provide metric in a feature space which is well generalizable to classify samples of novel classes; in the space, even a few amount of labeled training examples can construct an effective classifier. In this paper, we propose a novel FSL loss based on \emph{geometric mean} to embed discriminative metric into deep features. In contrast to the other losses such as utilizing arithmetic mean in softmax-based formulation, the proposed method leverages geometric mean to aggregate pair-wise relationships among samples for enhancing discriminative metric across class categories. The proposed loss is not only formulated in a simple form but also is thoroughly analyzed in theoretical ways to reveal its favorable characteristics which are favorable for learning feature metric in FSL. In the experiments on few-shot image classification tasks, the method produces competitive performance in comparison to the other losses.||
|**2025-01-24**|[Visual Localization via Semantic Structures in Autonomous Photovoltaic Power Plant Inspection](http://arxiv.org/abs/2501.14587)|null|Inspection systems utilizing unmanned aerial vehicles (UAVs) equipped with thermal cameras are increasingly popular for the maintenance of photovoltaic (PV) power plants. However, automation of the inspection task is a challenging problem as it requires precise navigation to capture images from optimal distances and viewing angles. This paper presents a novel localization pipeline that directly integrates PV module detection with UAV navigation, allowing precise positioning during inspection. Detections are used to identify the power plant structures in the image and associate these with the power plant model. We define visually recognizable anchor points for the initial association and use object tracking to discern global associations. We present three distinct methods for visual segmentation of PV modules based on traditional computer vision, deep learning, and their fusion, and we evaluate their performance in relation to the proposed localization pipeline. The presented methods were verified and evaluated using custom aerial inspection data sets, demonstrating their robustness and applicability for real-time navigation. Additionally, we evaluate the influence of the power plant model's precision on the localization methods.||
|**2025-01-24**|[ $SpikePack$: Enhanced Information Flow in Spiking Neural Networks with High Hardware Compatibility](http://arxiv.org/abs/2501.14484)|null|Spiking Neural Networks (SNNs) hold promise for energy-efficient, biologically inspired computing. We identify substantial informatio loss during spike transmission, linked to temporal dependencies in traditional Leaky Integrate-and-Fire (LIF) neuron-a key factor potentially limiting SNN performance. Existing SNN architectures also underutilize modern GPUs, constrained by single-bit spike storage and isolated weight-spike operations that restrict computational efficiency. We introduce ${SpikePack}$, a neuron model designed to reduce transmission loss while preserving essential features like membrane potential reset and leaky integration. ${SpikePack}$ achieves constant $\mathcal{O}(1)$ time and space complexity, enabling efficient parallel processing on GPUs and also supporting serial inference on existing SNN hardware accelerators. Compatible with standard Artificial Neural Network (ANN) architectures, ${SpikePack}$ facilitates near-lossless ANN-to-SNN conversion across various networks. Experimental results on tasks such as image classification, detection, and segmentation show ${SpikePack}$ achieves significant gains in accuracy and efficiency for both directly trained and converted SNNs over state-of-the-art models. Tests on FPGA-based platforms further confirm cross-platform flexibility, delivering high performance and enhanced sparsity. By enhancing information flow and rethinking SNN-ANN integration, ${SpikePack}$ advances efficient SNN deployment across diverse hardware platforms.||
|**2025-01-24**|[Quantum Neural Networks: A Comparative Analysis and Noise Robustness Evaluation](http://arxiv.org/abs/2501.14412)|null|In current noisy intermediate-scale quantum (NISQ) devices, hybrid quantum neural networks (HQNNs) offer a promising solution, combining the strengths of classical machine learning with quantum computing capabilities. However, the performance of these networks can be significantly affected by the quantum noise inherent in NISQ devices. In this paper, we conduct an extensive comparative analysis of various HQNN algorithms, namely Quantum Convolution Neural Network (QCNN), Quanvolutional Neural Network (QuanNN), and Quantum Transfer Learning (QTL), for image classification tasks. We evaluate the performance of each algorithm across quantum circuits with different entangling structures, variations in layer count, and optimal placement in the architecture. Subsequently, we select the highest-performing architectures and assess their robustness against noise influence by introducing quantum gate noise through Phase Flip, Bit Flip, Phase Damping, Amplitude Damping, and the Depolarizing Channel. Our results reveal that the top-performing models exhibit varying resilience to different noise gates. However, in most scenarios, the QuanNN demonstrates greater robustness across various quantum noise channels, consistently outperforming other models. This highlights the importance of tailoring model selection to specific noise environments in NISQ devices.||
|**2025-01-24**|[Correlation-Based Band Selection for Hyperspectral Image Classification](http://arxiv.org/abs/2501.14338)|**[link](https://github.com/dibyabha/hsi-cc)**|Hyperspectral images offer extensive spectral information about ground objects across multiple spectral bands. However, the large volume of data can pose challenges during processing. Typically, adjacent bands in hyperspectral data are highly correlated, leading to the use of only a few selected bands for various applications. In this work, we present a correlation-based band selection approach for hyperspectral image classification. Our approach calculates the average correlation between bands using correlation coefficients to identify the relationships among different bands. Afterward, we select a subset of bands by analyzing the average correlation and applying a threshold-based method. This allows us to isolate and retain bands that exhibit lower inter-band dependencies, ensuring that the selected bands provide diverse and non-redundant information. We evaluate our proposed approach on two standard benchmark datasets: Pavia University (PA) and Salinas Valley (SA), focusing on image classification tasks. The experimental results demonstrate that our method performs competitively with other standard band selection approaches.||
|**2025-01-23**|[CSAOT: Cooperative Multi-Agent System for Active Object Tracking](http://arxiv.org/abs/2501.13994)|null|Object Tracking is essential for many computer vision applications, such as autonomous navigation, surveillance, and robotics. Unlike Passive Object Tracking (POT), which relies on static camera viewpoints to detect and track objects across consecutive frames, Active Object Tracking (AOT) requires a controller agent to actively adjust its viewpoint to maintain visual contact with a moving target in complex environments. Existing AOT solutions are predominantly single-agent-based, which struggle in dynamic and complex scenarios due to limited information gathering and processing capabilities, often resulting in suboptimal decision-making. Alleviating these limitations necessitates the development of a multi-agent system where different agents perform distinct roles and collaborate to enhance learning and robustness in dynamic and complex environments. Although some multi-agent approaches exist for AOT, they typically rely on external auxiliary agents, which require additional devices, making them costly. In contrast, we introduce the Collaborative System for Active Object Tracking (CSAOT), a method that leverages multi-agent deep reinforcement learning (MADRL) and a Mixture of Experts (MoE) framework to enable multiple agents to operate on a single device, thereby improving tracking performance and reducing costs. Our approach enhances robustness against occlusions and rapid motion while optimizing camera movements to extend tracking duration. We validated the effectiveness of CSAOT on various interactive maps with dynamic and stationary obstacles.||
|**2025-01-23**|[Attribute-based Visual Reprogramming for Image Classification with CLIP](http://arxiv.org/abs/2501.13982)|**[link](https://github.com/tmlr-group/attrvr)**|Visual reprogramming (VR) reuses pre-trained vision models for downstream image classification tasks by adding trainable noise patterns to inputs. When applied to vision-language models (e.g., CLIP), existing VR approaches follow the same pipeline used in vision models (e.g., ResNet, ViT), where ground-truth class labels are inserted into fixed text templates to guide the optimization of VR patterns. This label-based approach, however, overlooks the rich information and diverse attribute-guided textual representations that CLIP can exploit, which may lead to the misclassification of samples. In this paper, we propose Attribute-based Visual Reprogramming (AttrVR) for CLIP, utilizing descriptive attributes (DesAttrs) and distinctive attributes (DistAttrs), which respectively represent common and unique feature descriptions for different classes. Besides, as images of the same class may reflect different attributes after VR, AttrVR iteratively refines patterns using the $k$ -nearest DesAttrs and DistAttrs for each image sample, enabling more dynamic and sample-specific optimization. Theoretically, AttrVR is shown to reduce intra-class variance and increase inter-class separation. Empirically, it achieves superior performance in 12 downstream tasks for both ViT-based and ResNet-based CLIP. The success of AttrVR facilitates more effective integration of VR from unimodal vision models into vision-language models. Our code is available at https://github.com/tmlr-group/AttrVR.||
|**2025-01-23**|[First Lessons Learned of an Artificial Intelligence Robotic System for Autonomous Coarse Waste Recycling Using Multispectral Imaging-Based Methods](http://arxiv.org/abs/2501.13855)|null|Current disposal facilities for coarse-grained waste perform manual sorting of materials with heavy machinery. Large quantities of recyclable materials are lost to coarse waste, so more effective sorting processes must be developed to recover them. Two key aspects to automate the sorting process are object detection with material classification in mixed piles of waste, and autonomous control of hydraulic machinery. Because most objects in those accumulations of waste are damaged or destroyed, object detection alone is not feasible in the majority of cases. To address these challenges, we propose a classification of materials with multispectral images of ultraviolet (UV), visual (VIS), near infrared (NIR), and short-wave infrared (SWIR) spectrums. Solution for autonomous control of hydraulic heavy machines for sorting of bulky waste is being investigated using cost-effective cameras and artificial intelligence-based controllers.||
|**2025-01-23**|[You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain](http://arxiv.org/abs/2501.13725)|null|The in-situ detection of planetary, lunar, and small-body surface terrain is crucial for autonomous spacecraft applications, where learning-based computer vision methods are increasingly employed to enable intelligence without prior information or human intervention. However, many of these methods remain computationally expensive for spacecraft processors and prevent real-time operation. Training of such algorithms is additionally complex due to the scarcity of labeled data and reliance on supervised learning approaches. Unsupervised Domain Adaptation (UDA) offers a promising solution by facilitating model training with disparate data sources such as simulations or synthetic scenes, although UDA is difficult to apply to celestial environments where challenging feature spaces are paramount. To alleviate such issues, You Only Crash Once (YOCOv1) has studied the integration of Visual Similarity-based Alignment (VSA) into lightweight one-stage object detection architectures to improve space terrain UDA. Although proven effective, the approach faces notable limitations, including performance degradations in multi-class and high-altitude scenarios. Building upon the foundation of YOCOv1, we propose novel additions to the VSA scheme that enhance terrain detection capabilities under UDA, and our approach is evaluated across both simulated and real-world data. Our second YOCO rendition, YOCOv2, is capable of achieving state-of-the-art UDA performance on surface terrain detection, where we showcase improvements upwards of 31% compared with YOCOv1 and terrestrial state-of-the-art. We demonstrate the practical utility of YOCOv2 with spacecraft flight hardware performance benchmarking and qualitative evaluation of NASA mission data.||
|**2025-01-23**|[YOLO11-JDE: Fast and Accurate Multi-Object Tracking with Self-Supervised Re-ID](http://arxiv.org/abs/2501.13710)|**[link](https://github.com/inakierregueab/yolo11-jde)**|We introduce YOLO11-JDE, a fast and accurate multi-object tracking (MOT) solution that combines real-time object detection with self-supervised Re-Identification (Re-ID). By incorporating a dedicated Re-ID branch into YOLO11s, our model performs Joint Detection and Embedding (JDE), generating appearance features for each detection. The Re-ID branch is trained in a fully self-supervised setting while simultaneously training for detection, eliminating the need for costly identity-labeled datasets. The triplet loss, with hard positive and semi-hard negative mining strategies, is used for learning discriminative embeddings. Data association is enhanced with a custom tracking implementation that successfully integrates motion, appearance, and location cues. YOLO11-JDE achieves competitive results on MOT17 and MOT20 benchmarks, surpassing existing JDE methods in terms of FPS and using up to ten times fewer parameters. Thus, making our method a highly attractive solution for real-world applications.||
|**2025-01-23**|[AEON: Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise for Robust Learning](http://arxiv.org/abs/2501.13389)|null|使用噪声标签进行鲁棒训练是图像分类中的一个关键挑战，它有可能减少对昂贵干净标签数据集的依赖。现实世界的数据集通常包含分布内 (ID) 和分布外 (OOD) 的实例相关标签噪声，这是一个现有方法很少同时解决的挑战，并且由于缺乏全面的基准数据集而变得更加复杂。此外，即使当前的噪声标签学习方法试图在训练期间找到噪声标签样本，这些方法的目的也不是估计ID和OOD噪声率以提高它们在选择此类噪声标签样本时的有效性，并且它们通常由低效的多阶段学习算法表示。我们提出了实例相关的分布内和分布外标签噪声的自适应估计 (AEON) 方法来解决这些研究差距。AEON 是一种高效的单阶段噪声标签学习方法，它动态估计实例相关的 ID 和 OOD 标签噪声率，以增强对复杂噪声设置的鲁棒性。此外，我们引入了一个新的基准，反映了现实世界的 ID 和 OOD 噪声场景。实验表明，AEON 在合成数据集和现实数据集上都实现了最先进的性能。||
|**2025-01-23**|[Contrast: A Hybrid Architecture of Transformers and State Space Models for Low-Level Vision](http://arxiv.org/abs/2501.13353)|null|Transformer因其强大的全局上下文建模能力，在图像超分辨率（SR）任务中越来越受欢迎。然而，其二次计算复杂度需要使用基于窗口的注意力机制，这限制了感受野并限制了有效上下文扩展。最近，Mamba架构成为一种很有前景的替代方案，它具有线性计算复杂度，使其能够避免窗口机制并保持较大的感受野。尽管如此，Mamba在SR任务等需要高像素级精度的情况下，处理长上下文依赖关系方面面临挑战。这是由于其隐藏状态机制，该机制可以压缩和存储大量上下文，但只是近似的方式，导致Transformer不会遇到的不准确性。在本文中，我们提出了Contrast，一个混合SR模型，它结合了卷积（Con）、Transformer（Tra）和状态空间（St）组件，有效地融合了Transformer和Mamba的优势，以解决它们各自的局限性。通过集成Transformer和状态空间机制，Contrast弥补了每种方法的缺点，增强了全局上下文建模和像素级精度。我们证明了结合这两种架构可以减轻每种架构固有的问题，从而提高图像超分辨率任务的性能。||
|**2025-01-23**|[Multi-aspect Knowledge Distillation with Large Language Model](http://arxiv.org/abs/2501.13341)|**[link](https://github.com/taegyeong-lee/makd)**|深度学习的最新进展显著提高了计算机视觉任务的性能。以前的图像分类方法主要修改模型架构或添加特征，并使用类别logits上的交叉熵损失来优化模型。由于它们专注于使用类别标签对图像进行分类，这些方法可能难以学习类别的各种\emph{方面}（例如，自然位置和形状变化）。我们从一个新的视角重新思考了以前的方法，提出了一种使用多模态大型语言模型（MLLM）的多方面知识蒸馏方法。我们的方法包括：1）使用与我们想要传递给模型的知识相关的多方面问题查询大型语言模型，2）从MLLM中提取相应的logits，以及3）扩展模型的输出维度以蒸馏这些多方面logits。然后，我们将交叉熵损失应用于类别logits，并将二元交叉熵损失应用于多方面logits。通过我们的方法，模型不仅可以学习视觉方面的知识，还可以学习需要更深层次理解的抽象和复杂方面的知识。我们主要将我们的方法应用于图像分类，并且为了探索扩展我们模型的潜力，我们将其扩展到其他任务，例如目标检测。在所有实验结果中，我们的方法都提高了基线的性能。此外，我们还分析了多方面知识蒸馏的效果。这些结果表明，我们的方法可以将各种方面的知识转移到模型中，并且方面知识可以增强计算机视觉任务中的模型性能。本文展示了多方面知识蒸馏的巨大潜力，我们相信它为计算机视觉及其他领域的未来研究提供了一个有希望的方向。||
|**2025-01-22**|[Revisiting Data Augmentation for Ultrasound Images](http://arxiv.org/abs/2501.13193)|**[link](https://github.com/adamtupper/ultrasound-augmentation)**|数据增强是一种广泛使用且有效的技术，可以提高深度神经网络的泛化性能。然而，尽管在处理医学图像时经常面临数据可用性有限的问题，但它却经常未得到充分利用。这似乎源于我们对不同增强技术在不同任务和模态下的有效性的集体理解上的差距。超声成像就是这样一种模态。这项工作通过分析不同增强技术在提高各种超声图像分析任务模型性能方面的有效性来弥补这一差距。为此，我们引入了一个新的标准化基准，包含来自 10 个不同来源、涵盖 11 个身体部位的 14 项超声图像分类和语义分割任务。我们的结果表明，许多常用于自然图像任务的增强方法对超声图像也有效，在某些情况下甚至比专门为超声图像开发的增强方法更有效。我们还表明，广泛用于自然图像的 TrivialAugment 多样化增强对超声图像也有效。此外，我们提出的方法论代表了一种用于评估各种数据增强的结构化方法，可以应用于其他上下文和模态。||
|**2025-01-22**|[Adapting OpenAI's CLIP Model for Few-Shot Image Inspection in Manufacturing Quality Control: An Expository Case Study with Multiple Application Examples](http://arxiv.org/abs/2501.12596)|null|这篇说明性论文介绍了一种简化的基于图像的制造业质量检测方法，该方法使用OpenAI的CLIP（对比语言-图像预训练）模型，并将其适配于小样本学习。虽然CLIP在一般计算机视觉任务中展现出令人印象深刻的能力，但由于其训练数据和工业应用之间存在领域差距，将其直接应用于制造业检测仍面临挑战。我们通过五个案例研究评估了CLIP的有效性：金属锅表面检测、3D打印挤压轮廓分析、随机纹理表面评估、汽车装配检测和微观结构图像分类。我们的结果表明，对于单组件和基于纹理的应用，CLIP可以使用相对较小的学习集（每类50-100个样本）实现较高的分类精度。然而，在复杂的、多组件场景下，其性能会下降。我们提供了一个实用的实施框架，使质量工程师能够在寻求更复杂的解决方案之前快速评估CLIP对其特定应用的适用性。这项工作将基于CLIP的小样本学习确立为一种有效的基线方法，它在易于实施和性能鲁棒性之间取得了平衡，并在多个制造业质量控制应用中得到了证明。||
|**2025-01-21**|[Analyzer-less X-ray Interferometry with Super-Resolution Methods](http://arxiv.org/abs/2501.12527)|null|我们提出将超分辨率方法用于没有分析器的X射线光栅干涉测量法，用于探测器无法满足传统图像恢复算法所需的奈奎斯特采样率的情况。这种方法能够实现无需X射线吸收分析器的Talbot-Lau干涉测量，并允许无分析器调制相位光栅干涉仪具有更高的自相关长度。这将允许比以前更低的X射线剂量和更高的自相关长度。我们论证了使用超分辨率方法迭代重建衰减、微分相位和暗场图像，使用的是具有一维肺部肿瘤模型的模拟。对于pD = 22 {\mu}m的条纹周期，我们比较了30 {\mu}m和50 {\mu}m探测器在各种信噪比下的干涉仪模拟成像性能。我们表明，我们的超分辨率迭代重建方法非常稳健，可用于改进传统算法无法使用情况下的光栅干涉测量。||
|**2025-01-21**|[Large-image Object Detection for Fine-grained Recognition of Punches Patterns in Medieval Panel Painting](http://arxiv.org/abs/2501.12489)|**[link](https://github.com/marcozullich/punches-object-detection)**|艺术品作者的归属通常是一个费力的手工过程，通常依赖于专家人物的主观评估。然而，在某些情况下，艺术品的量化特征可以支持这些评估。这些特征的提取有时可以自动化，例如，使用机器学习 (ML) 技术。这些特征的一个例子是由重复的、机械压印的图案（称为冲孔）所代表的，主要出现在 13 和 14 世纪的托斯卡纳镶板画中。之前的艺术史研究表明，冲孔的形状与特定的艺术家或工作室之间存在着密切的联系，这表明可以使用这些量化线索来支持作品的归属。在本研究中，我们首先收集了这些镶板画的大型图像数据集。然后，使用 YOLOv10（一种最新的流行目标检测模型），我们训练了一个机器学习流程来对图像中包含的冲孔进行目标检测。由于图像尺寸较大，检测过程采用滑动窗口方法并带有重叠，将检测过程拆分到多个帧中，之后使用自定义的非极大值抑制程序将整个图像的预测结果组合起来。我们的结果表明，该领域的艺术史学家如何能够可靠地使用我们的方法来识别和提取冲孔。||
|**2025-01-21**|[InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling](http://arxiv.org/abs/2501.12386)|**[link](https://github.com/opengvlab/internvideo)**|本文旨在通过长而丰富的上下文 (LRC) 建模来提高视频多模态大型语言模型 (MLLM) 的性能。因此，我们开发了新版本的 InternVideo2.5，重点增强原始 MLLM 感知细粒度细节和捕捉视频中长格式时间结构的能力。具体来说，我们的方法使用直接偏好优化将密集视觉任务注释融入 MLLM，并通过自适应分层标记压缩开发紧凑的时空表示。实验结果表明，这种独特的 LRC 设计极大地提高了视频 MLLM 在主流视频理解基准测试（短视频和长视频）中的结果，使 MLLM 能够记住更长的视频输入（至少比原始输入长 6 倍），并掌握对象跟踪和分割等专门的视觉能力。我们的工作强调了多模态上下文丰富性（长度和细粒度）在增强 MLLM 的先天能力（注意力和记忆力）方面的重要性，为未来视频 MLLM 的研究提供了新的见解。代码和模型可在 https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5 获取。||
|**2025-01-21**|[Benchmarking Image Perturbations for Testing Automated Driving Assistance Systems](http://arxiv.org/abs/2501.12269)|**[link](https://github.com/ast-fortiss-tum/perturbation-drive)**|基于深度神经网络 (DNN) 的高级驾驶辅助系统 (ADAS) 广泛应用于自动驾驶汽车的关键感知任务，例如目标检测、语义分割和车道识别。然而，这些系统对输入变化（例如噪声和光照变化）高度敏感，这可能会降低其有效性并可能导致安全关键故障。本研究对图像扰动（一种常用于评估 DNN 稳健性的技术）进行了全面的实证评估，以验证和提高 ADAS 感知系统的稳健性和泛化能力。我们首先对文献进行了系统回顾，确定了 38 类扰动。接下来，我们在组件级和系统级评估了它们在揭示两种不同 ADAS 故障方面的有效性。最后，我们探索了基于扰动的数据增强和持续学习策略，以提高 ADAS 对新操作设计域的适应性。我们的结果表明，所有类别的图像扰动都成功地暴露了 ADAS 中的稳健性问题，并且使用数据集增强和持续学习显着提高了 ADAS 在新的、未见过的环境中的性能。||
|**2025-01-21**|[DLEN: Dual Branch of Transformer for Low-Light Image Enhancement in Dual Domains](http://arxiv.org/abs/2501.12235)|null|弱光图像增强 (LLE) 旨在提高在光线不足条件下拍摄的图像的视觉质量，这些图像通常存在亮度低、对比度低、噪声和颜色失真等问题。这些问题会阻碍计算机视觉任务（例如目标检测、人脸识别和自动驾驶）的性能。传统的增强技术，例如多尺度融合和直方图均衡化，无法保留精细细节，并且通常难以在复杂光照条件下保持增强图像的自然外观。虽然 Retinex 理论为图像分解提供了基础，但它通常会放大噪声，导致图像质量欠佳。在本文中，我们提出了双光增强网络 (DLEN)，这是一种结合了两种不同注意力机制的新型架构，同时考虑了空间域和频域。我们的模型在光照估计阶段引入了可学习的小波变换模块，保留高频和低频分量以增强边缘和纹理细节。此外，我们设计了一个双分支结构，利用 Transformer 架构的强大功能来增强图像的光照和结构组件。通过大量实验，我们的模型在标准基准测试中优于最先进的方法。代码地址：https://github.com/LaLaLoXX/DLEN||
|**2025-01-21**|[SMamba: Sparse Mamba for Event-based Object Detection](http://arxiv.org/abs/2501.11971)|**[link](https://github.com/Zizzzzzzz/SMamba_AAAI2025)**|基于Transformer的方法由于其全局建模能力，在基于事件的目标检测中取得了显著的性能。然而，它们忽略了非事件和噪声区域的影响，并对它们进行统一处理，导致计算开销较高。为了降低计算成本，一些研究人员提出了基于窗口注意力的稀疏化策略来丢弃不重要的区域，但这牺牲了全局建模能力，导致性能欠佳。为了更好地兼顾准确性和效率，我们提出了稀疏曼巴（SMamba），它执行自适应稀疏化以减少计算工作量，同时保持全局建模能力。具体来说，我们提出了一个时空连续性评估模块，利用活动事件和噪声事件之间的时空分布差异来衡量token的信息含量并丢弃无信息的token。基于评估结果，我们设计了一种信息优先的局部扫描策略，以缩短高信息token之间的扫描距离，促进它们在空间维度上的交互。此外，为了将全局交互从二维空间扩展到三维表示，我们提出了一个全局通道交互模块，从全局空间角度聚合通道信息。在三个数据集（Gen1、1Mpx和eTram）上的结果表明，我们的模型在性能和效率方面都优于其他方法。||
|**2025-01-20**|[PD-SORT: Occlusion-Robust Multi-Object Tracking Using Pseudo-Depth Cues](http://arxiv.org/abs/2501.11288)|**[link](https://github.com/wangyc2000/pd_sort)**|多目标跟踪（MOT）是视频处理技术中的一个新兴课题，在消费电子领域具有重要的应用价值。目前，基于检测的跟踪（TBD）是MOT的主流范式，它逐帧执行目标检测和关联。然而，TBD方法在遮挡严重的复杂场景中的关联性能会下降，这阻碍了此类方法在实际场景中的应用。为此，我们结合伪深度线索来增强关联性能，并提出了伪深度SORT（PD-SORT）。首先，我们用伪深度状态扩展了卡尔曼滤波器的状态向量。其次，我们通过将传统的二维交并比（IoU）与伪深度相结合，引入了一种新的深度体积交并比（DVIoU）。此外，我们开发了一种量化的伪深度测量（QPDM）策略，以实现更鲁棒的数据关联。此外，我们还集成了相机运动补偿（CMC）来处理动态相机的情况。通过以上设计，PD-SORT显著减轻了遮挡引起的模糊关联，并在DanceTrack、MOT17和MOT20上取得了领先的性能。值得注意的是，在DanceTrack上的改进尤其明显，因为DanceTrack中的对象表现出复杂的运动、相似的外观和频繁的遮挡。代码可在https://github.com/Wangyc2000/PD_SORT获取。||
|**2025-01-20**|[Enhancing SAR Object Detection with Self-Supervised Pre-training on Masked Auto-Encoders](http://arxiv.org/abs/2501.11249)|null|监督微调方法（SFT）在SAR图像人工智能解译方面表现出极高的效率，利用了预训练模型强大的表征知识。由于SAR图像缺乏特定领域的预训练主干网络，传统的策略是加载自然场景的基础预训练模型，例如ImageNet，其图像特征与SAR图像截然不同。这可能会在采用SFT处理小规模标注SAR数据时，阻碍模型在下游任务中的性能。本文提出了一种基于掩码自编码器（MAE）的掩码图像建模自监督学习（SSL）方法，用于在预训练过程中学习SAR图像的特征表示，并有利于SAR图像中SFT的目标检测任务。在大规模SAR目标检测基准数据集SARDet-100k上的评估实验验证了所提出的方法能够捕获SAR图像的适当潜在表示，并通过SSL将预训练域从自然场景转换为SAR图像，从而提高了模型在下游任务中的泛化能力。与仅使用SFT策略相比，该方法在SARDet-100k基准数据集上实现了1.3 mAP的提升。||
|**2025-01-20**|[KPL: Training-Free Medical Knowledge Mining of Vision-Language Models](http://arxiv.org/abs/2501.11231)|**[link](https://github.com/jxliu-ai/kpl)**|像CLIP这样的视觉语言模型由于广泛的图文预训练，在图像识别方面表现出色。然而，在零样本分类中应用CLIP推理，尤其是在医学图像诊断方面，面临以下挑战：1）仅使用单一类别名称不足以表示图像类别；2）CLIP编码器生成的视觉和文本空间之间存在模态差距。尽管尝试使用大型语言模型丰富疾病描述，但缺乏特定类别知识往往导致性能不佳。此外，经验证据表明，现有的针对自然图像数据集的零样本图像分类的代理学习方法在应用于医学数据集时表现出不稳定性。为了应对这些挑战，我们引入了知识代理学习（KPL）来从CLIP中挖掘知识。KPL旨在利用CLIP的多模态理解能力，通过文本代理优化和多模态代理学习进行医学图像分类。具体而言，KPL从构建的知识增强库中检索与图像相关的知识描述，以丰富语义文本代理。然后，它利用CLIP编码的输入图像和这些描述来稳定地生成多模态代理，从而提高零样本分类性能。在医学和自然图像数据集上进行的大量实验表明，KPL能够实现有效的零样本图像分类，其性能优于所有基线方法。这些发现凸显了这种从CLIP挖掘知识用于医学图像分类和更广泛领域的范例的巨大潜力。||
|**2025-01-19**|[LiFT: Lightweight, FPGA-tailored 3D object detection based on LiDAR data](http://arxiv.org/abs/2501.11159)|**[link](https://github.com/vision-agh/lift)**|本文介绍了LiFT，一种用于LiDAR数据的轻量级、全量化3D目标检测算法，针对FPGA平台上的实时推理进行了优化。通过深入分析FPGA的特定局限性，我们确定了一组影响算法设计的FPGA约束。这些约束包括30 GMACs（十亿次乘加运算）的计算复杂度限制，权重和激活的INT8量化，基于2D单元的处理而非3D体素，以及最小化使用跳跃连接。为了在满足这些约束的同时最大化性能，LiFT结合了新颖的机制和最先进的技术，例如可重参数化卷积和全稀疏架构。关键创新包括双界柱特征网络（Dual-bound Pillar Feature Net），它在不增加复杂性的情况下提高了性能，以及一种高效的输入特征INT8量化方案。LiFT的计算成本仅为20.73 GMACs，是少数几个针对最小复杂度3D目标检测的算法之一。在同类方法中，LiFT排名第一，在具有挑战性的NuScenes验证数据集上实现了51.84%的mAP和61.01%的NDS。代码将在https://github.com/vision-agh/lift上发布。||
|**2025-01-19**|[CLOFAI: A Dataset of Real And Fake Image Classification Tasks for Continual Learning](http://arxiv.org/abs/2501.11140)|**[link](https://github.com/will-doherty/clofai)**|能够创建逼真媒体的生成式AI模型的快速发展导致了对能够准确区分真实图像和人工生成图像的分类器的需求。当这些分类器遇到训练数据中未包含的生成模型生成的图像时，它们的表现通常会下降，这对分类器来说是一个重大挑战。一种典型的方法是用来自新生成模型的图像定期更新分类器的训练数据，然后在更新后的数据集上重新训练分类器。然而，在一些现实场景中，存储、计算或隐私限制使得这种方法不切实际。此外，安全应用程序中使用的模型可能需要快速适应。在这些情况下，持续学习提供了一种很有前景的替代方案，因为分类器可以在不重新训练整个数据集的情况下进行更新。在本文中，我们介绍了一个名为CLOFAI（基于伪造和真实图像的持续学习）的新数据集，它采用域增量图像分类问题的形式。此外，我们展示了该数据集作为评估持续学习方法的基准的适用性。为此，我们使用三种基本的持续学习方法——EWC、GEM和经验回放——在我们的新数据集上设置了基线，发现EWC表现不佳，而GEM和经验回放显示出潜力，表现明显优于朴素基线。数据集和运行实验的代码可以从以下GitHub存储库访问：https://github.com/Will-Doherty/CLOFAI。||
|**2025-01-19**|[Leveraging counterfactual concepts for debugging and improving CNN model performance](http://arxiv.org/abs/2501.11087)|null|反事实解释方法最近因其能够提供更符合人类推理的易于理解的解释而受到了广泛关注，用于解释基于CNN的图像分类器。然而，利用可解释性方法来提高模型性能却很少受到关注。在本文中，我们建议利用反事实概念来提高CNN模型在图像分类任务中的性能。我们提出的方法利用反事实推理来识别决策过程中使用的关键过滤器。在此之后，我们通过设计一种新的方法和损失函数来进行模型再训练，鼓励激活与类别相关的重要的过滤器，并抑制每个类别中不相关过滤器的激活。这个过程有效地最小化了局部预测的激活模式与其相应推断类别的全局激活模式之间的偏差。通过结合反事实解释，我们验证了未见模型的预测并识别了错误分类。所提出的方法提供了对模型学习过程中潜在弱点和偏差的洞察，从而能够进行有针对性的改进并提高性能。在公开数据集上的实验结果表明，性能提高了1-2％，验证了该方法的有效性。||
|**2025-01-17**|[DiffStereo: High-Frequency Aware Diffusion Model for Stereo Image Restoration](http://arxiv.org/abs/2501.10325)|null|扩散模型（DM）在图像修复方面取得了令人瞩目的性能，但尚未应用于立体图像。将DM应用于立体图像修复面临一系列挑战。重建两张图像的需求加剧了DM的计算成本。此外，现有的潜在DM通常关注语义信息，并在潜在压缩过程中去除高频细节，将其视为冗余信息，而这恰恰是图像修复的关键。为了解决上述问题，我们提出了一个高频感知扩散模型DiffStereo，用于立体图像修复，这是DM在该领域的首次尝试。具体来说，DiffStereo首先学习高质量图像的潜在高频表示（LHFR）。然后在学习的潜在空间中训练DM，以估计立体图像的LHFR，并将这些LHFR融合到一个基于Transformer的立体图像修复网络中，提供相应高质量图像的有益高频信息。LHFR的分辨率与输入图像保持一致，从而避免了纹理失真。通道压缩减轻了DM的计算负担。此外，我们在将LHFR集成到修复网络时设计了一种位置编码方案，使得在修复网络的不同深度能够提供独特的指导。综合实验表明，通过结合生成式DM和Transformer，DiffStereo在立体图像超分辨率、去模糊和低光增强方面，与现有最先进的方法相比，实现了更高的重建精度和更好的感知质量。||
|**2025-01-17**|[Spatio-temporal Graph Learning on Adaptive Mined Key Frames for High-performance Multi-Object Tracking](http://arxiv.org/abs/2501.10129)|null|在多目标跟踪领域，准确捕捉视频序列中对象之间的时空关系仍然是一项重大挑战。对象之间频繁发生的相互遮挡使这个问题进一步复杂化，这可能导致跟踪错误并降低现有方法的性能。基于这些挑战，我们提出了一种新的自适应关键帧挖掘策略，以解决当前跟踪方法的局限性。具体来说，我们引入了一个关键帧提取 (KFE) 模块，它利用强化学习来自适应地分割视频，从而引导跟踪器利用视频内容的内在逻辑。这种方法使我们能够捕捉不同对象之间的结构化空间关系以及跨帧对象的时间关系。为了解决对象遮挡问题，我们开发了帧内特征融合 (IFF) 模块。与主要关注帧间特征融合的传统基于图的方法不同，我们的 IFF 模块使用图卷积网络 (GCN) 来促进帧内目标与周围对象之间的信息交换。这种创新显着增强了目标可区分性，并减轻了由于遮挡引起的跟踪丢失和外观相似性。通过结合长轨迹和短轨迹的优势，并考虑对象之间的空间关系，我们提出的跟踪器在 MOT17 数据集上取得了令人印象深刻的结果，即 68.6 HOTA、81.0 IDF1、66.6 AssA 和 893 IDS，证明了其有效性和准确性。||
|**2025-01-17**|[DiffVSR: Enhancing Real-World Video Super-Resolution with Diffusion Models for Advanced Visual Quality and Temporal Consistency](http://arxiv.org/abs/2501.10110)|null|扩散模型在图像生成和修复方面展现了卓越的能力，但其在视频超分辨率中的应用面临着维持高保真度和时间一致性的重大挑战。我们提出了 DiffVSR，一个基于扩散的真实世界视频超分辨率框架，通过关键创新有效地解决了这些挑战。为了实现帧内序列一致性，我们开发了多尺度时间注意力模块和时间增强型 VAE 解码器，以捕捉细粒度的运动细节。为了确保帧间序列稳定性，我们引入了一种具有交织潜在过渡方法的噪声重新调度机制，可在不增加额外训练开销的情况下增强时间一致性。我们提出了一种渐进式学习策略，从简单到复杂的退化进行过渡，即使在高质量视频数据有限的情况下也能实现稳健的优化。大量实验表明，DiffVSR 在视觉质量和时间一致性方面均提供了优异的结果，为真实世界视频超分辨率设定了新的性能标准。||
|**2025-01-17**|[Classifier Ensemble for Efficient Uncertainty Calibration of Deep Neural Networks for Image Classification](http://arxiv.org/abs/2501.10089)|null|本文研究了应用于各种图像分类深度神经网络的不确定性校准的新型分类器集成技术。我们评估了准确性和校准指标，重点关注预期校准误差 (ECE) 和最大校准误差 (MCE)。我们的工作比较了构建简单而有效的分类器集成体的不同方法，包括多数投票和几种基于元模型的方法。我们的评估表明，虽然最先进的图像分类深度神经网络在标准数据集上实现了高精度，但它们经常会出现严重的校准误差。像多数投票这样的基本集成技术提供了适度的改进，而基于元模型的集成在所有架构中都持续降低了 ECE 和 MCE。值得注意的是，我们比较的元模型中最大的模型表现出最显著的校准改进，而对准确性的影响最小。此外，具有元模型的分类器集成在校准性能方面优于传统的模型集成，同时需要的参数要少得多。与传统的后期校准方法相比，我们的方法无需单独的校准数据集。这些发现强调了我们提出的基于元模型的分类器集成作为一种改进模型校准的高效方法的潜力，从而有助于构建更可靠的深度学习系统。||
|**2025-01-17**|[One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression](http://arxiv.org/abs/2501.10064)|null|目前的图像分词方法需要大量的token来捕捉图像中包含的信息。尽管不同图像的信息量有所不同，但大多数图像分词器只支持固定长度的分词，导致token分配效率低下。在本研究中，我们介绍了One-D-Piece，一个专为可变长度分词设计的离散图像分词器，实现了质量可控机制。为了实现可变压缩率，我们在离散一维图像分词器中引入了一种简单但有效的正则化机制，称为“尾部token丢弃”。这种方法鼓励关键信息集中在token序列的头部，从而支持可变长度分词，同时保持最先进的重建质量。我们通过多个重建质量指标评估了我们的分词器，发现它在更小的字节大小下，比现有的质量可控压缩方法（包括JPEG和WebP）提供了明显更好的感知质量。此外，我们还在各种下游计算机视觉任务（包括图像分类、目标检测、语义分割和深度估计）上评估了我们的分词器，证实了它与其他可变速率方法相比对众多应用的适应性。我们的方法证明了可变长度离散图像分词的多功能性，在压缩效率和重建性能方面建立了新的范例。最后，我们通过对分词器的详细分析验证了尾部token丢弃的有效性。||
|**2025-01-17**|[LWGANet: A Lightweight Group Attention Backbone for Remote Sensing Visual Tasks](http://arxiv.org/abs/2501.10040)|**[link](https://github.com/lwcver/lwganet)**|遥感 (RS) 视觉任务在学术和实践中都具有重要意义。然而，它们面临着许多阻碍有效特征提取的挑战，包括检测和识别单个图像中尺度变化很大的多个对象。虽然之前的双分支或多分支架构策略可以有效地管理这些对象差异，但它们同时导致计算需求和参数数量的显著增加。因此，这些架构在资源受限设备上的部署变得不太可行。当代轻量级骨干网络主要为自然图像设计，在有效提取多尺度对象的特征方面经常遇到困难，这影响了它们在遥感视觉任务中的效率。本文介绍了 LWGANet，这是一种专为遥感视觉任务定制的轻量级骨干网络，它包含一个新颖的轻量级组注意力 (LWGA) 模块，旨在应对这些特定挑战。LWGA 模块专为遥感图像量身定制，巧妙地利用冗余特征来提取从局部到全局范围的各种空间信息，而不会引入额外的复杂性或计算开销。这有助于在高效框架内跨多个尺度进行精确的特征提取。LWGANet 在涵盖四个关键遥感视觉任务（场景分类、定向目标检测、语义分割和变化检测）的十二个数据集上进行了严格评估。结果证实了 LWGANet 的广泛适用性及其在保持高性能和低复杂性之间的最佳平衡的能力，在不同的数据集上实现了最先进的结果 (SOTA)。LWGANet 成为需要强大遥感图像处理能力的资源有限场景的新颖解决方案。||
|**2025-01-17**|[FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis](http://arxiv.org/abs/2501.09887)|null|对象指称分析（ORA），通常被称为指称表达式理解，需要根据自然语言描述识别和定位图像中的特定对象。与一般的目标检测不同，ORA 要求准确的语言理解和精确的视觉定位，使其本质上更加复杂。尽管最近预训练的大型视觉定位检测器取得了显著进展，但它们严重依赖于大量标记数据和耗时的学习过程。为了解决这些问题，我们引入了一个新的、无需训练的零样本 ORA 框架，称为 FLORA（Formal Language for Object Referring and Analysis，即用于对象指称和分析的形式语言）。FLORA 利用大型语言模型 (LLM) 固有的推理能力，并集成了一个形式语言模型——一个在结构化、基于规则的描述中规范语言的逻辑框架——来提供有效的零样本 ORA。更具体地说，我们的形式语言模型 (FLM) 能够对对象描述进行有效的、逻辑驱动的解释，而无需任何训练过程。基于 FLM 规范的 LLM 输出，我们进一步设计了一个贝叶斯推理框架，并采用合适的现成解释模型来完成推理，从而以无需训练的方式提供对 LLM 幻觉的良好鲁棒性和引人注目的 ORA 性能。在实践中，我们的 FLORA 将现有预训练定位检测器的零样本性能提升了约 45%。我们对不同挑战性数据集的全面评估也证实，FLORA 在与零样本 ORA 相关的检测和分割任务中始终优于当前最先进的零样本方法。我们相信，我们对 LLM 输出的概率解析和推理提高了零样本 ORA 的可靠性和可解释性。我们将在发表后发布代码。||
|**2025-01-16**|[SRE-Conv: Symmetric Rotation Equivariant Convolution for Biomedical Image Classification](http://arxiv.org/abs/2501.09753)|**[link](https://github.com/xypb/sre-conv)**|卷积神经网络 (CNN) 是计算机视觉任务的重要工具，但它们缺乏传统上期望的提取特征属性，这些属性可以进一步提高模型性能，例如旋转等变性。这些属性在生物医学图像中普遍存在，而生物医学图像通常缺乏明确的方向。虽然目前的工作主要依赖于数据增强或显式模块来捕获方向信息，但这会增加训练成本或导致对所需等变性的无效近似。为了克服这些挑战，我们提出了一种新颖且高效的对称旋转等变 (SRE) 卷积 (SRE-Conv) 核的实现，旨在学习旋转不变特征，同时压缩模型大小。SRE-Conv 核可以轻松地融入任何 CNN 骨干网络。我们使用公共 MedMNISTv2 数据集（共 16 个任务）验证了深度 SRE-CNN 捕捉旋转等变性的能力。SRE-Conv-CNN 在 2D 和 3D 图像的所有 16 个测试数据集上都表现出更高的旋转图像分类性能准确性，同时通过更少的参数和更小的内存占用提高了效率。代码可在 https://github.com/XYPB/SRE-Conv 获取。||
|**2025-01-16**|[A Simple Aerial Detection Baseline of Multimodal Language Models](http://arxiv.org/abs/2501.09720)|**[link](https://github.com/li-qingyun/mllm-mmrotate)**|基于生成式预训练Transformer的多模态语言模型（MLM）被认为是统一各种领域和任务的强有力候选者。为遥感（RS）开发的MLM已在多项任务中展现出卓越的性能，例如视觉问答和视觉定位。除了检测与给定指令对应的特定对象的视觉定位外，航空检测（检测多个类别的所有对象）也是RS基础模型中一项有价值且具有挑战性的任务。然而，现有的RS MLM尚未探索航空检测，因为MLM的自回归预测机制与检测输出存在显著差异。在本文中，我们首次提出了一个将MLM应用于航空检测的简单基线，名为LMMRotate。具体来说，我们首先引入一种归一化方法，将检测输出转换为文本输出，以与MLM框架兼容。然后，我们提出了一种评估方法，以确保MLM与传统目标检测模型之间的公平比较。我们通过微调开源的通用MLM构建了基线，并实现了与传统检测器相当的出色检测性能。我们希望该基线能够为未来的MLM发展提供参考，使其具备更全面的RS图像理解能力。代码可在https://github.com/Li-Qingyun/mllm-mmrotate获取。||
|**2025-01-16**|[Multi-task deep-learning for sleep event detection and stage classification](http://arxiv.org/abs/2501.09519)|**[link](https://github.com/adrania/sleep-events-detection)**|多导睡眠图睡眠分析是准确诊断和治疗睡眠障碍的标准临床方法。它是一个复杂的过程，涉及手动识别、分类和定位多种睡眠事件模式。由于不同类型事件的识别需要关注不同的信号子集，导致需要进行多次视觉分析的迭代且耗时的过程，因此这项工作十分复杂。在本文中，我们提出了一种多任务深度学习方法，用于在单次过程中同时检测睡眠事件和构建睡眠图。以计算机视觉领域最先进的目标检测方法为参考，我们重新定义了多元时间序列分析问题，更具体地说，是睡眠分析场景中的模式检测问题。我们研究了所得方法在识别脑电觉醒、呼吸事件（呼吸暂停和低通气）和睡眠阶段的不同组合时的性能，同时也考虑了不同的输入信号组合配置。此外，我们使用两个独立的数据集评估我们的方法，评估涉及局部和外部验证场景的真实泛化效果。基于我们的结果，我们分析和讨论了我们方法的能力及其在不同设置和数据集中的潜在广泛适用性。||
|**2025-01-16**|[The Devil is in the Details: Simple Remedies for Image-to-LiDAR Representation Learning](http://arxiv.org/abs/2501.09485)|null|激光雷达是自动驾驶中的关键传感器，通常与摄像头一起使用。通过利用这种摄像头-激光雷达的设置以及图像表征学习的最新进展，之前的研究已经展示了图像到激光雷达蒸馏的巨大潜力。这些先前的工作主要集中于设计它们自己的损失函数，以有效地将预训练的2D图像表征提取到3D模型中。然而，设计中的其他部分却惊人地未被探索。我们发现，基本的设计元素，例如激光雷达坐标系、根据现有输入接口进行量化以及数据利用，比开发损失函数更为关键，而这些在先前的工作中都被忽略了。在这项工作中，我们展示了对这些设计的简单修复在nuScenes数据集上的3D语义分割和KITTI数据集上的3D目标检测的下游任务性能方面，分别比现有方法显著提高了16%和13%。我们关注的是沿空间和时间轴被忽略的设计选择。在空间上，先前的工作使用了圆柱坐标和体素大小，而没有考虑它们与常用的稀疏卷积层输入接口产生的副作用，导致3D模型中的空间量化误差。在时间上，现有工作通过丢弃不同步的数据来避免繁琐的数据整理，从而将使用限制在传感器之间时间同步的一小部分数据。我们分析了这些影响，并针对每个被忽略的方面提出了简单的解决方案。||
|**2025-01-16**|[RE-POSE: Synergizing Reinforcement Learning-Based Partitioning and Offloading for Edge Object Detection](http://arxiv.org/abs/2501.09465)|null|目标检测在智能视频分析中起着至关重要的作用，其应用范围涵盖自动驾驶、安全监控以及智慧城市等领域。然而，由于边缘设备计算资源有限以及基于深度神经网络 (DNN) 的检测模型的高要求，尤其是在处理高分辨率视频时，在边缘设备上实现实时目标检测面临着重大挑战。传统的策略，例如输入下采样和网络上采样，通常会为了更快的性能而牺牲检测精度，或者导致更高的推理延迟。为了解决这些问题，本文提出了RE-POSE，一种强化学习 (RL) 驱动的分区和边缘卸载框架，旨在优化资源受限的边缘环境中的精度-延迟权衡。我们的方法采用了一种基于强化学习的动态聚类算法 (RL-DCA)，该算法根据目标分布和DNN的计算特性将视频帧划分为非均匀块。此外，还实施了并行边缘卸载方案，以将这些块分布到多个边缘服务器上进行并发处理。实验评估表明，RE-POSE显著提高了检测精度并降低了推理延迟，优于现有方法。||
|**2025-01-16**|[Shape-Based Single Object Classification Using Ensemble Method Classifiers](http://arxiv.org/abs/2501.09311)|null|如今，越来越多的图像变得可用。图像的标注和检索提出了分类问题，其中每个类别被定义为标有共同语义标签的数据库图像组。人们已经提出了各种用于基于内容的检索以及图像分类和索引的系统。在本文中，我们提出了一个分层分类框架，以有效地弥合语义鸿沟并实现多类别图像分类。我们使用了一种众所周知的预处理和后处理方法，并将其应用于三个问题：图像分割、目标识别和图像分类。该方法被用于对来自亚马逊和谷歌数据集的单目标图像进行分类。我们使用四种不同的分类器测试了分类效果：贝叶斯网络 (BN)、随机森林 (RF)、Bagging 和 Vote。估计的分类准确率介于 20% 到 99% 之间（使用 10 折交叉验证）。Bagging 分类器的性能最佳，其次是随机森林分类器。||
|**2025-01-16**|[Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning](http://arxiv.org/abs/2501.09294)|null|由于标注数据的有限可用性以及医学图像的复杂性，医学图像分类中的少样本学习提出了重大挑战。在这项工作中，我们提出了具有分层对比对齐的自适应视觉语言微调（HiCA），一个利用大型视觉语言模型（LVLM）进行医学图像分析的新颖框架。HiCA引入了两阶段微调策略，结合特定领域预训练和分层对比学习，在多个层次上对齐视觉和文本表示。我们在两个基准数据集，胸部X光和乳腺超声上评估了我们的方法，在少样本和零样本设置中均实现了最先进的性能。进一步的分析证明了我们方法的鲁棒性、泛化性和可解释性，与现有基线相比，性能有了实质性的提高。我们的工作突出了分层对比策略在使LVLM适应医学成像任务的独特挑战方面的潜力。||
|**2025-01-16**|[SoccerSynth-Detection: A Synthetic Dataset for Soccer Player Detection](http://arxiv.org/abs/2501.09281)|null|在足球视频分析中，球员检测对于识别关键事件和重建战术位置至关重要。大量球员的存在和频繁的遮挡，加上版权限制，严重限制了数据集的可用性，使得选择有限，例如 SoccerNet-Tracking 和 SportsMOT。这些数据集缺乏多样性，阻碍了算法有效适应不同的足球视频场景。为了应对这些挑战，我们开发了 SoccerSynth-Detection，这是第一个专为检测合成足球运动员而设计的合成数据集。它包含各种随机光照和纹理，以及模拟的相机运动模糊。我们使用目标检测模型 (Yolov8n) 针对真实世界数据集（SoccerNet-Tracking 和 SportsMOT）验证了它的有效性。在迁移测试中，它的性能与真实数据集相当，并且在具有运动模糊的图像中表现显著优于真实数据集；在预训练测试中，它证明了其作为预训练数据集的有效性，显著提高了算法的整体性能。我们的工作证明了合成数据集在足球视频分析领域替代真实数据集进行算法训练的潜力。||
|**2025-01-16**|[Bias for Action: Video Implicit Neural Representations with Bias Modulation](http://arxiv.org/abs/2501.09277)|null|我们提出了一个基于隐式神经表示（INR）的新型连续视频建模框架，称为ActINR。我们方法的核心在于观察到INR可以被视为可学习的字典，其中基函数的形状由INR的权重控制，而其位置由偏置控制。假设非线性激活函数是紧凑的，我们推测INR的偏置适合捕捉图像间的运动，并促进视频序列的紧凑表示。利用这些观察结果，我们设计ActINR在视频序列的帧之间共享INR权重，同时对每一帧使用唯一的偏置。我们进一步将偏置建模为以时间索引为条件的单独INR的输出，以促进平滑性。通过同时训练视频INR和这个偏置INR，我们展示了独特的功能，包括10倍视频慢动作、4倍空间超分辨率以及2倍慢动作、去噪和视频修复。ActINR在众多视频处理任务中表现出色（通常实现超过6dB的改进），为视频的连续建模设定了新的标准。||
|**2025-01-15**|[Boosting Diffusion Guidance via Learning Degradation-Aware Models for Blind Super Resolution](http://arxiv.org/abs/2501.08819)|**[link](https://github.com/ryanlu2240/boosting-diffusion-guidance-via-learning-degradation-aware-models-for-blind-super-resolution)**|近年来，基于扩散的盲超分辨率 (SR) 方法在生成具有丰富高频细节的高分辨率图像方面表现出强大的能力，但细节的提升通常以保真度为代价。同时，另一项专注于修正扩散模型逆过程（即扩散引导）的研究表明，它能够为非盲超分生成高保真度的结果。然而，这些方法依赖于已知的退化核，使其难以应用于盲超分。为了解决这些问题，我们引入了退化感知模型，该模型可以集成到扩散引导框架中，从而无需知道退化核。此外，我们提出了两种新技术——输入扰动和引导标量，以进一步提高我们的性能。大量的实验结果表明，我们提出的方法在盲超分基准测试中优于现有最先进的方法。||
|**2025-01-14**|[Training Hybrid Neural Networks with Multimode Optical Nonlinearities Using Digital Twins](http://arxiv.org/abs/2501.07991)|null|训练越来越大的神经网络的能力使人工智能走到了科学和技术发现的前沿。然而，它们呈指数增长的规模对能源和计算硬件的需求也成比例地增加。将复杂的物理事件作为固定的、高效的计算模块整合到网络中，可以通过降低可训练层的复杂性来解决这一需求。在这里，我们利用多模光纤中的超短脉冲传播来实现这一目的，它可以执行大规模的非线性变换。混合架构的训练是通过一个可微分地逼近光学系统的神经模型来实现的。训练算法更新神经模拟器，并通过该代理反向传播误差信号，以优化光学层之前的层。我们的实验结果达到了最先进的图像分类精度和模拟保真度。此外，该框架表现出对实验漂移的出色适应性。通过将低能耗物理系统集成到神经网络中，这种方法可以实现可扩展的、节能的AI模型，并显著降低计算需求。||
|**2025-01-14**|[Bridge-SR: Schrödinger Bridge for Efficient SR](http://arxiv.org/abs/2501.07897)|null|语音超分辨率 (SR) 技术旨在从低分辨率语音波形生成更高采样率的波形，是语音恢复领域一项长期存在的关键任务。先前的工作已在不同的数据空间中探索了语音超分辨率，但这些方法要么需要额外的压缩网络，要么合成质量和推理速度有限。受概率生成模型最新进展的启发，我们提出了 Bridge-SR，一个新颖高效的任意到 48kHz 的波形域语音超分辨率系统。利用易于处理的薛定谔桥模型，我们将观察到的低分辨率波形作为先验，这对于高分辨率目标具有内在的信息价值。通过优化一个轻量级网络来学习从先验到目标的得分函数，我们通过一个充分利用低分辨率观测值中包含的指导性内容的数据到数据生成过程，实现了高效的波形超分辨率。此外，我们确定了噪声调度、数据缩放和辅助损失函数的重要性，这些进一步提高了基于桥模型的系统的超分辨率质量。在基准数据集 VCTK 上进行的实验验证了我们系统的效率：(1) 在样本质量方面，Bridge-SR 使用轻量级网络主干 (170 万参数) 在不同的超分辨率设置下优于几种强大的基线方法；(2) 在推理速度方面，我们的 4 步合成实现了比 8 步条件扩散方法 (LSD：0.911 vs 0.927) 更好的性能。演示地址：https://bridge-sr.github.io。||
|**2025-01-14**|[A Low-cost and Ultra-lightweight Binary Neural Network for Traffic Signal Recognition](http://arxiv.org/abs/2501.07808)|null|神经网络在车辆平台和可穿戴人工智能物联网（AIOT）场景中的部署已成为备受关注的研究领域。随着深度学习技术的不断发展，许多图像分类模型致力于提高识别精度，但这通常伴随着模型资源占用大、结构复杂、功耗高等问题，这使得其难以部署在资源受限的平台上。本文提出了一种面向硬件部署的超轻量级二值神经网络（BNN）模型，并基于德国交通标志识别基准（GTSRB）数据集进行了图像分类研究。此外，我们还在中国交通标志（CTS）和比利时交通标志（BTS）数据集上进行了验证。所提出的模型展现出优异的识别性能，准确率高达97.64%，使其成为GTSRB数据集中性能最佳的BNN模型之一。与全精度模型相比，精度损失控制在1%以内，模型的参数存储开销仅为全精度模型的10%。更重要的是，我们的网络模型在推理阶段仅依赖逻辑运算和低位宽定点加减运算，这大大简化了处理单元（PE）的设计复杂度。我们的研究表明BNN在计算机视觉模型硬件部署方面，尤其是在自动驾驶相关的计算机视觉任务领域，具有巨大的潜力。||
|**2025-01-14**|[Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation](http://arxiv.org/abs/2501.07806)|**[link](https://github.com/hy0523/mtnet)**|本文提出了一种名为MTNet的高效算法来解决无监督视频对象分割（UVOS）中的挑战，该算法同时利用运动和时间线索。与以往只关注将外观与运动融合或对时间关系建模的方法不同，我们的方法将这两个方面结合在一个统一的框架内。MTNet通过在编码器特征提取过程中有效地融合外观和运动特征来设计，从而促进更具互补性的表示。为了捕捉视频中复杂的长期上下文动态和嵌入信息，我们引入了时间变换器模块，促进了视频剪辑中帧间的有效交互。此外，我们在所有特征级别上采用级联解码器，以优化利用提取的特征，旨在生成更精确的分割掩码。因此，MTNet提供了一个强大而紧凑的框架，探索时间和跨模态知识，以便在各种挑战性场景中高效且鲁棒地定位和跟踪主要对象。跨多个基准的广泛实验最终表明，我们的方法不仅在无监督视频对象分割中实现了最先进的性能，而且在视频显著性目标检测中也取得了有竞争力的结果。这些发现凸显了该方法强大的通用性和对各种分割任务的适应性。源代码可在https://github.com/hy0523/MTNet上获取。||
|**2025-01-14**|[Balance Divergence for Knowledge Distillation](http://arxiv.org/abs/2501.07804)|null|知识蒸馏已广泛应用于计算机视觉任务处理中，因为它可以通过利用从繁琐的教师网络转移的知识来有效地提高轻量级学生网络的性能。大多数现有的知识蒸馏方法利用 Kullback-Leibler 散度来模拟教师网络和学生网络之间的logit输出概率。然而，这些方法可能会忽略教师“暗知识”的负面部分，因为散度计算可能会忽略教师logit输出中微小概率的影响。这种缺陷可能导致蒸馏过程中logit模拟的性能欠佳，并导致学生网络获取的信息不平衡。在本文中，我们研究了这种不平衡的影响，并提出了一种名为平衡散度蒸馏的新方法。通过使用反向 Kullback-Leibler 散度引入补偿操作，我们的方法可以改进对教师负面部分中极小值的建模，并保持对正面部分的学习能力。此外，我们测试了不同温度系数调整的影响，这可以进一步平衡知识转移。我们在几个计算机视觉任务上评估了所提出的方法，包括图像分类和语义分割。评估结果表明，我们的方法在 CIFAR-100 和 ImageNet 数据集上为轻量级学生实现了 1%~3% 的精度提升，并在 Cityscapes 数据集上为 PSP-ResNet18 实现了 4.55% 的 mIoU 提升。实验表明，我们的方法是一种简单但高效的解决方案，可以平滑地应用于不同的知识蒸馏方法。||
|**2025-01-14**|[Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding](http://arxiv.org/abs/2501.07783)|**[link](https://github.com/opengvlab/piip)**|图像金字塔被广泛应用于高性能方法中，以获取用于精确视觉感知和理解的多尺度特征。然而，目前的图像金字塔使用相同的大规模模型来处理多种分辨率的图像，导致了显著的计算成本。为了应对这一挑战，我们提出了一种名为参数反转图像金字塔网络（PIIP）的新型网络架构。具体来说，PIIP使用预训练模型（ViT或CNN）作为分支来处理多尺度图像，其中更高分辨率的图像由更小的网络分支处理，以平衡计算成本和性能。为了整合来自不同空间尺度的信息，我们进一步提出了一种新的跨分支特征交互机制。为了验证PIIP，我们将其应用于各种感知模型和一个名为LLaVA的代表性多模态大型语言模型，并在各种任务（如目标检测、分割、图像分类和多模态理解）上进行了广泛的实验。PIIP与单分支和现有的多分辨率方法相比，在计算成本更低的情况下实现了优越的性能。当应用于大型视觉基础模型InternViT-6B时，PIIP可以将其在检测和分割上的性能提高1%-2%，而计算量仅为原来的40%-60%，最终在MS COCO上实现了60.0的box AP，在ADE20K上实现了59.7的mIoU。对于多模态理解，我们的PIIP-LLaVA仅用280万训练数据就在TextVQA上达到了73.0%的准确率，在MMBench上达到了74.5%的准确率。我们的代码已发布在https://github.com/OpenGVLab/PIIP。||
|**2025-01-13**|[C2PD: Continuity-Constrained Pixelwise Deformation for Guided Depth Super-Resolution](http://arxiv.org/abs/2501.07688)|**[link](https://github.com/amhamster/c2pd)**|引导深度超分辨率 (GDSR) 已在广泛领域展现出令人印象深刻的性能，并提出了许多方法。然而，现有方法通常将深度图视为图像，其中阴影值是离散计算的，这使得它们难以有效地恢复深度图固有的连续性。在本文中，我们提出了一种新方法，通过将 GDSR 问题转化为具有理想可塑性的粗糙模型的变形，最大限度地利用深度中的空间特征，并结合人类对现实世界物质的抽象感知，该模型可以通过力像连续物体一样变形。具体来说，我们首先设计了一个跨模态操作，即连续性约束非对称像素操作 (CAPO)，它可以模拟通过外力使等容柔性物体变形的过程。利用 CAPO 作为基本组件，我们开发了像素级交叉梯度变形 (PCGD)，它能够模拟对理想塑性物体（无体积约束）的操作。值得注意的是，我们的方法在四个广泛采用的 GDSR 基准测试中展现了最先进的性能，在大规模任务和泛化性方面具有显著优势。||
|**2025-01-13**|[SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing](http://arxiv.org/abs/2501.07554)|**[link](https://github.com/custommetrics-sst/sst_customevaluationmetrics)**|视频编辑模型取得了显著进展，但评估其性能仍然具有挑战性。传统的指标，例如 CLIP 文本和图像分数，通常存在不足：文本分数受限于训练数据不足和层次依赖性，而图像分数无法评估时间一致性。我们提出了 SST-EM（语义、空间和时间评估指标），这是一个利用现代视觉语言模型 (VLM)、目标检测和时间一致性检查的新型评估框架。SST-EM 包含四个组成部分：(1) 使用 VLM 从帧中提取语义信息；(2) 使用目标检测进行主要目标跟踪；(3) 通过 LLM 代理进行聚焦目标细化；(4) 使用视觉Transformer (ViT) 进行时间一致性评估。这些组件被整合到一个统一的指标中，其权重源自人工评估和回归分析。SST-EM 的名称反映了其对视频评估的语义、空间和时间方面的关注。SST-EM 提供了对视频编辑中语义保真度和时间平滑度的全面评估。源代码可在GitHub存储库中获取。||
|**2025-01-13**|[TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations](http://arxiv.org/abs/2501.07360)|**[link](https://github.com/timbervision/timbervision)**|木材作为一种资源，其价值日益提升，用途也日益广泛。然而，诸如采伐、搬运和测量木材等林业作业仍然需要大量的人工劳动，而且这些作业通常在偏远地区进行，存在着重大的安全风险。逐步实现这些任务的自动化具有提高效率和安全性的潜力，但这需要对单个原木以及活树及其周围环境进行精确检测。尽管针对这一具有挑战性的应用领域已经提出了一些初步方法，但专门的数据和算法仍然过于稀缺，无法开发出稳健的解决方案。为了弥补这一差距，我们引入了 TimberVision 数据集，该数据集包含 2000 多张带注释的 RGB 图像，共包含 5.1 万个树干组件，包括切割面和侧面，在数量和细节方面都大大超过了该领域现有的任何数据集。基于此数据，我们针对定向目标检测和实例分割进行了一系列消融实验，并评估了多个场景参数对模型性能的影响。我们引入了一个通用框架，将我们的模型检测到的组件融合成统一的树干表示，适用于这两种任务。此外，我们自动导出几何属性并应用多目标跟踪来进一步增强鲁棒性。即使在具有挑战性的环境条件下，我们的检测和跟踪方法也仅从 RGB 图像数据中提供了高度描述性和准确的树干表示。我们的解决方案适用于广泛的应用场景，并且可以很容易地与其他传感器模式相结合。||
|**2025-01-13**|[Toward Realistic Camouflaged Object Detection: Benchmarks and Method](http://arxiv.org/abs/2501.07297)|**[link](https://github.com/zhimengxin/rcod)**|伪装目标检测 (COD) 主要依赖于语义或实例分割方法。虽然这些方法在识别伪装目标轮廓方面取得了显著进展，但对于仅需要目标特定位置的任务来说，它们可能效率低下或成本效益不高。在这种情况下，目标检测算法为现实伪装目标检测 (RCOD) 提供了优化的解决方案。然而，由于目标与其背景特征之间的高度相似性，检测伪装目标仍然是一项艰巨的挑战。与执行像素级比较以区分前景和背景的分割方法不同，目标检测器省略了此分析，进一步加剧了挑战。为了解决这个问题，我们提出了一种伪装感知特征细化 (CAFR) 策略。由于伪装目标并非罕见类别，CAFR 充分利用大型模型先验知识中对当前目标的清晰感知，以帮助检测器深入理解背景和前景之间的区别。具体来说，在 CAFR 中，我们引入了自适应梯度传播 (AGP) 模块，该模块微调大型检测模型中的所有特征提取器层，以从伪装上下文中完全细化特定类别的特征。然后，我们设计了稀疏特征细化 (SFR) 模块，该模块优化了基于 Transformer 的特征提取器，使其主要关注在伪装场景中捕获特定类别的特征。为了便于评估 RCOD 任务，我们手动标注了三个现有分割 COD 数据集上检测所需的标签，为 RCOD 任务创建了一个新的基准。代码和数据集可在以下网址获取：https://github.com/zhimengXin/RCOD。||
|**2025-01-10**|[Merging Feed-Forward Sublayers for Compressed Transformers](http://arxiv.org/abs/2501.06126)|**[link](https://github.com/nverma1/merging-ffs-compression)**|随着更大规模深度学习模型的兴起和普及，为了广泛部署这些模型，对高质量压缩技术的需求日益增长。这些模型庞大的参数量使得它们难以适应不同硬件的内存限制。在这项工作中，我们提出了一种新的模型压缩方法，通过合并模型中相似的参数组，而不是剪枝掉不太重要的参数。具体来说，我们在Transformer模型中选择、对齐并合并独立的前馈子层，并在语言建模、图像分类和机器翻译任务上测试了我们的方法。使用我们的方法，我们展示了与原始模型相当的性能，同时合并了超过三分之一的模型前馈子层，并且展示了比强大的层剪枝基线更好的性能。例如，我们可以从视觉Transformer中移除超过21%的总参数，同时保持其99%的原始性能。此外，我们观察到某些前馈子层组表现出高度的激活相似性，这可能有助于解释它们惊人的可合并性。||
|**2025-01-10**|[Minimizing Occlusion Effect on Multi-View Camera Perception in BEV with Multi-Sensor Fusion](http://arxiv.org/abs/2501.05997)|null|自动驾驶技术正在迅速发展，为更安全、更高效的交通运输提供了潜力。然而，由于灰尘、雨水和雾等环境因素造成的传感器遮挡会严重影响这些系统的性能。这些遮挡会严重影响基于视觉的任务，例如目标检测、车辆分割和车道识别。在本文中，我们通过将nuScenes数据集中多视角摄像机图像的影响投射到鸟瞰图（BEV）域中，研究各种遮挡对摄像机传感器产生的影响。这种方法使我们能够分析遮挡如何在BEV域内空间分布并影响车辆分割精度。尽管传感器技术和多传感器融合取得了重大进展，但现有文献中仍然缺乏关于相机遮挡对基于BEV的感知系统具体影响的研究。为了弥补这一差距，我们使用了一种融合激光雷达和雷达传感器数据的多传感器融合技术，以减轻由摄像头遮挡引起的性能下降。我们的研究结果表明，这种方法显著提高了车辆分割任务的准确性和鲁棒性，从而使自动驾驶系统更加可靠。||
|**2025-01-10**|[Automatic detection of single-electron regime of quantum dots and definition of virtual gates using U-Net and clustering](http://arxiv.org/abs/2501.05878)|null|为了实现实用的量子计算机，需要大量的量子比特。半导体自旋量子比特具有诸如高可扩展性和与现有半导体技术兼容等优势。然而，随着量子比特数量的增加，手动调整量子比特变得不可行，这促使人们寻求自动调整方法。在本研究中，我们使用 U-Net（一种用于目标检测的神经网络方法）来识别实验电荷稳定图中的电荷跃迁线。利用霍夫变换分析提取的电荷跃迁线，以确定其位置和角度。基于此分析，我们获得了到虚拟门的变换矩阵。此外，我们通过对霍夫变换输出进行聚类来识别单电子区域。我们还展示了虚拟门空间内的单电子区域。这些顺序过程是自动执行的。这种方法将推进大规模量子器件的自动控制技术。||
|**2025-01-10**|[Zero-shot Shark Tracking and Biometrics from Aerial Imagery](http://arxiv.org/abs/2501.05717)|null|近年来，无人机被广泛用于研究海洋动物，这为从航空影像中获取生物信息提供了机会。无人机采集的大规模影像数据非常适合机器学习（ML）分析。然而，开发用于分析海洋动物航空影像的机器学习模型一直遵循着传统的范式，即针对每个数据集训练、测试和部署新模型，这需要大量的时间、人力和机器学习专业知识。我们引入了帧级对齐和跟踪（FLAIR）方法，它利用了Segment Anything Model 2（SAM2）的视频理解能力和对比语言-图像预训练（CLIP）的视觉-语言能力。FLAIR将无人机视频作为输入，并输出视频中感兴趣物种的分割掩码。值得注意的是，FLAIR采用零样本学习方法，无需标记数据、训练新模型或微调现有模型即可泛化到其他物种。利用包含18,000张太平洋护士鲨无人机图像的数据集，我们训练了最先进的目标检测模型来与FLAIR进行比较。结果表明，FLAIR的性能大大优于这些目标检测器，并且与两种用于提示SAM2的人机交互方法相比具有竞争力，实现了0.81的Dice分数。FLAIR可以轻松泛化到其他鲨鱼物种，而无需额外的人力，并且可以与新的启发式方法相结合，自动提取相关信息，包括长度和尾拍频率。FLAIR具有显著加快航空影像分析工作流程的潜力，与传统的机器学习工作流程相比，它所需的人力和专业知识明显减少，同时实现了更高的准确性。通过减少航空影像分析所需的工作量，FLAIR使科学家能够将更多时间用于解释结果和获取有关海洋生态系统的见解。||
|**2025-01-10**|[Dark Energy Survey Year 6 Results: Synthetic-source Injection Across the Full Survey Using Balrog](http://arxiv.org/abs/2501.05683)|null|人工源注入(SSI)，即将模拟源插入到像素级的真实天空图像中，是一种强大的方法，用于表征大视场天文成像巡天中的目标检测和测量。在暗能量巡天（DES）中，SSI 在表征将图像转换为星表的各种必要算法以及推导宇宙学分析所需的关键量（例如目标探测率、星系红移估计、星系放大率、恒星-星系分类和测光性能）方面起着至关重要的作用。我们在此展示一个包含 1.46 亿个注入源的源注入星表，该星表使用 Balrog SSI 流程生成，涵盖了整个 5000 平方度的 DES 巡天区域。通过此样本，我们证明 DES 第六年 (Y6) 图像处理流程能够以百分比级别的精度准确估计星系和恒星的目标属性，并且我们重点介绍了精度降低的特定情况。然后，我们展示了 SSI 星表和数据星表之间的一致性，涵盖了 DES Y6 弱引力透镜和星系成团性分析中开发的所有星系样本。这两个星表之间的一致性也延伸到它们与巡天观测属性（视宁度、大气质量、深度、消光等）的相关性。最后，我们重点介绍了该星表在 DES Y6 宇宙学分析中的一些应用。该数据集是目前以如此高保真度生成的最大的 SSI 星表，并将作为关键测试平台，用于探索 SSI 星表在即将进行的巡天（例如薇拉·鲁宾天文台时空遗产巡天）中的实用性。||
|**2025-01-09**|[Bit-depth color recovery via off-the-shelf super-resolution models](http://arxiv.org/abs/2501.05611)|null|成像技术的进步使得硬件能够支持每通道10到16位，从而促进了图像编辑和视频处理等应用中的精确操作。尽管深度神经网络有望恢复高位深度的表示，但现有方法通常依赖于尺度不变的图像信息，从而限制了在某些场景下的性能。在本文中，我们介绍了一种新颖的方法，该方法集成了超分辨率架构以从图像中提取详细的先验信息。通过利用超分辨率过程中生成的插值数据，我们的方法实现了像素级精细颜色细节的恢复。此外，我们证明了通过超分辨率过程学习的空间特征对恢复详细的色彩深度信息有显著贡献。在基准数据集上的实验表明，我们的方法优于现有最先进的方法，突出了超分辨率在高保真色彩恢复方面的潜力。||
|**2025-01-09**|[An Empirical Study of Autoregressive Pre-training from Videos](http://arxiv.org/abs/2501.05453)|null|我们对视频的自回归预训练进行了实证研究。为此，我们构建了一系列称为 Toto 的自回归视频模型。我们将视频视为视觉标记序列，并训练 Transformer 模型来自回归地预测未来标记。我们的模型在一个包含超过 1 万亿个视觉标记的多样化视频和图像数据集上进行预训练。我们探索了不同的架构、训练和推理设计选择。我们在包括图像识别、视频分类、对象跟踪和机器人在内的一系列下游任务上评估了学习到的视觉表示。我们的结果表明，尽管只有极少的归纳偏差，自回归预训练在所有基准测试中都实现了具有竞争力的性能。最后，我们发现扩展我们的视频模型会导致类似于语言模型中的缩放曲线，尽管缩放速率不同。更多详细信息请访问 https://brjathu.github.io/toto/||
|**2025-01-09**|[Performance of YOLOv7 in Kitchen Safety While Handling Knife](http://arxiv.org/abs/2501.05399)|null|厨房中的安全刀具操作规范能显著降低食物准备过程中割伤、受伤和严重事故的风险。本研究利用先进的目标检测模型 YOLOv7，重点识别刀具使用过程中的安全风险，特别是手指放置不当和刀刃与手部接触的情况。模型的性能通过精确率、召回率、mAP50 和 mAP50-95 等指标进行评估。结果表明，YOLOv7 在第 31 轮训练周期中达到最佳性能，mAP50-95 得分为 0.7879，精确率为 0.9063，召回率为 0.7503。这些发现突出了 YOLOv7 在准确检测刀具相关危险方面的潜力，促进了改进厨房安全的进一步发展。||
|**2025-01-09**|[A Systematic Literature Review on Deep Learning-based Depth Estimation in Computer Vision](http://arxiv.org/abs/2501.05147)|null|深度估计 (DE) 提供场景的空间信息，支持诸如3D重建、目标检测和场景理解等任务。近年来，使用基于深度学习 (DL) 的深度估计方法的兴趣日益浓厚。传统技术依赖于手工设计的特征，这些特征通常难以泛化到不同的场景，并且需要大量的手动调整。然而，用于深度估计的深度学习模型可以从输入数据中自动提取相关特征，适应各种场景条件，并很好地泛化到未见过的环境。大量基于深度学习的方法已经被开发出来，因此有必要对现有技术 (SOTA) 进行调查和总结。以往关于深度估计的综述主要集中在单目或立体视觉技术上，而不是对深度估计进行全面综述。此外，据我们所知，目前还没有针对深度估计进行全面综述的系统文献综述 (SLR)。因此，本SLR研究正在进行中。最初，在电子数据库中搜索相关出版物，找到了1284篇出版物。使用定义的排除和质量标准，筛选出128篇出版物，并进一步筛选出59篇高质量的主要研究。对这些研究进行了分析，以提取数据并回答定义的研究问题。结果表明，深度学习方法主要针对三种不同类型的深度估计而开发：单目、立体视觉和多视图。20个公开可用的数据集被用于训练、测试和评估深度估计的深度学习模型，其中KITTI、NYU Depth V2和Make 3D是使用最多的数据集。29个评估指标被用于评估深度估计的性能。主要研究中报告了35个基础模型，其中最常用的五个基础模型是ResNet-50、ResNet-18、ResNet-101、U-Net和VGG-16。最后，缺乏真实数据是主要研究报告的最重大挑战之一。||
|**2025-01-09**|[CorrDiff: Adaptive Delay-aware Detector with Temporal Cue Inputs for Real-time Object Detection](http://arxiv.org/abs/2501.05132)|null|实时目标检测在许多现实应用的决策过程中起着至关重要的作用，包括自动驾驶系统中的碰撞避免和路径规划。本文提出了一种名为CorrDiff的新型实时流感知方法，旨在解决实时检测系统中的延迟挑战。CorrDiff的主要贡献在于其自适应延迟感知检测器，它能够利用运行时估计的时间线索来预测未来多帧的目标位置，并选择性地生成与现实世界时间匹配的预测，有效地补偿任何通信和计算延迟。所提出的模型在运动估计和特征增强两方面均优于当前最先进的方法，1）针对当前帧或下一帧的单帧检测，在mAP指标方面；2）针对（多个）未来帧的预测，在sAP指标方面（sAP指标用于评估流场景中的目标检测算法，同时考虑延迟和准确性）。它在从强大的Tesla V100到普通的RTX 2080Ti等各种设备上都表现出稳健的性能，在所有平台上都实现了最高水平的感知精度。与大多数最先进的方法在功能较弱的设备上难以在单帧内完成计算不同，CorrDiff在各种设备上都满足严格的实时处理要求。实验结果强调了该系统的适应性及其显著提高许多现实系统（如自动驾驶）的安全性和可靠性的潜力。我们的代码已完全开源，可在https://anonymous.4open.science/r/CorrDiff获取。||
|**2025-01-09**|[A 1Mb mixed-precision quantized encoder for image classification and patch-based compression](http://arxiv.org/abs/2501.05097)|null|即使专用集成电路 (ASIC) 已被证明是集成边缘推理的相关选择，但它们的适用性通常受到限制。在本文中，我们证明了，一个专用于图像处理的 ASIC 神经网络加速器可以应用于不同级别的多个任务：图像分类和压缩，同时只需要非常有限的硬件。关键组件是一个可重构的混合精度 (3b/2b/1b) 编码器，它利用适当的权重和激活量化，并结合卷积层结构剪枝来降低硬件相关的限制（内存和计算）。我们引入了一种线性对称量化器缩放因子的自动调整方法，以执行量化级别均衡，旨在稳定五进制和三进制权重的训练。此外，我们提出的层共享位移归一化方法显著简化了硬件成本高昂的批量归一化的实现。对于编码器设计仅需要 1Mb 的特定配置，在 CIFAR-10 上的分类精度达到 87.5%。此外，我们还展示了，此量化编码器可用于逐块压缩图像，而重建可以通过专用的全帧解码器远程执行。这种解决方案通常能够实现几乎没有任何块伪影的端到端压缩，其性能优于采用逐块恒定比特率的基于块的现有技术。||
|**2025-01-09**|[FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching](http://arxiv.org/abs/2501.04926)|**[link](https://github.com/jjunak-yun/FLowHigh_code)**|音频超分辨率由于其不适定性而具有挑战性。最近，扩散模型在音频超分辨率中的应用在缓解这一挑战方面展现了可喜的成果。然而，基于扩散的模型存在局限性，主要是需要大量的采样步骤，这导致合成高质量音频样本时延迟显著增加。在本文中，我们提出了FLowHigh，一种将高效的生成模型——流匹配集成到音频超分辨率中的新方法。我们还探索了专门为音频超分辨率定制的概率路径，它可以有效地捕捉高分辨率音频分布，从而提高重建质量。所提出的方法通过在各种输入采样率下的单步采样过程生成高保真、高分辨率的音频。在VCTK基准数据集上的实验结果表明，FLowHigh在音频超分辨率方面实现了最先进的性能，通过对数谱距离和ViSQOL进行评估，同时保持了计算效率，仅需单步采样过程。||
|**2025-01-08**|[Planarian Neural Networks: Evolutionary Patterns from Basic Bilateria Shaping Modern Artificial Neural Network Architectures](http://arxiv.org/abs/2501.04700)|null|本研究探讨了通过开发具有类似于生物神经网络进化模式的人工神经网络 (ANN) 来提高 ANN 在图像分类任务中预测准确性的可行性。ResNet 是一个广泛使用的深度和宽度兼具的神经网络家族；因此，它被选为我们研究的基准模型。本研究的目的是通过一种受涡虫生物神经系统结构（包括一个大脑和两条神经索）启发的新方法来提高 ANN 的图像分类性能。我们相信，涡虫独特的神经结构为 ANN 的性能增强提供了宝贵的见解。我们在 CIFAR-10 和 CIFAR-100 数据集上评估了提出的基于涡虫神经结构的神经网络。我们的结果表明，在图像分类任务中，所提出的方法比基线神经网络模型表现出更高的预测精度。这些发现证明了受生物学启发的神经网络架构在改进各种应用中 ANN 性能方面的巨大潜力。||
|**2025-01-08**|[RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark](http://arxiv.org/abs/2501.04440)|**[link](https://github.com/zhasion/rsar)**|旋转目标检测在光学遥感领域取得了显著进展。然而，合成孔径雷达（SAR）领域的进展却相对滞后，主要原因是缺乏大规模数据集。标注这样的数据集效率低下且成本高昂。一个有希望的解决方案是使用弱监督模型（例如，仅使用可用的水平框进行训练）来生成伪旋转框以供手动校准参考。不幸的是，现有的弱监督模型在预测物体角度方面精度有限。以前的工作尝试通过使用将角度解耦为余弦和正弦编码的角度解析器来增强角度预测。在这项工作中，我们首先从维度映射的统一角度重新评估这些解析器，并揭示它们具有相同的缺点：这些方法忽略了这些编码中固有的单位圆约束，容易导致预测偏差。为了解决这个问题，我们提出了单位圆解析器（Unit Cycle Resolver，UCR），它结合了单位圆约束损失来提高角度预测精度。我们的方法可以有效地提高现有最先进的弱监督方法的性能，甚至在现有的光学基准测试（即DOTA-v1.0数据集）上超越了全监督模型。借助UCR，我们进一步标注并引入了RSAR，这是迄今为止最大的多类别旋转SAR目标检测数据集。在RSAR和光学数据集上的大量实验表明，我们的UCR增强了角度预测精度。我们的数据集和代码可以在以下网址找到：https://github.com/zhasion/RSAR。||
|**2025-01-08**|[FGU3R: Fine-Grained Fusion via Unified 3D Representation for Multimodal 3D Object Detection](http://arxiv.org/abs/2501.04373)|null|多模态三维目标检测在自动驾驶领域引起了广泛关注。然而，多模态检测器由于粗略融合三维点和二维像素而存在维度不匹配的问题，导致融合性能欠佳。在本文中，我们提出了一个名为FGU3R的多模态框架，通过统一的三维表示和细粒度融合来解决上述问题，该框架包含两个重要组成部分。首先，我们提出了一种用于原始点和伪点的高效特征提取器，称为伪原始卷积（PRConv），它可以同步调制多模态特征，并基于多模态交互在关键点上聚合来自不同类型点的特征。其次，我们设计了一种跨注意力自适应融合模块（CAAF），通过一种细粒度的跨注意力变体自适应地融合同质三维RoI（感兴趣区域）特征。它们共同在统一的三维表示上进行细粒度融合。在KITTI和nuScenes数据集上进行的实验验证了我们所提出方法的有效性。||
|**2025-01-08**|[H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving](http://arxiv.org/abs/2501.04302)|null|随着多模态大型语言模型 (MLLM) 的普及，自动驾驶迎来了新的机遇和挑战。尤其，多模态视频理解对于交互式分析自动驾驶过程中将要发生的事情至关重要。然而，这种动态场景中的视频通常包含复杂的时空运动，这限制了现有 MLLM 在该领域的泛化能力。为了弥合这一差距，我们提出了一种新颖的分层曼巴适应 (H-MBA) 框架，以适应自动驾驶视频中复杂的运动变化。具体来说，我们的 H-MBA 由两个不同的模块组成，包括上下文曼巴 (C-Mamba) 和查询曼巴 (Q-Mamba)。首先，C-Mamba 包含各种类型的结构状态空间模型，可以有效地捕获不同时间分辨率的多粒度视频上下文。其次，Q-Mamba 将当前帧灵活地转换为可学习的查询，并有意识地选择多粒度视频上下文作为查询。因此，它可以自适应地整合所有多尺度时间分辨率的视频上下文，以增强视频理解能力。通过 MLLM 中的即插即用范式，我们的 H-MBA 在自动驾驶中的多模态视频任务上展现了卓越的性能，例如，在风险目标检测方面，它比之前的 SOTA 方法 mIoU 提高了 5.5%。||
|**2025-01-07**|[LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving](http://arxiv.org/abs/2501.04005)|null|视觉基础模型 (VFM) 近期的进展彻底改变了二维视觉感知，但其在三维场景理解方面的潜力，尤其是在自动驾驶应用中，仍未得到充分探索。本文介绍了 LargeAD，这是一个通用且可扩展的框架，旨在跨多样化的真实驾驶数据集进行大规模三维预训练。我们的框架利用 VFM 从二维图像中提取语义丰富的超像素，这些超像素与激光雷达点云对齐以生成高质量的对比样本。这种对齐促进了跨模态表征学习，增强了二维和三维数据之间的语义一致性。我们引入了几项关键创新：i) VFM 驱动的超像素生成，以实现详细的语义表示；ii) VFM 辅助的对比学习策略，以对齐多模态特征；iii) 超点时间一致性，以保持跨时间的稳定表示；iv) 多源数据预训练，以泛化各种激光雷达配置。我们的方法在线性探测和微调任务中，无论基于激光雷达的分割还是目标检测，都比现有技术水平的方法实现了显著的性能提升。在十一个大型多模态数据集上的大量实验突出了我们的优越性能，证明了其在真实自动驾驶场景中的适应性、效率和鲁棒性。||
|**2025-01-07**|[Visual question answering: from early developments to recent advances -- a survey](http://arxiv.org/abs/2501.03939)|null|视觉问答 (VQA) 是一个不断发展的研究领域，旨在通过集成图像和语言处理技术（例如特征提取、目标检测、文本嵌入、自然语言理解和语言生成）使机器能够回答有关视觉内容的问题。随着多模态数据研究的增长，VQA 因其广泛的应用而受到极大关注，这些应用包括交互式教育工具、医学图像诊断、客户服务、娱乐和社交媒体字幕生成。此外，VQA 通过从图像生成描述性内容，在帮助视障人士方面发挥着至关重要的作用。本综述介绍了 VQA 架构的分类法，根据设计选择和关键组件对它们进行分类，以便于比较分析和评估。我们回顾了主要的 VQA 方法，重点关注基于深度学习的方法，并探讨了在 VQA 等多模态任务中取得成功的视觉-语言大模型 (LVLM) 这一新兴领域。本文进一步研究了可用的数据集和评估 VQA 系统性能的必要评估指标，随后探讨了 VQA 的实际应用。最后，我们重点介绍了 VQA 研究中正在进行的挑战和未来方向，提出了有待解决的问题和潜在的发展领域。本综述可作为对最新进展和未来方向感兴趣的研究人员和从业人员的综合资源。||
|**2025-01-07**|[Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback](http://arxiv.org/abs/2501.03916)|null|由于人工智能 (AI) 的发展，科研范式正在经历一场深刻的变革。近期的研究表明，各种AI辅助研究方法可以通过改进数据分析、加速计算和促进新想法的产生，从而在很大程度上提高研究效率。为了进一步迈向最终目标（即自动化科研），本文提出了Dolphin，这是第一个闭环开放式自动研究框架，旨在进一步构建人类科研的完整过程。Dolphin可以生成研究想法，执行实验，并从实验结果中获取反馈以生成更高质量的想法。更具体地说，Dolphin首先根据主题和任务属性排序的相关论文生成新的想法。然后，通过异常-回溯引导的局部代码结构自动生成和调试代码。最后，Dolphin自动分析每个想法的结果，并将结果反馈到下一轮想法生成。在不同主题的基准数据集上进行了实验，结果表明Dolphin可以持续生成新想法并循环完成实验。我们强调，Dolphin可以在某些任务（例如2D图像分类和3D点云分类）中自动提出与最先进方法相媲美的方法。||
|**2025-01-07**|[Neuromorphic Optical Tracking and Imaging of Randomly Moving Targets through Strongly Scattering Media](http://arxiv.org/abs/2501.03874)|null|在散射介质遮挡下跟踪和获取随机移动目标的同步光学图像仍然是许多需要精确定位和识别物体的应用中的一个难题。在这项工作中，我们开发了一种端到端的仿神经形态光学工程和计算方法，通过将事件相机与多级仿神经形态深度学习策略相结合来演示如何跟踪和成像通常不可见的物体。从密集散射介质中出现的光子被事件相机检测到，并转换为像素级异步脉冲序列——这是从占主导地位的无信息背景中分离物体特定信息的第一步。脉冲数据被馈送到一个深度脉冲神经网络 (SNN) 引擎，其中目标跟踪和图像重建由两个独立但相互连接的模块执行，这些模块在事件持续时间内以离散时间步长并行运行。通过台式实验，我们演示了在密集浑浊介质中跟踪和成像随机移动物体，以及对空间静止但光学动态物体的图像重建。标准化字符集用作几何复杂物体的代表性代理，强调了该方法的通用性。结果突出了完全仿神经形态方法在满足主要成像技术方面的高计算效率和低功耗的优势。||
|**2025-01-07**|[MedFocusCLIP : Improving few shot classification in medical datasets using pixel wise attention](http://arxiv.org/abs/2501.03839)|null|随着基础模型的普及，参数高效微调已成为利用预训练模型执行下游任务的实际方法。受大型语言模型、视觉提示调优和类似技术最新进展的启发，我们学习一个额外的提示来高效地微调预训练的视觉基础模型。然而，我们观察到，这种提示对于细粒度视觉分类任务（例如医学图像分类）来说是不够的，因为这类任务存在较大的类间方差和较小的类内方差。因此，在本文中，我们建议利用Segment Anything Model 2 (SAM2) 的高级分割功能作为视觉提示线索，通过引导CLIP（对比语言-图像预训练）中视觉编码器的注意力到图像中的相关区域来辅助CLIP视觉编码器。这有助于模型专注于高度判别区域，而不会被视觉上相似的背景特征分散注意力，这在少样本、细粒度分类环境中是必不可少的。我们在不同的医学数据集（包括X光、CT扫描和MRI图像）上评估了我们的方法，并报告了该方法在（COVID、肺部疾病、脑肿瘤、乳腺癌）数据集上分别获得了(71%, 81%, 86%, 58%) 的准确率，而预训练的CLIP模型在少样本训练后仅为(66%, 70%, 68%, 29%)。该方法还允许通过使用分割获得的定位来获得对分类性能的可解释性解释。||
|**2025-01-07**|[SCC-YOLO: An Improved Object Detector for Assisting in Brain Tumor Diagnosis](http://arxiv.org/abs/2501.03836)|null|脑肿瘤会导致神经功能障碍、认知和心理状态改变、颅内压升高和癫痫发作，从而对人类生命健康构成重大威胁。YOLO（You Only Look Once）系列模型在医学影像目标检测方面表现出卓越的精度。本文提出了一种新的SCC-YOLO架构，将SCConv注意力机制融入YOLOv9中。SCConv模块通过减少特征之间的空间和通道冗余来重构高效的卷积模块，从而增强图像特征的学习。我们使用Br35H数据集和我们自制的数据集（Brain_Tumor_Dataset）研究了将不同注意力机制与YOLOv9模型结合对脑肿瘤图像检测的影响。实验结果表明，在Br35H数据集上，SCC-YOLO的mAp50相比YOLOv9提高了0.3%，而在我们自制的数据集上，SCC-YOLO比YOLOv9提高了0.5%。SCC-YOLO在脑肿瘤检测方面达到了最先进的性能。源代码可在以下网址获取：https://jihulab.com/healthcare-information-studio/SCC-YOLO/-/tree/master||
|**2025-01-07**|[AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features](http://arxiv.org/abs/2501.03700)|null|单目三维物体检测是自动驾驶系统中一项具有挑战性的任务，因为单视图图像缺乏明确的深度信息。现有方法通常依赖于外部深度估计器或昂贵的传感器，这增加了计算复杂度并阻碍了实时性能。为了克服这些限制，我们提出了 AuxDepthNet，这是一个高效的实时单目三维物体检测框架，无需依赖外部深度图或预训练的深度模型。AuxDepthNet 引入了两个关键组件：辅助深度特征 (ADF) 模块，它隐式地学习深度敏感特征以提高空间推理能力和计算效率；以及深度位置映射 (DPM) 模块，它将深度位置信息直接嵌入到检测过程中，从而实现精确的物体定位和三维边界框回归。AuxDepthNet 利用 DepthFusion Transformer 架构，通过深度引导的交互全局集成视觉和深度敏感特征，确保鲁棒高效的检测。在 KITTI 数据集上的大量实验表明，AuxDepthNet 实现了最先进的性能，在 IoU 阈值为 0.7 时， $\text{AP}_{3D}$ 分数分别为 24.72%（简单）、18.63%（中等）和 15.31%（困难），$\text{AP}_{\text{BEV}}$ 分数分别为 34.11%（简单）、25.18%（中等）和 21.90%（困难）。||
|**2025-01-06**|[FTA-FTL: A Fine-Tuned Aggregation Federated Transfer Learning Scheme for Lithology Microscopic Image Classification](http://arxiv.org/abs/2501.03349)|**[link](https://github.com/ahmadtaheri2021/lithology-microscopic-images-mini-dataset)**|岩性识别是油藏表征的一项关键活动，处理岩性显微镜图像是研究化石和矿物以及页岩油勘探地质评估的重要技术。在这种情况下，深度学习 (DL) 技术是构建鲁棒分类器模型的有效方法。然而，收集和生成大型数据集仍然存在相当大的挑战。迁移学习和数据增强技术已成为解决这个问题的流行方法。此外，由于各种原因，尤其是数据隐私，个人、组织和工业公司通常不愿意共享他们的敏感数据和信息。联邦学习 (FL) 旨在跨多个分散的边缘服务器训练高精度中心模型，而无需传输敏感数据，从而保护敏感数据并增强安全性。本研究包括两个阶段；第一阶段是在小型数据集上使用迁移学习进行岩性显微图像分类。为此，对各种预训练的深度学习模型架构进行了全面比较，以完成分类任务。在第二阶段，我们将分类任务制定为联邦迁移学习 (FTL) 方案，并提出了一种用于联邦学习的微调聚合策略 (FTA-FTL)。为了进行全面的实验研究，考虑了几个指标，例如准确率、f1 分数、精确率、特异性、灵敏度（召回率）和混淆矩阵。结果非常一致，证实了所提出方案的效率，并表明所提出的 FTA-FTL 算法能够实现与岩性显微图像分类任务的集中式实现大致相同的结果。||
|**2025-01-06**|[Plant Leaf Disease Detection and Classification Using Deep Learning: A Review and A Proposed System on Bangladesh's Perspective](http://arxiv.org/abs/2501.03305)|null|农业对孟加拉国人民的就业、GDP贡献以及民生至关重要。它在减少贫困和确保粮食安全方面发挥着重要作用。植物病害是孟加拉国农业生产的严重障碍。有时，人眼无法从受感染的叶子上识别出病害。在为时已晚时对植物使用无机化学品或杀虫剂通常徒劳无功，浪费了之前的所有劳动。基于叶片的图像分类深度学习技术已显示出令人印象深刻的结果，可以使识别和分类所有疾病的工作变得轻松且更精确。在本文中，我们主要提出了一种更好的叶片病害检测模型。我们提出的论文包括收集三种不同作物的数据：甜椒、番茄和马铃薯。为了训练和测试提出的CNN模型，使用了从Kaggle收集的植物叶片病害数据集，其中包含17,430张图像。这些图像被标记为14个不同的损害类别。开发的CNN模型运行高效，可以成功地检测和分类测试的疾病。提出的CNN模型在作物病害管理方面可能具有巨大的潜力。||
|**2025-01-06**|[ImageMM: Joint multi-frame image restoration and super-resolution](http://arxiv.org/abs/2501.03002)|null|在地基天文学中，一个关键的处理步骤是将多张带有噪声和模糊的曝光图像组合起来，以生成信噪比更高的夜空图像。通常，这是通过图像叠加来实现的，并且可以以提高最终夜空图像空间分辨率的方式进行。然而，尽管经过了几十年的发展，这项任务仍然是一项艰巨的挑战。在本文中，我们介绍了ImageMM：一个基于优化最小化算法的新框架，用于联合多帧天文图像复原和超分辨率。ImageMM使用多张配准的天文曝光图像来生成夜空的非参数潜在图像，该图像是大气影响观测曝光之前的图像。我们的框架还采用了一种新的变分方法来计算用于复原和超分辨率过程的任意分辨率的精确点扩散函数。我们用TensorFlow实现的算法利用图形处理单元加速，即使在处理高分辨率曝光图像时也能近乎实时地生成潜在图像。我们在超主焦点相机（HSC）曝光图像上测试了ImageMM，这些图像是即将从鲁宾天文台获得的成像数据的先驱。结果令人鼓舞：ImageMM生成了清晰的潜在图像，其中明亮源的空间特征以前所未有的细节展现出来（例如螺旋星系的结构），并且通常无法与噪声天空背景区分开的微弱源也变得可辨别，从而突破了探测极限。此外，在HSC管线叠加图像和ImageMM的潜在图像上进行的孔径测光产生了与源探测和通量测量一致的结果，从而证明了ImageMM适用于使用最先进的天文成像数据进行前沿测光研究。||
|**2025-01-03**|[Dual Mutual Learning Network with Global-local Awareness for RGB-D Salient Object Detection](http://arxiv.org/abs/2501.01648)|**[link](https://github.com/kingkung2016/gl-dmnet)**|RGB-D显著目标检测（SOD）旨在通过联合建模RGB和深度信息来突出给定场景中的显著区域，是一项具有挑战性的像素级预测任务。近年来，由于双注意力机制能够增强检测过程，它一直被致力于该领域。然而，大多数现有方法在手动强制融合范式下直接融合注意力跨模态特征，而没有考虑RGB和深度之间固有的差异，这可能会导致性能下降。此外，源于全局和局部信息的远程依赖性使得难以利用统一有效的融合策略。因此，在本文中，我们提出了GL-DMNet，一个具有全局-局部感知的新型双重互学习网络。具体来说，我们提出了一个位置互融合模块和一个通道互融合模块，以利用空间和通道维度上不同模态之间的相互依赖性。此外，我们采用了一个基于级联Transformer注入重建的高效解码器来联合集成多级融合特征。在六个基准数据集上的大量实验表明，我们提出的GL-DMNet的性能优于24种RGB-D SOD方法，与第二好的模型（S3Net）相比，在四个评估指标上平均提高了约3%。代码和结果可在https://github.com/kingkung2016/GL-DMNet获取。||
|**2025-01-02**|[Embedding Similarity Guided License Plate Super Resolution](http://arxiv.org/abs/2501.01483)|null|超分辨率（SR）技术在增强低分辨率图像质量方面发挥着关键作用，尤其是在安全和监控等需要精确车牌识别的应用中。本研究提出了一种结合像素级损失和嵌入相似性学习的新框架，以应对车牌超分辨率（LPSR）的独特挑战。引入的像素和嵌入一致性损失（PECL）集成了一个孪生网络，并应用对比损失来强制嵌入相似性，从而提高感知和结构保真度。通过有效地平衡像素级精度和嵌入级一致性，该框架实现了高分辨率（HR）和超分辨率（SR）车牌之间细粒度特征的更优对齐。在CCPD数据集上的大量实验验证了所提出框架的有效性，证明了其在PSNR_RGB、PSNR_Y和光学字符识别（OCR）精度方面相较于现有技术的持续改进。这些结果突出了嵌入相似性学习在极端超分辨率场景下提升感知质量和特定任务性能的潜力。||
|**2025-01-02**|[A Multi-task Supervised Compression Model for Split Computing](http://arxiv.org/abs/2501.01420)|**[link](https://github.com/yoshitomo-matsubara/ladon-multi-task-sc2)**|分割计算（ $\neq$ 分割学习）是一种很有前景的深度学习模型方法，适用于资源受限的边缘计算系统，其中弱传感器（移动）设备通过通信容量有限的信道无线连接到更强大的边缘服务器。目前最先进的分割计算工作提出了针对单一任务（如图像分类、目标检测或语义分割）的方法。将现有方法应用于多任务问题会降低模型精度和/或显著增加运行时延迟。在本研究中，我们提出了Ladon，这是第一个用于多任务分割计算的多任务头监督压缩模型。实验结果表明，对于ILSVRC 2012、COCO 2017和PASCAL VOC 2012数据集，多任务监督压缩模型在预测性能方面要么优于要么与强大的轻量级基线模型相媲美，同时在其早期层学习压缩表示。此外，我们的模型在多任务分割计算场景中减少了移动设备的端到端延迟（高达95.4%）和能耗（高达88.2%）。||
|**2025-01-02**|[HybridTrack: A Hybrid Approach for Robust Multi-Object Tracking](http://arxiv.org/abs/2501.01275)|**[link](https://github.com/leandro-svg/hybridtrack)**|高级驾驶辅助系统 (ADAS) 的发展增加了对鲁棒且可泛化的多目标跟踪算法的需求。传统的基于统计模型的跟踪方法依赖于预定义的运动模型和关于系统噪声分布的假设。尽管计算效率高，但它们通常缺乏对不同交通场景的适应性，并且需要大量的手动设计和参数调整。为了解决这些问题，我们提出了一种新的用于车辆的 3D 多目标跟踪方法 HybridTrack，它将数据驱动的卡尔曼滤波器 (KF) 集成到检测跟踪范式中。特别是，它直接从数据中学习过渡残差和卡尔曼增益，从而消除了手动运动和随机参数建模的需要。在真实世界的 KITTI 数据集上进行验证，HybridTrack 实现了 82.08% 的 HOTA 精度，显著优于最先进的方法。我们还在不同配置下评估了我们的方法，实现了 112 FPS 的最快处理速度。因此，HybridTrack 消除了对场景特定设计的依赖，同时提高了性能并保持了实时效率。代码将在发布时公开：https://github.com/leandro-svg/HybridTrack.git。||
|**2025-01-02**|[Sensitivity of Room Impulse Responses in Changing Acoustic Environment](http://arxiv.org/abs/2501.01206)|null|房间声学特性的变化，例如表面吸声的改变或散射物体的插入，会显著影响测得的房间脉冲响应（RIR）。这些变化会影响回声消除和主动声学系统以及导航和目标跟踪等支持任务的性能。因此，识别和量化这些变化对于推进基于房间声学的技术至关重要。本研究介绍了一种通过评估连续记录的RIR的相似性来分析声学环境变化的方法。采用短时相干性来表征各种修改，包括墙壁吸声的变化或房间内移动人员的存在。进一步使用灵敏度等级来量化这些变化的幅度。结果清晰地区分了不同类型的修改——大气变化、吸声变化和人员存在。所描述的方法提供了一种分析和解释房间声学的新颖方法，强调RIR相似性并从时间和频谱信号特性中提取信息。||
|**2025-01-02**|[MSC-Bench: Benchmarking and Analyzing Multi-Sensor Corruption for Driving Perception](http://arxiv.org/abs/2501.01037)|null|多传感器融合模型在自动驾驶感知中起着至关重要的作用，尤其是在3D目标检测和高清地图构建等任务中。这些模型为自动驾驶系统提供了必要的、全面的静态环境信息。虽然摄像头-激光雷达融合方法通过整合两种模式的数据取得了显著成果，但它们通常依赖于完整的传感器输入。这种依赖性会导致鲁棒性低，并在传感器损坏或缺失时可能出现故障，从而引发重大的安全问题。为了应对这一挑战，我们引入了多传感器损坏基准（MSC-Bench），这是第一个旨在评估多传感器自动驾驶感知模型针对各种传感器损坏的鲁棒性的综合基准。我们的基准测试包括16种组合的损坏类型，这些损坏类型会分别或同时干扰摄像头和激光雷达输入。对六个3D目标检测模型和四个高清地图构建模型的广泛评估表明，在恶劣天气条件和传感器故障下，模型性能会大幅下降，这突显了关键的安全问题。基准测试工具包及相关的代码和模型检查点已公开发布。||
|**2025-01-01**|[A Novel Approach using CapsNet and Deep Belief Network for Detection and Identification of Oral Leukopenia](http://arxiv.org/abs/2501.00876)|null|口腔癌是全球主要的健康问题，2023 年导致 277,484 人死亡，其中低收入和中等收入国家的发病率最高。促进口腔癌潜在恶性及恶性病变检测的自动化，可以实现经济高效的早期疾病诊断。建立一个广泛的、精心标注的口腔病变数据库至关重要。在这项研究中，照片是从全球临床专家那里收集的，他们配备了一个注释工具来生成全面的标签。本研究提出了一种整合来自不同医生标注框的新方法。此外，深度置信网络与胶囊网络相结合，用于开发自动化系统提取复杂模式以应对这一挑战性问题。本研究评估了两种基于深度学习的计算机视觉方法，用于口腔病变的自动检测和分类，以促进口腔癌的早期发现：使用胶囊网络进行图像分类。图像分类在检测病变照片方面取得了 94.23% 的 F1 值，在识别需要转诊的图像方面取得了 93.46% 的 F1 值。目标检测在识别需要转诊的病变方面取得了 89.34% 的 F1 值。后续记录了基于转诊决策类型的分类性能。我们的初步研究结果表明，深度学习能够解决这个复杂的问题。||
|**2025-01-01**|[NMM-HRI: Natural Multi-modal Human-Robot Interaction with Voice and Deictic Posture via Large Language Model](http://arxiv.org/abs/2501.00785)|null|将人类意图转换为机器人指令对于老龄化社会中服务机器人的未来至关重要。现有的依赖于手势或语音命令的人机交互 (HRI) 系统对于老年人来说并不实用，因为他们难以掌握复杂的语法或手语。为了应对这一挑战，本文介绍了一种多模态交互框架，该框架结合语音和指示姿势信息来创建更自然的人机交互系统。视觉线索首先由物体检测模型处理以获得对环境的全局理解，然后根据深度信息估计边界框。通过使用大型语言模型 (LLM) 处理语音到文本的命令和时间对齐的选定边界框，可以生成机器人动作序列，同时应用关键控制语法约束以避免潜在的 LLM 幻觉问题。该系统使用优傲机器人 UR3e 机械臂在不同复杂程度的真实世界任务中进行了评估。我们的方法在人机交互方面的准确性和鲁棒性方面表现出明显更好的性能。为了使研究界和公众受益，我们将开源我们的代码和设计。||
|**2025-01-01**|[Less is More: Token Context-aware Learning for Object Tracking](http://arxiv.org/abs/2501.00758)|**[link](https://github.com/XuChenLong/LMTrack)**|近年来，多项研究表明，利用上下文信息感知目标状态对于目标跟踪至关重要。它们通常通过合并多个视频帧来捕获上下文。然而，这些简单的帧上下文方法未能考虑参考帧内每个补丁的重要性，使其容易受到噪声和冗余标记的影响，从而降低跟踪性能。为了应对这一挑战，我们提出了一种名为LMTrack的全新标记上下文感知跟踪流程，旨在自动学习高质量的参考标记，以实现高效的视觉跟踪。秉持“少即是多”的原则，LMTrack的核心思想是分析所有参考标记的重要性分布，收集、持续关注并更新重要的标记。具体而言，我们设计了一个新颖的标记上下文记忆模块，以自回归方式动态收集目标的高质量时空信息，消除参考帧中冗余的背景标记。此外，我们还设计了一种有效的单向标记注意力机制，以建立参考标记和搜索帧之间的依赖关系，从而实现鲁棒的跨帧关联和目标定位。大量实验表明，我们的跟踪器具有优越性，在GOT-10K、TrackingNet和LaSOT等跟踪基准测试中取得了最先进的结果。||
|**2025-01-01**|[Ensuring superior learning outcomes and data security for authorized learner](http://arxiv.org/abs/2501.00754)|null|机器学习中，学习者生成接近目标函数的假设至关重要。实现这一点需要足够的数据；然而，窃听学习者的未授权访问会导致安全风险。因此，通过限制窃听者可访问的训练数据质量来确保“授权”学习者的性能非常重要。与以往侧重于加密或访问控制的研究不同，我们提供了一个定理，通过量子标签编码来确保仅授权学习者获得优异的学习成果。在此背景下，我们使用可能近似正确（PAC）学习框架，并引入学习概率的概念来定量评估学习者性能。我们的定理允许这样的条件：给定一个训练数据集，授权学习者保证获得一定质量的学习结果，而窃听者则不能。值得注意的是，该条件的构建仅基于训练数据的授权学习者可测量值，即其大小和噪声程度。我们通过卷积神经网络（CNN）图像分类学习验证了我们的理论证明和预测。||
|**2024-12-30**|[Uncertainty-Aware Out-of-Distribution Detection with Gaussian Processes](http://arxiv.org/abs/2412.20918)|null|深度神经网络 (DNN) 通常构建于封闭世界假设之下，这可能导致其无法泛化到分布外 (OOD) 数据。这会导致 DNN 产生过于自信的错误预测，并在安全关键型应用中造成灾难性后果。现有的 OOD 检测方法主要依赖于整理一组 OOD 数据用于模型训练或超参数调整，以区分 OOD 数据和训练数据（也称为分布内数据或 InD 数据）。然而，在实际应用中，OOD 样本在训练阶段并不总是可用，这阻碍了 OOD 检测的准确性。为了克服这一限制，我们提出了一种基于高斯过程的 OOD 检测方法，仅基于 InD 数据建立决策边界。其基本思想是通过多类高斯过程 (GP) 对 DNN 的无约束 softmax 分数进行不确定性量化，然后根据 GP 后验预测分布的根本差异定义一个评分函数来分离 InD 和潜在的 OOD 数据。我们对传统的图像分类数据集和真实世界的图像数据集进行了两个案例研究，以证明当在训练阶段未观察到 OOD 样本时，所提出的方法优于最先进的 OOD 检测方法。||
|**2024-12-30**|[Humanoid Robot RHP Friends: Seamless Combination of Autonomous and Teleoperated Tasks in a Nursing Context](http://arxiv.org/abs/2412.20770)|null|本文描述了RHP Friends，一个为在人类共存环境中实现辅助机器人部署而开发的社交人形机器人。作为一个用例应用，我们展示了它在护理方面的潜在用途，方法是扩展其根据任务操作人类设备和工具的能力，并支持远程协助操作。为了满足人类设计环境中的各种任务和情况，我们开发了一个系统，该系统将纤细轻巧的机器人与多项技术无缝集成：移动操控、多点接触运动、远程操作以及物体检测和跟踪。我们展示了该系统在护理应用中的使用情况。机器人高效地执行了病人转移的日常任务和一个非常规任务，即操作断路器的请求。该演示在2023年国际机器人展（IREX）上举行，每天进行三次，持续三天。||
|**2024-12-30**|[Open-Set Object Detection By Aligning Known Class Representations](http://arxiv.org/abs/2412.20701)|null|开放集目标检测 (OSOD) 已成为解决未知目标检测的现代研究方向。最近，一些工作通过采用对比聚类来分离未知类别，在 OSOD 任务中取得了显著的性能。相比之下，我们提出了一种新的基于语义聚类的方法，以促进语义空间中集群的有效对齐，并引入了一个类别去相关模块来增强集群间的分离。我们的方法进一步结合了目标聚焦模块来预测目标得分，从而增强未知目标的检测。此外，我们采用 i) 一种惩罚低置信度输出的评估技术，以降低未知目标错误分类的风险，以及 ii) 一种称为 HMP 的新指标，它使用调和平均值结合已知和未知精度。我们的大量实验表明，所提出的模型在用于 OSOD 任务的 MS-COCO 和 PASCAL VOC 数据集上取得了显著改进。||
|**2024-12-30**|[YOLO-UniOW: Efficient Universal Open-World Object Detection](http://arxiv.org/abs/2412.20645)|**[link](https://github.com/thu-mig/yolo-uniow)**|传统的目标检测模型受限于闭集数据集，只能检测到训练期间遇到的类别。虽然多模态模型通过对齐文本和图像模态扩展了类别识别，但它们由于跨模态融合而引入了显著的推理开销，并且仍然受限于预定义的词汇表，导致它们在处理开放世界场景中的未知对象时效率低下。在这项工作中，我们引入了通用开放世界目标检测 (Uni-OWD)，这是一种统一开放词汇和开放世界目标检测任务的新范式。为了应对这种设置的挑战，我们提出了 YOLO-UniOW，一种突破效率、通用性和性能界限的新颖模型。YOLO-UniOW 结合了自适应决策学习，用 CLIP 潜在空间中的轻量级对齐代替计算成本高昂的跨模态融合，从而在不损害泛化能力的情况下实现高效检测。此外，我们设计了一种通配符学习策略，将分布外对象检测为“未知”，同时支持动态词汇扩展，而无需增量学习。这种设计使 YOLO-UniOW 能够无缝适应开放世界环境中的新类别。大量实验验证了 YOLO-UniOW 的优越性，在 LVIS 上实现了 34.6 AP 和 30.0 APr，推理速度为 69.6 FPS。该模型还在 M-OWODB、S-OWODB 和 nuScenes 数据集上设立了基准，展示了其在开放世界目标检测中无与伦比的性能。代码和模型可在 https://github.com/THU-MIG/YOLO-UniOW 获取。||
|**2024-12-29**|[Hilbert Curve Based Molecular Sequence Analysis](http://arxiv.org/abs/2412.20616)|null|精确的分子序列分析是生物信息学领域的一项关键任务。为了应用分子序列分类算法，我们首先需要生成合适的序列表示。传统的数字序列表示技术大多基于序列比对，但在准确性方面存在局限性。尽管也引入了一些免比对技术，但与基于图像数据的竞争性能相比，它们的表格数据形式在与深度学习 (DL) 模型一起使用时性能较低。为了解决这个问题，并使深度学习 (DL) 模型能够最大限度地发挥其潜力，同时捕获序列数据中的重要空间信息，我们提出了一种通用的基于希尔伯特曲线的混沌游戏表示 (CGR) 方法。该方法是一个转换函数，它包含一种新颖的字母索引映射技术，用于从分子序列构建基于希尔伯特曲线的图像表示。我们的方法可以全局应用于任何类型的分子序列数据。基于希尔伯特曲线的图像表示可以用作复杂的视觉深度学习模型的输入，以进行序列分类。所提出的方法显示出可喜的结果，因为它在肺癌数据集上使用 CNN 模型进行测试时，达到了 94.5% 的高准确率和 93.9% 的 F1 分数，优于当前最先进的方法。这种方法为利用图像分类方法探索分子序列分析开辟了新的视野。||
|**2024-12-29**|[Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)](http://arxiv.org/abs/2412.20596)|**[link](https://github.com/tirer-lab/cm4ir)**|近年来，使用单个预训练的扩散模型 (DM) 和数据保真度引导来处理图像恢复任务变得流行起来，而不是针对每个任务训练专用的深度神经网络。然而，这种“零样本”恢复方案目前需要大量的函数求值 (NFE) 才能表现良好，这可能是由于DM本身的生成功能需要许多NFE。最近，人们探索了用于图像生成的更快DM变体，包括一致性模型 (CM)，它可以通过几次NFE生成样本。然而，现有的使用引导CM进行恢复的工作仍然需要数十次NFE或针对每个任务对模型进行微调，如果微调期间的假设不准确，则会导致性能下降。在本文中，我们提出了一种零样本恢复方案，它使用CM并且只需4次NFE即可良好运行。它基于几种成分的巧妙组合：更好的初始化、反投影引导，以及最重要的，一种新颖的噪声注入机制。我们展示了我们的方法在图像超分辨率、去模糊和图像修复方面的优势。有趣的是，我们发现噪声注入技术的有效性超越了CM：它还可以减轻现有引导DM方法在减少NFE数量时的性能下降。||
|**2024-12-29**|[A Novel FPGA-based CNN Hardware Accelerator: Optimization for Convolutional Layers using Karatsuba Ofman Multiplier](http://arxiv.org/abs/2412.20393)|null|本文提出了一种新的CNN硬件加速器架构。卷积神经网络 (CNN) 是一类神经网络，已在各种计算机视觉应用（包括目标检测、图像分类等）中展现出卓越的性能。卷积是CNN的基本组成部分，它是一种数学运算，包括将一组输入值与一组称为过滤器或内核的可学习参数进行乘法、移位和加法运算。Karatsuba Ofman 乘法器以其能够以比传统乘法器更少的硬件资源执行高速乘法而闻名。本文探讨了在主流 CNN 设计 AlexNet、VGG16 和 VGG19 中，于 FPGA 上使用 Karatsuba Ofman 乘法器方法的情况。||
|**2024-12-29**|[Differential Evolution Integrated Hybrid Deep Learning Model for Object Detection in Pre-made Dishes](http://arxiv.org/abs/2412.20370)|null|随着人们生活水平的不断提高和快节奏的工作环境，预制菜因其省时、便捷、种类多样、经济高效、品质标准等优势，越来越受到家庭和餐馆的欢迎。目标检测是预制菜行业选择食材和评估菜品质量的关键技术。迄今为止，已经提出了许多目标检测方法。然而，由于食材的相互遮挡、食材的相似性以及加工环境中的光线不足，预制菜的精确目标检测非常困难。因此，识别场景相对复杂，导致单一模型的目标检测效果较差。为了解决这个问题，本文提出了一种差分进化集成混合深度学习（DEIHDL）模型。DEIHDL的主要思想有三方面：1）分别开发了三个基于YOLO和Transformer的基模型，以增加检测预制菜目标的多样性；2）通过差分进化优化的自调整权重集成三个基模型；3）在集成过程中采用加权框融合策略对三个基模型的置信度进行评分。因此，DEIHDL拥有源自三个基模型的多种性能，能够在复杂的预制菜场景中实现精确的目标检测。在真实数据集上的大量实验表明，所提出的DEIHDL模型在检测预制菜目标方面明显优于基模型。||
|**2024-12-29**|[Deep Learning in Image Classification: Evaluating VGG19's Performance on Complex Visual Data](http://arxiv.org/abs/2412.20345)|null|本研究旨在探索基于VGG19深度卷积神经网络的肺炎X射线图像自动分类方法，并通过与SVM、XGBoost、MLP和ResNet50等经典模型的比较，评估其在肺炎诊断中的应用效果。实验结果表明，VGG19在准确率（92%）、AUC（0.95）、F1值（0.90）和召回率（0.87）等多项指标上表现良好，优于其他对比模型，尤其在图像特征提取和分类准确率方面。ResNet50虽然在某些指标上表现不错，但在召回率和F1值上略逊于VGG19。传统的机器学习模型SVM和XGBoost在图像分类任务中，尤其是在复杂的医学图像分析任务中，表现明显受限，性能相对平庸。研究结果表明，深度学习，尤其是卷积神经网络，在医学图像分类任务中，尤其是在肺炎X射线图像分析中，具有显著优势，可以提供高效、准确的自动诊断支持。本研究为肺炎的早期检测和自动化诊断系统的开发提供了强有力的技术支持，也为进一步推动医学图像自动化处理技术的应用和发展奠定了基础。||
|**2024-12-28**|[Few-shot Algorithm Assurance](http://arxiv.org/abs/2412.20275)|null|在图像分类任务中，深度学习模型容易受到图像失真的影响。为了成功部署，识别模型可用的失真程度至关重要，即其准确率保持在规定的阈值以上。我们将此问题称为图像失真下的模型保证，并将其表述为一个分类任务。给定一个失真级别，我们的目标是预测模型在失真图像集上的准确率是否大于阈值。我们提出了一种基于水平集估计 (LSE) 算法的新型分类器，它使用 LSE 的均值和方差函数来形成分类规则。我们进一步将我们的方法扩展到“少量样本”设置，在这种情况下，我们只能获取少量真实图像来执行模型保证过程。我们的想法是使用具有两个新损失函数的新型条件变分自动编码器模型生成额外的合成图像。我们进行了广泛的实验，表明我们的分类方法在五个基准图像数据集上显著优于强大的基线模型。||
|**2024-12-27**|[Asymmetrical Reciprocity-based Federated Learning for Resolving Disparities in Medical Diagnosis](http://arxiv.org/abs/2412.19654)|**[link](https://github.com/JackqqWang/fedhelp)**|地域健康差距构成了一个紧迫的全球性挑战，尤其是在低收入和中等收入国家服务欠缺地区。解决这个问题需要一种协作方法来提高医疗质量，并利用来自医疗更发达地区的支援。联邦学习成为了一种很有前景的工具。然而，服务欠缺地区医疗数据的稀缺性和有限的计算资源使得强大的机器学习模型的协同训练具有挑战性。此外，服务欠缺地区和发达地区之间存在不对称的互惠性。为了克服这些挑战，我们提出了一个名为 FedHelp 的新型跨孤岛联邦学习框架，旨在缓解地域健康差距并增强服务欠缺地区的诊断能力。具体而言，FedHelp 通过一次性 API 访问利用基础模型知识来指导服务欠缺的小型客户端的学习过程，从而解决了数据不足的挑战。此外，我们引入了一种新型的非对称双知识蒸馏模块来管理非对称互惠问题，促进发达的大型客户端和服务欠缺的小型客户端之间必要的知识交换。我们通过对医学图像分类和分割任务的广泛实验验证了 FedHelp 的有效性和实用性。实验结果表明，与最先进的基线相比，性能显著提高，尤其有利于服务欠缺地区的客户端。||
|**2024-12-27**|[Chimera: A Block-Based Neural Architecture Search Framework for Event-Based Object Detection](http://arxiv.org/abs/2412.19646)|null|事件相机是模拟人眼功能的传感器，具有高速鲁棒性和低功耗等优势。已有的深度学习技术已经证明了在处理事件数据方面的有效性。Chimera是一个基于模块的神经架构搜索（NAS）框架，专为基于事件的目标检测而设计，旨在创建一种将RGB域处理方法系统地适配到事件域的方法。Chimera的设计空间由各种宏模块构成，包括注意力模块、卷积、状态空间模型和基于MLP-mixer的架构，这些模块在局部和全局处理能力之间提供了宝贵的权衡，并具有不同级别的复杂度。在机器人人物检测(PEDRo)数据集上的结果表明，其性能水平与领先的SOTA模型相当，同时参数量平均减少了1.6倍。||
|**2024-12-27**|[Structural Similarity in Deep Features: Image Quality Assessment Robust to Geometrically Disparate Reference](http://arxiv.org/abs/2412.19553)|null|参考图像质量评估 (IQA) 在优化和评估计算机视觉任务中起着重要作用。传统方法假设参考图像和测试图像的所有像素完全对齐。这种对齐参考 IQA (AR-IQA) 方法无法解决现实世界中两幅图像之间存在各种几何变形的问题。尽管已付出巨大努力来解决几何差异参考 IQA (GDR-IQA) 问题，但其解决方案一直依赖于特定任务，例如，通过针对图像超分辨率和重定向的专门设计，或通过假设几何失真很小，可以通过平移鲁棒滤波器或显式图像配准来抵消。本文重新思考了这个问题，并提出了一种统一的、无需训练的深度结构相似性 (DeepSSIM) 方法，在一个框架内解决上述问题，该方法以简单有效的方式评估深度特征的结构相似性，并使用注意力校准策略来减轻注意力偏差。所提出的方法无需特定应用设计，即可在 AR-IQA 数据集上达到最先进的性能，同时对各种 GDR-IQA 测试用例表现出很强的鲁棒性。有趣的是，我们的测试还表明 DeepSSIM 作为一种优化工具在训练图像超分辨率、增强和恢复方面的有效性，这意味着其具有更广泛的泛化性。||
|**2024-12-27**|[Multi-label Classification using Deep Multi-order Context-aware Kernel Networks](http://arxiv.org/abs/2412.19491)|null|多标签分类是模式识别中一项具有挑战性的任务。许多深度学习方法已经被提出，并极大地提高了分类性能。然而，大多数现有的复杂方法忽略了模型学习过程中的上下文信息。由于上下文可能为学习模型提供额外的线索，因此它可能会显著提高分类性能。在这项工作中，我们充分利用上下文信息（即图像的几何结构）来学习图像之间更好的上下文感知相似性（也称为核）。我们将上下文感知核设计重新表述为一个输出显式核映射特征的前馈网络。我们获得的上下文感知核网络进一步利用了不同距离内的多阶补丁邻居，从而产生了一个用于多标签分类的更具判别力的深度多阶上下文感知核网络（DMCKN）。我们在具有挑战性的Corel5K和NUS-WIDE基准数据集上评估了所提出的方法，实验结果表明，我们的方法相对于相关的现有最先进技术获得了具有竞争力的性能，定量和定性性能都证实了其在多标签图像分类中的有效性和优越性。||
|**2024-12-27**|[Optimizing Helmet Detection with Hybrid YOLO Pipelines: A Detailed Analysis](http://arxiv.org/abs/2412.19467)|null|头盔检测对于提高公共道路交通动态中的安全水平至关重要。这个问题可以转化为一个目标检测任务。因此，本文比较了最近的几种You Only Look Once (YOLO) 模型在头盔检测方面的可靠性和计算负荷。具体来说，使用了YOLOv8、YOLOv9和新发布的YOLOv11。此外，本文还提出了一种改进的架构流程，该流程显著提高了整体性能。这种混合YOLO模型（h-YOLO）与独立模型进行了对比分析，证明h-YOLO在头盔检测方面优于普通的YOLO模型。这些模型使用一系列标准目标检测基准进行测试，例如召回率、精确率和mAP（平均精度均值）。此外，还记录了训练和测试时间，以便在实时检测场景中提供模型的整体范围。||
|**2024-12-27**|[An In-Depth Analysis of Adversarial Discriminative Domain Adaptation for Digit Classification](http://arxiv.org/abs/2412.19391)|**[link](https://github.com/eugenechoi2004/cos429_final)**|领域自适应是当前研究的活跃领域，它是由对在现实世界数据中表现良好的鲁棒机器学习模型日益增长的需求所驱动的。深度神经网络 (DNN) 的对抗性学习已成为一种很有前景的提高泛化能力的方法，尤其是在图像分类方面。在本文中，我们实现了一种称为对抗判别域自适应 (ADDA) 的特定对抗性学习技术，并复制了原始 ADDA 论文中的数字分类实验。我们通过检查更广泛的域迁移来扩展他们的发现，并对 ADDA 后的域内分类精度进行了详细分析。我们的结果表明，ADDA 显着提高了某些域迁移的准确性，同时对域内性能的影响最小。此外，我们提供了定性分析，并对 ADDA 在不太成功的域迁移中的局限性提出了可能的解释。代码位于 https://github.com/eugenechoi2004/COS429_FINAL。||
|**2024-12-26**|[Revisiting Monocular 3D Object Detection from Scene-Level Depth Retargeting to Instance-Level Spatial Refinement](http://arxiv.org/abs/2412.19165)|null|单目三维目标检测由于缺乏精确的深度信息而充满挑战。然而，现有的深度辅助解决方案仍然表现不佳，其原因普遍被认为是单目深度估计模型的精度不足。在本文中，我们从深度视角重新审视单目三维目标检测，并将现有深度表示（例如，深度独热编码或深度分布）有限的三维结构感知能力确定为另一个问题。为了解决这个问题，我们提出了一种新的深度自适应单目三维目标检测网络，称为RD3D，它主要包含场景级深度重定向（SDR）模块和实例级空间细化（ISR）模块。前者结合了对三维结构的场景级感知，将传统的深度表示重定向到一种新的形式：深度厚度场。后者在实例的指导下细化体素空间表示，消除了三维占用歧义，从而提高了检测精度。在KITTI和Waymo数据集上的大量实验表明，我们优于现有的最先进（SoTA）方法，并且在配备不同深度估计模型时具有通用性。代码将公开发布。||
|**2024-12-26**|[SUTrack: Towards Simple and Unified Single Object Tracking](http://arxiv.org/abs/2412.19138)|**[link](https://github.com/chenxin-dlut/sutrack)**|在本文中，我们提出了一个简单而统一的单目标跟踪（SOT）框架，称为SUTrack。它将五种SOT任务（基于RGB、RGB-深度、RGB-热成像、RGB-事件、RGB-语言跟踪）整合到一个单一训练的统一模型中。由于数据性质的差异，目前的方法通常针对每个任务设计单独的架构并训练单独的模型。这种碎片化导致冗余的训练过程、重复的技术创新以及有限的跨模态知识共享。相比之下，SUTrack证明了具有统一输入表示的单个模型可以有效地处理各种常见的SOT任务，从而消除了对特定任务设计和单独训练的需求。此外，我们引入了任务识别辅助训练策略和软标记类型嵌入，以在最小开销的情况下进一步增强SUTrack的性能。实验表明，SUTrack在涵盖五种SOT任务的11个数据集上均优于以往的特定任务模型。此外，我们提供了一系列适用于边缘设备和高性能GPU的模型，在速度和精度之间取得了良好的平衡。我们希望SUTrack可以为未来统一跟踪模型的研究提供坚实的基础。代码和模型可在github.com/chenxin-dlut/SUTrack获取。||
|**2024-12-26**|[From Coin to Data: The Impact of Object Detection on Digital Numismatics](http://arxiv.org/abs/2412.19091)|null|本文研究了先进目标检测技术在数字钱币学中的应用，重点关注对历史硬币的分析。利用诸如对比语言-图像预训练 (CLIP) 等模型，我们开发了一个灵活的框架，可以使用图像和文本描述来识别和分类特定的硬币特征。通过研究两个不同的数据集——具有复杂“圣乔治屠龙”图案的现代俄罗斯硬币和带有印度教-佛教符号的公元1世纪东南亚磨损硬币——我们评估了不同检测算法在搜索和分类任务中的有效性。我们的结果表明，较大的 CLIP 模型在检测复杂图像方面表现更优，而传统方法在识别简单几何图案方面表现出色。此外，我们提出了一种统计校准机制来提高低质量数据集中相似性评分的可靠性。这项工作突出了将最先进的目标检测技术融入数字钱币学的变革潜力，从而能够对历史文物进行更具规模化、更精确和更高效的分析。这些进步为文化遗产研究、文物溯源研究和伪造品检测的新方法铺平了道路。||
|**2024-12-26**|[Assessing Pre-trained Models for Transfer Learning through Distribution of Spectral Components](http://arxiv.org/abs/2412.19085)|null|预训练模型评估旨在从模型库中识别下游任务的最佳候选模型，而无需耗时的微调。现有的先进工作主要集中于分析每个预训练模型提取的整体特征的内在特性，或这些特征与目标标签的匹配程度。本文提出了一个通过谱分量分布（DISCO）评估预训练模型的新视角。通过对预训练模型提取的特征进行奇异值分解，我们研究了不同的谱分量，并观察到它们具有不同的可迁移性，对微调性能的贡献也各不相同。受此启发，我们提出了一种基于谱分量分布的评估方法，该方法测量其对应奇异值的比例。特征集中在更具可迁移性分量上的预训练模型被认为是迁移学习的更好选择。我们进一步利用下游数据的标签来更好地估计每个谱分量的可迁移性，并得出最终的评估标准。我们提出的方法灵活，可应用于分类和回归任务。我们对三个基准数据集和图像分类与目标检测两项任务进行了全面的实验，结果表明，我们的方法在从模型库中选择合适的预训练模型进行迁移学习方面达到了最先进的性能。||
|**2024-12-24**|[Sampling Bag of Views for Open-Vocabulary Object Detection](http://arxiv.org/abs/2412.18273)|null|现有的开放词汇目标检测 (OVD) 方法通过将目标区域嵌入与相应的 VLM 特征对齐来开发测试未见类别的方法。最近的一项研究利用了 VLM 隐式学习图像中语义概念的组合结构这一思想。它不使用单个区域嵌入，而是利用一包区域嵌入作为新的表示形式，将组合结构融入 OVD 任务中。然而，这种方法通常无法捕捉到每个区域的上下文概念，导致组合结构噪声较大。这导致性能改进有限且效率降低。为了解决这个问题，我们提出了一种新的基于概念的对齐方法，该方法采样更强大且高效的组合结构。我们的方法将上下文相关的“概念”分组到一个包中，并调整包内概念的规模，以实现更有效的嵌入对齐。结合 Faster R-CNN，我们的方法在开放词汇 COCO 和 LVIS 基准测试中，在未见类别上比先前的工作实现了 2.6 box AP50 和 0.5 mask AP 的改进。此外，与之前的研究相比，我们的方法将 CLIP 计算的 FLOPs 降低了 80.3%，显著提高了效率。实验结果表明，所提出的方法在 OVD 数据集上的性能优于先前的最先进模型。||
|**2024-12-24**|[Efficient Detection Framework Adaptation for Edge Computing: A Plug-and-play Neural Network Toolbox Enabling Edge Deployment](http://arxiv.org/abs/2412.18230)|**[link](https://github.com/word-ky/Edge-TOOLBOX)**|边缘计算已成为在时间敏感场景中部署基于深度学习的目标检测的关键范例。然而，现有的边缘检测方法面临挑战：1）难以平衡检测精度和轻量级模型，2）通用部署设计的适应性有限，以及3）缺乏实际验证。为了解决这些问题，我们提出了边缘检测工具箱（ED-TOOLBOX），它利用可泛化的即插即用组件来适应边缘环境中的目标检测模型。具体来说，我们引入了一个轻量级的重参数化动态卷积网络 (Rep-DConvNet)，它具有加权多形状卷积分支以增强检测性能。此外，我们设计了一个具有局部映射辅助自注意力机制的稀疏交叉注意力（SC-A）网络，从而实现了一个精心设计的联合模块，用于自适应特征迁移。对于实际应用，我们将高效头部融入YOLO框架以加速边缘模型优化。为了展示实际影响，我们发现了头盔检测中的一个缺口——忽略了带扣，一个关键的安全因素——并创建了头盔带检测数据集（HBDD）。使用ED-TOOLBOX优化的模型，我们解决了这个实际任务。大量实验验证了ED-TOOLBOX的有效性，边缘检测模型在视觉监控模拟中优于六种最先进的方法，实现了实时和准确的性能。这些结果突出了ED-TOOLBOX作为边缘目标检测的优越解决方案。||
|**2024-12-24**|[VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis](http://arxiv.org/abs/2412.18178)|**[link](https://github.com/yangliu9208/visiongru)**|卷积神经网络 (CNN) 和视觉Transformer (ViT) 是两种主要的图像分析模型。CNN 擅长提取多尺度特征，而 ViT 能有效捕获全局依赖关系，但两者都存在计算成本高的问题，尤其是在处理高分辨率图像时。最近，状态空间模型 (SSM) 和循环神经网络 (RNN) 因其效率而受到关注。然而，它们在图像分类任务中的性能仍然有限。为了应对这些挑战，本文介绍了一种名为 VisionGRU 的新型基于 RNN 的架构，旨在实现高效的图像分类。VisionGRU 利用简化的门控循环单元 (minGRU) 以线性复杂度处理大规模图像特征。它将图像分割成更小的块，并逐步减少序列长度，同时增加通道深度，从而促进多尺度特征提取。具有双向扫描的分层二维 GRU 模块可以捕获局部和全局上下文，从而改进长距离依赖建模，尤其适用于语义分割等任务。在 ImageNet 和 ADE20K 数据集上的实验结果表明，VisionGRU 的性能优于 ViT，并显著降低了内存使用和计算成本，尤其是在处理高分辨率图像时。这些发现强调了基于 RNN 的方法在开发高效且可扩展的计算机视觉解决方案方面的潜力。代码将在 https://github.com/YangLiu9208/VisionGRU 上提供。||
|**2024-12-24**|[Spectrum-oriented Point-supervised Saliency Detector for Hyperspectral Images](http://arxiv.org/abs/2412.18112)|**[link](https://github.com/laprf/spsd)**|高光谱显著目标检测 (HSOD) 旨在从高光谱图像中提取光谱明显不同的目标或区域。尽管现有的基于深度学习的方法可以取得良好的检测结果，但它们通常需要像素级标注，而这对于高光谱图像来说获取难度很大。为了解决这个问题，我们将点监督引入 HSOD，并将从传统 HSOD 方法中衍生的光谱显著性作为框架内的关键光谱表示。这种集成导致了一种新的面向光谱的点监督显著性检测器 (SPSD) 的开发。具体来说，我们提出了一种专门为高光谱图像设计的新的流程来生成伪标签，有效地减轻了点监督策略带来的性能下降。此外，光谱显著性用于抵消模型监督和显著性细化过程中的信息损失，从而保持检测目标的结构完整性和边缘精度。此外，我们引入了一个光谱变换空间门，以更精确地关注显著区域，同时减少特征冗余。我们在 HSOD-BIT 和 HS-SOD 数据集上进行了全面的实验，使用平均绝对误差 (MAE)、E-measure、F-measure、曲线下面积和互相关作为评估指标来验证我们提出的方法的有效性。例如，在 HSOD-BIT 数据集上，我们的 SPSD 实现了 0.031 的 MAE 和 0.878 的 F-measure。彻底的消融研究证实了每个模块的有效性，并提供了对模型工作机制的见解。在 RGB 热显著目标检测数据集上的进一步评估突出了我们方法的多功能性。||
|**2024-12-24**|[Multi-Point Positional Insertion Tuning for Small Object Detection](http://arxiv.org/abs/2412.18090)|null|小型物体检测旨在定位和分类图像中的小型物体。随着近年来大规模视觉语言预训练的进步，微调预训练的物体检测模型已成为一种很有前景的方法。然而，微调大型模型的计算和内存成本很高。为了解决这个问题，本文介绍了多点位置插入(MPI)调优，这是一种用于小型物体检测的参数高效微调(PEFT)方法。具体来说，MPI将多个位置嵌入合并到冻结的预训练模型中，通过向潜在特征提供精确的位置信息，从而实现高效的小型物体检测。通过在SODA-D数据集上的实验，我们证明了该方法的有效性。MPI的性能与传统的PEFT方法（包括CoOp和VPT）相当，同时显著减少了需要调整的参数数量。||
|**2024-12-24**|[COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection](http://arxiv.org/abs/2412.18076)|null|单模态目标检测任务在面对复杂场景时经常会遇到性能下降的问题。相比之下，多模态目标检测任务可以通过整合来自不同模态的数据提供更全面的目标特征信息。目前的多模态目标检测方法通常使用各种融合技术，包括传统神经网络和基于Transformer的模型，来实现特征融合策略并实现信息互补。然而，由于多模态图像是由不同的传感器捕获的，它们之间经常存在错位，使得直接匹配具有挑战性。这种错位阻碍了在不同模态中为同一目标建立强关联的能力。在本文中，我们提出了一种名为交叉曼巴交互和偏移引导融合（COMO）框架的新方法，用于多模态目标检测任务。COMO框架采用交叉曼巴技术来构建特征交互方程，从而实现多模态序列化状态计算。这产生了交互式融合输出，同时减少了计算开销并提高了效率。此外，COMO利用受错位影响较小的高级特征来促进模态之间的交互和传递互补信息，解决了由相机角度和捕获时间变化引起的位置偏移挑战。此外，COMO在交叉曼巴模块中加入了全局和局部扫描机制，以捕获具有局部相关性的特征，尤其是在遥感图像中。为了保留低级特征，偏移引导融合机制确保了有效的多尺度特征利用，允许构建多尺度融合数据立方体，从而提高检测性能。||
|**2024-12-23**|[Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection](http://arxiv.org/abs/2412.17800)|**[link](https://github.com/row11n/prova)**|使模型能够识别广阔的开放世界类别一直是目标检测领域长期追求的目标。通过利用视觉语言模型的泛化能力，目前的开放世界检测器可以识别更广泛的词汇，即使它们只在有限的类别上进行训练。然而，当训练期间类别词汇的规模扩展到真实世界水平时，先前与粗略类别名称对齐的分类器会显著降低这些检测器的识别性能。在本文中，我们介绍了Prova，一个用于大规模词汇目标检测的多模态原型分类器。Prova提取全面的多模态原型作为对齐分类器的初始化，以解决大规模词汇目标识别失败问题。在V3Det上，这种简单的方法极大地提高了单阶段、两阶段和基于DETR的检测器的性能，在监督和开放词汇设置中仅增加了投影层。特别是，在V3Det的监督设置下，Prova分别将Faster R-CNN、FCOS和DINO的AP提高了3.3、6.2和2.9。对于开放词汇设置，Prova实现了新的最先进性能，基础AP为32.8，新颖AP为11.0，比之前的方法分别提高了2.6和4.3。||
|**2024-12-23**|[COBRA: COmBinatorial Retrieval Augmentation for Few-Shot Learning](http://arxiv.org/abs/2412.17684)|null|检索增强，即从大型辅助池中检索额外数据的实践，已成为在低数据体制（例如少样本学习）下提升模型性能的有效技术。先前的方法仅采用基于最近邻的策略进行数据选择，检索与目标任务实例高度相似的辅助样本。然而，由于未能结合任何多样性概念，这些方法容易选择高度冗余的样本。在我们的工作中，我们首先证明了先前检索增强少样本学习设置中使用的数据选择策略可以使用一类称为组合互信息 (CMI) 度量的函数进行概括。然后，我们提出了 COBRA（组合检索增强），它采用了一种替代的 CMI 度量，同时考虑了与目标数据集的多样性和相似性。当用于从 LAION-2B 检索样本时，COBRA 在图像分类任务和少样本学习技术方面始终优于先前的检索方法。COBRA 在检索成本方面引入了可忽略不计的计算开销，同时显着提高了下游模型的性能。||
|**2024-12-23**|[Enhanced Temporal Processing in Spiking Neural Networks for Static Object Detection Using 3D Convolutions](http://arxiv.org/abs/2412.17654)|null|脉冲神经网络 (SNN) 是一类能够处理时空信息的网络模型，具有事件驱动特性和能效优势。近年来，直接训练的 SNN 在分类任务中已展现出与传统人工神经网络 (ANN) 性能相当甚至超越的潜力。然而，在目标检测任务中，当在基于帧的静态目标数据集（如 COCO2017）上测试时，直接训练的 SNN 与 ANN 相比仍然存在显著的性能差距。因此，弥合这种性能差距并使直接训练的 SNN 在这些静态数据集上达到与 ANN 相当的性能水平已成为 SNN 发展中的关键挑战之一。为了应对这一挑战，本文着重于增强 SNN 处理时空信息的独特能力。脉冲神经元作为 SNN 的核心组件，在将输入浮点数据转换为二进制脉冲信号的过程中，促进了不同时间通道之间的信息交换。然而，现有的神经元模型在时间信息的传递方面仍然存在一定的局限性。一些研究甚至表明，在 SNN 训练过程中禁用时间维度上的反向传播仍然可以获得良好的训练结果。为了改进 SNN 对时间信息的处理，本文提出用 3D 卷积代替传统的 2D 卷积，从而将时间信息直接融入卷积过程。此外，在神经元内部引入了时间信息循环机制，进一步提高神经元利用时间信息的效率。实验结果表明，所提出的方法能够使直接训练的 SNN 在 COCO2017 和 VOC 数据集上达到与 ANN 相当的性能水平。||
|**2024-12-23**|[Impact of Evidence Theory Uncertainty on Training Object Detection Models](http://arxiv.org/abs/2412.17405)|null|本文研究利用证据理论通过将不确定性纳入反馈回路来提高目标检测模型的训练效率。在每个训练迭代的验证阶段，应用证据理论建立真实标签和预测之间的关系。使用Dempster-Shafer组合规则根据这些预测的证据来量化不确定性。然后，该不确定性度量用于加权后续迭代的反馈损失，从而允许模型动态调整其学习。通过实验各种不确定性加权策略，本研究旨在确定优化反馈以加速训练过程的最有效方法。结果表明，与传统方法相比，使用基于不确定性的反馈不仅可以减少训练时间，还可以提高模型性能。这项研究提供了关于不确定性在改进机器学习工作流程（尤其是在目标检测中）中的作用的见解，并提出了不确定性驱动训练在其他人工智能学科中的更广泛应用。||
|**2024-12-20**|[Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks](http://arxiv.org/abs/2412.16146)|**[link](https://github.com/cocoalex00/Mamba2D)**|状态空间模型 (SSM) 近期已成为长期主导地位的 Transformer 架构之外的一种强大且高效的替代方案。然而，现有的 SSM 概念化在其自然语言处理的根源中保留了根深蒂固的偏见。这限制了它们对视觉输入的空间依赖特性进行适当建模的能力。在本文中，我们通过从原生多维公式重新推导现代选择性状态空间技术来解决这些限制。目前，先前的工作试图通过依赖 1D 扫描方向的任意组合来捕取空间依赖性，从而将原生 1D SSM 应用于 2D 数据（即图像）。相比之下，Mamba2D 通过单一的 2D 扫描方向改进了这一点，该方向原生考虑了输入的两个维度，在构建隐藏状态时有效地建模了空间依赖性。在 ImageNet-1K 数据集的标准图像分类评估中，Mamba2D 在视觉任务上表现出与先前 SSM 的视觉改编版本相当的性能。||
|**2024-12-20**|[NeRF-To-Real Tester: Neural Radiance Fields as Test Image Generators for Vision of Autonomous Systems](http://arxiv.org/abs/2412.16141)|null|陆地和水下基础设施的自主巡检是一个快速增长的市场，其应用包括勘测建筑、监测工厂以及跟踪陆上和海上风电场的环境变化。对于自主水下航行器和无人驾驶飞行器而言，控制器过度拟合模拟条件会导致其在实际操作环境中的性能不佳。迫切需要更加多样化和真实的测试数据来准确地表示这些系统面临的挑战。我们通过利用神经辐射场生成逼真且多样化的测试图像，并将其集成到视觉组件（如vSLAM和目标检测）的变质测试框架中，来应对为自主系统生成感知测试数据的挑战。我们的工具N2R-Tester允许训练自定义场景的模型并从扰动位置渲染测试图像。在AUV和UAV的八个不同视觉组件上对N2R-Tester进行的实验评估证明了该方法的有效性和通用性。||
|**2024-12-20**|[Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG](http://arxiv.org/abs/2412.16086)|**[link](https://github.com/tifat58/irr-with-cbm-rag)**|深度学习推进了医学图像分类，但可解释性挑战阻碍了其临床应用。本研究通过使用概念瓶颈模型 (CBM) 和用于报告生成的多智能体检索增强生成 (RAG) 系统来增强胸部 X 光 (CXR) 分类中的可解释性。通过对视觉特征和临床概念之间的关系进行建模，我们创建了可解释的概念向量，用于指导多智能体 RAG 系统生成放射学报告，从而增强临床相关性、可解释性和透明度。使用大型语言模型 (LLM) 作为评判标准对生成的报告进行评估，证实了我们模型输出的可解释性和临床实用性。在 COVID-QU 数据集上，我们的模型实现了 81% 的分类准确率，并展示了稳健的报告生成性能，五个关键指标介于 84% 到 90% 之间。这种可解释的多智能体框架弥合了高性能人工智能与可靠的、人工智能驱动的 CXR 临床分析所需的可解释性之间的差距。||
|**2024-12-20**|[MR-GDINO: Efficient Open-World Continual Object Detection](http://arxiv.org/abs/2412.15979)|**[link](https://github.com/dongsky/mr-gdino)**|开放世界 (OW) 识别和检测模型展现出强大的零样本和少样本适应能力，这促使它们被用作持续学习方法中的初始化以提高性能。尽管在已见类别上取得了可喜的成果，但由于灾难性遗忘，这种OW能力在未见类别上已大幅退化。为了应对这一挑战，我们提出了一个开放世界持续目标检测任务，要求检测器在持续学习场景中泛化到旧的、新的和未见的类别。基于此任务，我们提出了一个具有挑战性但实用的OW-COD基准来评估检测能力。目标是激励OW检测器在少样本适应的情况下，同时保留已学习的类别、适应新类别并保持开放世界能力。为了减轻未见类别中的遗忘，我们提出了MR-GDINO，这是一个强大、高效且可扩展的基线，它通过在高度可扩展的内存池中使用记忆和检索机制。实验结果表明，现有的持续检测器对已见和未见类别都存在严重的遗忘。相比之下，MR-GDINO仅激活了0.1%的额外参数就大大减轻了遗忘，在旧的、新的和未见的类别中都实现了最先进的性能。||
|**2024-12-20**|[Mask-RadarNet: Enhancing Transformer With Spatial-Temporal Semantic Context for Radar Object Detection in Autonomous Driving](http://arxiv.org/abs/2412.15595)|null|作为一种经济高效且鲁棒的技术，车载雷达在过去几年中得到了稳步提升，使其成为自动驾驶中常用传感器（如摄像头和激光雷达）的有力补充。蕴含丰富语义信息的射频数据正引起越来越多的关注。目前大多数基于雷达的模型都将射频图像序列作为输入。然而，这些模型严重依赖卷积神经网络，并且在编码阶段忽略了时空语义上下文。为了解决这些问题，我们提出了一个名为Mask-RadarNet的模型，以充分利用输入雷达数据中的分层语义特征。Mask-RadarNet利用交错卷积和注意力操作的组合来取代基于Transformer模型中的传统架构。此外，在Mask-RadarNet中引入了Patch Shift机制，以实现高效的时空特征学习。通过在时间维度上以特定的马赛克模式移动部分Patch，Mask-RadarNet在降低时空建模计算负担的同时实现了具有竞争力的性能。为了捕获时空语义上下文信息，我们在编码器中设计了类别掩蔽注意力模块（CMAM）。此外，我们在模型中添加了一个轻量级的辅助解码器，用于聚合CMAM生成的先验图。在CRUW数据集上的实验表明，所提出的方法优于一些最先进的基于雷达的目标检测算法。在相对较低的计算复杂度和较少参数的情况下，所提出的Mask-RadarNet在自动驾驶目标检测中实现了更高的识别精度。||
|**2024-12-20**|[Continual Learning Using a Kernel-Based Method Over Foundation Models](http://arxiv.org/abs/2412.15571)|**[link](https://github.com/salehmomeni/klda)**|持续学习（CL）以增量方式学习一系列任务。本文研究了类别增量学习（CIL）这种具有挑战性的CL设置。CIL有两个关键挑战：灾难性遗忘（CF）和任务间类别分离（ICS）。尽管提出了许多方法，但这些问题仍然是持续存在的障碍。本文提出了一种名为核线性判别分析（KLDA）的新型CIL方法，可以有效避免CF和ICS问题。它仅利用在基础模型（FM）中学习到的强大特征。然而，直接使用这些特征 proved 次优。为了解决这个问题，KLDA结合了径向基函数（RBF）核及其随机傅里叶特征（RFF）来增强FM的特征表示，从而提高性能。当新任务到达时，KLDA仅计算任务中每个类别的均值，并基于核化特征更新所有已学习类别的共享协方差矩阵。使用线性判别分析进行分类。我们使用文本和图像分类数据集进行的实证评估表明，KLDA明显优于基线方法。值得注意的是，在不依赖重放数据的情况下，KLDA实现了与所有类别联合训练相当的准确性，这被认为是CIL性能的上限。KLDA代码可在https://github.com/salehmomeni/klda获取。||
|**2024-12-19**|[Uncertainty Estimation for Super-Resolution using ESRGAN](http://arxiv.org/abs/2412.15439)|null|基于深度学习的图像超分辨率（SR）技术在生成对抗网络的帮助下得到了广泛的应用。像SRGAN和ESRGAN这样的模型一直被认为是最好的图像超分辨率工具之一。然而，它们缺乏估计预测不确定性的有效方法。在本研究中，我们使用蒙特卡洛Dropout和深度集成来增强这些模型，从而实现预测不确定性的计算。预测不确定性与预测结果结合，可以为模型用户提供更多信息，突出显示SR输出可能不确定的像素，如果这些估计可靠的话，这些像素可能是不准确的。我们的研究结果表明，这些不确定性估计得到了较好的校准，因此可以实现这一目标，同时在性能上与没有不确定性估计的相应模型相比没有任何下降。||
|**2024-12-19**|[Exploring Machine Learning Engineering for Object Detection and Tracking by Unmanned Aerial Vehicle (UAV)](http://arxiv.org/abs/2412.15347)|null|随着深度学习方法的进步，自动驾驶系统势必会通过包含先进的机器学习算法来执行各种自主操作而变得越来越智能。其中一项任务涉及用于目标检测和跟踪的感知系统子系统的设计和评估。创建软件来解决该任务的挑战在于发现数据集的需求、数据集的标注、特征的选择、现有算法的集成和改进，同时通过训练和测试评估性能指标。这项研究工作侧重于开发一个机器学习管道，强调在日益自动化的过程中包含保障方法。在此过程中，通过收集移动物体（如Roomba吸尘器）的视频创建了一个新的数据集，模拟室内环境中的搜索和救援（SAR）。从视频中提取单个帧，并使用手动和自动技术相结合的方式进行标记。通过在YOLOv4上进行初始训练，对这个带标注的数据集进行了精度细化。在数据集细化之后，在第二个YOLOv4和Mask R-CNN模型上对其进行训练，该模型部署在Parrot Mambo无人机上以执行实时目标检测和跟踪。实验结果表明，这些模型能够在多次试验中准确地检测和跟踪Roomba，实现了0.1942的平均损失和96%的准确率。||
|**2024-12-19**|[Scaling 4D Representations](http://arxiv.org/abs/2412.15212)|null|对于纯粹的视频自监督学习来说，规模扩展的有效性尚未得到令人信服的证明。然而，先前的工作主要集中在语义相关任务（例如动作分类、ImageNet 分类等）的评估上。在本文中，我们专注于评估自监督学习在非语义视觉任务上的表现，这些任务更具空间（3D）和时间（+1D = 4D）特性，例如相机姿态估计、点和目标跟踪以及深度估计。我们发现，通过从非常大的视频数据集中学习，使用Transformer视频模型的掩码自编码器（MAE）实际上可以进行规模扩展，随着模型大小从 2000 万参数一直增加到迄今为止报道的最大的自监督视频模型（220 亿参数），这些 4D 任务的性能也随之持续提高。与许多最近的图像和视频模型进行严格的同类比较，证明了扩展 4D 表示的优势。||
|**2024-12-19**|[A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space](http://arxiv.org/abs/2412.14680)|**[link](https://github.com/d-robotics-ai-lab/dosod)**|开放集目标检测 (OSOD) 对于非结构化环境中的机器人操作非常重要。然而，现有的 OSOD 方法由于其高计算负担和复杂的部署，往往无法满足机器人应用的需求。为了解决这个问题，本文提出了一种名为解耦 OSOD (DOSOD) 的轻量级框架，它是一个实用且高效的解决方案，可支持机器人系统中的实时 OSOD 任务。具体来说，DOSOD 建立在 YOLO-World 管道的基础上，通过将视觉语言模型 (VLM) 与检测器集成。开发了一个多层感知器 (MLP) 适配器，用于将 VLM 提取的文本嵌入转换为联合空间，检测器在其中学习类别无关提议的区域表示。跨模态特征直接在联合空间中对齐，避免了复杂的特征交互，从而提高了计算效率。DOSOD 在测试阶段的操作类似于传统的闭集检测器，有效地弥合了闭集和开集检测之间的差距。与基线 YOLO-World 相比，所提出的 DOSOD 显着提高了实时性能，同时保持了相当的精度。在 LVIS minival 数据集上使用类似的骨干网络，轻量级 DOSOD-S 模型实现了 26.7% 的固定平均精度 (AP)，而 YOLO-World-v1-S 为 26.2%，YOLO-World-v2-S 为 22.7%。同时，DOSOD-S 的每秒帧数 (FPS) 比 YOLO-World-v1-S 高 57.1%，比 YOLO-World-v2-S 高 29.6%。同时，我们证明了 DOSOD 模型有助于边缘设备的部署。代码和模型已在 https://github.com/D-Robotics-AI-Lab/DOSOD 公开发布。||
|**2024-12-19**|[Progressive Fine-to-Coarse Reconstruction for Accurate Low-Bit Post-Training Quantization in Vision Transformers](http://arxiv.org/abs/2412.14633)|null|由于其高效性，训练后量化（PTQ）已被广泛用于压缩视觉Transformer（ViT）。然而，当量化为低比特表示时，与其全精度模型相比，性能通常会显著下降。为了解决这个问题，重建方法已被纳入PTQ框架，以提高低比特量化设置下的性能。然而，现有的相关方法预先定义了重建粒度，很少探索不同重建粒度之间的渐进关系，这导致ViT的量化结果欠佳。为此，在本文中，我们提出了一种用于精确PTQ的渐进式由细到粗的重建（PFCR）方法，该方法显著提高了低比特量化视觉Transformer的性能。具体来说，我们将多头自注意力和多层感知器模块及其快捷连接定义为最精细的重建单元。在重建这两个细粒度单元之后，我们将它们组合以形成更粗糙的块，并在更粗糙的粒度级别上重建它们。我们迭代地执行此组合和重建过程，实现渐进式由细到粗的重建。此外，我们为PFCR引入了一种渐进式优化策略（POS），以减轻训练难度，从而进一步提高模型性能。在ImageNet数据集上的实验结果表明，我们提出的方法在最先进的方法中实现了最佳的Top-1准确率，尤其是在PTQ中，对于3比特量化的ViT-B达到了75.61%。此外，在COCO数据集上的量化结果揭示了我们提出的方法在其他计算机视觉任务（如目标检测和实例分割）上的有效性和泛化性。||
|**2024-12-19**|[Alignment-Free RGB-T Salient Object Detection: A Large-scale Dataset and Progressive Correlation Network](http://arxiv.org/abs/2412.14576)|**[link](https://github.com/angknpng/pcnet)**|无对齐RGB-热成像（RGB-T）显著目标检测（SOD）旨在通过直接利用未对齐可见光-热成像图像对中的互补信息来在复杂场景中实现鲁棒性能，而无需手动对齐。然而，收集和标注图像对的劳动密集型过程限制了现有基准数据集的规模，阻碍了无对齐RGB-T SOD的发展。在本文中，我们构建了一个名为UVT20K的大规模、高多样性未对齐RGB-T SOD数据集，包含20,000个图像对、407个场景和1256个目标类别。所有样本均采集自具有各种挑战的真实场景，例如低光照、图像杂波、复杂的显著目标等。为了支持进一步研究的探索，UVT20K中的每个样本都标注了一套全面的真值，包括显著性掩码、涂鸦、边界和挑战属性。此外，我们提出了一个渐进式关联网络（PCNet），它在显式对齐的基础上对模态间和模态内关联进行建模，以在未对齐的图像对中实现准确预测。在未对齐和对齐数据集上进行的大量实验验证了我们方法的有效性。代码和数据集可在https://github.com/Angknpng/PCNet获取。||
|**2024-12-19**|[SCKD: Semi-Supervised Cross-Modality Knowledge Distillation for 4D Radar Object Detection](http://arxiv.org/abs/2412.14571)|null|三维目标检测是自动驾驶汽车的基本感知任务之一。使用4D毫米波雷达完成这项任务非常有吸引力，因为该传感器能够获取类似于激光雷达的3D点云，同时在恶劣天气下仍能保持稳健的测量。然而，由于雷达点云的高稀疏性和噪声，现有方法的性能仍然远低于预期。在本文中，我们提出了一种新的基于4D雷达的3D目标检测的半监督跨模态知识蒸馏（SCKD）方法。它具有从激光雷达-雷达融合教师网络中通过半监督蒸馏学习特征的能力。我们首先在教师网络中提出了一个自适应融合模块以提高其性能。然后，设计了两个特征蒸馏模块来促进跨模态知识迁移。最后，提出了一种半监督输出蒸馏，以提高蒸馏框架的有效性和灵活性。在相同的网络结构下，我们通过SCKD训练的仅雷达学生网络在VoD数据集上将mAP提高了10.38%，并优于最先进的工作。在ZJUODset上的实验也表明，当有额外的未标记数据可用时，中等难度级别的mAP比基线提高了5.12%。代码可在https://github.com/Ruoyu-Xu/SCKD获取。||
|**2024-12-18**|[Super-Resolution Generative Adversarial Network for Data Compression of Direct Numerical Simulations](http://arxiv.org/abs/2412.14150)|null|高性能计算的进步使得能够生成大型湍流直接数值模拟 (DNS) 数据集，从而推动了对高效压缩/解压缩技术的需求，这些技术能够在减少存储需求的同时保持保真度。对于复杂的湍流，传统的离散小波变换等方法无法在不引入显著的编码/解码错误的情况下实现 8 或更高的压缩比。另一方面，超分辨率生成对抗网络 (SR-GAN) 可以精确地重建精细尺度特征，即使在 512 的压缩比下也能保留速度梯度和结构细节，这得益于判别器实现的对抗训练。采用渐进迁移学习方法可以显著减少其较长的训练时间，并且一旦训练完成，它们就可以独立于雷诺数进行应用。研究表明，SR-GAN 可以通过从压缩快照生成高质量的中间场来提高数据集的时间分辨率，而无需额外的模拟开销。SR-GAN 判别器可以可靠地评估解码场的质量，即使在没有原始 DNS 场的情况下也能确保保真度。因此，基于 SR-GAN 的压缩/解压缩方法为大规模 DNS 存储和传输提供了一种高效且可扩展的替代方案，在压缩效率、重建保真度和时间分辨率增强方面具有显著优势。||
|**2024-12-18**|[Object Style Diffusion for Generalized Object Detection in Urban Scene](http://arxiv.org/abs/2412.13815)|null|目标检测是计算机视觉中的一项关键任务，其应用领域涵盖自动驾驶和城市场景监控等。然而，基于深度学习的方法通常需要大量的标注数据，这些数据成本高昂且难以获取，尤其是在复杂且不可预测的现实环境中。这种依赖性严重阻碍了现有目标检测技术的泛化能力。为了解决这个问题，我们引入了一种名为GoDiff的新的单域目标检测泛化方法，它利用预训练模型来增强在未见域中的泛化能力。我们方法的核心是伪目标数据生成（PTDG）模块，该模块采用潜在扩散模型来生成伪目标域数据，在保留源域特征的同时引入风格变化。通过将这些伪数据与源域数据相结合，我们实现了训练数据集的多样化。此外，我们引入了一种跨风格实例归一化技术，以融合PTDG模块生成的来自不同域的风格特征，从而提高检测器的鲁棒性。实验结果表明，我们的方法不仅增强了现有检测器的泛化能力，还可以作为其他单域泛化方法的即插即用增强模块，在自动驾驶场景中实现了最先进的性能。||
|**2024-12-18**|[MBInception: A new Multi-Block Inception Model for Enhancing Image Processing Efficiency](http://arxiv.org/abs/2412.13703)|null|深度学习模型，特别是卷积神经网络，通过直接从原始像素数据中自主提取特征，彻底改变了图像分类领域。本文介绍了一种创新的图像分类模型，该模型在卷积神经网络框架内采用了三个连续的Inception块，并与Visual Geometry Group (VGG)、残差网络 (ResNet) 和 MobileNet 等成熟架构进行了全面的比较分析。我们利用基准数据集，包括加拿大高级研究所 (CIFAR) 数据集、改进的美国国家标准与技术研究院数据库 (MNIST) 和时尚改进的美国国家标准与技术研究院数据库 (Fashion-MNIST)，评估了我们提出的模型与这些基准模型的性能。结果表明，我们的新模型在不同的数据集上始终优于其他模型，突出了其有效性和推进图像分类领域当前最新技术的潜力。评估指标进一步强调，所提出的模型优于其他比较架构，从而提高了标准数据集上图像分类的效率。||
|**2024-12-18**|[MMO-IG: Multi-Class and Multi-Scale Object Image Generation for Remote Sensing](http://arxiv.org/abs/2412.13684)|null|深度生成模型（DGM）的快速发展极大地促进了计算机视觉领域的研究，为获取大量昂贵的图像数据提供了一种经济高效的替代方案。然而，现有方法主要集中于合成与真实图像在全局布局视图上一致的遥感（RS）图像，这限制了它们在遥感图像目标检测（RSIOD）研究中的适用性。为了应对这些挑战，我们提出了一种基于DGM的多类别多尺度目标图像生成器，称为MMO-IG，旨在同时从全局和局部方面生成带有监督目标标签的RS图像。具体而言，从局部来看，MMO-IG使用等间距实例图（ISIM）对各种RS实例进行编码。在生成过程中，它解码ISIM中等间距值的每个实例区域——对应于背景和前景实例——以通过扩散模型的去噪过程生成RS图像。考虑到MMO之间复杂的相互依赖性，我们构建了一个空间交叉依赖知识图谱（SCDKG）。这确保了MMO之间用于区域嵌入的现实可靠的多向分布，从而减少了源域和目标域之间的差异。此外，我们提出了一种结构化目标分布指令（SODI），以结合基于SCDKG的ISIM从全局方面指导合成RS图像内容的生成。大量的实验结果表明，我们的MMO-IG在生成具有密集MMO监督标签的RS图像方面表现出优异的生成能力，并且使用MMO-IG预训练的RS检测器在真实世界数据集上表现出优异的性能。||
|**2024-12-18**|[MambaLCT: Boosting Tracking via Long-term Context State Space Model](http://arxiv.org/abs/2412.13615)|**[link](https://github.com/gxnu-zhonglab/mambalct)**|从视频序列中有效地构建具有长期依赖关系的上下文信息对于目标跟踪至关重要。然而，现有工作构建的上下文长度有限，仅考虑相邻帧或视频片段中的目标信息，导致上下文信息利用不足。为了解决这个问题，我们提出了MambaLCT，它从第一帧到当前帧构建并利用目标变化线索进行鲁棒跟踪。首先，我们设计了一种新颖的单向上下文Mamba模块，用于沿时间维度扫描帧特征，收集整个序列中的目标变化线索。具体来说，帧特征中与目标相关的信息通过选择性扫描机制被压缩到一个隐藏状态空间中。整个视频中的目标信息被连续聚合到目标变化线索中。接下来，我们将目标变化线索注入注意力机制，为建模模板帧和搜索帧之间的关系提供时间信息。MambaLCT的优势在于它能够不断扩展上下文的长度，捕获完整的目标变化线索，从而增强跟踪器的稳定性和鲁棒性。大量实验表明，长期上下文信息增强了模型在复杂场景中感知目标的能力。MambaLCT在六个基准测试中实现了新的SOTA性能，同时保持了实时运行速度。||
|**2024-12-17**|[Continuous Patient Monitoring with AI: Real-Time Analysis of Video in Hospital Care Settings](http://arxiv.org/abs/2412.13152)|**[link](https://github.com/lookdeep/ai-norms-2024)**|本研究介绍了LookDeep Health开发的一款用于医院环境下持续被动病人监控的AI驱动平台。该平台利用先进的计算机视觉技术，通过视频分析提供病人行为和互动方面的实时洞察，并将推理结果安全地存储在云端以供回顾性评估。该数据集与11家合作医院共同编译，涵盖了300多名高危跌倒患者和超过1000天的推理数据，可用于跌倒检测和易受伤害患者群体的安全监控等应用。为了促进创新和可重复性，该数据集的匿名子集已公开发布。该AI系统可以检测医院房间内的关键组件，包括人员在场情况和角色、家具位置、运动幅度和越界行为。性能评估表明，该系统在目标检测（宏观F1分数= 0.92）和病人角色分类（F1分数= 0.98）方面具有很高的准确性，并且在“病人独处”指标的趋势分析方面也表现可靠（平均逻辑回归准确率= 0.82 ± 0.15）。这些功能可以自动检测病人隔离、徘徊或无人监督的移动——这些都是跌倒风险和其他不良事件的关键指标。这项工作为验证AI驱动的病人监控系统建立了基准，突出了该平台通过提供病人行为和互动的持续数据驱动洞察来增强病人安全和护理的潜力。||
|**2024-12-17**|[Identifying Bias in Deep Neural Networks Using Image Transforms](http://arxiv.org/abs/2412.13079)|**[link](https://github.com/SaiTeja-Erukude/identifying-bias-in-dnn-classification)**|卷积神经网络（CNN）已成为近二十年来最常用的计算工具之一。CNN 的主要缺点之一是它们像一个“黑匣子”一样工作，用户不一定知道图像数据是如何被分析的，因此需要依赖经验评估来测试训练好的 CNN 的有效性。这可能导致隐藏的偏差影响神经网络的性能评估，但却难以识别。本文讨论了常见和广泛使用的基准数据集中此类隐藏偏差的示例，并提出了识别可能影响标准性能评估指标的数据集偏差的技术。识别数据集偏差的一种有效方法是仅使用原始图像的空白背景部分进行图像分类。然而，在某些情况下，图像中没有空白背景，这使得将前景或上下文信息与偏差分离变得更加困难。为了克服这个问题，我们提出了一种无需从图像中裁剪背景信息即可识别数据集偏差的方法。该方法基于对原始图像应用多种图像变换，包括傅里叶变换、小波变换、中值滤波及其组合。应用这些变换是为了恢复 CNN 用于对图像进行分类的背景偏差信息。这些变换对上下文视觉信息的影响方式与对系统背景偏差的影响方式不同。因此，该方法可以区分上下文信息和偏差，并在无需从原始图像的空白背景中分离子图像部分的情况下提醒背景偏差的存在。实验中使用的代码已公开发布。||
|**2024-12-17**|[What is YOLOv6? A Deep Insight into the Object Detection Model](http://arxiv.org/abs/2412.13006)|null|这项工作深入探讨了YOLOv6目标检测模型，重点关注其设计框架、优化技术和检测能力。YOLOv6的核心要素包括用于鲁棒特征提取的EfficientRep骨干网络和用于无缝特征聚合的Rep-PAN颈部网络，以确保高性能的目标检测。在COCO数据集上进行评估，YOLOv6-N在NVIDIA Tesla T4 GPU上以1187 FPS的速度达到了37.5%的AP。YOLOv6-S达到了45.0%的AP，速度为484 FPS，在同类别中优于PPYOLOE-S、YOLOv5-S、YOLOX-S和YOLOv8-S等模型。此外，YOLOv6-M和YOLOv6-L也表现出更高的精度（50.0%和52.8%），同时保持了与其他检测器相当的推理速度。通过升级的骨干网络和颈部网络结构，YOLOv6-L6实现了实时检测的尖端精度。||
|**2024-12-17**|[RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion](http://arxiv.org/abs/2412.12725)|null|我们提出了雷达-相机融合Transformer（RaCFormer）来提高3D目标检测的精度，基于以下洞察：室外3D场景感知中的雷达-相机融合受限于图像到BEV的转换——如果像素深度估计不准确，BEV特征的简单组合实际上整合了未对齐的视觉内容。为了避免这个问题，我们提出了一个基于查询的框架，可以自适应地从BEV和原始图像视图中采样与实例相关的特征。此外，我们通过两个关键设计来增强系统性能：优化查询初始化和增强BEV的表示能力。对于前者，我们引入了极坐标系中的自适应圆形分布来细化目标查询的初始化，允许基于距离调整查询密度。对于后者，我们首先结合了一个雷达引导的深度头来细化从图像视图到BEV的转换。随后，我们专注于利用雷达的多普勒效应，并引入一个隐式动态捕捉器来捕捉BEV中的时间元素。在nuScenes和View-of-Delft（VoD）数据集上的大量实验验证了我们设计的优点。值得注意的是，我们的方法在nuScenes上取得了64.9% mAP和70.2% NDS的优异结果，甚至超过了几个基于激光雷达的检测器。RaCFormer还在VoD数据集上获得了第一名。代码将被发布。||
|**2024-12-17**|[ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries](http://arxiv.org/abs/2412.12675)|null|现有的以人为中心的视频理解工作通常集中于分析特定时刻或整个视频。然而，许多应用需要在帧级别更高的精度。在这项工作中，我们提出了一个名为BestShot的新任务，旨在通过语言查询在以人为中心的视频中定位高光帧。这项任务不仅需要对人类行为进行深度语义理解，还需要精确的时间定位。为了支持这项任务，我们引入了BestShot基准测试。该基准测试通过结合人工标注的高光帧、详细的文本描述和持续时间标签精心构建而成。这些描述包含三个关键要素：（1）视觉内容；（2）细粒度动作；以及（3）人体姿态描述。这些要素共同提供了识别视频中精确高光帧所需的精度。为了解决这个问题，我们收集了两个不同的数据集：（i）ShotGPT4o数据集，由GPT-4o算法生成；以及（ii）Image-SMPLText数据集，这是一个利用PoseScript和现有姿态估计数据集的大规模且精确的每帧姿态描述数据集。基于这些数据集，我们提出了一个强大的基线模型ShotVL，它是从InternVL针对BestShot进行微调的。我们强调了我们模型令人印象深刻的零样本能力，并提供了与现有SOTA模型的比较分析。ShotVL在BestShot基准测试中比InternVL提高了52%，在THUMOS14基准测试中提高了57%，同时保持了在一般图像分类和检索中的SOTA性能。||
|**2024-12-17**|[Structural Pruning via Spatial-aware Information Redundancy for Semantic Segmentation](http://arxiv.org/abs/2412.12672)|**[link](https://github.com/dywu98/SIRFP)**|近年来，语义分割在各种应用中蓬勃发展。然而，高计算成本仍然是阻碍其进一步应用的重大挑战。用于结构化网络瘦身的滤波器剪枝方法为减少分割网络的规模提供了一种直接有效的解决方案。然而，我们认为大多数现有的剪枝方法最初是为图像分类设计的，忽略了分割是一项位置敏感的任务，因此当应用于分割网络时会导致其性能欠佳。为了解决这个问题，本文提出了一种新方法，称为空间感知信息冗余滤波器剪枝（SIRFP），旨在减少通道之间的特征冗余。首先，我们将剪枝过程表述为图论中的最大边权团问题（MEWCP），从而最小化剪枝后剩余特征之间的冗余。在此框架内，我们引入了一种基于特征图的空间感知冗余度量，从而赋予剪枝过程位置敏感性，以更好地适应剪枝分割网络。此外，基于MEWCP，我们提出了一种低计算复杂度的贪婪策略来解决这个NP难题，使其对于结构化剪枝可行且高效。为了验证我们方法的有效性，我们在各种具有挑战性的数据集上进行了广泛的比较实验。结果表明，SIRFP在语义分割任务中表现出优异的性能。||
|**2024-12-17**|[RemoteTrimmer: Adaptive Structural Pruning for Remote Sensing Image Classification](http://arxiv.org/abs/2412.12603)|**[link](https://github.com/1e12Leon/RemoteTrimmer)**|由于高分辨率遥感图像分类通常需要较高的计算复杂度，轻量级模型往往更实用和高效。模型剪枝是模型压缩的有效方法。然而，现有方法很少考虑遥感图像的特殊性，导致剪枝后精度损失较大。为此，我们提出了一种有效的遥感图像分类结构化剪枝方法。具体来说，我们引入了一种放大模型通道重要性差异的剪枝策略。然后，我们为剪枝模型的微调过程设计了一种自适应挖掘损失函数。最后，我们在两个遥感分类数据集上进行了实验。实验结果表明，我们的方法在压缩遥感分类模型后精度损失最小，达到了最先进的（SoTA）性能。||
|**2024-12-17**|[Efficient Oriented Object Detection with Enhanced Small Object Recognition in Aerial Images](http://arxiv.org/abs/2412.12562)|null|在航空影像中旋转边界框目标检测领域，如何在计算效率和检测精度之间取得平衡是一项重大挑战。尽管之前的研究致力于创建轻量级模型以增强计算性能和特征提取，但这些网络在检测遥感 (RS) 影像中的小型和多尺度目标时仍然存在性能差距。为了应对这些挑战，我们提出了一种针对 YOLOv8 模型的新颖改进，专为定向目标检测任务量身定制，并针对计算资源有限的环境进行了优化。我们的模型采用基于小波变换的 C2f 模块来捕获关联特征，并使用自适应尺度特征金字塔 (ASFP) 模块来利用 P2 层的细节信息。此外，GhostDynamicConv 的加入显著提升了模型的轻量级特性，确保了航空影像分析的高效性。我们的方法参数量为 21.6M，比参数量为 23.3M 的 DecoupleNet 提供了更高效的架构设计，同时保持了检测精度。在 DOTAv1.0 数据集上，我们的模型展示了与 DecoupleNet 等领先方法相当的平均精度 (mAP)。该模型的高效性及其减少的参数量使其成为航空目标检测的有力候选，尤其是在资源受限的环境中。||
|**2024-12-17**|[Tell Me What to Track: Infusing Robust Language Guidance for Enhanced Referring Multi-Object Tracking](http://arxiv.org/abs/2412.12561)|null|指代性多目标跟踪（RMOT）是一项新兴的跨模态任务，旨在根据语言表达定位任意数量的目标，并在视频中持续跟踪它们。这项复杂的任务涉及多模态数据的推理和具有时间关联的精确目标定位。然而，由于任务的性质，先前的研究忽略了新生目标和现有目标之间的数据分布不平衡。此外，它们仅间接融合多模态特征，难以对新生目标检测提供清晰的指导。为了解决上述问题，我们采用了一种协作匹配策略来减轻不平衡的影响，在保持跟踪性能的同时提高检测新生目标的能力。在编码器中，我们整合并增强了跨模态和多尺度融合，克服了先前工作中多模态信息共享和特征图之间交互有限的瓶颈。在解码器中，我们还开发了一种指代性注入适配，通过查询令牌提供明确的指代性指导。实验结果表明，我们的模型比先前的工作具有更优的性能（+3.42%），证明了我们设计的有效性。||
|**2024-12-17**|[Addressing Small and Imbalanced Medical Image Datasets Using Generative Models: A Comparative Study of DDPM and PGGANs with Random and Greedy K Sampling](http://arxiv.org/abs/2412.12532)|**[link](https://github.com/imankhazrak/DDPM_X-Ray)**|开发精确的医学图像分类模型常常受到隐私问题和某些疾病数据稀缺的限制，导致数据集小且不平衡。为了解决这些限制，本研究探索了使用生成模型（如去噪扩散概率模型 (DDPM) 和渐进式增长生成对抗网络 (PGGAN)）进行数据集增强。该研究引入了一个框架来评估 DDPM 和 PGGAN 生成的合成图像对四种模型性能的影响：自定义 CNN、未经训练的 VGG16、预训练的 VGG16 和预训练的 ResNet50。实验使用随机采样和贪婪 K 采样来创建小型不平衡数据集。使用 Frechet 初始距离 (FID) 评估合成图像，并通过分类指标与原始数据集进行比较。结果表明，DDPM 始终生成更逼真的图像，FID 分数更低，并且在所有模型和数据集的分类指标改进方面均显著优于 PGGAN。将 DDPM 生成的图像纳入原始数据集可将准确率提高多达 6%，增强模型的鲁棒性和稳定性，尤其是在不平衡的情况下。随机采样表现出优异的稳定性，而贪婪 K 采样以更高的 FID 分数为代价提供了多样性。这项研究强调了 DDPM 在增强小型不平衡医学图像数据集方面的有效性，通过平衡数据集和扩大其规模来提高模型性能。||
|**2024-12-13**|[A dual contrastive framework](http://arxiv.org/abs/2412.10348)|null|在目前的多模态任务中，模型通常在适应中间层以完成特定任务目标（例如区域描述）时冻结编码器和解码器。区域级的视觉理解对大规模视觉语言模型提出了重大挑战。虽然有限的空间感知是一个已知问题，但粗粒度的预训练尤其加剧了优化潜在表示以实现有效的编码器-解码器对齐的难度。我们提出了 AlignCap，这是一个旨在通过潜在空间的细粒度对齐来增强区域级理解的框架。我们的方法引入了一种新颖的潜在特征细化模块，该模块增强了条件潜在空间表示，以提高区域级描述的性能。我们还提出了一种创新的对齐策略，即语义空间对齐模块，它提高了多模态表示的质量。此外，我们以一种新颖的方式在这两个模块中结合了对比学习，以进一步增强区域级描述性能。为了解决空间限制，我们采用通用目标检测 (GOD) 方法作为数据预处理流程，以增强区域级的空间推理能力。大量实验表明，我们的方法显著提高了各种任务中区域级描述的性能。||
|**2024-12-13**|[Copy-Move Detection in Optical Microscopy: A Segmentation Network and A Dataset](http://arxiv.org/abs/2412.10258)|null|随着越来越多的学术造假事件被曝光，检测生物医学领域伪造的实验图像已成为公众关注的问题。挑战在于复制移动的目标可能包括背景组织、小的前景对象或两者兼有，这些目标可能超出训练域并遭受未曾预料的攻击，使得标准的基于目标检测的方法效率降低。为了解决这个问题，我们将检测生物医学复制移动伪造区域的问题重新定义为图像内共显著性检测任务，并提出了CMSeg-Net，一个能够识别未见过的重复区域的复制移动伪造分割网络。CMSeg-Net建立在多分辨率编码器-解码器架构之上，并结合了自相关和相关辅助空间注意力模块，以在每个观察尺度上检测特征张量内的图像内区域相似性。这种设计有助于区分复杂显微图像中即使是很小的复制移动目标与其他相似对象。此外，我们使用来自ICIP 2022挑战赛的公开数据创建了一个光学显微图像的复制移动伪造数据集，名为FakeParaEgg，以支持CMSeg-Net的开发并验证其性能。大量实验表明，我们的方法在FakeParaEgg数据集和其他公开复制移动检测数据集（包括CASIA-CMFD、CoMoFoD和CMF）上的性能优于以往的先进方法。FakeParaEgg数据集、我们的源代码以及带有我们手动定义的分割真值的CMF数据集可在``https://github.com/YoursEver/FakeParaEgg''获取。||
|**2024-12-13**|[UN-DETR: Promoting Objectness Learning via Joint Supervision for Unknown Object Detection](http://arxiv.org/abs/2412.10176)|**[link](https://github.com/ndwxhmzz/un-detr)**|未知目标检测 (UOD) 旨在识别未见类别的目标，这与受限于封闭世界假设的传统检测范式不同。UOD 的一个关键组成部分是学习泛化表示，即已知和未知类别的目标性，以便以类别无关的方式区分目标并将其从背景中定位出来。然而，先前的方法孤立地从定位或分类信息中获取用于学习目标性的监督信号，导致 UOD 性能不佳。为了解决这个问题，我们提出了一个基于 Transformer 的 UOD 框架，UN-DETR。在此基础上，我们设计了实例存在分数 (IPS) 来表示目标存在的概率。为了信息互补，IPS 采用联合监督学习策略，将代表来自位置和类别潜在空间的一般目标性的属性整合为监督信号。为了增强 IPS 学习，我们引入了一对多分配策略以纳入更多监督。然后，我们提出了无偏查询选择，为解码器提供优质的初始查询向量。此外，我们提出了一个 IPS 引导的后处理策略，以过滤冗余框并纠正已知和未知目标的分类预测。最后，我们以无监督的方式预训练整个 UN-DETR，以获得目标性先验。我们的 UN-DETR 在多个 UOD 和已知检测基准上进行了全面评估，证明了其有效性并实现了最先进的性能。||
|**2024-12-13**|[HS-FPN: High Frequency and Spatial Perception FPN for Tiny Object Detection](http://arxiv.org/abs/2412.10116)|null|特征金字塔网络（FPN）的引入显著提高了目标检测性能。然而，在检测微小目标方面仍然存在重大挑战，因为它们的特征只占特征图的很小一部分。尽管FPN集成了多尺度特征，但它并没有直接增强或丰富微小目标的特征。此外，FPN缺乏空间感知能力。为了解决这些问题，我们提出了一种新颖的高频和空间感知特征金字塔网络（HS-FPN），它包含两个创新模块。首先，我们设计了一个高频感知模块（HFP），它通过高通滤波器生成高频响应。这些高频响应被用作空间和通道视角的掩码权重，以丰富和突出原始特征图中微小目标的特征。其次，我们开发了一个空间依赖感知模块（SDP）来捕获FPN缺乏的空间依赖性。我们的实验表明，基于HS-FPN的检测器在用于微小目标检测的AI-TOD数据集上比最先进的模型表现出竞争优势。||
|**2024-12-13**|[RemDet: Rethinking Efficient Model Design for UAV Object Detection](http://arxiv.org/abs/2412.10040)|**[link](https://github.com/hzai-zjnu/remdet)**|无人机(UAV)图像中的目标检测已成为一个研究热点，它提出了两个重大挑战：i)目标通常在巨大的图像中又小又密集；ii)计算资源的限制使得大多数模型不适合实时部署。当前的实时目标检测器没有针对无人机图像进行优化，而为小目标检测设计的复杂方法通常缺乏实时能力。为了应对这些挑战，我们提出了一种新的检测器，RemDet（重参数高效乘法检测器）。我们的贡献如下：1)重新思考现有检测器在处理小而密集的无人机图像时面临的挑战，并提出将信息损失作为高效模型的设计准则。2)我们引入了ChannelC2f模块来增强小目标检测性能，证明了高维表示可以有效地减少信息损失。3)我们设计了GatedFFN模块，不仅提供强大的性能，而且具有低延迟，有效地解决了实时检测的挑战。我们的研究表明，GatedFFN通过使用乘法，比用于高维表示的前馈网络更具成本效益。4)我们提出了CED模块，它结合了ViT和CNN下采样的优点，有效减少信息损失。它特别增强了针对小而密集目标的上下文信息。在大型无人机数据集Visdrone和UAVDT上的大量实验验证了我们方法的实时效率和优越性能。在具有挑战性的无人机数据集VisDrone上，我们的方法不仅提供了最先进的结果，将检测精度提高了3.4%以上，而且在单个4090上实现了110 FPS。代码可在(此链接)(https://github.com/HZAI-ZJNU/RemDet)获取。||
|**2024-12-13**|[Object-Focused Data Selection for Dense Prediction Tasks](http://arxiv.org/abs/2412.10032)|null|Dense prediction tasks such as object detection and segmentation require high-quality labels at pixel level, which are costly to obtain. Recent advances in foundation models have enabled the generation of autolabels, which we find to be competitive but not yet sufficient to fully replace human annotations, especially for more complex datasets. Thus, we consider the challenge of selecting a representative subset of images for labeling from a large pool of unlabeled images under a constrained annotation budget. This task is further complicated by imbalanced class distributions, as rare classes are often underrepresented in selected subsets. We propose object-focused data selection (OFDS) which leverages object-level representations to ensure that the selected image subsets semantically cover the target classes, including rare ones. We validate OFDS on PASCAL VOC and Cityscapes for object detection and semantic segmentation tasks. Our experiments demonstrate that prior methods which employ image-level representations fail to consistently outperform random selection. In contrast, OFDS consistently achieves state-of-the-art performance with substantial improvements over all baselines in scenarios with imbalanced class distributions. Moreover, we demonstrate that pre-training with autolabels on the full datasets before fine-tuning on human-labeled subsets selected by OFDS further enhances the final performance.||
|**2024-12-13**|[A Single-Frame and Multi-Frame Cascaded Image Super-Resolution Method](http://arxiv.org/abs/2412.09846)|null|图像超分辨率的目标是从一张或多张低分辨率 (LR) 图像中利用先验知识重建高分辨率 (HR) 图像。然而，在现实世界中，由于互补信息的有限性，单帧和多帧超分辨率重建的性能会随着放大倍数的增加而迅速下降。在本文中，我们提出了一种新的两步图像超分辨率方法，将多帧超分辨率 (MFSR) 与单帧超分辨率 (SFSR) 连接起来，逐步将图像放大到所需的分辨率。所提出的方法由一个 L0 范数约束的重建方案和一个增强的残差反投影网络组成，融合了基于变分模型方法的灵活性和基于深度学习方法的特征学习能力。为了验证所提出算法的有效性，我们对模拟序列和真实世界序列进行了大量实验。实验结果表明，该方法在客观和感知质量测量方面均具有优越的性能。该级联模型在 set5 和 set14 上的平均 PSNR 分别为 33.413 dB 和 29.658 dB，比基线方法分别高 0.76 dB 和 0.621 dB。此外，实验表明，这种级联模型可以稳健地应用于不同的 SFSR 和 MFSR 方法。||
|**2024-12-13**|[Super-Resolution for Remote Sensing Imagery via the Coupling of a Variational Model and Deep Learning](http://arxiv.org/abs/2412.09841)|null|图像超分辨率 (SR) 是一种有效增强遥感图像空间分辨率和细节信息的方法，以获得更佳的视觉质量。由于 SR 是一个严重的病态问题，有效的图像先验对于规范解空间和生成相应的高分辨率 (HR) 图像是必要的。本文提出了一种新的梯度引导多帧超分辨率 (MFSR) 框架，用于遥感图像重建。该框架将学习到的梯度先验作为正则化项整合到基于模型的优化方法中。具体来说，局部梯度正则化 (LGR) 先验是通过梯度剖面变换从深度残差注意力网络 (DRAN) 中导出的。非局部全变分 (NLTV) 先验使用具有最大后验 (MAP) 模型的梯度块的空间结构相似性来表征。建模的先验在保持边缘平滑度和抑制视觉伪影方面表现良好，而学习的先验在增强清晰边缘和恢复精细结构方面有效。通过将这两个互补的先验结合到基于自适应范数的重建框架中，混合 L1 和 L2 正则化最小化问题得到优化，以获得所需的 HR 遥感图像。在遥感数据上的大量实验结果表明，所提出的方法可以生成视觉上令人满意的图像，并且在定量评估方面优于几种最先进的 SR 算法。||
|**2024-12-13**|[CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection](http://arxiv.org/abs/2412.09799)|null|Recent research on universal object detection aims to introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated to fully utilize the prompted information by prompt multi-label loss and auxiliary detection head. In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt, to extract abstract concepts through concrete visual examples and stably reduce alignment bias in downstream tasks. With these effective designs, CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODinW35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODinW13.||
|**2024-12-12**|[DQA: An Efficient Method for Deep Quantization of Deep Neural Network Activations](http://arxiv.org/abs/2412.09687)|null|Quantization of Deep Neural Network (DNN) activations is a commonly used technique to reduce compute and memory demands during DNN inference, which can be particularly beneficial on resource-constrained devices. To achieve high accuracy, existing methods for quantizing activations rely on complex mathematical computations or perform extensive searches for the best hyper-parameters. However, these expensive operations are impractical on devices with limited computation capabilities, memory capacities, and energy budgets. Furthermore, many existing methods do not focus on sub-6-bit (or deep) quantization. To fill these gaps, in this paper we propose DQA (Deep Quantization of DNN Activations), a new method that focuses on sub-6-bit quantization of activations and leverages simple shifting-based operations and Huffman coding to be efficient and achieve high accuracy. We evaluate DQA with 3, 4, and 5-bit quantization levels and three different DNN models for two different tasks, image classification and image segmentation, on two different datasets. DQA shows significantly better accuracy (up to 29.28%) compared to the direct quantization method and the state-of-the-art NoisyQuant for sub-6-bit quantization.||
|**2024-12-12**|[OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs](http://arxiv.org/abs/2412.09465)|**[link](https://github.com/yuanzhi-zhu/oftsr)**|基于扩散和流的生成模型的最新进展在图像恢复任务中取得了显著成功，相比传统深度学习方法实现了卓越的感知质量。然而，这些方法要么需要大量的采样步骤来生成高质量图像，导致巨大的计算开销，要么依赖于模型蒸馏，这通常会带来固定的保真度-真实感权衡，因此缺乏灵活性。在本文中，我们介绍了OFTSR，一个用于单步图像超分辨率的新型基于流的框架，可以生成具有可调保真度和真实感水平的输出。我们的方法首先训练一个基于条件流的超分辨率模型作为教师模型。然后，我们通过应用一个专门的约束来蒸馏这个教师模型。具体来说，我们强制我们的单步学生模型对相同输入的预测位于教师模型的相同采样常微分方程（ODE）轨迹上。这种对齐确保了学生模型从初始状态的单步预测与教师模型从更接近的中间状态的预测相匹配。通过在包括FFHQ（256×256）、DIV2K和ImageNet（256×256）等具有挑战性的数据集上的大量实验，我们证明OFTSR在单步图像超分辨率方面实现了最先进的性能，同时能够灵活地调整保真度-真实感之间的权衡。代码和预训练模型分别在https://github.com/yuanzhi-zhu/OFTSR和https://huggingface.co/Yuanzhi/OFTSR上提供。||
|**2024-12-12**|[Distribution free uncertainty quantification in neuroscience-inspired deep operators](http://arxiv.org/abs/2412.09369)|null|节能深度学习算法对于可持续发展的未来和可行的边缘计算设置至关重要。受神经科学启发的脉冲神经网络 (SNN) 是朝着实现所需能效迈出的积极一步。然而，为了降低能量需求，准确性会有小幅牺牲。因此，此类深度学习算法的预测需要一种不确定性度量，可以告知用户某个输出的界限。在本文中，我们介绍了保角随机化先验算子 (CRP-O) 框架，该框架利用随机化先验 (RP) 网络和分割保角预测 (SCP) 来量化传统神经算子和脉冲神经算子中的不确定性。为了进一步实现UQ中的零样本超分辨率，我们提出了一个结合高斯过程回归的扩展方法。这个增强的支持超分辨率的 CRP-O 框架与最近开发的可变脉冲小波神经算子 (VSWNO) 集成在一起。为了测试获得的校准不确定性界限的性能，我们讨论了四个不同的例子，涵盖一维和二维偏微分方程。结果表明，与普通的 RP-VSWNO、分位数 WNO (Q-WNO) 和保角分位数 WNO (CQ-WNO) 相比，保角化 RP-VSWNO 产生的不确定性界限显着增强了 UQ 估计。这些发现强调了所提出的方法在实际应用中的潜力。||
|**2024-12-12**|[Advancing Attribution-Based Neural Network Explainability through Relative Absolute Magnitude Layer-Wise Relevance Propagation and Multi-Component Evaluation](http://arxiv.org/abs/2412.09311)|**[link](https://github.com/davor10105/relative-absolute-magnitude-propagation)**|深度神经网络性能的最新进展促使许多领域出现了新的最先进方法。然而，神经网络的黑盒性质通常使其无法用于模型可解释性和模型透明度至关重要的领域。多年来，研究人员提出了许多算法来帮助理解神经网络并向人类专家提供额外的信息。其中最流行的方法之一是分层相关性传播 (LRP)。该方法基于非线性分类器的像素分解来分配局部相关性。随着归因方法研究的兴起，迫切需要评估其性能。人们提出了许多指标，每个指标都评估归因方法的单个属性，例如置信度、鲁棒性或定位性。遗憾的是，没有哪个单一指标被认为对所有情况都最佳，研究人员通常使用多个指标来测试归因图的质量。在这项工作中，我们解决了当前 LRP 公式的缺点，并引入了一种通过分层相关性传播来确定输入神经元相关性的新方法。此外，我们将此方法应用于最近开发的视觉Transformer架构，并在两个图像分类数据集（即ImageNet和PascalVOC）上评估其性能，并与现有方法进行比较。我们的结果清楚地表明了我们提出的方法的优势。此外，我们讨论了当前基于归因的可解释性评估指标的不足之处，并提出了一种结合置信度、鲁棒性和对比度概念的新评估指标。我们利用这个新指标来评估各种基于归因的方法的性能。我们的代码可在以下网址获得：https://github.com/davor10105/relative-absolute-magnitude-propagation||
|**2024-12-12**|[FD2-Net: Frequency-Driven Feature Decomposition Network for Infrared-Visible Object Detection](http://arxiv.org/abs/2412.09258)|null|红外-可见光目标检测 (IVOD) 旨在利用红外和可见光图像中的互补信息，从而提高检测器在复杂环境中的性能。然而，现有方法往往忽略了互补信息的频率特性，例如可见光图像中丰富的高频细节和红外图像中有价值的低频热信息，从而限制了检测性能。为了解决这个问题，我们提出了一种用于 IVOD 的新型频率驱动特征分解网络，称为 FD2-Net，它可以有效地捕获跨多模态视觉空间的互补信息的独特频率表示。具体来说，我们提出了一个特征分解编码器，其中高频单元 (HFU) 利用离散余弦变换来捕获具有代表性的高频特征，而低频单元 (LFU) 则采用动态感受野来建模不同目标的多尺度上下文。接下来，我们采用无参数的互补强度策略，通过无缝的频率间重新耦合来增强多模态特征。此外，我们创新性地设计了一种多模态重建机制，可以恢复特征提取过程中丢失的图像细节，进一步利用红外和可见光图像的互补信息来增强整体表示能力。大量实验表明，FD2-Net 在各种 IVOD 基准测试中均优于最先进的 (SOTA) 模型，即 LLVIP (96.2% mAP)、FLIR (82.9% mAP) 和 M3FD (83.5% mAP)。||
|**2024-12-12**|[UADet: A Remarkably Simple Yet Effective Uncertainty-Aware Open-Set Object Detection Framework](http://arxiv.org/abs/2412.09229)|null|我们致力于解决开放集目标检测 (OSOD) 这一难题，其目标是在未标记图像中检测已知和未知目标。主要难点在于缺少对这些未知类别的监督，这使得难以将它们与背景区分开来。现有的OSOD检测器要么未能充分利用训练数据中大量的未标记未知目标，要么对其利用不足，从而限制了其性能。为了解决这些限制，我们提出了UADet，一种不确定性感知开放集目标检测器，它考虑了外观和几何不确定性。通过整合这些不确定性度量，UADet有效地减少了先前方法错误使用或遗漏的未注释实例的数量。在OSOD基准上的大量实验表明，UADet在检测已知和未知目标方面均大幅优于先前的最先进 (SOTA) 方法，在保持已知类别高性能的同时，未知类别召回率提高了1.8倍。当扩展到开放世界目标检测 (OWOD) 时，我们的方法相较于当前的SOTA方法展现出显著优势，在M-OWODB和S-OWODB基准上的未知类别召回率分别平均提高了13.8%和6.9%。大量结果验证了我们不确定性感知方法在不同开放集场景下的有效性。||
|**2024-12-12**|[An Efficient Framework for Enhancing Discriminative Models via Diffusion Techniques](http://arxiv.org/abs/2412.09063)|null|图像分类是计算机视觉的基石，传统上通过基于深度神经网络的判别模型来实现。最近的进展引入了源自生成模型的分类方法，该方法具有零样本分类的优势。然而，这些方法存在两个主要缺点：高计算开销以及与判别模型相比较差的性能。受人脑在视觉信号识别过程中快速-慢速通路相互作用的协调认知过程的启发，我们提出了基于扩散的判别模型增强框架（DBMEF）。该框架以无需训练的方式将判别模型和生成模型无缝集成，利用判别模型进行初始预测，并通过扩散模型赋予深度神经网络反思能力。因此，DBMEF可以以即插即用的方式有效地增强判别模型的分类精度和泛化能力。我们已经在17种流行的深度模型架构上进行了广泛的实验，这些架构具有不同的训练方法，包括基于CNN的模型（如ResNet）和基于Transformer的模型（如ViT），以证明所提出的DBMEF的有效性。具体而言，该框架在ImageNet数据集上使ResNet-50的性能提升了1.51%，在ImageNet-A数据集上提升了3.02%。总之，我们的研究引入了一种新的图像分类范式，展示了在不同数据集和神经网络上的稳定改进。||
|**2024-12-12**|[ContextHOI: Spatial Context Learning for Human-Object Interaction Detection](http://arxiv.org/abs/2412.09050)|null|空间上下文（例如背景和周围环境）在人-物交互 (HOI) 识别中被认为至关重要，尤其是在以实例为中心的的前景模糊或被遮挡时。HOI 检测器的最新进展通常建立在检测transformer流水线之上。虽然这种面向对象检测的范例在定位对象方面很有前景，但它对空间上下文的探索通常不足以准确识别人类动作。为了增强用于HOI检测的目标检测器的能力，我们提出了一个名为 ContextHOI 的双分支框架，它可以有效地捕获目标检测特征和空间上下文。在上下文分支中，我们训练模型提取信息丰富的空间上下文，而无需额外的 handcrafted 背景标签。此外，我们向上下文分支引入了上下文感知的空间和语义监督，以滤除不相关的噪声并捕获信息丰富的上下文。ContextHOI 在 HICO-DET 和 v-coco 基准测试中实现了最先进的性能。为了进一步验证，我们构建了一个新的基准测试 HICO-ambiguous，它是 HICO-DET 的一个子集，包含具有遮挡或受损实例线索的图像。所有基准测试中的大量实验以及可视化结果都强调了 ContextHOI 带来的增强，尤其是在识别涉及遮挡或模糊实例的交互方面。||
|**2024-12-12**|[Arbitrary-steps Image Super-resolution via Diffusion Inversion](http://arxiv.org/abs/2412.09013)|**[link](https://github.com/zsyoaoa/invsr)**|本研究提出了一种基于扩散逆推的图像超分辨率（SR）新技术，旨在利用大型预训练扩散模型中封装的丰富图像先验来提高SR性能。我们设计了一种部分噪声预测策略来构建扩散模型的中间状态，作为起始采样点。我们方法的核心是一个深度噪声预测器，用于估计正向扩散过程的最佳噪声图。训练完成后，该噪声预测器可用于沿着扩散轨迹部分初始化采样过程，生成所需的⾼分辨率结果。与现有方法相比，我们的方法提供了一种灵活高效的采样机制，支持任意数量的采样步骤，从一到五步。即使只进行一步采样，我们的方法也展现出优于或可比于最新技术的性能。代码和模型已公开发布在 https://github.com/zsyOAOA/InvSR。||
|**2024-12-11**|[DALI: Domain Adaptive LiDAR Object Detection via Distribution-level and Instance-level Pseudo Label Denoising](http://arxiv.org/abs/2412.08806)|**[link](https://github.com/xiaohulugo/t-ro2024-dali)**|基于激光雷达点云的目标检测依赖于大量的标注样本用于训练底层探测器的深度神经网络。然而，为大规模数据集生成3D边界框标注既昂贵又耗时。另一种方法是无监督域自适应（UDA），它通过将从带标签的\textit{源域}数据训练中学习到的知识迁移到新的未标记的\textit{目标域}，使给定的目标检测器能够在新的未标记训练数据集上运行。伪标签策略，即使用预训练模型在目标域预测的边界框来训练3D目标检测器，在UDA中被广泛使用。然而，这些伪标签通常会引入噪声，从而影响性能。在本文中，我们引入了域自适应激光雷达（DALI）目标检测框架，以解决分布级和实例级的噪声问题。首先，我们开发了一种训练后尺寸归一化（PTSN）策略，通过在网络训练后识别无偏尺度来减轻伪标签尺寸分布中的偏差。为了解决伪标签和对应点云之间的实例级噪声，我们开发了两种伪点云生成（PPCG）策略，即射线约束和无约束，为每个实例生成伪点云，确保训练过程中伪标签和伪点之间的一致性。我们在公开可用的流行数据集KITTI、Waymo和nuScenes上证明了我们方法的有效性。结果表明，所提出的DALI框架在大多数域自适应任务上取得了最先进的结果，并优于领先的方法。我们的代码可在\href{https://github.com/xiaohulugo/T-RO2024-DALI}{https://github.com/xiaohulugo/T-RO2024-DALI}获取。||
|**2024-12-11**|[Utilizing Multi-step Loss for Single Image Reflection Removal](http://arxiv.org/abs/2412.08582)|**[link](https://github.com/AbdelrhmanElnenaey/SIRR_MSloss_RefGAN_RDM)**|图像反射去除对于恢复图像质量至关重要。失真的图像会对物体检测和图像分割等任务产生负面影响。在本文中，我们提出了一种使用单张图像去除图像反射的新方法。我们没有专注于模型架构，而是引入了一种新的训练技术，可以推广到输入和输出本质上相似的图像到图像问题。这种技术体现在我们的多步损失机制中，该机制已在反射去除任务中 terbukti efektif。此外，我们通过使用 Pix2Pix GAN 合成名为 RefGAN 的高质量非线性合成数据集来解决反射去除训练数据的稀缺性问题。该数据集显着增强了模型学习更好反射去除模式的能力。我们还利用从环境图像深度估计中提取的范围深度图作为辅助特征，利用其缺乏反射深度估计的特性。我们的方法在 SIR^2 基准测试和其他真实世界数据集上展现出优越的性能，通过超越其他最先进的模型证明了其有效性。||
|**2024-12-10**|[Leveraging Content and Context Cues for Low-Light Image Enhancement](http://arxiv.org/abs/2412.07693)|**[link](https://github.com/igor-morawski/tmm-sem)**|弱光条件会对机器认知产生不利影响，从而限制计算机视觉系统在现实生活中的性能。由于弱光数据有限且难以标注，我们专注于图像处理来增强弱光图像并提高任何下游任务模型的性能，而不是对每个模型进行微调，因为这可能会非常昂贵。我们建议利用 CLIP 模型来捕获图像先验和语义指导，从而改进现有的零参考弱光增强方法。具体来说，我们提出了一种数据增强策略，通过基于图像采样的提示学习来学习图像先验，从而无需任何配对或非配对的正常光数据即可学习图像先验。接下来，我们提出了一种语义指导策略，通过引入关于图像训练块的内容和上下文线索，最大限度地利用现有的弱光标注。我们在定性研究中通过实验表明，所提出的先验和语义指导有助于提高整体图像对比度和色调，以及改善背景-前景区分，从而减少过度饱和和噪声过度放大，这在相关的零参考方法中很常见。由于我们的目标是机器认知，而不是依赖于假设人类感知与下游任务性能之间的相关性，因此我们针对许多弱光数据集（包括图像分类、物体和人脸检测）的基于任务的性能进行了消融研究和与相关零参考方法的比较，展示了我们提出的方法的有效性。||
|**2024-12-10**|[Enhancing 3D Object Detection in Autonomous Vehicles Based on Synthetic Virtual Environment Analysis](http://arxiv.org/abs/2412.07509)|null|自动驾驶汽车（AV）使用自然图像和视频作为输入，通过叠加和推断数字元素来理解现实世界，从而促进主动检测以确保安全。这一过程的一个关键方面是通过自动场景分析进行实时、准确的物体识别。虽然传统方法主要集中在2D物体检测上，但探索3D物体检测（包括将3D边界框投影到三维环境中）具有重要意义，并且可以使用AR生态系统得到显著增强。本研究检验了AI模型在实时场景分析中推断3D边界框的能力，同时在虚拟域中生成和评估模型的性能和处理时间，然后将其应用于自动驾驶汽车。这项工作还使用了一个合成数据集，其中包含模拟各种环境、照明和时空状态的人工生成图像。该评估旨在处理在不同天气条件下以不同相机设置拍摄的包含物体的图像。这些变化带来了更具挑战性的检测和识别场景，这项工作的结果有助于在大多数测试条件下取得有竞争力的结果。||
|**2024-12-10**|[DSFEC: Efficient and Deployable Deep Radar Object Detection](http://arxiv.org/abs/2412.07411)|null|在资源受限的边缘设备（如树莓派）上部署雷达目标检测模型面临着巨大的挑战，因为模型体积庞大，而树莓派的计算能力和内存有限。在这项工作中，我们探索了深度可分离卷积在雷达目标检测网络中的效率，并将其集成到我们的模型中。此外，我们向PointPillars特征编码器引入了一种新颖的特征增强和压缩（FEC）模块，以进一步提高模型性能。基于这些创新，我们提出了DSFEC-L模型及其两个版本，它们在nuScenes数据集上的性能优于基线（Car类别23.9 mAP，20.72 GFLOPs）：1).高效的DSFEC-M模型，性能提升14.6%，GFLOPs降低60%。2).可部署的DSFEC-S模型，性能提升3.76%，GFLOPs显著降低78.5%。尽管性能增益略微，但与基线相比，我们可部署的模型在树莓派上的运行时间实现了惊人的74.5%的减少。||
|**2024-12-10**|[Benchmarking Vision-Based Object Tracking for USVs in Complex Maritime Environments](http://arxiv.org/abs/2412.07392)|null|基于视觉的目标跟踪对于无人水面舰艇 (USV) 执行检查、监视和侦察等任务至关重要。然而，由于动态的摄像机运动、低能见度和尺度变化，在复杂的海洋环境中进行实时跟踪具有挑战性。通常，目标检测方法结合滤波技术常用于跟踪，但它们往往缺乏鲁棒性，尤其是在存在摄像机运动和漏检的情况下。尽管最近已经提出了一些先进的跟踪方法，但它们在海洋场景中的应用仍然有限。为了弥补这一差距，本研究提出了一种用于 USV 的视觉引导目标跟踪框架，将最先进的跟踪算法与低级控制系统相结合，以便在动态海洋环境中实现精确跟踪。我们对七种不同的跟踪器（使用 Siamese 网络和 Transformer 等先进的深度学习技术开发）的性能进行了基准测试，方法是在模拟和真实世界的海洋数据集上对其进行评估。此外，我们还评估了各种控制算法与这些跟踪系统结合使用的鲁棒性。通过仿真和真实世界的海上实验验证了所提出的框架，证明了其在处理动态海洋条件下的有效性。结果表明，基于 Transformer 的跟踪器 SeqTrack 在恶劣条件下（如沙尘暴）表现最佳。在评估的控制算法中，线性二次调节器控制器 (LQR) 表现出最稳健和平滑的控制，从而实现了 USV 的稳定跟踪。||
|**2024-12-10**|[Image Classification Using Singular Value Decomposition and Optimization](http://arxiv.org/abs/2412.07288)|**[link](https://github.com/isabelayepes/Image-Classification-Using-Singular-Value-Decomposition-and-Optimization)**|本研究探讨了奇异值分解在猫狗特定品种图像分类中的适用性，主要以毛色作为识别特征。采用序列二次规划（SQP）构建最优加权模板。该方法在秩为10时，使用弗罗贝尼乌斯范数达到了69%的准确率。结果部分验证了低秩近似可以有效捕捉主要特征（例如毛色）的假设。然而，该准确率表明，对于更稳健的分类，可能需要额外的特征或方法，这突出了资源受限环境下简洁性与性能之间的权衡。||
|**2024-12-10**|[MPSI: Mamba enhancement model for pixel-wise sequential interaction Image Super-Resolution](http://arxiv.org/abs/2412.07222)|null|单图像超分辨率（SR）长期以来一直是计算机视觉领域的一个挑战。虽然深度学习的出现带来了许多旨在解决这一持续问题的方法，但目前的方法在建模长序列信息方面仍然存在挑战，导致在有效捕捉全局像素交互方面存在局限性。为了应对这一挑战并获得更好的超分辨率结果，我们提出了Mamba像素级序列交互网络（MPSI），旨在增强信息的远程连接，尤其关注像素级的序列交互。我们提出了通道-Mamba块（CMB），通过有效地建模长序列信息来捕获全面的像素交互信息。此外，在现有的超分辨率方法中，仍然存在忽略前几层提取的特征的问题，导致宝贵的特征信息丢失。虽然某些现有模型努力保留这些特征，但它们经常难以在所有层之间建立连接。为了克服这一限制，MPSI引入了Mamba通道递归模块（MCRM），它最大限度地保留了早期层中的宝贵特征信息，从而促进了从多级层获取像素序列交互信息。通过大量的实验，我们证明了MPSI在图像重建结果方面优于现有的超分辨率方法，达到了最先进的性能。||
|**2024-12-10**|[A Progressive Image Restoration Network for High-order Degradation Imaging in Remote Sensing](http://arxiv.org/abs/2412.07195)|null|近年来，深度学习方法在遥感（RS）图像复原领域取得了显著成就。然而，大多数现有的遥感图像复原方法主要集中在传统的低阶退化模型上，这可能无法有效地捕捉遥感图像的成像机制。此外，许多使用深度学习的遥感图像复原方法经常因其缺乏架构透明度和模型可解释性而受到批评。为了解决这些问题，我们提出了一种用于高阶退化成像（HDI-PRNet）的新型渐进式复原网络，以逐步复原不同的图像退化。HDI-PRNet是基于退化成像的理论框架开发的，在展开网络中提供了数学可解释性的优势。该框架由三个主要组件组成：一个依赖于近端映射先验学习的图像去噪模块，一个将Neumann级数展开与双域退化学习相结合的图像去模糊模块，以及一个用于超分辨率的模块。大量实验表明，我们的方法在合成和真实遥感图像上都取得了优异的性能。||
|**2024-12-10**|[Hero-SR: One-Step Diffusion for Super-Resolution with Human Perception Priors](http://arxiv.org/abs/2412.07152)|null|由于扩散模型具有强大的先验知识，最近的方法在解决现实世界超分辨率（Real-SR）问题上展现了潜力。然而，要达到满足人类感知需求的语义一致性和感知自然度仍然很困难，尤其是在严重的图像退化和多变的输入复杂度情况下。为了解决这个问题，我们提出了Hero-SR，这是一个基于一步扩散的超分辨率框架，明确地融入了人类感知先验。Hero-SR包含两个新颖的模块：动态时间步长模块（DTSM）和开放世界多模态监督模块（OWMS）。DTSM自适应地选择最佳扩散步长，灵活地满足人类感知标准；OWMS则通过CLIP整合来自图像和文本领域的指导，以提高语义一致性和感知自然度。通过这些模块，Hero-SR不仅可以生成保留复杂细节的高分辨率图像，还能反映人类的感知偏好。大量实验验证了Hero-SR在Real-SR中实现了最先进的性能。代码将在论文被接收后公开发布。||
|**2024-12-10**|[RAP-SR: RestorAtion Prior Enhancement in Diffusion Models for Realistic Image Super-Resolution](http://arxiv.org/abs/2412.07149)|**[link](https://github.com/W-JG/RAP-SR)**|预训练扩散模型凭借其强大的生成能力，在真实世界图像超分辨率（Real-SR）领域引起了广泛关注。现有的基于扩散的超分辨率方法通常利用退化图像的语义信息和修复提示来激活先验以生成逼真的高分辨率图像。然而，并非为修复任务设计的通用预训练扩散模型通常具有次优的先验，并且手动定义的提示可能无法充分利用其生成潜力。为了解决这些限制，我们引入了RAP-SR，一种用于Real-SR的预训练扩散模型中新的修复先验增强方法。首先，我们通过质量驱动的审美图像选择流程（QDAISP）开发了高保真审美图像数据集（HFAID）。我们的数据集不仅在保真度方面超越了现有数据集，而且在审美质量方面也表现出色。其次，我们提出了修复先验增强框架，其中包括修复先验细化（RPR）和面向修复的提示优化（ROPO）模块。RPR使用HFAID细化修复先验，而ROPO优化独特的修复标识符，从而提高生成图像的质量。RAP-SR通过增强修复先验，有效地弥合了通用模型与Real-SR需求之间的差距。利用RAP-SR的即插即用特性，我们的方法可以无缝集成到现有的基于扩散的超分辨率方法中，从而提高其性能。大量实验表明了其广泛的适用性和最先进的结果。代码和数据集将在论文被接收后公开。||
|**2024-12-09**|[Convolution goes higher-order: a biologically inspired mechanism empowers image classification](http://arxiv.org/abs/2412.06740)|null|我们提出了一种受复杂非线性生物视觉处理启发的新颖图像分类方法，该方法通过为经典卷积神经网络 (CNN) 配备可学习的高阶卷积来实现。我们的模型结合了类似沃尔泰拉的卷积算子扩展，捕捉了类似于在生物视觉处理早期和晚期阶段观察到的乘法相互作用。我们通过测量对测试高阶相关性的敏感性以及在标准基准测试（MNIST、FashionMNIST、CIFAR10、CIFAR100 和 Imagenette）中的性能，在合成数据集上评估了这种方法。我们的架构优于传统的 CNN 基线，并在扩展到 3 阶/4 阶时达到最佳性能，这与自然图像中像素强度的分布非常吻合。通过系统的扰动分析，我们通过分离特定图像统计数据对模型性能的贡献来验证这种一致性，展示了不同阶卷积如何处理视觉信息的各个方面。此外，表征相似性分析揭示了网络层之间不同的几何形状，表明视觉信息处理的模式存在质的差异。我们的工作将神经科学和深度学习联系起来，为构建更有效、更具生物学启发的计算机视觉模型提供了一条途径。它提供了对视觉信息处理的见解，并为更好地捕捉复杂视觉模式的神经网络奠定了基础，尤其是在资源受限的情况下。||
|**2024-12-06**|[From classical techniques to convolution-based models: A review of object detection algorithms](http://arxiv.org/abs/2412.05252)|null|目标检测是计算机视觉和图像理解中的一项基本任务，其目标是在图像中识别和定位感兴趣的对象，并为其分配相应的类别标签。传统的依赖于手工特征和浅层模型的方法难以处理复杂的视觉数据，性能有限。这些方法将低级特征与上下文信息相结合，缺乏捕获高级语义的能力。深度学习，尤其是卷积神经网络 (CNN)，通过直接从数据中自动学习丰富的层次特征解决了这些限制。这些特征包括语义和高级表示，对于准确的目标检测至关重要。本文回顾了目标检测框架，从经典的计算机视觉方法开始。我们将目标检测方法分为两类：（1）经典计算机视觉技术和（2）基于 CNN 的检测器。我们比较了主要的 CNN 模型，讨论了它们的优势和局限性。总之，这篇综述重点介绍了深度学习在目标检测方面的重大进展，并指出了进一步提高性能的关键研究领域。||
|**2024-12-06**|[MSECG: Incorporating Mamba for Robust and Efficient ECG Super-Resolution](http://arxiv.org/abs/2412.04861)|null|心电图 (ECG) 信号在诊断心血管疾病中起着至关重要的作用。为了降低用于长期 ECG 监测的可穿戴或便携式设备的功耗，研究人员开发了超分辨率 (SR) 技术，使这些设备能够以较低的采样率收集和传输信号。在本研究中，我们提出了 MSECG，一个专为 ECG SR 设计的紧凑型神经网络模型。MSECG 结合了循环 Mamba 模型的优势和卷积层，以捕获 ECG 波形中的局部和全局依赖性，从而有效地重建高分辨率信号。我们还利用来自 PTB-XL 数据库的 ECG 数据和来自 MIT-BIH 噪声压力测试数据库的噪声数据，评估了该模型在实际噪声条件下的性能。实验结果表明，MSECG 在干净和噪声条件下均优于两个当代 ECG SR 模型，同时使用更少的参数，为长期 ECG 监测应用提供了更强大和更稳健的解决方案。||
|**2024-12-06**|[MTSpark: Enabling Multi-Task Learning with Spiking Neural Networks for Generalist Agents](http://arxiv.org/abs/2412.04847)|null|目前，最先进的强化学习 (RL) 方法在单任务设置中表现出色，但由于灾难性遗忘的挑战，它们仍然难以泛化到多个任务，即在引入新任务时忘记先前学习的任务。这种多任务学习能力对于通用智能体至关重要，因为它们非常需要适应性特征（例如，自主机器人）。另一方面，脉冲神经网络 (SNN) 由于其稀疏的基于脉冲的操作，已成为一种替代的节能神经网络算法。为此，我们提出了 MTSpark，一种使用脉冲网络实现多任务强化学习的新方法。具体来说，MTSpark 利用特定任务的上下文信号，开发了一种具有主动树突和决斗结构的深度脉冲 Q 网络 (DSQN)。具体而言，每个神经元计算依赖于任务的激活，动态调节输入，形成针对每个任务的专门子网络。此外，这种生物学上可信的网络模型也得益于 SNN，提高了能源效率，并使模型适合硬件实现。实验结果表明，与最先进的技术相比，我们的 MTSpark 可以有效地学习多项任务，并具有更高的性能。具体来说，MTSpark 在三个 Atari 游戏（即 Pong：-5.4、Breakout：0.6 和 Enduro：371.2）中成功取得了高分，达到了人类水平的性能（即 Pong：-3、Breakout：31 和 Enduro：368），而最先进的技术难以达到。此外，我们的 MTSpark 在图像分类任务中也比最先进的技术表现出更高的准确性。这些结果凸显了我们的 MTSpark 方法在开发能够利用强化学习和 SNN 概念学习多项任务的通用智能体方面的潜力。||
|**2024-12-05**|[2.5D Super-Resolution Approaches for X-ray Computed Tomography-based Inspection of Additively Manufactured Parts](http://arxiv.org/abs/2412.04525)|null|X射线计算机断层扫描（XCT）是增材制造（AM）零件无损评估的关键工具，可用于内部检查和缺陷检测。尽管XCT得到广泛应用，但获取高分辨率CT扫描非常耗时。这个问题可以通过以较低分辨率执行扫描来缓解；然而，降低分辨率会损害空间细节，限制缺陷检测的准确性。超分辨率算法为克服AM零件XCT重建中的分辨率限制提供了一种很有前景的解决方案，能够更准确地检测缺陷。虽然二维超分辨率方法在自然图像上表现出最先进的性能，但它们在直接应用于XCT切片时往往表现不佳。另一方面，三维超分辨率方法的计算成本很高，使其在大规模应用中不可行。为了应对这些挑战，我们提出了一种专为AM零件XCT定制的2.5D超分辨率方法。我们的方法利用来自相邻二维切片的多个切片信息来提高单个切片的分辨率，而无需承担完整三维方法的巨大计算开销。具体来说，我们使用相邻的低分辨率切片来超分辨中心切片，利用切片间的空间上下文，同时保持计算效率。这种方法弥合了二维和三维方法之间的差距，为AM零件的高通量缺陷检测提供了一种实用解决方案。||
|**2024-12-05**|[Cubify Anything: Scaling Indoor 3D Object Detection](http://arxiv.org/abs/2412.04458)|null|我们研究了使用商用手持设备获取的单个RGB(-D)帧进行室内三维物体检测的问题。我们的目标是在数据和建模方面显著提升现状。首先，我们确定现有数据集在规模、精度和物体多样性方面存在显著局限性。因此，我们引入了Cubify-Anything 1M (CA-1M)数据集，该数据集在超过1千个高精度激光扫描场景中详尽地标记了超过40万个三维物体，这些场景与超过3.5千个手持式、以自我为中心的捕捉图像近乎完美地配准。接下来，我们建立了Cubify Transformer (CuTR)，这是一个完全基于Transformer的三维物体检测基线模型，它不是在基于点或体素的三维表示上操作，而是直接从RGB(-D)输入的二维特征预测三维边界框。虽然这种方法缺乏任何三维归纳偏置，但我们证明，与CA-1M结合使用时，CuTR的性能优于基于点的方法——在三维空间中准确召回超过62%的物体，并且在处理商用激光雷达深度图中存在的噪声和不确定性方面能力显著增强，同时在不改变架构的情况下也提供了良好的纯RGB性能。此外，通过在CA-1M上进行预训练，CuTR可以在更多样化的SUN RGB-D变体上胜过基于点的方法——这支持了这样一种观点：虽然三维归纳偏置在现有数据集较小的规模下很有用，但它们无法扩展到CA-1M的数据丰富状态。总的来说，这个数据集和基线模型有力地证明了我们正在朝着能够有效地“Cubify Anything”（立方体化任何物体）的模型迈进。||
|**2024-12-05**|[Grounding Descriptions in Images informs Zero-Shot Visual Recognition](http://arxiv.org/abs/2412.04429)|**[link](https://github.com/shaunak27/grain-clip)**|像CLIP这样的视觉语言模型 (VLM) 因其能够对开放词汇概念进行零样本视觉识别而备受青睐。这是通过选择文本表示与查询图像具有最高相似度的对象类别来实现的。虽然在某些领域取得了成功，但这种方法难以识别细粒度实体，也难以泛化到训练分布未涵盖的未见概念。最近的研究试图通过在测试时整合类别描述来缓解这些挑战，尽管改进有限。我们将这些有限的收益归因于图像和描述表示之间的根本不对齐，这根植于CLIP的预训练结构。在本文中，我们提出了GRAIN，一种新的预训练策略，旨在同时在精细和粗略级别上对齐表示。我们的方法学习将文本描述与图像区域共同 grounding，并将总体标题与全局图像表示对齐。为了推动这种预训练，我们利用冻结的多模态大型语言模型 (MLLM) 来获得大规模的合成标注。我们展示了我们的模型在11个不同的图像分类数据集上与当前最先进方法相比增强的零样本性能。此外，我们还介绍了Products-2023，这是一个新策划的手动标记数据集，包含新的概念，并展示了我们的模型通过对其进行基准测试来识别这些概念的能力。我们的模型在其他下游任务（如检索）上取得的显著改进进一步突出了我们的方法学习到的表示的优越质量。代码可在https://github.com/shaunak27/grain-clip 获取。||
|**2024-12-05**|[FedDUAL: A Dual-Strategy with Adaptive Loss and Dynamic Aggregation for Mitigating Data Heterogeneity in Federated Learning](http://arxiv.org/abs/2412.04416)|**[link](https://github.com/Pranabiitp/FedDUAL)**|联邦学习 (FL) 标志着分布式模型训练的一种变革性方法，它将来自各个客户端的局部优化模型组合成一个统一的全局模型。虽然 FL 通过消除集中式存储来保护数据隐私，但它也面临着巨大的挑战，例如性能下降、收敛速度变慢以及由于客户端数据分布的异构性导致全局模型鲁棒性降低。在各种数据异构性形式中，标签倾斜成为一个尤其棘手且普遍存在的问题，尤其是在图像分类等领域。为了应对这些挑战，我们首先进行了全面的实验，以查明 FL 训练过程中潜在的问题。基于我们的发现，我们引入了一种创新的双策略方法，旨在有效解决这些问题。首先，我们引入了一种用于客户端训练的自适应损失函数，精心设计以保留先前获得的知识，同时在局部优化和全局模型一致性之间保持最佳平衡。其次，我们开发了一种动态聚合策略，用于在服务器上聚合客户端模型。这种方法适应每个客户端独特的学习模式，有效地解决了网络中数据多样性的挑战。我们对三个不同的真实世界数据集进行了全面评估，并结合理论收敛保证，证明了我们的方法与几种已建立的最先进方法相比具有更高的效率。||
|**2024-12-05**|[Reflective Teacher: Semi-Supervised Multimodal 3D Object Detection in Bird's-Eye-View via Uncertainty Measure](http://arxiv.org/abs/2412.04337)|null|应用伪标签技术已被证明在自动驾驶的鸟瞰图 (BEV) 半监督 3D 对象检测 (SSOD) 中具有优势，尤其是在标记数据有限的情况下。在现有文献中，指数移动平均 (EMA) 已被用于学生网络调整教师网络的权重。然而，这会导致教师网络出现灾难性遗忘。在这项工作中，我们通过引入一种新颖的反射教师概念来解决这个问题，其中学生网络通过标记数据和伪标记数据进行训练，同时其知识通过正则化器逐步传递给教师网络，以确保保留先前的知识。此外，我们提出了几何感知 BEV 融合 (GA-BEVFusion)，用于有效对齐多模态 BEV 特征，从而减少相机和激光雷达两种模态之间的差异。这有助于将激光雷达点中嵌入的精确几何信息与空间先验可靠地映射，以便从相机图像中提取语义信息。我们在 nuScenes 和 Waymo 数据集上的实验表明：1）在全监督和半监督设置下，性能均优于现有最先进的方法；2）反射教师仅使用 nuScenes 和 Waymo 数据集分别 25% 和 22% 的标记数据即可达到与其他使用完整标记数据集的全监督方法相当的性能。||
|**2024-12-05**|[YOLO-CCA: A Context-Based Approach for Traffic Sign Detection](http://arxiv.org/abs/2412.04289)|**[link](https://github.com/zippiest/yolo-cca)**|交通标志检测对于提高道路安全和推进自动驾驶技术至关重要。由于驾驶环境的复杂性，交通标志检测经常面临一系列挑战，包括低分辨率、特征信息有限和目标尺寸小。这些挑战严重阻碍了从交通标志中有效提取特征，导致目标检测中的误报和漏报。为了应对这些挑战，探索更高效、更准确的交通标志检测方法至关重要。本文提出了一种基于上下文的交通标志检测算法，该算法使用YOLOv7作为基线模型。首先，我们提出了一种使用多尺度空洞卷积的自适应局部上下文特征增强（LCFE）模块，以捕获目标与周围区域之间的潜在关系。该模块为网络补充了额外的局部上下文信息。其次，我们提出了一个全局上下文特征收集（GCFC）模块，用于从整个图像场景中提取关键位置特征作为全局上下文信息。最后，我们构建了一个基于Transformer的上下文收集增强（CCA）模块来处理收集到的局部上下文和全局上下文，该模块在不增加额外复杂性的情况下实现了YOLOv7的优异的多级特征融合结果。在清华-腾讯100K数据集上进行的大量实验研究表明，我们方法的mAP为92.1%。与YOLOv7相比，我们的方法在mAP上提高了3.9%，而参数量减少了2.7M。在CCTSDB2021数据集上，mAP提高了0.9%。这些结果表明，我们的方法以更少的参数实现了更高的检测精度。源代码可在\url{https://github.com/zippiest/yolo-cca}获取。||
|**2024-12-05**|[DEIM: DETR with Improved Matching for Fast Convergence](http://arxiv.org/abs/2412.04234)|**[link](https://github.com/shihuahuang95/deim)**|我们引入了DEIM，这是一个创新且高效的训练框架，旨在加速基于Transformer架构（DETR）的实时目标检测的收敛速度。为了缓解DETR模型中一对一（O2O）匹配固有的稀疏监督问题，DEIM采用了密集O2O匹配策略。这种方法通过使用标准数据增强技术合并额外的目标，增加了每张图像的正样本数量。虽然密集O2O匹配加快了收敛速度，但它也引入了许多可能影响性能的低质量匹配。为了解决这个问题，我们提出了匹配度感知损失（MAL），一种新的损失函数，可以优化不同质量级别的匹配，从而提高密集O2O的有效性。在COCO数据集上的大量实验验证了DEIM的有效性。当与RT-DETR和D-FINE集成时，它持续提升性能，同时将训练时间减少了50%。值得注意的是，与RT-DETRv2搭配使用时，DEIM在NVIDIA 4090 GPU上仅需一天的训练即可达到53.2%的AP。此外，DEIM训练的实时模型优于领先的实时目标检测器，DEIM-D-FINE-L和DEIM-D-FINE-X在NVIDIA T4 GPU上分别以124 FPS和78 FPS的速度达到了54.7%和56.5%的AP，且无需额外数据。我们相信DEIM为实时目标检测的进步设定了新的基准。我们的代码和预训练模型可在https://github.com/ShihuaHuang95/DEIM获取。||
|**2024-12-05**|[Frequency-Adaptive Low-Latency Object Detection Using Events and Frames](http://arxiv.org/abs/2412.04149)|null|将事件和RGB图像融合用于目标检测可以利用事件相机在不利环境下的鲁棒性以及RGB相机提供的丰富语义信息。然而，两个关键的不匹配：低延迟事件与高延迟RGB帧；训练中时间稀疏的标签与推理中连续的事件流，严重阻碍了基于高频融合的目标检测。为了解决这些挑战，我们提出了频率自适应低延迟目标检测器（FAOD）。FAOD通过对齐模块将低频RGB帧与高频事件对齐，该模块增强了跨模态风格和空间接近度，以解决事件-RGB不匹配问题。我们进一步提出了一种训练策略，即时间偏移，它强制模块将时间偏移的事件-RGB对的预测及其原始表示对齐，使其与事件对齐的标注保持一致。该策略使网络能够使用高频事件数据作为主要参考，同时将低频RGB图像视为补充信息，保留事件流的低延迟特性以实现高频检测。此外，我们观察到，与单独使用事件数据相比，这些校正后的事件-RGB对在从低训练频率到更高推理频率的泛化方面表现更好。在PKU-DAVIS-SOD和DSEC-Detection数据集上的大量实验表明，我们的FAOD实现了最先进的性能。具体来说，在PKU-DAVIS-SOD数据集中，FAOD在完全配对的事件-RGB数据中，mAP提高了9.8个点，而参数只有SODFormer的四分之一，甚至在80倍的事件-RGB频率不匹配的情况下仍保持稳健的性能（mAP仅下降3个点）。||
|**2024-12-05**|[Deep priors for satellite image restoration with accurate uncertainties](http://arxiv.org/abs/2412.04130)|null|卫星光学图像在地面接收时，会呈现出观测场景的失真视图。在应用之前，需要对其进行复原，经典的复原方法包括去噪、去模糊，有时还包括超分辨率。此外，量化与此复原相关的不确定性可以通过降低幻觉风险并避免将这些偏差传播到下游应用中而变得很有价值。深度学习方法目前是卫星图像复原的最先进技术。然而，它们需要为每个传感器训练一个特定的网络，并且它们不提供相关的不确定性。本文提出了一种通用方法，该方法涉及使用单个网络来复原来自多个传感器的图像，以及一种可扩展的方法来导出不确定性。我们专注于深度正则化 (DR) 方法，该方法在将其插入基于模型的优化方案之前学习目标图像的深度先验。首先，我们介绍了 VBLE-xz，它在变分压缩自编码器的潜在空间中解决了逆问题，并在潜在空间和图像空间中联合估计不确定性。它能够使用相关且经过校准的不确定性进行可扩展的后验采样。其次，我们提出了基于去噪器的方法 SatDPIR，该方法改编自 DPIR，可以有效地计算准确的点估计。我们对超高分辨率模拟和真实昴宿星图像进行了一组全面的实验，断言了所提出方法的性能和稳健性。与直接反演方法相比，VBLE-xz 和 SatDPIR 取得了最先进的结果。特别是，VBLE-xz 是一种可扩展的方法，可以获得逼真的后验样本和准确的不确定性，而当不需要不确定性量化时，SatDPIR 代表了直接反演方法的一个引人注目的替代方案。||
|**2024-12-05**|[SoRA: Singular Value Decomposed Low-Rank Adaptation for Domain Generalizable Representation Learning](http://arxiv.org/abs/2412.04077)|**[link](https://github.com/ysj9909/DG-SoRA)**|域泛化 (DG) 旨在使用一个或多个源域来调整模型，以确保在未见的目标域中具有稳健的性能。最近，在 DG 问题的背景下，基础模型的参数高效微调 (PEFT) 已显示出良好的结果。然而，现有的 PEFT 方法仍然难以在保留预训练模型的泛化组件和学习任务特定特征之间取得平衡。为了深入了解泛化组件的分布，我们首先通过奇异值分解的视角分析预训练权重。基于这些见解，我们引入了奇异值分解低秩自适应 (SoRA)，这是一种选择性地调整次要奇异值成分，同时保持剩余部分冻结的方法。SoRA 有效地保留了预训练模型的泛化能力，同时有效地获得了特定任务的技能。此外，我们冻结了域泛化块，并采用了退火权重衰减策略，从而在泛化性和可辨别性之间的微妙权衡中实现了最佳平衡。SoRA 在多个基准测试中取得了最先进的结果，这些基准测试涵盖了域泛化语义分割到域泛化目标检测。此外，我们的方法不会引入额外的推理开销或正则化损失，保持与任何主干或头部的兼容性，并且设计灵活，可以轻松集成到各种任务中。||
|**2024-12-05**|[Space to Policy: Scalable Brick Kiln Detection and Automatic Compliance Monitoring with Geospatial Data](http://arxiv.org/abs/2412.04065)|null|空气污染每年导致 700 万人死亡。砖窑行业对经济发展做出了重大贡献，但也造成了印度 8-14% 的空气污染。政策制定者已经实施了合规措施来监管砖窑。排放清单对于空气质量建模和源解析研究至关重要。然而，砖窑行业普遍缺乏组织性，需要进行劳动密集型的调查工作来进行监测。空气质量研究人员最近尝试依靠人工标注卫星图像中的砖窑来建立排放清单，但这种方法缺乏可扩展性。基于机器学习的目标检测方法在检测砖窑方面已展现出前景；然而，之前的研究通常依赖昂贵的高分辨率图像，并且未能与政府政策相结合。在这项工作中，我们开发了一个可扩展的机器学习流程，使用 Planet Labs 提供的免费、中等分辨率卫星图像，检测并分类了印度恒河平原五个邦的 30638 座砖窑。我们的检测结果与实地调查高度相关。我们根据政府政策进行了自动合规性分析。在德里空气流域，更严格的政策执行已促使采用高效的砖窑技术。这项研究强调了制定兼顾环境可持续性和工人 livelihoods 的包容性政策的必要性。||
|**2024-12-03**|[Efficient Algorithms for Low Tubal Rank Tensor Approximation with Applications to Image Compression, Super-Resolution and Deep Learning](http://arxiv.org/abs/2412.02598)|null|本文提出了高效的随机固定精度技术，用于低管秩张量逼近。所提出的方法比现有的用于逼近截断张量奇异值分解（T-SVD）的固定精度算法更快、更高效。此外，关于用于计算张量低管秩逼近的随机单遍算法的研究很少，并且没有一项研究通过实验报告了此类算法对于现实世界数据张量（例如图像和视频）的低秩逼近的鲁棒性。目前的张量单遍算法是将矩阵的单遍算法推广到张量。然而，矩阵的单遍随机算法最近得到了改进和稳定。受此进展的启发，在本文中，我们还基于管积（T-积）将它们推广到张量情况。我们进行了大量的模拟，以研究它们与现有单遍随机算法相比的鲁棒性。特别是，我们通过实验发现，具有相同大小的草图参数的单遍算法通常会导致病态张量最小二乘问题和不准确的结果。实验表明，我们提出的单遍算法在这种意义上是鲁棒的。数值结果表明，在相同条件下（设置相同的超参数），我们提出的算法提供了更好的性能。本文还介绍了图像压缩、超分辨率问题和深度学习的三个应用。||
|**2024-12-03**|[Randomized algorithms for Kroncecker tensor decomposition and applications](http://arxiv.org/abs/2412.02597)|null|本文提出了用于计算克罗内克张量分解 (KTD) 的快速随机算法。与现有的最先进算法相比，所提出的算法可以更快地将给定张量分解为 KTD 格式。我们的主要思想是使用随机化框架来显著降低计算复杂度。我们提供了广泛的模拟，以验证所提出的随机算法的有效性和性能，与确定性算法相比，其加速了几个数量级。我们的模拟使用了合成数据集和真实世界数据集，并将其应用于张量补全、视频/图像压缩、图像去噪和图像超分辨率。||
|**2024-12-03**|[SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection](http://arxiv.org/abs/2412.02565)|**[link](https://github.com/jw-chae/sjtu)**|尽管视觉语言理解取得了进步，但在多模态架构中实现图像分割仍然是现代人工智能系统中的一个基本挑战。现有的视觉语言模型主要依赖于骨干架构或基于CLIP的嵌入学习，在细粒度空间定位和操作能力方面表现出固有的局限性。本文介绍了SJTU：多模态模型中的空间判断——通过坐标检测实现统一分割，这是一个利用空间坐标理解来桥接视觉语言交互和精确分割的新颖框架，能够通过自然语言指令实现准确的目标识别。该框架提出了一种基于多模态空间推理的将分割技术与视觉语言模型相结合的新方法。通过利用用于边界框的归一化坐标检测并将其转换为可操作的分割输出，我们探索了整合多模态空间和语言表示的可能性。基于所提出的技术方法，该框架在各种基准数据集上展现出优越的性能以及准确的目标分割。在COCO 2017通用目标检测数据集和Pascal VOC语义分割数据集上的结果证明了该框架的泛化能力。||
|**2024-12-03**|[GenMix: Effective Data Augmentation with Generative Diffusion Model Image Editing](http://arxiv.org/abs/2412.02366)|null|数据增强广泛用于增强视觉分类任务中的泛化能力。然而，由于传统方法无法解决域差异，因此在源域和目标域不同的情况下（例如在域适应中）表现不佳。本文介绍了 GenMix，一种通用的提示引导生成数据增强方法，可以增强域内和跨域图像分类。我们的技术利用图像编辑根据自定义条件提示生成增强图像，这些提示是专门为每种问题类型设计的。通过将输入图像的部分与其编辑后的生成对应物混合，并结合分形图案，我们的方法减少了不真实的图像和标签歧义，从而提高了最终模型的性能和对抗鲁棒性。我们在八个公共数据集上进行了广泛的实验，涵盖了通用和细粒度分类，以及域内和跨域设置，证实了我们方法的有效性。此外，我们还展示了在自监督学习、数据稀缺学习和对抗鲁棒性方面的性能改进。与现有的最先进方法相比，我们的技术在各个方面都取得了更强的性能。||
|**2024-12-03**|[Active Learning via Classifier Impact and Greedy Selection for Interactive Image Retrieval](http://arxiv.org/abs/2412.02310)|**[link](https://github.com/barleah/greedyal)**|主动学习 (AL) 是一种用户交互方法，旨在通过选择最关键的示例进行标记来降低标注成本。尽管 AL 已被广泛研究用于图像分类任务，但交互式图像检索的特定场景却很少受到关注。这种情况呈现出独特的特征，包括开放集和类别不平衡的二元分类，并且从很少的标记样本开始。我们引入了一个名为 GAL（贪婪主动学习）的新型批量模式主动学习框架，它可以更好地应对这种应用。它包含一个用于样本选择的新的获取函数，用于衡量每个未标记样本对分类器的影响。我们进一步将此策略嵌入到贪婪选择方法中，更好地利用每个批次中的样本。我们使用线性和非线性 MLP/高斯过程分类器评估了我们的框架。对于高斯过程情况，我们展示了贪婪近似的理论保证。最后，我们评估了我们针对几个基准上的基于交互式内容的图像检索任务的性能，并证明了其优于现有方法和常见基线的性能。代码可在 https://github.com/barleah/GreedyAL 获取。||
|**2024-12-03**|[CubeFormer: A Simple yet Effective Baseline for Lightweight Image Super-Resolution](http://arxiv.org/abs/2412.02234)|null|轻量级图像超分辨率 (SR) 方法旨在使用轻量级神经网络来提高图像的分辨率并恢复图像细节。然而，目前的轻量级 SR 方法仍然存在性能较差和细节不佳的问题。我们的分析表明，这些方法受到特征多样性不足的限制，这会对特征表示和细节恢复产生负面影响。为了解决这个问题，我们提出了一个简单而有效的基线模型，称为 CubeFormer，旨在通过完整的全局信息聚合来增强特征丰富度。具体来说，我们引入了立方体注意力机制，将 2D 注意力扩展到 3D 空间，促进更全面的信息交互，进一步鼓励全面信息提取并提升特征多样性。此外，我们注入了块采样和网格采样策略来构建立方体内 Transformer 块 (Intra-CTB) 和立方体间 Transformer 块 (Inter-CTB)，分别执行局部和全局建模。大量实验表明，我们的 CubeFormer 在常用的 SR 基准测试中实现了最先进的性能。我们的源代码和模型将公开发布。||
|**2024-12-03**|[GSOT3D: Towards Generic 3D Single Object Tracking in the Wild](http://arxiv.org/abs/2412.02129)|**[link](https://github.com/ailovejinx/gsot3d)**|在本文中，我们提出了一个名为 GSOT3D 的全新基准测试，旨在促进野外通用三维单目标跟踪 (SOT) 的发展。具体而言，GSOT3D 提供了 620 个序列，包含 123K 帧，涵盖 54 个目标类别。每个序列都提供多种模态，包括点云 (PC)、RGB 图像和深度信息。这使得 GSOT3D 能够支持各种三维跟踪任务，例如基于点云的单模态三维 SOT 和基于 RGB-PC 或 RGB-D 的多模态三维 SOT，从而极大地拓宽了三维目标跟踪的研究方向。为了提供高质量的逐帧三维标注，所有序列都经过多轮细致的人工检查和 refinement。据我们所知，GSOT3D 是目前最大的致力于各种通用三维目标跟踪任务的基准测试。为了了解现有三维跟踪器的性能并为 GSOT3D 的未来研究提供比较，我们评估了八个具有代表性的基于点云的跟踪模型。我们的评估结果表明，这些模型在 GSOT3D 上的性能大幅下降，需要付出更多努力来实现鲁棒和通用的三维目标跟踪。此外，为了鼓励未来的研究，我们提出了一个简单而有效的通用三维跟踪器，名为 PROT3D，它通过渐进式时空网络定位目标对象，并大幅超越了所有现有解决方案。通过发布 GSOT3D，我们期望在未来的研究和应用中进一步推进三维跟踪。我们的基准测试、模型以及评估结果将在我们的网页 https://github.com/ailovejinx/GSOT3D 上公开发布。||
|**2024-12-03**|[Redundant Queries in DETR-Based 3D Detection Methods: Unnecessary and Prunable](http://arxiv.org/abs/2412.02054)|null|基于查询的模型广泛用于三维目标检测任务，并且有大量预训练的检查点可供在线使用。然而，尽管这些模型很受欢迎，它们通常需要过多的对象查询，远远超过实际需要检测的对象数量。冗余的查询会导致不必要的计算和内存成本。在本文中，我们发现并非所有查询的贡献都相同——很大一部分查询的影响远小于其他查询。基于这一观察，我们提出了一个非常简单的方法，称为逐步修剪查询（GPQ），它根据查询的分类得分逐步修剪查询。它可以很容易地应用于任何基于查询的方法，因为它可以作为微调步骤无缝集成到训练后的现有检查点中。使用GPQ，用户可以轻松地从具有过多查询的检查点开始，生成多个查询较少的模型。在各种先进的三维检测器上的实验表明，GPQ可以有效地减少冗余查询，同时保持性能。使用我们的方法，在桌面GPU上的模型推理可以加速高达1.31倍。此外，在边缘设备上部署后，它可以减少高达67.86%的FLOPs和76.38%的推理时间。代码将在\url{https://github.com/iseri27/Gpq}上提供。||
|**2024-12-02**|[HPRM: High-Performance Robotic Middleware for Intelligent Autonomous Systems](http://arxiv.org/abs/2412.01799)|null|智能自主系统的兴起，尤其是在机器人和自主代理领域，对能够确保大量传感器数据实时处理的稳健通信中间件提出了迫切需求。当前的机器人中间件，如机器人操作系统 (ROS) 2，在多核计算平台上处理跨多个订阅者的海量数据时，面临着非确定性和高通信延迟的挑战。为了解决这些问题，我们提出了高性能机器人中间件 (HPRM)，它构建于确定性协调语言 Lingua Franca (LF) 之上。HPRM 采用了一系列优化策略，包括用于高效零拷贝传输大型有效负载的内存对象存储、用于最小化序列化开销的自适应序列化以及带有实时套接字的主动协议以减少握手延迟。基准测试表明，在向多个节点广播大型消息时，HPRM 的延迟比 ROS2 低 173 倍。然后，我们通过将 HPRM 与 CARLA 模拟器集成，并运行强化学习代理以及目标检测工作负载来展示 HPRM 的优势。在 CARLA 自动驾驶应用中，HPRM 的延迟比 ROS2 降低了 91.1%。HPRM 的确定性协调语义与其优化的进程间通信机制相结合，可为智能自主系统实现高效且可预测的实时通信。||
|**2024-12-02**|[Identifying Reliable Predictions in Detection Transformers](http://arxiv.org/abs/2412.01782)|null|检测Transformer (DETR) 已成为一种很有前景的目标检测架构，它提供了一个端到端的预测流程。然而，在实践中，DETR会生成数百个预测，远远超过图像中实际存在的目标数量。这就引出了一个问题：我们能否信任并使用所有这些预测？为了解决这个问题，我们提供了经验证据，强调了同一图像内的不同预测如何扮演不同的角色，导致这些预测的可靠性水平各不相同。更具体地说，虽然通常会对单个目标进行多个预测，但我们的研究结果表明，大多数情况下，其中一个预测是经过良好校准的，而其他预测的校准效果很差。基于这些见解，我们证明了识别DETR预测的可靠子集对于准确评估模型在目标和图像级别的可靠性至关重要。基于这一观点，我们首先解决了广泛使用的性能和校准指标（例如平均精度和各种形式的预期校准误差）的缺点。具体来说，它们不足以确定应该信任和使用DETR预测的哪个子集。为此，我们提出了目标级校准误差 (OCE)，它能够评估不同模型之间以及特定模型内各种配置之间的校准质量。作为最后一项贡献，我们引入了一个事后不确定性量化 (UQ) 框架，用于预测模型在每张图像上的准确性。通过对比由OCE确定的正（即可能匹配）和负预测的平均置信度分数，该框架评估了DETR模型对于每个测试图像的可靠性。||
|**2024-11-29**|[Real-Time Anomaly Detection in Video Streams](http://arxiv.org/abs/2411.19731)|null|本论文是Othello公司与LIASD实验室CIFRE协议的一部分。目标是开发一种能够实时检测视频流中危险的人工智能系统。为此，提出了一种结合时空分析的新方法。为了改进异常检测，探索了多种途径，包括整合目标检测、人体姿态检测和运动分析。为了提高结果的可解释性，将图像分析中常用的技术（如激活图和显著性图）扩展到视频分析，并提出了一种原创方法。根据是否需要识别警报或警报原因，所提出的架构执行二元或多类别分类。测试了大量的深度神经网络模型，并从中选择了三个模型。“你只看一次”（YOLO）用于空间分析，由VGG19和门控循环单元（GRU）组成的卷积循环神经网络（CRNN）用于时间分析，多层感知器用于分类。这些模型处理不同类型的数据，并且可以并行或串行组合。虽然并行模式速度更快，但串行模式通常更可靠。为了训练这些模型，选择了监督学习，并创建了两个专有的数据集。第一个数据集关注可能在异常中起潜在作用的目标，而第二个数据集由包含异常或非异常的视频组成。这种方法允许处理连续视频流和有限视频，从而在检测中提供更大的灵活性。||
|**2024-11-29**|[LDA-AQU: Adaptive Query-guided Upsampling via Local Deformable Attention](http://arxiv.org/abs/2411.19585)|**[link](https://github.com/duzw9311/lda-aqu)**|特征上采样是构建深度卷积神经网络的重要操作。然而，现有的上采样器要么缺乏特定特征的指导，要么需要利用高分辨率特征图，导致性能和灵活性下降。本文发现局部自注意力机制天然具备特征引导能力，其计算范式与特征上采样的本质（即相邻点的特征重组）高度契合。因此，我们将局部自注意力机制引入上采样任务，并证明大多数现有的上采样器都可以视为基于局部自注意力的上采样器的特例。考虑到上采样点与其相邻点之间可能存在的语义差距，我们进一步在基于局部自注意力的上采样器中引入了变形机制，从而提出了LDA-AQU。作为一个新颖的基于动态核的上采样器，LDA-AQU利用查询特征来引导模型自适应地调整相邻点的位置和聚合权重，从而满足各种复杂场景下的上采样需求。此外，LDA-AQU轻量且易于集成到各种模型架构中。我们在四个密集预测任务（目标检测、实例分割、全景分割和语义分割）上评估了LDA-AQU的有效性。LDA-AQU始终优于先前最先进的上采样器，与基线模型相比，在上述四个任务中分别实现了1.7 AP、1.5 AP、2.0 PQ和2.5 mIoU的性能提升。代码可在\url{https://github.com/duzw9311/LDA-AQU}获取。||
|**2024-11-29**|[Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding](http://arxiv.org/abs/2411.19551)|null|将语义信息注入三维高斯 splatting (3DGS) 近期受到了广泛关注。尽管目前的方法通常从二维基础模型（例如 CLIP 和 SAM）中提取三维语义特征，以便进行新视角分割和语义理解，但它们对二维监督的严重依赖会破坏跨视角语义一致性，并需要复杂的数据准备过程，从而阻碍视角一致的场景理解。在本工作中，我们提出了 FreeGS，一个无监督的语义嵌入式 3DGS 框架，无需二维标签即可实现视角一致的三维场景理解。我们没有直接学习语义特征，而是在 3DGS 中引入了身份耦合语义场 (IDSF)，它可以捕获每个高斯的语义表示和视角一致的实例索引。我们采用两步交替策略优化 IDSF：语义有助于在三维空间中提取连贯的实例，而生成的实例则规范了从二维空间注入的稳定语义。此外，我们采用了二维-三维联合对比损失，以增强自举过程中视角一致的三维几何形状和丰富语义之间的互补性，使 FreeGS 能够统一执行新视角语义分割、对象选择和三维目标检测等任务。在 LERF-Mask、3D-OVS 和 ScanNet 数据集上的大量实验表明，FreeGS 的性能与最先进的方法相当，同时避免了复杂的数据预处理工作。||
|**2024-11-29**|[Contextual Checkerboard Denoise -- A Novel Neural Network-Based Approach for Classification-Aware OCT Image Denoising](http://arxiv.org/abs/2411.19549)|**[link](https://github.com/abtahimajeed/checkerboarddenoiser)**|与非医学图像去噪主要目标是增强图像清晰度不同，医学图像去噪要求在不引入新伪影的同时保留关键特征。然而，许多提高图像清晰度的去噪方法会无意中改变去噪图像的关键信息，从而可能损害分类性能和诊断质量。此外，由于噪声医学图像的“真值”去噪版本通常极难获得，因此监督去噪方法在医学图像领域并不十分实用。在本文中，我们通过引入一种新的基于神经网络的方法——“上下文棋盘去噪”来解决这两个问题，该方法可以仅从噪声图像数据集中学习去噪，同时保留图像分类/分析所需的关键解剖细节。我们在真实的光学相干断层扫描 (OCT) 图像上进行了实验，并通过经验证明，我们提出的方法显着提高了图像质量，提供了更清晰、更详细的 OCT 图像，同时提高了诊断准确性。||
|**2024-11-28**|[CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections](http://arxiv.org/abs/2411.19346)|**[link](https://github.com/fazliimam/NoLA)**|在基础模型时代，CLIP已成为将文本和视觉模态对齐到共同嵌入空间的强大工具。然而，用于训练CLIP的对齐目标通常会导致细粒度任务的视觉特征欠佳。相比之下，像DINO这样的SSL预训练模型由于其专门的训练范式而擅长提取丰富的视觉特征。然而，这些SSL模型需要额外的监督线性探测步骤，这依赖于完全标记的数据，而这些数据通常很昂贵且难以大规模获取。在本文中，我们提出了一种无标签提示调整方法，它利用自监督学习模型（DINO）的丰富视觉特征和大型语言模型（LLM）的广泛文本知识，在使用未标记图像的情况下大幅增强基于CLIP的图像分类性能。我们的方法分三个关键步骤展开：（1）我们利用来自LLM的特定类别描述生成更准确地表示对象类别的鲁棒文本特征嵌入，与CLIP的默认名称特定提示相比，实现更有效的零样本分类。（2）然后，这些文本嵌入用于生成伪标签，以训练一个对齐模块，该模块整合了基于LLM描述的文本嵌入和DINO视觉特征的互补优势。（3）最后，我们使用训练好的对齐模块，通过DINO辅助监督来提示调整CLIP的视觉编码器。这个三步过程使我们能够利用视觉和文本基础模型的最佳特性，从而形成一种强大而高效的方法，超越了最先进的无标签分类方法。值得注意的是，我们的框架NoLA（No Labels Attached）在11个不同的图像分类数据集上比最先进的LaFter平均绝对增益3.6%。||
|**2024-11-28**|[Quantum Neural Networks in Practice: A Comparative Study with Classical Models from Standard Data Sets to Industrial Images](http://arxiv.org/abs/2411.19276)|null|图像分类任务是经典机器学习模型能够可靠解决的最突出示例之一。在本研究中，我们比较了随机经典和量子神经网络以及经典和量子-经典混合卷积神经网络在二元图像分类任务中的性能。为此，我们采用了各种复杂度递增的数据集 - (i) 人工超立方体数据集，(ii) MNIST 手写数字，以及 (iii) 来自激光切割机的真实工业图像。我们分析了所采用的量子模型的性能，并关注分类精度与各种超参数之间的相关性。对于随机量子神经网络，我们还将其性能与一些已知的文献模型进行了比较，并研究了在一个数据集上表现最佳的模型在其他数据集上的表现。总的来说，我们观察到经典模型、量子模型和混合模型的性能相当相似。我们的研究为量子机器学习在实际图像分类任务中的前景提供了行业视角。||
|**2024-11-28**|[On Moving Object Segmentation from Monocular Video with Transformers](http://arxiv.org/abs/2411.19141)|null|从单个移动摄像机中检测和分割运动物体是一项具有挑战性的任务，需要理解识别、运动和 3D 几何。将识别和重建结合起来归结为一个融合问题，其中需要结合外观和运动特征进行分类和分割。在本文中，我们提出了一种用于单目运动分割的新型融合架构——M3Former，它利用了Transformer在分割和多模态融合方面的强大性能。由于从单目视频重建运动是一个不适定问题，我们系统地分析了针对此问题的不同 2D 和 3D 运动表示及其对分割性能的重要性。最后，我们分析了训练数据的影响，并表明需要多样化的数据集才能在 Kitti 和 Davis 上达到最先进的性能。||
|**2024-11-28**|[Comprehensive Performance Evaluation of YOLOv11, YOLOv10, YOLOv9, YOLOv8 and YOLOv5 on Object Detection of Power Equipment](http://arxiv.org/abs/2411.18871)|null|随着全球工业生产的快速发展，对电力设备可靠性的需求不断提高。确保电力系统运行的稳定性需要精确的方法来检测电力设备的潜在故障，从而保障电能的正常供应。本文综合评估了YOLOv5、YOLOv8、YOLOv9、YOLOv10和最先进的YOLOv11方法在电力设备目标检测中的性能。实验结果表明，在电力设备公共数据集上，它们的平均精度均值（mAP）分别为54.4%、55.5%、43.8%、48.0%和57.2%，其中YOLOv11的检测性能最高。此外，YOLOv11在召回率方面也优于其他方法，并在减少误检方面表现出优异的性能。综上所述，研究结果表明YOLOv11模型为电力设备目标检测提供了一种可靠有效的解决方案，是提高电力系统运行可靠性的一种很有前景的方法。||
|**2024-11-28**|[Improving Batch Normalization with TTA for Robust Object Detection in Self-Driving](http://arxiv.org/abs/2411.18860)|null|在当前开放的真实世界自动驾驶场景中，传感器故障和极端天气条件等挑战阻碍了大多数自动驾驶感知模型泛化到这些未见领域，这是由于测试数据和训练数据之间的域偏移造成的。随着自动驾驶感知模型参数规模的增长，传统的测试时适应（TTA）方法变得不稳定，并且在大多数场景下通常会降低模型性能。为了应对这些挑战，本文提出了两种新的鲁棒性方法来改进自动驾驶目标检测中结合TTA的批量归一化：（1）我们引入了一种基于广义搜索熵最小化（GSEM）方法的可学习BN层。具体来说，我们通过引入辅助可学习参数来修改传统的BN层，这使得BN层能够根据不同的输入数据动态更新统计数据。（2）我们提出了一种新的基于语义一致性的双阶段适应策略，鼓励模型迭代搜索最优解，并在适应过程中消除不稳定的样本。在NuScenes-C数据集上的大量实验表明，我们的方法在使用BEVFormer作为基线模型的情况下，在六种损坏类型和三种严重程度下实现了高达约8%的最大改进。我们将很快公开我们的源代码。||
|**2024-11-28**|[COMPrompter: reconceptualized segment anything model with multiprompt network for camouflaged object detection](http://arxiv.org/abs/2411.18858)|**[link](https://github.com/guobaoxiao/comprompter)**|我们重新思考了Segment Anything Model (SAM) 并提出了一种名为COMPrompter的多提示符网络，用于伪装目标检测 (COD)。SAM具备超越其他模型的零样本泛化能力，可以为COD提供理想的框架。我们的网络旨在将SAM中的单提示符策略增强为多提示符策略。为此，我们提出了一个边缘梯度提取模块，生成包含伪装目标边界梯度信息的掩码。该梯度掩码随后被用作一种新的边界提示符，增强分割过程。此后，我们设计了一个框-边界相互引导模块，通过边界提示符和框提示符之间的相互引导，促进更精确和全面的特征提取。这种协作增强了模型准确检测伪装目标的能力。此外，我们采用离散小波变换从图像嵌入中提取高频特征。这些高频特征作为多提示符系统的补充组件。最后，我们的COMPrompter引导网络实现增强的分割结果，从而推进了SAM在COD方面的应用发展。跨COD基准的实验结果表明，COMPrompter实现了最先进的性能，在COD10K中平均正指标超过当前领先模型2.2%。在COD的具体应用中，息肉分割的实验结果表明，我们的模型也优于顶级方法。代码将在https://github.com/guobaoxiao/COMPrompter上发布。||
|**2024-11-27**|[Leveraging Semi-Supervised Learning to Enhance Data Mining for Image Classification under Limited Labeled Data](http://arxiv.org/abs/2411.18622)|null|在21世纪信息时代，随着大数据技术的发展，如何从海量数据中有效地提取有价值的信息成为了一个关键问题。传统的数据挖掘方法在面对大规模、高维度和复杂数据时显得力不从心，尤其是在标记数据稀缺的情况下，其性能受到极大限制。本研究通过引入半监督学习方法来优化数据挖掘算法，旨在提高算法利用未标记数据的能力，从而在有限的标记数据条件下实现更准确的数据分析和模式识别。具体而言，我们采用了一种自训练方法，并将其与卷积神经网络 (CNN) 相结合，用于图像特征提取和分类，通过迭代过程不断提高模型的预测性能。实验结果表明，在CIFAR-10图像分类数据集上，该方法的性能显著优于支持向量机 (SVM)、XGBoost和多层感知器 (MLP) 等传统机器学习技术。在准确率、召回率和F1值等关键性能指标上均有显著提升。此外，通过在不同噪声水平下的实验，验证了半监督CNN模型的鲁棒性和抗噪能力，证实了其在实际场景中的实用性。||
|**2024-11-27**|[Pruning Deep Convolutional Neural Network Using Conditional Mutual Information](http://arxiv.org/abs/2411.18578)|null|卷积神经网络 (CNN) 在图像分类任务中取得了很高的性能，但由于模型规模庞大，难以部署在资源受限的硬件上。为了解决这个问题，我们利用互信息（Mutual Information），这是一种通过测量输入特征或输出标签与网络层之间的共享信息来深入了解深度学习模型如何保留和处理信息的指标。在本研究中，我们提出了一种用于 CNN 的结构化滤波器剪枝方法，该方法可以识别并选择性地保留每一层中最具信息量的特征。我们的方法通过基于条件互信息 (CMI) 值对特征图的重要性进行排序来依次评估每一层，CMI 值使用基于矩阵的 Renyi α 阶熵数值方法计算。我们提出了几种 CMI 公式来捕捉不同层之间特征的相关性。然后，我们开发了各种策略来确定 CMI 值的截止点，以剪枝不重要的特征。这种方法允许在正向和反向两个方向上并行剪枝，并在显著减小模型尺寸的同时保持准确性。在使用 CIFAR-10 数据集的 VGG16 架构上进行测试，所提出的方法将滤波器数量减少了三分之一以上，测试精度仅下降了 0.32%。||
|**2024-11-27**|[A comparison of extended object tracking with multi-modal sensors in indoor environment](http://arxiv.org/abs/2411.18476)|null|本文初步研究了一种高效的目标跟踪方法，比较了两种不同的3D点云传感器源——激光雷达和立体摄像机的性能，这两种传感器价格差异显著。在这项初步工作中，我们专注于单目标跟踪。我们首先开发了一种快速启发式目标检测器，它利用了关于环境和目标的先验信息。随后将得到的目标点输入到一个扩展目标跟踪框架中，该框架使用星凸超曲面模型来参数化目标形状。实验结果表明，我们使用立体摄像机的目标跟踪方法实现了与激光雷达传感器相似的性能，而成本差异超过十倍。||
|**2024-11-27**|[Efficient Dynamic LiDAR Odometry for Mobile Robots with Structured Point Clouds](http://arxiv.org/abs/2411.18443)|**[link](https://github.com/tu-darmstadt-ros-pkg/dynamic_direct_lidar_odometry)**|我们提出了一种用于城市搜救 (USAR) 场景中移动机器人的实时动态激光雷达里程计管道。现有的动态物体检测方法通常依赖于预训练的学习网络或计算成本高昂的体积地图。为了提高计算能力有限的机器人的效率，我们在里程计和检测模块之间重用数据。利用距离图像分割技术和一种新颖的基于残差的启发式方法，我们的方法在将动态物体和静态物体集成到点云地图之前对其进行区分。该方法在具有大量动态物体的环境中展示了稳健的物体跟踪和改进的地图精度。即使是高度非刚性物体（例如奔跑的人）也能在点级别准确检测，而无需事先对点云进行下采样，因此不会丢失信息。对模拟数据和真实数据的评估验证了其计算效率。与最先进的体积方法相比，我们的方法在处理时间的一小部分内显示出相当的检测性能，仅为里程计模块增加了 14 毫秒用于动态物体检测和跟踪。该实现和一个新的真实世界数据集作为开源提供，以供进一步研究。||
|**2024-11-27**|[Uncertainty-driven Sampling for Efficient Pairwise Comparison Subjective Assessment](http://arxiv.org/abs/2411.18372)|**[link](https://github.com/shimamohammadi/LBPS-EIC)**|图像质量评估在图像处理任务中至关重要，例如压缩、超分辨率和去噪。虽然涉及人类评估者的主观评估提供了最准确的质量分数，但由于其高成本和时间要求，它们对于大规模或持续评估来说是不切实际的。成对比较主观评估测试，对图像对进行排序而不是分配分数，提供了更高的可靠性和准确性，但需要大量的比较，导致高成本。尽管客观质量指标更有效率，但它们缺乏主观测试的精度，而主观测试对于基准测试和训练基于学习的质量指标至关重要。本文提出了一种基于不确定性的采样方法来优化成对比较主观评估过程。通过利用深度学习模型来估计人类偏好并识别需要人工标注的图像对，该方法在保持高精度的同时减少了所需的比较次数。主要贡献包括对不确定性进行建模以实现准确的偏好预测和成对采样。实验结果表明，与传统的主动采样方法相比，该方法具有优越的性能。软件可在 shimamohammadi/LBPS-EIC 公开获取。||
|**2024-11-27**|[Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks](http://arxiv.org/abs/2411.18288)|**[link](https://github.com/cpboost/double-co-detr)**|多光谱目标检测，利用RGB和TIR（热红外）两种模态，被广泛认为是一项具有挑战性的任务。它不仅需要从两种模态中有效地提取特征和鲁棒的融合策略，还需要解决诸如RGB和TIR图像之间的光谱差异、空间错位以及环境依赖性等问题。这些挑战显著地阻碍了多光谱检测系统在不同场景下的泛化能力。尽管许多研究试图克服这些限制，但仍然难以清晰地区分多光谱检测系统的性能提升与这些“优化技术”的影响。更糟糕的是，尽管高性能单模态检测模型迅速涌现，但仍然缺乏能够有效地将这些模型应用于多光谱检测任务的专门训练技术。缺乏具有公平和一致实验设置的标准化基准也对评估新方法的有效性构成了重大障碍。为此，我们提出了第一个专门用于评估训练“技术”的公平且可复现的基准，该基准系统地分类了现有的多光谱目标检测方法，研究了它们对超参数的敏感性，并标准化了核心配置。我们利用各种骨干网络和检测框架，在多个具有代表性的多光谱目标检测数据集上进行了全面评估。此外，我们引入了一个高效且易于部署的多光谱目标检测框架，可以将高性能的单模态模型无缝地优化为双模态模型，并集成了我们先进的训练技术。||
|**2024-11-27**|[TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution](http://arxiv.org/abs/2411.18263)|**[link](https://github.com/Microtreei/TSD-SR)**|预训练的文本到图像扩散模型越来越多地应用于现实世界图像超分辨率 (Real-ISR) 任务。鉴于扩散模型的迭代细化性质，大多数现有方法的计算成本都很高。虽然 SinSR 和 OSEDiff 等方法已经出现，通过蒸馏来减少推理步骤，但它们在图像恢复或细节恢复方面的性能并不令人满意。为了解决这个问题，我们提出了 TSD-SR，这是一个专门为现实世界图像超分辨率设计的蒸馏框架，旨在构建一个高效且有效的一步模型。我们首先引入了目标分数蒸馏，它利用扩散模型和真实图像参考的先验知识来实现更真实的图像恢复。其次，我们提出了一个分布感知采样模块，使面向细节的梯度更容易获得，从而解决了恢复精细细节的挑战。大量实验表明，与过去基于预训练扩散先验的 Real-ISR 方法相比，我们的 TSD-SR 具有更好的恢复结果（大多数指标表现最佳）和最快的推理速度（例如比 SeeSR 快 40 倍）。||
|**2024-11-27**|[KANs for Computer Vision: An Experimental Study](http://arxiv.org/abs/2411.18224)|null|本文对Kolmogorov-Arnold网络（KANs）在计算机视觉任务，特别是图像分类中的应用进行了实验研究。相比于传统的多层感知机（MLPs）和卷积神经网络（CNNs）等使用预定义激活函数的神经网络，KANs在网络连接边上引入了可学习的激活函数，从而提供更灵活的非线性变换能力。虽然KANs在简化或小规模数据集上已展现出一定的潜力，但其在更复杂的现实世界任务（例如计算机视觉任务）中的有效性仍有待探索。为了填补这一空白，本实验研究旨在提供对KANs优势和局限性的更广泛观察和见解。我们发现，尽管KANs在某些视觉任务中可以表现良好，但它们也面临着一些重大挑战，包括超参数敏感性增加和计算成本较高。这些局限性表明，为了将KANs应用于大规模视觉问题，需要对其架构进行调整，例如与其他架构进行集成。本研究侧重于实证结果，而非提出新方法，旨在为未来关于KANs优化的研究，特别是计算机视觉应用或类似应用的研究提供参考。||
|**2024-11-27**|[From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects](http://arxiv.org/abs/2411.18207)|**[link](https://github.com/343gltysprk/ovow)**|传统的目标检测方法在闭集假设下运行，其中模型只能检测训练集中预定义的固定数量的对象。最近关于开放词汇目标检测 (OVD) 的工作使得能够检测由无限词汇定义的对象，从而降低了为特定任务训练模型的成本。然而，OVD 严重依赖于“预言机”提供的准确提示，这限制了它们在驾驶场景感知等关键应用中的使用。OVD 模型倾向于错误分类与已知类别语义相似的近分布外 (NOOD) 对象，而忽略远分布外 (FOOD) 对象。为了解决这些限制，我们提出了一个框架，使 OVD 模型能够在开放世界环境中运行，通过识别和增量学习新对象。为了检测 FOOD 对象，我们提出了开放世界嵌入学习 (OWEL)，并引入了伪未知嵌入的概念，该概念基于已知类别信息在连续语义空间中推断未知类别的位置。我们还提出了多尺度对比锚学习 (MSCAL)，通过提高不同尺度下对象嵌入的类内一致性，从而能够识别错误分类的未知对象。所提出的方法在常见的开放世界目标检测和自动驾驶基准测试中实现了最先进的性能。||
|**2024-11-27**|[HAAT: Hybrid Attention Aggregation Transformer for Image Super-Resolution](http://arxiv.org/abs/2411.18003)|null|在图像超分辨率研究领域，基于Swin Transformer的模型因其全局空间建模和滑动窗口注意力机制而备受青睐。然而，现有方法通常将自注意力限制在非重叠窗口内以降低计算成本，却忽略了跨通道存在的有效信息。为了解决这个问题，本文提出了一种新颖的模型，称为混合注意力聚合Transformer（HAAT），旨在更好地利用特征信息。HAAT通过集成Swin密集残差连接块（SDRCB）和混合网格注意力块（HGAB）而构建。SDRCB在保持精简架构的同时扩展了感受野，从而提高了性能。HGAB结合了通道注意力、稀疏注意力和窗口注意力，以改进非局部特征融合并获得更具视觉吸引力的结果。实验评估表明，HAAT在基准数据集上的性能超越了现有最先进的方法。关键词：图像超分辨率，计算机视觉，注意力机制，Transformer||
|**2024-11-26**|[A Distractor-Aware Memory for Visual Object Tracking with SAM2](http://arxiv.org/abs/2411.17576)|**[link](https://github.com/jovanavidenovic/dam4sam)**|基于内存的跟踪器是视频对象分割方法，它通过将最近跟踪的帧连接到内存缓冲区来形成目标模型，并通过将当前图像与缓冲帧进行关联来定位目标。虽然已经在许多基准测试中取得了最佳性能，但最近发布的SAM2才使基于内存的跟踪器成为视觉对象跟踪领域的焦点。然而，现代跟踪器在存在干扰物的情况下仍然难以应对。我们认为需要一个更复杂的内存模型，并提出了一种新的干扰感知内存模型，用于SAM2和一个基于内省的更新策略，共同解决了分割精度和跟踪鲁棒性问题。由此产生的跟踪器被命名为SAM2.1++。我们还提出了一个新的干扰物蒸馏DiDi数据集，以更好地研究干扰物问题。SAM2.1++在七个基准测试中优于SAM2.1和相关的SAM内存扩展，并在其中六个基准测试中树立了新的最先进水平。||
|**2024-11-26**|[TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba](http://arxiv.org/abs/2411.17473)|**[link](https://github.com/xwmaxwma/tinyvim)**|Mamba因其在对输入长度建模全局上下文时呈线性复杂度，在计算机视觉领域展现出巨大潜力。然而，现有的基于Mamba的轻量级骨干网络的性能却无法与基于卷积或Transformer的方法相媲美。我们观察到，简单地修改图像域中的扫描路径不利于充分发挥视觉Mamba的潜力。在本文中，我们首先进行了全面的频谱和定量分析，并验证了在卷积-Mamba混合架构下，Mamba模块主要建模低频信息。基于这些分析，我们引入了一种新颖的拉普拉斯混合器，用于在频域上解耦特征，并将仅低频分量输入到Mamba模块中。此外，考虑到特征的冗余性以及不同阶段对高频细节和低频全局信息的不同需求，我们引入了频率渐变初始模块，即逐渐减小高频分支的输入维度，以便在不同层有效地权衡高频和低频分量。通过集成移动友好的卷积和高效的拉普拉斯混合器，我们构建了一系列名为TinyViM的微型混合视觉Mamba模型。所提出的TinyViM在图像分类、语义分割、目标检测和实例分割等多个下游任务中均取得了令人印象深刻的性能。特别是，TinyViM的性能优于规模相似的卷积、Transformer和基于Mamba的模型，吞吐量约为其他基于Mamba模型的2-3倍。代码可在https://github.com/xwmaxwma/TinyViM获取。||
|**2024-11-26**|[SpikeAtConv: An Integrated Spiking-Convolutional Attention Architecture for Energy-Efficient Neuromorphic Vision Processing](http://arxiv.org/abs/2411.17439)|null|脉冲神经网络 (SNN) 提供了一种受生物学启发的传统人工神经网络的替代方案，由于其事件驱动计算，在能效方面具有潜在优势。尽管前景广阔，但 SNN 在图像分类等复杂的视觉任务上尚未达到具有竞争力的性能。本研究介绍了一种旨在提高计算效率和任务准确性的新型 SNN 架构。该架构具有优化的脉冲模块，有助于处理视觉数据中的时空模式，旨在调和高级视觉任务的计算需求与 SNN 的节能处理。我们在标准图像分类基准上的评估表明，所提出的架构缩小了与传统神经网络的性能差距，为设计更高效、更强大的神经形态计算系统提供了见解。||
|**2024-11-26**|[Communication-Efficient Cooperative SLAMMOT via Determining the Number of Collaboration Vehicles](http://arxiv.org/abs/2411.17432)|null|SLAMMOT，即同步定位、建图和移动物体（检测与）跟踪，代表了面向动态环境中自动驾驶汽车的新兴技术。这种单车系统仍然存在固有的局限性，例如遮挡问题。受SLAMMOT和快速发展的协作技术的启发，探索协作式同步定位、建图和移动物体（检测与）跟踪（C-SLAMMOT）以增强对本车和移动物体的状态估计是很自然的。C-SLAMMOT可以通过利用和整合多车之间通过通信共享的信息来显著提升单车性能。这不可避免地导致性能和通信成本之间的基本权衡，尤其是在协作车辆数量增加时如何以可扩展的方式进行。为了应对这一挑战，我们提出了一种基于激光雷达的通信高效C-SLAMMOT（CE C-SLAMMOT）方法，通过确定协作车辆的数量来实现。在CE C-SLAMMOT中，我们采用基于描述符的方法来增强本车姿态估计，并采用基于空间置信度图的方法进行协作目标感知，从而允许对相应的关键协作车辆和交互内容进行连续和动态的选择。与在所有车辆之间交换原始观测信息的基线方法相比，这种方法通过避免共享来自某些可能贡献很少或没有性能增益的协作车辆的信息，避免了宝贵通信成本的浪费。在各个方面的对比实验已经证实，所提出的方法在性能和通信成本之间取得了良好的平衡，同时在协作感知性能方面也优于以往最先进的方法。||
|**2024-11-26**|[CoA: Chain-of-Action for Generative Semantic Labels](http://arxiv.org/abs/2411.17406)|**[link](https://github.com/WilsonMqz/CoA)**|近年来，视觉语言模型 (VLM) 在图像分类方面取得了显著进展。这些 VLM 利用预定义的类别集合来构建用于零样本推理的文本提示。然而，在像自动驾驶这样更开放的领域中，使用预定义的标签集变得不切实际，因为语义标签空间是未知的且不断变化的。此外，固定的嵌入文本提示通常倾向于预测单个标签（而实际上，每张图像通常存在多个标签）。在本文中，我们介绍了 CoA，这是一种创新的行动链 (CoA) 方法，可生成与图像所有上下文相关特征对齐的标签。CoA 的设计基于以下观察：丰富且有价值的上下文信息可以提高推理过程中的生成性能。传统的视觉语言模型倾向于输出单一且冗余的响应。因此，我们采用定制的 CoA 来缓解这个问题。我们首先将生成标签任务分解为详细的行动，并构建一个 CoA，最终实现生成目标。每个行动都从先前的行动中提取并合并关键信息，并将丰富的信息作为上下文传递给下一个行动，最终改进 VLM 生成全面且准确的语义标签的能力。我们通过对广泛使用的基准数据集进行全面评估来评估 CoA 的有效性，结果表明，关键性能指标均有显著提高。||
|**2024-11-26**|[BadScan: An Architectural Backdoor Attack on Visual State Space Models](http://arxiv.org/abs/2411.17283)|null|新引入的视觉状态空间模型 (VMamba) 使用状态空间机制 (SSM) 将图像解释为一系列图像块，与视觉Transformer (ViT) 相比，在各种计算机视觉任务中表现出卓越的性能。然而，最近的研究表明，深度模型容易受到对抗性攻击。一种常见的方法是在训练数据中嵌入触发器来重新训练模型，导致模型将数据样本错误分类到目标类别，这种现象称为后门攻击。在本文中，我们首先评估了 VMamba 模型对现有后门攻击的鲁棒性。基于此评估，我们引入了一种针对 VMamba 模型的新型架构后门攻击，称为 BadScan。这种攻击利用位平面切片来创建视觉上难以察觉的后门图像。在测试过程中，如果通过对修改后的触发图像块的第 k 个位平面执行异或运算检测到触发器，则 VMamba 的视觉状态空间 (VSS) 块中的传统 2D 选择性扫描 (SS2D) 机制将被我们新设计的 BadScan 块取代，该块包含四种新开发的扫描模式。我们证明了 BadScan 后门攻击对视觉状态空间模型构成了重大威胁，即使从头开始完全重新训练后仍然有效。在两个广泛使用的图像分类数据集 CIFAR-10 和 ImageNet-1K 上的实验结果表明，虽然视觉状态空间模型通常对当前的后门攻击表现出鲁棒性，但 BadScan 攻击特别有效，在误导 VMamba 模型及其变体方面实现了更高的触发准确率 (TAR)。||
|**2024-11-26**|[MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution](http://arxiv.org/abs/2411.17214)|**[link](https://github.com/stella-von/MAT)**|近年来，图像超分辨率（SR）领域的显著进步得益于Transformer架构的引入。然而，传统的扩大自注意力窗口以捕捉更广阔上下文的技术存在固有缺陷，尤其是计算需求的显著增加。此外，现有模型在固定大小窗口内的特征感知限制了有效感受野和中间特征的多样性。本研究表明，跨不同空间范围灵活地整合注意力可以显著提升性能。基于这一见解，我们提出了专为SR任务设计的Multi-Range Attention Transformer (MAT)。MAT利用空洞卷积运算的固有计算优势，结合自注意力机制，促进了多范围注意力（MA）和稀疏多范围注意力（SMA），从而能够有效地捕捉局部和稀疏全局特征。结合局部特征提取，MAT巧妙地捕捉了不同空间范围的依赖关系，提高了特征表示的多样性和有效性。我们还引入了MSConvStar模块，增强了模型进行多范围表示学习的能力。综合实验表明，我们的MAT相比现有的最先进SR模型表现出更优异的性能，并具有显著的效率（比SRFormer-light快约3.3倍）。||
|**2024-11-26**|[PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution](http://arxiv.org/abs/2411.17106)|**[link](https://github.com/libozhu03/passionsr)**|基于扩散的图像超分辨率 (SR) 模型已经展现出优越的性能，但代价是需要多个去噪步骤。然而，即使去噪步骤已被减少到一步，它们仍然需要高计算成本和存储需求，使其难以部署在硬件设备上。为了解决这些问题，我们提出了一种新颖的单步扩散 (OSD) 图像超分辨率的训练后量化方法，PassionSR，并引入自适应尺度。首先，我们通过移除 CLIPEncoder 将 OSD 模型简化为两个核心组件：UNet 和变分自编码器 (VAE)。其次，我们提出了可学习边界量化器 (LBQ) 和可学习等效变换 (LET) 来优化量化过程并操纵激活分布以实现更好的量化效果。最后，我们设计了一种分布式量化校准 (DQC) 策略，以稳定量化参数的训练并实现快速收敛。综合实验表明，PassionSR 使用 8 位和 6 位量化可以获得与全精度模型相当的视觉效果。此外，我们的 PassionSR 与最近领先的低比特量化图像超分辨率方法相比具有显著优势。我们的代码将在 https://github.com/libozhu03/PassionSR 发布。||
|**2024-11-26**|[ΩSFormer: Dual-Modal Ω-like Super-Resolution Transformer Network for Cross-scale and High-accuracy Terraced Field Vectorization Extraction](http://arxiv.org/abs/2411.17088)|null|梯田是水土保持(SWC)的重要工程实践。从遥感影像中提取梯田是监测和评估SWC的基础。本研究首次提出了一种用于梯田智能提取(TFVE)的新型双模态Ω形超分辨率Transformer网络，具有以下优点：（1）通过在编码器的每一步将原始高分辨率特征与下采样特征融合，并利用多头注意力机制，减少了传统多尺度下采样编码器产生的边缘分割误差；（2）通过提出一种Ω形网络结构，提高了TFVE的精度，该结构将光谱和地形数据丰富的的高级特征完全融合，形成跨尺度超分辨率特征；（3）验证了一种用于跨模态和跨尺度（即遥感影像和DEM之间空间分辨率不一致）超分辨率特征提取的最佳融合方案；（4）通过从粗到精和空间拓扑语义关系优化(STSRO)分割策略，减轻分割边缘像素之间的不确定性；（5）利用轮廓振动神经网络持续优化参数，并从语义分割结果迭代地矢量化梯田。此外，首次创建了用于基于深度学习的TFVE的深度学习遥感影像和DEM矢量数据集(DMRVD)，涵盖了中国四个省份的九个研究区域，总覆盖面积为22441平方公里。为了评估ΩSFormer的性能，对经典网络和SOTA网络进行了比较。与精度最高的单模态遥感影像、单模态DEM和双模态结果相比，ΩSFormer的mIOU分别提高了0.165、0.297和0.128。||
|**2024-11-26**|[Event-based Spiking Neural Networks for Object Detection: A Review of Datasets, Architectures, Learning Rules, and Implementation](http://arxiv.org/abs/2411.17006)|**[link](https://github.com/radlab-sketch/Event-SNN-Resources)**|脉冲神经网络 (SNN) 是一种受生物学启发的范例，为计算机视觉 (CV) 应用中的传统人工神经网络 (ANN) 提供了一种节能的替代方案。本文系统回顾了基于 SNN 的 CV 物体检测任务中使用的数据集、架构、学习方法、实现技术和评估方法。基于对 151 篇期刊和会议文章的分析，本综述整理归纳了：1) 全连接、卷积和循环架构的有效性；2) 直接无监督、直接监督和间接学习方法的性能；以及 3) 神经形态硬件实现中能耗、延迟和内存之间的权衡。本文还提供了一个开源代码库，其中包含用于构建 SNN 模型、基于事件的数据处理和 SNN 模拟的 Python 代码示例和详细资源。此外，还指出了 SNN 训练、硬件集成以及 CV 应用未来发展方向中的关键挑战。||
|**2024-11-22**|[A Real-Time DETR Approach to Bangladesh Road Object Detection for Autonomous Vehicles](http://arxiv.org/abs/2411.15110)|null|近年来，随着Transformer架构的出现，我们见证了计算机视觉领域的范式转变。检测Transformer已成为目标检测的先进解决方案，并且是自动驾驶汽车道路目标检测的潜在候选方案。尽管目标检测方案种类繁多，但实时DETR模型在推理时间上表现出明显更优的性能，且精度和性能损失最小。在我们的工作中，我们对基于孟加拉国的BadODD道路目标检测数据集使用了实时DETR（RTDETR）目标检测，并进行了必要的实验和测试。我们的结果在公开的60%测试集中获得了0.41518的mAP50得分，在私有的40%测试集中获得了0.28194的mAP50得分。||
|**2024-11-22**|[VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving](http://arxiv.org/abs/2411.14716)|null|本文介绍了VisionPAD，一种新型的自监督预训练范式，专为自动驾驶中以视觉为中心的算法而设计。与以往采用显式深度监督的神经渲染方法不同，VisionPAD利用更高效的三维高斯渲染技术，仅使用图像作为监督来重建多视图表示。具体而言，我们引入了一种用于体素速度估计的自监督方法。通过将体素变形到相邻帧并监督渲染输出，模型有效地学习了序列数据中的运动线索。此外，我们采用了多帧光度一致性方法来增强几何感知。它基于渲染深度和相对姿态将相邻帧投影到当前帧，通过纯图像监督增强了三维几何表示。在自动驾驶数据集上的大量实验表明，VisionPAD显著提高了三维目标检测、占据预测和地图分割的性能，并大幅超越了最先进的预训练策略。||
|**2024-11-21**|[Unveiling the Hidden: A Comprehensive Evaluation of Underwater Image Enhancement and Its Impact on Object Detection](http://arxiv.org/abs/2411.14626)|null|水下图像通常会遭受严重的退化，导致视觉质量和目标检测性能低下。本研究旨在评估最先进的图像增强模型，调查它们对水下目标检测的影响，并探索它们改善检测性能的潜力。为此，我们选择了涵盖主要增强类别的代表性水下图像增强模型，并分别将它们应用于两个最新的数据集：1）真实世界水下目标检测数据集 (RUOD)，和 2）具有挑战性的水下植物检测数据集 (CUPDD)。在此之后，我们对增强后的图像进行了定性和定量分析，并开发了质量指数（Q-index）来比较原始图像和增强图像的质量分布。随后，我们比较了几个分别在原始图像集和增强图像集上训练和测试的 YOLO-NAS 检测模型的性能。然后，我们进行了相关性研究，以检验增强指标与检测性能之间的关系。我们还分析了训练后的检测器的推理结果，展示了增强提高检测性能的案例以及增强揭示了人工标注者遗漏目标的案例。这项研究表明，尽管增强通常会降低检测性能，但在某些情况下它仍然可以用于提高检测性能和更准确的人工标注。||
|**2024-11-21**|[DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding](http://arxiv.org/abs/2411.14347)|**[link](https://github.com/idea-research/dino-x-api)**|本文介绍了由IDEA研究院开发的统一的以对象为中心的视觉模型DINO-X，它拥有迄今为止最佳的开放世界目标检测性能。DINO-X采用与Grounding DINO 1.5相同的基于Transformer的编码器-解码器架构，以追求用于开放世界对象理解的对象级表示。为了简化长尾目标检测，DINO-X扩展了其输入选项，以支持文本提示、视觉提示和自定义提示。利用这些灵活的提示选项，我们开发了一个通用的对象提示来支持免提示的开放世界检测，从而可以在图像中检测任何物体，而无需用户提供任何提示。为了增强模型的核心基础能力，我们构建了一个包含超过1亿个高质量基础样本的大规模数据集，称为Grounding-100M，用于提升模型的开放词汇检测性能。在此类大规模基础数据集上进行预训练可以生成基础的对象级表示，使DINO-X能够集成多个感知头，以同时支持多个对象感知和理解任务，包括检测、分割、姿态估计、对象描述、基于对象的问答等。实验结果证明了DINO-X的优越性能。具体来说，DINO-X Pro模型在COCO、LVIS-minival和LVIS-val零样本目标检测基准测试中分别达到了56.0 AP、59.8 AP和52.4 AP。值得注意的是，它在LVIS-minival和LVIS-val基准测试的稀有类别中分别获得了63.3 AP和56.5 AP，均比之前的SOTA性能提高了5.8 AP。这一结果突显了其在识别长尾对象方面的能力显著提升。||
|**2024-11-21**|[Transforming Static Images Using Generative Models for Video Salient Object Detection](http://arxiv.org/abs/2411.13975)|**[link](https://github.com/suhwan-cho/realflow)**|在许多视频处理任务中，利用大规模图像数据集是一种常见的策略，因为图像数据更丰富，便于进行全面的知识迁移。一种典型的模拟静态图像视频的方法是应用空间变换，例如仿射变换和样条变形，以创建模拟时间进展的序列。然而，在诸如视频显著目标检测等任务中，外观和运动线索都至关重要，这些基本的图像到视频技术无法生成捕捉每个对象独立运动特性的真实光流。在本研究中，我们展示了图像到视频的扩散模型可以生成静态图像的逼真变换，同时理解图像组件之间的上下文关系。这种能力使模型能够生成似是而非的光流，在保留语义完整性的同时反映场景元素的独立运动。通过以这种方式增强单个图像，我们创建了大规模的图像-光流对，从而显著增强了模型训练。我们的方法在所有公共基准数据集上都实现了最先进的性能，优于现有方法。||
|**2024-11-20**|[MambaDETR: Query-based Temporal Modeling using State Space Model for Multi-View 3D Object Detection](http://arxiv.org/abs/2411.13628)|null|利用时间信息提升自动驾驶领域三维检测性能近年来取得了显著进展。传统的基于Transformer的时间融合方法会随着帧序列长度的增加而面临二次计算成本和信息衰减的问题。在本文中，我们提出了一种名为MambaDETR的新方法，其主要思想是在高效的状态空间中实现时间融合。此外，我们设计了一个运动消除模块，用于移除相对静态的物体，以便进行时间融合。在标准nuScenes基准测试中，我们提出的MambaDETR在三维目标检测任务中取得了显著成果，在现有的时间融合方法中展现了最先进的性能。||
|**2024-11-20**|[HF-Diff: High-Frequency Perceptual Loss and Distribution Matching for One-Step Diffusion-Based Image Super-Resolution](http://arxiv.org/abs/2411.13548)|null|虽然最近基于扩散的单步超分辨率方法相比SinSR取得了更好的性能，但它们的计算复杂度很高。为了提高SinSR的性能，我们研究了如何在超分辨率（SR）过程中保留高频细节特征，因为降级的图像缺乏详细信息。为此，我们利用在ImageNet数据集上预训练的可逆神经网络（INN）引入了高频感知损失。预训练INN的不同特征图产生了图像的不同高频方面。在训练阶段，我们强制保留超分辨率图像和 ground truth (GT) 图像的高频特征，从而提高推理过程中SR图像的质量。此外，我们还利用预训练DINO-v2嵌入空间中GT图像和SR图像之间的Jensen-Shannon散度来匹配它们的分布。通过在单步基于扩散的SR（HF-Diff）中引入高频保留损失和分布匹配约束，我们在基准RealSR、RealSet65、DIV2K-Val和ImageNet数据集上实现了最先进的CLIPIQA分数。此外，在多个数据集上的实验结果表明，我们的高频感知损失比LPIPS和基于VGG的感知损失能产生更好的SR图像质量。我们的代码将在https://github.com/shoaib-sami/HF-Diff发布。||
|**2024-11-20**|[DIS-Mine: Instance Segmentation for Disaster-Awareness in Poor-Light Condition in Underground Mines](http://arxiv.org/abs/2411.13544)|null|多年来，检测地下矿井中的灾害，例如爆炸和结构损坏，一直是一项持续的挑战。对于急救人员来说，这个问题更加复杂，他们通常不清楚矿井内损坏的程度或性质。矿井内光线不足甚至完全黑暗，使得救援工作异常困难，导致悲惨的生命损失。在本文中，我们提出了一种名为 DIS-Mine 的新型实例分割方法，专门用于识别低光或能见度差条件下地下矿井中受灾区域，帮助急救人员进行救援。DIS-Mine 能够通过解决高噪声、颜色失真和对比度降低等挑战，即使在完全黑暗的环境下也能检测图像中的物体。DIS-Mine 的关键创新基于四个核心组件：i) 图像亮度提升，ii) 与 SAM 集成的实例分割，iii) 基于 Mask R-CNN 的分割，以及 iv) 基于特征匹配的掩码对齐。此外，我们从一个实验性地下矿井收集了真实世界的图像，引入了一个名为 ImageMine 的新数据集，该数据集专门在低能见度条件下收集。该数据集用于验证 DIS-Mine 在现实、具有挑战性的环境中的性能。我们对 ImageMine 数据集以及其他各种数据集进行的综合实验表明，DIS-Mine 实现了 86.0% 的优异 F1 分数和 72.0% 的 mIoU，优于最先进的实例分割方法，至少提高了 15 倍，目标检测精度提高了高达 80%。||
|**2024-11-20**|[Adversarial Diffusion Compression for Real-World Image Super-Resolution](http://arxiv.org/abs/2411.13383)|**[link](https://github.com/guaishou74851/adcsr)**|现实世界图像超分辨率 (Real-ISR) 旨在从由复杂未知过程降级的低分辨率输入重建高分辨率图像。虽然许多基于稳定扩散 (SD) 的 Real-ISR 方法取得了显著成功，但其缓慢的多步推理阻碍了实际部署。最近基于 SD 的单步网络如 OSEDiff 和 S3Diff 缓解了这个问题，但由于依赖大型预训练 SD 模型，仍然会导致高计算成本。本文提出了一种新的 Real-ISR 方法，AdcSR，通过在我们提出的对抗性扩散压缩 (ADC) 框架下将单步扩散网络 OSEDiff 蒸馏成一个精简的扩散-GAN 模型。我们仔细研究了 OSEDiff 的模块，将其分为两类：（1）可移除的（VAE 编码器、提示提取器、文本编码器等）和（2）可修剪的（去噪 UNet 和 VAE 解码器）。由于直接移除和修剪会降低模型的生成能力，我们预训练了修剪后的 VAE 解码器以恢复其解码图像的能力，并采用对抗性蒸馏来弥补性能损失。这种基于 ADC 的扩散-GAN 混合设计有效地降低了复杂性，推理时间减少了 73%，计算量减少了 78%，参数减少了 74%，同时保留了模型的生成能力。实验表明，我们提出的 AdcSR 在合成数据集和真实世界数据集上都实现了具有竞争力的恢复质量，比以前的单步基于扩散的方法实现了高达 9.3 倍的加速。代码和模型将公开发布。||
|**2024-11-20**|[RTSR: A Real-Time Super-Resolution Model for AV1 Compressed Content](http://arxiv.org/abs/2411.13362)|null|超分辨率 (SR) 是一种通过提高空间分辨率并重建精细细节来改善视频内容视觉质量的关键技术。SR 已被应用于许多领域，包括视频流媒体，其中压缩的低分辨率内容通常传输给最终用户，然后以更高的分辨率和增强的质量进行重建。为了支持实时播放，在保持重建质量的同时实现快速 SR 模型至关重要；然而，大多数现有解决方案，尤其是那些基于复杂深度神经网络的方案，未能做到这一点。为了解决这个问题，本文提出了一种低复杂度的 SR 方法 RTSR，旨在提高压缩视频内容的视觉质量，重点关注 a) 从 360p 到 1080p 和 b) 从 540p 到 4K 的分辨率提升。该方法利用基于 CNN 的网络架构，该架构针对不同量化级别的 AV1 (SVT) 编码内容进行了优化，并基于双教师知识蒸馏方法。该方法已提交至 AIM 2024 视频超分辨率挑战赛，专门针对高效/移动实时视频超分辨率竞赛。在所有六份提交方案中，它在复杂度和编码性能（以 PSNR、SSIM 和 VMAF 衡量）之间实现了最佳的平衡。代码即将发布。||
|**2024-11-20**|[Teaching VLMs to Localize Specific Objects from In-context Examples](http://arxiv.org/abs/2411.13317)|**[link](https://github.com/sivandoveh/iploc)**|视觉语言模型 (VLM) 在各种视觉任务中展现出卓越的能力，包括图像识别、视频理解和视觉问答 (VQA)，前提是针对这些任务进行明确的训练。尽管取得了这些进展，我们发现目前的 VLM 缺乏一项基本的认知能力：通过考虑上下文来学习定位场景中的物体。在这项工作中，我们专注于少样本个性化定位任务，其中模型被给予一小组带注释的图像（上下文示例）——每个图像都带有类别标签和边界框——并被要求在查询图像中定位相同类型的对象。为了激发模型的个性化定位能力，我们提出了一种以数据为中心的解决方案，使用从视频对象跟踪数据集中精心挑选的数据对模型进行微调。通过利用跨多个镜头跟踪同一对象的帧序列，我们模拟了促进上下文感知的指令调整对话。为了强化这一点，我们引入了一种新的正则化技术，用伪名称替换对象标签，确保模型依赖视觉上下文而不是先验知识。我们的方法显著提高了少样本定位性能，且不牺牲泛化能力，这在几个为个性化定位定制的基准测试中得到了证明。这项工作是第一个探索和基准测试 VLM 的个性化少样本定位的工作，为未来上下文驱动的视觉语言应用研究奠定了基础。我们的项目代码可在 https://github.com/SivanDoveh/IPLoc 获取。||
|**2024-11-20**|[A Resource Efficient Fusion Network for Object Detection in Bird's-Eye View using Camera and Raw Radar Data](http://arxiv.org/abs/2411.13311)|**[link](https://github.com/tue-mps/refnet)**|摄像头可用于感知车辆周围环境，而价格合理的雷达传感器在自动驾驶系统中很受欢迎，因为它们不像摄像头那样会受到恶劣天气条件的影响。然而，雷达点云较为稀疏，方位角和仰角分辨率较低，缺乏场景的语义和结构信息，导致雷达检测性能普遍较低。在这项工作中，我们直接使用雷达数据的原始距离-多普勒 (RD) 谱，从而避免了雷达信号处理。我们使用提出的综合图像处理流程独立处理摄像头图像。具体来说，首先，我们将摄像头图像转换为鸟瞰图 (BEV) 极坐标域，并使用我们的摄像头编码器-解码器架构提取相应的特征。将生成的特征图与从雷达解码器的RD谱输入中恢复的距离-方位角 (RA) 特征融合，以执行目标检测。我们在RADIal数据集上评估了我们的融合策略与其他现有方法，不仅评估了准确性，还评估了计算复杂度指标。||
|**2024-11-20**|[Click; Single Object Tracking; Video Object Segmentation; Real-time Interaction](http://arxiv.org/abs/2411.13183)|null|单目标跟踪 (SOT) 依赖于精确的目标边界框初始化。本文重新审视了当前单目标跟踪器初始化方法的不足，并提出了一种新的单目标跟踪算法范式 ClickTrack，该范式在实时场景中使用点击交互。此外，点击作为一种输入类型本身缺乏层次信息。为了解决某些特殊场景中的歧义，我们设计了引导点击优化器 (GCR)，它接受点和可选的文本信息作为输入，将点转换为操作员期望的边界框。该边界框将用作单目标跟踪器的输入。在 LaSOT 和 GOT-10k 基准测试中的实验表明，结合 GCR 的跟踪器在实时交互场景中实现了稳定的性能。此外，我们还探索了将 GCR 集成到 Segment Anything 模型 (SAM) 中，显著减少了 SAM 接收点输入时的歧义问题。||
|**2024-11-19**|[GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-training in Autonomous Driving](http://arxiv.org/abs/2411.12452)|**[link](https://github.com/public-bots/gaussianpretrain)**|自监督学习在图像处理领域取得了重大进展，但用于自动驾驶的视觉预训练仍处于起步阶段。现有方法通常侧重于学习几何场景信息而忽略纹理，或将两者割裂开来处理，阻碍了对场景的全面理解。在此背景下，我们欣然推出GaussianPretrain，这是一种新颖的预训练范式，通过统一整合几何和纹理表示来实现对场景的整体理解。该方法将3D高斯锚点概念化为体积激光雷达点，学习对场景更深入的理解，利用详细的空间结构和纹理来增强预训练性能，实现比基于NeRF的方法UniPAD快40.6%，且仅占用70%的GPU内存。我们在多个3D感知任务上展示了GaussianPretrain的有效性，并显示出显著的性能提升，例如3D目标检测的NDS提升了7.05%，高清地图构建的mAP提升了1.9%，以及占据栅格预测提升了0.8%。这些显著的成果突出了GaussianPretrain的理论创新和强大的实践潜力，推动了自动驾驶视觉预训练的发展。源代码将在https://github.com/Public-BOTs/GaussianPretrain发布。||
|**2024-11-19**|[Physics-Guided Detector for SAR Airplanes](http://arxiv.org/abs/2411.12301)|**[link](https://github.com/xai4sar/pgd)**|合成孔径雷达(SAR)飞机目标的分散结构分布（离散性）和多变的散射特性（可变性）给目标检测和识别带来了特殊的挑战。当前基于深度学习的检测器在区分复杂背景下的细粒度SAR飞机方面面临挑战。为了解决这个问题，我们提出了一种新的面向SAR飞机的物理引导检测器（PGD）学习范式，该范式综合考虑了SAR飞机的离散性和可变性以提高检测性能。它是一个通用的学习范式，可以扩展到各种现有的具有“骨干-颈部-头部”架构的基于深度学习的检测器。PGD的主要贡献包括物理引导的自监督学习、特征增强和实例感知，分别表示为PGSSL、PGFE和PGIP。PGSSL旨在构建一个基于各种SAR飞机目标的自监督学习任务，将各种离散结构分布的先验知识编码到嵌入空间中。然后，PGFE在PGSSL学习到的物理感知信息的引导下，增强检测器的多尺度特征表示。PGIP构建于检测头，学习每个SAR飞机实例的精细和主要的散射点，从而减轻复杂背景的干扰。我们提出了两种实现方式，分别表示为PGD和PGD-Lite，并将它们应用于各种具有不同骨干网络和检测头的现有检测器。实验结果证明了所提出的PGD的灵活性和有效性，它可以改进现有的SAR飞机细粒度分类检测器（最多提高3.1%的mAP），并在SAR-AIRcraft-1.0数据集上实现了最先进的性能（90.7%的mAP）。该项目是开源的，网址为\url{https://github.com/XAI4SAR/PGD}。||
|**2024-11-19**|[Invariant Shape Representation Learning For Image Classification](http://arxiv.org/abs/2411.12201)|**[link](https://github.com/tonmoy-hossain/isrl)**|几何形状特征已被广泛用作图像分类的强预测因子。然而，大多数现有分类器，例如深度神经网络 (DNN)，直接利用这些形状特征和目标变量之间的统计相关性。然而，这些相关性通常是虚假的，并且在不同的环境中不稳定（例如，在不同的年龄组中，某些类型的脑部变化与神经退行性疾病的关系不稳定）；因此导致预测有偏差或不准确。在本文中，我们引入了一个新颖的框架，首次开发了不变形状表征学习 (ISRL) 以进一步增强图像分类器的鲁棒性。与主要在图像空间中导出特征的现有方法相比，我们的模型 ISRL 旨在联合捕获由可变形变换参数化的潜在形状空间中的不变特征。为了实现这一目标，我们开发了一种基于不变风险最小化 (IRM) 的新学习范式，以学习跨多个训练分布/环境的图像和形状特征的不变表示。通过嵌入在不同环境中关于目标变量不变的特征，我们的模型始终提供更准确的预测。我们通过对模拟二维图像、真实三维大脑和电影心血管磁共振图像 (MRI) 执行分类任务来验证我们的方法。我们的代码可在 https://github.com/tonmoy-hossain/ISRL 公开获取。||
|**2024-11-19**|[Self-Supervised Learning in Deep Networks: A Pathway to Robust Few-Shot Classification](http://arxiv.org/abs/2411.12151)|null|本研究旨在结合自监督学习和深度网络模型ResNet-101来优化少样本图像分类任务，并提升模型的特征提取和分类性能。在训练过程中，我们首先使用自监督方法对模型进行预训练，使其能够在大量无标签数据上学习通用的特征表达；然后在少样本数据集Mini-ImageNet上进行微调，以提高模型在有限数据下的准确率和泛化能力。实验结果表明，与传统的卷积神经网络、ResNet-50、DenseNet等模型相比，我们的方法在分类准确率（ACC）和F1分数上都取得了优异的性能，约为95.12%，验证了自监督学习在少样本分类中的有效性。该方法为少样本图像分类领域提供了一种高效可靠的解决方案。||
|**2024-11-18**|[Scaling Deep Learning Research with Kubernetes on the NRP Nautilus HyperCluster](http://arxiv.org/abs/2411.12038)|null|在整个科学计算领域，深度学习算法已在广泛的应用中展现出卓越的性能。随着这些深度神经网络 (DNN) 的不断成熟，训练它们所需的计算量也在持续增长。如今，现代 DNN 需要数百万 FLOP 的运算以及数天到数周的训练才能生成一个训练良好的模型。DNN 的训练时间通常是各种深度学习应用中 DNN 研究的瓶颈，因此，加速和扩展 DNN 训练能够实现更强大、更快速的科研。为此，在这项工作中，我们探索利用 NRP Nautilus 超级集群来自动化和扩展深度学习模型训练，涵盖三个不同的 DNN 应用，包括空中物体检测、燃烧区域分割和森林砍伐检测。我们总共在 Nautilus 上训练了 234 个深度神经网络模型，总训练时间为 4,040 小时。||
|**2024-11-18**|[Fair Distillation: Teaching Fairness from Biased Teachers in Medical Imaging](http://arxiv.org/abs/2411.11939)|null|深度学习在图像分类和分割任务中取得了显著的成功。然而，公平性问题依然存在，因为模型经常表现出对由种族、性别或年龄等敏感属性定义的人口群体的 disproportionate 偏见。现有的 bias mitigation 技术，包括子群重新平衡、对抗训练和域泛化，旨在平衡不同人口群体的准确性，但由于这些相互依赖的目标之间存在冲突，通常无法同时提高总体准确性、特定群体的准确性和公平性。我们提出了公平蒸馏（FairDi）方法，这是一种新的公平性方法，它利用针对特定人口群体优化的有偏见的“教师”模型来分解这些目标。然后，这些教师模型指导统一“学生”模型的训练，该模型提取它们的知识以最大化整体和特定群体的准确性，同时最小化群体间的差异。在医学影像数据集上的实验表明，与现有方法相比，FairDi 在总体准确性、特定群体准确性和公平性方面均取得了显著提升。FairDi 适用于各种医学任务，例如分类和分割，并为公平的模型性能提供了有效的解决方案。||
|**2024-11-18**|[LightFFDNets: Lightweight Convolutional Neural Networks for Rapid Facial Forgery Detection](http://arxiv.org/abs/2411.11826)|null|Accurate and fast recognition of forgeries is an issue of great importance in the fields of artificial intelligence, image processing and object detection. Recognition of forgeries of facial imagery is the process of classifying and defining the faces in it by analyzing real-world facial images. This process is usually accomplished by extracting features from an image, using classifier algorithms, and correctly interpreting the results. Recognizing forgeries of facial imagery correctly can encounter many different challenges. For example, factors such as changing lighting conditions, viewing faces from different angles can affect recognition performance, and background complexity and perspective changes in facial images can make accurate recognition difficult. Despite these difficulties, significant progress has been made in the field of forgery detection. Deep learning algorithms, especially Convolutional Neural Networks (CNNs), have significantly improved forgery detection performance. This study focuses on image processing-based forgery detection using Fake-Vs-Real-Faces (Hard) [10] and 140k Real and Fake Faces [61] data sets. Both data sets consist of two classes containing real and fake facial images. In our study, two lightweight deep learning models are proposed to conduct forgery detection using these images. Additionally, 8 different pretrained CNN architectures were tested on both data sets and the results were compared with newly developed lightweight CNN models. It's shown that the proposed lightweight deep learning models have minimum number of layers. It's also shown that the proposed lightweight deep learning models detect forgeries of facial imagery accurately, and computationally efficiently. Although the data set consists only of face images, the developed models can also be used in other two-class object recognition problems.||
|**2024-11-18**|[WoodYOLO: A Novel Object Detector for Wood Species Detection in Microscopic Images](http://arxiv.org/abs/2411.11738)|null|木材种类识别在各个行业中都起着至关重要的作用，从确保木材产品的合法性到推进生态保护工作。本文介绍了 WoodYOLO，一种专门为微观木材纤维分析而设计的新型目标检测算法。我们的方法采用了 YOLO 架构，以应对大型高分辨率显微镜图像带来的挑战，以及对目标细胞类型（导管分子）定位的高召回率的需求。我们的结果表明，WoodYOLO 的性能明显优于最先进的模型，在 F2 分数上分别比 YOLOv10 和 YOLOv7 提高了 12.9% 和 6.5%。这种自动化木材细胞类型定位能力的改进有助于提高法规遵从性，支持可持续林业实践，并在全球范围内促进生物多样性保护工作。||
|**2024-11-18**|[Learning a Neural Association Network for Self-supervised Multi-Object Tracking](http://arxiv.org/abs/2411.11514)|null|本文介绍了一种新的框架，用于以自监督的方式学习多目标跟踪中的数据关联。众所周知，全监督学习方法可以实现出色的跟踪性能，但获取身份级别的标注既繁琐又耗时。受现实场景中物体运动通常可以用马尔可夫过程表示这一事实的启发，我们提出了一种新的期望最大化（EM）算法，该算法训练神经网络来关联检测以进行跟踪，而无需事先了解它们的时间对应关系。我们方法的核心是一个神经卡尔曼滤波器，其观测模型以由神经网络参数化的检测关联为条件。给定一批帧作为输入，相邻帧之间检测的数据关联由神经网络预测，然后进行Sinkhorn归一化，确定检测到状态的分配概率。然后使用卡尔曼平滑来获得给定推断状态的观测值的边际概率，从而产生一个训练目标，使用梯度下降来最大化该边际概率。所提出的框架是完全可微的，允许底层神经模型进行端到端的训练。我们在具有挑战性的MOT17和MOT20数据集上评估了我们的方法，并在使用公共检测的自监督跟踪器中取得了最先进的结果。我们进一步证明了学习模型跨数据集泛化的能力。||
|**2024-11-18**|[SL-YOLO: A Stronger and Lighter Drone Target Detection Model](http://arxiv.org/abs/2411.11477)|null|在复杂场景中（例如无人机拍摄的场景）检测小型目标是一项艰巨的挑战，因为难以捕捉小型目标的复杂特征。虽然YOLO系列在大目标检测方面取得了巨大成功，但在面对小型目标时，其性能并不令人满意。因此，本文提出了一种革命性的模型SL-YOLO（更强更轻的YOLO），旨在打破小型目标检测的瓶颈。我们提出了分层扩展路径聚合网络（HEPAN），这是一种开创性的跨尺度特征融合方法，即使在最具挑战性的环境中也能确保无与伦比的检测精度。同时，在不牺牲检测能力的情况下，我们设计了C2fDCB轻量级模块并添加了SCDown下采样模块，大大减少了模型的参数和计算复杂度。我们在VisDrone2019数据集上的实验结果表明性能显著提高，[email protected]从43.0%跃升至46.9%，[email protected]:0.95从26.0%增加到28.9%。同时，模型参数从11.1M减少到9.6M，FPS可达132，使其成为资源受限环境下实时小型目标检测的理想解决方案。||
|**2024-11-15**|[On the Cost of Model-Serving Frameworks: An Experimental Evaluation](http://arxiv.org/abs/2411.10337)|null|在机器学习 (ML) 中，推理阶段是将预训练模型应用于新的、未见过的数据以进行预测的过程。在推理阶段，最终用户与机器学习服务交互，以根据输入数据获得洞察、建议或操作。因此，服务策略对于在生产环境中有效地部署和管理模型至关重要。这些策略确保模型可用、可扩展、可靠且性能良好，适用于实际应用，例如时间序列预测、图像分类、自然语言处理等。在本文中，我们评估了五种广泛使用的模型服务框架（TensorFlow Serving、TorchServe、MLServer、MLflow 和 BentoML）在四种不同场景（恶意软件检测、加密货币价格预测、图像分类和情感分析）下的性能。我们证明，TensorFlow Serving 在服务深度学习 (DL) 模型方面优于所有其他框架。此外，我们还表明，特定于深度学习的框架（TensorFlow Serving 和 TorchServe）的延迟明显低于三个通用机器学习框架（BentoML、MLFlow 和 MLServer）。||
|**2024-11-15**|[Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning](http://arxiv.org/abs/2411.10252)|null|多模态大型语言模型 (MLLM) 擅长图像描述任务，但在精确的目标定位方面常常存在不足，而目标定位是可靠视觉理解的关键要素。相比之下，传统的目标检测模型虽然定位精度高，但由于对物体间关系建模有限，生成的检测结果往往缺乏上下文一致性。为了解决这一根本局限性，我们引入了视觉-语言代理 (VLA)，这是一个将 MLLM 的关系推理优势与传统目标检测器的精确定位能力相结合的协作框架。在 VLA 范式中，MLLM 充当中央语言代理，与专门用于目标检测和分类的视觉代理进行协作。语言代理通过推理物体间的空间和上下文关系来评估和改进检测结果，而分类视觉代理则提供纠正反馈以提高分类精度。这种协作方法使 VLA 能够显著增强空间推理和目标定位能力，从而解决多模态理解中的关键挑战。在 COCO 数据集上的大量评估表明，VLA 能够显著提升多种检测模型的性能，凸显了其在准确且上下文一致的目标检测方面树立新标杆的潜力。||
|**2024-11-15**|[A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift](http://arxiv.org/abs/2411.10231)|null|基于Transformer的超分辨率（SR）模型最近提高了图像重建质量，但由于计算复杂性以及过度依赖大patch尺寸（这限制了细粒度细节增强），仍然存在挑战。在这项工作中，我们提出了TaylorIR来解决这些限制，它利用1x1的patch大小，从而在任何基于Transformer的SR模型中实现像素级处理。为了解决传统自注意力机制下巨大的计算需求，我们采用了TaylorShift注意力机制，这是一种基于泰勒级数展开的高效内存替代方案，以线性复杂度实现了完全的token到token交互。实验结果表明，与传统的基于自注意力的Transformer相比，我们的方法实现了新的最先进的SR性能，同时减少了高达60%的内存消耗。||
|**2024-11-15**|[Embedding Byzantine Fault Tolerance into Federated Learning via Virtual Data-Driven Consistency Scoring Plugin](http://arxiv.org/abs/2411.10212)|**[link](https://github.com/NAVER-INTEL-Co-Lab/gaudi-byzantine)**|如果能从多个边缘设备收集到足够的数据，联邦学习（FL）就能在不传输私人数据到中央服务器的情况下训练共享模型。然而，联邦学习通常容易受到来自受损边缘设备的拜占庭攻击，这会显著降低模型性能。在本文中，我们提出了一个直观的插件，可以集成到现有的联邦学习技术中以实现拜占庭容错。其关键思想是生成虚拟数据样本，并评估各个本地更新之间的模型一致性分数，从而有效地过滤掉受损的边缘设备。通过在聚合阶段之前利用这种评分机制，所提出的插件使现有的联邦学习技术能够在保持其原有优势的同时，对拜占庭攻击具有鲁棒性。医学图像分类任务的数值结果验证了将所提出的方法插入到具有代表性的联邦学习算法中，可以有效地实现拜占庭容错。此外，当不存在拜占庭攻击时，所提出的插件还能保持基础联邦学习算法原有的收敛特性。||
|**2024-11-15**|[MOT\_FCG++: Enhanced Representation of Motion and Appearance Features](http://arxiv.org/abs/2411.10028)|null|多目标跟踪 (MOT) 的目标是在场景中跨帧检测和跟踪所有对象，同时为每个对象维护唯一的身份。大多数现有方法依赖于连续帧中检测到的对象的空间运动特征和外观嵌入特征。有效且鲁棒地表示长轨迹的空间和外观特征已成为影响 MOT 性能的关键因素。我们提出了一种新的外观和空间特征表示方法，改进了聚类关联方法 MOT\_FCG。对于空间运动特征，我们提出了对角线调制 GIoU，它可以更准确地表示对象的位置和形状之间的关系。对于外观特征，我们利用包含置信信息的动态外观表示，使轨迹外观特征更加鲁棒和全局化。基于基线模型 MOT\_FCG，我们在 MOT17 验证集上实现了 76.1 HOTA、80.4 MOTA 和 81.3 IDF1，并且在 MOT20 和 DanceTrack 验证集上也取得了具有竞争力的性能。||
|**2024-11-14**|[Local-Global Attention: An Adaptive Mechanism for Multi-Scale Feature Integration](http://arxiv.org/abs/2411.09604)|**[link](https://github.com/ziyueqingwan/localglobalattention)**|近年来，注意力机制通过关注关键特征信息显著提高了目标检测的性能。然而，主流方法仍然难以有效平衡局部和全局特征。这种不平衡阻碍了它们捕捉细粒度细节和更广泛上下文信息的能力，而这两者是实现准确目标检测的关键要素。为了应对这些挑战，我们提出了一种新的注意力机制，称为局部-全局注意力，旨在更好地整合局部和全局上下文特征。具体而言，我们的方法结合了多尺度卷积和位置编码，使模型能够关注局部细节，同时兼顾更广泛的全局上下文。此外，我们引入了可学习参数，允许模型根据任务的具体要求动态调整局部和全局注意力的相对重要性，从而优化跨多尺度的特征表示。我们在几个广泛使用的目标检测和分类数据集上全面评估了局部-全局注意力机制。我们的实验结果表明，这种方法显著增强了各种尺度目标的检测，在多类别和小目标检测任务中表现尤为出色。与现有的注意力机制相比，局部-全局注意力在多个关键指标上始终优于它们，同时保持了计算效率。||
|**2024-11-14**|[GAN-Based Architecture for Low-dose Computed Tomography Imaging Denoising](http://arxiv.org/abs/2411.09512)|null|生成对抗网络 (GAN) 已成为低剂量计算机断层扫描 (LDCT) 成像领域的一项革命性元素，为兼顾辐射暴露和图像质量这一长期问题提供了先进的解决方案。这篇综述综合了基于 GAN 的 LDCT 去噪技术的快速发展，考察了从基础架构到结合解剖先验、感知损失函数和创新正则化策略等高级特征的最先进模型的演变。我们批判性地分析了各种 GAN 架构，包括条件 GAN (cGAN)、循环 GAN (CycleGAN) 和超分辨率 GAN (SRGAN)，阐明了它们在 LDCT 去噪背景下的独特优势和局限性。评估提供了与基准和临床数据集性能改进相关的定性和定量结果，并使用了峰值信噪比 (PSNR)、结构相似性指数 (SSIM) 和学习感知图像块相似度 (LPIPS) 等指标。在强调积极成果之后，我们讨论了阻碍其更广泛临床应用的一些挑战，包括 GAN 生成图像的可解释性、合成伪影以及对临床相关指标的需求。综述最后强调了基于 GAN 的方法在通过定制 LDCT 去噪模型推进精准医学方面的重要意义，并强调了人工智能在当代放射学实践中带来的变革可能性。||
|**2024-11-14**|[ISAC Super-Resolution Receiver via Lifted Atomic Norm Minimization](http://arxiv.org/abs/2411.09495)|null|本文介绍了一种用于集成传感和通信 (ISAC) 系统的离网估计器，利用了提升原子范数最小化 (LANM)。这种情况下的关键挑战是发射信号和雷达通信信道都是未知的。我们证明，当观测次数与 ISAC 系统的自由度成正比时，LANM 可以同时实现雷达目标定位和通信符号解码。尽管问题本质上是不适定的，我们采用提升技术对发射信号进行初始编码。然后，我们利用原子范数来提升 ISAC 信道的结构化低秩性。我们利用对偶技术将 LANM 转换为信号域上的无限维搜索。随后，我们使用半定松弛 (SDR) 来实现对偶问题。我们将方法扩展到接收信号被加性高斯白噪声 (AWGN) 和干扰信号污染的实际场景。此外，我们推导了所提出的估计器的计算复杂度，并证明它等效于传统的导频辅助 ANM 用于估计信道参数。我们的仿真实验表明，所提出的 LANM 方法能够估计通信数据和目标参数，其性能与传统的仅雷达超分辨率技术相当。||
|**2024-11-14**|[ResidualDroppath: Enhancing Feature Reuse over Residual Connections](http://arxiv.org/abs/2411.09475)|null|残差连接是神经网络架构中最重要的组件之一，用于缓解梯度消失问题并促进更深层网络的训练。关于残差连接如何帮助更深层网络训练的一种可能解释是通过促进特征重用。然而，我们识别并分析了使用普通残差连接进行特征重用的局限性。为了解决这些局限性，我们提出了训练方法的改进。具体来说，我们通过在训练期间的两种类型的迭代为模型提供了额外的学习利用残差连接进行特征重用的机会。第一种类型的迭代涉及使用droppath，它通过随机丢弃层的子集来强制执行特征重用。第二种类型的迭代侧重于训练模型中丢弃的部分，同时冻结未丢弃的部分。结果，丢弃的部分以鼓励特征重用的方式进行学习，因为模型依赖于考虑到特征重用的未丢弃部分。总的来说，我们在某些情况下证明了具有残差连接的模型在图像分类方面的性能有所提高。||
|**2024-11-14**|[SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers](http://arxiv.org/abs/2411.09420)|**[link](https://github.com/shravan-18/SAG-ViT)**|图像分类是一项计算机视觉任务，模型分析图像并将其归类到特定标签。视觉Transformer (ViT) 通过利用自注意力机制捕获复杂模式和图像块之间的长距离关系来改进这项任务。然而，ViT 的一个关键挑战是如何有效地结合多尺度特征表示，而这在 CNN 中是通过其层次结构固有的。在本文中，我们介绍了尺度感知图注意力视觉Transformer (SAG-ViT)，这是一个通过集成多尺度特征来解决这一挑战的新颖框架。该模型使用EfficientNet作为骨干网络，提取多尺度特征图，并将这些特征图分割成图像块以保留语义信息。这些图像块根据空间和特征相似性组织成图，并使用图注意力网络 (GAT) 来细化节点嵌入。最后，Transformer 编码器捕获长距离依赖关系和复杂交互。SAG-ViT 在基准数据集上进行了评估，证明了其在提高图像分类性能方面的有效性。||
|**2024-11-14**|[Instruction-Driven Fusion of Infrared-Visible Images: Tailoring for Diverse Downstream Tasks](http://arxiv.org/abs/2411.09387)|null|红外与可见光图像融合技术的核心价值在于将其融合结果应用于下游任务。然而，现有方法在同时处理多个下游任务时，面临着训练复杂性增加且单个任务性能显著下降等挑战。为了解决这个问题，我们提出了面向任务的自适应调节机制（T-OAR），该机制专为多任务环境设计。此外，我们引入了任务相关的动态提示注入模块（T-DPI），该模块根据用户输入的文本指令生成特定于任务的动态提示，并将其整合到目标表示中。这将引导特征提取模块生成更符合下游任务特定需求的表示。通过将T-DPI模块融入T-OAR框架，我们的方法可以生成针对特定任务需求的融合图像，而无需单独训练或特定于任务的权重。这不仅降低了计算成本，还增强了跨多个任务的适应性和性能。实验结果表明，我们的方法在目标检测、语义分割和显著目标检测方面表现出色，展现了其强大的适应性、灵活性和任务特异性。这为多任务环境下的图像融合提供了一种高效的解决方案，突出了该技术在各种应用中的潜力。||
|**2024-11-14**|[Cross-Modal Consistency in Multimodal Large Language Models](http://arxiv.org/abs/2411.09273)|null|多模态方法的最新发展标志着能够处理多种数据类型（包括文本、音频和视觉内容）的模型进入了一个激动人心的新时代。像GPT-4V这样将计算机视觉与高级语言处理相结合的模型，在处理需要同时理解文本和视觉信息的复杂任务方面展现出非凡的能力。之前的研究工作已经仔细评估了这些视觉大型语言模型（VLLM）在各种领域（包括目标检测、图像字幕和其他相关领域）的有效性。然而，现有的分析往往存在局限性，主要集中在孤立地评估每种模态的性能，而忽略了探索它们复杂的跨模态交互。具体来说，关于这些模型在面对不同模态的相同任务实例时是否能达到相同精度的问题仍未得到解答。在本研究中，我们率先通过引入一个称为跨模态一致性的新概念来深入研究这些感兴趣模态之间的交互和比较。此外，我们提出了一个基于此概念的定量评估框架。我们从自己开发的精选平行视觉语言数据集得出的实验结果揭示了GPT-4V内部视觉和语言模态之间明显的矛盾性，尽管它被描述成一个统一的多模态模型。我们的研究揭示了此类模型的合理使用方法，并暗示了改进其设计的潜在途径。||
|**2024-11-14**|[LEAP:D -- A Novel Prompt-based Approach for Domain-Generalized Aerial Object Detection](http://arxiv.org/abs/2411.09180)|null|无人机拍摄的图像由于拍摄条件的变化会导致物体外观和形状的改变，给物体检测带来了巨大的挑战。诸如无人机高度、角度和天气等因素会导致这些变化，从而影响物体检测算法的性能。为了应对这些挑战，我们引入了一种使用可学习提示的创新视觉语言方法。这种从传统手动提示的转变旨在减少特定领域知识的干扰，最终提高物体检测能力。此外，我们采用单步训练方法简化了训练过程，将可学习提示与模型训练同步更新，在不提高性能的前提下提高了效率。我们的研究通过利用可学习提示和优化训练过程，促进了领域泛化的物体检测。这增强了模型在不同环境下的鲁棒性和适应性，从而实现了更有效的空中物体检测。||
|**2024-11-14**|[Performance Boundaries and Tradeoffs in Super-Resolution Imaging Technologies for Space Targets](http://arxiv.org/abs/2411.09155)|null|逆合成孔径雷达(ISAR)超分辨率成像技术广泛应用于空间目标成像。然而，超分辨率成像算法的性能极限仍然是一个很少被探索的问题。本文通过分析空间目标超分辨率算法的边界来研究这些极限，并检验关键影响因素之间的关系。特别地，利用已建立的用于线谱重建的计算分辨率极限(CRL)数学理论，我们基于ISAR成像模型变换，推导出了跨距离超分辨率成像上限和下限的数学表达式。利用这些显式表达式，我们首先探讨了这些边界的影响因素，例如传统的瑞利极限、散射点数量以及散射点的峰值信噪比(PSNR)。然后，我们阐明了CRL理论对ISAR成像施加的、为满足所需跨距离分辨率所需的最小资源要求，如果没有这些要求，在实践中研究超分辨率算法就没有必要。此外，还分析了累积旋转角、雷达发射能量和其他影响分辨率的因素之间的权衡。进行了仿真以演示各种ISAR成像场景中的这些权衡，揭示了它们对特定成像目标的高度依赖性。||
|**2024-11-14**|[Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery](http://arxiv.org/abs/2411.09101)|**[link](https://github.com/ashimdahal/vit-vs-cnn-image-segmentation)**|视觉Transformer（ViT）最近在计算机视觉领域掀起了一股新的研究浪潮。这些模型在图像分类和分割领域表现尤为出色。随着新架构的出现，语义分割和实例分割的研究进展迅速，iSAID数据集排名前20的基准测试中有超过80%是基于ViT架构或其背后的注意力机制。本文重点对在iSAID数据集上进行遥感航拍图像语义分割时使用（或不使用）ViT的三个关键因素进行启发式比较。研究过程中观察到的实验结果是在以下目标的审查下进行的：1. 使用加权融合损失函数以获得最大平均交并比（mIoU）分数、Dice分数，以及最小化或保持熵或类别表示；2. 比较基于ViT的语义分割模型Meta的MaskFormer与通用UNet卷积神经网络（CNN）的迁移学习效果，并根据mIoU、Dice分数、训练效率和推理时间进行评判；3. 我们为了获得什么而失去了什么？即，将这两种模型与当前最先进的分割模型进行比较。我们展示了新型组合加权损失函数的使用相较于ViT的迁移学习显著提升了CNN模型的性能。该实现的代码可以在\url{https://github.com/ashimdahal/ViT-vs-CNN-ImageSegmentation}找到。||
|**2024-11-12**|[Large-scale Remote Sensing Image Target Recognition and Automatic Annotation](http://arxiv.org/abs/2411.07802)|**[link](https://github.com/anaerovane/lrsaa)**|本文提出了一种名为LRSAA的大范围遥感图像目标识别与自动标注方法。该方法通过集成学习融合了YOLOv11和MobileNetV3-SSD目标检测算法以提升模型性能。此外，它采用泊松圆盘采样分割技术和EIOU指标来优化分割图像的训练和推理过程，并最终整合结果。这种方法不仅降低了对计算资源的需求，还在准确率和速度之间取得了良好的平衡。该项目的源代码已在https://github.com/anaerovane/LRSAA公开发布。||
|**2024-11-12**|[ALANINE: A Novel Decentralized Personalized Federated Learning For Heterogeneous LEO Satellite Constellation](http://arxiv.org/abs/2411.07752)|null|近年来，低地球轨道 (LEO) 卫星星座在规模和功能上都得到了显著增强，集成了通信、导航和遥感等多种能力。然而，不同卫星收集数据的异构性以及高效的星间协同计算问题，对实现这些星座的潜力构成了重大障碍。现有方法难以应对数据异构性、图像分辨率变化以及高效的在轨模型训练的需求。为了应对这些挑战，我们提出了一种新的去中心化个性化联邦学习框架，即一种面向异构低地球轨道卫星星座的新型去中心化个性化联邦学习 (ALANINE)。ALANINE 结合了用于卫星图像超分辨率 (SR) 的去中心化联邦学习 (DFL)，从而提高输入数据质量。然后，它利用个性化联邦学习 (PFL) 来实现一种个性化方法，以考虑卫星数据的独特特征。此外，该框架采用先进的模型剪枝技术来优化模型复杂度和传输效率。该框架能够实现高效的数据采集和处理，同时提高 PFL 图像处理模型的精度。仿真结果表明，与传统的集中式方法相比，ALANINE 在 SR 和 PFL 图像处理模型的在轨训练中表现出更优的性能。这种新方法在数据采集效率、处理精度以及模型对本地卫星条件的适应性方面都有显著改进。||
|**2024-11-12**|[Efficient 3D Perception on Multi-Sweep Point Cloud with Gumbel Spatial Pruning](http://arxiv.org/abs/2411.07742)|null|本文研究了室外环境中的点云感知。由于室外点云的稀疏性，现有方法在识别远距离或被遮挡的物体方面存在局限性。在本研究中，我们观察到通过累积多个时间上连续的激光雷达扫描可以显著缓解这个问题，从而显着提高感知精度。然而，计算成本也随之增加，阻碍了先前的方法利用大量的激光雷达扫描。为了应对这一挑战，我们发现累积点云中的相当一部分点是冗余的，丢弃这些点对感知精度的影响很小。我们引入了一个简单而有效的Gumbel空间剪枝（GSP）层，它基于学习的端到端采样动态地剪枝点。GSP层与其他网络组件解耦，因此可以无缝集成到现有的点云网络架构中。在不增加额外计算开销的情况下，我们将激光雷达扫描次数从常用的10次增加到40次。因此，感知性能得到了显著提升。例如，在nuScenes 3D目标检测和BEV地图分割任务中，我们的剪枝策略改进了vanilla TransL基线和其他基线方法。||
|**2024-11-12**|[Numerical Homogenization by Continuous Super-Resolution](http://arxiv.org/abs/2411.07576)|null|有限元方法通常需要高分辨率才能令人满意地逼近底层物理模型的微观甚至宏观模式。这个问题可以通过适当的数值均匀化或多尺度策略来规避，这些策略能够在欠解析尺度上获得合理的近似值。在本文中，我们研究了隐式神经表示，并提出了一种连续超分辨率网络作为数值均匀化策略。它可以利用粗糙的有限元数据来学习分布内和分布外的高分辨率有限元预测。我们的亮点是设计了一个局部隐式变换器，它能够学习多尺度特征。我们还提出了基于 Gabor 小波的坐标编码，它可以克服神经网络学习低频特征的偏差。最后，科学家通常更偏好感知而不是失真，以便他们能够识别视觉模式以进行进一步研究。然而，隐式神经表示的缺点是缺乏局部模式监督。我们建议使用随机余弦相似度来比较预测值和真值之间的局部特征差异。它在结构对齐方面表现出更好的性能。我们的实验表明，我们提出的策略作为一种分布内和分布外超分辨率策略实现了卓越的性能。||
|**2024-11-12**|[Depthwise Separable Convolutions with Deep Residual Convolutions](http://arxiv.org/abs/2411.07544)|null|随着边缘计算的最新进展，研究人员得以优化各种深度学习架构，以便在边缘设备中部署。本研究旨在优化 Xception 架构，它是计算机视觉应用中最流行的深度学习算法之一。Xception 架构对于目标检测任务非常有效。然而，它也带来了巨大的计算成本。Xception 的计算复杂性有时会阻碍其在资源受限的边缘设备上的部署。为了解决这个问题，我们提出了一种针对边缘设备优化的 Xception 架构，旨在实现轻量级和高效的部署。我们将深度可分离卷积与 Xception 架构的深度残差卷积相结合，为边缘设备开发了一个小型高效的模型。由此产生的架构减少了参数数量、内存使用量和计算负载。我们在 CIFAR 10 目标检测数据集上评估了所提出的架构。我们的实验评估结果还表明，所提出的架构参数规模更小，所需的训练时间更短，同时性能优于 Xception 架构。||
|**2024-11-11**|[Ensemble Learning for Microbubble Localization in Super-Resolution Ultrasound](http://arxiv.org/abs/2411.07376)|null|超分辨率超声 (SR-US) 是一种强大的成像技术，能够以高空间分辨率捕获微血管结构和血流。然而，精确的微泡 (MB) 定位仍然是一个关键挑战，因为定位误差会传播到超分辨率过程的后续阶段，从而影响整体性能。在本文中，我们探索了集成学习技术在增强微泡定位方面的潜力，通过提高检测灵敏度和减少误报来实现。我们的研究评估了集成方法在可变形检测Transformer（Deformable DETR）网络的体内和模拟输出上的有效性。通过我们的研究，我们能够证明这些集成方法的优势，即提高了微泡检测的精确率和召回率，并为其在超分辨率超声中的应用提供了见解。||
|**2024-11-11**|[General Geospatial Inference with a Population Dynamics Foundation Model](http://arxiv.org/abs/2411.07207)|**[link](https://github.com/google-research/population-dynamics)**|为了支持全球动态人口的健康和福祉，政府机构、组织和研究人员需要理解和推理人类行为与当地环境之间复杂的联系，以便识别高风险人群并战略性地分配有限的资源。解决这类问题的传统方法通常需要开发手动管理的、特定于任务的特征和模型来表示人类行为以及自然和建筑环境，这对于适应新的甚至相关的任务来说可能具有挑战性。为了解决这个问题，我们引入了人口动态基础模型（PDFM），旨在捕捉不同数据模态之间的关系，并适用于广泛的地理空间任务。我们首先构建了一个针对美国邮政编码和县的地理索引数据集，其中包含从地图、繁忙程度和聚合搜索趋势中获取的丰富的人类行为聚合信息，以及天气和空气质量等环境因素。然后，我们使用图神经网络对这些数据以及位置之间的复杂关系进行建模，生成可通过相对简单的模型适应各种下游任务的嵌入。我们通过在涵盖三个不同领域（健康指标、社会经济因素和环境测量）的27个下游任务上进行基准测试来评估我们方法的有效性。该方法在所有27个地理空间插值任务上实现了最先进的性能，并且在27个外推和超分辨率任务中的25个上也达到了最先进的性能。我们将PDFM与最先进的预测基础模型TimesFM相结合，来预测失业率和贫困率，实现了超越完全监督预测的性能。完整嵌入集和示例代码已公开提供给研究人员。||
|**2024-11-11**|[Transformers for Charged Particle Track Reconstruction in High Energy Physics](http://arxiv.org/abs/2411.07149)|null|重建带电粒子轨迹是现代对撞机实验的一项基本任务。高亮度大型强子对撞机 (HL-LHC) 预计将产生的前所未有的粒子数量对轨迹重建提出了重大挑战，传统的算法将难以应对如此巨大的计算量。为了应对这一挑战，我们提出了一种新颖的基于学习的轨迹重建方法，该方法借鉴了计算机视觉和目标检测领域的最新进展。我们的架构结合了Transformer 击中点过滤网络和 MaskFormer 重建模型，共同优化了击中点分配和带电粒子属性的估计。在 TrackML 数据集上进行评估，我们性能最佳的模型实现了最先进的跟踪性能，效率达到 97%，假阳性率为 0.6%，推理时间为 100 毫秒。我们可调的方法能够针对触发系统等特定应用进行专门化，而其基本原理可以扩展到高能物理中的其他重建挑战。这项工作展示了现代深度学习架构在应对粒子物理学中新兴的计算挑战，同时保持突破性物理分析所需的精度的潜力。||
|**2024-11-11**|[The Inherent Adversarial Robustness of Analog In-Memory Computing](http://arxiv.org/abs/2411.07023)|null|深度神经网络 (DNN) 算法的一个关键挑战是它们容易受到对抗性攻击。本质上非确定性的计算基底，例如基于模拟内存计算 (AIMC) 的基底，被推测在执行 DNN 推理时能够提供显著的对抗鲁棒性。在本文中，我们首次在基于相变存储器 (PCM) 器件的 AIMC 芯片上实验验证了这一猜想。我们展示了在实现图像分类网络时，针对不同类型的对抗性攻击具有更高的对抗鲁棒性。在执行硬件在环攻击时也观察到了额外的鲁棒性，在这种攻击中，假设攻击者可以完全访问硬件。对各种噪声源的仔细研究表明，随机噪声源（包括循环和非循环）的组合是造成对抗鲁棒性的原因，并且它们的类型和大小对这一特性产生了不成比例的影响。最后，通过仿真证明，当使用更大的变换器网络来执行自然语言处理 (NLP) 任务时，仍然可以观察到额外的鲁棒性。||
|**2024-11-11**|[BuckTales : A multi-UAV dataset for multi-object tracking and re-identification of wild antelopes](http://arxiv.org/abs/2411.06896)|null|理解动物行为对于预测、理解和减轻自然和人为变化对动物种群和生态系统的影响至关重要。然而，在野外环境中获取和处理长期、具有生态学相关性数据的挑战限制了行为研究的范围。无人机 (UAV) 的日益普及，加上机器学习的进步，为使用空中追踪进行野生动物监测开辟了新的机遇。然而，由于缺乏自然栖息地中野外动物的数据集，阻碍了用于长期动物追踪的自动化计算机视觉解决方案的进展。在此，我们介绍 BuckTales，这是第一个旨在解决野生动物（特别是黑羚羊的交配行为或求偶场）中的多目标跟踪 (MOT) 和重识别 (Re-ID) 问题的大规模无人机数据集。该数据集与生物学家合作收集，MOT 数据集包含超过 120 万个标注，包括 12 个高分辨率 (5.4K) 视频中的 680 个轨迹，每个视频平均 66 秒，包含 30 到 130 个个体。Re-ID 数据集包含用两架无人机同时拍摄的 730 个个体。该数据集旨在使用多个摄像头传感器推动可扩展的长期动物行为跟踪。通过提供两个检测器的基线性能，并对几种最先进的跟踪方法进行基准测试，我们的数据集反映了在社会和生态相关环境中跟踪野生动物的实际挑战。通过广泛提供这些数据，我们希望能够促进野生动物 MOT 和 Re-ID 的进展，从而通过自动化、长期监测促进对动物行为、保护工作和生态系统动态的深入了解。||
|**2024-11-08**|[Visual-TCAV: Concept-based Attribution and Saliency Maps for Post-hoc Explainability in Image Classification](http://arxiv.org/abs/2411.05698)|null|卷积神经网络 (CNN) 近年来性能显著提高。然而，由于其规模和复杂性，它们的功能如同黑盒，导致透明度问题。最先进的显著性方法生成局部解释，突出显示输入图像中识别类别的区域，但无法解释感兴趣的概念如何对预测做出贡献，这对于偏差缓解至关重要。另一方面，基于概念的方法，例如 TCAV（使用概念激活向量进行测试），可以深入了解网络对概念的敏感程度，但无法计算其在特定预测中的归因，也无法显示其在输入图像中的位置。本文介绍了一种新颖的事后可解释性框架 Visual-TCAV，旨在通过为基于 CNN 的图像分类提供局部和全局解释来弥合这些方法之间的差距。Visual-TCAV 使用概念激活向量 (CAV) 生成显著图，显示网络识别概念的位置。此外，它可以使用集成梯度的泛化来估计这些概念对任何类别输出的归因。该框架在流行的 CNN 架构上进行了评估，并通过已知解释的真实情况的实验以及与 TCAV 的比较进一步证实了其有效性。我们的代码即将发布。||
|**2024-11-08**|[Open-set object detection: towards unified problem formulation and benchmarking](http://arxiv.org/abs/2411.05564)|null|在诸如自动驾驶等置信度至关重要的实际应用中，准确检测和恰当处理与训练期间所用类别不同的类别至关重要。尽管已提出了各种未知物体检测方法，但我们观察到它们之间在使用的数据集、指标和场景方面存在普遍的不一致性，并且明显缺乏对未知物体的明确定义，这阻碍了有意义的评估。为了应对这些问题，我们引入了两个基准：统一的VOC-COCO评估和新的OpenImagesRoad基准，后者除了新的评估指标外，还提供了清晰的层次对象定义。作为基准的补充，我们利用了最新的自监督视觉Transformer的性能，通过OW-DETR++来改进基于伪标签的开放集目标检测（OSOD）。我们在提出的基准上对最先进的方法进行了广泛的评估。这项研究提供了清晰的问题定义，确保了一致的评估，并得出了关于OSOD策略有效性的新结论。||
|**2024-11-08**|[Training objective drives the consistency of representational similarity across datasets](http://arxiv.org/abs/2411.05561)|**[link](https://github.com/lciernik/similarity_consistency)**|柏拉图式表征假设认为，近期的基础模型正趋向于一个共享的表征空间，这是由它们的下游任务性能决定的，而与用于训练这些模型的目标和数据模态无关。表征相似性通常针对单个数据集进行测量，并且在不同数据集之间不一定一致。因此，人们可能会疑问这种模型表征的收敛是否受到机器学习中常用数据集的混淆。在这里，我们提出了一种系统的方法来衡量模型之间的表征相似性如何随着用于构建表征的刺激集而变化。我们发现，目标函数是决定跨数据集表征相似性一致性的最关键因素。具体来说，与图像分类或图文模型相比，自监督视觉模型学习到的表征，其成对相似性在不同数据集之间具有更好的泛化能力。此外，表征相似性与模型任务行为之间的对应关系取决于数据集，在单域数据集中表现最为明显。我们的工作提供了一个框架，用于系统地测量跨数据集的模型表征相似性，并将这些相似性与任务行为的差异联系起来。||
|**2024-11-08**|[WeatherGFM: Learning A Weather Generalist Foundation Model via In-context Learning](http://arxiv.org/abs/2411.05420)|null|地球天气系统包含复杂的天气数据模态和多样的天气理解任务，这些对人类生活至关重要。现有的数据驱动模型专注于单一的天气理解任务（例如，天气预报）。尽管这些模型取得了可喜的成果，但它们无法在单个统一模型中处理各种复杂的任务。此外，依赖于单个场景的有限真实观测的范式阻碍了模型性能上限的提升。为了应对这些限制，我们从最先进的视觉基础模型和大型语言模型中使用的上下文学习范式中汲取灵感。在本文中，我们介绍了第一个通用的天气基础模型 (WeatherGFM)，旨在以统一的方式处理各种天气理解任务。更具体地说，我们首先统一了不同天气理解任务的表示和定义。随后，我们设计了天气提示格式来管理不同的天气数据模态，即单一、多重和时间模态。最后，我们采用视觉提示问答范式来训练统一的天气理解任务。大量实验表明，我们的 WeatherGFM 可以有效地处理多达十项天气理解任务，包括天气预报、超分辨率、天气图像转换和后处理。我们的方法还展示了对未见过任务的泛化能力。||
|**2024-11-08**|[SimpleBEV: Improved LiDAR-Camera Fusion Architecture for 3D Object Detection](http://arxiv.org/abs/2411.05292)|null|越来越多的研究工作融合激光雷达和相机信息来提升自动驾驶系统中的三维目标检测性能。最近，一个简单但有效的融合框架通过在统一的鸟瞰图（BEV）空间中融合激光雷达和相机特征，实现了优异的检测性能。在本文中，我们提出了一个名为SimpleBEV的激光雷达-相机融合框架，用于精确的三维目标检测，该框架遵循基于BEV的融合框架并分别改进了相机和激光雷达编码器。具体来说，我们使用级联网络进行基于相机的深度估计，并利用激光雷达点云导出的深度信息来校正深度估计结果。同时，引入了一个仅使用相机BEV特征进行三维目标检测的辅助分支，以在训练阶段充分利用相机信息。此外，我们通过融合多尺度的稀疏卷积特征来改进激光雷达特征提取器。实验结果证明了我们提出的方法的有效性。我们的方法在nuScenes数据集上达到了77.6%的NDS精度，在三维目标检测赛道中展现出优异的性能。||
|**2024-11-07**|[Zero-Shot Temporal Resolution Domain Adaptation for Spiking Neural Networks](http://arxiv.org/abs/2411.04760)|null|脉冲神经网络 (SNN) 是一种受生物启发的深度神经网络，可以有效地提取时间信息，并在神经形态设备上部署时在能效和延迟方面具有显著优势。然而，SNN 模型参数对时间分辨率敏感，当边缘目标数据的时间分辨率与用于训练的部署前源数据的时间分辨率不同时，会导致性能显著下降，尤其是在边缘无法进行微调的情况下。为了应对这一挑战，我们提出了三种新的域自适应方法，用于调整神经元参数以适应时间分辨率的变化，而无需在目标时间分辨率上重新训练。所提出的方法基于 SNN 中神经元动力学和状态空间模型 (SSM) 之间的映射；并且适用于一般的神经元模型。我们在时空数据任务下评估了所提出的方法，即音频关键词识别数据集 SHD 和 MSWC 以及图像分类数据集 NMINST。我们的方法提供了一种替代方案，并且在大多数情况下明显优于现有的简单缩放时间常数的参考方法。此外，我们的结果表明，通过在较低时间分辨率数据上进行高效的时间训练和模型自适应，可以获得较高时间分辨率数据的高精度。||
|**2024-11-07**|[ESC-MISR: Enhancing Spatial Correlations for Multi-Image Super-Resolution in Remote Sensing](http://arxiv.org/abs/2411.04706)|null|多图像超分辨率 (MISR) 是遥感领域一项至关重要但又极具挑战性的研究任务。本文致力于解决遥感多图像超分辨率 (MISR-RS) 这一难题，旨在从卫星获取的多张低分辨率 (LR) 图像生成高分辨率 (HR) 图像。最近，低分辨率图像之间弱时间相关性在 MISR-RS 任务中受到越来越多的关注。然而，现有的 MISR 方法将低分辨率图像视为具有强时间相关性的序列，忽略了空间相关性并强加了时间依赖性。为了解决这个问题，我们提出了一种名为“增强 MISR 中空间相关性”(ESC-MISR) 的新型端到端框架，它充分利用多图像的时空关系进行高分辨率图像重建。具体来说，我们首先引入了一种名为“多图像空间变换器”(MIST) 的新型融合模块，它强调具有更清晰全局空间特征的部分，并增强低分辨率图像之间的空间相关性。此外，我们对低分辨率图像的顺序输入执行随机洗牌策略，以减弱时间依赖性并在训练阶段捕获弱时间相关性。与最先进的方法相比，我们的 ESC-MISR 在 PROBA-V 数据集的两个波段上分别实现了 0.70dB 和 0.76dB 的 cPSNR 提升，证明了我们方法的优越性。||
|**2024-11-07**|[Is network fragmentation a useful complexity measure?](http://arxiv.org/abs/2411.04695)|null|已观察到深度神经网络分类器的输入空间可能表现出“碎片化”现象，即模型函数的类别随着输入空间的遍历而快速变化。这种碎片化的严重程度往往遵循双下降曲线，在插值区域达到最大值。我们在图像分类的背景下研究了这一现象，并探究碎片化是否可以预测泛化性能。我们使用基于碎片化的复杂性度量，通过在PGDL（深度学习泛化预测）基准测试中取得良好性能，证明了这种可能性。此外，我们还报告了与碎片化相关的新观察结果，即（i）碎片化不仅限于输入空间，也出现在隐藏表示中，（ii）碎片化在整个训练过程中遵循验证误差的趋势，以及（iii）碎片化并非权重范数增加的直接结果。总之，这些表明在研究深度神经网络的泛化能力时，碎片化是一个值得进一步研究的现象。||
|**2024-11-07**|[On the Inherent Robustness of One-Stage Object Detection against Out-of-Distribution Data](http://arxiv.org/abs/2411.04586)|null|鲁棒性是开发安全可靠模型的一个基本方面，尤其是在开放世界部署时。在这项工作中，我们分析了单阶段目标检测器在存在分布外 (OoD) 数据时进行鲁棒操作的固有能力。具体来说，我们提出了一种新的检测算法，用于检测图像数据中的未知目标，该算法利用模型从每个样本中提取的特征。与文献中其他最近的方法不同，我们的提议不需要重新训练目标检测器，从而允许使用预训练模型。我们提出的 OoD 检测器利用监督降维技术来减轻维度灾难对模型提取特征的影响。此外，它利用高分辨率特征图以无监督方式识别潜在的未知目标。我们的实验分析了不同算法配置和推理置信度阈值导致的检测已知和未知目标的性能之间的帕累托权衡。我们还将我们提出的算法的性能与基于 logits 的事后 OoD 方法以及可能的融合策略的性能进行了比较。最后，我们讨论了所有测试方法与针对最近发布的未知目标检测基准的目标检测模型的最新 OoD 方法的竞争力。获得的结果证实，当与我们提出的算法结合使用时，前沿的事后 OoD 检测器的性能可以得到进一步提高。||
|**2024-11-07**|[Neural Fingerprints for Adversarial Attack Detection](http://arxiv.org/abs/2411.04533)|**[link](https://github.com/HaimFisher/fingerprints-armor)**|近年来，用于图像分类的深度学习模型已成为标准工具。这些模型的一个众所周知的漏洞是它们容易受到对抗样本的攻击。这些对抗样本是通过轻微改变某个类别的图像而生成的，这种改变对人类来说难以察觉，但却会导致模型将其错误地分类为另一个类别。许多算法已经被提出来解决这个问题，它们通常分为两类：（i）构建鲁棒的分类器（ii）直接检测受攻击的图像。尽管这些检测器性能良好，但我们认为在白盒设置中，攻击者知道网络和检测器的配置和权重，他们可以通过在本地副本上运行许多示例，并仅将未检测到的示例发送到实际模型来克服检测器。这个问题在安全应用中很常见，即使是非常好的模型也不足以确保安全。在本文中，我们建议通过随机化来克服任何静态防御的这种固有限制。为此，必须生成一个非常大的性能一致的检测器家族，并为每个输入随机选择一个或多个检测器。对于单个检测器，我们建议使用神经指纹的方法。在训练阶段，对于每个类别，我们反复从网络的某些层中随机抽取一小部分神经元，如果它们的平均值在焦点类别的干净图像和受攻击图像之间有足够的差异，则它们被认为是指纹并添加到检测器库中。在测试期间，我们从与模型预测的标签相关的库中采样指纹，并使用似然比检验来检测攻击。我们在ImageNet上使用不同的攻击方法和模型架构评估了我们的检测器，并显示了近乎完美的检测和低误检率。||
|**2024-11-07**|[UEVAVD: A Dataset for Developing UAV's Eye View Active Object Detection](http://arxiv.org/abs/2411.04348)|null|遮挡是基于无人机（UAV）的目标检测中长期存在的难题。许多研究工作通过调整检测模型来解决这个问题。然而，很少有研究利用无人机通过改变视角来从根本上提高检测性能。主动目标检测（AOD）为此提供了一种有效的方法。通过深度强化学习（DRL），AOD赋予无人机自主路径规划的能力，以搜索更有利于目标识别的观察视角。遗憾的是，目前还没有可用于开发无人机AOD方法的数据集。为了填补这一空白，我们发布了一个名为UEVAVD的无人机视角主动视觉数据集，希望它能够促进无人机AOD问题的研究。此外，我们在学习状态表示时结合了归纳偏差，改进了现有的基于DRL的AOD方法。首先，由于部分可观测性，我们使用门控循环单元从观测序列中提取状态表示，而不是单视角观测。其次，我们使用Segment Anything Model (SAM)预先分割场景，并使用导出的掩码过滤掉无关信息。通过这些实践，agent可以学习到具有更好泛化能力的主动观察策略。UEVAVD数据集上的实验验证了我们改进的有效性。我们的数据集将很快在https://github.com/Leo000ooo/UEVAVD_dataset上发布。||
|**2024-11-07**|[GazeGen: Gaze-Driven User Interaction for Visual Content Generation](http://arxiv.org/abs/2411.04335)|null|我们提出了GazeGen，一个用户交互系统，它可以根据用户注视的位置生成视觉内容（图像和视频）。GazeGen允许用户通过注视目标区域来直观地操作视觉内容。利用先进的目标检测和生成式人工智能技术，GazeGen可以执行注视控制的图像对象添加/删除、重新定位和表面材质更改，并将静态图像转换为视频。GazeGen的核心是DFT Gaze（蒸馏和微调的注视）代理，这是一个只有281K参数的超轻量级模型，可以在小型边缘设备上针对个人用户的眼睛进行准确的实时注视预测。GazeGen是第一个将视觉内容生成与实时注视估计相结合的系统，这完全得益于DFT Gaze。这种实时注视估计支持各种由用户注视控制的视觉内容生成任务。DFT Gaze的输入是用户的眼睛图像，而视觉内容生成的输入是用户的视野和DFT Gaze预测的注视点。为了实现高效的注视预测，我们通过新颖的知识蒸馏和个性化适应技术，从一个大型模型（比其大10倍）派生出这个小型模型。我们将知识蒸馏与掩码自编码器相结合，开发了一个紧凑而强大的注视估计模型。该模型使用适配器进一步微调，从而能够以最少的用户输入实现高度准确和个性化的注视预测。DFT Gaze确保了低延迟和精确的注视跟踪，支持广泛的注视驱动任务。我们在AEA和OpenEDS2020基准测试中验证了DFT Gaze的性能，证明了其在边缘设备（Raspberry Pi 4）上的低角度注视误差和低延迟。此外，我们还描述了GazeGen的应用，展示了它在各种使用场景中的多功能性和有效性。||
|**2024-11-06**|[Multimodal Structure-Aware Quantum Data Processing](http://arxiv.org/abs/2411.04242)|**[link](https://github.com/halaa901/qnlp-thesis)**|虽然大型语言模型 (LLM) 推进了自然语言处理 (NLP) 领域的发展，但其“黑盒”性质掩盖了其决策过程。为了解决这个问题，研究人员开发了使用高阶张量的结构化方法。这些方法能够对语言关系进行建模，但在经典计算机上进行训练时，由于其规模过大而停滞不前。张量是量子系统的天然组成部分，在量子计算机上进行训练通过将文本转换为变分量子电路提供了一种解决方案。在本文中，我们开发了 MultiQ-NLP：一个用于多模态文本+图像数据进行结构感知数据处理的框架。这里，“结构”指的是语言中的句法和语法关系，以及图像中视觉元素的层次组织。我们使用新的类型和类型同态丰富了转换过程，并开发了新的架构来表示结构。在主流图像分类任务 (SVO Probes) 上进行测试时，我们的最佳模型与最先进的经典模型表现相当；此外，最佳模型是完全结构化的。||
|**2024-11-06**|[RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models](http://arxiv.org/abs/2411.04097)|**[link](https://github.com/stanford-aimi/ravl)**|微调后的视觉语言模型 (VLM) 通常会捕获图像特征和文本属性之间的虚假关联，导致测试时的零样本性能下降。现有的解决虚假关联的方法 (i) 主要在全局图像级别操作，而不是直接干预细粒度的图像特征，并且 (ii) 主要为单模态设置而设计。在这项工作中，我们提出了 RaVL，它通过使用局部图像特征而不是在全局图像级别操作来发现和减轻虚假关联，从而从细粒度的角度来看待 VLM 的鲁棒性。给定一个微调的 VLM，RaVL 首先利用区域级聚类方法来识别导致零样本分类错误的精确图像特征，从而发现虚假关联。然后，RaVL 通过一种新的区域感知损失函数来减轻已识别的虚假关联，该函数使 VLM 能够专注于相关区域并在微调期间忽略虚假关系。我们在 654 个具有各种模型架构、数据域和学习到的虚假关联的 VLM 上评估了 RaVL。我们的结果表明，RaVL 可以准确地发现（比最接近的基线提高 191%）并减轻（最差组图像分类准确率提高 8.2%）虚假关联。对通用领域和医学领域 VLM 的定性评估证实了我们的发现。||
|**2024-11-06**|[Overcoming label shift in targeted federated learning](http://arxiv.org/abs/2411.03799)|null|联邦学习允许多个参与者在不共享私有数据的情况下协同训练模型。这释放了将机器学习扩展到各种应用的潜力。当客户端和目标域共享相同的特征和标签分布时，现有的算法是合理的，但在现实场景中，这种假设常常被违反。一种常见的违反是标签偏移，即客户端之间或客户端与目标域之间的标签分布不同，这会显著降低模型性能。为了解决这个问题，我们提出了 FedPALS，一种新的模型聚合方案，它通过利用中心服务器上目标标签分布的知识来适应标签偏移。我们的方法确保了随机梯度下降下的无偏更新，确保了在具有不同标签偏移数据的客户端之间的鲁棒泛化。在图像分类上的大量实验表明，FedPALS 通过将模型聚合与目标域对齐，始终优于标准基线。我们的研究结果表明，传统的联邦学习方法在客户端极其稀疏的情况下会受到严重影响，这突出了目标感知聚合的关键需求。FedPALS 提供了一种有原则且实用的解决方案来缓解标签分布不匹配，确保在联邦设置中训练的模型能够有效地泛化到标签偏移的目标域。||
|**2024-11-05**|[CRT-Fusion: Camera, Radar, Temporal Fusion Using Motion Information for 3D Object Detection](http://arxiv.org/abs/2411.03013)|**[link](https://github.com/mjseong0414/CRT-Fusion)**|精确且鲁棒的三维目标检测是自动驾驶汽车和机器人技术中的关键组成部分。尽管最近的雷达-相机融合方法通过在鸟瞰图（BEV）表示中融合信息取得了显著进展，但它们往往难以有效捕捉动态物体的运动，从而导致在实际场景中的性能受限。在本文中，我们介绍了 CRT-Fusion，一个将时间信息整合到雷达-相机融合中的新型框架，以应对这一挑战。我们的方法包含三个关键模块：多视图融合（MVF）、运动特征估计器（MFE）和运动引导时间融合（MGTF）。MVF 模块在相机视图和鸟瞰图中融合雷达和图像特征，从而生成更精确的统一 BEV 表示。MFE 模块同时执行两项任务：像素级速度信息估计和 BEV 分割。基于从 MFE 模块获得的速度和占用率分数图，MGTF 模块以循环方式跨多个时间戳对齐和融合特征图。通过考虑动态物体的运动，CRT-Fusion 可以生成鲁棒的 BEV 特征图，从而提高检测精度和鲁棒性。在具有挑战性的 nuScenes 数据集上的大量评估表明，CRT-Fusion 在基于雷达-相机的三维目标检测方面实现了最先进的性能。我们的方法在 NDS 方面比之前的最佳方法高出 1.7%，同时在 mAP 方面也超过了领先方法 1.4%。这两个指标的显著改进展示了我们提出的融合策略在增强三维目标检测的可靠性和准确性方面的有效性。||
|**2024-11-05**|[Domain Expansion and Boundary Growth for Open-Set Single-Source Domain Generalization](http://arxiv.org/abs/2411.02920)|null|开放集单源域泛化旨在使用单一源域学习一个鲁棒的模型，该模型可以泛化到具有域偏移和标签偏移的未知目标域。源域数据的稀缺性和目标域的未知数据分布对域不变特征学习和未知类别识别提出了巨大的挑战。在本文中，我们提出了一种基于域扩展和边界增长的新型学习方法，以扩展稀缺的源样本并扩大已知类别之间的边界，从而间接地拓宽已知类别和未知类别之间的边界。具体来说，我们通过对源数据进行背景抑制和风格增强来合成新样本，从而实现域扩展。然后，我们强制模型从合成样本中提取一致的知识，以便模型能够学习域不变信息。此外，我们在训练多二元分类器时，通过使用边缘图作为样本的附加模态来实现跨类别的边界增长。这种方式扩大了内点和外点之间的边界，从而提高了开放集泛化期间的未知类别识别能力。大量实验表明，我们的方法可以在多个跨域图像分类数据集上实现显著的改进并达到最先进的性能。||
|**2024-11-05**|[Applications of Automatic Differentiation in Image Registration](http://arxiv.org/abs/2411.02806)|**[link](https://github.com/wdwatson2/ImgRegPytorchProject)**|我们论证了在机器学习框架中已普遍可用的自动微分技术，是探索改进多尺度仿射图像配准和仿射超分辨率问题算法的有效方法。在第一个关于多尺度配准的实验中，我们实现了一种常微分方程预测-校正方法，该方法涉及关于尺度参数的导数和图像配准目标函数的Hessian矩阵，这两者在没有自动微分的情况下都很难计算。我们的研究结果表明，精确的Hessian矩阵对于该方法比传统的多尺度方法有所改进是必要的；而高斯-牛顿Hessian近似未能提供这样的改进。在第二个实验中，我们实现了一种用于超分辨率的可变投影高斯-牛顿方法，并使用自动微分来对迭代计算的投影进行微分，这是一种文献中先前未涉及的方法。我们展示了不通过投影进行微分获得的雅可比矩阵是可变投影正向映射的真实雅可比矩阵的较差近似，并探讨了其他一些近似的性能。通过解决这些问题，这项工作促进了自动微分在图像配准中的应用，并为机器学习工具在该领域的进一步应用开创了先例。||
|**2024-11-05**|[ERUP-YOLO: Enhancing Object Detection Robustness for Adverse Weather Condition by Unified Image-Adaptive Processing](http://arxiv.org/abs/2411.02799)|null|我们提出了一种图像自适应的目标检测方法，用于应对雾霾和低光等恶劣天气条件。我们的框架采用可微分预处理滤波器来执行图像增强，以适应后续的目标检测阶段。我们的框架引入了两种可微分滤波器：基于贝塞尔曲线的逐像素（BPW）滤波器和基于核的局部（KBL）滤波器。这些滤波器统一了经典图像处理滤波器的功能，并提高了目标检测的性能。我们还提出了一种使用BPW滤波器的域无关数据增强策略。我们的方法不需要针对特定数据定制滤波器组合、参数范围和数据增强。我们通过将所提出的方法（称为ERUP-YOLO，即通过统一图像处理增强鲁棒性的YOLO）应用于YOLOv3检测器来评估其性能。在恶劣天气数据集上的实验表明，我们提出的滤波器在表达能力上与传统方法相当或更优，并且我们的ERUP-YOLO在各种恶劣天气条件下（包括雾霾和低光条件）都实现了卓越的性能。||
|**2024-11-05**|[Efficient Feature Aggregation and Scale-Aware Regression for Monocular 3D Object Detection](http://arxiv.org/abs/2411.02747)|**[link](https://github.com/WYFDUT/MonoASRH)**|单目3D目标检测因其简洁性和低成本而备受关注。现有方法通常遵循传统的2D检测范式，先定位目标中心，然后通过邻近特征预测3D属性。然而，这些方法主要依赖于渐进的跨尺度特征聚合，并且只关注局部信息，这可能导致缺乏全局感知和遗漏小尺度目标。此外，由于不同场景和深度下目标尺度的巨大变化，不准确的感受野通常会导致背景噪声和特征表示退化。为了解决这些问题，我们引入了MonoASRH，一种新颖的单目3D检测框架，由高效混合特征聚合模块（EH-FAM）和自适应尺度感知3D回归头（ASRH）组成。具体来说，EH-FAM采用具有全局感受野的多头注意力机制来提取小尺度目标的语义特征，并利用轻量级卷积模块高效地聚合不同尺度的视觉特征。ASRH对2D边界框维度进行编码，然后通过尺度-语义特征融合模块将尺度特征与EH-FAM聚合的语义特征融合。尺度-语义特征融合模块引导ASRH学习动态感受野偏移，将尺度先验融入3D位置预测，以获得更好的尺度感知能力。在KITTI和Waymo数据集上的大量实验表明，MonoASRH实现了最先进的性能。||
|**2024-11-05**|[Integrated lithium niobate photonic computing circuit based on efficient and high-speed electro-optic conversion](http://arxiv.org/abs/2411.02734)|null|我们展示了一种利用系统级薄膜铌酸锂电路的光计算加速器，克服了这一限制。利用强大的电光（普克尔斯）效应和该平台的可扩展性，我们展示了高达 1.36 TOPS 的光子计算速度，同时功耗仅为 0.057 pJ/OP。我们的系统具有 100 多个协同工作的薄膜铌酸锂高性能组件，超越了该平台上的最先进系统。我们进一步演示了二元分类、手写数字分类和图像分类，并实现了显著的准确性，展示了我们系统执行实际算法的能力。最后，我们研究了将我们的系统与混合集成的分布式反馈激光源和异质集成的改进单向行波载流子光电二极管相结合的可能性。我们的结果表明了薄膜铌酸锂作为计算平台的前景，解决了当前电子和光子计算中的瓶颈。其高性能电光权重编码和转换、晶圆级可扩展性以及与集成激光器和探测器的兼容性等独特特性，使薄膜铌酸锂光子学成为硅光子学的有力补充，并可扩展到超快速和低功耗信号处理和测距等应用领域。||
|**2024-11-04**|[Intelligent Video Recording Optimization using Activity Detection for Surveillance Systems](http://arxiv.org/abs/2411.02632)|null|监控系统通常难以管理大量的视频素材，其中很多素材无关紧要，导致存储效率低下且事件检索困难。本文提出了一种专注于活动检测的优化视频录制解决方案来解决这些问题。该方案利用了一种混合方法，结合了基于帧差法的运动检测和使用 YOLOv9 的目标检测。该策略专门针对涉及人类或汽车活动的场景进行录制，从而减少不必要的素材并优化存储空间使用。开发的模型展现出卓越的性能，汽车检测的精确率达到 0.855，行人检测的精确率达到 0.884，并且与仅依赖运动检测的传统监控系统相比，存储需求减少了三分之二。存储量的显著减少凸显了该方案在提高监控系统效率方面的有效性。尽管如此，仍然存在一些局限性，特别是在恶劣天气条件下（例如强风）会出现误报和漏报。||
|**2024-11-04**|[MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D](http://arxiv.org/abs/2411.02336)|null|纹理化是3D资产生产流程中的关键步骤，它可以增强3D资产的视觉吸引力和多样性。尽管文本到纹理（T2T）生成技术近期取得了一些进展，但现有方法生成的结果往往不尽如人意，主要原因是局部不连续性、多视图之间不一致以及它们对UV展开结果的严重依赖。为了应对这些挑战，我们提出了一种名为MVPaint的创新生成-细化3D纹理化框架，它可以生成高分辨率、无缝的纹理，同时强调多视图一致性。MVPaint主要由三个关键模块组成。1) 同步多视图生成（SMG）。给定一个3D网格模型，MVPaint首先使用SMG模型同时生成多视图图像，这会导致粗糙的纹理化结果，并且由于缺少观察而存在未上色的部分。2) 空间感知3D修复（S3I）。为了确保完整的3D纹理化，我们引入了S3I方法，专门用于有效地对先前未观察到的区域进行纹理化。3) UV细化（UVR）。此外，MVPaint采用UVR模块来提高UV空间中的纹理质量，该模块首先执行UV空间超分辨率，然后使用空间感知的接缝平滑算法来修正由UV展开引起的空间纹理不连续性。此外，我们基于从Objaverse数据集和整个GSO数据集中选择的优质3D网格，分别建立了两个T2T评估基准：Objaverse T2T基准和GSO T2T基准。大量的实验结果表明，MVPaint超越了现有的最先进方法。值得注意的是，MVPaint可以生成高保真纹理，同时最大限度地减少Janus问题，并显著增强跨视图一致性。||
|**2024-11-04**|[Toward Integrating Semantic-aware Path Planning and Reliable Localization for UAV Operations](http://arxiv.org/abs/2411.01816)|null|定位是无人机系统 (UAV) 最关键的任务之一，直接影响整体性能，它可以通过各种传感器实现，并应用于与搜索和救援行动、目标跟踪、建筑等相关的众多任务。然而，由于挑战性环境的负面影响，无人机可能会丢失用于定位的信号。在本文中，我们提出了一种有效的路径规划系统，利用语义分割信息，使用单目相机绕过纹理缺失和有问题的区域，如湖泊、海洋和高层建筑。我们介绍了一种实时语义分割架构和一种新颖的关键帧决策流程，以基于像素分布优化图像输入，从而减少处理时间。一个基于动态窗口方法 (DWA) 算法的分层规划器，与成本地图集成，旨在促进高效的路径规划。该系统在使用 Unity 的逼真模拟环境中实现，并与分割模型参数对齐。全面的定性和定量评估验证了我们方法的有效性，表明在挑战性环境中无人机定位的可靠性和效率得到了显著提高。||
|**2024-11-04**|[ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model](http://arxiv.org/abs/2411.01756)|null|视觉目标跟踪的目标是基于初始边界框在视频序列中定位目标物体。最近，视觉语言（VL）跟踪器提议利用额外的自然语言描述来增强其在各种应用中的多功能性。然而，在跟踪性能方面，VL 跟踪器仍然不如最先进的（SoTA）视觉跟踪器。我们发现这种劣势主要源于它们严重依赖手动文本标注，其中包括频繁提供的模糊语言描述。在本文中，我们提出了 ChatTracker，它利用多模态大型语言模型 (MLLM) 中丰富的知识来生成高质量的语言描述并提高跟踪性能。为此，我们提出了一种新颖的基于反思的提示优化模块，用跟踪反馈迭代地改进目标模糊和不准确的描述。为了进一步利用 MLLM 生成的语义信息，我们提出了一个简单而有效的 VL 跟踪框架，它可以轻松地作为即插即用模块集成到 VL 和视觉跟踪器中，以提高其性能。实验结果表明，我们提出的 ChatTracker 实现了与现有方法相当的性能。||
|**2024-10-31**|[DiffPAD: Denoising Diffusion-based Adversarial Patch Decontamination](http://arxiv.org/abs/2410.24006)|**[link](https://github.com/jasonfu1998/diffpad)**|在不断发展的对抗性机器学习领域中，开发有效的防御补丁攻击的方法已成为一项关键挑战，需要可靠的解决方案来保护现实世界中的人工智能系统。尽管扩散模型在图像合成方面表现出非凡的能力，并且最近已被用于对抗 $\ell_p$ 范数有界攻击，但其在缓解局部补丁攻击方面的潜力很大程度上仍未得到充分探索。在这项工作中，我们提出了 DiffPAD，这是一个利用扩散模型的力量进行对抗性补丁去污的新框架。DiffPAD 首先对下采样的输入图像执行超分辨率恢复，然后采用二值化、动态阈值方案和滑动窗口来有效地定位对抗性补丁。这种设计灵感来自于理论上推导出的补丁大小和扩散恢复误差之间的相关性，该相关性在各种补丁攻击场景中得到了推广。最后，DiffPAD 将修复技术应用于原始输入图像，并将估计的补丁区域屏蔽。通过将超分辨率恢复和图像修复的闭式解集成到预训练扩散模型的条件反向采样过程中，DiffPAD 避免了对文本指导或微调的需求。通过全面的实验，我们证明了 DiffPAD 不仅实现了最先进的对抗补丁攻击的鲁棒性，而且在恢复自然图像方面表现出色，没有补丁残留。||
|**2024-10-31**|[ImOV3D: Learning Open-Vocabulary Point Clouds 3D Object Detection from Only 2D Images](http://arxiv.org/abs/2410.24001)|**[link](https://github.com/yangtiming/imod)**|开放词汇量3D目标检测 (OV-3Det) 旨在泛化到训练阶段标记的有限数量的基本类别之外。最大的瓶颈是3D标注数据的稀缺性，而2D图像数据集丰富且标注详尽。因此，利用丰富的2D图像标注来缓解OV-3Det中固有的数据稀缺性是很直观的。在本文中，我们通过探索仅使用2D图像学习OV-3Det的潜力，将任务设置推向极限。这种设置的主要挑战是训练图像和测试点云之间的模态差距，这阻碍了将2D知识有效地整合到OV-3Det中。为了应对这一挑战，我们提出了一个名为ImOV3D的新颖框架，利用包含图像和点云 (PC) 的伪多模态表示来弥合模态差距。ImOV3D的关键在于灵活的模态转换，其中2D图像可以使用单目深度估计提升到3D，也可以通过渲染从3D场景派生。这允许将训练图像和测试点云统一到一个通用的图像-PC表示中，既包含丰富的2D语义信息，又包含了3D空间数据的深度和结构特征。我们谨慎地进行这种转换，以最大限度地减少训练和测试用例之间的域差距。在SUNRGBD和ScanNet这两个基准数据集上的大量实验表明，即使在没有真实3D训练数据的情况下，ImOV3D的性能也明显优于现有方法。通过包含少量真实的3D数据进行微调，其性能也大大超过了之前的最先进水平。代码和预训练模型已发布在https://github.com/yangtiming/ImOV3D。||
|**2024-10-31**|[Uncertainty Estimation for 3D Object Detection via Evidential Learning](http://arxiv.org/abs/2410.23910)|null|三维物体检测是自动驾驶和机器人技术中计算机视觉应用的一项重要任务。然而，模型通常难以量化检测可靠性，导致在不熟悉的场景中表现不佳。我们引入了一个框架，通过利用三维检测器中鸟瞰图表示上的证据学习损失来量化三维物体检测中的不确定性。这些不确定性估计所需的计算开销极小，并且可以推广到不同的架构。我们证明了这些不确定性估计在识别分布外场景、定位不良的物体和漏检（假阴性）方面的有效性和重要性；我们的框架在基准上平均提高了10-20%。最后，我们将这套任务集成到一个系统中，其中三维物体检测器自动标记驾驶场景，并且我们的不确定性估计在标签用于训练第二个模型之前验证标签的正确性。在此，我们基于不确定性的验证导致mAP提高了1%，NDS提高了1-2%。||
|**2024-10-31**|[From Web Data to Real Fields: Low-Cost Unsupervised Domain Adaptation for Agricultural Robots](http://arxiv.org/abs/2410.23906)|null|在精准农业中，视觉模型通常难以处理新的、未曾见过的田地，因为作物和杂草会受到外部因素的影响，导致它们的组成和外观与学习到的分布不同。本文旨在利用无监督域自适应（UDA）以低成本适应特定田地。我们探索了一种新的域迁移，从多样的大型互联网数据池迁移到机器人特定位置收集的小数据集，从而最大限度地减少对大量田间数据收集的需求。此外，我们引入了一个新的模块——多级基于注意力的对抗判别器（MAAD）——它可以集成到任何检测模型的特征提取器级别。在本研究中，我们将MAAD与CenterNet结合起来，同时检测叶片、茎和叶脉实例。我们的结果表明，与基线模型相比，未标记目标域的性能显著提高，目标检测精度提高了7.5%，关键点检测精度提高了5.1%。||
|**2024-10-31**|[Open-Set 3D object detection in LiDAR data as an Out-of-Distribution problem](http://arxiv.org/abs/2410.23767)|null|基于激光雷达数据的三维目标检测通过先进的深度学习方法在受控环境中已达到工业级性能。然而，这些神经网络模型受到有限的内围目标类别的限制。我们的工作将激光雷达数据中的开放集三维目标检测问题重新定义为分布外（OOD）检测问题，以检测异常目标。与传统的目标检测相比，这种方法带来了额外的信息。我们建立了一个比较基准，并表明两阶段OOD方法，特别是自动标记，在三维OOD目标检测中显示出 promising 的结果。我们的贡献包括通过检查超参数的评估和评估生成额外数据以训练OOD感知三维目标检测器的策略来建立严格的评估协议。这种全面的分析对于开发能够在多样化和不可预测的现实场景中可靠执行的鲁棒的三维目标检测系统至关重要。||
|**2024-10-31**|[Context-Aware Token Selection and Packing for Enhanced Vision Transformer](http://arxiv.org/abs/2410.23608)|null|近年来，视觉Transformer的长距离注意力机制在各种计算机视觉任务中推动了显著的性能突破。然而，传统的自注意力机制需要处理信息丰富的和无信息的标记，效率低下且精度不高。虽然已引入稀疏注意力机制通过减少参与注意力的标记来缓解这些问题，但它们通常缺乏上下文感知能力和智能性。这些机制经常在不同的输入上应用统一的标记选择策略进行批量训练，或者仅针对推理阶段优化效率。为了克服这些挑战，我们提出了一种新颖的算法：选择并打包注意力（SPA）。SPA 使用一个由选择标签监督的低成本门控层动态选择信息丰富的标记，并将这些标记打包成新的批次，从而在并行化的 GPU 批量训练和推理中使用可变数量的标记。跨不同数据集和计算机视觉任务的大量实验表明，SPA 提供了卓越的性能和效率，包括目标检测的 mAP 提高了 0.6，计算成本降低了 16.4%。||
|**2024-10-31**|[QUEST-A: Untrained Filtering with Trained Focusing led to Enhanced Quantum Architectures](http://arxiv.org/abs/2410.23560)|**[link](https://github.com/uestc-ylh/quest-a)**|量子架构搜索（QAS）是量子机器学习中的一个基本挑战，目前最先进的方法主要分为免训练和梯度引导两类。然而，将QAS仅仅视为离散剪枝过程或连续优化问题都无法平衡准确性和效率。本工作将QAS分解为两个交替解决的子问题：最优电路结构检索和参数优化。基于此洞察，我们提出了量子未训练-探索协同训练架构（QUEST-A），它通过电路固有属性实现快速架构剪枝，并利用参数重用策略进行 focused 优化。QUEST-A在一个进化框架内统一了离散结构搜索和连续参数优化，该框架集成了快速剪枝和细粒度优化。实验表明，QUEST-A 优于现有方法：增强了信号表示中的模型表达能力，在图像分类的不同复杂度下保持了高性能，并在变分量子本征求解器任务中实现了数量级的精度提升。这些结果验证了QUEST-A的有效性，并为QAS提供了可迁移的方法。||
|**2024-10-30**|[Multilingual Vision-Language Pre-training for the Remote Sensing Domain](http://arxiv.org/abs/2410.23370)|**[link](https://github.com/DannielSilva/RS-M-CLIP)**|基于对比语言-图像预训练 (CLIP) 的方法目前广泛用于支持涉及遥感数据的视觉和语言任务，例如跨模态检索。CLIP 在这一特定领域的适应依赖于使用标准对比目标的模型微调，使用现有的人工标注的图像-标题数据集，或使用从遥感图像上的其他注释（例如，对象类别）派生的图像-标题对对应的合成数据。使用不同的预训练机制受到的关注较少，只有少数例外情况考虑了多语言输入。这项工作提出了一种用于遥感领域的新型视觉和语言模型，探索了多语言 CLIP 模型的微调，并测试了使用基于对齐来自单个输入图像的局部和全局表示的自监督方法，以及标准的 CLIP 目标。模型训练依赖于汇集预先存在的遥感图像和英文标题配对的数据集，然后使用自动机器翻译成另外九种语言。我们表明，翻译后的数据确实是有帮助的，例如，也提高了英语的性能。我们由此产生的模型，我们将其命名为遥感多语言 CLIP (RS-M-CLIP)，在各种视觉和语言任务中获得了最先进的结果，包括跨模态和多语言图像-文本检索，或零样本图像分类。||
|**2024-10-30**|[CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP](http://arxiv.org/abs/2410.23330)|null|机器遗忘 (MU) 作为一种无需完全重新训练即可从训练模型中移除特定数据的方法，受到了广泛关注。尽管在文本和图像分类等单模态领域取得了进展，但多模态模型中的遗忘仍然相对缺乏研究。本文探讨了在 CLIP（一种对齐视觉和文本表示的杰出多模态模型）中遗忘所面临的独特挑战。我们引入了 CLIPErase，一种新颖的方法，可以解开并选择性地遗忘视觉和文本关联，确保遗忘不会损害模型性能。CLIPErase 由三个关键模块组成：遗忘模块，用于破坏遗忘集中样本的关联；保留模块，用于保持模型在保留集上的性能；以及一致性模块，用于维持与原始模型的一致性。在 CIFAR-100 和 Flickr30K 数据集上对四个 CLIP 下游任务进行的大量实验表明，CLIPErase 可以有效地遗忘零样本任务中多模态样本的指定关联，同时在遗忘后保持模型在保留集上的性能。||
|**2024-10-30**|[EMMA: End-to-End Multimodal Model for Autonomous Driving](http://arxiv.org/abs/2410.23262)|null|我们推出了EMMA，一个用于自动驾驶的端到端多模态模型。基于多模态大型语言模型基础，EMMA将原始摄像头传感器数据直接映射到各种驾驶专用输出，包括规划轨迹、感知对象和道路图元素。EMMA通过将所有非传感器输入（例如导航指令和车辆自身状态）和输出（例如轨迹和3D位置）表示为自然语言文本，最大限度地利用了预训练大型语言模型的世界知识。这种方法允许EMMA在统一的语言空间中联合处理各种驾驶任务，并使用特定任务的提示生成每个任务的输出。根据经验，我们通过在nuScenes上实现最先进的运动规划性能以及在Waymo Open Motion Dataset (WOMD) 上取得有竞争力的结果来证明EMMA的有效性。EMMA还在Waymo Open Dataset (WOD) 上的摄像头主要3D目标检测中取得了有竞争力的结果。我们表明，使用规划轨迹、目标检测和道路图任务对EMMA进行联合训练可以在所有三个领域带来改进，突出了EMMA作为自动驾驶应用的通用模型的潜力。然而，EMMA也存在某些局限性：它只能处理少量图像帧，不包含像LiDAR或雷达这样的精确3D传感模态，并且计算成本高昂。我们希望我们的结果能够激发进一步的研究来缓解这些问题，并进一步发展自动驾驶模型架构的最新技术。||
|**2024-10-29**|[Active Learning for Vision-Language Models](http://arxiv.org/abs/2410.22187)|null|像CLIP这样的预训练视觉语言模型（VLM）在一系列下游计算机视觉任务中展现出令人印象深刻的零样本性能。然而，这些模型与在下游数据集上训练的有监督深度模型之间仍然存在相当大的性能差距。为了弥合这一差距，我们提出了一种新颖的主动学习（AL）框架，通过仅从未标记数据中选择少量信息丰富的样本进行标注来增强VLM的零样本分类性能。为此，我们的方法首先校准VLM的预测熵，然后结合自不确定性和邻居感知不确定性来计算可靠的不确定性度量，用于主动样本选择。我们的大量实验表明，所提出的方法在多个图像分类数据集上优于现有的AL方法，并显著提高了VLM的零样本性能。||
|**2024-10-29**|[Lighten CARAFE: Dynamic Lightweight Upsampling with Guided Reassemble Kernels](http://arxiv.org/abs/2410.22139)|**[link](https://github.com/fu0511/dynamic-lightweight-upsampling)**|特征上采样作为现代机器视觉模型中的基本操作，已在文献中得到广泛应用和研究。理想的上采样操作应轻量且计算复杂度低。也就是说，它不仅可以提高整体性能，而且不会影响模型的复杂性。内容感知特征重组 (CARAFE) 是一种精心设计的可学习操作，可实现特征上采样。尽管取得了令人鼓舞的性能，但该方法需要生成大规模内核，这带来了大量额外的冗余参数，并且固有地限制了可扩展性。为此，我们在本文中提出了一种轻量级上采样操作，称为动态轻量级上采样 (DLU)。具体来说，它首先构建一个小规模的源核空间，然后通过引入可学习的引导偏移量从核空间中采样大规模核，从而避免在上采样中引入大量可训练参数。在几个主流视觉任务上的实验表明，我们的 DLU 实现了与原始 CARAFE 相当甚至更好的性能，但复杂度要低得多，例如，在 16 倍上采样的情况下，DLU 比 CARAFE 的参数减少了 91%，FLOPs（浮点运算）至少减少了 63%，但在目标检测中，其 mAP 比 CARAFE 提高了 0.3%。代码可在 https://github.com/Fu0511/Dynamic-Lightweight-Upsampling 获取。||
|**2024-10-29**|[Data Generation for Hardware-Friendly Post-Training Quantization](http://arxiv.org/abs/2410.22110)|**[link](https://github.com/sony/model_optimization)**|使用合成数据的零样本量化 (ZSQ) 是在隐私和安全约束下进行训练后量化 (PTQ) 的关键方法。然而，现有的数据生成方法通常难以有效地生成适用于硬件友好量化（所有模型层都量化）的数据。我们分析了现有的基于批量归一化 (BN) 匹配的数据生成方法，并确定了合成数据和真实数据之间的几个差距：1) 当前的生成算法无法同时优化整个合成数据集；2) 训练期间应用的数据增强通常被忽略；3) 由于这些层中缺少 BN，最终模型层中会出现分布偏移。这些差距会对 ZSQ 性能产生负面影响，尤其是在硬件友好量化场景中。在这项工作中，我们提出了面向硬件友好量化的数据生成 (DGH)，这是一种解决这些差距的新方法。DGH 联合优化所有生成的图像，无论图像集大小或 GPU 内存限制如何。为了解决数据增强不匹配问题，DGH 包括一个预处理阶段，该阶段模仿增强过程，并通过结合自然图像先验来提高图像质量。最后，我们提出了一种新的分布拉伸损失，它可以对齐真实数据和合成数据之间特征图分布的支持度。此损失应用于模型的输出，并且可以适应各种任务。DGH 在多个任务的量化性能方面均有显著改进，在分类和目标检测中，硬件友好 ZSQ 的准确率提升高达 30%，其性能通常与真实数据相当。||
|**2024-10-29**|[FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection](http://arxiv.org/abs/2410.21964)|null|近来，视觉Transformer（ViT）在通用图像分类领域取得了前所未有的成效。然而，由于在深度伪造检测领域的性能相比卷积神经网络（CNN）较低，这些模型在该领域的探索仍然不足。本文首先研究了为什么普通的ViT架构在处理面部伪造检测时表现欠佳。我们的分析表明，与CNN相比，ViT难以对通常是深度伪造特征的局部伪造痕迹进行建模。基于这一观察，我们提出了一个名为FakeFormer的深度伪造检测框架，该框架扩展了ViT以增强对细微的不一致性信息的提取。为此，我们引入了一种由伪造痕迹易感区域引导并专为ViT设计的显式注意力学习机制。我们在多个著名的基准数据集上进行了大量实验，包括FF++、Celeb-DF、WildDeepfake、DFD、DFDCP和DFDC。结果表明，FakeFormer在泛化性和计算成本方面均优于现有最佳方法，且无需大规模训练数据集。代码可在\url{https://github.com/10Ring/FakeFormer}获取。||
|**2024-10-29**|[Cognitive Semantic Augmentation LEO Satellite Networks for Earth Observation](http://arxiv.org/abs/2410.21916)|null|对地观测 (EO) 系统对于地图绘制、灾难监测和资源管理至关重要，但它们难以高效地处理和传输大量的 EO 数据，特别是对于农业和实时灾难响应等专门应用而言。本文提出了一种用于 EO 卫星网络中语义通信的新型框架，旨在通过认知处理技术提高数据传输效率和系统性能。该系统利用离散任务导向联合信源信道编码 (DT-JSCC) 和语义数据增强 (SA) 将认知语义处理与星间链路相结合，从而实现多光谱图像的有效分析和传输，以改进目标检测、模式识别和实时决策。引入了认知语义增强 (CSA) 来增强系统处理和传输语义信息的能力，从而改进特征优先级排序、一致性以及对不断变化的通信和应用需求的适应性。端到端架构专为下一代卫星网络（例如支持 6G 的网络）而设计，与联邦学习相比，展示了在更少的通信轮次和更高的精度方面的显著改进。||
|**2024-10-29**|[Bayesian Optimization for Hyperparameters Tuning in Neural Networks](http://arxiv.org/abs/2410.21886)|null|本研究探讨了贝叶斯优化（BO）在神经网络超参数调整中的应用，特别针对增强卷积神经网络（CNN）在图像分类任务中的性能。贝叶斯优化是一种无导数的全局优化方法，适用于具有连续输入和有限评估预算的昂贵的黑盒函数。BO算法利用高斯过程回归和采集函数（如置信上限（UCB）和期望改进（EI））来有效地识别最佳配置。本研究使用Ax和BOTorch框架，展示了BO在减少超参数调整试验次数的同时实现具有竞争力的模型性能的效率。实验结果表明，BO有效地平衡了探索和利用，快速收敛到CNN架构的最佳设置。这种方法强调了BO在自动化神经网络调整方面的潜力，有助于提高机器学习流程的准确性和计算效率。||
|**2024-10-29**|[PK-YOLO: Pretrained Knowledge Guided YOLO for Brain Tumor Detection in Multiplanar MRI Slices](http://arxiv.org/abs/2410.21822)|**[link](https://github.com/mkang315/pk-yolo)**|多平面磁共振成像 (MRI) 切片中的脑肿瘤检测是一项具有挑战性的任务，因为多平面图像的结构中存在各种外观和关系。在本文中，我们提出了一种新的基于 YOLO（You Only Look Once）的检测模型，该模型结合了预训练知识 (PK)，称为 PK-YOLO，以提高多平面 MRI 切片中脑肿瘤检测的性能。据我们所知，PK-YOLO 是第一个基于预训练知识引导的 YOLO 目标检测器。新方法的主要组成部分包括一个通过稀疏掩码建模预训练的纯轻量级卷积神经网络主干、一个带有预训练主干的 YOLO 架构和一个用于改进小目标检测的回归损失函数。预训练的主干允许将单个平面 MRI 切片上的目标查询的特征迁移到模型编码器中，并且学习到的领域知识库可以改进域内检测。改进的损失函数可以进一步提高多平面二维 MRI 切片中小尺寸脑肿瘤的检测性能。实验结果表明，与最先进的类 YOLO 和类 DETR 目标检测器相比，所提出的 PK-YOLO 在多平面 MRI 脑肿瘤检测数据集上实现了具有竞争力的性能。代码可在 https://github.com/mkang315/PK-YOLO 获取。||
|**2024-10-28**|[MVSDet: Multi-View Indoor 3D Object Detection via Efficient Plane Sweeps](http://arxiv.org/abs/2410.21566)|**[link](https://github.com/pixie8888/mvsdet)**|多视角室内三维物体检测的关键挑战在于从图像中推断准确的几何信息，以实现精确的三维检测。先前的方法依赖于神经辐射场（NeRF）进行几何推理。然而，从NeRF提取的几何信息通常不准确，导致检测性能欠佳。本文提出了MVSDet，它利用平面扫描进行几何感知的三维物体检测。为了规避对大量深度平面进行精确深度预测的要求，我们设计了一种概率采样和软加权机制来决定像素特征在三维体素上的放置。我们为每个像素选择概率体素中得分最高的多个位置，并使用它们的概率得分来表示置信度。我们进一步应用最新的像素对齐高斯 splatting 来正则化深度预测，并在计算开销很小的情况下提高检测性能。我们在 ScanNet 和 ARKitScenes 数据集上进行了大量实验，以证明我们模型的优越性。我们的代码可在 https://github.com/Pixie8888/MVSDet 获取。||
|**2024-10-28**|[TACO: Adversarial Camouflage Optimization on Trucks to Fool Object Detectors](http://arxiv.org/abs/2410.21443)|null|对抗性攻击威胁着机器学习模型在自动驾驶和防御系统等关键应用中的可靠性。随着像YOLOv8这样的模型使目标检测器变得更加鲁棒，开发有效的对抗性方法也越来越具有挑战性。我们提出了卡车对抗性伪装优化（TACO），这是一个在3D车辆模型上生成对抗性伪装图案以欺骗最先进的目标检测器的新框架。TACO采用虚幻引擎5，将可微渲染与逼真的渲染网络相结合，以优化针对YOLOv8的对抗性纹理。为了确保生成的纹理既能有效地欺骗检测器，又在视觉上合理，我们引入了卷积平滑损失函数，一个通用的平滑损失函数。实验评估表明，TACO显著降低了YOLOv8的检测性能，在未见测试数据上实现了0.0099的[email protected]。此外，这些对抗性图案对其他目标检测模型（如Faster R-CNN和早期YOLO版本）表现出很强的迁移性。||
|**2024-10-28**|[Synthetica: Large Scale Synthetic Data for Robot Perception](http://arxiv.org/abs/2410.21153)|null|基于视觉的目标检测器是机器人应用的关键基础，因为它们提供有关环境中目标定位的宝贵信息。这些检测器需要确保在不同的照明条件、遮挡和视觉伪影下都具有高可靠性，同时还要实时运行。为这些网络收集和标注真实世界的数据非常耗时且成本高昂，尤其是对于工业物体等自定义资产，这使得将其推广到实际场景变得难以为继。为此，我们提出了Synthetica，一种用于训练鲁棒状态估计器的大规模合成数据生成方法。本文重点关注目标检测任务，这是一个重要问题，可以作为大多数状态估计问题（例如姿态估计）的前端。利用来自逼真的光线追踪渲染器的数据，我们扩大了数据生成规模，生成了270万张图像，以训练高精度实时检测Transformer。我们提出了一系列渲染随机化和训练时数据增强技术，有助于视觉任务的稳健的仿真到现实性能。我们展示了在目标检测任务中最先进的性能，同时检测器以50-100Hz的频率运行，比之前的SOTA快9倍。我们通过展示一个用于现实世界中自定义对象的管道，进一步证明了我们的训练方法对机器人应用的有用性，而这些对象之前并不存在数据集。我们的工作强调了扩展合成数据生成对于实现稳健的仿真到现实迁移以及实现最快的实时推理速度的重要性。视频和补充信息可以在以下URL找到：https://sites.google.com/view/synthetica-vision。||
|**2024-10-25**|[Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models](http://arxiv.org/abs/2410.19635)|null|最近的视觉基础模型可以提取通用表示并在各种任务中展现出令人印象深刻的能力。然而，它们在目标检测方面的应用在很大程度上被忽视了，尤其是在没有经过微调的情况下。在这项工作中，我们展示了冻结的基础模型可以成为通用的特征增强器，即使它们没有针对目标检测进行预训练。具体来说，我们探索了以下两种方式将基础模型的高级图像理解能力直接迁移到检测器中。首先，基础模型中的类别标记提供了对复杂场景的深入理解，这可以通过提供紧凑的上下文来促进解码检测器解码器中的目标查询。此外，基础模型中的补丁标记可以通过提供语义细节来丰富检测器编码器中的特征。利用冻结的基础模型作为即插即用的模块，而不是常用的骨干网络，可以显著提高检测器的性能，同时避免了由检测器骨干网络和基础模型之间的架构差异引起的问题。通过这种新颖的范式，我们通过集成一个或两个基础模型，在 COCO 验证集上，使用 R50 作为检测器骨干网络训练 12 个 epoch 后，将最先进的基于查询的检测器 DINO 的 AP 从 49.0% 提升到 51.9% (+2.9% AP)，并进一步提升到 53.8% (+4.8% AP)。||
|**2024-10-25**|[MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors](http://arxiv.org/abs/2410.19590)|**[link](https://github.com/pufanqi23/monodgp)**|透视投影已被广泛应用于单目 3D 物体检测方法中。它引入了来自 2D 边界框和 3D 物体尺寸的几何先验，以减少深度估计的不确定性。然而，由于源于物体视觉表面的深度误差，边界框的高度通常无法代表实际的投影中心高度，这削弱了几何深度的有效性。直接预测投影高度不可避免地会导致 2D 先验信息的丢失，而使用复杂分支的多深度预测并不能充分利用几何深度。本文提出了一种基于 Transformer 的单目 3D 物体检测方法，称为 MonoDGP，该方法采用透视不变几何误差来修改投影公式。我们还尝试系统地讨论和解释几何误差背后的机制和功效，将其作为多深度预测的一种简单但有效的替代方案。此外，MonoDGP 将深度引导解码器解耦，并构建了一个仅依赖于视觉特征的 2D 解码器，提供了 2D 先验信息并在没有 3D 检测干扰的情况下初始化物体查询。为了进一步优化和微调 Transformer 解码器的输入标记，我们还引入了区域分割头 (RSH)，以生成增强的特征和分割嵌入。我们的单目方法在 KITTI 基准测试中展现了最先进的性能，无需额外数据。代码可在 https://github.com/PuFanqi23/MonoDGP 获取。||
|**2024-10-25**|[DECADE: Towards Designing Efficient-yet-Accurate Distance Estimation Modules for Collision Avoidance in Mobile Advanced Driver Assistance Systems](http://arxiv.org/abs/2410.19336)|null|智能手机和其他移动设备的普及为通过低成本机器/深度学习 (ML/DL) 模型赋能的应用程序形式，以增强道路安全，为每个人提供先进驾驶辅助系统 (ADAS) 的独特机会。对于移动 ADAS 中碰撞避免的关键特性，存在用于物体检测的轻量级深度神经网络 (DNN)，但传统的像素级深度/距离估计 DNN 的计算成本要高得多，因此不适用于资源受限设备上的实时应用。在本文中，我们提出了一种距离估计模型 DECADE，它处理每个检测器输出，而不是构建像素级深度/视差图。在该模型中，我们提出了一个姿态估计 DNN 来估计检测的非自我中心方向，以补充距离估计 DNN 使用边界框特征进行距离预测。我们证明了这些模块可以附加到任何检测器上，以通过快速距离估计来扩展物体检测。在 KITTI 3D 物体检测数据集上，通过附加到 YOLO 物体检测器输出并对其进行微调，对所提出的模块进行评估，实现了最先进的性能，在 0-150 米的距离范围内，平均绝对误差为 1.38 米，平均相对误差为 7.3%。我们广泛的评估方案不仅评估了类别性能，还评估了范围精度，特别是在 0-70 米的关键范围内。||
|**2024-10-24**|[HUE Dataset: High-Resolution Event and Frame Sequences for Low-Light Vision](http://arxiv.org/abs/2410.19164)|null|弱光环境对图像增强方法提出了重大挑战。为了应对这些挑战，在这项工作中，我们引入了HUE数据集，这是一个在多样化和具有挑战性的弱光条件下捕获的高分辨率事件和帧序列的综合集合。我们的数据集包括106个序列，涵盖室内、城市景观、暮光、夜间、驾驶和受控场景，每个序列都经过精心录制，以应对各种照度和动态范围。利用混合RGB和事件相机设置，我们收集了一个将高分辨率事件数据与互补帧数据相结合的数据集。我们采用无参考指标的定性和定量评估来评估最先进的弱光增强和基于事件的图像重建方法。此外，我们还在下游目标检测任务上评估了这些方法。我们的研究结果表明，虽然基于事件的方法在特定指标上表现良好，但在实际应用中可能会产生误报。该数据集和我们的综合分析为弱光视觉和混合相机系统的未来研究提供了宝贵的见解。||
|**2024-10-24**|[Optimizing Edge Offloading Decisions for Object Detection](http://arxiv.org/abs/2410.18919)|**[link](https://github.com/qiujiaming315/edgeml-object-detection)**|近年来机器学习和硬件的进步已经催生了能够执行实时目标检测且精度极高的嵌入式设备。我们考虑这样一种场景：嵌入式设备依赖于板载目标检测器，但可以选择在本地精度被认为过低时将检测任务卸载到更强大的边缘服务器。然而，资源限制了可以卸载到边缘的图像数量。我们的目标是在这些限制条件下确定要卸载哪些图像以最大限度地提高整体检测精度。为此，本文引入了一种奖励指标，旨在量化卸载单个图像带来的潜在精度提升，并提出了一种仅基于本地检测结果来估计此奖励，从而高效地做出卸载决策的方法。该方法的计算量很小，足以在嵌入式设备上运行，并且实证结果表明，即使在卸载图像的比例很小的情况下，它在提高检测精度方面也优于现有的替代方法。||
|**2024-10-24**|[Hybrid Quantum-Classical Feature Extraction approach for Image Classification using Autoencoders and Quantum SVMs](http://arxiv.org/abs/2410.18814)|null|为了利用量子计算机执行图像分类等机器学习任务，需要仔细考虑以下因素：NISQ（噪声中等规模量子）时代的量子计算机存在一些局限性，包括噪声、可扩展性、读入和读出时间以及门操作时间。因此，应该设计策略来减轻复杂数据集对量子机器学习管道整体效率的潜在影响，否则可能会导致资源需求过高或噪声增加。我们应用了一种使用 ResNet10 启发的卷积自编码器的经典特征提取方法，在将数据馈送到量子机器学习模块之前，既降低了数据集的维数，又提取了抽象且有意义的特征。我们选择的量子模块是量子增强支持向量机 (QSVM)，因为支持向量机通常不需要大样本量来识别数据中的模式，并且具有短深度量子电路，这限制了噪声的影响。自编码器经过训练，可以通过图像重建来提取有意义的特征，旨在最小化训练集的均方误差。我们使用三个图像数据集来说明该管道：HTRU-1、MNIST 和 CIFAR-10。我们还为高度不平衡的 HTRU-1 数据集包含了一个量子增强的一类支持向量机 (QOCSVM)，以及作为基准的经典机器学习结果。最后，还包括 HTRU-2 数据集，作为具有良好相关特征的数据集的基准。自编码器实现了近乎完美的重建，并且对 MNIST 实现了高分类精度，而 CIFAR-10 由于图像复杂性而表现出较差的性能，而 HTRU-1 由于数据集不平衡而表现不佳。这突出表明了通过经典特征提取进行降维与使用量子方法进行预测性能之间需要平衡。||
|**2024-10-25**|[Transferring Knowledge from High-Quality to Low-Quality MRI for Adult Glioma Diagnosis](http://arxiv.org/abs/2410.18698)|null|胶质瘤是一种常见且致命的脑肿瘤，需要早期诊断才能改善预后。然而，撒哈拉以南非洲 (SSA) 地区磁共振成像 (MRI) 技术落后，阻碍了准确诊断。本文介绍了我们参与 BraTS 挑战赛 SSA 成人胶质瘤项目的工作。我们采用了 BraTS-GLI 2021 获奖方案的模型，并利用三种训练策略对其进行训练：(1) 首先在 BraTS-GLI 2021 数据集上进行训练，然后在 BraTS-Africa 数据集上进行微调，(2) 仅在 BraTS-Africa 数据集上进行训练，(3) 仅在经过 2 倍超分辨率增强的 BraTS-Africa 数据集上进行训练。结果表明，首先在 BraTS-GLI 2021 数据集上进行训练，然后在 BraTS-Africa 数据集上进行微调，取得了最佳效果。这表明高质量数据集在训练过程中提供先验知识的重要性。我们性能最佳的模型在验证阶段分别实现了 0.882、0.840 和 0.926 的 Dice 分数，以及 15.324、37.518 和 13.971 的 Hausdorff 距离 (95%) 分数，用于增强肿瘤、肿瘤核心和全肿瘤。在比赛的最后阶段，我们的方法成功获得了总排名第二，体现了我们模型和训练策略的优势和有效性。我们的方法为改善 SSA 地区的胶质瘤诊断提供了见解，展示了深度学习在资源有限环境中的潜力以及从高质量数据集中进行迁移学习的重要性。||
|**2024-10-24**|[Spatial-Temporal Search for Spiking Neural Networks](http://arxiv.org/abs/2410.18580)|null|脉冲神经网络 (SNN) 具有稀疏计算和固有时间动态等吸引人的特性，被认为是下一代人工智能的潜在候选者。通过采用人工神经网络 (ANN) 的架构，SNN 在图像分类等基准测试任务中取得了具有竞争力的性能。然而，ANN 的成功架构对于 SNN 来说并非最佳。在这项工作中，我们应用神经架构搜索 (NAS) 来寻找适合 SNN 的架构。以前用于 SNN 的 NAS 方法主要关注空间维度，而明显缺乏对 SNN 至关重要的时域动态的考虑。受生物神经网络异质性的启发，我们提出了一种可微的方法来优化 SNN 的空间和时间维度。在空间层面，我们开发了一个基于脉冲的可微分层搜索 (SpikeDHS) 框架，其中基于脉冲的操作在计算约束下在细胞和层级上都得到了优化。我们进一步提出了一种可微分的代理梯度搜索 (DGS) 方法，以便在训练期间独立地演化局部 SG 函数。在时间层面，我们通过演化不同类型脉冲神经元的时间常数来探索其多样化时间动态的最佳配置，并在此基础上进一步开发了结合 SNN 和 ANN 的混合网络，平衡了准确性和效率。我们的方法在 CIFAR10/100 和 ImageNet 上实现了相当的分类性能，准确率分别为 96.43%、78.96% 和 70.21%。在基于事件的深度立体视觉方面，我们的方法找到了最佳的层变化，并以降低 26 倍的计算成本 (6.7 毫焦) 超越了专门设计的 ANN 的准确性，证明了 SNN 在处理高度稀疏和动态信号方面的潜力。||
|**2024-10-25**|[Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks](http://arxiv.org/abs/2410.18387)|null|为了解决涉及多种医学影像模式下文本指令与视觉图像的任务，研究人员开发了几种医学多模态大语言模型 (MLLM)，并取得了令人瞩目的成果。目前大多数医学通才模型都是区域无关的，即将整个图像视为一个整体表征。然而，它们难以确定在生成句子时所关注的具体区域。为了模拟医生通常先浏览整个图像，然后集中于特定区域进行全面评估的行为，我们旨在增强医学 MLLM 对完整医学扫描图像中解剖区域的理解能力。为此，我们首先制定了以区域为中心的任务，并构建了一个大规模数据集 MedRegInstruct，将区域信息纳入训练。结合我们收集的数据集和其他医学多模态语料库进行训练，我们提出了一种区域感知的医学 MLLM，名为 MedRegA，它是第一个能够同时处理多种模态图像级和区域级医学视觉语言任务的双语通才医学人工智能系统。我们的 MedRegA 不仅支持三种以区域为中心的任务，而且在 8 种模态的视觉问答、报告生成和医学图像分类方面均取得了最佳性能，展现出显著的多功能性。实验表明，我们的模型不仅可以在双语环境下完成各种医学视觉语言任务，而且可以识别和检测多模态医学扫描图像中的结构，提高医学 MLLM 的可解释性和用户交互性。我们的项目页面是 https://medrega.github.io。||
|**2024-10-24**|[Thermal Chameleon: Task-Adaptive Tone-mapping for Radiometric Thermal-Infrared images](http://arxiv.org/abs/2410.18340)|**[link](https://github.com/donkeymouse/thermalchameleon)**|热红外 (TIR) 成像为在具有挑战性的户外环境中导航提供了强大的感知能力，但由于其采用 14/16 位格式，因此存在纹理不佳和图像对比度低的问题。传统方法利用各种色调映射方法来增强 TIR 图像的对比度和光度一致性，然而，色调映射的选择很大程度上取决于对任务的了解以及良好的温度依赖先验。在本文中，我们提出了热变色龙网络 (TCNet)，这是一种针对 RAW 14 位 TIR 图像的任务自适应色调映射方法。给定相同的图像，TCNet 可以针对每个特定任务调整 TIR 图像的不同表示的色调映射，从而无需启发式图像重新缩放预处理，也不依赖于场景温度或特定任务特征的广泛先验知识。TCNet 在目标检测和单目深度估计方面表现出改进的泛化性能，同时计算开销最小，并且可以模块化地集成到各种任务的现有架构中。项目页面：https://github.com/donkeymouse/ThermalChameleon||
|**2024-10-23**|[Backdoor in Seconds: Unlocking Vulnerabilities in Large Pre-trained Models via Model Editing](http://arxiv.org/abs/2410.18267)|null|大型预训练模型在一系列下游任务中取得了显著成功。然而，最近的研究表明，一种对抗性攻击（即后门攻击）可以通过污染训练数据集来操纵机器学习模型的行为，这对大型预训练模型（尤其是那些定制模型）的实际应用构成了重大威胁。因此，应对探索预训练模型漏洞的独特挑战至关重要。通过对大型预训练模型（例如ViT）执行后门攻击能力的实证研究，我们发现了攻击大型预训练模型的以下独特挑战：1）无法操纵甚至访问大型训练数据集，以及2）训练或微调这些模型所需的巨大计算资源。为了应对这些挑战，我们针对大型预训练模型的背景，建立了有效且可行的后门攻击的新标准。根据这些标准，我们引入了EDT模型，一种高效、无需数据、无需训练的后门攻击方法。受模型编辑技术的启发，EDT将一个基于编辑的轻量级码本注入到大型预训练模型的后门中，它将中毒图像的嵌入替换为目标图像的嵌入，而无需污染训练数据集或训练受害者模型。我们在各种预训练模型（如ViT、CLIP、BLIP和稳定扩散）以及图像分类、图像描述和图像生成等下游任务上进行的实验，证明了我们方法的有效性。我们的代码可在补充材料中找到。||
|**2024-10-23**|[FIPER: Generalizable Factorized Fields for Joint Image Compression and Super-Resolution](http://arxiv.org/abs/2410.18083)|null|在这项工作中，我们提出了一种用于超分辨率 (SR) 和图像压缩的统一表示方法，称为“因子化场”，其动机源于这两个任务之间的共同原理。SISR 和图像压缩都需要恢复和保留精细的图像细节——无论是通过增强分辨率还是重建压缩数据。与以往主要关注网络架构的方法不同，我们提出的方法利用基系数分解来显式地捕捉图像中的多尺度视觉特征和结构成分，从而解决了这两个任务的核心挑战。我们首先推导了我们的 SR 模型，其中包括一个系数主干网络和一个用于泛化因子化场的基 Swin Transformer。然后，为了进一步统一这两个任务，我们将训练好的 SR 模块强大的信息恢复能力作为先验知识用于压缩流程，从而提高压缩效率和细节重建效果。此外，我们引入了一个合并基的压缩分支，以整合共享结构，进一步优化压缩过程。大量实验表明，我们的统一表示方法实现了最先进的性能，在超分辨率 (SR) 中，PSNR 相比基线平均提高了 204.4%，在图像压缩中，相比之前的 SOTA 方法，BD 率降低了 9.35%。||
|**2024-10-23**|[DREB-Net: Dual-stream Restoration Embedding Blur-feature Fusion Network for High-mobility UAV Object Detection](http://arxiv.org/abs/2410.17822)|**[link](https://github.com/eeic-lab/dreb-net)**|目标检测算法是无人机 (UAV) 成像系统的关键组成部分，广泛应用于复杂领域。然而，高机动性无人机拍摄的图像通常会受到运动模糊的影响，这严重阻碍了先进目标检测算法的性能。为了应对这些挑战，我们提出了一种专门为模糊图像设计的创新目标检测算法，称为 DREB-Net（双流恢复嵌入模糊特征融合网络）。首先，DREB-Net 通过在训练阶段加入模糊图像恢复辅助分支 (BRAB) 来解决模糊图像目标检测问题的特殊性。其次，它通过多级注意力引导特征融合 (MAGFF) 模块融合提取的浅层特征，以提取更丰富的特征。这里，MAGFF 模块包含局部注意力模块和全局注意力模块，它们为不同的分支分配不同的权重。然后，在推理阶段，可以移除 BRAB 的深度特征提取以降低计算复杂度并提高检测速度。在损失函数中，将 MSE 和 SSIM 的组合损失添加到 BRAB 以恢复模糊图像。最后，DREB-Net 在特征提取的早期阶段通过可学习频域幅度调制模块 (LFAMM) 引入快速傅里叶变换，以调整特征幅度并增强特征处理能力。实验结果表明，DREB-Net 在拍摄图像存在运动模糊的情况下仍然可以有效地执行目标检测任务，展现出优异的性能和广阔的应用前景。我们的源代码将在 https://github.com/EEIC-Lab/DREB-Net.git 上提供。||
|**2024-10-23**|[Deep Learning for Active Region Classification: A Systematic Study from Convolutional Neural Networks to Vision Transformers](http://arxiv.org/abs/2410.17816)|null|太阳活动区会严重扰乱日地空间环境，经常导致严重的太空天气事件，例如太阳耀斑和日冕物质抛射。因此，对活动区群进行自动分类是准确、及时预测太阳活动的关键起点。本研究展示了我们将深度学习技术应用于基于威尔逊山分类方案的活动区图像分类的结果。具体来说，我们探索了图像分类架构的最新进展，从卷积神经网络到视觉变换器，并报告了它们在活动区分类任务中的性能，表明其有效性的关键在于基于该领域最新进展的稳健训练过程。||
|**2024-10-22**|[Altogether: Image Captioning via Re-aligning Alt-text](http://arxiv.org/abs/2410.17251)|**[link](https://github.com/facebookresearch/metaclip)**|本文着重于创建合成数据以提高图像描述的质量。现有工作通常存在两个缺点。首先，它们从头开始描述图像，忽略了现有的替代文本元数据；其次，如果描述器的训练数据（例如 GPT）未知，则缺乏透明度。在本文中，我们研究了一种基于关键思想的原则性方法Altogether，即编辑和重新调整与图像相关的现有替代文本。为了生成训练数据，我们执行人工注释，注释者从现有的替代文本开始，并在多轮中将其重新调整到图像内容，从而构建具有丰富视觉概念的描述。这与先前的工作不同，先前的工作将人工注释作为一项一次性的描述任务，完全基于图像和注释者的知识。我们根据这些数据训练了一个描述器，该描述器可以大规模地概括重新调整替代文本的过程。我们的结果表明，我们的 Altogether 方法可以生成更丰富的图像描述，还可以改进文本到图像生成和零样本图像分类任务。||
|**2024-10-22**|[KANICE: Kolmogorov-Arnold Networks with Interactive Convolutional Elements](http://arxiv.org/abs/2410.17172)|**[link](https://github.com/m-ferdaus/kanice)**|我们介绍了一种名为KANICE（Kolmogorov-Arnold Networks with Interactive Convolutional Elements）的新型神经网络架构，它将卷积神经网络（CNN）与Kolmogorov-Arnold网络（KAN）原理相结合。KANICE将交互式卷积块（ICB）和KAN线性层集成到CNN框架中。这利用了KAN的通用逼近能力和ICB的自适应特征学习能力。基于Kolmogorov-Arnold表示定理，KANICE可以捕获复杂的非线性数据关系，同时实现动态的、上下文相关的特征提取。我们在四个数据集上评估了KANICE：MNIST、Fashion-MNIST、EMNIST和SVHN，并将其与标准CNN、CNN-KAN混合模型和ICB变体进行了比较。KANICE始终优于基线模型，在MNIST上实现了99.35%的准确率，在SVHN数据集上实现了90.05%的准确率。此外，我们还介绍了KANICE-mini，这是一种专为提高效率而设计的紧凑型变体。全面的消融研究表明，KANICE-mini可以用少得多的参数实现与KANICE相当的性能。KANICE-mini在SVHN上达到了90.00%的准确率，参数量为2,337,828，而KANICE的参数量为25,432,000。这项研究突出了基于KAN的架构在图像分类任务中平衡性能和计算效率的潜力。我们的工作为自适应神经网络的研究做出了贡献，将数学定理融入到深度学习架构中，并探索了模型复杂性和性能之间的权衡，推进了计算机视觉和模式识别领域的发展。本文的源代码可通过我们的GitHub存储库（https://github.com/m-ferdaus/kanice）公开获取。||
|**2024-10-22**|[YOLO-TS: Real-Time Traffic Sign Detection with Enhanced Accuracy Using Optimized Receptive Fields and Anchor-Free Fusion](http://arxiv.org/abs/2410.17144)|null|在自动驾驶和高级驾驶辅助系统 (ADAS) 中确保安全，很大程度上取决于交通标志识别技术的有效部署。虽然现有方法已具有一定成效，但它们往往需要在速度和准确性之间做出妥协。为了解决这个问题，我们提出了一种新颖的实时高效道路标志检测网络 YOLO-TS。该网络通过优化多尺度特征图的感受野，使其与各种数据集中交通标志的尺寸分布更加一致，从而显著提高了性能。此外，我们利用无锚框方法的灵活性，创新性地提出了特征融合策略，允许在包含丰富上下文信息的高分辨率特征图上进行多尺度目标检测，实现了准确性和速度的显著提升。为了减轻由空洞卷积引起的网格效应对小目标检测的不利影响，我们设计了一个独特的模块，该模块不仅可以减轻这种网格效应，还可以扩大感受野以涵盖更广泛的空间上下文信息，从而提高信息使用效率。在具有挑战性的公共数据集 TT100K 和 CCTSDB2021 上的评估表明，YOLO-TS 在准确性和速度方面均优于现有的最先进方法。我们将在未来公开此方法的代码。||
|**2024-10-22**|[AttriPrompter: Auto-Prompting with Attribute Semantics for Zero-shot Nuclei Detection via Visual-Language Pre-trained Models](http://arxiv.org/abs/2410.16820)|**[link](https://github.com/wuyongjiancode/attriprompter)**|大规模视觉语言预训练模型（VLPM）在自然场景中文本提示的目标检测下游任务中表现出色。然而，由于医学图像的特征与用于预训练的网络来源图文对之间存在显著差距，VLPM在组织病理学图像的零样本核检测中的应用仍处于相对未开发的状态。本文旨在探索目标级VLPM，即基于基础语言图像预训练（GLIP）模型，在零样本核检测中的潜力。具体来说，我们提出了一种名为AttriPrompter的创新性自动提示管道，它包括属性生成、属性增强和相关性排序，以避免主观的人工提示设计。AttriPrompter利用VLPM的文本-图像对齐能力创建语义丰富的文本提示，然后将其输入GLIP进行初始的零样本核检测。此外，我们提出了一个自训练的知识蒸馏框架，其中GLIP作为教师模型，其初始预测被用作伪标签，以解决高核密度带来的挑战，包括漏检、误检和实例重叠。我们的方法在无标签核检测方面表现出色，优于所有现有的无监督方法，并展现出优异的泛化能力。值得注意的是，这项工作凸显了基于自然图像-文本对预训练的VLPM在医学领域下游任务中的惊人潜力。代码将在https://github.com/wuyongjianCODE/AttriPrompter发布。||
|**2024-10-22**|[DSORT-MCU: Detecting Small Objects in Real-Time on Microcontroller Units](http://arxiv.org/abs/2410.16769)|null|轻量级神经网络的进步彻底改变了广泛物联网应用中的计算机视觉，包括远程监控和过程自动化。然而，对于许多此类应用至关重要的小目标检测仍然是当前计算机视觉研究中一个尚未充分探索的领域，特别是对于托管资源受限处理器的低功耗嵌入式设备而言。为了解决上述差距，本文提出了一种适用于轻量级和节能目标检测网络的自适应切片方法，包括基于 YOLO 的模型和流行的 FOMO 网络。与大规模检测模型相比，所提出的切片方法能够在不影响精度的情况下在低功耗 MCU 上进行目标检测。通过将所提出的方法应用于具有内置机器学习加速器的新型基于 RISC-V 的 MCU 上的 FOMO 和 TinyissimoYOLO 网络，证明了该方法的优势。大量的实验结果表明，所提出的切片方法在 FOMO 和 TinyissimoYOLO 网络上将 F1 分数提高了高达 225%，同时使用 FOMO 将平均目标计数误差降低了高达 76%，使用 TinyissimoYOLO 降低了高达 89%。此外，这项工作的研究结果表明，对流行的二元交叉熵损失使用软 F1 损失可以作为 FOMO 网络的隐式非极大值抑制。为了评估真实世界的性能，这些网络部署在 GreenWaves Technologies 的基于 RISC-V 的 GAP9 微控制器上，展示了所提出的方法在检测性能（58% - 95% F1 分数）、低延迟（0.6 毫秒/推理 - 16.2 毫秒/推理）和能效（31 微焦耳/推理 - 1.27 毫焦耳/推理）之间取得平衡的能力，同时在 MCU 上使用高分辨率图像执行多个预测。||
|**2024-10-22**|[DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model](http://arxiv.org/abs/2410.16707)|null|本文的研究动机源于一个有趣的现象：当我们探究MaskDINO（即目前最先进的联合检测和分割模型）中transformer解码器初始层的中间结果时，会发现目标检测的性能滞后于实例分割的性能（即性能不平衡）。这一现象促使我们思考一个问题：transformer解码器初始层的性能不平衡是否会限制最终性能的上限？带着这个问题，我们进一步进行了定性和定量的预实验，验证了检测-分割不平衡问题对模型性能的负面影响。为了解决这个问题，本文提出了DI-MaskDINO模型，其核心思想是通过缓解检测-分割不平衡来提高最终性能。DI-MaskDINO是通过将我们提出的去不平衡（DI）模块和平衡感知token优化（BATO）模块配置到MaskDINO中来实现的。DI模块负责生成平衡感知查询，BATO模块使用平衡感知查询来指导初始特征token的优化。平衡感知查询和优化后的特征token分别作为transformer解码器的查询和键值对，以执行联合目标检测和实例分割任务。DI-MaskDINO在COCO和BDD100K基准测试中优于现有的联合目标检测和实例分割模型，与目前最先进的联合检测和分割模型MaskDINO相比， $AP^{box}$提高了+1.2，$AP^{mask}$提高了+0.9。此外，与目前最先进的目标检测模型DINO相比，DI-MaskDINO的$AP^{box}$提高了+1.0，与目前最先进的分割模型Mask2Former相比，$AP^{mask}$ 提高了+3.0。||
|**2024-10-22**|[Fire and Smoke Detection with Burning Intensity Representation](http://arxiv.org/abs/2410.16642)|**[link](https://github.com/xiaoyihan6/fsdmethod)**|由于火灾的破坏性潜力，有效地进行火灾和烟雾检测 (FSD) 和分析系统至关重要。然而，许多现有的 FSD 方法直接采用通用的目标检测技术，而没有考虑火灾和烟雾的透明性，这导致定位不准确并降低了检测性能。为了解决这个问题，本文提出了一种新的注意力火灾和烟雾检测模型 (a-FSDM)。该模型不仅保留了传统检测算法强大的特征提取和融合能力，还重新设计了专门针对 FSD 中透明目标的检测头，称为注意力透明度检测头 (ATDH)。此外，燃烧强度 (BI) 被引入作为传统 FSD 方法中与火灾相关的下游风险评估的关键特征。在多个 FSD 数据集上的大量实验展示了所提出的 FSD 模型的有效性和通用性。该项目可在 \href{https://xiaoyihan6.github.io/FSD/}{https://xiaoyihan6.github.io/FSD/} 获取。||
|**2024-10-21**|[Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models](http://arxiv.org/abs/2410.16163)|**[link](https://github.com/jefferyzhan/griffon)**|大型多模态模型 (LMM) 基于自回归建模在各种视觉语言和以视觉为中心的的任务中取得了重大突破。然而，这些模型通常专注于以视觉为中心的的任务，例如视觉定位和区域描述，或者视觉语言任务，例如图像字幕和多场景视觉问答 (VQA)。现有的 LMM 都没有像自然语言处理领域的大型语言模型那样，在一个模型中全面统一这两类任务。此外，即使有丰富的多任务指令遵循数据，直接堆叠这些数据来扩展通用能力仍然具有挑战性。为了解决这些问题，我们引入了一个名为 CCMD-8M 的新型多维度策划和整合的多模态数据集，它通过多级数据策划和多任务整合克服了统一以视觉为中心的任务和视觉语言任务的数据障碍。更重要的是，我们提出了 Griffon-G，这是一种通用的 LMM，可以在单个端到端范例中同时解决以视觉为中心的任务和视觉语言任务。 Griffon-G 解决了在联合优化这些任务期间遇到的训练崩溃问题，实现了更好的训练效率。跨多模态基准、通用视觉问答 (VQA) 任务、以场景文本为中心的 VQA 任务、与文档相关的 VQA 任务、指代表达理解和目标检测的评估表明，Griffon-G 超越了先进的 LMM，并在复杂的以视觉为中心的的任务中实现了专家级的性能。||
|**2024-10-21**|[Few-shot target-driven instance detection based on open-vocabulary object detection models](http://arxiv.org/abs/2410.16028)|null|当前的大型开放视觉模型可以用于单样本和少样本目标识别。然而，基于梯度的重新训练方案成本高昂。另一方面，开放词汇目标检测模型在相同的潜在空间中拉近了视觉和文本概念，从而允许以较小的计算成本通过提示进行零样本检测。我们提出了一种轻量级的方法，可以在不需要文本描述的情况下将后者转换为单样本或少样本目标识别模型。我们在 TEgO 数据集上使用 YOLO-World 模型作为基础进行的实验表明，性能随着模型大小、示例数量和图像增强的使用而提高。||
|**2024-10-21**|[Visual Representation Learning Guided By Multi-modal Prior Knowledge](http://arxiv.org/abs/2410.15981)|null|尽管深度神经网络（DNN）在计算机视觉方面取得了显著成功，但当训练数据和测试数据之间存在分布偏移时，它们的表现就会下降。在本文中，我们提出了一种基于分布的学习方法——知识引导的视觉表征学习（KGV），它利用多模态先验知识来提高分布偏移下的泛化能力。我们使用了来自两种不同模态的先验知识：1）具有层次和关联关系的知识图谱（KG）；2）根据知识图谱中语义表示的视觉元素生成的合成图像。在共同的潜在空间中，从给定的模态生成相应的嵌入，即来自原始图像和合成图像的视觉嵌入以及知识图谱嵌入（KGE）。这些嵌入通过一种新颖的基于翻译的KGE方法进行对齐，其中知识图谱的节点和关系嵌入分别被建模为高斯分布和平移。我们认为，结合多模型先验知识可以实现更规范化的图像表征学习。因此，模型能够更好地泛化到不同的数据分布。我们在具有较大或较小分布偏移的不同图像分类任务上评估了KGV，即来自德国、中国和俄罗斯的数据集上的道路标志分类、使用mini-ImageNet数据集及其变体的图像分类，以及DVM-CAR数据集。结果表明，在所有实验中，KGV始终比基线表现出更高的准确性和数据效率。||
|**2024-10-18**|[MultiOrg: A Multi-rater Organoid-detection Dataset](http://arxiv.org/abs/2410.14612)|null|近年来，生物医学领域的高通量图像分析备受关注，推动了药物发现、疾病预测和个性化医疗的进步。类器官作为人类器官及其功能的优秀模型，是一个活跃的研究领域。显微图像中类器官自动量化的实现将为克服大量手动量化瓶颈提供有效的解决方案，特别是在高通量图像分析中。然而，与自动驾驶等其他领域相比，开放生物医学数据集明显缺乏，而且值得注意的是，其中只有少数尝试量化标注的不确定性。在这项工作中，我们提出了MultiOrg，一个全面的类器官数据集，专为具有不确定性量化的目标检测任务而设计。该数据集包含超过400张高分辨率二维显微图像和超过60,000个类器官的精选注释。最重要的是，它包括三个用于测试数据的标签集，由两位专家在不同时间点独立标注。我们还提供了一个类器官检测的基准，并通过一个易于安装的交互式插件，将最佳模型应用于流行的图像可视化工具Napari，以执行类器官量化。||
|**2024-10-18**|[A Hybrid Feature Fusion Deep Learning Framework for Leukemia Cancer Detection in Microscopic Blood Sample Using Gated Recurrent Unit and Uncertainty Quantification](http://arxiv.org/abs/2410.14536)|null|急性淋巴细胞白血病 (ALL) 是最恶性的白血病，也是成人和儿童中最常见的癌症。传统上，白血病的诊断是通过在显微镜下分析血液和骨髓涂片，并通过额外的细胞化学测试来确认。然而，这些方法昂贵、耗时且高度依赖专家知识。近年来，深度学习，特别是卷积神经网络 (CNN)，为显微镜涂片图像分类提供了先进的方法，有助于检测白血病细胞。这些方法快速、经济高效，并且不受人为偏差的影响。然而，大多数方法缺乏量化不确定性的能力，这可能导致严重的误诊。在这项研究中，混合深度学习模型（InceptionV3-GRU、EfficientNetB3-GRU、MobileNetV2-GRU）被用于对ALL进行分类。贝叶斯优化用于微调模型的超参数并提高其性能。此外，深度集成不确定性量化被应用于解决白血病图像分类过程中的不确定性。所提出的模型在公开可用的数据集 ALL-IDB1 和 ALL-IDB2 上进行了训练。然后使用求和规则在分数级别聚合它们的结果。这些模型中使用的并行架构在区分 ALL 和非 ALL 病例方面提供了高水平的置信度。所提出的方法在 ALL-IDB1 数据集上实现了 100% 的检测准确率，在 ALL-IDB2 数据集上实现了 98.07% 的检测准确率，在组合数据集上实现了 98.64% 的检测准确率，证明了其在准确可靠的白血病诊断方面的潜力。||
|**2024-10-18**|[Ultrasound matrix imaging for transcranial in-vivo localization microscopy](http://arxiv.org/abs/2410.14499)|null|经颅超声成像通常受到颅骨引起的衰减和高阶像差的限制。通过使用微泡等造影剂并结合超快成像，不仅可以提高信噪比，还可以获得分辨率低至脑血管微米级的超分辨率图像。然而，超声定位显微镜 (ULM) 仍然受到波前畸变的影响，这限制了微泡的检测率并阻碍了它们的定位。在这项工作中，我们展示了依赖于预先记录反射矩阵的矩阵成像如何为这些基本问题提供解决方案。作为实验性概念验证，对三只麻醉羊进行了深部脑微血管的体内重建。结果表明，波畸变的补偿可以显著增强 ULM 的对比度和分辨率。这项实验研究为经颅和非电离观测人类脑微血管病理学（如中风）开辟了广阔的前景。||
|**2024-10-18**|[ClearSR: Latent Low-Resolution Image Embeddings Help Diffusion-Based Real-World Super Resolution Models See Clearer](http://arxiv.org/abs/2410.14279)|null|我们提出了ClearSR，这是一种可以更好地利用潜在低分辨率图像（LR）嵌入进行基于扩散的真实世界图像超分辨率（Real-ISR）的新方法。以前的Real-ISR模型主要关注如何激活更多文本到图像扩散模型的生成先验，以使输出的高分辨率（HR）图像看起来更好。然而，由于这些方法过于依赖生成先验，输出图像的内容往往与输入的LR图像不一致。为了缓解上述问题，在这项工作中，我们探索使用潜在的LR嵌入来约束ControlNet的控制信号，并在细节和结构层面提取LR信息。我们表明，正确使用潜在的LR嵌入可以产生更高质量的控制信号，这使得超分辨率结果与LR图像更加一致，并产生更清晰的视觉结果。此外，我们还表明，潜在的LR嵌入可以用来控制推理阶段，从而同时提高保真度和生成能力。实验表明，我们的模型在多个测试集的多个指标上都能取得更好的性能，并且与现有方法相比，能够生成与LR图像更加一致的SR结果。我们的代码将公开发布。||
|**2024-10-18**|[Comparative Evaluation of Clustered Federated Learning Method](http://arxiv.org/abs/2410.14212)|**[link](https://github.com/leahcimali/Comparative-Evaluation-of-Clustered-Federated-Learning-Methods)**|近年来，联邦学习 (FL) 已被证明是最有前途的分布式学习方法之一，可以保护数据隐私。随着该方法的发展并在各种现实场景中的应用，出现了新的挑战。其中一个挑战是 FL 协议参与者之间存在高度异构（通常称为非独立同分布）的数据分布。解决这个障碍的一个流行方案是集群联邦学习 (CFL)，其目的是将客户端划分为分布均匀的组。在文献中，最先进的 CFL 算法通常使用一些数据异构性案例进行测试，而没有系统地证明选择的合理性。此外，用于区分不同异构场景的分类法并不总是直截了当。在本文中，我们针对联邦学习 (FL) 中提出的数据异构性分类法，探讨了两种最先进的 CFL 算法的性能。我们使用三个图像分类数据集，并使用外部聚类指标针对异构性类别分析生成的聚类。我们的目标是更清楚地了解 CFL 性能与数据异构场景之间的关系。||
|**2024-10-17**|[MMAD-Purify: A Precision-Optimized Framework for Efficient and Scalable Multi-Modal Attacks](http://arxiv.org/abs/2410.14089)|null|神经网络在各种任务中都取得了显著的性能，但它们仍然容易受到对抗性扰动的影响，这对安全关键型应用构成了重大风险。随着多模态的兴起，扩散模型已成为强大的工具，不仅可用于生成任务，还可用于图像编辑、修复和超分辨率等各种应用。然而，由于对其攻击以增强其弹性的研究有限，这些模型仍然缺乏鲁棒性。传统的攻击技术，如基于梯度的对抗性攻击和基于扩散模型的方法，由于其迭代性质而受到计算效率低下和可扩展性问题的阻碍。为了应对这些挑战，我们引入了一个创新框架，该框架利用扩散模型的蒸馏骨干，并结合了精度优化的噪声预测器，以增强我们攻击框架的有效性。这种方法不仅增强了攻击的效力，而且还显著降低了计算成本。我们的框架为多模态对抗性攻击提供了一种前沿解决方案，确保了更低的延迟和生成具有更高成功率的高保真对抗性示例。此外，我们证明了我们的框架实现了出色的可迁移性和针对净化防御的鲁棒性，在有效性和效率方面都优于现有的基于梯度的攻击模型。||
|**2024-10-17**|[Reproducibility study of "LICO: Explainable Models with Language-Image Consistency"](http://arxiv.org/abs/2410.13989)|**[link](https://github.com/robertdvdk/lico-fact)**|机器学习领域日益严重的复现性危机要求我们仔细审查研究结果。本文调查了 Lei 等人 (2023) 提出的 LICO 方法，该方法旨在增强事后可解释性技术并提高图像分类性能。LICO 利用来自视觉语言模型的自然语言监督来丰富特征表示并指导学习过程。我们进行了全面的复现性研究，采用了（Wide）ResNets 和已建立的可解释性方法，如 Grad-CAM 和 RISE。我们基本上无法复现作者的结果。特别是，我们没有发现 LICO 始终如一地带来分类性能的提高或可解释性的定量和定性指标的改进。因此，我们的研究结果强调了在可解释性研究中进行严格评估和透明报告的重要性。||
|**2024-10-17**|[ConsisSR: Delving Deep into Consistency in Diffusion-based Image Super-Resolution](http://arxiv.org/abs/2410.13807)|null|现实世界图像超分辨率 (Real-ISR) 旨在从被未知且复杂的退化破坏的低质量 (LQ) 输入中恢复高质量 (HQ) 图像。特别是，预训练的文本到图像 (T2I) 扩散模型提供了强大的生成先验，可以重建可信且复杂的细节。然而，T2I 生成侧重于语义一致性，而 Real-ISR 强调像素级重建，这阻碍了现有方法充分利用扩散先验。为了应对这一挑战，我们引入了 ConsisSR 来处理语义和像素级的一致性。具体来说，与粗粒度的文本提示相比，我们利用更强大的 CLIP 图像嵌入，并通过我们的混合提示适配器 (HPA) 有效地利用这两种模态进行语义指导。其次，我们引入了时间感知潜在增强 (TALA) 来减轻 T2I 生成和 Real-ISR 一致性要求之间的固有差距。通过随机混合 LQ 和 HQ 潜在输入，我们的模型不仅可以处理时间步长特定的扩散噪声，还可以细化累积的潜在表示。最后但同样重要的是，我们的 GAN 嵌入策略采用预训练的 Real-ESRGAN 模型来细化扩散起点。这在不训练的情况下将推理过程加速到 10 步，同时保持采样质量。我们的方法在全尺度和加速模型中都表现出最先进的性能。代码将公开。||
|**2024-10-17**|[LoLDU: Low-Rank Adaptation via Lower-Diag-Upper Decomposition for Parameter-Efficient Fine-Tuning](http://arxiv.org/abs/2410.13618)|**[link](https://github.com/skddj/loldu)**|模型规模的快速增长对微调所需的计算资源提出了更高的要求。现有的方法，如低秩自适应（LoRA），试图解决全参数微调中处理大量更新参数的问题。然而，LoRA 利用随机初始化和低秩矩阵优化来近似更新权重，这可能导致与全参数微调相比，收敛速度较慢且精度存在差距。为了解决这些问题，我们提出了 LoLDU，这是一种参数高效微调（PEFT）方法，与常规 PEFT 方法相比，可将可训练参数减少 2600 倍，同时保持相当的性能。 LoLDU 利用下三角-对角-上三角分解（LDU）来初始化低秩矩阵，以实现更快的收敛速度和正交性。我们专注于优化对角矩阵以进行缩放变换。据我们所知，LoLDU 在所有 PEFT 方法中参数最少。我们对 4 个指令遵循数据集、6 个自然语言理解 (NLU) 数据集、8 个图像分类数据集以及具有多种模型类型（LLaMA2、RoBERTa、ViT 和 Stable Diffusion）的图像生成数据集进行了广泛的实验，提供了全面而详细的分析。我们的开源代码可在 \href{https://github.com/SKDDJ/LoLDU}{https://github.com/SKDDJ/LoLDU} 获取。||
|**2024-10-17**|[Spatiotemporal Object Detection for Improved Aerial Vehicle Detection in Traffic Monitoring](http://arxiv.org/abs/2410.13616)|null|这项工作通过开发时空目标检测模型，在使用无人机摄像头进行多类别车辆检测方面取得了进展。该研究介绍了一个时空车辆检测数据集 (STVD)，其中包含 6,600 张由无人机捕获的带注释的连续帧图像，能够对用于整体时空感知的算法进行全面训练和评估。基于 YOLO 的目标检测算法得到了增强，以结合时间动态，从而提高了单帧模型的性能。将注意力机制集成到时空模型中可以进一步提高性能。实验验证表明取得了重大进展，最佳时空模型比单帧模型提高了 16.22%，同时证明注意力机制具有进一步提高性能的潜力。||
|**2024-10-17**|[Augmentation Policy Generation for Image Classification Using Large Language Models](http://arxiv.org/abs/2410.13453)|null|自动数据增强方法显著提高了深度学习模型在图像分类中的性能和泛化能力。然而，大多数最先进的方法都是在常见的基准数据集上进行优化的，这限制了它们对更多样化或特定领域数据（如医学数据集）的适用性。在本文中，我们提出了一种使用大型语言模型自动生成高效增强策略的策略，该策略可针对任何数据集和模型架构的特定特征进行定制。所提出的方法迭代地与LLM交互，以获得并根据模型性能反馈改进增强策略，从而创建一个与数据集无关的数据增强管道。在医学影像数据集上对所提出的方法进行了评估，结果表明，该方法比现有方法有明显的改进。所提出的方法提供了一种自适应和可扩展的解决方案。虽然它增加了计算成本，但它显著提高了模型的鲁棒性，使流程自动化，并最大限度地减少了模型开发过程中的人工参与。||
|**2024-10-17**|[Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation](http://arxiv.org/abs/2410.13437)|null|指代性多目标跟踪（RMOT）是一项新兴的跨模态任务，旨在定位视频中由语言表达式指代的任意数量的目标对象并维持其身份。这项复杂的任务涉及语言和视觉模态的推理，以及目标对象的时间关联。然而，现有研究仅采用松散的特征融合，忽略了对跟踪目标的长期信息的利用。在本研究中，我们介绍了一种紧凑的基于 Transformer 的方法，称为 TenRMOT。我们在编码和解码阶段都进行特征融合，以充分利用 Transformer 架构的优势。具体来说，我们在编码阶段逐层递增地执行跨模态融合。在解码阶段，我们利用语言引导的查询来探测记忆特征，以准确预测所需的对象。此外，我们引入了一个查询更新模块，该模块明确利用跟踪对象的先前时间信息来增强其轨迹的一致性。此外，我们引入了一个名为“指代性多目标跟踪和分割（RMOTS）”的新任务，并构建了一个名为 Ref-KITTI Segmentation 的新数据集。我们的数据集包含 18 个视频，共 818 个表达式，每个表达式平均包含 10.7 个掩码，与大多数现有指代性视频分割数据集中典型的单个掩码相比，这带来了更大的挑战。TenRMOT 在指代性多目标跟踪和分割任务上均表现出优越的性能。||
|**2024-10-17**|[Unsupervised Skull Segmentation via Contrastive MR-to-CT Modality Translation](http://arxiv.org/abs/2410.13427)|null|从CT扫描中分割颅骨可以看作是一个已经解决的问题。然而，在MRI中，由于存在软组织而不是骨骼，这项任务的复杂性要大得多。从头部MRI图像中捕获骨骼结构非常困难，因为头部MRI的主要可视化目标是大脑。尝试使用颅骨剥离的方法似乎不太适合这项任务，并且在许多情况下都失败了。另一方面，监督学习方法需要昂贵且耗时的颅骨标注。为了克服这些困难，我们提出了一种完全无监督的方法，我们不直接对MRI图像进行分割，而是通过MRI到CT的转换生成合成CT数据，并在其中进行分割。我们解决了与无监督颅骨分割相关的许多问题，包括MRI和CT数据集的不配对性质（对比学习）、低分辨率和低质量（超分辨率）以及泛化能力。这项研究对于需要从MRI体积数据中进行颅骨分割的下游任务（如颅骨切除术或手术计划）具有重要价值，并且可以被视为朝着在医学影像中利用合成数据迈出的重要一步。||
|**2024-10-16**|[Interpreting and Analyzing CLIP's Zero-Shot Image Classification via Mutual Knowledge](http://arxiv.org/abs/2410.13016)|**[link](https://github.com/fawazsammani/clip-interpret-mutual-knowledge)**|对比语言-图像预训练 (CLIP) 通过将图像和文本类别表示映射到共享嵌入空间中来执行零样本图像分类，然后检索最接近图像的类别。这项工作提供了一种新方法，可以从两种模态之间的互知识的角度来解释用于图像分类的 CLIP 模型。具体来说，我们提出以下问题：视觉和语言 CLIP 编码器都学习了哪些共同的概念，这些概念会影响联合嵌入空间，导致点更近或更远？我们通过基于文本概念的解释方法来回答这个问题，展示其有效性，并对包含 13 个 CLIP 模型的池进行分析，这些模型在架构、规模和预训练数据集方面各不相同。我们探讨了与互知识相关的这些不同方面，并分析了零样本预测。我们的方法展示了一种有效且人性化的方式来理解 CLIP 的零样本分类决策。||
|**2024-10-16**|[PND-Net: Plant Nutrition Deficiency and Disease Classification using Graph Convolutional Network](http://arxiv.org/abs/2410.12742)|null|如果能够在早期识别和检测各种植物营养缺乏症和病害，就可以提高作物产量，促进农业增长。深度学习方法在利用叶片视觉症状自动检测植物病害和营养缺乏方面表现出优异的性能。本文提出了一种新的深度学习方法，即在基础卷积神经网络 (CNN) 的基础上，使用图卷积网络 (GNN) 对植物营养缺乏和病害进行分类。有时，全局特征描述符可能无法捕获病叶的关键区域，从而导致疾病分类不准确。为了解决这个问题，区域特征学习对于整体特征聚合至关重要。在这项工作中，我们探索了使用空间金字塔池化进行多尺度区域特征汇总，以实现具有判别性的特征表示。我们开发了一个 GCN，使其能够学习更精细的细节，从而对植物病害和营养缺乏进行分类。所提出的方法称为植物营养缺乏与病害网络 (PND-Net)，并在两个营养缺乏公共数据集和两个病害分类公共数据集上使用四种 CNN 进行了评估。最佳分类性能为：(a) 香蕉营养缺乏数据集 90.00% 和咖啡营养缺乏数据集 90.54%；(b) 使用 Xception 骨干网络在马铃薯病害数据集上达到 96.18%，在 PlantDoc 数据集上达到 84.30%。此外，还进行了一些泛化实验，所提出的方法在两个公共数据集上取得了最先进的性能，即乳腺癌组织病理学图像分类（BreakHis 40X：95.50% 准确率，BreakHis 100X：96.79% 准确率）和宫颈癌分类巴氏涂片图像中的单细胞（SIPaKMeD：99.18% 准确率）。此外，PND-Net 使用五折交叉验证也取得了更好的性能。||
|**2024-10-16**|[Transformer based super-resolution downscaling for regional reanalysis: Full domain vs tiling approaches](http://arxiv.org/abs/2410.12728)|null|超分辨率 (SR) 是一种很有前景的降尺度方法，可以经济高效地从较粗糙的气候数据中生成高分辨率气候信息。其一个特定应用是从驱动全局对应物（预测因子）中降尺度区域再分析输出（预测值）。本研究以 CERRA 再分析（5.5 公里分辨率，由 ERA5 驱动的区域大气模型生成）为例，对各种 SR 降尺度方法进行了比较，重点关注温度。这项工作中提出的方法是 Swin Transformer，并使用了两种替代方法作为基准（全卷积 U-Net 和卷积和密集 DeepESD）以及简单的双三次插值。我们比较了两种方法，一种是使用整个域作为输入的标准方法，另一种是更具可扩展性的切片方法，将整个域划分为用作输入的切片。这些方法经过训练可以根据来自驱动 ERA5 的温度信息对 CERRA 地表温度进行降尺度；此外，切片方法还包括静态地形信息。我们表明，需要空间可迁移性的切片方法以降低性能为代价（尽管它优于某些全域基准），但提供了一种有效的可扩展解决方案，允许在泛欧尺度上进行 SR 减少，并且对于实时应用很有价值。||
|**2024-10-16**|[MambaBEV: An efficient 3D detection model with Mamba2](http://arxiv.org/abs/2410.12673)|null|基于BEV范式并结合时间信息的稳定3D目标检测模型对于自动驾驶系统至关重要。然而，当前使用卷积层或可变形自注意力的时序融合模型不利于BEV空间全局信息的交换，并且计算成本更高。最近，一种专门用于处理序列的新型基于Mamba的模型在多个下游任务中显示出巨大潜力。在这项工作中，我们提出了一种基于Mamba2的BEV 3D目标检测模型，名为MambaBEV。我们还采用了一种端到端的自动驾驶范式来测试模型的性能。我们的工作在nuScenes数据集上取得了相当不错的结果：我们的基本版本实现了51.7%的NDS。我们的代码将很快开源。||
|**2024-10-15**|[Fractal Calibration for long-tailed object detection](http://arxiv.org/abs/2410.11774)|**[link](https://github.com/kostas1515/FRACAL)**|现实世界的数据集遵循不平衡的分布，这对稀有类别目标检测提出了重大挑战。最近的研究通过开发重新加权和重新采样的方法来解决这个问题，这些方法利用了数据集的类别频率。然而，这些技术只关注频率统计，而忽略了图像空间中类别的分布，从而遗漏了重要信息。与它们不同的是，我们提出了分形校准（FRACAL）：一种新的用于长尾目标检测的后校准方法。FRACAL设计了一种logit调整方法，利用分形维数来估计类别在图像空间中的均匀分布程度。在推理过程中，它使用分形维数对均匀分布的类别预测概率进行反向加权，从而在两个轴上实现平衡：频繁类别和稀有类别之间，以及均匀分布类别和稀疏分布类别之间。FRACAL是一种后处理方法，它不需要任何训练，并且可以与许多现成的模型相结合，例如一级sigmoid检测器和两级实例分割模型。FRACAL将稀有类别的性能提高了8.6%，并在LVIS数据集上超过了所有以前的方法，同时在其他数据集（如COCO、V3Det和OpenImages）上也表现出良好的泛化能力。代码将被发布。||
|**2024-10-15**|[YOLO-ELA: Efficient Local Attention Modeling for High-Performance Real-Time Insulator Defect Detection](http://arxiv.org/abs/2410.11727)|null|现有的无人机绝缘子缺陷检测方法在处理复杂背景和小型目标时存在困难，导致精度欠佳和误报率高。为了解决这个问题，本文基于局部注意力建模的概念，提出了一种新的基于注意力的基础架构YOLO-ELA。该架构在单阶段YOLOv8架构的颈部添加了高效局部注意力（ELA）模块，将模型的注意力从背景特征转移到缺陷绝缘子特征。采用SCYLLA Intersection-Over-Union（SIoU）准则函数来减少检测损失，加速模型收敛，并提高模型对小型绝缘子缺陷的敏感性，从而产生更高的真阳性结果。由于数据集有限，我们利用数据增强技术来增加数据集的多样性。此外，我们利用迁移学习策略来提高模型的性能。在高分辨率无人机图像上的实验结果表明，我们的方法达到了最先进的性能，mAP0.5为96.9%，实时检测速度为每秒74.63帧，优于基线模型。这进一步证明了基于注意力的卷积神经网络（CNN）在目标检测任务中的有效性。||
|**2024-10-15**|[Degradation Oriented and Regularized Network for Real-World Depth Super-Resolution](http://arxiv.org/abs/2410.11666)|**[link](https://github.com/yanzq95/dornet)**|近年来，现有的RGB引导的深度超分辨率方法在固定和已知退化（例如，双三次下采样）的假设下取得了优异的性能。然而，在现实场景中，由于传感器限制和成像环境的复杂性（例如，低反射表面、照明），捕获的深度往往会出现非常规和未知的退化。当这些真实退化与其假设不同时，它们的性能会显著下降。为了解决这些问题，我们提出了一种面向退化和正则化的网络DORNet，它更加关注学习低分辨率深度的退化表示，从而为深度恢复提供有针对性的指导。具体来说，我们首先设计了一种自监督退化学习方法，使用基于路由选择的退化正则化来模拟低分辨率深度的判别性退化表示。然后，我们提出了一种退化感知方法，递归地进行多个面向退化的特征变换，每个变换都根据学习到的退化表示选择性地将RGB信息嵌入到深度中。在真实数据集和合成数据集上的大量实验结果表明，我们的方法达到了最先进的性能。||
|**2024-10-15**|[LoKO: Low-Rank Kalman Optimizer for Online Fine-Tuning of Large Models](http://arxiv.org/abs/2410.11551)|null|从头开始训练具有数百万甚至数十亿参数的大型模型会产生巨大的计算成本。参数高效微调 (PEFT) 方法，特别是低秩自适应 (LoRA)，通过仅使少量参数适应基于梯度优化器的特定任务来应对这一挑战。在本文中，我们将 PEFT 转换为最优滤波/状态估计问题，并提出低秩卡尔曼优化器 (LoKO) 以在线方式估计最优可训练参数。我们利用 LoRA 中的低秩分解来显着减少卡尔曼迭代中的矩阵大小，并进一步利用协方差矩阵的对角近似来有效地将计算复杂度从可训练参数数量的二次方降低到线性。此外，我们发现卡尔曼算法中协方差矩阵的初始化和观测噪声协方差的准确估计是该公式的关键，并且我们提出了在各种成熟的计算机视觉和语言模型中都能很好地工作的鲁棒方法。我们的结果表明，与图像分类和语言任务中 LoRA 常用的优化器相比，LoKO 以更少的迭代次数收敛并产生更好的性能模型。我们的研究开辟了利用卡尔曼滤波器作为在线微调大型模型的有效优化器的可能性。||
|**2024-10-15**|[Spatio-Temporal Distortion Aware Omnidirectional Video Super-Resolution](http://arxiv.org/abs/2410.11506)|**[link](https://github.com/nichenxingmeng/STDAN)**|全向视频（ODV）可以提供沉浸式体验，并广泛应用于虚拟现实和增强现实领域。然而，受限的采集设备和传输带宽导致ODV分辨率较低。视频超分辨率（VSR）方法被提出用于提高视频的分辨率，但直接应用此类方法并不能很好地解决应用中ODV投影失真问题。为了获得更好的超分辨率重建质量，我们提出了一种面向ODV特性的新型时空失真感知网络（STDAN）。具体来说，引入了一个时空失真调制模块，以根据帧内和帧间对齐来改善空间ODV投影失真并利用时间相关性。接下来，我们设计了一种多帧重建和融合机制，以改进重建ODV帧的一致性。此外，我们在损失函数中加入了纬度显著性自适应映射，以专注于具有更高纹理复杂度和人类观看兴趣的重要视点区域。此外，我们收集了一个包含各种场景的新ODV-SR数据集。大量实验结果表明，所提出的STDAN在ODV上实现了卓越的超分辨率性能，并优于最先进的方法。||
|**2024-10-15**|[SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection](http://arxiv.org/abs/2410.11358)|null|多模态目标检测利用多种模态信息来提高检测器的准确性和鲁棒性。通过学习长期依赖关系，Transformer可以在特征提取阶段有效地融合多模态特征，从而大大提高多模态目标检测的性能。然而，当前的方法仅仅是堆叠Transformer引导的融合技术，而没有探索其在网络不同深度层提取特征的能力，从而限制了检测性能的提升。在本文中，我们介绍了一种名为SeaDATE的精确高效的目标检测方法。首先，我们提出了一种新颖的双重注意力特征融合（DTF）模块，在Transformer的引导下，通过双重注意力机制融合局部和全局信息，利用空间和通道token从正交角度加强模态特征的融合。同时，我们的理论分析和实证验证表明，将图像视为像素序列进行融合的Transformer引导融合方法，在浅层特征的细节信息方面比深度语义信息表现更好。针对这一问题，我们设计了一个对比学习（CL）模块，旨在学习多模态样本的特征，弥补Transformer引导融合在提取深度语义特征方面的不足，并有效地利用跨模态信息。在FLIR、LLVIP和M3FD数据集上的大量实验和消融研究证明了我们方法的有效性，达到了最先进的检测性能。||
|**2024-10-15**|[Representation Similarity: A Better Guidance of DNN Layer Sharing for Edge Computing without Training](http://arxiv.org/abs/2410.11233)|null|边缘计算已经成为一种减少传输和处理延迟并保护视频流隐私的替代方案。然而，基于视频的应用程序（例如目标检测）中使用的深度神经网络 (DNN) 日益复杂，这给内存受限的边缘设备带来了压力。模型合并被提出通过在内存中仅保留合并层权重的一个副本，来减少 DNN 的内存占用。在现有的模型合并技术中，(i) 只有架构相同的层才能共享；(ii) 需要在云中进行计算成本高昂的重新训练；(iii) 假设可获得用于重新训练的真实数据。然而，重新评估合并模型的性能需要具有真实数据的验证数据集，通常在云中运行。指导选择共享层的常用指标包括共享层的大小或计算成本或表示大小。我们提出了一种新的模型合并方案，通过在边缘共享表示（即层的输出），并以表示相似度 S 为指导。我们发现，与其他指标相比，S 与合并模型的准确性具有极高的相关性，Pearson 相关系数|r|
|**2024-10-15**|[TEOcc: Radar-camera Multi-modal Occupancy Prediction via Temporal Enhancement](http://arxiv.org/abs/2410.11228)|**[link](https://github.com/vdigpku/teocc)**|语义占用作为一种新颖的三维场景表示方法，在自动驾驶领域受到了广泛关注。然而，现有的占用预测方法主要集中于设计更好的占用表示方法，例如三视角或神经辐射场，而忽略了利用长期时间信息的优势。本文提出了一种雷达-相机多模态时间增强占用预测网络，称为TEOcc。我们的方法受到三维目标检测中利用时间信息取得成功的启发。具体来说，我们引入了一个时间增强分支来学习时间占用预测。在这个分支中，我们随机丢弃多视角相机的第t-k帧输入，并利用其他相邻帧和多模态输入的信息，分别通过长期和短期时间解码器预测其三维占用情况。此外，为了降低计算成本并融合多模态输入，我们针对长期和短期时间解码器专门设计了三维卷积层。此外，由于轻量级占用预测头是一个密集分类头，我们建议对时间增强分支和主分支使用共享的占用预测头。值得注意的是，时间增强分支仅在训练期间执行，在推理期间被丢弃。实验结果表明，TEOcc在nuScenes基准测试中实现了最先进的占用预测性能。此外，所提出的时间增强分支是一个即插即用的模块，可以很容易地集成到现有的占用预测方法中，以提高占用预测的性能。代码和模型将在https://github.com/VDIGPKU/TEOcc发布。||
|**2024-10-15**|[CVCP-Fusion: On Implicit Depth Estimation for 3D Bounding Box Prediction](http://arxiv.org/abs/2410.11211)|**[link](https://github.com/safetylab24/FusionCVCP)**|激光雷达和摄像头视图数据的结合已成为3D目标检测的常用方法。然而，以往的方法在点级别上融合两种输入流，丢弃了从摄像头特征中提取的语义信息。在本文中，我们提出了跨视图中心点融合（Cross-View Center Point-Fusion），这是一种通过在BEV空间中融合摄像头和激光雷达衍生特征来执行3D目标检测的最先进模型，它在融合激光雷达的空间数据的同时保留了来自摄像头流的语义密度。我们的架构利用了先前已建立的算法（跨视图Transformer和CenterPoint）的各个方面，并并行运行它们的主干网络，从而实现实时处理和应用的高效计算。在本文中，我们发现，虽然隐式计算的深度估计在2D地图视图表示中可能足够准确，但在3D世界视图空间中进行精确的边界框预测需要显式计算的几何和空间信息。||
|**2024-10-15**|[Multiview Scene Graph](http://arxiv.org/abs/2410.11187)|**[link](https://github.com/ai4ce/MSG)**|一个合适的场景表示是实现空间智能的核心，在这种情况下，智能体可以稳健地重建并有效地理解 3D 场景。场景表示可以是度量的，例如 3D 重建中的地标地图、目标检测中的 3D 边界框或占用预测中的体素网格，也可以是拓扑的，例如 SLAM 中具有闭环的位姿图或 SfM 中的可见性图。在这项工作中，我们建议从无位姿图像构建多视图场景图 (MSG)，使用相互连接的地点和对象节点以拓扑方式表示场景。对于现有的表示学习方法来说，构建 MSG 的任务具有挑战性，因为它需要从视野有限且可能存在较大视角变化的图像中共同解决视觉位置识别、目标检测和目标关联问题。为了评估任何解决此任务的方法，我们基于公共 3D 数据集开发了 MSG 数据集和注释。我们还提出了一种基于 MSG 边缘的交并比分数的评估指标。此外，我们开发了一种基于主流预训练视觉模型的新基线方法，将视觉位置识别和目标关联结合到一个 Transformer 解码器架构中。实验表明，与现有的相关基线相比，我们的方法具有优越的性能。||
|**2024-10-11**|[Efficient Hyperparameter Importance Assessment for CNNs](http://arxiv.org/abs/2410.08920)|null|Hyperparameter selection is an essential aspect of the machine learning pipeline, profoundly impacting models' robustness, stability, and generalization capabilities. Given the complex hyperparameter spaces associated with Neural Networks and the constraints of computational resources and time, optimizing all hyperparameters becomes impractical. In this context, leveraging hyperparameter importance assessment (HIA) can provide valuable guidance by narrowing down the search space. This enables machine learning practitioners to focus their optimization efforts on the hyperparameters with the most significant impact on model performance while conserving time and resources. This paper aims to quantify the importance weights of some hyperparameters in Convolutional Neural Networks (CNNs) with an algorithm called N-RReliefF, laying the groundwork for applying HIA methodologies in the Deep Learning field. We conduct an extensive study by training over ten thousand CNN models across ten popular image classification datasets, thereby acquiring a comprehensive dataset containing hyperparameter configuration instances and their corresponding performance metrics. It is demonstrated that among the investigated hyperparameters, the top five important hyperparameters of the CNN model are the number of convolutional layers, learning rate, dropout rate, optimizer and epoch.||
|**2024-10-11**|[Efficient Multi-Object Tracking on Edge Devices via Reconstruction-Based Channel Pruning](http://arxiv.org/abs/2410.08769)|null|The advancement of multi-object tracking (MOT) technologies presents the dual challenge of maintaining high performance while addressing critical security and privacy concerns. In applications such as pedestrian tracking, where sensitive personal data is involved, the potential for privacy violations and data misuse becomes a significant issue if data is transmitted to external servers. To mitigate these risks, processing data directly on an edge device, such as a smart camera, has emerged as a viable solution. Edge computing ensures that sensitive information remains local, thereby aligning with stringent privacy principles and significantly reducing network latency. However, the implementation of MOT on edge devices is not without its challenges. Edge devices typically possess limited computational resources, necessitating the development of highly optimized algorithms capable of delivering real-time performance under these constraints. The disparity between the computational requirements of state-of-the-art MOT algorithms and the capabilities of edge devices emphasizes a significant obstacle. To address these challenges, we propose a neural network pruning method specifically tailored to compress complex networks, such as those used in modern MOT systems. This approach optimizes MOT performance by ensuring high accuracy and efficiency within the constraints of limited edge devices, such as NVIDIA's Jetson Orin Nano. By applying our pruning method, we achieve model size reductions of up to 70% while maintaining a high level of accuracy and further improving performance on the Jetson Orin Nano, demonstrating the effectiveness of our approach for edge computing applications.||
|**2024-10-11**|[MMLF: Multi-modal Multi-class Late Fusion for Object Detection with Uncertainty Estimation](http://arxiv.org/abs/2410.08739)|null|Autonomous driving necessitates advanced object detection techniques that integrate information from multiple modalities to overcome the limitations associated with single-modal approaches. The challenges of aligning diverse data in early fusion and the complexities, along with overfitting issues introduced by deep fusion, underscore the efficacy of late fusion at the decision level. Late fusion ensures seamless integration without altering the original detector's network structure. This paper introduces a pioneering Multi-modal Multi-class Late Fusion method, designed for late fusion to enable multi-class detection. Fusion experiments conducted on the KITTI validation and official test datasets illustrate substantial performance improvements, presenting our model as a versatile solution for multi-modal object detection in autonomous driving. Moreover, our approach incorporates uncertainty analysis into the classification fusion process, rendering our model more transparent and trustworthy and providing more reliable insights into category predictions.||
|**2024-10-11**|[Boosting Open-Vocabulary Object Detection by Handling Background Samples](http://arxiv.org/abs/2410.08645)|null|Open-vocabulary object detection is the task of accurately detecting objects from a candidate vocabulary list that includes both base and novel categories. Currently, numerous open-vocabulary detectors have achieved success by leveraging the impressive zero-shot capabilities of CLIP. However, we observe that CLIP models struggle to effectively handle background images (i.e. images without corresponding labels) due to their language-image learning methodology. This limitation results in suboptimal performance for open-vocabulary detectors that rely on CLIP when processing background samples. In this paper, we propose Background Information Representation for open-vocabulary Detector (BIRDet), a novel approach to address the limitations of CLIP in handling background samples. Specifically, we design Background Information Modeling (BIM) to replace the single, fixed background embedding in mainstream open-vocabulary detectors with dynamic scene information, and prompt it into image-related background representations. This method effectively enhances the ability to classify oversized regions as background. Besides, we introduce Partial Object Suppression (POS), an algorithm that utilizes the ratio of overlap area to address the issue of misclassifying partial regions as foreground. Experiments on OV-COCO and OV-LVIS benchmarks demonstrate that our proposed model is capable of achieving performance enhancements across various open-vocabulary detectors.||
|**2024-10-11**|[DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention](http://arxiv.org/abs/2410.08582)|**[link](https://github.com/maclong01/DeBiFormer)**|Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity-adaptive attention, such as in DAT, has yielded strong results in image classification, the key-value pairs selected by deformable points lack semantic relevance when fine-tuning for semantic segmentation tasks. The query-aware sparsity attention in BiFormer seeks to focus each query on top-k routed regions. However, during attention calculation, the selected key-value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi-level Routing Attention (DBRA) module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its effectiveness.Code is available at {https://github.com/maclong01/DeBiFormer}||
|**2024-10-11**|[Quality Prediction of AI Generated Images and Videos: Emerging Trends and Opportunities](http://arxiv.org/abs/2410.08534)|null|The advent of AI has influenced many aspects of human life, from self-driving cars and intelligent chatbots to text-based image and video generation models capable of creating realistic images and videos based on user prompts (text-to-image, image-to-image, and image-to-video). AI-based methods for image and video super resolution, video frame interpolation, denoising, and compression have already gathered significant attention and interest in the industry and some solutions are already being implemented in real-world products and services. However, to achieve widespread integration and acceptance, AI-generated and enhanced content must be visually accurate, adhere to intended use, and maintain high visual quality to avoid degrading the end user's quality of experience (QoE). One way to monitor and control the visual "quality" of AI-generated and -enhanced content is by deploying Image Quality Assessment (IQA) and Video Quality Assessment (VQA) models. However, most existing IQA and VQA models measure visual fidelity in terms of "reconstruction" quality against a pristine reference content and were not designed to assess the quality of "generative" artifacts. To address this, newer metrics and models have recently been proposed, but their performance evaluation and overall efficacy have been limited by datasets that were too small or otherwise lack representative content and/or distortion capacity; and by performance measures that can accurately report the success of an IQA/VQA model for "GenAI". This paper examines the current shortcomings and possibilities presented by AI-generated and enhanced image and video content, with a particular focus on end-user perceived quality. Finally, we discuss open questions and make recommendations for future work on the "GenAI" quality assessment problems, towards further progressing on this interesting and relevant field of research.||
|**2024-10-11**|[Accelerated Distributed Stochastic Non-Convex Optimization over Time-Varying Directed Networks](http://arxiv.org/abs/2410.08508)|null|Distributed stochastic non-convex optimization problems have recently received attention due to the growing interest of signal processing, computer vision, and natural language processing communities in applications deployed over distributed learning systems (e.g., federated learning). We study the setting where the data is distributed across the nodes of a time-varying directed network, a topology suitable for modeling dynamic networks experiencing communication delays and straggler effects. The network nodes, which can access only their local objectives and query a stochastic first-order oracle to obtain gradient estimates, collaborate to minimize a global objective function by exchanging messages with their neighbors. We propose an algorithm, novel to this setting, that leverages stochastic gradient descent with momentum and gradient tracking to solve distributed non-convex optimization problems over time-varying networks. To analyze the algorithm, we tackle the challenges that arise when analyzing dynamic network systems which communicate gradient acceleration components. We prove that the algorithm's oracle complexity is $\mathcal{O}(1/\epsilon^{1.5})$, and that under Polyak-$\L$ ojasiewicz condition the algorithm converges linearly to a steady error state. The proposed scheme is tested on several learning tasks: a non-convex logistic regression experiment on the MNIST dataset, an image classification task on the CIFAR-10 dataset, and an NLP classification test on the IMDB dataset. We further present numerical simulations with an objective that satisfies the PL condition. The results demonstrate superior performance of the proposed framework compared to the existing related methods.||
|**2024-10-10**|[Bilinear MLPs enable weight-based mechanistic interpretability](http://arxiv.org/abs/2410.08417)|**[link](https://github.com/tdooms/bilinear-decomposition)**|A mechanistic understanding of how MLPs do computation in deep neural networks remains elusive. Current interpretability work can extract features from hidden activations over an input dataset but generally cannot explain how MLP weights construct features. One challenge is that element-wise nonlinearities introduce higher-order interactions and make it difficult to trace computations through the MLP layer. In this paper, we analyze bilinear MLPs, a type of Gated Linear Unit (GLU) without any element-wise nonlinearity that nevertheless achieves competitive performance. Bilinear MLPs can be fully expressed in terms of linear operations using a third-order tensor, allowing flexible analysis of the weights. Analyzing the spectra of bilinear MLP weights using eigendecomposition reveals interpretable low-rank structure across toy tasks, image classification, and language modeling. We use this understanding to craft adversarial examples, uncover overfitting, and identify small language model circuits directly from the weights alone. Our results demonstrate that bilinear layers serve as an interpretable drop-in replacement for current activation functions and that weight-based interpretability is viable for understanding deep-learning models.||
|**2024-10-10**|[What is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias](http://arxiv.org/abs/2410.08407)|null|Knowledge Distillation is a commonly used Deep Neural Network compression method, which often maintains overall generalization performance. However, we show that even for balanced image classification datasets, such as CIFAR-100, Tiny ImageNet and ImageNet, as many as 41% of the classes are statistically significantly affected by distillation when comparing class-wise accuracy (i.e. class bias) between a teacher/distilled student or distilled student/non-distilled student model. Changes in class bias are not necessarily an undesirable outcome when considered outside of the context of a model's usage. Using two common fairness metrics, Demographic Parity Difference (DPD) and Equalized Odds Difference (EOD) on models trained with the CelebA, Trifeature, and HateXplain datasets, our results suggest that increasing the distillation temperature improves the distilled student model's fairness -- for DPD, the distilled student even surpasses the fairness of the teacher model at high temperatures. This study highlights the uneven effects of Knowledge Distillation on certain classes and its potentially significant role in fairness, emphasizing that caution is warranted when using distilled models for sensitive application domains.||
|**2024-10-10**|[Are We Ready for Real-Time LiDAR Semantic Segmentation in Autonomous Driving?](http://arxiv.org/abs/2410.08365)|null|Within a perception framework for autonomous mobile and robotic systems, semantic analysis of 3D point clouds typically generated by LiDARs is key to numerous applications, such as object detection and recognition, and scene reconstruction. Scene semantic segmentation can be achieved by directly integrating 3D spatial data with specialized deep neural networks. Although this type of data provides rich geometric information regarding the surrounding environment, it also presents numerous challenges: its unstructured and sparse nature, its unpredictable size, and its demanding computational requirements. These characteristics hinder the real-time semantic analysis, particularly on resource-constrained hardware architectures that constitute the main computational components of numerous robotic applications. Therefore, in this paper, we investigate various 3D semantic segmentation methodologies and analyze their performance and capabilities for resource-constrained inference on embedded NVIDIA Jetson platforms. We evaluate them for a fair comparison through a standardized training protocol and data augmentations, providing benchmark results on the Jetson AGX Orin and AGX Xavier series for two large-scale outdoor datasets: SemanticKITTI and nuScenes.||
|**2024-10-10**|[Dynamic Object Catching with Quadruped Robot Front Legs](http://arxiv.org/abs/2410.08065)|null|本文提出了一种利用四足机器人的前腿在其后腿站立时进行动态物体捕捉的框架。该系统集成了计算机视觉、轨迹预测和腿部控制，使四足机器人能够使用机载摄像头视觉检测、跟踪并成功捕捉抛掷物体。利用微调后的 YOLOv8 模型进行物体检测和基于回归的轨迹预测模块，四足机器人迭代地调整其前腿位置，以预测和拦截物体。捕捉动作包括识别最佳捕捉位置、使用笛卡尔 PD 控制控制前腿以及在适当的时刻合拢双腿。我们提出并验证了三种选择最佳捕捉位置的不同方法：1）将预测轨迹与垂直平面相交；2）选择预测轨迹上与机器人腿部在其标称位置的中心距离最小的点；3）选择基于高斯混合模型 (GMM) 对机器人可达空间建模的预测轨迹上可能性最高的点。实验结果证明了该系统在各种场景下的鲁棒捕捉能力，其中 GMM 方法表现最佳，捕捉成功率达到 80%。系统运行的视频演示可在 https://youtu.be/sm7RdxRfIYg 找到。||
|**2024-10-10**|[When the Small-Loss Trick is Not Enough: Multi-Label Image Classification with Noisy Labels Applied to CCTV Sewer Inspections](http://arxiv.org/abs/2410.07689)|null|拥有数百万公里管道的污水管网维护在很大程度上依赖于高效的闭路电视（CCTV）检查。许多基于多标签图像分类的有前景的方法都利用了历史检查报告数据库来自动化这些检查。然而，这些数据库中标签噪声的显著存在，尽管已为人所知，但尚未得到解决。虽然大量研究探索了单标签分类（SLC）中的标签噪声问题，但很少有人关注多标签分类（MLC）中的标签噪声。为了解决这个问题，我们首先调整了三种样本选择SLC方法（Co-teaching、CoSELFIE和DISC），这些方法已被证明对标签噪声具有鲁棒性。我们的研究结果表明，仅基于小损失技巧的样本选择可以处理复杂的标签噪声，但它不是最优的。将混合样本选择方法应用于噪声MLC似乎是一种更有前景的方法。鉴于此，我们开发了一种基于CoSELFIE的新方法，称为MHSS（多标签混合样本选择）。通过深入的比较研究，我们证明了我们的方法在处理合成复杂噪声和真实噪声方面的优越性能，从而有助于持续努力实现CCTV污水管道检查的有效自动化。||
|**2024-10-10**|[TDDSR: Single-Step Diffusion with Two Discriminators for Super Resolution](http://arxiv.org/abs/2410.07663)|null|超分辨率方法正越来越多地针对现实世界和特定人脸任务进行专门设计。然而，许多现有方法依赖于过于简化的退化模型，这限制了它们有效处理复杂和未知退化模式的能力。虽然基于扩散的超分辨率技术最近显示出令人印象深刻的结果，但它们仍然受到需要大量推理步骤的限制。为了解决这个问题，我们提出了 TDDSR，一种高效的单步扩散超分辨率方法。我们的方法是从预训练的教师模型中提取，并基于扩散网络，只需一步即可执行超分辨率。它集成了一个可学习的下采样器来捕获不同的退化模式，并采用了两个鉴别器（一个用于高分辨率图像，一个用于低分辨率图像）来提高整体性能。实验结果证明了该方法在现实世界和特定人脸超分辨率任务中的有效性，其性能与另一种单步方法、先前最先进的模型和教师模型相当，甚至更好。||
|**2024-10-10**|[Explainability of Deep Neural Networks for Brain Tumor Detection](http://arxiv.org/abs/2410.07613)|**[link](https://github.com/sunyoung98/Brain_Tumor_Detection_XAI)**|医学图像分类对于支持医疗保健专业人员进行决策和培训至关重要。虽然卷积神经网络 (CNN) 传统上一直主导着该领域，但基于 Transformer 的模型正受到越来越多的关注。在这项研究中，我们应用可解释人工智能 (XAI) 技术来评估各种模型在现实世界医学数据上的性能，并确定需要改进的领域。我们将 VGG-16、ResNet-50 和 EfficientNetV2L 等 CNN 模型与 Transformer 模型 ViT-Base-16 进行了比较。我们的结果表明，数据增强几乎没有影响，但超参数调整和高级建模可以提高性能。CNN，特别是 VGG-16 和 ResNet-50，优于 ViT-Base-16 和 EfficientNetV2L，这可能是由于数据有限导致的欠拟合。LIME 和 SHAP 等 XAI 方法进一步表明，性能更好的模型可以更有效地显示肿瘤。这些发现表明，具有较浅架构的 CNN 对于小型数据集更有效，并且可以支持医疗决策。||
|**2024-10-10**|[O1O: Grouping of Known Classes to Identify Unknown Objects as Odd-One-Out](http://arxiv.org/abs/2410.07514)|null|在固定已知类别集合上训练的目标检测方法难以在开放世界环境中检测未知类别的物体。目前的修复方法包括添加近似监督，使用与候选物体位置相对应的伪标签，这些位置通常以类别无关的方式获得。虽然先前的方法主要依赖于物体的视觉特征，但我们发现几何线索可以提高未知物体的召回率。尽管来自伪标签的额外监督有助于检测未知物体，但它也会给已知类别带来混淆。我们观察到，在存在噪声伪标签的情况下，模型检测已知物体的性能显著下降。受人类认知研究的启发，我们建议将已知类别分组到超类中。通过识别超类中类别之间的相似性，我们可以通过“异类排除”评分机制识别未知类别。我们在开放世界检测基准上的实验表明，所有任务的未知物体召回率都有显著提高。至关重要的是，由于通过超类更好地划分了特征空间，我们在不影响已知物体性能的情况下实现了这一点。||
|**2024-10-09**|[Progressive Multi-Modal Fusion for Robust 3D Object Detection](http://arxiv.org/abs/2410.07475)|null|多传感器融合对于自动驾驶中精确的 3D 物体检测至关重要，其中摄像头和激光雷达是最常用的传感器。然而，现有方法通过将两种模态的特征投影到鸟瞰图 (BEV) 或透视图 (PV) 中，在单一视图中进行传感器融合，从而牺牲了诸如高度或几何比例等补充信息。为了解决这一局限性，我们提出了 ProFusion3D，一种渐进式融合框架，在中间和对象查询级别结合了 BEV 和 PV 中的特征。我们的架构分层融合了局部和全局特征，增强了 3D 物体检测的鲁棒性。此外，我们引入了一种自监督掩码建模预训练策略，通过三个新颖的目标来改进多模态表示学习和数据效率。在 nuScenes 和 Argoverse2 数据集上的大量实验最终证明了 ProFusion3D 的有效性。此外，ProFusion3D 对传感器故障具有鲁棒性，在仅有一种模态可用的情况下也表现出强大的性能。||
|**2024-10-09**|[Self-Supervised Learning for Real-World Object Detection: a Survey](http://arxiv.org/abs/2410.07442)|null|自监督学习 (SSL) 已成为计算机视觉领域的一种很有前景的方法，它使网络能够从大型未标记数据集中学习有意义的表示。SSL 方法主要分为两类：实例判别和掩码图像建模 (MIM)。虽然实例判别是 SSL 的基础，但它最初是为分类任务设计的，对于目标检测，尤其是小型目标检测，效果可能不佳。在本综述中，我们重点关注专为现实世界目标检测而设计的 SSL 方法，重点是在复杂环境中检测小型目标。与以往的综述不同，我们详细比较了 SSL 策略，包括目标级实例判别和 MIM 方法，并使用基于 CNN 和 ViT 的架构评估了它们对小型目标检测的有效性。具体而言，我们的基准测试是在广泛使用的 COCO 数据集以及专注于红外遥感图像中车辆检测的专业现实世界数据集上进行的。我们还评估了在自定义领域特定数据集上进行预训练的影响，重点介绍了某些 SSL 策略如何更适合处理未经整理的数据。我们的研究结果表明，实例判别方法在基于 CNN 的编码器中表现良好，而 MIM 方法更适合基于 ViT 的架构和自定义数据集预训练。本综述为选择最佳 SSL 策略提供了实用指南，并考虑了主干架构、目标大小和自定义预训练要求等因素。最后，我们证明，选择合适的 SSL 预训练策略以及合适的编码器可以显著提高现实世界目标检测的性能，特别是对于资源有限环境中的小型目标检测。||
|**2024-10-09**|[Robust infrared small target detection using self-supervised and a contrario paradigms](http://arxiv.org/abs/2410.07437)|null|在国防应用中，由于复杂背景的存在和目标的小尺寸，红外图像中的小目标检测提出了重大挑战。传统的目标检测方法往往难以在高检测率和低误报率之间取得平衡，尤其是在处理小目标时。在本文中，我们介绍了一种新方法，将“反事实范式”与自监督学习 (SSL) 相结合，以改进红外小目标检测 (IRSTD)。一方面，在 YOLO 检测头中集成“反事实准则”增强了对小型和意外目标的特征图响应，同时有效控制了误报。另一方面，我们探索了 SSL 技术来克服 IRSTD 任务中常见的注释数据有限的挑战。具体来说，我们对几种具有代表性的 SSL 策略进行了基准测试，以了解它们在提高小目标检测性能方面的有效性。我们的研究结果表明，实例判别方法在应用于基于 YOLO 的小目标检测时优于掩码图像建模策略。此外，“反事实范式”和 SSL 范式的结合带来了显着的性能提升，缩小了与最先进的分割方法的差距，甚至在资源有限的环境中也优于它们。这种双管齐下的方法为提高 IRSTD 性能提供了一种强大的解决方案，尤其是在具有挑战性的条件下。||
|**2024-10-09**|[One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation](http://arxiv.org/abs/2410.07170)|**[link](https://github.com/ml-jku/EVA)**|基础模型 (FM) 在大规模数据集上进行预训练，然后针对特定应用在下游任务上进行微调。最成功和最常用的微调方法是通过低秩自适应 (LoRA) 更新预训练的权重。LoRA 引入了新的权重矩阵，这些矩阵通常使用跨模型权重的均匀秩分布随机初始化。最近的工作集中在权重驱动的初始化或在训练期间学习自适应秩。这两种方法都只是孤立地进行研究，导致收敛速度慢或秩分布均匀，进而导致性能欠佳。我们建议通过以数据驱动的方式初始化新权重来增强 LoRA，方法是在小批量激活向量上计算奇异值分解。然后，我们使用获得的右奇异向量初始化 LoRA 矩阵，并在所有权重矩阵之间重新分配秩，以解释最大量的方差，并继续标准的 LoRA 微调过程。这导致了我们的新方法，称为解释方差自适应 (EVA)。我们将 EVA 应用于各种微调任务，从语言生成和理解到图像分类和强化学习。与竞争对手相比，EVA 表现出更快的收敛速度，并在每个领域的众多任务中获得了最高的平均分数。||
|**2024-10-09**|[JPEG Inspired Deep Learning](http://arxiv.org/abs/2410.07081)|**[link](https://github.com/jpeginspireddl/jpeg-inspired-dl)**|尽管传统上认为有损图像压缩（例如JPEG压缩）会对深度神经网络（DNN）的性能产生负面影响，但最近的研究表明，精心设计的JPEG压缩实际上可以提高深度学习（DL）的性能。受此启发，我们提出了JPEG-DL，这是一种新颖的深度学习框架，它在任何底层DNN架构之前添加了一个可训练的JPEG压缩层。为了使JPEG压缩中的量化操作可训练，我们在JPEG层采用了一种新的可微分软量化器，然后联合训练量化操作和底层DNN。大量实验表明，与标准深度学习相比，JPEG-DL在各种数据集和模型架构上均可显著提高准确性，同时增强了对对抗性攻击的鲁棒性。特别是，在一些细粒度图像分类数据集上，JPEG-DL可以将预测精度提高多达20.9%。我们的代码可在https://github.com/JpegInspiredDl/JPEG-Inspired-DL.git获取。||
|**2024-10-07**|[LoTLIP: Improving Language-Image Pre-training for Long Text Understanding](http://arxiv.org/abs/2410.05249)|null|理解长文本在实践中有着巨大的需求，但这超出了大多数语言图像预训练 (LIP) 模型的能力范围。在本研究中，我们通过实证证实了造成这个问题的关键原因是训练图像通常与简短的标题配对，导致某些词语容易被突出的词语所掩盖。为了解决这个问题，我们最初尝试使用长标题重新标记数据，但是，直接使用长标题进行学习可能会导致理解短文本的性能下降（例如，在图像分类任务中）。然后，通过结合角点词语来聚合不同的文本信息，我们设法帮助模型在理解短文本方面赶上其原始水平，同时大大增强其理解长文本的能力。我们进一步研究了模型是否可以从更长的标题中持续受益，并注意到性能和效率之间存在明显的权衡。最后，我们使用一个自建的大规模数据集验证了我们方法的有效性，该数据集包含 1 亿个面向长标题的文本图像对。值得注意的是，在长文本图像检索任务中，我们比使用长标题的竞争对手提高了 11.1%（即从 72.62% 提高到 83.72%）。我们将发布代码、模型和新数据集，以促进可重复性和进一步的研究。项目页面可访问 https://wuw2019.github.io/lotlip。||
|**2024-10-07**|[Control-oriented Clustering of Visual Latent Representation](http://arxiv.org/abs/2410.05063)|null|我们对基于图像的控制管道中视觉表征空间（从视觉编码器到动作解码器的信道）的几何结构进行研究，该管道通过行为克隆学习得到。受图像分类中神经元崩溃（NC）现象的启发，我们研究了视觉表征空间中是否会出现类似的聚类规律。由于基于图像的控制是一项没有明确定义类别的回归任务，因此问题的关键在于确定视觉特征根据哪些隐含类别进行聚类（如果存在这种规律）。我们专注于基于图像的平面推动任务，假设视觉表征在控制任务中最重要作用是向动作解码器传递目标。然后，我们根据(a) 输入中物体和目标之间的相对姿态或(b) 输出中专家动作引起的物体的相对姿态，将专家演示的训练样本分为八个“面向控制”的类别，其中一个类别对应一个相对姿态卦限（REPO）。在架构的四种不同实例中，我们报告了根据八个REPO，视觉表征空间中普遍出现了面向控制的聚类。除了经验观察之外，我们还表明，当使用有限的专家演示训练策略时，这种聚类规律可以用作算法工具来提高测试时的性能。特别是，我们使用NC作为正则化方法对视觉编码器进行预训练，以鼓励视觉特征的面向控制的聚类。令人惊讶的是，这种经过NC预训练的视觉编码器在使用动作解码器进行端到端微调时，在低数据情况下将测试性能提高了10%到35%。现实世界中基于视觉的平面推动实验证实了面向控制的视觉表征预训练的惊人优势。||
|**2024-10-07**|[Improving Object Detection via Local-global Contrastive Learning](http://arxiv.org/abs/2410.05058)|null|视觉域差距通常会影响目标检测性能。图像到图像的转换可以减轻这种影响，其中对比方法能够在无监督情况下学习图像到图像的映射。然而，现有方法往往无法处理包含多个目标实例的内容丰富的场景，这表现为检测性能不理想。对这种实例级内容的敏感性通常只能通过目标标注来获得，而目标标注的获取成本可能很高。为了解决这个问题，我们提出了一种新的图像到图像转换方法，专门针对跨域目标检测。我们将我们的方法制定为一个对比学习框架，该框架具有归纳先验，通过空间注意掩码优化目标实例的外观，将场景隐式地划分为与目标目标实例相关的前景区域和背景非目标区域。我们的方法不是依靠目标标注在转换过程中明确地考虑目标实例，而是通过对比局部-全局信息来学习表示目标。这为探索一项未被充分挖掘的挑战提供了可能：在不依赖目标标注或检测器模型微调的情况下，在域转移下获得高性能检测。我们通过三个具有挑战性的基准测试，对多个跨域目标检测设置进行了实验，并报告了最先进的性能。项目页面：https://local-global-detection.github.io||
|**2024-10-07**|[Near-Field ISAC in 6G: Addressing Phase Nonlinearity via Lifted Super-Resolution](http://arxiv.org/abs/2410.04930)|null|集成传感与通信 (ISAC) 是 6G 网络的一个很有前景的组成部分，它融合了通信和雷达技术以促进新的服务。此外，在 ISAC 共用接收机上使用超大规模天线阵列 (ELLA) 不仅促进了太赫兹级通信链路，而且还显著提高了雷达应用中目标检测的精度。在实际场景中，通信散射体和雷达目标通常位于距离 ISAC 接收机很近的位置。这种情况，再加上 ELLA 的使用，从根本上改变了无线和雷达信道的电磁特性，从远场平面波传播转变为近场球面波传播。在远场平面波模型下，阵列响应向量的相位随天线索引线性变化。相反，在近场球面波模型中，这种相位关系变为非线性。这种转变提出了一个根本性的挑战：广泛使用的傅立叶分析不能再直接应用于 ISAC 共用接收机上的目标检测和通信信道估计。在这项工作中，我们提出了一个可行的解决方案来解决这个基本问题。具体来说，我们证明了存在一个高维空间，其中相位非线性可以表示为线性。利用这一见解，我们开发了一个提升的超分辨率框架，该框架可以同时执行通信信道估计并以高精度提取目标参数。||
|**2024-10-07**|[Improved detection of discarded fish species through BoxAL active learning](http://arxiv.org/abs/2410.04880)|**[link](https://github.com/pieterblok/boxal)**|近年来，强大的数据驱动深度学习技术已被开发并应用于自动化渔获登记。然而，这些方法依赖于标记数据，而标记数据的收集非常耗时、费力、昂贵，并且需要专业知识。在本研究中，我们提出了一种名为 BoxAL 的主动学习技术，该技术包括对 Faster R-CNN 目标检测模型的认知不确定性进行估计。该方法允许从未标记的图像池中选择最不确定的训练图像，然后使用这些图像来训练目标检测模型。为了评估该方法，我们使用了一个开源图像数据集，该数据集是通过专为捕捞底层鱼类的商业拖网渔船开发的专用图像采集系统获得的。我们证明，我们的方法可以使用比随机抽样少 400 张标记图像的情况下达到相同的目标检测性能。此外，在最后一次训练迭代中，使用 1100 张训练图像时，基于置信度的采样和随机采样的平均 AP 分数分别显着提高到 39.0±1.6 和 34.8±1.8。此外，我们还表明，认知不确定性是一种合适的采样方法，可以对当前迭代模型无法处理的图像进行采样。我们的研究还表明，采样得到的新数据比剩余的未标记数据对训练更有价值。我们的软件可在 https://github.com/pieterblok/boxal 获取。||
|**2024-10-06**|[Learning De-Biased Representations for Remote-Sensing Imagery](http://arxiv.org/abs/2410.04546)|**[link](https://github.com/doem97/deblora)**|遥感 (RS) 影像需要专门的卫星进行采集，而且标注难度大，因此存在数据稀缺和某些光谱类别不平衡的问题。由于数据稀缺，从头开始训练任何大规模 RS 模型都是不现实的，替代方案是通过微调或数据效率更高的 LoRA 方法来迁移预训练模型。由于类别不平衡，迁移后的模型表现出强烈的偏差，其中主要类别的特征支配着次要类别的特征。在本文中，我们提出了 debLoRA，这是一种通用的训练方法，可以与任何 LoRA 变体一起使用，以产生去偏差的特征。它是一种无监督学习方法，可以根据与主要类别共享的属性来实现次要类别特征的多样化，其中属性是通过简单的聚类步骤获得的。为了对其进行评估，我们在 RS 领域的两种迁移学习场景中进行了广泛的实验：从自然图像到光学 RS 图像，以及从光学 RS 图像到多光谱 RS 图像。我们在光学 RS 数据集 DOTA 和 SAR 数据集 FUSRS 上执行了目标分类和面向目标的检测任务。结果表明，我们的 debLoRA 在这些 RS 适应性设置中始终优于现有技术，在自然图像到光学 RS 和光学 RS 到多光谱 RS 的适应性方面，尾部类别的性能分别提高了 3.3 和 4.7 个百分点，同时保持了头部类别的性能，证明了其有效性和适应性。||
|**2024-10-05**|[Distillation-Free One-Step Diffusion for Real-World Image Super-Resolution](http://arxiv.org/abs/2410.04224)|**[link](https://github.com/jianzeli-114/dfosd)**|扩散模型在现实世界图像超分辨率（Real-ISR）方面取得了优异的性能，但计算成本很高。当前的方法试图通过知识蒸馏从多步模型中推导出一步扩散模型。然而，这些方法会导致大量的训练成本，并且可能会受到教师模型的限制，从而限制学生模型的性能。为了解决这些问题，我们提出了DFOSD，一种无需蒸馏的一步扩散模型。具体来说，我们提出了一个噪声感知鉴别器（NAD）来参与对抗训练，进一步增强生成内容的真实性。此外，我们利用边缘感知DISTS（EA-DISTS）改进了感知损失，以增强模型生成精细细节的能力。我们的实验表明，与之前需要数十步甚至数百步的基于扩散的方法相比，我们的DFOSD在定量指标和定性评估方面都取得了相当甚至更好的结果。与其他一步扩散方法相比，我们的DFOSD还获得了更高的性能和效率。我们将在\url{https://github.com/JianzeLi-114/DFOSD}发布代码和模型。||
|**2024-10-05**|[Exploring Strengths and Weaknesses of Super-Resolution Attack in Deepfake Detection](http://arxiv.org/abs/2410.04205)|null|Image manipulation is rapidly evolving, allowing the creation of credible content that can be used to bend reality. Although the results of deepfake detectors are promising, deepfakes can be made even more complicated to detect through adversarial attacks. They aim to further manipulate the image to camouflage deepfakes' artifacts or to insert signals making the image appear pristine. In this paper, we further explore the potential of super-resolution attacks based on different super-resolution techniques and with different scales that can impact the performance of deepfake detectors with more or less intensity. We also evaluated the impact of the attack on more diverse datasets discovering that the super-resolution process is effective in hiding the artifacts introduced by deepfake generation models but fails in hiding the traces contained in fully synthetic images. Finally, we propose some changes to the detectors' training process to improve their robustness to this kind of attack.||
|**2024-10-05**|[Fast Object Detection with a Machine Learning Edge Device](http://arxiv.org/abs/2410.04173)|null|本机器学习研究调查了一种低成本边缘设备，该设备集成了一个具有计算机视觉功能的嵌入式系统，从而提高了目标检测和分类的推理时间和精度。本研究的主要目标是减少推理时间和降低功耗，并使竞赛级自主人形机器人的嵌入式设备能够支持实时目标识别、场景理解、视觉导航、运动规划和机器人的自主导航。本研究比较了中央处理器 (CPU)、图形处理器 (GPU) 和张量处理器 (TPU) 之间的推理时间性能。CPU、GPU 和 TPU 都是可用于机器学习任务的处理器。为了支持自主人形机器人，我们还努力观察使用具有单目视觉功能的相机与立体视觉功能的相机是否存在显著差异。本研究的 TPU 推理时间结果反映，与 GPU 相比，时间缩短了 25%，与 CPU 相比，推理时间惊人地缩短了 87.5%。本文的许多信息有助于最终选择 Google 的 Coral 品牌 Edge TPU 设备。Arduino Nano 33 BLE Sense Tiny ML 套件也被考虑用于比较，但由于初始不兼容性以及为了及时完成本研究，我们决定在未来的实验中再审查该套件。||
|**2024-10-05**|[Robust Task-Oriented Communication Framework for Real-Time Collaborative Vision Perception](http://arxiv.org/abs/2410.04168)|**[link](https://github.com/fangzr/R-ACP)**|Cooperative perception enhances sensing in multi-robot and vehicular networks by aggregating information from multiple agents, improving perception accuracy and range. However, mobility and non-rigid sensor mounts introduce extrinsic calibration errors, necessitating online calibration, which is complicated by limited overlap in sensing regions. Maintaining fresh information is crucial for timely and accurate sensing. To address calibration errors and ensure both perception accuracy and transmission timeliness, we propose a Robust Task-Oriented Communication framework (R-TOCOM) that optimizes calibration and feature transmission in both deployment and streaming phases. First, we formulate an Age of Perceived Targets (AoPT) minimization problem to capture information freshness. Then, in the deployment phase, we introduce a channel-aware self-calibration technique based on re-identification (Re-ID). This technique adaptively compresses key-point features according to channel capacities, effectively addressing calibration issues via spatial and temporal cross-camera correlations. In the streaming phase, we tackle the trade-off between bandwidth and inference accuracy by integrating an Information Bottleneck (IB)-based encoding method that adjusts video compression rates based on task relevance, thereby reducing communication overhead and latency. To mitigate performance degradation from packet loss, we introduce a priority network that filters corrupted features. Extensive studies demonstrate our framework outperforms five baselines, improving multiple object detection accuracy (MODA) by 25.49% and reducing communication costs by 51.36% under severe channel condition.||
|**2024-10-04**|[Classification-Denoising Networks](http://arxiv.org/abs/2410.03505)|null|图像分类和去噪面临着缺乏鲁棒性或部分忽略条件信息的互补问题。我们认为，可以通过 (噪声) 图像和类别标签的联合概率模型来统一这两个任务，从而缓解这些问题。分类通过前向传递和条件化来执行。使用 Tweedie-Miyasawa 公式，我们用分数来评估去噪函数，该分数可以通过边缘化和反向传播来计算。然后，训练目标是交叉熵损失和在噪声水平上积分的去噪分数匹配损失的组合。在 CIFAR-10 和 ImageNet 上的数值实验表明，与参考深度卷积分类器/去噪器相比，该方法具有竞争性的分类和去噪性能，并且与以前的联合方法相比，效率显着提高。与标准判别分类器相比，我们的模型对对抗性扰动的鲁棒性有所提高，并且可以将对抗性梯度 novel 地解释为去噪器的差异。||
|**2024-10-04**|[Sm: enhanced localization in Multiple Instance Learning for medical imaging classification](http://arxiv.org/abs/2410.03276)|**[link](https://github.com/franblueee/smmil)**|多示例学习 (MIL) 广泛应用于医学图像分类，以减少标注工作量。虽然训练时只有包标签可用，但人们通常会在包和实例级别寻求预测（分别为分类和定位任务）。早期的 MIL 方法独立地处理包中的实例。最近的方法考虑了实例之间的全局和局部依赖关系。虽然它们在分类方面取得了很好的效果，但它们在定位方面的性能相对有限。我们认为，这些模型的设计目标是分类任务，而实例级别的含义尚未得到深入研究。基于一个简单的观察结果——相邻实例可能具有相同的标签——我们提出了一种新颖、有原则且灵活的机制来模拟局部依赖关系。它可以单独使用，也可以与任何模拟全局依赖关系的机制（例如，Transformer）结合使用。全面的实证验证表明，我们的模块在定位方面达到了最先进的性能，同时在分类方面也具有竞争力或优越性。我们的代码位于https://github.com/Franblueee/SmMIL。||
|**2024-10-04**|[DRAFTS: A Deep Learning-Based Radio Fast Transient Search Pipeline](http://arxiv.org/abs/2410.03200)|**[link](https://github.com/SukiYume/DRAFTS)**|在射电天文学中，快速射电暴 (FRB) 的探测是一项复杂的任务，因为它面临着射频干扰 (RFI) 和星际介质中信号色散带来的挑战。传统的搜索算法通常效率低下、耗时且会产生大量的误报。在本文中，我们提出了 DRAFTS，一个基于深度学习的快速射电瞬变搜索流程。DRAFTS 整合了目标检测和二元分类技术，以准确识别射电数据中的 FRB。我们开发了一个大型的真实 FRB 数据集，用于训练深度学习模型。对 FAST 真实观测数据的搜索测试表明，DRAFTS 在准确性、完整性和搜索速度方面表现出色。在 FRB 20190520B 观测数据的搜索中，DRAFTS 探测到的爆发次数是 Heimdall 的三倍多，这突出了其在未来 FRB 探测和分析方面的潜力。||
|**2024-10-03**|[PixelShuffler: A Simple Image Translation Through Pixel Rearrangement](http://arxiv.org/abs/2410.03021)|**[link](https://github.com/OmarSZamzam/PixelShuffler)**|图像到图像的转换是计算机视觉领域的一个课题，其应用范围十分广泛，从医学图像转换（例如将MRI扫描转换为CT扫描或其他MRI对比度）到图像着色、超分辨率、域适应以及从草图或语义图生成逼真图像。图像风格迁移也是图像到图像转换中一个被广泛研究的应用，其目标是合成一个结合了一幅图像的内容和另一幅图像风格的图像。现有的最先进方法通常依赖于复杂的神经网络（包括扩散模型和语言模型）来实现高质量的风格迁移，但这些方法的计算成本可能很高，而且实现起来也很复杂。在本文中，我们提出了一种新的像素洗牌方法，该方法解决了图像到图像转换的一般问题，并在风格迁移中有一个具体的演示应用。该方法通过对风格图像的像素进行洗牌来实现风格迁移，从而最大化洗牌后的图像与内容图像之间的互信息。这种方法inherently保留了风格图像的颜色，同时确保了内容图像的结构细节保留在风格化后的输出中。我们证明，这种简单直接的方法产生的结果可与最先进的技术相媲美，这可以通过学习感知图像块相似度（LPIPS）损失（用于内容保留）和Fr\'echet初始距离（FID）分数（用于风格相似度）来衡量。我们的实验验证了所提出的像素洗牌方法在显著降低复杂度的同时实现了具有竞争力的性能，为高效的图像风格迁移提供了一种很有前途的替代方案，同时也为该方法在一般图像到图像转换任务中的可用性带来了希望。||
|**2024-10-03**|[On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions](http://arxiv.org/abs/2410.02935)|null|随着混合专家模型 (MoE) 架构在开发大规模基础模型中的重要性日益凸显，我们研究了分层混合专家模型 (HMoE)，这是 MoE 的一种特殊变体，擅长处理复杂输入和提高目标任务的性能。我们的研究强调了使用不同的门控函数的优势，超越了 HMoE 框架内的 softmax 门控。我们从理论上证明，即使仅在选定的层次级别应用最佳门控函数，对每个专家组应用定制的门控函数也允许 HMoE 实现稳健的结果。跨不同场景的经验验证支持了这些理论主张。这包括大规模多模态任务、图像分类以及潜在领域发现和预测任务，在这些任务中，我们改进的 HMoE 模型显示出巨大的性能提升。||
|**2024-10-04**|[Learning 3D Perception from Others' Predictions](http://arxiv.org/abs/2410.02646)|null|在现实环境中进行精确的三维目标检测需要大量高质量的标注数据。获取此类数据的过程既乏味又昂贵，并且在采用新传感器或将检测器部署到新环境中时，通常需要重复工作。我们研究了一种构建三维目标检测器的新方案：从配备精确检测器的附近单元的预测中学习。例如，当自动驾驶汽车进入一个新区域时，它可以从其他交通参与者那里学习，这些交通参与者的检测器已经针对该区域进行了优化。这种设置具有标签效率高、传感器无关性和通信效率高的特点：附近的单元只需要与自我代理（例如，汽车）共享预测结果。然而，简单地将接收到的预测作为真实值来训练自我车辆的检测器会导致性能下降。我们系统地研究了这个问题，并将视点不匹配和定位错误（由于同步和 GPS 错误）确定为主要原因，这些原因不可避免地会导致误报、漏报和不准确的伪标签。我们提出了一种基于距离的课程学习方法，首先从视点相似的较近单元学习，然后通过自我训练逐步提高其他单元预测的质量。我们进一步证明，可以使用少量标注数据训练有效的伪标签细化模块，从而大大减少训练目标检测器所需的数据量。我们在最近发布的真实世界协同驾驶数据集上验证了我们的方法，使用参考车辆的预测作为自我车辆的伪标签。包括多种场景（例如，不同的传感器、检测器和域）在内的大量实验表明，我们的方法可以有效地从其他单元的预测中进行标签高效的三维感知学习。||
|**2024-10-03**|[LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model](http://arxiv.org/abs/2410.02615)|null|当前最先进的医学多模态大型语言模型（med-MLLM），如 LLaVA-Med 或 BioMedGPT，在预训练中利用了指令遵循数据。然而，这些模型主要侧重于扩大模型规模和数据量以提高性能，而主要依赖于自回归学习目标。令人惊讶的是，我们发现这种学习方案可能导致视觉和语言模态之间的对齐较弱，使得这些模型高度依赖于大量的预训练数据集——这在医学领域是一个重大挑战，因为高质量指令遵循实例的整理既昂贵又耗时。我们使用 LoGra-Med 来解决这个问题，这是一种新的多图对齐算法，可在图像模态、基于对话的描述和扩展字幕之间强制执行三元组关联。这有助于模型捕捉上下文含义、处理语言变异性以及在视觉和文本之间建立跨模态关联。为了扩展我们的方法，我们设计了一种使用黑盒梯度估计的高效端到端学习方案，可以实现更快的 LLaMa 7B 训练。我们的结果表明，LoGra-Med 在 60 万个图像-文本对的医学 VQA 上与 LLAVA-Med 的性能相匹配，并且在接受 10% 数据训练时明显优于它。例如，在 VQA-RAD 上，我们比 LLAVA-Med 高出 20.13%，并且几乎达到了 100% 预训练分数（72.52% 对比 72.64%）。我们还在视觉聊天机器人上超越了像 BiomedGPT 这样的 SOTA 方法，并在使用 VQA 进行零样本图像分类方面超越了 RadFM，突出了多图对齐的有效性。||
|**2024-10-03**|[Personalized Quantum Federated Learning for Privacy Image Classification](http://arxiv.org/abs/2410.02547)|null|量子联邦学习提高了隐私图像分类的效果，但客户端模型缺乏个性化可能导致量子联邦学习的次优性。为了增强图像分布不平衡情况下客户端模型的个性化，提出了一种用于隐私图像分类的个性化量子联邦学习算法。首先，构建了个性化量子联邦学习模型，在客户端模型中设置了个性化层以维护个性化参数。其次，引入了个性化量子联邦学习算法，以确保客户端和服务器之间交换的信息安全。第三，将个性化联邦学习应用于 FashionMNIST 数据集上的图像分类，实验结果表明，即使在本地训练样本不平衡的情况下，个性化量子联邦学习算法也能获得性能优异的全局和局部模型。在8个客户端和分布参数为100的情况下，服务器的准确率达到了100%，比非个性化模型提高了7%。在2个客户端和分布参数为1的情况下，客户端的平均准确率比非个性化模型提高了2.9%。与之前的量子联邦学习算法相比，所提出的个性化量子联邦学习算法在保护模型和数据隐私的同时，无需额外的本地训练。这可能促进量子技术的更广泛采用和应用，并为更安全、可扩展和高效的量子分布式机器学习解决方案铺平道路。||
|**2024-10-03**|[DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM](http://arxiv.org/abs/2410.02492)|null|视觉语言跟踪 (VLT) 已成为一个前沿研究领域，它利用语言数据增强了多模态输入算法，并将传统单目标跟踪 (SOT) 的范围扩展到视频理解应用。尽管如此，大多数 VLT 基准测试仍然依赖于人工标注的简洁文本描述来描述每个视频。这些描述通常无法捕捉视频内容动态的细微差别，并且缺乏语言风格变化，受限于其统一的细节水平和固定的标注频率。因此，算法倾向于默认采用“记住答案”的策略，偏离了深入理解视频内容的核心目标。幸运的是，大型语言模型 (LLM) 的出现使生成多样化文本成为可能。这项工作利用 LLM 为具有代表性的 SOT 基准生成不同的语义注释（在文本长度和粒度方面），从而建立了一个新的多模态基准。具体来说，我们 (1) 基于五个著名的 VLT 和 SOT 基准，提出了一个新的具有不同文本的视觉语言跟踪基准，名为 DTVLT，包括三个子任务：短期跟踪、长期跟踪和全局实例跟踪。 (2) 我们的基准测试提供了四种粒度的文本，考虑了语义信息的范围和密度。我们预计这种多粒度生成策略将为 VLT 和视频理解研究营造有利的环境。 (3) 我们对 DTVLT 进行了全面的实验分析，评估了不同文本对跟踪性能的影响，并希望识别出的现有算法的性能瓶颈能够支持 VLT 和视频理解的进一步研究。提出的基准、实验结果和工具包将在 http://videocube.aitestunion.com/ 上逐步发布。||
|**2024-10-03**|[PnP-Flow: Plug-and-Play Image Restoration with Flow Matching](http://arxiv.org/abs/2410.02423)|**[link](https://github.com/annegnx/PnP-Flow)**|本文介绍了即插即用流匹配 (PnP Flow Matching)，这是一种解决成像逆问题的算法。PnP 方法利用预训练去噪器（通常是深度神经网络）的优势，将它们集成到优化方案中。虽然它们在各种成像逆问题上实现了最先进的性能，但 PnP 方法在修复等更具生成性的任务中面临着固有的局限性。另一方面，流匹配等生成模型突破了图像采样的界限，但缺乏在图像恢复中有效使用的明确方法。我们建议通过使用预训练的 FM 模型定义时间相关的去噪器，将 PnP 框架与流匹配 (FM) 相结合。我们的算法在数据保真度项上的梯度下降步骤、对学习到的 FM 路径的重新投影和去噪之间交替进行。值得注意的是，我们的方法计算效率高且内存友好，因为它避免了通过 ODE 的反向传播和轨迹计算。我们评估了其在去噪、超分辨率、去模糊和修复任务上的性能，证明了其与现有 PnP 算法和基于流匹配的最先进方法相比具有优越的结果。||
|**2024-10-03**|[Spiking Neural Network as Adaptive Event Stream Slicer](http://arxiv.org/abs/2410.02249)|**[link](https://github.com/andycao1125/spikeslicer)**|基于事件的相机由于其丰富的边缘信息、高动态范围和高时间分辨率而备受关注。许多最先进的基于事件的算法依赖于将事件分割成固定的组，这会导致关键时间信息的丢失，尤其是在处理不同的运动场景（例如，高速/低速）时。在这项工作中，我们提出了SpikeSlicer，一种新颖的即插即用事件处理方法，能够自适应地分割事件流。SpikeSlicer利用轻量级（0.41M）和低能耗的脉冲神经网络（SNN）来触发事件切片。为了引导SNN在最佳时间步长触发脉冲，我们提出了脉冲位置感知损失（SPA-Loss）来调节神经元的状态。此外，我们开发了一种反馈更新训练策略，利用来自下游人工神经网络（ANN）的反馈来改进切片决策。大量实验表明，我们的方法在基于事件的目标跟踪和识别方面取得了显著的性能提升。值得注意的是，SpikeSlicer提供了一种全新的SNN-ANN合作范式，其中SNN充当高效、低能耗的数据处理器，协助ANN提高下游性能，为探索新的视角和潜在途径注入了活力。||
|**2024-10-02**|[Kolmogorov-Arnold Network Autoencoders](http://arxiv.org/abs/2410.02077)|**[link](https://github.com/aminmoradixl/kan_ae)**|深度学习模型已经彻底改变了各个领域，其中多层感知器 (MLP) 是数据回归和图像分类等任务的基石。然而，最近的一项研究引入了 Kolmogorov-Arnold 网络 (KAN) 作为 MLP 的有前途的替代方案，它利用放置在边而不是节点上的激活函数。这种结构转变使 KAN 与 Kolmogorov-Arnold 表示定理紧密结合，有可能提高模型的准确性和可解释性。在这项研究中，我们探讨了 KAN 在通过自动编码器进行数据表示方面的功效，将它们在 MNIST、SVHN 和 CIFAR-10 数据集上的性能与传统卷积神经网络 (CNN) 进行了比较。我们的结果表明，基于 KAN 的自动编码器在重建精度方面取得了具有竞争力的性能，从而表明它们可以作为数据分析任务中的有效工具。||
|**2024-10-02**|[Stochastic Deep Restoration Priors for Imaging Inverse Problems](http://arxiv.org/abs/2410.02057)|null|作为图像去噪器的深度神经网络被广泛用作解决成像逆问题的先验。虽然高斯去噪被认为足以学习图像先验，但我们表明，从预先训练为更通用的恢复算子的深度模型中获得的先验可以表现得更好。我们引入了随机深度恢复先验 (ShaRP)，这是一种利用此类恢复模型的集合来规范化逆问题的新方法。 ShaRP 通过更好地处理结构化伪影并在即使没有完全采样数据的情况下也能进行自监督训练，改进了使用高斯去噪器先验的方法。我们证明了 ShaRP 最小化了一个目标函数，该函数涉及从最小均方误差 (MMSE) 恢复算子的得分函数导出的正则化器，并从理论上分析了其收敛性。经验表明，ShaRP 在磁共振成像重建和单图像超分辨率等任务上实现了最先进的性能，超过了基于去噪器和扩散模型的方法，而无需重新训练。||
|**2024-10-02**|[Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking](http://arxiv.org/abs/2410.01806)|null|在复杂场景（例如，协作舞蹈表演、团队运动或动态动物群体）中进行多目标跟踪提出了独特的挑战。在这些场景中，目标经常以协调的模式移动、相互遮挡并在其轨迹中表现出长期依赖性。然而，如何对轨迹内的长期依赖性、轨迹间的相互依赖性以及相关的时序遮挡进行建模仍然是一个关键的开放性研究问题。为此，我们引入了 Samba，这是一种新颖的线性时间序列集模型，旨在通过同步用于对每个轨迹建模的多个选择性状态空间来联合处理多个轨迹。Samba 自回归地预测每个序列的未来轨迹查询，同时保持跨轨迹同步的长期记忆表示。通过将 Samba 集成到逐传播跟踪框架中，我们提出了 SambaMOTR，这是第一个有效解决上述问题的跟踪器，包括长期依赖性、轨迹相互依赖性和时间遮挡。此外，我们介绍了一种处理不确定观察结果的有效技术 (MaskObs) 和一种有效的训练方法，以将 SambaMOTR 扩展到更长的序列。通过对跟踪对象之间的长期依赖性和交互进行建模，SambaMOTR 隐式地学习在没有任何手工启发式的情况下准确地跟踪遮挡下的对象。我们的方法在 DanceTrack、BFT 和 SportsMOT 数据集上显着优于先前最先进的方法。||
|**2024-10-02**|[Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking](http://arxiv.org/abs/2410.01678)|**[link](https://github.com/ayesha-ishaq/open3dtrack)**|三维多目标跟踪通过实时监控和预测多个物体的运动，在自动驾驶中发挥着至关重要的作用。传统的三维跟踪系统通常受到预定义物体类别的限制，限制了它们对动态环境中新出现的、未见过的物体的适应性。为了解决这一限制，我们引入了开放词汇三维跟踪，它将三维跟踪的范围扩展到预定义类别之外的物体。我们将开放词汇三维跟踪问题进行公式化，并引入了旨在表示各种开放词汇场景的数据集划分。我们提出了一种新方法，将开放词汇能力集成到三维跟踪框架中，从而能够泛化到未见过的物体类别。我们的方法通过策略性适应有效地减少了跟踪已知物体和新物体之间的性能差距。实验结果表明，我们的方法在各种室外驾驶场景中具有鲁棒性和适应性。据我们所知，这项工作是第一个解决开放词汇三维跟踪问题的，为现实世界中的自主系统带来了重大进步。代码、经过训练的模型和数据集划分均已公开发布。||
|**2024-09-30**|[NUTRIVISION: A System for Automatic Diet Management in Smart Healthcare](http://arxiv.org/abs/2409.20508)|null|通过均衡饮食保持健康和强健体魄对于预防心脏病、糖尿病和癌症等非传染性疾病至关重要。NutriVision 将智能医疗保健与计算机视觉和机器学习相结合，以应对营养和饮食管理方面的挑战。本文介绍了一种新颖的系统，该系统可以识别食物种类，估算数量，并提供全面的营养信息。NutriVision 采用了基于 Faster Region 的卷积神经网络，这是一种深度学习算法，通过生成区域提proposals 并对这些区域进行分类来改进对象检测，使其即使在复杂和无序的膳食环境中也能高效、准确地识别食物。通过基于智能手机的图像捕捉，NutriVision 可以提供即时营养数据，包括宏量营养素分解、卡路里计数和微量营养素详细信息。NutriVision 的突出特点之一是其个性化的营养分析和饮食建议，这些建议是根据每个用户的饮食偏好、营养需求和健康史量身定制的。通过提供定制化的建议，NutriVision 帮助用户实现特定的健康和健身目标，例如管理饮食限制或控制体重。除了提供精确的食物检测和营养评估外，NutriVision 还通过将用户数据与促进均衡健康饮食的建议相结合，支持更明智的饮食决策。该系统为营养管理提供了一种实用且先进的解决方案，并有可能显著影响人们的饮食选择方式，促进更健康的饮食习惯和整体健康。本文讨论了 NutriVision 系统的设计、性能评估和未来应用。||
|**2024-09-30**|[POMONAG: Pareto-Optimal Many-Objective Neural Architecture Generator](http://arxiv.org/abs/2409.20447)|null|神经架构搜索 (NAS) 自动化了神经网络设计，减少了对人类专业知识的依赖。虽然 NAS 方法计算量大且依赖于特定数据集，但辅助预测器减少了需要训练的模型数量，从而缩短了搜索时间。此策略用于生成满足多个计算约束的架构。最近，可迁移 NAS 应运而生，将搜索过程从依赖于数据集推广到依赖于任务。在该领域，DiffusionNAG 是一种最先进的方法。这种基于扩散的方法简化了计算，生成针对未见数据集的准确性进行优化的架构，而无需进一步调整。然而，DiffusionNAG 只关注准确性，而忽略了其他关键目标，如模型复杂性、计算效率和推理延迟，这些因素对于在资源受限环境中部署模型至关重要。本文介绍了帕累托最优多目标神经架构生成器 (POMONAG)，通过多目标扩散过程扩展了 DiffusionNAG。POMONAG 同时考虑准确性、参数数量、乘积累加运算 (MAC) 和推理延迟。它集成了性能预测器模型来估计这些指标并指导扩散梯度。POMONAG 的优化通过扩展其训练元数据集、应用帕累托前沿过滤和改进条件生成的嵌入来增强。这些增强功能使 POMONAG 能够生成在性能和效率方面优于先前技术的帕累托最优架构。结果在两个搜索空间（NASBench201 和 MobileNetV3）上得到验证，并在 15 个图像分类数据集上进行了评估。||
|**2024-09-30**|[Fine-Tuning Personalization in Federated Learning to Mitigate Adversarial Clients](http://arxiv.org/abs/2409.20329)|null|联邦学习 (FL) 是一种颇具吸引力的范式，它允许多台机器（也称为客户端）在保持数据本地化的同时进行集体学习。然而，由于客户端数据分布的异构性，使用联邦学习算法获得的模型在某些客户端的数据上可能表现不佳。个性化通过使每个客户端能够拥有针对自身数据定制的不同模型，同时受益于其他客户端的数据来解决这个问题。我们考虑了一种联邦学习设置，其中某些客户端可能是对抗性的，并且我们推导出完全协作失败的条件。具体来说，我们分析了在存在对抗性客户端的情况下插值个性化联邦学习框架的泛化性能，并精确地描述了完全协作的性能严格低于微调个性化的情况。我们的分析根据数据异构性和可容忍的对抗性客户端比例，确定了我们应该将协作程度降低多少。我们通过对均值估计和二元分类问题的实证结果来支持我们的发现，并考虑了合成和基准图像分类数据集。||
|**2024-09-30**|[Classroom-Inspired Multi-Mentor Distillation with Adaptive Learning Strategies](http://arxiv.org/abs/2409.20237)|null|我们提出了ClassroomKD，这是一个受课堂环境启发的新型多导师知识蒸馏框架，旨在增强学生和多个导师之间的知识转移。与依赖固定导师-学生关系的传统方法不同，我们的框架根据每个数据样本的有效性动态选择和调整不同导师的教学策略。ClassroomKD 包含两个主要模块：知识过滤 (KF) 模块和指导模块。KF 模块根据每个输入的表现对导师进行动态排名，仅激活高质量的导师，以最大程度地减少误差累积并防止信息丢失。指导模块通过根据学生和导师之间的表现差距调整每个导师的影响力来调整蒸馏策略，从而有效地调节学习进度。在图像分类（CIFAR-100 和 ImageNet）和二维人体姿态估计（COCO Keypoints 和 MPII Human Pose）方面的大量实验表明，ClassroomKD 明显优于现有的知识蒸馏方法。我们的结果表明，导师选择和指导的动态和自适应方法可以实现更有效的知识转移，从而通过蒸馏提高模型性能。||
|**2024-09-30**|[Training a Computer Vision Model for Commercial Bakeries with Primarily Synthetic Images](http://arxiv.org/abs/2409.20122)|null|在食品工业中，重新加工退回的产品是提高资源效率的重要步骤。[SBB23] 提出了一种人工智能应用程序，可以自动跟踪退回的圆面包。我们通过创建一个包含 2432 张图像和更广泛烘焙食品的扩展数据集来扩展他们的工作。为了提高模型的鲁棒性，我们使用生成模型 pix2pix 和 CycleGAN 来创建合成图像。我们在检测任务上训练了最先进的对象检测模型 YOLOv9 和 YOLOv8。我们总体表现最佳的模型在我们的测试集上实现了 90.3% 的平均精度 [email protected]。||
|**2024-09-30**|[TSdetector: Temporal-Spatial Self-correction Collaborative Learning for Colonoscopy Video Detection](http://arxiv.org/abs/2409.19983)|**[link](https://github.com/soleilssss/tsdetector)**|基于CNN的目标检测模型在性能和速度之间取得了平衡，并逐渐应用于息肉检测任务。然而，由于现有方法忽略了两个关键问题：帧内序列分布异质性和精度-置信度差异，因此在复杂的结肠镜视频场景中准确定位息肉仍然具有挑战性。为了应对这些挑战，我们提出了一种新颖的时空自校正检测器（TSdetector），它首先整合了时间层面的 consistency learning 和空间层面的 reliability learning 来持续检测目标。具体来说，我们首先提出了一种全局时间感知卷积，它汇集了先前的信息，以动态引导当前的卷积核关注序列之间的全局特征。此外，我们设计了一种层次队列集成机制，通过渐进累积的方式组合多时间特征，充分利用上下文一致性信息，同时保留长序列依赖特征。同时，在空间层面上，我们提出了一种位置感知聚类，以探索候选框之间的空间关系，从而自适应地重新校准预测置信度，从而有效地消除冗余边界框。在三个公开可用的息肉视频数据集上的实验结果表明，TSdetector 实现了最高的息肉检测率，并优于其他最先进的方法。代码可在 https://github.com/soleilssss/TSdetector 获取。||
|**2024-09-30**|[DAOcc: 3D Object Detection Assisted Multi-Sensor Fusion for 3D Occupancy Prediction](http://arxiv.org/abs/2409.19972)|**[link](https://github.com/alphaplustt/daocc)**|多传感器融合显著提高了三维语义占用预测的准确性和鲁棒性，这对于自动驾驶和机器人技术至关重要。然而，现有方法依赖于大图像分辨率和复杂网络来实现最佳性能，这阻碍了它们在实际场景中的应用。此外，大多数多传感器融合方法侧重于改进融合特征，而忽略了对这些特征的监督策略的探索。为此，我们提出了 DAOcc，一种新颖的多传感器融合占用网络，它利用 3D 目标检测监督来帮助实现卓越的性能，同时使用部署友好的图像特征提取网络和实用的输入图像分辨率。此外，我们引入了 BEV 视域扩展策略来减轻降低图像分辨率带来的不利影响。因此，我们的方法在使用 ResNet50 和 256x704 输入图像分辨率的 Occ3D-nuScenes 和 SurroundOcc 数据集上取得了新的最先进的结果。代码将在 https://github.com/AlphaPlusTT/DAOcc 上提供。||
|**2024-09-30**|[SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers](http://arxiv.org/abs/2409.19850)|null|在过去的几年里，视觉Transformer（ViT）在各种视觉识别任务中一直表现出卓越的性能。然而，增强其鲁棒性的尝试收效甚微，主要集中在不同的训练策略、输入patch增强或网络结构增强。这些方法通常涉及大量的训练和微调，既耗时又耗费资源。为了克服这些障碍，我们引入了一种名为空间自相关Token分析（SATA）的新方法。通过利用Token特征之间的空间关系，SATA增强了ViT模型的表示能力和鲁棒性。这是通过在输入到自注意力机制的前馈网络（FFN）块之前，根据空间自相关分数对Token进行分析和分组来实现的。重要的是，SATA可以无缝集成到现有的预训练ViT基线中，无需重新训练或额外的微调，同时通过减少FFN单元的计算负载来提高效率。实验结果表明，经过SATA增强的基线ViT不仅在ImageNet-1K图像分类上实现了新的最先进的top-1准确率（94.9%），而且在多个鲁棒性基准测试中也建立了新的最先进的性能，包括ImageNet-A（top-1=63.6%）、ImageNet-R（top-1=79.2%）和ImageNet-C（mCE=13.6%），所有这些都不需要对基线模型进行额外的训练或微调。||
|**2024-09-30**|[HazyDet: Open-source Benchmark for Drone-view Object Detection with Depth-cues in Hazy Scenes](http://arxiv.org/abs/2409.19833)|**[link](https://github.com/grokcv/hazydet)**|基于无人机的恶劣天气条件下的目标检测对于增强无人机的环境感知至关重要，但由于缺乏相关的基准测试，这方面在很大程度上仍未得到探索。为了弥合这一差距，我们引入了 HazyDet，这是一个专为无人机在雾霾场景中进行目标检测而设计的大规模数据集。它包含 383,000 个真实世界实例，这些实例是从自然雾霾环境和具有合成叠加雾霾效果的正常场景中收集的，以模拟恶劣的天气条件。通过观察不同深度和雾霾条件下目标尺度和清晰度的显著变化，我们设计了一种深度条件检测器 (DeCoDet)，以结合这种先验知识。DeCoDet 具有多尺度深度感知检测头，可无缝集成深度感知，并通过动态深度条件核模块利用由此产生的深度线索。此外，我们提出了一种尺度不变的细化损失，以促进从伪标签中学习鲁棒的深度线索。在 HazyDet 数据集上的大量评估证明了我们方法的灵活性和有效性，产生了显著的性能提升。我们的数据集和工具包可在 https://github.com/GrokCV/HazyDet 获取。||
|**2024-09-29**|[Applying the Lower-Biased Teacher Model in Semi-Suepervised Object Detection](http://arxiv.org/abs/2409.19703)|null|我提出了低偏差教师模型，这是对无偏差教师模型的增强，专门针对半监督目标检测任务进行了定制。该模型的主要创新在于将定位损失集成到教师模型中，从而显着提高了伪标签生成的准确性。通过解决类别不平衡和边界框精度等关键问题，低偏差教师模型在目标检测任务中表现出优异的性能。在多个半监督目标检测数据集上的大量实验表明，低偏差教师模型不仅减少了由类别不平衡引起的伪标签偏差，而且还减少了由错误边界框引起的错误。因此，与现有方法相比，该模型实现了更高的mAP分数和更可靠的检测结果。这项研究强调了准确的伪标签生成的重要性，并为未来半监督学习在目标检测中的进步提供了一个强大的框架。||
|**2024-09-27**|[Spectral Wavelet Dropout: Regularization in the Wavelet Domain](http://arxiv.org/abs/2409.18951)|null|正则化技术有助于防止过拟合，从而提高卷积神经网络 (CNN) 的泛化能力。过拟合的原因之一是网络不同部分之间复杂的相互适应，这使得 CNN 依赖于它们的联合响应，而不是鼓励每个部分独立学习有用的特征表示。频域处理是一种强大的策略，它利用频率分解来修改具有时间和空间一致性的数据。这项工作介绍了一种新颖的正则化方法——谱小波丢弃 (SWD)，它包括两种变体：1D-SWD 和 2D-SWD。这些变体通过随机丢弃特征图的离散小波分解中的详细频带，从而提高 CNN 的泛化能力。我们的方法区别于预先存在的谱“傅立叶”丢弃 (2D-SFD)，后者消除了傅立叶域中的系数。值得注意的是，SWD 只需要一个超参数，不像 SFD 需要两个。我们还通过实现一维版本的谱“傅立叶”丢弃 (1D-SFD) 来扩展文献，为全面比较奠定了基础。我们的评估表明，相对于 1D-SFD 和 2D-SFD，1D 和 2D SWD 变体在 CIFAR-10/100 基准测试中均具有竞争力的性能。具体来说，与 1D/2D-SFD 相比，1D-SWD 具有显著更低的计算复杂度。在 Pascal VOC 目标检测基准测试中，SWD 变体的性能优于 1D-SFD 和 2D-SFD，并且在训练期间表现出更低的计算复杂度。||
|**2024-09-27**|[Unconditional stability of a recurrent neural circuit implementing divisive normalization](http://arxiv.org/abs/2409.18946)|**[link](https://github.com/martiniani-lab/dynamic-divisive-norm)**|递归神经模型的稳定性是一个重大挑战，特别是在开发可以无缝训练的生物学上合理的 neurodynamical 模型方面。传统的皮质回路模型由于动力系统中存在广泛的非线性，因此难以训练，导致优化问题具有难以施加的非线性稳定性约束。相反，递归神经网络 (RNN) 在涉及序列数据的任务中表现出色，但缺乏生物学上的合理性和可解释性。在这项工作中，我们通过将动态除法归一化 (DN) 与 ORGaNICs 的稳定性联系起来来解决这些挑战，ORGaNICs 是一种生物学上合理的递归皮质回路模型，它可以动态地实现 DN，并且已被证明可以模拟广泛的神经生理学现象。通过使用 Lyapunov 的间接方法，我们证明了当递归权重矩阵是单位矩阵时，任意维度的 ORGaNICs 电路具有无条件局部稳定性的显著特性。因此，我们将 ORGaNICs 连接到一个耦合阻尼谐振子的系统，这使我们能够推导出电路的能量函数，从而提供电路和单个神经元旨在实现的目标的规范原则。此外，对于一般的递归权重矩阵，我们证明了二维模型的稳定性，并通过经验证明了稳定性在更高维度上成立。最后，我们表明 ORGaNICs 可以通过时间反向传播进行训练，而无需梯度裁剪/缩放，这得益于其内在的稳定性特性和自适应时间常数，解决了梯度爆炸、消失和振荡的问题。通过评估模型在 RNN 基准测试中的性能，我们发现 ORGaNICs 在静态图像分类任务上优于其他神经动力学模型，并且在序列任务上的性能与 LSTM 相当。||
|**2024-09-27**|[Subspace Preserving Quantum Convolutional Neural Network Architectures](http://arxiv.org/abs/2409.18918)|null|子空间保持量子电路是一类量子算法，它依赖于计算中的某些对称性，可以为其训练提供理论上的保证。这些算法之所以受到广泛关注，是因为它们可以提供多项式加速，并且可以用来模拟经典的机器学习算法。在这项工作中，我们提出了一种基于汉明重量保持量子电路的新型卷积神经网络架构模型。特别是，我们引入了卷积层和基于测量的池化层，它们在保持量子态对称性的同时，使用非子空间保持的门来实现非线性。与经典的深度学习架构相比，我们的方案在多项式运行时间上具有显著的优势。我们提供了一个用于汉明重量保持量子电路的开源仿真库，可以使用面向GPU的库更有效地仿真我们的技术。使用此代码，我们提供了一些架构示例，这些示例突出了在量子比特数量有限且参数少于经典深度学习架构的情况下，在复杂图像分类任务上的出色性能。||
|**2024-09-27**|[MCUBench: A Benchmark of Tiny Object Detectors on MCUs](http://arxiv.org/abs/2409.18866)|**[link](https://github.com/deeplite/deeplite-torch-zoo)**|我们推出了 MCUBench，这是一个基准测试平台，涵盖了 100 多个基于 YOLO 的目标检测模型，这些模型在 VOC 数据集上针对七种不同的 MCU 进行了评估。该基准测试平台提供了各种输入分辨率和基于 YOLO 的单阶段检测器的平均精度、延迟、RAM 和 Flash 使用情况的详细信息。通过使用固定的训练流程进行受控比较，我们收集了全面的性能指标。我们的帕累托最优分析表明，集成现代检测头和训练技术可以让各种 YOLO 架构（包括 YOLOv3 等传统模型）在平均精度 (mAP) 和延迟之间实现高效的权衡。MCUBench 是一个有价值的工具，可用于对当代目标检测器的 MCU 性能进行基准测试，并根据特定限制条件帮助进行模型选择。||
|**2024-09-27**|[A Novel Unified Architecture for Low-Shot Counting by Detection and Segmentation](http://arxiv.org/abs/2409.18686)|**[link](https://github.com/jerpelhan/GeCo)**|少样本目标计数器可以使用少量甚至没有标注样本估计图像中的目标数量。目标定位通过将目标与原型进行匹配来实现，原型是通过对图像范围内的目标外观进行无监督聚合构建的。由于目标外观可能存在多样性，现有方法通常会导致过度泛化和误报。此外，性能最佳的方法通过预测每个目标中心的单位高斯分布的代理损失来训练目标定位。这种损失对标注误差和超参数很敏感，并且没有直接优化检测任务，导致计数结果欠佳。我们引入了GeCo，这是一种新颖的少样本计数器，可以在统一的架构中实现准确的目标检测、分割和计数估计。GeCo 通过一种新颖的密集目标查询公式，可以稳健地泛化不同目标外观的原型。此外，我们还提出了一种新的计数损失，它直接优化检测任务，避免了标准代理损失的问题。GeCo 在总计数平均绝对误差方面比领先的基于少样本检测的计数器高出约 25%，实现了卓越的检测精度，并在所有少样本计数设置中都树立了新的最先进的结果。||
|**2024-09-27**|[Query matching for spatio-temporal action detection with query-based object detector](http://arxiv.org/abs/2409.18408)|null|本文提出了一种扩展基于查询的目标检测模型DETR的方法，将其应用于时空动作检测，该任务需要在视频中保持时间一致性。我们提出的方法将DETR应用于每一帧，并使用特征偏移来整合时间信息。然而，每帧中DETR的对象查询可能对应于不同的对象，使得简单的特征偏移无效。为了克服这个问题，我们提出了跨不同帧的查询匹配，确保对同一对象的查询能够匹配并用于特征偏移。实验结果表明，当使用所提出的查询匹配对查询特征进行偏移时，JHMDB21数据集上的性能显著提高。||
|**2024-09-27**|[Simpler Gradient Methods for Blind Super-Resolution with Lower Iteration Complexity](http://arxiv.org/abs/2409.18387)|**[link](https://github.com/Jinshengg/SimplerGDs-VHL)**|我们研究了盲超分辨率问题，它可以通过向量化汉克尔提升（VHL）公式化为一个低秩矩阵恢复问题。先前基于VHL的名为PGD-VHL的梯度下降方法依赖于额外的正则化，例如投影和平衡惩罚，表现出次优的迭代复杂度。在本文中，我们提出了一个更简单的无约束优化问题，无需上述两种类型的正则化，并开发了两种新的可证梯度方法，分别名为VGD-VHL和ScalGD-VHL。我们为算法的理论保证提供了新颖而清晰的分析，证明了我们的方法比PGD-VHL具有更低的迭代复杂度。此外，ScalGD-VHL具有最低的迭代复杂度，同时与条件数无关。此外，我们的新分析表明，盲超分辨率问题对不相干性的要求较低，从而无需不相干投影即可实现线性收敛。实验结果表明，我们的方法在实现与现有技术相当的恢复性能的同时，还具有更高的计算效率。||
|**2024-09-26**|[Realistic Evaluation of Model Merging for Compositional Generalization](http://arxiv.org/abs/2409.18314)|**[link](https://github.com/r-three/realistic_evaluation_of_model_merging_for_compositional_generalization)**|模型融合已成为一种广泛使用的方法，可以将单个模型廉价地组合成一个模型，该模型继承了它们的性能并获得了更好的性能。这种流行促进了许多新融合方法的快速发展，这些方法通常在不同的实验环境中得到验证，并且经常在对模型架构、数据可用性和计算预算做出的假设方面有所不同。在这项工作中，我们通过在共享实验环境中评估不同的融合方法并精确识别每种方法的实际要求，来描述它们的相对优点。具体来说，我们的设置侧重于使用融合来实现图像分类、图像生成和自然语言处理中功能的组合泛化。此外，我们还测量了不同融合方法的计算成本，以及它们在扩展融合模型数量时的性能。总的来说，我们的结果阐明了模型融合领域的现状，并提供了一个全面而严谨的实验设置来测试新方法。||
|**2024-09-26**|[Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing](http://arxiv.org/abs/2409.18286)|null|本研究旨在全面回顾和实证评估多模态大型语言模型 (MLLM) 和大型视觉模型 (VLM) 在交通系统目标检测中的应用。首先，我们介绍了 MLLM 在交通应用中的潜在优势，并对以往研究中现有的 MLLM 技术进行了全面回顾。我们重点介绍了它们在各种交通场景下目标检测的有效性和局限性。其次，我们概述了交通应用中端到端目标检测的分类以及未来方向。在此基础上，我们提出了实证分析，在三个现实交通问题上测试 MLLM，这些问题包括目标检测任务，即道路安全属性提取、安全关键事件检测和热图像视觉推理。我们的研究结果提供了对 MLLM 性能的详细评估，揭示了其优势和需要改进的方面。最后，我们讨论了 MLLM 在增强交通目标检测方面的实际局限性和挑战，从而为该关键领域的未来研究和开发提供了路线图。||
|**2024-09-26**|[DARE: Diverse Visual Question Answering with Robustness Evaluation](http://arxiv.org/abs/2409.18023)|null|视觉语言模型 (VLM) 扩展了仅文本大型语言模型和仅视觉模型的卓越能力，并且能够从多模态视觉文本输入中学习和处理。虽然现代 VLM 在许多标准图像分类和图像文本匹配任务中表现良好，但它们仍然难以应对许多关键的视觉语言 (VL) 推理能力，例如计数和空间推理。此外，虽然它们可能对指令和/或评估协议的微小变化非常脆弱，但现有基准测试未能评估它们的稳健性（或者更确切地说是缺乏稳健性）。为了将具有挑战性的 VL 场景与全面的稳健性评估相结合，我们引入了 DARE，即具有稳健性评估的多样化视觉问答，这是一个精心创建和策划的多项选择 VQA 基准测试。 DARE 评估 VLM 在五个不同类别上的性能，并包括四个基于以下变化的稳健性评估：提示、答案选项子集、输出格式和正确答案的数量。在其他一系列发现中，我们报告说，最先进的 VLM 仍然难以回答大多数类别的问题，并且无法在测试的稳健性评估中始终如一地提供其峰值性能。选项子集的最坏情况性能比标准情况下的性能低 34%。 LLaVA 1.6 和 Idefics2 等开源 VLM 的稳健性无法与 GPT-4 和 Gemini 等闭源模型相提并论，但即使是后者仍然非常容易受到不同变化的影响。||
|**2024-09-26**|[A New Dataset for Monocular Depth Estimation Under Viewpoint Shifts](http://arxiv.org/abs/2409.17851)|null|单目深度估计是自动驾驶和许多其他计算机视觉应用的关键任务。虽然该领域已经取得了重大进展，但视角变化对深度估计模型的影响在很大程度上仍未得到充分探索。本文介绍了一种新的数据集和评估方法，用于量化不同相机位置和方向对单目深度估计性能的影响。我们提出了一种基于单应性估计和目标检测的真值策略，无需昂贵的激光雷达传感器。我们从多个视点收集了道路场景的多样化数据集，并用它来评估现代深度估计模型对几何偏移的鲁棒性。在公共数据集上评估了我们策略的有效性后，我们提供了对当前模型局限性的宝贵见解，并强调了在实际应用中考虑视点变化的重要性。||
|**2024-09-26**|[Cascade Prompt Learning for Vision-Language Model Adaptation](http://arxiv.org/abs/2409.17805)|**[link](https://github.com/megvii-research/caspl)**|提示学习已成为一种有效的方法，可以提高视觉语言模型（VLM）在下游任务中的性能，例如CLIP。然而，当前可学习的提示标记主要用于适应任务的单一阶段（即，调整提示），容易导致过拟合风险。在这项工作中，我们提出了一种新颖的级联提示学习CasPL框架，使提示学习能够同时服务于通用和特定专业知识（即，增强和调整提示）。具体来说，CasPL是一种新的学习范式，包括两个不同阶段的可学习提示：第一个增强提示旨在通过使用大量未标记的域图像对齐其预测的logits，从高级更大的CLIP教师模型中提取域一般知识。然后，第二个调整提示与冻结的第一组级联，以微调下游任务，遵循先前研究中采用的方法。通过这种方式，CasPL可以有效地将域一般表示和任务特定表示捕获到明确不同的渐进提示组中，从而潜在地缓解目标域中的过拟合问题。值得注意的是，CasPL是一个即插即用模块，可以无缝集成到任何现有的提示学习方法中。CasPL在性能和推理速度之间取得了显著更好的平衡，这对于在资源受限的环境中部署较小的VLM模型尤其有利。与之前的最先进方法PromptSRC相比，CasPL在11个图像分类数据集上，基础类的平均改进率为1.85%，新类的平均改进率为3.44%，调和平均值的平均改进率为2.72%。代码公开地址：https://github.com/megvii-research/CasPL。||
|**2024-09-26**|[Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs](http://arxiv.org/abs/2409.17778)|**[link](https://github.com/qinpengcui/dossr)**|基于扩散的图像超分辨率 (SR) 模型因其强大的图像恢复能力而引起了广泛关注。然而，现有的扩散模型通常难以在效率和性能之间取得最佳平衡。它们通常要么忽略了利用现有大量预训练模型的潜力，限制了其生成能力，要么需要从随机噪声开始进行数十次前向传递，从而降低了推理效率。在本文中，我们提出了 DoSSR，一种基于域迁移扩散的 SR 模型，它利用预训练扩散模型的生成能力，并通过以低分辨率 (LR) 图像初始化扩散过程来显著提高效率。我们方法的核心是一个与现有扩散模型无缝集成的域迁移方程。这种集成不仅提高了扩散先验的利用，还提高了推理效率。此外，我们通过将离散迁移过程转换为连续公式（称为 DoS-SDE）来推进我们的方法。这一进步带来了快速且定制化的求解器，进一步提高了采样效率。实验结果表明，我们提出的方法在合成数据集和真实世界数据集上均达到了最先进的性能，同时仅需 5 个采样步骤。与之前基于扩散先验的方法相比，我们的方法实现了 5-7 倍的显著加速，证明了其卓越的效率。代码：https://github.com/QinpengCui/DoSSR。||
|**2024-09-26**|[LGFN: Lightweight Light Field Image Super-Resolution using Local Convolution Modulation and Global Attention Feature Extraction](http://arxiv.org/abs/2409.17759)|null|光场（LF）能够将三维场景信息编码成四维光场图像，在诸如后期重聚焦和深度感知等领域有着广泛的应用。光场图像超分辨率（SR）旨在提升受限于光场相机传感器性能的图像分辨率。尽管现有方法已经取得了可喜的成果，但由于模型不够轻量化，限制了其实际应用。本文提出了一种名为LGFN的轻量级模型，它集成了不同视角的局部和全局特征以及不同通道的特征，用于光场图像超分辨率。具体来说，由于不同子孔径图像中相同像素位置的相邻区域表现出相似的结构关系，我们设计了一个基于轻量级CNN的特征提取模块（DGCE），通过特征调制更好地提取局部特征。同时，由于光场图像中超出边界的像素位置存在较大差异，我们提出了一个高效的空间注意力模块（ESAM），它使用可分解的大核卷积来获得更大的感受野，以及一个高效的通道注意力模块（ECAM）。与现有参数量大的光场图像超分辨率模型相比，我们的模型参数量为0.45M，FLOPs为19.33G，取得了具有竞争力的效果。大量的消融实验验证了我们提出的方法的有效性，在NTIRE2024光场超分辨率挑战赛的Track 2保真度和效率赛道中排名第二，在Track 1保真度赛道中排名第七。||
|**2024-09-26**|[Scene Understanding in Pick-and-Place Tasks: Analyzing Transformations Between Initial and Final Scenes](http://arxiv.org/abs/2409.17720)|null|随着机器人在日常任务中越来越多地与人类合作，采取措施使机器人系统能够理解环境变得至关重要。这项工作侧重于场景理解，以根据场景的初始图像和最终图像检测拾取和放置任务。为此，我们收集了一个用于目标检测和拾取放置任务检测的数据集。随后训练了一个 YOLOv5 网络来检测初始场景和最终场景中的目标。给定检测到的目标及其边界框，我们提出了两种方法来检测将初始场景转换为最终场景的拾取和放置任务。一种是几何方法，它跟踪目标在两个场景中的运动，并根据场景内移动的边界框的交集进行工作。相反，基于 CNN 的方法利用卷积神经网络将具有相交边界框的目标分类为 5 类，显示相关目标之间的空间关系。然后，通过分析包含这两个场景的实验，得出执行的拾取和放置任务。结果表明，在某些场景下，使用 VGG16 骨干网络的基于 CNN 的方法的成功率比几何方法高出约 12 个百分点，总体成功率为 84.3%。||
|**2024-09-26**|[Unifying Dimensions: A Linear Adaptive Approach to Lightweight Image Super-Resolution](http://arxiv.org/abs/2409.17597)|**[link](https://github.com/zononhzy/lamnet)**|基于窗口的 Transformer 由于其通过局部自注意力机制 (SA) 进行自适应建模的能力，在超分辨率任务中展现出卓越的性能。然而，与卷积神经网络相比，它们表现出更高的计算复杂度和推理延迟。在本文中，我们首先确定 Transformer 的适应性源于其自适应空间聚合和先进的结构设计，而其高延迟则源于与局部 SA 相关的计算成本和内存布局转换。为了模拟这种聚合方法，我们提出了一种有效的基于卷积的线性焦点可分离注意力机制 (FSA)，允许以线性复杂度进行长距离动态建模。此外，我们引入了一种有效的双分支结构，结合超轻量级信息交换模块 (IEM)，以增强 Token Mixer 对信息的聚合能力。最后，在结构方面，我们通过结合自门控机制来修改现有的基于空间门控的前馈神经网络，以保留高维通道信息，从而能够对更复杂的关系进行建模。基于这些改进，我们构建了一个名为线性自适应混合网络 (LAMNet) 的基于卷积的 Transformer 框架。大量实验表明，LAMNet 在保持卷积神经网络计算效率的同时，实现了比现有基于 SA 的 Transformer 方法更好的性能，推理时间可达 $3\times$ 加速。代码将公开发布在：https://github.com/zononhzy/LAMNet。||
|**2024-09-26**|[Let the Quantum Creep In: Designing Quantum Neural Network Models by Gradually Swapping Out Classical Components](http://arxiv.org/abs/2409.17583)|**[link](https://github.com/peiyong-addwater/let-the-quantum-creep-in)**|人工智能 (AI) 凭借其乘数效应和在多个领域的广泛应用，可能成为量子计算的重要应用领域。由于现代人工智能系统通常建立在神经网络之上，因此量子神经网络的设计成为将量子计算集成到人工智能中的关键挑战。为了更细致地描述量子组件对神经网络性能的影响，我们提出了一个框架，在该框架中，经典神经网络层逐渐被具有相同输入和输出类型、同时保持层间信息流不变的量子层所取代，这不同于目前大多数量子神经网络的研究，后者倾向于端到端的量子模型。我们从一个没有任何标准化层或激活函数的简单三层经典神经网络开始，逐步将经典层更改为相应的量子版本。我们对 MNIST、FashionMNIST 和 CIFAR-10 等图像分类数据集进行了数值实验，以证明系统引入量子组件所带来的性能变化。通过这个框架，我们的研究为未来量子神经网络模型的设计提供了新的思路，在这些模型中，寻找能够利用经典世界和量子世界优势的方法和框架可能更为有利。||
|**2024-09-26**|[General Compression Framework for Efficient Transformer Object Tracking](http://arxiv.org/abs/2409.17564)|null|基于Transformer的跟踪器在视觉目标跟踪领域占据主导地位。虽然这些跟踪器表现出良好的性能，但由于效率低下，它们在资源受限设备上的部署仍然具有挑战性。为了提高推理效率并降低计算成本，先前的方法旨在设计轻量级跟踪器或将知识从较大的教师模型提炼到更紧凑的学生模型中。然而，这些解决方案通常以牺牲精度为代价来提高速度。因此，我们提出了一种通用的高效Transformer目标跟踪模型压缩框架CompressTracker，以将预训练的跟踪模型压缩成轻量级跟踪器，同时最大限度地减少性能下降。我们的方法采用了一种新颖的阶段划分策略，将教师模型的Transformer层划分为不同的阶段，使学生模型能够更有效地模拟每个相应的教师阶段。此外，我们还设计了一种独特的替换训练技术，该技术涉及用教师模型中的相应阶段随机替换学生模型中的特定阶段，而不是孤立地训练学生模型。替换训练增强了学生模型复制教师模型行为的能力。为了进一步迫使学生模型模拟教师模型，我们引入了预测指导和阶段性特征模拟，以便在教师模型的压缩过程中提供额外的监督。我们的框架CompressTracker在结构上是不可知的，使其与任何Transformer架构兼容。我们进行了一系列实验，以验证CompressTracker的有效性和通用性。我们的CompressTracker-4具有4个Transformer层，它是从OSTrack压缩而来的，在LaSOT上保留了约96%的性能（66.1% AUC），同时实现了2.17倍的加速。||
|**2024-09-26**|[CAMOT: Camera Angle-aware Multi-Object Tracking](http://arxiv.org/abs/2409.17533)|null|本文提出了CAMOT，一种用于多目标跟踪的简单相机角度估计器，用于解决两个问题：1）遮挡和2）深度方向上的距离估计不准确。在假设每个视频帧中的多个目标位于平面上，CAMOT 使用目标检测来估计相机角度。此外，它还给出了每个目标的深度，从而实现了伪 3D MOT。我们通过将其添加到 MOT17 和 MOT20 数据集上的各种 2D MOT 方法中来评估其性能，并确认了其有效性。将 CAMOT 应用于 ByteTrack，我们在 MOT17 中获得了 63.8% 的 HOTA、80.6% 的 MOTA 和 78.5% 的 IDF1，这些都是最先进的结果。它的计算成本明显低于现有的基于深度学习的跟踪深度估计器。||
|**2024-09-18**|[Applications of Knowledge Distillation in Remote Sensing: A Survey](http://arxiv.org/abs/2409.12111)|null|随着遥感 (RS) 领域模型复杂性的不断提高，对平衡模型精度和计算效率的解决方案的需求也日益增长。知识蒸馏 (KD) 已成为满足这一需求的强大工具，能够在不显著降低性能的情况下，将知识从大型复杂模型迁移到更小、更高效的模型。这篇综述文章广泛考察了 KD 及其在遥感领域的创新应用。KD 是一种将知识从复杂、通常笨重的模型（教师）迁移到更紧凑、更高效的模型（学生）的技术，已经在各个领域得到了显著的发展和应用。首先，我们介绍了 KD 方法的基本概念和历史进程。文章重点介绍了采用 KD 的优势，特别是在模型压缩、计算效率提高和性能改善方面，这些优势对于 RS 场景中的实际部署至关重要。文章提供了 KD 技术的全面分类，其中每个类别都经过严格分析，以证明替代方案的广度和深度，并通过具体的案例研究展示了 KD 方法在 RS 任务中的实际应用，例如实例分割和目标检测。此外，该综述还讨论了 KD 在遥感领域面临的挑战和局限性，包括实际约束和未来的发展方向，为遥感领域的研究人员和从业者提供了全面的概述。通过这种组织方式，本文不仅阐明了 KD 研究的现状，而且为未来的研究方向奠定了基础，从而为学术研究和实际应用做出了重大贡献。||
|**2024-09-18**|[Unraveling the Hessian: A Key to Smooth Convergence in Loss Function Landscapes](http://arxiv.org/abs/2409.11995)|**[link](https://github.com/kisnikser/landscape-hessian)**|神经网络的损失景观是其训练的一个关键方面，理解其属性对于提高其性能至关重要。在本文中，我们研究了当样本量增加时损失曲面如何变化，这是一个以前未被探索的问题。我们从理论上分析了全连接神经网络中损失景观的收敛性，并推导出在样本中添加新对象时损失函数值差异的上界。我们的实证研究在各种数据集上证实了这些结果，证明了图像分类任务中损失函数曲面的收敛性。我们的发现为神经损失景观的局部几何提供了见解，并对样本量确定技术的发展具有意义。||
|**2024-09-18**|[Agglomerative Token Clustering](http://arxiv.org/abs/2409.11923)|null|我们提出了聚合式Token聚类（ATC），这是一种新颖的Token合并方法，在图像分类、图像合成以及目标检测和分割任务中始终优于以前的Token合并和剪枝方法。ATC通过自下而上的层次聚类来合并聚类，无需引入额外的可学习参数。我们发现ATC在所有任务中都实现了最先进的性能，甚至在应用于现成模型时（即无需微调）也能与之前的最先进技术相媲美。当应用于低保留率时，ATC特别有效，在这种情况下，只有一小部分Token被保留，并且保持任务性能特别困难。||
|**2024-09-18**|[Distillation-free Scaling of Large SSMs for Images and Videos](http://arxiv.org/abs/2409.11867)|null|State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$ .||
|**2024-09-18**|[RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework](http://arxiv.org/abs/2409.11749)|null|随着3D目标检测技术的快速发展，尤其是在经济高效的多相机设置中，3D多目标跟踪（MOT）获得了显著的性能提升。然而，目前流行的端到端多相机跟踪器训练方法会导致模型依赖于特定的检测器，从而限制了其通用性。此外，现有的通用跟踪器忽略了多相机检测器的独特特征，即运动观测的不可靠性和视觉信息的可用性。为了应对这些挑战，我们提出了RockTrack，一种面向多相机检测器的3D MOT方法。RockTrack遵循“检测跟踪”框架，兼容各种现成的检测器。RockTrack包含一个置信度引导的预处理模块，用于从单个检测器的不同表示空间中提取可靠的运动和图像观测结果。然后，这些观测结果会在关联模块中融合，该模块利用几何和外观线索来最大程度地减少错配。最终的匹配结果通过分阶段估计过程进行传播，形成启发式噪声建模的基础。此外，我们引入了一种新颖的外观相似性度量方法，用于在多相机设置中明确表征目标亲和度。RockTrack在nuScenes仅视觉跟踪排行榜上实现了最先进的性能，AMOTA达到59.1%，同时展现出惊人的计算效率。||
|**2024-09-18**|[Few-Shot Learning Approach on Tuberculosis Classification Based on Chest X-Ray Images](http://arxiv.org/abs/2409.11644)|null|Tuberculosis (TB) is caused by the bacterium Mycobacterium tuberculosis, primarily affecting the lungs. Early detection is crucial for improving treatment effectiveness and reducing transmission risk. Artificial intelligence (AI), particularly through image classification of chest X-rays, can assist in TB detection. However, class imbalance in TB chest X-ray datasets presents a challenge for accurate classification. In this paper, we propose a few-shot learning (FSL) approach using the Prototypical Network algorithm to address this issue. We compare the performance of ResNet-18, ResNet-50, and VGG16 in feature extraction from the TBX11K Chest X-ray dataset. Experimental results demonstrate classification accuracies of 98.93% for ResNet-18, 98.60% for ResNet-50, and 33.33% for VGG16. These findings indicate that the proposed method outperforms others in mitigating data imbalance, which is particularly beneficial for disease classification applications.||
|**2024-09-17**|[VALO: A Versatile Anytime Framework for LiDAR-based Object Detection Deep Neural Networks](http://arxiv.org/abs/2409.11542)|**[link](https://github.com/csl-ku/valo)**|This work addresses the challenge of adapting dynamic deadline requirements for LiDAR object detection deep neural networks (DNNs). The computing latency of object detection is critically important to ensure safe and efficient navigation. However, state-of-the-art LiDAR object detection DNNs often exhibit significant latency, hindering their real-time performance on resource-constrained edge platforms. Therefore, a tradeoff between detection accuracy and latency should be dynamically managed at runtime to achieve optimum results. In this paper, we introduce VALO (Versatile Anytime algorithm for LiDAR Object detection), a novel data-centric approach that enables anytime computing of 3D LiDAR object detection DNNs. VALO employs a deadline-aware scheduler to selectively process input regions, making execution time and accuracy tradeoffs without architectural modifications. Additionally, it leverages efficient forecasting of past detection results to mitigate possible loss of accuracy due to partial processing of input. Finally, it utilizes a novel input reduction technique within its detection heads to significantly accelerate execution without sacrificing accuracy. We implement VALO on state-of-the-art 3D LiDAR object detection networks, namely CenterPoint and VoxelNext, and demonstrate its dynamic adaptability to a wide range of time constraints while achieving higher accuracy than the prior state-of-the-art. Code is available athttps://github.com/CSL-KU/VALO}{github.com/CSL-KU/VALO.||
|**2024-09-17**|[Enhancing the Reliability of LiDAR Point Cloud Sampling: A Colorization and Super-Resolution Approach Based on LiDAR-Generated Images](http://arxiv.org/abs/2409.11532)|null|In recent years, Light Detection and Ranging (LiDAR) technology, a critical sensor in robotics and autonomous systems, has seen significant advancements. These improvements include enhanced resolution of point clouds and the capability to provide 360{\deg} low-resolution images. These images encode various data such as depth, reflectivity, and near-infrared light within the pixels. However, an excessive density of points and conventional point cloud sampling can be counterproductive, particularly in applications such as LiDAR odometry, where misleading points and degraded geometry information may induce drift errors. Currently, extensive research efforts are being directed towards leveraging LiDAR-generated images to improve situational awareness. This paper presents a comprehensive review of current deep learning (DL) techniques, including colorization and super-resolution, which are traditionally utilized in conventional computer vision tasks. These techniques are applied to LiDAR-generated images and are analyzed qualitatively. Based on this analysis, we have developed a novel approach that selectively integrates the most suited colorization and super-resolution methods with LiDAR imagery to sample reliable points from the LiDAR point cloud. This approach aims to not only improve the accuracy of point cloud registration but also avoid mismatching caused by lacking geometry information, thereby augmenting the utility and precision of LiDAR systems in practical applications. In our evaluation, the proposed approach demonstrates superior performance compared to our previous work, achieving lower translation and rotation errors with a reduced number of points.||
|**2024-09-19**|[Super Resolution On Global Weather Forecasts](http://arxiv.org/abs/2409.11502)|null|Weather forecasting is a vitally important tool for tasks ranging from planning day to day activities to disaster response planning. However, modeling weather has proven to be challenging task due to its chaotic and unpredictable nature. Each variable, from temperature to precipitation to wind, all influence the path the environment will take. As a result, all models tend to rapidly lose accuracy as the temporal range of their forecasts increase. Classical forecasting methods use a myriad of physics-based, numerical, and stochastic techniques to predict the change in weather variables over time. However, such forecasts often require a very large amount of data and are extremely computationally expensive. Furthermore, as climate and global weather patterns change, classical models are substantially more difficult and time-consuming to update for changing environments. Fortunately, with recent advances in deep learning and publicly available high quality weather datasets, deploying learning methods for estimating these complex systems has become feasible. The current state-of-the-art deep learning models have comparable accuracy to the industry standard numerical models and are becoming more ubiquitous in practice due to their adaptability. Our group seeks to improve upon existing deep learning based forecasting methods by increasing spatial resolutions of global weather predictions. Specifically, we are interested in performing super resolution (SR) on GraphCast temperature predictions by increasing the global precision from 1 degree of accuracy to 0.5 degrees, which is approximately 111km and 55km respectively.||
|**2024-09-17**|[SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking](http://arxiv.org/abs/2409.11235)|**[link](https://github.com/siyuanliii/slack)**|Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to novel categories not in the training set. Currently, the best-performing methods are mainly based on pure appearance matching. Due to the complexity of motion patterns in the large-vocabulary scenarios and unstable classification of the novel objects, the motion and semantics cues are either ignored or applied based on heuristics in the final matching steps by existing methods. In this paper, we present a unified framework SLAck that jointly considers semantics, location, and appearance priors in the early steps of association and learns how to integrate all valuable information through a lightweight spatial and temporal object graph. Our method eliminates complex post-processing heuristics for fusing different cues and boosts the association performance significantly for large-scale open-vocabulary tracking. Without bells and whistles, we outperform previous state-of-the-art methods for novel classes tracking on the open-vocabulary MOT and TAO TETA benchmarks. Our code is available at \href{https://github.com/siyuanliii/SLAck}{github.com/siyuanliii/SLAck}.||
|**2024-09-17**|[STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking](http://arxiv.org/abs/2409.11234)|**[link](https://github.com/ydhcg-bobo/stcmot)**|Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision. Current MOT trackers rely on accurate object detection results and precise matching of target reidentification (ReID). These methods focus on optimizing target spatial attributes while overlooking temporal cues in modelling object relationships, especially for challenging tracking conditions such as object deformation and blurring, etc. To address the above-mentioned issues, we propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which utilizes historical embedding features to model the representation of ReID and detection features in a sequential order. Concretely, a temporal embedding boosting module is introduced to enhance the discriminability of individual embedding based on adjacent frame cooperation. While the trajectory embedding is then propagated by a temporal detection refinement module to mine salient target locations in the temporal field. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new state-of-the-art performance in MOTA and IDF1 metrics. The source codes are released at https://github.com/ydhcg-BoBo/STCMOT.||
|**2024-09-17**|[Vision foundation models: can they be applied to astrophysics data?](http://arxiv.org/abs/2409.11175)|**[link](https://github.com/elastufka/fm4astro)**|Vision foundation models, which have demonstrated significant potential in many multimedia applications, are often underutilized in the natural sciences. This is primarily due to mismatches between the nature of domain-specific scientific data and the typical training data used for foundation models, leading to distribution shifts. Scientific data often differ substantially in structure and characteristics; researchers frequently face the challenge of optimizing model performance with limited labeled data of only a few hundred or thousand images. To adapt foundation models effectively requires customized approaches in preprocessing, data augmentation, and training techniques. Additionally, each vision foundation model exhibits unique strengths and limitations, influenced by differences in architecture, training procedures, and the datasets used for training. In this work, we evaluate the application of various vision foundation models to astrophysics data, specifically images from optical and radio astronomy. Our results show that using features extracted by specific foundation models improves the classification accuracy of optical galaxy images compared to conventional supervised training. Similarly, these models achieve equivalent or better performance in object detection tasks with radio images. However, their performance in classifying radio galaxy images is generally poor and often inferior to traditional supervised training results. These findings suggest that selecting suitable vision foundation models for astrophysics applications requires careful consideration of the model characteristics and alignment with the specific requirements of the downstream tasks.||
|**2024-09-17**|[Unleashing the Potential of Mamba: Boosting a LiDAR 3D Sparse Detector by Using Cross-Model Knowledge Distillation](http://arxiv.org/abs/2409.11018)|null|The LiDAR-based 3D object detector that strikes a balance between accuracy and speed is crucial for achieving real-time perception in autonomous driving and robotic navigation systems. To enhance the accuracy of point cloud detection, integrating global context for visual understanding improves the point clouds ability to grasp overall spatial information. However, many existing LiDAR detection models depend on intricate feature transformation and extraction processes, leading to poor real-time performance and high resource consumption, which limits their practical effectiveness. In this work, we propose a Faster LiDAR 3D object detection framework, called FASD, which implements heterogeneous model distillation by adaptively uniform cross-model voxel features. We aim to distill the transformer's capacity for high-performance sequence modeling into Mamba models with low FLOPs, achieving a significant improvement in accuracy through knowledge transfer. Specifically, Dynamic Voxel Group and Adaptive Attention strategies are integrated into the sparse backbone, creating a robust teacher model with scale-adaptive attention for effective global visual context modeling. Following feature alignment with the Adapter, we transfer knowledge from the Transformer to the Mamba through latent space feature supervision and span-head distillation, resulting in improved performance and an efficient student model. We evaluated the framework on the Waymo and nuScenes datasets, achieving a 4x reduction in resource consumption and a 1-2\% performance improvement over the current SoTA methods.||
|**2024-09-17**|[TrajSSL: Trajectory-Enhanced Semi-Supervised 3D Object Detection](http://arxiv.org/abs/2409.10901)|null|Semi-supervised 3D object detection is a common strategy employed to circumvent the challenge of manually labeling large-scale autonomous driving perception datasets. Pseudo-labeling approaches to semi-supervised learning adopt a teacher-student framework in which machine-generated pseudo-labels on a large unlabeled dataset are used in combination with a small manually-labeled dataset for training. In this work, we address the problem of improving pseudo-label quality through leveraging long-term temporal information captured in driving scenes. More specifically, we leverage pre-trained motion-forecasting models to generate object trajectories on pseudo-labeled data to further enhance the student model training. Our approach improves pseudo-label quality in two distinct manners: first, we suppress false positive pseudo-labels through establishing consistency across multiple frames of motion forecasting outputs. Second, we compensate for false negative detections by directly inserting predicted object tracks into the pseudo-labeled scene. Experiments on the nuScenes dataset demonstrate the effectiveness of our approach, improving the performance of standard semi-supervised approaches in a variety of settings.||
|**2024-09-17**|[Single-Layer Learnable Activation for Implicit Neural Representation (SL $^{2}$A-INR)](http://arxiv.org/abs/2409.10836)|null|隐式神经表示 (INR) 利用神经网络将坐标输入转换为相应的属性，近年来在多个视觉相关领域取得了重大进展。然而，INR 的性能很大程度上受其多层感知器 (MLP) 架构中使用的非线性激活函数选择的影响。目前已经研究了多种非线性方法；然而，当前的 INR 在捕获高频分量、多样信号类型和处理逆问题方面面临局限性。我们已经确定，通过引入 INR 的范式转变可以大大缓解这些问题。我们发现，在初始层具有可学习激活函数的架构可以表示底层信号中的精细细节。具体来说，我们提出了 SL$^{2}$A-INR，这是一种用于 INR 的混合网络，具有单层可学习激活函数，从而提高了传统基于 ReLU 的 MLP 的有效性。我们的方法在各种任务中均表现出色，包括图像表示、3D 形状重建、图像修复、单图像超分辨率、CT 重建和新视图合成。通过综合实验，SL$^{2}$ A-INR 在 INR 的准确性、质量和收敛速度方面树立了新的基准。||
|**2024-09-17**|[Context-Dependent Interactable Graphical User Interface Element Detection for VR Applications](http://arxiv.org/abs/2409.10811)|null|In recent years, Virtual Reality (VR) has emerged as a transformative technology, offering users immersive and interactive experiences across diversified virtual environments. Users can interact with VR apps through interactable GUI elements (IGEs) on the stereoscopic three-dimensional (3D) graphical user interface (GUI). The accurate recognition of these IGEs is instrumental, serving as the foundation of many software engineering tasks, including automated testing and effective GUI search. The most recent IGE detection approaches for 2D mobile apps typically train a supervised object detection model based on a large-scale manually-labeled GUI dataset, usually with a pre-defined set of clickable GUI element categories like buttons and spinners. Such approaches can hardly be applied to IGE detection in VR apps, due to a multitude of challenges including complexities posed by open-vocabulary and heterogeneous IGE categories, intricacies of context-sensitive interactability, and the necessities of precise spatial perception and visual-semantic alignment for accurate IGE detection results. Thus, it is necessary to embark on the IGE research tailored to VR apps. In this paper, we propose the first zero-shot cOntext-sensitive inteRactable GUI ElemeNT dEtection framework for virtual Reality apps, named Orienter. By imitating human behaviors, Orienter observes and understands the semantic contexts of VR app scenes first, before performing the detection. The detection process is iterated within a feedback-directed validation and reflection loop. Specifically, Orienter contains three components, including (1) Semantic context comprehension, (2) Reflection-directed IGE candidate detection, and (3) Context-sensitive interactability classification. Extensive experiments on the dataset demonstrate that Orienter is more effective than the state-of-the-art GUI element detection approaches.||
|**2024-09-16**|[Are Deep Learning Models Robust to Partial Object Occlusion in Visual Recognition Tasks?](http://arxiv.org/abs/2409.10775)|null|图像分类模型，包括卷积神经网络（CNN），在各种分类任务中表现良好，但在部分遮挡的情况下表现不佳，例如，物体被部分遮挡在相机视野之外的情况。已经出现了一些方法来提高遮挡情况下的性能，包括数据增强、基于部分的聚类，以及更强大的架构，包括视觉Transformer（ViT）模型，这些方法在一定程度上已经根据其在部分遮挡下对物体进行分类的能力进行了评估。然而，对这些方法的评估很大程度上依赖于包含人工遮挡的图像，这些图像通常是计算机生成的，因此标注成本低廉。此外，这些方法很少相互比较，许多方法是与早期、现在已经过时的深度学习模型进行比较的。我们贡献了遮挡下图像识别（IRUO）数据集，该数据集基于最近开发的遮挡视频实例分割（OVIS）数据集（arXiv:2102.01558）。IRUO利用真实世界和人工遮挡的图像来测试和比较领先方法在视觉识别任务中对部分遮挡的鲁棒性。此外，我们还贡献了使用IRUO图像进行的人类研究的设计和结果，该研究评估了人类在多个级别和类型的遮挡下的分类性能。我们发现，与早期的基于CNN的模型相比，现代基于CNN的模型在遮挡图像上的识别精度有所提高，并且基于ViT的模型在遮挡图像上的精度高于基于CNN的模型，其性能仅略低于人类精度。我们还发现，某些类型的遮挡，包括漫射遮挡，即相关物体通过栅栏和树叶等遮挡物上的“孔洞”可见，与人类相比，这种遮挡会大大降低深度识别模型的精度，尤其是那些具有CNN骨干的模型。||
|**2024-09-16**|[CoMamba: Real-time Cooperative Perception Unlocked with State Space Models](http://arxiv.org/abs/2409.10699)|null|Cooperative perception systems play a vital role in enhancing the safety and efficiency of vehicular autonomy. Although recent studies have highlighted the efficacy of vehicle-to-everything (V2X) communication techniques in autonomous driving, a significant challenge persists: how to efficiently integrate multiple high-bandwidth features across an expanding network of connected agents such as vehicles and infrastructure. In this paper, we introduce CoMamba, a novel cooperative 3D detection framework designed to leverage state-space models for real-time onboard vehicle perception. Compared to prior state-of-the-art transformer-based models, CoMamba enjoys being a more scalable 3D model using bidirectional state space models, bypassing the quadratic complexity pain-point of attention mechanisms. Through extensive experimentation on V2X/V2V datasets, CoMamba achieves superior performance compared to existing methods while maintaining real-time processing capabilities. The proposed framework not only enhances object detection accuracy but also significantly reduces processing time, making it a promising solution for next-generation cooperative perception systems in intelligent transportation networks.||
|**2024-09-16**|[Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning](http://arxiv.org/abs/2409.10362)|**[link](https://github.com/amink8/folk)**|We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training. Prior work in this direction masks out pre-defined frequencies in the input image and employs a reconstruction loss to pre-train the model. While achieving promising results, such an implementation has two fundamental limitations as identified in our paper. First, using pre-defined frequencies overlooks the variability of image frequency responses. Second, pre-trained with frequency-filtered images, the resulting model needs relatively more data to adapt to naturally looking images during fine-tuning. To address these drawbacks, we propose FOurier transform compression with seLf-Knowledge distillation (FOLK), integrating two dedicated ideas. First, inspired by image compression, we adaptively select the masked-out frequencies based on image frequency responses, creating more suitable SSL tasks for pre-training. Second, we employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input, largely reducing the burden of downstream tasks. Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state-of-the-art SSL methods across various downstream tasks, including image classification, few-shot learning, and semantic segmentation.||
|**2024-09-13**|[Optically-Validated Microvascular Phantom for Super-Resolution Ultrasound Imaging](http://arxiv.org/abs/2409.09031)|null|超分辨率超声 (SRUS) 通过定位和跟踪空间隔离的微泡造影剂，可视化超声衍射极限（波长 ( $λ$ )/2）以外的微血管结构。SRUS 模型通常由简单的管状结构组成，其中直径小于 100 微米的通道不可用。此外，这些模型通常易碎且不稳定，真值验证有限，并且其简单的结构限制了 SRUS 算法的评估。为了帮助 SRUS 的开发，需要具有已知且生理相关的微血管结构的坚固耐用的模型，以便进行可重复的 SRUS 测试。这项工作提出了一种制造耐用微血管模型的方法，该模型允许进行光学测量以进行 SRUS 验证。该方法使用嵌入聚二甲基硅氧烷中的微血管阴模来制造微血管模型。展示了具有可变微血管密度的分支微血管模型，其光学验证的血管直径低至约 60 微米（λ/5.8；λ = 约 350 微米）。进行了 SRUS 成像并通过光学测量进行了验证。平均 SRUS 误差为 15.61 微米（λ/22），标准偏差误差为 11.44 微米。一旦定位的微泡数量超过每个估计直径 1000 个，平均误差降低至 7.93 微米（λ/44）。此外，制造一年后测得的声学和光学特性变化小于 10% 以及模型的机械韧性证明了其长期耐用性。这项工作提出了一种制造耐用且经过光学验证的复杂微血管模型的方法，该模型可用于量化 SRUS 性能并促进其进一步发展。||
|**2024-09-13**|[Pushing Joint Image Denoising and Classification to the Edge](http://arxiv.org/abs/2409.08943)|null|本文中，我们将图像分类和图像去噪相结合，旨在增强人类对边缘设备（如低照度监控摄像头）所拍摄噪声图像的感知能力。在这种情况下，重要的是要保留人类验证自动分类决策的能力，从而联合对图像进行去噪以增强人类感知。由于边缘设备计算能力有限，我们通过提出一种集成这两项任务的新型架构来明确优化效率。此外，我们还修改了一种神经架构搜索（NAS）方法，该方法搜索分类器以搜索集成模型，同时优化目标延迟、分类精度和去噪性能。NAS 架构在去噪和分类方面均优于我们手动设计的方案，可显著改善人类感知。我们的方法使用户能够构建针对医疗成像、监控系统和工业检测等领域的定制架构。||
|**2024-09-13**|[Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing](http://arxiv.org/abs/2409.08885)|null|遥感影像中的目标检测在地球观测的各个应用中都起着至关重要的作用。然而，与自然场景图像中的目标检测不同，由于不同地形中存在大量的小型且通常难以察觉的目标，这项任务尤其具有挑战性。为了应对这些挑战，可以使用多模态学习来整合来自不同数据模态的特征，从而提高检测精度。然而，多模态学习的性能往往受到标记数据集大小有限的限制。在本文中，我们建议使用掩蔽图像建模（MIM）作为预训练技术，利用未标记数据的自监督学习来提高检测性能。然而，传统的 MIM 方法（如 MAE）使用不包含任何上下文信息的掩码标记，由于缺乏与图像其他部分的交互，难以捕捉到细粒度的细节。为了解决这个问题，我们提出了一种新的交互式 MIM 方法，可以在不同标记之间建立交互，这对于遥感中的目标检测特别有利。大量的消融研究和评估证明了我们方法的有效性。||
|**2024-09-13**|[Direct-CP: Directed Collaborative Perception for Connected and Autonomous Vehicles via Proactive Attention](http://arxiv.org/abs/2409.08840)|null|协同感知 (CP) 利用来自联网和自动驾驶车辆 (CAV) 的视觉数据来增强自车视野 (FoV)。尽管最近取得了进展，但目前的 CP 方法几乎平等地扩展了自车的 360 度感知范围，这面临着两个关键挑战。首先，在交通分布不均匀的地区，关注交通流量小的方向带来的好处有限。其次，在有限的通信预算下，为不太重要的方向分配过多的带宽会降低更重要区域的感知精度。为了解决这些问题，我们提出了 Direct-CP，一种主动且方向感知的 CP 系统，旨在改善特定方向的 CP。我们的核心理念是使自车能够主动发出其感兴趣方向的信号，并重新调整其注意力以增强局部方向性 CP 性能。为此，我们首先提出了一种 RSU 辅助方向掩蔽机制，以帮助自车识别重要方向。此外，我们设计了一个方向感知的选择性注意模块，根据自车的方向优先级、通信预算和 CAV 的位置数据，明智地聚合相关特征。此外，我们引入了方向加权检测损失 (DWLoss) 来捕捉方向性 CP 结果与真实情况之间的差异，从而促进有效的模型训练。在 V2X-Sim 2.0 数据集上进行的大量实验表明，与最先进的协作 3D 目标检测方法相比，我们的方法在感兴趣方向的局部感知精度提高了 19.8%，整体感知精度提高了 2.5%。||
|**2024-09-13**|[Test-time Training for Hyperspectral Image Super-resolution](http://arxiv.org/abs/2409.08667)|null|高光谱图像 (HSI) 超分辨率 (SR) 的研究进展仍然落后于 RGB 图像 SR 的研究。HSI 通常具有大量的波段，因此准确地模拟 HSI SR 的波段间交互非常困难。此外，HSI SR 的训练数据难以获取，因此数据集通常很小。在这项工作中，我们提出了一种新的测试时训练方法来解决这个问题。具体来说，我们开发了一个新的自训练框架，可以生成更准确的伪标签和更准确的 LR-HR 关系，以便模型可以使用它们进行进一步训练以提高性能。为了更好地支持我们的测试时训练方法，我们还提出了一种新的网络架构来学习 HSI SR，而无需对波段间交互进行建模，并提出了一种新的数据增强方法 Spectral Mixup，以增加测试时训练数据的的多样性。我们还收集了一个新的 HSI 数据集，其中包含从食物到植被、材料和一般场景等各种有趣对象的图像。在多个数据集上的大量实验表明，我们的方法可以在测试时训练后显着提高预训练模型的性能，并在 HSI SR 方面显着优于竞争方法。||
|**2024-09-13**|[Low Complexity DoA-ToA Signature Estimation for Multi-Antenna Multi-Carrier Systems](http://arxiv.org/abs/2409.08650)|null|准确的方向估计 (DoA) 和到达时间 (ToA) 估计是声纳、雷达、通信和双功能雷达通信 (DFRC) 等多种无线系统的严格要求。由于使用高载波频率和带宽，这些系统大多数设计有多个天线和子载波。尽管大阵列机制下的分辨率很高，但由于频谱泄漏效应，实际的网格估计方法的 DoA-ToA 估计精度仍然存在估计不准确的问题。在本文中，我们提出了针对具有正交频分复用 (OFDM) 信号的多天线多载波系统的 DoA-ToA 估计方法。在第一种方法中，我们应用了基于离散傅立叶变换 (DFT) 的粗略特征估计，并提出了一种低复杂度的多级微调方法，以极大地提高估计精度。第二种方法基于压缩感知，其中我们通过采用比天线和子载波基数实际数量更多的二维过完备角度延迟字典来实现超分辨率。与向量化一维正交匹配追踪 (OMP) 方法不同，我们将低复杂度的二维 OMP 方法应用于矩阵数据模型，这使得在大型阵列机制中使用压缩感知方法变得切实可行。通过数值仿真，我们表明我们提出的方法实现了与基于子空间的二维多重信号分类 (MUSIC) 方法相似的性能，并且计算复杂度显着降低。||
|**2024-09-13**|[Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering](http://arxiv.org/abs/2409.08640)|null|分布式学习已成为跨私有数据孤岛训练大规模机器学习模型的标准方法。虽然分布式学习增强了隐私保护和训练效率，但它也面临着与拜占庭鲁棒性和通信减少相关的重大挑战。现有的拜占庭鲁棒且高效通信的方法依赖于每次迭代或以一定概率在某些迭代中获得完整的梯度信息，并且它们仅收敛到解周围一个不必要的大的邻域。基于这些问题，我们提出了一种新颖的拜占庭鲁棒且高效通信的随机分布式学习方法，该方法对批量大小没有任何要求，并且收敛到比所有现有方法都更接近最优解的小邻域，与理论下界一致。我们的关键创新是利用 Polyak 动量来减轻由有偏压缩器和随机梯度引起的噪声，从而在信息压缩的情况下防御拜占庭工作者。我们提供了在非凸平滑损失函数的背景下，我们算法的紧复杂度界限的证明，证明这些界限与无拜占庭场景中的下界相匹配。最后，我们通过一系列广泛的实验验证了我们算法的实际意义，对二进制分类和图像分类任务的性能进行了基准测试。||
|**2024-09-13**|[Think Twice Before You Act: Improving Inverse Problem Solving With MCMC](http://arxiv.org/abs/2409.08551)|null|最近的研究表明，扩散模型可以作为解决逆问题的强有力先验。一个突出的例子是扩散后验采样（DPS），它使用Tweedie公式来近似给定测量值的数据后验分布。尽管DPS在解决各种逆问题时具有无需重新训练的优点，但由于这种后验近似可能不准确，特别是在高噪声水平下，因此其性能受到限制。因此，我们提出了扩散后验MCMC（DPMC），这是一种基于退火MCMC的新型推理算法，用于解决使用预训练扩散模型的逆问题。我们定义了一系列中间分布，其灵感来自DPS使用的近似条件分布。通过退火MCMC采样，我们鼓励样本在移动到噪声水平较低的下一个分布之前，更紧密地遵循每个中间分布，从而减少沿路径累积的误差。我们在各种逆问题中测试了我们的算法，包括超分辨率、高斯去模糊、运动去模糊、修复和相位检索。我们的算法在几乎所有任务中都优于DPS，并且评估次数更少，并且与现有方法相比具有竞争力。||
|**2024-09-12**|[Learned Compression for Images and Point Clouds](http://arxiv.org/abs/2409.08376)|**[link](https://github.com/multimedialabsfu/learned-point-cloud-compression-for-classification)**|在过去十年中，深度学习在执行计算机视觉任务（包括分类、超分辨率和风格迁移）方面表现出色。现在，我们将其应用于数据压缩，以帮助构建下一代多媒体编解码器。本论文对这一新兴的学习压缩领域做出了三个主要贡献。首先，我们提出了一种高效的低复杂度熵模型，它通过将编码分布本身作为边信息进行压缩和传输，从而动态地使编码分布适应特定的输入。其次，我们提出了一种新颖的轻量级低复杂度点云编解码器，该编解码器专门针对分类进行了高度优化，与非专门编解码器相比，可以显著降低比特率。最后，我们探讨了连续视频帧之间输入域内的运动是如何体现在相应的卷积导出的潜在空间中的。||
|**2024-09-12**|[FACT: Feature Adaptive Continual-learning Tracker for Multiple Object Tracking](http://arxiv.org/abs/2409.07904)|null|多目标跟踪 (MOT) 涉及识别视频序列中的多个目标并为其分配相应的 ID，其中经常遇到遮挡。最近的方法通过在线学习技术解决遮挡问题，以提高适应性，或通过离线学习技术利用视频中的时间信息。然而，大多数现有的基于在线学习的 MOT 方法无法从所有过去的跟踪信息中学习，从而在保持实时跟踪速度的同时提高对长期遮挡的适应性。另一方面，基于时间信息的离线学习方法维护一个长期记忆来存储过去的跟踪信息，但这种方法限制了它们在跟踪过程中只能使用局部的过去信息。为了应对这些挑战，我们提出了一种新的 MOT 框架，称为特征自适应持续学习跟踪器 (FACT)，它通过利用所有过去的跟踪信息实现目标的实时跟踪和特征学习。我们证明了该框架可以与各种最先进的基于特征的跟踪器集成，从而提高它们的跟踪能力。具体来说，我们开发了特征自适应持续学习 (FAC) 模块，这是一个神经网络，可以在线训练以自适应地学习特征，并在跟踪过程中使用所有过去的跟踪信息。此外，我们还介绍了一个专为所提出的基于持续学习的跟踪而设计的两阶段关联模块。大量实验结果表明，所提出的方法在 MOT17 和 MOT20 基准测试中实现了最先进的在线跟踪性能。代码将在接收后发布。||
|**2024-09-12**|[Microscopic-Mamba: Revealing the Secrets of Microscopic Images with Just 4M Parameters](http://arxiv.org/abs/2409.07896)|**[link](https://github.com/zs1314/microscopic-mamba)**|在医学显微图像分类 (MIC) 领域，基于 CNN 和 Transformer 的模型已被广泛研究。然而，CNN 难以建模远程依赖关系，限制了其充分利用图像语义信息的能力。相反，Transformer 则受到二次计算复杂性的阻碍。为了解决这些挑战，我们提出了一种基于 Mamba 架构的模型：Microscopic-Mamba。具体来说，我们设计了部分选择前馈网络（PSFFN）来替换视觉状态空间模块（VSSM）的最后一个线性层，增强了 Mamba 的局部特征提取能力。此外，我们引入了调制交互特征聚合（MIFA）模块，以有效地调制和动态聚合全局和局部特征。我们还结合了并行 VSSM 机制，以改善通道间的信息交互，同时减少参数数量。大量实验表明，我们的方法在五个公共数据集上实现了最先进的性能。代码可在 https://github.com/zs1314/Microscopic-Mamba 获取。||
|**2024-09-12**|[What is YOLOv9: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector](http://arxiv.org/abs/2409.07813)|null|本研究全面分析了 YOLOv9 对象检测模型，重点关注其架构创新、训练方法以及相较于先前版本的性能改进。关键的改进，例如广义高效层聚合网络 (GELAN) 和可编程梯度信息 (PGI)，显著增强了特征提取和梯度流，从而提高了准确性和效率。通过结合深度卷积和轻量级 C3Ghost 架构，YOLOv9 在保持高精度的同时降低了计算复杂度。在 Microsoft COCO 上的基准测试表明，它具有优越的平均精度均值 (mAP) 和更快的推理时间，在多个指标上优于 YOLOv8。该模型的多功能性体现在它可以无缝部署到从边缘设备到高性能 GPU 的各种硬件平台上，并内置支持 PyTorch 和 TensorRT 集成。本文首次深入探讨了 YOLOv9 的内部特征及其在现实世界中的适用性，将其确立为跨行业的实时对象检测的最新解决方案，从物联网设备到大型工业应用。||
|**2024-09-12**|[Mesh-based Super-Resolution of Fluid Flows with Multiscale Graph Neural Networks](http://arxiv.org/abs/2409.07769)|null|这项工作介绍了一种图神经网络 (GNN) 方法，能够对流体流动进行基于网格的三维超分辨率重建。在此框架中，GNN 的设计不是一次性在整个基于网格的场上运行，而是直接在局部元素（或单元）网格上运行。为了以类似于谱（或有限）元素离散化的方式促进基于网格的 GNN 表示，修改了基线 GNN 层（称为消息传递层，用于更新局部节点属性）以考虑重合图节点的同步，从而使其与常用的基于元素的网格连接兼容。该架构本质上是多尺度的，由粗尺度和细尺度消息传递层序列（称为处理器）组合而成，这些序列之间通过图解池层进行分离。粗尺度处理器使用粗尺度同步消息传递在元素邻域上将查询元素（以及一组相邻的粗元素）嵌入到单个潜在图表示中，而细尺度处理器利用此潜在图上的其他消息传递操作来校正插值误差。使用来自雷诺数为 1600 和 3200 的泰勒-格林涡流模拟的六面体网格数据进行演示研究。通过分析全局和局部误差，结果最终表明，与粗尺度和多尺度模型配置中的目标相比，GNN 如何能够生成准确的超分辨率场。发现固定架构的重建误差与雷诺数成正比，而包含周围粗元素邻居被发现可以改善 Re=1600 时的预测，但在 Re=3200 时则不然。||
|**2024-09-12**|[DFDG: Data-Free Dual-Generator Adversarial Distillation for One-Shot Federated Learning](http://arxiv.org/abs/2409.07734)|null|联邦学习 (FL) 是一种分布式机器学习方案，其中客户端通过共享模型信息而不是其私有数据集来共同参与全局模型的协作训练。考虑到与通信和隐私相关的担忧，具有一轮通信的单次联邦学习已成为事实上的有希望的解决方案。然而，现有的单次联邦学习方法要么需要公共数据集，要么侧重于模型同构设置，要么从本地模型中提取的知识有限，这使得训练鲁棒的全局模型变得困难甚至不切实际。为了解决这些限制，我们提出了一种新的用于单次联邦学习的无数据双生成器对抗蒸馏方法 (即 DFDG)，该方法可以通过训练双生成器来探索更广泛的本地模型训练空间。DFDG 以对抗方式执行，包括两部分：双生成器训练和双模型蒸馏。在双生成器训练中，我们深入研究了每个生成器在保真度、可迁移性和多样性方面的内容，以确保其效用，并额外定制了交叉散度损失以减少双生成器输出空间的重叠。在双模型蒸馏中，训练好的双生成器协同工作，为全局模型的更新提供训练数据。最后，我们对各种图像分类任务的广泛实验表明，与 SOTA 基线相比，DFDG 在准确性方面取得了显着的性能提升。||
|**2024-09-12**|[Cooperative Inference with Interleaved Operator Partitioning for CNNs](http://arxiv.org/abs/2409.07693)|null|将深度学习模型部署在物联网（IoT）设备上通常会面临内存资源和计算能力有限的挑战。协同推理是解决这一问题的重要方法，需要对智能模型进行分区和分布式部署。为了执行水平分区，现有的协同推理方法要么采用算子的输出通道，要么采用特征图的高度和宽度作为分区维度。在这种方式下，由于算子的激活是分布式的，因此必须将它们连接在一起，然后才能将其馈送到下一个算子，这会导致协同推理的延迟。在本文中，我们为CNN模型提出了交错算子分区（IOP）策略。通过基于输出通道维度对一个算子进行分区，并基于输入通道维度对其后续算子进行分区，可以避免激活连接，从而减少通信连接的数量，从而减少协同推理延迟。基于IOP，我们进一步提出了一种模型分割算法，用于最小化协同推理时间，该算法根据获得的推理延迟收益，贪婪地选择用于IOP配对的算子。实验结果表明，与CoEdge中使用的最先进的分区方法相比，IOP策略在三个经典图像分类模型上实现了6.39%~16.83%的加速，并将峰值内存占用减少了21.22%~49.98%。||
|**2024-09-11**|[Minimizing Embedding Distortion for Robust Out-of-Distribution Performance](http://arxiv.org/abs/2409.07582)|null|基于庞大且多样化数据集训练的基础模型在各种零样本任务中展现出跨不同领域和分布泛化的非凡能力。我们的工作解决了在通过微调使基础模型适应特定下游任务时，如何保留这些强大的泛化能力的挑战。为此，我们引入了一种名为“相似性损失”的新方法，它可以融入到任何任务的微调过程中。通过最小化微调嵌入与预训练嵌入之间的扭曲，我们的方法在特定任务适应和保持广泛泛化能力之间取得了平衡。我们在两个不同的任务上评估了我们的方法：卫星图像的图像分类和人脸识别，重点关注开放类别和领域迁移场景，以评估分布外 (OOD) 性能。我们证明，这种方法在保持强大的分布内 (ID) 性能的同时，显著提高了 OOD 性能。||
|**2024-09-11**|[ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers](http://arxiv.org/abs/2409.07541)|**[link](https://github.com/gsavathrakis/enact)**|Transformer在基于视觉的目标检测问题上表现出具有竞争力的精度。然而，由于注意力权重的平方大小，它们需要相当大的计算资源。在这项工作中，我们建议根据输入信息熵对transformer输入进行聚类。这样做的原因是，每个像素的自信息（其总和为熵）在对应于同一对象的像素之间可能是相似的。聚类减少了作为transformer输入的数据量，因此减少了训练时间和GPU内存使用量，同时保留了要传递到网络其余部分的有意义信息。建议的过程组织在一个名为ENACT的模块中，该模块可以插入任何在其编码器中包含多头自注意力计算的transformer架构。我们使用COCO目标检测数据集和三个检测transformer进行了广泛的实验。获得的结果表明，在所有测试案例中，所需的计算资源都持续减少，而检测任务的精度仅略有下降。ENACT模块的代码将在https://github.com/GSavathrakis/ENACT上提供。||
|**2024-09-11**|[A Contrastive Symmetric Forward-Forward Algorithm (SFFA) for Continual Learning Tasks](http://arxiv.org/abs/2409.07387)|null|所谓的“正向-正向算法”(FFA) 近期作为一种替代传统神经网络学习中反向传播算法的新方法获得了关注，在各种建模任务中展现出具有竞争力的性能。通过用两次对比正向传递代替梯度反向传播的反向传递，FFA 通过启用逐层训练启发式方法，避免了其前身所经历的几个缺点（例如梯度消失/爆炸）。在分类任务中，这种对比方法已被证明可以有效地创建输入数据的潜在稀疏表示，最终有利于区分性。然而，由于正负数据之间损失函数的不平衡，FFA 表现出固有的不对称梯度行为，这会对模型的泛化能力产生负面影响并导致准确性下降。为了解决这个问题，这项工作提出了对称正向-正向算法 (SFFA)，这是对原始 FFA 的一种新颖改进，它将每一层划分为正神经元和负神经元。这允许将局部适应度函数定义为正神经元激活与整体层活动之间的比率，从而在训练阶段产生对称的损失情况。为了评估我们方法增强的收敛性，我们使用多个图像分类基准进行了多项实验，比较了使用 SFFA 训练的模型与其使用 FFA 训练的模型的准确性。作为这种重新表述的副产品，我们探索了将逐层训练算法用于持续学习 (CL) 任务的优势。逐层训练算法引起的神经元特化及其激活的稀疏性使得能够实现有效的 CL 策略，将新知识（类别）整合到神经网络中，同时防止灾难性地遗忘先前...||
|**2024-09-11**|[Three-Dimensional, Multimodal Synchrotron Data for Machine Learning Applications](http://arxiv.org/abs/2409.07322)|**[link](https://github.com/calum-green/xct-xdrct_paper_code)**|Machine learning techniques are being increasingly applied in medical and physical sciences across a variety of imaging modalities; however, an important issue when developing these tools is the availability of good quality training data. Here we present a unique, multimodal synchrotron dataset of a bespoke zinc-doped Zeolite 13X sample that can be used to develop advanced deep learning and data fusion pipelines. Multi-resolution micro X-ray computed tomography was performed on a zinc-doped Zeolite 13X fragment to characterise its pores and features, before spatially resolved X-ray diffraction computed tomography was carried out to characterise the homogeneous distribution of sodium and zinc phases. Zinc absorption was controlled to create a simple, spatially isolated, two-phase material. Both raw and processed data is available as a series of Zenodo entries. Altogether we present a spatially resolved, three-dimensional, multimodal, multi-resolution dataset that can be used for the development of machine learning techniques. Such techniques include development of super-resolution, multimodal data fusion, and 3D reconstruction algorithm development.||
|**2024-09-10**|[A comprehensive study on Blood Cancer detection and classification using Convolutional Neural Network](http://arxiv.org/abs/2409.06689)|null|多年来，在目标检测领域，一些高效的卷积神经网络 (CNN)，如 DenseNet201、InceptionV3、ResNet152v2、SEresNet152、VGG19、Xception 因其性能而备受关注。此外，CNN 范式已经扩展到从原始 CNN 架构进行迁移学习和集成模型。研究表明，迁移学习和集成模型能够提高深度学习 (DL) 模型的准确性。然而，很少有研究利用这些技术对血液恶性肿瘤进行检测和定位的综合实验。意识到这一差距，本研究进行了三个实验；在第一个实验中，使用了六个原始 CNN，在第二个实验中，使用了迁移学习，在第三个实验中，开发了一个新的集成模型 DIX（DenseNet201、InceptionV3 和 Xception）来检测和分类血癌。统计结果表明，DIX 的性能优于原始模型和迁移学习，准确率达到 99.12%。然而，这项研究也提供了一个关于迁移学习的负面结果，因为迁移学习并没有提高原始 CNN 的准确性。与许多其他癌症一样，血癌疾病需要及时识别，才能制定有效的治疗方案并提高生存机会。使用 CNN 检测和分类血癌的高精度表明，CNN 模型在血癌检测中很有前景。这项研究在生物医学工程、计算机辅助疾病诊断和基于机器学习的疾病检测领域具有重要意义。||
|**2024-09-10**|[Lightweight Multiscale Feature Fusion Super-Resolution Network Based on Two-branch Convolution and Transformer](http://arxiv.org/abs/2409.06590)|null|目前，深度学习下的单图像超分辨率(SISR)算法主要有两大模型，一种是基于卷积神经网络的模型，另一种是基于Transformer的模型。前者采用不同卷积核大小的卷积层堆叠的方式来设计模型，使得模型能够更好地提取图像的局部特征；后者采用自注意力机制来设计模型，通过自注意力机制可以让模型建立图像像素点之间的长距离依赖关系，进而更好地提取图像的全局特征。然而，上述两种方法都面临着自己的问题。基于此，本文提出了一种基于双向互补卷积和Transformer的新型轻量级多尺度特征融合网络模型，该模型通过双分支网络架构，融合Transformer和卷积神经网络各自的特点，实现全局和局部信息的相互融合。同时，考虑到深度神经网络训练的低像素图像造成的局部信息丢失，本文设计了一种多阶段特征补充的模块化连接方式，将模型浅层阶段提取的特征图与模型深层阶段提取的特征图进行融合，以最大限度地减少特征图像中信息的丢失，有利于图像的复原，便于获得更高质量的复原图像。最终的实践结果表明，与其他参数量相同的轻量级模型相比，本文提出的模型在图像恢复性能方面是最优的。||
|**2024-09-10**|[Transtreaming: Adaptive Delay-aware Transformer for Real-time Streaming Perception](http://arxiv.org/abs/2409.06584)|null|实时目标检测对于许多现实应用（如自动驾驶中的防撞和路径规划）的决策过程至关重要。本研究提出了一种创新的实时流感知方法 Transtreaming，它解决了具有动态计算延迟的实时目标检测挑战。Transtreaming 的核心创新在于其自适应延迟感知转换器，它可以同时预测多个未来帧并选择与现实世界当前时间最匹配的输出，从而补偿任何系统引起的计算延迟。即使在单帧检测场景中，所提出的模型也通过利用基于转换器的方法优于现有的最先进方法。它在从强大的 V100 到适度的 2080Ti 的各种设备上均表现出强大的性能，在所有平台上都实现了最高水平的感知精度。与大多数难以在功能较弱的设备上在一帧内完成计算的最先进方法不同，Transtreaming 可以满足各种设备上的严格实时处理要求。实验结果强调了该系统的适应性和其显着提高许多现实系统（如自动驾驶）的安全性和可靠性的潜力。||
|**2024-09-10**|[Semi-Supervised 3D Object Detection with Chanel Augmentation using Transformation Equivariance](http://arxiv.org/abs/2409.06583)|null|对于自动驾驶汽车和机器人来说，精确的三维物体检测对于其安全有效地导航和与环境交互至关重要。同时，三维检测器的性能依赖于数据规模和标注，而这通常成本高昂。因此，使用有限的标注数据进行训练的需求日益增长。本文探索了一种新颖的师生框架，该框架采用通道增强技术进行三维半监督目标检测。师生SSL通常对教师和学生分别采用弱增强和强增强。在本工作中，我们使用变换等变检测器（TED）对两个网络应用了多通道增强。TED使我们能够探索点云上增强的不同组合，并有效地聚合多通道变换等变特征。原则上，通过对教师网络采用固定的通道增强，学生可以在可靠的伪标签上稳定地训练。采用强通道增强可以丰富数据的多样性，增强对变换的鲁棒性，提高学生网络的泛化性能。我们使用SOTA层次监督作为基线，并将其双阈值调整到TED，称为通道IoU一致性。我们使用KITTI数据集对我们的方法进行了评估，取得了显著的性能提升，超越了SOTA三维半监督目标检测模型。||
|**2024-09-10**|[Dynamic Decoupling of Placid Terminal Attractor-based Gradient Descent Algorithm](http://arxiv.org/abs/2409.06542)|null|梯度下降 (GD) 和随机梯度下降 (SGD) 已广泛应用于众多应用领域。因此，理解 GD 的动力学并提高其收敛速度仍然非常重要。本文根据梯度流不同阶段的终端吸引子，仔细分析了 GD 的动力学。基于终端滑模理论和终端吸引子理论，设计了四种自适应学习率。并通过详细的理论研究考察了它们的性能，并对学习过程的运行时间进行了评估和比较。此外，还详细研究了它们学习过程的总时间。为了评估其有效性，在函数逼近问题和图像分类问题上对各种仿真结果进行了研究。||
|**2024-09-10**|[Knowledge Distillation via Query Selection for Detection Transformer](http://arxiv.org/abs/2409.06443)|null|Transformer 通过引入 DETR 为目标检测领域带来了革命性的变化，DETR 以其简洁性和有效性而备受赞誉。尽管有这些优势，但这些模型的庞大规模对其在实际部署中，尤其是在资源受限的环境中，提出了重大挑战。本文利用知识蒸馏技术解决了压缩 DETR 的挑战，该技术有望在保持模型性能的同时减小模型规模。DETR 性能的一个关键方面是它们依赖查询来准确解释对象表示。传统的蒸馏方法通常只关注通过二分匹配识别的正查询，而忽略了硬负查询中存在的信息。我们的视觉分析表明，关注前景元素的硬负查询对于增强蒸馏结果至关重要。为此，我们引入了一种新颖的组查询选择策略，该策略通过根据查询与真实对象的广义交并比 (GIoU) 对查询进行分段，从而发现有价值的硬负查询用于蒸馏，这与 DETR 蒸馏中的传统查询选择不同。此外，我们提出了基于查询选择的 DETR 知识蒸馏 (QSKD) 框架，该框架结合了注意力引导特征蒸馏 (AGFD) 和局部对齐预测蒸馏 (LAPD)。这些组件通过关注教师模型中间特征和输出中最有信息的部分来优化蒸馏过程。我们对 MS-COCO 数据集的综合实验评估证明了我们方法的有效性，在不增加大量计算成本的情况下，显着提高了各种 DETR 架构的平均精度 (AP)。具体来说，Conditional DETR ResNet-18 的 AP 从 35.8 提高到 39.9。||
|**2024-09-10**|[Seam Carving as Feature Pooling in CNN](http://arxiv.org/abs/2409.06311)|null|这项工作研究了将接缝裁剪作为卷积神经网络 (CNN) 中的一种特征池化技术用于图像分类任务的潜力。我们建议用接缝裁剪操作替换传统的最大池化层。我们在 Caltech-UCSD Birds 200-2011 数据集上进行的实验表明，基于接缝裁剪的 CNN 与采用最大池化的模型相比，在准确率、精确率、召回率和 F1 分数等指标上均取得了更好的性能。我们通过特征图可视化进一步分析了这两种方法的行为，表明接缝裁剪在池化过程中可能保留了更多结构信息。此外，我们还讨论了我们方法的局限性，并提出了未来研究的潜在方向。||
|**2024-09-10**|[An Attribute-Enriched Dataset and Auto-Annotated Pipeline for Open Detection](http://arxiv.org/abs/2409.06300)|null|通过语言检测感兴趣的对象经常会遇到挑战，特别是对于那些不常见或难以描述的对象，因为自动化模型和人类标注者之间存在感知差异。这些挑战凸显了对综合数据集的需求，这些数据集需要超越标准的对象标签，并结合详细的属性描述。为了满足这一需求，我们引入了 Objects365-Attr 数据集，它是对现有 Objects365 数据集的扩展，其特点是具有属性标注。该数据集通过整合广泛的属性（包括颜色、材质、状态、纹理和色调）来减少对象检测中的不一致性。它包含 560 万个对象级属性描述的扩展集合，这些描述在 140 万个边界框中进行了精心标注。此外，为了验证数据集的有效性，我们对不同规模的 YOLO-World 进行了严格的评估，测量了它们的检测性能，并展示了该数据集对推进对象检测的贡献。||
|**2024-09-09**|[Replay Consolidation with Label Propagation for Continual Object Detection](http://arxiv.org/abs/2409.05650)|null|目标检测是一个与机器人技术和自动驾驶等许多应用高度相关的计算机视觉问题。持续学习 (CL) 考虑的是模型在保留先前获得的知识的同时逐步学习新信息的设置。这尤其具有挑战性，因为深度学习模型在训练新数据时往往会灾难性地忘记旧知识。特别是，与用于分类的持续学习相比，用于目标检测的持续学习 (CLOD) 带来了额外的困难。在 CLOD 中，来自先前任务的图像可能包含未知的类别，这些类别可能会在未来的任务中重新出现并被标记。这些缺失的注释会导致基于重放的方法出现任务干扰问题。因此，文献中的大多数工作都集中在基于蒸馏的方法上。然而，这些方法只有在不同任务之间存在强大的类别重叠时才有效。为了解决当前方法的问题，我们提出了一种解决 CLOD 的新技术，称为用于目标检测的标签传播重放整合 (RCLPOD)。基于重放方法，我们的解决方案通过增强缓冲区内存样本来避免任务干扰问题。我们的方法在 CLOD 文献中的现有技术基础上进行了评估，证明了其在 VOC 和 COCO 等既定基准测试中的优越性能。||
|**2024-09-09**|[LEROjD: Lidar Extended Radar-Only Object Detection](http://arxiv.org/abs/2409.05564)|**[link](https://github.com/rst-tu-dortmund/lerojd)**|对于自动驾驶而言，精确的三维物体检测至关重要。激光雷达传感器非常适合这项任务，但它们价格昂贵，并且在恶劣天气条件下存在局限性。3+1D 成像雷达传感器提供了一种经济高效且稳健的替代方案，但由于其分辨率低和测量噪声高而面临挑战。现有的 3+1D 成像雷达数据集包括雷达和激光雷达数据，可以改进跨模态模型。尽管不应在推理过程中使用激光雷达，但它可以帮助训练仅使用雷达的物体检测器。我们探索了两种将知识从激光雷达域迁移到雷达域和仅使用雷达的物体检测器的策略：1. 使用顺序激光雷达点云细化的多阶段训练，以及 2. 跨模态知识蒸馏。在多阶段过程中，我们研究了三种细化方法。我们的结果表明，通过多阶段训练，平均精度 (mAP) 显着提高了 4.2 个百分点，通过使用教师模型的权重初始化学生模型进行知识蒸馏，平均精度提高了 3.9 个百分点。这些方法的主要优点是它们适用于其他 3D 物体检测网络，而无需改变其架构，正如我们通过在两个不同的物体检测器上进行分析所展示的那样。我们的代码可在 https://github.com/rst-tu-dortmund/lerojd 获取。||
|**2024-09-08**|[Can OOD Object Detectors Learn from Foundation Models?](http://arxiv.org/abs/2409.05162)|**[link](https://github.com/cvmi-lab/syncood)**|Out-of-distribution (OOD) object detection is a challenging task due to the absence of open-set OOD data. Inspired by recent advancements in text-to-image generative models, such as Stable Diffusion, we study the potential of generative models trained on large-scale open-set data to synthesize OOD samples, thereby enhancing OOD object detection. We introduce SyncOOD, a simple data curation method that capitalizes on the capabilities of large foundation models to automatically extract meaningful OOD data from text-to-image generative models. This offers the model access to open-world knowledge encapsulated within off-the-shelf foundation models. The synthetic OOD samples are then employed to augment the training of a lightweight, plug-and-play OOD detector, thus effectively optimizing the in-distribution (ID)/OOD decision boundaries. Extensive experiments across multiple benchmarks demonstrate that SyncOOD significantly outperforms existing methods, establishing new state-of-the-art performance with minimal synthetic data usage.||
|**2024-09-08**|[Visual Grounding with Multi-modal Conditional Adaptation](http://arxiv.org/abs/2409.04999)|**[link](https://github.com/mr-bigworth/mmca)**|Visual grounding is the task of locating objects specified by natural language expressions. Existing methods extend generic object detection frameworks to tackle this task. They typically extract visual and textual features separately using independent visual and textual encoders, then fuse these features in a multi-modal decoder for final prediction. However, visual grounding presents unique challenges. It often involves locating objects with different text descriptions within the same image. Existing methods struggle with this task because the independent visual encoder produces identical visual features for the same image, limiting detection performance. Some recently approaches propose various language-guided visual encoders to address this issue, but they mostly rely solely on textual information and require sophisticated designs. In this paper, we introduce Multi-modal Conditional Adaptation (MMCA), which enables the visual encoder to adaptively update weights, directing its focus towards text-relevant regions. Specifically, we first integrate information from different modalities to obtain multi-modal embeddings. Then we utilize a set of weighting coefficients, which generated from the multimodal embeddings, to reorganize the weight update matrices and apply them to the visual encoder of the visual grounding model. Extensive experiments on four widely used datasets demonstrate that MMCA achieves significant improvements and state-of-the-art results. Ablation experiments further demonstrate the lightweight and efficiency of our method. Our source code is available at: https://github.com/Mr-Bigworth/MMCA.||
|**2024-09-08**|[RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network](http://arxiv.org/abs/2409.04979)|null|Perceiving the surrounding environment is a fundamental task in autonomous driving. To obtain highly accurate perception results, modern autonomous driving systems typically employ multi-modal sensors to collect comprehensive environmental data. Among these, the radar-camera multi-modal perception system is especially favored for its excellent sensing capabilities and cost-effectiveness. However, the substantial modality differences between radar and camera sensors pose challenges in fusing information. To address this problem, this paper presents RCBEVDet, a radar-camera fusion 3D object detection framework. Specifically, RCBEVDet is developed from an existing camera-based 3D object detector, supplemented by a specially designed radar feature extractor, RadarBEVNet, and a Cross-Attention Multi-layer Fusion (CAMF) module. Firstly, RadarBEVNet encodes sparse radar points into a dense bird's-eye-view (BEV) feature using a dual-stream radar backbone and a Radar Cross Section aware BEV encoder. Secondly, the CAMF module utilizes a deformable attention mechanism to align radar and camera BEV features and adopts channel and spatial fusion layers to fuse them. To further enhance RCBEVDet's capabilities, we introduce RCBEVDet++, which advances the CAMF through sparse fusion, supports query-based multi-view camera perception models, and adapts to a broader range of perception tasks. Extensive experiments on the nuScenes show that our method integrates seamlessly with existing camera-based 3D perception models and improves their performance across various perception tasks. Furthermore, our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks. Notably, with ViT-L as the image backbone, RCBEVDet++ achieves 72.73 NDS and 67.34 mAP in 3D object detection without test-time augmentation or model ensembling.||
|**2024-09-08**|[PatchAlign:Fair and Accurate Skin Disease Image Classification by Alignment with Clinical Labels](http://arxiv.org/abs/2409.04975)|**[link](https://github.com/aayushmanace/patchalign24)**|深度学习模型在皮肤病变诊断自动化方面取得了巨大成功。然而，在部署这些模型之前，需要解决其预测中存在的种族差异问题。我们介绍了一种名为 PatchAlign 的新方法，通过与皮肤病临床文本表征对齐来提高皮肤病图像分类的准确性和公平性。PatchAlign 使用图最优传输 (GOT) 损失作为正则化器来执行跨域对齐。即使在训练样本有限的情况下，获得的表征也是稳健的，并且可以很好地泛化到不同的肤色。为了减少临床皮肤病图像中噪声和伪影的影响，我们提出了一种可学习的掩码图最优传输，用于跨域对齐，进一步改善了公平性指标。我们在两个具有不同皮肤类型的皮肤病变数据集上将我们的模型与最先进的 FairDisCo 进行了比较：Fitzpatrick17k 和 Diverse Dermatology Images (DDI)。与 FairDisCo 相比，PatchAlign 在 Fitzpatrick17k 上将皮肤病图像分类的准确性提高了 2.8%（域内）和 6.2%（跨域），在 DDI 上提高了 4.2%（域内）。此外，它持续改善了不同肤色真实阳性率的公平性。用于实现的源代码可在以下 GitHub 存储库中获取：https://github.com/aayushmanace/PatchAlign24，可以轻松复现和进一步试验。||
|**2024-09-07**|[Activation Function Optimization Scheme for Image Classification](http://arxiv.org/abs/2409.04915)|**[link](https://github.com/abdurrahman1828/afos)**|Activation function has a significant impact on the dynamics, convergence, and performance of deep neural networks. The search for a consistent and high-performing activation function has always been a pursuit during deep learning model development. Existing state-of-the-art activation functions are manually designed with human expertise except for Swish. Swish was developed using a reinforcement learning-based search strategy. In this study, we propose an evolutionary approach for optimizing activation functions specifically for image classification tasks, aiming to discover functions that outperform current state-of-the-art options. Through this optimization framework, we obtain a series of high-performing activation functions denoted as Exponential Error Linear Unit (EELU). The developed activation functions are evaluated for image classification tasks from two perspectives: (1) five state-of-the-art neural network architectures, such as ResNet50, AlexNet, VGG16, MobileNet, and Compact Convolutional Transformer which cover computationally heavy to light neural networks, and (2) eight standard datasets, including CIFAR10, Imagenette, MNIST, Fashion MNIST, Beans, Colorectal Histology, CottonWeedID15, and TinyImageNet which cover from typical machine vision benchmark, agricultural image applications to medical image applications. Finally, we statistically investigate the generalization of the resultant activation functions developed through the optimization scheme. With a Friedman test, we conclude that the optimization scheme is able to generate activation functions that outperform the existing standard ones in 92.8% cases among 28 different cases studied, and $-x\cdot erf(e^{-x})$ is found to be the best activation function for image classification generated by the optimization scheme.||
|**2024-09-07**|[SSFam: Scribble Supervised Salient Object Detection Family](http://arxiv.org/abs/2409.04817)|**[link](https://github.com/liuzywen/ssfam)**|Scribble supervised salient object detection (SSSOD) constructs segmentation ability of attractive objects from surroundings under the supervision of sparse scribble labels. For the better segmentation, depth and thermal infrared modalities serve as the supplement to RGB images in the complex scenes. Existing methods specifically design various feature extraction and multi-modal fusion strategies for RGB, RGB-Depth, RGB-Thermal, and Visual-Depth-Thermal image input respectively, leading to similar model flood. As the recently proposed Segment Anything Model (SAM) possesses extraordinary segmentation and prompt interactive capability, we propose an SSSOD family based on SAM, named SSFam, for the combination input with different modalities. Firstly, different modal-aware modulators are designed to attain modal-specific knowledge which cooperates with modal-agnostic information extracted from the frozen SAM encoder for the better feature ensemble. Secondly, a siamese decoder is tailored to bridge the gap between the training with scribble prompt and the testing with no prompt for the stronger decoding ability. Our model demonstrates the remarkable performance among combinations of different modalities and refreshes the highest level of scribble supervised methods and comes close to the ones of fully supervised methods. https://github.com/liuzywen/SSFam||
|**2024-09-07**|[SpotActor: Training-Free Layout-Controlled Consistent Image Generation](http://arxiv.org/abs/2409.04801)|null|Text-to-image diffusion models significantly enhance the efficiency of artistic creation with high-fidelity image generation. However, in typical application scenarios like comic book production, they can neither place each subject into its expected spot nor maintain the consistent appearance of each subject across images. For these issues, we pioneer a novel task, Layout-to-Consistent-Image (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts. To accomplish this challenging task, we present a new formalization of dual energy guidance with optimization in a dual semantic-latent space and thus propose a training-free pipeline, SpotActor, which features a layout-conditioned backward update stage and a consistent forward sampling stage. In the backward stage, we innovate a nuanced layout energy function to mimic the attention activations with a sigmoid-like objective. While in the forward stage, we design Regional Interconnection Self-Attention (RISA) and Semantic Fusion Cross-Attention (SFCA) mechanisms that allow mutual interactions across images. To evaluate the performance, we present ActorBench, a specified benchmark with hundreds of reasonable prompt-box pairs stemming from object detection datasets. Comprehensive experiments are conducted to demonstrate the effectiveness of our method. The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications with superior layout alignment, subject consistency, prompt conformity and background diversity.||
|**2024-09-07**|[LoCa: Logit Calibration for Knowledge Distillation](http://arxiv.org/abs/2409.04778)|null|Knowledge Distillation (KD), aiming to train a better student model by mimicking the teacher model, plays an important role in model compression. One typical way is to align the output logits. However, we find a common issue named mis-instruction, that the student would be misled when the predictions based on teacher logits do not follow the labels. Meanwhile, there is other useful dark knowledge in the logits such as the class discriminability, which is vital for distillation. In this paper, we propose a simple yet effective Logit Calibration (LoCa) method, which calibrates the logits from the teacher model based on the ground-truth labels. The key insight is to correct the prediction (to address the mis-instruction issue) and maintain useful dark knowledge simultaneously. Our proposed LoCa does not require any additional parameters. Empirical results on image classification and text generation tasks demonstrate that LoCa can effectively improve the performance of baselines.||
|**2024-09-05**|[Use of triplet loss for facial restoration in low-resolution images](http://arxiv.org/abs/2409.03530)|null|近年来，人脸识别 (FR) 模型已成为应用最广泛的生物识别工具，在众多数据集上取得了令人瞩目的成果。然而，硬件的固有挑战或拍摄距离 often 导致低分辨率图像，这会严重影响人脸识别模型的性能。为了解决这个问题，人们提出了几种解决方案，包括生成高度逼真的人脸的超分辨率 (SR) 模型。尽管做出了这些努力，但人脸识别算法并未取得显著改进。我们提出了一种新颖的超分辨率模型 FTLGAN，它侧重于生成保留个人身份的高分辨率图像，而不仅仅是提高图像质量，从而最大限度地提高人脸识别模型的性能。结果令人信服，表明 d' 的平均值比当前最先进的模型高出 21%，具体而言，14x14 像素时 d' = 1.099，AUC = 0.78，28x28 像素时 d' = 2.112，AUC = 0.92，56x56 像素时 d' = 3.049，AUC = 0.98。这项研究的贡献在几个关键领域意义重大。首先，在低分辨率图像（特别是 14x14、28x28 和 56x56 像素的分辨率）中，人脸识别性能取得了显着提高。其次，FTLGAN 所展示的增强功能在所有分辨率下都表现出一致的响应，与其他比较模型不同，它始终如一地提供出色的性能。第三，使用三元组损失逻辑实施了一种创新方法，能够仅使用真实图像训练超分辨率模型，这与当前模型形成对比，并扩展了潜在的现实应用。最后，本研究引入了一种新颖的模型，该模型通过在模型训练期间将人脸识别质量作为损失纳入其中，专门解决了提高人脸识别系统分类性能的挑战。||
|**2024-09-05**|[Have Large Vision-Language Models Mastered Art History?](http://arxiv.org/abs/2409.03521)|null|大型视觉语言模型 (VLM) 的出现最近在跨多个领域的图像分类方面建立了新的基准。然而，VLM 在艺术品分类这一特定任务中的表现，特别是绘画艺术风格分类——传统上由艺术史学家掌握的领域——尚未得到探索。与自然图像相比，艺术品由于其固有的复杂性和多样性结构（以多变的构图和风格为特征）而构成了独特的挑战。艺术史学家长期以来一直在研究艺术品的独特方面，而风格预测是其学科的一个重要组成部分。本文研究了集成视觉和文本数据的大型 VLM 是否可以有效地预测绘画的艺术史属性。我们对四种 VLM（即 CLIP、LLaVA、OpenFlamingo 和 GPT-4o）进行了深入分析，重点关注使用两个公共艺术品基准对艺术风格、作者和时间段进行零样本分类。此外，我们还介绍了 ArTest，这是一个精心策划的艺术品测试集，其中包括艺术史学家研究的关键绘画作品。||
|**2024-09-05**|[LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution](http://arxiv.org/abs/2409.03516)|**[link](https://github.com/jwgdmkj/lmlt)**|近年来，基于视觉Transformer (ViT) 的图像超分辨率方法展现出令人印象深刻的性能。然而，它们存在复杂性高的问题，导致推理时间和内存使用量大。此外，使用窗口自注意力机制(WSA) 的ViT模型在处理窗口区域外的信息时面临挑战。为了解决这些问题，我们提出了低到高多级Transformer (LMLT)，它对每个头采用不同特征大小的注意力机制。LMLT 沿通道维度划分图像特征，逐渐减小低层头的空间大小，并对每个头应用自注意力机制。这种方法有效地捕获了局部和全局信息。通过将低层头的结果整合到高层头中，LMLT 克服了自注意力机制中的窗口边界问题。大量实验表明，我们的模型在保持甚至超越最先进的基于 ViT 的图像超分辨率方法的性能的同时，显著减少了推理时间和 GPU 内存使用量。我们的代码可在 https://github.com/jwgdmkj/LMLT 获取。||
|**2024-09-05**|[Non-Uniform Illumination Attack for Fooling Convolutional Neural Networks](http://arxiv.org/abs/2409.03458)|**[link](https://github.com/Akshayjain97/Non-Uniform_Illumination)**|卷积神经网络（CNN）虽然取得了显著进步，但仍然容易受到攻击，特别是在面对人类容易识别的微小图像扰动时。这种弱点通常被称为“攻击”，突显了CNN的鲁棒性有限，需要研究如何增强其抵抗此类操纵的能力。本研究介绍了一种新颖的非均匀照明（NUI）攻击技术，该技术使用不同的NUI掩码对图像进行细微 alteration。我们在广泛接受的数据集（包括CIFAR10、TinyImageNet和CalTech256）上进行了大量实验，重点关注12种不同NUI攻击模型的图像分类。评估了VGG、ResNet、MobilenetV3-small和InceptionV3模型对NUI攻击的抵抗力。我们的结果表明，CNN模型在遭受NUI攻击时，分类精度大幅下降，表明它们在非均匀照明下的脆弱性。为了缓解这种情况，我们提出了一种防御策略，将通过新的NUI变换生成的NUI攻击图像包含到训练集中。结果表明，当CNN模型面对受NUI攻击影响的扰动图像时，其性能得到显著提升。该策略旨在增强CNN模型对NUI攻击的抵抗力。||
|**2024-09-05**|[Raw Speech Enhancement with Deep State Space Modeling](http://arxiv.org/abs/2409.03377)|**[link](https://github.com/Brainchip-Inc/aTENNuate)**|我们提出了 aTENNuate，这是一种简单的深度状态空间自编码器，专为高效的在线原始语音增强而配置，采用端到端的方式。该网络的性能主要在原始语音去噪方面进行评估，并在超分辨率和去量化等任务上进行了额外评估。我们在 VoiceBank + DEMAND 和 Microsoft DNS1 合成测试集上对 aTENNuate 进行了基准测试。该网络在 PESQ 分数、参数数量、MAC 和延迟方面优于以前的实时去噪模型。即使作为原始波形处理模型，该模型也能保持对干净信号的高保真度，并且可听见的伪影极少。此外，即使将噪声输入压缩至 4000Hz 和 4 位，该模型仍能保持良好的性能，这表明它在资源受限的环境中具有一般的语音增强能力。||
|**2024-09-05**|[Training-free Conversion of Pretrained ANNs to SNNs for Low-Power and High-Performance Applications](http://arxiv.org/abs/2409.03368)|**[link](https://github.com/putshua/inference-scale-ann-snn)**|脉冲神经网络 (SNN) 由于其推理速度快、功耗低等优势，已成为人工神经网络 (ANN) 的一种很有前途的替代方案。然而，缺乏有效的训练算法阻碍了它们的广泛应用。现有的 SNN 监督学习算法比 ANN 需要更多的内存和时间。即使是常用的 ANN-SNN 转换方法也需要重新训练 ANN 以提高转换效率，从而产生额外的计算成本。为了应对这些挑战，我们提出了一种新颖的免训练 ANN-SNN 转换流程。我们的方法将预先训练好的 ANN 模型直接转换为高性能 SNN，无需额外的训练。该转换流程包括一个基于局部学习的阈值平衡算法，该算法能够有效地计算最佳阈值并通过通道缩放对阈值进行细粒度调整。我们展示了我们的框架在三个典型的计算机视觉任务中的可扩展性：图像分类、语义分割和目标检测。这展示了其对分类和回归任务的适用性。此外，我们评估了转换后的 SNN 的能耗，证明了它们与传统 ANN 相比具有优越的低功耗优势。我们的免训练算法优于现有方法，突出了其实用性和效率。这种方法通过利用开源预训练 ANN 模型和神经形态硬件简化了 SNN 的部署，从而实现了快速、低功耗的推理，并且性能损失可以忽略不计。||
|**2024-09-05**|[YOLO-PPA based Efficient Traffic Sign Detection for Cruise Control in Autonomous Driving](http://arxiv.org/abs/2409.03320)|null|在自动驾驶系统中高效、准确地检测交通标志至关重要。然而，距离越远，交通标志越小。现有的目标检测算法很难检测到这些小尺寸的标志。此外，车载嵌入式设备的性能限制了检测模型的规模。为了应对这些挑战，本文提出了一种基于 YOLO PPA 的交通标志检测算法。在 GTSDB 数据集上的实验结果表明，与原始 YOLO 相比，该方法将推理效率提高了 11.2%，mAP 50 也提高了 93.2%，证明了所提出的 YOLO PPA 的有效性。||
|**2024-09-05**|[PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning](http://arxiv.org/abs/2409.03192)|null|细粒度图像分类随着深度学习和计算机视觉技术的出现取得了显著的进步。然而，详细标注的缺乏仍然是一个主要挑战，特别是在获取高质量标记数据的成本高昂或耗时的情况下。为了解决这一限制，我们引入了专为半监督学习框架内的细粒度图像分类设计的精度增强型伪标签（PEPL）方法。我们的方法通过生成高质量的伪标签来利用丰富的未标记数据，这些伪标签通过两个关键阶段逐步细化：初始伪标签生成和语义混合伪标签生成。这些阶段利用类激活图（CAM）来准确估计语义内容并生成细化标签，这些标签捕获了细粒度分类所需的基本细节。通过关注语义级信息，我们的方法有效地解决了标准数据增强和图像混合技术在保留关键细粒度特征方面的局限性。我们在基准数据集上实现了最先进的性能，证明了相对于现有半监督策略的显著改进，在准确性和鲁棒性方面都有显著提升。我们的代码已在https://github.com/TianSuya/SemiFG开源。||
|**2024-09-05**|[The AdEMAMix Optimizer: Better, Faster, Older](http://arxiv.org/abs/2409.03137)|**[link](https://github.com/apple/ml-ademamix)**|基于动量的优化器是众多机器学习应用的核心。这些优化器通常依赖于梯度的指数移动平均 (EMA)，它会以指数方式衰减旧梯度对当前梯度的贡献。这是因为梯度是局部的线性近似，当迭代点在损失函数曲面上移动时，旧梯度的相关性会降低。这项工作对使用单个 EMA 来累积过去梯度的做法提出了质疑，并通过经验证明了这种选择可能是次优的：单个 EMA 无法同时对最近的梯度赋予高权重，并对较旧的梯度赋予不可忽略的权重。基于这一观察，我们提出了 AdEMAMix，它是对 Adam 优化器的一种简单修改，它混合了两个 EMA，以更好地利用过去的梯度。我们在语言建模和图像分类方面的实验表明，令人惊讶的是，梯度在数万步内仍然具有相关性。它们有助于更快地收敛，并且通常收敛到更低的最小值：例如，一个在 1010 亿个词符上训练的具有 13 亿个参数的 AdEMAMix LLM 的性能与在一个 1970 亿个词符上训练的 AdamW 模型相当（+95%）。此外，我们的方法显著减缓了训练过程中的模型遗忘。我们的工作鼓励进一步探索利用过去梯度的不同类型的函数，而不仅仅是 EMA。||
|**2024-09-04**|[Boundless: Generating Photorealistic Synthetic Data for Object Detection in Urban Streetscapes](http://arxiv.org/abs/2409.03022)|**[link](https://github.com/zk2172-columbia/boundless)**|我们介绍Boundless，这是一个用于在密集的城市街景中实现高度准确的目标检测的逼真合成数据生成系统。Boundless可以用自动化和可配置的过程取代大规模的现实世界数据收集和手动地面实况目标注释（标记）。Boundless基于虚幻引擎5 (UE5) 城市示例项目，并进行了改进，能够在不同的照明和场景变化条件下准确收集3D边界框。我们评估了在Boundless生成的数据集上训练的目标检测模型在从中空相机获取的真实数据集上进行推理时的性能。我们将Boundless训练模型的性能与CARLA训练模型的性能进行了比较，观察到7.8 mAP的改进。我们取得的结果支持了合成数据生成是一种可靠的方法，可以用于训练/微调用于城市场景的可扩展目标检测模型。||
|**2024-09-04**|[iConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation](http://arxiv.org/abs/2409.02838)|null|基于预训练编码器的完整微调（FFT）和任务特定解码器的迁移学习随着深度模型的指数级增长而变得越来越复杂。使用由小型可学习层组成的适配器的参数高效微调（PEFT）方法已成为 FFT 的替代方案，在保持高训练效率的同时实现了可比的性能。然而，适配器对输入实例的不灵活限制了其在不同下游任务中学习任务特定信息的能力。在本文中，我们提出了一种新的 PEFT 方法，即输入条件化的 Transformer，称为 iConFormer，它利用了以输入实例为条件的动态适配器。为了确保在各种下游任务中对输入实例的灵活学习能力，我们在动态适配器中引入了输入条件化网络（iCoN），从而实现实例级特征转换。具体来说，iCoN 为每个特征生成通道级的卷积核，并使用自适应卷积过程对其进行转换，以有效捕获针对下游任务的任务特定和细粒度细节。实验结果表明，通过仅调整 Transformer 主干参数的 1.6% 到 2.8%，iConFormer 在单目深度估计和语义分割方面实现了与 FFT 相当的性能，同时在图像分类和实例分割方面优于 FFT。此外，所提出的方法在所有上述任务中始终优于最近的 PEFT 方法。||
|**2024-09-04**|[Real-Time Dynamic Scale-Aware Fusion Detection Network: Take Road Damage Detection as an example](http://arxiv.org/abs/2409.02546)|null|基于无人机的道路损坏检测 (RDD) 对城市的日常维护和安全至关重要，特别是在显著降低劳动力成本方面。然而，当前基于无人机的 RDD 研究仍面临许多挑战。例如，形状和方向不规则的损坏、背景对损坏的遮挡以及难以区分损坏和背景，这些因素都显著影响了无人机在日常巡检中检测道路损坏的能力。为了解决这些问题并提高无人机实时道路损坏检测的性能，我们设计并提出了三个相应的模块：一个能够灵活适应形状和背景的特征提取模块；一个融合多尺度感知并适应形状和背景的模块；一个高效的下采样模块。基于这些模块，我们设计了一种具有自动去除背景干扰能力的多尺度自适应道路损坏检测模型，称为动态尺度感知融合检测模型 (RT-DSAFDet)。在 UAV-PDD2023 公开数据集上的实验结果表明，我们的模型 RT-DSAFDet 的 mAP50 达到了 54.2%，比最新实时目标检测模型 YOLOv10 的高效变体 YOLOv10-m 高 11.1%，而参数量减少到 1.8M，FLOPs 减少到 4.6G，分别降低了 88% 和 93%。此外，在大型通用目标检测公开数据集 MS COCO2017 上也展现了我们模型的优越性，其 mAP50-95 与 YOLOv9-t 相同，但 mAP50 高出 0.5%，参数量减少 10%，FLOPs 减少 40%。||
|**2024-09-04**|[Boosting Generalizability towards Zero-Shot Cross-Dataset Single-Image Indoor Depth by Meta-Initialization](http://arxiv.org/abs/2409.02486)|null|室内机器人的导航或障碍物检测等任务依赖于深度信息，而单图像深度估计被广泛用于辅助感知。大多数室内单图像深度预测较少关注模型对未见数据集的泛化能力，而更关注系统部署的野外鲁棒性。这项工作利用基于梯度的元学习在零样本跨数据集推理中获得更高的泛化能力。与研究最多的、与显式类别标签相关的图像分类元学习不同，对于与物体排列和场景构成方面高度变化的室内环境相关的连续深度值，不存在明确的任务边界。我们提出了细粒度任务，在我们的元学习公式中将每个RGB-D小批量视为一个任务。我们首先展示了我们的方法在有限数据上诱导出更好的先验（RMSE 最高降低 27.8%）。然后，在元学习初始化上进行微调始终优于没有元方法的基线。为了实现泛化，我们提出了零样本跨数据集协议，并验证了由我们的元初始化诱导的更高泛化能力，作为许多现有深度估计方法的简单而有用的插件。深度和元学习交叉领域的工作有可能推动这两项研究更接近实际的机器人和机器感知应用。||
|**2024-09-03**|[Site Selection for the Second Flyeye Telescope: A Simulation Study for Optimizing Near-Earth Object Discovery](http://arxiv.org/abs/2409.02329)|null|欧洲航天局 (ESA) 正在开发一个名为 Flyeye 的广域巡天望远镜网络，以改进近地天体 (NEO) 的发现。该网络中的第一个望远镜将位于北半球的穆法拉山（意大利），而第二个具有增强探测能力的 Flyeye 望远镜刚刚开始关键设计阶段。通过对撞击轨迹上的近地天体进行模拟，研究了第二个 Flyeye 望远镜的潜在位置。对大约 3000 个撞击小行星（绝对星等为 H=25 和 H=28）进行了传播，并测试了主要现有巡天项目（Catalina、Pan-STARRS、ATLAS）、即将投入使用的薇拉·鲁宾天文台 (LSST) 以及 Flyeye 可能选址的可探测性。考虑了智利、南非和北半球的第二个设施。对于每个天文台，在模拟中都考虑了它们过去或计划的指向策略。在 LSST 部署之前，南半球的一个 Flyeye 的性能与北半球的一个望远镜相似。结合起来，在北方和南方各放置一台望远镜可以最大限度地提高探测率和探测到的独特物体的数量。LSST 之后，南部和北部的 Flyeye 望远镜仍然是互补的。总体而言，模拟表明，无论是在 LSST 之前还是之后，位于南部的第二个 Flyeye 都可以补充位于北部的 Flyeye 望远镜。位于拉西拉的 Flyeye 将利用其优越的大气条件，同时平衡南北半球的资产。||
|**2024-09-03**|[K-Origins: Better Colour Quantification for Neural Networks](http://arxiv.org/abs/2409.02281)|**[link](https://github.com/lewismmason/Thesis-Public)**|K-Origins是一种神经网络层，旨在在学习颜色或强度有利时提高基于图像的网络性能。超过 250 个编码器-解码器卷积网络在 16 位合成数据上进行了训练和测试，结果表明，在两种情况下，K-Origins 提高了语义分割精度：低信噪比下的目标检测，以及分割形状相同但颜色不同的多个目标。对于每个可训练参数 $w_k$，K-Origins 通过公式 $\textbf{Y}_k = \textbf{X}-\textbf{J}\cdot w_k$ 从输入特征 $\textbf{X}$ 生成输出特征，其中 $\textbf{J}$ 是一个全 1 矩阵。此外，还训练了具有不同感受野的网络，以根据目标类别的维度确定最佳网络深度，这表明感受野长度应超过目标大小。通过确保足够的感受野长度并结合 K-Origins，我们可以获得更好的语义网络性能。||
|**2024-09-03**|[Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems](http://arxiv.org/abs/2409.02278)|null|近年来，视觉语言模型（VLM）的快速发展展现出其在图像理解相关应用方面的巨大潜力。本研究探索了最先进的VLM模型在基于视觉的交通工程任务中的应用，例如图像分类和目标检测。图像分类任务包括拥堵检测和裂缝识别，而目标检测任务则用于识别未佩戴头盔的行为。我们应用了开源模型（如CLIP、BLIP、OWL-ViT、Llava-Next）和闭源模型GPT-4o，评估了这些最先进的VLM模型的性能，以利用语言理解能力来完成基于视觉的交通任务。这些任务通过对VLM模型应用零样本提示来完成，因为零样本提示可以在不对任务进行任何训练的情况下执行任务。这消除了对特定任务进行标注数据集或微调的需求。虽然这些模型在图像分类任务中取得了与基准卷积神经网络（CNN）模型相当的结果，但在目标定位任务中仍有改进的空间。因此，本研究对最先进的VLM模型进行了全面评估，突出了这些模型的优势和局限性，可以作为未来改进和广泛实施的基准。||
|**2024-09-03**|[A Modern Take on Visual Relationship Reasoning for Grasp Planning](http://arxiv.org/abs/2409.02035)|null|与现实世界杂乱场景交互对机器人代理提出了若干挑战，这些代理需要理解观察到的物体之间复杂的的空间依赖性，以确定最佳拾取顺序或有效的物体检索策略。现有的解决方案通常管理简化的场景，并侧重于在初始物体检测阶段之后预测成对物体关系，但往往忽略全局上下文或难以处理冗余和缺失的物体关系。在这项工作中，我们提出了一种用于抓取规划的视觉关系推理的现代方法。我们介绍了 D3GD，这是一个新的测试平台，其中包括包含来自 97 个不同类别的多达 35 个物体的分拣场景。此外，我们还提出了 D3G，这是一种新的基于端到端 transformer 的依赖图生成模型，它可以同时检测物体并生成表示其空间关系的邻接矩阵。认识到标准指标的局限性，我们首次采用关系平均精度来评估模型性能，进行了广泛的实验基准测试。获得的结果表明我们的方法是这项任务的最新技术，为机器人操作的未来研究奠定了基础。我们在 https://paolotron.github.io/d3g.github.io 上公开发布代码和数据集。||
|**2024-09-03**|[Compressed learning based onboard semantic compression for remote sensing platforms](http://arxiv.org/abs/2409.01988)|**[link](https://github.com/protim1191/glodismo_classifier)**|地球观测 (EO) 在创建和维持一个具有弹性和繁荣的社会方面发挥着至关重要的作用，这对所有生命和地球本身都具有深远的影响。卫星、航空平台以及最近的无人机和无人驾驶飞行器等遥感平台都用于 EO。它们收集大量数据，需要将其下传到地球进行进一步处理和分析。这种高吞吐量采集的瓶颈是下行链路带宽。需要以数据为中心的图像压缩解决方案来应对这种海量数据。在这项工作中，通过压缩学习框架研究了语义压缩，该框架仅利用快速和稀疏的矩阵向量乘法来编码数据。相机噪声和通信信道是造成失真的主要来源。然后，完整的语义通信管道由一个学习到的低复杂度压缩矩阵组成，该矩阵作用于噪声相机输出，以在机载生成一个观测向量，该向量通过通信信道下行链路传输，通过展开网络处理，然后馈送到执行必要下游任务的深度学习模型；研究了图像分类。通过使用小波稀疏先验展开 NA-ALISTA 的层来补偿失真。因此，解码是一种根据相机/环境信息和下游任务设计的即插即用方法。用于下游任务的深度学习模型通过端到端方式的损失函数与压缩矩阵和展开网络联合微调。结果表明，在低压缩比的噪声环境中，添加恢复损失以及任务相关损失可以提高下游性能。||
|**2024-09-03**|[Latent Distillation for Continual Object Detection at the Edge](http://arxiv.org/abs/2409.01872)|**[link](https://github.com/pastifra/Continual_Nanodet)**|虽然在目标检测文献中存在许多性能卓越的方法，但解决数据分布偏移仍然具有挑战性。持续学习（CL）为这个问题提供了解决方案，使模型能够适应新数据，同时保持对先前数据的性能。这对于边缘设备尤其重要，这些设备在汽车和机器人等动态环境中很常见。在这项工作中，我们解决了目标检测持续学习（CLOD）场景中边缘设备的内存和计算限制。具体来说，（i）我们研究了一种开源、轻量级和快速的检测器 NanoDet 对边缘设备上 CLOD 的适用性，改进了文献中使用的较大架构。此外，（ii）我们提出了一种名为潜在蒸馏（LD）的新型 CL 方法，该方法在不显着影响检测性能的情况下减少了最先进的 CL 方法所需的运算次数和内存。我们的方法使用著名的 VOC 和 COCO 基准测试集进行了验证，与其他蒸馏方法相比，每次模型更新可将蒸馏参数开销减少 74%，将浮点运算（FLOPs）减少 56%。||
|**2024-09-03**|[GeoBEV: Learning Geometric BEV Representation for Multi-view 3D Object Detection](http://arxiv.org/abs/2409.01816)|**[link](https://github.com/mengtan00/geobev)**|鸟瞰图 (BEV) 表示已成为多视图 3D 对象检测的主流范式，展现出令人印象深刻的感知能力。然而，现有方法忽略了 BEV 表示的几何质量，使其处于低分辨率状态，无法恢复场景真实的几何信息。在本文中，我们确定了先前方法受限于低 BEV 表示分辨率的原因，并提出了径向-笛卡尔 BEV 采样 (RC-Sampling)，从而能够高效生成高分辨率密集 BEV 表示，而无需复杂的算子。此外，我们设计了一种新颖的盒内标签来替代从激光雷达点生成的传统深度标签。此标签反映了对象的实际几何结构，而不仅仅是它们的表面，将现实世界的几何信息注入 BEV 表示中。此外，结合盒内标签，开发了一种质心感知内部损失 (CAI 损失) 来捕捉对象的细粒度内部几何结构。最后，我们将上述模块集成到一个名为 GeoBEV 的新型多视图 3D 对象检测框架中。在 nuScenes 数据集上的大量实验表明，GeoBEV 实现了最先进的性能，突出了其有效性。||

(back to top)

## 生成模型

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2025-04-08**|[Transfer between Modalities with MetaQueries](http://arxiv.org/abs/2504.06256)|null|统一多模态模型旨在整合理解（文本输出）和生成（像素输出），但将这些不同的模态整合到单个架构中通常需要复杂的训练方法和仔细的数据平衡。我们引入了元查询（MetaQueries），这是一组可学习的查询，充当自回归多模态大型语言模型 (MLLM) 和扩散模型之间的有效接口。元查询将 MLLM 的潜在表示连接到扩散解码器，通过利用 MLLM 的深度理解和推理能力实现知识增强的图像生成。我们的方法简化了训练，只需要配对的图像-标题数据和标准的扩散目标函数。值得注意的是，即使 MLLM 主干保持冻结状态，这种迁移也很有效，从而在保持其最先进的多模态理解能力的同时实现强大的生成性能。此外，我们的方法灵活，可以轻松进行指令微调，以用于图像编辑和主题驱动生成等高级应用。|
|**2025-04-08**|[From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models](http://arxiv.org/abs/2504.06214)|null|长上下文能力对于广泛的应用至关重要，包括文档和视频理解、上下文学习和推理时扩展，所有这些都需要模型处理和推理长序列的文本和多模态数据。在这项工作中，我们介绍了一种高效的训练方法，用于从对齐的指令模型构建超长上下文大型语言模型 (LLM)，将上下文长度的边界从128K扩展到1M、2M和4M个token。我们的方法利用高效的持续预训练策略来扩展上下文窗口，并采用有效的指令微调来保持指令遵循和推理能力。我们的UltraLong-8B模型基于Llama3.1-Instruct并使用我们的方法构建，在一系列不同的长上下文基准测试中实现了最先进的性能。重要的是，使用我们的方法训练的模型在标准基准测试中保持了竞争力，证明了在长上下文和短上下文任务上的均衡改进。我们进一步深入分析了关键的设计选择，强调了扩展策略和数据组成的影响。我们的研究结果建立了一个稳健的框架，可以有效地扩展上下文长度，同时保留模型的通用能力。我们在以下网址发布所有模型权重：https://ultralong.github.io/。|
|**2025-04-08**|[FaceCloak: Learning to Protect Face Templates](http://arxiv.org/abs/2504.06131)|null|生成模型可以从编码表示（模板）重建人脸图像，其与原始人脸的相似度非常高，引发了安全和隐私方面的担忧。我们提出了FaceCloak，一个通过生成智能的、可更新的二进制掩码来保护人脸模板的神经网络框架。我们的方法通过使用从单个人脸模板动态合成的独特干扰器来掩盖人脸模板，从而主动阻止反演攻击，同时可证地保留生物识别效用和不可链接性。我们经过掩盖的模板可以抑制敏感属性，同时泛化到新的特征提取方案，并且在生物特征匹配和抵御重建攻击方面的性能优于领先的基线方法。基于FaceCloak的匹配速度极快（推理时间成本=0.28毫秒），并且轻量级（0.57MB）。|
|**2025-04-08**|[CamContextI2V: Context-aware Controllable Video Generation](http://arxiv.org/abs/2504.06022)|null|近来，图像到视频 (I2V) 扩散模型在场景理解和生成质量方面展现出令人印象深刻的能力，通过结合图像条件来引导生成过程。然而，这些模型主要用于使静态图像动画化，而没有扩展到其提供的上下文之外。引入额外的约束，例如相机轨迹，可以增强多样性，但通常会降低视觉质量，从而限制其在需要忠实场景表示的任务中的适用性。我们提出了 CamContextI2V，这是一个 I2V 模型，它集成了多个图像条件、3D 约束以及相机控制，以丰富全局语义和细粒度的视觉细节。这使得能够生成更连贯且上下文感知的视频。此外，我们还解释了对有效上下文表示的时间感知的必要性。我们在 RealEstate10K 数据集上进行的全面研究证明了视觉质量和相机可控性的改进。我们的代码和模型已公开发布：https://github.com/LDenninger/CamContextI2V。|
|**2025-04-08**|[Note on the Universality of Parameterized IQP Circuits with Hidden Units for Generating Probability Distributions](http://arxiv.org/abs/2504.05997)|null|在一系列近期工作中，一种基于参数化瞬时多项式量子（IQP）电路的有趣的量子生成模型出现了，因为它们可以使用任何仅依赖于模型可观测量期望值的损失函数进行有效的经典训练。该模型已被证明不能普遍生成任意分布，但人们怀疑边缘分布可以做到——就像玻尔兹曼机通过利用隐藏层（用量子术语来说是迹出的）来实现普遍性一样。在这篇简短的笔记中，我们提供了两个关于这一事实的简单证明。第一个证明近乎 trivial 并且是渐近的，第二个证明表明可以用合理数量的额外量子比特实现普遍性。|
|**2025-04-08**|[An Empirical Study of GPT-4o Image Generation Capabilities](http://arxiv.org/abs/2504.05979)|null|图像生成领域发展迅速，从早期的基于GAN的方法到扩散模型，以及最近出现的旨在连接理解和生成任务的统一生成架构。近期的进展，特别是GPT-4o，已经证明了高保真多模态生成的可能性，但其架构设计仍然神秘且未公开发表。这引发了一个问题：对于这些方法，图像和文本生成是否已经成功地整合到一个统一的框架中。在这项工作中，我们对GPT-4o的图像生成能力进行了实证研究，并将其与领先的开源和商业模型进行了基准测试。我们的评估涵盖四大类，包括文本到图像、图像到图像、图像到3D以及图像到X的生成，共涉及20多项任务。我们的分析突出了GPT-4o在各种设置下的优势和局限性，并将其置于生成建模的更广泛的演变背景中。通过这项研究，我们确定了未来统一生成模型的发展方向，强调了架构设计和数据规模的作用。|
|**2025-04-08**|[Diffusion Based Ambiguous Image Segmentation](http://arxiv.org/abs/2504.05977)|null|医学图像分割通常由于专家标注的差异而存在固有的不确定性。捕捉这种不确定性是一个重要的目标，之前的研究已经使用各种生成图像模型来表示合理的专家真值分布。在这项工作中，我们探索了用于生成分割的扩散模型的设计空间，研究了噪声调度、预测类型和损失权重的影响。值得注意的是，我们发现通过输入缩放使噪声调度更难会显著提高性能。我们得出结论，x和v预测优于epsilon预测，可能是因为扩散过程处于离散分割域中。只要对扩散过程的末尾给予足够的权重，许多损失权重就能达到类似的性能。我们的实验基于LIDC-IDRI肺部病灶数据集，并获得了最先进的（SOTA）性能。此外，我们引入了LIDC-IDRI数据集的随机裁剪变体，它更适合于图像分割中的不确定性。我们的模型在这个更难的设置中也达到了SOTA性能。|
|**2025-04-08**|[Physics-aware generative models for turbulent fluid flows through energy-consistent stochastic interpolants](http://arxiv.org/abs/2504.05852)|null|生成模型在文本、图像和视频合成等领域展现出显著的成功。本研究探索了生成模型在流体动力学中的应用，特别是针对计算成本高昂的湍流模拟。我们提出了一种基于随机插值的新型随机生成模型，该模型能够进行概率预测，同时结合能量稳定性和无散度等物理约束。与通常忽略底层物理定律的传统随机生成模型不同，我们的方法通过使随机插值的参数成为可学习的系数来嵌入能量一致性。我们在基准湍流问题——科尔莫戈罗夫流——上评估了我们的方法，证明其相较于最先进的替代方案（如自回归条件扩散模型 (ACDM) 和 PDE-Refiner）具有更高的精度和稳定性。此外，我们实现了比标准随机插值更长时间的稳定滚动预测。我们的结果突出了物理感知生成模型在加速和增强湍流模拟，同时保持基本守恒特性的潜力。|
|**2025-04-08**|[On the Importance of Conditioning for Privacy-Preserving Data Augmentation](http://arxiv.org/abs/2504.05849)|null|潜扩散模型可以作为一种强大的增强方法，人为地扩展数据集以增强训练效果。对人眼来说，这些增强图像看起来与原始图像截然不同。先前的工作建议使用这种数据增强技术进行数据匿名化。然而，我们发现以深度图或边缘等特征为条件来引导扩散过程的潜扩散模型并不适合作为一种隐私保护方法。我们使用对比学习方法训练了一个模型，该模型可以从候选池中正确识别人物。此外，我们证明了使用条件扩散模型进行匿名化容易受到黑盒攻击。我们将所述方法的成功归因于匿名化过程中潜扩散模型的条件设置。扩散模型被指示为匿名图像生成相似的边缘。因此，模型可以学习识别这些模式以进行身份识别。|
|**2025-04-08**|[Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking](http://arxiv.org/abs/2504.05838)|null|近来，图像提示适配器 (IP-Adapter) 越来越多地被集成到文本到图像扩散模型 (T2I-DMs) 中以提高可控性。然而，在本文中，我们揭示了配备 IP-Adapter 的 T2I-DMs (T2I-IP-DMs) 容易受到一种名为劫持攻击的新型对抗攻击。我们证明，通过上传难以察觉的图像空间对抗样本 (AEs)，攻击者可以劫持大量良性用户来攻击由 T2I-IP-DMs 驱动的图像生成服务 (IGS)，并误导公众以 discredit 服务提供商。更糟糕的是，IP-Adapter 对开源图像编码器的依赖降低了制作对抗样本所需的知识。大量实验验证了劫持攻击的技术可行性。鉴于已揭示的威胁，我们研究了几种现有的防御措施，并探索将 IP-Adapter 与对抗训练模型相结合以克服现有防御的局限性。我们的代码可在 https://github.com/fhdnskfbeuv/attackIPA 获取。|
|**2025-04-04**|[MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models](http://arxiv.org/abs/2504.03641)|null|现有的MLLM基准在评估统一MLLM（U-MLLM）方面面临重大挑战，原因在于：1）缺乏针对传统任务的标准化基准，导致比较结果不一致；2）缺乏针对混合模态生成的基准，无法评估多模态推理能力。我们提出了一个旨在系统评估U-MLLM的综合评估框架。我们的基准包括：标准化传统任务评估。我们从12个数据集取样，涵盖10个任务和30个子任务，确保不同研究之间的一致性和公平比较。统一任务评估。我们引入了五个测试多模态推理的新任务，包括图像编辑、带有图像生成的常识问答和几何推理。全面模型基准测试。我们评估了12个领先的U-MLLM，例如Janus-Pro、EMU3、VILA-U和Gemini2-flash，以及专门的理解模型（例如Claude-3.5-Sonnet）和生成模型（例如DALL-E-3）。我们的研究结果揭示了现有U-MLLM的显著性能差距，突出了对能够有效处理混合模态任务的更强大模型的需求。代码和评估数据可在https://mme-unify.github.io/ 中找到。|
|**2025-04-04**|[Enhancing Causal Effect Estimation with Diffusion-Generated Data](http://arxiv.org/abs/2504.03630)|null|从观察数据中估计因果效应本质上具有挑战性，这是由于缺乏可观察的反事实结果，甚至存在未测量的混杂因素。传统方法通常依赖于限制性、不可检验的假设或需要有效的工具变量，这显著限制了它们的适用性和稳健性。在本文中，我们介绍了增强因果效应估计 (ACEE)，这是一种利用扩散模型生成的合成数据来增强因果效应估计的创新方法。通过微调预训练的生成模型，ACEE 模拟了否则无法观察到的反事实场景，即使在存在未测量的混杂因素的情况下，也有助于准确估计个体和平均治疗效果。与传统方法不同，ACEE 放宽了严格的无混杂假设，而是依赖于一个经验上可检验的条件。此外，还引入了偏差校正机制以减轻合成数据的不准确性。我们提供了理论保证，证明了 ACEE 估计器的一致性和效率，并通过模拟研究和基准数据集进行了全面的经验验证。结果证实，ACEE 显着提高了因果估计的准确性，尤其是在以非线性关系和异方差噪声为特征的复杂环境中。|
|**2025-04-04**|[Quantifying the uncertainty of model-based synthetic image quality metrics](http://arxiv.org/abs/2504.03623)|null|合成图像（例如由扩散模型生成的图像）的质量通常使用由预训练辅助模型编码的图像内容信息进行评估。例如，Fréchet初始距离 (FID) 使用从预训练用于对 ImageNet 进行分类的 InceptionV3 模型中提取的嵌入向量。这种特征嵌入模型的有效性对计算指标的可信度有相当大的影响（影响其在几个领域的适用性，包括医学影像）。这里，不确定性量化 (UQ) 用于提供特征嵌入模型和称为 Fréchet 自编码器距离 (FAED) 的类似 FID 指标的可信度的启发式度量。我们将蒙特卡洛 dropout 应用于特征嵌入模型（卷积自编码器）以模拟其嵌入向量中的不确定性。然后，每个输入的嵌入向量分布用于计算 FAED 值的分布。我们将不确定性表示为嵌入向量的预测方差以及计算出的 FAED 值的标准差。我们发现它们的大小与输入偏离模型训练数据的程度相关，这在一定程度上验证了其评估 FAED 可信度的能力。|
|**2025-04-04**|[VISTA-OCR: Towards generative and interactive end to end OCR models](http://arxiv.org/abs/2504.03621)|null|我们推出了VISTA-OCR（视觉和空间感知文本分析OCR），这是一种轻量级架构，它在单个生成模型中统一了文本检测和识别。与需要单独分支并使用专用参数进行文本识别和检测的传统方法不同，我们的方法利用Transformer解码器在统一分支中顺序生成文本转录及其空间坐标。VISTA-OCR构建于编码器-解码器架构之上，并经过渐进式训练，首先是视觉特征提取阶段，然后是使用多模态标记生成的多任务学习。为了满足对能够执行高级任务（例如基于内容的文本定位）的多功能OCR系统的日益增长的需求，我们在预训练期间引入了新的提示可控OCR任务。为了增强模型的功能，我们构建了一个新数据集，该数据集由丰富的边界框注释的真实示例和合成样本组成。尽管最近的视觉大型语言模型（VLLM）可以有效地执行这些任务，但它们的高计算成本仍然是实际部署的障碍。相比之下，我们的VISTA omni变体仅使用1.5亿个参数，通过提示即可交互式地处理手写和打印文档。在多个数据集上进行的大量实验表明，与最先进的专用模型相比，VISTA-OCR在标准OCR任务上实现了更好的性能，同时在更复杂的OCR应用中显示出强大的潜力，从而满足了对交互式OCR系统日益增长的需求。VISTA-OCR的所有代码和注释将在被接受后公开发布。|
|**2025-04-04**|[Autonomous and Self-Adapting System for Synthetic Media Detection and Attribution](http://arxiv.org/abs/2504.03615)|null|生成式人工智能的快速发展使得创建高度逼真的合成图像成为可能，这虽然在许多领域有益，但也给虚假信息、欺诈和其他恶意应用带来了严重的风险。目前的合成图像识别系统通常是静态的，依赖于从已知生成器学习到的特征表示；随着新的生成模型的出现，这些系统的性能会严重下降。在本文中，我们引入了自主自适应合成媒体识别系统的概念——它不仅可以检测合成图像并将其归因于已知来源，还可以自主识别和合并新的生成器，而无需人工干预。我们的方法利用了具有可进化嵌入空间的开放集识别策略，该策略可以区分已知和未知来源。通过采用无监督聚类方法将未知样本聚合到高置信度聚类中，并不断改进其决策边界，即使在生成环境不断发展的情况下，我们的系统也能保持稳健的检测和归因性能。大量实验表明，我们的方法明显优于现有方法，标志着在生成模型快速发展的时代，朝着通用、自适应的取证系统迈出了关键一步。|
|**2025-04-04**|[Multimodal Diffusion Bridge with Attention-Based SAR Fusion for Satellite Image Cloud Removal](http://arxiv.org/abs/2504.03607)|null|深度学习通过融合合成孔径雷达 (SAR) 图像，在解决光学卫星图像中的云去除挑战方面取得了一些成功。最近，扩散模型已成为去除云的强大工具，与早期方法相比，通过从无云分布中采样，可以提供更高质量的估计。然而，扩散模型从纯高斯噪声开始采样，这使得采样轨迹变得复杂并导致性能欠佳。此外，目前的方法在有效融合SAR和光学数据方面存在不足。为了解决这些限制，我们提出了用于云去除的扩散桥 (Diffusion Bridges for Cloud Removal, DB-CR)，它直接桥接了有云和无云图像分布。此外，我们提出了一种新颖的多模态扩散桥架构，该架构具有用于多模态图像恢复的双分支主干网络，并结合了高效的主干网络和专用的跨模态融合块，以有效地从合成孔径雷达 (SAR) 和光学图像中提取和融合特征。通过将云去除公式化为扩散桥问题并利用这种定制架构，DB-CR 实现了高保真结果，同时计算效率很高。我们在 SEN12MS-CR 云去除数据集上评估了 DB-CR，证明其达到了最先进的结果。|
|**2025-04-04**|[HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration](http://arxiv.org/abs/2504.03536)|null|单图像人体重建对于数字人体建模应用至关重要，但这仍然是一项极具挑战性的任务。目前的方法依赖于生成模型来合成多视图图像，以便进行后续的3D重建和动画制作。然而，直接从单张人体图像生成多个视图会存在几何不一致性，导致重建模型出现肢体碎片化或模糊等问题。为了解决这些局限性，我们引入了HumanDreamer-X，这是一个将多视图人体生成和重建集成到统一流程中的新框架，它显著增强了重建3D模型的几何一致性和视觉保真度。在这个框架中，三维高斯 splatting 作为一种显式的3D表示方法，提供了初始的几何形状和外观优先级。在此基础上，我们训练了HumanFixer 来修复3DGS渲染结果，以保证照片级真实感。此外，我们深入研究了多视图人体生成中与注意力机制相关的固有挑战，并提出了一种注意力调制策略，有效地增强了多视图间的几何细节和身份一致性。实验结果表明，我们的方法显著提高了生成和重建的PSNR质量指标，分别提升了16.45%和12.65%，实现了高达25.62 dB的PSNR，同时也展现了对野外数据的泛化能力以及对各种人体重建骨干模型的适用性。|
|**2025-04-04**|[Diffusion Active Learning: Towards Data-Driven Experimental Design in Computed Tomography](http://arxiv.org/abs/2504.03491)|null|我们提出了一种名为“扩散主动学习”的新方法，它将生成式扩散模型与数据驱动的序列实验设计相结合，以自适应地获取用于解决逆问题的数据。尽管该方法具有广泛的适用性，但我们专注于科学计算断层扫描 (CT) 的实验验证，因为在该领域中，结构化的先验数据集是可用的，并且减少数据需求可以直接转化为更短的测量时间和更低的 X 射线剂量。我们首先在特定领域的 CT 重建数据集上预训练一个无条件扩散模型。该扩散模型充当一个学习到的先验，它依赖于数据并捕获底层数据分布的结构，然后以两种方式使用：它驱动主动学习过程，同时也提高重建的质量。在主动学习循环中，我们采用扩散后验采样的变体，从后验分布生成条件数据样本，确保与当前测量的一致性。使用这些样本，我们量化当前估计中的不确定性，以选择信息量最大的下一个测量。我们的结果表明，在多个真实世界的断层扫描数据集上，数据采集需求显著减少，对应于更低的 X 射线剂量，同时图像重建质量也得到了提高。|
|**2025-04-04**|[BUFF: Bayesian Uncertainty Guided Diffusion Probabilistic Model for Single Image Super-Resolution](http://arxiv.org/abs/2504.03490)|null|超分辨率 (SR) 技术对于增强图像质量至关重要，尤其是在硬件限制导致高分辨率图像获取受限的情况下。现有的 SR 扩散模型主要依赖于高斯模型来生成噪声，这种方法在处理自然场景中固有的复杂多变纹理时往往不足。为了解决这些缺陷，我们引入了贝叶斯不确定性引导扩散概率模型 (BUFF)。BUFF 的独特之处在于它结合了贝叶斯网络来生成高分辨率不确定性掩膜。这些掩膜引导扩散过程，允许以上下文感知和自适应的方式调整噪声强度。这种新颖的方法不仅增强了超分辨率图像与其原始高分辨率图像的保真度，而且还显著减少了复杂纹理和精细细节区域的伪影和模糊。该模型表现出对复杂噪声模式的出色鲁棒性，并展示了在处理图像中的纹理和边缘方面的卓越适应性。视觉结果支持的经验证据表明了该模型的鲁棒性，尤其是在挑战性场景中，以及其在解决常见 SR 问题（例如模糊）方面的有效性。在 DIV2K 数据集上进行的实验评估表明，BUFF 取得了显著改进，在 BSD100 上的 SSIM 比基线增加了 +0.61，超过了传统的扩散方法平均 +0.20dB 的 PSNR 增益。这些发现强调了贝叶斯方法在增强 SR 扩散过程中的潜力，为该领域的未来发展铺平了道路。|
|**2025-04-04**|[Structured Legal Document Generation in India: A Model-Agnostic Wrapper Approach with VidhikDastaavej](http://arxiv.org/abs/2504.03486)|null|自动化法律文件起草可以显著提高效率，减少手动工作，并简化法律工作流程。虽然之前的研究探索了诸如判决预测和案件摘要等任务，但在印度法律领域，私人法律文件的结构化生成在很大程度上仍未得到解决。为了弥合这一差距，我们引入了VidhikDastaavej，这是一个新颖的、匿名化的私人法律文件数据集，并开发了NyayaShilp，这是一个专门针对印度法律文本进行微调的法律文件生成模型。我们提出了一个模型无关的包装器（MAW），这是一个两步框架，首先生成结构化的章节标题，然后迭代地生成内容，同时利用基于检索的机制来确保连贯性和事实准确性。我们将多个开源大型语言模型（LLM）进行基准测试，包括指令微调和领域适应版本，并与专有模型进行比较。我们的研究结果表明，虽然在小型数据集上直接进行微调并不总是能带来改进，但我们的结构化包装器显著增强了连贯性、事实依从性和整体文档质量，同时减少了幻觉。为了确保在现实世界中的适用性，我们开发了一个人工参与的文档生成系统（HITL），这是一个交互式用户界面，允许用户指定文档类型、细化章节细节并生成结构化的法律草稿。该工具允许法律专业人员和研究人员高效地生成、验证和完善人工智能生成的法律文件。广泛的评估，包括专家评估，证实了我们的框架在结构化法律起草方面实现了高可靠性。这项研究为印度人工智能辅助法律起草奠定了可扩展和可适应的基础，为结构化法律文件生成提供了一种有效的方法。|
|**2025-04-03**|[Concept Lancet: Image Editing with Compositional Representation Transplant](http://arxiv.org/abs/2504.02828)|null|扩散模型广泛用于图像编辑任务。现有的编辑方法通常通过在文本嵌入或分数空间中选择编辑方向来设计表示操作程序。然而，这种程序面临一个关键挑战：高估编辑强度会损害视觉一致性，而低估则会使编辑任务失败。值得注意的是，每个源图像可能需要不同的编辑强度，并且通过反复试验来搜索合适的强度成本很高。为了应对这一挑战，我们提出了Concept Lancet (CoLan)，这是一个用于基于扩散的图像编辑中原则性表示操作的零样本即插即用框架。在推理时，我们将潜在（文本嵌入或扩散分数）空间中的源输入分解为收集到的视觉概念表示的稀疏线性组合。这使我们能够准确估计每个图像中概念的存在，从而为编辑提供信息。根据编辑任务（替换/添加/删除），我们执行定制的概念移植过程以施加相应的编辑方向。为了充分建模概念空间，我们整理了一个概念表示数据集 CoLan-150K，其中包含用于潜在字典的视觉术语和短语的各种描述和场景。在多个基于扩散的图像编辑基线上进行的实验表明，配备 CoLan 的方法在编辑有效性和一致性保持方面实现了最先进的性能。||
|**2025-04-03**|[Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization](http://arxiv.org/abs/2504.02817)|null|许多三维生成模型依赖于变分自编码器 (VAE) 来学习紧凑的形状表示。然而，现有方法将所有形状编码为固定大小的标记，忽略了三维数据中固有的尺度和复杂性差异。这导致低效的潜在表示，可能会影响下游生成。我们通过引入基于八叉树的自适应标记化来解决这一挑战，这是一个根据形状复杂性调整潜在表示维度的新框架。我们的方法构建了一个由基于二次误差的细分准则指导的自适应八叉树结构，并使用基于查询的Transformer为每个八叉树单元分配一个形状潜在向量。基于这种标记化，我们开发了一个基于八叉树的自回归生成模型，在形状生成中有效地利用了这些可变大小的表示。大量实验表明，与固定大小的方法相比，我们的方法在保持相当视觉质量的同时，将标记数量减少了 50%。当使用相似的标记长度时，我们的方法可以生成质量明显更高的形状。当与我们的下游生成模型结合使用时，我们的方法可以创建比现有方法更详细、更多样的三维内容。||
|**2025-04-03**|[F-ViTA: Foundation Model Guided Visible to Thermal Translation](http://arxiv.org/abs/2504.02801)|null|热成像对于场景理解至关重要，尤其是在低光和夜间条件下。然而，由于红外图像采集需要专门的设备，收集大型热数据集成本高昂且劳动密集。为了应对这一挑战，研究人员探索了可见光到热成像的转换。大多数现有方法依赖于生成对抗网络 (GAN) 或扩散模型 (DM)，将该任务视为风格迁移问题。因此，这些方法试图从有限的训练数据中学习模态分布偏移和底层物理原理。在本文中，我们提出了 F-ViTA，这是一种利用基础模型中嵌入的通用世界知识来指导扩散过程以改进转换的新方法。具体来说，我们使用来自基础模型（如 SAM 和 Grounded DINO）的零样本掩码和标签来调节 InstructPix2Pix 扩散模型。这使得模型能够学习场景对象与其在红外图像中的热特征之间的有意义的相关性。在五个公共数据集上的大量实验表明，F-ViTA 的性能优于现有最先进 (SOTA) 方法。此外，我们的模型可以很好地泛化到分布外 (OOD) 场景，并且可以从同一可见图像生成长波红外 (LWIR)、中波红外 (MWIR) 和近红外 (NIR) 转换。代码：https://github.com/JayParanjape/F-ViTA/tree/master。||
|**2025-04-03**|[Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model](http://arxiv.org/abs/2504.02764)|null|本文提出了一种名为“场景飞溅”（Scene Splatter）的基于动量的视频扩散范式，用于从单张图像生成通用场景。现有的方法通常使用视频生成模型来合成新视角，但会受到视频长度和场景一致性的限制，导致在进一步重建过程中出现伪影和失真。为了解决这个问题，我们从原始特征构建噪声样本作为动量，以增强视频细节并保持场景一致性。然而，对于感受野跨越已知和未知区域的潜在特征，这种潜在级别的动量会限制视频扩散在未知区域的生成能力。因此，我们进一步将上述一致性视频作为像素级动量引入到没有动量的直接生成的视频中，以便更好地恢复未见区域。我们的级联动量使视频扩散模型能够生成高保真度且一致的新视角。我们进一步使用增强帧微调全局高斯表示，并渲染新帧以在下一步中进行动量更新。通过这种方式，我们可以迭代地恢复3D场景，避免视频长度的限制。大量实验表明，我们的方法在高保真度和一致性场景生成方面具有泛化能力和优越性能。||
|**2025-04-03**|[MD-ProjTex: Texturing 3D Shapes with Multi-Diffusion Projection](http://arxiv.org/abs/2504.02762)|null|我们提出MD-ProjTex，一种使用预训练的文本到图像扩散模型为3D形状快速且一致地生成文本引导纹理的方法。我们方法的核心是在UV空间中实现多视图一致性机制，这确保了不同视点之间的纹理连贯性。具体来说，MD-ProjTex在每个扩散步骤中融合来自多个视图的噪声预测，并共同更新每个视图的去噪方向以保持3D一致性。与现有的依赖于优化或顺序视图合成的最先进方法相比，MD-ProjTex的计算效率更高，并实现了更好的定量和定性结果。||
|**2025-04-03**|[Echoes of the hidden: Uncovering coordination beyond network structure](http://arxiv.org/abs/2504.02757)|null|近几十年来，连接性和协调性的研究因其在驱动市场、塑造社会动态和影响生物系统方面的核心作用而受到越来越多的关注。传统上，人们利用可观察到的连接，例如电话、金融交易或社交媒体连接，来推断协调性和连接性。然而，不完整、加密或碎片化的数据，以及通信平台的普遍性和故意的混淆，往往导致许多现实世界的连接隐藏起来。在本研究中，我们证明了协调的个体表现出共享的突发活动模式，即使它们之间可观察到的联系稀疏或完全不存在，也能够检测到它们。我们进一步提出了一个基于网络的网络形式主义的生成模型，以解释驱动这种协作突发性的机制，将其归因于跨网络的冲击传播，而不是孤立的个体行为。模型模拟表明，当可观察连接密度低于70%时，与最先进的时间和结构方法相比，突发性显著提高了协调检测能力。这项工作为理解社群和协调动态提供了新的视角，推进了理论理解和实际检测。通过为识别可观察网络结构之外的隐藏连接奠定基础，它能够跨不同平台进行检测，同时增强系统行为理解、知情决策和风险缓解。||
|**2025-04-03**|[RBR4DNN: Requirements-based Testing of Neural Networks](http://arxiv.org/abs/2504.02737)|null|深度神经网络 (DNN) 测试对于关键系统的可靠性和安全性至关重要，因为故障可能会导致严重后果。尽管已开发出各种技术来创建鲁棒性测试套件，但基于需求的 DNN 测试在很大程度上仍未得到探索——然而，此类测试被认为是关键系统软件验证的重要组成部分。在这项工作中，我们提出了一种基于需求的测试套件生成方法，该方法使用在语义特征空间中制定的结构化自然语言需求，通过用需求前置条件提示文本条件潜在扩散模型，然后使用相关的后置条件定义测试预言机来判断被测 DNN 的输出来创建测试套件。我们使用预训练生成模型的微调变体来研究该方法。我们在 MNIST、CelebA-HQ、ImageNet 和自动驾驶汽车数据集上的实验表明，生成的测试套件是真实的、多样化的、与前置条件一致的，并且能够揭示故障。||
|**2025-04-03**|[Variational Online Mirror Descent for Robust Learning in Schrödinger Bridge](http://arxiv.org/abs/2504.02618)|null|薛定谔桥 (SB) 已发展成为一类通用的概率生成模型。然而，在实践中，估计的学习信号通常是不确定的，并且现有方法所承诺的可靠性通常基于推测性的最佳情况场景。最近关于通过镜像下降 (MD) 的 Sinkhorn 算法的研究获得了关注，揭示了SB问题解获取的几何见解。在本文中，我们为SB问题提出了一个变分在线镜像下降 (OMD) 框架，它为SB求解器提供了进一步的稳定性。我们正式证明了SB获取的新型OMD公式的收敛性和遗憾界。因此，我们利用Schrödinger势能高斯混合参数化的Wasserstein-Fisher-Rao几何，提出了一种称为变分镜像薛定谔桥 (VMSB) 的免仿真SB算法。基于Wasserstein梯度流理论，该算法提供了易于处理的学习动力学，可以精确地逼近每个OMD步骤。在实验中，我们在一系列基准测试中验证了所提出的VMSB算法的性能。VMSB在一系列SB问题上始终优于当代SB求解器，证明了我们理论预测的稳健性。||
|**2025-04-03**|[Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation](http://arxiv.org/abs/2504.02612)|null|近年来，文本到图像生成模型的进步催生了众多实际应用，其中包括主体驱动生成，它通过微调预训练模型来仅从少量示例中捕获主体语义。虽然基于扩散的模型能够生成高质量的图像，但其大量的去噪步骤会导致显著的计算开销，从而限制了实际应用。视觉自回归（VAR）模型预测的是下一个尺度的标记而不是空间相邻的标记，提供了更快的推理速度，适用于实际部署。在本文中，我们提出了第一个基于VAR的主体驱动生成方法。然而，简单地微调VAR会导致计算开销、语言漂移和多样性降低。为了应对这些挑战，我们引入了选择性层调整以降低复杂性，并使用先验蒸馏来缓解语言漂移。此外，我们发现早期阶段对主体生成的影响大于后期阶段，后期阶段仅合成局部细节。基于这一发现，我们提出了尺度加权调整，它优先考虑较粗糙的分辨率，以促使模型关注与主体相关的信息而不是局部细节。大量实验验证了我们的方法在各种指标上都显著优于基于扩散的基线模型，并展示了其实际应用价值。||
|**2025-04-03**|[Bridging the Gap between Gaussian Diffusion Models and Universal Quantization for Image Compression](http://arxiv.org/abs/2504.02579)|null|生成式神经图像压缩支持极低比特率下的数据表示，在客户端合成细节并持续生成高度逼真的图像。通过利用量化误差和加性噪声之间的相似性，可以使用潜在扩散模型来构建基于扩散的生成式图像压缩编解码器，以“去噪”量化引入的伪影。然而，我们发现了先前遵循这种范式的方案中的三个关键差距（即噪声水平、噪声类型和离散化差距），这些差距导致量化数据落入扩散模型已知数据分布之外。在这项工作中，我们提出了一种基于量化的新型前向扩散过程，该过程具有理论基础，可以解决所有上述三个差距。我们通过精心设计的量化计划的通用量化和使用均匀噪声训练的扩散模型来实现这一点。与先前的工作相比，我们的方案即使在非常低的比特率下也能生成始终如一的逼真和详细的重建图像。在这种情况下，我们实现了最佳的率失真真实感性能，优于先前相关的工作。||
|**2025-03-31**|[Consistent Subject Generation via Contrastive Instantiated Concepts](http://arxiv.org/abs/2503.24387)|null|虽然文本到图像生成模型可以合成多样化且逼真的内容，但跨多个创作的主体变化限制了其在长内容生成中的应用。现有方法需要耗时的调整、所有主体的参考或访问其他创作。我们引入了对比概念实例化（CoCoIns），以有效地在多个独立创作中合成一致的主体。该框架由一个生成模型和一个映射网络组成，该网络将输入的潜在代码转换为与某些概念实例相关的伪词。用户可以使用相同的潜在代码生成一致的主体。为了构建这种关联，我们提出了一种对比学习方法，用于训练网络以区分提示和潜在代码的组合。对单一主体的多张人脸进行的大量评估表明，CoCoIns 的性能与现有方法相当，同时保持了更高的灵活性。我们还展示了将 CoCoIns 扩展到多个主体和其他对象类别的潜力。||
|**2025-03-31**|[Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation](http://arxiv.org/abs/2503.24379)|null|为了解决当前视频生成领域中用户意图精确理解的瓶颈问题，我们提出了Any2Caption，这是一个在任意条件下可控视频生成的新颖框架。其核心思想是将各种条件解释步骤与视频合成步骤解耦。通过利用现代多模态大型语言模型 (MLLM)，Any2Caption 将不同的输入（文本、图像、视频以及诸如区域、运动和相机姿态等特定线索）解释为密集的、结构化的描述，为骨干视频生成器提供更好的指导。我们还引入了Any2CapIns，这是一个包含337k个实例和407k个条件的大规模数据集，用于任意条件到描述的指令微调。综合评估表明，我们的系统在现有视频生成模型的各个方面，在可控性和视频质量方面均有显著提高。项目页面：https://sqwu.top/Any2Cap/||
|**2025-03-31**|[Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach](http://arxiv.org/abs/2503.24271)|**[link](https://github.com/fpramunno/ldm_superresolution)**|太阳磁场的空间特性对于解码太阳内部的物理过程及其行星际效应至关重要。然而，来自旧仪器（如迈克尔逊多普勒成像仪 (MDI)）的观测数据空间或时间分辨率有限，这阻碍了对小尺度太阳特征的详细研究。对这些旧数据集进行超分辨率处理对于跨不同太阳周期进行统一分析至关重要，能够更好地表征太阳耀斑、活动区和磁网络动力学。在这项工作中，我们介绍了一种用于超分辨率的新型扩散模型方法，并将其应用于 MDI 磁图，以匹配日震和磁成像仪 (HMI) 的更高分辨率能力。通过使用降尺度 HMI 数据的残差训练潜在扩散模型 (LDM)，并使用配对的 MDI/HMI 数据对其进行微调，我们可以将 MDI 观测的分辨率从 2"/像素提高到 0.5"/像素。我们通过经典指标（例如，PSNR、SSIM、FID 和 LPIPS）评估重建图像的质量，并检查是否保留了物理特性，例如无符号磁通量或活动区的尺寸。我们将我们的模型与 LDM 和去噪扩散概率模型 (DDPM) 的不同变体进行比较，还与过去用于执行超分辨率任务的两种确定性架构进行比较。此外，我们通过傅里叶域分析表明，具有残差的 LDM 可以分辨小于 2" 的特征，并且由于 LDM 的概率性质，我们可以评估它们的可靠性，这与确定性模型形成对比。未来的研究旨在提高太阳 MDI 仪器的时间尺度超分辨率，以便我们更好地了解旧事件的动态。||
|**2025-04-01**|[Visual Acoustic Fields](http://arxiv.org/abs/2503.24270)|null|物体被敲击时会发出不同的声音，而人类可以根据物体的外观和材质属性直观地推断出它可能发出的声音。受这种直觉的启发，我们提出了视觉声场（Visual Acoustic Fields），这是一个使用三维高斯 splatting (3DGS) 将敲击声和视觉信号在三维空间中联系起来的框架。我们的方法有两个关键模块：声音生成和声音定位。声音生成模块利用一个条件扩散模型，该模型接收从特征增强的 3DGS 渲染的多尺度特征来生成逼真的敲击声。同时，声音定位模块支持查询由特征增强的 3DGS 表示的三维场景，以根据声源定位敲击位置。为了支持这个框架，我们引入了一种新的流程来收集场景级视觉-声音样本对，从而实现捕获的图像、撞击位置和相应声音之间的对齐。据我们所知，这是第一个在三维环境中连接视觉和声学信号的数据集。在我们数据集上进行的大量实验表明，视觉声场在生成合理的撞击声和准确定位撞击源方面是有效的。我们的项目页面位于https://yuelei0428.github.io/projects/Visual-Acoustic-Fields/。||
|**2025-03-31**|[Pre-training with 3D Synthetic Data: Learning 3D Point Cloud Instance Segmentation from 3D Synthetic Scenes](http://arxiv.org/abs/2503.24229)|null|近年来，研究界见证了3D点云数据在各种实际应用中的使用日益增长，因为它具有很高的适用性。通过3D点云，这种模态能够考虑实际大小和空间理解。应用领域包括机器人、车辆或其他现实世界系统的机械控制。沿着这条路线，我们希望改进3D点云实例分割，这已成为这些应用中特别有前景的方法。然而，与2D图像数据集相比，创建3D点云数据集需要巨大的成本。为了训练3D点云实例分割模型，不仅需要分配类别，还需要为大规模3D空间中的每个点提供详细的注释。同时，最近3D领域生成模型提案的增加刺激了使用生成模型创建3D点云数据的提案。在这项工作中，我们提出了一种使用3D合成数据进行预训练的方法，以训练基于生成模型的3D点云实例分割模型，该模型用于由点云数据表示的3D场景。我们直接使用Point-E生成3D点云数据，以便将生成的数据插入到3D场景中。最近在2025年，尽管存在其他更精确的3D生成模型，即使使用Point-E作为早期的3D生成模型，也可以有效地支持使用3D合成数据进行预训练。在实验部分，我们将我们的预训练方法与基线方法进行了比较，结果表明性能有所提高，证明了3D生成模型对于3D点云实例分割的有效性。||
|**2025-03-31**|[AI-Assisted Colonoscopy: Polyp Detection and Segmentation using Foundation Models](http://arxiv.org/abs/2503.24138)|**[link](https://github.com/udelaqui/foundation_models_for_polyp_detection_segmentation)**|在结肠镜检查中，深度学习模型可以帮助检测到80%的漏诊息肉。在寻找能够解决这一挑战的算法时，基础模型成为有希望的候选者。它们的零样本或少样本学习能力有助于泛化到新的数据或任务，而无需大量的微调。这一概念在医学影像领域尤其有利，因为用于传统训练的大型标注数据集非常稀缺。在此背景下，我们对用于息肉分割的基础模型进行了全面评估，评估了检测和界定。在这项研究中，我们使用了三个不同的结肠镜检查数据集来比较五种不同的基础模型（DINOv2、YOLO-World、GroundingDINO、SAM和MedSAM）与两个基准网络（YOLOv8和Mask R-CNN）的性能。结果表明，基础模型在息肉表征方面的成功很大程度上取决于领域专业化。为了在医学应用中获得最佳性能，特定领域的模型至关重要，通用模型需要微调才能获得有效结果。通过这种专业化，基础模型表现出比最先进的检测和分割模型更优越的性能，一些模型甚至在零样本评估中表现出色，在未见数据上优于微调模型。||
|**2025-03-31**|[Controlled Latent Diffusion Models for 3D Porous Media Reconstruction](http://arxiv.org/abs/2503.24083)|**[link](https://github.com/Lacadame/PoreGen)**|多孔介质的三维数字重建是地球科学中的一项基本挑战，需要在捕获代表性单元体积的同时解析精细的孔隙结构。我们引入了一个计算框架，通过在 EDM 框架内运行的潜在扩散模型来应对这一挑战。我们的方法通过在二元地质体积中训练的定制变分自动编码器来降低维度，从而提高效率，并能够生成比以往使用扩散模型更大的体积。一个关键的创新是我们受控的无条件采样方法，它通过首先从经验分布中采样目标统计数据，然后根据这些值生成样本，来增强分布覆盖率。对四种不同岩石类型的广泛测试表明，以孔隙度（一种易于计算的统计量）为条件足以确保多种复杂特性的具有一致性的表示，包括渗透率、两点相关函数和孔径分布。该框架实现了比像素空间扩散更好的生成质量，同时能够以大幅减少的计算需求实现更大的体积重建（256 立方体素），为数字岩石物理应用确立了新的最先进水平。||
|**2025-03-31**|[DenseFormer: Learning Dense Depth Map from Sparse Depth and Image via Conditional Diffusion Model](http://arxiv.org/abs/2503.23993)|null|深度补全任务是自动驾驶中的一个关键问题，涉及从稀疏深度图和RGB图像生成密集深度图。大多数现有方法采用空间传播网络在获得初始密集深度后迭代地细化深度图。在本文中，我们提出了DenseFormer，一种将扩散模型集成到深度补全任务中的新方法。通过结合扩散模型的去噪机制，DenseFormer通过多次迭代逐步细化初始随机深度分布来生成密集深度图。我们提出了一个特征提取模块，它利用特征金字塔结构以及多层可变形注意力，有效地从稀疏深度图和RGB图像中提取和整合特征，作为扩散过程的指导条件。此外，本文提出了一个深度细化模块，对扩散过程生成的密集深度结果进行多步迭代细化，涵盖各种范围。该模块利用富含多尺度信息的图像特征和稀疏深度输入，进一步提高预测深度图的精度。在KITTI户外场景数据集上的大量实验表明，DenseFormer的性能优于经典的深度补全方法。||
|**2025-03-31**|[JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation](http://arxiv.org/abs/2503.23951)|null|近期的文本到视频生成技术已实现从提示生成连贯视频，并扩展到对外观和运动的细粒度控制。然而，现有方法要么由于简单的解耦优化导致的特征域不匹配而遭受概念干扰，要么由于参考视频重建中运动和外观的纠缠导致的空间特征泄漏而出现外观污染。在本文中，我们提出了 JointTuner，一种新颖的自适应联合训练框架，以缓解这些问题。具体来说，我们开发了自适应 LoRA，它包含一个上下文感知门控机制，并将门控 LoRA 组件集成到扩散模型中的空间和时间 Transformer 中。这些组件能够同时优化外观和运动，消除概念干扰。此外，我们引入了外观无关的时间损失，它通过与外观无关的噪声预测任务将参考视频重建中的运动模式与其固有外观解耦。关键创新在于将逐帧偏移噪声添加到真实高斯噪声中，扰乱其分布，从而破坏与帧相关的空间属性，同时保留时间一致性。此外，我们构建了一个包含 90 种外观-运动自定义组合和跨越四个维度的 10 种多类型自动指标的基准，促进了对此自定义任务的更全面评估。大量实验表明，我们的方法比现有先进方法具有更优越的性能。||
|**2025-03-31**|[DiffuSE: Cross-Layer Design Space Exploration of DNN Accelerator via Diffusion-Driven Optimization](http://arxiv.org/abs/2503.23945)|null|深度学习加速器的普及呼吁高效且经济的硬件设计解决方案，其中参数化模块化硬件生成器和电子设计自动化 (EDA) 工具在提高生产力和最终结果质量 (QoR) 方面发挥着至关重要的作用。为了在多个目标 QoR（例如性能、功耗和面积）之间取得良好的平衡，设计人员需要浏览一个巨大的设计空间，其中包含硬件生成器和 EDA 综合工具的可调参数。然而，EDA 工具调用所需的大量时间以及众多设计参数之间复杂的相互作用使得这项任务极具挑战性，即使对于经验丰富的设计人员也是如此。为了应对这些挑战，我们引入了 DiffuSE，这是一个用于 DNN 加速器跨层优化的扩散驱动设计空间探索框架。DiffuSE 利用条件扩散模型来捕捉从 QoR 目标到参数组合的逆向一对多映射，从而允许在设计空间中有希望的区域内进行定向探索。通过仔细选择条件 QoR 值，该框架能够以样本高效的方式促进多个 QoR 指标之间的有效权衡。在 7nm 技术下的实验结果证明了所提出的框架与现有技术的优越性。||
|**2025-03-28**|[DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness](http://arxiv.org/abs/2503.22677)|null|大多数3D物体生成器专注于审美质量，常常忽略应用中必要的物理约束。其中一个约束是3D物体应该是自支撑的，即在重力作用下保持平衡。先前生成稳定3D物体的方法使用可微分物理模拟器在测试时优化几何形状，这种方法速度慢、不稳定且容易陷入局部最优。受生成模型与外部反馈对齐的文献启发，我们提出了直接模拟优化（DSO）框架，利用来自（不可微分）模拟器的反馈来直接增加3D生成器输出稳定3D物体的可能性。我们构建了一个3D物体数据集，并用从物理模拟器获得的稳定性分数进行标记。然后，我们可以使用稳定性分数作为对齐指标，通过直接偏好优化（DPO）或直接奖励优化（DRO）来微调3D生成器，DRO是我们引入的一种新颖的目标函数，用于在不需要成对偏好的情况下对齐扩散模型。我们的实验表明，使用DPO或DRO目标函数微调的前馈生成器比测试时优化更快，并且更有可能生成稳定的物体。值得注意的是，DSO框架即使在没有任何用于训练的真实3D物体的情况下也能工作，允许3D生成器通过自动收集自身输出的模拟反馈来进行自我改进。||
|**2025-03-28**|[Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model](http://arxiv.org/abs/2503.22622)|null|近年来，多视角或4D视频生成已成为一个重要的研究课题。然而，目前的4D生成方法仍然面临一些根本性的局限，因为它们主要依赖于利用多个视频扩散模型并进行额外的训练，或者对完整的4D扩散模型进行计算密集型训练，而现实世界中的4D数据有限且计算成本高昂。为了应对这些挑战，我们提出了第一个无需训练的4D视频生成方法，该方法利用现成的视频扩散模型从单个输入视频生成多视角视频。我们的方法包含两个关键步骤：（1）通过将时空采样网格中的边缘帧指定为关键帧，我们首先使用视频扩散模型合成这些关键帧，并利用基于深度的变形技术进行引导。这种方法确保了生成帧之间的结构一致性，保持了空间和时间上的连贯性。（2）然后，我们使用视频扩散模型对剩余帧进行插值，构建一个填充完整且时间连贯的采样网格，同时保持空间和时间一致性。通过这种方法，我们将单个视频扩展为沿着新相机轨迹的多视角视频，同时保持时空一致性。我们的方法无需训练，并且充分利用了现成的视频扩散模型，为多视角视频生成提供了一种实用且有效的解决方案。||
|**2025-03-28**|[Generative Latent Neural PDE Solver using Flow Matching](http://arxiv.org/abs/2503.22600)|null|自回归下一步预测模型已成为构建数据驱动的神经求解器以预测时间相关的偏微分方程 (PDE) 的事实标准。与扩散概率模型密切相关的去噪训练已被证明可以增强神经求解器的时间稳定性，而其随机推理机制则支持集成预测和不确定性量化。原则上，这种训练在训练和推理过程中都涉及对一系列离散的扩散时间步进行采样，这不可避免地会增加计算开销。此外，大多数扩散模型在结构化的均匀网格上应用各向同性高斯噪声，限制了它们对不规则域的适应性。我们提出了一种用于 PDE 模拟的潜在扩散模型，它将 PDE 状态嵌入到低维潜在空间中，从而显着降低了计算成本。我们的框架使用自动编码器将不同类型的网格映射到统一的结构化潜在网格上，从而捕获复杂的几何形状。通过分析常见的扩散路径，我们建议在训练和测试中使用来自流匹配的粗略采样的噪声调度。数值实验表明，所提出的模型在准确性和长期稳定性方面均优于几个确定性基线，突出了基于扩散的方法在稳健的数据驱动 PDE 学习中的潜力。||
|**2025-03-28**|[RELD: Regularization by Latent Diffusion Models for Image Restoration](http://arxiv.org/abs/2503.22563)|null|近年来，扩散模型已成为深度生成模型领域新的最先进技术，结束了生成对抗网络长期以来的主导地位。受去噪正则化原则的启发，我们引入了一种方法，将训练用于去噪任务的潜在扩散模型，使用半二次分裂将其集成到变分框架中，并利用其正则化特性。在各种成像应用中容易满足的适当条件下，这种方法可以在降低计算成本的同时实现高质量的结果。我们提出的名为“潜在去噪正则化”（RELD）的策略，随后在自然图像数据集上进行了测试，用于图像去噪、去模糊和超分辨率任务。数值实验表明，RELD与其他最先进的方法相比具有竞争力，尤其是在使用感知质量指标进行评估时取得了显著成果。||
|**2025-03-28**|[Deterministic Medical Image Translation via High-fidelity Brownian Bridges](http://arxiv.org/abs/2503.22531)|null|最近的研究表明，扩散模型生成的合成图像比生成对抗网络 (GAN) 更优。然而，由于其固有的随机性，它们的输出通常是非确定性的，并且缺乏对真实数据的高保真度。在本文中，我们提出了一种用于确定性医学图像转换的新型高保真布朗桥模型 (HiFi-BBrg)。我们的模型包含两个不同但互惠互利的映射：生成映射和重建映射。布朗桥训练过程由重建映射中的保真度损失和对抗训练指导。这确保了转换后的图像可以准确地还原为原始形式，从而实现与真实数据高度一致的转换。我们对多个数据集进行的大量实验表明，HiFi-BBrg 在多模态图像转换和多图像超分辨率方面优于现有最先进的方法。||
|**2025-03-28**|[Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments](http://arxiv.org/abs/2503.22496)|null|我们推出了场景梦想家（Scenario Dreamer），这是一个完全数据驱动的自动驾驶规划生成式模拟器，它可以生成初始交通场景（包括车道图和智能体边界框）以及闭环智能体行为。现有的驾驶模拟环境生成方法将初始交通场景编码为栅格化图像，因此需要参数繁多的网络，由于栅格化场景中存在许多空像素，这些网络会执行不必要的计算。此外，我们发现采用基于规则的智能体行为的现有方法缺乏多样性和真实性。场景梦想家则采用一种新颖的矢量化潜在扩散模型来生成初始场景，该模型直接对矢量化场景元素进行操作，并使用自回归Transformer进行数据驱动的智能体行为模拟。场景梦想家还支持通过扩散修复进行场景外推，从而能够生成无限的模拟环境。大量实验表明，场景梦想家在真实性和效率方面优于现有的生成式模拟器：矢量化场景生成基础模型实现了更高的生成质量，同时参数减少了约2倍，生成延迟降低了6倍，GPU训练时间减少了10倍。我们通过强化学习规划智能体在场景梦想家环境中比在传统的非生成式模拟环境中更具挑战性来证实其实用性，尤其是在长距离和对抗性驾驶环境中。||
|**2025-03-28**|[Volumetric Material Decomposition Using Spectral Diffusion Posterior Sampling with a Compressed Polychromatic Forward Model](http://arxiv.org/abs/2503.22392)|null|我们先前引入了光谱扩散后验采样（Spectral DPS）框架，通过将解析光谱系统模型与从大型数据集中学习的先验知识相结合，实现精确的一步式材料分解。这项工作将二维Spectral DPS算法扩展到三维，通过使用预训练的二维扩散模型进行逐层处理来解决潜在的大内存需求限制，并使用压缩的多色前向模型来确保精确的物理建模。仿真研究表明，所提出的内存高效的三维Spectral DPS能够对临床重要体积大小的材料进行分解。定量分析表明，Spectral DPS在对比度量化、层间连续性和分辨率保持方面优于其他深度学习算法，例如InceptNet和条件DDPM。这项研究为推进体积光谱CT中的一步式材料分解奠定了基础。||
|**2025-03-28**|[Meta-LoRA: Meta-Learning LoRA Components for Domain-Aware ID Personalization](http://arxiv.org/abs/2503.22352)|null|近年来，文本到图像生成模型，特别是潜在扩散模型 (LDM) 在根据文本提示合成高质量图像方面展现出显著的能力。然而，实现身份个性化——确保模型从有限的参考图像中一致地生成特定主题的输出——仍然是一个根本性挑战。为了解决这个问题，我们引入了元低秩适应 (Meta-LoRA)，这是一个利用元学习将特定领域先验编码到基于 LoRA 的身份个性化中的新框架。我们的方法引入了一个结构化的三层 LoRA 架构，将身份无关的知识与身份特定的适应分开。在第一阶段，LoRA Meta-Down 层在多个主题上进行元训练，学习一个捕获一般身份相关特征的共享流形。在第二阶段，仅优化 LoRA-Mid 和 LoRA-Up 层以专门针对给定主题，从而显著减少适应时间并提高身份保真度。为了评估我们的方法，我们引入了 Meta-PHD，一个用于身份个性化的新的基准数据集，并将 Meta-LoRA 与最先进的方法进行了比较。我们的结果表明，Meta-LoRA 在不同的身份条件下实现了卓越的身份保持能力、计算效率和适应性。代码、模型权重和数据集将在论文被接收后公开发布。||
|**2025-03-28**|[GCRayDiffusion: Pose-Free Surface Reconstruction via Geometric Consistent Ray Diffusion](http://arxiv.org/abs/2503.22349)|null|从无位姿图像中精确重建表面对于高效创建3D对象或场景至关重要。然而，这仍然具有挑战性，特别是对于联合相机位姿估计。先前的方法在密集视图设置中取得了令人印象深刻的无位姿表面重建结果，但在没有足够视觉重叠的稀疏视图场景中很容易失败。在本文中，我们提出了一种新的无位姿表面重建技术，该技术遵循基于三平面符号距离场（SDF）的学习，但通过从基于光线的相机位姿估计扩散中采样的显式点来规范学习。我们的主要贡献是一种新颖的几何一致性光线扩散模型（GCRayDiffusion），我们将相机位姿表示为神经束射线，并通过扩散模型回归噪声射线的分布。更重要的是，我们进一步使用整个场景的基于三平面的SDF来调节RGRayDiffusion的去噪过程，这提供了有效的3D一致性正则化，以实现多视图一致的相机位姿估计。最后，我们将RGRayDiffusion融入到基于三平面的SDF学习中，通过引入来自神经束射线采样点的表面几何正则化，即使对于稀疏视图输入也能获得高精度的无位姿表面重建结果。在公共数据集上的大量评估表明，我们的GCRayDiffusion比以前的方法实现了更准确的相机位姿估计，并获得了几何上更一致的表面重建结果，尤其是在稀疏视图输入的情况下。||
|**2025-03-28**|[Semantix: An Energy Guided Sampler for Semantic Style Transfer](http://arxiv.org/abs/2503.22344)|null|近年来，风格和外观迁移取得了显著进展，但大多数方法将全局风格和局部外观迁移孤立开来，忽略了语义对应。此外，图像和视频任务通常单独处理，很少关注将它们集成以进行视频迁移。为了解决这些限制，我们引入了一项新任务，即语义风格迁移，它涉及根据语义对应将参考图像的风格和外观特征迁移到目标视觉内容。随后，我们提出了一种免训练方法Semantix，这是一种专为语义风格迁移设计的能量引导采样器，它可以同时基于预训练扩散模型的语义理解能力来引导风格和外观迁移。此外，作为一个采样器，Semantix可以无缝地应用于图像和视频模型，从而使语义风格迁移能够通用到各种视觉媒体。具体来说，一旦通过SDE将参考图像和上下文图像或视频反转到噪声空间，Semantix就利用精心设计的能量函数来引导采样过程，该函数包括三个关键组成部分：风格特征引导、空间特征引导和语义距离作为正则化项。实验结果表明，Semantix不仅可以有效地完成图像和视频的语义风格迁移任务，而且在两个领域都超越了现有的最先进的解决方案。项目网站：https://huiang-he.github.io/semantix/||
|**2025-03-27**|[StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion](http://arxiv.org/abs/2503.21775)|null|我们提出了StyleMotif，一个新颖的风格化运动隐式扩散模型，可以根据来自多种模态的内容和风格生成运动。与现有方法要么专注于生成多样化的运动内容，要么从序列中迁移风格不同，StyleMotif可以在广泛的内容范围内无缝合成运动，同时结合来自多模态输入（包括运动、文本、图像、视频和音频）的风格线索。为了实现这一点，我们引入了风格-内容交叉融合机制，并将风格编码器与预训练的多模态模型对齐，确保生成的运动准确捕捉参考风格，同时保持真实感。大量实验表明，我们的框架在风格化运动生成方面超越了现有方法，并展现了多模态运动风格化的涌现能力，从而实现更细致的运动合成。源代码和预训练模型将在论文被接收后发布。项目页面：https://stylemotif.github.io||
|**2025-03-27**|[Optimal Stepsize for Diffusion Sampling](http://arxiv.org/abs/2503.21774)|**[link](https://github.com/bebebe666/optimalsteps)**|扩散模型实现了卓越的生成质量，但由于次优的步长离散化导致采样计算密集。虽然现有工作侧重于优化去噪方向，但我们致力于步长调度设计的原则性方法。本文提出了最优步长蒸馏，这是一个动态规划框架，通过从参考轨迹中提取知识来提取理论上最优的调度。通过将步长优化重新表述为递归误差最小化，我们的方法通过利用最优子结构来保证全局离散化边界。至关重要的是，蒸馏的步长调度在不同架构、ODE 求解器和噪声调度中表现出很强的鲁棒性。实验表明，在 GenEval 上保持 99.4% 性能的同时，文本到图像的生成速度提高了 10 倍。我们的代码可在 https://github.com/bebebe666/OptimalSteps 获取。||
|**2025-03-27**|[Exploring the Evolution of Physics Cognition in Video Generation: A Survey](http://arxiv.org/abs/2503.21765)|**[link](https://github.com/minnie-lin/awesome-physics-cognition-based-video-generation)**|视频生成领域的最新进展取得了显著进步，特别是随着扩散模型的快速发展。尽管如此，它们在物理认知方面的缺陷逐渐受到广泛关注——生成的内容经常违反基本的物理定律，陷入“视觉逼真但物理荒谬”的困境。研究人员开始越来越认识到物理保真度在视频生成中的重要性，并尝试将启发式物理认知（如运动表征和物理知识）融入生成系统中，以模拟真实世界的动态场景。鉴于该领域缺乏系统性的概述，本综述旨在提供对架构设计及其应用的全面总结，以填补这一空白。具体来说，我们从认知科学的角度讨论并组织了视频生成中物理认知的演化过程，并提出了一个三级分类法：1）用于生成的基本图式感知，2）用于生成的物理知识的被动认知，以及3）用于世界模拟的主动认知，涵盖了最先进的方法、经典范例和基准。随后，我们强调了该领域固有的关键挑战，并描绘了未来研究的潜在途径，有助于推进学术界和工业界讨论的前沿。通过结构化的回顾和跨学科的分析，本综述旨在为开发可解释、可控和物理一致的视频生成范式提供方向性指导，从而推动生成模型从“视觉模仿”阶段迈向“类人物理理解”的新阶段。||
|**2025-03-27**|[A Unified Framework for Diffusion Bridge Problems: Flow Matching and Schrödinger Matching into One](http://arxiv.org/abs/2503.21756)|null|桥接问题旨在寻找连接两个给定分布的随机微分方程（SDE）（有时是常微分方程（ODE））。桥接问题的应用领域非常广泛，其中最近的生成建模（例如，条件或无条件图像生成）最为流行。此外，著名的薛定谔桥问题，一个广为人知且已存在一个世纪的问题，也是桥接问题的一个特例。深度学习时代解决桥接问题的两个最流行的算法是：（条件）流匹配和迭代拟合算法，前者仅限于ODE解，而后者专门用于薛定谔桥问题。本文的主要贡献有两个方面：i）我们对这些算法进行了简明的回顾，并在一定程度上提供了技术细节；ii）我们提出了一个新颖的统一视角和框架，将这些看似无关的算法（及其变体）归纳为一个。特别是，我们展示了我们的统一框架可以实例化流匹配（FM）算法、（小批量）最优传输FM算法、（小批量）薛定谔桥FM算法和深度薛定谔桥匹配（DSBM）算法作为其特例。我们相信，这个统一的框架将有助于从更通用和灵活的角度看待桥接问题，进而可以帮助研究人员和从业人员在各自的领域开发新的桥接算法。||
|**2025-03-27**|[VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness](http://arxiv.org/abs/2503.21755)|**[link](https://github.com/vchitect/vbench)**|视频生成技术已取得显著进展，从生成不真实的输出发展到生成视觉上逼真且时间上一致的视频。为了评估这些视频生成模型，研究人员开发了VBench等基准测试，用于评估其保真度，测量每帧美学、时间一致性和基本提示依从性等因素。然而，这些方面主要代表了表面保真度，其关注点在于视频是否看起来视觉上令人信服，而不是是否遵循现实世界的原则。尽管最近的模型在这些指标上的表现越来越好，但它们仍然难以生成不仅视觉上合理而且本质上真实的视频。为了通过视频生成实现真正的“世界模型”，下一个前沿在于内在保真度，以确保生成的视频遵循物理定律、常识推理、解剖学正确性和组合完整性。实现这种级别的真实感对于AI辅助电影制作和模拟世界建模等应用至关重要。为了弥合这一差距，我们推出了VBench-2.0，这是一个旨在自动评估视频生成模型内在保真度的下一代基准测试。VBench-2.0评估五个关键维度：人类保真度、可控性、创造力、物理性和常识性，每个维度又细分为更精细的能力。我们的评估框架针对各个维度量身定制，集成了最先进的VLM和LLM等通用模型，以及专门模型，包括为视频生成提出的异常检测方法。我们进行了广泛的注释，以确保与人类判断保持一致。通过超越表面保真度，迈向内在保真度，VBench-2.0旨在为追求内在保真度的下一代视频生成模型树立新的标准。||
|**2025-03-27**|[3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models](http://arxiv.org/abs/2503.21745)|null|三维生成技术正在经历快速发展，但三维评估的发展却没有跟上步伐。如何使自动评估与人类感知公平地保持一致已成为一个公认的挑战。语言和图像生成领域的最新进展探索了人类偏好，并展示了良好的拟合能力。然而，三维领域仍然缺乏这样一个关于生成模型的全面偏好数据集。为了弥补这一缺失，我们开发了3DGen-Arena，一个以竞技方式设计的集成平台。然后，我们精心设计了多样化的文本和图像提示，并利用竞技平台从公共用户和专家标注员那里收集人类偏好，从而形成了一个大规模多维度人类偏好数据集3DGen-Bench。利用该数据集，我们进一步训练了一个基于CLIP的评分模型3DGen-Score和一个基于MLLM的自动评估器3DGen-Eval。这两个模型创新性地统一了文本到三维和图像到三维生成的质量评估，并结合各自的优势共同构成了我们的自动化评估系统。大量实验表明，我们的评分模型在预测人类偏好方面非常有效，与现有指标相比，它与人类排序具有更高的相关性。我们相信，我们的3DGen-Bench数据集和自动化评估系统将促进三维生成领域更公平的评估，进一步推动三维生成模型及其下游应用的发展。||
|**2025-03-27**|[Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data](http://arxiv.org/abs/2503.21694)|**[link](https://github.com/theericma/triplaneturbo)**|人们非常希望获得一个能够在几秒钟内根据文本提示生成高质量3D网格的模型。虽然最近的尝试已经将预训练的文本到图像扩散模型（例如Stable Diffusion，SD）改编成3D表示生成器（例如Triplane），但由于缺乏足够的高质量3D训练数据，它们生成的质量通常很差。为了克服数据短缺的问题，我们提出了一种新的训练方案，称为渐进式渲染蒸馏（PRD），它通过蒸馏多视图扩散模型并将SD改编成本地3D生成器，从而消除了对3D ground-truths的需求。在每次训练迭代中，PRD使用U-Net从随机噪声中逐步对潜在表示进行去噪，并在每个步骤中将去噪后的潜在表示解码为3D输出。多视图扩散模型，包括MVDream和RichDreamer，与SD一起使用，通过分数蒸馏将文本一致的纹理和几何形状蒸馏到3D输出中。由于PRD支持无需3D ground-truths的训练，我们可以轻松扩展训练数据并提高具有创造性概念的挑战性文本提示的生成质量。同时，PRD可以在几个步骤内加快生成模型的推理速度。借助PRD，我们训练了一个Triplane生成器，即TriplaneTurbo，它仅增加了2.5%的可训练参数即可使SD适应Triplane生成。TriplaneTurbo在效率和质量方面都优于以前的文本到3D生成器。具体来说，它可以在1.2秒内生成高质量的3D网格，并且可以很好地泛化到具有挑战性的文本输入。代码可在https://github.com/theEricMa/TriplaneTurbo获取。||
|**2025-03-27**|[A friendly introduction to triangular transport](http://arxiv.org/abs/2503.21673)|null|不确定性下的决策是科学和工程领域的一个交叉挑战。大多数解决此挑战的方法都采用概率表示来描述不确定性。然而，在只能通过数据或黑盒模型访问的复杂系统中，这些概率表示通常是未知的。我们讨论如何使用三角传输映射来表征和操作这种表示，三角传输映射可以将任何复杂的概率分布近似为对一个简单易懂的分布的变换。三角传输的特殊结构保证了许多 desirable 的数学和计算特性，这些特性可以很好地转化为解决实际问题。三角映射被积极地用于密度估计、（条件）生成建模、贝叶斯推理、数据同化、最优实验设计以及相关任务。虽然已有大量关于三角传输方法的开发和理论的文献，但本文为那些对使用测度传输感兴趣但没有正式数学背景的科学家提供了详细的介绍。我们建立了对三角传输关键基础的直觉理解，讨论了其实际应用的许多方面，并概述了该领域的前沿。||
|**2025-03-27**|[Audio-driven Gesture Generation via Deviation Feature in the Latent Space](http://arxiv.org/abs/2503.21616)|null|手势对于增强语音交流至关重要，它提供视觉强调并补充口头互动。先前的工作主要集中在点级运动或完全监督的数据驱动方法，而我们关注于伴随语音的手势，提倡弱监督学习和像素级运动偏差。我们引入了一个弱监督框架，该框架学习潜在表示偏差，专为伴随语音的手势视频生成而设计。我们的方法采用扩散模型来整合潜在运动特征，从而实现更精确和细致的手势表示。通过利用潜在空间中的弱监督偏差，我们有效地生成了手势和嘴部运动，这对于逼真的视频制作至关重要。实验表明，我们的方法显著提高了视频质量，超越了当前最先进的技术。||
|**2025-03-27**|[Critical Iterative Denoising: A Discrete Generative Model Applied to Graphs](http://arxiv.org/abs/2503.21592)|null|离散扩散模型和流匹配模型显著推进了离散结构（包括图）的生成建模。然而，这些模型去噪过程中的时间依赖性会导致反向过程中误差的累积和传播。这个问题在掩码扩散中尤为突出，是序列建模中的一个已知局限性，并且正如我们所证明的，它也会影响图的离散扩散模型。为了解决这个问题，我们提出了一个名为迭代去噪的新框架，它通过假设跨时间条件独立性来简化离散扩散并规避了这个问题。此外，我们通过引入一个评判器来增强我们的模型，该评判器在生成过程中根据数据分布下的可能性选择性地保留或破坏实例中的元素。我们的实证评估表明，所提出的方法在图生成任务中显著优于现有的离散扩散基线模型。||
|**2025-03-25**|[Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models](http://arxiv.org/abs/2503.19914)|null|我们提出了一种通过利用预训练二维扩散模型生成的合成三维样本学习物体对之间三维空间关系的方法，称为物体-物体空间关系（OOR）。我们假设由二维扩散模型合成的图像内在地捕获了合理且逼真的OOR线索，从而能够有效地收集三维数据集来学习各种无界对象类别的OOR。我们的方法首先合成捕获合理OOR线索的各种图像，然后将其提升为三维样本。利用我们为物体对收集的各种合理三维样本，我们训练了一个基于分数的OOR扩散模型来学习它们相对空间关系的分布。此外，我们通过强制执行成对关系之间的一致性并防止物体碰撞，将成对OOR扩展到多物体OOR。大量实验表明，我们的方法在各种物体-物体空间关系中具有鲁棒性，以及其在使用OOR扩散模型的真实世界三维场景布置任务中的适用性。||
|**2025-03-25**|[PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model](http://arxiv.org/abs/2503.19913)|null|随着人们对世界模型（可以根据当前观察和动作预测未来状态）的兴趣日益增长，精确建模部件级动态对于各种应用变得越来越重要。现有方法，例如Puppet-Master，依赖于对大型预训练视频扩散模型进行微调，但由于2D视频表示的局限性和缓慢的处理时间，这些方法在实际应用中并不实用。为了克服这些挑战，我们提出了PartRM，一个新颖的4D重建框架，可以同时从静态物体的多视角图像中建模外观、几何形状和部件级运动。PartRM建立在大型3D高斯重建模型的基础上，利用其在静态物体外观和几何形状方面的广泛知识。为了解决4D数据稀缺的问题，我们引入了PartDrag-4D数据集，该数据集提供了超过20,000个状态的部件级动态多视角观察数据。我们通过多尺度拖动嵌入模块增强了模型对交互条件的理解，该模块可以捕获不同粒度的动态。为了防止微调期间的灾难性遗忘，我们实施了一个两阶段训练过程，依次关注运动和外观学习。实验结果表明，PartRM在部件级运动学习方面建立了新的最先进水平，并且可以应用于机器人操作任务。我们的代码、数据和模型已公开发布，以促进未来的研究。||
|**2025-03-25**|[AvatarArtist: Open-Domain 4D Avatarization](http://arxiv.org/abs/2503.19906)|null|这项工作专注于开放域 4D 头像生成，旨在从任意风格的肖像图像创建 4D 头像。我们选择参数化三平面作为中间 4D 表示，并提出了一种利用生成对抗网络 (GAN) 和扩散模型的实用训练范式。我们的设计源于以下观察：4D GAN 擅长在无监督的情况下桥接图像和三平面，但在处理多样化的数据分布时通常面临挑战。鲁棒的 2D 扩散先验模型应运而生，它可以帮助 GAN 将其专业知识迁移到各个领域。这些专家之间的协同作用允许构建多域图像-三平面数据集，从而推动通用 4D 头像创建器的开发。大量实验表明，我们的模型 AvatarArtist 能够生成高质量的 4D 头像，并且对各种源图像域具有很强的鲁棒性。代码、数据和模型将公开发布，以促进未来的研究。||
|**2025-03-25**|[ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models](http://arxiv.org/abs/2503.19902)|null|定义视觉概念的固有歧义性给现代生成模型（例如基于扩散的文本到图像 (T2I) 模型）从单张图像中准确学习概念带来了重大挑战。现有方法缺乏一种可靠地提取可解释的底层内在概念的系统方法。为了应对这一挑战，我们提出了 ICE（Intrinsic Concept Extraction 的缩写），这是一个仅利用 T2I 模型来自动且系统地从单张图像中提取内在概念的新颖框架。ICE 由两个关键阶段组成。在第一阶段，ICE 设计了一个自动概念定位模块，以精确定位图像中相关的基于文本的概念及其对应的掩码。这一关键阶段简化了概念初始化，并为后续分析提供了精确的指导。第二阶段深入研究每个已识别的掩码，将对象级概念分解为内在概念和一般概念。这种分解可以对视觉元素进行更细粒度且更具解释性的分解。我们的框架在以无监督方式从单张图像中提取内在概念方面表现出优异的性能。项目页面：https://visual-ai.github.io/ice||
|**2025-03-25**|[Scaling Down Text Encoders of Text-to-Image Diffusion Models](http://arxiv.org/abs/2503.19897)|**[link](https://github.com/LifuWang-66/DistillT5)**|扩散模型中的文本编码器发展迅速，已从 CLIP 过渡到 T5-XXL。虽然这种演变显著增强了模型理解复杂提示和生成文本的能力，但也导致参数数量大幅增加。尽管 T5 系列编码器在包含大量非视觉数据的 C4 自然语言语料库上进行了训练，但使用 T5 编码器的扩散模型却无法响应这些非视觉提示，这表明其表征能力存在冗余。因此，这就提出了一个重要的问题：“我们真的需要如此庞大的文本编码器吗？”为了寻求答案，我们采用基于视觉的知识蒸馏来训练一系列 T5 编码器模型。为了充分继承其能力，我们基于图像质量、语义理解和文本渲染三个标准构建了数据集。我们的结果证明了按比例缩小的模式，即蒸馏后的 T5-base 模型可以生成与 T5-XXL 质量相当的图像，而大小却缩小了 50 倍。这种模型大小的缩减显著降低了运行 FLUX 和 SD3 等最先进模型所需的 GPU 资源，使高质量的文本到图像生成更易于实现。||
|**2025-03-25**|[FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model](http://arxiv.org/abs/2503.19839)|null|目前，基于指令的图像编辑方法利用视觉语言模型（VLM）强大的跨模态理解能力取得了显著进展。然而，它们仍然在三个关键领域面临挑战：1）复杂场景；2）语义一致性；3）细粒度编辑。为了解决这些问题，我们提出了FireEdit，这是一个创新的基于指令的细粒度图像编辑框架，它利用了区域感知的VLM。FireEdit旨在准确理解用户指令并确保对编辑过程的有效控制。具体来说，我们通过引入额外的区域标记来增强VLM的细粒度视觉感知能力。仅仅依靠LLM的输出来指导扩散模型可能会导致编辑结果不佳。因此，我们提出了一个时间感知目标注入模块和一个混合视觉交叉注意力模块。前者通过将时间步嵌入与文本嵌入相结合，动态调整不同去噪阶段的引导强度。后者增强了图像编辑的视觉细节，从而保持了编辑结果和源图像之间的语义一致性。通过结合具有细粒度区域标记的增强VLM和时间相关的扩散模型，FireEdit在理解编辑指令和保持高语义一致性方面展现出显著优势。大量实验表明，我们的方法超越了最先进的基于指令的图像编辑方法。我们的项目可在https://zjgans.github.io/fireedit.github.io获取。||
|**2025-03-25**|[TopoGEN: topology-driven microstructure generation for in silico modeling of fiber network mechanics](http://arxiv.org/abs/2503.19832)|null|机械生物学和生物力学领域正在拓展我们对多尺度软生物组织复杂行为的理解。鉴于组织微观结构与其宏观力学行为之间错综复杂的联系，解开这种机制关系仍然是一项持续的挑战。重构纤维网络作为有价值的体外模型，简化了体内系统的复杂性，以便进行有针对性的研究。与此同时，成像技术的进步使微观结构可视化成为可能，并通过生成管道将其建模为离散元网络。这些中尺度模型为了解宏观组织行为提供了 insights。然而，目前仍缺乏对微观结构变化如何影响非线性组织力学的系统研究。在这项工作中，我们开发了一个新的框架来生成拓扑驱动的离散纤维网络。利用这些网络，我们生成了相互连接的承载纤维组件模型，这些组件在压缩下表现出软化并且抗弯曲。通过虚拟复制重构胶原网络的微观结构特征，例如纤维体积分数和交联浓度，我们评估了模拟的稳健性。通过分析不同聚合温度下的非线性弹性行为，我们发现计算机模拟结果与文献中的体外数据一致。我们将研究扩展到经验可测量的因素之外，探索单纤维水平的微观结构效应（即纤维形态和刚度），这些效应在实验上难以研究。TopoGEN使我们能够从机制上探索局部微观结构现象，并将微观结构变化与软生物材料的整体力学响应联系起来，从而为推进组织生物力学和工程领域提供了一个不可或缺的工具。||
|**2025-03-25**|[IgCraft: A versatile sequence generation framework for antibody discovery and engineering](http://arxiv.org/abs/2503.19821)|**[link](https://github.com/mgreenig/igcraft)**|设计更类似于在天然人类抗体库中观察到的抗体序列是生物制剂开发中的一个关键挑战。我们介绍了IgCraft：一个用于生成人类抗体配对序列的多用途模型，它基于贝叶斯流网络构建。IgCraft是首批能够使用单个模型处理多个抗体序列设计任务的统一生成建模框架之一，这些任务包括无条件采样、序列修补、逆折叠和CDR基序支架构建。我们的方法在所有这些任务中都取得了有竞争力的结果，同时将生成限制在人类抗体序列空间内，在CDR基序支架构建（移植）方面表现出特别的优势，我们在人类相似性和结构特性保留方面实现了最先进的性能。通过将以前单独的任务集成到一个可扩展的生成模型中，IgCraft提供了一个多功能平台，可在与抗体发现和工程相关的各种环境下对人类抗体序列进行采样。模型代码和权重可在github.com/mgreenig/IgCraft公开获取。||
|**2025-03-25**|[Unpaired Object-Level SAR-to-Optical Image Translation for Aircraft with Keypoints-Guided Diffusion Models](http://arxiv.org/abs/2503.19798)|null|合成孔径雷达(SAR)图像具备全天候、全天时和高分辨率成像能力，但其独特的成像机制使得图像解译 heavily reliant on expert knowledge，限制了其可解释性，尤其是在复杂目标任务中。将SAR图像转换为光学图像，是增强SAR图像解译能力并支持下游任务的有效方法。大多数现有研究集中在场景级转换，由于配对数据稀缺以及准确保留轮廓和纹理细节的挑战，目标级转换的研究有限。为了解决这些问题，本研究提出了一种关键点引导的扩散模型（KeypointDiff），用于非配对飞机目标的SAR到光学图像转换。该框架通过关键点引入了目标类别和方位角的监督，并提出了一种非配对数据的训练策略。基于无分类器引导的扩散架构，设计了一个类别-角度引导模块（CAGM），将类别和角度信息整合到扩散生成过程中。此外，针对飞机目标的特点，采用了对抗损失和一致性损失来提高图像的保真度和细节质量。在采样过程中，借助预训练的关键点检测器，该模型无需手动标注类别和方位角信息，实现了SAR到光学图像的自动转换。实验结果表明，该方法在多个指标上优于现有方法，为目标级SAR到光学图像转换及下游任务提供了一种高效且有效的解决方案。此外，借助关键点检测器，该方法对未经训练的飞机类型展现出强大的零样本泛化能力。||
|**2025-03-25**|[In the Blink of an Eye: Instant Game Map Editing using a Generative-AI Smart Brush](http://arxiv.org/abs/2503.19793)|null|随着电子游戏复杂性的不断提高，游戏内容的自动化生成引起了广泛的兴趣。然而，由于其独特的复杂性和特定领域的挑战，3D游戏地图美术创作的任务迄今为止仍未得到充分探索。虽然最近的一些工作已经涉及到诸如复古风格关卡生成和程序化地形创建等相关主题，但这些工作主要集中在更简单的数据分布上。据我们所知，我们是第一个在复杂、高度精细的3A级3D游戏环境中演示现代AI技术应用于高分辨率纹理处理的。我们引入了一种用于地图编辑的新型智能画笔，旨在帮助美术师以最小的努力无缝地修改游戏地图的选定区域。通过利用生成对抗网络和扩散模型，我们提出了两种画笔变体，可实现高效且上下文感知的生成。我们的混合工作流程旨在增强艺术灵活性和生产效率，从而能够在无需手动返工每个细节的情况下改进环境，从而帮助弥合游戏开发中自动化和创意控制之间的差距。我们将我们的两种方法与几种最先进模型的改编版本进行了比较评估，结果表明，我们基于GAN的画笔可以生成最清晰、最详细的输出，同时保留图像上下文，而被评估的最先进模型则倾向于产生更模糊的结果，并且在保持上下文一致性方面存在困难。||
|**2025-03-21**|[Position: Interactive Generative Video as Next-Generation Game Engine](http://arxiv.org/abs/2503.17359)|null|现代游戏开发由于传统游戏引擎中预先确定的内容，在创造力和成本方面面临着重大挑战。最近视频生成模型的突破，能够合成逼真且可交互的虚拟环境，为彻底改变游戏创作提供了机会。在本文中，我们提出将交互式生成视频 (IGV) 作为生成式游戏引擎 (GGE) 的基础，从而在下一代游戏中实现无限新颖内容的生成。GGE 利用 IGV 在无限高质量内容合成、物理感知世界建模、用户控制的交互性、长期记忆能力和因果推理方面的独特优势。我们提出了一个综合框架，详细介绍了 GGE 的核心模块和一个分层成熟度路线图 (L0-L4) 来指导其发展。我们的工作为人工智能时代的游戏开发规划了新的方向，展望了未来人工智能驱动的生成系统将从根本上重塑游戏创作和体验方式。||
|**2025-03-21**|[Preference-Guided Diffusion for Multi-Objective Offline Optimization](http://arxiv.org/abs/2503.17299)|null|离线多目标优化旨在给定设计数据集及其目标值的情况下识别帕累托最优解。在这项工作中，我们提出了一种偏好引导的扩散模型，该模型利用基于分类器的引导机制生成帕累托最优设计。我们的引导分类器是一个偏好模型，经过训练可以预测一个设计支配另一个设计的概率，从而将扩散模型导向设计空间的最优区域。至关重要的是，这种偏好模型可以泛化到训练分布之外，从而能够发现观察数据集之外的帕累托最优解。我们引入了一种新颖的多样性感知偏好引导，通过多样性标准增强帕累托支配偏好。这确保了生成的解决方案是最优的，并且在目标空间中分布良好，这是先前用于离线多目标优化的生成方法所不具备的能力。我们在各种连续离线多目标优化任务上评估了我们的方法，发现它始终优于其他逆/生成方法，同时与正向/基于代理的优化方法保持竞争力。我们的结果突出了分类器引导的扩散模型在生成能够很好地逼近帕累托前沿的多样化和高质量解决方案方面的有效性。||
|**2025-03-21**|[Offline Model-Based Optimization: Comprehensive Review](http://arxiv.org/abs/2503.17286)|**[link](https://github.com/mila-iqia/Awesome-Offline-Model-Based-Optimization)**|离线优化是科学和工程中的一项基本挑战，其目标是仅使用离线数据集来优化黑盒函数。当查询目标函数的成本过高或不可行时，这种设置尤为重要，其应用涵盖蛋白质工程、材料发现、神经架构搜索等领域。主要难点在于如何准确估计可用数据范围之外的目标函数，因为外推法充满了显著的认知不确定性。这种不确定性可能导致目标欺骗（奖励欺骗），即利用模型在未见区域中的不准确性，或其他虚假优化，从而在训练分布之外产生具有误导性的高性能估计。基于模型的优化 (MBO) 的最新进展利用了深度神经网络的泛化能力，开发了特定于离线的代理模型和生成模型。通过精心设计的策略进行训练，这些模型对分布外问题更具鲁棒性，有助于发现改进的设计。尽管其在加速科学发现方面的影响日益增强，但该领域缺乏全面的综述。为了弥合这一差距，我们首次对离线 MBO 进行了全面综述。我们首先将单目标和多目标设置的问题形式化，并回顾了最近的基准和评估指标。然后，我们将现有方法分为两个关键领域：代理建模，强调分布外区域的精确函数逼近；生成建模，探索高维设计空间以识别高性能设计。最后，我们研究了关键挑战，并提出了该快速发展领域未来有前景的研究方向，包括对超级智能系统的安全控制。||
|**2025-03-21**|[Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras](http://arxiv.org/abs/2503.17262)|**[link](https://github.com/tub-rip/e2fai)**|事件相机依靠运动来获取场景外观信息。换句话说，对于事件相机，运动和外观要么同时被捕捉，要么都无法被捕捉，它们被编码在输出的事件流中。以往的工作将恢复这两种视觉量视为单独的任务，这与事件相机的本质不符，并且忽略了两个任务之间固有的联系。在本文中，我们提出了一个无监督学习框架，使用单个网络联合估计光流（运动）和图像强度（外观）。从事件生成模型出发，我们新推导了基于事件的光度误差，该误差是光流和图像强度的函数，并将其与对比度最大化框架相结合，从而得到一个综合损失函数，为光流和强度估计提供了适当的约束。大量实验表明，我们的模型在光流（在无监督学习类别中，EPE 和 AE 分别提高了 20% 和 25%）和强度估计（与其他基线相比产生了具有竞争力的结果，尤其是在高动态范围场景中）方面都达到了最先进的性能。最后但同样重要的是，我们的模型实现了比所有其他光流模型和许多图像重建模型更短的推理时间，而它们只输出一个量。项目页面：https://github.com/tub-rip/e2fai||
|**2025-03-21**|[Deep End-to-End Posterior ENergy (DEEPEN) for image recovery](http://arxiv.org/abs/2503.17244)|null|目前的端到端 (E2E) 和即插即用 (PnP) 图像重建算法近似于最大后验 (MAP) 估计，但不能像扩散模型那样从后验分布中采样。相比之下，扩散模型很难以E2E方式进行训练。本文介绍了一种深度端到端后验能量 (DEEPEN) 框架，它能够进行MAP估计和采样。我们使用最大似然优化以E2E方式学习后验参数，该后验是数据一致性误差和负对数先验分布的和。所提出的方法不需要算法展开，因此比目前的E2E方法具有更小的计算和内存占用，同时它不需要当前PnP方法通常需要的收缩约束。我们的结果表明，在MAP设置下，DEEPEN比目前的E2E和PnP模型提供了更好的性能，同时与扩散模型相比，它还提供了更快的采样速度。此外，观察到学习的基于能量的模型对图像采集设置的变化更具鲁棒性。||
|**2025-03-21**|[Leveraging Text-to-Image Generation for Handling Spurious Correlation](http://arxiv.org/abs/2503.17226)|null|Deep neural networks trained with Empirical Risk Minimization (ERM) perform well when both training and test data come from the same domain, but they often fail to generalize to out-of-distribution samples. In image classification, these models may rely on spurious correlations that often exist between labels and irrelevant features of images, making predictions unreliable when those features do not exist. We propose a technique to generate training samples with text-to-image (T2I) diffusion models for addressing the spurious correlation problem. First, we compute the best describing token for the visual features pertaining to the causal components of samples by a textual inversion mechanism. Then, leveraging a language segmentation method and a diffusion model, we generate new samples by combining the causal component with the elements from other classes. We also meticulously prune the generated samples based on the prediction probabilities and attribution scores of the ERM model to ensure their correct composition for our objective. Finally, we retrain the ERM model on our augmented dataset. This process reduces the model's reliance on spurious correlations by learning from carefully crafted samples for in which this correlation does not exist. Our experiments show that across different benchmarks, our technique achieves better worst-group accuracy than the existing state-of-the-art methods.||
|**2025-03-21**|[Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation](http://arxiv.org/abs/2503.17224)|null|As machine learning models increase in scale and complexity, obtaining sufficient training data has become a critical bottleneck due to acquisition costs, privacy constraints, and data scarcity in specialised domains. While synthetic data generation has emerged as a promising alternative, a notable performance gap remains compared to models trained on real data, particularly as task complexity grows. Concurrently, Neuro-Symbolic methods, which combine neural networks' learning strengths with symbolic reasoning's structured representations, have demonstrated significant potential across various cognitive tasks. This paper explores the utility of Neuro-Symbolic conditioning for synthetic image dataset generation, focusing specifically on improving the performance of Scene Graph Generation models. The research investigates whether structured symbolic representations in the form of scene graphs can enhance synthetic data quality through explicit encoding of relational constraints. The results demonstrate that Neuro-Symbolic conditioning yields significant improvements of up to +2.59% in standard Recall metrics and +2.83% in No Graph Constraint Recall metrics when used for dataset augmentation. These findings establish that merging Neuro-Symbolic and generative approaches produces synthetic data with complementary structural information that enhances model performance when combined with real data, providing a novel approach to overcome data scarcity limitations even for complex visual reasoning tasks.||
|**2025-03-21**|[UniCon: Unidirectional Information Flow for Effective Control of Large-Scale Diffusion Models](http://arxiv.org/abs/2503.17221)|null|We introduce UniCon, a novel architecture designed to enhance control and efficiency in training adapters for large-scale diffusion models. Unlike existing methods that rely on bidirectional interaction between the diffusion model and control adapter, UniCon implements a unidirectional flow from the diffusion network to the adapter, allowing the adapter alone to generate the final output. UniCon reduces computational demands by eliminating the need for the diffusion model to compute and store gradients during adapter training. Our results indicate that UniCon reduces GPU memory usage by one-third and increases training speed by 2.3 times, while maintaining the same adapter parameter size. Additionally, without requiring extra computational resources, UniCon enables the training of adapters with double the parameter volume of existing ControlNets. In a series of image conditional generation tasks, UniCon has demonstrated precise responsiveness to control inputs and exceptional generation capabilities.||
|**2025-03-21**|[FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy](http://arxiv.org/abs/2503.17197)|null|从单视角二维图像中恢复高质量的三维人脸纹理是一项具有挑战性的任务，尤其是在有限数据和复杂面部细节（如妆容、皱纹和遮挡）的限制下。本文介绍了FreeUV，一种无需真实标注的UV纹理恢复框架，无需标注或合成UV数据。FreeUV利用预训练的稳定扩散模型以及交叉组合推理策略来实现这一目标。在FreeUV中，单独训练的网络各自专注于逼真的外观和结构一致性，并在推理过程中组合这些网络以生成连贯的纹理。我们的方法能够准确捕捉复杂的面部特征，并在不同姿态和遮挡下表现出稳健的性能。大量实验验证了FreeUV的有效性，其结果在定量和定性指标上均超过了现有最先进的方法。此外，FreeUV还支持新的应用，包括局部编辑、面部特征插值和多视角纹理恢复。通过降低数据需求，FreeUV为生成适用于现实世界场景的高保真三维人脸纹理提供了一种可扩展的解决方案。||
|**2025-03-21**|[D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens](http://arxiv.org/abs/2503.17155)|null|在图像生成领域，基于潜变量的生成模型占据主导地位；然而，这些模型严重依赖于图像分词器。为了满足建模需求，具有可扩展性和灵活性特点的自回归模型采用离散值分词器，但面临图像生成质量差的挑战。相比之下，扩散模型利用连续值分词器实现了更好的生成质量，但效率低且复杂。现有的混合模型主要用于补偿信息损失并简化扩散学习过程。在图像生成领域，合并离散值和连续值分词器的潜力尚未被探索。在本文中，我们提出了D2C，一种新颖的两阶段方法来增强模型的生成能力。在第一阶段，使用小型离散值生成器对表示粗粒度图像特征的离散值分词器进行采样。然后在第二阶段，以离散分词器序列为条件学习表示细粒度图像特征的连续值分词器。此外，我们设计了两种融合模块以实现无缝交互。在ImageNet-256基准测试中，大量的实验结果验证了我们的模型在类别条件图像生成任务上与几种连续值和离散值生成模型相比实现了优越的性能。||
|**2025-03-20**|[Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation](http://arxiv.org/abs/2503.16430)|null|自回归视觉生成模型通常依赖于分词器将图像压缩成可按顺序预测的标记。标记表示存在一个根本性的困境：离散标记可以使用标准的交叉熵损失进行直接建模，但会遭受信息损失和分词器训练不稳定的困扰；连续标记可以更好地保留视觉细节，但需要复杂的分布建模，从而使生成流程复杂化。在本文中，我们提出了 TokenBridge，它通过保持连续标记的强大表示能力，同时保留离散标记的建模简洁性来弥合这一差距。为了实现这一点，我们通过训练后量化将离散化与分词器训练过程解耦，直接从连续表示中获得离散标记。具体来说，我们引入了一种维度量化策略，独立地离散化每个特征维度，并搭配一个轻量级的自回归预测机制，有效地建模由此产生的巨大标记空间。大量实验表明，我们的方法在使用标准分类预测的同时，实现了与连续方法相当的重建和生成质量。这项工作表明，桥接离散和连续范式可以有效地利用两种方法的优势，为使用简单的自回归建模进行高质量视觉生成提供了一个有前景的方向。项目页面：https://yuqingwang1029.github.io/TokenBridge.||
|**2025-03-20**|[SynCity: Training-Free Generation of 3D Worlds](http://arxiv.org/abs/2503.16420)|null|我们解决了从文本描述生成3D世界的挑战。我们提出了SynCity，一种无需训练和优化的方法，它利用预训练的3D生成模型的几何精度和2D图像生成器的艺术多功能性来创建大型、高质量的3D空间。虽然大多数3D生成模型是以对象为中心的，并且无法生成大规模的世界，但我们展示了如何将3D和2D生成器结合起来生成不断扩展的场景。通过基于图块的方法，我们允许对场景的布局和外观进行细粒度控制。世界是逐块生成的，每个新图块都在其世界上下文中生成，然后与场景融合。SynCity生成引人入胜的沉浸式场景，细节丰富，多样性强。||
|**2025-03-20**|[DreamTexture: Shape from Virtual Texture with Analysis by Augmentation](http://arxiv.org/abs/2503.16412)|null|DreamFusion 通过结合生成模型和可微渲染的进展，为从虚拟视图进行无监督三维重建建立了新的范式。然而，其底层的多视图渲染以及来自大规模生成模型的监督，在计算上成本高昂且约束不足。我们提出了 DreamTexture，一种利用单目深度线索重建三维物体的新颖的 Shape-from-Virtual-Texture 方法。我们的方法通过将虚拟纹理与输入中的真实深度线索对齐来对输入图像进行纹理化，利用现代扩散模型中编码的单目几何的内在理解。然后，我们使用一种新的共形映射优化方法从虚拟纹理变形中重建深度，从而减轻了内存密集型体积表示的负担。我们的实验表明，生成模型具备对单目形状线索的理解，可以通过增强和对齐纹理线索来提取——我们称之为“通过增强进行分析”的一种新颖的单目重建范式。||
|**2025-03-20**|[VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness](http://arxiv.org/abs/2503.16406)|**[link](https://github.com/seungjucha/verbdiff)**|最近的大规模文本到图像扩散模型能够生成逼真的图像，但由于其区分不同交互词语的能力有限，常常难以准确地描绘人与物体之间的交互。在这项工作中，我们提出了 VerbDiff 来解决文本到图像扩散模型中捕捉细微交互的挑战。VerbDiff 是一种新颖的文本到图像生成模型，它弱化了交互词语和物体之间的偏差，增强了对交互的理解。具体来说，我们将各种交互词语与基于频率的锚词语解耦，并利用生成图像中的局部交互区域来帮助模型更好地捕捉不同词语的语义，而无需额外的条件。我们的方法使模型能够准确理解人与物体之间的预期交互，生成高质量的图像，其准确的交互与指定的动词相符。在 HICO-DET 数据集上的大量实验表明，我们的方法比之前的方法更有效。||
|**2025-03-20**|[ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos](http://arxiv.org/abs/2503.16400)|null|视频扩散模型（VDM）有助于生成高质量视频，目前的研究主要集中在通过改进数据质量、计算资源和模型复杂性来提升训练规模。然而，推理时的规模扩展却较少受到关注，大多数方法将模型限制为单次生成尝试。最近的研究发现了“黄金噪声”的存在，可以提高视频生成质量。基于此，我们发现引导VDM的推理时规模搜索以识别更好的噪声候选不仅可以评估当前步骤生成的帧的质量，还可以通过参考先前多块中的锚帧来保留高级对象特征，从而提供长期价值。我们的分析表明，扩散模型本身可以通过改变去噪步骤来灵活地调整计算量，即使是单步去噪方法，在奖励信号的引导下，也能产生显著的长期效益。基于这一观察，我们提出了ScalingNoise，这是一种即插即用的推理时搜索策略，用于识别扩散采样过程中的黄金初始噪声，以提高全局内容一致性和视觉多样性。具体来说，我们执行单步去噪以将初始噪声转换为剪辑，然后利用先前生成内容锚定的奖励模型评估其长期价值。此外，为了保持多样性，我们从倾斜的噪声分布中采样候选噪声，该分布增加了有希望的噪声的权重。通过这种方式，ScalingNoise显著减少了噪声引起的错误，确保生成更连贯且时空一致的视频。在基准数据集上的大量实验表明，所提出的ScalingNoise有效地改进了长视频生成。||
|**2025-03-20**|[Scale-wise Distillation of Diffusion Models](http://arxiv.org/abs/2503.16397)|null|我们提出了SwD，一个用于扩散模型（DM）的尺度式蒸馏框架，它有效地将下一尺度预测的思想用于基于扩散的少量步骤生成器。更详细地说，SwD的灵感来自于最近将扩散过程与隐式谱自回归联系起来的见解。我们假设DM可以在较低的数据分辨率下开始生成，并在每个去噪步骤中逐渐提升样本的分辨率，而不会损失性能，同时显著降低计算成本。SwD自然地将这一思想融入到现有的基于分布匹配的扩散蒸馏方法中。此外，我们通过引入一种新的补丁损失来丰富分布匹配方法系列，这种损失可以强制实现与目标分布更细粒度的相似性。当应用于最先进的文本到图像扩散模型时，SwD的推理时间接近两个全分辨率步骤，并且在相同的计算预算下显著优于同类方法，这可以通过自动化指标和人类偏好研究得到证明。||
|**2025-03-20**|[SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation](http://arxiv.org/abs/2503.16396)|null|我们提出了Stable Video 4D 2.0 (SV4D 2.0)，一个用于生成动态3D资产的多视角视频扩散模型。相比于其前身SV4D，SV4D 2.0对遮挡和大运动更加鲁棒，对真实世界视频的泛化能力更强，并能生成细节更清晰、时空一致性更高的输出。我们通过在多个方面引入关键改进实现了这一点：1）网络架构：消除了对参考多视角的依赖，并设计了用于3D和帧注意力的混合机制；2）数据：提高了训练数据的质量和数量；3）训练策略：采用渐进式3D-4D训练以获得更好的泛化能力；4）4D优化：通过两阶段细化和渐进式帧采样来处理3D不一致性和大运动。大量实验表明，SV4D 2.0在视觉和定量上都取得了显著的性能提升，在新视角视频合成中实现了更好的细节（LPIPS降低14%）和4D一致性（FV4D降低44%），在4D优化中相比于SV4D实现了更好的效果（LPIPS降低12%，FV4D降低24%）。项目页面：https://sv4d2.0.github.io。||
|**2025-03-20**|[Do Visual Imaginations Improve Vision-and-Language Navigation Agents?](http://arxiv.org/abs/2503.16394)|null|视觉语言导航 (VLN) 代理的任务是使用自然语言指令在未知环境中导航。在这项工作中，我们研究了指令中隐含的子目标的视觉表示是否可以作为导航线索并提高导航性能。为了合成这些视觉表示或想象，我们利用文本到图像的扩散模型对分割指令中包含的地标参考进行处理。这些想象作为一种附加模态提供给 VLN 代理，以充当地标线索，并添加辅助损失以明确鼓励将这些想象与其相应的指代表达式相关联。我们的研究结果表明，在不同代理中，成功率 (SR) 提高了约 1 个百分点，按路径长度倒数缩放的成功率 (SPL) 提高了 0.5 个百分点。这些结果表明，与单独依赖语言指令相比，所提出的方法增强了视觉理解。我们工作的代码和数据可以在 https://www.akhilperincherry.com/VLN-Imagine-website/ 找到。||
|**2025-03-20**|[LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images](http://arxiv.org/abs/2503.16376)|null|现代机器学习的成功，尤其是在人脸转换网络方面，高度依赖于高质量、配对的大规模数据集。然而，获取足够的数据通常具有挑战性且成本高昂。受扩散模型在高质量图像合成方面取得的最新成功以及大型语言模型 (LLM) 的进步的启发，我们提出了一个名为 LLM 辅助配对图像生成 (LaPIG) 的新框架。该框架可以使用 LLM 生成的描述来构建全面的、高质量的可见光和热成像配对图像。我们的方法包含三个部分：使用 ArcFace 嵌入进行可见光图像合成，使用潜在扩散模型 (LDM) 进行热成像转换，以及使用 LLM 生成描述。我们的方法不仅生成多视角配对的可见光和热成像图像以增加数据多样性，而且在保持身份信息的同时生成高质量的配对数据。我们通过与现有方法进行比较，在公共数据集上评估了我们的方法，证明了 LaPIG 的优越性。||
|**2025-03-20**|[Heat transfer and mixing in initiated Chemical Vapor Deposition analyzed by in-situ gas composition sensing](http://arxiv.org/abs/2503.16373)|null|iCVD动力学研究通常对iCVD气相做出一些假设，而这些假设很难通过实验验证。我们通过使用原位气体成分传感器研究iCVD反应器中的传热和混合来弥补这一差距。我们的工作使iCVD的实践者能够根据反应器尺寸和工艺变量（如腔室压力）估算气相中的混合程度和温度曲线。我们首先使用量纲分析来简化控制传热和混合的参数空间，识别捕获主要物理现象的关键无量纲群。我们发现反应器中的混合程度主要由佩克莱特数 (Pe) 决定，而传热主要由克努森数 (Kn) 决定。iCVD反应器的多物理场模拟提供了非理想混合的可视化，并使我们能够确定临界Pe值，超过该值，充分混合的假设可能开始失效。为了验证充分混合假设适用于低Pe时的iCVD，我们测量了氩气和异丙醇在1 sccm流量下的停留时间分布 (RTD)，发现与充分混合模型非常吻合。我们使用皮拉尼压力计进行成分传感来测量RTD。我们证明了皮拉尼压力计是一种精确的气体成分传感器，它比原位红外光谱和质谱等方法更具模块化且成本更低。然后，我们通过测量不同压力下氩气和异丙醇中灯丝的热扩散来解决传热问题。我们量化了低压下非理想传热的影响，并确定了一个临界Kn值，高于该值，传热与压力无关。在间歇模式下测量热驱动压力变化使我们能够定义和建模反应器的“有效温度”，与热扩散模型相匹配。||
|**2025-03-14**|[From few to many maps: A fast map-level emulator for extreme augmentation of CMB systematics datasets](http://arxiv.org/abs/2503.11643)|**[link](https://github.com/jmdelouis/healpixml)**|我们介绍了一种基于散射协方差的新颖、快速且高效的生成模型，这是散射变换统计的最新迭代。该模型旨在将计算成本高昂的CMB仪器系统效应模拟数据集中的地图模拟数量增加几个数量级，包括它们的非高斯和非均匀特征。与传统的基于神经网络的算法不同，该生成模型只需要最少数量的训练样本，使其高度兼容典型CMB模拟活动的计算限制。我们使用CMB系统效应的真实模拟验证了该方法，这些模拟尤其难以模拟，并进行了广泛的统计测试，以确认其生成新的统计独立近似实现的能力。值得注意的是，即使只训练了10个模拟，该仿真器也能精确地再现关键的汇总统计数据——包括角功率谱、散射系数和闵可夫斯基泛函——并提供像素间协方差估计，与未进行增强的情况相比，样本噪声显著降低。所提出的方法有可能改变模拟活动设计的范式。未来的流程可以专注于生成少量的高精度模拟，然后使用这种生成模型有效地进行增强，而不是生成大量的中低精度模拟。这有望为当前和即将进行的宇宙学调查（例如 $Planck$、$LiteBIRD$ 、Simons天文台、CMB-S4、Euclid和Rubin-LSST）带来显著的益处。我们在https://github.com/jmdelouis/HealpixML上提供了散射变换统计的一般框架，并在https://github.com/pcampeti/CMBSCAT上提供了仿真器。||
|**2025-03-14**|[Gradient-bridged Posterior: Bayesian Inference for Models with Implicit Functions](http://arxiv.org/abs/2503.11637)|null|许多统计问题包含模型参数，这些参数被定义为优化子问题的解。这些问题包括经典方法（例如，轮廓似然）以及涉及流网络或普鲁克鲁斯距离的现代应用。在这种情况下，数据的似然涉及一个隐函数，这通常会使推理过程复杂化，并导致计算成本过高。在本文中，我们针对这种情况提出了一种直观且易于处理的后验推理方法。我们引入了一类连续模型，该模型使用子问题的一阶最优性来处理隐函数值。具体来说，我们将收缩核应用于梯度范数，这在生成模型中保留了概率解释。这可以理解为将吉布斯后验框架推广到新的子集参数中，使其能够集中于部分极小值。我们证明了这种称为梯度桥接后验的方法适用于高效的后验计算，并具有理论保证，建立了伯恩斯坦-冯·米塞斯定理以实现渐近正态性。我们通过一个合成的流网络实验和一个使用普鲁克鲁斯距离进行数据整合的应用，突出了我们方法的优势。||
|**2025-03-14**|[Pathology Image Compression with Pre-trained Autoencoders](http://arxiv.org/abs/2503.11591)|null|数字病理学中高分辨率全视野切片图像数量的不断增长，给存储、传输和计算效率带来了重大挑战。标准压缩方法（例如JPEG）可以减小文件大小，但通常无法保留对下游任务至关重要的细粒度表型细节。在这项工作中，我们将为潜在扩散模型设计的自动编码器 (AE) 重新用作病理图像的高效学习压缩框架。我们系统地对三种具有不同压缩级别的 AE 模型进行了基准测试，并使用病理基础模型评估了它们的重建能力。我们引入了一种微调策略，以进一步增强重建保真度，从而优化特定于病理学的学习感知指标。我们在包括分割、图像块分类和多实例学习在内的下游任务上验证了我们的方法，结果表明，用 AE 压缩重建图像替换原始图像只会导致最小的性能下降。此外，我们提出了一种基于 K 均值聚类的 AE 潜在向量量化方法，可在保持重建质量的同时提高存储效率。我们在 https://huggingface.co/collections/StonyBrook-CVLab/pathology-fine-tuned-aes-67d45f223a659ff2e3402dd0 提供了微调后的自动编码器的权重。||
|**2025-03-14**|[Dynamics of a coupled nonlocal PDE-ODE system with spatial memory: well-posedness, stability, and bifurcation analysis](http://arxiv.org/abs/2503.11550)|null|非局部聚集-扩散模型与空间图相结合，可以捕捉认知和记忆对动物运动和种群层面模式的影响。在这项工作中，我们研究了一个一维反应-扩散-聚集系统，其中种群的时空动态与一个单独的、动态更新的图紧密联系。根据当地人口密度，该图放大和抑制某些景观区域，并通过非局部空间核促进定向运动。在建立耦合的偏微分方程-常微分方程系统的适定性后，我们进行线性稳定性分析以确定临界聚集强度。然后，我们执行严格的分岔分析，以确定在接近这些临界阈值的稳态下精确的解行为，判断分岔是亚临界还是超临界，以及涌现分支的稳定性。基于我们的分析结果，我们强调了几个有趣的生物学后果。首先，我们观察到空间图是吸引还是排斥取决于图的相对激发率与适应率：当激发效应大于（小于）适应效应时，图是有吸引力的（排斥的）。其次，在没有增长动力的情况下，种群只能形成单个聚集体。因此，种内竞争的存在对于驱动多峰聚集是必要的，这反映了更高频率的空间模式。最后，我们展示了亚临界分岔如何触发平均种群丰度的突然变化，这表明了一种临界点现象，其中运动参数的适度变化会导致种群数量突然下降。||
|**2025-03-14**|[AugGen: Synthetic Augmentation Can Improve Discriminative Models](http://arxiv.org/abs/2503.11544)|null|机器学习对大规模数据集日益增长的依赖带来了重大的隐私和伦理挑战。合成数据生成提供了一个有希望的解决方案；然而，大多数现有方法依赖于外部数据集或预训练模型，这增加了复杂性并提高了资源需求。在这项工作中，我们介绍了一种新颖的独立合成增强技术，该技术从仅在目标数据集上训练的条件生成模型中进行策略性采样。这种方法消除了对辅助数据源的需求。应用于人脸识别数据集，我们的方法在 IJB-C 和 IJB-B 基准测试中实现了 1-12% 的性能提升。它优于仅使用真实数据训练的模型，并且超过了最先进的合成数据生成基线的性能。值得注意的是，这些改进通常超过了通过架构改进实现的改进，突出了合成增强在数据稀缺环境中的显著影响。这些发现表明，精心整合的合成数据不仅解决了隐私和资源限制，还大大提高了模型性能。项目页面 https://parsa-ra.github.io/auggen||
|**2025-03-14**|[Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models](http://arxiv.org/abs/2503.11519)|null|当前的跨模态生成模型（GMs）在各种生成任务中展现出卓越的能力。鉴于现实世界场景中视觉模态输入的普遍性和信息丰富性，跨视觉任务，包括视觉语言理解（VLP）和图像到图像（I2I）任务，已经引起了广泛关注。大型视觉语言模型（LVLMs）和I2I生成模型分别用于处理VLP和I2I任务。先前的研究表明，将印刷体文字添加到输入图像中会显著诱导LVLMs和I2I生成模型产生与这些文字语义相关的破坏性输出。此外，作为一种更复杂的印刷体形式，视觉提示也被发现会在注入图像时对VLP任务的各种应用构成安全风险。在本文中，我们全面研究了印刷体视觉提示注入（TVPI）对各种LVLMs和I2I生成模型的性能影响。为了更好地观察这种威胁的性能变化和特征，我们还引入了TVPI数据集。通过广泛的探索，我们加深了对各种生成模型中TVPI威胁的潜在原因的理解，并对其可能的起源提供了有价值的见解。||
|**2025-03-14**|[Perfect Stabilization of Biomolecular Adhesions under Load](http://arxiv.org/abs/2503.11510)|null|细胞通过粘着斑附着于周围环境，粘着斑是一种分子复合物，其大小可随机械载荷的变化而变化，展现出显著的适应能力。基于此行为背后的生物分子机制，我们提出了一个机械诱导粘着斑生长的通用模型，其中粘着斑簇中现有分子的构象状态与其他分子的吸附相耦合。如果耦合足够强，则在没有机械载荷的情况下，系统会出现不稳定性并导致不受限制的生长。出乎意料的是，同类型的不稳定性在机械载荷下可导致完美的稳定性，即粘着斑簇通过调节自身大小来承受任意大的力而不会断裂，我们将这种现象称为完美稳定。我们推导出了表征静态载荷下粘着斑稳定性的状态图，并表明完美稳定也发生在生理相关时间尺度内的动态载荷下。最后，我们证明，如果内部分子状态与非平衡吸附相耦合，完美稳定和相关的不稳定性是可以通过多种不同方式实现的普遍现象。||
|**2025-03-14**|[Exponential Quantum Advantage for Simulating Open Classical Systems](http://arxiv.org/abs/2503.11483)|null|近期，量子优势的一个充满希望的领域是模拟指数级规模的经典系统。在这里，我们展示了如何利用这一优势来计算开放经典系统的动力学，包括非马尔可夫浴的影响。这是一类特别有趣的系统，因为耗散在从流体动力学到热化的各种情况下都起着关键作用。我们采用Caldeira-Leggett哈密顿量，这是一个通用的耗散模型，其中系统耦合到具有大量自由度的谐振子浴。迄今为止，模拟此类系统的最有效的经典算法对浴的大小具有多项式依赖性。在这项工作中，我们提出了一种具有指数级加速的量子算法，能够模拟耦合到 $N = 2^n\gg d$个浴自由度的$d$个系统自由度，误差在$\varepsilon$以内，使用$O({\rm poly}(d, n, t, \varepsilon^{-1}))$ 个量子门。||
|**2025-03-14**|[T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation](http://arxiv.org/abs/2503.11481)|null|尽管最近的文本到图像生成模型取得了令人瞩目的性能，但它们仍然经常难以捕捉提示的组合复杂性，包括属性绑定和不同实体之间的空间关系。这种不匹配并没有被诸如CLIPScore之类的常见评估指标所揭示。最近的研究提出了利用视觉问答（VQA）的评估指标，通过将提示分解为关于生成图像的问题，以进行更鲁棒的组合评估。虽然这些方法与人类评估更一致，但它们仍然无法完全涵盖图像中的组合性。为了解决这个问题，我们提出了一种新的指标，将图像分解成组件，并将文本分解成关于生成图像的细粒度问题进行评估。我们的方法优于先前最先进的指标，证明了其在评估文本到图像生成模型方面的有效性。代码可在https://github.com/hadi-hosseini/ T2I-FineEval获取。||
|**2025-03-14**|[TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation](http://arxiv.org/abs/2503.11423)|null|我们解决了现有面向任务的手物交互视频生成数据集和模型的关键局限性，这是一种为机器人模仿学习生成视频演示的关键方法。当前的数据集（例如Ego4D）通常存在视角不一致和交互未对齐的问题，导致视频质量下降并限制了其在精确模仿学习任务中的适用性。为此，我们推出了TASTE-Rob——一个开创性的大规模数据集，包含100,856个以自我为中心的手物交互视频。每个视频都经过精心处理，与语言指令对齐，并从一致的摄像机视角录制，以确保交互清晰。通过在TASTE-Rob上微调视频扩散模型（VDM），我们实现了逼真的物体交互，尽管我们观察到手部抓取姿势偶尔会出现不一致的情况。为了增强真实感，我们引入了一个三阶段姿势细化流程，以提高生成视频中手部姿势的准确性。我们精心策划的数据集，加上专门的姿势细化框架，在生成高质量、面向任务的手物交互视频方面取得了显著的性能提升，从而实现了卓越的、可泛化的机器人操作。TASTE-Rob数据集将在发布后公开，以促进该领域的进一步发展。||
|**2025-03-13**|[GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing](http://arxiv.org/abs/2503.10639)|**[link](https://github.com/rongyaofang/got)**|目前的图像生成和编辑方法主要将文本提示作为直接输入进行处理，而没有推理视觉构成和显式操作。我们提出了生成思维链 (GoT)，这是一种新的范式，能够在输出图像之前通过显式语言推理过程进行生成和编辑。这种方法将传统的文本到图像的生成和编辑转换为推理引导的框架，该框架分析语义关系和空间排列。我们定义了 GoT 的公式，并构建了包含超过 900 万个样本的大规模 GoT 数据集，其中包含捕获语义空间关系的详细推理链。为了利用 GoT 的优势，我们实现了一个统一的框架，该框架集成了 Qwen2.5-VL 用于推理链生成，以及由我们新颖的语义空间引导模块增强的端到端扩散模型。实验表明，我们的 GoT 框架在生成和编辑任务上都取得了优异的性能，比基线模型有显著改进。此外，我们的方法支持交互式视觉生成，允许用户显式修改推理步骤以进行精确的图像调整。GoT 开创了推理驱动视觉生成和编辑的新方向，生成更符合人类意图的图像。为了促进未来的研究，我们在 https://github.com/rongyaofang/GoT 上公开提供了我们的数据集、代码和预训练模型。||
|**2025-03-13**|[Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective](http://arxiv.org/abs/2503.10638)|null|无分类器引导已成为使用去噪扩散模型进行条件生成的标准方法。然而，目前仍缺乏对无分类器引导的全面理解。在这项工作中，我们开展了一项实证研究，为无分类器引导提供新的视角。具体来说，我们没有仅仅关注无分类器引导，而是追溯到其根源，即分类器引导，明确推导的关键假设，并进行系统研究来理解分类器的作用。我们发现，分类器引导和无分类器引导都是通过将去噪扩散轨迹推离决策边界（即条件信息通常纠缠且难以学习的区域）来实现条件生成。基于这种以分类器为中心的理解，我们提出了一个通用的基于流匹配的后处理步骤，以缩小预训练去噪扩散模型的学习分布与真实数据分布之间的差距，主要是在决策边界附近。在各种数据集上的实验验证了所提出方法的有效性。||
|**2025-03-13**|[Distilling Diversity and Control in Diffusion Models](http://arxiv.org/abs/2503.10637)|null|蒸馏扩散模型存在一个关键限制：与其基础模型相比，样本多样性降低。在本研究中，我们发现尽管存在这种多样性损失，但蒸馏模型保留了基础模型的基本概念表示。我们展示了控制蒸馏——在基础模型上训练的控制机制，如概念滑块和LoRA，可以无缝地迁移到蒸馏模型，反之亦然，从而有效地蒸馏控制而无需任何重新训练。这种表征结构的保留促使我们研究蒸馏过程中多样性崩溃的机制。为了理解蒸馏如何影响多样性，我们引入了扩散目标 (DT) 可视化，这是一种分析和调试工具，揭示了模型如何在中间步骤预测最终输出。通过DT可视化，我们识别了生成伪影、不一致性，并证明了初始扩散时间步不成比例地决定了输出多样性，而后来的步骤主要完善细节。基于这些见解，我们引入了多样性蒸馏——一种混合推理方法，该方法策略性地仅在第一个关键时间步使用基础模型，然后过渡到高效的蒸馏模型。我们的实验表明，这种简单的修改不仅将多样性能力从基础模型恢复到蒸馏模型，而且令人惊讶地超过了它，同时保持了几乎与蒸馏推理相同的计算效率，所有这些都不需要额外的训练或模型修改。我们的代码和数据可在https://distillation.baulab.info获取。||
|**2025-03-13**|[HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model](http://arxiv.org/abs/2503.10631)|null|近年来，用于常识推理的视觉语言模型 (VLM) 的进步促进了视觉语言动作 (VLA) 模型的发展，使机器人能够执行泛化操作。虽然现有的自回归 VLA 方法利用了大规模预训练知识，但它们会破坏动作的连续性。同时，一些 VLA 方法结合了一个额外的扩散头来预测连续动作，这仅仅依赖于 VLM 提取的特征，限制了它们的推理能力。在本文中，我们介绍了 HybridVLA，这是一个统一的框架，它将自回归和扩散策略的优势无缝集成到单个大型语言模型中，而不是简单地将它们连接起来。为了弥合生成差距，我们提出了一种协作训练方法，将扩散模型直接注入到下一个标记预测中。通过这种方法，我们发现这两种形式的动作预测不仅可以相互增强，而且在不同的任务中表现出不同的性能。因此，我们设计了一种协作动作集成机制，自适应地融合这两种预测，从而实现更稳健的控制。实验表明，HybridVLA 在各种模拟和真实世界任务中均优于先前的最先进 VLA 方法，包括单臂和双臂机器人，同时在以前未见过的配置中展示了稳定的操作性能。||
|**2025-03-13**|[NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models](http://arxiv.org/abs/2503.10626)|null|在包括人形机器人、四足动物和动物在内的各种非常规形态中习得物理上合理的运动技能，对于推进角色模拟和机器人技术至关重要。传统的强化学习 (RL) 等方法是针对特定任务和特定身体的，需要大量的奖励函数工程，并且泛化能力差。模仿学习提供了一种替代方案，但严重依赖于高质量的专家演示，而这些演示对于非人形态来说很难获得。另一方面，视频扩散模型能够生成各种形态的逼真视频，从人类到蚂蚁。利用这种能力，我们提出了一种数据无关的技能习得方法，可以从二维生成的视频中学习三维运动技能，并具有泛化到非常规和非人形态的能力。具体来说，我们利用视觉变换器进行基于视频的比较，通过计算视频嵌入之间的成对距离来指导模仿学习过程。除了视频编码距离之外，我们还使用计算出的分割视频帧之间的相似度作为指导奖励。我们在涉及独特身体配置的运动任务上验证了我们的方法。在人形机器人运动任务中，我们证明了“无数据模仿学习”(NIL) 的性能优于使用三维运动捕捉数据训练的基线模型。我们的结果突出了利用生成视频模型进行各种形态的物理上合理的技能学习的潜力，有效地用数据生成代替了模仿学习中的数据收集。||
|**2025-03-13**|[MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction](http://arxiv.org/abs/2503.10604)|null|近年来，辐射场的突破性进展显著推进了自动驾驶领域的三维场景重建和新视角合成（NVS）。然而，关键的局限性依然存在：基于重建的方法在视点与训练轨迹显著偏离的情况下性能大幅下降，而基于生成的技术则难以保证时间一致性和精确的场景可控性。为了克服这些挑战，我们提出了MuDG，这是一个创新的框架，它将多模态扩散模型与高斯 splatting (GS) 相结合，用于城市场景重建。MuDG利用聚合的激光雷达点云以及RGB和几何先验信息来调节多模态视频扩散模型，合成新视点的逼真RGB、深度和语义输出。这种合成流程无需对每个场景进行计算密集型优化即可实现前馈式NVS，并提供全面的监督信号来细化3DGS表示，从而增强极端视点变化下的渲染鲁棒性。在Open Waymo数据集上的实验表明，MuDG在重建和合成质量方面均优于现有方法。||
|**2025-03-13**|[CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models](http://arxiv.org/abs/2503.10592)|null|本文介绍了CameraCtrl II，这是一个通过相机控制的视频扩散模型实现大规模动态场景探索的框架。以往的相机条件视频生成模型在生成具有较大相机运动的视频时，视频动态感会减弱，视点范围也会受到限制。我们采用了一种逐步扩展动态场景生成的方法——首先增强单个视频片段中的动态内容，然后扩展此功能以创建跨越广泛视点范围的无缝探索。具体来说，我们构建了一个包含大量动态内容和相机参数标注的数据集用于训练，同时设计了一个轻量级的相机注入模块和训练方案，以保留预训练模型的动态特性。基于这些改进的单片段技术，我们允许用户迭代地指定相机轨迹来生成连贯的视频序列，从而实现扩展场景探索。跨不同场景的实验表明，CameraCtrl II 支持相机控制的动态场景合成，其空间探索范围比以往的方法要广泛得多。||
|**2025-03-13**|[Long Context Tuning for Video Generation](http://arxiv.org/abs/2503.10589)|null|最近视频生成技术的进步使得利用可扩展的扩散Transformer生成逼真、长达一分钟的单镜头视频成为可能。然而，现实世界的叙事视频需要多镜头场景，并在镜头之间保持视觉和动态一致性。在这项工作中，我们引入了长上下文微调（LCT），这是一种训练范式，它扩展了预训练的单镜头视频扩散模型的上下文窗口，以直接从数据中学习场景级一致性。我们的方法将全注意力机制从单个镜头扩展到包含场景中的所有镜头，并结合了交错的3D位置嵌入和异步噪声策略，从而在不增加额外参数的情况下实现联合和自回归镜头生成。经过LCT训练的具有双向注意力的模型可以进一步使用上下文因果注意力进行微调，从而通过高效的KV缓存促进自回归生成。实验表明，经过LCT训练的单镜头模型可以生成连贯的多镜头场景，并展现出新的能力，包括组合生成和交互式镜头扩展，为更实用的视觉内容创作铺平了道路。更多详情请访问https://guoyww.github.io/projects/long-context-video/。||
|**2025-03-13**|[Sample and Map from a Single Convex Potential: Generation using Conjugate Moment Measures](http://arxiv.org/abs/2503.10576)|null|生成模型的一个常见方法是将模型拟合分成两个模块：首先定义如何采样噪声（例如，高斯噪声），然后选择如何处理它（例如，使用单个映射或流）。我们在本文中探索了一种将采样和映射联系起来的替代方法。我们从矩测度中找到灵感，该结果表明，对于任何在 $\mathbb{R}^d$的紧凸集上支持的测度$\rho$，存在唯一的凸势$u$，使得$\rho=\nabla u\,\sharp\,e^{-u}$。虽然这似乎有效地将采样（从对数凹分布$e^{-u}$）和动作（通过$\nabla u$推送粒子）联系起来，但我们从简单的例子（例如，高斯分布或一维分布）中观察到，这种选择不适合实际任务。我们研究了一种替代分解方法，其中$\rho$被分解为$\nabla w^*\,\sharp\,e^{-w}$，其中$w^*$是$w$的凸共轭。我们将这种方法称为共轭矩测度，并在这些例子中展示了更直观的结果。因为$\nabla w^*$是对数凹分布$e^{-w}$和$\rho$之间的蒙日映射，我们依靠最优传输求解器来提出一种从$\rho$的样本中恢复$w$的算法，并将$w$ 参数化为一个输入凸神经网络。||
|**2025-03-13**|[Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression](http://arxiv.org/abs/2503.10512)|null|我们研究了通过从黑盒深度生成模型中为给定输入（例如，文本提示）采样输出（例如，软件代码和自然语言文本）来生成有效且小的预测集的问题。预测集的有效性由用户定义的二元可接受性函数决定，该函数取决于目标应用程序。例如，在代码生成应用程序中，要求集合中至少有一个程序通过所有测试用例。为了解决这个问题，我们开发了一种简单有效的保形推理算法，称为生成预测集 (GPS)。给定一组校准示例和对深度生成模型的黑盒访问，GPS 可以生成具有可证明保证的预测集。GPS 背后的关键见解是利用获取可接受输出所需的最少样本数量的分布中的固有结构，以开发一种基于最小样本数量的简单保形回归方法。在使用不同大型语言模型的代码和数学应用题的多个数据集上进行的实验证明了 GPS 相对于最先进方法的有效性。||
|**2025-03-11**|[OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models](http://arxiv.org/abs/2503.08686)|**[link](https://github.com/hustvl/omnimamba)**|近年来，统一多模态理解和视觉生成（或多模态生成）模型的进步受到其二次计算复杂性和对大规模训练数据的依赖性的阻碍。我们提出了OmniMamba，这是第一个基于线性架构的多模态生成模型，它通过统一的下一个token预测范式生成文本和图像。该模型充分利用了Mamba-2的高计算和内存效率，将其功能从文本生成扩展到多模态生成。为了解决现有统一模型的数据效率低下的问题，我们提出了两个关键创新：（1）解耦词汇表以指导特定模态的生成，以及（2）用于参数高效适应的任务特定LoRA。此外，我们引入了一种解耦的两阶段训练策略，以减轻两个任务之间的数据不平衡。借助这些技术，OmniMamba在基准测试中实现了与JanusFlow相当的性能，同时超过了Show-o，尽管它仅在200万个图文对上进行了训练，这比Show-o少了1000倍。值得注意的是，OmniMamba凭借出色的推理效率脱颖而出，与基于Transformer的模型相比，长序列生成的推理速度提升高达119.2倍，GPU内存减少63%。代码和模型已发布在https://github.com/hustvl/OmniMamba。||
|**2025-03-11**|[GarmentCrafter: Progressive Novel View Synthesis for Single-View 3D Garment Reconstruction and Editing](http://arxiv.org/abs/2503.08678)|null|我们推出 GarmentCrafter，一种允许非专业用户从单视图图像创建和修改 3D 服装的新方法。虽然图像生成技术的最新进展促进了 2D 服装设计，但对于非专业用户来说，创建和编辑 3D 服装仍然具有挑战性。现有的单视图 3D 重建方法通常依赖于预训练的生成模型来合成以参考图像和相机姿态为条件的新视图，但它们缺乏跨视图一致性，无法捕捉不同视图之间的内部关系。在本文中，我们通过渐进式深度预测和图像扭曲来近似新视图，从而解决了这一挑战。随后，我们训练了一个多视图扩散模型，以根据不断变化的相机姿态来补全被遮挡和未知的服装区域。通过联合推断 RGB 和深度，GarmentCrafter 强制执行视图间一致性，并重建精确的几何形状和精细的细节。大量实验表明，与最先进的单视图 3D 服装重建方法相比，我们的方法实现了卓越的视觉保真度和视图间一致性。||
|**2025-03-11**|[OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting](http://arxiv.org/abs/2503.08677)|null|基于扩散的生成模型彻底改变了面向对象的图像编辑，但其在现实对象移除和插入中的部署仍然受到物理效果复杂相互作用和配对训练数据不足等挑战的阻碍。在本研究中，我们引入了OmniPaint，这是一个统一的框架，它将对象移除和插入重新概念化为相互依赖的过程，而不是孤立的任务。OmniPaint利用预训练的扩散先验以及包含初始配对样本优化和随后通过CycleFlow进行大规模非配对细化的渐进式训练流程，实现了精确的前景消除和无缝的对象插入，同时忠实地保留了场景几何形状和固有属性。此外，我们新颖的CFD指标提供了对上下文一致性和对象幻觉的鲁棒的、无需参考的评估，为高保真图像编辑建立了新的基准。项目页面：https://github.com/yeates/OmniPaint-Page/||
|**2025-03-11**|[Language-Depth Navigated Thermal and Visible Image Fusion](http://arxiv.org/abs/2503.08676)|null|深度引导的多模态融合结合了可见光和红外图像的深度信息，显著增强了三维重建和机器人应用的性能。现有的热成像-可见光图像融合主要集中在检测任务上，忽略了深度等其他关键信息。通过解决单一模态在低光和复杂环境下的局限性，融合图像的深度信息不仅可以生成更精确的点云数据，提高三维重建的完整性和精度，还可以为机器人导航、定位和环境感知提供全面的场景理解。这支持在自动驾驶和救援任务等应用中进行精确识别和高效操作。我们介绍了一种文本引导和深度驱动的红外和可见光图像融合网络。该模型由一个图像融合分支、一个文本引导模块和两个辅助深度估计分支组成。图像融合分支用于通过扩散模型提取多通道互补信息，并配备了文本引导模块。融合分支使用CLIP从富含深度信息的图像描述中提取语义信息和参数，以指导扩散模型提取多通道特征并生成融合图像。然后将这些融合图像输入到深度估计分支中，以计算深度驱动的损失，从而优化图像融合网络。该框架旨在集成视觉语言和深度信息，直接从多模态输入生成彩色融合图像。||
|**2025-03-11**|[Modeling Stock Return Distributions and Pricing Options](http://arxiv.org/abs/2503.08666)|null|本文提供证据表明，截断后的股票收益可以用一种特殊类型的连续正态混合分布，即所谓的 $q$ -高斯分布来建模。负二项分布可以对极端收益的计数进行建模。本文提出了一个广义跳跃扩散模型，并得到了一个显式期权定价公式。||
|**2025-03-11**|[REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder](http://arxiv.org/abs/2503.08665)|null|我们提出了一种学习用于生成模型的视频嵌入器的新视角：有效的嵌入器不应要求精确复制输入视频，而应侧重于合成视觉上合理的重建。这种宽松的标准可以在不影响下游生成模型质量的情况下显著提高压缩比。具体来说，我们建议用编码器-生成器框架取代传统的编码器-解码器视频嵌入器，该框架采用扩散变换器 (DiT) 从紧凑的潜在空间合成缺失的细节。其中，我们开发了一个专用的潜在条件模块，以根据编码的视频潜在嵌入来调节 DiT 解码器。我们的实验表明，与最先进的方法相比，我们的方法能够实现卓越的编码-解码性能，尤其是在压缩比增加时。为了证明我们方法的有效性，我们报告了视频嵌入器实现高达 32 倍时间压缩比（比领先的视频嵌入器高 8 倍）的结果，并验证了这种超紧凑潜在空间在文本到视频生成中的鲁棒性，从而显著提高了潜在扩散模型训练和推理的效率。||
|**2025-03-11**|[MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention](http://arxiv.org/abs/2503.08664)|**[link](https://github.com/johannwyh/meat)**|多视角扩散模型在一般物体的图像到3D生成方面已取得显著成功。然而，当应用于人体数据时，现有方法尚未取得令人满意的结果，这主要是由于将多视角注意力扩展到更高分辨率的挑战。在本文中，我们探索了百万像素级别的人体多视角扩散模型，并引入了一种称为网格注意力的解决方案，以便能够以1024x1024分辨率进行训练。使用 clothed 人体网格作为中心粗几何表示，提出的网格注意力利用光栅化和投影来建立直接的跨视角坐标对应关系。这种方法显著降低了多视角注意力的复杂性，同时保持了跨视角一致性。在此基础上，我们设计了一个网格注意力块，并将其与关键点条件相结合，创建了我们的人体专用多视角扩散模型 MEAT。此外，我们还提供了将多视角人体运动视频应用于扩散训练的宝贵见解，解决了长期存在的数据稀缺问题。大量实验表明，MEAT 可以有效地在百万像素级别生成密集、一致的多视角人体图像，优于现有的多视角扩散方法。||
|**2025-03-11**|[MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input](http://arxiv.org/abs/2503.08650)|null|近期的虚拟试穿 (VITON) 技术进步显著提升了图像真实感和服装细节保留，这主要得益于强大的文本到图像 (T2I) 扩散模型。然而，如图1(a)所示，现有方法通常依赖用户提供的掩码，由于输入的不完善，这引入了复杂性并导致性能下降。为了解决这个问题，我们提出了一个无掩码VITON（MF-VITON）框架，仅使用单张人物图像和目标服装即可实现逼真的VITON，从而消除了对辅助掩码的需求。我们的方法引入了一个新颖的两阶段流程：(1) 我们利用现有的基于掩码的VITON模型来合成一个高质量的数据集。该数据集包含多样化、逼真的人物图像和相应服装的配对，并通过各种背景进行增强，以模拟真实场景。(2) 在生成的数据集上对预训练的基于掩码的模型进行微调，从而无需依赖掩码即可进行服装转移。此阶段简化了输入要求，同时保留了服装纹理和形状的保真度。我们的框架在服装转移精度和视觉真实感方面达到了最先进 (SOTA) 的性能。值得注意的是，所提出的无掩码模型显著优于现有的基于掩码的方法，树立了新的基准，并展现出相较于先前方法的实质性领先优势。更多详情，请访问我们的项目页面：https://zhenchenwan.github.io/MF-VITON/。||
|**2025-03-11**|[Rethinking Diffusion Model in High Dimension](http://arxiv.org/abs/2503.08643)|null|维数灾难是统计概率模型中一个不可避免的挑战，然而扩散模型似乎克服了这一限制，在高维数据生成方面取得了令人瞩目的成果。扩散模型假设它们可以学习底层概率分布的统计特性，从而能够从该分布中采样以生成逼真的样本。但这真的是它们的工作原理吗？为了解决这个问题，本文对扩散模型的目标函数和推理方法进行了详细分析，得出几个有助于回答上述问题的重要结论：1）在高维稀疏场景下，目标函数拟合的目标从多个样本的加权和退化为单个样本。2）主流的推理方法都可以用一个简单的统一框架来表示，而无需诸如马尔可夫链和随机微分方程（SDE）之类的统计概念。3）在这个简单框架的指导下，可以发现更高效的推理方法。||
|**2025-03-11**|[LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization](http://arxiv.org/abs/2503.08619)|**[link](https://github.com/xianfengwu01/lightgen)**|最近文本到图像生成的进展主要依赖于广泛的数据集和参数繁重的架构。这些需求严重限制了缺乏大量计算资源的研究人员和从业者的可及性。在本文中，我们介绍了\model，这是一种用于图像生成模型的高效训练范式，它使用知识蒸馏（KD）和直接偏好优化（DPO）。LightGen 的灵感来自于在多模态大型语言模型 (MLLM) 中广泛采用的数据 KD 技术的成功，它将最先进 (SOTA) 文本到图像模型的知识蒸馏到仅具有 0.7B 参数的紧凑型掩码自回归 (MAR) 架构中。我们使用仅包含 200 万张由各种标题生成的高质量图像的紧凑型合成数据集，证明了数据多样性在决定模型性能方面明显优于数据量。这种策略极大地降低了计算需求，并将预训练时间从潜在的数千 GPU 天减少到仅 88 GPU 天。此外，为了解决合成数据固有的缺点，特别是较差的高频细节和空间不准确性，我们集成了 DPO 技术，以改进图像保真度和位置精度。综合实验表明，LightGen 实现了与 SOTA 模型相当的图像生成质量，同时显著减少了计算资源，并扩展了资源受限环境的可及性。代码可在 https://github.com/XianfengWu01/LightGen 获取。||
|**2025-03-07**|[AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data](http://arxiv.org/abs/2503.05665)|**[link](https://github.com/zengqunzhao/aim-fair)**|近年来，生成模型的进步引发了关于利用AI生成数据改进模型公平性的研究。然而，现有方法通常在合成数据的多样性和质量方面存在局限性，导致公平性和整体模型精度受损。此外，许多方法依赖于人口统计组标签的可用性，而这些标签的标注成本往往很高。本文提出了AIM-Fair，旨在克服这些限制，并利用尖端生成模型的潜力来促进算法公平性。我们研究了一种微调范式，该范式从最初在没有人口统计注释的真实世界数据上训练的 biased 模型开始。然后，使用由最先进的扩散模型生成的 unbiased 合成数据对该模型进行微调，以提高其公平性。在这种微调范式中确定了两个关键挑战，1）合成数据质量低，即使使用先进的生成模型也可能发生这种情况，以及 2）真实数据和合成数据之间的域和偏差差距。为了解决合成数据质量的限制，我们提出了上下文合成数据生成（CSDG），使用文本到图像扩散模型（T2I）生成数据，其提示由上下文感知LLM生成，确保数据多样性和对合成数据偏差的控制。为了解决域和偏差偏移问题，我们引入了一种新颖的选择性微调方案，其中仅更新对偏差更敏感且对域偏移不太敏感的模型参数。在CelebA和UTKFace数据集上的实验表明，我们的AIM-Fair在保持效用的同时提高了模型公平性，优于模型公平性的完全微调和部分微调方法。||
|**2025-03-07**|[TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models](http://arxiv.org/abs/2503.05638)|null|我们提出了TrajectoryCrafter，一种用于重定向单目视频相机轨迹的新方法。通过将确定性视图变换与随机内容生成分离，我们的方法实现了对用户指定相机轨迹的精确控制。我们提出了一种新颖的双流条件视频扩散模型，该模型同时集成了点云渲染和源视频作为条件，确保了准确的视图变换和连贯的4D内容生成。我们没有利用稀缺的多视图视频，而是通过我们创新的双重重投影策略，构建了一个混合训练数据集，结合了网络规模的单目视频和静态多视图数据集，显著促进了跨不同场景的鲁棒泛化。对多视图和大规模单目视频的广泛评估证明了我们方法的优越性能。||
|**2025-03-07**|[A functional approach for curve alignment and shape analysis](http://arxiv.org/abs/2503.05632)|null|随机平面曲线 $\mathbf{X}$的形状$\tilde{\mathbf{X}}$是指去除缩放、旋转、平移和参数化等变形效应后剩余的部分。以往的统计形状分析研究主要集中于通过曲线${\bf X}$的离散观测来分析$\tilde{\bf X}$。虽然这种方法具有一些计算优势，但它忽略了${\bf X}$及其形状$\tilde{\bf X}$的连续性。它也忽略了变形变量之间潜在的依赖关系及其对$\tilde{\bf X}$的影响，这可能导致信息丢失和可解释性降低。在本文中，我们引入了一个在泛函数据分析（FDA）框架下分析$\bf X$的新方法。我们采用基函数展开技术来推导估计变形变量（如旋转和重新参数化）的解析解，从而实现形状对齐。然后，使用联合主成分分析方法研究$\bf X$ 的生成模型。在模拟数据和\textit{MPEG-7}数据库上的数值实验表明，我们的新方法可以成功地识别变形参数并捕获平面曲线的潜在分布，而传统的FDA方法在这些情况下无法做到。||
|**2025-03-07**|[Diffusion Models for Cayley Graphs](http://arxiv.org/abs/2503.05558)|null|我们回顾了在群和群作用的凯莱图中寻找路径的问题，并以魔方为例，列举了几个更具重要数学意义的例子。然后，我们展示了如何在扩散模型的框架下表述这些问题。图的探索由前向过程执行，而目标节点的查找则由反向过程完成。这系统化了讨论并提出了许多推广。为了改进探索，我们提出了一个“反向得分”假设，它比之前类似的算法有了实质性的改进。||
|**2025-03-07**|[Accelerating db-A $^\textbf{*}$ for Kinodynamic Motion Planning Using Diffusion](http://arxiv.org/abs/2503.05539)|null|我们提出了一种使用扩散模型生成用于运动规划的运动基元的新方法。我们方法生成的运动通过利用特定问题的参数来适应每个问题实例，从而能够更快地找到质量更好的解决方案。我们方法中使用的扩散模型是在随机切割的解轨迹上训练的。这些轨迹是通过使用运动规划器求解随机生成的问题实例而创建的。实验结果表明，在各种机器人动力学（例如二阶单轮车或带拖车的汽车）中，计算时间和解决方案质量均显著提高了30%。||
|**2025-03-07**|[Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations](http://arxiv.org/abs/2503.05522)|**[link](https://github.com/erenerogullari/cav-disentanglement)**|概念激活向量 (CAV) 广泛用于将人类可理解的概念建模为神经网络潜在空间中的方向。它们通过识别从概念样本的激活到非概念样本的激活的方向来训练。然而，这种方法通常会为相关概念（例如 CelebA 数据集中经常在男性图像中同时出现的“胡须”和“领带”）产生相似且非正交的方向。这种纠缠使孤立概念的解释变得复杂，并可能导致 CAV 应用中出现不良影响，例如激活引导。为了解决这个问题，我们引入了一种事后概念解耦方法，该方法采用非正交性损失，有助于识别正交概念方向，同时保持方向正确性。我们使用 CelebA 中的真实世界和受控的相关概念以及带有 VGG16 和 ResNet18 架构的合成 FunnyBirds 数据集来评估我们的方法。我们进一步证明了正交化概念表示在激活引导任务中的优越性，允许 (1) 通过生成模型将孤立的概念插入到输入图像中，以及 (2) 去除概念以有效抑制捷径，同时与基线 CAV 相比减少对相关概念的影响。||
|**2025-03-07**|[Noise-Robust Radio Frequency Fingerprint Identification Using Denoise Diffusion Model](http://arxiv.org/abs/2503.05514)|null|由于物联网 (IoT) 设备的计算和能源资源有限，保护其安全面临越来越大的挑战。射频指纹识别 (RFFI) 作为一种很有前景的身份验证技术，可以通过硬件缺陷来识别无线设备。在低信噪比 (SNR) 场景下，RFFI 的性能会显著下降，因为细微的硬件特征很容易被噪声淹没。在本文中，我们利用扩散模型来有效地恢复低信噪比场景下的射频指纹。具体来说，我们训练了一个强大的噪声预测器，并定制了一个噪声去除算法，以有效降低接收信号中的噪声水平并恢复设备指纹。我们以Wi-Fi 为例进行了研究，并创建了一个包含 6 个商用现成 Wi-Fi 网络加密狗和一个 USRP N210 软件定义无线电 (SDR) 平台的测试平台。我们在各种信噪比场景下进行了实验评估。实验结果表明，所提出的算法可以将分类精度提高高达 34.9%。||
|**2025-03-07**|[Statistical Deficiency for Task Inclusion Estimation](http://arxiv.org/abs/2503.05491)|null|任务在机器学习中至关重要，因为它们是评估当前模型能力最自然的指标。目前的趋势是构建能够处理任何任务的通用模型。尽管迁移学习和多任务学习试图利用底层任务空间，但尚无完善的工具可用于研究其结构。本研究提出了一个理论上扎实的框架来定义任务的概念，并从统计缺陷的角度计算两个任务之间的{\bf 包含}关系。我们提出了信息充分性作为一种易于处理的代理来估计任务之间的包含程度，并在合成数据上展示了其可靠性，并用它来经验性地重建经典的NLP流水线。||
|**2025-03-07**|[De Novo Design of Protein-Binding Peptides by Quantum Computing](http://arxiv.org/abs/2503.05458)|null|计算机从头设计可以大幅缩减药物开发的成本和时间。尤其值得一提的是，与生成模型不同，自下而上的、基于物理学的方法的关键优势在于它们不依赖于训练数据集。然而，它们需要同时探索化学空间和构象空间。在本研究中，我们利用量子退火器解决了这一巨大的挑战。我们专注于肽的从头设计，并引入了一个多尺度框架，整合了经典计算和量子计算，用于原子级分辨率的预测。我们通过设计几种蛋白质靶点的结合剂来评估该方案。D-Wave量子退火器快速生成了一组化学多样性的结合剂，其一级结构和结合姿态与实验结果具有良好的相关性。这些结果表明，即使在目前的早期阶段，量子技术已经能够赋能基于物理学的药物设计。||
|**2025-03-07**|[PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?](http://arxiv.org/abs/2503.05333)|null|生成学习模型的图像到图像转换能力近来在估计图像分布之间复杂（受控）映射方面取得了显著进展。虽然基于外观的任务，如图像修复或风格迁移已被广泛研究，我们建议探讨生成模型在物理模拟中的潜力。我们提供了一个包含30万图像对的数据集和三个不同物理模拟任务的基线评估，并提出了一个基准来研究以下研究问题：i）生成模型是否能够从输入-输出图像对中学习复杂的物理关系？ii）通过替换基于微分方程的模拟可以实现多大的加速？虽然目前不同模型的基线评估显示了实现高加速的潜力(ii)，但这些结果也显示了在物理正确性(i)方面的强烈局限性。这突出了对新方法的需求，以增强物理正确性。数据、基线模型和评估代码：http://www.physics-gen.org。||
|**2025-03-06**|[Compositional World Knowledge leads to High Utility Synthetic data](http://arxiv.org/abs/2503.04687)|null|机器学习系统在子群体偏移的情况下难以保持鲁棒性。当训练数据中仅观察到属性组合的子集时，这个问题尤为突出——这是一种严重的子群体偏移，称为组合偏移。为了解决这个问题，我们提出了以下问题：我们能否通过在包含所有可能属性组合的合成数据上进行训练来提高鲁棒性？我们首先展示了在有限数据上训练条件扩散模型会导致底层分布不正确。因此，从此类模型中采样的合成数据将导致不真实的样本，并且不会提高下游机器学习系统的性能。为了解决这个问题，我们提出了 CoInD，它通过最小化联合分布和边缘分布之间的Fisher散度来强制执行条件独立性，从而反映世界的组合性质。我们证明了 CoInD 生成的合成数据是真实的，并且转化为CelebA上组合偏移任务中最先进的最差组准确率。||
|**2025-03-06**|[What Are You Doing? A Closer Look at Controllable Human Video Generation](http://arxiv.org/abs/2503.04666)|null|高质量的基准对于推动机器学习研究的进展至关重要。然而，尽管人们对视频生成越来越感兴趣，但目前还没有一个全面的数据集来评估人物生成。人类可以执行各种各样的动作和交互，但现有的数据集，如TikTok和TED演讲，缺乏足够的多样性和复杂性来充分捕捉视频生成模型的能力。为了弥补这一差距，我们引入了“你在做什么？”（WYD）：一个用于细粒度评估可控图像到视频人物生成的新基准。WYD包含1544个带字幕的视频，这些视频经过精心收集和标注，包含56个细粒度类别。这些类别使我们能够系统地衡量人物生成的9个方面，包括动作、交互和运动。我们还提出并验证了利用我们的标注和更好地捕捉人类评估的自动指标。利用我们的数据集和指标，我们对七个最先进的可控图像到视频生成模型进行了深入分析，展示了WYD如何提供关于这些模型能力的新见解。我们在https://github.com/google-deepmind/wyd-benchmark发布了我们的数据和代码，以推动人物视频生成建模的进一步发展。||
|**2025-03-06**|[Simulating the Real World: A Unified Survey of Multimodal Generative Models](http://arxiv.org/abs/2503.04641)|**[link](https://github.com/aleeehu/world-simulator)**|理解和复制现实世界是通用人工智能 (AGI) 研究中的一个关键挑战。为了实现这一目标，许多现有方法，例如世界模型，旨在捕捉支配物理世界的基本原理，从而实现更精确的模拟和更有意义的交互。然而，目前的方法通常将不同的模态，包括 2D（图像）、视频、3D 和 4D 表示，视为独立的领域，而忽略了它们之间的相互依赖性。此外，这些方法通常侧重于现实的孤立维度，而没有系统地整合它们之间的联系。在本综述中，我们提出了一个关于多模态生成模型的统一综述，研究了现实世界模拟中数据维度的进展。具体来说，本综述从 2D 生成（外观）开始，然后转向视频（外观+动态）和 3D 生成（外观+几何），最后以整合所有维度的 4D 生成达到高潮。据我们所知，这是首次尝试在单一框架内系统地统一 2D、视频、3D 和 4D 生成的研究。为了指导未来的研究，我们对数据集、评估指标和未来方向进行了全面回顾，并为新手提供了见解。本综述旨在搭建桥梁，在统一框架内推进多模态生成模型和现实世界模拟的研究。||
|**2025-03-06**|[3HANDS Dataset: Learning from Humans for Generating Naturalistic Handovers with Supernumerary Robotic Limbs](http://arxiv.org/abs/2503.04635)|null|多余机械肢体 (SRL) 是与用户身体紧密结合的机械结构，增强了人类的身体能力，并需要无缝、自然的人机交互。为了在体力任务中提供有效帮助，使 SRL 能够将物体递交给人类至关重要。然而，为机器人设计基于启发式的策略既耗时，又难以跨任务泛化，并且会导致动作不那么像人类。如果使用适当的数据集进行训练，生成模型是创建自然 handover 动作的强大替代方案。我们引入了 3HANDS，这是一个关于参与者执行日常活动与另一个参与者以自然方式模拟髋部安装 SRL 之间物体 handover 交互的新颖数据集。3HANDS 捕捉了 SRL 交互的独特特征：在私密的个人空间中进行操作，具有不对称的物体来源，隐式运动同步，以及用户在 handover 过程中参与主要任务。为了证明我们数据集的有效性，我们提出了三个模型：一个生成自然的 handover 轨迹，另一个确定适当的 handover 终点，第三个预测启动 handover 的时刻。在一项用户研究 (N=10) 中，我们将使用我们的方法执行的 handover 交互与基线进行了比较。结果表明，我们的方法被认为更自然，对体力要求更低，并且更舒适。||
|**2025-03-06**|[The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation](http://arxiv.org/abs/2503.04606)|null|近年来，文本到视频 (T2V) 生成技术的进步主要由两种相互竞争的范式驱动：自回归语言模型和扩散模型。然而，每种范式都有其固有的局限性：语言模型难以保证视觉质量并容易积累错误，而扩散模型缺乏语义理解和因果建模能力。在这项工作中，我们提出了LanDiff，一个混合框架，它通过从粗到精的生成过程将两种范式的优势结合起来。我们的架构引入了三个关键创新：(1) 语义分词器，通过高效的语义压缩将3D视觉特征压缩成紧凑的1D离散表示，实现了约14,000倍的压缩率；(2) 语言模型，生成具有高级语义关系的语义标记；(3) 流式扩散模型，将粗略的语义细化为高保真视频。实验表明，LanDiff，一个50亿参数的模型，在VBench T2V基准测试中取得了85.43分，超过了最先进的开源模型Hunyuan Video（130亿参数）和其他商业模型，如Sora、Keling和Hailuo。此外，我们的模型还在长视频生成方面取得了最先进的性能，超过了该领域的其他开源模型。我们的演示可以在https://landiff.github.io/查看。||
|**2025-03-06**|[Learning Object Placement Programs for Indoor Scene Synthesis with Iterative Self Training](http://arxiv.org/abs/2503.04496)|null|数据驱动和自回归的室内场景合成系统通过依次建议和放置物体来自动生成室内场景。经验观察表明，当前系统倾向于产生不完整的下一个物体位置分布。我们引入了一个系统来解决这个问题。我们设计了一种指定功能约束的领域特定语言 (DSL)。我们语言中的程序将部分场景和要放置的物体作为输入。执行后，它们会预测可能的物体放置位置。我们设计了一个自动编写这些程序的生成模型。可用的 3D 场景数据集不包含用于训练的程序，因此我们在先前无监督程序归纳工作的基础上引入了一种新的程序自举算法。为了量化我们的经验观察，我们引入了一种新的评估程序，该程序可以捕捉系统对每个对象位置分布的建模程度。我们要求人工标注员标记一个对象在场景中可以放置的所有可能位置，并表明我们的系统生成的每个对象位置分布与人工标注员更加一致。我们的系统还可以生成与先前系统质量相当的室内场景，并且虽然先前系统在训练数据稀疏时性能会下降，但我们的系统不会出现同样程度的下降。||
|**2025-03-06**|[InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference](http://arxiv.org/abs/2503.04483)|null|从基因表达数据推断基因调控网络 (GRN) 对于理解生物过程至关重要。虽然据报道监督模型在此任务中实现了高性能，但它们依赖于昂贵的ground truth (GT) 标签，并且存在学习基因特异性偏差（例如GT相互作用的类别不平衡）而非真正调控机制的风险。为了解决这些问题，我们引入了 InfoSEM，这是一种无监督生成模型，它利用文本基因嵌入作为信息先验，在没有 GT 标签的情况下改进了 GRN 推断。InfoSEM 还可以在可用时将 GT 标签集成为额外的先验，从而避免偏差并进一步提高性能。此外，我们提出了一个具有生物学动机的基准测试框架，该框架可以更好地反映现实世界的应用，例如生物标志物发现，并揭示现有监督方法的学习偏差。InfoSEM 在四个数据集上使用文本嵌入先验比现有模型的性能提高了 38.5%，并在集成标记数据作为先验时进一步将性能提高了 11.1%。||
|**2025-03-06**|[Semantic Alignment of Unimodal Medical Text and Vision Representations](http://arxiv.org/abs/2503.04478)|null|通用人工智能模型，特别是那些为文本和视觉设计的模型，在各种深度学习任务中展现出令人印象深刻的通用性。然而，它们在医学影像等专业领域通常表现不佳，这些领域通常需要特定领域的解决方案或替代的知识迁移方法。最近的研究表明，通用模型在处理语义相关数据时可以展现出相似的潜在空间，尽管这种对齐并非自然发生。基于这一见解，已有研究表明，应用一个简单的变换（最多仿射变换），该变换是根据语义对应的样本子集（称为锚点）估计的，可以实现跨不同训练范式、架构和模态的模型拼接。在本文中，我们探讨了语义对齐（估计锚点之间的变换）如何将通用人工智能与专业的医学知识联系起来。我们使用多个公共胸部 X 光数据集，证明了跨模型架构的模型拼接允许通用模型在无需额外训练的情况下整合特定领域的知识，从而提高医学任务的性能。此外，我们还介绍了一种新的单模态视觉编码器的零样本分类方法，该方法利用了跨模态的语义对齐。我们的结果表明，我们的方法不仅优于通用的多模态模型，而且还接近完全训练的、特定于医学的多模态解决方案的性能水平。||
|**2025-03-06**|[Can Large Language Models Predict Antimicrobial Resistance Gene?](http://arxiv.org/abs/2503.04413)|null|本研究表明，与传统的基于Transformer编码器的模型相比，生成式大型语言模型可以更灵活地用于DNA序列分析和分类任务。尽管最近出现的基于编码器的模型（如DNABERT和Nucleotide Transformer）在DNA序列分类中表现出显著的性能，但基于Transformer解码器的生成式模型在该领域尚未得到广泛探索。本研究评估了生成式大型语言模型如何有效地处理带有各种标签的DNA序列，并分析了在提供额外文本信息时的性能变化。实验以抗菌素耐药基因为例，结果表明，生成式大型语言模型可以提供相当甚至可能更好的预测，在结合序列和文本信息时展现出灵活性和准确性。本研究所使用的代码和数据可在以下GitHub存储库中获取：https://github.com/biocomgit/llm4dna。||
|**2025-03-06**|[LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding](http://arxiv.org/abs/2503.04359)|null|当前先进的长上下文语言模型为现实世界的软件工程应用提供了巨大的潜力。然而，该关键领域的研究进展仍然受到一个基本限制的阻碍：缺乏用于长代码理解的严格评估框架。为了弥补这一障碍，我们提出了一个长代码理解基准LONGCODEU，从四个方面（8个任务）评估LCLM在实际应用中所需的长代码理解能力，包括代码单元感知、代码单元内理解、代码单元间关系理解和长代码文档理解。我们在LONGCODEU上评估了9个流行的LCLM（即6个通用模型和3个代码模型）。我们的实验结果揭示了当前LCLM在长代码理解能力方面的关键局限性。特别是，当长代码长度超过32K时，LCLM的性能急剧下降，远低于其声称的128K-1M上下文窗口。在这四个方面中，代码单元间关系理解对LCLM来说最具挑战性。我们的研究为优化LCLM和推动软件工程的进步提供了宝贵的见解。||
|**2025-03-04**|[ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models](http://arxiv.org/abs/2503.02883)|**[link](https://github.com/qinyu-allen-zhao/arinar)**|现有的自回归（AR）图像生成模型使用逐个标记的生成方案。也就是说，它们预测每个标记的概率分布并从该分布中采样下一个标记。主要的挑战是如何对高维标记的复杂分布进行建模。以前的方法要么过于简化而无法拟合分布，要么导致生成速度缓慢。我们没有拟合整个标记的分布，而是探索使用AR模型以逐个特征的方式生成每个标记，即将生成的特征作为输入并生成下一个特征。基于此，我们提出了ARINAR（AR-in-AR），一个双层AR模型。外层AR层将先前的标记作为输入，预测下一个标记的条件向量z。内层以z为条件，自回归地生成下一个标记的特征。这样，内层只需要对单个特征的分布进行建模，例如，使用简单的高斯混合模型。在ImageNet 256x256图像生成任务上，具有2.13亿参数的ARINAR-B实现了2.75的FID，这与最先进的MAR-B模型（FID=2.31）相当，同时比后者快五倍。||
|**2025-03-04**|[SeqFusion: Sequential Fusion of Pre-Trained Models for Zero-Shot Time-Series Forecasting](http://arxiv.org/abs/2503.02836)|**[link](https://github.com/Tingji2419/SeqFusion)**|与传统的需要大量任务内数据进行训练的时间序列预测方法不同，零样本预测可以在不使用额外训练数据的情况下，直接根据目标时间序列预测未来值。当前的零样本方法主要依赖于预训练的通用模型，其性能通常取决于预训练数据的多样性和相关性，这可能会引发隐私问题。在这项工作中，我们没有收集各种各样的预训练数据，而是提出了SeqFusion，这是一个新颖的框架，它可以顺序地收集和融合不同的预训练模型（PTM）以进行零样本预测。基于目标时间序列的特定时间特征，SeqFusion从一批预先收集的PTM中选择最合适的PTM，执行顺序预测，并在使用最少数据以保护隐私的同时融合所有预测。这些PTM中的每一个都专注于不同的时间模式和预测任务，允许SeqFusion通过测量目标时间序列与每个PTM在共享表示空间中的距离来进行选择。实验表明，与最先进的方法相比，SeqFusion在零样本预测中实现了具有竞争力的准确性。||
|**2025-03-04**|[A Multimodal Symphony: Integrating Taste and Sound through Generative AI](http://arxiv.org/abs/2503.02823)|null|近几十年来，神经科学和心理学研究已经追踪到味觉和听觉感知之间的直接关系。本文基于这项基础研究，探索了能够将味觉信息转换为音乐的多模态生成模型。我们简要回顾了该领域的最新进展，重点介绍了关键发现和方法。我们提出了一个实验，其中使用微调版本的音乐生成模型 (MusicGEN) 根据为每首乐曲提供的详细味道描述来生成音乐。结果令人鼓舞：根据参与者（n=111）的评估，与未经微调的模型相比，微调后的模型生成的音乐更一致地反映了输入的味道描述。这项研究代表着在理解和开发人工智能、声音和味觉之间具身交互方面迈出的重要一步，为生成式人工智能领域开辟了新的可能性。我们在 https://osf.io/xs5jy/ 发布了我们的数据集、代码和预训练模型。||
|**2025-03-04**|[Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts](http://arxiv.org/abs/2503.02819)|**[link](https://github.com/martaskrt/fkc-diffusion)**|虽然基于分数的生成模型是跨领域的首选模型，但用于以原则方式控制推理时行为的工具有限，例如用于组合多个预训练模型。现有的无分类器指导方法使用简单的启发式方法来混合条件和无条件分数，以近似地从条件分布中采样。然而，这种方法不能逼近中间分布，需要额外的“校正器”步骤。在这项工作中，我们提供了一种有效且有原则的方法，用于从一系列退火、几何平均或从预训练的基于分数的模型导出的乘积分布中采样。我们通过仔细考虑适当的偏微分方程 (PDE) 中的项，基于著名的 Feynman-Kac 公式，推导出了一种加权模拟方案，我们称之为 Feynman-Kac 校正器 (FKC)。为了模拟这些 PDE，我们提出了顺序蒙特卡罗 (SMC) 重采样算法，该算法利用推理时缩放来提高采样质量。我们通过提出通过推理时温度退火进行摊销采样、使用预训练模型改进多目标分子生成以及改进文本到图像生成的无分类器指导，经验证明了我们方法的实用性。我们的代码可在 https://github.com/martaskrt/fkc-diffusion 获取。||
|**2025-03-04**|[Generating Reliable Initial Velocity Models for Full-waveform Inversion with Well and Structural Constraints](http://arxiv.org/abs/2503.02815)|null|全波形反演(FWI)由于其高分辨率优势在速度建模中起着重要作用。然而，其高度非线性的特点导致了许多局部极小值，即周波跳跃问题。因此，有效地解决周波跳跃问题对于FWI的成功至关重要。测井数据包含丰富的地下介质参数信息，为速度建模提供了固有的优势。传统的测井数据插值方法建立速度模型的精度有限，对复杂地质结构的适应性较差。本研究引入了一种基于生成扩散模型(GDM)的测井插值算法来生成FWI的初始模型，以解决周波跳跃问题。现有的基于卷积神经网络(CNN)的方法在处理复杂特征分布方面存在困难，并且缺乏有效的不确定性量化，限制了其输出的可靠性。所提出的基于GDM的方法通过提供地质一致性测井插值并结合不确定性评估来克服这些挑战。数值实验表明，该方法可以生成准确可靠的初始模型，从而提高FWI性能并缓解周波跳跃问题。||
|**2025-03-04**|[Zero-Shot Complex Question-Answering on Long Scientific Documents](http://arxiv.org/abs/2503.02695)|**[link](https://github.com/wendywangwwt/zero-shot-complex-question-answering-on-long-scientific-documents)**|随着基于Transformer的语言模型的快速发展，短文档和简单问题的阅读理解任务已基本得到解决。长文档，特别是包含人类发现和发展的密集知识的科学文献，仍然相对未被探索。这些文档通常带有一组复杂且更现实的问题，增加了其复杂性。我们提出了一个零样本流水线框架，使社会科学研究人员能够对全文研究论文执行复杂但具有预定问题格式的问答任务，而无需机器学习专业知识。我们的方法集成了预训练的语言模型来处理具有挑战性的场景，包括多片段提取、多跳推理和长答案生成。在MLPsych（一个带有注释复杂问题的社会心理学论文的新数据集）上进行评估，我们证明了我们的框架通过提取和生成模型的组合实现了强大的性能。这项工作推进了社会科学的文档理解能力，同时为研究人员提供了实用工具。||
|**2025-03-04**|[Generative Modeling of Microweather Wind Velocities for Urban Air Mobility](http://arxiv.org/abs/2503.02690)|**[link](https://github.com/nasa/wind-generative-modeling)**|受安全、可靠且耐候的城市空中交通 (UAM) 解决方案的推动，这项工作提出了一种生成建模方法来表征微尺度天气风速。微尺度天气，或高度局部区域的天气状况，在城市环境中尤其复杂，这是由于风流的混沌和湍流性质。此外，评估局部风场的传统方法通常不是 UAM 应用的可行解决方案：1) 依赖于运行空域中永久风廓线系统的现场测量是不切实际的；2) 以足够高的分辨率模拟流体动力学的基于物理的模型在计算上是难以处理的；3) 很大程度上是确定性的数据驱动建模方法忽略了决定 UAM 可靠性的湍流中固有的可变性。因此，需要提高预测能力，以帮助减轻微尺度天气风对更小、更轻的 UAM 飞机造成的独特运行安全风险。这项工作旨在以一种计算高效、捕捉随机变化且仅需要临时而非永久性现场测量活动的方式对微尺度天气风速进行建模。受条件生成式人工智能（如文本到图像生成）最新突破的启发，所提出的方法使用生成模型（去噪扩散概率模型、流匹配和高斯混合模型）学习区域天气预报和测量的局部风速之间的概率宏观到微观天气映射。使用一个数据集实现了简单的概念验证，该数据集包含来自声波探测和测距 (SoDAR) 风廓线仪的局部（微观）测量值以及来自附近气象站的同期（宏观）预测数据。||
|**2025-03-04**|[YARE-GAN: Yet Another Resting State EEG-GAN](http://arxiv.org/abs/2503.02636)|**[link](https://github.com/Yeganehfrh/EEGModalNet)**|生成对抗网络 (GANs) 在合成逼真的神经数据方面已展现出潜力，但其在静息态脑电图 (EEG) 中进行无监督表征学习的潜力仍未得到充分探索。在本研究中，我们实现了带有梯度惩罚的 Wasserstein GAN (WGAN-GP) 来生成多通道静息态脑电图数据，并通过视觉和基于特征的评估来评估合成信号的质量。我们的结果表明，该模型有效地捕获了真实脑电图数据的统计和频谱特征，尽管在复制额叶区域的高频振荡方面仍然存在挑战。此外，我们证明了判别器学习到的表征可以微调用于年龄组分类，实现了优于随机标签基线的样本外准确率。这些发现表明，生成模型不仅可以作为脑电图数据生成器，还可以作为无监督特征提取器，减少了手动特征工程的需求。这项研究突出了基于 GAN 的无监督学习在脑电图分析中的潜力，为神经科学中更高效的深度学习应用提供了途径。||
|**2025-03-04**|[StageDesigner: Artistic Stage Generation for Scenography via Theater Scripts](http://arxiv.org/abs/2503.02595)|null|在这项工作中，我们介绍了StageDesigner，这是第一个使用大型语言模型结合布局控制的扩散模型来生成艺术舞台场景的综合框架。鉴于舞台布景的专业要求，StageDesigner模拟经验丰富的艺术家的工作流程来生成沉浸式3D舞台场景。具体来说，我们的方法分为三个主要模块：脚本分析，从输入脚本中提取主题和空间线索；前景生成，构建和排列必要的3D对象；以及背景生成，生成与叙事氛围相协调的背景，并通过管理前景和背景元素之间的遮挡来保持空间一致性。此外，我们引入了StagePro-V1数据集，这是一个包含276个独特舞台场景的专用数据集，涵盖不同的历史风格，并带有脚本、图像和详细的3D布局注释，专门为此任务定制。最后，使用标准和新提出的指标进行的评估以及广泛的用户研究证明了StageDesigner的有效性。项目地址：https://deadsmither5.github.io/2025/01/03/StageDesigner/||
|**2025-03-04**|[TS-CGNet: Temporal-Spatial Fusion Meets Centerline-Guided Diffusion for BEV Mapping](http://arxiv.org/abs/2503.02578)|**[link](https://github.com/krabs-h/ts-cgnet)**|鸟瞰图（BEV）感知技术对于自动驾驶至关重要，因为它生成用于环境感知、导航和决策的自顶向下二维地图。然而，目前大多数专注于视觉地图生成的BEV地图生成研究缺乏深度感知推理能力。它们在管理遮挡和处理复杂环境方面的效率有限，在恶劣天气条件或低光场景下的感知性能显着下降。因此，本文提出了TS-CGNet，它利用时空融合与中心线引导扩散。这个基于先验知识的视觉框架旨在集成到任何现有网络中以构建BEV地图。具体来说，该框架被解耦为三个部分：局部建图系统涉及使用纯视觉信息生成语义地图；时空对齐模块（TSAM）通过应用变换矩阵将历史信息集成到地图生成中；中心线引导扩散模型（CGDM）是基于扩散模型的预测模块。CGDM通过空间注意力机制结合中心线信息来增强语义分割重建。我们使用我们的方法在公共nuScenes数据集和各种损坏下的鲁棒性基准上构建了BEV语义分割图。我们的方法在60x30米、120x60米和240x60米的感知范围内，在BEV高清地图绘制任务中分别提高了1.90%、1.73%和2.87%。TS-CGNet在100x100米感知范围内的BEV语义地图绘制任务中实现了1.92%的提升。此外，在240x60米的感知范围内，TS-CGNet在不同天气条件和传感器干扰下的检测精度平均提高了2.92%。源代码将在https://github.com/krabs-H/TS-CGNet公开发布。||
|**2025-02-28**|[How far can we go with ImageNet for Text-to-Image generation?](http://arxiv.org/abs/2502.21318)|null|近来的文本到图像 (T2I) 生成模型通过在数十亿规模的数据集上进行训练取得了显著成果，遵循“越大越好”的范式，优先考虑数据数量而非质量。我们挑战这一既定范式，证明对小型、精心策划的数据集进行策略性数据增强可以达到或超过在海量网络抓取数据集合上训练的模型的性能。仅使用经过精心设计的文本和图像增强改进的 ImageNet，我们在 GenEval 上取得了比 SD-XL 高 2 分的总体得分，在 DPGBench 上取得了高 5 分的成绩，而使用的参数仅为其 1/10，训练图像仅为其 1/1000。我们的结果表明，策略性数据增强而非海量数据集可以为 T2I 生成提供更可持续的发展道路。||
|**2025-02-28**|[Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos](http://arxiv.org/abs/2502.21314)|null|随着扩散模型的出现，文本到视频生成取得了显著进展，但现有方法仍受数据集质量和计算资源的限制。为了解决这些限制，本文提出了一种改进数据管理和模型设计两方面的综合方法。我们引入了 CFC-VIDS-1M，这是一个通过系统性的由粗到精的管理流程构建的高质量视频数据集。该流程首先评估视频在多个维度上的质量，然后利用视觉语言模型进行细粒度处理，以增强文本-视频对齐和语义丰富性。基于管理后的数据集对视觉质量和时间一致性的强调，我们开发了 RACCOON，一个具有解耦时空注意力机制的基于Transformer的架构。该模型通过一个渐进的四阶段策略进行训练，旨在有效处理视频生成的复杂性。大量实验表明，我们高质量数据管理和高效训练策略的集成方法能够生成视觉上吸引人且时间上一致的视频，同时保持计算效率。我们将发布我们的数据集、代码和模型。||
|**2025-02-28**|[Does Generation Require Memorization? Creative Diffusion Models using Ambient Diffusion](http://arxiv.org/abs/2502.21278)|null|有强有力的经验证据表明，最先进的扩散模型范式会导致模型记忆训练集，尤其当训练集较小时。先前用于缓解记忆问题的方法通常会导致图像质量下降。是否有可能获得强大且具有创造力的生成模型，即同时实现高生成质量和低记忆性的模型？尽管目前的结果不容乐观，但我们在推动保真度和记忆性之间的权衡方面取得了重大进展。我们首先提供了理论证据，证明扩散模型中的记忆性仅对于低噪声尺度下的去噪问题是必要的（通常用于生成高频细节）。利用这一理论见解，我们提出了一种简单且有原则的方法，使用大噪声尺度下的噪声数据来训练扩散模型。我们表明，对于文本条件和无条件模型以及各种数据可用性设置，我们的方法都能在不降低图像质量的情况下显著减少记忆性。||
|**2025-02-28**|[Dynamic Markov Blanket Detection for Macroscopic Physics Discovery](http://arxiv.org/abs/2502.21217)|**[link](https://github.com/bayesianempirimancer/pyDMBD)**|自由能原理（FEP），以及相关的马尔可夫毯和本体势等概念，最近被提出作为一种广义建模方法的核心组成部分，该方法能够从数学上描述随机动力系统中存在的任意对象；也就是说，这是一个关于“所有”“事物”的数学理论。在这里，我们利用FEP开发了一种数学物理方法来识别对象、对象类型以及控制其行为的宏观、对象类型特定的规则。我们采用生成式建模方法，并使用变分贝叶斯期望最大化算法开发了一种动态马尔可夫毯检测算法，该算法能够在对微观动力学进行部分观察的情况下识别和分类宏观对象。这种无监督算法使用贝叶斯注意力机制，根据可观察微观元素在给定系统中的当前角色，将其明确标记为给定宏观对象的内部元素或边界元素；并且它识别控制对象与其环境相互作用的宏观物理定律。由于这些标签是动态的或随时间演变的，因此该算法能够识别穿过固定介质或与其环境交换物质的复杂对象。这种方法直接导致了一类灵活的结构化无监督算法，可以将复杂的多粒子或多组分系统合理地划分为相互作用的宏观子系统的集合，即“对象”或“事物”。我们推导出了几种此类宏观物理发现算法的示例，并通过简单的数值实验展示了其实用性，在这些实验中，该算法正确标记了牛顿摆、燃烧的引线、洛伦兹吸引子和模拟细胞的组成部分。||
|**2025-02-28**|[SYN-LUNGS: Towards Simulating Lung Nodules with Anatomy-Informed Digital Twins for AI Training](http://arxiv.org/abs/2502.21187)|null|用于肺癌筛查的AI模型受到数据稀缺的限制，这影响了其泛化能力和临床适用性。生成模型解决了这个问题，但又受到训练数据可变性的约束。我们引入了SYN-LUNGS，这是一个用于生成带有详细注释的高质量3D CT图像的框架。SYN-LUNGS集成了XCAT3体模用于生成数字孪生体，X-Lesions用于模拟结节（大小、位置和外观各不相同），以及DukeSim用于模拟不同供应商和参数变化的CT图像形成。该数据集包括来自1044次模拟CT扫描的3072张结节图像，包含512个病灶和174个数字孪生体。使用临床+模拟数据训练的模型优于仅使用临床数据训练的模型，在检测方面实现了10%的改进，在分割和分类方面实现了2-9%的改进，并增强了合成效果。通过结合解剖学信息的模拟，SYN-LUNGS为AI模型开发提供了一种可扩展的方法，尤其是在罕见疾病的表示和提高模型可靠性方面。||
|**2025-02-28**|[A Review on Generative AI For Text-To-Image and Image-To-Image Generation and Implications To Scientific Images](http://arxiv.org/abs/2502.21151)|null|这篇综述调查了生成式人工智能领域中文本到图像和图像到图像生成的最新技术。我们对三种主要架构进行了比较分析：变分自编码器、生成对抗网络和扩散模型。对于每一种架构，我们都阐明了其核心概念、架构创新以及实际优势和局限性，特别是对于科学图像理解方面的应用。最后，我们讨论了这个快速发展领域中的关键开放挑战和未来潜在的研究方向。||
|**2025-02-28**|[Rare event modeling with self-regularized normalizing flows: what can we learn from a single failure?](http://arxiv.org/abs/2502.21110)|null|随着自动驾驶系统在交通和机器人等领域的部署日益增多，安全关键型故障也相应增加。由于数据的相对缺乏，这些故障难以建模和调试：与正常运行的数万个示例相比，我们可能只有几秒钟的故障前数据。这种稀缺性使得训练罕见故障事件的生成模型变得具有挑战性，因为现有方法要么存在对有限故障数据集中噪声的过拟合风险，要么由于先验过于强而欠拟合。我们用CalNF（校准归一化流）来应对这一挑战，这是一个用于从有限数据进行后验学习的自正则化框架。CalNF在数据有限的故障建模和逆问题上实现了最先进的性能，并实现了对2022年西南航空公司调度危机根源的首例案例研究。||
|**2025-02-28**|[Spatial Reasoning with Denoising Models](http://arxiv.org/abs/2502.21075)|null|我们引入了空间推理模型 (SRM)，这是一个通过去噪生成模型对连续变量集合进行推理的框架。SRM根据对观测变量的观察，推断出一组未观测变量的连续表示。当前空间域上的生成模型，例如扩散和流匹配模型，在复杂分布的情况下经常会崩溃到幻觉。为了衡量这一点，我们引入了一组基准测试任务，用于测试生成模型中复杂推理的质量，并可以量化幻觉。SRM 框架允许报告关于生成中序列化重要性、相关顺序以及训练期间采样策略的关键发现。它首次证明了去噪网络本身可以成功预测生成顺序。利用这些发现，我们可以将特定推理任务的准确率从<1% 提高到>50%。||
|**2025-02-28**|[Synthesizing Individualized Aging Brains in Health and Disease with Generative Models and Parallel Transport](http://arxiv.org/abs/2502.21049)|**[link](https://github.com/fjr9516/inbrainsyn)**|从给定的个体脑图像模拟预期磁共振成像 (MRI) 扫描具有挑战性，因为它需要考虑衰老和/或疾病进展的典型变化，同时还要考虑个体大脑的当前状态和独特特征。虽然目前的深度生成模型可以为人群范围的研究生成高分辨率的解剖学准确模板，但它们预测个体未来衰老轨迹的能力仍然有限，尤其是在捕捉受试者特定神经解剖学随时间变化方面。在本研究中，我们介绍了个体化大脑合成 (InBrainSyn)，这是一个用于合成高分辨率的特定受试者纵向 MRI 扫描的框架，该框架模拟阿尔茨海默病 (AD) 和正常衰老中的神经退行性变。InBrainSyn 使用平行传输算法来调整由生成式深度模板网络学习的人口水平衰老轨迹，从而实现个体化衰老合成。由于 InBrainSyn 使用微分同胚变换来模拟衰老，因此合成的图像在拓扑上与原始解剖结构一致。我们在来自公开获取成像研究系列 - 版本 3 数据集的 AD 和健康对照队列上对 InBrainSyn 进行了定量和定性评估。实验表明，InBrainSyn 还可以模拟正常衰老和 AD 之间的神经解剖学转变。对外部数据集的评估支持其泛化性。总体而言，只需一次基线扫描，InBrainSyn 即可合成逼真的 3D 时空 T1w MRI 扫描，生成个性化的纵向衰老轨迹。InBrainSyn 的代码可在以下网址获取：https://github.com/Fjr9516/InBrainSyn。||
|**2025-02-28**|[Generative Uncertainty in Diffusion Models](http://arxiv.org/abs/2502.20946)|null|扩散模型最近在生成建模方面取得了重大突破。虽然最先进的模型平均能产生高质量的样本，但个别样本的质量仍然可能较低。在没有人工检查的情况下检测此类样本仍然是一项具有挑战性的任务。为了解决这个问题，我们提出了一个贝叶斯框架来估计合成样本的生成不确定性。我们概述了如何使贝叶斯推理在大型现代生成模型中变得实用，并引入了一种新的语义似然（在特征提取器的潜在空间中评估）来解决高维样本空间带来的挑战。通过实验，我们证明了所提出的生成不确定性可以有效地识别低质量样本，并且显著优于现有的基于不确定性的方法。值得注意的是，我们的贝叶斯框架可以通过拉普拉斯近似对任何预训练的扩散或流匹配模型进行后验应用，并且我们提出了简单而有效的技术来最小化其在采样过程中的计算开销。||
|**2025-02-27**|[InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions](http://arxiv.org/abs/2502.20390)|null|实现逼真的人类与各种物体交互的模拟一直是一个基本目标。将基于物理的运动模仿扩展到复杂的人-物交互 (HOI) 具有挑战性，这是由于人与物体之间复杂的耦合、物体几何形状的可变性以及运动捕捉数据中的伪影（例如不精确的接触和有限的手部细节）。我们引入了 InterMimic，这是一个框架，它使单个策略能够从数小时的不完美 MoCap 数据中稳健地学习，这些数据涵盖了与动态和多样化物体的各种全身交互。我们的关键见解是采用课程策略——先完美，然后扩展。我们首先训练特定对象的教师策略来模仿、重定向和细化运动捕捉数据。接下来，我们将这些教师提炼成学生策略，教师充当在线专家，提供直接监督和高质量参考。值得注意的是，我们在学生策略中加入了强化学习微调，以超越单纯的演示复制并获得更高质量的解决方案。我们的实验表明，InterMimic 可以在多个 HOI 数据集中产生逼真且多样化的交互。学习到的策略以零样本方式泛化，并与运动学生成器无缝集成，将框架从单纯的模仿提升到复杂人-物交互的生成建模。||
|**2025-02-27**|[Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation](http://arxiv.org/abs/2502.20388)|**[link](https://github.com/OliverRensu/xAR)**|自回归 (AR) 建模以其下一个标记预测范式而闻名，是目前最先进的语言和视觉生成模型的基础。传统上，“标记”被视为最小的预测单元，通常是语言中的离散符号或视觉中的量化块。然而，二维图像结构的最佳标记定义仍然是一个悬而未决的问题。此外，AR 模型存在曝光偏差，即训练期间的教师强制会导致推理时错误累积。在本文中，我们提出了 xAR，一个广义的 AR 框架，它将标记的概念扩展到实体 X，它可以表示单个补丁标记、单元格（ $k\times k$ 个相邻补丁的分组）、子样本（远处补丁的非局部分组）、尺度（从粗到精的分辨率），甚至整个图像。此外，我们将离散标记分类重新表述为连续实体回归，在每个 AR 步骤中利用流匹配方法。这种方法在训练时以噪声实体而非真实标记为条件，从而实现噪声上下文学习，有效地减轻了曝光偏差。因此，xAR 提供了两个关键优势：(1) 它支持灵活的预测单元，可以捕获不同的上下文粒度和空间结构；(2) 它通过避免依赖教师强制来减轻曝光偏差。在 ImageNet-256 生成基准测试中，我们的基础模型 xAR-B (172M) 的性能优于 DiT-XL/SiT-XL (675M)，同时推理速度提高了 20 倍。同时，xAR-H 创下了 FID 为 1.24 的最新记录，运行速度比之前性能最佳的模型快 2.2 倍，而且不依赖于视觉基础模块（例如 DINOv2）或高级引导间隔采样。||
|**2025-02-27**|[Tight Inversion: Image-Conditioned Inversion for Real Image Editing](http://arxiv.org/abs/2502.20376)|null|文转图扩散模型提供了强大的图像编辑能力。为了编辑真实图像，许多方法依赖于将图像反演为高斯噪声。一种常见的图像反演方法是逐步向图像添加噪声，其中噪声由反向采样方程确定。这个过程在重建和可编辑性之间存在固有的权衡，限制了对具有挑战性图像（例如高度细节图像）的编辑。认识到文转图模型反演对文本条件的依赖性，这项工作探索了条件选择的重要性。我们发现与输入图像精确匹配的条件可以显著提高反演质量。基于我们的发现，我们引入了紧密反演（Tight Inversion），这是一种利用最精确条件（即输入图像本身）的反演方法。这种紧密条件缩小了模型输出的分布，并增强了重建和可编辑性。我们通过大量实验，评估了重建精度以及与各种编辑方法的集成，证明了我们的方法与现有反演方法结合使用的有效性。||
|**2025-02-27**|[Constrained Generative Modeling with Manually Bridged Diffusion Models](http://arxiv.org/abs/2502.20371)|null|本文描述了一种基于扩散的约束空间生成建模新框架。具体来说，我们引入了人工桥（manual bridges），这是一个扩展了可用于形成所谓的扩散桥的约束类型的框架。我们开发了一种组合多个此类约束的机制，以便最终的多重约束模型仍然是遵守所有约束的人工桥。我们还开发了一种训练扩散模型的机制，该模型既尊重这些多重约束，又使其适应数据分布。我们开发并扩展了证明我们机制数学有效性的理论。此外，我们还在约束生成建模任务中演示了我们的机制，重点介绍了在自动驾驶汽车路径规划和控制中建模轨迹初始化的高价值应用。||
|**2025-02-27**|[FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction](http://arxiv.org/abs/2502.20313)|**[link](https://github.com/jiaosiyu1999/FlexVAR)**|这项工作挑战了视觉自回归建模中的残差预测范式，并提出了FlexVAR，一种新的灵活视觉自回归图像生成范式。FlexVAR通过真值预测促进了自回归学习，使每一步都能独立生成逼真的图像。这种简单直观的方法可以快速学习视觉分布，并使生成过程更加灵活和适应性强。FlexVAR仅在低分辨率图像（≤ 256像素）上进行训练，即可：(1) 生成各种分辨率和纵横比的图像，甚至超过训练图像的分辨率。(2) 支持各种图像到图像的任务，包括图像细化、图像修复/扩展和图像放大。(3) 适应各种自回归步骤，允许使用更少的步骤进行更快的推理，或使用更多步骤提高图像质量。我们的1.0B模型在ImageNet 256×256基准测试中优于其VAR对应模型。此外，当使用13个步骤进行零样本迁移图像生成过程时，性能进一步提高到2.08 FID，分别优于最先进的自回归模型AiM/VAR 0.25/0.28 FID和流行的扩散模型LDM/DiT 1.52/0.19 FID。当以零样本方式将我们的1.0B模型迁移到ImageNet 512×512基准测试时，FlexVAR取得了与VAR 2.3B模型相当的结果，后者是一个以512×512分辨率进行完全监督训练的模型。||
|**2025-02-27**|[Mobius: Text to Seamless Looping Video Generation via Latent Shift](http://arxiv.org/abs/2502.20307)|**[link](https://github.com/yisuitt/mobius)**|我们提出了Mobius，一种直接从文本描述生成无缝循环视频的新方法，无需任何用户注释，从而为多媒体演示创建新的视觉素材。我们的方法重新利用预训练的视频潜扩散模型，无需任何训练即可根据文本提示生成循环视频。在推理过程中，我们首先通过连接视频的起始和结束噪声来构建潜在循环。鉴于视频扩散模型的上下文可以保持时间一致性，我们通过在每一步中逐渐将第一帧潜变量移到末尾来执行多帧潜变量去噪。因此，去噪上下文在每一步中都会发生变化，同时在整个推理过程中保持一致性。此外，我们方法中的潜在循环可以是任意长度的。这扩展了我们的潜变量偏移方法，以生成超出视频扩散模型上下文范围的无缝循环视频。与之前的电影图不同，所提出的方法不需要图像作为外观，这将限制生成结果的运动。相反，我们的方法可以产生更动态的运动和更好的视觉质量。我们进行了多项实验和比较，以验证所提出方法的有效性，证明其在不同场景下的有效性。所有代码都将公开。||
|**2025-02-27**|[Explainable, Multi-modal Wound Infection Classification from Images Augmented with Generated Captions](http://arxiv.org/abs/2502.20277)|null|糖尿病足溃疡 (DFUs) 的感染会导致严重的并发症，包括组织坏死和截肢，这凸显了准确、及时诊断的必要性。以前的机器学习方法侧重于通过单独分析伤口图像来识别感染，而没有利用额外的元数据，例如医疗记录。在本研究中，我们旨在通过引入用于伤口感染检测的合成字幕增强检索 (SCARWID) 来改进感染检测，这是一个利用合成文本描述来增强 DFU 图像的新型深度学习框架。SCARWID 由两个组件组成：(1) Wound-BLIP，一个在 GPT-4o 生成的描述上微调的视觉语言模型 (VLM)，用于从图像合成一致的字幕；(2) 一个图像-文本融合模块，它使用交叉注意力从图像及其相应的 Wound-BLIP 字幕中提取跨模态嵌入。通过从标记的支持集中检索前 k 个相似项来确定感染状态。为了增强训练数据的多样性，我们利用潜在扩散模型生成了额外的伤口图像。结果，SCARWID 的性能优于最先进的模型，在伤口感染分类中分别实现了 0.85、0.78 和 0.81 的平均灵敏度、特异性和准确度。在伤口图像和感染检测结果旁边显示生成的字幕增强了可解释性和信任度，使护士能够将 SCARWID 输出与其医学知识相结合。当伤口记录不可用或协助难以识别伤口感染视觉属性的新手护士时，这尤其有价值。||
|**2025-02-27**|[Do computer vision foundation models learn the low-level characteristics of the human visual system?](http://arxiv.org/abs/2502.20256)|null|计算机视觉基础模型，例如DINO或OpenCLIP，是在大型图像数据集上以自监督方式训练的。类似地，大量证据表明，人类视觉系统（HVS）受到自然界中颜色和图案统计分布的影响，这些特征也存在于基础模型的训练数据中。本文探讨的问题是，在自然图像上训练的基础模型是否模仿了人类视觉系统的一些低级特征，例如对比度检测、对比度掩蔽和对比度恒常性。具体来说，我们设计了一个包含九种测试类型的协议来评估45个基础模型和生成模型的图像编码器。我们的结果表明，一些基础模型（例如DINO、DINOv2和OpenCLIP）具有人类视觉的一些特征，但其他模型则几乎没有相似之处。基础模型往往对低对比度不太敏感，并且对不同频率的对比度的反应相当不规则。在对比度掩蔽方面，基础模型与人类数据的吻合度最高。我们的研究结果表明，人类视觉和计算机视觉在学习解释真实世界的图像时，可能采取相似也可能采取不同的路径。总体而言，虽然仍然存在差异，但在视觉任务上训练的基础模型开始与低级人类视觉趋于一致，其中DINOv2表现出最接近的相似性。||
|**2025-02-27**|[Beyond Natural Language Perplexity: Detecting Dead Code Poisoning in Code Generation Datasets](http://arxiv.org/abs/2502.20246)|null|大型语言模型 (LLM) 在代码相关任务中的日益普及引发了对其训练数据集安全性的担忧。其中一个关键威胁是死代码投毒，即在训练数据中注入语法有效但功能冗余的代码来操纵模型行为。这种攻击会降低神经代码搜索系统的性能，导致出现有偏差或不安全的代码建议。现有的检测方法，例如词例级困惑度分析，由于编程语言的结构和上下文特征而无法有效识别死代码。在本文中，我们提出了 DePA（死代码困惑度分析），这是一种针对代码结构特性量身定制的新型行级检测和清理方法。DePA 通过利用代码行之间的上下文关系来计算行级困惑度，并通过比较其困惑度与文件中整体分布来识别异常行。我们在基准数据集上的实验表明，DePA 的性能明显优于现有方法，检测 F1 分数提高了 0.14-0.19，投毒片段定位精度提高了 44-65%。此外，DePA 将检测速度提高了 0.62-23 倍，使其可用于大规模数据集清理。总而言之，通过解决死代码投毒带来的独特挑战，DePA 为保障代码生成模型训练数据集的完整性提供了一种强大而高效的解决方案。||
|**2025-02-27**|[From Retrieval to Generation: Comparing Different Approaches](http://arxiv.org/abs/2502.20245)|null|知识密集型任务，尤其是开放域问答 (ODQA)、文档重排序和检索增强语言建模，需要在检索准确性和生成灵活性之间取得平衡。传统的检索模型，如 BM25 和密集段落检索 (DPR)，可以高效地从大型语料库中检索信息，但通常缺乏语义深度。像 GPT-4-o 这样的生成模型提供了更丰富的上下文理解，但在保持事实一致性方面面临挑战。在这项工作中，我们对基于检索、基于生成和混合模型进行了系统评估，主要关注它们在 ODQA 和相关检索增强任务中的性能。我们的结果表明，密集检索器，特别是 DPR，在 ODQA 中实现了强大的性能，在 NQ 上的 top-1 准确率为 50.17%，而混合模型将 BEIR 上的 nDCG@10 分数从 43.42 (BM25) 提高到 52.59，证明了它们在文档重排序方面的优势。此外，我们使用 WikiText-103 分析了语言建模任务，结果表明，像 BM25 这样的基于检索的方法比生成式和混合方法实现了更低的困惑度，突出了它们在检索增强生成中的效用。通过提供详细的比较和对每种方法的优势条件的实践见解，我们旨在促进未来对 ODQA 和相关知识密集型应用的检索、重排序和生成模型的优化。||
|**2025-02-25**|[K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs](http://arxiv.org/abs/2502.18461)|null|最近的研究探索了组合不同的LoRA以共同生成学习到的风格和内容。然而，现有方法要么无法有效地同时保留原始主题和风格，要么需要额外的训练。在本文中，我们认为LoRA的固有属性可以有效地引导扩散模型融合学习到的主题和风格。基于这一见解，我们提出了K-LoRA，一种简单而有效的免训练LoRA融合方法。在每个注意力层中，K-LoRA比较待融合的每个LoRA中的Top-K元素，确定选择哪个LoRA以实现最佳融合。这种选择机制确保在融合过程中保留主题和风格最具代表性的特征，从而有效地平衡它们的贡献。实验结果表明，所提出的方法有效地整合了原始LoRA学习到的主题和风格信息，在定性和定量结果方面均优于最先进的基于训练的方法。||
|**2025-02-25**|[ToMCAT: Theory-of-Mind for Cooperative Agents in Teams via Multiagent Diffusion Policies](http://arxiv.org/abs/2502.18438)|null|本文提出了ToMCAT（团队合作智能体心智理论），这是一个生成基于心智理论的轨迹的新框架。它结合了元学习机制和多智能体去噪扩散模型。元学习机制对队友的潜在目标和未来行为进行心智理论推理；多智能体去噪扩散模型根据智能体的目标及其通过心智理论计算的队友特征，为智能体及其队友生成计划。我们实现了一个在线规划系统，只要检测到先前生成的计划与当前世界状态存在差异，就会动态地从扩散模型中采样新的轨迹（重新规划）。我们在模拟烹饪领域使用ToMCAT进行了几项实验。我们的结果强调了动态重新规划机制在不牺牲团队绩效的情况下减少资源使用的重要性。我们还表明，智能体在一个事件过程中收集的关于世界和队友行为的最新观察结果与心智理论推论相结合，对于生成团队感知计划以动态适应队友至关重要，尤其是在没有提供关于队友的先验信息的情况下。||
|**2025-02-25**|[Sparse Bayesian Generative Modeling for Joint Parameter and Channel Estimation](http://arxiv.org/abs/2502.18369)|null|利用传感系统和无线通信之间固有的联系可以提高它们的整体性能，这也是联合通信和感知的核心目标。为了实现有效的通信，必须频繁地估计信道。另一方面，感知主要基于估计的物理信道参数（例如到达方向或延迟）来推断环境的属性。这项工作提出了一种低复杂度的生成建模方法，该方法可以同时估计无线信道及其物理参数，而无需额外的计算开销。为此，我们利用最近提出的基于稀疏贝叶斯生成建模的无线信道物理信息生成模型，并利用条件高斯生成模型的特性来逼近条件均值估计器。||
|**2025-02-25**|[ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation](http://arxiv.org/abs/2502.18364)|null|多层图像生成是一项基础任务，它使用户能够隔离、选择和编辑特定的图像层，从而彻底改变与生成模型的交互方式。本文介绍了匿名区域Transformer (ART)，它能够根据全局文本提示和匿名区域布局直接生成可变的多层透明图像。受图式理论（Schema theory）的启发，该理论认为知识是以框架（图式）的形式组织的，使人们能够通过将新信息与先前知识联系起来进行解释和学习，这种匿名区域布局允许生成模型自主决定哪些视觉标记应该与哪些文本标记对齐，这与之前在图像生成任务中占主导地位的语义布局形成对比。此外，分层区域裁剪机制只选择属于每个匿名区域的视觉标记，显著降低了注意力计算成本，并支持高效生成具有大量不同层（例如，50+）的图像。与全注意力方法相比，我们的方法速度提高了12倍以上，且层冲突更少。此外，我们还提出了一种高质量的多层透明图像自动编码器，它支持以联合方式直接对可变多层图像的透明度进行编码和解码。通过实现精确控制和可扩展的图层生成，ART 为交互式内容创建建立了新的范例。||
|**2025-02-25**|[LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation](http://arxiv.org/abs/2502.18302)|null|本文介绍了一种名为LDGen的新方法，用于将大型语言模型（LLM）集成到现有的文本到图像扩散模型中，同时最大限度地减少计算需求。传统的文本编码器，例如CLIP和T5，在多语言处理方面存在局限性，阻碍了跨多种语言的图像生成。我们利用LLM的先进功能来应对这些挑战。我们的方法采用了一种语言表示策略，应用分层标题优化和人工指导技术来获取精确的语义信息。随后，我们加入一个轻量级适配器和一个跨模态优化器，以促进LLM和图像特征之间的高效特征对齐和交互。LDGen减少了训练时间并支持零样本多语言图像生成。实验结果表明，我们的方法在提示依从性和图像美学质量方面均优于基线模型，同时无缝支持多种语言。项目页面：https://zrealli.github.io/LDGen。||
|**2025-02-25**|[Bayesian Computation in Deep Learning](http://arxiv.org/abs/2502.18300)|null|这篇综述论文旨在为《马尔可夫链蒙特卡洛手册》第二版提供内容。我们介绍了近似推理技术，作为应用于深度学习模型的贝叶斯计算方法。本章节通过介绍用于 (1) 贝叶斯神经网络和 (2) 深度生成模型的流行计算方法来组织，解释了它们在后验推理中的独特挑战以及相应的解决方案。||
|**2025-02-25**|[Imperfect Knowledge Management (IKM) in GEFRED (GENeralized model for Fuzzy RElational Databases)](http://arxiv.org/abs/2502.18255)|null|不完善知识管理 (IKM) 帮助管理意义中不精确、不确定或不完整的方面。IKM 承认企业的知识通常是不完善的，其特点是不同程度的不精确性、不确定性或不完整性。在这种情况下，知识被视为由属性和值描述的对象。我们关注的是涂布纸板生产中的能力（技能）领域，特别是将制成品从制造的纸板转换的过程。此过程涉及用于评估纸板质量的经典属性和模糊属性。本文介绍了一组协议，旨在使用 GEFRED（模糊关系数据库通用模型）在 Oracle 8i 关系数据库系统中对模糊元知识库进行建模。||
|**2025-02-25**|[Beyond the convexity assumption: Realistic tabular data generation under quantifier-free real linear constraints](http://arxiv.org/abs/2502.18237)|**[link](https://github.com/mihaela-stoian/drl_dgm)**|合成表格数据的生成一直以来都是一个具有挑战性的问题，这是由于此类数据潜在分布的高度复杂性。尽管深度生成模型（DGM）取得了最新进展，但现有方法通常无法生成与可用背景知识良好对齐的逼真数据点。在本文中，我们通过引入析取细化层（DRL）来解决这一限制，DRL 是一种旨在强制生成的数据与用户定义约束中指定的背景知识对齐的新型层。DRL 是第一个能够自动使深度学习模型完全符合约束条件的方法，其表达能力与无量词线性公式一样强大，可以定义非凸甚至不连通的空间。我们的实验分析表明，DRL 不仅能保证约束满足，还能提高下游任务的效率。值得注意的是，当应用于经常违反约束的 DGM 时，DRL 可以完全消除违规。此外，它将 F1 分数的性能指标提高了 21.4%，将 ROC 曲线下面积提高了 20.9%，从而证明了其对数据生成的实际影响。||
|**2025-02-25**|[Synthesizing Consistent Novel Views via 3D Epipolar Attention without Re-Training](http://arxiv.org/abs/2502.18219)|null|大型扩散模型在单图像新视角合成方面展现出显著的零样本能力。然而，这些模型常常难以在新的视角和参考视角之间保持一致性。导致此问题的一个关键因素是对参考视角上下文信息的利用有限。具体来说，当两个视角的视锥体存在重叠时，必须确保相应区域在几何形状和外观上保持一致。基于此观察，我们提出了一种简单而有效的方法，利用对极几何来定位和检索输入视角中的重叠信息。然后，将这些信息融入到目标视角的生成中，无需训练或微调，因为该过程不需要任何可学习的参数。此外，为了增强生成视角的整体一致性，我们将对极注意力机制扩展到多视角设置，允许从输入视角和其他目标视角检索重叠信息。定性和定量实验结果表明，我们的方法可以有效地提高合成视角的一致性，而无需任何微调。此外，这种增强还有助于提升下游应用（如三维重建）的性能。代码可在https://github.com/botaoye/ConsisSyn获取。||
|**2025-02-25**|[Training Consistency Models with Variational Noise Coupling](http://arxiv.org/abs/2502.18197)|**[link](https://github.com/sony/vct)**|一致性训练（CT）最近已成为扩散模型的一个有前景的替代方案，在图像生成任务中实现了具有竞争力的性能。然而，非蒸馏一致性训练通常存在高方差和不稳定性，分析和改进其训练动态是一个活跃的研究领域。在这项工作中，我们提出了一种基于流匹配框架的新型CT训练方法。我们的主要贡献是受变分自编码器（VAE）架构启发的已训练噪声耦合方案。通过训练一个作为编码器架构实现的数据相关噪声发射模型，我们的方法可以间接学习噪声到数据映射的几何结构，而这在经典CT中是由前向过程的选择所固定的。跨不同图像数据集的实验结果显示了显著的生成改进，我们的模型优于基线，并在CIFAR-10上实现了最先进的（SoTA）非蒸馏CT FID，并在ImageNet 64x64分辨率下以2步生成实现了与SoTA相当的FID。我们的代码可在https://github.com/sony/vct 获取。||
|**2025-02-21**|[One-step Diffusion Models with $f$-Divergence Distribution Matching](http://arxiv.org/abs/2502.15681)|null|从扩散模型中采样涉及一个缓慢的迭代过程，这阻碍了它们的实际部署，尤其是在交互式应用程序中。为了加快生成速度，最近的方法通过变分分数蒸馏将多步扩散模型提炼成单步学生生成器，使其生成的样本分布与教师分布匹配。然而，这些方法使用反向 Kullback-Leibler (KL) 散度进行分布匹配，这是一种已知的众数寻求方法。在本文中，我们使用一种新颖的 $f$-散度最小化框架（称为 $f$-distill）概括了分布匹配方法，该框架涵盖了不同的散度，在众数覆盖范围和训练方差方面具有不同的权衡。我们推导了教师和学生分布之间 $f$-散度的梯度，并表明它表示为它们分数差异和由它们的密度比确定的加权函数的乘积。当使用较少众数寻求的散度时，此加权函数自然会强调教师分布中密度较高的样本。我们观察到，使用反向 KL 散度的流行变分分数蒸馏方法是我们框架中的一个特例。根据经验，我们证明了其他 $f$-散度（例如正向 KL 和 Jensen-Shannon 散度）在图像生成任务中优于当前最佳的变分分数蒸馏方法。特别是，当使用 Jensen-Shannon 散度时，$f$ -distill 在 ImageNet64 上实现了当前最先进的单步生成性能，并在 MS-COCO 上实现了零样本文本到图像生成。项目页面：https://research.nvidia.com/labs/genair/f-distill||
|**2025-02-21**|[Enhancing RWKV-based Language Models for Long-Sequence Text Generation](http://arxiv.org/abs/2502.15485)|**[link](https://github.com/PStarH/long-seq-rwkv)**|本文提出了一种基于RWKV的增强型语言生成模型，旨在改进长序列文本处理。我们提出了一种自适应的词元移位和门控机制，以更好地捕捉文本生成中的长距离依赖关系。通过一系列实验，我们将基准RWKV模型与增强型模型进行了比较，从前向传播时间、文本生成质量和自动评估指标（如困惑度、BLEU和ROUGE）方面评估了性能。实验结果表明，增强型模型显著提高了生成质量，尤其是在BLEU和ROUGE评分方面，并在长文本生成任务中展现了更强的上下文捕捉能力。||
|**2025-02-21**|[Modeling Infectious Diseases: From SIR Models to Diffusion-Based Approaches and Numerical Solutions](http://arxiv.org/abs/2502.15439)|null|随着全球生活水平的提高和医疗技术的进步，许多传染病得到了有效控制。然而，某些疾病，例如最近的COVID-19大流行，仍然对公共健康构成重大威胁。本文探讨了传染病模型的演变，从早期基于常微分方程的模型（如SIR框架）到结合了时间和空间动态的更复杂的反应扩散模型。该研究强调了数值方法（例如龙格-库塔法、隐式-显式时间离散化技术和有限差分法）在求解这些模型中的重要性。通过分析这些方法的发展和应用，本研究强调了它们在预测疾病传播、为公共卫生策略提供信息以及减轻未来大流行的影响方面的关键作用。||
|**2025-02-21**|[Efficiently Solving Discounted MDPs with Predictions on Transition Matrices](http://arxiv.org/abs/2502.15345)|null|我们研究了生成模型下的无限 horizon 折扣马尔可夫决策过程 (DMDP)。受Mitzenmacher 和 Vassilvitskii 2022 年提出的“带建议的算法”框架的启发，我们提出了一个新的框架来研究如何利用对转移矩阵的预测来提高求解 DMDP 的样本效率并改进样本复杂度界限。我们关注具有 $N$ 个状态-动作对和折扣因子 $\gamma$ 的 DMDP。首先，我们提供了一个不可能的结果：在没有预测精度先验知识的情况下，没有任何采样策略可以计算出一个 $\epsilon$-最优策略，其样本复杂度界限优于 $\tilde{O}((1-\gamma)^{-3} N\epsilon^{-2})$，这与没有预测的最先进的极小极大样本复杂度界限相符。作为补充，我们提出了一种基于极小极大优化技术的算法，该算法利用了对转移矩阵的预测。我们的算法实现了依赖于预测误差的样本复杂度界限，并且该界限一致优于 $\tilde{O}((1-\gamma)^{-4} N \epsilon^{-2})$ ，这是先前从凸优化方法得出的最佳结果。这些理论发现得到了我们的数值实验的进一步支持。||
|**2025-02-21**|[BundleFlow: Deep Menus for Combinatorial Auctions by Diffusion-Based Optimization](http://arxiv.org/abs/2502.15283)|null|可微分经济学——将深度学习用于拍卖设计——推动了具有加性或单位需求估值的多物品拍卖自动化设计的进展。然而，即使在单一竞标者的情况下，最优组合拍卖（CA）也几乎没有取得进展，因为我们需要克服捆绑空间随物品数量呈指数增长的挑战。例如，在CA中为竞标者学习分配-价格选择菜单时，每个菜单元素都需要高效灵活地指定捆绑的概率分布。在本文中，我们通过将常微分方程（ODE）应用于易于处理的初始分布来生成捆绑分布，从而解决了单一竞标者CA设置中的这个问题，其灵感来自生成模型，尤其是基于分数的扩散模型和连续归一化流。我们的方法BundleFlow使用深度学习来寻找合适的基于ODE的初始分布变换，每个菜单元素一个变换，以便整个菜单实现较高的预期收益。在CATS（一个标准CA测试平台）的单竞标者版本上，我们的方法实现了比自动化机制设计基线高1.11 $-$2.23倍的收益，并且可以扩展到多达150个商品的问题。相对于同样在菜单元素中学习分配的基线，我们的方法将训练迭代次数减少了3.6$-$ 9.5倍，并在包含50和100个商品的设置中将训练时间缩短了约80%。||
|**2025-02-21**|[CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models](http://arxiv.org/abs/2502.15278)|null|评估AI生成的图像是否与受版权保护的作品实质性相似是解决版权纠纷的关键步骤。本文提出了CopyJudge，一个自动化的版权侵权识别框架，它利用大型视觉语言模型（LVLM）来模拟实际的法院流程，以确定受版权保护的图像与由文本到图像扩散模型生成的图像之间的实质性相似性。具体来说，我们采用抽象-过滤-比较测试框架，并结合多LVLM辩论来评估侵权的可能性，并提供详细的判断理由。基于这些判断，我们进一步引入了一种通用的基于LVLM的缓解策略，该策略通过避免敏感表达来自动优化侵权提示，同时保留非侵权内容。此外，我们的方法可以通过强化学习探索扩散潜在空间内的非侵权噪声向量来增强，即使不修改原始提示。实验结果表明，我们的识别方法实现了与现有最先进方法相当的性能，同时在各种侵权形式中提供了卓越的泛化性和可解释性，并且我们的缓解方法可以更有效地减少记忆和IP侵权，而不会丢失非侵权表达。||
|**2025-02-21**|[Lung-DDPM: Semantic Layout-guided Diffusion Models for Thoracic CT Image Synthesis](http://arxiv.org/abs/2502.15204)|**[link](https://github.com/manem-lab/lung-ddpm)**|随着人工智能 (AI) 的快速发展，AI 辅助医学影像分析在肺癌早期筛查中展现出卓越的性能。然而，昂贵的标注过程和隐私问题限制了大规模医学数据集的构建，阻碍了AI在医疗保健领域的进一步应用。为了解决肺癌筛查中数据稀缺的问题，我们提出了 Lung-DDPM，一种胸部CT图像合成方法，可以有效地生成高保真3D合成CT图像，并证明其有助于下游肺结节分割任务。我们的方法基于语义布局引导的去噪扩散概率模型 (DDPM)，即使从不完整的语义布局也能生成解剖学上合理、无缝且一致的样本。我们的结果表明，所提出的方法在图像质量评估和下游肺结节分割任务中优于其他最先进的 (SOTA) 生成模型。具体而言，Lung-DDPM 在我们的大型验证队列上取得了优异的性能，Fréchet 起始距离 (FID) 为 0.0047，最大平均差异 (MMD) 为 0.0070，均方误差 (MSE) 为 0.0024。这些结果分别比第二好的竞争对手好 7.4 倍、3.1 倍和 29.5 倍。此外，在结合真实样本和 Lung-DDPM 生成的合成样本的数据集上训练的肺结节分割模型的 Dice 系数 (Dice) 为 0.3914，灵敏度为 0.4393。与仅在真实样本上训练的模型相比，Dice 和灵敏度分别提高了 8.8% 和 18.6%。实验结果凸显了 Lung-DDPM 在更广泛的医学影像应用中的潜力，例如一般肿瘤分割、癌症生存估计和风险预测。||
|**2025-02-21**|[Methods and Trends in Detecting Generated Images: A Comprehensive Review](http://arxiv.org/abs/2502.15176)|null|生成模型（如生成对抗网络 (GAN)、扩散模型和变分自编码器 (VAE)）的激增使得合成高质量多媒体数据成为可能。然而，这些进步也引发了对对抗性攻击、不道德使用和社会危害的重大担忧。认识到这些挑战，研究人员越来越关注开发有效检测合成数据的方法，旨在减轻潜在风险。之前的综述主要关注Deepfake检测，并且通常缺乏对合成图像检测最新进展的涵盖，特别是利用多模态框架改进取证分析的方法。为了弥补这一差距，本次综述全面回顾了用于检测和分类由高级生成式AI模型生成的合成图像的最新方法。本综述系统地 examines 了核心检测方法，识别了各种方法之间的共性，并将它们归类为有意义的分类法。此外，鉴于大规模数据集在该领域的关键作用，我们概述了公开可用的数据集，这些数据集有助于进一步研究和合成数据检测的基准测试。||
|**2025-02-20**|[Pseudoinverse Diffusion Models for Generative CT Image Reconstruction from Low Dose Data](http://arxiv.org/abs/2502.15064)|null|基于分数的扩散模型极大地推进了用于图像处理的生成式深度学习。测量条件模型也被应用于逆问题，例如CT重建。然而，传统的收敛于白噪声的方法通常需要大量的逆过程更新步骤和分数函数评估。为了解决这个限制，我们提出了一种基于分数的扩散模型中的替代前向过程，使其与低剂量CT重建的噪声特性对齐，而不是收敛于白噪声。这种方法显著减少了所需的分数函数评估次数，提高了效率，并为放射科医生保留了熟悉的噪声纹理。我们的方法不仅加速了生成过程，还保留了CT噪声相关性，这是临床医生经常批评深度学习重建的一个关键方面。在这项工作中，我们为此严格定义了一个矩阵控制的随机过程，并通过计算实验对其进行了验证。我们使用来自癌症基因组图谱肝细胞癌（TCGA-LIHC）的数据集，模拟低剂量CT测量并训练我们的模型，将其与基线标量扩散过程和条件扩散模型进行比较。我们的结果证明了我们的伪逆扩散模型在效率方面的优越性，以及在少量分数函数评估中生成高质量且纹理对医学专业人员来说熟悉的重建图像的能力。这一进展为医学影像中更高效、更具临床实用性的扩散模型铺平了道路，尤其是在需要快速重建或降低辐射暴露的情况下。||
|**2025-02-20**|[FIP: Endowing Robust Motion Capture on Daily Garment by Fusing Flex and Inertial Sensors](http://arxiv.org/abs/2502.15058)|null|如果我们的衣服能够准确捕捉我们的身体运动会怎样？本文介绍了一种名为柔性惯性姿态捕捉器 (FIP) 的新型运动捕捉系统，该系统利用日常服装，结合两个附着在肘部的弯曲传感器和四个惯性测量单元 (IMU)。为了解决宽松可穿戴设备中不可避免的传感器位移问题，该问题会显著降低关节跟踪精度，我们确定了弯曲传感器和惯性传感器位移的不同特征，并开发了位移潜在扩散模型和基于物理的校准器，以根据这些观察结果补偿传感器位移，从而大幅提高运动捕捉精度。我们还引入了姿态融合预测器来增强多模态传感器融合。大量实验表明，我们的方法在不同体型和运动中均实现了稳健的性能，显著优于最先进的 IMU 方法，角度误差降低了 19.5%，肘部角度误差降低了 26.4%，位置误差降低了 30.1%。FIP 为普适人机交互和各种交互式应用（如元宇宙、康复和健身分析）开辟了新的机遇。||
|**2025-02-20**|[Improving the Diffusability of Autoencoders](http://arxiv.org/abs/2502.14831)|null|潜在扩散模型已成为生成高质量图像和视频的主要方法，它利用压缩的潜在表示来减少扩散过程的计算负担。虽然最近的进展主要集中在扩展扩散主干和提高自编码器重建质量上，但这些组件之间的相互作用却相对较少受到关注。在这项工作中，我们对现代自编码器进行了频谱分析，并发现了其潜在空间中过高的频率成分，这在具有较大瓶颈通道大小的自编码器中尤为明显。我们假设这种高频成分会干扰扩散合成过程中从粗到精的特性，并影响生成质量。为了缓解这个问题，我们提出了尺度等变性：一种简单的正则化策略，通过在解码器中强制执行尺度等变性来对齐不同频率下的潜在空间和RGB空间。它只需要最少的代码更改和最多20K的自编码器微调步骤，即可显著提高生成质量，在ImageNet-1K 256x256上进行图像生成时，FID降低了19%，在Kinetics-700 17x256x256上进行视频生成时，FVD降低了至少44%。||
|**2025-02-20**|[A Survey on Text-Driven 360-Degree Panorama Generation](http://arxiv.org/abs/2502.14799)|null|文本驱动的360度全景图生成技术的出现，标志着沉浸式视觉内容创作的变革性进步，它能够直接从文本描述中合成360度全景图像。这项创新极大地简化了传统上复杂的全景内容制作过程。文本到图像扩散模型的最新进展加速了这一新兴领域的快速发展。本综述全面回顾了文本驱动的360度全景图生成技术，深入分析了最先进的算法及其在360度3D场景生成中不断扩展的应用。此外，我们批判性地审视了当前的局限性，并提出了未来研究的有希望的方向。包含相关资源和研究论文的精选项目页面可访问https://littlewhitesea.github.io/Text-Driven-Pano-Gen/。||
|**2025-02-20**|[DC-ControlNet: Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models](http://arxiv.org/abs/2502.14779)|null|本文介绍了一种高度灵活且精确可控的多条件图像生成框架，名为DC (Decouple)-ControlNet。DC-ControlNet的核心思想是解耦控制条件，将全局控制转化为一个集成了不同元素、内容和布局的层次化系统。这使用户能够更灵活地混合这些单独的条件，从而实现更高效、更精确的图像生成控制。以前的基于ControlNet的模型仅依赖于全局条件，全局条件会影响整个图像，缺乏对元素或区域的特定控制能力。这种局限性降低了灵活性，并可能在多条件图像生成中导致条件误解。为了应对这些挑战，我们在DC-ControlNet中提出了元素内控制器和元素间控制器。元素内控制器处理单个元素内不同类型的控制信号，准确地描述对象的的内容和布局特征。对于元素之间的交互，我们引入了元素间控制器，它可以根据用户定义的关系准确地处理多元素交互和遮挡。大量评估表明，DC-ControlNet在多条件控制的灵活性和精度方面显著优于现有的ControlNet模型和布局到图像生成模型。||
|**2025-02-20**|[ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation](http://arxiv.org/abs/2502.14637)|**[link](https://github.com/AngxiaoYue/ReQFlow)**|蛋白质骨架生成在蛋白质从头设计中起着核心作用，对许多生物和医学应用都具有重要意义。尽管基于扩散和流的生成模型为这项具有挑战性的任务提供了潜在的解决方案，但它们生成的蛋白质通常设计性较差，并且计算效率低下。在本研究中，我们提出了一种新的改进型四元数流 (ReQFlow) 匹配方法，用于快速生成高质量的蛋白质骨架。具体来说，我们的方法从随机噪声中为蛋白质链中的每个残基生成局部平移和3D旋转，将每个3D旋转表示为单位四元数，并以指数形式通过球面线性插值 (SLERP) 构建其流。我们通过四元数流 (QFlow) 匹配训练模型，保证数值稳定性，并改进QFlow模型以加速其推理并提高生成蛋白质骨架的可设计性，从而得到所提出的ReQFlow模型。实验表明，ReQFlow在蛋白质骨架生成方面实现了最先进的性能，同时所需的采样步骤和推理时间都大大减少（例如，在生成长度为300的骨架时，比RFDiffusion快37倍，比Genie2快62倍），证明了其有效性和效率。代码可在https://github.com/AngxiaoYue/ReQFlow获取。||
|**2025-02-20**|[A Theory for Conditional Generative Modeling on Multiple Data Sources](http://arxiv.org/abs/2502.14583)|**[link](https://github.com/ml-gsai/multi-source-gm)**|大型生成模型的成功推动了一种范式转变，即利用大量的多源数据来增强模型能力。然而，这些数据源之间的相互作用在理论上仍未得到充分探索。本文针对条件生成建模中的多源训练进行了首次严格分析，其中每个条件代表一个不同的数据源。具体而言，我们基于括号数为条件最大似然估计建立了一个平均全变差距离下的泛化分布估计误差界。我们的结果表明，当源分布具有一定相似性且模型具有足够的表达能力时，多源训练比单源训练具有更小的误差界。我们进一步通过描述括号数，将该泛化理论应用于条件高斯估计和深度生成模型，包括自回归模型和灵活的基于能量的模型。结果表明，数据源的数量和源分布之间的相似性提高了多源训练的优势。仿真和实际实验验证了我们的理论。代码可在以下网址获取：\url{https://github.com/ML-GSAI/Multi-Source-GM}。||
|**2025-02-20**|[How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?](http://arxiv.org/abs/2502.14502)|**[link](https://github.com/AIRI-Institute/knowledge-packing)**|大型语言模型 (LLM) 在许多任务上的性能很大程度上受限于预训练期间学习并存储在模型参数中的知识。低秩自适应 (LoRA) 是一种流行且高效的训练技术，用于更新 LLM 或使其适应特定领域。在本研究中，我们探讨了如何在不损害先前学习的知识的情况下，使用 LoRA 将新事实融入 LLM。我们使用 LoRA 对 Llama-3.1-8B-instruct 进行了微调，并使用了不同数量的新知识。我们的实验表明，当训练数据包含已知事实和新事实的混合时，可以获得最佳结果。然而，这种方法仍然可能是有害的，因为在这种微调之后，模型在外部问答基准测试中的性能会下降。当训练数据偏向某些实体时，模型倾向于回归到少数几个过度表达的答案。此外，我们发现模型变得更加自信，并且只在少数情况下拒绝提供答案。这些发现突出了基于 LoRA 的 LLM 更新的潜在缺陷，并强调了训练数据组成和调整参数以平衡新知识整合和通用模型能力的重要性。||
|**2025-02-20**|[How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation](http://arxiv.org/abs/2502.14486)|null|越狱攻击，即有害提示绕过生成模型内置安全机制的情况，引发了人们对模型漏洞的严重担忧。尽管已提出许多防御方法，但安全性和实用性之间的权衡及其在大规模视觉语言模型 (LVLM) 中的应用仍未得到充分理解。本文通过将标准生成任务重新定义为二元分类问题，系统地研究了越狱防御，以评估模型对有害和良性查询的拒绝倾向。我们确定了两种关键防御机制：安全偏移（增加所有查询的拒绝率）和有害性区分（提高模型区分有害和良性输入的能力）。利用这些机制，我们开发了两种集成防御策略——机制间集成和机制内集成——以平衡安全性和实用性。在 LLaVA-1.5 模型上使用 MM-SafetyBench 和 MOSSBench 数据集进行的实验表明，这些策略可以有效提高模型安全性或优化安全性和实用性之间的权衡。||
|**2025-02-20**|[Enhancing Portuguese Variety Identification with Cross-Domain Approaches](http://arxiv.org/abs/2502.14394)|null|自然语言处理的最新进展提高了人们对生成模型生成跨多种语言变体的连贯文本的期望。就葡萄牙语而言，巴西葡萄牙语语料库在网上的优势地位给这些模型引入了语言偏差，限制了它们在巴西以外的适用性。为了弥合这一差距并促进欧洲葡萄牙语资源的创建，我们开发了一种跨领域的语言变体识别器（LVI），以区分欧洲葡萄牙语和巴西葡萄牙语。基于我们文献综述的发现，我们编译了PtBrVarId语料库，这是一个跨领域的LVI数据集，并研究了基于Transformer的LVI分类器在跨领域场景中的有效性。尽管这项研究的重点是两种葡萄牙语变体，但我们的贡献可以扩展到其他变体和语言。我们开源了代码、语料库和模型，以促进这项任务的进一步研究。||
|**2025-02-20**|[PPO-MI: Efficient Black-Box Model Inversion via Proximal Policy Optimization](http://arxiv.org/abs/2502.14370)|null|模型逆向攻击（Model Inversion Attacks）试图从训练好的模型中重建私人训练数据，从而构成严重的隐私风险。大多数现有方法依赖于梯度估计或需要白盒访问模型参数，这限制了它们在实际场景中的适用性。本文提出了一种名为PPO-MI的基于强化学习的新型黑盒模型逆向攻击框架。我们的方法将逆向任务制定为马尔可夫决策过程，其中代理在生成模型的潜在空间中导航，仅使用模型预测来重建私人训练样本。通过采用具有基于动量的状态转换机制的近端策略优化（PPO），以及平衡预测精度和探索的奖励函数，PPO-MI确保了高效的潜在空间探索和高查询效率。我们进行了广泛的实验，结果表明，PPO-MI在所需攻击知识较少的情况下优于现有方法，并且在各种模型架构和数据集上都具有鲁棒性。这些结果突出了其在实际黑盒场景中的有效性和泛化能力，并对已部署机器学习模型的隐私漏洞提出了重要的考量。||
|**2025-02-20**|[Entropy-UID: A Method for Optimizing Information Density](http://arxiv.org/abs/2502.14366)|null|平衡有效的信息流对于优化语言生成模型至关重要。在这项工作中，我们提出了Entropy-UID，一种新的token选择方法，它平衡了熵和均匀信息密度（UID）原则，以提高文本生成的效率。我们的方法通过联合最小化熵和惊异度来自适应地调整token选择，从而促进在生成的序列中更均匀的信息分布。理论验证表明，Entropy-UID在保持流畅性和连贯性的同时，能够最佳地减少信息尖峰。该方法已使用信息论指标在多个基准数据集（包括WikiText-2、OpenWebText和WMT）上进行了评估。实验结果表明，与标准GPT-2和其他启发式方法相比，Entropy-UID实现了更低的惊异度和熵方差，从而实现了更平衡和更像人类的文本生成。我们的研究结果表明，利用信息论约束来改进自回归语言模型中的token选择策略具有潜力。||
|**2025-02-18**|[AV-Flow: Transforming Text to Audio-Visual Human-like Interactions](http://arxiv.org/abs/2502.13133)|null|We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars. Project page: https://aggelinacha.github.io/AV-Flow/||
|**2025-02-18**|[Is Noise Conditioning Necessary for Denoising Generative Models?](http://arxiv.org/abs/2502.13129)|null|It is widely believed that noise conditioning is indispensable for denoising diffusion models to work successfully. This work challenges this belief. Motivated by research on blind image denoising, we investigate a variety of denoising-based generative models in the absence of noise conditioning. To our surprise, most models exhibit graceful degradation, and in some cases, they even perform better without noise conditioning. We provide a theoretical analysis of the error caused by removing noise conditioning and demonstrate that our analysis aligns with empirical observations. We further introduce a noise-unconditional model that achieves a competitive FID of 2.23 on CIFAR-10, significantly narrowing the gap to leading noise-conditional models. We hope our findings will inspire the community to revisit the foundations and formulations of denoising generative models.||
|**2025-02-18**|[Score Matching Riemannian Diffusion Means](http://arxiv.org/abs/2502.13106)|null|Estimating means on Riemannian manifolds is generally computationally expensive because the Riemannian distance function is not known in closed-form for most manifolds. To overcome this, we show that Riemannian diffusion means can be efficiently estimated using score matching with the gradient of Brownian motion transition densities using the same principle as in Riemannian diffusion models. Empirically, we show that this is more efficient than Monte Carlo simulation while retaining accuracy and is also applicable to learned manifolds. Our method, furthermore, extends to computing the Fr\'echet mean and the logarithmic map for general Riemannian manifolds. We illustrate the applicability of the estimation of diffusion mean by efficiently extending Euclidean algorithms to general Riemannian manifolds with a Riemannian $k$ -means algorithm and maximum likelihood Riemannian regression.||
|**2025-02-18**|[A Neural Difference-of-Entropies Estimator for Mutual Information](http://arxiv.org/abs/2502.13085)|null|Estimating Mutual Information (MI), a key measure of dependence of random quantities without specific modelling assumptions, is a challenging problem in high dimensions. We propose a novel mutual information estimator based on parametrizing conditional densities using normalizing flows, a deep generative model that has gained popularity in recent years. This estimator leverages a block autoregressive structure to achieve improved bias-variance trade-offs on standard benchmark tasks.||
|**2025-02-18**|[Personalized Image Generation with Deep Generative Models: A Decade Survey](http://arxiv.org/abs/2502.13081)|**[link](https://github.com/csyxwei/awesome-personalized-image-generation)**|Recent advancements in generative models have significantly facilitated the development of personalized content creation. Given a small set of images with user-specific concept, personalized image generation allows to create images that incorporate the specified concept and adhere to provided text descriptions. Due to its wide applications in content creation, significant effort has been devoted to this field in recent years. Nonetheless, the technologies used for personalization have evolved alongside the development of generative models, with their distinct and interrelated components. In this survey, we present a comprehensive review of generalized personalized image generation across various generative models, including traditional GANs, contemporary text-to-image diffusion models, and emerging multi-model autoregressive models. We first define a unified framework that standardizes the personalization process across different generative models, encompassing three key components, i.e., inversion spaces, inversion methods, and personalization schemes. This unified framework offers a structured approach to dissecting and comparing personalization techniques across different generative architectures. Building upon this unified framework, we further provide an in-depth analysis of personalization techniques within each generative model, highlighting their unique contributions and innovations. Through comparative analysis, this survey elucidates the current landscape of personalized image generation, identifying commonalities and distinguishing features among existing methods. Finally, we discuss the open challenges in the field and propose potential directions for future research. We keep tracing related works at https://github.com/csyxwei/Awesome-Personalized-Image-Generation.||
|**2025-02-18**|[Towards Variational Flow Matching on General Geometries](http://arxiv.org/abs/2502.12981)|null|We introduce Riemannian Gaussian Variational Flow Matching (RG-VFM), an extension of Variational Flow Matching (VFM) that leverages Riemannian Gaussian distributions for generative modeling on structured manifolds. We derive a variational objective for probability flows on manifolds with closed-form geodesics, making RG-VFM comparable - though fundamentally different to Riemannian Flow Matching (RFM) in this geometric setting. Experiments on a checkerboard dataset wrapped on the sphere demonstrate that RG-VFM captures geometric structure more effectively than Euclidean VFM and baseline methods, establishing it as a robust framework for manifold-aware generative modeling.||
|**2025-02-18**|[Does Training with Synthetic Data Truly Protect Privacy?](http://arxiv.org/abs/2502.12976)|**[link](https://github.com/yunpeng-zhao/syndata-privacy)**|As synthetic data becomes increasingly popular in machine learning tasks, numerous methods--without formal differential privacy guarantees--use synthetic data for training. These methods often claim, either explicitly or implicitly, to protect the privacy of the original training data. In this work, we explore four different training paradigms: coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models. While all these methods utilize synthetic data for training, they lead to vastly different conclusions regarding privacy preservation. We caution that empirical approaches to preserving data privacy require careful and rigorous evaluation; otherwise, they risk providing a false sense of privacy.||
|**2025-02-18**|[Guaranteed Conditional Diffusion: 3D Block-based Models for Scientific Data Compression](http://arxiv.org/abs/2502.12951)|null|This paper proposes a new compression paradigm -- Guaranteed Conditional Diffusion with Tensor Correction (GCDTC) -- for lossy scientific data compression. The framework is based on recent conditional diffusion (CD) generative models, and it consists of a conditional diffusion model, tensor correction, and error guarantee. Our diffusion model is a mixture of 3D conditioning and 2D denoising U-Net. The approach leverages a 3D block-based compressing module to address spatiotemporal correlations in structured scientific data. Then, the reverse diffusion process for 2D spatial data is conditioned on the ``slices'' of content latent variables produced by the compressing module. After training, the denoising decoder reconstructs the data with zero noise and content latent variables, and thus it is entirely deterministic. The reconstructed outputs of the CD model are further post-processed by our tensor correction and error guarantee steps to control and ensure a maximum error distortion, which is an inevitable requirement in lossy scientific data compression. Our experiments involving two datasets generated by climate and chemical combustion simulations show that our framework outperforms standard convolutional autoencoders and yields competitive compression quality with an existing scientific data compression algorithm.||
|**2025-02-18**|[Probabilistic neural operators for functional uncertainty quantification](http://arxiv.org/abs/2502.12902)|**[link](https://github.com/cbuelt/pfno)**|Neural operators aim to approximate the solution operator of a system of differential equations purely from data. They have shown immense success in modeling complex dynamical systems across various domains. However, the occurrence of uncertainties inherent in both model and data has so far rarely been taken into account\textemdash{}a critical limitation in complex, chaotic systems such as weather forecasting. In this paper, we introduce the probabilistic neural operator (PNO), a framework for learning probability distributions over the output function space of neural operators. PNO extends neural operators with generative modeling based on strictly proper scoring rules, integrating uncertainty information directly into the training process. We provide a theoretical justification for the approach and demonstrate improved performance in quantifying uncertainty across different domains and with respect to different baselines. Furthermore, PNO requires minimal adjustment to existing architectures, shows improved performance for most probabilistic prediction tasks, and leads to well-calibrated predictive distributions and adequate uncertainty representations even for long dynamical trajectories. Implementing our approach into large-scale models for physical applications can lead to improvements in corresponding uncertainty quantification and extreme event identification, ultimately leading to a deeper understanding of the prediction of such surrogate models.||
|**2025-02-18**|[CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image](http://arxiv.org/abs/2502.12894)|null|Recovering high-quality 3D scenes from a single RGB image is a challenging task in computer graphics. Current methods often struggle with domain-specific limitations or low-quality object generation. To address these, we propose CAST (Component-Aligned 3D Scene Reconstruction from a Single RGB Image), a novel method for 3D scene reconstruction and recovery. CAST starts by extracting object-level 2D segmentation and relative depth information from the input image, followed by using a GPT-based model to analyze inter-object spatial relationships. This enables the understanding of how objects relate to each other within the scene, ensuring more coherent reconstruction. CAST then employs an occlusion-aware large-scale 3D generation model to independently generate each object's full geometry, using MAE and point cloud conditioning to mitigate the effects of occlusions and partial object information, ensuring accurate alignment with the source image's geometry and texture. To align each object with the scene, the alignment generation model computes the necessary transformations, allowing the generated meshes to be accurately placed and integrated into the scene's point cloud. Finally, CAST incorporates a physics-aware correction step that leverages a fine-grained relation graph to generate a constraint graph. This graph guides the optimization of object poses, ensuring physical consistency and spatial coherence. By utilizing Signed Distance Fields (SDF), the model effectively addresses issues such as occlusions, object penetration, and floating objects, ensuring that the generated scene accurately reflects real-world physical interactions. CAST can be leveraged in robotics, enabling efficient real-to-simulation workflows and providing realistic, scalable simulation environments for robotic systems.||
|**2025-02-14**|[Region-Adaptive Sampling for Diffusion Transformers](http://arxiv.org/abs/2502.10389)|null|Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model's focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.||
|**2025-02-14**|[ReStyle3D: Scene-Level Appearance Transfer with Semantic Correspondences](http://arxiv.org/abs/2502.10377)|null|我们提出了ReStyle3D，这是一个新颖的框架，用于将单个风格图像的场景级外观迁移到由多个视图表示的真实世界场景。该方法结合了显式语义对应和多视图一致性，以实现精确和连贯的风格化。与传统全局应用参考风格的风格化方法不同，ReStyle3D使用开放词汇分割在风格图像和真实世界图像之间建立密集的实例级对应关系。这确保每个对象都使用语义匹配的纹理进行风格化。它首先使用扩散模型中无需训练的语义注意力机制将风格迁移到单个视图。然后，它通过学习的扭曲和细化网络将风格化提升到其他视图，该网络由单目深度和像素级对应关系引导。实验表明，ReStyle3D在结构保留、感知风格相似性和多视图一致性方面始终优于先前的方法。用户研究进一步验证了其生成逼真、语义准确结果的能力。我们的代码、预训练模型和数据集将公开发布，以支持室内设计、虚拟舞台和3D一致风格化的新应用。||
|**2025-02-14**|[AffinityFlow: Guided Flows for Antibody Affinity Maturation](http://arxiv.org/abs/2502.10365)|null|Antibodies are widely used as therapeutics, but their development requires costly affinity maturation, involving iterative mutations to enhance binding affinity.This paper explores a sequence-only scenario for affinity maturation, using solely antibody and antigen sequences. Recently AlphaFlow wraps AlphaFold within flow matching to generate diverse protein structures, enabling a sequence-conditioned generative model of structure. Building on this, we propose an alternating optimization framework that (1) fixes the sequence to guide structure generation toward high binding affinity using a structure-based affinity predictor, then (2) applies inverse folding to create sequence mutations, refined by a sequence-based affinity predictor for post selection. To address this, we develop a co-teaching module that incorporates valuable information from noisy biophysical energies into predictor refinement. The sequence-based predictor selects consensus samples to teach the structure-based predictor, and vice versa. Our method, AffinityFlow, achieves state-of-the-art performance in affinity maturation experiments. We plan to open-source our code after acceptance.||
|**2025-02-14**|[Dimension-free Score Matching and Time Bootstrapping for Diffusion Models](http://arxiv.org/abs/2502.10354)|null|Diffusion models generate samples by estimating the score function of the target distribution at various noise levels. The model is trained using samples drawn from the target distribution, progressively adding noise. In this work, we establish the first (nearly) dimension-free sample complexity bounds for learning these score functions, achieving a double exponential improvement in dimension over prior results. A key aspect of our analysis is the use of a single function approximator to jointly estimate scores across noise levels, a critical feature of diffusion models in practice which enables generalization across timesteps. Our analysis introduces a novel martingale-based error decomposition and sharp variance bounds, enabling efficient learning from dependent data generated by Markov processes, which may be of independent interest. Building on these insights, we propose Bootstrapped Score Matching (BSM), a variance reduction technique that utilizes previously learned scores to improve accuracy at higher noise levels. These results provide crucial insights into the efficiency and effectiveness of diffusion models for generative modeling.||
|**2025-02-14**|[DiOpt: Self-supervised Diffusion for Constrained Optimization](http://arxiv.org/abs/2502.10330)|null|Recent advances in diffusion models show promising potential for learning-based optimization by leveraging their multimodal sampling capability to escape local optima. However, existing diffusion-based optimization approaches, often reliant on supervised training, lacks a mechanism to ensure strict constraint satisfaction which is often required in real-world applications. One resulting observation is the distributional misalignment, i.e. the generated solution distribution often exhibits small overlap with the feasible domain. In this paper, we propose DiOpt, a novel diffusion paradigm that systematically learns near-optimal feasible solution distributions through iterative self-training. Our framework introduces several key innovations: a target distribution specifically designed to maximize overlap with the constrained solution manifold; a bootstrapped self-training mechanism that adaptively weights candidate solutions based on the severity of constraint violations and optimality gaps; and a dynamic memory buffer that accelerates convergence by retaining high-quality solutions over training iterations. To our knowledge, DiOpt represents the first successful integration of self-supervised diffusion with hard constraint satisfaction. Evaluations on diverse tasks, including power grid control, motion retargeting, wireless allocation demonstrate its superiority in terms of both optimality and constraint satisfaction.||
|**2025-02-14**|[Generalised Parallel Tempering: Flexible Replica Exchange via Flows and Diffusions](http://arxiv.org/abs/2502.10328)|null|Parallel Tempering (PT) is a classical MCMC algorithm designed for leveraging parallel computation to sample efficiently from high-dimensional, multimodal or otherwise complex distributions via annealing. One limitation of the standard formulation of PT is the growth of computational resources required to generate high-quality samples, as measured by effective sample size or round trip rate, for increasingly challenging distributions. To address this issue, we propose the framework: Generalised Parallel Tempering (GePT) which allows for the incorporation of recent advances in modern generative modelling, such as normalising flows and diffusion models, within Parallel Tempering, while maintaining the same theoretical guarantees as MCMC-based methods. For instance, we show that this allows us to utilise diffusion models in a parallelised manner, bypassing the usual computational cost of a large number of steps to generate quality samples. Further, we empirically demonstrate that GePT can improve sample quality and reduce the growth of computational resources required to handle complex distributions over the classical algorithm.||
|**2025-02-14**|[Probabilistic Super-Resolution for High-Fidelity Physical System Simulations with Uncertainty Quantification](http://arxiv.org/abs/2502.10280)|null|Super-resolution (SR) is a promising tool for generating high-fidelity simulations of physical systems from low-resolution data, enabling fast and accurate predictions in engineering applications. However, existing deep-learning based SR methods, require large labeled datasets and lack reliable uncertainty quantification (UQ), limiting their applicability in real-world scenarios. To overcome these challenges, we propose a probabilistic SR framework that leverages the Statistical Finite Element Method and energy-based generative modeling. Our method enables efficient high-resolution predictions with inherent UQ, while eliminating the need for extensive labeled datasets. The method is validated on a 2D Poisson example and compared with bicubic interpolation upscaling. Results demonstrate a computational speed-up over high-resolution numerical solvers while providing reliable uncertainty estimates.||
|**2025-02-14**|[Dark Matter Attenuation Effects: Sensitivity Ceilings for Spin-Dependent and Spin-Independent Interactions](http://arxiv.org/abs/2502.10251)|null|Direct detection experiments aimed at uncovering the elusive nature of dark matter (DM) have made significant progress in probing ever lower cross-sections for DM-nucleon interactions. At the same time, an upper limit in the cross-section sensitivity region is present due to DM scattering in the Earth and atmosphere and as a result never reaching the detector. We investigate the impact of this effect for both spin-dependent and spin-independent interactions. In contrast to previous studies that assume a straight line path for DM scattering we employ a semi-analytic diffusion model that takes into account the impact of potentially large angle deviations prevalent for light DM masses. We find that for sufficiently low energy thresholds, this difference in modelling impacts the DM interaction cross-section sensitivity. This study evaluates the impact in the context of the QUEST-DMC experiment, which utilises surface-based detectors with superfluid Helium-3 bolometers to search for sub-GeV DM exploiting low energy threshold. At masses below 1 GeV $/c^2$ the deviation between the two frameworks becomes pronounced. The ceiling sensitivity limit for QUEST-DMC on spin-dependent DM-neutron cross-sections is $\sim 3 \times 10^{-24}$ cm$^2$ using the diffusive framework and approximately doubles with the straight-line path DM scattering. Similarly, for spin-independent DM-nucleon cross-sections, the ceiling limit is $\sim 7.5 \times 10^{-27}$ cm$^2$ under the diffusive framework and also increases about a factor of two with the straight-line path approximation, within the mass range of 0.025-5 GeV$/c^2$ .||
|**2025-02-14**|[Shaping Inductive Bias in Diffusion Models through Frequency-Based Noise Control](http://arxiv.org/abs/2502.10236)|null|Diffusion Probabilistic Models (DPMs) are powerful generative models that have achieved unparalleled success in a number of generative tasks. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. For topologically structured data, we devise a frequency-based noising operator to purposefully manipulate, and set, these inductive biases. We first show that appropriate manipulations of the noising forward process can lead DPMs to focus on particular aspects of the distribution to learn. We show that different datasets necessitate different inductive biases, and that appropriate frequency-based noise control induces increased generative performance compared to standard diffusion. Finally, we demonstrate the possibility of ignoring information at particular frequencies while learning. We show this in an image corruption and recovery task, where we train a DPM to recover the original target distribution after severe noise corruption.||
|**2025-02-14**|[VideoDiff: Human-AI Video Co-Creation with Alternatives](http://arxiv.org/abs/2502.10190)|null|To make an engaging video, people sequence interesting moments and add visuals such as B-rolls or text. While video editing requires time and effort, AI has recently shown strong potential to make editing easier through suggestions and automation. A key strength of generative models is their ability to quickly generate multiple variations, but when provided with many alternatives, creators struggle to compare them to find the best fit. We propose VideoDiff, an AI video editing tool designed for editing with alternatives. With VideoDiff, creators can generate and review multiple AI recommendations for each editing process: creating a rough cut, inserting B-rolls, and adding text effects. VideoDiff simplifies comparisons by aligning videos and highlighting differences through timelines, transcripts, and video previews. Creators have the flexibility to regenerate and refine AI suggestions as they compare alternatives. Our study participants (N=12) could easily compare and customize alternatives, creating more satisfying results.||
|**2025-02-13**|[Theoretical Benefit and Limitation of Diffusion Language Model](http://arxiv.org/abs/2502.09622)|null|扩散语言模型已成为一种颇具前景的文本生成方法。人们自然期望这种方法能够有效替代自回归模型，因为在每个扩散步骤中可以并行采样多个标记。然而，其效率与准确性之间的权衡尚未得到很好的理解。在本文中，我们对一种广泛使用的扩散语言模型——掩码扩散模型 (MDM) 进行了严格的理论分析，发现其有效性很大程度上取决于目标评估指标。在温和的条件下，我们证明了当使用困惑度作为指标时，无论序列长度如何，MDM 都可以在少量采样步骤中达到近乎最优的困惑度，这表明 MDM 可以在不牺牲性能的情况下实现高效率。然而，当使用序列错误率（这对于理解序列的“正确性”非常重要，例如推理链）时，我们发现所需的采样步骤必须随序列长度线性扩展才能获得“正确”的序列，从而消除了 MDM 相对于自回归模型的效率优势。我们的分析为理解 MDM 的优势和局限性奠定了第一个理论基础。所有理论发现都得到了实证研究的支持。||
|**2025-02-13**|[RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets](http://arxiv.org/abs/2502.09615)|null|我们提出了RigAnything，一个基于自回归Transformer的新型模型，它通过以无模板的方式概率性地生成关节、骨骼拓扑结构和分配蒙皮权重，使3D资产做好绑定准备。与大多数现有的依赖预定义骨骼模板并局限于特定类别（如人形）的自动绑定方法不同，RigAnything以自回归的方式处理绑定问题，根据全局输入形状和先前的预测迭代地预测下一个关节。虽然自回归模型通常用于生成序列数据，但RigAnything将其应用扩展到有效学习和表示骨骼，而骨骼本质上是树形结构。为此，我们以广度优先搜索（BFS）的顺序组织关节，使骨骼能够被定义为3D位置和父索引的序列。此外，我们的模型通过利用扩散建模来提高位置预测的准确性，确保在层次结构中精确一致地放置关节。这种公式允许自回归模型有效地捕捉骨骼内的空间和层次关系。在RigNet和Objaverse数据集上进行端到端训练后，RigAnything在各种对象类型（包括人形、四足动物、海洋生物、昆虫等等）上展现了最先进的性能，在质量、鲁棒性、泛化性和效率方面都超过了先前的方法。更多详细信息，请访问我们的网站：https://www.liuisabella.com/RigAnything。||
|**2025-02-13**|[Designing a Conditional Prior Distribution for Flow-Based Generative Models](http://arxiv.org/abs/2502.09611)|null|基于流的生成模型最近在条件生成任务（如文本到图像生成）中展现出令人印象深刻的性能。然而，当前的方法将一般的单峰噪声分布转换为目标数据分布的特定模式。因此，初始源分布中的每个点都可以映射到目标分布中的每个点，从而导致平均路径较长。为此，在这项工作中，我们利用了条件流模型的一个未被利用的特性：设计非平凡先验分布的能力。给定一个输入条件，例如文本提示，我们首先将其映射到数据空间中的一个点，该点表示一个“平均”数据点，其到同一条件模式（例如，类别）的所有数据点的平均距离最小。然后，我们利用流匹配公式将以该点为中心的来自参数分布的样本映射到条件目标分布。实验表明，与基线相比，我们的方法显著缩短了训练时间并提高了生成效率（FID、KID 和 CLIP 对齐分数），并使用更少的采样步骤生成了高质量的样本。||
|**2025-02-13**|[Score-of-Mixture Training: Training One-Step Generative Models Made Simple](http://arxiv.org/abs/2502.09609)|null|我们提出了一种名为混合分数训练 (SMT) 的新框架，用于训练单步生成模型，通过最小化一类称为 $\alpha$ -skew Jensen-Shannon 散度的散度。SMT 的核心是在多个噪声水平上估计真实样本和生成样本之间混合分布的分数。与一致性模型类似，我们的方法支持从头开始训练 (SMT) 和使用预训练扩散模型进行蒸馏，我们称之为混合分数蒸馏 (SMD)。它易于实现，只需要最少的超参数调整，并确保训练稳定。在 CIFAR-10 和 ImageNet 64x64 上的实验表明，SMT/SMD 与现有方法相比具有竞争力，甚至可以超越它们。||
|**2025-02-13**|[Rolling Ahead Diffusion for Traffic Scene Simulation](http://arxiv.org/abs/2502.09587)|null|逼真的驾驶模拟要求NPC不仅要模仿自然的驾驶行为，还要对其他模拟智能体的行为做出反应。最近基于扩散的场景生成技术的进展集中在通过联合建模场景中所有智能体的运动来创建多样化和逼真的交通场景。然而，当智能体的运动偏离其建模轨迹时，这些交通场景不会做出反应。例如，自主智能体可以由独立的运动规划器控制。为了使用联合场景模型生成反应式场景，模型必须在每个时间步根据新的观测以模型预测控制（MPC）的方式重新生成场景。虽然具有反应性，但这种方法非常耗时，因为每个模拟步骤都会生成所有NPC的一个完整的可能的未来。或者，可以使用自回归模型（AR）来仅预测所有NPC的下一个时间步的未来。虽然速度更快，但这种方法缺乏高级规划的能力。我们提出了一种基于滚动扩散的交通场景生成模型，该模型结合了两种方法的优点，通过预测下一个时间步的未来并同时预测部分噪声化的更远未来步骤。我们证明，与基于扩散模型的自回归模型相比，这种模型效率更高，在反应性和计算效率之间实现了有利的折衷。||
|**2025-02-13**|[Memorization and Generalization in Generative Diffusion under the Manifold Hypothesis](http://arxiv.org/abs/2502.09578)|null|我们研究了扩散模型 (DM) 在潜在流形上定义的结构化数据情况下的记忆和泛化能力。我们特别考虑在 $N$ 维空间中的一组 $P$ 个单峰数据点，根据隐藏流形模型 (HMM)，这些数据点位于维度为 $D = \alpha_D N$ 的潜在子空间上。我们的分析利用了最近引入的基于随机能量模型 (REM) 统计物理学的形式主义。我们提供了证据，证明存在一个起始时间 $t_{o} > t_c$，此时势中出现陷阱，但不影响典型的扩散轨迹。这些陷阱的吸引盆地的大小被计算为时间的函数。此外，我们推导了轨迹落入其中一个训练点盆地时的坍缩时间 $t_{c}$，这意味着记忆。给出了 $t_c$ 的显式公式，它是 $P$ 和比率 $\alpha_D$ 的函数，证明了维度灾难问题对于高度结构化的数据（即 $\alpha_D\ll 1$）不成立，而与流形表面的非线性无关。我们还证明了坍缩与 REM 中的凝聚转变一致。最终，DM 的泛化程度根据采样配置的精确分布和经验分布之间的 Kullback-Leibler 散度来表述：我们证明了存在一个额外的时间 $t_{g}50% 幅度）。未来像IceCube-Gen2这样的探测器有可能对较弱的调制（>20% 幅度）敏感，特别是使用波长位移器的情况下。对于所有探测器配置，5σ探测范围内快速时间特征的频率和中心时间可以分别以7.0 Hz和17 ms的分辨率进行测量。||
|**2025-02-07**|[Robust Graph Learning Against Adversarial Evasion Attacks via Prior-Free Diffusion-Based Structure Purification](http://arxiv.org/abs/2502.05000)|**[link](https://github.com/RingBDStack/DiffSP)**|对抗性规避攻击对图学习构成重大威胁，大量研究致力于提高图神经网络 (GNN) 的鲁棒性。然而，现有工作依赖于关于干净图或攻击策略的先验知识，这些知识通常是启发式的且不一致的。为了在不同类型的规避攻击和多样化的数据集上实现鲁棒的图学习，我们从无先验结构净化视角研究了这个问题。具体来说，我们提出了一种名为 DiffSP 的新型基于扩散的结构净化框架，它创造性地结合了图扩散模型来学习干净图的内在分布，并在捕获的预测模式的指导下通过去除对抗扰动来净化扰动结构，而无需依赖先验知识。DiffSP 分为前向扩散过程和反向去噪过程，结构净化在期间完成。为了避免在前向过程中丢失有价值的信息，我们提出了一种 LID 驱动的非各向同性扩散机制，以选择性地、非各向同性地注入噪声。为了促进反向过程中生成的干净图和净化图之间的语义对齐，我们通过提出的图传递熵引导去噪机制来减少生成的不确定性。大量实验表明 DiffSP 对规避攻击具有优越的鲁棒性。||
|**2025-02-07**|[C2GM: Cascading Conditional Generation of Multi-scale Maps from Remote Sensing Images Constrained by Geographic Features](http://arxiv.org/abs/2502.04991)|null|多尺度地图是测绘制图成果的重要表达形式，是地理信息服务的基础组成部分。当前的图像生成网络可以快速地从遥感影像生成地图瓦片。然而，为自然图像设计的生成模型通常关注纹理特征，而忽略了遥感特征的独特性和瓦片地图的尺度属性。生成模型的这一局限性损害了地理信息的准确表达，瓦片地图的生成质量仍需改进。扩散模型已在各种图像生成任务中取得显著成功，凸显了其应对这一挑战的潜力。本文提出了C2GM，一个通过条件引导扩散和多尺度级联生成来生成多尺度瓦片地图的新框架。具体来说，我们实现了一个条件特征融合编码器，从遥感影像中提取目标先验信息，并级联参考双分支输入，确保复杂特征的准确表达。低层生成的瓦片作为高层地图生成的约束，增强了视觉连续性。此外，我们利用CLIP结合地图比例尺模态信息来模拟瓦片地图中地图比例尺与制图综合之间的关系。大量的实验评估表明，C2GM在所有指标上始终保持着最先进（SOTA）的性能，促进了多尺度大比例尺地图的快速有效生成，可用于应急响应和远程制图应用。||
|**2025-02-07**|[Generative-enhanced optimization for knapsack problems: an industry-relevant study](http://arxiv.org/abs/2502.04928)|null|优化是物流、航空、制造、化工、制药和保险等各个行业的关键任务，找到问题的最佳解决方案可以显著节省成本并提高效率。近年来，张量网络 (TN) 在用量子启发方法建模经典系统方面日益突出。最近，有人提出TN生成增强优化 (TN-GEO) 策略，它使用生成模型有效地对优化问题的特定约束条件下的有效解进行采样。此外，研究表明，对称TN (STN) 可以编码优化问题的某些约束条件，从而有助于其求解过程。在这项工作中，我们研究了 TN-GEO 和 STN-GEO 对行业相关问题类别（多背包问题）的适用性，其中每个对象必须分配给一个可用的背包。我们为实践者详细介绍了使用 TN-GEO 和 STN-GEO 方法的方案，并研究了其缩放行为及其对超参数的依赖性。我们对 60 个不同的问题实例进行了基准测试，发现 TN-GEO 和 STN-GEO 产生的结果与模拟退火的质量相似。||
|**2025-02-06**|[Can Grammarly and ChatGPT accelerate language change? AI-powered technologies and their impact on the English language: wordiness vs. conciseness](http://arxiv.org/abs/2502.04324)|null|随着支持自然语言处理的语言技术的普及、基于人工智能的自然语言生成模型的发展，以及英语作为母语和非母语人士之间主流交流方式的地位，人工智能工具的输出对语言学家来说尤其引人入胜。本文研究了Grammarly和ChatGPT如何影响英语语言的简洁性和冗长性。本文提出了一个关注目的状语从句引导词“in order to”的案例研究，以说明Grammarly和ChatGPT推荐更短的语法结构而非更长、更复杂的结构的方式。尽管分析的句子是由母语人士生成的、完全正确的，并且是从当代英语语料库中提取的，但Grammarly和ChatGPT都建议更简洁、更少的冗长，即使对于相对较短的句子也是如此。本文认为，像Grammarly这样的技术不仅反映了语言的变化，而且有可能促进或加速这种变化。||
|**2025-02-06**|[HOG-Diff: Higher-Order Guided Diffusion for Graph Generation](http://arxiv.org/abs/2502.04308)|**[link](https://github.com/yiminghh/hog-diff)**|图生成是一项至关重要但又充满挑战的任务，因为经验分析需要对复杂的非欧几里得结构有深入的理解。尽管扩散模型最近在图生成方面取得了显著成果，但这些模型通常是从为图像生成设计的框架改编而来，使其不适合捕捉图的拓扑特性。在这项工作中，我们提出了一种新颖的高阶引导扩散（HOG-Diff）模型，该模型遵循从粗到细的生成课程，并以高阶信息为指导，从而能够逐步生成具有固有拓扑结构的合理图。我们进一步证明，我们的模型比经典扩散框架展现出更强的理论保证。在分子图和通用图生成任务上的大量实验表明，我们的方法始终优于或与最先进的基线模型保持竞争力。我们的代码可在https://github.com/Yiminghh/HOG-Diff获取。||
|**2025-02-06**|[MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation](http://arxiv.org/abs/2502.04299)|null|本文提出了一种允许用户在图像到视频生成的过程中设计电影级视频镜头的方法。镜头设计是电影制作的关键环节，它涉及对场景中摄像机运动和物体运动的精心规划。然而，要在现代图像到视频生成系统中实现直观的镜头设计面临两个主要挑战：首先，有效捕捉用户对运动设计的意图，其中摄像机运动和场景空间中的物体运动必须共同指定；其次，表示运动信息，以便视频扩散模型可以有效地利用这些信息来合成图像动画。为了应对这些挑战，我们引入了 MotionCanvas，它是一种将用户驱动的控件集成到图像到视频 (I2V) 生成模型中的方法，允许用户以场景感知的方式控制物体和摄像机的运动。通过结合经典计算机图形学和当代视频生成技术的见解，我们展示了在 I2V 合成中实现 3D 感知运动控制的能力，而无需昂贵的 3D 相关训练数据。MotionCanvas使用户能够直观地描述场景空间运动意图，并将其转换为用于视频扩散模型的时空运动调节信号。我们在各种现实世界的图像内容和镜头设计场景中展示了我们方法的有效性，突出了其在增强数字内容创作中的创作流程以及适应各种图像和视频编辑应用方面的潜力。||
|**2025-02-06**|[Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression](http://arxiv.org/abs/2502.04296)|null|我们提出了异构掩码自回归模型 (HMA)，用于对动作视频动态进行建模，以生成高质量数据并在规模化机器人学习中进行评估。由于处理多样化设置的挑战以及保持实时运行的计算效率，为机器人构建交互式视频世界模型和策略非常困难。HMA 使用来自不同机器人具身、领域和任务的观察和动作序列进行异构预训练。HMA 使用掩码自回归来生成用于视频预测的量化或软标记。我们的方法实现了比以往的机器人视频生成模型更好的视觉保真度和可控性，在现实世界中的速度提高了15倍。经过后训练，该模型可以用作低级动作输入的视频模拟器，用于评估策略和生成合成数据。更多信息请访问此链接https://liruiw.github.io/hma。||
|**2025-02-06**|[Realistic Image-to-Image Machine Unlearning via Decoupling and Knowledge Retention](http://arxiv.org/abs/2502.04260)|null|机器遗忘允许参与者从训练好的机器学习模型中移除他们的数据以保护其隐私和安全。然而，关于生成模型的机器遗忘文献相当有限。图像到图像生成模型（I2I模型）的文献将最小化高斯噪声与I2I模型对遗忘样本的输出之间的距离视为机器遗忘。然而，我们认为机器学习模型在未见数据上表现相当好，即重新训练的模型将能够捕捉数据中的通用模式，因此不会生成等同于高斯噪声的输出。在本文中，我们认为遗忘后的模型应该将遗忘样本视为分布外（OOD）数据，即遗忘学习后的模型不应该再识别或编码遗忘样本中的特定模式。为了实现这一点，我们提出了一个通过梯度上升将模型参数解耦的框架，确保遗忘样本对于遗忘学习后的模型是OOD数据，并具有理论保证。我们还为使用梯度上升进行的模型更新提供了 $(\epsilon, \delta)$ -遗忘保证。遗忘学习后的模型会在剩余样本上进一步微调以保持其性能。我们还提出了一个攻击模型，以确保遗忘学习后的模型已经有效地消除了遗忘样本的影响。在ImageNet-1K和Places365这两个大型数据集上的大量实证评估突出了我们方法的优越性。为了展示与重新训练模型相当的性能，我们还在CIFAR-10数据集上展示了简单自动编码器在各种基线上的比较。||
|**2025-02-06**|[MRAMG-Bench: A BeyondText Benchmark for Multimodal Retrieval-Augmented Multimodal Generation](http://arxiv.org/abs/2502.04176)|null|检索增强生成（RAG）技术的最新进展表明，通过将外部知识融入生成模型，可以显著提高响应的准确性和相关性。然而，现有的RAG方法即使在多模态检索增强生成场景下，也主要关注提供纯文本答案。在本工作中，我们引入了多模态检索增强多模态生成（MRAMG）任务，旨在生成文本和图像相结合的答案，充分利用语料库中的多模态数据。尽管这项任务的重要性日益凸显，但目前仍缺乏一个全面的基准来有效评估MRAMG的性能。为了弥合这一差距，我们推出了MRAMG-Bench，这是一个精心策划、人工标注的数据集，包含4,346个文档、14,190张图像和4,800个问答对，数据来源涵盖三个类别：网络数据、学术论文和生活方式。该数据集包含不同的难度级别和复杂的多图像场景，为评估多模态生成任务提供了坚实的基础。为了促进严格的评估，我们的MRAMG-Bench包含一套全面的统计和基于大型语言模型（LLM）的指标，可以对流行的生成模型在MRAMG任务中的性能进行全面分析。此外，我们提出了一个高效的多模态答案生成框架，利用LLM和多模态大型语言模型（MLLM）来生成多模态响应。我们的数据集可在以下网址获取：https://huggingface.co/MRAMG。||
|**2025-02-06**|[Diffusion-based mass map reconstruction from weak lensing data](http://arxiv.org/abs/2502.04158)|null|扩散模型已在宇宙学应用中被用作生成模型，用于快速模拟并从噪声数据中重建底层宇宙学场或天体物理图像。这两种任务通常被视为独立的：为一种目的训练的扩散模型不能推广到执行另一项任务。在本文中，我们开发了一个可用于这两种任务的单一扩散模型。通过使用扩散后验采样 (DPS) 方法，我们使用训练用于模拟弱引力透镜图的扩散模型来解决从噪声弱引力透镜数据重建质量图的逆问题。我们发现标准的 DPS 方法会导致有偏差的推断，但我们通过降低扩散早期采样时间步长的似然项的权重来纠正这种偏差。我们的方法为我们提供了一种重建具有正确功率谱和一系列非高斯摘要统计量的精确高分辨率（亚弧分）质量图的方法。我们讨论了由我们模型的计算效率和准确性实现的几个应用。这些包括生成模拟质量图、辅助高阶统计量的协方差估计，以及从噪声透镜剪切数据中寻找细丝、空洞和星系团。||
|**2025-02-06**|[Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis](http://arxiv.org/abs/2502.04128)|null|基于文本的大语言模型（LLM），特别是GPT系列和o1模型的最新进展，已经证明了扩大训练时间和推理时间计算的有效性。然而，目前最先进的利用LLM的文本到语音（TTS）系统通常是多阶段的，需要单独的模型（例如，LLM之后的扩散模型），这使得决定在训练或测试期间是否扩展特定模型变得复杂。这项工作做出以下贡献：首先，我们探索了语音合成中训练时间和推理时间计算的扩展。其次，我们提出了一个简单的语音合成框架Llasa，它采用单层矢量量化器（VQ）编解码器和单一Transformer架构，以完全与标准LLM（如Llama）对齐。我们的实验表明，扩展Llasa的训练时间计算可以持续提高合成语音的自然度，并能够生成更复杂和准确的韵律模式。此外，从扩展推理时间计算的角度来看，我们在搜索过程中使用语音理解模型作为验证器，发现扩展推理时间计算会将采样模式转向特定验证器的偏好，从而提高情感表达、音色一致性和内容准确性。此外，我们已公开发布了TTS模型（1B、3B、8B）和编解码器模型的检查点和训练代码。||
|**2025-02-06**|[Generative Adversarial Networks Bridging Art and Machine Intelligence](http://arxiv.org/abs/2502.04116)|null|本书首先详细介绍了生成对抗网络（GANs）的基本原理和历史发展，将其与传统的生成模型进行对比，并通过 illustrative Python 示例阐明了核心的对抗机制。本书系统地阐述了相关的数学和理论基础，包括概率论、统计学和博弈论，为理解 GAN 训练的目标、损失函数和优化挑战提供了坚实的框架。后续章节回顾了经典的 GAN 变体，如条件 GAN（Conditional GANs）、深度卷积 GAN（DCGANs）、信息 GAN（InfoGAN）和拉普拉斯金字塔 GAN（LAPGAN），然后深入探讨了高级训练方法，如 Wasserstein GANs、带梯度惩罚的 GANs、最小二乘 GANs 和谱归一化技术。本书进一步研究了生成器和判别器在架构上的增强和特定任务的适配，展示了在高分辨率图像生成、艺术风格迁移、视频合成、文本到图像生成和其他多媒体应用中的实际实现。最后的章节提供了对新兴研究趋势的见解，包括自注意力机制、基于 Transformer 的生成模型，以及与扩散模型的比较分析，从而为学术和应用领域的未来发展指明了有希望的方向。||
|**2025-02-06**|[Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency](http://arxiv.org/abs/2502.04076)|**[link](https://github.com/littlespray/crave)**|新一代视频生成模型（如Sora）的出现，给AI生成内容（AIGC）视频质量评估（VQA）带来了挑战。这些模型显著减少了先前模型中普遍存在的闪烁伪影，支持更长、更复杂的文本提示，并生成具有复杂多样运动模式的更长视频。传统的VQA方法，其设计针对简单的文本和基本运动模式，难以评估这些内容丰富的视频。为此，我们提出了CRAVE（内容丰富的AIGC视频评估器），专门用于评估Sora时代的AIGC视频。CRAVE提出了多粒度文本-时间融合，将长形式复杂文本语义与视频动态对齐。此外，CRAVE利用混合运动保真度建模来评估时间伪影。鉴于当前AIGC VQA数据集中提示和内容的简单性，我们引入了CRAVE-DB，这是一个基准测试，包含来自下一代模型的内容丰富的视频以及精心设计的提示。大量实验表明，所提出的CRAVE在多个AIGC VQA基准测试中取得了优异的结果，展现出与人类感知的高度一致性。所有数据和代码都将在https://github.com/littlespray/CRAVE公开发布。||
|**2025-01-31**|[Beyond Fixed Horizons: A Theoretical Framework for Adaptive Denoising Diffusions](http://arxiv.org/abs/2501.19373)|null|We introduce a new class of generative diffusion models that, unlike conventional denoising diffusion models, achieve a time-homogeneous structure for both the noising and denoising processes, allowing the number of steps to adaptively adjust based on the noise level. This is accomplished by conditioning the forward process using Doob's $h$ -transform, which terminates the process at a suitable sampling distribution at a random time. The model is particularly well suited for generating data with lower intrinsic dimensions, as the termination criterion simplifies to a first-hitting rule. A key feature of the model is its adaptability to the target data, enabling a variety of downstream tasks using a pre-trained unconditional generative model. These tasks include natural conditioning through appropriate initialization of the denoising process and classification of noisy data.||
|**2025-01-31**|[Addressing the correlation of Stokes-shifted photons emitted from two quantum emitters](http://arxiv.org/abs/2501.19356)|null|In resonance fluorescence excitation experiments, light emitted from solid-state quantum emitters is typically filtered to eliminate the laser photons, ensuring that only red-shifted Stokes photons are detected. Theoretical analyses of the fluorescence intensity correlation often model emitters as two-level systems, focusing on light emitted exclusively from the purely electronic transition (the Zero-Phonon Line), or rely on statistical approaches based on conditional probabilities that do not account for quantum coherences. Here, we propose a general model to characterize the correlation of either Zero-Phonon Line photons or Stokes-shifted photons. This model successfully reproduces the experimental correlation of Stokes-shifted photons emitted from two interacting molecules and predicts that this correlation is affected by quantum coherence. Besides, we analyze the role of quantum coherence in light emission from two uncorrelated emitters, which helps to clarify the discrepancy between theory and experiments regarding the value of the correlation of photons emitted from this system at zero delay time.||
|**2025-01-31**|[Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SCICAP Challenge 2023](http://arxiv.org/abs/2501.19353)|null|Since the SCICAP datasets launch in 2021, the research community has made significant progress in generating captions for scientific figures in scholarly articles. In 2023, the first SCICAP Challenge took place, inviting global teams to use an expanded SCICAP dataset to develop models for captioning diverse figure types across various academic fields. At the same time, text generation models advanced quickly, with many powerful pre-trained large multimodal models (LMMs) emerging that showed impressive capabilities in various vision-and-language tasks. This paper presents an overview of the first SCICAP Challenge and details the performance of various models on its data, capturing a snapshot of the fields state. We found that professional editors overwhelmingly preferred figure captions generated by GPT-4V over those from all other models and even the original captions written by authors. Following this key finding, we conducted detailed analyses to answer this question: Have advanced LMMs solved the task of generating captions for scientific figures?||
|**2025-01-31**|[Pathological MRI Segmentation by Synthetic Pathological Data Generation in Fetuses and Neonates](http://arxiv.org/abs/2501.19338)|null|Developing new methods for the automated analysis of clinical fetal and neonatal MRI data is limited by the scarcity of annotated pathological datasets and privacy concerns that often restrict data sharing, hindering the effectiveness of deep learning models. We address this in two ways. First, we introduce Fetal&Neonatal-DDPM, a novel diffusion model framework designed to generate high-quality synthetic pathological fetal and neonatal MRIs from semantic label images. Second, we enhance training data by modifying healthy label images through morphological alterations to simulate conditions such as ventriculomegaly, cerebellar and pontocerebellar hypoplasia, and microcephaly. By leveraging Fetal&Neonatal-DDPM, we synthesize realistic pathological MRIs from these modified pathological label images. Radiologists rated the synthetic MRIs as significantly (p < 0.05) superior in quality and diagnostic value compared to real MRIs, demonstrating features such as blood vessels and choroid plexus, and improved alignment with label annotations. Synthetic pathological data enhanced state-of-the-art nnUNet segmentation performance, particularly for severe ventriculomegaly cases, with the greatest improvements achieved in ventricle segmentation (Dice scores: 0.9253 vs. 0.7317). This study underscores the potential of generative AI as transformative tool for data augmentation, offering improved segmentation performance in pathological cases. This development represents a significant step towards improving analysis and segmentation accuracy in prenatal imaging, and also offers new ways for data anonymization through the generation of pathologic image data.||
|**2025-01-31**|[Medical Semantic Segmentation with Diffusion Pretrain](http://arxiv.org/abs/2501.19265)|null|Recent advances in deep learning have shown that learning robust feature representations is critical for the success of many computer vision tasks, including medical image segmentation. In particular, both transformer and convolutional-based architectures have benefit from leveraging pretext tasks for pretraining. However, the adoption of pretext tasks in 3D medical imaging has been less explored and remains a challenge, especially in the context of learning generalizable feature representations. We propose a novel pretraining strategy using diffusion models with anatomical guidance, tailored to the intricacies of 3D medical image data. We introduce an auxiliary diffusion process to pretrain a model that produce generalizable feature representations, useful for a variety of downstream segmentation tasks. We employ an additional model that predicts 3D universal body-part coordinates, providing guidance during the diffusion process and improving spatial awareness in generated representations. This approach not only aids in resolving localization inaccuracies but also enriches the model's ability to understand complex anatomical structures. Empirical validation on a 13-class organ segmentation task demonstrate the effectiveness of our pretraining technique. It surpasses existing restorative pretraining methods in 3D medical image segmentation by $7.5\%$ , and is competitive with the state-of-the-art contrastive pretraining approach, achieving an average Dice coefficient of 67.8 in a non-linear evaluation scenario.||
|**2025-01-31**|[Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search](http://arxiv.org/abs/2501.19252)|null|The remarkable progress in text-to-video diffusion models enables photorealistic generations, although the contents of the generated video often include unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some quantity on the goodness of the content. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select better diffusion latent to maximize a given alignment reward, at inference time. We then point out that the improvement of perceptual video quality considering the alignment to prompts requires reward calibration by weighting existing metrics. When evaluating outputs by using vision language models as a proxy of humans, many previous metrics to quantify the naturalness of video do not always correlate with evaluation and also depend on the degree of dynamic descriptions in evaluation prompts. We demonstrate that our method improves the perceptual quality based on the calibrated reward, without model parameter update, and outputs the best generation compared to greedy search and best-of-N sampling. We provide practical guidelines on which axes, among search budget, lookahead steps for reward estimate, and denoising steps, in the reverse diffusion process, we should allocate the inference-time computation.||
|**2025-01-31**|[PSyDUCK: Training-Free Steganography for Latent Diffusion](http://arxiv.org/abs/2501.19172)|null|Recent advances in AI-generated steganography highlight its potential for safeguarding the privacy of vulnerable democratic actors, including aid workers, journalists, and whistleblowers operating in oppressive regimes. In this work, we address current limitations and establish the foundations for large-throughput generative steganography. We introduce a novel approach that enables secure and efficient steganography within latent diffusion models. We show empirically that our methods perform well across a variety of open-source latent diffusion models, particularly in generative image and video tasks.||
|**2025-01-31**|[RMDM: Radio Map Diffusion Model with Physics Informed](http://arxiv.org/abs/2501.19160)|**[link](https://github.com/Hxxxz0/RMDM)**|With the rapid development of wireless communication technology, the efficient utilization of spectrum resources, optimization of communication quality, and intelligent communication have become critical. Radio map reconstruction is essential for enabling advanced applications, yet challenges such as complex signal propagation and sparse data hinder accurate reconstruction. To address these issues, we propose the **Radio Map Diffusion Model (RMDM)**, a physics-informed framework that integrates **Physics-Informed Neural Networks (PINNs)** to incorporate constraints like the **Helmholtz equation**. RMDM employs a dual U-Net architecture: the first ensures physical consistency by minimizing PDE residuals, boundary conditions, and source constraints, while the second refines predictions via diffusion-based denoising. By leveraging physical laws, RMDM significantly enhances accuracy, robustness, and generalization. Experiments demonstrate that RMDM outperforms state-of-the-art methods, achieving **NMSE of 0.0031** and **RMSE of 0.0125** under the Static RM (SRM) setting, and **NMSE of 0.0047** and **RMSE of 0.0146** under the Dynamic RM (DRM) setting. These results establish a novel paradigm for integrating physics-informed and data-driven approaches in radio map reconstruction, particularly under sparse data conditions.||
|**2025-01-31**|[A theoretical framework for overfitting in energy-based modeling](http://arxiv.org/abs/2501.19158)|null|We investigate the impact of limited data on training pairwise energy-based models for inverse problems aimed at identifying interaction networks. Utilizing the Gaussian model as testbed, we dissect training trajectories across the eigenbasis of the coupling matrix, exploiting the independent evolution of eigenmodes and revealing that the learning timescales are tied to the spectral decomposition of the empirical covariance matrix. We see that optimal points for early stopping arise from the interplay between these timescales and the initial conditions of training. Moreover, we show that finite data corrections can be accurately modeled through asymptotic random matrix theory calculations and provide the counterpart of generalized cross-validation in the energy based model context. Our analytical framework extends to binary-variable maximum-entropy pairwise models with minimal variations. These findings offer strategies to control overfitting in discrete-variable models through empirical shrinkage corrections, improving the management of overfitting in energy-based generative models.||
|**2025-01-31**|[Ambient Denoising Diffusion Generative Adversarial Networks for Establishing Stochastic Object Models from Noisy Image Data](http://arxiv.org/abs/2501.19094)|null|It is widely accepted that medical imaging systems should be objectively assessed via task-based image quality (IQ) measures that ideally account for all sources of randomness in the measured image data, including the variation in the ensemble of objects to be imaged. Stochastic object models (SOMs) that can randomly draw samples from the object distribution can be employed to characterize object variability. To establish realistic SOMs for task-based IQ analysis, it is desirable to employ experimental image data. However, experimental image data acquired from medical imaging systems are subject to measurement noise. Previous work investigated the ability of deep generative models (DGMs) that employ an augmented generative adversarial network (GAN), AmbientGAN, for establishing SOMs from noisy measured image data. Recently, denoising diffusion models (DDMs) have emerged as a leading DGM for image synthesis and can produce superior image quality than GANs. However, original DDMs possess a slow image-generation process because of the Gaussian assumption in the denoising steps. More recently, denoising diffusion GAN (DDGAN) was proposed to permit fast image generation while maintain high generated image quality that is comparable to the original DDMs. In this work, we propose an augmented DDGAN architecture, Ambient DDGAN (ADDGAN), for learning SOMs from noisy image data. Numerical studies that consider clinical computed tomography (CT) images and digital breast tomosynthesis (DBT) images are conducted. The ability of the proposed ADDGAN to learn realistic SOMs from noisy image data is demonstrated. It has been shown that the ADDGAN significantly outperforms the advanced AmbientGAN models for synthesizing high resolution medical images with complex textures.||
|**2025-01-30**|[Diffusion Autoencoders are Scalable Image Tokenizers](http://arxiv.org/abs/2501.18593)|null|将图像标记化为紧凑的视觉表示是学习高效高质量图像生成模型的关键步骤。我们提出了一个简单的扩散标记器 (DiTo)，它可以学习用于图像生成模型的紧凑视觉表示。我们的主要见解是，单个学习目标，即扩散 L2 损失，可用于训练可扩展的图像标记器。由于扩散已被广泛用于图像生成，因此我们的见解极大地简化了此类标记器的训练。相比之下，当前最先进的标记器依赖于经验发现的启发式方法和损失的组合，因此需要一个复杂的训练方案，该方案依赖于不同损失和预训练监督模型之间的非平凡平衡。我们展示了设计决策以及理论基础，使我们能够扩展 DiTo 以学习具有竞争力的图像表示。我们的结果表明，DiTo 是当前最先进的监督图像标记器的一种更简单、可扩展且自监督的替代方案。DiTo 在图像重建和下游图像生成任务中实现了与最先进技术相当或更好的质量。||
|**2025-01-30**|[DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models](http://arxiv.org/abs/2501.18590)|null|理解和建模光照效果是计算机视觉和图形学中的基本任务。经典的基于物理的渲染（PBR）可以精确地模拟光传输，但依赖于精确的场景表示——显式 3D 几何、高质量的材质属性和照明条件——这些在现实场景中通常难以获得。因此，我们引入了 DiffusionRenderer，这是一种神经网络方法，它在一个整体框架内解决了逆向渲染和正向渲染的双重问题。利用强大的视频扩散模型先验，逆向渲染模型可以从真实世界的视频中准确估计 G 缓冲区，为图像编辑任务提供接口，并为渲染模型提供训练数据。相反，我们的渲染模型无需显式光传输模拟即可从 G 缓冲区生成逼真的图像。实验表明，DiffusionRenderer 可以有效地逼近逆向和正向渲染，始终优于现有技术水平。我们的模型只需单个视频输入即可实现实际应用——包括重新照明、材质编辑和逼真的对象插入。||
|**2025-01-30**|[WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training](http://arxiv.org/abs/2501.18511)|**[link](https://github.com/penfever/wildchat-50m)**|Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.||
|**2025-01-30**|[Examining the Expanding Role of Synthetic Data Throughout the AI Development Pipeline](http://arxiv.org/abs/2501.18493)|null|Alongside the growth of generative AI, we are witnessing a surge in the use of synthetic data across all stages of the AI development pipeline. It is now common practice for researchers and practitioners to use one large generative model (which we refer to as an auxiliary model) to generate synthetic data that is used to train or evaluate another, reconfiguring AI workflows and reshaping the very nature of data. While scholars have raised concerns over the risks of synthetic data, policy guidance and best practices for its responsible use have not kept up with these rapidly evolving industry trends, in part because we lack a clear picture of current practices and challenges. Our work aims to address this gap. Through 29 interviews with AI practitioners and responsible AI experts, we examine the expanding role of synthetic data in AI development. Our findings reveal how auxiliary models are now widely used across the AI development pipeline. Practitioners describe synthetic data as crucial for addressing data scarcity and providing a competitive edge, noting that evaluation of generative AI systems at scale would be infeasible without auxiliary models. However, they face challenges controlling the outputs of auxiliary models, generating data that accurately depict underrepresented groups, and scaling data validation practices that are based primarily on manual inspection. We detail general limitations of and ethical considerations for synthetic data and conclude with a proposal of concrete steps towards the development of best practices for its responsible use.||
|**2025-01-30**|[How to Select Datapoints for Efficient Human Evaluation of NLG Models?](http://arxiv.org/abs/2501.18251)|**[link](https://github.com/zouharvi/subset2evaluate)**|Human evaluation is the gold-standard for evaluating text generation models. It is also expensive, and to fit budgetary constraints, a random subset of the test data is often chosen in practice. The randomly selected data may not accurately represent test performance, making this approach economically inefficient for model comparison. Thus, in this work, we develop a suite of selectors to get the most informative datapoints for human evaluation while taking the evaluation costs into account. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. We further develop an approach to distill these selectors to the scenario where the model outputs are not yet available. In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts. We demonstrate the efficacy of our selectors in two common NLG tasks, machine translation and summarization, and show that up to only ~50% of the test data is needed to produce the same evaluation result as the entire data. Our implementations are published in the subset2evaluate package.||
|**2025-01-30**|[Free-T2M: Frequency Enhanced Text-to-Motion Diffusion Model With Consistency Loss](http://arxiv.org/abs/2501.18232)|**[link](https://github.com/Hxxxz0/Free-T2m)**|Rapid progress in text-to-motion generation has been largely driven by diffusion models. However, existing methods focus solely on temporal modeling, thereby overlooking frequency-domain analysis. We identify two key phases in motion denoising: the **semantic planning stage** and the **fine-grained improving stage**. To address these phases effectively, we propose **Fre**quency **e**nhanced **t**ext-**to**-**m**otion diffusion model (**Free-T2M**), incorporating stage-specific consistency losses that enhance the robustness of static features and improve fine-grained accuracy. Extensive experiments demonstrate the effectiveness of our method. Specifically, on StableMoFusion, our method reduces the FID from **0.189** to **0.051**, establishing a new SOTA performance within the diffusion architecture. These findings highlight the importance of incorporating frequency-domain insights into text-to-motion generation for more precise and robust results.||
|**2025-01-30**|[Inverse source problem of sub-diffusion of variable exponent](http://arxiv.org/abs/2501.18228)|null|This work investigates both direct and inverse problems of the variable-exponent sub-diffusion model, which attracts increasing attentions in both practical applications and theoretical aspects. Based on the perturbation method, which transfers the original model to an equivalent but more tractable form, the analytical extensibility of the solutions and the weak unique continuation principle are proved, which results in the uniqueness of the inverse space-dependent source problem from local internal observation. Then, based on the variational identity connecting the inversion input data with the unknown source function, we propose a weak norm and prove the conditional stability for the inverse problem in this norm. The iterative thresholding algorithm and Nesterov iteration scheme are employed to numerically reconstruct the smooth and non-smooth sources, respectively. Numerical experiments are performed to investigate their effectiveness.||
|**2025-01-29**|[SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders](http://arxiv.org/abs/2501.18052)|**[link](https://github.com/cywinski/saeuron)**|Recent machine unlearning approaches offer promising solution for removing unwanted concepts from diffusion models. However, traditional methods, which largely rely on fine-tuning, provide little insight into the changes they introduce to the base model, making it unclear whether concepts are truly removed or only masked. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to unlearn unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a method of selecting concept-specific features. This enables precise interventions on the model's activations to block targeted content while preserving the model's overall performance. Evaluation on the competitive UnlearnCanvas benchmark on object and style unlearning highlights SAeUron's state-of-the-art performance. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron dismisses the possibility of generating unwanted content, even under adversarial attack.||
|**2025-01-29**|[Generative Unordered Flow for Set-Structured Data Generation](http://arxiv.org/abs/2501.17770)|null|Flow-based generative models have demonstrated promising performance across a broad spectrum of data modalities (e.g., image and text). However, there are few works exploring their extension to unordered data (e.g., spatial point set), which is not trivial because previous models are mostly designed for vector data that are naturally ordered. In this paper, we present unordered flow, a type of flow-based generative model for set-structured data generation. Specifically, we convert unordered data into an appropriate function representation, and learn the probability measure of such representations through function-valued flow matching. For the inverse map from a function representation to unordered data, we propose a method similar to particle filtering, with Langevin dynamics to first warm-up the initial particles and gradient-based search to update them until convergence. We have conducted extensive experiments on multiple real-world datasets, showing that our unordered flow model is very effective in generating set-structured data and significantly outperforms previous baselines.||
|**2025-01-29**|[A technical review of multi-omics data integration methods: from classical statistical to deep generative approaches](http://arxiv.org/abs/2501.17729)|null|The rapid advancement of high-throughput sequencing and other assay technologies has resulted in the generation of large and complex multi-omics datasets, offering unprecedented opportunities for advancing precision medicine strategies. However, multi-omics data integration presents significant challenges due to the high dimensionality, heterogeneity, experimental gaps, and frequency of missing values across data types. Computational methods have been developed to address these issues, employing statistical and machine learning approaches to uncover complex biological patterns and provide deeper insights into our understanding of disease mechanisms. Here, we comprehensively review state-of-the-art multi-omics data integration methods with a focus on deep generative models, particularly variational autoencoders (VAEs) that have been widely used for data imputation and augmentation, joint embedding creation, and batch effect correction. We explore the technical aspects of loss functions and regularisation techniques including adversarial training, disentanglement and contrastive learning. Moreover, we discuss recent advancements in foundation models and the integration of emerging data modalities, while describing the current limitations and outlining future directions for enhancing multi-modal methodologies in biomedical research.||
|**2025-01-28**|[CubeDiff: Repurposing Diffusion-Based Image Models for Panorama Generation](http://arxiv.org/abs/2501.17162)|null|We introduce a novel method for generating 360{\deg} panoramas from text prompts or images. Our approach leverages recent advances in 3D generation by employing multi-view diffusion models to jointly synthesize the six faces of a cubemap. Unlike previous methods that rely on processing equirectangular projections or autoregressive generation, our method treats each face as a standard perspective image, simplifying the generation process and enabling the use of existing multi-view diffusion models. We demonstrate that these models can be adapted to produce high-quality cubemaps without requiring correspondence-aware attention layers. Our model allows for fine-grained text control, generates high resolution panorama images and generalizes well beyond its training set, whilst achieving state-of-the-art results, both qualitatively and quantitatively. Project page: https://cubediff.github.io/||
|**2025-01-28**|[IC-Portrait: In-Context Matching for View-Consistent Personalized Portrait](http://arxiv.org/abs/2501.17159)|null|Existing diffusion models show great potential for identity-preserving generation. However, personalized portrait generation remains challenging due to the diversity in user profiles, including variations in appearance and lighting conditions. To address these challenges, we propose IC-Portrait, a novel framework designed to accurately encode individual identities for personalized portrait generation. Our key insight is that pre-trained diffusion models are fast learners (e.g.,100 ~ 200 steps) for in-context dense correspondence matching, which motivates the two major designs of our IC-Portrait framework. Specifically, we reformulate portrait generation into two sub-tasks: 1) Lighting-Aware Stitching: we find that masking a high proportion of the input image, e.g., 80%, yields a highly effective self-supervisory representation learning of reference image lighting. 2) View-Consistent Adaptation: we leverage a synthetic view-consistent profile dataset to learn the in-context correspondence. The reference profile can then be warped into arbitrary poses for strong spatial-aligned view conditioning. Coupling these two designs by simply concatenating latents to form ControlNet-like supervision and modeling, enables us to significantly enhance the identity preservation fidelity and stability. Extensive evaluations demonstrate that IC-Portrait consistently outperforms existing state-of-the-art methods both quantitatively and qualitatively, with particularly notable improvements in visual qualities. Furthermore, IC-Portrait even demonstrates 3D-aware relighting capabilities.||
|**2025-01-28**|[Generative diffusion models from a PDE perspective](http://arxiv.org/abs/2501.17054)|null|Diffusion models have become the de facto framework for generating new datasets. The core of these models lies in the ability to reverse a diffusion process in time. The goal of this manuscript is to explain, from a PDE perspective, how this method works and how to derive the PDE governing the reverse dynamics as well as to study its solution analytically. By linking forward and reverse dynamics, we show that the reverse process's distribution has its support contained within the original distribution. Consequently, diffusion methods, in their analytical formulation, do not inherently regularize the original distribution, and thus, there is no generalization principle. This raises a question: where does generalization arise, given that in practice it does occur? Moreover, we derive an explicit solution to the reverse process's SDE under the assumption that the starting point of the forward process is fixed. This provides a new derivation that links two popular approaches to generative diffusion models: stable diffusion (discrete dynamics) and the score-based approach (continuous dynamics). Finally, we explore the case where the original distribution consists of a finite set of data points. In this scenario, the reverse dynamics are explicit (i.e., the loss function has a clear minimizer), and solving the dynamics fails to generate new samples: the dynamics converge to the original samples. In a sense, solving the minimization problem exactly is "too good for its own good" (i.e., an overfitting regime).||
|**2025-01-28**|[Generative quantum combinatorial optimization by means of a novel conditional generative quantum eigensolver](http://arxiv.org/abs/2501.16986)|null|Quantum computing is entering a transformative phase with the emergence of logical quantum processors, which hold the potential to tackle complex problems beyond classical capabilities. While significant progress has been made, applying quantum algorithms to real-world problems remains challenging. Hybrid quantum-classical techniques have been explored to bridge this gap, but they often face limitations in expressiveness, trainability, or scalability. In this work, we introduce conditional Generative Quantum Eigensolver (conditional-GQE), a context-aware quantum circuit generator powered by an encoder-decoder Transformer. Focusing on combinatorial optimization, we train our generator for solving problems with up to 10 qubits, exhibiting nearly perfect performance on new problems. By leveraging the high expressiveness and flexibility of classical generative models, along with an efficient preference-based training scheme, conditional-GQE provides a generalizable and scalable framework for quantum circuit generation. Our approach advances hybrid quantum-classical computing and contributes to accelerate the transition toward fault-tolerant quantum computing.||
|**2025-01-28**|[Adversarial Masked Autoencoder Purifier with Defense Transferability](http://arxiv.org/abs/2501.16904)|null|The study of adversarial defense still struggles to combat with advanced adversarial attacks. In contrast to most prior studies that rely on the diffusion model for test-time defense to remarkably increase the inference time, we propose Masked AutoEncoder Purifier (MAEP), which integrates Masked AutoEncoder (MAE) into an adversarial purifier framework for test-time purification. While MAEP achieves promising adversarial robustness, it particularly features model defense transferability and attack generalization without relying on using additional data that is different from the training dataset. To our knowledge, MAEP is the first study of adversarial purifier based on MAE. Extensive experimental results demonstrate that our method can not only maintain clear accuracy with only a slight drop but also exhibit a close gap between the clean and robust accuracy. Notably, MAEP trained on CIFAR10 achieves state-of-the-art performance even when tested directly on ImageNet, outperforming existing diffusion-based models trained specifically on ImageNet.||
|**2025-01-28**|[DIRIGENt: End-To-End Robotic Imitation of Human Demonstrations Based on a Diffusion Model](http://arxiv.org/abs/2501.16800)|null|There has been substantial progress in humanoid robots, with new skills continuously being taught, ranging from navigation to manipulation. While these abilities may seem impressive, the teaching methods often remain inefficient. To enhance the process of teaching robots, we propose leveraging a mechanism effectively used by humans: teaching by demonstrating. In this paper, we introduce DIRIGENt (DIrect Robotic Imitation GENeration model), a novel end-to-end diffusion approach that directly generates joint values from observing human demonstrations, enabling a robot to imitate these actions without any existing mapping between it and humans. We create a dataset in which humans imitate a robot and then use this collected data to train a diffusion model that enables a robot to imitate humans. The following three aspects are the core of our contribution. First is our novel dataset with natural pairs between human and robot poses, allowing our approach to imitate humans accurately despite the gap between their anatomies. Second, the diffusion input to our model alleviates the challenge of redundant joint configurations, limiting the search space. And finally, our end-to-end architecture from perception to action leads to an improved learning capability. Through our experimental analysis, we show that combining these three aspects allows DIRIGENt to outperform existing state-of-the-art approaches in the field of generating joint values from RGB images.||
|**2025-01-28**|[Algorithm for Automatic Legislative Text Consolidation](http://arxiv.org/abs/2501.16794)|null|This study introduces a method for automating the consolidation process in a legal context, a time-consuming task traditionally performed by legal professionals. We present a generative approach that processes legislative texts to automatically apply amendments. Our method employs light quantized generative model, fine-tuned with LoRA, to generate accurate and reliable amended texts. To the authors knowledge, this is the first time generative models are used on legislative text consolidation. Our dataset is publicly available on HuggingFace1. Experimental results demonstrate a significant improvement in efficiency, offering faster updates to legal documents. A full automated pipeline of legislative text consolidation can be done in a few hours, with a success rate of more than 63% on a difficult bill.||
|**2025-01-28**|[Exponential Family Attention](http://arxiv.org/abs/2501.16790)|**[link](https://github.com/yixinw-lab/efa)**|The self-attention mechanism is the backbone of the transformer neural network underlying most large language models. It can capture complex word patterns and long-range dependencies in natural language. This paper introduces exponential family attention (EFA), a probabilistic generative model that extends self-attention to handle high-dimensional sequence, spatial, or spatial-temporal data of mixed data types, including both discrete and continuous observations. The key idea of EFA is to model each observation conditional on all other existing observations, called the context, whose relevance is learned in a data-driven way via an attention-based latent factor model. In particular, unlike static latent embeddings, EFA uses the self-attention mechanism to capture dynamic interactions in the context, where the relevance of each context observations depends on other observations. We establish an identifiability result and provide a generalization guarantee on excess loss for EFA. Across real-world and synthetic data sets -- including U.S. city temperatures, Instacart shopping baskets, and MovieLens ratings -- we find that EFA consistently outperforms existing models in capturing complex latent structures and reconstructing held-out data.||
|**2025-01-28**|[FlexMotion: Lightweight, Physics-Aware, and Controllable Human Motion Generation](http://arxiv.org/abs/2501.16778)|null|Lightweight, controllable, and physically plausible human motion synthesis is crucial for animation, virtual reality, robotics, and human-computer interaction applications. Existing methods often compromise between computational efficiency, physical realism, or spatial controllability. We propose FlexMotion, a novel framework that leverages a computationally lightweight diffusion model operating in the latent space, eliminating the need for physics simulators and enabling fast and efficient training. FlexMotion employs a multimodal pre-trained Transformer encoder-decoder, integrating joint locations, contact forces, joint actuations and muscle activations to ensure the physical plausibility of the generated motions. FlexMotion also introduces a plug-and-play module, which adds spatial controllability over a range of motion parameters (e.g., joint locations, joint actuations, contact forces, and muscle activations). Our framework achieves realistic motion generation with improved efficiency and control, setting a new benchmark for human motion synthesis. We evaluate FlexMotion on extended datasets and demonstrate its superior performance in terms of realism, physical plausibility, and controllability.||
|**2025-01-28**|[DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation](http://arxiv.org/abs/2501.16764)|null|Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussian splats by taming large-scale text-to-image diffusion models. It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multi-view Gaussian splat grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views. The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism.||
|**2025-01-24**|[Diffusion based Text-to-Music Generationwith Global and Local Text based Conditioning](http://arxiv.org/abs/2501.14680)|null|Diffusion based Text-To-Music (TTM) models generate music corresponding to text descriptions. Typically UNet based diffusion models condition on text embeddings generated from a pre-trained large language model or from a cross-modality audio-language representation model. This work proposes a diffusion based TTM, in which the UNet is conditioned on both (i) a uni-modal language model (e.g., T5) via cross-attention and (ii) a cross-modal audio-language representation model (e.g., CLAP) via Feature-wise Linear Modulation (FiLM). The diffusion model is trained to exploit both a local text representation from the T5 and a global representation from the CLAP. Furthermore, we propose modifications that extract both global and local representations from the T5 through pooling mechanisms that we call mean pooling and self-attention pooling. This approach mitigates the need for an additional encoder (e.g., CLAP) to extract a global representation, thereby reducing the number of model parameters. Our results show that incorporating the CLAP global embeddings to the T5 local embeddings enhances text adherence (KL=1.47) compared to a baseline model solely relying on the T5 local embeddings (KL=1.54). Alternatively, extracting global text embeddings directly from the T5 local embeddings through the proposed mean pooling approach yields superior generation quality (FAD=1.89) while exhibiting marginally inferior text adherence (KL=1.51) against the model conditioned on both CLAP and T5 text embeddings (FAD=1.94 and KL=1.47). Our proposed solution is not only efficient but also compact in terms of the number of parameters required.||
|**2025-01-24**|[Towards Scalable Topological Regularizers](http://arxiv.org/abs/2501.14641)|null|Latent space matching, which consists of matching distributions of features in latent space, is a crucial component for tasks such as adversarial attacks and defenses, domain adaptation, and generative modelling. Metrics for probability measures, such as Wasserstein and maximum mean discrepancy, are commonly used to quantify the differences between such distributions. However, these are often costly to compute, or do not appropriately take the geometric and topological features of the distributions into consideration. Persistent homology is a tool from topological data analysis which quantifies the multi-scale topological structure of point clouds, and has recently been used as a topological regularizer in learning tasks. However, computation costs preclude larger scale computations, and discontinuities in the gradient lead to unstable training behavior such as in adversarial tasks. We propose the use of principal persistence measures, based on computing the persistent homology of a large number of small subsamples, as a topological regularizer. We provide a parallelized GPU implementation of this regularizer, and prove that gradients are continuous for smooth densities. Furthermore, we demonstrate the efficacy of this regularizer on shape matching, image generation, and semi-supervised learning tasks, opening the door towards a scalable regularizer for topological features.||
|**2025-01-24**|[Single-neuron deep generative model uncovers underlying physics of neuronal activity in Ca imaging data](http://arxiv.org/abs/2501.14615)|null|Calcium imaging has become a powerful alternative to electrophysiology for studying neuronal activity, offering spatial resolution and the ability to measure large populations of neurons in a minimally invasive manner. This technique has broad applications in neuroscience, neuroengineering, and medicine, enabling researchers to explore the relationship between neuron location and activity. Recent advancements in deep generative models (DGMs) have facilitated the modeling of neuronal population dynamics, uncovering latent representations that provide insights into behavior prediction and neuronal variance. However, these models often rely on spike inference algorithms and primarily focus on population-level dynamics, limiting their applicability for single-neuron analyses. To address this gap, we propose a novel framework for single-neuron representation learning using autoregressive variational autoencoders (AVAEs). Our approach embeds individual neurons' spatiotemporal signals into a reduced-dimensional space without the need for spike inference algorithms. The AVAE excels over traditional linear methods by generating more informative and discriminative latent representations, improving tasks such as visualization, clustering, and the understanding of neuronal activity. Additionally, the reconstruction performance of the AVAE outperforms the state of the art, demonstrating its ability to accurately recover the original fluorescence signal from the learned representation. Using realistic simulations, we show that our model captures underlying physical properties and connectivity patterns, enabling it to distinguish between different firing and connectivity types. These findings position the AVAE as a versatile and powerful tool for advancing single-neuron analysis and lays the groundwork for future integration of multimodal single-cell datasets in neuroscience.||
|**2025-01-24**|[Training-Free Style and Content Transfer by Leveraging U-Net Skip Connections in Stable Diffusion 2.*](http://arxiv.org/abs/2501.14524)|null|Despite significant recent advances in image generation with diffusion models, their internal latent representations remain poorly understood. Existing works focus on the bottleneck layer (h-space) of Stable Diffusion's U-Net or leverage the cross-attention, self-attention, or decoding layers. Our model, SkipInject takes advantage of U-Net's skip connections. We conduct thorough analyses on the role of the skip connections and find that the residual connections passed by the third encoder block carry most of the spatial information of the reconstructed image, splitting the content from the style. We show that injecting the representations from this block can be used for text-based editing, precise modifications, and style transfer. We compare our methods state-of-the-art style transfer and image editing methods and demonstrate that our method obtains the best content alignment and optimal structural preservation tradeoff.||
|**2025-01-24**|[Pesti-Gen: Unleashing a Generative Molecule Approach for Toxicity Aware Pesticide Design](http://arxiv.org/abs/2501.14469)|null|Global climate change has reduced crop resilience and pesticide efficacy, making reliance on synthetic pesticides inevitable, even though their widespread use poses significant health and environmental risks. While these pesticides remain a key tool in pest management, previous machine-learning applications in pesticide and agriculture have focused on classification or regression, leaving the fundamental challenge of generating new molecular structures or designing novel candidates unaddressed. In this paper, we propose Pesti-Gen, a novel generative model based on variational auto-encoders, designed to create pesticide candidates with optimized properties for the first time. Specifically, Pesti-Gen leverages a two-stage learning process: an initial pre-training phase that captures a generalized chemical structure representation, followed by a fine-tuning stage that incorporates toxicity-specific information. The model simultaneously optimizes over multiple toxicity metrics, such as (1) livestock toxicity and (2) aqua toxicity to generate environmentally friendly pesticide candidates. Notably, Pesti-Gen achieves approximately 68\% structural validity in generating new molecular structures, demonstrating the model's effectiveness in producing optimized and feasible pesticide candidates, thereby providing a new way for safer and more sustainable pest management solutions.||
|**2025-01-24**|[CENTS: Generating synthetic electricity consumption time series for rare and unseen scenarios](http://arxiv.org/abs/2501.14426)|null|Recent breakthroughs in large-scale generative modeling have demonstrated the potential of foundation models in domains such as natural language, computer vision, and protein structure prediction. However, their application in the energy and smart grid sector remains limited due to the scarcity and heterogeneity of high-quality data. In this work, we propose a method for creating high-fidelity electricity consumption time series data for rare and unseen context variables (e.g. location, building type, photovoltaics). Our approach, Context Encoding and Normalizing Time Series Generation, or CENTS, includes three key innovations: (i) A context normalization approach that enables inverse transformation for time series context variables unseen during training, (ii) a novel context encoder to condition any state-of-the-art time-series generator on arbitrary numbers and combinations of context variables, (iii) a framework for training this context encoder jointly with a time-series generator using an auxiliary context classification loss designed to increase expressivity of context embeddings and improve model performance. We further provide a comprehensive overview of different evaluation metrics for generative time series models. Our results highlight the efficacy of the proposed method in generating realistic household-level electricity consumption data, paving the way for training larger foundation models in the energy domain on synthetic as well as real-world data.||
|**2025-01-24**|[Uncovering the bias in the evidence for dynamical dark energy through minimal and generalized modeling approaches](http://arxiv.org/abs/2501.14366)|null|In this letter we argue that the CPL parameterisation for the dark energy equation of state is biased towards preferring such model over the constant $w$ while the latter bounds are still compatible with LCDM. For that we compare constraints on the EoS parameters $w_0$ and early time type $w_a$ (CPL) against those with a late time parameterisation on $w_a$ (LZ) and the constant $w$ model, using CMB, Supernovae and BAO from DESI datasets. We found, the same as was the case with CPL model, preference for dynamical dark energy within the LZ model, but for values almost symmetrically distributed with respect to their LCDM limits. This is due to the fact that the presence of $w_0$ allows to recast each parametrisation into making it compensate the preference for $w\sim -1$ in the opposite direction. To further test our hypothesis, we fixed $w_0$ to -1 and followed a minimal approach by considering models that deviates by one free parameter, or we extend to more general models that either group both late and early effects, or allow the presence of two dark energy fluid alike and constant alike component. We found that all the variants, except the original CPL are still compatible with LCDM, with likelihoods peaking close to $w_0 = -1$, $w_a = 0$, or 0.68 for $\Omega_{\rm CC}$, with the constant $w$ and the late time $w_a$ having the smallest constraints. Although we found that the evidence from CPL is stronger than those for the more minimal cases, however the preference increases further for the more generalized parameterizations, while still staying compatible with LCDM in terms of the significance levels. We conclude that considering CPL model is not sufficient to test deviation from the standard model and that it is necessary to conduct further minimal or more general approaches to better understand the outcomes from model testing and inference methods.(abridged)||
|**2025-01-24**|[Stochastic Method for Delayed Neutron Precursors Transport in Liquid Fuel](http://arxiv.org/abs/2501.14332)|null|This paper presents a novel stochastic method for modeling the transport of Delayed Neutron Precursors (DNPs) in liquid nuclear fuel. The method incorporates advection and diffusion effects into the Monte Carlo solution of the neutron balance equation by leveraging the Green's function of the advection-diffusion-reaction (ADR) equation. For a 1D system, we demonstrate that the Green's function can be interpreted as the Probability Density Function (PDF) of the position increment of a Brownian motion with drift. Using this interpretation, the position of DNPs is sampled via a time-of-flight process combined with a drift and diffusion model. The method is validated on a modified 1D rod problem, where results from the Monte Carlo implementation are compared against those obtained using a deterministic approach. The comparison confirms that the method accurately captures the impact of fuel velocity and diffusion on neutron flux. As expected, the fuel velocity shifts the neutron flux. Reactivity decreases as a function of speed while diffusion can counteract this decrease under certain conditions. While the current study is limited to 1D systems, the approach could be extended to higher dimensions and more complex geometries by replacing the Green's function with the Stochastic Differential Equation (SDE) associated with the ADR equation.||
|**2025-01-24**|[PAID: A Framework of Product-Centric Advertising Image Design](http://arxiv.org/abs/2501.14316)|null|In E-commerce platforms, a full advertising image is composed of a background image and marketing taglines. Automatic ad image design reduces human costs and plays a crucial role. For the convenience of users, a novel automatic framework named Product-Centric Advertising Image Design (PAID) is proposed in this work. PAID takes the product foreground image, required taglines, and target size as input and creates an ad image automatically. PAID consists of four sequential stages: prompt generation, layout generation, background image generation, and graphics rendering. Different expert models are trained to conduct these sub-tasks. A visual language model (VLM) based prompt generation model is leveraged to produce a product-matching background prompt. The layout generation model jointly predicts text and image layout according to the background prompt, product, and taglines to achieve the best harmony. An SDXL-based layout-controlled inpainting model is trained to generate an aesthetic background image. Previous ad image design methods take a background image as input and then predict the layout of taglines, which limits the spatial layout due to fixed image content. Innovatively, our PAID adjusts the stages to produce an unrestricted layout. To complete the PAID framework, we created two high-quality datasets, PITA and PIL. Extensive experimental results show that PAID creates more visually pleasing advertising images than previous methods.||
|**2025-01-24**|[CDI: Blind Image Restoration Fidelity Evaluation based on Consistency with Degraded Image](http://arxiv.org/abs/2501.14264)|null|Recent advancements in Blind Image Restoration (BIR) methods, based on Generative Adversarial Networks and Diffusion Models, have significantly improved visual quality. However, they present significant challenges for Image Quality Assessment (IQA), as the existing Full-Reference IQA methods often rate images with high perceptual quality poorly. In this paper, we reassess the Solution Non-Uniqueness and Degradation Indeterminacy issues of BIR, and propose constructing a specific BIR IQA system. In stead of directly comparing a restored image with a reference image, the BIR IQA evaluates fidelity by calculating the Consistency with Degraded Image (CDI). Specifically, we propose a wavelet domain Reference Guided CDI algorithm, which can acquire the consistency with a degraded image for various types without requiring knowledge of degradation parameters. The supported degradation types include down sampling, blur, noise, JPEG and complex combined degradations etc. In addition, we propose a Reference Agnostic CDI, enabling BIR fidelity evaluation without reference images. Finally, in order to validate the rationality of CDI, we create a new Degraded Images Switch Display Comparison Dataset (DISDCD) for subjective evaluation of BIR fidelity. Experiments conducted on DISDCD verify that CDI is markedly superior to common Full Reference IQA methods for BIR fidelity evaluation. The source code and the DISDCD dataset will be publicly available shortly.||
|**2025-01-23**|[IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models](http://arxiv.org/abs/2501.13920)|null|随着扩散模型的快速发展，文本到图像（T2I）模型取得了显著进展，在遵循提示和图像生成方面展现出令人印象深刻的能力。最近推出的FLUX.1和Ideogram2.0等模型，以及Dall-E3和Stable Diffusion 3等其他模型，在各种复杂任务中都表现出了卓越的性能，这引发了人们对T2I模型是否正朝着通用适用性方向发展的问题。除了传统的图像生成之外，这些模型还展现出在多个领域的能力，包括可控生成、图像编辑、视频、音频、3D和运动生成，以及语义分割和深度估计等计算机视觉任务。然而，目前的评估框架不足以全面评估这些模型在不断扩展的领域中的性能。为了彻底评估这些模型，我们开发了IMAGINE-E，并测试了六个 prominent 模型：FLUX.1、Ideogram2.0、Midjourney、Dall-E3、Stable Diffusion 3和Jimeng。我们的评估分为五个关键领域：结构化输出生成、真实感和物理一致性、特定领域生成、挑战性场景生成以及多风格创建任务。这项全面评估突出了每个模型的优势和局限性，特别是FLUX.1和Ideogram2.0在结构化和特定领域任务中的出色表现，强调了T2I模型作为基础AI工具的扩展应用和潜力。这项研究为T2I模型在发展成为通用可用性的过程中，提供了对其当前状态和未来发展轨迹的宝贵见解。评估脚本将在https://github.com/jylei16/Imagine-e发布。||
|**2025-01-23**|[Improving Video Generation with Human Feedback](http://arxiv.org/abs/2501.13918)|null|视频生成技术通过校正流技术取得了显著进展，但仍然存在运动不流畅以及视频与提示词不对齐等问题。在本研究中，我们开发了一个系统性的流程，利用人类反馈来缓解这些问题并改进视频生成模型。具体来说，我们首先构建了一个专注于现代视频生成模型的大规模人类偏好数据集，其中包含跨多维度的成对标注。然后，我们引入了VideoReward，一个多维视频奖励模型，并研究了标注和各种设计选择如何影响其奖励效果。从旨在通过KL正则化最大化奖励的统一强化学习角度，我们通过扩展扩散模型中的对齐算法，为基于流的模型引入了三种对齐算法。这些算法包括两种训练时策略：流的直接偏好优化 (Flow-DPO) 和流的奖励加权回归 (Flow-RWR)，以及一种推理时技术，Flow-NRG，它将奖励引导直接应用于噪声视频。实验结果表明，VideoReward 的性能显著优于现有的奖励模型，并且 Flow-DPO 的性能优于 Flow-RWR 和标准的监督微调方法。此外，Flow-NRG 允许用户在推理过程中为多个目标分配自定义权重，以满足个性化的视频质量需求。项目页面：https://gongyeliu.github.io/videoalign.||
|**2025-01-23**|[Binary Diffusion Probabilistic Model](http://arxiv.org/abs/2501.13915)|null|我们提出了一种名为二值扩散概率模型（BDPM）的新型生成模型，该模型针对二值数据表示进行了优化。尽管去噪扩散概率模型（DDPM）在图像合成和修复等任务中取得了显著成果，但传统的DDPM依赖于连续数据表示和均方误差（MSE）损失进行训练，并应用了可能并非对离散或二值数据结构最优的高斯噪声模型。BDPM 通过将图像分解为位平面并采用基于异或（XOR）的噪声变换来解决这个问题，其去噪模型使用二元交叉熵损失进行训练。这种方法实现了精确的噪声控制和计算高效的推理，显著降低了计算成本并提高了模型收敛速度。在图像超分辨率、图像修复和盲图像修复等图像修复任务的评估中，BDPM 在 FFHQ、CelebA 和 CelebA-HQ 数据集上的表现优于现有最先进的方法。值得注意的是，BDPM 比传统的 DDPM 模型需要更少的推理步骤即可达到最佳结果，展现出更高的推理效率。||
|**2025-01-23**|[A RAG-Based Institutional Assistant](http://arxiv.org/abs/2501.13880)|null|大型语言模型 (LLM) 虽然展现出强大的文本生成能力，但在需要访问结构化知识库或特定文档的场景下却表现不佳，限制了它们在知识密集型任务中的有效性。为了解决这一局限性，研究人员开发了检索增强生成 (RAG) 模型，使生成模型能够将相关的文档片段纳入其输入。在本文中，我们设计并评估了一个基于 RAG 的虚拟助手，专门为圣保罗大学定制。我们的系统架构包含两个关键模块：检索器和生成模型。我们对这两种组件的不同类型的模型进行了实验，调整了诸如块大小和检索文档数量之类的超参数。我们最优的检索器模型实现了 30% 的 Top-5 准确率，而我们最有效的生成模型相对于真实答案的得分则为 22.04%。值得注意的是，当向 LLM 提供正确的文档块时，准确率显著提高到 54.02%，提高了 30 多个百分点。相反，如果没有上下文输入，性能则下降到 13.68%。这些发现突出了数据库访问在增强 LLM 性能方面的关键作用。同时也揭示了当前语义搜索方法在准确识别相关文档方面的局限性，并强调了 LLM 在生成精确响应方面仍然面临的挑战。||
|**2025-01-23**|[Unveiling the Power of Noise Priors: Enhancing Diffusion Models for Mobile Traffic Prediction](http://arxiv.org/abs/2501.13794)|null|准确预测移动流量（即蜂窝基站的网络流量）对于优化网络性能和支持城市发展至关重要。然而，由于人类活动和环境变化的影响，移动流量的非平稳性导致了规律的模式和突然的变化。扩散模型因其能够捕捉固有的不确定性，擅长捕捉这种复杂的时序动态。大多数现有方法优先设计新颖的去噪网络，但往往忽略了噪声本身的关键作用，可能导致性能欠佳。在本文中，我们引入了一个新的视角，强调噪声在去噪过程中的作用。我们的分析表明，噪声从根本上塑造了移动流量预测，呈现出独特且一致的模式。我们提出了NPDiff，一个将噪声分解为“先验”和“残差”成分的框架，其中“先验”源于数据动态，增强了模型捕捉规律性和突发性变化的能力。NPDiff可以与各种基于扩散的预测模型无缝集成，提供有效、高效且稳健的预测。大量实验表明，它实现了优异的性能，提升超过30%，为在该领域利用扩散模型提供了新的视角。||
|**2025-01-23**|[An Efficient Diffusion-based Non-Autoregressive Solver for Traveling Salesman Problem](http://arxiv.org/abs/2501.13767)|**[link](https://github.com/deitsp/deitsp)**|近年来，神经模型在解决旅行商问题（TSP）方面取得了显著进展，且无需依赖大量手工设计的工程。然而，尽管非自回归（NAR）方法受益于并行推理带来的速度优势，但其解决方案质量通常不如自回归方法。为了在保持快速推理的同时提高解决方案质量，我们提出了DEITSP，一个专为TSP设计的具有高效迭代的扩散模型，它以NAR方式运行。首先，我们引入了一个单步扩散模型，它将受控的离散噪声添加过程与自洽性增强相结合，能够通过同时对多个解决方案进行去噪来预测最佳解决方案。其次，我们设计了一个双模态图变换器，以增强从节点和边模态中提取和融合特征的能力，同时通过减少层数进一步加快推理速度。第三，我们开发了一种高效的迭代策略，在添加和去除噪声之间交替进行，从而与以往的扩散方法相比，提高了探索能力。此外，我们设计了一个调度框架，通过调整噪声水平来逐步细化解空间，促进平滑地搜索最佳解决方案。在真实世界和大规模TSP实例上的大量实验表明，DEITSP在解决方案质量、推理延迟和泛化能力方面优于现有的神经方法。我们的代码可在 $\href{https://github.com/DEITSP/DEITSP}{https://github.com/DEITSP/DEITSP}$ 获取。||
|**2025-01-23**|[A Mutual Information Perspective on Multiple Latent Variable Generative Models for Positive View Generation](http://arxiv.org/abs/2501.13718)|null|在图像生成领域，多潜变量生成模型 (MLVGM) 利用多个潜变量逐步塑造最终图像，从全局特征到更精细的局部细节（例如 StyleGAN、NVAE），成为各种应用的强大工具。然而，它们的生成动态和潜变量利用率仍然仅凭经验观察。在这项工作中，我们提出了一个新颖的框架，使用互信息 (MI) 作为指导指标，系统地量化 MLVGM 中每个潜变量的影响。我们的分析揭示了未充分利用的变量，并可以指导 MLVGM 在下游应用中的使用。在此基础上，我们介绍了一种为自监督对比表征学习 (SSCRL) 生成合成数据的方法。通过利用 MLVGM 的分层和解耦变量，并在先前分析的指导下，我们应用定制的潜变量扰动来生成 SSCRL 的不同视图，而完全不依赖于真实数据。此外，我们引入了一种连续采样 (CS) 策略，生成器在 SSCRL 训练期间动态创建新样本，极大地增加了数据可变性。我们的综合实验结果证明了这些贡献的有效性，表明 MLVGM 生成的视图与从真实数据生成的视图相比，表现相当甚至更好。这项工作建立了理解和利用 MLVGM 的原则性方法，推进了生成建模和自监督学习。||
|**2025-01-23**|[Training-Free Consistency Pipeline for Fashion Repose](http://arxiv.org/abs/2501.13692)|null|扩散模型的最新进展显著拓宽了编辑现实世界物体图像的可能性。然而，执行非刚性变换，例如改变物体的姿势或基于图像的条件化，仍然具有挑战性。在这些编辑过程中保持物体身份的完整性很困难，而且目前的方法通常达不到工业应用所需的精度，而一致性在工业应用中至关重要。此外，微调扩散模型需要自定义训练数据，这在现实场景中并不总是能够获得。这项工作介绍了FashionRepose，这是一个专为时尚行业设计的用于非刚性姿态编辑的免训练流程。该方法集成了现成的模型来调整长袖服装的姿势，保持身份和品牌属性。FashionRepose使用零样本方法近乎实时地执行这些编辑，无需专门的训练，并实现一致的图像编辑。该解决方案在时尚行业和其他需要在图像编辑中保持身份完整性的领域具有应用潜力。||
|**2025-01-23**|[One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt](http://arxiv.org/abs/2501.13554)|**[link](https://github.com/byliutao/1prompt1story)**|文转图生成模型可以根据输入的提示词创建高质量的图像。然而，它们难以支持在叙事过程中保持身份一致性。现有的方法通常需要在大型数据集上进行大量的训练或对原始模型架构进行额外的修改。这限制了它们在不同领域和不同扩散模型配置中的适用性。在本文中，我们首先观察到语言模型的固有能力（我们称之为上下文一致性），即通过单个提示词理解上下文中的身份。受这种固有上下文一致性的启发，我们提出了一种新的无需训练的用于一致性文转图生成的方法，称为“单提示单故事”（1Prompt1Story）。我们的方法1Prompt1Story将所有提示词连接成一个单一的输入提供给文转图扩散模型，从而初步保留角色身份。然后，我们使用两种新技术改进生成过程：奇异值重加权和身份保持交叉注意力，以确保更好地与每个帧的输入描述对齐。在我们的实验中，我们将我们的方法与各种现有的文转图生成方法进行了比较，并通过定量指标和定性评估证明了其有效性。代码可在https://github.com/byliutao/1Prompt1Story获取。||
|**2025-01-23**|[Diffusion-based Perceptual Neural Video Compression with Temporal Diffusion Information Reuse](http://arxiv.org/abs/2501.13528)|null|近年来，基础扩散模型在图像压缩任务中备受关注，但其在视频压缩中的应用仍有待探索。本文介绍了DiffVC，一个基于扩散感知的视频压缩框架，它有效地将基础扩散模型与视频条件编码范式相结合。该框架利用先前解码帧的时间上下文和当前帧的重构潜在表示来引导扩散模型生成高质量的结果。为了加速扩散模型的迭代推理过程，我们提出了时间扩散信息复用（TDIR）策略，通过复用先前帧的扩散信息，在最小性能损失的情况下显著提高了推理效率。此外，为了解决不同比特率下失真差异带来的挑战，我们提出了基于量化参数的提示（QPP）机制，它利用量化参数作为提示输入到基础扩散模型中，以显式地调制中间特征，从而实现了一个鲁棒的变比特率基于扩散的神经压缩框架。实验结果表明，我们提出的解决方案在感知指标和视觉质量方面均表现出色。||
|**2025-01-21**|[Towards Affordance-Aware Articulation Synthesis for Rigged Objects](http://arxiv.org/abs/2501.12393)|null|绑定对象在艺术家工作流程中很常见，因为它们可以灵活地适应不同的场景和姿势。然而，将绑定对象调整为符合现实的功能感知姿势（例如，遵循上下文、尊重物理规律和对象特性）仍然非常耗时，并且严重依赖经验丰富的艺术家的手动劳动。在本文中，我们解决了这个新问题，并设计了A3Syn。给定一个上下文，例如环境网格和所需姿势的文本提示，A3Syn可以为从互联网获取的任意开放域绑定对象合成关节参数。由于缺乏训练数据，这项任务极其具有挑战性，并且我们不对开放域绑定对象进行任何拓扑假设。我们建议使用2D图像修复扩散模型和几种控制技术来合成上下文感知功能信息。然后，我们结合可微渲染和语义对应，开发了一种高效的骨骼对应对齐方法。A3Syn具有稳定的收敛性，可在几分钟内完成，并在不同的野生对象绑定和场景组合中合成合理的姿势。||
|**2025-01-21**|[GPS as a Control Signal for Image Generation](http://arxiv.org/abs/2501.12390)|null|我们展示了照片元数据中包含的GPS标签为图像生成提供了有用的控制信号。我们训练了GPS到图像的模型，并将其用于需要精细理解图像在城市内如何变化的任务。特别是，我们训练了一个扩散模型，以根据GPS和文本生成图像。学习到的模型生成的图像捕捉了不同街区、公园和地标的独特外观。我们还通过分数蒸馏采样从二维GPS到图像的模型中提取三维模型，使用GPS条件来约束从每个视点重建的外观。我们的评估表明，我们基于GPS条件的模型成功地学习了生成基于位置变化的图像，并且GPS条件改善了估计的三维结构。||
|**2025-01-21**|[Audio Texture Manipulation by Exemplar-Based Analogy](http://arxiv.org/abs/2501.12385)|null|音频纹理操作涉及修改声音的感知特征以实现特定变换，例如添加、移除或替换听觉元素。在本文中，我们提出了一种基于样本的类比模型，用于音频纹理操作。我们的方法不依赖于基于文本的指令，而是使用成对的语音示例，其中一个片段代表原始声音，另一个片段展示所需的变换。该模型学习将相同的变换应用于新的输入，从而实现对声音纹理的操作。我们构建了一个表示各种编辑任务的四元组数据集，并以自监督的方式训练了一个潜在扩散模型。我们通过定量评估和感知研究表明，我们的模型优于基于文本的基线模型，并且可以泛化到真实世界、分布外和非语音场景。项目页面：https://berkeley-speech-group.github.io/audio-texture-analogy/||
|**2025-01-21**|[Accelerating Pulsar Parameter Estimation Using Convolutional Neural Networks](http://arxiv.org/abs/2501.12383)|null|精确的中子星模型对于约束致密物质状态方程至关重要。然而，包含磁场结构的真实模型计算量非常大。在这项工作中，我们开发了一个神经网络 (NN) 模拟器，用于生成具有多极磁场的毫秒脉冲星的模型热辐射热X射线光变曲线。我们评估了 NN 在广泛参数空间中的预测和计算性能。我们发现，对于静态真空场 (SVF) 模型，NN 提供了 400 倍以上的加速。我们将此 NN 模拟器集成到马尔可夫链蒙特卡罗 (MCMC) 框架中，以在参数探索期间替换计算成本高的物理模型。应用于 PSR J0030+0451，这种方法允许 MCMC 达到平衡，而这仅使用原始物理模型是无法实现的。我们通过使用 NN 和物理模型运行等效的 MCMC 迭代来比较后验分布，并评估从 NN MCMC 平衡状态继续物理模型 MCMC 时的分布差异。我们的 NN 架构与物理模型的底层物理无关，可以针对任何其他物理模型进行训练。无论其训练模拟的物理模型的复杂性如何，NN 的速度都保持不变，从而与比 SVF 模型更复杂的物理模型相比，可以实现更大的加速。||
|**2025-01-21**|[DiffDoctor: Diagnosing Image Diffusion Models Before Treating](http://arxiv.org/abs/2501.12382)|null|尽管图像扩散模型近期取得了进展，但仍会产生伪影。一种常见的解决方案是使用质量评估系统来改进已有的模型，该系统通常对图像的整体进行评估。在本研究中，我们认为解决问题的关键在于识别问题，因此模型不仅应该感知图像中是否存在缺陷，还应该知道缺陷的具体位置。基于此，我们提出了 DiffDoctor，一个两阶段的流程来帮助图像扩散模型减少伪影的生成。具体来说，第一阶段的目标是开发一个鲁棒的伪影检测器，为此我们收集了一个包含超过一百万张有缺陷的合成图像的数据集，并建立了一个高效的人机交互标注流程，并结合了精心设计的类别平衡策略。然后在第二阶段中使用学习到的伪影检测器，通过为每次合成分配一个逐像素的置信度图来调整扩散模型。在文本到图像扩散模型上的大量实验证明了我们伪影检测器的有效性以及我们“先诊断后治疗”设计的合理性。||
|**2025-01-21**|[Video Depth Anything: Consistent Depth Estimation for Super-Long Videos](http://arxiv.org/abs/2501.12375)|null|Depth Anything在单目深度估计方面取得了显著成功，并具有很强的泛化能力。然而，它在视频中存在时间不一致性，阻碍了其实际应用。人们提出了各种方法来缓解这个问题，例如利用视频生成模型或引入来自光流和相机姿态的先验信息。尽管如此，这些方法仅适用于短视频（< 10 秒），并且需要在质量和计算效率之间进行权衡。我们提出了Video Depth Anything，用于在超长视频（超过几分钟）中进行高质量、一致的深度估计，而不会牺牲效率。我们的模型基于Depth Anything V2，并将其头部替换为高效的时空头部。我们设计了一个简单而有效的时间一致性损失，通过约束时间深度梯度来实现，从而无需额外的几何先验。该模型在视频深度和未标记图像的联合数据集上进行训练，类似于Depth Anything V2。此外，我们还开发了一种新的基于关键帧的策略，用于长视频推理。实验表明，我们的模型可以应用于任意长度的视频，而不会影响质量、一致性或泛化能力。在多个视频基准上的综合评估表明，我们的方法在零样本视频深度估计方面树立了新的最先进水平。我们提供不同规模的模型以支持各种场景，我们的最小模型能够以 30 FPS 的速度实现实时性能。||
|**2025-01-21**|[VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models](http://arxiv.org/abs/2501.12267)|null|近期的视频修复方法通过利用光流引导像素从参考帧在图像空间或特征空间中传播，取得了令人鼓舞的改进。然而，当掩码区域过大且中心找不到像素对应关系时，它们会在掩码中心产生严重的伪影。最近，扩散模型在生成多样化和高质量图像方面表现出令人印象深刻的性能，并已被许多图像修复工作所利用。然而，这些方法不能直接应用于视频以产生时间一致的修复结果。在本文中，我们提出了一个名为 VipDiff 的免训练框架，用于在反向扩散过程中调节扩散模型，从而在不需要任何训练数据或微调预训练扩散模型的情况下生成时间一致的修复结果。VipDiff 以光流作为指导，从参考帧中提取有效像素作为约束条件，以优化随机采样的高斯噪声，并使用生成的结果进行进一步的像素传播和条件生成。VipDiff 还允许在不同的采样噪声上生成不同的视频修复结果。实验表明，VipDiff 在时空一致性和保真度方面都大大优于最先进的视频修复方法。||
|**2025-01-21**|[Joint Reconstruction and Motion Estimation in Sparse-View 4DCT Using Diffusion Models within a Blind Inverse Problem Framework](http://arxiv.org/abs/2501.12249)|null|四维计算机断层扫描 (4DCT) 对于诸如放射治疗等需要精确呼吸运动表示的医学成像应用至关重要。传统的 4DCT 数据重建方法容易出现伪影和噪声，尤其是在稀疏视图和低剂量的情况下。我们提出了一个新颖的框架，将运动校正和扩散模型 (DM) 集成到盲反问题公式中。通过利用来自 DM 的先验概率分布，我们增强了联合重建和运动估计过程，提高了图像质量并保留了分辨率。在扩展心脏躯干 (XCAT) 模型数据上的实验表明，我们的方法优于现有技术，即使在不规则呼吸条件下也能产生无伪影的高分辨率重建图像。这些结果展示了将 DM 与运动校正相结合以推进稀疏视图 4DCT 成像的潜力。||
|**2025-01-21**|[InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models](http://arxiv.org/abs/2501.12231)|null|生成模型能力的提升有助于构建利用语言之外模态的多模态虚拟助手。通过观察人类执行多步骤任务，可以构建对正在执行的动作和任务具有情境感知能力的助手，使其能够根据这种理解提供帮助。在本文中，我们开发了一种基于上下文感知的指令性任务助手，它利用多模态大型语言模型 (InsTALL)，并利用在线视觉流（例如用户的屏幕共享或视频录制）实时响应与用户手头任务相关的查询。为了提供有效的帮助，InsTALL 1) 在任务视频和配对文本数据上训练多模态模型，以及 2) 从视频数据中自动提取任务图并在训练和推理时利用它。我们展示了 InsTALL 在多模态活动理解的子任务——任务识别 (TR)、动作识别 (AR)、下一个动作预测 (AP) 和计划预测 (PP)——上实现了最先进的性能，并在与自动错误识别相关的两个新子任务上优于现有基线。||
|**2025-01-21**|[TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space](http://arxiv.org/abs/2501.12224)|null|我们提出了TokenVerse——一种利用预训练的文本到图像扩散模型的多概念个性化方法。我们的框架可以从单张图像中分离复杂的视觉元素和属性，同时支持对从多张图像中提取的概念进行无缝的即插即用式组合生成。与现有工作不同的是，TokenVerse可以处理多张包含多个概念的图像，并支持各种概念，包括物体、配件、材质、姿势和光照。我们的工作利用了基于DiT的文本到图像模型，其中输入文本通过注意力和调制（偏移和缩放）影响生成。我们观察到调制空间是语义的，并且可以对复杂概念进行局部控制。基于这一见解，我们设计了一个基于优化的框架，该框架将图像和文本描述作为输入，并为每个单词在调制空间中找到一个不同的方向。然后，这些方向可用于生成以所需配置组合学习到的概念的新图像。我们展示了TokenVerse在具有挑战性的个性化设置中的有效性，并展示了其相较于现有方法的优势。项目网页位于https://token-verse.github.io/||
|**2025-01-17**|[Zero-Shot Monocular Scene Flow Estimation in the Wild](http://arxiv.org/abs/2501.10357)|null|大型模型已在许多低级视觉任务（如深度估计）中展现出跨数据集的泛化能力，但对于场景流来说，目前尚不存在此类通用模型。尽管场景流具有广泛的潜在用途，但由于当前预测模型的泛化能力不佳，它在实践中并未得到广泛应用。我们确定了三个关键挑战，并针对每个挑战提出了解决方案。首先，我们创建了一种联合估计几何和运动的方法，以实现精确预测。其次，我们通过一种数据配方缓解了场景流数据的稀缺性，该配方使我们能够在不同的合成场景中获得100万个带注释的训练样本。第三，我们评估了用于场景流预测的不同参数化方法，并采用了一种自然有效的参数化方法。我们得到的模型在三维端点误差方面优于现有方法以及基于大规模模型构建的基线，并且对DAVIS的日常捕捉视频和RoboTAP的机器人操作场景展现出零样本泛化能力。总的来说，我们的方法使场景流预测在实际应用中更加实用。||
|**2025-01-17**|[DiffStereo: High-Frequency Aware Diffusion Model for Stereo Image Restoration](http://arxiv.org/abs/2501.10325)|null|扩散模型（DM）在图像恢复方面取得了显著成果，但尚未应用于立体图像。将DM应用于立体图像恢复面临一系列挑战。重建两张图像的需求加剧了DM的计算成本。此外，现有的潜在DM通常关注语义信息，并在潜在压缩过程中去除高频细节，将其视为冗余信息，而这恰恰是图像恢复的关键。为了解决上述问题，我们提出了一个高频感知扩散模型DiffStereo，用于立体图像恢复，这是DM在该领域的首次尝试。具体来说，DiffStereo首先学习高质量图像的潜在高频表示（LHFR）。然后在学习到的空间中训练DM，以估计立体图像的LHFR，并将其融合到一个基于Transformer的立体图像恢复网络中，提供相应高质量图像的有益高频信息。LHFR的分辨率与输入图像保持一致，从而避免了纹理失真。通道压缩减轻了DM的计算负担。此外，我们在将LHFR集成到恢复网络时设计了一种位置编码方案，使得在恢复网络的不同深度能够提供独特的指导。综合实验表明，通过结合生成式DM和Transformer，DiffStereo在立体图像超分辨率、去模糊和低光增强方面，与现有最先进的方法相比，实现了更高的重建精度和更好的感知质量。||
|**2025-01-17**|[AI-Generated Music Detection and its Challenges](http://arxiv.org/abs/2501.10111)|**[link](https://github.com/deezer/deepfake-detector)**|面对生成模型的新时代，检测人工生成内容变得至关重要。尤其是在用户友好的平台上几秒钟内即可创建可信的分钟级合成音乐的能力，对流媒体服务构成真正的欺诈威胁，并对人类艺术家造成不公平竞争。本文证明了在包含真实音频和人工重建的数据集上训练分类器的可能性（以及令人惊讶的轻松程度），实现了令人信服的99.8%的准确率。据我们所知，这是第一个AI音乐检测器的出版物，该工具将有助于规范合成媒体。然而，根据其他领域几十年伪造检测文献的启发，我们强调获得良好的测试分数并不是故事的结局。我们揭露并讨论了此类已部署检测器可能存在的几个问题：对音频操作的鲁棒性、对未见模型的泛化能力。第二部分作为该领域未来研究步骤的定位，并对蓬勃发展的人工内容检查器市场提出警告。||
|**2025-01-17**|[DiffVSR: Enhancing Real-World Video Super-Resolution with Diffusion Models for Advanced Visual Quality and Temporal Consistency](http://arxiv.org/abs/2501.10110)|null|扩散模型在图像生成和修复方面展现了卓越的能力，但其在视频超分辨率上的应用面临着维持高保真度和时间一致性的重大挑战。我们提出了 DiffVSR，一个基于扩散的真实世界视频超分辨率框架，通过关键创新有效地应对了这些挑战。为了实现帧内序列一致性，我们开发了一个多尺度时间注意力模块和时间增强型 VAE 解码器，以捕捉细粒度的运动细节。为了确保帧间序列稳定性，我们引入了一种具有交织潜在过渡方法的噪声重新调度机制，在不增加额外训练开销的情况下增强了时间一致性。我们提出了一种渐进式学习策略，从简单到复杂的退化进行过渡，即使在高质量视频数据有限的情况下也能实现稳健的优化。大量实验表明，DiffVSR 在视觉质量和时间一致性方面均提供了优异的结果，为真实世界视频超分辨率设定了新的性能标准。||
|**2025-01-17**|[Conditional Latent Diffusion-Based Speech Enhancement Via Dual Context Learning](http://arxiv.org/abs/2501.10052)|**[link](https://github.com/alibabasglab/cldm-dcl)**|近年来，扩散概率模型的应用通过生成式方法推进了语音增强领域的发展。然而，现有的基于扩散的方法主要集中于在高维波形或频谱域上的生成过程，导致生成复杂性增加和推理速度变慢。此外，这些方法主要建模干净语音的分布，对噪声分布的探索有限，从而限制了扩散模型在语音增强中的判别能力。为了解决这些问题，我们提出了一种将条件潜在扩散模型（cLDM）与双上下文学习（DCL）相结合的新方法。我们的方法利用变分自编码器（VAE）将梅尔频谱图压缩到低维潜在空间中。然后，我们应用cLDM通过DCL过程将干净语音和背景噪声的潜在表示转换为高斯噪声，并训练一个参数化模型来逆转此过程，该模型以带噪潜在表示和文本嵌入为条件。通过在低维空间中操作，潜在表示降低了生成过程的复杂性，而DCL过程增强了模型处理各种和未见噪声环境的能力。我们的实验表明，即使迭代步骤较少，所提出的方法与现有的基于扩散的方法相比也表现出强大的性能，并突出了我们的模型对域外噪声数据集的优越泛化能力（https://github.com/modelscope/ClearerVoice-Studio）。||
|**2025-01-17**|[DiffuEraser: A Diffusion Model for Video Inpainting](http://arxiv.org/abs/2501.10018)|**[link](https://github.com/lixiaowen-xw/diffueraser)**|近期的视频修复算法将基于流的像素传播与基于Transformer的生成相结合，利用光流根据相邻帧的信息恢复纹理和对象，并通过视觉Transformer补全被遮挡的区域。然而，这些方法在处理大面积遮罩时经常遇到模糊和时间不一致的问题，这凸显了对具有更强生成能力模型的需求。近年来，扩散模型因其令人印象深刻的性能，已成为图像和视频生成领域的一项突出技术。在本文中，我们介绍了DiffuEraser，一个基于稳定扩散的视频修复模型，旨在以更丰富的细节和更连贯的结构填充被遮挡区域。我们结合了先验信息来提供初始化和弱条件，这有助于减少噪声伪影并抑制幻觉。此外，为了提高长序列推理过程中的时间一致性，我们扩展了先验模型和DiffuEraser的时间感受野，并进一步利用视频扩散模型的时间平滑特性来增强一致性。实验结果表明，我们提出的方法在内容完整性和时间一致性方面均优于现有最先进的技术，同时保持了可接受的效率。||
|**2025-01-17**|[Enhancing Crash Frequency Modeling Based on Augmented Multi-Type Data by Hybrid VAE-Diffusion-Based Generative Neural Networks](http://arxiv.org/abs/2501.10017)|null|碰撞频率建模分析了交通量、道路几何形状和环境条件等因素对碰撞发生的影响。预测不准确会扭曲我们对这些因素的理解，导致误导性政策和资源浪费，从而危及交通安全。碰撞频率建模的一个关键挑战是过度零值观测的普遍存在，这是由漏报、碰撞的低概率和高数据收集成本造成的。这些零值观测通常会降低模型精度并引入偏差，使安全决策复杂化。虽然现有的方法，如统计方法、数据聚合和重采样，试图解决这个问题，但它们要么依赖于限制性假设，要么导致显著的信息丢失，从而扭曲碰撞数据。为了克服这些限制，我们提出了一种混合VAE-Diffusion神经网络，旨在减少零值观测并处理多类型表格碰撞数据（计数、序数、名义和实值变量）的复杂性。我们通过相似性、准确性、多样性和结构一致性等指标评估该模型生成的合成数据的质量，并将其预测性能与传统的统计模型进行比较。我们的研究结果表明，混合VAE-Diffusion模型在所有指标上都优于基线模型，为增强碰撞数据和提高碰撞频率预测的准确性提供了一种更有效的方法。这项研究强调了合成数据在通过改进碰撞频率建模和提供更好的政策决策来增强交通安全的潜力。||
|**2025-01-17**|[RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation](http://arxiv.org/abs/2501.09982)|null|文本到视频生成模型取得了令人瞩目的进展，但它们仍然难以生成具有复杂特征的视频。这种局限性通常源于文本编码器无法生成准确的嵌入，从而阻碍了视频生成模型。在这项工作中，我们提出了一种新颖的方法，通过在嵌入空间中进行插值来选择最佳文本嵌入，以克服这一挑战。我们证明了这种方法能够使视频生成模型生成所需的视频。此外，我们引入了一种使用垂足嵌入和余弦相似度的简单算法来识别最佳插值嵌入。我们的研究结果强调了准确文本嵌入的重要性，并为提高文本到视频生成性能提供了一条途径。||
|**2025-01-17**|[GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions](http://arxiv.org/abs/2501.09972)|null|为视频创作音乐至关重要但充满挑战，这导致人们对视频应用中音乐自动生成的兴趣日益浓厚。现有方法通常难以实现稳健的音乐-视频对应和生成多样性，这主要是由于特征对齐方法不足和数据集不够充分。在本研究中，我们提出了通用视频到音乐生成模型 (GVMGen)，旨在为视频输入生成高度相关的音乐。我们的模型采用分层注意力机制来提取视频特征并在空间和时间维度上与音乐对齐，确保保留相关特征的同时最大限度地减少冗余。值得注意的是，我们的方法用途广泛，能够从不同的视频输入生成多风格的音乐，即使在零样本场景下也是如此。我们还提出了一个评估模型以及两个新的客观指标来评估视频-音乐对齐。此外，我们还编译了一个包含各种类型视频-音乐对的大规模数据集。实验结果表明，GVMGen 在音乐-视频对应性、生成多样性和应用普遍性方面均优于以前的模型。||
|**2025-01-17**|[Physics-informed DeepCT: Sinogram Wavelet Decomposition Meets Masked Diffusion](http://arxiv.org/abs/2501.09935)|**[link](https://github.com/yqx7150/swarm)**|扩散模型在稀疏视角计算机断层扫描（SVCT）重建中展现出显著的潜力。然而，当网络在有限的样本空间上训练时，其泛化能力可能会受到限制，从而降低其对不熟悉数据的性能。对于图像生成任务，这可能导致诸如细节模糊和区域之间不一致等问题。为了缓解这个问题，我们提出了一种基于正弦图的小波随机分解和随机掩码扩散模型（SWARM），用于SVCT重建。具体来说，在正弦图中引入随机掩码策略有效地扩展了有限的训练样本空间。这使得模型能够学习更广泛的数据分布，增强其对数据不确定性的理解和泛化能力。此外，将随机训练策略应用于正弦图小波的高频分量增强了特征表示，并提高了捕获不同频带细节的能力，从而提高了性能和鲁棒性。采用两阶段迭代重建方法来确保重建图像的全局一致性，同时细化其细节。实验结果表明，SWARM在各种数据集上的定量和定性性能均优于其他竞争方法。||
|**2025-01-16**|[SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces](http://arxiv.org/abs/2501.09756)|null|我们推出了SynthLight，一个用于人像重照明的扩散模型。我们的方法将图像重照明定义为重新渲染问题，其中像素根据环境照明条件的变化进行转换。我们使用基于物理的渲染引擎合成了一个数据集，用3D头部资产在不同光照下模拟这种光照条件下的转换。我们提出了两种训练和推理策略来弥合合成图像和真实图像领域之间的差距：（1）利用没有光照标签的真实人像进行多任务训练；（2）一种基于无分类器引导的推理时扩散采样程序，利用输入人像更好地保留细节。我们的方法可以泛化到各种真实的相片，并产生逼真的照明效果，包括镜面高光和投射阴影，同时保留人物的身份。我们对Light Stage数据的定量实验表明，其结果与最先进的重照明方法相当。我们对自然图像的定性结果展示了丰富且前所未有的照明效果。项目页面：\url{https://vrroom.github.io/synthlight/}||
|**2025-01-16**|[Learnings from Scaling Visual Tokenizers for Reconstruction and Generation](http://arxiv.org/abs/2501.09755)|null|通过自编码进行视觉标记化，将像素压缩到潜在空间，赋能最先进的图像和视频生成模型。尽管基于Transformer的生成器的扩展一直是近期进展的核心，但标记器组件本身很少被扩展，这留下了关于自编码器设计选择如何影响其重建目标和下游生成性能的问题。我们的工作旨在探索自编码器中的扩展以填补这一空白。为了促进这项探索，我们用增强的视觉Transformer标记化架构（ViTok）取代了典型的卷积主干。我们在远远超过ImageNet-1K的大规模图像和视频数据集上训练ViTok，消除了标记器扩展的数据限制。我们首先研究了扩展自编码器瓶颈如何影响重建和生成——发现虽然它与重建高度相关，但它与生成的关系更为复杂。接下来，我们探索了分别扩展自编码器的编码器和解码器对重建和生成性能的影响。至关重要的是，我们发现扩展编码器对重建或生成几乎没有收益，而扩展解码器可以提升重建，但对生成的好处是复杂的。基于我们的探索，我们将ViTok设计为一个轻量级的自编码器，在ImageNet-1K和COCO重建任务（256p和512p）上实现了与最先进的自编码器相当的性能，同时在UCF-101的16帧128p视频重建上优于现有的自编码器，所有这些都将FLOPs减少了2-5倍。当与扩散Transformer集成时，ViTok在ImageNet-1K的图像生成上表现出具有竞争力的性能，并在UCF-101的类别条件视频生成上设定了新的最先进基准。||
|**2025-01-16**|[Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps](http://arxiv.org/abs/2501.09732)|null|生成模型在各个领域都产生了重大影响，这主要归功于它们能够通过增加数据、计算资源和模型大小来扩展训练规模，这种现象被称为缩放定律。最近的研究开始探索大型语言模型 (LLM) 的推理时缩放行为，揭示了如何在推理过程中通过额外的计算进一步提高性能。与 LLM 不同，扩散模型天生就具有通过去噪步骤数调整推理时计算的灵活性，尽管性能提升通常在几十步之后趋于平缓。在这项工作中，我们探索了扩散模型推理时缩放行为，超越了增加去噪步骤数，并研究了如何通过增加计算量来进一步提高生成性能。具体来说，我们考虑了一个搜索问题，旨在为扩散采样过程识别更好的噪声。我们沿着两个轴构建设计空间：用于提供反馈的验证器和用于寻找更好噪声候选者的算法。通过对类别条件和文本条件图像生成基准的广泛实验，我们的研究结果表明，增加推理时计算量可以显著提高扩散模型生成的样本质量，并且鉴于图像的复杂性，可以专门选择框架中组件的组合以适应不同的应用场景。||
|**2025-01-16**|[Comparative Insights from 12 Machine Learning Models in Extracting Economic Ideology from Political Text](http://arxiv.org/abs/2501.09719)|null|本研究系统地评估了12种机器学习模型及其变体在检测经济意识形态方面的能力。我使用涵盖英国六次选举的宣言数据作为评估基准，这些数据已由专家和众包编码员进行了预先注释。该分析评估了几个生成式模型、微调模型和零样本模型在粒度和聚合层面的性能。结果表明，像GPT-4o和Gemini 1.5 Flash这样的生成式模型在所有基准测试中都始终优于其他模型。然而，它们存在可访问性和资源可用性方面的问题。微调模型产生了具有竞争力的性能，并通过特定领域的优化提供了一种可靠的替代方案。但其对训练数据的依赖严重限制了可扩展性。零样本模型在识别经济意识形态信号方面一直面临困难，通常导致与人工编码的负相关。使用一般知识来完成意识形态衡量这一特定领域的任务被证明是不可靠的。其他主要发现包括相当大的党内差异，微调模型受益于更大的训练数据，以及零样本模型对提示内容的敏感性。评估内容包括每个模型的优势和局限性，并推导出自动化分析政治内容的最佳实践。||
|**2025-01-16**|[Reward-Guided Controlled Generation for Inference-Time Alignment in Diffusion Models: Tutorial and Review](http://arxiv.org/abs/2501.09685)|null|本教程深入讲解推理时引导和对齐方法，用于优化扩散模型中的下游奖励函数。虽然扩散模型以其生成建模能力而闻名，但在生物学等领域的实际应用中，通常需要样本生成能够最大化特定指标（例如，蛋白质的稳定性、亲和力、与目标结构的接近程度）。在这些情况下，扩散模型不仅可以生成逼真的样本，还可以在推理时明确地最大化所需的度量，而无需微调。本教程探讨了此类推理时算法的基本方面。我们从统一的角度回顾了这些方法，证明了当前的技术——例如基于序列蒙特卡罗 (SMC) 的引导、基于值的采样和分类器引导——旨在逼近软最优去噪过程（在强化学习中也称为策略），该过程将预训练的去噪过程与充当前瞻函数的值函数相结合，从中间状态预测最终奖励。在此框架内，我们提出了一些文献中尚未涵盖的新算法。此外，我们还讨论了 (1) 与推理时技术相结合的微调方法，(2) 基于搜索算法（例如蒙特卡罗树搜索）的推理时算法，这些算法在当前研究中受到的关注有限，以及 (3) 语言模型和扩散模型中推理时算法之间的联系。本教程中关于蛋白质设计的代码可在 https://github.com/masa-ue/AlignInversePro 获取。||
|**2025-01-16**|[AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation](http://arxiv.org/abs/2501.09503)|null|近年来，大型生成模型在文本到图像生成方面展现出卓越的能力。然而，生成具有特定主体的高保真个性化图像仍然存在挑战，尤其是在涉及多个主体的情况下。在本文中，我们提出了AnyStory，一种用于个性化主体生成的统一方法。AnyStory不仅可以实现单个主体的高保真个性化，还可以实现多个主体的高保真个性化，且不会牺牲主体的保真度。具体来说，AnyStory以“编码后路由”的方式对主体个性化问题进行建模。在编码步骤中，AnyStory利用通用且强大的图像编码器，即ReferenceNet，结合CLIP视觉编码器来实现主体特征的高保真编码。在路由步骤中，AnyStory利用解耦的实例感知主体路由器来准确感知和预测潜在空间中相应主体的潜在位置，并引导主体条件的注入。详细的实验结果证明了我们的方法在保留主体细节、对齐文本描述以及实现多个主体个性化方面的出色性能。项目页面位于https://aigcdesigngroup.github.io/AnyStory/。||
|**2025-01-16**|[Pruning for Sparse Diffusion Models based on Gradient Flow](http://arxiv.org/abs/2501.09464)|null|扩散模型（DM）在生成模型中展现出令人印象深刻的能力，但其推理速度较慢且计算成本较高。先前的工作利用一次性结构剪枝从预训练的DM中导出轻量级DM，但这种方法通常会导致生成质量显著下降，并可能移除关键权重。因此，我们提出了一种基于梯度流的迭代剪枝方法，包括梯度流剪枝过程和梯度流剪枝准则。我们采用渐进式软剪枝策略来保持掩码矩阵的连续性，并根据稀疏空间中的剪枝准则引导其沿着能量函数的梯度流，从而避免一次性剪枝通常导致的突然信息丢失。基于梯度流的准则剪枝移除后会增加损失函数梯度范数的参数，并且可以在迭代剪枝阶段使剪枝后的模型快速收敛。我们对广泛使用的数据集进行的大量实验表明，我们的方法在效率和与预训练模型的一致性方面实现了卓越的性能。||
|**2025-01-16**|[CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation](http://arxiv.org/abs/2501.09433)|**[link](https://github.com/ncsoft/CaPa)**|从文本或视觉输入合成高质量的3D资产已成为现代生成模型的核心目标。尽管3D生成算法种类繁多，但它们经常面临多视角不一致、生成时间慢、保真度低和表面重建问题等挑战。虽然一些研究已经解决了其中的一些问题，但全面的解决方案仍然难以捉摸。在本文中，我们介绍了CaPa，一个高效生成高保真3D资产的雕刻和绘制框架。CaPa采用两阶段流程，将几何生成与纹理合成解耦。首先，一个3D潜在扩散模型在多视图输入的引导下生成几何形状，确保跨视角的结构一致性。随后，利用一种新的、模型无关的空间解耦注意力机制，该框架为给定的几何形状合成高分辨率纹理（高达4K）。此外，我们提出了一种3D感知的遮挡修复算法，用于填充未纹理区域，从而在整个模型中获得一致的结果。该流程可在30秒内生成高质量的3D资产，为商业应用提供即用型输出。实验结果表明，CaPa在纹理保真度和几何稳定性方面均表现出色，为实用、可扩展的3D资产生成树立了新的标准。||
|**2025-01-16**|[Contract-Inspired Contest Theory for Controllable Image Generation in Mobile Edge Metaverse](http://arxiv.org/abs/2501.09391)|null|沉浸式技术的快速发展推动了元宇宙的发展，虚拟现实和物理现实的融合需要生成高质量的逼真图像来增强用户体验。然而，在移动边缘计算环境中生成这些图像，尤其是通过生成扩散模型（GDM），由于边缘设备的计算资源有限以及无线网络的动态特性而面临着重大挑战。本文提出了一种新颖的框架，该框架集成了契约启发的竞赛理论、深度强化学习（DRL）和GDM，以优化在这些资源受限环境中的图像生成。该框架通过激励边缘设备高效传输高质量的语义数据来解决资源分配和语义数据传输质量的关键挑战，这对于创建逼真和沉浸式图像至关重要。竞赛和契约理论的使用确保了边缘设备被激励有效地分配资源，而DRL则动态地适应网络条件，从而优化整个图像生成过程。实验结果表明，与传统方法相比，所提出的方法不仅提高了生成图像的质量，而且还实现了更快的收敛速度和更高的稳定性。这使得该框架对于优化移动边缘元宇宙应用中的复杂资源分配任务特别有效，从而在创建沉浸式虚拟环境方面提供更高的性能和效率。||
|**2025-01-16**|[UVRM: A Scalable 3D Reconstruction Model from Unposed Videos](http://arxiv.org/abs/2501.09347)|null|大型重建模型 (LRM) 近期成为创建 3D 基础模型的流行方法。传统的 3D 重建模型训练需要 2D 视觉数据以及训练样本的相机姿态的先验知识，这个过程既耗时又容易出错。因此，3D 重建训练一直局限于合成 3D 数据集或带有姿态标注的小规模数据集。在本研究中，我们探讨了使用各种对象的无姿态视频数据进行 3D 重建的可行性。我们引入了 UVRM，一种新型 3D 重建模型，能够在单目视频上进行训练和评估，而无需任何姿态信息。UVRM 使用 Transformer 网络将视频帧隐式聚合到一个姿态不变的潜在特征空间，然后将其解码为三平面 3D 表示。为了避免在训练期间需要真实的姿态标注，UVRM 结合使用了分数蒸馏采样 (SDS) 方法和分析-综合方法，使用预训练的扩散模型逐步合成伪新视角。我们在 G-Objaverse 和 CO3D 数据集上对 UVRM 的性能进行了定性和定量评估，无需依赖姿态信息。大量实验表明，UVRM 能够从无姿态视频中有效且高效地重建各种 3D 对象。||
|**2025-01-14**|[DAViD: Modeling Dynamic Affordance of 3D Objects using Pre-trained Video Diffusion Models](http://arxiv.org/abs/2501.08333)|null|理解人类使用物体的能力对于人工智能改善日常生活至关重要。现有的关于学习这种能力的研究主要集中在静态情况下的人与物体模式（例如，接触、空间关系、方向），而关于学习随时间变化的人与物体交互（HOI）模式（即人和物体的运动）的研究相对较少。在本文中，我们介绍了一种名为动态可供性（Dynamic Affordance）的新型可供性。对于给定的输入3D物体网格，我们学习动态可供性，它对交互过程中（1）人体运动和（2）人引导的物体姿态的分布进行建模。作为核心思想，我们提出了一种从合成生成的2D视频中学习3D动态可供性的方法，利用预训练的视频扩散模型。具体来说，我们提出了一个流程，首先从3D物体生成2D HOI视频，然后将其提升到3D以生成4D HOI样本。一旦我们在各种目标物体上生成了不同的4D HOI样本，我们就训练我们的DAViD，其中我们提出了一种基于低秩适应（LoRA）模块的方法，用于预训练的人体运动扩散模型（MDM）和具有人体姿态引导的物体姿态扩散模型。我们的运动扩散模型扩展到多物体交互，展示了我们的LoRA流程在结合物体使用概念方面的优势。通过大量的实验，我们证明了我们的DAViD在生成具有人与物体交互的人体运动方面优于基线模型。||
|**2025-01-14**|[MangaNinja: Line Art Colorization with Precise Reference Following](http://arxiv.org/abs/2501.08332)|null|源自扩散模型，MangaNinjia专注于参考引导的线稿上色任务。我们结合了两种精心设计以确保精确的角色细节转录，包括一个用于促进参考彩色图像和目标线稿之间对应学习的补丁洗牌模块，以及一个用于实现细粒度颜色匹配的点驱动控制方案。在我们自行收集的基准数据集上的实验表明，我们的模型在精确上色方面优于当前的解决方案。我们进一步展示了所提出的交互式点控制在处理现有算法无法实现的挑战性案例、跨角色上色、多参考协调方面的潜力。||
|**2025-01-14**|[Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise](http://arxiv.org/abs/2501.08331)|**[link](https://github.com/vgenai-netflix-eyeline-research/go-with-the-flow)**|生成模型旨在将随机噪声转换为结构化输出。在这项工作中，我们通过结构化潜在噪声采样实现运动控制，从而增强视频扩散模型。这只需要改变数据即可实现：我们预处理训练视频以产生结构化噪声。因此，我们的方法与扩散模型设计无关，无需更改模型架构或训练流程。具体来说，我们提出了一种新颖的噪声扭曲算法，其速度足以实时运行，该算法用源自光流场的相关扭曲噪声代替随机时间高斯性，同时保留空间高斯性。我们算法的效率使我们能够使用扭曲噪声对现代视频扩散基础模型进行微调，只需极小的开销，并为各种用户友好的运动控制提供一站式解决方案：局部对象运动控制、全局摄像机运动控制和运动迁移。我们扭曲噪声中时间相干性和空间高斯性之间的协调导致有效的运动控制，同时保持每帧像素质量。大量的实验和用户研究证明了我们方法的优势，使其成为一种稳健且可扩展的控制视频扩散模型中运动的方法。视频结果可在我们的网页上找到：https://vgenai-netflix-eyeline-research.github.io/Go-with-the-Flow/；源代码和模型检查点可在GitHub上找到：https://github.com/VGenAI-Netflix-Eyeline-Research/Go-with-the-Flow。||
|**2025-01-14**|[GameFactory: Creating New Games with Generative Interactive Videos](http://arxiv.org/abs/2501.08325)|null|生成式游戏引擎拥有通过自主创建新内容并减少手动工作量来彻底改变游戏开发的潜力。然而，现有的基于视频的游戏生成方法未能解决场景泛化的关键挑战，将其适用性限制在具有固定风格和场景的现有游戏上。在本文中，我们提出了GameFactory，这是一个专注于探索游戏视频生成中场景泛化的框架。为了能够创建全新的、多样化的游戏，我们利用在开放域视频数据上预训练的视频扩散模型。为了弥合开放域先验知识与小规模游戏数据集之间的领域差距，我们提出了一种多阶段训练策略，将游戏风格学习与动作控制分离，在保持开放域泛化能力的同时实现动作可控性。我们使用Minecraft作为数据源，发布了GF-Minecraft，这是一个用于研究的高质量、多样化的带有动作标注的视频数据集。此外，我们扩展了我们的框架，使其能够进行自回归的动作可控游戏视频生成，从而可以生成无限长度的交互式游戏视频。实验结果表明，GameFactory可以有效地生成开放域、多样化且动作可控的游戏视频，这代表了AI驱动游戏生成领域的重大进步。我们的数据集和项目页面公开发布于\url{https://vvictoryuki.github.io/gamefactory/}。||
|**2025-01-14**|[Diffusion Adversarial Post-Training for One-Step Video Generation](http://arxiv.org/abs/2501.08316)|null|扩散模型广泛用于图像和视频生成，但其迭代生成过程缓慢且成本高昂。虽然现有的蒸馏方法已经证明了在图像领域进行一步生成的潜力，但它们仍然存在严重的质量下降问题。在这项工作中，我们提出了一种在扩散预训练后针对真实数据进行对抗性后训练 (APT) 的方法，用于一步视频生成。为了提高训练的稳定性和质量，我们对模型架构和训练过程进行了一些改进，并引入了一个近似的 R1 正则化目标。实验结果表明，我们经过对抗性后训练的模型 Seaweed-APT 可以使用单个前向评估步骤实时生成 2 秒、1280x720、24fps 的视频。此外，我们的模型能够一步生成 1024 像素的图像，其质量可与最先进的方法相媲美。||
|**2025-01-14**|[LayerAnimate: Layer-specific Control for Animation](http://arxiv.org/abs/2501.08295)|null|动画视频将前景和背景元素分离成图层，并对草图绘制、精细化、着色和中间帧生成采用不同的处理流程。现有的视频生成方法通常将动画视为单一数据域，缺乏对各个图层的精细控制。在本文中，我们介绍了 LayerAnimate，这是一种新颖的架构方法，它增强了对视频扩散模型中各个动画图层的精细控制，允许用户独立操作不同图层中的前景和背景元素。为了解决图层特定数据有限的挑战，我们提出了一个数据整理流程，其特点是自动元素分割、运动状态分层合并和运动一致性细化。通过定量和定性比较以及用户研究，我们证明 LayerAnimate 在动画质量、控制精度和可用性方面优于当前的方法，使其成为专业动画师和业余爱好者的理想工具。该框架为特定于图层的动画应用和创作灵活性开辟了新的可能性。我们的代码可在 https://layeranimate.github.io 获取。||
|**2025-01-14**|[HALoGEN: Fantastic LLM Hallucinations and Where to Find Them](http://arxiv.org/abs/2501.08292)|null|尽管大型生成式语言模型 (LLM) 在生成高质量、流畅的文本方面能力非凡，但它们也会产生幻觉：即与既有世界知识或提供的输入上下文不一致的陈述。然而，衡量幻觉可能具有挑战性，因为让人工实时验证模型的生成结果既昂贵又耗时。在这项工作中，我们发布了 HALoGEN，这是一个全面的幻觉基准测试，包含：(1) 10,923 个针对生成模型的提示，涵盖九个领域，包括编程、科学归因和摘要，以及 (2) 针对每个用例的自动高精度验证器，它们将 LLM 生成结果分解成原子单元，并根据高质量的知识源验证每个单元。我们使用这个框架评估了来自 14 个语言模型的约 150,000 个生成结果，发现即使是性能最佳的模型也充满了幻觉（根据领域的差异，有时高达 86% 的生成原子事实）。我们进一步根据 LLM 幻觉是否可能源于对训练数据的错误回忆（A 类错误）、训练数据中的错误知识（B 类错误）或捏造（C 类错误）来定义一种新的错误分类。我们希望我们的框架能够为有原则地研究生成模型为何产生幻觉提供基础，并推进可信赖的大型语言模型的开发。||
|**2025-01-14**|[Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints](http://arxiv.org/abs/2501.08246)|null|近期的研究提出了自动化红队方法，用于测试给定目标大型语言模型（LLM）的漏洞。这些方法使用红队LLM来发现导致目标LLM有害行为的输入。在本文中，我们研究了支持定向安全评估的红队策略。我们提出了一个带有邻近约束的红队优化框架，其中发现的提示必须与给定数据集中的参考提示相似。该数据集充当发现提示的模板，将测试用例的搜索锚定到特定主题、写作风格或有害行为类型。我们发现，已建立的自回归模型架构在这种情况下表现不佳。因此，我们引入了一种受文本扩散模型启发的黑盒红队方法：用于审计和红队的扩散（DART）。DART通过在嵌入空间中扰动参考提示来修改它，直接控制引入的变化量。我们通过比较其有效性与基于模型微调和零样本及少量样本提示的已建立方法，系统地评估了我们的方法。我们的结果表明，DART在发现靠近参考提示的有害输入方面明显更有效。||
|**2025-01-14**|[CodecFake-Omni: A Large-Scale Codec-based Deepfake Speech Dataset](http://arxiv.org/abs/2501.08238)|null|随着基于编解码器的语音生成（CoSG）系统的快速发展，创建模仿个人身份并传播虚假信息的伪造语音变得异常容易。应对这种深度伪造语音带来的风险已引起广泛关注。然而，大多数现有研究都集中于检测由传统语音生成模型生成的伪造数据。关于检测由CoSG系统生成的伪造语音的研究仍然有限且很大程度上未被探索。在本文中，我们介绍了CodecFake-Omni，这是一个大型数据集，专门用于推进基于神经编解码器的深度伪造语音（CodecFake）检测的研究，并在反欺骗社区内促进进步。据我们所知，CodecFake-Omni是截至撰写本文时同类数据集中规模最大的，涵盖了最多样化的编解码器架构。训练集是使用几乎所有公开可用的31个神经音频编解码器模型（跨越21个不同的编解码器系列，一个具有不同配置的编解码器系列将产生多个不同的编解码器模型）通过重新合成生成的。评估集包括从17个先进的CoSG模型（包含8个编解码器系列）生成的网站收集的网络数据。利用这个大型数据集，我们再次证实了我们之前的发现：在由声码器生成的传统欺骗数据集上训练的反欺骗模型难以检测当前CoSG系统合成的语音。此外，我们提出了一个全面的神经音频编解码器分类法，根据其根组件对神经音频编解码器进行分类：矢量量化器、辅助目标和解码器类型，并对每个组件进行了详细的解释和示例。利用这个全面的分类法，我们进行了分层分析，为未来的CodecFake检测研究提供有价值的见解。||
|**2025-01-14**|[FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors](http://arxiv.org/abs/2501.08225)|**[link](https://github.com/ybybzhang/framepainter)**|交互式图像编辑允许用户通过视觉交互操作（例如绘图、点击和拖动）来修改图像。现有方法从视频中构建此类监督信号，因为视频捕获了对象如何随各种物理交互而变化。然而，这些模型通常建立在文本到图像的扩散模型之上，因此需要 (i) 大量的训练样本和 (ii) 一个额外的参考编码器来学习真实世界的动态和视觉一致性。在本文中，我们将此任务重新表述为图像到视频的生成问题，以便继承强大的视频扩散先验，从而降低训练成本并确保时间一致性。具体来说，我们引入了 FramePainter 作为此公式的有效实例。它以 Stable Video Diffusion 为基础进行初始化，仅使用轻量级的稀疏控制编码器来注入编辑信号。考虑到时间注意力在处理两帧之间的大运动时的局限性，我们进一步提出了匹配注意力，以扩大感受野，同时鼓励编辑图像标记和源图像标记之间的密集对应。我们重点介绍了 FramePainter 在各种编辑信号中的有效性和效率：它以少得多的训练数据显著优于先前的最先进方法，实现了高度无缝和连贯的图像编辑，例如，自动调整杯子的反射。此外，FramePainter 在现实世界视频中不存在的场景中也表现出卓越的泛化能力，例如，将小丑鱼转换成鲨鱼状。我们的代码将在 https://github.com/YBYBZhang/FramePainter 上提供。||
|**2025-01-10**|[VideoAuteur: Towards Long Narrative Video Generation](http://arxiv.org/abs/2501.06173)|null|近期的视频生成模型在生成持续数秒的高质量视频片段方面展现出可喜的成果。然而，这些模型在生成能够传达清晰且信息丰富的事件的长序列方面面临挑战，限制了它们支持连贯叙事的能力。在本文中，我们提出了一个大规模烹饪视频数据集，旨在推进烹饪领域的长篇叙事生成。我们分别使用最先进的视觉语言模型 (VLM) 和视频生成模型，在视觉保真度和文本描述准确性方面验证了我们提出的数据集的质量。我们进一步引入了一个长叙事视频导演，以增强生成视频的视觉和语义连贯性，并强调对齐视觉嵌入以提高整体视频质量的作用。我们的方法在生成视觉细节丰富且语义对齐的关键帧方面展现出显著改进，这得益于在视频生成过程中集成文本和图像嵌入的微调技术。项目页面：https://videoauteur.github.io/||
|**2025-01-10**|[GenMol: A Drug Discovery Generalist with Discrete Diffusion](http://arxiv.org/abs/2501.06158)|null|药物发现是一个复杂的过程，涉及多个场景和阶段，例如片段约束分子生成、苗头化合物生成和先导化合物优化。然而，现有的分子生成模型只能处理其中的一到两个场景，缺乏解决药物发现流程各个方面的灵活性。在本文中，我们提出了通才分子生成模型 (GenMol)，这是一个通用的框架，通过将离散扩散应用于基于顺序连接的片段嵌入 (SAFE) 分子表示来解决这些限制。GenMol 通过非自回归双向并行解码生成 SAFE 序列，从而允许利用不依赖于特定标记顺序的分子上下文，并提高计算效率。此外，在离散扩散框架下，我们引入了片段重新掩码，这是一种通过用掩码标记替换片段并重新生成它们来优化分子的策略，从而能够有效地探索化学空间。GenMol 在从头生成和片段约束生成方面显著优于先前基于 GPT 并在 SAFE 表示上训练的模型，并在目标导向的苗头化合物生成和先导化合物优化方面实现了最先进的性能。这些实验结果表明，GenMol 可以处理广泛的药物发现任务，为分子设计提供了一种统一且通用的方法。||
|**2025-01-10**|[From discrete-time policies to continuous-time diffusion samplers: Asymptotic equivalences and faster training](http://arxiv.org/abs/2501.06148)|**[link](https://github.com/gfnorg/gfn-diffusion)**|我们研究了在无法访问目标样本的情况下，训练神经随机微分方程（即扩散模型）以从玻尔兹曼分布中采样的问题。现有的训练此类模型的方法使用可微分模拟或离策略强化学习 (RL) 来强制生成过程和噪声过程的时间反转。我们证明了在无限小离散化步长极限下目标族之间的等价性，将熵RL方法（GFlowNets）与连续时间对象（偏微分方程和路径空间测度）联系起来。我们进一步表明，在训练期间选择合适的粗时间离散化可以大大提高样本效率并使用时间局部目标，从而以降低的计算成本在标准采样基准测试中实现竞争性能。||
|**2025-01-10**|[Photokinetics of Photothermal Reactions](http://arxiv.org/abs/2501.06057)|null|光热反应，涉及光化学和热反应步骤，是光化学中最常见的反应序列。其速率定律的推导已标准化，但这些速率定律的积分尚未实现。事实上，该领域仍然缺乏用于描述这些反应行为和/或识别其反应级数的积分速率定律。这使得对光热反应的光动力学的全面解释成为知识上的空白。本文通过引入一个前所未有的通用模型方程来弥补这一空白，该方程能够绘制出此类反应在曝光或黑暗条件下的动力学轨迹。该积分速率定律模型方程也适用于反应介质暴露于单色光或多色光照射的情况。该模型方程的有效性已通过四阶龙格-库塔法获得的模拟数据得到验证。然后，它被用于描述和量化光热反应的几种情况，例如初始浓度、旁观分子和入射辐射强度的影响，以及后者对光子产率的影响。该模型方程为确定任何已知物种数量的光热机制的固有反应参数（反应物种的量子产率和吸收率）提供了一种通用的阐明方法。本文有助于按照化学动力学中采用的相同一般准则合理化光动力学。||
|**2025-01-10**|[Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction](http://arxiv.org/abs/2501.06035)|null|概率人体运动预测旨在根据过去的观察结果预测多种可能的未来运动。虽然目前的方法报告了高度的多样性和真实感，但它们经常生成带有未检测到的肢体拉伸和抖动的运动。为了解决这个问题，我们引入了 SkeletonDiffusion，这是一种在其架构和训练中嵌入了关于人体的显式归纳偏差的潜在扩散模型。我们的模型使用一种新颖的非各向同性高斯扩散公式进行训练，该公式与人体骨骼的自然运动学结构相一致。结果表明，我们的方法优于传统的各向同性替代方案，在避免肢体变形等伪影的同时，始终如一地生成逼真的预测。此外，我们发现了常用多样性指标的一个局限性，即它可能无意中偏向于在同一序列中生成不一致肢体长度的模型。SkeletonDiffusion 在三个真实世界的数据集上设定了新的基准，在多个评估指标上优于各种基线。访问我们的项目页面：https://ceveloper.github.io/publications/skeletondiffusion/||
|**2025-01-10**|[CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control](http://arxiv.org/abs/2501.06006)|null|我们提出了一种从单张图像和给定相机轨迹生成场景飞越视频的方法。我们基于图像到视频的潜在扩散模型进行构建。我们使用四种技术将UNet去噪器设置为以相机轨迹为条件。(1) 我们类似于MotionCtrl，将UNet的时间块设置为以原始相机外参为条件。(2) 我们类似于CameraCtrl，使用包含相机光线和方向的图像。(3) 我们将初始图像重新投影到后续帧，并将生成的视频用作条件。(4) 我们使用2D<=>3D变换器引入全局3D表示，该表示隐式地以相机姿态为条件。我们将所有条件组合在一个类似ControlNet的架构中。然后，我们提出了一个评估整体视频质量和视角变化时保留细节能力的指标，我们用它来分析单个条件和组合条件的权衡。最终，我们确定了条件的最佳组合。我们在数据集中校准了相机位置以确保跨场景的尺度一致性，并训练了我们的场景探索模型CamCtrl3D，展示了最先进的结果。||
|**2025-01-10**|[Model Inversion in Split Learning for Personalized LLMs: New Insights from Information Bottleneck Theory](http://arxiv.org/abs/2501.05965)|null|个性化大型语言模型（LLM）日益普及，展现出像GPT-4这样模型的卓越能力。这一趋势也促进了在移动设备上部署LLM的广泛研究。此类边缘云部署的可行方法包括使用拆分学习。然而，先前的研究很大程度上忽略了与从设备传输到服务器的中间表示相关的隐私泄露。这项工作首次识别了LLM拆分学习框架中的模型逆向攻击，强调了安全防御的必要性。我们首次引入互信息熵来理解基于Transformer的LLM的信息传播，并评估LLM模块的隐私攻击性能。为了解决表示比嵌入更稀疏且包含更少信息的问题，我们提出了一个两阶段攻击系统，其中第一部分将表示投影到嵌入空间，第二部分使用生成模型从这些嵌入中恢复文本。这种设计降低了复杂性，并在各种场景下实现了38%-75%的攻击得分，比现有最佳方法提高了60%以上。这项工作全面突出了在边缘侧部署个性化LLM时的潜在隐私风险。||
|**2025-01-10**|[Estimation and Restoration of Unknown Nonlinear Distortion using Diffusion](http://arxiv.org/abs/2501.05959)|**[link](https://github.com/michalsvento/NLDistortionDiff)**|本文研究了非线性失真音频信号的恢复以及所应用的无记忆非线性操作的识别。本文重点关注非线性特性和原始输入信号均未知的困难但实际重要的案例。所提出的方法使用在吉他或语音信号上无条件训练的生成扩散模型，在推理时对非线性系统进行联合建模和反演。无记忆非线性函数模型和恢复的音频信号都作为输出获得。文中展示了成功的案例研究，包括硬削波和软削波、数字量化、半波整流和波形折叠非线性的反演。我们的结果表明，在此处测试的非线性函数中，三次Catmull-Rom样条最适合逼近这些非线性。在吉他录音的情况下，与已知和监督方法的比较表明，所提出的盲方法在客观指标方面至少与它们一样好。对失真语音的实验表明，所提出的盲方法优于通用语音增强技术，并恢复了原始语音质量。所提出的方法可应用于音频效果建模、音乐和语音录音的恢复以及模拟录音介质的表征。||
|**2025-01-10**|[DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific Information](http://arxiv.org/abs/2501.05932)|**[link](https://github.com/raiiyf/diffusets_exp)**|心脏病仍然是对人类健康的重大威胁。心电图 (ECG) 作为一种非侵入性诊断工具，是最广泛使用的心脏筛查方法之一。然而，由于隐私问题和医疗资源有限，高质量心电图数据的稀缺性使得有效的心电图信号生成方法的需求日益迫切。现有的心电图信号生成方法通常依赖于小型训练数据集，缺乏全面的评估框架，并且忽略了数据增强之外的潜在应用。为了应对这些挑战，我们提出了 DiffuSETS，这是一个能够生成具有高语义对齐性和保真度的心电图信号的新型框架。DiffuSETS 接受各种形式的临床文本报告和患者特定信息作为输入，从而能够创建具有临床意义的心电图信号。此外，为了解决心电图生成领域缺乏标准化评估的问题，我们引入了一种全面的基准测试方法来评估该领域生成模型的有效性。我们的模型在测试中取得了优异的成绩，证明了其在心电图生成任务中的优越性。此外，我们展示了其在缓解数据稀缺性方面的潜力，同时探索了其在心脏病学教育和医学知识发现中的新应用，突出了我们工作的更广泛影响。||
|**2025-01-10**|[Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation](http://arxiv.org/abs/2501.05892)|null|在现实世界的图像中，倾斜或弯曲的文本，尤其是在罐头、横幅或徽章上的文本，由于艺术设计或布局限制，出现的频率与平面文本一样高，甚至更高。虽然随着扩散模型先进的生成能力，高质量的视觉文本生成已经成为可能，但由于训练数据的限制，这些模型在给定倾斜或弯曲的文本布局时，通常会产生扭曲的文本和不和谐的文本背景。在本文中，我们介绍了一个新的免训练框架 STGen，它可以在具有挑战性的场景（例如，倾斜或弯曲的文本布局）中准确地生成视觉文本，同时使其与文本背景和谐一致。我们的框架将视觉文本生成过程分解为两个分支：（i）语义校正分支，它利用模型生成扁平但准确的视觉文本的能力来指导具有挑战性场景的生成。生成的扁平文本的潜在特征包含丰富的与文本本身及其背景相关的准确语义信息。通过结合这些信息，我们校正了复杂布局中文本的语义信息，并协调了文本与其背景的融合。（ii）结构注入分支，它在推理过程中增强了视觉文本结构。我们将富含字形结构的字形图像的潜在信息作为新的条件，以进一步加强文本结构。为了增强图像的和谐性，我们还应用了一种有效的组合方法来合并先验信息，为生成提供了坚实的基础。在各种视觉文本布局上的大量实验表明，我们的框架实现了卓越的准确性和出色的质量。||
|**2025-01-09**|[Decentralized Diffusion Models](http://arxiv.org/abs/2501.05450)|null|大规模AI模型训练需要在数千个GPU上分配工作，并在每一步同步它们的梯度。这带来了巨大的网络负担，只有集中式的单体集群才能支持，从而推高了基础设施成本并给电力系统带来了压力。我们提出了去中心化扩散模型，这是一个可扩展的框架，通过消除对集中式高带宽网络结构的依赖，将扩散模型训练分布到独立的集群或数据中心。我们的方法在一组数据集分区上训练一组专家扩散模型，每个模型彼此完全隔离。在推理时，专家们通过一个轻量级路由器进行集成。我们证明了该集成整体优化与在整个数据集上训练的单个模型相同的目标。这意味着我们可以将训练负担分配到多个“计算孤岛”上，从而降低基础设施成本并提高对局部GPU故障的容错能力。去中心化扩散模型使研究人员能够利用更小、更具成本效益且更容易获得的计算资源，例如按需GPU节点，而不是中央集成系统。我们在ImageNet和LAION Aesthetics上进行了广泛的实验，结果表明，去中心化扩散模型在同等FLOP情况下优于标准扩散模型。最后，我们将我们的方法扩展到240亿个参数，证明了现在只需八个单独的GPU节点即可在一周内训练出高质量的扩散模型。||
|**2025-01-09**|[Consistent Flow Distillation for Text-to-3D Generation](http://arxiv.org/abs/2501.05445)|null|分数蒸馏采样（SDS）在蒸馏图像生成模型用于三维生成方面取得了显著进展。然而，其寻求最大似然的特性通常会导致视觉质量和多样性下降，限制了其在三维应用中的有效性。在本工作中，我们提出了“一致性流蒸馏”（CFD），以解决这些限制。我们首先利用扩散常微分方程或随机微分方程采样过程的梯度来指导三维生成。从基于梯度的采样角度来看，我们发现不同视点下二维图像流的一致性对于高质量的三维生成至关重要。为了实现这一点，我们在三维对象上引入了多视图一致性高斯噪声，该噪声可以从不同视点渲染以计算流梯度。我们的实验表明，CFD通过一致性流在文本到三维生成方面显著优于先前的方法。||
|**2025-01-09**|[Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces](http://arxiv.org/abs/2501.05442)|null|视频分词器对于潜在视频扩散模型至关重要，它将原始视频数据转换为时空压缩的潜在空间以便高效训练。然而，在不增加通道容量的情况下，扩展最先进的视频分词器以实现超过4倍的时间压缩比带来了重大挑战。在这项工作中，我们提出了一种增强时间压缩的替代方法。我们发现，从低压缩编码器重建时间亚采样视频的质量超过了应用于原始视频的高压缩编码器的质量。这表明高压缩模型可以利用低压缩模型的表示。基于这一见解，我们开发了一种自举式高时间压缩模型，该模型在训练良好的低压缩模型之上逐步训练高压缩块。我们的方法包括一个跨级特征混合模块，以保留来自预训练的低压缩模型的信息，并引导更高压缩的块从完整视频序列中捕获剩余的细节。对视频基准的评估表明，与现有视频分词器的直接扩展相比，我们的方法在提高时间压缩的同时显着提高了重建质量。此外，由此产生的紧凑潜在空间有效地训练了视频扩散模型，以减少的令牌预算生成高质量视频。||
|**2025-01-09**|[The GAN is dead; long live the GAN! A Modern GAN Baseline](http://arxiv.org/abs/2501.05441)|**[link](https://github.com/brownvc/r3gan)**|有一种广泛流传的说法，即 GAN 很难训练，而且文献中的 GAN 架构充斥着经验技巧。我们提供了反对这种说法的证据，并以更有原则的方式构建了现代 GAN 基线。首先，我们推导出一个表现良好的正则化相对论 GAN 损失函数，它解决了先前通过一系列特殊技巧处理的模式崩溃和不收敛问题。我们对损失函数进行了数学分析，并证明了它具有局部收敛保证，这与大多数现有的相对论损失函数不同。其次，我们的新损失函数使我们能够摒弃所有特殊技巧，并将常用 GAN 中使用的过时骨干网络替换为现代架构。以 StyleGAN2 为例，我们提出了一个简化和现代化的路线图，从而产生了一个新的极简基线——R3GAN。尽管简单，但我们的方法在 FFHQ、ImageNet、CIFAR 和 Stacked MNIST 数据集上均优于 StyleGAN2，并且与最先进的 GAN 和扩散模型相比也具有优势。||
|**2025-01-09**|[Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation](http://arxiv.org/abs/2501.05427)|null|近年来，二维图像生成技术取得了显著进展，这主要得益于扩散模型的强大能力和大规模数据集的可用性。然而，直接三维生成仍然受到三维数据集稀缺性和低保真度的限制。在本文中，我们介绍了Zero-1-to-G，一种通过使用预训练的二维扩散模型在高斯球体上实现直接单视图生成来解决这个问题的新方法。我们的关键见解是，高斯球体作为一种三维表示，可以分解为编码不同属性的多视图图像。这将具有挑战性的直接三维生成任务重新构建到二维扩散框架内，使我们能够利用预训练二维扩散模型的丰富先验知识。为了融入三维感知，我们引入了跨视图和跨属性注意力层，以捕捉复杂的相关性并确保生成球体之间的三维一致性。这使得Zero-1-to-G成为第一个有效利用预训练二维扩散先验的直接图像到三维生成模型，从而实现高效的训练并提高对未见对象的泛化能力。在合成数据集和真实数据集上的大量实验都证明了其在三维物体生成方面的优越性能，为高质量三维生成提供了一种新方法。||
|**2025-01-09**|[Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation](http://arxiv.org/abs/2501.05413)|null|训练音频到图像的生成模型需要大量多样且语义对齐的视听数据对。鉴于跨模态语义对应关系是此类数据的固有属性，这些数据几乎总是从自然视频中整理而来。在本研究中，我们假设坚持绝对需要真实的视听对应关系不仅是不必要的，还会导致数据规模、质量和多样性受到严重限制，最终损害其在现代生成模型中的应用。也就是说，我们提出了一个可扩展的图像声化框架，其中可以利用现代视觉语言模型的推理能力，通过检索过程将来自各种高质量但互不关联的单模态来源的实例进行人工配对。为了证明这种方法的有效性，我们使用声化图像来训练音频到图像的生成模型，其性能与最先进的模型相比具有竞争力。最后，通过一系列消融研究，我们展示了我们的模型在引导图像生成过程中隐式发展出的几种有趣的听觉能力，如语义混合和插值、响度校准以及通过混响进行声学空间建模。||
|**2025-01-09**|[TimeDP: Learning to Generate Multi-Domain Time Series with Domain Prompts](http://arxiv.org/abs/2501.05403)|**[link](https://github.com/yukhoy/timedp)**|时间序列生成模型对于数据增强和隐私保护等应用至关重要。大多数现有的时间序列生成模型通常设计用于从一个特定领域生成数据。虽然利用其他领域的数据来获得更好的泛化能力在其他应用领域已被证明有效，但由于不同现实世界时间序列类别之间模式的巨大差异，这种方法对于时间序列建模仍然具有挑战性。在本文中，我们提出了一种带有领域提示的多领域时间序列扩散模型，名为TimeDP。在TimeDP中，我们利用时间序列语义原型模块，该模块定义了时间序列原型来表示时间序列基，每个原型向量都充当表示某些基本时间序列特征的“单词”。应用原型分配模块来提取特定领域的原型权重，用于学习领域提示作为生成条件。在采样过程中，我们使用来自目标领域的少量样本提取“领域提示”，并使用领域提示作为条件来生成时间序列样本。实验表明，我们的方法优于基线模型，提供了最先进的域内生成质量和强大的未见领域生成能力。||
|**2025-01-09**|[Accelerated Diffusion Models via Speculative Sampling](http://arxiv.org/abs/2501.05370)|null|推测性采样是一种流行的技术，用于通过使用快速草稿模型生成候选标记并根据目标模型的分布接受或拒绝它们来加速大型语言模型中的推理。虽然推测性采样以前仅限于离散序列，但我们将其扩展到扩散模型，该模型通过连续的向量值马尔可夫链生成样本。在这种情况下，目标模型是高质量但计算成本高的扩散模型。我们提出了各种草拟策略，包括一种简单有效的方法，该方法不需要训练草稿模型，并且可以直接应用于任何扩散模型。我们的实验表明，在各种扩散模型上，生成速度显着加快，将函数评估次数减半，同时从目标模型生成精确样本。||
|**2025-01-09**|[CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models](http://arxiv.org/abs/2501.05359)|null|With advances in diffusion models, image generation has shown significant performance improvements. This raises concerns about the potential abuse of image generation, such as the creation of explicit or violent images, commonly referred to as Not Safe For Work (NSFW) content. To address this, the Stable Diffusion model includes several safety checkers to censor initial text prompts and final output images generated from the model. However, recent research has shown that these safety checkers have vulnerabilities against adversarial attacks, allowing them to generate NSFW images. In this paper, we find that these adversarial attacks are not robust to small changes in text prompts or input latents. Based on this, we propose CROPS (Circular or RandOm Prompts for Safety), a model-agnostic framework that easily defends against adversarial attacks generating NSFW images without requiring additional training. Moreover, we develop an approach that utilizes one-step diffusion models for efficient NSFW detection (CROPS-1), further reducing computational resources. We demonstrate the superiority of our method in terms of performance and applicability.||
|**2025-01-09**|[Patch-GAN Transfer Learning with Reconstructive Models for Cloud Removal](http://arxiv.org/abs/2501.05265)|null|云去除在增强遥感图像分析中起着至关重要的作用，然而，准确重建云遮蔽区域仍然是一项重大挑战。生成模型的最新进展使得逼真图像的生成越来越容易，为这项任务提供了新的机会。鉴于图像生成和云去除任务之间的概念一致性，生成模型为解决遥感中的云去除问题提供了一种有前景的方法。在这项工作中，我们提出了一种基于生成对抗网络 (GAN) 框架的深度迁移学习方法，以探索新型掩码自编码器 (MAE) 图像重建模型在云去除中的潜力。由于遥感图像的复杂性，我们进一步建议使用逐块判别器来判断图像的每个块是真实的还是虚假的。与其他基于 GAN 的方法相比，所提出的重建迁移学习方法在云去除性能方面表现出显著改进。此外，虽然由于其训练/测试数据分割的细节不明确，与一些最先进的云去除技术的直接比较受到限制，但所提出的模型基于可用的基准测试取得了具有竞争力的结果。||
|**2025-01-07**|[Synthetic Data for Portfolios: A Throw of the Dice Will Never Abolish Chance](http://arxiv.org/abs/2501.03993)|null|模拟方法一直是金融领域的重要工具，而数据驱动且模型设定极简的生成式模型也越来越受到关注，尤其是在深度学习在各领域取得成功之后。然而，这些模型在金融应用中的采用速度并未跟上日益增长的兴趣，这可能是由于金融市场独特的复杂性和挑战。本文旨在加深对生成式模型局限性的理解，尤其是在投资组合和风险管理方面。为此，我们首先展示关于初始样本量重要性的理论结果，并指出生成远超原始可用数据量可能存在的陷阱。然后，我们通过一个悖论强调模型开发和预期用例之间不可分割的性质：通用的生成式模型天生较少关注构建投资组合（尤其是多空组合）的关键因素。基于这些发现，我们提出了一个生成多元收益率的流程，该流程满足对大量美国股票的常规评估标准，同时符合资产收益率中观察到的程式化事实，并规避了我们先前发现的陷阱。此外，我们强调需要更精细的评估方法，并通过均值回归策略的示例，建议一种基于“反刍式”训练（即使用模型自身生成的数据重新训练模型，这在统计学中通常被称为可识别性）来识别特定应用场景下不良模型的方法。||
|**2025-01-07**|[NeuralSVG: An Implicit Representation for Text-to-Vector Generation](http://arxiv.org/abs/2501.03992)|null|矢量图形在设计中至关重要，它为艺术家提供了一种用于创建分辨率无关且高度可编辑的视觉内容的多功能媒介。视觉语言和扩散模型的最新进展激发了人们对文本到矢量图形生成的兴趣。然而，现有方法通常存在输出参数过多的问题，或者将分层结构（矢量图形的核心特征）视为次要目标，从而降低了它们的实用性。认识到分层 SVG 表示的重要性，我们提出了 NeuralSVG，这是一种用于从文本提示生成矢量图形的隐式神经表示。受神经辐射场 (NeRF) 的启发，NeuralSVG 将整个场景编码到小型 MLP 网络的权重中，并使用分数蒸馏采样 (SDS) 进行优化。为了鼓励生成的 SVG 中的分层结构，我们引入了一种基于 dropout 的正则化技术，以增强每个形状的独立含义。我们还证明了利用神经表示法具有推理时控制的额外好处，使用户能够根据用户提供的输入动态调整生成的 SVG，所有这些都只需使用单个学习的表示法。通过广泛的定性和定量评估，我们证明 NeuralSVG 在生成结构化且灵活的 SVG 方面优于现有方法。||
|**2025-01-07**|[Stabilising effect of generic anomalous diffusion independent of the Rayleigh number](http://arxiv.org/abs/2501.03990)|null|这项工作研究了通用反常扩散模型对流体饱和多孔介质中质量对流的影响，重点关注超扩散机制。建立了一个数学模型，并进行了线性和非线性稳定性分析。结果表明，描述反常扩散的时间函数的具体形式会显著影响系统的稳定性，使其稳定性能够持续超出经典的瑞利-贝纳德中性阈值。此外，在某些条件下观察到瞬态扰动增长，随后最终衰减。本文系统地研究了各种记忆函数，包括幂律、指数和对数形式，强调了它们对扰动动力学的影响。研究结果强调了反常扩散在调节稳定性方面的重要性，并为非菲克质量传递引起的瞬态行为提供了新的见解。||
|**2025-01-07**|[Synthetic Data Privacy Metrics](http://arxiv.org/abs/2501.03941)|null|生成式AI的最新进展使得创建合成数据集成为可能，这些数据集在训练AI模型、提供统计洞见以及在提供强隐私保障的同时促进敏感数据集的协作方面，可以与真实世界数据一样准确。有效地衡量合成数据的经验隐私是该过程中的重要一步。然而，尽管每天都有大量新的隐私指标被发布，但目前还没有标准化。在本文中，我们回顾了包括对抗攻击模拟在内的流行指标的优缺点。我们还回顾了当前修改生成模型以增强其创建数据隐私性的最佳实践（例如，差分隐私）。||
|**2025-01-07**|[A precise asymptotic analysis of learning diffusion models: theory and insights](http://arxiv.org/abs/2501.03937)|**[link](https://github.com/hugocui/ae_diffusion)**|本文研究了基于双层自编码器参数化的流模型或扩散模型生成模型的学习问题，该模型采用在线随机梯度下降进行训练，目标是在具有底层低维流形结构的高维目标密度上进行学习。我们推导了学习模型生成的样本分布的低维投影的紧渐近特征，特别确定了其对训练样本数量的依赖性。基于此分析，我们讨论了模式坍塌是如何产生的，以及当生成模型在生成的合成数据上重新训练时如何导致模型坍塌。||
|**2025-01-07**|[Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers](http://arxiv.org/abs/2501.03931)|**[link](https://github.com/dvlab-research/magicmirror)**|我们提出了Magic Mirror，这是一个用于生成具有电影级质量和动态运动的、身份保持的视频框架。尽管最近视频扩散模型的进步在文本到视频生成方面展现了令人印象深刻的能力，但在生成自然运动的同时保持身份一致仍然具有挑战性。先前的方法要么需要针对特定人物进行微调，要么难以平衡身份保持和运动多样性。基于视频扩散Transformer，我们的方法引入了三个关键组件：（1）一个双分支面部特征提取器，用于捕获身份和结构特征；（2）一个带有条件自适应归一化的轻量级跨模态适配器，用于高效的身份整合；（3）一个结合合成身份对和视频数据的两阶段训练策略。大量实验表明，Magic Mirror有效地平衡了身份一致性和自然运动，在多个指标上优于现有方法，同时只需添加最少的参数。代码和模型将在以下地址公开发布：https://github.com/dvlab-research/MagicMirror/||
|**2025-01-07**|[Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control](http://arxiv.org/abs/2501.03847)|**[link](https://github.com/igl-hkust/diffusionasshader)**|扩散模型在根据文本提示或图像生成高质量视频方面表现出色。然而，精确控制视频生成过程，如相机操作或内容编辑，仍然是一项重大挑战。现有的受控视频生成方法通常局限于单一控制类型，缺乏处理多样化控制需求的灵活性。在本文中，我们介绍了“扩散即着色器”（Diffusion as Shader，DaS），这是一种支持在统一架构内进行多种视频控制任务的新方法。我们的主要见解是，实现多功能视频控制需要利用3D控制信号，因为视频本质上是动态3D内容的2D渲染。与先前局限于2D控制信号的方法不同，DaS利用3D跟踪视频作为控制输入，使视频扩散过程本身具有3D感知能力。这种创新使DaS能够通过简单地操作3D跟踪视频来实现各种视频控制。使用3D跟踪视频的另一个优势是它们能够有效地链接帧，从而显著增强生成视频的时间一致性。DaS仅使用不到1万个视频，在8个H800 GPU上进行了3天的微调，就展示了在各种任务（包括网格到视频生成、相机控制、运动转移和对象操作）中的强大控制能力。||
|**2025-01-07**|[Deep Sylvester Posterior Inference for Adaptive Compressed Sensing in Ultrasound Imaging](http://arxiv.org/abs/2501.03825)|null|超声图像是通过依次获取波束控制的扫描线形成的。最大限度地减少所需的扫描线数量可以显著提高帧率、视野、能量效率和数据传输速度。现有方法通常使用静态子采样方案，并结合基于稀疏性的或最近出现的基于深度学习的恢复方法。在这项工作中，我们介绍了一种自适应子采样方法，该方法可以最大限度地提高原位内在信息增益，采用Sylvester归一化流编码器来实时推断部分观察下的近似贝叶斯后验。利用贝叶斯后验和用于未来观察的深度生成模型，我们确定了最大化子采样观察值与视频下一帧之间互信息的子采样方案。我们使用EchoNet心脏超声视频数据集评估了我们的方法，并证明了我们的主动采样方法优于包括均匀和可变密度随机采样以及等距扫描线在内的竞争基线，将平均绝对重建误差降低了15%。此外，后验推断和采样方案的生成仅需0.015秒（66Hz），使其速度足以满足实时二维超声成像应用的需求。||
|**2025-01-07**|[Impact of diffusion mechanisms on persistence and spreading](http://arxiv.org/abs/2501.03816)|null|我们研究了一个带有“ $q$-扩散”的广义KPP方程，该框架统一了各种标准的线性扩散机制：菲克扩散（$q = 0$），斯特拉托诺维奇扩散（$q = 1/2$），福克-普朗克扩散（$q = 1$），以及一般$q\in\mathbb{R}$情况下的非标准扩散机制。我们结合分析方法和数值模拟，探究了持久性（由某个主特征值衡量）和渐近传播速度如何依赖于参数$q$以及增长率$r(x)$和扩散系数$D(x)$之间的相移。我们的结果表明，持久性和传播特性通常取决于$q$：例如，可以构建适当的$r(x)$和$D(x)$配置，使得$q$-扩散相对于传统的菲克扩散增强或减弱持久性和传播速度。我们发现，$r(x)$相对于$D(x)$的空间排列在$q > 0$、$q = 0$或$q < 0$的情况下具有显著不同的影响。$r$为常数的情况是一个例外：持久性与$q$无关，而传播速度则关于$q = 1/2$ 对称。这项工作强调了在生态和流行病学背景下仔细选择扩散模型的重要性，突出了它们对持久性、传播和控制策略的潜在影响。||
|**2025-01-07**|[Mixing by Internal Gravity Waves in Stars: Assessing Numerical Simulations Against Theory](http://arxiv.org/abs/2501.03796)|null|本文基于使用完全可压缩代码MUSIC进行的多维流体动力学模拟，研究了由内部重力波 (IGWs) 驱动的非旋转大质量主序星中的径向化学混合。我们检验了两种常被引用的由IGWs引起的恒星物质混合机制，它们与热扩散和亚波长剪切有关。热扩散对波产生非恢复性效应，使物质偏离其先前的平衡位置，而波内产生的剪切驱动微弱的局部流动，从而混合那里的流体。利用模拟中的IGW谱，我们评估了这两种机制导致的混合速率的理论预测。我们发现，对于20个太阳质量的主序星，这两种机制都不太可能产生足以纠正当前恒星演化模型中误差的混合。此外，我们将这些预测与拉格朗日示踪粒子的结果进行了比较，该方法最近被用于恒星内部的全局模拟，以测量其辐射区中由IGWs引起的混合。我们证明了示踪粒子方法在测量上述理论预测的小扩散系数方面面临着巨大的数值挑战，容易产生人为增强的系数。基于这种方法的扩散系数目前被用于星震学研究的恒星演化代码中，但应谨慎看待。最后，在一个示踪粒子不受数值伪影影响的情况下，我们认为扩散模型不适用于二维数值模拟通常考虑的时间尺度。||
|**2025-01-03**|[Metadata Conditioning Accelerates Language Model Pre-training](http://arxiv.org/abs/2501.01956)|**[link](https://github.com/princeton-pli/meco)**|语言模型预训练语料库中风格、领域和质量水平的巨大差异对于开发通用模型能力至关重要，但有效地学习和部署这些异构数据源中体现的正确行为具有挑战性。为了解决这个问题，我们提出了一种名为“元数据调节后冷却”（MeCo）的新方法，在预训练期间加入额外的学习线索。MeCo首先在训练期间将元数据（例如URL，如en.wikipedia.org）与文本一起提供，然后使用仅包含标准文本的冷却阶段，从而使模型即使没有元数据也能正常运行。MeCo显著加快了不同模型规模（6亿到80亿参数）和训练源（C4、RefinedWeb和DCLM）的预训练速度。例如，使用MeCo训练的16亿参数语言模型在使用少33%数据的情况下，下游任务性能与标准预训练模型相当。此外，MeCo使我们能够通过调节推理提示中的真实或虚构元数据来引导语言模型，这些元数据编码了输出的所需属性：例如，在提示前添加wikipedia.org以减少有害生成，或添加factquizmaster.com（虚构的）以提高常识任务性能。我们还证明了MeCo与不同类型的元数据兼容，例如模型生成的主题。MeCo非常简单，不会增加计算开销，并且在生成更强大、更可控的语言模型方面展现出潜力。||
|**2025-01-03**|[MADGEN -- Mass-Spec attends to De Novo Molecular generation](http://arxiv.org/abs/2501.01950)|**[link](https://github.com/HassounLab/MADGEN)**|由于生物样品中巨大的分子多样性和参考数据库的有限范围，质谱（MS/MS）谱图的注释（指定结构化学特性）仍然是一项重大挑战。目前，绝大多数的谱图测量仍然处于没有结构注释的“暗化学空间”中。为了改进注释，我们提出了MADGEN（基于质谱的从头分子生成），一种基于支架的从头分子结构生成方法，由质谱数据引导。MADGEN分两个阶段运行：支架检索和以支架为起点的、光谱条件下的分子生成。在第一阶段，给定一个MS/MS谱图，我们将支架检索制定为一个排序问题，并采用对比学习来将质谱与候选分子支架对齐。在第二阶段，从检索到的支架开始，我们使用MS/MS谱图来引导一个基于注意力的生成模型来生成最终分子。我们的方法限制了分子生成的搜索空间，降低了其复杂性并提高了生成精度。我们在三个数据集（NIST23、CANOPUS和MassSpecGym）上评估了MADGEN，并使用预测性支架检索器和oracle检索器评估了MADGEN的性能。我们证明了在整个生成过程中使用注意力机制整合光谱信息以利用oracle检索器实现优异结果的有效性。||
|**2025-01-03**|[EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation](http://arxiv.org/abs/2501.01895)|null|我们引入了 EnerVerse，这是一个专为机器人操纵任务设计的用于生成具体化未来空间的综合框架。EnerVerse 无缝集成了卷积和双向注意力机制，用于内部块空间建模，确保低级别的连贯性和连续性。认识到视频数据中固有的冗余性，我们提出了一种稀疏记忆上下文与分块单向生成范式相结合的方法，以生成无限长的序列。为了进一步增强机器人能力，我们引入了自由锚点视图 (FAV) 空间，它提供了灵活的视角来增强观察和分析。FAV 空间减少了运动建模的歧义，消除了受限环境中的物理约束，并显著提高了机器人在各种任务和环境中的泛化能力和适应性。为了解决获取多摄像头观测数据的高昂成本和劳动强度，我们提出了一种将生成模型与 4D 高斯 splatting (4DGS) 相结合的数据引擎流水线。该流水线利用生成模型强大的泛化能力和 4DGS 提供的空间约束，实现了数据质量和多样性的迭代增强，从而产生数据飞轮效应，有效地缩小了仿真与现实之间的差距。最后，我们的实验表明，具体化的未来空间生成先验知识大大增强了策略预测能力，从而提高了整体性能，尤其是在远程机器人操纵任务中。||
|**2025-01-03**|[Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation](http://arxiv.org/abs/2501.01793)|**[link](https://github.com/mohdkhalil/repository-supplementary-for-lak-25-paper--creating-artificial-students-that-never-existed)**|本研究探讨了人工智能和深度学习技术，特别是生成对抗网络（GAN）和大语言模型（LLM），在生成合成表格数据方面日益增长的潜力。高质量的学生数据对于推进学习分析至关重要，但隐私问题和全球范围内更严格的数据保护法规限制了其可用性和使用。合成数据提供了一种很有前景的替代方案。我们研究了合成数据是否可以用来创建人工学生以服务于学习分析模型。我们使用流行的GAN模型CTGAN和三个LLM（GPT2、DistilGPT2和DialoGPT）生成合成的表格学生数据。我们的结果表明，这些方法在生成类似于真实学生数据的高质量合成数据集方面具有强大的潜力。为了验证我们的发现，我们应用了一套全面的效用评估指标来评估合成数据的统计和预测性能，并比较了使用的不同生成器模型，特别是LLM的性能。我们的研究旨在为学习分析社区提供关于合成数据使用的宝贵见解，为利用学习分析数据生成的新的创新方法扩展该领域的方法论工具箱奠定基础。||
|**2025-01-03**|[Nonparametric estimation of a factorizable density using diffusion models](http://arxiv.org/abs/2501.01783)|null|近年来，扩散模型，更广义地说，基于分数的深度生成模型，在图像和音频生成等各种应用中取得了显著的成功。在本文中，我们将扩散模型视为一种非参数密度估计的隐式方法，并在统计框架内对其进行研究，以分析其惊人的性能。高维统计推断中的一个关键挑战是如何利用数据中固有的低维结构来减轻维数灾难。我们假设潜在密度通过分解成低维成分来展现低维结构，这在贝叶斯网络和马尔可夫随机场等例子中很常见。在适当的假设下，我们证明了由扩散模型构建的隐式密度估计器能够适应这种分解结构，并在全变差距离上达到极小极大最优速率。在构建估计器时，我们设计了一种稀疏权重共享神经网络架构，其中稀疏性和权重共享是卷积神经网络和循环神经网络等实际架构的关键特征。||
|**2025-01-03**|[Adverse Weather Conditions Augmentation of LiDAR Scenes with Latent Diffusion Models](http://arxiv.org/abs/2501.01761)|null|激光雷达场景是多种自动驾驶应用的基础数据来源。尽管已有多个数据集，但在恶劣天气条件下的场景仍然稀缺。这限制了下游机器学习模型的鲁棒性，并降低了自动驾驶系统在特定地点和季节的可靠性。由于季节限制，在恶劣天气条件下收集特征多样的场景具有挑战性。因此，生成模型至关重要，尤其是在为特定驾驶场景生成恶劣天气条件方面。在我们的工作中，我们提出了一个由自动编码器和潜在扩散模型组成的潜在扩散过程。此外，我们利用清晰条件下的激光雷达场景，并结合后处理步骤来提高生成的恶劣天气条件场景的真实感。||
|**2025-01-03**|[MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling](http://arxiv.org/abs/2501.01757)|null|虽然大多数音乐生成模型生成的是混合的单声道或立体声音轨，但我们提出训练一个包含3个音轨（贝斯、鼓和其他）的多音轨生成模型，以学习它们之间的音乐依赖性。为此，我们针对每个音轨训练一个专门的压缩算法，将音乐标记成平行的标记流。然后，我们利用音乐源分离任务的最新改进，在一个大型数据集上训练一个多流文本到音乐的语言模型。最后，由于一种特殊的调节方法，我们的模型能够在现有或生成的歌曲上编辑贝斯、鼓或其他音轨，以及进行迭代式创作（例如，在现有鼓的基础上生成贝斯）。这为音乐生成算法提供了更大的灵活性，并且据我们所知，这是第一个能够进行高质量生成和一致的音源编辑的开源多音轨自回归音乐生成模型。代码和模型权重将被发布，示例可在 https://simonrouard.github.io/musicgenstem/ 上获取。||
|**2025-01-03**|[AR4D: Autoregressive 4D Generation from Monocular Videos](http://arxiv.org/abs/2501.01722)|null|生成模型的最新进展引发了人们对动态3D内容创建（即4D生成）的浓厚兴趣。现有方法主要依赖分数蒸馏采样（SDS）来推断新视角视频，由于SDS固有的随机性，通常会导致多样性受限、时空不一致以及提示对齐不良等问题。为了解决这些问题，我们提出了AR4D，一个无需SDS的4D生成新范式。具体来说，我们的范式由三个阶段组成。首先，对于生成或捕获的单目视频，我们首先利用预训练的专家模型创建第一帧的3D表示，并对其进行微调以作为规范空间。随后，基于视频以自回归方式自然发生的这一事实，我们建议根据前一帧的表示来生成每一帧的3D表示，因为这种自回归生成方式可以促进更准确的几何和运动估计。同时，为了防止在此过程中过度拟合，我们引入了一种渐进式视图采样策略，利用来自预训练的大规模3D重建模型的先验。为了避免自回归生成引入的外观漂移，我们进一步结合了基于全局变形场和每帧3D表示几何的细化阶段。大量实验表明，AR4D无需SDS即可实现最先进的4D生成，提供更大的多样性、改进的时空一致性以及与输入提示更好的对齐。||
|**2025-01-03**|[iCBIR-Sli: Interpretable Content-Based Image Retrieval with 2D Slice Embeddings](http://arxiv.org/abs/2501.01642)|null|目前用于搜索脑部MR图像的方法依赖于基于文本的方法，这凸显了对基于内容的图像检索 (CBIR) 系统的重大需求。将3D脑部MR图像直接应用于机器学习模型，可以有效地学习大脑结构；然而，构建通用模型需要大量的训练数据。虽然考虑深度方向并利用连续二维切片的模型已在涉及3D数据的分割和分类任务中取得成功，但仍然存在一些问题。具体而言，使用普通的二维切片可能会导致忽略病理特征和深度方向信息的不连续性。此外，据作者所知，目前还没有尝试开发一种能够保留整个大脑结构信息的实用CBIR系统。在本研究中，我们提出了一种可解释的脑部MR图像CBIR方法，名为iCBIR-Sli（基于二维切片嵌入的可解释CBIR），该方法首次在全球范围内利用了一系列二维切片。iCBIR-Sli通过有效地聚合切片信息来解决使用二维切片带来的挑战，从而实现具有高完整性、可用性、鲁棒性和互操作性的低维表示，这些特性对于有效的CBIR至关重要。在利用五个公开可用的脑部MR数据集（ADNI2/3、OASIS3/4、AIBL）进行阿尔茨海默病和认知正常人群的检索评估实验中，iCBIR-Sli展现了最佳的top-1检索性能（宏F1 = 0.859），与现有的专为分类设计的深度学习模型相当，且无需外部分类器。此外，该方法通过清晰地识别指示所搜索疾病的脑区，提供了很高的可解释性。||
|**2025-01-03**|[Uncertainty and Energy based Loss Guided Semi-Supervised Semantic Segmentation](http://arxiv.org/abs/2501.01640)|null|半监督 (SS) 语义分割利用标记图像和未标记图像来克服繁琐且昂贵的像素级标注问题。伪标签监督是使用伪标签和真实标签训练网络的核心方法之一。这项工作在交并伪监督网络中使用了数据不确定性（也称为偶然不确定性）和基于能量的建模。偶然不确定性通过具有两个预测分支的网络对数据的固有噪声变化进行建模。从网络获得的逐像素方差参数提供了关于数据不确定性的定量信息。此外，基于能量的损失实现了生成模型在下游 SS 分割任务中的潜力。偶然不确定性损失和能量损失与伪交集标签、伪并集标签以及真实标签一起应用于相应的网络分支。与最先进方法的比较分析表明，性能指标有所改进。||
|**2024-12-30**|[The Gaussian Kicked Rotor: Periodic forcing with finite-width pulses and the role of shifting the kick](http://arxiv.org/abs/2412.21186)|null|踢转子模型或许是经典力学中最简单的能够阐明从规则运动到混沌运动过渡的物理模型。它也被广泛用作光与物质相互作用的模型。在传统的处理方法中，每次踢的无穷小宽度允许对运动方程进行即时积分。如果只从频闪观测的角度来看动力学，这反过来又允许通过离散映射（标准映射）来完整描述动力学。事实证明，如果考虑到踢的有限持续时间宽度，该模型只是一个更丰富故事的一部分。在本篇文章中，我们建立了一个有限宽度周期性强迫的通用模型，并推导出一组连续的映射，这些映射依赖于参数偏移 Δ，从而可以捕捉受驱和踢动两种状态下的运动。分析和数值结果表明，映射的不动点和对称性取决于偏移参数的值。||
|**2025-01-02**|[Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation](http://arxiv.org/abs/2412.21117)|null|在这项工作中，我们推出了Prometheus，一个具有3D感知能力的潜在扩散模型，能够在几秒钟内实现对象级和场景级的文本到3D生成。我们将3D场景生成公式化为在潜在扩散范式下的多视图、前馈、像素对齐的3D高斯生成。为了确保泛化性，我们基于预训练的文本到图像生成模型构建我们的模型，仅进行了最小程度的调整，并使用来自单视图和多视图数据集的大量图像对其进行进一步训练。此外，我们将RGB-D潜在空间引入3D高斯生成中，以解耦外观和几何信息，从而实现更高保真度和几何形状的3D高斯的有效前馈生成。大量的实验结果证明了我们的方法在3D高斯前馈重建和文本到3D生成方面的有效性。项目页面：https://freemty.github.io/project-prometheus/||
|**2024-12-30**|[Quantum Diffusion Model for Quark and Gluon Jet Generation](http://arxiv.org/abs/2412.21082)|**[link](https://github.com/mashathepotato/GSoC-Quantum-Diffusion-Model)**|扩散模型在图像生成方面展现出显著的成功，但其训练过程计算密集且耗时。在本文中，我们介绍了一种新颖的扩散模型，它利用量子计算技术来缓解计算挑战并增强高能物理数据中的生成性能。该全量子扩散模型在前向过程中用随机酉矩阵代替高斯噪声，并在去噪架构的U-Net中 incorporates 一个变分量子电路。我们使用大型强子对撞机中结构复杂的夸克和胶子喷流数据集进行评估。结果表明，全量子和混合模型在喷流生成方面与类似的经典模型具有竞争力，突出了将量子技术用于机器学习问题的潜力。||
|**2025-01-02**|[Edicho: Consistent Image Editing in the Wild](http://arxiv.org/abs/2412.21079)|**[link](https://github.com/ezioby/edicho)**|针对野外图像的一致性编辑是一个公认的需求，但由于物体姿态、光照条件和拍摄环境等各种难以控制的因素，这仍然是一项技术挑战。Edicho 提供了一种基于扩散模型的免训练解决方案，其基本设计原则是利用显式图像对应关系来指导编辑。具体来说，其关键组件包括一个注意力操作模块和一个精心改进的无分类器引导（CFG）去噪策略，两者都考虑了预先估计的对应关系。这种推理时算法具有即插即用的特性，并且与大多数基于扩散的编辑方法兼容，例如 ControlNet 和 BrushNet。大量实验结果证明了 Edicho 在不同设置下进行一致跨图像编辑的有效性。我们将发布代码以促进未来的研究。||
|**2024-12-30**|[Varformer: Adapting VAR's Generative Prior for Image Restoration](http://arxiv.org/abs/2412.21063)|**[link](https://github.com/siywang541/Varformer)**|在广泛的高质量数据集上训练的生成模型能够有效地捕捉干净图像的结构和统计特性，使其成为将退化特征转换为干净特征的强大先验。VAR是一种新颖的图像生成范式，它通过应用下一尺度预测方法在生成质量上超越了扩散模型。它通过自回归过程逐步捕捉全局结构和细粒度细节，这与修复领域广泛认可的多尺度修复原则一致。此外，我们观察到在利用VAR进行图像重建的过程中，尺度预测会自动调节输入，从而促进后续尺度上的表示与干净图像分布的对齐。为了在图像修复任务中利用VAR的自适应分布对齐能力，我们将VAR内的多尺度潜在表示公式化为修复先验，从而推进了我们精心设计的VarFormer框架。这些先验的战略性应用使我们的VarFormer能够在未见过的任务上实现显著的泛化能力，同时也降低了训练的计算成本。大量的实验表明，我们的VarFormer在各种修复任务中都优于现有的多任务图像修复方法。||
|**2024-12-30**|[VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation](http://arxiv.org/abs/2412.21059)|**[link](https://github.com/thudm/visionreward)**|我们提出了一种将视觉生成模型（包括图像和视频生成）与人类偏好对齐的通用策略。首先，我们构建了VisionReward——一个细粒度、多维度的奖励模型。我们将图像和视频中的人类偏好分解成多个维度，每个维度由一系列判断问题表示，并以线性加权求和的方式得到一个可解释且准确的分数。为了应对视频质量评估的挑战，我们系统地分析了视频的各种动态特征，这使得VisionReward在视频偏好预测方面超越了VideoScore 17.2%，并达到了最佳性能。基于VisionReward，我们开发了一种多目标偏好学习算法，有效地解决了偏好数据中混杂因素的问题。我们的方法在机器指标和人工评估方面均显著优于现有的图像和视频评分方法。所有代码和数据集均可在https://github.com/THUDM/VisionReward获取。||
|**2024-12-30**|[E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models](http://arxiv.org/abs/2412.21044)|null|扩散模型已成为一种强大的生成建模框架，在各种任务中实现了最先进的性能。然而，它们面临着几个固有的局限性，包括训练-采样差距、渐进噪声过程中的信息泄漏，以及无法在训练期间结合感知和对抗性损失等高级损失函数。为了应对这些挑战，我们提出了一个创新的端到端训练框架，通过直接优化最终重建输出来对齐训练和采样过程。我们的方法消除了训练-采样差距，通过将训练过程视为从纯噪声到目标数据分布的直接映射来减轻信息泄漏，并能够将感知和对抗性损失整合到目标函数中。在 COCO30K 和 HW30K 等基准上的大量实验表明，我们的方法始终优于传统的扩散模型，即使在减少采样步骤的情况下，也能在 FID 和 CLIP 分数方面取得更好的结果。这些发现突出了端到端训练在推动基于扩散的生成模型走向更稳健和高效的解决方案方面的潜力。||
|**2024-12-30**|[Visual Style Prompt Learning Using Diffusion Models for Blind Face Restoration](http://arxiv.org/abs/2412.21042)|**[link](https://github.com/longlongaaago/vspbfr)**|盲人脸修复旨在从各种未知的退化源中恢复高质量的面部图像，由于从退化图像中可检索的信息极少，这带来了巨大的挑战。先前基于先验知识的方法利用几何先验和面部特征，在人脸修复方面取得了进展，但通常无法捕捉到精细的细节。为了解决这个问题，我们引入了一种视觉风格提示学习框架，该框架利用扩散概率模型在预训练生成模型的潜在空间中显式生成视觉提示。这些提示旨在引导修复过程。为了充分利用视觉提示并增强信息丰富模式的提取，我们引入了风格调制聚合变换层。大量的实验和应用证明了我们的方法在实现高质量盲人脸修复方面的优越性。源代码可在 \href{https://github.com/LonglongaaaGo/VSPBFR}{https://github.com/LonglongaaaGo/VSPBFR} 获取。||
|**2024-12-30**|[TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization](http://arxiv.org/abs/2412.21037)|**[link](https://github.com/declare-lab/TangoFlux)**|我们推出了 TangoFlux，一个高效的文本到音频 (TTA) 生成模型，拥有 5.15 亿个参数，能够在单个 A40 GPU 上仅需 3.7 秒即可生成长达 30 秒的 44.1kHz 音频。TTA 模型对齐的一个关键挑战在于难以创建偏好对，因为 TTA 缺乏类似于大型语言模型 (LLM) 中可验证的奖励或黄金标准答案等结构化机制。为了解决这个问题，我们提出了 CLAP 排序偏好优化 (CRPO)，这是一个新颖的框架，可以迭代地生成和优化偏好数据以增强 TTA 对齐。我们证明了使用 CRPO 生成的音频偏好数据集优于现有替代方案。借助此框架，TangoFlux 在客观和主观基准测试中均实现了最先进的性能。我们开源所有代码和模型，以支持 TTA 生成领域的进一步研究。||
|**2024-12-30**|[AlignAb: Pareto-Optimal Energy Alignment for Designing Nature-Like Antibodies](http://arxiv.org/abs/2412.20984)|null|我们提出了一个用于训练专门进行抗体序列-结构共同设计的深度学习模型的三阶段框架。首先，我们使用数百万的抗体序列数据预训练一个语言模型。然后，我们利用学习到的表征来指导扩散模型的训练，以便对抗体的序列和结构进行联合优化。在最后的对齐阶段，我们优化模型以利于选择与抗原结合位点具有低排斥力和高吸引力的抗体，从而增强设计的合理性和功能性。为了减轻能量偏好之间的冲突，我们扩展了AbDPO（抗体直接偏好优化）以引导模型在多个基于能量的对齐目标下达到帕累托最优。此外，我们采用了一种具有温度缩放的迭代学习范式，使模型能够从不同的在线数据集中受益，而无需额外的数据。在实践中，与基线和先前对齐技术生成的最佳样本相比，我们提出的方法在生成更好的抗体设计帕累托前沿方面实现了高稳定性和效率。通过大量的实验，我们展示了我们的方法在持续生成具有高结合亲和力的类天然抗体方面的优越性能。||
|**2024-12-27**|[Tensor Network Estimation of Distribution Algorithms](http://arxiv.org/abs/2412.19780)|null|张量网络最初应用于多体量子物理领域，如今已广泛应用于计算科学的各个方面，从数值方法到机器学习。最近的文献中出现了一些将张量网络融入进化优化算法的方法。本质上，这些方法可以理解为用基于张量网络的生成模型取代了遗传算法中传统的交叉操作。我们从分布估计算法（EDA）的角度研究了这些方法。我们发现这些方法的优化性能与生成模型的能力并非简单的相关。更好的生成模型（指更好地模拟其训练数据来源分布的模型）并不一定会带来更好的优化算法性能。这引出了一个问题：如何将强大的生成模型更好地融入优化程序。鉴于此，我们发现向生成模型的输出添加显式变异算子通常可以提高优化性能。||
|**2024-12-27**|[Generative Video Propagation](http://arxiv.org/abs/2412.19761)|null|大型视频生成模型具有逼真地建模自然场景的固有能力。本文展示了，通过精心设计生成式视频传播框架，可以利用此类模型的生成能力以统一的方式处理各种视频任务。具体来说，我们的框架 GenProp 使用选择性内容编码器对原始视频进行编码，并使用图像到视频生成模型传播对第一帧所做的更改。我们提出了一种数据生成方案，以基于实例级视频分割数据集来涵盖多个视频任务。我们的模型通过结合掩码预测解码器头和优化区域感知损失来训练，以帮助编码器保留原始内容，同时生成模型传播修改后的区域。这种新颖的设计开辟了新的可能性：在编辑场景中，GenProp 允许对对象的形状进行实质性更改；对于插入，插入的对象可以展现独立运动；对于移除，GenProp 可以有效地从整个视频中移除阴影和反射等效果；对于跟踪，GenProp 能够一起跟踪对象及其相关效果。实验结果证明了我们的模型在各种视频任务中的领先性能，我们还对提出的框架进行了深入分析。||
|**2024-12-27**|[From Elements to Design: A Layered Approach for Automatic Graphic Design Composition](http://arxiv.org/abs/2412.19712)|null|在这项工作中，我们研究了多模态图形元素的自动设计组合。尽管最近的研究已经开发了各种用于图形设计的生成模型，但它们通常面临以下限制：它们只关注某些子任务，并且远未实现设计组合任务；它们在生成过程中没有考虑图形设计的层次信息。为了解决这些问题，我们将分层设计原则引入大型多模态模型 (LMM) 中，并提出了一种称为 LaDeCo 的新方法来完成这项具有挑战性的任务。具体来说，LaDeCo 首先对给定的元素集执行图层规划，根据其内容将输入元素划分到不同的语义图层中。基于规划结果，它随后以分层的方式预测控制设计组合的元素属性，并将先前生成的图层的渲染图像包含到上下文中。通过这种富有洞察力的设计，LaDeCo 将困难的任务分解成更小的、可管理的步骤，使生成过程更流畅、更清晰。实验结果证明了 LaDeCo 在设计组合中的有效性。此外，我们展示了 LaDeCo 在图形设计中启用了一些有趣的应用，例如分辨率调整、元素填充、设计变化等。此外，它甚至在某些设计子任务中超越了专门的模型，而无需任何特定任务的训练。||
|**2024-12-27**|[VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models](http://arxiv.org/abs/2412.19645)|null|零样本定制视频生成因其巨大的应用潜力而备受关注。现有方法依赖于额外的模型来提取和注入参考主题特征，假设仅靠视频扩散模型（VDM）不足以进行零样本定制视频生成。然而，由于特征提取和注入技术欠佳，这些方法通常难以保持一致的主题外观。在本文中，我们揭示了VDM本身具有提取和注入主题特征的能力。与以往的启发式方法不同，我们引入了一个新的框架，利用VDM的内在能力来实现高质量的零样本定制视频生成。具体来说，对于特征提取，我们将参考图像直接输入VDM，并利用其固有的特征提取过程，这不仅可以提供细粒度的特征，而且与VDM的预训练知识高度吻合。对于特征注入，我们设计了一种通过VDM内空间自注意力机制实现的主题特征与生成内容之间的创新双向交互，确保VDM在保持生成视频多样性的同时具有更好的主题保真度。在定制人物和物体视频生成的实验验证了我们框架的有效性。||
|**2024-12-27**|[Diverse Rare Sample Generation with Pretrained GANs](http://arxiv.org/abs/2412.19543)|**[link](https://github.com/sbrblee/DivRareGen)**|深度生成模型擅长生成逼真的数据，但由于训练数据集的稀缺性和模式坍塌问题，难以生成低密度区域的稀有样本。尽管最近的一些方法旨在提高生成样本的保真度，但它们往往会忽略稀有和新颖的样本，从而降低多样性和覆盖范围。本研究提出了一种利用预训练GAN从高分辨率图像数据集中生成多样化稀有样本的新方法。我们的方法在多目标框架内采用基于梯度的潜在向量优化，并利用归一化流进行特征空间上的密度估计。这使得能够生成多样化的稀有图像，并可控制稀有度、多样性和与参考图像的相似性等参数。我们在各种数据集和GAN上定性和定量地证明了我们方法的有效性，而无需重新训练或微调预训练的GAN。||
|**2024-12-27**|[StyleRWKV: High-Quality and High-Efficiency Style Transfer with RWKV-like Architecture](http://arxiv.org/abs/2412.19535)|null|风格迁移旨在生成保留内容但具有风格来源艺术表现的新图像。大多数现有方法基于Transformer或扩散模型，然而，它们受到二次计算复杂度和高推理时间的限制。RWKV作为一种新兴的深度序列模型，在自然语言处理任务中展现出巨大的长上下文序列建模潜力。在这项工作中，我们提出了一个名为StyleRWKV的新颖框架，以在有限的内存使用和线性时间复杂度下实现高质量的风格迁移。具体来说，我们提出了一种循环WKV（Re-WKV）注意力机制，它结合了双向注意力来建立全局感受野。此外，我们开发了一种可变形移位（Deform-Shifting）层，它为卷积核的采样网格引入了可学习的偏移量，允许token从感兴趣区域灵活自适应地移位，从而增强模型捕获局部依赖关系的能力。最后，我们提出了一种跳跃扫描（S-Scanning）方法，可以有效地建立全局上下文依赖关系。包含定性和定量评估的广泛实验分析表明，我们的方法在风格化质量、模型复杂性和推理效率方面优于最先进的方法。||
|**2024-12-27**|[Lévy Score Function and Score-Based Particle Algorithm for Nonlinear Lévy--Fokker--Planck Equations](http://arxiv.org/abs/2412.19520)|null|扩散过程的得分函数，也称为对数密度的梯度，是表征概率流的基本概念，在基于得分的扩散生成建模和伊藤随机微分方程的模拟中具有重要的应用。然而，无论是概率流还是对应的跳跃扩散过程的得分函数都是未知的。本文针对由非线性Lévy-Fokker-Planck方程表示的具有跳跃和不连续性的非高斯系统中的相应得分函数，进行了数学推导、数值算法和误差分析。我们提出了用于此类随机方程的Lévy得分函数，它具有非局部双积分项，并通过最小化样本中提出的损失函数来开发其训练算法。基于概率流与确定性动力学的等价性，我们开发了一种自洽的基于得分的输运粒子算法，用于在离散时间网格点上对交互式Lévy随机过程进行采样。通过克服Lévy得分中的非局部挑战，我们为数值概率密度函数和真实概率密度函数之间的Kullback-Leibler散度提供了误差界限。此外，还建立了包含蒙特卡罗误差和时间离散化误差的完整误差分析。为了展示我们方法的有用性和效率，测试了生物学和金融学应用中的数值示例。||
|**2024-12-27**|[Estimation of System Parameters Including Repeated Cross-Sectional Data through Emulator-Informed Deep Generative Model](http://arxiv.org/abs/2412.19517)|null|微分方程 (DEs) 对于建模自然或工程系统的演化至关重要。传统上，DEs 中的参数会根据系统观测数据进行调整。然而，在政治、经济和生物等领域，可用数据通常是在不同时间点从不同主体独立收集的（即重复横截面 (RCS) 数据）。当 RCS 数据表现出各种异质性时，传统的优化技术难以准确估计 DE 参数，从而导致大量信息丢失。为了解决这个问题，我们提出了一种新的估计方法，称为仿真器信息深度生成模型 (EIDGM)，旨在处理 RCS 数据。具体来说，EIDGM 集成了一个基于物理信息神经网络的仿真器，可以立即生成 DE 解，以及一个基于 Wasserstein 生成对抗网络的参数生成器，可以有效地模拟 RCS 数据。我们在指数增长、逻辑种群模型和洛伦兹系统上评估了 EIDGM，证明了其准确捕获参数分布的卓越能力。此外，我们将 EIDGM 应用于淀粉样蛋白 β 40 和 β 42 的实验数据集，成功捕获了不同的参数分布形状。这表明 EIDGM 可以应用于对各种系统进行建模，并可以扩展到基于有限数据揭示系统的运行原理。||
|**2024-12-27**|[DrivingWorld: ConstructingWorld Model for Autonomous Driving via Video GPT](http://arxiv.org/abs/2412.19505)|**[link](https://github.com/yvanyin/drivingworld)**|自回归 (AR) 生成模型（如自然语言处理中的 GPT 系列）近期的成功推动了在视觉任务中复制这种成功的努力。一些工作尝试将这种方法扩展到自动驾驶领域，通过构建基于视频的世界模型来生成逼真的未来视频序列并预测自我状态。然而，先前的工作往往产生不令人满意的结果，因为经典的 GPT 框架旨在处理一维上下文信息（例如文本），并且缺乏对视频生成至关重要的空间和时间动态进行建模的固有能力。在本文中，我们提出了 DrivingWorld，一个用于自动驾驶的 GPT 风格的世界模型，它具有多种时空融合机制。这种设计能够有效地对空间和时间动态进行建模，从而促进生成高保真、长时间的视频。具体来说，我们提出了一种下一状态预测策略来建模连续帧之间的时间一致性，并应用了下一标记预测策略来捕获每帧内的空间信息。为了进一步增强泛化能力，我们提出了一种新的掩码策略和标记预测的重新加权策略，以减轻长期漂移问题并实现精确控制。我们的工作展示了生成超过 40 秒的高保真度和一致性视频剪辑的能力，这比最先进的驾驶世界模型长 2 倍多。实验表明，与先前的工作相比，我们的方法实现了卓越的视觉质量和更精确的可控未来视频生成。我们的代码可在 https://github.com/YvanYin/DrivingWorld 获取。||
|**2024-12-27**|[RobotDiffuse: Motion Planning for Redundant Manipulator based on Diffusion Model](http://arxiv.org/abs/2412.19500)|**[link](https://github.com/acrobot-buaa/robotdiffuse)**|冗余机械臂由于具有更高的自由度 (DOF)，可提供增强的运动学性能和多功能性，使其适用于制造、手术机器人和人机协作等应用。然而，由于DOF增加以及环境复杂且动态，这些机械臂的运动规划具有挑战性。虽然传统的运动规划算法难以处理高维空间，但基于深度学习的方法在复杂任务中经常面临不稳定性和低效性问题。本文介绍了RobotDiffuse，一种基于扩散模型的冗余机械臂运动规划方法。通过将物理约束与点云编码器集成，并用仅编码器Transformer取代U-Net结构，RobotDiffuse提高了模型捕获时间依赖性并生成更平滑、更连贯运动计划的能力。我们使用复杂的模拟器验证了该方法，并发布了一个包含3500万个机器人姿态和14万个避障场景的新数据集。实验结果证明了RobotDiffuse的有效性以及扩散模型在运动规划任务中的前景。代码可在https://github.com/ACRoboT-buaa/RobotDiffuse获取。||
|**2024-12-24**|[PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models](http://arxiv.org/abs/2412.18608)|null|文本或图像到3D生成器和3D扫描仪现在可以生成具有高质量形状和纹理的3D资产。这些资产通常由单一的融合表示组成，例如隐式神经场、高斯混合或网格，没有任何有用的结构。然而，大多数应用程序和创意工作流程要求资产由多个可以独立操作的有意义的部分组成。为了解决这一差距，我们引入了PartGen，这是一种从文本、图像或非结构化3D对象生成由有意义部分组成的3D对象的新方法。首先，给定生成或渲染的3D对象的多个视图，多视图扩散模型提取一组合理的且视图一致的部件分割，将对象分割成多个部分。然后，第二个多视图扩散模型分别处理每个部分，填充遮挡，并通过将这些完成的视图馈送到3D重建网络来使用它们进行3D重建。此完成过程会考虑整个对象的上下文，以确保各个部分能够连贯地整合。生成式补全模型可以弥补由于遮挡而丢失的信息；在极端情况下，它可以根据输入的3D资产完全凭空想象出不可见的部分。我们在生成的和真实的3D资产上评估了我们的方法，并表明它在很大程度上优于分割和零件提取基线。我们还展示了下游应用，例如3D零件编辑。||
|**2024-12-24**|[DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers](http://arxiv.org/abs/2412.18607)|null|基于世界模型的搜索和规划被广泛认为是通往人类水平物理智能的 promising 路径。然而，当前的驾驶世界模型主要依赖于视频扩散模型，这种模型擅长视觉生成但缺乏整合其他模态（例如动作）的灵活性。相比之下，自回归Transformer已在多模态数据建模中展现出卓越的能力。我们的工作旨在将驾驶模型模拟和轨迹规划统一到一个序列建模问题中。我们引入了一种基于交错图像和动作标记的多模态驾驶语言，并开发了DrivingGPT，通过标准的下一个标记预测来学习联合世界建模和规划。我们的DrivingGPT在动作条件视频生成和端到端规划方面均表现出强大的性能，在大规模nuPlan和NAVSIM基准测试中优于强大的基线模型。||
|**2024-12-24**|[Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models](http://arxiv.org/abs/2412.18604)|null|分类器是许多计算机视觉任务中的重要组成部分，是各种模型在不同应用中使用的基础支柱。然而，理解分类器的决策过程仍然是一项重大挑战。我们提出了 DiffEx，一种利用文本到图像扩散模型的功能来解释分类器决策的新方法。与传统的基于 GAN 的可解释性模型（仅限于简单的单概念分析，并且通常需要为每个分类器训练一个新模型）不同，我们的方法可以解释专注于单一概念（例如人脸或动物）的分类器，以及处理涉及多个概念的复杂场景的分类器。DiffEx 使用视觉语言模型来创建分层的语义列表，允许用户不仅识别对分类器的总体语义影响（例如，面部分类器中的“胡须”语义），还可以识别其子类型，例如“山羊胡”或“巴尔博”胡须。我们的实验表明，与基于 GAN 的方法相比，DiffEx 能够涵盖更广泛的语义，提供了一种分层工具，可以更详细、更细致地理解分类器决策。||
|**2024-12-24**|[Long-Form Speech Generation with Spoken Language Models](http://arxiv.org/abs/2412.18603)|**[link](https://github.com/google-deepmind/librispeech-long)**|我们研究了数分钟语音的生成模型，这是长篇多媒体生成和音频原生语音助手的一项要求。然而，目前的口语模型难以生成超过数十秒的合理语音，其原因包括语音标记的高时间分辨率导致连贯性丧失、长序列训练或外推的架构问题，以及推理时的内存成本。考虑到这些因素，我们提出了 SpeechSSM，这是第一个能够在不使用文本中间体的情况下，基于线性时间序列建模的最新进展，在单个解码会话中学习和采样长篇语音音频（例如 16 分钟的朗读或即兴演讲）的语音语言模型。此外，为了应对口语评估中日益增长的挑战，尤其是在这种新的长篇环境中，我们提出了：新的基于嵌入和LLM评判的指标；长度和时间上的质量测量；以及一个用于长篇语音处理和生成的新基准 LibriSpeech-Long。语音样本和数据集已发布在 https://google.github.io/tacotron/publications/speechssm/||
|**2024-12-24**|[ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation](http://arxiv.org/abs/2412.18600)|null|人与场景交互 (HSI) 生成对于 embodied AI、虚拟现实和机器人技术等应用至关重要。尽管现有方法可以在 3D 场景中合成逼真的人体动作并生成合理的人与物体交互，但它们严重依赖于包含配对 3D 场景和运动捕捉数据的数据集，而这些数据集的收集成本高昂且耗时，难以涵盖不同的环境和交互。我们提出了 ZeroHSI，这是一种通过集成视频生成和神经人体渲染来实现零样本 4D 人与场景交互合成的新方法。我们的关键见解是利用最先进的视频生成模型学习到的丰富运动先验，这些模型已经过海量自然人体运动和交互数据的训练，并使用可微渲染来重建人与场景交互。ZeroHSI 可以在静态场景和包含动态物体的环境中合成逼真的人体运动，而无需任何真值运动数据。我们在一个包含各种室内外场景和不同交互提示的精选数据集上评估了 ZeroHSI，证明了其生成多样化且符合上下文的人与场景交互的能力。||
|**2024-12-24**|[DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation](http://arxiv.org/abs/2412.18597)|**[link](https://github.com/tencentarc/ditctrl)**|类似Sora的视频生成模型基于多模态扩散Transformer（MM-DiT）架构取得了显著进展。然而，目前的视频生成模型主要集中在单提示词生成，难以根据更能反映现实世界动态场景的多个连续提示词生成连贯的场景。虽然一些开创性工作探索了多提示词视频生成，但它们面临着重大挑战，包括严格的训练数据要求、对提示词的弱跟随性以及不自然的过渡。为了解决这些问题，我们首次提出了DiTCtrl，一种在MM-DiT架构下无需训练的多提示词视频生成方法。我们的核心思想是将多提示词视频生成任务视为具有平滑过渡的时间视频编辑。为了实现这一目标，我们首先分析了MM-DiT的注意力机制，发现其3D全注意力机制与UNet类扩散模型中的交叉/自注意力模块的行为类似，从而能够通过注意力共享实现对不同提示词的掩码引导的精确语义控制，以进行多提示词视频生成。基于我们精心的设计，DiTCtrl生成的视频在给定多个连续提示词的情况下实现了平滑的过渡和一致的物体运动，而无需额外的训练。此外，我们还提出了MPVBench，一个专门为多提示词视频生成设计的新基准，用于评估多提示词生成性能。大量实验表明，我们的方法在无需额外训练的情况下实现了最先进的性能。||
|**2024-12-24**|[LatentCRF: Continuous CRF for Efficient Latent Diffusion](http://arxiv.org/abs/2412.18596)|null|潜扩散模型 (LDM) 可以生成高质量的逼真图像，然而，多次高成本推理迭代带来的延迟会限制其适用性。我们引入了 LatentCRF，一个以神经网络层实现的连续条件随机场 (CRF) 模型，用于对 LDM 中潜在向量之间的空间和语义关系进行建模。通过用轻量级的 LatentCRF 替换一些计算密集型的 LDM 推理迭代，我们在质量、速度和多样性之间取得了更优的平衡。与完整的 LDM 相比，我们在不损失图像质量或多样性的情况下将推理效率提高了 33%。LatentCRF 是一个易于添加的组件，不需要修改 LDM。||
|**2024-12-24**|[Resolution-Robust 3D MRI Reconstruction with 2D Diffusion Priors: Diverse-Resolution Training Outperforms Interpolation](http://arxiv.org/abs/2412.18584)|null|基于深度学习的3D成像，特别是磁共振成像（MRI），由于3D训练数据的有限性而充满挑战。因此，在2D切片上训练的2D扩散模型开始被用于3D MRI重建。然而，正如本文所示，现有方法通常针对固定体素大小，并且当体素大小变化时，性能会下降，这在临床实践中很常见。在本文中，我们提出并研究了几种利用2D扩散先验进行分辨率鲁棒的3D MRI重建的方法。通过这项研究，我们获得了一种简单的基于扩散引导的随机采样2D切片正则化的分辨率鲁棒变分3D重建方法。与后验采样基线相比，该方法提供了具有竞争力的重建质量。为了解决对分辨率变化的敏感性，我们研究了最先进的基于模型的方法，包括高斯溅射、神经表示和无限维扩散模型，以及在多个分辨率上训练扩散模型的简单数据中心方法。我们的实验表明，基于模型的方法未能弥合3D MRI中的性能差距。相比之下，在各种分辨率上训练扩散模型的数据中心方法有效地提供了一种分辨率鲁棒的方法，且不影响精度。||
|**2024-12-24**|[3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement](http://arxiv.org/abs/2412.18565)|null|尽管神经渲染技术有所进步，但由于高质量3D数据集的稀缺以及多视角扩散模型的固有局限性，视图合成和3D模型生成仍局限于低分辨率且多视角一致性欠佳。在本研究中，我们提出了一种名为3DEnhancer的新型3D增强流程，它采用多视角潜在扩散模型来增强粗糙的3D输入，同时保持多视角一致性。我们的方法包括一个姿态感知编码器和一个基于扩散的去噪器来细化低质量的多视角图像，以及数据增强和一个带有极线聚合的多视角注意力模块，以在不同视角下保持一致的高质量3D输出。与现有的基于视频的方法不同，我们的模型支持无缝的多视角增强，并在不同视角下具有更好的一致性。大量评估表明，3DEnhancer 明显优于现有方法，提升了多视角增强和基于实例的3D优化任务的性能。||
|**2024-12-24**|[GeFL: Model-Agnostic Federated Learning with Generative Models](http://arxiv.org/abs/2412.18460)|null|联邦学习 (FL) 是一种很有前景的分布式学习范式，它能够保护用户的隐私。然而，近年来模型规模的不断增长使得少数用户无法承担整个模型的训练。这导致用户根据自身不同的计算能力和网络带宽采用异构模型。相应地，考虑到联邦学习通常涉及训练单个全局模型，因此需要解决异构模型下的联邦学习问题。在本文中，我们提出了生成模型辅助联邦学习 (GeFL)，它包含一个生成模型，用于聚合异构模型用户间的全局知识。我们在各种分类任务上的实验表明，与基线方法相比，GeFL 的性能有了显著提高，但也存在隐私和可扩展性方面的局限性。为了解决这些问题，我们引入了一个新的框架 GeFL-F。它借助特征生成模型来训练目标网络。我们通过实验证明了 GeFL-F 始终如一的性能提升，同时也展示了其更好的隐私保护能力和对大量客户端的鲁棒性。代码可在[1]获取。||
|**2024-12-20**|[Personalized Representation from Personalized Generation](http://arxiv.org/abs/2412.16156)|**[link](https://github.com/ssundaram21/personalized-rep)**|现代视觉模型擅长处理通用的下游任务。然而，尚不清楚如何将它们用于个性化的视觉任务，这类任务既细粒度又数据稀缺。最近的研究已经成功地将合成数据应用于通用表征学习，而T2I扩散模型的进步使得仅从少量真实示例生成个性化图像成为可能。在这里，我们探索这些想法之间的潜在联系，并将使用个性化合成数据来学习个性化表征的挑战形式化，这些表征编码关于目标对象的知识，并且可以灵活地应用于与目标对象相关的任何下游任务。我们为此挑战引入了一个评估套件，包括两个现有数据集的重新构建和一个为此目的明确构建的新数据集，并提出了一种对比学习方法，创造性地利用了图像生成器。我们证明，我们的方法改进了从识别到分割等各种下游任务的个性化表征学习，并分析了图像生成方法的关键特征，这些特征是获得这种改进的关键。||
|**2024-12-20**|[Can Generative Video Models Help Pose Estimation?](http://arxiv.org/abs/2412.16155)|null|从图像中估计几乎没有或没有重叠的成对姿态是计算机视觉中的一个开放挑战。即使是在大规模数据集上训练的现有方法，由于缺乏可识别的对应关系或视觉重叠，在这些场景中也难以奏效。受人类从不同场景推断空间关系的能力的启发，我们提出了一种新方法 InterPose，它利用预训练生成视频模型中编码的丰富先验。我们建议使用视频模型来幻化两幅输入图像之间的中间帧，有效地创建一个密集的视觉过渡，从而显著简化姿态估计问题。由于当前的视频模型仍然会产生不合理的运动或不一致的几何形状，我们引入了一种自洽性评分，用于评估从采样视频中预测姿态的一致性。我们证明了我们的方法可以推广到三种最先进的视频模型，并在包含室内、室外和以物体为中心的场景的四个不同数据集上，相对于最先进的 DUSt3R 表现出一致的改进。我们的研究结果表明，利用在海量视频数据（比 3D 数据更容易获得）上训练的大型生成模型，是改进姿态估计模型的一个有前景的途径。请访问我们的项目页面以查看结果：https://inter-pose.github.io/。||
|**2024-12-20**|[Predicting human cooperation: sensitizing drift-diffusion model to interaction and external stimuli](http://arxiv.org/abs/2412.16121)|null|人类在感知世界并积极参与的过程中，会根据不断变化的群体动态调整自己的决策，并受到社会互动的影响。本研究旨在确定互动的哪些方面会影响合作与背叛的选择。具体来说，我们研究了囚徒困境博弈中人类的合作行为，并使用漂移扩散模型来描述决策过程。我们引入了一个新的贝叶斯模型，该模型基于与其他参与者互动性质来模拟模型参数的演变。这种方法使我们能够预测群体预期合作率的演变。我们使用未见过的测试数据集成功验证了我们的模型，并将其应用于探索三种策略场景：共同参与者操纵、奖励和惩罚的使用以及时间压力。这些结果表明，我们的模型有潜力成为开发和测试旨在增强合作的策略的基础工具，最终有助于改善社会福利。||
|**2024-12-20**|[Label-Efficient Data Augmentation with Video Diffusion Models for Guidewire Segmentation in Cardiac Fluoroscopy](http://arxiv.org/abs/2412.16050)|null|The accurate segmentation of guidewires in interventional cardiac fluoroscopy videos is crucial for computer-aided navigation tasks. Although deep learning methods have demonstrated high accuracy and robustness in wire segmentation, they require substantial annotated datasets for generalizability, underscoring the need for extensive labeled data to enhance model performance. To address this challenge, we propose the Segmentation-guided Frame-consistency Video Diffusion Model (SF-VD) to generate large collections of labeled fluoroscopy videos, augmenting the training data for wire segmentation networks. SF-VD leverages videos with limited annotations by independently modeling scene distribution and motion distribution. It first samples the scene distribution by generating 2D fluoroscopy images with wires positioned according to a specified input mask, and then samples the motion distribution by progressively generating subsequent frames, ensuring frame-to-frame coherence through a frame-consistency strategy. A segmentation-guided mechanism further refines the process by adjusting wire contrast, ensuring a diverse range of visibility in the synthesized image. Evaluation on a fluoroscopy dataset confirms the superior quality of the generated videos and shows significant improvements in guidewire segmentation.||
|**2024-12-20**|[SafeCFG: Redirecting Harmful Classifier-Free Guidance for Safe Generation](http://arxiv.org/abs/2412.16039)|null|Diffusion models (DMs) have demonstrated exceptional performance in text-to-image (T2I) tasks, leading to their widespread use. With the introduction of classifier-free guidance (CFG), the quality of images generated by DMs is improved. However, DMs can generate more harmful images by maliciously guiding the image generation process through CFG. Some safe guidance methods aim to mitigate the risk of generating harmful images but often reduce the quality of clean image generation. To address this issue, we introduce the Harmful Guidance Redirector (HGR), which redirects harmful CFG direction while preserving clean CFG direction during image generation, transforming CFG into SafeCFG and achieving high safety and quality generation. We train HGR to redirect multiple harmful CFG directions simultaneously, demonstrating its ability to eliminate various harmful elements while preserving high-quality generation. Additionally, we find that HGR can detect image harmfulness, allowing for unsupervised fine-tuning of safe diffusion models without pre-defined clean or harmful labels. Experimental results show that by incorporating HGR, images generated by diffusion models achieve both high quality and strong safety, and safe DMs trained through unsupervised methods according to the harmfulness detected by HGR also exhibit good safety performance. The codes will be publicly available.||
|**2024-12-20**|[Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation](http://arxiv.org/abs/2412.15939)|**[link](https://github.com/gautierevn/blip2idc)**|The rise of the generative models quality during the past years enabled the generation of edited variations of images at an important scale. To counter the harmful effects of such technology, the Image Difference Captioning (IDC) task aims to describe the differences between two images. While this task is successfully handled for simple 3D rendered images, it struggles on real-world images. The reason is twofold: the training data-scarcity, and the difficulty to capture fine-grained differences between complex images. To address those issues, we propose in this paper a simple yet effective framework to both adapt existing image captioning models to the IDC task and augment IDC datasets. We introduce BLIP2IDC, an adaptation of BLIP2 to the IDC task at low computational cost, and show it outperforms two-streams approaches by a significant margin on real-world IDC datasets. We also propose to use synthetic augmentation to improve the performance of IDC models in an agnostic fashion. We show that our synthetic augmentation strategy provides high quality data, leading to a challenging new dataset well-suited for IDC named Syned1.||
|**2024-12-20**|[RiTTA: Modeling Event Relations in Text-to-Audio Generation](http://arxiv.org/abs/2412.15922)|**[link](https://github.com/yuhanghe01/ritta)**|Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: https://github.com/yuhanghe01/RiTTA||
|**2024-12-20**|[Semi-Supervised Adaptation of Diffusion Models for Handwritten Text Generation](http://arxiv.org/abs/2412.15853)|null|The generation of images of realistic looking, readable handwritten text is a challenging task which is referred to as handwritten text generation (HTG). Given a string and examples from a writer, the goal is to synthesize an image depicting the correctly spelled word in handwriting with the calligraphic style of the desired writer. An important application of HTG is the generation of training images in order to adapt downstream models for new data sets. With their success in natural image generation, diffusion models (DMs) have become the state-of-the-art approach in HTG. In this work, we present an extension of a latent DM for HTG to enable generation of writing styles not seen during training by learning style conditioning with a masked auto encoder. Our proposed content encoder allows for different ways of conditioning the DM on textual and calligraphic features. Additionally, we employ classifier-free guidance and explore the influence on the quality of the generated training images. For adapting the model to a new unlabeled data set, we propose a semi-supervised training scheme. We evaluate our approach on the IAM-database and use the RIMES-database to examine the generation of data not seen during training achieving improvements in this particularly promising application of DMs for HTG.||
|**2024-12-20**|[Electromagnetic particle-in-cell modeling of an electron cyclotron resonance plasma discharge in hydrogen](http://arxiv.org/abs/2412.15802)|null|A low pressure discharge sustained in molecular hydrogen with help of the electron cyclotron resonance heating at a frequency of 2.45 GHz is simulated using a fully electromagnetic implicit charge- and energy-conserving particle-in-cell/Monte Carlo code. The simulations show a number of kinetic effects, and the results are in good agreement with various experimentally measured data such as electron density, electron temperature and degree of dissociation. The electron energy distribution shows a tri-Maxwellian form due to a number of different electron heating mechanisms, agreeing with the experimental data in the measured electron energy interval. The simulation results are also verified against a drift-diffusion model and proximity is observed between the computational results for the plasma density at the location of experimental measurement. However, the fluid approximation fails to accurately predict radical density and electron temperature because of the assumption of a single electron temperature. Special attention is paid to the characteristics of hydrogen radicals, whose production is strongly underestimated by the fluid model, whereas it is well predicted by the model considered here. The energy distribution of such radicals demonstrates the presence of a relatively large number of energetic hydrogen atoms produced by the dissociation of molecular hydrogen. The new insights are of significance for practical applications of hydrogen plasmas.||
|**2024-12-20**|[Diffusion-Based Conditional Image Editing through Optimized Inference with Guidance](http://arxiv.org/abs/2412.15798)|null|We present a simple but effective training-free approach for text-driven image-to-image translation based on a pretrained text-to-image diffusion model. Our goal is to generate an image that aligns with the target task while preserving the structure and background of a source image. To this end, we derive the representation guidance with a combination of two objectives: maximizing the similarity to the target prompt based on the CLIP score and minimizing the structural distance to the source latent variable. This guidance improves the fidelity of the generated target image to the given target prompt while maintaining the structure integrity of the source image. To incorporate the representation guidance component, we optimize the target latent variable of diffusion model's reverse process with the guidance. Experimental results demonstrate that our method achieves outstanding image-to-image translation performance on various tasks when combined with the pretrained Stable Diffusion model.||
|**2024-12-19**|[LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis](http://arxiv.org/abs/2412.15214)|**[link](https://github.com/qiuyu96/LeviTor)**|基于拖拽交互的直观特性使其在图像到视频合成中控制物体轨迹的应用越来越广泛。然而，现有的在二维空间中进行拖拽的方法在处理平面外运动时通常会面临歧义。在本工作中，我们为交互增加了一个新的维度，即深度维度，以便用户可以为轨迹上的每个点分配一个相对深度。这样，我们的新交互范式不仅继承了二维拖拽的便捷性，还促进了三维空间中的轨迹控制，拓宽了创作范围。我们提出了一种在图像到视频合成中进行三维轨迹控制的开创性方法，将对象掩码抽象为几个聚类点。这些点连同深度信息和实例信息最终作为控制信号被输入到视频扩散模型中。大量实验验证了我们称之为LeviTor的方法在从静态图像生成逼真视频时精确控制物体运动的有效性。项目页面：https://ppetrichor.github.io/levitor.github.io/||
|**2024-12-19**|[Flowing from Words to Pixels: A Framework for Cross-Modality Evolution](http://arxiv.org/abs/2412.15213)|null|扩散模型及其泛化形式——流匹配，已对媒体生成领域产生了显著影响。当前的常规方法是学习从简单的高斯噪声源分布到目标媒体分布的复杂映射。对于文本到图像生成等跨模态任务，模型在学习从噪声到图像的相同映射的同时，包含了一种调节机制。流匹配的一个关键且迄今为止相对未被探索的特性是，与扩散模型不同，它们的源分布不必是噪声。因此，在本文中，我们提出了一个范式转变，并提出了这样一个问题：我们能否训练流匹配模型来学习从一种模态分布到另一种模态分布的直接映射，从而避免对噪声分布和调节机制的需求。我们提出了一个通用的、简单的跨模态流匹配框架CrossFlow。我们展示了将变分编码器应用于输入数据的重要性，并介绍了一种实现无分类器引导的方法。令人惊讶的是，对于文本到图像任务，使用普通Transformer且没有交叉注意力的CrossFlow略优于标准流匹配，并且我们证明它随着训练步数和模型大小的增加而更好地扩展，同时也允许有趣的潜在算术操作，从而在输出空间中进行语义上有意义的编辑。为了证明我们方法的泛化性，我们还展示了CrossFlow在各种跨模态/内模态映射任务（例如图像描述、深度估计和图像超分辨率）中与最先进的技术持平或优于它们。我们希望本文能有助于加速跨模态媒体生成的进展。||
|**2024-12-19**|[Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation](http://arxiv.org/abs/2412.15211)|null|从不同环境拍摄的照片中重建物体的几何形状和外观是困难的，因为照明以及因此物体外观在拍摄图像之间有所不同。对于外观很大程度上取决于视角的镜面反射更强的物体来说，这尤其具有挑战性。一些先前的方法使用每张图像的嵌入向量来模拟图像之间的外观变化，而其他方法使用基于物理的渲染来恢复材质和每张图像的照明。这些方法无法在输入照明变化很大的情况下忠实地恢复视角相关的外观，并且往往会产生大部分漫反射的结果。我们提出一种从不同照明下拍摄的图像重建物体的方法，首先使用多视图重照明扩散模型在单个参考照明下对图像进行重照明，然后使用对重照明图像之间剩余小差异具有鲁棒性的辐射场架构重建物体的几何形状和外观。我们在合成数据集和真实数据集上验证了我们提出的方法，并证明它在从极端照明变化下拍摄的图像重建高保真外观方面大大优于现有技术。此外，我们的方法在恢复先前方法无法重建的视角相关“闪亮”外观方面特别有效。||
|**2024-12-19**|[AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation](http://arxiv.org/abs/2412.15191)|null|我们提出了AV-Link，一个用于视频到音频和音频到视频生成的统一框架，它利用冻结的视频和音频扩散模型的激活来进行时间对齐的跨模态条件化。我们框架的关键是一个融合块（Fusion Block），它通过时间对齐的自注意力操作，使我们的骨干视频和音频扩散模型之间能够进行双向信息交换。与先前使用为其他任务预训练的特征提取器作为条件信号的工作不同，AV-Link可以直接利用单一框架中互补模态获得的特征，即使用视频特征生成音频，或使用音频特征生成视频。我们广泛地评估了我们的设计选择，并展示了我们的方法能够生成同步且高质量的视听内容，展示了其在沉浸式媒体生成应用中的潜力。项目页面：snap-research.github.io/AVLink/||
|**2024-12-19**|[LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation](http://arxiv.org/abs/2412.15188)|null|我们提出了 LlamaFusion，这是一个赋予预训练纯文本大型语言模型 (LLM) 多模态生成能力的框架，使其能够理解和生成任意序列的文本和图像。LlamaFusion 利用现有的 Llama-3 的权重自回归地处理文本，同时引入了额外的并行 transformer 模块来使用扩散模型处理图像。在训练期间，来自每种模态的数据被路由到其专用模块：模态特定的前馈层、查询-键-值投影和归一化层独立处理每种模态，而共享的自注意力层允许文本和图像特征之间的交互。通过冻结文本特定模块并仅训练图像特定模块，LlamaFusion 保留了纯文本 LLM 的语言能力，同时发展了强大的视觉理解和生成能力。与从头开始预训练多模态生成模型的方法相比，我们的实验表明，LlamaFusion 仅使用 50% 的 FLOPs 就将图像理解能力提高了 20%，图像生成能力提高了 3.6%，同时保持了 Llama-3 的语言能力。我们还证明了该框架可以使现有的视觉语言模型具备多模态生成能力。总体而言，该框架不仅利用了对纯文本 LLM 的现有计算投入，而且还支持语言和视觉能力的并行发展，为高效的多模态模型开发提供了一个有前景的方向。||
|**2024-12-19**|[Tiled Diffusion](http://arxiv.org/abs/2412.15185)|null|图像拼接——将不同的图像无缝连接以创建连贯的视野——对于纹理创建、电子游戏资源开发和数字艺术等应用至关重要。传统上，图块是手动构建的，这种方法在可扩展性和灵活性方面存在重大局限性。最近的研究试图使用生成模型自动化此过程。然而，目前的方法主要集中在拼接纹理和操纵模型以生成单图像，而没有从本质上支持跨不同领域创建多个互连图块。本文提出了Tiled Diffusion，这是一种新颖的方法，它扩展了扩散模型的功能，以适应在需要拼接的各种图像合成领域生成 cohesive 拼接图案。我们的方法支持广泛的拼接场景，从自拼接到复杂的多对多连接，从而实现多个图像的无缝集成。Tiled Diffusion 自动化了拼接过程，无需手动干预，并增强了各种应用中的创作可能性，例如现有图像的无缝拼接、拼接纹理创建和 360 度合成。||
|**2024-12-19**|[OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization](http://arxiv.org/abs/2412.15159)|null|近年来，文本到视频 (T2V) 生成领域取得了显著进展。尽管取得了这些进步，但理论进步与实际应用之间仍然存在差距，图像质量下降和闪烁伪影等问题放大了这种差距。最近通过反馈学习增强视频扩散模型 (VDM) 的进展已显示出可喜的成果。然而，这些方法仍然存在明显的局限性，例如反馈未对齐和可扩展性较差。为了解决这些问题，我们引入了 OnlineVPO，这是一种专为视频扩散模型量身定制的更高效的偏好学习方法。我们的方法有两个新颖的设计，首先，我们没有直接使用基于图像的奖励反馈，而是利用在合成数据上训练的视频质量评估 (VQA) 模型作为奖励模型，为视频扩散模型提供分布和模态对齐的反馈。此外，我们引入了在线 DPO 算法来解决现有视频偏好学习框架中的离线策略优化和可扩展性问题。通过使用视频奖励模型动态提供简洁的视频反馈，OnlineVPO 提供了有效且高效的偏好指导。在开源视频扩散模型上的大量实验表明，OnlineVPO 是一种简单但有效且更重要的是可扩展的视频扩散模型偏好学习算法，为该领域的未来发展提供了宝贵的见解。||
|**2024-12-19**|[Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM](http://arxiv.org/abs/2412.15156)|**[link](https://github.com/jiyt17/prompt-a-video)**|文本到视频模型通过对高质量文本-视频对的优化取得了显著进展，其中文本提示在决定输出视频质量方面起着关键作用。然而，要获得所需的输出，通常需要多次修改和迭代推理来完善用户提供的提示。当前用于改进提示的自动方法在应用于文本到视频扩散模型时，会遇到模态不一致、成本差异和模型不感知等挑战。为了解决这些问题，我们引入了一个基于大型语言模型的提示自适应框架，称为 Prompt-A-Video，它擅长为特定的视频扩散模型制作以视频为中心、无需人工且偏好对齐的提示。我们的方法涉及精心设计的两阶段优化和对齐系统。首先，我们执行奖励引导的提示进化流程，自动创建最佳提示池，并利用它们对大型语言模型进行监督微调（SFT）。然后，采用多维奖励为SFT模型生成配对数据，随后使用直接偏好优化（DPO）算法来进一步促进偏好对齐。通过广泛的实验和比较分析，我们验证了 Prompt-A-Video 在不同生成模型中的有效性，突出了其拓展视频生成界限的潜力。||
|**2024-12-19**|[Jet: A Modern Transformer-Based Normalizing Flow](http://arxiv.org/abs/2412.15129)|null|过去，归一化流模型已成为一种很有前景的自然图像生成模型。这类模型具有许多建模优势：能够有效计算输入数据的对数似然，快速生成以及简单的整体结构。归一化流模型曾一度是活跃的研究课题，但后来不再受青睐，因为其样本的视觉质量无法与其他模型类别（例如 GAN、基于 VQ-VAE 的方法或扩散模型）竞争。在本文中，我们重新审视了基于耦合的归一化流模型的设计，仔细消融了先前的设计选择，并使用了基于视觉Transformer架构而非卷积神经网络的计算模块。结果，我们用更简单的架构实现了最先进的定量和定性性能。虽然整体视觉质量仍落后于当前最先进的模型，但我们认为，强大的归一化流模型可以作为更强大生成模型的构建组件，从而有助于推进研究前沿。||
|**2024-12-19**|[Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation](http://arxiv.org/abs/2412.15109)|**[link](https://github.com/openrobotlab/seer)**|目前在机器人操作中学习可扩展策略的研究主要分为两类：一类侧重于“动作”，即从大量的机器人数据中进行行为克隆；另一类则强调“视觉”，通过使用大规模视觉数据集预训练表征模型或生成模型（也称为世界模型）来增强模型的泛化能力。本文提出了一种端到端的范式，利用基于机器人预测视觉状态的逆动力学模型来预测动作，称为预测逆动力学模型（PIDM）。通过闭合视觉和动作之间的循环，端到端的PIDM可以成为一个更好的可扩展动作学习器。在实践中，我们使用Transformer来处理视觉状态和动作，并将该模型命名为Seer。它最初在大规模机器人数据集（如DROID）上进行预训练，只需少量微调数据即可适应现实场景。得益于大规模的端到端训练以及视觉和动作之间的协同作用，Seer在仿真和真实世界实验中均显著优于先前的方法。它在LIBERO-LONG基准上实现了13%的改进，在CALVIN ABC-D上实现了21%的改进，在真实世界任务中实现了43%的改进。值得注意的是，Seer在CALVIN ABC-D基准上创造了新的最先进水平，平均长度达到4.28，并且在真实场景中对新物体、光照条件和高强度干扰下的环境表现出优异的泛化能力。代码和模型已在https://github.com/OpenRobotLab/Seer/公开发布。||
|**2024-12-17**|[CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models](http://arxiv.org/abs/2412.13195)|**[link](https://github.com/blurgyy/compass)**|文本到图像扩散模型擅长生成逼真的图像，但在渲染文本提示中描述的空间关系时通常会遇到困难。我们确定了造成这种常见失败的两个核心问题：1）现有数据集中空间相关数据的模糊性，以及 2）当前文本编码器无法准确解释输入描述的空间语义。我们使用 CoMPaSS 来解决这些问题，CoMPaSS 是一个通用的训练框架，可增强任何 T2I 扩散模型的空间理解能力。CoMPaSS 通过面向空间约束的配对 (SCOP) 数据引擎解决了空间相关数据的模糊性，该引擎通过一组原则性空间约束来管理空间精确的训练数据。为了更好地利用高质量的空间先验，CoMPaSS 进一步引入了标记编码排序 (TENOR) 模块，以更好地利用高质量的空间先验，有效地弥补了文本编码器的不足。在涵盖 UNet 和 MMDiT 架构的四种流行的开放权重 T2I 扩散模型上进行的大量实验表明，CoMPaSS 通过在空间关系生成方面（包括 VISOR (+98%)、T2I-CompBench Spatial (+67%) 和 GenEval Position (+131%)）的著名基准测试中实现了大幅相对提升，从而设立了新的最先进水平。代码将在 https://github.com/blurgyy/CoMPaSS 上提供。||
|**2024-12-17**|[StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models](http://arxiv.org/abs/2412.13188)|null|本文旨在解决从车辆传感器数据合成逼真视图的问题。近年来，神经场景表示在渲染高质量自动驾驶场景方面取得了显著进展，但随着视点偏离训练轨迹，性能会显著下降。为了缓解这个问题，我们引入了StreetCrafter，一种新颖的可控视频扩散模型，它利用激光雷达点云渲染作为像素级条件，充分利用生成先验进行新颖视图合成，同时保持精确的相机控制。此外，利用像素级激光雷达条件，我们可以对目标场景进行精确的像素级编辑。另外，StreetCrafter的生成先验可以有效地融入动态场景表示中，以实现实时渲染。在Waymo Open Dataset和PandaSet上的实验表明，我们的模型能够灵活控制视点变化，扩大视图合成区域以获得令人满意的渲染效果，其性能优于现有方法。||
|**2024-12-17**|[Move-in-2D: 2D-Conditioned Human Motion Generation](http://arxiv.org/abs/2412.13185)|null|生成逼真的人类视频仍然是一项具有挑战性的任务，目前最有效的方法依赖于将人体运动序列作为控制信号。现有方法通常使用从其他视频中提取的现有运动，这将应用限制于特定的运动类型和全局场景匹配。我们提出了 Move-in-2D，这是一种以场景图像为条件生成人体运动序列的新方法，允许根据不同场景调整多样化运动。我们的方法利用了一个扩散模型，该模型接受场景图像和文本提示作为输入，生成适合场景的运动序列。为了训练这个模型，我们收集了一个包含单人活动的大规模视频数据集，并使用相应的人体运动作为目标输出来标注每个视频。实验表明，我们的方法可以有效地预测投影后与场景图像对齐的人体运动。此外，我们还展示了生成的运动序列在视频合成任务中提高了人体运动质量。||
|**2024-12-17**|[F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration](http://arxiv.org/abs/2412.13155)|null|人工智能生成模型在内容创作方面展现出非凡的能力，尤其在人脸图像生成、定制和修复方面。然而，由于独特的失真、不真实的细节和意外的身份偏移，当前的AI生成人脸（AIGFs）通常达不到人类的期望，这凸显了对AIGFs进行全面质量评估框架的需求。为了满足这一需求，我们引入了FaceQ，这是一个大规模、全面的AI生成人脸图像数据库，带有反映人类偏好的细粒度质量标注。FaceQ数据库包含由29个模型在三个任务中生成的12,255张图像：（1）人脸生成，（2）人脸定制，以及（3）人脸修复。它包含来自180位标注者的32,742个平均意见得分（MOS），涵盖多个维度：质量、真实性、身份（ID）保真度和文本-图像一致性。使用FaceQ数据库，我们建立了F-Bench，这是一个用于比较和评估人脸生成、定制和修复模型的基准，突出了各种提示和评估维度下的优势和劣势。此外，我们评估了现有图像质量评估（IQA）、人脸质量评估（FQA）、AI生成内容图像质量评估（AIGCIQA）和偏好评估指标的性能，表明这些标准指标在评估真实性、身份保真度和文本-图像一致性方面相对无效。FaceQ数据库将在发布后公开提供。||
|**2024-12-17**|[Prompt Augmentation for Self-supervised Text-guided Image Manipulation](http://arxiv.org/abs/2412.13081)|null|文本引导的图像编辑在各种创造性和实用领域中都有应用。尽管最近在图像生成方面的研究取得了进展，但它们往往难以同时应对连贯的图像转换和上下文保留这两项挑战。为此，我们的工作引入了提示增强，这是一种将单个输入提示放大为多个目标提示的方法，可以加强文本上下文并实现局部图像编辑。具体来说，我们使用增强的提示来描绘预期的操作区域。我们提出了一种对比损失（Contrastive Loss），它专门用于通过置换编辑区域并将保留区域拉近来驱动有效的图像编辑。认识到图像操作的连续性，我们通过引入相似性概念进一步完善了我们的方法，创建了一种软对比损失（Soft Contrastive Loss）。新的损失被纳入到扩散模型中，在公共数据集和生成的图像上展示了比现有最先进方法更好或具有竞争力的图像编辑结果。||
|**2024-12-17**|[Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance](http://arxiv.org/abs/2412.12974)|**[link](https://github.com/anonym0u3/attentiveeraser)**|近年来，扩散模型作为生成模型领域的新秀，在图像生成方面表现出色。然而，当其应用于目标移除任务时，仍然存在生成随机伪影以及移除后无法用适当内容重新绘制前景目标区域等问题。为了解决这些问题，我们提出了Attentive Eraser，一种无需微调的方法，赋予预训练扩散模型稳定有效的目标移除能力。首先，鉴于自注意力图会影响生成图像的结构和形状细节，我们提出了注意力激活和抑制（ASS）机制，该机制基于给定的掩码重新设计预训练扩散模型内的自注意力机制，从而在反向生成过程中优先考虑背景而非前景目标。此外，我们引入了自注意力重定向引导（SARG），它利用ASS重定向的自注意力来引导生成过程，有效地移除掩码内的前景目标，同时生成合理且连贯的内容。实验表明，Attentive Eraser在各种预训练扩散模型上的目标移除任务中都表现出稳定性和有效性，甚至优于基于训练的方法。此外，Attentive Eraser可以应用于各种扩散模型架构和检查点，具有出色的可扩展性。代码可在https://github.com/Anonym0u3/AttentiveEraser获取。||
|**2024-12-17**|[ArchesWeather & ArchesWeatherGen: a deterministic and generative model for efficient ML weather forecasting](http://arxiv.org/abs/2412.12971)|**[link](https://github.com/inria/geoarches)**|天气预报在当今社会中扮演着至关重要的角色，从农业和物流到预测可再生能源的输出以及应对极端天气事件。使用下一状态预测目标在ERA5数据集上训练的深度学习天气预报模型，与数值全球环流模型相比，取得了巨大的成功。然而，对于广泛的应用来说，能够从未来可能的天气状态分布中提供代表性样本至关重要。在本文中，我们提出了一种利用确定性天气模型设计概率性天气模型的方法，从而提高性能并降低计算成本。我们首先介绍ArchesWeather，一个基于Transformer的确定性模型，它通过去除过度限制性的归纳先验来改进Pangu-Weather。然后，我们设计了一个名为ArchesWeatherGen的概率性天气模型，该模型基于流匹配（一种现代的扩散模型变体），经过训练可以将ArchesWeather的预测投影到ERA5天气状态的分布。ArchesWeatherGen是一个真正的ERA5随机模拟器，并且在所有WeatherBench主要变量（除了NeuralGCM的位势高度）上都优于IFS ENS和NeuralGCM。我们的工作还旨在通过学术计算资源，使确定性和生成式机器学习模型在天气预报研究中的使用民主化。所有模型都以1.5度分辨率进行训练，ArchesWeather的训练预算约为9个V100天，ArchesWeatherGen的训练预算约为45个V100天。在推理方面，ArchesWeatherGen在A100 GPU卡上以每分钟生成一个集合成员的速度生成15天的天气轨迹。为了使我们的工作完全可复现，我们的代码和模型是开源的，包括用于数据准备、训练和评估的完整流程，网址为https://github.com/INRIA/geoarches。||
|**2024-12-17**|[MOPO: Multi-Objective Prompt Optimization for Affective Text Generation](http://arxiv.org/abs/2412.12948)|null|情感表达方式取决于上下文和领域。例如，在X（前身为Twitter）上，作者可能只是使用#anger标签，而在新闻标题中，情感通常以更礼貌、间接的方式表达。为了使条件文本生成模型能够创建适合特定领域的情感文本，用户需要一个参数来选择合适的表达情感的方式。为此，我们引入了MOPO，一种多目标提示优化方法。MOPO根据多个目标（对应于针对不同领域训练的情感分类器分配的输出概率）来优化提示。与单目标优化相比，MOPO输出一组提示，每个提示对多个目标具有不同的权重。用户可以选择最适合其上下文的提示。我们使用由各种特定领域情感分类器确定的三个目标来评估MOPO。与单目标优化相比，MOPO在所有目标上的性能提升高达15个百分点，而任何单一目标的性能损失最小（1-2个百分点）。这些轻微的性能损失可以通过在多个目标上的更广泛的泛化来弥补——这是单目标优化无法实现的。此外，MOPO通过同时优化多个目标来降低计算需求，从而无需为每个目标单独执行优化程序。||
|**2024-12-17**|[Generation of cosmic ray trajectories by a Diffusion Model trained on test particles in 3D magnetohydrodynamic turbulence](http://arxiv.org/abs/2412.12923)|null|高能带电粒子在强磁湍流中传输的模型在空间和天体物理研究中扮演着关键角色，例如描述太阳高能粒子和高能宇宙射线的传播。受高性能机器学习技术最新进展的启发，我们研究了生成扩散模型在合成从湍流磁流体动力学模拟中获得的测试粒子轨迹方面的应用。我们考虑了速度增量、空间输运和曲率统计，并发现与固定粒子能量的基线轨迹非常吻合。此外，我们还考虑了两个合成湍流模型进行比较。最后，讨论了基于我们方法的应用就绪传输模型所面临的挑战。||
|**2024-12-17**|[Unsupervised Region-Based Image Editing of Denoising Diffusion Models](http://arxiv.org/abs/2412.12912)|null|虽然扩散模型在图像生成领域取得了显著成功，但其潜在空间仍未得到充分探索。当前识别潜在空间语义的方法通常依赖于外部监督，例如文本信息和分割掩码。在本文中，我们提出了一种在预训练扩散模型的潜在空间中识别语义属性的方法，无需任何进一步训练。通过将目标语义区域的雅可比矩阵投影到与非掩码区域正交的低维子空间中，我们的方法促进了精确的语义发现和对局部掩码区域的控制，从而无需标注。我们在多个数据集和各种扩散模型架构上进行了广泛的实验，实现了最先进的性能。尤其对于某些特定的人脸属性，我们提出的方法的性能甚至超过了监督方法，证明了其在编辑局部图像属性方面的优越能力。||
|**2024-12-13**|[Towards a foundation model for heavy-ion collision experiments through point cloud diffusion](http://arxiv.org/abs/2412.10352)|null|本文介绍了一种用于相对论重离子碰撞的新型点云扩散模型，该模型能够超快速地生成逐事件碰撞输出。在使用UrQMD级联模拟训练后，该模型可以生成逼真的碰撞事件输出，包含26种不同的强子种类，以粒子动量向量及其粒子ID列表的形式呈现。从解决逆问题到加速模型计算或探测器模拟，该模型可以成为一个有前景的重离子碰撞通用工具，有利于理论研究和实验应用。||
|**2024-12-13**|[BrushEdit: All-In-One Image Inpainting and Editing](http://arxiv.org/abs/2412.10316)|null|图像编辑技术随着基于反演和基于指令的扩散模型的发展而取得了显著进展。然而，由于反演噪声的结构化特性阻碍了实质性更改，当前基于反演的方法难以进行较大的修改（例如，添加或移除对象）。同时，基于指令的方法通常将用户限制在黑盒操作中，限制了指定编辑区域和强度的直接交互。为了解决这些限制，我们提出了BrushEdit，一种新颖的基于修复的指令引导图像编辑范式，它利用多模态大型语言模型（MLLM）和图像修复模型来实现自主、用户友好和交互式的自由形式指令编辑。具体来说，我们设计了一个系统，通过在代理协作框架中集成MLLM和双分支图像修复模型来实现自由形式的指令编辑，以执行编辑类别分类、主要对象识别、掩码获取和编辑区域修复。大量实验表明，我们的框架有效地结合了MLLM和修复模型，在包括掩码区域保留和编辑效果一致性在内的七个指标上均实现了优异的性能。||
|**2024-12-13**|[Coherent 3D Scene Diffusion From a Single RGB Image](http://arxiv.org/abs/2412.10294)|null|We present a novel diffusion-based approach for coherent 3D scene reconstruction from a single RGB image. Our method utilizes an image-conditioned 3D scene diffusion model to simultaneously denoise the 3D poses and geometries of all objects within the scene. Motivated by the ill-posed nature of the task and to obtain consistent scene reconstruction results, we learn a generative scene prior by conditioning on all scene objects simultaneously to capture the scene context and by allowing the model to learn inter-object relationships throughout the diffusion process. We further propose an efficient surface alignment loss to facilitate training even in the absence of full ground-truth annotation, which is common in publicly available datasets. This loss leverages an expressive shape representation, which enables direct point sampling from intermediate shape predictions. By framing the task of single RGB image 3D scene reconstruction as a conditional diffusion process, our approach surpasses current state-of-the-art methods, achieving a 12.04% improvement in AP3D on SUN RGB-D and a 13.43% increase in F-Score on Pix3D.||
|**2024-12-13**|[Adversarial Robustness of Bottleneck Injected Deep Neural Networks for Task-Oriented Communication](http://arxiv.org/abs/2412.10265)|null|This paper investigates the adversarial robustness of Deep Neural Networks (DNNs) using Information Bottleneck (IB) objectives for task-oriented communication systems. We empirically demonstrate that while IB-based approaches provide baseline resilience against attacks targeting downstream tasks, the reliance on generative models for task-oriented communication introduces new vulnerabilities. Through extensive experiments on several datasets, we analyze how bottleneck depth and task complexity influence adversarial robustness. Our key findings show that Shallow Variational Bottleneck Injection (SVBI) provides less adversarial robustness compared to Deep Variational Information Bottleneck (DVIB) approaches, with the gap widening for more complex tasks. Additionally, we reveal that IB-based objectives exhibit stronger robustness against attacks focusing on salient pixels with high intensity compared to those perturbing many pixels with lower intensity. Lastly, we demonstrate that task-oriented communication systems that rely on generative models to extract and recover salient information have an increased attack surface. The results highlight important security considerations for next-generation communication systems that leverage neural networks for goal-oriented compression.||
|**2024-12-13**|[Targeted Angular Reversal of Weights (TARS) for Knowledge Removal in Large Language Models](http://arxiv.org/abs/2412.10257)|null|The sheer scale of data required to train modern large language models (LLMs) poses significant risks, as models are likely to gain knowledge of sensitive topics such as bio-security, as well the ability to replicate copyrighted works. Methods designed to remove such knowledge must do so from all prompt directions, in a multi-lingual capacity and without degrading general model performance. To this end, we introduce the targeted angular reversal (TARS) method of knowledge removal from LLMs. The TARS method firstly leverages the LLM in combination with a detailed prompt to aggregate information about a selected concept in the internal representation space of the LLM. It then refines this approximate concept vector to trigger the concept token with high probability, by perturbing the approximate concept vector with noise and transforming it into token scores with the language model head. The feedforward weight vectors in the LLM which operate directly on the internal representation space, and have the highest cosine similarity with this targeting vector, are then replaced by a reversed targeting vector, thus limiting the ability of the concept to propagate through the model. The modularity of the TARS method allows for a sequential removal of concepts from Llama 3.1 8B, such as the famous literary detective Sherlock Holmes, and the planet Saturn. It is demonstrated that the probability of triggering target concepts can be reduced to 0.00 with as few as 1 TARS edit, whilst simultaneously removing the knowledge bi-directionally. Moreover, knowledge is shown to be removed across all languages despite only being targeted in English. Importantly, TARS has minimal impact on the general model capabilities, as after removing 5 diverse concepts in a modular fashion, there is minimal KL divergence in the next token probabilities of the LLM on large corpora of Wikipedia text (median of 0.002).||
|**2024-12-13**|[Exploring the Frontiers of Animation Video Generation in the Sora Era: Method, Dataset and Benchmark](http://arxiv.org/abs/2412.10255)|**[link](https://github.com/bilibili/index-anisora)**|Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation dataset. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, the evaluation on VBench and human double-blind test demonstrates consistency in character and motion, achieving state-of-the-art results in animation video generation. %We also collect an evaluation benchmark of 948 various animation videos, with specifically developed metrics for animation video generation. Our model access API and evaluation benchmark will be publicly available.||
|**2024-12-13**|[GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion](http://arxiv.org/abs/2412.10209)|null|We propose a novel approach for reconstructing animatable 3D Gaussian avatars from monocular videos captured by commodity devices like smartphones. Photorealistic 3D head avatar reconstruction from such recordings is challenging due to limited observations, which leaves unobserved regions under-constrained and can lead to artifacts in novel views. To address this problem, we introduce a multi-view head diffusion model, leveraging its priors to fill in missing regions and ensure view consistency in Gaussian splatting renderings. To enable precise viewpoint control, we use normal maps rendered from FLAME-based head reconstruction, which provides pixel-aligned inductive biases. We also condition the diffusion model on VAE features extracted from the input image to preserve details of facial identity and appearance. For Gaussian avatar reconstruction, we distill multi-view diffusion priors by using iteratively denoised images as pseudo-ground truths, effectively mitigating over-saturation issues. To further improve photorealism, we apply latent upsampling to refine the denoised latent before decoding it into an image. We evaluate our method on the NeRSemble dataset, showing that GAF outperforms the previous state-of-the-art methods in novel view synthesis by a 5.34\% higher SSIM score. Furthermore, we demonstrate higher-fidelity avatar reconstructions from monocular videos captured on commodity devices.||
|**2024-12-13**|[Efficient Generative Modeling with Residual Vector Quantization-Based Tokens](http://arxiv.org/abs/2412.10208)|null|We explore the use of Residual Vector Quantization (RVQ) for high-fidelity generation in vector-quantized generative models. This quantization technique maintains higher data fidelity by employing more in-depth tokens. However, increasing the token number in generative models leads to slower inference speeds. To this end, we introduce ResGen, an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed. Our key idea is a direct prediction of vector embedding of collective tokens rather than individual ones. Moreover, we demonstrate that our proposed token masking and multi-token prediction method can be formulated within a principled probabilistic framework using a discrete diffusion process and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation} on ImageNet 256x256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models. The project page can be found at https://resgen-genai.github.io||
|**2024-12-13**|[Simple Guidance Mechanisms for Discrete Diffusion Models](http://arxiv.org/abs/2412.10193)|**[link](https://github.com/kuleshov-group/discrete-diffusion-guidance)**|Diffusion models for continuous data gained widespread adoption owing to their high quality generation and control mechanisms. However, controllable diffusion on discrete data faces challenges given that continuous guidance methods do not directly apply to discrete diffusion. Here, we provide a straightforward derivation of classifier-free and classifier-based guidance for discrete diffusion, as well as a new class of diffusion models that leverage uniform noise and that are more guidable because they can continuously edit their outputs. We improve the quality of these models with a novel continuous-time variational lower bound that yields state-of-the-art performance, especially in settings involving guidance or fast generation. Empirically, we demonstrate that our guidance mechanisms combined with uniform noise diffusion improve controllable generation relative to autoregressive and diffusion baselines on several discrete data domains, including genomic sequences, small molecule design, and discretized image generation.||
|**2024-12-13**|[SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models](http://arxiv.org/abs/2412.10178)|null|Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. While significant advances have been made in image-based virtual try-ons, extending these successes to video often results in frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequence. To address these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we introduce ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the \dataname~dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments show that our approach outperforms current baselines, particularly in terms of video consistency and inference speed. Data and code are available at https://github.com/VinAIResearch/swift-try||
|**2024-12-12**|[FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion](http://arxiv.org/abs/2412.09626)|null|视觉扩散模型取得了显著进展，但由于缺乏高分辨率数据和计算资源的限制，它们通常在有限的分辨率下进行训练，这阻碍了它们生成更高分辨率的高保真图像或视频的能力。最近的研究探索了免调优策略，以展现预训练模型在更高分辨率视觉生成方面的未开发潜力。然而，这些方法仍然容易产生带有重复模式的低质量视觉内容。关键障碍在于当模型生成超过其训练分辨率的视觉内容时，高频信息不可避免地增加，导致源于累积误差的不良重复模式。为了解决这一挑战，我们提出了 FreeScale，这是一种免调优的推理范式，通过尺度融合实现更高分辨率的视觉生成。具体来说，FreeScale 处理来自不同感受野尺度的信息，然后通过提取所需的频率分量进行融合。大量实验验证了我们的范式在扩展更高分辨率图像和视频模型的视觉生成能力方面的优越性。值得注意的是，与之前性能最佳的方法相比，FreeScale 首次实现了 8k 分辨率图像的生成。||
|**2024-12-12**|[Illusion3D: 3D Multiview Illusion with 2D Diffusion Priors](http://arxiv.org/abs/2412.09625)|null|自动生成多视角错觉是一个引人入胜的挑战，其中单件视觉内容从不同的视角提供不同的解释。传统方法，如阴影艺术和线框艺术，可以创造有趣的3D错觉，但仅限于简单的视觉输出（即图形-背景或线条图），限制了它们的艺术表现力和实用性。最近基于扩散的错觉生成方法可以生成更复杂的设计，但仅限于2D图像。在这项工作中，我们提出了一种简单而有效的方法，用于根据用户提供的文本提示或图像创建3D多视角错觉。我们的方法利用预训练的文本到图像扩散模型，通过可微渲染优化神经3D表示的纹理和几何形状。从多个角度观看时，会产生不同的解释。我们开发了几种技术来提高生成的3D多视角错觉的质量。我们通过大量的实验论证了我们方法的有效性，并展示了具有不同3D形式的错觉生成。||
|**2024-12-12**|[GenEx: Generating an Explorable World](http://arxiv.org/abs/2412.09624)|null|理解、导航和探索3D物理现实世界一直是人工智能发展中的核心挑战。在这项工作中，我们朝着这个目标迈进了一步，推出了GenEx，这是一个能够规划复杂具身世界探索的系统，它以其生成式想象力为指导，形成对周围环境的先验（预期）。GenEx可以仅从一张RGB图像生成完整的3D一致的想象环境，并通过全景视频流将其赋予生命。利用从虚幻引擎中筛选的可扩展3D世界数据，我们的生成模型在物理世界中得到了完善。它可以轻松地捕捉连续的360度环境，为AI代理提供无限的探索和交互场景。GenEx实现了高质量的世界生成，在长轨迹上的鲁棒循环一致性，并展示了强大的3D能力，例如一致性和主动3D建图。在世界生成式想象力的驱动下，GPT辅助代理能够执行复杂的具身任务，包括目标无关的探索和目标驱动的导航。这些代理利用对物理世界未知部分的预测预期来完善它们的信念，根据潜在的决策模拟不同的结果，并做出更明智的选择。总之，我们证明了GenEx为在想象空间中推进具身AI提供了一个变革性平台，并具有将这些能力扩展到现实世界探索的潜力。||
|**2024-12-12**|[OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation](http://arxiv.org/abs/2412.09623)|null|随着虚拟现实的普及，对可控生成沉浸式动态全向视频（ODV）的需求日益增长。尽管之前的文本到ODV生成方法取得了令人瞩目的成果，但由于仅依赖文本输入，它们在内容准确性和一致性方面仍存在不足。虽然最近的运动控制技术为视频生成提供了细粒度的控制，但将这些方法直接应用于ODV通常会导致空间失真和性能不佳，尤其是在复杂的球面运动中。为了应对这些挑战，我们提出了OmniDrag，这是第一个能够同时实现场景级和对象级运动控制以生成精确、高质量全向图像到视频的方法。基于预训练的视频扩散模型，我们引入了一个全向控制模块，该模块与时间注意力层联合微调，以有效处理复杂的球面运动。此外，我们开发了一种新颖的球面运动估计器，可以准确地提取运动控制信号，并允许用户通过简单地绘制控制点和目标点来执行拖拽式ODV生成。我们还提出了一个名为Move360的新数据集，以解决具有大场景和物体运动的ODV数据稀缺的问题。实验表明，OmniDrag在实现整体场景级和细粒度对象级控制以生成ODV方面具有显著优势。项目页面位于https://lwq20020127.github.io/OmniDrag。||
|**2024-12-12**|[SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training](http://arxiv.org/abs/2412.09619)|null|现有的文本到图像 (T2I) 扩散模型面临若干局限性，包括模型规模大、运行时间慢以及在移动设备上生成质量低。本文旨在通过开发一个极小且快速的 T2I 模型来解决所有这些挑战，该模型能够在移动平台上生成高分辨率和高质量的图像。我们提出了几种技术来实现这一目标。首先，我们系统地研究了网络架构的设计选择，以减少模型参数和延迟，同时确保高质量的生成。其次，为了进一步提高生成质量，我们采用了来自更大模型的跨架构知识蒸馏，使用多级方法从头开始指导我们模型的训练。第三，我们通过将对抗性指导与知识蒸馏相结合，实现了少步生成。我们的模型 SnapGen 首次在移动设备上实现了约 1.4 秒生成 1024x1024 像素的图像。在 ImageNet-1K 上，我们的模型仅有 3.72 亿个参数，在生成 256x256 像素图像时实现了 2.06 的 FID 值。在 T2I 基准测试（即 GenEval 和 DPG-Bench）中，我们仅有 3.79 亿个参数的模型，在尺寸显著更小的情况下（例如，比 SDXL 小 7 倍，比 IF-XL 小 14 倍）超越了数十亿参数的大规模模型。||
|**2024-12-12**|[EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM](http://arxiv.org/abs/2412.09618)|null|个性化 diffusion 模型近年来取得了显著进展。传统的免调优方法通常将多张参考图像的图像嵌入取平均值作为注入条件进行编码，但这种与图像无关的操作无法在图像之间进行交互以捕捉多个参考图像中一致的视觉元素。虽然基于调优的低秩适应（LoRA）可以通过训练过程有效地提取多张图像中一致的元素，但它需要针对每个不同的图像组进行特定的微调。本文介绍了一种名为 EasyRef 的新型即插即用适配方法，使 diffusion 模型能够以多张参考图像和文本提示为条件。为了有效利用多张图像中一致的视觉元素，我们利用多模态大型语言模型 (MLLM) 的多图像理解和指令遵循能力，提示其根据指令捕捉一致的视觉元素。此外，通过适配器将 MLLM 的表示注入到 diffusion 过程中可以轻松泛化到未见过的领域，挖掘未见数据中一致的视觉元素。为了降低计算成本并增强细粒度细节的保留，我们引入了一种高效的参考聚合策略和渐进式训练方案。最后，我们引入了 MRBench，一个新的多参考图像生成基准。实验结果表明，EasyRef 超越了 IP-Adapter 等免调优方法和 LoRA 等基于调优的方法，在不同的领域实现了卓越的审美质量和鲁棒的零样本泛化能力。||
|**2024-12-12**|[Olympus: A Universal Task Router for Computer Vision Tasks](http://arxiv.org/abs/2412.09612)|**[link](https://github.com/yuanze-lin/olympus_page)**|我们推出了Olympus，一种将多模态大型语言模型（MLLM）转换为统一框架的新方法，该框架能够处理各种计算机视觉任务。Olympus利用一个控制器MLLM，将图像、视频和3D对象上的20多个专门任务委派给专用模块。这种基于指令的路由机制，无需训练繁重的生成模型，即可通过链式动作实现复杂的工作流程。Olympus可以轻松地与现有的MLLM集成，并在保持相当性能的同时扩展其功能。实验结果表明，Olympus在20项任务中的平均路由准确率达到94.75%，在链式动作场景中的准确率达到91.82%，展示了其作为通用任务路由器解决各种计算机视觉任务的有效性。项目页面：https://github.com/yuanze-lin/Olympus_page||
|**2024-12-12**|[Owl-1: Omni World Model for Consistent Long Video Generation](http://arxiv.org/abs/2412.09600)|**[link](https://github.com/huang-yh/owl)**|视频生成模型（VGMs）近期受到了广泛关注，并有望成为通用的、大型视觉模型。虽然它们每次只能生成短视频，但现有方法通过迭代调用VGMs，使用上一帧的输出作为下一轮生成的条件，来实现长视频生成。然而，最后一帧只包含场景的短期细粒度信息，导致长期一致性不足。为了解决这个问题，我们提出了一个全方位世界模型（Owl-1），以生成长期连贯和全面的条件，从而实现一致的长视频生成。由于视频是对底层演化世界的观察，我们建议在潜在空间中对长期发展进行建模，并使用VGMs将其拍摄成视频。具体来说，我们用一个潜在状态变量来表示世界，该变量可以解码成显式的视频观察结果。这些观察结果可作为预测时间动态的基础，进而更新状态变量。不断变化的动态和持久状态之间的相互作用增强了长视频的多样性和一致性。大量实验表明，Owl-1在VBench-I2V和VBench-Long上取得了与SOTA方法相当的性能，验证了其生成高质量视频观察结果的能力。代码：https://github.com/huang-yh/Owl.||
|**2024-12-12**|[LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors](http://arxiv.org/abs/2412.09597)|null|单图像三维重建由于固有的几何歧义性和有限的视角信息，仍然是计算机视觉中的一个基本挑战。潜在视频扩散模型 (LVDM) 的最新进展提供了从大规模视频数据中学习的有前景的三维先验。然而，有效利用这些先验面临三个关键挑战：（1）大相机运动下的质量下降；（2）难以实现精确的相机控制；（3）扩散过程固有的几何变形会破坏三维一致性。我们提出了 LiftImage3D 来解决这些挑战，这是一个有效释放 LVDM 生成先验并同时确保三维一致性的框架。具体来说，我们设计了一种铰接轨迹策略来生成视频帧，它将具有大相机运动的视频序列分解为具有可控小运动的序列。然后，我们使用鲁棒的神经匹配模型，即 MASt3R，来校准生成帧的相机姿态并生成相应的点云。最后，我们提出了一种失真感知三维高斯splatting表示，它可以学习帧之间的独立失真并输出无失真的规范高斯函数。大量实验表明，LiftImage3D 在两个具有挑战性的数据集（即 LLFF、DL3DV 和 Tanks and Temples）上实现了最先进的性能，并且可以很好地泛化到各种野外图像，从卡通插图到复杂的真实世界场景。||
|**2024-12-12**|[Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion](http://arxiv.org/abs/2412.09593)|null|从单张图像中恢复物体的几何形状和材质具有挑战性，因为它本质上是欠约束的。在本文中，我们提出了Neural LightRig，一个利用2D扩散先验的辅助多光照条件来提升内在估计的新颖框架。具体来说，1）我们首先利用大规模扩散模型的照明先验，在具有专门设计的合成重光照数据集上构建我们的多光照扩散模型。该扩散模型生成多个一致的图像，每个图像都由不同方向的点光源照亮。2）通过使用这些变化的光照图像来减少估计的不确定性，我们训练了一个具有U-Net主干的大型G缓冲区模型，以准确预测表面法线和材质。大量实验验证了我们的方法显著优于现有最先进的方法，能够实现准确的表面法线和PBR材质估计，并具有逼真的重光照效果。代码和数据集可在我们的项目页面https://projects.zxhezexin.com/neural-lightrig上获取。||
|**2024-12-10**|[Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets](http://arxiv.org/abs/2412.07775)|null|虽然通常通过收集目标下游任务的数据集来训练大型扩散模型，但人们常常希望根据某些奖励函数来调整和微调预训练的扩散模型，这些奖励函数要么由专家设计，要么从小规模数据集中学习。现有的微调扩散模型的方法通常存在生成样本缺乏多样性、缺乏先验保留和/或微调收敛速度慢等问题。受生成流网络 (GFlowNets) 最新成功的启发，这是一类以奖励函数的非规范化密度进行采样的概率模型，我们提出了一种名为 Nabla-GFlowNet（缩写为 $\nabla$-GFlowNet）的新型 GFlowNet 方法，这是第一个利用奖励梯度中丰富信号的 GFlowNet 方法，以及一个名为 $\nabla$-DB 的目标及其变体残差 $\nabla$ -DB，旨在实现保留先验的扩散对齐。我们证明，我们提出的方法在不同的现实奖励函数上实现了 Stable Diffusion（一种大规模文本条件图像扩散模型）的快速、多样性和保留先验的对齐。||
|**2024-12-10**|[UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics](http://arxiv.org/abs/2412.07774)|null|我们推出UniReal，这是一个统一的框架，旨在处理各种图像生成和编辑任务。现有的解决方案通常因任务而异，但共享一些基本原则：在保持输入和输出之间一致性的同时捕捉视觉变化。受最近能够有效平衡帧之间一致性和变化的视频生成模型的启发，我们提出了一种统一的方法，将图像级任务视为不连续的视频生成。具体来说，我们将不同数量的输入和输出图像视为帧，从而能够无缝支持图像生成、编辑、定制、合成等任务。尽管UniReal是为图像级任务设计的，但我们利用视频作为可扩展的通用监督来源。UniReal从大规模视频中学习世界动态，展示了在处理阴影、反射、姿势变化和对象交互方面的先进能力，同时也展现了在 novel 应用中的涌现能力。||
|**2024-12-10**|[From Slow Bidirectional to Fast Causal Video Generators](http://arxiv.org/abs/2412.07772)|null|目前的视频扩散模型虽然可以生成高质量的视频，但由于双向注意力机制的依赖性，它们在交互式应用中表现不佳。生成单帧需要模型处理整个序列，包括未来的帧。我们通过将预训练的双向扩散Transformer模型调整为因果Transformer模型来解决这个问题，该模型可以实时生成帧。为了进一步降低延迟，我们将分布匹配蒸馏（DMD）扩展到视频领域，将50步的扩散模型蒸馏成4步的生成器。为了实现稳定和高质量的蒸馏，我们引入了一种基于教师模型ODE轨迹的学生初始化方案，以及一种用双向教师模型监督因果学生模型的非对称蒸馏策略。这种方法有效地减轻了自回归生成中的错误累积，即使在短视频片段上训练，也能合成较长视频。得益于KV缓存，我们的模型在单个GPU上支持以9.4 FPS的速度快速流式生成高质量视频。我们的方法还支持零样本的流式视频到视频翻译、图像到视频生成和动态提示。我们将在未来发布基于开源模型的代码。||
|**2024-12-10**|[Make-A-Texture: Fast Shape-Aware Texture Generation in 3 Seconds](http://arxiv.org/abs/2412.07766)|null|我们提出了Make-A-Texture，这是一个新的框架，可以根据给定的三维几何形状，从文本提示高效地合成高分辨率纹理贴图。我们的方法使用深度感知的修复扩散模型，以自动视图选择算法确定的优化视图序列，逐步生成在多个视点上一致的纹理。我们方法的一个显著特点是其卓越的效率，在单个NVIDIA H100 GPU上仅需3.07秒即可端到端生成完整纹理，显著优于现有方法。这种加速是通过优化扩散模型和专门的反投影方法实现的。此外，我们的方法通过选择性地屏蔽非正面和开放表面对象的内部面，减少了反投影阶段的伪影。实验结果表明，Make-A-Texture的质量达到或超过了其他最先进的方法。我们的工作显著提高了纹理生成模型在现实世界3D内容创建（包括交互式创建和文本引导的纹理编辑）中的适用性和实用性。||
|**2024-12-10**|[Bayesian Optimization of Antibodies Informed by a Generative Model of Evolving Sequences](http://arxiv.org/abs/2412.07763)|**[link](https://github.com/alannawzadamin/clonebo)**|为了构建有效的治疗方法，生物学家会迭代地突变抗体序列以提高结合力和稳定性。提出的突变可以基于先前的测量结果，也可以通过学习大型抗体数据库来仅预测典型抗体。不幸的是，典型抗体的搜索空间巨大，实验通常无法在预算内找到合适的抗体。我们引入了克隆信息贝叶斯优化 (CloneBO)，这是一种贝叶斯优化程序，它通过教导生成模型我们的免疫系统如何优化抗体来有效地在实验室中优化抗体。我们的免疫系统通过迭代进化序列的特定部分来制造抗体，使其能够与其靶标强而稳定地结合，从而产生一组相关的、进化中的序列，称为克隆家族。我们在数十万个克隆家族上训练了一个大型语言模型 CloneLM，并用它来设计最有可能在人体免疫系统内优化抗体的突变序列。我们建议使用扭曲序列蒙特卡罗程序来指导我们的设计以拟合先前的测量结果。我们表明，在逼真的计算机模拟实验中，CloneBO 比以前的方法优化抗体的效率要高得多，并且在体外湿实验室实验中设计了更强、更稳定的结合剂。||
|**2024-12-10**|[Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation](http://arxiv.org/abs/2412.07761)|null|视频帧插帧旨在恢复观测帧之间丢失的真实帧，从低帧率视频生成高帧率视频。然而，在没有额外指导的情况下，帧间的大运动使得这个问题不适定。基于事件的视频帧插帧（EVFI）通过使用稀疏的、高时间分辨率的事件测量作为运动指导来应对这一挑战。这种指导使EVFI方法的性能明显优于仅基于帧的方法。然而，迄今为止，EVFI方法依赖于有限的事件-帧配对训练数据集，严重限制了它们的性能和泛化能力。在这项工作中，我们通过将预训练的、在互联网规模数据集上训练的视频扩散模型应用于EVFI来克服数据有限的挑战。我们在真实的EVFI数据集（包括我们引入的一个新数据集）上对我们的方法进行了实验验证。我们的方法优于现有方法，并且跨摄像机的泛化能力远胜于现有方法。||
|**2024-12-10**|[SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints](http://arxiv.org/abs/2412.07760)|**[link](https://github.com/kwaivgi/syncammaster)**|Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency. This progress inspires us to investigate the potential of these models to ensure dynamic consistency across various viewpoints, a highly desirable feature for applications such as virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating 6 DoF camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we design a hybrid training scheme that leverages multi-camera images and monocular videos to supplement Unreal Engine-rendered multi-camera videos. Furthermore, our method enables intriguing extensions, such as re-rendering a video from novel viewpoints. We also release a multi-view synchronized video dataset, named SynCamVideo-Dataset. Project page: https://jianhongbai.github.io/SynCamMaster/.||
|**2024-12-10**|[Multi-Shot Character Consistency for Text-to-Video Generation](http://arxiv.org/abs/2412.07750)|null|Text-to-video models have made significant strides in generating short video clips from textual descriptions. Yet, a significant challenge remains: generating several video shots of the same characters, preserving their identity without hurting video quality, dynamics, and responsiveness to text prompts. We present Video Storyboarding, a training-free method to enable pretrained text-to-video models to generate multiple shots with consistent characters, by sharing features between them. Our key insight is that self-attention query features (Q) encode both motion and identity. This creates a hard-to-avoid trade-off between preserving character identity and making videos dynamic, when features are shared. To address this issue, we introduce a novel query injection strategy that balances identity preservation and natural motion retention. This approach improves upon naive consistency techniques applied to videos, which often struggle to maintain this delicate equilibrium. Our experiments demonstrate significant improvements in character consistency across scenes while maintaining high-quality motion and text alignment. These results offer insights into critical stages of video generation and the interplay of structure and motion in video diffusion models.||
|**2024-12-10**|[StyleMaster: Stylize Your Video with Artistic Generation and Translation](http://arxiv.org/abs/2412.07744)|null|Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage matters, whereas existing methods emphasize global style but ignore local textures. In order to bring texture features while preventing content leakage, we filter content-related patches while retaining style ones based on prompt-patch similarity; for global style extraction, we generate a paired style dataset through model illusion to facilitate contrastive learning, which greatly enhances the absolute style consistency. Moreover, to fill in the image-to-video gap, we train a lightweight motion adapter on still videos, which implicitly enhances stylization extent, and enables our image-trained model to be seamlessly applied to videos. Benefited from these efforts, our approach, StyleMaster, not only achieves significant improvement in both style resemblance and temporal coherence, but also can easily generalize to video style transfer with a gray tile ControlNet. Extensive experiments and visualizations demonstrate that StyleMaster significantly outperforms competitors, effectively generating high-quality stylized videos that align with textual content and closely resemble the style of reference images. Our project page is at https://zixuan-ye.github.io/stylemaster||
|**2024-12-10**|[STIV: Scalable Text and Image Conditioned Video Generation](http://arxiv.org/abs/2412.07730)|null|The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously. Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, multi-view generation, and long video generation, etc. With comprehensive ablation studies on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and extensible recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress toward more versatile and reliable video generation solutions.||
|**2024-12-06**|[Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model](http://arxiv.org/abs/2412.05280)|**[link](https://github.com/wzzheng/stag)**|4D驾驶模拟对于开发逼真的自动驾驶模拟器至关重要。尽管现有的驾驶场景生成方法取得了进步，但在视图变换和时空动态建模方面仍然存在重大挑战。为了解决这些局限性，我们提出了一个用于驾驶的时空模拟模型（Stag-1），以重建真实世界的场景，并设计了一个可控的生成网络来实现4D模拟。Stag-1使用来自自动驾驶汽车的环视数据构建连续的4D点云场景。它解耦了时空关系并生成连贯的关键帧视频。此外，Stag-1利用视频生成模型从任意视角获得逼真且可控的4D驾驶模拟视频。为了扩展视图生成的范围，我们基于分解的相机姿态训练车辆运动视频，增强了对远处场景的建模能力。此外，我们重建了车辆的相机轨迹，以整合连续视图中的3D点，从而实现沿时间维度的全面场景理解。经过广泛的多级场景训练后，Stag-1可以从任何所需的视点进行模拟，并在静态时空条件下实现对场景演变的深入理解。与现有方法相比，我们的方法在多视图场景一致性、背景连贯性和准确性方面展现出良好的性能，并有助于推动逼真自动驾驶模拟的持续发展。代码：https://github.com/wzzheng/Stag.||
|**2024-12-06**|[Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories](http://arxiv.org/abs/2412.05279)|null|随着基于文本的扩散模型的发展，三维重建和基于文本的三维编辑领域取得了显著进展。虽然现有的三维编辑方法在修改颜色、纹理和风格方面表现出色，但它们难以进行大范围的几何或外观更改，从而限制了它们的应用。我们提出了“扰动-修正”方法，使各种NeRF编辑成为可能。首先，我们通过随机初始化扰动NeRF参数，以创建一个多功能的初始化。我们通过分析局部损失情况来自动确定扰动幅度。然后，我们通过生成轨迹修正编辑后的NeRF。结合生成过程，我们施加身份保持梯度来细化编辑后的NeRF。大量实验表明，“扰动-修正”方法有助于灵活、有效且一致地编辑三维颜色、外观和几何形状。欲查看360度效果，请访问我们的项目页面：https://susunghong.github.io/Perturb-and-Revise。||
|**2024-12-06**|[Birth and Death of a Rose](http://arxiv.org/abs/2412.05278)|null|我们研究如何利用预训练的二维基础模型生成时序对象内在属性——随时间演变的对象几何形状、反射率和纹理序列，例如一朵盛开的玫瑰。与需要大量手动工作和专业知识的传统3D建模和动画技术不同，我们引入了一种利用从预训练的2D扩散模型中提取的信号来生成此类资源的方法。为了确保对象内在属性的时间一致性，我们提出了用于时间状态引导蒸馏的神经模板，该模板自动从自监督学习的图像特征中导出。我们的方法可以为多种自然现象生成高质量的时序对象内在属性，并支持从任意视点、在任意环境光照条件下、在其生命周期的任意时间对这些动态对象进行采样和可控渲染。项目网站：https://chen-geng.com/rose4d||
|**2024-12-06**|[MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models](http://arxiv.org/abs/2412.05275)|null|文本到视频模型在生成多样化和引人入胜的视频内容方面展现出令人印象深刻的能力，标志着生成式人工智能的显著进步。然而，这些模型通常缺乏对运动模式的细粒度控制，限制了它们的实际应用。我们引入了 MotionFlow，这是一个专为视频扩散模型中的运动迁移而设计的新颖框架。我们的方法利用交叉注意力图来准确捕捉和操纵空间和时间动态，从而实现跨各种场景的无缝运动迁移。我们的方法无需训练，并且通过利用预训练视频扩散模型的固有能力在测试时即可工作。与在保持一致运动的同时难以应对复杂场景变化的传统方法相比，MotionFlow 通过其基于注意力的机制成功地处理了此类复杂转换。我们的定性和定量实验表明，即使在剧烈的场景变化中，MotionFlow 在保真度和通用性方面也明显优于现有模型。||
|**2024-12-06**|[Go-or-Grow Models in Biology: a Monster on a Leash](http://arxiv.org/abs/2412.05191)|null|“走或长”方法代表一类特定的数学模型，用于描述个体要么迁移要么繁殖，但不能同时进行的种群。这些模型在生物学和医学领域具有广泛的应用，主要应用于脑癌扩散的建模。对“走或长”模型的分析激发了新的数学发展，本文旨在重点介绍“走或长”类型的反应扩散模型的有趣且具有挑战性的数学特性。我们先详细回顾了其在生物学和医学中的应用，然后重点介绍了有关解的存在性和唯一性、模式形成、临界域大小问题和行波的关键结果。我们提出了与临界域大小和行波问题相关的新的一般性结果，并将这些发现与现有文献联系起来。此外，我们还证明了“走或长”模型固有的高度不稳定性。我们认为目前还没有精确的数值求解器来解决这些模型，并强调在处理这个“拴着的怪物”时必须格外小心。||
|**2024-12-06**|[DNF: Unconditional 4D Generation with Dictionary-based Neural Fields](http://arxiv.org/abs/2412.05161)|null|虽然基于扩散的3D生成模型在形状生成方面取得了显著成功，但由于物体随时间变形过程的复杂性，4D生成建模仍然具有挑战性。我们提出了DNF，一种用于无条件生成建模的全新4D表示方法，它可以有效地对可变形形状进行建模，将形状和运动解耦，同时捕捉变形物体中的高保真细节。为此，我们提出了一种字典学习方法，将4D运动从形状中解耦出来，并将其表示为神经场。形状和运动都被表示为学习到的潜在空间，其中每个可变形形状都由其形状和运动的全局潜在代码、特定于形状的系数向量和共享的字典信息表示。这既捕获了特定形状的细节，也捕获了学习字典中的全局共享信息。我们基于字典的表示方法很好地平衡了保真度、连续性和压缩率——结合基于Transformer的扩散模型，我们的方法能够生成有效的高保真4D动画。||
|**2024-12-06**|[LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation](http://arxiv.org/abs/2412.05148)|**[link](https://github.com/donaldssh/LoRA.rar)**|近年来，图像生成模型的进步使得用户可以根据自定义的主题（内容）和风格创建个性化图像。先前的工作通过基于优化方法合并相应的低秩自适应参数（LoRA）来实现个性化，但这需要大量的计算，不适合在智能手机等资源受限的设备上实时使用。为了解决这个问题，我们引入了 LoRA.rar 方法，该方法不仅提高了图像质量，而且在合并过程中实现了超过 4000 倍的显著加速。LoRA.rar 在不同的内容-风格 LoRA 对数据集上预训练一个超网络，学习一种有效的合并策略，可以泛化到新的、未见过内容-风格对，从而实现快速、高质量的个性化。此外，我们发现了现有内容-风格质量评估指标的局限性，并提出了一种使用多模态大型语言模型（MLLM）进行更准确评估的新协议。经 MLLM 评估和人工评估验证，我们的方法在内容保真度和风格保真度方面均显著优于当前的最佳技术。||
|**2024-12-06**|[Probabilistic Galaxy Field Generation with Diffusion Models](http://arxiv.org/abs/2412.05131)|null|在精密宇宙学时代，生成精确的大规模星系目录对于提升我们对宇宙的理解至关重要。随着当前和即将进行的任务带来的海量宇宙学数据，生成理论预测以与这些观测结果进行比较，对于约束关键宇宙学参数至关重要。虽然传统方法（例如晕占据分布（HOD））提供了基础见解，但它们难以平衡准确性和计算效率的需求。高保真流体动力学模拟提供了更高的精度，但计算成本高且资源密集。在这项工作中，我们介绍了一种利用卷积神经网络 (CNN) 和扩散模型的新型机器学习方法，在 CAMELS 模拟套件上进行训练，以弥合廉价的暗物质模拟和更昂贵的流体动力学模拟的星系分布之间的差距。我们的方法不仅在准确性方面优于传统的 HOD 技术，而且还显着加快了模拟过程，为下一代宇宙学巡天提供了一种可扩展的解决方案。这一进展有可能彻底改变星系目录的生成，从而实现更精确的、数据驱动的宇宙学分析。||
|**2024-12-06**|[The Silent Prompt: Initial Noise as Implicit Guidance for Goal-Driven Image Generation](http://arxiv.org/abs/2412.05101)|null|随着大规模扩散模型的出现，文本到图像合成（T2I）取得了显著进展。在传统的设置中，文本提示提供明确的、用户定义的指导，通过对随机采样的高斯噪声进行去噪来指导生成过程。在这项工作中，我们揭示了经常被忽视的噪声本身编码了内在的生成趋势，充当了隐式指导输出的“无声提示”。这种隐含的指导，嵌入在扩散模型公式的噪声调度器设计及其训练阶段中，可以推广到各种T2I模型和骨干网络。基于这一见解，我们引入了NoiseQuery，这是一种从预先构建的噪声库中选择最佳初始噪声以满足不同用户需求的新策略。我们的方法不仅增强了与文本提示的高级语义对齐，还允许对低级视觉属性（如纹理、清晰度、形状和颜色）进行细致的调整，而这些属性通常难以单独通过文本来控制。跨各种模型和目标属性的大量实验表明，我们的方法具有强大的性能和零样本迁移能力，无需额外的优化。||
|**2024-12-06**|[Reconstructing Quantitative Cerebral Perfusion Images Directly From Measured Sinogram Data Acquired Using C-arm Cone-Beam CT](http://arxiv.org/abs/2412.05084)|null|为了缩短对急性缺血性卒中患者的穿刺时间以提高治疗效果，使用介入套间配备的C臂锥形束计算机断层扫描（CBCT）获取定量脑灌注图像的需求很高。然而，受限于缓慢的机架旋转速度，典型C臂CBCT的时间分辨率和时间采样密度远低于诊断成像套间中的多排探测器CT。目前的定量灌注成像包括两个级联步骤：时间分辨图像重建和灌注参数估计。对于时间分辨图像重建，由低时间分辨率和低采样密度带来的技术挑战导致脑动脉和组织衰减值的时间变化量化不准确。对于灌注参数估计，如何适当地设计手工正则化以更好地解决相关的反卷积问题仍然是一个技术挑战。这两个挑战共同阻碍了使用C臂CBCT获得定量准确的灌注图像。这项工作的目的是通过将两个级联步骤合并到一个联合优化问题中，并直接从测量的正弦图数据重建定量灌注图像，来同时解决这两个挑战。在开发的直接脑灌注参数图像重建技术（简称TRAINER）中，定量灌注图像已被表示为一个特定于受试者的条件生成模型，该模型是在时间分辨CT正向模型、灌注卷积模型和受试者自身的测量正弦图数据的约束下训练的。本文展示的结果表明，使用TRAINER，可以在介入套间中使用C臂CBCT准确地获得定量脑灌注图像。||
|**2024-12-05**|[PaintScene4D: Consistent 4D Scene Generation from Text Prompts](http://arxiv.org/abs/2412.04471)|null|扩散模型的最新进展彻底改变了2D和3D内容创作，但生成逼真的动态4D场景仍然是一项重大挑战。现有的动态4D生成方法通常依赖于从预训练的3D生成模型中提取知识，这些模型通常在合成对象数据集上进行微调。因此，生成的场景往往以对象为中心，缺乏逼真度。虽然文本到视频模型可以生成更逼真的运动场景，但它们往往难以理解空间信息，并且在渲染过程中对相机视角的控制有限。为了解决这些限制，我们提出了PaintScene4D，这是一个新颖的文本到4D场景生成框架，它摒弃了传统的多视角生成模型，转而采用一种简化的架构，利用在各种真实世界数据集上训练的视频生成模型。我们的方法首先使用视频生成模型生成参考视频，然后采用策略性相机阵列选择进行渲染。我们应用渐进式变形和修复技术，以确保跨多个视点的空间和时间一致性。最后，我们使用动态渲染器优化多视图图像，从而实现基于用户偏好的灵活相机控制。PaintScene4D采用免训练架构，可高效生成可从任意轨迹观看的逼真4D场景。代码将公开发布。我们的项目页面位于https://paintscene4d.github.io/||
|**2024-12-05**|[LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors](http://arxiv.org/abs/2412.04460)|null|大型扩散模型在根据文本描述生成高质量图像方面取得了显著成功，并在各种应用中得到普及。然而，分层内容（例如具有前景和背景层透明图像）的生成仍然是一个未充分探索的领域。分层内容生成对于图形设计、动画和数字艺术等领域的创作流程至关重要，在这些领域中，基于图层的方法对于灵活编辑和合成至关重要。在本文中，我们提出了一种基于潜在扩散模型 (LDM) 的新型图像生成流程，该流程可生成具有两层的图像：具有透明度信息的前景层 (RGBA) 和背景层 (RGB)。与现有依次生成这些层的方法不同，我们的方法引入了一种协调生成机制，使层之间能够动态交互，从而获得更一致的输出。我们通过广泛的定性和定量实验论证了我们方法的有效性，与基线方法相比，在视觉连贯性、图像质量和层一致性方面均显示出显著改进。||
|**2024-12-05**|[Four-Plane Factorized Video Autoencoders](http://arxiv.org/abs/2412.04452)|null|潜变量生成模型已成为图像和视频合成等生成任务的强大工具。这些模型由预训练的自动编码器驱动，这些编码器将高分辨率数据映射到压缩的低维潜在空间，从而使生成模型能够在此空间中进行开发，并减少计算资源的需求。尽管它们很有效，但将潜变量模型直接应用于视频等更高维度的领域仍然对高效的训练和推理构成挑战。在本文中，我们提出了一种自动编码器，它将体积数据投影到一个四平面分解的潜在空间，该空间随着输入大小呈亚线性增长，使其非常适合视频等高维数据。我们分解模型的设计支持在许多具有潜在扩散模型 (LDM) 的条件生成任务中直接采用，例如类别条件生成、帧预测和视频插值。我们的结果表明，所提出的四平面潜在空间即使在高度压缩的情况下也能保留高保真重建所需的丰富表示，同时使 LDM 能够在速度和内存方面显著提高。||
|**2024-12-05**|[MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation](http://arxiv.org/abs/2412.04448)|null|近来视频扩散模型的进步为逼真的音频驱动说话视频生成解锁了新的潜力。然而，在生成的说话视频中实现无缝的音频唇部同步、保持长期的身份一致性以及生成自然、与音频对齐的表情仍然是重大挑战。为了应对这些挑战，我们提出了记忆引导的情绪感知扩散模型（MEMO），这是一种端到端的音频驱动肖像动画方法，用于生成身份一致且富有表情的说话视频。我们的方法围绕两个关键模块构建：（1）记忆引导的时间模块，通过开发记忆状态来存储来自较长过去上下文的信息，并通过线性注意力引导时间建模，从而增强长期身份一致性和运动平滑度；（2）情绪感知音频模块，它用多模态注意力取代传统的交叉注意力，以增强音频-视频交互，同时从音频中检测情绪，并通过情绪自适应层归一化来细化面部表情。大量的定量和定性结果表明，MEMO 可以生成更逼真的说话视频，涵盖各种图像和音频类型，在整体质量、音频唇部同步、身份一致性和表情-情绪对齐方面均优于现有最先进的方法。||
|**2024-12-05**|[DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models](http://arxiv.org/abs/2412.04446)|null|视频本质上是时间序列。在这项工作中，受自然语言处理领域自回归（AR）语言模型成功的启发，我们探索了以时间顺序和可扩展的方式使用自回归语言模型对视频进行建模的潜力。我们引入了DiCoDe，这是一种利用扩散压缩深度标记（Diffusion-Compressed Deep Tokens）以自回归方式生成视频的新方法。与现有方法使用压缩率有限的低级表示不同，DiCoDe利用具有相当高压缩率（标记数量减少1000倍）的深度标记。这种显著的压缩是通过利用视频扩散模型的先验知识训练的标记器实现的。深度标记使DiCoDe能够使用普通的AR语言模型进行视频生成，类似于将一种视觉“语言”翻译成另一种。通过将视频视为时间序列，DiCoDe充分利用了语言模型进行自回归生成的能力。DiCoDe可以使用现成的AR架构进行扩展，并且能够仅使用4个A100 GPU进行训练即可生成几秒到一分钟的视频。我们对DiCoDe进行了定量和定性的评估，证明它在质量方面与现有方法相当，同时确保了高效的训练。为了展示其可扩展性，我们发布了一系列具有不同参数大小的DiCoDe配置，并观察到随着模型大小从1亿增加到30亿，性能持续提高。我们相信DiCoDe在学术界的探索代表着使用AR语言模型进行可扩展视频建模的有希望的第一步，为开发更大、更强大的视频生成模型铺平了道路。||
|**2024-12-05**|[Learning Artistic Signatures: Symmetry Discovery and Style Transfer](http://arxiv.org/abs/2412.04441)|null|尽管关于风格迁移的文献已有近十年历史，但对于艺术风格仍然没有一个毫无争议的定义。最先进的模型产生了令人印象深刻的结果，但难以解释，因为如果没有对风格的连贯定义，风格迁移问题本身就是一个不适定问题。早期工作将风格迁移定义为一个优化问题，但仅将风格视为纹理的度量。这导致早期模型的输出中出现伪影，其中风格图像的内容特征有时会渗入输出图像。相反，最近使用扩散模型的工作提供了引人注目的实证结果，但几乎没有提供理论基础。为了解决这些问题，我们提出了艺术风格的另一种定义。我们建议将风格视为一组决定局部纹理排列的全局对称性。我们通过学习大型绘画数据集的对称性并表明对称性可以预测每幅画所属的艺术运动，从而从经验上验证了这一观点。最后，我们表明，通过同时考虑局部和全局特征，使用李生成器和传统的纹理度量，我们可以比单独使用任何一组特征更好地量化艺术家之间的风格相似性。这种方法不仅与艺术史学家的共识非常吻合，而且提供了一个强大的框架来区分细微的风格差异，从而实现更具可解释性、理论上更扎实的风格迁移方法。||
|**2024-12-05**|[GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration](http://arxiv.org/abs/2412.04440)|null|近年来，文本到视频生成模型取得了显著进展。然而，它们仍然难以根据组合文本提示生成复杂的动态场景，例如多个对象的属性绑定、与不同对象相关的时序动态以及对象之间的交互。我们的主要动机是复杂的任务可以分解成更简单的任务，每个任务由一个角色专精的多模态大语言模型 (MLLM) 代理处理。多个代理可以协同工作，以实现复杂目标的集体智能。我们提出了 GenMAC，一个迭代的、多代理框架，可实现组合文本到视频生成。协作工作流程包括三个阶段：设计、生成和重新设计，在生成和重新设计阶段之间有一个迭代循环，以逐步验证和改进生成的视频。重新设计阶段是最具挑战性的阶段，旨在验证生成的视频、提出修改建议，并重新设计文本提示、逐帧布局和用于下一轮生成的指导比例。为了避免单个 MLLM 代理的幻觉，我们将此阶段分解为四个顺序执行的基于 MLLM 的代理：验证代理、建议代理、修改代理和输出结构化代理。此外，为了处理组合文本到视频生成的各种场景，我们设计了一种自路由机制，以自适应地从一组修改代理中选择合适的修改代理，每个代理专门用于一种场景。大量实验证明了 GenMAC 的有效性，在组合文本到视频生成方面达到了最先进的性能。||
|**2024-12-05**|[Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation](http://arxiv.org/abs/2412.04432)|**[link](https://github.com/tencentarc/divot)**|近年来，将图像理解和生成统一在大语言模型 (LLM) 中的兴趣显著增加。这种日益增长的兴趣促使我们探索将这种统一扩展到视频。核心挑战在于开发一种通用的视频分词器，它能够捕捉视频的空间特征和时间动态，从而为LLM获取表示，并且这些表示可以进一步解码成逼真的视频片段，以实现视频生成。在这项工作中，我们介绍了 Divot，一种由扩散模型驱动的视频分词器，它利用扩散过程进行自监督视频表示学习。我们假设，如果一个视频扩散模型能够以视频分词器的特征作为条件，有效地对视频片段进行去噪，那么该分词器就成功地捕捉到了鲁棒的空间和时间信息。此外，视频扩散模型本身也充当了解码器，可以从视频表示中解码出视频。基于 Divot 分词器，我们通过视频到文本的自回归以及通过高斯混合模型对连续值的 Divot 特征分布进行建模，从而提出了 Divot-Vicuna，实现了文本到视频的生成。实验结果表明，我们基于扩散的视频分词器与预训练的 LLM 集成后，在各种视频理解和生成基准测试中均取得了具有竞争力的性能。经过指令微调的 Divot-Vicuna 在视频故事讲述方面也表现出色，能够生成交错的叙事和相应的视频。||
|**2024-12-05**|[Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis](http://arxiv.org/abs/2412.04431)|**[link](https://github.com/FoundationVision/Infinity)**|我们提出了Infinity，一个基于位式视觉自回归建模的方法，能够根据语言指令生成高分辨率的逼真图像。Infinity在位式标记预测框架下重新定义了视觉自回归模型，它采用无限词汇量分词器和分类器以及位式自纠正机制，显著提高了生成能力和细节表现。通过理论上将分词器词汇量扩展到无限大，并同时扩展Transformer的规模，我们的方法相比于传统的视觉自回归模型，显著释放了强大的扩展能力。Infinity为自回归文本到图像模型设立了新的记录，其性能超越了顶级扩散模型，如SD3-Medium和SDXL。值得注意的是，Infinity在GenEval基准测试中将得分从0.62提高到0.73，在ImageReward基准测试中将得分从0.87提高到0.96，超过了SD3-Medium，实现了66%的胜率。无需额外优化，Infinity可以在0.8秒内生成高质量的1024x1024图像，比SD3-Medium快2.6倍，使其成为最快的文本到图像模型。模型和代码将被发布，以促进对Infinity在视觉生成和统一分词器建模方面的进一步探索。||
|**2024-12-05**|[Reversible molecular simulation for training classical and machine learning force fields](http://arxiv.org/abs/2412.04374)|**[link](https://github.com/greener-group/rev-sim)**|下一代分子动力学力场将利用海量数据来开发。然而，使用实验数据进行系统训练仍然是一项挑战，特别是对于机器学习势能。可微分分子模拟通过分子动力学轨迹计算可观测值相对于参数的梯度。在这里，我们通过使用反向时间模拟显式计算梯度来改进这种方法，并有效地保持恒定的内存成本。该方法被应用于学习具有不同函数形式的全原子水和气体扩散模型，并从头开始训练金刚石的机器学习势能。与系综重加权的比较表明，可逆模拟可以提供更准确的梯度，并训练以匹配随时间变化的可观测值。||
|**2024-12-03**|[Motion Prompting: Controlling Video Generation with Motion Trajectories](http://arxiv.org/abs/2412.02700)|null|运动控制对于生成具有表现力和吸引力的视频内容至关重要；然而，大多数现有的视频生成模型主要依赖于文本提示进行控制，这难以捕捉动态动作和时间构成的细微之处。为此，我们训练了一个以时空稀疏或密集运动轨迹为条件的视频生成模型。与之前的运动条件工作相比，这种灵活的表示可以编码任意数量的轨迹，特定对象或全局场景运动，以及时间稀疏运动；由于其灵活性，我们将这种条件称为运动提示。虽然用户可以直接指定稀疏轨迹，但我们也展示了如何将高级用户请求转换为详细的、半密集的运动提示，我们将此过程称为运动提示扩展。我们通过各种应用展示了我们方法的多功能性，包括相机和对象运动控制、“与图像交互”、运动转移和图像编辑。我们的结果展示了涌现行为，例如逼真的物理效果，这表明运动提示在探测视频模型和与未来生成世界模型交互方面的潜力。最后，我们进行了定量评估，开展了用户研究，并展示了强大的性能。视频结果可在我们的网页上获取：https://motion-prompting.github.io/||
|**2024-12-03**|[Diffusion-based Visual Anagram as Multi-task Learning](http://arxiv.org/abs/2412.02693)|**[link](https://github.com/pixtella/anagram-mtl)**|视觉字谜游戏是指经过翻转或旋转等变换后外观会发生变化的图像。随着扩散模型的出现，通过在反向去噪过程中对多个视图的噪声进行平均，可以生成这种视觉错觉。然而，我们观察到这种方法存在两个关键的失效模式：(i) 概念分离，即不同视图中的概念独立生成，这不能被视为真正的字谜游戏，以及 (ii) 概念支配，即某些概念压倒其他概念。在本研究中，我们将视觉字谜游戏生成问题转化为多任务学习设置，其中不同的视点提示类似于不同的任务，并推导出可同时在不同任务之间良好对齐的去噪轨迹。我们设计的框架核心是两个新引入的技术：(i) 一种反分离优化策略，它促进不同概念之间交叉注意力图的重叠，以及 (ii) 一种噪声向量平衡方法，它自适应地调整不同任务的影响。此外，我们观察到直接平均噪声预测会产生次优性能，因为统计特性可能无法保留，这促使我们推导出一种噪声方差校正方法。大量的定性和定量实验表明，我们的方法在生成跨越不同概念的视觉字谜游戏方面具有优越的能力。||
|**2024-12-03**|[FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation](http://arxiv.org/abs/2412.02690)|null|尽管图像生成模型取得了显著进步，但生成逼真的手部图像仍然是一项持续的挑战，这是由于手部复杂的关节结构、多变的视角以及频繁的遮挡造成的。我们提出了FoundHand，一个用于合成单手和双手图像的大规模特定领域扩散模型。为了训练我们的模型，我们引入了FoundHand-10M，一个带有二维关键点和分割掩码标注的大规模手部数据集。我们的见解是使用二维手部关键点作为通用表示，它既编码了手部关节结构，也编码了相机视角。FoundHand从图像对中学习以捕捉物理上可信的手部关节，原生支持通过二维关键点进行精确控制，并支持外观控制。我们的模型展现了核心功能，包括重新摆放手部姿势、迁移手部外观，甚至合成新视角。这带来了零样本功能，可以修复先前生成的图像中变形的手部，或合成手部视频序列。我们提供了大量的实验和评估，证明了我们方法的最佳性能。||
|**2024-12-03**|[SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance](http://arxiv.org/abs/2412.02687)|null|近期的研究在将多步文本到图像扩散模型蒸馏成单步模型方面取得了 promising 的成果。最先进的有效蒸馏技术，即 SwiftBrushv2 (SBv2)，甚至在有限的资源下超过了教师模型的性能。然而，我们的研究表明，由于在变分分数蒸馏（VSD）损失中使用固定的引导尺度，它在处理不同的扩散模型骨干时存在不稳定性。现有单步扩散模型的另一个弱点是不支持负面提示引导，这在实际图像生成中至关重要。本文提出了 SNOOPI，一个旨在通过增强单步扩散模型在训练和推理过程中的引导来解决这些限制的新框架。首先，我们通过 Proper Guidance-SwiftBrush (PG-SB) 有效地增强了训练的稳定性，它采用了一种随机尺度的无分类器引导方法。通过改变两个教师模型的引导尺度，我们拓宽了它们的输出分布，从而得到了更鲁棒的 VSD 损失，使 SB 能够在不同的骨干网络上有效执行，同时保持 competitive 的性能。其次，我们提出了一种名为 Negative-Away Steer Attention (NASA) 的免训练方法，它通过交叉注意力将负面提示整合到单步扩散模型中，以抑制生成图像中不需要的元素。我们的实验结果表明，我们提出的方法在各种指标上显著提高了基线模型的性能。值得注意的是，我们实现了 31.08 的 HPSv2 分数，为单步扩散模型树立了新的 state-of-the-art 基准。||
|**2024-12-03**|[AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction](http://arxiv.org/abs/2412.02684)|null|从单张图像生成可动画人体化身对于各种数字人体建模应用至关重要。现有的3D重建方法通常难以捕捉可动画模型中的精细细节，而用于可控动画的生成方法虽然避免了显式3D建模，但在极端姿势下存在视角不一致和计算效率低下的问题。在本文中，我们利用生成模型的能力生成详细的多视角规范姿态图像，这有助于解决可动画人体重建中的歧义，从而应对这些挑战。然后，我们提出了一种鲁棒的用于不一致图像的3D重建方法，从而在推理过程中实现实时渲染。具体来说，我们采用基于Transformer的视频生成模型来生成多视角规范姿态图像和法线贴图，并在大型视频数据集上进行预训练以提高泛化能力。为了处理视角不一致性，我们将重建问题重新定义为4D任务，并引入了一种使用4D高斯 splatting 的高效3D建模方法。实验表明，我们的方法可以从自然图像中实现逼真的3D人体化身的实时动画，展示了其有效性和泛化能力。||
|**2024-12-03**|[Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation](http://arxiv.org/abs/2412.02631)|null|文本到图像扩散模型的进步使得快速3D内容创建取得了显著进展。一种常见的方法是生成一组对象的多视图图像，然后将其重建为3D模型。然而，这种方法绕过了对象原生3D表示的使用，因此容易出现几何伪影，并且在可控性和操作能力方面受到限制。另一种方法涉及直接生成3D表示的原生3D生成模型。然而，这些模型的分辨率通常有限，导致生成的3D对象质量较低。在这项工作中，我们弥合了直接生成3D表示的方法与从多视图图像重建3D对象的方法之间的质量差距。我们引入了一种名为Sharp-It的多视图到多视图扩散模型，它采用从低质量对象渲染的3D一致的多视图图像集，并丰富其几何细节和纹理。该扩散模型并行作用于多视图集，因为它在生成的视图之间共享特征。然后可以从丰富的多视图集中重建高质量的3D模型。通过利用2D和3D方法的优势，我们的方法为高质量3D内容创建提供了一种高效且可控的方法。我们证明了Sharp-It支持各种3D应用，例如快速合成、编辑和受控生成，同时获得高质量的资产。||
|**2024-12-03**|[The effect of priors on Learning with Restricted Boltzmann Machines](http://arxiv.org/abs/2412.02623)|null|受限玻尔兹曼机（RBM）是一种生成模型，旨在从具有丰富底层结构的数据中学习。在这项工作中，我们探索了一种教师-学生设置，其中学生 RBM 从教师 RBM 生成的示例中学习，重点关注单元先验对学习效率的影响。我们考虑一类参数化的先验，它在连续（高斯）变量和二元变量之间进行插值。这种方法为教师和学生 RBM 建模了各种可能的可见单元、隐藏单元和权重的选择。通过分析贝叶斯最优和失配情况下后验分布的相图，我们证明了三相点的存在，该三相点定义了通过泛化学习所需的临界数据集大小。临界大小受教师属性（以及数据）的强烈影响，但不受学生 RBM 属性的影响。然而，谨慎选择学生先验可以通过扩展所谓的信号检索区域来促进训练，在该区域中机器可以有效地泛化。||
|**2024-12-03**|[Unveiling Concept Attribution in Diffusion Models](http://arxiv.org/abs/2412.02542)|**[link](https://github.com/mail-research/cad-attribution4diffusion)**|扩散模型在根据文本提示生成逼真且高质量的图像方面展现了非凡的能力。然而，训练后的模型仍然是一个黑匣子；我们对其组件在展现诸如物体或风格等概念中的作用知之甚少。最近的研究工作采用因果追踪来定位存储在生成模型中知识的层，但并未展示这些层如何对目标概念做出贡献。在这项工作中，我们从更通用的角度来处理模型可解释性问题，并提出一个问题：“模型组件如何协同工作来展示知识？”。我们采用组件归因法来分解扩散模型，揭示组件如何对某个概念做出贡献。我们的框架允许有效的模型编辑，特别是，我们可以通过移除正向组件来从扩散模型中擦除一个概念，同时保留其他概念的知识。令人惊讶的是，我们还发现存在对某个概念有负面贡献的组件，这在知识定位方法中尚未被发现。实验结果证实了我们的框架所确定的正向和负向组件的作用，描绘了对生成模型进行解释的完整视角。我们的代码可在\url{https://github.com/mail-research/CAD-attribution4diffusion}获取。||
|**2024-12-03**|[GerPS-Compare: Comparing NER methods for legal norm analysis](http://arxiv.org/abs/2412.02427)|null|我们将命名实体识别（NER）应用于德语法律文本的一个特定子类型：规范公共服务行政中行政流程的法律规范。这类文本的分析涉及识别文本中体现公共服务行政专业人员确定的十个类别之一的片段。我们研究并比较了三种执行命名实体识别以检测这些类别的方法：基于规则的系统、深度判别模型和深度生成模型。我们的结果表明，深度判别模型的性能优于基于规则的系统和深度生成模型，后两者表现大致相同，在不同类别中互有胜负。这一 somewhat surprising 结果的主要原因可能是，分析中使用的类别在语义和句法上是异构的，这与更标准的 NER 任务中使用的类别不同。深度判别模型似乎比通用的大型语言模型和设计基于规则的 NER 系统的人类语言学家更能处理这种异构性。||
|**2024-12-03**|[Social patch foraging theory in an egalitarian group](http://arxiv.org/abs/2412.02381)|null|觅食是一种普遍的行为，与单独觅食相比，群体觅食可能带来多种益处，例如集体汇集信息和减少环境不确定性。通常，集体行为的理论模型使用粗粒度表示，或者过于复杂而难以进行分析处理，并且通常不考虑个体代理实现的噪声决策过程。这就需要开发一种机械的、可分析的和随机的框架来研究社会觅食的潜在过程，将微观层面与宏观层面联系起来。基于证据积累框架，我们开发了一个大型平等群体中斑块离开决策的模型。在各种环境统计数据和信息共享机制中，我们能够分析得出最佳代理策略。所考虑的环境统计数据是两个不枯竭的斑块或几个连续枯竭的斑块。社会信息共享机制要么通过观察其他个体的食物奖励，要么通过信念共享，包括连续共享、脉冲式观察其他个体的离开或到达，或通过计算斑块中的个体数量。在所有这些条件下，我们量化了群体随着时间的推移是如何凝聚的，代理平均在一个斑块中花费多少时间，以及它们的群体平衡动态是什么。我们发现，社会耦合在各种环境统计数据中强烈地调节着这些特征。这个通用的建模框架对于设计社会觅食实验和生成可检验的假设至关重要。此外，该框架可以扩展到具有等级关系的群体。||
|**2024-11-29**|[Input-Output Optics as a Causal Time Series Mapping: A Generative Machine Learning Solution](http://arxiv.org/abs/2411.19897)|null|多体量子系统对光脉冲的响应建模极具挑战性。本文探讨了使用传统和生成神经网络从数据中学习并模拟此类系统响应的方法。量子系统可以被视为执行从输入时间序列（光脉冲）到输出时间序列（系统响应）的复杂映射，该响应通常也是光脉冲。我们以横向和非可积伊辛模型为例，表明时间卷积网络不仅可以捕获系统生成的输入/输出映射，还可以用于表征映射的复杂性。这种复杂性的度量由能够准确建模映射的最小潜在空间的大小提供。我们进一步发现，生成模型，特别是变分自动编码器，在学习多体量子系统的复杂响应方面明显优于传统的自动编码器。对于生成最复杂映射的示例，变分自动编码器在我们测试数据中超过90%的输入产生的输出误差小于10%。||
|**2024-11-29**|[MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks](http://arxiv.org/abs/2411.19786)|null|近年来，受去噪扩散模型和大型语言模型等生成模型的启发，人体运动分析取得了显著进展。然而，现有方法主要集中在根据文本描述生成运动，而忽略了其逆向任务。在本文中，我们提出了MoTe，一个统一的多模态模型，它通过同时学习运动和文本的边缘、条件和联合分布来处理各种任务。MoTe使我们能够通过简单地修改输入上下文来处理配对的文本-运动生成、运动描述和文本驱动的运动生成。具体来说，MoTe由三个组件组成：运动编码器-解码器（MED）、文本编码器-解码器（TED）和运动-文本扩散模型（MTDM）。其中，MED和TED分别用于提取潜在嵌入，并从提取的嵌入中重建运动序列和文本描述。另一方面，MTDM对输入上下文执行迭代去噪过程以处理不同的任务。在基准数据集上的实验结果表明，我们提出的方法在文本到运动生成方面表现出色，在运动描述方面也具有竞争力。||
|**2024-11-29**|[Riemannian Denoising Score Matching for Molecular Structure Optimization with Accurate Energy](http://arxiv.org/abs/2411.19769)|null|本研究介绍了一种改进的分数匹配方法，旨在生成具有高能量精度的分子结构。分数匹配或扩散模型的去噪过程反映了分子结构优化，其中分数类似于物理力场，引导粒子达到平衡状态。为了获得能量精确的结构，使分数接近实际势能面的梯度是有利的。与仅基于欧几里得空间中的结构差异来设计目标分数的传统方法不同，我们提出了一种黎曼分数匹配方法。该方法将分子结构表示在由物理信息决定的内坐标所定义的流形上，以有效地模拟能量图景，并在该空间内进行加噪和去噪。我们的方法已通过在 QM9 和 GEOM 数据集上细化几种类型的起始结构进行了评估，结果表明所提出的黎曼分数匹配方法显着提高了生成分子结构的精度，达到了化学精度。这项研究的意义扩展到计算化学的各种应用，为精确的分子结构预测提供了一个强大的工具。||
|**2024-11-29**|[JetFormer: An Autoregressive Generative Model of Raw Images and Text](http://arxiv.org/abs/2411.19722)|null|消除建模限制和统一跨领域的架构一直是近年来训练大型多模态模型取得进展的关键驱动力。然而，这些模型中的大多数仍然依赖于许多单独训练的组件，例如特定模态的编码器和解码器。在这项工作中，我们进一步简化了图像和文本的联合生成建模。我们提出了一个自回归的仅解码器Transformer——JetFormer——它被训练用于直接最大化原始数据的似然性，而不依赖任何单独预训练的组件，并且可以理解和生成文本和图像。具体来说，我们利用归一化流模型来获得一个与自回归多模态Transformer联合训练的软标记图像表示。归一化流模型在推理过程中既充当图像编码器用于感知任务，又充当图像解码器用于图像生成任务。JetFormer实现了与最近基于VQ-VAE和VAE的基线相当的文本到图像生成质量。这些基线依赖于预训练的图像自编码器，这些自编码器使用包括感知损失在内的复杂损失混合进行训练。同时，JetFormer展现了强大的图像理解能力。据我们所知，JetFormer是第一个能够生成高保真图像并产生强对数似然边界的模型。||
|**2024-11-29**|[TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting](http://arxiv.org/abs/2411.19654)|**[link](https://github.com/ymxbj/TexGaussian)**|基于物理的渲染（PBR）材质在现代图形学中扮演着至关重要的角色，它能够在不同的环境贴图中实现逼真的渲染效果。开发一种高效且有效的算法来自动生成高质量的PBR材质（而不是用于3D网格的RGB纹理）可以显著简化3D内容的创建过程。大多数现有方法利用预训练的2D扩散模型进行多视图图像合成，这通常会导致生成的纹理与输入3D网格之间存在严重的不一致性。本文提出了TexGaussian，一种使用八叉树对齐的3D高斯 splatting 来快速生成PBR材质的新方法。具体来说，我们将每个3D高斯放置在从输入3D网格构建的八叉树的最精细叶子节点上，以渲染多视图图像，不仅用于反照率贴图，还用于粗糙度和金属度。此外，我们的模型以回归方式进行训练，而不是扩散去噪，能够在单个前馈过程中生成3D网格的PBR材质。在公开可用的基准数据集上的大量实验表明，我们的方法合成的PBR材质在视觉上更令人满意，并且在无条件和文本条件场景下都比以前的方法运行速度更快，表现出与给定几何形状更好的一致性。我们的代码和训练好的模型可在 https://3d-aigc.github.io/TexGaussian 获取。||
|**2024-11-29**|[Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing](http://arxiv.org/abs/2411.19652)|**[link](https://github.com/mowenyii/uniform-attention-maps)**|基于扩散模型的文本引导图像生成和编辑技术取得了显著进展。其中，免调优方法因其无需大量模型调整即可执行编辑的能力而备受关注，兼具简洁性和高效性。然而，现有的免调优方法常常难以平衡保真度和编辑精度。DDIM逆推过程中的重建误差部分归因于U-Net中的交叉注意力机制，该机制在逆推和重建过程中引入了错位。为了解决这个问题，我们从结构角度分析了重建过程，并提出了一种新的方法，用统一的注意力图谱取代传统的交叉注意力机制，从而显著提高了图像重建的保真度。我们的方法有效地减少了噪声预测过程中由变化的文本条件引起的失真。为了补充这一改进，我们引入了一种自适应掩码引导的编辑技术，该技术与我们的重建方法无缝集成，确保了编辑任务的一致性和准确性。实验结果表明，我们的方法不仅在实现高保真图像重建方面表现出色，而且在真实图像合成和编辑场景中也表现出鲁棒性。这项研究强调了统一注意力图谱在增强基于扩散的图像处理方法的保真度和多功能性方面的潜力。代码可在https://github.com/Mowenyii/Uniform-Attention-Maps获取。||
|**2024-11-29**|[Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings](http://arxiv.org/abs/2411.19628)|**[link](https://github.com/doubtedsteam/dyvte)**|现有的多模态大语言模型 (MLLM) 过度使用视觉标记，通常表现出明显的冗余性，并带来过高的计算成本。为了深入了解这个问题，我们首先对 MLLM 的注意力行为进行了广泛的实证研究，并总结了 MLLM 推理的三个主要阶段：（i）标记之间的早期融合首先快速完成。（ii）模态内建模开始发挥作用。（iii）多模态推理恢复并持续到推理结束。尤其值得注意的是，我们发现当文本标记接收到足够的图像信息后，视觉标记将停止对推理做出贡献，从而产生明显的视觉冗余。基于这些普遍观察，我们提出了一种简单而有效的方法来提高 MLLM 的效率，称为动态视觉标记退出 (DyVTE)。DyVTE 使用轻量级超网络来感知文本标记状态，并决定在特定层后移除所有视觉标记，从而解决观察到的视觉冗余问题。为了验证 VTE，我们将其应用于一组 MLLM，包括 LLaVA、VILA、Eagle 和 InternVL，并在多个基准测试上进行了广泛的实验。实验结果不仅表明了我们的 VTE 在提高 MLLM 效率方面的有效性，而且揭示了 MLLM 的一般建模模式，有助于深入理解 MLLM。我们的代码已匿名发布在 https://github.com/DoubtedSteam/DyVTE。||
|**2024-11-29**|[Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook](http://arxiv.org/abs/2411.19537)|**[link](https://github.com/croitorualin/biodeep)**|随着生成模型的最新进展，深度伪造内容的逼真度一直在稳步提高，甚至达到了人们经常无法检测到在线操纵媒体内容的地步，从而被骗入各种类型的诈骗。在本文中，我们调查了深度伪造生成和检测技术，包括该领域的最新发展，如扩散模型和神经辐射场。我们的文献综述涵盖了所有深度伪造媒体类型，包括图像、视频、音频和多模态（视听）内容。我们根据用于更改或生成伪造内容的程序，识别各种类型的深度伪造。我们进一步构建了深度伪造生成和检测方法的分类法，展示了重要的方法组以及这些方法应用的领域。接下来，我们收集用于深度伪造检测的数据集，并提供在最流行数据集上表现最佳的深度伪造检测器的更新排名。此外，我们开发了一个新的多模态基准来评估深度伪造检测器对分布外内容的检测能力。结果表明，最先进的检测器无法泛化到由未见过的深度伪造生成器生成的深度伪造内容。最后，我们提出了未来获得鲁棒而强大的深度伪造检测器的方向。我们的项目页面和新基准可在https://github.com/CroitoruAlin/biodeep获取。||
|**2024-11-29**|[DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding](http://arxiv.org/abs/2411.19527)|null|人体运动固有的连续性和动态性给生成模型带来了巨大的挑战。尽管离散量化方法（如VQ-VAE）占主导地位，但它们存在固有的局限性，包括表达能力受限和帧间噪声伪影。连续方法虽然可以生成更平滑、更自然的运动，但由于高维复杂性和训练数据有限，往往难以奏效。为了解决离散表示和连续表示之间的这种“不协调”，我们引入了DisCoRD：通过校正流解码将离散标记转换为连续运动，这是一种通过校正流将离散运动标记解码为连续运动的新方法。通过在连续空间中采用迭代细化过程，DisCoRD 捕捉了细粒度的动态，并确保了更平滑、更自然的运动。我们的方法兼容任何基于离散的框架，在不影响对条件信号保真度的情况下增强了自然度。大量评估表明，DisCoRD 实现了最先进的性能，在 HumanML3D 上的 FID 为 0.032，在 KIT-ML 上的 FID 为 0.169。这些结果巩固了 DisCoRD 作为弥合离散效率和连续真实感之间差距的稳健解决方案的地位。我们的项目页面位于：https://whwjdqls.github.io/discord.github.io/。||
|**2024-11-29**|[Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis](http://arxiv.org/abs/2411.19509)|**[link](https://github.com/antgroup/ditto-talkinghead)**|扩散模型的最新进展彻底改变了音频驱动的说话头像合成。除了精确的唇形同步外，基于扩散的方法还擅长生成与音频信号良好对齐的细微表情和自然头部运动。然而，这些方法面临推理速度慢、对面部运动的细粒度控制不足以及偶尔出现的视觉伪影等问题，这主要是因为从变分自动编码器（VAE）派生的隐式潜在空间，阻碍了它们在实时交互应用中的采用。为了解决这些问题，我们引入了Ditto，一个基于扩散的框架，可实现可控的实时说话头像合成。我们的主要创新在于通过一个明确的、身份无关的运动空间连接运动生成和逼真的神经渲染，取代了传统的VAE表示。这种设计大大降低了扩散学习的复杂性，同时实现了对合成说话头像的精确控制。我们进一步提出了一种联合优化三个关键组件的推理策略：音频特征提取、运动生成和视频合成。这种优化实现了流处理、实时推理和低首帧延迟，这些功能对于AI助手等交互式应用至关重要。大量的实验结果表明，Ditto 可以生成引人入胜的说话头像视频，并且在运动控制和实时性能方面都大大优于现有方法。||
|**2024-11-27**|[GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data](http://arxiv.org/abs/2411.18624)|null|给定一张野外环境下的人像照片，重建高保真3D人体模型仍然是一项具有挑战性的任务。现有方法面临诸多困难，包括：a) 野外环境下拍摄的人体图像中身体比例的变化；b) 照片中各种各样的个人物品；c) 人体姿势的模糊性和人体纹理的不一致性。此外，高质量人体数据的稀缺性加剧了这一挑战。为了解决这些问题，我们提出了一个通用的图像到3D人体重建框架，称为GeneMAN，它建立在一个包含高质量人体数据的综合多源集合的基础上，包括3D扫描、多视角视频、单张照片和我们生成的合成人体数据。GeneMAN包含三个关键模块。1) GeneMAN首先训练了一个人体专用文本到图像的扩散模型和一个视角条件扩散模型，分别作为GeneMAN的2D人体先验和3D人体先验，用于重建，而不依赖于参数化人体模型（例如SMPL）。2) 在预训练的人体先验模型的帮助下，利用几何初始化和雕刻流程来恢复高质量的3D人体几何形状。3) 为了获得高保真的3D人体纹理，GeneMAN采用了多空间纹理细化流程，在潜在空间和像素空间中连续细化纹理。大量的实验结果表明，GeneMAN可以从单张图像输入生成高质量的3D人体模型，其性能优于现有的最先进方法。值得注意的是，GeneMAN在处理野外环境下的图像时表现出更好的泛化能力，即使输入图像中身体比例不同，也能够生成高质量的、姿势自然的、带有常见物品的3D人体模型。||
|**2024-11-27**|[Diffusion Self-Distillation for Zero-Shot Customized Image Generation](http://arxiv.org/abs/2411.18616)|null|文转图扩散模型能生成令人印象深刻的结果，但对于渴望精细控制的艺术家来说，它们是令人沮丧的工具。例如，一个常见的用例是在新的上下文中创建特定实例的图像，即“保留身份的生成”。这种设置以及许多其他任务（例如，重新照明）自然适合图像+文本条件生成模型。然而，没有足够高质量的配对数据来直接训练这样的模型。我们提出了扩散自蒸馏，一种使用预训练的文转图模型为文本条件的图到图任务生成自己的数据集的方法。我们首先利用文转图扩散模型的上下文内生成能力来创建图像网格，并在视觉语言模型的帮助下整理一个大型配对数据集。然后，我们使用整理好的配对数据集将文转图模型微调为文本+图到图模型。我们证明，在广泛的身份保留生成任务上，扩散自蒸馏优于现有的零样本方法，并且与每个实例的微调技术相比具有竞争力，而无需测试时优化。||
|**2024-11-27**|[CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models](http://arxiv.org/abs/2411.18613)|null|我们提出了CAT4D，一种从单目视频创建4D（动态3D）场景的方法。CAT4D利用在各种数据集组合上训练的多视角视频扩散模型，能够在任何指定的相机姿态和时间戳合成新视角。结合一种新颖的采样方法，该模型可以将单个单目视频转换为多视角视频，从而能够通过优化可变形3D高斯表示来实现稳健的4D重建。我们在新视角合成和动态场景重建基准测试中展示了具有竞争力的性能，并突出了从真实或生成的视频生成4D场景的创造能力。请访问我们的项目页面以获取结果和交互式演示：\url{cat-4d.github.io}。||
|**2024-11-27**|[Evaluating and Improving the Effectiveness of Synthetic Chest X-Rays for Medical Image Analysis](http://arxiv.org/abs/2411.18602)|null|目的：探索生成合成胸部X光图像和扩充医学影像数据集的最佳实践方法，以优化深度学习模型在下游任务（如分类和分割）中的性能。方法：我们利用潜在扩散模型，根据文本提示和/或分割掩码来调节合成胸部X光图像的生成。我们探索了使用代理模型和放射科医生反馈等方法来提高合成数据的质量。然后，我们根据相关的疾病信息或几何变换的分割掩码生成这些合成图像，并将它们添加到来自CheXpert、CANDID-PTX、SIIM和RSNA肺炎数据集的真实训练集图像中，以衡量分类和分割模型在测试集上的性能改进。F1和Dice分数分别用于评估分类和分割性能。采用Bonferroni校正的单尾t检验评估了使用合成数据带来的性能改进的统计学显著性。结果：在所有实验中，与仅使用真实数据相比，我们生成的合成数据使分类的F1分数最大平均提高了0.150453（置信区间：0.099108-0.201798；P=0.0031）。对于分割，Dice分数的最大提高为0.14575（置信区间：0.108267-0.183233；P=0.0064）。结论：生成用于下游任务的合成胸部X光图像的最佳实践包括以单一疾病标签或几何变换的分割掩码为条件，以及可能使用代理模型进行微调。||
|**2024-11-27**|[Bit symmetry entails the symmetry of the quantum transition probability](http://arxiv.org/abs/2411.18589)|null|使用广义概率理论 (GPTs) 作为通用模型，从几个基本原理重建量子理论，并更好地理解量子物理学和量子计算的概率或信息论基础，这是相当常见的。在这个框架中，引入了各种对称性假设并进行了研究，包括自同构群在 (1) 纯态上，(2) 正交纯态对（这些对称为二维框架）上，以及 (3) 在任何相同大小的框架上的传递性。第二个假设是M\"uller 和 Ududec 的比特对称性，他们通过量子计算的需求来 motivating 它。这里，我们在转移概率框架中探讨这三个假设，该框架比 GPTs 更具体，因为它预设了量子逻辑原子的转移概率的存在，可以直接预设，也可以通过状态空间的某种几何特性间接预设。作者在最近的一篇论文中介绍了紧凸集的这种性质。我们证明了比特对称性暗示了原子之间转移概率的对称性。利用 Barnum 和 Hilgert 的一个结果，我们可以得出结论：第三个相当强的对称性假设排除了除经典情况和简单欧几里得约旦代数之外的所有模型。||
|**2024-11-27**|[Building Confidence in Deep Generative Protein Design](http://arxiv.org/abs/2411.18568)|**[link](https://github.com/ecburx/proteval)**|深度生成模型在新蛋白质设计方面展现出潜力，但其在特定蛋白质家族中的有效性仍未得到充分探索。在本研究中，我们评估了两种 3D 刚体生成方法，即分数匹配和流匹配，以在 SE(3) 空间中生成单体蛋白质骨架。我们的目标是提供新的见解，并增强人们对深度生成模型在蛋白质设计中更广泛适用性的信心。从生成的骨架预测最佳氨基酸序列，然后进行侧链同源建模。结果表明，生成的蛋白质结构完整性高，保守的关键残基与已知蛋白质对齐。结构系统发育分析显示，生成的样本与其蛋白质家族成员之间存在进化联系。进一步的分子动力学模拟和蛋白质-配体对接证实了这些样本的动态稳定性和功能潜力，配体结合诱导的构象变化与野生型蛋白质一致。||
|**2024-11-27**|[FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion](http://arxiv.org/abs/2411.18552)|null|扩散模型擅长生成高质量图像。然而，它们只有在以训练时使用的分辨率运行时才有效。以缩放分辨率进行推理会导致重复模式和结构扭曲。以更高的分辨率重新训练很快就会变得令人望而却步。因此，非常需要能够使预先存在的扩散模型在灵活的测试时分辨率下运行的方法。先前的工作经常出现伪影，并且常常会引入较大的延迟开销。我们提出了两个简单的模块来解决这些问题。我们引入了一个频率调制 (FM) 模块，它利用傅里叶域来改善全局结构一致性，以及一个注意力调制 (AM) 模块，它改善了局部纹理模式的一致性，这个问题在先前的工作中很大程度上被忽略了。我们的方法被称为Fam扩散，可以无缝集成到任何潜在扩散模型中，并且不需要额外的训练。大量的定性结果突出了我们的方法在解决结构和局部伪影方面的有效性，而定量结果显示了最先进的性能。此外，我们的方法避免了为提高一致性而进行的冗余推理技巧，例如基于补丁或渐进式生成，从而导致延迟开销可以忽略不计。||
|**2024-11-27**|[Enhancing weed detection performance by means of GenAI-based image augmentation](http://arxiv.org/abs/2411.18513)|null|精确的杂草管理对于维持作物产量和生态平衡至关重要。传统的除草剂施用面临经济和环境挑战，这凸显了对由深度学习驱动的智能杂草控制系统的需求。这些系统需要大量的优质训练数据。然而，现实情况是缺乏标注良好的训练数据，通常通过使用数据增强来生成更多数据来解决这个问题。尽管如此，传统的增强技术，例如随机翻转、颜色变化和模糊，缺乏足够的保真度和多样性。本文研究了一种基于生成式人工智能的增强技术，该技术使用Stable Diffusion模型生成各种合成图像，以提高杂草检测模型训练数据集的数量和质量。此外，本文还探讨了这些合成图像对实时检测系统性能的影响，因此重点关注用于边缘设备的紧凑型基于CNN的模型，例如YOLO nano。实验结果表明，使用生成式AI增强数据集训练的YOLO模型的平均精度（mAP50和mAP50-95）得分有了显著提高，证明了合成数据在增强模型鲁棒性和准确性方面的巨大潜力。||
|**2024-11-27**|[GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation](http://arxiv.org/abs/2411.18499)|null|多模态大型语言模型 (MLLM) 在视觉理解和生成任务方面取得了显著进展。然而，生成交错的图文内容仍然是一项挑战，这需要整合多模态理解和生成能力。虽然统一模型的进步提供了新的解决方案，但由于数据大小和多样性的限制，现有的基准不足以评估这些方法。为了弥合这一差距，我们推出了 GATE OpenING (OpenING)，这是一个包含 5,400 个高质量人工标注实例、涵盖 56 个真实世界任务的综合基准。OpenING 涵盖了各种日常场景，例如旅行指南、设计和头脑风暴，为挑战交错生成方法提供了一个强大的平台。此外，我们还提出了 IntJudge，一个用于评估开放式多模态生成方法的评估模型。通过使用一种新颖的数据流水线进行训练，我们的 IntJudge 与人类判断的符合率达到了 82.42%，比基于 GPT 的评估器高出 11.34%。在 OpenING 上进行的大量实验表明，当前的交错生成方法仍有很大的改进空间。我们进一步提出了关于交错图文生成的几个关键发现，以指导下一代模型的开发。OpenING 已在 https://opening.github.io 开源。||
|**2024-11-27**|[Synthetic ECG Generation for Data Augmentation and Transfer Learning in Arrhythmia Classification](http://arxiv.org/abs/2411.18456)|null|深度学习模型需要足够的数据才能找到其中的隐藏模式。生成模型的目的是学习数据分布，从而使我们能够采样更多数据并扩充原始数据集。在生理数据，更具体地说是心电图 (ECG) 数据的背景下，鉴于其敏感性和昂贵的数据收集成本，我们可以利用生成模型的优势来扩大现有数据集并改进下游任务，在本例中是心律分类。在这项工作中，我们探索了使用深度学习中不同的生成模型（即 Diffwave、Time-Diffusion 和 Time-VQVAE）生成的合成数据的有效性，以便为两个开源多变量 ECG 数据集获得更好的分类结果。此外，我们还通过微调一个预训练的合成模型，然后逐步添加越来越多的真实数据来研究迁移学习的效果。我们的结论是，尽管合成样本与真实样本相似，但简单地扩充真实数据集时，单个数据集的分类改进几乎不明显，但是当两个数据集合并时，使用合成样本作为增强数据时，分类器的所有指标都有所提高。从微调结果来看，Time-VQVAE 生成模型表现优于其他模型，但其功能不足以达到接近仅使用真实数据训练的分类器的结果。此外，作为本研究主要研究问题的副产品，我们还探索了用于衡量合成数据与真实数据之间接近度的方法和指标。||
|**2024-11-26**|[StableAnimator: High-Quality Identity-Preserving Human Image Animation](http://arxiv.org/abs/2411.17697)|**[link](https://github.com/Francis-Rings/StableAnimator)**|目前的用于人体图像动画的扩散模型难以确保身份（ID）一致性。本文提出了 StableAnimator，这是第一个端到端的ID保持视频扩散框架，它无需任何后处理即可合成高质量视频，条件是参考图像和一系列姿势。基于视频扩散模型，StableAnimator 包含精心设计的用于训练和推理的模块，力求保持身份一致性。特别是，StableAnimator 首先分别使用现成的提取器计算图像和面部嵌入，并通过使用全局内容感知人脸编码器与图像嵌入交互来进一步细化面部嵌入。然后，StableAnimator 引入了一种新颖的分布感知 ID 适配器，可防止时间层引起的干扰，同时通过对齐来保持 ID。在推理过程中，我们提出了一种基于 Hamilton-Jacobi-Bellman (HJB) 方程的新颖优化方法，以进一步提高面部质量。我们证明了求解 HJB 方程可以集成到扩散去噪过程中，并且所得解约束了去噪路径，从而有利于 ID 保持。在多个基准上的实验定性和定量地证明了 StableAnimator 的有效性。||
|**2024-11-26**|[ScribbleLight: Single Image Indoor Relighting with Scribbles](http://arxiv.org/abs/2411.17696)|null|基于图像的室内房间重新照明创造了一种沉浸式的虚拟空间理解，这对于室内设计、虚拟舞台布置和房地产非常有用。由于多个光源和杂乱物体之间复杂的照明交互，以及物体几何形状和材质的巨大差异，从单张图像重新照明室内房间尤其具有挑战性。最近，生成模型已成功应用于基于图像的重新照明，并以目标图像或潜在代码为条件，尽管缺乏对局部照明的详细控制。在本文中，我们介绍了ScribbleLight，一种生成模型，它支持通过描述照明变化的涂鸦来对照明效果进行局部细粒度控制。我们的关键技术创新是一种反照率条件稳定的图像扩散模型，它在重新照明后保留了原始图像的固有颜色和纹理，以及一个基于编码器-解码器的ControlNet架构，它能够通过法线贴图和涂鸦注释实现保留几何形状的照明效果。我们展示了ScribbleLight能够通过稀疏的涂鸦注释创建不同的照明效果（例如，打开/关闭灯光、添加高光、投射阴影或来自不可见光源的间接照明）。||
|**2024-11-26**|[Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis](http://arxiv.org/abs/2411.17690)|null|本文提出了一项新任务——从人物视频及其转录文本 (VTTS) 生成语音——以激发多模态语音生成的新技术。这项任务概括了从裁剪的唇部视频生成语音的任务，并且比从视频和文本生成通用音频片段（例如，狗叫声）的任务更复杂。该任务的多语言版本可能会催生新的跨语言配音技术。我们还为此任务提出了一个仅解码器的多模态模型，我们称之为 Visatronic。该模型将视觉、文本和语音直接嵌入到 Transformer 模型的公共子空间中，并使用自回归损失来学习以说话人视频及其语音转录为条件的离散梅尔谱图的生成模型。通过将所有模态嵌入到一个公共子空间中，Visatronic 可以获得比仅使用文本或视频作为输入的模型更好的结果。此外，与现有的依赖唇部检测器和复杂架构来融合模态的方法相比，它为多模态语音生成提供了一种更简单的方法，同时产生更好的结果。由于该模型足够灵活，可以适应不同的输入排序方式，我们仔细探索了不同的策略，以便更好地理解将信息传播到生成步骤的最佳方式。为了促进对 VTTS 的进一步研究，我们将发布 (i) 我们的代码，(ii) 用于大规模 VoxCeleb2 数据集的干净转录，以及 (iii) 包含客观和主观指标的 VTTS 标准化评估协议。||
|**2024-11-26**|[GenDeg: Diffusion-Based Degradation Synthesis for Generalizable All-in-One Image Restoration](http://arxiv.org/abs/2411.17687)|null|近年来，基于深度学习的全能图像修复 (AIOR) 模型取得了显著进展。然而，由于对训练分布之外样本的泛化能力较差，它们的实际应用受到限制。这种限制主要源于现有数据集中退化变化和场景的多样性不足，导致对真实场景的表征不足。此外，获取大规模真实世界中雾霾、低光和雨滴等退化类型的配对数据通常很繁琐，有时甚至不可行。在本文中，我们利用潜在扩散模型的生成能力，从干净图像合成高质量的退化图像。具体来说，我们引入了 GenDeg，这是一个退化和强度感知的条件扩散模型，能够在干净图像上生成各种退化模式。使用 GenDeg，我们合成了超过 55 万个样本，涵盖六种退化类型：雾霾、雨、雪、运动模糊、低光和雨滴。这些生成的样本与现有数据集集成，形成了包含超过 75 万个样本的 GenDS 数据集。我们的实验表明，在 GenDS 数据集上训练的图像修复模型与仅在现有数据集上训练的模型相比，在分布外性能方面表现出显著改进。此外，我们还对基于扩散模型的合成退化对 AIOR 的影响进行了全面分析。代码将公开发布。||
|**2024-11-26**|[Accelerating Vision Diffusion Transformers with Skip Branches](http://arxiv.org/abs/2411.17616)|**[link](https://github.com/opensparsellms/skip-dit)**|扩散Transformer（DiT）作为一种新兴的图像和视频生成模型架构，因其高质量的生成能力和可扩展性而展现出巨大潜力。尽管性能令人印象深刻，但其在实际部署中受到计算复杂性和序列去噪过程中冗余的限制。虽然跨时间步的特征缓存已被证明可有效加速扩散模型，但由于其与基于U-Net的方法在架构上的根本差异，其在DiT中的应用受到限制。通过对DiT特征动态的实证分析，我们发现DiT块之间显著的特征变化对特征复用性提出了关键挑战。为了解决这个问题，我们将标准DiT转换为带有跳跃连接的Skip-DiT，以增强特征平滑度。此外，我们引入了Skip-Cache，它利用跳跃连接在推理时跨时间步缓存DiT特征。我们在不同的DiT主干网络上验证了我们提出的方法在视频和图像生成中的有效性，展示了跳跃连接有助于保持生成质量并实现更高的加速。实验结果表明，Skip-DiT几乎可以免费实现1.5倍的加速，并且仅需少量降低量化指标即可实现2.2倍的加速。代码可在https://github.com/OpenSparseLLMs/Skip-DiT.git获取。||
|**2024-11-26**|[Mixed-State Quantum Denoising Diffusion Probabilistic Model](http://arxiv.org/abs/2411.17608)|null|生成式量子机器学习因其能够生成具有所需分布的量子态而备受关注。在各种量子生成模型中，量子去噪扩散概率模型 (QuDDPMs) [Phys. Rev. Lett. 132, 100602 (2024)] 提供了一种有前景的分步学习方法，解决了训练问题。然而，QuDDPM 中对高保真置乱幺正算符的要求给近期实现带来了挑战。我们提出了混合态量子去噪扩散概率模型 (MSQuDDPM)，以消除对置乱幺正算符的需求。我们的方法侧重于将量子噪声通道适配到模型架构中，该架构在正向扩散过程中集成了去极化噪声通道，并在反向去噪步骤中集成了参数化量子电路和投影测量。我们还引入了几种改进 MSQuDDPM 的技术，包括噪声插值的余弦指数调度、使用单量子比特随机辅助比特以及基于超保真度的损失函数以增强收敛性。我们在量子系综生成任务上评估了 MSQuDDPM，证明了其成功的性能。||
|**2024-11-26**|[VideoDirector: Precise Video Editing via Text-to-Video Models](http://arxiv.org/abs/2411.17592)|null|尽管使用文本到图像 (T2I) 模型的典型“反演然后编辑”范式已展现出 promising 的结果，但将其直接扩展到文本到视频 (T2V) 模型仍然存在严重的伪影，例如颜色闪烁和内容失真。因此，目前的视频编辑方法主要依赖于 T2I 模型，其本身缺乏时间一致性生成能力，通常导致较差的编辑结果。在本文中，我们将典型编辑范式的失败归因于：1) 紧密的时空耦合。普通的基于关键点 (pivotal-based) 的反演策略难以解耦视频扩散模型中的时空信息；2) 复杂的时空布局。普通的交叉注意力控制不足以保留未编辑的内容。为了解决这些限制，我们提出了时空解耦引导 (STDG) 和多帧空文本优化策略，为更精确的关键点反演提供关键时间线索。此外，我们引入了自注意力控制策略，以保持更高的保真度，从而实现精确的部分内容编辑。实验结果表明，我们的方法（称为 VideoDirector）有效地利用了 T2V 模型强大的时间生成能力，生成的编辑视频在准确性、运动平滑度、真实感和未编辑内容的保真度方面均达到了最先进的性能。||
|**2024-11-26**|[Metaverse Innovation Canvas: A Tool for Extended Reality Product/Service Development](http://arxiv.org/abs/2411.17541)|null|本研究调查了新兴元宇宙领域中增强现实 (AR) 和虚拟现实 (VR) 初创公司失败的因素。通过对 2016 年至 2022 年 29 家失败的 AR/VR 初创公司进行深入分析，确定了关键的陷阱，例如缺乏可扩展性、可用性差、价值主张不明确以及未能解决特定的用户问题。基于这些发现，我们开发了元宇宙创新画布 (MIC)，这是一个为 XR 产品和服务量身定制的商业构思框架。该画布指导创始人定义用户问题，阐明独特的 XR 价值主张，评估可用性因素（例如基于运动的交互负荷），考虑社交/虚拟经济机会，并规划长期可扩展性。与通用模型不同，专门的模块会促使从一开始就考虑关键的 XR 因素。该画布通过与初创公司顾问就五个失败的风险案例进行专家测试来评估。结果突出了该工具在预先发现被忽视的可用性问题和技术限制方面的有效性，从而提高了未来元宇宙初创公司的生存能力。||
|**2024-11-26**|[IMPROVE: Improving Medical Plausibility without Reliance on HumanValidation -- An Enhanced Prototype-Guided Diffusion Framework](http://arxiv.org/abs/2411.17535)|null|生成模型已被证明在生成合成医学图像方面非常有效，并在下游任务中得到应用，例如增强罕见疾病数据集、长尾数据集扩充和扩展机器学习算法。对于医学应用，根据FID分数、精确率和召回率等传统指标评估，此类模型生成的合成医学图像质量仍然合理。然而，这些指标未能捕捉到生成图像的医学/生物学合理性。人类专家反馈已被用于获取生物学合理性，这表明这些生成的图像合理性非常低。最近，研究界通过基于人类反馈的强化学习（RLHF）进一步整合了这种人类反馈，从而生成了更具医学合理性的图像。然而，结合人类反馈是一个昂贵且缓慢的过程。在这项工作中，我们提出了一种无需人工反馈即可提高生成图像医学合理性的新方法。我们介绍了IMPROVE：改进医学合理性而无需依赖人工验证——一种增强的原型引导扩散框架，这是一种用于医学图像生成的原型引导扩散过程，并表明它在无需任何人工反馈的情况下大大增强了生成医学图像的生物学合理性。我们在骨髓和HAM10000数据集上进行了实验，结果表明，无需人工反馈即可大幅提高医学准确性。||
|**2024-11-26**|[FTMoMamba: Motion Generation with Frequency and Text State Space Models](http://arxiv.org/abs/2411.17532)|null|扩散模型在人体运动生成方面取得了令人瞩目的性能。然而，当前的方法通常忽略了频域信息在捕捉潜在空间中细粒度运动方面的重要性（例如，低频与静态姿势相关，高频与细粒度运动对齐）。此外，文本和运动之间存在语义差异，导致生成的运动与文本描述不一致。在这项工作中，我们提出了一种新颖的基于扩散的FTMoMamba框架，该框架配备了频率状态空间模型（FreqSSM）和文本状态空间模型（TextSSM）。具体来说，为了学习细粒度表示，FreqSSM将序列分解为低频和高频分量，分别指导静态姿势（例如，坐、躺）和细粒度运动（例如，过渡、绊倒）的生成。为了确保文本和运动之间的一致性，TextSSM在句子级别编码文本特征，将文本语义与序列特征对齐。大量实验表明，FTMoMamba在文本到运动生成任务上取得了优异的性能，尤其是在HumanML3D数据集上获得了最低的FID，为0.181（远低于MLD的0.421）。||
|**2024-11-22**|[DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving](http://arxiv.org/abs/2411.15139)|**[link](https://github.com/hustvl/diffusiondrive)**|近年来，扩散模型作为一种强大的生成技术，已应用于机器人策略学习，能够对多模态动作分布进行建模。利用其进行端到端自动驾驶是一个很有前景的方向。然而，机器人扩散策略中大量的去噪步骤以及交通场景更动态、开放的特点，对实时生成多样化的驾驶动作提出了重大挑战。为了应对这些挑战，我们提出了一种新颖的截断扩散策略，它结合了先验的多模态锚点并截断了扩散计划，使模型能够学习从锚定的高斯分布到多模态驾驶动作分布的去噪过程。此外，我们设计了一个高效的级联扩散解码器，以增强与条件场景上下文的交互。所提出的模型DiffusionDrive与传统的扩散策略相比，去噪步骤减少了10倍，仅需2步即可提供卓越的多样性和质量。在面向规划的NAVSIM数据集上，使用对齐的ResNet-34骨干网络，DiffusionDrive在没有额外技巧的情况下实现了88.1的PDMS，创造了新的记录，同时在NVIDIA 4090上以45 FPS的实时速度运行。在挑战性场景下的定性结果进一步证实，DiffusionDrive可以鲁棒地生成多种合理的驾驶动作。代码和模型将在https://github.com/hustvl/DiffusionDrive上发布。||
|**2024-11-22**|[Material Anything: Generating Materials for Any 3D Object via Diffusion](http://arxiv.org/abs/2411.15138)|null|我们提出了Material Anything，这是一个全自动的统一扩散框架，旨在为3D对象生成基于物理的材质。与依赖复杂流程或特定案例优化的现有方法不同，Material Anything提供了一个稳健的端到端解决方案，适用于不同光照条件下的对象。我们的方法利用预训练的图像扩散模型，并通过三头架构和渲染损失来增强稳定性和材质质量。此外，我们引入了置信度掩码作为扩散模型中的动态切换器，使其能够有效处理不同光照条件下有纹理和无纹理的对象。通过采用由这些置信度掩码引导的渐进式材质生成策略，以及UV空间材质细化器，我们的方法确保了生成一致的、UV可用的材质输出。大量实验表明，我们的方法在各种对象类别和光照条件下均优于现有方法。||
|**2024-11-22**|[VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement](http://arxiv.org/abs/2411.15115)|null|近期的文本到视频 (T2V) 扩散模型在各个领域展现出令人印象深刻的生成能力。然而，这些模型生成的视频经常与文本提示不一致，尤其当提示描述包含多个对象和属性的复杂场景时。为了解决这个问题，我们引入了 VideoRepair，这是一个与模型无关、无需训练的视频细化框架，它能够自动识别细粒度的文本-视频不匹配，并生成明确的空间和文本反馈，使 T2V 扩散模型能够执行有针对性的局部细化。VideoRepair 包含四个阶段：（1）视频评估：我们通过生成细粒度的评估问题并使用大型语言模型 (MLLM) 回答这些问题来检测不匹配。（2）细化计划：我们识别准确生成的对象，然后创建局部提示以细化视频中的其他区域。接下来，在（3）区域分解阶段，我们使用组合的 grounding 模块分割正确生成的区域。我们在（4）局部细化阶段通过调整不匹配的区域同时保留正确的区域来重新生成视频。在两个流行的视频生成基准（EvalCrafter 和 T2V-CompBench）上，VideoRepair 在各种文本-视频对齐指标上的表现都明显优于近期的基线模型。我们提供了对 VideoRepair 组件的全面分析和一些定性示例。||
|**2024-11-22**|[Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion](http://arxiv.org/abs/2411.15113)|null|随着文生图模型的功能日益强大和复杂，其不断增长的规模成为广泛应用的重大障碍，尤其是在资源受限的设备上。本文对Stable Diffusion 2的训练后剪枝进行了开创性研究，解决了文生图领域模型压缩的关键需求。我们的研究探讨了先前未探索过的多模态生成模型的剪枝技术，并特别分别考察了剪枝对文本组件和图像生成组件的影响。我们对以不同稀疏度剪枝模型或模型的单个组件进行了全面比较。我们的结果得出了一些先前未记载的发现。例如，与语言模型剪枝的既定趋势相反，我们发现简单的幅度剪枝在文生图环境中优于更高级的技术。此外，我们的结果表明，Stable Diffusion 2可以剪枝到38.5%的稀疏度，且质量损失最小，从而显著减小了模型大小。我们提出了一种最优剪枝配置，将文本编码器剪枝到47.5%，将扩散生成器剪枝到35%。这种配置在保持图像生成质量的同时，大大降低了计算需求。此外，我们的工作揭示了关于文生图模型中信息编码的有趣问题：我们观察到，超过特定阈值的剪枝会导致性能突然下降（图像无法读取），这表明特定权重编码了关键的语义信息。这一发现为未来在模型压缩、互操作性和文生图模型偏差识别方面的研究开辟了新途径。通过提供对文生图模型剪枝行为的关键见解，我们的研究为开发更高效、更易于访问的AI驱动图像生成系统奠定了基础。||
|**2024-11-22**|[Leapfrog Latent Consistency Model (LLCM) for Medical Images Generation](http://arxiv.org/abs/2411.15084)|**[link](https://github.com/lskdsjy/leapfroglcm)**|由于隐私问题，医院不愿共享数据，导致可访问的医学图像数据稀缺，这给有效训练用于医学诊断的深度学习模型带来了重大障碍。为此，我们收集了一个名为MedImgs的多样化数据集，其中包含来自开源存储库的超过250,127张图像，涵盖61种疾病类型和159种类的人类和动物。我们提出了一种跳蛙潜在一致性模型（LLCM），该模型是从基于收集的MedImgs数据集重新训练的扩散模型中提取出来的，这使我们的模型能够生成实时高分辨率图像。我们将反向扩散过程公式化为概率流常微分方程（PF-ODE），并使用跳蛙算法在潜在空间中求解。这种公式可以实现快速采样，而无需额外的迭代。我们的模型在生成医学图像方面展现了最先进的性能。此外，我们的模型可以使用任何自定义医学图像数据集进行微调，从而方便生成各种图像。我们的实验结果在未见过的狗心脏X光图像上优于现有模型。源代码可在https://github.com/lskdsjy/LeapfrogLCM获取。||
|**2024-11-22**|[The 1D nonlocal Fisher-KPP equation with a top hat kernel. Part 3. The effect of perturbations in the kernel](http://arxiv.org/abs/2411.15054)|null|在本系列论文的第三部分中，我们研究与第一部分相同的柯西问题，即一维空间中的非局部Fisher-KPP方程， $u_t = D u_{xx} + u(1-\phi_T*u)$，其中$\phi_T*u$是与顶帽核$\phi_T(y) \equiv H\left(\frac{1}{4}-y^2\right)$的空间卷积，但现在我们对该核加入一个特定的扰动，我们将其表示为$\overline{\phi}:\mathbb{R}\to \mathbb{R}$。因此，顶帽核$\phi_T$现在被扰动核$\phi:\mathbb{R} \to \mathbb{R}$取代，其中$\phi(x) = \phi_T(x) + \overline{\phi}(x)~~\forall~~x\in \mathbb{R}$。当核扰动的大小在合适的范数下较小时，当扩散系数$D$形式上为O(1)或更大时，通常情况下这是一个正则扰动问题。然而，当$D$ 变小，特别是与核扰动的大小相当时，这将成为一个强奇异扰动问题，整体结构发生显著变化。这种情况将被详细揭示。就其一般意义而言，该模型是经典Fisher-KPP模型的自然扩展，在饱和项中引入了最简单的非局部效应。非局部反应扩散模型自然地出现在各种（通常是生物或生态）环境中，因此详细研究其性质，并将其与经典Fisher-KPP模型的已知性质进行比较和对比具有根本意义。||
|**2024-11-22**|[FloAt: Flow Warping of Self-Attention for Clothing Animation Generation](http://arxiv.org/abs/2411.15028)|null|我们提出了一种基于扩散模型的方法FloAtControlNet，用于生成由人体服装动画组成的动态照片。我们专注于连衣裙、裙子和裤子等人体服装。我们模型的输入是一个文本提示，描述服装的类型和纹理，例如豹纹、条纹或纯色，以及一系列法线贴图，捕捉我们希望在输出中呈现的底层动画。我们方法的核心是一个以法线贴图作为条件的ControlNet，它在免训练的情况下运行。关键观察是底层动画嵌入在法线贴图的流动中。我们利用由此获得的流来操纵相应层的自注意力图。具体来说，特定层和帧的自注意力图被重新计算为其自身与相同层和前一帧的自注意力图的线性组合，并通过两帧法线贴图上的流进行变形。我们证明，操纵自注意力图可以极大地提高服装动画的质量，使其看起来更自然，并抑制背景伪影。通过大量实验，我们证明所提出的方法在视觉结果和用户研究方面都优于所有基线。具体来说，我们的方法能够减轻我们考虑的其他基于扩散模型的基线中存在的背景闪烁。此外，我们证明，在使用输入法线贴图序列和从输出RGB帧获得的法线贴图序列计算的RMSE和PSNR方面，我们的方法优于所有基线。此外，我们还表明，像LPIPS、SSIM和CLIP分数这样公认的视觉质量评估指标，并不一定适合捕捉人体服装动画中的细微运动。||
|**2024-11-22**|[Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation](http://arxiv.org/abs/2411.14913)|null|学习用于非抓取操作的多样化策略对于提高技能迁移和泛化到分布外场景至关重要。在这项工作中，我们通过在混合框架内采用双重方法来增强探索，该框架同时处理离散和连续动作空间。首先，我们将连续运动参数策略建模为扩散模型，其次，我们将其纳入最大熵强化学习框架，该框架统一了离散和连续组件。离散动作空间（例如接触点选择）通过 Q 值函数最大化进行优化，而连续部分则由基于扩散的策略引导。这种混合方法导致了一个原则性目标，其中最大熵项是使用结构化变分推理作为下界导出的。我们提出了混合扩散策略算法 (HyDo)，并在仿真和零样本 sim2real 任务上评估其性能。我们的结果表明，HyDo 鼓励更多样化的行为策略，从而显着提高了跨任务的成功率——例如，在真实世界的 6D 姿态对齐任务中，成功率从 53% 提高到 72%。项目页面：https://leh2rng.github.io/hydo||
|**2024-11-22**|[Prioritize Denoising Steps on Diffusion Model Preference Alignment via Explicit Denoised Distribution Estimation](http://arxiv.org/abs/2411.14871)|null|扩散模型在文本到图像生成方面取得了显著成功，使得这些模型的对齐方法变得越来越重要。一个关键的挑战是偏好标签的稀疏性，这些标签通常只在去噪轨迹的末端可用。这就引发了一个问题，即如何根据这些稀疏标签在去噪步骤中分配信用。在本文中，我们提出了去噪分布估计 (DDE)，一种用于信用分配的新方法。与先前依赖辅助模型或手工方案的方法不同，DDE 的策略更加明确。提出的 DDE 直接从每个步骤的角度估计最终去噪分布。它配备了两种估计策略，并且能够通过单次模型推理表示整个去噪轨迹。我们从理论上和经验上证明，DDE 优先优化去噪轨迹的中间部分，从而产生一种新颖有效的信用分配方案。大量实验表明，我们的方法在定量和定性方面都实现了优越的性能。||
|**2024-11-22**|[Latent Schrodinger Bridge: Prompting Latent Diffusion for Fast Unpaired Image-to-Image Translation](http://arxiv.org/abs/2411.14863)|null|扩散模型 (DM) 能够从噪声生成图像并从数据进行反演，这启发了强大的非配对图像到图像 (I2I) 转换算法。然而，它们通常需要大量的网络函数评估 (NFE)，限制了它们的实际应用。在本文中，我们使用薛定谔桥 (SB) 来解决这个问题，薛定谔桥是具有最小传输成本的分布之间的随机微分方程 (SDE)。我们分析了 SB 的概率流常微分方程 (ODE) 公式，并观察到我们可以将其向量场分解为源预测器、目标预测器和噪声预测器的线性组合。受此观察的启发，我们提出了潜在薛定谔桥 (LSB)，它通过预训练的稳定扩散来近似 SB ODE，并开发了适当的提示优化和变量公式变换，以匹配分布之间的训练和推理。我们证明，我们的算法在无监督设置下成功地进行了具有竞争力的 I2I 转换，其计算成本仅是先前基于 DM 的 I2I 方法所需的一小部分。||
|**2024-11-21**|[Stable Flow: Vital Layers for Training-Free Image Editing](http://arxiv.org/abs/2411.14430)|**[link](https://github.com/snap-research/stable-flow)**|扩散模型彻底改变了内容合成和编辑领域。最近的模型用扩散Transformer（DiT）取代了传统的UNet架构，并采用流匹配来改进训练和采样。然而，它们的生成多样性有限。在这项工作中，我们利用这一限制，通过选择性注入注意力特征来执行一致的图像编辑。主要的挑战是，与基于UNet的模型不同，DiT缺乏从粗到精的合成结构，使其不清楚在哪一层执行注入。因此，我们提出了一种自动识别DiT内对图像形成至关重要的“关键层”的方法，并演示了这些层如何使用相同的机制促进一系列受控的稳定编辑，从非刚性修改到对象添加。接下来，为了实现真实图像编辑，我们为流模型引入了一种改进的图像反演方法。最后，我们通过定性和定量比较以及用户研究来评估我们的方法，并展示其在多种应用中的有效性。项目页面位于https://omriavrahami.com/stable-flow||
|**2024-11-21**|[Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation](http://arxiv.org/abs/2411.14384)|null|现有的前馈图像到3D方法主要依赖于2D多视图扩散模型，无法保证3D一致性。这些方法在更改提示视图方向时容易崩溃，并且主要处理以对象为中心的提示图像。在本文中，我们提出了一种新颖的单阶段3D扩散模型DiffusionGS，用于从单视图生成对象和场景。DiffusionGS在每个时间步直接输出3D高斯点云，以增强视图一致性，并允许模型在给定任意方向的提示视图时稳健地生成，超越以对象为中心的输入。此外，为了提高DiffusionGS的能力和泛化能力，我们通过开发场景-对象混合训练策略来扩展3D训练数据。实验表明，我们的方法具有更好的生成质量（PSNR高2.20 dB，FID低23.25），并且比SOTA方法快5倍以上（在A100 GPU上约6秒）。用户研究和文本到3D的应用也揭示了我们方法的实用价值。我们的项目页面https://caiyuanhao1998.github.io/project/DiffusionGS/展示了视频和交互式生成结果。||
|**2024-11-21**|[CoNFiLD-inlet: Synthetic Turbulence Inflow Using Generative Latent Diffusion Models with Neural Fields](http://arxiv.org/abs/2411.14378)|null|求解涡流的湍流模拟需要能够准确复制复杂多尺度湍流结构的随机流入条件。传统的基于循环的方法依赖于计算成本高昂的前体模拟，而现有的合成流入生成器通常无法再现真实的湍流相干结构。深度学习 (DL) 的最新进展为流入湍流生成开辟了新的可能性，但许多基于深度学习的方法依赖于确定性的自回归框架，容易出现误差累积，导致长期预测的鲁棒性较差。在这项工作中，我们提出了 CoNFiLD-inlet，这是一种新颖的基于深度学习的流入湍流生成器，它将扩散模型与条件神经场 (CNF) 编码的潜在空间相结合，以生成逼真的随机流入湍流。通过使用雷诺数参数化流入条件，CoNFiLD-inlet 可以有效地泛化到各种雷诺数（ $Re_τ$ 在 $10^3$ 和 $10^4$ 之间），而无需重新训练或参数调整。通过直接数值模拟 (DNS) 和壁面模型大涡模拟 (WMLES) 中的先验和后验测试进行的全面验证证明了其高保真度、鲁棒性和可扩展性，使其成为流入湍流合成的有效且通用的解决方案。||
|**2024-11-21**|[Enhancing Medical Image Segmentation with Deep Learning and Diffusion Models](http://arxiv.org/abs/2411.14353)|null|医学图像分割对于准确的临床诊断至关重要，但它面临着诸如病灶与正常组织之间对比度低、边界不清以及患者间差异性大等挑战。深度学习提高了分割的准确性和效率，但它仍然严重依赖于专家标注，并且难以应对医学图像的复杂性。医学图像数据集规模小以及数据采集成本高进一步限制了分割网络的性能。扩散模型凭借其迭代去噪过程，为更好地捕获分割细节提供了一种有前景的替代方案。然而，它们在准确分割小目标和保持边界细节的精度方面面临困难。本文讨论了医学图像分割的重要性、当前深度学习方法的局限性以及扩散模型应对这些挑战的潜力。||
|**2024-11-21**|[StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart](http://arxiv.org/abs/2411.14295)|**[link](https://github.com/shijianjian/stereocrafter-zero)**|生成模仿人类双眼视觉的高质量立体视频需要在帧之间保持一致的深度感知和时间连贯性。尽管扩散模型已经推进了图像和视频合成，但生成高质量的立体视频仍然具有挑战性，因为它难以在左右视图之间保持一致的时空连贯性。我们引入了StereoCrafter-Zero，这是一个用于零样本立体视频生成的新框架，它利用视频扩散先验，而无需配对训练数据。关键创新包括用于初始化立体感知潜在表示的噪声重启策略和迭代细化过程，该过程逐步协调潜在空间，解决了诸如时间闪烁和视图不一致等问题。包括定量指标和用户研究在内的综合评估表明，即使深度估计不完美，StereoCrafter-Zero也能生成具有更高深度一致性和时间平滑度的高质量立体视频。我们的框架在各种扩散模型中都具有鲁棒性和适应性，为零样本立体视频生成设定了新的基准，并实现了更具沉浸感的视觉体验。我们的代码可以在https://github.com/shijianjian/StereoCrafter-Zero找到。||
|**2024-11-21**|[Efficient Aspect-Based Summarization of Climate Change Reports with Small Language Models](http://arxiv.org/abs/2411.14272)|**[link](https://github.com/ighina/llmclimate2024)**|自然语言处理 (NLP) 用于辅助决策者采取气候变化行动，最近被强调为与更广泛的 NLP 技术促进社会公益的驱动相一致的用例。在此背景下，提取和总结相关信息的基于方面的情感摘要 (ABS) 系统特别有用，因为它们为利益相关者提供了在专家策划的报告中查找相关信息的便捷方式。在这项工作中，我们发布了一个用于气候变化报告 ABS 的新数据集，并采用不同的大型语言模型 (LLM) 和所谓的小型语言模型 (SLM) 以无监督的方式解决这个问题。考虑到手头的问题，我们还展示了 SLM 如何在问题上没有显著恶化的同时减少碳足迹；我们通过首次将一个同时考虑能源效率和任务性能的现有框架应用于 ABS 零样本生成模型的评估来做到这一点。总体而言，我们的结果表明，无论是大型还是小型现代语言模型都可以有效地处理气候变化报告的 ABS，但当我们将问题构建为检索增强生成 (RAG) 问题时，需要进行更多研究，我们的工作和数据集将有助于促进这方面的努力。||
|**2024-11-21**|[Guided MRI Reconstruction via Schrödinger Bridge](http://arxiv.org/abs/2411.14269)|null|磁共振成像 (MRI) 是一种多对比度成像技术，其中不同的对比度图像共享相似的结构信息。然而，传统的扩散模型难以有效利用这种结构相似性。最近，薛定谔桥 (SB)，作为扩散模型的非线性扩展，被提出用于在任意分布之间建立扩散路径，从而允许结合引导先验。本研究提出了一种基于SB的多对比度图像引导重建框架，在引导图像和目标图像分布之间建立扩散桥。通过在采样过程中使用引导图像以及数据一致性，可以更准确地重建目标图像。为了更好地解决图像之间的结构差异，我们引入了图像编辑领域的一种反演策略，称为 $\mathbf{I}^2$SB-inversion。在配对的T1和T2-FLAIR数据集上的实验表明，$\mathbf{I}^2$ SB-inversion 实现了高达14.4倍的加速，并在重建精度和稳定性方面优于现有方法。||
|**2024-11-21**|[TaQ-DiT: Time-aware Quantization for Diffusion Transformers](http://arxiv.org/abs/2411.14172)|null|基于Transformer的扩散模型，被称为扩散Transformer（DiT），在图像和视频生成任务中取得了最先进的性能。然而，它们庞大的模型规模和缓慢的推理速度限制了它们的实际应用，需要模型压缩方法，例如量化。不幸的是，现有的DiT量化方法忽略了（1）重建的影响和（2）不同层之间不同的量化敏感性，这阻碍了它们可实现的性能。为了解决这些问题，我们提出了创新的DiT时间感知量化（TaQ-DiT）。具体来说，（1）我们观察到在量化期间分别重建权重和激活时存在不收敛问题，并引入了一种联合重建方法来解决这个问题。（2）我们发现，Post-GELU激活对量化特别敏感，因为它们在不同的去噪步骤中存在显著的可变性，并且在每个步骤内也存在极端的不对称性和变化。为了解决这个问题，我们提出了时间方差感知变换，以促进更有效的量化。实验结果表明，当将DiT的权重量化为4位，激活量化为8位（W4A8）时，我们的方法显著优于先前的量化方法。||
|**2024-11-21**|[RestorerID: Towards Tuning-Free Face Restoration with ID Preservation](http://arxiv.org/abs/2411.14125)|**[link](https://github.com/yingjiacheng/restorerid)**|盲人脸修复在生成高质量和逼真图像方面取得了巨大进展。然而，保留身份信息仍然具有挑战性，尤其是在图像退化严重的情况下。当前的参考引导人脸修复方法要么需要人脸对齐，要么需要个性化的测试微调，这些方法要么不忠实于原始图像，要么耗时。在本文中，我们提出了一种名为 RestorerID 的免微调方法，该方法在人脸修复过程中结合了身份信息保留。RestorerID 是一种基于扩散模型的方法，它使用单个参考图像来恢复具有不同退化程度的低质量图像。为此，我们提出了一个统一框架，将身份信息注入与基础盲人脸修复模型相结合。此外，我们设计了一种新颖的人脸身份再平衡适配器（FIR-Adapter），以解决由低质量输入和参考图像之间的信息冲突引起的内容不一致和轮廓未对齐问题。此外，通过采用自适应身份比例调整策略，RestorerID 可以针对各种退化程度生成高质量的修复图像。在 Celeb-Ref 数据集和真实场景上的实验结果表明，RestorerID 可以有效地实现高质量的人脸修复并保留身份信息，与测试微调方法和其他参考引导方法相比，实现了优越的性能。RestorerID 的代码可在 \url{https://github.com/YingJiacheng/RestorerID} 获取。||
|**2024-11-21**|[Point Cloud Resampling with Learnable Heat Diffusion](http://arxiv.org/abs/2411.14120)|null|生成式扩散模型在点云重采样方面已取得了经验上的成功，通过逐步将噪声细化为结构，从稀疏或嘈杂的3D点云生成更密集、更均匀的点分布。然而，现有的扩散模型采用手动预定义的方案，由于几何退化的刚性和破坏性，这些方案通常无法恢复底层点云结构。为了解决这个问题，我们提出了一种新的用于点云重采样的可学习热扩散框架，该框架通过学习时变热核的自适应热扩散计划和局部滤波尺度，直接参数化正向过程的边缘分布，从而为反向过程生成自适应条件先验。与先前具有固定先验的扩散模型不同，自适应条件先验通过最小化改进的变分下界来选择性地保留点云的几何特征，引导点在反向过程中向底层表面演化。大量的实验结果表明，所提出的点云重采样方法在包括点云去噪和上采样在内的代表性重建任务中实现了最先进的性能。||
|**2024-11-19**|[Auto-Evaluation with Few Labels through Post-hoc Regression](http://arxiv.org/abs/2411.12665)|null|Continually evaluating large generative models provides a unique challenge. Often, human annotations are necessary to evaluate high-level properties of these models (e.g. in text or images). However, collecting human annotations of samples can be resource intensive, and using other machine learning systems to provide the annotations, or automatic evaluation, can introduce systematic errors into the evaluation. The Prediction Powered Inference (PPI) framework provides a way of leveraging both the statistical power of automatic evaluation and a small pool of labelled data to produce a low-variance, unbiased estimate of the quantity being evaluated for. However, most work on PPI considers a relatively sizable set of labelled samples, which is not always practical to obtain. To this end, we present two new PPI-based techniques that leverage robust regressors to produce even lower variance estimators in the few-label regime.||
|**2024-11-19**|[PoM: Efficient Image and Video Generation with the Polynomial Mixer](http://arxiv.org/abs/2411.12663)|**[link](https://github.com/davidpicard/homm)**|Diffusion models based on Multi-Head Attention (MHA) have become ubiquitous to generate high quality images and videos. However, encoding an image or a video as a sequence of patches results in costly attention patterns, as the requirements both in terms of memory and compute grow quadratically. To alleviate this problem, we propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens. This explicit state also allows us to generate frames in a sequential fashion, minimizing memory and compute requirement, while still being able to train in parallel. We show the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA. We adapt several Diffusion Transformers (DiT) for generating images and videos with PoM replacing MHA, and we obtain high quality samples while using less computational resources. The code is available at https://github.com/davidpicard/HoMM.||
|**2024-11-19**|[Improving Controllability and Editability for Pretrained Text-to-Music Generation Models](http://arxiv.org/abs/2411.12641)|null|The field of AI-assisted music creation has made significant strides, yet existing systems often struggle to meet the demands of iterative and nuanced music production. These challenges include providing sufficient control over the generated content and allowing for flexible, precise edits. This thesis tackles these issues by introducing a series of advancements that progressively build upon each other, enhancing the controllability and editability of text-to-music generation models. First, we introduce Loop Copilot, a system that tries to address the need for iterative refinement in music creation. Loop Copilot leverages a large language model (LLM) to coordinate multiple specialised AI models, enabling users to generate and refine music interactively through a conversational interface. Central to this system is the Global Attribute Table, which records and maintains key musical attributes throughout the iterative process, ensuring that modifications at any stage preserve the overall coherence of the music. While Loop Copilot excels in orchestrating the music creation process, it does not directly address the need for detailed edits to the generated content. To overcome this limitation, MusicMagus is presented as a further solution for editing AI-generated music. MusicMagus introduces a zero-shot text-to-music editing approach that allows for the modification of specific musical attributes, such as genre, mood, and instrumentation, without the need for retraining. By manipulating the latent space within pre-trained diffusion models, MusicMagus ensures that these edits are stylistically coherent and that non-targeted attributes remain unchanged. This system is particularly effective in maintaining the structural integrity of the music during edits, but it encounters challenges with more complex and real-world audio scenarios. ...||
|**2024-11-19**|[Data Pruning in Generative Diffusion Models](http://arxiv.org/abs/2411.12523)|**[link](https://github.com/briqr/diffusion_data_pruning)**|Data pruning is the problem of identifying a core subset that is most beneficial to training and discarding the remainder. While pruning strategies are well studied for discriminative models like those used in classification, little research has gone into their application to generative models. Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets. In this work we aim to shed light on the accuracy of this statement, specifically answer the question of whether data pruning for generative diffusion models could have a positive impact. Contrary to intuition, we show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically. We experiment with several pruning methods including recent-state-of-art methods, and evaluate over CelebA-HQ and ImageNet datasets. We demonstrate that a simple clustering method outperforms other sophisticated and computationally demanding methods. We further exhibit how we can leverage clustering to balance skewed datasets in an unsupervised manner to allow fair sampling for underrepresented populations in the data distribution, which is a crucial problem in generative models.||
|**2024-11-19**|[Empirical Privacy Evaluations of Generative and Predictive Machine Learning Models -- A review and challenges for practice](http://arxiv.org/abs/2411.12451)|null|Synthetic data generators, when trained using privacy-preserving techniques like differential privacy, promise to produce synthetic data with formal privacy guarantees, facilitating the sharing of sensitive data. However, it is crucial to empirically assess the privacy risks associated with the generated synthetic data before deploying generative technologies. This paper outlines the key concepts and assumptions underlying empirical privacy evaluation in machine learning-based generative and predictive models. Then, this paper explores the practical challenges for privacy evaluations of generative models for use cases with millions of training records, such as data from statistical agencies and healthcare providers. Our findings indicate that methods designed to verify the correct operation of the training algorithm are effective for large datasets, but they often assume an adversary that is unrealistic in many scenarios. Based on the findings, we highlight a crucial trade-off between the computational feasibility of the evaluation and the level of realism of the assumed threat model. Finally, we conclude with ideas and suggestions for future research.||
|**2024-11-19**|[Frequency-Aware Guidance for Blind Image Restoration via Diffusion Models](http://arxiv.org/abs/2411.12450)|null|Blind image restoration remains a significant challenge in low-level vision tasks. Recently, denoising diffusion models have shown remarkable performance in image synthesis. Guided diffusion models, leveraging the potent generative priors of pre-trained models along with a differential guidance loss, have achieved promising results in blind image restoration. However, these models typically consider data consistency solely in the spatial domain, often resulting in distorted image content. In this paper, we propose a novel frequency-aware guidance loss that can be integrated into various diffusion models in a plug-and-play manner. Our proposed guidance loss, based on 2D discrete wavelet transform, simultaneously enforces content consistency in both the spatial and frequency domains. Experimental results demonstrate the effectiveness of our method in three blind restoration tasks: blind image deblurring, imaging through turbulence, and blind restoration for multiple degradations. Notably, our method achieves a significant improvement in PSNR score, with a remarkable enhancement of 3.72\,dB in image deblurring. Moreover, our method exhibits superior capability in generating images with rich details and reduced distortion, leading to the best visual quality.||
|**2024-11-19**|[A general modeling and simulation framework for dynamic vehicle routing](http://arxiv.org/abs/2411.12406)|**[link](https://github.com/sztaki-hu/dvrpsim)**|In dynamic vehicle routing problems (DVRPs), some part of the information is revealed or changed on the fly, and the decision maker has the opportunity to re-plan the vehicle routes during their execution, reflecting on the changes. Accordingly, the solution to a DVRP is a flexible policy rather than a set of fixed routes. A policy is basically a problem-specific algorithm that is invoked at various decision points in the planning horizon and returns a decision according to the current state. Since DVRPs involve dynamic decision making, a simulator is an essential tool for dynamically testing and evaluating the policies. Despite this, there are few tools available that are specifically designed for this purpose. To fill this gap, we have developed a simulation framework that is suitable for a wide range of dynamic vehicle routing problems and allows to dynamically test different policies for the given problem. In this paper, we present the background of this simulation tool, for which we proposed a general modeling framework suitable for formalizing DVRPs independently of simulation purposes. Our open source simulation tool is already available, easy to use, and easily customizable, making it a useful tool for the research community.||
|**2024-11-19**|[Combinational Backdoor Attack against Customized Text-to-Image Models](http://arxiv.org/abs/2411.12389)|null|Recently, Text-to-Image (T2I) synthesis technology has made tremendous strides. Numerous representative T2I models have emerged and achieved promising application outcomes, such as DALL-E, Stable Diffusion, Imagen, etc. In practice, it has become increasingly popular for model developers to selectively adopt various pre-trained text encoders and conditional diffusion models from third-party platforms, integrating them to build customized (personalized) T2I models. However, such an adoption approach is vulnerable to backdoor attacks. In this work, we propose a Combinational Backdoor Attack against Customized T2I models (CBACT2I) targeting this application scenario. Different from previous backdoor attacks against T2I models, CBACT2I embeds the backdoor into the text encoder and the conditional diffusion model separately. The customized T2I model exhibits backdoor behaviors only when the backdoor text encoder is used in combination with the backdoor conditional diffusion model. These properties make CBACT2I more stealthy and flexible than prior backdoor attacks against T2I models. Extensive experiments demonstrate the effectiveness of CBACT2I with different backdoor triggers and different backdoor targets on the open-sourced Stable Diffusion model. This work reveals the backdoor vulnerabilities of customized T2I models and urges countermeasures to mitigate backdoor threats in this scenario.||
|**2024-11-19**|[Scalable and Effective Negative Sample Generation for Hyperedge Prediction](http://arxiv.org/abs/2411.12354)|null|Hyperedge prediction is crucial in hypergraph analysis for understanding complex multi-entity interactions in various web-based applications, including social networks and e-commerce systems. Traditional methods often face difficulties in generating high-quality negative samples due to the imbalance between positive and negative instances. To address this, we present the Scalable and Effective Negative Sample Generation for Hyperedge Prediction (SEHP) framework, which utilizes diffusion models to tackle these challenges. SEHP employs a boundary-aware loss function that iteratively refines negative samples, moving them closer to decision boundaries to improve classification performance. SEHP samples positive instances to form sub-hypergraphs for scalable batch processing. By using structural information from sub-hypergraphs as conditions within the diffusion process, SEHP effectively captures global patterns. To enhance efficiency, our approach operates directly in latent space, avoiding the need for discrete ID generation and resulting in significant speed improvements while preserving accuracy. Extensive experiments show that SEHP outperforms existing methods in accuracy, efficiency, and scalability, representing a substantial advancement in hyperedge prediction techniques. Our code is available here.||
|**2024-11-19**|[Diffusion Product Quantization](http://arxiv.org/abs/2411.12306)|null|In this work, we explore the quantization of diffusion models in extreme compression regimes to reduce model size while maintaining performance. We begin by investigating classical vector quantization but find that diffusion models are particularly susceptible to quantization error, with the codebook size limiting generation quality. To address this, we introduce product quantization, which offers improved reconstruction precision and larger capacity -- crucial for preserving the generative capabilities of diffusion models. Furthermore, we propose a method to compress the codebook by evaluating the importance of each vector and removing redundancy, ensuring the model size remaining within the desired range. We also introduce an end-to-end calibration approach that adjusts assignments during the forward pass and optimizes the codebook using the DDPM loss. By compressing the model to as low as 1 bit (resulting in over 24 times reduction in model size), we achieve a balance between compression and quality. We apply our compression method to the DiT model on ImageNet and consistently outperform other quantization approaches, demonstrating competitive generative performance.||
|**2024-11-15**|[M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation](http://arxiv.org/abs/2411.10433)|**[link](https://github.com/oliverrensu/mvar)**|计算机视觉领域最近出现了一种名为VAR的新型自回归图像生成范式。与传统的逐像素预测不同，VAR将图像生成从结构上重新表述为从粗到精的逐尺度预测。本文展示了这种尺度自回归框架可以有效地解耦为尺度内建模和尺度间建模，前者捕捉每个尺度内的局部空间依赖性，后者则逐步建立从粗到精尺度之间的跨尺度关系。这种解耦结构允许以更高效的方式重建VAR。具体而言，对于生成高保真图像至关重要的尺度内建模，我们保留了原始的双向自注意力设计，以确保全面建模；对于语义连接不同尺度但计算密集的尺度间建模，我们应用了Mamba等线性复杂度机制，以大幅降低计算开销。我们将此新框架称为M-VAR。大量实验表明，我们的方法在图像质量和生成速度方面均优于现有模型。例如，我们的1.5B模型，参数更少且推理速度更快，性能却优于最大的VAR-d30-2B模型。此外，我们最大的模型M-VAR-d32在ImageNet 256×256上令人印象深刻地达到了1.78的FID值，分别优于先前最先进的自回归模型LlamaGen/VAR 0.4/0.19和流行的扩散模型LDM/DiT 1.82/0.49。代码可在\url{https://github.com/OliverRensu/MVAR}获取。||
|**2024-11-15**|[Mitigating Parameter Degeneracy using Joint Conditional Diffusion Model for WECC Composite Load Model in Power Systems](http://arxiv.org/abs/2411.10431)|null|数据驱动动态系统建模近年来受到广泛关注。其逆向公式，参数估计，旨在从观测数据中推断出固有的模型参数。然而，参数退化，即不同的参数组合产生相同的可观测输出，对准确且唯一地识别模型参数构成了关键障碍。在电力系统WECC复合负荷模型（CLM）的背景下，公用事业从业者观察到，针对某一故障事件精心选择的CLM参数在另一故障事件中可能无法令人满意地执行。在此，我们创新了一种基于联合条件扩散模型的逆问题求解器（JCDI），它结合了联合条件架构，同时输入多事件观测数据，以提高参数的泛化能力。对WECC CLM的仿真研究表明，所提出的JCDI有效地减少了退化参数的不确定性，从而使参数估计误差相比单事件学习方案降低了42.1%。这使得该模型能够高精度地预测不同故障事件（包括电子负载跳闸和电机堵转）下的功率轨迹，优于标准的深度强化学习和监督学习方法。我们预计这项工作将有助于缓解系统动力学中的参数退化问题，为各个科学领域提供一个通用的参数估计框架。||
|**2024-11-15**|[Towards High-Fidelity 3D Portrait Generation with Rich Details by Cross-View Prior-Aware Diffusion](http://arxiv.org/abs/2411.10369)|null|最近基于扩散的单图像3D人像生成方法通常采用2D扩散模型来提供多视角知识，然后将其提取到3D表示中。然而，这些方法通常难以生成高保真3D模型，经常产生过度模糊的纹理。我们将这个问题归因于在扩散过程中对跨视角一致性考虑不足，导致不同视角之间存在显著差异，最终导致3D表示模糊。在本文中，我们通过在条件和扩散过程中全面利用多视角先验来解决这个问题，以生成一致的、细节丰富的人像。从条件的角度来看，我们提出了一个混合先验扩散模型，它显式地和隐式地结合了多视角先验作为条件，以增强生成的多视角人像的状态一致性。从扩散的角度来看，考虑到扩散噪声分布对细节纹理生成的显著影响，我们提出了一种多视角噪声重采样策略，该策略集成在优化过程中，利用跨视角先验来增强表示一致性。大量实验表明，我们的方法可以从单个图像生成具有精确几何形状和丰富细节的3D人像。项目页面位于\url{https://haoran-wei.github.io/Portrait-Diffusion}。||
|**2024-11-15**|[Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding](http://arxiv.org/abs/2411.10329)|null|近年来，文本到图像 (T2I) 生成模型在生成与文本描述一致的高质量图像方面取得了显著进展。然而，这些模型也面临着不安全生成的风险，可能会产生违反使用策略的有害内容，例如色情内容。现有的安全生成方法通常侧重于通过从视觉表示中消除不良概念来抑制不当内容，而忽略了对文本表示的净化。虽然这些方法在一定程度上降低了滥用的风险，但在应对对抗性攻击时，它们的鲁棒性仍然不足。鉴于输入文本和输出图像之间的语义一致性是 T2I 模型的基本要求，我们发现文本表示（即提示嵌入）可能是不安全生成的主要来源。为此，我们提出了一个与视觉无关的安全生成框架，嵌入净化器 (ES)，它专注于从提示嵌入中消除不当概念，并使用净化后的嵌入来指导模型进行安全生成。ES 作为即插即用模块应用于文本编码器的输出，可以与不同的 T2I 模型以及其他安全措施无缝集成。此外，ES 独特的评分机制为提示中的每个标记分配一个分数，以指示其潜在危害，并动态调整净化强度以平衡防御性能和生成质量。通过对五个提示基准的广泛评估，与九种基线方法相比，我们的方法通过净化不安全生成的源头（提示嵌入）实现了最先进的鲁棒性。它在可解释性和可控性方面显着优于现有的安全措施，同时保持了生成质量。||
|**2024-11-15**|[Modification Takes Courage: Seamless Image Stitching via Reference-Driven Inpainting](http://arxiv.org/abs/2411.10309)|**[link](https://github.com/yayoyo66/rdistitcher)**|当前图像拼接方法在诸如色调不均匀和大视差等挑战性场景下经常会产生明显的接缝。为了解决这个问题，我们提出了参考驱动型修复拼接器 (RDIStitcher)，它将图像融合和矩形化重新表述为一个基于参考的修复模型，并结合了比以往方法更大的修改融合区域和更强的修改强度。此外，我们引入了一种自监督模型训练方法，通过微调文本到图像 (T2I) 扩散模型，无需标记数据即可实现 RDIStitcher。认识到评估拼接图像质量的困难，我们提出了基于多模态大型语言模型 (MLLM) 的指标，为评估拼接图像质量提供了新的视角。与最先进 (SOTA) 方法相比，大量实验表明，我们的方法显著增强了拼接图像的内容连贯性和无缝过渡。特别是在零样本实验中，我们的方法展现出强大的泛化能力。代码：https://github.com/yayoyo66/RDIStitcher||
|**2024-11-15**|[The Unreasonable Effectiveness of Guidance for Diffusion Models](http://arxiv.org/abs/2411.10257)|null|引导是一种纠错技术，用于提高扩散模型生成图像的感知质量。通常，这种纠正是通过线性外推法实现的，使用的是性能低于主模型的辅助扩散模型。通过一个二维玩具示例，我们展示了当辅助模型表现出与主模型相似但更强的错误时，这种方法非常有效。我们在更高维度上验证了这一发现，并表明当辅助模型与主模型的区别仅在于更强的权重正则化时，可以实现与最先进的引导方法相媲美的生成性能。作为一项独立的贡献，我们研究了提升长程空间依赖性是否能提高视觉保真度。研究成果是一种新颖的引导方法，我们称之为滑动窗口引导（SWG），它通过约束主模型的感受野来引导自身。有趣的是，SWG比最先进的引导方法更符合人类的偏好，而且既不需要训练，也不需要修改架构或类别条件。代码将被发布。||
|**2024-11-15**|[Smooth transport map via diffusion process](http://arxiv.org/abs/2411.10235)|null|我们将经典的最优传输正则性理论扩展到由高斯测度扰动的热流生成的非最优传输映射。考虑 $\mathbb{R}^d$上形式为$ d\mu(x) = \exp\left(-\frac{|x|
|**2024-11-15**|[ColorEdit: Training-free Image-Guided Color editing with diffusion model](http://arxiv.org/abs/2411.10232)|null|Text-to-image (T2I) diffusion models, with their impressive generative capabilities, have been adopted for image editing tasks, demonstrating remarkable efficacy. However, due to attention leakage and collision between the cross-attention map of the object and the new color attribute from the text prompt, text-guided image editing methods may fail to change the color of an object, resulting in a misalignment between the resulting image and the text prompt. In this paper, we conduct an in-depth analysis on the process of text-guided image synthesizing and what semantic information different cross-attention blocks have learned. We observe that the visual representation of an object is determined in the up-block of the diffusion model in the early stage of the denoising process, and color adjustment can be achieved through value matrices alignment in the cross-attention layer. Based on our findings, we propose a straightforward, yet stable, and effective image-guided method to modify the color of an object without requiring any additional fine-tuning or training. Lastly, we present a benchmark dataset called COLORBENCH, the first benchmark to evaluate the performance of color change methods. Extensive experiments validate the effectiveness of our method in object-level color editing and surpass the performance of popular text-guided image editing approaches in both synthesized and real images.||
|**2024-11-15**|[Evaluating Text-to-Image Diffusion Models for Texturing Synthetic Data](http://arxiv.org/abs/2411.10164)|**[link](https://github.com/tlpss/diffusing-synthetic-data)**|Building generic robotic manipulation systems often requires large amounts of real-world data, which can be dificult to collect. Synthetic data generation offers a promising alternative, but limiting the sim-to-real gap requires significant engineering efforts. To reduce this engineering effort, we investigate the use of pretrained text-to-image diffusion models for texturing synthetic images and compare this approach with using random textures, a common domain randomization technique in synthetic data generation. We focus on generating object-centric representations, such as keypoints and segmentation masks, which are important for robotic manipulation and require precise annotations. We evaluate the efficacy of the texturing methods by training models on the synthetic data and measuring their performance on real-world datasets for three object categories: shoes, T-shirts, and mugs. Surprisingly, we find that texturing using a diffusion model performs on par with random textures, despite generating seemingly more realistic images. Our results suggest that, for now, using diffusion models for texturing does not benefit synthetic data generation for robotics. The code, data and trained models are available at \url{https://github.com/tlpss/diffusing-synthetic-data.git}.||
|**2024-11-15**|[Towards Multi-View Consistent Style Transfer with One-Step Diffusion via Vision Conditioning](http://arxiv.org/abs/2411.10130)|null|The stylization of 3D scenes is an increasingly attractive topic in 3D vision. Although image style transfer has been extensively researched with promising results, directly applying 2D style transfer methods to 3D scenes often fails to preserve the structural and multi-view properties of 3D environments, resulting in unpleasant distortions in images from different viewpoints. To address these issues, we leverage the remarkable generative prior of diffusion-based models and propose a novel style transfer method, OSDiffST, based on a pre-trained one-step diffusion model (i.e., SD-Turbo) for rendering diverse styles in multi-view images of 3D scenes. To efficiently adapt the pre-trained model for multi-view style transfer on small datasets, we introduce a vision condition module to extract style information from the reference style image to serve as conditional input for the diffusion model and employ LoRA in diffusion model for adaptation. Additionally, we consider color distribution alignment and structural similarity between the stylized and content images using two specific loss functions. As a result, our method effectively preserves the structural information and multi-view consistency in stylized images without any 3D information. Experiments show that our method surpasses other promising style transfer methods in synthesizing various styles for multi-view images of 3D scenes. Stylized images from different viewpoints generated by our method achieve superior visual quality, with better structural integrity and less distortion. The source code is available at https://github.com/YushenZuo/OSDiffST.||
|**2024-11-14**|[A Bayesian Optimization Approach to Machine Translation Reranking](http://arxiv.org/abs/2411.09694)|null|使用外部评分模型对机器翻译系统的候选列表进行重新排序并返回得分最高的候选仍然是提高整体输出质量的一种简单有效的方法。翻译评分模型的规模持续增长，最佳模型的规模已与生成模型相当。因此，重新排序可能会给翻译流程增加大量的计算成本。在这项工作中，我们将重新排序视为贝叶斯优化（BayesOpt）问题。通过基于探索和利用之间的平衡策略性地选择要评分的候选，我们证明了在仅对候选列表的一小部分进行评分时，找到得分最高的候选是可行的。例如，我们的方法仅使用70次评分评估就达到了与基线系统使用180次评估相同的CometKiwi得分。我们提出了一种用于BayesOpt的多保真度设置，其中候选首先使用更便宜但噪声更大的代理评分模型进行评分，这在使用更小但训练良好的蒸馏代理评分器时，可以进一步改善成本-性能的权衡。||
|**2024-11-14**|[Golden Noise for Diffusion Models: A Learning Framework](http://arxiv.org/abs/2411.09502)|**[link](https://github.com/xie-lab-ml/golden-noise-for-diffusion-models)**|文转图扩散模型是一种流行的范式，它通过提供文本提示和随机高斯噪声来合成个性化图像。虽然人们观察到某些噪声是“黄金噪声”，可以实现比其他噪声更好的文本-图像对齐和更高的人类偏好，但我们仍然缺乏一个机器学习框架来获取这些黄金噪声。为了学习用于扩散采样的黄金噪声，我们在本文中主要做了三点贡献。首先，我们提出了一个名为“噪声提示”的新概念，旨在通过添加从文本提示中导出的小的理想扰动，将随机高斯噪声转化为黄金噪声。遵循这一概念，我们首先制定了“噪声提示学习”框架，该框架系统地学习与文本提示相关的用于扩散模型的“提示”黄金噪声。其次，我们设计了一个噪声提示数据收集管道，并收集了一个包含10万对随机噪声和黄金噪声及其相关文本提示的大规模“噪声提示数据集”（NPD）。利用准备好的NPD作为训练数据集，我们训练了一个小型“噪声提示网络”（NPNet），可以直接学习将随机噪声转换为黄金噪声。学习到的黄金噪声扰动可以被认为是一种噪声提示，因为它富含语义信息并且针对给定的文本提示进行了定制。第三，我们广泛的实验表明，NPNet在改进各种扩散模型（包括SDXL、DreamShaper-xl-v2-turbo和Hunyuan-DiT）的合成图像质量方面具有令人印象深刻的有效性和泛化性。此外，NPNet是一个小型高效的控制器，它作为一个即插即用模块，只需很小的额外推理和计算成本，因为它只是提供黄金噪声而不是随机噪声，而无需访问原始管道。||
|**2024-11-14**|[Sparse Bayesian Generative Modeling for Compressive Sensing](http://arxiv.org/abs/2411.09483)|**[link](https://github.com/beneboeck/sparse-bayesian-gen-mod)**|这项工作通过引入一种新型的正则化生成先验，解决了压缩感知 (CS) 中的基本线性逆问题。我们提出的方法利用了基于经典字典的压缩感知的思想，特别是稀疏贝叶斯学习 (SBL)，以整合对稀疏解的强正则化。同时，通过利用条件高斯性的概念，它还结合了生成模型对训练数据的适应性。然而，与大多数最先进的生成模型不同，它能够从少量压缩的噪声数据样本中学习，并且不需要优化算法来解决逆问题。此外，与狄利克雷先验网络类似，我们的模型参数化了一个共轭先验，使其能够应用于不确定性量化。我们通过变分推理的概念在理论上支持我们的方法，并使用不同类型的可压缩信号进行经验验证。||
|**2024-11-14**|[DiffRoad: Realistic and Diverse Road Scenario Generation for Autonomous Vehicle Testing](http://arxiv.org/abs/2411.09451)|null|生成逼真且多样化的道路场景对于自动驾驶汽车的测试和验证至关重要。然而，由于现实世界道路环境的复杂性和多变性，为智能驾驶测试创建真实且多样的场景具有挑战性。在本文中，我们提出了DiffRoad，一种旨在生成可控且高保真度3D道路场景的新型扩散模型。DiffRoad利用扩散模型的生成能力，通过逆去噪过程从白噪声合成道路布局，保留真实世界的空间特征。为了提高生成场景的质量，我们设计了Road-UNet架构，优化了主干网络和跳跃连接之间的平衡，以生成高真实感的场景。此外，我们引入了一个道路场景评估模块，使用两个关键指标（道路连续性和道路合理性）筛选用于智能驾驶测试的适当且合理的场景。在多个真实世界数据集上的实验结果表明，DiffRoad能够生成逼真且平滑的道路结构，同时保持原始分布。此外，生成的场景可以完全自动化转换为OpenDRIVE格式，方便通用的自动驾驶汽车仿真测试。DiffRoad为大规模自动驾驶汽车测试提供了丰富多样的场景库，并为未来更适合自动驾驶汽车的基础设施设计提供了宝贵的见解。||
|**2024-11-14**|[Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models](http://arxiv.org/abs/2411.09449)|null|扩散模型为图像生成领域注入了新的活力，在学术研究和艺术表达中都发挥着至关重要的作用。随着新的扩散模型的出现，评估文本到图像模型的性能变得越来越重要。目前的指标侧重于将输入文本与生成的图像直接匹配，但由于跨模态信息的不对称性，这会导致评估结果不可靠或不完整。基于此，我们在本研究中引入了图像再生任务，通过要求文本到图像模型根据参考图像生成图像来评估文本到图像模型。我们使用GPT4V来弥合参考图像和文本到图像模型的文本输入之间的差距，使文本到图像模型能够理解图像内容。由于生成的图像和参考图像之间的比较非常直接，因此该评估过程得到了简化。我们引入了两个涵盖内容多样性和风格多样性的再生数据集，用于评估目前领先的扩散模型。此外，我们提出了ImageRepainter框架，通过MLLM引导的迭代生成和修正来提高内容理解，从而增强生成图像的质量。我们全面的实验展示了该框架在评估模型生成能力方面的有效性。通过利用MLLM，我们证明了一个强大的文本到图像模型可以生成更接近参考图像的图像。||
|**2024-11-14**|[A survey of probabilistic generative frameworks for molecular simulations](http://arxiv.org/abs/2411.09388)|**[link](https://github.com/shams-mehdi/aib9_openmm)**|生成式人工智能现在是分子科学中广泛使用的工具。尽管概率生成模型很受欢迎，但缺乏对其在分子数据上性能进行基准测试的数值实验。在这项工作中，我们介绍并解释了几类生成模型，大致分为两类：基于流的模型和扩散模型。我们选择了三个具有代表性的模型：神经样条流（Neural Spline Flows）、条件流匹配（Conditional Flow Matching）和去噪扩散概率模型（Denoising Diffusion Probabilistic Models），并检查了它们在具有可调维度、复杂性和模态不对称性的数据集上的准确性、计算成本和生成速度。我们的研究结果各不相同，没有一个框架对所有目的都是最佳的。简而言之，(i) 神经样条流最擅长捕捉低维数据中存在的模态不对称性，(ii) 条件流匹配在低复杂度的高维数据上优于其他模型，(iii) 去噪扩散概率模型似乎最适合高复杂度的低维数据。我们的数据集包括一个高斯混合模型和通过分子动力学模拟生成的Aib₉肽的二面角扭角分布。我们希望我们的概率生成框架分类和数值结果可以指导各种分子任务的模型选择。||
|**2024-11-14**|[Multi-scale Generative Modeling for Fast Sampling](http://arxiv.org/abs/2411.09356)|null|虽然在空间域工作会由于幂律衰减导致病态分数，但基于扩散的生成模型的最新进展表明，过渡到小波域提供了一种很有前景的替代方案。然而，在小波域内，我们面临着独特的挑战，特别是高频系数的稀疏表示，这与扩散过程中的高斯假设存在显著偏差。为此，我们提出了一种在小波域中的多尺度生成模型，该模型采用不同的策略来处理低频和高频带。在小波域中，我们对低频带应用具有良好条件分数的基于分数的生成模型，同时对高频带利用多尺度生成对抗学习。理论分析和实验结果表明，我们的模型显著提高了性能，并减少了可训练参数的数量、采样步骤和时间。||
|**2024-11-14**|[ParaLBench: A Large-Scale Benchmark for Computational Paralinguistics over Acoustic Foundation Models](http://arxiv.org/abs/2411.09349)|null|计算副语言学 (ComParal) 旨在开发算法和模型，以自动检测、分析和解释语音交际中的非语言信息，例如情绪、健康状况、年龄和性别。尽管发展迅速，但它严重依赖于针对特定副语言任务设计的复杂模型。因此，ComParal 模型的异质性和多样性在很大程度上阻碍了其在实际中的应用。近年来，随着自监督学习的兴起和声学基础模型的出现，开发能够有效感知大量副语言信息的更通用的模型已成为语音处理中的一个活跃话题。然而，它缺乏一个统一的评估框架来进行公平和一致的性能比较。为了弥合这一差距，我们开展了一个名为 ParaLBench 的大规模基准测试，该基准测试致力于标准化各种副语言任务的评估流程，包括情感计算的关键方面，如情绪识别和情绪维度预测，并涵盖不同的声学基础模型。此基准测试包含十个数据集和十三个不同的副语言任务，涵盖短期、中期和长期特征。每个任务都在 14 个声学基础模型上使用统一的评估框架进行，从而实现无偏见的方法比较，并为 ComParal 社区提供可靠的参考。基于从 ParaLBench 获得的见解，我们还指出了潜在的研究方向，例如跨语料库的泛化性，以推动 ComParal 未来研究的发展。这项研究相关的代码将公开，以提高这项工作的透明度和可重复性，造福后来的研究人员。||
|**2024-11-14**|[Approximate Probabilistic Inference forTime-Series Data A Robust Latent Gaussian Model With Temporal Awareness](http://arxiv.org/abs/2411.09312)|null|针对高度变化的非平稳时间序列数据开发鲁棒的生成模型是一个复杂且重要的课题。传统的用于时间序列数据预测的模型，例如长短期记忆网络（LSTM），效率低且泛化能力差，因为它们无法捕捉复杂的时间关系。在本文中，我们提出了一种概率生成模型，可以训练它来捕捉时间信息，并且对数据错误具有鲁棒性。我们称之为时间深度潜高斯模型（tDLGM）。其新颖的架构受到深度潜高斯模型（DLGM）的启发。我们的模型通过最小化基于负对数似然的损失函数进行训练。时间深度潜高斯模型（tDLGM）鲁棒性的一个促成因素是我们的正则化项，它考虑了数据趋势。进行的实验表明，tDLGM能够重建和生成复杂的时间序列数据，并且对噪声和错误数据具有鲁棒性。||
|**2024-11-14**|[LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion Space](http://arxiv.org/abs/2411.09268)|null|现有的单样本说话头像生成模型在粗粒度情绪编辑方面取得了进展，但仍然缺乏具有高可解释性的细粒度情绪编辑模型。我们认为，要使一种方法被认为是细粒度的，它需要提供清晰的定义和足够详细的区分。我们提出了LES-Talker，一种具有高可解释性的新型单样本说话头像生成模型，以实现跨情绪类型、情绪级别和面部单元的细粒度情绪编辑。我们提出了一种基于面部动作单元的线性情绪空间（LES）定义，将情绪转换表征为向量转换。我们设计了跨维度注意力网络（CDAN）来深入挖掘LES表示和3D模型表示之间的相关性。通过挖掘不同特征和结构维度之间的多重关系，我们使LES表示能够引导3D模型的可控变形。为了使具有偏差的多模态数据适应LES并增强视觉质量，我们利用了专门的网络设计和训练策略。实验表明，我们的方法提供了高视觉质量以及多层次且可解释的细粒度情绪编辑，优于主流方法。||
|**2024-11-12**|[Scaling Properties of Diffusion Models for Perceptual Tasks](http://arxiv.org/abs/2411.08034)|null|在本文中，我们认为基于扩散模型的迭代计算不仅为生成任务，也为视觉感知任务提供了一个强大的范式。我们将深度估计、光流和分割等任务统一在图像到图像的转换框架下，并展示了扩散模型如何从训练和测试时计算规模的扩展中受益。通过仔细分析这些缩放行为，我们提出了各种有效训练用于视觉感知任务的扩散模型的技术。我们的模型在显著减少数据和计算量的情况下，实现了与最先进方法相当或更优的性能。使用我们的代码和模型，请访问https://scaling-diffusion-perception.github.io。||
|**2024-11-12**|[GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation](http://arxiv.org/abs/2411.08033)|null|虽然3D内容生成技术已经取得了显著进展，但现有方法仍然面临着输入格式、潜在空间设计和输出表示方面的挑战。本文介绍了一种新颖的3D生成框架，解决了这些挑战，并通过交互式点云结构的潜在空间，实现了可扩展的高质量3D生成。我们的框架采用了一种变分自动编码器（VAE），其输入为多视角姿态RGB-D（深度）-N（法线）渲染，并使用独特的潜在空间设计来保留3D形状信息，同时结合了级联潜在扩散模型以改进形状与纹理的解耦。所提出的方法GaussianAnything支持多模态条件3D生成，允许点云、文本描述和单/多视角图像作为输入。值得注意的是，新提出的潜在空间天然支持几何与纹理的解耦，从而实现了3D感知的编辑。实验结果证明了我们的方法在多个数据集上的有效性，在文本和图像条件的3D生成方面均优于现有方法。||
|**2024-11-12**|[Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings](http://arxiv.org/abs/2411.08017)|**[link](https://github.com/autodeskailab/wala)**|大型3D生成模型需要大量的计算资源，但通常难以在高分辨率下捕捉精细的细节和复杂的几何形状。我们将此限制归因于当前表示方法的低效性，它们缺乏有效建模生成模型所需的紧凑性。为了解决这个问题，我们引入了一种名为小波潜在扩散（WaLa）的新方法，它将3D形状编码为基于小波的紧凑潜在编码。具体来说，我们将一个 $256^3$的符号距离场压缩成一个$12^3 \times 4$的潜在网格，实现了惊人的2427倍压缩率，且细节损失极小。这种高压缩率使我们的方法能够有效地训练大规模生成网络，而不会增加推理时间。我们的模型（包括有条件和无条件的）包含大约10亿个参数，并成功地在$256^3$ 分辨率下生成高质量的3D形状。此外，WaLa提供快速推理，尽管模型规模很大，但根据条件的不同，可在两到四秒内生成形状。我们在多个数据集上展示了最先进的性能，在生成质量、多样性和计算效率方面都有显著提高。我们开源了代码，并且据我们所知，发布了跨不同模态的最大预训练3D生成模型。||
|**2024-11-12**|[JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation](http://arxiv.org/abs/2411.07975)|**[link](https://github.com/deepseek-ai/janus)**|我们提出了JanusFlow，这是一个强大的框架，它将图像理解和生成统一在一个单一模型中。JanusFlow引入了一个极简的架构，它集成了自回归语言模型和校正流，后者是生成模型中最先进的方法。我们的主要发现表明，校正流可以直接在大型语言模型框架内进行训练，无需复杂的架构修改。为了进一步提高我们统一模型的性能，我们采用了两个关键策略：（i）解耦理解编码器和生成编码器，以及（ii）在统一训练期间对齐它们的表示。大量实验表明，JanusFlow在其各自领域实现了与专用模型相当或更优的性能，同时在标准基准测试中显著优于现有的统一方法。这项工作代表着朝着更高效、更通用的视觉语言模型迈进了一步。||
|**2024-11-12**|[Diverse capability and scaling of diffusion and auto-regressive models when learning abstract rules](http://arxiv.org/abs/2411.07873)|null|人类擅长从有限样本中发现规律性结构，并将推断出的规则应用于新的环境。我们研究了现代生成模型是否同样可以从有限样本中学习潜在规则，并通过条件采样进行推理。受瑞文推理测验的启发，我们设计了GenRAVEN数据集，其中每个样本包含三行，并且40个关系规则之一（控制对象位置、数量或属性）适用于所有行。我们将样本编码为整数数组，以专注于规则学习，并训练生成模型来学习数据分布。我们比较了两个生成模型系列：扩散模型（EDM、DiT、SiT）和自回归模型（GPT2、Mamba）。我们评估了它们生成结构一致样本的能力，并通过无条件和条件采样进行面板补全。我们发现扩散模型在无条件生成方面表现出色，可以从头开始生成更多新颖且一致的样本，并且记忆更少，但在面板补全方面表现较差，即使使用高级条件采样方法也是如此。相反，自回归模型擅长以符合规则的方式补全缺失面板，但无条件生成的样本一致性较低。我们观察到不同的数据规模效应：对于这两个模型系列，规则学习都出现在特定的数据集大小——大约每个规则1000个示例左右。随着训练数据的增加，扩散模型的无条件和条件生成能力都有所提高。然而，对于自回归模型，虽然面板补全随着训练数据的增加而改进，但无条件生成的一致性却下降。我们的研究结果突出了扩散模型和自回归模型在规则学习和推理任务中的互补能力和局限性，为进一步研究其机制和类人推理的潜力提供了方向。||
|**2024-11-12**|[Novel View Synthesis with Pixel-Space Diffusion Models](http://arxiv.org/abs/2411.07765)|null|从单张输入图像合成新视角是一项具有挑战性的任务。传统上，这项任务通过估计场景深度、扭曲和修复来完成，机器学习模型支持了部分流程。最近，生成模型越来越多地用于新视角合成（NVS），通常涵盖整个端到端系统。在这项工作中，我们调整了一个现代的扩散模型架构，用于像素空间中的端到端NVS，其性能大大超过了之前的最先进（SOTA）技术。我们探索了将几何信息编码到网络中的不同方法。我们的实验表明，虽然这些方法可以提高性能，但与利用改进的生成模型相比，它们的影响很小。此外，我们引入了一种新的NVS训练方案，该方案利用单视图数据集，充分利用了它们相对于多视图数据集的相对丰富性。这使得模型对域外内容场景的泛化能力得到提升。||
|**2024-11-12**|[Nanosecond nanothermometry in an electron microscope](http://arxiv.org/abs/2411.07764)|null|纳米结构中的热传输在现代科技中扮演着至关重要的角色。随着器件尺寸的缩小，能够在纳米和纳秒尺度测量热学性质的技术越来越需要，以捕捉瞬态的非平衡现象。我们提出了一种在扫描透射电子显微镜（STEM）中使用的新型泵浦-探测光子-电子方法，以绘制具有前所未有的空间和时间分辨率的温度动态图。通过结合聚焦激光诱导加热和同步时间分辨单色电子能量损失谱（EELS），我们跟踪了各种材料（包括氮化硅、铝薄膜和过渡金属二硫化物）中的声子、激子和等离子体激元信号。我们的结果证明了该技术能够跟踪纳米和纳秒尺度的温度变化。实验数据与理论热扩散模型紧密匹配，证实了该方法的有效性。这种方法为研究纳米级材料中的瞬态热现象开辟了新的机会，为热电器件和纳米电子学中的应用提供了宝贵的见解。||
|**2024-11-12**|[LapGSR: Laplacian Reconstructive Network for Guided Thermal Super-Resolution](http://arxiv.org/abs/2411.07750)|null|近年来，多模态数据融合已被广泛研究，并应用于机器人、手势识别和自动导航等各种应用。事实上，高质量的视觉传感器价格昂贵，而消费级传感器的图像分辨率较低。研究人员开发了将RGB彩色图像与非视觉数据（例如热图像）相结合的方法，以克服这一限制并提高分辨率。融合多种模态以生成视觉上吸引人的高分辨率图像通常需要具有数百万参数的密集模型和大量的计算负荷，这通常归因于模型的复杂架构。我们提出了LapGSR，一种结合了拉普拉斯图像金字塔的多模态轻量级生成模型，用于引导热图像超分辨率。这种方法在RGB彩色图像上使用拉普拉斯金字塔来提取重要的边缘信息，然后将其与像素损失和对抗损失相结合，以绕过模型较高层中的繁重特征图计算。LapGSR在保持图像的空间和结构细节的同时，还具有高效和紧凑的特点。这使得模型的参数数量明显少于其他最先进的模型，同时在两个跨域数据集（即ULB17-VT和VGTSR数据集）上展现出优异的结果。||
|**2024-11-12**|[Evaluating the Generation of Spatial Relations in Text and Image Generative Models](http://arxiv.org/abs/2411.07664)|null|理解空间关系对于人类和人工智能来说都是一项至关重要的认知能力。虽然目前的研究主要集中在文本到图像 (T2I) 模型的基准测试上，但我们提出了一个更全面的评估方法，包括 T2I 模型和大型语言模型 (LLM)。由于空间关系在视觉空间上的理解更为自然，我们开发了一种将 LLM 输出转换为图像的方法，从而使我们能够以视觉方式评估 T2I 模型和 LLM。我们基于一组 10 个常用介词，检验了 8 个 prominent 生成模型（3 个 T2I 模型和 5 个 LLM）对空间关系的理解，并评估了自动评估方法的可行性。令人惊讶的是，我们发现尽管 T2I 模型具有一般的图像生成能力，但在空间关系理解方面表现不佳。更令人惊讶的是，我们的结果表明，尽管主要是在文本数据上进行训练，但 LLM 在生成空间关系方面比 T2I 模型准确得多。我们研究了模型失败的原因，并强调了可以填补的差距，以便生成更符合空间关系的图像。||
|**2024-11-12**|[Leveraging Previous Steps: A Training-free Fast Solver for Flow Diffusion](http://arxiv.org/abs/2411.07627)|null|流扩散模型（FDM）最近在生成任务中展现出潜力，这归功于其高质量的生成能力。然而，目前用于FDM的常微分方程（ODE）求解器，例如欧拉求解器，由于ODE求解器需要大量的函数评估（NFE）来维持高质量的生成，因此仍然存在生成速度慢的问题。在本文中，我们提出了一种新颖的免训练流求解器，以在保持高质量生成的同时减少NFE。该流求解器的关键在于利用先前的步骤来减少NFE，其中创建一个缓存来复用先前步骤的结果。具体来说，首先使用泰勒展开来逼近ODE。为了计算泰勒展开的高阶导数，该流求解器建议使用先前的步骤和多项式插值来逼近它，其中我们可以逼近的阶数等于我们缓存的先前步骤的数量。我们还证明了该流求解器具有更小的逼近误差和更快的生成速度。在CIFAR-10、CelebA-HQ、LSUN-Bedroom、LSUN-Church、ImageNet和真实文本到图像生成的实验结果证明了该流求解器的效率。具体来说，在CIFAR-10和LSUN-Church上，当NFE=10时，该流求解器将FID-30K分别从13.79提高到6.75，从46.64提高到19.49。||
|**2024-11-08**|[StdGEN: Semantic-Decomposed 3D Character Generation from Single Images](http://arxiv.org/abs/2411.05738)|null|我们提出了StdGEN，这是一个创新的流水线，用于从单张图像生成语义分解的高质量3D角色，使其可在虚拟现实、游戏和电影制作等领域得到广泛应用。与以往那些在分解能力有限、质量不令人满意以及优化时间过长等方面存在不足的方法不同，StdGEN具有可分解性、有效性和高效性；也就是说，它可以在三分钟内生成具有单独语义组件（如身体、衣服和头发）的细节复杂的3D角色。StdGEN的核心是我们提出的语义感知大型重建模型（S-LRM），这是一个基于Transformer的通用模型，它以前馈方式从多视图图像中联合重建几何形状、颜色和语义信息。我们引入了一种可微分的多分层语义表面提取方案，用于从S-LRM重建的混合隐式场中获取网格。此外，该流水线还集成了一个专门的高效多视图扩散模型和一个迭代多分层表面细化模块，以促进高质量、可分解的3D角色生成。大量实验表明，我们在3D动漫角色生成方面达到了最先进的性能，在几何形状、纹理和可分解性方面显著超越了现有基线。StdGEN提供可直接使用的语义分解3D角色，并支持灵活的定制，适用于各种应用。项目页面：https://stdgen.github.io||
|**2024-11-08**|[Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models](http://arxiv.org/abs/2411.05706)|null|评估自动生成的图像描述的质量是一项复杂的任务，需要能够捕捉各种维度的指标，例如语法性、覆盖率、准确性和真实性。虽然人工评估可以提供有价值的见解，但其成本高昂且耗时，存在局限性。现有的自动化指标，如 BLEU、ROUGE、METEOR 和 CIDEr 试图填补这一空白，但它们与人类判断的相关性通常较弱。为了应对这一挑战，我们提出了一个名为 Image2Text2Image 的新型评估框架，它利用 Stable Diffusion 或 DALL-E 等扩散模型进行文本到图像的生成。在 Image2Text2Image 框架中，首先使用选定的图像描述模型（即待评估模型）处理输入图像，以生成文本描述。然后，使用该生成的描述，通过扩散模型创建新图像。通过比较从原始图像和生成图像中提取的特征，我们使用指定的相似性度量来衡量它们的相似性。高相似性得分表明该模型生成了忠实的文本描述，而低得分则突显了差异，揭示了模型性能的潜在弱点。值得注意的是，我们的框架不依赖于人工标注的参考描述，使其成为评估图像描述模型的宝贵工具。大量的实验和人工评估验证了我们提出的 Image2Text2Image 评估框架的有效性。代码和数据集将被公开发布，以支持社区的进一步研究。||
|**2024-11-08**|[Improving Molecular Graph Generation with Flow Matching and Optimal Transport](http://arxiv.org/abs/2411.05676)|null|生成分子图在药物设计和发现中至关重要，但由于节点和边之间复杂的相互依赖性，这仍然是一项挑战。虽然扩散模型已证明其在分子图设计中的潜力，但它们经常面临训练不稳定和采样效率低下的问题。为了提高生成性能和训练稳定性，我们提出了GGFlow，这是一种结合了最优传输的离散流匹配生成模型，用于分子图生成，它包含一个边增强的图变换器，以实现化学键之间的直接通信。此外，GGFlow引入了一种新的目标引导生成框架来控制模型的生成轨迹，旨在设计具有所需属性的新型分子结构。GGFlow在无条件和条件分子生成任务上均表现出优异的性能，超过了现有的基线模型，突出了其有效性和更广泛应用的潜力。||
|**2024-11-08**|[Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion](http://arxiv.org/abs/2411.05544)|null|文本到图像扩散模型的终身小样本定制旨在以最少的数据持续泛化现有模型以适应新任务，同时保留旧知识。当前的定制扩散模型在小样本任务中表现出色，但在终身生成中却面临灾难性遗忘问题。在本研究中，我们将灾难性遗忘问题识别并归类为两方面：相关概念遗忘和先前概念遗忘。为了应对这些挑战，我们首先设计了一种无数据知识蒸馏策略来解决相关概念遗忘问题。与依赖额外真实数据或离线回放原始概念数据的现有方法不同，我们的方法支持动态知识蒸馏，在学习新概念的同时保留先前概念，而无需访问任何先前数据。其次，我们开发了一种上下文生成（ICGen）范式，允许扩散模型以输入视觉上下文为条件，这有助于小样本生成并缓解先前概念遗忘问题。大量实验表明，所提出的终身小样本扩散（LFS-Diffusion）方法可以生成高质量和准确的图像，同时保持先前学习的知识。||
|**2024-11-08**|[Improving image synthesis with diffusion-negative sampling](http://arxiv.org/abs/2411.05473)|null|对于使用扩散模型（DM）生成图像，可以使用负面提示词n来补充文本提示词p，帮助定义合成图像中不需要的属性。虽然这提高了提示词的依附性和图像质量，但是找到好的负面提示词是具有挑战性的。我们认为这是由于人类和DM之间存在语义差距，这使得对DM有效的负面提示词对人类来说显得不直观。为了弥合这一差距，我们提出了一种新的扩散负面提示（DNP）策略。DNP基于一种新的程序，用于在DM的分布下采样最不符合p的图像，表示为扩散负面采样（DNS）。给定p，将采样一个这样的图像，然后由用户或字幕模型将其转换为自然语言，以生成负面提示词n*。最终使用(p, n*)对来提示DM。DNS易于实现，并且不需要训练。实验和人工评估表明，DNP在定量和定性方面都表现良好，并且可以轻松地与几种DM变体结合使用。||
|**2024-11-08**|[Bridging the Gap between Learning and Inference for Diffusion-Based Molecule Generation](http://arxiv.org/abs/2411.05472)|**[link](https://github.com/hughnew/gapdiff)**|扩散模型在生成各种数据模态（包括图像、文本和视频）方面的有效性，促使人们对其在分子生成中的效用进行探究，并在该领域取得了显著进展。然而，使用扩散模型进行分子生成的过程涉及在有限时间范围内进行多个自回归步骤，这固有地导致了曝光偏差问题。为了解决曝光偏差问题，我们提出了一个名为 GapDiff 的训练框架。GapDiff 的核心思想是在训练过程中概率性地利用模型预测的构象作为真实值，旨在减轻训练和推理之间的数据分布差异，从而增强生成分子的亲和力。我们使用 CrossDocked2020 数据集上的 3D 分子生成模型进行了实验，vina 能量和多样性证明了我们框架的效力及其生成的分子具有更优的亲和力。GapDiff 的代码可在 \url{https://github.com/HUGHNew/gapdiff} 获取。||
|**2024-11-08**|[IntellBot: Retrieval Augmented LLM Chatbot for Cyber Threat Knowledge Delivery](http://arxiv.org/abs/2411.05442)|null|在快速发展的网络安全领域，智能聊天机器人正日益受到重视。人工智能、机器学习和自然语言处理使这些聊天机器人能够处理用户查询并提供威胁情报。这有助于网络安全知识易于为专业人士和公众所获得。传统的基于规则的聊天机器人通常缺乏灵活性，难以适应用户交互。相比之下，基于大型语言模型的聊天机器人可以跨多个领域提供上下文相关的信息，并适应不断变化的对话上下文。在这项工作中，我们开发了IntellBot，一个构建于大型语言模型和Langchain等前沿技术之上的高级网络安全聊天机器人，并结合了检索增强生成模型以提供卓越的功能。该聊天机器人从不同的数据源收集信息，以创建一个涵盖已知漏洞、近期网络攻击和新兴威胁的综合知识库。它提供定制的响应，充当网络安全洞察的主要枢纽。通过提供对相关信息和资源的即时访问，IntellBot增强了威胁情报、事件响应和整体安全态势，从而节省时间并使用户掌握网络安全最佳实践的知识。此外，我们使用两阶段评估策略分析了我们助手的性能。我们通过间接方法获得了高于0.8的BERT分数，以及0.8到1的余弦相似度分数，这证实了我们助手的准确性。此外，我们利用RAGAS评估RAG模型，所有评估指标都持续得到高于0.77的分数，突出了我们系统的有效性。||
|**2024-11-08**|[RED: Residual Estimation Diffusion for Low-Dose PET Sinogram Reconstruction](http://arxiv.org/abs/2411.05354)|null|扩散模型近年来在各领域的生成任务中展现出卓越的性能。在正电子发射断层扫描（PET）中，减少示踪剂剂量会导致正弦图信息丢失。使用扩散模型重建缺失信息可以提高成像质量。传统的扩散模型有效地利用高斯噪声进行图像重建。然而，在低剂量PET重建中，高斯噪声会通过引入伪影和不一致性而恶化原本就稀疏的数据。为了解决这个问题，我们提出了一种名为残差估计扩散（RED）的扩散模型。从扩散机制的角度来看，RED使用正弦图之间的残差代替扩散过程中的高斯噪声，分别将低剂量和全剂量正弦图设置为重建的起点和终点。这种机制有助于保留低剂量正弦图中的原始信息，从而提高重建的可靠性。从数据一致性的角度来看，RED引入了漂移校正策略，以减少反向过程中累积的预测误差。校准反向迭代的中间结果有助于保持数据一致性并增强重建过程的稳定性。实验结果表明，RED有效地提高了低剂量正弦图以及重建结果的质量。代码可在 https://github.com/yqx7150/RED 获取。||
|**2024-11-08**|[Social balance in directed networks](http://arxiv.org/abs/2411.05327)|null|社交网络天生就展现出复杂的关系，这些关系可以是正向或负向的，也可以是有方向性的。理解这些网络中的平衡对于揭示社会动态至关重要，但传统理论难以纳入有向交互。本文提出了一个理解有向符号网络中平衡的综合路线图，扩展了传统的平衡理论以解释有向交互。平衡是由与适当的零模型相比更高阶模式（如三元组）的富集来指示的，其中网络被随机化，并保留了一些关键方面。即使不考虑方向性，寻找合适的零模型也是一项具有挑战性的任务，而方向性在很大程度上扩展了潜在零模型的空间。最近，研究表明，在无向情况下，网络拓扑和符号度都是需要保留的关键因素。因此，我们引入了一个最大约束的零模型，它保留了有向拓扑以及由符号单向度、互惠度和冲突度给出的节点级特征。我们的零模型基于最大熵原理，并揭示了大规模社交网络中一致的模式。我们还考虑了平衡理论的有向推广，发现观察到的模式与两个提出的有向强平衡概念非常吻合。我们的方法不仅揭示了有向符号网络中的平衡，还可以作为有向符号社交网络生成模型的起点，从而推进我们对复杂社会系统及其动态的理解。||
|**2024-11-08**|[Differentiable Calibration of Inexact Stochastic Simulation Models via Kernel Score Minimization](http://arxiv.org/abs/2411.05315)|null|随机仿真模型是模拟复杂系统以辅助决策的生成模型。这些模型的可靠性很大程度上取决于经过良好校准的输入模型参数。然而，在许多实际场景中，只有输出级数据可用于学习输入模型参数，由于随机仿真模型的似然函数通常难以处理，这带来了挑战。此外，随机仿真模型经常是不精确的，模型与目标系统之间存在差异。现有的方法都无法有效地仅使用输出级数据来学习和量化输入参数的不确定性。在本文中，我们提出使用输出级数据，通过核得分最小化和随机梯度下降来学习随机仿真模型的可微输入参数。我们使用基于新的渐近正态性结果的频率置信集程序来量化学习到的输入参数的不确定性，该结果考虑了模型的不精确性。所提出的方法在精确和不精确的G/G/1排队模型上进行了评估。||
|**2024-11-07**|[SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models](http://arxiv.org/abs/2411.05007)|**[link](https://github.com/mit-han-lab/deepcompressor)**|扩散模型已被证明在生成高质量图像方面非常有效。然而，随着这些模型规模的增大，它们需要更多的内存，并且延迟更高，这对部署提出了重大挑战。在这项工作中，我们的目标是通过将权重和激活量化到4位来加速扩散模型。在如此激进的量化级别下，权重和激活都高度敏感，此时针对大型语言模型的传统训练后量化方法（如平滑化）变得不足。为了克服这一限制，我们提出了SVDQuant，一种新的4位量化范式。与在权重和激活之间重新分配异常值的平滑化不同，我们的方法使用低秩分支吸收这些异常值。我们首先将异常值从激活转移到权重，从而整合它们，然后采用高精度低秩分支通过奇异值分解（SVD）处理权重异常值。此过程简化了双方的量化。然而，简单地独立运行低秩分支会由于激活的额外数据移动而导致巨大的开销，从而抵消了量化带来的速度提升。为了解决这个问题，我们共同设计了一个推理引擎Nunchaku，它将低秩分支的内核融合到低位分支的内核中，以减少冗余内存访问。它还可以无缝支持现成的低秩适配器（LoRA），而无需重新量化。在SDXL、PixArt- $\Sigma$ 和FLUX.1上的大量实验验证了SVDQuant在保持图像质量方面的有效性。我们将12B FLUX.1模型的内存使用量减少了3.5倍，在16GB笔记本电脑4090 GPU上实现了比仅4位权重量化的基线3.0倍的加速，为在PC上实现更具交互性的应用铺平了道路。我们的量化库和推理引擎已开源。||
|**2024-11-07**|[ProEdit: Simple Progression is All You Need for High-Quality 3D Scene Editing](http://arxiv.org/abs/2411.05006)|null|本文提出了一种名为ProEdit的简单而高效的3D场景编辑框架，该框架以一种新颖的渐进式方式利用扩散蒸馏进行引导。受到多视图不一致性源于扩散模型巨大的可行输出空间（FOS）这一关键观察的启发，我们的框架通过将整体编辑任务分解成若干个子任务，然后在场景上逐步执行这些子任务，从而控制FOS的大小并减少不一致性。在此框架内，我们设计了一个难度感知的子任务分解调度器和一个自适应3D高斯 splatting（3DGS）训练策略，以确保高质量且高效地执行每个子任务。大量评估表明，ProEdit在各种场景和具有挑战性的编辑任务中均取得了最先进的结果，所有这些都通过一个简单的框架实现，无需任何昂贵或复杂的附加组件，如蒸馏损失、组件或训练程序。值得注意的是，ProEdit还提供了一种在编辑过程中控制、预览和选择编辑操作“强度”的新方法。||
|**2024-11-07**|[Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models](http://arxiv.org/abs/2411.05005)|null|除了高保真图像合成之外，扩散模型最近在密集视觉感知任务中展现出 promising 的结果。然而，大多数现有工作将扩散模型视为感知任务的独立组件，仅将其用于现成的数据增强或仅仅作为特征提取器。与这些孤立的、因此并非最佳的尝试相反，我们引入了一个统一的、多功能的、基于扩散的框架，Diff-2-in-1，它可以通过独特地利用扩散去噪过程，同时处理多模态数据生成和密集视觉感知。在这个框架内，我们通过利用去噪网络创建反映原始训练集分布的多模态数据，进一步增强了基于多模态生成的判别性视觉感知。重要的是，Diff-2-in-1 通过利用一种新颖的自我改进学习机制，优化了所创建的多样化且真实的数据的利用。全面的实验评估验证了我们框架的有效性，展示了跨各种判别性骨干网络的一致性能提升，以及高质量的多模态数据生成，其特点是兼具真实性和实用性。||
|**2024-11-07**|[ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning](http://arxiv.org/abs/2411.05003)|null|近来，视频建模的突破使得在生成视频中控制相机轨迹成为可能。然而，这些方法不能直接应用于用户提供的、非视频模型生成的视频。在本文中，我们提出了ReCapture，一种从用户提供的单个视频生成具有新颖相机轨迹的新视频的方法。我们的方法允许我们重新生成参考视频，包括其所有现有的场景运动，并从截然不同的角度和电影级的相机运动进行呈现。值得注意的是，使用我们的方法，我们还可以合理地推断出参考视频中不可观察到的场景部分。我们的方法的工作原理是：（1）使用多视角扩散模型或基于深度的点云渲染生成具有新相机轨迹的噪声锚定视频，然后（2）使用我们提出的掩蔽视频微调技术将锚定视频重新生成为清晰且时间一致的重新角度视频。||
|**2024-11-07**|[SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation](http://arxiv.org/abs/2411.04989)|null|图像到视频生成方法已经实现了令人印象深刻的逼真质量。然而，调整生成视频中的特定元素（例如物体运动或相机运动）通常是一个繁琐的反复试验过程，例如，涉及使用不同的随机种子重新生成视频。最近的技术通过微调预训练模型以遵循条件信号（例如边界框或点轨迹）来解决这个问题。然而，这种微调过程的计算成本可能很高，并且需要带有注释对象运动的数据集，而这些数据集可能难以获得。在这项工作中，我们引入了SG-I2V，这是一个用于可控图像到视频生成的框架，它是自引导的——通过仅依赖预训练的图像到视频扩散模型中存在的知识来提供零样本控制，而无需微调或外部知识。我们的零样本方法在视觉质量和运动保真度方面优于无监督基线，同时与监督模型相比具有竞争力。||
|**2024-11-07**|[Few-Shot Task Learning through Inverse Generative Modeling](http://arxiv.org/abs/2411.04987)|null|学习智能体的意图，即由其目标或运动风格定义的意图，通常仅凭少量示例极具挑战性。我们将此问题称为任务概念学习，并提出了我们的方法：通过逆生成建模进行少样本任务学习 (FTL-IGM)，该方法利用可逆神经生成模型学习新的任务概念。其核心思想是在一组基本概念及其演示上预训练一个生成模型。然后，给定一个新概念（例如一个新目标或一个新动作）的少量演示，由于生成模型的可逆性，我们的方法无需更新模型权重即可通过反向传播学习底层概念。我们在五个领域评估了我们的方法——对象重排、目标导向导航、人类动作的运动描述、自动驾驶和真实世界的桌面操作。我们的实验结果表明，通过预训练的生成模型，我们成功地学习了新概念，并在以下情况下生成了与这些概念相对应的智能体计划或运动：(1) 未见过的环境；(2) 与训练概念的组合。||
|**2024-11-07**|[Uncovering Hidden Subspaces in Video Diffusion Models Using Re-Identification](http://arxiv.org/abs/2411.04956)|null|潜扩散视频模型因其生成的图像质量和时间一致性，很容易欺骗普通观察者和领域专家。除了娱乐之外，这为完全合成数据集的安全数据共享创造了机会，这在医疗保健以及其他依赖敏感个人信息的领域至关重要。然而，这种方法的隐私问题尚未得到完全解决，并且针对特定下游任务在合成数据上训练的模型的性能仍然不如在真实数据上训练的模型。这种差异可能部分是由于采样空间是训练视频的子空间，有效地减少了下游模型的训练数据大小。此外，生成长视频时时间一致性的降低也可能是一个促成因素。在本文中，我们首先展示了在潜在空间中训练隐私保护模型的计算效率更高，泛化能力更好。此外，为了研究下游性能下降的因素，我们建议使用一种先前用作隐私保护过滤器的重新识别模型。我们证明了在视频生成器的潜在空间上训练该模型就足够了。随后，我们使用这些模型来评估合成视频数据集覆盖的子空间，从而引入了一种衡量生成式机器学习模型保真度的新方法。我们专注于医疗保健超声心动图中的一个特定应用，以说明我们新方法的有效性。我们的研究结果表明，潜扩散视频模型仅学习了多达 30.8% 的训练视频，这可以解释在下游任务使用合成数据训练时缺乏性能的原因。||
|**2024-11-07**|[DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion](http://arxiv.org/abs/2411.04928)|null|本文介绍了DimensionX，这是一个旨在仅通过单个图像和视频扩散技术生成逼真3D和4D场景的框架。我们的方法始于一个洞察，即3D场景的空间结构和4D场景的时间演化都可以通过视频帧序列有效地表示。虽然最近的视频扩散模型在生成生动的视觉效果方面取得了显著成功，但由于在生成过程中空间和时间可控性有限，它们在直接恢复3D/4D场景方面面临局限性。为了克服这个问题，我们提出了ST-Director，它通过从维度变化数据中学习维度感知的LoRA，将视频扩散中的空间和时间因素解耦。这种可控的视频扩散方法能够精确地操纵空间结构和时间动态，使我们能够结合空间和时间维度从连续帧中重建3D和4D表示。此外，为了弥合生成视频与真实场景之间的差距，我们引入了用于3D生成的轨迹感知机制和用于4D生成的identity-preserving去噪策略。在各种真实世界和合成数据集上进行的大量实验表明，与以前的方法相比，DimensionX在可控视频生成以及3D和4D场景生成方面取得了优异的结果。||
|**2024-11-07**|[Stem-OB: Generalizable Visual Imitation Learning with Stem-Like Convergent Observation through Diffusion Inversion](http://arxiv.org/abs/2411.04919)|**[link](https://github.com/hukz18/Stem-Ob-Code)**|视觉模仿学习方法表现出很强的性能，但当面对视觉输入扰动（包括光照和纹理的变化）时，它们缺乏泛化能力，阻碍了其在现实世界的应用。我们提出了Stem-OB，它利用预训练的图像扩散模型来抑制低层视觉差异，同时保持高层场景结构。这种图像逆推过程类似于将观察结果转换为共享表示，其他观察结果均源于此表示，并去除无关的细节。Stem-OB与数据增强方法不同，因为它对各种未指定的 apariencia 变化具有鲁棒性，而无需额外的训练。我们的方法是一个简单但高效的即插即用解决方案。实验结果证实了我们的方法在模拟任务中的有效性，并在现实应用中显示出非常显著的改进，与最佳基线相比，平均成功率提高了22.2%。更多信息请访问https://hukz18.github.io/Stem-Ob/。||
|**2024-11-07**|[GASE: Generatively Augmented Sentence Encoding](http://arxiv.org/abs/2411.04914)|null|We propose an approach to enhance sentence embeddings by applying generative text models for data augmentation at inference time. Unlike conventional data augmentation that utilises synthetic training data, our approach does not require access to model parameters or the computational resources typically required for fine-tuning state-of-the-art models. Generatively Augmented Sentence Encoding uses diverse linguistic synthetic variants of input texts generated by paraphrasing, summarising, or extracting keywords, followed by pooling the original and synthetic embeddings. Experimental results on the Massive Text Embedding Benchmark for Semantic Textual Similarity (STS) demonstrate performance improvements across a range of embedding models using different generative models for augmentation. We find that generative augmentation leads to larger performance improvements for embedding models with lower baseline performance. These findings suggest that integrating generative augmentation at inference time adds semantic diversity and can enhance the robustness and generalizability of sentence embeddings for embedding models. Our results show that the degree to which generative augmentation can improve STS performance depends not only on the embedding model but also on the dataset. From a broader perspective, the approach allows trading training for inference compute.||
|**2024-11-05**|[DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models](http://arxiv.org/abs/2411.03250)|null|大型语言模型 (LLM) 近期的进展显著增强了它们的知识和生成能力，引发了人们对利用 LLM 合成高质量数据的浓厚兴趣。然而，通过提示 LLM 生成合成数据仍然具有挑战性，因为 LLM 对目标数据分布的理解有限，并且提示工程的复杂性较高，尤其是对于结构化格式的数据。为了解决这些问题，我们引入了 DiffLM，这是一个基于变分自编码器 (VAE) 的可控数据合成框架，它进一步 (1) 利用扩散模型在学习的潜在分布中保留更多原始分布和格式结构的信息，并且 (2) 通过即插即用的潜在特征注入模块将目标分布知识的学习与 LLM 的生成目标解耦。由于我们观察到 VAE 的潜在表示与真实数据分布之间存在显著差异，因此在我们的框架中引入了潜在扩散模块来学习完全表达的潜在分布。在七个具有结构化格式数据（即表格、代码和工具数据）的真实世界数据集上的评估表明，DiffLM 生成了高质量的数据，在某些情况下，下游任务的性能比真实数据高 2-7 个百分点。数据和代码将在内部审查完成后公开发布。||
|**2024-11-05**|[On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models](http://arxiv.org/abs/2411.03177)|null|大规模训练潜在扩散模型 (LDM) 使图像生成质量达到了前所未有的水平。然而，性能最佳的 LDM 训练方法的关键组成部分通常不对研究界开放，这阻碍了同类比较并妨碍了该领域进展的验证。在这项工作中，我们对 LDM 训练方法进行了深入研究，重点关注模型的性能及其训练效率。为了确保同类比较，我们重新实现了五个先前发布的模型及其相应的训练方法。通过我们的研究，我们探讨了 (i) 用于控制生成模型对语义信息（例如，文本提示）和控制元数据（例如，裁剪大小、随机翻转标志等）的条件机制对模型性能的影响，以及 (ii) 在较小和较低分辨率数据集上学习的表示迁移到较大数据集上对训练效率和模型性能的影响。然后，我们提出了一种新的条件机制，它将语义和控制元数据条件分离，并在 ImageNet-1k 数据集上的类条件生成方面树立了新的最先进水平——256 和 512 分辨率的 FID 分别提高了 7% 和 8%——以及在 CC12M 数据集上的文本到图像生成方面——256 和 512 分辨率的 FID 分别提高了 8% 和 23%。||
|**2024-11-05**|[Unleashing the power of novel conditional generative approaches for new materials discovery](http://arxiv.org/abs/2411.03156)|**[link](https://github.com/AIRI-Institute/conditional-crystal-generation)**|长期以来，新材料设计的计算方法依赖于寻找候选材料并对其性质进行建模的迭代过程。人工智能在这方面发挥了至关重要的作用，通过先进的计算方法和数据驱动的方法，帮助加速了晶体性质和结构的发现和优化。为了解决新材料设计问题并加快新材料的搜索过程，我们将最新的生成方法应用于晶体结构设计问题，试图解决逆问题：在给定性质的情况下生成满足这些性质的结构，而无需利用超级计算机的能力。在我们的工作中，我们提出了两种方法：1）条件结构修改：利用能量上最有利的结构与其所有不太稳定的多晶型物之间的能量差来优化任意原子构型的稳定性；2）条件结构生成。我们使用了包含以下信息的材料表示：晶格、原子坐标、原子类型、化学特征、空间群和结构的形成能。损失函数经过优化，以考虑晶体结构的周期性边界条件。我们应用了扩散模型方法、流匹配、普通的自动编码器（AE），并比较了模型和方法的结果。作为研究的度量标准，我们使用了物理PyMatGen匹配器：我们使用默认容差比较目标结构和生成的结构。到目前为止，我们的修改器和生成器分别以41%和82%的准确率生成了具有所需性质的结构。为了证明所提出的方法的有效性，我们进行了推断，得到了一些形成能低于AFLOW衍生凸包的潜在新结构。||
|**2024-11-05**|[Local Lesion Generation is Effective for Capsule Endoscopy Image Data Augmentation in a Limited Data Setting](http://arxiv.org/abs/2411.03098)|null|有限的医学影像数据集通过增加过拟合和泛化能力降低的风险来挑战深度学习模型，尤其是在生成对抗网络 (GAN) 中，判别器可能过拟合，导致训练发散。这种限制也损害了在小数据集上训练的分类模型。生成数据增强 (GDA) 通过使用合成数据扩展训练数据集来解决这个问题，尽管它需要训练一个生成模型。我们提出并评估了两种局部病灶生成方法，以应对增强小型医学图像数据集的挑战。第一种方法采用泊松图像编辑算法（一种经典的图像处理技术）来创建逼真的图像合成物，其性能优于当前最先进的方法。第二种方法引入了一种新的生成方法，利用微调的图像修复 GAN 在真实训练图像的指定区域内合成逼真的病灶。对这两种方法的全面比较表明，在数据受限的环境下有效的局部病灶生成能够在胶囊内窥镜病灶分类中达到新的最先进的结果。结合我们的技术，在高度不平衡的 Kvasir 胶囊数据集（胶囊内窥镜的基准）上实现了 33.07% 的宏观 F1 分数，比之前的最佳结果高出 7.84 个百分点。据我们所知，这项工作是第一个将微调的图像修复 GAN 应用于医学影像中的 GDA 的工作，证明了图像条件 GAN 可以有效地适应有限的数据集以生成高质量的样本，从而促进有效的数据增强。此外，我们还表明，将这种基于 GAN 的方法与经典图像处理技术相结合可以进一步增强结果。||
|**2024-11-05**|[Gradient-Guided Conditional Diffusion Models for Private Image Reconstruction: Analyzing Adversarial Impacts of Differential Privacy and Denoising](http://arxiv.org/abs/2411.03053)|null|我们研究了用于重建隐私图像的梯度引导条件扩散模型的构建方法，重点关注差分隐私噪声与扩散模型去噪能力之间的对抗性相互作用。当前基于梯度的重建方法由于计算复杂度和先验知识要求的限制，难以处理高分辨率图像，而我们提出了两种新方法，它们只需对扩散模型的生成过程进行少量修改，并且无需先验知识。我们的方法利用扩散模型强大的图像生成能力，即使在梯度中添加了少量差分隐私噪声的情况下，也能从随机生成的噪声开始重建隐私图像。我们还对差分隐私噪声对重建图像质量的影响进行了全面的理论分析，揭示了噪声幅度、受攻击模型的架构以及攻击者的重建能力之间的关系。此外，大量的实验验证了我们提出的方法的有效性和我们理论发现的准确性，为使用条件扩散模型进行隐私风险审计提出了新的方向。||
|**2024-11-05**|[GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details](http://arxiv.org/abs/2411.03047)|null|神经隐式函数为从多张甚至单张图像进行 clothed human digitization 带来了显著的进步。然而，尽管取得了进展，目前的技术仍然难以泛化到具有复杂布料变形和身体姿势的未见过图像。在这项工作中，我们提出了 GarVerseLOD，一个新的数据集和框架，为实现从单张不受约束的图像进行高保真 3D 服装重建的 unprecedented robustness 铺平了道路。受大型生成模型近期成功的启发，我们认为解决泛化挑战的关键在于 3D 服装数据的数量和质量。为此，GarVerseLOD 收集了 6,000 个高质量的布料模型，这些模型具有由专业艺术家手动创建的精细几何细节。除了训练数据的规模外，我们观察到，拥有 disentangled granularities 的几何细节可以在提升学习模型的泛化能力和推理精度方面发挥重要作用。因此，我们将 GarVerseLOD 设计为具有不同细节级别 (LOD) 的分层数据集，从无细节的程式化形状到具有像素对齐细节的姿势混合服装。这使我们能够通过将推理分解成更简单的任务来处理这个高度欠约束的问题，每个任务都缩小了搜索空间。为了确保 GarVerseLOD 能够很好地泛化到自然图像，我们提出了一种基于条件扩散模型的新颖标注范式，为每个服装模型生成大量具有高逼真度的配对图像。我们在大量自然图像上评估了我们的方法。实验结果表明，GarVerseLOD 可以生成独立的服装，其质量明显优于先前的方法。项目页面：https://garverselod.github.io/||
|**2024-11-05**|[IMUDiffusion: A Diffusion Model for Multivariate Time Series Synthetisation for Inertial Motion Capturing Systems](http://arxiv.org/abs/2411.02954)|null|由于运动传感器易于使用且不受空间限制（这与基于视频的动作捕捉系统不同），它们常用于分析体育和日常活动中的运动行为。然而，运动数据的生成，尤其是针对特定活动的标记，可能既耗时又昂贵。此外，许多模型难以处理有限的数据，这限制了它们识别复杂运动模式的性能。为了解决这些问题，生成合成数据有助于扩展数据的多样性和可变性。在这项工作中，我们提出了 IMUDiffusion，这是一种专门为多元时间序列生成设计的概率扩散模型。我们的方法能够生成高质量的时间序列，准确地捕捉人类活动的动态。此外，通过将我们的数据集与合成数据结合，我们显著提高了基线人类活动分类器的性能。在某些情况下，我们能够将宏观 F1 分数提高近 30%。IMUDiffusion 为生成逼真的人类活动运动提供了一个宝贵的工具，并增强了模型在训练数据有限的情况下的鲁棒性。||
|**2024-11-05**|[LDPM: Towards undersampled MRI reconstruction with MR-VAE and Latent Diffusion Prior](http://arxiv.org/abs/2411.02951)|null|扩散模型作为一种强大的生成模型，已在包括MRI重建在内的广泛领域得到应用。然而，大多数现有的基于扩散模型的MRI重建方法直接在像素空间中进行操作，这使得它们的优化和推理在计算上非常昂贵。潜在扩散模型的引入是为了解决自然图像处理中的这个问题，但将其直接应用于MRI重建仍然面临许多挑战，包括对生成结果缺乏控制、变分自动编码器 (VAE) 对MRI的适应性以及潜在空间中适用数据一致性的探索。为了应对这些挑战，本文提出了一种基于潜在扩散先验的欠采样MRI重建方法（LDPM）。该方法利用了一个草图模块来提供适当的控制，并平衡重建MRI图像的质量和保真度。本文还探索了一种适用于MRI任务的VAE（MR-VAE），它可以作为未来MRI相关任务的基础。此外，本文提出了一种DDIM采样器的变体，称为双阶段采样器，以在潜在空间中实现高保真重建。所提出的方法在fastMRI数据集上取得了具有竞争力的结果，并且消融实验也证明了每个模块的有效性。||
|**2024-11-05**|[Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey](http://arxiv.org/abs/2411.02914)|null|世界模型和视频生成是自动驾驶领域的关键技术，它们在增强自主系统的稳健性和可靠性方面发挥着至关重要的作用。世界模型模拟现实环境的动态，而视频生成模型则生成逼真的视频序列，二者正日益融合以提高自动驾驶汽车的态势感知和决策能力。本文研究了这两种技术之间的关系，重点关注它们在结构上的相似性（尤其是在基于扩散的模型中）如何促进对驾驶场景进行更准确、更一致的模拟。我们考察了JEPA、Genie和Sora等前沿工作，它们代表了世界模型设计的不同方法，从而突出了目前缺乏对世界模型普遍接受的定义。这些不同的解释强调了该领域对如何针对各种自动驾驶任务优化世界模型的理解仍在不断发展。此外，本文还讨论了该领域采用的关键评估指标，例如用于3D场景重建的Chamfer距离和用于评估生成视频内容质量的Fr\'echet初始距离 (FID)。通过分析视频生成和世界模型之间的相互作用，本综述指出了关键挑战和未来研究方向，强调了这些技术在共同提升自动驾驶系统性能方面的潜力。本文提出的研究结果旨在全面了解视频生成和世界模型的融合如何推动更安全、更可靠的自动驾驶汽车的创新发展。||
|**2024-11-05**|[ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate](http://arxiv.org/abs/2411.02853)|**[link](https://github.com/ishohei220/adopt)**|Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., $\beta_2$, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $\beta_2$ without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at https://github.com/iShohei220/adopt.||
|**2024-10-31**|[Bridging Geometric States via Geometric Diffusion Bridge](http://arxiv.org/abs/2410.24220)|null|在复杂的系统中准确预测几何状态演化对于推进量子化学和材料建模等科学领域至关重要。传统的实验和计算方法在环境限制和计算需求方面面临挑战，而目前的深度学习方法在精度和普适性方面仍然不足。在这项工作中，我们引入了几何扩散桥 (GDB)，这是一个新颖的生成建模框架，可以准确地连接初始和目标几何状态。GDB 利用概率方法来演化几何状态分布，采用由修改版的 Doob $h$ -变换导出的等变扩散桥来连接几何状态。这个定制的扩散过程以初始和目标几何状态作为固定端点，并由等变转移核控制。此外，通过使用一系列等变扩散桥，轨迹数据可以无缝地融入我们的 GDB 框架中，从而提供更详细、更准确的演化动力学表征。理论上，我们进行了全面的检验，以确认我们的框架能够保持几何状态的联合分布，并能够以可忽略的误差对轨迹分布进行完整建模。跨各种实际场景的实验评估表明，GDB 超越了现有的最先进方法，为精确连接几何状态和以更高的精度和适用性应对关键科学挑战开辟了一条新途径。||
|**2024-10-31**|[Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning](http://arxiv.org/abs/2410.24219)|**[link](https://github.com/pr-ryan/demo)**|尽管文本到视频 (T2V) 生成技术取得了进步，但生成具有逼真运动的视频仍然具有挑战性。目前的模型通常产生静态或极少动态的输出，无法捕捉文本描述的复杂运动。这个问题源于文本编码中忽略运动的内部偏差，以及T2V生成模型中不充分的条件机制。为了解决这个问题，我们提出了一个名为分解运动 (DEMO) 的新框架，它通过将文本编码和条件机制分解为内容和运动组件来增强T2V生成中的运动合成。我们的方法包括用于静态元素的内容编码器和用于时间动态的运动编码器，以及单独的内容和运动条件机制。至关重要的是，我们引入了文本-运动和视频-运动监督来提高模型对运动的理解和生成能力。在MSR-VTT、UCF-101、WebVid-10M、EvalCrafter和VBench等基准上的评估表明，DEMO能够生成具有增强运动动态且保持高视觉质量的视频。我们的方法通过直接从文本描述中集成全面的运动理解，显著推进了T2V生成技术。项目页面：https://PR-Ryan.github.io/DEMO-project/||
|**2024-10-31**|[DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion](http://arxiv.org/abs/2410.24203)|**[link](https://github.com/zju3dv/diffpano)**|基于扩散的方法在2D图像或3D物体生成方面取得了显著成就，然而，3D场景乃至360度图像的生成仍然受到限制，这归因于场景数据集数量有限、3D场景本身的复杂性以及生成一致多视角图像的难度。为了解决这些问题，我们首先建立了一个大规模的全景视频-文本数据集，其中包含数百万个连续的全景关键帧以及相应的全景深度、相机姿态和文本描述。然后，我们提出了一种新的文本驱动的全景生成框架，称为DiffPano，以实现可扩展、一致且多样化的全景场景生成。具体而言，得益于稳定扩散强大的生成能力，我们在已建立的全景视频-文本数据集上使用LoRA微调了一个单视角文本到全景的扩散模型。我们进一步设计了一个球面极线感知的多视角扩散模型，以确保生成的全景图像的多视角一致性。大量实验表明，DiffPano可以根据给定的未见文本描述和相机姿态生成可扩展、一致且多样化的全景图像。||
|**2024-10-31**|[Multi-Attribute Linguistic Tuning for Controlled Paraphrase Generation](http://arxiv.org/abs/2410.24199)|null|我们提出了一种新颖的复述生成方法，可以精确控制和微调英语的40个语言属性。我们的模型采用编码器-解码器架构，输入源语句和所需的语言属性，并生成满足所需属性的源语句复述。为了保证推理时的高质量输出，我们的方法配备了质量控制机制，逐步调整语言属性的嵌入，以找到用于复述生成的最近且最可实现的所需属性配置。我们通过将其与最近的可控生成模型进行比较来评估我们方法的有效性。实验结果表明，所提出的模型在生成满足所需语言属性的复述方面优于基线模型。||
|**2024-10-31**|[AR-Pro: Counterfactual Explanations for Anomaly Repair with Formal Properties](http://arxiv.org/abs/2410.24178)|**[link](https://github.com/xjiae/arpro)**|异常检测被广泛用于识别关键错误和可疑行为，但目前的方法缺乏可解释性。我们利用现有方法的共同特性和生成模型的最新进展，为异常检测引入了反事实解释。给定一个输入，我们生成其反事实解释，作为基于扩散的修复，展示非异常版本应该是什么样子。这种方法的一个关键优势是它支持对可解释性需求进行领域无关的正式规范，从而为生成和评估解释提供了一个统一的框架。我们在视觉（MVTec、VisA）和时间序列（SWaT、WADI、HAI）异常数据集上证明了我们的异常可解释性框架AR-Pro的有效性。实验代码可在以下网址访问：https://github.com/xjiae/arpro。||
|**2024-10-31**|[Redefining in Dictionary: Towards a Enhanced Semantic Understanding of Creative Generation](http://arxiv.org/abs/2410.24160)|null|创造力，无论是在人类还是在扩散模型中，本质上都是一个抽象的概念；因此，简单地在提示词中添加“creative”并不能保证模型能够可靠地识别其语义。在这项工作中，我们通过TP2O任务将“创造性”这一抽象概念具体化，该任务旨在融合两个不相关的概念，并引入了CreTok，将“创造性”重新定义为标记。这种重新定义为概念融合提供了一种更具体、更普遍适应的表示方法。这一重新定义过程是连续进行的，包括反复随机抽取具有不同概念的文本对，并优化目标提示词和常量提示词之间的余弦相似度。这种方法使能够学习一种创造性概念融合的方法。大量实验表明，带来的创造能力大大超越了最近的SOTA扩散模型，并实现了更优越的创造性生成。CreTok展现出更大的灵活性和更低的时间开销，因为可以作为任何概念的通用标记，从而无需重新训练即可促进创造性生成。||
|**2024-10-31**|[Scaling Concept With Text-Guided Diffusion Models](http://arxiv.org/abs/2410.24151)|null|文本引导的扩散模型通过根据文本描述生成高保真内容，彻底改变了生成任务。它们还实现了一种编辑范式，可以通过文本条件替换概念（例如，将狗替换为老虎）。在这项工作中，我们探索了一种新颖的方法：我们能否增强或抑制概念本身，而不是替换概念？通过实证研究，我们发现了一个趋势，即在文本引导的扩散模型中，概念可以被分解。利用这一见解，我们引入了 ScalingConcept，这是一种简单而有效的方法，可以在不引入新元素的情况下放大或缩小真实输入中分解的概念。为了系统地评估我们的方法，我们提出了 WeakConcept-10 数据集，其中概念不完善，需要增强。更重要的是，ScalingConcept 能够在图像和音频领域实现各种新颖的零样本应用，包括诸如规范姿态生成和生成声音突出显示或移除等任务。||
|**2024-10-31**|[Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure](http://arxiv.org/abs/2410.24060)|**[link](https://github.com/Morefre/Understanding-Generalizability-of-Diffusion-Models-Requires-Rethinking-the-Hidden-Gaussian-Structure)**|In this work, we study the generalizability of diffusion models by looking into the hidden properties of the learned score functions, which are essentially a series of deep denoisers trained on various noise levels. We observe that as diffusion models transition from memorization to generalization, their corresponding nonlinear diffusion denoisers exhibit increasing linearity. This discovery leads us to investigate the linear counterparts of the nonlinear diffusion models, which are a series of linear models trained to match the function mappings of the nonlinear diffusion denoisers. Surprisingly, these linear denoisers are approximately the optimal denoisers for a multivariate Gaussian distribution characterized by the empirical mean and covariance of the training dataset. This finding implies that diffusion models have the inductive bias towards capturing and utilizing the Gaussian structure (covariance information) of the training dataset for data generation. We empirically demonstrate that this inductive bias is a unique property of diffusion models in the generalization regime, which becomes increasingly evident when the model's capacity is relatively small compared to the training dataset size. In the case that the model is highly overparameterized, this inductive bias emerges during the initial training phases before the model fully memorizes its training data. Our study provides crucial insights into understanding the notable strong generalization phenomenon recently observed in real-world diffusion models.||
|**2024-10-31**|[TPC: Test-time Procrustes Calibration for Diffusion-based Human Image Animation](http://arxiv.org/abs/2410.24037)|null|Human image animation aims to generate a human motion video from the inputs of a reference human image and a target motion video. Current diffusion-based image animation systems exhibit high precision in transferring human identity into targeted motion, yet they still exhibit irregular quality in their outputs. Their optimal precision is achieved only when the physical compositions (i.e., scale and rotation) of the human shapes in the reference image and target pose frame are aligned. In the absence of such alignment, there is a noticeable decline in fidelity and consistency. Especially, in real-world environments, this compositional misalignment commonly occurs, posing significant challenges to the practical usage of current systems. To this end, we propose Test-time Procrustes Calibration (TPC), which enhances the robustness of diffusion-based image animation systems by maintaining optimal performance even when faced with compositional misalignment, effectively addressing real-world scenarios. The TPC provides a calibrated reference image for the diffusion model, enhancing its capability to understand the correspondence between human shapes in the reference and target images. Our method is simple and can be applied to any diffusion-based image animation system in a model-agnostic manner, improving the effectiveness at test time without additional training.||
|**2024-10-31**|[Unveiling Synthetic Faces: How Synthetic Datasets Can Expose Real Identities](http://arxiv.org/abs/2410.24015)|null|合成数据生成在不同的计算机视觉应用中越来越受欢迎。现有的最先进的人脸识别模型使用大规模人脸数据集进行训练，这些数据集是从互联网上抓取的，引发了隐私和伦理方面的担忧。为了解决这些担忧，一些工作提出了生成合成人脸数据集来训练人脸识别模型。然而，这些方法依赖于生成模型，而这些模型是在真实人脸图像上训练的。在这项工作中，我们设计了一种简单而有效的成员推理攻击，系统地研究了任何现有的合成人脸识别数据集是否泄露了用于训练生成器模型的真实数据中的任何信息。我们对6个最先进的合成人脸识别数据集进行了广泛的研究，并表明在所有这些合成数据集中，原始真实数据集中的几个样本都被泄露了。据我们所知，本文是第一个展示生成器模型的训练数据泄露到生成的合成人脸识别数据集中的工作。我们的研究揭示了合成人脸识别数据集中的隐私陷阱，并为未来关于生成负责任的合成人脸数据集的研究铺平了道路。||
|**2024-10-29**|[A Gaussian Process Generative Model for QCD Equation of State](http://arxiv.org/abs/2410.22160)|null|我们利用高斯过程回归方法开发了一个零净重子密度下核物质状态方程的生成模型。我们分别在高温和低温区域施加了来自格点量子色动力学和强子共振气体的第一性原理理论约束。通过允许训练后的高斯过程回归模型在相变区域附近自由变化，我们生成了具有不同声速的随机平滑交叉状态方程，而不依赖于特定的参数化。我们探索了大量实验可观测量与生成的状态方程之间的依赖关系，这为未来使用相对论重离子碰撞的实验测量来约束核物质状态方程的贝叶斯推断研究奠定了基础。||
|**2024-10-29**|[Capacity Control is an Effective Memorization Mitigation Mechanism in Text-Conditional Diffusion Models](http://arxiv.org/abs/2410.22149)|**[link](https://github.com/raman1121/diffusion_memorization_hpo)**|在这项工作中，我们提出了令人信服的证据，表明在微调过程中控制模型容量可以有效地减轻扩散模型中的记忆效应。具体来说，我们证明了在预训练-微调范式中采用参数高效微调（PEFT）与传统的完整微调方法相比，可以显著减少记忆效应。我们的实验使用了MIMIC数据集，该数据集包含胸部X光图像及其相应报告的图像-文本对。通过一系列记忆效应和生成质量指标评估的结果表明，PEFT不仅减少了记忆效应，还提高了下游生成质量。此外，PEFT方法可以与现有的记忆效应缓解技术无缝结合，以进一步改进。我们的实验代码可在以下网址获取：https://github.com/Raman1121/Diffusion_Memorization_HPO||
|**2024-10-29**|[AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts](http://arxiv.org/abs/2410.22143)|null|虽然大型语言模型 (LLM) 通常对齐良好，但它们仍然容易受到精心设计的自然语言提示或奇怪的对抗性后缀的攻击。然而，尽管乱码标记在攻击对齐的 LLM 方面取得了成功，但它们受到的关注相对较少。最近的研究 AmpleGCG~\citep{liao2024amplegcg} 表明，生成模型可以针对任何有害查询快速生成大量可定制的乱码对抗性后缀，从而暴露分布外 (OOD) 语言空间中的一系列对齐差距。为了引起更多人关注这一领域，我们推出了 AmpleGCG-Plus，这是一个增强版本，可在更少的尝试次数下获得更好的性能。通过一系列探索性实验，我们确定了几种改进乱码后缀学习的训练策略。我们在严格的评估设置下验证的结果表明，它在开放权重和闭源模型上的性能均优于 AmpleGCG，在针对 Llama-2-7B-chat 的白盒设置中，攻击成功率 (ASR) 提升高达 17%，在针对 GPT-4 的黑盒设置中，ASR 提升了三倍以上。值得注意的是，AmpleGCG-Plus 以与 GPT-4 相似的比率攻击了较新的 GPT-4o 系列模型，并发现了针对最近提出的断路器防御的漏洞。我们公开发布了 AmpleGCG-Plus 以及我们收集的训练数据集。||
|**2024-10-29**|[Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench](http://arxiv.org/abs/2410.22108)|**[link](https://github.com/franciscoliu/MLLMU-Bench)**|像大型语言模型 (LLM) 和多模态大型语言模型 (MLLM) 这样的生成模型，在海量网络语料库上训练后，可能会记住并泄露个人的机密和隐私数据，引发法律和伦理方面的担忧。虽然之前的许多工作已经通过机器遗忘技术解决了 LLM 中的这个问题，但在 MLLM 中，这仍然是一个很大程度上未被探索的领域。为了应对这一挑战，我们引入了多模态大型语言模型遗忘基准 (MLLMU-Bench)，这是一个旨在提升对多模态机器遗忘理解的新型基准。MLLMU-Bench 包含 500 个虚构人物和 153 个公众人物的个人资料，每个资料都包含超过 14 个定制的问答对，并从多模态（图像+文本）和单模态（文本）两个角度进行评估。该基准测试分为四组，用于评估遗忘算法的有效性、泛化能力和模型效用。最后，我们使用现有的生成模型遗忘算法提供了基线结果。令人惊讶的是，我们的实验表明，单模态遗忘算法在生成和完形填空任务中表现出色，而多模态遗忘方法在使用多模态输入的分类任务中表现更好。||
|**2024-10-29**|[Variational inference for pile-up removal at hadron colliders with diffusion models](http://arxiv.org/abs/2410.22074)|null|本文提出了一种使用扩散模型的变分推理方法来去除pp相互作用中的堆积效应，称为Vipr。该方法并非使用分类方法来识别哪些粒子来自主碰撞，而是训练一个生成模型来预测去除堆积效应后的硬散射粒子射流的成分。这将得到对硬散射射流成分的完整后验估计，这在去除堆积效应的背景下尚未被探索。我们在模拟 tt¯ 事件样本中评估了 Vipr 的性能，该样本叠加了堆积污染。在各种堆积场景下，Vipr 在预测硬散射射流的子结构方面优于 SoftDrop。||
|**2024-10-29**|[PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement](http://arxiv.org/abs/2410.22059)|null|场景重排，例如整理桌子，由于预测不同物体排列的复杂性，在机器人操作中是一项具有挑战性的任务。网络规模训练的生成模型，例如 Stable Diffusion，可以通过生成自然场景作为目标来提供帮助。为了便于机器人执行，必须提取对象级表示，以便将真实场景与生成的目标匹配，并计算对象姿态变换。目前的方法通常采用多步骤设计，涉及用于生成、分割和特征编码的单独模型，这可能由于误差累积而导致低成功率。此外，它们缺乏对生成目标视角的控制，将任务限制在 3 自由度设置中。在本文中，我们提出了 PACA，一个用于场景重排的零样本流水线，它利用从 Stable Diffusion 派生的透视感知交叉注意力表示。具体来说，我们开发了一种将生成、分割和特征编码集成到单个步骤中以生成对象级表示的表示方法。此外，我们引入了视角控制，从而能够匹配 6 自由度相机视角，并扩展了过去局限于 3 自由度俯视视角的方法。我们的方法的有效性通过其在各种场景的真实机器人实验中的零样本性能得到证明，分别实现了 87% 的平均匹配精度和 67% 的执行成功率。||
|**2024-10-29**|[Dual Conditional Diffusion Models for Sequential Recommendation](http://arxiv.org/abs/2410.21967)|null|扩散模型的最新进展在序列推荐（SR）中展现出可喜的成果。然而，当前基于扩散的方法仍然存在两个关键限制。首先，它们隐式地对目标项目嵌入而不是离散的目标项目本身进行建模，导致推荐过程中的不一致性。其次，现有方法依赖于隐式或显式条件扩散模型，限制了它们充分捕捉用户行为上下文的能力，并导致目标项目嵌入的鲁棒性较差。在本文中，我们提出了用于序列推荐的双条件扩散模型（DCRec），引入了一个离散到连续的序列推荐扩散框架。我们的框架引入了一个完整的马尔可夫链来模拟从反向目标项目表示到离散项目索引的转换，连接了扩散模型的离散和连续项目空间，并确保了与扩散框架的一致性。在此框架的基础上，我们提出了双条件扩散变换器（DCDT），它结合了基于扩散的SR的隐式条件和显式条件。在公共基准数据集上的大量实验表明，DCRec 的性能优于最先进的方法。||
|**2024-10-29**|[PrefPaint: Aligning Image Inpainting Diffusion Model with Human Preference](http://arxiv.org/abs/2410.21966)|null|在本文中，我们首次尝试通过强化学习框架将图像修复的扩散模型与人类审美标准对齐，从而显著提高修复图像的质量和视觉吸引力。具体来说，我们没有直接测量与配对图像的差异，而是使用我们构建的数据集训练了一个奖励模型，该数据集包含近51,000张带有注释人类偏好的图像。然后，我们采用强化学习过程微调预训练的图像修复扩散模型的分布，使其朝着更高奖励的方向发展。此外，我们从理论上推导了奖励模型的误差上限，这说明了在整个强化对齐过程中奖励估计的潜在置信度，从而促进了准确的正则化。在修复比较和下游任务（例如图像扩展和3D重建）上的大量实验，证明了我们方法的有效性，与最先进的方法相比，修复图像与人类偏好的对齐度显著提高。这项研究不仅推进了图像修复领域，还提供了一个框架，将人类偏好纳入基于建模奖励精度的生成模型的迭代改进中，对视觉驱动AI应用的设计具有广泛的意义。我们的代码和数据集已公开发布在https://prefpaint.github.io。||
|**2024-10-29**|[CT to PET Translation: A Large-scale Dataset and Domain-Knowledge-Guided Diffusion Approach](http://arxiv.org/abs/2410.21932)|**[link](https://github.com/thanhhff/CPDM)**|正电子发射断层扫描（PET）和计算机断层扫描（CT）对于诊断、分期和监测各种疾病（尤其是癌症）至关重要。尽管它们很重要，但PET/CT系统的使用受到放射性物质的必要性、PET扫描仪的稀缺性以及PET成像相关高成本的限制。相比之下，CT扫描仪更容易获得且成本低得多。为了应对这些挑战，我们的研究解决了从CT图像生成PET图像的问题，旨在降低医疗检查成本和患者的相关健康风险。我们的贡献有两个方面：首先，我们引入了一个名为CPDM的条件扩散模型，据我们所知，这是首次尝试使用扩散模型将CT图像转换为PET图像。其次，我们提供了迄今为止最大的CT-PET数据集，包含2,028,628对配对CT-PET图像，这有助于CT到PET转换模型的训练和评估。对于CPDM模型，我们结合领域知识开发了两个条件图：注意力图和衰减图。前者帮助扩散过程聚焦于感兴趣区域，而后者改进PET数据校正并确保准确的诊断信息。跨各种基准的实验评估表明，CPDM在生成高质量PET图像方面在多个指标上均优于现有方法。源代码和数据样本可在https://github.com/thanhhff/CPDM获取。||
|**2024-10-29**|[Guided Diffusion-based Counterfactual Augmentation for Robust Session-based Recommendation](http://arxiv.org/abs/2410.21892)|null|基于会话的推荐(SR)模型旨在根据用户在当前会话期间的行为向用户推荐top-K项目。文献中提出了几种SR模型，然而，人们对其易受训练数据（观察数据）中固有偏差（例如流行度偏差）的影响提出了担忧。在有偏差的训练数据上训练的SR模型在现实场景中可能会遇到分布外数据的性能挑战。减轻流行度偏差的一种方法是反事实数据增强。与先前依赖于使用SR模型生成数据的工作相比，我们专注于利用最先进的扩散模型来生成反事实数据。我们提出了一个用于SR的基于引导扩散的反事实增强框架。通过分别在真实世界和模拟数据集上进行的离线和在线实验的组合，我们表明我们的方法比基线SR模型和其他最先进的增强框架表现得更好。更重要的是，我们的框架在不太流行的目标项目上显示出显著的改进，在真实世界和模拟数据集上的召回率分别提高了20%，点击率提高了13%。||
|**2024-10-25**|[Model merging with SVD to tie the Knots](http://arxiv.org/abs/2410.19735)|**[link](https://github.com/gstoica27/knots)**|最近的模型合并方法表明，专门针对不同任务的完全微调模型的参数可以合并到一个模型中，该模型能够在不进行重新训练的情况下解决所有任务。然而，当合并 LoRA 微调模型时，这种成功并没有很好地迁移。我们研究了这一现象，并观察到与完全微调模型相比，LoRA 微调模型的权重表现出较低的对齐程度。我们假设提高这种对齐性是获得更好 LoRA 模型合并的关键，并提出了 KnOTS 来解决这个问题。KnOTS 使用 SVD 将不同 LoRA 模型的权重联合转换到一个对齐的空间中，现有的合并方法可以在该空间中应用。此外，我们引入了一个新的基准测试，该基准测试明确评估合并模型是否为通用模型。值得注意的是，KnOTS 在多个视觉和语言基准测试中，包括我们的新设置，始终将 LoRA 合并提高了 4.3%。我们在以下位置发布我们的代码：https://github.com/gstoica27/KnOTS。||
|**2024-10-25**|[Adversarial Environment Design via Regret-Guided Diffusion Models](http://arxiv.org/abs/2410.19715)|null|在深度强化学习 (RL) 中，训练对环境变化具有鲁棒性的智能体仍然是一项重大挑战。无监督环境设计 (UED) 近期应运而生，旨在通过生成一组针对智能体能力量身定制的训练环境来解决这个问题。尽管先前的工作表明 UED 有可能学习到鲁棒的策略，但其性能受到环境生成能力的限制。为此，我们提出了一种新颖的 UED 算法，即通过遗憾引导扩散模型进行对抗性环境设计 (ADD)。所提出的方法利用智能体的遗憾来指导基于扩散的环境生成器，以生成对智能体具有挑战性但有利于进一步改进的环境。通过利用扩散模型的表示能力，ADD 可以直接生成对抗性环境，同时保持训练环境的多样性，从而使智能体能够有效地学习鲁棒的策略。我们的实验结果表明，所提出的方法成功地生成了一个具有指导意义的环境课程，在对新颖的、超出分布的环境的零样本泛化方面优于 UED 基线。项目页面：https://github.com/rllab-snu.github.io/projects/ADD||
|**2024-10-25**|[DiffGS: Functional Gaussian Splatting Diffusion](http://arxiv.org/abs/2410.19657)|null|三维高斯 splatting (3DGS) 在渲染速度和保真度方面表现出了令人信服的性能，但由于其离散性和非结构化性质，高斯 splatting 的生成仍然是一个挑战。在这项工作中，我们提出了 DiffGS，一个基于潜在扩散模型的通用高斯生成器。DiffGS 是一种强大且高效的 3D 生成模型，能够生成任意数量的高斯基元，用于光栅化的高保真渲染。其关键见解是通过三个新颖的函数以解耦的方式表示高斯 splatting，分别对高斯概率、颜色和变换进行建模。通过对 3DGS 的新颖解耦，我们使用连续的高斯 splatting 函数表示离散和非结构化的 3DGS，然后我们训练一个潜在扩散模型，目标是无条件和有条件地生成这些高斯 splatting 函数。同时，我们引入了一种离散化算法，通过八叉树引导采样和优化，从生成的函数中提取任意数量的高斯函数。我们探索了 DiffGS 的各种任务，包括无条件生成、从文本、图像和部分 3DGS 进行条件生成，以及点到高斯的生成。我们相信，DiffGS 为灵活建模和生成高斯 splatting 提供了一个新的方向。||
|**2024-10-25**|[Diffusion models for lattice gauge field simulations](http://arxiv.org/abs/2410.19602)|null|我们为格点规范理论开发了基于随机量子化概念的扩散模型。这个框架被应用于 $1+1$维的$U(1)$ 规范理论。我们证明，在一个小的逆耦合常数下训练的模型可以有效地迁移到更大的逆耦合常数，而不会遇到与拓扑冻结相关的问题，即该模型可以通过引入玻尔兹曼因子作为物理条件来生成对应于不同耦合常数的构型，同时保持正确的物理分布，而无需任何额外的训练。这证明了物理条件扩散模型在高效灵活的格点规范理论模拟方面的潜力。||
|**2024-10-25**|[Utilizing Image Transforms and Diffusion Models for Generative Modeling of Short and Long Time Series](http://arxiv.org/abs/2410.19538)|null|近年来，围绕时间序列数据的生成模型的兴趣激增。大多数现有方法要么设计用于处理短序列，要么处理长程序列。这种二分法可归因于循环网络的梯度问题、与 Transformer 相关的计算成本以及状态空间模型的表达能力有限。为了构建一个适用于不同长度时间序列的统一生成模型，我们在这项工作中建议将序列转换为图像。通过采用可逆变换（例如延迟嵌入和短时傅里叶变换），我们获得了三个主要优势：i）我们可以利用先进的扩散视觉模型；ii）我们可以在同一框架内显著地处理短程和长程输入；iii）我们可以利用时间序列到图像文献中提出的最新和已建立的工具。我们通过对多个任务（包括无条件生成、插值和外推）的综合评估来验证我们方法的有效性。我们表明，我们的方法在与强大的基线相比始终如一地实现了最先进的结果。在无条件生成任务中，我们展示了与之前的扩散模型相比，在短期判别分数上取得了 58.17% 的显着平均改进，在（超）长期分类分数上取得了 132.61% 的显着平均改进。代码位于 https://github.com/azencot-group/ImagenTime。||
|**2024-10-25**|[Ensemble Data Assimilation for Particle-based Methods](http://arxiv.org/abs/2410.19525)|null|本研究提出了一种新颖的方法，将数据同化技术应用于基于粒子的模拟中，并使用了集合卡尔曼滤波器。虽然数据同化方法已有效地应用于欧拉模拟，但其在拉格朗日解离散化中的应用尚未得到适当的探索。我们引入了两种具体的方法来弥补这一差距。第一种方法采用了一种中间欧拉变换，它结合了投影和重新网格化过程。第二种方法是一种纯粹的拉格朗日方案，适用于重新网格化不合适的情况。第二种方法是一种纯粹的拉格朗日方案，适用于重新网格化不适用的情况。这些方法使用具周期边界条件的一维对流扩散模型进行评估。针对基于网格的同化滤波器对一维场景进行了性能基准测试。随后，将同化方案应用于通过涡度-单元法求解的非线性二维不可压缩流动问题。结果证明了这些方法在更复杂场景中的适用性，突出了它们在一维和二维环境中的有效性。||
|**2024-10-25**|[Marked Temporal Bayesian Flow Point Processes](http://arxiv.org/abs/2410.19512)|null|带标记事件数据通过记录事件的连续值发生时间戳及其对应的离散值类型来捕获事件。它们出现在各种现实场景中，例如社交媒体、金融交易和医疗保健记录，并且已经通过带标记时间点过程 (MTPP) 模型得到有效建模。最近，由于其强大的生成能力和限制较少的函数形式，针对这些 MTPP 模型开发生成模型发展迅速。然而，现有的生成性 MTPP 模型通常在联合建模事件的时间戳和类型方面面临挑战，因为：(1) 主流方法仅设计时间戳的生成机制，不包括事件类型；(2) 时间戳和事件类型之间复杂的相互依赖关系被忽略了。在本文中，我们提出了一种新的生成性 MTPP 模型，称为 BMTPP。与现有的生成性 MTPP 模型不同，BMTPP 使用基于参数的方法灵活地对标记的时间联合分布进行建模。此外，通过向标记的时间数据空间添加联合噪声，BMTPP 可以有效地捕获并明确揭示时间戳和事件类型之间的相互依赖关系。大量实验验证了我们的方法优于其他最先进模型的优越性及其有效捕获标记时间相互依赖性的能力。||
|**2024-10-25**|[NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction](http://arxiv.org/abs/2410.19452)|**[link](https://github.com/gongzix/neuroclips)**|利用非侵入性脑活动 fMRI 重建静态视觉刺激取得了巨大成功，这得益于诸如 CLIP 和 Stable Diffusion 等先进的深度学习模型。然而，由于解码对连续视觉体验的时空感知非常具有挑战性，因此关于 fMRI 到视频重建的研究仍然有限。我们认为，应对这些挑战的关键在于准确解码大脑对视频刺激所感知到的高级语义和低级感知流。为此，我们提出了 NeuroClips，这是一个从 fMRI 解码高保真、流畅视频的创新框架。NeuroClips 利用语义重建器来重建视频关键帧，指导语义准确性和一致性，并采用感知重建器来捕捉低级感知细节，确保视频流畅性。在推理过程中，它采用预先训练的 T2V 扩散模型，注入关键帧和低级感知流，用于视频重建。在公开可用的 fMRI 视频数据集上进行评估，NeuroClips 实现了高达 6 秒、8FPS 的流畅高保真视频重建，在各种指标上都比现有最佳模型取得了显著改进，例如，SSIM 提高了 128%，时空指标提高了 81%。我们的项目可在 https://github.com/gongzix/NeuroClips 获得。||
|**2024-10-25**|[Learned Reference-based Diffusion Sampling for multi-modal distributions](http://arxiv.org/abs/2410.19449)|null|在过去几年中，已经提出了一些利用基于分数的扩散方法从概率分布中采样的方法，即在无法获得精确样本的情况下，仅依靠对未归一化密度的评估。由此产生的采样器近似于噪声扩散过程的时间反转，将目标分布桥接到易于采样的基础分布。在实践中，这些方法的性能在很大程度上取决于关键的超参数，这些超参数需要真实样本才能进行精确调整。我们的工作旨在突出和解决这一基本问题，特别关注多模态分布，这对现有的采样方法提出了重大挑战。在现有方法的基础上，我们引入了基于学习参考的扩散采样器（LRDS），这是一种专门设计用于利用关于目标模态位置的先验知识的方法，以绕过超参数调整的障碍。LRDS 分两步进行：（i）学习位于高密度空间区域并针对多模态量身定制的样本上的参考扩散模型，以及（ii）使用该参考模型来促进基于扩散的采样器的训练。我们通过实验证明，在各种具有挑战性的分布上，与竞争算法相比，LRDS 最好地利用了目标分布的先验知识。||
|**2024-10-25**|[Generative Diffusion Models for Sequential Recommendations](http://arxiv.org/abs/2410.19429)|null|诸如变分自编码器 (VAE) 和生成对抗网络 (GAN) 等生成模型在序列推荐任务中已展现出前景。然而，它们也面临着挑战，包括后验坍缩和表示能力有限。Li 等人 (2023) 的工作引入了一种新颖的方法，利用扩散模型来应对这些挑战，将物品嵌入表示为分布而不是固定向量。这种方法允许更自适应地反映用户多样化的兴趣和物品的各个方面。在扩散阶段，模型通过添加噪声将目标物品嵌入转换为高斯分布，促进序列物品分布的表示并注入不确定性。然后，一个逼近器处理这个带有噪声的物品表示以重建目标物品。在反向阶段，模型利用用户的历史交互来逆转噪声，并通过舍入操作最终确定物品预测。这项研究对 DiffuRec 架构进行了增强，特别是在扩散过程中添加了偏移噪声以提高鲁棒性，并在逼近器中加入了交叉注意力机制以更好地捕获相关的用户-物品交互。这些贡献促成了一种名为 DiffuRecSys 的新模型的开发，该模型提高了性能。在三个公共基准数据集上进行的大量实验表明，这些改进增强了物品表示，有效地捕获了不同的用户偏好，并在序列推荐研究中优于现有基线。||
|**2024-10-24**|[MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms](http://arxiv.org/abs/2410.18977)|null|本研究深入探讨了人体动作生成的交互式编辑问题。以往的动作扩散模型缺乏对词级文本-动作对应关系的显式建模和良好的可解释性，从而限制了其细粒度的编辑能力。为了解决这个问题，我们提出了一个基于注意力的动作扩散模型，名为MotionCLR，它对注意力机制进行了清晰的建模（CLeaR）。从技术上讲，MotionCLR分别使用自注意力和交叉注意力机制对模态内和跨模态交互进行建模。更具体地说，自注意力机制旨在测量帧之间的序列相似性并影响运动特征的顺序。相比之下，交叉注意力机制致力于找到细粒度的词序列对应关系，并激活运动序列中相应的时刻。基于这些关键特性，我们开发了一套通用且简单有效的运动编辑方法，通过操纵注意力图来实现，例如运动（去）强调、原位运动替换和基于示例的动作生成等。为了进一步验证注意力机制的可解释性，我们还探索了通过注意力图进行动作计数和基于基础的动作生成的能力。我们的实验结果表明，我们的方法具有良好的生成和编辑能力以及良好的可解释性。||
|**2024-10-24**|[Unbounded: A Generative Infinite Game of Character Life Simulation](http://arxiv.org/abs/2410.18975)|null|我们引入了生成式无限游戏的概念，这是一种超越了有限的、硬编码的传统系统边界，使用生成模型的电子游戏。受James P. Carse的有限游戏和无限游戏区别的启发，我们利用生成式人工智能的最新进展创造了“无限”：一个完全封装在生成模型中的人物生活模拟游戏。“无限”从沙盒生活模拟游戏中汲取灵感，允许你通过喂养、玩耍和引导，与自主的虚拟角色在一个虚拟世界中互动——其开放式机制由大型语言模型生成，其中一些可能是涌现的。为了开发“无限”，我们提出了大型语言模型和视觉生成领域的技术创新。具体来说，我们提出了：（1）一个专门的、精简的大型语言模型（LLM），它可以实时动态地生成游戏机制、叙事和角色互动；（2）一个新的用于视觉模型的动态区域图像提示适配器（IP-Adapter），它确保了角色在多个环境中的视觉生成保持一致性和灵活性。我们通过定性和定量分析评估了我们的系统，结果表明，与传统的相关方法相比，在角色生活模拟、用户指令遵循、叙事连贯性以及角色和环境的视觉一致性方面都有显著改进。||
|**2024-10-24**|[3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation](http://arxiv.org/abs/2410.18974)|**[link](https://github.com/Lakonik/MVEdit)**|多视角图像扩散模型显著推进了开放域三维物体生成。然而，大多数现有模型依赖于缺乏固有三维偏差的二维网络架构，导致几何一致性受损。为了应对这一挑战，我们引入了3D-Adapter，一个插件模块，旨在将三维几何感知融入预训练的图像扩散模型。我们方法的核心是三维反馈增强：对于采样循环中的每个去噪步骤，3D-Adapter将中间多视角特征解码为一致的三维表示，然后重新编码渲染的RGBD视图，通过特征添加来增强预训练的基础模型。我们研究了3D-Adapter的两种变体：一种基于高斯 splatting 的快速前馈版本和一种利用神经场和网格的多功能免训练版本。我们广泛的实验表明，3D-Adapter不仅极大地提高了诸如Instant3D和Zero123++等文本到多视角模型的几何质量，还能够使用普通的文本到图像模型Stable Diffusion进行高质量的三维生成。此外，我们通过在文本到三维、图像到三维、文本到纹理和文本到头像任务中呈现高质量结果，展示了3D-Adapter广泛的应用潜力。||
|**2024-10-24**|[On the Crucial Role of Initialization for Matrix Factorization](http://arxiv.org/abs/2410.18965)|null|这项工作重新审视了经典的低秩矩阵分解问题，并揭示了初始化在塑造这种非凸非光滑优化收敛速度中的关键作用。我们引入了Nystrom初始化，它显著提高了缩放梯度下降（ScaledGD）在对称和非对称矩阵分解任务中的全局收敛性。具体来说，我们证明了在以前只知道线性收敛速度的情况下，使用Nystrom初始化的ScaledGD可以实现二次收敛。此外，我们将此初始化扩展到通常用于微调基础模型的低秩适配器（LoRA）。我们的方法NoRA，即带有Nystrom初始化的LoRA，在各种下游任务和模型规模（从10亿到70亿个参数）的大语言模型和扩散模型中展现出优越的性能。||
|**2024-10-24**|[Stable Consistency Tuning: Understanding and Improving Consistency Models](http://arxiv.org/abs/2410.18958)|**[link](https://github.com/G-U-N/Stable-Consistency-Tuning)**|扩散模型实现了卓越的生成质量，但由于去噪的迭代性质，生成速度较慢。相比之下，一致性模型作为一种新的生成模型系列，以显著更快的采样速度实现了具有竞争力的性能。这些模型要么通过一致性蒸馏（利用预训练的扩散模型）进行训练，要么直接从原始数据进行一致性训练/微调。在这项工作中，我们提出了一个新的框架来理解一致性模型，我们将扩散模型的去噪过程建模为马尔可夫决策过程 (MDP)，并将一致性模型训练框架化为通过时间差学习 (TD Learning) 进行的价值估计。更重要的是，该框架使我们能够分析当前一致性训练/微调策略的局限性。在轻松一致性微调 (ECT) 的基础上，我们提出了稳定一致性微调 (SCT)，它结合了使用分数恒等式的方差减少学习。SCT 在 CIFAR-10 和 ImageNet-64 等基准测试中带来了显著的性能提升。在 ImageNet-64 上，SCT 实现了 1 步 FID 2.42 和 2 步 FID 1.55，这是当前一致性模型的最佳结果。||
|**2024-10-24**|[Generation of synthetic financial time series by diffusion models](http://arxiv.org/abs/2410.18897)|null|尽管实际意义重大，但生成逼真的合成金融时间序列仍然具有挑战性，这是由于其统计特性，即所谓的程式化事实，例如厚尾、波动率聚集和季节性模式。各种生成模型，包括生成对抗网络 (GAN) 和变分自编码器 (VAE)，已被用于解决这一挑战，尽管目前还没有模型能够满足所有程式化事实。我们提出另一种方法，利用扩散模型，特别是去噪扩散概率模型 (DDPM)，来生成合成金融时间序列。这种方法采用小波变换将多个时间序列（例如股票价格、交易量和价差）转换为图像。给定这些转换后的图像，该模型能够生成可以通过逆小波变换转换回逼真的时间序列的图像。我们证明了我们提出的方法满足程式化事实。||
|**2024-10-24**|[Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences](http://arxiv.org/abs/2410.18881)|null|一步文本到图像生成模型具有推理效率高、架构灵活以及最先进的生成性能等优势。本文首次研究了一步生成模型与人类偏好的对齐问题。受人类反馈强化学习 (RLHF) 的成功启发，我们将对齐问题表述为最大化预期人类奖励函数，同时添加一个积分 Kullback-Leibler 散度项以防止生成器偏离。通过克服技术挑战，我们引入了 Diff-Instruct++ (DI++)，这是第一个快速收敛且无需图像数据的人类偏好对齐方法，适用于一步文本到图像生成器。我们还引入了新的理论见解，表明使用 CFG 进行扩散蒸馏实际上是在使用 DI++ 进行 RLHF。这一有趣的发现有助于理解和促进未来涉及 CFG 的研究。在实验部分，我们使用 DI++ 对齐了基于 UNet 和基于 DiT 的一步生成器，它们分别使用 Stable Diffusion 1.5 和 PixelArt- $\alpha$ 作为参考扩散过程。由此产生的基于 DiT 的一步文本到图像模型在 COCO 验证提示数据集上实现了 6.19 的高美学得分和 1.24 的图像奖励。它还实现了领先的人类偏好得分 (HPSv2.0) 28.48，优于其他开源模型，如 Stable Diffusion XL、DMD2、SD-Turbo 以及 PixelArt-$\alpha$ 。理论贡献和实证证据都表明，DI++ 是一种强大的人类偏好对齐方法，适用于一步文本到图像模型。||
|**2024-10-24**|[The Cat and Mouse Game: The Ongoing Arms Race Between Diffusion Models and Detection Methods](http://arxiv.org/abs/2410.18866)|null|扩散模型的出现改变了合成媒体生成领域，在内容创作方面提供了无与伦比的真实感和控制力。这些进步推动了艺术、设计和科学可视化等领域的创新。然而，它们也带来了重大的伦理和社会挑战，特别是通过创建超逼真图像，这些图像可能助长深度伪造、虚假信息和未经授权的版权材料复制。因此，对有效检测机制的需求变得日益迫切。本综述探讨了扩散模型发展与检测方法进步之间不断演变的对抗关系。我们对当代检测策略进行了全面分析，包括频域和空域技术、基于深度学习的方法以及结合多种方法的混合模型。我们还强调了多样化数据集和标准化评估指标在提高检测精度和泛化能力方面的重要性。我们的讨论探讨了这些检测系统在版权保护、虚假信息预防和取证分析中的实际应用，同时也探讨了合成媒体的伦理影响。最后，我们确定了关键的研究差距，并提出了未来发展方向，以增强检测方法的鲁棒性和适应性，使其与扩散模型的快速发展保持同步。本综述强调了在日益数字化的世界中，采取全面方法来降低与人工智能生成内容相关的风险的必要性。||
|**2024-10-24**|[From Efficiency to Equity: Measuring Fairness in Preference Learning](http://arxiv.org/abs/2410.18841)|null|随着人工智能系统，特别是生成模型，越来越多地影响决策，确保它们能够公平地代表不同的人类偏好变得至关重要。本文介绍了一个新的框架，用于评估偏好学习模型中的认知公平性，其灵感来自经济学中的不平等理论和罗尔斯主义的正义理论。我们提出了根据基尼系数、阿特金森指数和库兹涅茨比率改编的指标来量化这些模型的公平性。我们使用两个数据集验证了我们的方法：一个自定义的视觉偏好数据集 (AI-EDI-Space) 和 Jester Jokes 数据集。我们的分析揭示了模型性能在不同用户之间的差异，突出了潜在的认知不公正现象。我们探索了预处理和进程中技术来减轻这些不平等，证明了模型效率和公平性之间的复杂关系。这项工作通过提供一个评估和改进偏好学习模型中认知公平性的框架，为人工智能伦理做出了贡献，为在人类偏好多样性至关重要的环境中开发更具包容性的人工智能系统提供了见解。||
|**2024-10-24**|[Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation](http://arxiv.org/abs/2410.18830)|null|Diffusion models have recently gained recognition for generating diverse and high-quality content, especially in the domain of image synthesis. These models excel not only in creating fixed-size images but also in producing panoramic images. However, existing methods often struggle with spatial layout consistency when producing high-resolution panoramas, due to the lack of guidance of the global image layout. In this paper, we introduce the Multi-Scale Diffusion (MSD) framework, a plug-and-play module that extends the existing panoramic image generation framework to multiple resolution levels. By utilizing gradient descent techniques, our method effectively incorporates structural information from low-resolution images into high-resolution outputs. A comprehensive evaluation of the proposed method was conducted, comparing it with the prior works in qualitative and quantitative dimensions. The evaluation results demonstrate that our method significantly outperforms others in generating coherent high-resolution panoramas.||
|**2024-10-22**|[Creativity in AI: Progresses and Challenges](http://arxiv.org/abs/2410.17218)|**[link](https://github.com/mismayil/creativity-in-AI)**|创造力是产生新颖、有用和令人惊讶的想法的能力，并且作为人类认知的一个重要方面已被广泛研究。另一方面，机器创造力一直是一项长期挑战。随着高级生成式人工智能的兴起，人们对人工智能的创造能力重新产生了兴趣和争论。因此，有必要重新审视人工智能创造力的现状，并确定关键进展和 remaining challenges。在这项工作中，我们调查了研究人工智能系统创造能力的主要工作，重点关注创造性问题解决、语言、艺术和科学创造力。我们的综述表明，虽然最新的人工智能模型在很大程度上能够生成具有语言和艺术创造力的输出，如诗歌、图像和音乐作品，但它们在需要创造性问题解决、抽象思维和组合性的任务中却步履维艰，而且它们的生成缺乏多样性、原创性、长期连贯性和幻觉。我们还讨论了与生成模型相关的版权和作者身份问题。此外，我们强调需要对创造力进行全面的评估，这种评估应以流程为导向，并考虑创造力的多个维度。最后，我们从认知科学和心理学中汲取灵感，提出了未来改进人工智能输出创造力的研究方向。||
|**2024-10-22**|[Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding](http://arxiv.org/abs/2410.17173)|**[link](https://github.com/flagshippioneering/pi-rldif)**|蛋白质逆折叠，即预测折叠成所需 3D 结构的氨基酸序列，是基于结构的蛋白质设计中的一个重要问题。基于机器学习的逆折叠方法通常使用原始序列的恢复作为优化目标。然而，逆折叠是一个一对多问题，其中多个序列可以折叠成相同的结构。此外，对于许多实际应用来说，拥有多个折叠成目标结构的不同序列通常是可取的，因为它允许为下游优化提供更多候选序列。在这里，我们证明，尽管最近的逆折叠方法显示出更高的序列恢复率，但它们的“可折叠多样性”——即它们生成多个折叠成与目标一致的结构的非相似序列的能力——并没有提高。为了解决这个问题，我们提出了 RL-DIF，一种用于逆折叠的分类扩散模型，该模型在序列恢复上进行了预训练，并通过强化学习对结构一致性进行了调整。我们发现 RL-DIF 实现了与基准模型相当的序列恢复和结构一致性，但显示出更大的可折叠多样性：实验表明 RL-DIF 在 CATH 4.2 上可以实现 29% 的可折叠多样性，而使用相同数据集训练的模型为 23%。PyTorch 模型权重和采样代码可在 GitHub 上获取。||
|**2024-10-22**|[Hybrid Generative AI for De Novo Design of Co-Crystals with Enhanced Tabletability](http://arxiv.org/abs/2410.17005)|**[link](https://github.com/ai-chem/gemcode)**|共晶化是控制有机晶体物理化学性质的一种便捷方法，在生物医学领域有着广泛的应用。本研究提出了一种名为“生成式共晶设计”(GEMCODE)的新型自动化共晶筛选流程，该流程基于深度生成模型和进化优化的混合，以更广泛地探索目标化学空间。GEMCODE能够快速地从头设计具有目标成片性的共晶，这对药物开发至关重要。通过一系列突出验证和发现案例的实验研究，我们证明了GEMCODE即使在现实的计算限制下也是有效的。此外，我们还探索了语言模型在生成共晶方面的潜力。最后，我们展示了GEMCODE预测的许多以前未知的共晶，并讨论了其在加速药物开发方面的潜力。||
|**2024-10-22**|[DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization](http://arxiv.org/abs/2410.16942)|null|扩散模型凭借其出色的能力在图像生成领域取得了显著进展。然而，由于推理过程中需要多步去噪，这些模型需要大量的计算资源。虽然传统的剪枝方法已被用于优化这些模型，但重新训练过程需要大规模的训练数据集和大量的计算成本才能保持泛化能力，这既不方便也不高效。最近的研究试图利用相邻去噪阶段特征的相似性，通过简单、静态的策略来降低计算成本。然而，这些策略不能充分利用相邻时间步中相似特征模式的潜力。在这项工作中，我们提出了一种新的剪枝方法，该方法通过更智能、可微分的剪枝器得到一个高效的扩散模型。我们的方法的核心是将模型剪枝过程转化为子网络搜索过程。具体来说，我们首先在标准扩散的基础上引入了一个超级网络，通过添加一些基于相似特征的备份连接。然后，我们构建了一个插件式的剪枝器网络，并设计了优化损失来识别冗余计算。最后，我们的方法可以通过少量的梯度优化和简单的后处理步骤来确定一个最优的子网络。我们在包括稳定扩散系列和 DiT 在内的各种扩散模型上进行了广泛的实验。我们的 DiP-GO 方法在不损失准确率的情况下，实现了 SD-1.5 的 4.4 倍加速，显著优于以往最先进的方法。||
|**2024-10-22**|[Hierarchical Clustering for Conditional Diffusion in Image Generation](http://arxiv.org/abs/2410.16910)|**[link](https://github.com/jogo175/treediffusion)**|寻找具有相似特征的数据点簇并生成新的簇特定样本可以显著增强我们对复杂数据分布的理解。虽然已经使用变分自编码器对聚类进行了广泛的探索，但这些模型在现实世界的数据集中通常缺乏生成质量。本文通过引入 TreeDiffusion 来解决这一差距，TreeDiffusion 是一种深度生成模型，它将扩散模型 conditioning 在层次聚类上，以获得高质量的、特定于聚类的生成结果。所提出的流程包括两个步骤：一个基于 VAE 的聚类模型，学习数据的层次结构；以及一个条件扩散模型，为每个聚类生成逼真的图像。我们提出这个两阶段过程，以确保生成的样本保持其各自聚类的代表性，并将图像保真度提高到扩散模型的水平。我们方法的一个关键优势是它能够为每个聚类创建图像，通过定性结果证明，可以更好地可视化聚类模型学习到的表示。这种方法有效地解决了基于 VAE 的方法的生成限制，同时保留了它们的聚类性能。根据经验，我们证明了在层次聚类上 conditioning 扩散模型可以显著提高生成性能，从而推进了生成聚类模型的发展。||
|**2024-10-22**|[Bayes without Underfitting: Fully Correlated Deep Learning Posteriors via Alternating Projections](http://arxiv.org/abs/2410.16901)|null|贝叶斯深度学习经常出现欠拟合问题，导致贝叶斯预测的准确性低于简单的点估计。因此，不确定性量化是以牺牲准确性为代价的。对于线性化模型，广义高斯-牛顿矩阵的零空间对应于保留点估计的训练预测的参数。我们建议在这个零空间中构建贝叶斯近似，从而保证贝叶斯预测不会欠拟合。我们提出了一种用于投影到该零空间的无矩阵算法，该算法的规模与参数数量呈线性关系，与输出维度数量呈平方关系。为了使该方法适用于生成模型，我们进一步提出了一种仅与参数呈线性关系的近似方法。广泛的实证评估表明，该方法可扩展到大型模型，包括具有 2800 万个参数的视觉Transformer。||
|**2024-10-22**|[VistaDream: Sampling multiview consistent images for single-view scene reconstruction](http://arxiv.org/abs/2410.16892)|null|在本文中，我们提出了VistaDream，这是一个从单视图图像重建三维场景的新框架。最近的扩散模型能够从单视图输入图像生成高质量的新视图图像。大多数现有方法只专注于建立输入图像和生成图像之间的一致性，而忽略了生成图像之间的一致性。VistaDream 通过两阶段流水线解决了这个问题。在第一阶段，VistaDream 首先通过稍微缩小并绘制边界和估计深度图来构建全局粗糙三维框架。然后，在这个全局框架上，我们使用基于迭代扩散的RGB-D修复来生成新视图图像，以修复框架中的孔洞。在第二阶段，我们通过一种新的无需训练的多视图一致性采样（MCS）进一步增强了生成的新视图图像之间的一致性，该采样在扩散模型的反向采样过程中引入了多视图一致性约束。实验结果表明，在没有训练或微调现有扩散模型的情况下，VistaDream仅使用单视图图像就能实现一致且高质量的新视图合成，并且大幅度优于基线方法。代码、视频和交互式演示可在https://vistadream-project-page.github.io/获取。||
|**2024-10-22**|[CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare](http://arxiv.org/abs/2410.16872)|null|由于隐私法规的严格限制，获取真实的临床数据非常困难，这阻碍了医疗保健研究和教育的发展。这些限制减缓了新疗法和数据驱动型医疗解决方案的开发进程，同时也限制了学生接触真实世界数据集的机会，使他们缺乏必要的实践技能。因此，高实用性的合成数据集对于推进研究和提供有意义的培训材料至关重要。然而，当前的生成模型——例如变分自动编码器 (VAE) 和生成对抗网络 (GAN)——以牺牲医疗实用性为代价来产生表面上的真实感，混合不同的患者特征，并产生实际相关性有限的合成数据。为了克服这些限制，我们引入了 CK4Gen（Cox Knowledge for Generation），这是一种利用 Cox 比例风险 (CoxPH) 模型中的知识蒸馏来创建合成生存数据集的新框架，该框架保留了关键的临床特征，包括风险比和生存曲线。CK4Gen 通过维护不同的患者风险特征来避免 VAE 和 GAN 中出现的插值问题，确保为研究和教育用途提供真实可靠的输出。CK4Gen 在四个基准数据集（GBSG2、ACTG320、WHAS500 和 FLChain）中得到验证，通过更好地对齐真实数据和合成数据，通过数据增强提高了生存模型在区分和校准方面的性能，优于竞争技术。由于 CK4Gen 可扩展到各种临床条件，并且代码将公开可用，因此未来的研究人员可以将其应用于自己的数据集，以生成适合公开共享的合成版本。||
|**2024-10-22**|[MPDS: A Movie Posters Dataset for Image Generation with Diffusion Model](http://arxiv.org/abs/2410.16840)|null|电影海报对于吸引观众、传达主题和推动电影行业的市场竞争至关重要。虽然传统的设计费时费力，但智能生成技术可以提高效率并改进设计。尽管图像生成取得了令人兴奋的进展，但目前的模型在生成令人满意的海报结果方面往往存在不足。主要问题在于缺乏用于模型训练的专门海报数据集。在这项工作中，我们提出了一个电影海报数据集（MPDS），专为文本到图像生成模型而设计，旨在彻底改变海报制作。作为致力于海报的数据集，据我们所知，MPDS 是第一个图像-文本对数据集，由 37.3 万多个图像-文本对和 8 千多张演员图像（涵盖 4 千多名演员）组成。详细的海报描述，如电影标题、类型、演员阵容和剧情梗概，都根据公开的电影梗概（也称为电影梗概提示）进行了精心组织和标准化。为了增强海报描述并减少与电影梗概的差异，我们利用大型视觉语言模型自动生成每个海报的视觉感知提示，然后进行手动校正并与电影梗概提示进行整合。此外，我们还引入了一个海报标题提示，以展示海报中的文本元素，如演员姓名和电影标题。对于电影海报生成，我们开发了一个多条件扩散框架，将海报提示、海报标题和演员图像（用于个性化）作为输入，通过学习扩散模型产生出色的结果。实验表明，我们提出的 MPDS 数据集在推进个性化电影海报生成方面发挥着重要作用。MPDS 可在 https://anonymous.4open.science/r/MPDS-373k-BD3B 获取。||
|**2024-10-22**|[Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other?](http://arxiv.org/abs/2410.16823)|null|生成式检索作为一种用于搜索和推荐的新兴范式，为传统的依赖外部索引和最近邻搜索的检索方法提供了一种替代方案。生成式模型直接将输入与项目ID相关联。鉴于大型语言模型（LLM）的突破，这些生成式系统可以在集中各种信息检索（IR）任务方面发挥至关重要的作用，在一个模型中执行查询理解、检索、推荐、解释、重新排序和响应生成等任务。尽管人们对这种用于信息检索系统的统一生成方法越来越感兴趣，但在文献中，使用单一、多任务模型优于多个专用模型的优势尚未得到很好的证实。本文探讨了这种统一的方法是否以及何时能够在搜索和推荐的信息检索任务中胜过特定于任务的模型，这些任务广泛存在于多个工业在线平台中，如Spotify、YouTube和Netflix。先前的工作表明：（1）生成式推荐系统学习到的项目潜在表示偏向于流行度，以及（2）基于内容和基于协同过滤的信息可以改进项目的表示。受此启发，我们的研究以两个假设为指导：[H1]联合训练规范了每个项目流行度的估计，以及[H2]联合训练规范了项目的潜在表示，其中搜索捕获项目的基于内容的方面，推荐捕获基于协同过滤的方面。我们使用模拟数据和真实世界数据进行的大量实验都支持[H1]和[H2]，认为它们是统一搜索和推荐生成模型相对于单任务方法所观察到的有效性改进的关键因素。||
|**2024-10-18**|[BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities](http://arxiv.org/abs/2410.14672)|**[link](https://github.com/haoosz/BiGR)**|我们介绍了一种名为 BiGR 的新型条件图像生成模型，它使用紧凑的二进制潜码进行生成训练，专注于增强生成和表示能力。BiGR 是第一个将生成和判别统一在同一框架内的条件生成模型。BiGR 具有二进制分词器、掩码建模机制和用于二进制代码预测的二进制转换器。此外，我们引入了一种新颖的熵排序采样方法，以实现高效的图像生成。大量实验验证了 BiGR 在生成质量（通过 FID-50k 衡量）和表示能力（通过线性探针精度证明）方面的优越性能。此外，BiGR 展示了跨各种视觉任务的零样本泛化能力，可在无需结构修改的情况下实现图像修复、扩展、编辑、插值和丰富等应用。我们的研究结果表明，BiGR 有效地统一了生成和判别任务，为该领域的进一步发展铺平了道路。||
|**2024-10-18**|[How Does Data Diversity Shape the Weight Landscape of Neural Networks?](http://arxiv.org/abs/2410.14602)|null|为了增强机器学习模型对未见数据的泛化能力，通常采用dropout、权重衰减（ $L_2$ 正则化）和噪声增强等技术。正则化方法（即dropout和权重衰减）旨在调整模型参数以防止过拟合，而数据增强则增加了输入训练集的多样性，这是一种据称可以提高准确性和校准误差的方法。在本文中，我们研究了这些技术各自对神经网络参数空间的影响，目的是了解它们如何在迁移学习场景中改变权重情况。为此，我们采用随机矩阵理论分析了使用这些技术进行微调的预训练模型的特征值分布，这些模型使用不同级别的数据多样性，用于相同的下游任务。我们观察到，多样化数据对权重情况的影响与dropout类似。此外，我们将常用的数据增强方法与生成模型创建的合成数据进行了比较。我们得出结论，合成数据可以为真实输入数据带来更多样性，从而在分布外测试实例上获得更好的性能。||
|**2024-10-18**|[Bayesian Multi-wavelength Imaging of the LMC SN1987A with SRG/eROSITA](http://arxiv.org/abs/2410.14599)|null|EDR和eRASS1数据已经揭示了大量未被发现的X射线源。利用贝叶斯推理和X射线成像的生成模型技术，我们的目标是通过对X射线天空进行去噪、反卷积和分解来提高这些观测的灵敏度和科学价值。利用信息场理论，我们可以利用天空不同物理成分的空间和光谱相关结构以及非参数先验来增强图像重建。通过将仪器效应纳入正演模型，我们为eROSITA指向观测开发了一种全面的贝叶斯成像算法。最后，我们将开发的算法应用于大麦哲伦星云SN1987A的EDR数据，融合了五个不同望远镜模块观测到的数据集。最终结果是一个去噪、去卷积和分解的大麦哲伦星云视图，它可以分析其精细结构，创建该区域的点源目录，并为未来的工作增强校准。||
|**2024-10-18**|[Neuro-Symbolic Traders: Assessing the Wisdom of AI Crowds in Markets](http://arxiv.org/abs/2410.14587)|null|深度生成模型正越来越多地被用作金融分析工具。然而，目前尚不清楚这些模型将如何影响金融市场，尤其是在它们以半自主的方式推断金融价值的情况下。在这项工作中，我们探讨了深度生成模型与市场动态之间的相互作用。我们开发了一种虚拟交易者，他们使用深度生成模型进行买卖决策，我们称之为神经符号交易者，并将其暴露在虚拟市场中。在我们的框架下，神经符号交易者是使用视觉语言模型来发现资产基本价值模型的代理。代理将此模型开发为随机微分方程，使用梯度下降校准市场数据。我们在合成数据和真实金融时间序列（包括股票、商品和外汇对）上测试了我们的神经符号交易者。然后，我们将几组神经符号交易者置于虚拟市场环境中。这种市场环境允许交易者对基础价值的信念与观察到的价格动态之间进行反馈。我们发现，与历史数据相比，这会导致价格抑制，突出了未来市场稳定的风险。我们的工作是量化深度生成代理对市场动态影响的第一步，并阐述了这种方法未来的一些潜在风险和收益。||
|**2024-10-18**|[Multi-modal Pose Diffuser: A Multimodal Generative Conditional Pose Prior](http://arxiv.org/abs/2410.14540)|null|SMPL (Skinned Multi-Person Linear) 模型在 3D 人体姿态估计中扮演着至关重要的角色，它提供了一种简化但有效的人体表示方法。然而，在诸如人体网格回归等任务中，确保 SMPL 配置的有效性仍然是一项重大挑战，这凸显了对能够辨别人体姿态真实性的鲁棒人体姿态先验的需求。为了解决这个问题，我们引入了 MOPED：\underline{M}ulti-m\underline{O}dal \underline{P}os\underline{E} \underline{D}iffuser。MOPED 是第一个利用新型多模态条件扩散模型作为 SMPL 姿态参数先验的方法。我们的方法提供了强大的无条件姿态生成能力，并能够以图像和文本等多模态输入作为条件。这种能力通过结合传统姿态先验中经常忽略的额外上下文信息，增强了我们方法的适用性。我们在姿态估计、姿态去噪和姿态补全这三个不同任务上的大量实验表明，我们基于多模态扩散模型的先验明显优于现有方法。这些结果表明，我们的模型捕获了更广泛的合理人体姿态。||
|**2024-10-18**|[LEAD: Latent Realignment for Human Motion Diffusion](http://arxiv.org/abs/2410.14508)|null|我们的目标是从自然语言生成逼真的人体动作。现代方法通常在模型表达能力和文本到动作的对齐之间进行权衡。一些方法对齐文本和动作的潜在空间，但牺牲了表达能力；另一些方法依赖于扩散模型，产生令人印象深刻的动作，但其潜在空间缺乏语义。这可能会损害真实性、多样性和适用性。在这里，我们通过将潜在扩散与重新对齐机制相结合来解决这个问题，产生一个新颖的、语义结构化的空间，该空间编码语言的语义。利用这种能力，我们引入了文本动作反演的任务，以从几个例子中捕捉新的动作概念。对于动作合成，我们在 HumanML3D 和 KIT-ML 上评估了 LEAD，并在真实性、多样性和文本-动作一致性方面表现出与最先进技术相当的性能。我们的定性分析和用户研究表明，与现代方法相比，我们合成的动作更清晰、更像人，并且更符合文本。对于动作文本反演，与传统的变分自编码器相比，我们的方法在捕捉分布外特征方面表现出更高的能力。||
|**2024-10-18**|[Reinforcement Learning in Non-Markov Market-Making](http://arxiv.org/abs/2410.14504)|null|我们开发了一个深度强化学习 (RL) 框架，用于解决最优做市 (MM) 交易问题，特别关注具有半马尔可夫和霍克斯跳跃扩散动力学的價格过程。我们首先讨论了 RL 的基础知识以及所使用的深度 RL 框架，其中我们部署了最先进的软行动者-评论家 (SAC) 算法进行深度学习部分。SAC 算法是一种离线策略熵最大化算法，更适合解决具有连续状态和动作空间的复杂、高维问题，例如最优做市 (MM)。我们介绍了所考虑的最优 MM 问题，详细说明了用于设置模拟此策略的环境的所有确定性和随机过程。在这里，我们还深入概述了使用的跳跃扩散定价动态、我们处理限价订单簿中逆向选择的方法，并重点介绍了优化问题的各个组成部分。接下来，我们讨论了训练和测试结果，并通过图表展示了重要的确定性和随机过程（例如买卖价差、交易执行、库存和奖励函数）是如何演变的。我们还讨论了这些结果的局限性，这些是大多数扩散模型在此设置中需要注意的重要点。||
|**2024-10-18**|[Data-driven topology design with persistent homology for enhancing population diversity](http://arxiv.org/abs/2410.14496)|null|本文提出了一种选择策略，用于增强数据驱动拓扑设计 (DDTD) 中的种群多样性，DDTD 是一种基于进化算法 (EA) 并使用深度生成模型的拓扑优化框架。虽然种群多样性对于 EA 的全局搜索至关重要，但由于设计变量空间的高维性和评估函数的强非线性，基于目标值保留多样性解决方案的传统选择算子仍可能导致拓扑优化问题中的种群多样性丧失。基于拓扑结构是材料分布之间固有多样性特征的理念，我们采用了一种称为持久同源性的拓扑数据分析方法。作为一项具体操作，在持久图之间引入了 Wasserstein 距离排序到选择算法中，以保持内在的种群多样性。我们将结合到 DDTD 中的所提出的选择操作应用于基于应力的拓扑优化问题作为数值示例。结果证实，可以使用持久同源性分析拓扑结构，并且所提出的选择操作显着提高了 DDTD 的搜索性能。||
|**2024-10-18**|[ANT: Adaptive Noise Schedule for Time Series Diffusion Models](http://arxiv.org/abs/2410.14488)|**[link](https://github.com/seunghan96/ant)**|生成式人工智能中扩散模型的进步最近已经扩展到时间序列（TS）领域，在各种任务上展现出最先进的性能。然而，先前关于时间序列扩散模型的研究工作往往借鉴了其他领域现有工作的框架，而没有考虑时间序列数据的特点，导致性能欠佳。在本研究中，我们提出了时间序列扩散模型的自适应噪声调度（ANT），它可以根据给定时间序列数据集的非平稳性统计数据，自动预先确定合适的噪声调度方案。我们的直觉是，一个最优的噪声调度方案应该满足以下要求：1）线性降低时间序列数据的非平稳性，使所有扩散步骤都具有同等意义；2）在最后一步将数据破坏为随机噪声；3）步骤数量足够多。所提出的方法具有很强的实用性，因为它消除了寻找最佳噪声调度的必要性，只需额外计算给定数据集的统计数据即可，这可以在训练前离线完成。我们在不同领域的数据集上验证了我们方法在各种任务上的有效性，包括时间序列预测、细化和生成。代码可在以下存储库中找到：https://github.com/seunghan96/ANT。||
|**2024-10-18**|[CaTs and DAGs: Integrating Directed Acyclic Graphs with Transformers and Fully-Connected Neural Networks for Causally Constrained Predictions](http://arxiv.org/abs/2410.14485)|**[link](https://github.com/matthewvowels1/causal_transformer)**|人工神经网络 (ANN)，包括全连接网络和 Transformer，是高度灵活且强大的函数逼近器，广泛应用于计算机视觉和自然语言处理等领域。然而，它们无法 inherent 地遵循因果结构，这限制了它们的鲁棒性，使其容易受到协变量偏移的影响，并且难以解释。这对它们在现实应用中的可靠性构成了重大挑战。在本文中，我们介绍了因果全连接神经网络 (CFCN) 和因果 Transformer (CaT)，这是两个通用的模型系列，旨在根据预定义的因果约束（由有向无环图 (DAG) 指定）进行操作。这些模型保留了传统神经网络强大的函数逼近能力，同时遵循底层结构约束，提高了推理时的鲁棒性、可靠性和可解释性。这种方法为在鲁棒性和可解释性至关重要的更苛刻的现实场景中部署神经网络开辟了新途径。||
|**2024-10-17**|[Diffusing States and Matching Scores: A New Framework for Imitation Learning](http://arxiv.org/abs/2410.13855)|**[link](https://github.com/ziqian2000/smiling)**|对抗性模仿学习传统上被构建为学习器和对抗性选择的成本函数之间的两人零和博弈，因此可以被认为是生成对抗网络 (GAN) 的顺序泛化。这种框架的一个突出例子是生成对抗性模仿学习 (GAIL)。然而，近年来，扩散模型已成为 GAN 的非对抗性替代方案，它只需要通过回归训练一个评分函数，就能产生更高质量的生成结果。为此，我们研究了如何将扩散模型的见解提升到序列设置中。我们建议沿着扩散状态对状态进行扩散并执行分数匹配，以测量专家和学习者状态之间的差异。因此，我们的方法只需要训练评分函数以通过标准回归来预测噪声，这使得它比对抗性方法更容易训练且更稳定。理论上，我们证明了具有一阶和二阶实例依赖界限且水平线性缩放，证明了我们的方法避免了阻碍离线模仿学习方法的复合误差。根据经验，我们展示了我们的方法在各种连续控制问题上优于 GAN 风格的模仿学习基线，包括控制仿人机器人行走、坐下和爬行的复杂任务。||
|**2024-10-17**|[Influence Functions for Scalable Data Attribution in Diffusion Models](http://arxiv.org/abs/2410.13850)|null|扩散模型在生成式建模方面取得了显著进展。然而，它们的广泛应用对数据溯源和可解释性提出了挑战。在本文中，我们的目标是通过开发一个\textit{影响函数}框架来帮助解决扩散模型中的此类挑战。基于影响函数的数据溯源方法近似于如果删除某些训练数据，模型的输出将如何变化。在监督学习中，这通常用于预测特定样本的损失将如何变化。对于扩散模型，我们专注于通过几个代理指标来预测生成特定样本的概率变化。我们展示了如何为此类量制定影响函数，以及如何将先前提出的方法解释为我们框架中的特定设计选择。为了确保影响函数中Hessian计算的可扩展性，我们系统地开发了基于广义高斯-牛顿矩阵的K-FAC近似，专门针对扩散模型量身定制。我们将先前提出的方法重新定义为我们框架中的特定设计选择，并表明我们推荐的方法在常见评估中优于先前的数据溯源方法，例如线性数据建模分数（LDS）或不包括顶部影响的重新训练，而无需针对特定方法进行超参数调整。||
|**2024-10-17**|[VidPanos: Generative Panoramic Videos from Casual Panning Videos](http://arxiv.org/abs/2410.13832)|null|全景图像拼接提供了一种统一的广角场景视图，超越了相机的视野范围。将平移视频的帧拼接成全景照片对于静态场景来说是一个很好理解的问题，但是当物体移动时，静态全景图无法捕捉场景。我们提出了一种从随意拍摄的平移视频合成全景视频的方法，就好像原始视频是用广角相机拍摄的一样。我们将全景合成视为一个时空外推问题，目标是创建一个与输入视频长度相同的完整全景视频。时空体积的一致性完成需要对视频内容和运动进行强大而真实的先验，为此我们采用了生成式视频模型。然而，现有的生成式模型并不能立即扩展到全景补全，正如我们所展示的那样。相反，我们将视频生成作为全景合成系统的一个组成部分，并演示了如何在最大限度地减少其局限性的同时利用模型的优势。我们的系统可以为各种野外场景创建视频全景图，包括人、车辆和流动的水，以及静止的背景特征。||
|**2024-10-17**|[Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning](http://arxiv.org/abs/2410.13823)|**[link](https://github.com/junzhin/dgm-vlc)**|深度生成模型通过增强数据集的大小和质量，极大地促进了医学图像分析的发展。除了单纯的数据增强之外，我们研究的重点在于深度生成模型的另一个重要能力：揭示和展示医学图像中的模式。我们采用了一种具有混合条件的生成结构，结合临床数据和分割掩码来指导图像合成过程。此外，我们创新地将表格化的临床数据转换为文本描述。这种方法简化了缺失值的处理，并使我们能够利用大型预训练的视觉语言模型，这些模型可以研究独立临床条目之间的关系，并理解性别和吸烟状况等一般术语。由于我们的临床信息与图像之间的视觉相关性较低，因此我们的方法不同于传统的医学报告指导的合成，并且提出了一项更具挑战性的任务。为了克服这个问题，我们引入了一种文本-视觉嵌入机制来加强条件，确保网络有效地利用所提供的信息。我们的流程可推广到基于 GAN 的模型和扩散模型。在胸部 CT 上进行的实验（特别关注吸烟状况）表明，肺部出现了一致的强度变化，这与临床观察结果一致，表明我们的方法可以有效地捕捉和可视化特定属性对医学图像模式的影响。我们的方法为利用深度生成模型早期检测和精确可视化复杂的临床状况开辟了新的途径。所有代码均可在 https://github.com/junzhin/DGM-VLC 获取。||
|**2024-10-17**|[ConsisSR: Delving Deep into Consistency in Diffusion-based Image Super-Resolution](http://arxiv.org/abs/2410.13807)|null|Real-world image super-resolution (Real-ISR) aims at restoring high-quality (HQ) images from low-quality (LQ) inputs corrupted by unknown and complex degradations. In particular, pretrained text-to-image (T2I) diffusion models provide strong generative priors to reconstruct credible and intricate details. However, T2I generation focuses on semantic consistency while Real-ISR emphasizes pixel-level reconstruction, which hinders existing methods from fully exploiting diffusion priors. To address this challenge, we introduce ConsisSR to handle both semantic and pixel-level consistency. Specifically, compared to coarse-grained text prompts, we exploit the more powerful CLIP image embedding and effectively leverage both modalities through our Hybrid Prompt Adapter (HPA) for semantic guidance. Secondly, we introduce Time-aware Latent Augmentation (TALA) to mitigate the inherent gap between T2I generation and Real-ISR consistency requirements. By randomly mixing LQ and HQ latent inputs, our model not only handle timestep-specific diffusion noise but also refine the accumulated latent representations. Last but not least, our GAN-Embedding strategy employs the pretrained Real-ESRGAN model to refine the diffusion start point. This accelerates the inference process to 10 steps while preserving sampling quality, in a training-free manner.Our method demonstrates state-of-the-art performance among both full-scale and accelerated models. The code will be made publicly available.||
|**2024-10-17**|[Probing the Latent Hierarchical Structure of Data via Diffusion Models](http://arxiv.org/abs/2410.13770)|null|High-dimensional data must be highly structured to be learnable. Although the compositional and hierarchical nature of data is often put forward to explain learnability, quantitative measurements establishing these properties are scarce. Likewise, accessing the latent variables underlying such a data structure remains a challenge. In this work, we show that forward-backward experiments in diffusion-based models, where data is noised and then denoised to generate new samples, are a promising tool to probe the latent structure of data. We predict in simple hierarchical models that, in this process, changes in data occur by correlated chunks, with a length scale that diverges at a noise level where a phase transition is known to take place. Remarkably, we confirm this prediction in both text and image datasets using state-of-the-art diffusion models. Our results show how latent variable changes manifest in the data and establish how to measure these effects in real data using diffusion models.||
|**2024-10-17**|[Theory on Score-Mismatched Diffusion Models and Zero-Shot Conditional Samplers](http://arxiv.org/abs/2410.13746)|null|The denoising diffusion model has recently emerged as a powerful generative technique, capable of transforming noise into meaningful data. While theoretical convergence guarantees for diffusion models are well established when the target distribution aligns with the training distribution, practical scenarios often present mismatches. One common case is in zero-shot conditional diffusion sampling, where the target conditional distribution is different from the (unconditional) training distribution. These score-mismatched diffusion models remain largely unexplored from a theoretical perspective. In this paper, we present the first performance guarantee with explicit dimensional dependencies for general score-mismatched diffusion samplers, focusing on target distributions with finite second moments. We show that score mismatches result in an asymptotic distributional bias between the target and sampling distributions, proportional to the accumulated mismatch between the target and training distributions. This result can be directly applied to zero-shot conditional samplers for any conditional model, irrespective of measurement noise. Interestingly, the derived convergence upper bound offers useful guidance for designing a novel bias-optimal zero-shot sampler in linear conditional models that minimizes the asymptotic bias. For such bias-optimal samplers, we further establish convergence guarantees with explicit dependencies on dimension and conditioning, applied to several interesting target distributions, including those with bounded support and Gaussian mixtures. Our findings are supported by numerical studies.||
|**2024-10-17**|[Improved Convergence Rate for Diffusion Probabilistic Models](http://arxiv.org/abs/2410.13738)|null|Score-based diffusion models have achieved remarkable empirical performance in the field of machine learning and artificial intelligence for their ability to generate high-quality new data instances from complex distributions. Improving our understanding of diffusion models, including mainly convergence analysis for such models, has attracted a lot of interests. Despite a lot of theoretical attempts, there still exists significant gap between theory and practice. Towards to close this gap, we establish an iteration complexity at the order of $d^{1/3}\varepsilon^{-2/3}$, which is better than $d^{5/12}\varepsilon^{-1}$, the best known complexity achieved before our work. This convergence analysis is based on a randomized midpoint method, which is first proposed for log-concave sampling (Shen and Lee, 2019), and then extended to diffusion models by Gupta et al. (2024). Our theory accommodates $\varepsilon$-accurate score estimates, and does not require log-concavity on the target distribution. Moreover, the algorithm can also be parallelized to run in only $O(\log^2(d/\varepsilon))$ parallel rounds in a similar way to prior works.||
|**2024-10-17**|[Optimizing Probabilistic Conformal Prediction with Vectorized Non-Conformity Scores](http://arxiv.org/abs/2410.13735)|null|Generative models have shown significant promise in critical domains such as medical diagnosis, autonomous driving, and climate science, where reliable decision-making hinges on accurate uncertainty quantification. While probabilistic conformal prediction (PCP) offers a powerful framework for this purpose, its coverage efficiency -- the size of the uncertainty set -- is limited when dealing with complex underlying distributions and a finite number of generated samples. In this paper, we propose a novel PCP framework that enhances efficiency by first vectorizing the non-conformity scores with ranked samples and then optimizing the shape of the prediction set by varying the quantiles for samples at the same rank. Our method delivers valid coverage while producing discontinuous and more efficient prediction sets, making it particularly suited for high-stakes applications. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets.||
|**2024-10-17**|[DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation](http://arxiv.org/abs/2410.13726)|**[link](https://github.com/hanbo-cheng/dawn-pytorch)**|Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly at https://github.com/Hanbo-Cheng/DAWN-pytorch.||
|**2024-10-15**|[High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion](http://arxiv.org/abs/2410.11838)|null|尽管近期取得了进展，现有的帧插值方法在处理极高分辨率输入和处理重复纹理、细小物体和大运动等挑战性案例时仍然存在困难。为了解决这些问题，我们引入了一种基于补丁的级联像素扩散模型，用于帧插值，名为 HiFI，它在这些场景中表现出色，同时在标准基准测试中实现了具有竞争力的性能。级联模型可以生成一系列从低分辨率到高分辨率的图像，这有助于处理需要全局上下文以获得粗略解决方案以及需要详细上下文以获得高分辨率输出的大运动或复杂运动。然而，与先前在越来越大的分辨率上执行扩散的级联扩散模型工作相反，我们使用单个模型，该模型始终以相同的分辨率执行扩散，并通过处理输入和先前解决方案的补丁来进行上采样。我们表明，这种技术大大减少了推理时的内存使用量，并且还允许我们在测试时使用单个模型，同时解决帧插值和空间上采样问题，从而节省了训练成本。我们证明了 HiFI 对需要全局上下文的高分辨率和复杂重复纹理有很大帮助。HiFI 在多个基准测试（Vimeo、Xiph、X-Test、SEPE-8K）上展示了与最先进技术相当或更优的性能。在我们新引入的专注于特别具有挑战性的案例的数据集上，HiFI 在这些案例上的表现也明显优于其他基线模型。请访问我们的项目页面以获取视频结果：https://hifi-diffusion.github.io||
|**2024-10-15**|[On the Effectiveness of Dataset Alignment for Fake Image Detection](http://arxiv.org/abs/2410.11835)|null|随着潜在扩散模型 (LDM) 使图像生成能力大众化，对虚假图像检测的需求日益增长。一个好的检测器应该专注于生成模型的指纹，而忽略图像属性，如语义内容、分辨率、文件格式等。虚假图像检测器通常以数据驱动的方式构建，其中训练模型以区分真实图像和虚假图像。现有工作主要研究网络架构选择和训练方法。在这项工作中，我们认为除了这些算法选择之外，我们还需要一个良好对齐的真实/虚假图像数据集来训练鲁棒的检测器。对于 LDM 系列，我们提出了一种非常简单的方法来实现这一点：我们使用 LDM 自动编码器重建所有真实图像，无需任何去噪操作。然后，我们训练一个模型来将这些真实图像与其重建图像区分开来。以这种方式创建的虚假图像在几乎所有方面（例如，大小、纵横比、语义内容）都与真实图像极其相似，这迫使模型寻找 LDM 解码器的伪影。我们通过经验证明，这种创建对齐的真实/虚假数据集的方法（也绕过了计算量大的去噪过程）有助于构建一个较少关注虚假相关性的检测器，而现有的非常流行的方法很容易受到这种相关性的影响。最后，为了证明数据集中对齐的有效性，我们使用非自然对象的图像构建了一个检测器，并获得了可喜的结果。总的来说，我们的工作确定了在训练虚假图像检测器时出现的细微但重要的问题，并提出了一种简单且廉价的解决方案来解决这些问题。||
|**2024-10-15**|[Bayesian Experimental Design via Contrastive Diffusions](http://arxiv.org/abs/2410.11826)|**[link](https://github.com/jcopo/ContrastiveDiffusions)**|贝叶斯最优实验设计 (BOED) 是一种强大的工具，可以降低运行一系列实验的成本。当基于预期信息增益 (EIG) 时，设计优化对应于最大化先验分布和后验分布之间某些难以处理的预期“对比”。由于 BOED 固有的计算复杂性，将这种最大化扩展到高维和复杂的环境一直是一个问题。在这项工作中，我们介绍了一种具有成本效益的采样特性的“预期后验”分布，并通过新的 EIG 梯度表达式提供了对 EIG 对比度最大化的易处理访问。基于扩散的采样器用于计算预期后验的动态，并且利用双层优化的思想来推导出高效的联合采样优化循环，而无需诉诸 EIG 的下界近似。由此产生的效率提升允许将 BOED 扩展到经过充分测试的扩散模型的生成能力。通过将生成模型纳入 BOED 框架，我们扩展了它的范围及其在以前不切实际的场景中的使用。数值实验和与最先进方法的比较显示了该方法的潜力。||
|**2024-10-15**|[KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities](http://arxiv.org/abs/2410.11824)|null|最近文本到图像生成技术的进步显著提高了合成图像的质量。尽管取得了这些进展，但评估主要集中在审美情趣或与文本提示的一致性上。因此，人们对这些模型是否能够准确地表示各种现实世界的视觉实体——一项需要现实世界知识的任务——知之甚少。为了弥合这一差距，我们提出了一个基准测试，重点评估现实世界实体的知识密集型图像生成（即 KITTEN）。我们使用 KITTEN 对文本到图像生成模型中的实体保真度进行了系统研究，重点关注它们生成各种现实世界视觉实体的能力，如地标建筑、飞机、植物和动物。我们使用自动指标和精心设计的人工评估来评估最新的文本到图像模型和检索增强定制模型，重点关注生成图像中实体的保真度。我们的研究结果表明，即使是最先进的文本到图像模型也常常无法生成具有准确视觉细节的实体。尽管检索增强模型可以通过在测试期间合并参考图像来增强实体的保真度，但它们往往过度依赖于这些参考，并且难以根据创意文本提示生成实体的新颖配置。||
|**2024-10-15**|[Improving Long-Text Alignment for Text-to-Image Diffusion Models](http://arxiv.org/abs/2410.11817)|**[link](https://github.com/luping-liu/longalign)**|文本到图像 (T2I) 扩散模型的快速发展使其能够根据给定文本生成前所未有的结果。然而，随着文本输入变长，像 CLIP 这样的现有编码方法面临局限性，并且使生成的图像与长文本对齐变得具有挑战性。为了解决这些问题，我们提出了 LongAlign，它包括用于处理长文本的分段级编码方法和用于有效对齐训练的分解偏好优化方法。对于分段级编码，长文本被分成多个段并分别处理。此方法克服了预训练编码模型的最大输入长度限制。对于偏好优化，我们提供基于 CLIP 的分解偏好模型来微调扩散模型。具体来说，为了利用基于 CLIP 的偏好模型进行 T2I 对齐，我们深入研究了它们的评分机制，发现偏好分数可以分解为两个部分：衡量 T2I 对齐的文本相关部分和评估人类偏好的其他视觉方面的文本无关部分。此外，我们发现文本无关部分会导致微调期间出现常见的过拟合问题。为了解决这个问题，我们提出了一种重新加权策略，为这两个部分分配不同的权重，从而减少过拟合并增强对齐。在我们使用该方法对 $512 \times 512$ Stable Diffusion (SD) v1.5 进行约 20 小时的微调后，微调后的 SD 在 T2I 对齐方面优于更强大的基础模型，例如 PixArt-$\alpha$ 和 Kandinsky v2.2。代码可在 https://github.com/luping-liu/LongAlign 获取。||
|**2024-10-15**|[SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing](http://arxiv.org/abs/2410.11815)|null|场景图提供了一种结构化的图像层次表示，其中节点和边分别代表对象及其之间的关系。它可以作为图像编辑的自然接口，极大地提高精度和灵活性。利用这一优势，我们引入了一个新的框架，将大型语言模型（LLM）与 Text2Image 生成模型相结合，用于基于场景图的图像编辑。这种集成可以在不影响整体图像完整性的情况下，实现对象级别的精确修改和场景的创造性重组。我们的方法包括两个主要阶段：1）利用 LLM 驱动的场景解析器，我们构建图像的场景图，捕获关键对象及其相互关系，并解析细粒度属性，如对象掩码和描述。这些注释有助于使用微调的扩散模型进行概念学习，用优化的标记和详细的描述提示来表示每个对象。2）在图像编辑阶段，LLM 编辑控制器引导编辑特定区域。然后，这些编辑由注意力调制的扩散编辑器执行，利用微调模型执行对象添加、删除、替换和调整。通过大量实验，我们证明了我们的框架在编辑精度和场景美学方面明显优于现有的图像编辑方法。||
|**2024-10-15**|[Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices](http://arxiv.org/abs/2410.11795)|null|作为近年来最受欢迎和最受追捧的生成模型之一，扩散模型凭借其扎实的理论基础和可靠的应用实践，引起了众多研究者的兴趣，并在图像合成、视频生成、分子设计、3D场景渲染和多模态生成等各种生成任务中展现出优异的性能。这些基于扩散模型的最新研究成果的显著成功很大程度上归功于渐进式设计原则以及高效的架构、训练、推理和部署方法。然而，目前尚缺乏全面深入的综述来总结这些原则和实践，以帮助快速理解和应用扩散模型。在本综述中，我们以效率为导向，对现有工作进行了新的视角审视，主要关注架构设计、模型训练、快速推理和可靠部署方面的深刻原理和高效实践，以通俗易懂的方式指导进一步的理论研究、算法迁移和模型应用到新的场景中。\url{https://github.com/ponyzym/Efficient-DMs-Survey}||
|**2024-10-15**|[Probabilistic Principles for Biophysics and Neuroscience: Entropy Production, Bayesian Mechanics & the Free-Energy Principle](http://arxiv.org/abs/2410.11735)|null|本论文重点研究生物系统的三个基本方面：即熵产生、贝叶斯力学和自由能原理。贡献有三方面：1) 我们计算了比以往更大类别系统的熵产生，包括几乎所有稳态扩散过程，例如驱动噪声不作用于系统所有坐标的退化扩散。重要的是，这类系统包含了由有色噪声驱动的随机微分方程的马尔可夫近似，这一点意义重大，因为宏观和中尺度生物系统通常会受到有色噪声的影响。2) 我们为与环境相互作用的生物和物理实体开发了一种贝叶斯力学，其中我们为事物的内部状态推断其外部状态提供了充分必要条件，这与统计学和理论神经科学中的变分贝叶斯推理一致。3) 我们改进了对贝叶斯力学的约束，以获得对生物系统更具体的描述，称为自由能原理。这表明生物系统的活动状态和内部状态是通过最小化称为自由能的量来展开的。这里提出的自由能原理的数学基础，通过在给定外部状态和感觉状态的生成模型的情况下最小化自由能，为神经生物学和人工智能中的行为建模和仿真提供了一种第一性原理方法。||
|**2024-10-15**|[Patch-Based Diffusion Models Beat Whole-Image Models for Mismatched Distribution Inverse Problems](http://arxiv.org/abs/2410.11730)|null|扩散模型由于能够学习强大的图像先验，在解决逆问题方面取得了优异的成功，但现有方法需要大量的图像训练数据集，这些图像应该来自与测试数据集相同的分布。当训练和测试分布不匹配时，由于先验不正确，重建图像中会出现伪影和幻觉。在这项工作中，我们系统地研究了分布外 (OOD) 问题，其中首先提供已知的训练分布。我们首先研究了仅从未知测试分布获得单次测量的情况。接下来，我们研究了属于测试分布的非常小的数据样本可用的情况，我们的目标仍然是从来自测试分布的测量中重建图像。在这两种情况下，我们都使用基于补丁的扩散先验，它仅从补丁中学习图像分布。此外，在第一种情况下，我们包含一个自监督损失，帮助网络输出保持与测量的Consistency。大量实验表明，在这两种情况下，基于补丁的方法都可以获得高质量的图像重建，其性能优于整幅图像模型，并且可以与可以使用大型分布内训练数据集的方法相媲美。此外，我们展示了整幅图像模型如何容易出现记忆和过拟合，从而导致重建中的伪影，而基于补丁的模型可以解决这些问题。||
|**2024-10-15**|[DeformPAM: Data-Efficient Learning for Long-horizon Deformable Object Manipulation via Preference-based Action Alignment](http://arxiv.org/abs/2410.11584)|**[link](https://github.com/xiaoxiaoxh/DeformPAM)**|In recent years, imitation learning has made progress in the field of robotic manipulation. However, it still faces challenges when dealing with complex long-horizon deformable object tasks, such as high-dimensional state spaces, complex dynamics, and multimodal action distributions. Traditional imitation learning methods often require a large amount of data and encounter distributional shifts and accumulative errors in these tasks. To address these issues, we propose a data-efficient general learning framework (DeformPAM) based on preference learning and reward-guided action selection. DeformPAM decomposes long-horizon tasks into multiple action primitives, utilizes 3D point cloud inputs and diffusion models to model action distributions, and trains an implicit reward model using human preference data. During the inference phase, the reward model scores multiple candidate actions, selecting the optimal action for execution, thereby reducing the occurrence of anomalous actions and improving task completion quality. Experiments conducted on three challenging real-world long-horizon deformable object manipulation tasks demonstrate the effectiveness of this method. Results show that DeformPAM improves both task completion quality and efficiency compared to baseline methods even with limited data. Code and data will be available at https://deform-pam.robotflow.ai.||
|**2024-10-11**|[SceneCraft: Layout-Guided 3D Scene Generation](http://arxiv.org/abs/2410.09049)|**[link](https://github.com/orangesodahub/scenecraft)**|The creation of complex 3D scenes tailored to user specifications has been a tedious and challenging task with traditional 3D modeling tools. Although some pioneering methods have achieved automatic text-to-3D generation, they are generally limited to small-scale scenes with restricted control over the shape and texture. We introduce SceneCraft, a novel method for generating detailed indoor scenes that adhere to textual descriptions and spatial layout preferences provided by users. Central to our method is a rendering-based technique, which converts 3D semantic layouts into multi-view 2D proxy maps. Furthermore, we design a semantic and depth conditioned diffusion model to generate multi-view images, which are used to learn a neural radiance field (NeRF) as the final scene representation. Without the constraints of panorama image generation, we surpass previous methods in supporting complicated indoor space generation beyond a single room, even as complicated as a whole multi-bedroom apartment with irregular shapes and layouts. Through experimental analysis, we demonstrate that our method significantly outperforms existing approaches in complex indoor scene generation with diverse textures, consistent geometry, and realistic visual quality. Code and more results are available at: https://orangesodahub.github.io/SceneCraft||
|**2024-10-11**|[Linear Convergence of Diffusion Models Under the Manifold Hypothesis](http://arxiv.org/abs/2410.09046)|null|Score-matching generative models have proven successful at sampling from complex high-dimensional data distributions. In many applications, this distribution is believed to concentrate on a much lower $d$-dimensional manifold embedded into $D$-dimensional space; this is known as the manifold hypothesis. The current best-known convergence guarantees are either linear in $D$ or polynomial (superlinear) in $d$. The latter exploits a novel integration scheme for the backward SDE. We take the best of both worlds and show that the number of steps diffusion models require in order to converge in Kullback-Leibler~(KL) divergence is linear (up to logarithmic terms) in the intrinsic dimension $d$ . Moreover, we show that this linear dependency is sharp.||
|**2024-10-11**|[Semantic Score Distillation Sampling for Compositional Text-to-3D Generation](http://arxiv.org/abs/2410.09009)|**[link](https://github.com/yangling0818/semanticsds-3d)**|Generating high-quality 3D assets from textual descriptions remains a pivotal challenge in computer graphics and vision research. Due to the scarcity of 3D data, state-of-the-art approaches utilize pre-trained 2D diffusion priors, optimized through Score Distillation Sampling (SDS). Despite progress, crafting complex 3D scenes featuring multiple objects or intricate interactions is still difficult. To tackle this, recent methods have incorporated box or layout guidance. However, these layout-guided compositional methods often struggle to provide fine-grained control, as they are generally coarse and lack expressiveness. To overcome these challenges, we introduce a novel SDS approach, Semantic Score Distillation Sampling (SemanticSDS), designed to effectively improve the expressiveness and accuracy of compositional text-to-3D generation. Our approach integrates new semantic embeddings that maintain consistency across different rendering views and clearly differentiate between various objects and parts. These embeddings are transformed into a semantic map, which directs a region-specific SDS process, enabling precise optimization and compositional generation. By leveraging explicit semantic guidance, our method unlocks the compositional capabilities of existing pre-trained diffusion models, thereby achieving superior quality in 3D content generation, particularly for complex objects and scenes. Experimental results demonstrate that our SemanticSDS framework is highly effective for generating state-of-the-art complex 3D content. Code: https://github.com/YangLing0818/SemanticSDS-3D||
|**2024-10-11**|[WaveDiffusion: Exploring Full Waveform Inversion via Joint Diffusion in the Latent Space](http://arxiv.org/abs/2410.09002)|null|Full Waveform Inversion (FWI) is a vital technique for reconstructing high-resolution subsurface velocity maps from seismic waveform data, governed by partial differential equations (PDEs) that model wave propagation. Traditional machine learning approaches typically map seismic data to velocity maps by encoding seismic waveforms into latent embeddings and decoding them into velocity maps. In this paper, we introduce a novel framework that reframes FWI as a joint diffusion process in a shared latent space, bridging seismic waveform data and velocity maps. Our approach has two key components: first, we merge the bottlenecks of two separate autoencoders-one for seismic data and one for velocity maps-into a unified latent space using vector quantization to establish a shared codebook. Second, we train a diffusion model in this latent space, enabling the simultaneous generation of seismic and velocity map pairs by sampling and denoising the latent representations, followed by decoding each modality with its respective decoder. Remarkably, our jointly generated seismic-velocity pairs approximately satisfy the governing PDE without any additional constraint, offering a new geometric interpretation of FWI. The diffusion process learns to score the latent space according to its deviation from the PDE, with higher scores representing smaller deviations from the true solutions. By following this diffusion process, the model traces a path from random initialization to a valid solution of the governing PDE. Our experiments on the OpenFWI dataset demonstrate that the generated seismic and velocity map pairs not only exhibit high fidelity and diversity but also adhere to the physical constraints imposed by the governing PDE.||
|**2024-10-11**|[Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory](http://arxiv.org/abs/2410.08942)|null|Synthetic data has gained attention for training large language models, but poor-quality data can harm performance (see, e.g., Shumailov et al. (2023); Seddik et al. (2024)). A potential solution is data pruning, which retains only high-quality data based on a score function (human or machine feedback). Previous work Feng et al. (2024) analyzed models trained on synthetic data as sample size increases. We extend this by using random matrix theory to derive the performance of a binary classifier trained on a mix of real and pruned synthetic data in a high dimensional setting. Our findings identify conditions where synthetic data could improve performance, focusing on the quality of the generative model and verification strategy. We also show a smooth phase transition in synthetic label noise, contrasting with prior sharp behavior in infinite sample limits. Experiments with toy models and large language models validate our theoretical results.||
|**2024-10-11**|[DiffPO: A causal diffusion model for learning distributions of potential outcomes](http://arxiv.org/abs/2410.08924)|null|Predicting potential outcomes of interventions from observational data is crucial for decision-making in medicine, but the task is challenging due to the fundamental problem of causal inference. Existing methods are largely limited to point estimates of potential outcomes with no uncertain quantification; thus, the full information about the distributions of potential outcomes is typically ignored. In this paper, we propose a novel causal diffusion model called DiffPO, which is carefully designed for reliable inferences in medicine by learning the distribution of potential outcomes. In our DiffPO, we leverage a tailored conditional denoising diffusion model to learn complex distributions, where we address the selection bias through a novel orthogonal diffusion loss. Another strength of our DiffPO method is that it is highly flexible (e.g., it can also be used to estimate different causal quantities such as CATE). Across a wide range of experiments, we show that our method achieves state-of-the-art performance.||
|**2024-10-11**|[Conditional Generative Models for Contrast-Enhanced Synthesis of T1w and T1 Maps in Brain MRI](http://arxiv.org/abs/2410.08894)|**[link](https://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models)**|Contrast enhancement by Gadolinium-based contrast agents (GBCAs) is a vital tool for tumor diagnosis in neuroradiology. Based on brain MRI scans of glioblastoma before and after Gadolinium administration, we address enhancement prediction by neural networks with two new contributions. Firstly, we study the potential of generative models, more precisely conditional diffusion and flow matching, for uncertainty quantification in virtual enhancement. Secondly, we examine the performance of T1 scans from quantitive MRI versus T1-weighted scans. In contrast to T1-weighted scans, these scans have the advantage of a physically meaningful and thereby comparable voxel range. To compare network prediction performance of these two modalities with incompatible gray-value scales, we propose to evaluate segmentations of contrast-enhanced regions of interest using Dice and Jaccard scores. Across models, we observe better segmentations with T1 scans than with T1-weighted scans.||
|**2024-10-11**|[On-Chip Learning via Transformer In-Context Learning](http://arxiv.org/abs/2410.08711)|null|Autoregressive decoder-only transformers have become key components for scalable sequence processing and generation models. However, the transformer's self-attention mechanism requires transferring prior token projections from the main memory at each time step (token), thus severely limiting their performance on conventional processors. Self-attention can be viewed as a dynamic feed-forward layer, whose matrix is input sequence-dependent similarly to the result of local synaptic plasticity. Using this insight, we present a neuromorphic decoder-only transformer model that utilizes an on-chip plasticity processor to compute self-attention. Interestingly, the training of transformers enables them to ``learn'' the input context during inference. We demonstrate this in-context learning ability of transformers on the Loihi 2 processor by solving a few-shot classification problem. With this we emphasize the importance of pretrained models especially their ability to find simple, local, backpropagation free, learning rules enabling on-chip learning and adaptation in a hardware friendly manner.||
|**2024-10-11**|[Distillation of Discrete Diffusion through Dimensional Correlations](http://arxiv.org/abs/2410.08709)|null|Diffusion models have demonstrated exceptional performances in various fields of generative modeling. While they often outperform competitors including VAEs and GANs in sample quality and diversity, they suffer from slow sampling speed due to their iterative nature. Recently, distillation techniques and consistency models are mitigating this issue in continuous domains, but discrete diffusion models have some specific challenges towards faster generation. Most notably, in the current literature, correlations between different dimensions (pixels, locations) are ignored, both by its modeling and loss functions, due to computational limitations. In this paper, we propose "mixture" models in discrete diffusion that are capable of treating dimensional correlations while remaining scalable, and we provide a set of loss functions for distilling the iterations of existing models. Two primary theoretical insights underpin our approach: first, that dimensionally independent models can well approximate the data distribution if they are allowed to conduct many sampling steps, and second, that our loss functions enables mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations. We empirically demonstrate that our proposed method for discrete diffusions work in practice, by distilling a continuous-time discrete diffusion model pretrained on the CIFAR-10 dataset.||
|**2024-10-11**|[E-Motion: Future Motion Simulation via Event Sequence Diffusion](http://arxiv.org/abs/2410.08649)|**[link](https://github.com/p4r4mount/E-Motion)**|Forecasting a typical object's future motion is a critical task for interpreting and interacting with dynamic environments in computer vision. Event-based sensors, which could capture changes in the scene with exceptional temporal granularity, may potentially offer a unique opportunity to predict future motion with a level of detail and precision previously unachievable. Inspired by that, we propose to integrate the strong learning capacity of the video diffusion model with the rich motion information of an event camera as a motion simulation framework. Specifically, we initially employ pre-trained stable video diffusion models to adapt the event sequence dataset. This process facilitates the transfer of extensive knowledge from RGB videos to an event-centric domain. Moreover, we introduce an alignment mechanism that utilizes reinforcement learning techniques to enhance the reverse generation trajectory of the diffusion model, ensuring improved performance and accuracy. Through extensive testing and validation, we demonstrate the effectiveness of our method in various complex scenarios, showcasing its potential to revolutionize motion flow prediction in computer vision applications such as autonomous vehicle guidance, robotic navigation, and interactive media. Our findings suggest a promising direction for future research in enhancing the interpretative power and predictive accuracy of computer vision systems.||
|**2024-10-10**|[DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models](http://arxiv.org/abs/2410.08207)|null|离散扩散模型在图像生成和掩码语言建模等任务中取得了成功，但在可控内容编辑方面面临局限性。我们引入了 DICE（用于可控编辑的离散逆推），这是第一个能够对离散扩散模型（包括多项式扩散和掩码生成模型）进行精确逆推的方法。通过在反向扩散过程中记录噪声序列和掩码模式，DICE 无需预定义掩码或注意力机制操作即可实现离散数据的准确重建和灵活编辑。我们在图像和文本领域证明了 DICE 的有效性，并在 VQ-Diffusion、Paella 和 RoBERTa 等模型上对其进行了评估。结果表明，DICE 在保持高数据保真度的同时增强了编辑能力，为离散空间中的细粒度内容操作提供了新的机会。项目网页请访问 https://hexiaoxiao-cs.github.io/DICE/。||
|**2024-10-10**|[HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation](http://arxiv.org/abs/2410.08192)|null|近年来，文本到图像扩散模型在使用文本提示进行创作方面取得了显著成果，但基于特定主题生成个性化实例（即主题驱动生成）仍然具有挑战性。为了解决这个问题，我们提出了一种名为 HybridBooth 的新型混合框架，它融合了基于优化和直接回归方法的优点。HybridBooth 分为两个阶段运行：词嵌入探测和词嵌入细化。词嵌入探测使用微调后的编码器生成稳健的初始词嵌入；词嵌入细化通过优化关键参数，进一步使编码器适应特定的主题图像。这种方法能够有效且快速地将视觉概念反转为文本嵌入，即使只有一个图像，同时还能保持模型的泛化能力。||
|**2024-10-10**|[DifFRelight: Diffusion-Based Facial Performance Relighting](http://arxiv.org/abs/2410.08188)|null|我们提出了一种基于扩散的图像到图像转换的新颖框架，用于自由视点的人脸表演重新照明。利用包含在各种照明条件下（包括平面照明和一次一灯 (OLAT) 场景）捕获的多种面部表情的特定主题数据集，我们训练了一个用于精确照明控制的扩散模型，能够从平面照明输入中生成高保真度的重新照明人脸图像。我们的框架包括空间对齐的平面照明捕获和随机噪声的调节，以及用于全局控制的集成照明信息，利用来自预训练的稳定扩散模型的先验知识。然后将此模型应用于在一致的平面照明环境中捕获的动态面部表演，并使用可扩展的动态 3D 高斯渲染方法重建以进行新颖视图合成，以保持重新照明结果的质量和一致性。此外，我们通过将新颖的区域照明表示与定向照明相结合，引入了统一的照明控制，允许对光照大小和方向进行联合调整。我们还支持使用多个定向光进行高动态范围成像 (HDRI) 合成，以在复杂的照明条件下生成动态序列。我们的评估证明了该模型在实现精确照明控制和泛化各种面部表情方面的效率，同时保留了皮肤纹理和头发等细节特征。该模型准确地再现了复杂的照明效果，例如眼睛反射、次表面散射、自阴影和半透明性，从而提高了我们框架内的照片真实感。||
|**2024-10-10**|[ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion](http://arxiv.org/abs/2410.08168)|**[link](https://github.com/lvsn/ZeroComp)**|我们提出了 ZeroComp，这是一种有效的零样本 3D 对象合成方法，在训练期间不需要成对的合成场景图像。我们的方法利用 ControlNet 从内蕴图像中进行条件控制，并将其与 Stable Diffusion 模型相结合，利用其场景先验，共同构成一个有效的渲染引擎。在训练过程中，ZeroComp 使用基于几何形状、反照率和遮罩阴影的内蕴图像，而不需要包含和不包含合成对象的场景的成对图像。训练完成后，它可以将虚拟 3D 对象无缝集成到场景中，调整阴影以创建逼真的合成图像。我们开发了一个高质量的评估数据集，并证明 ZeroComp 在定量和人类感知基准测试中优于使用显式光照估计和生成技术的其他方法。此外，ZeroComp 还可以扩展到真实和室外图像合成，即使仅在合成室内数据上进行训练，也展示了其在图像合成方面的有效性。||
|**2024-10-10**|[DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation](http://arxiv.org/abs/2410.08159)|null|扩散模型已成为视觉生成的主导方法。它们通过对马尔可夫过程进行去噪来训练，该过程逐渐向输入中添加噪声。我们认为，马尔可夫性质限制了模型充分利用生成轨迹的能力，导致训练和推理过程中的效率低下。在本文中，我们提出了 DART，一种基于 Transformer 的模型，它在非马尔可夫框架内统一了自回归 (AR) 和扩散。DART 使用与标准语言模型相同架构的自回归模型，在空间和频谱上迭代地对图像块进行去噪。DART 不依赖图像量化，从而能够在保持灵活性的同时实现更有效的图像建模。此外，DART 可以在统一模型中使用文本和图像数据进行无缝训练。我们的方法在类别条件和文本到图像生成任务上表现出具有竞争力的性能，为传统的扩散模型提供了一种可扩展、高效的替代方案。通过这种统一的框架，DART 为可扩展、高质量的图像合成树立了新的标杆。||
|**2024-10-10**|[Progressive Autoregressive Video Diffusion Models](http://arxiv.org/abs/2410.08151)|**[link](https://github.com/desaixie/pa_vdm)**|当前前沿的视频扩散模型在生成高质量视频方面已经展现出显著成果。然而，由于训练过程中的计算限制，它们只能生成通常约10秒或240帧的短视频片段。在这项工作中，我们展示了现有模型可以自然地扩展到自回归视频扩散模型，而无需改变架构。我们的关键思想是为潜在帧分配逐渐增加的噪声级别，而不是单一噪声级别，这允许潜在帧之间进行细粒度的条件化以及注意力窗口之间的大量重叠。这种渐进式视频去噪允许我们的模型自回归地生成视频帧，而不会出现质量下降或场景突变。我们在1分钟的长视频生成（24 FPS下1440帧）上呈现了最先进的结果。本文中的视频可在https://desaixie.github.io/pa-vdm/上获取。||
|**2024-10-10**|[Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction](http://arxiv.org/abs/2410.08134)|null|离散数据的生成模型是许多重要应用的基础，涵盖了从基于文本的智能体（如 ChatGPT）到蛋白质序列中生命基本构建块的设计。然而，应用领域需要通过引导生成过程（通常通过 RLHF）来控制生成的数据，以满足特定的属性、奖励或亲和度指标。在本文中，我们研究了引导掩码扩散模型 (MDM) 的问题，MDM 是一类新兴的离散扩散模型，为传统的自回归模型提供了一种引人注目的替代方案。我们引入了离散去噪后验预测 (DDPP)，这是一个新的框架，通过学习从目标贝叶斯后验分布中采样，将引导预训练 MDM 的任务转化为概率推理问题。我们的 DDPP 框架产生了一系列三个新的目标函数，它们都是无需模拟的，因此具有可扩展性，同时适用于一般的不可微奖励函数。在实验中，我们通过引导 MDM 执行类别条件像素级图像建模、使用基于文本奖励的 MDM 的 RLHF 对齐，以及微调蛋白质语言模型以生成更多样化的二级结构和更短的蛋白质，实例化了 DDPP。我们通过湿实验室验证证实了我们的设计，观察到奖励优化蛋白质序列的瞬时表达。||
|**2024-10-10**|[Robust AI-Generated Text Detection by Restricted Embeddings](http://arxiv.org/abs/2410.08113)|**[link](https://github.com/silversolver/robustatd)**|人工智能生成文本的数量和质量不断提高，这使得检测此类内容变得更加困难。在大多数现实场景中，生成数据的领域（风格和主题）和生成器模型事先并不知道。在这项工作中，我们关注基于分类器的 AI 生成文本检测器的鲁棒性，即它们迁移到未知生成器或语义领域的能力。我们研究了基于 Transformer 的文本编码器嵌入空间的几何结构，并表明清除有害的线性子空间有助于训练鲁棒的分类器，忽略特定领域的虚假特征。我们研究了几种子空间分解和特征选择策略，并在跨领域和跨生成器迁移方面取得了优于现有技术的显著改进。我们针对词头和基于坐标的子空间去除的最佳方法分别将 RoBERTa 和 BERT 嵌入的平均失配分布 (OOD) 分类分数提高了高达 9% 和 14%。我们发布了代码和数据：https://github.com/SilverSolver/RobustATD||
|**2024-10-10**|[Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models](http://arxiv.org/abs/2410.08074)|null|文图生成扩散模型依赖于大规模网络数据集。从头开始训练这些模型计算成本高昂，因此开发者通常更喜欢对现有模型进行增量更新。这些更新通常包括微调步骤（学习新概念或提高模型性能）和“遗忘”步骤（“忘记”现有概念，例如受版权保护的作品或露骨内容）。在这项工作中，我们展示了这种范式中出现的一个关键且以前未知的漏洞：即使在良性、非对抗性条件下，在看似无关的图像上微调文图生成扩散模型也会导致其“重新学习”先前已“遗忘”的概念。我们通过一系列将“大规模概念擦除”（文图生成扩散模型中遗忘的当前技术水平（Lu et al., 2024））与随后对 Stable Diffusion v1.4 进行微调的实验，全面研究了这种现象的原因和范围，我们将这种现象称为概念复苏。我们的研究结果强调了组合增量模型更新的脆弱性，并对当前确保文图生成扩散模型的安全性和一致性的方法提出了新的严重担忧。||
|**2024-10-10**|[A Target-Aware Analysis of Data Augmentation for Hate Speech Detection](http://arxiv.org/abs/2410.08053)|null|仇恨言论是社交网络广泛使用带来的主要威胁之一，尽管人们努力限制它。尽管已经关注了这个问题，但缺乏以能力歧视或年龄歧视等鲜少出现的现象为中心的数据集和案例研究，可能导致仇恨言论检测系统在代表性不足的身份群体中表现不佳。鉴于大型语言模型 (LLM) 在生成高质量数据方面的空前能力，我们研究了使用生成式语言模型扩充现有数据的可能性，以减少目标不平衡。我们尝试使用 Measuring Hate Speech 语料库中的 1,000 个帖子进行扩充，这是一个标注了目标身份信息的英语数据集，使用简单的数据库增强方法和不同类型的生成模型添加了大约 30,000 个合成样本，比较了自回归和序列到序列的方法。我们发现传统的数据库增强方法通常比生成模型更可取，但两者结合往往会产生最好的结果。事实上，对于某些仇恨类别，例如出身、宗教和残疾，使用增强数据进行训练的仇恨言论分类比没有增强数据的基线提高了 10% 以上的 F1 值。这项工作有助于开发仇恨言论检测系统，这些系统不仅性能更好，而且对迄今为止被忽视的目标更公平、更具包容性。||
|**2024-10-07**|[DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control](http://arxiv.org/abs/2410.05260)|null|文本条件化人体动作生成允许用户通过自然语言进行交互，近年来备受欢迎。现有方法通常基于单个输入语句生成简短、孤立的动作。然而，人类动作是连续的，可以持续很长时间，并承载着丰富的语义。创造能够精确响应文本描述流的长期、复杂动作，特别是在在线和实时环境中，仍然是一项重大挑战。此外，将空间约束纳入文本条件化动作生成带来了额外的挑战，因为它需要将文本描述指定的动作语义与几何信息（例如目标位置和 3D 场景几何形状）对齐。为了解决这些限制，我们提出了 DART，一种基于扩散的自回归运动基元模型，用于实时文本驱动的运动控制。我们的模型 DART 使用潜在扩散模型，有效地学习了联合依赖于运动历史和文本输入的紧凑运动基元空间。通过根据先前历史和当前文本输入自回归地生成运动基元，DART 可以实现由自然语言描述驱动的实时、连续动作生成。此外，学习到的运动基元空间允许精确的空间运动控制，我们将其制定为潜在噪声优化问题或通过强化学习解决的马尔可夫决策过程。我们针对这两种方法提出了有效的算法，证明了我们的模型在各种运动合成任务中的多功能性和卓越性能。实验表明，我们的方法在运动真实感、效率和可控性方面优于现有的基线。视频结果可在项目页面上找到：https://zkf1997.github.io/DART/。||
|**2024-10-07**|[GS-VTON: Controllable 3D Virtual Try-on with Gaussian Splatting](http://arxiv.org/abs/2410.05259)|null|Diffusion-based 2D virtual try-on (VTON) techniques have recently demonstrated strong performance, while the development of 3D VTON has largely lagged behind. Despite recent advances in text-guided 3D scene editing, integrating 2D VTON into these pipelines to achieve vivid 3D VTON remains challenging. The reasons are twofold. First, text prompts cannot provide sufficient details in describing clothing. Second, 2D VTON results generated from different viewpoints of the same 3D scene lack coherence and spatial relationships, hence frequently leading to appearance inconsistencies and geometric distortions. To resolve these problems, we introduce an image-prompted 3D VTON method (dubbed GS-VTON) which, by leveraging 3D Gaussian Splatting (3DGS) as the 3D representation, enables the transfer of pre-trained knowledge from 2D VTON models to 3D while improving cross-view consistency. (1) Specifically, we propose a personalized diffusion model that utilizes low-rank adaptation (LoRA) fine-tuning to incorporate personalized information into pre-trained 2D VTON models. To achieve effective LoRA training, we introduce a reference-driven image editing approach that enables the simultaneous editing of multi-view images while ensuring consistency. (2) Furthermore, we propose a persona-aware 3DGS editing framework to facilitate effective editing while maintaining consistent cross-view appearance and high-quality 3D geometry. (3) Additionally, we have established a new 3D VTON benchmark, 3D-VTONBench, which facilitates comprehensive qualitative and quantitative 3D VTON evaluations. Through extensive experiments and comparative analyses with existing methods, the proposed \OM has demonstrated superior fidelity and advanced editing capabilities, affirming its effectiveness for 3D VTON.||
|**2024-10-07**|[SePPO: Semi-Policy Preference Optimization for Diffusion Alignment](http://arxiv.org/abs/2410.05255)|**[link](https://github.com/dwanzhang-ai/seppo)**|Reinforcement learning from human feedback (RLHF) methods are emerging as a way to fine-tune diffusion models (DMs) for visual generation. However, commonly used on-policy strategies are limited by the generalization capability of the reward model, while off-policy approaches require large amounts of difficult-to-obtain paired human-annotated data, particularly in visual generation tasks. To address the limitations of both on- and off-policy RLHF, we propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. Specifically, we introduce a Semi-Policy Preference Optimization (SePPO) method. SePPO leverages previous checkpoints as reference models while using them to generate on-policy reference samples, which replace "losing images" in preference pairs. This approach allows us to optimize using only off-policy "winning images." Furthermore, we design a strategy for reference model selection that expands the exploration in the policy space. Notably, we do not simply treat reference samples as negative examples for learning. Instead, we design an anchor-based criterion to assess whether the reference samples are likely to be winning or losing images, allowing the model to selectively learn from the generated reference samples. This approach mitigates performance degradation caused by the uncertainty in reference sample quality. We validate SePPO across both text-to-image and text-to-video benchmarks. SePPO surpasses all previous approaches on the text-to-image benchmarks and also demonstrates outstanding performance on the text-to-video benchmarks. Code will be released in https://github.com/DwanZhang-AI/SePPO.||
|**2024-10-07**|[DiffuseReg: Denoising Diffusion Model for Obtaining Deformation Fields in Unsupervised Deformable Image Registration](http://arxiv.org/abs/2410.05234)|**[link](https://github.com/yutazhuo/diffusereg)**|Deformable image registration aims to precisely align medical images from different modalities or times. Traditional deep learning methods, while effective, often lack interpretability, real-time observability and adjustment capacity during registration inference. Denoising diffusion models present an alternative by reformulating registration as iterative image denoising. However, existing diffusion registration approaches do not fully harness capabilities, neglecting the critical sampling phase that enables continuous observability during the inference. Hence, we introduce DiffuseReg, an innovative diffusion-based method that denoises deformation fields instead of images for improved transparency. We also propose a novel denoising network upon Swin Transformer, which better integrates moving and fixed images with diffusion time step throughout the denoising process. Furthermore, we enhance control over the denoising registration process with a novel similarity consistency regularization. Experiments on ACDC datasets demonstrate DiffuseReg outperforms existing diffusion registration methods by 1.32 in Dice score. The sampling process in DiffuseReg enables real-time output observability and adjustment unmatched by previous deep models.||
|**2024-10-07**|[Avoiding Deadlocks via Weak Deadlock Sets](http://arxiv.org/abs/2410.05175)|null|A deadlock occurs in a network when two or more items prevent each other from moving and are stalled. In a general model, items are stored at vertices and each vertex $v$ has a buffer with $b(v)$ slots. Given a route for each item toward its destination, the Deadlock Safety Problem asks whether the current state is safe, i.e., it is possible to deliver each item at its destination, or is bound to deadlock, i.e., any sequence of moves will end up with a set of items stalled. While when $b \geq 2$ the problem is solvable in polynomial time building upon a nice characterization of YES/NO-instances, it is NP-hard on quite simple graphs as grids when $b=1$ and on trees when $b\leq 3$. We improve on these results by means of two new tools, weak deadlock sets and wise states. We show that for general networks and $b$ a state that is wise and without weak deadlock sets -- this can be recognized in polynomial time -- is safe: this is indeed a strengthening of the result for $b\geq 2$ . We sharpen this result for trees, where we show that a wise state is safe if and only if it has no weak deadlock set. That is interesting in particular in the context of rail transportation where networks are often single-tracked and deadlock detection and avoidance focuses on local sub-networks, mostly with a tree-like structure. We pose some research questions for future investigations.||
|**2024-10-07**|[Presto! Distilling Steps and Layers for Accelerating Music Generation](http://arxiv.org/abs/2410.05167)|null|Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at https://presto-music.github.io/web/.||
|**2024-10-07**|[A Simulation-Free Deep Learning Approach to Stochastic Optimal Control](http://arxiv.org/abs/2410.05163)|null|We propose a simulation-free algorithm for the solution of generic problems in stochastic optimal control (SOC). Unlike existing methods, our approach does not require the solution of an adjoint problem, but rather leverages Girsanov theorem to directly calculate the gradient of the SOC objective on-policy. This allows us to speed up the optimization of control policies parameterized by neural networks since it completely avoids the expensive back-propagation step through stochastic differential equations (SDEs) used in the Neural SDE framework. In particular, it enables us to solve SOC problems in high dimension and on long time horizons. We demonstrate the efficiency of our approach in various domains of applications, including standard stochastic optimal control problems, sampling from unnormalized distributions via construction of a Schr\"odinger-F\"ollmer process, and fine-tuning of pre-trained diffusion models. In all cases our method is shown to outperform the existing methods in both the computing time and memory efficiency.||
|**2024-10-07**|[Leveraging Multimodal Diffusion Models to Accelerate Imaging with Side Information](http://arxiv.org/abs/2410.05143)|null|Diffusion models have found phenomenal success as expressive priors for solving inverse problems, but their extension beyond natural images to more structured scientific domains remains limited. Motivated by applications in materials science, we aim to reduce the number of measurements required from an expensive imaging modality of interest, by leveraging side information from an auxiliary modality that is much cheaper to obtain. To deal with the non-differentiable and black-box nature of the forward model, we propose a framework to train a multimodal diffusion model over the joint modalities, turning inverse problems with black-box forward models into simple linear inpainting problems. Numerically, we demonstrate the feasibility of training diffusion models over materials imagery data, and show that our approach achieves superior image reconstruction by leveraging the available side information, requiring significantly less amount of data from the expensive microscopy modality.||
|**2024-10-07**|[Agnostic Smoothed Online Learning](http://arxiv.org/abs/2410.05124)|null|Classical results in statistical learning typically consider two extreme data-generating models: i.i.d. instances from an unknown distribution, or fully adversarial instances, often much more challenging statistically. To bridge the gap between these models, recent work introduced the smoothed framework, in which at each iteration an adversary generates instances from a distribution constrained to have density bounded by $\sigma^{-1}$ compared to some fixed base measure $\mu$. This framework interpolates between the i.i.d. and adversarial cases, depending on the value of $\sigma$. For the classical online prediction problem, most prior results in smoothed online learning rely on the arguably strong assumption that the base measure $\mu$ is known to the learner, contrasting with standard settings in the PAC learning or consistency literature. We consider the general agnostic problem in which the base measure is unknown and values are arbitrary. Along this direction, Block et al. showed that empirical risk minimization has sublinear regret under the well-specified assumption. We propose an algorithm R-Cover based on recursive coverings which is the first to guarantee sublinear regret for agnostic smoothed online learning without prior knowledge of $\mu$. For classification, we prove that R-Cover has adaptive regret $\tilde O(\sqrt{dT/\sigma})$ for function classes with VC dimension $d$ , which is optimal up to logarithmic factors. For regression, we establish that R-Cover has sublinear oblivious regret for function classes with polynomial fat-shattering dimension growth.||
|**2024-10-07**|[Synthetic Generation of Dermatoscopic Images with GAN and Closed-Form Factorization](http://arxiv.org/abs/2410.05114)|null|In the realm of dermatological diagnoses, where the analysis of dermatoscopic and microscopic skin lesion images is pivotal for the accurate and early detection of various medical conditions, the costs associated with creating diverse and high-quality annotated datasets have hampered the accuracy and generalizability of machine learning models. We propose an innovative unsupervised augmentation solution that harnesses Generative Adversarial Network (GAN) based models and associated techniques over their latent space to generate controlled semiautomatically-discovered semantic variations in dermatoscopic images. We created synthetic images to incorporate the semantic variations and augmented the training data with these images. With this approach, we were able to increase the performance of machine learning models and set a new benchmark amongst non-ensemble based models in skin lesion classification on the HAM10000 dataset; and used the observed analytics and generated models for detailed studies on model explainability, affirming the effectiveness of our solution.||
|**2024-10-04**|[Estimating Body and Hand Motion in an Ego-sensed World](http://arxiv.org/abs/2410.03665)|null|我们提出了EgoAllo，一个基于头戴式设备的人体动作估计系统。EgoAllo仅使用以自我为中心的SLAM姿态和图像，引导从条件扩散模型中采样，以估计捕捉佩戴者在场景的全局坐标系中的动作的3D身体姿态、身高和手部参数。为了实现这一点，我们的关键见解在于表示：我们提出了用于提高模型性能的空间和时间不变性标准，并由此推导出一种头部运动条件参数化，该参数化将估计精度提高了18%。我们还展示了我们系统估计的身体如何改进手部估计：与嘈杂的单目估计相比，由此产生的运动学和时间约束使手部估计误差降低了40%以上。项目页面：https://egoallo.github.io/||
|**2024-10-04**|[Geometric Representation Condition Improves Equivariant Molecule Generation](http://arxiv.org/abs/2410.03655)|null|近年来，分子生成模型的进步展现了其在加速科学发现方面的巨大潜力，特别是在药物设计领域。然而，这些模型在生成高质量分子方面经常面临挑战，尤其是在必须满足特定分子特性的条件生成场景下。在这项工作中，我们介绍了 GeoRCG，这是一个通过整合几何表示条件来增强分子生成模型性能的通用框架。我们将分子生成过程分解为两个阶段：首先，生成信息丰富的几何表示；其次，根据该表示生成分子。与直接生成分子相比，在第一阶段生成相对容易的表示，以更目标导向和更快的速度引导第二阶段生成高质量分子。利用 EDM 作为基础生成器，我们观察到在广泛使用的 QM9 和 GEOM-DRUG 数据集上的无条件分子生成质量有显著提高。更值得注意的是，在具有挑战性的条件分子生成任务中，我们的框架相对于最先进的方法实现了平均 31% 的性能提升，这凸显了以语义丰富的几何表示为条件优于先前方法中以单个属性值为条件的优越性。此外，我们还发现，在这种表示指导下，扩散步骤的数量可以减少到仅 100 步，同时保持比 1000 步更高的生成质量，从而显著加速了生成过程。||
|**2024-10-04**|[Real-World Benchmarks Make Membership Inference Attacks Fail on Diffusion Models](http://arxiv.org/abs/2410.03640)|**[link](https://github.com/caradryanl/copymark)**|扩散模型的成员推断攻击 (MIA) 已成为潜在证据，表明在训练预训练扩散模型中存在未经授权的数据使用。这些攻击旨在检测扩散模型训练数据集中是否存在特定图像。我们的研究深入评估了扩散模型中最先进的 MIA，并揭示了现有 MIA 评估中的严重缺陷和过于乐观的性能估计。我们介绍了 CopyMark，这是一个更现实的 MIA 基准测试，它通过支持预训练的扩散模型、无偏数据集和公平的评估管道来区分自己。通过广泛的实验，我们证明了当前 MIA 方法的有效性在这些更实际的条件下会显着降低。根据我们的结果，我们提醒，MIA 目前的状态并不是识别预训练扩散模型中未经授权数据使用的可靠方法。据我们所知，我们是第一个发现 MIA 对扩散模型的性能高估，并提出了一个统一的基准以进行更现实的评估。我们的代码可在 GitHub 上获取：\url{https://github.com/caradryanl/CopyMark}。||
|**2024-10-04**|[Conditional Enzyme Generation Using Protein Language Models with Adapters](http://arxiv.org/abs/2410.03634)|null|以期望的功能和/或特性为条件生成蛋白质是生成模型的关键目标。现有的基于语言模型提示的方法可以生成以目标功能（例如所需的酶家族）为条件的蛋白质。然而，这些方法仅限于简单的标记化条件，并且尚未显示出对未见功能的泛化能力。在本研究中，我们提出了 ProCALM（蛋白质条件自适应语言模型），这是一种使用适配器对蛋白质语言模型进行条件生成蛋白质的方法。我们对 ProCALM 的具体实现涉及微调 ProGen2，以结合酶功能和分类法的条件表示。ProCALM 在有条件地从目标酶家族生成序列方面与现有方法相匹配。令人印象深刻的是，它还可以在酶功能和分类法的联合分布内生成，并且可以泛化到稀有和未见过的酶家族和分类法。总的来说，ProCALM 是一种灵活且计算效率高的方法，我们预计它可以扩展到广泛的生成语言模型。||
|**2024-10-04**|[How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework](http://arxiv.org/abs/2410.03601)|null|离散扩散模型因其能够对具有易于处理的采样和推理的复杂分布进行建模而受到越来越多的关注。然而，离散扩散模型的误差分析仍然缺乏深入的理解。在这项工作中，我们提出了一个基于 Lévy 型随机积分的离散扩散模型误差分析综合框架。通过将泊松随机测度推广到具有时间无关和状态相关强度的测度，我们严格建立了离散扩散模型的随机积分公式，并提供了相应的测度变化定理，这些定理与 Itô 积分和 Girsanov 定理及其连续对应物有着惊人的相似之处。我们的框架统一并加强了当前关于离散扩散模型的理论结果，并获得了 KL 散度中 τ-leaping 方案的第一个误差界。通过明确识别误差来源，我们的分析为离散扩散模型的数学性质提供了新的见解，并为设计用于现实世界离散扩散模型应用的高效和准确算法提供了指导。||
|**2024-10-04**|[Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features](http://arxiv.org/abs/2410.03558)|**[link](https://github.com/darkbblue/generic-diffusion-feature)**|扩散模型最初是为图像生成而设计的。最近的研究表明，其主干内部的信号（称为激活）也可以作为密集特征，用于各种判别任务，例如语义分割。在众多激活中，选择一个有效的小子集是一个基本问题。为此，该领域的早期研究对激活的判别能力进行了大规模的定量比较。然而，我们发现许多潜在的激活还没有被评估，例如用于计算注意力分数的查询和键。此外，扩散架构的最新进展带来了许多新的激活，例如嵌入式 ViT 模块中的激活。两者结合在一起，激活选择仍然是一个尚未解决但被忽视的问题。为了解决这个问题，本文更进一步，评估了更广泛的激活。考虑到激活的显著增加，全面的定量比较已不再可行。相反，我们试图了解这些激活的属性，以便可以通过简单的定性评估预先过滤掉明显较差的激活。经过仔细分析，我们发现了扩散模型中普遍存在的三个属性，使这项研究能够超越特定的模型。在此基础上，我们针对几种流行的扩散模型提出了有效的特征选择解决方案。最后，跨多个判别任务的实验验证了我们的方法优于 SOTA 竞争对手。我们的代码可在 https://github.com/Darkbblue/generic-diffusion-feature 获取。||
|**2024-10-04**|[NRGBoost: Energy-Based Generative Boosted Trees](http://arxiv.org/abs/2410.03535)|null|尽管深度学习在非结构化数据领域占据主导地位，但基于树的方法，如随机森林（RF）和梯度提升决策树（GBDT），仍然是处理表格数据判别任务的主力军。我们探索了这些流行算法的生成式扩展，重点是对数据密度（直到归一化常数）进行显式建模，从而支持除采样之外的其他应用。作为我们的主要贡献，我们提出了一种基于能量的生成式提升算法，该算法类似于在 XGBoost 等流行软件包中实现的二阶提升。我们表明，尽管产生了一个能够处理任何输入变量的推理任务的生成模型，但我们提出的算法在许多真实世界的表格数据集上可以实现与 GBDT 相似的判别性能，优于其他生成方法。同时，我们也展示了它在采样方面也具有与基于神经网络的模型相媲美的竞争力。||
|**2024-10-04**|[Generative Artificial Intelligence for Navigating Synthesizable Chemical Space](http://arxiv.org/abs/2410.03494)|**[link](https://github.com/wenhao-gao/synformer)**|我们推出了 SynFormer，这是一个生成式建模框架，旨在有效地探索和导航可合成化学空间。与传统的分子生成方法不同，我们为分子生成合成路线，以确保设计具有合成可行性。通过结合可扩展的 Transformer 架构和用于构建块选择的扩散模块，SynFormer 在可合成分子设计方面超越了现有模型。我们通过两个关键应用展示了 SynFormer 的有效性：(1) 局部化学空间探索，其中模型生成参考分子的可合成类似物，以及 (2) 全局化学空间探索，其中模型旨在根据黑盒性质预测预言机识别最佳分子。此外，我们通过随着更多计算资源可用而提高性能来证明我们方法的可扩展性。通过公开我们的代码和训练模型，我们希望 SynFormer 能够在药物发现和材料科学的应用中得到应用。||
|**2024-10-04**|[Diffusion State-Guided Projected Gradient for Inverse Problems](http://arxiv.org/abs/2410.03463)|**[link](https://github.com/anima-lab/diffstategrad)**|扩散模型的最新进展在学习用于解决反问题的先验数据方面非常有效。它们利用扩散采样步骤来引入数据先验，同时在每个步骤中使用测量引导梯度来施加数据一致性。对于一般的反问题，当使用无条件训练的扩散模型时，由于测量似然是难以处理的，因此需要进行近似，这会导致不准确的后验采样。换句话说，由于它们的近似性，这些方法无法在由扩散先验定义的数据流形上保留生成过程，从而导致图像恢复等应用中的伪影。为了提高扩散模型在解决反问题方面的性能和鲁棒性，我们提出了扩散状态引导投影梯度（DiffStateGrad），它将测量梯度投影到一个子空间上，该子空间是扩散过程中间状态的低秩近似。DiffStateGrad作为一个模块，可以添加到各种基于扩散的反求解器中，以改进对先验流形上扩散过程的保留，并滤除产生伪影的成分。我们强调，DiffStateGrad提高了扩散模型在测量引导步长和噪声选择方面的鲁棒性，同时提高了最坏情况下的性能。最后，我们证明了DiffStateGrad在线性和非线性图像恢复反问题上优于现有技术水平。||
|**2024-10-04**|[Generative Semantic Communication for Text-to-Speech Synthesis](http://arxiv.org/abs/2410.03459)|null|语义通信是一种很有前景的技术，它只传输源数据的语义信息，从而提高通信效率。然而，传统的语义通信方法主要集中在数据重建任务上，对于文本到语音（TTS）合成等新兴的生成任务来说，效率可能不高。为了解决这一局限性，本文利用生成式人工智能技术，开发了一种新的TTS合成生成式语义通信框架。首先，我们利用预先训练好的大型语音模型WavLM和残差矢量量化方法，分别在发送端和接收端构建了两个语义知识库（KB）。发送端的KB能够有效地提取语义，而接收端的KB则有助于逼真的语音合成。然后，我们采用Transformer编码器和扩散模型来实现高效的语义编码，而不会引入显著的通信开销。最后，数值结果表明，在加性高斯白噪声信道和瑞利衰落信道两种情况下，我们的框架在生成语音的保真度方面都比四种基线方法高得多。||
|**2024-10-03**|[Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models](http://arxiv.org/abs/2410.02740)|null|近年来，多模态模型的进步凸显了重写图像描述对于提高性能的价值，但关键挑战依然存在。例如，虽然合成图像描述通常能提供更高的质量和图文一致性，但尚不清楚它们是否可以完全替代替代文本：合成图像描述的作用以及它们在预训练中与原始网络爬取的替代文本的交互作用仍不清楚。此外，不同的多模态基础模型可能对特定的图像描述格式有独特的偏好，但识别每种模型最佳图像描述的工作仍然有限。在这项工作中，我们提出了一种新颖的、可控的、可扩展的图像描述生成流程，旨在生成针对各种多模态模型量身定制的不同图像描述格式。通过以短合成图像描述 (SSC) 和密集合成图像描述 (DSC+) 作为案例研究，我们系统地探索了它们对 CLIP、多模态大语言模型和扩散模型等模型的影响，以及它们与替代文本的交互作用。我们的研究结果表明，保留合成图像描述和替代文本的混合方法可以优于单独使用合成图像描述，从而提高一致性和性能，并且每个模型都表现出对特定图像描述格式的偏好。这种全面的分析为优化图像描述策略提供了宝贵的见解，从而促进了多模态基础模型的预训练。||
|**2024-10-03**|[A Photonic Parameter-shift Rule: Enabling Gradient Computation for Photonic Quantum Computers](http://arxiv.org/abs/2410.02726)|null|我们提出了一种在线性光量子计算平台上实现的量子算法中进行梯度计算的方法。虽然参数移位规则已成为基于量子比特门的量子计算中计算梯度的主要方法，但由于福克空间中微分相移算符的非幺正性，它们在光子平台上的直接应用受到了阻碍。我们引入了一种克服了这一限制的光子参数移位规则，为线性光量子处理器中的梯度计算提供了一个精确的公式。我们的方法与输入光子数呈线性比例，并且在每次评估中使用具有移位参数的相同参数化光子电路。这一进步弥合了光子量子计算中的一个关键差距，使得能够在近期光子量子处理器上对变分量子算法进行有效的基于梯度的优化。我们通过量子化学和生成模型任务中的数值模拟证明了我们方法的有效性，与其他基于梯度和无梯度的方法相比，该方法显示出优越的优化性能以及对有限采样和光子可分辨性噪声的鲁棒性。||
|**2024-10-03**|[SteerDiff: Steering towards Safe Text-to-Image Diffusion Models](http://arxiv.org/abs/2410.02710)|null|文本到图像 (T2I) 扩散模型因其能够生成具有精确文本对齐的高质量图像而备受关注。然而，这些模型也可能被滥用于制作不当内容。现有的安全措施通常依赖于文本分类器或类似 ControlNet 的方法，但往往不够充分。传统的文本分类器依赖于大规模标记数据集，并且很容易通过改写来绕过。随着扩散模型的不断扩展，微调这些安全措施变得越来越具有挑战性，并且缺乏灵活性。最近的红队攻击研究进一步强调了需要一种新的范式来防止生成不当内容。在本文中，我们介绍了 SteerDiff，这是一个轻量级的适配器模块，旨在充当用户输入和扩散模型之间的中介，确保生成的图像符合道德和安全标准，并且对可用性的影响微乎其微。SteerDiff 识别并操纵文本嵌入空间中的不当概念，以引导模型远离有害输出。我们进行了各种概念遗忘任务的广泛实验，以评估我们方法的有效性。此外，我们针对多种红队攻击策略对 SteerDiff 进行了基准测试，以评估其鲁棒性。最后，我们探讨了 SteerDiff 在概念遗忘任务中的潜力，展示了其在文本条件图像生成中的多功能性。||
|**2024-10-03**|[ControlAR: Controllable Image Generation with Autoregressive Models](http://arxiv.org/abs/2410.02705)|**[link](https://github.com/hustvl/controlar)**|自回归 (AR) 模型将图像生成重新定义为下一个标记预测任务，展现出惊人的潜力，并逐渐成为扩散模型的有力竞争者。然而，类似于 ControlNet 的控制到图像生成在 AR 模型中仍然很大程度上未被探索。尽管受大型语言模型进步的启发，一种自然而然的方法是将控制图像标记化为标记，并在解码图像标记之前将它们预填充到自回归模型中，但与其 ControlNet 相比，这种方法在生成质量方面仍然不足，并且效率低下。为此，我们引入了 ControlAR，这是一个高效且有效的框架，用于将空间控制集成到自回归图像生成模型中。首先，我们探索了 AR 模型的控制编码，并提出了一种轻量级的控制编码器，将空间输入（例如，Canny 边缘或深度图）转换为控制标记。然后，ControlAR 利用条件解码方法，根据控制标记和图像标记之间的每个标记融合（类似于位置编码）生成下一个图像标记。与预填充标记相比，使用条件解码显着增强了 AR 模型的控制能力，同时保持了模型的效率。此外，令人惊讶的是，所提出的 ControlAR 通过条件解码和特定控制使 AR 模型能够生成任意分辨率的图像。大量实验表明，所提出的 ControlAR 能够在包括边缘、深度和分割掩码在内的不同输入上进行自回归控制到图像生成。此外，定量和定性结果都表明 ControlAR 超越了先前最先进的可控扩散模型，例如 ControlNet++。代码、模型和演示将很快在 https://github.com/hustvl/ControlAR 上提供。||
|**2024-10-03**|[GUD: Generation with Unified Diffusion](http://arxiv.org/abs/2410.02667)|null|扩散生成模型通过反转将噪声逐步添加到数据样本的过程，将噪声转换为数据。受物理学中重整化群概念的启发，该概念分析不同尺度的系统，我们通过探索三个关键设计方面来重新审视扩散模型：1）扩散过程在其上运行的表示的选择（例如，像素、PCA、傅里叶或小波基），2）数据在扩散过程中被转换成先验分布（例如，具有协方差 $\Sigma$ 的高斯分布），以及 3）应用于数据不同部分的噪声水平的调度，由组件级噪声调度捕获。结合这些选择的灵活性，我们为扩散生成模型开发了一个统一的框架，极大地增强了设计自由度。特别是，我们引入了软条件模型，可以在标准扩散模型和自回归模型（在任何基础上）之间平滑插值，从概念上连接了这两种方法。我们的框架开辟了一个广阔的设计空间，可以实现更高效的训练和数据生成，并为集成不同生成方法和生成任务的新颖架构铺平道路。||
|**2024-10-03**|[Grounded Answers for Multi-agent Decision-making Problem through Generative World Model](http://arxiv.org/abs/2410.02664)|null|生成模型的最新进展促进了图像生成和聊天机器人等许多领域的重大创新。尽管取得了成功，但这些模型在解决复杂的多智能体决策问题时，常常会产生粗略且误导性的解决方案，因为它们缺乏像人类一样的试错经验和推理能力。为了解决这一局限性，我们探索了一种将语言引导的模拟器集成到多智能体强化学习管道中的范式，以增强生成的答案质量。该模拟器是一个分别学习动力学和奖励的世界模型，其中动力学模型包括一个图像分词器和一个因果Transformer，用于自回归地生成交互转换，而奖励模型是一个双向Transformer，通过在语言指导下最大化专家演示中轨迹的可能性来学习。给定当前状态的图像和任务描述，我们使用世界模型来训练联合策略，并通过在动力学模型上运行收敛的策略来生成图像序列作为答案。实证结果表明，该框架可以通过在星际争霸多智能体挑战基准测试的训练和未见任务上表现出优异的性能，从而改进多智能体决策问题的答案。特别是，它可以生成一致的交互序列和交互状态下可解释的奖励函数，为未来训练生成模型开辟了道路。||
|**2024-10-03**|[Scalable Simulation-free Entropic Unbalanced Optimal Transport](http://arxiv.org/abs/2410.02656)|null|最优传输（OT）问题旨在寻找一个连接两个分布的传输映射，同时最小化给定的成本函数。寻找这样的传输映射在机器学习中有着广泛的应用，例如生成模型和图像到图像的转换。在本文中，我们介绍了一种可扩展且无需模拟的方法来解决熵非平衡最优传输（EUOT）问题。我们推导了该EUOT问题的动力学形式，它是薛定谔桥（SB）问题的推广。在此基础上，我们从随机最优控制的角度推导了EUOT问题的对偶形式和最优性条件。通过利用这些性质，我们提出了一种无需模拟的算法来求解EUOT，称为Simulation-free EUOT (SF-EUOT)。现有的SB模型在训练和评估过程中需要昂贵的模拟成本，而我们的模型利用互易性实现了无需模拟的训练和一步生成。与之前的SB方法相比，我们的模型在生成模型和图像到图像转换任务中显示出显著提高的可扩展性。||
|**2024-10-03**|[Measuring and Improving Persuasiveness of Generative Models](http://arxiv.org/abs/2410.02653)|null|大型语言模型 (LLM) 正越来越多地用于涉及生成人类消费内容（例如营销）以及直接与人类互动（例如通过聊天机器人）的工作流程中。开发能够生成可验证的说服性信息的此类系统，对社会来说既有机遇也有挑战。一方面，此类系统可以对广告和社会公益等领域产生积极影响，例如解决药物成瘾问题；另一方面，它们也可能被滥用于传播错误信息和塑造政治观点。为了引导 LLM 对社会的影响，我们需要开发系统来衡量和比较它们的说服力。出于这种动机，我们推出了 PersuasionBench 和 PersuasionArena，这是第一个包含一系列任务的大型基准和竞技场，用于自动衡量生成模型的说服能力。我们调查了 LLM 在多大程度上了解和利用了可以帮助它们生成更有说服力的语言的语言模式。我们的研究结果表明，LLM 的说服力与其模型规模呈正相关，但较小的模型也可以比更大的模型具有更高的说服力。值得注意的是，使用合成数据集和自然数据集进行的目标训练显着增强了较小模型的说服能力，这对依赖规模的假设提出了挑战。我们的研究结果对模型开发者和政策制定者都具有重要意义。例如，虽然欧盟人工智能法案和加州的 SB-1047 旨在根据浮点运算次数来监管人工智能模型，但我们证明，仅凭此类简单指标无法完全捕捉人工智能的社会影响。我们邀请社区探索并贡献 PersuasionArena 和 PersuasionBench（网址为 https://bit.ly/measure-persuasion），以促进我们对人工智能驱动型说服及其社会影响的理解。||
|**2024-10-03**|[Beyond Squared Error: Exploring Loss Design for Enhanced Training of Generative Flow Networks](http://arxiv.org/abs/2410.02596)|null|生成流网络 (GFlowNets) 是一类新颖的生成模型，旨在从非规范化分布中采样，并在各种重要任务中得到应用，其训练算法引起了人们极大的研究兴趣。通常，GFlowNets 的训练是通过将采样的训练对象上的前向流与反向流进行拟合来实现的。先前的工作重点关注训练对象的选择、参数化、采样和重采样策略以及反向策略，旨在增强训练过程中的信用分配、探索或利用。然而，回归损失的选择却被忽视了，而它极大地影响了训练不足策略的探索和利用行为。由于缺乏对选择合适的回归损失的理论理解，大多数现有算法通过最小化对数空间中前向流和反向流的平方误差来训练流网络，即使用二次回归损失。在这项工作中，我们严格证明了不同的回归损失对应于特定的散度度量，这使我们能够根据相应散度度量的期望属性来设计和分析回归损失。具体来说，我们研究了两个关键属性：零强制和零避免，前者促进利用和更高的奖励，而后者鼓励探索并增强多样性。基于我们的理论框架，我们提出了三种新的回归损失，即 Shifted-Cosh、Linex(1/2) 和 Linex(1)。我们通过三个基准测试来评估它们：超网格、位序列生成和分子生成。我们提出的损失函数与大多数现有训练算法兼容，并在收敛速度、样本多样性和鲁棒性方面显著提高了算法的性能。||
|**2024-10-03**|[Local Flow Matching Generative Models](http://arxiv.org/abs/2410.02548)|**[link](https://github.com/hamrel-cxu/localflowmatching)**|流匹配（FM）是一种无需模拟的方法，用于学习连续且可逆的流，以在两个分布之间进行插值，特别是在生成建模中从噪声生成数据。在本文中，我们介绍了局部流匹配（LFM），它学习一系列 FM 子模型，每个子模型都匹配一个扩散过程，直到数据到噪声方向上的步长时间。在每个步骤中，子模型要插值的两个分布比数据与噪声更接近，这使得可以使用更小的模型进行更快的训练。LFM 的逐步结构 naturally lends itself to distillation，并且可以采用不同的蒸馏技术来加速生成。理论上，我们根据生成的和真实数据分布之间的 $\chi^2$ 散度证明了所提出的流模型的生成保证。在实验中，我们证明了 LFM 与 FM 相比，在表格数据和图像数据集的无条件生成以及机器人操作策略的条件生成方面，具有更高的训练效率和更具竞争力的生成性能。||
|**2024-09-30**|[SpaceMesh: A Continuous Representation for Learning Manifold Surface Meshes](http://arxiv.org/abs/2409.20562)|null|网格在视觉计算和模拟中无处不在，但大多数现有的机器学习技术只能间接地表示网格，例如，将其表示为标量场的水平集或模板的变形，或者表示为缺乏局部结构的无序三角形集合。这项工作提出了一种方案，可以直接生成具有复杂连接性的流形多边形网格作为神经网络的输出。我们的关键创新是在每个网格顶点定义一个连续的潜在连接空间，这意味着离散网格。特别是，我们的顶点嵌入在半边网格表示中生成循环邻居关系，这保证了边的流形性和表示一般多边形网格的能力。这种表示非常适合机器学习和随机优化，并且不受连通性或拓扑结构的限制。我们首先探索了这种表示的基本属性，然后使用它来拟合来自大型数据集的网格分布。生成的模型可以生成具有从数据集总体学习到的镶嵌结构的不同网格，并具有简洁的细节和高质量的网格元素。在应用中，这种方法不仅可以从生成模型中产生高质量的输出，还可以直接学习具有挑战性的几何处理任务，例如网格修复。||
|**2024-09-30**|[COLLAGE: Collaborative Human-Agent Interaction Generation using Hierarchical Latent Diffusion and Language Models](http://arxiv.org/abs/2409.20502)|null|我们提出了一个名为COLLAGE的新框架，用于生成协作式的“主体-客体-主体”交互，该框架利用了大型语言模型（LLM）和分层的、针对动作的矢量量化变分自编码器（VQ-VAE）。我们的模型通过结合LLM的知识和推理能力来指导生成扩散模型，解决了该领域缺乏丰富数据集的问题。分层VQ-VAE架构在多个抽象级别捕获不同的动作特定特征，避免了冗余概念，并实现了高效的多分辨率表示。我们引入了一种在潜在空间中运行的扩散模型，并结合了LLM生成的运动规划线索来指导去噪过程，从而产生更具控制力和多样性的、针对提示词的动作生成。在CORE-4D和InterHuman数据集上的实验结果表明，我们的方法在生成逼真且多样化的协作式“人-物体-人”交互方面非常有效，优于现有最佳方法。我们的工作为在机器人、图形和计算机视觉等各个领域对复杂交互进行建模开辟了新的可能性。||
|**2024-09-30**|[FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing](http://arxiv.org/abs/2409.20500)|null|文本到视频的扩散模型取得了显著的进步。由于其能够生成时间连贯的视频，使用这些基础模型进行零样本视频编辑的研究迅速扩展。为了提高编辑质量，结构化控制经常被用于视频编辑中。在这些技术中，交叉注意力掩码控制以其有效性和效率而著称。然而，当交叉注意力掩码被简单地应用于视频编辑时，它们会引入诸如模糊和闪烁之类的伪影。我们的实验发现了一个先前视频编辑研究中被忽视的关键因素：交叉注意力掩码并非始终清晰，而是随着模型结构和去噪时间步长而变化。为了解决这个问题，我们提出了度量掩码匹配成本 (MMC) 来量化这种可变性，并提出了 FreeMask，一种为特定视频编辑任务选择最佳掩码的方法。使用 MMC 选择的掩码，我们进一步改进了全面注意力特征（例如，时间、交叉和自注意力模块）中的掩码融合机制。我们的方法可以无缝集成到现有的零样本视频编辑框架中，并具有更好的性能，无需控制辅助或参数微调，但能够通过掩码精度控制自适应地解耦未编辑的语义布局。大量实验表明，与最先进的方法相比，FreeMask 实现了卓越的语义保真度、时间一致性和编辑质量。||
|**2024-09-30**|[All-optical autoencoder machine learning framework using diffractive processors](http://arxiv.org/abs/2409.20346)|null|衍射深度神经网络 (D2NN) 以其高速、低功耗和强大的并行性而闻名，已广泛应用于模式识别、图像处理和图像传输等各个领域。然而，现有的网络架构主要关注原始域内的数据表示，对潜在空间的探索有限，从而限制了 D2NN 的信息挖掘能力和多功能集成。在这里，我们提出了一种全光自动编码器 (OAE) 框架，它可以将输入波场编码到潜在空间中的先验形状分布，并将编码的模式解码回原始波场。通过利用 D2NN 的非互易性，OAE 模型在一个波传播方向上充当编码器，而在相反方向上充当解码器。我们进一步将这些模型应用于三个关键领域：图像去噪、抗噪声的可重构图像分类和图像生成。已经进行了概念验证实验以验证数值模拟。我们的 OAE 框架充分利用了潜在空间表示的潜力，使一组衍射处理器能够同时实现图像重建、表示和生成。它可以被视为电子自动编码器模型的对应物和扩展。这项工作不仅为光学生成模型的设计提供了新的见解，而且为开发和应用多功能、高度集成和通用的光学智能系统铺平了道路。||
|**2024-09-30**|[Devil is in Details: Locality-Aware 3D Abdominal CT Volume Generation for Self-Supervised Organ Segmentation](http://arxiv.org/abs/2409.20332)|null|在医学图像分析领域，自监督学习 (SSL) 技术已经出现，以减轻对标签的需求，但由于资源需求不断增加和隐私限制，训练数据的稀缺性仍然是一个挑战。许多努力都采用生成模型来生成跨越不同模态和解剖区域的高保真、未标记的 3D 体积数据。然而，与其他解剖区域相比，腹部内复杂且难以区分的解剖结构对腹部 CT 体积生成提出了独特的挑战。为了应对这一被忽视的挑战，我们引入了局部感知扩散 (Lad)，这是一种专为生成精细的 3D 腹部 CT 体积数据而设计的新方法。我们设计了一个局部损失来细化关键的解剖区域，并设计了一个条件提取器将腹部先验信息整合到生成过程中，从而能够生成大量高质量的腹部 CT 体积数据，这些数据对于 SSL 任务至关重要，而无需额外的标签或放射学报告等数据。通过我们的方法生成的体积数据在再现腹部结构方面表现出非凡的保真度，在 AbdomenCT-1K 数据集上将 FID 分数从 0.0034 降低到 0.0002，与真实数据非常接近，并优于当前的方法。大量实验表明，我们的方法在自监督器官分割任务中的有效性，在两个腹部数据集上有效地提高了平均 Dice 分数。这些结果强调了合成数据在推进医学图像分析中的自监督学习方面的潜力。||
|**2024-09-30**|[UIR-LoRA: Achieving Universal Image Restoration through Multiple Low-Rank Adaptation](http://arxiv.org/abs/2409.20197)|**[link](https://github.com/justones/uir-lora)**|Existing unified methods typically treat multi-degradation image restoration as a multi-task learning problem. Despite performing effectively compared to single degradation restoration methods, they overlook the utilization of commonalities and specificities within multi-task restoration, thereby impeding the model's performance. Inspired by the success of deep generative models and fine-tuning techniques, we proposed a universal image restoration framework based on multiple low-rank adapters (LoRA) from multi-domain transfer learning. Our framework leverages the pre-trained generative model as the shared component for multi-degradation restoration and transfers it to specific degradation image restoration tasks using low-rank adaptation. Additionally, we introduce a LoRA composing strategy based on the degradation similarity, which adaptively combines trained LoRAs and enables our model to be applicable for mixed degradation restoration. Extensive experiments on multiple and mixed degradations demonstrate that the proposed universal image restoration method not only achieves higher fidelity and perceptual image quality but also has better generalization ability than other unified image restoration models. Our code is available at https://github.com/Justones/UIR-LoRA.||
|**2024-09-30**|[Ensemble Kalman Diffusion Guidance: A Derivative-free Method for Inverse Problems](http://arxiv.org/abs/2409.20175)|null|在解决逆问题时，使用预训练的扩散模型作为即插即用的先验越来越受欢迎。这种框架可以适应不同的前向模型，而无需重新训练，同时保留了扩散模型的生成能力。尽管它们在许多成像逆问题中取得了成功，但大多数现有方法都依赖于特权信息，例如导数、伪逆或关于前向模型的完整知识。这种依赖性构成了一个重大限制，限制了它们在无法获得此类信息的各种问题中的使用，例如在许多科学应用中。为了解决这个问题，我们提出了用于扩散模型的集成卡尔曼扩散引导 (EnKG)，这是一种无导数方法，可以通过仅访问前向模型评估和预训练的扩散模型先验来解决逆问题。我们研究了我们的方法在各种逆问题中的经验有效性，包括科学环境，例如推断流体流动和天文物体，这些都是高度非线性的逆问题，通常只允许对前向模型进行黑盒访问。||
|**2024-09-30**|[Erase, then Redraw: A Novel Data Augmentation Approach for Free Space Detection Using Diffusion Model](http://arxiv.org/abs/2409.20164)|null|Data augmentation is one of the most common tools in deep learning, underpinning many recent advances including tasks such as classification, detection, and semantic segmentation. The standard approach to data augmentation involves simple transformations like rotation and flipping to generate new images. However, these new images often lack diversity along the main semantic dimensions within the data. Traditional data augmentation methods cannot alter high-level semantic attributes such as the presence of vehicles, trees, and buildings in a scene to enhance data diversity. In recent years, the rapid development of generative models has injected new vitality into the field of data augmentation. In this paper, we address the lack of diversity in data augmentation for road detection task by using a pre-trained text-to-image diffusion model to parameterize image-to-image transformations. Our method involves editing images using these diffusion models to change their semantics. In essence, we achieve this goal by erasing instances of real objects from the original dataset and generating new instances with similar semantics in the erased regions using the diffusion model, thereby expanding the original dataset. We evaluate our approach on the KITTI road dataset and achieve the best results compared to other data augmentation methods, which demonstrates the effectiveness of our proposed development.||
|**2024-09-30**|[Conditional Diffusion Models are Minimax-Optimal and Manifold-Adaptive for Conditional Distribution Estimation](http://arxiv.org/abs/2409.20124)|null|We consider a class of conditional forward-backward diffusion models for conditional generative modeling, that is, generating new data given a covariate (or control variable). To formally study the theoretical properties of these conditional generative models, we adopt a statistical framework of distribution regression to characterize the large sample properties of the conditional distribution estimators induced by these conditional forward-backward diffusion models. Here, the conditional distribution of data is assumed to smoothly change over the covariate. In particular, our derived convergence rate is minimax-optimal under the total variation metric within the regimes covered by the existing literature. Additionally, we extend our theory by allowing both the data and the covariate variable to potentially admit a low-dimensional manifold structure. In this scenario, we demonstrate that the conditional forward-backward diffusion model can adapt to both manifold structures, meaning that the derived estimation error bound (under the Wasserstein metric) depends only on the intrinsic dimensionalities of the data and the covariate.||
|**2024-09-30**|[Training a Computer Vision Model for Commercial Bakeries with Primarily Synthetic Images](http://arxiv.org/abs/2409.20122)|null|In the food industry, reprocessing returned product is a vital step to increase resource efficiency. [SBB23] presented an AI application that automates the tracking of returned bread buns. We extend their work by creating an expanded dataset comprising 2432 images and a wider range of baked goods. To increase model robustness, we use generative models pix2pix and CycleGAN to create synthetic images. We train state-of-the-art object detection model YOLOv9 and YOLOv8 on our detection task. Our overall best-performing model achieved an average precision [email protected] of 90.3% on our test set.||
|**2024-09-27**|[ $O(d/T)$ Convergence Theory for Diffusion Probabilistic Models under Minimal Assumptions](http://arxiv.org/abs/2409.18959)|null|基于分数的扩散模型通过学习逆转将目标分布数据扰动为噪声的扩散过程来生成新数据，已经在各种生成任务中取得了显著成功。尽管它们具有优越的经验性能，但现有的理论保证通常受到严格假设或次优收敛速度的限制。在本文中，我们以最小的假设建立了流行的基于 SDE 的采样器的快速收敛理论。我们的分析表明，如果提供分数函数的 $\ell_{2}$ 精度估计，则目标分布和生成分布之间的总变差距离的上限为 $O(d/T)$（忽略对数因子），其中 $d$ 是数据维度，$T$ 是步数。该结果适用于任何具有一阶矩有限的目标分布。据我们所知，这改进了基于 SDE 的采样器和另一种基于 ODE 的采样器的现有收敛理论，同时对目标数据分布和分数估计施加了最小假设。这是通过一组新颖的分析工具实现的，该工具提供了对误差在反向过程的每个步骤中如何传播的细粒度表征。||
|**2024-09-27**|[ReviveDiff: A Universal Diffusion Model for Restoring Images in Adverse Weather Conditions](http://arxiv.org/abs/2409.18932)|null|在诸如夜间、雾天、雨天和水下等挑战性环境中拍摄的图像经常会遭受严重的质量下降，导致视觉质量大幅降低。有效地恢复这些退化的图像对于后续的视觉任务至关重要。虽然许多现有方法已经成功地结合了针对个任务的特定先验知识，但这些定制解决方案限制了它们对其他退化的适用性。在这项工作中，我们提出了一个通用的网络架构，称为“ReviveDiff”，它可以解决各种退化问题，并通过增强和恢复图像质量使其恢复生机。我们的方法受到以下观察结果的启发：与运动或电子问题造成的退化不同，恶劣条件下的质量退化主要源于自然介质（如雾、水和低亮度），这些介质通常保留了物体的原始结构。为了恢复此类图像的质量，我们利用了扩散模型的最新进展，并开发了ReviveDiff，从宏观和微观层面恢复图像质量，涵盖决定图像质量的一些关键因素，如清晰度、失真、噪声水平、动态范围和色彩准确度。我们在涵盖五种退化条件（雨天、水下、低光、烟雾和夜间雾霾）的七个基准数据集上对ReviveDiff进行了严格评估。我们的实验结果表明，ReviveDiff在定量和视觉上都优于最先进的方法。||
|**2024-09-27**|[Unsupervised Low-light Image Enhancement with Lookup Tables and Diffusion Priors](http://arxiv.org/abs/2409.18899)|null|弱光图像增强 (LIE) 旨在精确有效地恢复在弱光环境下降质的图像。最近先进的 LIE 技术正在使用深度神经网络，这需要大量的弱光-正常光图像对、网络参数和计算资源。因此，它们的实用性受到限制。在这项工作中，我们设计了一种基于扩散先验和查找表 (DPLUT) 的新型无监督 LIE 框架，以实现高效的弱光图像恢复。所提出的方法包括两个关键组件：光照调整查找表 (LLUT) 和噪声抑制查找表 (NLUT)。LLUT 使用一组无监督损失进行优化。它旨在预测特定图像动态范围调整的逐像素曲线参数。NLUT 旨在去除光线变亮后放大的噪声。由于扩散模型对噪声很敏感，因此引入了扩散先验以实现高性能的噪声抑制。大量实验表明，我们的方法在视觉质量和效率方面优于最先进的方法。||
|**2024-09-27**|[Detecting Dataset Abuse in Fine-Tuning Stable Diffusion Models for Text-to-Image Synthesis](http://arxiv.org/abs/2409.18897)|null|文图生成在生成逼真和风格化的图像方面已经变得非常流行，这通常需要使用特定领域的数据库对生成模型进行微调以完成专门的任务。然而，这些有价值的数据库面临着未经授权使用和未经批准共享的风险，损害了所有者的权利。在本文中，我们解决了在对 Stable Diffusion 模型进行文图生成的微调过程中出现的数据库滥用问题。我们提出了一个数据库水印框架，旨在检测未经授权的使用并追踪数据泄露。该框架在多个水印方案中采用了两种关键策略，对大规模数据库授权有效。大量实验表明，该框架有效，对数据库的影响最小（只需修改 2% 的数据即可实现高检测精度），并且能够追踪数据泄露。我们的结果还突出了该框架的鲁棒性和可迁移性，证明了其在检测数据库滥用方面的实际适用性。||
|**2024-09-27**|[Explainable Artifacts for Synthetic Western Blot Source Attribution](http://arxiv.org/abs/2409.18881)|**[link](https://github.com/phillipecardenuto/ai-wblots-detector)**|人工智能领域的最新进展使得生成模型能够生成与真实图像难以区分的合成科学图像，这对习惯于处理此类内容的专业科学家也构成了挑战。当被称为“论文工厂”的组织利用这些技术系统地生成虚假文章时，它们可能会助长关于无根据科学的错误信息的传播，从而有可能破坏对科学研究的信任。虽然之前的研究已经探索了黑盒解决方案（例如卷积神经网络）来识别合成内容，但只有一部分研究解决了跨不同模型进行泛化并深入了解合成图像中可用于检测过程的人工痕迹的挑战。本研究旨在识别由最先进的生成模型（例如，生成对抗网络和扩散模型）产生的可解释的人工痕迹，并利用它们进行开放集识别和来源归因（即，指出创建图像的模型）。||
|**2024-09-27**|[Emu3: Next-Token Prediction is All You Need](http://arxiv.org/abs/2409.18869)|null|虽然下一词预测被认为是通向人工通用智能的有希望的途径，但它在多模态任务中一直难以取得优异表现，而多模态任务仍然由扩散模型（例如，Stable Diffusion）和组合方法（例如，CLIP 与 LLM 相结合）主导。在本文中，我们介绍了 Emu3，这是一套全新的最先进的多模态模型，仅使用下一词预测进行训练。通过将图像、文本和视频标记化为离散空间，我们在多模态序列的混合上从头开始训练单个变换器。Emu3 在生成和感知任务中均优于多个完善的特定任务模型，超越了 SDXL 和 LLaVA-1.6 等旗舰模型，同时无需扩散或组合架构。Emu3 还能够通过预测视频序列中的下一个标记来生成高保真视频。我们通过专注于单一焦点：标记，简化了复杂的多模态模型设计，从而在训练和推理过程中释放了巨大的扩展潜力。我们的结果表明，下一词预测是构建超越语言的通用多模态智能的有希望的途径。我们开源了关键技术和模型，以支持在该方向上的进一步研究。||
|**2024-09-27**|[Challenges of Generating Structurally Diverse Graphs](http://arxiv.org/abs/2409.18859)|**[link](https://github.com/Abusagit/Challenges-on-generating-structurally-diverse-graphs)**|对于许多与图相关的问题，拥有一组结构多样化的图至关重要。例如，此类图可用于测试图算法或其神经网络近似。然而，据我们所知，生成结构多样化图的问题尚未在文献中得到探讨。在本文中，我们填补了这一空白。首先，我们讨论了如何定义一组图的多样性，为什么这项任务不简单，以及如何选择合适的度量标准。然后，对于给定的多样性度量标准，我们提出并比较了几种优化它的算法：我们考虑了基于标准随机图模型、局部图优化、遗传算法和神经生成模型的方法。我们证明，相较于基本的随机图生成器，可以显著提高多样性。此外，我们对生成图的分析使我们能够更好地理解图距离的特性：根据用于优化的多样性度量标准，获得的图可能具有非常不同的结构特性，这为了解多样性度量标准中使用的图距离的敏感性提供了见解。||
|**2024-09-27**|[Convergence of Diffusion Models Under the Manifold Hypothesis in High-Dimensions](http://arxiv.org/abs/2409.18804)|null|去噪扩散概率模型 (DDPM) 是一种强大的最先进方法，用于从高维数据分布生成合成数据，并广泛用于图像、音频和视频生成以及科学及其他领域的更多应用。流形假设指出高维数据通常位于环境空间内的低维流形上，并且被广泛认为在提供的示例中成立。虽然最近的结果为了解扩散模型如何适应流形假设提供了宝贵的见解，但它们没有捕捉到这些模型的巨大经验成功，这使其成为一个非常富有成果的研究方向。在这项工作中，我们研究了流形假设下的 DDPM，并证明了它们在学习分数方面实现了与环境维度无关的速率。在采样方面，我们获得了关于 Kullback-Leibler 散度的与环境维度无关的速率，以及关于 Wasserstein 距离的 $O(\sqrt{D})$ 。我们通过开发一个新的框架来做到这一点，该框架将扩散模型连接到经过充分研究的高斯过程极值理论。||
|**2024-09-27**|[Geometric deep learning for galaxy-halo connection: a case study for galaxy intrinsic alignments](http://arxiv.org/abs/2409.18761)|null|即将进行的宇宙学成像巡天，例如 Rubin Observatory LSST，需要包含真实星系群的大规模模拟，以用于各种科学应用。其中一个特别值得关注的现象是内禀排列 (IA)，即星系倾向于朝向超密度区域排列，如果不对其进行适当建模，可能会在弱引力透镜分析中引入显著的系统偏差。由于计算限制，在广阔的体积范围内模拟与 IA 相关的星系形成和演化的复杂细节是不切实际的。作为替代方案，我们提出了一种在 IllustrisTNG-100 模拟上训练的深度生成模型，用于对 3D 星系形状和方向进行采样，以准确地再现内禀排列以及相关的标量特征。我们将宇宙网建模为一组图，每个图代表一个晕，节点代表子晕/星系。该架构由一个 SO(3) $\times$ $\mathbb{R}^n$ 扩散生成模型组成，用于星系方向和 $n$ 个标量，并使用明确遵守宇宙欧几里德对称性的 E(3) 等变图神经网络实现。该模型能够学习和预测与参考模拟在统计上一致的特征，例如星系方向。值得注意的是，我们的模型展示了联合建模欧几里德值标量（星系大小、形状和颜色）以及非欧几里德值 SO(3) 量（星系方向）的能力，这些量受非线性尺度上高度复杂的星系物理支配。||
|**2024-09-27**|[Unsupervised Fingerphoto Presentation Attack Detection With Diffusion Models](http://arxiv.org/abs/2409.18636)|null|基于智能手机的非接触式指纹认证由于智能手机相机技术的快速发展，已成为传统接触式指纹生物识别系统的可靠替代方案。尽管其便利性很高，但通过指纹照片进行的指纹认证更容易受到伪造攻击，这促使最近的研究工作致力于开发指纹照片呈现攻击检测 (PAD) 技术。然而，先前的 PAD 方法利用了监督学习方法，这些方法需要真实和攻击样本的标记训练数据。这可能会遇到两个关键问题，即 (i) 泛化性：检测训练数据中未见过的呈现攻击工具 (PAI)，以及 (ii) 可扩展性：使用不同的 PAI 收集大型攻击样本数据集。为了应对这些挑战，我们提出了一种基于最先进的深度学习扩散模型的新型无监督方法，即去噪扩散概率模型 (DDPM)，该模型仅使用真实样本进行训练。所提出的方法通过计算 DDPM 的输入和输出对之间的重建相似性来检测呈现攻击 (PA)。我们展示了跨三个 PAI 数据集的大量实验，以测试我们方法的准确性和泛化能力。结果表明，与其他基线无监督方法相比，所提出的基于 DDPM 的 PAD 方法在多个 PAI 类别上实现了显着更好的检测错误率。||
|**2024-09-26**|[FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner](http://arxiv.org/abs/2409.18128)|**[link](https://github.com/shiml20/flowturbo)**|基于扩散模型在视觉生成方面的成功，基于流的模型作为另一类重要的生成模型重新兴起，在视觉质量和推理速度方面都取得了与之相当或更好的性能。通过流匹配学习速度场，基于流的模型倾向于产生更直的采样轨迹，这在采样过程中是有利的。然而，与快速采样器已经得到很好发展的扩散模型不同，基于流的生成模型的有效采样还很少被探索。在本文中，我们提出了一个名为FlowTurbo的框架，以加速基于流的模型的采样，同时提高采样质量。我们的主要观察结果是，基于流模型中的速度预测器输出在采样过程中会变得稳定，从而可以通过轻量级速度优化器估计速度。此外，我们还引入了一些技术，包括伪校正器和样本感知编译，以进一步减少推理时间。由于FlowTurbo没有改变多步采样范式，因此可以有效地应用于图像编辑、修复等各种任务。通过将FlowTurbo集成到不同的基于流的模型中，我们在类别条件生成上获得了53.1% $\sim$58.3%的加速比，在文本到图像生成上获得了29.8%$\sim$ 38.5%的加速比。值得注意的是，FlowTurbo在ImageNet上实现了100 (ms / img)时FID为2.12，38 (ms / img)时FID为3.93，实现了实时图像生成，并建立了新的最先进水平。代码可在https://github.com/shiml20/FlowTurbo获取。||
|**2024-09-26**|[Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction](http://arxiv.org/abs/2409.18124)|null|利用预训练文本到图像扩散模型的视觉先验知识为增强密集预测任务中的零样本泛化能力提供了一种很有前景的解决方案。然而，现有方法通常不加批判地使用原始的扩散公式，由于密集预测和图像生成之间的根本差异，这可能不是最佳选择。在本文中，我们对用于密集预测的扩散公式进行了系统分析，重点关注质量和效率。我们发现，用于图像生成的原始参数化类型（学习预测噪声）对密集预测是有害的；多步加噪/去噪扩散过程也是不必要的，并且难以优化。基于这些见解，我们推出了Lotus，这是一个基于扩散的视觉基础模型，它采用了一种简单而有效的密集预测适应协议。具体来说，Lotus被训练成直接预测注释而不是噪声，从而避免了有害的方差。我们还将扩散过程重新定义为单步过程，简化了优化并显著提高了推理速度。此外，我们引入了一种称为细节保留器的新型调整策略，它可以实现更准确、更细粒度的预测。在不扩大训练数据或模型容量的情况下，Lotus在各种数据集上的零样本深度和法线估计方面均达到了最先进的性能。它还显著提高了效率，比大多数现有的基于扩散的方法快数百倍。||
|**2024-09-26**|[EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation](http://arxiv.org/abs/2409.18114)|null|目前的自动回归网格生成方法存在着诸如网格不完整、细节不足和泛化能力差等问题。在本文中，我们提出了一种自回归自动编码器（ArAE）模型，能够生成高达4,000个面片、空间分辨率为 $512^3$ 的高质量三维网格。我们引入了一种新颖的网格标记化算法，可以有效地将三角网格压缩成一维标记序列，显著提高了训练效率。此外，我们的模型将变长三角网格压缩成固定长度的潜在空间，从而能够训练潜在扩散模型以获得更好的泛化能力。大量实验表明，我们的模型在点云和图像条件网格生成任务中均表现出优越的质量、多样性和泛化能力。||
|**2024-09-26**|[StackGen: Generating Stable Structures from Silhouettes via Diffusion](http://arxiv.org/abs/2409.18098)|null|Humans naturally obtain intuition about the interactions between and the stability of rigid objects by observing and interacting with the world. It is this intuition that governs the way in which we regularly configure objects in our environment, allowing us to build complex structures from simple, everyday objects. Robotic agents, on the other hand, traditionally require an explicit model of the world that includes the detailed geometry of each object and an analytical model of the environment dynamics, which are difficult to scale and preclude generalization. Instead, robots would benefit from an awareness of intuitive physics that enables them to similarly reason over the stable interaction of objects in their environment. Towards that goal, we propose StackGen, a diffusion model that generates diverse stable configurations of building blocks matching a target silhouette. To demonstrate the capability of the method, we evaluate it in a simulated environment and deploy it in the real setting using a robotic arm to assemble structures generated by the model.||
|**2024-09-26**|[DiffSSC: Semantic LiDAR Scan Completion using Denoising Diffusion Probabilistic Models](http://arxiv.org/abs/2409.18092)|null|感知系统在自动驾驶中起着至关重要的作用，它结合了多个传感器和相应的计算机视觉算法。3D 激光雷达传感器被广泛用于捕捉车辆周围环境的稀疏点云。然而，由于这些点云的稀疏性和缺乏语义信息，此类系统难以感知遮挡区域和场景中的间隙。为了应对这些挑战，语义场景补全 (SSC) 在给定原始激光雷达测量值的情况下，联合预测场景中未观察到的几何形状和语义信息，旨在实现更完整的场景表示。基于扩散模型在图像生成和超分辨率任务中的良好结果，我们建议将其扩展到 SSC，方法是在点空间和语义空间中分别实现去噪和加噪扩散过程。为了控制生成过程，我们采用语义激光雷达点云作为条件输入，并设计了局部和全局正则化损失来稳定去噪过程。我们在自动驾驶数据集上评估了我们的方法，我们的方法在 SSC 方面的性能优于最先进的方法。||
|**2024-09-26**|[Stable Video Portraits](http://arxiv.org/abs/2409.18083)|null|Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present SVP, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any fine-tuning at test time. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.||
|**2024-09-26**|[PhoCoLens: Photorealistic and Consistent Reconstruction in Lensless Imaging](http://arxiv.org/abs/2409.17996)|null|Lensless cameras offer significant advantages in size, weight, and cost compared to traditional lens-based systems. Without a focusing lens, lensless cameras rely on computational algorithms to recover the scenes from multiplexed measurements. However, current algorithms struggle with inaccurate forward imaging models and insufficient priors to reconstruct high-quality images. To overcome these limitations, we introduce a novel two-stage approach for consistent and photorealistic lensless image reconstruction. The first stage of our approach ensures data consistency by focusing on accurately reconstructing the low-frequency content with a spatially varying deconvolution method that adjusts to changes in the Point Spread Function (PSF) across the camera's field of view. The second stage enhances photorealism by incorporating a generative prior from pre-trained diffusion models. By conditioning on the low-frequency content retrieved in the first stage, the diffusion model effectively reconstructs the high-frequency details that are typically lost in the lensless imaging process, while also maintaining image fidelity. Our method achieves a superior balance between data fidelity and visual quality compared to existing methods, as demonstrated with two popular lensless systems, PhlatCam and DiffuserCam. Project website: https://phocolens.github.io/.||
|**2024-09-26**|[Joint Localization and Planning using Diffusion](http://arxiv.org/abs/2409.17995)|null|Diffusion models have been successfully applied to robotics problems such as manipulation and vehicle path planning. In this work, we explore their application to end-to-end navigation -- including both perception and planning -- by considering the problem of jointly performing global localization and path planning in known but arbitrary 2D environments. In particular, we introduce a diffusion model which produces collision-free paths in a global reference frame given an egocentric LIDAR scan, an arbitrary map, and a desired goal position. To this end, we implement diffusion in the space of paths in SE(2), and describe how to condition the denoising process on both obstacles and sensor observations. In our evaluation, we show that the proposed conditioning techniques enable generalization to realistic maps of considerably different appearance than the training environment, demonstrate our model's ability to accurately describe ambiguous solutions, and run extensive simulation experiments showcasing our model's use as a real-time, end-to-end localization and planning stack.||
|**2024-09-26**|[CNCA: Toward Customizable and Natural Generation of Adversarial Camouflage for Vehicle Detectors](http://arxiv.org/abs/2409.17963)|**[link](https://github.com/SeRAlab/CNCA)**|Prior works on physical adversarial camouflage against vehicle detectors mainly focus on the effectiveness and robustness of the attack. The current most successful methods optimize 3D vehicle texture at a pixel level. However, this results in conspicuous and attention-grabbing patterns in the generated camouflage, which humans can easily identify. To address this issue, we propose a Customizable and Natural Camouflage Attack (CNCA) method by leveraging an off-the-shelf pre-trained diffusion model. By sampling the optimal texture image from the diffusion model with a user-specific text prompt, our method can generate natural and customizable adversarial camouflage while maintaining high attack performance. With extensive experiments on the digital and physical worlds and user studies, the results demonstrate that our proposed method can generate significantly more natural-looking camouflage than the state-of-the-art baselines while achieving competitive attack performance. Our code is available at \href{https://anonymous.4open.science/r/CNCA-1D54}{https://anonymous.4open.science/r/CNCA-1D54}||
|**2024-09-26**|[Relativistic diffusion model for hadron production in p-Pb collisions at the LHC](http://arxiv.org/abs/2409.17960)|null|We investigate charged-hadron production in relativistic heavy-ion collisions of asymmetric systems within a nonequilibrium-statistical framework. Calculated centrality-dependent pseudorapidity distributions for p-Pb collisions at sqrt(s_NN)=5.02 and 8.16 TeV are compared with data from the Large Hadron Collider (LHC). Our approach combines a relativistic diffusion model with formulations based on quantum chromodynamics while utilizing numerical solutions of a Fokker-Planck equation to account for the shift and broadening of the fragmentation sources for particle-production with respect to the stopping (net-baryon) rapidity distributions. To represent the centrality dependence of charged-hadron production in asymmetric systems over a broad region of pseudorapidities, the consideration and precise modelling of the fragmentation sources - along with the central gluon-gluon source - is found to be essential. Specifically, this results in an inversion of the particle-production amplitude from backward- to forward-dominance when transitioning from central to peripheral collisions, in agreement with recent ATLAS and ALICE p-Pb data at sqrt(s_NN)=5.02 TeV.||
|**2024-09-18**|[Massively Multi-Person 3D Human Motion Forecasting with Scene Context](http://arxiv.org/abs/2409.12189)|**[link](https://github.com/felixbmuller/sast)**|Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study. Code is available at https://github.com/felixbmuller/SAST.||
|**2024-09-18**|[MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion](http://arxiv.org/abs/2409.12140)|**[link](https://github.com/Motion-RAG/MoRAG)**|We introduce MoRAG, a novel multi-part fusion based retrieval-augmented generation strategy for text-based human motion generation. The method enhances motion diffusion models by leveraging additional knowledge obtained through an improved motion retrieval process. By effectively prompting large language models (LLMs), we address spelling errors and rephrasing issues in motion retrieval. Our approach utilizes a multi-part retrieval strategy to improve the generalizability of motion retrieval across the language space. We create diverse samples through the spatial composition of the retrieved motions. Furthermore, by utilizing low-level, part-specific motion information, we can construct motion samples for unseen text descriptions. Our experiments demonstrate that our framework can serve as a plug-and-play module, improving the performance of motion diffusion models. Code, pretrained models and sample videos will be made available at: https://motion-rag.github.io/||
|**2024-09-18**|[Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance](http://arxiv.org/abs/2409.12099)|null|Understanding how humans process visual information is one of the crucial steps for unraveling the underlying mechanism of brain activity. Recently, this curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI data from visual stimuli, it aims to reconstruct the corresponding visual stimuli. Surprisingly, leveraging powerful generative models such as the Latent Diffusion Model (LDM) has shown promising results in reconstructing complex visual stimuli such as high-resolution natural images from vision datasets. Despite the impressive structural fidelity of these reconstructions, they often lack details of small objects, ambiguous shapes, and semantic nuances. Consequently, the incorporation of additional semantic knowledge, beyond mere visuals, becomes imperative. In light of this, we exploit how modern LDMs effectively incorporate multi-modal guidance (text guidance, visual guidance, and image layout) for structurally and semantically plausible image generations. Specifically, inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, our framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings. That is, by extracting textual guidance from semantic information regions and visual guidance from perceptual information regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI data.||
|**2024-09-18**|[Design of Ligand-Binding Proteins with Atomic Flow Matching](http://arxiv.org/abs/2409.12080)|null|Designing novel proteins that bind to small molecules is a long-standing challenge in computational biology, with applications in developing catalysts, biosensors, and more. Current computational methods rely on the assumption that the binding pose of the target molecule is known, which is not always feasible, as conformations of novel targets are often unknown and tend to change upon binding. In this work, we formulate proteins and molecules as unified biotokens, and present AtomFlow, a novel deep generative model under the flow-matching framework for the design of ligand-binding proteins from the 2D target molecular graph alone. Operating on representative atoms of biotokens, AtomFlow captures the flexibility of ligands and generates ligand conformations and protein backbone structures iteratively. We consider the multi-scale nature of biotokens and demonstrate that AtomFlow can be effectively trained on a subset of structures from the Protein Data Bank, by matching flow vector field using an SE(3) equivariant structure prediction network. Experimental results show that our method can generate high fidelity ligand-binding proteins and achieve performance comparable to the state-of-the-art model RFDiffusionAA, while not requiring bound ligand structures. As a general framework, AtomFlow holds the potential to be applied to various biomolecule generation tasks in the future.||
|**2024-09-18**|[LEMON: Localized Editing with Mesh Optimization and Neural Shaders](http://arxiv.org/abs/2409.12024)|null|In practical use cases, polygonal mesh editing can be faster than generating new ones, but it can still be challenging and time-consuming for users. Existing solutions for this problem tend to focus on a single task, either geometry or novel view synthesis, which often leads to disjointed results between the mesh and view. In this work, we propose LEMON, a mesh editing pipeline that combines neural deferred shading with localized mesh optimization. Our approach begins by identifying the most important vertices in the mesh for editing, utilizing a segmentation model to focus on these key regions. Given multi-view images of an object, we optimize a neural shader and a polygonal mesh while extracting the normal map and the rendered image from each view. By using these outputs as conditioning data, we edit the input images with a text-to-image diffusion model and iteratively update our dataset while deforming the mesh. This process results in a polygonal mesh that is edited according to the given text instruction, preserving the geometric characteristics of the initial mesh while focusing on the most significant areas. We evaluate our pipeline using the DTU dataset, demonstrating that it generates finely-edited meshes more rapidly than the current state-of-the-art methods. We include our code and additional results in the supplementary material.||
|**2024-09-18**|[Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models](http://arxiv.org/abs/2409.11920)|null|In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.||
|**2024-09-18**|[Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation](http://arxiv.org/abs/2409.11904)|null|Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and quantify the state of the art. Leveraging Rapidata's technology, we present an efficient annotation framework that sources human feedback from a diverse, global pool of annotators. Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style preference, coherence, and text-to-image alignment. We demonstrate that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.||
|**2024-09-18**|[NT-ViT: Neural Transcoding Vision Transformers for EEG-to-fMRI Synthesis](http://arxiv.org/abs/2409.11836)|null|This paper introduces the Neural Transcoding Vision Transformer (\modelname), a generative model designed to estimate high-resolution functional Magnetic Resonance Imaging (fMRI) samples from simultaneous Electroencephalography (EEG) data. A key feature of \modelname is its Domain Matching (DM) sub-module which effectively aligns the latent EEG representations with those of fMRI volumes, enhancing the model's accuracy and reliability. Unlike previous methods that tend to struggle with fidelity and reproducibility of images, \modelname addresses these challenges by ensuring methodological integrity and higher-quality reconstructions which we showcase through extensive evaluation on two benchmark datasets; \modelname outperforms the current state-of-the-art by a significant margin in both cases, e.g. achieving a $10\times$ reduction in RMSE and a $3.14\times$ increase in SSIM on the Oddball dataset. An ablation study also provides insights into the contribution of each component to the model's overall effectiveness. This development is critical in offering a new approach to lessen the time and financial constraints typically linked with high-resolution brain imaging, thereby aiding in the swift and precise diagnosis of neurological disorders. Although it is not a replacement for actual fMRI but rather a step towards making such imaging more accessible, we believe that it represents a pivotal advancement in clinical practice and neuroscience research. Code is available at \url{https://github.com/rom42pla/ntvit}.||
|**2024-09-18**|[DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech](http://arxiv.org/abs/2409.11835)|null|In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel spectrograms as general images, which overlooks the specific acoustic properties of speech. To address these limitations, we propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which builds on DiT and achieves fast training without compromising accuracy. Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties, enhancing the naturalness of the generated speech. Additionally, we introduce a fine-grained style temporal modeling method that further improves speaker style similarity. Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.||
|**2024-09-18**|[RaggeDi: Diffusion-based State Estimation of Disordered Rags, Sheets, Towels and Blankets](http://arxiv.org/abs/2409.11831)|null|Cloth state estimation is an important problem in robotics. It is essential for the robot to know the accurate state to manipulate cloth and execute tasks such as robotic dressing, stitching, and covering/uncovering human beings. However, estimating cloth state accurately remains challenging due to its high flexibility and self-occlusion. This paper proposes a diffusion model-based pipeline that formulates the cloth state estimation as an image generation problem by representing the cloth state as an RGB image that describes the point-wise translation (translation map) between a pre-defined flattened mesh and the deformed mesh in a canonical space. Then we train a conditional diffusion-based image generation model to predict the translation map based on an observation. Experiments are conducted in both simulation and the real world to validate the performance of our method. Results indicate that our method outperforms two recent methods in both accuracy and speed.||
|**2024-09-17**|[Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion](http://arxiv.org/abs/2409.11406)|null|In 3D modeling, designers often use an existing 3D model as a reference to create new ones. This practice has inspired the development of Phidias, a novel generative model that uses diffusion for reference-augmented 3D generation. Given an image, our method leverages a retrieved or user-provided 3D reference model to guide the generation process, thereby enhancing the generation quality, generalization ability, and controllability. Our model integrates three key components: 1) meta-ControlNet that dynamically modulates the conditioning strength, 2) dynamic reference routing that mitigates misalignment between the input image and 3D reference, and 3) self-reference augmentations that enable self-supervised training with a progressive curriculum. Collectively, these designs result in a clear improvement over existing methods. Phidias establishes a unified framework for 3D generation using text, image, and 3D conditions with versatile applications.||
|**2024-09-17**|[Teaching dark matter simulations to speak the halo language](http://arxiv.org/abs/2409.11401)|**[link](https://github.com/shivampcosmo/gotham)**|We develop a transformer-based conditional generative model for discrete point objects and their properties. We use it to build a model for populating cosmological simulations with gravitationally collapsed structures called dark matter halos. Specifically, we condition our model with dark matter distribution obtained from fast, approximate simulations to recover the correct three-dimensional positions and masses of individual halos. This leads to a first model that can recover the statistical properties of the halos at small scales to better than 3% level using an accelerated dark matter simulation. This trained model can then be applied to simulations with significantly larger volumes which would otherwise be computationally prohibitive with traditional simulations, and also provides a crucial missing link in making end-to-end differentiable cosmological simulations. The code, named GOTHAM (Generative cOnditional Transformer for Halo's Auto-regressive Modeling) is publicly available at \url{https://github.com/shivampcosmo/GOTHAM}.||
|**2024-09-17**|[Ultrasound Image Enhancement with the Variance of Diffusion Models](http://arxiv.org/abs/2409.11380)|**[link](https://github.com/yuxin-zhang-jasmine/ius2024_diffusion)**|Ultrasound imaging, despite its widespread use in medicine, often suffers from various sources of noise and artifacts that impact the signal-to-noise ratio and overall image quality. Enhancing ultrasound images requires a delicate balance between contrast, resolution, and speckle preservation. This paper introduces a novel approach that integrates adaptive beamforming with denoising diffusion-based variance imaging to address this challenge. By applying Eigenspace-Based Minimum Variance (EBMV) beamforming and employing a denoising diffusion model fine-tuned on ultrasound data, our method computes the variance across multiple diffusion-denoised samples to produce high-quality despeckled images. This approach leverages both the inherent multiplicative noise of ultrasound and the stochastic nature of diffusion models. Experimental results on a publicly available dataset demonstrate the effectiveness of our method in achieving superior image reconstructions from single plane-wave acquisitions. The code is available at: https://github.com/Yuxin-Zhang-Jasmine/IUS2024_Diffusion.||
|**2024-09-17**|[OSV: One Step is Enough for High-Quality Image to Video Generation](http://arxiv.org/abs/2409.11367)|null|Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. While efforts have been made to accelerate video diffusion by reducing inference steps (through techniques like consistency distillation) and GAN training (these approaches often fall short in either performance or training stability). In this work, we introduce a two-stage training framework that effectively combines consistency distillation with GAN training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).||
|**2024-09-17**|[Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think](http://arxiv.org/abs/2409.11355)|**[link](https://github.com/VisualComputingInstitute/diffusion-e2e-ft)**|Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200 $\times$ faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.||
|**2024-09-17**|[OmniGen: Unified Image Generation](http://arxiv.org/abs/2409.11340)|**[link](https://github.com/vectorspacelab/omnigen)**|In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGenis characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports other downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. Additionally, OmniGen can handle classical computer vision tasks by transforming them into image generation tasks, such as edge detection and human pose recognition. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional text encoders. Moreover, it is more user-friendly compared to existing diffusion models, enabling complex tasks to be accomplished through instructions without the need for extra preprocessing steps (e.g., human pose estimation), thereby significantly simplifying the workflow of image generation. 3) Knowledge Transfer: Through learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and there remain several unresolved issues. We will open-source the related resources at https://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.||
|**2024-09-17**|[fMRI-3D: A Comprehensive Dataset for Enhancing fMRI-based 3D Reconstruction](http://arxiv.org/abs/2409.11315)|null|Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI) data, introduced as Recon3DMind in our conference work, is of significant interest to both cognitive neuroscience and computer vision. To advance this task, we present the fMRI-3D dataset, which includes data from 15 participants and showcases a total of 4768 3D objects. The dataset comprises two components: fMRI-Shape, previously introduced and accessible at https://huggingface.co/datasets/Fudan-fMRI/fMRI-Shape, and fMRI-Objaverse, proposed in this paper and available at https://huggingface.co/datasets/Fudan-fMRI/fMRI-Objaverse. fMRI-Objaverse includes data from 5 subjects, 4 of whom are also part of the Core set in fMRI-Shape, with each subject viewing 3142 3D objects across 117 categories, all accompanied by text captions. This significantly enhances the diversity and potential applications of the dataset. Additionally, we propose MinD-3D, a novel framework designed to decode 3D visual information from fMRI signals. The framework first extracts and aggregates features from fMRI data using a neuro-fusion encoder, then employs a feature-bridge diffusion model to generate visual features, and finally reconstructs the 3D object using a generative transformer decoder. We establish new benchmarks by designing metrics at both semantic and structural levels to evaluate model performance. Furthermore, we assess our model's effectiveness in an Out-of-Distribution setting and analyze the attribution of the extracted features and the visual ROIs in fMRI signals. Our experiments demonstrate that MinD-3D not only reconstructs 3D objects with high semantic and spatial accuracy but also deepens our understanding of how human brain processes 3D visual information. Project page at: https://jianxgao.github.io/MinD-3D.||
|**2024-09-17**|[SpMis: An Investigation of Synthetic Spoken Misinformation Detection](http://arxiv.org/abs/2409.11308)|null|In recent years, speech generation technology has advanced rapidly, fueled by generative models and large-scale training techniques. While these developments have enabled the production of high-quality synthetic speech, they have also raised concerns about the misuse of this technology, particularly for generating synthetic misinformation. Current research primarily focuses on distinguishing machine-generated speech from human-produced speech, but the more urgent challenge is detecting misinformation within spoken content. This task requires a thorough analysis of factors such as speaker identity, topic, and synthesis. To address this need, we conduct an initial investigation into synthetic spoken misinformation detection by introducing an open-source dataset, SpMis. SpMis includes speech synthesized from over 1,000 speakers across five common topics, utilizing state-of-the-art text-to-speech systems. Although our results show promising detection capabilities, they also reveal substantial challenges for practical implementation, underscoring the importance of ongoing research in this critical area.||
|**2024-09-17**|[DroneDiffusion: Robust Quadrotor Dynamics Learning with Diffusion Models](http://arxiv.org/abs/2409.11292)|null|An inherent fragility of quadrotor systems stems from model inaccuracies and external disturbances. These factors hinder performance and compromise the stability of the system, making precise control challenging. Existing model-based approaches either make deterministic assumptions, utilize Gaussian-based representations of uncertainty, or rely on nominal models, all of which often fall short in capturing the complex, multimodal nature of real-world dynamics. This work introduces DroneDiffusion, a novel framework that leverages conditional diffusion models to learn quadrotor dynamics, formulated as a sequence generation task. DroneDiffusion achieves superior generalization to unseen, complex scenarios by capturing the temporal nature of uncertainties and mitigating error propagation. We integrate the learned dynamics with an adaptive controller for trajectory tracking with stability guarantees. Extensive experiments in both simulation and real-world flights demonstrate the robustness of the framework across a range of scenarios, including unfamiliar flight paths and varying payloads, velocities, and wind disturbances.||
|**2024-09-17**|[Learning Source Disentanglement in Neural Audio Codec](http://arxiv.org/abs/2409.11228)|null|Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.||
|**2024-09-13**|[Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation](http://arxiv.org/abs/2409.09016)|**[link](https://github.com/OpenDriveLab/CLOVER)**|Despite significant progress in robotics and embodied AI in recent years, deploying robots for long-horizon tasks remains a great challenge. Majority of prior arts adhere to an open-loop philosophy and lack real-time feedback, leading to error accumulation and undesirable robustness. A handful of approaches have endeavored to establish feedback mechanisms leveraging pixel-level differences or pre-trained visual representations, yet their efficacy and adaptability have been found to be constrained. Inspired by classic closed-loop control systems, we propose CLOVER, a closed-loop visuomotor control framework that incorporates feedback mechanisms to improve adaptive robotic control. CLOVER consists of a text-conditioned video diffusion model for generating visual plans as reference inputs, a measurable embedding space for accurate error quantification, and a feedback-driven controller that refines actions from feedback and initiates replans as needed. Our framework exhibits notable advancement in real-world robotic tasks and achieves state-of-the-art on CALVIN benchmark, improving by 8% over previous open-loop counterparts. Code and checkpoints are maintained at https://github.com/OpenDriveLab/CLOVER.||
|**2024-09-13**|[A Diffusion Approach to Radiance Field Relighting using Multi-Illumination Synthesis](http://arxiv.org/abs/2409.08947)|null|Relighting radiance fields is severely underconstrained for multi-view data, which is most often captured under a single illumination condition; It is especially hard for full scenes containing multiple objects. We introduce a method to create relightable radiance fields using such single-illumination data by exploiting priors extracted from 2D image diffusion models. We first fine-tune a 2D diffusion model on a multi-illumination dataset conditioned by light direction, allowing us to augment a single-illumination capture into a realistic -- but possibly inconsistent -- multi-illumination dataset from directly defined light directions. We use this augmented data to create a relightable radiance field represented by 3D Gaussian splats. To allow direct control of light direction for low-frequency lighting, we represent appearance with a multi-layer perceptron parameterized on light direction. To enforce multi-view consistency and overcome inaccuracies we optimize a per-image auxiliary feature vector. We show results on synthetic and real multi-view data under single illumination, demonstrating that our method successfully exploits 2D diffusion model priors to allow realistic 3D relighting for complete scenes. Project site https://repo-sam.inria.fr/fungraph/generative-radiance-field-relighting/||
|**2024-09-13**|[Latent Space Score-based Diffusion Model for Probabilistic Multivariate Time Series Imputation](http://arxiv.org/abs/2409.08917)|**[link](https://github.com/gorgen2020/LSSDM_imputation)**|Accurate imputation is essential for the reliability and success of downstream tasks. Recently, diffusion models have attracted great attention in this field. However, these models neglect the latent distribution in a lower-dimensional space derived from the observed data, which limits the generative capacity of the diffusion model. Additionally, dealing with the original missing data without labels becomes particularly problematic. To address these issues, we propose the Latent Space Score-Based Diffusion Model (LSSDM) for probabilistic multivariate time series imputation. Observed values are projected onto low-dimensional latent space and coarse values of the missing data are reconstructed without knowing their ground truth values by this unsupervised learning approach. Finally, the reconstructed values are fed into a conditional diffusion model to obtain the precise imputed values of the time series. In this way, LSSDM not only possesses the power to identify the latent distribution but also seamlessly integrates the diffusion model to obtain the high-fidelity imputed values and assess the uncertainty of the dataset. Experimental results demonstrate that LSSDM achieves superior imputation performance while also providing a better explanation and uncertainty analysis of the imputation mechanism. The website of the code is \textit{https://github.com/gorgen2020/LSSDM\_imputation}.||
|**2024-09-13**|[Gaussian is All You Need: A Unified Framework for Solving Inverse Problems via Diffusion Posterior Sampling](http://arxiv.org/abs/2409.08906)|null|Diffusion models can generate a variety of high-quality images by modeling complex data distributions. Trained diffusion models can also be very effective image priors for solving inverse problems. Most of the existing diffusion-based methods integrate data consistency steps within the diffusion reverse sampling process. The data consistency steps rely on an approximate likelihood function. In this paper, we show that the existing approximations are either insufficient or computationally inefficient. To address these issues, we propose a unified likelihood approximation method that incorporates a covariance correction term to enhance the performance and avoids propagating gradients through the diffusion model. The correction term, when integrated into the reverse diffusion sampling process, achieves better convergence towards the true data posterior for selected distributions and improves performance on real-world natural image datasets. Furthermore, we present an efficient way to factorize and invert the covariance matrix of the likelihood function for several inverse problems. We present comprehensive experiments to demonstrate the effectiveness of our method over several existing approaches.||
|**2024-09-13**|[Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control](http://arxiv.org/abs/2409.08861)|null|Dynamical generative models that produce samples through an iterative process, such as Flow Matching and denoising diffusion models, have seen widespread use, but there has not been many theoretically-sound methods for improving these models with reward fine-tuning. In this work, we cast reward fine-tuning as stochastic optimal control (SOC). Critically, we prove that a very specific memoryless noise schedule must be enforced during fine-tuning, in order to account for the dependency between the noise variable and the generated samples. We also propose a new algorithm named Adjoint Matching which outperforms existing SOC algorithms, by casting SOC problems as a regression problem. We find that our approach significantly improves over existing methods for reward fine-tuning, achieving better consistency, realism, and generalization to unseen human preference reward models, while retaining sample diversity.||
|**2024-09-13**|[InstantDrag: Improving Interactivity in Drag-based Image Editing](http://arxiv.org/abs/2409.08857)|null|Drag-based image editing has recently gained popularity for its interactivity and precision. However, despite the ability of text-to-image models to generate samples within a second, drag editing still lags behind due to the challenge of accurately reflecting user interaction while maintaining image content. Some existing approaches rely on computationally intensive per-image optimization or intricate guidance-based methods, requiring additional inputs such as masks for movable regions and text prompts, thereby compromising the interactivity of the editing process. We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed, requiring only an image and a drag instruction as input. InstantDrag consists of two carefully designed networks: a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion). InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. We demonstrate InstantDrag's capability to perform fast, photo-realistic edits without masks or text prompts through experiments on facial video datasets and general scenes. These results highlight the efficiency of our approach in handling drag-based image editing, making it a promising solution for interactive, real-time applications.||
|**2024-09-13**|[DX2CT: Diffusion Model for 3D CT Reconstruction from Bi or Mono-planar 2D X-ray(s)](http://arxiv.org/abs/2409.08850)|null|Computational tomography (CT) provides high-resolution medical imaging, but it can expose patients to high radiation. X-ray scanners have low radiation exposure, but their resolutions are low. This paper proposes a new conditional diffusion model, DX2CT, that reconstructs three-dimensional (3D) CT volumes from bi or mono-planar X-ray image(s). Proposed DX2CT consists of two key components: 1) modulating feature maps extracted from two-dimensional (2D) X-ray(s) with 3D positions of CT volume using a new transformer and 2) effectively using the modulated 3D position-aware feature maps as conditions of DX2CT. In particular, the proposed transformer can provide conditions with rich information of a target CT slice to the conditional diffusion model, enabling high-quality CT reconstruction. Our experiments with the bi or mono-planar X-ray(s) benchmark datasets show that proposed DX2CT outperforms several state-of-the-art methods. Our codes and model will be available at: https://www.github.com/intyeger/DX2CT.||
|**2024-09-13**|[DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset](http://arxiv.org/abs/2409.08731)|**[link](https://github.com/dfadd-dataset/dfadd_demo_pages)**|Mainstream zero-shot TTS production systems like Voicebox and Seed-TTS achieve human parity speech by leveraging Flow-matching and Diffusion models, respectively. Unfortunately, human-level audio synthesis leads to identity misuse and information security issues. Currently, many antispoofing models have been developed against deepfake audio. However, the efficacy of current state-of-the-art anti-spoofing models in countering audio synthesized by diffusion and flowmatching based TTS systems remains unknown. In this paper, we proposed the Diffusion and Flow-matching based Audio Deepfake (DFADD) dataset. The DFADD dataset collected the deepfake audio based on advanced diffusion and flowmatching TTS models. Additionally, we reveal that current anti-spoofing models lack sufficient robustness against highly human-like audio generated by diffusion and flow-matching TTS systems. The proposed DFADD dataset addresses this gap and provides a valuable resource for developing more resilient anti-spoofing models.||
|**2024-09-13**|[STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment](http://arxiv.org/abs/2409.08601)|null|Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that enhances audio generation from videos by extracting both local temporal and global semantic video features and combining these refined video features with text as cross-modal guidance. To address the issue of information redundancy in videos, we propose an onset prediction pretext task for local temporal feature extraction and an attentive pooling module for global semantic feature extraction. To supplement the insufficient semantic information in videos, we propose a Latent Diffusion Model with Text-to-Audio priors initialization and cross-modal guidance. We also introduce Audio-Audio Align, a new metric to assess audio-temporal alignment. Subjective and objective metrics demonstrate that our method surpasses existing Video-to-Audio models in generating audio with better quality, semantic consistency, and temporal alignment. The ablation experiment validated the effectiveness of each module. Audio samples are available at https://y-ren16.github.io/STAV2A.||
|**2024-09-13**|[LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling](http://arxiv.org/abs/2409.08583)|null|Singing Voice Conversion (SVC) has emerged as a significant subfield of Voice Conversion (VC), enabling the transformation of one singer's voice into another while preserving musical elements such as melody, rhythm, and timbre. Traditional SVC methods have limitations in terms of audio quality, data requirements, and computational complexity. In this paper, we propose LHQ-SVC, a lightweight, CPU-compatible model based on the SVC framework and diffusion model, designed to reduce model size and computational demand without sacrificing performance. We incorporate features to improve inference quality, and optimize for CPU execution by using performance tuning tools and parallel computing frameworks. Our experiments demonstrate that LHQ-SVC maintains competitive performance, with significant improvements in processing speed and efficiency across different devices. The results suggest that LHQ-SVC can meet||
|**2024-09-12**|[DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors](http://arxiv.org/abs/2409.08278)|null|We present DreamHOI, a novel method for zero-shot synthesis of human-object interactions (HOIs), enabling a 3D human model to realistically interact with any given object based on a textual description. This task is complicated by the varying categories and geometries of real-world objects and the scarcity of datasets encompassing diverse HOIs. To circumvent the need for extensive data, we leverage text-to-image diffusion models trained on billions of image-caption pairs. We optimize the articulation of a skinned human mesh using Score Distillation Sampling (SDS) gradients obtained from these models, which predict image-space edits. However, directly backpropagating image-space gradients into complex articulation parameters is ineffective due to the local nature of such gradients. To overcome this, we introduce a dual implicit-explicit representation of a skinned mesh, combining (implicit) neural radiance fields (NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization, we transition between implicit and explicit forms, grounding the NeRF generation while refining the mesh articulation. We validate our approach through extensive experiments, demonstrating its effectiveness in generating realistic HOIs.||
|**2024-09-12**|[Hand-Object Interaction Pretraining from Videos](http://arxiv.org/abs/2409.08273)|null|我们提出了一种从 3D 手-物体交互轨迹中学习通用机器人操作先验的方法。我们构建了一个框架，利用野外视频生成感觉运动机器人轨迹。为此，我们将人手和被操纵物体提升到共享的 3D 空间中，并将人体动作重定向到机器人动作。对这些数据进行生成建模，我们得到了一个与任务无关的基础策略。该策略捕获了一个通用而灵活的操作先验。我们通过经验证明，使用强化学习 (RL) 和行为克隆 (BC) 对该策略进行微调，可以实现对下游任务的样本高效适应，同时与先前的方法相比，提高了鲁棒性和泛化能力。定性实验结果可见：\url{https://hgaurav2k.github.io/hop/}。||
|**2024-09-12**|[Click2Mask: Local Editing with Dynamic Mask Generation](http://arxiv.org/abs/2409.08272)|**[link](https://github.com/omeregev/click2mask)**|生成模型的最新进展彻底改变了图像生成和编辑领域，使非专业人士也能轻松完成这些任务。本文重点关注局部图像编辑，特别是向大致指定区域添加新内容的任务。现有方法通常需要精确的掩码或对位置的详细描述，这可能既麻烦又容易出错。我们提出了 Click2Mask，这是一种新颖的方法，它只需一个参考点（以及内容描述）即可简化局部编辑过程。在混合潜在扩散 (BLD) 过程中，掩码会围绕该点动态增长，并以基于 CLIP 的语义损失为指导。Click2Mask 超越了基于分割和依赖微调的方法的局限性，提供了一种对用户更友好且上下文更准确的解决方案。我们的实验表明，根据人类判断和自动指标，与 SoTA 方法相比，Click2Mask 不仅最大限度地减少了用户的工作量，而且还提供了具有竞争力或更优的局部图像处理结果。主要贡献包括简化用户输入、能够不受现有分割限制地自由添加对象，以及将我们的动态掩码方法集成到其他编辑方法中的潜力。||
|**2024-09-12**|[DreamBeast: Distilling 3D Fantastical Animals with Part-Aware Knowledge Transfer](http://arxiv.org/abs/2409.08271)|null|We present DreamBeast, a novel method based on score distillation sampling (SDS) for generating fantastical 3D animal assets composed of distinct parts. Existing SDS methods often struggle with this generation task due to a limited understanding of part-level semantics in text-to-image diffusion models. While recent diffusion models, such as Stable Diffusion 3, demonstrate a better part-level understanding, they are prohibitively slow and exhibit other common problems associated with single-view diffusion models. DreamBeast overcomes this limitation through a novel part-aware knowledge transfer mechanism. For each generated asset, we efficiently extract part-level knowledge from the Stable Diffusion 3 model into a 3D Part-Affinity implicit representation. This enables us to instantly generate Part-Affinity maps from arbitrary camera views, which we then use to modulate the guidance of a multi-view diffusion model during SDS to create 3D assets of fantastical animals. DreamBeast significantly enhances the quality of generated 3D creatures with user-specified part compositions while reducing computational overhead, as demonstrated by extensive quantitative and qualitative evaluations.||
|**2024-09-12**|[Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation](http://arxiv.org/abs/2409.08269)|null|现今的触觉传感器形态各异，尺寸不一。由于模型通常与特定的传感器设计绑定，因此开发通用的触觉处理方法变得极具挑战性。我们通过在触觉传感器之间进行跨模态预测来解决这个问题：给定来自一个传感器的触觉信号，我们使用生成模型来估计另一个传感器如何感知相同的物理接触。这允许我们将特定于传感器的算法应用于生成的信号。我们通过训练一个扩散模型来实现这个想法，该模型可以在流行的 GelSlim 和 Soft Bubble 传感器之间进行转换。作为一个下游任务，我们使用 GelSlim 传感器执行手持物体姿态估计，同时使用仅对 Soft Bubble 信号进行操作的算法。数据集、代码和更多详细信息可以在 https://www.mmintlab.com/research/touch2touch/ 上找到。||
|**2024-09-12**|[Improving Text-guided Object Inpainting with Semantic Pre-inpainting](http://arxiv.org/abs/2409.08260)|**[link](https://github.com/nnn-s/catdiffusion)**|Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \url{https://github.com/Nnn-s/CATdiffusion}.||
|**2024-09-12**|[Improving Virtual Try-On with Garment-focused Diffusion Models](http://arxiv.org/abs/2409.08258)|**[link](https://github.com/siqi0905/gardiff)**|Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: \href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.||
|**2024-09-12**|[LoRID: Low-Rank Iterative Diffusion for Adversarial Purification](http://arxiv.org/abs/2409.08255)|null|This work presents an information-theoretic examination of diffusion-based purification methods, the state-of-the-art adversarial defenses that utilize diffusion models to remove malicious perturbations in adversarial examples. By theoretically characterizing the inherent purification errors associated with the Markov-based diffusion purifications, we introduce LoRID, a novel Low-Rank Iterative Diffusion purification method designed to remove adversarial perturbation with low intrinsic purification errors. LoRID centers around a multi-stage purification process that leverages multiple rounds of diffusion-denoising loops at the early time-steps of the diffusion models, and the integration of Tucker decomposition, an extension of matrix factorization, to remove adversarial noise at high-noise regimes. Consequently, LoRID increases the effective diffusion time-steps and overcomes strong adversarial attacks, achieving superior robustness performance in CIFAR-10/100, CelebA-HQ, and ImageNet datasets under both white-box and black-box settings.||
|**2024-09-12**|[Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding](http://arxiv.org/abs/2409.08251)|null|Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance.||
|**2024-09-12**|[IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation](http://arxiv.org/abs/2409.08240)|null|While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of individual instances, they struggle to accurately position and control the features generation of multiple instances. The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. In response, we propose the Instance Feature Generation (IFG) task, which aims to ensure both positional accuracy and feature fidelity in generated instances. To address the IFG task, we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances feature depiction by incorporating additional appearance tokens and utilizing an Instance Semantic Map to align instance-level features with spatial locations. The IFAdapter guides the diffusion process as a plug-and-play module, making it adaptable to various community models. For evaluation, we contribute an IFG benchmark and develop a verification pipeline to objectively compare models' abilities to generate instances with accurate positioning and features. Experimental results demonstrate that IFAdapter outperforms other models in both quantitative and qualitative evaluations.||
|**2024-09-10**|[SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation](http://arxiv.org/abs/2409.06633)|null|近年来，扩散模型的发展推动了图像和视频生成任务的显著进步，其中像Stable Diffusion系列这样的预训练模型发挥了至关重要的作用。受模型剪枝技术的启发，该技术通过移除不重要的参数来减轻大型预训练模型的负担，我们提出了一种新颖的模型微调方法，可以充分利用这些无效参数，并使预训练模型具备新的任务特定能力。本研究首先调查了预训练扩散模型中参数的重要性，发现按绝对值计算，最小的10%到20%的参数对生成过程没有贡献。基于这一观察，我们提出了一种名为SaRA的方法，该方法重新利用这些暂时无效的参数，相当于优化一个稀疏权重矩阵来学习特定任务的知识。为了减轻过拟合，我们提出了一种基于核范数的低秩稀疏训练方案，以实现高效的微调。此外，我们设计了一种新的渐进式参数调整策略，以充分利用重新训练/微调的参数。最后，我们提出了一种新颖的非结构化反向传播策略，可显著降低微调过程中的内存成本。我们的方法增强了预训练模型在下游应用中的生成能力，并且在保持模型泛化能力方面优于LoRA等传统微调方法。我们通过在SD模型上的微调实验验证了我们的方法，结果表明SaRA取得了显著的改进。SaRA还具有一个实际优势，即只需修改一行代码即可实现高效实施，并且与现有方法无缝兼容。||
|**2024-09-10**|[MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification](http://arxiv.org/abs/2409.06620)|null|文本到3D内容生成领域在生成逼真的3D对象方面取得了重大进展，像分数蒸馏采样（SDS）这样的现有方法提供了有希望的指导。然而，由于指导不精确，这些方法经常遇到“两面神”问题——多面歧义。此外，虽然最近3D高斯分裂的进步已经显示出其在表示3D体积方面的功效，但这种表示的优化在很大程度上仍未得到探索。本文介绍了一个用于文本到3D内容生成的统一框架，以解决这些关键差距。我们的方法利用多视图指导迭代形成3D模型的结构，逐步增强细节和准确性。我们还引入了一种新的密集化算法，使高斯接近表面，优化生成模型的结构完整性和保真度。大量实验验证了我们的方法，表明它能够以最少的时间成本生成高质量的视觉输出。值得注意的是，我们的方法在半小时的训练时间内就能获得高质量的结果，与大多数需要数小时训练时间才能获得类似结果的现有方法相比，效率显著提高。||
|**2024-09-10**|[A Primer on Variational Inference for Physics-Informed Deep Generative Modelling](http://arxiv.org/abs/2409.06560)|null|变分推断（VI）是一种计算高效且可扩展的近似贝叶斯推断方法。它在不确定性量化的准确性和实际可处理性之间取得了平衡。由于其内置的贝叶斯正则化和灵活性，它在生成建模和反演任务中表现出色，这对于物理相关问题至关重要。推导 VI 的核心学习目标通常必须针对新的学习任务进行调整，其中问题的性质决定了感兴趣变量之间的条件依赖性，例如物理问题中出现的情况。在本文中，我们为正向和反向问题提供了 VI 的易于理解且全面的技术介绍，引导读者了解 VI 框架的标准推导及其如何通过深度学习得到最佳实现。然后，我们回顾并统一了最近的文献，这些文献例证了 VI 所允许的创造性灵活性。本文面向希望解决基于物理的问题并强调不确定性量化的一般科学受众。||
|**2024-09-10**|[From LIMA to DeepLIMA: following a new path of interoperability](http://arxiv.org/abs/2409.06550)|null|本文描述了 LIMA（Libre Multilingual Analyzer）框架的体系结构及其最新发展，其中新增了基于深度神经网络的文本分析模块。我们在保留现有可配置架构以及先前开发的基于规则和统计的分析组件的可用性的同时，扩展了 LIMA 在支持语言数量方面的功能。我们在 Universal Dependencies 2.5 语料库、WikiNer 语料库和 CoNLL-03 数据集上针对 60 多种语言训练了模型。Universal Dependencies 允许我们增加支持的语言数量，并生成可以集成到其他平台的模型。这种普遍存在的深度学习自然语言处理模型的集成以及使用 Universal Dependencies 的标准注释集合的使用可以被视为一种新的互操作性途径，通过模型和数据的规范化，与更标准的技术互操作性相辅相成，在 LIMA 中通过 Docker Hub 上 Docker 容器中可用的服务实现。||
|**2024-09-10**|[Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models](http://arxiv.org/abs/2409.06451)|null|虽然当前的情感文本到语音（TTS）系统可以生成高度智能的情感语音，但在输出语音的情感渲染方面实现精细控制仍然是一项重大挑战。在本文中，我们介绍了 ParaEVITS，这是一种新颖的情感 TTS 框架，它利用自然语言的组合性来增强对情感渲染的控制。通过结合受 ParaCLAP（一种用于计算语用学的对比性语言-音频预训练（CLAP）模型）启发的文本-音频编码器，我们训练扩散模型以根据文本情感风格描述生成情感嵌入。我们的框架首先使用音频编码器在参考音频上进行训练，然后微调扩散模型以处理来自 ParaCLAP 文本编码器的文本输入。在推理过程中，仅使用文本条件就可以操纵音调、抖动和响度等语音属性。我们的实验表明，ParaEVITS 可以有效地控制情感渲染，而不会影响语音质量。语音演示公开可用。||
|**2024-09-10**|[Prompt2Fashion: An automatically generated fashion dataset](http://arxiv.org/abs/2409.06442)|**[link](https://github.com/georgiarg/prompt2fashion)**|尽管语言和视觉生成模型在快速发展且效率不断提高，但仍然缺乏将个性化时尚需求与人工智能驱动设计联系起来的综合数据集，这限制了真正包容和定制化时尚解决方案的潜力。在这项工作中，我们利用生成模型自动构建了一个时尚图像数据集，该数据集根据用户的指示针对不同的场合、风格和体型量身定制。我们使用不同的生成式预训练模型（LLM）和提示策略，为专家和非专家用户提供具有高质量审美、细节和相关性的个性化服装，并通过定性分析证明了这一点。到目前为止，生成的服装的评估一直由非专家的人类受试者进行。尽管对生成的质量和相关性提供了细致入微的见解，但我们就专家知识对于评估此类艺术性人工智能生成数据集的重要性展开了进一步的讨论。我们的数据集可在 GitHub 上公开获取，网址为 https://github.com/georgiarg/Prompt2Fashion。||
|**2024-09-10**|[Fast nonparametric inference of network backbones for graph sparsification](http://arxiv.org/abs/2409.06417)|**[link](https://github.com/aleckirkley/mdl-network-backbones)**|网络骨干通过仅保留最重要的链接来提供加权网络的有用稀疏表示，从而实现一系列计算加速并简化复杂的网络可视化。判断链接是否重要的标准有很多，因此已经开发了许多用于图稀疏化网络骨干提取的方法。这些方法根据它们是在整个网络还是在单个节点邻域的上下文中评估边的重要性，可以分为全局或局部方法。现有网络骨干提取方法的一个关键限制是，它们要么人为地将骨干的拓扑结构限制为特定形式（例如树），要么需要指定一个自由参数（例如显著性水平）来确定骨干中要保留的边数。在这里，我们开发了一个完全非参数的框架来推断加权网络的骨干，该框架通过使用信息论中的最小描述长度（MDL）原则自动选择保留在骨干中的最佳边数来克服这些限制。我们开发了两种编码方案，作为全局和局部网络骨干的目标函数，以及有效的优化算法，以根据这些目标识别最佳骨干，其运行时复杂度在边数上是对数线性的。我们表明，所提出的框架可以使用最大后验（MAP）估计程序和渐近等效的贝叶斯骨干生成模型推广到边上的任何离散权重分布。我们在真实和合成网络上的一系列任务中将所提出的方法与现有方法进行了比较。||
|**2024-09-10**|[Distilling Generative-Discriminative Representations for Very Low-Resolution Face Recognition](http://arxiv.org/abs/2409.06371)|null|由于分辨率下降会导致信息丰富的面部细节严重丢失，因此极低分辨率人脸识别极具挑战性。在本文中，我们提出了一种结合了生成表示和跨分辨率对齐知识蒸馏的生成-判别表示蒸馏方法。这种方法通过两个蒸馏模块联合蒸馏生成模型和判别模型，促进了极低分辨率人脸识别。首先，生成表示蒸馏将预先训练用于人脸超分辨率的扩散模型的编码器作为生成教师，通过特征回归来监督学生骨干网络的学习，然后冻结学生骨干网络。之后，判别表示蒸馏进一步考虑将预先训练好的人脸识别器作为判别教师，通过跨分辨率关系对比蒸馏来监督学生头部的学习。通过这种方式，可以将通用的骨干网络表示转换为判别头部表示，从而形成一个鲁棒的、具有判别力的学生模型，用于极低分辨率人脸识别。我们的方法改进了极低分辨率人脸中缺失细节的恢复，并实现了更好的知识迁移。在人脸数据集上的大量实验表明，我们的方法提高了极低分辨率人脸的识别精度，展示了其有效性和适应性。||
|**2024-09-10**|[What happens to diffusion model likelihood when your model is conditional?](http://arxiv.org/abs/2409.06364)|null|Diffusion Models (DMs) iteratively denoise random samples to produce high-quality data. The iterative sampling process is derived from Stochastic Differential Equations (SDEs), allowing a speed-quality trade-off chosen at inference. Another advantage of sampling with differential equations is exact likelihood computation. These likelihoods have been used to rank unconditional DMs and for out-of-domain classification. Despite the many existing and possible uses of DM likelihoods, the distinct properties captured are unknown, especially in conditional contexts such as Text-To-Image (TTI) or Text-To-Speech synthesis (TTS). Surprisingly, we find that TTS DM likelihoods are agnostic to the text input. TTI likelihood is more expressive but cannot discern confounding prompts. Our results show that applying DMs to conditional tasks reveals inconsistencies and strengthens claims that the properties of DM likelihood are unknown. This impact sheds light on the previously unknown nature of DM likelihoods. Although conditional DMs maximise likelihood, the likelihood in question is not as sensitive to the conditioning input as one expects. This investigation provides a new point-of-view on diffusion likelihoods.||
|**2024-09-10**|[DiffQRCoder: Diffusion-based Aesthetic QR Code Generation with Scanning Robustness Guided Iterative Refinement](http://arxiv.org/abs/2409.06355)|null|With the success of Diffusion Models for image generation, the technologies also have revolutionized the aesthetic Quick Response (QR) code generation. Despite significant improvements in visual attractiveness for the beautified codes, their scannabilities are usually sacrificed and thus hinder their practical uses in real-world scenarios. To address this issue, we propose a novel Diffusion-based QR Code generator (DiffQRCoder) to effectively craft both scannable and visually pleasing QR codes. The proposed approach introduces Scanning-Robust Perceptual Guidance (SRPG), a new diffusion guidance for Diffusion Models to guarantee the generated aesthetic codes to obey the ground-truth QR codes while maintaining their attractiveness during the denoising process. Additionally, we present another post-processing technique, Scanning Robust Manifold Projected Gradient Descent (SR-MPGD), to further enhance their scanning robustness through iterative latent space optimization. With extensive experiments, the results demonstrate that our approach not only outperforms other compared methods in Scanning Success Rate (SSR) with better or comparable CLIP aesthetic score (CLIP-aes.) but also significantly improves the SSR of the ControlNet-only approach from 60% to 99%. The subjective evaluation indicates that our approach achieves promising visual attractiveness to users as well. Finally, even with different scanning angles and the most rigorous error tolerance settings, our approach robustly achieves over 95% SSR, demonstrating its capability for real-world applications.||
|**2024-09-09**|[Enhancing Preference-based Linear Bandits via Human Response Time](http://arxiv.org/abs/2409.05798)|null|Binary human choice feedback is widely used in interactive preference learning for its simplicity, but it provides limited information about preference strength. To overcome this limitation, we leverage human response times, which inversely correlate with preference strength, as complementary information. Our work integrates the EZ-diffusion model, which jointly models human choices and response times, into preference-based linear bandits. We introduce a computationally efficient utility estimator that reformulates the utility estimation problem using both choices and response times as a linear regression problem. Theoretical and empirical comparisons with traditional choice-only estimators reveal that for queries with strong preferences ("easy" queries), choices alone provide limited information, while response times offer valuable complementary information about preference strength. As a result, incorporating response times makes easy queries more useful. We demonstrate this advantage in the fixed-budget best-arm identification problem, with simulations based on three real-world datasets, consistently showing accelerated learning when response times are incorporated.||
|**2024-09-09**|[Predicting Critical Heat Flux with Uncertainty Quantification and Domain Generalization Using Conditional Variational Autoencoders and Deep Neural Networks](http://arxiv.org/abs/2409.05790)|null|Deep generative models (DGMs) have proven to be powerful in generating realistic data samples. Their capability to learn the underlying distribution of a dataset enable them to generate synthetic data samples that closely resemble the original training dataset, thus addressing the challenge of data scarcity. In this work, we investigated the capabilities of DGMs by developing a conditional variational autoencoder (CVAE) model to augment the critical heat flux (CHF) measurement data that was used to generate the 2006 Groeneveld lookup table. To determine how this approach compared to traditional methods, a fine-tuned deep neural network (DNN) regression model was created and evaluated with the same dataset. Both the CVAE and DNN models achieved small mean absolute relative errors, with the CVAE model maintaining more favorable results. To quantify the uncertainty in the model's predictions, uncertainty quantification (UQ) was performed with repeated sampling of the CVAE model and ensembling of the DNN model. Following UQ, the DNN ensemble notably improved performance when compared to the baseline DNN model, while the CVAE model achieved similar results to its non-UQ results. The CVAE model was shown to have significantly less variability and a higher confidence after assessment of the prediction-wise relative standard deviations. Evaluating domain generalization, both models achieved small mean error values when predicting both inside and outside the training domain, with predictions outside the training domain showing slightly larger errors. Overall, the CVAE model was comparable to the DNN regression model in predicting CHF values but with better uncertainty behavior.||
|**2024-09-09**|[Vector Quantized Diffusion Model Based Speech Bandwidth Extension](http://arxiv.org/abs/2409.05784)|null|神经音频编解码器 (NAC) 的最新进展为音频信号处理解锁了新的潜力。越来越多的研究探索利用 NAC 的潜在特征来完成各种语音信号处理任务。本文介绍了第一种利用从 NAC 获得的离散特征进行语音带宽扩展 (BWE) 的方法。通过恢复高度压缩的离散标记中的高频细节，该方法增强了语音的清晰度和自然度。所提出的框架基于矢量量化扩散，结合了先进 NAC、扩散模型和 Mamba-2 的优势，以重建高频语音成分。大量实验表明，该方法在对数谱距离和 ViSQOL 方面均表现出优异的性能，显着提高了语音质量。||
|**2024-09-09**|[AS-Speech: Adaptive Style For Speech Synthesis](http://arxiv.org/abs/2409.05730)|null|近年来，文本到语音（TTS）合成技术取得了显著进展，能够在常见场景下合成高质量的语音。在未知情况下，自适应TTS需要强大的泛化能力来适应说话人的风格特征。然而，现有的自适应方法只能分别提取和整合粗粒度的音色或混合的韵律属性。在本文中，我们提出了AS-Speech，一种将说话人音色特征和韵律属性整合到一个统一框架中的自适应风格方法，用于文本到语音合成。具体来说，AS-Speech可以通过细粒度的基于文本的音色特征和全局韵律信息准确地模拟风格特征，并通过扩散模型实现高保真语音合成。实验表明，与一系列自适应TTS模型相比，该模型生成的语音在音色和韵律方面具有更高的自然度和相似性。||
|**2024-09-09**|[pFedGPA: Diffusion-based Generative Parameter Aggregation for Personalized Federated Learning](http://arxiv.org/abs/2409.05701)|null|联邦学习 (FL) 是一种去中心化的模型训练方法，数据保留在本地，只有模型参数在客户端和中心服务器之间共享。传统的联邦平均 (FedAvg) 等方法对这些通常在异构数据分布上训练的参数进行线性聚合，这可能忽略了参数空间复杂、高维的性质，导致聚合模型的性能下降。虽然个性化联邦学习方法可以在一定程度上缓解异构数据问题，但线性聚合的局限性仍然没有解决。为了缓解这个问题，我们研究了扩散模型的生成方法，并提出了一种新的个性化联邦学习生成参数聚合框架，即 pFedGPA。在这个框架中，我们在服务器上部署了一个扩散模型，以整合不同的参数分布，并提出了一种参数反演方法，为每个客户端有效地生成一组个性化参数。这种反演方法将上传的参数转换为一个潜在代码，然后通过去噪采样进行聚合，生成最终的个性化参数。通过使用高容量扩散模型对客户端模型参数对其特定数据分布的依赖性进行编码，pFedGPA 可以有效地将所有客户端模型参数的总体分布的复杂性与每个客户端参数分布的复杂性解耦。我们的实验结果一致地证明了所提出的方法在多个数据集上的优越性能，超过了基线方法。||
|**2024-09-09**|[Unlearning or Concealment? A Critical Analysis and Evaluation Metrics for Unlearning in Diffusion Models](http://arxiv.org/abs/2409.05668)|null|近期的研究已经看到人们对扩散模型中概念去除和目标遗忘方法的浓厚兴趣。在本文中，我们对现有的扩散模型遗忘方法进行了全面的白盒分析，以揭示其存在的重大漏洞。我们发现，现有方法中用于遗忘的目标函数导致了要遗忘的目标概念与相应提示之间的解耦。这是一种隐蔽行为，而不是真正的遗忘，而真正的遗忘才是最初的目标。当前方法的无效性主要源于它们只关注降低特定提示集的生成概率，而忽略了推理过程中使用的中间引导的多种形式。本文对四种常用的扩散模型遗忘技术进行了严格的理论和实证检验。我们引入了两个新的评估指标：概念检索分数（CRS）和概念置信度分数（CCS）。这些指标基于一个成功的对抗攻击设置，可以从遗忘的扩散模型中恢复被遗忘的概念。CRS 衡量的是遗忘后的遗忘模型和完全训练模型的潜在表示之间的相似性。它反映了随着引导量增加，被遗忘概念的检索程度。CCS 量化了模型将目标概念分配给被操纵数据的置信度。它反映了随着引导量增加，未遗忘模型的生成结果与原始领域知识一致的概率。我们使用提出的针对扩散模型的严格指标对现有的遗忘方法进行评估，结果揭示了它们在真正遗忘概念方面的重大缺陷。源代码：https://respailab.github.io/unlearning-or-concealment||
|**2024-09-09**|[Forward KL Regularized Preference Optimization for Aligning Diffusion Policies](http://arxiv.org/abs/2409.05622)|null|扩散模型通过在策略学习中利用高度表达的模型能力，在序列决策中取得了显著的成功。学习扩散策略的一个核心问题是如何在各种任务中使策略输出与人类意图保持一致。为了实现这一点，先前的方法进行了回报条件策略生成或基于强化学习（RL）的策略优化，但它们都依赖于预先定义的奖励函数。在这项工作中，我们提出了一种新的框架，即用于对齐扩散策略的前向 KL 正则化偏好优化，以直接将扩散策略与偏好对齐。我们首先从离线数据集中训练一个不考虑偏好的扩散策略，然后通过直接偏好优化将该策略与偏好数据对齐。在对齐阶段，我们在扩散策略中制定了直接偏好学习，其中在前向偏好优化中采用了 KL 正则化，以避免生成分布外动作。我们对 MetaWorld 操作和 D4RL 任务进行了广泛的实验。结果表明，我们的方法在偏好一致性方面表现出色，并且优于先前最先进的算法。||
|**2024-09-09**|[Latent 3D Brain MRI Counterfactual](http://arxiv.org/abs/2409.05585)|null|结构性脑部MRI研究中的样本数量通常过小，无法充分训练深度学习模型。生成模型通过有效学习数据分布和生成高保真MRI，为解决这一问题带来了希望。然而，它们难以生成训练数据分布之外的多样化、高质量数据。解决这一问题的一种方法是使用针对3D体积反事实开发的因果模型。然而，在高维空间中准确建模因果关系是一项挑战，因此这些模型通常生成质量较低的3D脑部MRI。为了应对这些挑战，我们提出了一种两阶段方法，在潜在空间内构建结构因果模型（SCM）。在第一阶段，我们采用VQ-VAE学习MRI体积的紧凑嵌入。随后，我们将因果模型整合到这个潜在空间中，并使用封闭形式的广义线性模型（GLM）执行三步反事实程序。我们对真实世界的高分辨率MRI数据（1mm）进行的实验表明，我们的方法可以生成高质量的3D MRI反事实。||
|**2024-09-09**|[Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation](http://arxiv.org/abs/2409.05583)|**[link](https://github.com/gmuraleekrishna/sas)**|具身人工智能旨在开发能够理解和执行人类语言指令并以自然语言进行交流的机器人。为此，我们研究了生成高度详细的导航指令以供具身机器人遵循的任务。尽管最近的研究表明，从图像序列生成逐步指令方面取得了重大进展，但生成的指令在指称物体和地标方面缺乏多样性。现有的说话者模型学习了一些策略来规避评估指标，即使对于低质量的句子也能获得更高的分数。在这项工作中，我们提出了SAS（空间感知说话者），这是一种指令生成器或“说话者”模型，它利用环境的结构和语义知识来生成更丰富的指令。为了进行训练，我们在对抗性设置中采用了奖励学习方法，以避免语言评估指标引入的系统性偏差。根据经验，我们的方法优于现有的指令生成模型，并使用标准指标进行了评估。我们的代码可在以下网址获得：https://github.com/gmuraleekrishna/SAS。||
|**2024-09-09**|[A Taxonomy of Miscompressions: Preparing Image Forensics for Neural Compression](http://arxiv.org/abs/2409.05490)|null|神经压缩有可能彻底改变有损图像压缩技术。基于生成模型，最近的方案在高感知质量下实现了前所未有的压缩率，但牺牲了语义保真度。解压缩图像的细节可能看起来在视觉上是完美的，但在语义上与原始图像不同，这使得压缩错误难以或不可能被检测到。我们探索了这个问题的空间，并提出了一个暂定的错误压缩分类法。它定义了三种类型的“发生了什么”，并有一个二进制的“高影响”标志，表示改变符号的错误压缩。我们讨论了该分类法如何促进风险沟通和缓解措施的研究。||
|**2024-09-05**|[Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding](http://arxiv.org/abs/2409.03757)|**[link](https://github.com/yunzeman/lexicon3d)**|复杂三维场景理解近年来备受关注，场景编码策略在其中发挥着至关重要的作用。然而，针对不同场景的最佳场景编码策略仍不明确，特别是与基于图像的编码策略相比。为了解决这个问题，我们对用于三维场景理解的各种视觉编码模型进行了全面研究，确定了每个模型在不同场景下的优势和局限性。我们的评估涵盖了七种视觉基础编码器，包括基于图像、基于视频和三维基础模型。我们在四个任务中评估这些模型：视觉语言场景推理、视觉定位、分割和配准，每个任务都侧重于场景理解的不同方面。我们的评估得出了以下主要发现：DINOv2 表现出优越的性能，视频模型在对象级任务中表现出色，扩散模型有利于几何任务，而语言预训练模型在语言相关任务中表现出意想不到的局限性。这些见解挑战了一些传统认知，为利用视觉基础模型提供了新的视角，并强调了在未来的视觉语言和场景理解任务中需要更灵活的编码器选择。||
|**2024-09-05**|[ArtiFade: Learning to Generate High-quality Subject from Blemished Images](http://arxiv.org/abs/2409.03745)|null|以主题为主导的文本到图像生成技术在学习和捕捉主题特征方面取得了显著进步，即使只使用有限数量的图像。然而，现有方法通常依赖于高质量的图像进行训练，当输入图像存在瑕疵时，可能难以生成合理的图像。这主要归因于当前技术在区分主题相关特征和干扰性瑕疵方面的能力不足。在本文中，我们引入了ArtiFade来解决这个问题，并成功地从有瑕疵的数据集中生成了高质量的无瑕疵图像。具体来说，ArtiFade利用预先训练的文本到图像模型的微调来消除瑕疵。通过在微调过程中使用包含无瑕疵图像及其对应的有瑕疵图像的专门数据集来实现瑕疵的消除。ArtiFade还确保了保留扩散模型中固有的原始生成能力，从而提高了主题驱动方法在生成高质量和无瑕疵图像方面的整体性能。我们进一步为这项任务设计了评估基准。通过广泛的定性和定量实验，我们证明了ArtiFade在分布内和分布外情况下都能有效去除瑕疵的泛化能力。||
|**2024-09-05**|[RealisHuman: A Two-Stage Approach for Refining Malformed Human Parts in Generated Images](http://arxiv.org/abs/2409.03644)|**[link](https://github.com/wangbenzhi/realishuman)**|近年来，扩散模型彻底改变了视觉生成领域，其性能超越了生成对抗网络 (GANs) 等传统框架。然而，由于人类及其语义部分（如手和脸）复杂的结构，生成具有真实感的人类图像仍然是一项重大挑战。为了解决这个问题，我们提出了一种名为 RealisHuman 的新型后处理解决方案。RealisHuman 框架分两个阶段运行。首先，它使用原始的畸形部分作为参考，生成逼真的人体部位（如手或脸），确保细节与原始图像一致。其次，它通过重新绘制周围区域将校正后的人体部位无缝地融入到其对应的位置，以确保平滑逼真的融合。RealisHuman 框架显著增强了人类生成的真实感，这可以通过定性和定量指标的显著改进得到证明。代码可在 https://github.com/Wangbenzhi/RealisHuman 获取。||
|**2024-09-05**|[DiffEVC: Any-to-Any Emotion Voice Conversion with Expressive Guidance](http://arxiv.org/abs/2409.03636)|null|情感语音转换 (EVC) 通过放大积极线索和减少消极线索来改变语音情感，从而增强沟通。这项复杂的任务涉及语音质量、说话者特征和内容等纠缠不清的因素。传统的深度学习模型（如 GAN 和自动编码器）通过学习映射或解耦特征在 EVC 中取得了一定的成功，但面临着不稳定性和语音质量下降等挑战。扩散模型提供了稳定的训练和高质量的生成。我们提出了一个基于扩散的 EVC 框架，该框架使用互信息损失和辅助模型来解耦情感和说话者身份。引入了一种表达性引导机制，以改善情感转换，同时保持说话者特征。实验结果表明，我们的方法对于未知说话者和情感的有效性，在 EVC 任务中实现了最先进的性能。||
|**2024-09-05**|[TCDiff: Triple Condition Diffusion Model with 3D Constraints for Stylizing Synthetic Faces](http://arxiv.org/abs/2409.03600)|**[link](https://github.com/bovifocr/tcdiff)**|一个鲁棒的人脸识别模型需要使用包含大量个体以及每个个体在不同条件（例如姿态、表情、年龄、噪声和遮挡）下的大量样本的数据集进行训练。由于伦理和隐私问题，大型真实人脸数据集（例如 MS1MV3）已被停用，并且已经提出了利用 GAN 和扩散模型的合成人脸生成器，例如 SYNFace、SFace、DigiFace-1M、IDiff-Face、DCFace 和 GANDiffFace，旨在满足这一需求。其中一些方法可以生成高保真度的真实人脸，但类内差异较低，而另一些方法则生成具有高差异性但身份一致性较低的人脸。在本文中，我们提出了一种三重条件扩散模型（TCDiff），通过 2D 和 3D 人脸约束来改进从真实人脸到合成人脸的人脸风格迁移，在保持必要的类内高差异性的同时增强人脸身份一致性。使用我们新的数据集的 1k、2k 和 5k 类进行训练的人脸识别实验在 LFW、CFP-FP、AgeDB 和 BUPT 等真实人脸基准测试中优于最先进的合成数据集。我们的源代码可在以下网址获得：https://github.com/BOVIFOCR/tcdiff。||
|**2024-09-05**|[DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture](http://arxiv.org/abs/2409.03550)|**[link](https://github.com/qianlong0502/DKDM)**|扩散模型 (DM) 在各个领域都表现出卓越的生成能力，但其部署过程中的推理速度慢和计算需求高却阻碍了其发展。加速DM最常用的方法是减少生成过程中的去噪步骤，这可以通过更快的采样求解器或知识蒸馏 (KD) 来实现。与先前的方法不同，我们提出了一种新方法，可以将大型预训练DM的功能迁移到更快的架构中。具体来说，我们以独特的方式使用KD，通过将生成能力提炼到更快的变体中来压缩DM。此外，考虑到源数据不可访问或对于当前的生成模型来说存储量太大，我们引入了一种新的无源数据蒸馏范式，称为扩散模型的无数据知识蒸馏 (DKDM)。通常，我们建立的DKDM框架包含两个主要组件：1) DKDM目标函数，它使用预训练DM生成的合成去噪数据来优化更快的DM，而无需源数据；2) 动态迭代蒸馏方法，它可以灵活地组织去噪数据的合成，防止由于生成速度慢而减慢优化过程。据我们所知，这是首次尝试使用KD以无数据的方式将DM提炼到任何架构中。重要的是，我们的DKDM与大多数现有的加速方法（例如减少去噪步骤、量化和剪枝）是正交的。实验表明，我们的DKDM能够推导出速度提高2倍的DM，其性能与基线保持一致。值得注意的是，我们的DKDM使预训练的DM能够作为“数据集”来训练新的DM。||
|**2024-09-05**|[Blended Latent Diffusion under Attention Control for Real-World Video Editing](http://arxiv.org/abs/2409.03514)|null|由于缺乏完全公开可用的文本到视频模型，目前的视频编辑方法倾向于建立在预训练的文本到图像生成模型之上，然而，在处理具有时间信息的视频局部编辑方面，它们仍然面临着巨大的挑战。首先，尽管现有方法试图通过预先定义的掩码来关注局部区域编辑，但由于每一帧的空间整体生成，外部区域背景的保留并不理想。此外，由用户专门提供掩码是一项额外的昂贵工作，因此需要一种集成到编辑过程中的自主掩码策略。最后但同样重要的是，图像级预训练模型没有学习视频帧之间的时间信息，而这对于表达运动和动态至关重要。在本文中，我们建议采用图像级混合潜在扩散模型来执行局部视频编辑任务。具体来说，我们利用 DDIM 反演来获取潜在向量作为背景潜在向量，而不是随机噪声的潜在向量，以更好地保留输入视频的背景信息。我们进一步介绍了一种从扩散步骤中的交叉注意图衍生的自主掩码制造机制。最后，我们通过将 U-Net 的自注意力块转换为时空块来增强视频帧之间的时间一致性。通过大量的实验，我们提出的方法在不同的现实世界视频编辑任务中表现出有效性。||
|**2024-09-05**|[Data-free Distillation with Degradation-prompt Diffusion for Multi-weather Image Restoration](http://arxiv.org/abs/2409.03455)|null|多天气图像复原取得了令人瞩目的进展，但模型容量的增加和昂贵的数据获取限制了其在内存有限设备上的应用。无数据蒸馏提供了一种替代方案，允许从预训练的教师模型中学习轻量级学生模型，而无需依赖原始训练数据。现有的无数据学习方法主要利用GAN生成的伪数据或从互联网收集的真实数据来优化模型。然而，它们不可避免地会遇到训练不稳定或与原始数据存在域偏移的问题。在本文中，我们提出了一种新的基于退化提示扩散的无数据蒸馏多天气图像复原框架（D4IR）。它用预训练的扩散模型代替GAN以避免模型崩溃，并结合了退化感知提示适配器，以促进内容驱动的条件扩散，从而生成与域相关的图像。具体来说，首先设计了一种基于对比的退化提示适配器，用于从网络收集的退化图像中捕获退化感知提示。然后，将收集到的未配对的干净图像扰动到稳定扩散的潜在特征中，并以退化感知提示为条件，合成新的域相关退化图像，用于知识蒸馏。实验表明，我们的方法取得了与使用原始训练数据蒸馏的模型相当的性能，甚至优于其他主流的无监督方法。||
|**2024-09-05**|[Convergence Rates for the Maximum A Posteriori Estimator in PDE-Regression Models with Random Design](http://arxiv.org/abs/2409.03417)|null|我们考虑从高斯回归问题 $Y = \mathscr{G}(\theta)(Z)+\varepsilon$产生的数据中恢复参数$\theta\in H^\alpha$的统计逆问题，其中$\mathscr{G}:\mathbb{L}^2\to\mathbb{L}^2$是非线性正向映射，$Z$是随机设计点，$\varepsilon$是高斯噪声。估计策略基于$\Vert\cdot\Vert_{H^\alpha}$-约束下的最小二乘法。我们在正向映射$\mathscr{G}$满足Lipschitz类型假设的情况下，建立了最小二乘估计量$\hat{\theta}$作为给定泛函的最大值的存在性。证明了一个一般的浓度结果，并用它来证明预测误差的一致性和上界。相应的收敛速度不仅反映了目标参数的平滑性，还反映了潜在逆问题的适定性。我们将一般模型应用于达西问题，其中PDE的未知系数函数$f$ 的恢复是令人感兴趣的。对于这个例子，我们还提供了预测误差和估计误差的相应收敛速度。此外，我们还简要讨论了该一般模型对其他问题的适用性。||
|**2024-09-05**|[RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning](http://arxiv.org/abs/2409.03403)|null|扩大机器人学习规模需要庞大而多样化的数据集，如何有效地重复使用收集到的数据并将策略迁移到新的机器人平台仍然是一个悬而未决的问题。诸如Open-X Embodiment (OXE) 项目等新兴研究已经表明，通过组合包含不同机器人的数据集来利用技能是有希望的。然而，许多数据集中机器人类型和相机角度分布的不平衡使得策略容易过拟合。为了缓解这个问题，我们提出了RoVi-Aug，它利用最先进的图像到图像生成模型，通过合成具有不同机器人和相机视角的演示来增强机器人数据。通过广泛的物理实验，我们证明了通过在机器人和视点增强数据上进行训练，RoVi-Aug 可以在具有显著不同相机角度的未知机器人上进行零样本部署。与 Mirage 等测试时自适应算法相比，RoVi-Aug 在测试时不需要额外的处理，不假设已知相机角度，并且允许策略微调。此外，通过在原始机器人数据集和增强机器人数据集上进行联合训练，RoVi-Aug 可以学习多机器人和多任务策略，从而实现机器人和技能之间更有效的迁移，并将成功率提高高达 30%。||
|**2024-09-04**|[HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts](http://arxiv.org/abs/2409.02919)|**[link](https://github.com/Liuxinyv/HiPrompt)**|利用预训练扩散模型生成更高分辨率图像的潜力巨大，但这些模型在处理物体重复和结构伪影方面常常遇到困难，尤其是在扩展到 4K 及更高分辨率时。我们发现问题在于，单个提示生成多个尺度的方式效率低下。为此，我们提出了 HiPrompt，这是一种无须微调的新解决方案，它通过引入分层提示来解决上述问题。分层提示提供全局和局部指导。具体来说，全局指导来自描述整体内容的用户输入，而局部指导则利用来自 MLLM 的逐块描述来精心指导局部结构和纹理的生成。此外，在逆向去噪过程中，生成的噪声被分解为低频和高频空间分量。这些分量以多个提示级别为条件，包括详细的逐块描述和更广泛的图像级提示，从而促进在分层语义指导下的提示引导去噪。它进一步允许生成过程更多地关注局部空间区域，并确保生成的图像在高清晰度下保持一致的局部和全局语义、结构和纹理。大量实验表明，HiPrompt 在高分辨率图像生成方面优于现有技术，显著减少了物体重复并提高了结构质量。||
|**2024-09-04**|[Latent Watermarking of Audio Generative Models](http://arxiv.org/abs/2409.02915)|null|音频生成模型的进步给其负责任的披露和滥用检测带来了新的挑战。为了应对这些挑战，我们介绍了一种通过对其训练数据进行特定水印来标记潜在生成模型的方法。由此产生的水印模型生成的潜在表示，其解码输出可以被高置信度地检测到，而无论使用何种解码方法。这种方法无需进行事后水印步骤即可检测生成的内容。它为开源模型提供了更安全的解决方案，并有助于识别那些在未遵守许可条款的情况下对这些模型进行微调或使用的衍生作品。例如，我们的结果表明，即使在对潜在生成模型进行微调后，生成输出的检测精度也能在假阳性率为 $10^{-3}$ 的情况下达到 75% 以上。||
|**2024-09-04**|[Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling](http://arxiv.org/abs/2409.02908)|null|掩码扩散模型 (MDM) 由于其相较于其他离散扩散模型的优越性能，已成为离散数据生成建模的热门研究课题，并在语言建模任务中与自回归模型 (ARM) 展开竞争。最近简化掩码扩散框架的努力进一步使其与连续空间扩散模型保持一致，并获得了更有原则的训练和采样方法。然而，在本文中，我们揭示了 MDM 的训练和采样在理论上都可以摆脱时间变量（可以说是扩散模型的关键特征），并且等效于掩码模型。我们在采样方面的联系是通过我们提出的首次命中采样器 (FHS) 建立的。具体来说，我们证明了 FHS 在理论上等效于 MDM 的原始生成过程，同时显著减少了耗时的分类采样，并实现了 20 倍的加速。此外，我们的研究对先前关于 MDM 在生成困惑度方面可以超越 ARM 的说法提出了质疑。我们首次发现了一个潜在的数值问题，即使使用 32 位浮点精度，也会导致不准确的分类采样。我们表明，该数值问题在理论上和经验上都降低了有效温度，导致先前文献中对 MDM 生成结果的评估不公平。||
|**2024-09-04**|[Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models](http://arxiv.org/abs/2409.02851)|**[link](https://github.com/Human-VDM/Human-VDM)**|从单张RGB图像生成逼真3D人体是计算机视觉中一项具有挑战性的任务，因为它需要精确的几何建模、高质量的纹理和合理的不可见部分生成。现有方法通常使用多视角扩散模型进行3D人体生成，但它们经常面临视角不一致的问题，这阻碍了高质量3D人体的生成。为了解决这个问题，我们提出了Human-VDM，一种使用视频扩散模型从单张RGB图像生成3D人体的新方法。Human-VDM使用高斯渲染为3D人体生成提供了时间上一致的视图。它由三个模块组成：视图一致的人体视频扩散模块、视频增强模块和高斯渲染模块。首先，将单张图像输入人体视频扩散模块以生成连贯的人体视频。接下来，视频增强模块应用超分辨率和视频插值来增强生成视频的纹理和几何平滑度。最后，3D人体高斯渲染模块在这些高分辨率和视角一致的图像的指导下学习逼真的人体。实验表明，Human-VDM可以从单张图像生成高质量的3D人体，在生成质量和数量方面均优于现有最佳方法。项目页面：https://human-vdm.github.io/Human-VDM/||
|**2024-09-04**|[Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model](http://arxiv.org/abs/2409.02845)|null|扩散模型在涉及音频和音乐的跨模态生成任务中展现出巨大的潜力，例如文本到声音和文本到音乐的生成。这些文本控制的音乐生成模型通常侧重于通过捕捉全局音乐属性（如流派和情绪）来生成音乐。然而，音乐创作是一项复杂的多层次任务，通常将音乐编排作为创作过程的一个组成部分。此过程涉及创作每个乐器部分，使其在节奏、力度、和声和旋律方面与现有部分保持一致，这需要比文本提示通常提供的更精确的音轨控制。在这项工作中，我们通过将 MusicLDM（一种用于音乐的潜在扩散模型）扩展为多轨生成模型来应对这些挑战。通过学习共享上下文的音轨的联合概率，我们的模型能够跨多个音轨生成彼此良好对应的音乐，无论是有条件地还是无条件地。此外，我们的模型还能够进行编曲生成，其中模型可以在给定其他音轨的情况下生成任何音轨子集（例如，生成与给定贝斯和鼓音轨互补的钢琴音轨）。我们将我们的模型与现有的多轨生成模型进行了比较，结果表明，我们的模型在总生成任务和编曲生成任务的客观指标上都取得了相当大的改进。||
|**2024-09-04**|[Rethinking HTG Evaluation: Bridging Generation and Recognition](http://arxiv.org/abs/2409.02683)|**[link](https://github.com/koninik/htg_evaluation)**|生成模型在自然图像任务中的评估已得到广泛研究。即使在诸如手写生成（HTG）等具有独特特殊性的情况下，也使用了类似的协议和指标，即使它们可能并非完全合适。在这项工作中，我们介绍了三种专为 HTG 评估量身定制的度量指标： $\text{HTG}_{\text{HTR}} $、$ \text{HTG}_{\text{style}} $ 和 $ \text{HTG}_{\text{OOV}}$ ，并认为它们更便于评估生成手写图像的质量。这些指标依赖于手写文本识别和书写者识别模型的识别错误/准确率，并强调书写风格、文本内容和多样性是符合手写图像内容的主要方面。我们在 IAM 手写数据库上进行了全面的实验，结果表明，诸如 FID 之类的广泛使用的指标无法正确量化生成手写样本的多样性和实用性。我们的研究结果表明，我们的指标信息更丰富，并强调了 HTG 中标准化评估协议的必要性。所提出的指标为评估 HTG 质量提供了更稳健、信息更丰富的协议，有助于提高 HTR 的性能。评估协议的代码可在以下网址获得：https://github.com/koninik/HTG_evaluation。||
|**2024-09-04**|[Introduction to Machine Learning](http://arxiv.org/abs/2409.02668)|null|本书介绍了机器学习中许多算法的开发和分析所依赖的数学基础和技术。本书首先介绍了贯穿全书的符号表示，并回顾了微积分、线性代数和概率论的基本概念，还介绍了一些测度论术语，可作为使用这些工具的部分的阅读指南。导论章节还提供了矩阵分析和优化的背景知识。后面的章节为本书中使用的许多算法提供了理论支持，包括随机梯度下降、近似方法等。在讨论了统计预测的基本概念之后，本书介绍了再生核理论和希尔伯特空间技术，这些技术在许多地方都有应用，然后介绍了各种监督统计学习算法，包括线性方法、支持向量机、决策树、boosting和神经网络。接下来转向生成方法，首先介绍了采样方法和马尔可夫链理论。接下来的章节描述了图模型理论，介绍了潜变量模型的变分方法，以及基于深度学习的生成模型。接下来的章节重点介绍无监督学习方法，包括聚类、因子分析和流形学习。本书的最后一章偏向理论，讨论了集中不等式和泛化界。||
|**2024-09-04**|[Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection](http://arxiv.org/abs/2409.02664)|null|The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via data perturbations, our method can reprogram a pretrained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. Furthermore, we insert a pseudo-word guided by facial identity into the text prompt. Extensive experiments on several popular benchmarks demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88% AUC in cross-dataset setting from FF++ to WildDeepfake) using a pre-trained CLIP model with our proposed reprogramming method; (2) our superior performances are at less cost of trainable parameters, making it a promising approach for real-world applications.||
|**2024-09-04**|[PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation](http://arxiv.org/abs/2409.02657)|null|While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4\% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: https://junleen.github.io/projects/posetalk.||
|**2024-09-04**|[Skip-and-Play: Depth-Driven Pose-Preserved Image Generation for Any Objects](http://arxiv.org/abs/2409.02653)|null|The emergence of diffusion models has enabled the generation of diverse high-quality images solely from text, prompting subsequent efforts to enhance the controllability of these models. Despite the improvement in controllability, pose control remains limited to specific objects (e.g., humans) or poses (e.g., frontal view) due to the fact that pose is generally controlled via camera parameters (e.g., rotation angle) or keypoints (e.g., eyes, nose). Specifically, camera parameters-conditional pose control models generate unrealistic images depending on the object, owing to the small size of 3D datasets for training. Also, keypoint-based approaches encounter challenges in acquiring reliable keypoints for various objects (e.g., church) or poses (e.g., back view). To address these limitations, we propose depth-based pose control, as depth maps are easily obtainable from a single depth estimation model regardless of objects and poses, unlike camera parameters and keypoints. However, depth-based pose control confronts issues of shape dependency, as depth maps influence not only the pose but also the shape of the generated images. To tackle this issue, we propose Skip-and-Play (SnP), designed via analysis of the impact of three components of depth-conditional ControlNet on the pose and the shape of the generated images. To be specific, based on the analysis, we selectively skip parts of the components to mitigate shape dependency on the depth map while preserving the pose. Through various experiments, we demonstrate the superiority of SnP over baselines and showcase the ability of SnP to generate images of diverse objects and poses. Remarkably, SnP exhibits the ability to generate images even when the objects in the condition (e.g., a horse) and the prompt (e.g., a hedgehog) differ from each other.||

(back to top)

## LLM

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2025-04-08**|[Knowledge Graph Completion with Relation-Aware Anchor Enhancement](http://arxiv.org/abs/2504.06129)|null|Text-based knowledge graph completion methods take advantage of pre-trained language models (PLM) to enhance intrinsic semantic connections of raw triplets with detailed text descriptions. Typical methods in this branch map an input query (textual descriptions associated with an entity and a relation) and its candidate entities into feature vectors, respectively, and then maximize the probability of valid triples. These methods are gaining promising performance and increasing attention for the rapid development of large language models. According to the property of the language models, the more related and specific context information the input query provides, the more discriminative the resultant embedding will be. In this paper, through observation and validation, we find a neglected fact that the relation-aware neighbors of the head entities in queries could act as effective contexts for more precise link prediction. Driven by this finding, we propose a relation-aware anchor enhanced knowledge graph completion method (RAA-KGC). Specifically, in our method, to provide a reference of what might the target entity be like, we first generate anchor entities within the relation-aware neighborhood of the head entity. Then, by pulling the query embedding towards the neighborhoods of the anchors, it is tuned to be more discriminative for target entity matching. The results of our extensive experiments not only validate the efficacy of RAA-KGC but also reveal that by integrating our relation-aware anchor enhancement strategy, the performance of current leading methods can be notably enhanced without substantial modifications.|
|**2025-04-07**|[Post-Training Language Models for Continual Relation Extraction](http://arxiv.org/abs/2504.05214)|null|现实世界数据，例如新闻文章、社交媒体帖子和聊天机器人对话，本质上是动态且非静态的，这给通过知识图谱 (KG) 构建实时结构化表示带来了重大挑战。关系抽取 (RE) 是知识图谱创建的基本组成部分，当传统模型依赖于静态的、过时的数据集时，它往往难以适应不断变化的数据。持续关系抽取 (CRE) 方法通过增量学习新关系同时保留先前获得的知识来解决这个问题。本研究探讨了预训练语言模型 (PLM)，特别是大型语言模型 (LLM)，在 CRE 中的应用，重点是利用记忆回放来解决灾难性遗忘问题。我们在 TACRED 和 FewRel 数据集上评估了仅解码器模型（例如 Mistral-7B 和 Llama2-7B）和编码器-解码器模型（例如 Flan-T5 Base）。在 TACRED 上，LLM 的任务增量微调表现出优于早期使用类似 BERT 的仅编码器模型的方法的性能，在已见任务准确性和整体性能（通过整体准确率和平均准确率衡量）方面表现出色，特别是 Mistral 和 Flan-T5 模型。FewRel 上的结果同样令人鼓舞，在整体准确率和平均准确率指标上均取得了第二名的成绩。这项工作强调了知识迁移、语言模型架构和知识图谱完整性的关键因素，推动了使用 LLM 和记忆回放进行动态实时关系抽取的 CRE 发展。|
|**2025-04-06**|[Pre-trained Language Models and Few-shot Learning for Medical Entity Extraction](http://arxiv.org/abs/2504.04385)|null|本研究提出了一种基于Transformer的医学实体提取方法，旨在增强医学文献的信息提取能力。考虑到医学文本的专业性和复杂性，我们比较了不同预训练语言模型（BERT、BioBERT、PubMedBERT、ClinicalBERT）在医学实体提取任务中的性能。实验结果表明，PubMedBERT取得了最佳性能（F1值=88.8%），这表明在生物医学文献上预训练的语言模型在医学领域更为有效。此外，我们分析了不同实体提取方法（CRF、基于Span的方法、Seq2Seq）的影响，发现基于Span的方法在医学实体提取任务中表现最佳（F1值=88.6%），它在识别实体边界方面展现出更高的准确性。在低资源场景下，我们进一步探索了少样本学习在医学实体提取中的应用。实验结果表明，即使只有10个训练样本，模型也能达到79.1%的F1值，验证了少样本学习在有限数据条件下的有效性。本研究证实了预训练语言模型和少样本学习的结合可以提高医学实体提取的准确性。未来的研究可以整合知识图谱和主动学习策略，以提高模型的泛化能力和稳定性，为医学自然语言处理研究提供更有效的解决方案。关键词：自然语言处理，医学命名实体识别，预训练语言模型，少样本学习，信息提取，深度学习|
|**2025-03-19**|[DCA: Dividing and Conquering Amnesia in Incremental Object Detection](http://arxiv.org/abs/2503.15295)|null|增量目标检测 (IOD) 旨在培养一种目标检测器，使其能够持续定位和识别新类别，同时保持其对先前类别的性能。现有方法通过改进基于 Transformer 的检测框架的知识蒸馏和样本回放取得了一定的成功，但其内在的遗忘机制仍未得到充分探索。在本文中，我们深入研究了遗忘的原因，并发现了基于 Transformer 的 IOD 中定位和识别之间的遗忘不平衡，这意味着定位的遗忘程度较低，并且可以泛化到未来的类别，而灾难性遗忘主要发生在识别上。基于这些见解，我们提出了一种分而治之的遗忘 (DCA) 策略，将基于 Transformer 的 IOD 重新设计为先定位后识别的过程。DCA 可以很好地保持和迁移定位能力，从而将解耦的脆弱识别单独处理。为了减少识别中的特征漂移，我们利用预训练语言模型中编码的语义知识将类别表示锚定在跨增量任务的统一特征空间内。这涉及设计一个双工分类器融合，并将类别语义特征以查询的形式嵌入到识别解码过程中。大量实验验证了我们的方法实现了最先进的性能，尤其是在长期增量场景下。例如，在 MS-COCO 的四步设置下，我们的 DCA 策略将最终 AP 显著提高了 6.9%。|
|**2025-03-18**|[RWKV-7 "Goose" with Expressive Dynamic State Evolution](http://arxiv.org/abs/2503.14456)|**[link](https://github.com/rwkv/rwkv-lm)**|我们推出了一个新的序列建模架构RWKV-7“Goose”，以及一些预训练的语言模型。这些模型在30亿参数规模的多语言任务下游性能方面树立了新的最先进水平，并且尽管其训练使用的token数量远少于其他顶尖的30亿参数模型，却能与当前最先进的英语语言性能相媲美。不仅如此，RWKV-7模型只需恒定的内存使用量和每个token恒定的推理时间。RWKV-7引入了带有向量值门控和上下文学习率的delta规则的新泛化公式，以及一个松弛的值替换规则。我们证明了RWKV-7可以执行状态跟踪并识别所有正则语言，同时保留训练的可并行性。这超越了标准复杂性猜想下Transformer的能力，后者仅限于 $\mathsf{TC}^0$ 。为了展示RWKV-7的语言建模能力，我们还展示了一个扩展的开源3.1万亿token多语言语料库，并在该数据集上训练了四个参数范围从1.9亿到29亿的RWKV-7模型。为了促进开放性、可复现性和采用率，我们在Apache 2.0许可下发布了我们的模型和数据集组件列表（https://huggingface.co/RWKV），以及我们的训练和推理代码（https://github.com/RWKV/RWKV-LM）。||
|**2025-03-17**|[CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings](http://arxiv.org/abs/2503.13733)|null|Large language models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency. However, these advancements challenge programming skills, ethics, and assessment integrity, making the detection of LLM-generated code essential for maintaining accountability and standards. While, there has been some research on this problem, it generally lacks domain coverage and robustness, and only covers a small number of programming languages. To this end, we propose a framework capable of distinguishing between human- and LLM-written code across multiple programming languages, code generators, and domains. We use a large-scale dataset from renowned platforms and LLM-based code generators, alongside applying rigorous data quality checks, feature engineering, and comparative analysis using evaluation of traditional machine learning models, pre-trained language models (PLMs), and LLMs for code detection. We perform an evaluation on out-of-domain scenarios, such as detecting the authorship and hybrid authorship of generated code and generalizing to unseen models, domains, and programming languages. Moreover, our extensive experiments show that our framework effectively distinguishes human- from LLM-written code and sets a new benchmark for this task.|
|**2025-03-16**|[Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning](http://arxiv.org/abs/2503.13543)|null|Federated Prototype Learning (FedPL) has emerged as an effective strategy for handling data heterogeneity in Federated Learning (FL). In FedPL, clients collaboratively construct a set of global feature centers (prototypes), and let local features align with these prototypes to mitigate the effects of data heterogeneity. The performance of FedPL highly depends on the quality of prototypes. Existing methods assume that larger inter-class distances among prototypes yield better performance, and thus design different methods to increase these distances. However, we observe that while these methods increase prototype distances to enhance class discrimination, they inevitably disrupt essential semantic relationships among classes, which are crucial for model generalization. This raises an important question: how to construct prototypes that inherently preserve semantic relationships among classes? Directly learning these relationships from limited and heterogeneous client data can be problematic in FL. Recently, the success of pre-trained language models (PLMs) demonstrates their ability to capture semantic relationships from vast textual corpora. Motivated by this, we propose FedTSP, a novel method that leverages PLMs to construct semantically enriched prototypes from the textual modality, enabling more effective collaboration in heterogeneous data settings. We first use a large language model (LLM) to generate fine-grained textual descriptions for each class, which are then processed by a PLM on the server to form textual prototypes. To address the modality gap between client image models and the PLM, we introduce trainable prompts, allowing prototypes to adapt better to client tasks. Extensive experiments demonstrate that FedTSP mitigates data heterogeneity while significantly accelerating convergence.|
|**2025-03-17**|[Auto-Configuring Entity Resolution Pipelines](http://arxiv.org/abs/2503.13226)|null|The same real-world entity (e.g., a movie, a restaurant, a person) may be described in various ways on different datasets. Entity Resolution (ER) aims to find such different descriptions of the same entity, this way improving data quality and, therefore, data value. However, an ER pipeline typically involves several steps (e.g., blocking, similarity estimation, clustering), with each step requiring its own configurations and tuning. The choice of the best configuration, among a vast number of possible combinations, is a dataset-specific and labor-intensive task both for novice and expert users, while it often requires some ground truth knowledge of real matches. In this work, we examine ways of automatically configuring a state of-the-art end-to-end ER pipeline based on pre-trained language models under two settings: (i) When ground truth is available. In this case, sampling strategies that are typically used for hyperparameter optimization can significantly restrict the search of the configuration space. We experimentally compare their relative effectiveness and time efficiency, applying them to ER pipelines for the first time. (ii) When no ground truth is available. In this case, labelled data extracted from other datasets with available ground truth can be used to train a regression model that predicts the relative effectiveness of parameter configurations. Experimenting with 11 ER benchmark datasets, we evaluate the relative performance of existing techniques that address each problem, but have not been applied to ER before.|
|**2025-03-17**|[Quantum-Enhanced LLM Efficient Fine Tuning](http://arxiv.org/abs/2503.12790)|null|Low-Rank Adaptation (LoRA) enables efficient fine-tuning of pre-trained language models via low-rank matrix approximation, which is effective in many scenarios. However, its low-rank representation capacity is constrained in complex tasks or high-rank dependency settings, potentially limiting model adaptability. Addressing the expressive bottleneck of classical low-rank approximation in fine-tuning large language models, this paper proposes a parameter-efficient fine-tuning method based on a Quantum Weighted Tensor Hybrid Network (QWTHN), which leverages Quantum Neural Network (QNN). The study investigates quantum-classical hybrid parameter-efficient fine-tuning in low-rank spaces. QWTHN decomposes pre-trained weights into quantum neural network and tensor network representations, utilizing quantum state superposition and other methods to break through classical rank limitations. Experiments show that the proposed quantum fine-tuning technique for large models approaches or even surpasses the parameter efficiency of LoRA. On the CPsyCounD and R1-Distill-SFT datasets, QWTHN, compared to classical LoRA, reduces training loss by up to 15% while using 76% fewer parameters, and achieves an 8.4% performance improvement on the CPsyCounD test set. This research not only realizes lightweight and efficient adaptation of quantum resources to billion-parameter models but also validates the practical path of quantum hardware driven by large model tasks, laying the first engineering-ready technical foundation for future quantum-enhanced AGI systems.|
|**2025-03-12**|[Multimodal Language Modeling for High-Accuracy Single Cell Transcriptomics Analysis and Generation](http://arxiv.org/abs/2503.09427)|**[link](https://github.com/syr-cn/scmmgpt)**|Pre-trained language models (PLMs) have revolutionized scientific research, yet their application to single-cell analysis remains limited. Text PLMs cannot process single-cell RNA sequencing data, while cell PLMs lack the ability to handle free text, restricting their use in multimodal tasks. Existing efforts to bridge these modalities often suffer from information loss or inadequate single-modal pre-training, leading to suboptimal performances. To address these challenges, we propose Single-Cell MultiModal Generative Pre-trained Transformer (scMMGPT), a unified PLM for joint cell and text modeling. scMMGPT effectively integrates the state-of-the-art cell and text PLMs, facilitating cross-modal knowledge sharing for improved performance. To bridge the text-cell modality gap, scMMGPT leverages dedicated cross-modal projectors, and undergoes extensive pre-training on 27 million cells -- the largest dataset for multimodal cell-text PLMs to date. This large-scale pre-training enables scMMGPT to excel in joint cell-text tasks, achieving an 84\% relative improvement of textual discrepancy for cell description generation, 20.5\% higher accuracy for cell type annotation, and 4\% improvement in $k$ -NN accuracy for text-conditioned pseudo-cell generation, outperforming baselines.|
|**2025-03-08**|[Multi-Attribute Multi-Grained Adaptation of Pre-Trained Language Models for Text Understanding from Bayesian Perspective](http://arxiv.org/abs/2503.06085)|**[link](https://github.com/yoyo-yun/M2A)**|Current neural networks often employ multi-domain-learning or attribute-injecting mechanisms to incorporate non-independent and identically distributed (non-IID) information for text understanding tasks by capturing individual characteristics and the relationships among samples. However, the extent of the impact of non-IID information and how these methods affect pre-trained language models (PLMs) remains unclear. This study revisits the assumption that non-IID information enhances PLMs to achieve performance improvements from a Bayesian perspective, which unearths and integrates non-IID and IID features. Furthermore, we proposed a multi-attribute multi-grained framework for PLM adaptations (M2A), which combines multi-attribute and multi-grained views to mitigate uncertainty in a lightweight manner. We evaluate M2A through prevalent text-understanding datasets and demonstrate its superior performance, mainly when data are implicitly non-IID, and PLMs scale larger.|
|**2025-03-08**|[A Survey on Post-training of Large Language Models](http://arxiv.org/abs/2503.06072)|null|The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. However, their pre-trained architectures often reveal limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance. These challenges necessitate advanced post-training language models (PoLMs) to address these shortcomings, such as OpenAI-o1/o3 and DeepSeek-R1 (collectively known as Large Reasoning Models, or LRMs). This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Efficiency, which optimizes resource utilization amidst increasing complexity; and Integration and Adaptation, which extend capabilities across diverse modalities while addressing coherence issues. Charting progress from ChatGPT's foundational alignment strategies to DeepSeek-R1's innovative reasoning advancements, we illustrate how PoLMs leverage datasets to mitigate biases, deepen reasoning capabilities, and enhance domain adaptability. Our contributions include a pioneering synthesis of PoLM evolution, a structured taxonomy categorizing techniques and datasets, and a strategic agenda emphasizing the role of LRMs in improving reasoning proficiency and domain flexibility. As the first survey of its scope, this work consolidates recent PoLM advancements and establishes a rigorous intellectual framework for future research, fostering the development of LLMs that excel in precision, ethical robustness, and versatility across scientific and societal applications.|
|**2025-03-07**|[Fine-Grained Evaluation for Implicit Discourse Relation Recognition](http://arxiv.org/abs/2503.05326)|null|Implicit discourse relation recognition is a challenging task in discourse analysis due to the absence of explicit discourse connectives between spans of text. Recent pre-trained language models have achieved great success on this task. However, there is no fine-grained analysis of the performance of these pre-trained language models for this task. Therefore, the difficulty and possible directions of this task is unclear. In this paper, we deeply analyze the model prediction, attempting to find out the difficulty for the pre-trained language models and the possible directions of this task. In addition to having an in-depth analysis for this task by using pre-trained language models, we semi-manually annotate data to add relatively high-quality data for the relations with few annotated examples in PDTB 3.0. The annotated data significantly help improve implicit discourse relation recognition for level-2 senses.|
|**2025-03-06**|[Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning](http://arxiv.org/abs/2503.04611)|null|In this work, we explain our approach employed in the BabyLM Challenge, which uses various methods of training language models (LMs) with significantly less data compared to traditional large language models (LLMs) and are inspired by how human children learn. While a human child is exposed to far less linguistic input than an LLM, they still achieve remarkable language understanding and generation abilities. To this end, we develop a model trained on a curated dataset consisting of 10 million words, primarily sourced from child-directed transcripts. The 2024 BabyLM Challenge initial dataset of 10M words is filtered to 8.5M. Next, it is supplemented with a randomly selected subset of TVR dataset consisting of 1.5M words of television dialogues. The latter dataset ensures that similar to children, the model is also exposed to language through media. Furthermore, we reduce the vocabulary size to 32,000 tokens, aligning it with the limited vocabulary of children in the early stages of language acquisition. We use curriculum learning and is able to match the baseline on certain benchmarks while surpassing the baseline on others. Additionally, incorporating common LLM training datasets, such as MADLAD-400, degrades performance. These findings underscore the importance of dataset selection, vocabulary scaling, and curriculum learning in creating more data-efficient language models that better mimic human learning processes.|
|**2025-03-04**|[Zero-Shot Complex Question-Answering on Long Scientific Documents](http://arxiv.org/abs/2503.02695)|**[link](https://github.com/wendywangwwt/zero-shot-complex-question-answering-on-long-scientific-documents)**|With the rapid development in Transformer-based language models, the reading comprehension tasks on short documents and simple questions have been largely addressed. Long documents, specifically the scientific documents that are densely packed with knowledge discovered and developed by humans, remain relatively unexplored. These documents often come with a set of complex and more realistic questions, adding to their complexity. We present a zero-shot pipeline framework that enables social science researchers to perform question-answering tasks that are complex yet of predetermined question formats on full-length research papers without requiring machine learning expertise. Our approach integrates pre-trained language models to handle challenging scenarios including multi-span extraction, multi-hop reasoning, and long-answer generation. Evaluating on MLPsych, a novel dataset of social psychology papers with annotated complex questions, we demonstrate that our framework achieves strong performance through combination of extractive and generative models. This work advances document understanding capabilities for social sciences while providing practical tools for researchers.||
|**2025-02-28**|[Attend or Perish: Benchmarking Attention in Algorithmic Reasoning](http://arxiv.org/abs/2503.01909)|null|Can transformers learn to perform algorithmic tasks reliably across previously unseen input/output domains? While pre-trained language models show solid accuracy on benchmarks incorporating algorithmic reasoning, assessing the reliability of these results necessitates an ability to cleanse models' functional capabilities from memorization. In this paper, we propose an algorithmic benchmark comprising six tasks of infinite input domains where we can also disentangle and trace the correct, robust algorithm necessary for the task. This allows us to assess (i) models' ability to extrapolate to unseen types of inputs, including new lengths, value ranges or input domains, but also (ii) to assess the robustness of the functional mechanism in recent models through the lens of their attention maps. We make the implementation of all our tasks and interoperability methods publicly available at https://github.com/michalspiegel/AttentionSpan .||
|**2025-02-28**|[Retrieval Backward Attention without Additional Training: Enhance Embeddings of Large Language Models via Repetition](http://arxiv.org/abs/2502.20726)|**[link](https://github.com/cqdyf099/ReBA)**|Language models can be viewed as functions that embed text into Euclidean space, where the quality of the embedding vectors directly determines model performance, training such neural networks involves various uncertainties. This paper focuses on improving the performance of pre-trained language models in zero-shot settings through a simple and easily implementable method. We propose a novel backward attention mechanism to enhance contextual information encoding. Evaluated on the Chinese Massive Text Embedding Benchmark (C-MTEB), our approach achieves significant improvements across multiple tasks, providing valuable insights for advancing zero-shot learning capabilities.||
|**2025-02-27**|[Unlocking Multi-Modal Potentials for Dynamic Text-Attributed Graph Representation](http://arxiv.org/abs/2502.19651)|null|Dynamic Text-Attributed Graphs (DyTAGs) are a novel graph paradigm that captures evolving temporal edges alongside rich textual attributes. A prior approach to representing DyTAGs leverages pre-trained language models to encode text attributes and subsequently integrates them into dynamic graph models. However, it follows edge-centric modeling, as in dynamic graph learning, which is limited in local structures and fails to exploit the unique characteristics of DyTAGs, leading to suboptimal performance. We observe that DyTAGs inherently comprise three distinct modalities-temporal, textual, and structural-often exhibiting dispersed or even orthogonal distributions, with the first two largely overlooked in existing research. Building on this insight, we propose MoMent, a model-agnostic multi-modal framework that can seamlessly integrate with dynamic graph models for structural modality learning. The core idea is to shift from edge-centric to node-centric modeling, fully leveraging three modalities for node representation. Specifically, MoMent presents non-shared node-centric encoders based on the attention mechanism to capture global temporal and semantic contexts from temporal and textual modalities, together with local structure learning, thus generating modality-specific tokens. To prevent disjoint latent space, we propose a symmetric alignment loss, an auxiliary objective that aligns temporal and textual tokens, ensuring global temporal-semantic consistency with a theoretical guarantee. Last, we design a lightweight adaptor to fuse these tokens, generating comprehensive and cohesive node representations. We theoretically demonstrate that MoMent enhances discriminative power over exclusive edge-centric modeling. Extensive experiments across seven datasets and two downstream tasks show that MoMent achieves up to 33.62% improvement against the baseline using four dynamic graph models.||
|**2025-02-25**|[DBR: Divergence-Based Regularization for Debiasing Natural Language Understanding Models](http://arxiv.org/abs/2502.18353)|null|Pre-trained language models (PLMs) have achieved impressive results on various natural language processing tasks. However, recent research has revealed that these models often rely on superficial features and shortcuts instead of developing a genuine understanding of language, especially for natural language understanding (NLU) tasks. Consequently, the models struggle to generalize to out-of-domain data. In this work, we propose Divergence Based Regularization (DBR) to mitigate this shortcut learning behavior. Our method measures the divergence between the output distributions for original examples and examples where shortcut tokens have been masked. This process prevents the model's predictions from being overly influenced by shortcut features or biases. We evaluate our model on three NLU tasks and find that it improves out-of-domain performance with little loss of in-domain accuracy. Our results demonstrate that reducing the reliance on shortcuts and superficial features can enhance the generalization ability of large pre-trained language models.||
|**2025-02-24**|[FedBM: Stealing Knowledge from Pre-trained Language Models for Heterogeneous Federated Learning](http://arxiv.org/abs/2502.16832)|**[link](https://github.com/cuhk-aim-group/fedbm)**|Federated learning (FL) has shown great potential in medical image computing since it provides a decentralized learning paradigm that allows multiple clients to train a model collaboratively without privacy leakage. However, current studies have shown that data heterogeneity incurs local learning bias in classifiers and feature extractors of client models during local training, leading to the performance degradation of a federation system. To address these issues, we propose a novel framework called Federated Bias eliMinating (FedBM) to get rid of local learning bias in heterogeneous federated learning (FL), which mainly consists of two modules, i.e., Linguistic Knowledge-based Classifier Construction (LKCC) and Concept-guided Global Distribution Estimation (CGDE). Specifically, LKCC exploits class concepts, prompts and pre-trained language models (PLMs) to obtain concept embeddings. These embeddings are used to estimate the latent concept distribution of each class in the linguistic space. Based on the theoretical derivation, we can rely on these distributions to pre-construct a high-quality classifier for clients to achieve classification optimization, which is frozen to avoid classifier bias during local training. CGDE samples probabilistic concept embeddings from the latent concept distributions to learn a conditional generator to capture the input space of the global model. Three regularization terms are introduced to improve the quality and utility of the generator. The generator is shared by all clients and produces pseudo data to calibrate updates of local feature extractors. Extensive comparison experiments and ablation studies on public datasets demonstrate the superior performance of FedBM over state-of-the-arts and confirm the effectiveness of each module, respectively. The code is available at https://github.com/CUHK-AIM-Group/FedBM.||
|**2025-02-21**|[Extraction multi-étiquettes de relations en utilisant des couches de Transformer](http://arxiv.org/abs/2502.15619)|null|In this article, we present the BTransformer18 model, a deep learning architecture designed for multi-label relation extraction in French texts. Our approach combines the contextual representation capabilities of pre-trained language models from the BERT family - such as BERT, RoBERTa, and their French counterparts CamemBERT and FlauBERT - with the power of Transformer encoders to capture long-term dependencies between tokens. Experiments conducted on the dataset from the TextMine'25 challenge show that our model achieves superior performance, particularly when using CamemBERT-Large, with a macro F1 score of 0.654, surpassing the results obtained with FlauBERT-Large. These results demonstrate the effectiveness of our approach for the automatic extraction of complex relations in intelligence reports.||
|**2025-02-20**|[Rapid Word Learning Through Meta In-Context Learning](http://arxiv.org/abs/2502.14791)|null|Humans can quickly learn a new word from a few illustrative examples, and then systematically and flexibly use it in novel contexts. Yet the abilities of current language models for few-shot word learning, and methods for improving these abilities, are underexplored. In this study, we introduce a novel method, Meta-training for IN-context learNing Of Words (Minnow). This method trains language models to generate new examples of a word's usage given a few in-context examples, using a special placeholder token to represent the new word. This training is repeated on many new words to develop a general word-learning ability. We find that training models from scratch with Minnow on human-scale child-directed language enables strong few-shot word learning, comparable to a large language model (LLM) pre-trained on orders of magnitude more data. Furthermore, through discriminative and generative evaluations, we demonstrate that finetuning pre-trained LLMs with Minnow improves their ability to discriminate between new words, identify syntactic categories of new words, and generate reasonable new usages and definitions for new words, based on one or a few in-context examples. These findings highlight the data efficiency of Minnow and its potential to improve language model performance in word learning tasks.||
|**2025-02-18**|[B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability](http://arxiv.org/abs/2502.12992)|null|Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural models. Meanwhile, B-cos networks have been introduced to improve model explainability through architectural and computational adaptations, but their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous B-cos methods. Our automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we provide practical guidelines for effectively building B-cos LMs based on our findings. Our code is available at https://anonymous.4open.science/r/bcos_lm.||
|**2025-02-18**|[Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text](http://arxiv.org/abs/2502.12953)|**[link](https://github.com/jarcaandrei/tiacbm)**|掩码语言模型已成为一种广泛采用的无监督技术，用于预训练语言模型。然而，选择掩码标记的过程是随机的，并且掩码标记的百分比在整个训练过程中通常是固定的。在本文中，我们提出根据一种新的任务导向的反课程学习方案来调整掩码率并决定哪些标记需要掩码。首先，我们利用关于有用和有害标记的任务特定知识来确定要掩码哪些标记。其次，我们提出了一种循环递减的掩码率，它对应于一个反课程安排（从难到易）。我们通过在三个不同的下游任务中掩码（TIACBM）方法来举例说明我们新颖的任务导向反课程学习：情感分析、按主题进行文本分类和作者归属。我们的研究结果表明，TIACBM增强了模型关注关键任务相关特征的能力，从而在不同任务中带来了统计上显著的性能提升。我们在https://github.com/JarcaAndrei/TIACBM发布了我们的代码。||
|**2025-02-18**|[RM-PoT: Reformulating Mathematical Problems and Solving via Program of Thoughts](http://arxiv.org/abs/2502.12589)|null|近年来，在训练语言模型以执行逐步推理来解决复杂的数值推理任务方面取得了实质性进展。除了用于解决这些问题的各种方法之外，问题本身的结构和表述方式在决定大型语言模型的性能方面也起着至关重要的作用。我们观察到，即使是数学问题表面形式的细微变化，也会对其答案分布和解决率产生深远的影响。这凸显了大型语言模型对表层变化的脆弱性，揭示了其在推理复杂问题时的鲁棒性有限。在本文中，我们提出了RM-PoT，这是一个集成了问题重构 (RM)、代码辅助推理 (PoT) 和领域感知小样本学习的三阶段框架，以解决这些局限性。我们的方法首先将输入问题重构为不同的表面形式，以减少结构偏差，然后从预先构建的特定领域问题库中检索五个语义对齐的示例，以提供上下文指导，最后生成可执行的Python代码以进行精确计算。||
|**2025-02-16**|[Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product](http://arxiv.org/abs/2502.12200)|null|Prompt tuning (PT) offers a cost-effective alternative to fine-tuning large-scale pre-trained language models (PLMs), requiring only a few parameters in soft prompt tokens added before the input text. However, existing PT approaches face two significant issues: (i) They overlook intrinsic semantic associations between soft prompt tokens, leading to high discreteness and limited interactions, thus reducing the model's comprehension and effectiveness in complex tasks. (ii) Due to the complexity of downstream tasks, long soft prompt is necessitated to improve performance, but prompt length correlates positively with memory usage and computational costs. Achieving high efficiency and performance remains an ongoing challenge. To address these issues, we propose a novel Low-parameters prompt tuning (LAMP) method, which leverages prompt decomposition and compressed outer product. Specifically, the prompt decomposition module employs Truncated SVD to reduce training parameters and significantly lower the dimensionality of the soft prompt parameter space. It then utilizes a compressed outer product module to facilitate multiple interactions among prompt tokens, exploring their intrinsic associations to enhance knowledge representation. Finally, LAMP uses average pooling to reduce memory usage and training/inference time. Extensive experiments across six architectures and eight datasets demonstrate that LAMP outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.||
|**2025-02-17**|[Text Classification in the LLM Era -- Where do we stand?](http://arxiv.org/abs/2502.11830)|null|Large Language Models revolutionized NLP and showed dramatic performance improvements across several tasks. In this paper, we investigated the role of such language models in text classification and how they compare with other approaches relying on smaller pre-trained language models. Considering 32 datasets spanning 8 languages, we compared zero-shot classification, few-shot fine-tuning and synthetic data based classifiers with classifiers built using the complete human labeled dataset. Our results show that zero-shot approaches do well for sentiment classification, but are outperformed by other approaches for the rest of the tasks, and synthetic data sourced from multiple LLMs can build better classifiers than zero-shot open LLMs. We also see wide performance disparities across languages in all the classification scenarios. We expect that these findings would guide practitioners working on developing text classification systems across languages.||
|**2025-02-17**|[Chinese Spelling Correction: A Comprehensive Survey of Progress, Challenges, and Opportunities](http://arxiv.org/abs/2502.11508)|null|Chinese Spelling Correction (CSC) is a critical task in natural language processing, aimed at detecting and correcting spelling errors in Chinese text. This survey provides a comprehensive overview of CSC, tracing its evolution from pre-trained language models to large language models, and critically analyzing their respective strengths and weaknesses in this domain. Moreover, we further present a detailed examination of existing benchmark datasets, highlighting their inherent challenges and limitations. Finally, we propose promising future research directions, particularly focusing on leveraging the potential of LLMs and their reasoning capabilities for improved CSC performance. To the best of our knowledge, this is the first comprehensive survey dedicated to the field of CSC. We believe this work will serve as a valuable resource for researchers, fostering a deeper understanding of the field and inspiring future advancements.||
|**2025-02-06**|[Lexical Substitution is not Synonym Substitution: On the Importance of Producing Contextually Relevant Word Substitutes](http://arxiv.org/abs/2502.04173)|null|Lexical Substitution is the task of replacing a single word in a sentence with a similar one. This should ideally be one that is not necessarily only synonymous, but also fits well into the surrounding context of the target word, while preserving the sentence's grammatical structure. Recent advances in Lexical Substitution have leveraged the masked token prediction task of Pre-trained Language Models to generate replacements for a given word in a sentence. With this technique, we introduce ConCat, a simple augmented approach which utilizes the original sentence to bolster contextual information sent to the model. Compared to existing approaches, it proves to be very effective in guiding the model to make contextually relevant predictions for the target word. Our study includes a quantitative evaluation, measured via sentence similarity and task performance. In addition, we conduct a qualitative human analysis to validate that users prefer the substitutions proposed by our method, as opposed to previous methods. Finally, we test our approach on the prevailing benchmark for Lexical Substitution, CoInCo, revealing potential pitfalls of the benchmark. These insights serve as the foundation for a critical discussion on the way in which Lexical Substitution is evaluated.||
|**2025-02-01**|[Social media polarization during conflict: Insights from an ideological stance dataset on Israel-Palestine Reddit comments](http://arxiv.org/abs/2502.00414)|null|In politically sensitive scenarios like wars, social media serves as a platform for polarized discourse and expressions of strong ideological stances. While prior studies have explored ideological stance detection in general contexts, limited attention has been given to conflict-specific settings. This study addresses this gap by analyzing 9,969 Reddit comments related to the Israel-Palestine conflict, collected between October 2023 and August 2024. The comments were categorized into three stance classes: Pro-Israel, Pro-Palestine, and Neutral. Various approaches, including machine learning, pre-trained language models, neural networks, and prompt engineering strategies for open source large language models (LLMs), were employed to classify these stances. Performance was assessed using metrics such as accuracy, precision, recall, and F1-score. Among the tested methods, the Scoring and Reflective Re-read prompt in Mixtral 8x7B demonstrated the highest performance across all metrics. This study provides comparative insights into the effectiveness of different models for detecting ideological stances in highly polarized social media contexts. The dataset used in this research is publicly available for further exploration and validation.||
|**2025-02-01**|[Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion](http://arxiv.org/abs/2502.00245)|null|Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis %that avoid fine-tuning large pre-trained generative models often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained language models (PLM) framework, named as WASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Q voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models.Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://anonymous.4open.science/r/WASP.||
|**2025-01-31**|[An Efficient Approach for Machine Translation on Low-resource Languages: A Case Study in Vietnamese-Chinese](http://arxiv.org/abs/2501.19314)|null|Despite the rise of recent neural networks in machine translation, those networks do not work well if the training data is insufficient. In this paper, we proposed an approach for machine translation in low-resource languages such as Vietnamese-Chinese. Our proposed method leveraged the power of the multilingual pre-trained language model (mBART) and both Vietnamese and Chinese monolingual corpus. Firstly, we built an early bird machine translation model using the bilingual training dataset. Secondly, we used TF-IDF technique to select sentences from the monolingual corpus which are the most related to domains of the parallel dataset. Finally, the first model was used to synthesize the augmented training data from the selected monolingual corpus for the translation model. Our proposed scheme showed that it outperformed 8% compared to the transformer model. The augmented dataset also pushed the model performance.||
|**2025-01-31**|[MPLinker: Multi-template Prompt-tuning with Adversarial Training for Issue-commit Link Recovery](http://arxiv.org/abs/2501.19026)|null|In recent years, the pre-training, prompting and prediction paradigm, known as prompt-tuning, has achieved significant success in Natural Language Processing (NLP). Issue-commit Link Recovery (ILR) in Software Traceability (ST) plays an important role in improving the reliability, quality, and security of software systems. The current ILR methods convert the ILR into a classification task using pre-trained language models (PLMs) and dedicated neural networks. these methods do not fully utilize the semantic information embedded in PLMs, resulting in not achieving acceptable performance. To address this limitation, we introduce a novel paradigm: Multi-template Prompt-tuning with adversarial training for issue-commit Link recovery (MPLinker). MPLinker redefines the ILR task as a cloze task via template-based prompt-tuning and incorporates adversarial training to enhance model generalization and reduce overfitting. We evaluated MPLinker on six open-source projects using a comprehensive set of performance metrics. The experiment results demonstrate that MPLinker achieves an average F1-score of 96.10%, Precision of 96.49%, Recall of 95.92%, MCC of 94.04%, AUC of 96.05%, and ACC of 98.15%, significantly outperforming existing state-of-the-art methods. Overall, MPLinker improves the performance and generalization of ILR models, and introduces innovative concepts and methods for ILR. The replication package for MPLinker is available at https://github.com/WTU-intelligent-software-development/MPLinker||
|**2025-01-31**|[Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities](http://arxiv.org/abs/2501.18845)|null|The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly make the model overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities, which improve the quality and quantity of data and play a crucial role in data augmentation. Specifically, distinctive prompt templates are given in personalised tasks to guide LLMs in generating the required content. Recent promising retrieval-based techniques further improve the expressive performance of LLMs in data augmentation by introducing external knowledge to enable them to produce more grounded-truth data. This survey provides an in-depth analysis of data augmentation in LLMs, classifying the techniques into Simple Augmentation, Prompt-based Augmentation, Retrieval-based Augmentation and Hybrid Augmentation. We summarise the post-processing approaches in data augmentation, which contributes significantly to refining the augmented data and enabling the model to filter out unfaithful content. Then, we provide the common tasks and evaluation metrics. Finally, we introduce existing challenges and future opportunities that could bring further improvement to data augmentation.||
|**2025-02-05**|[Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate](http://arxiv.org/abs/2501.17703)|null|Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of ([query; noisy response], critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our model Qwen2.5-Math-CFT only requires 1 hour training on 8xH100 over the 50K examples. It can match or outperform strong competitors like Qwen2.5-Math-Instruct on most benchmarks, which use over 2M samples. Moreover, it can match the performance of SimpleRL, which is a deepseek-r1 replication trained with 140x more compute. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that CFT offers a more effective alternative to advance the reasoning of language models.||
|**2025-01-25**|[Pre-training a Transformer-Based Generative Model Using a Small Sepedi Dataset](http://arxiv.org/abs/2501.15281)|null|Due to the scarcity of data in low-resourced languages, the development of language models for these languages has been very slow. Currently, pre-trained language models have gained popularity in natural language processing, especially, in developing domain-specific models for low-resourced languages. In this study, we experiment with the impact of using occlusion-based techniques when training a language model for a text generation task. We curate 2 new datasets, the Sepedi monolingual (SepMono) dataset from several South African resources and the Sepedi radio news (SepNews) dataset from the radio news domain. We use the SepMono dataset to pre-train transformer-based models using the occlusion and non-occlusion pre-training techniques and compare performance. The SepNews dataset is specifically used for fine-tuning. Our results show that the non-occlusion models perform better compared to the occlusion-based models when measuring validation loss and perplexity. However, analysis of the generated text using the BLEU score metric, which measures the quality of the generated text, shows a slightly higher BLEU score for the occlusion-based models compared to the non-occlusion models.||
|**2025-01-24**|[Code Change Intention, Development Artifact and History Vulnerability: Putting Them Together for Vulnerability Fix Detection by LLM](http://arxiv.org/abs/2501.14983)|null|Detecting vulnerability fix commits in open-source software is crucial for maintaining software security. To help OSS identify vulnerability fix commits, several automated approaches are developed. However, existing approaches like VulFixMiner and CoLeFunDa, focus solely on code changes, neglecting essential context from development artifacts. Tools like Vulcurator, which integrates issue reports, fail to leverage semantic associations between different development artifacts (e.g., pull requests and history vulnerability fixes). Moreover, they miss vulnerability fixes in tangled commits and lack explanations, limiting practical use. Hence to address those limitations, we propose LLM4VFD, a novel framework that leverages Large Language Models (LLMs) enhanced with Chain-of-Thought reasoning and In-Context Learning to improve the accuracy of vulnerability fix detection. LLM4VFD comprises three components: (1) Code Change Intention, which analyzes commit summaries, purposes, and implications using Chain-of-Thought reasoning; (2) Development Artifact, which incorporates context from related issue reports and pull requests; (3) Historical Vulnerability, which retrieves similar past vulnerability fixes to enrich context. More importantly, on top of the prediction, LLM4VFD also provides a detailed analysis and explanation to help security experts understand the rationale behind the decision. We evaluated LLM4VFD against state-of-the-art techniques, including Pre-trained Language Model-based approaches and vanilla LLMs, using a newly collected dataset, BigVulFixes. Experimental results demonstrate that LLM4VFD significantly outperforms the best-performed existing approach by 68.1%--145.4%. Furthermore, We conducted a user study with security experts, showing that the analysis generated by LLM4VFD improves the efficiency of vulnerability fix identification.||
|**2025-01-24**|[Adaptive Rank Allocation for Federated Parameter-Efficient Fine-Tuning of Language Models](http://arxiv.org/abs/2501.14406)|null|Pre-trained Language Models (PLMs) have demonstrated their superiority and versatility in modern Natural Language Processing (NLP), effectively adapting to various downstream tasks through further fine-tuning. Federated Parameter-Efficient Fine-Tuning (FedPEFT) has emerged as a promising solution to address privacy and efficiency challenges in distributed training for PLMs on mobile devices. However, our measurements reveal two key limitations of FedPEFT: heterogeneous data leads to significant performance degradation, and a fixed parameter configuration results in communication inefficiency. To overcome these limitations, we propose FedARA, a novel Federated Adaptive Rank Allocation for parameter-efficient fine-tuning of language models. Specifically, FedARA employs truncated singular value decomposition (SVD) adaptation to enhance flexibility and expressiveness, significantly mitigating the adverse effects of data heterogeneity. Subsequently, it utilizes dynamic rank allocation to progressively identify critical ranks, effectively improving communication efficiency. Lastly, it leverages rank-based module pruning to remove inactive modules, steadily reducing local training time and peak memory usage in each round. Extensive experiments show that FedARA consistently outperforms weak baselines by an average of 8.49\% and strong baselines by 6.95\% across various datasets under data heterogeneity while significantly improving communication efficiency by 2.40$\times$. Moreover, experiments on AGX Orin, Orin Nano and Raspberry Pi 5 devices demonstrate substantial decreases in total training time and energy consumption by up to 48.90\% and 46.95\%, respectively.||
|**2025-01-22**|[The potential -- and the pitfalls -- of using pre-trained language models as cognitive science theories](http://arxiv.org/abs/2501.12651)|null|Many studies have evaluated the cognitive alignment of Pre-trained Language Models (PLMs), i.e., their correspondence to adult performance across a range of cognitive domains. Recently, the focus has expanded to the developmental alignment of these models: identifying phases during training where improvements in model performance track improvements in children's thinking over development. However, there are many challenges to the use of PLMs as cognitive science theories, including different architectures, different training data modalities and scales, and limited model interpretability. In this paper, we distill lessons learned from treating PLMs, not as engineering artifacts but as cognitive science and developmental science models. We review assumptions used by researchers to map measures of PLM performance to measures of human performance. We identify potential pitfalls of this approach to understanding human thinking, and we end by enumerating criteria for using PLMs as credible accounts of cognition and cognitive development.||
|**2025-01-20**|[Revisiting Language Models in Neural News Recommender Systems](http://arxiv.org/abs/2501.11391)|**[link](https://github.com/go0day/lm4newsrec)**|Neural news recommender systems (RSs) have integrated language models (LMs) to encode news articles with rich textual information into representations, thereby improving the recommendation process. Most studies suggest that (i) news RSs achieve better performance with larger pre-trained language models (PLMs) than shallow language models (SLMs), and (ii) that large language models (LLMs) outperform PLMs. However, other studies indicate that PLMs sometimes lead to worse performance than SLMs. Thus, it remains unclear whether using larger LMs consistently improves the performance of news RSs. In this paper, we revisit, unify, and extend these comparisons of the effectiveness of LMs in news RSs using the real-world MIND dataset. We find that (i) larger LMs do not necessarily translate to better performance in news RSs, and (ii) they require stricter fine-tuning hyperparameter selection and greater computational resources to achieve optimal recommendation performance than smaller LMs. On the positive side, our experiments show that larger LMs lead to better recommendation performance for cold-start users: they alleviate dependency on extensive user interaction history and make recommendations more reliant on the news content.||
|**2025-01-15**|[Expanding Vietnamese SentiWordNet to Improve Performance of Vietnamese Sentiment Analysis Models](http://arxiv.org/abs/2501.08758)|null|Sentiment analysis is one of the most crucial tasks in Natural Language Processing (NLP), involving the training of machine learning models to classify text based on the polarity of opinions. Pre-trained Language Models (PLMs) can be applied to downstream tasks through fine-tuning, eliminating the need to train the model from scratch. Specifically, PLMs have been employed for Sentiment Analysis, a process that involves detecting, analyzing, and extracting the polarity of text sentiments. Numerous models have been proposed to address this task, with pre-trained PhoBERT-V2 models standing out as the state-of-the-art language models for Vietnamese. The PhoBERT-V2 pre-training approach is based on RoBERTa, optimizing the BERT pre-training method for more robust performance. In this paper, we introduce a novel approach that combines PhoBERT-V2 and SentiWordnet for Sentiment Analysis of Vietnamese reviews. Our proposed model utilizes PhoBERT-V2 for Vietnamese, offering a robust optimization for the prominent BERT model in the context of Vietnamese language, and leverages SentiWordNet, a lexical resource explicitly designed to support sentiment classification applications. Experimental results on the VLSP 2016 and AIVIVN 2019 datasets demonstrate that our sentiment analysis system has achieved excellent performance in comparison to other models.||
|**2025-01-13**|[Evaluating Pre-Trained Models for Multi-Language Vulnerability Patching](http://arxiv.org/abs/2501.07339)|null|Software vulnerabilities pose critical security risks, demanding prompt and effective mitigation strategies. While advancements in Automated Program Repair (APR) have primarily targeted general software bugs, the domain of vulnerability patching, which is a security-critical subset of APR, remains underexplored. This paper investigates the potential of pre-trained language models, CodeBERT and CodeT5, for automated vulnerability patching across diverse datasets and five programming languages. We evaluate these models on their accuracy, computational efficiency, and how the length of vulnerable code patches impacts performance. Our findings reveal promising accuracy levels, particularly for CodeT5 on datasets with complex vulnerability patterns, while CodeBERT demonstrates strengths in handling fragmented or context-limited datasets. CodeT5 further showcases superior efficiency, making it well-suited for large-scale applications. However, both models face challenges in maintaining performance as patch length increases, highlighting the complexity of addressing extended in program repair specifically aimed at fixing vulnerabilities. This study benchmarks model performance, highlights key limitations, and offers insights to improve automated vulnerability patching for practical security applications.||
|**2025-01-12**|[Language Fusion for Parameter-Efficient Cross-lingual Transfer](http://arxiv.org/abs/2501.06892)|**[link](https://github.com/pnborchert/flare)**|Limited availability of multilingual text corpora for training language models often leads to poor performance on downstream tasks due to undertrained representation spaces for languages other than English. This 'under-representation' has motivated recent cross-lingual transfer methods to leverage the English representation space by e.g. mixing English and 'non-English' tokens at the input level or extending model parameters to accommodate new languages. However, these approaches often come at the cost of increased computational complexity. We propose Fusion forLanguage Representations (FLARE) in adapters, a novel method that enhances representation quality and downstream performance for languages other than English while maintaining parameter efficiency. FLARE integrates source and target language representations within low-rank (LoRA) adapters using lightweight linear transformations, maintaining parameter efficiency while improving transfer performance. A series of experiments across representative cross-lingual natural language understanding tasks, including natural language inference, question-answering and sentiment analysis, demonstrate FLARE's effectiveness. FLARE achieves performance improvements of 4.9% for Llama 3.1 and 2.2% for Gemma~2 compared to standard LoRA fine-tuning on question-answering tasks, as measured by the exact match metric.||
|**2025-01-10**|[Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages](http://arxiv.org/abs/2501.05700)|null|Multilingual Pre-trained Language models (multiPLMs), trained on the Masked Language Modelling (MLM) objective are commonly being used for cross-lingual tasks such as bitext mining. However, the performance of these models is still suboptimal for low-resource languages (LRLs). To improve the language representation of a given multiPLM, it is possible to further pre-train it. This is known as continual pre-training. Previous research has shown that continual pre-training with MLM and subsequently with Translation Language Modelling (TLM) improves the cross-lingual representation of multiPLMs. However, during masking, both MLM and TLM give equal weight to all tokens in the input sequence, irrespective of the linguistic properties of the tokens. In this paper, we introduce a novel masking strategy, Linguistic Entity Masking (LEM) to be used in the continual pre-training step to further improve the cross-lingual representations of existing multiPLMs. In contrast to MLM and TLM, LEM limits masking to the linguistic entity types nouns, verbs and named entities, which hold a higher prominence in a sentence. Secondly, we limit masking to a single token within the linguistic entity span thus keeping more context, whereas, in MLM and TLM, tokens are masked randomly. We evaluate the effectiveness of LEM using three downstream tasks, namely bitext mining, parallel data curation and code-mixed sentiment analysis using three low-resource language pairs English-Sinhala, English-Tamil, and Sinhala-Tamil. Experiment results show that continually pre-training a multiPLM with LEM outperforms a multiPLM continually pre-trained with MLM+TLM for all three tasks.||
|**2025-01-08**|[From Superficial Patterns to Semantic Understanding: Fine-Tuning Language Models on Contrast Sets](http://arxiv.org/abs/2501.02683)|null|Large-scale pre-trained language models have demonstrated high performance on standard datasets for natural language inference (NLI) tasks. Unfortunately, these evaluations can be misleading, as although the models can perform well on in-distribution data, they perform poorly on out-of-distribution test sets, such as contrast sets. Contrast sets consist of perturbed instances of data that have very minor, but meaningful, changes to the input that alter the gold label, revealing how models can learn superficial patterns in the training data rather than learning more sophisticated language nuances. As an example, the ELECTRA-small language model achieves nearly 90% accuracy on an SNLI dataset but drops to 75% when tested on an out-of-distribution contrast set. The research carried out in this study explores how the robustness of a language model can be improved by exposing it to small amounts of more complex contrast sets during training to help it better learn language patterns. With this approach, the model recovers performance and achieves nearly 90% accuracy on contrast sets, highlighting the importance of diverse and challenging training data.||
|**2025-01-04**|[CPTuning: Contrastive Prompt Tuning for Generative Relation Extraction](http://arxiv.org/abs/2501.02196)|null|Generative relation extraction (RE) commonly involves first reformulating RE as a linguistic modeling problem easily tackled with pre-trained language models (PLM) and then fine-tuning a PLM with supervised cross-entropy loss. Although having achieved promising performance, existing approaches assume only one deterministic relation between each pair of entities without considering real scenarios where multiple relations may be valid, i.e., entity pair overlap, causing their limited applications. To address this problem, we introduce a novel contrastive prompt tuning method for RE, CPTuning, which learns to associate a candidate relation between two in-context entities with a probability mass above or below a threshold, corresponding to whether the relation exists. Beyond learning schema, CPTuning also organizes RE as a verbalized relation generation task and uses Trie-constrained decoding to ensure a model generates valid relations. It adaptively picks out the generated candidate relations with a high estimated likelihood in inference, thereby achieving multi-relation extraction. We conduct extensive experiments on four widely used datasets to validate our method. Results show that T5-large fine-tuned with CPTuning significantly outperforms previous methods, regardless of single or multiple relations extraction.||
|**2025-01-02**|[BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion](http://arxiv.org/abs/2501.01069)|**[link](https://github.com/akabircs/belin)**|Automatic text summarization, particularly headline generation, remains a critical yet underexplored area for Bengali religious news. Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect. This limitation significantly hinders their effectiveness and overall performance. This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News) - comprising religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach. Leveraging transformer-based pre-trained language models such as BanglaT5, mBART, mT5, and mT0, MultiGen integrates additional contextual features - including category, aspect, and sentiment - with the news content. This fusion enables the model to capture critical contextual information often overlooked by traditional methods. Experimental results demonstrate the superiority of MultiGen over the baseline approach that uses only news content, achieving a BLEU score of 18.61 and ROUGE-L score of 24.19, compared to baseline approach scores of 16.08 and 23.08, respectively. These findings underscore the importance of incorporating contextual features in headline generation for low-resource languages. By bridging linguistic and cultural gaps, this research advances natural language processing for Bengali and other underrepresented languages. To promote reproducibility and further exploration, the dataset and implementation code are publicly accessible at https://github.com/akabircs/BeliN.||
|**2025-01-01**|[U-GIFT: Uncertainty-Guided Firewall for Toxic Speech in Few-Shot Scenario](http://arxiv.org/abs/2501.00907)|null|随着社交媒体的普及，用户生成内容在网络平台上激增。当此类内容包含仇恨、辱骂、攻击性或网络欺凌行为时，即被归类为有害言论，对网络生态系统的完整性和安全性构成重大威胁。虽然人工内容审核仍然普遍，但海量的内容以及给人工审核员带来的心理压力都凸显了对自动化有害言论检测的需求。先前提出的检测方法通常依赖于大型标注数据集；然而，获取此类数据集在实践中既昂贵又具有挑战性。为了解决这个问题，我们提出了一种用于少样本场景下有害言论检测的不确定性引导防火墙U-GIFT，它利用自训练来提高检测性能，即使在标记数据有限的情况下也是如此。具体来说，U-GIFT 将主动学习与贝叶斯神经网络 (BNN) 相结合，从未标记数据中自动识别高质量样本，并根据模型预测得出的不确定性估计，优先选择置信度较高的伪标签进行训练。大量实验表明，U-GIFT 在少样本检测场景中显著优于竞争基线。在 5 样本设置中，它的性能比基本模型提高了 14.92%。重要的是，U-GIFT 用户友好，并且适应各种预训练语言模型 (PLM)。它还在样本不平衡和跨领域场景中表现出稳健的性能，同时在各种语言应用中展现出强大的泛化能力。我们相信，U-GIFT 为少样本有害言论检测提供了一种有效的解决方案，为网络空间中的自动化内容审核提供了实质性支持，从而充当防火墙以促进网络安全的发展。||
|**2024-12-31**|[TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment](http://arxiv.org/abs/2501.00522)|**[link](https://github.com/empathyang/tinyhelen)**|Training language models (LMs) and their application agents is increasingly costly due to large datasets and models, making test failures difficult to bear. Simplified language environments serve as primordial training and testing grounds, retaining essential commonsense and communication skills but in a more digestible form, potentially enhancing the learning efficiency of LMs, and thus reducing the required model size and data volume for effective training and evaluation. In these simplified language environments, workable strategies for small models, datasets, and agents may be adaptable to larger models, datasets, and agents in complex language environments. To create such environments, we focus on two aspects: i) minimizing language dataset noise and complexity, and ii) preserving the essential text distribution characteristics. Unlike previous methods, we propose a pipeline to refine text data by eliminating noise, minimizing vocabulary, and maintaining genre-specific patterns (e.g., for books, conversation, code, etc.). Implementing this pipeline with large LMs, we have created a leaner suite of LM training and evaluation datasets: 71M Leaner-Pretrain, 7M Leaner-Instruct, Leaner-Glue for assessing linguistic proficiency, and Leaner-Eval for testing instruction-following ability. Our experiments show that leaner pre-training boosts LM learning efficiency. Tiny LMs trained on these datasets outperform those trained on original datasets in instruction-following across different language granularity levels. Moreover, the Leaner-Pretrain dataset's alignment with conventional large LM training sets enables resource-optimized analysis of how learning objectives, model architectures, and training techniques impact performance on language modeling and downstream tasks. Our code and datasets are available at https://github.com/EmpathYang/TinyHelen.git.||
|**2024-12-31**|[Loss-Aware Curriculum Learning for Chinese Grammatical Error Correction](http://arxiv.org/abs/2501.00334)|null|Chinese grammatical error correction (CGEC) aims to detect and correct errors in the input Chinese sentences. Recently, Pre-trained Language Models (PLMS) have been employed to improve the performance. However, current approaches ignore that correction difficulty varies across different instances and treat these samples equally, enhancing the challenge of model learning. To address this problem, we propose a multi-granularity Curriculum Learning (CL) framework. Specifically, we first calculate the correction difficulty of these samples and feed them into the model from easy to hard batch by batch. Then Instance-Level CL is employed to help the model optimize in the appropriate direction automatically by regulating the loss function. Extensive experimental results and comprehensive analyses of various datasets prove the effectiveness of our method.||
|**2024-12-29**|[Controlling Out-of-Domain Gaps in LLMs for Genre Classification and Generated Text Detection](http://arxiv.org/abs/2412.20595)|**[link](https://github.com/dminus1/LLM-OOD-control)**|This study demonstrates that the modern generation of Large Language Models (LLMs, such as GPT-4) suffers from the same out-of-domain (OOD) performance gap observed in prior research on pre-trained Language Models (PLMs, such as BERT). We demonstrate this across two non-topical classification tasks: 1) genre classification and 2) generated text detection. Our results show that when demonstration examples for In-Context Learning (ICL) come from one domain (e.g., travel) and the system is tested on another domain (e.g., history), classification performance declines significantly. To address this, we introduce a method that controls which predictive indicators are used and which are excluded during classification. For the two tasks studied here, this ensures that topical features are omitted, while the model is guided to focus on stylistic rather than content-based attributes. This approach reduces the OOD gap by up to 20 percentage points in a few-shot setup. Straightforward Chain-of-Thought (CoT) methods, used as the baseline, prove insufficient, while our approach consistently enhances domain transfer performance.||
|**2024-12-28**|[Adaptive Parameter-Efficient Federated Fine-Tuning on Heterogeneous Devices](http://arxiv.org/abs/2412.20004)|null|联邦微调（FedFT）已被提出用于以分布式方式微调预训练语言模型。然而，在实际应用中，高效的 FedFT 面临两个关键挑战，即资源限制和系统异构性。现有工作依赖于参数高效的微调方法，例如低秩自适应（LoRA），但存在主要局限性。在此，基于 FedFT 的固有特性，我们观察到在靠近输出层添加更高秩的 LoRA 层有助于节省资源消耗，同时实现相当的微调性能。然后我们提出了一个新的基于 LoRA 的 FedFT 框架，称为 LEGEND，它面临着确定 LoRA 层数（称为 LoRA 深度）和每个 LoRA 层的秩（称为秩分布）的难题。我们分析了 LoRA 深度和秩分布之间的耦合关系，并设计了一种针对异构设备的高效 LoRA 配置算法，从而提高微调效率。我们在一个包含 80 个商用设备的物理平台上进行了广泛的实验。结果表明，与先进的解决方案相比，LEGEND 在达到目标精度时可以实现 1.5-2.8 倍的加速，并节省约 42.3% 的通信成本。||
|**2024-12-24**|[Exploring Embedding Priors in Prompt-Tuning for Improved Interpretability and Control](http://arxiv.org/abs/2412.18582)|null|Prompt-Tuning is an efficient method for adapting pre-trained language models to new tasks with minimal computational overhead by modifying prompt embeddings. In this work, we investigate how crucial the phenomenon of embedding collapse, frequently observed in Prompt-Tuning, is for the final performance of the model. To address this question, we designed embedding priors and compared them with posteriors of the converged Soft and Deep Prompt-Tuning methods. Our findings suggest that priors strongly affect the position of the tuned embeddings, and models can effectively work with embeddings from different parts of activation spaces, including completely new regions. As the final Prompt-Tuning capabilities are limited, we hypothesize that controllable Prompt-Tuning posteriors may serve as a good starting point for tasks such as chain-of-thought (COT) distillation. Our experiments also show that generated trajectories are not localized in the activation space of the models. However, there are distinct clusters of activations for distant tasks (e.g., NLP and arithmetic), while activations between NLP tasks (e.g., Question-Answering and MLM) lie in the same cluster. These observations raise questions about the importance of a single activation cluster for the generalization abilities of large language models.||
|**2025-01-05**|[Investigating Large Language Models for Code Vulnerability Detection: An Experimental Study](http://arxiv.org/abs/2412.18260)|**[link](https://github.com/sakirinn/llm4cvd)**|Code vulnerability detection (CVD) is essential for addressing and preventing system security issues, playing a crucial role in ensuring software security. Previous learning-based vulnerability detection methods rely on either fine-tuning medium-size sequence models or training smaller neural networks from scratch. Recent advancements in large pre-trained language models (LLMs) have showcased remarkable capabilities in various code intelligence tasks including code understanding and generation. However, the effectiveness of LLMs in detecting code vulnerabilities is largely under-explored. This work aims to investigate the gap by fine-tuning LLMs for the CVD task, involving four widely-used open-source LLMs. We also implement other five previous graph-based or medium-size sequence models for comparison. Experiments are conducted on five commonly-used CVD datasets, including both the part of short samples and long samples. In addition, we conduct quantitative experiments to investigate the class imbalance issue and the model's performance on samples of different lengths, which are rarely studied in previous works. To better facilitate communities, we open-source all codes and resources of this study in https://github.com/SakiRinn/LLM4CVD and https://huggingface.co/datasets/xuefen/VulResource.||
|**2024-12-24**|[Neuron Empirical Gradient: Connecting Neurons' Linear Controllability and Representational Capacity](http://arxiv.org/abs/2412.18053)|null|Although neurons in the feed-forward layers of pre-trained language models (PLMs) can store factual knowledge, most prior analyses remain qualitative, leaving the quantitative relationship among knowledge representation, neuron activations, and model output poorly understood. In this study, by performing neuron-wise interventions using factual probing datasets, we first reveal the linear relationship between neuron activations and output token probabilities. We refer to the gradient of this linear relationship as ``neuron empirical gradients.'' and propose NeurGrad, an efficient method for their calculation to facilitate quantitative neuron analysis. We next investigate whether neuron empirical gradients in PLMs encode general task knowledge by probing skill neurons. To this end, we introduce MCEval8k, a multi-choice knowledge evaluation benchmark spanning six genres and 22 tasks. Our experiments confirm that neuron empirical gradients effectively capture knowledge, while skill neurons exhibit efficiency, generality, inclusivity, and interdependency. These findings link knowledge to PLM outputs via neuron empirical gradients, shedding light on how PLMs store knowledge. The code and dataset are released.||
|**2024-12-23**|[FedTLU: Federated Learning with Targeted Layer Updates](http://arxiv.org/abs/2412.17692)|null|Federated learning (FL) addresses privacy concerns in language modeling by enabling multiple clients to contribute to training language models. However, non-IID (identically and independently distributed) data across clients often limits FL's performance. This issue is especially challenging during model fine-tuning, as noise due to variations in clients' data distributions can harm model convergence near the optimum. This paper proposes a targeted layer update strategy for fine-tuning in FL. Instead of randomly updating layers of the language model, as often done in practice, we use a scoring mechanism to identify and update the most critical layers, avoiding excessively noisy or even poisoned updates by freezing the parameters in other layers. We show in extensive experiments that our method improves convergence and performance in non-IID settings, offering a more efficient approach to fine-tuning federated language models.||
|**2024-12-23**|[EM-MIAs: Enhancing Membership Inference Attacks in Large Language Models through Ensemble Modeling](http://arxiv.org/abs/2412.17249)|null|With the widespread application of large language models (LLM), concerns about the privacy leakage of model training data have increasingly become a focus. Membership Inference Attacks (MIAs) have emerged as a critical tool for evaluating the privacy risks associated with these models. Although existing attack methods, such as LOSS, Reference-based, min-k, and zlib, perform well in certain scenarios, their effectiveness on large pre-trained language models often approaches random guessing, particularly in the context of large-scale datasets and single-epoch training. To address this issue, this paper proposes a novel ensemble attack method that integrates several existing MIAs techniques (LOSS, Reference-based, min-k, zlib) into an XGBoost-based model to enhance overall attack performance (EM-MIAs). Experimental results demonstrate that the ensemble model significantly improves both AUC-ROC and accuracy compared to individual attack methods across various large language models and datasets. This indicates that by combining the strengths of different methods, we can more effectively identify members of the model's training data, thereby providing a more robust tool for evaluating the privacy risks of LLM. This study offers new directions for further research in the field of LLM privacy protection and underscores the necessity of developing more powerful privacy auditing methods.||
|**2024-12-22**|[Learning to Adapt to Low-Resource Paraphrase Generation](http://arxiv.org/abs/2412.17111)|null|Paraphrase generation is a longstanding NLP task and achieves great success with the aid of large corpora. However, transferring a paraphrasing model to another domain encounters the problem of domain shifting especially when the data is sparse. At the same time, widely using large pre-trained language models (PLMs) faces the overfitting problem when training on scarce labeled data. To mitigate these two issues, we propose, LAPA, an effective adapter for PLMs optimized by meta-learning. LAPA has three-stage training on three types of related resources to solve this problem: 1. pre-training PLMs on unsupervised corpora, 2. inserting an adapter layer and meta-training on source domain labeled data, and 3. fine-tuning adapters on a small amount of target domain labeled data. This method enables paraphrase generation models to learn basic language knowledge first, then learn the paraphrasing task itself later, and finally adapt to the target task. Our experimental results demonstrate that LAPA achieves state-of-the-art in supervised, unsupervised, and low-resource settings on three benchmark datasets. With only 2\% of trainable parameters and 1\% labeled data of the target task, our approach can achieve a competitive performance with previous work.||
|**2024-12-17**|[The Reliability Paradox: Exploring How Shortcut Learning Undermines Language Model Calibration](http://arxiv.org/abs/2412.15269)|null|The advent of pre-trained language models (PLMs) has enabled significant performance gains in the field of natural language processing. However, recent studies have found PLMs to suffer from miscalibration, indicating a lack of accuracy in the confidence estimates provided by these models. Current evaluation methods for PLM calibration often assume that lower calibration error estimates indicate more reliable predictions. However, fine-tuned PLMs often resort to shortcuts, leading to overconfident predictions that create the illusion of enhanced performance but lack generalizability in their decision rules. The relationship between PLM reliability, as measured by calibration error, and shortcut learning, has not been thoroughly explored thus far. This paper aims to investigate this relationship, studying whether lower calibration error implies reliable decision rules for a language model. Our findings reveal that models with seemingly superior calibration portray higher levels of non-generalizable decision rules. This challenges the prevailing notion that well-calibrated models are inherently reliable. Our study highlights the need to bridge the current gap between language model calibration and generalization objectives, urging the development of comprehensive frameworks to achieve truly robust and reliable language models.||
|**2024-12-19**|[How to Synthesize Text Data without Model Collapse?](http://arxiv.org/abs/2412.14689)|null|Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT- $\{n\}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.||
|**2024-12-18**|[Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali](http://arxiv.org/abs/2412.13860)|null|Continual learning has emerged as an important research direction due to the infeasibility of retraining large language models (LLMs) from scratch in the event of new data availability. Of great interest is the domain-adaptive pre-training (DAPT) paradigm, which focuses on continually training a pre-trained language model to adapt it to a domain it was not originally trained on. In this work, we evaluate the feasibility of DAPT in a low-resource setting, namely the Nepali language. We use synthetic data to continue training Llama 3 8B to adapt it to the Nepali language in a 4-bit QLoRA setting. We evaluate the adapted model on its performance, forgetting, and knowledge acquisition. We compare the base model and the final model on their Nepali generation abilities, their performance on popular benchmarks, and run case-studies to probe their linguistic knowledge in Nepali. We see some unsurprising forgetting in the final model, but also surprisingly find that increasing the number of shots during evaluation yields better percent increases in the final model (as high as 19.29% increase) compared to the base model (4.98%), suggesting latent retention. We also explore layer-head self-attention heatmaps to establish dependency resolution abilities of the final model in Nepali.||
|**2024-12-17**|[COSEE: Consistency-Oriented Signal-Based Early Exiting via Calibrated Sample Weighting Mechanism](http://arxiv.org/abs/2412.13236)|**[link](https://github.com/he-jianing/cosee)**|Early exiting is an effective paradigm for improving the inference efficiency of pre-trained language models (PLMs) by dynamically adjusting the number of executed layers for each sample. However, in most existing works, easy and hard samples are treated equally by each classifier during training, which neglects the test-time early exiting behavior, leading to inconsistency between training and testing. Although some methods have tackled this issue under a fixed speed-up ratio, the challenge of flexibly adjusting the speed-up ratio while maintaining consistency between training and testing is still under-explored. To bridge the gap, we propose a novel Consistency-Oriented Signal-based Early Exiting (COSEE) framework, which leverages a calibrated sample weighting mechanism to enable each classifier to emphasize the samples that are more likely to exit at that classifier under various acceleration scenarios. Extensive experiments on the GLUE benchmark demonstrate the effectiveness of our COSEE across multiple exiting signals and backbones, yielding a better trade-off between performance and efficiency.||
|**2024-12-17**|[Token-Level Graphs for Short Text Classification](http://arxiv.org/abs/2412.12754)|**[link](https://github.com/dogregor/tokengraph)**|The classification of short texts is a common subtask in Information Retrieval (IR). Recent advances in graph machine learning have led to interest in graph-based approaches for low resource scenarios, showing promise in such settings. However, existing methods face limitations such as not accounting for different meanings of the same words or constraints from transductive approaches. We propose an approach which constructs text graphs entirely based on tokens obtained through pre-trained language models (PLMs). By applying a PLM to tokenize and embed the texts when creating the graph(-nodes), our method captures contextual and semantic information, overcomes vocabulary constraints, and allows for context-dependent word meanings. Our approach also makes classification more efficient with reduced parameters compared to classical PLM fine-tuning, resulting in more robust training with few samples. Experimental results demonstrate how our method consistently achieves higher scores or on-par performance with existing methods, presenting an advancement in graph-based text classification techniques. To support reproducibility of our work we make all implementations publicly available to the community\footnote{\url{https://github.com/doGregor/TokenGraph}}.||
|**2024-12-12**|[MGM: Global Understanding of Audience Overlap Graphs for Predicting the Factuality and the Bias of News Media](http://arxiv.org/abs/2412.10467)|**[link](https://github.com/marslanm/mgm_code)**|In the current era of rapidly growing digital data, evaluating the political bias and factuality of news outlets has become more important for seeking reliable information online. In this work, we study the classification problem of profiling news media from the lens of political bias and factuality. Traditional profiling methods, such as Pre-trained Language Models (PLMs) and Graph Neural Networks (GNNs) have shown promising results, but they face notable challenges. PLMs focus solely on textual features, causing them to overlook the complex relationships between entities, while GNNs often struggle with media graphs containing disconnected components and insufficient labels. To address these limitations, we propose MediaGraphMind (MGM), an effective solution within a variational Expectation-Maximization (EM) framework. Instead of relying on limited neighboring nodes, MGM leverages features, structural patterns, and label information from globally similar nodes. Such a framework not only enables GNNs to capture long-range dependencies for learning expressive node representations but also enhances PLMs by integrating structural information and therefore improving the performance of both models. The extensive experiments demonstrate the effectiveness of the proposed framework and achieve new state-of-the-art results. Further, we share our repository1 which contains the dataset, code, and documentation||
|**2024-12-13**|[Low-Resource Fast Text Classification Based on Intra-Class and Inter-Class Distance Calculation](http://arxiv.org/abs/2412.09922)|null|In recent years, text classification methods based on neural networks and pre-trained models have gained increasing attention and demonstrated excellent performance. However, these methods still have some limitations in practical applications: (1) They typically focus only on the matching similarity between sentences. However, there exists implicit high-value information both within sentences of the same class and across different classes, which is very crucial for classification tasks. (2) Existing methods such as pre-trained language models and graph-based approaches often consume substantial memory for training and text-graph construction. (3) Although some low-resource methods can achieve good performance, they often suffer from excessively long processing times. To address these challenges, we propose a low-resource and fast text classification model called LFTC. Our approach begins by constructing a compressor list for each class to fully mine the regularity information within intra-class data. We then remove redundant information irrelevant to the target classification to reduce processing time. Finally, we compute the similarity distance between text pairs for classification. We evaluate LFTC on 9 publicly available benchmark datasets, and the results demonstrate significant improvements in performance and processing time, especially under limited computational and data resources, highlighting its superior advantages.||
|**2024-12-11**|[Mitigating Out-of-Entity Errors in Named Entity Recognition: A Sentence-Level Strategy](http://arxiv.org/abs/2412.08434)|null|Many previous models of named entity recognition (NER) suffer from the problem of Out-of-Entity (OOE), i.e., the tokens in the entity mentions of the test samples have not appeared in the training samples, which hinders the achievement of satisfactory performance. To improve OOE-NER performance, in this paper, we propose a new framework, namely S+NER, which fully leverages sentence-level information. Our S+NER achieves better OOE-NER performance mainly due to the following two particular designs. 1) It first exploits the pre-trained language model's capability of understanding the target entity's sentence-level context with a template set. 2) Then, it refines the sentence-level representation based on the positive and negative templates, through a contrastive learning strategy and template pooling method, to obtain better NER results. Our extensive experiments on five benchmark datasets have demonstrated that, our S+NER outperforms some state-of-the-art OOE-NER models.||
|**2024-12-09**|[Leveraging Prompt Learning and Pause Encoding for Alzheimer's Disease Detection](http://arxiv.org/abs/2412.06259)|null|Compared to other clinical screening techniques, speech-and-language-based automated Alzheimer's disease (AD) detection methods are characterized by their non-invasiveness, cost-effectiveness, and convenience. Previous studies have demonstrated the efficacy of fine-tuning pre-trained language models (PLMs) for AD detection. However, the objective of this traditional fine-tuning method, which involves inputting only transcripts, is inconsistent with the masked language modeling (MLM) task used during the pre-training phase of PLMs. In this paper, we investigate prompt-based fine-tuning of PLMs, converting the classification task into a MLM task by inserting prompt templates into the transcript inputs. We also explore the impact of incorporating pause information from forced alignment into manual transcripts. Additionally, we compare the performance of various automatic speech recognition (ASR) models and select the Whisper model to generate ASR-based transcripts for comparison with manual transcripts. Furthermore, majority voting and ensemble techniques are applied across different PLMs (BERT and RoBERTa) using different random seeds. Ultimately, we obtain maximum detection accuracy of 95.8% (with mean 87.9%, std 3.3%) using manual transcripts, achieving state-of-the-art performance for AD detection using only transcripts on the ADReSS test set.||
|**2024-12-07**|[SMI-Editor: Edit-based SMILES Language Model with Fragment-level Supervision](http://arxiv.org/abs/2412.05569)|null|SMILES, a crucial textual representation of molecular structures, has garnered significant attention as a foundation for pre-trained language models (LMs). However, most existing pre-trained SMILES LMs focus solely on the single-token level supervision during pre-training, failing to fully leverage the substructural information of molecules. This limitation makes the pre-training task overly simplistic, preventing the models from capturing richer molecular semantic information. Moreover, during pre-training, these SMILES LMs only process corrupted SMILES inputs, never encountering any valid SMILES, which leads to a train-inference mismatch. To address these challenges, we propose SMI-Editor, a novel edit-based pre-trained SMILES LM. SMI-Editor disrupts substructures within a molecule at random and feeds the resulting SMILES back into the model, which then attempts to restore the original SMILES through an editing process. This approach not only introduces fragment-level training signals, but also enables the use of valid SMILES as inputs, allowing the model to learn how to reconstruct complete molecules from these incomplete structures. As a result, the model demonstrates improved scalability and an enhanced ability to capture fragment-level molecular information. Experimental results show that SMI-Editor achieves state-of-the-art performance across multiple downstream molecular tasks, and even outperforming several 3D molecular representation models.||
|**2024-12-05**|[GRAF: Graph Retrieval Augmented by Facts for Legal Question Answering](http://arxiv.org/abs/2412.04119)|null|Pre-trained Language Models (PLMs) have shown remarkable performances in recent years, setting a new paradigm for NLP research and industry. The legal domain has received some attention from the NLP community partly due to its textual nature. Some tasks from this domain are represented by question-answering (QA) tasks. This work explores the legal domain Multiple-Choice QA (MCQA) for a low-resource language. The contribution of this work is multi-fold. We first introduce JuRO, the first openly available Romanian legal MCQA dataset, comprising three different examinations and a number of 10,836 total questions. Along with this dataset, we introduce CROL, an organized corpus of laws that has a total of 93 distinct documents with their modifications from 763 time spans, that we leveraged in this work for Information Retrieval (IR) techniques. Moreover, we are the first to propose Law-RoG, a Knowledge Graph (KG) for the Romanian language, and this KG is derived from the aforementioned corpus. Lastly, we propose a novel approach for MCQA, Graph Retrieval Augmented by Facts (GRAF), which achieves competitive results with generally accepted SOTA methods and even exceeds them in most settings.||
|**2024-12-05**|[Hostility Detection in UK Politics: A Dataset on Online Abuse Targeting MPs](http://arxiv.org/abs/2412.04046)|null|Numerous politicians use social media platforms, particularly X, to engage with their constituents. This interaction allows constituents to pose questions and offer feedback but also exposes politicians to a barrage of hostile responses, especially given the anonymity afforded by social media. They are typically targeted in relation to their governmental role, but the comments also tend to attack their personal identity. This can discredit politicians and reduce public trust in the government. It can also incite anger and disrespect, leading to offline harm and violence. While numerous models exist for detecting hostility in general, they lack the specificity required for political contexts. Furthermore, addressing hostility towards politicians demands tailored approaches due to the distinct language and issues inherent to each country (e.g., Brexit for the UK). To bridge this gap, we construct a dataset of 3,320 English tweets spanning a two-year period manually annotated for hostility towards UK MPs. Our dataset also captures the targeted identity characteristics (race, gender, religion, none) in hostile tweets. We perform linguistic and topical analyses to delve into the unique content of the UK political data. Finally, we evaluate the performance of pre-trained language models and large language models on binary hostility detection and multi-class targeted identity type classification tasks. Our study offers valuable data and insights for future research on the prevalence and nature of politics-related hostility specific to the UK.||
|**2024-11-29**|[Enhancing Sentiment Analysis in Bengali Texts: A Hybrid Approach Using Lexicon-Based Algorithm and Pretrained Language Model Bangla-BERT](http://arxiv.org/abs/2411.19584)|null|Sentiment analysis (SA) is a process of identifying the emotional tone or polarity within a given text and aims to uncover the user's complex emotions and inner feelings. While sentiment analysis has been extensively studied for languages like English, research in Bengali, remains limited, particularly for fine-grained sentiment categorization. This work aims to connect this gap by developing a novel approach that integrates rule-based algorithms with pre-trained language models. We developed a dataset from scratch, comprising over 15,000 manually labeled reviews. Next, we constructed a Lexicon Data Dictionary, assigning polarity scores to the reviews. We developed a novel rule based algorithm Bangla Sentiment Polarity Score (BSPS), an approach capable of generating sentiment scores and classifying reviews into nine distinct sentiment categories. To assess the performance of this method, we evaluated the classified sentiments using BanglaBERT, a pre-trained transformer-based language model. We also performed sentiment classification directly with BanglaBERT on the original data and evaluated this model's results. Our analysis revealed that the BSPS + BanglaBERT hybrid approach outperformed the standalone BanglaBERT model, achieving higher accuracy, precision, and nuanced classification across the nine sentiment categories. The results of our study emphasize the value and effectiveness of combining rule-based and pre-trained language model approaches for enhanced sentiment analysis in Bengali and suggest pathways for future research and application in languages with similar linguistic complexities.||
|**2024-11-28**|[PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning](http://arxiv.org/abs/2411.19335)|null|联邦参数高效微调（FedPEFT）已成为一种很有前景的范式，用于在联邦学习（FL）设置中对预训练语言模型（PLM）进行隐私保护和高效的自适应调整。它通过保持数据分散并在本地设备上训练模型来保护数据隐私，确保原始数据永远不会离开用户的设备。此外，与微调整个模型相比，集成 LoRA 等 PEFT 方法显著减少了可训练参数的数量，从而最大限度地降低了通信成本和计算开销。尽管 FedPEFT 具有潜力，但其安全隐患仍未得到充分探索。本文介绍了 FedPEFT 的一种新的安全威胁，称为 PEFT 即攻击 (PaaA)，它揭示了 PEFT 如何被用作攻击向量来绕过 PLM 的安全对齐并响应恶意提示生成有害内容。我们对 PaaA 的评估表明，在将不到 1% 的模型参数设置为可训练，并且一小部分客户端进行恶意操作的情况下，使用 LoRA 等代表性 PEFT 方法，攻击实现了大约 80% 的攻击成功率。为了缓解这种威胁，我们进一步研究了潜在的防御策略，包括鲁棒聚合方案 (RAS) 和 PEFT 后安全对齐 (PPSA)。然而，我们的实证分析突出了这些防御的局限性，即使是最先进的 RAS，例如 DnC 和 ClippedClustering，也很难在数据分布高度异构的情况下抵御 PaaA。同样，虽然 PPSA 可以将攻击成功率降低到 10% 以下，但它会严重降低模型在目标任务上的准确性。我们的结果强调迫切需要更有效的防御机制，以同时确保 FedPEFT 范式的安全性和性能。||
|**2024-11-26**|[Pretrained LLM Adapted with LoRA as a Decision Transformer for Offline RL in Quantitative Trading](http://arxiv.org/abs/2411.17900)|**[link](https://github.com/syyunn/finrl-dt)**|Developing effective quantitative trading strategies using reinforcement learning (RL) is challenging due to the high risks associated with online interaction with live financial markets. Consequently, offline RL, which leverages historical market data without additional exploration, becomes essential. However, existing offline RL methods often struggle to capture the complex temporal dependencies inherent in financial time series and may overfit to historical patterns. To address these challenges, we introduce a Decision Transformer (DT) initialized with pre-trained GPT-2 weights and fine-tuned using Low-Rank Adaptation (LoRA). This architecture leverages the generalization capabilities of pre-trained language models and the efficiency of LoRA to learn effective trading policies from expert trajectories solely from historical data. Our model performs competitively with established offline RL algorithms, including Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), and Behavior Cloning (BC), as well as a baseline Decision Transformer with randomly initialized GPT-2 weights and LoRA. Empirical results demonstrate that our approach effectively learns from expert trajectories and secures superior rewards in certain trading scenarios, highlighting the effectiveness of integrating pre-trained language models and parameter-efficient fine-tuning in offline RL for quantitative trading. Replication code for our experiments is publicly available at https://github.com/syyunn/finrl-dt||
|**2024-11-26**|[Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting](http://arxiv.org/abs/2411.17674)|null|Understanding the emotions in a dialogue usually requires external knowledge to accurately understand the contents. As the LLMs become more and more powerful, we do not want to settle on the limited ability of the pre-trained language model. However, the LLMs either can only process text modality or are too expensive to process the multimedia information. We aim to utilize both the power of LLMs and the supplementary features from the multimedia modalities. In this paper, we present a framework, Lantern, that can improve the performance of a certain vanilla model by prompting large language models with receptive-field-aware attention weighting. This framework trained a multi-task vanilla model to produce probabilities of emotion classes and dimension scores. These predictions are fed into the LLMs as references to adjust the predicted probabilities of each emotion class with its external knowledge and contextual understanding. We slice the dialogue into different receptive fields, and each sample is included in exactly t receptive fields. Finally, the predictions of LLMs are merged with a receptive-field-aware attention-driven weighting module. In the experiments, vanilla models CORECT and SDT are deployed in Lantern with GPT-4 or Llama-3.1-405B. The experiments in IEMOCAP with 4-way and 6-way settings demonstrated that the Lantern can significantly improve the performance of current vanilla models by up to 1.23% and 1.80%.||
|**2024-12-02**|[Scaling Speech-Text Pre-training with Synthetic Interleaved Data](http://arxiv.org/abs/2411.17607)|null|Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. This supervised training approach results in discrete speech tokens with strong semantic preservation even at lower frame rates (e.g. 12.5Hz), while still maintaining speech reconstruction quality. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain.||
|**2024-11-24**|[Development of Pre-Trained Transformer-based Models for the Nepali Language](http://arxiv.org/abs/2411.15734)|null|Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction tuning and explored its potential for monolingual Nepali data, providing a foundation for future research. Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.||
|**2024-11-21**|[Transformer-Based Contextualized Language Models Joint with Neural Networks for Natural Language Inference in Vietnamese](http://arxiv.org/abs/2411.13407)|null|Natural Language Inference (NLI) is a task within Natural Language Processing (NLP) that holds value for various AI applications. However, there have been limited studies on Natural Language Inference in Vietnamese that explore the concept of joint models. Therefore, we conducted experiments using various combinations of contextualized language models (CLM) and neural networks. We use CLM to create contextualized work presentations and use Neural Networks for classification. Furthermore, we have evaluated the strengths and weaknesses of each joint model and identified the model failure points in the Vietnamese context. The highest F1 score in this experiment, up to 82.78% in the benchmark dataset (ViNLI). By conducting experiments with various models, the most considerable size of the CLM is XLM-R (355M). That combination has consistently demonstrated superior performance compared to fine-tuning strong pre-trained language models like PhoBERT (+6.58%), mBERT (+19.08%), and XLM-R (+0.94%) in terms of F1-score. This article aims to introduce a novel approach or model that attains improved performance for Vietnamese NLI. Overall, we find that the joint approach of CLM and neural networks is simple yet capable of achieving high-quality performance, which makes it suitable for applications that require efficient resource utilization.||
|**2024-11-19**|[Enhancing Multi-Class Disease Classification: Neoplasms, Cardiovascular, Nervous System, and Digestive Disorders Using Advanced LLMs](http://arxiv.org/abs/2411.12712)|null|In this research, we explored the improvement in terms of multi-class disease classification via pre-trained language models over Medical-Abstracts-TC-Corpus that spans five medical conditions. We excluded non-cancer conditions and examined four specific diseases. We assessed four LLMs, BioBERT, XLNet, and BERT, as well as a novel base model (Last-BERT). BioBERT, which was pre-trained on medical data, demonstrated superior performance in medical text classification (97% accuracy). Surprisingly, XLNet followed closely (96% accuracy), demonstrating its generalizability across domains even though it was not pre-trained on medical data. LastBERT, a custom model based on the lighter version of BERT, also proved competitive with 87.10% accuracy (just under BERT's 89.33%). Our findings confirm the importance of specialized models such as BioBERT and also support impressions around more general solutions like XLNet and well-tuned transformer architectures with fewer parameters (in this case, LastBERT) in medical domain tasks.||
|**2024-11-20**|[Predicting User Intents and Musical Attributes from Music Discovery Conversations](http://arxiv.org/abs/2411.12254)|**[link](https://github.com/daeyongkwon98/intent_classification)**|Intent classification is a text understanding task that identifies user needs from input text queries. While intent classification has been extensively studied in various domains, it has not received much attention in the music domain. In this paper, we investigate intent classification models for music discovery conversation, focusing on pre-trained language models. Rather than only predicting functional needs: intent classification, we also include a task for classifying musical needs: musical attribute classification. Additionally, we propose a method of concatenating previous chat history with just single-turn user queries in the input text, allowing the model to understand the overall conversation context better. Our proposed model significantly improves the F1 score for both user intent and musical attribute classification, and surpasses the zero-shot and few-shot performance of the pretrained Llama 3 model.||
|**2024-11-18**|[Zero-Shot Load Forecasting with Large Language Models](http://arxiv.org/abs/2411.11350)|null|Deep learning models have shown strong performance in load forecasting, but they generally require large amounts of data for model training before being applied to new scenarios, which limits their effectiveness in data-scarce scenarios. Inspired by the great success of pre-trained language models (LLMs) in natural language processing, this paper proposes a zero-shot load forecasting approach using an advanced LLM framework denoted as the Chronos model. By utilizing its extensive pre-trained knowledge, the Chronos model enables accurate load forecasting in data-scarce scenarios without the need for extensive data-specific training. Simulation results across five real-world datasets demonstrate that the Chronos model significantly outperforms nine popular baseline models for both deterministic and probabilistic load forecasting with various forecast horizons (e.g., 1 to 48 hours), even though the Chronos model is neither tailored nor fine-tuned to these specific load datasets. Notably, Chronos reduces root mean squared error (RMSE), continuous ranked probability score (CRPS), and quantile score (QS) by approximately 7.34%-84.30%, 19.63%-60.06%, and 22.83%-54.49%, respectively, compared to baseline models. These results highlight the superiority and flexibility of the Chronos model, positioning it as an effective solution in data-scarce scenarios.||
|**2024-11-11**|[TempCharBERT: Keystroke Dynamics for Continuous Access Control Based on Pre-trained Language Models](http://arxiv.org/abs/2411.07224)|null|With the widespread of digital environments, reliable authentication and continuous access control has become crucial. It can minimize cyber attacks and prevent frauds, specially those associated with identity theft. A particular interest lies on keystroke dynamics (KD), which refers to the task of recognizing individuals' identity based on their unique typing style. In this work, we propose the use of pre-trained language models (PLMs) to recognize such patterns. Although PLMs have shown high performance on multiple NLP benchmarks, the use of these models on specific tasks requires customization. BERT and RoBERTa, for instance, rely on subword tokenization, and they cannot be directly applied to KD, which requires temporal-character information to recognize users. Recent character-aware PLMs are able to process both subwords and character-level information and can be an alternative solution. Notwithstanding, they are still not suitable to be directly fine-tuned for KD as they are not optimized to account for user's temporal typing information (e.g., hold time and flight time). To overcome this limitation, we propose TempCharBERT, an architecture that incorporates temporal-character information in the embedding layer of CharBERT. This allows modeling keystroke dynamics for the purpose of user identification and authentication. Our results show a significant improvement with this customization. We also showed the feasibility of training TempCharBERT on a federated learning settings in order to foster data privacy.||
|**2024-11-11**|[Model Fusion through Bayesian Optimization in Language Model Fine-Tuning](http://arxiv.org/abs/2411.06710)|**[link](https://github.com/chaeyoon-jang/bomf)**|Fine-tuning pre-trained models for downstream tasks is a widely adopted technique known for its adaptability and reliability across various domains. Despite its conceptual simplicity, fine-tuning entails several troublesome engineering choices, such as selecting hyperparameters and determining checkpoints from an optimization trajectory. To tackle the difficulty of choosing the best model, one effective solution is model fusion, which combines multiple models in a parameter space. However, we observe a large discrepancy between loss and metric landscapes during the fine-tuning of pre-trained language models. Building on this observation, we introduce a novel model fusion technique that optimizes both the desired metric and loss through multi-objective Bayesian optimization. In addition, to effectively select hyperparameters, we establish a two-stage procedure by integrating Bayesian optimization processes into our framework. Experiments across various downstream tasks show considerable performance improvements using our Bayesian optimization-guided method.||
|**2024-11-11**|[Bridge: A Unified Framework to Knowledge Graph Completion via Language Models and Knowledge Representation](http://arxiv.org/abs/2411.06660)|null|Knowledge graph completion (KGC) is a task of inferring missing triples based on existing Knowledge Graphs (KGs). Both structural and semantic information are vital for successful KGC. However, existing methods only use either the structural knowledge from the KG embeddings or the semantic information from pre-trained language models (PLMs), leading to suboptimal model performance. Moreover, since PLMs are not trained on KGs, directly using PLMs to encode triples may be inappropriate. To overcome these limitations, we propose a novel framework called Bridge, which jointly encodes structural and semantic information of KGs. Specifically, we strategically encode entities and relations separately by PLMs to better utilize the semantic knowledge of PLMs and enable structured representation learning via a structural learning principle. Furthermore, to bridge the gap between KGs and PLMs, we employ a self-supervised representation learning method called BYOL to fine-tune PLMs with two different views of a triple. Unlike BYOL, which uses augmentation methods to create two semantically similar views of the same image, potentially altering the semantic information. We strategically separate the triple into two parts to create different views, thus avoiding semantic alteration. Experiments demonstrate that Bridge outperforms the SOTA models on three benchmark datasets.||
|**2024-11-01**|[Improving Few-Shot Cross-Domain Named Entity Recognition by Instruction Tuning a Word-Embedding based Retrieval Augmented Large Language Model](http://arxiv.org/abs/2411.00451)|null|Few-Shot Cross-Domain NER is the process of leveraging knowledge from data-rich source domains to perform entity recognition on data scarce target domains. Most previous state-of-the-art (SOTA) approaches use pre-trained language models (PLMs) for cross-domain NER. However, these models are often domain specific. To successfully use these models for new target domains, we need to modify either the model architecture or perform model finetuning using data from the new domains. Both of these result in the creation of entirely new NER models for each target domain which is infeasible for practical scenarios. Recently,several works have attempted to use LLMs to solve Few-Shot Cross-Domain NER. However, most of these are either too expensive for practical purposes or struggle to follow LLM prompt instructions. In this paper, we propose IF-WRANER (Instruction Finetuned Word-embedding based Retrieval Augmented large language model for Named Entity Recognition), a retrieval augmented LLM, finetuned for the NER task. By virtue of the regularization techniques used during LLM finetuning and the adoption of word-level embedding over sentence-level embedding during the retrieval of in-prompt examples, IF-WRANER is able to outperform previous SOTA Few-Shot Cross-Domain NER approaches. We have demonstrated the effectiveness of our model by benchmarking its performance on the open source CrossNER dataset, on which it shows more than 2% F1 score improvement over the previous SOTA model. We have deployed the model for multiple customer care domains of an enterprise. Accurate entity prediction through IF-WRANER helps direct customers to automated workflows for the domains, thereby reducing escalations to human agents by almost 15% and leading to millions of dollars in yearly savings for the company.||
|**2024-11-01**|[Enhancing Authorship Attribution through Embedding Fusion: A Novel Approach with Masked and Encoder-Decoder Language Models](http://arxiv.org/abs/2411.00411)|null|The increasing prevalence of AI-generated content alongside human-written text underscores the need for reliable discrimination methods. To address this challenge, we propose a novel framework with textual embeddings from Pre-trained Language Models (PLMs) to distinguish AI-generated and human-authored text. Our approach utilizes Embedding Fusion to integrate semantic information from multiple Language Models, harnessing their complementary strengths to enhance performance. Through extensive evaluation across publicly available diverse datasets, our proposed approach demonstrates strong performance, achieving classification accuracy greater than 96% and a Matthews Correlation Coefficient (MCC) greater than 0.93. This evaluation is conducted on a balanced dataset of texts generated from five well-known Large Language Models (LLMs), highlighting the effectiveness and robustness of our novel methodology.||
|**2024-11-01**|[C2A: Client-Customized Adaptation for Parameter-Efficient Federated Learning](http://arxiv.org/abs/2411.00311)|**[link](https://github.com/yeachan-kr/c2a)**|Despite the versatility of pre-trained language models (PLMs) across domains, their large memory footprints pose significant challenges in federated learning (FL), where the training model has to be distributed between a server and clients. One potential solution to bypass such constraints might be the use of parameter-efficient fine-tuning (PEFT) in the context of FL. However, we have observed that typical PEFT tends to severely suffer from heterogeneity among clients in FL scenarios, resulting in unstable and slow convergence. In this paper, we propose Client-Customized Adaptation (C2A), a novel hypernetwork-based FL framework that generates client-specific adapters by conditioning the client information. With the effectiveness of the hypernetworks in generating customized weights through learning to adopt the different characteristics of inputs, C2A can maximize the utility of shared model parameters while minimizing the divergence caused by client heterogeneity. To verify the efficacy of C2A, we perform extensive evaluations on FL scenarios involving heterogeneity in label and language distributions. Comprehensive evaluation results clearly support the superiority of C2A in terms of both efficiency and effectiveness in FL scenarios.||
|**2024-11-01**|[Large Language Models for Patient Comments Multi-Label Classification](http://arxiv.org/abs/2410.23528)|null|Patient experience and care quality are crucial for a hospital's sustainability and reputation. The analysis of patient feedback offers valuable insight into patient satisfaction and outcomes. However, the unstructured nature of these comments poses challenges for traditional machine learning methods following a supervised learning paradigm. This is due to the unavailability of labeled data and the nuances these texts encompass. This research explores leveraging Large Language Models (LLMs) in conducting Multi-label Text Classification (MLTC) of inpatient comments shared after a stay in the hospital. GPT-4 Turbo was leveraged to conduct the classification. However, given the sensitive nature of patients' comments, a security layer is introduced before feeding the data to the LLM through a Protected Health Information (PHI) detection framework, which ensures patients' de-identification. Additionally, using the prompt engineering framework, zero-shot learning, in-context learning, and chain-of-thought prompting were experimented with. Results demonstrate that GPT-4 Turbo, whether following a zero-shot or few-shot setting, outperforms traditional methods and Pre-trained Language Models (PLMs) and achieves the highest overall performance with an F1-score of 76.12% and a weighted F1-score of 73.61% followed closely by the few-shot learning results. Subsequently, the results' association with other patient experience structured variables (e.g., rating) was conducted. The study enhances MLTC through the application of LLMs, offering healthcare practitioners an efficient method to gain deeper insights into patient feedback and deliver prompt, appropriate responses.||
|**2024-10-28**|[Relation-based Counterfactual Data Augmentation and Contrastive Learning for Robustifying Natural Language Inference Models](http://arxiv.org/abs/2410.20710)|null|Although pre-trained language models show good performance on various natural language processing tasks, they often rely on non-causal features and patterns to determine the outcome. For natural language inference tasks, previous results have shown that even a model trained on a large number of data fails to perform well on counterfactually revised data, indicating that the model is not robustly learning the semantics of the classes. In this paper, we propose a method in which we use token-based and sentence-based augmentation methods to generate counterfactual sentence pairs that belong to each class, and apply contrastive learning to help the model learn the difference between sentence pairs of different classes with similar contexts. Evaluation results with counterfactually-revised dataset and general NLI datasets show that the proposed method can improve the performance and robustness of the NLI model.||
|**2024-10-28**|[SubjECTive-QA: Measuring Subjectivity in Earnings Call Transcripts' QA Through Six-Dimensional Feature Analysis](http://arxiv.org/abs/2410.20651)|**[link](https://github.com/gtfintechlab/subjective-qa)**|Fact-checking is extensively studied in the context of misinformation and disinformation, addressing objective inaccuracies. However, a softer form of misinformation involves responses that are factually correct but lack certain features such as clarity and relevance. This challenge is prevalent in formal Question-Answer (QA) settings such as press conferences in finance, politics, sports, and other domains, where subjective answers can obscure transparency. Despite this, there is a lack of manually annotated datasets for subjective features across multiple dimensions. To address this gap, we introduce SubjECTive-QA, a human annotated dataset on Earnings Call Transcripts' (ECTs) QA sessions as the answers given by company representatives are often open to subjective interpretations and scrutiny. The dataset includes 49,446 annotations for long-form QA pairs across six features: Assertive, Cautious, Optimistic, Specific, Clear, and Relevant. These features are carefully selected to encompass the key attributes that reflect the tone of the answers provided during QA sessions across different domain. Our findings are that the best-performing Pre-trained Language Model (PLM), RoBERTa-base, has similar weighted F1 scores to Llama-3-70b-Chat on features with lower subjectivity, such as Relevant and Clear, with a mean difference of 2.17% in their weighted F1 scores. The models perform significantly better on features with higher subjectivity, such as Specific and Assertive, with a mean difference of 10.01% in their weighted F1 scores. Furthermore, testing SubjECTive-QA's generalizability using QAs from White House Press Briefings and Gaggles yields an average weighted F1 score of 65.97% using our best models for each feature, demonstrating broader applicability beyond the financial domain. SubjECTive-QA is publicly available under the CC BY 4.0 license||
|**2024-10-27**|[Effective Instruction Parsing Plugin for Complex Logical Query Answering on Knowledge Graphs](http://arxiv.org/abs/2410.20321)|null|Knowledge Graph Query Embedding (KGQE) aims to embed First-Order Logic (FOL) queries in a low-dimensional KG space for complex reasoning over incomplete KGs. To enhance the generalization of KGQE models, recent studies integrate various external information (such as entity types and relation context) to better capture the logical semantics of FOL queries. The whole process is commonly referred to as Query Pattern Learning (QPL). However, current QPL methods typically suffer from the pattern-entity alignment bias problem, leading to the learned defective query patterns limiting KGQE models' performance. To address this problem, we propose an effective Query Instruction Parsing Plugin (QIPP) that leverages the context awareness of Pre-trained Language Models (PLMs) to capture latent query patterns from code-like query instructions. Unlike the external information introduced by previous QPL methods, we first propose code-like instructions to express FOL queries in an alternative format. This format utilizes textual variables and nested tuples to convey the logical semantics within FOL queries, serving as raw materials for a PLM-based instruction encoder to obtain complete query patterns. Building on this, we design a query-guided instruction decoder to adapt query patterns to KGQE models. To further enhance QIPP's effectiveness across various KGQE models, we propose a query pattern injection mechanism based on compressed optimization boundaries and an adaptive normalization component, allowing KGQE models to utilize query patterns more efficiently. Extensive experiments demonstrate that our plug-and-play method improves the performance of eight basic KGQE models and outperforms two state-of-the-art QPL methods.||
|**2024-10-25**|[A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection](http://arxiv.org/abs/2410.19898)|null|This review paper explores recent advances in deep learning approaches for non-invasive cognitive impairment detection. We examine various non-invasive indicators of cognitive decline, including speech and language, facial, and motoric mobility. The paper provides an overview of relevant datasets, feature-extracting techniques, and deep-learning architectures applied to this domain. We have analyzed the performance of different methods across modalities and observed that speech and language-based methods generally achieved the highest detection performance. Studies combining acoustic and linguistic features tended to outperform those using a single modality. Facial analysis methods showed promise for visual modalities but were less extensively studied. Most papers focused on binary classification (impaired vs. non-impaired), with fewer addressing multi-class or regression tasks. Transfer learning and pre-trained language models emerged as popular and effective techniques, especially for linguistic analysis. Despite significant progress, several challenges remain, including data standardization and accessibility, model explainability, longitudinal analysis limitations, and clinical adaptation. Lastly, we propose future research directions, such as investigating language-agnostic speech analysis methods, developing multi-modal diagnostic systems, and addressing ethical considerations in AI-assisted healthcare. By synthesizing current trends and identifying key obstacles, this review aims to guide further development of deep learning-based cognitive impairment detection systems to improve early diagnosis and ultimately patient outcomes.||
|**2024-10-25**|[Intelligent Understanding of Large Language Models in Traditional Chinese Medicine Based on Prompt Engineering Framework](http://arxiv.org/abs/2410.19451)|null|This paper explores the application of prompt engineering to enhance the performance of large language models (LLMs) in the domain of Traditional Chinese Medicine (TCM). We propose TCM-Prompt, a framework that integrates various pre-trained language models (PLMs), templates, tokenization, and verbalization methods, allowing researchers to easily construct and fine-tune models for specific TCM-related tasks. We conducted experiments on disease classification, syndrome identification, herbal medicine recommendation, and general NLP tasks, demonstrating the effectiveness and superiority of our approach compared to baseline methods. Our findings suggest that prompt engineering is a promising technique for improving the performance of LLMs in specialized domains like TCM, with potential applications in digitalization, modernization, and personalized medicine.||
|**2024-10-22**|[All Entities are Not Created Equal: Examining the Long Tail for Fine-Grained Entity Typing](http://arxiv.org/abs/2410.17355)|null|Pre-trained language models (PLMs) are trained on large amounts of data, which helps capture world knowledge alongside linguistic competence. Due to this, they are extensively used for ultra-fine entity typing tasks, where they provide the entity knowledge held in its parameter space. Given that PLMs learn from co-occurrence patterns, they likely contain more knowledge or less knowledge about entities depending on their how frequent they are in the pre-training data. In this work, we probe PLMs to elicit encoded entity probabilities and demonstrate that they highly correlate with their frequency in large-scale internet data. Then, we demonstrate that entity-typing approaches that rely on PLMs struggle with entities at the long tail on the distribution. Our findings suggests that we need to go beyond PLMs to produce solutions that perform well for rare, new or infrequent entities.||
|**2024-10-21**|[ComPO: Community Preferences for Language Model Personalization](http://arxiv.org/abs/2410.16027)|null|Conventional algorithms for training language models (LMs) with human feedback rely on preferences that are assumed to account for an "average" user, disregarding subjectivity and finer-grained variations. Recent studies have raised concerns that aggregating such diverse and often contradictory human feedback to finetune models results in generic models that generate outputs not preferred by many user groups, as they tend to average out styles and norms. To address this issue, we draw inspiration from recommendation systems and propose ComPO, a method to personalize preference optimization in LMs by contextualizing the probability distribution of model outputs with the preference provider. Focusing on group-level preferences rather than individuals, we collect and release ComPRed, a question answering dataset with community-level preferences from Reddit. This dataset facilitates studying diversity in preferences without incurring privacy concerns associated with individual feedback. Our experiments reveal that conditioning language models on a community identifier (i.e., subreddit name) during preference tuning substantially enhances model performance. Conversely, replacing this context with random subreddit identifiers significantly diminishes performance, highlighting the effectiveness of our approach in tailoring responses to communities' preferences.||
|**2024-10-21**|[Learning-to-Defer for Extractive Question Answering](http://arxiv.org/abs/2410.15761)|null|预训练语言模型已对抽取式问答领域产生了深远的影响，利用大规模文本语料库增强了上下文语言理解能力。尽管取得了成功，但这些模型在需要细致解读或推理超出直接文本线索的复杂场景中仍存在困难。此外，它们的规模也给资源受限设备上的部署带来了挑战。为了解决这些限制，我们引入了一种改进的两阶段“学会延迟”机制，通过选择性地将问题交给人类专家或更大模型来增强决策能力，而无需在问答环境下重新训练语言模型。这种方法不仅保持了计算效率，还在模糊的上下文中显著提高了模型的可靠性和准确性。我们通过证明代理损失函数的贝叶斯和 $(\mathcal{H}, \mathcal{R})$ 一致性，确立了我们方法的理论可靠性，保证了最终解决方案的最优性。在SQuADv2数据集上的实证评估表明，整合人类专业知识和利用更大模型可以提高性能。我们的结果进一步表明，只需延迟少量查询，较小的模型就能达到与其较大模型相当的性能，同时保持计算效率，从而拓宽了预训练语言模型在各种操作环境中的适用性。||
|**2024-10-21**|[Who's Who: Large Language Models Meet Knowledge Conflicts in Practice](http://arxiv.org/abs/2410.15737)|**[link](https://github.com/vinairesearch/whoqa)**|Retrieval-augmented generation (RAG) methods are viable solutions for addressing the static memory limits of pre-trained language models. Nevertheless, encountering conflicting sources of information within the retrieval context is an inevitable practical challenge. In such situations, the language models are recommended to transparently inform users about the conflicts rather than autonomously deciding what to present based on their inherent biases. To analyze how current large language models (LLMs) align with our recommendation, we introduce WhoQA, a public benchmark dataset to examine model's behavior in knowledge conflict situations. We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers. WhoQA evaluation set includes 5K questions across 13 Wikidata property types and 150K Wikipedia entities. Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs' performance in RAG settings.||
|**2024-10-21**|[DomainSum: A Hierarchical Benchmark for Fine-Grained Domain Shift in Abstractive Text Summarization](http://arxiv.org/abs/2410.15687)|**[link](https://github.com/hpzhang94/DomainSum)**|Most research on abstractive summarization focuses on single-domain applications, often neglecting how domain shifts between documents affect performance and the generalization ability of summarization models. To address this issue, we introduce DomainSum, a hierarchical benchmark designed to capture fine-grained domain shifts in abstractive summarization. We categorize these shifts into three levels: genre, style, and topic, and demonstrate through comprehensive benchmark analysis that they follow a hierarchical structure. Furthermore, we evaluate the domain generalization capabilities of commonly used pre-trained language models (PLMs) and large language models (LLMs) in in-domain and cross-domain settings.||
|**2024-10-21**|[Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding](http://arxiv.org/abs/2410.15609)|null|Recently, pre-trained language models (PLMs) have been increasingly adopted in spoken language understanding (SLU). However, automatic speech recognition (ASR) systems frequently produce inaccurate transcriptions, leading to noisy inputs for SLU models, which can significantly degrade their performance. To address this, our objective is to train SLU models to withstand ASR errors by exposing them to noises commonly observed in ASR systems, referred to as ASR-plausible noises. Speech noise injection (SNI) methods have pursued this objective by introducing ASR-plausible noises, but we argue that these methods are inherently biased towards specific ASR systems, or ASR-specific noises. In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. Experimental results and analyses demonstrate the effectiveness of our proposed methods in enhancing the robustness and generalizability of SLU models against unseen ASR systems by introducing more diverse and plausible ASR noises in advance.||
|**2024-10-19**|[MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science](http://arxiv.org/abs/2410.15126)|null|We introduce a novel continued pre-training method, MELT (MatEriaLs-aware continued pre-Training), specifically designed to efficiently adapt the pre-trained language models (PLMs) for materials science. Unlike previous adaptation strategies that solely focus on constructing domain-specific corpus, MELT comprehensively considers both the corpus and the training strategy, given that materials science corpus has distinct characteristics from other domains. To this end, we first construct a comprehensive materials knowledge base from the scientific corpus by building semantic graphs. Leveraging this extracted knowledge, we integrate a curriculum into the adaptation process that begins with familiar and generalized concepts and progressively moves toward more specialized terms. We conduct extensive experiments across diverse benchmarks to verify the effectiveness and generality of MELT. A comprehensive evaluation convincingly supports the strength of MELT, demonstrating superior performance compared to existing continued pre-training methods. The in-depth analysis also shows that MELT enables PLMs to effectively represent materials entities compared to the existing adaptation methods, thereby highlighting its broad applicability across a wide spectrum of materials science.||
|**2024-10-19**|[BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation](http://arxiv.org/abs/2410.14971)|null|Recent advances in decoding language from brain signals (EEG and MEG) have been significantly driven by pre-trained language models, leading to remarkable progress on publicly available non-invasive EEG/MEG datasets. However, previous works predominantly utilize teacher forcing during text generation, leading to significant performance drops without its use. A fundamental issue is the inability to establish a unified feature space correlating textual data with the corresponding evoked brain signals. Although some recent studies attempt to mitigate this gap using an audio-text pre-trained model, Whisper, which is favored for its signal input modality, they still largely overlook the inherent differences between audio signals and brain signals in directly applying Whisper to decode brain signals. To address these limitations, we propose a new multi-stage strategy for semantic brain signal decoding via vEctor-quantized speCtrogram reconstruction for WHisper-enhanced text generatiOn, termed BrainECHO. Specifically, BrainECHO successively conducts: 1) Discrete autoencoding of the audio spectrogram; 2) Brain-audio latent space alignment; and 3) Semantic text generation via Whisper finetuning. Through this autoencoding--alignment--finetuning process, BrainECHO outperforms state-of-the-art methods under the same data split settings on two widely accepted resources: the EEG dataset (Brennan) and the MEG dataset (GWilliams). The innovation of BrainECHO, coupled with its robustness and superiority at the sentence, session, and subject-independent levels across public datasets, underscores its significance for language-based brain-computer interfaces.||
|**2024-10-18**|[Reasoning, Memorization, and Fine-Tuning Language Models for Non-Cooperative Games](http://arxiv.org/abs/2410.14890)|null|We develop a method that integrates the tree of thoughts and multi-agent framework to enhance the capability of pre-trained language models in solving complex, unfamiliar games. The method decomposes game-solving into four incremental tasks -- game summarization, area selection, action extraction, and action validation -- each assigned to a specific language-model agent. By constructing a tree of thoughts, the method simulates reasoning paths and allows agents to collaboratively distill game representations and tactics, mitigating the limitations of language models in reasoning and long-term memorization. Additionally, an automated fine-tuning process further optimizes the agents' performance by ranking query-response pairs based on game outcomes, e.g., winning or losing. We apply the method to a non-cooperative game and demonstrate a 65 percent winning rate against benchmark algorithms, with an additional 10 percent improvement after fine-tuning. In contrast to existing deep learning algorithms for game solving that require millions of training samples, the proposed method consumes approximately 1000 training samples, highlighting its efficiency and scalability.||
|**2024-10-18**|[PTR: A Pre-trained Language Model for Trajectory Recovery](http://arxiv.org/abs/2410.14281)|null|Spatiotemporal trajectory data is vital for web-of-things services and is extensively collected and analyzed by web-based hardware and platforms. However, issues such as service interruptions and network instability often lead to sparsely recorded trajectories, resulting in a loss of detailed movement data. As a result, recovering these trajectories to restore missing information becomes essential. Despite progress, several challenges remain unresolved. First, the lack of large-scale dense trajectory data hampers the performance of existing deep learning methods, which rely heavily on abundant data for supervised training. Second, current methods struggle to generalize across sparse trajectories with varying sampling intervals, necessitating separate re-training for each interval and increasing computational costs. Third, external factors crucial for the recovery of missing points are not fully incorporated. To address these challenges, we propose a framework called PTR. This framework mitigates the issue of limited dense trajectory data by leveraging the capabilities of pre-trained language models (PLMs). PTR incorporates an explicit trajectory prompt and is trained on datasets with multiple sampling intervals, enabling it to generalize effectively across different intervals in sparse trajectories. To capture external factors, we introduce an implicit trajectory prompt that models road conditions, providing richer information for recovering missing points. Additionally, we present a trajectory embedder that encodes trajectory points and transforms the embeddings of both observed and missing points into a format comprehensible to PLMs. Experimental results on two public trajectory datasets with three sampling intervals demonstrate the efficacy and scalability of PTR.||
|**2024-10-16**|[NSmark: Null Space Based Black-box Watermarking Defense Framework for Pre-trained Language Models](http://arxiv.org/abs/2410.13907)|**[link](https://github.com/dongdongzhaoup/nsmark)**|Pre-trained language models (PLMs) have emerged as critical intellectual property (IP) assets that necessitate protection. Although various watermarking strategies have been proposed, they remain vulnerable to Linear Functionality Equivalence Attacks (LFEA), which can invalidate most existing white-box watermarks without prior knowledge of the watermarking scheme or training data. This paper further analyzes and extends the attack scenarios of LFEA to the commonly employed black-box settings for PLMs by considering Last-Layer outputs (dubbed LL-LFEA). We discover that the null space of the output matrix remains invariant against LL-LFEA attacks. Based on this finding, we propose NSmark, a task-agnostic, black-box watermarking scheme capable of resisting LL-LFEA attacks. NSmark consists of three phases: (i) watermark generation using the digital signature of the owner, enhanced by spread spectrum modulation for increased robustness; (ii) watermark embedding through an output mapping extractor that preserves PLM performance while maximizing watermark capacity; (iii) watermark verification, assessed by extraction rate and null space conformity. Extensive experiments on both pre-training and downstream tasks confirm the effectiveness, reliability, fidelity, and robustness of our approach. Code is available at https://github.com/dongdongzhaoUP/NSmark.||
|**2024-10-17**|[Meta-DiffuB: A Contextualized Sequence-to-Sequence Text Diffusion Model with Meta-Exploration](http://arxiv.org/abs/2410.13201)|**[link](https://github.com/meta-diffub/meta-diffub)**|The diffusion model, a new generative modeling paradigm, has achieved significant success in generating images, audio, video, and text. It has been adapted for sequence-to-sequence text generation (Seq2Seq) through DiffuSeq, termed S2S Diffusion. Existing S2S-Diffusion models predominantly rely on fixed or hand-crafted rules to schedule noise during the diffusion and denoising processes. However, these models are limited by non-contextualized noise, which fails to fully consider the characteristics of Seq2Seq tasks. In this paper, we propose the Meta-DiffuB framework - a novel scheduler-exploiter S2S-Diffusion paradigm designed to overcome the limitations of existing S2S-Diffusion models. We employ Meta-Exploration to train an additional scheduler model dedicated to scheduling contextualized noise for each sentence. Our exploiter model, an S2S-Diffusion model, leverages the noise scheduled by our scheduler model for updating and generation. Meta-DiffuB achieves state-of-the-art performance compared to previous S2S-Diffusion models and fine-tuned pre-trained language models (PLMs) across four Seq2Seq benchmark datasets. We further investigate and visualize the impact of Meta-DiffuB's noise scheduling on the generation of sentences with varying difficulties. Additionally, our scheduler model can function as a "plug-and-play" model to enhance DiffuSeq without the need for fine-tuning during the inference stage.||
|**2024-10-16**|[Negative-Prompt-driven Alignment for Generative Language Model](http://arxiv.org/abs/2410.12194)|null|Large language models have achieved remarkable capabilities, but aligning their outputs with human values and preferences remains a significant challenge. Existing alignment methods primarily focus on positive examples while overlooking the importance of negative responses in guiding models away from undesirable behaviors. For instance, the widely-used alignment datasets reveals a scarcity of explicit negative examples that contradict human values, hindering its ability to discourage harmful or biased outputs during training. To address this limitation, we propose NEAT, i.e., NEgative-prompt-driven AlignmenT, to introduce negative prompts to generate undesirable responses alongside positive examples during the optimization process. NEAT explicitly penalizes the model for producing harmful outputs, guiding it not only toward desirable behaviors but also steering it away from generating undesirable, biased responses. This dual feedback mechanism enables better alignment with human preferences, crucial in contexts where avoiding harm is paramount. Starting from a pre-trained language model, NEAT performs online alignment by incorporating a ranking loss derived from an expanded preference dataset containing both positive and negative examples. Extensive experiments validate NEAT's effectiveness in significantly enhancing language models' alignment with human values and preferences.||
|**2024-10-15**|[Bridging Large Language Models and Graph Structure Learning Models for Robust Representation Learning](http://arxiv.org/abs/2410.12096)|null|Graph representation learning, involving both node features and graph structures, is crucial for real-world applications but often encounters pervasive noise. State-of-the-art methods typically address noise by focusing separately on node features with large language models (LLMs) and on graph structures with graph structure learning models (GSLMs). In this paper, we introduce LangGSL, a robust framework that integrates the complementary strengths of pre-trained language models and GSLMs to jointly enhance both node feature and graph structure learning. In LangGSL, we first leverage LLMs to filter noise in the raw data and extract valuable cleaned information as features, enhancing the synergy of downstream models. During the mutual learning phase in LangGSL, the core idea is to leverage the relatively small language model (LM) to process local attributes and generate reliable pseudo-labels and informative node embeddings, which are then integrated into the GSLM's prediction phase. This approach enriches the global context and enhances overall performance. Meanwhile, GSLM refines the evolving graph structure constructed from the LM's output, offering updated labels back to the LM as additional guidance, thus facilitating a more effective mutual learning process. The LM and GSLM work synergistically, complementing each other's strengths and offsetting weaknesses within a variational information-maximizing framework, resulting in enhanced node features and a more robust graph structure. Extensive experiments on diverse graph datasets of varying scales and across different task scenarios demonstrate the scalability and effectiveness of the proposed approach.||
|**2024-10-15**|[LegalLens Shared Task 2024: Legal Violation Identification in Unstructured Text](http://arxiv.org/abs/2410.12064)|null|This paper presents the results of the LegalLens Shared Task, focusing on detecting legal violations within text in the wild across two sub-tasks: LegalLens-NER for identifying legal violation entities and LegalLens-NLI for associating these violations with relevant legal contexts and affected individuals. Using an enhanced LegalLens dataset covering labor, privacy, and consumer protection domains, 38 teams participated in the task. Our analysis reveals that while a mix of approaches was used, the top-performing teams in both tasks consistently relied on fine-tuning pre-trained language models, outperforming legal-specific models and few-shot methods. The top-performing team achieved a 7.11% improvement in NER over the baseline, while NLI saw a more marginal improvement of 5.7%. Despite these gains, the complexity of legal texts leaves room for further advancements.||
|**2024-10-15**|[A Survey on Deep Tabular Learning](http://arxiv.org/abs/2410.12034)|null|Tabular data, widely used in industries like healthcare, finance, and transportation, presents unique challenges for deep learning due to its heterogeneous nature and lack of spatial structure. This survey reviews the evolution of deep learning models for tabular data, from early fully connected networks (FCNs) to advanced architectures like TabNet, SAINT, TabTranSELU, and MambaNet. These models incorporate attention mechanisms, feature embeddings, and hybrid architectures to address tabular data complexities. TabNet uses sequential attention for instance-wise feature selection, improving interpretability, while SAINT combines self-attention and intersample attention to capture complex interactions across features and data points, both advancing scalability and reducing computational overhead. Hybrid architectures such as TabTransformer and FT-Transformer integrate attention mechanisms with multi-layer perceptrons (MLPs) to handle categorical and numerical data, with FT-Transformer adapting transformers for tabular datasets. Research continues to balance performance and efficiency for large datasets. Graph-based models like GNN4TDL and GANDALF combine neural networks with decision trees or graph structures, enhancing feature representation and mitigating overfitting in small datasets through advanced regularization techniques. Diffusion-based models like the Tabular Denoising Diffusion Probabilistic Model (TabDDPM) generate synthetic data to address data scarcity, improving model robustness. Similarly, models like TabPFN and Ptab leverage pre-trained language models, incorporating transfer learning and self-supervised techniques into tabular tasks. This survey highlights key advancements and outlines future research directions on scalability, generalization, and interpretability in diverse tabular data applications.||
|**2024-10-14**|[Improve Meta-learning for Few-Shot Text Classification with All You Can Acquire from the Tasks](http://arxiv.org/abs/2410.10454)|**[link](https://github.com/yvogao/laqda)**|Meta-learning has emerged as a prominent technology for few-shot text classification and has achieved promising performance. However, existing methods often encounter difficulties in drawing accurate class prototypes from support set samples, primarily due to probable large intra-class differences and small inter-class differences within the task. Recent approaches attempt to incorporate external knowledge or pre-trained language models to augment data, but this requires additional resources and thus does not suit many few-shot scenarios. In this paper, we propose a novel solution to address this issue by adequately leveraging the information within the task itself. Specifically, we utilize label information to construct a task-adaptive metric space, thereby adaptively reducing the intra-class differences and magnifying the inter-class differences. We further employ the optimal transport technique to estimate class prototypes with query set samples together, mitigating the problem of inaccurate and ambiguous support set samples caused by large intra-class differences. We conduct extensive experiments on eight benchmark datasets, and our approach shows obvious advantages over state-of-the-art models across all the tasks on all the datasets. For reproducibility, all the datasets and codes are available at https://github.com/YvoGao/LAQDA.||
|**2024-10-14**|[Scalable Multi-Domain Adaptation of Language Models using Modular Experts](http://arxiv.org/abs/2410.10181)|null|Domain-specific adaptation is critical to maximizing the performance of pre-trained language models (PLMs) on one or multiple targeted tasks, especially under resource-constrained use cases, such as edge devices. However, existing methods often struggle to balance domain-specific performance, retention of general knowledge, and efficiency for training and inference. To address these challenges, we propose Modular Domain Experts (MoDE). MoDE is a mixture-of-experts architecture that augments a general PLMs with modular, domain-specialized experts. These experts are trained independently and composed together via a lightweight training process. In contrast to standard low-rank adaptation methods, each MoDE expert consists of several transformer layers which scale better with more training examples and larger parameter counts. Our evaluation demonstrates that MoDE achieves comparable target performances to full parameter fine-tuning while achieving 1.65% better retention performance. Moreover, MoDE's architecture enables flexible sharding configurations and improves training speeds by up to 38% over state-of-the-art distributed training configurations.||
|**2024-10-11**|[Lifelong Event Detection via Optimal Transport](http://arxiv.org/abs/2410.08905)|null|Continual Event Detection (CED) poses a formidable challenge due to the catastrophic forgetting phenomenon, where learning new tasks (with new coming event types) hampers performance on previous ones. In this paper, we introduce a novel approach, Lifelong Event Detection via Optimal Transport (LEDOT), that leverages optimal transport principles to align the optimization of our classification module with the intrinsic nature of each class, as defined by their pre-trained language modeling. Our method integrates replay sets, prototype latent representations, and an innovative Optimal Transport component. Extensive experiments on MAVEN and ACE datasets demonstrate LEDOT's superior performance, consistently outperforming state-of-the-art baselines. The results underscore LEDOT as a pioneering solution in continual event detection, offering a more effective and nuanced approach to addressing catastrophic forgetting in evolving environments.||
|**2024-10-10**|[Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity](http://arxiv.org/abs/2410.08198)|**[link](https://github.com/mohamad-amin/adam-coordinate-adaptivity)**|Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $\widetilde{O}(T^{-1/4})$. In this work, we argue that the exploitation of nice $\ell_\infty$-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$-geometry rather than the more common $\ell_2$-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $\ell_\infty$ -geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.||
|**2024-10-10**|[Do Current Language Models Support Code Intelligence for R Programming Language?](http://arxiv.org/abs/2410.07793)|null|Recent advancements in developing Pre-trained Language Models for Code (Code-PLMs) have urged many areas of Software Engineering (SE) and brought breakthrough results for many SE tasks. Though these models have achieved the state-of-the-art performance for SE tasks for many popular programming languages, such as Java and Python, the Scientific Software and its related languages like R programming language have rarely benefited or even been evaluated with the Code-PLMs. Research has shown that R has many differences with other programming languages and requires specific techniques. In this study, we provide the first insights for code intelligence for R. For this purpose, we collect and open source an R dataset, and evaluate Code-PLMs for the two tasks of code summarization and method name prediction using several settings and strategies, including the differences in two R styles, Tidy-verse and Base R. Our results demonstrate that the studied models have experienced varying degrees of performance degradation when processing R programming language code, which is supported by human evaluation. Additionally, not all models show performance improvement in R-specific tasks even after multi-language fine-tuning. The dual syntax paradigms in R significantly impact the models' performance, particularly in code summarization tasks. Furthermore, the project-specific context inherent in R codebases significantly impacts the performance when attempting cross-project training.||
|**2024-10-09**|[Multi-Task Program Error Repair and Explanatory Diagnosis](http://arxiv.org/abs/2410.07271)|null|Program errors can occur in any type of programming, and can manifest in a variety of ways, such as unexpected output, crashes, or performance issues. And program error diagnosis can often be too abstract or technical for developers to understand, especially for beginners. The goal of this paper is to present a novel machine-learning approach for Multi-task Program Error Repair and Explanatory Diagnosis (mPRED). A pre-trained language model is used to encode the source code, and a downstream model is specifically designed to identify and repair errors. Programs and test cases will be augmented and optimized from several perspectives. Additionally, our approach incorporates a "chain of thoughts" method, which enables the models to produce intermediate reasoning explanations before providing the final correction. To aid in visualizing and analyzing the program structure, we use a graph neural network for program structure visualization. Overall, our approach offers a promising approach for repairing program errors across different programming languages and providing helpful explanations to programmers.||
|**2024-10-08**|[Manual Verbalizer Enrichment for Few-Shot Text Classification](http://arxiv.org/abs/2410.06173)|null|With the continuous development of pre-trained language models, prompt-based training becomes a well-adopted paradigm that drastically improves the exploitation of models for many natural language processing tasks. Prompting also shows great performance compared to traditional fine-tuning when adapted to zero-shot or few-shot scenarios where the number of annotated data is limited. In this framework, the role of verbalizers is essential, as an interpretation from masked word distributions into output predictions. In this work, we propose \acrshort{mave}, an approach for verbalizer construction by enrichment of class labels using neighborhood relation in the embedding space of words for the text classification task. In addition, we elaborate a benchmarking procedure to evaluate typical baselines of verbalizers for document classification in few-shot learning contexts. Our model achieves state-of-the-art results while using significantly fewer resources. We show that our approach is particularly effective in cases with extremely limited supervision data.||
|**2024-10-08**|[Enhancing SPARQL Generation by Triplet-order-sensitive Pre-training](http://arxiv.org/abs/2410.05731)|**[link](https://github.com/LUMIA-Group/TosT5)**|Semantic parsing that translates natural language queries to SPARQL is of great importance for Knowledge Graph Question Answering (KGQA) systems. Although pre-trained language models like T5 have achieved significant success in the Text-to-SPARQL task, their generated outputs still exhibit notable errors specific to the SPARQL language, such as triplet flips. To address this challenge and further improve the performance, we propose an additional pre-training stage with a new objective, Triplet Order Correction (TOC), along with the commonly used Masked Language Modeling (MLM), to collectively enhance the model's sensitivity to triplet order and SPARQL syntax. Our method achieves state-of-the-art performances on three widely-used benchmarks.||
|**2024-10-05**|[Persona Knowledge-Aligned Prompt Tuning Method for Online Debate](http://arxiv.org/abs/2410.04239)|**[link](https://github.com/HKUST-KnowComp/PersonaPrompt)**|Debate is the process of exchanging viewpoints or convincing others on a particular issue. Recent research has provided empirical evidence that the persuasiveness of an argument is determined not only by language usage but also by communicator characteristics. Researchers have paid much attention to aspects of languages, such as linguistic features and discourse structures, but combining argument persuasiveness and impact with the social personae of the audience has not been explored due to the difficulty and complexity. We have observed the impressive simulation and personification capability of ChatGPT, indicating a giant pre-trained language model may function as an individual to provide personae and exert unique influences based on diverse background knowledge. Therefore, we propose a persona knowledge-aligned framework for argument quality assessment tasks from the audience side. This is the first work that leverages the emergence of ChatGPT and injects such audience personae knowledge into smaller language models via prompt tuning. The performance of our pipeline demonstrates significant and consistent improvement compared to competitive architectures.||
|**2024-10-05**|[Overview of Factify5WQA: Fact Verification through 5W Question-Answering](http://arxiv.org/abs/2410.04236)|null|Researchers have found that fake news spreads much times faster than real news. This is a major problem, especially in today's world where social media is the key source of news for many among the younger population. Fact verification, thus, becomes an important task and many media sites contribute to the cause. Manual fact verification is a tedious task, given the volume of fake news online. The Factify5WQA shared task aims to increase research towards automated fake news detection by providing a dataset with an aspect-based question answering based fact verification method. Each claim and its supporting document is associated with 5W questions that help compare the two information sources. The objective performance measure in the task is done by comparing answers using BLEU score to measure the accuracy of the answers, followed by an accuracy measure of the classification. The task had submissions using custom training setup and pre-trained language-models among others. The best performing team posted an accuracy of 69.56%, which is a near 35% improvement over the baseline.||
|**2024-10-05**|[On Eliciting Syntax from Language Models via Hashing](http://arxiv.org/abs/2410.04074)|**[link](https://github.com/speedcell4/parserker)**|Unsupervised parsing, also known as grammar induction, aims to infer syntactic structure from raw text. Recently, binary representation has exhibited remarkable information-preserving capabilities at both lexicon and syntax levels. In this paper, we explore the possibility of leveraging this capability to deduce parsing trees from raw text, relying solely on the implicitly induced grammars within models. To achieve this, we upgrade the bit-level CKY from zero-order to first-order to encode the lexicon and syntax in a unified binary representation space, switch training from supervised to unsupervised under the contrastive hashing framework, and introduce a novel loss function to impose stronger yet balanced alignment signals. Our model shows competitive performance on various datasets, therefore, we claim that our method is effective and efficient enough to acquire high-quality parsing trees from pre-trained language models at a low cost.||
|**2024-10-03**|[Reward-RAG: Enhancing RAG with Reward Driven Supervision](http://arxiv.org/abs/2410.03780)|null|In this paper, we introduce Reward-RAG, a novel approach designed to enhance the Retrieval-Augmented Generation (RAG) model through Reward-Driven Supervision. Unlike previous RAG methodologies, which focus on training language models (LMs) to utilize external knowledge retrieved from external sources, our method adapts retrieval information to specific domains by employing CriticGPT to train a dedicated reward model. This reward model generates synthesized datasets for fine-tuning the RAG encoder, aligning its outputs more closely with human preferences. The versatility of our approach allows it to be effectively applied across various domains through domain-specific fine-tuning. We evaluate Reward-RAG on publicly available benchmarks from multiple domains, comparing it to state-of-the-art methods. Our experimental results demonstrate significant improvements in performance, highlighting the effectiveness of Reward-RAG in improving the relevance and quality of generated responses. These findings underscore the potential of integrating reward models with RAG to achieve superior outcomes in natural language generation tasks.||
|**2024-10-04**|[Vulnerability Detection via Topological Analysis of Attention Maps](http://arxiv.org/abs/2410.03470)|**[link](https://github.com/Snopoff/Vulnerability-Detection-via-Topological-Analysis-of-Attention-Maps)**|Recently, deep learning (DL) approaches to vulnerability detection have gained significant traction. These methods demonstrate promising results, often surpassing traditional static code analysis tools in effectiveness. In this study, we explore a novel approach to vulnerability detection utilizing the tools from topological data analysis (TDA) on the attention matrices of the BERT model. Our findings reveal that traditional machine learning (ML) techniques, when trained on the topological features extracted from these attention matrices, can perform competitively with pre-trained language models (LLMs) such as CodeBERTa. This suggests that TDA tools, including persistent homology, are capable of effectively capturing semantic information critical for identifying vulnerabilities.||
|**2024-10-09**|[What do Large Language Models Need for Machine Translation Evaluation?](http://arxiv.org/abs/2410.03278)|**[link](https://github.com/surrey-nlp/LLM4MT_eval)**|Leveraging large language models (LLMs) for various natural language processing tasks has led to superlative claims about their performance. For the evaluation of machine translation (MT), existing research shows that LLMs are able to achieve results comparable to fine-tuned multilingual pre-trained language models. In this paper, we explore what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate MT quality. In addition, we investigate prompting techniques such as zero-shot, Chain of Thought (CoT) and few-shot prompting for eight language pairs covering high-, medium- and low-resource languages, leveraging varying LLM variants. Our findings indicate the importance of reference translations for an LLM-based evaluation. While larger models do not necessarily fare better, they tend to benefit more from CoT prompting, than smaller models. We also observe that LLMs do not always provide a numerical score when generating evaluations, which poses a question on their reliability for the task. Our work presents a comprehensive analysis for resource-constrained and training-less LLM-based evaluation of machine translation. We release the accrued prompt templates, code and data publicly for reproducibility.||
|**2024-10-04**|[Generating bilingual example sentences with large language models as lexicography assistants](http://arxiv.org/abs/2410.03182)|**[link](https://github.com/raphaelmerx/llm-bilingual-examples)**|We present a study of LLMs' performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels: French (high-resource), Indonesian (mid-resource), and Tetun (low-resource), with English as the target language. We evaluate the quality of LLM-generated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility. Our findings reveal that while LLMs can generate reasonably good dictionary examples, their performance degrades significantly for lower-resourced languages. We also observe high variability in human preferences for example quality, reflected in low inter-annotator agreement rates. To address this, we demonstrate that in-context learning can successfully align LLMs with individual annotator preferences. Additionally, we explore the use of pre-trained language models for automated rating of examples, finding that sentence perplexity serves as a good proxy for typicality and intelligibility in higher-resourced languages. Our study also contributes a novel dataset of 600 ratings for LLM-generated sentence pairs, and provides insights into the potential of LLMs in reducing the cost of lexicographic work, particularly for low-resource languages.||
|**2024-10-03**|[Guided Stream of Search: Learning to Better Search with Language Models via Optimal Path Guidance](http://arxiv.org/abs/2410.02992)|**[link](https://github.com/symoon11/guided-stream-of-search)**|While language models have demonstrated impressive capabilities across a range of tasks, they still struggle with tasks that require complex planning and reasoning. Recent studies have proposed training language models on search processes rather than optimal solutions, resulting in better generalization performance even though search processes are noisy and even suboptimal. However, these studies overlook the value of optimal solutions, which can serve as step-by-step landmarks to guide more effective search. In this work, we explore how to leverage optimal solutions to enhance the search and planning abilities of language models. To this end, we propose guided stream of search (GSoS), which seamlessly incorporates optimal solutions into the self-generation process in a progressive manner, producing high-quality search trajectories. These trajectories are then distilled into the pre-trained model via supervised fine-tuning. Our approach significantly enhances the search and planning abilities of language models on Countdown, a simple yet challenging mathematical reasoning task. Notably, combining our method with RL fine-tuning yields further improvements, whereas previous supervised fine-tuning methods do not benefit from RL. Furthermore, our approach exhibits greater effectiveness than leveraging optimal solutions in the form of subgoal rewards.||
|**2024-10-03**|[Does the Order of Fine-tuning Matter and Why?](http://arxiv.org/abs/2410.02915)|null|To improve the performance on a target task, researchers have fine-tuned language models with an intermediate task before the target task of interest. However, previous works have focused on the pre-trained language models and downstream tasks in Natural Language Processing (NLP) and considered only one intermediate task. The effect of fine-tuning multiple intermediate tasks and their ordering on target task performance has not been fully explored in Software Engineering. In this study, we perform the first empirical study on analyzing the impact of task ordering on target task performance. Experimental results show that there is an impact of task ordering on target task performance by up to 6% of performance gain and up to 4% of performance loss. To explain such an impact, we consider a variety of potential factors, including the characteristics of dataset (syntactic similarity and semantic similarity analysis, dataset size), model (probing task and attention analysis), and task (task affinity analysis). Our study provides Software Engineering researchers and practitioners with insights into the effect of task orderings and how to select the one that is cost-effective while achieving the best performance gain.||
|**2024-10-02**|[SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics](http://arxiv.org/abs/2410.01946)|**[link](https://github.com/zhiwenyou103/SciPrompt)**|基于提示的微调已成为从预训练语言模型中提取编码信息的重要方法，用于各种任务，包括文本分类。对于多类别分类任务，在低资源场景下，基于提示的微调已经实现了与完全微调方法相当的性能水平。先前的研究使用精心设计的提示模板和词语转换器，将标签词空间映射到类别空间，从而将分类问题解决为掩码语言建模任务。然而，具有自动丰富词语转换器的跨领域和细粒度提示微调仍然 unexplored，这主要是由于手动选择领域标签词用于词语转换器存在困难且成本高昂，这需要具备领域专业知识的人员。为了应对这一挑战，我们引入了 SciPrompt，这是一个旨在自动检索与科学主题相关的术语的框架，用于低资源文本分类任务。为此，我们在科学文献的背景下选择语义相关且特定于领域的标签词进行词语转换器增强。此外，我们提出了一种新的词语转换策略，使用相关性得分作为额外的权重，以提高语言模型在模型微调期间的预测性能。我们的方法在少样本和零样本设置下的科学文本分类任务中优于最先进的基于提示的微调方法，特别是在对细粒度和新兴科学主题进行分类方面。||
|**2024-10-01**|[PclGPT: A Large Language Model for Patronizing and Condescending Language Detection](http://arxiv.org/abs/2410.00361)|**[link](https://github.com/dut-laowang/emnlp24-PclGPT)**|Disclaimer: Samples in this paper may be harmful and cause discomfort! Patronizing and condescending language (PCL) is a form of speech directed at vulnerable groups. As an essential branch of toxic language, this type of language exacerbates conflicts and confrontations among Internet communities and detrimentally impacts disadvantaged groups. Traditional pre-trained language models (PLMs) perform poorly in detecting PCL due to its implicit toxicity traits like hypocrisy and false sympathy. With the rise of large language models (LLMs), we can harness their rich emotional semantics to establish a paradigm for exploring implicit toxicity. In this paper, we introduce PclGPT, a comprehensive LLM benchmark designed specifically for PCL. We collect, annotate, and integrate the Pcl-PT/SFT dataset, and then develop a bilingual PclGPT-EN/CN model group through a comprehensive pre-training and supervised fine-tuning staircase process to facilitate implicit toxic detection. Group detection results and fine-grained detection from PclGPT and other models reveal significant variations in the degree of bias in PCL towards different vulnerable groups, necessitating increased societal attention to protect them.||
|**2024-10-03**|[Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation](http://arxiv.org/abs/2410.00249)|null|With the rapid development and widespread use of advanced network systems, software vulnerabilities pose a significant threat to secure communications and networking. Learning-based vulnerability detection systems, particularly those leveraging pre-trained language models, have demonstrated significant potential in promptly identifying vulnerabilities in communication networks and reducing the risk of exploitation. However, the shortage of accurately labeled vulnerability datasets hinders further progress in this field. Failing to represent real-world vulnerability data variety and preserve vulnerability semantics, existing augmentation approaches provide limited or even counterproductive contributions to model training. In this paper, we propose a data augmentation technique aimed at enhancing the performance of pre-trained language models for vulnerability detection. Given the vulnerability dataset, our method performs natural semantic-preserving program transformation to generate a large volume of new samples with enriched data diversity and variety. By incorporating our augmented dataset in fine-tuning a series of representative code pre-trained models (i.e., CodeBERT, GraphCodeBERT, UnixCoder, and PDBERT), up to 10.1% increase in accuracy and 23.6% increase in F1 can be achieved in the vulnerability detection task. Comparison results also show that our proposed method can substantially outperform other prominent vulnerability augmentation approaches.||
|**2024-09-29**|[Adversarial Examples for DNA Classification](http://arxiv.org/abs/2409.19788)|null|Pre-trained language models such as DNABERT2 and Nucleotide Transformer, which are trained on DNA sequences, have shown promising performance in DNA sequence classification tasks. The classification ability of these models stems from language models trained on vast amounts of DNA sequence samples, followed by fine-tuning with relatively smaller classification datasets. However, these text-based systems are not robust enough and can be vulnerable to adversarial examples. While adversarial attacks have been widely studied in text classification, there is limited research in DNA sequence classification. In this paper, we adapt commonly used attack algorithms in text classification for DNA sequence classification. We evaluated the impact of various attack methods on DNA sequence classification at the character, word, and sentence levels. Our findings indicate that actual DNA language model sequence classifiers are vulnerable to these attacks.||
|**2024-09-29**|[NeuroMax: Enhancing Neural Topic Modeling via Maximizing Mutual Information and Group Topic Regularization](http://arxiv.org/abs/2409.19749)|null|Recent advances in neural topic models have concentrated on two primary directions: the integration of the inference network (encoder) with a pre-trained language model (PLM) and the modeling of the relationship between words and topics in the generative model (decoder). However, the use of large PLMs significantly increases inference costs, making them less practical for situations requiring low inference times. Furthermore, it is crucial to simultaneously model the relationships between topics and words as well as the interrelationships among topics themselves. In this work, we propose a novel framework called NeuroMax (Neural Topic Model with Maximizing Mutual Information with Pretrained Language Model and Group Topic Regularization) to address these challenges. NeuroMax maximizes the mutual information between the topic representation obtained from the encoder in neural topic models and the representation derived from the PLM. Additionally, NeuroMax employs optimal transport to learn the relationships between topics by analyzing how information is transported among them. Experimental results indicate that NeuroMax reduces inference time, generates more coherent topics and topic groups, and produces more representative document embeddings, thereby enhancing performance on downstream tasks.||
|**2024-09-27**|[Suicide Phenotyping from Clinical Notes in Safety-Net Psychiatric Hospital Using Multi-Label Classification with Pre-Trained Language Models](http://arxiv.org/abs/2409.18878)|null|Accurate identification and categorization of suicidal events can yield better suicide precautions, reducing operational burden, and improving care quality in high-acuity psychiatric settings. Pre-trained language models offer promise for identifying suicidality from unstructured clinical narratives. We evaluated the performance of four BERT-based models using two fine-tuning strategies (multiple single-label and single multi-label) for detecting coexisting suicidal events from 500 annotated psychiatric evaluation notes. The notes were labeled for suicidal ideation (SI), suicide attempts (SA), exposure to suicide (ES), and non-suicidal self-injury (NSSI). RoBERTa outperformed other models using binary relevance (acc=0.86, F1=0.78). MentalBERT (F1=0.74) also exceeded BioClinicalBERT (F1=0.72). RoBERTa fine-tuned with a single multi-label classifier further improved performance (acc=0.88, F1=0.81), highlighting that models pre-trained on domain-relevant data and the single multi-label classification strategy enhance efficiency and performance. Keywords: EHR-based Phynotyping; Natural Language Processing; Secondary Use of EHR Data; Suicide Classification; BERT-based Model; Psychiatry; Mental Health||
|**2024-09-26**|[Infer Human's Intentions Before Following Natural Language Instructions](http://arxiv.org/abs/2409.18073)|**[link](https://github.com/simon-wan/fiser)**|For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformer-based models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.||
|**2024-09-26**|[Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study](http://arxiv.org/abs/2409.17750)|null|In this study, we delve into the efficacy of transformers within pre-trained language models (PLMs) when repurposed as encoders for Automatic Speech Recognition (ASR). Our underlying hypothesis posits that, despite being initially trained on text-based corpora, these transformers possess a remarkable capacity to extract effective features from the input sequence. This inherent capability, we argue, is transferrable to speech data, thereby augmenting the acoustic modeling ability of ASR. Through rigorous empirical analysis, our findings reveal a notable improvement in Character Error Rate (CER) and Word Error Rate (WER) across diverse ASR tasks when transformers from pre-trained LMs are incorporated. Particularly, they serve as an advantageous starting point for initializing ASR encoders. Furthermore, we uncover that these transformers, when integrated into a well-established ASR encoder, can significantly boost performance, especially in scenarios where profound semantic comprehension is pivotal. This underscores the potential of leveraging the semantic prowess embedded within pre-trained transformers to advance ASR systems' capabilities.||
|**2024-09-24**|[HLB: Benchmarking LLMs' Humanlikeness in Language Use](http://arxiv.org/abs/2409.15890)|null|As synthetic data becomes increasingly prevalent in training language models, particularly through generated dialogue, concerns have emerged that these models may deviate from authentic human language patterns, potentially losing the richness and creativity inherent in human communication. This highlights the critical need to assess the humanlikeness of language models in real-world language use. In this paper, we present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs) using 10 psycholinguistic experiments designed to probe core linguistic aspects, including sound, word, syntax, semantics, and discourse (see https://huggingface.co/spaces/XufengDuan/HumanLikeness). To anchor these comparisons, we collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments. For rigorous evaluation, we developed a coding algorithm that accurately identified language use patterns, enabling the extraction of response distributions for each task. By comparing the response distributions between human participants and LLMs, we quantified humanlikeness through distributional similarity. Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels. Importantly, we found that improvements in other performance metrics did not necessarily lead to greater humanlikeness, and in some cases, even resulted in a decline. By introducing psycholinguistic methods to model evaluation, this benchmark offers the first framework for systematically assessing the humanlikeness of LLMs in language use.||
|**2024-09-23**|[DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models](http://arxiv.org/abs/2409.14904)|**[link](https://github.com/josangyeon/dsg-kd)**|The use of pre-trained language models fine-tuned to address specific downstream tasks is a common approach in natural language processing (NLP). However, acquiring domain-specific knowledge via fine-tuning is challenging. Traditional methods involve pretraining language models using vast amounts of domain-specific data before fine-tuning for particular tasks. This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea. Our findings reveal that existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics of non-English-speaking regions. To address these limitations, we propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning. This study demonstrates the effective transfer of specialized knowledge between models by defining a general language model as the student model and a domain-specific pre-trained model as the teacher model. In particular, we address the complexities of EMR data obtained from PEDs in non-English-speaking regions, such as Korea, and demonstrate that the proposed method enhances classification performance in such contexts. The proposed methodology not only outperforms baseline models on Korean PED EMR data, but also promises broader applicability in various professional and technical domains. In future works, we intend to extend this methodology to include diverse non-English-speaking regions and address additional downstream tasks, with the aim of developing advanced model architectures using state-of-the-art KD techniques. The code is available in https://github.com/JoSangYeon/DSG-KD.||
|**2024-09-23**|[Pre-trained Language Model and Knowledge Distillation for Lightweight Sequential Recommendation](http://arxiv.org/abs/2409.14810)|null|Sequential recommendation models user interests based on historical behaviors to provide personalized recommendation. Previous sequential recommendation algorithms primarily employ neural networks to extract features of user interests, achieving good performance. However, due to the recommendation system datasets sparsity, these algorithms often employ small-scale network frameworks, resulting in weaker generalization capability. Recently, a series of sequential recommendation algorithms based on large pre-trained language models have been proposed. Nonetheless, given the real-time demands of recommendation systems, the challenge remains in applying pre-trained language models for rapid recommendations in real scenarios. To address this, we propose a sequential recommendation algorithm based on a pre-trained language model and knowledge distillation. The key of proposed algorithm is to transfer pre-trained knowledge across domains and achieve lightweight inference by knowledge distillation. The algorithm operates in two stages: in the first stage, we fine-tune the pre-trained language model on the recommendation dataset to transfer the pre-trained knowledge to the recommendation task; in the second stage, we distill the trained language model to transfer the learned knowledge to a lightweight model. Extensive experiments on multiple public recommendation datasets show that the proposed algorithm enhances recommendation accuracy and provide timely recommendation services.||
|**2024-09-21**|[Probing Context Localization of Polysemous Words in Pre-trained Language Model Sub-Layers](http://arxiv.org/abs/2409.14097)|null|In the era of high performing Large Language Models, researchers have widely acknowledged that contextual word representations are one of the key drivers in achieving top performances in downstream tasks. In this work, we investigate the degree of contextualization encoded in the fine-grained sub-layer representations of a Pre-trained Language Model (PLM) by empirical experiments using linear probes. Unlike previous work, we are particularly interested in identifying the strength of contextualization across PLM sub-layer representations (i.e. Self-Attention, Feed-Forward Activation and Output sub-layers). To identify the main contributions of sub-layers to contextualisation, we first extract the sub-layer representations of polysemous words in minimally different sentence pairs, and compare how these representations change through the forward pass of the PLM network. Second, by probing on a sense identification classification task, we try to empirically localize the strength of contextualization information encoded in these sub-layer representations. With these probing experiments, we also try to gain a better understanding of the influence of context length and context richness on the degree of contextualization. Our main conclusion is cautionary: BERT demonstrates a high degree of contextualization in the top sub-layers if the word in question is in a specific position in the sentence with a shorter context window, but this does not systematically generalize across different word positions and context sizes.||
|**2024-09-20**|[Eliciting Instruction-tuned Code Language Models' Capabilities to Utilize Auxiliary Function for Code Generation](http://arxiv.org/abs/2409.13928)|null|We study the code generation behavior of instruction-tuned models built on top of code pre-trained language models when they could access an auxiliary function to implement a function. We design several ways to provide auxiliary functions to the models by adding them to the query or providing a response prefix to incorporate the ability to utilize auxiliary functions with the instruction-following capability. Our experimental results show the effectiveness of combining the base models' auxiliary function utilization ability with the instruction following ability. In particular, the performance of adopting our approaches with the open-sourced language models surpasses that of the recent powerful proprietary language models, i.e., gpt-4o.||
|**2024-09-20**|[Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis](http://arxiv.org/abs/2409.13561)|**[link](https://github.com/jun-jie-huang/lofi)**|Logs are imperative in the maintenance of online service systems, which often encompass important information for effective failure mitigation. While existing anomaly detection methodologies facilitate the identification of anomalous logs within extensive runtime data, manual investigation of log messages by engineers remains essential to comprehend faults, which is labor-intensive and error-prone. Upon examining the log-based troubleshooting practices at CloudA, we find that engineers typically prioritize two categories of log information for diagnosis. These include fault-indicating descriptions, which record abnormal system events, and fault-indicating parameters, which specify the associated entities. Motivated by this finding, we propose an approach to automatically extract such faultindicating information from logs for fault diagnosis, named LoFI. LoFI comprises two key stages. In the first stage, LoFI performs coarse-grained filtering to collect logs related to the faults based on semantic similarity. In the second stage, LoFI leverages a pre-trained language model with a novel prompt-based tuning method to extract fine-grained information of interest from the collected logs. We evaluate LoFI on logs collected from Apache Spark and an industrial dataset from CloudA. The experimental results demonstrate that LoFI outperforms all baseline methods by a significant margin, achieving an absolute improvement of 25.8~37.9 in F1 over the best baseline method, ChatGPT. This highlights the effectiveness of LoFI in recognizing fault-indicating information. Furthermore, the successful deployment of LoFI at CloudA and user studies validate the utility of our method. The code and data are available at https://github.com/Jun-jie-Huang/LoFI.||
|**2024-09-20**|[HUT: A More Computation Efficient Fine-Tuning Method With Hadamard Updated Transformation](http://arxiv.org/abs/2409.13501)|null|Fine-tuning pre-trained language models for downstream tasks has achieved impressive results in NLP. However, fine-tuning all parameters becomes impractical due to the rapidly increasing size of model parameters. To address this, Parameter Efficient Fine-Tuning (PEFT) methods update only a subset of parameters. Most PEFT methods, such as LoRA, use incremental updates, which involve adding learned weight matrix increments to the original parameters. Although effective, these methods face limitations in capturing complex parameter dynamics and do not maintain a strong correlation between the original and updated parameters. To overcome these challenges, we propose the direct Updated Transformation (UT) paradigm, which constructs a transformation directly from the original to the updated parameters. This approach ensures that the correlation between the original and updated parameters is preserved, leveraging the semantic features learned during pre-training. Building on this paradigm, we present the Hadamard Updated Transformation (HUT) method. HUT efficiently updates the original weight matrix using the Hadamard transformation with two low-rank matrices, offering a more expressive and flexible update mechanism. This allows HUT to capture richer parameter features through functional transformations, reducing computational complexity while maintaining or improving model quality. Theoretical analysis and extensive experiments on RoBERTa and GPT-2 validate the effectiveness of HUT. Results show that HUT performs on par with or better than other PEFT methods in terms of model quality, while significantly reducing computational complexity.||
|**2024-09-19**|[Exploring Large Language Models for Product Attribute Value Identification](http://arxiv.org/abs/2409.12695)|null|Product attribute value identification (PAVI) involves automatically identifying attributes and their values from product information, enabling features like product search, recommendation, and comparison. Existing methods primarily rely on fine-tuning pre-trained language models, such as BART and T5, which require extensive task-specific training data and struggle to generalize to new attributes. This paper explores large language models (LLMs), such as LLaMA and Mistral, as data-efficient and robust alternatives for PAVI. We propose various strategies: comparing one-step and two-step prompt-based approaches in zero-shot settings and utilizing parametric and non-parametric knowledge through in-context learning examples. We also introduce a dense demonstration retriever based on a pre-trained T5 model and perform instruction fine-tuning to explicitly train LLMs on task-specific instructions. Extensive experiments on two product benchmarks show that our two-step approach significantly improves performance in zero-shot settings, and instruction fine-tuning further boosts performance when using training data, demonstrating the practical benefits of using LLMs for PAVI.||
|**2024-09-16**|[Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models](http://arxiv.org/abs/2409.10695)|null|We introduce Playground v3 (PGv3), our latest text-to-image model that achieves state-of-the-art (SoTA) performance across multiple testing benchmarks, excels in graphic design abilities and introduces new capabilities. Unlike traditional text-to-image generative models that rely on pre-trained language models like T5 or CLIP text encoders, our approach fully integrates Large Language Models (LLMs) with a novel structure that leverages text conditions exclusively from a decoder-only LLM. Additionally, to enhance image captioning quality-we developed an in-house captioner, capable of generating captions with varying levels of detail, enriching the diversity of text structures. We also introduce a new benchmark CapsBench to evaluate detailed image captioning performance. Experimental results demonstrate that PGv3 excels in text prompt adherence, complex reasoning, and accurate text rendering. User preference studies indicate the super-human graphic design ability of our model for common design applications, such as stickers, posters, and logo designs. Furthermore, PGv3 introduces new capabilities, including precise RGB color control and robust multilingual understanding.||
|**2024-09-14**|[Protecting Copyright of Medical Pre-trained Language Models: Training-Free Backdoor Watermarking](http://arxiv.org/abs/2409.10570)|null|Pre-training language models followed by fine-tuning on specific tasks is standard in NLP, but traditional models often underperform when applied to the medical domain, leading to the development of specialized medical pre-trained language models (Med-PLMs). These models are valuable assets but are vulnerable to misuse and theft, requiring copyright protection. However, no existing watermarking methods are tailored for Med-PLMs, and adapting general PLMs watermarking techniques to the medical domain faces challenges such as task incompatibility, loss of fidelity, and inefficiency. To address these issues, we propose the first training-free backdoor watermarking method for Med-PLMs. Our method uses rare special symbols as trigger words, which do not impact downstream task performance, embedding watermarks by replacing their original embeddings with those of specific medical terms in the Med-PLMs' word embeddings layer. After fine-tuning the watermarked Med-PLMs on various medical downstream tasks, the final models (FMs) respond to the trigger words in the same way they would to the corresponding medical terms. This property can be utilized to extract the watermark. Experiments demonstrate that our method achieves high fidelity while effectively extracting watermarks across various medical downstream tasks. Additionally, our method demonstrates robustness against various attacks and significantly enhances the efficiency of watermark embedding, reducing the embedding time from 10 hours to 10 seconds.||
|**2024-09-14**|[Synthetic4Health: Generating Annotated Synthetic Clinical Letters](http://arxiv.org/abs/2409.09501)|**[link](https://github.com/hecta-uom/synthetic4health)**|Since clinical letters contain sensitive information, clinical-related datasets can not be widely applied in model training, medical research, and teaching. This work aims to generate reliable, various, and de-identified synthetic clinical letters. To achieve this goal, we explored different pre-trained language models (PLMs) for masking and generating text. After that, we worked on Bio\_ClinicalBERT, a high-performing model, and experimented with different masking strategies. Both qualitative and quantitative methods were used for evaluation. Additionally, a downstream task, Named Entity Recognition (NER), was also implemented to assess the usability of these synthetic letters. The results indicate that 1) encoder-only models outperform encoder-decoder models. 2) Among encoder-only models, those trained on general corpora perform comparably to those trained on clinical data when clinical information is preserved. 3) Additionally, preserving clinical entities and document structure better aligns with our objectives than simply fine-tuning the model. 4) Furthermore, different masking strategies can impact the quality of synthetic clinical letters. Masking stopwords has a positive impact, while masking nouns or verbs has a negative effect. 5) For evaluation, BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references. 6) Contextual information does not significantly impact the models' understanding, so the synthetic clinical letters have the potential to replace the original ones in downstream tasks.||
|**2024-09-12**|[Knowledge Tagging with Large Language Model based Multi-Agent System](http://arxiv.org/abs/2409.08406)|null|Knowledge tagging for questions is vital in modern intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations have been performed by pedagogical experts, as the task demands not only a deep semantic understanding of question stems and knowledge definitions but also a strong ability to link problem-solving logic with relevant knowledge concepts. With the advent of advanced natural language processing (NLP) algorithms, such as pre-trained language models and large language models (LLMs), pioneering studies have explored automating the knowledge tagging process using various machine learning models. In this paper, we investigate the use of a multi-agent system to address the limitations of previous algorithms, particularly in handling complex cases involving intricate knowledge definitions and strict numerical constraints. By demonstrating its superior performance on the publicly available math question knowledge tagging dataset, MathKnowCT, we highlight the significant potential of an LLM-based multi-agent system in overcoming the challenges that previous methods have encountered. Finally, through an in-depth discussion of the implications of automating knowledge tagging, we underscore the promising results of deploying LLM-based algorithms in educational contexts.||
|**2024-09-12**|[Fine-tuning Large Language Models for Entity Matching](http://arxiv.org/abs/2409.08185)|**[link](https://github.com/wbsg-uni-mannheim/tailormatch)**|Generative large language models (LLMs) are a promising alternative to pre-trained language models for entity matching due to their high zero-shot performance and their ability to generalize to unseen entities. Existing research on using LLMs for entity matching has focused on prompt engineering and in-context learning. This paper explores the potential of fine-tuning LLMs for entity matching. We analyze fine-tuning along two dimensions: 1) The representation of training examples, where we experiment with adding different types of LLM-generated explanations to the training set, and 2) the selection and generation of training examples using LLMs. In addition to the matching performance on the source dataset, we investigate how fine-tuning affects the model's ability to generalize to other in-domain datasets as well as across topical domains. Our experiments show that fine-tuning significantly improves the performance of the smaller models while the results for the larger models are mixed. Fine-tuning also improves the generalization to in-domain datasets while hurting cross-domain transfer. We show that adding structured explanations to the training set has a positive impact on the performance of three out of four LLMs, while the proposed example selection and generation methods only improve the performance of Llama 3.1 8B while decreasing the performance of GPT-4o Mini.||
|**2024-09-10**|[Exploring Italian sentence embeddings properties through multi-tasking](http://arxiv.org/abs/2409.06622)|**[link](https://github.com/clcl-geneva/blm-snfdisentangling)**|We investigate to what degree existing LLMs encode abstract linguistic information in Italian in a multi-task setting. We exploit curated synthetic data on a large scale -- several Blackbird Language Matrices (BLMs) problems in Italian -- and use them to study how sentence representations built using pre-trained language models encode specific syntactic and semantic information. We use a two-level architecture to model separately a compression of the sentence embeddings into a representation that contains relevant information for a task, and a BLM task. We then investigate whether we can obtain compressed sentence representations that encode syntactic and semantic information relevant to several BLM tasks. While we expected that the sentence structure -- in terms of sequence of phrases/chunks -- and chunk properties could be shared across tasks, performance and error analysis show that the clues for the different tasks are encoded in different manners in the sentence embeddings, suggesting that abstract linguistic notions such as constituents or thematic roles does not seem to be present in the pretrained sentence embeddings.||
|**2024-09-09**|[TransformerRanker: A Tool for Efficiently Finding the Best-Suited Language Models for Downstream Classification Tasks](http://arxiv.org/abs/2409.05997)|**[link](https://github.com/flairnlp/transformer-ranker)**|Classification tasks in NLP are typically addressed by selecting a pre-trained language model (PLM) from a model hub, and fine-tuning it for the task at hand. However, given the very large number of PLMs that are currently available, a practical challenge is to determine which of them will perform best for a specific downstream task. With this paper, we introduce TransformerRanker, a lightweight library that efficiently ranks PLMs for classification tasks without the need for computationally costly fine-tuning. Our library implements current approaches for transferability estimation (LogME, H-Score, kNN), in combination with layer aggregation options, which we empirically showed to yield state-of-the-art rankings of PLMs (Garbas et al., 2024). We designed the interface to be lightweight and easy to use, allowing users to directly connect to the HuggingFace Transformers and Dataset libraries. Users need only select a downstream classification task and a list of PLMs to create a ranking of likely best-suited PLMs for their task. We make TransformerRanker available as a pip-installable open-source library https://github.com/flairNLP/transformer-ranker.||
|**2024-09-08**|[Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?](http://arxiv.org/abs/2409.05197)|**[link](https://github.com/zawedcvg/are-large-language-models-attentive-readers)**|State-of-the-art Large Language Models (LLMs) are accredited with an increasing number of different capabilities, ranging from reading comprehension, over advanced mathematical and reasoning skills to possessing scientific knowledge. In this paper we focus on their multi-hop reasoning capability: the ability to identify and integrate information from multiple textual sources. Given the concerns with the presence of simplifying cues in existing multi-hop reasoning benchmarks, which allow models to circumvent the reasoning requirement, we set out to investigate, whether LLMs are prone to exploiting such simplifying cues. We find evidence that they indeed circumvent the requirement to perform multi-hop reasoning, but they do so in more subtle ways than what was reported about their fine-tuned pre-trained language model (PLM) predecessors. Motivated by this finding, we propose a challenging multi-hop reasoning benchmark, by generating seemingly plausible multi-hop reasoning chains, which ultimately lead to incorrect answers. We evaluate multiple open and proprietary state-of-the-art LLMs, and find that their performance to perform multi-hop reasoning is affected, as indicated by up to 45% relative decrease in F1 score when presented with such seemingly plausible alternatives. We conduct a deeper analysis and find evidence that while LLMs tend to ignore misleading lexical cues, misleading reasoning paths indeed present a significant challenge.||
|**2024-08-21**|[CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction](http://arxiv.org/abs/2409.03773)|**[link](https://github.com/hanrthu/copra)**|准确测量蛋白质-RNA结合亲和力在许多生物过程和药物设计中至关重要。以前的蛋白质-RNA结合亲和力预测计算方法依赖于序列或结构特征，无法全面捕捉结合机制。最近出现的在大量无监督蛋白质和RNA序列上训练的预训练语言模型，在包括结合位点预测在内的各种域内下游任务中表现出强大的表示能力。然而，协同应用不同领域的语言模型来完成复杂级别的任务仍未得到探索。在本文中，我们提出了CoPRA，通过蛋白质-RNA结合亲和力预测的复合物结构，将来自不同生物领域的预训练语言模型连接起来。我们首次证明了跨生物模态语言模型可以协同提高结合亲和力预测。我们提出了一个Co-Former来结合跨模态序列和结构信息，并提出了一种双范围预训练策略来提高Co-Former的交互理解能力。同时，我们构建了最大的蛋白质-RNA结合亲和力数据集PRA310用于性能评估。我们还在一个公共数据集上测试了我们模型的突变效应预测能力。CoPRA在所有数据集上都达到了最先进的性能。我们提供了广泛的分析，并验证了CoPRA可以（1）准确预测蛋白质-RNA结合亲和力；（2）理解由突变引起的结合亲和力变化；（3）受益于数据和模型规模的扩大。||
|**2024-09-03**|[LUK: Empowering Log Understanding with Expert Knowledge from Large Language Models](http://arxiv.org/abs/2409.01909)|**[link](https://github.com/LeaperOvO/LUK)**|Logs play a critical role in providing essential information for system monitoring and troubleshooting. Recently, with the success of pre-trained language models (PLMs) and large language models (LLMs) in natural language processing (NLP), smaller PLMs (such as BERT) and LLMs (like ChatGPT) have become the current mainstream approaches for log analysis. While LLMs possess rich knowledge, their high computational costs and unstable performance make LLMs impractical for analyzing logs directly. In contrast, smaller PLMs can be fine-tuned for specific tasks even with limited computational resources, making them more practical. However, these smaller PLMs face challenges in understanding logs comprehensively due to their limited expert knowledge. To better utilize the knowledge embedded within LLMs for log understanding, this paper introduces a novel knowledge enhancement framework, called LUK, which acquires expert knowledge from LLMs to empower log understanding on a smaller PLM. Specifically, we design a multi-expert collaboration framework based on LLMs consisting of different roles to acquire expert knowledge. In addition, we propose two novel pre-training tasks to enhance the log pre-training with expert knowledge. LUK achieves state-of-the-art results on different log analysis tasks and extensive experiments demonstrate expert knowledge from LLMs can be utilized more effectively to understand logs.||
|**2024-09-04**|[MARS: Matching Attribute-aware Representations for Text-based Sequential Recommendation](http://arxiv.org/abs/2409.00702)|**[link](https://github.com/junieberry/mars)**|Sequential recommendation aims to predict the next item a user is likely to prefer based on their sequential interaction history. Recently, text-based sequential recommendation has emerged as a promising paradigm that uses pre-trained language models to exploit textual item features to enhance performance and facilitate knowledge transfer to unseen datasets. However, existing text-based recommender models still struggle with two key challenges: (i) representing users and items with multiple attributes, and (ii) matching items with complex user interests. To address these challenges, we propose a novel model, Matching Attribute-aware Representations for Text-based Sequential Recommendation (MARS). MARS extracts detailed user and item representations through attribute-aware text encoding, capturing diverse user intents with multiple attribute-aware representations. It then computes user-item scores via attribute-wise interaction matching, effectively capturing attribute-level user preferences. Our extensive experiments demonstrate that MARS significantly outperforms existing sequential models, achieving improvements of up to 24.43% and 29.26% in Recall@10 and NDCG@10 across five benchmark datasets. Code is available at https://github.com/junieberry/MARS||
|**2024-08-31**|[From Prediction to Application: Language Model-based Code Knowledge Tracing with Domain Adaptive Pre-Training and Automatic Feedback System with Pedagogical Prompting for Comprehensive Programming Education](http://arxiv.org/abs/2409.00323)|null|Knowledge Tracing (KT) is a critical component in online learning, but traditional approaches face limitations in interpretability and cross-domain adaptability. This paper introduces Language Model-based Code Knowledge Tracing (CodeLKT), an innovative application of Language model-based Knowledge Tracing (LKT) to programming education. CodeLKT leverages pre-trained language models to process learning data, demonstrating superior performance over existing KT and Code KT models. We explore Domain Adaptive Pre-Training (DAPT) and Task Adaptive Pre-Training (TAPT), showing enhanced performance in the coding domain and investigating cross-domain transfer between mathematics and coding. Additionally, we present an theoretically-informed integrated system combining CodeLKT with large language models to generate personalized, in-depth feedback to support students' programming learning. This work advances the field of Code Knowledge Tracing by expanding the knowledge base with language model-based approach and offering practical implications for programming education through data-informed feedback.||
|**2024-08-30**|[Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage](http://arxiv.org/abs/2408.17354)|null|Fine-tuning large language models on private data for downstream applications poses significant privacy risks in potentially exposing sensitive information. Several popular community platforms now offer convenient distribution of a large variety of pre-trained models, allowing anyone to publish without rigorous verification. This scenario creates a privacy threat, as pre-trained models can be intentionally crafted to compromise the privacy of fine-tuning datasets. In this study, we introduce a novel poisoning technique that uses model-unlearning as an attack tool. This approach manipulates a pre-trained language model to increase the leakage of private data during the fine-tuning process. Our method enhances both membership inference and data extraction attacks while preserving model utility. Experimental results across different models, datasets, and fine-tuning setups demonstrate that our attacks significantly surpass baseline performance. This work serves as a cautionary note for users who download pre-trained models from unverified sources, highlighting the potential risks involved.||
|**2024-08-24**|[Empowering Pre-Trained Language Models for Spatio-Temporal Forecasting via Decoupling Enhanced Discrete Reprogramming](http://arxiv.org/abs/2408.14505)|null|Spatio-temporal time series forecasting plays a critical role in various real-world applications, such as transportation optimization, energy management, and climate analysis. The recent advancements in Pre-trained Language Models (PLMs) have inspired efforts to reprogram these models for time series forecasting tasks, by leveraging their superior reasoning and generalization capabilities. However, existing approaches fall short in handling complex spatial inter-series dependencies and intrinsic intra-series frequency components, limiting their spatio-temporal forecasting performance. Moreover, the linear mapping of continuous time series to a compressed subset vocabulary in reprogramming constrains the spatio-temporal semantic expressivity of PLMs and may lead to potential information bottleneck. To overcome the above limitations, we propose \textsc{RePST}, a tailored PLM reprogramming framework for spatio-temporal forecasting. The key insight of \textsc{RePST} is to decouple the spatio-temporal dynamics in the frequency domain, allowing better alignment with the PLM text space. Specifically, we first decouple spatio-temporal data in Fourier space and devise a structural diffusion operator to obtain temporal intrinsic and spatial diffusion signals, making the dynamics more comprehensible and predictable for PLMs. To avoid information bottleneck from a limited vocabulary, we further propose a discrete reprogramming strategy that selects relevant discrete textual information from an expanded vocabulary space in a differentiable manner. Extensive experiments on four real-world datasets show that our proposed approach significantly outperforms state-of-the-art spatio-temporal forecasting models, particularly in data-scarce scenarios.||
|**2024-08-23**|[SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks](http://arxiv.org/abs/2408.13040)|null|Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM's inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.||
|**2024-08-23**|[Investigating LLM Applications in E-Commerce](http://arxiv.org/abs/2408.12779)|null|The emergence of Large Language Models (LLMs) has revolutionized natural language processing in various applications especially in e-commerce. One crucial step before the application of such LLMs in these fields is to understand and compare the performance in different use cases in such tasks. This paper explored the efficacy of LLMs in the e-commerce domain, focusing on instruction-tuning an open source LLM model with public e-commerce datasets of varying sizes and comparing the performance with the conventional models prevalent in industrial applications. We conducted a comprehensive comparison between LLMs and traditional pre-trained language models across specific tasks intrinsic to the e-commerce domain, namely classification, generation, summarization, and named entity recognition (NER). Furthermore, we examined the effectiveness of the current niche industrial application of very large LLM, using in-context learning, in e-commerce specific tasks. Our findings indicate that few-shot inference with very large LLMs often does not outperform fine-tuning smaller pre-trained models, underscoring the importance of task-specific model optimization.Additionally, we investigated different training methodologies such as single-task training, mixed-task training, and LoRA merging both within domain/tasks and between different tasks. Through rigorous experimentation and analysis, this paper offers valuable insights into the potential effectiveness of LLMs to advance natural language processing capabilities within the e-commerce industry.||
|**2024-08-22**|[AutoTest: Evolutionary Code Solution Selection with Test Cases](http://arxiv.org/abs/2408.12125)|null|随着代码生成技术的发展，从多个候选方案中选择正确的代码方案已成为一项至关重要的任务。本研究提出了一种名为AutoTest的新技术，该技术将自动测试用例生成与代码方案执行相结合，利用进化遗传算法优化选择过程。首先，AutoTest利用诸如codegen-16B、code-davinci-002和incoder-6B等大型预训练语言模型来提供代码方案及其相应的测试用例。然后，通过执行代码方案并评估其在测试用例上的性能，形成共识集。基于进化遗传算法的选择、变异和交叉机制，通过调整alpha和beta参数，实现细粒度排名。最后，选择最佳代码方案。AutoTest在HumanEval基准测试中展现出显著的性能提升。HumanEval数据集包含164个编程问题，AutoTest在pass@1分数方面比基线方法提高了约10%。||
|**2024-08-24**|[SarcasmBench: Towards Evaluating Large Language Models on Sarcasm Understanding](http://arxiv.org/abs/2408.11319)|null|In the era of large language models (LLMs), the task of ``System I''~-~the fast, unconscious, and intuitive tasks, e.g., sentiment analysis, text classification, etc., have been argued to be successfully solved. However, sarcasm, as a subtle linguistic phenomenon, often employs rhetorical devices like hyperbole and figuration to convey true sentiments and intentions, involving a higher level of abstraction than sentiment analysis. There is growing concern that the argument about LLMs' success may not be fully tenable when considering sarcasm understanding. To address this question, we select eleven SOTA LLMs and eight SOTA pre-trained language models (PLMs) and present comprehensive evaluations on six widely used benchmark datasets through different prompting approaches, i.e., zero-shot input/output (IO) prompting, few-shot IO prompting, chain of thought (CoT) prompting. Our results highlight three key findings: (1) current LLMs underperform supervised PLMs based sarcasm detection baselines across six sarcasm benchmarks. This suggests that significant efforts are still required to improve LLMs' understanding of human sarcasm. (2) GPT-4 consistently and significantly outperforms other LLMs across various prompting methods, with an average improvement of 14.0\% $\uparrow$ . Claude 3 and ChatGPT demonstrate the next best performance after GPT-4. (3) Few-shot IO prompting method outperforms the other two methods: zero-shot IO and few-shot CoT. The reason is that sarcasm detection, being a holistic, intuitive, and non-rational cognitive process, is argued not to adhere to step-by-step logical reasoning, making CoT less effective in understanding sarcasm compared to its effectiveness in mathematical reasoning tasks.||
|**2024-08-20**|[Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution](http://arxiv.org/abs/2408.10548)|**[link](https://github.com/lanxiang1017/language-modeling-on-tabular-data-survey)**|Tabular data, a prevalent data type across various domains, presents unique challenges due to its heterogeneous nature and complex structural relationships. Achieving high predictive performance and robustness in tabular data analysis holds significant promise for numerous applications. Influenced by recent advancements in natural language processing, particularly transformer architectures, new methods for tabular data modeling have emerged. Early techniques concentrated on pre-training transformers from scratch, often encountering scalability issues. Subsequently, methods leveraging pre-trained language models like BERT have been developed, which require less data and yield enhanced performance. The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning. Despite the growing interest, a comprehensive survey of language modeling techniques for tabular data remains absent. This paper fills this gap by providing a systematic review of the development of language modeling for tabular data, encompassing: (1) a categorization of different tabular data structures and data types; (2) a review of key datasets used in model training and tasks used for evaluation; (3) a summary of modeling techniques including widely-adopted data processing methods, popular architectures, and training objectives; (4) the evolution from adapting traditional Pre-training/Pre-trained language models to the utilization of large language models; (5) an identification of persistent challenges and potential future research directions in language modeling for tabular data analysis. GitHub page associated with this survey is available at: https://github.com/lanxiang1017/Language-Modeling-on-Tabular-Data-Survey.git.||

(back to top)

## Transformer

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2025-04-08**|[Rethinking the Nested U-Net Approach: Enhancing Biomarker Segmentation with Attention Mechanisms and Multiscale Feature Fusion](http://arxiv.org/abs/2504.06158)|**[link](https://github.com/saadwazir/ReN-UNet)**|Identifying biomarkers in medical images is vital for a wide range of biotech applications. However, recent Transformer and CNN based methods often struggle with variations in morphology and staining, which limits their feature extraction capabilities. In medical image segmentation, where data samples are often limited, state-of-the-art (SOTA) methods improve accuracy by using pre-trained encoders, while end-to-end approaches typically fall short due to difficulties in transferring multiscale features effectively between encoders and decoders. To handle these challenges, we introduce a nested UNet architecture that captures both local and global context through Multiscale Feature Fusion and Attention Mechanisms. This design improves feature integration from encoders, highlights key channels and regions, and restores spatial details to enhance segmentation performance. Our method surpasses SOTA approaches, as evidenced by experiments across four datasets and detailed ablation studies. Code: https://github.com/saadwazir/ReN-UNet|
|**2025-04-08**|[Transferable Mask Transformer: Cross-domain Semantic Segmentation with Region-adaptive Transferability Estimation](http://arxiv.org/abs/2504.05774)|null|Recent advances in Vision Transformers (ViTs) have set new benchmarks in semantic segmentation. However, when adapting pretrained ViTs to new target domains, significant performance degradation often occurs due to distribution shifts, resulting in suboptimal global attention. Since self-attention mechanisms are inherently data-driven, they may fail to effectively attend to key objects when source and target domains exhibit differences in texture, scale, or object co-occurrence patterns. While global and patch-level domain adaptation methods provide partial solutions, region-level adaptation with dynamically shaped regions is crucial due to spatial heterogeneity in transferability across different image areas. We present Transferable Mask Transformer (TMT), a novel region-level adaptation framework for semantic segmentation that aligns cross-domain representations through spatial transferability analysis. TMT consists of two key components: (1) An Adaptive Cluster-based Transferability Estimator (ACTE) that dynamically segments images into structurally and semantically coherent regions for localized transferability assessment, and (2) A Transferable Masked Attention (TMA) module that integrates region-specific transferability maps into ViTs' attention mechanisms, prioritizing adaptation in regions with low transferability and high semantic uncertainty. Comprehensive evaluations across 20 cross-domain pairs demonstrate TMT's superiority, achieving an average 2% MIoU improvement over vanilla fine-tuning and a 1.28% increase compared to state-of-the-art baselines. The source code will be publicly available.|
|**2025-04-07**|[TRATSS: Transformer-Based Task Scheduling System for Autonomous Vehicles](http://arxiv.org/abs/2504.05407)|null|Efficient scheduling remains a critical challenge in various domains, requiring solutions to complex NP-hard optimization problems to achieve optimal resource allocation and maximize productivity. In this paper, we introduce a framework called Transformer-Based Task Scheduling System (TRATSS), designed to address the intricacies of single agent scheduling in graph-based environments. By integrating the latest advancements in reinforcement learning and transformer architecture, TRATSS provides a novel system that outputs optimized task scheduling decisions while dynamically adapting to evolving task requirements and resource availability. Leveraging the self-attention mechanism in transformers, TRATSS effectively captures complex task dependencies, thereby providing solutions with enhanced resource utilization and task completion efficiency. Experimental evaluations on benchmark datasets demonstrate TRATSS's effectiveness in providing high-quality solutions to scheduling problems that involve multiple action profiles.|
|**2025-04-07**|[RCCFormer: A Robust Crowd Counting Network Based on Transformer](http://arxiv.org/abs/2504.04935)|null|人群计数作为一项关键的计算机视觉任务，已成为人群分析和公共安全管理中的基础技术。然而，诸如尺度变化和复杂背景等挑战严重影响了人群计数的准确性。为了缓解这些问题，本文提出了一种基于Transformer的鲁棒人群计数网络，称为RCCFormer，专门用于背景抑制和尺度感知。该方法结合了多级特征融合模块（MFFM），它 meticulous 地集成了在骨干架构的不同阶段提取的特征。它建立了一个强大的基线，能够捕获复杂而全面的特征表示，超越了传统的基线。此外，引入的细节嵌入注意力块（DEAB）通过全局自注意力和局部注意力以及可学习的融合方式捕获上下文信息和局部细节。这增强了模型聚焦前景区域的能力，同时有效地减轻了背景噪声干扰。此外，我们开发了自适应尺度感知模块（ASAM），并以我们新颖的输入依赖可变形卷积（IDConv）作为其基本构建块。该模块动态地适应头部目标形状和尺度的变化，显著提高了网络适应大尺度变化的能力。在ShanghaiTech Part_A和Part_B、NWPU-Crowd和QNRF数据集上验证了所提出方法的有效性。结果表明，我们的RCCFormer在所有四个数据集上都取得了优异的性能，展现了最先进的结果。|
|**2025-04-07**|[Content-Aware Transformer for All-in-one Image Restoration](http://arxiv.org/abs/2504.04869)|null|Image restoration has witnessed significant advancements with the development of deep learning models. Although Transformer architectures have progressed considerably in recent years, challenges remain, particularly the limited receptive field in window-based self-attention. In this work, we propose DSwinIR, a Deformable Sliding window Transformer for Image Restoration. DSwinIR introduces a novel deformable sliding window self-attention that adaptively adjusts receptive fields based on image content, enabling the attention mechanism to focus on important regions and enhance feature extraction aligned with salient features. Additionally, we introduce a central ensemble pattern to reduce the inclusion of irrelevant content within attention windows. In this way, the proposed DSwinIR model integrates the deformable sliding window Transformer and central ensemble pattern to amplify the strengths of both CNNs and Transformers while mitigating their limitations. Extensive experiments on various image restoration tasks demonstrate that DSwinIR achieves state-of-the-art performance. For example, in image deraining, compared to DRSformer on the SPA dataset, DSwinIR achieves a 0.66 dB PSNR improvement. In all-in-one image restoration, compared to PromptIR, DSwinIR achieves over a 0.66 dB and 1.04 dB improvement on three-task and five-task settings, respectively. Pretrained models and code are available at our project https://github.com/Aitical/DSwinIR.|
|**2025-04-07**|[Disentangling Instruction Influence in Diffusion Transformers for Parallel Multi-Instruction-Guided Image Editing](http://arxiv.org/abs/2504.04784)|null|Instruction-guided image editing enables users to specify modifications using natural language, offering more flexibility and control. Among existing frameworks, Diffusion Transformers (DiTs) outperform U-Net-based diffusion models in scalability and performance. However, while real-world scenarios often require concurrent execution of multiple instructions, step-by-step editing suffers from accumulated errors and degraded quality, and integrating multiple instructions with a single prompt usually results in incomplete edits due to instruction conflicts. We propose Instruction Influence Disentanglement (IID), a novel framework enabling parallel execution of multiple instructions in a single denoising process, designed for DiT-based models. By analyzing self-attention mechanisms in DiTs, we identify distinctive attention patterns in multi-instruction settings and derive instruction-specific attention masks to disentangle each instruction's influence. These masks guide the editing process to ensure localized modifications while preserving consistency in non-edited regions. Extensive experiments on open-source and custom datasets demonstrate that IID reduces diffusion steps while improving fidelity and instruction completion compared to existing baselines. The codes will be publicly released upon the acceptance of the paper.|
|**2025-04-07**|[Exploring Kernel Transformations for Implicit Neural Representations](http://arxiv.org/abs/2504.04728)|null|Implicit neural representations (INRs), which leverage neural networks to represent signals by mapping coordinates to their corresponding attributes, have garnered significant attention. They are extensively utilized for image representation, with pixel coordinates as input and pixel values as output. In contrast to prior works focusing on investigating the effect of the model's inside components (activation function, for instance), this work pioneers the exploration of the effect of kernel transformation of input/output while keeping the model itself unchanged. A byproduct of our findings is a simple yet effective method that combines scale and shift to significantly boost INR with negligible computation overhead. Moreover, we present two perspectives, depth and normalization, to interpret the performance benefits caused by scale and shift transformation. Overall, our work provides a new avenue for future works to understand and improve INR through the lens of kernel transformation.|
|**2025-04-06**|[Hallucination Detection using Multi-View Attention Features](http://arxiv.org/abs/2504.04335)|null|This study tackles token-level hallucination detection in outputs of large language models. Previous studies revealed that attention exhibits irregular patterns when hallucination occurs. Inspired by this, we extract features from the attention matrix that provide complementary views of (a) the average attention each token receives, which helps identify whether certain tokens are overly influential or ignored, (b) the diversity of attention each token receives, which reveals whether attention is biased toward specific subsets, and (c) the diversity of tokens a token attends to during generation, which indicates whether the model references a narrow or broad range of information. These features are input to a Transformer-based classifier to conduct token-level classification to identify hallucinated spans. Experimental results indicate that the proposed method outperforms strong baselines on hallucination detection with longer input contexts, i.e., data-to-text and summarization tasks.|
|**2025-04-05**|[ADA-Net: Attention-Guided Domain Adaptation Network with Contrastive Learning for Standing Dead Tree Segmentation Using Aerial Imagery](http://arxiv.org/abs/2504.04271)|null|Information on standing dead trees is important for understanding forest ecosystem functioning and resilience but has been lacking over large geographic regions. Climate change has caused large-scale tree mortality events that can remain undetected due to limited data. In this study, we propose a novel method for segmenting standing dead trees using aerial multispectral orthoimages. Because access to annotated datasets has been a significant problem in forest remote sensing due to the need for forest expertise, we introduce a method for domain transfer by leveraging domain adaptation to learn a transformation from a source domain X to target domain Y. In this Image-to-Image translation task, we aim to utilize available annotations in the target domain by pre-training a segmentation network. When images from a new study site without annotations are introduced (source domain X), these images are transformed into the target domain. Then, transfer learning is applied by inferring the pre-trained network on domain-adapted images. In addition to investigating the feasibility of current domain adaptation approaches for this objective, we propose a novel approach called the Attention-guided Domain Adaptation Network (ADA-Net) with enhanced contrastive learning. Accordingly, the ADA-Net approach provides new state-of-the-art domain adaptation performance levels outperforming existing approaches. We have evaluated the proposed approach using two datasets from Finland and the US. The USA images are converted to the Finland domain, and we show that the synthetic USA2Finland dataset exhibits similar characteristics to the Finland domain images. The software implementation is shared at https://github.com/meteahishali/ADA-Net. The data is publicly available at https://www.kaggle.com/datasets/meteahishali/aerial-imagery-for-standing-dead-tree-segmentation.|
|**2025-04-05**|[Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images](http://arxiv.org/abs/2504.04225)|null|Modern AI models excel in controlled settings but often fail in real-world scenarios where data distributions shift unpredictably - a challenge known as domain generalisation (DG). This paper tackles this limitation by rigorously evaluating vision tramsformers, specifically the BEIT architecture which is a model pre-trained with masked image modelling (MIM), against synthetic out-of-distribution (OOD) benchmarks designed to mimic real-world noise and occlusions. We introduce a novel framework to generate OOD test cases by strategically masking object regions in images using grid patterns (25\%, 50\%, 75\% occlusion) and leveraging cutting-edge zero-shot segmentation via Segment Anything and Grounding DINO to ensure precise object localisation. Experiments across three benchmarks (PACS, Office-Home, DomainNet) demonstrate BEIT's known robustness while maintaining 94\% accuracy on PACS and 87\% on Office-Home, despite significant occlusions, outperforming CNNs and other vision transformers by margins of up to 37\%. Analysis of self-attention distances reveals that the BEIT dependence on global features correlates with its resilience. Furthermore, our synthetic benchmarks expose critical failure modes: performance degrades sharply when occlusions disrupt object shapes e.g. 68\% drop for external grid masking vs. 22\% for internal masking. This work provides two key advances (1) a scalable method to generate OOD benchmarks using controllable noise, and (2) empirical evidence that MIM and self-attention mechanism in vision transformers enhance DG by learning invariant features. These insights bridge the gap between lab-trained models and real-world deployment that offer a blueprint for building AI systems that generalise reliably under uncertainty.|
|**2025-04-04**|[Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](http://arxiv.org/abs/2504.03624)|null|As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3 $\times$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. All Nemotron-H models will be released, with support in Hugging Face, NeMo, and Megatron-LM.|
|**2025-04-04**|[AdaViT: Adaptive Vision Transformer for Flexible Pretrain and Finetune with Variable 3D Medical Image Modalities](http://arxiv.org/abs/2504.03589)|null|预训练技术，无论是监督式还是自监督式，都广泛用于深度学习以增强模型性能。在现实世界的临床场景中，不同的受试者/病例通常会采集不同的磁共振 (MR) 对比剂组合，这对假设所有病例以及预训练和微调之间输入模态一致的深度学习模型带来了挑战。当输入模态/对比剂组合与预训练模型不匹配时，现有方法难以保持性能，通常会导致精度下降。我们提出了一个自适应视觉Transformer（AdaViT）框架，能够处理每个病例的不同输入模态组合。我们利用动态分词器将不同的输入图像模态编码为标记，并利用Transformer的特性在可变长度的标记之间建立注意力机制。通过大量实验，我们证明了该架构可以有效地将监督式预训练模型迁移到具有不同输入模态/对比剂组合的新数据集上，从而在脑梗塞和脑肿瘤分割任务的零样本测试、少样本微调和反向迁移中获得优异的性能。此外，对于自监督预训练，所提出的方法能够最大限度地利用预训练数据，并促进迁移到具有不同输入模态组合的各种下游任务。|
|**2025-04-04**|[JanusDDG: A Thermodynamics-Compliant Model for Sequence-Based Protein Stability via Two-Fronts Multi-Head Attention](http://arxiv.org/abs/2504.03278)|null|Understanding how residue variations affect protein stability is crucial for designing functional proteins and deciphering the molecular mechanisms underlying disease-related mutations. Recent advances in protein language models (PLMs) have revolutionized computational protein analysis, enabling, among other things, more accurate predictions of mutational effects. In this work, we introduce JanusDDG, a deep learning framework that leverages PLM-derived embeddings and a bidirectional cross-attention transformer architecture to predict $\Delta \Delta G$ of single and multiple-residue mutations while simultaneously being constrained to respect fundamental thermodynamic properties, such as antisymmetry and transitivity. Unlike conventional self-attention, JanusDDG computes queries (Q) and values (V) as the difference between wild-type and mutant embeddings, while keys (K) alternate between the two. This cross-interleaved attention mechanism enables the model to capture mutation-induced perturbations while preserving essential contextual information. Experimental results show that JanusDDG achieves state-of-the-art performance in predicting $\Delta \Delta G$ from sequence alone, matching or exceeding the accuracy of structure-based methods for both single and multiple mutations.|
|**2025-04-04**|[A Survey of Quantum Transformers: Approaches, Advantages, Challenges, and Future Directions](http://arxiv.org/abs/2504.03192)|null|Quantum Transformer models represent a significant research direction in quantum machine learning (QML), leveraging the parallelism and entanglement properties of quantum computing to overcome the computational complexity and expressive limitations of classical Transformers. Parameterized quantum circuit (PQC)-based Transformer models are the primary focus of current research, employing PQCs to achieve varying degrees of quantumization, including strategies such as QKV-only Quantum mapping, Quantum Pairwise Attention, Quantum Global Attention, and Quantum-Assisted Acceleration. These approaches are well-suited to Noisy Intermediate-Scale Quantum (NISQ) devices, demonstrating potential in small-scale tasks to reduce complexity or enhance performance. The strength of PQC-based methods lies in their compatibility with existing quantum hardware, positioning them as the main pathway toward the practical implementation of quantum Transformers. However, these methods face challenges such as limited scalability, the absence of standardized testing benchmarks, and the "barren plateau" problem during training. As a complementary approach, Quantum Linear Algebra (QLA)-based Transformer models rely on future fault-tolerant quantum computing, utilizing techniques like block-encoding and Quantum Singular Value Transformation (QSVT) to achieve efficient matrix operations and theoretically significant complexity reductions, though they remain in the theoretical exploration stage. Future research should prioritize optimizing PQC-based hybrid architectures and quantum global attention models, establishing unified evaluation frameworks, and addressing training difficulties, while also exploring hybrid PQC-QLA approaches to advance the development of quantum Transformers.|
|**2025-04-04**|[Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion](http://arxiv.org/abs/2504.03135)|null|Medical Visual Question Answering (Med-VQA) answers clinical questions using medical images, aiding diagnosis. Designing the MedVQA system holds profound importance in assisting clinical diagnosis and enhancing diagnostic accuracy. Building upon this foundation, Hierarchical Medical VQA extends Medical VQA by organizing medical questions into a hierarchical structure and making level-specific predictions to handle fine-grained distinctions. Recently, many studies have proposed hierarchical MedVQA tasks and established datasets, However, several issues still remain: (1) imperfect hierarchical modeling leads to poor differentiation between question levels causing semantic fragmentation across hierarchies. (2) Excessive reliance on implicit learning in Transformer-based cross-modal self-attention fusion methods, which obscures crucial local semantic correlations in medical scenarios. To address these issues, this study proposes a HiCA-VQA method, including two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders. The hierarchical prompting module pre-aligns hierarchical text prompts with image features to guide the model in focusing on specific image regions according to question types, while the hierarchical decoder performs separate predictions for questions at different levels to improve accuracy across granularities. The framework also incorporates a cross-attention fusion module where images serve as queries and text as key-value pairs. Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework better outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions. This study provides an effective pathway for hierarchical visual question answering systems, advancing medical image understanding.|
|**2025-04-03**|[Attention-Aware Multi-View Pedestrian Tracking](http://arxiv.org/abs/2504.03047)|null|In spite of the recent advancements in multi-object tracking, occlusion poses a significant challenge. Multi-camera setups have been used to address this challenge by providing a comprehensive coverage of the scene. Recent multi-view pedestrian detection models have highlighted the potential of an early-fusion strategy, projecting feature maps of all views to a common ground plane or the Bird's Eye View (BEV), and then performing detection. This strategy has been shown to improve both detection and tracking performance. However, the perspective transformation results in significant distortion on the ground plane, affecting the robustness of the appearance features of the pedestrians. To tackle this limitation, we propose a novel model that incorporates attention mechanisms in a multi-view pedestrian tracking scenario. Our model utilizes an early-fusion strategy for detection, and a cross-attention mechanism to establish robust associations between pedestrians in different frames, while efficiently propagating pedestrian features across frames, resulting in a more robust feature representation for each pedestrian. Extensive experiments demonstrate that our model outperforms state-of-the-art models, with an IDF1 score of $96.1\%$ on Wildtrack dataset, and $85.7\%$ on MultiviewX dataset.|
|**2025-04-03**|[Graph Attention for Heterogeneous Graphs with Positional Encoding](http://arxiv.org/abs/2504.02938)|null|Graph Neural Networks (GNNs) have emerged as the de facto standard for modeling graph data, with attention mechanisms and transformers significantly enhancing their performance on graph-based tasks. Despite these advancements, the performance of GNNs on heterogeneous graphs often remains complex, with networks generally underperforming compared to their homogeneous counterparts. This work benchmarks various GNN architectures to identify the most effective methods for heterogeneous graphs, with a particular focus on node classification and link prediction. Our findings reveal that graph attention networks excel in these tasks. As a main contribution, we explore enhancements to these attention networks by integrating positional encodings for node embeddings. This involves utilizing the full Laplacian spectrum to accurately capture both the relative and absolute positions of each node within the graph, further enhancing performance on downstream tasks such as node classification and link prediction.|
|**2025-04-03**|[On Vanishing Variance in Transformer Length Generalization](http://arxiv.org/abs/2504.02827)|null|It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. This raises the question of whether Transformer models are real reasoning engines, despite their impressive abilities in mathematical problem solving and code synthesis. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the argmax retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.|
|**2025-04-03**|[HQViT: Hybrid Quantum Vision Transformer for Image Classification](http://arxiv.org/abs/2504.02730)|null|Transformer-based architectures have revolutionized the landscape of deep learning. In computer vision domain, Vision Transformer demonstrates remarkable performance on par with or even surpassing that of convolutional neural networks. However, the quadratic computational complexity of its self-attention mechanism poses challenges for classical computing, making model training with high-dimensional input data, e.g., images, particularly expensive. To address such limitations, we propose a Hybrid Quantum Vision Transformer (HQViT), that leverages the principles of quantum computing to accelerate model training while enhancing model performance. HQViT introduces whole-image processing with amplitude encoding to better preserve global image information without additional positional encoding. By leveraging quantum computation on the most critical steps and selectively handling other components in a classical way, we lower the cost of quantum resources for HQViT. The qubit requirement is minimized to $O(log_2N)$ and the number of parameterized quantum gates is only $O(log_2d)$, making it well-suited for Noisy Intermediate-Scale Quantum devices. By offloading the computationally intensive attention coefficient matrix calculation to the quantum framework, HQViT reduces the classical computational load by $O(T^2d)$. Extensive experiments across various computer vision datasets demonstrate that HQViT outperforms existing models, achieving a maximum improvement of up to $10.9\%$ (on the MNIST 10-classification task) over the state of the art. This work highlights the great potential to combine quantum and classical computing to cope with complex image classification tasks.|
|**2025-04-03**|[A Hybrid Similarity-Aware Graph Neural Network with Transformer for Node Classification](http://arxiv.org/abs/2504.02615)|null|节点分类在图深度学习中变得越来越重要，并在推荐系统、药物发现和引文网络等现实应用中发挥着作用。图卷积网络和图变换器在节点分类任务中取得了优异的性能。然而，图卷积网络的主要问题是过度压缩，这限制了它们捕获网络中远程依赖关系的能力。此外，图变换器面临可扩展性挑战，难以高效地处理大型图。为了解决这个问题，我们提出了一个新的框架，即用于节点分类的混合相似性感知图神经网络与变换器（SIGNNet），它利用局部和全局结构信息，增强模型有效捕获图结构中细粒度关系和更广泛上下文模式的能力。所提出的方法利用图卷积网络以及基于分数的机制来有效捕获局部和全局节点交互，同时解决了过度压缩的局限性。我们提出的方法采用了一种新的基于个性化PageRank的节点采样方法，通过生成节点子图来解决可扩展性问题。此外，SIGNNet结合了一种新的注意力机制，即结构感知多头注意力（SA-MHA），它集成了节点结构信息以进行 informed 注意力加权，使模型能够根据拓扑重要性对节点进行优先排序。大量实验表明，该方法比现有的最先进方法取得了显著改进，在Cora、Citeseer、CS、Wisconsin、Texas、Actor、Cornell和Chameleon数据集上的平均准确率分别提高了6.03%、5.47%、4.78%、19.10%、19.61%、7.22%、19.54%和14.94%。|
|**2025-04-03**|[HGFormer: Topology-Aware Vision Transformer with HyperGraph Learning](http://arxiv.org/abs/2504.02440)|null|The computer vision community has witnessed an extensive exploration of vision transformers in the past two years. Drawing inspiration from traditional schemes, numerous works focus on introducing vision-specific inductive biases. However, the implicit modeling of permutation invariance and fully-connected interaction with individual tokens disrupts the regional context and spatial topology, further hindering higher-order modeling. This deviates from the principle of perceptual organization that emphasizes the local groups and overall topology of visual elements. Thus, we introduce the concept of hypergraph for perceptual exploration. Specifically, we propose a topology-aware vision transformer called HyperGraph Transformer (HGFormer). Firstly, we present a Center Sampling K-Nearest Neighbors (CS-KNN) algorithm for semantic guidance during hypergraph construction. Secondly, we present a topology-aware HyperGraph Attention (HGA) mechanism that integrates hypergraph topology as perceptual indications to guide the aggregation of global and unbiased information during hypergraph messaging. Using HGFormer as visual backbone, we develop an effective and unitive representation, achieving distinct and detailed scene depictions. Empirical experiments show that the proposed HGFormer achieves competitive performance compared to the recent SoTA counterparts on various visual benchmarks. Extensive ablation and visualization studies provide comprehensive explanations of our ideas and contributions.||
|**2025-04-03**|[Beyond Conventional Transformers: The Medical X-ray Attention (MXA) Block for Improved Multi-Label Diagnosis Using Knowledge Distillation](http://arxiv.org/abs/2504.02277)|null|Medical imaging, particularly X-ray analysis, often involves detecting multiple conditions simultaneously within a single scan, making multi-label classification crucial for real-world clinical applications. We present the Medical X-ray Attention (MXA) block, a novel attention mechanism tailored specifically to address the unique challenges of X-ray abnormality detection. The MXA block enhances traditional Multi-Head Self Attention (MHSA) by integrating a specialized module that efficiently captures both detailed local information and broader global context. To the best of our knowledge, this is the first work to propose a task-specific attention mechanism for diagnosing chest X-rays, as well as to attempt multi-label classification using an Efficient Vision Transformer (EfficientViT). By embedding the MXA block within the EfficientViT architecture and employing knowledge distillation, our proposed model significantly improves performance on the CheXpert dataset, a widely used benchmark for multi-label chest X-ray abnormality detection. Our approach achieves an area under the curve (AUC) of 0.85, an absolute improvement of 0.19 compared to our baseline model's AUC of 0.66, corresponding to a substantial approximate 233% relative improvement over random guessing (AUC = 0.5).||
|**2025-04-03**|[FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention](http://arxiv.org/abs/2504.02211)|null|Transformer models leverage self-attention mechanisms to capture complex dependencies, demonstrating exceptional performance in various applications. However, the long-duration high-load computations required for model inference impose stringent reliability demands on the computing platform, as soft errors that occur during execution can significantly degrade model performance. Existing fault tolerance methods protect each operation separately using decoupled kernels, incurring substantial computational and memory overhead. In this paper, we propose a novel error-resilient framework for Transformer models, integrating end-to-end fault tolerant attention (EFTA) to improve inference reliability against soft errors. Our approach enables error detection and correction within a fully fused attention kernel, reducing redundant data access and thereby mitigating memory faults. To further enhance error coverage and reduce overhead, we design a hybrid fault tolerance scheme tailored for the EFTA, introducing for the first time: 1) architecture-aware algorithm-based fault tolerance (ABFT) using tensor checksum, which minimizes inter-thread communication overhead on tensor cores during error detection; 2) selective neuron value restriction, which selectively applies adaptive fault tolerance constraints to neuron values, balancing error coverage and overhead; 3) unified verification, reusing checksums to streamline multiple computation steps into a single verification process. Experimental results show that EFTA achieves up to 7.56x speedup over traditional methods with an average fault tolerance overhead of 13.9%.||
|**2025-04-02**|[Attention Mamba: Time Series Modeling with Adaptive Pooling Acceleration and Receptive Field Enhancements](http://arxiv.org/abs/2504.02013)|null|"This work has been submitted to the lEEE for possible publication. Copyright may be transferred without noticeafter which this version may no longer be accessible." Time series modeling serves as the cornerstone of real-world applications, such as weather forecasting and transportation management. Recently, Mamba has become a promising model that combines near-linear computational complexity with high prediction accuracy in time series modeling, while facing challenges such as insufficient modeling of nonlinear dependencies in attention and restricted receptive fields caused by convolutions. To overcome these limitations, this paper introduces an innovative framework, Attention Mamba, featuring a novel Adaptive Pooling block that accelerates attention computation and incorporates global information, effectively overcoming the constraints of limited receptive fields. Furthermore, Attention Mamba integrates a bidirectional Mamba block, efficiently capturing long-short features and transforming inputs into the Value representations for attention mechanisms. Extensive experiments conducted on diverse datasets underscore the effectiveness of Attention Mamba in extracting nonlinear dependencies and enhancing receptive fields, establishing superior performance among leading counterparts. Our codes will be available on GitHub.||
|**2025-04-02**|[Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies](http://arxiv.org/abs/2504.01470)|**[link](https://github.com/skrantidatta/lipinc-)**|Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .||
|**2025-04-02**|[Prompt-Guided Attention Head Selection for Focus-Oriented Image Retrieval](http://arxiv.org/abs/2504.01348)|null|The goal of this paper is to enhance pretrained Vision Transformer (ViT) models for focus-oriented image retrieval with visual prompting. In real-world image retrieval scenarios, both query and database images often exhibit complexity, with multiple objects and intricate backgrounds. Users often want to retrieve images with specific object, which we define as the Focus-Oriented Image Retrieval (FOIR) task. While a standard image encoder can be employed to extract image features for similarity matching, it may not perform optimally in the multi-object-based FOIR task. This is because each image is represented by a single global feature vector. To overcome this, a prompt-based image retrieval solution is required. We propose an approach called Prompt-guided attention Head Selection (PHS) to leverage the head-wise potential of the multi-head attention mechanism in ViT in a promptable manner. PHS selects specific attention heads by matching their attention maps with user's visual prompts, such as a point, box, or segmentation. This empowers the model to focus on specific object of interest while preserving the surrounding visual context. Notably, PHS does not necessitate model re-training and avoids any image alteration. Experimental results show that PHS substantially improves performance on multiple datasets, offering a practical and training-free solution to enhance model performance in the FOIR task.||
|**2025-04-01**|[Multi-Token Attention](http://arxiv.org/abs/2504.00927)|null|Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and key token vector. This "single token attention" bottlenecks the amount of information used in distinguishing a relevant part from the rest of the context. To address this issue, we propose a new attention method, Multi-Token Attention (MTA), which allows LLMs to condition their attention weights on multiple query and key vectors simultaneously. This is achieved by applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention. As a result, our method can locate relevant context using richer, more nuanced information that can exceed a single vector's capacity. Through extensive evaluations, we demonstrate that MTA achieves enhanced performance on a range of popular benchmarks. Notably, it outperforms Transformer baseline models on standard language modeling tasks, and on tasks that require searching for information within long contexts, where our method's ability to leverage richer information proves particularly beneficial.||
|**2025-03-31**|[DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description](http://arxiv.org/abs/2503.24096)|null|Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. Existing methods rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. We introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene level embeddings to improve long-term contextual understanding. We propose a novel, state-of-the-art method for sequential cross-attention to achieve contextual grounding for fine-grained audio description generation. Evaluated on a broad range of key scenes from well-known movie clips, DANTE-AD outperforms existing methods across traditional NLP metrics and LLM-based evaluations.||
|**2025-03-31**|[From Colors to Classes: Emergence of Concepts in Vision Transformers](http://arxiv.org/abs/2503.24071)|**[link](https://github.com/teresa-sc/concepts_in_vits)**|Vision Transformers (ViTs) are increasingly utilized in various computer vision tasks due to their powerful representation capabilities. However, it remains understudied how ViTs process information layer by layer. Numerous studies have shown that convolutional neural networks (CNNs) extract features of increasing complexity throughout their layers, which is crucial for tasks like domain adaptation and transfer learning. ViTs, lacking the same inductive biases as CNNs, can potentially learn global dependencies from the first layers due to their attention mechanisms. Given the increasing importance of ViTs in computer vision, there is a need to improve the layer-wise understanding of ViTs. In this work, we present a novel, layer-wise analysis of concepts encoded in state-of-the-art ViTs using neuron labeling. Our findings reveal that ViTs encode concepts with increasing complexity throughout the network. Early layers primarily encode basic features such as colors and textures, while later layers represent more specific classes, including objects and animals. As the complexity of encoded concepts increases, the number of concepts represented in each layer also rises, reflecting a more diverse and specific set of features. Additionally, different pretraining strategies influence the quantity and category of encoded concepts, with finetuning to specific downstream tasks generally reducing the number of encoded concepts and shifting the concepts to more relevant categories.||
|**2025-03-31**|[TransMamba: Flexibly Switching between Transformer and Mamba](http://arxiv.org/abs/2503.24067)|null|Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that unifies Transformer and Mamba through shared parameter matrices (e.g., QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for further improvements. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to baselines, and validated the deeper consistency between Transformer and Mamba paradigms, offering a scalable solution for next-generation sequence modeling.||
|**2025-03-31**|[CITRAS: Covariate-Informed Transformer for Time Series Forecasting](http://arxiv.org/abs/2503.24007)|null|Covariates play an indispensable role in practical time series forecasting, offering rich context from the past and sometimes extending into the future. However, their availability varies depending on the scenario, and situations often involve multiple target variables simultaneously. Moreover, the cross-variate dependencies between them are multi-granular, with some covariates having a short-term impact on target variables and others showing long-term correlations. This heterogeneity and the intricate dependencies arising in covariate-informed forecasting present significant challenges to existing deep models. To address these issues, we propose CITRAS, a patch-based Transformer that flexibly leverages multiple targets and covariates covering both the past and the future forecasting horizon. While preserving the strong autoregressive capabilities of the canonical Transformer, CITRAS introduces two novel mechanisms in patch-wise cross-variate attention: Key-Value (KV) Shift and Attention Score Smoothing. KV Shift seamlessly incorporates future known covariates into the forecasting of target variables based on their concurrent dependencies. Additionally, Attention Score Smoothing transforms locally accurate patch-wise cross-variate dependencies into global variate-level dependencies by smoothing the past series of attention scores. Experimentally, CITRAS achieves state-of-the-art performance in both covariate-informed and multivariate forecasting, demonstrating its versatile ability to leverage cross-variate dependency for improved forecasting accuracy.||
|**2025-03-31**|[PupiNet: Seamless OCT-OCTA Interconversion Through Wavelet-Driven and Multi-Scale Attention Mechanisms](http://arxiv.org/abs/2503.23933)|null|Optical Coherence Tomography (OCT) and Optical Coherence Tomography Angiography (OCTA) are key diagnostic tools for clinical evaluation and management of retinal diseases. Compared to traditional OCT, OCTA provides richer microvascular information, but its acquisition requires specialized sensors and high-cost equipment, creating significant challenges for the clinical deployment of hardware-dependent OCTA imaging methods. Given the technical complexity of OCTA image acquisition and potential mechanical artifacts, this study proposes a bidirectional image conversion framework called PupiNet, which accurately achieves bidirectional transformation between 3D OCT and 3D OCTA. The generator module of this framework innovatively integrates wavelet transformation and multi-scale attention mechanisms, significantly enhancing image conversion quality. Meanwhile, an Adaptive Discriminator Augmentation (ADA) module has been incorporated into the discriminator to optimize model training stability and convergence efficiency. To ensure clinical accuracy of vascular structures in the converted images, we designed a Vessel Structure Matcher (VSM) supervision module, achieving precise matching of vascular morphology between generated images and target images. Additionally, the Hierarchical Feature Calibration (HFC) module further guarantees high consistency of texture details between generated images and target images across different depth levels. To rigorously validate the clinical effectiveness of the proposed method, we conducted a comprehensive evaluation on a paired OCT-OCTA image dataset containing 300 eyes with various retinal pathologies. Experimental results demonstrate that PupiNet not only reliably achieves high-quality bidirectional transformation between the two modalities but also shows significant advantages in image fidelity, vessel structure preservation, and clinical usability.||
|**2025-03-31**|[An extension of linear self-attention for in-context learning](http://arxiv.org/abs/2503.23814)|null|In-context learning is a remarkable property of transformers and has been the focus of recent research. An attention mechanism is a key component in transformers, in which an attention matrix encodes relationships between words in a sentence and is used as weights for words in a sentence. This mechanism is effective for capturing language representations. However, it is questionable whether naive self-attention is suitable for in-context learning in general tasks, since the computation implemented by self-attention is somewhat restrictive in terms of matrix multiplication. In fact, we may need appropriate input form designs when considering heuristic implementations of computational algorithms. In this paper, in case of linear self-attention, we extend it by introducing a bias matrix in addition to a weight matrix for an input. Despite the simple extension, the extended linear self-attention can output any constant matrix, input matrix and multiplications of two or three matrices in the input. Note that the second property implies that it can be a skip connection. Therefore, flexible matrix manipulations can be implemented by connecting the extended linear self-attention components. As an example of implementation using the extended linear self-attention, we show a heuristic construction of a batch-type gradient descent of ridge regression under a reasonable input form.||
|**2025-03-30**|[DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution](http://arxiv.org/abs/2503.23580)|null|Large-scale pre-trained diffusion models are becoming increasingly popular in solving the Real-World Image Super-Resolution (Real-ISR) problem because of their rich generative priors. The recent development of diffusion transformer (DiT) has witnessed overwhelming performance over the traditional UNet-based architecture in image generation, which also raises the question: Can we adopt the advanced DiT-based diffusion model for Real-ISR? To this end, we propose our DiT4SR, one of the pioneering works to tame the large-scale DiT model for Real-ISR. Instead of directly injecting embeddings extracted from low-resolution (LR) images like ControlNet, we integrate the LR embeddings into the original attention mechanism of DiT, allowing for the bidirectional flow of information between the LR latent and the generated latent. The sufficient interaction of these two streams allows the LR stream to evolve with the diffusion process, producing progressively refined guidance that better aligns with the generated latent at each diffusion step. Additionally, the LR guidance is injected into the generated latent via a cross-stream convolution layer, compensating for DiT's limited ability to capture local information. These simple but effective designs endow the DiT model with superior performance in Real-ISR, which is demonstrated by extensive experiments. Project Page: https://adam-duan.github.io/projects/dit4sr/.||
|**2025-03-30**|[Efficient Token Compression for Vision Transformer with Spatial Information Preserved](http://arxiv.org/abs/2503.23455)|**[link](https://github.com/nust-machine-intelligence-laboratory/prune_and_merge)**|Token compression is essential for reducing the computational and memory requirements of transformer models, enabling their deployment in resource-constrained environments. In this work, we propose an efficient and hardware-compatible token compression method called Prune and Merge. Our approach integrates token pruning and merging operations within transformer models to achieve layer-wise token compression. By introducing trainable merge and reconstruct matrices and utilizing shortcut connections, we efficiently merge tokens while preserving important information and enabling the restoration of pruned tokens. Additionally, we introduce a novel gradient-weighted attention scoring mechanism that computes token importance scores during the training phase, eliminating the need for separate computations during inference and enhancing compression efficiency. We also leverage gradient information to capture the global impact of tokens and automatically identify optimal compression structures. Extensive experiments on the ImageNet-1k and ADE20K datasets validate the effectiveness of our approach, achieving significant speed-ups with minimal accuracy degradation compared to state-of-the-art methods. For instance, on DeiT-Small, we achieve a 1.64 $\times$ speed-up with only a 0.2\% drop in accuracy on ImageNet-1k. Moreover, by compressing segmenter models and comparing with existing methods, we demonstrate the superior performance of our approach in terms of efficiency and effectiveness. Code and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/prune_and_merge.||
|**2025-03-30**|[CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition](http://arxiv.org/abs/2503.23447)|null|We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.||
|**2025-03-30**|[Filtering with Time-frequency Analysis: An Adaptive and Lightweight Model for Sequential Recommender Systems Based on Discrete Wavelet Transform](http://arxiv.org/abs/2503.23436)|null|Sequential Recommender Systems (SRS) aim to model sequential behaviors of users to capture their interests which usually evolve over time. Transformer-based SRS have achieved distinguished successes recently. However, studies reveal self-attention mechanism in Transformer-based models is essentially a low-pass filter and ignores high frequency information potentially including meaningful user interest patterns. This motivates us to seek better filtering technologies for SRS, and finally we find Discrete Wavelet Transform (DWT), a famous time-frequency analysis technique from digital signal processing field, can effectively process both low-frequency and high-frequency information. We design an adaptive time-frequency filter with DWT technique, which decomposes user interests into multiple signals with different frequency and time, and can automatically learn weights of these signals. Furthermore, we develop DWTRec, a model for sequential recommendation all based on the adaptive time-frequency filter. Thanks to fast DWT technique, DWTRec has a lower time complexity and space complexity theoretically, and is Proficient in modeling long sequences. Experiments show that our model outperforms state-of-the-art baseline models in datasets with different domains, sparsity levels and average sequence lengths. Especially, our model shows great performance increase in contrast with previous models when the sequence grows longer, which demonstrates another advantage of our model.||
|**2025-03-28**|[EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices](http://arxiv.org/abs/2503.22196)|null|Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache optimizations struggle with irreversible token eviction in long-output tasks, while alternative sequence modeling architectures prove costly to adopt within established Transformer infrastructure. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs through a trainable memory-gating module. This approach maintains full compatibility with standard Transformer architectures, requiring fine-tuning only a small part of parameters, and enables selective activation of the memory-gating module for long and short context task routing. The experimental result shows that EdgeInfinite achieves comparable performance to baseline Transformer-based LLM on long context benchmarks while optimizing memory consumption and time to first token.||
|**2025-03-27**|[Molecular Quantum Transformer](http://arxiv.org/abs/2503.21686)|null|The Transformer model, renowned for its powerful attention mechanism, has achieved state-of-the-art performance in various artificial intelligence tasks but faces challenges such as high computational cost and memory usage. Researchers are exploring quantum computing to enhance the Transformer's design, though it still shows limited success with classical data. With a growing focus on leveraging quantum machine learning for quantum data, particularly in quantum chemistry, we propose the Molecular Quantum Transformer (MQT) for modeling interactions in molecular quantum systems. By utilizing quantum circuits to implement the attention mechanism on the molecular configurations, MQT can efficiently calculate ground-state energies for all configurations. Numerical demonstrations show that in calculating ground-state energies for H_2, LiH, BeH_2, and H_4, MQT outperforms the classical Transformer, highlighting the promise of quantum effects in Transformer structures. Furthermore, its pretraining capability on diverse molecular data facilitates the efficient learning of new molecules, extending its applicability to complex molecular systems with minimal additional effort. Our method offers an alternative to existing quantum algorithms for estimating ground-state energies, opening new avenues in quantum chemistry and materials science.||
|**2025-03-27**|[vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition](http://arxiv.org/abs/2503.21262)|**[link](https://github.com/yunusa2k2/vGamba)**|Capturing long-range dependencies efficiently is essential for visual recognition tasks, yet existing methods face limitations. Convolutional neural networks (CNNs) struggle with restricted receptive fields, while Vision Transformers (ViTs) achieve global context and long-range modeling at a high computational cost. State-space models (SSMs) offer an alternative, but their application in vision remains underexplored. This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness. At its core, the Gamba bottleneck block that includes, Gamba Cell, an adaptation of Mamba for 2D spatial structures, alongside a Multi-Head Self-Attention (MHSA) mechanism and a Gated Fusion Module for effective feature representation. The interplay of these components ensures that vGamba leverages the low computational demands of SSMs while maintaining the accuracy of attention mechanisms for modeling long-range dependencies in vision tasks. Additionally, the Fusion module enables seamless interaction between these components. Extensive experiments on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.||
|**2025-03-26**|[Optimal Scaling Laws for Efficiency Gains in a Theoretical Transformer-Augmented Sectional MoE Framework](http://arxiv.org/abs/2503.20750)|null|This paper introduces a theoretical framework for a Transformer-augmented, sectional Mixture-of-Experts (MoE) architecture that aims to enhance computational efficiency while preserving model scalability. Unlike conventional MoE models, which route entire token embeddings to selected experts, our approach portions the embedding dimension itself -- assigning segments of each token's representation to dedicated experts. To combat losses in token representation, we utilize a pre-expert transformer layer to recompute attention across tokens and reduce the sequence length dimensionality. We extend our theory by deriving optimal scaling laws that a non-linear relationship between the number of experts and factors such as model dimensionality, sequence length, and system overhead. These formulations yield closed-form and numerically-solvable expressions for identifying the optimal expert count under given architectural and hardware constraints. As a result, our framework not only provides theoretical bounds for computing efficiency with varying frameworks but also guides practical design choices for scaling large models effectively. While empirical validation is pending, we present a comprehensive experimental road map to evaluate the framework's efficiency, scalability, and practicality in future work.||
|**2025-03-27**|[Imitating Radiological Scrolling: A Global-Local Attention Model for 3D Chest CT Volumes Multi-Label Anomaly Classification](http://arxiv.org/abs/2503.20652)|null|The rapid increase in the number of Computed Tomography (CT) scan examinations has created an urgent need for automated tools, such as organ segmentation, anomaly classification, and report generation, to assist radiologists with their growing workload. Multi-label classification of Three-Dimensional (3D) CT scans is a challenging task due to the volumetric nature of the data and the variety of anomalies to be detected. Existing deep learning methods based on Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies effectively, while Vision Transformers require extensive pre-training, posing challenges for practical use. Additionally, these existing methods do not explicitly model the radiologist's navigational behavior while scrolling through CT scan slices, which requires both global context understanding and local detail awareness. In this study, we present CT-Scroll, a novel global-local attention model specifically designed to emulate the scrolling behavior of radiologists during the analysis of 3D CT scans. Our approach is evaluated on two public datasets, demonstrating its efficacy through comprehensive experiments and an ablation study that highlights the contribution of each model component.||
|**2025-03-26**|[RSRWKV: A Linear-Complexity 2D Attention Mechanism for Efficient Remote Sensing Vision Task](http://arxiv.org/abs/2503.20382)|null|High-resolution remote sensing analysis faces challenges in global context modeling due to scene complexity and scale diversity. While CNNs excel at local feature extraction via parameter sharing, their fixed receptive fields fundamentally restrict long-range dependency modeling. Vision Transformers (ViTs) effectively capture global semantic relationships through self-attention mechanisms but suffer from quadratic computational complexity relative to image resolution, creating critical efficiency bottlenecks for high-resolution imagery. The RWKV model's linear-complexity sequence modeling achieves breakthroughs in NLP but exhibits anisotropic limitations in vision tasks due to its 1D scanning mechanism. To address these challenges, we propose RSRWKV, featuring a novel 2D-WKV scanning mechanism that bridges sequential processing and 2D spatial reasoning while maintaining linear complexity. This enables isotropic context aggregation across multiple directions. The MVC-Shift module enhances multi-scale receptive field coverage, while the ECA module strengthens cross-channel feature interaction and semantic saliency modeling. Experimental results demonstrate RSRWKV's superior performance over CNN and Transformer baselines in classification, detection, and segmentation tasks on NWPU RESISC45, VHR-10.v2, and GLH-Water datasets, offering a scalable solution for high-resolution remote sensing analysis.||
|**2025-03-26**|[Progressive Focused Transformer for Single Image Super-Resolution](http://arxiv.org/abs/2503.20337)|**[link](https://github.com/labshuhanggu/pft-sr)**|Transformer-based methods have achieved remarkable results in image super-resolution tasks because they can capture non-local dependencies in low-quality input images. However, this feature-intensive modeling approach is computationally expensive because it calculates the similarities between numerous features that are irrelevant to the query features when obtaining attention weights. These unnecessary similarity calculations not only degrade the reconstruction performance but also introduce significant computational overhead. How to accurately identify the features that are important to the current query features and avoid similarity calculations between irrelevant features remains an urgent problem. To address this issue, we propose a novel and effective Progressive Focused Transformer (PFT) that links all isolated attention maps in the network through Progressive Focused Attention (PFA) to focus attention on the most important tokens. PFA not only enables the network to capture more critical similar features, but also significantly reduces the computational cost of the overall network by filtering out irrelevant features before calculating similarities. Extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance on various single image super-resolution benchmarks.||
|**2025-03-26**|[VESTA: A Versatile SNN-Based Transformer Accelerator with Unified PEs for Multiple Computational Layers](http://arxiv.org/abs/2503.20246)|null|Spiking Neural Networks (SNNs) and transformers represent two powerful paradigms in neural computation, known for their low power consumption and ability to capture feature dependencies, respectively. However, transformer architectures typically involve multiple types of computational layers, including linear layers for MLP modules and classification heads, convolution layers for tokenizers, and dot product computations for self-attention mechanisms. These diverse operations pose significant challenges for hardware accelerator design, and to our knowledge, there is not yet a hardware solution that leverages spike-form data from SNNs for transformer architectures. In this paper, we introduce VESTA, a novel hardware design that synergizes these technologies, presenting unified Processing Elements (PEs) capable of efficiently performing all three types of computations crucial to transformer structures. VESTA uniquely benefits from the spike-form outputs of the Spike Neuron Layers \cite{zhou2024spikformer}, simplifying multiplication operations by reducing them from handling two 8-bit integers to handling one 8-bit integer and a binary spike. This reduction enables the use of multiplexers in the PE module, significantly enhancing computational efficiency while maintaining the low-power advantage of SNNs. Experimental results show that the core area of VESTA is $0.844 mm^2$. It operates at 500MHz and is capable of real-time image classification at 30 fps.||
|**2025-03-26**|[Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration](http://arxiv.org/abs/2503.20174)|null|Transformer-based approaches have gained significant attention in image restoration, where the core component, i.e, Multi-Head Attention (MHA), plays a crucial role in capturing diverse features and recovering high-quality results. In MHA, heads perform attention calculation independently from uniform split subspaces, and a redundancy issue is triggered to hinder the model from achieving satisfactory outputs. In this paper, we propose to improve MHA by exploring diverse learners and introducing various interactions between heads, which results in a Hierarchical multI-head atteNtion driven Transformer model, termed HINT, for image restoration. HINT contains two modules, i.e., the Hierarchical Multi-Head Attention (HMHA) and the Query-Key Cache Updating (QKCU) module, to address the redundancy problem that is rooted in vanilla MHA. Specifically, HMHA extracts diverse contextual features by employing heads to learn from subspaces of varying sizes and containing different information. Moreover, QKCU, comprising intra- and inter-layer schemes, further reduces the redundancy problem by facilitating enhanced interactions between attention heads within and across layers. Extensive experiments are conducted on 12 benchmarks across 5 image restoration tasks, including low-light enhancement, dehazing, desnowing, denoising, and deraining, to demonstrate the superiority of HINT. The source code is available in the supplementary materials.||
|**2025-03-25**|[Mask $^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation](http://arxiv.org/abs/2503.19881)|null|Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$ DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.||
|**2025-03-25**|[A multitask transformer to sign language translation using motion gesture primitives](http://arxiv.org/abs/2503.19668)|null|The absence of effective communication the deaf population represents the main social gap in this community. Furthermore, the sign language, main deaf communication tool, is unlettered, i.e., there is no formal written representation. In consequence, main challenge today is the automatic translation among spatiotemporal sign representation and natural text language. Recent approaches are based on encoder-decoder architectures, where the most relevant strategies integrate attention modules to enhance non-linear correspondences, besides, many of these approximations require complex training and architectural schemes to achieve reasonable predictions, because of the absence of intermediate text projections. However, they are still limited by the redundant background information of the video sequences. This work introduces a multitask transformer architecture that includes a gloss learning representation to achieve a more suitable translation. The proposed approach also includes a dense motion representation that enhances gestures and includes kinematic information, a key component in sign language. From this representation it is possible to avoid background information and exploit the geometry of the signs, in addition, it includes spatiotemporal representations that facilitate the alignment between gestures and glosses as an intermediate textual representation. The proposed approach outperforms the state-of-the-art evaluated on the CoL-SLTD dataset, achieving a BLEU-4 of 72,64% in split 1, and a BLEU-4 of 14,64% in split 2. Additionally, the strategy was validated on the RWTH-PHOENIX-Weather 2014 T dataset, achieving a competitive BLEU-4 of 11,58%.||
|**2025-03-25**|[An Efficient Data Reuse with Tile-Based Adaptive Stationary for Transformer Accelerators](http://arxiv.org/abs/2503.19640)|null|Transformer-based models have become the \textit{de facto} backbone across many fields, such as computer vision and natural language processing. However, as these models scale in size, external memory access (EMA) for weight and activations becomes a critical bottleneck due to its significantly higher energy consumption compared to internal computations. While most prior work has focused on optimizing the self-attention mechanism, little attention has been given to optimizing data transfer during linear projections, where EMA costs are equally important. In this paper, we propose the Tile-based Adaptive Stationary (TAS) scheme that selects the input or weight stationary in a tile granularity, based on the input sequence length. Our experimental results demonstrate that TAS can significantly reduce EMA by more than 97\% compared to traditional stationary schemes, while being compatible with various attention optimization techniques and hardware accelerators.||
|**2025-03-25**|[A novel forecasting framework combining virtual samples and enhanced Transformer models for tourism demand forecasting](http://arxiv.org/abs/2503.19423)|null|Accurate tourism demand forecasting is hindered by limited historical data and complex spatiotemporal dependencies among tourist origins. A novel forecasting framework integrating virtual sample generation and a novel Transformer predictor addresses constraints arising from restricted data availability. A spatiotemporal GAN produces realistic virtual samples by dynamically modeling spatial correlations through a graph convolutional network, and an enhanced Transformer captures local patterns with causal convolutions and long-term dependencies with self-attention,eliminating autoregressive decoding. A joint training strategy refines virtual sample generation based on predictor feedback to maintain robust performance under data-scarce conditions. Experimental evaluations on real-world daily and monthly tourism demand datasets indicate a reduction in average MASE by 18.37% compared to conventional Transformer-based models, demonstrating improved forecasting accuracy. The integration of adaptive spatiotemporal sample augmentation with a specialized Transformer can effectively address limited-data forecasting scenarios in tourism management.||
|**2025-03-25**|[No Black Box Anymore: Demystifying Clinical Predictive Modeling with Temporal-Feature Cross Attention Mechanism](http://arxiv.org/abs/2503.19285)|null|Despite the outstanding performance of deep learning models in clinical prediction tasks, explainability remains a significant challenge. Inspired by transformer architectures, we introduce the Temporal-Feature Cross Attention Mechanism (TFCAM), a novel deep learning framework designed to capture dynamic interactions among clinical features across time, enhancing both predictive accuracy and interpretability. In an experiment with 1,422 patients with Chronic Kidney Disease, predicting progression to End-Stage Renal Disease, TFCAM outperformed LSTM and RETAIN baselines, achieving an AUROC of 0.95 and an F1-score of 0.69. Beyond performance gains, TFCAM provides multi-level explainability by identifying critical temporal periods, ranking feature importance, and quantifying how features influence each other across time before affecting predictions. Our approach addresses the "black box" limitations of deep learning in healthcare, offering clinicians transparent insights into disease progression mechanisms while maintaining state-of-the-art predictive performance.||
|**2025-03-24**|[Mechanistic Interpretability of Fine-Tuned Vision Transformers on Distorted Images: Decoding Attention Head Behavior for Transparent and Trustworthy AI](http://arxiv.org/abs/2503.18762)|null|Mechanistic interpretability improves the safety, reliability, and robustness of large AI models. This study examined individual attention heads in vision transformers (ViTs) fine tuned on distorted 2D spectrogram images containing non relevant content (axis labels, titles, color bars). By introducing extraneous features, the study analyzed how transformer components processed unrelated information, using mechanistic interpretability to debug issues and reveal insights into transformer architectures. Attention maps assessed head contributions across layers. Heads in early layers (1 to 3) showed minimal task impact with ablation increased MSE loss slightly ({\mu}=0.11%, {\sigma}=0.09%), indicating focus on less critical low level features. In contrast, deeper heads (e.g., layer 6) caused a threefold higher loss increase ({\mu}=0.34%, {\sigma}=0.02%), demonstrating greater task importance. Intermediate layers (6 to 11) exhibited monosemantic behavior, attending exclusively to chirp regions. Some early heads (1 to 4) were monosemantic but non task relevant (e.g. text detectors, edge or corner detectors). Attention maps distinguished monosemantic heads (precise chirp localization) from polysemantic heads (multiple irrelevant regions). These findings revealed functional specialization in ViTs, showing how heads processed relevant vs. extraneous information. By decomposing transformers into interpretable components, this work enhanced model understanding, identified vulnerabilities, and advanced safer, more transparent AI.||
|**2025-03-24**|[Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures](http://arxiv.org/abs/2503.18565)|null|The current era of Natural Language Processing (NLP) is dominated by Transformer models. However, novel architectures relying on recurrent mechanisms, such as xLSTM and Mamba, have been proposed as alternatives to attention-based models. Although computation is done differently than with the attention mechanism mechanism, these recurrent models yield good results and sometimes even outperform state-of-the-art attention-based models. In this work, we propose Distil-xLSTM, an xLSTM-based Small Language Model (SLM) trained by distilling knowledge from a Large Language Model (LLM) that shows promising results while being compute and scale efficient. Our Distil-xLSTM focuses on approximating a transformer-based model attention parametrization using its recurrent sequence mixing components and shows good results with minimal training.||
|**2025-03-24**|[Exploring State Space Model in Wavelet Domain: An Infrared and Visible Image Fusion Network via Wavelet Transform and State Space Model](http://arxiv.org/abs/2503.18378)|null|Deep learning techniques have revolutionized the infrared and visible image fusion (IVIF), showing remarkable efficacy on complex scenarios. However, current methods do not fully combine frequency domain features with global semantic information, which will result in suboptimal extraction of global features across modalities and insufficient preservation of local texture details. To address these issues, we propose Wavelet-Mamba (W-Mamba), which integrates wavelet transform with the state-space model (SSM). Specifically, we introduce Wavelet-SSM module, which incorporates wavelet-based frequency domain feature extraction and global information extraction through SSM, thereby effectively capturing both global and local features. Additionally, we propose a cross-modal feature attention modulation, which facilitates efficient interaction and fusion between different modalities. The experimental results indicate that our method achieves both visually compelling results and superior performance compared to current state-of-the-art methods. Our code is available at https://github.com/Lmmh058/W-Mamba.||
|**2025-03-24**|[Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models](http://arxiv.org/abs/2503.18337)|null|Transformer-based large pre-trained models have shown remarkable generalization ability, and various parameter-efficient fine-tuning (PEFT) methods have been proposed to customize these models on downstream tasks with minimal computational and memory budgets. Previous PEFT methods are primarily designed from a tensor-decomposition perspective that tries to effectively tune the linear transformation by finding the smallest subset of parameters to train. Our study adopts an orthogonal view by representing the attention operation as a graph convolution and formulating the multi-head attention maps as a convolutional filter subspace, with each attention map as a subspace element. In this paper, we propose to tune the large pre-trained transformers by learning a small set of combination coefficients that construct a more expressive filter subspace from the original multi-head attention maps. We show analytically and experimentally that the tuned filter subspace can effectively expand the feature space of the multi-head attention and further enhance the capacity of transformers. We further stabilize the fine-tuning with a residual parameterization of the tunable subspace coefficients, and enhance the generalization with a regularization design by directly applying dropout on the tunable coefficient during training. The tunable coefficients take a tiny number of parameters and can be combined with previous PEFT methods in a plug-and-play manner. Extensive experiments show that our approach achieves superior performances than PEFT baselines with neglectable additional parameters.||
|**2025-03-24**|[Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module](http://arxiv.org/abs/2503.18297)|null|Medical report generation requires specialized expertise that general large models often fail to accurately capture. Moreover, the inherent repetition and similarity in medical data make it difficult for models to extract meaningful features, resulting in a tendency to overfit. So in this paper, we propose a multimodal model, Co-Attention Triple-LSTM Network (CA-TriNet), a deep learning model that combines transformer architectures with a Multi-LSTM network. Its Co-Attention module synergistically links a vision transformer with a text transformer to better differentiate medical images with similarities, augmented by an adaptive weight operator to catch and amplify image labels with minor similarities. Furthermore, its Triple-LSTM module refines generated sentences using targeted image objects. Extensive evaluations over three public datasets have demonstrated that CA-TriNet outperforms state-of-the-art models in terms of comprehensive ability, even pre-trained large language models on some metrics.||
|**2025-03-21**|[Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer](http://arxiv.org/abs/2503.17350)|null|The motion transfer task involves transferring motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusion-based methods primarily rely on separate spatial and temporal attention mechanisms within 3D U-Net. In contrast, state-of-the-art video Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models. In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along dense trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric that considers both the global and local motion similarity. Therefore, our work provides a more comprehensive evaluation than previous works. Extensive experiments on MTBench demonstrate that DeT achieves the best trade-off between motion fidelity and edit fidelity.||
|**2025-03-21**|[Vision Transformer Based Semantic Communications for Next Generation Wireless Networks](http://arxiv.org/abs/2503.17275)|null|In the evolving landscape of 6G networks, semantic communications are poised to revolutionize data transmission by prioritizing the transmission of semantic meaning over raw data accuracy. This paper presents a Vision Transformer (ViT)-based semantic communication framework that has been deliberately designed to achieve high semantic similarity during image transmission while simultaneously minimizing the demand for bandwidth. By equipping ViT as the encoder-decoder framework, the proposed architecture can proficiently encode images into a high semantic content at the transmitter and precisely reconstruct the images, considering real-world fading and noise consideration at the receiver. Building on the attention mechanisms inherent to ViTs, our model outperforms Convolution Neural Network (CNNs) and Generative Adversarial Networks (GANs) tailored for generating such images. The architecture based on the proposed ViT network achieves the Peak Signal-to-noise Ratio (PSNR) of 38 dB, which is higher than other Deep Learning (DL) approaches in maintaining semantic similarity across different communication environments. These findings establish our ViT-based approach as a significant breakthrough in semantic communications.||
|**2025-03-21**|[Halton Scheduler For Masked Generative Image Transformer](http://arxiv.org/abs/2503.17076)|**[link](https://github.com/valeoai/halton-maskgit)**|Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework, able to deliver high-quality visuals with low inference costs. However, MaskGIT's token unmasking scheduler, an essential component of the framework, has not received the attention it deserves. We analyze the sampling objective in MaskGIT, based on the mutual information between tokens, and elucidate its shortcomings. We then propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler. More precisely, our method selects the token's position according to a quasi-random, low-discrepancy Halton sequence. Intuitively, that method spreads the tokens spatially, progressively covering the image uniformly at each step. Our analysis shows that it allows reducing non-recoverable sampling errors, leading to simpler hyper-parameters tuning and better quality images. Our scheduler does not require retraining or noise injection and may serve as a simple drop-in replacement for the original sampling strategy. Evaluation of both class-to-image synthesis on ImageNet and text-to-image generation on the COCO dataset demonstrates that the Halton scheduler outperforms the Confidence scheduler quantitatively by reducing the FID and qualitatively by generating more diverse and more detailed images. Our code is at https://github.com/valeoai/Halton-MaskGIT.||
|**2025-03-21**|[Rankformer: A Graph Transformer for Recommendation based on Ranking Objective](http://arxiv.org/abs/2503.16927)|**[link](https://github.com/stupidthree/rankformer)**|Recommender Systems (RS) aim to generate personalized ranked lists for each user and are evaluated using ranking metrics. Although personalized ranking is a fundamental aspect of RS, this critical property is often overlooked in the design of model architectures. To address this issue, we propose Rankformer, a ranking-inspired recommendation model. The architecture of Rankformer is inspired by the gradient of the ranking objective, embodying a unique (graph) transformer architecture -- it leverages global information from all users and items to produce more informative representations and employs specific attention weights to guide the evolution of embeddings towards improved ranking performance. We further develop an acceleration algorithm for Rankformer, reducing its complexity to a linear level with respect to the number of positive instances. Extensive experimental results demonstrate that Rankformer outperforms state-of-the-art methods. The code is available at https://github.com/StupidThree/Rankformer.||
|**2025-03-21**|[Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Multi-Culture Sign Language Recognition](http://arxiv.org/abs/2503.16855)|null|Hand gesture-based Sign Language Recognition (SLR) serves as a crucial communication bridge between deaf and non-deaf individuals. Existing SLR systems perform well for their cultural SL but may struggle with multi-cultural sign languages (McSL). To address these challenges, this paper proposes a Stack Spatial-Temporal Transformer Network that leverages multi-head attention mechanisms to capture both spatial and temporal dependencies with hierarchical features using the Stack Transfer concept. In the proceed, firstly, we applied a fully connected layer to make a embedding vector which has high expressive power from the original dataset, then fed them a stack newly proposed transformer to achieve hierarchical features with short-range and long-range dependency. The network architecture is composed of several stages that process spatial and temporal relationships sequentially, ensuring effective feature extraction. After making the fully connected layer, the embedding vector is processed by the Spatial Multi-Head Attention Transformer, which captures spatial dependencies between joints. In the next stage, the Temporal Multi-Head Attention Transformer captures long-range temporal dependencies, and again, the features are concatenated with the output using another skip connection. The processed features are then passed to the Feed-Forward Network (FFN), which refines the feature representations further. After the FFN, additional skip connections are applied to combine the output with earlier layers, followed by a final normalization layer to produce the final output feature tensor. This process is repeated for 10 transformer blocks. The extensive experiment shows that the JSL, KSL and ASL datasets achieved good performance accuracy. Our approach demonstrates improved performance in McSL, and it will be consider as a novel work in this domain.||
|**2025-03-20**|[Design and Implementation of an FPGA-Based Tiled Matrix Multiplication Accelerator for Transformer Self-Attention on the Xilinx KV260 SoM](http://arxiv.org/abs/2503.16731)|**[link](https://github.com/Richielee630/TMMA)**|Transformer-based LLMs spend most of their compute in large matrix multiplications for attention and feed-forward layers. Recognizing that the Q, K, and V linear projections within the Multi-Head Self-Attention (MHA) module represent a critical computational bottleneck, we strategically focused our efforts on accelerating these operations. We present a tiled matrix multiplication accelerator optimized for such workloads on a Xilinx KV260 on-board FPGA. Key innovations include persistent on-chip storage for one matrix operand, two-level tiling for data reuse, and a systolic-like unrolled compute engine. Implemented via high-level synthesis (HLS) and integrated with DistilBERT for Q, K, V projections, our accelerator achieves significant speedup and energy efficiency gains over CPU baselines. Standalone GEMM benchmarks show up to a 7x speedup over an ARM CPU (PyTorch) and ~200x over naive numpy, with a throughput of up to 3.1 GFLOPs on 768x3072 matrices. Although the overall end-to-end DistilBERT acceleration is more modest, our results validate the potential of FPGA-based acceleration for critical components of Transformer models.||
|**2025-03-20**|[EDiT: Efficient Diffusion Transformers with Linear Compressed Attention](http://arxiv.org/abs/2503.16726)|null|Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation with higher resolution or on devices with limited resources. This work introduces an efficient diffusion transformer (EDiT) to alleviate these efficiency bottlenecks in conventional DiTs and Multimodal DiTs (MM-DiTs). First, we present a novel linear compressed attention method that uses a multi-layer convolutional network to modulate queries with local information while keys and values are spatially aggregated. Second, we formulate a hybrid attention scheme for multi-modal inputs that combines linear attention for image-to-image interactions and standard scaled dot-product attention for interactions involving prompts. Merging these two approaches leads to an expressive, linear-time Multimodal Efficient Diffusion Transformer (MM-EDiT). We demonstrate the effectiveness of the EDiT and MM-EDiT architectures by integrating them into PixArt-Sigma(conventional DiT) and Stable Diffusion 3.5-Medium (MM-DiT), achieving up to 2.2x speedup with comparable image quality after distillation.||
|**2025-03-20**|[iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation](http://arxiv.org/abs/2503.16653)|null|This paper propose iFlame, a novel transformer-based network architecture for mesh generation. While attention-based models have demonstrated remarkable performance in mesh generation, their quadratic computational complexity limits scalability, particularly for high-resolution 3D data. Conversely, linear attention mechanisms offer lower computational costs but often struggle to capture long-range dependencies, resulting in suboptimal outcomes. To address this trade-off, we propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms. To further enhance efficiency and leverage the inherent structure of mesh representations, we integrate this interleaving approach into an hourglass architecture, which significantly boosts efficiency. Our approach reduces training time while achieving performance comparable to pure attention-based models. To improve inference efficiency, we implemented a caching algorithm that almost doubles the speed and reduces the KV cache size by seven-eighths compared to the original Transformer. We evaluate our framework on ShapeNet and Objaverse, demonstrating its ability to generate high-quality 3D meshes efficiently. Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance, making it a practical solution for mesh generation. The training takes only 2 days with 4 GPUs on 39k data with a maximum of 4k faces on Objaverse.||
|**2025-03-20**|[XAttention: Block Sparse Attention with Antidiagonal Scoring](http://arxiv.org/abs/2503.16428)|**[link](https://github.com/mit-han-lab/x-attention)**|Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks-including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation. XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications. Code is available at https://github.com/mit-han-lab/x-attention.||
|**2025-03-20**|[Iterative Optimal Attention and Local Model for Single Image Rain Streak Removal](http://arxiv.org/abs/2503.16165)|**[link](https://github.com/ghfkahfk/EMResformer)**|High-fidelity imaging is crucial for the successful safety supervision and intelligent deployment of vision-based measurement systems (VBMS). It ensures high-quality imaging in VBMS, which is fundamental for reliable visual measurement and analysis. However, imaging quality can be significantly impaired by adverse weather conditions, particularly rain, leading to blurred images and reduced contrast. Such impairments increase the risk of inaccurate evaluations and misinterpretations in VBMS. To address these limitations, we propose an Expectation Maximization Reconstruction Transformer (EMResformer) for single image rain streak removal. The EMResformer retains the key self-attention values for feature aggregation, enhancing local features to produce superior image reconstruction. Specifically, we propose an Expectation Maximization Block seamlessly integrated into the single image rain streak removal network, enhancing its ability to eliminate superfluous information and restore a cleaner background image. Additionally, to further enhance local information for improved detail rendition, we introduce a Local Model Residual Block, which integrates two local model blocks along with a sequence of convolutions and activation functions. This integration synergistically facilitates the extraction of more pertinent features for enhanced single image rain streak removal. Extensive experiments validate that our proposed EMResformer surpasses current state-of-the-art single image rain streak removal methods on both synthetic and real-world datasets, achieving an improved balance between model complexity and single image deraining performance. Furthermore, we evaluate the effectiveness of our method in VBMS scenarios, demonstrating that high-quality imaging significantly improves the accuracy and reliability of VBMS tasks.||
|**2025-03-20**|[Temporal-Spatial Attention Network (TSAN) for DoS Attack Detection in Network Traffic](http://arxiv.org/abs/2503.16047)|null|Denial-of-Service (DoS) attacks remain a critical threat to network security, disrupting services and causing significant economic losses. Traditional detection methods, including statistical and rule-based models, struggle to adapt to evolving attack patterns. To address this challenge, we propose a novel Temporal-Spatial Attention Network (TSAN) architecture for detecting Denial of Service (DoS) attacks in network traffic. By leveraging both temporal and spatial features of network traffic, our approach captures complex traffic patterns and anomalies that traditional methods might miss. The TSAN model incorporates transformer-based temporal encoding, convolutional spatial encoding, and a cross-attention mechanism to fuse these complementary feature spaces. Additionally, we employ multi-task learning with auxiliary tasks to enhance the model's robustness. Experimental results on the NSL-KDD dataset demonstrate that TSAN outperforms state-of-the-art models, achieving superior accuracy, precision, recall, and F1-score while maintaining computational efficiency for real-time deployment. The proposed architecture offers an optimal balance between detection accuracy and computational overhead, making it highly suitable for real-world network security applications.||
|**2025-03-20**|[SpiLiFormer: Enhancing Spiking Transformers with Lateral Inhibition](http://arxiv.org/abs/2503.15986)|null|Spiking Neural Networks (SNNs) based on Transformers have garnered significant attention due to their superior performance and high energy efficiency. However, the spiking attention modules of most existing Transformer-based SNNs are adapted from those of analog Transformers, failing to fully address the issue of over-allocating attention to irrelevant contexts. To fix this fundamental yet overlooked issue, we propose a Lateral Inhibition-inspired Spiking Transformer (SpiLiFormer). It emulates the brain's lateral inhibition mechanism, guiding the model to enhance attention to relevant tokens while suppressing attention to irrelevant ones. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including CIFAR-10 (+0.45%), CIFAR-100 (+0.48%), CIFAR10-DVS (+2.70%), N-Caltech101 (+1.94%), and ImageNet-1K (+1.6%). Notably, on the ImageNet-1K dataset, SpiLiFormer (69.9M parameters, 4 time steps, 384 resolution) outperforms E-SpikeFormer (173.0M parameters, 8 time steps, 384 resolution), a SOTA spiking Transformer, by 0.46% using only 39% of the parameters and half the time steps. Our code and training checkpoints will be released upon acceptance.||
|**2025-03-20**|[InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer](http://arxiv.org/abs/2503.15983)|null|This work explores optimizing transformer-based language models by integrating model compression techniques with inhibitor attention, a novel alternative attention mechanism. Inhibitor attention employs Manhattan distances and ReLU activations instead of the matrix multiplications and softmax activation of the conventional scaled dot-product attention. This shift offers potential computational and energy savings while maintaining model effectiveness. We propose further adjustments to improve the inhibitor mechanism's training efficiency and evaluate its performance on the DistilBERT architecture. Our knowledge distillation experiments indicate that the modified inhibitor transformer model can achieve competitive performance on standard NLP benchmarks, including General Language Understanding Evaluation (GLUE) and sentiment analysis tasks.||
|**2025-03-20**|[ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism](http://arxiv.org/abs/2503.15758)|null|Transformer-based models have emerged as a leading architecture for natural language processing, natural language generation, and image generation tasks. A fundamental element of the transformer architecture is self-attention, which allows the model to capture intricate dependencies within the data. However, the self-attention mechanism also incurs significant computational and memory costs, particularly for long sequences. In this paper, we introduce ATTENTION2D, a novel approach that exploits parallelism along two dimensions - query and key/value - of the self-attention operation. This method enables efficient distribution and parallelization of computations across multiple devices. Our approach facilitates asymptotically faster training and inference phases compared to previous methods, without relying on approximations or incurring additional computational or memory overheads. Furthermore, unlike existing techniques that struggle to scale with an increasing number of processing units, our approach effectively scales with additional processing units. Our experimental results confirm the effectiveness of our method in improving communication efficiency and scalability. Compared to Ring Attention, our approach demonstrated up to a 5x performance boost on a GPT-3-like model using 64 NVIDIA A100 GPUs across 16 nodes, and up to a 9.4x performance boost on 64 NVIDIA H100 GPUs across 64 nodes.||
|**2025-03-19**|[Sparseformer: a Transferable Transformer with Multi-granularity Token Sparsification for Medical Time Series Classification](http://arxiv.org/abs/2503.15578)|null|医学时间序列 (MedTS) 分类对于改进医疗诊断至关重要，但由于模式粒度的变化、复杂的通道间相关性、信息冗余和标签稀缺性，它也面临着挑战。虽然现有的基于Transformer的模型在时间序列分析中展现出前景，但它们主要关注预测，未能充分利用MedTS数据的独特特征。在本文中，我们提出了Sparseformer，一个专为MedTS分类设计的Transformer模型。我们提出了一种基于稀疏token的双重注意力机制，该机制支持全局建模和token压缩，允许动态关注信息量最大的token，同时提炼冗余特征。然后将此机制应用于医学信号的多粒度、跨通道编码，捕捉粒度内和粒度间的相关性以及通道间的联系。稀疏化设计使我们的模型能够直接处理不同长度和通道的异构输入。此外，我们引入了一种自适应标签编码器来解决跨数据集的标签空间错位问题，使我们的模型具备跨数据集迁移能力，从而缓解医学标签稀缺问题。在监督学习下，我们的模型在七个医学数据集上的表现优于12个基线模型。在少样本学习实验中，我们的模型也取得了优异的平均结果。此外，在三种诊断场景下的域内和跨域实验也证明了我们模型的零样本学习能力。总的来说，这些发现突出了我们的模型在各种医学应用中的鲁棒性和可迁移性。||
|**2025-03-20**|[Dynamic Bi-Elman Attention Networks (DBEAN): Dual-Directional Context-Aware Representation Learning for Enhanced Text Classification](http://arxiv.org/abs/2503.15469)|**[link](https://github.com/Bearisbug/Bi-Elman)**|Text classification, a fundamental task in natural language processing (NLP), aims to categorize textual data into predefined labels. Traditional methods struggled with complex linguistic structures and semantic dependencies. The advent of deep learning, particularly recurrent neural networks (RNNs) and Transformer-based models, has significantly advanced the field by enabling nuanced feature extraction and context-aware predictions. Despite improvements, existing models exhibit limitations in balancing interpretability, computational efficiency, and long-range contextual understanding. This paper proposes the Dynamic Bidirectional Elman with Attention Network (DBEAN), which integrates bidirectional temporal modelling with self-attention mechanisms. DBEAN dynamically assigns weights to critical segments of input, improving contextual representation while maintaining computational efficiency.||
|**2025-03-19**|[Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement](http://arxiv.org/abs/2503.15404)|**[link](https://github.com/ryc-98/fpr)**|Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement by up to 7.0\% on average. We also validate its superiority against popular defenses and its compatibility with other transfer methods. Codes and appendix are available at https://github.com/RYC-98/FPR.||
|**2025-03-19**|[A Novel Channel Boosted Residual CNN-Transformer with Regional-Boundary Learning for Breast Cancer Detection](http://arxiv.org/abs/2503.15008)|null|近年来，利用深度学习对乳腺超声图像（BUSI）进行肿瘤检测的研究取得了显著进展。深度卷积神经网络（CNN）和视觉变换器（ViT）各自展现了良好的初始性能。然而，模型复杂性以及对比度、纹理和肿瘤形态变化等挑战带来了不确定性，阻碍了现有方法的有效性。本研究提出了一种名为CB-Res-RBCMT的混合框架，它结合了定制的残差CNN和新的ViT组件，用于详细的BUSI癌症分析。所提出的RBCMT使用带有CNN Meet Transformer（CMT）块的stem卷积块，并随后进行新的区域和边界（RB）特征提取操作，以捕获对比度和形态变化。此外，CMT块通过多头注意力机制结合了全局上下文交互，并通过轻量级设计提高了计算效率。另外，CMT中定制的逆残差和stem CNN有效地提取了局部纹理信息并处理了梯度消失问题。最后，新的通道增强（CB）策略通过将原始RBCMT通道与基于迁移学习的残差CNN生成的特征图相结合，丰富了有限数据集的特征多样性。这些多样化的通道通过空间注意力块进行处理，以实现最佳像素选择，减少冗余并提高对细微对比度和纹理变化的辨别能力。在标准统一严格的BUSI数据集上，所提出的CB-Res-RBCMT实现了95.57%的F1值、95.63%的准确率、96.42%的灵敏度和94.79%的精确度，优于现有的ViT和CNN方法。这些结果证明了我们集成的CNN-Transformer框架在捕获多样化特征和在BUSI癌症诊断中提供卓越性能方面的多功能性。||
|**2025-03-14**|[Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers](http://arxiv.org/abs/2503.11579)|null|State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640 $\times$ 360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.||
|**2025-03-14**|[APLA: A Simple Adaptation Method for Vision Transformers](http://arxiv.org/abs/2503.11335)|**[link](https://github.com/moeinsorkhei/apla)**|Existing adaptation techniques typically require architectural modifications or added parameters, leading to high computational costs and complexity. We introduce Attention Projection Layer Adaptation (APLA), a simple approach to adapt vision transformers (ViTs) without altering the architecture or adding parameters. Through a systematic analysis, we find that the layer immediately after the attention mechanism is crucial for adaptation. By updating only this projection layer, or even just a random subset of this layer's weights, APLA achieves state-of-the-art performance while reducing GPU memory usage by up to 52.63% and training time by up to 43.0%, with no extra cost at inference. Across 46 datasets covering a variety of tasks including scene classification, medical imaging, satellite imaging, and fine-grained classification, APLA consistently outperforms 17 other leading adaptation methods, including full fine-tuning, on classification, segmentation, and detection tasks. The code is available at https://github.com/MoeinSorkhei/APLA.||
|**2025-03-14**|[Brain Effective Connectivity Estimation via Fourier Spatiotemporal Attention](http://arxiv.org/abs/2503.11283)|**[link](https://github.com/XiongWenXww/FSTA)**|Estimating brain effective connectivity (EC) from functional magnetic resonance imaging (fMRI) data can aid in comprehending the neural mechanisms underlying human behavior and cognition, providing a foundation for disease diagnosis. However, current spatiotemporal attention modules handle temporal and spatial attention separately, extracting temporal and spatial features either sequentially or in parallel. These approaches overlook the inherent spatiotemporal correlations present in real world fMRI data. Additionally, the presence of noise in fMRI data further limits the performance of existing methods. In this paper, we propose a novel brain effective connectivity estimation method based on Fourier spatiotemporal attention (FSTA-EC), which combines Fourier attention and spatiotemporal attention to simultaneously capture inter-series (spatial) dynamics and intra-series (temporal) dependencies from high-noise fMRI data. Specifically, Fourier attention is designed to convert the high-noise fMRI data to frequency domain, and map the denoised fMRI data back to physical domain, and spatiotemporal attention is crafted to simultaneously learn spatiotemporal dynamics. Furthermore, through a series of proofs, we demonstrate that incorporating learnable filter into fast Fourier transform and inverse fast Fourier transform processes is mathematically equivalent to performing cyclic convolution. The experimental results on simulated and real-resting-state fMRI datasets demonstrate that the proposed method exhibits superior performance when compared to state-of-the-art methods.||
|**2025-03-14**|[When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective](http://arxiv.org/abs/2503.11272)|**[link](https://github.com/mousavih/transformers-separation)**|Theoretical efforts to prove advantages of Transformers in comparison with classical architectures such as feedforward and recurrent neural networks have mostly focused on representational power. In this work, we take an alternative perspective and prove that even with infinite compute, feedforward and recurrent networks may suffer from larger sample complexity compared to Transformers, as the latter can adapt to a form of dynamic sparsity. Specifically, we consider a sequence-to-sequence data generating model on sequences of length $N$, in which the output at each position depends only on $q$ relevant tokens with $q \ll N$, and the positions of these tokens are described in the input prompt. We prove that a single-layer Transformer can learn this model if and only if its number of attention heads is at least $q$, in which case it achieves a sample complexity almost independent of $N$, while recurrent networks require $N^{\Omega(1)}$ samples on the same problem. If we simplify this model, recurrent networks may achieve a complexity almost independent of $N$, while feedforward networks still require $N$ samples. Consequently, our proposed sparse retrieval model illustrates a natural hierarchy in sample complexity across these architectures.||
|**2025-03-14**|[Addressing Information Loss and Interaction Collapse: A Dual Enhanced Attention Framework for Feature Interaction](http://arxiv.org/abs/2503.11233)|null|The Transformer has proven to be a significant approach in feature interaction for CTR prediction, achieving considerable success in previous works. However, it also presents potential challenges in handling feature interactions. Firstly, Transformers may encounter information loss when capturing feature interactions. By relying on inner products to represent pairwise relationships, they compress raw interaction information, which can result in a degradation of fidelity. Secondly, due to the long-tail features distribution, feature fields with low information-abundance embeddings constrain the information abundance of other fields, leading to collapsed embedding matrices. To tackle these issues, we propose a Dual Attention Framework for Enhanced Feature Interaction, known as Dual Enhanced Attention. This framework integrates two attention mechanisms: the Combo-ID attention mechanism and the collapse-avoiding attention mechanism. The Combo-ID attention mechanism directly retains feature interaction pairs to mitigate information loss, while the collapse-avoiding attention mechanism adaptively filters out low information-abundance interaction pairs to prevent interaction collapse. Extensive experiments conducted on industrial datasets have shown the effectiveness of Dual Enhanced Attention.||
|**2025-03-14**|[X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression](http://arxiv.org/abs/2503.11132)|null|Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression. Rather than caching keys and values separately, MLA stores their compressed latent representations, reducing memory overhead while maintaining the performance. While MLA improves memory efficiency without compromising language model accuracy, its major limitation lies in its integration during the pre-training phase, requiring models to be trained from scratch. This raises a key question: can we use MLA's benefits fully or partially in models that have already been pre-trained with different attention mechanisms? In this paper, we propose X-EcoMLA to deploy post training distillation to enable the upcycling of Transformer-based attention into an efficient hybrid (i.e., combination of regular attention and MLA layers) or full MLA variant through lightweight post-training adaptation, bypassing the need for extensive pre-training. We demonstrate that leveraging the dark knowledge of a well-trained model can enhance training accuracy and enable extreme KV cache compression in MLA without compromising model performance. Our results show that using an 8B teacher model allows us to compress the KV cache size of the Llama3.2-1B-Inst baseline by 6.4x while preserving 100% of its average score across multiple tasks on the LM Harness Evaluation benchmark. This is achieved with only 3.6B training tokens and about 70 GPU hours on AMD MI300 GPUs, compared to the 370K GPU hours required for pre-training the Llama3.2-1B model.||
|**2025-03-14**|[Limits of KV Cache Compression for Tensor Attention based Autoregressive Transformers](http://arxiv.org/abs/2503.11108)|null|The key-value (KV) cache in autoregressive transformers presents a significant bottleneck during inference, which restricts the context length capabilities of large language models (LLMs). While previous work analyzes the fundamental space complexity barriers in standard attention mechanism [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a novel reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = \Omega(\log n)$. In the low dimensional regime where $d = o(\log n)$ , we analyze the theoretical bounds of the space complexity as well. Overall, our work provides a theoretical foundation for us to understand the compression-expressivity tradeoff in tensor attention mechanisms and offers more perspectives in developing more memory-efficient transformer architectures.||
|**2025-03-14**|[FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection](http://arxiv.org/abs/2503.11030)|null|Camouflaged Object Detection (COD) is challenging due to the strong similarity between camouflaged objects and their surroundings, which complicates identification. Existing methods mainly rely on spatial local features, failing to capture global information, while Transformers increase computational costs.To address this, the Frequency-Assisted Mamba-Like Linear Attention Network (FMNet) is proposed, which leverages frequency-domain learning to efficiently capture global features and mitigate ambiguity between objects and the background. FMNet introduces the Multi-Scale Frequency-Assisted Mamba-Like Linear Attention (MFM) module, integrating frequency and spatial features through a multi-scale structure to handle scale variations while reducing computational complexity. Additionally, the Pyramidal Frequency Attention Extraction (PFAE) module and the Frequency Reverse Decoder (FRD) enhance semantics and reconstruct features. Experimental results demonstrate that FMNet outperforms existing methods on multiple COD datasets, showcasing its advantages in both performance and efficiency. Code available at https://anonymous.4open.science/r/FMNet-3CE5.||
|**2025-03-13**|[Predicting Stock Movement with BERTweet and Transformers](http://arxiv.org/abs/2503.10957)|null|Applying deep learning and computational intelligence to finance has been a popular area of applied research, both within academia and industry, and continues to attract active attention. The inherently high volatility and non-stationary of the data pose substantial challenges to machine learning models, especially so for today's expressive and highly-parameterized deep learning models. Recent work has combined natural language processing on data from social media to augment models based purely on historic price data to improve performance has received particular attention. Previous work has achieved state-of-the-art performance on this task by combining techniques such as bidirectional GRUs, variational autoencoders, word and document embeddings, self-attention, graph attention, and adversarial training. In this paper, we demonstrated the efficacy of BERTweet, a variant of BERT pre-trained specifically on a Twitter corpus, and the transformer architecture by achieving competitive performance with the existing literature and setting a new baseline for Matthews Correlation Coefficient on the Stocknet dataset without auxiliary data sources.||
|**2025-03-13**|[Towards Efficient Large Scale Spatial-Temporal Time Series Forecasting via Improved Inverted Transformers](http://arxiv.org/abs/2503.10858)|null|Time series forecasting at scale presents significant challenges for modern prediction systems, particularly when dealing with large sets of synchronized series, such as in a global payment network. In such systems, three key challenges must be overcome for accurate and scalable predictions: 1) emergence of new entities, 2) disappearance of existing entities, and 3) the large number of entities present in the data. The recently proposed Inverted Transformer (iTransformer) architecture has shown promising results by effectively handling variable entities. However, its practical application in large-scale settings is limited by quadratic time and space complexity ( $O(N^2)$) with respect to the number of entities $N$. In this paper, we introduce EiFormer, an improved inverted transformer architecture that maintains the adaptive capabilities of iTransformer while reducing computational complexity to linear scale ($O(N)$ ). Our key innovation lies in restructuring the attention mechanism to eliminate redundant computations without sacrificing model expressiveness. Additionally, we incorporate a random projection mechanism that not only enhances efficiency but also improves prediction accuracy through better feature representation. Extensive experiments on the public LargeST benchmark dataset and a proprietary large-scale time series dataset demonstrate that EiFormer significantly outperforms existing methods in both computational efficiency and forecasting accuracy. Our approach enables practical deployment of transformer-based forecasting in industrial applications where handling time series at scale is essential.||
|**2025-03-13**|[Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?](http://arxiv.org/abs/2503.10632)|null|Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of learnable activation functions with the potential to capture more complex relationships from data. Although KANs are useful in finding symbolic representations and continual learning of one-dimensional functions, their effectiveness in diverse machine learning (ML) tasks, such as vision, remains questionable. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep network architectures, including advanced architectures such as vision Transformers (ViTs). In this paper, we are the first to design a general learnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operate on any choice of basis. However, the computing and memory costs of training them motivated us to propose a more modular version, and we designed particular learnable attention, called Fourier-KArAt. Fourier-KArAt and its variants either outperform their ViT counterparts or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures' performance and generalization capacity by analyzing their loss landscapes, weight distributions, optimizer path, attention visualization, and spectral behavior, and contrast them with vanilla ViTs. The goal of this paper is not to produce parameter- and compute-efficient attention, but to encourage the community to explore KANs in conjunction with more advanced architectures that require a careful understanding of learnable activations. Our open-source code and implementation details are available on: https://subhajitmaity.me/KArAt||
|**2025-03-13**|[Radar: Fast Long-Context Decoding for Any Transformer](http://arxiv.org/abs/2503.10571)|**[link](https://github.com/BorealisAI/radar-decoding)**|Transformer模型已在各种应用中展现出卓越的性能。尽管点积注意力机制是Transformer模型的基础，但由于其时间要求随上下文长度呈二次方增长，因此无法很好地扩展到长上下文数据。在这项工作中，我们提出了Radar，一种无需训练的方法，通过动态搜索最重要的上下文标记来加速推理。对于任何预训练的Transformer，Radar可以降低解码时间复杂度，而无需训练或启发式地剔除标记。此外，我们为我们的方法提供了理论依据，证明Radar可以高概率地可靠识别最重要的标记。我们在一系列任务上与之前的方法进行了广泛的比较。结果表明，Radar在不同架构下均实现了最先进的性能，并降低了时间复杂度，为Transformer的高效长上下文处理提供了一种实用解决方案。||
|**2025-03-13**|[GalProTE: Galactic Properties Mapping using Transformer Encoder](http://arxiv.org/abs/2503.10106)|null|This work presents GalProTE, a proof-of-concept Machine Learning model utilizing a Transformer Encoder to determine stellar age, metallicity, and dust attenuation from optical spectra. Designed for large astronomical surveys, GalProTE significantly accelerates processing while maintaining accuracy. Using the E-MILES spectral library, we construct a dataset of 111,936 diverse templates by expanding 636 simple stellar population models with varying extinction, spectral combinations, and noise modifications. This ensures robust training over 4750 to 7100 Angstrom at 2.5 Angstrom resolution. GalProTE employs four parallel attention-based encoders with varying kernel sizes to capture spectral features. On synthetic test data, it achieves a mean squared error (MSE) of 0.27% between input and predicted spectra. Validation on PHANGS-MUSE galaxies NGC4254 and NGC5068 confirms its ability to extract physical parameters efficiently, with residuals averaging -0.02% and 0.28% and standard deviations of 4.3% and 5.3%, respectively. To contextualize these results, we compare GalProTE's age, metallicity, and dust attenuation maps with pPXF, a state-of-the-art spectral fitting tool. While pPXF requires approximately 11 seconds per spectrum, GalProTE processes one in less than 4 milliseconds, offering a 2750 times speedup and consuming 68 times less power per spectrum. The strong agreement between pPXF and GalProTE highlights the potential of machine learning to enhance traditional methods, paving the way for faster, energy-efficient, and scalable analyses of galactic properties in modern surveys.||
|**2025-03-13**|[Edge-Fog Computing-Enabled EEG Data Compression via Asymmetrical Variational Discrete Cosine Transform Network](http://arxiv.org/abs/2503.09961)|null|The large volume of electroencephalograph (EEG) data produced by brain-computer interface (BCI) systems presents challenges for rapid transmission over bandwidth-limited channels in Internet of Things (IoT) networks. To address the issue, we propose a novel multi-channel asymmetrical variational discrete cosine transform (DCT) network for EEG data compression within an edge-fog computing framework. At the edge level, low-complexity DCT compression units are designed using parallel trainable hard-thresholding and scaling operators to remove redundant data and extract the effective latent space representation. At the fog level, an adaptive filter bank is applied to merge important features from adjacent channels into each individual channel by leveraging inter-channel correlations. Then, the inverse DCT reconstructed multi-head attention is developed to capture both local and global dependencies and reconstruct the original signals. Furthermore, by applying the principles of variational inference, a new evidence lower bound is formulated as the loss function, driving the model to balance compression efficiency and reconstruction accuracy. Experimental results on two public datasets demonstrate that the proposed method achieves superior compression performance without sacrificing any useful information for BCI detection compared with state-of-the-art techniques, indicating a feasible solution for EEG data compression.||
|**2025-03-13**|[Target-aware Bidirectional Fusion Transformer for Aerial Object Tracking](http://arxiv.org/abs/2503.09951)|null|基于轻量级神经网络的跟踪器在航空遥感领域取得了巨大成功，其中大多数聚合多级深度特征以提升跟踪质量。然而，现有算法通常仅生成单级融合特征用于状态决策，忽略了识别和定位目标需要不同种类的特征，限制了跟踪的鲁棒性和精度。在本文中，我们提出了一种用于无人机跟踪的新型目标感知双向融合Transformer（BFTrans）。具体来说，我们首先提出了一种基于线性自注意力和交叉注意力的双流融合网络，它可以结合来自前向和后向的浅层和深层特征，提供用于定位的调整后的局部细节和用于识别的全局语义。此外，我们为上述融合模型设计了一种目标感知的位置编码策略，这有助于在融合阶段感知与目标相关的属性。最后，我们在几个流行的无人机基准测试集上评估了所提出的方法，包括UAV-123、UAV20L和UAVTrack112。大量的实验结果表明，我们的方法可以超越其他最先进的跟踪器，并在嵌入式平台上以平均30.5 FPS的速度运行，这适合实际的无人机部署。||
|**2025-03-12**|[Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments](http://arxiv.org/abs/2503.09820)|null|我们提出视觉语言注意力蒸馏（Vi-LAD），一种将符合社会规范的导航知识从大型视觉语言模型（VLM）蒸馏到轻量级Transformer模型的新方法，以实现实时机器人导航。与依赖专家演示或人工标注数据集的传统方法不同，Vi-LAD通过利用预训练的视觉-动作模型的主干，在中间层表示级别（即注意力图）进行知识蒸馏和微调。这些注意力图突出了给定场景中的关键导航区域，作为具有社会意识的运动规划的隐式指导。Vi-LAD使用从预训练的视觉-动作模型中提取的中间注意力图，结合从大型VLM构建的类似注意力的语义图，对基于Transformer的模型进行微调。为此，我们引入了一种新颖的注意力级别蒸馏损失，它融合了来自两个来源的知识，生成具有增强社会意识的增强注意力图。然后，这些细化的注意力图在具有社会意识的模型预测控制器（MPC）中用作可穿越性成本图，以进行导航。我们通过在Husky轮式机器人上进行的真实世界实验验证了我们的方法，证明其相较于最先进的（SOTA）导航方法取得了显著改进。我们的结果显示成功率提高了14.2%到50%，这突出了Vi-LAD在实现符合社会规范且高效的机器人导航方面的有效性。||
|**2025-03-12**|[4D-ACFNet: A 4D Attention Mechanism-Based Prognostic Framework for Colorectal Cancer Liver Metastasis Integrating Multimodal Spatiotemporal Features](http://arxiv.org/abs/2503.09652)|null|Postoperative prognostic prediction for colorectal cancer liver metastasis (CRLM) remains challenging due to tumor heterogeneity, dynamic evolution of the hepatic microenvironment, and insufficient multimodal data fusion. To address these issues, we propose 4D-ACFNet, the first framework that synergistically integrates lightweight spatiotemporal modeling, cross-modal dynamic calibration, and personalized temporal prediction within a unified architecture. Specifically, it incorporates a novel 4D spatiotemporal attention mechanism, which employs spatiotemporal separable convolution (reducing parameter count by 41%) and virtual timestamp encoding to model the interannual evolution patterns of postoperative dynamic processes, such as liver regeneration and steatosis. For cross-modal feature alignment, Transformer layers are integrated to jointly optimize modality alignment loss and disentanglement loss, effectively suppressing scale mismatch and redundant interference in clinical-imaging data. Additionally, we design a dynamic prognostic decision module that generates personalized interannual recurrence risk heatmaps through temporal upsampling and a gated classification head, overcoming the limitations of traditional methods in temporal dynamic modeling and cross-modal alignment. Experiments on 197 CRLM patients demonstrate that the model achieves 100% temporal adjacency accuracy (TAA), with performance significantly surpassing existing approaches. This study establishes the first spatiotemporal modeling paradigm for postoperative dynamic monitoring of CRLM. The proposed framework can be extended to prognostic analysis of multi-cancer metastases, advancing precision surgery from "spatial resection" to "spatiotemporal cure."||
|**2025-03-12**|[Cost-Optimal Grouped-Query Attention for Long-Context LLMs](http://arxiv.org/abs/2503.09579)|**[link](https://github.com/thunlp/cost-optimal-gqa)**|Building effective and efficient Transformer-based large language models (LLMs) has recently become a research focus, requiring maximizing model language capabilities and minimizing training and deployment costs. Existing efforts have primarily described complex relationships among model performance, parameter size, and data size, as well as searched for the optimal compute allocation to train LLMs. However, they overlook the impacts of context length and attention head configuration (the number of query and key-value heads in grouped-query attention) on training and inference. In this paper, we systematically compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Then, we extend the existing scaling methods, which are based solely on parameter size and training compute, to guide the construction of cost-optimal LLMs during both training and inference. Our quantitative scaling studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs. Our findings provide valuable insights for developing practical LLMs, especially in long-context processing scenarios. We will publicly release our code and data.||
|**2025-03-12**|[Evaluating Visual Explanations of Attention Maps for Transformer-based Medical Imaging](http://arxiv.org/abs/2503.09535)|null|Although Vision Transformers (ViTs) have recently demonstrated superior performance in medical imaging problems, they face explainability issues similar to previous architectures such as convolutional neural networks. Recent research efforts suggest that attention maps, which are part of decision-making process of ViTs can potentially address the explainability issue by identifying regions influencing predictions, especially in models pretrained with self-supervised learning. In this work, we compare the visual explanations of attention maps to other commonly used methods for medical imaging problems. To do so, we employ four distinct medical imaging datasets that involve the identification of (1) colonic polyps, (2) breast tumors, (3) esophageal inflammation, and (4) bone fractures and hardware implants. Through large-scale experiments on the aforementioned datasets using various supervised and self-supervised pretrained ViTs, we find that although attention maps show promise under certain conditions and generally surpass GradCAM in explainability, they are outperformed by transformer-specific interpretability methods. Our findings indicate that the efficacy of attention maps as a method of interpretability is context-dependent and may be limited as they do not consistently provide the comprehensive insights required for robust medical decision-making.||
|**2025-03-12**|[Performance Modeling for Correlation-based Neural Decoding of Auditory Attention to Speech](http://arxiv.org/abs/2503.09349)|null|Correlation-based auditory attention decoding (AAD) algorithms exploit neural tracking mechanisms to determine listener attention among competing speech sources via, e.g., electroencephalography signals. The correlation coefficients between the decoded neural responses and encoded speech stimuli of the different speakers then serve as AAD decision variables. A critical trade-off exists between the temporal resolution (the decision window length used to compute these correlations) and the AAD accuracy. This trade-off is typically characterized by evaluating AAD accuracy across multiple window lengths, leading to the performance curve. We propose a novel method to model this trade-off curve using labeled correlations from only a single decision window length. Our approach models the (un)attended correlations with a normal distribution after applying the Fisher transformation, enabling accurate AAD accuracy prediction across different window lengths. We validate the method on two distinct AAD implementations: a linear decoder and the non-linear VLAAI deep neural network, evaluated on separate datasets. Results show consistently low modeling errors of approximately 2 percent points, with 94% of true accuracies falling within estimated 95%-confidence intervals. The proposed method enables efficient performance curve modeling without extensive multi-window length evaluation, facilitating practical applications in, e.g., performance tracking in neuro-steered hearing devices to continuously adapt the system parameters over time.||
|**2025-03-11**|[Vision Transformer for Intracranial Hemorrhage Classification in CT Scans Using an Entropy-Aware Fuzzy Integral Strategy for Adaptive Scan-Level Decision Fusion](http://arxiv.org/abs/2503.08609)|null|Intracranial hemorrhage (ICH) is a critical medical emergency caused by the rupture of cerebral blood vessels, leading to internal bleeding within the skull. Accurate and timely classification of hemorrhage subtypes is essential for effective clinical decision-making. To address this challenge, we propose an advanced pyramid vision transformer (PVT)-based model, leveraging its hierarchical attention mechanisms to capture both local and global spatial dependencies in brain CT scans. Instead of processing all extracted features indiscriminately, A SHAP-based feature selection method is employed to identify the most discriminative components, which are then used as a latent feature space to train a boosting neural network, reducing computational complexity. We introduce an entropy-aware aggregation strategy along with a fuzzy integral operator to fuse information across multiple CT slices, ensuring a more comprehensive and reliable scan-level diagnosis by accounting for inter-slice dependencies. Experimental results show that our PVT-based framework significantly outperforms state-of-the-art deep learning architectures in terms of classification accuracy, precision, and robustness. By combining SHAP-driven feature selection, transformer-based modeling, and an entropy-aware fuzzy integral operator for decision fusion, our method offers a scalable and computationally efficient AI-driven solution for automated ICH subtype classification.||
|**2025-03-11**|[ChromaFormer: A Scalable and Accurate Transformer Architecture for Land Cover Classification](http://arxiv.org/abs/2503.08534)|null|Remote sensing imagery from systems such as Sentinel provides full coverage of the Earth's surface at around 10-meter resolution. The remote sensing community has transitioned to extensive use of deep learning models due to their high performance on benchmarks such as the UCMerced and ISPRS Vaihingen datasets. Convolutional models such as UNet and ResNet variations are commonly employed for remote sensing but typically only accept three channels, as they were developed for RGB imagery, while satellite systems provide more than ten. Recently, several transformer architectures have been proposed for remote sensing, but they have not been extensively benchmarked and are typically used on small datasets such as Salinas Valley. Meanwhile, it is becoming feasible to obtain dense spatial land-use labels for entire first-level administrative divisions of some countries. Scaling law observations suggest that substantially larger multi-spectral transformer models could provide a significant leap in remote sensing performance in these settings. In this work, we propose ChromaFormer, a family of multi-spectral transformer models, which we evaluate across orders of magnitude differences in model parameters to assess their performance and scaling effectiveness on a densely labeled imagery dataset of Flanders, Belgium, covering more than 13,500 km^2 and containing 15 classes. We propose a novel multi-spectral attention strategy and demonstrate its effectiveness through ablations. Furthermore, we show that models many orders of magnitude larger than conventional architectures, such as UNet, lead to substantial accuracy improvements: a UNet++ model with 23M parameters achieves less than 65% accuracy, while a multi-spectral transformer with 655M parameters achieves over 95% accuracy on the Biological Valuation Map of Flanders.||
|**2025-03-11**|[EnergyFormer: Energy Attention with Fourier Embedding for Hyperspectral Image Classification](http://arxiv.org/abs/2503.08239)|null|Hyperspectral imaging (HSI) provides rich spectral-spatial information across hundreds of contiguous bands, enabling precise material discrimination in applications such as environmental monitoring, agriculture, and urban analysis. However, the high dimensionality and spectral variability of HSI data pose significant challenges for feature extraction and classification. This paper presents EnergyFormer, a transformer-based framework designed to address these challenges through three key innovations: (1) Multi-Head Energy Attention (MHEA), which optimizes an energy function to selectively enhance critical spectral-spatial features, improving feature discrimination; (2) Fourier Position Embedding (FoPE), which adaptively encodes spectral and spatial dependencies to reinforce long-range interactions; and (3) Enhanced Convolutional Block Attention Module (ECBAM), which selectively amplifies informative wavelength bands and spatial structures, enhancing representation learning. Extensive experiments on the WHU-Hi-HanChuan, Salinas, and Pavia University datasets demonstrate that EnergyFormer achieves exceptional overall accuracies of 99.28\%, 98.63\%, and 98.72\%, respectively, outperforming state-of-the-art CNN, transformer, and Mamba-based models. The source code will be made available at https://github.com/mahmad000.||
|**2025-03-11**|[HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views](http://arxiv.org/abs/2503.08140)|null|We present HOTFormerLoc, a novel and versatile Hierarchical Octree-based Transformer, for large-scale 3D place recognition in both ground-to-ground and ground-to-aerial scenarios across urban and forest environments. We propose an octree-based multi-scale attention mechanism that captures spatial and semantic features across granularities. To address the variable density of point distributions from spinning lidar, we present cylindrical octree attention windows to reflect the underlying distribution during attention. We introduce relay tokens to enable efficient global-local interactions and multi-scale representation learning at reduced computational cost. Our pyramid attentional pooling then synthesises a robust global descriptor for end-to-end place recognition in challenging environments. In addition, we introduce CS-Wild-Places, a novel 3D cross-source dataset featuring point cloud data from aerial and ground lidar scans captured in dense forests. Point clouds in CS-Wild-Places contain representational gaps and distinctive attributes such as varying point densities and noise patterns, making it a challenging benchmark for cross-view localisation in the wild. HOTFormerLoc achieves a top-1 average recall improvement of 5.5% - 11.5% on the CS-Wild-Places benchmark. Furthermore, it consistently outperforms SOTA 3D place recognition methods, with an average performance gain of 5.8% on well-established urban and forest datasets. The code and CS-Wild-Places benchmark is available at https://csiro-robotics.github.io/HOTFormerLoc .||
|**2025-03-11**|[Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning](http://arxiv.org/abs/2503.08101)|**[link](https://github.com/iseri27/tg_gbc)**|Query-based methods with dense features have demonstrated remarkable success in 3D object detection tasks. However, the computational demands of these models, particularly with large image sizes and multiple transformer layers, pose significant challenges for efficient running on edge devices. Existing pruning and distillation methods either need retraining or are designed for ViT models, which are hard to migrate to 3D detectors. To address this issue, we propose a zero-shot runtime pruning method for transformer decoders in 3D object detection models. The method, termed tgGBC (trim keys gradually Guided By Classification scores), systematically trims keys in transformer modules based on their importance. We expand the classification score to multiply it with the attention map to get the importance score of each key and then prune certain keys after each transformer layer according to their importance scores. Our method achieves a 1.99x speedup in the transformer decoder of the latest ToC3D model, with only a minimal performance loss of less than 1%. Interestingly, for certain models, our method even enhances their performance. Moreover, we deploy 3D detectors with tgGBC on an edge device, further validating the effectiveness of our method. The code can be found at https://github.com/iseri27/tg_gbc.||
|**2025-03-11**|[From Slices to Sequences: Autoregressive Tracking Transformer for Cohesive and Consistent 3D Lymph Node Detection in CT Scans](http://arxiv.org/abs/2503.07933)|null|Lymph node (LN) assessment is an essential task in the routine radiology workflow, providing valuable insights for cancer staging, treatment planning and beyond. Identifying scatteredly-distributed and low-contrast LNs in 3D CT scans is highly challenging, even for experienced clinicians. Previous lesion and LN detection methods demonstrate effectiveness of 2.5D approaches (i.e, using 2D network with multi-slice inputs), leveraging pretrained 2D model weights and showing improved accuracy as compared to separate 2D or 3D detectors. However, slice-based 2.5D detectors do not explicitly model inter-slice consistency for LN as a 3D object, requiring heuristic post-merging steps to generate final 3D LN instances, which can involve tuning a set of parameters for each dataset. In this work, we formulate 3D LN detection as a tracking task and propose LN-Tracker, a novel LN tracking transformer, for joint end-to-end detection and 3D instance association. Built upon DETR-based detector, LN-Tracker decouples transformer decoder's query into the track and detection groups, where the track query autoregressively follows previously tracked LN instances along the z-axis of a CT scan. We design a new transformer decoder with masked attention module to align track query's content to the context of current slice, meanwhile preserving detection query's high accuracy in current slice. An inter-slice similarity loss is introduced to encourage cohesive LN association between slices. Extensive evaluation on four lymph node datasets shows LN-Tracker's superior performance, with at least 2.7% gain in average sensitivity when compared to other top 3D/2.5D detectors. Further validation on public lung nodule and prostate tumor detection tasks confirms the generalizability of LN-Tracker as it achieves top performance on both tasks. Datasets will be released upon acceptance.||
|**2025-03-10**|[PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity](http://arxiv.org/abs/2503.07677)|null|Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS, which boosts pre-trained models (U-Net/Transformer) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention, our PLADIS unleashes the latent potential of text-to-image diffusion models, enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models. Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution.||
|**2025-03-10**|[Distilling Knowledge into Quantum Vision Transformers for Biomedical Image Classification](http://arxiv.org/abs/2503.07294)|null|Quantum vision transformers (QViTs) build on vision transformers (ViTs) by replacing linear layers within the self-attention mechanism with parameterised quantum neural networks (QNNs), harnessing quantum mechanical properties to improve feature representation. This hybrid approach aims to achieve superior performance, with significantly reduced model complexity as a result of the enriched feature representation, requiring fewer parameters. This paper proposes a novel QViT model for biomedical image classification and investigates its performance against comparable ViTs across eight diverse datasets, encompassing various modalities and classification tasks. We assess models trained from scratch and those pre-trained using knowledge distillation (KD) from high-quality teacher models. Our findings demonstrate that QViTs outperform comparable ViTs with average ROC AUC (0.863 vs 0.846) and accuracy (0.710 vs 0.687) when trained from scratch, and even compete with state-of-the-art classical models in multiple tasks, whilst being significantly more efficient (89% reduction in GFLOPs and 99.99% in parameter number). Additionally, we find that QViTs and ViTs respond equally well to KD, with QViT pre-training performance scaling with model complexity. This is the first investigation into the efficacy of deploying QViTs with KD for computer-aided diagnosis. Our results highlight the enormous potential of quantum machine learning (QML) in biomedical image analysis.||
|**2025-03-10**|[A LSTM-Transformer Model for pulsation control of pVADs](http://arxiv.org/abs/2503.07110)|null|Methods: A method of the pulsation for a pVAD is proposed (AP-pVAD Model). AP-pVAD Model consists of two parts: NPQ Model and LSTM-Transformer Model. (1)The NPQ Model determines the mathematical relationship between motor speed, pressure, and flow rate for the pVAD. (2)The Attention module of Transformer neural network is integrated into the LSTM neural network to form the new LSTM-Transformer Model to predict the pulsation time characteristic points for adjusting the motor speed of the pVAD. Results: The AP-pVAD Model is validated in three hydraulic experiments and an animal experiment. (1)The pressure provided by pVAD calculated with the NPQ Model has a maximum error of only 2.15 mmHg compared to the expected values. (2)The pulsation time characteristic points predicted by the LSTM-Transformer Model shows a maximum prediction error of 1.78ms, which is significantly lower than other methods. (3)The in-vivo test of pVAD in animal experiment has significant improvements in aortic pressure. Animals survive for over 27 hours after the initiation of pVAD operation. Conclusion: (1)For a given pVAD, motor speed has a linear relationship with pressure and a quadratic relationship with flow. (2)Deep learning can be used to predict pulsation characteristic time points, with the LSTM-Transformer Model demonstrating minimal prediction error and better robust performance under conditions of limited dataset sizes, elevated noise levels, and diverse hyperparameter combinations, demonstrating its feasibility and effectiveness.||
|**2025-03-10**|[EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer](http://arxiv.org/abs/2503.07027)|null|Recent advancements in Unet-based diffusion models, such as ControlNet and IP-Adapter, have introduced effective spatial and subject control mechanisms. However, the DiT (Diffusion Transformer) architecture still struggles with efficient and flexible control. To tackle this issue, we propose EasyControl, a novel framework designed to unify condition-guided diffusion transformers with high efficiency and flexibility. Our framework is built on three key innovations. First, we introduce a lightweight Condition Injection LoRA Module. This module processes conditional signals in isolation, acting as a plug-and-play solution. It avoids modifying the base model weights, ensuring compatibility with customized models and enabling the flexible injection of diverse conditions. Notably, this module also supports harmonious and robust zero-shot multi-condition generalization, even when trained only on single-condition data. Second, we propose a Position-Aware Training Paradigm. This approach standardizes input conditions to fixed resolutions, allowing the generation of images with arbitrary aspect ratios and flexible resolutions. At the same time, it optimizes computational efficiency, making the framework more practical for real-world applications. Third, we develop a Causal Attention Mechanism combined with the KV Cache technique, adapted for conditional generation tasks. This innovation significantly reduces the latency of image synthesis, improving the overall efficiency of the framework. Through extensive experiments, we demonstrate that EasyControl achieves exceptional performance across various application scenarios. These innovations collectively make our framework highly efficient, flexible, and suitable for a wide range of tasks.||
|**2025-03-07**|[Deep Frequency Attention Networks for Single Snapshot Sparse Array Interpolation](http://arxiv.org/abs/2503.05486)|null|Sparse arrays have been widely exploited in radar systems because of their advantages in achieving large array aperture at low hardware cost, while significantly reducing mutual coupling. However, sparse arrays suffer from high sidelobes which may lead to false detections. Missing elements in sparse arrays can be interpolated using the sparse array measurements. In snapshot-limited scenarios, such as automotive radar, it is challenging to utilize difference coarrays which require a large number of snapshots to construct a covariance matrix for interpolation. For single snapshot sparse array interpolation, traditional model-based methods, while effective, require expert knowledge for hyperparameter tuning, lack task-specific adaptability, and incur high computational costs. In this paper, we propose a novel deep learning-based single snapshot sparse array interpolation network that addresses these challenges by leveraging a frequency-domain attention mechanism. The proposed approach transforms the sparse signal into the frequency domain, where the attention mechanism focuses on key spectral regions, enabling improved interpolation of missing elements even in low signal-to-noise ratio (SNR) conditions. By minimizing computational costs and enhancing interpolation accuracy, the proposed method demonstrates superior performance compared to traditional approaches, making it well-suited for automotive radar applications.||
|**2025-03-07**|[ColFigPhotoAttnNet: Reliable Finger Photo Presentation Attack Detection Leveraging Window-Attention on Color Spaces](http://arxiv.org/abs/2503.05247)|**[link](https://github.com/avurity/ColFigPhotoAttnNet)**|指纹照片呈现攻击检测 (PAD) 可以显著增强智能手机设备的安全性。然而，这些算法被训练用于检测特定类型的攻击。此外，它们的设计目的是在特定采集设备获取的图像上运行，导致泛化能力差，并且缺乏处理移动硬件不断发展的特性的鲁棒性。本研究首次系统地分析了现有深度学习 PAD 系统（卷积和Transformer）在跨采集设备设置下的性能下降。在本文中，我们介绍了 ColFigPhotoAttnNet 架构，该架构基于颜色通道上的窗口注意力机制，然后使用嵌套残差网络作为预测器以实现可靠的 PAD。我们使用各种采集设备（包括 iPhone 13 Pro、Google Pixel 3、Nokia C5 和 OnePlus One）进行了大量实验，以评估所提出的方法和现有方法在三个公开可用数据库上的性能。研究结果强调了我们方法的有效性。||
|**2025-03-06**|[HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization](http://arxiv.org/abs/2503.04598)|**[link](https://github.com/brycezhuo/hybridnorm)**|Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, challenges remain in training deep transformer networks, especially regarding the location of layer normalization. While Pre-Norm structures facilitate easier training due to their more prominent identity path, they often yield suboptimal performance compared to Post-Norm. In this paper, we propose $\textbf{HybridNorm}$ , a straightforward yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm approaches. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. This design not only stabilizes training but also enhances performance, particularly in the context of LLMs. Comprehensive experiments in both dense and sparse architectures show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches, achieving state-of-the-art results across various benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. %Code will be made publicly available. Code is available at https://github.com/BryceZhuo/HybridNorm.||
|**2025-03-06**|[Interpretable Transformation and Analysis of Timelines through Learning via Surprisability](http://arxiv.org/abs/2503.04502)|null|The analysis of high-dimensional timeline data and the identification of outliers and anomalies is critical across diverse domains, including sensor readings, biological and medical data, historical records, and global statistics. However, conventional analysis techniques often struggle with challenges such as high dimensionality, complex distributions, and sparsity. These limitations hinder the ability to extract meaningful insights from complex temporal datasets, making it difficult to identify trending features, outliers, and anomalies effectively. Inspired by surprisability -- a cognitive science concept describing how humans instinctively focus on unexpected deviations - we propose Learning via Surprisability (LvS), a novel approach for transforming high-dimensional timeline data. LvS quantifies and prioritizes anomalies in time-series data by formalizing deviations from expected behavior. LvS bridges cognitive theories of attention with computational methods, enabling the detection of anomalies and shifts in a way that preserves critical context, offering a new lens for interpreting complex datasets. We demonstrate the usefulness of LvS on three high-dimensional timeline use cases: a time series of sensor data, a global dataset of mortality causes over multiple years, and a textual corpus containing over two centuries of State of the Union Addresses by U.S. presidents. Our results show that the LvS transformation enables efficient and interpretable identification of outliers, anomalies, and the most variable features along the timeline.||
|**2025-03-06**|[Learning Transformer-based World Models with Contrastive Predictive Coding](http://arxiv.org/abs/2503.04416)|null|The DreamerV3 algorithm recently obtained remarkable performance across diverse environment domains by learning an accurate world model based on Recurrent Neural Networks (RNNs). Following the success of model-based reinforcement learning algorithms and the rapid adoption of the Transformer architecture for its superior training efficiency and favorable scaling properties, recent works such as STORM have proposed replacing RNN-based world models with Transformer-based world models using masked self-attention. However, despite the improved training efficiency of these methods, their impact on performance remains limited compared to the Dreamer algorithm, struggling to learn competitive Transformer-based world models. In this work, we show that the next state prediction objective adopted in previous approaches is insufficient to fully exploit the representation capabilities of Transformers. We propose to extend world model predictions to longer time horizons by introducing TWISTER (Transformer-based World model wIth contraSTivE Representations), a world model using action-conditioned Contrastive Predictive Coding to learn high-level temporal feature representations and improve the agent performance. TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search.||
|**2025-03-07**|[LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding](http://arxiv.org/abs/2503.04344)|null|Diffusion transformers(DiTs) struggle to generate images at resolutions higher than their training resolutions. The primary obstacle is that the explicit positional encodings(PE), such as RoPE, need extrapolation which degrades performance when the inference resolution differs from training. In this paper, we propose a Length-Extrapolatable Diffusion Transformer(LEDiT), a simple yet powerful architecture to overcome this limitation. LEDiT needs no explicit PEs, thereby avoiding extrapolation. The key innovations of LEDiT are introducing causal attention to implicitly impart global positional information to tokens, while enhancing locality to precisely distinguish adjacent tokens. Experiments on 256x256 and 512x512 ImageNet show that LEDiT can scale the inference resolution to 512x512 and 1024x1024, respectively, while achieving better image quality compared to current state-of-the-art length extrapolation methods(NTK-aware, YaRN). Moreover, LEDiT achieves strong extrapolation performance with just 100K steps of fine-tuning on a pretrained DiT, demonstrating its potential for integration into existing text-to-image DiTs. Project page: https://shenzhang2145.github.io/ledit/||
|**2025-03-06**|[Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation](http://arxiv.org/abs/2503.04151)|null|近年来，多视图学习 (MVL) 因其能够融合来自多个视图的判别信息而备受关注。然而，现实世界中的多视图数据集通常是异构且不完善的，这通常使得为特定视图组合设计的 MVL 方法缺乏应用潜力并限制了其有效性。为了解决这个问题，我们提出了一种新的鲁棒 MVL 方法（简称 RML），该方法同时进行表示融合和对齐。具体来说，我们引入了一个简单而有效的多视图Transformer融合网络，将异构多视图数据转换为同构词嵌入，然后通过样本级注意力机制整合多个视图以获得融合表示。此外，我们提出了一个基于模拟扰动的多视图对比学习框架，该框架动态生成噪声和不可用扰动来模拟不完善的数据条件。模拟的噪声和不可用数据获得了两种不同的融合表示，我们利用对比学习来对齐它们以学习判别性和鲁棒性表示。我们的 RML 是自监督的，也可以作为正则化应用于下游任务。在实验中，我们将其用于无监督多视图聚类、噪声标签分类，并将其作为跨模态哈希检索的即插即用模块。大量的对比实验和消融研究验证了 RML 的有效性。||
|**2025-03-06**|[SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer](http://arxiv.org/abs/2503.04119)|null|Attention-based arbitrary style transfer methods, including CNN-based, Transformer-based, and Diffusion-based, have flourished and produced high-quality stylized images. However, they perform poorly on the content and style images with the same semantics, i.e., the style of the corresponding semantic region of the generated stylized image is inconsistent with that of the style image. We argue that the root cause lies in their failure to consider the relationship between local regions and semantic regions. To address this issue, we propose a plug-and-play semantic continuous-sparse attention, dubbed SCSA, for arbitrary semantic style transfer -- each query point considers certain key points in the corresponding semantic region. Specifically, semantic continuous attention ensures each query point fully attends to all the continuous key points in the same semantic region that reflect the overall style characteristics of that region; Semantic sparse attention allows each query point to focus on the most similar sparse key point in the same semantic region that exhibits the specific stylistic texture of that region. By combining the two modules, the resulting SCSA aligns the overall style of the corresponding semantic regions while transferring the vivid textures of these regions. Qualitative and quantitative results prove that SCSA enables attention-based arbitrary style transfer methods to produce high-quality semantic stylized images.||
|**2025-03-06**|[DTU-Net: A Multi-Scale Dilated Transformer Network for Nonlinear Hyperspectral Unmixing](http://arxiv.org/abs/2503.03465)|null|Transformers have shown significant success in hyperspectral unmixing (HU). However, challenges remain. While multi-scale and long-range spatial correlations are essential in unmixing tasks, current Transformer-based unmixing networks, built on Vision Transformer (ViT) or Swin-Transformer, struggle to capture them effectively. Additionally, current Transformer-based unmixing networks rely on the linear mixing model, which lacks the flexibility to accommodate scenarios where nonlinear effects are significant. To address these limitations, we propose a multi-scale Dilated Transformer-based unmixing network for nonlinear HU (DTU-Net). The encoder employs two branches. The first one performs multi-scale spatial feature extraction using Multi-Scale Dilated Attention (MSDA) in the Dilated Transformer, which varies dilation rates across attention heads to capture long-range and multi-scale spatial correlations. The second one performs spectral feature extraction utilizing 3D-CNNs with channel attention. The outputs from both branches are then fused to integrate multi-scale spatial and spectral information, which is subsequently transformed to estimate the abundances. The decoder is designed to accommodate both linear and nonlinear mixing scenarios. Its interpretability is enhanced by explicitly modeling the relationships between endmembers, abundances, and nonlinear coefficients in accordance with the polynomial post-nonlinear mixing model (PPNMM). Experiments on synthetic and real datasets validate the effectiveness of the proposed DTU-Net compared to PPNMM-derived methods and several advanced unmixing networks.||
|**2025-03-05**|[ScaleFusionNet: Transformer-Guided Multi-Scale Feature Fusion for Skin Lesion Segmentation](http://arxiv.org/abs/2503.03327)|**[link](https://github.com/sqbqamar/scalefusionnet)**|Melanoma is a malignant tumor originating from skin cell lesions. Accurate and efficient segmentation of skin lesions is essential for quantitative medical analysis but remains challenging. To address this, we propose ScaleFusionNet, a segmentation model that integrates Cross-Attention Transformer Module (CATM) and AdaptiveFusionBlock to enhance feature extraction and fusion. The model employs a hybrid architecture encoder that effectively captures both local and global features. We introduce CATM, which utilizes Swin Transformer Blocks and Cross Attention Fusion (CAF) to adaptively refine encoder-decoder feature fusion, reducing semantic gaps and improving segmentation accuracy. Additionally, the AdaptiveFusionBlock is improved by integrating adaptive multi-scale fusion, where Swin Transformer-based attention complements deformable convolution-based multi-scale feature extraction. This enhancement refines lesion boundaries and preserves fine-grained details. ScaleFusionNet achieves Dice scores of 92.94% and 91.65% on ISIC-2016 and ISIC-2018 datasets, respectively, demonstrating its effectiveness in skin lesion analysis. Our code implementation is publicly available at GitHub.||
|**2025-03-05**|[See What You Are Told: Visual Attention Sink in Large Multimodal Models](http://arxiv.org/abs/2503.03321)|null|Large multimodal models (LMMs) "see" images by leveraging the attention mechanism between text and visual tokens in the transformer decoder. Ideally, these models should focus on key visual information relevant to the text token. However, recent findings indicate that LMMs have an extraordinary tendency to consistently allocate high attention weights to specific visual tokens, even when these tokens are irrelevant to the corresponding text. In this study, we investigate the property behind the appearance of these irrelevant visual tokens and examine their characteristics. Our findings show that this behavior arises due to the massive activation of certain hidden state dimensions, which resembles the attention sink found in language models. Hence, we refer to this phenomenon as the visual attention sink. In particular, our analysis reveals that removing the irrelevant visual sink tokens does not impact model performance, despite receiving high attention weights. Consequently, we recycle the attention to these tokens as surplus resources, redistributing the attention budget to enhance focus on the image. To achieve this, we introduce Visual Attention Redistribution (VAR), a method that redistributes attention in image-centric heads, which we identify as innately focusing on visual information. VAR can be seamlessly applied across different LMMs to improve performance on a wide range of tasks, including general vision-language tasks, visual hallucination tasks, and vision-centric tasks, all without the need for additional training, models, or inference steps. Experimental results demonstrate that VAR enables LMMs to process visual information more effectively by adjusting their internal attention mechanisms, offering a new direction to enhancing the multimodal capabilities of LMMs.||
|**2025-03-05**|[Conformal Transformations for Symmetric Power Transformers](http://arxiv.org/abs/2503.03269)|null|Transformers with linear attention offer significant computational advantages over softmax-based transformers but often suffer from degraded performance. The symmetric power (sympow) transformer, a particular type of linear transformer, addresses some of this performance gap by leveraging symmetric tensor embeddings, achieving comparable performance to softmax transformers. However, the finite capacity of the recurrent state in sympow transformers limits their ability to retain information, leading to performance degradation when scaling the training or evaluation context length. To address this issue, we propose the conformal-sympow transformer, which dynamically frees up capacity using data-dependent multiplicative gating and adaptively stores information using data-dependent rotary embeddings. Preliminary experiments on the LongCrawl64 dataset demonstrate that conformal-sympow overcomes the limitations of sympow transformers, achieving robust performance across scaled training and evaluation contexts.||
|**2025-03-04**|[Boltzmann Attention Sampling for Image Analysis with Small Objects](http://arxiv.org/abs/2503.02841)|null|Detecting and segmenting small objects, such as lung nodules and tumor lesions, remains a critical challenge in image analysis. These objects often occupy less than 0.1% of an image, making traditional transformer architectures inefficient and prone to performance degradation due to redundant attention computations on irrelevant regions. Existing sparse attention mechanisms rely on rigid hierarchical structures, which are poorly suited for detecting small, variable, and uncertain object locations. In this paper, we propose BoltzFormer, a novel transformer-based architecture designed to address these challenges through dynamic sparse attention. BoltzFormer identifies and focuses attention on relevant areas by modeling uncertainty using a Boltzmann distribution with an annealing schedule. Initially, a higher temperature allows broader area sampling in early layers, when object location uncertainty is greatest. As the temperature decreases in later layers, attention becomes more focused, enhancing efficiency and accuracy. BoltzFormer seamlessly integrates into existing transformer architectures via a modular Boltzmann attention sampling mechanism. Comprehensive evaluations on benchmark datasets demonstrate that BoltzFormer significantly improves segmentation performance for small objects while reducing attention computation by an order of magnitude compared to previous state-of-the-art methods.||
|**2025-03-04**|[TReND: Transformer derived features and Regularized NMF for neonatal functional network Delineation](http://arxiv.org/abs/2503.02685)|null|Precise parcellation of functional networks (FNs) of early developing human brain is the fundamental basis for identifying biomarker of developmental disorders and understanding functional development. Resting-state fMRI (rs-fMRI) enables in vivo exploration of functional changes, but adult FN parcellations cannot be directly applied to the neonates due to incomplete network maturation. No standardized neonatal functional atlas is currently available. To solve this fundamental issue, we propose TReND, a novel and fully automated self-supervised transformer-autoencoder framework that integrates regularized nonnegative matrix factorization (RNMF) to unveil the FNs in neonates. TReND effectively disentangles spatiotemporal features in voxel-wise rs-fMRI data. The framework integrates confidence-adaptive masks into transformer self-attention layers to mitigate noise influence. A self supervised decoder acts as a regulator to refine the encoder's latent embeddings, which serve as reliable temporal features. For spatial coherence, we incorporate brain surface-based geodesic distances as spatial encodings along with functional connectivity from temporal features. The TReND clustering approach processes these features under sparsity and smoothness constraints, producing robust and biologically plausible parcellations. We extensively validated our TReND framework on three different rs-fMRI datasets: simulated, dHCP and HCP-YA against comparable traditional feature extraction and clustering techniques. Our results demonstrated the superiority of the TReND framework in the delineation of neonate FNs with significantly better spatial contiguity and functional homogeneity. Collectively, we established TReND, a novel and robust framework, for neonatal FN delineation. TReND-derived neonatal FNs could serve as a neonatal functional atlas for perinatal populations in health and disease.||
|**2025-03-04**|[Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer](http://arxiv.org/abs/2503.02495)|**[link](https://github.com/yujiaoyang-work/uoe)**|Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at https://github.com/YujiaoYang-work/UoE.||
|**2025-03-04**|[Exploring Token-Level Augmentation in Vision Transformer for Semi-Supervised Semantic Segmentation](http://arxiv.org/abs/2503.02459)|null|近年来，半监督语义分割取得了显著进展。然而，现有算法基于卷积神经网络，直接应用于视觉Transformer存在一定的局限性，原因在于两者概念上的差异。为此，我们提出了TokenMix，一种专门为视觉Transformer设计的半监督语义分割数据增强技术。TokenMix通过在token级别混合图像，与全局注意力机制很好地结合，增强了图像块之间上下文信息的学习能力。我们进一步结合了图像增强和特征增强来提升增强的多样性。此外，为了增强一致性正则化，我们提出了一个双分支框架，其中每个分支都对输入图像应用图像增强和特征增强。我们在多个基准数据集（包括Pascal VOC 2012、Cityscapes和COCO）上进行了大量实验。结果表明，所提出的方法优于现有最先进的算法，并显著提高了准确性，尤其是在精细标注数据有限的情况下。||
|**2025-03-04**|[BHViT: Binarized Hybrid Vision Transformer](http://arxiv.org/abs/2503.02394)|**[link](https://github.com/IMRL/BHViT)**|Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN), offering a potential solution to the deployment challenges faced by Vision Transformers (ViTs) on edge devices. However, due to the structural differences between CNN and Transformer architectures, simply applying binary CNN strategies to the ViT models will lead to a significant performance drop. To tackle this challenge, we propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations. Initially, BHViT utilizes the local information interaction and hierarchical feature aggregation technique from coarse to fine levels to address redundant computations stemming from excessive tokens. Then, a novel module based on shift operations is proposed to enhance the performance of the binary Multilayer Perceptron (MLP) module without significantly increasing computational overhead. In addition, an innovative attention matrix binarization method based on quantization decomposition is proposed to evaluate the token's importance in the binarized attention matrix. Finally, we propose a regularization loss to address the inadequate optimization caused by the incompatibility between the weight oscillation in the binary layers and the Adam Optimizer. Extensive experimental results demonstrate that our proposed algorithm achieves SOTA performance among binary ViT methods.||
|**2025-03-04**|[BdSLW401: Transformer-Based Word-Level Bangla Sign Language Recognition Using Relative Quantization Encoding (RQE)](http://arxiv.org/abs/2503.02360)|null|Sign language recognition (SLR) for low-resource languages like Bangla suffers from signer variability, viewpoint variations, and limited annotated datasets. In this paper, we present BdSLW401, a large-scale, multi-view, word-level Bangla Sign Language (BdSL) dataset with 401 signs and 102,176 video samples from 18 signers in front and lateral views. To improve transformer-based SLR, we introduce Relative Quantization Encoding (RQE), a structured embedding approach anchoring landmarks to physiological reference points and quantize motion trajectories. RQE improves attention allocation by decreasing spatial variability, resulting in 44.3% WER reduction in WLASL100, 21.0% in SignBD-200, and significant gains in BdSLW60 and SignBD-90. However, fixed quantization becomes insufficient on large-scale datasets (e.g., WLASL2000), indicating the need for adaptive encoding strategies. Further, RQE-SF, an extended variant that stabilizes shoulder landmarks, achieves improvements in pose consistency at the cost of small trade-offs in lateral view recognition. The attention graphs prove that RQE improves model interpretability by focusing on the major articulatory features (fingers, wrists) and the more distinctive frames instead of global pose changes. Introducing BdSLW401 and demonstrating the effectiveness of RQE-enhanced structured embeddings, this work advances transformer-based SLR for low-resource languages and sets a benchmark for future research in this area.||
|**2025-03-03**|[Forgetting Transformer: Softmax Attention with a Forget Gate](http://arxiv.org/abs/2503.02130)|**[link](https://github.com/zhixuan-lin/forgetting-transformer)**|现代循环序列模型的一个重要组成部分是遗忘门。虽然Transformer没有显式的循环形式，但我们展示了可以通过以数据相关的方式降低未归一化注意力分数的权重，将遗忘门自然地融入Transformer中。我们将这种注意力机制命名为遗忘注意力，并将由此产生的模型命名为遗忘Transformer（FoX）。我们证明，FoX在长上下文语言建模、长度外推和短上下文下游任务上的表现优于Transformer，而在长上下文下游任务上的表现与Transformer相当。此外，它与FlashAttention算法兼容，并且不需要任何位置嵌入。多项分析，包括大海捞针测试，表明FoX还保留了Transformer相较于循环序列模型（如Mamba-2、HGRN2和DeltaNet）的优越长上下文能力。我们还引入了一个“Pro”块设计，它包含了一些循环序列模型中常见的架构组件，并发现它显著提高了FoX和Transformer的性能。我们的代码可在https://github.com/zhixuan-lin/forgetting-transformer获取。||
|**2025-03-03**|[Attention Condensation via Sparsity Induced Regularized Training](http://arxiv.org/abs/2503.01564)|null|随着上下文窗口的扩展，自注意力机制逐渐占据了Transformer模型推理时间的主导地位。因此，在尽量减少性能下降的同时加速注意力计算对于高效部署大型语言模型（LLM）至关重要。在本研究中，我们扩展了LLM中注意力稀疏性的理论框架。我们设计了一个定制的损失函数，通过限制注意力矩阵中顶部元素的数量来增强稀疏性。我们使用GPT-2进行了一组初步评估，以展示我们稀疏化方法的有效性。使用所提出的损失函数训练的模型的注意力矩阵既稀疏又能有效地捕捉相关的输入依赖关系。我们目前正在继续开展工作，以证明我们的方法在更大模型和不同架构上的价值。||
|**2025-03-04**|[Hierarchical Causal Transformer with Heterogeneous Information for Expandable Sequential Recommendation](http://arxiv.org/abs/2503.01469)|null|Sequential recommendation systems leveraging transformer architectures have demonstrated exceptional capabilities in capturing user behavior patterns. At the core of these systems lies the critical challenge of constructing effective item representations. Traditional approaches employ feature fusion through simple concatenation or basic neural architectures to create uniform representation sequences. However, these conventional methods fail to address the intrinsic diversity of item attributes, thereby constraining the transformer's capacity to discern fine-grained patterns and hindering model extensibility. Although recent research has begun incorporating user-related heterogeneous features into item sequences, the equally crucial item-side heterogeneous feature continue to be neglected. To bridge this methodological gap, we present HeterRec - an innovative framework featuring two novel components: the Heterogeneous Token Flattening Layer (HTFL) and Hierarchical Causal Transformer (HCT). HTFL pioneers a sophisticated tokenization mechanism that decomposes items into multi-dimensional token sets and structures them into heterogeneous sequences, enabling scalable performance enhancement through model expansion. The HCT architecture further enhances pattern discovery through token-level and item-level attention mechanisms. furthermore, we develop a Listwise Multi-step Prediction (LMP) objective function to optimize learning process. Rigorous validation, including real-world industrial platforms, confirms HeterRec's state-of-the-art performance in both effective and efficiency.||
|**2025-03-03**|[Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning](http://arxiv.org/abs/2503.01329)|null|Recent advancements in large language models (LLMs) based on transformer architectures have sparked significant interest in understanding their inner workings. In this paper, we introduce a novel approach to modeling transformer architectures using highly flexible non-autonomous neural ordinary differential equations (ODEs). Our proposed model parameterizes all weights of attention and feed-forward blocks through neural networks, expressing these weights as functions of a continuous layer index. Through spectral analysis of the model's dynamics, we uncover an increase in eigenvalue magnitude that challenges the weight-sharing assumption prevalent in existing theoretical studies. We also leverage the Lyapunov exponent to examine token-level sensitivity, enhancing model interpretability. Our neural ODE transformer demonstrates performance comparable to or better than vanilla transformers across various configurations and datasets, while offering flexible fine-tuning capabilities that can adapt to different architectural constraints.||
|**2025-02-28**|[Training-free and Adaptive Sparse Attention for Efficient Long Video Generation](http://arxiv.org/abs/2502.21079)|null|扩散Transformer（DiT）生成高保真长视频通常会受到严重延迟的阻碍，这主要是由于注意力机制的计算需求。例如，使用HunyuanVideo生成一个8秒的720p视频（11万个token）大约需要600 PFLOPs，其中约500 PFLOPs被注意力计算消耗。为了解决这个问题，我们提出了AdaSpa，这是第一个动态模式和在线精确搜索的稀疏注意力方法。首先，为了实现动态模式，我们引入了一种块状模式来有效地捕捉DiT中固有的层次稀疏性。这是基于我们的观察，即DiT的稀疏特性在不同模态之间和内部呈现出层次化和块状结构。这种块状方法显著降低了注意力计算的复杂性，同时保持了生成视频的高保真度。其次，为了实现在线精确搜索，我们提出了融合LSE缓存搜索和Head自适应层次块稀疏注意力的方法。该方法的动机是我们的发现：DiT的稀疏模式和LSE会随着输入、层和头的变化而变化，但在去噪步骤中保持不变。通过利用这种跨去噪步骤的不变性，它适应了DiT的动态特性，并允许以最小的开销实时精确地识别稀疏索引。AdaSpa被实现为一种自适应的、即插即用的解决方案，可以与现有的DiT无缝集成，既不需要额外的微调，也不需要依赖于数据集的分析。大量实验验证了AdaSpa在各种模型中都能显著提高速度，同时保持视频质量，使其成为一种高效视频生成的鲁棒且可扩展的方法。||
|**2025-02-28**|[Efficient Transformer-based Decoder for Varshamov-Tenengolts Codes](http://arxiv.org/abs/2502.21060)|null|近年来，DNA数据存储技术的兴起使得插入、删除和替换（IDS）错误的纠正挑战备受关注。在各种用于IDS纠正的编码方法中，主要设计用于单错误纠正的Varshamov-Tenengolts (VT) 码已成为核心研究焦点。虽然现有的解码方法在纠正单个错误方面实现了高精度，但它们通常无法纠正多个IDS错误。在这项工作中，我们观察到VT码通过引入基于Transformer的VT解码器（TVTD）以及基于符号和统计的码字嵌入，保留了处理多个错误的能力。实验结果表明，所提出的TVTD实现了单个错误的完美纠正。此外，在解码各种码字长度的多个错误时，与现有的硬判决和软入软出算法相比，比特错误率和帧错误率都得到了显著改善。此外，通过模型架构优化，与其他软解码器相比，该方法将时间消耗降低了一个数量级。||
|**2025-02-28**|[MagNet: Multi-Level Attention Graph Network for Predicting High-Resolution Spatial Transcriptomics](http://arxiv.org/abs/2502.21011)|null|The rapid development of spatial transcriptomics (ST) offers new opportunities to explore the gene expression patterns within the spatial microenvironment. Current research integrates pathological images to infer gene expression, addressing the high costs and time-consuming processes to generate spatial transcriptomics data. However, as spatial transcriptomics resolution continues to improve, existing methods remain primarily focused on gene expression prediction at low-resolution spot levels. These methods face significant challenges, especially the information bottleneck, when they are applied to high-resolution HD data. To bridge this gap, this paper introduces MagNet, a multi-level attention graph network designed for accurate prediction of high-resolution HD data. MagNet employs cross-attention layers to integrate features from multi-resolution image patches hierarchically and utilizes a GAT-Transformer module to aggregate neighborhood information. By integrating multilevel features, MagNet overcomes the limitations posed by low-resolution inputs in predicting high-resolution gene expression. We systematically evaluated MagNet and existing ST prediction models on both a private spatial transcriptomics dataset and a public dataset at three different resolution levels. The results demonstrate that MagNet achieves state-of-the-art performance at both spot level and high-resolution bin levels, providing a novel methodology and benchmark for future research and applications in high-resolution HD-level spatial transcriptomics. Code is available at https://github.com/Junchao-Zhu/MagNet.||
|**2025-02-28**|[Visual Attention Exploration in Vision-Based Mamba Models](http://arxiv.org/abs/2502.20764)|null|State space models (SSMs) have emerged as an efficient alternative to transformer-based models, offering linear complexity that scales better than transformers. One of the latest advances in SSMs, Mamba, introduces a selective scan mechanism that assigns trainable weights to input tokens, effectively mimicking the attention mechanism. Mamba has also been successfully extended to the vision domain by decomposing 2D images into smaller patches and arranging them as 1D sequences. However, it remains unclear how these patches interact with (or attend to) each other in relation to their original 2D spatial location. Additionally, the order used to arrange the patches into a sequence also significantly impacts their attention distribution. To better understand the attention between patches and explore the attention patterns, we introduce a visual analytics tool specifically designed for vision-based Mamba models. This tool enables a deeper understanding of how attention is distributed across patches in different Mamba blocks and how it evolves throughout a Mamba model. Using the tool, we also investigate the impact of different patch-ordering strategies on the learned attention, offering further insights into the model's behavior.||
|**2025-02-28**|[Variational Transformer Ansatz for the Density Operator of Steady States in Dissipative Quantum Many-Body Systems](http://arxiv.org/abs/2502.20723)|null|The transformer architecture, known for capturing long-range dependencies and intricate patterns, has extended beyond natural language processing. Recently, it has attracted significant attention in quantum information and condensed matter physics. In this work, we propose the transformer density operator ansatz for determining the steady states of dissipative quantum many-body systems. By vectorizing the density operator as a many-body state in a doubled Hilbert space, the transformer encodes the amplitude and phase of the state's coefficients, with its parameters serving as variational variables. Our design preserves translation invariance while leveraging attention mechanisms to capture diverse long-range correlations. We demonstrate the effectiveness of our approach by numerically calculating the steady states of dissipative Ising and Heisenberg spin chain models, showing that our method achieves excellent accuracy in predicting steady states.||
|**2025-02-28**|[Disentangling Feature Structure: A Mathematically Provable Two-Stage Training Dynamics in Transformers](http://arxiv.org/abs/2502.20681)|null|Transformers may exhibit two-stage training dynamics during the real-world training process. For instance, when training GPT-2 on the Counterfact dataset, the answers progress from syntactically incorrect to syntactically correct to semantically correct. However, existing theoretical analyses hardly account for this two-stage phenomenon. In this paper, we theoretically demonstrate how such two-stage training dynamics occur in transformers. Specifically, we analyze the dynamics of transformers using feature learning techniques under in-context learning regimes, based on a disentangled two-type feature structure. Such disentanglement of feature structure is general in practice, e.g., natural languages contain syntax and semantics, and proteins contain primary and secondary structures. To our best known, this is the first rigorous result regarding a two-stage optimization process in transformers. Additionally, a corollary indicates that such a two-stage process is closely related to the spectral properties of the attention weights, which accords well with empirical findings.||
|**2025-02-27**|[An Integrated Deep Learning Framework Leveraging NASNet and Vision Transformer with MixProcessing for Accurate and Precise Diagnosis of Lung Diseases](http://arxiv.org/abs/2502.20570)|null|The lungs are the essential organs of respiration, and this system is significant in the carbon dioxide and exchange between oxygen that occurs in human life. However, several lung diseases, which include pneumonia, tuberculosis, COVID-19, and lung cancer, are serious healthiness challenges and demand early and precise diagnostics. The methodological study has proposed a new deep learning framework called NASNet-ViT, which effectively incorporates the convolution capability of NASNet with the global attention mechanism capability of Vision Transformer ViT. The proposed model will classify the lung conditions into five classes: Lung cancer, COVID-19, pneumonia, TB, and normal. A sophisticated multi-faceted preprocessing strategy called MixProcessing has been used to improve diagnostic accuracy. This preprocessing combines wavelet transform, adaptive histogram equalization, and morphological filtering techniques. The NASNet-ViT model performs at state of the art, achieving an accuracy of 98.9%, sensitivity of 0.99, an F1-score of 0.989, and specificity of 0.987, outperforming other state of the art architectures such as MixNet-LD, D-ResNet, MobileNet, and ResNet50. The model's efficiency is further emphasized by its compact size, 25.6 MB, and a low computational time of 12.4 seconds, hence suitable for real-time, clinically constrained environments. These results reflect the high-quality capability of NASNet-ViT in extracting meaningful features and recognizing various types of lung diseases with very high accuracy. This work contributes to medical image analysis by providing a robust and scalable solution for diagnostics in lung diseases.||
|**2025-02-27**|[Revisiting Kernel Attention with Correlated Gaussian Process Representation](http://arxiv.org/abs/2502.20525)|null|Transformers have increasingly become the de facto method to model sequential data with state-of-the-art performance. Due to its widespread use, being able to estimate and calibrate its modeling uncertainty is important to understand and design robust transformer models. To achieve this, previous works have used Gaussian processes (GPs) to perform uncertainty calibration for the attention units of transformers and attained notable successes. However, such approaches have to confine the transformers to the space of symmetric attention to ensure the necessary symmetric requirement of their GP's kernel specification, which reduces the representation capacity of the model. To mitigate this restriction, we propose the Correlated Gaussian Process Transformer (CGPT), a new class of transformers whose self-attention units are modeled as cross-covariance between two correlated GPs (CGPs). This allows asymmetries in attention and can enhance the representation capacity of GP-based transformers. We also derive a sparse approximation for CGP to make it scale better. Our empirical studies show that both CGP-based and sparse CGP-based transformers achieve better performance than state-of-the-art GP-based transformers on a variety of benchmark tasks. The code for our experiments is available at https://github.com/MinhLong210/CGP-Transformers.||
|**2025-02-27**|[Space Rotation with Basis Transformation for Training-free Test-Time Adaptation](http://arxiv.org/abs/2502.19946)|null|With the development of visual-language models (VLM) in downstream task applications, test-time adaptation methods based on VLM have attracted increasing attention for their ability to address changes distribution in test-time. Although prior approaches have achieved some progress, they typically either demand substantial computational resources or are constrained by the limitations of the original feature space, rendering them less effective for test-time adaptation tasks. To address these challenges, we propose a training-free feature space rotation with basis transformation for test-time adaptation. By leveraging the inherent distinctions among classes, we reconstruct the original feature space and map it to a new representation, thereby enhancing the clarity of class differences and providing more effective guidance for the model during testing. Additionally, to better capture relevant information from various classes, we maintain a dynamic queue to store representative samples. Experimental results across multiple benchmarks demonstrate that our method outperforms state-of-the-art techniques in terms of both performance and efficiency.||
|**2025-02-27**|[CirT: Global Subseasonal-to-Seasonal Forecasting with Geometry-inspired Transformer](http://arxiv.org/abs/2502.19750)|**[link](https://github.com/compasszzn/CirT)**|Accurate Subseasonal-to-Seasonal (S2S) climate forecasting is pivotal for decision-making including agriculture planning and disaster preparedness but is known to be challenging due to its chaotic nature. Although recent data-driven models have shown promising results, their performance is limited by inadequate consideration of geometric inductive biases. Usually, they treat the spherical weather data as planar images, resulting in an inaccurate representation of locations and spatial relations. In this work, we propose the geometric-inspired Circular Transformer (CirT) to model the cyclic characteristic of the graticule, consisting of two key designs: (1) Decomposing the weather data by latitude into circular patches that serve as input tokens to the Transformer; (2) Leveraging Fourier transform in self-attention to capture the global information and model the spatial periodicity. Extensive experiments on the Earth Reanalysis 5 (ERA5) reanalysis dataset demonstrate our model yields a significant improvement over the advanced data-driven models, including PanguWeather and GraphCast, as well as skillful ECMWF systems. Additionally, we empirically show the effectiveness of our model designs and high-quality prediction over spatial and temporal dimensions.||
|**2025-02-26**|[Introduction to Sequence Modeling with Transformers](http://arxiv.org/abs/2502.19597)|**[link](https://github.com/kamarain/transformer_intro)**|Understanding the transformer architecture and its workings is essential for machine learning (ML) engineers. However, truly understanding the transformer architecture can be demanding, even if you have a solid background in machine learning or deep learning. The main working horse is attention, which yields to the transformer encoder-decoder structure. However, putting attention aside leaves several programming components that are easy to implement but whose role for the whole is unclear. These components are 'tokenization', 'embedding' ('un-embedding'), 'masking', 'positional encoding', and 'padding'. The focus of this work is on understanding them. To keep things simple, the understanding is built incrementally by adding components one by one, and after each step investigating what is doable and what is undoable with the current model. Simple sequences of zeros (0) and ones (1) are used to study the workings of each step.||
|**2025-02-26**|[Integrating Biological and Machine Intelligence: Attention Mechanisms in Brain-Computer Interfaces](http://arxiv.org/abs/2502.19281)|null|With the rapid advancement of deep learning, attention mechanisms have become indispensable in electroencephalography (EEG) signal analysis, significantly enhancing Brain-Computer Interface (BCI) applications. This paper presents a comprehensive review of traditional and Transformer-based attention mechanisms, their embedding strategies, and their applications in EEG-based BCI, with a particular emphasis on multimodal data fusion. By capturing EEG variations across time, frequency, and spatial channels, attention mechanisms improve feature extraction, representation learning, and model robustness. These methods can be broadly categorized into traditional attention mechanisms, which typically integrate with convolutional and recurrent networks, and Transformer-based multi-head self-attention, which excels in capturing long-range dependencies. Beyond single-modality analysis, attention mechanisms also enhance multimodal EEG applications, facilitating effective fusion between EEG and other physiological or sensory data. Finally, we discuss existing challenges and emerging trends in attention-based EEG modeling, highlighting future directions for advancing BCI technology. This review aims to provide valuable insights for researchers seeking to leverage attention mechanisms for improved EEG interpretation and application.||
|**2025-02-27**|[ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding](http://arxiv.org/abs/2502.19247)|null|Embodied intelligence requires agents to interact with 3D environments in real time based on language instructions. A foundational task in this domain is ego-centric 3D visual grounding. However, the point clouds rendered from RGB-D images retain a large amount of redundant background data and inherent noise, both of which can interfere with the manifold structure of the target regions. Existing point cloud enhancement methods often require a tedious process to improve the manifold, which is not suitable for real-time tasks. We propose Proxy Transformation suitable for multimodal task to efficiently improve the point cloud manifold. Our method first leverages Deformable Point Clustering to identify the point cloud sub-manifolds in target regions. Then, we propose a Proxy Attention module that utilizes multimodal proxies to guide point cloud transformation. Built upon Proxy Attention, we design a submanifold transformation generation module where textual information globally guides translation vectors for different submanifolds, optimizing relative spatial relationships of target regions. Simultaneously, image information guides linear transformations within each submanifold, refining the local point cloud manifold of target regions. Extensive experiments demonstrate that Proxy Transformation significantly outperforms all existing methods, achieving an impressive improvement of 7.49% on easy targets and 4.60% on hard targets, while reducing the computational overhead of attention blocks by 40.6%. These results establish a new SOTA in ego-centric 3D visual grounding, showcasing the effectiveness and robustness of our approach.||
|**2025-02-26**|[A Hybrid Transformer Architecture with a Quantized Self-Attention Mechanism Applied to Molecular Generation](http://arxiv.org/abs/2502.19214)|**[link](https://github.com/anthonysmaldone/quantum-transformer)**|The success of the self-attention mechanism in classical machine learning models has inspired the development of quantum analogs aimed at reducing computational overhead. Self-attention integrates learnable query and key matrices to calculate attention scores between all pairs of tokens in a sequence. These scores are then multiplied by a learnable value matrix to obtain the output self-attention matrix, enabling the model to effectively capture long-range dependencies within the input sequence. Here, we propose a hybrid quantum-classical self-attention mechanism as part of a transformer decoder, the architecture underlying large language models (LLMs). To demonstrate its utility in chemistry, we train this model on the QM9 dataset for conditional generation, using SMILES strings as input, each labeled with a set of physicochemical properties that serve as conditions during inference. Our theoretical analysis shows that the time complexity of the query-key dot product is reduced from $\mathcal{O}(n^2 d)$ in a classical model to $\mathcal{O}(n^2\log d)$ in our quantum model, where $n$ and $d$ represent the sequence length and embedding dimension, respectively. We perform simulations using NVIDIA's CUDA-Q platform, which is designed for efficient GPU scalability. This work provides a promising avenue for quantum-enhanced natural language processing (NLP).||
|**2025-02-26**|[A HEART for the environment: Transformer-Based Spatiotemporal Modeling for Air Quality Prediction](http://arxiv.org/abs/2502.19042)|null|Accurate and reliable air pollution forecasting is crucial for effective environmental management and policy-making. llull-environment is a sophisticated and scalable forecasting system for air pollution, inspired by previous models currently operational in Madrid and Valladolid (Spain). It contains (among other key components) an encoder-decoder convolutional neural network to forecast mean pollution levels for four key pollutants (NO $_2$, O$_3$, PM$_{10}$, PM$_{2.5}$ ) using historical data, external forecasts, and other contextual features. This paper investigates the augmentation of this neural network with an attention mechanism to improve predictive accuracy. The proposed attention mechanism pre-processes tensors containing the input features before passing them to the existing mean forecasting model. The resulting model is a combination of several architectures and ideas and can be described as a "Hybrid Enhanced Autoregressive Transformer", or HEART. The effectiveness of the approach is evaluated by comparing the mean square error (MSE) across different attention layouts against the system without such a mechanism. We observe a significant reduction in MSE of up to 22%, with an average of 7.5% across tested cities and pollutants. The performance of a given attention mechanism turns out to depend on the pollutant, highlighting the differences in their creation and dissipation processes. Our findings are not restricted to optimizing air quality prediction models, but are applicable generally to (fixed length) time series forecasting.||
|**2025-02-26**|[The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training](http://arxiv.org/abs/2502.19002)|null|Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and interactions among these blocks is important. In this paper, we uncover a clear Sharpness Disparity across these blocks, which emerges early in training and intriguingly persists throughout the training process. Motivated by this finding, we propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block's sharpness, accelerating large language model (LLM) pre-training. By integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. We demonstrate this acceleration across GPT-2 and LLaMA, with model sizes ranging from 0.12B to 1.1B and datasets of OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory saving. These results underscore the potential of exploiting the sharpness disparity to improve LLM training.||
|**2025-02-26**|[Sliding Window Attention Training for Efficient Large Language Models](http://arxiv.org/abs/2502.18845)|null|Recent advances in transformer-based Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their quadratic computational complexity concerning sequence length remains a significant bottleneck for processing long documents. As a result, many efforts like sparse attention and state space models have been proposed to improve the efficiency of LLMs over long sequences. Though effective, these approaches compromise the performance or introduce structural complexity. This calls for a simple yet efficient model that preserves the fundamental Transformer architecture. To this end, we introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Then, we replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks. Code is available at https://anonymous.4open.science/r/SWAT-attention.||
|**2025-02-26**|[The FFT Strikes Back: An Efficient Alternative to Self-Attention](http://arxiv.org/abs/2502.18394)|**[link](https://github.com/jacobfa/fft)**|Conventional self-attention mechanisms incur quadratic complexity, limiting their scalability on long sequences. We introduce FFTNet, an adaptive spectral filtering framework that leverages the Fast Fourier Transform (FFT) to achieve global token mixing in $\mathcal{O}(n\log n)$ time. By transforming inputs into the frequency domain, FFTNet exploits the orthogonality and energy preservation guaranteed by Parseval's theorem to capture long-range dependencies efficiently. A learnable spectral filter and modReLU activation dynamically emphasize salient frequency components, providing a rigorous and adaptive alternative to traditional self-attention. Experiments on the Long Range Arena and ImageNet benchmarks validate our theoretical insights and demonstrate superior performance over fixed Fourier and standard attention models.||
|**2025-02-25**|[ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation](http://arxiv.org/abs/2502.18364)|null|Multi-layer image generation is a fundamental task that enables users to isolate, select, and edit specific image layers, thereby revolutionizing interactions with generative models. In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. Inspired by Schema theory suggests that knowledge is organized in frameworks (schemas) that enable people to interpret and learn from new information by linking it to prior knowledge.}, this anonymous region layout allows the generative model to autonomously determine which set of visual tokens should align with which text tokens, which is in contrast to the previously dominant semantic layout for the image generation task. In addition, the layer-wise region crop mechanism, which only selects the visual tokens belonging to each anonymous region, significantly reduces attention computation costs and enables the efficient generation of images with numerous distinct layers (e.g., 50+). When compared to the full attention approach, our method is over 12 times faster and exhibits fewer layer conflicts. Furthermore, we propose a high-quality multi-layer transparent image autoencoder that supports the direct encoding and decoding of the transparency of variable multi-layer images in a joint manner. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.||
|**2025-02-25**|[FwNet-ECA: Facilitating Window Attention with Global Receptive Fields through Fourier Filtering Operations](http://arxiv.org/abs/2502.18094)|**[link](https://github.com/qingxiaoli/fwnet-eca)**|Windowed attention mechanisms were introduced to mitigate the issue of excessive computation inherent in global attention mechanisms. However, In this paper, we present FwNet-ECA, a novel method that utilizes Fourier transforms paired with learnable weight matrices to enhance the spectral features of images. This strategy facilitates inter-window connectivity, thereby maximizing the receptive field. Additionally, we incorporate the Efficient Channel Attention (ECA) module to improve communication between different channels. Instead of relying on physically shifted windows, our approach leverages frequency domain enhancement to implicitly bridge information across spatial regions. We validate our model on the iCartoonFace dataset and conduct downstream tasks on ImageNet, demonstrating that our model achieves lower parameter counts and computational overheads compared to shifted window approaches, while maintaining competitive accuracy. This work offers a more efficient and effective alternative for leveraging attention mechanisms in visual processing tasks, alleviating the challenges associated with windowed attention models. Code is available at https://github.com/qingxiaoli/FwNet-ECA.||
|**2025-02-25**|[Automatic Vehicle Detection using DETR: A Transformer-Based Approach for Navigating Treacherous Roads](http://arxiv.org/abs/2502.17843)|null|Automatic Vehicle Detection (AVD) in diverse driving environments presents unique challenges due to varying lighting conditions, road types, and vehicle types. Traditional methods, such as YOLO and Faster R-CNN, often struggle to cope with these complexities. As computer vision evolves, combining Convolutional Neural Networks (CNNs) with Transformer-based approaches offers promising opportunities for improving detection accuracy and efficiency. This study is the first to experiment with Detection Transformer (DETR) for automatic vehicle detection in complex and varied settings. We employ a Collaborative Hybrid Assignments Training scheme, Co-DETR, to enhance feature learning and attention mechanisms in DETR. By leveraging versatile label assignment strategies and introducing multiple parallel auxiliary heads, we provide more effective supervision during training and extract positive coordinates to boost training efficiency. Through extensive experiments on DETR variants and YOLO models, conducted using the BadODD dataset, we demonstrate the advantages of our approach. Our method achieves superior results, and improved accuracy in diverse conditions, making it practical for real-world deployment. This work significantly advances autonomous navigation technology and opens new research avenues in object detection for autonomous vehicles. By integrating the strengths of CNNs and Transformers, we highlight the potential of DETR for robust and efficient vehicle detection in challenging driving environments.||
|**2025-02-24**|[CalibRefine: Deep Learning-Based Online Automatic Targetless LiDAR-Camera Calibration with Iterative and Attention-Driven Post-Refinement](http://arxiv.org/abs/2502.17648)|**[link](https://github.com/radar-lab/Lidar)**|Accurate multi-sensor calibration is essential for deploying robust perception systems in applications such as autonomous driving, robotics, and intelligent transportation. Existing LiDAR-camera calibration methods often rely on manually placed targets, preliminary parameter estimates, or intensive data preprocessing, limiting their scalability and adaptability in real-world settings. In this work, we propose a fully automatic, targetless, and online calibration framework, CalibRefine, which directly processes raw LiDAR point clouds and camera images. Our approach is divided into four stages: (1) a Common Feature Discriminator that trains on automatically detected objects--using relative positions, appearance embeddings, and semantic classes--to generate reliable LiDAR-camera correspondences, (2) a coarse homography-based calibration, (3) an iterative refinement to incrementally improve alignment as additional data frames become available, and (4) an attention-based refinement that addresses non-planar distortions by leveraging a Vision Transformer and cross-attention mechanisms. Through extensive experiments on two urban traffic datasets, we show that CalibRefine delivers high-precision calibration results with minimal human involvement, outperforming state-of-the-art targetless methods and remaining competitive with, or surpassing, manually tuned baselines. Our findings highlight how robust object-level feature matching, together with iterative and self-supervised attention-based adjustments, enables consistent sensor fusion in complex, real-world conditions without requiring ground-truth calibration matrices or elaborate data preprocessing.||
|**2025-02-24**|[Neural Attention: A Novel Mechanism for Enhanced Expressive Power in Transformer Models](http://arxiv.org/abs/2502.17206)|null|Transformer模型通常使用点积计算注意力矩阵，这在捕捉嵌入向量之间的非线性关系时存在局限性。我们提出了神经注意力（Neural Attention），一种用前馈网络取代点积的技术，从而能够更具表现力地表示token之间的关系。这种方法只修改了注意力矩阵的计算方式，同时保留了矩阵维度，使其易于适应现有的基于Transformer的架构。我们提供了详细的数学论证，解释了为什么神经注意力能够增加表示能力，并进行了受控实验来验证这一说法。在WikiText-103上进行的NLP实验表明，与点积注意力相比，神经注意力的困惑度降低了5%以上。类似地，在CIFAR-10和CIFAR-100上的实验表明，图像分类任务也获得了类似的改进。虽然神经注意力带来了更高的计算需求，但我们开发了相应的技术来缓解这些挑战，确保在不牺牲其增强表现力的前提下实现实用性。这项工作确立了神经注意力作为一种有效方法，可以增强Transformer模型在各种应用中的预测能力。||
|**2025-02-24**|[Disentangling Visual Transformers: Patch-level Interpretability for Image Classification](http://arxiv.org/abs/2502.17196)|null|Visual transformers have achieved remarkable performance in image classification tasks, but this performance gain has come at the cost of interpretability. One of the main obstacles to the interpretation of transformers is the self-attention mechanism, which mixes visual information across the whole image in a complex way. In this paper, we propose Hindered Transformer (HiT), a novel interpretable by design architecture inspired by visual transformers. Our proposed architecture rethinks the design of transformers to better disentangle patch influences at the classification stage. Ultimately, HiT can be interpreted as a linear combination of patch-level information. We show that the advantages of our approach in terms of explicability come with a reasonable trade-off in performance, making it an attractive alternative for applications where interpretability is paramount.||
|**2025-02-24**|[MaxGlaViT: A novel lightweight vision transformer-based approach for early diagnosis of glaucoma stages from fundus images](http://arxiv.org/abs/2502.17154)|**[link](https://github.com/ymyurdakul/MaxGlaViT)**|Glaucoma is a prevalent eye disease that progresses silently without symptoms. If not detected and treated early, it can cause permanent vision loss. Computer-assisted diagnosis systems play a crucial role in timely and efficient identification. This study introduces MaxGlaViT, a lightweight model based on the restructured Multi-Axis Vision Transformer (MaxViT) for early glaucoma detection. First, MaxViT was scaled to optimize block and channel numbers, resulting in a lighter architecture. Second, the stem was enhanced by adding attention mechanisms (CBAM, ECA, SE) after convolution layers to improve feature learning. Third, MBConv structures in MaxViT blocks were replaced by advanced DL blocks (ConvNeXt, ConvNeXtV2, InceptionNeXt). The model was evaluated using the HDV1 dataset, containing fundus images of different glaucoma stages. Additionally, 40 CNN and 40 ViT models were tested on HDV1 to validate MaxGlaViT's efficiency. Among CNN models, EfficientB6 achieved the highest accuracy (84.91%), while among ViT models, MaxViT-Tiny performed best (86.42%). The scaled MaxViT reached 87.93% accuracy. Adding ECA to the stem block increased accuracy to 89.01%. Replacing MBConv with ConvNeXtV2 further improved it to 89.87%. Finally, integrating ECA in the stem and ConvNeXtV2 in MaxViT blocks resulted in 92.03% accuracy. Testing 80 DL models for glaucoma stage classification, this study presents a comprehensive and comparative analysis. MaxGlaViT outperforms experimental and state-of-the-art models, achieving 92.03% accuracy, 92.33% precision, 92.03% recall, 92.13% f1-score, and 87.12% Cohen's kappa score.||
|**2025-02-24**|[Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems](http://arxiv.org/abs/2502.17019)|**[link](https://github.com/maxxxzdn/erwin)**|Large-scale physical systems defined on irregular grids pose significant scalability challenges for deep learning methods, especially in the presence of long-range interactions and multi-scale coupling. Traditional approaches that compute all pairwise interactions, such as attention, become computationally prohibitive as they scale quadratically with the number of nodes. We present Erwin, a hierarchical transformer inspired by methods from computational many-body physics, which combines the efficiency of tree-based algorithms with the expressivity of attention mechanisms. Erwin employs ball tree partitioning to organize computation, which enables linear-time attention by processing nodes in parallel within local neighborhoods of fixed size. Through progressive coarsening and refinement of the ball tree structure, complemented by a novel cross-ball interaction mechanism, it captures both fine-grained local details and global features. We demonstrate Erwin's effectiveness across multiple domains, including cosmology, molecular dynamics, and particle fluid dynamics, where it consistently outperforms baseline methods both in accuracy and computational efficiency.||
|**2025-02-24**|[Quantifying Logical Consistency in Transformers via Query-Key Alignment](http://arxiv.org/abs/2502.17017)|null|大型语言模型 (LLM) 在各种自然语言处理任务中展现出令人印象深刻的性能，但其执行多步逻辑推理的能力仍然是一个开放的挑战。虽然思维链提示通过使模型能够生成中间步骤来改进逻辑推理，但它缺乏评估这些逻辑转换一致性的机制。在本文中，我们提出了一种新颖的、轻量级的逻辑推理评估策略，该策略使用 Transformer 注意力头内部的查询键对齐。通过计算单个前向传递并从精心挑选的注意力头中提取“QK 分数”，我们的方法揭示了可靠地区分有效推理和无效推理的潜在表示，为传统的基于消融的技术提供了一种可扩展的替代方案。我们还在多个逻辑推理基准上提供了实证验证，证明了我们的评估方法对干扰因素的鲁棒性有所提高，并增加了推理深度。实验是在参数范围从 15 亿到 700 亿的不同模型集上进行的。||
|**2025-02-21**|[Estimating Vehicle Speed on Roadways Using RNNs and Transformers: A Video-based Approach](http://arxiv.org/abs/2502.15545)|null|This project explores the application of advanced machine learning models, specifically Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Transformers, to the task of vehicle speed estimation using video data. Traditional methods of speed estimation, such as radar and manual systems, are often constrained by high costs, limited coverage, and potential disruptions. In contrast, leveraging existing surveillance infrastructure and cutting-edge neural network architectures presents a non-intrusive, scalable solution. Our approach utilizes LSTM and GRU to effectively manage long-term dependencies within the temporal sequence of video frames, while Transformers are employed to harness their self-attention mechanisms, enabling the processing of entire sequences in parallel and focusing on the most informative segments of the data. This study demonstrates that both LSTM and GRU outperform basic Recurrent Neural Networks (RNNs) due to their advanced gating mechanisms. Furthermore, increasing the sequence length of input data consistently improves model accuracy, highlighting the importance of contextual information in dynamic environments. Transformers, in particular, show exceptional adaptability and robustness across varied sequence lengths and complexities, making them highly suitable for real-time applications in diverse traffic conditions. The findings suggest that integrating these sophisticated neural network models can significantly enhance the accuracy and reliability of automated speed detection systems, thus promising to revolutionize traffic management and road safety.||
|**2025-02-21**|[Enhancing Vehicle Make and Model Recognition with 3D Attention Modules](http://arxiv.org/abs/2502.15398)|null|Vehicle make and model recognition (VMMR) is a crucial component of the Intelligent Transport System, garnering significant attention in recent years. VMMR has been widely utilized for detecting suspicious vehicles, monitoring urban traffic, and autonomous driving systems. The complexity of VMMR arises from the subtle visual distinctions among vehicle models and the wide variety of classes produced by manufacturers. Convolutional Neural Networks (CNNs), a prominent type of deep learning model, have been extensively employed in various computer vision tasks, including VMMR, yielding remarkable results. As VMMR is a fine-grained classification problem, it primarily faces inter-class similarity and intra-class variation challenges. In this study, we implement an attention module to address these challenges and enhance the model's focus on critical areas containing distinguishing features. This module, which does not increase the parameters of the original model, generates three-dimensional (3-D) attention weights to refine the feature map. Our proposed model integrates the attention module into two different locations within the middle section of a convolutional model, where the feature maps from these sections offer sufficient information about the input frames without being overly detailed or overly coarse. The performance of our proposed model, along with state-of-the-art (SOTA) convolutional and transformer-based models, was evaluated using the Stanford Cars dataset. Our proposed model achieved the highest accuracy, 90.69\%, among the compared models.||
|**2025-02-21**|[AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms](http://arxiv.org/abs/2502.15349)|**[link](https://github.com/microsoft/attentionengine)**|Transformer 和大型语言模型 (LLM) 为机器学习带来了革命性的变化，其中注意力机制是其成功的核心。随着注意力机制变体的不断扩展，优化其性能的挑战也日益增加，尤其是在不同的硬件平台上。当前的优化策略通常较为局限，需要大量手动干预才能适应模型配置或硬件环境的变化。在本文中，我们介绍了 AttentionEngine，这是一个旨在简化跨异构硬件后端优化注意力机制的综合框架。通过将注意力计算分解为具有可定制组件的模块化操作，AttentionEngine 可以灵活地适应不同的算法需求。该框架还通过结合可编程模板和强大的跨平台调度策略来自动优化内核。实证结果表明，在现有方法无法达到的配置上，性能提升高达 10 倍。AttentionEngine 为开发和部署注意力机制提供了一个可扩展、高效的基础，只需极少的手动调整。我们的代码已开源，可在 https://github.com/microsoft/AttentionEngine 获取。||
|**2025-02-21**|[Lightweight yet Efficient: An External Attentive Graph Convolutional Network with Positional Prompts for Sequential Recommendation](http://arxiv.org/abs/2502.15331)|**[link](https://github.com/JinyuZ1996/EA-GPS)**|Graph-based Sequential Recommender systems (GSRs) have gained significant research attention due to their ability to simultaneously handle user-item interactions and sequential relationships between items. Current GSRs often utilize composite or in-depth structures for graph encoding (e.g., the Graph Transformer). Nevertheless, they have high computational complexity, hindering the deployment on resource-constrained edge devices. Moreover, the relative position encoding in Graph Transformer has difficulty in considering the complicated positional dependencies within sequence. To this end, we propose an External Attentive Graph convolutional network with Positional prompts for Sequential recommendation, namely EA-GPS. Specifically, we first introduce an external attentive graph convolutional network that linearly measures the global associations among nodes via two external memory units. Then, we present a positional prompt-based decoder that explicitly treats the absolute item positions as external prompts. By introducing length-adaptive sequential masking and a soft attention network, such a decoder facilitates the model to capture the long-term positional dependencies and contextual relationships within sequences. Extensive experimental results on five real-world datasets demonstrate that the proposed EA-GPS outperforms the state-of-the-art methods. Remarkably, it achieves the superior performance while maintaining a smaller parameter size and lower training overhead. The implementation of this work is publicly available at https://github.com/ZZY-GraphMiningLab/EA-GPS.||
|**2025-02-21**|[SentiFormer: Metadata Enhanced Transformer for Image Sentiment Analysis](http://arxiv.org/abs/2502.15322)|**[link](https://github.com/MET4ISA/SentiFormer)**|As more and more internet users post images online to express their daily emotions, image sentiment analysis has attracted increasing attention. Recently, researchers generally tend to design different neural networks to extract visual features from images for sentiment analysis. Despite the significant progress, metadata, the data (e.g., text descriptions and keyword tags) for describing the image, has not been sufficiently explored in this task. In this paper, we propose a novel Metadata Enhanced Transformer for sentiment analysis (SentiFormer) to fuse multiple metadata and the corresponding image into a unified framework. Specifically, we first obtain multiple metadata of the image and unify the representations of diverse data. To adaptively learn the appropriate weights for each metadata, we then design an adaptive relevance learning module to highlight more effective information while suppressing weaker ones. Moreover, we further develop a cross-modal fusion module to fuse the adaptively learned representations and make the final prediction. Extensive experiments on three publicly available datasets demonstrate the superiority and rationality of our proposed method.||
|**2025-02-21**|[SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention](http://arxiv.org/abs/2502.15304)|null|For the efficient inference of Large Language Models (LLMs), the effective compression of key-value (KV) cache is essential. Three main types of KV cache compression techniques, namely sparsity, channel compression, and quantization, have been identified. This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed precision quantization method for K cache. Initially, K cache is transformed into latent channels using SVD basis representations. Since the values in latent channels decay rapidly and become negligible after only a few latent channels, our method then incorporates importance-aware quantization and compression for latent channels. This enables the effective allocation of higher precision to more significant channels. Theoretically, we prove that SVDq results in quantization errors (x0.1 or even lower) that are much lower than those of per-channel key quantization in the original space. Our findings based on RULER and LongBench benchmarks demonstrate that SVDq can achieve an equivalent key cache precision as low as 1.25-bit. When combined with key sparsity, it can reach a key compression ratio of up to 410x for attention computation, all while maintaining comparable model performance. Notably, our method is nearly lossless for LongBench datasets. This indicates that SVDq enables high-precision low-bit quantization, providing a more efficient solution for KV cache compression in LLMs.||
|**2025-02-21**|[TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba](http://arxiv.org/abs/2502.15130)|null|Transformers have been favored in both uni-modal and multi-modal foundation models for their flexible scalability in attention modules. Consequently, a number of pre-trained Transformer models, e.g., LLaVA, CLIP, and DEIT, are publicly available. Recent research has introduced subquadratic architectures like Mamba, which enables global awareness with linear complexity. Nevertheless, training specialized subquadratic architectures from scratch for certain tasks is both resource-intensive and time-consuming. As a motivator, we explore cross-architecture training to transfer the ready knowledge in existing Transformer models to alternative architecture Mamba, termed TransMamba. Our approach employs a two-stage strategy to expedite training new Mamba models, ensuring effectiveness in across uni-modal and cross-modal tasks. Concerning architecture disparities, we project the intermediate features into an aligned latent space before transferring knowledge. On top of that, a Weight Subcloning and Adaptive Bidirectional distillation method (WSAB) is introduced for knowledge transfer without limitations on varying layer counts. For cross-modal learning, we propose a cross-Mamba module that integrates language awareness into Mamba's visual features, enhancing the cross-modal interaction capabilities of Mamba architecture. Despite using less than 75% of the training data typically required for training from scratch, TransMamba boasts substantially stronger performance across various network architectures and downstream tasks, including image classification, visual question answering, and text-video retrieval. The code will be publicly available.||
|**2025-02-21**|[Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps](http://arxiv.org/abs/2502.15120)|null|本研究探讨了各种基于解码器Transformer的语言模型的上下文学习能力，涵盖了不同模型规模和训练数据，包括GPT2、SmolLM2、OpenELM、TinyLlama、Stable LM和Gemma 2。我们确定了一个关键的参数阈值（约16亿），超过该阈值后，模型在多项任务中的推理性能显著提升，例如多项选择题问答中的常识推理和演绎推理。具体而言，超过此阈值的模型在演绎推理任务的思维链（CoT）提示中取得了更高的成功率，尤其是在需要较长推理链的任务中，例如反证法和析取三段论。为了解决低于阈值模型的局限性，我们证明了使用特定任务样本进行微调可以显著增强推理性能，即使在提示中没有额外样本的情况下，也能在推理链较短的任务中生成准确的CoT。最后，我们对注意力图的分析表明，能够生成正确CoT的模型在后续正确标记和正确的词性上表现出更高的标记级注意力分数，为推理过程提供了可解释性的见解。这些发现共同促进了对基于解码器Transformer模型推理能力的理解。代码可在以下地址获取：https://github.com/AnnonymousForPapers/CoT_Reasoning_Test。||
|**2025-02-20**|[GeoAggregator: An Efficient Transformer Model for Geo-Spatial Tabular Data](http://arxiv.org/abs/2502.15032)|null|Modeling geospatial tabular data with deep learning has become a promising alternative to traditional statistical and machine learning approaches. However, existing deep learning models often face challenges related to scalability and flexibility as datasets grow. To this end, this paper introduces GeoAggregator, an efficient and lightweight algorithm based on transformer architecture designed specifically for geospatial tabular data modeling. GeoAggregators explicitly account for spatial autocorrelation and spatial heterogeneity through Gaussian-biased local attention and global positional awareness. Additionally, we introduce a new attention mechanism that uses the Cartesian product to manage the size of the model while maintaining strong expressive power. We benchmark GeoAggregator against spatial statistical models, XGBoost, and several state-of-the-art geospatial deep learning methods using both synthetic and empirical geospatial datasets. The results demonstrate that GeoAggregators achieve the best or second-best performance compared to their competitors on nearly all datasets. GeoAggregator's efficiency is underscored by its reduced model size, making it both scalable and lightweight. Moreover, ablation experiments offer insights into the effectiveness of the Gaussian bias and Cartesian attention mechanism, providing recommendations for further optimizing the GeoAggregator's performance.||
|**2025-02-20**|[Simpler Fast Vision Transformers with a Jumbo CLS Token](http://arxiv.org/abs/2502.15021)|null|We introduce a simple enhancement to the global processing of vision transformers (ViTs) to improve accuracy while maintaining throughput. Our approach, Jumbo, creates a wider CLS token, which is split to match the patch token width before attention, processed with self-attention, and reassembled. After attention, Jumbo applies a dedicated, wider FFN to this token. Jumbo significantly improves over ViT+Registers on ImageNet-1K at high speeds (by 3.2% for ViT-tiny and 13.5% for ViT-nano); these Jumbo models even outperform specialized compute-efficient models while preserving the architectural advantages of plain ViTs. Although Jumbo sees no gains for ViT-small on ImageNet-1K, it gains 3.4% on ImageNet-21K over ViT+Registers. Both findings indicate that Jumbo is most helpful when the ViT is otherwise too narrow for the task. Finally, we show that Jumbo can be easily adapted to excel on data beyond images, e.g., time series.||
|**2025-02-20**|[RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers](http://arxiv.org/abs/2502.14377)|null|The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across different transformer layers. To address this, we propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl, enabling efficient and resource-optimized integration of control signals into the Diffusion Transformer. First, we evaluate the relevance of each layer in the Diffusion Transformer to the control information by assessing the "ControlNet Relevance Score"-i.e., the impact of skipping each control layer on both the quality of generation and the control effectiveness during inference. Based on the strength of the relevance, we then tailor the positioning, parameter scale, and modeling capacity of the control layers to reduce unnecessary parameters and redundant computations. Additionally, to further improve efficiency, we replace the self-attention and FFN in the commonly used copy block with the carefully designed Two-Dimensional Shuffle Mixer (TDSM), enabling efficient implementation of both the token mixer and channel mixer. Both qualitative and quantitative experimental results demonstrate that our approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta. More examples are available at https://relactrl.github.io/RelaCtrl/.||
|**2025-02-20**|[Designing Parameter and Compute Efficient Diffusion Transformers using Distillation](http://arxiv.org/abs/2502.14226)|null|Diffusion Transformers (DiTs) with billions of model parameters form the backbone of popular image and video generation models like DALL.E, Stable-Diffusion and SORA. Though these models are necessary in many low-latency applications like Augmented/Virtual Reality, they cannot be deployed on resource-constrained Edge devices (like Apple Vision Pro or Meta Ray-Ban glasses) due to their huge computational complexity. To overcome this, we turn to knowledge distillation and perform a thorough design-space exploration to achieve the best DiT for a given parameter size. In particular, we provide principles for how to choose design knobs such as depth, width, attention heads and distillation setup for a DiT. During the process, a three-way trade-off emerges between model performance, size and speed that is crucial for Edge implementation of diffusion. We also propose two distillation approaches - Teaching Assistant (TA) method and Multi-In-One (MI1) method - to perform feature distillation in the DiT context. Unlike existing solutions, we demonstrate and benchmark the efficacy of our approaches on practical Edge devices such as NVIDIA Jetson Orin Nano.||
|**2025-02-19**|[Learning Novel Transformer Architecture for Time-series Forecasting](http://arxiv.org/abs/2502.13721)|null|Despite the success of Transformer-based models in the time-series prediction (TSP) tasks, the existing Transformer architecture still face limitations and the literature lacks comprehensive explorations into alternative architectures. To address these challenges, we propose AutoFormer-TS, a novel framework that leverages a comprehensive search space for Transformer architectures tailored to TSP tasks. Our framework introduces a differentiable neural architecture search (DNAS) method, AB-DARTS, which improves upon existing DNAS approaches by enhancing the identification of optimal operations within the architecture. AutoFormer-TS systematically explores alternative attention mechanisms, activation functions, and encoding operations, moving beyond the traditional Transformer design. Extensive experiments demonstrate that AutoFormer-TS consistently outperforms state-of-the-art baselines across various TSP benchmarks, achieving superior forecasting accuracy while maintaining reasonable training efficiency.||
|**2025-02-19**|[Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention](http://arxiv.org/abs/2502.13693)|**[link](https://github.com/Omid-Nejati/MedViTV2)**|Convolutional networks, transformers, hybrid models, and Mamba-based architectures have demonstrated strong performance across various medical image classification tasks. However, these methods were primarily designed to classify clean images using labeled data. In contrast, real-world clinical data often involve image corruptions that are unique to multi-center studies and stem from variations in imaging equipment across manufacturers. In this paper, we introduce the Medical Vision Transformer (MedViTV2), a novel architecture incorporating Kolmogorov-Arnold Network (KAN) layers into the transformer architecture for the first time, aiming for generalized medical image classification. We have developed an efficient KAN block to reduce computational load while enhancing the accuracy of the original MedViT. Additionally, to counteract the fragility of our MedViT when scaled up, we propose an enhanced Dilated Neighborhood Attention (DiNA), an adaptation of the efficient fused dot-product attention kernel capable of capturing global context and expanding receptive fields to scale the model effectively and addressing feature collapse issues. Moreover, a hierarchical hybrid strategy is introduced to stack our Local Feature Perception and Global Feature Perception blocks in an efficient manner, which balances local and global feature perceptions to boost performance. Extensive experiments on 17 medical image classification datasets and 12 corrupted medical image datasets demonstrate that MedViTV2 achieved state-of-the-art results in 27 out of 29 experiments with reduced computational complexity. MedViTV2 is 44\% more computationally efficient than the previous version and significantly enhances accuracy, achieving improvements of 4.6\% on MedMNIST, 5.8\% on NonMNIST, and 13.4\% on the MedMNIST-C benchmark.||
|**2025-02-20**|[Neural Attention Search](http://arxiv.org/abs/2502.13251)|null|We present Neural Attention Search (NAtS), a framework that automatically evaluates the importance of each token within a sequence and determines if the corresponding token can be dropped after several steps. This approach can efficiently reduce the KV cache sizes required by transformer-based models during inference and thus reduce inference costs. In this paper, we design a search space that contains three token types: (i) Global Tokens will be preserved and queried by all the following tokens. (ii) Local Tokens survive until the next global token appears. (iii) Sliding Window Tokens have an impact on the inference of a fixed size of the next following tokens. Similar to the One-Shot Neural Architecture Search approach, this token-type information can be learned jointly with the architecture weights via a learnable attention mask. Experiments on both training a new transformer from scratch and fine-tuning existing large language models show that NAtS can efficiently reduce the KV cache size required for the models while maintaining the models' performance.||
|**2025-02-18**|[RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals](http://arxiv.org/abs/2502.13181)|null|Transformers have achieved great success in effectively processing sequential data such as text. Their architecture consisting of several attention and feedforward blocks can model relations between elements of a sequence in parallel manner, which makes them very efficient to train and effective in sequence modeling. Even though they have shown strong performance in processing sequential data, the size of their parameters is considerably larger when compared to other architectures such as RNN and CNN based models. Therefore, several approaches have explored parameter sharing and recurrence in Transformer models to address their computational demands. However, such methods struggle to maintain high performance compared to the original transformer model. To address this challenge, we propose our novel approach, RingFormer, which employs one Transformer layer that processes input repeatedly in a circular, ring-like manner, while utilizing low-rank matrices to generate input-dependent level signals. This allows us to reduce the model parameters substantially while maintaining high performance in a variety of tasks such as translation and image classification, as validated in the experiments.||
|**2025-02-18**|[Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing](http://arxiv.org/abs/2502.12962)|null|Limited by the context window size of Large Language Models(LLMs), handling various tasks with input tokens exceeding the upper limit has been challenging, whether it is a simple direct retrieval task or a complex multi-hop reasoning task. Although various methods have been proposed to enhance the long-context processing capabilities of LLMs, they either incur substantial post-training costs, or require additional tool modules(e.g.,RAG), or have not shown significant improvement in realistic tasks. Our work observes the correlation between the attention distribution and generated answers across each layer, and establishes the attention allocation aligns with retrieval-augmented capabilities through experiments. Drawing on the above insights, we propose a novel method InfiniRetri that leverages the LLMs's own attention information to enable accurate retrieval across inputs of infinitely length. Our evaluations indicate that InfiniRetri achieves 100% accuracy in the Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model, surpassing other method or larger models and setting a new state-of-the-art(SOTA). Moreover, our method achieves significant performance improvements on real-world benchmarks, with a maximum 288% improvement. In addition, InfiniRetri can be applied to any Transformer-based LLMs without additional training and substantially reduces inference latency and compute overhead in long texts. In summary, our comprehensive studies show InfiniRetri's potential for practical applications and creates a paradigm for retrievaling information using LLMs own capabilities under infinite-length tokens. Code will be released in link.||
|**2025-02-19**|[SparkAttention: High-Performance Multi-Head Attention for Large Models on Volta GPU Architecture](http://arxiv.org/abs/2502.12784)|null|Transformer are widely used in various fields such as natural language processing and computer vision. However, the training time for large Transformer models can be challenging due to the Multi-Head Attention (MHA) mechanism. Especially as models become larger, training becomes more costly. So it is crucial to utilize various resources for efficient model training. Currently, NVIDIA Volta GPU is still widely used. However, because the computational shapes supported by Tensor Core Units (TCU) of Volta GPU differ from other GPU architectures, most efforts have not focused on using them to accelerate Transformer training. To address this issue, we propose SparkAttention, an acceleration library designed to speed up MHA training on the Volta GPU. SparkAttention leverages TCU and kernel fusion to reduce the number of high bandwidth memory (HBM) accesses and overhead. Our End-to-End experimental results on an NVIDIA V100 GPU show that SparkAttention achieves on average 1.80 $\times$ (up to 2.46$\times$ ) speedup compared to using PyTorch.||
|**2025-02-18**|[Spiking Vision Transformer with Saccadic Attention](http://arxiv.org/abs/2502.12677)|null|The combination of Spiking Neural Networks (SNNs) and Vision Transformers (ViTs) holds potential for achieving both energy efficiency and high performance, particularly suitable for edge vision applications. However, a significant performance gap still exists between SNN-based ViTs and their ANN counterparts. Here, we first analyze why SNN-based ViTs suffer from limited performance and identify a mismatch between the vanilla self-attention mechanism and spatio-temporal spike trains. This mismatch results in degraded spatial relevance and limited temporal interactions. To address these issues, we draw inspiration from biological saccadic attention mechanisms and introduce an innovative Saccadic Spike Self-Attention (SSSA) method. Specifically, in the spatial domain, SSSA employs a novel spike distribution-based method to effectively assess the relevance between Query and Key pairs in SNN-based ViTs. Temporally, SSSA employs a saccadic interaction module that dynamically focuses on selected visual areas at each timestep and significantly enhances whole scene understanding through temporal interactions. Building on the SSSA mechanism, we develop a SNN-based Vision Transformer (SNN-ViT). Extensive experiments across various visual tasks demonstrate that SNN-ViT achieves state-of-the-art performance with linear computational complexity. The effectiveness and efficiency of the SNN-ViT highlight its potential for power-critical edge vision applications.||
|**2025-02-18**|[MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation](http://arxiv.org/abs/2502.12632)|null|Diffusion models are successful for synthesizing high-quality videos but are limited to generating short clips (e.g., 2-10 seconds). Synthesizing sustained footage (e.g. over minutes) still remains an open research question. In this paper, we propose MALT Diffusion (using Memory-Augmented Latent Transformers), a new diffusion model specialized for long video generation. MALT Diffusion (or just MALT) handles long videos by subdividing them into short segments and doing segment-level autoregressive generation. To achieve this, we first propose recurrent attention layers that encode multiple segments into a compact memory latent vector; by maintaining this memory vector over time, MALT is able to condition on it and continuously generate new footage based on a long temporal context. We also present several training techniques that enable the model to generate frames over a long horizon with consistent quality and minimal degradation. We validate the effectiveness of MALT through experiments on long video benchmarks. We first perform extensive analysis of MALT in long-contextual understanding capability and stability using popular long video benchmarks. For example, MALT achieves an FVD score of 220.4 on 128-frame video generation on UCF-101, outperforming the previous state-of-the-art of 648.4. Finally, we explore MALT's capabilities in a text-to-video generation setting and show that it can produce long videos compared with recent techniques for long text-to-video generation.||
|**2025-02-18**|[Mixture of Attention Yields Accurate Results for Tabular Data](http://arxiv.org/abs/2502.12507)|null|Tabular data inherently exhibits significant feature heterogeneity, but existing transformer-based methods lack specialized mechanisms to handle this property. To bridge the gap, we propose MAYA, an encoder-decoder transformer-based framework. In the encoder, we design a Mixture of Attention (MOA) that constructs multiple parallel attention branches and averages the features at each branch, effectively fusing heterogeneous features while limiting parameter growth. Additionally, we employ collaborative learning with a dynamic consistency weight constraint to produce more robust representations. In the decoder stage, cross-attention is utilized to seamlessly integrate tabular data with corresponding label features. This dual-attention mechanism effectively captures both intra-instance and inter-instance interactions. We evaluate the proposed method on a wide range of datasets and compare it with other state-of-the-art transformer-based methods. Extensive experiments demonstrate that our model achieves superior performance among transformer-based methods in both tabular classification and regression tasks.||
|**2025-02-18**|[Efficient OpAmp Adaptation for Zoom Attention to Golden Contexts](http://arxiv.org/abs/2502.12502)|null|Large language models (LLMs) have shown significant promise in question-answering (QA) tasks, particularly in retrieval-augmented generation (RAG) scenarios and long-context applications. However, their performance is hindered by noisy reference documents, which often distract from essential information. Despite fine-tuning efforts, Transformer-based architectures struggle to prioritize relevant content. This is evidenced by their tendency to allocate disproportionate attention to irrelevant or later-positioned documents. Recent work proposes the differential attention mechanism to address this issue, but this mechanism is limited by an unsuitable common-mode rejection ratio (CMRR) and high computational costs. Inspired by the operational amplifier (OpAmp), we propose the OpAmp adaptation to address these challenges, which is implemented with adapters efficiently. By integrating the adapter into pre-trained Transformer blocks, our approach enhances focus on the golden context without costly training from scratch. Empirical evaluations on noisy-context benchmarks reveal that our Qwen2.5-OpAmp-72B model, trained with our OpAmp adaptation, surpasses the performance of state-of-the-art LLMs, including DeepSeek-V3 and GPT-4o.||
|**2025-02-17**|[Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs](http://arxiv.org/abs/2502.12352)|**[link](https://github.com/batu-el/understanding-inductive-biases-of-gnns)**|We introduce Attention Graphs, a new tool for mechanistic interpretability of Graph Neural Networks (GNNs) and Graph Transformers based on the mathematical equivalence between message passing in GNNs and the self-attention mechanism in Transformers. Attention Graphs aggregate attention matrices across Transformer layers and heads to describe how information flows among input nodes. Through experiments on homophilous and heterophilous node classification tasks, we analyze Attention Graphs from a network science perspective and find that: (1) When Graph Transformers are allowed to learn the optimal graph structure using all-to-all attention among input nodes, the Attention Graphs learned by the model do not tend to correlate with the input/original graph structure; and (2) For heterophilous graphs, different Graph Transformer variants can achieve similar performance while utilising distinct information flow patterns. Open source code: https://github.com/batu-el/understanding-inductive-biases-of-gnns||
|**2025-02-17**|[Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory](http://arxiv.org/abs/2502.12344)|null|Transformers have become the backbone of neural network architecture for most machine learning applications. Their widespread use has resulted in multiple efforts on accelerating attention, the basic building block of transformers. This paper tackles the challenges associated with accelerating attention through a hardware-software co-design approach while leveraging compute-in-memory(CIM) architecture. In particular, our energy- and area-efficient CIM based accelerator, named HASTILY, aims to accelerate softmax computation, an integral operation in attention, and minimize their high on-chip memory requirements that grows quadratically with input sequence length. Our architecture consists of novel CIM units called unified compute and lookup modules(UCLMs) that integrate both lookup and multiply-accumulate functionality within the same SRAM array, incurring minimal area overhead over standard CIM arrays. Designed in TSMC 65nm, UCLMs can be used to concurrently perform exponential and matrix-vector multiplication operations. Complementing the proposed architecture, HASTILY features a fine-grained pipelining strategy for scheduling both attention and feed-forward layers, to reduce the quadratic dependence on sequence length to linear dependence. Further, for fast softmax computation which involves computing the maxima and sum of exponential values, such operations are parallelized across multiple cores using reduce and gather strategy. We evaluate our proposed architecture using a compiler tailored towards attention computation and a standard cycle-level CIM simulator. Our evaluation shows end-to-end throughput(TOPS) improvement of 4.4x-9.8x and 1.7x-5.9x over Nvidia A40 GPU and baseline CIM hardware, respectively, for BERT models with INT-8 precision. Additionally, it shows gains of 16x-36x in energy-efficiency(TOPS/W) over A40 GPU and similar energy-efficiency as baseline CIM hardware.||
|**2025-02-17**|[AdaSplash: Adaptive Sparse Flash Attention](http://arxiv.org/abs/2502.12082)|**[link](https://github.com/deep-spin/adasplash)**|The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $\alpha$-entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $\alpha$-entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the $\alpha$-entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing $\alpha$ -entmax implementations. It approaches -- and in some cases surpasses -- the efficiency of highly optimized softmax implementations like FlashAttention-2, enabling long-context training while maintaining strong task performance.||
|**2025-02-17**|[JotlasNet: Joint Tensor Low-Rank and Attention-based Sparse Unrolling Network for Accelerating Dynamic MRI](http://arxiv.org/abs/2502.11749)|**[link](https://github.com/yhao-z/JotlasNet)**|Joint low-rank and sparse unrolling networks have shown superior performance in dynamic MRI reconstruction. However, existing works mainly utilized matrix low-rank priors, neglecting the tensor characteristics of dynamic MRI images, and only a global threshold is applied for the sparse constraint to the multi-channel data, limiting the flexibility of the network. Additionally, most of them have inherently complex network structure, with intricate interactions among variables. In this paper, we propose a novel deep unrolling network, JotlasNet, for dynamic MRI reconstruction by jointly utilizing tensor low-rank and attention-based sparse priors. Specifically, we utilize tensor low-rank prior to exploit the structural correlations in high-dimensional data. Convolutional neural networks are used to adaptively learn the low-rank and sparse transform domains. A novel attention-based soft thresholding operator is proposed to assign a unique learnable threshold to each channel of the data in the CNN-learned sparse domain. The network is unrolled from the elaborately designed composite splitting algorithm and thus features a simple yet efficient parallel structure. Extensive experiments on two datasets (OCMR, CMRxRecon) demonstrate the superior performance of JotlasNet in dynamic MRI reconstruction.||
|**2025-02-14**|[(How) Can Transformers Predict Pseudo-Random Numbers?](http://arxiv.org/abs/2502.10390)|null|Transformers excel at discovering patterns in sequential data, yet their fundamental limitations and learning mechanisms remain crucial topics of investigation. In this paper, we study the ability of Transformers to learn pseudo-random number sequences from linear congruential generators (LCGs), defined by the recurrence relation $x_{t+1} = a x_t + c \;\mathrm{mod}\; m$. Our analysis reveals that with sufficient architectural capacity and training data variety, Transformers can perform in-context prediction of LCG sequences with unseen moduli ($m$) and parameters ($a,c$). Through analysis of embedding layers and attention patterns, we uncover how Transformers develop algorithmic structures to learn these sequences in two scenarios of increasing complexity. First, we analyze how Transformers learn LCG sequences with unseen ($a, c$) but fixed modulus, and we demonstrate successful learning up to $m = 2^{32}$. Our analysis reveals that models learn to factorize the modulus and utilize digit-wise number representations to make sequential predictions. In the second, more challenging scenario of unseen moduli, we show that Transformers can generalize to unseen moduli up to $m_{\text{test}} = 2^{16}$. In this case, the model employs a two-step strategy: first estimating the unknown modulus from the context, then utilizing prime factorizations to generate predictions. For this task, we observe a sharp transition in the accuracy at a critical depth $=3$ . We also find that the number of in-context sequence elements needed to reach high accuracy scales sublinearly with the modulus.||
|**2025-02-14**|[PromptArtisan: Multi-instruction Image Editing in Single Pass with Complete Attention Control](http://arxiv.org/abs/2502.10258)|null|We present PromptArtisan, a groundbreaking approach to multi-instruction image editing that achieves remarkable results in a single pass, eliminating the need for time-consuming iterative refinement. Our method empowers users to provide multiple editing instructions, each associated with a specific mask within the image. This flexibility allows for complex edits involving mask intersections or overlaps, enabling the realization of intricate and nuanced image transformations. PromptArtisan leverages a pre-trained InstructPix2Pix model in conjunction with a novel Complete Attention Control Mechanism (CACM). This mechanism ensures precise adherence to user instructions, granting fine-grained control over the editing process. Furthermore, our approach is zero-shot, requiring no additional training, and boasts improved processing complexity compared to traditional iterative methods. By seamlessly integrating multi-instruction capabilities, single-pass efficiency, and complete attention control, PromptArtisan unlocks new possibilities for creative and efficient image editing workflows, catering to both novice and expert users alike.||
|**2025-02-14**|[Compress image to patches for Vision Transformer](http://arxiv.org/abs/2502.10120)|**[link](https://github.com/fanchy/ci2pvit)**|The Vision Transformer (ViT) has made significant strides in the field of computer vision. However, as the depth of the model and the resolution of the input images increase, the computational cost associated with training and running ViT models has surged dramatically.This paper proposes a hybrid model based on CNN and Vision Transformer, named CI2P-ViT. The model incorporates a module called CI2P, which utilizes the CompressAI encoder to compress images and subsequently generates a sequence of patches through a series of convolutions. CI2P can replace the Patch Embedding component in the ViT model, enabling seamless integration into existing ViT models.Compared to ViT-B/16, CI2P-ViT has the number of patches input to the self-attention layer reduced to a quarter of the original.This design not only significantly reduces the computational cost of the ViT model but also effectively enhances the model's accuracy by introducing the inductive bias properties of CNN.The ViT model's precision is markedly enhanced.When trained from the ground up on the Animals-10 dataset, CI2P-ViT achieved an accuracy rate of 92.37%, representing a 3.3% improvement over the ViT-B/16 baseline. Additionally, the model's computational operations, measured in floating-point operations per second (FLOPs), were diminished by 63.35%, and it exhibited a 2-fold increase in training velocity on identical hardware configurations.||
|**2025-02-14**|[TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation](http://arxiv.org/abs/2502.09931)|null|Skip connection engineering is primarily employed to address the semantic gap between the encoder and decoder, while also integrating global dependencies to understand the relationships among complex anatomical structures in medical image segmentation. Although several models have proposed transformer-based approaches to incorporate global dependencies within skip connections, they often face limitations in capturing detailed local features with high computational complexity. In contrast, graph neural networks (GNNs) exploit graph structures to effectively capture local and global features. Leveraging these properties, we introduce an attentional cross-scale graph neural network (ACS-GNN), which enhances the skip connection framework by converting cross-scale feature maps into a graph structure and capturing complex anatomical structures through node attention. Additionally, we observed that deep learning models often produce uninformative feature maps, which degrades the quality of spatial attention maps. To address this problem, we integrated entropy-driven feature selection (EFS) with spatial attention, calculating an entropy score for each channel and filtering out high-entropy feature maps. Our innovative framework, TransGUNet, comprises ACS-GNN and EFS-based spatial attentio} to effectively enhance domain generalizability across various modalities by leveraging GNNs alongside a reliable spatial attention map, ensuring more robust features within the skip connection. Through comprehensive experiments and analysis, TransGUNet achieved superior segmentation performance on six seen and eight unseen datasets, demonstrating significantly higher efficiency compared to previous methods.||
|**2025-02-14**|[AttenGluco: Multimodal Transformer-Based Blood Glucose Forecasting on AI-READI Dataset](http://arxiv.org/abs/2502.09919)|null|Diabetes is a chronic metabolic disorder characterized by persistently high blood glucose levels (BGLs), leading to severe complications such as cardiovascular disease, neuropathy, and retinopathy. Predicting BGLs enables patients to maintain glucose levels within a safe range and allows caregivers to take proactive measures through lifestyle modifications. Continuous Glucose Monitoring (CGM) systems provide real-time tracking, offering a valuable tool for monitoring BGLs. However, accurately forecasting BGLs remains challenging due to fluctuations due to physical activity, diet, and other factors. Recent deep learning models show promise in improving BGL prediction. Nonetheless, forecasting BGLs accurately from multimodal, irregularly sampled data over long prediction horizons remains a challenging research problem. In this paper, we propose AttenGluco, a multimodal Transformer-based framework for long-term blood glucose prediction. AttenGluco employs cross-attention to effectively integrate CGM and activity data, addressing challenges in fusing data with different sampling rates. Moreover, it employs multi-scale attention to capture long-term dependencies in temporal data, enhancing forecasting accuracy. To evaluate the performance of AttenGluco, we conduct forecasting experiments on the recently released AIREADI dataset, analyzing its predictive accuracy across different subject cohorts including healthy individuals, people with prediabetes, and those with type 2 diabetes. Furthermore, we investigate its performance improvements and forgetting behavior as new cohorts are introduced. Our evaluations show that AttenGluco improves all error metrics, such as root mean square error (RMSE), mean absolute error (MAE), and correlation, compared to the multimodal LSTM model. AttenGluco outperforms this baseline model by about 10% and 15% in terms of RMSE and MAE, respectively.||
|**2025-02-13**|[AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization](http://arxiv.org/abs/2502.09503)|null|Transformer architectures have transformed AI applications but remain complex to customize for domain experts lacking low-level implementation expertise. We introduce AttentionSmithy, a modular software package that simplifies transformer innovation by breaking down key components into reusable building blocks: attention modules, feed-forward networks, normalization layers, and positional encodings. Users can rapidly prototype and evaluate transformer variants without extensive coding. Our framework supports four positional encoding strategies and integrates with neural architecture search for automated design. We validate AttentionSmithy by replicating the original transformer under resource constraints and optimizing translation performance by combining positional encodings. Additionally, we demonstrate its adaptability in gene-specific modeling, achieving over 95% accuracy in cell type classification. These case studies highlight AttentionSmithy's potential to accelerate research across diverse fields by removing framework implementation barriers.||
|**2025-02-13**|[Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction](http://arxiv.org/abs/2502.09423)|null|Crystal structure forms the foundation for understanding the physical and chemical properties of materials. Generative models have emerged as a new paradigm in crystal structure prediction(CSP), however, accurately capturing key characteristics of crystal structures, such as periodicity and symmetry, remains a significant challenge. In this paper, we propose a Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction (TransVAE-CSP), who learns the characteristic distribution space of stable materials, enabling both the reconstruction and generation of crystal structures. TransVAE-CSP integrates adaptive distance expansion with irreducible representation to effectively capture the periodicity and symmetry of crystal structures, and the encoder is a transformer network based on an equivariant dot product attention mechanism. Experimental results on the carbon_24, perov_5, and mp_20 datasets demonstrate that TransVAE-CSP outperforms existing methods in structure reconstruction and generation tasks under various modeling metrics, offering a powerful tool for crystal structure design and optimization.||
|**2025-02-13**|[Simple Path Structural Encoding for Graph Transformers](http://arxiv.org/abs/2502.09365)|**[link](https://github.com/LouisBearing/Graph-SPSE-Encoding)**|Graph transformers extend global self-attention to graph-structured data, achieving notable success in graph learning. Recently, random walk structural encoding (RWSE) has been found to further enhance their predictive power by encoding both structural and positional information into the edge representation. However, RWSE cannot always distinguish between edges that belong to different local graph patterns, which reduces its ability to capture the full structural complexity of graphs. This work introduces Simple Path Structural Encoding (SPSE), a novel method that utilizes simple path counts for edge encoding. We show theoretically and experimentally that SPSE overcomes the limitations of RWSE, providing a richer representation of graph structures, particularly for capturing local cyclic patterns. To make SPSE computationally tractable, we propose an efficient approximate algorithm for simple path counting. SPSE demonstrates significant performance improvements over RWSE on various benchmarks, including molecular and long-range graph datasets, achieving statistically significant gains in discriminative tasks. These results pose SPSE as a powerful edge encoding alternative for enhancing the expressivity of graph transformers.||
|**2025-02-13**|[Feature-based Graph Attention Networks Improve Online Continual Learning](http://arxiv.org/abs/2502.09143)|null|Online continual learning for image classification is crucial for models to adapt to new data while retaining knowledge of previously learned tasks. This capability is essential to address real-world challenges involving dynamic environments and evolving data distributions. Traditional approaches predominantly employ Convolutional Neural Networks, which are limited to processing images as grids and primarily capture local patterns rather than relational information. Although the emergence of transformer architectures has improved the ability to capture relationships, these models often require significantly larger resources. In this paper, we present a novel online continual learning framework based on Graph Attention Networks (GATs), which effectively capture contextual relationships and dynamically update the task-specific representation via learned attention weights. Our approach utilizes a pre-trained feature extractor to convert images into graphs using hierarchical feature maps, representing information at varying levels of granularity. These graphs are then processed by a GAT and incorporate an enhanced global pooling strategy to improve classification performance for continual learning. In addition, we propose the rehearsal memory duplication technique that improves the representation of the previous tasks while maintaining the memory budget. Comprehensive evaluations on benchmark datasets, including SVHN, CIFAR10, CIFAR100, and MiniImageNet, demonstrate the superiority of our method compared to the state-of-the-art methods.||
|**2025-02-13**|[Lowering the Error Floor of Error Correction Code Transformer](http://arxiv.org/abs/2502.09065)|null|With the success of transformer architectures across diverse applications, the error correction code transformer (ECCT) has gained significant attention for its superior decoding performance. In spite of its advantages, the error floor phenomenon in ECCT decoding remains unexplored. We present the first investigation of the error floor issue in ECCT and propose a hybrid decoding approach that integrates hard decision decoders as pre- and post-decoders with ECCT to effectively lower the error floor. In particular, we introduce a novel loss function for ECCT that considers the dynamics of hybrid decoding algorithm. Training ECCT with the proposed loss function enhances its ability to correct specific error patterns by taking into account its interaction with the auxiliary decoders. Simulation results demonstrate that the proposed hybrid decoder with the novel loss function significantly outperforms the original ECCT in both the waterfall and the error floor regions.||
|**2025-02-13**|[MTDP: Modulated Transformer Diffusion Policy Model](http://arxiv.org/abs/2502.09029)|null|Recent research on robot manipulation based on Behavior Cloning (BC) has made significant progress. By combining diffusion models with BC, diffusion policiy has been proposed, enabling robots to quickly learn manipulation tasks with high success rates. However, integrating diffusion policy with high-capacity Transformer presents challenges, traditional Transformer architectures struggle to effectively integrate guiding conditions, resulting in poor performance in manipulation tasks when using Transformer-based models. In this paper, we investigate key architectural designs of Transformers and improve the traditional Transformer architecture by proposing the Modulated Transformer Diffusion Policy (MTDP) model for diffusion policy. The core of this model is the Modulated Attention module we proposed, which more effectively integrates the guiding conditions with the main input, improving the generative model's output quality and, consequently, increasing the robot's task success rate. In six experimental tasks, MTDP outperformed existing Transformer model architectures, particularly in the Toolhang experiment, where the success rate increased by 12\%. To verify the generality of Modulated Attention, we applied it to the UNet architecture to construct Modulated UNet Diffusion Policy model (MUDP), which also achieved higher success rates than existing UNet architectures across all six experiments. The Diffusion Policy uses Denoising Diffusion Probabilistic Models (DDPM) as the diffusion model. Building on this, we also explored Denoising Diffusion Implicit Models (DDIM) as the diffusion model, constructing the MTDP-I and MUDP-I model, which nearly doubled the generation speed while maintaining performance.||
|**2025-02-13**|[Hierarchical Vision Transformer with Prototypes for Interpretable Medical Image Classification](http://arxiv.org/abs/2502.08997)|null|可解释性是在医学等高风险领域应用中一个非常重要的需求。视觉Transformer主要局限于提取注意力来提供对模型推理过程的理解。我们的方法结合了视觉Transformer的高性能，并引入了新的可解释性能力。我们提出了HierViT，一个本身可解释的视觉Transformer，并使其推理过程适应人类的推理方式。它使用分层结构来处理特定领域的特征以进行预测。它在设计上是可解释的，因为它使用人类定义的特征（原型）导出目标输出，这些特征可以通过示例图像进行可视化。通过结合关于这些决定性特征的领域知识，推理过程在语义上与人类推理相似，因此更直观。此外，注意力热力图可视化了识别每个特征的关键区域，从而为HierViT提供了一个用于验证预测的多功能工具。在两个医学基准数据集（用于肺结节评估的LIDC-IDRI和用于皮肤病变分类的derm7pt）上进行的评估表明，HierViT分别实现了优于和可比的预测精度，同时提供了与人类推理一致的解释。||
|**2025-02-13**|[Biologically Plausible Brain Graph Transformer](http://arxiv.org/abs/2502.08958)|**[link](https://github.com/pcyyyy/BioBGT)**|State-of-the-art brain graph analysis methods fail to fully encode the small-world architecture of brain graphs (accompanied by the presence of hubs and functional modules), and therefore lack biological plausibility to some extent. This limitation hinders their ability to accurately represent the brain's structural and functional properties, thereby restricting the effectiveness of machine learning models in tasks such as brain disorder detection. In this work, we propose a novel Biologically Plausible Brain Graph Transformer (BioBGT) that encodes the small-world architecture inherent in brain graphs. Specifically, we present a network entanglement-based node importance encoding technique that captures the structural importance of nodes in global information propagation during brain graph communication, highlighting the biological properties of the brain structure. Furthermore, we introduce a functional module-aware self-attention to preserve the functional segregation and integration characteristics of brain graphs in the learned representations. Experimental results on three benchmark datasets demonstrate that BioBGT outperforms state-of-the-art models, enhancing biologically plausible brain graph representations for various brain graph analytical tasks||
|**2025-02-12**|[Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding](http://arxiv.org/abs/2502.08363)|**[link](https://github.com/kostyanoob/top-theta-attention)**|The attention mechanism is essential for the impressive capabilities of transformer-based Large Language Models (LLMs). However, calculating attention is computationally intensive due to its quadratic dependency on the sequence length. We introduce a novel approach called Top-Theta Attention, or simply Top- $\theta$, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy, reducing the number of required V cache rows by 3x during generative decoding and the number of attention elements by 10x during the prefill phase. Our method does not require model retraining; instead, it requires only a brief calibration phase to be resilient to distribution shifts, thus not requiring the thresholds for different datasets to be recalibrated. Unlike top-k attention, Top-$\theta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search. A key innovation of our approach is the development of efficient numerical compensation techniques, which help preserve model accuracy even under aggressive pruning of attention scores.||
|**2025-02-12**|[HDT: Hierarchical Discrete Transformer for Multivariate Time Series Forecasting](http://arxiv.org/abs/2502.08302)|**[link](https://github.com/hdtkk/HDT)**|生成模型在多元时间序列预测（MTS）中受到了广泛关注，特别是由于它们能够生成高保真样本。预测多元时间序列的概率分布是一项具有挑战性但又非常实际的任务。尽管最近有一些尝试来处理这项任务，但仍然存在两个主要挑战：1）一些现有的生成方法在高维多元时间序列预测中表现不佳，难以扩展到更高的维度；2）多元属性固有的高维度限制了现有生成模型的预测长度。在本文中，我们指出离散的标记表示可以对高维MTS进行建模，并且推理速度更快，并且通过自身长期趋势预测目标可以延长预测长度并保持高精度。基于此，我们提出了一个名为分层离散Transformer（HDT）的向量量化框架，该框架使用经过l2归一化增强的向量量化策略将时间序列建模为离散标记表示，我们将MTS预测转换为离散标记生成。为了解决生成模型在长期预测中的局限性，我们提出了一种分层离散Transformer。该模型在低级别捕获目标的离散长期趋势，并利用该趋势作为条件，在高级别生成目标的离散表示，引入目标本身的特征，从而扩展高维MTS中的预测长度。在五个流行的MTS数据集上的大量实验验证了我们提出的方法的有效性。||
|**2025-02-11**|[LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid](http://arxiv.org/abs/2502.07563)|**[link](https://github.com/opensparsellms/linear-moe)**|Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.||
|**2025-02-11**|[Attention Learning is Needed to Efficiently Learn Parity Function](http://arxiv.org/abs/2502.07553)|null|基于注意力机制的Transformer已成为序列建模中最先进的架构，并在自然语言处理和计算机视觉等诸多领域中，经验性地优于前馈神经网络（FFNN）。然而，它们的泛化能力，特别是对于低敏感度函数的泛化能力，仍然缺乏研究。我们通过分析Transformer在 $k$-奇偶校验问题上的表现来弥合这一差距。Daniely和Malach（NeurIPS 2020）表明，具有一个隐藏层和$O(nk^7 \log k)$个参数的FFNN可以学习$k$-奇偶校验，其中输入长度$n$通常远大于$k$。在本文中，我们证明了FFNN至少需要$\Omega(n)$个参数来学习$k$-奇偶校验，而Transformer只需要$O(k)$ 个参数，超过了FFNN所需的理论下限。我们进一步证明，这种参数效率无法通过固定注意力头实现。我们的工作从理论上证明了Transformer在学习奇偶校验函数方面优于FFNN，展示了它们的注意力机制如何在低敏感度函数中实现参数高效的泛化。||
|**2025-02-11**|[Exoplanet Transit Candidate Identification in TESS Full-Frame Images via a Transformer-Based Algorithm](http://arxiv.org/abs/2502.07542)|**[link](https://github.com/helemysm/FII_TransformerNN)**|The Transiting Exoplanet Survey Satellite (TESS) is surveying a large fraction of the sky, generating a vast database of photometric time series data that requires thorough analysis to identify exoplanetary transit signals. Automated learning approaches have been successfully applied to identify transit signals. However, most existing methods focus on the classification and validation of candidates, while few efforts have explored new techniques for the search of candidates. To search for new exoplanet transit candidates, we propose an approach to identify exoplanet transit signals without the need for phase folding or assuming periodicity in the transit signals, such as those observed in multi-transit light curves. To achieve this, we implement a new neural network inspired by Transformers to directly process Full Frame Image (FFI) light curves to detect exoplanet transits. Transformers, originally developed for natural language processing, have recently demonstrated significant success in capturing long-range dependencies compared to previous approaches focused on sequential data. This ability allows us to employ multi-head self-attention to identify exoplanet transit signals directly from the complete light curves, combined with background and centroid time series, without requiring prior transit parameters. The network is trained to learn characteristics of the transit signal, like the dip shape, which helps distinguish planetary transits from other variability sources. Our model successfully identified 214 new planetary system candidates, including 122 multi-transit light curves, 88 single-transit and 4 multi-planet systems from TESS sectors 1-26 with a radius > 0.27 $R_{\mathrm{Jupiter}}$ , demonstrating its ability to detect transits regardless of their periodicity.||
|**2025-02-11**|[Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More](http://arxiv.org/abs/2502.07490)|**[link](https://github.com/scitix/MEAP)**|Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.||
|**2025-02-11**|[Optimizing Knowledge Distillation in Transformers: Enabling Multi-Head Attention without Alignment Barriers](http://arxiv.org/abs/2502.07436)|null|Transformer 中的知识蒸馏 (KD) 经常面临由于教师模型和学生模型之间注意力头数量不匹配带来的挑战。现有方法要么要求头部数量相同，要么引入投影器来弥合维度差距，从而限制了灵活性和效率。我们提出了压缩头部蒸馏 (SHD)，这是一种新颖的方法，它通过有效的线性近似压缩多头注意力图，从而实现不同头部数量模型之间的无缝知识迁移。与先前的工作不同，SHD 消除了对齐障碍，无需额外的参数或架构修改。我们的方法将多个教师头的组合效应动态地近似到更少的学生头中，在减少冗余的同时保留了细粒度的注意力模式。跨语言（LLaMA、GPT）和视觉（DiT、MDT）生成以及视觉（DeiT）判别任务的实验表明了 SHD 的有效性：它优于基于 logits 和特征对齐的 KD 基线，在图像分类、图像生成、语言微调和语言预训练中取得了最先进的结果。灵活的头部压缩、无投影器的设计和线性时间复杂度等关键创新使 SHD 成为蒸馏现代 Transformer 的通用且可扩展的解决方案。这项工作弥合了 KD 中的一个关键差距，使得在不影响性能的情况下高效部署紧凑模型成为可能。||
|**2025-02-11**|[Fast-COS: A Fast One-Stage Object Detector Based on Reparameterized Attention Vision Transformer for Autonomous Driving](http://arxiv.org/abs/2502.07417)|null|The perception system is a a critical role of an autonomous driving system for ensuring safety. The driving scene perception system fundamentally represents an object detection task that requires achieving a balance between accuracy and processing speed. Many contemporary methods focus on improving detection accuracy but often overlook the importance of real-time detection capabilities when computational resources are limited. Thus, it is vital to investigate efficient object detection strategies for driving scenes. This paper introduces Fast-COS, a novel single-stage object detection framework crafted specifically for driving scene applications. The research initiates with an analysis of the backbone, considering both macro and micro architectural designs, yielding the Reparameterized Attention Vision Transformer (RAViT). RAViT utilizes Reparameterized Multi-Scale Depth-Wise Convolution (RepMSDW) and Reparameterized Self-Attention (RepSA) to enhance computational efficiency and feature extraction. In extensive tests across GPU, edge, and mobile platforms, RAViT achieves 81.4% Top-1 accuracy on the ImageNet-1K dataset, demonstrating significant throughput improvements over comparable backbone models such as ResNet, FastViT, RepViT, and EfficientFormer. Additionally, integrating RepMSDW into a feature pyramid network forms RepFPN, enabling fast and multi-scale feature fusion. Fast-COS enhances object detection in driving scenes, attaining an AP50 score of 57.2% on the BDD100K dataset and 80.0% on the TJU-DHD Traffic dataset. It surpasses leading models in efficiency, delivering up to 75.9% faster GPU inference and 1.38 higher throughput on edge devices compared to FCOS, YOLOF, and RetinaNet. These findings establish Fast-COS as a highly scalable and reliable solution suitable for real-time applications, especially in resource-limited environments like autonomous driving systems||
|**2025-02-11**|[MIGT: Memory Instance Gated Transformer Framework for Financial Portfolio Management](http://arxiv.org/abs/2502.07280)|null|Deep reinforcement learning (DRL) has been applied in financial portfolio management to improve returns in changing market conditions. However, unlike most fields where DRL is widely used, the stock market is more volatile and dynamic as it is affected by several factors such as global events and investor sentiment. Therefore, it remains a challenge to construct a DRL-based portfolio management framework with strong return capability, stable training, and generalization ability. This study introduces a new framework utilizing the Memory Instance Gated Transformer (MIGT) for effective portfolio management. By incorporating a novel Gated Instance Attention module, which combines a transformer variant, instance normalization, and a Lite Gate Unit, our approach aims to maximize investment returns while ensuring the learning process's stability and reducing outlier impacts. Tested on the Dow Jones Industrial Average 30, our framework's performance is evaluated against fifteen other strategies using key financial metrics like the cumulative return and risk-return ratios (Sharpe, Sortino, and Omega ratios). The results highlight MIGT's advantage, showcasing at least a 9.75% improvement in cumulative returns and a minimum 2.36% increase in risk-return ratios over competing strategies, marking a significant advancement in DRL for portfolio management.||
|**2025-02-11**|[Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting](http://arxiv.org/abs/2502.07244)|**[link](https://github.com/ljc-fvnr/structural-aligned-mixture-of-var)**|Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention sometimes outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.||
|**2025-02-11**|[SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer](http://arxiv.org/abs/2502.07216)|null|Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.||
|**2025-02-10**|[DeepCrossAttention: Supercharging Transformer Residual Connections](http://arxiv.org/abs/2502.06785)|null|Transformer networks have achieved remarkable success across diverse domains, leveraging a variety of architectural innovations, including residual connections. However, traditional residual connections, which simply sum the outputs of previous layers, can dilute crucial information. This work introduces DeepCrossAttention (DCA), an approach that enhances residual learning in transformers. DCA employs learnable, input-dependent weights to dynamically combine layer outputs, enabling the model to selectively focus on the most relevant information in any of the previous layers. Furthermore, DCA incorporates depth-wise cross-attention, allowing for richer interactions between layers at different depths. Our language modeling experiments show that DCA achieves improved perplexity for a given training time. Moreover, DCA obtains the same model quality up to 3x faster while adding a negligible number of parameters. Theoretical analysis confirms that DCA provides an improved trade-off between accuracy and model size when the ratio of collective layer ranks to the ambient dimension falls below a critical threshold.||
|**2025-02-07**|[In-context denoising with one-layer transformers: connections between attention and associative memory retrieval](http://arxiv.org/abs/2502.05164)|null|We introduce in-context denoising, a task that refines the connection between attention-based architectures and dense associative memory (DAM) networks, also known as modern Hopfield networks. Using a Bayesian framework, we show theoretically and empirically that certain restricted denoising problems can be solved optimally even by a single-layer transformer. We demonstrate that a trained attention layer processes each denoising prompt by performing a single gradient descent update on a context-aware DAM energy landscape, where context tokens serve as associative memories and the query token acts as an initial state. This one-step update yields better solutions than exact retrieval of either a context token or a spurious local minimum, providing a concrete example of DAM networks extending beyond the standard retrieval paradigm. Overall, this work solidifies the link between associative memory and attention mechanisms first identified by Ramsauer et al., and demonstrates the relevance of associative memory models in the study of in-context learning.||
|**2025-02-07**|[Paying Attention to Facts: Quantifying the Knowledge Capacity of Attention Layers](http://arxiv.org/abs/2502.05076)|null|In this paper, we investigate the ability of single-layer attention-only transformers (i.e. attention layers) to memorize facts contained in databases from a linear-algebraic perspective. We associate with each database a 3-tensor, propose the rank of this tensor as a measure of the size of the database, and provide bounds on the rank in terms of properties of the database. We also define a 3-tensor corresponding to an attention layer, and empirically demonstrate the relationship between its rank and database rank on a dataset of toy models and random databases. By highlighting the roles played by the value-output and query-key weights, and the effects of argmax and softmax on rank, our results shed light on the `additive motif' of factual recall in transformers, while also suggesting a way of increasing layer capacity without increasing the number of parameters.||
|**2025-02-07**|[Wavelet-Assisted Multi-Frequency Attention Network for Pansharpening](http://arxiv.org/abs/2502.04903)|**[link](https://github.com/jie-1203/wfanet)**|全色锐化旨在将高分辨率全色（PAN）图像与低分辨率多光谱（LRMS）图像相结合，以生成高分辨率多光谱（HRMS）图像。尽管频域全色锐化具有明显的优势，但大多数现有方法要么继续仅在空间域中进行操作，要么未能充分利用频域的优势。为了解决这个问题，我们创新性地提出了多频融合注意力（MFFA），它利用小波变换来清晰地分离频率并在不同频域之间实现无损重建。然后，我们根据不同特征所代表的物理意义生成频率查询、空间键和融合值，这能够更有效地捕获频域中的特定信息。此外，我们专注于在不同操作中保留频率特征。在更广泛的层面上，我们的网络采用小波金字塔来逐步融合跨多个尺度的信息。与之前的频域方法相比，我们的网络更好地防止了融合过程中不同频率特征的混淆和丢失。在多个数据集上的定量和定性实验表明，我们的方法优于现有方法，并显示出对现实场景的显著泛化能力。||
|**2025-02-06**|[Fast Video Generation with Sliding Tile Attention](http://arxiv.org/abs/2502.04507)|null|Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.||
|**2025-02-06**|[ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features](http://arxiv.org/abs/2502.04320)|**[link](https://github.com/helblazer811/ConceptAttention)**|Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention mechanisms. Remarkably, ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 11 other zero-shot interpretability methods on the ImageNet-Segmentation dataset and on a single-class subset of PascalVOC. Our work contributes the first evidence that the representations of multi-modal DiT models like Flux are highly transferable to vision tasks like segmentation, even outperforming multi-modal foundation models like CLIP.||
|**2025-02-06**|[XAttnMark: Learning Robust Audio Watermarking with Cross-Attention](http://arxiv.org/abs/2502.04230)|null|The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to achieve both robust detection and accurate attribution simultaneously. This paper introduces Cross-Attention Robust Audio Watermark (XAttnMark), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned temporal-frequency masking loss that captures fine-grained auditory masking effects, enhancing watermark imperceptibility. Our approach achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing with strong editing strength. The project webpage is available at https://liuyixin-louis.github.io/xattnmark/.||
|**2025-02-06**|[Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation](http://arxiv.org/abs/2502.04139)|null|3D instance segmentation aims to predict a set of object instances in a scene and represent them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines, reduced manual selection of geometric properties, and superior performance. However, transformer-based methods fail to simultaneously maintain strong position and content information during query initialization. Additionally, due to supervision at each decoder layer, there exists a phenomenon of object disappearance with the deepening of layers. To overcome these hurdles, we introduce Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation (BFL). Specifically, an Agent-Interpolation Initialization Module is designed to generate resilient queries capable of achieving a balance between foreground coverage and content learning. Additionally, a Hierarchical Query Fusion Decoder is designed to retain low overlap queries, mitigating the decrease in recall with the deepening of layers. Extensive experiments on ScanNetV2, ScanNet200, ScanNet++ and S3DIS datasets demonstrate the superior performance of BFL.||
|**2025-02-07**|[Decoding Human Attentive States from Spatial-temporal EEG Patches Using Transformers](http://arxiv.org/abs/2502.03736)|**[link](https://github.com/yi-ding-cs/eeg-patchformer)**|Learning the spatial topology of electroencephalogram (EEG) channels and their temporal dynamics is crucial for decoding attention states. This paper introduces EEG-PatchFormer, a transformer-based deep learning framework designed specifically for EEG attention classification in Brain-Computer Interface (BCI) applications. By integrating a Temporal CNN for frequency-based EEG feature extraction, a pointwise CNN for feature enhancement, and Spatial and Temporal Patching modules for organizing features into spatial-temporal patches, EEG-PatchFormer jointly learns spatial-temporal information from EEG data. Leveraging the global learning capabilities of the self-attention mechanism, it captures essential features across brain regions over time, thereby enhancing EEG data decoding performance. Demonstrating superior performance, EEG-PatchFormer surpasses existing benchmarks in accuracy, area under the ROC curve (AUC), and macro-F1 score on a public cognitive attention dataset. The code can be found via: https://github.com/yi-ding-cs/EEG-PatchFormer .||
|**2025-02-05**|[TruePose: Human-Parsing-guided Attention Diffusion for Full-ID Preserving Pose Transfer](http://arxiv.org/abs/2502.03426)|null|姿态引导的人物图像合成 (PGPIS) 可以根据给定的目标姿态（例如，骨骼）生成保留源图像中人物身份的图像。尽管基于扩散的 PGPIS 方法能够在姿态转换过程中有效地保留面部特征，但它们通常难以在整个扩散过程中准确地维护源图像中的服装细节。当源姿态和目标姿态之间存在较大差异时，这种局限性尤为突出，这严重影响了 PGPIS 在时尚行业的应用，因为在时尚行业中，服装风格的保留对于版权保护至关重要。我们的分析表明，这种局限性主要源于条件扩散模型的注意力模块未能充分捕捉和保留服装图案。为了解决这个问题，我们提出了一种人体解析引导的注意力扩散方法，这种新方法能够在生成高质量结果的同时有效地保留面部和服装外观。我们提出了一个人体解析感知的孪生网络，它由三个关键组件组成：两个相同的 UNet（TargetNet 用于扩散去噪，SourceNet 用于源图像嵌入提取）、一个人体解析引导的融合注意力模块 (HPFA) 和一个 CLIP 引导的注意力对齐模块 (CAA)。HPFA 和 CAA 模块可以自适应且有效地将面部和服装图案嵌入到目标图像生成中。在店内服装检索基准数据集和最新自然场景人体编辑数据集上的大量实验表明，我们的方法在保留源图像中面部和服装外观方面比 13 种基线方法具有显著优势。||
|**2025-02-05**|[From Features to Transformers: Redefining Ranking for Scalable Impact](http://arxiv.org/abs/2502.03417)|null|We present LiGR, a large-scale ranking framework developed at LinkedIn that brings state-of-the-art transformer-based modeling architectures into production. We introduce a modified transformer architecture that incorporates learned normalization and simultaneous set-wise attention to user history and ranked items. This architecture enables several breakthrough achievements, including: (1) the deprecation of most manually designed feature engineering, outperforming the prior state-of-the-art system using only few features (compared to hundreds in the baseline), (2) validation of the scaling law for ranking systems, showing improved performance with larger models, more training data, and longer context sequences, and (3) simultaneous joint scoring of items in a set-wise manner, leading to automated improvements in diversity. To enable efficient serving of large ranking models, we describe techniques to scale inference effectively using single-pass processing of user history and set-wise attention. We also summarize key insights from various ablation studies and A/B tests, highlighting the most impactful technical approaches.||
|**2025-02-05**|[Edge Attention Module for Object Classification](http://arxiv.org/abs/2502.03103)|null|A novel ``edge attention-based Convolutional Neural Network (CNN)'' is proposed in this research for object classification task. With the advent of advanced computing technology, CNN models have achieved to remarkable success, particularly in computer vision applications. Nevertheless, the efficacy of the conventional CNN is often hindered due to class imbalance and inter-class similarity problems, which are particularly prominent in the computer vision field. In this research, we introduce for the first time an ``Edge Attention Module (EAM)'' consisting of a Max-Min pooling layer, followed by convolutional layers. This Max-Min pooling is entirely a novel pooling technique, specifically designed to capture only the edge information that is crucial for any object classification task. Therefore, by integrating this novel pooling technique into the attention module, the CNN network inherently prioritizes on essential edge features, thereby boosting the accuracy and F1-score of the model significantly. We have implemented our proposed EAM or 2EAMs on several standard pre-trained CNN models for Caltech-101, Caltech-256, CIFAR-100 and Tiny ImageNet-200 datasets. The extensive experiments reveal that our proposed framework (that is, EAM with CNN and 2EAMs with CNN), outperforms all pre-trained CNN models as well as recent trend models ``Pooling-based Vision Transformer (PiT)'', ``Convolutional Block Attention Module (CBAM)'', and ConvNext, by substantial margins. We have achieved the accuracy of 95.5% and 86% by the proposed framework on Caltech-101 and Caltech-256 datasets, respectively. So far, this is the best results on these datasets, to the best of our knowledge.||
|**2025-02-04**|[A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation (GALI)](http://arxiv.org/abs/2502.02659)|**[link](https://github.com/academycityl/gali)**|Transformer-based Large Language Models (LLMs) struggle to process inputs exceeding their training context window, with performance degrading due to positional out-of-distribution (O.O.D.) that disrupt attention computations. Existing solutions, fine-tuning and training-free methods, are limited by computational inefficiency, attention logit outliers or loss of local positional information. To address this, we propose Greedy Attention Logit Interpolation (GALI), a training-free length extrapolation method that maximizes the utilization of pretrained positional intervals while avoiding attention logit outliers through attention logit interpolation. The result demonstrates that GALI consistently outperforms state-of-the-art training-free methods. Our findings reveal that LLMs interpret positional intervals unevenly within their training context window, suggesting that extrapolating within a smaller positional interval range yields superior results-even for short-context tasks. GALI represents a significant step toward resolving the positional O.O.D. challenge, enabling more reliable long-text understanding in LLMs. Our implementation of GALI, along with the experiments from our paper, is open-sourced at https://github.com/AcademyCityL/GALI.||
|**2025-02-04**|[Spatio-temporal transformer to support automatic sign language translation](http://arxiv.org/abs/2502.02587)|null|手语翻译 (SLT) 系统通过找到手语和口语之间的对应关系来支持听障人士的交流。然而，由于多种手势变体、语言的复杂性和表达的固有丰富性，这项任务具有挑战性。计算方法已经证明了其支持 SLT 的能力。尽管如此，这些方法在涵盖手势变化和支持长序列翻译方面仍然存在局限性。本文介绍了一种基于 Transformer 的架构，该架构对时空运动手势进行编码，通过使用多个卷积和注意力机制保留局部和远程空间信息。所提出的方法在哥伦比亚手语翻译数据集 (CoL-SLTD) 上进行了验证，其性能优于基线方法，BLEU4 得分为 46.84%。此外，该方法还在 RWTH-PHOENIX-Weather-2014T (PHOENIX14T) 上进行了验证，BLEU4 得分为 30.77%，证明了其在处理现实世界变化方面的鲁棒性和有效性。||
|**2025-02-04**|[Distribution Transformers: Fast Approximate Bayesian Inference With On-The-Fly Prior Adaptation](http://arxiv.org/abs/2502.02463)|null|While Bayesian inference provides a principled framework for reasoning under uncertainty, its widespread adoption is limited by the intractability of exact posterior computation, necessitating the use of approximate inference. However, existing methods are often computationally expensive, or demand costly retraining when priors change, limiting their utility, particularly in sequential inference problems such as real-time sensor fusion. To address these challenges, we introduce the Distribution Transformer -- a novel architecture that can learn arbitrary distribution-to-distribution mappings. Our method can be trained to map a prior to the corresponding posterior, conditioned on some dataset -- thus performing approximate Bayesian inference. Our novel architecture represents a prior distribution as a (universally-approximating) Gaussian Mixture Model (GMM), and transforms it into a GMM representation of the posterior. The components of the GMM attend to each other via self-attention, and to the datapoints via cross-attention. We demonstrate that Distribution Transformers both maintain flexibility to vary the prior, and significantly reduces computation times-from minutes to milliseconds-while achieving log-likelihood performance on par with or superior to existing approximate inference methods across tasks such as sequential inference, quantum system parameter inference, and Gaussian Process predictive posterior inference with hyperpriors.||
|**2025-01-31**|[Scalable-Softmax Is Superior for Attention](http://arxiv.org/abs/2501.19399)|null|The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining.||
|**2025-01-31**|[Transformer-Based Financial Fraud Detection with Cloud-Optimized Real-Time Streaming](http://arxiv.org/abs/2501.19267)|null|As the financial industry becomes more interconnected and reliant on digital systems, fraud detection systems must evolve to meet growing threats. Cloud-enabled Transformer models present a transformative opportunity to address these challenges. By leveraging the scalability, flexibility, and advanced AI capabilities of cloud platforms, companies can deploy fraud detection solutions that adapt to real-time data patterns and proactively respond to evolving threats. Using the Graph self-attention Transformer neural network module, we can directly excavate gang fraud features from the transaction network without constructing complicated feature engineering. Finally, the fraud prediction network is combined to optimize the topological pattern and the temporal transaction pattern to realize the high-precision detection of fraudulent transactions. The results of antifraud experiments on credit card transaction data show that the proposed model outperforms the 7 baseline models on all evaluation indicators: In the transaction fraud detection task, the average accuracy (AP) increased by 20% and the area under the ROC curve (AUC) increased by 2.7% on average compared with the benchmark graph attention neural network (GAT), which verified the effectiveness of the proposed model in the detection of credit card fraud transactions.||
|**2025-01-31**|[Strassen Attention: Unlocking Compositional Abilities in Transformers Based on a New Lower Bound Method](http://arxiv.org/abs/2501.19215)|null|We propose a novel method to evaluate the theoretical limits of Transformers, allowing us to prove the first lower bounds against one-layer softmax Transformers with infinite precision. We establish those bounds for three tasks that require advanced reasoning. The first task, Match3 (Sanford et al., 2023), requires looking at all triples of positions. The second and third tasks address compositionality-based reasoning: one is composition of functions (Peng et al., 2024) and the other is composition of binary relations. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. In an attempt to overcome these limitations, we introduce Strassen attention and prove that with this mechanism a one-layer Transformer can in principle solve all these tasks. We also show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously proposed mechanisms, such as higher-order attention (Sanford et al., 2023). To complement our theoretical findings, we experimentally studied Strassen attention and compared it against standard (Vaswani et al, 2017), higher-order attention (Sanford et al., 2023) and triangular attention (Bergen et al. 2021). Our results help to disentangle all these attention mechanisms, highlighting their strengths and limitations. In particular, Strassen attention outperforms standard attention significantly on all the tasks. Altogether, understanding the theoretical limitations can guide research towards scalable attention mechanisms that improve the reasoning abilities of Transformers.||
|**2025-01-31**|[CAAT-EHR: Cross-Attentional Autoregressive Transformer for Multimodal Electronic Health Record Embeddings](http://arxiv.org/abs/2501.18891)|**[link](https://github.com/bozdaglab/caat-ehr)**|Electronic health records (EHRs) provide a comprehensive source of longitudinal patient data, encompassing structured modalities such as laboratory results, imaging data, and vital signs, and unstructured clinical notes. These datasets, after necessary preprocessing to clean and format the data for analysis, often remain in their raw EHR form, representing numerical or categorical values without further transformation into task-agnostic embeddings. While such raw EHR data enables predictive modeling, its reliance on manual feature engineering or downstream task-specific optimization limits its utility for general-purpose applications. Deep learning (DL) techniques, such as recurrent neural networks (RNNs) and Transformers, have facilitated predictive tasks like disease progression and diagnosis prediction. However, these methods often struggle to fully exploit the temporal and multimodal dependencies inherent in EHR data due to their reliance on pre-processed but untransformed raw EHR inputs. In this study, we introduce CAAT-EHR, a novel architecture designed to bridge this gap by generating robust, task-agnostic longitudinal embeddings from raw EHR data. CAAT-EHR leverages self- and cross-attention mechanisms in its encoder to integrate temporal and contextual relationships across multiple modalities, transforming the data into enriched embeddings that capture complex dependencies. An autoregressive decoder complements the encoder by predicting future time points data during pre-training, ensuring that the resulting embeddings maintain temporal consistency and alignment. CAAT-EHR eliminates the need for manual feature engineering and enables seamless transferability across diverse downstream tasks. Extensive evaluations on benchmark datasets, demonstrate the superiority of CAAT-EHR-generated embeddings over pre-processed raw EHR data and other baseline approaches.||
|**2025-01-30**|[Rope to Nope and Back Again: A New Hybrid Attention Strategy](http://arxiv.org/abs/2501.18795)|null|Long-context large language models (LLMs) have achieved remarkable advancements, driven by techniques like Rotary Position Embedding (RoPE) (Su et al., 2023) and its extensions (Chen et al., 2023; Liu et al., 2024c; Peng et al., 2023). By adjusting RoPE parameters and incorporating training data with extended contexts, we can train performant models with considerably longer input sequences. However, existing RoPE-based methods exhibit performance limitations when applied to extended context lengths. This paper presents a comprehensive analysis of various attention mechanisms, including RoPE, No Positional Embedding (NoPE), and Query-Key Normalization (QK-Norm), identifying their strengths and shortcomings in long-context modeling. Our investigation identifies distinctive attention patterns in these methods and highlights their impact on long-context performance, providing valuable insights for architectural design. Building on these findings, we propose a novel architectural based on a hybrid attention mechanism that not only surpasses conventional RoPE-based transformer models in long context tasks but also achieves competitive performance on benchmarks requiring shorter context lengths.||
|**2025-01-30**|[Structure Development in List-Sorting Transformers](http://arxiv.org/abs/2501.18666)|null|We study how a one-layer attention-only transformer develops relevant structures while learning to sort lists of numbers. At the end of training, the model organizes its attention heads in two main modes that we refer to as vocabulary-splitting and copy-suppression. Both represent simpler modes than having multiple heads handle overlapping ranges of numbers. Interestingly, vocabulary-splitting is present regardless of whether we use weight decay, a common regularization technique thought to drive simplification, supporting the thesis that neural networks naturally prefer simpler solutions. We relate copy-suppression to a mechanism in GPT-2 and investigate its functional role in our model. Guided by insights from a developmental analysis of the model, we identify features in the training data that drive the model's final acquired solution. This provides a concrete example of how the training data shape the internal organization of transformers, paving the way for future studies that could help us better understand how LLMs develop their internal structures.||
|**2025-01-30**|[Efficient Transformer for High Resolution Image Motion Deblurring](http://arxiv.org/abs/2501.18403)|**[link](https://github.com/hamzafer/image-deblurring)**|This paper presents a comprehensive study and improvement of the Restormer architecture for high-resolution image motion deblurring. We introduce architectural modifications that reduce model complexity by 18.4% while maintaining or improving performance through optimized attention mechanisms. Our enhanced training pipeline incorporates additional transformations including color jitter, Gaussian blur, and perspective transforms to improve model robustness as well as a new frequency loss term. Extensive experiments on the RealBlur-R, RealBlur-J, and Ultra-High-Definition Motion blurred (UHDM) datasets demonstrate the effectiveness of our approach. The improved architecture shows better convergence behavior and reduced training time while maintaining competitive performance across challenging scenarios. We also provide detailed ablation studies analyzing the impact of our modifications on model behavior and performance. Our results suggest that thoughtful architectural simplification combined with enhanced training strategies can yield more efficient yet equally capable models for motion deblurring tasks. Code and Data Available at: https://github.com/hamzafer/image-deblurring||
|**2025-01-31**|[MatIR: A Hybrid Mamba-Transformer Image Restoration Model](http://arxiv.org/abs/2501.18401)|**[link](https://github.com/wenjuan7275/matir)**|In recent years, Transformers-based models have made significant progress in the field of image restoration by leveraging their inherent ability to capture complex contextual features. Recently, Mamba models have made a splash in the field of computer vision due to their ability to handle long-range dependencies and their significant computational efficiency compared to Transformers. However, Mamba currently lags behind Transformers in contextual learning capabilities. To overcome the limitations of these two models, we propose a Mamba-Transformer hybrid image restoration model called MatIR. Specifically, MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features, thereby taking full advantage of the advantages of the two architectures. In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths to achieve efficient processing of long sequence data. In the Transformer module, we combine triangular window-based local attention with channel-based global attention to effectively activate the attention mechanism over a wider range of image pixels. Extensive experimental results and ablation studies demonstrate the effectiveness of our approach.||
|**2025-01-30**|[A Unified Perspective on the Dynamics of Deep Transformers](http://arxiv.org/abs/2501.18322)|null|Transformers, which are state-of-the-art in most machine learning tasks, represent the data as sequences of vectors called tokens. This representation is then exploited by the attention function, which learns dependencies between tokens and is key to the success of Transformers. However, the iterative application of attention across layers induces complex dynamics that remain to be fully understood. To analyze these dynamics, we identify each input sequence with a probability measure and model its evolution as a Vlasov equation called Transformer PDE, whose velocity field is non-linear in the probability measure. Our first set of contributions focuses on compactly supported initial data. We show the Transformer PDE is well-posed and is the mean-field limit of an interacting particle system, thus generalizing and extending previous analysis to several variants of self-attention: multi-head attention, L2 attention, Sinkhorn attention, Sigmoid attention, and masked attention--leveraging a conditional Wasserstein framework. In a second set of contributions, we are the first to study non-compactly supported initial conditions, by focusing on Gaussian initial data. Again for different types of attention, we show that the Transformer PDE preserves the space of Gaussian measures, which allows us to analyze the Gaussian case theoretically and numerically to identify typical behaviors. This Gaussian analysis captures the evolution of data anisotropy through a deep Transformer. In particular, we highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.||
|**2025-01-30**|[In-Context Learning of Polynomial Kernel Regression in Transformers with GLU Layers](http://arxiv.org/abs/2501.18187)|null|Transformer-based models have demonstrated remarkable ability in in-context learning (ICL), where they can adapt to unseen tasks from a prompt with a few examples, without requiring parameter updates. Recent research has provided insight into how linear Transformers can perform ICL by implementing gradient descent estimators. In particular, it has been shown that the optimal linear self-attention (LSA) mechanism can implement one step of gradient descent with respect to a linear least-squares objective when trained on random linear regression tasks. However, the theoretical understanding of ICL for nonlinear function classes remains limited. In this work, we address this gap by first showing that LSA is inherently restricted to solving linear least-squares objectives and thus, the solutions in prior works cannot readily extend to nonlinear ICL tasks. To overcome this limitation, drawing inspiration from modern architectures, we study a mechanism that combines LSA with GLU-like feed-forward layers and show that this allows the model to perform one step of gradient descent on a polynomial kernel regression. Further, we characterize the scaling behavior of the resulting Transformer model, highlighting the necessary model size to effectively handle quadratic ICL tasks. Our findings highlight the distinct roles of attention and feed-forward layers in nonlinear ICL and identify key challenges when extending ICL to nonlinear function classes.||
|**2025-01-29**|[TransRAD: Retentive Vision Transformer for Enhanced Radar Object Detection](http://arxiv.org/abs/2501.17977)|**[link](https://github.com/radar-lab/transrad)**|Despite significant advancements in environment perception capabilities for autonomous driving and intelligent robotics, cameras and LiDARs remain notoriously unreliable in low-light conditions and adverse weather, which limits their effectiveness. Radar serves as a reliable and low-cost sensor that can effectively complement these limitations. However, radar-based object detection has been underexplored due to the inherent weaknesses of radar data, such as low resolution, high noise, and lack of visual information. In this paper, we present TransRAD, a novel 3D radar object detection model designed to address these challenges by leveraging the Retentive Vision Transformer (RMT) to more effectively learn features from information-dense radar Range-Azimuth-Doppler (RAD) data. Our approach leverages the Retentive Manhattan Self-Attention (MaSA) mechanism provided by RMT to incorporate explicit spatial priors, thereby enabling more accurate alignment with the spatial saliency characteristics of radar targets in RAD data and achieving precise 3D radar detection across Range-Azimuth-Doppler dimensions. Furthermore, we propose Location-Aware NMS to effectively mitigate the common issue of duplicate bounding boxes in deep radar object detection. The experimental results demonstrate that TransRAD outperforms state-of-the-art methods in both 2D and 3D radar detection tasks, achieving higher accuracy, faster inference speed, and reduced computational complexity. Code is available at https://github.com/radar-lab/TransRAD||
|**2025-01-29**|[Shared DIFF Transformer](http://arxiv.org/abs/2501.17900)|null|DIFF Transformer improves attention allocation by enhancing focus on relevant context while suppressing noise. It introduces a differential attention mechanism that calculates the difference between two independently generated attention distributions, effectively reducing noise and promoting sparse attention patterns. However, the independent signal generation in DIFF Transformer results in parameter redundancy and suboptimal utilization of information. In this work, we propose Shared DIFF Transformer, which draws on the idea of a differential amplifier by introducing a shared base matrix to model global patterns and incorporating low-rank updates to enhance task-specific flexibility. This design significantly reduces parameter redundancy, improves efficiency, and retains strong noise suppression capabilities. Experimental results show that, compared to DIFF Transformer, our method achieves better performance in tasks such as long-sequence modeling, key information retrieval, and in-context learning. Our work provides a novel and efficient approach to optimizing differential attention mechanisms and advancing robust Transformer architectures.||
|**2025-01-29**|[DINT Transformer](http://arxiv.org/abs/2501.17486)|null|DIFF Transformer addresses the issue of irrelevant context interference by introducing a differential attention mechanism that enhances the robustness of local attention. However, it has two critical limitations: the lack of global context modeling, which is essential for identifying globally significant tokens, and numerical instability due to the absence of strict row normalization in the attention matrix. To overcome these challenges, we propose DINT Transformer, which extends DIFF Transformer by incorporating a differential-integral mechanism. By computing global importance scores and integrating them into the attention matrix, DINT Transformer improves its ability to capture global dependencies. Moreover, the unified parameter design enforces row-normalized attention matrices, improving numerical stability. Experimental results demonstrate that DINT Transformer excels in accuracy and robustness across various practical applications, such as long-context language modeling and key information retrieval. These results position DINT Transformer as a highly effective and promising architecture.||
|**2025-01-28**|[Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models](http://arxiv.org/abs/2501.17088)|**[link](https://github.com/intellabs/hardware-aware-automated-machine-learning)**|Large pre-trained models have achieved outstanding results in sequence modeling. The Transformer block and its attention mechanism have been the main drivers of the success of these models. Recently, alternative architectures, such as Selective Structured State Space Models (SSMs), have been proposed to address the inefficiencies of Transformers. This paper explores the compression of SSM-based models, particularly Mamba and its hybrids. We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy. The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the overall model performance. The code is available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.||
|**2025-01-28**|[Graph Transformers for inverse physics: reconstructing flows around arbitrary 2D airfoils](http://arxiv.org/abs/2501.17081)|null|We introduce a Graph Transformer framework that serves as a general inverse physics engine on meshes, demonstrated through the challenging task of reconstructing aerodynamic flow fields from sparse surface measurements. While deep learning has shown promising results in forward physics simulation, inverse problems remain particularly challenging due to their ill-posed nature and the difficulty of propagating information from limited boundary observations. Our approach addresses these challenges by combining the geometric expressiveness of message-passing neural networks with the global reasoning of Transformers, enabling efficient learning of inverse mappings from boundary conditions to complete states. We evaluate this framework on a comprehensive dataset of steady-state RANS simulations around diverse airfoil geometries, where the task is to reconstruct full pressure and velocity fields from surface pressure measurements alone. The architecture achieves high reconstruction accuracy while maintaining fast inference times. We conduct experiments and provide insights into the relative importance of local geometric processing and global attention mechanisms in mesh-based inverse problems. We also find that the framework is robust to reduced sensor coverage. These results suggest that Graph Transformers can serve as effective inverse physics engines across a broader range of applications where complete system states must be reconstructed from limited boundary observations.||
|**2025-01-28**|[CASK: A Gauge Covariant Transformer for Lattice Gauge Theory](http://arxiv.org/abs/2501.16955)|null|We propose a Transformer neural network architecture specifically designed for lattice QCD, focusing on preserving the fundamental symmetries required in lattice gauge theory. The proposed architecture is gauge covariant/equivariant, ensuring it respects gauge symmetry on the lattice, and is also equivariant under spacetime symmetries such as rotations and translations on the lattice. A key feature of our approach lies in the attention matrix, which forms the core of the Transformer architecture. To preserve symmetries, we define the attention matrix using a Frobenius inner product between link variables and extended staples. This construction ensures that the attention matrix remains invariant under gauge transformations, thereby making the entire Transformer architecture covariant. We evaluated the performance of the gauge covariant Transformer in the context of self-learning HMC. Numerical experiments show that the proposed architecture achieves higher performance compared to the gauge covariant neural networks, demonstrating its potential to improve lattice QCD calculations.||
|**2025-01-28**|[RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception](http://arxiv.org/abs/2501.16803)|**[link](https://github.com/LantaoLi/RG-Attn)**|Cooperative perception offers an optimal solution to overcome the perception limitations of single-agent systems by leveraging Vehicle-to-Everything (V2X) communication for data sharing and fusion across multiple agents. However, most existing approaches focus on single-modality data exchange, limiting the potential of both homogeneous and heterogeneous fusion across agents. This overlooks the opportunity to utilize multi-modality data per agent, restricting the system's performance. In the automotive industry, manufacturers adopt diverse sensor configurations, resulting in heterogeneous combinations of sensor modalities across agents. To harness the potential of every possible data source for optimal performance, we design a robust LiDAR and camera cross-modality fusion module, Radian-Glue-Attention (RG-Attn), applicable to both intra-agent cross-modality fusion and inter-agent cross-modality fusion scenarios, owing to the convenient coordinate conversion by transformation matrix and the unified sampling/inversion mechanism. We also propose two different architectures, named Paint-To-Puzzle (PTP) and Co-Sketching-Co-Coloring (CoS-CoCo), for conducting cooperative perception. PTP aims for maximum precision performance and achieves smaller data packet size by limiting cross-agent fusion to a single instance, but requiring all participants to be equipped with LiDAR. In contrast, CoS-CoCo supports agents with any configuration-LiDAR-only, camera-only, or LiDAR-camera-both, presenting more generalization ability. Our approach achieves state-of-the-art (SOTA) performance on both real and simulated cooperative perception datasets. The code will be released at GitHub in early 2025.||
|**2025-01-28**|[Exponential Family Attention](http://arxiv.org/abs/2501.16790)|**[link](https://github.com/yixinw-lab/efa)**|The self-attention mechanism is the backbone of the transformer neural network underlying most large language models. It can capture complex word patterns and long-range dependencies in natural language. This paper introduces exponential family attention (EFA), a probabilistic generative model that extends self-attention to handle high-dimensional sequence, spatial, or spatial-temporal data of mixed data types, including both discrete and continuous observations. The key idea of EFA is to model each observation conditional on all other existing observations, called the context, whose relevance is learned in a data-driven way via an attention-based latent factor model. In particular, unlike static latent embeddings, EFA uses the self-attention mechanism to capture dynamic interactions in the context, where the relevance of each context observations depends on other observations. We establish an identifiability result and provide a generalization guarantee on excess loss for EFA. Across real-world and synthetic data sets -- including U.S. city temperatures, Instacart shopping baskets, and MovieLens ratings -- we find that EFA consistently outperforms existing models in capturing complex latent structures and reconstructing held-out data.||
|**2025-01-28**|[ITVTON:Virtual Try-On Diffusion Transformer Model Based on Integrated Image and Text](http://arxiv.org/abs/2501.16757)|null|Recent advancements in virtual fitting for characters and clothing have leveraged diffusion models to improve the realism of garment fitting. However, challenges remain in handling complex scenes and poses, which can result in unnatural garment fitting and poorly rendered intricate patterns. In this work, we introduce ITVTON, a novel method that enhances clothing-character interactions by combining clothing and character images along spatial channels as inputs, thereby improving fitting accuracy for the inpainting model. Additionally, we incorporate integrated textual descriptions from multiple images to boost the realism of the generated visual effects. To optimize computational efficiency, we limit training to the attention parameters within a single diffusion transformer (Single-DiT) block. To more rigorously address the complexities of real-world scenarios, we curated training samples from the IGPair dataset, thereby enhancing ITVTON's performance across diverse environments. Extensive experiments demonstrate that ITVTON outperforms baseline methods both qualitatively and quantitatively, setting a new standard for virtual fitting tasks.||
|**2025-01-28**|[Overcoming Semantic Dilution in Transformer-Based Next Frame Prediction](http://arxiv.org/abs/2501.16753)|null|Next-frame prediction in videos is crucial for applications such as autonomous driving, object tracking, and motion prediction. The primary challenge in next-frame prediction lies in effectively capturing and processing both spatial and temporal information from previous video sequences. The transformer architecture, known for its prowess in handling sequence data, has made remarkable progress in this domain. However, transformer-based next-frame prediction models face notable issues: (a) The multi-head self-attention (MHSA) mechanism requires the input embedding to be split into $N$ chunks, where $N$ is the number of heads. Each segment captures only a fraction of the original embeddings information, which distorts the representation of the embedding in the latent space, resulting in a semantic dilution problem; (b) These models predict the embeddings of the next frames rather than the frames themselves, but the loss function based on the errors of the reconstructed frames, not the predicted embeddings -- this creates a discrepancy between the training objective and the model output. We propose a Semantic Concentration Multi-Head Self-Attention (SCMHSA) architecture, which effectively mitigates semantic dilution in transformer-based next-frame prediction. Additionally, we introduce a loss function that optimizes SCMHSA in the latent space, aligning the training objective more closely with the model output. Our method demonstrates superior performance compared to the original transformer-based predictors.||
|**2025-01-28**|[Toward Relative Positional Encoding in Spiking Transformers](http://arxiv.org/abs/2501.16745)|null|Spiking neural networks (SNNs) are bio-inspired networks that model how neurons in the brain communicate through discrete spikes, which have great potential in various tasks due to their energy efficiency and temporal processing capabilities. SNNs with self-attention mechanisms (Spiking Transformers) have recently shown great advancements in various tasks such as sequential modeling and image classifications. However, integrating positional information, which is essential for capturing sequential relationships in data, remains a challenge in Spiking Transformers. In this paper, we introduce an approximate method for relative positional encoding (RPE) in Spiking Transformers, leveraging Gray Code as the foundation for our approach. We provide comprehensive proof of the method's effectiveness in partially capturing relative positional information for sequential tasks. Additionally, we extend our RPE approach by adapting it to a two-dimensional form suitable for image patch processing. We evaluate the proposed RPE methods on several tasks, including time series forecasting, text classification, and patch-based image classification. Our experimental results demonstrate that the incorporation of RPE significantly enhances performance by effectively capturing relative positional information.||
|**2025-01-28**|[Chinese Stock Prediction Based on a Multi-Modal Transformer Framework: Macro-Micro Information Fusion](http://arxiv.org/abs/2501.16621)|null|This paper proposes an innovative Multi-Modal Transformer framework (MMF-Trans) designed to significantly improve the prediction accuracy of the Chinese stock market by integrating multi-source heterogeneous information including macroeconomy, micro-market, financial text, and event knowledge. The framework consists of four core modules: (1) A four-channel parallel encoder that processes technical indicators, financial text, macro data, and event knowledge graph respectively for independent feature extraction of multi-modal data; (2) A dynamic gated cross-modal fusion mechanism that adaptively learns the importance of different modalities through differentiable weight allocation for effective information integration; (3) A time-aligned mixed-frequency processing layer that uses an innovative position encoding method to effectively fuse data of different time frequencies and solves the time alignment problem of heterogeneous data; (4) A graph attention-based event impact quantification module that captures the dynamic impact of events on the market through event knowledge graph and quantifies the event impact coefficient. We introduce a hybrid-frequency Transformer and Event2Vec algorithm to effectively fuse data of different frequencies and quantify the event impact. Experimental results show that in the prediction task of CSI 300 constituent stocks, the root mean square error (RMSE) of the MMF-Trans framework is reduced by 23.7% compared to the baseline model, the event response prediction accuracy is improved by 41.2%, and the Sharpe ratio is improved by 32.6%.||
|**2025-01-27**|[The Linear Attention Resurrection in Vision Transformer](http://arxiv.org/abs/2501.16182)|null|Vision Transformers (ViTs) have recently taken computer vision by storm. However, the softmax attention underlying ViTs comes with a quadratic complexity in time and memory, hindering the application of ViTs to high-resolution images. We revisit the attention design and propose a linear attention method to address the limitation, which doesn't sacrifice ViT's core advantage of capturing global representation like existing methods (e.g. local window attention of Swin). We further investigate the key difference between linear attention and softmax attention. Our empirical results suggest that linear attention lacks a fundamental property of concentrating the distribution of the attention matrix. Inspired by this observation, we introduce a local concentration module to enhance linear attention. By incorporating enhanced linear global attention and local window attention, we propose a new ViT architecture, dubbed L $^2$ViT. Notably, L$^2$ViT can effectively capture both global interactions and local representations while enjoying linear computational complexity. Extensive experiments demonstrate the strong performance of L$^2$ViT. On image classification, L$^2$ViT achieves 84.4% Top-1 accuracy on ImageNet-1K without any extra training data or label. By further pre-training on ImageNet-22k, it attains 87.0% when fine-tuned with resolution 384$^2$. For downstream tasks, L$^2$ ViT delivers favorable performance as a backbone on object detection as well as semantic segmentation.||
|**2025-01-24**|[ZETA: Leveraging Z-order Curves for Efficient Top-k Attention](http://arxiv.org/abs/2501.14577)|null|Over recent years, the Transformer has become a fundamental building block for sequence modeling architectures. Yet at its core is the use of self-attention, whose memory and computational cost grow quadratically with the sequence length $N$, rendering it prohibitively expensive for long sequences. A promising approach is top-$k$ attention, which selects only the $k$ most relevant tokens and achieves performance comparable to vanilla self-attention while significantly reducing space and computational demands. However, causal masks require the current query token to only attend to past tokens, preventing the existing top-$k$ attention method from efficiently searching for the most relevant tokens in parallel, thereby limiting training efficiency. In this work, we propose ZETA, leveraging \textbf{Z}-Order Curves for \textbf{E}fficient \textbf{T}op-$k$ \textbf{A}ttention, to enable parallel querying of past tokens for entire sequences. % in both space and time complexity of $\mathcal{O}(N \log N)$. We first theoretically show that the choice of key and query dimensions involves a trade-off between the curse of dimensionality and the preservation of relative distances after projection. In light of this insight, we propose reducing the dimensionality of keys and queries in contrast to values and further leverage $Z$-order curves to map low-dimensional keys and queries into \emph{one}-dimensional space, which permits parallel sorting, thereby largely improving the efficiency for top-$k$ token selection. Experimental results demonstrate that ZETA matches the performance of standard attention on the synthetic \textsc{Multi-Query Associative Recall} task and outperforms attention and its variants on \textsc{Long Range Arena} and \textsc{WikiText-103} language modeling.||
|**2025-01-23**|[ME-CPT: Multi-Task Enhanced Cross-Temporal Point Transformer for Urban 3D Change Detection](http://arxiv.org/abs/2501.14004)|**[link](https://github.com/zhangluqi0209/me-cpt)**|The point clouds collected by the Airborne Laser Scanning (ALS) system provide accurate 3D information of urban land covers. By utilizing multi-temporal ALS point clouds, semantic changes in urban area can be captured, demonstrating significant potential in urban planning, emergency management, and infrastructure maintenance. Existing 3D change detection methods struggle to efficiently extract multi-class semantic information and change features, still facing the following challenges: (1) the difficulty of accurately modeling cross-temporal point clouds spatial relationships for effective change feature extraction; (2) class imbalance of change samples which hinders distinguishability of semantic features; (3) the lack of real-world datasets for 3D semantic change detection. To resolve these challenges, we propose the Multi-task Enhanced Cross-temporal Point Transformer (ME-CPT) network. ME-CPT establishes spatiotemporal correspondences between point cloud across different epochs and employs attention mechanisms to jointly extract semantic change features, facilitating information exchange and change comparison. Additionally, we incorporate a semantic segmentation task and through the multi-task training strategy, further enhance the distinguishability of semantic features, reducing the impact of class imbalance in change types. Moreover, we release a 22.5 $km^2$ 3D semantic change detection dataset, offering diverse scenes for comprehensive evaluation. Experiments on multiple datasets show that the proposed MT-CPT achieves superior performance compared to existing state-of-the-art methods. The source code and dataset will be released upon acceptance at \url{https://github.com/zhangluqi0209/ME-CPT}.||
|**2025-01-23**|[FreEformer: Frequency Enhanced Transformer for Multivariate Time Series Forecasting](http://arxiv.org/abs/2501.13989)|null|This paper presents \textbf{FreEformer}, a simple yet effective model that leverages a \textbf{Fre}quency \textbf{E}nhanced Trans\textbf{former} for multivariate time series forecasting. Our work is based on the assumption that the frequency spectrum provides a global perspective on the composition of series across various frequencies and is highly suitable for robust representation learning. Specifically, we first convert time series into the complex frequency domain using the Discrete Fourier Transform (DFT). The Transformer architecture is then applied to the frequency spectra to capture cross-variate dependencies, with the real and imaginary parts processed independently. However, we observe that the vanilla attention matrix exhibits a low-rank characteristic, thus limiting representation diversity. This could be attributed to the inherent sparsity of the frequency domain and the strong-value-focused nature of Softmax in vanilla attention. To address this, we enhance the vanilla attention mechanism by introducing an additional learnable matrix to the original attention matrix, followed by row-wise L1 normalization. Theoretical analysis~demonstrates that this enhanced attention mechanism improves both feature diversity and gradient flow. Extensive experiments demonstrate that FreEformer consistently outperforms state-of-the-art models on eighteen real-world benchmarks covering electricity, traffic, weather, healthcare and finance. Notably, the enhanced attention mechanism also consistently improves the performance of state-of-the-art Transformer-based forecasters.||
|**2025-01-23**|[Quantized Spike-driven Transformer](http://arxiv.org/abs/2501.13492)|**[link](https://github.com/bollossom/qsd-transformer)**|Spiking neural networks are emerging as a promising energy-efficient alternative to traditional artificial neural networks due to their spike-driven paradigm. However, recent research in the SNN domain has mainly focused on enhancing accuracy by designing large-scale Transformer structures, which typically rely on substantial computational resources, limiting their deployment on resource-constrained devices. To overcome this challenge, we propose a quantized spike-driven Transformer baseline (QSD-Transformer), which achieves reduced resource demands by utilizing a low bit-width parameter. Regrettably, the QSD-Transformer often suffers from severe performance degradation. In this paper, we first conduct empirical analysis and find that the bimodal distribution of quantized spike-driven self-attention (Q-SDSA) leads to spike information distortion (SID) during quantization, causing significant performance degradation. To mitigate this issue, we take inspiration from mutual information entropy and propose a bi-level optimization strategy to rectify the information distribution in Q-SDSA. Specifically, at the lower level, we introduce an information-enhanced LIF to rectify the information distribution in Q-SDSA. At the upper level, we propose a fine-grained distillation scheme for the QSD-Transformer to align the distribution in Q-SDSA with that in the counterpart ANN. By integrating the bi-level optimization strategy, the QSD-Transformer can attain enhanced energy efficiency without sacrificing its high-performance advantage.For instance, when compared to the prior SNN benchmark on ImageNet, the QSD-Transformer achieves 80.3\% top-1 accuracy, accompanied by significant reductions of 6.0 $\times$ and 8.1$\times$ in power consumption and model size, respectively. Code is available at https://github.com/bollossom/QSD-Transformer.||
|**2025-01-23**|[Multi-Level Attention and Contrastive Learning for Enhanced Text Classification with an Optimized Transformer](http://arxiv.org/abs/2501.13467)|null|本文研究了一种基于改进Transformer的文本分类算法，旨在提高模型在文本分类任务中的性能和效率。针对传统Transformer模型在捕捉深层语义关系和优化计算复杂度方面的不足，本文引入了多级注意力机制和对比学习策略。多级注意力机制通过结合全局注意力和局部注意力，有效地建模文本中的全局语义和局部特征；对比学习策略通过构建正负样本对，增强模型对不同类别之间区分能力的同时，提升分类效果。此外，为了提高模型在大规模文本数据上的训练和推理效率，本文设计了一个轻量级模块来优化特征转换过程并降低计算成本。在数据集上的实验结果表明，改进后的Transformer模型在分类准确率、F1值和召回率等方面均优于BiLSTM、CNN、标准Transformer和BERT等对比模型，展现出更强的语义表示能力和泛化性能。本文提出的方法为文本分类领域的算法优化提供了新思路，具有良好的应用潜力和实用价值。未来的工作将集中于研究该模型在多类别不平衡数据集和跨领域任务中的性能，并探索与其他技术的融合。||
|**2025-01-23**|[KAA: Kolmogorov-Arnold Attention for Enhancing Attentive Graph Neural Networks](http://arxiv.org/abs/2501.13456)|**[link](https://github.com/luckytiger123/kaa)**|近年来，带有注意力机制的图神经网络 (GNN)，通常被称为注意力GNN，已成为高级GNN模型中的一个突出范例。然而，我们对邻居节点评分这一关键过程的理解仍然有限，导致许多现有注意力GNN的性能不足。在本文中，我们统一了当前注意力GNN的评分函数，并提出了Kolmogorov-Arnold注意力(KAA)，它将Kolmogorov-Arnold网络(KAN)架构集成到评分过程中。KAA全面提升了评分函数的性能，并且几乎可以应用于所有现有的注意力GNN。为了比较KAA与其他评分函数的表达能力，我们引入了最大排序距离(MRD)来定量估计它们在节点重要性排序错误中的上限。我们的分析表明，在有限的参数以及宽度和深度限制下，基于线性变换和基于MLP的评分函数都表现出有限的表达能力。相比之下，我们提出的KAA，即使使用由零阶B样条函数参数化的单层KAN，也展现出几乎无限的表达能力。在使用各种骨干模型的节点级和图级任务上的大量实验表明，KAA增强的评分函数始终优于其原始对应函数，在某些情况下性能提升超过20%。||
|**2025-01-23**|[Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models](http://arxiv.org/abs/2501.13428)|null|Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases. This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm. We identify the latter as essential for maintaining model performance. By replacing the non-linear transformation with the Softplus activation function and introducing a dynamic length scale factor for different token lengths based on invariance entropy, we create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths. To further improve the length extrapolation ability of the proposed attention mechanism, we introduce a re-weighting mechanism that amplifies significant attention weights while diminishing weaker ones, enabling the model to concentrate more effectively on relevant tokens. When combined with our proposed attention mechanism, this approach demonstrates significant promise in managing longer sequences, maintaining nearly constant validation loss even at 16$\times$ the training token length while ensuring numerical stability. Our code is available at: https://github.com/iminfine/freeatten.||
|**2025-01-23**|[M3PT: A Transformer for Multimodal, Multi-Party Social Signal Prediction with Person-aware Blockwise Attention](http://arxiv.org/abs/2501.13416)|**[link](https://github.com/abraranwar/masked-social-signals)**|Understanding social signals in multi-party conversations is important for human-robot interaction and artificial social intelligence. Multi-party interactions include social signals like body pose, head pose, speech, and context-specific activities like acquiring and taking bites of food when dining. Incorporating all the multimodal signals in a multi-party interaction is difficult, and past work tends to build task-specific models for predicting social signals. In this work, we address the challenge of predicting multimodal social signals in multi-party settings in a single model. We introduce M3PT, a causal transformer architecture with modality and temporal blockwise attention masking which allows for the simultaneous processing of multiple social cues across multiple participants and their temporal interactions. This approach better captures social dynamics over time by considering longer horizons of social signals between individuals. We train and evaluate our unified model on the Human-Human Commensality Dataset (HHCD), and demonstrate that using multiple modalities improves bite timing and speaking status prediction. Source code: https://github.com/AbrarAnwar/masked-social-signals/||
|**2025-01-23**|[Contrast: A Hybrid Architecture of Transformers and State Space Models for Low-Level Vision](http://arxiv.org/abs/2501.13353)|null|Transformers have become increasingly popular for image super-resolution (SR) tasks due to their strong global context modeling capabilities. However, their quadratic computational complexity necessitates the use of window-based attention mechanisms, which restricts the receptive field and limits effective context expansion. Recently, the Mamba architecture has emerged as a promising alternative with linear computational complexity, allowing it to avoid window mechanisms and maintain a large receptive field. Nevertheless, Mamba faces challenges in handling long-context dependencies when high pixel-level precision is required, as in SR tasks. This is due to its hidden state mechanism, which can compress and store a substantial amount of context but only in an approximate manner, leading to inaccuracies that transformers do not suffer from. In this paper, we propose \textbf{Contrast}, a hybrid SR model that combines \textbf{Con}volutional, \textbf{Tra}nsformer, and \textbf{St}ate Space components, effectively blending the strengths of transformers and Mamba to address their individual limitations. By integrating transformer and state space mechanisms, \textbf{Contrast} compensates for the shortcomings of each approach, enhancing both global context modeling and pixel-level accuracy. We demonstrate that combining these two architectures allows us to mitigate the problems inherent in each, resulting in improved performance on image super-resolution tasks.||
|**2025-01-22**|[LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation](http://arxiv.org/abs/2501.12976)|null|In commonly used sub-quadratic complexity modules, linear attention benefits from simplicity and high parallelism, making it promising for image synthesis tasks. However, the architectural design and learning strategy for linear attention remain underexplored in this field. In this paper, we offer a suite of ready-to-use solutions for efficient linear diffusion Transformers. Our core contributions include: (1) Simplified Linear Attention using few heads, observing the free-lunch effect of performance without latency increase. (2) Weight inheritance from a fully pre-trained diffusion Transformer: initializing linear Transformer using pre-trained diffusion Transformer and loading all parameters except for those related to linear attention. (3) Hybrid knowledge distillation objective: using a pre-trained diffusion Transformer to help the training of the student linear Transformer, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed Linear Diffusion Transformer (LiT), an efficient text-to-image Transformer that can be deployed offline on a laptop. Experiments show that in class-conditional 256*256 and 512*512 ImageNet benchmark LiT achieves highly competitive FID while reducing training steps by 80% and 77% compared to DiT. LiT also rivals methods based on Mamba or Gated Linear Attention. Besides, for text-to-image generation, LiT allows for the rapid synthesis of up to 1K resolution photorealistic images. Project page: https://techmonsterwang.github.io/LiT/.||
|**2025-01-22**|[Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference](http://arxiv.org/abs/2501.12959)|null|Although applications involving long-context inputs are crucial for the effective utilization of large language models (LLMs), they also result in increased computational costs and reduced performance. To address this challenge, we propose an efficient, training-free prompt compression method that retains key information within compressed prompts. We identify specific attention heads in transformer-based LLMs, which we designate as evaluator heads, that are capable of selecting tokens in long inputs that are most significant for inference. Building on this discovery, we develop EHPC, an Evaluator Head-based Prompt Compression method, which enables LLMs to rapidly "skim through" input prompts by leveraging only the first few layers with evaluator heads during the pre-filling stage, subsequently passing only the important tokens to the model for inference. EHPC achieves state-of-the-art results across two mainstream benchmarks: prompt compression and long-context inference acceleration. Consequently, it effectively reduces the complexity and costs associated with commercial API calls. We further demonstrate that EHPC attains competitive results compared to key-value cache-based acceleration methods, thereby highlighting its potential to enhance the efficiency of LLMs for long-context tasks.||
|**2025-01-22**|[Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi](http://arxiv.org/abs/2501.12900)|null|Convolutional neural networks (CNNs) evaluate short-range correlations in input images which progress along the layers, whereas vision transformer (ViT) architectures evaluate long-range correlations, using repeated transformer encoders composed of fully connected layers. Both are designed to solve complex classification tasks but from different perspectives. This study demonstrates that CNNs and ViT architectures stem from a unified underlying learning mechanism, which quantitatively measures the single-nodal performance (SNP) of each node in feedforward (FF) and multi-head attention (MHA) subblocks. Each node identifies small clusters of possible output labels, with additional noise represented as labels outside these clusters. These features are progressively sharpened along the transformer encoders, enhancing the signal-to-noise ratio. This unified underlying learning mechanism leads to two main findings. First, it enables an efficient applied nodal diagonal connection (ANDC) pruning technique without affecting the accuracy. Second, based on the SNP, spontaneous symmetry breaking occurs among the MHA heads, such that each head focuses its attention on a subset of labels through cooperation among its SNPs. Consequently, each head becomes an expert in recognizing its designated labels, representing a quantitative MHA modus vivendi mechanism. These results are based on a compact convolutional transformer architecture trained on the CIFAR-100 and Flowers-102 datasets and call for their extension to other architectures and applications, such as natural language processing.||
|**2025-01-21**|[DLEN: Dual Branch of Transformer for Low-Light Image Enhancement in Dual Domains](http://arxiv.org/abs/2501.12235)|null|弱光图像增强 (LLE) 旨在提高在光线不足条件下拍摄的图像的视觉质量，这些图像通常存在亮度低、对比度低、噪声和颜色失真等问题。这些问题会阻碍计算机视觉任务（例如目标检测、人脸识别和自动驾驶）的性能。传统的增强技术，例如多尺度融合和直方图均衡化，无法保留精细细节，并且通常难以在复杂光照条件下保持增强图像的自然外观。虽然 Retinex 理论为图像分解提供了基础，但它通常会放大噪声，导致图像质量欠佳。在本文中，我们提出了双光增强网络 (DLEN)，这是一种结合了两种不同注意力机制的新型架构，同时考虑了空间域和频域。我们的模型在光照估计阶段引入了可学习的小波变换模块，保留高频和低频分量以增强边缘和纹理细节。此外，我们设计了一个双分支结构，利用 Transformer 架构的强大功能来增强图像的光照和结构组件。通过大量实验，我们的模型在标准基准测试中优于最先进的方法。代码地址：https://github.com/LaLaLoXX/DLEN||
|**2025-01-21**|[Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model](http://arxiv.org/abs/2501.12206)|**[link](https://github.com/hasanar1f/llava-hallunication-fix)**|Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in understanding and describing visual content, achieving state-of-the-art performance across various vision-language tasks. However, these models frequently exhibit hallucination behavior, where they generate descriptions containing objects or details absent in the input image. Our work investigates this phenomenon by analyzing attention patterns across transformer layers and heads, revealing that hallucinations often stem from progressive degradation of visual grounding in deeper layers. We propose a novel attention modification approach that combines selective token emphasis and head-specific modulation to maintain visual grounding throughout the generation process. Our method introduces two key components: (1) a dual-stream token selection mechanism that identifies and prioritizes both locally informative and spatially significant visual tokens, and (2) an attention head-specific modulation strategy that differentially amplifies visual information processing based on measured visual sensitivity of individual attention heads. Through extensive experimentation on the MSCOCO dataset, we demonstrate that our approach reduces hallucination rates by up to 62.3\% compared to baseline models while maintaining comparable task performance. Our analysis reveals that selectively modulating tokens across attention heads with varying levels of visual sensitivity can significantly improve visual grounding without requiring model retraining.||
|**2025-01-21**|[Speech Enhancement with Overlapped-Frame Information Fusion and Causal Self-Attention](http://arxiv.org/abs/2501.12004)|**[link](https://github.com/zhangyuewei98/ofif-net)**|For time-frequency (TF) domain speech enhancement (SE) methods, the overlap-and-add operation in the inverse TF transformation inevitably leads to an algorithmic delay equal to the window size. However, typical causal SE systems fail to utilize the future speech information within this inherent delay, thereby limiting SE performance. In this paper, we propose an overlapped-frame information fusion scheme. At each frame index, we construct several pseudo overlapped-frames, fuse them with the original speech frame, and then send the fused results to the SE model. Additionally, we introduce a causal time-frequency-channel attention (TFCA) block to boost the representation capability of the neural network. This block parallelly processes the intermediate feature maps through self-attention-based operations in the time, frequency, and channel dimensions. Experiments demonstrate the superiority of these improvements, and the proposed SE system outperforms the current advanced methods.||
|**2025-01-21**|[WaveNet-SF: A Hybrid Network for Retinal Disease Detection Based on Wavelet Transform in the Spatial-Frequency Domain](http://arxiv.org/abs/2501.11854)|null|Retinal diseases are a leading cause of vision impairment and blindness, with timely diagnosis being critical for effective treatment. Optical Coherence Tomography (OCT) has become a standard imaging modality for retinal disease diagnosis, but OCT images often suffer from issues such as speckle noise, complex lesion shapes, and varying lesion sizes, making interpretation challenging. In this paper, we propose a novel framework, WaveNet-SF, to enhance retinal disease detection by integrating spatial-domain and frequency-domain learning. The framework utilizes wavelet transforms to decompose OCT images into low- and high-frequency components, enabling the model to extract both global structural features and fine-grained details. To improve lesion detection, we introduce a multi-scale wavelet spatial attention (MSW-SA) module, which enhances the model's focus on regions of interest at multiple scales. Additionally, a high-frequency feature compensation block (HFFC) is incorporated to recover edge information lost during wavelet decomposition, suppress noise, and preserve fine details crucial for lesion detection. Our approach achieves state-of-the-art (SOTA) classification accuracies of 97.82% and 99. 58% on the OCT-C8 and OCT2017 datasets, respectively, surpassing existing methods. These results demonstrate the efficacy of WaveNet-SF in addressing the challenges of OCT image analysis and its potential as a powerful tool for retinal disease diagnosis.||
|**2025-01-20**|[Is logical analysis performed by transformers taking place in self-attention or in the fully connected part?](http://arxiv.org/abs/2501.11765)|null|Transformers architecture apply self-attention to tokens represented as vectors, before a fully connected (neuronal network) layer. These two parts can be layered many times. Traditionally, self-attention is seen as a mechanism for aggregating information before logical operations are performed by the fully connected layer. In this paper, we show, that quite counter-intuitively, the logical analysis can also be performed within the self-attention. For this we implement a handcrafted single-level encoder layer which performs the logical analysis within self-attention. We then study the scenario in which a one-level transformer model undergoes self-learning using gradient descent. We investigate whether the model utilizes fully connected layers or self-attention mechanisms for logical analysis when it has the choice. Given that gradient descent can become stuck at undesired zeros, we explicitly calculate these unwanted zeros and find ways to avoid them. We do all this in the context of predicting grammatical category pairs of adjacent tokens in a text. We believe that our findings have broader implications for understanding the potential logical operations performed by self-attention.||
|**2025-01-20**|[CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation](http://arxiv.org/abs/2501.11325)|**[link](https://github.com/zheng-chong/catton)**|Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with reduced resource demands. We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing for enhanced temporal consistency. Comprehensive experiments demonstrate that CatV2TON outperforms existing methods in both image and video try-on tasks, offering a versatile and reliable solution for realistic virtual try-ons across diverse scenarios.||
|**2025-01-20**|[Hybrid Photonic-digital Accelerator for Attention Mechanism](http://arxiv.org/abs/2501.11286)|null|The wide adoption and substantial computational resource requirements of attention-based Transformers have spurred the demand for efficient hardware accelerators. Unlike digital-based accelerators, there is growing interest in exploring photonics due to its high energy efficiency and ultra-fast processing speeds. However, the significant signal conversion overhead limits the performance of photonic-based accelerators. In this work, we propose HyAtten, a photonic-based attention accelerator with minimize signal conversion overhead. HyAtten incorporates a signal comparator to classify signals into two categories based on whether they can be processed by low-resolution converters. HyAtten integrates low-resolution converters to process all low-resolution signals, thereby boosting the parallelism of photonic computing. For signals requiring high-resolution conversion, HyAtten uses digital circuits instead of signal converters to reduce area and latency overhead. Compared to state-of-the-art photonic-based Transformer accelerator, HyAtten achieves 9.8X performance/area and 2.2X energy-efficiency/area improvement.||
|**2025-01-17**|[Enhancing the Reliability in Machine Learning for Gravitational Wave Parameter Estimation with Attention-Based Models](http://arxiv.org/abs/2501.10486)|null|We introduce a technique to enhance the reliability of gravitational wave parameter estimation results produced by machine learning. We develop two independent machine learning models based on the Vision Transformer to estimate effective spin and chirp mass from spectrograms of gravitational wave signals from binary black hole mergers. To enhance the reliability of these models, we utilize attention maps to visualize the areas our models focus on when making predictions. This approach enables demonstrating that both models perform parameter estimation based on physically meaningful information. Furthermore, by leveraging these attention maps, we demonstrate a method to quantify the impact of glitches on parameter estimation. We show that as the models focus more on glitches, the parameter estimation results become more strongly biased. This suggests that attention maps could potentially be used to distinguish between cases where the results produced by the machine learning model are reliable and cases where they are not.||
|**2025-01-16**|[Prompt-CAM: A Simpler Interpretable Transformer for Fine-Grained Analysis](http://arxiv.org/abs/2501.09333)|null|We present a simple usage of pre-trained Vision Transformers (ViTs) for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as different bird species or dog breeds. Pre-trained ViTs such as DINO have shown remarkable capabilities to extract localized, informative features. However, using saliency maps like Grad-CAM can hardly point out the traits: they often locate the whole object by a blurred, coarse heatmap, not traits. We propose a novel approach Prompt Class Attention Map (Prompt-CAM) to the rescue. Prompt-CAM learns class-specific prompts to a pre-trained ViT and uses the corresponding outputs for classification. To classify an image correctly, the true-class prompt must attend to the unique image patches not seen in other classes' images, i.e., traits. As such, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a free lunch by simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM fairly easy to train and apply, sharply contrasting other interpretable methods that design specific models and training processes. It is even simpler than the recently published INterpretable TRansformer (INTR), whose encoder-decoder architecture prevents it from leveraging pre-trained ViTs. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate Prompt-CAM superior interpretation capability.||
|**2025-01-16**|[Soft Knowledge Distillation with Multi-Dimensional Cross-Net Attention for Image Restoration Models Compression](http://arxiv.org/abs/2501.09321)|null|Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.||
|**2025-01-15**|[Attention is All You Need Until You Need Retention](http://arxiv.org/abs/2501.09166)|null|This work introduces a novel Retention Layer mechanism for Transformer based architectures, addressing their inherent lack of intrinsic retention capabilities. Unlike human cognition, which can encode and dynamically recall symbolic templates, Generative Pretrained Transformers rely solely on fixed pretrained weights and ephemeral context windows, limiting their adaptability. The proposed Retention Layer incorporates a persistent memory module capable of real time data population, dynamic recall, and guided output generation. This enhancement allows models to store, update, and reuse observed patterns across sessions, enabling incremental learning and bridging the gap between static pretraining and dynamic, context sensitive adaptation. The Retention Layer design parallels social learning processes, encompassing attention, retention, reproduction, and motivation stages. Technically, it integrates a memory attention mechanism and episodic buffers to manage memory scalability, mitigate overfitting, and ensure efficient recall. Applications span adaptive personal assistants, real time fraud detection, autonomous robotics, content moderation, and healthcare diagnostics. In each domain, the retention mechanism enables systems to learn incrementally, personalize outputs, and respond to evolving real world challenges effectively. By emulating key aspects of human learning, this retention enhanced architecture fosters a more fluid and responsive AI paradigm, paving the way for dynamic, session aware models that extend the capabilities of traditional Transformers into domains requiring continual adaptation.||
|**2025-01-15**|[Multi-View Transformers for Airway-To-Lung Ratio Inference on Cardiac CT Scans: The C4R Study](http://arxiv.org/abs/2501.08902)|null|The ratio of airway tree lumen to lung size (ALR), assessed at full inspiration on high resolution full-lung computed tomography (CT), is a major risk factor for chronic obstructive pulmonary disease (COPD). There is growing interest to infer ALR from cardiac CT images, which are widely available in epidemiological cohorts, to investigate the relationship of ALR to severe COVID-19 and post-acute sequelae of SARS-CoV-2 infection (PASC). Previously, cardiac scans included approximately 2/3 of the total lung volume with 5-6x greater slice thickness than high-resolution (HR) full-lung (FL) CT. In this study, we present a novel attention-based Multi-view Swin Transformer to infer FL ALR values from segmented cardiac CT scans. For the supervised training we exploit paired full-lung and cardiac CTs acquired in the Multi-Ethnic Study of Atherosclerosis (MESA). Our network significantly outperforms a proxy direct ALR inference on segmented cardiac CT scans and achieves accuracy and reproducibility comparable with a scan-rescan reproducibility of the FL ALR ground-truth.||
|**2025-01-15**|[Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models](http://arxiv.org/abs/2501.08727)|null|Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabling users to fine-tune models with limited computational resources. However, the approximation gap between the low-rank assumption and desired fine-tuning weights prevents the simultaneous acquisition of ultra-parameter-efficiency and better performance. To reduce this gap and further improve the power of LoRA, we propose a new PEFT method that combines two classes of adaptations, namely, transform and residual adaptations. In specific, we first apply a full-rank and dense transform to the pre-trained weight. This learnable transform is expected to align the pre-trained weight as closely as possible to the desired weight, thereby reducing the rank of the residual weight. Then, the residual part can be effectively approximated by more compact and parameter-efficient structures, with a smaller approximation error. To achieve ultra-parameter-efficiency in practice, we design highly flexible and effective tensor decompositions for both the transform and residual adaptations. Additionally, popular PEFT methods such as DoRA can be summarized under this transform plus residual adaptation scheme. Experiments are conducted on fine-tuning Stable Diffusion models in subject-driven and controllable generation. The results manifest that our method can achieve better performances and parameter efficiency compared to LoRA and several baselines.||
|**2025-01-15**|[Transformer-based Multivariate Time Series Anomaly Localization](http://arxiv.org/abs/2501.08628)|null|With the growing complexity of Cyber-Physical Systems (CPS) and the integration of Internet of Things (IoT), the use of sensors for online monitoring generates large volume of multivariate time series (MTS) data. Consequently, the need for robust anomaly diagnosis in MTS is paramount to maintaining system reliability and safety. While significant advancements have been made in anomaly detection, localization remains a largely underexplored area, though crucial for intelligent decision-making. This paper introduces a novel transformer-based model for unsupervised anomaly diagnosis in MTS, with a focus on improving localization performance, through an in-depth analysis of the self-attention mechanism's learning behavior under both normal and anomalous conditions. We formulate the anomaly localization problem as a three-stage process: time-step, window, and segment-based. This leads to the development of the Space-Time Anomaly Score (STAS), a new metric inspired by the connection between transformer latent representations and space-time statistical models. STAS is designed to capture individual anomaly behaviors and inter-series dependencies, delivering enhanced localization performance. Additionally, the Statistical Feature Anomaly Score (SFAS) complements STAS by analyzing statistical features around anomalies, with their combination helping to reduce false alarms. Experiments on real world and synthetic datasets illustrate the model's superiority over state-of-the-art methods in both detection and localization tasks.||
|**2025-01-15**|[MIAFEx: An Attention-based Feature Extraction Method for Medical Image Classification](http://arxiv.org/abs/2501.08562)|null|Feature extraction techniques are crucial in medical image classification; however, classical feature extractors in addition to traditional machine learning classifiers often exhibit significant limitations in providing sufficient discriminative information for complex image sets. While Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) have shown promise in feature extraction, they are prone to overfitting due to the inherent characteristics of medical imaging data, including small sample sizes or high intra-class variance. In this work, the Medical Image Attention-based Feature Extractor (MIAFEx) is proposed, a novel method that employs a learnable refinement mechanism to enhance the classification token within the Transformer encoder architecture. This mechanism adjusts the token based on learned weights, improving the extraction of salient features and enhancing the model's adaptability to the challenges presented by medical imaging data. The MIAFEx output features quality is compared against classical feature extractors using traditional and hybrid classifiers. Also, the performance of these features is compared against modern CNN and ViT models in classification tasks, demonstrating its superiority in accuracy and robustness across multiple complex classification medical imaging datasets. This advantage is particularly pronounced in scenarios with limited training data, where traditional and modern models often struggle to generalize effectively. The source code of this proposal can be found at https://github.com/Oscar-RamosS/Medical-Image-Attention-based-Feature-Extractor-MIAFEx||
|**2025-01-14**|[Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach](http://arxiv.org/abs/2501.08114)|null|Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth's dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model's ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.||
|**2025-01-14**|[A Comparative Analysis of Transformer-less Inverter Topologies for Grid-Connected PV Systems: Minimizing Leakage Current and THD](http://arxiv.org/abs/2501.08103)|null|The integration of distributed energy resources (DERs), particularly photovoltaic (PV) systems, into power grids has gained major attention due to their environmental and economic benefits. Although traditional transformer-based grid-connected PV inverters provide galvanic isolation for leakage current, they suffer from major drawbacks of high cost, lower efficiency, and increased size. Transformer-less grid-connected PV inverters (TLGI) have emerged as a prominent alternative, as they achieve higher efficiency, compact design, and lower cost. However, due to a lack of galvanic isolation, TLGIs are highly affected by leakage current caused by the fluctuation of common-mode voltage (CMV). This paper investigates three topologies H4, H5, and HERIC with comparisons between their CMV, differential-mode voltage (DMV), total harmonic distortion (THD), and leakage current. A simulation was conducted for each topology in MATLAB/Simulink R2023a, and the results demonstrate that the H5 topology achieves a balance between low leakage current, reduced THD, and optimal operational efficiency, making it suitable for practical application.||
|**2025-01-14**|[Dynamic Multimodal Sentiment Analysis: Leveraging Cross-Modal Attention for Enabled Classification](http://arxiv.org/abs/2501.08085)|null|This paper explores the development of a multimodal sentiment analysis model that integrates text, audio, and visual data to enhance sentiment classification. The goal is to improve emotion detection by capturing the complex interactions between these modalities, thereby enabling more accurate and nuanced sentiment interpretation. The study evaluates three feature fusion strategies -- late stage fusion, early stage fusion, and multi-headed attention -- within a transformer-based architecture. Experiments were conducted using the CMU-MOSEI dataset, which includes synchronized text, audio, and visual inputs labeled with sentiment scores. Results show that early stage fusion significantly outperforms late stage fusion, achieving an accuracy of 71.87\%, while the multi-headed attention approach offers marginal improvement, reaching 72.39\%. The findings suggest that integrating modalities early in the process enhances sentiment classification, while attention mechanisms may have limited impact within the current framework. Future work will focus on refining feature fusion techniques, incorporating temporal data, and exploring dynamic feature weighting to further improve model performance.||
|**2025-01-14**|[Comprehensive Metapath-based Heterogeneous Graph Transformer for Gene-Disease Association Prediction](http://arxiv.org/abs/2501.07970)|null|Discovering gene-disease associations is crucial for understanding disease mechanisms, yet identifying these associations remains challenging due to the time and cost of biological experiments. Computational methods are increasingly vital for efficient and scalable gene-disease association prediction. Graph-based learning models, which leverage node features and network relationships, are commonly employed for biomolecular predictions. However, existing methods often struggle to effectively integrate node features, heterogeneous structures, and semantic information. To address these challenges, we propose COmprehensive MEtapath-based heterogeneous graph Transformer(COMET) for predicting gene-disease associations. COMET integrates diverse datasets to construct comprehensive heterogeneous networks, initializing node features with BioGPT. We define seven Metapaths and utilize a transformer framework to aggregate Metapath instances, capturing global contexts and long-distance dependencies. Through intra- and inter-metapath aggregation using attention mechanisms, COMET fuses latent vectors from multiple Metapaths to enhance GDA prediction accuracy. Our method demonstrates superior robustness compared to state-of-the-art approaches. Ablation studies and visualizations validate COMET's effectiveness, providing valuable insights for advancing human health research.||
|**2025-01-14**|[An Efficient Sparse Hardware Accelerator for Spike-Driven Transformer](http://arxiv.org/abs/2501.07825)|null|Recently, large models, such as Vision Transformer and BERT, have garnered significant attention due to their exceptional performance. However, their extensive computational requirements lead to considerable power and hardware resource consumption. Brain-inspired computing, characterized by its spike-driven methods, has emerged as a promising approach for low-power hardware implementation. In this paper, we propose an efficient sparse hardware accelerator for Spike-driven Transformer. We first design a novel encoding method that encodes the position information of valid activations and skips non-spike values. This method enables us to use encoded spikes for executing the calculations of linear, maxpooling and spike-driven self-attention. Compared with the single spike input design of conventional SNN accelerators that primarily focus on convolution-based spiking computations, the specialized module for spike-driven self-attention is unique in its ability to handle dual spike inputs. By exclusively utilizing activated spikes, our design fully exploits the sparsity of Spike-driven Transformer, which diminishes redundant operations, lowers power consumption, and minimizes computational latency. Experimental results indicate that compared to existing SNNs accelerators, our design achieves up to 13.24 $\times$ and 1.33$\times$ improvements in terms of throughput and energy efficiency, respectively.||
|**2025-01-14**|[Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors](http://arxiv.org/abs/2501.07774)|null|Indoor localization in challenging non-line-of-sight (NLOS) environments often leads to mediocre accuracy with traditional approaches. Deep learning (DL) has been applied to tackle these challenges; however, many DL approaches overlook computational complexity, especially for floating-point operations (FLOPs), making them unsuitable for resource-limited devices. Transformer-based models have achieved remarkable success in natural language processing (NLP) and computer vision (CV) tasks, motivating their use in wireless applications. However, their use in indoor localization remains nascent, and directly applying Transformers for indoor localization can be both computationally intensive and exhibit limitations in accuracy. To address these challenges, in this work, we introduce a novel tokenization approach, referred to as Sensor Snapshot Tokenization (SST), which preserves variable-specific representations of power delay profile (PDP) and enhances attention mechanisms by effectively capturing multi-variate correlation. Complementing this, we propose a lightweight Swish-Gated Linear Unit-based Transformer (L-SwiGLU Transformer) model, designed to reduce computational complexity without compromising localization accuracy. Together, these contributions mitigate the computational burden and dependency on large datasets, making Transformer models more efficient and suitable for resource-constrained scenarios. The proposed tokenization method enables the Vanilla Transformer to achieve a 90th percentile positioning error of 0.388 m in a highly NLOS indoor factory, surpassing conventional tokenization methods. The L-SwiGLU ViT further reduces the error to 0.355 m, achieving an 8.51% improvement. Additionally, the proposed model outperforms a 14.1 times larger model with a 46.13% improvement, underscoring its computational efficiency.||
|**2025-01-13**|[D3MES: Diffusion Transformer with multihead equivariant self-attention for 3D molecule generation](http://arxiv.org/abs/2501.07077)|**[link](https://github.com/physilearn/d3mes)**|Understanding and predicting the diverse conformational states of molecules is crucial for advancing fields such as chemistry, material science, and drug development. Despite significant progress in generative models, accurately generating complex and biologically or material-relevant molecular structures remains a major challenge. In this work, we introduce a diffusion model for three-dimensional (3D) molecule generation that combines a classifiable diffusion model, Diffusion Transformer, with multihead equivariant self-attention. This method addresses two key challenges: correctly attaching hydrogen atoms in generated molecules through learning representations of molecules after hydrogen atoms are removed; and overcoming the limitations of existing models that cannot generate molecules across multiple classes simultaneously. The experimental results demonstrate that our model not only achieves state-of-the-art performance across several key metrics but also exhibits robustness and versatility, making it highly suitable for early-stage large-scale generation processes in molecular design, followed by validation and further screening to obtain molecules with specific properties.||
|**2025-01-13**|[Protego: Detecting Adversarial Examples for Vision Transformers via Intrinsic Capabilities](http://arxiv.org/abs/2501.07044)|null|Transformer models have excelled in natural language tasks, prompting the vision community to explore their implementation in computer vision problems. However, these models are still influenced by adversarial examples. In this paper, we investigate the attack capabilities of six common adversarial attacks on three pretrained ViT models to reveal the vulnerability of ViT models. To understand and analyse the bias in neural network decisions when the input is adversarial, we use two visualisation techniques that are attention rollout and grad attention rollout. To prevent ViT models from adversarial attack, we propose Protego, a detection framework that leverages the transformer intrinsic capabilities to detection adversarial examples of ViT models. Nonetheless, this is challenging due to a diversity of attack strategies that may be adopted by adversaries. Inspired by the attention mechanism, we know that the token of prediction contains all the information from the input sample. Additionally, the attention region for adversarial examples differs from that of normal examples. Given these points, we can train a detector that achieves superior performance than existing detection methods to identify adversarial examples. Our experiments have demonstrated the high effectiveness of our detection method. For these six adversarial attack methods, our detector's AUC scores all exceed 0.95. Protego may advance investigations in metaverse security.||
|**2025-01-12**|[Temporal-Aware Spiking Transformer Hashing Based on 3D-DWT](http://arxiv.org/abs/2501.06786)|null|With the rapid growth of dynamic vision sensor (DVS) data, constructing a low-energy, efficient data retrieval system has become an urgent task. Hash learning is one of the most important retrieval technologies which can keep the distance between hash codes consistent with the distance between DVS data. As spiking neural networks (SNNs) can encode information through spikes, they demonstrate great potential in promoting energy efficiency. Based on the binary characteristics of SNNs, we first propose a novel supervised hashing method named Spikinghash with a hierarchical lightweight structure. Spiking WaveMixer (SWM) is deployed in shallow layers, utilizing a multilevel 3D discrete wavelet transform (3D-DWT) to decouple spatiotemporal features into various low-frequency and high frequency components, and then employing efficient spectral feature fusion. SWM can effectively capture the temporal dependencies and local spatial features. Spiking Self-Attention (SSA) is deployed in deeper layers to further extract global spatiotemporal information. We also design a hash layer utilizing binary characteristic of SNNs, which integrates information over multiple time steps to generate final hash codes. Furthermore, we propose a new dynamic soft similarity loss for SNNs, which utilizes membrane potentials to construct a learnable similarity matrix as soft labels to fully capture the similarity differences between classes and compensate information loss in SNNs, thereby improving retrieval performance. Experiments on multiple datasets demonstrate that Spikinghash can achieve state-of-the-art results with low energy consumption and fewer parameters.||
|**2025-01-11**|[TopoFormer: Integrating Transformers and ConvLSTMs for Coastal Topography Prediction](http://arxiv.org/abs/2501.06494)|null|This paper presents \textit{TopoFormer}, a novel hybrid deep learning architecture that integrates transformer-based encoders with convolutional long short-term memory (ConvLSTM) layers for the precise prediction of topographic beach profiles referenced to elevation datums, with a particular focus on Mean Low Water Springs (MLWS) and Mean Low Water Neaps (MLWN). Accurate topographic estimation down to MLWS is critical for coastal management, navigation safety, and environmental monitoring. Leveraging a comprehensive dataset from the Wales Coastal Monitoring Centre (WCMC), consisting of over 2000 surveys across 36 coastal survey units, TopoFormer addresses key challenges in topographic prediction, including temporal variability and data gaps in survey measurements. The architecture uniquely combines multi-head attention mechanisms and ConvLSTM layers to capture both long-range dependencies and localized temporal patterns inherent in beach profiles data. TopoFormer's predictive performance was rigorously evaluated against state-of-the-art models, including DenseNet, 1D/2D CNNs, and LSTMs. While all models demonstrated strong performance, \textit{TopoFormer} achieved the lowest mean absolute error (MAE), as low as 2 cm, and provided superior accuracy in both in-distribution (ID) and out-of-distribution (OOD) evaluations.||
|**2025-01-10**|[ELFATT: Efficient Linear Fast Attention for Vision Transformers](http://arxiv.org/abs/2501.06098)|null|The attention mechanism is the key to the success of transformers in different machine learning tasks. However, the quadratic complexity with respect to the sequence length of the vanilla softmax-based attention mechanism becomes the major bottleneck for the application of long sequence tasks, such as vision tasks. Although various efficient linear attention mechanisms have been proposed, they need to sacrifice performance to achieve high efficiency. What's more, memory-efficient methods, such as FlashAttention-1-3, still have quadratic computation complexity which can be further improved. In this paper, we propose a novel efficient linear fast attention (ELFATT) mechanism to achieve low memory input/output operations, linear computational complexity, and high performance at the same time. ELFATT offers 4-7x speedups over the vanilla softmax-based attention mechanism in high-resolution vision tasks without losing performance. ELFATT is FlashAttention friendly. Using FlashAttention-2 acceleration, ELFATT still offers 2-3x speedups over the vanilla softmax-based attention mechanism on high-resolution vision tasks without losing performance. Even on edge GPUs, ELFATT still offers 1.6x to 2.0x speedups compared to state-of-the-art attention mechanisms in various power modes from 5W to 60W. The code of ELFATT is available at [https://github.com/Alicewithrabbit/ELFATT].||
|**2025-01-10**|[MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets](http://arxiv.org/abs/2501.06040)|null|Vision Transformer (ViT) has demonstrated significant potential in various vision tasks due to its strong ability in modelling long-range dependencies. However, such success is largely fueled by training on massive samples. In real applications, the large-scale datasets are not always available, and ViT performs worse than Convolutional Neural Networks (CNNs) if it is only trained on small scale dataset (called tiny dataset), since it requires large amount of training data to ensure its representational capacity. In this paper, a small-size ViT architecture with multi-scale self-attention mechanism and convolution blocks is presented (dubbed MSCViT) to model different scales of attention at each layer. Firstly, we introduced wavelet convolution, which selectively combines the high-frequency components obtained by frequency division with our convolution channel to extract local features. Then, a lightweight multi-head attention module is developed to reduce the number of tokens and computational costs. Finally, the positional encoding (PE) in the backbone is replaced by a local feature extraction module. Compared with the original ViT, it is parameter-efficient and is particularly suitable for tiny datasets. Extensive experiments have been conducted on tiny datasets, in which our model achieves an accuracy of 84.68% on CIFAR-100 with 14.0M parameters and 2.5 GFLOPs, without pre-training on large datasets.||
|**2025-01-10**|[An Attention-Guided Deep Learning Approach for Classifying 39 Skin Lesion Types](http://arxiv.org/abs/2501.05991)|**[link](https://github.com/akabircs/skin-lesions-classification)**|The skin, as the largest organ of the human body, is vulnerable to a diverse array of conditions collectively known as skin lesions, which encompass various dermatoses. Diagnosing these lesions presents significant challenges for medical practitioners due to the subtle visual differences that are often imperceptible to the naked eye. While not all skin lesions are life-threatening, certain types can act as early indicators of severe diseases, including skin cancers, underscoring the critical need for timely and accurate diagnostic methods. Deep learning algorithms have demonstrated remarkable potential in facilitating the early detection and prognosis of skin lesions. This study advances the field by curating a comprehensive and diverse dataset comprising 39 categories of skin lesions, synthesized from five publicly available datasets. Using this dataset, the performance of five state-of-the-art deep learning models -- MobileNetV2, Xception, InceptionV3, EfficientNetB1, and Vision Transformer - is rigorously evaluated. To enhance the accuracy and robustness of these models, attention mechanisms such as the Efficient Channel Attention (ECA) and the Convolutional Block Attention Module (CBAM) are incorporated into their architectures. Comprehensive evaluation across multiple performance metrics reveals that the Vision Transformer model integrated with CBAM outperforms others, achieving an accuracy of 93.46%, precision of 94%, recall of 93%, F1-score of 93%, and specificity of 93.67%. These results underscore the significant potential of the proposed system in supporting medical professionals with accurate and efficient prognostic tools for diagnosing a broad spectrum of skin lesions. The dataset and code used in this study can be found at https://github.com/akabircs/Skin-Lesions-Classification.||
|**2025-01-10**|[Swin-X2S: Reconstructing 3D Shape from 2D Biplanar X-ray with Swin Transformers](http://arxiv.org/abs/2501.05961)|**[link](https://github.com/liukuan5625/swin-x2s)**|The conversion from 2D X-ray to 3D shape holds significant potential for improving diagnostic efficiency and safety. However, existing reconstruction methods often rely on hand-crafted features, manual intervention, and prior knowledge, resulting in unstable shape errors and additional processing costs. In this paper, we introduce Swin-X2S, an end-to-end deep learning method for directly reconstructing 3D segmentation and labeling from 2D biplanar orthogonal X-ray images. Swin-X2S employs an encoder-decoder architecture: the encoder leverages 2D Swin Transformer for X-ray information extraction, while the decoder employs 3D convolution with cross-attention to integrate structural features from orthogonal views. A dimension-expanding module is introduced to bridge the encoder and decoder, ensuring a smooth conversion from 2D pixels to 3D voxels. We evaluate proposed method through extensive qualitative and quantitative experiments across nine publicly available datasets covering four anatomies (femur, hip, spine, and rib), with a total of 54 categories. Significant improvements over previous methods have been observed not only in the segmentation and labeling metrics but also in the clinically relevant parameters that are of primary concern in practical applications, which demonstrates the promise of Swin-X2S to provide an effective option for anatomical shape reconstruction in clinical scenarios. Code implementation is available at: \url{https://github.com/liukuan5625/Swin-X2S}.||
|**2025-01-10**|[Weakly Supervised Segmentation of Hyper-Reflective Foci with Compact Convolutional Transformers and SAM2](http://arxiv.org/abs/2501.05933)|null|Weakly supervised segmentation has the potential to greatly reduce the annotation effort for training segmentation models for small structures such as hyper-reflective foci (HRF) in optical coherence tomography (OCT). However, most weakly supervised methods either involve a strong downsampling of input images, or only achieve localization at a coarse resolution, both of which are unsatisfactory for small structures. We propose a novel framework that increases the spatial resolution of a traditional attention-based Multiple Instance Learning (MIL) approach by using Layer-wise Relevance Propagation (LRP) to prompt the Segment Anything Model (SAM~2), and increases recall with iterative inference. Moreover, we demonstrate that replacing MIL with a Compact Convolutional Transformer (CCT), which adds a positional encoding, and permits an exchange of information between different regions of the OCT image, leads to a further and substantial increase in segmentation accuracy.||
|**2025-01-10**|[Binary Event-Driven Spiking Transformer](http://arxiv.org/abs/2501.05904)|null|Transformer-based Spiking Neural Networks (SNNs) introduce a novel event-driven self-attention paradigm that combines the high performance of Transformers with the energy efficiency of SNNs. However, the larger model size and increased computational demands of the Transformer structure limit their practicality in resource-constrained scenarios. In this paper, we integrate binarization techniques into Transformer-based SNNs and propose the Binary Event-Driven Spiking Transformer, i.e. BESTformer. The proposed BESTformer can significantly reduce storage and computational demands by representing weights and attention maps with a mere 1-bit. However, BESTformer suffers from a severe performance drop from its full-precision counterpart due to the limited representation capability of binarization. To address this issue, we propose a Coupled Information Enhancement (CIE) method, which consists of a reversible framework and information enhancement distillation. By maximizing the mutual information between the binary model and its full-precision counterpart, the CIE method effectively mitigates the performance degradation of the BESTformer. Extensive experiments on static and neuromorphic datasets demonstrate that our method achieves superior performance to other binary SNNs, showcasing its potential as a compact yet high-performance model for resource-limited edge devices.||
|**2025-01-09**|[MHAFF: Multi-Head Attention Feature Fusion of CNN and Transformer for Cattle Identification](http://arxiv.org/abs/2501.05209)|null|Convolutional Neural Networks (CNNs) have drawn researchers' attention to identifying cattle using muzzle images. However, CNNs often fail to capture long-range dependencies within the complex patterns of the muzzle. The transformers handle these challenges. This inspired us to fuse the strengths of CNNs and transformers in muzzle-based cattle identification. Addition and concatenation have been the most commonly used techniques for feature fusion. However, addition fails to preserve discriminative information, while concatenation results in an increase in dimensionality. Both methods are simple operations and cannot discover the relationships or interactions between fusing features. This research aims to overcome the issues faced by addition and concatenation. This research introduces a novel approach called Multi-Head Attention Feature Fusion (MHAFF) for the first time in cattle identification. MHAFF captures relations between the different types of fusing features while preserving their originality. The experiments show that MHAFF outperformed addition and concatenation techniques and the existing cattle identification methods in accuracy on two publicly available cattle datasets. MHAFF demonstrates excellent performance and quickly converges to achieve optimum accuracy of 99.88% and 99.52% in two cattle datasets simultaneously.||
|**2025-01-09**|[SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection](http://arxiv.org/abs/2501.04916)|**[link](https://github.com/emit-sds/SpecTf)**|当前和即将推出的可见-短波红外 (VSWIR) 成像光谱仪有望以前所未有的能力量化全球地球系统过程。然而，可靠的云筛选仍然是这些仪器的基本挑战，传统的空间和时间方法受到云变化和有限时间覆盖范围的限制。光谱变换器 (SpecTf) 通过特定于光谱的深度学习架构解决了这些挑战，该架构仅使用光谱信息（无需空间或时间数据）即可执行云检测。通过将光谱测量视为序列而不是图像通道，SpecTf 学习了基本物理关系，而不依赖于空间上下文。我们的实验表明，SpecTf 的性能明显优于目前为 EMIT 仪器实施的基线方法，并且其性能与其他机器学习方法相当，但学习参数的数量级要少得多。至关重要的是，我们通过其注意力机制展示了 SpecTf 固有的可解释性，揭示了模型学习到的具有物理意义的光谱特征。最后，我们通过将 SpecTf 应用于不同平台上的不同仪器而无需修改，展示了其跨仪器泛化的潜力，为未来成像光谱任务的仪器无关数据驱动算法打开了大门。||
|**2025-01-07**|[AI-Driven Reinvention of Hydrological Modeling for Accurate Predictions and Interpretation to Transform Earth System Modeling](http://arxiv.org/abs/2501.04733)|null|Traditional equation-driven hydrological models often struggle to accurately predict streamflow in challenging regional Earth systems like the Tibetan Plateau, while hybrid and existing algorithm-driven models face difficulties in interpreting hydrological behaviors. This work introduces HydroTrace, an algorithm-driven, data-agnostic model that substantially outperforms these approaches, achieving a Nash-Sutcliffe Efficiency of 98% and demonstrating strong generalization on unseen data. Moreover, HydroTrace leverages advanced attention mechanisms to capture spatial-temporal variations and feature-specific impacts, enabling the quantification and spatial resolution of streamflow partitioning as well as the interpretation of hydrological behaviors such as glacier-snow-streamflow interactions and monsoon dynamics. Additionally, a large language model (LLM)-based application allows users to easily understand and apply HydroTrace's insights for practical purposes. These advancements position HydroTrace as a transformative tool in hydrological and broader Earth system modeling, offering enhanced prediction accuracy and interpretability.||
|**2025-01-08**|[Discrete Wavelet Transform-Based Capsule Network for Hyperspectral Image Classification](http://arxiv.org/abs/2501.04643)|null|Hyperspectral image (HSI) classification is a crucial technique for remote sensing to build large-scale earth monitoring systems. HSI contains much more information than traditional visual images for identifying the categories of land covers. One recent feasible solution for HSI is to leverage CapsNets for capturing spectral-spatial information. However, these methods require high computational requirements due to the full connection architecture between stacked capsule layers. To solve this problem, a DWT-CapsNet is proposed to identify partial but important connections in CapsNet for a effective and efficient HSI classification. Specifically, we integrate a tailored attention mechanism into a Discrete Wavelet Transform (DWT)-based downsampling layer, alleviating the information loss problem of conventional downsampling operation in feature extractors. Moreover, we propose a novel multi-scale routing algorithm that prunes a large proportion of connections in CapsNet. A capsule pyramid fusion mechanism is designed to aggregate the spectral-spatial relationships in multiple levels of granularity, and then a self-attention mechanism is further conducted in a partially and locally connected architecture to emphasize the meaningful relationships. As shown in the experimental results, our method achieves state-of-the-art accuracy while keeping lower computational demand regarding running time, flops, and the number of parameters, rendering it an appealing choice for practical implementation in HSI classification.||
|**2025-01-08**|[MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by Taylor Formula for Image Restoration](http://arxiv.org/abs/2501.04486)|**[link](https://github.com/fvl2020/mb-taylorformer)**|Recently, Transformer networks have demonstrated outstanding performance in the field of image restoration due to the global receptive field and adaptability to input. However, the quadratic computational complexity of Softmax-attention poses a significant limitation on its extensive application in image restoration tasks, particularly for high-resolution images. To tackle this challenge, we propose a novel variant of the Transformer. This variant leverages the Taylor expansion to approximate the Softmax-attention and utilizes the concept of norm-preserving mapping to approximate the remainder of the first-order Taylor expansion, resulting in a linear computational complexity. Moreover, we introduce a multi-branch architecture featuring multi-scale patch embedding into the proposed Transformer, which has four distinct advantages: 1) various sizes of the receptive field; 2) multi-level semantic information; 3) flexible shapes of the receptive field; 4) accelerated training and inference speed. Hence, the proposed model, named the second version of Taylor formula expansion-based Transformer (for short MB-TaylorFormer V2) has the capability to concurrently process coarse-to-fine features, capture long-distance pixel interactions with limited computational cost, and improve the approximation of the Taylor expansion remainder. Experimental results across diverse image restoration benchmarks demonstrate that MB-TaylorFormer V2 achieves state-of-the-art performance in multiple image restoration tasks, such as image dehazing, deraining, desnowing, motion deblurring, and denoising, with very little computational overhead. The source code is available at https://github.com/FVL2020/MB-TaylorFormerV2.||
|**2025-01-08**|[Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability of Decoder-Only Transformer Models](http://arxiv.org/abs/2501.04286)|**[link](https://github.com/tbahman/mapping_the_edge_of_chaos)**|In the realm of fractal geometry, intricate structures emerge from simple iterative processes that partition parameter spaces into regions of stability and instability. Likewise, training large language models involves iteratively applying update functions, such as Adam, where even slight hyperparameter adjustments can shift the training process from convergence to divergence. Recent evidence from miniature neural networks suggests that the boundary separating these outcomes displays fractal characteristics [1]. Building on these insights, this study extends them to medium-sized, decoder-only transformer architectures by employing a more consistent convergence measure and examining the learning rate hyperparameter landscape for attention and fully connected layers. The results show that the trainability frontier is not a simple threshold; rather, it forms a self-similar yet seemingly random structure at multiple scales, with statistically consistent and repeating patterns. Within this landscape, a region of stable convergence is surrounded by a complex chaotic border, illustrating the sensitive nature of the underlying training dynamics.||
|**2025-01-07**|[Three-dimensional attention Transformer for state evaluation in real-time strategy games](http://arxiv.org/abs/2501.03832)|null|Situation assessment in Real-Time Strategy (RTS) games is crucial for understanding decision-making in complex adversarial environments. However, existing methods remain limited in processing multi-dimensional feature information and temporal dependencies. Here we propose a tri-dimensional Space-Time-Feature Transformer (TSTF Transformer) architecture, which efficiently models battlefield situations through three independent but cascaded modules: spatial attention, temporal attention, and feature attention. On a dataset comprising 3,150 adversarial experiments, the 8-layer TSTF Transformer demonstrates superior performance: achieving 58.7% accuracy in the early game (~4% progress), significantly outperforming the conventional Timesformer's 41.8%; reaching 97.6% accuracy in the mid-game (~40% progress) while maintaining low performance variation (standard deviation 0.114). Meanwhile, this architecture requires fewer parameters (4.75M) compared to the baseline model (5.54M). Our study not only provides new insights into situation assessment in RTS games but also presents an innovative paradigm for Transformer-based multi-dimensional temporal modeling.||
|**2025-01-07**|[CFFormer: Cross CNN-Transformer Channel Attention and Spatial Feature Fusion for Improved Segmentation of Low Quality Medical Images](http://arxiv.org/abs/2501.03629)|null|Hybrid CNN-Transformer models are designed to combine the advantages of Convolutional Neural Networks (CNNs) and Transformers to efficiently model both local information and long-range dependencies. However, most research tends to focus on integrating the spatial features of CNNs and Transformers, while overlooking the critical importance of channel features. This is particularly significant for model performance in low-quality medical image segmentation. Effective channel feature extraction can significantly enhance the model's ability to capture contextual information and improve its representation capabilities. To address this issue, we propose a hybrid CNN-Transformer model, CFFormer, and introduce two modules: the Cross Feature Channel Attention (CFCA) module and the X-Spatial Feature Fusion (XFF) module. The model incorporates dual encoders, with the CNN encoder focusing on capturing local features and the Transformer encoder modeling global features. The CFCA module filters and facilitates interactions between the channel features from the two encoders, while the XFF module effectively reduces the significant semantic information differences in spatial features, enabling a smooth and cohesive spatial feature fusion. We evaluate our model across eight datasets covering five modalities to test its generalization capability. Experimental results demonstrate that our model outperforms current state-of-the-art (SOTA) methods, with particularly superior performance on datasets characterized by blurry boundaries and low contrast.||
|**2025-01-07**|[Efficient and Accurate Tuberculosis Diagnosis: Attention Residual U-Net and Vision Transformer Based Detection Framework](http://arxiv.org/abs/2501.03538)|null|Tuberculosis (TB), an infectious disease caused by Mycobacterium tuberculosis, continues to be a major global health threat despite being preventable and curable. This burden is particularly high in low and middle income countries. Microscopy remains essential for diagnosing TB by enabling direct visualization of Mycobacterium tuberculosis in sputum smear samples, offering a cost effective approach for early detection and effective treatment. Given the labour-intensive nature of microscopy, automating the detection of bacilli in microscopic images is crucial to improve both the expediency and reliability of TB diagnosis. The current methodologies for detecting tuberculosis bacilli in bright field microscopic sputum smear images are hindered by limited automation capabilities, inconsistent segmentation quality, and constrained classification precision. This paper proposes a twostage deep learning methodology for tuberculosis bacilli detection, comprising bacilli segmentation followed by classification. In the initial phase, an advanced U-Net model employing attention blocks and residual connections is proposed to segment microscopic sputum smear images, enabling the extraction of Regions of Interest (ROIs). The extracted ROIs are then classified using a Vision Transformer, which we specifically customized as TBViT to enhance the precise detection of bacilli within the images. For the experiments, a newly developed dataset of microscopic sputum smear images derived from Ziehl-Neelsen-stained slides is used in conjunction with existing public datasets. The qualitative and quantitative evaluation of the experiments using various metrics demonstrates that the proposed model achieves significantly improved segmentation performance, higher classification accuracy, and a greater level of automation, surpassing existing methods.||
|**2025-01-08**|[Entropy-Guided Attention for Private LLMs](http://arxiv.org/abs/2501.03489)|**[link](https://github.com/nandan91/entropy-guided-attention-llm)**|The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users' sensitive information. While PI offers a promising solution, its practical deployment is hindered by substantial communication and latency overheads, primarily stemming from nonlinear operations. To address this, we introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a principled foundation for optimizing transformer-architectures tailored to the demands of PI. By leveraging Shannon's entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities: beyond ensuring training stability, they are crucial for maintaining attention head diversity. Specifically, we find that their removal triggers two critical failure modes: {\em entropy collapse} in deeper layers that destabilizes training, and {\em entropic overload} in earlier layers that leads to under-utilization of Multi-Head Attention's (MHA) representational capacity. We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload. Additionally, we explore PI-friendly alternatives to layer normalization for preventing entropy collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our study bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient PI architectures. The code and implementation are available at https://github.com/Nandan91/entropy-guided-attention-llm||
|**2025-01-06**|[Mixture-of-Experts Graph Transformers for Interpretable Particle Collision Detection](http://arxiv.org/abs/2501.03432)|null|The Large Hadron Collider at CERN produces immense volumes of complex data from high-energy particle collisions, demanding sophisticated analytical techniques for effective interpretation. Neural Networks, including Graph Neural Networks, have shown promise in tasks such as event classification and object identification by representing collisions as graphs. However, while Graph Neural Networks excel in predictive accuracy, their "black box" nature often limits their interpretability, making it difficult to trust their decision-making processes. In this paper, we propose a novel approach that combines a Graph Transformer model with Mixture-of-Expert layers to achieve high predictive performance while embedding interpretability into the architecture. By leveraging attention maps and expert specialization, the model offers insights into its internal decision-making, linking predictions to physics-informed features. We evaluate the model on simulated events from the ATLAS experiment, focusing on distinguishing rare Supersymmetric signal events from Standard Model background. Our results highlight that the model achieves competitive classification accuracy while providing interpretable outputs that align with known physics, demonstrating its potential as a robust and transparent tool for high-energy physics data analysis. This approach underscores the importance of explainability in machine learning methods applied to high energy physics, offering a path toward greater trust in AI-driven discoveries.||
|**2025-01-06**|[Sensorformer: Cross-patch attention with global-patch compression is effective for high-dimensional multivariate time series forecasting](http://arxiv.org/abs/2501.03284)|null|Among the existing Transformer-based multivariate time series forecasting methods, iTransformer, which treats each variable sequence as a token and only explicitly extracts cross-variable dependencies, and PatchTST, which adopts a channel-independent strategy and only explicitly extracts cross-time dependencies, both significantly outperform most Channel-Dependent Transformer that simultaneously extract cross-time and cross-variable dependencies. This indicates that existing Transformer-based multivariate time series forecasting methods still struggle to effectively fuse these two types of information. We attribute this issue to the dynamic time lags in the causal relationships between different variables. Therefore, we propose a new multivariate time series forecasting Transformer, Sensorformer, which first compresses the global patch information and then simultaneously extracts cross-variable and cross-time dependencies from the compressed representations. Sensorformer can effectively capture the correct inter-variable correlations and causal relationships, even in the presence of dynamic causal lags between variables, while also reducing the computational complexity of pure cross-patch self-attention from $O(D^2 \cdot Patch\_num^2 \cdot d\_model)$ to $O(D^2 \cdot Patch\_num \cdot d\_model)$ . Extensive comparative and ablation experiments on 9 mainstream real-world multivariate time series forecasting datasets demonstrate the superiority of Sensorformer. The implementation of Sensorformer, following the style of the Time-series-library and scripts for reproducing the main results, is publicly available at https://github.com/BigYellowTiger/Sensorformer||
|**2025-01-06**|[Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization](http://arxiv.org/abs/2501.03096)|**[link](https://github.com/timroith/transformerdynamics)**|The aim of this paper is to provide a mathematical analysis of transformer architectures using a self-attention mechanism with layer normalization. In particular, observed patterns in such architectures resembling either clusters or uniform distributions pose a number of challenging mathematical questions. We focus on a special case that admits a gradient flow formulation in the spaces of probability measures on the unit sphere under a special metric, which allows us to give at least partial answers in a rigorous way. The arising mathematical problems resemble those recently studied in aggregation equations, but with additional challenges emerging from restricting the dynamics to the sphere and the particular form of the interaction energy. We provide a rigorous framework for studying the gradient flow, which also suggests a possible metric geometry to study the general case (i.e. one that is not described by a gradient flow). We further analyze the stationary points of the induced self-attention dynamics. The latter are related to stationary points of the interaction energy in the Wasserstein geometry, and we further discuss energy minimizers and maximizers in different parameter settings.||
|**2025-01-06**|[Inverse receptive field attention for naturalistic image reconstruction from the brain](http://arxiv.org/abs/2501.03051)|**[link](https://github.com/neuralcodinglab/irfa)**|Visual perception in the brain largely depends on the organization of neuronal receptive fields. Although extensive research has delineated the coding principles of receptive fields, most studies have been constrained by their foundational assumptions. Moreover, while machine learning has successfully been used to reconstruct images from brain data, this approach faces significant challenges, including inherent feature biases in the model and the complexities of brain structure and function. In this study, we introduce an inverse receptive field attention (IRFA) model, designed to reconstruct naturalistic images from neurophysiological data in an end-to-end fashion. This approach aims to elucidate the tuning properties and representational transformations within the visual cortex. The IRFA model incorporates an attention mechanism that determines the inverse receptive field for each pixel, weighting neuronal responses across the visual field and feature spaces. This method allows for an examination of the dynamics of neuronal representations across stimuli in both spatial and feature dimensions. Our results show highly accurate reconstructions of naturalistic data, independent of pre-trained models. Notably, IRF models trained on macaque V1, V4, and IT regions yield remarkably consistent spatial receptive fields across different stimuli, while the features to which neuronal representations are selective exhibit significant variation. Additionally, we propose a data-driven method to explore representational clustering within various visual areas, further providing testable hypotheses.||
|**2025-01-06**|[Self-Attention as a Parametric Endofunctor: A Categorical Framework for Transformer Architectures](http://arxiv.org/abs/2501.02931)|null|Self-attention mechanisms have revolutionised deep learning architectures, but their mathematical foundations remain incompletely understood. We establish that these mechanisms can be formalised through categorical algebra, presenting a framework that focuses on the linear components of self-attention. We prove that the query, key, and value maps in self-attention naturally form a parametric endofunctor in the 2-category $\mathbf{Para}(\mathbf{Vect})$ of parametric morphisms. We show that stacking multiple self-attention layers corresponds to constructing the free monad on this endofunctor. For positional encodings, we demonstrate that strictly additive position embeddings constitute monoid actions on the embedding space, while standard sinusoidal encodings, though not additive, possess a universal property among faithful position-preserving functors. We establish that the linear portions of self-attention exhibit natural equivariance properties with respect to permutations of input tokens. Finally, we prove that the ``circuits'' identified in mechanistic interpretability correspond precisely to compositions of parametric morphisms in our framework. This categorical perspective unifies geometric, algebraic, and interpretability-based approaches to transformer analysis, while making explicit the mathematical structures underlying attention mechanisms. Our treatment focuses exclusively on linear maps, setting aside nonlinearities like softmax and layer normalisation, which require more sophisticated categorical structures. Our results extend recent work on categorical foundations for deep learning while providing insights into the algebraic structure of attention mechanisms.||
|**2025-01-06**|[A Novel Vision Transformer for Camera-LiDAR Fusion based Traffic Object Segmentation](http://arxiv.org/abs/2501.02858)|null|This paper presents Camera-LiDAR Fusion Transformer (CLFT) models for traffic object segmentation, which leverage the fusion of camera and LiDAR data using vision transformers. Building on the methodology of visual transformers that exploit the self-attention mechanism, we extend segmentation capabilities with additional classification options to a diverse class of objects including cyclists, traffic signs, and pedestrians across diverse weather conditions. Despite good performance, the models face challenges under adverse conditions which underscores the need for further optimization to enhance performance in darkness and rain. In summary, the CLFT models offer a compelling solution for autonomous driving perception, advancing the state-of-the-art in multimodal fusion and object segmentation, with ongoing efforts required to address existing limitations and fully harness their potential in practical deployments.||
|**2025-01-03**|[Transformer-Driven Inverse Problem Transform for Fast Blind Hyperspectral Image Dehazing](http://arxiv.org/abs/2501.01924)|null|Hyperspectral dehazing (HyDHZ) has become a crucial signal processing technology to facilitate the subsequent identification and classification tasks, as the airborne visible/infrared imaging spectrometer (AVIRIS) data portal reports a massive portion of haze-corrupted areas in typical hyperspectral remote sensing images. The idea of inverse problem transform (IPT) has been proposed in recent remote sensing literature in order to reformulate a hardly tractable inverse problem (e.g., HyDHZ) into a relatively simple one. Considering the emerging spectral super-resolution (SSR) technique, which spectrally upsamples multispectral data to hyperspectral data, we aim to solve the challenging HyDHZ problem by reformulating it as an SSR problem. Roughly speaking, the proposed algorithm first automatically selects some uncorrupted/informative spectral bands, from which SSR is applied to spectrally upsample the selected bands in the feature space, thereby obtaining a clean hyperspectral image (HSI). The clean HSI is then further refined by a deep transformer network to obtain the final dehazed HSI, where a global attention mechanism is designed to capture nonlocal information. There are very few HyDHZ works in existing literature, and this article introduces the powerful spatial-spectral transformer into HyDHZ for the first time. Remarkably, the proposed transformer-driven IPT-based HyDHZ (T2HyDHZ) is a blind algorithm without requiring the user to manually select the corrupted region. Extensive experiments demonstrate the superiority of T2HyDHZ with less color distortion.||
|**2025-01-03**|[VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer for Video-based Remote Physiological Measurement](http://arxiv.org/abs/2501.01691)|null|Remote physiological signal measurement based on facial videos, also known as remote photoplethysmography (rPPG), involves predicting changes in facial vascular blood flow from facial videos. While most deep learning-based methods have achieved good results, they often struggle to balance performance across small and large-scale datasets due to the inherent limitations of convolutional neural networks (CNNs) and Transformer. In this paper, we introduce VidFormer, a novel end-to-end framework that integrates 3-Dimension Convolutional Neural Network (3DCNN) and Transformer models for rPPG tasks. Initially, we conduct an analysis of the traditional skin reflection model and subsequently introduce an enhanced model for the reconstruction of rPPG signals. Based on this improved model, VidFormer utilizes 3DCNN and Transformer to extract local and global features from input data, respectively. To enhance the spatiotemporal feature extraction capabilities of VidFormer, we incorporate temporal-spatial attention mechanisms tailored for both 3DCNN and Transformer. Additionally, we design a module to facilitate information exchange and fusion between the 3DCNN and Transformer. Our evaluation on five publicly available datasets demonstrates that VidFormer outperforms current state-of-the-art (SOTA) methods. Finally, we discuss the essential roles of each VidFormer module and examine the effects of ethnicity, makeup, and exercise on its performance.||
|**2025-01-02**|[nnY-Net: Swin-NeXt with Cross-Attention for 3D Medical Images Segmentation](http://arxiv.org/abs/2501.01406)|null|This paper provides a novel 3D medical image segmentation model structure called nnY-Net. This name comes from the fact that our model adds a cross-attention module at the bottom of the U-net structure to form a Y structure. We integrate the advantages of the two latest SOTA models, MedNeXt and SwinUNETR, and use Swin Transformer as the encoder and ConvNeXt as the decoder to innovatively design the Swin-NeXt structure. Our model uses the lowest-level feature map of the encoder as Key and Value and uses patient features such as pathology and treatment information as Query to calculate the attention weights in a Cross Attention module. Moreover, we simplify some pre- and post-processing as well as data enhancement methods in 3D image segmentation based on the dynUnet and nnU-net frameworks. We integrate our proposed Swin-NeXt with Cross-Attention framework into this framework. Last, we construct a DiceFocalCELoss to improve the training efficiency for the uneven data convergence of voxel classification.||
|**2025-01-02**|[A Unified Hyperparameter Optimization Pipeline for Transformer-Based Time Series Forecasting Models](http://arxiv.org/abs/2501.01394)|**[link](https://github.com/jingjing-unilu/HPO_transformer_time_series)**|Transformer-based models for time series forecasting (TSF) have attracted significant attention in recent years due to their effectiveness and versatility. However, these models often require extensive hyperparameter optimization (HPO) to achieve the best possible performance, and a unified pipeline for HPO in transformer-based TSF remains lacking. In this paper, we present one such pipeline and conduct extensive experiments on several state-of-the-art (SOTA) transformer-based TSF models. These experiments are conducted on standard benchmark datasets to evaluate and compare the performance of different models, generating practical insights and examples. Our pipeline is generalizable beyond transformer-based architectures and can be applied to other SOTA models, such as Mamba and TimeMixer, as demonstrated in our experiments. The goal of this work is to provide valuable guidance to both industry practitioners and academic researchers in efficiently identifying optimal hyperparameters suited to their specific domain applications. The code and complete experimental results are available on GitHub.||
|**2025-01-02**|[SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration](http://arxiv.org/abs/2501.01320)|null|Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR's superiority over existing methods for generic video restoration.||
|**2025-01-02**|[Multi-Head Explainer: A General Framework to Improve Explainability in CNNs and Transformers](http://arxiv.org/abs/2501.01311)|null|In this study, we introduce the Multi-Head Explainer (MHEX), a versatile and modular framework that enhances both the explainability and accuracy of Convolutional Neural Networks (CNNs) and Transformer-based models. MHEX consists of three core components: an Attention Gate that dynamically highlights task-relevant features, Deep Supervision that guides early layers to capture fine-grained details pertinent to the target class, and an Equivalent Matrix that unifies refined local and global representations to generate comprehensive saliency maps. Our approach demonstrates superior compatibility, enabling effortless integration into existing residual networks like ResNet and Transformer architectures such as BERT with minimal modifications. Extensive experiments on benchmark datasets in medical imaging and text classification show that MHEX not only improves classification accuracy but also produces highly interpretable and detailed saliency scores.||
|**2025-01-02**|[An Efficient Attention Mechanism for Sequential Recommendation Tasks: HydraRec](http://arxiv.org/abs/2501.01242)|null|Transformer based models are increasingly being used in various domains including recommender systems (RS). Pretrained transformer models such as BERT have shown good performance at language modelling. With the greater ability to model sequential tasks, variants of Encoder-only models (like BERT4Rec, SASRec etc.) have found success in sequential RS problems. Computing dot-product attention in traditional transformer models has quadratic complexity in sequence length. This is a bigger problem with RS because unlike language models, new items are added to the catalogue every day. User buying history is a dynamic sequence which depends on multiple factors. Recently, various linear attention models have tried to solve this problem by making the model linear in sequence length (token dimensions). Hydra attention is one such linear complexity model proposed for vision transformers which reduces the complexity of attention for both the number of tokens as well as model embedding dimensions. Building on the idea of Hydra attention, we introduce an efficient Transformer based Sequential RS (HydraRec) which significantly improves theoretical complexity of computing attention for longer sequences and bigger datasets while preserving the temporal context. Extensive experiments are conducted to evaluate other linear transformer-based RS models and compared with HydraRec across various evaluation metrics. HydraRec outperforms other linear attention-based models as well as dot-product based attention models when used with causal masking for sequential recommendation next item prediction tasks. For bi-directional models its performance is comparable to the BERT4Rec model with an improvement in running time.||
|**2025-01-02**|[RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer](http://arxiv.org/abs/2501.01182)|**[link](https://github.com/seongho608/ringformer)**|While transformers demonstrate outstanding performance across various audio tasks, their application to neural vocoders remains challenging. Neural vocoders require the generation of long audio signals at the sample level, which demands high temporal resolution. This results in significant computational costs for attention map generation and limits their ability to efficiently process both global and local information. Additionally, the sequential nature of sample generation in neural vocoders poses difficulties for real-time processing, making the direct adoption of transformers impractical. To address these challenges, we propose RingFormer, a neural vocoder that incorporates the ring attention mechanism into a lightweight transformer variant, the convolution-augmented transformer (Conformer). Ring attention effectively captures local details while integrating global information, making it well-suited for processing long sequences and enabling real-time audio generation. RingFormer is trained using adversarial training with two discriminators. The proposed model is applied to the decoder of the text-to-speech model VITS and compared with state-of-the-art vocoders such as HiFi-GAN, iSTFT-Net, and BigVGAN under identical conditions using various objective and subjective metrics. Experimental results show that RingFormer achieves comparable or superior performance to existing models, particularly excelling in real-time audio generation. Our code and audio samples are available on GitHub.||
|**2025-01-02**|[FAST: Fast Audio Spectrogram Transformer](http://arxiv.org/abs/2501.01104)|null|In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines convolutional neural networks (CNNs) and transformers to capitalize on the strengths of both. FAST integrates the local feature extraction efficiencies of CNNs with the global context modeling capabilities of transformers, resulting in a model that is powerful yet lightweight, well-suited to a real-time or mobile use case. Additionally, we incorporate Lipschitz continuous attention mechanisms to improve training stability and accelerate convergence. We evaluate FAST on the ADIMA dataset, a multilingual corpus towards real-time profanity and abuse detection, as well as on the more traditional AudioSet. Our results show that FAST achieves state-of-the-art performance on both the ADIMA and AudioSet classification tasks and in some cases surpasses existing benchmarks while using up to 150x fewer parameters.||
|**2025-01-02**|[EliGen: Entity-Level Controlled Image Generation with Regional Attention](http://arxiv.org/abs/2501.01097)|**[link](https://github.com/modelscope/DiffSynth-Studio)**|Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-Level controlled Image Generation. We introduce regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both positional control precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending EliGen to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with community models such as IP-Adapter and MLLM, unlocking new creative possibilities. The source code, dataset, and model will be released publicly.||
|**2024-12-30**|[Attention Is All You Need For Mixture-of-Depths Routing](http://arxiv.org/abs/2412.20875)|null|Advancements in deep learning are driven by training models with increasingly larger numbers of parameters, which in turn heightens the computational demands. To address this issue, Mixture-of-Depths (MoD) models have been proposed to dynamically assign computations only to the most relevant parts of the inputs, thereby enabling the deployment of large-parameter models with high efficiency during inference and training. These MoD models utilize a routing mechanism to determine which tokens should be processed by a layer, or skipped. However, conventional MoD models employ additional network layers specifically for the routing which are difficult to train, and add complexity and deployment overhead to the model. In this paper, we introduce a novel attention-based routing mechanism A-MoD that leverages the existing attention map of the preceding layer for routing decisions within the current layer. Compared to standard routing, A-MoD allows for more efficient training as it introduces no additional trainable parameters and can be easily adapted from pretrained transformer models. Furthermore, it can increase the performance of the MoD model. For instance, we observe up to 2% higher accuracy on ImageNet compared to standard routing and isoFLOP ViT baselines. Furthermore, A-MoD improves the MoD training convergence, leading to up to 2x faster transfer learning.||
|**2024-12-30**|[Metadata-Enhanced Speech Emotion Recognition: Augmented Residual Integration and Co-Attention in Two-Stage Fine-Tuning](http://arxiv.org/abs/2412.20707)|null|Speech Emotion Recognition (SER) involves analyzing vocal expressions to determine the emotional state of speakers, where the comprehensive and thorough utilization of audio information is paramount. Therefore, we propose a novel approach on self-supervised learning (SSL) models that employs all available auxiliary information -- specifically metadata -- to enhance performance. Through a two-stage fine-tuning method in multi-task learning, we introduce the Augmented Residual Integration (ARI) module, which enhances transformer layers in encoder of SSL models. The module efficiently preserves acoustic features across all different levels, thereby significantly improving the performance of metadata-related auxiliary tasks that require various levels of features. Moreover, the Co-attention module is incorporated due to its complementary nature with ARI, enabling the model to effectively utilize multidimensional information and contextual relationships from metadata-related auxiliary tasks. Under pre-trained base models and speaker-independent setup, our approach consistently surpasses state-of-the-art (SOTA) models on multiple SSL encoders for the IEMOCAP dataset.||
|**2024-12-30**|[Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA](http://arxiv.org/abs/2412.20677)|null|Large language models have been shown to perform well on a variety of natural language processing problems. However, as the model size and the input sequence's length increase, the rapid increase of KV Cache significantly slows down inference speed. Therefore GQA model, as an alternative to MHA model, has been widely introduced into LLMs. In this work, we propose a low-cost method for pruning MHA models into GQA models with any compression ratio of key-value heads. Our method is based on $\mathit{L_0}$ masks to gradually remove redundant parameters. In addition, we apply orthogonal transformations to attention heads without changing the model to increase similarity between attention heads before pruning training, in order to further improve performance of the model. Our method can be compatible with rotary position embedding (RoPE), which means the model after training can be fully adapted to the mainstream standard GQA framework. Experiments demonstrate that our strategy can compress up to 87.5% of key-value heads of the LLaMA2-7B model without too much performance degradation, just achieved through supervised fine-tuning.||
|**2024-12-29**|[FreqMixFormerV2: Lightweight Frequency-aware Mixed Transformer for Human Skeleton Action Recognition](http://arxiv.org/abs/2412.20621)|**[link](https://github.com/wenhanwu95/freqmixformer)**|Transformer-based human skeleton action recognition has been developed for years. However, the complexity and high parameter count demands of these models hinder their practical applications, especially in resource-constrained environments. In this work, we propose FreqMixForemrV2, which was built upon the Frequency-aware Mixed Transformer (FreqMixFormer) for identifying subtle and discriminative actions with pioneered frequency-domain analysis. We design a lightweight architecture that maintains robust performance while significantly reducing the model complexity. This is achieved through a redesigned frequency operator that optimizes high-frequency and low-frequency parameter adjustments, and a simplified frequency-aware attention module. These improvements result in a substantial reduction in model parameters, enabling efficient deployment with only a minimal sacrifice in accuracy. Comprehensive evaluations of standard datasets (NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets) demonstrate that the proposed model achieves a superior balance between efficiency and accuracy, outperforming state-of-the-art methods with only 60% of the parameters.||
|**2024-12-29**|[Segmentation of Muscularis Propria in Colon Histopathology Images Using Vision Transformers for Hirschsprung's Disease](http://arxiv.org/abs/2412.20571)|null|Hirschsprung's disease (HD) is a congenital birth defect diagnosed by identifying the lack of ganglion cells within the colon's muscularis propria, specifically within the myenteric plexus regions. There may be advantages for quantitative assessments of histopathology images of the colon, such as counting the ganglion and assessing their spatial distribution; however, this would be time-intensive for pathologists, costly, and subject to inter- and intra-rater variability. Previous research has demonstrated the potential for deep learning approaches to automate histopathology image analysis, including segmentation of the muscularis propria using convolutional neural networks (CNNs). Recently, Vision Transformers (ViTs) have emerged as a powerful deep learning approach due to their self-attention. This study explores the application of ViTs for muscularis propria segmentation in calretinin-stained histopathology images and compares their performance to CNNs and shallow learning methods. The ViT model achieved a DICE score of 89.9% and Plexus Inclusion Rate (PIR) of 100%, surpassing the CNN (DICE score of 89.2%; PIR of 96.0%) and k-means clustering method (DICE score of 80.7%; PIR 77.4%). Results assert that ViTs are a promising tool for advancing HD-related image analysis.||
|**2025-01-02**|[EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers](http://arxiv.org/abs/2412.20413)|null|Removing unwanted concepts from large-scale text-to-image (T2I) diffusion models while maintaining their overall generative quality remains an open challenge. This difficulty is especially pronounced in emerging paradigms, such as Stable Diffusion (SD) v3 and Flux, which incorporate flow matching and transformer-based architectures. These advancements limit the transferability of existing concept-erasure techniques that were originally designed for the previous T2I paradigm (e.g., SD v1.4). In this work, we introduce EraseAnything, the first method specifically developed to address concept erasure within the latest flow-based T2I framework. We formulate concept erasure as a bi-level optimization problem, employing LoRA-based parameter tuning and an attention map regularizer to selectively suppress undesirable activations. Furthermore, we propose a self-contrastive learning strategy to ensure that removing unwanted concepts does not inadvertently harm performance on unrelated ones. Experimental results demonstrate that EraseAnything successfully fills the research gap left by earlier methods in this new T2I paradigm, achieving state-of-the-art performance across a wide range of concept erasure tasks.||
|**2024-12-28**|[Transformer-Based Contrastive Meta-Learning For Low-Resource Generalizable Activity Recognition](http://arxiv.org/abs/2412.20290)|null|Deep learning has been widely adopted for human activity recognition (HAR) while generalizing a trained model across diverse users and scenarios remains challenging due to distribution shifts. The inherent low-resource challenge in HAR, i.e., collecting and labeling adequate human-involved data can be prohibitively costly, further raising the difficulty of tackling DS. We propose TACO, a novel transformer-based contrastive meta-learning approach for generalizable HAR. TACO addresses DS by synthesizing virtual target domains in training with explicit consideration of model generalizability. Additionally, we extract expressive feature with the attention mechanism of Transformer and incorporate the supervised contrastive loss function within our meta-optimization to enhance representation learning. Our evaluation demonstrates that TACO achieves notably better performance across various low-resource DS scenarios.||
|**2024-12-28**|[Distilled Transformers with Locally Enhanced Global Representations for Face Forgery Detection](http://arxiv.org/abs/2412.20156)|null|Face forgery detection (FFD) is devoted to detecting the authenticity of face images. Although current CNN-based works achieve outstanding performance in FFD, they are susceptible to capturing local forgery patterns generated by various manipulation methods. Though transformer-based detectors exhibit improvements in modeling global dependencies, they are not good at exploring local forgery artifacts. Hybrid transformer-based networks are designed to capture local and global manipulated traces, but they tend to suffer from the attention collapse issue as the transformer block goes deeper. Besides, soft labels are rarely available. In this paper, we propose a distilled transformer network (DTN) to capture both rich local and global forgery traces and learn general and common representations for different forgery faces. Specifically, we design a mixture of expert (MoE) module to mine various robust forgery embeddings. Moreover, a locally-enhanced vision transformer (LEVT) module is proposed to learn locally-enhanced global representations. We design a lightweight multi-attention scaling (MAS) module to avoid attention collapse, which can be plugged and played in any transformer-based models with only a slight increase in computational costs. In addition, we propose a deepfake self-distillation (DSD) scheme to provide the model with abundant soft label information. Extensive experiments show that the proposed method surpasses the state of the arts on five deepfake datasets.||
|**2024-12-27**|[An Integrated Optimization and Deep Learning Pipeline for Predicting Live Birth Success in IVF Using Feature Optimization and Transformer-Based Models](http://arxiv.org/abs/2412.19696)|null|In vitro fertilization (IVF) is a widely utilized assisted reproductive technology, yet predicting its success remains challenging due to the multifaceted interplay of clinical, demographic, and procedural factors. This study develops a robust artificial intelligence (AI) pipeline aimed at predicting live birth outcomes in IVF treatments. The pipeline uses anonymized data from 2010 to 2018, obtained from the Human Fertilization and Embryology Authority (HFEA). We evaluated the prediction performance of live birth success as a binary outcome (success/failure) by integrating different feature selection methods, such as principal component analysis (PCA) and particle swarm optimization (PSO), with different traditional machine learning-based classifiers including random forest (RF) and decision tree, as well as deep learning-based classifiers including custom transformer-based model and a tab transformer model with an attention mechanism. Our research demonstrated that the best performance was achieved by combining PSO for feature selection with the TabTransformer-based deep learning model, yielding an accuracy of 99.50% and an AUC of 99.96%, highlighting its significant performance to predict live births. This study establishes a highly accurate AI pipeline for predicting live birth outcomes in IVF, demonstrating its potential to enhance personalized fertility treatments.||
|**2024-12-26**|[Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning](http://arxiv.org/abs/2412.19200)|**[link](https://github.com/Littleor/Personalized-DMER)**|Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of different moments in music, playing a crucial role in music information retrieval. The existing DMER methods struggle to capture long-term dependencies when dealing with sequence data, which limits their performance. Furthermore, these methods often overlook the influence of individual differences on emotion perception, even though everyone has their own personalized emotional perception in the real world. Motivated by these issues, we explore more effective sequence processing methods and introduce the Personalized DMER (PDMER) problem, which requires models to predict emotions that align with personalized perception. Specifically, we propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method. This method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies using a dual-scale attention transformer, improving the performance in traditional DMER. To achieve PDMER, we design a novel task construction strategy that divides tasks by annotators. Samples in a task are annotated by the same annotator, ensuring consistent perception. Leveraging this strategy alongside meta-learning, DSAML can predict personalized perception of emotions with just one personalized annotation sample. Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER.||
|**2024-12-26**|[Dual Channel Multi-Attention in ViT for Biometric Authentication using Forehead Subcutaneous Vein Pattern and Periocular Pattern](http://arxiv.org/abs/2412.19160)|null|Traditional biometric systems, like face and fingerprint recognition, have encountered significant setbacks due to wearing face masks and hygiene concerns. To meet the challenges of the partially covered face due to face masks and hygiene concerns of fingerprint recognition, this paper proposes a novel dual-channel multi-attention Vision Transformer (ViT) framework for biometric authentication using forehead subcutaneous vein patterns and periocular patterns, offering a promising alternative to traditional methods, capable of performing well even with face masks and without any physical touch. The proposed framework leverages a dual-channel ViT architecture, designed to handle two distinct biometric traits. It can capture long-range dependencies of independent features from the vein and periocular patterns. A custom classifier is then designed to integrate the independently extracted features, producing a final class prediction. The performance of the proposed algorithm was rigorously evaluated using the Forehead Subcutaneous Vein Pattern and Periocular Biometric Pattern (FSVP-PBP) database. The results demonstrated the superiority of the algorithm over state-of-the-art methods, achieving remarkable classification accuracy of $99.3 \pm 0.02\%$ with the combined vein and periocular patterns.||
|**2024-12-25**|[Adopting Trustworthy AI for Sleep Disorder Prediction: Deep Time Series Analysis with Temporal Attention Mechanism and Counterfactual Explanations](http://arxiv.org/abs/2412.18971)|null|Sleep disorders have a major impact on both lifestyle and health. Effective sleep disorder prediction from lifestyle and physiological data can provide essential details for early intervention. This research utilizes three deep time series models and facilitates them with explainability approaches for sleep disorder prediction. Specifically, our approach adopts Temporal Convolutional Networks (TCN), Long Short-Term Memory (LSTM) for time series data analysis, and Temporal Fusion Transformer model (TFT). Meanwhile, the temporal attention mechanism and counterfactual explanation with SHapley Additive exPlanations (SHAP) approach are employed to ensure dependable, accurate, and interpretable predictions. Finally, using a large dataset of sleep health measures, our evaluation demonstrates the effect of our method in predicting sleep disorders.||
|**2024-12-25**|[TopoBDA: Towards Bezier Deformable Attention for Road Topology Understanding](http://arxiv.org/abs/2412.18951)|null|Understanding road topology is crucial for autonomous driving. This paper introduces TopoBDA (Topology with Bezier Deformable Attention), a novel approach that enhances road topology understanding by leveraging Bezier Deformable Attention (BDA). BDA utilizes Bezier control points to drive the deformable attention mechanism, significantly improving the detection and representation of elongated and thin polyline structures, such as lane centerlines. TopoBDA processes multi-camera 360-degree imagery to generate Bird's Eye View (BEV) features, which are refined through a transformer decoder employing BDA. This method enhances computational efficiency while maintaining high accuracy in centerline prediction. Additionally, TopoBDA incorporates an instance mask formulation and an auxiliary one-to-many set prediction loss strategy to further refine centerline detection and improve road topology understanding. Experimental evaluations on the OpenLane-V2 dataset demonstrate that TopoBDA outperforms existing methods, achieving state-of-the-art results in centerline detection and topology reasoning. The integration of multi-modal data, including lidar and radar, specifically for road topology understanding, further enhances the model's performance, underscoring its importance in autonomous driving applications.||
|**2024-12-25**|[UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation](http://arxiv.org/abs/2412.18928)|null|Recently, text-to-image generation models have achieved remarkable advancements, particularly with diffusion models facilitating high-quality image synthesis from textual descriptions. However, these models often struggle with achieving precise control over pixel-level layouts, object appearances, and global styles when using text prompts alone. To mitigate this issue, previous works introduce conditional images as auxiliary inputs for image generation, enhancing control but typically necessitating specialized models tailored to different types of reference inputs. In this paper, we explore a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture, to enable flexible and controllable generation across diverse conditions without the need for multiple specialized models. Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions, injecting this information into the image generation process through a cross-attention mechanism enhanced by Rotary Position Embedding. Experimental results across a variety of tasks, including pixel-level spatial control, subject-driven image generation, and style-image-based image synthesis, demonstrate the effectiveness of our UNIC-Adapter in unified controllable image generation.||
|**2024-12-25**|[Accelerating Diffusion Transformers with Dual Feature Caching](http://arxiv.org/abs/2412.18911)|**[link](https://github.com/shenyi-z/duca)**|Diffusion Transformers (DiT) have become the dominant methods in image and video generation yet still suffer substantial computational costs. As an effective approach for DiT acceleration, feature caching methods are designed to cache the features of DiT in previous timesteps and reuse them in the next timesteps, allowing us to skip the computation in the next timesteps. However, on the one hand, aggressively reusing all the features cached in previous timesteps leads to a severe drop in generation quality. On the other hand, conservatively caching only the features in the redundant layers or tokens but still computing the important ones successfully preserves the generation quality but results in reductions in acceleration ratios. Observing such a tradeoff between generation quality and acceleration performance, this paper begins by quantitatively studying the accumulated error from cached features. Surprisingly, we find that aggressive caching does not introduce significantly more caching errors in the caching step, and the conservative feature caching can fix the error introduced by aggressive caching. Thereby, we propose a dual caching strategy that adopts aggressive and conservative caching iteratively, leading to significant acceleration and high generation quality at the same time. Besides, we further introduce a V-caching strategy for token-wise conservative caching, which is compatible with flash attention and requires no training and calibration data. Our codes have been released in Github: \textbf{Code: \href{https://github.com/Shenyi-Z/DuCa}{\texttt{\textcolor{cyan}{https://github.com/Shenyi-Z/DuCa}}}}||
|**2024-12-25**|[Implicit factorized transformer approach to fast prediction of turbulent channel flows](http://arxiv.org/abs/2412.18840)|**[link](https://github.com/huiyu-2002/ifactformer-m)**|Transformer neural operators have recently become an effective approach for surrogate modeling of nonlinear systems governed by partial differential equations (PDEs). In this paper, we introduce a modified implicit factorized transformer (IFactFormer-m) model which replaces the original chained factorized attention with parallel factorized attention. The IFactFormer-m model successfully performs long-term predictions for turbulent channel flow, whereas the original IFactFormer (IFactFormer-o), Fourier neural operator (FNO), and implicit Fourier neural operator (IFNO) exhibit a poor performance. Turbulent channel flows are simulated by direct numerical simulation using fine grids at friction Reynolds numbers $\text{Re}_{\tau}\approx 180,395,590$ , and filtered to coarse grids for training neural operator. The neural operator takes the current flow field as input and predicts the flow field at the next time step, and long-term prediction is achieved in the posterior through an autoregressive approach. The prediction results show that IFactFormer-m, compared to other neural operators and the traditional large eddy simulation (LES) methods including dynamic Smagorinsky model (DSM) and the wall-adapted local eddy-viscosity (WALE) model, reduces prediction errors in the short term, and achieves stable and accurate long-term prediction of various statistical properties and flow structures, including the energy spectrum, mean streamwise velocity, root mean square (rms) values of fluctuating velocities, Reynolds shear stress, and spatial structures of instantaneous velocity. Moreover, the trained IFactFormer-m is much faster than traditional LES methods.||
|**2024-12-25**|[Ister: Inverted Seasonal-Trend Decomposition Transformer for Explainable Multivariate Time Series Forecasting](http://arxiv.org/abs/2412.18798)|null|In long-term time series forecasting, Transformer-based models have achieved great success, due to its ability to capture long-range dependencies. However, existing transformer-based methods face challenges in accurately identifying which variables play a pivotal role in the prediction process and tend to overemphasize noisy channels, thereby limiting the interpretability and practical effectiveness of the models. Besides, it faces scalability issues due to quadratic computational complexity of self-attention. In this paper, we propose a new model named Inverted Seasonal-Trend Decomposition Transformer (Ister), which addresses these challenges in long-term multivariate time series forecasting by designing an improved Transformer-based structure. Ister firstly decomposes original time series into seasonal and trend components. Then we propose a new Dot-attention mechanism to process the seasonal component, which improves both accuracy, computation complexity and interpretability. Upon completion of the training phase, it allows users to intuitively visualize the significance of each feature in the overall prediction. We conduct comprehensive experiments, and the results show that Ister achieves state-of-the-art (SOTA) performance on multiple datasets, surpassing existing models in long-term prediction tasks.||
|**2024-12-25**|[Unified Local and Global Attention Interaction Modeling for Vision Transformers](http://arxiv.org/abs/2412.18778)|null|We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets. ViTs show strong capability for image understanding tasks such as object detection, segmentation, and classification. This is due in part to their ability to leverage global information from interactions among visual tokens. However, the self-attention mechanism in ViTs are limited because they do not allow visual tokens to exchange local or global information with neighboring features before computing global attention. This is problematic because tokens are treated in isolation when attending (matching) to other tokens, and valuable spatial relationships are overlooked. This isolation is further compounded by dot-product similarity operations that make tokens from different semantic classes appear visually similar. To address these limitations, we introduce two modifications to the traditional self-attention framework; a novel aggressive convolution pooling strategy for local feature mixing, and a new conceptual attention transformation to facilitate interaction and feature exchange between semantic concepts. Experimental results demonstrate that local and global information exchange among visual features before self-attention significantly improves performance on challenging object detection tasks and generalizes across multiple benchmark datasets and challenging medical datasets. We publish source code and a novel dataset of cancerous tumors (chimeric cell clusters).||
|**2024-12-24**|[DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation](http://arxiv.org/abs/2412.18597)|**[link](https://github.com/tencentarc/ditctrl)**|Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.||
|**2024-12-24**|[Advancing Deformable Medical Image Registration with Multi-axis Cross-covariance Attention](http://arxiv.org/abs/2412.18545)|null|Deformable image registration is a fundamental requirement for medical image analysis. Recently, transformers have been widely used in deep learning-based registration methods for their ability to capture long-range dependency via self-attention (SA). However, the high computation and memory loads of SA (growing quadratically with the spatial resolution) hinder transformers from processing subtle textural information in high-resolution image features, e.g., at the full and half image resolutions. This limits deformable registration as the high-resolution textural information is crucial for finding precise pixel-wise correspondence between subtle anatomical structures. Cross-covariance Attention (XCA), as a "transposed" version of SA that operates across feature channels, has complexity growing linearly with the spatial resolution, providing the feasibility of capturing long-range dependency among high-resolution image features. However, existing XCA-based transformers merely capture coarse global long-range dependency, which are unsuitable for deformable image registration relying primarily on fine-grained local correspondence. In this study, we propose to improve existing deep learning-based registration methods by embedding a new XCA mechanism. To this end, we design an XCA-based transformer block optimized for deformable medical image registration, named Multi-Axis XCA (MAXCA). Our MAXCA serves as a general network block that can be embedded into various registration network architectures. It can capture both global and local long-range dependency among high-resolution image features by applying regional and dilated XCA in parallel via a multi-axis design. Extensive experiments on two well-benchmarked inter-/intra-patient registration tasks with seven public medical datasets demonstrate that our MAXCA block enables state-of-the-art registration performance.||
|**2024-12-24**|[Segment-Based Attention Masking for GPTs](http://arxiv.org/abs/2412.18487)|**[link](https://github.com/shacharKZ/MAS-Segment-Based-Attention-Masking)**|Modern Language Models (LMs) owe much of their success to masked causal attention, the backbone of Generative Pre-Trained Transformer (GPT) models. Although GPTs can process the entire user prompt at once, the causal masking is applied to all input tokens step-by-step, mimicking the generation process. This imposes an unnecessary constraint during the initial "prefill" phase when the model processes the input prompt and generates the internal representations before producing any output tokens. In this work, attention is masked based on the known block structure at the prefill phase, followed by the conventional token-by-token autoregressive process after that. For example, in a typical chat prompt, the system prompt is treated as one block, and the user prompt as the next one. Each of these is treated as a unit for the purpose of masking, such that the first tokens in each block can access the subsequent tokens in a non-causal manner. Then, the model answer is generated in the conventional causal manner. This Segment-by-Segment scheme entails no additional computational overhead. When integrating it into models such as Llama and Qwen, state-of-the-art performance is consistently achieved.||
|**2024-12-24**|[Towards understanding how attention mechanism works in deep learning](http://arxiv.org/abs/2412.18288)|null|Attention mechanism has been extensively integrated within mainstream neural network architectures, such as Transformers and graph attention networks. Yet, its underlying working principles remain somewhat elusive. What is its essence? Are there any connections between it and traditional machine learning algorithms? In this study, we inspect the process of computing similarity using classic metrics and vector space properties in manifold learning, clustering, and supervised learning. We identify the key characteristics of similarity computation and information propagation in these methods and demonstrate that the self-attention mechanism in deep learning adheres to the same principles but operates more flexibly and adaptively. We decompose the self-attention mechanism into a learnable pseudo-metric function and an information propagation process based on similarity computation. We prove that the self-attention mechanism converges to a drift-diffusion process through continuous modeling provided the pseudo-metric is a transformation of a metric and certain reasonable assumptions hold. This equation could be transformed into a heat equation under a new metric. In addition, we give a first-order analysis of attention mechanism with a general pseudo-metric function. This study aids in understanding the effects and principle of attention mechanism through physical intuition. Finally, we propose a modified attention mechanism called metric-attention by leveraging the concept of metric learning to facilitate the ability to learn desired metrics more effectively. Experimental results demonstrate that it outperforms self-attention regarding training efficiency, accuracy, and robustness.||
|**2024-12-24**|[Leveraging Convolutional Neural Network-Transformer Synergy for Predictive Modeling in Risk-Based Applications](http://arxiv.org/abs/2412.18222)|null|With the development of the financial industry, credit default prediction, as an important task in financial risk management, has received increasing attention. Traditional credit default prediction methods mostly rely on machine learning models, such as decision trees and random forests, but these methods have certain limitations in processing complex data and capturing potential risk patterns. To this end, this paper proposes a deep learning model based on the combination of convolutional neural networks (CNN) and Transformer for credit user default prediction. The model combines the advantages of CNN in local feature extraction with the ability of Transformer in global dependency modeling, effectively improving the accuracy and robustness of credit default prediction. Through experiments on public credit default datasets, the results show that the CNN+Transformer model outperforms traditional machine learning models, such as random forests and XGBoost, in multiple evaluation indicators such as accuracy, AUC, and KS value, demonstrating its powerful ability in complex financial data modeling. Further experimental analysis shows that appropriate optimizer selection and learning rate adjustment play a vital role in improving model performance. In addition, the ablation experiment of the model verifies the advantages of the combination of CNN and Transformer and proves the complementarity of the two in credit default prediction. This study provides a new idea for credit default prediction and provides strong support for risk assessment and intelligent decision-making in the financial field. Future research can further improve the prediction effect and generalization ability by introducing more unstructured data and improving the model architecture.||
|**2024-12-24**|[Leveraging Deep Learning with Multi-Head Attention for Accurate Extraction of Medicine from Handwritten Prescriptions](http://arxiv.org/abs/2412.18199)|null|Extracting medication names from handwritten doctor prescriptions is challenging due to the wide variability in handwriting styles and prescription formats. This paper presents a robust method for extracting medicine names using a combination of Mask R-CNN and Transformer-based Optical Character Recognition (TrOCR) with Multi-Head Attention and Positional Embeddings. A novel dataset, featuring diverse handwritten prescriptions from various regions of Pakistan, was utilized to fine-tune the model on different handwriting styles. The Mask R-CNN model segments the prescription images to focus on the medicinal sections, while the TrOCR model, enhanced by Multi-Head Attention and Positional Embeddings, transcribes the isolated text. The transcribed text is then matched against a pre-existing database for accurate identification. The proposed approach achieved a character error rate (CER) of 1.4% on standard benchmarks, highlighting its potential as a reliable and efficient tool for automating medicine name extraction.||
|**2024-12-23**|[Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers](http://arxiv.org/abs/2412.18040)|null|Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding ($\mathsf{RoPE}$) has shown superior performance in encoding positional information in long-context scenarios, significantly enhancing transformer models' expressiveness. Despite these empirical successes, the theoretical limitations of these technologies remain underexplored. In this study, we analyze the circuit complexity of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention, showing that with polynomial precision, constant-depth layers, and linear or sublinear hidden dimension, they cannot solve fixed membership problems or $(A_{F,r})^*$ closure problems, under the assumption that $\mathsf{TC}^0 \neq \mathsf{NC}^1$. These findings highlight a gap between the empirical performance and theoretical constraints of Tensor Attention and $\mathsf{RoPE}$ -based Tensor Attention Transformers, offering insights that could guide the development of more theoretically grounded approaches to Transformer model design and scaling.||
|**2024-12-23**|[Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction](http://arxiv.org/abs/2412.17810)|**[link](https://github.com/robinwu218/tost)**|The attention operator is arguably the key distinguishing factor of transformer architectures, which have demonstrated state-of-the-art performance on a variety of tasks. However, transformer attention operators often impose a significant computational burden, with the computational complexity scaling quadratically with the number of tokens. In this work, we propose a novel transformer attention operator whose computational complexity scales linearly with the number of tokens. We derive our network architecture by extending prior work which has shown that a transformer style architecture naturally arises by "white-box" architecture design, where each layer of the network is designed to implement an incremental optimization step of a maximal coding rate reduction objective (MCR $^2$). Specifically, we derive a novel variational form of the MCR$^2$ objective and show that the architecture that results from unrolled gradient descent of this variational objective leads to a new attention module called Token Statistics Self-Attention (TSSA). TSSA has linear computational and memory complexity and radically departs from the typical attention architecture that computes pairwise similarities between tokens. Experiments on vision, language, and long sequence tasks show that simply swapping TSSA for standard self-attention, which we refer to as the Token Statistics Transformer (ToST), achieves competitive performance with conventional transformers while being significantly more computationally efficient and interpretable. Our results also somewhat call into question the conventional wisdom that pairwise similarity style attention mechanisms are critical to the success of transformer architectures. Code will be available at https://github.com/RobinWu218/ToST.||
|**2024-12-23**|[Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization](http://arxiv.org/abs/2412.17739)|**[link](https://github.com/tsinghuac3i/fourier-position-embedding)**|Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.||
|**2024-12-23**|[URoadNet: Dual Sparse Attentive U-Net for Multiscale Road Network Extraction](http://arxiv.org/abs/2412.17573)|null|The challenges of road network segmentation demand an algorithm capable of adapting to the sparse and irregular shapes, as well as the diverse context, which often leads traditional encoding-decoding methods and simple Transformer embeddings to failure. We introduce a computationally efficient and powerful framework for elegant road-aware segmentation. Our method, called URoadNet, effectively encodes fine-grained local road connectivity and holistic global topological semantics while decoding multiscale road network information. URoadNet offers a novel alternative to the U-Net architecture by integrating connectivity attention, which can exploit intra-road interactions across multi-level sampling features with reduced computational complexity. This local interaction serves as valuable prior information for learning global interactions between road networks and the background through another integrality attention mechanism. The two forms of sparse attention are arranged alternatively and complementarily, and trained jointly, resulting in performance improvements without significant increases in computational complexity. Extensive experiments on various datasets with different resolutions, including Massachusetts, DeepGlobe, SpaceNet, and Large-Scale remote sensing images, demonstrate that URoadNet outperforms state-of-the-art techniques. Our approach represents a significant advancement in the field of road network extraction, providing a computationally feasible solution that achieves high-quality segmentation results.||
|**2024-12-20**|[CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up](http://arxiv.org/abs/2412.16112)|**[link](https://github.com/huage001/clear)**|Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.||
|**2024-12-20**|[Multi-dimensional Visual Prompt Enhanced Image Restoration via Mamba-Transformer Aggregation](http://arxiv.org/abs/2412.15845)|**[link](https://github.com/12138-chr/mtair)**|Recent efforts on image restoration have focused on developing "all-in-one" models that can handle different degradation types and levels within single model. However, most of mainstream Transformer-based ones confronted with dilemma between model capabilities and computation burdens, since self-attention mechanism quadratically increase in computational complexity with respect to image size, and has inadequacies in capturing long-range dependencies. Most of Mamba-related ones solely scanned feature map in spatial dimension for global modeling, failing to fully utilize information in channel dimension. To address aforementioned problems, this paper has proposed to fully utilize complementary advantages from Mamba and Transformer without sacrificing computation efficiency. Specifically, the selective scanning mechanism of Mamba is employed to focus on spatial modeling, enabling capture long-range spatial dependencies under linear complexity. The self-attention mechanism of Transformer is applied to focus on channel modeling, avoiding high computation burdens that are in quadratic growth with image's spatial dimensions. Moreover, to enrich informative prompts for effective image restoration, multi-dimensional prompt learning modules are proposed to learn prompt-flows from multi-scale encoder/decoder layers, benefiting for revealing underlying characteristic of various degradations from both spatial and channel perspectives, therefore, enhancing the capabilities of "all-in-one" model to solve various restoration tasks. Extensive experiment results on several image restoration benchmark tasks such as image denoising, dehazing, and deraining, have demonstrated that the proposed method can achieve new state-of-the-art performance, compared with many popular mainstream methods. Related source codes and pre-trained parameters will be public on github https://github.com/12138-chr/MTAIR.||
|**2024-12-20**|[Mask-RadarNet: Enhancing Transformer With Spatial-Temporal Semantic Context for Radar Object Detection in Autonomous Driving](http://arxiv.org/abs/2412.15595)|null|As a cost-effective and robust technology, automotive radar has seen steady improvement during the last years, making it an appealing complement to commonly used sensors like camera and LiDAR in autonomous driving. Radio frequency data with rich semantic information are attracting more and more attention. Most current radar-based models take radio frequency image sequences as the input. However, these models heavily rely on convolutional neural networks and leave out the spatial-temporal semantic context during the encoding stage. To solve these problems, we propose a model called Mask-RadarNet to fully utilize the hierarchical semantic features from the input radar data. Mask-RadarNet exploits the combination of interleaved convolution and attention operations to replace the traditional architecture in transformer-based models. In addition, patch shift is introduced to the Mask-RadarNet for efficient spatial-temporal feature learning. By shifting part of patches with a specific mosaic pattern in the temporal dimension, Mask-RadarNet achieves competitive performance while reducing the computational burden of the spatial-temporal modeling. In order to capture the spatial-temporal semantic contextual information, we design the class masking attention module (CMAM) in our encoder. Moreover, a lightweight auxiliary decoder is added to our model to aggregate prior maps generated from the CMAM. Experiments on the CRUW dataset demonstrate the superiority of the proposed method to some state-of-the-art radar-based object detection algorithms. With relatively lower computational complexity and fewer parameters, the proposed Mask-RadarNet achieves higher recognition accuracy for object detection in autonomous driving.||
|**2024-12-19**|[Uncertainty-Guided Cross Attention Ensemble Mean Teacher for Semi-supervised Medical Image Segmentation](http://arxiv.org/abs/2412.15380)|null|This work proposes a novel framework, Uncertainty-Guided Cross Attention Ensemble Mean Teacher (UG-CEMT), for achieving state-of-the-art performance in semi-supervised medical image segmentation. UG-CEMT leverages the strengths of co-training and knowledge distillation by combining a Cross-attention Ensemble Mean Teacher framework (CEMT) inspired by Vision Transformers (ViT) with uncertainty-guided consistency regularization and Sharpness-Aware Minimization emphasizing uncertainty. UG-CEMT improves semi-supervised performance while maintaining a consistent network architecture and task setting by fostering high disparity between sub-networks. Experiments demonstrate significant advantages over existing methods like Mean Teacher and Cross-pseudo Supervision in terms of disparity, domain generalization, and medical image segmentation performance. UG-CEMT achieves state-of-the-art results on multi-center prostate MRI and cardiac MRI datasets, where object segmentation is particularly challenging. Our results show that using only 10\% labeled data, UG-CEMT approaches the performance of fully supervised methods, demonstrating its effectiveness in exploiting unlabeled data for robust medical image segmentation. The code is publicly available at \url{https://github.com/Meghnak13/UG-CEMT}||
|**2024-12-19**|[PCA-Featured Transformer for Jamming Detection in 5G UAV Networks](http://arxiv.org/abs/2412.15312)|null|Jamming attacks pose a threat to Unmanned Aerial Vehicle (UAV) wireless communication systems, potentially disrupting essential services and compromising network reliability. Current detection approaches struggle with sophisticated artificial intelligence (AI) jamming techniques that adapt their patterns while existing machine learning solutions often require extensive feature engineering and fail to capture complex temporal dependencies in attack signatures. Furthermore, 5G networks using either Time Division Duplex (TDD) or Frequency Division Duplex (FDD) methods can face service degradation from intentional interference sources. To address these challenges, we present a novel transformer-based deep learning framework for jamming detection with Principal Component Analysis (PCA) added features. Our architecture leverages the transformer's self-attention mechanism to capture complex temporal dependencies and spatial correlations in wireless signal characteristics, enabling more robust jamming detection techniques. The U-shaped model incorporates a modified transformer encoder that processes signal features including received signal strength indicator (RSSI) and signal-to-noise ratio (SINR) measurements, alongside a specialized positional encoding scheme that accounts for the periodic nature of wireless signals. In addition, we propose a batch size scheduler and implement chunking techniques to optimize training convergence for time series data. These advancements contribute to achieving up to a ten times improvement in training speed within the advanced U-shaped encoder-decoder model introduced. Simulation results demonstrate that our approach achieves a detection accuracy of 90.33 \% in Line-of-Sight (LoS) and 84.35 % in non-Line-of-Sight (NLoS) and outperforms machine learning methods and existing deep learning solutions such as the XGBoost (XGB) classifier in approximately 4%.||
|**2024-12-19**|[MIETT: Multi-Instance Encrypted Traffic Transformer for Encrypted Traffic Classification](http://arxiv.org/abs/2412.15306)|**[link](https://github.com/secilia-cxy/miett)**|Network traffic includes data transmitted across a network, such as web browsing and file transfers, and is organized into packets (small units of data) and flows (sequences of packets exchanged between two endpoints). Classifying encrypted traffic is essential for detecting security threats and optimizing network management. Recent advancements have highlighted the superiority of foundation models in this task, particularly for their ability to leverage large amounts of unlabeled data and demonstrate strong generalization to unseen data. However, existing methods that focus on token-level relationships fail to capture broader flow patterns, as tokens, defined as sequences of hexadecimal digits, typically carry limited semantic information in encrypted traffic. These flow patterns, which are crucial for traffic classification, arise from the interactions between packets within a flow, not just their internal structure. To address this limitation, we propose a Multi-Instance Encrypted Traffic Transformer (MIETT), which adopts a multi-instance approach where each packet is treated as a distinct instance within a larger bag representing the entire flow. This enables the model to capture both token-level and packet-level relationships more effectively through Two-Level Attention (TLA) layers, improving the model's ability to learn complex packet dynamics and flow patterns. We further enhance the model's understanding of temporal and flow-specific dynamics by introducing two novel pre-training tasks: Packet Relative Position Prediction (PRPP) and Flow Contrastive Learning (FCL). After fine-tuning, MIETT achieves state-of-the-art (SOTA) results across five datasets, demonstrating its effectiveness in classifying encrypted traffic and understanding complex network behaviors. Code is available at \url{https://github.com/Secilia-Cxy/MIETT}.||
|**2024-12-19**|[Associative memory inspires improvements for in-context learning using a novel attention residual stream architecture](http://arxiv.org/abs/2412.15113)|**[link](https://github.com/tfburns/amicl-and-residual-attention-streams)**|Large language models (LLMs) demonstrate an impressive ability to utilise information within the context of their input sequences to appropriately respond to data unseen by the LLM during its training procedure. This ability is known as in-context learning (ICL). Humans and non-human animals demonstrate similar abilities, however their neural architectures differ substantially from LLMs. Despite this, a critical component within LLMs, the attention mechanism, resembles modern associative memory models, widely used in and influenced by the computational neuroscience community to model biological memory systems. Using this connection, we introduce an associative memory model capable of performing ICL. We use this as inspiration for a novel residual stream architecture which allows information to directly flow between attention heads. We test this architecture during training within a two-layer Transformer and show its ICL abilities manifest more quickly than without this modification. We then apply our architecture in small language models with 8 million parameters, focusing on attention head values, with results also indicating improved ICL performance at this larger and more naturalistic scale.||
|**2024-12-19**|[A Full Transformer-based Framework for Automatic Pain Estimation using Videos](http://arxiv.org/abs/2412.15095)|null|The automatic estimation of pain is essential in designing an optimal pain management system offering reliable assessment and reducing the suffering of patients. In this study, we present a novel full transformer-based framework consisting of a Transformer in Transformer (TNT) model and a Transformer leveraging cross-attention and self-attention blocks. Elaborating on videos from the BioVid database, we demonstrate state-of-the-art performances, showing the efficacy, efficiency, and generalization capability across all the primary pain estimation tasks.||
|**2024-12-19**|[Mention Attention for Pronoun Translation](http://arxiv.org/abs/2412.14829)|null|Most pronouns are referring expressions, computers need to resolve what do the pronouns refer to, and there are divergences on pronoun usage across languages. Thus, dealing with these divergences and translating pronouns is a challenge in machine translation. Mentions are referring candidates of pronouns and have closer relations with pronouns compared to general tokens. We assume that extracting additional mention features can help pronoun translation. Therefore, we introduce an additional mention attention module in the decoder to pay extra attention to source mentions but not non-mention tokens. Our mention attention module not only extracts features from source mentions, but also considers target-side context which benefits pronoun translation. In addition, we also introduce two mention classifiers to train models to recognize mentions, whose outputs guide the mention attention. We conduct experiments on the WMT17 English-German translation task, and evaluate our models on general translation and pronoun translation, using BLEU, APT, and contrastive evaluation metrics. Our proposed model outperforms the baseline Transformer model in terms of APT and BLEU scores, this confirms our hypothesis that we can improve pronoun translation by paying additional attention to source mentions, and shows that our introduced additional modules do not have negative effect on the general translation quality.||
|**2024-12-19**|[MARIA: a Multimodal Transformer Model for Incomplete Healthcare Data](http://arxiv.org/abs/2412.14810)|null|In healthcare, the integration of multimodal data is pivotal for developing comprehensive diagnostic and predictive models. However, managing missing data remains a significant challenge in real-world applications. We introduce MARIA (Multimodal Attention Resilient to Incomplete datA), a novel transformer-based deep learning model designed to address these challenges through an intermediate fusion strategy. Unlike conventional approaches that depend on imputation, MARIA utilizes a masked self-attention mechanism, which processes only the available data without generating synthetic values. This approach enables it to effectively handle incomplete datasets, enhancing robustness and minimizing biases introduced by imputation methods. We evaluated MARIA against 10 state-of-the-art machine learning and deep learning models across 8 diagnostic and prognostic tasks. The results demonstrate that MARIA outperforms existing methods in terms of performance and resilience to varying levels of data incompleteness, underscoring its potential for critical healthcare applications.||
|**2024-12-19**|[FLAMe: Federated Learning with Attention Mechanism using Spatio-Temporal Keypoint Transformers for Pedestrian Fall Detection in Smart Cities](http://arxiv.org/abs/2412.14768)|null|In smart cities, detecting pedestrian falls is a major challenge to ensure the safety and quality of life of citizens. In this study, we propose a novel fall detection system using FLAMe (Federated Learning with Attention Mechanism), a federated learning (FL) based algorithm. FLAMe trains around important keypoint information and only transmits the trained important weights to the server, reducing communication costs and preserving data privacy. Furthermore, the lightweight keypoint transformer model is integrated into the FL framework to effectively learn spatio-temporal features. We validated the experiment using 22,672 video samples from the "Fall Accident Risk Behavior Video-Sensor Pair data" dataset from AI-Hub. As a result of the experiment, the FLAMe-based system achieved an accuracy of 94.02% with about 190,000 transmission parameters, maintaining performance similar to that of existing centralized learning while maximizing efficiency by reducing communication costs by about 40% compared to the existing FL algorithm, FedAvg. Therefore, the FLAMe algorithm has demonstrated that it provides robust performance in the distributed environment of smart cities and is a practical and effective solution for public safety.||
|**2024-12-19**|[Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning](http://arxiv.org/abs/2412.14640)|null|Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data. This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training (CLIP) model through adaptive prompt tuning, guided by real-time visual inputs. Unlike existing techniques such as Context Optimization (CoOp) and Visual Prompt Tuning (VPT), which are constrained by static prompts or visual token reliance, the proposed approach leverages a cross-attention mechanism to dynamically refine text prompts for the image at hand. This enables an image-specific alignment of textual features with image patches extracted from the Vision Transformer, making the model more effective for datasets with high intra-class variance and low inter-class differences. The method is evaluated on several datasets, including CUBirds, Oxford Flowers, and FGVC Aircraft, showing significant performance gains over static prompt tuning approaches. To ensure these performance gains translate into trustworthy predictions, we integrate Monte-Carlo Dropout in our approach to improve the reliability of the model predictions and uncertainty estimates. This integration provides valuable insights into the model's predictive confidence, helping to identify when predictions can be trusted and when additional verification is necessary. This dynamic approach offers a robust solution, advancing the state-of-the-art for few-shot fine-grained classification.||
|**2024-12-19**|[Progressive Fine-to-Coarse Reconstruction for Accurate Low-Bit Post-Training Quantization in Vision Transformers](http://arxiv.org/abs/2412.14633)|null|Due to its efficiency, Post-Training Quantization (PTQ) has been widely adopted for compressing Vision Transformers (ViTs). However, when quantized into low-bit representations, there is often a significant performance drop compared to their full-precision counterparts. To address this issue, reconstruction methods have been incorporated into the PTQ framework to improve performance in low-bit quantization settings. Nevertheless, existing related methods predefine the reconstruction granularity and seldom explore the progressive relationships between different reconstruction granularities, which leads to sub-optimal quantization results in ViTs. To this end, in this paper, we propose a Progressive Fine-to-Coarse Reconstruction (PFCR) method for accurate PTQ, which significantly improves the performance of low-bit quantized vision transformers. Specifically, we define multi-head self-attention and multi-layer perceptron modules along with their shortcuts as the finest reconstruction units. After reconstructing these two fine-grained units, we combine them to form coarser blocks and reconstruct them at a coarser granularity level. We iteratively perform this combination and reconstruction process, achieving progressive fine-to-coarse reconstruction. Additionally, we introduce a Progressive Optimization Strategy (POS) for PFCR to alleviate the difficulty of training, thereby further enhancing model performance. Experimental results on the ImageNet dataset demonstrate that our proposed method achieves the best Top-1 accuracy among state-of-the-art methods, particularly attaining 75.61% for 3-bit quantized ViT-B in PTQ. Besides, quantization results on the COCO dataset reveal the effectiveness and generalization of our proposed method on other computer vision tasks like object detection and instance segmentation.||
|**2024-12-19**|[Can We Get Rid of Handcrafted Feature Extractors? SparseViT: Nonsemantics-Centered, Parameter-Efficient Image Manipulation Localization Through Spare-Coding Transformer](http://arxiv.org/abs/2412.14598)|**[link](https://github.com/scu-zjz/sparsevit)**|Non-semantic features or semantic-agnostic features, which are irrelevant to image context but sensitive to image manipulations, are recognized as evidential to Image Manipulation Localization (IML). Since manual labels are impossible, existing works rely on handcrafted methods to extract non-semantic features. Handcrafted non-semantic features jeopardize IML model's generalization ability in unseen or complex scenarios. Therefore, for IML, the elephant in the room is: How to adaptively extract non-semantic features? Non-semantic features are context-irrelevant and manipulation-sensitive. That is, within an image, they are consistent across patches unless manipulation occurs. Then, spare and discrete interactions among image patches are sufficient for extracting non-semantic features. However, image semantics vary drastically on different patches, requiring dense and continuous interactions among image patches for learning semantic representations. Hence, in this paper, we propose a Sparse Vision Transformer (SparseViT), which reformulates the dense, global self-attention in ViT into a sparse, discrete manner. Such sparse self-attention breaks image semantics and forces SparseViT to adaptively extract non-semantic features for images. Besides, compared with existing IML models, the sparse self-attention mechanism largely reduced the model size (max 80% in FLOPs), achieving stunning parameter efficiency and computation reduction. Extensive experiments demonstrate that, without any handcrafted feature extractors, SparseViT is superior in both generalization and efficiency across benchmark datasets.||
|**2024-12-18**|[ViTmiX: Vision Transformer Explainability Augmented by Mixed Visualization Methods](http://arxiv.org/abs/2412.14231)|null|Recent advancements in Vision Transformers (ViT) have demonstrated exceptional results in various visual recognition tasks, owing to their ability to capture long-range dependencies in images through self-attention mechanisms. However, the complex nature of ViT models requires robust explainability methods to unveil their decision-making processes. Explainable Artificial Intelligence (XAI) plays a crucial role in improving model transparency and trustworthiness by providing insights into model predictions. Current approaches to ViT explainability, based on visualization techniques such as Layer-wise Relevance Propagation (LRP) and gradient-based methods, have shown promising but sometimes limited results. In this study, we explore a hybrid approach that mixes multiple explainability techniques to overcome these limitations and enhance the interpretability of ViT models. Our experiments reveal that this hybrid approach significantly improves the interpretability of ViT models compared to individual methods. We also introduce modifications to existing techniques, such as using geometric mean for mixing, which demonstrates notable results in object segmentation tasks. To quantify the explainability gain, we introduced a novel post-hoc explainability measure by applying the Pigeonhole principle. These findings underscore the importance of refining and optimizing explainability methods for ViT models, paving the way to reliable XAI-based segmentations.||
|**2024-12-18**|[Self-attentive Transformer for Fast and Accurate Postprocessing of Temperature and Wind Speed Forecasts](http://arxiv.org/abs/2412.13957)|**[link](https://github.com/uantwerpm4s/pp_eupp)**|Current postprocessing techniques often require separate models for each lead time and disregard possible inter-ensemble relationships by either correcting each member separately or by employing distributional approaches. In this work, we tackle these shortcomings with an innovative, fast and accurate Transformer which postprocesses each ensemble member individually while allowing information exchange across variables, spatial dimensions and lead times by means of multi-headed self-attention. Weather foreacasts are postprocessed over 20 lead times simultaneously while including up to twelve meteorological predictors. We use the EUPPBench dataset for training which contains ensemble predictions from the European Center for Medium-range Weather Forecasts' integrated forecasting system alongside corresponding observations. The work presented here is the first to postprocess the ten and one hundred-meter wind speed forecasts within this benchmark dataset, while also correcting the two-meter temperature. Our approach significantly improves the original forecasts, as measured by the CRPS, with 17.5 % for two-meter temperature, nearly 5% for ten-meter wind speed and 5.3 % for one hundred-meter wind speed, outperforming a classical member-by-member approach employed as competitive benchmark. Furthermore, being up to 75 times faster, it fulfills the demand for rapid operational weather forecasts in various downstream applications, including renewable energy forecasting.||
|**2024-12-17**|[Identification of Epileptic Spasms (ESES) Phases Using EEG Signals: A Vision Transformer Approach](http://arxiv.org/abs/2412.13028)|null|这项工作介绍了一种基于脑电图 (EEG) 信号并使用视觉Transformer (ViT) 检测婴儿痉挛症 (ESES) 的新方法。传统的ESES检测方法通常采用手动处理或传统算法，存在样本量小、基于单通道分析和泛化能力低等问题。相比之下，提出的ViT模型通过使用注意力机制来关注多通道脑电图数据中的重要特征，从而克服了这些限制，有助于提高准确性和效率。该模型将脑电图信号的频域表示（例如频谱图）作为图像数据进行处理，以捕获信号中的长期依赖性和复杂模式。该模型表现出高性能，准确率达到97%，且无需大量数据预处理，因此适用于大规模实时临床应用。该方法代表了ESES等神经系统疾病检测和分析方面的重大进展。||
|**2024-12-17**|[Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning](http://arxiv.org/abs/2412.12953)|null|Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.||
|**2024-12-17**|[CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image](http://arxiv.org/abs/2412.12906)|null|Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources. These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass. However, unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area. In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings. First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from a single image. By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under single-view settings. With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques. Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.||
|**2024-12-17**|[ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers](http://arxiv.org/abs/2412.12571)|**[link](https://github.com/ali-vilab/chatdit)**|最近的研究arXiv:2410.15027 arXiv:2410.23775强调了预训练扩散Transformer（DiT）固有的上下文生成能力，使其能够在几乎或不需要任何架构修改的情况下无缝适应各种视觉任务。这些能力是通过跨多个输入和目标图像连接自注意力标记，并结合分组和掩码生成流程来实现的。在此基础上，我们提出了ChatDiT，这是一个零样本、通用且交互式的视觉生成框架，它以原始形式利用预训练的扩散Transformer，无需额外的调整、适配器或修改。用户可以与ChatDiT交互来创建交错的图文文章、多页图画书、编辑图像、设计IP衍生品或开发角色设计设置，所有这些都通过一个或多个对话轮次的自由自然语言完成。ChatDiT的核心是一个多代理系统，包含三个关键组件：一个指令解析代理，用于解释用户上传的图像和指令；一个策略规划代理，用于设计单步或多步生成动作；以及一个执行代理，用于使用上下文工具包中的扩散Transformer执行这些动作。我们在IDEA-Bench arXiv:2412.11767上对ChatDiT进行了全面评估，该基准包含100个真实世界设计任务和275个案例，涵盖了各种指令以及不同数量的输入和目标图像。尽管ChatDiT采用简单且无需训练的方法，但它超越了所有竞争对手，包括那些专门设计并在大量多任务数据集上训练的模型。我们进一步确定了预训练DiT在零样本适应任务中的关键局限性。我们在https://github.com/ali-vilab/ChatDiT上发布了所有代码、代理、结果和中间输出，以促进进一步的研究。||
|**2024-12-17**|[Enhanced Momentum with Momentum Transformers](http://arxiv.org/abs/2412.12516)|null|The primary objective of this research is to build a Momentum Transformer that is expected to outperform benchmark time-series momentum and mean-reversion trading strategies. We extend the ideas introduced in the paper Trading with the Momentum Transformer: An Intelligent and Interpretable Architecture to equities as the original paper primarily only builds upon futures and equity indices. Unlike conventional Long Short-Term Memory (LSTM) models, which operate sequentially and are optimized for processing local patterns, an attention mechanism equips our architecture with direct access to all prior time steps in the training window. This hybrid design, combining attention with an LSTM, enables the model to capture long-term dependencies, enhance performance in scenarios accounting for transaction costs, and seamlessly adapt to evolving market conditions, such as those witnessed during the Covid Pandemic. We average 4.14% returns which is similar to the original papers results. Our Sharpe is lower at an average of 1.12 due to much higher volatility which may be due to stocks being inherently more volatile than futures and indices.||
|**2024-12-17**|[Core Context Aware Attention for Long Context Language Modeling](http://arxiv.org/abs/2412.12465)|null|Transformer-based Large Language Models (LLMs) have exhibited remarkable success in various natural language processing tasks primarily attributed to self-attention mechanism, which requires a token to consider all preceding tokens as its context to compute the attention score. However, when the context length L becomes very large (e.g., 32K), more redundant context information will be included w.r.t. any tokens, making the self-attention suffer from two main limitations: 1) The computational and memory complexity scales quadratically w.r.t. L; 2) The presence of redundant context information may hamper the model to capture dependencies among crucial tokens, which may degrade the representation performance. In this paper, we propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling, which consists of two components: 1) Globality-pooling attention that divides input tokens into groups and then dynamically merges tokens within each group into one core token based on their significance; 2) Locality-preserved attention that incorporates neighboring tokens into the attention calculation. The two complementary attentions will then be fused to the final attention, maintaining comprehensive modeling ability as the full self-attention. In this way, the core context information w.r.t. a given token will be automatically focused and strengthened, while the context information in redundant groups will be diminished during the learning process. As a result, the computational and memory complexity will be significantly reduced. More importantly, the CCA-Attention can improve the long-context modeling ability by diminishing the redundant context information. Extensive experimental results demonstrate that our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.||
|**2024-12-16**|[Efficient Scaling of Diffusion Transformers for Text-to-Image Generation](http://arxiv.org/abs/2412.12391)|null|We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on datasets up to 600M images. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with cross-attention based DiT variants, which allows straightforward expansion for extra conditions and other modalities. We identify a 2.3B U-ViT model can get better performance than SDXL UNet and other DiT variants in controlled setting. On the data scaling side, we investigate how increasing dataset size and enhanced long caption improve the text-image alignment performance and the learning efficiency.||
|**2024-12-16**|[EDformer: Embedded Decomposition Transformer for Interpretable Multivariate Time Series Predictions](http://arxiv.org/abs/2412.12227)|null|Time series forecasting is a crucial challenge with significant applications in areas such as weather prediction, stock market analysis, and scientific simulations. This paper introduces an embedded decomposed transformer, 'EDformer', for multivariate time series forecasting tasks. Without altering the fundamental elements, we reuse the Transformer architecture and consider the capable functions of its constituent parts in this work. Edformer first decomposes the input multivariate signal into seasonal and trend components. Next, the prominent multivariate seasonal component is reconstructed across the reverse dimensions, followed by applying the attention mechanism and feed-forward network in the encoder stage. In particular, the feed-forward network is used for each variable frame to learn nonlinear representations, while the attention mechanism uses the time points of individual seasonal series embedded within variate frames to capture multivariate correlations. Therefore, the trend signal is added with projection and performs the final forecasting. The EDformer model obtains state-of-the-art predicting results in terms of accuracy and efficiency on complex real-world time series datasets. This paper also addresses model explainability techniques to provide insights into how the model makes its predictions and why specific features or time steps are important, enhancing the interpretability and trustworthiness of the forecasting results.||
|**2024-12-16**|[Transformers Use Causal World Models in Maze-Solving Tasks](http://arxiv.org/abs/2412.11867)|null|Recent studies in interpretability have explored the inner workings of transformer models trained on tasks across various domains, often discovering that these networks naturally develop surprisingly structured representations. When such representations comprehensively reflect the task domain's structure, they are commonly referred to as ``World Models'' (WMs). In this work, we discover such WMs in transformers trained on maze tasks. In particular, by employing Sparse Autoencoders (SAEs) and analysing attention patterns, we examine the construction of WMs and demonstrate consistency between the circuit analysis and the SAE feature-based analysis. We intervene upon the isolated features to confirm their causal role and, in doing so, find asymmetries between certain types of interventions. Surprisingly, we find that models are able to reason with respect to a greater number of active features than they see during training, even if attempting to specify these in the input token sequence would lead the model to fail. Futhermore, we observe that varying positional encodings can alter how WMs are encoded in a model's residual stream. By analyzing the causal role of these WMs in a toy domain we hope to make progress toward an understanding of emergent structure in the representations acquired by Transformers, leading to the development of more interpretable and controllable AI systems.||
|**2024-12-16**|[UnMA-CapSumT: Unified and Multi-Head Attention-driven Caption Summarization Transformer](http://arxiv.org/abs/2412.11836)|null|Image captioning is the generation of natural language descriptions of images which have increased immense popularity in the recent past. With this different deep-learning techniques are devised for the development of factual and stylized image captioning models. Previous models focused more on the generation of factual and stylized captions separately providing more than one caption for a single image. The descriptions generated from these suffer from out-of-vocabulary and repetition issues. To the best of our knowledge, no such work exists that provided a description that integrates different captioning methods to describe the contents of an image with factual and stylized (romantic and humorous) elements. To overcome these limitations, this paper presents a novel Unified Attention and Multi-Head Attention-driven Caption Summarization Transformer (UnMA-CapSumT) based Captioning Framework. It utilizes both factual captions and stylized captions generated by the Modified Adaptive Attention-based factual image captioning model (MAA-FIC) and Style Factored Bi-LSTM with attention (SF-Bi-ALSTM) driven stylized image captioning model respectively. SF-Bi-ALSTM-based stylized IC model generates two prominent styles of expression- {romance, and humor}. The proposed summarizer UnMHA-ST combines both factual and stylized descriptions of an input image to generate styled rich coherent summarized captions. The proposed UnMHA-ST transformer learns and summarizes different linguistic styles efficiently by incorporating proposed word embedding fastText with Attention Word Embedding (fTA-WE) and pointer-generator network with coverage mechanism concept to solve the out-of-vocabulary issues and repetition problem. Extensive experiments are conducted on Flickr8K and a subset of FlickrStyle10K with supporting ablation studies to prove the efficiency and efficacy of the proposed framework.||
|**2024-12-13**|[Ultra-High Resolution Segmentation via Boundary-Enhanced Patch-Merging Transformer](http://arxiv.org/abs/2412.10181)|null|Segmentation of ultra-high resolution (UHR) images is a critical task with numerous applications, yet it poses significant challenges due to high spatial resolution and rich fine details. Recent approaches adopt a dual-branch architecture, where a global branch learns long-range contextual information and a local branch captures fine details. However, they struggle to handle the conflict between global and local information while adding significant extra computational cost. Inspired by the human visual system's ability to rapidly orient attention to important areas with fine details and filter out irrelevant information, we propose a novel UHR segmentation method called Boundary-enhanced Patch-merging Transformer (BPT). BPT consists of two key components: (1) Patch-Merging Transformer (PMT) for dynamically allocating tokens to informative regions to acquire global and local representations, and (2) Boundary-Enhanced Module (BEM) that leverages boundary information to enrich fine details. Extensive experiments on multiple UHR image segmentation benchmarks demonstrate that our BPT outperforms previous state-of-the-art methods without introducing extra computational overhead. Codes will be released to facilitate research.||
|**2024-12-13**|[Mr. DETR: Instructive Multi-Route Training for Detection Transformers](http://arxiv.org/abs/2412.10028)|null|Existing methods enhance the training of detection transformers by incorporating an auxiliary one-to-many assignment. In this work, we treat the model as a multi-task framework, simultaneously performing one-to-one and one-to-many predictions. We investigate the roles of each component in the transformer decoder across these two training targets, including self-attention, cross-attention, and feed-forward network. Our empirical results demonstrate that any independent component in the decoder can effectively learn both targets simultaneously, even when other components are shared. This finding leads us to propose a multi-route training mechanism, featuring a primary route for one-to-one prediction and two auxiliary training routes for one-to-many prediction. We enhance the training mechanism with a novel instructive self-attention that dynamically and flexibly guides object queries for one-to-many prediction. The auxiliary routes are removed during inference, ensuring no impact on model architecture or inference cost. We conduct extensive experiments on various baselines, achieving consistent improvements as shown in Figure 1.||
|**2024-12-13**|[Efficient Large-Scale Traffic Forecasting with Transformers: A Spatial Data Management Perspective](http://arxiv.org/abs/2412.09972)|**[link](https://github.com/lmissher/patchstg)**|Road traffic forecasting is crucial in real-world intelligent transportation scenarios like traffic dispatching and path planning in city management and personal traveling. Spatio-temporal graph neural networks (STGNNs) stand out as the mainstream solution in this task. Nevertheless, the quadratic complexity of remarkable dynamic spatial modeling-based STGNNs has become the bottleneck over large-scale traffic data. From the spatial data management perspective, we present a novel Transformer framework called PatchSTG to efficiently and dynamically model spatial dependencies for large-scale traffic forecasting with interpretability and fidelity. Specifically, we design a novel irregular spatial patching to reduce the number of points involved in the dynamic calculation of Transformer. The irregular spatial patching first utilizes the leaf K-dimensional tree (KDTree) to recursively partition irregularly distributed traffic points into leaf nodes with a small capacity, and then merges leaf nodes belonging to the same subtree into occupancy-equaled and non-overlapped patches through padding and backtracking. Based on the patched data, depth and breadth attention are used interchangeably in the encoder to dynamically learn local and global spatial knowledge from points in a patch and points with the same index of patches. Experimental results on four real world large-scale traffic datasets show that our PatchSTG achieves train speed and memory utilization improvements up to $10\times$ and $4\times$ with the state-of-the-art performance.||
|**2024-12-13**|[Simulating Hard Attention Using Soft Attention](http://arxiv.org/abs/2412.09925)|null|We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several variants of linear temporal logic, whose formulas have been previously been shown to be computable using hard attention transformers. We demonstrate how soft attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate a large subclass of average-hard attention transformers, those that have what we call the uniform-tieless property.||
|**2024-12-13**|[CSL-L2M: Controllable Song-Level Lyric-to-Melody Generation Based on Conditional Transformer with Fine-Grained Lyric and Musical Controls](http://arxiv.org/abs/2412.09887)|null|Lyric-to-melody generation is a highly challenging task in the field of AI music generation. Due to the difficulty of learning strict yet weak correlations between lyrics and melodies, previous methods have suffered from weak controllability, low-quality and poorly structured generation. To address these challenges, we propose CSL-L2M, a controllable song-level lyric-to-melody generation method based on an in-attention Transformer decoder with fine-grained lyric and musical controls, which is able to generate full-song melodies matched with the given lyrics and user-specified musical attributes. Specifically, we first introduce REMI-Aligned, a novel music representation that incorporates strict syllable- and sentence-level alignments between lyrics and melodies, facilitating precise alignment modeling. Subsequently, sentence-level semantic lyric embeddings independently extracted from a sentence-wise Transformer encoder are combined with word-level part-of-speech embeddings and syllable-level tone embeddings as fine-grained controls to enhance the controllability of lyrics over melody generation. Then we introduce human-labeled musical tags, sentence-level statistical musical attributes, and learned musical features extracted from a pre-trained VQ-VAE as coarse-grained, fine-grained and high-fidelity controls, respectively, to the generation process, thereby enabling user control over melody generation. Finally, an in-attention Transformer decoder technique is leveraged to exert fine-grained control over the full-song melody generation with the aforementioned lyric and musical conditions. Experimental results demonstrate that our proposed CSL-L2M outperforms the state-of-the-art models, generating melodies with higher quality, better controllability and enhanced structure. Demos and source code are available at https://lichaiustc.github.io/CSL-L2M/.||
|**2024-12-13**|[MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion](http://arxiv.org/abs/2412.09828)|null|Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training. The causal attention diffusion framework can also be used for auto-regressive long video generation, without violating the natural order of frame sequences.||
|**2024-12-13**|[Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism](http://arxiv.org/abs/2412.09822)|null|Video try-on stands as a promising area for its tremendous real-world potential. Previous research on video try-on has primarily focused on transferring product clothing images to videos with simple human poses, while performing poorly with complex movements. To better preserve clothing details, those approaches are armed with an additional garment encoder, resulting in higher computational resource consumption. The primary challenges in this domain are twofold: (1) leveraging the garment encoder's capabilities in video try-on while lowering computational requirements; (2) ensuring temporal consistency in the synthesis of human body parts, especially during rapid movements. To tackle these issues, we propose a novel video try-on framework based on Diffusion Transformer(DiT), named Dynamic Try-On. To reduce computational overhead, we adopt a straightforward approach by utilizing the DiT backbone itself as the garment encoder and employing a dynamic feature fusion module to store and integrate garment features. To ensure temporal consistency of human body parts, we introduce a limb-aware dynamic attention module that enforces the DiT backbone to focus on the regions of human limbs during the denoising process. Extensive experiments demonstrate the superiority of Dynamic Try-On in generating stable and smooth try-on results, even for videos featuring complicated human postures.||
|**2024-12-12**|[SEGT: A General Spatial Expansion Group Transformer for nuScenes Lidar-based Object Detection Task](http://arxiv.org/abs/2412.09658)|null|In the technical report, we present a novel transformer-based framework for nuScenes lidar-based object detection task, termed Spatial Expansion Group Transformer (SEGT). To efficiently handle the irregular and sparse nature of point cloud, we propose migrating the voxels into distinct specialized ordered fields with the general spatial expansion strategies, and employ group attention mechanisms to extract the exclusive feature maps within each field. Subsequently, we integrate the feature representations across different ordered fields by alternately applying diverse expansion strategies, thereby enhancing the model's ability to capture comprehensive spatial information. The method was evaluated on the nuScenes lidar-based object detection test dataset, achieving an NDS score of 73.5 without Test-Time Augmentation (TTA) and 74.2 with TTA, demonstrating the effectiveness of the proposed method.||
|**2024-12-12**|[Audios Don't Lie: Multi-Frequency Channel Attention Mechanism for Audio Deepfake Detection](http://arxiv.org/abs/2412.09467)|null|With the rapid development of artificial intelligence technology, the application of deepfake technology in the audio field has gradually increased, resulting in a wide range of security risks. Especially in the financial and social security fields, the misuse of deepfake audios has raised serious concerns. To address this challenge, this study proposes an audio deepfake detection method based on multi-frequency channel attention mechanism (MFCA) and 2D discrete cosine transform (DCT). By processing the audio signal into a melspectrogram, using MobileNet V2 to extract deep features, and combining it with the MFCA module to weight different frequency channels in the audio signal, this method can effectively capture the fine-grained frequency domain features in the audio signal and enhance the Classification capability of fake audios. Experimental results show that compared with traditional methods, the model proposed in this study shows significant advantages in accuracy, precision,recall, F1 score and other indicators. Especially in complex audio scenarios, this method shows stronger robustness and generalization capabilities and provides a new idea for audio deepfake detection and has important practical application value. In the future, more advanced audio detection technologies and optimization strategies will be explored to further improve the accuracy and generalization capabilities of audio deepfake detection.||
|**2024-12-12**|[RingFormer: A Ring-Enhanced Graph Transformer for Organic Solar Cell Property Prediction](http://arxiv.org/abs/2412.09030)|**[link](https://github.com/TommyDzh/RingFormer)**|Organic Solar Cells (OSCs) are a promising technology for sustainable energy production. However, the identification of molecules with desired OSC properties typically involves laborious experimental research. To accelerate progress in the field, it is crucial to develop machine learning models capable of accurately predicting the properties of OSC molecules. While graph representation learning has demonstrated success in molecular property prediction, it remains underexplored for OSC-specific tasks. Existing methods fail to capture the unique structural features of OSC molecules, particularly the intricate ring systems that critically influence OSC properties, leading to suboptimal performance. To fill the gap, we present RingFormer, a novel graph transformer framework specially designed to capture both atom and ring level structural patterns in OSC molecules. RingFormer constructs a hierarchical graph that integrates atomic and ring structures and employs a combination of local message passing and global attention mechanisms to generate expressive graph representations for accurate OSC property prediction. We evaluate RingFormer's effectiveness on five curated OSC molecule datasets through extensive experiments. The results demonstrate that RingFormer consistently outperforms existing methods, achieving a 22.77% relative improvement over the nearest competitor on the CEPDB dataset.||
|**2024-12-12**|[STEAM: Squeeze and Transform Enhanced Attention Module](http://arxiv.org/abs/2412.09023)|null|Channel and spatial attention mechanisms introduced by earlier works enhance the representation abilities of deep convolutional neural networks (CNNs) but often lead to increased parameter and computation costs. While recent approaches focus solely on efficient feature context modeling for channel attention, we aim to model both channel and spatial attention comprehensively with minimal parameters and reduced computation. Leveraging the principles of relational modeling in graphs, we introduce a constant-parameter module, STEAM: Squeeze and Transform Enhanced Attention Module, which integrates channel and spatial attention to enhance the representation power of CNNs. To our knowledge, we are the first to propose a graph-based approach for modeling both channel and spatial attention, utilizing concepts from multi-head graph transformers. Additionally, we introduce Output Guided Pooling (OGP), which efficiently captures spatial context to further enhance spatial attention. We extensively evaluate STEAM for large-scale image classification, object detection and instance segmentation on standard benchmark datasets. STEAM achieves a 2% increase in accuracy over the standard ResNet-50 model with only a meager increase in GFLOPs. Furthermore, STEAM outperforms leading modules ECA and GCT in terms of accuracy while achieving a three-fold reduction in GFLOPs.||
|**2024-12-12**|[A physics-informed transformer neural operator for learning generalized solutions of initial boundary value problems](http://arxiv.org/abs/2412.09009)|**[link](https://github.com/quest-lab-iisc/pinto)**|Initial boundary value problems arise commonly in applications with engineering and natural systems governed by nonlinear partial differential equations (PDEs). Operator learning is an emerging field for solving these equations by using a neural network to learn a map between infinite dimensional input and output function spaces. These neural operators are trained using a combination of data (observations or simulations) and PDE-residuals (physics-loss). A major drawback of existing neural approaches is the requirement to retrain with new initial/boundary conditions, and the necessity for a large amount of simulation data for training. We develop a physics-informed transformer neural operator (named PINTO) that efficiently generalizes to unseen initial and boundary conditions, trained in a simulation-free setting using only physics loss. The main innovation lies in our new iterative kernel integral operator units, implemented using cross-attention, to transform the PDE solution's domain points into an initial/boundary condition-aware representation vector, enabling efficient learning of the solution function for new scenarios. The PINTO architecture is applied to simulate the solutions of important equations used in engineering applications: advection, Burgers, and steady and unsteady Navier-Stokes equations (three flow scenarios). For these five test cases, we show that the relative errors during testing under challenging conditions of unseen initial/boundary conditions are only one-fifth to one-third of other leading physics informed operator learning methods. Moreover, our PINTO model is able to accurately solve the advection and Burgers equations at time steps that are not included in the training collocation points. The code is available at $\texttt{https://github.com/quest-lab-iisc/PINTO}$ ||
|**2024-12-12**|[Towards modeling evolving longitudinal health trajectories with a transformer-based deep learning model](http://arxiv.org/abs/2412.08873)|null|Health registers contain rich information about individuals' health histories. Here our interest lies in understanding how individuals' health trajectories evolve in a nationwide longitudinal dataset with coded features, such as clinical codes, procedures, and drug purchases. We introduce a straightforward approach for training a Transformer-based deep learning model in a way that lets us analyze how individuals' trajectories change over time. This is achieved by modifying the training objective and by applying a causal attention mask. We focus here on a general task of predicting the onset of a range of common diseases in a given future forecast interval. However, instead of providing a single prediction about diagnoses that could occur in this forecast interval, our approach enable the model to provide continuous predictions at every time point up until, and conditioned on, the time of the forecast period. We find that this model performs comparably to other models, including a bi-directional transformer model, in terms of basic prediction performance while at the same time offering promising trajectory modeling properties. We explore a couple of ways to use this model for analyzing health trajectories and aiding in early detection of events that forecast possible later disease onsets. We hypothesize that this method may be helpful in continuous monitoring of peoples' health trajectories and enabling interventions in ongoing health trajectories, as well as being useful in retrospective analyses.||
|**2024-12-12**|[HadaCore: Tensor Core Accelerated Hadamard Transform Kernel](http://arxiv.org/abs/2412.08832)|**[link](https://github.com/pytorch-labs/applied-ai)**|We present HadaCore, a modified Fast Walsh-Hadamard Transform (FWHT) algorithm optimized for the Tensor Cores present in modern GPU hardware. HadaCore follows the recursive structure of the original FWHT algorithm, achieving the same asymptotic runtime complexity but leveraging a hardware-aware work decomposition that benefits from Tensor Core acceleration. This reduces bottlenecks from compute and data exchange. On Nvidia A100 and H100 GPUs, HadaCore achieves speedups of 1.1-1.4x and 1.0-1.3x, with a peak gain of 3.5x and 3.6x respectively, when compared to the existing state-of-the-art implementation of the original algorithm. We also show that when using FP16 or BF16, our implementation is numerically accurate, enabling comparable accuracy on MMLU benchmarks when used in an end-to-end Llama3 inference run with quantized (FP8) attention.||
|**2024-12-11**|[From MLP to NeoMLP: Leveraging Self-Attention for Neural Fields](http://arxiv.org/abs/2412.08731)|**[link](https://github.com/mkofinas/neomlp)**|Neural fields (NeFs) have recently emerged as a state-of-the-art method for encoding spatio-temporal signals of various modalities. Despite the success of NeFs in reconstructing individual signals, their use as representations in downstream tasks, such as classification or segmentation, is hindered by the complexity of the parameter space and its underlying symmetries, in addition to the lack of powerful and scalable conditioning mechanisms. In this work, we draw inspiration from the principles of connectionism to design a new architecture based on MLPs, which we term NeoMLP. We start from an MLP, viewed as a graph, and transform it from a multi-partite graph to a complete graph of input, hidden, and output nodes, equipped with high-dimensional features. We perform message passing on this graph and employ weight-sharing via self-attention among all the nodes. NeoMLP has a built-in mechanism for conditioning through the hidden and output nodes, which function as a set of latent codes, and as such, NeoMLP can be used straightforwardly as a conditional neural field. We demonstrate the effectiveness of our method by fitting high-resolution signals, including multi-modal audio-visual data. Furthermore, we fit datasets of neural representations, by learning instance-specific sets of latent codes using a single backbone architecture, and then use them for downstream tasks, outperforming recent state-of-the-art methods. The source code is open-sourced at https://github.com/mkofinas/neomlp.||
|**2024-12-11**|[On-demand Quick Metasurface Design with Neighborhood Attention Transformer](http://arxiv.org/abs/2412.08405)|null|Metasurfaces are reshaping traditional optical paradigms and are increasingly required in complex applications that demand substantial computational resources to numerically solve Maxwell's equations-particularly for large-scale systems, inhomogeneous media, and densely packed metadevices. Conventional forward design using electromagnetic solvers is based on specific approximations, which may not effectively address complex problems. In contrast, existing inverse design methods are a stepwise process that is often time-consuming and involves repetitive computations. Here, we present an inverse design approach utilizing a surrogate Neighborhood Attention Transformer, MetaE-former, to predict the performance of metasurfaces with ultrafast speed and high accuracy. This method achieves global solutions for hundreds of nanostructures simultaneously, providing up to a 250,000-fold speedup compared with solving for individual meta-atoms based on the FDTD method. As examples, we demonstrate a binarized high-numerical-aperture (about 1.31) metalens and several optimized structured-light meta-generators. Our method significantly improves the beam shaping adaptability with metasurfaces and paves the way for fast designing of large-scale metadevices for shaping extreme light fields with high accuracy.||
|**2024-12-10**|[Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences](http://arxiv.org/abs/2412.07894)|null|To optimize large Transformer model training, efficient parallel computing and advanced data management are essential. However, current methods often assume a stable and uniform training workload, neglecting imbalances in data sampling and packing that can impede performance. Specifically, data sampling imbalance arises from uneven sequence length distribution of the training data, while data packing imbalance stems from the discrepancy between the linear memory complexity and quadratic time complexity of the attention mechanism. To address these imbalance issues, we develop Hydraulis, which jointly optimizes the parallel strategies and data assignment. For one thing, we introduce large model training with dynamic heterogeneous parallel strategies in response to the sequence length variations within and across training iterations. For another, we devise a two-stage data assignment approach, which strikes a good balance in terms of the training workloads both within and across model replicas. Empirical results demonstrate that Hydraulis outperforms existing systems by 1.32-2.66 times.||
|**2024-12-10**|[Video Motion Transfer with Diffusion Transformers](http://arxiv.org/abs/2412.07776)|**[link](https://github.com/ditflow/ditflow)**|We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.||
|**2024-12-10**|[ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer](http://arxiv.org/abs/2412.07720)|**[link](https://github.com/thunlp/acdit)**|The recent surge of interest in comprehensive multimodal models has necessitated the unification of diverse modalities. However, the unification suffers from disparate methodologies. Continuous visual generation necessitates the full-sequence diffusion-based approach, despite its divergence from the autoregressive modeling in the text domain. We posit that autoregressive modeling, i.e., predicting the future based on past deterministic experience, remains crucial in developing both a visual generation model and a potential unified multimodal model. In this paper, we explore an interpolation between the autoregressive modeling and full-parameters diffusion to model visual information. At its core, we present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer, where the block size of diffusion, i.e., the size of autoregressive units, can be flexibly adjusted to interpolate between token-wise autoregression and full-sequence diffusion. ACDiT is easy to implement, as simple as creating a Skip-Causal Attention Mask (SCAM) during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We verify the effectiveness of ACDiT on image and video generation tasks. We also demonstrate that benefitted from autoregressive modeling, ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective. The analysis of the trade-off between autoregressive modeling and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. These strengths make it promising as the backbone of future unified models.||
|**2024-12-10**|[Scaling Sequential Recommendation Models with Transformers](http://arxiv.org/abs/2412.07585)|**[link](https://github.com/mercadolibre/srt)**|Modeling user preferences has been mainly addressed by looking at users' interaction history with the different elements available in the system. Tailoring content to individual preferences based on historical data is the main goal of sequential recommendation. The nature of the problem, as well as the good performance observed across various domains, has motivated the use of the transformer architecture, which has proven effective in leveraging increasingly larger amounts of training data when accompanied by an increase in the number of model parameters. This scaling behavior has brought a great deal of attention, as it provides valuable guidance in the design and training of even larger models. Taking inspiration from the scaling laws observed in training large language models, we explore similar principles for sequential recommendation. We use the full Amazon Product Data dataset, which has only been partially explored in other studies, and reveal scaling behaviors similar to those found in language models. Compute-optimal training is possible but requires a careful analysis of the compute-performance trade-offs specific to the application. We also show that performance scaling translates to downstream tasks by fine-tuning larger pre-trained models on smaller task-specific domains. Our approach and findings provide a strategic roadmap for model training and deployment in real high-dimensional preference spaces, facilitating better training and inference efficiency. We hope this paper bridges the gap between the potential of transformers and the intrinsic complexities of high-dimensional sequential recommendation in real-world recommender systems. Code and models can be found at https://github.com/mercadolibre/srt||
|**2024-12-10**|[Comateformer: Combined Attention Transformer for Semantic Sentence Matching](http://arxiv.org/abs/2412.07220)|null|The Transformer-based model have made significant strides in semantic matching tasks by capturing connections between phrase pairs. However, to assess the relevance of sentence pairs, it is insufficient to just examine the general similarity between the sentences. It is crucial to also consider the tiny subtleties that differentiate them from each other. Regrettably, attention softmax operations in transformers tend to miss these subtle differences. To this end, in this work, we propose a novel semantic sentence matching model named Combined Attention Network based on Transformer model (Comateformer). In Comateformer model, we design a novel transformer-based quasi-attention mechanism with compositional properties. Unlike traditional attention mechanisms that merely adjust the weights of input tokens, our proposed method learns how to combine, subtract, or resize specific vectors when building a representation. Moreover, our proposed approach builds on the intuition of similarity and dissimilarity (negative affinity) when calculating dual affinity scores. This allows for a more meaningful representation of relationships between sentences. To evaluate the performance of our proposed model, we conducted extensive experiments on ten public real-world datasets and robustness testing. Experimental results show that our method achieves consistent improvements.||
|**2024-12-09**|[Static Key Attention in Vision](http://arxiv.org/abs/2412.07049)|null|The success of vision transformers is widely attributed to the expressive power of their dynamically parameterized multi-head self-attention mechanism. We examine the impact of substituting the dynamic parameterized key with a static key within the standard attention mechanism in Vision Transformers. Our findings reveal that static key attention mechanisms can match or even exceed the performance of standard self-attention. Integrating static key attention modules into a Metaformer backbone, we find that it serves as a better intermediate stage in hierarchical hybrid architectures, balancing the strengths of depth-wise convolution and self-attention. Experiments on several vision tasks underscore the effectiveness of the static key mechanism, indicating that the typical two-step dynamic parameterization in attention can be streamlined to a single step without impacting performance under certain circumstances.||
|**2024-12-09**|[SphereUFormer: A U-Shaped Transformer for Spherical 360 Perception](http://arxiv.org/abs/2412.06968)|null|This paper proposes a novel method for omnidirectional 360 $\degree$ perception. Most common previous methods relied on equirectangular projection. This representation is easily applicable to 2D operation layers but introduces distortions into the image. Other methods attempted to remove the distortions by maintaining a sphere representation but relied on complicated convolution kernels that failed to show competitive results. In this work, we introduce a transformer-based architecture that, by incorporating a novel ``Spherical Local Self-Attention'' and other spherically-oriented modules, successfully operates in the spherical domain and outperforms the state-of-the-art in 360$\degree$ perception benchmarks for depth estimation and semantic segmentation.||
|**2024-12-09**|[Bridging the Divide: Reconsidering Softmax and Linear Attention](http://arxiv.org/abs/2412.06590)|**[link](https://github.com/leaplabthu/inline)**|Widely adopted in modern Vision Transformer designs, Softmax attention can effectively capture long-range visual information; however, it incurs excessive computational cost when dealing with high-resolution inputs. In contrast, linear attention naturally enjoys linear complexity and has great potential to scale up to higher-resolution images. Nonetheless, the unsatisfactory performance of linear attention greatly limits its practical application in various scenarios. In this paper, we take a step forward to close the gap between the linear and Softmax attention with novel theoretical analyses, which demystify the core factors behind the performance deviations. Specifically, we present two key perspectives to understand and alleviate the limitations of linear attention: the injective property and the local modeling ability. Firstly, we prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors, thus adding to severe semantic confusion since different queries correspond to the same outputs. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short. The aforementioned two fundamental differences significantly contribute to the disparities between these two attention paradigms, which is demonstrated by our substantial empirical validation in the paper. In addition, more experiment results indicate that linear attention, as long as endowed with these two properties, can outperform Softmax attention across various tasks while maintaining lower computation complexity. Code is available at https://github.com/LeapLabTHU/InLine.||
|**2024-12-09**|[Understanding Factual Recall in Transformers via Associative Memories](http://arxiv.org/abs/2412.06538)|null|Large language models have demonstrated an impressive ability to perform factual recall. Prior work has found that transformers trained on factual recall tasks can store information at a rate proportional to their parameter count. In our work, we show that shallow transformers can use a combination of associative memories to obtain such near optimal storage capacity. We begin by proving that the storage capacities of both linear and MLP associative memories scale linearly with parameter count. We next introduce a synthetic factual recall task, and prove that a transformer with a single layer of self-attention followed by an MLP can obtain 100% accuracy on the task whenever either the total number of self-attention parameters or MLP parameters scales (up to log factors) linearly with the number of facts. In particular, the transformer can trade off between using the value matrices or the MLP as an associative memory to store the dataset of facts. We complement these expressivity results with an analysis of the gradient flow trajectory of a simplified linear attention model trained on our factual recall task, where we show that the model exhibits sequential learning behavior.||
|**2024-12-09**|[Hybrid Attention Network: An efficient approach for anatomy-free landmark detection](http://arxiv.org/abs/2412.06499)|**[link](https://github.com/ecnuacrush/hyatt-net)**|Accurate anatomical landmark detection in medical images is crucial for clinical applications. Existing methods often struggle to balance global context with computational efficiency, particularly with high-resolution images. This paper introduces the Hybrid Attention Network(HAN), a novel hybrid architecture integrating CNNs and Transformers. Its core is the BiFormer module, utilizing Bi-Level Routing Attention (BRA) for efficient attention to relevant image regions. This, combined with Convolutional Attention Blocks (CAB) enhanced by CBAM, enables precise local feature refinement guided by the global context. A Feature Fusion Correction Module (FFCM) integrates multi-scale features, mitigating resolution loss. Deep supervision with MSE loss on multi-resolution heatmaps optimizes the model. Experiments on five diverse datasets demonstrate state-of-the-art performance, surpassing existing methods in accuracy, robustness, and efficiency. The HAN provides a promising solution for accurate and efficient anatomical landmark detection in complex medical images. Our codes and data will be released soon at: \url{https://github.com/MIRACLE-Center/}.||
|**2024-12-09**|[Local Attention Transformers for High-Detail Optical Flow Upsampling](http://arxiv.org/abs/2412.06439)|null|Most recent works on optical flow use convex upsampling as the last step to obtain high-resolution flow. In this work, we show and discuss several issues and limitations of this currently widely adopted convex upsampling approach. We propose a series of changes, in an attempt to resolve current issues. First, we propose to decouple the weights for the final convex upsampler, making it easier to find the correct convex combination. For the same reason, we also provide extra contextual features to the convex upsampler. Then, we increase the convex mask size by using an attention-based alternative convex upsampler; Transformers for Convex Upsampling. This upsampler is based on the observation that convex upsampling can be reformulated as attention, and we propose to use local attention masks as a drop-in replacement for convex masks to increase the mask size. We provide empirical evidence that a larger mask size increases the likelihood of the existence of the convex combination. Lastly, we propose an alternative training scheme to remove bilinear interpolation artifacts from the model output. Our proposed ideas could theoretically be applied to almost every current state-of-the-art optical flow architecture. On the FlyingChairs + FlyingThings3D training setting we reduce the Sintel Clean training end-point-error of RAFT from 1.42 to 1.26, GMA from 1.31 to 1.18, and that of FlowFormer from 0.94 to 0.90, by solely adapting the convex upsampler.||
|**2024-12-06**|[MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models](http://arxiv.org/abs/2412.05275)|null|Text-to-video models have demonstrated impressive capabilities in producing diverse and captivating video content, showcasing a notable advancement in generative AI. However, these models generally lack fine-grained control over motion patterns, limiting their practical applicability. We introduce MotionFlow, a novel framework designed for motion transfer in video diffusion models. Our method utilizes cross-attention maps to accurately capture and manipulate spatial and temporal dynamics, enabling seamless motion transfers across various contexts. Our approach does not require training and works on test-time by leveraging the inherent capabilities of pre-trained video diffusion models. In contrast to traditional approaches, which struggle with comprehensive scene changes while maintaining consistent motion, MotionFlow successfully handles such complex transformations through its attention-based mechanism. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing models in both fidelity and versatility even during drastic scene alterations.||
|**2024-12-06**|[Dynamics of Aggregation Processes and Electrophysical Properties of Transformer Oil-Based Magnetic Fluids](http://arxiv.org/abs/2412.04911)|null|Magnetic fluids exhibit tunable structures and electrophysical properties, making them promising for adaptive optical systems, biomedical sensors, and microelectromechanical devices. However, the dynamic evolution of their microstructure under varying magnetic fields remains insufficiently explored. This study investigates the structural and dielectric properties of transformer oil-based magnetic fluids containing 0.2-10 vol% magnetite nanoparticles, across a frequency range of 20 Hz to 10 MHz. Particular attention is given to the dynamics of aggregate reorientation in response to alternating magnetic fields. Experimental results demonstrate that low nanoparticle concentrations lead to a linear increase in dielectric permittivity and conductivity, consistent with the Maxwell-Wagner model. In contrast, higher concentrations exhibit conductivity saturation and dispersion effects due to the formation of elongated aggregates. An analysis based on the Boyle polarization model describes the relaxation and structural changes associated with aggregation dynamics. Changes in the magnetic field orientation induce aggregate reconfiguration and significant structural transformations. At early stages, elongated chains form, subsequently thickening until an equilibrium state is reached. Elevated temperatures accelerate these processes by reducing medium viscosity and aggregate order. The findings highlight the critical role of reorientation dynamics in designing high-speed magnetic sensors, vibration isolation systems, and adaptive devices operating in dynamic magnetic environments.||
|**2024-12-06**|[Megatron: Evasive Clean-Label Backdoor Attacks against Vision Transformer](http://arxiv.org/abs/2412.04776)|null|Vision transformers have achieved impressive performance in various vision-related tasks, but their vulnerability to backdoor attacks is under-explored. A handful of existing works focus on dirty-label attacks with wrongly-labeled poisoned training samples, which may fail if a benign model trainer corrects the labels. In this paper, we propose Megatron, an evasive clean-label backdoor attack against vision transformers, where the attacker injects the backdoor without manipulating the data-labeling process. To generate an effective trigger, we customize two loss terms based on the attention mechanism used in transformer networks, i.e., latent loss and attention diffusion loss. The latent loss aligns the last attention layer between triggered samples and clean samples of the target label. The attention diffusion loss emphasizes the attention diffusion area that encompasses the trigger. A theoretical analysis is provided to underpin the rationale behind the attention diffusion loss. Extensive experiments on CIFAR-10, GTSRB, CIFAR-100, and Tiny ImageNet demonstrate the effectiveness of Megatron. Megatron can achieve attack success rates of over 90% even when the position of the trigger is slightly shifted during testing. Furthermore, Megatron achieves better evasiveness than baselines regarding both human visual inspection and defense strategies (i.e., DBAVT, BAVT, Beatrix, TeCo, and SAGE).||
|**2024-12-06**|[DHIL-GT: Scalable Graph Transformer with Decoupled Hierarchy Labeling](http://arxiv.org/abs/2412.04738)|null|Graph Transformer (GT) has recently emerged as a promising neural network architecture for learning graph-structured data. However, its global attention mechanism with quadratic complexity concerning the graph scale prevents wider application to large graphs. While current methods attempt to enhance GT scalability by altering model architecture or encoding hierarchical graph data, our analysis reveals that these models still suffer from the computational bottleneck related to graph-scale operations. In this work, we target the GT scalability issue and propose DHIL-GT, a scalable Graph Transformer that simplifies network learning by fully decoupling the graph computation to a separate stage in advance. DHIL-GT effectively retrieves hierarchical information by exploiting the graph labeling technique, as we show that the graph label hierarchy is more informative than plain adjacency by offering global connections while promoting locality, and is particularly suitable for handling complex graph patterns such as heterophily. We further design subgraph sampling and positional encoding schemes for precomputing model input on top of graph labels in an end-to-end manner. The training stage thus favorably removes graph-related computations, leading to ideal mini-batch capability and GPU utilization. Notably, the precomputation and training processes of DHIL-GT achieve complexities linear to the number of graph edges and nodes, respectively. Extensive experiments demonstrate that DHIL-GT is efficient in terms of computational boost and mini-batch capability over existing scalable Graph Transformer designs on large-scale benchmarks, while achieving top-tier effectiveness on both homophilous and heterophilous graphs.||
|**2024-12-06**|[Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens](http://arxiv.org/abs/2412.04680)|**[link](https://github.com/jangsoohyuk/SuiT)**|Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Transformer (ViT) utilizes tokens from uniformly partitioned square image patches, which may result in an arbitrary mixing of visual concepts in a token. In this work, we propose to substitute the grid-based tokenization in ViT with superpixel tokenization, which employs superpixels to generate a token that encapsulates a sole visual concept. Unfortunately, the diverse shapes, sizes, and locations of superpixels make integrating superpixels into ViT tokenization rather challenging. Our tokenization pipeline, comprised of pre-aggregate extraction and superpixel-aware aggregation, overcomes the challenges that arise in superpixel tokenization. Extensive experiments demonstrate that our approach, which exhibits strong compatibility with existing frameworks, enhances the accuracy and robustness of ViT on various downstream tasks.||
|**2024-12-05**|[MetaFormer: High-fidelity Metalens Imaging via Aberration Correcting Transformers](http://arxiv.org/abs/2412.04591)|null|Metalens is an emerging optical system with an irreplaceable merit in that it can be manufactured in ultra-thin and compact sizes, which shows great promise of various applications such as medical imaging and augmented/virtual reality (AR/VR). Despite its advantage in miniaturization, its practicality is constrained by severe aberrations and distortions, which significantly degrade the image quality. Several previous arts have attempted to address different types of aberrations, yet most of them are mainly designed for the traditional bulky lens and not convincing enough to remedy harsh aberrations of the metalens. While there have existed aberration correction methods specifically for metalens, they still fall short of restoration quality. In this work, we propose MetaFormer, an aberration correction framework for metalens-captured images, harnessing Vision Transformers (ViT) that has shown remarkable restoration performance in diverse image restoration tasks. Specifically, we devise a Multiple Adaptive Filters Guidance (MAFG), where multiple Wiener filters enrich the degraded input images with various noise-detail balances, enhancing output restoration quality. In addition, we introduce a Spatial and Transposed self-Attention Fusion (STAF) module, which aggregates features from spatial self-attention and transposed self-attention modules to further ameliorate aberration correction. We conduct extensive experiments, including correcting aberrated images and videos, and clean 3D reconstruction from the degraded images. The proposed method outperforms the previous arts by a significant margin. We further fabricate a metalens and verify the practicality of MetaFormer by restoring the images captured with the manufactured metalens in the wild. Code and pre-trained models are available at https://benhenryl.github.io/MetaFormer||
|**2024-12-05**|[TransAdapter: Vision Transformer for Feature-Centric Unsupervised Domain Adaptation](http://arxiv.org/abs/2412.04073)|**[link](https://github.com/enesdoruk/TransAdapter)**|Unsupervised Domain Adaptation (UDA) aims to utilize labeled data from a source domain to solve tasks in an unlabeled target domain, often hindered by significant domain gaps. Traditional CNN-based methods struggle to fully capture complex domain relationships, motivating the shift to vision transformers like the Swin Transformer, which excel in modeling both local and global dependencies. In this work, we propose a novel UDA approach leveraging the Swin Transformer with three key modules. A Graph Domain Discriminator enhances domain alignment by capturing inter-pixel correlations through graph convolutions and entropy-based attention differentiation. An Adaptive Double Attention module combines Windows and Shifted Windows attention with dynamic reweighting to align long-range and local features effectively. Finally, a Cross-Feature Transform modifies Swin Transformer blocks to improve generalization across domains. Extensive benchmarks confirm the state-of-the-art performance of our versatile method, which requires no task-specific alignment modules, establishing its adaptability to diverse applications.||
|**2024-12-05**|[Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration](http://arxiv.org/abs/2412.03814)|null|Image Restoration aims to restore degraded images, with deep learning, especially CNNs and Transformers, enhancing performance. However, there's a lack of a unified training benchmark for IR. We identified a bias in image complexity between training and testing datasets, affecting restoration quality. To address this, we created ReSyn, a large-scale IR dataset with balanced complexity, including real and synthetic images. We also established a unified training standard for IR models. Our RWKV-IR model integrates linear complexity RWKV into transformers for global and local receptive fields. It replaces Q-Shift with Depth-wise Convolution for local dependencies and combines Bi-directional attention for global-local awareness. The Cross-Bi-WKV module balances horizontal and vertical attention. Experiments show RWKV-IR's effectiveness in image restoration.||
|**2024-12-04**|[Interpreting Transformers for Jet Tagging](http://arxiv.org/abs/2412.03673)|**[link](https://github.com/aaronw5/Interpreting-Transformers-for-Jet-Tagging)**|Machine learning (ML) algorithms, particularly attention-based transformer models, have become indispensable for analyzing the vast data generated by particle physics experiments like ATLAS and CMS at the CERN LHC. Particle Transformer (ParT), a state-of-the-art model, leverages particle-level attention to improve jet-tagging tasks, which are critical for identifying particles resulting from proton collisions. This study focuses on interpreting ParT by analyzing attention heat maps and particle-pair correlations on the $\eta$-$\phi$ plane, revealing a binary attention pattern where each particle attends to at most one other particle. At the same time, we observe that ParT shows varying focus on important particles and subjets depending on decay, indicating that the model learns traditional jet substructure observables. These insights enhance our understanding of the model's internal workings and learning process, offering potential avenues for improving the efficiency of transformer architectures in future high-energy physics applications.||
|**2024-12-04**|[Advanced Risk Prediction and Stability Assessment of Banks Using Time Series Transformer Models](http://arxiv.org/abs/2412.03606)|null|This paper aims to study the prediction of the bank stability index based on the Time Series Transformer model. The bank stability index is an important indicator to measure the health status and risk resistance of financial institutions. Traditional prediction methods are difficult to adapt to complex market changes because they rely on single-dimensional macroeconomic data. This paper proposes a prediction framework based on the Time Series Transformer, which uses the self-attention mechanism of the model to capture the complex temporal dependencies and nonlinear relationships in financial data. Through experiments, we compare the model with LSTM, GRU, CNN, TCN and RNN-Transformer models. The experimental results show that the Time Series Transformer model outperforms other models in both mean square error (MSE) and mean absolute error (MAE) evaluation indicators, showing strong prediction ability. This shows that the Time Series Transformer model can better handle multidimensional time series data in bank stability prediction, providing new technical approaches and solutions for financial risk management.||
|**2024-12-04**|[Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention](http://arxiv.org/abs/2412.03520)|null|Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of the standard ControlNet, enabling precise control over Bird's-Eye-View layouts. To enhance the generation of object instances crucial for autonomous driving, we propose a re-weighted learning objective, dynamically adjusting the learning weights for object instances during training. CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos. The project can be found at https://luhannan.github.io/CogDrivingPage/.||
|**2024-12-04**|[Mapping using Transformers for Volumes -- Network for Super-Resolution with Long-Range Interactions](http://arxiv.org/abs/2412.03379)|**[link](https://github.com/augusthoeg/mtvnet)**|Until now, it has been difficult for volumetric super-resolution to utilize the recent advances in transformer-based models seen in 2D super-resolution. The memory required for self-attention in 3D volumes limits the receptive field. Therefore, long-range interactions are not used in 3D to the extent done in 2D and the strength of transformers is not realized. We propose a multi-scale transformer-based model based on hierarchical attention blocks combined with carrier tokens at multiple scales to overcome this. Here information from larger regions at coarse resolution is sequentially carried on to finer-resolution regions to predict the super-resolved image. Using transformer layers at each resolution, our coarse-to-fine modeling limits the number of tokens at each scale and enables attention over larger regions than what has previously been possible. We experimentally compare our method, MTVNet, against state-of-the-art volumetric super-resolution models on five 3D datasets demonstrating the advantage of an increased receptive field. This advantage is especially pronounced for images that are larger than what is seen in popularly used 3D datasets. Our code is available at https://github.com/AugustHoeg/MTVNet||
|**2024-12-05**|[Continual Low-Rank Scaled Dot-product Attention](http://arxiv.org/abs/2412.03214)|null|Transformers are widely used for their ability to capture data relations in sequence processing, with great success for a wide range of static tasks. However, the computational and memory footprint of their main component, i.e., the Scaled Dot-product Attention, is commonly overlooked. This makes their adoption in applications involving stream data processing with constraints in response latency, computational and memory resources infeasible. Some works have proposed methods to lower the computational cost of transformers, i.e. low-rank approximations, sparsity in attention, and efficient formulations for Continual Inference. In this paper, we introduce a new formulation of the Scaled Dot-product Attention based on the Nystr\"om approximation that is suitable for Continual Inference. In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude compared to the original Transformers while retaining the predictive performance of competing models.||
|**2024-12-04**|[STDCformer: A Transformer-Based Model with a Spatial-Temporal Causal De-Confounding Strategy for Crowd Flow Prediction](http://arxiv.org/abs/2412.02942)|null|Existing works typically treat spatial-temporal prediction as the task of learning a function $F$ to transform historical observations to future observations. We further decompose this cross-time transformation into three processes: (1) Encoding ($E$): learning the intrinsic representation of observations, (2) Cross-Time Mapping ($M$): transforming past representations into future representations, and (3) Decoding ($D$): reconstructing future observations from the future representations. From this perspective, spatial-temporal prediction can be viewed as learning $F = E \cdot M \cdot D$, which includes learning the space transformations $\left\{{E},{D}\right\}$ between the observation space and the hidden representation space, as well as the spatial-temporal mapping $M$ from future states to past states within the representation space. This leads to two key questions: \textbf{Q1: What kind of representation space allows for mapping the past to the future? Q2: How to achieve map the past to the future within the representation space?} To address Q1, we propose a Spatial-Temporal Backdoor Adjustment strategy, which learns a Spatial-Temporal De-Confounded (STDC) representation space and estimates the de-confounding causal effect of historical data on future data. This causal relationship we captured serves as the foundation for subsequent spatial-temporal mapping. To address Q2, we design a Spatial-Temporal Embedding (STE) that fuses the information of temporal and spatial confounders, capturing the intrinsic spatial-temporal characteristics of the representations. Additionally, we introduce a Cross-Time Attention mechanism, which queries the attention between the future and the past to guide spatial-temporal mapping.||
|**2024-12-04**|[Higher Order Transformers: Efficient Attention Mechanism for Tensor Structured Data](http://arxiv.org/abs/2412.02919)|null|Transformers are now ubiquitous for sequence modeling tasks, but their extension to multi-dimensional data remains a challenge due to the quadratic cost of the attention mechanism. In this paper, we propose Higher-Order Transformers (HOT), a novel architecture designed to efficiently process data with more than two axes, i.e. higher-order tensors. To address the computational challenges associated with high-order tensor attention, we introduce a novel Kronecker factorized attention mechanism that reduces the attention cost to quadratic in each axis' dimension, rather than quadratic in the total size of the input tensor. To further enhance efficiency, HOT leverages kernelized attention, reducing the complexity to linear. This strategy maintains the model's expressiveness while enabling scalable attention computation. We validate the effectiveness of HOT on two high-dimensional tasks, including multivariate time series forecasting, and 3D medical image classification. Experimental results demonstrate that HOT achieves competitive performance while significantly improving computational efficiency, showcasing its potential for tackling a wide range of complex, multi-dimensional data.||
|**2024-12-03**|[The Asymptotic Behavior of Attention in Transformers](http://arxiv.org/abs/2412.02682)|null|A key component of transformers is the attention mechanism orchestrating how each token influences the propagation of every other token through a transformer. In this paper we provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers. Although we present several results based on different assumptions, all of them point to the same conclusion, all tokens asymptotically converge to each other, a phenomenon that has been empirically reported in the literature. Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 model.||
|**2024-12-03**|[FCL-ViT: Task-Aware Attention Tuning for Continual Learning](http://arxiv.org/abs/2412.02509)|null|Continual Learning (CL) involves adapting the prior Deep Neural Network (DNN) knowledge to new tasks, without forgetting the old ones. However, modern CL techniques focus on provisioning memory capabilities to existing DNN models rather than designing new ones that are able to adapt according to the task at hand. This paper presents the novel Feedback Continual Learning Vision Transformer (FCL-ViT) that uses a feedback mechanism to generate real-time dynamic attention features tailored to the current task. The FCL-ViT operates in two Phases. In phase 1, the generic image features are produced and determine where the Transformer should attend on the current image. In phase 2, task-specific image features are generated that leverage dynamic attention. To this end, Tunable self-Attention Blocks (TABs) and Task Specific Blocks (TSBs) are introduced that operate in both phases and are responsible for tuning the TABs attention, respectively. The FCL-ViT surpasses state-of-the-art performance on Continual Learning compared to benchmark methods, while retaining a small number of trainable DNN parameters.||
|**2024-12-03**|[UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices](http://arxiv.org/abs/2412.02344)|null|Transformer-based architectures have demonstrated remarkable success across various domains, but their deployment on edge devices remains challenging due to high memory and computational demands. In this paper, we introduce a novel Reuse Attention mechanism, tailored for efficient memory access and computational optimization, enabling seamless operation on resource-constrained platforms without compromising performance. Unlike traditional multi-head attention (MHA), which redundantly computes separate attention matrices for each head, Reuse Attention consolidates these computations into a shared attention matrix, significantly reducing memory overhead and computational complexity. Comprehensive experiments on ImageNet-1K and downstream tasks show that the proposed UniForm models leveraging Reuse Attention achieve state-of-the-art imagenet classification accuracy while outperforming existing attention mechanisms, such as Linear Attention and Flash Attention, in inference speed and memory scalability. Notably, UniForm-l achieves a 76.7% Top-1 accuracy on ImageNet-1K with 21.8ms inference time on edge devices like the Jetson AGX Orin, representing up to a 5x speedup over competing benchmark methods. These results demonstrate the versatility of Reuse Attention across high-performance GPUs and edge platforms, paving the way for broader real-time applications||
|**2024-12-03**|[GQWformer: A Quantum-based Transformer for Graph Representation Learning](http://arxiv.org/abs/2412.02285)|null|Graph Transformers (GTs) have demonstrated significant advantages in graph representation learning through their global attention mechanisms. However, the self-attention mechanism in GTs tends to neglect the inductive biases inherent in graph structures, making it chanllenging to effectively capture essential structural information. To address this issue, we propose a novel approach that integrate graph inductive bias into self-attention mechanisms by leveraging quantum technology for structural encoding. In this paper, we introduce the Graph Quantum Walk Transformer (GQWformer), a groundbreaking GNN framework that utilizes quantum walks on attributed graphs to generate node quantum states. These quantum states encapsulate rich structural attributes and serve as inductive biases for the transformer, thereby enabling the generation of more meaningful attention scores. By subsequently incorporating a recurrent neural network, our design amplifies the model's ability to focus on both local and global information. We conducted comprehensive experiments across five publicly available datasets to evaluate the effectiveness of our model. These results clearly indicate that GQWformer outperforms existing state-of-the-art graph classification algorithms. These findings highlight the significant potential of integrating quantum computing methodologies with traditional GNNs to advance the field of graph representation learning, providing a promising direction for future research and applications.||
|**2024-12-02**|[FGATT: A Robust Framework for Wireless Data Imputation Using Fuzzy Graph Attention Networks and Transformer Encoders](http://arxiv.org/abs/2412.01979)|null|Missing data is a pervasive challenge in wireless networks and many other domains, often compromising the performance of machine learning and deep learning models. To address this, we propose a novel framework, FGATT, that combines the Fuzzy Graph Attention Network (FGAT) with the Transformer encoder to perform robust and accurate data imputation. FGAT leverages fuzzy rough sets and graph attention mechanisms to capture spatial dependencies dynamically, even in scenarios where predefined spatial information is unavailable. The Transformer encoder is employed to model temporal dependencies, utilizing its self-attention mechanism to focus on significant time-series patterns. A self-adaptive graph construction method is introduced to enable dynamic connectivity learning, ensuring the framework's applicability to a wide range of wireless datasets. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in imputation accuracy and robustness, particularly in scenarios with substantial missing data. The proposed model is well-suited for applications in wireless sensor networks and IoT environments, where data integrity is critical.||
|**2024-12-03**|[Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis](http://arxiv.org/abs/2412.01819)|null|This work presents Switti, a scale-wise transformer for text-to-image generation. Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance. We then observe that self-attention maps of our pretrained scale-wise AR model exhibit weak dependence on preceding scales. Based on this insight, we propose a non-AR counterpart facilitating ~11% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~20% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7 times faster.||
|**2024-12-02**|[Efficient Semantic Communication Through Transformer-Aided Compression](http://arxiv.org/abs/2412.01817)|null|Transformers, known for their attention mechanisms, have proven highly effective in focusing on critical elements within complex data. This feature can effectively be used to address the time-varying channels in wireless communication systems. In this work, we introduce a channel-aware adaptive framework for semantic communication, where different regions of the image are encoded and compressed based on their semantic content. By employing vision transformers, we interpret the attention mask as a measure of the semantic contents of the patches and dynamically categorize the patches to be compressed at various rates as a function of the instantaneous channel bandwidth. Our method enhances communication efficiency by adapting the encoding resolution to the content's relevance, ensuring that even in highly constrained environments, critical information is preserved. We evaluate the proposed adaptive transmission framework using the TinyImageNet dataset, measuring both reconstruction quality and accuracy. The results demonstrate that our approach maintains high semantic fidelity while optimizing bandwidth, providing an effective solution for transmitting multi-resolution data in limited bandwidth conditions.||
|**2024-12-02**|[Epipolar Attention Field Transformers for Bird's Eye View Semantic Segmentation](http://arxiv.org/abs/2412.01595)|null|Spatial understanding of the semantics of the surroundings is a key capability needed by autonomous cars to enable safe driving decisions. Recently, purely vision-based solutions have gained increasing research interest. In particular, approaches extracting a bird's eye view (BEV) from multiple cameras have demonstrated great performance for spatial understanding. This paper addresses the dependency on learned positional encodings to correlate image and BEV feature map elements for transformer-based methods. We propose leveraging epipolar geometric constraints to model the relationship between cameras and the BEV by Epipolar Attention Fields. They are incorporated into the attention mechanism as a novel attribution term, serving as an alternative to learned positional encodings. Experiments show that our method EAFormer outperforms previous BEV approaches by 2% mIoU for map semantic segmentation and exhibits superior generalization capabilities compared to implicitly learning the camera configuration.||
|**2024-12-02**|[VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval](http://arxiv.org/abs/2412.01558)|**[link](https://github.com/dpaul06/VideoLights)**|Video Highlight Detection and Moment Retrieval (HD/MR) are essential in video analysis. Recent joint prediction transformer models often overlook their cross-task dynamics and video-text alignment and refinement. Moreover, most models typically use limited, uni-directional attention mechanisms, resulting in weakly integrated representations and suboptimal performance in capturing the interdependence between video and text modalities. Although large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains, their application in this field remains relatively underexplored. Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules with an alignment loss for better video-text feature alignment, (ii) Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware clip representations, and (iii) Uni-directional joint-task feedback mechanism enhancing both tasks through correlation. In addition, (iv) we introduce hard positive/negative losses for adaptive error penalization and improved learning, and (v) leverage LVLMs like BLIP-2 for enhanced multimodal feature integration and intelligent pretraining using synthetic data generated from LVLMs. Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance. Codes and models are available at https://github.com/dpaul06/VideoLights .||
|**2024-12-02**|[ReHub: Linear Complexity Graph Transformers with Adaptive Hub-Spoke Reassignment](http://arxiv.org/abs/2412.01519)|null|We present ReHub, a novel graph transformer architecture that achieves linear complexity through an efficient reassignment technique between nodes and virtual nodes. Graph transformers have become increasingly important in graph learning for their ability to utilize long-range node communication explicitly, addressing limitations such as oversmoothing and oversquashing found in message-passing graph networks. However, their dense attention mechanism scales quadratically with the number of nodes, limiting their applicability to large-scale graphs. ReHub draws inspiration from the airline industry's hub-and-spoke model, where flights are assigned to optimize operational efficiency. In our approach, graph nodes (spokes) are dynamically reassigned to a fixed number of virtual nodes (hubs) at each model layer. Recent work, Neural Atoms (Li et al., 2024), has demonstrated impressive and consistent improvements over GNN baselines by utilizing such virtual nodes; their findings suggest that the number of hubs strongly influences performance. However, increasing the number of hubs typically raises complexity, requiring a trade-off to maintain linear complexity. Our key insight is that each node only needs to interact with a small subset of hubs to achieve linear complexity, even when the total number of hubs is large. To leverage all hubs without incurring additional computational costs, we propose a simple yet effective adaptive reassignment technique based on hub-hub similarity scores, eliminating the need for expensive node-hub computations. Our experiments on LRGB indicate a consistent improvement in results over the base method, Neural Atoms, while maintaining a linear complexity. Remarkably, our sparse model achieves performance on par with its non-sparse counterpart. Furthermore, ReHub outperforms competitive baselines and consistently ranks among top performers across various benchmarks.||
|**2024-11-29**|[KV Shifting Attention Enhances Language Modeling](http://arxiv.org/abs/2411.19574)|**[link](https://github.com/erogol/blagpt)**|The current large language models are mainly based on decode-only structure transformers, which have great in-context learning (ICL) capabilities. It is generally believed that the important foundation of its ICL capability is the induction heads mechanism, which requires at least two layers attention. In order to more efficiently implement the ability of the model's induction, we revisit the induction heads mechanism and proposed a KV shifting attention. We theoretically prove that the KV shifting attention reducing the model's requirements for the depth and width of the induction heads mechanism. Our experimental results demonstrate that KV shifting attention is beneficial to learning induction heads and language modeling, which lead to better performance or faster convergence from toy models to the pre-training models with more than 10 B parameters.||
|**2024-11-28**|[Quantum feedback control with a transformer neural network architecture](http://arxiv.org/abs/2411.19253)|null|Attention-based neural networks such as transformers have revolutionized various fields such as natural language processing, genomics, and vision. Here, we demonstrate the use of transformers for quantum feedback control through a supervised learning approach. In particular, due to the transformer's ability to capture long-range temporal correlations and training efficiency, we show that it can surpass some of the limitations of previous control approaches, e.g.~those based on recurrent neural networks trained using a similar approach or reinforcement learning. We numerically show, for the example of state stabilization of a two-level system, that our bespoke transformer architecture can achieve unit fidelity to a target state in a short time even in the presence of inefficient measurement and Hamiltonian perturbations that were not included in the training set. We also demonstrate that this approach generalizes well to the control of non-Markovian systems. Our approach can be used for quantum error correction, fast control of quantum states in the presence of colored noise, as well as real-time tuning, and characterization of quantum devices.||
|**2024-11-28**|[Pilot Contamination Aware Transformer for Downlink Power Control in Cell-Free Massive MIMO Networks](http://arxiv.org/abs/2411.19020)|null|Learning-based downlink power control in cell-free massive multiple-input multiple-output (CFmMIMO) systems offers a promising alternative to conventional iterative optimization algorithms, which are computationally intensive due to online iterative steps. Existing learning-based methods, however, often fail to exploit the intrinsic structure of channel data and neglect pilot allocation information, leading to suboptimal performance, especially in large-scale networks with many users. This paper introduces the pilot contamination-aware power control (PAPC) transformer neural network, a novel approach that integrates pilot allocation data into the network, effectively handling pilot contamination scenarios. PAPC employs the attention mechanism with a custom masking technique to utilize structural information and pilot data. The architecture includes tailored preprocessing and post-processing stages for efficient feature extraction and adherence to power constraints. Trained in an unsupervised learning framework, PAPC is evaluated against the accelerated proximal gradient (APG) algorithm, showing comparable spectral efficiency fairness performance while significantly improving computational efficiency. Simulations demonstrate PAPC's superior performance over fully connected networks (FCNs) that lack pilot information, its scalability to large-scale CFmMIMO networks, and its computational efficiency improvement over APG. Additionally, by employing padding techniques, PAPC adapts to the dynamically varying number of users without retraining.||
|**2024-11-27**|[TS3-Codec: Transformer-Based Simple Streaming Single Codec](http://arxiv.org/abs/2411.18803)|**[link](https://github.com/ga642381/speech-trident)**|Neural audio codecs (NACs) have garnered significant attention as key technologies for audio compression as well as audio representation for speech language models. While mainstream NAC models are predominantly convolution-based, the performance of NACs with a purely transformer-based, and convolution-free architecture remains unexplored. This paper introduces TS3-Codec, a Transformer-Based Simple Streaming Single Codec. TS3-Codec consists of only a stack of transformer layers with a few linear layers, offering greater simplicity and expressiveness by fully eliminating convolution layers that require careful hyperparameter tuning and large computations. Under the streaming setup, the proposed TS3-Codec achieves comparable or superior performance compared to the codec with state-of-the-art convolution-based architecture while requiring only 12% of the computation and 77% of bitrate. Furthermore, it significantly outperforms the convolution-based codec when using similar computational resources.||
|**2024-11-27**|[HDI-Former: Hybrid Dynamic Interaction ANN-SNN Transformer for Object Detection Using Frames and Events](http://arxiv.org/abs/2411.18658)|null|Combining the complementary benefits of frames and events has been widely used for object detection in challenging scenarios. However, most object detection methods use two independent Artificial Neural Network (ANN) branches, limiting cross-modality information interaction across the two visual streams and encountering challenges in extracting temporal cues from event streams with low power consumption. To address these challenges, we propose HDI-Former, a Hybrid Dynamic Interaction ANN-SNN Transformer, marking the first trial to design a directly trained hybrid ANN-SNN architecture for high-accuracy and energy-efficient object detection using frames and events. Technically, we first present a novel semantic-enhanced self-attention mechanism that strengthens the correlation between image encoding tokens within the ANN Transformer branch for better performance. Then, we design a Spiking Swin Transformer branch to model temporal cues from event streams with low power consumption. Finally, we propose a bio-inspired dynamic interaction mechanism between ANN and SNN sub-networks for cross-modality information interaction. The results demonstrate that our HDI-Former outperforms eleven state-of-the-art methods and our four baselines by a large margin. Our SNN branch also shows comparable performance to the ANN with the same architecture while consuming 10.57 $\times$ less energy on the DSEC-Detection dataset. Our open-source code is available in the supplementary material.||
|**2024-11-27**|[PATHS: A Hierarchical Transformer for Efficient Whole Slide Image Analysis](http://arxiv.org/abs/2411.18225)|**[link](https://github.com/zzbuzzard/paths)**|Computational analysis of whole slide images (WSIs) has seen significant research progress in recent years, with applications ranging across important diagnostic and prognostic tasks such as survival or cancer subtype prediction. Many state-of-the-art models process the entire slide - which may be as large as $150,000 \times 150,000$ pixels - as a bag of many patches, the size of which necessitates computationally cheap feature aggregation methods. However, a large proportion of these patches are uninformative, such as those containing only healthy or adipose tissue, adding significant noise and size to the bag. We propose Pathology Transformer with Hierarchical Selection (PATHS), a novel top-down method for hierarchical weakly supervised representation learning on slide-level tasks in computational pathology. PATHS is inspired by the cross-magnification manner in which a human pathologist examines a slide, recursively filtering patches at each magnification level to a small subset relevant to the diagnosis. Our method overcomes the complications of processing the entire slide, enabling quadratic self-attention and providing a simple interpretable measure of region importance. We apply PATHS to five datasets of The Cancer Genome Atlas (TCGA), and achieve superior performance on slide-level prediction tasks when compared to previous methods, despite processing only a small proportion of the slide.||
|**2024-11-27**|[Spectral-Spatial Transformer with Active Transfer Learning for Hyperspectral Image Classification](http://arxiv.org/abs/2411.18115)|**[link](https://github.com/mahmad000/atl-sst)**|The classification of hyperspectral images (HSI) is a challenging task due to the high spectral dimensionality and limited labeled data typically available for training. In this study, we propose a novel multi-stage active transfer learning (ATL) framework that integrates a Spatial-Spectral Transformer (SST) with an active learning process for efficient HSI classification. Our approach leverages a pre-trained (initially trained) SST model, fine-tuned iteratively on newly acquired labeled samples using an uncertainty-diversity (Spatial-Spectral Neighborhood Diversity) querying mechanism. This mechanism identifies the most informative and diverse samples, thereby optimizing the transfer learning process to reduce both labeling costs and model uncertainty. We further introduce a dynamic freezing strategy, selectively freezing layers of the SST model to minimize computational overhead while maintaining adaptability to spectral variations in new data. One of the key innovations in our work is the self-calibration of spectral and spatial attention weights, achieved through uncertainty-guided active learning. This not only enhances the model's robustness in handling dynamic and disjoint spectral profiles but also improves generalization across multiple HSI datasets. Additionally, we present a diversity-promoting sampling strategy that ensures the selected samples span distinct spectral regions, preventing overfitting to particular spectral classes. Experiments on benchmark HSI datasets demonstrate that the SST-ATL framework significantly outperforms existing CNN and SST-based methods, offering superior accuracy, efficiency, and computational performance. The source code can be accessed at \url{https://github.com/mahmad000/ATL-SST}.||
|**2024-11-27**|[HAAT: Hybrid Attention Aggregation Transformer for Image Super-Resolution](http://arxiv.org/abs/2411.18003)|null|In the research area of image super-resolution, Swin-transformer-based models are favored for their global spatial modeling and shifting window attention mechanism. However, existing methods often limit self-attention to non overlapping windows to cut costs and ignore the useful information that exists across channels. To address this issue, this paper introduces a novel model, the Hybrid Attention Aggregation Transformer (HAAT), designed to better leverage feature information. HAAT is constructed by integrating Swin-Dense-Residual-Connected Blocks (SDRCB) with Hybrid Grid Attention Blocks (HGAB). SDRCB expands the receptive field while maintaining a streamlined architecture, resulting in enhanced performance. HGAB incorporates channel attention, sparse attention, and window attention to improve nonlocal feature fusion and achieve more visually compelling results. Experimental evaluations demonstrate that HAAT surpasses state-of-the-art methods on benchmark datasets. Keywords: Image super-resolution, Computer vision, Attention mechanism, Transformer||
|**2024-11-26**|[Geometric Point Attention Transformer for 3D Shape Reassembly](http://arxiv.org/abs/2411.17788)|null|Shape assembly, which aims to reassemble separate parts into a complete object, has gained significant interest in recent years. Existing methods primarily rely on networks to predict the poses of individual parts, but often fail to effectively capture the geometric interactions between the parts and their poses. In this paper, we present the Geometric Point Attention Transformer (GPAT), a network specifically designed to address the challenges of reasoning about geometric relationships. In the geometric point attention module, we integrate both global shape information and local pairwise geometric features, along with poses represented as rotation and translation vectors for each part. To enable iterative updates and dynamic reasoning, we introduce a geometric recycling scheme, where each prediction is fed into the next iteration for refinement. We evaluate our model on both the semantic and geometric assembly tasks, showing that it outperforms previous methods in absolute pose estimation, achieving accurate pose predictions and high alignment accuracy.||
|**2024-11-26**|[TAFM-Net: A Novel Approach to Skin Lesion Segmentation Using Transformer Attention and Focal Modulation](http://arxiv.org/abs/2411.17556)|null|Incorporating modern computer vision techniques into clinical protocols shows promise in improving skin lesion segmentation. The U-Net architecture has been a key model in this area, iteratively improved to address challenges arising from the heterogeneity of dermatologic images due to varying clinical settings, lighting, patient attributes, and hair density. To further improve skin lesion segmentation, we developed TAFM-Net, an innovative model leveraging self-adaptive transformer attention (TA) coupled with focal modulation (FM). Our model integrates an EfficientNetV2B1 encoder, which employs TA to enhance spatial and channel-related saliency, while a densely connected decoder integrates FM within skip connections, enhancing feature emphasis, segmentation performance, and interpretability crucial for medical image analysis. A novel dynamic loss function amalgamates region and boundary information, guiding effective model training. Our model achieves competitive performance, with Jaccard coefficients of 93.64\%, 86.88\% and 92.88\% in the ISIC2016, ISIC2017 and ISIC2018 datasets, respectively, demonstrating its potential in real-world scenarios.||
|**2024-11-26**|[GrokFormer: Graph Fourier Kolmogorov-Arnold Transformers](http://arxiv.org/abs/2411.17296)|**[link](https://github.com/GGA23/GrokFormer)**|Graph Transformers (GTs) have demonstrated remarkable performance in incorporating various graph structure information, e.g., long-range structural dependency, into graph representation learning. However, self-attention -- the core module of GTs -- preserves only low-frequency signals on graph features, retaining only homophilic patterns that capture similar features among the connected nodes. Consequently, it has insufficient capacity in modeling complex node label patterns, such as the opposite of homophilic patterns -- heterophilic patterns. Some improved GTs deal with the problem by learning polynomial filters or performing self-attention over the first-order graph spectrum. However, these GTs either ignore rich information contained in the whole spectrum or neglect higher-order spectrum information, resulting in limited flexibility and frequency response in their spectral filters. To tackle these challenges, we propose a novel GT network, namely Graph Fourier Kolmogorov-Arnold Transformers (GrokFormer), to go beyond the self-attention in GTs. GrokFormer leverages learnable activation functions in order- $K$ graph spectrum through Fourier series modeling to i) learn eigenvalue-targeted filter functions producing learnable base that can capture a broad range of frequency signals flexibly, and ii) extract first- and higher-order graph spectral information adaptively. In doing so, GrokFormer can effectively capture intricate patterns hidden across different orders and levels of frequency signals, learning expressive, order-and-frequency-adaptive graph representations. Comprehensive experiments conducted on 10 node classification datasets across various domains, scales, and levels of graph heterophily, as well as 5 graph classification datasets, demonstrate that GrokFormer outperforms state-of-the-art GTs and other advanced graph neural networks.||
|**2024-11-26**|[MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution](http://arxiv.org/abs/2411.17214)|**[link](https://github.com/stella-von/MAT)**|Recent advances in image super-resolution (SR) have significantly benefited from the incorporation of Transformer architectures. However, conventional techniques aimed at enlarging the self-attention window to capture broader contexts come with inherent drawbacks, especially the significantly increased computational demands. Moreover, the feature perception within a fixed-size window of existing models restricts the effective receptive fields and the intermediate feature diversity. This study demonstrates that a flexible integration of attention across diverse spatial extents can yield significant performance enhancements. In line with this insight, we introduce Multi-Range Attention Transformer (MAT) tailored for SR tasks. MAT leverages the computational advantages inherent in dilation operation, in conjunction with self-attention mechanism, to facilitate both multi-range attention (MA) and sparse multi-range attention (SMA), enabling efficient capture of both regional and sparse global features. Further coupled with local feature extraction, MAT adeptly capture dependencies across various spatial ranges, improving the diversity and efficacy of its feature representations. We also introduce the MSConvStar module, which augments the model's ability for multi-range representation learning. Comprehensive experiments show that our MAT exhibits superior performance to existing state-of-the-art SR models with remarkable efficiency (~3.3 faster than SRFormer-light).||
|**2024-11-26**|[Star Attention: Efficient LLM Inference over Long Sequences](http://arxiv.org/abs/2411.17116)|**[link](https://github.com/NVIDIA/Star-Attention)**|Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.||
|**2024-11-26**|[ΩSFormer: Dual-Modal Ω-like Super-Resolution Transformer Network for Cross-scale and High-accuracy Terraced Field Vectorization Extraction](http://arxiv.org/abs/2411.17088)|null|Terraced field is a significant engineering practice for soil and water conservation (SWC). Terraced field extraction from remotely sensed imagery is the foundation for monitoring and evaluating SWC. This study is the first to propose a novel dual-modal {\Omega}-like super-resolution Transformer network for intelligent TFVE, offering the following advantages: (1) reducing edge segmentation error from conventional multi-scale downsampling encoder, through fusing original high-resolution features with downsampling features at each step of encoder and leveraging a multi-head attention mechanism; (2) improving the accuracy of TFVE by proposing a {\Omega}-like network structure, which fully integrates rich high-level features from both spectral and terrain data to form cross-scale super-resolution features; (3) validating an optimal fusion scheme for cross-modal and cross-scale (i.e., inconsistent spatial resolution between remotely sensed imagery and DEM) super-resolution feature extraction; (4) mitigating uncertainty between segmentation edge pixels by a coarse-to-fine and spatial topological semantic relationship optimization (STSRO) segmentation strategy; (5) leveraging contour vibration neural network to continuously optimize parameters and iteratively vectorize terraced fields from semantic segmentation results. Moreover, a DMRVD for deep-learning-based TFVE was created for the first time, which covers nine study areas in four provinces of China, with a total coverage area of 22441 square kilometers. To assess the performance of {\Omega}SFormer, classic and SOTA networks were compared. The mIOU of {\Omega}SFormer has improved by 0.165, 0.297 and 0.128 respectively, when compared with best accuracy single-modal remotely sensed imagery, single-modal DEM and dual-modal result.||
|**2024-11-26**|[SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation](http://arxiv.org/abs/2411.17061)|null|The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants extensively validated across various downstream tasks, including semantic segmentation. However, designed as general-purpose visual encoders, ViT backbones often overlook the specific needs of task decoders, revealing opportunities to design decoders tailored to efficient semantic segmentation. This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head explicitly designed for semantic segmentation. Instead of relying on the simple conventional skip connections, we employ lateral connections between the encoder and decoder stages, using encoder features as Queries for the cross-attention modules. Additionally, we introduce a Cross-Layer Block that blends hierarchical feature maps from different encoder and decoder stages to create a unified representation for Keys and Values. To further boost computational efficiency, SCASeg compresses queries and keys into strip-like patterns to optimize memory usage and inference speed over the traditional vanilla cross-attention. Moreover, the Cross-Layer Block incorporates the local perceptual strengths of convolution, enabling SCASeg to capture both global and local context dependencies across multiple layers. This approach facilitates effective feature interaction at different scales, improving the overall performance. Experiments show that the adaptable decoder of SCASeg produces competitive performance across different setups, surpassing leading segmentation architectures on all benchmark datasets, including ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012, even under varying computational limitations.||
|**2024-11-25**|[Tree Transformers are an Ineffective Model of Syntactic Constituency](http://arxiv.org/abs/2411.16993)|null|Linguists have long held that a key aspect of natural language syntax is the recursive organization of language units into constituent structures, and research has suggested that current state-of-the-art language models lack an inherent bias towards this feature. A number of alternative models have been proposed to provide inductive biases towards constituency, including the Tree Transformer, which utilizes a modified attention mechanism to organize tokens into constituents. We investigate Tree Transformers to study whether they utilize meaningful and/or useful constituent structures. We pretrain a large Tree Transformer on language modeling in order to investigate the learned constituent tree representations of sentences, finding little evidence for meaningful structures. Next, we evaluate Tree Transformers with similar transformer models on error detection tasks requiring constituent structure. We find that while the Tree Transformer models may slightly outperform at these tasks, there is little evidence to suggest a meaningful improvement. In general, we conclude that there is little evidence to support Tree Transformer as an effective model of syntactic constituency.||
|**2024-11-25**|[CMAViT: Integrating Climate, Managment, and Remote Sensing Data for Crop Yield Estimation with Multimodel Vision Transformers](http://arxiv.org/abs/2411.16989)|null|Crop yield prediction is essential for agricultural planning but remains challenging due to the complex interactions between weather, climate, and management practices. To address these challenges, we introduce a deep learning-based multi-model called Climate-Management Aware Vision Transformer (CMAViT), designed for pixel-level vineyard yield predictions. CMAViT integrates both spatial and temporal data by leveraging remote sensing imagery and short-term meteorological data, capturing the effects of growing season variations. Additionally, it incorporates management practices, which are represented in text form, using a cross-attention encoder to model their interaction with time-series data. This innovative multi-modal transformer tested on a large dataset from 2016-2019 covering 2,200 hectares and eight grape cultivars including more than 5 million vines, outperforms traditional models like UNet-ConvLSTM, excelling in spatial variability capture and yield prediction, particularly for extreme values in vineyards. CMAViT achieved an R2 of 0.84 and a MAPE of 8.22% on an unseen test dataset. Masking specific modalities lowered performance: excluding management practices, climate data, and both reduced R2 to 0.73, 0.70, and 0.72, respectively, and raised MAPE to 11.92%, 12.66%, and 12.39%, highlighting each modality's importance for accurate yield prediction. Code is available at https://github.com/plant-ai-biophysics-lab/CMAViT.||
|**2024-11-25**|[StructFormer: Document Structure-based Masked Attention and its Impact on Language Model Pre-Training](http://arxiv.org/abs/2411.16618)|null|Most state-of-the-art techniques for Language Models (LMs) today rely on transformer-based architectures and their ubiquitous attention mechanism. However, the exponential growth in computational requirements with longer input sequences confines Transformers to handling short passages. Recent efforts have aimed to address this limitation by introducing selective attention mechanisms, notably local and global attention. While sparse attention mechanisms, akin to full attention in being Turing-complete, have been theoretically established, their practical impact on pre-training remains unexplored. This study focuses on empirically assessing the influence of global attention on BERT pre-training. The primary steps involve creating an extensive corpus of structure-aware text through arXiv data, alongside a text-only counterpart. We carry out pre-training on these two datasets, investigate shifts in attention patterns, and assess their implications for downstream tasks. Our analysis underscores the significance of incorporating document structure into LM models, demonstrating their capacity to excel in more abstract tasks, such as document understanding.||
|**2024-11-25**|[J-CaPA : Joint Channel and Pyramid Attention Improves Medical Image Segmentation](http://arxiv.org/abs/2411.16568)|null|Medical image segmentation is crucial for diagnosis and treatment planning. Traditional CNN-based models, like U-Net, have shown promising results but struggle to capture long-range dependencies and global context. To address these limitations, we propose a transformer-based architecture that jointly applies Channel Attention and Pyramid Attention mechanisms to improve multi-scale feature extraction and enhance segmentation performance for medical images. Increasing model complexity requires more training data, and we further improve model generalization with CutMix data augmentation. Our approach is evaluated on the Synapse multi-organ segmentation dataset, achieving a 6.9% improvement in Mean Dice score and a 39.9% improvement in Hausdorff Distance (HD95) over an implementation without our enhancements. Our proposed model demonstrates improved segmentation accuracy for complex anatomical structures, outperforming existing state-of-the-art methods.||
|**2024-11-22**|[OminiControl: Minimal and Universal Control for Diffusion Transformer](http://arxiv.org/abs/2411.15098)|**[link](https://github.com/Yuanshi9815/OminiControl)**|In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone and process them with its flexible multi-modal attention processors. Unlike existing methods, which rely heavily on additional encoder modules with complex architectures, OminiControl (1) effectively and efficiently incorporates injected image conditions with only ~0.1% additional parameters, and (2) addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions such as edges, depth, and more. Remarkably, these capabilities are achieved by training on images generated by the DiT itself, which is particularly beneficial for subject-driven generation. Extensive evaluations demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted models in both subject-driven and spatially-aligned conditional generation. Additionally, we release our training dataset, Subjects200K, a diverse collection of over 200,000 identity-consistent images, along with an efficient data synthesis pipeline to advance research in subject-consistent generation.||
|**2024-11-22**|[HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads](http://arxiv.org/abs/2411.15034)|null|Diffusion Transformers (DiTs) have exhibited robust capabilities in image generation tasks. However, accurate text-guided image editing for multimodal DiTs (MM-DiTs) still poses a significant challenge. Unlike UNet-based structures that could utilize self/cross-attention maps for semantic editing, MM-DiTs inherently lack support for explicit and consistent incorporated text guidance, resulting in semantic misalignment between the edited results and texts. In this study, we disclose the sensitivity of different attention heads to different image semantics within MM-DiTs and introduce HeadRouter, a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs. Furthermore, we present a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression. Experimental results on multiple benchmarks demonstrate HeadRouter's performance in terms of editing fidelity and image quality.||
|**2024-11-22**|[Point Cloud Understanding via Attention-Driven Contrastive Learning](http://arxiv.org/abs/2411.14744)|null|Recently Transformer-based models have advanced point cloud understanding by leveraging self-attention mechanisms, however, these methods often overlook latent information in less prominent regions, leading to increased sensitivity to perturbations and limited global comprehension. To solve this issue, we introduce PointACL, an attention-driven contrastive learning framework designed to address these limitations. Our method employs an attention-driven dynamic masking strategy that guides the model to focus on under-attended regions, enhancing the understanding of global structures within the point cloud. Then we combine the original pre-training loss with a contrastive learning loss, improving feature discrimination and generalization. Extensive experiments validate the effectiveness of PointACL, as it achieves state-of-the-art performance across a variety of 3D understanding tasks, including object classification, part segmentation, and few-shot learning. Specifically, when integrated with different Transformer backbones like Point-MAE and PointGPT, PointACL demonstrates improved performance on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart. This highlights its superior capability in capturing both global and local features, as well as its enhanced robustness against perturbations and incomplete data.||
|**2024-11-22**|[FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration](http://arxiv.org/abs/2411.14733)|null|Encoder-based transformers, powered by self-attention layers, have revolutionized machine learning with their context-aware representations. However, their quadratic growth in computational and memory demands presents significant bottlenecks. Analog-Mixed-Signal Process-in-Memory (AMS-PiM) architectures address these challenges by enabling efficient on-chip processing. Traditionally, AMS-PiM relies on Quantization-Aware Training (QAT), which is hardware-efficient but requires extensive retraining to adapt models to AMS-PiMs, making it increasingly impractical for transformer models. Post-Training Quantization (PTQ) mitigates this training overhead but introduces significant hardware inefficiencies. PTQ relies on dequantization-quantization (DQ-Q) processes, floating-point units (FPUs), and high-ENOB (Effective Number of Bits) analog-to-digital converters (ADCs). Particularly, High-ENOB ADCs scale exponentially in area and energy ( $2^{ENOB}$ ), reduce sensing margins, and increase susceptibility to process, voltage, and temperature (PVT) variations, further compounding PTQ's challenges in AMS-PiM systems. To overcome these limitations, we propose RAP, an AMS-PiM architecture that eliminates DQ-Q processes, introduces FPU- and division-free nonlinear processing, and employs a low-ENOB-ADC-based sparse Matrix Vector multiplication technique. Using the proposed techniques, RAP improves error resiliency, area/energy efficiency, and computational speed while preserving numerical stability. Experimental results demonstrate that RAP outperforms state-of-the-art GPUs and conventional PiM architectures in energy efficiency, latency, and accuracy, making it a scalable solution for the efficient deployment of transformers.||
|**2024-11-22**|[Multiset Transformer: Advancing Representation Learning in Persistence Diagrams](http://arxiv.org/abs/2411.14662)|**[link](https://github.com/minghuax/MST)**|To improve persistence diagram representation learning, we propose Multiset Transformer. This is the first neural network that utilizes attention mechanisms specifically designed for multisets as inputs and offers rigorous theoretical guarantees of permutation invariance. The architecture integrates multiset-enhanced attentions with a pool-decomposition scheme, allowing multiplicities to be preserved across equivariant layers. This capability enables full leverage of multiplicities while significantly reducing both computational and spatial complexity compared to the Set Transformer. Additionally, our method can greatly benefit from clustering as a preprocessing step to further minimize complexity, an advantage not possessed by the Set Transformer. Experimental results demonstrate that the Multiset Transformer outperforms existing neural network methods in the realm of persistence diagram representation learning.||
|**2024-11-21**|[CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs](http://arxiv.org/abs/2411.14611)|null|Machine Learning (ML) for software engineering (SE) has gained prominence due to its ability to significantly enhance the performance of various SE applications. This progress is largely attributed to the development of generalizable source code representations that effectively capture the syntactic and semantic characteristics of code. In recent years, pre-trained transformer-based models, inspired by natural language processing (NLP), have shown remarkable success in SE tasks. However, source code contains structural and semantic properties embedded within its grammar, which can be extracted from structured code-views like the Abstract Syntax Tree (AST), Data-Flow Graph (DFG), and Control-Flow Graph (CFG). These code-views can complement NLP techniques, further improving SE tasks. Unfortunately, there are no flexible frameworks to infuse arbitrary code-views into existing transformer-based models effectively. Therefore, in this work, we propose CodeSAM, a novel scalable framework to infuse multiple code-views into transformer-based models by creating self-attention masks. We use CodeSAM to fine-tune a small language model (SLM) like CodeBERT on the downstream SE tasks of semantic code search, code clone detection, and program classification. Experimental results show that by using this technique, we improve downstream performance when compared to SLMs like GraphCodeBERT and CodeBERT on all three tasks by utilizing individual code-views or a combination of code-views during fine-tuning. We believe that these results are indicative that techniques like CodeSAM can help create compact yet performant code SLMs that fit in resource constrained settings.||
|**2024-11-21**|[Revisiting the Integration of Convolution and Attention for Vision Backbone](http://arxiv.org/abs/2411.14429)|**[link](https://github.com/rayleizhu/glmix)**|Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel \textbf{at different granularity levels} instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. We apply different operations to these two representations: Convs to the grid for local features, and MHSAs to the slots for global features. A pair of fully differentiable soft clustering and dispatching modules is introduced to bridge the grid and set representations, thus enabling local-global fusion. Through extensive experiments on various vision tasks, we empirically verify the potential of the proposed integration scheme, named \textit{GLMix}: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few (e.g., 64) semantic slots to match the performance of recent state-of-the-art backbones, while being more efficient. Our visualization results also demonstrate that the soft clustering module produces a meaningful semantic grouping effect with only IN1k classification supervision, which may induce better interpretability and inspire new weakly-supervised semantic segmentation approaches. Code will be available at \url{https://github.com/rayleizhu/GLMix}.||
|**2024-11-21**|[Global and Local Attention-Based Transformer for Hyperspectral Image Change Detection](http://arxiv.org/abs/2411.14109)|**[link](https://github.com/summitgao/glaformer)**|Recently Transformer-based hyperspectral image (HSI) change detection methods have shown remarkable performance. Nevertheless, existing attention mechanisms in Transformers have limitations in local feature representation. To address this issue, we propose Global and Local Attention-based Transformer (GLAFormer), which incorporates a global and local attention module (GLAM) to combine high-frequency and low-frequency signals. Furthermore, we introduce a cross-gating mechanism, called cross-gated feed-forward network (CGFN), to emphasize salient features and suppress noise interference. Specifically, the GLAM splits attention heads into global and local attention components to capture comprehensive spatial-spectral features. The global attention component employs global attention on downsampled feature maps to capture low-frequency information, while the local attention component focuses on high-frequency details using non-overlapping window-based local attention. The CGFN enhances the feature representation via convolutions and cross-gating mechanism in parallel paths. The proposed GLAFormer is evaluated on three HSI datasets. The results demonstrate its superiority over state-of-the-art HSI change detection methods. The source code of GLAFormer is available at \url{https://github.com/summitgao/GLAFormer}.||
|**2024-11-20**|[Quantum Attention for Vision Transformers in High Energy Physics](http://arxiv.org/abs/2411.13520)|null|We present a novel hybrid quantum-classical vision transformer architecture incorporating quantum orthogonal neural networks (QONNs) to enhance performance and computational efficiency in high-energy physics applications. Building on advancements in quantum vision transformers, our approach addresses limitations of prior models by leveraging the inherent advantages of QONNs, including stability and efficient parameterization in high-dimensional spaces. We evaluate the proposed architecture using multi-detector jet images from CMS Open Data, focusing on the task of distinguishing quark-initiated from gluon-initiated jets. The results indicate that embedding quantum orthogonal transformations within the attention mechanism can provide robust performance while offering promising scalability for machine learning challenges associated with the upcoming High Luminosity Large Hadron Collider. This work highlights the potential of quantum-enhanced models to address the computational demands of next-generation particle physics experiments.||
|**2024-11-20**|[Transformers with Sparse Attention for Granger Causality](http://arxiv.org/abs/2411.13264)|null|Temporal causal analysis means understanding the underlying causes behind observed variables over time. Deep learning based methods such as transformers are increasingly used to capture temporal dynamics and causal relationships beyond mere correlations. Recent works suggest self-attention weights of transformers as a useful indicator of causal links. We leverage this to propose a novel modification to the self-attention module to establish causal links between the variables of multivariate time-series data with varying lag dependencies. Our Sparse Attention Transformer captures causal relationships using a two-fold approach - performing temporal attention first followed by attention between the variables across the time steps masking them individually to compute Granger Causality indices. The key novelty in our approach is the ability of the model to assert importance and pick the most significant past time instances for its prediction task against manually feeding a fixed time lag value. We demonstrate the effectiveness of our approach via extensive experimentation on several synthetic benchmark datasets. Furthermore, we compare the performance of our model with the traditional Vector Autoregression based Granger Causality method that assumes fixed lag length.||
|**2024-11-20**|[Topkima-Former: Low-energy, Low-Latency Inference for Transformers using top-k In-memory ADC](http://arxiv.org/abs/2411.13050)|null|Transformer model has gained prominence as a popular deep neural network architecture for neural language processing (NLP) and computer vision (CV) applications. However, the extensive use of nonlinear operations, like softmax, poses a performance bottleneck during transformer inference and comprises up to 40% of the total latency. Hence, we propose innovations at the circuit, architecture, and algorithm levels to accelerate the transformer. At the circuit level, we propose topkima-combining top-k activation selection with in-memory ADC (IMA) to implement a low-energy and low-latency softmax without any sorting latency. Only the k largest activations are sent to the softmax calculation block, reducing the huge computational cost of softmax. Using a modified training scheme with top-k only in the forward pass, experimental results demonstrate only a 0.4% to 1.2% reduction in accuracy across ViT, distilBERT, and BERT-base models when evaluated on CIFAR-10, CIFAR-100, and SQuAD datasets with k=5. At the architecture level, an improved scale-free technique is introduced to reduce the computational cost of attention. The combined system, dubbed Topkima-Former, enhances 1.8x-84x speedup and 1.3x-35x energy efficiency (EE) over prior In-memory computing (IMC) accelerators. Compared to a conventional softmax macro and a digital top-k (Dtopk) softmax macro, our proposed tokima softmax macro achieves about 15x and 8x faster speed respectively.||
|**2024-11-20**|[A Theory for Compressibility of Graph Transformers for Transductive Learning](http://arxiv.org/abs/2411.13028)|null|Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks, as the independent and identically distributed (i.i.d.) assumption does not hold among samples. Instead, all train/test/validation samples are present during training, making them more akin to a semi-supervised task. These differences make the analysis of the models substantially different from other models. Recently, Graph Transformers have significantly improved results on these datasets by overcoming long-range dependency problems. However, the quadratic complexity of full Transformers has driven the community to explore more efficient variants, such as those with sparser attention patterns. While the attention matrix has been extensively discussed, the hidden dimension or width of the network has received less attention. In this work, we establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed. Our results apply to both sparse and dense variants of Graph Transformers.||
|**2024-11-20**|[MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers](http://arxiv.org/abs/2411.12992)|null|In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and corresponding computational complexity are constantly scaled up in pursuit of higher performance. In this work, we present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. We eliminate nearly all the computations of the transformer model except for the necessary computation required by the multi-head attention operation. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. Specifically, we first construct a group of in-memory lookup tables that store a large amount of discrete vectors to replace the weight matrix used in linear projection. We then use a hash algorithm to retrieve a correlated subset of vectors dynamically based on the input embedding. The retrieved vectors combined together will form the output embedding, which provides an estimation of the result of matrix multiplication operation in a fully-connected layer. Compared to conducting matrix multiplication, retrieving data blocks from memory is a much cheaper operation which requires little computations. We train MemoryFormer from scratch and conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.||
|**2024-11-19**|[Selective Attention: Enhancing Transformer through Principled Context Control](http://arxiv.org/abs/2411.12892)|**[link](https://github.com/umich-sota/selective_attention)**|The attention mechanism within the transformer architecture enables the model to weigh and combine tokens based on their relevance to the query. While self-attention has enjoyed major success, it notably treats all queries $q$ in the same way by applying the mapping $V^\top\text{softmax}(Kq)$, where $V,K$ are the value and key embeddings respectively. In this work, we argue that this uniform treatment hinders the ability to control contextual sparsity and relevance. As a solution, we introduce the $\textit{Selective Self-Attention}$ (SSA) layer that augments the softmax nonlinearity with a principled temperature scaling strategy. By controlling temperature, SSA adapts the contextual sparsity of the attention map to the query embedding and its position in the context window. Through theory and experiments, we demonstrate that this alleviates attention dilution, aids the optimization process, and enhances the model's ability to control softmax spikiness of individual queries. We also incorporate temperature scaling for value embeddings and show that it boosts the model's ability to suppress irrelevant/noisy tokens. Notably, SSA is a lightweight method which introduces less than 0.5% new parameters through a weight-sharing strategy and can be fine-tuned on existing LLMs. Extensive empirical evaluations demonstrate that SSA-equipped models achieve a noticeable and consistent accuracy improvement on language modeling benchmarks.||
|**2024-11-19**|[Benchmarking Positional Encodings for GNNs and Graph Transformers](http://arxiv.org/abs/2411.12732)|**[link](https://github.com/ETH-DISCO/Benchmarking-PEs)**|Recent advances in Graph Neural Networks (GNNs) and Graph Transformers (GTs) have been driven by innovations in architectures and Positional Encodings (PEs), which are critical for augmenting node features and capturing graph topology. PEs are essential for GTs, where topological information would otherwise be lost without message-passing. However, PEs are often tested alongside novel architectures, making it difficult to isolate their effect on established models. To address this, we present a comprehensive benchmark of PEs in a unified framework that includes both message-passing GNNs and GTs. We also establish theoretical connections between MPNNs and GTs and introduce a sparsified GRIT attention mechanism to examine the influence of global connectivity. Our findings demonstrate that previously untested combinations of GNN architectures and PEs can outperform existing methods and offer a more comprehensive picture of the state-of-the-art. To support future research and experimentation in our framework, we make the code publicly available.||
|**2024-11-19**|[S3TU-Net: Structured Convolution and Superpixel Transformer for Lung Nodule Segmentation](http://arxiv.org/abs/2411.12547)|null|The irregular and challenging characteristics of lung adenocarcinoma nodules in computed tomography (CT) images complicate staging diagnosis, making accurate segmentation critical for clinicians to extract detailed lesion information. In this study, we propose a segmentation model, S3TU-Net, which integrates multi-dimensional spatial connectors and a superpixel-based visual transformer. S3TU-Net is built on a multi-view CNN-Transformer hybrid architecture, incorporating superpixel algorithms, structured weighting, and spatial shifting techniques to achieve superior segmentation performance. The model leverages structured convolution blocks (DWF-Conv/D2BR-Conv) to extract multi-scale local features while mitigating overfitting. To enhance multi-scale feature fusion, we introduce the S2-MLP Link, integrating spatial shifting and attention mechanisms at the skip connections. Additionally, the residual-based superpixel visual transformer (RM-SViT) effectively merges global and local features by employing sparse correlation learning and multi-branch attention to capture long-range dependencies, with residual connections enhancing stability and computational efficiency. Experimental results on the LIDC-IDRI dataset demonstrate that S3TU-Net achieves a DSC, precision, and IoU of 89.04%, 90.73%, and 90.70%, respectively. Compared to recent methods, S3TU-Net improves DSC by 4.52% and sensitivity by 3.16%, with other metrics showing an approximate 2% increase. In addition to comparison and ablation studies, we validated the generalization ability of our model on the EPDB private dataset, achieving a DSC of 86.40%.||
|**2024-11-19**|[Transformer Neural Processes -- Kernel Regression](http://arxiv.org/abs/2411.12502)|null|Stochastic processes model various natural phenomena from disease transmission to stock prices, but simulating and quantifying their uncertainty can be computationally challenging. For example, modeling a Gaussian Process with standard statistical methods incurs an $\mathcal{O}(n^3)$ penalty, and even using state-of-the-art Neural Processes (NPs) incurs an $\mathcal{O}(n^2)$ penalty due to the attention mechanism. We introduce the Transformer Neural Process - Kernel Regression (TNP-KR), a new architecture that incorporates a novel transformer block we call a Kernel Regression Block (KRBlock), which reduces the computational complexity of attention in transformer-based Neural Processes (TNPs) from $\mathcal{O}((n_C+n_T)^2)$ to $O(n_C^2+n_Cn_T)$ by eliminating masked computations, where $n_C$ is the number of context, and $n_T$ is the number of test points, respectively, and a fast attention variant that further reduces all attention calculations to $\mathcal{O}(n_C)$ in space and time complexity. In benchmarks spanning such tasks as meta-regression, Bayesian optimization, and image completion, we demonstrate that the full variant matches the performance of state-of-the-art methods while training faster and scaling two orders of magnitude higher in number of test points, and the fast variant nearly matches that performance while scaling to millions of both test and context points on consumer hardware.||
|**2024-11-19**|[Robust 3D Semantic Occupancy Prediction with Calibration-free Spatial Transformation](http://arxiv.org/abs/2411.12177)|**[link](https://github.com/iceory/reo)**|3D semantic occupancy prediction, which seeks to provide accurate and comprehensive representations of environment scenes, is important to autonomous driving systems. For autonomous cars equipped with multi-camera and LiDAR, it is critical to aggregate multi-sensor information into a unified 3D space for accurate and robust predictions. Recent methods are mainly built on the 2D-to-3D transformation that relies on sensor calibration to project the 2D image information into the 3D space. These methods, however, suffer from two major limitations: First, they rely on accurate sensor calibration and are sensitive to the calibration noise, which limits their application in real complex environments. Second, the spatial transformation layers are computationally expensive and limit their running on an autonomous vehicle. In this work, we attempt to exploit a Robust and Efficient 3D semantic Occupancy (REO) prediction scheme. To this end, we propose a calibration-free spatial transformation based on vanilla attention to implicitly model the spatial correspondence. In this way, we robustly project the 2D features to a predefined BEV plane without using sensor calibration as input. Then, we introduce 2D and 3D auxiliary training tasks to enhance the discrimination power of 2D backbones on spatial, semantic, and texture features. Last, we propose a query-based prediction scheme to efficiently generate large-scale fine-grained occupancy predictions. By fusing point clouds that provide complementary spatial information, our REO surpasses the existing methods by a large margin on three benchmarks, including OpenOccupancy, Occ3D-nuScenes, and SemanticKITTI Scene Completion. For instance, our REO achieves 19.8 $\times$ speedup compared to Co-Occ, with 1.1 improvements in geometry IoU on OpenOccupancy. Our code will be available at https://github.com/ICEORY/REO.||
|**2024-11-18**|[Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers](http://arxiv.org/abs/2411.12118)|null|In this paper, I introduce the retrieval problem, a simple reasoning task that can be solved only by transformers with a minimum number of layers. The task has an adjustable difficulty that can further increase the required number of layers to any arbitrary value. I demonstrate that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. I find that successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence.||
|**2024-11-18**|[Edge-Enhanced Dilated Residual Attention Network for Multimodal Medical Image Fusion](http://arxiv.org/abs/2411.11799)|**[link](https://github.com/simonzhou86/en_dran)**|Multimodal medical image fusion is a crucial task that combines complementary information from different imaging modalities into a unified representation, thereby enhancing diagnostic accuracy and treatment planning. While deep learning methods, particularly Convolutional Neural Networks (CNNs) and Transformers, have significantly advanced fusion performance, some of the existing CNN-based methods fall short in capturing fine-grained multiscale and edge features, leading to suboptimal feature integration. Transformer-based models, on the other hand, are computationally intensive in both the training and fusion stages, making them impractical for real-time clinical use. Moreover, the clinical application of fused images remains unexplored. In this paper, we propose a novel CNN-based architecture that addresses these limitations by introducing a Dilated Residual Attention Network Module for effective multiscale feature extraction, coupled with a gradient operator to enhance edge detail learning. To ensure fast and efficient fusion, we present a parameter-free fusion strategy based on the weighted nuclear norm of softmax, which requires no additional computations during training or inference. Extensive experiments, including a downstream brain tumor classification task, demonstrate that our approach outperforms various baseline methods in terms of visual quality, texture preservation, and fusion speed, making it a possible practical solution for real-world clinical applications. The code will be released at https://github.com/simonZhou86/en_dran.||
|**2024-11-18**|[Transformer networks for Heavy flavor jet tagging](http://arxiv.org/abs/2411.11519)|null|In this article, we review recent machine learning methods used in challenging particle identification of heavy-boosted particles at high-energy colliders. Our primary focus is on attention-based Transformer networks. We report the performance of state-of-the-art deep learning networks and further improvement coming from the modification of networks based on physics insights. Additionally, we discuss interpretable methods to understand network decision-making, which are crucial when employing highly complex and deep networks.||
|**2024-11-18**|[DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery](http://arxiv.org/abs/2411.11214)|null|Human Mesh Recovery (HMR) is an important yet challenging problem with applications across various domains including motion capture, augmented reality, and biomechanics. Accurately predicting human pose parameters from a single image remains a challenging 3D computer vision task. In this work, we introduce DeforHMR, a novel regression-based monocular HMR framework designed to enhance the prediction of human pose parameters using deformable attention transformers. DeforHMR leverages a novel query-agnostic deformable cross-attention mechanism within the transformer decoder to effectively regress the visual features extracted from a frozen pretrained vision transformer (ViT) encoder. The proposed deformable cross-attention mechanism allows the model to attend to relevant spatial features more flexibly and in a data-dependent manner. Equipped with a transformer decoder capable of spatially-nuanced attention, DeforHMR achieves state-of-the-art performance for single-frame regression-based methods on the widely used 3D HMR benchmarks 3DPW and RICH. By pushing the boundary on the field of 3D human mesh recovery through deformable attention, we introduce an new, effective paradigm for decoding local spatial information from large pretrained vision encoders in computer vision.||
|**2024-11-17**|[Freqformer: Frequency-Domain Transformer for 3-D Visualization and Quantification of Human Retinal Circulation](http://arxiv.org/abs/2411.11189)|null|We introduce Freqformer, a novel Transformer-based architecture designed for 3-D, high-definition visualization of human retinal circulation from a single scan in commercial optical coherence tomography angiography (OCTA). Freqformer addresses the challenge of limited signal-to-noise ratio in OCTA volume by utilizing a complex-valued frequency-domain module (CFDM) and a simplified multi-head attention (Sim-MHA) mechanism. Using merged volumes as ground truth, Freqformer enables accurate reconstruction of retinal vasculature across the depth planes, allowing for 3-D quantification of capillary segments (count, density, and length). Our method outperforms state-of-the-art convolutional neural networks (CNNs) and several Transformer-based models, with superior performance in peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and learned perceptual image patch similarity (LPIPS). Furthermore, Freqformer demonstrates excellent generalizability across lower scanning density, effectively enhancing OCTA scans with larger fields of view (from 3 $\times$3 $mm^{2}$ to 6$\times$6 $mm^{2}$ and 12$\times$12 $mm^{2}$ ). These results suggest that Freqformer can significantly improve the understanding and characterization of retinal circulation, offering potential clinical applications in diagnosing and managing retinal vascular diseases.||
|**2024-11-16**|[FIAS: Feature Imbalance-Aware Medical Image Segmentation with Dynamic Fusion and Mixing Attention](http://arxiv.org/abs/2411.10881)|null|With the growing application of transformer in computer vision, hybrid architecture that combine convolutional neural networks (CNNs) and transformers demonstrates competitive ability in medical image segmentation. However, direct fusion of features from CNNs and transformers often leads to feature imbalance and redundant information. To address these issues, we propose a Feaure Imbalance-Aware Segmentation (FIAS) network, which incorporates a dual-path encoder and a novel Mixing Attention (MixAtt) decoder. The dual-branches encoder integrates a DilateFormer for long-range global feature extraction and a Depthwise Multi-Kernel (DMK) convolution for capturing fine-grained local details. A Context-Aware Fusion (CAF) block dynamically balances the contribution of these global and local features, preventing feature imbalance. The MixAtt decoder further enhances segmentation accuracy by combining self-attention and Monte Carlo attention, enabling the model to capture both small details and large-scale dependencies. Experimental results on the Synapse multi-organ and ACDC datasets demonstrate the strong competitiveness of our approach in medical image segmentation tasks.||
|**2024-11-15**|[Probabilistic Prior Driven Attention Mechanism Based on Diffusion Model for Imaging Through Atmospheric Turbulence](http://arxiv.org/abs/2411.10321)|null|Atmospheric turbulence introduces severe spatial and geometric distortions, challenging traditional image restoration methods. We propose the Probabilistic Prior Turbulence Removal Network (PPTRN), which combines probabilistic diffusion-based prior modeling with Transformer-driven feature extraction to address this issue. PPTRN employs a two-stage approach: first, a latent encoder and Transformer are jointly trained on clear images to establish robust feature representations. Then, a Denoising Diffusion Probabilistic Model (DDPM) models prior distributions over latent vectors, guiding the Transformer in capturing diverse feature variations essential for restoration. A key innovation in PPTRN is the Probabilistic Prior Driven Cross Attention mechanism, which integrates the DDPM-generated prior with feature embeddings to reduce artifacts and enhance spatial coherence. Extensive experiments validate that PPTRN significantly improves restoration quality on turbulence-degraded images, setting a new benchmark in clarity and structural fidelity.||
|**2024-11-15**|[Morpho-Aware Global Attention for Image Matting](http://arxiv.org/abs/2411.10251)|null|Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) face inherent challenges in image matting, particularly in preserving fine structural details. ViTs, with their global receptive field enabled by the self-attention mechanism, often lose local details such as hair strands. Conversely, CNNs, constrained by their local receptive field, rely on deeper layers to approximate global context but struggle to retain fine structures at greater depths. To overcome these limitations, we propose a novel Morpho-Aware Global Attention (MAGA) mechanism, designed to effectively capture the morphology of fine structures. MAGA employs Tetris-like convolutional patterns to align the local shapes of fine structures, ensuring optimal local correspondence while maintaining sensitivity to morphological details. The extracted local morphology information is used as query embeddings, which are projected onto global key embeddings to emphasize local details in a broader context. Subsequently, by projecting onto value embeddings, MAGA seamlessly integrates these emphasized morphological details into a unified global structure. This approach enables MAGA to simultaneously focus on local morphology and unify these details into a coherent whole, effectively preserving fine structures. Extensive experiments show that our MAGA-based ViT achieves significant performance gains, outperforming state-of-the-art methods across two benchmarks with average improvements of 4.3% in SAD and 39.5% in MSE.||
|**2024-11-15**|[A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift](http://arxiv.org/abs/2411.10231)|null|Transformer-based Super-Resolution (SR) models have recently advanced image reconstruction quality, yet challenges remain due to computational complexity and an over-reliance on large patch sizes, which constrain fine-grained detail enhancement. In this work, we propose TaylorIR to address these limitations by utilizing a patch size of 1x1, enabling pixel-level processing in any transformer-based SR model. To address the significant computational demands under the traditional self-attention mechanism, we employ the TaylorShift attention mechanism, a memory-efficient alternative based on Taylor series expansion, achieving full token-to-token interactions with linear complexity. Experimental results demonstrate that our approach achieves new state-of-the-art SR performance while reducing memory consumption by up to 60% compared to traditional self-attention-based transformers.||
|**2024-11-15**|[Memorization in Attention-only Transformers](http://arxiv.org/abs/2411.10115)|**[link](https://github.com/leodana2000/Transformer_Attentional_Memory)**|Recent research has explored the memorization capacity of multi-head attention, but these findings are constrained by unrealistic limitations on the context size. We present a novel proof for language-based Transformers that extends the current hypothesis to any context size. Our approach improves upon the state-of-the-art by achieving more effective exact memorization with an attention layer, while also introducing the concept of approximate memorization of distributions. Through experimental validation, we demonstrate that our proposed bounds more accurately reflect the true memorization capacity of language models, and provide a precise comparison with prior work.||
|**2024-11-14**|[On the Surprising Effectiveness of Attention Transfer for Vision Transformers](http://arxiv.org/abs/2411.09702)|**[link](https://github.com/alexlioralexli/attention-transfer)**|Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning||
|**2024-11-14**|[SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers](http://arxiv.org/abs/2411.09420)|**[link](https://github.com/shravan-18/SAG-ViT)**|Image classification is a computer vision task where a model analyzes an image to categorize it into a specific label. Vision Transformers (ViT) improve this task by leveraging self-attention to capture complex patterns and long range relationships between image patches. However, a key challenge for ViTs is efficiently incorporating multiscale feature representations, which is inherent in CNNs through their hierarchical structure. In this paper, we introduce the Scale-Aware Graph Attention Vision Transformer (SAG-ViT), a novel framework that addresses this challenge by integrating multi-scale features. Using EfficientNet as a backbone, the model extracts multi-scale feature maps, which are divided into patches to preserve semantic information. These patches are organized into a graph based on spatial and feature similarities, with a Graph Attention Network (GAT) refining the node embeddings. Finally, a Transformer encoder captures long-range dependencies and complex interactions. The SAG-ViT is evaluated on benchmark datasets, demonstrating its effectiveness in enhancing image classification performance.||
|**2024-11-14**|[Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery](http://arxiv.org/abs/2411.09101)|**[link](https://github.com/ashimdahal/vit-vs-cnn-image-segmentation)**|Vision Transformers (ViT) have recently brought a new wave of research in the field of computer vision. These models have done particularly well in the field of image classification and segmentation. Research on semantic and instance segmentation has emerged to accelerate with the inception of the new architecture, with over 80\% of the top 20 benchmarks for the iSAID dataset being either based on the ViT architecture or the attention mechanism behind its success. This paper focuses on the heuristic comparison of three key factors of using (or not using) ViT for semantic segmentation of remote sensing aerial images on the iSAID. The experimental results observed during the course of the research were under the scrutinization of the following objectives: 1. Use of weighted fused loss function for the maximum mean Intersection over Union (mIoU) score, Dice score, and minimization or conservation of entropy or class representation, 2. Comparison of transfer learning on Meta's MaskFormer, a ViT-based semantic segmentation model, against generic UNet Convolutional Neural Networks (CNNs) judged over mIoU, Dice scores, training efficiency, and inference time, and 3. What do we lose for what we gain? i.e., the comparison of the two models against current state-of-art segmentation models. We show the use of the novel combined weighted loss function significantly boosts the CNN model's performance capacities as compared to transfer learning the ViT. The code for this implementation can be found on \url{https://github.com/ashimdahal/ViT-vs-CNN-ImageSegmentation}.||
|**2024-11-13**|[TRACE: Transformer-based Risk Assessment for Clinical Evaluation](http://arxiv.org/abs/2411.08701)|null|We present TRACE (Transformer-based Risk Assessment for Clinical Evaluation), a novel method for clinical risk assessment based on clinical data, leveraging the self-attention mechanism for enhanced feature interaction and result interpretation. Our approach is able to handle different data modalities, including continuous, categorical and multiple-choice (checkbox) attributes. The proposed architecture features a shared representation of the clinical data obtained by integrating specialized embeddings of each data modality, enabling the detection of high-risk individuals using Transformer encoder layers. To assess the effectiveness of the proposed method, a strong baseline based on non-negative multi-layer perceptrons (MLPs) is introduced. The proposed method outperforms various baselines widely used in the domain of clinical risk assessment, while effectively handling missing values. In terms of explainability, our Transformer-based method offers easily interpretable results via attention weights, further enhancing the clinicians' decision-making process.||
|**2024-11-12**|[Rendering-Oriented 3D Point Cloud Attribute Compression using Sparse Tensor-based Transformer](http://arxiv.org/abs/2411.07899)|null|The evolution of 3D visualization techniques has fundamentally transformed how we interact with digital content. At the forefront of this change is point cloud technology, offering an immersive experience that surpasses traditional 2D representations. However, the massive data size of point clouds presents significant challenges in data compression. Current methods for lossy point cloud attribute compression (PCAC) generally focus on reconstructing the original point clouds with minimal error. However, for point cloud visualization scenarios, the reconstructed point clouds with distortion still need to undergo a complex rendering process, which affects the final user-perceived quality. In this paper, we propose an end-to-end deep learning framework that seamlessly integrates PCAC with differentiable rendering, denoted as rendering-oriented PCAC (RO-PCAC), directly targeting the quality of rendered multiview images for viewing. In a differentiable manner, the impact of the rendering process on the reconstructed point clouds is taken into account. Moreover, we characterize point clouds as sparse tensors and propose a sparse tensor-based transformer, called SP-Trans. By aligning with the local density of the point cloud and utilizing an enhanced local attention mechanism, SP-Trans captures the intricate relationships within the point cloud, further improving feature analysis and synthesis within the framework. Extensive experiments demonstrate that the proposed RO-PCAC achieves state-of-the-art compression performance, compared to existing reconstruction-oriented methods, including traditional, learning-based, and hybrid methods.||
|**2024-11-12**|[Joint multi-dimensional dynamic attention and transformer for general image restoration](http://arxiv.org/abs/2411.07893)|**[link](https://github.com/house-yuyu/mdda-former)**|Outdoor images often suffer from severe degradation due to rain, haze, and noise, impairing image quality and challenging high-level tasks. Current image restoration methods struggle to handle complex degradation while maintaining efficiency. This paper introduces a novel image restoration architecture that combines multi-dimensional dynamic attention and self-attention within a U-Net framework. To leverage the global modeling capabilities of transformers and the local modeling capabilities of convolutions, we integrate sole CNNs in the encoder-decoder and sole transformers in the latent layer. Additionally, we design convolutional kernels with selected multi-dimensional dynamic attention to capture diverse degraded inputs efficiently. A transformer block with transposed self-attention further enhances global feature extraction while maintaining efficiency. Extensive experiments demonstrate that our method achieves a better balance between performance and computational complexity across five image restoration tasks: deraining, deblurring, denoising, dehazing, and enhancement, as well as superior performance for high-level vision tasks. The source code will be available at https://github.com/House-yuyu/MDDA-former.||
|**2024-11-14**|[Breaking the Low-Rank Dilemma of Linear Attention](http://arxiv.org/abs/2411.07635)|**[link](https://github.com/qhfan/rala)**|The Softmax attention mechanism in Transformer models is notoriously computationally expensive, particularly due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention provides a far more efficient solution by reducing the complexity to linear levels. However, compared to Softmax attention, linear attention often experiences significant performance degradation. Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map, which hinders its ability to adequately model complex spatial information. In this paper, to break the low-rank dilemma of linear attention, we conduct rank analysis from two perspectives: the KV buffer and the output features. Consequently, we introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency. Based on RALA, we construct the Rank-Augmented Vision Linear Transformer (RAVLT). Extensive experiments demonstrate that RAVLT achieves excellent performance across various vision tasks. Specifically, without using any additional labels, data, or supervision during training, RAVLT achieves an 84.4% Top-1 accuracy on ImageNet-1k with only 26M parameters and 4.6G FLOPs. This result significantly surpasses previous linear attention mechanisms, fully illustrating the potential of RALA. Code will be available at https://github.com/qhfan/RALA.||
|**2024-11-12**|[Circuit Complexity Bounds for RoPE-based Transformer Architecture](http://arxiv.org/abs/2411.07602)|null|Characterizing the express power of the Transformer architecture is critical to understanding its capacity limits and scaling law. Recent works provide the circuit complexity bounds to Transformer-like architecture. On the other hand, Rotary Position Embedding ( $\mathsf{RoPE}$) has emerged as a crucial technique in modern large language models, offering superior performance in capturing positional information compared to traditional position embeddings, which shows great potential in application prospects, particularly for the long context scenario. Empirical evidence also suggests that $\mathsf{RoPE}$-based Transformer architectures demonstrate greater generalization capabilities compared to conventional Transformer models. In this work, we establish a tighter circuit complexity bound for Transformers with $\mathsf{RoPE}$ attention. Our key contribution is that we show that unless $\mathsf{TC}^0 = \mathsf{NC}^1$, a $\mathsf{RoPE}$-based Transformer with $\mathrm{poly}(n)$-precision, $O(1)$ layers, hidden dimension $d \leq O(n)$ cannot solve the arithmetic problem or the Boolean formula value problem. This result significantly demonstrates the fundamental limitation of the expressivity of the $\mathsf{RoPE}$-based Transformer architecture, although it achieves giant empirical success. Our theoretical framework not only establishes tighter complexity bounds but also may instruct further work on the $\mathsf{RoPE}$ -based Transformer.||
|**2024-11-12**|[Unraveling the Gradient Descent Dynamics of Transformers](http://arxiv.org/abs/2411.07538)|null|While the Transformer architecture has achieved remarkable success across various domains, a thorough theoretical foundation explaining its optimization dynamics is yet to be fully developed. In this study, we aim to bridge this understanding gap by answering the following two core questions: (1) Which types of Transformer architectures allow Gradient Descent (GD) to achieve guaranteed convergence? and (2) Under what initial conditions and architectural specifics does the Transformer achieve rapid convergence during training? By analyzing the loss landscape of a single Transformer layer using Softmax and Gaussian attention kernels, our work provides concrete answers to these questions. Our findings demonstrate that, with appropriate weight initialization, GD can train a Transformer model (with either kernel type) to achieve a global optimal solution, especially when the input embedding dimension is large. Nonetheless, certain scenarios highlight potential pitfalls: training a Transformer using the Softmax attention kernel may sometimes lead to suboptimal local solutions. In contrast, the Gaussian attention kernel exhibits a much favorable behavior. Our empirical study further validate the theoretical findings.||
|**2024-11-11**|[Spiking Transformer Hardware Accelerators in 3D Integration](http://arxiv.org/abs/2411.07397)|null|Spiking neural networks (SNNs) are powerful models of spatiotemporal computation and are well suited for deployment on resource-constrained edge devices and neuromorphic hardware due to their low power consumption. Leveraging attention mechanisms similar to those found in their artificial neural network counterparts, recently emerged spiking transformers have showcased promising performance and efficiency by capitalizing on the binary nature of spiking operations. Recognizing the current lack of dedicated hardware support for spiking transformers, this paper presents the first work on 3D spiking transformer hardware architecture and design methodology. We present an architecture and physical design co-optimization approach tailored specifically for spiking transformers. Through memory-on-logic and logic-on-logic stacking enabled by 3D integration, we demonstrate significant energy and delay improvements compared to conventional 2D CMOS integration.||
|**2024-11-11**|[More Expressive Attention with Negative Weights](http://arxiv.org/abs/2411.07176)|**[link](https://github.com/trestad/cogattn)**|We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention can shift the token deletion and copying function from a static OV matrix to dynamic QK inner products, with the OV matrix now focusing more on refinement or modification. The attention head can simultaneously delete, copy, or retain tokens by assigning them negative, positive, or minimal attention weights, respectively. As a result, a single attention head becomes more flexible and expressive. (2) Cog Attention improves the model's robustness against representational collapse, which can occur when earlier tokens are over-squashed into later positions, leading to homogeneous representations. Negative weights reduce effective information paths from earlier to later tokens, helping to mitigate this issue. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.||
|**2024-11-11**|[ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition](http://arxiv.org/abs/2411.07118)|**[link](https://github.com/mallikagarg/convmixformer)**|Transformer models have demonstrated remarkable success in many domains such as natural language processing (NLP) and computer vision. With the growing interest in transformer-based architectures, they are now utilized for gesture recognition. So, we also explore and devise a novel ConvMixFormer architecture for dynamic hand gestures. The transformers use quadratic scaling of the attention features with the sequential data, due to which these models are computationally complex and heavy. We have considered this drawback of the transformer and designed a resource-efficient model that replaces the self-attention in the transformer with the simple convolutional layer-based token mixer. The computational cost and the parameters used for the convolution-based mixer are comparatively less than the quadratic self-attention. Convolution-mixer helps the model capture the local spatial features that self-attention struggles to capture due to their sequential processing nature. Further, an efficient gate mechanism is employed instead of a conventional feed-forward network in the transformer to help the model control the flow of features within different stages of the proposed model. This design uses fewer learnable parameters which is nearly half the vanilla transformer that helps in fast and efficient training. The proposed method is evaluated on NVidia Dynamic Hand Gesture and Briareo datasets and our model has achieved state-of-the-art results on single and multimodal inputs. We have also shown the parameter efficiency of the proposed ConvMixFormer model compared to other methods. The source code is available at https://github.com/mallikagarg/ConvMixFormer.||
|**2024-11-12**|[SPARTAN: A Sparse Transformer Learning Local Causation](http://arxiv.org/abs/2411.06890)|null|Causal structures play a central role in world models that flexibly adapt to changes in the environment. While recent works motivate the benefits of discovering local causal graphs for dynamics modelling, in this work we demonstrate that accurately capturing these relationships in complex settings remains challenging for the current state-of-the-art. To remedy this shortcoming, we postulate that sparsity is a critical ingredient for the discovery of such local causal structures. To this end we present the SPARse TrANsformer World model (SPARTAN), a Transformer-based world model that learns local causal structures between entities in a scene. By applying sparsity regularisation on the attention pattern between object-factored tokens, SPARTAN identifies sparse local causal models that accurately predict future object states. Furthermore, we extend our model to capture sparse interventions with unknown targets on the dynamics of the environment. This results in a highly interpretable world model that can efficiently adapt to changes. Empirically, we evaluate SPARTAN against the current state-of-the-art in object-centric world models on observation-based environments and demonstrate that our model can learn accurate local causal graphs and achieve significantly improved few-shot adaptation to changes in the dynamics of the environment as well as robustness against removing irrelevant distractors.||
|**2024-11-11**|[Spatially Constrained Transformer with Efficient Global Relation Modelling for Spatio-Temporal Prediction](http://arxiv.org/abs/2411.06836)|**[link](https://github.com/ashusao/st-samplenet)**|Accurate spatio-temporal prediction is crucial for the sustainable development of smart cities. However, current approaches often struggle to capture important spatio-temporal relationships, particularly overlooking global relations among distant city regions. Most existing techniques predominantly rely on Convolutional Neural Networks (CNNs) to capture global relations. However, CNNs exhibit neighbourhood bias, making them insufficient for capturing distant relations. To address this limitation, we propose ST-SampleNet, a novel transformer-based architecture that combines CNNs with self-attention mechanisms to capture both local and global relations effectively. Moreover, as the number of regions increases, the quadratic complexity of self-attention becomes a challenge. To tackle this issue, we introduce a lightweight region sampling strategy that prunes non-essential regions and enhances the efficiency of our approach. Furthermore, we introduce a spatially constrained position embedding that incorporates spatial neighbourhood information into the self-attention mechanism, aiding in semantic interpretation and improving the performance of ST-SampleNet. Our experimental evaluation on three real-world datasets demonstrates the effectiveness of ST-SampleNet. Additionally, our efficient variant achieves a 40% reduction in computational costs with only a marginal compromise in performance, approximately 1%.||
|**2024-11-08**|[AuthFormer: Adaptive Multimodal biometric authentication transformer for middle-aged and elderly people](http://arxiv.org/abs/2411.05395)|null|Multimodal biometric authentication methods address the limitations of unimodal biometric technologies in security, robustness, and user adaptability. However, most existing methods depend on fixed combinations and numbers of biometric modalities, which restricts flexibility and adaptability in real-world applications. To overcome these challenges, we propose an adaptive multimodal biometric authentication model, AuthFormer, tailored for elderly users. AuthFormer is trained on the LUTBIO multimodal biometric database, containing biometric data from elderly individuals. By incorporating a cross-attention mechanism and a Gated Residual Network (GRN), the model improves adaptability to physiological variations in elderly users. Experiments show that AuthFormer achieves an accuracy of 99.73%. Additionally, its encoder requires only two layers to perform optimally, reducing complexity compared to traditional Transformer-based models.||
|**2024-11-07**|[Clustering in Causal Attention Masking](http://arxiv.org/abs/2411.04990)|null|This work presents a modification of the self-attention dynamics proposed by Geshkovski et al. (arXiv:2312.10794) to better reflect the practically relevant, causally masked attention used in transformer architectures for generative AI. This modification translates into an interacting particle system that cannot be interpreted as a mean-field gradient flow. Despite this loss of structure, we significantly strengthen the results of Geshkovski et al. (arXiv:2312.10794) in this context: While previous rigorous results focused on cases where all three matrices (Key, Query, and Value) were scaled identities, we prove asymptotic convergence to a single cluster for arbitrary key-query matrices and a value matrix equal to the identity. Additionally, we establish a connection to the classical R\'enyi parking problem from combinatorial geometry to make initial theoretical steps towards demonstrating the existence of meta-stable states.||
|**2024-11-07**|[AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation](http://arxiv.org/abs/2411.04967)|null|Neural network architecture design requires making many crucial decisions. The common desiderata is that similar decisions, with little modifications, can be reused in a variety of tasks and applications. To satisfy that, architectures must provide promising latency and performance trade-offs, support a variety of tasks, scale efficiently with respect to the amounts of data and compute, leverage available data from other tasks, and efficiently support various hardware. To this end, we introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks. We revisit the key design principles of hybrid architectures and propose a simple and effective \emph{asymmetric} architecture, where the distribution of convolutional and transformer blocks is \emph{asymmetric}, containing more convolutional blocks in the earlier stages, followed by more transformer blocks in later stages. AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation, and features a superior trade-off between performance and latency. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance compared to the most recent public and commercial models. Notably, even without any computation optimization for transformer blocks, our models still yield faster inference speed than existing works featuring efficient attention mechanisms, highlighting the advantages and the value of our approach.||
|**2024-11-07**|[High Entropy Alloy property predictions using Transformer-based language model](http://arxiv.org/abs/2411.04861)|null|This study introduces a language transformer-based machine learning model to predict key mechanical properties of high-entropy alloys (HEAs), addressing the challenges due to their complex, multi-principal element compositions and limited experimental data. By pre-training the transformer on extensive synthetic materials data and fine-tuning it with specific HEA datasets, the model effectively captures intricate elemental interactions through self-attention mechanisms. This approach mitigates data scarcity issues via transfer learning, enhancing predictive accuracy for properties like elongation (%) and ultimate tensile strength (UTS) compared to traditional regression models such as Random Forests and Gaussian Processes. The model's interpretability is enhanced by visualizing attention weights, revealing significant elemental relationships that align with known metallurgical principles. This work demonstrates the potential of transformer models to accelerate materials discovery and optimization, enabling accurate property predictions, thereby advancing the field of materials informatics.||
|**2024-11-07**|[How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis](http://arxiv.org/abs/2411.04105)|null|Large language models (LLMs) have shown amazing performance on tasks that require planning and reasoning. Motivated by this, we investigate the internal mechanisms that underpin a network's ability to perform complex logical reasoning. We first construct a synthetic propositional logic problem that serves as a concrete test-bed for network training and evaluation. Crucially, this problem demands nontrivial planning to solve, but we can train a small transformer to achieve perfect accuracy. Building on our set-up, we then pursue an understanding of precisely how a three-layer transformer, trained from scratch, solves this problem. We are able to identify certain "planning" and "reasoning" circuits in the network that necessitate cooperation between the attention blocks to implement the desired logic. To expand our findings, we then study a larger model, Mistral 7B. Using activation patching, we characterize internal components that are critical in solving our logic problem. Overall, our work systemically uncovers novel aspects of small and large transformers, and continues the study of how they plan and reason.||
|**2024-11-07**|[ $k$NN Attention Demystified: A Theoretical Exploration for Scalable Transformers](http://arxiv.org/abs/2411.04013)|**[link](https://github.com/sansui-123/knn_attention)**|Despite their power, Transformers face challenges with long sequences due to the quadratic complexity of self-attention. To address this limitation, methods like $k$-Nearest-Neighbor ($k$NN) attention have been introduced [Roy, Saffar, Vaswani, Grangier, 2021] enabling each token to attend to only its $k$ closest tokens. While $k$NN attention has shown empirical success in making Transformers more efficient, its exact approximation guarantees have not been theoretically analyzed. In this work, we establish a theoretical framework for $k$NN attention, reformulating self-attention as expectations over softmax distributions and leveraging lazy Gumbel sampling [Mussmann, Levy, Ermon, 2017] with $k$ NN indices for efficient approximation. Building on this framework, we also propose novel sub-quadratic algorithms that approximate self-attention gradients by leveraging efficient sampling techniques, such as Markov Chain-based estimation. Finally, we demonstrate the practical effectiveness of these algorithms through empirical experiments, showcasing their benefits in both training and inference.||
|**2024-11-05**|[LASER: Attention with Exponential Transformation](http://arxiv.org/abs/2411.03493)|null|Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's performance. We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. This poor gradient signal backpropagation can lead to inefficient learning of parameters preceeding the attention operations. To this end, we introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that LASER Attention can be implemented by making small modifications to existing attention implementations. We conduct experiments on autoregressive large language models (LLMs) with upto 2.2 billion parameters where we show upto 3.38% and an average of ~1% improvement over standard attention on downstream evaluations. Using LASER gives the following relative improvements in generalization performance across a variety of tasks (vision, text and speech): 4.67% accuracy in Vision Transformer (ViT) on Imagenet, 2.25% error rate in Conformer on the Librispeech speech-to-text and 0.93% fraction of incorrect predictions in BERT with 2.2 billion parameters.||
|**2024-11-05**|[Enhanced Real-Time Threat Detection in 5G Networks: A Self-Attention RNN Autoencoder Approach for Spectral Intrusion Analysis](http://arxiv.org/abs/2411.03365)|null|In the rapidly evolving landscape of 5G technology, safeguarding Radio Frequency (RF) environments against sophisticated intrusions is paramount, especially in dynamic spectrum access and management. This paper presents an enhanced experimental model that integrates a self-attention mechanism with a Recurrent Neural Network (RNN)-based autoencoder for the detection of anomalous spectral activities in 5G networks at the waveform level. Our approach, grounded in time-series analysis, processes in-phase and quadrature (I/Q) samples to identify irregularities that could indicate potential jamming attacks. The model's architecture, augmented with a self-attention layer, extends the capabilities of RNN autoencoders, enabling a more nuanced understanding of temporal dependencies and contextual relationships within the RF spectrum. Utilizing a simulated 5G Radio Access Network (RAN) test-bed constructed with srsRAN 5G and Software Defined Radios (SDRs), we generated a comprehensive stream of data that reflects real-world RF spectrum conditions and attack scenarios. The model is trained to reconstruct standard signal behavior, establishing a normative baseline against which deviations, indicative of security threats, are identified. The proposed architecture is designed to balance between detection precision and computational efficiency, so the LSTM network, enriched with self-attention, continues to optimize for minimal execution latency and power consumption. Conducted on a real-world SDR-based testbed, our results demonstrate the model's improved performance and accuracy in threat detection. Keywords: self-attention, real-time intrusion detection, RNN autoencoder, Transformer architecture, LSTM, time series anomaly detection, 5G Security, spectrum access security.||
|**2024-11-07**|[DiT4Edit: Diffusion Transformer for Image Editing](http://arxiv.org/abs/2411.03286)|null|Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture the long-range dependencies among patches, leading to higher-quality image generation. In this paper, we propose DiT4Edit, the first Diffusion Transformer-based image editing framework. Specifically, DiT4Edit uses the DPM-Solver inversion algorithm to obtain the inverted latents, reducing the number of steps compared to the DDIM inversion algorithm commonly used in UNet-based frameworks. Additionally, we design unified attention control and patches merging, tailored for transformer computation streams. This integration allows our framework to generate higher-quality edited images faster. Our design leverages the advantages of DiT, enabling it to surpass UNet structures in image editing, especially in high-resolution and arbitrary-size images. Extensive experiments demonstrate the strong performance of DiT4Edit across various editing scenarios, highlighting the potential of Diffusion Transformers in supporting image editing.||
|**2024-11-05**|[Rethinking Decoders for Transformer-based Semantic Segmentation: Compression is All You Need](http://arxiv.org/abs/2411.03033)|**[link](https://github.com/qishuaiwen/depict)**|State-of-the-art methods for Transformer-based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross-attention, refine either or both types of embeddings via self-attention, and project image embeddings onto the additional embeddings via dot-product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self-attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross-attention operator seeks to find a low-rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot-product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter, and it is light weight and more robust.||
|**2024-11-05**|[Transformer-Based Fault-Tolerant Control for Fixed-Wing UAVs Using Knowledge Distillation and In-Context Adaptation](http://arxiv.org/abs/2411.02975)|null|This study presents a transformer-based approach for fault-tolerant control in fixed-wing Unmanned Aerial Vehicles (UAVs), designed to adapt in real time to dynamic changes caused by structural damage or actuator failures. Unlike traditional Flight Control Systems (FCSs) that rely on classical control theory and struggle under severe alterations in dynamics, our method directly maps outer-loop reference values -- altitude, heading, and airspeed -- into control commands using the in-context learning and attention mechanisms of transformers, thus bypassing inner-loop controllers and fault-detection layers. Employing a teacher-student knowledge distillation framework, the proposed approach trains a student agent with partial observations by transferring knowledge from a privileged expert agent with full observability, enabling robust performance across diverse failure scenarios. Experimental results demonstrate that our transformer-based controller outperforms industry-standard FCS and state-of-the-art reinforcement learning (RL) methods, maintaining high tracking accuracy and stability in nominal conditions and extreme failure cases, highlighting its potential for enhancing UAV operational safety and reliability.||
|**2024-11-04**|[Adaptive Caching for Faster Video Generation with Diffusion Transformers](http://arxiv.org/abs/2411.02397)|null|Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.||
|**2024-11-04**|[Training-free Regional Prompting for Diffusion Transformers](http://arxiv.org/abs/2411.02395)|**[link](https://github.com/antonioo-c/regional-prompting-flux)**|Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.||
|**2024-11-04**|[Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning](http://arxiv.org/abs/2411.02199)|null|Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. Existing empirical studies have revealed a strong connection between these LLMs' impressive emergence abilities and their in-context learning (ICL) capacity, allowing them to solve new tasks using only task-specific prompts without further fine-tuning. On the other hand, existing empirical and theoretical studies also show that there is a linear regularity of the multi-concept encoded semantic representation behind transformer-based LLMs. However, existing theoretical work fail to build up an understanding of the connection between this regularity and the innovative power of ICL. Additionally, prior work often focuses on simplified, unrealistic scenarios involving linear transformers or unrealistic loss functions, and they achieve only linear or sub-linear convergence rates. In contrast, this work provides a fine-grained mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities, offering insights into how transformers innovate solutions for certain unseen tasks encoded with multiple cross-concept semantics. Inspired by empirical studies on the linear latent geometry of LLMs, the analysis is based on a concept-based low-noise sparse coding prompt model. Leveraging advanced techniques, this work showcases the exponential 0-1 loss convergence over the highly non-convex training dynamics, which pioneeringly incorporates the challenges of softmax self-attention, ReLU-activated MLPs, and cross-entropy loss. Empirical simulations corroborate the theoretical findings.||
|**2024-11-04**|[Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention](http://arxiv.org/abs/2411.02063)|**[link](https://github.com/tsinghuac3i/lpa)**|Improving the effectiveness and efficiency of large language models (LLMs) simultaneously is a critical yet challenging research goal. In this paper, we find that low-rank pre-training, normally considered as efficient methods that will compromise performance, can be scalably effective when reduced parameters are precisely targeted. Specifically, applying the low-dimensional module only to the attention layer -- resolves this issue and enhances both effectiveness and efficiency. We refer to this structure as Low-dimensional Projected Attention (LPA) and provide an explanatory analysis. Through extensive experimentation at parameter scales of 130M, 370M, and scaling up to 3B, we have validated the effectiveness and scalability of LPA. Our results show that LPA model can save up to 12.4% in time while achieving an approximate 5% improvement in test perplexity (ppl) and on downstream tasks compared with the vanilla Transformer.||
|**2024-11-04**|[UnSegMedGAT: Unsupervised Medical Image Segmentation using Graph Attention Networks Clustering](http://arxiv.org/abs/2411.01966)|**[link](https://github.com/mudit-adityaja/unsegmedgat)**|The data-intensive nature of supervised classification drives the interest of the researchers towards unsupervised approaches, especially for problems such as medical image segmentation, where labeled data is scarce. Building on the recent advancements of Vision transformers (ViT) in computer vision, we propose an unsupervised segmentation framework using a pre-trained Dino-ViT. In the proposed method, we leverage the inherent graph structure within the image to realize a significant performance gain for segmentation in medical images. For this, we introduce a modularity-based loss function coupled with a Graph Attention Network (GAT) to effectively capture the inherent graph topology within the image. Our method achieves state-of-the-art performance, even significantly surpassing or matching that of existing (semi)supervised technique such as MedSAM which is a Segment Anything Model in medical images. We demonstrate this using two challenging medical image datasets ISIC-2018 and CVC-ColonDB. This work underscores the potential of unsupervised approaches in advancing medical image analysis in scenarios where labeled data is scarce. The github repository of the code is available on [https://github.com/mudit-adityaja/UnSegMedGAT].||
|**2024-11-04**|[ElasTST: Towards Robust Varied-Horizon Forecasting with Elastic Time-Series Transformer](http://arxiv.org/abs/2411.01842)|**[link](https://github.com/microsoft/probts)**|Numerous industrial sectors necessitate models capable of providing robust forecasts across various horizons. Despite the recent strides in crafting specific architectures for time-series forecasting and developing pre-trained universal models, a comprehensive examination of their capability in accommodating varied-horizon forecasting during inference is still lacking. This paper bridges this gap through the design and evaluation of the Elastic Time-Series Transformer (ElasTST). The ElasTST model incorporates a non-autoregressive design with placeholders and structured self-attention masks, warranting future outputs that are invariant to adjustments in inference horizons. A tunable version of rotary position embedding is also integrated into ElasTST to capture time-series-specific periods and enhance adaptability to different horizons. Additionally, ElasTST employs a multi-scale patch design, effectively integrating both fine-grained and coarse-grained information. During the training phase, ElasTST uses a horizon reweighting strategy that approximates the effect of random sampling across multiple horizons with a single fixed horizon setting. Through comprehensive experiments and comparisons with state-of-the-art time-series architectures and contemporary foundation models, we demonstrate the efficacy of ElasTST's unique design elements. Our findings position ElasTST as a robust solution for the practical necessity of varied-horizon forecasting.||
|**2024-11-05**|[MSTA3D: Multi-scale Twin-attention for 3D Instance Segmentation](http://arxiv.org/abs/2411.01781)|null|Recently, transformer-based techniques incorporating superpoints have become prevalent in 3D instance segmentation. However, they often encounter an over-segmentation problem, especially noticeable with large objects. Additionally, unreliable mask predictions stemming from superpoint mask prediction further compound this issue. To address these challenges, we propose a novel framework called MSTA3D. It leverages multi-scale feature representation and introduces a twin-attention mechanism to effectively capture them. Furthermore, MSTA3D integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries. Experimental evaluations on ScanNetV2, ScanNet200 and S3DIS datasets demonstrate that our approach surpasses state-of-the-art 3D instance segmentation methods.||
|**2024-10-31**|[Length-Induced Embedding Collapse in Transformer-based Models](http://arxiv.org/abs/2410.24200)|null|Text embeddings enable various applications, but their performance deteriorates on longer texts. In this paper, we find that the performance degradation is due to a phenomenon called Length Collapse, where longer text embeddings collapse into a narrow space. This collapse results in a distributional inconsistency between embeddings of different text lengths, ultimately hurting the performance of downstream tasks. Theoretically, by considering the self-attention mechanism inherently functions as a low-pass filter, we prove that long sequences increase the attenuation rate of the low-pass filter effect of the self-attention mechanism. With layers going deeper, excessive low-pass filtering causes the token signals to retain only their Direct-Current (DC) component, which means the input token feature maps will collapse into a narrow space, especially in long texts. Based on the above analysis, we propose to mitigate the undesirable length collapse limitation by introducing a temperature in softmax(), which achieves a higher low-filter attenuation rate. The tuning-free method, called TempScale, can be plugged into multiple transformer-based embedding models. Empirically, we demonstrate that TempScale can improve existing embedding models, especially on long text inputs, bringing up to 0.53% performance gains on 40 datasets from Massive Text Embedding Benchmark (MTEB) and 0.82% performance gains on 4 datasets from LongEmbed, which specifically focuses on long context retrieval.||
|**2024-10-31**|[Ada-MSHyper: Adaptive Multi-Scale Hypergraph Transformer for Time Series Forecasting](http://arxiv.org/abs/2410.23992)|**[link](https://github.com/shangzongjiang/Ada-MSHyper)**|Although transformer-based methods have achieved great success in multi-scale temporal pattern interaction modeling, two key challenges limit their further development: (1) Individual time points contain less semantic information, and leveraging attention to model pair-wise interactions may cause the information utilization bottleneck. (2) Multiple inherent temporal variations (e.g., rising, falling, and fluctuating) entangled in temporal patterns. To this end, we propose Adaptive Multi-Scale Hypergraph Transformer (Ada-MSHyper) for time series forecasting. Specifically, an adaptive hypergraph learning module is designed to provide foundations for modeling group-wise interactions, then a multi-scale interaction module is introduced to promote more comprehensive pattern interactions at different scales. In addition, a node and hyperedge constraint mechanism is introduced to cluster nodes with similar semantic information and differentiate the temporal variations within each scales. Extensive experiments on 11 real-world datasets demonstrate that Ada-MSHyper achieves state-of-the-art performance, reducing prediction errors by an average of 4.56%, 10.38%, and 4.97% in MSE for long-range, short-range, and ultra-long-range time series forecasting, respectively. Code is available at https://github.com/shangzongjiang/Ada-MSHyper.||
|**2024-10-31**|[Weight decay induces low-rank attention layers](http://arxiv.org/abs/2410.23819)|null|The effect of regularizers such as weight decay when training deep neural networks is not well understood. We study the influence of weight decay as well as $L2$-regularization when training neural network models in which parameter matrices interact multiplicatively. This combination is of particular interest as this parametrization is common in attention layers, the workhorse of transformers. Here, key-query, as well as value-projection parameter matrices, are multiplied directly with each other: $W_K^TW_Q$ and $PW_V$. We extend previous results and show on one hand that any local minimum of a $L2$-regularized loss of the form$ L(AB^\top) + \lambda (\|A\|
|**2024-11-01**|[Human Action Recognition (HAR) Using Skeleton-based Spatial Temporal Relative Transformer Network: ST-RTR](http://arxiv.org/abs/2410.23806)|null|Human Action Recognition (HAR) is an interesting research area in human-computer interaction used to monitor the activities of elderly and disabled individuals affected by physical and mental health. In the recent era, skeleton-based HAR has received much attention because skeleton data has shown that it can handle changes in striking, body size, camera views, and complex backgrounds. One key characteristic of ST-GCN is automatically learning spatial and temporal patterns from skeleton sequences. It has some limitations, as this method only works for short-range correlation due to its limited receptive field. Consequently, understanding human action requires long-range interconnection. To address this issue, we developed a spatial-temporal relative transformer ST-RTR model. The ST-RTR includes joint and relay nodes, which allow efficient communication and data transmission within the network. These nodes help to break the inherent spatial and temporal skeleton topologies, which enables the model to understand long-range human action better. Furthermore, we combine ST-RTR with a fusion model for further performance improvements. To assess the performance of the ST-RTR method, we conducted experiments on three skeleton-based HAR benchmarks: NTU RGB+D 60, NTU RGB+D 120, and UAV-Human. It boosted CS and CV by 2.11 % and 1.45% on NTU RGB+D 60, 1.25% and 1.05% on NTU RGB+D 120. On UAV-Human datasets, accuracy improved by 2.54%. The experimental outcomes explain that the proposed ST-RTR model significantly improves action recognition associated with the standard ST-GCN method.||
|**2024-10-31**|[EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching](http://arxiv.org/abs/2410.23788)|**[link](https://github.com/xinwangchen/edt)**|Transformer-based Diffusion Probabilistic Models (DPMs) have shown more potential than CNN-based DPMs, yet their extensive computational requirements hinder widespread practical applications. To reduce the computation budget of transformer-based DPMs, this work proposes the Efficient Diffusion Transformer (EDT) framework. The framework includes a lightweight-design diffusion model architecture, and a training-free Attention Modulation Matrix and its alternation arrangement in EDT inspired by human-like sketching. Additionally, we propose a token relation-enhanced masking training strategy tailored explicitly for EDT to augment its token relation learning capability. Our extensive experiments demonstrate the efficacy of EDT. The EDT framework reduces training and inference costs and surpasses existing transformer-based diffusion models in image synthesis performance, thereby achieving a significant overall enhancement. With lower FID, EDT-S, EDT-B, and EDT-XL attained speed-ups of 3.93x, 2.84x, and 1.92x respectively in the training phase, and 2.29x, 2.29x, and 2.22x respectively in inference, compared to the corresponding sizes of MDTv2. The source code is released at https://github.com/xinwangChen/EDT.||
|**2024-11-01**|[In-Context LoRA for Diffusion Transformers](http://arxiv.org/abs/2410.23775)|**[link](https://github.com/ali-vilab/In-Context-LoRA)**|Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inherently possess in-context generation capabilities, requiring only minimal tuning to activate them. Through diverse task experiments, we qualitatively demonstrate that existing text-to-image DiTs can effectively perform in-context generation without any tuning. Building on this insight, we propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., $20\sim 100$ samples) instead of full-parameter tuning with large datasets. We name our models In-Context LoRA (IC-LoRA). This approach requires no modifications to the original DiT models, only changes to the training data. Remarkably, our pipeline generates high-fidelity image sets that better adhere to prompts. While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems. We release our code, data, and models at https://github.com/ali-vilab/In-Context-LoRA||
|**2024-10-31**|[MLLA-UNet: Mamba-like Linear Attention in an Efficient U-Shape Model for Medical Image Segmentation](http://arxiv.org/abs/2410.23738)|**[link](https://github.com/csyfjiang/mlla-unet)**|Recent advancements in medical imaging have resulted in more complex and diverse images, with challenges such as high anatomical variability, blurred tissue boundaries, low organ contrast, and noise. Traditional segmentation methods struggle to address these challenges, making deep learning approaches, particularly U-shaped architectures, increasingly prominent. However, the quadratic complexity of standard self-attention makes Transformers computationally prohibitive for high-resolution images. To address these challenges, we propose MLLA-UNet (Mamba-Like Linear Attention UNet), a novel architecture that achieves linear computational complexity while maintaining high segmentation accuracy through its innovative combination of linear attention and Mamba-inspired adaptive mechanisms, complemented by an efficient symmetric sampling structure for enhanced feature processing. Our architecture effectively preserves essential spatial features while capturing long-range dependencies at reduced computational complexity. Additionally, we introduce a novel sampling strategy for multi-scale feature fusion. Experiments demonstrate that MLLA-UNet achieves state-of-the-art performance on six challenging datasets with 24 different segmentation tasks, including but not limited to FLARE22, AMOS CT, and ACDC, with an average DSC of 88.32%. These results underscore the superiority of MLLA-UNet over existing methods. Our contributions include the novel 2D segmentation architecture and its empirical validation. The code is available via https://github.com/csyfjiang/MLLA-UNet.||
|**2024-10-31**|[Context-Aware Token Selection and Packing for Enhanced Vision Transformer](http://arxiv.org/abs/2410.23608)|null|In recent years, the long-range attention mechanism of vision transformers has driven significant performance breakthroughs across various computer vision tasks. However, the traditional self-attention mechanism, which processes both informative and non-informative tokens, suffers from inefficiency and inaccuracies. While sparse attention mechanisms have been introduced to mitigate these issues by pruning tokens involved in attention, they often lack context-awareness and intelligence. These mechanisms frequently apply a uniform token selection strategy across different inputs for batch training or optimize efficiency only for the inference stage. To overcome these challenges, we propose a novel algorithm: Select and Pack Attention (SPA). SPA dynamically selects informative tokens using a low-cost gating layer supervised by selection labels and packs these tokens into new batches, enabling a variable number of tokens to be used in parallelized GPU batch training and inference. Extensive experiments across diverse datasets and computer vision tasks demonstrate that SPA delivers superior performance and efficiency, including a 0.6 mAP improvement in object detection and a 16.4% reduction in computational costs.||
|**2024-10-30**|[A Neural Transformer Framework for Simultaneous Tasks of Segmentation, Classification, and Caller Identification of Marmoset Vocalization](http://arxiv.org/abs/2410.23279)|null|Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification, and caller identification for marmoset vocalizations. However, the CNN has limitations in modeling long-range acoustic patterns; the Transformer architecture that has been shown to outperform CNNs, utilizes the self-attention mechanism that efficiently segregates information parallelly over long distances and captures the global structure of marmoset vocalization. We propose using the Transformer to jointly segment and classify the marmoset calls and identify the callers for each vocalization.||
|**2024-10-30**|[DiaMond: Dementia Diagnosis with Multi-Modal Vision Transformers Using MRI and PET](http://arxiv.org/abs/2410.23219)|**[link](https://github.com/ai-med/diamond)**|Diagnosing dementia, particularly for Alzheimer's Disease (AD) and frontotemporal dementia (FTD), is complex due to overlapping symptoms. While magnetic resonance imaging (MRI) and positron emission tomography (PET) data are critical for the diagnosis, integrating these modalities in deep learning faces challenges, often resulting in suboptimal performance compared to using single modalities. Moreover, the potential of multi-modal approaches in differential diagnosis, which holds significant clinical importance, remains largely unexplored. We propose a novel framework, DiaMond, to address these issues with vision Transformers to effectively integrate MRI and PET. DiaMond is equipped with self-attention and a novel bi-attention mechanism that synergistically combine MRI and PET, alongside a multi-modal normalization to reduce redundant dependency, thereby boosting the performance. DiaMond significantly outperforms existing multi-modal methods across various datasets, achieving a balanced accuracy of 92.4% in AD diagnosis, 65.2% for AD-MCI-CN classification, and 76.5% in differential diagnosis of AD and FTD. We also validated the robustness of DiaMond in a comprehensive ablation study. The code is available at https://github.com/ai-med/DiaMond.||
|**2024-10-29**|[Abrupt Learning in Transformers: A Case Study on Matrix Completion](http://arxiv.org/abs/2410.22244)|null|Recent analysis on the training dynamics of Transformers has unveiled an interesting characteristic: the training loss plateaus for a significant number of training steps, and then suddenly (and sharply) drops to near--optimal values. To understand this phenomenon in depth, we formulate the low-rank matrix completion problem as a masked language modeling (MLM) task, and show that it is possible to train a BERT model to solve this task to low error. Furthermore, the loss curve shows a plateau early in training followed by a sudden drop to near-optimal values, despite no changes in the training procedure or hyper-parameters. To gain interpretability insights into this sudden drop, we examine the model's predictions, attention heads, and hidden states before and after this transition. Concretely, we observe that (a) the model transitions from simply copying the masked input to accurately predicting the masked entries; (b) the attention heads transition to interpretable patterns relevant to the task; and (c) the embeddings and hidden states encode information relevant to the problem. We also analyze the training dynamics of individual model components to understand the sudden drop in loss.||
|**2024-10-29**|[MAPUNetR: A Hybrid Vision Transformer and U-Net Architecture for Efficient and Interpretable Medical Image Segmentation](http://arxiv.org/abs/2410.22223)|null|Medical image segmentation is pivotal in healthcare, enhancing diagnostic accuracy, informing treatment strategies, and tracking disease progression. This process allows clinicians to extract critical information from visual data, enabling personalized patient care. However, developing neural networks for segmentation remains challenging, especially when preserving image resolution, which is essential in detecting subtle details that influence diagnoses. Moreover, the lack of transparency in these deep learning models has slowed their adoption in clinical practice. Efforts in model interpretability are increasingly focused on making these models' decision-making processes more transparent. In this paper, we introduce MAPUNetR, a novel architecture that synergizes the strengths of transformer models with the proven U-Net framework for medical image segmentation. Our model addresses the resolution preservation challenge and incorporates attention maps highlighting segmented regions, increasing accuracy and interpretability. Evaluated on the BraTS 2020 dataset, MAPUNetR achieved a dice score of 0.88 and a dice coefficient of 0.92 on the ISIC 2018 dataset. Our experiments show that the model maintains stable performance and potential as a powerful tool for medical image segmentation in clinical practice.||
|**2024-10-29**|[Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech](http://arxiv.org/abs/2410.22179)|**[link](https://github.com/google/sequence-layers/blob/main/examples/very_attentive_tacotron.py)**|Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.||
|**2024-10-29**|[PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement](http://arxiv.org/abs/2410.22059)|null|Scene rearrangement, like table tidying, is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion can aid by generating natural scenes as goals. To facilitate robot execution, object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation, segmentation, and feature encoding, which can lead to a low success rate due to error accumulation. Furthermore, they lack control over the viewing perspectives of the generated goals, restricting the tasks to 3-DoF settings. In this paper, we propose PACA, a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation derived from Stable Diffusion. Specifically, we develop a representation that integrates generation, segmentation, and feature encoding into a single step to produce object-level representations. Additionally, we introduce perspective control, thus enabling the matching of 6-DoF camera views and extending past approaches that were limited to 3-DoF top-down views. The efficacy of our method is demonstrated through its zero-shot performance in real robot experiments across various scenes, achieving an average matching accuracy and execution success rate of 87% and 67%, respectively.||
|**2024-10-29**|[FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection](http://arxiv.org/abs/2410.21964)|null|Recently, Vision Transformers (ViTs) have achieved unprecedented effectiveness in the general domain of image classification. Nonetheless, these models remain underexplored in the field of deepfake detection, given their lower performance as compared to Convolution Neural Networks (CNNs) in that specific context. In this paper, we start by investigating why plain ViT architectures exhibit a suboptimal performance when dealing with the detection of facial forgeries. Our analysis reveals that, as compared to CNNs, ViT struggles to model localized forgery artifacts that typically characterize deepfakes. Based on this observation, we propose a deepfake detection framework called FakeFormer, which extends ViTs to enforce the extraction of subtle inconsistency-prone information. For that purpose, an explicit attention learning guided by artifact-vulnerable patches and tailored to ViTs is introduced. Extensive experiments are conducted on diverse well-known datasets, including FF++, Celeb-DF, WildDeepfake, DFD, DFDCP, and DFDC. The results show that FakeFormer outperforms the state-of-the-art in terms of generalization and computational cost, without the need for large-scale training datasets. The code is available at \url{https://github.com/10Ring/FakeFormer}.||
|**2024-10-29**|[Spatio-temporal Transformers for Action Unit Classification with Event Cameras](http://arxiv.org/abs/2410.21958)|null|Face analysis has been studied from different angles to infer emotion, poses, shapes, and landmarks. Traditionally RGB cameras are used, yet for fine-grained tasks standard sensors might not be up to the task due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. We propose a novel spatiotemporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered one of the main causes of an existing gap between the maturity of RGB and neuromorphic vision models. Gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and contains streams collected with various possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. Our proposed model outperforms baseline methods by effectively capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.||
|**2024-10-28**|[On Inductive Biases That Enable Generalization of Diffusion Transformers](http://arxiv.org/abs/2410.21273)|**[link](https://github.com/dit-generalization/dit-generalization.github.io)**|Recent work studying the generalization of diffusion models with UNet-based denoisers reveals inductive biases that can be expressed via geometry-adaptive harmonic bases. However, in practice, more recent denoising networks are often based on transformers, e.g., the diffusion transformer (DiT). This raises the question: do transformer-based denoising networks exhibit inductive biases that can also be expressed via geometry-adaptive harmonic bases? To our surprise, we find that this is not the case. This discrepancy motivates our search for the inductive bias that can lead to good generalization in DiT models. Investigating the pivotal attention modules of a DiT, we find that locality of attention maps are closely associated with generalization. To verify this finding, we modify the generalization of a DiT by restricting its attention windows. We inject local attention windows to a DiT and observe an improvement in generalization. Furthermore, we empirically find that both the placement and the effective attention size of these local attention windows are crucial factors. Experimental results on the CelebA, ImageNet, and LSUN datasets show that strengthening the inductive bias of a DiT can improve both generalization and generation quality when less training data is available. Source code will be released publicly upon paper publication. Project page: dit-generalization.github.io/.||
|**2024-10-29**|[Enhancing Learned Image Compression via Cross Window-based Attention](http://arxiv.org/abs/2410.21144)|**[link](https://github.com/prmudgal/CWAM_IC_ISVC)**|In recent years, learned image compression methods have demonstrated superior rate-distortion performance compared to traditional image compression methods. Recent methods utilize convolutional neural networks (CNN), variational autoencoders (VAE), invertible neural networks (INN), and transformers. Despite their significant contributions, a main drawback of these models is their poor performance in capturing local redundancy. Therefore, to leverage global features along with local redundancy, we propose a CNN-based solution integrated with a feature encoding module. The feature encoding module encodes important features before feeding them to the CNN and then utilizes cross-scale window-based attention, which further captures local redundancy. Cross-scale window-based attention is inspired by the attention mechanism in transformers and effectively enlarges the receptive field. Both the feature encoding module and the cross-scale window-based attention module in our architecture are flexible and can be incorporated into any other network architecture. We evaluate our method on the Kodak and CLIC datasets and demonstrate that our approach is effective and on par with state-of-the-art methods.||
|**2024-10-28**|[LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition](http://arxiv.org/abs/2410.21108)|null|Group Activity Recognition (GAR) remains challenging in computer vision due to the complex nature of multi-agent interactions. This paper introduces LiGAR, a LIDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition. LiGAR leverages LiDAR data as a structural backbone to guide the processing of visual and textual information, enabling robust handling of occlusions and complex spatial arrangements. Our framework incorporates a Multi-Scale LIDAR Transformer, Cross-Modal Guided Attention, and an Adaptive Fusion Module to integrate multi-modal data at different semantic levels effectively. LiGAR's hierarchical architecture captures group activities at various granularities, from individual actions to scene-level dynamics. Extensive experiments on the JRDB-PAR, Volleyball, and NBA datasets demonstrate LiGAR's superior performance, achieving state-of-the-art results with improvements of up to 10.6% in F1-score on JRDB-PAR and 5.9% in Mean Per Class Accuracy on the NBA dataset. Notably, LiGAR maintains high performance even when LiDAR data is unavailable during inference, showcasing its adaptability. Our ablation studies highlight the significant contributions of each component and the effectiveness of our multi-modal, multi-scale approach in advancing the field of group activity recognition.||
|**2024-10-28**|[Pay Attention to Attention for Sequential Recommendation](http://arxiv.org/abs/2410.21048)|null|Transformer-based approaches have demonstrated remarkable success in various sequence-based tasks. However, traditional self-attention models may not sufficiently capture the intricate dependencies within items in sequential recommendation scenarios. This is due to the lack of explicit emphasis on attention weights, which play a critical role in allocating attention and understanding item-to-item correlations. To better exploit the potential of attention weights and improve the capability of sequential recommendation in learning high-order dependencies, we propose a novel sequential recommendation (SR) approach called attention weight refinement (AWRSR). AWRSR enhances the effectiveness of self-attention by additionally paying attention to attention weights, allowing for more refined attention distributions of correlations among items. We conduct comprehensive experiments on multiple real-world datasets, demonstrating that our approach consistently outperforms state-of-the-art SR models. Moreover, we provide a thorough analysis of AWRSR's effectiveness in capturing higher-level dependencies. These findings suggest that AWRSR offers a promising new direction for enhancing the performance of self-attention architecture in SR tasks, with potential applications in other sequence-based problems as well.||
|**2024-10-25**|[Capsule Endoscopy Multi-classification via Gated Attention and Wavelet Transformations](http://arxiv.org/abs/2410.19363)|**[link](https://github.com/09srinivas2005/capsule-endoscopy-multi-classification-via-gated-attention-and-wavelet-transformations)**|Abnormalities in the gastrointestinal tract significantly influence the patient's health and require a timely diagnosis for effective treatment. With such consideration, an effective automatic classification of these abnormalities from a video capsule endoscopy (VCE) frame is crucial for improvement in diagnostic workflows. The work presents the process of developing and evaluating a novel model designed to classify gastrointestinal anomalies from a VCE video frame. Integration of Omni Dimensional Gated Attention (OGA) mechanism and Wavelet transformation techniques into the model's architecture allowed the model to focus on the most critical areas in the endoscopy images, reducing noise and irrelevant features. This is particularly advantageous in capsule endoscopy, where images often contain a high degree of variability in texture and color. Wavelet transformations contributed by efficiently capturing spatial and frequency-domain information, improving feature extraction, especially for detecting subtle features from the VCE frames. Furthermore, the features extracted from the Stationary Wavelet Transform and Discrete Wavelet Transform are concatenated channel-wise to capture multiscale features, which are essential for detecting polyps, ulcerations, and bleeding. This approach improves classification accuracy on imbalanced capsule endoscopy datasets. The proposed model achieved 92.76% and 91.19% as training and validation accuracies respectively. At the same time, Training and Validation losses are 0.2057 and 0.2700. The proposed model achieved a Balanced Accuracy of 94.81%, AUC of 87.49%, F1-score of 91.11%, precision of 91.17%, recall of 91.19% and specificity of 98.44%. Additionally, the model's performance is benchmarked against two base models, VGG16 and ResNet50, demonstrating its enhanced ability to identify and classify a range of gastrointestinal abnormalities accurately.||
|**2024-10-24**|[DCT-HistoTransformer: Efficient Lightweight Vision Transformer with DCT Integration for histopathological image analysis](http://arxiv.org/abs/2410.19166)|null|In recent years, the integration of advanced imaging techniques and deep learning methods has significantly advanced computer-aided diagnosis (CAD) systems for breast cancer detection and classification. Transformers, which have shown great promise in computer vision, are now being applied to medical image analysis. However, their application to histopathological images presents challenges due to the need for extensive manual annotations of whole-slide images (WSIs), as these models require large amounts of data to work effectively, which is costly and time-consuming. Furthermore, the quadratic computational cost of Vision Transformers (ViTs) is particularly prohibitive for large, high-resolution histopathological images, especially on edge devices with limited computational resources. In this study, we introduce a novel lightweight breast cancer classification approach using transformers that operates effectively without large datasets. By incorporating parallel processing pathways for Discrete Cosine Transform (DCT) Attention and MobileConv, we convert image data from the spatial domain to the frequency domain to utilize the benefits such as filtering out high frequencies in the image, which reduces computational cost. This demonstrates the potential of our approach to improve breast cancer classification in histopathological images, offering a more efficient solution with reduced reliance on extensive annotated datasets. Our proposed model achieves an accuracy of 96.00% $\pm$ 0.48% for binary classification and 87.85% $\pm$ 0.93% for multiclass classification, which is comparable to state-of-the-art models while significantly reducing computational costs. This demonstrates the potential of our approach to improve breast cancer classification in histopathological images, offering a more efficient solution with reduced reliance on extensive annotated datasets.||
|**2024-10-24**|[Attention-based Citywide Electric Vehicle Charging Demand Prediction Approach Considering Urban Region and Dynamic Influences](http://arxiv.org/abs/2410.18766)|null|Electric vehicle charging demand prediction is important for vacant charging pile recommendation and charging infrastructure planning, thus facilitating vehicle electrification and green energy development. The performance of previous spatio-temporal studies is still far from satisfactory because the traditional graphs are difficult to model non-pairwise spatial relationships and multivariate temporal features are not adequately taken into account. To tackle these issues, we propose an attention-based heterogeneous multivariate data fusion approach (AHMDF) for citywide electric vehicle charging demand prediction, which incorporates geo-based clustered hypergraph and multivariate gated Transformer to considers both static and dynamic influences. To learn non-pairwise relationships, we cluster service areas by the types and numbers of points of interest in the areas and develop attentive hypergraph networks accordingly. Graph attention mechanisms are used for information propagation between neighboring areas. Additionally, we improve the Transformer encoder utilizing gated mechanisms so that it can selectively learn dynamic auxiliary information and temporal features. Experiments on an electric vehicle charging benchmark dataset demonstrate the effectiveness of our proposed approach compared with a broad range of competing baselines. Furthermore, we demonstrate the impact of dynamic influences on prediction results in different areas of the city and the effectiveness of our clustering method.||
|**2024-10-24**|[Rethinking Softmax: Self-Attention with Polynomial Activations](http://arxiv.org/abs/2410.18613)|null|This paper challenges the conventional belief that softmax attention in transformers is effective primarily because it generates a probability distribution for attention allocation. Instead, we theoretically show that its success lies in its ability to implicitly regularize the Frobenius norm of the attention matrix during training. We then explore alternative activations that regularize the Frobenius norm of the attention matrix, demonstrating that certain polynomial activations can achieve this effect, making them suitable for attention-based architectures. Empirical results indicate these activations perform comparably or better than softmax across various computer vision and language tasks, suggesting new possibilities for attention mechanisms beyond softmax.||
|**2024-10-24**|[Taipan: Efficient and Expressive State Space Language Models with Selective Attention](http://arxiv.org/abs/2410.18572)|null|Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.||
|**2024-10-24**|[Local and Global Graph Modeling with Edge-weighted Graph Attention Network for Handwritten Mathematical Expression Recognition](http://arxiv.org/abs/2410.18555)|null|In this paper, we present a novel approach to Handwritten Mathematical Expression Recognition (HMER) by leveraging graph-based modeling techniques. We introduce an End-to-end model with an Edge-weighted Graph Attention Mechanism (EGAT), designed to perform simultaneous node and edge classification. This model effectively integrates node and edge features, facilitating the prediction of symbol classes and their relationships within mathematical expressions. Additionally, we propose a stroke-level Graph Modeling method for both local (LGM) and global (GGM) information, which applies an end-to-end model to Online HMER tasks, transforming the recognition problem into node and edge classification tasks in graph structure. By capturing both local and global graph features, our method ensures comprehensive understanding of the expression structure. Through the combination of these components, our system demonstrates superior performance in symbol detection, relation classification, and expression-level recognition.||
|**2024-10-24**|[On Explaining with Attention Matrices](http://arxiv.org/abs/2410.18541)|**[link](https://github.com/omyokun/on-explaining-with-attention-matrices)**|This paper explores the much discussed, possible explanatory link between attention weights (AW) in transformer models and predicted output. Contrary to intuition and early research on attention, more recent prior research has provided formal arguments and empirical evidence that AW are not explanatorily relevant. We show that the formal arguments are incorrect. We introduce and effectively compute efficient attention, which isolates the effective components of attention matrices in tasks and models in which AW play an explanatory role. We show that efficient attention has a causal role (provides minimally necessary and sufficient conditions) for predicting model output in NLP tasks requiring contextual information, and we show, contrary to [7], that efficient attention matrices are probability distributions and are effectively calculable. Thus, they should play an important part in the explanation of attention based model behavior. We offer empirical experiments in support of our method illustrating various properties of efficient attention with various metrics on four datasets.||
|**2024-10-24**|[SFB-net for cardiac segmentation: Bridging the semantic gap with attention](http://arxiv.org/abs/2410.18503)|null|In the past few years, deep learning algorithms have been widely used for cardiac image segmentation. However, most of these architectures rely on convolutions that hardly model long-range dependencies, limiting their ability to extract contextual information. In order to tackle this issue, this article introduces the Swin Filtering Block network (SFB-net) which takes advantage of both conventional and swin transformer layers. The former are used to introduce spatial attention at the bottom of the network, while the latter are applied to focus on high level semantically rich features between the encoder and decoder. An average Dice score of 92.4 was achieved on the ACDC dataset. To the best of our knowledge, this result outperforms any other work on this dataset. The average Dice score of 87.99 obtained on the M\&M's dataset demonstrates that the proposed method generalizes well to data from different vendors and centres.||
|**2024-10-23**|[Value Residual Learning For Alleviating Attention Concentration In Transformers](http://arxiv.org/abs/2410.17897)|**[link](https://github.com/Zcchill/Value-Residual-Learning)**|Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the KV cache by nearly 50%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.||
|**2024-10-23**|[Anomaly Resilient Temporal QoS Prediction using Hypergraph Convoluted Transformer Network](http://arxiv.org/abs/2410.17762)|null|Quality-of-Service (QoS) prediction is a critical task in the service lifecycle, enabling precise and adaptive service recommendations by anticipating performance variations over time in response to evolving network uncertainties and user preferences. However, contemporary QoS prediction methods frequently encounter data sparsity and cold-start issues, which hinder accurate QoS predictions and limit the ability to capture diverse user preferences. Additionally, these methods often assume QoS data reliability, neglecting potential credibility issues such as outliers and the presence of greysheep users and services with atypical invocation patterns. Furthermore, traditional approaches fail to leverage diverse features, including domain-specific knowledge and complex higher-order patterns, essential for accurate QoS predictions. In this paper, we introduce a real-time, trust-aware framework for temporal QoS prediction to address the aforementioned challenges, featuring an end-to-end deep architecture called the Hypergraph Convoluted Transformer Network (HCTN). HCTN combines a hypergraph structure with graph convolution over hyper-edges to effectively address high-sparsity issues by capturing complex, high-order correlations. Complementing this, the transformer network utilizes multi-head attention along with parallel 1D convolutional layers and fully connected dense blocks to capture both fine-grained and coarse-grained dynamic patterns. Additionally, our approach includes a sparsity-resilient solution for detecting greysheep users and services, incorporating their unique characteristics to improve prediction accuracy. Trained with a robust loss function resistant to outliers, HCTN demonstrated state-of-the-art performance on the large-scale WSDREAM-2 datasets for response time and throughput.||
|**2024-10-23**|[PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers in a resource-limited Context](http://arxiv.org/abs/2410.17661)|null|Following their success in natural language processing (NLP), there has been a shift towards transformer models in computer vision. While transformers perform well and offer promising multi-tasking performance, due to their high compute requirements, many resource-constrained applications still rely on convolutional or hybrid models that combine the benefits of convolution and attention layers and achieve the best results in the sub 100M parameter range. Simultaneously, task adaptation techniques that allow for the use of one shared transformer backbone for multiple downstream tasks, resulting in great storage savings at negligible cost in performance, have not yet been adopted for hybrid transformers. In this work, we investigate how to achieve the best task-adaptation performance and introduce PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers. We further combine PETAH adaptation with pruning to achieve highly performant and storage friendly models for multi-tasking. In our extensive evaluation on classification and other vision tasks, we demonstrate that our PETAH-adapted hybrid models outperform established task-adaptation techniques for ViTs while requiring fewer parameters and being more efficient on mobile hardware.||
|**2024-10-23**|[Surgical Scene Segmentation by Transformer With Asymmetric Feature Enhancement](http://arxiv.org/abs/2410.17642)|**[link](https://github.com/cyuan-sjtu/vit-asym)**|Surgical scene segmentation is a fundamental task for robotic-assisted laparoscopic surgery understanding. It often contains various anatomical structures and surgical instruments, where similar local textures and fine-grained structures make the segmentation a difficult task. Vision-specific transformer method is a promising way for surgical scene understanding. However, there are still two main challenges. Firstly, the absence of inner-patch information fusion leads to poor segmentation performance. Secondly, the specific characteristics of anatomy and instruments are not specifically modeled. To tackle the above challenges, we propose a novel Transformer-based framework with an Asymmetric Feature Enhancement module (TAFE), which enhances local information and then actively fuses the improved feature pyramid into the embeddings from transformer encoders by a multi-scale interaction attention strategy. The proposed method outperforms the SOTA methods in several different surgical segmentation tasks and additionally proves its ability of fine-grained structure recognition. Code is available at https://github.com/cyuan-sjtu/ViT-asym.||
|**2024-10-22**|[From Attention to Activation: Unravelling the Enigmas of Large Language Models](http://arxiv.org/abs/2410.17174)|null|We study two strange phenomena in auto-regressive Transformers: (1) the dominance of the first token in attention heads; (2) the occurrence of large outlier activations in the hidden states. We find that popular large language models, such as Llama attend maximally to the first token in 98% of attention heads, a behaviour we attribute to the softmax function. To mitigate this issue, we propose a reformulation of softmax to softmax-1. Furthermore, we identify adaptive optimisers, e.g. Adam, as the primary contributor to the large outlier activations and introduce OrthoAdam, a novel optimiser that utilises orthogonal matrices to transform gradients, to address this issue. Finally, not only do our methods prevent these phenomena from occurring, but additionally, they enable Transformers to sustain their performance when quantised using basic algorithms, something that standard methods are unable to do. In summary, our methods reduce the attention proportion on the first token from 65% to 3.3%, the activation kurtosis in the hidden states from 1657 to 3.1, and perplexity penalty under 4-bit weight quantisation from 3565 to 0.3.||
|**2024-10-22**|[A Comparison of Baseline Models and a Transformer Network for SOC Prediction in Lithium-Ion Batteries](http://arxiv.org/abs/2410.17049)|null|Accurately predicting the state of charge of Lithium-ion batteries is essential to the performance of battery management systems of electric vehicles. One of the main reasons for the slow global adoption of electric cars is driving range anxiety. The ability of a battery management system to accurately estimate the state of charge can help alleviate this problem. In this paper, a comparison between data-driven state-of-charge estimation methods is conducted. The paper compares different neural network-based models and common regression models for SOC estimation. These models include several ablated transformer networks, a neural network, a lasso regression model, a linear regression model and a decision tree. Results of various experiments conducted on data obtained from natural driving cycles of the BMW i3 battery show that the decision tree outperformed all other models including the more complex transformer network with self-attention and positional encoding.||
|**2024-10-20**|[Advancing Gasoline Consumption Forecasting: A Novel Hybrid Model Integrating Transformers, LSTM, and CNN](http://arxiv.org/abs/2410.16336)|null|Iran, endowed with abundant hydrocarbon resources, plays a crucial role in the global energy landscape. Gasoline, as a critical fuel, significantly supports the nation's transportation sector. Accurate forecasting of gasoline consumption is essential for strategic resource management and environmental planning. This research introduces a novel approach to predicting monthly gasoline consumption using a hybrid Transformer-LSTM-CNN model, which integrates the strengths of Transformer networks, Long Short-Term Memory (LSTM) networks, and Convolutional Neural Networks (CNN). This advanced architecture offers a superior alternative to conventional methods such as artificial neural networks and regression models by capturing both short- and long-term dependencies in time series data. By leveraging the self-attention mechanism of Transformers, the temporal memory of LSTMs, and the local pattern detection of CNNs, our hybrid model delivers improved prediction accuracy. Implemented using Python, the model provides precise future gasoline consumption forecasts and evaluates the environmental impact through the analysis of greenhouse gas emissions. This study examines gasoline consumption trends from 2007 to 2021, which rose from 64.5 million liters per day in 2007 to 99.80 million liters per day in 2021. Our proposed model forecasts consumption levels up to 2031, offering a valuable tool for policymakers and energy analysts. The results highlight the superiority of this hybrid model in improving the accuracy of gasoline consumption forecasts, reinforcing the need for advanced machine learning techniques to optimize resource management and mitigate environmental risks in the energy sector.||
|**2024-10-21**|[MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report](http://arxiv.org/abs/2410.16239)|**[link](https://github.com/svthapa/more)**|In this paper, we introduce a novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms (ECGs), and radiology/cardiology reports. Our approach leverages transformers to encode these diverse modalities into a unified representation space, aiming to enhance diagnostic accuracy and facilitate comprehensive patient assessments. We utilize LoRA-Peft to significantly reduce trainable parameters in the LLM and incorporate recent linear attention dropping strategy in the Vision Transformer(ViT) for smoother attention. Furthermore, we provide novel multimodal attention explanations and retrieval for our model. To the best of our knowledge, we are the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach. By utilizing contrastive loss, MoRE effectively aligns modality-specific features into a coherent embedding, which supports various downstream tasks such as zero-shot classification and multimodal retrieval. Employing our proposed methodology, we achieve state-of-the-art (SOTA) on the Mimic-IV, CheXpert, Edema Severity, and PtbXl downstream datasets, surpassing existing multimodal approaches. Our proposed framework shows significant improvements in capturing intricate inter-modal relationships and its robustness in medical diagnosis that establishes a framework for future research in multimodal learning in the healthcare sector.||
|**2024-10-21**|[An Explainable Contrastive-based Dilated Convolutional Network with Transformer for Pediatric Pneumonia Detection](http://arxiv.org/abs/2410.16143)|null|Pediatric pneumonia remains a significant global threat, posing a larger mortality risk than any other communicable disease. According to UNICEF, it is a leading cause of mortality in children under five and requires prompt diagnosis. Early diagnosis using chest radiographs is the prevalent standard, but limitations include low radiation levels in unprocessed images and data imbalance issues. This necessitates the development of efficient, computer-aided diagnosis techniques. To this end, we propose a novel EXplainable Contrastive-based Dilated Convolutional Network with Transformer (XCCNet) for pediatric pneumonia detection. XCCNet harnesses the spatial power of dilated convolutions and the global insights from contrastive-based transformers for effective feature refinement. A robust chest X-ray processing module tackles low-intensity radiographs, while adversarial-based data augmentation mitigates the skewed distribution of chest X-rays in the dataset. Furthermore, we actively integrate an explainability approach through feature visualization, directly aligning it with the attention region that pinpoints the presence of pneumonia or normality in radiographs. The efficacy of XCCNet is comprehensively assessed on four publicly available datasets. Extensive performance evaluation demonstrates the superiority of XCCNet compared to state-of-the-art methods.||
|**2024-10-21**|[START: A Generalized State Space Model with Saliency-Driven Token-Aware Transformation](http://arxiv.org/abs/2410.16020)|**[link](https://github.com/lingeringlight/start)**|Domain Generalization (DG) aims to enable models to generalize to unseen target domains by learning from multiple source domains. Existing DG methods primarily rely on convolutional neural networks (CNNs), which inherently learn texture biases due to their limited receptive fields, making them prone to overfitting source domains. While some works have introduced transformer-based methods (ViTs) for DG to leverage the global receptive field, these methods incur high computational costs due to the quadratic complexity of self-attention. Recently, advanced state space models (SSMs), represented by Mamba, have shown promising results in supervised learning tasks by achieving linear complexity in sequence length during training and fast RNN-like computation during inference. Inspired by this, we investigate the generalization ability of the Mamba model under domain shifts and find that input-dependent matrices within SSMs could accumulate and amplify domain-specific features, thus hindering model generalization. To address this issue, we propose a novel SSM-based architecture with saliency-based token-aware transformation (namely START), which achieves state-of-the-art (SOTA) performances and offers a competitive alternative to CNNs and ViTs. Our START can selectively perturb and suppress domain-specific features in salient tokens within the input-dependent matrices of SSMs, thus effectively reducing the discrepancy between different domains. Extensive experiments on five benchmarks demonstrate that START outperforms existing SOTA DG methods with efficient linear complexity. Our code is available at https://github.com/lingeringlight/START.||
|**2024-10-21**|[All You Need is an Improving Column: Enhancing Column Generation for Parallel Machine Scheduling via Transformers](http://arxiv.org/abs/2410.15601)|null|We present a neural network-enhanced column generation (CG) approach for a parallel machine scheduling problem. The proposed approach utilizes an encoder-decoder attention model, namely the transformer and pointer architectures, to develop job sequences with negative reduced cost and thus generate columns to add to the master problem. By training the neural network offline and using it in inference mode to predict negative reduced costs columns, we achieve significant computational time savings compared to dynamic programming (DP). Since the exact DP procedure is used to verify that no further columns with negative reduced cost can be identified at termination, the optimality guarantee of the original CG procedure is preserved. For small to medium-sized instances, our approach achieves an average 45% reduction in computation time compared to solving the subproblems with DP. Furthermore, the model generalizes not only to unseen, larger problem instances from the same probability distribution but also to instances from different probability distributions than those presented at training time. For large-sized instances, the proposed approach achieves an 80% improvement in the objective value in under 500 seconds, demonstrating both its scalability and efficiency.||
|**2024-10-21**|[Generalized Probabilistic Attention Mechanism in Transformers](http://arxiv.org/abs/2410.15578)|null|The Transformer architecture has become widely adopted due to its demonstrated success, attributed to the attention mechanism at its core. Despite these successes, the attention mechanism of Transformers is associated with two well-known issues: rank-collapse and gradient vanishing. In this paper, we present a theoretical analysis that it is inherently difficult to address both issues simultaneously in the conventional attention mechanism. To handle these issues, we introduce a novel class of attention mechanism, referred to as generalized probabilistic attention mechanism (GPAM), and its dual-attention implementation within the Transformer architecture. Unlike conventional attention mechanisms, GPAM allows for negative attention scores while preserving a fixed total sum. We provide theoretical evidence that the proposed dual-attention GPAM (daGPAM) effectively mitigates both the rank-collapse and gradient vanishing issues which are difficult to resolve simultaneously with the conventional attention mechanisms. Furthermore, we empirically validate this theoretical evidence, demonstrating the superiority of daGPAM compared to other alternative attention mechanisms that were proposed to address the same issues. Additionally, we demonstrate the practical benefits of GPAM in natural language processing tasks, such as language modeling and neural machine translation.||
|**2024-10-20**|[SEA: State-Exchange Attention for High-Fidelity Physics-Based Transformers](http://arxiv.org/abs/2410.15495)|**[link](https://github.com/parsaesmati/sea)**|Current approaches using sequential networks have shown promise in estimating field variables for dynamical systems, but they are often limited by high rollout errors. The unresolved issue of rollout error accumulation results in unreliable estimations as the network predicts further into the future, with each step's error compounding and leading to an increase in inaccuracy. Here, we introduce the State-Exchange Attention (SEA) module, a novel transformer-based module enabling information exchange between encoded fields through multi-head cross-attention. The cross-field multidirectional information exchange design enables all state variables in the system to exchange information with one another, capturing physical relationships and symmetries between fields. In addition, we incorporate a ViT-like architecture to generate spatially coherent mesh embeddings, further improving the model's ability to capture spatial dependencies in the data. This enhances the model's ability to represent complex interactions between the field variables, resulting in improved rollout error accumulation. Our results show that the Transformer model integrated with the State-Exchange Attention (SEA) module outperforms competitive baseline models, including the PbGMR-GMUS Transformer-RealNVP and GMR-GMUS Transformer, with a reduction in error of 88\% and 91\%, respectively, achieving state-of-the-art performance. Furthermore, we demonstrate that the SEA module alone can reduce errors by 97\% for state variables that are highly dependent on other states of the system.||
|**2024-10-19**|[EViT-Unet: U-Net Like Efficient Vision Transformer for Medical Image Segmentation on Mobile and Edge Devices](http://arxiv.org/abs/2410.15036)|**[link](https://github.com/Retinal-Research/EVIT-UNET)**|With the rapid development of deep learning, CNN-based U-shaped networks have succeeded in medical image segmentation and are widely applied for various tasks. However, their limitations in capturing global features hinder their performance in complex segmentation tasks. The rise of Vision Transformer (ViT) has effectively compensated for this deficiency of CNNs and promoted the application of ViT-based U-networks in medical image segmentation. However, the high computational demands of ViT make it unsuitable for many medical devices and mobile platforms with limited resources, restricting its deployment on resource-constrained and edge devices. To address this, we propose EViT-UNet, an efficient ViT-based segmentation network that reduces computational complexity while maintaining accuracy, making it ideal for resource-constrained medical devices. EViT-UNet is built on a U-shaped architecture, comprising an encoder, decoder, bottleneck layer, and skip connections, combining convolutional operations with self-attention mechanisms to optimize efficiency. Experimental results demonstrate that EViT-UNet achieves high accuracy in medical image segmentation while significantly reducing computational complexity.||
|**2024-10-18**|[SignAttention: On the Interpretability of Transformer Models for Sign Language Translation](http://arxiv.org/abs/2410.14506)|**[link](https://github.com/pedroodb/sign_attention)**|This paper presents the first comprehensive interpretability analysis of a Transformer-based Sign Language Translation (SLT) model, focusing on the translation from video-based Greek Sign Language to glosses and text. Leveraging the Greek Sign Language Dataset, we examine the attention mechanisms within the model to understand how it processes and aligns visual input with sequential glosses. Our analysis reveals that the model pays attention to clusters of frames rather than individual ones, with a diagonal alignment pattern emerging between poses and glosses, which becomes less distinct as the number of glosses increases. We also explore the relative contributions of cross-attention and self-attention at each decoding step, finding that the model initially relies on video frames but shifts its focus to previously predicted tokens as the translation progresses. This work contributes to a deeper understanding of SLT models, paving the way for the development of more transparent and reliable translation systems essential for real-world applications.||
|**2024-10-18**|[Mixed Attention Transformer Enhanced Channel Estimation for Extremely Large-Scale MIMO Systems](http://arxiv.org/abs/2410.14439)|null|Extremely large-scale massive multiple-input multiple-output (XL-MIMO) is one of the key technologies for next-generation wireless communication systems. However, acquiring the accurate high-dimensional channel matrix of XL-MIMO remains a pressing challenge due to the intractable channel property and the high complexity. In this paper, a Mixed Attention Transformer based Channel Estimation Neural Network (MAT-CENet) is developed, which is inspired by the Transformer encoder structure as well as organically integrates the feature map attention and spatial attention mechanisms to better grasp the unique characteristics of the XL-MIMO channel. By incorporating the multi-head attention layer as the core enabler, the insightful feature importance is captured and exploited effectively. A comprehensive complexity analysis for the proposed MAT-CENet is also provided. Simulation results show that MAT-CENet outperforms the state of the art in different propagation scenarios of near-, far- and hybrid-fields.||
|**2024-10-18**|[Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis](http://arxiv.org/abs/2410.14195)|**[link](https://github.com/invoker-ll/long-mil)**|Histopathology Whole Slide Image (WSI) analysis serves as the gold standard for clinical cancer diagnosis in the daily routines of doctors. To develop computer-aided diagnosis model for WSIs, previous methods typically employ Multi-Instance Learning to enable slide-level prediction given only slide-level labels. Among these models, vanilla attention mechanisms without pairwise interactions have traditionally been employed but are unable to model contextual information. More recently, self-attention models have been utilized to address this issue. To alleviate the computational complexity of long sequences in large WSIs, methods like HIPT use region-slicing, and TransMIL employs approximation of full self-attention. Both approaches suffer from suboptimal performance due to the loss of key information. Moreover, their use of absolute positional embedding struggles to effectively handle long contextual dependencies in shape-varying WSIs. In this paper, we first analyze how the low-rank nature of the long-sequence attention matrix constrains the representation ability of WSI modelling. Then, we demonstrate that the rank of attention matrix can be improved by focusing on local interactions via a local attention mask. Our analysis shows that the local mask aligns with the attention patterns in the lower layers of the Transformer. Furthermore, the local attention mask can be implemented during chunked attention calculation, reducing the quadratic computational complexity to linear with a small local bandwidth. Building on this, we propose a local-global hybrid Transformer for both computational acceleration and local-global information interactions modelling. Our method, Long-contextual MIL (LongMIL), is evaluated through extensive experiments on various WSI tasks to validate its superiority. Our code will be available at github.com/invoker-LL/Long-MIL.||
|**2024-10-18**|[Provable In-context Learning for Mixture of Linear Regressions using Transformers](http://arxiv.org/abs/2410.14183)|null|We theoretically investigate the in-context learning capabilities of transformers in the context of learning mixtures of linear regression models. For the case of two mixtures, we demonstrate the existence of transformers that can achieve an accuracy, relative to the oracle predictor, of order $\mathcal{\tilde{O}}((d/n)^{1/4})$ in the low signal-to-noise ratio (SNR) regime and $\mathcal{\tilde{O}}(\sqrt{d/n})$ in the high SNR regime, where $n$ is the length of the prompt, and $d$ is the dimension of the problem. Additionally, we derive in-context excess risk bounds of order $\mathcal{O}(L/\sqrt{B})$, where $B$ denotes the number of (training) prompts, and $L$ represents the number of attention layers. The order of $L$ depends on whether the SNR is low or high. In the high SNR regime, we extend the results to $K$-component mixture models for finite $K$ . Extensive simulations also highlight the advantages of transformers for this task, outperforming other baselines such as the Expectation-Maximization algorithm.||
|**2024-10-17**|[MarineFormer: A Transformer-based Navigation Policy Model for Collision Avoidance in Marine Environment](http://arxiv.org/abs/2410.13973)|null|In this work, we investigate the problem of Unmanned Surface Vehicle (USV) navigation in a dense marine environment with a high-intensity current flow. The complexities arising from static and dynamic obstacles and the disturbance forces caused by current flow render existing navigation protocols inadequate for ensuring safety and avoiding collisions at sea. To learn a safe and efficient robot policy, we propose a novel methodology that leverages attention mechanisms to capture heterogeneous interactions of the agents with the static and moving obstacles and the flow disturbances from the environment in space and time. In particular, we refine a temporal function with MarineFormer, a Transformer navigation policy for spatially variable Marine environment, trained end-to-end with reinforcement learning (RL). MarineFormer uses foundational spatio-temporal graph attention with transformer architecture to process spatial attention and temporal sequences in an environment that simulates a 2D turbulent marine condition. We propose architectural modifications that improve the stability and learning speed of the recurrent models. The flow velocity estimation, which can be derived from flow simulations or sensors, is incorporated into a model-free RL framework to prevent the robot from entering into high-intensity current flow regions including intense vortices, while potentially leveraging the flow to assist in transportation. The investigated 2D marine environment encompasses flow singularities, including vortices, sinks, and sources, representing fundamental planar flow patterns associated with flood or maritime thunderstorms. Our proposed method is trained with a new reward model to deal with static and dynamic obstacles and disturbances from the current flow.||
|**2024-10-17**|[Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs](http://arxiv.org/abs/2410.13835)|**[link](https://github.com/guotianyu2000/active-dormant-attention)**|Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.||
|**2024-10-17**|[Reducing the Transformer Architecture to a Minimum](http://arxiv.org/abs/2410.13732)|null|Transformers are a widespread and successful model architecture, particularly in Natural Language Processing (NLP) and Computer Vision (CV). The essential innovation of this architecture is the Attention Mechanism, which solves the problem of extracting relevant context information from long sequences in NLP and realistic scenes in CV. A classical neural network component, a Multi-Layer Perceptron (MLP), complements the attention mechanism. Its necessity is frequently justified by its capability of modeling nonlinear relationships. However, the attention mechanism itself is nonlinear through its internal use of similarity measures. A possible hypothesis is that this nonlinearity is sufficient for modeling typical application problems. As the MLPs usually contain the most trainable parameters of the whole model, their omission would substantially reduce the parameter set size. Further components can also be reorganized to reduce the number of parameters. Under some conditions, query and key matrices can be collapsed into a single matrix of the same size. The same is true about value and projection matrices, which can also be omitted without eliminating the substance of the attention mechanism. Initially, the similarity measure was defined asymmetrically, with peculiar properties such as that a token is possibly dissimilar to itself. A possible symmetric definition requires only half of the parameters. We have laid the groundwork by testing widespread CV benchmarks: MNIST and CIFAR-10. The tests have shown that simplified transformer architectures (a) without MLP, (b) with collapsed matrices, and (c) symmetric similarity matrices exhibit similar performance as the original architecture, saving up to 90% of parameters without hurting the classification performance.||
|**2024-10-17**|[DiRecNetV2: A Transformer-Enhanced Network for Aerial Disaster Recognition](http://arxiv.org/abs/2410.13663)|null|The integration of Unmanned Aerial Vehicles (UAVs) with artificial intelligence (AI) models for aerial imagery processing in disaster assessment, necessitates models that demonstrate exceptional accuracy, computational efficiency, and real-time processing capabilities. Traditionally Convolutional Neural Networks (CNNs), demonstrate efficiency in local feature extraction but are limited by their potential for global context interpretation. On the other hand, Vision Transformers (ViTs) show promise for improved global context interpretation through the use of attention mechanisms, although they still remain underinvestigated in UAV-based disaster response applications. Bridging this research gap, we introduce DiRecNetV2, an improved hybrid model that utilizes convolutional and transformer layers. It merges the inductive biases of CNNs for robust feature extraction with the global context understanding of Transformers, maintaining a low computational load ideal for UAV applications. Additionally, we introduce a new, compact multi-label dataset of disasters, to set an initial benchmark for future research, exploring how models trained on single-label data perform in a multi-label test set. The study assesses lightweight CNNs and ViTs on the AIDERSv2 dataset, based on the frames per second (FPS) for efficiency and the weighted F1 scores for classification performance. DiRecNetV2 not only achieves a weighted F1 score of 0.964 on a single-label test set but also demonstrates adaptability, with a score of 0.614 on a complex multi-label test set, while functioning at 176.13 FPS on the Nvidia Orin Jetson device.||
|**2024-10-17**|[360U-Former: HDR Illumination Estimation with Panoramic Adapted Vision Transformers](http://arxiv.org/abs/2410.13566)|null|Recent illumination estimation methods have focused on enhancing the resolution and improving the quality and diversity of the generated textures. However, few have explored tailoring the neural network architecture to the Equirectangular Panorama (ERP) format utilised in image-based lighting. Consequently, high dynamic range images (HDRI) results usually exhibit a seam at the side borders and textures or objects that are warped at the poles. To address this shortcoming we propose a novel architecture, 360U-Former, based on a U-Net style Vision-Transformer which leverages the work of PanoSWIN, an adapted shifted window attention tailored to the ERP format. To the best of our knowledge, this is the first purely Vision-Transformer model used in the field of illumination estimation. We train 360U-Former as a GAN to generate HDRI from a limited field of view low dynamic range image (LDRI). We evaluate our method using current illumination estimation evaluation protocols and datasets, demonstrating that our approach outperforms existing and state-of-the-art methods without the artefacts typically associated with the use of the ERP format.||
|**2024-10-17**|[Precipitation Nowcasting Using Diffusion Transformer with Causal Attention](http://arxiv.org/abs/2410.13314)|null|Short-term precipitation forecasting remains challenging due to the difficulty in capturing long-term spatiotemporal dependencies. Current deep learning methods fall short in establishing effective dependencies between conditions and forecast results, while also lacking interpretability. To address this issue, we propose a Precipitation Nowcasting Using Diffusion Transformer with Causal Attention model. Our model leverages Transformer and combines causal attention mechanisms to establish spatiotemporal queries between conditional information (causes) and forecast results (results). This design enables the model to effectively capture long-term dependencies, allowing forecast results to maintain strong causal relationships with input conditions over a wide range of time and space. We explore four variants of spatiotemporal information interactions for DTCA, demonstrating that global spatiotemporal labeling interactions yield the best performance. In addition, we introduce a Channel-To-Batch shift operation to further enhance the model's ability to represent complex rainfall dynamics. We conducted experiments on two datasets. Compared to state-of-the-art U-Net-based methods, our approach improved the CSI (Critical Success Index) for predicting heavy precipitation by approximately 15% and 8% respectively, achieving state-of-the-art performance.||
|**2024-10-17**|[DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis](http://arxiv.org/abs/2410.13288)|null|This paper proposes an improved version of DurIAN-E (DurIAN-E 2), which is also a duration informed attention neural network for expressive and high-fidelity text-to-speech (TTS) synthesis. Similar with the DurIAN-E model, multiple stacked SwishRNN-based Transformer blocks are utilized as linguistic encoders and Style-Adaptive Instance Normalization (SAIN) layers are also exploited into frame-level encoders to improve the modeling ability of expressiveness in the proposed the DurIAN-E 2. Meanwhile, motivated by other TTS models using generative models such as VITS, the proposed DurIAN-E 2 utilizes variational autoencoders (VAEs) augmented with normalizing flows and a BigVGAN waveform generator with adversarial training strategy, which further improve the synthesized speech quality and expressiveness. Both objective test and subjective evaluation results prove that the proposed expressive TTS model DurIAN-E 2 can achieve better performance than several state-of-the-art approaches besides DurIAN-E.||
|**2024-10-17**|[An Evolved Universal Transformer Memory](http://arxiv.org/abs/2410.13166)|**[link](https://github.com/sakanaai/evo-memory)**|Prior methods propose to offset the escalating costs of modern foundation models by dropping specific parts of their contexts with hand-designed rules, while attempting to preserve their original performance. We overcome this trade-off with Neural Attention Memory Models (NAMMs), introducing a learned network for memory management that improves both the performance and efficiency of transformers. We evolve NAMMs atop pre-trained transformers to provide different latent contexts focusing on the most relevant information for individual layers and attention heads.NAMMs are universally applicable to any model using self-attention as they condition exclusively on the values in the produced attention matrices. Learning NAMMs on a small set of problems, we achieve substantial performance improvements across multiple long-context benchmarks while cutting the model's input contexts up to a fraction of the original sizes. We show the generality of our conditioning enables zero-shot transfer of NAMMs trained only on language to entirely new transformer architectures even across input modalities, with their benefits carrying over to vision and reinforcement learning.||
|**2024-10-16**|[SWIM: An Attention-Only Model for Speech Quality Assessment Under Subjective Variance](http://arxiv.org/abs/2410.12675)|null|Speech quality is best evaluated by human feedback using mean opinion scores (MOS). However, variance in ratings between listeners can introduce noise in the true quality label of an utterance. Currently, deep learning networks including convolutional, recurrent, and attention-based architectures have been explored for quality estimation. This paper proposes an exclusively attention-based model involving a Swin Transformer for MOS estimation (SWIM). Our network captures local and global dependencies that reflect the acoustic properties of an utterance. To counteract subjective variance in MOS labels, we propose a normal distance-based objective that accounts for standard deviation in each label, and we avail a multistage self-teaching strategy to improve generalization further. Our model is significantly more compact than existing attention-based networks for quality estimation. Finally, our experiments on the Samsung Open Mean Opinion Score (SOMOS) dataset show improvement over existing baseline models when trained from scratch.||
|**2024-10-16**|[ExoTST: Exogenous-Aware Temporal Sequence Transformer for Time Series Prediction](http://arxiv.org/abs/2410.12184)|null|Accurate long-term predictions are the foundations for many machine learning applications and decision-making processes. Traditional time series approaches for prediction often focus on either autoregressive modeling, which relies solely on past observations of the target ``endogenous variables'', or forward modeling, which considers only current covariate drivers ``exogenous variables''. However, effectively integrating past endogenous and past exogenous with current exogenous variables remains a significant challenge. In this paper, we propose ExoTST, a novel transformer-based framework that effectively incorporates current exogenous variables alongside past context for improved time series prediction. To integrate exogenous information efficiently, ExoTST leverages the strengths of attention mechanisms and introduces a novel cross-temporal modality fusion module. This module enables the model to jointly learn from both past and current exogenous series, treating them as distinct modalities. By considering these series separately, ExoTST provides robustness and flexibility in handling data uncertainties that arise from the inherent distribution shift between historical and current exogenous variables. Extensive experiments on real-world carbon flux datasets and time series benchmarks demonstrate ExoTST's superior performance compared to state-of-the-art baselines, with improvements of up to 10\% in prediction accuracy. Moreover, ExoTST exhibits strong robustness against missing values and noise in exogenous drivers, maintaining consistent performance in real-world situations where these imperfections are common.||
|**2024-10-15**|[MoH: Multi-Head Attention as Mixture-of-Head Attention](http://arxiv.org/abs/2410.11842)|**[link](https://github.com/skyworkai/moh)**|In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.||
|**2024-10-15**|[Light-Weight Fault Tolerant Attention for Large Language Model Training](http://arxiv.org/abs/2410.11720)|null|Large Language Models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. However, the training of these models is computationally intensive and susceptible to faults, particularly in the attention mechanism, which is a critical component of transformer-based LLMs. In this paper, we investigate the impact of faults on LLM training, focusing on INF, NaN, and near-INF values in the computation results with systematic fault injection experiments. We observe the propagation patterns of these errors, which can trigger non-trainable states in the model and disrupt training, forcing the procedure to load from checkpoints.To mitigate the impact of these faults, we propose ATTNChecker, the first Algorithm-Based Fault Tolerance (ABFT) technique tailored for the attention mechanism in LLMs. ATTNChecker is designed based on fault propagation patterns of LLM and incorporates performance optimization to adapt to both system reliability and model vulnerability while providing lightweight protection for fast LLM training. Evaluations on four LLMs show that ATTNChecker on average incurs on average 7% overhead on training while detecting and correcting all extreme errors. Compared with the state-of-the-art checkpoint/restore approach, ATTNChecker reduces recovery overhead by up to 49x.||
|**2024-10-15**|[CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction](http://arxiv.org/abs/2410.11428)|null|Convolutional neural networks (CNNs) and vision transformers (ViTs) have become essential in computer vision for local and global feature extraction. However, aggregating these architectures in existing methods often results in inefficiencies. To address this, the CNN-Transformer Aggregation Network (CTA-Net) was developed. CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features. This integration enables efficient processing of detailed local and broader contextual information. CTA-Net introduces the Light Weight Multi-Scale Feature Fusion Multi-Head Self-Attention (LMF-MHSA) module for effective multi-scale feature integration with reduced parameters. Additionally, the Reverse Reconstruction CNN-Variants (RRCV) module enhances the embedding of CNNs within the transformer architecture. Extensive experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance (TOP-1 Acc 86.76\%), fewer parameters (20.32M), and greater efficiency (FLOPs 2.83B), making it a highly efficient and lightweight solution for visual tasks on small-scale datasets (fewer than 100,000).||
|**2024-10-15**|[Implementing Derivations of Definite Logic Programs with Self-Attention Networks](http://arxiv.org/abs/2410.11396)|null|In this paper we propose that a restricted version of logical inference can be implemented with self-attention networks. We are aiming at showing that LLMs (Large Language Models) constructed with transformer networks can make logical inferences. We would reveal the potential of LLMs by analyzing self-attention networks, which are main components of transformer networks. Our approach is not based on semantics of natural languages but operations of logical inference. %point of view. We show that hierarchical constructions of self-attention networks with feed forward networks (FFNs) can implement top-down derivations for a class of logical formulae. We also show bottom-up derivations are also implemented for the same class. We believe that our results show that LLMs implicitly have the power of logical inference.||
|**2024-10-15**|[SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection](http://arxiv.org/abs/2410.11358)|null|Multimodal object detection leverages diverse modal information to enhance the accuracy and robustness of detectors. By learning long-term dependencies, Transformer can effectively integrate multimodal features in the feature extraction stage, which greatly improves the performance of multimodal object detection. However, current methods merely stack Transformer-guided fusion techniques without exploring their capability to extract features at various depth layers of network, thus limiting the improvements in detection performance. In this paper, we introduce an accurate and efficient object detection method named SeaDATE. Initially, we propose a novel dual attention Feature Fusion (DTF) module that, under Transformer's guidance, integrates local and global information through a dual attention mechanism, strengthening the fusion of modal features from orthogonal perspectives using spatial and channel tokens. Meanwhile, our theoretical analysis and empirical validation demonstrate that the Transformer-guided fusion method, treating images as sequences of pixels for fusion, performs better on shallow features' detail information compared to deep semantic information. To address this, we designed a contrastive learning (CL) module aimed at learning features of multimodal samples, remedying the shortcomings of Transformer-guided fusion in extracting deep semantic features, and effectively utilizing cross-modal information. Extensive experiments and ablation studies on the FLIR, LLVIP, and M3FD datasets have proven our method to be effective, achieving state-of-the-art detection performance.||
|**2024-10-15**|[Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix](http://arxiv.org/abs/2410.11261)|null|Large Language Models (LLMs) have shown immense potential in enhancing various aspects of our daily lives, from conversational AI to search and AI assistants. However, their growing capabilities come at the cost of extremely large model sizes, making deployment on edge devices challenging due to memory and computational constraints. This paper introduces a novel approach to LLM weight pruning that directly optimizes for approximating the attention matrix, a core component of transformer architectures. Unlike existing methods that focus on linear approximations, our approach accounts for the non-linear nature of the Softmax attention mechanism. We provide theoretical guarantees for the convergence of our Gradient Descent-based optimization method to a near-optimal pruning mask solution. Our preliminary empirical results demonstrate the effectiveness of this approach in maintaining model performance while significantly reducing computational costs. This work establishes a new theoretical foundation for pruning algorithm design in LLMs, potentially paving the way for more efficient LLM inference on resource-constrained devices.||
|**2024-10-15**|[Rethinking Graph Transformer Architecture Design for Node Classification](http://arxiv.org/abs/2410.11189)|null|Graph Transformer (GT), as a special type of Graph Neural Networks (GNNs), utilizes multi-head attention to facilitate high-order message passing. However, this also imposes several limitations in node classification applications: 1) nodes are susceptible to global noise; 2) self-attention computation cannot scale well to large graphs. In this work, we conduct extensive observational experiments to explore the adaptability of the GT architecture in node classification tasks and draw several conclusions: the current multi-head self-attention module in GT can be completely replaceable, while the feed-forward neural network module proves to be valuable. Based on this, we decouple the propagation (P) and transformation (T) of GNNs and explore a powerful GT architecture, named GNNFormer, which is based on the P/T combination message passing and adapted for node classification in both homophilous and heterophilous scenarios. Extensive experiments on 12 benchmark datasets demonstrate that our proposed GT architecture can effectively adapt to node classification tasks without being affected by global noise and computational efficiency limitations.||
|**2024-10-14**|[What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis](http://arxiv.org/abs/2410.10986)|**[link](https://github.com/dalab/transformer-hessian)**|The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning -- to the extent that Transformers are often accompanied by adaptive optimizers, layer normalization, learning rate warmup, and more, in comparison to MLPs/CNNs. The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures -- grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer's Hessian and express it in matrix derivatives; (b) we then characterize it in terms of data, weight, and attention moment dependencies; and (c) while doing so further highlight the important structural differences to the Hessian of classical networks. Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies on the data and weight matrices, which vary heterogeneously across parameters. Ultimately, our findings provide a deeper understanding of the Transformer's unique optimization landscape and the challenges it poses.||
|**2024-10-14**|[Hybrid Transformer for Early Alzheimer's Detection: Integration of Handwriting-Based 2D Images and 1D Signal Features](http://arxiv.org/abs/2410.10547)|null|Alzheimer's Disease (AD) is a prevalent neurodegenerative condition where early detection is vital. Handwriting, often affected early in AD, offers a non-invasive and cost-effective way to capture subtle motor changes. State-of-the-art research on handwriting, mostly online, based AD detection has predominantly relied on manually extracted features, fed as input to shallow machine learning models. Some recent works have proposed deep learning (DL)-based models, either 1D-CNN or 2D-CNN architectures, with performance comparing favorably to handcrafted schemes. These approaches, however, overlook the intrinsic relationship between the 2D spatial patterns of handwriting strokes and their 1D dynamic characteristics, thus limiting their capacity to capture the multimodal nature of handwriting data. Moreover, the application of Transformer models remains basically unexplored. To address these limitations, we propose a novel approach for AD detection, consisting of a learnable multimodal hybrid attention model that integrates simultaneously 2D handwriting images with 1D dynamic handwriting signals. Our model leverages a gated mechanism to combine similarity and difference attention, blending the two modalities and learning robust features by incorporating information at different scales. Our model achieved state-of-the-art performance on the DARWIN dataset, with an F1-score of 90.32\% and accuracy of 90.91\% in Task 8 ('L' writing), surpassing the previous best by 4.61% and 6.06% respectively.||
|**2024-10-14**|[Domain-Conditioned Transformer for Fully Test-time Adaptation](http://arxiv.org/abs/2410.10442)|**[link](https://github.com/yushuntang/dct)**|Fully test-time adaptation aims to adapt a network model online based on sequential analysis of input samples during the inference stage. We observe that, when applying a transformer network model into a new domain, the self-attention profiles of image samples in the target domain deviate significantly from those in the source domain, which results in large performance degradation during domain changes. To address this important issue, we propose a new structure for the self-attention modules in the transformer. Specifically, we incorporate three domain-conditioning vectors, called domain conditioners, into the query, key, and value components of the self-attention module. We learn a network to generate these three domain conditioners from the class token at each transformer network layer. We find that, during fully online test-time adaptation, these domain conditioners at each transform network layer are able to gradually remove the impact of domain shift and largely recover the original self-attention profile. Our extensive experimental results demonstrate that the proposed domain-conditioned transformer significantly improves the online fully test-time domain adaptation performance and outperforms existing state-of-the-art methods by large margins.||
|**2024-10-11**|[AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation](http://arxiv.org/abs/2410.09040)|**[link](https://github.com/ucsc-vlaa/attngcg-attack)**|This paper studies the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks, focusing specifically on the optimization-based Greedy Coordinate Gradient (GCG) strategy. We first observe a positive correlation between the effectiveness of attacks and the internal behaviors of the models. For instance, attacks tend to be less effective when models pay more attention to system prompts designed to ensure LLM safety alignment. Building on this discovery, we introduce an enhanced method that manipulates models' attention scores to facilitate LLM jailbreaking, which we term AttnGCG. Empirically, AttnGCG shows consistent improvements in attack efficacy across diverse LLMs, achieving an average increase of ~7% in the Llama-2 series and ~10% in the Gemma series. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs like GPT-3.5 and GPT-4. Moreover, we note our attention-score visualization is more interpretable, allowing us to gain better insights into how our targeted attention manipulation facilitates more effective jailbreaking. We release the code at https://github.com/UCSC-VLAA/AttnGCG-attack.||
|**2024-10-11**|[Extra Global Attention Designation Using Keyword Detection in Sparse Transformer Architectures](http://arxiv.org/abs/2410.08971)|null|In this paper, we propose an extension to Longformer Encoder-Decoder, a popular sparse transformer architecture. One common challenge with sparse transformers is that they can struggle with encoding of long range context, such as connections between topics discussed at a beginning and end of a document. A method to selectively increase global attention is proposed and demonstrated for abstractive summarization tasks on several benchmark data sets. By prefixing the transcript with additional keywords and encoding global attention on these keywords, improvement in zero-shot, few-shot, and fine-tuned cases is demonstrated for some benchmark data sets.||
|**2024-10-11**|[On-Chip Learning via Transformer In-Context Learning](http://arxiv.org/abs/2410.08711)|null|Autoregressive decoder-only transformers have become key components for scalable sequence processing and generation models. However, the transformer's self-attention mechanism requires transferring prior token projections from the main memory at each time step (token), thus severely limiting their performance on conventional processors. Self-attention can be viewed as a dynamic feed-forward layer, whose matrix is input sequence-dependent similarly to the result of local synaptic plasticity. Using this insight, we present a neuromorphic decoder-only transformer model that utilizes an on-chip plasticity processor to compute self-attention. Interestingly, the training of transformers enables them to ``learn'' the input context during inference. We demonstrate this in-context learning ability of transformers on the Loihi 2 processor by solving a few-shot classification problem. With this we emphasize the importance of pretrained models especially their ability to find simple, local, backpropagation free, learning rules enabling on-chip learning and adaptation in a hardware friendly manner.||
|**2024-10-11**|[Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation](http://arxiv.org/abs/2410.08626)|null|Recently, symbolic music generation has become a focus of numerous deep learning research. Structure as an important part of music, contributes to improving the quality of music, and an increasing number of works start to study the hierarchical structure. In this study, we delve into the multi-level structures within music from macro-level and micro-level hierarchies. At the macro-level hierarchy, we conduct phrase segmentation algorithm to explore how phrases influence the overall development of music, and at the micro-level hierarchy, we design skeleton notes extraction strategy to explore how skeleton notes within each phrase guide the melody generation. Furthermore, we propose a novel Phrase-level Cross-Attention mechanism to capture the intrinsic relationship between macro-level hierarchy and micro-level hierarchy. Moreover, in response to the current lack of research on Chinese-style music, we construct our Small Tunes Dataset: a substantial collection of MIDI files comprising 10088 Small Tunes, a category of traditional Chinese Folk Songs. This dataset serves as the focus of our study. We generate Small Tunes songs utilizing the extracted skeleton notes as conditions, and experiment results indicate that our proposed model, Small Tunes Transformer, outperforms other state-of-the-art models. Besides, we design three novel objective evaluation metrics to evaluate music from both rhythm and melody dimensions.||
|**2024-10-11**|[DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention](http://arxiv.org/abs/2410.08582)|**[link](https://github.com/maclong01/DeBiFormer)**|Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity-adaptive attention, such as in DAT, has yielded strong results in image classification, the key-value pairs selected by deformable points lack semantic relevance when fine-tuning for semantic segmentation tasks. The query-aware sparsity attention in BiFormer seeks to focus each query on top-k routed regions. However, during attention calculation, the selected key-value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi-level Routing Attention (DBRA) module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its effectiveness.Code is available at {https://github.com/maclong01/DeBiFormer}||
|**2024-10-10**|[Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow](http://arxiv.org/abs/2410.08243)|null|Banking Transaction Flow (BTF) is a sequential data found in a number of banking activities such as marketing, credit risk or banking fraud. It is a multimodal data composed of three modalities: a date, a numerical value and a wording. We propose in this work an application of self-attention mechanism to the processing of BTFs. We trained two general models on a large amount of BTFs in a self-supervised way: one RNN-based model and one Transformer-based model. We proposed a specific tokenization in order to be able to process BTFs. The performance of these two models was evaluated on two banking downstream tasks: a transaction categorization task and a credit risk task. The results show that fine-tuning these two pre-trained models allowed to perform better than the state-of-the-art approaches for both tasks.||
|**2024-10-10**|[Pretraining Graph Transformers with Atom-in-a-Molecule Quantum Properties for Improved ADMET Modeling](http://arxiv.org/abs/2410.08024)|**[link](https://github.com/aidd-msca/GraphQPT)**|We evaluate the impact of pretraining Graph Transformer architectures on atom-level quantum-mechanical features for the modeling of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of drug-like compounds. We compare this pretraining strategy with two others: one based on molecular quantum properties (specifically the HOMO-LUMO gap) and one using a self-supervised atom masking technique. After fine-tuning on Therapeutic Data Commons ADMET datasets, we evaluate the performance improvement in the different models observing that models pretrained with atomic quantum mechanical properties produce in general better results. We then analyse the latent representations and observe that the supervised strategies preserve the pretraining information after finetuning and that different pretrainings produce different trends in latent expressivity across layers. Furthermore, we find that models pretrained on atomic quantum mechanical properties capture more low-frequency laplacian eigenmodes of the input graph via the attention weights and produce better representations of atomic environments within the molecule. Application of the analysis to a much larger non-public dataset for microsomal clearance illustrates generalizability of the studied indicators. In this case the performances of the models are in accordance with the representation analysis and highlight, especially for the case of masking pretraining and atom-level quantum property pretraining, how model types with similar performance on public benchmarks can have different performances on large scale pharmaceutical data.||
|**2024-10-11**|[BA-Net: Bridge Attention in Deep Neural Networks](http://arxiv.org/abs/2410.07860)|null|Attention mechanisms, particularly channel attention, have become highly influential in numerous computer vision tasks. Despite their effectiveness, many existing methods primarily focus on optimizing performance through complex attention modules applied at individual convolutional layers, often overlooking the synergistic interactions that can occur across multiple layers. In response to this gap, we introduce bridge attention, a novel approach designed to facilitate more effective integration and information flow between different convolutional layers. Our work extends the original bridge attention model (BAv1) by introducing an adaptive selection operator, which reduces information redundancy and optimizes the overall information exchange. This enhancement results in the development of BAv2, which achieves substantial performance improvements in the ImageNet classification task, obtaining Top-1 accuracies of 80.49% and 81.75% when using ResNet50 and ResNet101 as backbone networks, respectively. These results surpass the retrained baselines by 1.61% and 0.77%, respectively. Furthermore, BAv2 outperforms other existing channel attention techniques, such as the classical SENet101, exceeding its retrained performance by 0.52% Additionally, integrating BAv2 into advanced convolutional networks and vision transformers has led to significant gains in performance across a wide range of computer vision tasks, underscoring its broad applicability.||
|**2024-10-10**|[Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers](http://arxiv.org/abs/2410.07799)|null|Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. However, \softmaxx-based attention puts transformers' trainability at risk. Even \textit{at initialisation}, the propagation of signals and gradients through the random network can be pathological, resulting in known issues such as (i) vanishing/exploding gradients and (ii) \textit{rank collapse}, i.e. when all tokens converge to a single representation \textit{with depth}. This paper examines signal propagation in \textit{attention-only} transformers from a random matrix perspective, illuminating the origin of such issues, as well as unveiling a new phenomenon -- (iii) rank collapse \textit{in width}. Modelling \softmaxx-based attention at initialisation with Random Markov matrices, our theoretical analysis reveals that a \textit{spectral gap} between the two largest singular values of the attention matrix causes (iii), which, in turn, exacerbates (i) and (ii). Building on this insight, we propose a novel, yet simple, practical solution to resolve rank collapse in width by removing the spectral gap. Moreover, we validate our findings and discuss the training benefits of the proposed fix through experiments that also motivate a revision of some of the default parameter scaling. Our attention model accurately describes the standard key-query attention in a single-layer transformer, making this work a significant first step towards a better understanding of the initialisation dynamics in the multi-layer case.||
|**2024-10-10**|[Benign Overfitting in Single-Head Attention](http://arxiv.org/abs/2410.07746)|null|The phenomenon of benign overfitting, where a trained neural network perfectly fits noisy training data but still achieves near-optimal test performance, has been extensively studied in recent years for linear models and fully-connected/convolutional networks. In this work, we study benign overfitting in a single-head softmax attention model, which is the fundamental building block of Transformers. We prove that under appropriate conditions, the model exhibits benign overfitting in a classification setting already after two steps of gradient descent. Moreover, we show conditions where a minimum-norm/maximum-margin interpolator exhibits benign overfitting. We study how the overfitting behavior depends on the signal-to-noise ratio (SNR) of the data distribution, namely, the ratio between norms of signal and noise tokens, and prove that a sufficiently large SNR is both necessary and sufficient for benign overfitting.||
|**2024-10-10**|[Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM](http://arxiv.org/abs/2410.07531)|null|Dropout, a network operator, when enabled is likely to dramatically impact the performance of Flash-Attention, which in turn increases the end-to-end training time of Large-Language-Models (LLMs). The main contributor to such performance degradation is the Random Number Generation (RNG) phase that is traditionally fused into the Flash-Attention kernel. As RNG and Attention have the same hardware bottlenecks, RNG latency can hardly be hidden within the Attention kernel. We propose overlapping RNG with previous GEMM layers in the network to hide RNG runtime and improve end-to-end performance. RNG and GEMM have distinct resource requirements and hardware bottlenecks, so they can run in parallel without compromising each other's performance. Our fine-grained performance model, cross-validated by silicon results, shows 1.14x speedup on one transformer block (including multi-head attention and feed-forward layers) for Llama2, and up to 1.23x speedup when varying workload sizes, on GH100 GPUs with FP8 precision. Further, we extend our theoretical model to different RNG implementations and hardware architectures, and discuss the widely applicable benefits for overlapping RNG with GEMM layers.||
|**2024-10-09**|[VIRT: Vision Instructed Transformer for Robotic Manipulation](http://arxiv.org/abs/2410.07169)|null|Robotic manipulation, owing to its multi-modal nature, often faces significant training ambiguity, necessitating explicit instructions to clearly delineate the manipulation details in tasks. In this work, we highlight that vision instruction is naturally more comprehensible to recent robotic policies than the commonly adopted text instruction, as these policies are born with some vision understanding ability like human infants. Building on this premise and drawing inspiration from cognitive science, we introduce the robotic imagery paradigm, which realizes large-scale robotic data pre-training without text annotations. Additionally, we propose the robotic gaze strategy that emulates the human eye gaze mechanism, thereby guiding subsequent actions and focusing the attention of the policy on the manipulated object. Leveraging these innovations, we develop VIRT, a fully Transformer-based policy. We design comprehensive tasks using both a physical robot and simulated environments to assess the efficacy of VIRT. The results indicate that VIRT can complete very competitive tasks like ``opening the lid of a tightly sealed bottle'', and the proposed techniques boost the success rates of the baseline policy on diverse challenging tasks from nearly 0% to more than 65%.||
|**2024-10-09**|[Stanceformer: Target-Aware Transformer for Stance Detection](http://arxiv.org/abs/2410.07083)|**[link](https://github.com/kgarg8/stanceformer)**|The task of Stance Detection involves discerning the stance expressed in a text towards a specific subject or target. Prior works have relied on existing transformer models that lack the capability to prioritize targets effectively. Consequently, these models yield similar performance regardless of whether we utilize or disregard target information, undermining the task's significance. To address this challenge, we introduce Stanceformer, a target-aware transformer model that incorporates enhanced attention towards the targets during both training and inference. Specifically, we design a \textit{Target Awareness} matrix that increases the self-attention scores assigned to the targets. We demonstrate the efficacy of the Stanceformer with various BERT-based models, including state-of-the-art models and Large Language Models (LLMs), and evaluate its performance across three stance detection datasets, alongside a zero-shot dataset. Our approach Stanceformer not only provides superior performance but also generalizes even to other domains, such as Aspect-based Sentiment Analysis. We make the code publicly available.\footnote{\scriptsize\url{https://github.com/kgarg8/Stanceformer}}||
|**2024-10-09**|[InAttention: Linear Context Scaling for Transformers](http://arxiv.org/abs/2410.07063)|null|VRAM requirements for transformer models scale quadratically with context length due to the self-attention mechanism. In this paper we modify the decoder-only transformer, replacing self-attention with InAttention, which scales linearly with context length during inference by having tokens attend only to initial states. Benchmarking shows that InAttention significantly reduces VRAM usage during inference, enabling handling of long sequences on consumer GPUs. We corroborate that fine-tuning extends context length efficiently, improving performance on long sequences without high training costs. InAttention offers a scalable solution for long-range dependencies in transformer models, paving the way for further optimization.||
|**2024-10-09**|[Dynamic metastability in the self-attention model](http://arxiv.org/abs/2410.06833)|**[link](https://github.com/hugokoubbi/2024-transformers-dotm)**|We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a configuration of several clusters for an exponentially long period of time. By leveraging a gradient flow interpretation of the system, we also connect our result to an overarching framework of slow motion of gradient flows proposed by Otto and Reznikoff [OR07] in the context of coarsening and the Allen-Cahn equation. We finally probe the dynamics beyond the exponentially long period of metastability, and illustrate that, under an appropriate time-rescaling, the energy reaches its global maximum in finite time and has a staircase profile, with trajectories manifesting saddle-to-saddle-like behavior, reminiscent of recent works in the analysis of training dynamics via gradient descent for two-layer neural networks.||
|**2024-10-09**|[Cluster-wise Graph Transformer with Dual-granularity Kernelized Attention](http://arxiv.org/abs/2410.06746)|**[link](https://github.com/lumia-group/cluster-wise-graph-transformer)**|In the realm of graph learning, there is a category of methods that conceptualize graphs as hierarchical structures, utilizing node clustering to capture broader structural information. While generally effective, these methods often rely on a fixed graph coarsening routine, leading to overly homogeneous cluster representations and loss of node-level information. In this paper, we envision the graph as a network of interconnected node sets without compressing each cluster into a single embedding. To enable effective information transfer among these node sets, we propose the Node-to-Cluster Attention (N2C-Attn) mechanism. N2C-Attn incorporates techniques from Multiple Kernel Learning into the kernelized attention framework, effectively capturing information at both node and cluster levels. We then devise an efficient form for N2C-Attn using the cluster-wise message-passing framework, achieving linear time complexity. We further analyze how N2C-Attn combines bi-level feature maps of queries and keys, demonstrating its capability to merge dual-granularity information. The resulting architecture, Cluster-wise Graph Transformer (Cluster-GT), which uses node clusters as tokens and employs our proposed N2C-Attn module, shows superior performance on various graph-level tasks. Code is available at https://github.com/LUMIA-Group/Cluster-wise-Graph-Transformer.||
|**2024-10-07**|[Differential Transformer](http://arxiv.org/abs/2410.05258)|**[link](https://github.com/microsoft/unilm/blob/master/Diff-Transformer/)**|Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.||
|**2024-10-07**|[TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention](http://arxiv.org/abs/2410.05076)|**[link](https://github.com/DerrickYLJ/TidalDecode)**|大型语言模型（LLM）在各种自然语言处理任务中取得了重大进展，其中长上下文模型在处理扩展输入方面表现突出。然而，Transformer 架构所需的不断扩大的键值（KV）缓存大小加剧了内存限制，特别是在解码阶段，造成了显著的瓶颈。现有的旨在解决此瓶颈的稀疏注意力机制有两个局限性：（1）它们通常无法可靠地识别与注意力最相关的标记，以及（2）它们忽略了跨连续 Transformer 层的标记选择的空間一致性，这可能导致性能下降和标记选择中的大量开销。本文介绍了 TidalDecode，这是一种简单而有效的算法和系统，可通过位置持久性稀疏注意力实现快速准确的 LLM 解码。TidalDecode 利用现有稀疏注意力方法选择的标记的空间一致性，并引入了一些执行完全注意力的标记选择层，以识别具有最高注意力分数的标记，而所有其他层都对预先选择的标记执行稀疏注意力。这种设计使 TidalDecode 能够在不牺牲生成结果质量的情况下，大幅减少稀疏注意力的标记选择开销。对各种 LLM 和任务的评估表明，TidalDecode 在生成性能上与完全注意力方法非常接近，同时将 LLM 解码延迟降低了高达 2.1 倍。||
|**2024-10-07**|[On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent](http://arxiv.org/abs/2410.04870)|null|The Adam optimizer is widely used for transformer optimization in practice, which makes understanding the underlying optimization mechanisms an important problem. However, due to the Adam's complexity, theoretical analysis of how it optimizes transformers remains a challenging task. Fortunately, Sign Gradient Descent (SignGD) serves as an effective surrogate for Adam. Despite its simplicity, theoretical understanding of how SignGD optimizes transformers still lags behind. In this work, we study how SignGD optimizes a two-layer transformer -- consisting of a softmax attention layer with trainable query-key parameterization followed by a linear layer -- on a linearly separable noisy dataset. We identify four stages in the training dynamics, each exhibiting intriguing behaviors. Based on the training dynamics, we prove the fast convergence but poor generalization of the learned transformer on the noisy dataset. We also show that Adam behaves similarly to SignGD in terms of both optimization and generalization in this setting. Additionally, we find that the poor generalization of SignGD is not solely due to data noise, suggesting that both SignGD and Adam requires high-quality data for real-world tasks. Finally, experiments on synthetic and real-world datasets empirically support our theoretical results.||
|**2024-10-07**|[Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering](http://arxiv.org/abs/2410.04801)|null|The goal of this paper is to improve the performance of pretrained Vision Transformer (ViT) models, particularly DINOv2, in image clustering task without requiring re-training or fine-tuning. As model size increases, high-norm artifacts anomaly appears in the patches of multi-head attention. We observe that this anomaly leads to reduced accuracy in zero-shot image clustering. These artifacts are characterized by disproportionately large values in the attention map compared to other patch tokens. To address these artifacts, we propose an approach called Inference-Time Attention Engineering (ITAE), which manipulates attention function during inference. Specifically, we identify the artifacts by investigating one of the Query-Key-Value (QKV) patches in the multi-head attention and attenuate their corresponding attention values inside the pretrained models. ITAE shows improved clustering accuracy on multiple datasets by exhibiting more expressive features in latent space. Our findings highlight the potential of ITAE as a practical solution for reducing artifacts in pretrained ViT models and improving model performance in clustering tasks without the need for re-training or fine-tuning.||
|**2024-10-07**|[DAPE V2: Process Attention Score as Feature Map for Length Extrapolation](http://arxiv.org/abs/2410.04798)|**[link](https://github.com/chuanyang-zheng/dape)**|The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens, in contrast to earlier feed-forward neural networks. In general, the attention scores are determined simply by the key-query products. However, this work's occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance.||
|**2024-10-07**|[PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners](http://arxiv.org/abs/2410.04733)|**[link](https://github.com/yyyujintang/predformer)**|Spatiotemporal predictive learning methods generally fall into two categories: recurrent-based approaches, which face challenges in parallelization and performance, and recurrent-free methods, which employ convolutional neural networks (CNNs) as encoder-decoder architectures. These methods benefit from strong inductive biases but often at the expense of scalability and generalization. This paper proposes PredFormer, a pure transformer-based framework for spatiotemporal predictive learning. Motivated by the Vision Transformers (ViT) design, PredFormer leverages carefully designed Gated Transformer blocks, following a comprehensive analysis of 3D attention mechanisms, including full-, factorized-, and interleaved- spatial-temporal attention. With its recurrent-free, transformer-based design, PredFormer is both simple and efficient, significantly outperforming previous methods by large margins. Extensive experiments on synthetic and real-world datasets demonstrate that PredFormer achieves state-of-the-art performance. On Moving MNIST, PredFormer achieves a 51.3% reduction in MSE relative to SimVP. For TaxiBJ, the model decreases MSE by 33.1% and boosts FPS from 533 to 2364. Additionally, on WeatherBench, it reduces MSE by 11.1% while enhancing FPS from 196 to 404. These performance gains in both accuracy and efficiency demonstrate PredFormer's potential for real-world applications. The source code will be released at https://github.com/yyyujintang/PredFormer.||
|**2024-10-07**|[Efficient transformer with reinforced position embedding for language models](http://arxiv.org/abs/2410.04731)|null|In this paper, we propose an efficient transformer architecture that uses reinforced positional embedding to obtain superior performance with half the number of encoder decoder layers. We demonstrate that concatenating positional encoding with trainable token embeddings, normalizing columns in the token embedding matrix, and using the normalized token embedding matrix as the value of the attention layer improve the training and validation loss and the training time in an encoder-decoder Transformer model for a Portuguese-English translation task with 10 epochs or 12 hours of training across 10 trials. Our method, with roughly a threefold parameter reduction compared to the baseline model, yields a mean training loss of 1.21, a mean validation loss of 1.51, and an average training time of 1352.27 seconds per epoch, surpassing the baseline model with the same embedding dimension that employs addition of positional encoding and token embeddings, which achieves a mean training loss of 1.96, a validation loss of 2.18, and an average training time of 4297.79 seconds per epoch. Additionally, we evaluated our proposed architecture and the baseline across 14 diverse translation datasets from TensorFlow. The results indicate that our method consistently achieves lower or comparable training and validation losses, suggesting enhanced learning efficiency.||
|**2024-10-07**|[Low-Rank Continual Pyramid Vision Transformer: Incrementally Segment Whole-Body Organs in CT with Light-Weighted Adaptation](http://arxiv.org/abs/2410.04689)|null|Deep segmentation networks achieve high performance when trained on specific datasets. However, in clinical practice, it is often desirable that pretrained segmentation models can be dynamically extended to enable segmenting new organs without access to previous training datasets or without training from scratch. This would ensure a much more efficient model development and deployment paradigm accounting for the patient privacy and data storage issues. This clinically preferred process can be viewed as a continual semantic segmentation (CSS) problem. Previous CSS works would either experience catastrophic forgetting or lead to unaffordable memory costs as models expand. In this work, we propose a new continual whole-body organ segmentation model with light-weighted low-rank adaptation (LoRA). We first train and freeze a pyramid vision transformer (PVT) base segmentation model on the initial task, then continually add light-weighted trainable LoRA parameters to the frozen model for each new learning task. Through a holistically exploration of the architecture modification, we identify three most important layers (i.e., patch-embedding, multi-head attention and feed forward layers) that are critical in adapting to the new segmentation tasks, while retaining the majority of the pretrained parameters fixed. Our proposed model continually segments new organs without catastrophic forgetting and meanwhile maintaining a low parameter increasing rate. Continually trained and tested on four datasets covering different body parts of a total of 121 organs, results show that our model achieves high segmentation accuracy, closely reaching the PVT and nnUNet upper bounds, and significantly outperforms other regularization-based CSS methods. When comparing to the leading architecture-based CSS method, our model has a substantial lower parameter increasing rate while achieving comparable performance.||
|**2024-10-06**|[DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination](http://arxiv.org/abs/2410.04514)|null|Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that $D$ive into $A$ttention $M$echanism of LVLM to $R$educe $O$ bject Hallucination. Specifically, our approach employs classification token (CLS) of ViT to filter out high-attention outlier tokens scattered in the background and then eliminate their influence during decoding stage. We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs. The code of our method will be released soon.||
|**2024-10-05**|[Fundamental Limitations on Subquadratic Alternatives to Transformers](http://arxiv.org/abs/2410.04271)|null|The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of approaches, including designing heuristic algorithms for performing attention computations faster, and proposing alternatives to the attention mechanism which can be computed more quickly. For instance, state space models such as Mamba were designed to replace attention with an almost linear time alternative. In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm. Thus, any model which can be evaluated in subquadratic time - whether because of subquadratic-time heuristics for attention, faster attention replacements like Mamba, or any other reason - cannot perform this task. In other words, in order to perform tasks that (implicitly or explicitly) involve document similarity, one may as well use Transformer and cannot avoid its quadratic running time.||
|**2024-10-04**|[Linear Transformer Topological Masking with Graph Random Features](http://arxiv.org/abs/2410.03462)|null|When training transformers on graph-structured data, incorporating information about the underlying topology is crucial for good performance. Topological masking, a type of relative position encoding, achieves this by upweighting or downweighting attention depending on the relationship between the query and keys in a graph. In this paper, we propose to parameterise topological masks as a learnable function of a weighted adjacency matrix -- a novel, flexible approach which incorporates a strong structural inductive bias. By approximating this mask with graph random features (for which we prove the first known concentration bounds), we show how this can be made fully compatible with linear attention, preserving $\mathcal{O}(N)$ time and space complexity with respect to the number of input tokens. The fastest previous alternative was $\mathcal{O}(N \log N)$ and only suitable for specific graphs. Our efficient masking algorithms provide strong performance gains for tasks on image and point cloud data, including with $>30$ k nodes.||
|**2024-10-04**|[Error Correction Code Transformer: From Non-Unified to Unified](http://arxiv.org/abs/2410.03364)|null|Channel coding is vital for reliable data transmission in modern wireless systems, and its significance will increase with the emergence of sixth-generation (6G) networks, which will need to support various error correction codes. However, traditional decoders were typically designed as fixed hardware circuits tailored to specific decoding algorithms, leading to inefficiencies and limited flexibility. To address these challenges, this paper proposes a unified, code-agnostic Transformer-based decoding architecture capable of handling multiple linear block codes, including Polar, Low-Density Parity-Check (LDPC), and Bose-Chaudhuri-Hocquenghem (BCH), within a single framework. To achieve this, standardized units are employed to harmonize parameters across different code types, while the redesigned unified attention module compresses the structural information of various codewords. Additionally, a sparse mask, derived from the sparsity of the parity-check matrix, is introduced to enhance the model's ability to capture inherent constraints between information and parity-check bits, resulting in improved decoding accuracy and robustness. Extensive experimental results demonstrate that the proposed unified Transformer-based decoder not only outperforms existing methods but also provides a flexible, efficient, and high-performance solution for next-generation wireless communication systems.||
|**2024-10-04**|[Selective Transformer for Hyperspectral Image Classification](http://arxiv.org/abs/2410.03171)|null|Transformer has achieved satisfactory results in the field of hyperspectral image (HSI) classification. However, existing Transformer models face two key challenges when dealing with HSI scenes characterized by diverse land cover types and rich spectral information: (1) fixed receptive field representation overlooks effective contextual information; (2) redundant self-attention feature representation. To address these limitations, we propose a novel Selective Transformer (SFormer) for HSI classification. The SFormer is designed to dynamically select receptive fields for capturing both spatial and spectral contextual information, while mitigating the impact of redundant data by prioritizing the most relevant features. This enables a highly accurate classification of the land covers of the HSI. Specifically, a Kernel Selective Transformer Block (KSTB) is first utilized to dynamically select an appropriate receptive field range to effectively extract spatial-spectral features. Furthermore, to capture the most crucial tokens, a Token Selective Transformer Block (TSTB) is introduced, which selects the most relevant tokens based on the ranking of attention scores for each query. Extensive experiments on four benchmark HSI datasets demonstrate that the proposed SFormer outperforms the state-of-the-art HSI classification models. The codes will be released.||
|**2024-10-04**|[Autoregressive Moving-average Attention Mechanism for Time Series Forecasting](http://arxiv.org/abs/2410.03159)|**[link](https://github.com/ljc-fvnr/arma-attention)**|We propose an Autoregressive (AR) Moving-average (MA) attention structure that can adapt to various linear attention mechanisms, enhancing their ability to capture long-range and local temporal patterns in time series. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that incorporating the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.||
|**2024-10-03**|[Towards Understanding the Universality of Transformers for Next-Token Prediction](http://arxiv.org/abs/2410.03011)|null|Causal Transformers are trained to predict the next token for a given context. While it is widely accepted that self-attention is crucial for encoding the causal structure of sequences, the precise underlying mechanism behind this in-context autoregressive learning ability remains unclear. In this paper, we take a step towards understanding this phenomenon by studying the approximation ability of Transformers for next-token prediction. Specifically, we explore the capacity of causal Transformers to predict the next token $x_{t+1}$ given an autoregressive sequence $(x_1, \dots, x_t)$ as a prompt, where $ x_{t+1} = f(x_t) $, and $ f $ is a context-dependent function that varies with each sequence. On the theoretical side, we focus on specific instances, namely when $ f $ is linear or when $ (x_t)_{t \geq 1} $ is periodic. We explicitly construct a Transformer (with linear, exponential, or softmax attention) that learns the mapping $f$ in-context through a causal kernel descent method. The causal kernel descent method we propose provably estimates $x_{t+1} $ based solely on past and current observations $ (x_1, \dots, x_t) $, with connections to the Kaczmarz algorithm in Hilbert spaces. We present experimental results that validate our theoretical findings and suggest their applicability to more general mappings $f$ .||
|**2024-10-03**|[Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient](http://arxiv.org/abs/2410.02984)|null|We introduce refined variants of the Local Learning Coefficient (LLC), a measure of model complexity grounded in singular learning theory, to study the development of internal structure in transformer language models during training. By applying these \textit{refined LLCs} (rLLCs) to individual components of a two-layer attention-only transformer, we gain novel insights into the progressive differentiation and specialization of attention heads. Our methodology reveals how attention heads differentiate into distinct functional roles over the course of training, analyzes the types of data these heads specialize to process, and discovers a previously unidentified multigram circuit. These findings demonstrate that rLLCs provide a principled, quantitative toolkit for \textit{developmental interpretability}, which aims to understand models through their evolution across the learning process. More broadly, this work takes a step towards establishing the correspondence between data distributional structure, geometric properties of the loss landscape, learning dynamics, and emergent computational structures in neural networks.||
|**2024-10-03**|[GABIC: Graph-based Attention Block for Image Compression](http://arxiv.org/abs/2410.02981)|**[link](https://github.com/EIDOSLAB/GABIC)**|While standardized codecs like JPEG and HEVC-intra represent the industry standard in image compression, neural Learned Image Compression (LIC) codecs represent a promising alternative. In detail, integrating attention mechanisms from Vision Transformers into LIC models has shown improved compression efficiency. However, extra efficiency often comes at the cost of aggregating redundant features. This work proposes a Graph-based Attention Block for Image Compression (GABIC), a method to reduce feature redundancy based on a k-Nearest Neighbors enhanced attention mechanism. Our experiments show that GABIC outperforms comparable methods, particularly at high bit rates, enhancing compression performance.||
|**2024-10-03**|[Selective Attention Improves Transformer](http://arxiv.org/abs/2410.02703)|null|Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.||
|**2024-10-03**|[Deconstructing Recurrence, Attention, and Gating: Investigating the transferability of Transformers and Gated Recurrent Neural Networks in forecasting of dynamical systems](http://arxiv.org/abs/2410.02654)|null|Machine learning architectures, including transformers and recurrent neural networks (RNNs) have revolutionized forecasting in applications ranging from text processing to extreme weather. Notably, advanced network architectures, tuned for applications such as natural language processing, are transferable to other tasks such as spatiotemporal forecasting tasks. However, there is a scarcity of ablation studies to illustrate the key components that enable this forecasting accuracy. The absence of such studies, although explainable due to the associated computational cost, intensifies the belief that these models ought to be considered as black boxes. In this work, we decompose the key architectural components of the most powerful neural architectures, namely gating and recurrence in RNNs, and attention mechanisms in transformers. Then, we synthesize and build novel hybrid architectures from the standard blocks, performing ablation studies to identify which mechanisms are effective for each task. The importance of considering these components as hyper-parameters that can augment the standard architectures is exhibited on various forecasting datasets, from the spatiotemporal chaotic dynamics of the multiscale Lorenz 96 system, the Kuramoto-Sivashinsky equation, as well as standard real world time-series benchmarks. A key finding is that neural gating and attention improves the performance of all standard RNNs in most tasks, while the addition of a notion of recurrence in transformers is detrimental. Furthermore, our study reveals that a novel, sparsely used, architecture which integrates Recurrent Highway Networks with neural gating and attention mechanisms, emerges as the best performing architecture in high-dimensional spatiotemporal forecasting of dynamical systems.||
|**2024-10-03**|[NestedMorph: Enhancing Deformable Medical Image Registration with Nested Attention Mechanisms](http://arxiv.org/abs/2410.02550)|**[link](https://github.com/as-lab/marthi-et-al-2024-nestedmorph-deformable-medical-image-registration)**|Deformable image registration is crucial for aligning medical images in a non-linear fashion across different modalities, allowing for precise spatial correspondence between varying anatomical structures. This paper presents NestedMorph, a novel network utilizing a Nested Attention Fusion approach to improve intra-subject deformable registration between T1-weighted (T1w) MRI and diffusion MRI (dMRI) data. NestedMorph integrates high-resolution spatial details from an encoder with semantic information from a decoder using a multi-scale framework, enhancing both local and global feature extraction. Our model notably outperforms existing methods, including CNN-based approaches like VoxelMorph, MIDIR, and CycleMorph, as well as Transformer-based models such as TransMorph and ViT-V-Net, and traditional techniques like NiftyReg and SyN. Evaluations on the HCP dataset demonstrate that NestedMorph achieves superior performance across key metrics, including SSIM, HD95, and SDlogJ, with the highest SSIM of 0.89, and the lowest HD95 of 2.5 and SDlogJ of 0.22. These results highlight NestedMorph's ability to capture both local and global image features effectively, leading to superior registration performance. The promising outcomes of this study underscore NestedMorph's potential to significantly advance deformable medical image registration, providing a robust framework for future research and clinical applications. The source code and our implementation are available at: https://bit.ly/3zdVqcg||
|**2024-10-03**|[SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration](http://arxiv.org/abs/2410.02367)|**[link](https://github.com/thu-ml/SageAttention)**|The transformer architecture predominates across various models. As the heart of the transformer, attention has a computational complexity of O(N^2), compared to O(N) for linear transformations. When handling large sequence lengths, attention becomes the primary time-consuming component. Although quantization has proven to be an effective method for accelerating model inference, existing quantization methods primarily focus on optimizing the linear layer. In response, we first analyze the feasibility of quantization in attention detailedly. Following that, we propose SageAttention, a highly efficient and accurate quantization method for attention. The OPS (operations per second) of our approach outperforms FlashAttention2 and xformers by about 2.1 times and 2.7 times, respectively. SageAttention also achieves superior accuracy performance over FlashAttention3. Comprehensive experiments confirm that our approach incurs almost no end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation.||
|**2024-10-03**|[Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization](http://arxiv.org/abs/2410.02247)|**[link](https://github.com/chen123CtrlS/LightweightAtt)**|Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive parameterization. In this paper, we investigate two remarkable phenomena observed during the fine-tuning of LLMs, particularly focusing on the attention mechanism: (1) Different Impact, optimizing the $\mathbf{W}_v$ matrix significantly improves performance over optimizing the $\mathbf{W}_k$ matrix. Fine-tuning only the $\mathbf{W}_q$ and $\mathbf{W}_v$ matrices is computationally efficient, delivering results that are comparable to, or even better than, fine-tuning all three matrices $\mathbf{W}_q$, $\mathbf{W}_k$, and $\mathbf{W}_v$. (2) Efficient Convergence, employing distinct learning rates for these matrices is crucial for optimal performance, with a higher learning rate for the $\mathbf{W}_v$ matrix expediting convergence. However, theoretical analyses of these phenomena are still relatively limited. We present a theoretical analysis of these phenomena from two perspectives: (i) Generalization, where we demonstrate that fine-tuning only $\mathbf{W}_q$ and $\mathbf{W}_v$ improves generalization bounds, enhances memory efficiency, and (ii) Optimization, where we emphasize that the feature learning of the attention mechanism is efficient, particularly when using distinct learning rates for the matrices, which leads to more effective fine-tuning. Building on these insights, we propose a new strategy that improves fine-tuning efficiency in terms of both storage and time. Experimental results on benchmark datasets validate the effectiveness of this approach, supporting our theoretical findings. Our analysis lays the theoretical groundwork for configuring and improving lightweight algorithms in LLMs fine-tuning.||
|**2024-10-03**|[HATFormer: Historic Handwritten Arabic Text Recognition with Transformers](http://arxiv.org/abs/2410.02179)|null|Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer's attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.||
|**2024-10-03**|[Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis](http://arxiv.org/abs/2410.02167)|null|Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability of large language models by augmenting the query using multiple examples with multiple intermediate steps. Despite the empirical success, the theoretical understanding of how to train a Transformer to achieve the CoT ability remains less explored. This is primarily due to the technical challenges involved in analyzing the nonconvex optimization on nonlinear attention models. To the best of our knowledge, this work provides the first theoretical study of training Transformers with nonlinear attention to obtain the CoT generalization capability so that the resulting model can inference on unseen tasks when the input is augmented by examples of the new task. We first quantify the required training samples and iterations to train a Transformer model towards CoT ability. We then prove the success of its CoT generalization on unseen tasks with distribution-shifted testing data. Moreover, we theoretically characterize the conditions for an accurate reasoning output by CoT even when the provided reasoning examples contain noises and are not always accurate. In contrast, in-context learning (ICL), which can be viewed as one-step CoT without intermediate steps, may fail to provide an accurate output when CoT does. These theoretical findings are justified through experiments.||
|**2024-10-02**|[Positional Attention: Out-of-Distribution Generalization and Expressivity for Neural Algorithmic Reasoning](http://arxiv.org/abs/2410.01686)|**[link](https://github.com/opallab/positional_attention)**|There has been a growing interest in the ability of neural networks to solve algorithmic tasks, such as arithmetic, summary statistics, and sorting. While state-of-the-art models like Transformers have demonstrated good generalization performance on in-distribution tasks, their out-of-distribution (OOD) performance is poor when trained end-to-end. In this paper, we focus on value generalization, a common instance of OOD generalization where the test distribution has the same input sequence length as the training distribution, but the value ranges in the training and test distributions do not necessarily overlap. To address this issue, we propose that using fixed positional encodings to determine attention weights-referred to as positional attention-enhances empirical OOD performance while maintaining expressivity. We support our claim about expressivity by proving that Transformers with positional attention can effectively simulate parallel algorithms.||
|**2024-10-02**|[On The Adaptation of Unlimiformer for Decoder-Only Transformers](http://arxiv.org/abs/2410.01637)|null|One of the prominent issues stifling the current generation of large language models is their limited context length. Recent proprietary models such as GPT-4 and Claude 2 have introduced longer context lengths, 8k/32k and 100k, respectively; however, despite the efforts in the community, most common models, such as LLama-2, have a context length of 4k or less. Unlimiformer (Bertsch et al., 2023) is a recently popular vector-retrieval augmentation method that offloads cross-attention computations to a kNN index. However, its main limitation is incompatibility with decoder-only transformers out of the box. In this work, we explore practical considerations of adapting Unlimiformer to decoder-only transformers and introduce a series of modifications to overcome this limitation. Moreover, we expand the original experimental setup on summarization to include a new task (i.e., free-form Q&A) and an instruction-tuned model (i.e., a custom 6.7B GPT model). Our results showcase the effectiveness of these modifications on summarization, performing on par with a model with 2x the context length. Moreover, we discuss limitations and future directions for free-form Q&A and instruction-tuned models.||
|**2024-10-02**|[Attention layers provably solve single-location regression](http://arxiv.org/abs/2410.01537)|**[link](https://github.com/pierremarion23/single-location-regression)**|Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding token-wise sparsity and internal linear representations. To address this gap, we introduce the single-location regression task, where only one token in a sequence determines the output, and its position is a latent random variable, retrievable via a linear projection of the input. To solve this task, we propose a dedicated predictor, which turns out to be a simplified version of a non-linear self-attention layer. We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and internal linear structures.||
|**2024-09-30**|[CBAM-SwinT-BL: Small Rail Surface Detect Detection Method Based on Swin Transformer with Block Level CBAM Enhancement](http://arxiv.org/abs/2409.20113)|null|Under high-intensity rail operations, rail tracks endure considerable stresses resulting in various defects such as corrugation and spellings. Failure to effectively detect defects and provide maintenance in time would compromise service reliability and public safety. While advanced models have been developed in recent years, efficiently identifying small-scale rail defects has not yet been studied, especially for categories such as Dirt or Squat on rail surface. To address this challenge, this study utilizes Swin Transformer (SwinT) as baseline and incorporates the Convolutional Block Attention Module (CBAM) for enhancement. Our proposed method integrates CBAM successively within the swin transformer blocks, resulting in significant performance improvement in rail defect detection, particularly for categories with small instance sizes. The proposed framework is named CBAM-Enhanced Swin Transformer in Block Level (CBAM-SwinT-BL). Experiment and ablation study have proven the effectiveness of the framework. The proposed framework has a notable improvement in the accuracy of small size defects, such as dirt and dent categories in RIII dataset, with mAP-50 increasing by +23.0% and +38.3% respectively, and the squat category in MUET dataset also reaches +13.2% higher than the original model. Compares to the original SwinT, CBAM-SwinT-BL increase overall precision around +5% in the MUET dataset and +7% in the RIII dataset, reaching 69.1% and 88.1% respectively. Meanwhile, the additional module CBAM merely extend the model training speed by an average of +0.04s/iteration, which is acceptable compared to the significant improvement in system performance.||
|**2024-09-30**|[SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers](http://arxiv.org/abs/2409.19850)|null|Over the past few years, vision transformers (ViTs) have consistently demonstrated remarkable performance across various visual recognition tasks. However, attempts to enhance their robustness have yielded limited success, mainly focusing on different training strategies, input patch augmentation, or network structural enhancements. These approaches often involve extensive training and fine-tuning, which are time-consuming and resource-intensive. To tackle these obstacles, we introduce a novel approach named Spatial Autocorrelation Token Analysis (SATA). By harnessing spatial relationships between token features, SATA enhances both the representational capacity and robustness of ViT models. This is achieved through the analysis and grouping of tokens according to their spatial autocorrelation scores prior to their input into the Feed-Forward Network (FFN) block of the self-attention mechanism. Importantly, SATA seamlessly integrates into existing pre-trained ViT baselines without requiring retraining or additional fine-tuning, while concurrently improving efficiency by reducing the computational load of the FFN units. Experimental results show that the baseline ViTs enhanced with SATA not only achieve a new state-of-the-art top-1 accuracy on ImageNet-1K image classification (94.9%) but also establish new state-of-the-art performance across multiple robustness benchmarks, including ImageNet-A (top-1=63.6%), ImageNet-R (top-1=79.2%), and ImageNet-C (mCE=13.6%), all without requiring additional training or fine-tuning of baseline models.||
|**2024-09-29**|[Spiking Transformer with Spatial-Temporal Attention](http://arxiv.org/abs/2409.19764)|**[link](https://github.com/intelligent-computing-lab-yale/statten)**|Spiking Neural Networks (SNNs) present a compelling and energy-efficient alternative to traditional Artificial Neural Networks (ANNs) due to their sparse binary activation. Leveraging the success of the transformer architecture, the spiking transformer architecture is explored to scale up dataset size and performance. However, existing works only consider the spatial self-attention in spiking transformer, neglecting the inherent temporal context across the timesteps. In this work, we introduce Spiking Transformer with Spatial-Temporal Attention (STAtten), a simple and straightforward architecture designed to integrate spatial and temporal information in self-attention with negligible additional computational load. The STAtten divides the temporal or token index and calculates the self-attention in a cross-manner to effectively incorporate spatial-temporal information. We first verify our spatial-temporal attention mechanism's ability to capture long-term temporal dependencies using sequential datasets. Moreover, we validate our approach through extensive experiments on varied datasets, including CIFAR10/100, ImageNet, CIFAR10-DVS, and N-Caltech101. Notably, our cross-attention mechanism achieves an accuracy of 78.39 % on the ImageNet dataset.||
|**2024-09-29**|[OrientedFormer: An End-to-End Transformer-Based Oriented Object Detector in Remote Sensing Images](http://arxiv.org/abs/2409.19648)|**[link](https://github.com/wokaikaixinxin/OrientedFormer)**|Oriented object detection in remote sensing images is a challenging task due to objects being distributed in multi-orientation. Recently, end-to-end transformer-based methods have achieved success by eliminating the need for post-processing operators compared to traditional CNN-based methods. However, directly extending transformers to oriented object detection presents three main issues: 1) objects rotate arbitrarily, necessitating the encoding of angles along with position and size; 2) the geometric relations of oriented objects are lacking in self-attention, due to the absence of interaction between content and positional queries; and 3) oriented objects cause misalignment, mainly between values and positional queries in cross-attention, making accurate classification and localization difficult. In this paper, we propose an end-to-end transformer-based oriented object detector, consisting of three dedicated modules to address these issues. First, Gaussian positional encoding is proposed to encode the angle, position, and size of oriented boxes using Gaussian distributions. Second, Wasserstein self-attention is proposed to introduce geometric relations and facilitate interaction between content and positional queries by utilizing Gaussian Wasserstein distance scores. Third, oriented cross-attention is proposed to align values and positional queries by rotating sampling points around the positional query according to their angles. Experiments on six datasets DIOR-R, a series of DOTA, HRSC2016 and ICDAR2015 show the effectiveness of our approach. Compared with previous end-to-end detectors, the OrientedFormer gains 1.16 and 1.21 AP $_{50}$ on DIOR-R and DOTA-v1.0 respectively, while reducing training epochs from 3$\times$ to 1$\times$ . The codes are available at https://github.com/wokaikaixinxin/OrientedFormer.||
|**2024-09-28**|[Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization](http://arxiv.org/abs/2409.19345)|null|Transformers have demonstrated great power in the recent development of large foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary changes to the field of vision, achieving significant accomplishments on the experimental side. However, their theoretical capabilities, particularly in terms of generalization when trained to overfit training data, are still not fully understood. To address this gap, this work delves deeply into the benign overfitting perspective of transformers in vision. To this end, we study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. By developing techniques that address the challenges posed by softmax and the interdependent nature of multiple weights in transformer optimization, we successfully characterized the training dynamics and achieved generalization in post-training. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model. The theoretical results are further verified by experimental simulation.||
|**2024-09-28**|[Intelligent Fish Detection System with Similarity-Aware Transformer](http://arxiv.org/abs/2409.19323)|**[link](https://github.com/vision4robotics/fishvit)**|Fish detection in water-land transfer has significantly contributed to the fishery. However, manual fish detection in crowd-collaboration performs inefficiently and expensively, involving insufficient accuracy. To further enhance the water-land transfer efficiency, improve detection accuracy, and reduce labor costs, this work designs a new type of lightweight and plug-and-play edge intelligent vision system to automatically conduct fast fish detection with high-speed camera. Moreover, a novel similarity-aware vision Transformer for fast fish detection (FishViT) is proposed to onboard identify every single fish in a dense and similar group. Specifically, a novel similarity-aware multi-level encoder is developed to enhance multi-scale features in parallel, thereby yielding discriminative representations for varying-size fish. Additionally, a new soft-threshold attention mechanism is introduced, which not only effectively eliminates background noise from images but also accurately recognizes both the edge details and overall features of different similar fish. 85 challenging video sequences with high framerate and high-resolution are collected to establish a benchmark from real fish water-land transfer scenarios. Exhaustive evaluation conducted with this challenging benchmark has proved the robustness and effectiveness of FishViT with over 80 FPS. Real work scenario tests validate the practicality of the proposed method. The code and demo video are available at https://github.com/vision4robotics/FishViT.||
|**2024-09-28**|[Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models](http://arxiv.org/abs/2409.19315)|**[link](https://github.com/NathanLeroux-git/GainCellAttention)**|Transformer neural networks, driven by self-attention mechanisms, are core components of foundational and Large Language Models. In generative transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks for long sequences. In this work, we propose a fast and energy-efficient hardware implementation of self-attention using analog in-memory computing based on gain cell memories. Volatile gain cell memories can be efficiently written to store new tokens during sequence generation, while performing analog signed weight multiplications to compute the dot-products required for self-attention. We implement Sliding Window Attention, which keeps memory of a finite set of past steps. A charge-to-pulse converter for array readout eliminates the need for analog-to-digital conversion between self-attention stages. Using a co-designed initialization algorithm to adapt pre-trained weights to gain cell non-idealities, we achieve NLP performance comparable to ChatGPT-2 with minimal training iterations, despite hardware constraints. Our end-to-end hardware design includes digital controls, estimating area, latency, and energy. The system reduces attention latency by up to two orders of magnitude and energy consumption by up to five orders compared to GPUs, marking a significant step toward ultra-fast, low-power sequence generation in Large Language Models.||
|**2024-09-27**|[Feature Estimation of Global Language Processing in EEG Using Attention Maps](http://arxiv.org/abs/2409.19174)|null|Understanding the correlation between EEG features and cognitive tasks is crucial for elucidating brain function. Brain activity synchronizes during speaking and listening tasks. However, it is challenging to estimate task-dependent brain activity characteristics with methods with low spatial resolution but high temporal resolution, such as EEG, rather than methods with high spatial resolution, like fMRI. This study introduces a novel approach to EEG feature estimation that utilizes the weights of deep learning models to explore this association. We demonstrate that attention maps generated from Vision Transformers and EEGNet effectively identify features that align with findings from prior studies. EEGNet emerged as the most accurate model regarding subject independence and the classification of Listening and Speaking tasks. The application of Mel-Spectrogram with ViTs enhances the resolution of temporal and frequency-related EEG characteristics. Our findings reveal that the characteristics discerned through attention maps vary significantly based on the input data, allowing for tailored feature extraction from EEG signals. By estimating features, our study reinforces known attributes and predicts new ones, potentially offering fresh perspectives in utilizing EEG for medical purposes, such as early disease detection. These techniques will make substantial contributions to cognitive neuroscience.||
|**2024-09-27**|[Cottention: Linear Transformers With Cosine Attention](http://arxiv.org/abs/2409.18747)|**[link](https://github.com/gmongaras/Cottention_Transformer)**|Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.||
|**2024-09-27**|[Token Caching for Diffusion Transformer Acceleration](http://arxiv.org/abs/2409.18523)|null|Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their high computational cost, arising from the quadratic computational complexity of attention mechanisms and multi-step inference, presents a significant bottleneck. To address this challenge, we propose TokenCache, a novel post-training acceleration method that leverages the token-based multi-block architecture of transformers to reduce redundant computations among tokens across inference steps. TokenCache specifically addresses three critical questions in the context of diffusion transformers: (1) which tokens should be pruned to eliminate redundancy, (2) which blocks should be targeted for efficient pruning, and (3) at which time steps caching should be applied to balance speed and quality. In response to these challenges, TokenCache introduces a Cache Predictor that assigns importance scores to tokens, enabling selective pruning without compromising model performance. Furthermore, we propose an adaptive block selection strategy to focus on blocks with minimal impact on the network's output, along with a Two-Phase Round-Robin (TPRR) scheduling policy to optimize caching intervals throughout the denoising process. Experimental results across various models demonstrate that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers. Our code will be publicly available.||
|**2024-09-26**|[Decomposable Transformer Point Processes](http://arxiv.org/abs/2409.18158)|null|The standard paradigm of modeling marked point processes is by parameterizing the intensity function using an attention-based (Transformer-style) architecture. Despite the flexibility of these methods, their inference is based on the computationally intensive thinning algorithm. In this work, we propose a framework where the advantages of the attention-based architecture are maintained and the limitation of the thinning algorithm is circumvented. The framework depends on modeling the conditional distribution of inter-event times with a mixture of log-normals satisfying a Markov property and the conditional probability mass function for the marks with a Transformer-based architecture. The proposed method attains state-of-the-art performance in predicting the next event of a sequence given its history. The experiments also reveal the efficacy of the methods that do not rely on the thinning algorithm during inference over the ones they do. Finally, we test our method on the challenging long-horizon prediction task and find that it outperforms a baseline developed specifically for tackling this task; importantly, inference requires just a fraction of time compared to the thinning-based baseline.||
|**2024-09-26**|[Supra-Laplacian Encoding for Transformer on Dynamic Graphs](http://arxiv.org/abs/2409.17986)|**[link](https://github.com/ykrmm/slate)**|Fully connected Graph Transformers (GT) have rapidly become prominent in the static graph community as an alternative to Message-Passing models, which suffer from a lack of expressivity, oversquashing, and under-reaching. However, in a dynamic context, by interconnecting all nodes at multiple snapshots with self-attention, GT loose both structural and temporal information. In this work, we introduce Supra-LAplacian encoding for spatio-temporal TransformErs (SLATE), a new spatio-temporal encoding to leverage the GT architecture while keeping spatio-temporal information. Specifically, we transform Discrete Time Dynamic Graphs into multi-layer graphs and take advantage of the spectral properties of their associated supra-Laplacian matrix. Our second contribution explicitly model nodes' pairwise relationships with a cross-attention mechanism, providing an accurate edge representation for dynamic link prediction. SLATE outperforms numerous state-of-the-art methods based on Message-Passing Graph Neural Networks combined with recurrent models (e.g LSTM), and Dynamic Graph Transformers, on 9 datasets. Code and instructions to reproduce our results will be open-sourced.||
|**2024-09-26**|[Self-supervised Monocular Depth Estimation with Large Kernel Attention](http://arxiv.org/abs/2409.17895)|null|Self-supervised monocular depth estimation has emerged as a promising approach since it does not rely on labeled training data. Most methods combine convolution and Transformer to model long-distance dependencies to estimate depth accurately. However, Transformer treats 2D image features as 1D sequences, and positional encoding somewhat mitigates the loss of spatial information between different feature blocks, tending to overlook channel features, which limit the performance of depth estimation. In this paper, we propose a self-supervised monocular depth estimation network to get finer details. Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies without compromising the two-dimension structure of features while maintaining feature channel adaptivity. In addition, we introduce a up-sampling module to accurately recover the fine details in the depth map. Our method achieves competitive results on the KITTI dataset.||
|**2024-09-26**|[CASPFormer: Trajectory Prediction from BEV Images with Deformable Attention](http://arxiv.org/abs/2409.17790)|null|Motion prediction is an important aspect for Autonomous Driving (AD) and Advance Driver Assistance Systems (ADAS). Current state-of-the-art motion prediction methods rely on High Definition (HD) maps for capturing the surrounding context of the ego vehicle. Such systems lack scalability in real-world deployment as HD maps are expensive to produce and update in real-time. To overcome this issue, we propose Context Aware Scene Prediction Transformer (CASPFormer), which can perform multi-modal motion prediction from rasterized Bird-Eye-View (BEV) images. Our system can be integrated with any upstream perception module that is capable of generating BEV images. Moreover, CASPFormer directly decodes vectorized trajectories without any postprocessing. Trajectories are decoded recurrently using deformable attention, as it is computationally efficient and provides the network with the ability to focus its attention on the important spatial locations of the BEV images. In addition, we also address the issue of mode collapse for generating multiple scene-consistent trajectories by incorporating learnable mode queries. We evaluate our model on the nuScenes dataset and show that it reaches state-of-the-art across multiple metrics||
|**2024-09-26**|[Paraformer-v2: An improved non-autoregressive transformer for noise-robust speech recognition](http://arxiv.org/abs/2409.17746)|null|Attention-based encoder-decoder, e.g. transformer and its variants, generates the output sequence in an autoregressive (AR) manner. Despite its superior performance, AR model is computationally inefficient as its generation requires as many iterations as the output length. In this paper, we propose Paraformer-v2, an improved version of Paraformer, for fast, accurate, and noise-robust non-autoregressive speech recognition. In Paraformer-v2, we use a CTC module to extract the token embeddings, as the alternative to the continuous integrate-and-fire module in Paraformer. Extensive experiments demonstrate that Paraformer-v2 outperforms Paraformer on multiple datasets, especially on the English datasets (over 14% improvement on WER), and is more robust in noisy environments.||
|**2024-09-26**|[Optimal Memorization Capacity of Transformers](http://arxiv.org/abs/2409.17677)|null|Recent research in the field of machine learning has increasingly focused on the memorization capacity of Transformers, but how efficient they are is not yet well understood. We demonstrate that Transformers can memorize labels with $\tilde{O}(\sqrt{N})$ parameters in a next-token prediction setting for $N$ input sequences of length $n$, which is proved to be optimal up to logarithmic factors. This indicates that Transformers can efficiently perform memorization with little influence from the input length $n$ owing to the benefit of parameter sharing. We also analyze the memorization capacity in the sequence-to-sequence setting, and find that $\tilde{O}(\sqrt{nN})$ parameters are not only sufficient, but also necessary at least for Transformers with hardmax. These results suggest that while self-attention mechanisms can efficiently identify input sequences, the feed-forward network becomes a bottleneck when associating a label to each token.||
|**2024-09-26**|[Benign or Not-Benign Overfitting in Token Selection of Attention Mechanism](http://arxiv.org/abs/2409.17625)|**[link](https://github.com/keitaroskmt/benign-attention)**|Modern over-parameterized neural networks can be trained to fit the training data perfectly while still maintaining a high generalization performance. This "benign overfitting" phenomenon has been studied in a surge of recent theoretical work; however, most of these studies have been limited to linear models or two-layer neural networks. In this work, we analyze benign overfitting in the token selection mechanism of the attention architecture, which characterizes the success of transformer models. We first show the existence of a benign overfitting solution and explain its mechanism in the attention architecture. Next, we discuss whether the model converges to such a solution, raising the difficulties specific to the attention architecture. We then present benign overfitting cases and not-benign overfitting cases by conditioning different scenarios based on the behavior of attention probabilities during training. To the best of our knowledge, this is the first study to characterize benign overfitting for the attention mechanism.||
|**2024-09-26**|[Dynamic Subframe Splitting and Spatio-Temporal Motion Entangled Sparse Attention for RGB-E Tracking](http://arxiv.org/abs/2409.17560)|null|Event-based bionic camera asynchronously captures dynamic scenes with high temporal resolution and high dynamic range, offering potential for the integration of events and RGB under conditions of illumination degradation and fast motion. Existing RGB-E tracking methods model event characteristics utilising attention mechanism of Transformer before integrating both modalities. Nevertheless, these methods involve aggregating the event stream into a single event frame, lacking the utilisation of the temporal information inherent in the event stream.Moreover, the traditional attention mechanism is well-suited for dense semantic features, while the attention mechanism for sparse event features require revolution. In this paper, we propose a dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters, aiming to capture spatio-temporal features that contain motion cues. Based on this, we design an event-based sparse attention mechanism to enhance the interaction of event features in temporal and spatial dimensions. The experimental results indicate that our method outperforms existing state-of-the-art methods on the FE240 and COESOT datasets, providing an effective processing manner for the event data.||
|**2024-09-26**|[MASSFormer: Mobility-Aware Spectrum Sensing using Transformer-Driven Tiered Structure](http://arxiv.org/abs/2409.17546)|null|In this paper, we develop a novel mobility-aware transformer-driven tiered structure (MASSFormer) based cooperative spectrum sensing method that effectively models the spatio-temporal dynamics of user movements. Unlike existing methods, our method considers a dynamic scenario involving mobile primary users (PUs) and secondary users (SUs)and addresses the complexities introduced by user mobility. The transformer architecture utilizes an attention mechanism, enabling the proposed method to adeptly model the temporal dynamics of user mobility by effectively capturing long-range dependencies within the input data. The proposed method first computes tokens from the sequence of covariance matrices (CMs) for each SU and processes them in parallel using the SUtransformer network to learn the spatio-temporal features at SUlevel. Subsequently, the collaborative transformer network learns the group-level PU state from all SU-level feature representations. The attention-based sequence pooling method followed by the transformer encoder adjusts the contributions of all tokens. The main goal of predicting the PU states at each SU-level and group-level is to improve detection performance even more. We conducted a sufficient amount of simulations and compared the detection performance of different SS methods. The proposed method is tested under imperfect reporting channel scenarios to show robustness. The efficacy of our method is validated with the simulation results demonstrating its higher performance compared with existing methods in terms of detection probability, sensing error, and classification accuracy.||
|**2024-09-26**|[NeuroPath: A Neural Pathway Transformer for Joining the Dots of Human Connectomes](http://arxiv.org/abs/2409.17510)|**[link](https://github.com/Chrisa142857/neuro_detour)**|Although modern imaging technologies allow us to study connectivity between two distinct brain regions in-vivo, an in-depth understanding of how anatomical structure supports brain function and how spontaneous functional fluctuations emerge remarkable cognition is still elusive. Meanwhile, tremendous efforts have been made in the realm of machine learning to establish the nonlinear mapping between neuroimaging data and phenotypic traits. However, the absence of neuroscience insight in the current approaches poses significant challenges in understanding cognitive behavior from transient neural activities. To address this challenge, we put the spotlight on the coupling mechanism of structural connectivity (SC) and functional connectivity (FC) by formulating such network neuroscience question into an expressive graph representation learning problem for high-order topology. Specifically, we introduce the concept of topological detour to characterize how a ubiquitous instance of FC (direct link) is supported by neural pathways (detour) physically wired by SC, which forms a cyclic loop interacted by brain structure and function. In the clich\'e of machine learning, the multi-hop detour pathway underlying SC-FC coupling allows us to devise a novel multi-head self-attention mechanism within Transformer to capture multi-modal feature representation from paired graphs of SC and FC. Taken together, we propose a biological-inspired deep model, coined as NeuroPath, to find putative connectomic feature representations from the unprecedented amount of neuroimages, which can be plugged into various downstream applications such as task recognition and disease diagnosis. We have evaluated NeuroPath on large-scale public datasets including HCP and UK Biobank under supervised and zero-shot learning, where the state-of-the-art performance by our NeuroPath indicates great potential in network neuroscience.||
|**2024-09-25**|[Non-asymptotic Convergence of Training Transformers for Next-token Prediction](http://arxiv.org/abs/2409.17335)|null|Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data, especially in next-token prediction (NTP) tasks. However, the theoretical understanding of their performance in NTP is limited, with existing studies focusing mainly on asymptotic performance. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer consisting of a self-attention module followed by a feed-forward layer. We first characterize the essential structural properties of training datasets for NTP using a mathematical framework based on partial orders. Then, we design a two-stage training algorithm, where the pre-processing stage for training the feed-forward layer and the main stage for training the attention layer exhibit fast convergence performance. Specifically, both layers converge sub-linearly to the direction of their corresponding max-margin solutions. We also show that the cross-entropy loss enjoys a linear convergence rate. Furthermore, we show that the trained transformer presents non-trivial prediction ability with dataset shift, which sheds light on the remarkable generalization performance of transformers. Our analysis technique involves the development of novel properties on the attention gradient and further in-depth analysis of how these properties contribute to the convergence of the training process. Our experiments further validate our theoretical findings.||
|**2024-09-24**|[MonoFormer: One Transformer for Both Diffusion and Autoregression](http://arxiv.org/abs/2409.16280)|**[link](https://github.com/MonoFormer/MonoFormer)**|Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The project is publicly available at https://monoformer.github.io/.||
|**2024-09-24**|[TE-PINN: Quaternion-Based Orientation Estimation using Transformer-Enhanced Physics-Informed Neural Networks](http://arxiv.org/abs/2409.16214)|null|This paper introduces a Transformer-Enhanced Physics-Informed Neural Network (TE-PINN) designed for accurate quaternion-based orientation estimation in high-dynamic environments, particularly within the field of robotics. By integrating transformer networks with physics-informed learning, our approach innovatively captures temporal dependencies in sensor data while enforcing the fundamental physical laws governing rotational motion. TE-PINN leverages a multi-head attention mechanism to handle sequential data from inertial sensors, such as accelerometers and gyroscopes, ensuring temporal consistency. Simultaneously, the model embeds quaternion kinematics and rigid body dynamics into the learning process, aligning the network's predictions with mechanical principles like Euler's laws of motion. The physics-informed loss function incorporates the dynamics of angular velocity and external forces, enhancing the network's ability to generalize in complex scenarios. Our experimental evaluation demonstrates that TE-PINN consistently outperforms traditional methods such as Extended Kalman Filters (EKF) and LSTM-based estimators, particularly in scenarios characterized by high angular velocities and noisy sensor data. The results show a significant reduction in mean quaternion error and improved gyroscope bias estimation compared to the state-of-the-art. An ablation study further isolates the contributions of both the transformer architecture and the physics-informed constraints, highlighting the synergistic effect of both components in improving model performance. The proposed model achieves real-time performance on embedded systems typical of mobile robots, offering a scalable and efficient solution for orientation estimation in autonomous systems.||
|**2024-09-24**|[Self-attention as an attractor network: transient memories without backpropagation](http://arxiv.org/abs/2409.16112)|**[link](https://github.com/francill99/self_attention_attractor_network)**|Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an energy function in certain cases: while it is possible to write the cross-attention layer as a modern Hopfield network, the same is not possible for the self-attention, which is used in the GPT architectures and other autoregressive models. In this work we show that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood. We leverage the analogy with pseudo-likelihood to design a recurrent model that can be trained without backpropagation: the dynamics shows transient states that are strongly correlated with both train and test examples. Overall we present a novel framework to interpret self-attention as an attractor network, potentially paving the way for new theoretical approaches inspired from physics to understand transformers.||
|**2024-09-24**|[Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR](http://arxiv.org/abs/2409.15869)|**[link](https://github.com/aiola-lab/whisper-medusa)**|Large transformer-based models have significant potential for speech transcription and translation. Their self-attention mechanisms and parallel processing enable them to capture complex patterns and dependencies in audio sequences. However, this potential comes with challenges, as these large and computationally intensive models lead to slow inference speeds. Various optimization strategies have been proposed to improve performance, including efficient hardware utilization and algorithmic enhancements. In this paper, we introduce Whisper-Medusa, a novel approach designed to enhance processing speed with minimal impact on Word Error Rate (WER). The proposed model extends the OpenAI's Whisper architecture by predicting multiple tokens per iteration, resulting in a 50% reduction in latency. We showcase the effectiveness of Whisper-Medusa across different learning setups and datasets.||
|**2024-09-23**|[SOFI: Multi-Scale Deformable Transformer for Camera Calibration with Enhanced Line Queries](http://arxiv.org/abs/2409.15553)|**[link](https://github.com/sebastianjanampa/sofi)**|Camera calibration consists of estimating camera parameters such as the zenith vanishing point and horizon line. Estimating the camera parameters allows other tasks like 3D rendering, artificial reality effects, and object insertion in an image. Transformer-based models have provided promising results; however, they lack cross-scale interaction. In this work, we introduce \textit{multi-Scale defOrmable transFormer for camera calibratIon with enhanced line queries}, SOFI. SOFI improves the line queries used in CTRL-C and MSCC by using both line content and line geometric features. Moreover, SOFI's line queries allow transformer models to adopt the multi-scale deformable attention mechanism to promote cross-scale interaction between the feature maps produced by the backbone. SOFI outperforms existing methods on the \textit {Google Street View}, \textit {Horizon Line in the Wild}, and \textit {Holicity} datasets while keeping a competitive inference speed.||
|**2024-09-23**|[Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer](http://arxiv.org/abs/2409.15117)|null|Vision-based perception and reasoning is essential for scene understanding in any autonomous system. RGB and depth images are commonly used to capture both the semantic and geometric features of the environment. Developing methods to reliably interpret this data is critical for real-world applications, where noisy measurements are often unavoidable. In this work, we introduce a diffusion-based framework to address the RGB-D semantic segmentation problem. Additionally, we demonstrate that utilizing a Deformable Attention Transformer as the encoder to extract features from depth images effectively captures the characteristics of invalid regions in depth measurements. Our generative framework shows a greater capacity to model the underlying distribution of RGB-D images, achieving robust performance in challenging scenarios with significantly less training time compared to discriminative methods. Experimental results indicate that our approach achieves State-of-the-Art performance on both the NYUv2 and SUN-RGBD datasets in general and especially in the most challenging of their image data. Our project page will be available at https://diffusionmms.github.io/||
|**2024-09-24**|[Efficiently Dispatching Flash Attention For Partially Filled Attention Masks](http://arxiv.org/abs/2409.15097)|null|Transformers are widely used across various applications, many of which yield sparse or partially filled attention matrices. Examples include attention masks designed to reduce the quadratic complexity of attention, sequence packing techniques, and recent innovations like tree masking for fast validation in MEDUSA. Despite the inherent sparsity in these matrices, the state-of-the-art algorithm Flash Attention still processes them with quadratic complexity as though they were dense. In this paper, we introduce Binary Block Masking, a highly efficient modification that enhances Flash Attention by making it mask-aware. We further propose two optimizations: one tailored for masks with contiguous non-zero patterns and another for extremely sparse masks. Our experiments on attention masks derived from real-world scenarios demonstrate up to a 9x runtime improvement. The implementation will be publicly released to foster further research and application.||
|**2024-09-23**|[Kriformer: A Novel Spatiotemporal Kriging Approach Based on Graph Transformers](http://arxiv.org/abs/2409.14906)|null|Accurately estimating data in sensor-less areas is crucial for understanding system dynamics, such as traffic state estimation and environmental monitoring. This study addresses challenges posed by sparse sensor deployment and unreliable data by framing the problem as a spatiotemporal kriging task and proposing a novel graph transformer model, Kriformer. This model estimates data at locations without sensors by mining spatial and temporal correlations, even with limited resources. Kriformer utilizes transformer architecture to enhance the model's perceptual range and solve edge information aggregation challenges, capturing spatiotemporal information effectively. A carefully constructed positional encoding module embeds the spatiotemporal features of nodes, while a sophisticated spatiotemporal attention mechanism enhances estimation accuracy. The multi-head spatial interaction attention module captures subtle spatial relationships between observed and unobserved locations. During training, a random masking strategy prompts the model to learn with partial information loss, allowing the spatiotemporal embedding and multi-head attention mechanisms to synergistically capture correlations among locations. Experimental results show that Kriformer excels in representation learning for unobserved locations, validated on two real-world traffic speed datasets, demonstrating its effectiveness in spatiotemporal kriging tasks.||
|**2024-09-23**|[A-VL: Adaptive Attention for Large Vision-Language Models](http://arxiv.org/abs/2409.14846)|null|The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention techniques can dynamically reduce computational redundancy and thus improve efficiency. Although current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models, they are not tailored for LVLMs. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. This observation inspires us to manage the attention for each modality separately. Specifically, for visual input, we store the cache of potentially useful information but only compute the most critical parts. For language input, we care more about local information. Based on our observation and analysis of vision-language attention patterns, we develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference. Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.||
|**2024-09-23**|[RoWSFormer: A Robust Watermarking Framework with Swin Transformer for Enhanced Geometric Attack Resilience](http://arxiv.org/abs/2409.14829)|null|In recent years, digital watermarking techniques based on deep learning have been widely studied. To achieve both imperceptibility and robustness of image watermarks, most current methods employ convolutional neural networks to build robust watermarking frameworks. However, despite the success of CNN-based watermarking models, they struggle to achieve robustness against geometric attacks due to the limitations of convolutional neural networks in capturing global and long-range relationships. To address this limitation, we propose a robust watermarking framework based on the Swin Transformer, named RoWSFormer. Specifically, we design the Locally-Channel Enhanced Swin Transformer Block as the core of both the encoder and decoder. This block utilizes the self-attention mechanism to capture global and long-range information, thereby significantly improving adaptation to geometric distortions. Additionally, we construct the Frequency-Enhanced Transformer Block to extract frequency domain information, which further strengthens the robustness of the watermarking framework. Experimental results demonstrate that our RoWSFormer surpasses existing state-of-the-art watermarking methods. For most non-geometric attacks, RoWSFormer improves the PSNR by 3 dB while maintaining the same extraction accuracy. In the case of geometric attacks (such as rotation, scaling, and affine transformations), RoWSFormer achieves over a 6 dB improvement in PSNR, with extraction accuracy exceeding 97\%.||
|**2024-09-18**|[On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery](http://arxiv.org/abs/2409.12026)|null|Side-scan sonar (SSS) imagery presents unique challenges in the classification of man-made objects on the seafloor due to the complex and varied underwater environments. Historically, experts have manually interpreted SSS images, relying on conventional machine learning techniques with hand-crafted features. While Convolutional Neural Networks (CNNs) significantly advanced automated classification in this domain, they often fall short when dealing with diverse seafloor textures, such as rocky or ripple sand bottoms, where false positive rates may increase. Recently, Vision Transformers (ViTs) have shown potential in addressing these limitations by utilizing a self-attention mechanism to capture global information in image patches, offering more flexibility in processing spatial hierarchies. This paper rigorously compares the performance of ViT models alongside commonly used CNN architectures, such as ResNet and ConvNext, for binary classification tasks in SSS imagery. The dataset encompasses diverse geographical seafloor types and is balanced between the presence and absence of man-made objects. ViT-based models exhibit superior classification performance across f1-score, precision, recall, and accuracy metrics, although at the cost of greater computational resources. CNNs, with their inductive biases, demonstrate better computational efficiency, making them suitable for deployment in resource-constrained environments like underwater vehicles. Future research directions include exploring self-supervised learning for ViTs and multi-modal fusion to further enhance performance in challenging underwater environments.||
|**2024-09-17**|[A short trajectory is all you need: A transformer-based model for long-time dissipative quantum dynamics](http://arxiv.org/abs/2409.11320)|**[link](https://github.com/kananenka-group/Transformer-spin-boson)**|In this communication we demonstrate that a deep artificial neural network based on a transformer architecture with self-attention layers can predict the long-time population dynamics of a quantum system coupled to a dissipative environment provided that the short-time population dynamics of the system is known. The transformer neural network model developed in this work predicts the long-time dynamics of spin-boson model efficiently and very accurately across different regimes, from weak system-bath coupling to strong coupling non-Markovian regimes. Our model is more accurate than classical forecasting models, such as recurrent neural networks and is comparable to the state-of-the-art models for simulating the dynamics of quantum dissipative systems, based on kernel ridge regression.||
|**2024-09-17**|[Linear Recency Bias During Training Improves Transformers' Fit to Reading Times](http://arxiv.org/abs/2409.11250)|null|Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown a strong fit between surprisal values from Transformers and reading times. However, standard Transformers work with a lossless representation of the entire previous linguistic context, unlike models of human language processing that include memory decay. To bridge this gap, this paper evaluates a modification of the Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to attention scores. Surprisal estimates with ALiBi show an improved fit to human reading times compared to a standard Transformer baseline. A subsequent analysis of attention heads suggests that ALiBi's mixture of slopes -- which determine the rate of memory decay in each attention head -- may play a role in the improvement by helping models with ALiBi to track different kinds of linguistic dependencies.||
|**2024-09-17**|[Contrasformer: A Brain Network Contrastive Transformer for Neurodegenerative Condition Identification](http://arxiv.org/abs/2409.10944)|**[link](https://github.com/angusmonroe/contrasformer)**|Understanding neurological disorder is a fundamental problem in neuroscience, which often requires the analysis of brain networks derived from functional magnetic resonance imaging (fMRI) data. Despite the prevalence of Graph Neural Networks (GNNs) and Graph Transformers in various domains, applying them to brain networks faces challenges. Specifically, the datasets are severely impacted by the noises caused by distribution shifts across sub-populations and the neglect of node identities, both obstruct the identification of disease-specific patterns. To tackle these challenges, we propose Contrasformer, a novel contrastive brain network Transformer. It generates a prior-knowledge-enhanced contrast graph to address the distribution shifts across sub-populations by a two-stream attention mechanism. A cross attention with identity embedding highlights the identity of nodes, and three auxiliary losses ensure group consistency. Evaluated on 4 functional brain network datasets over 4 different diseases, Contrasformer outperforms the state-of-the-art methods for brain networks by achieving up to 10.8\% improvement in accuracy, which demonstrates its efficacy in neurological disorder identification. Case studies illustrate its interpretability, especially in the context of neuroscience. This paper provides a solution for analyzing brain networks, offering valuable insights into neurological disorders. Our code is available at \url{https://github.com/AngusMonroe/Contrasformer}.||
|**2024-09-17**|[Adaptive Large Language Models By Layerwise Attention Shortcuts](http://arxiv.org/abs/2409.10870)|null|Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we propose to challenge this and introduce adaptive computations for LLM-like setups, which allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism, thereby introducing computational \textbf{attention shortcuts}. These shortcuts can thus make the architecture depth and context adaptive. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture. We give evidence via attention maps that the models learn complex dependencies across layers that are adaptive in context and depth depending on the input tokens.||
|**2024-09-16**|[Recurrent Graph Transformer Network for Multiple Fault Localization in Naval Shipboard Systems](http://arxiv.org/abs/2409.10792)|null|The integration of power electronics building blocks in modern MVDC 12kV Naval ship systems enhances energy management and functionality but also introduces complex fault detection and control challenges. These challenges strain traditional fault diagnostic methods, making it difficult to detect and manage faults across multiple locations while maintaining system stability and performance. This paper proposes a temporal recurrent graph transformer network for fault diagnosis in naval MVDC 12kV shipboard systems. The deep graph neural network uses gated recurrent units to capture temporal features and a multi-head attention mechanism to extract spatial features, enhancing diagnostic accuracy. The approach effectively identifies and evaluates successive multiple faults with high precision. The method is implemented and validated on the MVDC 12kV shipboard system designed by the ESDRC team, incorporating all key components. Results show significant improvements in fault localization accuracy, with a 1-4% increase in performance metrics compared to other machine learning methods.||
|**2024-09-16**|[Self-Attention Limits Working Memory Capacity of Transformer-Based Models](http://arxiv.org/abs/2409.10715)|null|Recent work on Transformer-based large language models (LLMs) has revealed striking limits in their working memory capacity, similar to what has been found in human behavioral studies. Specifically, these models' performance drops significantly on N-back tasks as N increases. However, there is still a lack of mechanistic interpretability as to why this phenomenon would arise. Inspired by the executive attention theory from behavioral sciences, we hypothesize that the self-attention mechanism within Transformer-based models might be responsible for their working memory capacity limits. To test this hypothesis, we train vanilla decoder-only transformers to perform N-back tasks and find that attention scores gradually aggregate to the N-back positions over training, suggesting that the model masters the task by learning a strategy to pay attention to the relationship between the current position and the N-back position. Critically, we find that the total entropy of the attention score matrix increases as N increases, suggesting that the dispersion of attention scores might be the cause of the capacity limit observed in N-back tasks.||
|**2024-09-16**|[Logic Synthesis Optimization with Predictive Self-Supervision via Causal Transformers](http://arxiv.org/abs/2409.10653)|null|Contemporary hardware design benefits from the abstraction provided by high-level logic gates, streamlining the implementation of logic circuits. Logic Synthesis Optimization (LSO) operates at one level of abstraction within the Electronic Design Automation (EDA) workflow, targeting improvements in logic circuits with respect to performance metrics such as size and speed in the final layout. Recent trends in the field show a growing interest in leveraging Machine Learning (ML) for EDA, notably through ML-guided logic synthesis utilizing policy-based Reinforcement Learning (RL) methods.Despite these advancements, existing models face challenges such as overfitting and limited generalization, attributed to constrained public circuits and the expressiveness limitations of graph encoders. To address these hurdles, and tackle data scarcity issues, we introduce LSOformer, a novel approach harnessing Autoregressive transformer models and predictive SSL to predict the trajectory of Quality of Results (QoR). LSOformer integrates cross-attention modules to merge insights from circuit graphs and optimization sequences, thereby enhancing prediction accuracy for QoR metrics. Experimental studies validate the effectiveness of LSOformer, showcasing its superior performance over baseline architectures in QoR prediction tasks, where it achieves improvements of 5.74%, 4.35%, and 17.06% on the EPFL, OABCD, and proprietary circuits datasets, respectively, in inductive setup.||
|**2024-09-16**|[Garment Attribute Manipulation with Multi-level Attention](http://arxiv.org/abs/2409.10206)|null|In the rapidly evolving field of online fashion shopping, the need for more personalized and interactive image retrieval systems has become paramount. Existing methods often struggle with precisely manipulating specific garment attributes without inadvertently affecting others. To address this challenge, we propose GAMMA (Garment Attribute Manipulation with Multi-level Attention), a novel framework that integrates attribute-disentangled representations with a multi-stage attention-based architecture. GAMMA enables targeted manipulation of fashion image attributes, allowing users to refine their searches with high accuracy. By leveraging a dual-encoder Transformer and memory block, our model achieves state-of-the-art performance on popular datasets like Shopping100k and DeepFashion.||
|**2024-09-14**|[Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens](http://arxiv.org/abs/2409.09513)|null|Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent's future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model's policy through the interpretable plan visualisations and attention map.||
|**2024-09-14**|[TransformerMPC: Accelerating Model Predictive Control via Transformers](http://arxiv.org/abs/2409.09266)|null|In this paper, we address the problem of reducing the computational burden of Model Predictive Control (MPC) for real-time robotic applications. We propose TransformerMPC, a method that enhances the computational efficiency of MPC algorithms by leveraging the attention mechanism in transformers for both online constraint removal and better warm start initialization. Specifically, TransformerMPC accelerates the computation of optimal control inputs by selecting only the active constraints to be included in the MPC problem, while simultaneously providing a warm start to the optimization process. This approach ensures that the original constraints are satisfied at optimality. TransformerMPC is designed to be seamlessly integrated with any MPC solver, irrespective of its implementation. To guarantee constraint satisfaction after removing inactive constraints, we perform an offline verification to ensure that the optimal control inputs generated by the MPC solver meet all constraints. The effectiveness of TransformerMPC is demonstrated through extensive numerical simulations on complex robotic systems, achieving up to 35x improvement in runtime without any loss in performance.||
|**2024-09-13**|[SGFormer: Single-Layer Graph Transformers with Approximation-Free Linear Complexity](http://arxiv.org/abs/2409.09007)|**[link](https://github.com/qitianwu/sgformer)**|Learning representations on large graphs is a long-standing challenge due to the inter-dependence nature. Transformers recently have shown promising performance on small graphs thanks to its global attention for capturing all-pair interactions beyond observed structures. Existing approaches tend to inherit the spirit of Transformers in language and vision tasks, and embrace complicated architectures by stacking deep attention-based propagation layers. In this paper, we attempt to evaluate the necessity of adopting multi-layer attentions in Transformers on graphs, which considerably restricts the efficiency. Specifically, we analyze a generic hybrid propagation layer, comprised of all-pair attention and graph-based propagation, and show that multi-layer propagation can be reduced to one-layer propagation, with the same capability for representation learning. It suggests a new technical path for building powerful and efficient Transformers on graphs, particularly through simplifying model architectures without sacrificing expressiveness. As exemplified by this work, we propose a Simplified Single-layer Graph Transformers (SGFormer), whose main component is a single-layer global attention that scales linearly w.r.t. graph sizes and requires none of any approximation for accommodating all-pair interactions. Empirically, SGFormer successfully scales to the web-scale graph ogbn-papers100M, yielding orders-of-magnitude inference acceleration over peer Transformers on medium-sized graphs, and demonstrates competitiveness with limited labeled data.||
|**2024-09-13**|[Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry](http://arxiv.org/abs/2409.08769)|**[link](https://github.com/ybkurt/vift)**|In recent years, transformer-based architectures become the de facto standard for sequence modeling in deep learning frameworks. Inspired by the successful examples, we propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry. This study aims to improve pose estimation accuracy by leveraging the attention mechanisms in transformers, which better utilize historical data compared to the recurrent neural network (RNN) based methods seen in recent methods. Transformers typically require large-scale data for training. To address this issue, we utilize inductive biases for deep VIO networks. Since latent visual-inertial feature vectors encompass essential information for pose estimation, we employ transformers to refine pose estimates by updating latent vectors temporally. Our study also examines the impact of data imbalance and rotation learning methods in supervised end-to-end learning of visual inertial odometry by utilizing specialized gradients in backpropagation for the elements of SE $(3)$ group. The proposed method is end-to-end trainable and requires only a monocular camera and IMU during inference. Experimental results demonstrate that VIFT increases the accuracy of monocular VIO networks, achieving state-of-the-art results when compared to previous methods on the KITTI dataset. The code will be made available at https://github.com/ybkurt/VIFT.||
|**2024-09-13**|[SkinFormer: Learning Statistical Texture Representation with Transformer for Skin Lesion Segmentation](http://arxiv.org/abs/2409.08652)|**[link](https://github.com/rongtao-xu/skinformer)**|Accurate skin lesion segmentation from dermoscopic images is of great importance for skin cancer diagnosis. However, automatic segmentation of melanoma remains a challenging task because it is difficult to incorporate useful texture representations into the learning process. Texture representations are not only related to the local structural information learned by CNN, but also include the global statistical texture information of the input image. In this paper, we propose a trans\textbf{Former} network (\textbf{SkinFormer}) that efficiently extracts and fuses statistical texture representation for \textbf{Skin} lesion segmentation. Specifically, to quantify the statistical texture of input features, a Kurtosis-guided Statistical Counting Operator is designed. We propose Statistical Texture Fusion Transformer and Statistical Texture Enhance Transformer with the help of Kurtosis-guided Statistical Counting Operator by utilizing the transformer's global attention mechanism. The former fuses structural texture information and statistical texture information, and the latter enhances the statistical texture of multi-scale features. {Extensive experiments on three publicly available skin lesion datasets validate that our SkinFormer outperforms other SOAT methods, and our method achieves 93.2\% Dice score on ISIC 2018. It can be easy to extend SkinFormer to segment 3D images in the future.} Our code is available at https://github.com/Rongtao-Xu/SkinFormer.||
|**2024-09-13**|[VistaFormer: Scalable Vision Transformers for Satellite Image Time Series Segmentation](http://arxiv.org/abs/2409.08461)|**[link](https://github.com/macdonaldezra/VistaFormer)**|We introduce VistaFormer, a lightweight Transformer-based model architecture for the semantic segmentation of remote-sensing images. This model uses a multi-scale Transformer-based encoder with a lightweight decoder that aggregates global and local attention captured in the encoder blocks. VistaFormer uses position-free self-attention layers which simplifies the model architecture and removes the need to interpolate temporal and spatial codes, which can reduce model performance when training and testing image resolutions differ. We investigate simple techniques for filtering noisy input signals like clouds and demonstrate that improved model scalability can be achieved by substituting Multi-Head Self-Attention (MHSA) with Neighbourhood Attention (NA). Experiments on the PASTIS and MTLCC crop-type segmentation benchmarks show that VistaFormer achieves better performance than comparable models and requires only 8% of the floating point operations using MHSA and 11% using NA while also using fewer trainable parameters. VistaFormer with MHSA improves on state-of-the-art mIoU scores by 0.1% on the PASTIS benchmark and 3% on the MTLCC benchmark while VistaFormer with NA improves on the MTLCC benchmark by 3.7%.||
|**2024-09-12**|[SDformer: Efficient End-to-End Transformer for Depth Completion](http://arxiv.org/abs/2409.08159)|**[link](https://github.com/jamesqian11/sdformer-for-depth-completion)**|Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.||
|**2024-09-12**|[InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bimanual Manipulation](http://arxiv.org/abs/2409.07914)|null|We present InterACT: Inter-dependency aware Action Chunking with Hierarchical Attention Transformers, a novel imitation learning framework for bimanual manipulation that integrates hierarchical attention to capture inter-dependencies between dual-arm joint states and visual inputs. InterACT consists of a Hierarchical Attention Encoder and a Multi-arm Decoder, both designed to enhance information aggregation and coordination. The encoder processes multi-modal inputs through segment-wise and cross-segment attention mechanisms, while the decoder leverages synchronization blocks to refine individual action predictions, providing the counterpart's prediction as context. Our experiments on a variety of simulated and real-world bimanual manipulation tasks demonstrate that InterACT significantly outperforms existing methods. Detailed ablation studies validate the contributions of key components of our work, including the impact of CLS tokens, cross-segment encoders, and synchronization blocks.||
|**2024-09-12**|[Lagrange Duality and Compound Multi-Attention Transformer for Semi-Supervised Medical Image Segmentation](http://arxiv.org/abs/2409.07793)|**[link](https://github.com/lzeeorno/lagrange-duality-and-cmaformer)**|Medical image segmentation, a critical application of semantic segmentation in healthcare, has seen significant advancements through specialized computer vision techniques. While deep learning-based medical image segmentation is essential for assisting in medical diagnosis, the lack of diverse training data causes the long-tail problem. Moreover, most previous hybrid CNN-ViT architectures have limited ability to combine various attentions in different layers of the Convolutional Neural Network. To address these issues, we propose a Lagrange Duality Consistency (LDC) Loss, integrated with Boundary-Aware Contrastive Loss, as the overall training objective for semi-supervised learning to mitigate the long-tail problem. Additionally, we introduce CMAformer, a novel network that synergizes the strengths of ResUNet and Transformer. The cross-attention block in CMAformer effectively integrates spatial attention and channel attention for multi-scale feature fusion. Overall, our results indicate that CMAformer, combined with the feature fusion framework and the new consistency loss, demonstrates strong complementarity in semi-supervised learning ensembles. We achieve state-of-the-art results on multiple public medical image datasets. Example code are available at: \url{https://github.com/lzeeorno/Lagrange-Duality-and-CMAformer}.||
|**2024-09-11**|[ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers](http://arxiv.org/abs/2409.07541)|**[link](https://github.com/gsavathrakis/enact)**|Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection. However, they require considerable computational resources due to the quadratic size of the attention weights. In this work, we propose to cluster the transformer input on the basis of its entropy. The reason for this is that the self-information of each pixel (whose sum is the entropy), is likely to be similar among pixels corresponding to the same objects. Clustering reduces the size of data given as input to the transformer and therefore reduces training time and GPU memory usage, while at the same time preserves meaningful information to be passed through the remaining parts of the network. The proposed process is organized in a module called ENACT, that can be plugged-in any transformer architecture that consists of a multi-head self-attention computation in its encoder. We ran extensive experiments using the COCO object detection dataset, and three detection transformers. The obtained results demonstrate that in all tested cases, there is consistent reduction in the required computational resources, while the precision of the detection task is only slightly reduced. The code of the ENACT module will become available at https://github.com/GSavathrakis/ENACT||
|**2024-09-11**|[Gated Slot Attention for Efficient Linear-Time Sequence Modeling](http://arxiv.org/abs/2409.07146)|**[link](https://github.com/sustcsonglin/flash-linear-attention)**|Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the softmax operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.||
|**2024-09-11**|[Enhancing Cross-domain Pre-Trained Decision Transformers with Adaptive Attention](http://arxiv.org/abs/2409.06985)|null|Recently, the pre-training of decision transformers (DT) using a different domain, such as natural language text, has generated significant attention in offline reinforcement learning (Offline RL). Although this cross-domain pre-training approach achieves superior performance compared to training from scratch in environments required short-term planning ability, the mechanisms by which pre-training benefits the fine-tuning phase remain unclear. Furthermore, we point out that the cross-domain pre-training approach hinders the extraction of distant information in environments like PointMaze that require long-term planning ability, leading to performance that is much worse than training DT from scratch. This work first analyzes these issues and found that Markov Matrix, a component that exists in pre-trained attention heads, is the key to explain the significant performance disparity of pre-trained models in different planning abilities. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pre-trained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodating diverse attention requirements during fine-tuning. Extensive experiments demonstrate that the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, and in long-term environments, it mitigates the negative impact caused by Markov Matrix, achieving results comparable to those of DT trained from scratch.||
|**2024-09-11**|[Brain-Inspired Stepwise Patch Merging for Vision Transformers](http://arxiv.org/abs/2409.06963)|null|The hierarchical architecture has become a mainstream design paradigm for Vision Transformers (ViTs), with Patch Merging serving as the pivotal component that transforms a columnar architecture into a hierarchical one. Drawing inspiration from the brain's ability to integrate global and local information for comprehensive visual understanding, we propose a novel technique called Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to 'see' better. SPM comprises two critical modules: Multi-Scale Aggregation (MSA) and Guided Local Enhancement (GLE). The MSA module integrates multi-scale features to enrich feature representation, while the GLE module focuses on refining local detail extraction, thus achieving an optimal balance between long-range dependency modeling and local feature enhancement. Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. These results underscore the efficacy of SPM in enhancing model accuracy and robustness across a wide range of computer vision tasks.||
|**2024-09-10**|[A Practical Gated Recurrent Transformer Network Incorporating Multiple Fusions for Video Denoising](http://arxiv.org/abs/2409.06603)|null|State-of-the-art (SOTA) video denoising methods employ multi-frame simultaneous denoising mechanisms, resulting in significant delays (e.g., 16 frames), making them impractical for real-time cameras. To overcome this limitation, we propose a multi-fusion gated recurrent Transformer network (GRTN) that achieves SOTA denoising performance with only a single-frame delay. Specifically, the spatial denoising module extracts features from the current frame, while the reset gate selects relevant information from the previous frame and fuses it with current frame features via the temporal denoising module. The update gate then further blends this result with the previous frame features, and the reconstruction module integrates it with the current frame. To robustly compute attention for noisy features, we propose a residual simplified Swin Transformer with Euclidean distance (RSSTE) in the spatial and temporal denoising modules. Comparative objective and subjective results show that our GRTN achieves denoising performance comparable to SOTA multi-frame delay networks, with only a single-frame delay.||
|**2024-09-10**|[Lightweight Multiscale Feature Fusion Super-Resolution Network Based on Two-branch Convolution and Transformer](http://arxiv.org/abs/2409.06590)|null|The single image super-resolution(SISR) algorithms under deep learning currently have two main models, one based on convolutional neural networks and the other based on Transformer. The former uses the stacking of convolutional layers with different convolutional kernel sizes to design the model, which enables the model to better extract the local features of the image; the latter uses the self-attention mechanism to design the model, which allows the model to establish long-distance dependencies between image pixel points through the self-attention mechanism and then better extract the global features of the image. However, both of the above methods face their problems. Based on this, this paper proposes a new lightweight multi-scale feature fusion network model based on two-way complementary convolutional and Transformer, which integrates the respective features of Transformer and convolutional neural networks through a two-branch network architecture, to realize the mutual fusion of global and local information. Meanwhile, considering the partial loss of information caused by the low-pixel images trained by the deep neural network, this paper designs a modular connection method of multi-stage feature supplementation to fuse the feature maps extracted from the shallow stage of the model with those extracted from the deep stage of the model, to minimize the loss of the information in the feature images that is beneficial to the image restoration as much as possible, to facilitate the obtaining of a higher-quality restored image. The practical results finally show that the model proposed in this paper is optimal in image recovery performance when compared with other lightweight models with the same amount of parameters.||
|**2024-09-10**|[Knowledge Distillation via Query Selection for Detection Transformer](http://arxiv.org/abs/2409.06443)|null|Transformers have revolutionized the object detection landscape by introducing DETRs, acclaimed for their simplicity and efficacy. Despite their advantages, the substantial size of these models poses significant challenges for practical deployment, particularly in resource-constrained environments. This paper addresses the challenge of compressing DETR by leveraging knowledge distillation, a technique that holds promise for maintaining model performance while reducing size. A critical aspect of DETRs' performance is their reliance on queries to interpret object representations accurately. Traditional distillation methods often focus exclusively on positive queries, identified through bipartite matching, neglecting the rich information present in hard-negative queries. Our visual analysis indicates that hard-negative queries, focusing on foreground elements, are crucial for enhancing distillation outcomes. To this end, we introduce a novel Group Query Selection strategy, which diverges from traditional query selection in DETR distillation by segmenting queries based on their Generalized Intersection over Union (GIoU) with ground truth objects, thereby uncovering valuable hard-negative queries for distillation. Furthermore, we present the Knowledge Distillation via Query Selection for DETR (QSKD) framework, which incorporates Attention-Guided Feature Distillation (AGFD) and Local Alignment Prediction Distillation (LAPD). These components optimize the distillation process by focusing on the most informative aspects of the teacher model's intermediate features and output. Our comprehensive experimental evaluation of the MS-COCO dataset demonstrates the effectiveness of our approach, significantly improving average precision (AP) across various DETR architectures without incurring substantial computational costs. Specifically, the AP of Conditional DETR ResNet-18 increased from 35.8 to 39.9.||
|**2024-09-10**|[AgileIR: Memory-Efficient Group Shifted Windows Attention for Agile Image Restoration](http://arxiv.org/abs/2409.06206)|null|Image Transformers show a magnificent success in Image Restoration tasks. Nevertheless, most of transformer-based models are strictly bounded by exorbitant memory occupancy. Our goal is to reduce the memory consumption of Swin Transformer and at the same time speed up the model during training process. Thus, we introduce AgileIR, group shifted attention mechanism along with window attention, which sparsely simplifies the model in architecture. We propose Group Shifted Window Attention (GSWA) to decompose Shift Window Multi-head Self Attention (SW-MSA) and Window Multi-head Self Attention (W-MSA) into groups across their attention heads, contributing to shrinking memory usage in back propagation. In addition to that, we keep shifted window masking and its shifted learnable biases during training, in order to induce the model interacting across windows within the channel. We also re-allocate projection parameters to accelerate attention matrix calculation, which we found a negligible decrease in performance. As a result of experiment, compared with our baseline SwinIR and other efficient quantization models, AgileIR keeps the performance still at 32.20 dB on Set5 evaluation dataset, exceeding other methods with tailor-made efficient methods and saves over 50% memory while a large batch size is employed.||
|**2024-09-09**|[ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL](http://arxiv.org/abs/2409.05749)|**[link](https://github.com/safwennaimi/representation-learning-for-skeleton-action-recognition-with-convolutional-transformers-and-byol)**|为了提取鲁棒且可泛化的骨架动作识别特征，通常需要大量精心标注的数据，而标注和计算成本的限制使得这项任务极具挑战性。因此，利用无标签骨架数据的无监督表征学习至关重要。本研究探讨了用于骨架动作识别的无监督表征学习方法。为此，我们设计了一个轻量级卷积Transformer框架，名为ReL-SAR，它利用卷积层和注意力层的互补性来联合建模骨架序列中的空间和时间线索。我们还对骨架关节采用了选择-排列策略，以确保从骨骼数据中获取更多信息。最后，我们利用Bootstrap Your Own Latent（BYOL）从无标签骨架序列数据中学习鲁棒的表征。我们在有限大小的数据集：MCAD、IXMAS、JHMDB和NW-UCLA上取得了非常有竞争力的结果，证明了我们提出的方法在性能和计算效率方面相对于现有技术的有效性。为了确保可重复性和可复用性，我们在以下链接提供了包含所有实现参数的源代码：https://github.com/SafwenNaimi/Representation-Learning-for-Skeleton-Action-Recognition-with-Convolutional-Transformers-and-BYOL||
|**2024-09-09**|[DSDFormer: An Innovative Transformer-Mamba Framework for Robust High-Precision Driver Distraction Identification](http://arxiv.org/abs/2409.05587)|null|Driver distraction remains a leading cause of traffic accidents, posing a critical threat to road safety globally. As intelligent transportation systems evolve, accurate and real-time identification of driver distraction has become essential. However, existing methods struggle to capture both global contextual and fine-grained local features while contending with noisy labels in training datasets. To address these challenges, we propose DSDFormer, a novel framework that integrates the strengths of Transformer and Mamba architectures through a Dual State Domain Attention (DSDA) mechanism, enabling a balance between long-range dependencies and detailed feature extraction for robust driver behavior recognition. Additionally, we introduce Temporal Reasoning Confident Learning (TRCL), an unsupervised approach that refines noisy labels by leveraging spatiotemporal correlations in video sequences. Our model achieves state-of-the-art performance on the AUC-V1, AUC-V2, and 100-Driver datasets and demonstrates real-time processing efficiency on the NVIDIA Jetson AGX Orin platform. Extensive experimental results confirm that DSDFormer and TRCL significantly improve both the accuracy and robustness of driver distraction detection, offering a scalable solution to enhance road safety.||
|**2024-09-10**|[Retrofitting Temporal Graph Neural Networks with Transformer](http://arxiv.org/abs/2409.05477)|**[link](https://github.com/qianghuangwhu/tf-tgn)**|Temporal graph neural networks (TGNNs) outperform regular GNNs by incorporating time information into graph-based operations. However, TGNNs adopt specialized models (e.g., TGN, TGAT, and APAN ) and require tailored training frameworks (e.g., TGL and ETC). In this paper, we propose TF-TGN, which uses Transformer decoder as the backbone model for TGNN to enjoy Transformer's codebase for efficient training. In particular, Transformer achieves tremendous success for language modeling, and thus the community developed high-performance kernels (e.g., flash-attention and memory-efficient attention) and efficient distributed training schemes (e.g., PyTorch FSDP, DeepSpeed, and Megatron-LM). We observe that TGNN resembles language modeling, i.e., the message aggregation operation between chronologically occurring nodes and their temporal neighbors in TGNNs can be structured as sequence modeling. Beside this similarity, we also incorporate a series of algorithm designs including suffix infilling, temporal graph attention with self-loop, and causal masking self-attention to make TF-TGN work. During training, existing systems are slow in transforming the graph topology and conducting graph sampling. As such, we propose methods to parallelize the CSR format conversion and graph sampling. We also adapt Transformer codebase to train TF-TGN efficiently with multiple GPUs. We experiment with 9 graphs and compare with 2 state-of-the-art TGNN training frameworks. The results show that TF-TGN can accelerate training by over 2.20 while providing comparable or even superior accuracy to existing SOTA TGNNs. TF-TGN is available at https://github.com/qianghuangwhu/TF-TGN.||
|**2024-09-08**|[Low Latency Transformer Inference on FPGAs for Physics Applications with hls4ml](http://arxiv.org/abs/2409.05207)|null|This study presents an efficient implementation of transformer architectures in Field-Programmable Gate Arrays(FPGAs) using hls4ml. We demonstrate the strategy for implementing the multi-head attention, softmax, and normalization layer and evaluate three distinct models. Their deployment on VU13P FPGA chip achieved latency less than 2us, demonstrating the potential for real-time applications. HLS4ML compatibility with any TensorFlow-built transformer model further enhances the scalability and applicability of this work. Index Terms: FPGAs, machine learning, transformers, high energy physics, LIGO||
|**2024-09-08**|[MHS-STMA: Multimodal Hate Speech Detection via Scalable Transformer-Based Multilevel Attention Framework](http://arxiv.org/abs/2409.05136)|null|Social media has a significant impact on people's lives. Hate speech on social media has emerged as one of society's most serious issues recently. Text and pictures are two forms of multimodal data distributed within articles. Unimodal analysis has been the primary emphasis of earlier approaches. Additionally, when doing multimodal analysis, researchers neglect to preserve the distinctive qualities associated with each modality. The present article suggests a scalable architecture for multimodal hate content detection called transformer-based multilevel attention (STMA) to address these shortcomings. This architecture consists of three main parts: a combined attention-based deep learning mechanism, a vision attention mechanism encoder, and a caption attention-mechanism encoder. To identify hate content, each component uses various attention processes and uniquely handles multimodal data. Several studies employing multiple assessment criteria on three hate speech datasets: Hateful memes, MultiOff, and MMHS150K, validate the suggested architecture's efficacy. The outcomes demonstrate that on all three datasets, the suggested strategy performs better than the baseline approaches.||
|**2024-09-08**|[An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing](http://arxiv.org/abs/2409.04940)|null|The attention mechanism is a key computing kernel of Transformers, calculating pairwise correlations across the entire input sequence. The computing complexity and frequent memory access in computing self-attention put a huge burden on the system especially when the sequence length increases. This paper presents an analog and digital hybrid processor to accelerate the attention mechanism for transformers in 65nm CMOS technology. We propose an analog computing-in-memory (CIM) core, which prunes ~75% of low-score tokens on average during runtime at ultra-low power and delay. Additionally, a digital processor performs precise computations only for ~25% unpruned tokens selected by the analog CIM core, preventing accuracy degradation. Measured results show peak energy efficiency of 14.8 and 1.65 TOPS/W, and peak area efficiency of 976.6 and 79.4 GOPS/mm $^\mathrm{2}$ in the analog core and the system-on-chip (SoC), respectively.||
|**2024-09-07**|[Efficient Training of Transformers for Molecule Property Prediction on Small-scale Datasets](http://arxiv.org/abs/2409.04909)|null|血脑屏障（BBB）是一道保护性屏障，将大脑与循环系统隔开，调节物质进入中枢神经系统的通道。评估潜在药物的BBB渗透性对于有效的药物靶向至关重要。然而，传统的BBB渗透性测量实验方法具有挑战性，并且对于大规模筛选来说不切实际。因此，需要开发计算方法来预测BBB渗透性。本文提出了一种增强了自注意力机制的GPS Transformer架构，旨在在低数据情况下表现良好。所提出的方法在使用BBBP数据集的BBB渗透性预测任务上实现了最先进的性能，超过了现有模型。该方法的ROC-AUC为78.8%，比现有最佳水平提高了5.5%。我们证明了标准的自注意力机制与GPS Transformer结合使用比其他注意力机制变体与GPS Transformer结合使用表现更好。||
|**2024-09-07**|[Cross-attention Inspired Selective State Space Models for Target Sound Extraction](http://arxiv.org/abs/2409.04803)|**[link](https://github.com/WuDH2000/CrossMamba)**|Transformer模型，特别是其交叉注意力模块，广泛应用于目标声音提取中的特征融合，该任务基于给定的线索提取感兴趣的信号。尽管有效，但这种方法的计算效率较低。状态空间模型的最新进展，特别是最近的Mamba模型，在各种任务中表现出与基于Transformer的方法相当的性能，同时显著降低了计算复杂度。然而，由于Mamba无法像交叉注意力那样捕捉不同序列之间的依赖关系，因此它在目标声音提取中的适用性受到限制。在本文中，我们提出了用于目标声音提取的CrossMamba模型，它利用Mamba的隐藏注意力机制来计算给定线索和音频混合物之间的依赖关系。Mamba的计算可以分为查询、键和值。我们利用线索生成查询，并利用音频混合物导出键和值，遵循Transformer中交叉注意力机制的原理。来自两种具有代表性的目标声音提取方法的实验结果验证了所提出的CrossMamba的有效性。||
|**2024-09-06**|[Theory, Analysis, and Best Practices for Sigmoid Self-Attention](http://arxiv.org/abs/2409.04431)|**[link](https://github.com/apple/ml-sigmoid-attention)**|注意力是 Transformer 架构的关键组成部分。它是一种序列到序列的映射，将每个序列元素转换为值的加权和。权重通常是通过键和查询之间的点积的 softmax 获得的。最近的工作探索了 Transformer 中 softmax 注意力的替代方案，例如 ReLU 和 sigmoid 激活函数。在这项工作中，我们重新审视 sigmoid 注意力，并对其进行深入的理论和实证分析。理论上，我们证明了具有 sigmoid 注意力的 Transformer 是通用函数逼近器，并且与 softmax 注意力相比，具有更好的正则性。通过详细的实证分析，我们发现，在训练的早期阶段稳定较大的初始注意力范数是成功训练具有 sigmoid 注意力模型的关键因素，其性能优于先前的尝试。我们还介绍了 FLASHSIGMOID，这是一种硬件感知且内存高效的 sigmoid 注意力实现，在 H100 GPU 上，其推理内核速度比 FLASHATTENTION2 提高了 17%。跨语言、视觉和语音的实验表明，经过适当标准化的 sigmoid 注意力在广泛的领域和规模上与 softmax 注意力的强大性能相匹配，这是先前尝试 sigmoid 注意力所无法完全实现的。我们的工作统一了现有技术，并为 sigmoid 注意力作为 Transformer 中 softmax 的直接替代品建立了最佳实践。||
|**2024-09-09**|[AttentionX: Exploiting Consensus Discrepancy In Attention from A Distributed Optimization Perspective](http://arxiv.org/abs/2409.04275)|null|在本文中，我们从分布式优化的角度出发，利用共识差异来扩展Transformer中的标准注意力机制，我们称之为AttentionX。值得注意的是，乘子交替方向法（PDMM）\cite{Zhang16PDMM}旨在迭代地解决点对点（P2P）网络上的一大类分布式优化问题，其中相邻节点根据优化过程中预定义的线性边约束逐渐达成共识。特别是在PDMM的每次迭代中，网络中的每个节点首先从邻居节点收集信息，然后执行本地信息融合。从高层次来看，注意力机制中基于 $KQ$-softmax的$V$表示加权求和对应于从邻居节点收集信息，而Transformer中通过前馈网络（FFN）进行的特征处理对应于本地信息融合。PDMM利用拉格朗日乘子以线性边约束的残差形式捕获历史共识差异，这对于算法的收敛至关重要。受PDMM的启发，我们提出了AttentionX，将共识差异纳入标准注意力机制的输出更新表达式中。AttentionX中的共识差异是指$V$表示的加权求和与其缩放后的$V$ 表示本身之间的差异。在ViT和nanoGPT上的实验表明了其良好的性能。||
|**2024-09-05**|[Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers](http://arxiv.org/abs/2409.03621)|**[link](https://github.com/schwartz-lab-NLP/Attend-First-Consolidate-Later)**|In decoder-based LLMs, the representation of a given layer serves two purposes: as input to the next layer during the computation of the current token; and as input to the attention mechanism of future tokens. In this work, we show that the importance of the latter role might be overestimated. To show that, we start by manipulating the representations of previous tokens; e.g. by replacing the hidden states at some layer k with random vectors. Our experimenting with four LLMs and four tasks show that this operation often leads to small to negligible drop in performance. Importantly, this happens if the manipulation occurs in the top part of the model-k is in the final 30-50% of the layers. In contrast, doing the same manipulation in earlier layers might lead to chance level performance. We continue by switching the hidden state of certain tokens with hidden states of other tokens from another prompt; e.g., replacing the word "Italy" with "France" in "What is the capital of Italy?". We find that when applying this switch in the top 1/3 of the model, the model ignores it (answering "Rome"). However if we apply it before, the model conforms to the switch ("Paris"). Our results hint at a two stage process in transformer-based LLMs: the first part gathers input from previous tokens, while the second mainly processes that information internally.||
|**2024-09-05**|[LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution](http://arxiv.org/abs/2409.03516)|**[link](https://github.com/jwgdmkj/lmlt)**|Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in processing regions outside their windows. To address these issues, we propose the Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head. LMLT divides image features along the channel dimension, gradually reduces spatial size for lower heads, and applies self-attention to each head. This approach effectively captures both local and global information. By integrating the results from lower heads into higher heads, LMLT overcomes the window boundary issues in self-attention. Extensive experiments show that our model significantly reduces inference time and GPU memory usage while maintaining or even surpassing the performance of state-of-the-art ViT-based Image Super-Resolution methods. Our codes are availiable at https://github.com/jwgdmkj/LMLT.||
|**2024-09-05**|[Blended Latent Diffusion under Attention Control for Real-World Video Editing](http://arxiv.org/abs/2409.03514)|null|由于缺乏完全公开可用的文本到视频模型，当前的视频编辑方法倾向于建立在预训练的文本到图像生成模型之上，然而，它们在处理具有时间信息的视频局部编辑方面仍然面临巨大挑战。首先，尽管现有方法试图通过预定义的掩码专注于局部区域编辑，但由于每一帧的空间整体生成，区域外背景的保留并不理想。此外，用户专门提供掩码是一项额外的昂贵工作，因此需要一种集成到编辑过程中的自主掩码策略。最后但同样重要的是，图像级预训练模型没有学习视频帧之间的时间信息，而这对于表达运动和动态至关重要。在本文中，我们建议采用图像级混合潜在扩散模型来执行局部视频编辑任务。具体来说，我们利用 DDIM 反演来获取潜在代码作为背景潜在代码，而不是随机噪声的潜在代码，以更好地保留输入视频的背景信息。我们进一步介绍了一种从扩散步骤中的交叉注意图派生的自主掩码制造机制。最后，我们通过将 U-Net 的自注意力块转换为时空块来增强视频帧之间的时间一致性。通过大量实验，我们提出的方法在不同的现实世界视频编辑任务中展示了有效性。||
|**2024-09-05**|[Characterizing Massive Activations of Attention Mechanism in Graph Neural Networks](http://arxiv.org/abs/2409.03463)|**[link](https://github.com/msorbi/gnn-ma)**|Graph Neural Networks (GNNs) have become increasingly popular for effectively modeling data with graph structures. Recently, attention mechanisms have been integrated into GNNs to improve their ability to capture complex patterns. This paper presents the first comprehensive study revealing a critical, unexplored consequence of this integration: the emergence of Massive Activations (MAs) within attention layers. We introduce a novel method for detecting and analyzing MAs, focusing on edge features in different graph transformer architectures. Our study assesses various GNN models using benchmark datasets, including ZINC, TOX21, and PROTEINS. Key contributions include (1) establishing the direct link between attention mechanisms and MAs generation in GNNs, (2) developing a robust definition and detection method for MAs based on activation ratio distributions, (3) introducing the Explicit Bias Term (EBT) as a potential countermeasure and exploring it as an adversarial framework to assess models robustness based on the presence or absence of MAs. Our findings highlight the prevalence and impact of attention-induced MAs across different architectures, such as GraphTransformer, GraphiT, and SAN. The study reveals the complex interplay between attention mechanisms, model architecture, dataset characteristics, and MAs emergence, providing crucial insights for developing more robust and reliable graph models.||
|**2024-09-05**|[LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones](http://arxiv.org/abs/2409.03460)|**[link](https://github.com/altair199797/lowformer)**|Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks. A smart combination of both, architecture-wise and component-wise is mandatory to excel in the speedaccuracy trade-off. Most publications focus on maximizing accuracy and utilize MACs (multiply accumulate operations) as an efficiency metric. The latter however often do not measure accurately how fast a model actually is due to factors like memory access cost and degree of parallelism. We analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications. We applied the conclusions taken from that analysis to create a recipe for increasing hardware-efficiency in macro design. Additionally we introduce a simple slimmed-down version of MultiHead Self-Attention, that aligns with our analysis. We combine both macro and micro design to create a new family of hardware-efficient backbone networks called LowFormer. LowFormer achieves a remarkable speedup in terms of throughput and latency, while achieving similar or better accuracy than current state-of-the-art efficient backbones. In order to prove the generalizability of our hardware-efficient design, we evaluate our method on GPU, mobile GPU and ARM CPU. We further show that the downstream tasks object detection and semantic segmentation profit from our hardware-efficient architecture. Code and models are available at https://github.com/ altair199797/LowFormer.||
|**2024-09-05**|[Masked Sensory-Temporal Attention for Sensor Generalization in Quadruped Locomotion](http://arxiv.org/abs/2409.03332)|null|With the rising focus on quadrupeds, a generalized policy capable of handling different robot models and sensory inputs will be highly beneficial. Although several methods have been proposed to address different morphologies, it remains a challenge for learning-based policies to manage various combinations of proprioceptive information. This paper presents Masked Sensory-Temporal Attention (MSTA), a novel transformer-based model with masking for quadruped locomotion. It employs direct sensor-level attention to enhance sensory-temporal understanding and handle different combinations of sensor data, serving as a foundation for incorporating unseen information. This model can effectively understand its states even with a large portion of missing information, and is flexible enough to be deployed on a physical system despite the long input sequence.||
|**2024-09-05**|[Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion](http://arxiv.org/abs/2409.03223)|null|Multi-modality image fusion aims to integrate the merits of images from different sources and render high-quality fusion images. However, existing feature extraction and fusion methods are either constrained by inherent local reduction bias and static parameters during inference (CNN) or limited by quadratic computational complexity (Transformers), and cannot effectively extract and fuse features. To solve this problem, we propose a dual-branch image fusion network called Tmamba. It consists of linear Transformer and Mamba, which has global modeling capabilities while maintaining linear complexity. Due to the difference between the Transformer and Mamba structures, the features extracted by the two branches carry channel and position information respectively. T-M interaction structure is designed between the two branches, using global learnable parameters and convolutional layers to transfer position and channel information respectively. We further propose cross-modal interaction at the attention level to obtain cross-modal attention. Experiments show that our Tmamba achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. Code with checkpoints will be available after the peer-review process.||
|**2024-09-04**|[Probing self-attention in self-supervised speech models for cross-linguistic differences](http://arxiv.org/abs/2409.03115)|null|Speech models have gained traction thanks to increase in accuracy from novel transformer architectures. While this impressive increase in performance across automatic speech recognition (ASR) benchmarks is noteworthy, there is still much that is unknown about the use of attention mechanisms for speech-related tasks. For example, while it is assumed that these models are learning language-independent (i.e., universal) speech representations, there has not yet been an in-depth exploration of what it would mean for the models to be language-independent. In the current paper, we explore this question within the realm of self-attention mechanisms of one small self-supervised speech transformer model (TERA). We find that even with a small model, the attention heads learned are diverse ranging from almost entirely diagonal to almost entirely global regardless of the training language. We highlight some notable differences in attention patterns between Turkish and English and demonstrate that the models do learn important phonological information during pretraining. We also present a head ablation study which shows that models across languages primarily rely on diagonal heads to classify phonemes.||
|**2024-09-04**|[Leveraging Interpretability in the Transformer to Automate the Proactive Scaling of Cloud Resources](http://arxiv.org/abs/2409.03103)|null|现代Web服务采用云原生原则来利用微服务的优势。为了根据服务等级协议（SLA）持续保证高质量的服务（QoS），确保令人满意的用户体验并最大程度地降低运营成本，必须为每个微服务配置适量的资源。然而，准确地为微服务配置充足的资源非常复杂，并且取决于许多因素，包括工作负载强度和微服务之间复杂的互连关系。为了应对这一挑战，我们开发了一个模型，该模型捕获了端到端延迟、前端级别的请求和资源利用率之间的关系。然后，我们使用开发的模型来预测端到端延迟。我们的解决方案利用了时间融合Transformer（TFT），这是一种具有可解释性特征的基于注意力的架构。当预测结果表明不符合SLA时，我们使用TFT提供的特征重要性作为核岭回归（KRR）中的协变量，并将响应变量设置为期望延迟，以学习与特征重要性相关的参数。这些学习到的参数反映了为确保符合SLA而需要对特征进行的调整。我们通过一个基于微服务的应用程序证明了我们方法的优点，并提供了一个部署路线图。||
|**2024-09-05**|[Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?](http://arxiv.org/abs/2409.02727)|**[link](https://github.com/yixuantt/poolingandattn)**|The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.||
|**2024-09-04**|[UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching](http://arxiv.org/abs/2409.02545)|null|Unlike other vision tasks where Transformer-based approaches are becoming increasingly common, stereo depth estimation is still dominated by convolution-based approaches. This is mainly due to the limited availability of real-world ground truth for stereo matching, which is a limiting factor in improving the performance of Transformer-based stereo approaches. In this paper, we propose UniTT-Stereo, a method to maximize the potential of Transformer-based stereo architectures by unifying self-supervised learning used for pre-training with stereo matching framework based on supervised learning. To be specific, we explore the effectiveness of reconstructing features of masked portions in an input image and at the same time predicting corresponding points in another image from the perspective of locality inductive bias, which is crucial in training models with limited training data. Moreover, to address these challenging tasks of reconstruction-and-prediction, we present a new strategy to vary a masking ratio when training the stereo model with stereo-tailored losses. State-of-the-art performance of UniTT-Stereo is validated on various benchmarks such as ETH3D, KITTI 2012, and KITTI 2015 datasets. Lastly, to investigate the advantages of the proposed approach, we provide a frequency analysis of feature maps and the analysis of locality inductive bias based on attention maps.||
|**2024-09-03**|[F2former: When Fractional Fourier Meets Deep Wiener Deconvolution and Selective Frequency Transformer for Image Deblurring](http://arxiv.org/abs/2409.02056)|null|Recent progress in image deblurring techniques focuses mainly on operating in both frequency and spatial domains using the Fourier transform (FT) properties. However, their performance is limited due to the dependency of FT on stationary signals and its lack of capability to extract spatial-frequency properties. In this paper, we propose a novel approach based on the Fractional Fourier Transform (FRFT), a unified spatial-frequency representation leveraging both spatial and frequency components simultaneously, making it ideal for processing non-stationary signals like images. Specifically, we introduce a Fractional Fourier Transformer (F2former), where we combine the classical fractional Fourier based Wiener deconvolution (F2WD) as well as a multi-branch encoder-decoder transformer based on a new fractional frequency aware transformer block (F2TB). We design F2TB consisting of a fractional frequency aware self-attention (F2SA) to estimate element-wise product attention based on important frequency components and a novel feed-forward network based on frequency division multiplexing (FM-FFN) to refine high and low frequency features separately for efficient latent clear image restoration. Experimental results for the cases of both motion deblurring as well as defocus deblurring show that the performance of our proposed method is superior to other state-of-the-art (SOTA) approaches.||
|**2024-09-03**|[TransDAE: Dual Attention Mechanism in a Hierarchical Transformer for Efficient Medical Image Segmentation](http://arxiv.org/abs/2409.02018)|null|In healthcare, medical image segmentation is crucial for accurate disease diagnosis and the development of effective treatment strategies. Early detection can significantly aid in managing diseases and potentially prevent their progression. Machine learning, particularly deep convolutional neural networks, has emerged as a promising approach to addressing segmentation challenges. Traditional methods like U-Net use encoding blocks for local representation modeling and decoding blocks to uncover semantic relationships. However, these models often struggle with multi-scale objects exhibiting significant variations in texture and shape, and they frequently fail to capture long-range dependencies in the input data. Transformers designed for sequence-to-sequence predictions have been proposed as alternatives, utilizing global self-attention mechanisms. Yet, they can sometimes lack precise localization due to insufficient granular details. To overcome these limitations, we introduce TransDAE: a novel approach that reimagines the self-attention mechanism to include both spatial and channel-wise associations across the entire feature space, while maintaining computational efficiency. Additionally, TransDAE enhances the skip connection pathway with an inter-scale interaction module, promoting feature reuse and improving localization accuracy. Remarkably, TransDAE outperforms existing state-of-the-art methods on the Synaps multi-organ dataset, even without relying on pre-trained weights.||
|**2024-09-03**|[TASL-Net: Tri-Attention Selective Learning Network for Intelligent Diagnosis of Bimodal Ultrasound Video](http://arxiv.org/abs/2409.01557)|null|In the intelligent diagnosis of bimodal (gray-scale and contrast-enhanced) ultrasound videos, medical domain knowledge such as the way sonographers browse videos, the particular areas they emphasize, and the features they pay special attention to, plays a decisive role in facilitating precise diagnosis. Embedding medical knowledge into the deep learning network can not only enhance performance but also boost clinical confidence and reliability of the network. However, it is an intractable challenge to automatically focus on these person- and disease-specific features in videos and to enable networks to encode bimodal information comprehensively and efficiently. This paper proposes a novel Tri-Attention Selective Learning Network (TASL-Net) to tackle this challenge and automatically embed three types of diagnostic attention of sonographers into a mutual transformer framework for intelligent diagnosis of bimodal ultrasound videos. Firstly, a time-intensity-curve-based video selector is designed to mimic the temporal attention of sonographers, thus removing a large amount of redundant information while improving computational efficiency of TASL-Net. Then, to introduce the spatial attention of the sonographers for contrast-enhanced video analysis, we propose the earliest-enhanced position detector based on structural similarity variation, on which the TASL-Net is made to focus on the differences of perfusion variation inside and outside the lesion. Finally, by proposing a mutual encoding strategy that combines convolution and transformer, TASL-Net possesses bimodal attention to structure features on gray-scale videos and to perfusion variations on contrast-enhanced videos. These modules work collaboratively and contribute to superior performance. We conduct a detailed experimental validation of TASL-Net's performance on three datasets, including lung, breast, and liver.||
|**2024-09-02**|[Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement](http://arxiv.org/abs/2409.01352)|**[link](https://github.com/tatban/Spectron)**|Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based end-to-end model to extract a target speaker's speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two additional objectives to impose speaker embedding consistency and waveform encoder invertibility and jointly train both speaker encoder and speech separator to better capture the speaker conditional embedding. Furthermore, we leverage a multi-scale discriminator to refine the perceptual quality of the extracted speech. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points. Finally, we compare our approach with recent state-of-the-arts and show that our model outperforms existing methods by $4.1$ dB points on an average without creating additional data dependency.||
|**2024-09-02**|[CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models](http://arxiv.org/abs/2409.01193)|**[link](https://github.com/raytsang123/clibe)**|Backdoors can be injected into NLP models to induce misbehavior when the input text contains a specific feature, known as a trigger, which the attacker secretly selects. Unlike fixed words, phrases, or sentences used in the static text trigger, NLP dynamic backdoor attacks design triggers associated with abstract and latent text features, making them considerably stealthier than traditional static backdoor attacks. However, existing research on NLP backdoor detection primarily focuses on defending against static backdoor attacks, while detecting dynamic backdoors in NLP models remains largely unexplored. This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models. CLIBE injects a "few-shot perturbation" into the suspect Transformer model by crafting optimized weight perturbation in the attention layers to make the perturbed model classify a limited number of reference samples as a target label. Subsequently, CLIBE leverages the generalization ability of this few-shot perturbation to determine whether the original model contains a dynamic backdoor. Extensive evaluation on three advanced NLP dynamic backdoor attacks, two widely-used Transformer frameworks, and four real-world classification tasks strongly validates the effectiveness of CLIBE. We also demonstrate the robustness of CLIBE against various adaptive attacks. Furthermore, we employ CLIBE to scrutinize 49 popular Transformer models on Hugging Face and discover one exhibiting a high probability of containing a dynamic backdoor. We have contacted Hugging Face and provided detailed evidence of this model's backdoor behavior. Moreover, we extend CLIBE to detect backdoor text generation models modified to exhibit toxic behavior. To the best of our knowledge, CLIBE is the first framework capable of detecting backdoors in text generation models without access to trigger input test samples.||
|**2024-09-02**|[Progressive Retinal Image Registration via Global and Local Deformable Transformations](http://arxiv.org/abs/2409.01068)|**[link](https://github.com/lyp-deeplearning/awesome-retinal-registration)**|Retinal image registration plays an important role in the ophthalmological diagnosis process. Since there exist variances in viewing angles and anatomical structures across different retinal images, keypoint-based approaches become the mainstream methods for retinal image registration thanks to their robustness and low latency. These methods typically assume the retinal surfaces are planar, and adopt feature matching to obtain the homography matrix that represents the global transformation between images. Yet, such a planar hypothesis inevitably introduces registration errors since retinal surface is approximately curved. This limitation is more prominent when registering image pairs with significant differences in viewing angles. To address this problem, we propose a hybrid registration framework called HybridRetina, which progressively registers retinal images with global and local deformable transformations. For that, we use a keypoint detector and a deformation network called GAMorph to estimate the global transformation and local deformable transformation, respectively. Specifically, we integrate multi-level pixel relation knowledge to guide the training of GAMorph. Additionally, we utilize an edge attention module that includes the geometric priors of the images, ensuring the deformation field focuses more on the vascular regions of clinical interest. Experiments on two widely-used datasets, FIRE and FLoRI21, show that our proposed HybridRetina significantly outperforms some state-of-the-art methods. The code is available at https://github.com/lyp-deeplearning/awesome-retinal-registration.||
|**2024-09-02**|[Multi-scale Temporal Fusion Transformer for Incomplete Vehicle Trajectory Prediction](http://arxiv.org/abs/2409.00904)|null|Motion prediction plays an essential role in autonomous driving systems, enabling autonomous vehicles to achieve more accurate local-path planning and driving decisions based on predictions of the surrounding vehicles. However, existing methods neglect the potential missing values caused by object occlusion, perception failures, etc., which inevitably degrades the trajectory prediction performance in real traffic scenarios. To address this limitation, we propose a novel end-to-end framework for incomplete vehicle trajectory prediction, named Multi-scale Temporal Fusion Transformer (MTFT), which consists of the Multi-scale Attention Head (MAH) and the Continuity Representation-guided Multi-scale Fusion (CRMF) module. Specifically, the MAH leverages the multi-head attention mechanism to parallelly capture multi-scale motion representation of trajectory from different temporal granularities, thus mitigating the adverse effect of missing values on prediction. Furthermore, the multi-scale motion representation is input into the CRMF module for multi-scale fusion to obtain the robust temporal feature of the vehicle. During the fusion process, the continuity representation of vehicle motion is first extracted across time steps to guide the fusion, ensuring that the resulting temporal feature incorporates both detailed information and the overall trend of vehicle motion, which facilitates the accurate decoding of future trajectory that is consistent with the vehicle's motion trend. We evaluate the proposed model on four datasets derived from highway and urban traffic scenarios. The experimental results demonstrate its superior performance in the incomplete vehicle trajectory prediction task compared with state-of-the-art models, e.g., a comprehensive performance improvement of more than 39% on the HighD dataset.||
|**2024-09-01**|[Attention-Guided Multi-scale Interaction Network for Face Super-Resolution](http://arxiv.org/abs/2409.00591)|null|Recently, CNN and Transformer hybrid networks demonstrated excellent performance in face super-resolution (FSR) tasks. Since numerous features at different scales in hybrid networks, how to fuse these multi-scale features and promote their complementarity is crucial for enhancing FSR. However, existing hybrid network-based FSR methods ignore this, only simply combining the Transformer and CNN. To address this issue, we propose an attention-guided Multi-scale interaction network (AMINet), which contains local and global feature interactions as well as encoder-decoder phases feature interactions. Specifically, we propose a Local and Global Feature Interaction Module (LGFI) to promote fusions of global features and different receptive fields' local features extracted by our Residual Depth Feature Extraction Module (RDFE). Additionally, we propose a Selective Kernel Attention Fusion Module (SKAF) to adaptively select fusions of different features within LGFI and encoder-decoder phases. Our above design allows the free flow of multi-scale features from within modules and between encoder and decoder, which can promote the complementarity of different scale features to enhance FSR. Comprehensive experiments confirm that our method consistently performs well with less computational consumption and faster inference.||

(back to top)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chenin-wang/awesome_ai_paper

Awesome Lists containing this project

README