Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/chenin-wang/awesome_ai_paper

paper.cheninweb.asia
https://github.com/chenin-wang/awesome_ai_paper

List: awesome_ai_paper

Last synced: 10 days ago
JSON representation

paper.cheninweb.asia

Awesome Lists containing this project

README

        

## Updated on 2024.11.06
> Usage instructions: [here](./docs/README.md#usage)

Table of Contents


  1. 多模态

  2. 6DOF Object Pose

  3. nerf

  4. 分类/检测/识别/分割

  5. 生成模型

  6. LLM

  7. Transformer

## 多模态

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2024-11-05**|[MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning](http://arxiv.org/abs/2411.03314)|null|近年来,通用领域的多模态基准推动了通用任务多模态模型的快速发展。然而,金融领域具有其特殊性,它包含独特的图形图像(例如,K线图、技术指标图)和丰富的专业金融知识(例如,期货、换手率)。因此,通用领域的基准通常无法衡量金融领域多模态模型的性能,从而不能有效指导大型金融模型的快速发展。为了促进大型金融多模态模型的发展,我们提出了MME-Finance,一个面向实际应用的双语开放式视觉问答(VQA)基准。我们的基准的特点是金融性和专业性,包括构建反映用户实际使用需求的图表(例如,电脑截图和手机拍摄)、根据金融领域查询偏好创建问题,以及由具有10年以上金融行业经验的专家进行问题标注。此外,我们还开发了一个定制的金融评估系统,在多模态评估过程中首先引入视觉信息。我们对19个主流MLLM进行了广泛的实验评估,以测试它们的感知、推理和认知能力。结果表明,在通用基准上表现良好的模型在MME-Finance上表现不佳;例如,表现最好的开源和闭源模型分别获得了65.69(Qwen2VL-72B)和63.18(GPT-4o)的分数。它们在与金融最相关的类别中表现尤其差,例如K线图和技术指标图。此外,我们还提出了一个中文版本,有助于比较MLLM在中文语境下的性能。|
|**2024-11-05**|[Inference Optimal VLMs Need Only One Visual Token but Larger Models](http://arxiv.org/abs/2411.03312)|**[link](https://github.com/locuslab/llava-token-compression)**|视觉语言模型 (VLM) 在各种视觉理解和推理任务中展现出强大的能力。然而,由于大型语言模型 (LLM) 处理大量输入标记(主要来自图像)所需的计算量巨大,导致推理过程中延迟较高,这常常限制了它们在现实世界的部署。为了降低推理成本,可以缩小 LLM 的规模或减少输入图像标记的数量,后者是最近许多关于标记压缩工作的重点。然而,由于这两个因素都直接影响 VLM 的性能,因此最佳的权衡策略尚不清楚。我们首先通过建立捕捉这两个因素的性能变化的缩放法则来描述视觉标记数量和 LLM 参数之间的最佳权衡。我们的结果揭示了一个令人惊讶的趋势:对于视觉推理任务,VLM 中推理最优的行为,即在任何给定的固定推理计算量下,下游误差最小,是在使用推理预算内最大的 LLM 的同时最小化视觉标记数量(通常减少到单个标记)时实现的。虽然标记减少的文献主要关注于通过适度减少标记数量(例如 5-10 倍)来保持基础模型的性能,但我们的结果表明,计算最优的推理机制需要在更高的标记压缩比下运行。基于这些见解,我们初步尝试构建针对高标记压缩设置的方法。代码可在 https://github.com/locuslab/llava-token-compression 获取。|
|**2024-11-05**|[HumanVLM: Foundation for Human-Scene Vision-Language Model](http://arxiv.org/abs/2411.03034)|null|人景视觉语言任务在各种社会应用中日益普及,但最近的进展主要依赖于专门为单个任务定制的模型。新兴研究表明,大型视觉语言模型 (VLM) 可以增强各种下游视觉语言理解任务的性能。然而,通用领域模型在特定领域通常表现不佳。本研究介绍了一个特定领域的大型视觉语言模型,即人景视觉语言模型 (HumanVLM),旨在为人景视觉语言任务提供基础。具体而言,(1) 我们创建了一个大规模的人景多模态图文数据集 (HumanCaption-10M),数据源自互联网,以促进特定领域的对齐;(2) 开发了一种以人为中心的图像的描述方法,捕捉人脸、身体和背景,并构建了一个高质量的人景图文数据集 (HumanCaptionHQ,约 31.1 万对),其中包含尽可能详细的人物信息;(3) 使用 HumanCaption-10M 和 HumanCaptionHQ,我们训练了一个 HumanVLM。在实验中,我们随后在各种下游任务中评估了我们的 HumanVLM,它在同等规模的多模态模型中展现出优越的整体性能,尤其在与人类相关的任务中表现出色,并显著优于类似模型,包括 Qwen2VL 和 ChatGPT-4o。HumanVLM 以及引入的数据将促进人类相关领域的研究。|
|**2024-11-05**|[Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning](http://arxiv.org/abs/2411.02793)|null|多模态情感分析(MSA)是一项重要的研究领域,旨在通过多种模态理解和识别人类情感。多模态融合提供的补充信息促进了情感分析,使其比仅利用单一模态更有效。然而,在实际应用中,许多不可避免的因素可能导致模态不确定缺失的情况,从而阻碍多模态建模的有效性并降低模型的性能。为此,我们针对模态不确定缺失情况下的MSA任务提出了一种分层表示学习框架(HRLF)。具体来说,我们提出了一个细粒度的表示分解模块,通过跨模态翻译和情感语义重建将模态分解为情感相关和模态特定的表示,从而充分提取有价值的情感信息。此外,我们引入了一种分层互信息最大化机制,以增量方式最大化多尺度表示之间的互信息,从而对齐和重建表示中的高层语义。最后,我们提出了一种分层对抗学习机制,进一步对齐和调整情感相关表示的潜在分布,以生成鲁棒的联合多模态表示。在三个数据集上的综合实验表明,HRLF在模态不确定缺失的情况下显著提高了MSA性能。|
|**2024-11-05**|[DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark](http://arxiv.org/abs/2411.02733)|**[link](https://github.com/haodongli2024/rspope)**|随着大型视觉语言模型(LVLMs)的快速发展,这些模型在各种多模态任务中展现出优异的成果。由于LVLMs容易出现幻觉,且目前针对遥感的专用数据集和评估方法较少,因此它们在应用于遥感任务时的性能通常较差。为了解决这些问题,本文介绍了一个高质量的遥感LVLMs数据集DDFAV,该数据集是使用数据增强和数据混合策略创建的。接下来,基于从所提出的数据集中选择的一些高质量遥感图像生成了一套训练指令集。最后,我们基于所提出的数据集开发了一种遥感LVLMs幻觉评估方法RSPOPE,并评估了不同LVLMs的零样本能力。我们提出的数据集、指令集和评估方法文件可在https://github.com/HaodongLi2024/rspope获取。|
|**2024-11-04**|[INQUIRE: A Natural World Text-to-Image Retrieval Benchmark](http://arxiv.org/abs/2411.02537)|**[link](https://github.com/inquire-benchmark/INQUIRE)**|我们推出了INQUIRE,这是一个文本到图像检索基准测试,旨在挑战多模态视觉语言模型在专家级查询上的能力。INQUIRE包含iNaturalist 2024 (iNat24),这是一个包含五百万张自然世界图像的新数据集,以及250个专家级检索查询。这些查询与iNat24中所有相关的图像进行了全面配对和标注,总共包含33,000个匹配项。查询涵盖物种识别、环境、行为和外观等类别,强调需要细致的图像理解和领域专业知识的任务。我们的基准测试评估了两个核心检索任务:(1) INQUIRE-Fullrank,一个全数据集排序任务,以及 (2) INQUIRE-Rerank,一个用于改进top-100检索结果的重排序任务。对一系列最新多模态模型的详细评估表明,INQUIRE提出了一个重大挑战,即使是最佳模型也未能达到50%以上的mAP@50。此外,我们还展示了使用更强大的多模态模型进行重排序可以提高检索性能,但仍有很大的改进空间。INQUIRE专注于具有科学动机的生态挑战,旨在弥合人工智能能力与现实世界科学探究需求之间的差距,鼓励开发能够协助加速生态和生物多样性研究的检索系统。我们的数据集和代码可在https://inquire-benchmark.github.io获取。|
|**2024-11-04**|[One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering](http://arxiv.org/abs/2411.02210)|null|视觉语言模型(VLMs)在利用网络规模多模态数据集的视觉问答(VQA)任务中展现出巨大的潜力。然而,这些模型在适应新任务时,由于灾难性遗忘,往往难以进行持续学习。作为缓解灾难性遗忘的有效补救措施,复习策略在学习新任务时会使用过去任务的数据。然而,这种策略需要存储过去的数据,这由于硬件限制或隐私问题可能并不可行。在这项工作中,我们提出了第一个无数据方法,它利用VLM的语言生成能力(而不是依赖外部模型)来生成伪复习数据,以解决持续VQA问题。我们的方案名为GaB,它通过对新任务数据提出先前任务的问题来生成伪复习数据。然而,尽管有效,但由于训练数据有限且特定于任务,生成问题的分布会偏向于最常提出的问题。为了缓解这个问题,我们引入了一个伪复习平衡模块,它使用问题元统计或无监督聚类方法将生成的数据与真实数据分布对齐。我们在两个最近的基准测试集上评估了我们提出的方法,即VQACL-VQAv2和CLOVE-function基准测试集。GaB 的性能优于所有无数据基线,在跨不断变化的任务中保持 VQA 性能方面有了实质性的改进,同时与可以访问过去数据的方法不相上下。|
|**2024-11-04**|[TableGPT2: A Large Multimodal Model with Tabular Data Integration](http://arxiv.org/abs/2411.02059)|null|像GPT、Claude、LLaMA和Qwen这样的模型的出现重塑了人工智能应用,为各行各业带来了巨大的新机遇。然而,尽管表格数据在众多现实领域中发挥着基础性作用,但其与这些模型的集成仍然明显不足。这种差距之所以至关重要,主要有三个原因。首先,数据库或数据仓库的数据集成对于高级应用至关重要;其次,大量且很大程度上尚未开发的表格数据资源提供了巨大的分析潜力;第三,商业智能领域尤其需要适应性强、精确的解决方案,而许多目前的LLM可能难以提供。为此,我们推出了TableGPT2,这是一个经过严格预训练和微调的模型,使用了超过593.8万个表格和236万个高质量的查询-表格-输出元组,其表格相关数据的规模在以往的研究中是前所未有的。这种广泛的训练使TableGPT2能够在以表格为中心的任务中表现出色,同时保持强大的通用语言和编码能力。TableGPT2的关键创新之一是其新颖的表格编码器,专门设计用于捕获模式级和单元格级信息。这种编码器增强了模型处理现实应用中常见的歧义查询、缺失列名和不规则表格的能力。与视觉语言模型类似,这种开创性的方法与解码器集成,形成了一个强大的大型多模态模型。我们相信结果令人信服:在23个基准测试指标中,TableGPT2在7B模型和72B模型上分别比之前的基准中性LLM平均性能提高了35.20%和49.32%,同时保持了强大的通用能力。|
|**2024-11-04**|[Foundations and Recent Trends in Multimodal Mobile Agents: A Survey](http://arxiv.org/abs/2411.02006)|null|移动代理是复杂和动态移动环境中自动化任务的关键。随着基础模型的发展,对能够实时适应和处理多模态数据的代理的需求也在增长。本综述全面回顾了移动代理技术,重点关注增强实时适应性和多模态交互的最新进展。最近开发的评估基准可以更好地捕捉移动任务的静态和交互环境,从而更准确地评估代理的性能。我们将这些进展分为两种主要方法:基于提示的方法,它利用大型语言模型(LLM)进行基于指令的任务执行;以及基于训练的方法,它对多模态模型进行微调以适应移动特定应用。此外,我们还探讨了增强代理性能的补充技术。通过讨论关键挑战并概述未来的研究方向,本综述为推进移动代理技术提供了宝贵的见解。 综合资源列表可在 https://github.com/aialt/awesome-mobile-agents 获取。|
|**2024-11-03**|[EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark](http://arxiv.org/abs/2411.01492)|null|近期对大型语言模型 (LLM) 和大型多模态模型 (LMM) 的研究表明,它们在科学和数学等各个领域都展现出 promising 的技能。然而,它们在更具挑战性和现实世界相关场景(如工程)中的能力尚未得到系统研究。为了弥合这一差距,我们提出了 EEE-Bench,这是一个多模态基准测试,旨在评估 LMM 解决实际工程任务的能力,使用电气与电子工程 (EEE) 作为测试平台。我们的基准测试包含 2860 个精心策划的问题,涵盖 10 个重要子领域,例如模拟电路、控制系统等。与其他领域的基准测试相比,工程问题的本质是 1) 视觉上更复杂和多样化,2) 解决方案更不确定。成功解决这些问题通常需要比以往更严格地整合视觉和文本信息,因为模型需要理解复杂的图像(如抽象电路和系统图),同时还要考虑专业指令,这使得它们成为 LMM 评估的绝佳候选者。除了 EEE-Bench,我们还提供了对 17 种广泛使用的开源和闭源 LLM 和 LMM 的广泛定量评估和细粒度分析。我们的结果表明,当前基础模型在 EEE 方面存在显著缺陷,平均性能范围为 19.48% 至 46.78%。最后,我们揭示并探讨了 LMM 的一个关键缺点,我们称之为“懒惰”:在对技术图像问题进行推理时,倾向于走捷径,依赖文本而忽略视觉上下文。总之,我们相信 EEE-Bench 不仅揭示了 LMM 的一些值得注意的局限性,而且为推进其在实际工程任务中应用的研究提供了宝贵的资源,推动其处理复杂现实场景的能力的未来改进。|
|**2024-10-31**|[ $π_0$ : A Vision-Language-Action Flow Model for General Robot Control](http://arxiv.org/abs/2410.24164)|null|机器人学习拥有巨大潜力,可以释放灵活、通用和灵巧机器人系统的全部潜能,并解决人工智能领域一些最深层次的问题。然而,要将机器人学习提升到有效现实世界系统所需的通用性水平,在数据、泛化性和鲁棒性方面面临着重大障碍。在本文中,我们讨论了通才机器人策略(即机器人基础模型)如何应对这些挑战,以及我们如何为复杂且高度灵巧的任务设计有效的通才机器人策略。我们提出了一种构建于预训练视觉语言模型 (VLM) 之上的新型流匹配架构,以继承互联网规模的语义知识。然后,我们讨论了如何使用来自多个灵巧机器人平台(包括单臂机器人、双臂机器人和移动机械手)的大型多样化数据集来训练该模型。我们评估了模型在预训练后零样本执行任务的能力、遵循来自人类和高级 VLM 策略的语言指令的能力,以及通过微调获取新技能的能力。我们的结果涵盖了各种各样的任务,例如叠衣服、清洁桌子和组装盒子。|
|**2024-10-31**|[Exploring Vision Language Models for Facial Attribute Recognition: Emotion, Race, Gender, and Age](http://arxiv.org/abs/2410.24148)|null|人脸属性识别技术,例如种族、性别、年龄和情绪识别,在监控、广告内容、情感分析以及人口趋势和社会行为研究等领域拥有广泛的应用。基于图像分析人口统计特征和面部表情分析由于人脸属性的复杂性而面临诸多挑战。传统方法采用卷积神经网络(CNN)和其他各种深度学习技术,并在大量标记图像上进行训练。虽然这些方法展现出有效性能,但仍有进一步提升的空间。在本文中,我们提议利用视觉语言模型(VLM),例如生成式预训练Transformer(GPT)、GEMINI、大型语言和视觉助手(LLAVA)、PaliGemma和Microsoft Florence2,从人脸图像中识别种族、性别、年龄和情绪等面部属性。我们使用了各种数据集,如FairFace、AffectNet和UTKFace来评估这些方案。结果表明,VLM与传统技术相比,即使不优越,也具有竞争力。此外,我们提出了“FaceScanPaliGemma”——一个微调的PaliGemma模型——用于种族、性别、年龄和情绪识别。结果显示,在种族、性别、年龄组和情绪分类方面,其准确率分别为81.1%、95.8%、80%和59.4%,优于预训练版本的PaliGemma、其他VLM和SotA方法。最后,我们提出了“FaceScanGPT”,这是一个GPT-4o模型,用于在图像中存在多个人时,使用针对具有特定面部和/或身体属性的人设计的提示来识别上述属性。结果强调了FaceScanGPT卓越的多任务处理能力,仅使用提示即可驱动检测和识别任务,检测个体的属性,如发型、服装颜色、姿势等。|
|**2024-10-31**|[Nearest Neighbor Normalization Improves Multimodal Retrieval](http://arxiv.org/abs/2410.24114)|**[link](https://github.com/multimodal-interpretability/nnn)**|多模态模型利用大规模预训练在图像描述、视觉问答和跨模态检索等任务上取得了显著但仍不完美的性能。本文提出了一种简单有效的方法,无需额外训练即可纠正已训练的对比图像-文本检索模型中的错误,称为最近邻归一化 (NNN)。我们展示了在我们测试的所有对比模型(CLIP、BLIP、ALBEF、SigLIP、BEiT)以及我们使用的两个数据集(MS-COCO 和 Flickr30k)上,文本检索和图像检索指标均有所改进。NNN 需要一个参考数据库,但不需要对该数据库进行任何训练,甚至可以在模型微调后提高其检索精度。|
|**2024-10-31**|[Bayesian-guided Label Mapping for Visual Reprogramming](http://arxiv.org/abs/2410.24018)|**[link](https://github.com/tmlr-group/bayesianlm)**|视觉重编程(VR)利用预训练视觉模型的内在能力,通过调整其输入或输出接口来解决下游任务,这些任务的标签(即下游标签)可能与预训练模型相关的标签(即预训练标签)完全不同。在调整输出接口时,标签映射方法通过在下游标签和预训练标签之间建立一个无梯度的一对一对应关系,将预训练标签转换为下游标签。然而,在本文中,我们揭示了一对一映射可能忽略了预训练标签和下游标签之间的复杂关系。基于这一观察,我们提出了一种贝叶斯引导的标签映射(BLM)方法。BLM构建了一个迭代更新的概率标签映射矩阵,其中每个元素量化了预训练标签和下游标签之间的成对关系。该矩阵值的分配由贝叶斯条件概率引导,考虑了预训练模型对下游样本预测的标签和下游标签的联合分布。在预训练视觉模型(例如ResNeXt)和视觉语言模型(例如CLIP)上进行的实验表明,BLM的性能优于现有的标签映射方法。BLM的成功也提供了一个概率视角,可以用来理解和分析VR的有效性。我们的代码可在https://github.com/tmlr-group/BayesianLM获取。|
|**2024-10-31**|[EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection](http://arxiv.org/abs/2410.23904)|**[link](https://github.com/chelsielei/ez-hoi)**|在零样本设置下检测人与物体交互 (HOI) 是一个巨大的挑战,模型必须处理未见过的类别。现有方法依赖于将视觉编码器与大型视觉语言模型 (VLM) 对齐以利用 VLM 的广泛知识,这需要大型的、计算成本高的模型,并且会遇到训练困难。使用提示学习调整 VLM 提供了直接对齐的替代方案。然而,由于缺乏未见类别的标签,在特定任务数据集上进行微调通常会导致对已见类别的过拟合以及对未见类别的次优性能。为了应对这些挑战,我们引入了一种新的基于提示学习的框架,用于高效的零样本 HOI 检测 (EZ-HOI)。首先,我们引入了大型语言模型 (LLM) 和 VLM 指导的可学习提示,整合详细的 HOI 描述和视觉语义,以使 VLM 适应 HOI 任务。然而,由于训练数据集仅包含已见类别的标签,因此在此类数据集上微调 VLM 往往会针对已见类别而不是未见类别优化可学习提示。因此,我们利用来自相关已见类别信息的提示学习来处理未见类别,并利用 LLM 突出显示未见类别与相关已见类别之间的差异。在基准数据集上的定量评估表明,我们的 EZ-HOI 在各种零样本设置下均实现了最先进的性能,与现有方法相比,仅使用了 10.35% 到 33.95% 的可训练参数。代码可在 https://github.com/ChelsieLei/EZ-HOI 获取。|
|**2024-10-31**|[Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP](http://arxiv.org/abs/2410.23698)|null|像CLIP这样的大型预训练视觉语言模型已展现出良好的泛化能力,但在专业领域(例如卫星图像)或细粒度分类(例如汽车型号)中可能会遇到困难,因为这些视觉概念在预训练期间未出现或未得到充分体现。提示学习提供了一种参数高效的微调框架,即使在标注数据有限的情况下也能使CLIP适应下游任务。在本文中,我们通过从自然语言提示(人工生成或LLM生成)中提取文本知识来改进提示学习,从而为这些未得到充分体现的概念提供丰富的先验知识。我们首先通过学习的提示聚合器获得与每个输入图像对齐的提示“摘要”。然后,我们联合训练一个提示生成器,使其生成的提示嵌入尽可能接近聚合的摘要,同时最小化任务损失。我们将这种提示嵌入称为聚合和自适应提示嵌入(AAPE)。AAPE被证明能够泛化到不同的下游数据分布和任务,包括视觉语言理解任务(例如,少样本分类、VQA)和生成任务(图像描述),并在这些任务中取得了具有竞争力的性能。我们还表明,AAPE对于处理非规范和OOD样本特别有帮助。此外,AAPE学习消除了基线方法所需的基于LLM的推理成本,并且可以更好地扩展数据和LLM模型规模。|
|**2024-10-31**|[SuctionPrompt: Visual-assisted Robotic Picking with a Suction Cup Using Vision-Language Models and Facile Hardware Design](http://arxiv.org/abs/2410.23640)|null|大型语言模型和视觉语言模型 (VLM) 的发展使得机器人在各个领域的应用日益增多。然而,如何将这些模型有效地整合到现实世界的机器人任务中是一个关键挑战。我们开发了一个名为 SuctionPrompt 的多功能机器人系统,该系统利用 VLM 的提示技术结合 3D 检测来执行在多样化和动态环境中的产品拾取任务。我们的方法强调了将 3D 空间信息与自适应行动规划相结合的重要性,使机器人能够在新的环境中接近和操纵物体。在验证实验中,该系统准确选择了 75.4% 的吸取点,并在拾取常见物品方面达到了 65.0% 的成功率。这项研究突出了 VLM 在机器人操纵任务中的有效性,即使只进行简单的 3D 处理。|
|**2024-10-30**|[CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP](http://arxiv.org/abs/2410.23330)|null|机器遗忘(MU)作为一种无需完全重新训练即可从训练模型中移除特定数据的方法,受到了广泛关注。尽管在文本和图像分类等单模态领域取得了进展,但多模态模型中的遗忘研究仍然相对不足。本研究致力于解决CLIP(一种对齐视觉和文本表示的杰出多模态模型)中遗忘带来的独特挑战。我们引入了CLIPErase,这是一种新颖的方法,可以解开并选择性地遗忘视觉和文本关联,确保遗忘不会损害模型性能。CLIPErase由三个关键模块组成:遗忘模块,用于破坏遗忘集中样本的关联;保留模块,用于保持模型在保留集上的性能;以及一致性模块,用于维护与原始模型的一致性。在CIFAR-100和Flickr30K数据集上,针对四个CLIP下游任务进行的大量实验表明,CLIPErase可以有效地遗忘零样本任务中多模态样本的指定关联,同时在遗忘后保持模型在保留集上的性能。|
|**2024-10-30**|[EMMA: End-to-End Multimodal Model for Autonomous Driving](http://arxiv.org/abs/2410.23262)|null|我们推出了EMMA,一个用于自动驾驶的端到端多模态模型。EMMA建立在多模态大型语言模型的基础上,可将原始摄像头传感器数据直接映射到各种驾驶专用输出,包括规划轨迹、感知对象和道路图元素。EMMA通过将所有非传感器输入(例如导航指令和车辆自身状态)和输出(例如轨迹和3D位置)表示为自然语言文本,最大限度地利用了预训练大型语言模型的世界知识。这种方法允许EMMA在统一的语言空间中联合处理各种驾驶任务,并使用特定于任务的提示生成每个任务的输出。根据经验,我们通过在nuScenes上实现最先进的运动规划性能以及在Waymo Open Motion Dataset (WOMD) 上取得有竞争力的结果来证明EMMA的有效性。EMMA还在Waymo Open Dataset (WOD) 上的摄像头主要3D目标检测中取得了有竞争力的结果。我们表明,使用规划轨迹、目标检测和道路图任务对EMMA进行联合训练可以在所有三个领域带来改进,突出了EMMA作为自动驾驶应用通用模型的潜力。然而,EMMA也存在某些局限性:它只能处理少量图像帧,不包含LiDAR或雷达等精确的3D传感模态,并且计算成本高昂。我们希望我们的研究结果能够激励进一步的研究来缓解这些问题,并进一步发展自动驾驶模型架构的最新技术。|
|**2024-10-30**|[Keypoint Abstraction using Large Models for Object-Relative Imitation Learning](http://arxiv.org/abs/2410.23254)|null|泛化到不同任务和环境中的新颖物体配置和实例是机器人技术中的一个关键挑战。基于关键点的表示已被证明是一种有效且简洁的表示方法,可以捕获重要的物体特征,并在动作预测中建立参考框架,从而实现数据高效的机器人技能学习。然而,它们的手动设计性质以及对额外人工标签的依赖限制了它们的可扩展性。在本文中,我们提出了KALM,一个利用大型预训练视觉语言模型 (LM) 自动生成与任务相关且跨实例一致的关键点的框架。KALM 通过使用 LM 生成关键点提议并根据少量机器人演示数据验证它们,从而提取跨视图和物体的鲁棒且一致的关键点。基于生成的关键点,我们可以训练以关键点为条件的策略模型,该模型可以在以关键点为中心的框架中预测动作,使机器人能够有效地泛化到不同的物体姿态、相机视角和具有相似功能形状的物体实例。我们的方法在现实世界中展现出强大的性能,只需少量演示即可适应不同的任务和环境,并且不需要额外的标签。网站:https://kalm-il.github.io/|
|**2024-10-29**|[Natural Language Inference Improves Compositionality in Vision-Language Models](http://arxiv.org/abs/2410.22315)|null|视觉语言模型 (VLM) 的组合推理仍然具有挑战性,因为这些模型通常难以关联对象、属性和空间关系。最近的方法旨在通过依赖文本描述的语义来解决这些限制,使用大型语言模型 (LLM) 将其分解为问题和答案的子集。然而,这些方法主要在表面层面运作,未能融入更深层次的词汇理解,同时引入了由 LLM 生成的错误假设。为了应对这些问题,我们提出了“基于矛盾和蕴涵的标题扩展 (CECE)”方法,这是一种利用自然语言推理 (NLI) 从给定前提生成蕴涵和矛盾的原则性方法。CECE 生成词汇多样化的句子,同时保持其核心含义。通过广泛的实验,我们表明 CECE 增强了可解释性并减少了对有偏差或肤浅特征的过度依赖。通过平衡 CECE 和原始前提,我们在无需额外微调的情况下实现了比先前方法的显著改进,在用于评估图像-文本对齐一致性的人类判断基准测试中取得了最先进的结果,并在 Winoground 上实现了 +19.2%(组得分)的性能提升,在 EqBen 上实现了 +12.9%(组得分)的性能提升,超过了之前的最佳工作(使用目标数据进行微调)。||
|**2024-10-29**|[Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving](http://arxiv.org/abs/2410.22313)|**[link](https://github.com/hustvl/senna)**|端到端自动驾驶凭借大规模数据展现出强大的规划能力,但在复杂和罕见场景下仍因缺乏常识而难以应对。相比之下,大型视觉语言模型(LVLM)擅长场景理解和推理。未来的方向在于融合两者的优势。以往使用LVLM预测轨迹或控制信号的方法效果欠佳,因为LVLM不适合进行精确的数值预测。本文提出Senna,一个结合了LVLM(Senna-VLM)和端到端模型(Senna-E2E)的自动驾驶系统。Senna将高级规划与低级轨迹预测解耦。Senna-VLM用自然语言生成规划决策,而Senna-E2E预测精确的轨迹。Senna-VLM利用多图像编码方法和多视角提示词来实现高效的场景理解。此外,我们引入了面向规划的问答以及三阶段训练策略,这增强了Senna-VLM的规划性能,同时保留了常识。在两个数据集上的大量实验表明,Senna实现了最先进的规划性能。值得注意的是,通过在大型数据集DriveX上进行预训练并在nuScenes上进行微调,Senna相比未经预训练的模型显著降低了27.12%的平均规划误差和33.33%的碰撞率。我们相信Senna的跨场景泛化能力和可迁移性对于实现完全自动驾驶至关重要。代码和模型将在https://github.com/hustvl/Senna发布。||
|**2024-10-29**|[ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding](http://arxiv.org/abs/2410.22211)|**[link](https://github.com/kimihiroh/promqa)**|多模态系统在辅助人类执行程序性活动方面具有巨大潜力,在这些活动中,人们遵循指令以实现其目标。尽管应用场景多种多样,但系统通常在传统的分类任务上进行评估,例如动作识别或时间动作分割。在本文中,我们提出了一个新的评估数据集ProMQA,用于衡量系统在面向应用场景中的进展。ProMQA包含401个多模态程序性问答对,基于用户录制的程序性活动及其相应的指令。对于问答标注,我们采用了一种经济高效的人机协作方法,其中利用LLM生成的、随后经人工验证的问答对来扩充现有标注。然后,我们提供了基准测试结果,以设定ProMQA的基线性能。我们的实验揭示了人类表现与当前系统(包括具有竞争力的专有多模态模型)之间存在显著差距。我们希望我们的数据集能够揭示模型多模态理解能力的新方面。||
|**2024-10-29**|[Active Learning for Vision-Language Models](http://arxiv.org/abs/2410.22187)|null|像CLIP这样的预训练视觉语言模型(VLM)在一系列下游计算机视觉任务中展现了令人印象深刻的零样本性能。然而,这些模型与在下游数据集上训练的有监督深度模型之间仍然存在相当大的性能差距。为了弥合这一差距,我们提出了一种新的主动学习(AL)框架,通过仅从未标记数据中选择少量信息丰富的样本进行标注来增强VLM的零样本分类性能。为了实现这一点,我们的方法首先校准VLM的预测熵,然后利用自不确定性和邻居感知不确定性的组合来计算可靠的不确定性度量,用于主动样本选择。我们的大量实验表明,所提出的方法在多个图像分类数据集上优于现有的AL方法,并显著提高了VLM的零样本性能。||
|**2024-10-29**|[Are VLMs Really Blind](http://arxiv.org/abs/2410.22029)|**[link](https://github.com/vlgiitr/Are-VLMs-Really-Blind)**|视觉语言模型擅长处理各种复杂任务,包括光学字符识别 (OCR)、视觉问答 (VQA) 和高级几何推理。然而,这些模型在人类特别容易掌握的低级基本视觉任务中表现不佳。我们这项工作的目标是确定这些模型是否真的对几何推理“视而不见”,或者是否存在增强其在这方面能力的方法。我们的工作提出了一种新颖的自动流水线,旨在根据特定问题从图像中提取关键信息。我们没有仅仅依赖直接的 VQA,而是使用从问题中提取的关键词来创建一个标题,突出显示图像中与问题相关的重要的细节。然后,语言模型使用此标题来提供对问题的精确答案,而无需外部微调。||
|**2024-10-29**|[Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications](http://arxiv.org/abs/2410.21943)|**[link](https://github.com/riedlerm/multimodal_rag_for_industry)**|大型语言模型 (LLM) 在回答问题方面展现出令人印象深刻的能力,但它们缺乏特定领域的知识,并且容易出现幻觉。检索增强生成 (RAG) 是解决这些挑战的一种方法,而多模态模型正在成为处理文本和图像方面很有前途的 AI 助手。在本文中,我们描述了一系列实验,旨在确定如何将多模态模型最好地集成到工业领域的 RAG 系统中。这些实验的目的是确定在工业领域的文件中包含图像以及文本是否会提高 RAG 性能,并找到这种多模态 RAG 系统的最佳配置。我们的实验包括两种图像处理和检索方法,以及两种用于答案合成的 LLM(GPT4-Vision 和 LLaVA)。这些图像处理策略涉及使用多模态嵌入和从图像生成文本摘要。我们使用 LLM 作为评判者的方法来评估我们的实验。我们的结果表明,多模态 RAG 可以胜过单模态 RAG 设置,尽管图像检索比文本检索更具挑战性。此外,利用图像的文本摘要与使用多模态嵌入相比,提供了一种更有希望的方法,为未来的进步提供了更多机会。||
|**2024-10-29**|[Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models](http://arxiv.org/abs/2410.21802)|**[link](https://github.com/zhyblue424/tga-zsr)**|由于预训练视觉语言模型(例如CLIP)令人印象深刻的零样本能力,它们吸引了广泛关注并在各个领域得到应用。然而,CLIP已被观察到容易受到对抗样本的攻击。通过实验分析,我们观察到一个现象:对抗扰动会导致文本引导的注意力发生偏移。基于这一观察,我们提出了一个简单而有效的策略:文本引导注意力零样本鲁棒性(TGA-ZSR)。该框架包含两个组件:注意力细化模块和基于注意力的模型约束模块。我们的目标是保持CLIP模型的泛化能力并增强其对抗鲁棒性:注意力细化模块将通过对抗样本从目标模型获得的文本引导注意力与通过干净样本从原始模型获得的文本引导注意力对齐。这种对齐增强了模型的鲁棒性。此外,基于注意力的模型约束模块使用干净样本从目标模型和原始模型获取文本引导注意力。其目标是保持模型在干净样本上的性能,同时增强整体鲁棒性。实验验证,我们的方法在16个数据集上,将零样本鲁棒精度比当前最先进的技术提高了9.58%。我们的代码可在https://github.com/zhyblue424/TGA-ZSR获取。||
|**2024-10-29**|[AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?](http://arxiv.org/abs/2410.21259)|**[link](https://github.com/wad3birch/AutoBench-V)**|大型视觉语言模型(LVLMs)已成为推进视觉和语言信息融合的关键,促进了各种复杂应用和任务的发展。然而,LVLMs 的评估面临着重大挑战,因为评估基准的构建总是需要大量的人力成本,并且一旦构建完成就保持静态,缺乏灵活性。尽管在文本模态中已经探索了自动评估,但视觉模态仍然缺乏研究。因此,在这项工作中,我们提出了一个问题:“LVLMs 能否成为自动基准测试的途径?”. 我们引入了 AutoBench-V,这是一个用于按需进行评估的自动化框架,即基于模型能力的特定方面对 LVLMs 进行基准测试。在接收到评估能力后,AutoBench-V 利用文本到图像模型生成相关的图像样本,然后利用 LVLMs 来编排视觉问答(VQA)任务,从而高效灵活地完成评估过程。通过对七个流行的 LVLMs 在五个用户输入(即评估能力)上的广泛评估,该框架展现了有效性和可靠性。我们观察到以下几点:(1)我们构建的基准准确地反映了不同的任务难度;(2)随着任务难度的增加,模型之间的性能差距会扩大;(3)虽然模型在抽象层面的理解上表现出很强的性能,但在细节推理任务中表现不佳;(4)构建具有不同难度级别的 datasets 对于全面彻底的评估至关重要。总的来说,AutoBench-V 不仅成功地利用 LVLMs 进行自动基准测试,还揭示了 LVLMs 作为评估者的巨大潜力。||
|**2024-10-28**|[Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines](http://arxiv.org/abs/2410.21220)|**[link](https://github.com/cnzzx/vsa)**|搜索引擎能够通过文本检索未知信息。然而,传统方法在理解不熟悉的视觉内容方面存在不足,例如识别模型从未见过的物体。对于大型视觉语言模型 (VLM) 来说,这一挑战尤为突出:如果模型没有接触过图像中描绘的物体,它就难以针对用户关于该图像的问题生成可靠的答案。此外,由于新的物体和事件不断涌现,频繁更新VLM由于沉重的计算负担而变得不切实际。为了解决这一限制,我们提出了视觉搜索助手 (Vision Search Assistant),一个促进VLM和网络代理之间协作的新框架。该方法利用VLM的视觉理解能力和网络代理的实时信息访问能力,通过网络执行开放世界检索增强生成。通过这种协作集成视觉和文本表示,即使图像对系统来说是新颖的,模型也可以提供有根据的响应。在开放集和封闭集问答基准上进行的大量实验表明,视觉搜索助手显著优于其他模型,并且可以广泛应用于现有的VLM。||
|**2024-10-28**|[Zero-Shot Action Recognition in Surveillance Videos](http://arxiv.org/abs/2410.21113)|null|公共场所日益增长的监控需求对人力资源短缺带来了重大挑战。当前基于人工智能的视频监控系统严重依赖需要大量微调的核心计算机视觉模型,而由于数据集有限且设置困难(视角、低质量等),这在监控环境中尤其困难。在本研究中,我们提出利用以强大的零样本和小样本泛化能力而闻名的大型视觉语言模型 (LVLM) 来处理监控中的视频理解任务。具体来说,我们探索了最先进的 LVLM VideoLLaMA2 和一种改进的标记级采样方法——自反射采样 (Self-ReS)。我们在 UCF-Crime 数据集上的实验表明,VideoLLaMA2 代表了零样本性能的显著飞跃,比基线提高了 20%。Self-ReS 还将零样本动作识别性能提高到 44.6%。这些结果突出了 LVLM 与改进的采样技术相结合在推进各种场景下的监控视频分析方面的潜力。||
|**2024-10-25**|[Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models](http://arxiv.org/abs/2410.19732)|null|大型视觉语言模型 (LVLM) 擅长跨模态任务,但在长上下文推理中表现不佳,因为它过度依赖文本信息而降低了对视觉的依赖。在本研究中,我们对 LVLM 在长上下文推理中的表现进行了实证分析,结果表明,随着上下文长度的增加,模型对语言的依赖程度会提高,而对视觉的依赖程度会降低。为了解决这个问题,我们提出了一种新的无需训练的上下文剪枝方法,该方法可以有选择地删除不太重要的文本信息。我们的方法增强了视觉依赖性并减少了文本噪声,从而提高了 LVLM 在长上下文推理中的性能。我们通过构建一个长上下文数据集来验证我们方法的有效性,并在各种 LVLM 上证明了其有效性。此外,进一步的分析证实了不同标记剪枝策略的鲁棒性,并初步探讨了剪枝率与上下文长度之间的比例关系。||
|**2024-10-25**|[OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization](http://arxiv.org/abs/2410.19609)|**[link](https://github.com/minorjerry/openwebvoyager)**|大型语言和多模态模型的快速发展引发了人们对使用 GPT-4o 等专有模型开发能够处理现实世界场景(如网页导航)的自主代理的浓厚兴趣。尽管最近的开源工作试图赋予代理探索环境并随着时间的推移不断改进的能力,但他们是在奖励信号明确定义的合成环境中构建纯文本代理。此类代理难以泛化到需要多模态感知能力且缺乏真实信号的现实环境中。在本文中,我们介绍了一个开源框架,旨在促进多模态 Web 代理的开发,该代理可以自主进行现实世界的探索并自我改进。我们首先通过模仿学习训练基础模型以获得基本能力。然后,我们让代理探索开放网络并收集对其轨迹的反馈。之后,它通过学习另一个通用模型判断的良好表现轨迹来进一步改进其策略。这种探索-反馈-优化循环可以持续多次迭代。实验结果表明,我们的 Web 代理在每次迭代后都成功地自我改进,在多个测试集中表现出强大的性能。||
|**2024-10-25**|[GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing](http://arxiv.org/abs/2410.19552)|null|探测地理景观中的时间变化对于环境监测和城市规划等应用至关重要。 虽然遥感数据丰富,但现有的视觉语言模型 (VLM) 通常无法有效捕捉时间动态。 本文通过引入一个带注释的视频帧对数据集来解决这些限制,以跟踪随时间推移而演变的地理模式。 通过在 Video-LLaVA 和 LLaVA-NeXT-Video 等模型上使用低秩自适应 (LoRA)、量化 LoRA (QLoRA) 和模型剪枝等微调技术,我们显著提高了 VLM 处理遥感时间变化的性能。 结果表明,性能得到显著提升,最佳性能的 BERT 得分为 0.864,ROUGE-1 得分为 0.576,在描述土地利用转变方面表现出卓越的准确性。||
|**2024-10-25**|[COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training](http://arxiv.org/abs/2410.19313)|null|FP8训练已成为提高训练效率的一种很有前景的方法。现有框架通过将FP8计算应用于线性层来加速训练,同时将优化器状态和激活保持在更高的精度,但这未能完全优化内存使用。本文介绍了COAT(压缩优化器状态和激活以进行FP8训练),这是一种新颖的FP8训练框架,旨在显着减少训练大型模型时的内存占用。COAT通过两项关键创新解决了当前的局限性:(1) 动态范围扩展,它使优化器状态分布更接近FP8表示范围,从而减少量化误差,以及(2) 混合粒度激活量化,它结合每张量和每组量化策略来优化激活内存。实验表明,与BF16相比,COAT有效地将端到端训练内存占用减少了1.54倍,同时在各种任务(如大型语言模型预训练和微调以及视觉语言模型训练)中实现了几乎无损的性能。与BF16相比,COAT还实现了1.43倍的端到端训练加速,性能与TransformerEngine的加速相当或优于后者。COAT能够在更少的GPU上对大型模型进行高效的全参数训练,并在分布式训练环境中将批大小翻倍,为扩展大规模模型训练提供了一种实用的解决方案。代码可在https://github.com/NVlabs/COAT获取。||
|**2024-10-25**|[Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting](http://arxiv.org/abs/2410.19294)|null|视觉语言模型,例如 CLIP,在使用适当的文本描述时表现出令人印象深刻的泛化能力。虽然在下游标记数据上优化提示已被证明可以有效提高性能,但这些方法需要承担注释的人工成本,并且受其质量的限制。此外,由于 CLIP 是在高度不平衡的网络规模数据上预先训练的,因此它存在固有的标签偏差,导致性能欠佳。为了应对上述挑战,我们提出了一个免标签的提示分布学习和偏差校正框架,称为 **Frolic**,它可以在不需要标记数据的情况下提高零样本性能。具体来说,我们的 Frolic 学习提示原型的分布以捕获不同的视觉表示,并通过置信度匹配自适应地将这些表示与原始 CLIP 融合。通过免标签的 logits 调整来校正标签偏差,进一步增强了这个融合模型。值得注意的是,我们的方法不仅无需训练,而且还避免了超参数调整的必要性。跨 16 个数据集的大量实验结果证明了我们方法的有效性,特别是使用 CLIP ViT-B/16 在 10 个数据集上的性能平均优于最先进方法 2.6%,并在 ImageNet 及其五个分布偏移上使用 CLIP ViT-B/16 实现了平均 1.5% 的优势。代码可在 https://github.com/zhuhsingyuu/Frolic 获取。||
|**2024-10-24**|[Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant](http://arxiv.org/abs/2410.19144)|**[link](https://github.com/vl2g/KaLMA)**|我们重新审视了基于知识的文本视觉问答,也称为 Text-KVQA,并结合大型多模态模型 (LMM) 的最新进展,做出了以下贡献:(i) 我们提出了 VisTEL——一种执行视觉文本实体链接的原则性方法。所提出的 VisTEL 模块利用最先进的视觉文本识别引擎和大规模多模态模型的能力,使用从图像中的周围线索获得的文本和视觉上下文进行联合推理,将视觉文本实体链接到正确的知识库实体。(ii) 我们介绍了 KaLMA——一种知识感知的大型多模态助手,它使用与图像中的视觉文本实体相关的知识来增强 LMM,以获得准确的答案。此外,我们还提供了我们的方法与传统视觉问答、大型多模态模型之前的模型、大型多模态模型以及先前表现最佳的方法的全面实验分析和比较。在 Text-KVQA 的三个拆分上的平均值,我们提出的方法比之前的最佳方法在绝对规模上大幅提高了 23.3%,并建立了新的最先进水平。我们将公开我们的实现。||
|**2024-10-24**|[VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks](http://arxiv.org/abs/2410.19100)|null|视频常被用于学习或提取完成任务所需的信息,其方式不同于仅凭文本和静态图像所能提供的。然而,许多现有的智能体基准测试忽略了长上下文视频理解,而是专注于文本或静态图像输入。为了弥合这一差距,我们引入了 VideoWebArena (VideoWA),这是一个用于评估长上下文多模态智能体视频理解能力的基准测试。VideoWA 由 2,021 个基于人工制作的视频教程的网络智能体任务组成,总计近四个小时的内容。对于我们的基准测试,我们定义了长上下文视频智能体任务的分类法,主要关注两个方面:技能保留和事实保留。技能保留任务评估智能体是否可以使用给定的人类演示有效地完成任务,而事实保留任务评估智能体是否可以从视频中检索与指令相关的信息以完成任务。我们发现,最佳模型在事实保留任务上的成功率为 13.3%,在事实保留问答对上的成功率为 45.8%,远低于人类分别为 73.9% 和 79.3% 的表现。在技能保留任务上,长上下文模型在使用教程的情况下比不使用教程的情况下表现更差,WebArena 任务的性能下降了 5%,VisualWebArena 任务的性能下降了 10.3%。我们的工作强调了提高长上下文多模态模型的智能体能力的必要性,并为未来长上下文视频智能体的开发提供了一个测试平台。||
|**2024-10-24**|[CAMEL-Bench: A Comprehensive Arabic LMM Benchmark](http://arxiv.org/abs/2410.18976)|**[link](https://github.com/mbzuai-oryx/CAMEL-Bench)**|近年来,开发能够执行各种视觉推理和理解任务的大型多模态模型 (LMM) 引起了人们的极大兴趣。这导致引入了多个 LMM 基准来评估 LMM 在不同任务上的表现。然而,大多数现有的 LMM 评估基准主要以英语为中心。在这项工作中,我们为阿拉伯语开发了一个全面的 LMM 评估基准,以代表超过 4 亿人口。拟议的基准测试名为 CAMEL-Bench,包括八个不同的领域和 38 个子领域,包括多图像理解、复杂视觉感知、手写文档理解、视频理解、医学成像、植物病害和基于遥感的土地利用理解,以评估广泛的场景泛化性。我们的 CAMEL-Bench 包含大约 29,036 个问题,这些问题是从更大的样本池中筛选出来的,其质量由母语人士手动验证,以确保可靠的模型评估。我们对闭源(包括 GPT-4 系列)和开源 LMM 进行了评估。我们的分析表明,需要进行重大改进,尤其是在最佳开源模型中,即使是闭源 GPT-4o 也仅获得了 62% 的总体得分。我们的基准测试和评估脚本是开源的。||
|**2024-10-24**|[Deep Insights into Cognitive Decline: A Survey of Leveraging Non-Intrusive Modalities with Deep Learning Techniques](http://arxiv.org/abs/2410.18972)|null|认知能力下降是衰老的自然组成部分,通常会导致认知能力下降。然而,在某些情况下,这种下降更为明显,通常是由于阿尔茨海默病等疾病。早期发现异常的认知能力下降至关重要,因为它可以促进及时的专业干预。虽然医学数据可以帮助进行这种检测,但它通常涉及侵入性程序。另一种方法是采用非侵入性技术,例如语音或笔迹分析,这些技术不一定会影响日常活动。本综述回顾了使用深度学习技术来自动化认知能力下降估计任务的最相关方法,包括音频、文本和视觉处理。我们讨论了每种模式和方法的关键特征和优势,包括最先进的方法,如Transformer架构和基础模型。此外,我们还介绍了整合不同模态以开发多模态模型的工作。我们还重点介绍了最重要的数据集以及使用这些资源的研究的量化结果。从这次审查中得出了一些结论。在大多数情况下,文本模态取得了最佳结果,并且与检测认知能力下降最相关。此外,将来自单个模态的各种方法组合成多模态模型始终如一地提高了几乎所有场景下的性能。||
|**2024-10-24**|[Zero-shot Object Navigation with Vision-Language Models Reasoning](http://arxiv.org/abs/2410.18570)|null|物体导航对于机器人至关重要,但传统方法需要大量的训练数据,并且无法泛化到未知环境。零样本物体导航 (ZSON) 旨在解决这一挑战,使机器人能够在没有特定训练数据的情况下与未知物体进行交互。语言驱动的零样本物体导航 (L-ZSON) 是 ZSON 的扩展,它结合了自然语言指令来指导机器人导航和与物体交互。在本文中,我们提出了一种新颖的视觉语言模型,该模型具有用于 L-ZSON 的思维树网络 (VLTNet)。VLTNet 包含四个主要模块:视觉语言模型理解、语义映射、思维树推理和探索以及目标识别。在这些模块中,思维树 (ToT) 推理和探索模块作为核心组件,创新地使用 ToT 推理框架在机器人探索过程中进行导航边界选择。与没有推理的传统边界选择相比,使用 ToT 推理的导航涉及多路径推理过程并在必要时进行回溯,从而能够进行全局信息的决策,并具有更高的准确性。在 PASTURE 和 RoboTHOR 基准测试上的实验结果表明,我们的模型在 LZSON 中表现出色,特别是在涉及复杂自然语言作为目标指令的场景中。||
|**2024-10-24**|[Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data](http://arxiv.org/abs/2410.18558)|null|视觉语言模型(VLM)最近取得了显著进展,但开源指令数据的规模和质量有限,阻碍了它们的性能,使其与闭源模型相比存在差距。在这项工作中,我们通过引入 Infinity-MM 来解决这个限制,Infinity-MM 是一个包含 4000 万个样本的大规模多模态指令数据集,通过严格的质量过滤和去重进行了增强。我们还提出了一种基于开源 VLM 的合成指令生成方法,使用详细的图像标注和多样化的问题生成。利用这些数据,我们训练了一个 20 亿参数的 VLM,Aquila-VL-2B,在类似规模的模型中实现了最先进的(SOTA)性能。这表明扩大指令数据和生成合成数据可以显著提高开源模型的性能。||
|**2024-10-24**|[Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics](http://arxiv.org/abs/2410.18537)|null|传统上,风格主要从颜色、笔触和光照等艺术元素方面来考虑。然而,相同的语义主题,例如人、船和房屋,在不同的艺术传统中可以有很大的差异,这表明风格也包含了潜在的语义。因此,在本研究中,我们提出了一种用于协调语义的图像变化的零样本方案。具体来说,我们的方案将图像到图像的问题转化为图像到文本到图像的问题。图像到文本的操作采用视觉语言模型(例如BLIP)来生成描述输入图像内容的文本,包括对象及其位置。随后,将输入的风格关键词详细描述,然后使用ChatGPT的推理能力将其与内容文本合并。最后,文本到图像的操作利用Diffusion模型根据文本提示生成图像。为了使Diffusion模型能够适应更多风格,我们提出了一种微调策略,将文本和风格约束注入到交叉注意力中。这确保了输出图像在所需的风格中展现出相似的语义。为了验证所提出方案的性能,我们构建了一个包含各种风格和场景图像的基准,并引入了两个新的指标。尽管简单,但我们的方案以零样本的方式产生了高度合理的结果,尤其是在生成具有高保真语义的风格化图像方面。||
|**2024-10-23**|[R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models](http://arxiv.org/abs/2410.17885)|**[link](https://github.com/dle666/r-cot)**|现有的多模态大模型 (LMMs) 在数学几何推理方面表现不佳,原因是缺乏高质量的图文配对数据。当前的几何数据生成方法,无论是应用预设模板生成几何数据还是使用大型语言模型 (LLMs) 改写问答 (Q&A),都不可避免地限制了数据的准确性和多样性。为了合成更高质量的数据,我们提出了一个两阶段逆向思维链 (R-CoT) 几何问题生成流程。首先,我们引入了 GeoChain 来生成高保真几何图像以及相应的描述,突出几何元素之间的关系。然后,我们设计了一种逆向问答方法,该方法基于描述逐步推理,并从推理结果反向生成问题。实验表明,所提出的方法为多个 LMM 基准模型带来了显著且一致的改进,在 2B、7B 和 8B 设置中均达到了新的性能记录。值得注意的是,R-CoT-8B 在 MathVista 和 GeoQA 上分别显著优于先前最先进的开源数学模型 16.6% 和 9.2%,同时还超过了闭源模型 GPT-4o 在这两个数据集上的平均性能 13%。代码可在 https://github.com/dle666/R-CoT 获取。||
|**2024-10-23**|[Lightweight Neural App Control](http://arxiv.org/abs/2410.17883)|null|本文介绍了一种名为“app agents”的新型手机控制架构,用于在各种安卓应用之间进行高效的交互和控制。所提出的轻量多模态应用控制 (LiMAC) 将文本目标和一系列过去的移动观察(例如屏幕截图和相应的 UI 树)作为输入,以生成精确的操作。为了解决智能手机固有的计算限制,我们在 LiMAC 中引入了一个小型动作转换器 (AcT),并将其与微调的视觉语言模型 (VLM) 集成,以实现实时决策和任务执行。我们在两个开源移动控制数据集上评估了 LiMAC,证明了我们的小尺寸方法优于开源 VLM(例如 Florence2 和 Qwen2-VL)的微调版本。它也明显优于利用闭源基础模型(如 GPT-4o)的提示工程基线。更具体地说,与微调的 VLM 相比,LiMAC 将整体动作准确率提高了 19%,与提示工程基线相比提高了 42%。||
|**2024-10-23**|[MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models](http://arxiv.org/abs/2410.17637)|**[link](https://github.com/liuziyu77/mia-dpo)**|视觉偏好对齐涉及训练大型视觉语言模型 (LVLM) 来预测人类对视觉输入的偏好。这通常是通过使用已标记的选中/拒绝图像对数据集并采用直接偏好优化 (DPO) 等优化算法来实现的。现有的视觉对齐方法主要针对单图像场景而设计,由于缺乏多样化的训练数据以及标注选中/拒绝图像对的高成本,难以有效处理多图像任务的复杂性。我们提出了多图像增强直接偏好优化 (MIA-DPO),这是一种可以有效处理多图像输入的视觉偏好对齐方法。MIA-DPO 通过使用以网格拼贴或画中画格式排列的无关图像来扩展单图像数据,从而缓解了多样化多图像训练数据的稀缺性,显著降低了与多图像数据标注相关的成本。我们的观察表明,LVLM 的注意力值在不同图像之间存在很大差异。我们使用注意力值来识别和过滤掉模型可能错误关注的已拒绝响应。我们基于注意力值的策略选择构建选中/拒绝图像对,无需依赖 (i) 人工标注,(ii) 额外数据,以及 (iii) 外部模型或 API。MIA-DPO 与各种架构兼容,并且在五个多图像基准测试中优于现有方法,在 LLaVA-v1.5 上平均性能提升 3.0%,在最近的 InternLM-XC2.5 上平均性能提升 4.3%。此外,MIA-DPO 对模型理解单图像的能力的影响微乎其微。||
|**2024-10-22**|[JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation](http://arxiv.org/abs/2410.17250)|null|加速非英语语言大型多模态模型 (LMM) 的研究对于提升更广泛人群的用户体验至关重要。在本文中,我们介绍了 JMMMU(日语 MMMU),这是第一个基于日本文化背景、旨在评估 LMM 在专家级任务上表现的大规模日语基准测试。为了促进全面的文化感知评估,JMMMU 包含两个互补的子集:(i) 文化无关 (CA) 子集,其中选择与文化无关的学科(例如数学)并将其翻译成日语,以便与对应的英语 MMMU 进行一对一比较;以及 (ii) 文化特定 (CS) 子集,包含反映日本文化背景的新创建学科。使用 CA 子集,我们观察到许多 LMM 在日语评估中性能下降,这完全归因于语言差异。使用 CS 子集,我们揭示了它们对日本文化理解的不足。此外,通过结合两个子集,我们发现一些 LMM 在 CA 子集上表现良好,但在 CS 子集上表现不佳,这暴露了它们对日语的理解肤浅,缺乏文化深度的理解。我们希望这项工作不仅有助于提升 LMM 在日语方面的性能,还能作为创建用于多语言 LMM 开发的高标准、文化多样化基准测试的指南。项目页面为 https://mmmu-japanese-benchmark.github.io/JMMMU/。||
|**2024-10-22**|[PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction](http://arxiv.org/abs/2410.17247)|**[link](https://github.com/cooperx521/pyramiddrop)**|在大型视觉语言模型 (LVLMs) 中,图像作为输入承载着丰富的信息。正如谚语“一图胜千言”所言,在当前的 LVLMs 中表示单个图像可能需要数百甚至数千个标记。这导致了巨大的计算成本,并且随着输入图像分辨率的增加呈二次方增长,从而严重影响训练和推理的效率。以前的方法试图在 LVLMs 的早期层之前或之内减少图像标记的数量。然而,这些策略不可避免地会导致关键图像信息的丢失,最终降低模型性能。为了应对这一挑战,我们进行了一项实证研究,表明所有视觉标记对于 LVLMs 的浅层都是必要的,而标记冗余在模型的深层逐渐增加。为此,我们提出了 PyramidDrop,一种 LVLMs 的视觉冗余减少策略,以提高其训练和推理效率,且性能损失可忽略不计。具体来说,我们将 LVLM 划分为几个阶段,并在每个阶段的末尾以预定义的比率丢弃部分图像标记,从而在模型层中创建金字塔状的视觉标记。丢弃操作基于轻量级的相似度计算,时间开销可以忽略不计。大量实验表明,PyramidDrop 可以使 LLaVA-NeXT 的训练时间缩短 40%,推理 FLOPs 减少 55%,且性能相当。此外,PyramidDrop 还可以作为即插即用的推理加速策略,无需训练,即可获得比同类方法更好的性能和更低的推理成本。我们希望 PyramidDrop 引入的见解和方法能够激励未来的研究,进一步探索图像标记在 LVLMs 中的作用。||
|**2024-10-22**|[An Eye for an AI: Evaluating GPT-4o's Visual Perception Skills and Geometric Reasoning Skills Using Computer Graphics Questions](http://arxiv.org/abs/2410.16991)|null|CG(计算机图形学)是 CS(计算机科学)中的一个热门领域,但许多学生发现这门课程很难,因为它需要大量的技能,如数学、编程、几何推理和创造力。在过去几年中,研究人员一直在探索利用生成式人工智能 (GenAI) 的力量来改进教学的方法。在计算机科学领域,许多研究都集中在计算机入门教育上。最近一项评估大型语言模型 (LLM) GPT-4(仅限文本)在 CG 问题上的表现的研究表明,GPT-4 的表现不佳,并且依赖于对图像内容的详细描述,这通常需要用户具备相当多的洞察力才能返回合理的结果。到目前为止,还没有研究调查过大型多模态模型 (LMM) 或多模态 LLM 解决 CG 问题的能力,以及如何利用这些能力来改进教学。在本研究中,我们构建了两个 CG 问题数据集,这些问题需要不同程度的视觉感知能力和几何推理能力,并评估了当前最先进的 LMM GPT-4o 在这两个数据集上的表现。我们发现,尽管 GPT-4o 在独立解决带有视觉信息的问题方面展现出巨大潜力,但在生成结果的准确性和质量方面仍然存在重大局限性。我们为 CG 教育工作者提出了一些新颖的方法,以便将生成式人工智能融入到 CG 教学中,尽管存在这些限制。我们希望,我们的指导方针能进一步鼓励 CG 课堂的学习和参与。||
|**2024-10-22**|[MPDS: A Movie Posters Dataset for Image Generation with Diffusion Model](http://arxiv.org/abs/2410.16840)|null|电影海报对于吸引观众、传达主题和推动电影行业的市场竞争至关重要。虽然传统的设计费时费力,但智能生成技术可以提高效率并增强设计效果。尽管图像生成取得了令人兴奋的进展,但目前的模型在生成令人满意的海报结果方面往往存在不足。主要问题在于缺乏专门的海报数据集来进行有针对性的模型训练。在这项工作中,我们提出了一个电影海报数据集 (MPDS),专为文本到图像生成模型量身定制,旨在彻底改变海报制作。MPDS 专注于海报,据我们所知,它是第一个图像-文本对数据集,由 37.3 万多个图像-文本对和 8 千多张演员图像(涵盖 4 千多名演员)组成。详细的海报描述,例如电影标题、类型、演员阵容和概要,都根据公开的电影概要(也称为电影概要提示)进行了精心组织和标准化。为了充实海报描述并减少与电影概要的差异,我们进一步利用大型视觉语言模型自动为每个海报生成视觉感知提示,然后进行手动校正并与电影概要提示相结合。此外,我们引入了海报标题提示,以展示海报中的文本元素,如演员姓名和电影标题。对于电影海报生成,我们开发了一个多条件扩散框架,将海报提示、海报标题和演员图像(用于个性化)作为输入,通过学习扩散模型产生出色的结果。实验表明,我们提出的 MPDS 数据集在推进个性化电影海报生成方面具有重要价值。MPDS 可在 https://anonymous.4open.science/r/MPDS-373k-BD3B 获取。||
|**2024-10-21**|[DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding](http://arxiv.org/abs/2410.16472)|null|文档结构编辑涉及根据用户请求操作文档图像中的局部文本、视觉和布局组件。过去的研究表明,用户请求在文档图像中的多模态 grounding 以及准确识别结构组件及其相关属性仍然是这项任务的关键挑战。为了解决这些问题,我们引入了 DocEdit-v2,这是一个利用大型多模态模型 (LMM) 执行端到端文档编辑的新框架。它包含三个新组件:(1) Doc2Command,它同时定位感兴趣的编辑区域 (RoI) 并将用户编辑请求分解为编辑命令;(2) 基于 LLM 的命令重构提示,将最初为专业软件设计的编辑命令定制为适合通才 LMM 的编辑指令。(3) 此外,DocEdit-v2 通过 GPT-4V 和 Gemini 等大型多模态模型处理这些输出,以解析文档布局、对 grounded 感兴趣区域 (RoI) 执行编辑并生成编辑后的文档图像。在 DocEdit 数据集上的大量实验表明,DocEdit-v2 在编辑命令生成 (2-33%)、RoI 边界框检测 (12-31%) 和整体文档编辑 (1-12%) 任务上明显优于强大的基线。||
|**2024-10-21**|[Promoting cross-modal representations to improve multimodal foundation models for physiological signals](http://arxiv.org/abs/2410.16424)|null|许多医疗保健应用本质上是多模态的,涉及多种生理信号。随着这些信号的传感器变得越来越普遍,改进针对多模态医疗保健数据的机器学习方法至关重要。预训练基础模型是取得成功的有希望的途径。然而,在医疗保健领域开发基础模型的方法仍处于早期探索阶段,并且尚不清楚鉴于生理信号的多样性,哪种预训练策略最有效。这在一定程度上是由于多模态健康数据方面的挑战:获取许多患者的数据既困难又昂贵,受试者之间存在很大差异,并且模态在下游任务中的信息量通常存在异质性。在这里,我们在 PhysioNet 2018 数据集中探讨了这些挑战。我们使用掩蔽自动编码目标来预训练多模态模型。我们证明了该模型学习到的表示可以被线性探测用于各种下游任务。我们假设跨模态重建目标对于成功的多模态训练很重要,因为它们鼓励模型整合跨模态的信息。我们证明了输入空间中的模态丢失可以提高下游任务的性能。我们还发现,使用对比学习目标预训练的后期融合模型在多个任务中的效果较差。最后,我们分析了模型的表示,表明注意力权重通过我们的预训练策略变得更加跨模态和时间对齐。就每个单元编码的模态而言,学习到的嵌入也变得更加分散。总的来说,我们的工作证明了多模态基础模型对健康数据的效用,即使是在不同的生理数据源中也是如此。我们进一步认为,用于诱导跨模态的显式方法可以增强多模态预训练策略。||
|**2024-10-21**|[VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use](http://arxiv.org/abs/2410.16400)|null|虽然视觉语言模型 (VLM) 在结合文本和视觉信息的各种任务中表现出卓越的性能,但它们在需要详细像素级分析的细粒度视觉感知任务中仍然面临挑战。如何有效地从 VLM 中引出对此类复杂视觉元素的全面推理仍然是一个开放的挑战。在本文中,我们提出了 VipAct,这是一个通过集成多智能体协作和视觉专家模型来增强 VLM 的智能体框架,从而实现更精确的视觉理解和更全面的推理。VipAct 由一个协调器智能体和一些专门的智能体组成,协调器智能体负责任务需求分析、规划和协调,而专门的智能体则处理图像字幕等特定任务,以及提供高精度感知信息的视觉专家模型。这种多智能体方法允许 VLM 通过协同规划、推理和工具使用来更好地执行细粒度视觉感知任务。我们在具有一组不同视觉感知任务的基准测试中评估了 VipAct,实验结果表明,在所有任务中,与最先进的基线相比,性能都有显著提高。此外,全面的消融研究揭示了多智能体协作在引出更详细的系统 2 推理中的关键作用,并强调了图像输入对任务规划的重要性。此外,我们的错误分析确定了 VLM 在视觉感知方面固有局限性的模式,为未来潜在的改进提供了见解。VipAct 提供了一个灵活且可扩展的框架,为各种现实应用中更先进的视觉感知系统铺平了道路。||
|**2024-10-21**|[Improve Vision Language Model Chain-of-thought Reasoning](http://arxiv.org/abs/2410.16198)|**[link](https://github.com/riflezhang/llava-reasoner-dpo)**|视觉语言模型 (VLM) 中的思维链 (CoT) 推理对于提高模型的可解释性和可信度至关重要。然而,目前的训练方法缺乏强大的 CoT 推理数据,依赖于以简短注释和少量推理过程为主的数据集。在这项工作中,我们发现,在简短答案上训练 VLM 并不能很好地泛化到需要更详细回答的推理任务。为了解决这个问题,我们提出了一种双重方法。首先,我们从 GPT-4o 模型中提取推理过程,以丰富训练数据并微调 VLM,从而提高其 CoT 性能。其次,我们应用强化学习来进一步校准推理质量。具体来说,我们通过将模型生成的推理链的预测结果与带注释的简短答案进行比较,构建正(正确)和负(错误)样本对。利用这些成对数据,我们应用直接偏好优化算法来改进模型的推理能力。我们的实验表明,在基准数据集上,CoT 推理得到了显著改进,并且对直接答案预测的泛化能力也更强。这项工作强调了在训练中纳入详细推理过程以及利用强化学习来增强 VLM 推理能力的重要性。||
|**2024-10-21**|[Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models](http://arxiv.org/abs/2410.16163)|**[link](https://github.com/jefferyzhan/griffon)**|大型多模态模型 (LMM) 在基于自回归建模的各种视觉语言和以视觉为中心的的任务中取得了重大突破。然而,这些模型通常侧重于以视觉为中心的的任务,例如视觉定位和区域描述,或者视觉语言任务,例如图像描述和多场景视觉问答 (VQA)。目前还没有哪个 LMM 能够像自然语言处理领域的大型语言模型那样,将这两种类型的任务全面统一在一个模型中。此外,即使有丰富的多任务指令遵循数据,直接堆叠这些数据来扩展通用能力仍然具有挑战性。为了解决这些问题,我们引入了一个名为 CCMD-8M 的新型多维度策划和整合的多模态数据集,它通过多级数据策划和多任务整合克服了统一以视觉为中心的任务和视觉语言任务的数据障碍。更重要的是,我们提出了 Griffon-G,这是一个通用的 LMM,它在单个端到端范式中同时解决了以视觉为中心的任务和视觉语言任务。Griffon-G 解决了在这些任务的联合优化过程中遇到的训练崩溃问题,实现了更好的训练效率。跨多模态基准、通用视觉问答 (VQA) 任务、场景文本中心 VQA 任务、文档相关 VQA 任务、指称表达式理解和目标检测的评估表明,Griffon-G 优于先进的 LMM,并在复杂的以视觉为中心的的任务中达到了专家级的性能。||
|**2024-10-21**|[Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning](http://arxiv.org/abs/2410.16162)|null|视觉语言模型 (VLM) 在各种下游任务中表现出了令人印象深刻的性能。然而,尽管空间推理在涉及导航和与物理环境交互的任务中起着至关重要的作用,但VLM在这方面的能力仍然有限。具体来说,这些任务中的大部分空间推理发生在二维 (2D) 环境中,我们的评估表明,最先进的 VLM 经常对复合空间推理问题生成不合理和错误的响应,包括人类一眼就能轻松解决的简单寻路任务。为了解决这个问题,我们探索了一种有效的方法,通过训练模型的基本空间能力来增强 VLM 中的 2D 空间推理能力。我们首先将 2D 空间推理的关键组成部分分解为:方向理解、距离估计和定位。我们的核心假设是,掌握这些基本的空间能力可以显着提高模型在需要高级空间理解和组合问题解决能力的复合空间任务中的性能。为了验证这一假设,我们引入了 Sparkle,这是一个通过合成数据生成和目标监督对这三种基本空间能力进行微调的 VLM 框架,以便为每种能力形成一个指令数据集。我们的实验表明,使用 Sparkle 微调的 VLM 不仅在基本任务本身中取得了显着的性能提升,而且还可以泛化到复合和分布外的空间推理任务中(例如,在最短路径问题上的性能从 13.5% 提高到 40.0%)。这些发现强调了掌握基本空间能力在增强复合空间问题解决能力方面的有效性,为提高 VLM 的空间推理能力提供了见解。||
|**2024-10-18**|[NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples](http://arxiv.org/abs/2410.14669)|null|视觉语言模型(VLM)在最近的视觉问答(VQA)基准测试中取得了重大进展,这些基准测试评估了复杂的视觉语言推理能力。然而,这些模型真的有效吗?在这项工作中,我们发现VLM仍然难以处理人类可以轻松回答的自然图像和问题,我们将其称为自然对抗样本。我们还发现,使用 CLIP 和 ChatGPT 等现成模型从自然图像文本语料库中生成这些VQA样本非常容易。我们提出了一种半自动方法来收集一个新的基准测试集NaturalBench,该测试集包含10,000个经过人工验证的VQA样本,用于可靠地评估VLM。至关重要的是,我们采用以视觉为中心的设计,将每个问题与两张产生不同答案的图像配对,防止模型在不使用图像的情况下盲目作答。这使得NaturalBench比之前可以利用常识先验知识解决的基准测试更具挑战性。我们在NaturalBench上评估了53个最先进的VLM,结果表明,LLaVA-OneVision、Cambrian-1、Llama3.2-Vision、Molmo、Qwen2-VL,甚至GPT-4o等模型都比人类表现(超过90%)落后50%-70%。我们从两个角度分析了NaturalBench为何难以处理:(1)组合性:解决NaturalBench需要多种视觉语言技能,包括理解属性绑定、对象关系以及逻辑和计数等高级推理。为此,与先前的工作使用每个样本一个标签不同,我们为每个NaturalBench样本标记了1到8个技能标签,以便进行细粒度评估。(2)偏差:NaturalBench揭示了VLM中存在的严重偏差,因为模型通常会选择相同的答案,而不管图像如何。最后,我们将基准测试集构建方法应用于不同的数据源,包括长标题(超过100字)和中文、印地语等非英语语言,突出了其对VLM进行动态评估的潜力。||
|**2024-10-18**|[Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension](http://arxiv.org/abs/2410.14332)|**[link](https://github.com/deepglint/croc)**|近年来,大型语言模型(LLM)的进步推动了大型多模态模型(LMM)的发展。然而,现有的研究主要集中在调整语言和图像指令上,而忽略了模型学习联合处理文本和视觉模态的关键预训练阶段。在本文中,我们提出了一种新的LMM预训练范式,通过引入一种新颖的跨模态理解阶段来增强LLM的视觉理解能力。具体来说,我们设计了一个动态可学习的提示标记池,并采用匈牙利算法用最相关的提示标记替换部分原始视觉标记。然后,我们将视觉标记概念化为LLM的“外语”,并提出了一种混合注意力机制,结合双向视觉注意力和单向文本注意力,以全面增强对视觉标记的理解。同时,我们整合了详细的图像描述生成任务,利用丰富的描述来进一步促进LLM理解视觉语义信息。在150万条公开数据上进行预训练后,我们提出了一个名为Croc的新基础模型。实验结果表明,Croc在大型视觉语言基准测试中取得了新的最先进性能。为了支持可 reproducibility 并促进进一步的研究,我们在https://github.com/deepglint/Croc 上发布了训练代码和预训练模型权重。||
|**2024-10-18**|[E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model](http://arxiv.org/abs/2410.14200)|null|三维医学视觉语言模型的开发在疾病诊断和患者治疗方面具有巨大潜力。然而,与二维医学图像相比,三维医学图像(如CT扫描)面临着训练数据有限和维度高等挑战,这严重限制了三维医学视觉语言模型的进展。为了解决这些问题,我们收集了大量未标记的三维CT数据,并利用自监督学习构建了一个用于提取三维视觉特征的三维视觉基础模型。然后,我们应用三维空间卷积来聚合和投影高级图像特征,在降低计算复杂度的同时保留空间信息。我们还基于BIMCV-R和CT-RATE构建了两个指令微调数据集,用于微调三维视觉语言模型。我们的模型在报告生成、视觉问答和疾病诊断方面表现出优于现有方法的性能。代码和数据将很快公开发布。||
|**2024-10-18**|[LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs](http://arxiv.org/abs/2410.14182)|null|实验室事故对人类生命和财产构成重大风险,凸显了健全安全规程的重要性。尽管安全培训有所进步,但实验室人员仍可能在不知不觉中进行不安全的操作。随着各领域(包括实验室环境)越来越依赖大型语言模型 (LLM) 进行指导,人们越来越担心LLM在关键安全相关决策中的可靠性。与受过训练的人类研究人员不同,LLM缺乏正式的实验室安全教育,这引发了人们对其提供安全和准确指导的能力的质疑。现有关于LLM可信度的研究主要集中在道德合规性、真实性和公平性等问题上,但未能完全涵盖安全关键型现实应用,例如实验室安全。为了弥补这一差距,我们提出了实验室安全基准(LabSafety Bench),这是一个基于与职业安全与健康管理局 (OSHA) 协议相一致的新分类法的综合评估框架。该基准测试包括由人类专家验证的765道多项选择题,用于评估LLM和视觉语言模型 (VLM) 在实验室安全环境中的性能。我们的评估表明,虽然GPT-4o的表现优于人类参与者,但它仍然容易出现严重错误,这凸显了在安全关键型环境中依赖LLM的风险。我们的研究结果强调,需要专门的基准来准确评估LLM在现实安全应用中的可信度。||
|**2024-10-18**|[ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom](http://arxiv.org/abs/2410.14138)|null|大型视觉语言模型 (LVLM) 在视觉理解任务方面取得了重大进展。然而,它们在视觉推理任务中经常优先考虑语言知识而不是图像信息,从而导致性能下降。为了解决这个问题,我们首先确定了现有解决方案的缺点(即视觉描述不足且不相关,以及多模态能力有限)。然后,我们将视觉推理过程分解为两个阶段:视觉感知(即视力)和文本推理(即智慧),并介绍了一种名为 ProReason 的新型视觉推理框架。该框架具有多轮主动感知和解耦的视觉推理能力。简而言之,给定一个多模态问题,ProReason 会迭代主动信息收集和推理,直到可以用必要且充分的视觉描述得出答案。值得注意的是,能力的解耦允许无缝集成现有的大型语言模型 (LLM) 来弥补 LVLM 的推理缺陷。我们广泛的实验表明,ProReason 在开源和闭源模型的各种基准测试中都优于现有的多步推理框架和被动对等方法。此外,在 LLM 的帮助下,ProReason 在 MMMU 基准测试中实现了高达 15% 的性能提升。我们对现有解决方案的见解以及对 LLM 可行集成的解耦视角,为未来的视觉推理技术研究(尤其是 LLM 辅助技术)提供了启示。||
|**2024-10-17**|[Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers](http://arxiv.org/abs/2410.14072)|null|近年来,视觉语言模型 (VLM) 的进步扩展了其在现实世界应用中的潜力,使这些模型能够对图像进行复杂的推理。在像 LLaVA 这样广泛使用的完全自回归的基于 Transformer 的模型中,投影的视觉标记被添加到文本标记之前。通常,视觉标记比提示标记多得多,导致训练和推理过程中的计算开销增加。在本文中,我们提出了视觉压缩标记寄存器 (Victor),这是一种通过将视觉标记汇总到一组较小的寄存器标记来减少视觉标记数量的方法。Victor 在视觉标记之后添加了一些可学习的寄存器标记,并使用 VLM 语言塔中的前几层将视觉信息汇总到这些寄存器中。在这几层之后,所有视觉标记都将被丢弃,从而显着提高了训练和推理的计算效率。值得注意的是,我们的方法易于实现,并且只需要少量新的可训练参数,对模型性能的影响最小。在我们的实验中,Victor 仅使用 8 个视觉寄存器(约占原始标记的 1%),就将准确率下降控制在 4% 以内,同时将总训练时间减少了 43%,并将推理吞吐量提高了 3.3 倍。||
|**2024-10-17**|[Reproducibility study of "LICO: Explainable Models with Language-Image Consistency"](http://arxiv.org/abs/2410.13989)|**[link](https://github.com/robertdvdk/lico-fact)**|机器学习领域日益严重的复现性危机要求我们仔细审查研究结果。本文调查了 Lei 等人 (2023) 提出的 LICO 方法,该方法旨在增强事后可解释性技术并提高图像分类性能。LICO 利用来自视觉语言模型的自然语言监督来丰富特征表示并指导学习过程。我们进行了一项全面的可重复性研究,采用了 (Wide) ResNets 和已建立的可解释性方法,如 Grad-CAM 和 RISE。我们基本上无法复现作者的结果。特别是,我们没有发现 LICO 始终能够提高分类性能或改进可解释性的定量和定性指标。因此,我们的研究结果强调了在可解释性研究中进行严格评估和透明报告的重要性。||
|**2024-10-17**|[Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations](http://arxiv.org/abs/2410.13976)|null|大型视觉语言模型 (LVLM),例如 LLaVA,已经展示出作为通用聊天机器人的强大能力,能够就提供的输入图像进行对话。然而,它们的响应会受到训练数据集中存在的社会偏见的影响,导致模型在处理描绘不同人群图像时产生不希望的差异。在这项工作中,我们为 LVLM 提出了一种新的去偏见框架,通过在文本生成过程中直接消融偏见属性,以避免生成与受保护属性相关的文本,甚至在内部表示它们。我们的方法不需要训练,只需要相对少量的代表性偏见输出(约 1000 个样本)。我们的实验表明,我们不仅可以最大限度地降低 LVLM 生成与受保护属性相关的文本的倾向,而且甚至可以使用合成数据来指导消融,同时保持在真实数据(如 COCO)上的字幕性能。此外,我们发现,去偏 LVLM 的结果生成表现出与基线偏见模型相似的准确性,表明可以在不牺牲模型性能的情况下实现去偏效果。||
|**2024-10-17**|[Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation](http://arxiv.org/abs/2410.13848)|**[link](https://github.com/deepseek-ai/janus)**|在本文中,我们介绍了 Janus,这是一个统一了多模态理解和生成的自动回归框架。之前的研究通常依赖于单一视觉编码器来完成这两项任务,例如 Chameleon。然而,由于多模态理解和生成所需的信息粒度不同,这种方法会导致性能欠佳,尤其是在多模态理解方面。为了解决这个问题,我们将视觉编码分离成独立的路径,同时仍然利用单个统一的 Transformer 架构进行处理。这种分离不仅缓解了视觉编码器在理解和生成中角色之间的冲突,还增强了框架的灵活性。例如,多模态理解和生成组件都可以独立选择最合适的编码方法。实验表明,Janus 优于之前的统一模型,并且达到或超过了特定任务模型的性能。Janus 的简洁性、高灵活性和有效性使其成为下一代统一多模态模型的有力候选者。||
|**2024-10-17**|[VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks](http://arxiv.org/abs/2410.13666)|**[link](https://github.com/shailaja183/vl-glue)**|从异构输入(如图像、文本和音频)中推导出推理是人类执行日常任务的一项重要技能。对于开发先进的人工智能 (AI) 系统来说,类似的能力也是非常需要的。虽然最先进的模型在各种计算机视觉和自然语言处理任务上正在迅速缩小与人类水平性能的差距,但它们在解决需要对视觉和文本模态进行联合推理的任务时仍然很吃力。受 GLUE(Wang 等人,2018 年)的启发,GLUE 是一个用于自然语言理解的多任务基准测试,我们在本文中提出了 VL-GLUE。VL-GLUE 由跨越七个不同任务的超过 100k 个样本组成,这些任务的核心都需要视觉语言推理。此外,我们的基准测试包含了多样化的图像类型(从合成渲染的图形、日常场景到图表和复杂图表),并包含了广泛的特定领域文本(从烹饪、政治、体育到高中课程),证明了现实世界中对多模态理解的需求。我们表明,这个基准测试对于现有的 大规模视觉语言模型来说 相当具有挑战性,并鼓励开发具有鲁棒视觉语言推理能力的系统。||
|**2024-10-17**|[H2OVL-Mississippi Vision Language Models Technical Report](http://arxiv.org/abs/2410.13611)|null|由于能够在消费者硬件上高效运行以处理企业商业文档和图像,体积更小的视觉语言模型 (VLM) 对于注重隐私的设备上应用程序变得越来越重要。这些模型需要强大的语言理解和视觉能力来增强人机交互。为了满足这一需求,我们推出了 H2OVL-Mississippi,这是一对小型 VLM,使用 8 个 H100 GPU,在 240 小时的计算时间内,利用 3700 万个图文对进行训练。H2OVL-Mississippi-0.8B 是一款参数量为 8 亿的微型模型,专注于文本识别,在 OCRBench 的文本识别部分实现了最先进的性能,并在该领域超越了许多更大的模型。此外,我们还发布了 H2OVL-Mississippi-2B,这是一个包含 20 亿个参数的通用模型,在各种学术基准测试中均表现出极具竞争力的指标。这两个模型都建立在我们之前使用 H2O-Danube 语言模型的工作基础之上,将其功能扩展到视觉领域。我们将它们在 Apache 2.0 许可下发布,使所有人都可以使用 VLM,从而使文档 AI 和视觉 LLM 民主化。||
|**2024-10-17**|[GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models](http://arxiv.org/abs/2410.13510)|null|几何问题解决需要高级推理能力来处理多模态输入并有效地利用数学知识。视觉语言模型(VLM)在各种多模态任务中取得了重大进展。然而,它们仍然难以解决几何问题,并且由于无法执行预训练期间未见过的数学运算(例如计算任意角度的余弦)以及难以正确应用相关几何公式而受到很大限制。为了克服这些挑战,我们提出了 GeoCoder,它利用模块化代码微调来使用预定义的几何函数库生成和执行代码。通过执行代码,我们实现了准确和确定的计算,与自回归标记预测的随机性形成对比,而函数库最大限度地减少了公式使用中的错误。我们还提出了 GeoCoder 的多模态检索增强变体,名为 RAG-GeoCoder,它结合了一个非参数内存模块来从几何库中检索函数,从而减少对参数内存的依赖。我们的模块化代码微调方法增强了 VLM 的几何推理能力,与其他微调方法相比,在 GeomVerse 数据集上的各种问题复杂性方面平均提高了 16% 以上。||
|**2024-10-17**|[Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR](http://arxiv.org/abs/2410.13445)|null|由于缺乏标注的训练数据,低资源语言的自动语音识别 (ASR) 仍然是一个挑战。参数高效的微调和纯文本自适应是两种常用的方法,用于解决这种低资源环境下的问题。在这项工作中,我们研究了如何使用像 SeamlessM4T 这样的多语言多模态模型有效地结合这些技术。多模态模型能够通过纯文本自适应利用未标注的文本,并进一步进行参数高效的 ASR 微调,从而提高 ASR 性能。我们还展示了从高资源语言进行跨语言迁移,在没有任何标注语音的零样本设置中,相对于基线实现了高达 17% 的词错误率 (WER) 降低。||
|**2024-10-17**|[Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding](http://arxiv.org/abs/2410.13321)|null|大型视觉语言模型 (LVLM) 在根据视觉输入生成详细且连贯的响应方面表现出令人印象深刻的能力。然而,由于过度依赖语言先验,它们容易产生幻觉。为了解决这个问题,我们研究了 LVLM 中的语言先验,并得出两个关键观察结果:(1) 即使在预测与图像相关的词性 (POS) 相关的标记时,随着标记序列的增长,模型越来越依赖语言先验,从而放大了幻觉。(2) 直接校准 LVLM 的输出分布以减轻语言先验的方法可能会导致文本质量下降,甚至加剧幻觉。基于这些发现,我们提出了一种新方法,即摘要引导解码 (SGD)。该方法通过摘要减少文本上下文,自然地鼓励模型更多地关注图像信息,同时仅控制与图像相关的词性标记以保持文本质量。通过实验,我们证明了 SGD 在物体幻觉基准测试中实现了最先进的性能。此外,在精确率和召回率的权衡方面,SGD 在现有方法中实现了帕累托最优。最后,我们观察到,尽管现有方法难以在减少物体幻觉和保持文本质量之间取得平衡,但 SGD 在应对这一挑战方面表现出稳健性。||
|**2024-10-17**|[Mapping Bias in Vision Language Models: Signposts, Pitfalls, and the Road Ahead](http://arxiv.org/abs/2410.13146)|**[link](https://github.com/kuleens/vlmbiaseval)**|随着视觉语言模型 (VLM) 得到广泛应用,其公平性仍然缺乏探索。在本文中,我们分析了五个模型和六个数据集的人口统计学偏差。我们发现,像 UTKFace 和 CelebA 这样的肖像数据集是检测偏差的最佳工具,可以发现 LLaVa 和 CLIP 模型之间在性能和公平性方面的差距。然而,像 PATA、VLStereoSet 这样的场景数据集由于其构建方式,无法成为有效的偏差基准。至于像 VisoGender 这样的基于代词的数据集,我们收到了混合信号,因为只有一部分数据子集对提供见解有用。为了缓解这个问题,我们引入了更难版本的 VisoGender,作为更严格的评估标准。基于这些结果,我们呼吁建立更有效、设计更仔细的数据集,以确保 VLM 的公平性和可靠性。||
|**2024-10-16**|[Sensitivity of Generative VLMs to Semantically and Lexically Altered Prompts](http://arxiv.org/abs/2410.13030)|null|尽管用于生成式视觉语言模型 (VLM) 的提示调整技术大量涌现,但这些模型对提示中的词汇和语义变化的敏感程度仍不清楚。在本文中,我们使用 SugarCrepe++ 数据集评估了生成式 VLM 理解文本中词汇和语义变化的能力。我们分析了 VLM 对提示中词汇变化的敏感性,而这些变化不对应于语义变化。我们的研究结果表明,生成式 VLM 对此类更改高度敏感。此外,我们还发现,这种脆弱性会影响旨在实现其输出一致性的技术性能。||
|**2024-10-16**|[Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models](http://arxiv.org/abs/2410.13002)|null|端到端学习将感官输入直接映射到动作,为复杂的机器人任务创建高度集成和高效的策略。然而,此类模型难以有效训练,并且通常难以泛化到其训练场景之外,从而限制了对新环境、任务和概念的适应性。在这项工作中,我们研究了在看不见的文本指令和视觉分布变化下,基于视觉的控制策略实现稳健的闭环性能所需的最小数据要求和架构适应。为此,我们设计了具有不同数据表示丰富度的数据库,通过利用多模态基础模型编码器来改进特征提取协议,并评估不同策略网络头的适用性。我们的研究结果在 Flex(Fly-lexically)中得到综合,这是一个使用预训练的视觉语言模型(VLM)作为冻结的逐块特征提取器的框架,生成整合语义和视觉信息的具有空间感知的嵌入。这些丰富的特征构成了训练高度稳健的下游策略的基础,这些策略能够跨平台、环境和文本指定的任务进行泛化。我们展示了这种方法在四旋翼飞行器飞往目标任务中的有效性,其中通过行为克隆在小型模拟数据库上训练的代理成功地泛化到现实世界场景,处理不同的新目标和命令公式。||
|**2024-10-16**|[The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio](http://arxiv.org/abs/2410.12787)|null|近年来,大型多模态模型 (LMM) 的进步显著提高了其在各种任务中的性能,并且人们一直在努力进一步整合视频和音频等其他模态。然而,大多数现有的 LMM 仍然容易出现幻觉,即事实上的多模态输入与生成的文本输出之间存在差异,这限制了它们在各种现实场景中的适用性。本文首次系统地研究了涉及三种最常见模态(语言、视觉和音频)的 LMM 中的幻觉问题。我们的研究揭示了导致幻觉的两个关键因素:过度依赖单模态先验和虚假的模态间相关性。为了应对这些挑战,我们引入了多模态诅咒 (CMM) 基准测试,该基准全面评估了 LMM 中的幻觉,并详细分析了其根本问题。我们的研究结果突出了关键的漏洞,包括模态整合的不平衡和训练数据的偏差,强调了平衡跨模态学习和增强幻觉缓解策略的必要性。根据我们的观察和发现,我们提出了一些潜在的研究方向,可以提高 LMM 的可靠性。||
|**2024-10-15**|[Unveiling the Mystery of Visual Attributes of Concrete and Abstract Concepts: Variability, Nearest Neighbors, and Challenging Categories](http://arxiv.org/abs/2410.11657)|**[link](https://github.com/TarunTater/AbstractConceptsInImages)**|一个概念的视觉表征会因其含义和出现语境的不同而发生显著变化,这对视觉和多模态模型都提出了多重挑战。我们的研究侧重于具象性,这是一个经过充分研究的词汇语义变量,并以此作为案例研究来检验视觉表征的可变性。我们依赖于从两个不同数据集(Bing 和 YFCC)中提取的与大约 1000 个抽象和具体概念相关的图像。我们的目标是:(i) 评估概念描述中的视觉多样性是否可以可靠地区分具体概念和抽象概念;(ii) 通过最近邻分析来分析同一概念的多幅图像的视觉特征的可变性;(iii) 通过对图像进行分类和注释来识别导致这种可变性的挑战性因素。我们的研究结果表明,对于抽象概念和具体概念图像的分类,颜色和纹理等基本视觉特征的组合比视觉Transformer(ViT)等更复杂模型提取的特征更有效。然而,ViT 在最近邻分析中表现出更好的性能,这强调了在通过文本以外的模态分析概念变量时,需要谨慎选择视觉特征。||
|**2024-10-15**|[On-the-fly Modulation for Balanced Multimodal Learning](http://arxiv.org/abs/2410.11582)|**[link](https://github.com/gewu-lab/bml_tpami2024)**|多模态学习旨在通过整合来自不同模态的信息来提升模型性能。然而,由于广泛使用的联合训练策略对所有模态采用统一目标,导致单模态表征不平衡和欠优化,因此多模态学习的潜力并未得到充分发挥。具体来说,我们指出通常存在具有更多判别信息的模态,例如踢足球的视觉和刮风的听觉。它们可能在联合训练过程中占据主导地位,导致其他模态严重欠优化。为了缓解这个问题,我们首先从优化的前馈和反向传播阶段分析了欠优化现象。然后,提出了动态预测调制(OPM)和动态梯度调制(OGM)策略,通过在训练过程中监控模态间的判别差异来调节每个模态的优化。具体而言,OPM在前馈阶段通过动态概率丢弃主导模态的特征来削弱其影响,而OGM在反向传播阶段减轻其梯度。在实验中,我们的方法在各种多模态任务中都表现出相当大的改进。这些简单而有效的策略不仅增强了普通和面向任务的多模态模型的性能,而且在更复杂的多模态任务中也表现出色,展示了它们的有效性和灵活性。源代码可在\url{https://github.com/GeWu-Lab/BML_TPAMI2024}获取。||
|**2024-10-15**|[Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference](http://arxiv.org/abs/2410.11403)|null|多模态变分自编码器 (VAE) 旨在通过整合来自不同数据模态的信息来捕获共享的潜在表示。一个重大挑战是在不需要为所有可能的模态组合训练不切实际数量 (2^M) 个推理网络的情况下,准确地从任何模态子集推断表示。基于混合的模型通过仅需要与模态数量一样多的推理模型来简化这一过程,从而聚合单模态推理。然而,当模态缺失时,它们会遭受信息丢失的困扰。基于对齐的 VAE 通过最小化 Kullback-Leibler (KL) 散度将单模态推理模型与多模态模型对齐来解决这个问题,但由于摊销差距导致推理精度下降,因此面临着问题。为了解决这些问题,我们在多模态 VAE 框架内引入了多模态迭代摊销推理,这是一种迭代细化机制。该方法通过使用所有可用模态迭代地细化多模态推理,从而克服了缺失模态造成的信息丢失,并最大程度地减少了摊销差距。通过将单模态推理与这种细化的多模态后验对齐,我们实现了单模态推理,该推理有效地结合了多模态信息,同时在推理过程中仅需要单模态输入。在基准数据集上的实验表明,我们的方法提高了推理性能,更高的线性分类精度和竞争性余弦相似性证明了这一点,并增强了跨模态生成,FID 得分较低表明了这一点。这表明我们的方法增强了从单模态输入推断的表示。||
|**2024-10-15**|[LargePiG: Your Large Language Model is Secretly a Pointer Generator](http://arxiv.org/abs/2410.11366)|null|最近关于查询生成的研究集中在使用大型语言模型(LLM)上,虽然LLM带来了最先进的性能,但也引入了生成查询中的幻觉问题。在这项工作中,我们将相关性幻觉和事实性幻觉作为一种新的类型学来描述基于LLM的查询生成带来的幻觉问题。我们提出了一种有效的方法来分离LLM生成查询中的内容和形式,该方法保留了从输入中提取和集成的 factual knowledge,并利用LLM强大的语言能力编译了句法结构,包括功能词。具体来说,我们介绍了一种与模型无关且无需训练的方法,将大型语言模型转换为指针生成器(LargePiG),其中指针注意力分布利用了LLM固有的注意力权重,并且复制概率源自模型高层和最后一层的词汇分布差异。为了验证LargePiG的有效性,我们构建了两个数据集,用于评估查询生成中的幻觉问题,涵盖了文档和视频场景。对各种LLM的实证研究表明,LargePiG在两个数据集上都具有优越性。额外的实验还验证了LargePiG可以减少大型视觉语言模型中的幻觉,并提高基于文档的问答和事实性评估任务的准确性。||
|**2024-10-15**|[CLIP-DFGS: A Hard Sample Mining Method for CLIP in Generalizable Person Re-Identification](http://arxiv.org/abs/2410.11255)|null|近年来,像CLIP这样的预训练视觉语言模型的进步,已经显示出其在行人重识别(ReID)应用中的潜力。然而,它们在通用行人重识别任务中的性能仍然欠佳。CLIP预训练中使用的大规模多样化的图像-文本对可能导致某些细粒度特征的缺失或不足。针对这些挑战,我们提出了一种名为DFGS(深度优先图采样器)的困难样本挖掘方法,该方法基于深度优先搜索,旨在提供足够具有挑战性的样本,以增强CLIP提取细粒度特征的能力。DFGS可以应用于CLIP中的图像编码器和文本编码器。通过利用CLIP强大的跨模态学习能力,我们的目标是应用DFGS方法提取具有挑战性的样本,并形成具有高判别难度的mini-batches,为图像模型提供更有效、更具挑战性的难以区分的样本,从而增强模型区分个体的能力。我们的结果表明,与其他方法相比,DFGS有显著的改进,证实了DFGS在提供具有挑战性的样本以增强CLIP在通用行人重识别中的性能方面的有效性。||
|**2024-10-14**|[Locality Alignment Improves Vision-Language Models](http://arxiv.org/abs/2410.11087)|null|近年来,视觉语言模型 (VLM) 得到越来越多的应用,但许多模型仍然难以解决基本的 spatial reasoning 错误。我们假设这是由于 VLM 采用了预训练的视觉骨干网络,特别是使用图像级监督和最小归纳偏差训练的视觉变换器 (ViT)。此类模型可能无法编码图像中每个位置的类别内容,我们的目标是通过确保视觉骨干网络有效捕获局部和全局图像语义来解决此问题。我们的主要见解是,我们不需要新的监督来学习这种能力——预训练模型包含大量的局部语义知识,我们可以提取这些知识并将其用于可扩展的自监督。我们为 ViT 提出了一种新的高效的训练后阶段,称为局部性对齐,以及一种新的微调程序,称为 MaskEmbed,它使用掩蔽重建损失来学习每个图像块的语义贡献。我们首先使用仅视觉基准评估局部性对齐,发现它提高了模型在块级语义分割任务中的性能,特别是对于使用图像-标题对(例如,CLIP 和 SigLIP)训练的强骨干网络。然后,我们训练了一系列使用和不使用局部性对齐的 VLM,并表明局部性对齐的骨干网络提高了各种基准测试的性能,特别是那些涉及空间理解的基准测试(例如,RefCOCO、OCID-Ref、TallyQA、VSR、AI2D)。总的来说,我们证明了我们可以通过局部性对齐阶段有效地学习局部语义提取,并且此过程补充了使用现成视觉骨干网络的现有 VLM 训练方法。||
|**2024-10-14**|[Towards Foundation Models for 3D Vision: How Close Are We?](http://arxiv.org/abs/2410.10799)|null|构建用于 3D 视觉的基础模型是一个尚未解决的复杂挑战。为了实现这一目标,重要的是了解当前模型的 3D 推理能力,并确定这些模型与人类之间的差距。因此,我们构建了一个新的 3D 视觉理解基准,该基准涵盖了视觉问答 (VQA) 格式的基本 3D 视觉任务。我们评估了最先进的视觉语言模型 (VLM)、专门模型和人类受试者。我们的结果表明,VLM 的性能普遍较差,而专门模型虽然准确但不稳健,在几何扰动下会失败。相比之下,人类视觉仍然是最可靠的 3D 视觉系统。我们进一步证明,与经典计算机视觉方法相比,神经网络与人类 3D 视觉机制的一致性更高,并且基于 Transformer 的网络(如 ViT)比 CNN 与人类 3D 视觉机制的一致性更高。我们希望我们的研究能够有利于未来 3D 视觉基础模型的开发。||
|**2024-10-14**|[VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents](http://arxiv.org/abs/2410.10594)|**[link](https://github.com/openbmb/visrag)**|检索增强生成(RAG)是一种有效的技术,它使大型语言模型(LLM)能够利用外部知识源进行生成。然而,当前的RAG系统完全基于文本,无法利用在现实世界多模态文档中起着至关重要作用的视觉信息,如布局和图像。在本文中,我们介绍了VisRAG,它通过建立一个基于视觉语言模型(VLM)的RAG流程来解决这个问题。在这个流程中,不是先解析文档以获取文本,而是使用VLM将文档作为图像直接嵌入,然后检索以增强VLM的生成。与传统的基于文本的RAG相比,VisRAG最大限度地保留和利用了原始文档中的数据信息,消除了解析过程中引入的信息损失。我们收集了开源数据和合成数据来训练VisRAG中的检索器,并探索了各种生成方法。实验表明,VisRAG在检索和生成阶段都优于传统的RAG,相较于传统的基于文本的RAG流程,实现了25%-39%的端到端性能提升。进一步的分析表明,VisRAG可以有效地利用训练数据并表现出强大的泛化能力,这使其成为多模态文档上RAG的一个很有前景的解决方案。我们的代码和数据可在https://github.com/openbmb/visrag 获取。||
|**2024-10-14**|[LG-CAV: Train Any Concept Activation Vector with Language Guidance](http://arxiv.org/abs/2410.10308)|null|概念激活向量(CAV)通过将模型预测优雅地归因于特定概念,在可解释人工智能领域引起了广泛的研究兴趣。然而,CAV 的训练通常需要大量高质量的图像,这些图像的整理成本很高,因此仅限于一组预定义的概念。为了解决这个问题,我们提出了语言引导的 CAV(LG-CAV),以利用某些预训练的视觉语言模型(例如 CLIP)中丰富的概念知识。该方法允许在没有标记数据的情况下训练任何 CAV,方法是利用相应的概念描述作为指导。为了弥合视觉语言模型与目标模型之间的差距,我们使用视觉语言模型计算了一组通用图像(探测图像)上概念描述的激活值,并利用它们作为语言指导来训练 LG-CAV。此外,在训练了与目标模型中所有预测类别相关的高质量 LG-CAV 后,我们提出了激活样本重新加权(ASR)作为一种模型校正技术,以反过来提高目标模型的性能。在四个数据集上跨越九种架构的实验表明,LG-CAV 在给定任何概念的情况下,相较于以前的 CAV 方法实现了显著的质量提升,并且我们的模型校正方法与现有的基于概念的方法相比,实现了最先进的性能。我们的代码可在 https://github.com/hqhQAQ/LG-CAV 获取。||
|**2024-10-14**|[Saliency Guided Optimization of Diffusion Latents](http://arxiv.org/abs/2410.10257)|null|随着扩散模型的快速发展,从文本提示生成高质量图像已不再是挑战。文本到图像生成的重点是如何优化生成结果,使其更好地与人类意图或提示保持一致。现有的优化方法通常将整个图像视为一个整体,进行全局优化。这些方法忽略了一个事实:当人类观察图像时,视觉系统会自然地将注意力集中在显著区域,而忽略不太重要或不显著的区域。也就是说,人类很可能忽略对非显著区域的优化。因此,尽管在大型多模态模型的指导下进行了模型微调,但现有进行全局优化的方法得到的结果并不理想。为了有效且高效地解决这种对齐挑战,我们提出了显著性引导的扩散潜在空间优化方法(SGOOL)。我们首先使用显著性检测器来模拟人类视觉注意力系统,并标记出显著区域。为了避免重新训练额外的模型,我们的方法直接优化扩散模型的潜在空间。此外,SGOOL 利用了可逆扩散过程,并具有恒定内存实现的优点。因此,我们的方法成为了一种参数高效且即插即用的微调方法。我们使用多种指标和人工评估进行了大量实验。实验结果表明,SGOOL 在图像质量和提示对齐方面具有优越性。||
|**2024-10-11**|[SegGrasp: Zero-Shot Task-Oriented Grasping via Semantic and Geometric Guided Segmentation](http://arxiv.org/abs/2410.08901)|null|面向任务的抓取,即根据物体功能抓取其特定部位,对于开发能够在动态环境中执行复杂任务的先进机器人系统至关重要。在本文中,我们提出了一个免训练框架,该框架结合了语义和几何先验,用于零样本面向任务的抓取生成。所提出的框架名为 SegGrasp,首先利用 GLIP 等视觉语言模型进行粗分割。然后,它使用来自凸分解的详细几何信息,通过名为 GeoFusion 的融合策略来提高分割质量。通过改进分割的抓取网络可以生成有效的抓取姿态。我们在分割基准和真实世界机器人抓取上进行了实验。实验结果表明,SegGrasp 在抓取和分割性能方面均优于基线 15% 以上。||
|**2024-10-11**|[Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation](http://arxiv.org/abs/2410.08895)|null|基于缓存的方法在适应视觉语言模型 (VLM) 方面表现出色且高效。然而,现有的缓存模型忽略了三个关键方面。1) 预训练的 VLM 主要针对图像-文本相似性进行优化,忽略了图像-图像相似性的重要性,导致预训练和适应之间存在差距。2) 当前的缓存模型基于 Nadaraya-Watson (N-W) 估计器,它在构建权重函数时忽略了训练样本之间错综复杂的关系。3) 在样本有限的情况下,缓存模型生成的 logits 具有很高的不确定性,直接使用这些 logits 而不考虑置信度可能会有问题。为了解决上述挑战,本工作提出了三个校准模块。相似性校准通过使用未标记的图像来改进图像-图像相似性。我们在 CLIP 的预训练图像编码器之上添加了一个带有残差连接的可学习投影层,并通过最小化自监督对比损失来优化参数。权重校准在权重函数中引入了一个精度矩阵,以充分模拟训练样本之间的关系,将现有的缓存模型转换为高斯过程 (GP) 回归器,这可能比 N-W 估计器更准确。置信度校准利用 GP 回归计算的预测方差来动态地重新调整缓存模型的 logits,确保缓存模型的输出根据其置信度进行适当调整。此外,为了降低 GP 的高复杂度,我们进一步提出了一种基于组的学习策略。整合上述设计,我们提出了免训练和需要训练的两种变体。在 11 个少样本分类数据集上的大量实验表明,所提出的方法可以达到最先进的性能。||
|**2024-10-11**|[RoRA-VLM: Robust Retrieval-Augmented Vision Language Models](http://arxiv.org/abs/2410.08876)|null|目前的视觉语言模型 (VLM) 在知识密集型任务中仍然表现不佳,这主要是由于难以将视觉对象和场景与其对应的实体和背景知识之间的所有关联进行准确编码。虽然检索增强方法提供了一种集成外部知识的有效方法,但将其扩展到视觉语言领域存在着独特的挑战:(1) 由于多模态查询中固有的差异,难以从外部来源准确检索相关信息;(2) 难以抵抗检索到的多模态知识片段中包含的无关、多余和嘈杂的信息。在这项工作中,我们介绍了 RORA-VLM,这是一个专为 VLM 量身定制的新颖且强大的检索增强框架,它具有两项关键创新:(1) 一种采用图像锚定文本查询扩展的两阶段检索过程,以协同组合查询中的视觉和文本信息,并检索最相关的多模态知识片段;(2) 一种鲁棒的检索增强方法,通过在检索增强训练过程中注入对抗性噪声,增强 VLM 对检索到的多模态知识中无关信息的抵抗力,并通过面向查询的视觉标记优化策略过滤掉无关的视觉信息,例如图像中呈现的无关实体。我们进行了广泛的实验,以验证我们提出的方法在三个广泛采用的基准数据集上的有效性和鲁棒性。我们的结果表明,只需极少的训练实例,RORA-VLM 就可以使基础模型实现显著的性能提升,并在所有基准测试中始终优于最先进的检索增强 VLM,同时还展现出新颖的零样本域迁移能力。||
|**2024-10-11**|[VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model](http://arxiv.org/abs/2410.08792)|null|视觉语言模型 (VLM) 近期因其在常识推理和泛化能力方面的优势被应用于机器人领域。现有工作已将 VLM 应用于从自然语言指令生成任务和运动规划,以及为机器人学习模拟训练数据。在本工作中,我们探索使用 VLM 来解释人类演示视频并生成机器人任务规划。我们的方法将关键帧选择、视觉感知和 VLM 推理集成到一个管道中。我们将其命名为 SeeDo,因为它使 VLM 能够“看到”人类演示并向机器人解释相应的计划,以便机器人“执行”。为了验证我们的方法,我们收集了一组长时程人类视频,演示了三种不同类别中的拾放任务,并设计了一套指标,以全面比较 SeeDo 与几种基线方法(包括最先进的视频输入 VLM)的性能。实验结果表明 SeeDo 具有优越的性能。我们进一步在仿真环境和真实的机器人手臂上部署了生成的的任务计划。||
|**2024-10-11**|[Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models](http://arxiv.org/abs/2410.08791)|**[link](https://github.com/abbasireza/super-pipeline)**|机器学习模型的快速发展,特别是在自然语言处理和计算机视觉领域,给在资源有限的硬件上运行这些模型带来了挑战。本文介绍了 Superpipeline,这是一个旨在优化大型 AI 模型在训练和推理过程中在受限硬件上执行的新框架。我们的方法涉及通过将模型划分为单独的层并有效地在 GPU 和 CPU 内存之间传输这些层来动态管理模型执行。在我们的实验中,Superpipeline 在保持模型精度和可接受的处理速度的同时,将 GPU 内存使用量减少了高达 60%。这使得原本会超出可用 GPU 内存的模型能够有效运行。与主要关注推理或特定模型类型的现有解决方案不同,Superpipeline 可以应用于大型语言模型 (LLM)、视觉语言模型 (VLM) 和基于视觉的模型。我们在各种模型和硬件设置中测试了 Superpipeline 的性能。该方法包括两个关键参数,允许微调 GPU 内存使用量和处理速度之间的平衡。重要的是,Superpipeline 不需要重新训练或更改模型参数,确保原始模型的输出保持不变。Superpipeline 的简单性和灵活性使其对在有限硬件上使用高级 AI 模型的研究人员和专业人士非常有用。它允许在现有硬件上使用更大的模型或更大的批次大小,从而有可能加快许多机器学习应用的创新。这项工作标志着朝着使高级 AI 模型更易于访问并在资源有限的环境中优化其部署迈出了重要一步。Superpipeline 的代码可在 https://github.com/abbasiReza/super-pipeline 获取。||
|**2024-10-11**|[Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping](http://arxiv.org/abs/2410.08695)|null|大型视觉语言模型(LVLM)在视觉感知和推理等多模态任务中表现出非凡的能力,在各种多模态评估基准测试中均取得了良好的性能。然而,这些基准测试保持着静态性,并且与预训练数据重叠,导致复杂度限制固定和数据污染问题。这引发了对评估有效性的担忧。为了应对这两项挑战,我们引入了一种称为视觉语言自举(VLB)的动态多模态评估协议。VLB 为 LVLM 提供了一个稳健且全面的评估,减少了数据污染,并具有灵活的复杂性。为此,VLB 通过多模态自举模块动态生成新的视觉问答样本,该模块修改图像和语言,同时通过判断模块确保新生成的样本与原始样本保持一致。通过组合各种自举策略,VLB 提供了具有不同复杂性的现有基准测试的动态变体,使评估能够随着 LVLM 不断发展的能力而共同发展。跨多个基准测试(包括 SEEDBench、MMBench 和 MME)的大量实验结果表明,VLB 显着减少了数据污染,并暴露了 LVLM 的性能局限性。||
|**2024-10-11**|[Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models](http://arxiv.org/abs/2410.08611)|**[link](https://github.com/mengyuanchen21/neurips2024-csp)**|零样本分布外 (OOD) 检测的直接 pipeline 涉及从广泛的语义库中选择潜在的 OOD 标签,然后利用预训练的视觉语言模型对分布内 (ID) 和 OOD 标签执行分类。在本文中,我们提出理论,认为提高性能需要扩展语义库,同时增加 OOD 样本激活所选 OOD 标签的预期概率,并确保这些 OOD 标签的激活之间相互依赖性低。一种自然的扩展方式是采用更大的词库;然而,不可避免地引入大量同义词和不常用词无法满足上述要求,这表明可行的扩展方式不仅仅是从词库中选择词语。由于 OOD 检测旨在将输入图像正确分类到 ID/OOD 类别组中,我们可以“编造”OOD 标签候选,这些候选不是标准类别名称,但有利于该过程。观察到原始语义库由未修改的特定类别名称组成,我们相应地构建了一个共轭语义库 (CSP),它由修改后的超类别名称组成,每个名称都充当跨不同类别共享相似属性的样本的聚类中心。与我们建立的理论一致,使用 CSP 扩展 OOD 标签候选满足要求,并且在 FPR95 中的性能比现有工作提高了 7.89%。代码可在 https://github.com/MengyuanChen21/NeurIPS2024-CSP 中获得。||
|**2024-10-11**|[ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression](http://arxiv.org/abs/2410.08584)|null|大型视觉语言模型 (LVLMs) 的效率受到预填充阶段注意力机制的计算瓶颈和解码阶段获取键值 (KV) 缓存的内存瓶颈的限制,尤其是在涉及高分辨率图像或视频的情况下。视觉内容通常表现出大量的冗余,导致 LVLMs 中的注意力图高度稀疏。可以利用这种稀疏性,通过各种方法来加速注意力计算或压缩 KV 缓存。然而,大多数研究只关注解决这些瓶颈中的一个,并且没有充分支持根据不同的层或任务动态调整稀疏性。在本文中,我们提出了 ZipVL,这是一个为 LVLMs 设计的高效推理框架,它通过重要标记的动态比率分配策略来解决计算和内存瓶颈。该比率是根据特定层的注意力分数分布自适应确定的,而不是固定的超参数,从而在较简单的任务中提高效率,同时在更具挑战性的任务中保持高性能。然后我们根据归一化后的注意力分数选择重要的标记,并仅对这些重要的标记执行注意力机制,以加速预填充阶段。为了缓解解码阶段的内存瓶颈,我们对 KV 缓存采用混合精度量化,其中对重要标记的缓存使用高比特量化,而对不那么重要的标记的缓存使用低比特量化。我们的实验表明,ZipVL 可以将预填充阶段的速度提高 2.6 倍,并将 GPU 内存使用量减少 50.0%,在 LongVA-7B 模型上的 Video-MME 基准测试中,准确率仅下降了 0.2%,有效地提高了 LVLMs 的生成效率。||
|**2024-10-10**|[LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts](http://arxiv.org/abs/2410.08211)|null|大规模视觉语言预训练 (VLP) 模型(例如 CLIP)以其多功能性而闻名,因为它们可以在零样本设置中应用于各种应用。然而,当这些模型用于特定领域时,由于领域差距或训练数据中这些领域的代表性不足,它们的性能往往不尽如人意。虽然在具有人工标注标签的自定义数据集上微调 VLP 模型可以解决这个问题,但即使是标注小规模数据集(例如,100k 个样本)也可能是一项昂贵的工作,如果任务复杂,通常需要专家标注员。为了应对这些挑战,我们提出了 LatteCLIP,这是一种无监督方法,用于在自定义领域中使用已知类名对 CLIP 模型进行分类微调,而无需依赖人工标注。我们的方法利用大型多模态模型 (LMM) 为单个图像和图像组生成富有表现力的文本描述。这些信息提供了额外的上下文信息,以指导自定义领域中的微调过程。由于 LMM 生成的描述容易出现幻觉或细节缺失,我们引入了一种新策略,仅提取有用信息并稳定训练过程。具体来说,我们从噪声生成的文本和双重伪标签中学习丰富的每类原型表示。我们在 10 个特定领域数据集上的实验表明,LatteCLIP 的性能优于预训练的零样本方法,平均提高了 +4.74 个百分点的 top-1 准确率,并且优于其他最先进的无监督方法 +3.45 个百分点。||
|**2024-10-10**|[Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision](http://arxiv.org/abs/2410.08209)|null|当前的大型多模态模型 (LMM) 面临着 grounding 的挑战, grounding 要求模型将语言成分与视觉实体相关联。与使用额外的 grounding 监督微调 LMM 的常见做法相反,我们发现 grounding 能力实际上可以在没有明确 grounding 监督的情况下训练的 LMM 中出现。为了揭示这种新兴的 grounding 能力,我们引入了一种“attend-and-segment”方法,该方法利用来自标准 LMM 的注意力图来执行像素级分割。此外,为了增强 grounding 能力,我们提出了 DIFFLMM,这是一种利用基于扩散的视觉编码器(而不是标准 CLIP 视觉编码器)的 LMM,并使用相同的弱监督进行训练。我们的方法不受限于 grounding 特定监督数据的偏差和规模限制,因此更具通用性和可扩展性。与 grounding LMM 和通才 LMM 相比,我们在 grounding 特定和一般视觉问答基准测试中均取得了有竞争力的性能。值得注意的是,我们在没有任何 grounding 监督的情况下,在 grounded 对话生成方面实现了 44.2 的 grounding 掩码召回率,优于经过广泛监督的模型 GLaMM。项目页面:https://groundLMM.github.io。||
|**2024-10-10**|[MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models](http://arxiv.org/abs/2410.08182)|null|现有的多模态检索基准主要侧重于评估模型是否能够检索和利用外部文本知识来回答问题。然而,在某些情况下,检索视觉信息比文本数据更有益或更容易获取。在本文中,我们介绍了一个多模态检索增强生成基准 MRAG-Bench,在该基准中,我们系统地识别和分类了视觉增强知识优于文本知识的场景,例如,来自不同视角的更多图像。MRAG-Bench 由 16,130 张图像和 1,353 个人工标注的多项选择题组成,涵盖 9 个不同的场景。借助 MRAG-Bench,我们对 10 个开源和 4 个专有的超大型视觉语言模型 (LVLM) 进行了评估。我们的结果表明,与文本知识相比,所有 LVLM 在使用图像增强时都表现出更大的改进,这证实了 MRAG-Bench 以视觉为中心的特点。此外,我们使用 MRAG-Bench 进行了广泛的分析,为了解检索增强型 LVLM 提供了宝贵的见解。值得注意的是,表现最佳的模型 GPT-4o 在有效利用检索到的知识方面面临挑战,在使用真实信息的情况下仅实现了 5.82% 的改进,而人类参与者观察到的改进为 33.16%。这些发现突出了 MRAG-Bench 在鼓励社区增强 LVLM 更有效地利用检索到的视觉知识方面的能力的重要性。||
|**2024-10-10**|[Q-VLM: Post-training Quantization for Large Vision-Language Models](http://arxiv.org/abs/2410.08119)|**[link](https://github.com/changyuanwang17/qvlm)**|在本文中,我们提出了一种针对大型视觉语言模型 (LVLMs) 的训练后量化框架,以实现高效的多模态推理。传统的量化方法通过最小化激活离散化误差来顺序搜索逐层舍入函数,这种方法由于没有考虑跨层依赖性,因此无法获得最佳量化策略。相反,我们挖掘了对整个视觉语言模型的离散化误差有显著影响的跨层依赖性,并将这种依赖性嵌入到低搜索成本的最佳量化策略搜索中。具体来说,我们观察到激活熵和跨层依赖性之间存在强相关性,这与输出离散化误差有关。因此,我们采用熵作为代理来优化分区块,旨在在离散化误差和搜索成本之间取得令人满意的平衡。此外,我们优化了视觉编码器以解耦跨层依赖性,从而对搜索空间进行细粒度分解,从而在不损害量化精度的情况下进一步降低搜索成本。实验结果表明,我们的方法在不降低各种多模态推理任务性能的情况下,将大约 13B LLaVA 模型的内存压缩了 2.78 倍,并将生成速度提高了 1.44 倍。代码可在 https://github.com/ChangyuanWang17/QVLM 获取。||
|**2024-10-10**|[Unsupervised Data Validation Methods for Efficient Model Training](http://arxiv.org/abs/2410.07880)|null|本文探讨了改进低资源语言机器学习系统所面临的挑战和潜在解决方案。自然语言处理 (NLP)、文本到语音 (TTS)、语音到文本 (STT) 和视觉语言模型 (VLM) 中的最新模型严重依赖于大型数据集,而这些数据集通常不适用于低资源语言。本研究探讨了关键领域,例如定义“高质量数据”、开发生成适当数据的方法以及增强模型训练的可访问性。对当前方法的全面回顾,包括数据增强、多语言迁移学习、合成数据生成和数据选择技术,突出了进步和局限性。确定了几个开放的研究问题,为未来旨在优化数据利用、减少所需数据量和保持高质量模型性能的研究提供了框架。通过应对这些挑战,本文旨在使低资源语言更容易获得先进的机器学习模型,从而增强其在各个领域的效用和影响力。||
|**2024-10-10**|[HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter](http://arxiv.org/abs/2410.07854)|null|基于适配器的调优方法在将知识从预训练的视觉语言模型迁移到下游任务方面已显示出巨大潜力。然而,在回顾现有的适配器后,我们发现它们通常无法充分探索构建特定任务知识时不同模态之间的交互。此外,现有工作通常只关注正文本提示之间的相似性匹配,这使得区分具有高度相似视觉内容的类别变得具有挑战性。为了解决这些问题,在本文中,我们提出了一种新颖的异构图适配器来实现下游任务的视觉语言模型微调。具体来说,我们首先构建了一个统一的异构图模式,它包含 i) 视觉节点、正文本节点和负文本节点,以及 ii) 几种类型的边连接,以全面地对模态内、模态间和类间结构知识进行建模。接下来,我们采用特定的异构图神经网络来挖掘多模态结构知识,以便为下游任务调整视觉和文本特征。最后,在HeGraphAdapter之后,我们同时构建基于文本和基于视觉的分类器,以全面提升CLIP模型的性能。在 11 个基准数据集上的实验结果证明了所提出的 HeGraphAdapter 的有效性和优势。||
|**2024-10-10**|[FLIER: Few-shot Language Image Models Embedded with Latent Representations](http://arxiv.org/abs/2410.07648)|null|随着像对比语言-图像预训练 (CLIP) 这样的大型视觉语言模型的快速发展,许多类似 CLIP 的方法在视觉识别方面表现出了令人印象深刻的能力,尤其是在低数据场景下。然而,我们注意到大多数这些方法仅限于对文本和图像编码器进行新的修改。最近,潜在扩散模型 (LDM) 在图像生成方面表现出了良好的能力。LDM 的强大能力将我们的注意力引向了 UNet 采样的潜在表示。受 CoOp 中学习到的提示编码超出现有词汇量的含义的猜想的启发,我们假设,对于深度模型,潜在表示是对图像的简洁准确的理解,其中抽象掉了高频的、不可感知的细节。在本文中,我们提出了一种融合潜在表示的少样本语言图像模型 (FLIER),通过引入一个与 CLIP 的图像编码器联合训练的潜在编码器来进行图像识别,它结合了 CLIP 的预训练视觉语言知识和稳定扩散的潜在表示。我们首先通过稳定扩散使用 GPT-3 的文本输入生成图像和相应的潜在表示。将潜在表示作为“模型可理解的像素”,我们引入了一个具有两个卷积层的灵活卷积神经网络作为潜在编码器,它比视觉语言模型中的大多数编码器都简单。潜在编码器与 CLIP 的图像编码器联合训练,可以更好地将预训练的知识迁移到下游任务。在各种视觉分类任务上的实验和广泛的消融研究表明,FLIER 在大多数少样本分类的 11 个数据集上表现出最先进的性能。||
|**2024-10-10**|[A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks](http://arxiv.org/abs/2410.07593)|**[link](https://github.com/HoinJung/Unified-Debiaisng-VLM-SFID)**|视觉语言模型 (VLM) 的最新进展使得通过同时处理文本和图像数据来完成复杂的多模态任务成为可能,从而显著增强了人工智能领域。然而,这些模型经常表现出偏差,这些偏差会导致输出偏向社会刻板印象,因此需要去偏差策略。现有的去偏差方法狭隘地关注特定的模态或任务,并且需要大量的再训练。为了解决这些限制,本文介绍了用于去偏差的选择性特征插补 (SFID),这是一种集成了特征剪枝和低置信度插补 (LCI) 的新方法,可以有效减少 VLM 中的偏差。SFID 具有多种功能,可以保持输出的语义完整性,并且通过消除重新训练的需要来节省成本。我们的实验结果证明了 SFID 在各种 VLM 任务中的有效性,包括零样本分类、文本到图像检索、图像字幕和文本到图像生成,通过在不影响性能的情况下显着减少性别偏差。这种方法不仅增强了 VLM 应用的公平性,而且还保留了它们在不同场景中的效率和实用性。||
|**2024-10-10**|[3D Vision-Language Gaussian Splatting](http://arxiv.org/abs/2410.07577)|null|近年来,三维重建方法和视觉语言模型的进步推动了多模态三维场景理解的发展,这在机器人技术、自动驾驶以及虚拟/增强现实中具有至关重要的应用。然而,当前的多模态场景理解方法简单地将语义表示嵌入到三维重建方法中,而没有在视觉和语言模态之间取得平衡,这导致半透明或反射性物体的语义栅格化效果不理想,以及对颜色模态的过度拟合。为了缓解这些限制,我们提出了一种充分处理不同视觉和语义模态的解决方案,即用于场景理解的三维视觉语言高斯散射模型,以强调语言模态的表示学习。我们提出了一种新颖的跨模态栅格化器,使用模态融合以及平滑语义指示器来增强语义栅格化。我们还采用了相机视图混合技术来提高现有视图和合成视图之间的语义一致性,从而有效地减轻过度拟合。大量实验表明,我们的方法在开放词汇语义分割方面达到了最先进的性能,明显优于现有方法。||
|**2024-10-09**|[The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks](http://arxiv.org/abs/2410.07391)|null|人们越来越关注追踪通用人工智能基础模型的能力。本研究以韦氏成人智力量表(WAIS-IV)为基准,将领先的大型语言模型和视觉语言模型与人类表现进行了比较。WAIS-IV是一种全面、以人群为规范的潜在人类认知和智力能力评估,重点关注语言理解(VCI)、工作记忆(WMI)和知觉推理(PRI)领域。大多数模型在存储、检索和处理诸如字母和数字的任意序列等token方面表现出卓越的能力,与人类群体规范能力相比,工作记忆指数(WMI)的表现等于或大于99.5%。语言理解指数(VCI)衡量的是对获得信息的检索,以及对单词含义及其相互关系的语言理解,其表现也始终保持在98%或以上。尽管有这些广泛的优势,但我们观察到,多模态模型在知觉推理指数(PRI;范围0.1-10%)上的表现一直很差,这表明其在解释和推理视觉信息方面存在严重不足。较小和较旧的模型版本的表现始终较差,这表明训练数据、参数数量和微调方面的进步正在导致认知能力的显著进步。||
|**2024-10-07**|[Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia](http://arxiv.org/abs/2410.05270)|**[link](https://github.com/astra-vision/prolip)**|我们研究了如何将像 CLIP (Radford et al., 2021) 这样的对比预训练视觉语言模型应用于少样本分类问题。现有文献通过学习冻结视觉特征的线性分类器、优化词嵌入或学习外部特征适配器来解决这个问题。本文介绍了一种无需添加“外部”参数来优化 CLIP 自适应的替代方法。我们发现,与现有的基线相比,简单地微调视觉编码器的最后一个投影矩阵就能获得强大的性能。此外,我们发现,通过微调矩阵和预训练矩阵之间的距离对训练进行正则化,可以提高通过该层自适应 CLIP 的可靠性。也许令人惊讶的是,这种被称为 ProLIP 的方法在 11 个少样本分类基准测试、少样本域泛化、跨数据集迁移和测试时自适应方面取得了与最先进水平相当或更好的性能。代码将在 https://github.com/astra-vision/ProLIP 上提供。||
|**2024-10-07**|[TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens](http://arxiv.org/abs/2410.05261)|null|阅读密集文本和定位图像中的物体是大规模视觉语言模型 (LVLM) 执行高级任务的基本能力。以前的 LVLM,包括像 GPT-4o 这样的优秀专有模型,都难以同时在这两项任务中表现出色。此外,以前具有细粒度感知能力的 LVLM 每张图像需要消耗数千个标记,这使得它们非常消耗资源。我们提出了 TextHawk2,这是一种双语 LVLM,具有高效的细粒度感知能力,并在通用、OCR 和 grounding 任务中展现出最先进的性能,同时图像标记数量减少了 16 倍。关键改进包括:(1) 标记压缩:TextHawk2 建立在其前身的有效架构之上,将每张图像的标记数量显著减少了 16 倍,从而能够以最少的资源促进 TextHawk 系列的训练和部署。(2) 视觉编码器增强:我们通过 LVLM 联合训练增强了视觉编码器,从而释放了其在中文 OCR 和 grounding 等以前未见任务中的潜力。(3) 数据多样性:我们在保持 1 亿个样本的相当规模的同时,使预训练数据的来源多样化。我们在多个基准测试中评估了 TextHawk2,它始终如一地提供卓越的性能,并优于类似规模的闭源模型,例如在 OCRBench 上实现了 78.4% 的准确率,在 ChartQA 上实现了 81.4% 的准确率,在 DocVQA 上实现了 89.6% 的 ANLS,以及在 RefCOCOg-test 上实现了 88.1% 的 [email protected]。||
|**2024-10-07**|[TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models](http://arxiv.org/abs/2410.05239)|**[link](https://github.com/naamiinepal/tunevlseg)**|视觉语言模型 (VLM) 在视觉任务中表现出色,但将其应用于新领域通常需要昂贵的微调。提示调整技术,包括文本、视觉和多模态提示,通过利用可学习的提示提供了有效的替代方案。然而,它们在视觉语言分割模型 (VLSM) 中的应用以及在显著领域迁移下的评估仍有待探索。本研究提出了一个开源基准测试框架 TuneVLSeg,将各种单模态和多模态提示调整技术集成到 VLSM 中,使得提示调整适用于任何类别数量的下游分割数据集。TuneVLSeg 包括在 2 个 VLSM 中使用的不同提示深度上的 6 种提示调整策略,总共 8 种不同的组合。我们在 8 个不同的医学数据集上测试了各种提示调整,包括 3 个放射学数据集(乳腺肿瘤、超声心动图、胸部 X 光片病变)和 5 个非放射学数据集(息肉、溃疡、皮肤癌),以及两个自然领域分割数据集。我们的研究发现,文本提示调整在从自然领域图像到医学数据的显著领域迁移下表现不佳。此外,与多模态提示调整相比,视觉提示调整具有更少的超参数,通常可以实现与多模态方法相当的性能,使其成为一种有价值的首次尝试。我们的工作促进了对不同提示调整技术在鲁棒的特定领域分割中的理解和适用性。源代码可在 https://github.com/naamiinepal/tunevlseg 获取。||
|**2024-10-07**|[LADEV: A Language-Driven Testing and Evaluation Platform for Vision-Language-Action Models in Robotic Manipulation](http://arxiv.org/abs/2410.05191)|null|基于大型语言模型(LLMs)和视觉语言模型(VLMs)的进步,近期的研究引入了视觉-语言-动作(VLA)模型作为机器人操作任务的集成解决方案。这些模型将相机图像和自然语言任务指令作为输入,直接生成机器人的控制动作来执行指定任务,极大地提高了决策能力和与人类用户的交互。然而,VLA模型的数据驱动特性,加上其缺乏可解释性,使得确保其有效性和鲁棒性成为一项具有挑战性的任务。这突出了对可靠测试和评估平台的需求。为此,在这项工作中,我们提出了LADEV,这是一个专门为评估VLA模型而设计的综合高效平台。我们首先提出了一种语言驱动的方法,可以根据自然语言输入自动生成仿真环境,从而减少了手动调整的需求,并显著提高了测试效率。然后,为了进一步评估语言输入对VLA模型的影响,我们实现了一种释义机制,可以生成不同的自然语言任务指令进行测试。最后,为了加快评估过程,我们引入了一种批量式方法来对VLA模型进行大规模测试。使用LADEV,我们对几种最先进的VLA模型进行了实验,证明了其作为评估这些模型的工具的有效性。我们的结果表明,LADEV不仅提高了测试效率,而且为评估VLA模型建立了坚实的基础,为开发更智能、更先进的机器人系统铺平了道路。||
|**2024-10-07**|[HE-Drive: Human-Like End-to-End Driving with Vision Language Models](http://arxiv.org/abs/2410.05051)|null|本文提出了HE-Drive:第一个以类人为中心的端到端自动驾驶系统,用于生成时间一致且舒适的轨迹。最近的研究表明,基于模仿学习的规划器和基于学习的轨迹评分器可以有效地生成和选择与专家演示非常相似的精确轨迹。然而,这种轨迹规划器和评分器面临着生成时间不一致和不舒适轨迹的困境。为了解决上述问题,我们的HE-Drive首先通过稀疏感知提取关键的3D空间表示,然后将其作为基于条件去噪扩散概率模型(DDPMs)的运动规划器的条件输入,以生成时间一致的多模态轨迹。随后,视觉语言模型(VLMs)引导的轨迹评分器从这些候选轨迹中选择最舒适的轨迹来控制车辆,确保类人的端到端驾驶。实验表明,HE-Drive不仅在具有挑战性的nuScenes和OpenScene数据集上实现了最先进的性能(即将平均碰撞率降低了71%比VAD)和效率(即比SparseDrive快1.9倍),而且在真实世界的数据上提供了最舒适的驾驶体验。更多信息请访问项目网站:https://jmwang0117.github.io/HE-Drive/。||
|**2024-10-07**|[Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models](http://arxiv.org/abs/2410.04884)|null|视觉语言预训练 (VLP) 模型在各个领域都取得了显著成功,但它们仍然容易受到对抗性攻击。解决这些对抗性漏洞对于增强多模态学习的安全性至关重要。传统上,针对 VLP 模型的对抗性方法涉及同时扰动图像和文本。然而,这种方法面临着显著的挑战:首先,对抗性扰动通常无法有效地转化为现实场景;其次,对文本的直接修改非常明显。为了克服这些限制,我们提出了一种新策略,该策略专门使用图像补丁进行攻击,从而保持原始文本的完整性。我们的方法利用来自扩散模型的先验知识来增强扰动的真实性和自然性。此外,为了优化补丁放置并提高攻击的效率,我们利用了交叉注意力机制,该机制通过生成注意力图来封装模态间交互,以指导战略性补丁放置。在图像到文本场景的白盒设置中进行的综合实验表明,我们提出的方法明显优于现有技术,实现了 100% 的攻击成功率。此外,它在涉及文本到图像配置的迁移任务中表现出 commendable 的性能。||
|**2024-10-05**|[TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions](http://arxiv.org/abs/2410.04107)|**[link](https://github.com/nlpcode/tubench)**|大型视觉语言模型 (LVLM) 在视觉感知和语言理解方面取得了显著进展。尽管它们在各种任务中表现出色,但 LVLM 仍然存在幻觉问题,即生成与视觉或文本输入不正确或不忠实的内容。传统的基准测试,如 MME 和 POPE,使用可回答的问题在视觉问答 (VQA) 范围内评估 LVLM 中的幻觉。然而,由于图像中信息不足,有些问题无法回答,而 LVLM 在此类无法回答的问题上的表现仍未得到充分探索。为了弥合这一研究差距,我们提出了 TUBench,这是一个专门用于使用无法回答的问题评估 LVLM 可靠性的基准测试。TUBench 包含大量高质量的、无法回答的问题,这些问题是使用十种不同的策略精心制作的。为了全面评估 LVLM,TUBench 中的无法回答的问题基于来自四个不同领域的图像作为视觉上下文:代码片段的屏幕截图、自然图像、几何图形和统计表的屏幕截图。这些无法回答的问题分别用于测试 LVLM 在代码推理、常识推理、几何推理和与表格相关的数学推理方面的可信度。我们对 TUBench 上的 28 个领先基础模型进行了全面的定量评估,其中表现最佳的模型 Gemini-1.5-Pro 在确定问题是否可回答方面达到了 69.2% 的平均准确率,排名第三的模型 GPT-4o 则达到了 66.7% 的平均准确率。TUBench 可在 https://github.com/NLPCode/TUBench 获取。||
|**2024-10-05**|[Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks](http://arxiv.org/abs/2410.04055)|**[link](https://github.com/ivy3h/SCL)**|虽然视觉语言模型 (VLM) 在视觉和语言推理任务中表现出非凡的能力,但它们也不可避免地会产生错误的响应。自我纠正,即指导模型改进其输出,为解决这个问题提供了一种很有前景的解决方案。以往的研究主要集中在大型语言模型 (LLM) 上,而 VLM 的自我纠正能力,特别是在视觉和语言信息方面的能力,在很大程度上仍未得到检验。本研究调查了 VLM 在推理和微调阶段的自我纠正能力。我们介绍了一种自我纠正学习 (SCL) 方法,该方法使 VLM 能够通过直接偏好优化 (DPO) 从其自我生成的自我纠正数据中学习,而无需依赖外部反馈,从而促进自我改进。具体来说,我们根据初始和改进响应的正确性收集偏好和不偏好的样本,这些样本是通过在推理阶段使用 VLM 进行两轮自我纠正获得的。实验结果表明,虽然 VLM 在没有额外微调和外部反馈的情况下难以在迭代推理过程中有效地进行自我纠正,但当它们自我生成的自我纠正数据被分类为偏好和不偏好样本时,它们可以通过偏好微调来提高性能并避免以前的错误。这项研究强调,自我纠正不仅仅是一个改进过程;相反,它应该通过额外的训练来增强模型的推理能力,使其能够直接生成高质量的响应,而无需进一步改进。||
|**2024-10-05**|[Gamified crowd-sourcing of high-quality data for visual fine-tuning](http://arxiv.org/abs/2410.04038)|null|本文介绍了游戏化对抗提示 (GAP),这是一个为大型多模态模型的视觉指令微调进行众包高质量数据的框架。GAP 将数据收集过程转化为引人入胜的游戏,激励玩家提供针对模型知识差距的细粒度、具有挑战性的问题和答案。我们的贡献包括 (1) 一种从人类那里捕获问答对的方法,这些问答对直接针对模型知识中的弱点,(2) 一种评估和奖励玩家的方法,该方法成功地激励他们提供高质量的提交内容,以及 (3) 一个可扩展的游戏化平台,该平台成功地在几周内从超过 50,000 名参与者那里收集了这些数据。我们对 GAP 的实现显着提高了小型多模态模型 MiniCPM-Llama3-V-2.5-8B 的准确性,将其在我们数据集上的 GPT 分数从 0.147 提高到 0.477,接近更大的 GPT-4V 所设定的基准。此外,我们证明了使用 MiniCPM-Llama3-V-2.5-8B 生成的数据也增强了其在其他基准测试中的性能,并展现出跨模型的优势。具体来说,相同的数据提高了 QWEN2-VL-2B 和 QWEN2-VL-7B 在相同多个基准测试中的性能。||
|**2024-10-04**|[Model Developmental Safety: A Safety-Centric Method and Applications in Vision-Language Models](http://arxiv.org/abs/2410.03955)|**[link](https://github.com/ganglii/devsafety)**|在现实世界中,学习型系统通常会经历多个模型开发周期,以增强系统处理困难或新出现任务的能力。这种持续的模型开发过程提出了一个重要问题,即为获取新能力或改进现有能力而进行的模型开发可能会无意中失去旧模型的能力,也称为灾难性遗忘。现有的持续学习研究侧重于通过权衡先前任务和新任务的性能来减轻灾难性遗忘,以确保良好的平均性能。然而,它们不足以用于许多应用,特别是在安全关键领域,因为未能严格保持旧模型的性能不仅会带来安全风险和不确定性,还会在重新改进和重新验证现有属性方面造成巨大开销。为了解决这个问题,我们引入了模型开发安全作为学习系统的保证,即在模型开发过程中,新模型应严格保留旧模型现有的受保护能力,同时提高其在目标任务上的性能。为了确保模型开发安全,我们提出了一个以安全为中心的框架,将模型开发安全制定为依赖于数据的约束。在这个框架下,我们研究了如何开发一个预训练的视觉语言模型(又称 CLIP 模型),以获得新的能力或改进现有的图像分类能力。我们提出了一种具有理论保证的高效约束优化算法,并利用其见解微调具有任务依赖头的 CLIP 模型,以促进模型开发安全。我们在自动驾驶和场景识别数据集上改进视觉感知能力的实验结果证明了该方法的有效性。||
|**2024-10-04**|[Generalizable Prompt Tuning for Vision-Language Models](http://arxiv.org/abs/2410.03189)|null|针对诸如 CLIP 等视觉语言模型的提示调优涉及优化用于为特定下游任务生成图像-文本对的文本提示。虽然手工制作或基于模板的提示通常适用于更广泛的未见类别,但它们在下游任务(即已见类别)中往往表现不佳。另一方面,可学习的软提示通常在下游任务中表现良好,但缺乏泛化性。此外,先前的研究主要集中在文本模态上,很少有研究试图从视觉模态探索提示的泛化潜力。考虑到这些限制,我们研究了如何进行提示调优以获得具有竞争力的下游性能和泛化能力。研究表明,通过将软提示和手工提示视为文本模态的双重视图,并最大化它们的互信息,我们可以更好地集成特定任务的语义信息和通用语义信息。此外,为了生成更具表达力的提示,该研究引入了来自视觉模态的类别增强,从而显著提高了对更广泛的未见类别的鲁棒性。对多个基准的广泛评估表明,所提出的方法在特定任务性能和泛化能力方面都取得了具有竞争力的结果。||
|**2024-10-04**|[Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models](http://arxiv.org/abs/2410.03176)|**[link](https://github.com/yufang-liu/clip_hallucination)**|大型视觉语言模型 (LVLM) 已经取得了令人瞩目的性能,但研究指出,这些模型存在严重的物体幻觉问题。然而,对于这些幻觉源自模型的哪个部分,目前还没有明确的结论。在本文中,我们深入研究了 CLIP 模型中的物体幻觉问题,CLIP 模型是许多最先进的视觉语言系统的支柱。我们揭示了即使是单独使用,CLIP 模型也容易出现物体幻觉,这表明幻觉问题不仅仅是由于视觉和语言模态之间的交互造成的。为了解决这个问题,我们提出了一种反事实数据增强方法,通过创建具有各种幻觉问题的负样本来实现。我们证明了我们的方法可以有效地减轻 CLIP 模型的物体幻觉,并且我们展示了增强后的模型可以用作视觉编码器,有效地缓解了 LVLMs 中的物体幻觉问题。||
|**2024-10-04**|[AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark](http://arxiv.org/abs/2410.03051)|null|视频详细字幕生成是一项关键任务,旨在生成对视频内容全面而连贯的文本描述,有利于视频理解和生成。在本文中,我们提出了 AuroraCap,一个基于大型多模态模型的视频字幕生成器。我们遵循最简单的架构设计,没有为时间建模添加额外的参数。为了解决长视频序列带来的开销,我们实施了标记合并策略,减少了输入视觉标记的数量。令人惊讶的是,我们发现这种策略几乎没有造成性能损失。AuroraCap 在各种视频和图像字幕基准测试中表现出色,例如,在 Flickr30k 上获得了 88.9 的 CIDEr 分数,超过了 GPT-4V (55.3) 和 Gemini-1.5 Pro (82.2)。然而,现有的视频字幕基准测试只包含简单的描述,由几十个词组成,这限制了该领域的研究。因此,我们开发了 VDC,这是一个包含一千多个精心标注的结构化字幕的视频详细字幕基准测试。此外,我们提出了一种新的 LLM 辅助指标 VDCscore,用于改进评估,该指标采用分治策略将长字幕评估转化为多个简短的问答对。在人工 Elo 排名的帮助下,我们的实验表明,该基准测试与人类对视频详细字幕质量的判断具有更好的相关性。||
|**2024-10-03**|[CPFD: Confidence-aware Privileged Feature Distillation for Short Video Classification](http://arxiv.org/abs/2410.03038)|null|在短视频分类中,针对不同业务场景定制的密集特征至关重要。然而,它们的复杂性、特定的适应性要求和高计算成本使得它们在在线推理过程中资源密集且难以访问。因此,这些密集特征被称为“特权密集特征”。同时,端到端多模态模型在众多计算机视觉任务中显示出良好的效果。在工业应用中,优先考虑端到端多模态特征可以提高效率,但往往会导致丢失历史特权密集特征中的宝贵信息。为了在保持效率和可管理的资源成本的同时整合这两种特征,我们提出了置信度感知的特权特征蒸馏(CPFD),它通过在训练过程中自适应地提取特权特征来增强端到端多模态模型的特征。与现有的特权特征蒸馏(PFD)方法不同,CPFD不会在蒸馏过程中对所有实例应用统一的权重(这可能会导致不同业务场景下的性能不稳定,以及教师模型(密集特征增强的多模态模型DF-X-VLM)和学生模型(仅使用多模态模型X-VLM)之间存在显著的性能差距),而是利用从教师模型中获得的置信度分数来自适应地减轻学生模型的性能差异。我们在五个不同的任务上进行了广泛的离线实验,结果表明,与端到端多模态模型(X-VLM)相比,CPFD将视频分类的F1分数提高了6.76%,与普通的PFD相比平均提高了2.31%。它将性能差距缩小了84.6%,并取得了与教师模型DF-X-VLM相当的结果。在线实验进一步证实了CPFD的有效性,我们的框架已经部署到生产系统中,用于十多个模型。||
|**2024-10-03**|[MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection](http://arxiv.org/abs/2410.03010)|null|多模态学习旨在结合来自多个输入源的数据,以提高不同下游任务的性能。在现实场景中,如果缺少某些输入模态,性能可能会大幅下降。现有的可以处理缺失模态的方法包括针对每个输入模态组合进行定制训练或适应步骤。这些方法要么绑定到特定的模态,要么随着输入模态数量的增加而变得计算成本高昂。在本文中,我们提出了掩蔽模态投影(MMP),这是一种旨在训练单个模型的方法,该模型对任何缺失模态场景都具有鲁棒性。我们通过在训练期间随机掩蔽一部分模态并学习投影可用的输入模态来估计掩蔽模态的标记来实现这一点。这种方法使模型能够有效地学习利用来自可用模态的信息来补偿缺失的模态,从而增强缺失模态的鲁棒性。我们使用各种基线模型和数据集进行了一系列实验,以评估该策略的有效性。实验表明,我们的方法提高了对不同缺失模态场景的鲁棒性,优于为缺失模态或特定模态组合设计的现有方法。||
|**2024-10-03**|[Real-World Cooking Robot System from Recipes Based on Food State Recognition Using Foundation Models and PDDL](http://arxiv.org/abs/2410.02874)|null|尽管机器人烹饪行为的需求日益增长,但基于机器人在现实世界中对新食谱描述的一系列烹饪行为尚未实现。在本研究中,我们提出了一种机器人系统,该系统集成了使用大型语言模型 (LLM) 和 PDDL 描述的经典规划的可执行的真实世界机器人烹饪行为规划,以及使用视觉语言模型 (VLM) 从少量数据中学习食物成分状态识别。我们成功地进行了实验,在实验中,双臂轮式机器人 PR2 在真实环境中根据安排的新食谱进行烹饪,并确认了所提出系统的有效性。||
|**2024-10-03**|[Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos](http://arxiv.org/abs/2410.02763)|null|最近,越来越多的人认为现代大型多模态模型 (LMM) 已经解决了与短视频理解相关的大多数关键挑战。因此,学术界和工业界都逐渐将注意力转向理解长视频带来的更复杂挑战。然而,事实真的如此吗?我们的研究表明,即使在处理短视频时,LMM 仍然缺乏许多基本的推理能力。我们介绍了 Vinoground,这是一个包含 1000 个短而自然的视频-字幕对的时间反事实 LMM 评估基准。我们证明,现有的 LMM 很难区分不同动作和对象转换之间的时间差异。例如,最佳模型 GPT-4o 在我们的文本和视频得分中仅获得约 50% 的分数,与约 90% 的人类基线相比存在较大差距。所有开源多模态模型和基于 CLIP 的模型表现更差,产生的结果大多是随机的。通过这项工作,我们揭示了短视频中的时间推理是一个尚未完全解决的问题。数据集和评估代码可在 https://vinoground.github.io 获取。||
|**2024-10-03**|[Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations](http://arxiv.org/abs/2410.02762)|**[link](https://github.com/nickjiang2378/vl-interp)**|我们研究了视觉语言模型 (VLM) 的内部表征,以解决幻觉问题,尽管模型规模和训练方面取得了进步,但这仍然是一个持续的挑战。我们将 VLM 的内部图像表征投影到它们的语言词汇表中,并观察到真实物体的输出概率比幻觉物体更有信心。我们还使用这些输出概率来对真实物体进行空间定位。在此方法的基础上,我们引入了一种知识擦除算法,通过线性正交化图像特征和幻觉物体特征来消除幻觉。我们表明,对模型潜在表征的有针对性的编辑可以将 COCO2014 数据集上的幻觉减少高达 25.7%,同时保持性能。我们的研究结果表明,更深入地理解 VLM 的潜在表征可以增强可靠性并实现新的功能,例如零样本分割。||
|**2024-10-03**|[Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models](http://arxiv.org/abs/2410.02740)|null|多模态模型的最新进展突出了重写图像描述对于提高性能的价值,但也存在一些关键挑战。例如,虽然合成图像描述通常提供更高的质量和图文对齐性,但尚不清楚它们是否可以完全替代 AltTexts:合成图像描述的作用及其与原始网络抓取的 AltTexts 在预训练中的交互作用仍不清楚。此外,不同的多模态基础模型可能对特定的图像描述格式有独特的偏好,但确定每个模型的最佳图像描述的努力仍然有限。在这项工作中,我们提出了一种新颖的、可控的和可扩展的图像描述生成流程,旨在生成适合各种多模态模型的不同图像描述格式。通过以简短合成图像描述 (SSC) 和密集合成图像描述 (DSC+) 作为案例研究,我们系统地探索了它们对 CLIP、多模态 LLM 和扩散模型等模型的影响以及与 AltTexts 的交互作用。我们的研究结果表明,保留合成图像描述和 AltTexts 的混合方法可以优于单独使用合成图像描述,从而提高对齐性和性能,并且每个模型都表现出对特定图像描述格式的偏好。这种全面的分析为优化图像描述策略提供了宝贵的见解,从而推进了多模态基础模型的预训练。||
|**2024-10-03**|[DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects](http://arxiv.org/abs/2410.02730)|**[link](https://github.com/zhaowei-wang-nlp/divscene)**|在未知环境中进行物体导航对于在现实世界应用中部署具身代理至关重要。虽然由于大规模场景数据集、更快的模拟器和更强大的模型,我们已经目睹了巨大的进步,但之前的研究主要集中在有限的场景类型和目标物体上。在本文中,我们研究了在大量场景类型中导航到不同目标物体的新任务。为了对该问题进行基准测试,我们提出了一个大规模场景数据集 DivScene,其中包含跨越 81 种不同类型的 4,614 个场景。利用该数据集,我们通过模仿学习微调大型视觉语言模型 (LVLM),构建了一个端到端的具身代理 NatVLM。LVLM 被训练用于获取来自环境的先前观察结果并生成下一步动作。我们还引入了动作预测的思维链 (CoT) 解释轨迹,以便在调整 LVLM 时获得更好的性能。我们广泛的实验发现,我们可以通过对由 BFS 规划器构建的最短路径进行模仿学习来构建性能良好的基于 LVLM 的代理,而无需任何人工监督。我们的代理实现了超过 GPT-4o 20% 以上的成功率。同时,我们进行了各种分析,展示了我们代理的泛化能力。||
|**2024-10-03**|[Video Instruction Tuning With Synthetic Data](http://arxiv.org/abs/2410.02713)|null|视频大型多模态模型 (LMM) 的发展一直受到从网络获取大量高质量原始数据的难度的阻碍。为了解决这个问题,我们提出了一种替代方法,即创建一个专门用于视频指令遵循的高质量合成数据集,即 LLaVA-Video-178K。该数据集包括关键任务,例如详细字幕、开放式问答 (QA) 和多项选择 QA。通过结合现有的视觉指令调整数据对该数据集进行训练,我们推出了一个新的视频 LLM,即 LLaVA-Video。我们的实验表明,LLaVA-Video 在各种视频基准测试中均取得了出色的性能,突出了我们数据集的有效性。我们计划发布数据集、其生成管道和模型检查点。||
|**2024-10-03**|[LLaVA-Critic: Learning to Evaluate Multimodal Models](http://arxiv.org/abs/2410.02712)|null|我们推出了 LLaVA-Critic,这是第一个开源的大型多模态模型 (LMM),它被设计成一个通用的评估器,用于评估各种多模态任务的性能。LLaVA-Critic 使用高质量的批评指令遵循数据集进行训练,该数据集包含不同的评估标准和场景。我们的实验结果证明了该模型在两个关键领域的有效性:(1) LMM 作为评判者,LLaVA-Critic 提供可靠的评估分数,在多个评估基准上表现与 GPT 模型相当或更优;(2) 偏好学习,它为偏好学习生成奖励信号,增强模型对齐能力。这项工作强调了开源 LMM 在自我批评和评估方面的潜力,为未来研究 LMM 可扩展的、超人的对齐反馈机制奠定了基础。||
|**2024-10-03**|[Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models](http://arxiv.org/abs/2410.02681)|null|置信度校准对于机器学习模型在现实世界中的安全部署至关重要。然而,像 CLIP 这样的视觉语言模型,特别是在微调之后,尚未完全解决这个问题。 本研究表明,现有的提示微调方法通常会导致基础类别和新类别之间校准的权衡:CoOp 中的交叉熵损失通过增加文本标签差异导致对新类别的过度自信,而 KgCoOp 的正则化保持了置信度水平,但由于准确性的提高,导致对基础类别的不自信。 受这些观察结果的启发,我们引入了动态异常值正则化 (DOR) 来确保微调后对基础类别和新类别的置信度校准。 特别是,我们建议最小化从大型词汇表中采样的新文本标签(而不是基础类别)的特征偏差。 实际上,DOR 阻止了新标签的文本差异的增加,同时放宽了对基础类别的限制。 大量实验表明,DOR 可以增强当前微调方法在基础类别和新类别上的校准性能。||
|**2024-10-03**|[Guiding Long-Horizon Task and Motion Planning with Vision Language Models](http://arxiv.org/abs/2410.02193)|null|视觉语言模型 (VLM) 能够在被提示目标、上下文、场景图像和任何规划约束时生成看似合理的高级计划。但是,无法保证预测的动作对于特定的机器人实施方案在几何和运动学上是可行的。因此,在他们的计划中,许多先决条件步骤(例如打开抽屉以获取物体)经常被省略。机器人任务和运动规划器可以生成尊重动作几何可行性的运动轨迹,并插入物理上必要的动作,但无法扩展到需要常识知识并涉及由许多变量组成的大状态空间的日常问题。我们提出了 VLM-TAMP,这是一种分层规划算法,它利用 VLM 生成语义上有意义且减少范围的中间子目标,从而指导任务和运动规划器。当子目标或动作无法细化时,将再次查询 VLM 以进行重新规划。我们在厨房任务中评估 VLM-TAMP,其中机器人必须完成需要按顺序执行 30-50 个动作并与多达 21 个物体交互的烹饪目标。VLM-TAMP 的性能大大优于严格且独立地执行 VLM 生成的动作序列的基线,无论是在成功率(50% 到 100% 对比 0%)还是平均任务完成百分比(72% 到 100% 对比 15% 到 45%)。有关更多信息,请参阅项目网站 https://zt-yang.github.io/vlm-tamp-robot/。||
|**2024-10-02**|[Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations](http://arxiv.org/abs/2410.02086)|null|多模态学习在使机器学习模型能够融合和利用文本、图像和音频等不同数据源以支持各种下游任务方面发挥着至关重要的作用。跨各种模态的统一表示对于提高效率和性能尤为重要。最近的绑定方法,如ImageBind(Girdhar等人,2023),通常使用固定的锚点模态来对齐锚点模态嵌入空间中的多模态数据。在本文中,我们对固定锚点绑定方法进行了数学分析,并发现了其显著的局限性:(1)过度依赖于锚点模态的选择,(2)无法捕获模态内信息,以及(3)无法解释非锚点模态之间的模态间相关性。为了解决这些局限性,我们提出了CentroBind,这是一种简单而强大的方法,它消除了对固定锚点的需求;相反,它采用从所有可用模态生成的动态可调的基于质心的锚点,从而产生平衡且丰富的表示空间。我们从理论上证明了我们的方法捕获了多模态学习的三个关键属性:模态内学习、模态间学习和多模态对齐,同时还在所有模态中构建了一个稳健的统一表示。我们在合成数据集和真实世界数据集上的实验都证明了该方法的优越性,表明动态锚点方法优于所有固定锚点绑定方法,因为前者捕获了更细微的多模态交互。||
|**2024-10-02**|[Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning](http://arxiv.org/abs/2410.02052)|null|自主智能体在自动化复杂的多步决策任务中展现出巨大潜力。然而,即使是最先进的视觉语言模型(VLM),例如GPT-4o,在复杂网络环境和长期规划任务中仍未达到人类水平。为了解决这些限制,我们引入了反射蒙特卡洛树搜索(R-MCTS),这是一种新颖的测试时算法,旨在增强人工智能体(例如由GPT-4o驱动的智能体)动态探索决策空间的能力。R-MCTS通过以下方式扩展了传统的MCTS:1)结合对比反射,使智能体能够从过去的交互中学习并动态提高其搜索效率;2)使用多智能体辩论来提供可靠的状态评估。此外,我们通过自我学习微调GPT-4o来提高智能体的性能,使用R-MCTS生成的树遍历,无需任何人工提供的标签。在具有挑战性的VisualWebArena基准测试中,我们基于GPT-4o的R-MCTS智能体在各种任务中比之前的最先进技术实现了6%到30%的相对改进。此外,我们还表明,从测试时搜索中获得的知识可以通过微调有效地转移回GPT-4o。经过微调的GPT-4o在测试时可以达到R-MCTS性能的97%,同时计算量减少了四倍。此外,定性结果表明,经过微调的GPT-4o模型能够探索环境、评估状态,并在检测到当前状态无法导致成功时回溯到可行的状态。此外,我们的工作展示了训练(使用R-MCTS收集数据)和测试时的计算扩展特性。这些结果为通过测试时搜索和自我学习来增强VLM的推理和规划能力,以用于智能体应用,提出了一个有希望的研究方向。||
|**2024-09-30**|[HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding](http://arxiv.org/abs/2409.20429)|null|大型视觉语言模型 (LVLM) 在许多视觉语言任务中都表现出了非凡的性能。然而,这些模型仍然受到多模态幻觉的影响,这意味着会生成违反图像内容的对象或内容。许多现有工作通过直接判断一个对象是否存在于图像中来检测幻觉,而忽略了对象与语义之间的关联。为了解决这个问题,我们提出了视觉增强惩罚解码的分层反馈学习 (HELPD)。该框架在对象和句子语义层面都纳入了幻觉反馈。值得注意的是,即使训练程度不高,这种方法也可以减少 15% 以上的幻觉。同时,HELPD 根据图像注意力窗口惩罚输出 logits,以避免过度受生成文本的影响。HELPD 可以无缝集成到任何 LVLMs 中。我们的实验表明,所提出的框架在多个幻觉基准测试中产生了良好的结果。它有效地减轻了不同 LVLMs 的幻觉,同时提高了它们的文本生成质量。||
|**2024-09-30**|[CableInspect-AD: An Expert-Annotated Anomaly Detection Dataset](http://arxiv.org/abs/2409.20353)|**[link](https://github.com/mila-iqia/cableinspect-ad-code)**|机器学习模型正越来越多地部署在现实环境中。然而,关于其对特定和关键应用的可迁移性的系统研究在研究文献中却鲜有报道。一个重要的例子是用于机器人电力线巡检的视觉异常检测 (VAD)。虽然现有的 VAD 方法在受控环境中表现良好,但现实场景中存在着当前数据集无法捕捉到的各种意外异常。为了弥补这一差距,我们推出了 $\textit{CableInspect-AD}$,这是一个由加拿大公用事业公司 Hydro-Qu\'ebec 的领域专家创建和标注的高质量、公开可用的数据集。该数据集包含具有挑战性的现实世界异常的高分辨率图像,涵盖了不同严重程度的缺陷。为了解决为设置检测阈值而收集各种异常和正常样本的挑战,我们建议对著名的 PatchCore 算法进行增强。这种增强使其能够在标记数据有限的情况下使用。我们还提出了一个基于交叉验证的综合评估方案,以评估模型的性能。我们评估了我们的 $\textit{Enhanced-PatchCore}$ 在少样本和多样本检测方面的性能,以及视觉语言模型在零样本检测方面的性能。虽然这些模型很有前景,但它们难以检测所有异常,这突出了该数据集作为一个具有挑战性的基准对更广泛研究群体的价值。项目页面:https://mila-iqia.github.io/cableinspect-ad/。||
|**2024-09-30**|[Visual Context Window Extension: A New Perspective for Long Video Understanding](http://arxiv.org/abs/2409.20018)|null|大型多模态模型 (LMM) 在短视频理解任务中表现出色,但在应用于长视频理解时面临巨大挑战。相比之下,大型语言模型 (LLM) 在建模长文本方面表现出色。现有工作试图通过在训练期间引入长视频-文本对来解决这个问题。然而,这些方法需要大量的计算和数据资源。在本文中,我们从上下文窗口的角度来应对长视频理解的挑战,旨在将 LMM 应用于长视频任务,而无需在长视频数据集上重新训练。我们首先深入分析了预训练的 LMM 难以理解长视频内容的原因,发现视觉和语言模态之间的差异导致视觉和语言标记的上下文窗口不同,这使得直接扩展视觉标记以匹配语言上下文窗口变得困难。基于此,我们建议通过扩展视觉上下文窗口来调整 LMM 以适应长视频理解任务,从而无需在大型长视频数据集上重新训练。为了进一步减少长序列导致的大量内存消耗,我们引入了一种渐进式池化推理策略,该策略选择性地调整帧嵌入的空间分辨率,在保留重要空间信息的同时减少视觉标记的数量。在多个长视频理解基准测试中,我们的方法随着视频帧数量的增加而持续提高性能。在 MLVU 基准测试中,我们的方法优于 GPT-4o,即使我们的模型大小只有 7B。此外,在 256 帧设置中,与基线相比,我们的方法将内存使用量减少了大约 45%,而不会导致任何性能损失。||
|**2024-09-30**|[Towards Robust Multimodal Sentiment Analysis with Incomplete Data](http://arxiv.org/abs/2409.20012)|**[link](https://github.com/haoyu-ha/lnln)**|多模态情感分析(MSA)领域最近出现了一个新兴方向,旨在解决数据不完整性问题。认识到语言模态通常包含密集的情感信息,我们将其视为主要模态,并提出了一种创新的语言主导抗噪学习网络(LNLN),以实现稳健的MSA。所提出的LNLN具有主要模态校正(DMC)模块和基于主要模态的多模态学习(DMML)模块,通过确保主要模态表示的质量,增强了模型在各种噪声场景下的鲁棒性。除了方法论设计之外,我们还在随机数据缺失场景下进行了全面的实验,在几个流行的数据集(例如MOSI、MOSEI和SIMS)上使用了多样化且有意义的设置,与文献中的现有评估相比,提供了额外的统一性、透明度和公平性。根据经验,LNLN始终优于现有的基线,在这些具有挑战性和广泛的评估指标中表现出卓越的性能。||
|**2024-09-30**|[Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels](http://arxiv.org/abs/2409.19846)|null|像 CLIP 这样的大规模视觉语言模型在图像级任务中表现出了令人印象深刻的开放词汇能力,在识别物体方面表现出色。然而,它们在语义分割等像素级识别任务中却表现不佳,因为这些任务还需要理解物体的位置。在这项工作中,我们提出了一种名为 PixelCLIP 的新方法,通过使用从 SAM 和 DINO 等视觉基础模型生成的未标记图像和掩码来指导模型识别物体的位置,从而使 CLIP 图像编码器适应像素级理解。为了解决在没有语义标签的情况下利用掩码的挑战,我们设计了一种使用可学习类名的在线聚类算法来获取一般的语义概念。PixelCLIP 在开放词汇语义分割方面比 CLIP 显示出显著的性能提升,并且与字幕监督方法相比具有竞争力的结果。项目页面:https://cvlab-kaist.github.io/PixelCLIP||
|**2024-09-29**|[PALM: Few-Shot Prompt Learning for Audio Language Models](http://arxiv.org/abs/2409.19806)|null|音频语言模型(ALM)最近在零样本音频识别任务中取得了显著成果,其灵感来自视觉语言模型(VLM)的进步,将音频波形的特征与特定类别的文本提示特征相匹配。鉴于零样本性能对人工设计文本提示选择的敏感性,已经为VLM开发了许多提示学习技术。我们探索了这些方法在ALM中的有效性,并提出了一种名为“音频语言模型中的提示学习”(PALM)的新方法,该方法优化了文本编码器分支的特征空间。与在输入空间中工作的现有方法不同,我们的方法实现了更高的训练效率。我们在11个音频识别数据集上证明了我们方法的有效性,这些数据集涵盖了各种语音处理任务,并在少样本学习设置中将结果与三个基线进行了比较。我们的方法在计算量较小的同时,其性能与其他方法相当或更优。代码可在https://asif-hanif.github.io/palm/获取。||
|**2024-09-29**|[Vision-Language Models are Strong Noisy Label Detectors](http://arxiv.org/abs/2409.19696)|**[link](https://github.com/HotanLee/DeFT)**|最近关于视觉语言模型微调的研究表明,其在下游任务中表现出色。然而,在实际应用中获取准确标记数据的挑战给微调过程带来了重大障碍。为了应对这一挑战,本文提出了一种名为 DeFT 的去噪微调框架,用于视觉语言模型的适应性训练。DeFT 利用在数百万个辅助图像-文本对上预训练的文本和视觉特征的鲁棒对齐来筛选噪声标签。所提出的框架通过学习每个类别的正负文本提示来建立噪声标签检测器。正提示旨在揭示该类别的独特特征,而负提示则作为可学习的阈值,用于区分干净样本和噪声样本。我们采用参数高效的微调方法来调整预训练的视觉编码器,以促进其与学习到的文本提示对齐。作为一个通用框架,DeFT 可以通过利用精心挑选的干净样本,将许多预训练模型无缝地微调到下游任务。在七个合成和真实噪声数据集上的实验结果验证了 DeFT 在噪声标签检测和图像分类方面的有效性。||
|**2024-09-29**|[MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation](http://arxiv.org/abs/2409.19684)|**[link](https://github.com/MedHK23/MedViLaM)**|医学本质上是多模态和多任务的,具有涵盖文本、影像等多种数据模态。然而,目前大多数医学领域模型都是单模态单任务的,缺乏良好的泛化性和可解释性。在本研究中,我们介绍了MedViLaM,这是一个通用的医学数据视觉语言模型,它可以使用相同的模型权重灵活地编码和解释各种形式的医学数据,包括临床语言和影像。为了促进这种多任务模型的创建,我们策划了MultiMedBench,这是一个全面的预训练数据集和基准,包含多个不同的任务,即连续问答、多标签疾病分类、疾病定位、放射学报告的生成和总结。MedViLaM在所有MultiMedBench任务中都表现出色,经常大幅超越其他通用模型。此外,我们还展示了零样本泛化到新的医学概念和任务、跨不同任务的有效迁移学习以及零样本医学推理的出现。||
|**2024-09-29**|[Federated Learning from Vision-Language Foundation Models: Theoretical Analysis and Method](http://arxiv.org/abs/2409.19610)|**[link](https://github.com/PanBikang/PromptFolio)**|将CLIP等预训练的视觉语言基础模型整合到联邦学习中,以增强跨不同任务的泛化能力,引起了广泛关注。通常,视觉语言模型的联邦学习采用提示学习来降低通信和计算成本,即基于提示的联邦学习。然而,目前对基于提示的联邦学习性能的理论分析还很有限。在这项工作中,我们通过特征学习理论构建了一个基于提示的联邦学习的理论分析框架。具体来说,我们监控了基于提示的联邦学习中信号学习和噪声记忆的演变,证明了可以通过与任务相关和与任务无关的系数之比来评估性能。此外,我们将投资组合优化中的收益和风险与特征学习中的任务相关和任务无关项进行了类比。受投资组合优化理论的启发,即组合两种独立资产将保持收益,同时降低风险,我们引入了两种提示:全局提示和局部提示,以构建一个提示组合来平衡泛化性和个性化。因此,我们展示了提示组合的性能优势,并推导出了最佳混合系数。这些理论主张得到了进一步的实证实验的支持。||
|**2024-09-28**|[FairPIVARA: Reducing and Assessing Biases in CLIP-Based Multimodal Models](http://arxiv.org/abs/2409.19474)|**[link](https://github.com/hiaac-nlp/fairpivara)**|尽管视觉语言模型取得了重大进展并得到广泛应用,但很少有研究探讨其伦理含义。这些模型通常需要大量的训练数据,而这些数据往往来自仓促审查的文本和图像数据集,导致数据集高度失衡并引发伦理问题。此外,最初用英语训练的模型经常针对其他语言进行微调,例如 CLIP 模型,可以通过添加更多数据来增强其功能,但也可能引入新的偏差。CAPIVARA 是一种基于 CLIP 模型并适用于葡萄牙语的模型,在零样本任务中表现出色。在本文中,我们评估了视觉语言模型中的四种不同类型的歧视性做法,并介绍了 FairPIVARA,这是一种通过移除特征嵌入中受影响最大的维度来减少这些做法的方法。FairPIVARA 的应用显著减少了高达 98% 的观察到的偏差,同时促进了模型中更平衡的词语分布。我们的模型和代码可在以下网址获取:https://github.com/hiaac-nlp/FairPIVARA。||
|**2024-09-27**|[Image-guided topic modeling for interpretable privacy classification](http://arxiv.org/abs/2409.18674)|**[link](https://github.com/idiap/itm)**|用人类可理解的术语预测和解释图像中包含的隐私信息是一项复杂且依赖于上下文的的任务。即使对于大型语言模型来说,这项任务也具有挑战性。为了促进对隐私决策的理解,我们建议根据一组自然语言内容描述符来预测图像隐私。这些内容描述符与隐私分数相关联,这些分数反映了人们如何看待图像内容。我们使用我们新颖的图像引导主题建模(ITM)方法生成描述符。ITM 通过多模态对齐,利用来自视觉语言模型的视觉信息和图像文本描述。我们使用 ITM 生成的描述符来学习隐私预测器 Priv×ITM,其决策在设计上是可解释的。我们的 Priv×ITM 分类器在准确率方面比参考的可解释方法高出 5 个百分点,并且性能与当前最先进的不可解释模型相当。||
|**2024-09-26**|[LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness](http://arxiv.org/abs/2409.18125)|null|大型多模态模型 (LMM) 近期的进步极大地提高了其在 2D 视觉理解任务中的能力,使其能够有效地处理和理解图像和视频。然而,由于缺乏大规模 3D 视觉语言数据集和强大的 3D 编码器,具有 3D 感知能力的 LMM 在 3D 场景理解方面的开发一直受到阻碍。在本文中,我们介绍了一种简单而有效的框架,称为 LLaVA-3D。LLaVA-3D 利用 LLaVA 强大的 2D 理解先验知识,有效地将 LLaVA 应用于 3D 场景理解,而不会影响其 2D 理解能力。为了实现这一点,我们采用了一种简单有效的表示方法,即 3D Patch,它将 2D CLIP 图像块特征与其在 3D 空间中的对应位置连接起来。通过将 3D Patch 集成到 2D LMM 中,并采用联合 2D 和 3D 视觉语言指令微调,我们建立了一个用于 2D 图像理解和 3D 场景理解的统一架构。实验结果表明,在 3D 视觉语言数据集上训练时,LLaVA-3D 的收敛速度比现有 3D LMM 快 3.5 倍。此外,LLaVA-3D 不仅在各种 3D 任务上实现了最先进的性能,而且还保持了与 LLaVA 相当的 2D 图像理解和视觉语言对话能力。||
|**2024-09-26**|[EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions](http://arxiv.org/abs/2409.18042)|null|GPT-4o,一个能够进行带有不同情感和语调的语音对话的多模态模型,标志着多模态基础模型的一个里程碑。然而,在开源社区中,使用公开可用的数据赋予大型语言模型以端到端的方式感知和生成图像、文本和语音仍然具有挑战性。现有的视觉语言模型依赖于外部工具进行语音处理,而语音语言模型仍然存在视觉理解能力有限甚至没有的问题。为了解决这个问题,我们提出了EMOVA(情感无所不在的语音助手),它使大型语言模型具备端到端的语音能力,同时保持领先的视觉语言性能。利用语义-声学解耦的语音标记器,我们惊奇地发现,与相应的双模态对齐模型相比,多模态对齐可以进一步增强视觉语言和语音能力。此外,我们还提出了一个轻量级的风格模块,用于灵活控制语音风格(例如情感和音调)。EMOVA首次在视觉语言和语音基准测试中均实现了最先进的性能,同时支持具有生动情感的多模态语音对话。||
|**2024-09-26**|[DARE: Diverse Visual Question Answering with Robustness Evaluation](http://arxiv.org/abs/2409.18023)|null|视觉语言模型 (VLM) 扩展了仅文本大型语言模型和仅视觉模型的卓越能力,并且能够从多模态视觉文本输入中学习和处理。虽然现代 VLM 在许多标准图像分类和图像文本匹配任务中表现良好,但它们仍然难以应对许多关键的视觉语言 (VL) 推理能力,例如计数和空间推理。此外,虽然它们可能对指令和/或评估协议的微小变化非常脆弱,但现有基准测试未能评估它们的稳健性(或者更确切地说是缺乏稳健性)。为了将具有挑战性的 VL 场景与全面的稳健性评估相结合,我们引入了 DARE,即具有稳健性评估的多样化视觉问答,这是一个精心创建和策划的多项选择 VQA 基准。DARE 评估 VLM 在五个不同类别上的性能,并包括四个基于以下变化的面向稳健性的评估:提示、答案选项子集、输出格式和正确答案的数量。在一系列其他发现中,我们报告说,最先进的 VLM 仍然难以回答大多数类别中的问题,并且无法在测试的稳健性评估中始终如一地提供其峰值性能。选项子集的最差情况性能比标准情况下的性能低 34%。诸如 LLaVA 1.6 和 Idefics2 等开源 VLM 的稳健性无法与 GPT-4 和 Gemini 等闭源模型相提并论,但即使是后者仍然非常容易受到不同变化的影响。||
|**2024-09-26**|[The Hard Positive Truth about Vision-Language Compositionality](http://arxiv.org/abs/2409.17958)|**[link](https://github.com/amitakamath/hard_positives)**|多项基准测试得出结论,我们最好的视觉语言模型(例如 CLIP)缺乏组合性。给定一张图像,这些基准测试会探测模型从一组组合干扰项中识别其关联标题的能力。作为回应,最近涌现出大量提案,表明通过使用干扰项作为强负例对 CLIP 进行微调可以改进模型。我们的调查表明,这些改进实际上被严重夸大了——因为现有的基准测试没有探究微调后的视觉语言模型是否对强正例保持不变。通过使用 112,382 个强负例和强正例整理评估数据集,我们发现包含强正例会使 CLIP 的性能降低 12.9%,而人类则可以毫不费力地达到 99% 的准确率。使用强负例微调 CLIP 会导致更大的性能下降,高达 38.7%。基于这一发现,我们制作了一个包含 1,775,259 个图像文本的训练集,其中包含强负例和强正例标题。通过同时使用两者进行训练,我们看到现有基准测试的性能有所提高,同时强正例的性能也有所提高,这表明组合性得到了更稳健的改进。我们的工作表明,未来的研究需要严格测试和改进 CLIP 对相关“正”概念之间语义关系的理解。||
|**2024-09-26**|[A Multimodal Single-Branch Embedding Network for Recommendation in Cold-Start and Missing Modality Scenarios](http://arxiv.org/abs/2409.17864)|**[link](https://github.com/hcai-mms/sibrar---single-branch-recommender)**|大多数推荐系统采用协同过滤 (CF) 并根据过去的集体交互提供推荐。因此,当可用交互很少或没有交互时,CF 算法的性能会下降,这种情况称为冷启动。为了解决这个问题,以前的工作依赖于利用协作数据和用户或项目辅助信息的模型。类似于多模态学习,这些模型旨在将协作和内容表示组合到共享嵌入空间中。在这项工作中,我们提出了一种新的多模态推荐技术,它依赖于用于推荐的多模态单分支嵌入网络 (SiBraR)。SiBraR 利用权重共享,在不同模态上使用相同的单分支嵌入网络对交互数据以及多模态辅助信息进行编码。这使得 SiBraR 在缺少模态的情况下(包括冷启动)非常有效。我们对来自三个不同推荐域(音乐、电影和电子商务)并提供多模态内容信息(音频、文本、图像、标签和交互)的大规模推荐数据集进行了广泛实验,结果表明,SiBraR 在冷启动场景下明显优于 CF 以及最先进的基于内容的 RS,并且在热启动场景下也具有竞争力。我们证明了 SiBraR 的推荐在缺少模态的情况下是准确的,并且该模型能够将不同的模态映射到共享嵌入空间的同一区域,从而减少了模态差距。||
|**2024-09-26**|[Cascade Prompt Learning for Vision-Language Model Adaptation](http://arxiv.org/abs/2409.17805)|**[link](https://github.com/megvii-research/caspl)**|提示学习已成为一种有效的方法,可以提高视觉语言模型 (VLM)(如 CLIP)在下游任务中的性能。然而,当前的可学习提示标记主要用于适应任务的单一阶段(即,调整提示),容易导致过拟合风险。在这项工作中,我们提出了一种新颖的级联提示学习 CasPL 框架,使提示学习能够同时服务于通用和特定专业知识(即,增强和调整提示)。具体来说,CasPL 是一种新的学习范式,包括两个不同阶段的可学习提示:第一个增强提示旨在通过使用大量未标记的域图像对齐其预测的 logits,从高级更大的 CLIP 教师模型中提取域通用知识。然后,第二个调整提示与冻结的第一组级联,以微调下游任务,遵循先前研究中采用的方法。通过这种方式,CasPL 可以有效地将域通用和任务特定表示捕获到明确不同的渐进提示组中,从而潜在地缓解目标域中的过拟合问题。值得注意的是,CasPL 作为一个即插即用的模块,可以无缝集成到任何现有的提示学习方法中。CasPL 在性能和推理速度之间实现了显著更好的平衡,这对于在资源受限的环境中部署较小的 VLM 模型特别有利。与先前最先进的方法 PromptSRC 相比,CasPL 在 11 个图像分类数据集上,基本类别平均提高了 1.85%,新类别平均提高了 3.44%,调和平均值平均提高了 2.72%。代码公开地址:https://github.com/megvii-research/CasPL。||
|**2024-09-26**|[Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification](http://arxiv.org/abs/2409.17777)|**[link](https://github.com/RaghavSinghal10/M3CoL)**|深度多模态学习通过利用对比学习来捕捉模态之间显式的一对一关系,已经展现出显著的成果。然而,现实世界的数据往往表现出超越简单成对关联的共享关系。我们提出了M3CoL,一种多模态混合对比学习方法,用于捕捉多模态数据中固有的细微共享关系。我们的主要贡献是一种基于混合的对比损失函数,它通过将来自一种模态的混合样本与其来自其他模态的对应样本对齐来学习鲁棒的表示,从而捕捉它们之间的共享关系。对于多模态分类任务,我们引入了一个框架,该框架将融合模块与单模态预测模块相结合,以便在训练期间进行辅助监督,并辅以我们提出的基于混合的对比损失函数。通过对不同数据集(N24News、ROSMAP、BRCA 和 Food-101)的广泛实验,我们证明了 M3CoL 可以有效地捕捉共享的多模态关系并在不同领域泛化。它在 N24News、ROSMAP 和 BRCA 上的表现优于最先进的方法,同时在 Food-101 上取得了可比的性能。我们的工作突出了学习共享关系对于鲁棒的多模态学习的重要性,为未来的研究开辟了有希望的途径。||
|**2024-09-26**|[Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications](http://arxiv.org/abs/2409.17727)|null|视觉语言模型在为各种机器人应用提取有意义的特征方面发挥了关键作用。其中,对比语言-图像预训练 (CLIP) 广泛应用于需要视觉和自然语言理解的机器人任务。然而,CLIP 仅在与文本提示配对的静态图像上进行训练,尚未完全适应涉及动态动作的机器人任务。在本文中,我们介绍了 Robotic-CLIP 来增强机器人的感知能力。我们首先收集和标记大规模动作数据,然后使用对比学习在 309,433 个视频(约 740 万帧)的动作数据上微调 CLIP,构建我们的 Robotic-CLIP。通过利用动作数据,Robotic-CLIP 继承了 CLIP 强大的图像性能,同时获得了理解机器人环境中动作的能力。大量实验表明,我们的 Robotic-CLIP 在各种语言驱动的机器人任务中优于其他基于 CLIP 的模型。此外,我们还展示了 Robotic-CLIP 在现实世界抓取应用中的实际有效性。||
|**2024-09-26**|[MIO: A Foundation Model on Multimodal Tokens](http://arxiv.org/abs/2409.17692)|**[link](https://github.com/mio-team/mio)**|本文介绍了一种基于多模态token的新型基础模型MIO,它能够以端到端、自回归的方式理解和生成语音、文本、图像和视频。尽管大型语言模型(LLM)和多模态大型语言模型(MM-LLM)凭借其多功能性推动了人工智能通用性的进步,但它们仍然缺乏真正的任意模态之间理解和生成的能力。最近,GPT-4o的发布展示了任意模态之间LLM在处理复杂现实世界任务方面的巨大潜力,它能够实现图像、语音和文本之间的全向输入和输出。然而,它是一个闭源模型,并且不支持生成多模态交错序列。为了解决这个问题,我们提出了MIO,它使用因果多模态建模在四种模态的离散token混合数据集上进行训练。MIO经历了四个训练阶段:(1)对齐预训练,(2)交错预训练,(3)语音增强预训练,以及(4)针对不同文本、视觉和语音任务的综合监督微调。我们的实验结果表明,与之前的双模态基线、任意模态之间模型基线,甚至是特定模态基线相比,MIO表现出具有竞争力的性能,在某些情况下甚至更胜一筹。此外,MIO还展示了其任意模态之间功能所带来的高级能力,例如交错视频文本生成、视觉思维链推理、视觉指南生成、指令图像编辑等。||
|**2024-09-26**|[P4Q: Learning to Prompt for Quantization in Visual-language Models](http://arxiv.org/abs/2409.17634)|null|大规模预训练的视觉语言模型(VLM)在各种视觉和多模态任务中取得了显著成果,但由于其对训练样本和计算资源的巨大需求,将VLM部署到下游应用平台仍然具有挑战性。对VLM进行微调和量化可以显著降低样本和计算成本,因此迫切需要这方面的研究。量化领域目前存在两种主要范式:量化感知训练(QAT)可以有效地量化大规模VLM,但会产生巨大的训练成本;而低比特位后训练量化(PTQ)则存在明显的性能下降问题。我们提出了一种平衡微调和量化的方法,称为“量化提示”(P4Q),其中我们设计了一种轻量级架构,利用对比损失监督来增强PTQ模型的识别性能。我们的方法可以有效地减少由低比特位量化引起的图像特征和文本特征之间的差距,其方法是基于可学习的提示来重组文本表示,并使用低比特位适配器重新调整图像和文本特征的分布。我们还引入了一种基于余弦相似度预测的蒸馏损失,以使用全精度教师模型对量化模型进行蒸馏。大量的实验结果表明,我们的P4Q方法优于现有技术,甚至可以达到与其全精度模型相当的结果。例如,我们的8位P4Q理论上可以将CLIP-ViT/B-32压缩4倍,同时在ImageNet数据集上实现66.94%的Top-1准确率,比可学习提示微调的全精度模型高出2.24%,而额外的参数可以忽略不计。||
|**2024-09-18**|[Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution](http://arxiv.org/abs/2409.12191)|**[link](https://github.com/qwenlm/qwen2-vl)**|我们推出了Qwen2-VL系列,这是对先前Qwen-VL模型的先进升级,它重新定义了视觉处理中传统的预定分辨率方法。Qwen2-VL引入了朴素动态分辨率机制,使模型能够将不同分辨率的图像动态处理成不同数量的视觉标记。这种方法允许模型生成更高效、更准确的视觉表示,与人类的感知过程紧密一致。该模型还集成了多模态旋转位置嵌入(M-RoPE),促进了文本、图像和视频中位置信息的有效融合。我们采用统一的范式来处理图像和视频,增强了模型的视觉感知能力。为了探索大型多模态模型的潜力,Qwen2-VL研究了大型视觉语言模型(LVLM)的缩放规律。通过扩展模型规模(包括2B、8B和72B参数的版本)和训练数据量,Qwen2-VL系列实现了极具竞争力的性能。值得注意的是,Qwen2-VL-72B模型在各种多模态基准测试中取得了与GPT-4o和Claude3.5-Sonnet等领先模型相当的结果,优于其他通用模型。代码可在\url{https://github.com/QwenLM/Qwen2-VL}获取。||
|**2024-09-18**|[GauTOAO: Gaussian-based Task-Oriented Affordance of Objects](http://arxiv.org/abs/2409.11941)|null|当您的机器人使用灵巧的手或抓手抓取物体时,它应该理解物体的面向任务的可操作性 (TOAO),因为不同的任务通常需要关注物体的特定部分。为了应对这一挑战,我们提出了 GauTOAO,这是一个基于高斯的物体面向任务可操作性框架,它以零样本的方式利用视觉语言模型,在给定自然语言查询的情况下预测物体上与可操作性相关的区域。我们的方法引入了一种新的范式:“静态相机,移动物体”,使机器人在操作过程中能够更好地观察和理解手中的物体。GauTOAO 解决了现有方法的局限性,这些方法通常缺乏有效的空间分组,它使用 DINO 特征提取完整的 3D 物体掩码。然后,该掩码用于有条件地查询高斯分布,从而生成针对特定任务的、在物体上的精细语义分布。这种方法可以更准确地提取 TOAO,增强机器人对物体的理解并提高任务性能。我们通过现实世界实验验证了 GauTOAO 的有效性,证明了它能够泛化到各种任务。||
|**2024-09-18**|[LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models](http://arxiv.org/abs/2409.11919)|null|视觉语言模型 (VLM) 在众多任务中都表现出色,但与其专用或微调模型相比,它们的零样本能力可能有限。然而,微调 VLM 存在局限性,因为它需要对模型架构和权重的“白盒”访问权限,以及设计微调目标和优化超参数的专业知识,这些都特定于每个 VLM 和下游任务。在这项工作中,我们提出了 LLM-wrapper,这是一种通过利用大型语言模型 (LLM) 来推理其输出,以“黑盒”方式调整 VLM 的新方法。我们通过指代表达理解 (REC) 证明了 LLM-wrapper 的有效性,这是一项需要空间和语义推理的具有挑战性的开放词汇任务。我们的方法显著提高了现成模型的性能,与经典微调相比获得了具有竞争力的结果。||
|**2024-09-17**|[NVLM: Open Frontier-Class Multimodal LLMs](http://arxiv.org/abs/2409.11402)|null|我们推出了 NVLM 1.0,这是一系列前沿的多模态大型语言模型 (LLM),在视觉语言任务上取得了最先进的结果,可与领先的专有模型(例如 GPT-4o)和开放访问模型(例如 Llama 3-V 405B 和 InternVL 2)相媲美。 值得注意的是,NVLM 1.0 在多模态训练后,其纯文本性能优于其 LLM 骨干模型。 在模型设计方面,我们对仅解码器多模态 LLM(例如 LLaVA)和基于交叉注意力的模型(例如 Flamingo)进行了全面比较。 基于这两种方法的优缺点,我们提出了一种新颖的架构,可以提高训练效率和多模态推理能力。 此外,我们为基于图块的动态高分辨率图像引入了 1-D 图块标记设计,这显着提高了多模态推理和 OCR 相关任务的性能。 关于训练数据,我们精心策划并提供有关我们多模态预训练和监督微调数据集的详细信息。 我们的研究结果表明,即使在预训练阶段,在所有架构中,数据集质量和任务多样性都比规模更重要。 值得注意的是,我们为 NVLM-1.0 模型开发了生产级多模态,使其能够在视觉语言任务中表现出色,同时保持甚至改进与其 LLM 骨干模型相比的纯文本性能。 为此,我们将高质量的纯文本数据集与大量的多模态数学和推理数据一起制作并集成到多模态训练中,从而增强了跨模态的数学和编码能力。 为了推动该领域的 研究,我们将发布模型权重,并将开源代码供社区使用:https://nvlm-project.github.io/。||
|**2024-09-17**|[CAST: Cross-modal Alignment Similarity Test for Vision Language Models](http://arxiv.org/abs/2409.11007)|**[link](https://github.com/gautierdag/cast)**|视觉语言模型 (VLM) 通常通过视觉问答 (VQA) 任务进行评估,这些任务评估模型对场景的理解。良好的 VQA 性能被视为该模型能够在需要视觉和语言输入的更广泛任务中表现良好的证据。然而,场景感知 VQA 并不能完全捕捉输入偏差,也不能评估由模态之间错位引起的幻觉。为了解决这个问题,我们提出了跨模态对齐相似性测试 (CAST) 来探测 VLM 在不同模态之间的自洽性。该测试包括要求模型仅通过文本、仅通过图像或两者兼用来识别两个场景之间的相似性,然后评估它们生成的相似性的真实性。由于没有可供比较的真实情况,因此该评估的重点不是客观准确性,而是 VLM 在输出方面是否内部一致。我们认为,虽然并非所有自洽模型都具有能力或准确性,但所有有能力的 VLM 都必须是自洽的。||
|**2024-09-17**|[KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph](http://arxiv.org/abs/2409.10921)|**[link](https://github.com/yanbei-jiang/artwork-interpretation)**|Exploring the narratives conveyed by fine-art paintings is a challenge in image captioning, where the goal is to generate descriptions that not only precisely represent the visual content but also offer a in-depth interpretation of the artwork's meaning. The task is particularly complex for artwork images due to their diverse interpretations and varied aesthetic principles across different artistic schools and styles. In response to this, we present KALE Knowledge-Augmented vision-Language model for artwork Elaborations), a novel approach that enhances existing vision-language models by integrating artwork metadata as additional knowledge. KALE incorporates the metadata in two ways: firstly as direct textual input, and secondly through a multimodal heterogeneous knowledge graph. To optimize the learning of graph representations, we introduce a new cross-modal alignment loss that maximizes the similarity between the image and its corresponding metadata. Experimental results demonstrate that KALE achieves strong performance (when evaluated with CIDEr, in particular) over existing state-of-the-art work across several artwork datasets. Source code of the project is available at https://github.com/Yanbei-Jiang/Artwork-Interpretation.||
|**2024-09-16**|[Do Pre-trained Vision-Language Models Encode Object States?](http://arxiv.org/abs/2409.10488)|null|For a vision-language model (VLM) to understand the physical world, such as cause and effect, a first step is to capture the temporal dynamics of the visual world, for example how the physical states of objects evolve over time (e.g. a whole apple into a sliced apple). Our paper aims to investigate if VLMs pre-trained on web-scale data learn to encode object states, which can be extracted with zero-shot text prompts. We curate an object state recognition dataset ChangeIt-Frames, and evaluate nine open-source VLMs, including models trained with contrastive and generative objectives. We observe that while these state-of-the-art vision-language models can reliably perform object recognition, they consistently fail to accurately distinguish the objects' physical states. Through extensive experiments, we identify three areas for improvements for VLMs to better encode object states, namely the quality of object localization, the architecture to bind concepts to objects, and the objective to learn discriminative visual and language encoders on object states. Data and code are released.||
|**2024-09-16**|[CtRNet-X: Camera-to-Robot Pose Estimation in Real-world Conditions Using a Single Camera](http://arxiv.org/abs/2409.10441)|null|Camera-to-robot calibration is crucial for vision-based robot control and requires effort to make it accurate. Recent advancements in markerless pose estimation methods have eliminated the need for time-consuming physical setups for camera-to-robot calibration. While the existing markerless pose estimation methods have demonstrated impressive accuracy without the need for cumbersome setups, they rely on the assumption that all the robot joints are visible within the camera's field of view. However, in practice, robots usually move in and out of view, and some portion of the robot may stay out-of-frame during the whole manipulation task due to real-world constraints, leading to a lack of sufficient visual features and subsequent failure of these approaches. To address this challenge and enhance the applicability to vision-based robot control, we propose a novel framework capable of estimating the robot pose with partially visible robot manipulators. Our approach leverages the Vision-Language Models for fine-grained robot components detection, and integrates it into a keypoint-based pose estimation network, which enables more robust performance in varied operational conditions. The framework is evaluated on both public robot datasets and self-collected partial-view datasets to demonstrate our robustness and generalizability. As a result, this method is effective for robot pose estimation in a wider range of real-world manipulation scenarios.||
|**2024-09-16**|[HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models](http://arxiv.org/abs/2409.10419)|null|能够通过自然语言与人类交互的机器人可以解锁许多应用,例如参考抓取合成(RGS)。给定一个文本查询,RGS确定一个稳定的抓取姿态来操纵机器人工作空间中所指的对象。RGS包括两个步骤:视觉定位和抓取姿态估计。最近的研究利用强大的视觉语言模型(VLM)将自由流动的自然语言视觉定位到现实世界的机器人执行中。然而,在具有多个相同对象实例的复杂、杂乱环境中的比较仍然缺乏。本文介绍了HiFi-CS,它采用特征线性调制(FiLM)的分层应用来融合图像和文本嵌入,增强了机器人抓取中遇到的复杂属性丰富文本查询的视觉定位。视觉定位将二维/三维空间中的对象与自然语言输入相关联,并在两种情况下进行研究:封闭词汇和开放词汇。HiFi-CS具有一个轻量级的解码器,结合了一个冻结的VLM,在封闭词汇设置中优于竞争基线,同时尺寸缩小了100倍。我们的模型可以有效地指导像GroundedSAM这样的开放集目标检测器,以提高开放词汇性能。我们使用一个7自由度机械臂,通过真实的RGS实验验证了我们的方法,在15个桌面场景中实现了90.33%的视觉定位精度。我们在补充材料中包含了我们的代码库。||
|**2024-09-19**|[IRIS: Interactive Responsive Intelligent Segmentation for 3D Affordance Analysis](http://arxiv.org/abs/2409.10078)|null|大型语言和视觉语言模型的最新进展显著增强了多模态理解,然而将高级语言指令转换为精确的3D空间机器人动作仍然具有挑战性。本文介绍了IRIS(交互式响应智能分割),这是一种用于3D功能分割的全新免训练多模态系统,以及一个用于评估日常环境中交互式语言引导功能的基准。IRIS将大型多模态模型与专门的3D视觉网络相结合,实现了2D和3D视觉理解与语言理解的无缝融合。为了便于评估,我们提供了一个包含10个典型室内环境的数据集,每个环境包含50张标注了物体动作和3D功能分割的图像。大量实验表明,IRIS能够处理各种环境下的交互式3D功能分割任务,并在各种指标上均展现出具有竞争力的性能。我们的结果突出了IRIS在增强基于复杂室内环境中功能理解的人机交互方面的潜力,推进了更直观、更高效的机器人系统在现实世界应用中的发展。||
|**2024-09-15**|[FSL-LVLM: Friction-Aware Safety Locomotion using Large Vision Language Model in Wheeled Robots](http://arxiv.org/abs/2409.09845)|null|轮腿式机器人在移动性和多功能性方面具有显著优势,但在湿滑地形上运行时面临着巨大挑战。这些机器人的传统基于模型的控制器假设没有滑动。虽然强化学习(RL)可以帮助四足机器人适应不同的表面,但从滑动中恢复仍然具有挑战性,特别是对于接触点较少的系统。估计地面摩擦系数是另一个开放的挑战。在本文中,我们提出了一种新颖的摩擦感知安全运动框架,该框架将大型视觉语言模型(LLM)与RL策略相结合。我们的方法将估计的摩擦系数明确纳入RL策略,使机器人能够在到达表面之前根据表面类型提前调整其行为。我们引入了一个“视觉摩擦”(FFV)模块,该模块利用LLM估计地面摩擦系数,从而无需大型数据集和大量训练。该框架在定制的轮式倒立摆上进行了验证,实验结果表明,我们的框架通过根据地形类型调整速度来提高完成驾驶任务的成功率,同时与基线方法相比实现了更好的跟踪性能。我们的框架可以轻松地与任何其他RL策略集成。||
|**2024-09-15**|[Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models](http://arxiv.org/abs/2409.09788)|null|尽管近期研究表明视觉语言模型 (VLM) 能够使用自然语言描述图像中的复杂关系,但其对物体大小和距离进行定量推理的能力仍有待探索。在这项工作中,我们引入了一个手动标注的基准测试集 Q-Spatial Bench,其中包含 271 个跨越五个类别的、专为定量空间推理设计的问题,并系统地研究了最先进的 VLM 在这项任务上的性能。我们的分析表明,推理物体之间的距离对 SoTA VLM 来说尤其具有挑战性;然而,一些 VLM 的性能明显优于其他模型,表现最好的两个模型之间存在超过 40 个百分点的差距。我们还惊奇地观察到,当响应中自然出现使用参考对象的推理路径时,性能最佳的 VLM 的成功率提高了 19 个百分点。受此观察结果的启发,我们开发了一种零样本提示技术 SpatialPrompt,该技术鼓励 VLM 使用参考对象作为视觉线索来回答定量空间问题。通过 SpatialPrompt 指导 VLM 在其推理路径中使用参考对象,Gemini 1.5 Pro、Gemini 1.5 Flash 和 GPT-4V 的成功率分别提高了 40、20 和 30 个百分点以上。我们强调,这些显著的改进无需更多数据、模型架构修改或微调即可实现。||
|**2024-09-15**|[Finetuning CLIP to Reason about Pairwise Differences](http://arxiv.org/abs/2409.09721)|**[link](https://github.com/dsam99/pc_clip)**|视觉语言模型 (VLM) 如 CLIP 是通过文本和图像对之间的对比学习进行训练的,从而产生对齐的图像和文本嵌入,这对许多下游任务非常有用。然而,CLIP 的一个显著缺点是,由此产生的嵌入空间似乎缺乏其纯文本替代方案所具有的一些结构。例如,长期以来,人们一直注意到文本嵌入可以使用向量算术来满足嵌入空间中的\emph{类比},而 CLIP 则没有这种特性。在本文中,我们提出了一种以对比方式原生训练 CLIP 的方法,以便推理嵌入空间中的差异。我们对 CLIP 进行了微调,以便图像嵌入空间中的差异对应于\emph{图像差异的文本描述},我们使用大型语言模型在图像-标题配对数据集上合成地生成了这些描述。我们首先证明,我们的方法在按特定属性对图像进行排序(例如,大象比猫大)方面产生了显著改进的能力,这在检索或构建基于属性的分类器中非常有用,并且提高了许多下游图像分类任务上的零样本分类性能。此外,我们的方法还实现了一种新的推理机制,我们将其称为比较提示,其中我们利用对感兴趣类别之间差异的文本描述的先验知识,在分类中实现了更大的性能提升。最后,我们说明了生成的嵌入在嵌入空间中遵循更大程度的几何特性,例如在文本到图像的生成中。||
|**2024-09-13**|[Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing](http://arxiv.org/abs/2409.08885)|null|遥感影像中的目标检测在地球观测的各种应用中发挥着至关重要的作用。然而,与自然场景图像中的目标检测不同,这项任务特别具有挑战性,因为在不同的地形中存在大量的小型且通常难以察觉的目标。为了应对这些挑战,可以使用多模态学习来整合来自不同数据模态的特征,从而提高检测精度。然而,多模态学习的性能往往受到标记数据集大小的限制。在本文中,我们建议使用掩蔽图像建模(MIM)作为一种预训练技术,利用无标记数据的自监督学习来提高检测性能。然而,传统的MIM方法(如MAE)使用没有上下文信息的掩蔽标记,由于缺乏与图像其他部分的交互,难以捕捉到细粒度的细节。为了解决这个问题,我们提出了一种新的交互式MIM方法,可以在不同的标记之间建立交互,这对于遥感中的目标检测特别有利。大量的消融研究和评估证明了我们方法的有效性。||
|**2024-09-13**|[A Multimodal Approach for Fluid Overload Prediction: Integrating Lung Ultrasound and Clinical Data](http://arxiv.org/abs/2409.08790)|null|维持透析患者的体液平衡至关重要,因为管理不当会导致严重并发症。在本文中,我们提出了一种多模态方法,该方法整合了肺部超声图像的视觉特征和临床数据,以增强对体内多余液体预测的准确性。我们的框架采用独立的编码器来提取每种模态的特征,并通过跨域注意力机制将它们组合起来,以捕获互补信息。通过将预测构建为分类任务,该模型实现了比回归模型更好的性能。结果表明,多模态模型始终优于单模态模型,尤其是在注意力机制优先考虑表格数据时。伪样本生成进一步有助于缓解分类问题中的数据不平衡问题,实现了 88.31% 的最高准确率。这项研究强调了多模态学习对透析患者液体超负荷管理的有效性,为改善临床结果提供了宝贵的见解。||
|**2024-09-13**|[ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning](http://arxiv.org/abs/2409.08582)|null|遥感 (RS) 变化分析通过检测图像随时间的变化来监测地球动态过程,至关重要。传统的变点检测擅长识别像素级的变化,但缺乏将这些变化置于背景中的能力。虽然最近在变化描述方面的进展提供了对变化的自然语言描述,但它们不支持交互式的、用户特定的查询。为了解决这些限制,我们引入了 ChangeChat,这是第一个专为 RS 变化分析设计的双时态视觉语言模型 (VLM)。ChangeChat 利用多模态指令微调,使其能够处理复杂的查询,例如变化描述、特定类别的量化和变化定位。为了提高模型的性能,我们开发了 ChangeChat-87k 数据集,该数据集是使用基于规则的方法和 GPT 辅助技术相结合生成的。实验表明,ChangeChat 为 RS 变化分析提供了一个全面、交互式的解决方案,在特定任务上的性能达到甚至优于最先进 (SOTA) 方法,并显着超过了最新的通用模型 GPT-4。代码和预训练权重可在 https://github.com/hanlinwu/ChangeChat 获取。||
|**2024-09-13**|[Generalization Boosted Adapter for Open-Vocabulary Segmentation](http://arxiv.org/abs/2409.08468)|null|视觉语言模型 (VLM) 已展现出卓越的开放词汇对象识别能力,这促使它们被应用于密集预测任务,例如分割。然而,由于缺乏像素级粒度以及可用于微调的数据有限,直接将 VLM 应用于此类任务仍然具有挑战性,导致过度拟合和泛化能力差。为了解决这些限制,我们提出了泛化增强适配器 (GBA),这是一种新颖的适配器策略,可以增强 VLM 对开放词汇分割的泛化能力和鲁棒性。GBA 包含两个核心组件:(1) 风格多样化适配器 (SDA),它将特征解耦为幅度和相位分量,仅对幅度进行操作以丰富特征空间表示,同时保持语义一致性;(2) 相关性约束适配器 (CCA),它采用交叉注意力机制在文本类别和目标区域之间建立更紧密的语义关联,抑制不相关的低频“噪声”信息并避免错误关联。通过浅层 SDA 和深层 CCA 的协同效应,GBA 有效地缓解了过度拟合问题,并增强了特征表示的语义相关性。作为一个简单、高效、即插即用的组件,GBA 可以灵活地集成到各种基于 CLIP 的方法中,展现出广泛的适用性,并在多个开放词汇分割基准测试中实现了最先进的性能。||
|**2024-09-12**|[Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations](http://arxiv.org/abs/2409.08381)|null|像 CLIP 这样的视觉语言模型 (VLM) 已被应用于部分标注的多标签识别 (MLR),其方法是利用提示学习,为每个类别学习正负提示,以便将它们的嵌入与共享视觉文本特征空间中的类别存在或不存在相关联。虽然这种方法通过依赖 VLM 先验信息提高了 MLR 性能,但我们假设学习负面提示可能不是最优的,因为用于训练 VLM 的数据集缺乏明确关注类别缺失的图像-标题对。为了分析正负提示学习对 MLR 的影响,我们引入了 PositiveCoOp 和 NegativeCoOp,其中只有一个提示是在 VLM 指导下学习的,而另一个提示则被直接在共享特征空间中学习的嵌入向量所取代,而不依赖于文本编码器。通过实证分析,我们观察到负面提示会降低 MLR 性能,并且仅学习正面提示并结合学习到的负面嵌入(PositiveCoOp)优于双提示学习方法。此外,我们量化了提示学习相对于仅使用视觉特征的简单基线的性能优势,观察到当缺失标签的比例较低时,基线表现出与双提示学习方法 (DualCoOp) 相当的强劲性能,同时所需的训练计算量减少一半,参数数量减少 16 倍。||
|**2024-09-12**|[What Makes a Maze Look Like a Maze?](http://arxiv.org/abs/2409.08202)|null|人类视觉理解的一个独特之处在于能够灵活地解释抽象概念:获取解释其象征意义的提升规则,将它们应用于熟悉和不熟悉的语境,并对其进行预测或推理。虽然现成的视觉语言模型擅长对图像进行字面解释(例如,识别树枝等物体类别),但它们仍然难以理解此类视觉抽象概念(例如,树枝的排列方式如何形成迷宫的墙壁)。为了应对这一挑战,我们引入了深度模式基础(DSG),这是一个利用视觉抽象的显式结构化表示进行基础化和推理的框架。DSG 的核心是模式——抽象概念的依赖图描述,将它们分解成更原始级别的符号。DSG 使用大型语言模型来提取模式,然后使用视觉语言模型将模式的具体组件到抽象组件分层地基础化到图像上。基础化的模式用于增强视觉抽象理解。我们在新的视觉抽象数据集上系统地评估了 DSG 和不同的推理方法,该数据集包含各种现实世界中抽象概念的图像以及由人类标记的相应问答对。我们表明,DSG 显着提高了视觉语言模型的抽象视觉推理性能,并且是朝着人类一致的视觉抽象理解迈出的一步。||
|**2024-09-13**|[A Comprehensive Survey on Deep Multimodal Learning with Missing Modality](http://arxiv.org/abs/2409.07825)|null|在多模态模型训练和推理过程中,由于传感器限制、成本限制、隐私问题、数据丢失以及时间和空间因素,数据样本可能会缺少某些模态,从而导致模型性能下降。本综述概述了缺失模态的多模态学习 (MLMM) 的最新进展,重点关注深度学习技术。它是第一个涵盖历史背景和 MLMM 与标准多模态学习设置之间区别的综合性综述,然后详细分析了当前的 MLMM 方法、应用和数据集,最后讨论了该领域的挑战和潜在的未来方向。||
|**2024-09-12**|[Top-down Activity Representation Learning for Video Question Answering](http://arxiv.org/abs/2409.07748)|null|从原子动作(例如,拿起一个礼物,移动到沙发,打开礼物)到上下文事件(例如,庆祝圣诞节)捕捉复杂的分层人类活动对于实现高性能视频问答 (VideoQA) 至关重要。 最近的工作已经扩展了多模态模型(例如,CLIP,LLaVA)来处理连续视频序列,增强了模型的时间推理能力。 然而,这些方法通常无法捕捉可以分解为多个原子动作的上下文事件,这些动作非连续地分布在相对长期的序列中。 在本文中,为了利用 CLIP 模型的空间视觉上下文表示能力来获得视频中上下文事件方面的非连续视觉表示,我们将长期视频序列转换为空间图像域,并针对 VideoQA 任务微调多模态模型 LLaVA。 我们的方法在 STAR 任务上取得了具有竞争力的性能,特别是在 NExTQA 任务上,获得了 78.4% 的准确率,超过了当前最先进的得分 2.8 个百分点。||
|**2024-09-12**|[DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?](http://arxiv.org/abs/2409.07703)|**[link](https://github.com/liqiangjing/dsbench)**|大型语言模型(LLM)和大型视觉语言模型(LVLM)已经展现出令人印象深刻的语言/视觉推理能力,引发了构建针对特定应用(如购物助手或AI软件工程师)的代理的最新趋势。最近,许多数据科学基准测试被提出,以研究其在数据科学领域的性能。然而,现有的数据科学基准测试与现实世界的数据科学应用相比仍然存在不足,因为它们的设置过于简化。为了弥合这一差距,我们引入了 DSBench,这是一个全面的基准测试,旨在评估具有现实任务的数据科学代理。该基准测试包括 466 个数据分析任务和 74 个数据建模任务,这些任务来自 Eloquence 和 Kaggle 竞赛。DSBench 通过包含长上下文、多模态任务背景、对大型数据文件和多表结构进行推理以及执行端到端数据建模任务,提供了一个真实的设置。我们对最先进的 LLM、LVLM 和代理的评估表明,它们难以完成大多数任务,最好的代理仅能解决 34.12% 的数据分析任务,并实现了 34.74% 的相对性能差距 (RPG)。这些发现强调了进一步发展更实用、更智能、更自主的数据科学代理的必要性。||
|**2024-09-12**|[Open-Vocabulary Remote Sensing Image Semantic Segmentation](http://arxiv.org/abs/2409.07683)|**[link](https://github.com/caoql98/ovrs)**|开放词汇图像语义分割 (OVS) 旨在将图像分割成跨开放类别集的语义区域。现有的 OVS 方法通常依赖于基础视觉语言模型,并利用相似度计算来处理 OVS 任务。然而,这些方法主要针对自然图像量身定制,难以应对遥感图像的独特特征,例如快速变化的方向和显著的尺度变化。这些挑战使地球视觉中的 OVS 任务变得复杂,需要专门的方法。为了解决这一难题,我们借鉴了独特的遥感特征,提出了第一个专门为遥感图像设计的 OVS 框架。特别是,为了解决不同的方向问题,我们引入了一种旋转聚合相似度计算模块,该模块生成方向自适应相似度图作为初始语义图。随后,这些图会在空间和类别级别进行细化,以生成更准确的语义图。此外,为了管理显著的尺度变化,我们将多尺度图像特征集成到上采样过程中,从而得到最终的尺度感知语义掩码。为了推进地球视觉中的 OVS 并鼓励可重复研究,我们建立了第一个用于遥感图像的开源 OVS 基准,包括四个公共遥感数据集。在这个基准上的大量实验表明,我们提出的方法达到了最先进的性能。所有代码和数据集都可以在 https://github.com/caoql98/OVRS 获取。||
|**2024-09-11**|[Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks](http://arxiv.org/abs/2409.07353)|**[link](https://github.com/speedlab-git/robust-encoder-against-jailbreak-attack)**|基于多模态大数据集训练的大型视觉语言模型 (LVLM) 在视觉语言任务方面表现出色,极大地推进了人工智能的发展。然而,这些模型仍然容易受到对抗性攻击,尤其是越狱攻击,这些攻击会绕过安全协议,导致模型生成误导性或有害的响应。这种脆弱性源于大型语言模型 (LLM) 固有的敏感性以及视觉模态引入的扩大攻击面。我们提出了 Sim-CLIP+,这是一种新颖的防御机制,它利用 Siamese 架构通过对抗性微调 CLIP 视觉编码器。这种方法最大限度地提高了扰动样本和干净样本之间的余弦相似度,增强了对对抗性操作的抵抗力。Sim-CLIP+ 提供了一种即插即用的解决方案,允许作为强大的视觉编码器无缝集成到现有的 LVLM 架构中。与以前的防御措施不同,我们的方法不需要对 LVLM 进行结构修改,并且计算开销最小。Sim-CLIP+ 证明了其对基于梯度的对抗性攻击和各种越狱技术的有效性。我们针对三种不同的越狱攻击策略评估了 Sim-CLIP+,并使用标准下游数据集(包括用于图像字幕的 COCO 和用于视觉问答的 OKVQA)执行了干净评估。大量实验表明,Sim-CLIP+ 在保持高清洁精度的同时,显着提高了对基于梯度的对抗性攻击和越狱技术的鲁棒性。我们的代码和强大的视觉编码器可在 https://github.com/speedlab-git/Robust-Encoder-against-Jailbreak-attack.git 获取。||
|**2024-09-11**|[MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving](http://arxiv.org/abs/2409.07267)|**[link](https://github.com/emzucas/minidrive)**|视觉语言模型 (VLM) 作为自动驾驶中的通用端到端模型,通过问答交互执行预测、规划和感知等子任务。然而,大多数现有方法依赖于计算成本高昂的视觉编码器和大型语言模型 (LLM),这使得它们难以部署在现实世界场景和实时应用程序中。同时,大多数现有 VLM 缺乏处理多图像的能力,难以适应自动驾驶中的多摄像头感知。为了解决这些问题,我们提出了一种名为 MiniDrive 的新型框架,该框架结合了我们提出的特征工程混合专家 (FE-MoE) 模块和动态指令适配器 (DI-Adapter)。FE-MoE 在输入语言模型之前,将 2D 特征有效地映射到视觉标记嵌入中。DI-Adapter 使视觉标记嵌入能够随指令文本嵌入动态变化,解决了以往方法中同一图像的静态视觉标记嵌入问题。与之前的工作相比,MiniDrive 在参数大小、浮点运算和响应效率方面实现了最先进的性能,最小版本仅包含 83M 参数。||
|**2024-09-11**|[MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis](http://arxiv.org/abs/2409.07129)|null|本文介绍了MVLLaVA,这是一种专为新视角合成任务设计的智能代理。MVLLaVA将多个多视图扩散模型与大型多模态模型LLaVA相结合,使其能够高效地处理各种任务。MVLLaVA代表了一个通用且统一的平台,可以适应不同的输入类型,包括单个图像、描述性标题或视角方位角的特定变化,并以语言指令指导视角生成。我们精心设计了特定于任务的指令模板,随后用于微调LLaVA。因此,MVLLaVA获得了根据用户指令生成新视角图像的能力,展示了其在不同任务中的灵活性。实验验证了MVLLaVA的有效性,证明了其在应对各种新视角合成挑战时的强大性能和多功能性。||
|**2024-09-11**|[FSMDet: Vision-guided feature diffusion for fully sparse 3D detector](http://arxiv.org/abs/2409.06945)|null|近年来,全稀疏三维目标检测引起了越来越多的关注。然而,这些框架中特征的稀疏性由于扩散过程有限,对候选框的生成提出了挑战。此外,对效率的追求导致对视觉辅助的全稀疏模型的研究很少。在本文中,我们提出了FSMDet(全稀疏多模态检测),它使用视觉信息来指导激光雷达特征扩散过程,同时仍然保持管道的效率。具体来说,大多数全稀疏工作都集中在复杂的定制中心融合扩散/回归算子上。然而,我们观察到,如果执行了适当的目标补全,即使是最简单的插值算子也能得到令人满意的结果。受此观察的启发,我们将视觉引导的扩散过程分为两个模块:形状恢复层(SRLayer)和自扩散层(SDLayer)。前者使用RGB信息来恢复物体可见部分的形状,后者使用视觉先验将特征进一步扩散到中心区域。实验表明,我们的方法成功地提高了以往仅使用激光雷达的全稀疏模型的性能,并在多模态模型中达到了SOTA性能。同时,由于采用了稀疏架构,我们的方法在推理过程中比以往的SOTA方法效率最高可提高5倍。||
|**2024-09-10**|[ExIQA: Explainable Image Quality Assessment Using Distortion Attributes](http://arxiv.org/abs/2409.06853)|null|盲图像质量评估 (BIQA) 旨在开发无需参考图像即可估计图像质量分数的方法。在本文中,我们从失真识别角度探讨 BIQA,主要目标是利用视觉语言模型 (VLM)(如 CLIP)预测失真类型和强度,因为它们具有广泛的知识和泛化能力。基于这些预测的失真,我们然后估计图像的质量分数。为此,我们提出了一种基于属性学习的可解释失真识别方法。我们没有使用失真名称提示 VLM,而是使用失真的属性或影响提示它们,并汇总这些信息以推断失真强度。此外,我们为每张图像考虑了多种失真,使我们的方法更具可扩展性。为此,我们生成了一个包含 100,000 张图像的数据集,用于高效训练。最后,检索属性概率并将其输入回归器以预测图像质量分数。结果表明,我们的方法除了具有可解释性和透明度外,还在多个数据集的 PLCC 和 SRCC 指标上均达到了最先进 (SOTA) 的性能。此外,零样本结果证明了该方法的泛化能力。||
|**2024-09-10**|[MAGDA: Multi-agent guideline-driven diagnostic assistance](http://arxiv.org/abs/2409.06351)|null|在急诊科、乡村医院或欠发达地区的诊所,临床医生往往缺乏训练有素的放射科医生进行快速图像分析,这可能对患者的医疗保健产生不利影响。大型语言模型 (LLM) 有可能通过提供有助于临床医生做出决策的见解,从而减轻他们的一些压力。虽然这些 LLM 在医学考试中取得了很高的测试成绩,展示了其丰富的理论医学知识,但它们往往不遵循医学指南。在这项工作中,我们介绍了一种新的零样本指南驱动决策支持方法。我们模拟了一个由多个 LLM 代理组成的系统,该系统增强了对比视觉语言模型,这些代理协作以达成患者诊断。在向代理提供简单的诊断指南后,他们将根据这些指南合成提示并筛选图像以查找结果。最后,他们为自己的诊断提供易于理解的思维链推理,然后对其进行自我完善,以考虑疾病之间的相互依赖性。由于我们的方法是零样本的,因此它适用于罕见疾病的设置,在这些情况下,训练数据有限,但可以使用专家制定的疾病描述。我们在两个胸部 X 光数据集 CheXpert 和 ChestX-ray 14 Longtail 上评估了我们的方法,展示了其相对于现有零样本方法的性能改进以及对罕见疾病的泛化能力。||
|**2024-09-10**|[INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding](http://arxiv.org/abs/2409.06210)|null|可供性是指物体固有的潜在交互方式。对可供性的感知可以让智能体高效地在新环境中导航和交互。弱监督可供性基础可以让智能体在没有昂贵的像素级标注的情况下学习可供性的概念,但需要使用以环境为中心的图像。尽管弱监督可供性基础的最新进展已经取得了可喜的成果,但仍然存在一些挑战,例如需要配对的以环境为中心和以自我为中心的图像数据集,以及为单个物体基础多种可供性的复杂性。为了解决这些问题,我们提出了交互关系感知的弱监督可供性基础 (INTRA)。与现有技术不同,INTRA 将这个问题重新定义为表征学习,通过仅使用以环境为中心的图像进行对比学习来识别交互的独特特征,从而消除了对配对数据集的需求。此外,我们利用视觉语言模型嵌入来灵活地使用任何文本进行可供性基础,设计了以文本为条件的可供性映射生成,以反映交互关系以进行对比学习,并通过我们的文本同义词增强来增强鲁棒性。我们的方法在 AGD20K、IIT-AFF、CAD 和 UMD 等不同的数据集上优于现有技术。此外,实验结果表明,我们的方法对合成图像/插图具有显著的领域可扩展性,并且能够对新的交互和物体进行可供性基础。||
|**2024-09-10**|[Revisiting Prompt Pretraining of Vision-Language Models](http://arxiv.org/abs/2409.06166)|null|提示学习是一种有效的定制视觉语言模型 (VLM) 以适应各种下游任务的方法,它仅需微调输入提示词符的少量参数。近年来,在大规模数据集(例如 ImageNet-21K)上进行提示预训练已成为通用视觉识别提示学习的关键。然而,我们重新审视并观察到,在提示预训练期间,鉴于图像数量庞大,有限的可学习提示可能会面临欠拟合的风险,同时导致泛化能力较差。为了解决上述问题,本文提出了一种名为“重新审视提示预训练”(RPP)的通用框架,旨在从提示结构和提示监督两个方面提高拟合和泛化能力。对于提示结构,我们打破了查询、键和值向量均来自共享的可学习提示词符的常见做法的限制。相反,我们引入了非共享的独立查询、键和值可学习提示,从而通过增加参数多样性来增强模型的拟合能力。对于提示监督,我们还利用了由预训练的对比语言图像预训练 (CLIP) 教师模型提供的零样本概率预测得到的软标签。这些软标签可以更细致、更全面地洞察类间关系,从而赋予预训练过程更好的泛化能力。RPP 产生更稳健的提示初始化,增强其在各种视觉识别任务中的鲁棒迁移能力。跨多个基准的实验一致证实了我们预训练提示的最新性能。代码和模型将很快发布。||
|**2024-09-09**|[PEERNet: An End-to-End Profiling Tool for Real-Time Networked Robotic Systems](http://arxiv.org/abs/2409.06078)|**[link](https://github.com/utaustin-swarmlab/peernet)**|网络机器人系统在自动驾驶汽车、无人机群和远程手术等应用中需要平衡计算、功耗和延迟约束。该领域的核心问题是何时将计算量大的任务卸载到云端(远程服务器)以换取通信延迟。任务卸载算法通常依赖于对系统特定性能指标的精确了解,例如传感器数据速率、网络带宽和机器学习模型延迟。虽然这些指标可以在系统设计期间进行建模,但连接质量、服务器负载和硬件条件的不确定性会导致实时性能变化,从而影响整体性能。我们推出了 PEERNet,这是一种用于云机器人的端到端实时分析工具。PEERNet 通过对传感器、网络、深度学习管道和设备等系统组件进行有针对性但自适应的分析,从而能够在异构硬件上进行性能监控。我们通过网络机器人任务展示了 PEERNet 的功能,例如基于图像的 Franka Emika Panda 机械臂远程操作和使用 Nvidia Jetson Orin 查询视觉语言模型。PEERNet 揭示了机器人系统中非直观的的行为,例如非对称网络传输和双峰语言模型输出。我们的评估强调了网络机器人中基准测试的有效性和重要性,证明了 PEERNet 的适应性。我们的代码是开源的,可在 github.com/UTAustin-SwarmLab/PEERNet 获取。||
|**2024-09-07**|[Unlocking Potential Binders: Multimodal Pretraining DEL-Fusion for Denoising DNA-Encoded Libraries](http://arxiv.org/abs/2409.05916)|null|在药物发现领域,DNA 编码化合物库 (DEL) 筛选技术已成为识别高亲和力化合物的有效方法。然而,DEL 筛选面临着一个重大挑战:复杂生物系统中非特异性相互作用产生的噪声。在 DEL 库上训练的神经网络已被用于提取化合物特征,旨在对数据进行去噪并发现潜在的治疗靶点结合剂。然而,DEL 的固有结构受限于结构单元的有限多样性,这影响了化合物编码器的性能。此外,现有方法仅在单一级别捕获化合物特征,进一步限制了去噪策略的有效性。为了缓解这些问题,我们提出了一种多模态预训练 DEL-Fusion 模型 (MPDF),该模型通过预训练增强编码器能力,并在不同尺度上整合化合物特征。我们开发了在不同化合物表示及其文本描述之间应用对比目标的预训练任务,增强了化合物编码器获取通用特征的能力。此外,我们提出了一种新颖的 DEL-fusion 框架,该框架融合了原子、亚分子和分子水平的化合物信息,这些信息由各种化合物编码器捕获。这些创新的协同作用使 MPDF 具备丰富的多尺度特征,从而实现全面的下游去噪。在三个 DEL 数据集上进行的评估表明,MPDF 在验证任务的数据处理和分析方面表现出优异的性能。值得注意的是,MPDF 为识别高亲和力分子提供了新的见解,为改进 DEL 在药物发现中的应用铺平了道路。||
|**2024-09-09**|[DexDiff: Towards Extrinsic Dexterity Manipulation of Ungraspable Objects in Unrestricted Environments](http://arxiv.org/abs/2409.05493)|null|抓取又大又平的物体(例如书或平底锅)通常被认为是一项无法完成的任务,因为抓取姿势无法企及,这带来了重大挑战。以前的工作利用墙壁或桌子边缘等外部灵活性来抓取此类物体。然而,它们仅限于特定于任务的策略,并且缺乏寻找预抓取条件的任务规划。这使得适应各种环境和外部灵活性约束变得困难。因此,我们提出了 DexDiff,一种用于具有外部灵活性的长视野规划的稳健机器人操作方法。具体来说,我们利用视觉语言模型 (VLM) 来感知环境状态并生成高级任务计划,然后使用目标条件动作扩散 (GCAD) 模型来预测低级动作序列。该模型从离线数据中学习低级策略,并将高级规划引导的累积奖励作为目标条件,从而可以改进对机器人动作的预测。实验结果表明,我们的方法不仅可以有效地执行无法完成的任务,而且可以泛化到以前从未见过的物体。它在模拟中的成功率比基线高 47%,并有助于在现实场景中高效部署和操作。||
|**2024-09-08**|[PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions](http://arxiv.org/abs/2409.05076)|**[link](https://github.com/btzyd/pip)**|大型视觉语言模型 (LVLM) 已经展示出强大的多模态能力。然而,它们也面临着严重的安全问题,因为攻击者可以通过精心设计的对抗样本在 LVLM 中引发鲁棒性问题。因此,LVLM 迫切需要针对对抗样本的检测工具,以防止出现错误响应。在这项工作中,我们首先发现,当使用探测问题时,LVLM 对干净图像表现出规律的注意力模式。我们提出了一种名为 PIP 的非常规方法,它利用一个随机选择的无关探测问题(例如,“有钟表吗?”)的注意力模式来区分对抗样本和干净样本。无论待测图像及其对应的问题是什么,PIP 只需要对待测图像和探测问题进行一次额外的推理,即可成功检测对抗样本。即使在黑盒攻击和开放数据集场景下,我们的 PIP 与简单的 SVM 相结合,仍然可以实现超过 98% 的召回率和超过 90% 的精确率。我们的 PIP 是首次尝试通过简单的无关探测问题来检测针对 LVLM 的对抗攻击,为更深入地理解和反思 LVLM 提供了思路。代码可在 https://github.com/btzyd/pip 获取。||
|**2024-09-07**|[POINTS: Improving Your Vision-language Model with Affordable Strategies](http://arxiv.org/abs/2409.04828)|null|近年来,视觉语言模型取得了重大进展,在光学字符识别和几何问题解决等任务中表现出色。然而,仍然存在几个关键问题:1)专有模型的架构往往缺乏透明度,而开源模型需要对其训练策略进行更详细的消融研究。2)开源工作中的预训练数据尚未得到充分探索,数据集是根据经验添加的,这使得过程变得繁琐。3)微调通常侧重于添加数据集,导致收益递减。为了解决这些问题,我们提出以下贡献:1)我们使用视觉语言模型的最新进展训练了一个强大的基线模型,引入了有效的改进,并对每种技术进行了全面的消融和验证。2)受近期大型语言模型工作的启发,我们使用困惑度对预训练数据进行过滤,选择困惑度最低的数据进行训练。这种方法使我们能够在精选的 1M 数据集上进行训练,并取得了具有竞争力的性能。3)在视觉指令微调期间,当添加更多数据集的收益微乎其微时,我们对不同数据集使用了模型融合。这些创新产生了一个 9B 参数的模型,其性能与最先进的模型相比具有竞争力。我们的策略高效且轻量级,因此社区很容易采用。||
|**2024-09-07**|[Enhancing Outlier Knowledge for Few-Shot Out-of-Distribution Detection with Extensible Local Prompts](http://arxiv.org/abs/2409.04796)|null|分布外 (OOD) 检测旨在区分已知类别之外的异常值,在实际场景中已变得越来越重要。近年来,视觉语言模型 (VLM) 的出现激发了人们对通过少量样本微调来增强 VLM 的 OOD 检测的兴趣。然而,现有方法主要侧重于优化全局提示,而忽略了对异常值的局部信息的精细利用。基于此,我们冻结全局提示,并引入了一种新颖的从粗到精的微调范式,以强调使用局部提示进行区域增强。我们的方法包括两个组成部分:全局提示引导的负增强和局部提示增强的区域正则化。前者利用冻结的、粗略的全局提示作为指导线索来合并负增强,从而利用局部异常值知识。后者采用可训练的局部提示和区域正则化来有效地捕获局部信息,从而帮助识别异常值。我们还提出了区域相关指标,以增强 OOD 检测的丰富性。此外,由于我们的方法仅探索增强局部提示,因此可以在推理过程中与训练好的全局提示无缝集成,以提高性能。综合实验结果证明了我们方法的有效性和潜力。值得注意的是,在 ImageNet-1k 数据集上进行的 4 次样本微调中,我们的方法相对于最先进的方法将平均 FPR95 降低了 5.17%,甚至优于先前方法的 16 次样本微调结果。||
|**2024-09-06**|[COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes](http://arxiv.org/abs/2409.04053)|null|虽然视觉问答 (VQA) 基准测试推动了推理技术的发展,但它们一直专注于垂直思维。有效的解决问题还需要横向思维,而横向思维在人工智能领域仍未得到充分研究,也没有用于测试视觉感知系统。为了弥合这一差距,我们将视觉横向思维形式化为一个多项选择题问答任务,并描述了一个由分类法驱动的三步法来实例化任务示例。然后,我们开发了 COLUMBUS,这是一个合成基准测试,它应用任务管道,根据公开可用的化合物和常用短语集合,创建带有文本和图标字谜的 QA 集。COLUMBUS 包含超过 1,000 个谜题,每个谜题有四个候选答案。虽然最先进的视觉语言模型 (VLM) 取得了不错的性能,但我们的评估表明人类和模型之间存在巨大差距。VLM 受益于人工策划的描述,但在正确的抽象级别上难以自行生成此类表示。||
|**2024-09-06**|[Generating Faithful and Salient Text from Multimodal Data](http://arxiv.org/abs/2409.03961)|**[link](https://github.com/TahsinaHashem/FaithD2T)**|虽然大型多模态模型 (LMM) 在许多多模态任务中取得了良好的性能,但它们在生成文本时仍可能会出现幻觉。它们在从视觉数据中检测显著特征方面的性能也不清楚。在本文中,我们开发了一个框架,用于从混合模态数据(包括图像和结构化数据(以知识图谱或表格表示))生成忠实且显著的文本。具体来说,我们训练了一个小型视觉评论家模型,用于从图像模态中识别幻觉和非显著特征。评论家模型还会生成显著图像特征列表。此信息用于后期编辑步骤,以提高生成质量。在两个数据集上的实验表明,我们的框架提高了 LMM 在忠实度和显著性方面的生成质量,优于最近旨在减少幻觉的技术。||
|**2024-09-05**|[Few-shot Adaptation of Medical Vision-Language Models](http://arxiv.org/abs/2409.03868)|**[link](https://github.com/fereshteshakeri/few-shot-medvlms)**|Integrating image and text data through multi-modal learning has emerged as a new approach in medical imaging research, following its successful deployment in computer vision. While considerable efforts have been dedicated to establishing medical foundation models and their zero-shot transfer to downstream tasks, the popular few-shot setting remains relatively unexplored. Following on from the currently strong emergence of this setting in computer vision, we introduce the first structured benchmark for adapting medical vision-language models (VLMs) in a strict few-shot regime and investigate various adaptation strategies commonly used in the context of natural images. Furthermore, we evaluate a simple generalization of the linear-probe adaptation baseline, which seeks an optimal blending of the visual prototypes and text embeddings via learnable class-wise multipliers. Surprisingly, such a text-informed linear probe yields competitive performances in comparison to convoluted prompt-learning and adapter-based strategies, while running considerably faster and accommodating the black-box setting. Our extensive experiments span three different medical modalities and specialized foundation models, nine downstream tasks, and several state-of-the-art few-shot adaptation methods. We made our benchmark and code publicly available to trigger further developments in this emergent subject: \url{https://github.com/FereshteShakeri/few-shot-MedVLMs}.||
|**2024-09-05**|[Have Large Vision-Language Models Mastered Art History?](http://arxiv.org/abs/2409.03521)|null|The emergence of large Vision-Language Models (VLMs) has recently established new baselines in image classification across multiple domains. However, the performance of VLMs in the specific task of artwork classification, particularly art style classification of paintings - a domain traditionally mastered by art historians - has not been explored yet. Artworks pose a unique challenge compared to natural images due to their inherently complex and diverse structures, characterized by variable compositions and styles. Art historians have long studied the unique aspects of artworks, with style prediction being a crucial component of their discipline. This paper investigates whether large VLMs, which integrate visual and textual data, can effectively predict the art historical attributes of paintings. We conduct an in-depth analysis of four VLMs, namely CLIP, LLaVA, OpenFlamingo, and GPT-4o, focusing on zero-shot classification of art style, author and time period using two public benchmarks of artworks. Additionally, we present ArTest, a well-curated test set of artworks, including pivotal paintings studied by art historians.||
|**2024-09-04**|[Can LVLMs Obtain a Driver's License? A Benchmark Towards Reliable AGI for Autonomous Driving](http://arxiv.org/abs/2409.02914)|null|Large Vision-Language Models (LVLMs) have recently garnered significant attention, with many efforts aimed at harnessing their general knowledge to enhance the interpretability and robustness of autonomous driving models. However, LVLMs typically rely on large, general-purpose datasets and lack the specialized expertise required for professional and safe driving. Existing vision-language driving datasets focus primarily on scene understanding and decision-making, without providing explicit guidance on traffic rules and driving skills, which are critical aspects directly related to driving safety. To bridge this gap, we propose IDKB, a large-scale dataset containing over one million data items collected from various countries, including driving handbooks, theory test data, and simulated road test data. Much like the process of obtaining a driver's license, IDKB encompasses nearly all the explicit knowledge needed for driving from theory to practice. In particular, we conducted comprehensive tests on 15 LVLMs using IDKB to assess their reliability in the context of autonomous driving and provided extensive analysis. We also fine-tuned popular models, achieving notable performance improvements, which further validate the significance of our dataset. The project page can be found at: \url{https://4dvlab.github.io/project_page/idkb.html}||
|**2024-09-04**|[Benchmarking Spurious Bias in Few-Shot Image Classifiers](http://arxiv.org/abs/2409.02882)|**[link](https://github.com/gtzheng/fewstab)**|Few-shot image classifiers are designed to recognize and classify new data with minimal supervision and limited data but often show reliance on spurious correlations between classes and spurious attributes, known as spurious bias. Spurious correlations commonly hold in certain samples and few-shot classifiers can suffer from spurious bias induced from them. There is an absence of an automatic benchmarking system to assess the robustness of few-shot classifiers against spurious bias. In this paper, we propose a systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias. FewSTAB creates few-shot evaluation tasks with biased attributes so that using them for predictions can demonstrate poor performance. To construct these tasks, we propose attribute-based sample selection strategies based on a pre-trained vision-language model, eliminating the need for manual dataset curation. This allows FewSTAB to automatically benchmark spurious bias using any existing test data. FewSTAB offers evaluation results in a new dimension along with a new design guideline for building robust classifiers. Moreover, it can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness. Its effectiveness is demonstrated through experiments on ten few-shot learning methods across three datasets. We hope our framework can inspire new designs of robust few-shot classifiers. Our code is available at https://github.com/gtzheng/FewSTAB.||
|**2024-09-06**|[CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models](http://arxiv.org/abs/2409.02834)|null|Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.||
|**2024-09-04**|[MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark](http://arxiv.org/abs/2409.02813)|null|This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.||
|**2024-09-04**|[Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection](http://arxiv.org/abs/2409.02664)|null|The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via data perturbations, our method can reprogram a pretrained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. Furthermore, we insert a pseudo-word guided by facial identity into the text prompt. Extensive experiments on several popular benchmarks demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88% AUC in cross-dataset setting from FF++ to WildDeepfake) using a pre-trained CLIP model with our proposed reprogramming method; (2) our superior performances are at less cost of trainable parameters, making it a promising approach for real-world applications.||
|**2024-09-04**|[Understanding eGFR Trajectories and Kidney Function Decline via Large Multimodal Models](http://arxiv.org/abs/2409.02530)|null|The estimated Glomerular Filtration Rate (eGFR) is an essential indicator of kidney function in clinical practice. Although traditional equations and Machine Learning (ML) models using clinical and laboratory data can estimate eGFR, accurately predicting future eGFR levels remains a significant challenge for nephrologists and ML researchers. Recent advances demonstrate that Large Language Models (LLMs) and Large Multimodal Models (LMMs) can serve as robust foundation models for diverse applications. This study investigates the potential of LMMs to predict future eGFR levels with a dataset consisting of laboratory and clinical values from 50 patients. By integrating various prompting techniques and ensembles of LMMs, our findings suggest that these models, when combined with precise prompts and visual representations of eGFR trajectories, offer predictive performance comparable to existing ML models. This research extends the application of foundation models and suggests avenues for future studies to harness these models in addressing complex medical forecasting challenges.||
|**2024-09-03**|[Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems](http://arxiv.org/abs/2409.02278)|null|近年来,视觉语言模型(VLM)的最新发展显示出其在图像理解相关应用方面的巨大潜力。在本研究中,我们探索了最先进的VLM模型在基于视觉的交通工程任务中的应用,例如图像分类和目标检测。图像分类任务包括拥堵检测和裂缝识别,而目标检测任务则用于识别未佩戴头盔的行为。我们应用了CLIP、BLIP、OWL-ViT、Llava-Next等开源模型和闭源模型GPT-4o,评估了这些最先进的VLM模型的性能,以利用语言理解能力来完成基于视觉的交通任务。这些任务是通过对VLM模型应用零样本提示来执行的,因为零样本提示允许在不对任务进行任何训练的情况下执行任务。它消除了对特定任务进行标注数据集或微调的需求。虽然这些模型在图像分类任务中取得了与基准卷积神经网络(CNN)模型相当的结果,但在目标定位任务中仍有改进的空间。因此,本研究对最先进的VLM模型进行了全面评估,突出了这些模型的优势和局限性,可以作为未来改进和大规模实施的基线。||
|**2024-09-03**|[How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?](http://arxiv.org/abs/2409.02253)|**[link](https://github.com/asgsaeid/cad_vqa)**|大型基础模型彻底改变了该领域,但针对特定视觉任务优化多模态模型仍然存在挑战。我们提出了一种新颖且通用的方法,通过测量不同输入提示下输出的一致性,来确定黑盒视觉语言模型 (VLM) 的首选图像分布。我们将其应用于 3D 对象的不同渲染类型,证明了其在需要精确解释复杂结构的各个领域的有效性,重点关注计算机辅助设计 (CAD) 作为示例领域。我们使用人类反馈的上下文学习进一步完善了 VLM 输出,显著提高了解释质量。为了解决专业领域缺乏基准的问题,我们引入了 CAD-VQA,这是一个用于评估 VLM 在 CAD 相关视觉问答任务上的新数据集。我们对 CAD-VQA 上最先进的 VLM 进行了评估,建立了基线性能水平,为在需要专家级视觉解释的各个领域推进 VLM 在复杂视觉推理任务中的能力提供了一个框架。我们在 \url{https://github.com/asgsaeid/cad_vqa} 上发布了数据集和评估代码。||
|**2024-09-03**|[Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models](http://arxiv.org/abs/2409.02101)|**[link](https://github.com/jiaqixuac/WResVLM)**|本文探讨了在合成数据上训练的恶劣天气图像恢复方法应用于现实场景时的局限性。我们构建了一个半监督学习框架,利用视觉语言模型来增强现实环境中不同恶劣天气条件下的恢复性能。我们的方法包括使用视觉语言模型对真实数据进行图像清晰度评估和语义提供,作为训练恢复模型的监督信号。对于清晰度增强,我们使用真实数据,采用双重策略,即利用视觉语言模型评估的伪标签和天气提示学习。对于语义增强,我们通过调整视觉语言模型描述中的天气条件,同时保留语义,来整合真实世界的数据。此外,我们引入了一种有效的训练策略来提升恢复性能。我们的方法在真实世界的恶劣天气图像恢复方面取得了优异的结果,通过与现有最佳工作的定性和定量比较证明了这一点。||
|**2024-09-03**|[GraspSplats: Efficient Manipulation with 3D Feature Splatting](http://arxiv.org/abs/2409.02084)|null|The ability for robots to perform efficient and zero-shot grasping of object parts is crucial for practical applications and is becoming prevalent with recent advances in Vision-Language Models (VLMs). To bridge the 2D-to-3D gap for representations to support such a capability, existing methods rely on neural fields (NeRFs) via differentiable rendering or point-based projection methods. However, we demonstrate that NeRFs are inappropriate for scene changes due to their implicitness and point-based methods are inaccurate for part localization without rendering-based optimization. To amend these issues, we propose GraspSplats. Using depth supervision and a novel reference feature computation method, GraspSplats generates high-quality scene representations in under 60 seconds. We further validate the advantages of Gaussian-based representation by showing that the explicit and optimized geometry in GraspSplats is sufficient to natively support (1) real-time grasp sampling and (2) dynamic and articulated object manipulation with point trackers. With extensive experiments on a Franka robot, we demonstrate that GraspSplats significantly outperforms existing methods under diverse task settings. In particular, GraspSplats outperforms NeRF-based methods like F3RM and LERF-TOGO, and 2D detection methods.||

(back to top)

## 6DOF Object Pose

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2024-10-08**|[AIVIO: Closed-loop, Object-relative Navigation of UAVs with AI-aided Visual Inertial Odometry](http://arxiv.org/abs/2410.05996)|null|面向对象的移动机器人导航对于各种任务至关重要,例如自主关键基础设施检查,但这需要从原始传感器数据中提取有关感兴趣对象的语义信息的能力。虽然基于深度学习 (DL) 的方法擅长从图像中推断语义对象信息,例如类别和相对六自由度 (6-DoF) 位姿,但它们的计算要求很高,因此通常不适合有效载荷受限的移动机器人。在这篇文章中,我们提出了一种实时无人机 (UAV) 系统,用于具有最小传感器配置(包括惯性测量单元 (IMU) 和 RGB 相机)的、面向对象的闭环导航。利用仅在合成数据上训练并针对伴侣板部署进行优化的基于深度学习的对象位姿估计器,将对象相对位姿测量值与 IMU 数据融合以执行对象相对定位。我们进行了多次真实世界的实验,以验证我们的系统在具有挑战性的电线杆检查用例中的性能。补充视频中展示了一个闭环飞行的示例。|
|**2024-09-24**|[LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation](http://arxiv.org/abs/2409.15727)|null|虽然基于RGBD的类别级物体姿态估计方法很有前景,但其对深度数据的依赖限制了其在不同场景中的适用性。因此,最近的研究转向了基于RGB的方法;然而,由于缺乏深度信息,它们面临着巨大的挑战。一方面,深度信息的缺失加剧了处理类内形状变化的难度,导致形状预测的不确定性增加。另一方面,纯RGB输入引入了固有的尺度模糊性,使得物体大小和位移的估计成为一个不适定问题。为了应对这些挑战,我们提出了LaPose,一个新颖的框架,它将物体形状建模为用于姿态估计的拉普拉斯混合模型。通过将每个点表示为概率分布,我们显式地量化了形状的不确定性。LaPose利用一个广义3D信息流和一个专门的特征流来独立预测每个点的拉普拉斯分布,捕捉物体几何形状的不同方面。然后,这两个分布被整合为一个拉普拉斯混合模型,以建立2D-3D对应关系,这些对应关系用于通过PnP模块求解姿态。为了减轻尺度模糊性,我们引入了一种与尺度无关的物体大小和位移表示方法,从而提高了训练效率和整体鲁棒性。在NOCS数据集上的大量实验验证了LaPose的有效性,在基于RGB的类别级物体姿态估计方面取得了最先进的性能。代码已发布在https://github.com/lolrudy/LaPose。|
|**2024-09-22**|[Tactile Functasets: Neural Implicit Representations of Tactile Datasets](http://arxiv.org/abs/2409.14592)|null|现代触觉传感器产生的原始感官反馈是高维的,例如图像,这使得高效存储、处理和跨传感器泛化具有挑战性。为了解决这些问题,我们引入了一种新的用于触觉传感器反馈的隐函数表示方法。我们没有直接使用原始触觉图像,而是提出了经过训练以重建触觉数据集的神经隐函数,从而生成紧凑的表示来捕捉感官输入的底层结构。这些表示方法与其原始对应方法相比具有几个优势:它们紧凑,支持概率可解释的推断,并促进跨不同传感器的泛化。我们在手持物体姿态估计的下游任务中证明了这种表示方法的有效性,在简化下游模型的同时实现了比基于图像的方法更高的性能。我们在https://www.mmintlab.com/tactile-functasets发布了代码、演示和数据集。|
|**2024-09-18**|[FAST GDRNPP: Improving the Speed of State-of-the-Art 6D Object Pose Estimation](http://arxiv.org/abs/2409.12720)|null|6D物体姿态估计涉及确定场景中物体相对于所选坐标系的三维平移和旋转。这个问题在许多工业任务的实际应用中尤其重要,例如质量控制、零件拾取和机器人操作,在这些应用中,速度和精度对于实际部署都至关重要。当前的模型,无论是经典模型还是基于深度学习的模型,通常难以在精度和延迟之间取得平衡。我们的研究重点是在保持其高精度的同时,提高最先进的深度学习模型GDRNPP的速度。我们采用了几种技术来减小模型大小并缩短推理时间。这些技术包括使用更小、更快的骨干网络、修剪不必要的参数以及通过蒸馏将知识从大型高性能模型转移到更小、更高效的学生模型。我们的研究结果表明,所提出的配置在显著提高推理速度的同时,保持了与最先进技术相当的精度。这一进步可以使各种工业场景中的应用更高效、更实用,从而提高6D物体姿态估计模型在实际环境中的整体适用性。|
|**2024-09-12**|[Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation](http://arxiv.org/abs/2409.08269)|null|现今的触摸传感器种类繁多,形状各异。由于模型通常与特定的传感器设计绑定,这给通用触摸处理方法的开发带来了挑战。我们通过在触摸传感器之间进行跨模态预测来解决这个问题:给定来自一个传感器的触觉信号,我们使用生成模型来估计另一个传感器如何感知相同的物理接触。这使得我们可以将特定于传感器的处理方法应用于生成的信号。我们通过训练一个扩散模型来实现这个想法,该模型可以在流行的GelSlim和Soft Bubble传感器之间进行转换。作为一个下游任务,我们使用GelSlim传感器进行手持物体姿态估计,同时使用一种仅对Soft Bubble信号进行操作的算法。数据集、代码和更多详细信息可以在https://www.mmintlab.com/research/touch2touch/找到。|
|**2024-09-04**|[Object Gaussian for Monocular 6D Pose Estimation from Sparse Views](http://arxiv.org/abs/2409.02581)|null|单目物体姿态估计是计算机视觉和机器人技术中的一项关键任务,其很大程度上依赖于精确的2D-3D对应关系,而这通常需要昂贵的CAD模型,而这些模型可能并不容易获得。物体三维重建方法提供了一种替代方案,其中最近3D高斯 splatting (3DGS) 的进展展现了引人注目的潜力。然而,它的性能仍然欠佳,并且在输入视图较少时容易出现过拟合。为了应对这一挑战,我们引入了SGPose,这是一个使用基于高斯方法进行稀疏视图物体姿态估计的新颖框架。只需十个视图,SGPose 就能通过从随机长方体初始化开始生成几何感知表示,从而避免了对传统3DGS方法所需的基于运动恢复结构 (SfM) 流程的几何形状的依赖。SGPose 通过回归稀疏输入和随机初始化的图像与重建模型之间的密集2D-3D对应关系,消除了对CAD模型的依赖,而几何一致性深度监督和在线合成视图扭曲是其成功的关键。在典型基准数据集,尤其是在Occlusion LM-O数据集上的实验表明,即使在稀疏视图约束下,SGPose 的性能也优于现有方法,这凸显了其在实际应用中的潜力。|
|**2024-08-29**|[OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation](http://arxiv.org/abs/2408.16547)|**[link](https://github.com/yc-che/op-align)**|类别级铰接物体姿态估计专注于对已知类别中未知铰接物体的姿态估计。尽管意义重大,但由于物体的形状和姿态各异、数据集标注成本高昂以及复杂的现实环境,这项任务仍然具有挑战性。在本文中,我们提出了一种新的自监督方法,利用单帧点云来解决这个问题。我们的模型可以一致地生成具有规范姿态和关节状态的完整输入物体重建,并估计可以减少整体姿态差异的物体级姿态,以及将输入的每个部分与其对应的重建部分对齐的部件级姿态。实验结果表明,我们的方法显著优于之前的自监督方法,并且与最先进的监督方法相当。为了评估我们的模型在真实场景中的性能,我们还引入了一个新的真实世界铰接物体基准数据集。|
|**2024-08-19**|[RUMI: Rummaging Using Mutual Information](http://arxiv.org/abs/2408.10450)|null|本文提出了一种名为“基于互信息的翻找”(RUMI)的方法,用于在线生成机器人的动作序列,以便在视觉遮挡环境中收集有关已知可移动物体姿态的信息。该方法侧重于富接触翻找,利用物体姿态分布和机器人轨迹之间的互信息进行动作规划。RUMI从观测到的部分点云推断出兼容的物体姿态分布,并实时计算其与工作空间占用率的互信息近似值。基于此,我们开发了信息增益成本函数和可达性成本函数,以保持物体在机器人的可及范围内。这些成本函数被集成到具有随机动力学模型的模型预测控制(MPC)框架中,并在闭环中更新姿态分布。主要贡献包括一个新的用于物体姿态估计的置信框架,一个高效的信息增益计算策略,以及一个鲁棒的基于MPC的控制方案。与基线方法相比,RUMI在仿真和实际任务中均表现出优异的性能。|
|**2024-08-15**|[Comparative Evaluation of 3D Reconstruction Methods for Object Pose Estimation](http://arxiv.org/abs/2408.08234)|**[link](https://github.com/varunburde/reconstruction_pose_benchmark)**|物体姿态估计对于许多涉及机器人操作、导航和增强现实的工业应用至关重要。当前通用的物体姿态估计器,即不需要针对每个物体进行训练的方法,依赖于精确的3D模型。目前主要使用CAD模型,但在实践中很难获得。同时,通常可以获取物体的图像。自然,这就引出了一个问题:从图像重建的3D模型是否足以实现准确的物体姿态估计。我们旨在通过提出一个新的基准来回答这个问题,该基准用于衡量3D重建质量对姿态估计精度的影响。我们的基准提供了用于物体重建的校准图像,这些图像与YCB-V数据集的测试图像配准,以便在BOP基准格式下进行姿态评估。使用多种最先进的3D重建和物体姿态估计方法进行的详细实验表明,现代重建方法生成的几何形状通常足以进行准确的姿态估计。我们的实验得出了一些有趣的观察结果:(1) 用于衡量3D重建质量的标准指标不一定能指示姿态估计精度,这表明需要像我们这样的专用基准。(2) 传统的、非基于学习的方法可以与现代的基于学习的重建技术相媲美,甚至可以提供更好的重建时间-姿态精度权衡。(3) 使用重建模型和使用CAD模型的性能之间仍然存在相当大的差距。为了促进缩小这一差距的研究,我们的基准在https://github.com/VarunBurde/reconstruction_pose_benchmark公开可用。|
|**2024-07-16**|[NeuSurfEmb: A Complete Pipeline for Dense Correspondence-based 6D Object Pose Estimation without CAD Models](http://arxiv.org/abs/2407.12207)|**[link](https://github.com/ethz-asl/neusurfemb)**|目前最先进的6D物体姿态估计方法假设CAD模型可用,并且需要用户手动设置基于物理的渲染(PBR)流程以生成合成训练数据。这两个因素都限制了这些方法在实际场景中的应用。在这项工作中,我们提出了一个不需要CAD模型的流程,并且允许训练一个最先进的姿态估计器,只需要一小组真实图像作为输入。我们的方法基于NeuS2对象表示,我们通过基于运动恢复结构(SfM)和对象无关分割的半自动化程序来学习它。我们利用NeuS2的新视角合成能力和简单的剪切粘贴增强功能来自动生成逼真的对象渲染,我们使用这些渲染来训练基于对应的SurfEmb姿态估计器。我们在LINEMOD-Occlusion数据集上评估了我们的方法,广泛研究了其各个组件的影响,并展示了相对于基于CAD模型和PBR数据的方法的竞争性能。我们还展示了我们流程在自收集的真实世界对象上的易用性和有效性,表明我们的方法优于最先进的无CAD模型方法,具有更好的准确性和对轻微遮挡的鲁棒性。为了让机器人社区能够从该系统中受益,我们将在https://www.github.com/ethz-asl/neusurfemb公开发布它。|
|**2024-06-06**|[Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking](http://arxiv.org/abs/2406.04316)|null|6D物体姿态估计是计算机视觉中一项至关重要但极具挑战性的任务,其面临的主要问题是大规模数据集的严重缺乏。这种稀缺性阻碍了对模型性能的全面评估,限制了研究进展。此外,可用实例或类别的数量有限也限制了其应用。为了解决这些问题,本文提出了Omni6DPose,这是一个以对象类别多样性、规模大和对象材质多样性为特征的大型数据集。Omni6DPose主要由三个部分组成:ROPE(真实6D物体姿态估计数据集),包含332K张图像,涵盖149个类别、581个实例的超过150万个标注;SOPE(模拟6D物体姿态估计数据集),由混合现实环境中创建的475K张图像组成,利用深度模拟技术进行标注,涵盖与ROPE相同的149个类别、4162个实例的超过500万个标注;以及在ROPE和SOPE中均使用的、经过手动对齐的真实扫描物体。由于存在大量的变化和模糊性,Omni6DPose本身就具有很大的挑战性。为了应对这一挑战,我们引入了GenPose++,它是SOTA类别级姿态估计框架的增强版本,它包含两个关键改进:语义感知特征提取和基于聚类的聚合。此外,我们还提供了一个全面的基准测试分析,以评估先前方法在这个大规模数据集上在6D物体姿态估计和姿态跟踪方面的性能。||
|**2024-06-05**|[Sparse Color-Code Net: Real-Time RGB-Based 6D Object Pose Estimation on Edge Devices](http://arxiv.org/abs/2406.02977)|null|随着机器人和增强现实应用越来越依赖于精确高效的6D物体姿态估计,边缘设备上的实时性能对于实现更具交互性和响应能力的系统至关重要。我们提出的稀疏颜色代码网络(SCCN)体现了一种清晰简洁的流程设计,以有效满足这一需求。SCCN对RGB图像中的目标物体进行像素级预测,利用基本物体几何特征的稀疏性来加速Perspective-n-Point(PnP)计算过程。此外,它引入了一种新颖的基于像素级几何的物体对称表示,该表示与初始姿态预测无缝集成,有效地解决了对称物体歧义问题。SCCN在英伟达Jetson AGX Xavier上分别实现了在基准LINEMOD数据集和遮挡LINEMOD数据集上每秒19帧(FPS)和6帧的估计速率,同时在这些速率下始终保持较高的估计精度。||
|**2024-05-31**|[Deep Learning-Based Object Pose Estimation: A Comprehensive Survey](http://arxiv.org/abs/2405.07801)|**[link](https://github.com/cnjianliu/awesome-object-pose-estimation)**|物体姿态估计是计算机视觉中的一个基本问题,在增强现实和机器人技术中有着广泛的应用。在过去的十年中,深度学习模型由于其卓越的准确性和鲁棒性,越来越多地取代了依赖于工程点对特征的传统算法。然而,当代方法仍然存在若干挑战,包括它们对标记训练数据的依赖性、模型紧凑性、在挑战性条件下的鲁棒性以及泛化到未见过的新物体能力。目前缺乏一篇综述来讨论该领域的进展、面临的挑战和未来有希望的方向。为了填补这一空白,我们讨论了基于深度学习的物体姿态估计的最新进展,涵盖了该问题的所有三种形式,即实例级、类别级和未见过物体的姿态估计。我们的综述还涵盖了多种输入数据模态、输出姿态的自由度、物体属性和下游任务,为读者提供了对该领域的全面理解。此外,它还讨论了不同领域的训练范式、推理模式、应用领域、评估指标和基准数据集,并报告了当前最先进方法在这些基准上的性能,从而方便读者为其应用选择最合适的方法。最后,该综述指出了关键挑战,回顾了当前的趋势及其优缺点,并确定了未来研究的有希望的方向。我们还在 https://github.com/CNJianLiu/Awesome-Object-Pose-Estimation 上持续跟踪最新的工作。||
|**2024-03-28**|[Instance-Adaptive and Geometric-Aware Keypoint Learning for Category-Level 6D Object Pose Estimation](http://arxiv.org/abs/2403.19527)|**[link](https://github.com/leeiieeo/ag-pose)**|类别级 6D 物体姿态估计旨在估计特定类别中未见实例的旋转、平移和大小。在这一领域,基于密集对应的方法取得了领先的性能。然而,它们没有明确考虑不同实例的局部和全局几何信息,导致对形状变化显著的未见实例的泛化能力较差。为了解决这个问题,我们提出了一种新颖的实例自适应和几何感知的关键点学习方法,用于类别级 6D 物体姿态估计 (AG-Pose),它包括两个关键设计:(1)第一个设计是实例自适应关键点检测模块,它可以自适应地检测一组稀疏的关键点,用于表示各种实例的几何结构。(2) 第二个设计是几何感知特征聚合模块,它可以有效地将局部和全局几何信息整合到关键点特征中。这两个模块可以协同工作,为未见实例建立鲁棒的关键点级对应关系,从而增强模型的泛化能力。在 CAMERA25 和 REAL275 数据集上的实验结果表明,所提出的 AG-Pose 在没有类别特定形状先验的情况下,大大优于最先进的方法。||
|**2024-06-01**|[Object Pose Estimation via the Aggregation of Diffusion Features](http://arxiv.org/abs/2403.18791)|**[link](https://github.com/tianfu18/diff-feats-pose)**|从图像中估计物体姿态是3D场景理解的关键任务,最近的方法在非常大的基准测试中显示出可喜的结果。然而,这些方法在处理未见过的物体时性能会显著下降。我们认为这是由于图像特征的泛化能力有限造成的。为了解决这个问题,我们对扩散模型(例如Stable Diffusion)的特征进行了深入分析,这些模型在对未见过的物体建模方面具有巨大潜力。在此分析的基础上,我们创新性地将这些扩散特征引入物体姿态估计。为此,我们提出了三种不同的架构,可以有效地捕获和聚合不同粒度的扩散特征,极大地提高了物体姿态估计的泛化能力。我们的方法在三个流行的基准数据集LM、O-LM和T-LESS上,以相当大的优势优于最先进的方法。特别是,我们的方法在未见过的物体上取得了比先前最佳结果更高的精度:在Unseen LM上为98.2%对93.5%,在Unseen O-LM上为85.9%对76.3%,显示了我们方法强大的泛化能力。我们的代码发布在https://github.com/Tianfu18/diff-feats-pose。||

(back to top)

## nerf

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2024-11-05**|[HFGaussian: Learning Generalizable Gaussian Human with Integrated Human Features](http://arxiv.org/abs/2411.03086)|null|近年来,辐射场渲染技术的进步在三维场景表示方面展现出 promising 的成果,其中基于高斯 splatting 的技术因其质量和效率成为 state-of-the-art。高斯 splatting 广泛应用于各种应用,包括三维人体表示。然而,以前的三维高斯 splatting 方法要么使用参数化人体模型作为附加信息,要么无法提供任何底层结构,例如人体生物力学特征,而这些特征对于不同的应用至关重要。在本文中,我们提出了一种名为 HFGaussian 的新方法,它可以从稀疏的输入图像中实时估计新视角和人体特征,例如 3D 骨骼、3D 关键点和密集姿态,速度可达 25 FPS。该方法利用可泛化的高斯 splatting 技术来表示人体及其相关特征,从而实现高效且可泛化的重建。通过结合姿态回归网络和特征 splatting 技术与高斯 splatting,HFGaussian 展示了比现有 3D 人体方法更强的能力,展现了集成生物力学的三维人体表示的潜力。我们针对人体高斯 splatting 和姿态估计领域的最新 state-of-the-art 技术,对 HFGaussian 方法进行了全面评估,证明了其实时且 state-of-the-art 的性能。|
|**2024-11-04**|[FewViewGS: Gaussian Splatting with Few View Matching and Multi-stage Training](http://arxiv.org/abs/2411.02229)|null|基于图像的新视角合成领域随着神经辐射场 (NeRF) 的引入以及最近 3D 高斯 splatting 的出现而取得了快速进展。由于其效率和准确渲染新视角的能力,高斯 splatting 得到了广泛采用。虽然在有足够训练图像的情况下高斯 splatting 表现良好,但其非结构化的显式表示在稀疏输入图像的情况下容易过拟合,导致渲染性能不佳。为了解决这个问题,我们提出了一种基于 3D 高斯的新视角合成方法,该方法使用稀疏输入图像,可以从训练图像未覆盖的视点准确地渲染场景。我们提出了一种多阶段训练方案,在不依赖于预训练深度估计或扩散模型的情况下,对新视角施加基于匹配的一致性约束。这是通过使用可用训练图像的匹配来监督在训练帧之间采样的新视角的生成,并施加颜色、几何和语义损失来实现的。此外,我们引入了一种局部性保留正则化方法用于 3D 高斯,通过保留场景的局部颜色结构来消除渲染伪影。在合成数据集和真实世界数据集上的评估表明,与现有的最先进方法相比,我们的方法在少样本新视角合成方面具有竞争力或更优的性能。|
|**2024-10-31**|[GaussianMarker: Uncertainty-Aware Copyright Protection of 3D Gaussian Splatting](http://arxiv.org/abs/2410.23718)|null|三维高斯 splatting (3DGS) 已成为获取三维资源的关键方法。为了保护这些资源的版权,可以应用数字水印技术将所有权信息谨慎地嵌入到 3DGS 模型中。然而,现有的用于网格、点云和隐式辐射场的数字水印方法不能直接应用于 3DGS 模型,因为 3DGS 模型使用具有独特结构的显式三维高斯函数,并且不依赖于神经网络。简单地将水印嵌入到预训练的 3DGS 中会导致渲染图像出现明显的失真。在我们的工作中,我们提出了一种基于不确定性的方法,该方法通过约束模型参数的扰动来实现 3DGS 的不可见水印。在消息解码阶段,即使在各种形式的三维和二维失真下,也可以从三维高斯函数和二维渲染图像中可靠地提取版权信息。我们在 Blender、LLFF 和 MipNeRF-360 数据集上进行了大量实验,以验证我们提出的方法的有效性,证明了其在消息解码精度和视图合成质量方面的最新性能。|
|**2024-10-23**|[VR-Splatting: Foveated Radiance Field Rendering via 3D Gaussian Splatting and Neural Points](http://arxiv.org/abs/2410.17932)|null|近年来,新视角合成(NVS)技术,特别是神经辐射场(NeRF)和高斯 splatting(3DGS),在逼真的场景渲染方面取得了令人瞩目的成果。这些技术在虚拟旅游和远程呈现等对沉浸式真实感至关重要的应用中拥有巨大的潜力。然而,虚拟现实(VR)系统的高性能需求对直接利用即使是像3DGS这样渲染速度很快的场景表示也带来了挑战,这主要是因为延迟和计算限制。在本文中,我们提出注视点渲染作为解决这些障碍的一个有前景的方案。我们分析了最先进的NVS方法的渲染性能及其与人类视觉系统的兼容性。我们的方法引入了一种新颖的用于虚拟现实的注视点渲染方法,该方法利用神经点渲染为中心凹区域生成的清晰、细节丰富的输出,并将其与3DGS为周边视觉生成的平滑渲染融合在一起。我们的评估证实,与标准的VR-ready 3DGS配置相比,我们的方法提高了感知的清晰度和细节丰富度。我们的系统满足实时VR交互所需的性能要求,最终增强了用户的沉浸式体验。项目页面:https://lfranke.github.io/vr_splatting|
|**2024-10-18**|[GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting](http://arxiv.org/abs/2410.17084)|null|本文介绍了GS-LIVM,一个面向户外场景的实时逼真激光雷达-惯性-视觉建图框架,该框架采用高斯 splatting 技术。与现有的基于神经辐射场 (NeRF) 和三维高斯 splatting (3DGS) 的方法相比,我们的方法能够实现实时逼真建图,同时确保在大规模无界户外环境中高质量的图像渲染。本文采用高斯过程回归 (GPR) 来缓解由稀疏且分布不均匀的激光雷达观测数据引起的问题。基于体素的三维高斯地图表示有助于在大型户外环境中进行实时密集建图,并通过自定义 CUDA 内核进行加速。此外,整个框架以协方差为中心进行设计,其中估计的协方差用于初始化三维高斯的尺度和旋转,以及更新 GPR 的参数。我们在多个户外数据集上评估了我们的算法,结果表明,我们的方法在建图效率和渲染质量方面达到了最先进的性能。源代码已在 GitHub 上发布。|
|**2024-10-22**|[E-3DGS: Gaussian Splatting with Exposure and Motion Events](http://arxiv.org/abs/2410.16995)|**[link](https://github.com/masterhow/e-3dgs)**|在视觉领域,基于理想条件下拍摄的图像估计神经辐射场(NeRFs)已被广泛研究。然而,机器人应用通常面临运动模糊、光照不足和计算开销高等挑战,这些挑战会对导航、检查和场景可视化等下游任务产生不利影响。为了应对这些挑战,我们提出了E-3DGS,一种基于事件的新方法,它将事件划分为运动事件(来自相机或物体运动)和曝光事件(来自相机曝光),前者用于处理快速运动场景,后者用于重建灰度图像,以实现基于事件的三维高斯 splatting (3DGS) 的高质量训练和优化。我们引入了一种将3DGS与曝光事件相结合的新方法,以实现显式场景表示的高质量重建。我们通用的框架可以单独处理运动事件以进行三维重建,使用曝光事件提高质量,或者采用混合模式,通过先用初始曝光事件再用高速运动事件进行优化来平衡质量和效率。我们还引入了EME-3D,一个包含曝光事件、运动事件、相机标定参数和稀疏点云的真实世界三维数据集。我们的方法比基于事件的NeRF更快,重建质量更好,同时比结合事件和RGB数据的NeRF方法更具成本效益,因为它只使用单个事件传感器。通过结合运动和曝光事件,E-3DGS为基于事件的三维重建设定了新的基准,在挑战性条件下具有稳健的性能和更低的硬件要求。源代码和数据集将在https://github.com/MasterHow/E-3DGS上发布。|
|**2024-10-18**|[DaRePlane: Direction-aware Representations for Dynamic Scene Reconstruction](http://arxiv.org/abs/2410.14169)|null|许多最近对动态场景建模和重新渲染的方法利用基于平面的显式表示,解决了与神经辐射场 (NeRF) 和高斯 splatting (GS) 等模型相关的训练时间慢的问题。然而,仅仅将 4D 动态场景分解成多个基于平面的 2D 表示不足以高保真地重新渲染具有复杂运动的场景。为此,我们提出了 DaRePlane,一种新颖的方向感知表示方法,它从六个不同的方向捕捉场景动态。这种学习到的表示经过逆双树复小波变换 (DTCWT) 来恢复基于平面的信息。在 NeRF 管道中,DaRePlane 通过融合来自这些恢复平面的向量来计算每个时空点的特征,然后将其传递给一个小型 MLP 进行颜色回归。当应用于高斯 splatting 时,DaRePlane 计算高斯点的特征,然后通过一个小型多头 MLP 进行时空变形预测。值得注意的是,为了解决由六个实部和六个虚部方向感知小波系数引入的冗余问题,我们引入了一种可训练的掩码方法,在不显著降低性能的情况下缓解了存储问题。为了证明 DaRePlane 的通用性和效率,我们在常规和手术动态场景上分别针对 NeRF 和 GS 系统对其进行了测试。大量实验表明,DaRePlane 在各种复杂动态场景的新颖视图合成中实现了最先进的性能。|
|**2024-10-16**|[3D Gaussian Splatting in Robotics: A Survey](http://arxiv.org/abs/2410.12262)|null|稠密的环境三维表示一直是机器人领域的长期目标。虽然以前基于坐标的隐式神经辐射场(NeRF)表示法很流行,但最近出现的3D高斯 splatting (3DGS) 在显式辐射场表示方面展现出了显著的潜力。通过利用3D高斯基元进行显式场景表示并支持可微渲染,3DGS 在实时渲染和逼真性能方面比其他辐射场表现出显著优势,这有利于机器人应用。在本综述中,我们提供了对机器人领域中3DGS的全面理解。我们将相关工作的讨论分为两大类:3DGS 的应用和 3DGS 技术的进步。在应用部分,我们探讨了 3DGS 如何从场景理解和交互的角度用于各种机器人任务。3DGS 的进展部分侧重于改进 3DGS 自身的适应性和效率特性,旨在提高其在机器人领域的性能。然后,我们总结了机器人领域中最常用的数据集和评估指标。最后,我们指出了当前 3DGS 方法的挑战和局限性,并讨论了 3DGS 在机器人领域的未来发展。|
|**2024-10-15**|[MCGS: Multiview Consistency Enhancement for Sparse-View 3D Gaussian Radiance Fields](http://arxiv.org/abs/2410.11394)|null|用三维高斯函数表示的辐射场在合成新视角方面表现出色,兼具高训练效率和快速渲染速度。然而,由于输入视角稀疏,缺乏多视角一致性约束会导致点云初始化不良以及优化和密集化过程中的启发式方法不可靠,从而导致性能欠佳。现有方法通常会结合来自密集估计网络的深度先验,但忽略了输入图像中固有的多视角一致性。此外,它们依赖于基于多视角立体视觉 (MVS) 的初始化,这限制了场景表示的效率。为了克服这些挑战,我们提出了一个基于三维高斯 splatting 的视图合成框架,名为 MCGS,可以从稀疏的输入视角实现逼真的场景重建。MCGS 在增强多视角一致性方面的关键创新如下:i) 我们引入了一种初始化方法,利用稀疏匹配器结合随机填充策略,生成一组紧凑但足以表示场景的初始点。这种方法增强了初始几何先验,促进了高效的场景表示。ii) 我们开发了一种多视角一致性引导的渐进式剪枝策略,通过加强一致性并消除低贡献的高斯函数来细化高斯场。这些模块化、即插即用的策略增强了对稀疏输入视角的鲁棒性,加快了渲染速度,并减少了内存消耗,使 MCGS 成为一个实用且高效的三维高斯 splatting 框架。|
|**2024-10-14**|[Few-shot Novel View Synthesis using Depth Aware 3D Gaussian Splatting](http://arxiv.org/abs/2410.11080)|**[link](https://github.com/raja-kumar/depth-aware-3dgs)**|三维高斯 splatting 技术在新型视图合成方面已经超越了神经辐射场方法,实现了更低的计算成本和实时高质量渲染。尽管在输入视图较多时可以生成高质量的渲染结果,但在只有少量视图可用时,其性能会显著下降。在本文中,我们提出了一种用于少样本新型视图合成的深度感知高斯 splatting 方法来解决这个问题。我们使用单目深度预测作为先验,并结合尺度不变的深度损失,在少量输入视图下约束三维形状。我们还使用低阶球谐函数对颜色进行建模,以避免过拟合。此外,我们观察到,像原始工作中那样周期性地移除低不透明度的 splat 会导致点云非常稀疏,从而降低渲染质量。为了缓解这个问题,我们保留了所有的 splat,从而在少量视图设置下实现了更好的重建效果。实验结果表明,我们的方法优于传统的三维高斯 splatting 方法,峰值信噪比提高了 10.5%,结构相似性指数提高了 6%,感知相似度提高了 14.1%,从而验证了我们方法的有效性。代码将在 https://github.com/raja-kumar/depth-aware-3DGS 上提供。|
|**2024-10-09**|[DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation](http://arxiv.org/abs/2410.06756)|null|近年来,2D/3D 生成技术的进步促进了从单目视频生成动态 3D 对象。先前的方法主要依赖于隐式神经辐射场 (NeRF) 或显式高斯 splatting 作为底层表示,难以实现令人满意的时空一致性和表面外观。受现代 3D 动画流程的启发,我们引入了 DreamMesh4D,这是一个结合了网格表示和几何蒙皮技术的新颖框架,可以从单目视频生成高质量的 4D 对象。我们没有使用经典的纹理贴图来表现外观,而是将高斯 splat 绑定到网格的三角面上,以便对纹理和网格顶点进行可微分优化。特别是,DreamMesh4D 从通过图像到 3D 生成过程获得的粗网格开始。然后在网格表面均匀采样稀疏点,并使用这些点构建变形图来驱动 3D 对象的运动,以提高计算效率并提供额外的约束。对于每个步骤,使用变形网络预测稀疏控制点的变换,并通过一种新颖的几何蒙皮算法对网格顶点和表面高斯进行变形,该算法结合了 LBS(线性混合蒙皮)和 DQS(双四元数蒙皮)的混合方法,减轻了两种方法相关的缺点。静态表面高斯和网格顶点以及变形网络通过参考视图光度损失、分数蒸馏损失以及其他正则化器以两阶段方式学习。大量实验表明我们的方法具有优越的性能。此外,我们的方法与现代图形流程兼容,展示了其在 3D 游戏和电影行业的潜力。|
|**2024-10-08**|[Comparative Analysis of Novel View Synthesis and Photogrammetry for 3D Forest Stand Reconstruction and extraction of individual tree parameters](http://arxiv.org/abs/2410.05772)|null|精确高效的三维树木重建对于森林资源评估和管理至关重要。近景摄影测量法 (CRP) 常用于重建森林场景,但面临效率低、质量差等挑战。近年来,包括神经辐射场 (NeRF) 和三维高斯 splatting (3DGS) 在内的新视角合成 (NVS) 技术已展现出利用有限图像进行三维植物重建的潜力。然而,现有研究主要集中在果园中的小型植物或单棵树木上,其在更大、更复杂的林分中的应用仍存在不确定性。在本研究中,我们收集了不同复杂程度的森林样地的序列图像,并使用 NeRF 和 3DGS 进行了密集重建。将所得点云与摄影测量和激光扫描的点云进行了比较。结果表明,NVS 方法显著提高了重建效率。摄影测量法在处理复杂林分时存在困难,导致点云树冠噪声过多,树木重建错误,例如树干重复。NeRF 虽然更适合树冠区域,但在视野有限的地面区域可能会产生错误。3DGS 方法生成的点云更稀疏,尤其是在树干区域,影响胸径 (DBH) 的精度。所有三种方法都可以提取树高信息,其中 NeRF 的精度最高;然而,摄影测量法在胸径精度方面仍然具有优势。这些发现表明,NVS 方法在林分三维重建方面具有巨大潜力,可为复杂的森林资源清查和可视化任务提供宝贵支持。|
|**2024-09-30**|[RL-GSBridge: 3D Gaussian Splatting Based Real2Sim2Real Method for Robotic Manipulation Learning](http://arxiv.org/abs/2409.20291)|null|Sim-to-Real 指的是将仿真环境中学习到的策略迁移到现实世界的过程,这对于实现实际机器人应用至关重要。然而,最近的 Sim2real 方法要么依赖大量的增强数据,要么依赖大型学习模型,这对于特定任务来说效率低下。近年来,基于辐射场的重建方法,尤其是 3D Gaussian Splatting 的出现,使得重现逼真的现实世界场景成为可能。为此,我们提出了一种新颖的 real-to-sim-to-real 强化学习框架 RL-GSBridge,该框架引入了基于网格的 3D Gaussian Splatting 方法,以实现基于视觉的深度强化学习的零样本 sim-to-real 迁移。我们通过使用软绑定约束改进了基于网格的 3D GS 建模方法,从而提高了网格模型的渲染质量。然后,我们采用 GS 编辑方法将渲染与物理模拟器同步,更准确地反映物理机器人的交互。通过一系列 sim-to-real 机械臂实验,包括抓取和拾放任务,我们证明了 RL-GSBridge 在 sim-to-real 迁移过程中保持了令人满意的实际任务完成成功率。此外,一系列渲染指标和可视化结果表明,我们提出的基于网格的 3D Gaussian 减少了非结构化对象中的伪影,展现了更逼真的渲染性能。||
|**2024-09-25**|[SeaSplat: Representing Underwater Scenes with 3D Gaussian Splatting and a Physically Grounded Image Formation Model](http://arxiv.org/abs/2409.17345)|null|我们介绍SeaSplat,这是一种利用最新3D辐射场技术实现水下场景实时渲染的方法。水下场景是具有挑战性的视觉环境,因为透过水等介质进行渲染会在图像捕获中引入距离和颜色相关的影响。我们使用物理基础的水下成像模型来约束3D高斯渲染(3DGS),这是一种最新的辐射场技术,可以实现完整3D场景的快速训练和实时渲染。将SeaSplat应用于SeaThru-NeRF数据集中的真实场景(由美属维尔京群岛的水下航行器收集的场景)和模拟退化的真实场景,我们不仅看到在存在介质的情况下渲染场景新视点的定量性能有所提高,而且还能够恢复场景的底层真实颜色,并将渲染恢复到不存在介入介质的状态。我们证明了水下成像模型有助于学习场景结构,获得更好的深度图,并表明我们的改进保持了利用3D高斯表示带来的显著计算优势。||
|**2024-09-25**|[Let's Make a Splan: Risk-Aware Trajectory Optimization in a Normalized Gaussian Splat](http://arxiv.org/abs/2409.16915)|null|神经辐射场和高斯 splatting 通过实现复杂场景的逼真表示,改变了计算机视觉领域。尽管取得了成功,但它们在现实世界机器人任务(如轨迹优化)中的应用仍然有限。造成这种有限成功有两个关键因素。首先,在辐射模型中难以推理碰撞。其次,很难足够快地执行辐射模型的推理以进行实时轨迹合成。本文提出了 SPLANNING,一种在高斯 splatting 模型中运行的风险感知轨迹优化器,以应对这些挑战。本文首先推导出一种严格限制机器人与辐射场之间碰撞概率上限的方法。其次,本文介绍了高斯 splatting 的归一化重构,以便在高斯 splat 中高效计算碰撞边界。第三,提出了一种在避免与高斯 splat 表示的场景发生碰撞的同时优化轨迹的方法。实验表明,在高度杂乱的环境中,SPLANNING 在生成无碰撞轨迹方面优于最先进的方法。所提出的系统还在现实世界的机器人机械臂上进行了测试。项目页面位于 https://roahmlab.github.io/splanning。||
|**2024-09-22**|[MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views](http://arxiv.org/abs/2409.14316)|null|近年来,神经辐射场(NeRF)的进步促进了少样本新视角合成(NVS)的发展,这是三维视觉应用中的一个重大挑战。尽管人们做了很多尝试来减少NeRF中对密集输入的需求,但它仍然面临着训练和渲染过程耗时的难题。最近,三维高斯散射(3DGS)通过基于点的显式表示实现了实时高质量渲染。然而,与NeRF类似,由于缺乏约束,它往往会对训练视图过拟合。在本文中,我们提出了MVPGS,一种基于三维高斯散射挖掘多视图先验的少样本NVS方法。我们利用最近基于学习的多视图立体(MVS)来提高3DGS几何初始化的质量。为了减轻过拟合,我们提出了一种前向扭曲方法,用于根据计算出的几何形状对场景进行额外的外观约束。此外,我们引入了一种视图一致性几何约束来约束高斯参数,以促进适当的优化收敛,并利用单目深度正则化作为补偿。实验表明,该方法在实时渲染速度下达到了最先进的性能。项目页面:https://zezeaaa.github.io/projects/MVPGS/||
|**2024-09-10**|[Sources of Uncertainty in 3D Scene Reconstruction](http://arxiv.org/abs/2409.06407)|**[link](https://github.com/aaltoml/uncertainty-nerf-gs)**|三维场景重建过程会受到现实世界场景中众多不确定性来源的影响。虽然神经辐射场 (NeRF) 和三维高斯散射 (GS) 可以实现高保真渲染,但它们缺乏内置机制来直接解决或量化由噪声、遮挡、混杂异常值和不精确的相机姿态输入引起的不确定性。在本文中,我们引入了一种分类法,对这些方法中固有的不同不确定性来源进行分类。此外,我们使用不确定性估计技术扩展了基于 NeRF 和 GS 的方法,包括学习不确定性输出和集成,并进行了实证研究来评估它们捕捉重建敏感性的能力。我们的研究强调了在设计基于 NeRF/GS 的不确定性感知三维重建方法时,需要解决各种不确定性方面的需求。||
|**2024-09-05**|[Optimizing 3D Gaussian Splatting for Sparse Viewpoint Scene Reconstruction](http://arxiv.org/abs/2409.03213)|null|三维高斯 splatting (3DGS) 已成为一种很有前景的三维场景表示方法,与神经辐射场 (NeRF) 相比,它可以降低计算开销。然而,3DGS 容易出现高频伪影,并且在稀疏视点条件下表现不佳,从而限制了其在机器人和计算机视觉中的应用。为了解决这些限制,我们引入了 SVS-GS,这是一种用于稀疏视点场景重建的新框架,它集成了三维高斯平滑滤波器来抑制伪影。此外,我们的方法结合了深度梯度剖面先验 (DGPP) 损失和动态深度掩码来锐化边缘,并结合了分数蒸馏采样 (SDS) 损失的二维扩散来增强新视图合成中的几何一致性。在 MipNeRF-360 和 SeaThru-NeRF 数据集上的实验评估表明,SVS-GS 显着改善了稀疏视点下的三维重建,为机器人和计算机视觉应用中的场景理解提供了一种稳健且高效的解决方案。||
|**2024-08-20**|[Gaussian in the Dark: Real-Time View Synthesis From Inconsistent Dark Images Using Gaussian Splatting](http://arxiv.org/abs/2408.09130)|**[link](https://github.com/yec22/Gaussian-DK)**|3D Gaussian Splatting has recently emerged as a powerful representation that can synthesize remarkable novel views using consistent multi-view images as input. However, we notice that images captured in dark environments where the scenes are not fully illuminated can exhibit considerable brightness variations and multi-view inconsistency, which poses great challenges to 3D Gaussian Splatting and severely degrades its performance. To tackle this problem, we propose Gaussian-DK. Observing that inconsistencies are mainly caused by camera imaging, we represent a consistent radiance field of the physical world using a set of anisotropic 3D Gaussians, and design a camera response module to compensate for multi-view inconsistencies. We also introduce a step-based gradient scaling strategy to constrain Gaussians near the camera, which turn out to be floaters, from splitting and cloning. Experiments on our proposed benchmark dataset demonstrate that Gaussian-DK produces high-quality renderings without ghosting and floater artifacts and significantly outperforms existing methods. Furthermore, we can also synthesize light-up images by controlling exposure levels that clearly show details in shadow areas.||
|**2024-09-05**|[EaDeblur-GS: Event assisted 3D Deblur Reconstruction with Gaussian Splatting](http://arxiv.org/abs/2407.13520)|null|3D deblurring reconstruction techniques have recently seen significant advancements with the development of Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Although these techniques can recover relatively clear 3D reconstructions from blurry image inputs, they still face limitations in handling severe blurring and complex camera motion. To address these issues, we propose Event-assisted 3D Deblur Reconstruction with Gaussian Splatting (EaDeblur-GS), which integrates event camera data to enhance the robustness of 3DGS against motion blur. By employing an Adaptive Deviation Estimator (ADE) network to estimate Gaussian center deviations and using novel loss functions, EaDeblur-GS achieves sharp 3D reconstructions in real-time, demonstrating performance comparable to state-of-the-art methods.||
|**2024-10-02**|[DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation](http://arxiv.org/abs/2407.11394)|**[link](https://github.com/kaist-cvml/dreamcatalyst)**|分数蒸馏采样(SDS)已成为文本驱动3D编辑任务中一种有效的框架,它利用扩散模型进行3D一致性编辑。然而,现有的基于SDS的3D编辑方法存在训练时间长、生成结果质量低的问题。我们发现,造成这种性能下降的根本原因是它们与扩散模型的采样动力学相冲突。解决这种冲突使我们能够将SDS视为通过从数据空间采样进行3D编辑的扩散逆过程。相比之下,现有方法简单地使用扩散模型提取分数函数。基于这些见解,我们提出了DreamCatalyst,这是一个在SDS框架中考虑了这些采样动力学的新框架。具体来说,我们设计了DreamCatalyst的优化过程来逼近编辑任务中的扩散逆过程,从而与扩散采样动力学保持一致。因此,DreamCatalyst成功地减少了训练时间并提高了编辑质量。我们的方法提供了两种模式:(1)快速模式,编辑神经辐射场(NeRF)场景的速度比当前最先进的NeRF编辑方法快约23倍;(2)高质量模式,生成的结果比这些方法好约8倍。值得注意的是,我们的高质量模式在速度和质量方面都优于当前最先进的NeRF编辑方法。DreamCatalyst还超越了最先进的3D高斯样条(3DGS)编辑方法,使其成为一种有效且与模型无关的3D编辑解决方案。请在我们的项目页面上查看更多结果:https://dream-catalyst.github.io。||
|**2024-07-10**|[3D Gaussian Ray Tracing: Fast Tracing of Particle Scenes](http://arxiv.org/abs/2407.07090)|null|基于粒子的辐射场表示法,例如 3D 高斯 splatting,在复杂场景的重建和重新渲染方面取得了巨大成功。大多数现有方法通过光栅化渲染粒子,将它们投影到屏幕空间图块中,以便按排序顺序进行处理。而这项工作则考虑对粒子进行光线追踪,构建边界体积层次结构,并使用高性能 GPU 光线追踪硬件为每个像素投射光线。为了有效处理大量半透明粒子,我们描述了一种专门的渲染算法,该算法使用边界网格封装粒子,以利用快速的光线三角形相交,并按深度顺序对成批的相交进行着色。光线追踪的优势在计算机图形学中是众所周知的:处理非相干光线以获得阴影和反射等二次照明效果、从机器人技术中常见的高度扭曲的相机进行渲染、随机采样光线等等。使用我们的渲染器,与光栅化相比,这种灵活性几乎没有成本。实验证明了我们方法的速度和准确性,以及在计算机图形学和视觉方面的几种应用。我们进一步提出了对基本高斯表示的相关改进,包括简单地使用广义核函数,这可以显着减少粒子命中次数。||
|**2024-07-07**|[GaussReg: Fast 3D Registration with Gaussian Splatting](http://arxiv.org/abs/2407.05254)|null|点云配准是大规模三维场景扫描和重建的基本问题。在深度学习的帮助下,配准方法得到了显著发展,已接近成熟阶段。随着神经辐射场(NeRF)的引入,它凭借强大的视图合成能力成为最受欢迎的三维场景表示方法。对于NeRF表示,大规模场景重建也需要对其进行配准。然而,这方面还缺乏深入的探索。这是因为对具有隐式表示的两个场景之间的几何关系进行建模存在固有的挑战。现有方法通常将隐式表示转换为显式表示以进行进一步配准。最近,引入了高斯 splatting(GS),它采用显式三维高斯函数。这种方法在保持高质量渲染效果的同时,显著提高了渲染速度。给定两个具有显式GS表示的场景,我们在这项工作中探索了它们之间的三维配准任务。为此,我们提出了GaussReg,一个快速且准确的由粗到精的框架。粗配准阶段遵循现有的点云配准方法,并估计来自GS的点云的粗略对齐。我们还提出了一种新的图像引导的精配准方法,该方法通过从GS渲染图像,为精确对齐提供更详细的几何信息。为了支持全面的评估,我们仔细构建了一个名为ScanNet-GSReg的场景级数据集,其中包含从ScanNet数据集中获得的1379个场景,并收集了一个名为GSReg的真实世界数据集。实验结果表明,我们的方法在多个数据集上实现了最先进的性能。我们的GaussReg比HLoc(SuperPoint作为特征提取器,SuperGlue作为匹配器)快44倍,并且具有相当的精度。||
|**2024-07-04**|[CRiM-GS: Continuous Rigid Motion-Aware Gaussian Splatting from Motion Blur Images](http://arxiv.org/abs/2407.03923)|null|由于神经辐射场 (NeRFs) 能够高质量地渲染新视角,因此备受关注,这促使人们对其在各种真实场景中的应用进行研究。其中一个关键挑战是相机在曝光时间内移动造成的相机运动模糊,这阻碍了精确的三维场景重建。在本研究中,我们提出了连续刚体运动感知高斯散射 (CRiM-GS),以实时渲染速度从模糊图像中重建精确的三维场景。考虑到实际的相机运动模糊过程包含复杂的运动模式,我们基于神经常微分方程 (ODEs) 预测相机的连续运动。具体来说,我们利用刚体变换来模拟相机运动并进行适当的正则化,以保持对象的形状和大小。此外,我们在\textit{SE(3)} 场中引入连续可变形三维变换,通过确保更高的自由度使刚体变换适应现实问题。通过重新审视基本相机理论并采用先进的神经网络训练技术,我们实现了对连续相机轨迹的精确建模。我们进行了大量的实验,在基准数据集上定量和定性地证明了其最先进的性能。||
|**2024-07-29**|[Trimming the Fat: Efficient Compression of 3D Gaussian Splats through Pruning](http://arxiv.org/abs/2406.18214)|**[link](https://github.com/salmanali96/trimming-the-fat)**|近年来,由于神经辐射场和最近出现的3D高斯样条曲线(3DGS)模型提供了端到端训练的能力,3D模型的使用得到了推广。后者在训练过程中能够轻松地快速收敛并提供广泛的可编辑性,因此具有显著的优势。然而,尽管发展迅速,但关于这些模型可扩展性的文献仍处于起步阶段。在本研究中,我们为解决这一差距采取了一些初步措施,展示了一种能够实现此类模型内存和计算可扩展性的方法。具体来说,我们提出了“Trimming the fat”,这是一种基于梯度的迭代式后剪枝技术,用于消除模型中编码的冗余信息。我们在广泛认可的基准测试集上的实验结果证明了我们方法的有效性,结果表明,在保持甚至提高基线性能的同时,最多可以移除75%的高斯函数。我们的方法实现了大约50倍的压缩,同时保持了与基线模型相似的性能,并且能够将计算速度提高到600 FPS。||
|**2024-06-21**|[Gaussian Splatting to Real World Flight Navigation Transfer with Liquid Networks](http://arxiv.org/abs/2406.15149)|null|模拟器是自动机器人学习的强大工具,因为它们可以提供可扩展的数据生成、灵活的设计和轨迹优化。然而,将从模拟数据中学习到的行为迁移到现实世界中被证明是困难的,通常需要通过计算量大的域随机化方法或进一步的模型微调来缓解。我们提出了一种方法来提高模拟到真实视觉四旋翼导航任务中对分布变化的泛化能力和鲁棒性。为此,我们首先通过将高斯 splatting 与四旋翼飞行动力学相结合来构建模拟器,然后使用 Liquid 神经网络训练鲁棒的导航策略。通过这种方式,我们获得了一个完整的模仿学习协议,它结合了 3D 高斯 splatting 辐射场渲染的进步、专家演示训练数据的巧妙编程以及 Liquid 网络的任务理解能力。通过一系列定量飞行测试,我们证明了在单个模拟场景中学习到的导航技能可以直接稳健地迁移到现实世界。我们进一步展示了在剧烈的分布和物理环境变化下,在训练环境之外保持性能的能力。我们学习的 Liquid 策略,仅在从真实感室内模拟飞行中提取的单个目标操作上进行训练,可以泛化到户外真实硬件平台上的多步远足。||
|**2024-06-14**|[Wild-GS: Real-Time Novel View Synthesis from Unconstrained Photo Collections](http://arxiv.org/abs/2406.10373)|null|在非结构化的旅游环境中拍摄的照片经常表现出多变的外观和短暂的遮挡,这对准确的场景重建提出了挑战,并在新视角合成中导致了伪影。虽然先前的方法已经将神经辐射场 (NeRF) 与其他可学习模块相结合来处理动态外观并消除瞬态对象,但其大量的训练需求和缓慢的渲染速度限制了实际部署。最近,3D 高斯 splatting (3DGS) 已成为 NeRF 的一种有前途的替代方案,它提供了卓越的训练和推理效率以及更好的渲染质量。本文介绍了 Wild-GS,这是一种针对不受约束的照片集优化的 3DGS 创新改编,同时保留了其效率优势。Wild-GS 通过每张图像的固有材质属性、全局照明和相机属性以及逐点反射率的局部变化来确定每个 3D 高斯的外观。与先前在图像空间中对参考特征进行建模的方法不同,Wild-GS 通过对从参考图像中提取的三平面进行采样,将像素外观特征明确地与相应的局部高斯对齐。这种新颖的设计有效地将参考视图的高频细节外观转移到 3D 空间,并显着加快了训练过程。此外,2D 可见性图和深度正则化分别用于减轻瞬态效应和约束几何形状。大量实验表明,Wild-GS 在所有现有技术中实现了最先进的渲染性能以及最高的训练和推理效率。||
|**2024-06-06**|[A Survey on 3D Human Avatar Modeling -- From Reconstruction to Generation](http://arxiv.org/abs/2406.04253)|null|3D modeling has long been an important area in computer vision and computer graphics. Recently, thanks to the breakthroughs in neural representations and generative models, we witnessed a rapid development of 3D modeling. 3D human modeling, lying at the core of many real-world applications, such as gaming and animation, has attracted significant attention. Over the past few years, a large body of work on creating 3D human avatars has been introduced, forming a new and abundant knowledge base for 3D human modeling. The scale of the literature makes it difficult for individuals to keep track of all the works. This survey aims to provide a comprehensive overview of these emerging techniques for 3D human avatar modeling, from both reconstruction and generation perspectives. Firstly, we review representative methods for 3D human reconstruction, including methods based on pixel-aligned implicit function, neural radiance field, and 3D Gaussian Splatting, etc. We then summarize representative methods for 3D human generation, especially those using large language models like CLIP, diffusion models, and various 3D representations, which demonstrate state-of-the-art performance. Finally, we discuss our reflection on existing methods and open challenges for 3D human avatar modeling, shedding light on future research.||
|**2024-06-13**|[3D-HGS: 3D Half-Gaussian Splatting](http://arxiv.org/abs/2406.02720)|**[link](https://github.com/lihaolin88/3d-half-gaussian-splatting)**|照片级逼真的三维重建是三维计算机视觉中的一个基本问题。由于最近神经渲染技术的出现,该领域取得了相当大的进步。这些技术主要集中于学习三维场景的体积表示,并通过渲染得到的损失函数来细化这些表示。其中,三维高斯散射(3D-GS)已成为一种重要的方法,其性能超过了神经辐射场(NeRFs)。3D-GS使用参数化的三维高斯函数来建模空间位置和颜色信息,并结合基于图块的快速渲染技术。尽管其渲染性能和速度都很出色,但使用三维高斯核函数在准确表示不连续函数方面存在固有限制,特别是在形状不连续的边缘和角落,以及在颜色不连续的不同纹理之间。为了解决这个问题,我们建议采用三维半高斯(3D-HGS)核函数,它可以作为一种即插即用的核函数。我们的实验表明,它们能够提高当前与3D-GS相关方法的性能,并在不影响渲染速度的情况下,在各种数据集上实现最先进的渲染性能。||

(back to top)

## 分类/检测/识别/分割

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2024-11-05**|[CRT-Fusion: Camera, Radar, Temporal Fusion Using Motion Information for 3D Object Detection](http://arxiv.org/abs/2411.03013)|null|精确且鲁棒的三维目标检测是自动驾驶汽车和机器人技术中的关键组成部分。尽管最近的雷达-相机融合方法通过在鸟瞰图(BEV)表示中融合信息取得了显著进展,但它们往往难以有效捕捉动态物体的运动,从而导致在实际场景中的性能受限。在本文中,我们介绍了 CRT-Fusion,一个将时间信息整合到雷达-相机融合中的新型框架,以应对这一挑战。我们的方法包含三个关键模块:多视图融合(MVF)、运动特征估计器(MFE)和运动引导时间融合(MGTF)。MVF 模块在相机视图和鸟瞰图中融合雷达和图像特征,从而生成更精确的统一 BEV 表示。MFE 模块同时执行两项任务:像素级速度信息估计和 BEV 分割。基于从 MFE 模块获得的速度和占用率分数图,MGTF 模块以循环方式跨多个时间戳对齐和融合特征图。通过考虑动态物体的运动,CRT-Fusion 可以生成鲁棒的 BEV 特征图,从而提高检测精度和鲁棒性。在具有挑战性的 nuScenes 数据集上的大量评估表明,CRT-Fusion 在基于雷达-相机的三维目标检测方面实现了最先进的性能。我们的方法在 NDS 方面比之前的最佳方法高出 1.7%,同时在 mAP 方面也超过了领先方法 1.4%。这两个指标的显著改进展示了我们提出的融合策略在增强三维目标检测的可靠性和准确性方面的有效性。|
|**2024-11-05**|[Domain Expansion and Boundary Growth for Open-Set Single-Source Domain Generalization](http://arxiv.org/abs/2411.02920)|null|开放集单源域泛化旨在使用单一源域学习一个鲁棒的模型,该模型可以泛化到具有域偏移和标签偏移的未知目标域。源域数据的稀缺性和目标域的未知数据分布对域不变特征学习和未知类别识别提出了巨大的挑战。在本文中,我们提出了一种基于域扩展和边界增长的新型学习方法,以扩展稀缺的源样本并扩大已知类别之间的边界,从而间接地拓宽已知类别和未知类别之间的边界。具体来说,我们通过对源数据进行背景抑制和风格增强来合成新样本,从而实现域扩展。然后,我们强制模型从合成样本中提取一致的知识,以便模型能够学习域不变信息。此外,我们在训练多二元分类器时,通过使用边缘图作为样本的附加模态来实现跨类别的边界增长。这种方式扩大了内点和外点之间的边界,从而提高了开放集泛化期间的未知类别识别能力。大量实验表明,我们的方法可以在多个跨域图像分类数据集上实现显著的改进并达到最先进的性能。|
|**2024-11-05**|[Applications of Automatic Differentiation in Image Registration](http://arxiv.org/abs/2411.02806)|**[link](https://github.com/wdwatson2/ImgRegPytorchProject)**|我们论证了在机器学习框架中已普遍可用的自动微分技术,是探索改进多尺度仿射图像配准和仿射超分辨率问题算法的有效方法。在第一个关于多尺度配准的实验中,我们实现了一种常微分方程预测-校正方法,该方法涉及关于尺度参数的导数和图像配准目标函数的Hessian矩阵,这两者在没有自动微分的情况下都很难计算。我们的研究结果表明,精确的Hessian矩阵对于该方法比传统的多尺度方法有所改进是必要的;而高斯-牛顿Hessian近似未能提供这样的改进。在第二个实验中,我们实现了一种用于超分辨率的可变投影高斯-牛顿方法,并使用自动微分来对迭代计算的投影进行微分,这是一种文献中先前未涉及的方法。我们展示了不通过投影进行微分获得的雅可比矩阵是可变投影正向映射的真实雅可比矩阵的较差近似,并探讨了其他一些近似的性能。通过解决这些问题,这项工作促进了自动微分在图像配准中的应用,并为机器学习工具在该领域的进一步应用开创了先例。|
|**2024-11-05**|[ERUP-YOLO: Enhancing Object Detection Robustness for Adverse Weather Condition by Unified Image-Adaptive Processing](http://arxiv.org/abs/2411.02799)|null|我们提出了一种图像自适应的目标检测方法,用于应对雾霾和低光等恶劣天气条件。我们的框架采用可微分预处理滤波器来执行图像增强,以适应后续的目标检测阶段。我们的框架引入了两种可微分滤波器:基于贝塞尔曲线的逐像素(BPW)滤波器和基于核的局部(KBL)滤波器。这些滤波器统一了经典图像处理滤波器的功能,并提高了目标检测的性能。我们还提出了一种使用BPW滤波器的域无关数据增强策略。我们的方法不需要针对特定数据定制滤波器组合、参数范围和数据增强。我们通过将所提出的方法(称为ERUP-YOLO,即通过统一图像处理增强鲁棒性的YOLO)应用于YOLOv3检测器来评估其性能。在恶劣天气数据集上的实验表明,我们提出的滤波器在表达能力上与传统方法相当或更优,并且我们的ERUP-YOLO在各种恶劣天气条件下(包括雾霾和低光条件)都实现了卓越的性能。|
|**2024-11-05**|[Efficient Feature Aggregation and Scale-Aware Regression for Monocular 3D Object Detection](http://arxiv.org/abs/2411.02747)|null|单目3D目标检测因其简洁性和低成本而备受关注。现有方法通常遵循传统的2D检测范式,先定位目标中心,然后通过邻近特征预测3D属性。然而,这些方法主要依赖于渐进的跨尺度特征聚合,并且只关注局部信息,这可能导致缺乏全局感知和遗漏小尺度目标。此外,由于不同场景和深度下目标尺度的巨大变化,不准确的感受野通常会导致背景噪声和特征表示退化。为了解决这些问题,我们引入了MonoASRH,一种新颖的单目3D检测框架,由高效混合特征聚合模块(EH-FAM)和自适应尺度感知3D回归头(ASRH)组成。具体来说,EH-FAM采用具有全局感受野的多头注意力机制来提取小尺度目标的语义特征,并利用轻量级卷积模块高效地聚合不同尺度的视觉特征。ASRH对2D边界框维度进行编码,然后通过尺度-语义特征融合模块将尺度特征与EH-FAM聚合的语义特征融合。尺度-语义特征融合模块引导ASRH学习动态感受野偏移,将尺度先验融入3D位置预测,以获得更好的尺度感知能力。在KITTI和Waymo数据集上的大量实验表明,MonoASRH实现了最先进的性能。|
|**2024-11-05**|[Integrated lithium niobate photonic computing circuit based on efficient and high-speed electro-optic conversion](http://arxiv.org/abs/2411.02734)|null|我们展示了一种利用系统级薄膜铌酸锂电路的光计算加速器,克服了这一限制。利用强大的电光(普克尔斯)效应和该平台的可扩展性,我们展示了高达 1.36 TOPS 的光子计算速度,同时功耗仅为 0.057 pJ/OP。我们的系统具有 100 多个协同工作的薄膜铌酸锂高性能组件,超越了该平台上的最先进系统。我们进一步演示了二元分类、手写数字分类和图像分类,并实现了显著的准确性,展示了我们系统执行实际算法的能力。最后,我们研究了将我们的系统与混合集成的分布式反馈激光源和异质集成的改进单向行波载流子光电二极管相结合的可能性。我们的结果表明了薄膜铌酸锂作为计算平台的前景,解决了当前电子和光子计算中的瓶颈。其高性能电光权重编码和转换、晶圆级可扩展性以及与集成激光器和探测器的兼容性等独特特性,使薄膜铌酸锂光子学成为硅光子学的有力补充,并可扩展到超快速和低功耗信号处理和测距等应用领域。|
|**2024-11-04**|[Intelligent Video Recording Optimization using Activity Detection for Surveillance Systems](http://arxiv.org/abs/2411.02632)|null|监控系统通常难以管理大量的视频素材,其中很多素材无关紧要,导致存储效率低下且事件检索困难。本文提出了一种专注于活动检测的优化视频录制解决方案来解决这些问题。该方案利用了一种混合方法,结合了基于帧差法的运动检测和使用 YOLOv9 的目标检测。该策略专门针对涉及人类或汽车活动的场景进行录制,从而减少不必要的素材并优化存储空间使用。开发的模型展现出卓越的性能,汽车检测的精确率达到 0.855,行人检测的精确率达到 0.884,并且与仅依赖运动检测的传统监控系统相比,存储需求减少了三分之二。存储量的显著减少凸显了该方案在提高监控系统效率方面的有效性。尽管如此,仍然存在一些局限性,特别是在恶劣天气条件下(例如强风)会出现误报和漏报。|
|**2024-11-04**|[MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D](http://arxiv.org/abs/2411.02336)|null|纹理化是3D资产生产流程中的关键步骤,它可以增强3D资产的视觉吸引力和多样性。尽管文本到纹理(T2T)生成技术近期取得了一些进展,但现有方法生成的结果往往不尽如人意,主要原因是局部不连续性、多视图之间不一致以及它们对UV展开结果的严重依赖。为了应对这些挑战,我们提出了一种名为MVPaint的创新生成-细化3D纹理化框架,它可以生成高分辨率、无缝的纹理,同时强调多视图一致性。MVPaint主要由三个关键模块组成。1) 同步多视图生成(SMG)。给定一个3D网格模型,MVPaint首先使用SMG模型同时生成多视图图像,这会导致粗糙的纹理化结果,并且由于缺少观察而存在未上色的部分。2) 空间感知3D修复(S3I)。为了确保完整的3D纹理化,我们引入了S3I方法,专门用于有效地对先前未观察到的区域进行纹理化。3) UV细化(UVR)。此外,MVPaint采用UVR模块来提高UV空间中的纹理质量,该模块首先执行UV空间超分辨率,然后使用空间感知的接缝平滑算法来修正由UV展开引起的空间纹理不连续性。此外,我们基于从Objaverse数据集和整个GSO数据集中选择的优质3D网格,分别建立了两个T2T评估基准:Objaverse T2T基准和GSO T2T基准。大量的实验结果表明,MVPaint超越了现有的最先进方法。值得注意的是,MVPaint可以生成高保真纹理,同时最大限度地减少Janus问题,并显著增强跨视图一致性。|
|**2024-11-04**|[Toward Integrating Semantic-aware Path Planning and Reliable Localization for UAV Operations](http://arxiv.org/abs/2411.01816)|null|定位是无人机系统 (UAV) 最关键的任务之一,直接影响整体性能,它可以通过各种传感器实现,并应用于与搜索和救援行动、目标跟踪、建筑等相关的众多任务。然而,由于挑战性环境的负面影响,无人机可能会丢失用于定位的信号。在本文中,我们提出了一种有效的路径规划系统,利用语义分割信息,使用单目相机绕过纹理缺失和有问题的区域,如湖泊、海洋和高层建筑。我们介绍了一种实时语义分割架构和一种新颖的关键帧决策流程,以基于像素分布优化图像输入,从而减少处理时间。一个基于动态窗口方法 (DWA) 算法的分层规划器,与成本地图集成,旨在促进高效的路径规划。该系统在使用 Unity 的逼真模拟环境中实现,并与分割模型参数对齐。全面的定性和定量评估验证了我们方法的有效性,表明在挑战性环境中无人机定位的可靠性和效率得到了显著提高。|
|**2024-11-04**|[ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model](http://arxiv.org/abs/2411.01756)|null|视觉目标跟踪的目标是基于初始边界框在视频序列中定位目标物体。最近,视觉语言(VL)跟踪器提议利用额外的自然语言描述来增强其在各种应用中的多功能性。然而,在跟踪性能方面,VL 跟踪器仍然不如最先进的(SoTA)视觉跟踪器。我们发现这种劣势主要源于它们严重依赖手动文本标注,其中包括频繁提供的模糊语言描述。在本文中,我们提出了 ChatTracker,它利用多模态大型语言模型 (MLLM) 中丰富的知识来生成高质量的语言描述并提高跟踪性能。为此,我们提出了一种新颖的基于反思的提示优化模块,用跟踪反馈迭代地改进目标模糊和不准确的描述。为了进一步利用 MLLM 生成的语义信息,我们提出了一个简单而有效的 VL 跟踪框架,它可以轻松地作为即插即用模块集成到 VL 和视觉跟踪器中,以提高其性能。实验结果表明,我们提出的 ChatTracker 实现了与现有方法相当的性能。|
|**2024-10-31**|[DiffPAD: Denoising Diffusion-based Adversarial Patch Decontamination](http://arxiv.org/abs/2410.24006)|null|在不断发展的对抗性机器学习领域中,开发有效的防御补丁攻击的方法已成为一项关键挑战,需要可靠的解决方案来保护现实世界中的人工智能系统。尽管扩散模型在图像合成方面表现出非凡的能力,并且最近已被用于对抗 $\ell_p$ 范数有界攻击,但其在缓解局部补丁攻击方面的潜力很大程度上仍未得到充分探索。在这项工作中,我们提出了 DiffPAD,这是一个利用扩散模型的力量进行对抗性补丁去污的新框架。DiffPAD 首先对下采样的输入图像执行超分辨率恢复,然后采用二值化、动态阈值方案和滑动窗口来有效地定位对抗性补丁。这种设计灵感来自于理论上推导出的补丁大小和扩散恢复误差之间的相关性,该相关性在各种补丁攻击场景中得到了推广。最后,DiffPAD 将修复技术应用于原始输入图像,并将估计的补丁区域屏蔽。通过将超分辨率恢复和图像修复的闭式解集成到预训练扩散模型的条件反向采样过程中,DiffPAD 避免了对文本指导或微调的需求。通过全面的实验,我们证明了 DiffPAD 不仅实现了最先进的对抗补丁攻击的鲁棒性,而且在恢复自然图像方面表现出色,没有补丁残留。|
|**2024-10-31**|[ImOV3D: Learning Open-Vocabulary Point Clouds 3D Object Detection from Only 2D Images](http://arxiv.org/abs/2410.24001)|**[link](https://github.com/yangtiming/imov3d)**|开放词汇量3D目标检测 (OV-3Det) 旨在泛化到训练阶段标记的有限数量的基本类别之外。最大的瓶颈是3D标注数据的稀缺性,而2D图像数据集丰富且标注详尽。因此,利用丰富的2D图像标注来缓解OV-3Det中固有的数据稀缺性是很直观的。在本文中,我们通过探索仅使用2D图像学习OV-3Det的潜力,将任务设置推向极限。这种设置的主要挑战是训练图像和测试点云之间的模态差距,这阻碍了将2D知识有效地整合到OV-3Det中。为了应对这一挑战,我们提出了一个名为ImOV3D的新颖框架,利用包含图像和点云 (PC) 的伪多模态表示来弥合模态差距。ImOV3D的关键在于灵活的模态转换,其中2D图像可以使用单目深度估计提升到3D,也可以通过渲染从3D场景派生。这允许将训练图像和测试点云统一到一个通用的图像-PC表示中,既包含丰富的2D语义信息,又包含了3D空间数据的深度和结构特征。我们谨慎地进行这种转换,以最大限度地减少训练和测试用例之间的域差距。在SUNRGBD和ScanNet这两个基准数据集上的大量实验表明,即使在没有真实3D训练数据的情况下,ImOV3D的性能也明显优于现有方法。通过包含少量真实的3D数据进行微调,其性能也大大超过了之前的最先进水平。代码和预训练模型已发布在https://github.com/yangtiming/ImOV3D。|
|**2024-10-31**|[Uncertainty Estimation for 3D Object Detection via Evidential Learning](http://arxiv.org/abs/2410.23910)|null|三维物体检测是自动驾驶和机器人技术中计算机视觉应用的一项重要任务。然而,模型通常难以量化检测可靠性,导致在不熟悉的场景中表现不佳。我们引入了一个框架,通过利用三维检测器中鸟瞰图表示上的证据学习损失来量化三维物体检测中的不确定性。这些不确定性估计所需的计算开销极小,并且可以推广到不同的架构。我们证明了这些不确定性估计在识别分布外场景、定位不良的物体和漏检(假阴性)方面的有效性和重要性;我们的框架在基准上平均提高了10-20%。最后,我们将这套任务集成到一个系统中,其中三维物体检测器自动标记驾驶场景,并且我们的不确定性估计在标签用于训练第二个模型之前验证标签的正确性。在此,我们基于不确定性的验证导致mAP提高了1%,NDS提高了1-2%。|
|**2024-10-31**|[From Web Data to Real Fields: Low-Cost Unsupervised Domain Adaptation for Agricultural Robots](http://arxiv.org/abs/2410.23906)|null|在精准农业中,视觉模型通常难以处理新的、未曾见过的田地,因为作物和杂草会受到外部因素的影响,导致它们的组成和外观与学习到的分布不同。本文旨在利用无监督域自适应(UDA)以低成本适应特定田地。我们探索了一种新的域迁移,从多样的大型互联网数据池迁移到机器人特定位置收集的小数据集,从而最大限度地减少对大量田间数据收集的需求。此外,我们引入了一个新的模块——多级基于注意力的对抗判别器(MAAD)——它可以集成到任何检测模型的特征提取器级别。在本研究中,我们将MAAD与CenterNet结合起来,同时检测叶片、茎和叶脉实例。我们的结果表明,与基线模型相比,未标记目标域的性能显著提高,目标检测精度提高了7.5%,关键点检测精度提高了5.1%。|
|**2024-10-31**|[Open-Set 3D object detection in LiDAR data as an Out-of-Distribution problem](http://arxiv.org/abs/2410.23767)|null|基于激光雷达数据的三维目标检测通过先进的深度学习方法在受控环境中已达到工业级性能。然而,这些神经网络模型受到有限的内围目标类别的限制。我们的工作将激光雷达数据中的开放集三维目标检测问题重新定义为分布外(OOD)检测问题,以检测异常目标。与传统的目标检测相比,这种方法带来了额外的信息。我们建立了一个比较基准,并表明两阶段OOD方法,特别是自动标记,在三维OOD目标检测中显示出 promising 的结果。我们的贡献包括通过检查超参数的评估和评估生成额外数据以训练OOD感知三维目标检测器的策略来建立严格的评估协议。这种全面的分析对于开发能够在多样化和不可预测的现实场景中可靠执行的鲁棒的三维目标检测系统至关重要。|
|**2024-10-31**|[Context-Aware Token Selection and Packing for Enhanced Vision Transformer](http://arxiv.org/abs/2410.23608)|null|近年来,视觉Transformer的长距离注意力机制在各种计算机视觉任务中推动了显著的性能突破。然而,传统的自注意力机制需要处理信息丰富的和无信息的标记,效率低下且精度不高。虽然已引入稀疏注意力机制通过减少参与注意力的标记来缓解这些问题,但它们通常缺乏上下文感知能力和智能性。这些机制经常在不同的输入上应用统一的标记选择策略进行批量训练,或者仅针对推理阶段优化效率。为了克服这些挑战,我们提出了一种新颖的算法:选择并打包注意力(SPA)。SPA 使用一个由选择标签监督的低成本门控层动态选择信息丰富的标记,并将这些标记打包成新的批次,从而在并行化的 GPU 批量训练和推理中使用可变数量的标记。跨不同数据集和计算机视觉任务的大量实验表明,SPA 提供了卓越的性能和效率,包括目标检测的 mAP 提高了 0.6,计算成本降低了 16.4%。|
|**2024-10-31**|[QUEST-A: Untrained Filtering with Trained Focusing led to Enhanced Quantum Architectures](http://arxiv.org/abs/2410.23560)|**[link](https://github.com/uestc-ylh/quest-a)**|量子架构搜索(QAS)是量子机器学习中的一个基本挑战,目前最先进的方法主要分为免训练和梯度引导两类。然而,将QAS仅仅视为离散剪枝过程或连续优化问题都无法平衡准确性和效率。本工作将QAS分解为两个交替解决的子问题:最优电路结构检索和参数优化。基于此洞察,我们提出了量子未训练-探索协同训练架构(QUEST-A),它通过电路固有属性实现快速架构剪枝,并利用参数重用策略进行 focused 优化。QUEST-A在一个进化框架内统一了离散结构搜索和连续参数优化,该框架集成了快速剪枝和细粒度优化。实验表明,QUEST-A 优于现有方法:增强了信号表示中的模型表达能力,在图像分类的不同复杂度下保持了高性能,并在变分量子本征求解器任务中实现了数量级的精度提升。这些结果验证了QUEST-A的有效性,并为QAS提供了可迁移的方法。|
|**2024-10-30**|[Multilingual Vision-Language Pre-training for the Remote Sensing Domain](http://arxiv.org/abs/2410.23370)|null|基于对比语言-图像预训练 (CLIP) 的方法目前广泛用于支持涉及遥感数据的视觉和语言任务,例如跨模态检索。CLIP 在这一特定领域的适应依赖于使用标准对比目标的模型微调,使用现有的人工标注的图像-标题数据集,或使用从遥感图像上的其他注释(例如,对象类别)派生的图像-标题对对应的合成数据。使用不同的预训练机制受到的关注较少,只有少数例外情况考虑了多语言输入。这项工作提出了一种用于遥感领域的新型视觉和语言模型,探索了多语言 CLIP 模型的微调,并测试了使用基于对齐来自单个输入图像的局部和全局表示的自监督方法,以及标准的 CLIP 目标。模型训练依赖于汇集预先存在的遥感图像和英文标题配对的数据集,然后使用自动机器翻译成另外九种语言。我们表明,翻译后的数据确实是有帮助的,例如,也提高了英语的性能。我们由此产生的模型,我们将其命名为遥感多语言 CLIP (RS-M-CLIP),在各种视觉和语言任务中获得了最先进的结果,包括跨模态和多语言图像-文本检索,或零样本图像分类。|
|**2024-10-30**|[CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP](http://arxiv.org/abs/2410.23330)|null|机器遗忘 (MU) 作为一种无需完全重新训练即可从训练模型中移除特定数据的方法,受到了广泛关注。尽管在文本和图像分类等单模态领域取得了进展,但多模态模型中的遗忘仍然相对缺乏研究。本文探讨了在 CLIP(一种对齐视觉和文本表示的杰出多模态模型)中遗忘所面临的独特挑战。我们引入了 CLIPErase,一种新颖的方法,可以解开并选择性地遗忘视觉和文本关联,确保遗忘不会损害模型性能。CLIPErase 由三个关键模块组成:遗忘模块,用于破坏遗忘集中样本的关联;保留模块,用于保持模型在保留集上的性能;以及一致性模块,用于维持与原始模型的一致性。在 CIFAR-100 和 Flickr30K 数据集上对四个 CLIP 下游任务进行的大量实验表明,CLIPErase 可以有效地遗忘零样本任务中多模态样本的指定关联,同时在遗忘后保持模型在保留集上的性能。|
|**2024-10-30**|[EMMA: End-to-End Multimodal Model for Autonomous Driving](http://arxiv.org/abs/2410.23262)|null|我们推出了EMMA,一个用于自动驾驶的端到端多模态模型。基于多模态大型语言模型基础,EMMA将原始摄像头传感器数据直接映射到各种驾驶专用输出,包括规划轨迹、感知对象和道路图元素。EMMA通过将所有非传感器输入(例如导航指令和车辆自身状态)和输出(例如轨迹和3D位置)表示为自然语言文本,最大限度地利用了预训练大型语言模型的世界知识。这种方法允许EMMA在统一的语言空间中联合处理各种驾驶任务,并使用特定任务的提示生成每个任务的输出。根据经验,我们通过在nuScenes上实现最先进的运动规划性能以及在Waymo Open Motion Dataset (WOMD) 上取得有竞争力的结果来证明EMMA的有效性。EMMA还在Waymo Open Dataset (WOD) 上的摄像头主要3D目标检测中取得了有竞争力的结果。我们表明,使用规划轨迹、目标检测和道路图任务对EMMA进行联合训练可以在所有三个领域带来改进,突出了EMMA作为自动驾驶应用的通用模型的潜力。然而,EMMA也存在某些局限性:它只能处理少量图像帧,不包含像LiDAR或雷达这样的精确3D传感模态,并且计算成本高昂。我们希望我们的结果能够激发进一步的研究来缓解这些问题,并进一步发展自动驾驶模型架构的最新技术。|
|**2024-10-29**|[Active Learning for Vision-Language Models](http://arxiv.org/abs/2410.22187)|null|像CLIP这样的预训练视觉语言模型(VLM)在一系列下游计算机视觉任务中展现出令人印象深刻的零样本性能。然而,这些模型与在下游数据集上训练的有监督深度模型之间仍然存在相当大的性能差距。为了弥合这一差距,我们提出了一种新颖的主动学习(AL)框架,通过仅从未标记数据中选择少量信息丰富的样本进行标注来增强VLM的零样本分类性能。为此,我们的方法首先校准VLM的预测熵,然后结合自不确定性和邻居感知不确定性来计算可靠的不确定性度量,用于主动样本选择。我们的大量实验表明,所提出的方法在多个图像分类数据集上优于现有的AL方法,并显著提高了VLM的零样本性能。||
|**2024-10-29**|[Lighten CARAFE: Dynamic Lightweight Upsampling with Guided Reassemble Kernels](http://arxiv.org/abs/2410.22139)|**[link](https://github.com/fu0511/dynamic-lightweight-upsampling)**|特征上采样作为现代机器视觉模型中的基本操作,已在文献中得到广泛应用和研究。理想的上采样操作应轻量且计算复杂度低。也就是说,它不仅可以提高整体性能,而且不会影响模型的复杂性。内容感知特征重组 (CARAFE) 是一种精心设计的可学习操作,可实现特征上采样。尽管取得了令人鼓舞的性能,但该方法需要生成大规模内核,这带来了大量额外的冗余参数,并且固有地限制了可扩展性。为此,我们在本文中提出了一种轻量级上采样操作,称为动态轻量级上采样 (DLU)。具体来说,它首先构建一个小规模的源核空间,然后通过引入可学习的引导偏移量从核空间中采样大规模核,从而避免在上采样中引入大量可训练参数。在几个主流视觉任务上的实验表明,我们的 DLU 实现了与原始 CARAFE 相当甚至更好的性能,但复杂度要低得多,例如,在 16 倍上采样的情况下,DLU 比 CARAFE 的参数减少了 91%,FLOPs(浮点运算)至少减少了 63%,但在目标检测中,其 mAP 比 CARAFE 提高了 0.3%。代码可在 https://github.com/Fu0511/Dynamic-Lightweight-Upsampling 获取。||
|**2024-10-29**|[Data Generation for Hardware-Friendly Post-Training Quantization](http://arxiv.org/abs/2410.22110)|null|使用合成数据的零样本量化 (ZSQ) 是在隐私和安全约束下进行训练后量化 (PTQ) 的关键方法。然而,现有的数据生成方法通常难以有效地生成适用于硬件友好量化(所有模型层都量化)的数据。我们分析了现有的基于批量归一化 (BN) 匹配的数据生成方法,并确定了合成数据和真实数据之间的几个差距:1) 当前的生成算法无法同时优化整个合成数据集;2) 训练期间应用的数据增强通常被忽略;3) 由于这些层中缺少 BN,最终模型层中会出现分布偏移。这些差距会对 ZSQ 性能产生负面影响,尤其是在硬件友好量化场景中。在这项工作中,我们提出了面向硬件友好量化的数据生成 (DGH),这是一种解决这些差距的新方法。DGH 联合优化所有生成的图像,无论图像集大小或 GPU 内存限制如何。为了解决数据增强不匹配问题,DGH 包括一个预处理阶段,该阶段模仿增强过程,并通过结合自然图像先验来提高图像质量。最后,我们提出了一种新的分布拉伸损失,它可以对齐真实数据和合成数据之间特征图分布的支持度。此损失应用于模型的输出,并且可以适应各种任务。DGH 在多个任务的量化性能方面均有显著改进,在分类和目标检测中,硬件友好 ZSQ 的准确率提升高达 30%,其性能通常与真实数据相当。||
|**2024-10-29**|[FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection](http://arxiv.org/abs/2410.21964)|null|近来,视觉Transformer(ViT)在通用图像分类领域取得了前所未有的成效。然而,由于在深度伪造检测领域的性能相比卷积神经网络(CNN)较低,这些模型在该领域的探索仍然不足。本文首先研究了为什么普通的ViT架构在处理面部伪造检测时表现欠佳。我们的分析表明,与CNN相比,ViT难以对通常是深度伪造特征的局部伪造痕迹进行建模。基于这一观察,我们提出了一个名为FakeFormer的深度伪造检测框架,该框架扩展了ViT以增强对细微的不一致性信息的提取。为此,我们引入了一种由伪造痕迹易感区域引导并专为ViT设计的显式注意力学习机制。我们在多个著名的基准数据集上进行了大量实验,包括FF++、Celeb-DF、WildDeepfake、DFD、DFDCP和DFDC。结果表明,FakeFormer在泛化性和计算成本方面均优于现有最佳方法,且无需大规模训练数据集。代码可在\url{https://github.com/10Ring/FakeFormer}获取。||
|**2024-10-29**|[Cognitive Semantic Augmentation LEO Satellite Networks for Earth Observation](http://arxiv.org/abs/2410.21916)|null|对地观测 (EO) 系统对于地图绘制、灾难监测和资源管理至关重要,但它们难以高效地处理和传输大量的 EO 数据,特别是对于农业和实时灾难响应等专门应用而言。本文提出了一种用于 EO 卫星网络中语义通信的新型框架,旨在通过认知处理技术提高数据传输效率和系统性能。该系统利用离散任务导向联合信源信道编码 (DT-JSCC) 和语义数据增强 (SA) 将认知语义处理与星间链路相结合,从而实现多光谱图像的有效分析和传输,以改进目标检测、模式识别和实时决策。引入了认知语义增强 (CSA) 来增强系统处理和传输语义信息的能力,从而改进特征优先级排序、一致性以及对不断变化的通信和应用需求的适应性。端到端架构专为下一代卫星网络(例如支持 6G 的网络)而设计,与联邦学习相比,展示了在更少的通信轮次和更高的精度方面的显著改进。||
|**2024-10-29**|[Bayesian Optimization for Hyperparameters Tuning in Neural Networks](http://arxiv.org/abs/2410.21886)|null|本研究探讨了贝叶斯优化(BO)在神经网络超参数调整中的应用,特别针对增强卷积神经网络(CNN)在图像分类任务中的性能。贝叶斯优化是一种无导数的全局优化方法,适用于具有连续输入和有限评估预算的昂贵的黑盒函数。BO算法利用高斯过程回归和采集函数(如置信上限(UCB)和期望改进(EI))来有效地识别最佳配置。本研究使用Ax和BOTorch框架,展示了BO在减少超参数调整试验次数的同时实现具有竞争力的模型性能的效率。实验结果表明,BO有效地平衡了探索和利用,快速收敛到CNN架构的最佳设置。这种方法强调了BO在自动化神经网络调整方面的潜力,有助于提高机器学习流程的准确性和计算效率。||
|**2024-10-29**|[PK-YOLO: Pretrained Knowledge Guided YOLO for Brain Tumor Detection in Multiplanar MRI Slices](http://arxiv.org/abs/2410.21822)|**[link](https://github.com/mkang315/pk-yolo)**|多平面磁共振成像 (MRI) 切片中的脑肿瘤检测是一项具有挑战性的任务,因为多平面图像的结构中存在各种外观和关系。在本文中,我们提出了一种新的基于 YOLO(You Only Look Once)的检测模型,该模型结合了预训练知识 (PK),称为 PK-YOLO,以提高多平面 MRI 切片中脑肿瘤检测的性能。据我们所知,PK-YOLO 是第一个基于预训练知识引导的 YOLO 目标检测器。新方法的主要组成部分包括一个通过稀疏掩码建模预训练的纯轻量级卷积神经网络主干、一个带有预训练主干的 YOLO 架构和一个用于改进小目标检测的回归损失函数。预训练的主干允许将单个平面 MRI 切片上的目标查询的特征迁移到模型编码器中,并且学习到的领域知识库可以改进域内检测。改进的损失函数可以进一步提高多平面二维 MRI 切片中小尺寸脑肿瘤的检测性能。实验结果表明,与最先进的类 YOLO 和类 DETR 目标检测器相比,所提出的 PK-YOLO 在多平面 MRI 脑肿瘤检测数据集上实现了具有竞争力的性能。代码可在 https://github.com/mkang315/PK-YOLO 获取。||
|**2024-10-28**|[MVSDet: Multi-View Indoor 3D Object Detection via Efficient Plane Sweeps](http://arxiv.org/abs/2410.21566)|**[link](https://github.com/pixie8888/mvsdet)**|多视角室内三维物体检测的关键挑战在于从图像中推断准确的几何信息,以实现精确的三维检测。先前的方法依赖于神经辐射场(NeRF)进行几何推理。然而,从NeRF提取的几何信息通常不准确,导致检测性能欠佳。本文提出了MVSDet,它利用平面扫描进行几何感知的三维物体检测。为了规避对大量深度平面进行精确深度预测的要求,我们设计了一种概率采样和软加权机制来决定像素特征在三维体素上的放置。我们为每个像素选择概率体素中得分最高的多个位置,并使用它们的概率得分来表示置信度。我们进一步应用最新的像素对齐高斯 splatting 来正则化深度预测,并在计算开销很小的情况下提高检测性能。我们在 ScanNet 和 ARKitScenes 数据集上进行了大量实验,以证明我们模型的优越性。我们的代码可在 https://github.com/Pixie8888/MVSDet 获取。||
|**2024-10-28**|[TACO: Adversarial Camouflage Optimization on Trucks to Fool Object Detectors](http://arxiv.org/abs/2410.21443)|null|对抗性攻击威胁着机器学习模型在自动驾驶和防御系统等关键应用中的可靠性。随着像YOLOv8这样的模型使目标检测器变得更加鲁棒,开发有效的对抗性方法也越来越具有挑战性。我们提出了卡车对抗性伪装优化(TACO),这是一个在3D车辆模型上生成对抗性伪装图案以欺骗最先进的目标检测器的新框架。TACO采用虚幻引擎5,将可微渲染与逼真的渲染网络相结合,以优化针对YOLOv8的对抗性纹理。为了确保生成的纹理既能有效地欺骗检测器,又在视觉上合理,我们引入了卷积平滑损失函数,一个通用的平滑损失函数。实验评估表明,TACO显著降低了YOLOv8的检测性能,在未见测试数据上实现了0.0099的[email protected]。此外,这些对抗性图案对其他目标检测模型(如Faster R-CNN和早期YOLO版本)表现出很强的迁移性。||
|**2024-10-28**|[Synthetica: Large Scale Synthetic Data for Robot Perception](http://arxiv.org/abs/2410.21153)|null|基于视觉的目标检测器是机器人应用的关键基础,因为它们提供有关环境中目标定位的宝贵信息。这些检测器需要确保在不同的照明条件、遮挡和视觉伪影下都具有高可靠性,同时还要实时运行。为这些网络收集和标注真实世界的数据非常耗时且成本高昂,尤其是对于工业物体等自定义资产,这使得将其推广到实际场景变得难以为继。为此,我们提出了Synthetica,一种用于训练鲁棒状态估计器的大规模合成数据生成方法。本文重点关注目标检测任务,这是一个重要问题,可以作为大多数状态估计问题(例如姿态估计)的前端。利用来自逼真的光线追踪渲染器的数据,我们扩大了数据生成规模,生成了270万张图像,以训练高精度实时检测Transformer。我们提出了一系列渲染随机化和训练时数据增强技术,有助于视觉任务的稳健的仿真到现实性能。我们展示了在目标检测任务中最先进的性能,同时检测器以50-100Hz的频率运行,比之前的SOTA快9倍。我们通过展示一个用于现实世界中自定义对象的管道,进一步证明了我们的训练方法对机器人应用的有用性,而这些对象之前并不存在数据集。我们的工作强调了扩展合成数据生成对于实现稳健的仿真到现实迁移以及实现最快的实时推理速度的重要性。视频和补充信息可以在以下URL找到:https://sites.google.com/view/synthetica-vision。||
|**2024-10-25**|[Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models](http://arxiv.org/abs/2410.19635)|null|最近的视觉基础模型可以提取通用表示并在各种任务中展现出令人印象深刻的能力。然而,它们在目标检测方面的应用在很大程度上被忽视了,尤其是在没有经过微调的情况下。在这项工作中,我们展示了冻结的基础模型可以成为通用的特征增强器,即使它们没有针对目标检测进行预训练。具体来说,我们探索了以下两种方式将基础模型的高级图像理解能力直接迁移到检测器中。首先,基础模型中的类别标记提供了对复杂场景的深入理解,这可以通过提供紧凑的上下文来促进解码检测器解码器中的目标查询。此外,基础模型中的补丁标记可以通过提供语义细节来丰富检测器编码器中的特征。利用冻结的基础模型作为即插即用的模块,而不是常用的骨干网络,可以显著提高检测器的性能,同时避免了由检测器骨干网络和基础模型之间的架构差异引起的问题。通过这种新颖的范式,我们通过集成一个或两个基础模型,在 COCO 验证集上,使用 R50 作为检测器骨干网络训练 12 个 epoch 后,将最先进的基于查询的检测器 DINO 的 AP 从 49.0% 提升到 51.9% (+2.9% AP),并进一步提升到 53.8% (+4.8% AP)。||
|**2024-10-25**|[MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors](http://arxiv.org/abs/2410.19590)|null|透视投影已被广泛应用于单目 3D 物体检测方法中。它引入了来自 2D 边界框和 3D 物体尺寸的几何先验,以减少深度估计的不确定性。然而,由于源于物体视觉表面的深度误差,边界框的高度通常无法代表实际的投影中心高度,这削弱了几何深度的有效性。直接预测投影高度不可避免地会导致 2D 先验信息的丢失,而使用复杂分支的多深度预测并不能充分利用几何深度。本文提出了一种基于 Transformer 的单目 3D 物体检测方法,称为 MonoDGP,该方法采用透视不变几何误差来修改投影公式。我们还尝试系统地讨论和解释几何误差背后的机制和功效,将其作为多深度预测的一种简单但有效的替代方案。此外,MonoDGP 将深度引导解码器解耦,并构建了一个仅依赖于视觉特征的 2D 解码器,提供了 2D 先验信息并在没有 3D 检测干扰的情况下初始化物体查询。为了进一步优化和微调 Transformer 解码器的输入标记,我们还引入了区域分割头 (RSH),以生成增强的特征和分割嵌入。我们的单目方法在 KITTI 基准测试中展现了最先进的性能,无需额外数据。代码可在 https://github.com/PuFanqi23/MonoDGP 获取。||
|**2024-10-25**|[DECADE: Towards Designing Efficient-yet-Accurate Distance Estimation Modules for Collision Avoidance in Mobile Advanced Driver Assistance Systems](http://arxiv.org/abs/2410.19336)|null|智能手机和其他移动设备的普及为通过低成本机器/深度学习 (ML/DL) 模型赋能的应用程序形式,以增强道路安全,为每个人提供先进驾驶辅助系统 (ADAS) 的独特机会。对于移动 ADAS 中碰撞避免的关键特性,存在用于物体检测的轻量级深度神经网络 (DNN),但传统的像素级深度/距离估计 DNN 的计算成本要高得多,因此不适用于资源受限设备上的实时应用。在本文中,我们提出了一种距离估计模型 DECADE,它处理每个检测器输出,而不是构建像素级深度/视差图。在该模型中,我们提出了一个姿态估计 DNN 来估计检测的非自我中心方向,以补充距离估计 DNN 使用边界框特征进行距离预测。我们证明了这些模块可以附加到任何检测器上,以通过快速距离估计来扩展物体检测。在 KITTI 3D 物体检测数据集上,通过附加到 YOLO 物体检测器输出并对其进行微调,对所提出的模块进行评估,实现了最先进的性能,在 0-150 米的距离范围内,平均绝对误差为 1.38 米,平均相对误差为 7.3%。我们广泛的评估方案不仅评估了类别性能,还评估了范围精度,特别是在 0-70 米的关键范围内。||
|**2024-10-24**|[HUE Dataset: High-Resolution Event and Frame Sequences for Low-Light Vision](http://arxiv.org/abs/2410.19164)|null|弱光环境对图像增强方法提出了重大挑战。为了应对这些挑战,在这项工作中,我们引入了HUE数据集,这是一个在多样化和具有挑战性的弱光条件下捕获的高分辨率事件和帧序列的综合集合。我们的数据集包括106个序列,涵盖室内、城市景观、暮光、夜间、驾驶和受控场景,每个序列都经过精心录制,以应对各种照度和动态范围。利用混合RGB和事件相机设置,我们收集了一个将高分辨率事件数据与互补帧数据相结合的数据集。我们采用无参考指标的定性和定量评估来评估最先进的弱光增强和基于事件的图像重建方法。此外,我们还在下游目标检测任务上评估了这些方法。我们的研究结果表明,虽然基于事件的方法在特定指标上表现良好,但在实际应用中可能会产生误报。该数据集和我们的综合分析为弱光视觉和混合相机系统的未来研究提供了宝贵的见解。||
|**2024-10-24**|[Optimizing Edge Offloading Decisions for Object Detection](http://arxiv.org/abs/2410.18919)|**[link](https://github.com/qiujiaming315/edgeml-object-detection)**|近年来机器学习和硬件的进步已经催生了能够执行实时目标检测且精度极高的嵌入式设备。我们考虑这样一种场景:嵌入式设备依赖于板载目标检测器,但可以选择在本地精度被认为过低时将检测任务卸载到更强大的边缘服务器。然而,资源限制了可以卸载到边缘的图像数量。我们的目标是在这些限制条件下确定要卸载哪些图像以最大限度地提高整体检测精度。为此,本文引入了一种奖励指标,旨在量化卸载单个图像带来的潜在精度提升,并提出了一种仅基于本地检测结果来估计此奖励,从而高效地做出卸载决策的方法。该方法的计算量很小,足以在嵌入式设备上运行,并且实证结果表明,即使在卸载图像的比例很小的情况下,它在提高检测精度方面也优于现有的替代方法。||
|**2024-10-24**|[Hybrid Quantum-Classical Feature Extraction approach for Image Classification using Autoencoders and Quantum SVMs](http://arxiv.org/abs/2410.18814)|null|为了利用量子计算机执行图像分类等机器学习任务,需要仔细考虑以下因素:NISQ(噪声中等规模量子)时代的量子计算机存在一些局限性,包括噪声、可扩展性、读入和读出时间以及门操作时间。因此,应该设计策略来减轻复杂数据集对量子机器学习管道整体效率的潜在影响,否则可能会导致资源需求过高或噪声增加。我们应用了一种使用 ResNet10 启发的卷积自编码器的经典特征提取方法,在将数据馈送到量子机器学习模块之前,既降低了数据集的维数,又提取了抽象且有意义的特征。我们选择的量子模块是量子增强支持向量机 (QSVM),因为支持向量机通常不需要大样本量来识别数据中的模式,并且具有短深度量子电路,这限制了噪声的影响。自编码器经过训练,可以通过图像重建来提取有意义的特征,旨在最小化训练集的均方误差。我们使用三个图像数据集来说明该管道:HTRU-1、MNIST 和 CIFAR-10。我们还为高度不平衡的 HTRU-1 数据集包含了一个量子增强的一类支持向量机 (QOCSVM),以及作为基准的经典机器学习结果。最后,还包括 HTRU-2 数据集,作为具有良好相关特征的数据集的基准。自编码器实现了近乎完美的重建,并且对 MNIST 实现了高分类精度,而 CIFAR-10 由于图像复杂性而表现出较差的性能,而 HTRU-1 由于数据集不平衡而表现不佳。这突出表明了通过经典特征提取进行降维与使用量子方法进行预测性能之间需要平衡。||
|**2024-10-25**|[Transferring Knowledge from High-Quality to Low-Quality MRI for Adult Glioma Diagnosis](http://arxiv.org/abs/2410.18698)|null|胶质瘤是一种常见且致命的脑肿瘤,需要早期诊断才能改善预后。然而,撒哈拉以南非洲 (SSA) 地区磁共振成像 (MRI) 技术落后,阻碍了准确诊断。本文介绍了我们参与 BraTS 挑战赛 SSA 成人胶质瘤项目的工作。我们采用了 BraTS-GLI 2021 获奖方案的模型,并利用三种训练策略对其进行训练:(1) 首先在 BraTS-GLI 2021 数据集上进行训练,然后在 BraTS-Africa 数据集上进行微调,(2) 仅在 BraTS-Africa 数据集上进行训练,(3) 仅在经过 2 倍超分辨率增强的 BraTS-Africa 数据集上进行训练。结果表明,首先在 BraTS-GLI 2021 数据集上进行训练,然后在 BraTS-Africa 数据集上进行微调,取得了最佳效果。这表明高质量数据集在训练过程中提供先验知识的重要性。我们性能最佳的模型在验证阶段分别实现了 0.882、0.840 和 0.926 的 Dice 分数,以及 15.324、37.518 和 13.971 的 Hausdorff 距离 (95%) 分数,用于增强肿瘤、肿瘤核心和全肿瘤。在比赛的最后阶段,我们的方法成功获得了总排名第二,体现了我们模型和训练策略的优势和有效性。我们的方法为改善 SSA 地区的胶质瘤诊断提供了见解,展示了深度学习在资源有限环境中的潜力以及从高质量数据集中进行迁移学习的重要性。||
|**2024-10-24**|[Spatial-Temporal Search for Spiking Neural Networks](http://arxiv.org/abs/2410.18580)|null|脉冲神经网络 (SNN) 具有稀疏计算和固有时间动态等吸引人的特性,被认为是下一代人工智能的潜在候选者。通过采用人工神经网络 (ANN) 的架构,SNN 在图像分类等基准测试任务中取得了具有竞争力的性能。然而,ANN 的成功架构对于 SNN 来说并非最佳。在这项工作中,我们应用神经架构搜索 (NAS) 来寻找适合 SNN 的架构。以前用于 SNN 的 NAS 方法主要关注空间维度,而明显缺乏对 SNN 至关重要的时域动态的考虑。受生物神经网络异质性的启发,我们提出了一种可微的方法来优化 SNN 的空间和时间维度。在空间层面,我们开发了一个基于脉冲的可微分层搜索 (SpikeDHS) 框架,其中基于脉冲的操作在计算约束下在细胞和层级上都得到了优化。我们进一步提出了一种可微分的代理梯度搜索 (DGS) 方法,以便在训练期间独立地演化局部 SG 函数。在时间层面,我们通过演化不同类型脉冲神经元的时间常数来探索其多样化时间动态的最佳配置,并在此基础上进一步开发了结合 SNN 和 ANN 的混合网络,平衡了准确性和效率。我们的方法在 CIFAR10/100 和 ImageNet 上实现了相当的分类性能,准确率分别为 96.43%、78.96% 和 70.21%。在基于事件的深度立体视觉方面,我们的方法找到了最佳的层变化,并以降低 26 倍的计算成本 (6.7 毫焦) 超越了专门设计的 ANN 的准确性,证明了 SNN 在处理高度稀疏和动态信号方面的潜力。||
|**2024-10-25**|[Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks](http://arxiv.org/abs/2410.18387)|null|为了解决涉及多种医学影像模式下文本指令与视觉图像的任务,研究人员开发了几种医学多模态大语言模型 (MLLM),并取得了令人瞩目的成果。目前大多数医学通才模型都是区域无关的,即将整个图像视为一个整体表征。然而,它们难以确定在生成句子时所关注的具体区域。为了模拟医生通常先浏览整个图像,然后集中于特定区域进行全面评估的行为,我们旨在增强医学 MLLM 对完整医学扫描图像中解剖区域的理解能力。为此,我们首先制定了以区域为中心的任务,并构建了一个大规模数据集 MedRegInstruct,将区域信息纳入训练。结合我们收集的数据集和其他医学多模态语料库进行训练,我们提出了一种区域感知的医学 MLLM,名为 MedRegA,它是第一个能够同时处理多种模态图像级和区域级医学视觉语言任务的双语通才医学人工智能系统。我们的 MedRegA 不仅支持三种以区域为中心的任务,而且在 8 种模态的视觉问答、报告生成和医学图像分类方面均取得了最佳性能,展现出显著的多功能性。实验表明,我们的模型不仅可以在双语环境下完成各种医学视觉语言任务,而且可以识别和检测多模态医学扫描图像中的结构,提高医学 MLLM 的可解释性和用户交互性。我们的项目页面是 https://medrega.github.io。||
|**2024-10-24**|[Thermal Chameleon: Task-Adaptive Tone-mapping for Radiometric Thermal-Infrared images](http://arxiv.org/abs/2410.18340)|**[link](https://github.com/donkeymouse/thermalchameleon)**|热红外 (TIR) 成像为在具有挑战性的户外环境中导航提供了强大的感知能力,但由于其采用 14/16 位格式,因此存在纹理不佳和图像对比度低的问题。传统方法利用各种色调映射方法来增强 TIR 图像的对比度和光度一致性,然而,色调映射的选择很大程度上取决于对任务的了解以及良好的温度依赖先验。在本文中,我们提出了热变色龙网络 (TCNet),这是一种针对 RAW 14 位 TIR 图像的任务自适应色调映射方法。给定相同的图像,TCNet 可以针对每个特定任务调整 TIR 图像的不同表示的色调映射,从而无需启发式图像重新缩放预处理,也不依赖于场景温度或特定任务特征的广泛先验知识。TCNet 在目标检测和单目深度估计方面表现出改进的泛化性能,同时计算开销最小,并且可以模块化地集成到各种任务的现有架构中。项目页面:https://github.com/donkeymouse/ThermalChameleon||
|**2024-10-23**|[Backdoor in Seconds: Unlocking Vulnerabilities in Large Pre-trained Models via Model Editing](http://arxiv.org/abs/2410.18267)|null|大型预训练模型在一系列下游任务中取得了显著成功。然而,最近的研究表明,一种对抗性攻击(即后门攻击)可以通过污染训练数据集来操纵机器学习模型的行为,这对大型预训练模型(尤其是那些定制模型)的实际应用构成了重大威胁。因此,应对探索预训练模型漏洞的独特挑战至关重要。通过对大型预训练模型(例如ViT)执行后门攻击能力的实证研究,我们发现了攻击大型预训练模型的以下独特挑战:1)无法操纵甚至访问大型训练数据集,以及2)训练或微调这些模型所需的巨大计算资源。为了应对这些挑战,我们针对大型预训练模型的背景,建立了有效且可行的后门攻击的新标准。根据这些标准,我们引入了EDT模型,一种高效、无需数据、无需训练的后门攻击方法。受模型编辑技术的启发,EDT将一个基于编辑的轻量级码本注入到大型预训练模型的后门中,它将中毒图像的嵌入替换为目标图像的嵌入,而无需污染训练数据集或训练受害者模型。我们在各种预训练模型(如ViT、CLIP、BLIP和稳定扩散)以及图像分类、图像描述和图像生成等下游任务上进行的实验,证明了我们方法的有效性。我们的代码可在补充材料中找到。||
|**2024-10-23**|[FIPER: Generalizable Factorized Fields for Joint Image Compression and Super-Resolution](http://arxiv.org/abs/2410.18083)|null|在这项工作中,我们提出了一种用于超分辨率 (SR) 和图像压缩的统一表示方法,称为“因子化场”,其动机源于这两个任务之间的共同原理。SISR 和图像压缩都需要恢复和保留精细的图像细节——无论是通过增强分辨率还是重建压缩数据。与以往主要关注网络架构的方法不同,我们提出的方法利用基系数分解来显式地捕捉图像中的多尺度视觉特征和结构成分,从而解决了这两个任务的核心挑战。我们首先推导了我们的 SR 模型,其中包括一个系数主干网络和一个用于泛化因子化场的基 Swin Transformer。然后,为了进一步统一这两个任务,我们将训练好的 SR 模块强大的信息恢复能力作为先验知识用于压缩流程,从而提高压缩效率和细节重建效果。此外,我们引入了一个合并基的压缩分支,以整合共享结构,进一步优化压缩过程。大量实验表明,我们的统一表示方法实现了最先进的性能,在超分辨率 (SR) 中,PSNR 相比基线平均提高了 204.4%,在图像压缩中,相比之前的 SOTA 方法,BD 率降低了 9.35%。||
|**2024-10-23**|[DREB-Net: Dual-stream Restoration Embedding Blur-feature Fusion Network for High-mobility UAV Object Detection](http://arxiv.org/abs/2410.17822)|**[link](https://github.com/eeic-lab/dreb-net)**|目标检测算法是无人机 (UAV) 成像系统的关键组成部分,广泛应用于复杂领域。然而,高机动性无人机拍摄的图像通常会受到运动模糊的影响,这严重阻碍了先进目标检测算法的性能。为了应对这些挑战,我们提出了一种专门为模糊图像设计的创新目标检测算法,称为 DREB-Net(双流恢复嵌入模糊特征融合网络)。首先,DREB-Net 通过在训练阶段加入模糊图像恢复辅助分支 (BRAB) 来解决模糊图像目标检测问题的特殊性。其次,它通过多级注意力引导特征融合 (MAGFF) 模块融合提取的浅层特征,以提取更丰富的特征。这里,MAGFF 模块包含局部注意力模块和全局注意力模块,它们为不同的分支分配不同的权重。然后,在推理阶段,可以移除 BRAB 的深度特征提取以降低计算复杂度并提高检测速度。在损失函数中,将 MSE 和 SSIM 的组合损失添加到 BRAB 以恢复模糊图像。最后,DREB-Net 在特征提取的早期阶段通过可学习频域幅度调制模块 (LFAMM) 引入快速傅里叶变换,以调整特征幅度并增强特征处理能力。实验结果表明,DREB-Net 在拍摄图像存在运动模糊的情况下仍然可以有效地执行目标检测任务,展现出优异的性能和广阔的应用前景。我们的源代码将在 https://github.com/EEIC-Lab/DREB-Net.git 上提供。||
|**2024-10-23**|[Deep Learning for Active Region Classification: A Systematic Study from Convolutional Neural Networks to Vision Transformers](http://arxiv.org/abs/2410.17816)|null|太阳活动区会严重扰乱日地空间环境,经常导致严重的太空天气事件,例如太阳耀斑和日冕物质抛射。因此,对活动区群进行自动分类是准确、及时预测太阳活动的关键起点。本研究展示了我们将深度学习技术应用于基于威尔逊山分类方案的活动区图像分类的结果。具体来说,我们探索了图像分类架构的最新进展,从卷积神经网络到视觉变换器,并报告了它们在活动区分类任务中的性能,表明其有效性的关键在于基于该领域最新进展的稳健训练过程。||
|**2024-10-22**|[Altogether: Image Captioning via Re-aligning Alt-text](http://arxiv.org/abs/2410.17251)|**[link](https://github.com/facebookresearch/metaclip)**|本文着重于创建合成数据以提高图像描述的质量。现有工作通常存在两个缺点。首先,它们从头开始描述图像,忽略了现有的替代文本元数据;其次,如果描述器的训练数据(例如 GPT)未知,则缺乏透明度。在本文中,我们研究了一种基于关键思想的原则性方法Altogether,即编辑和重新调整与图像相关的现有替代文本。为了生成训练数据,我们执行人工注释,注释者从现有的替代文本开始,并在多轮中将其重新调整到图像内容,从而构建具有丰富视觉概念的描述。这与先前的工作不同,先前的工作将人工注释作为一项一次性的描述任务,完全基于图像和注释者的知识。我们根据这些数据训练了一个描述器,该描述器可以大规模地概括重新调整替代文本的过程。我们的结果表明,我们的 Altogether 方法可以生成更丰富的图像描述,还可以改进文本到图像生成和零样本图像分类任务。||
|**2024-10-22**|[KANICE: Kolmogorov-Arnold Networks with Interactive Convolutional Elements](http://arxiv.org/abs/2410.17172)|**[link](https://github.com/m-ferdaus/kanice)**|我们介绍了一种名为KANICE(Kolmogorov-Arnold Networks with Interactive Convolutional Elements)的新型神经网络架构,它将卷积神经网络(CNN)与Kolmogorov-Arnold网络(KAN)原理相结合。KANICE将交互式卷积块(ICB)和KAN线性层集成到CNN框架中。这利用了KAN的通用逼近能力和ICB的自适应特征学习能力。基于Kolmogorov-Arnold表示定理,KANICE可以捕获复杂的非线性数据关系,同时实现动态的、上下文相关的特征提取。我们在四个数据集上评估了KANICE:MNIST、Fashion-MNIST、EMNIST和SVHN,并将其与标准CNN、CNN-KAN混合模型和ICB变体进行了比较。KANICE始终优于基线模型,在MNIST上实现了99.35%的准确率,在SVHN数据集上实现了90.05%的准确率。此外,我们还介绍了KANICE-mini,这是一种专为提高效率而设计的紧凑型变体。全面的消融研究表明,KANICE-mini可以用少得多的参数实现与KANICE相当的性能。KANICE-mini在SVHN上达到了90.00%的准确率,参数量为2,337,828,而KANICE的参数量为25,432,000。这项研究突出了基于KAN的架构在图像分类任务中平衡性能和计算效率的潜力。我们的工作为自适应神经网络的研究做出了贡献,将数学定理融入到深度学习架构中,并探索了模型复杂性和性能之间的权衡,推进了计算机视觉和模式识别领域的发展。本文的源代码可通过我们的GitHub存储库(https://github.com/m-ferdaus/kanice)公开获取。||
|**2024-10-22**|[YOLO-TS: Real-Time Traffic Sign Detection with Enhanced Accuracy Using Optimized Receptive Fields and Anchor-Free Fusion](http://arxiv.org/abs/2410.17144)|null|在自动驾驶和高级驾驶辅助系统 (ADAS) 中确保安全,很大程度上取决于交通标志识别技术的有效部署。虽然现有方法已具有一定成效,但它们往往需要在速度和准确性之间做出妥协。为了解决这个问题,我们提出了一种新颖的实时高效道路标志检测网络 YOLO-TS。该网络通过优化多尺度特征图的感受野,使其与各种数据集中交通标志的尺寸分布更加一致,从而显著提高了性能。此外,我们利用无锚框方法的灵活性,创新性地提出了特征融合策略,允许在包含丰富上下文信息的高分辨率特征图上进行多尺度目标检测,实现了准确性和速度的显著提升。为了减轻由空洞卷积引起的网格效应对小目标检测的不利影响,我们设计了一个独特的模块,该模块不仅可以减轻这种网格效应,还可以扩大感受野以涵盖更广泛的空间上下文信息,从而提高信息使用效率。在具有挑战性的公共数据集 TT100K 和 CCTSDB2021 上的评估表明,YOLO-TS 在准确性和速度方面均优于现有的最先进方法。我们将在未来公开此方法的代码。||
|**2024-10-22**|[AttriPrompter: Auto-Prompting with Attribute Semantics for Zero-shot Nuclei Detection via Visual-Language Pre-trained Models](http://arxiv.org/abs/2410.16820)|**[link](https://github.com/wuyongjiancode/attriprompter)**|大规模视觉语言预训练模型(VLPM)在自然场景中文本提示的目标检测下游任务中表现出色。然而,由于医学图像的特征与用于预训练的网络来源图文对之间存在显著差距,VLPM在组织病理学图像的零样本核检测中的应用仍处于相对未开发的状态。本文旨在探索目标级VLPM,即基于基础语言图像预训练(GLIP)模型,在零样本核检测中的潜力。具体来说,我们提出了一种名为AttriPrompter的创新性自动提示管道,它包括属性生成、属性增强和相关性排序,以避免主观的人工提示设计。AttriPrompter利用VLPM的文本-图像对齐能力创建语义丰富的文本提示,然后将其输入GLIP进行初始的零样本核检测。此外,我们提出了一个自训练的知识蒸馏框架,其中GLIP作为教师模型,其初始预测被用作伪标签,以解决高核密度带来的挑战,包括漏检、误检和实例重叠。我们的方法在无标签核检测方面表现出色,优于所有现有的无监督方法,并展现出优异的泛化能力。值得注意的是,这项工作凸显了基于自然图像-文本对预训练的VLPM在医学领域下游任务中的惊人潜力。代码将在https://github.com/wuyongjianCODE/AttriPrompter发布。||
|**2024-10-22**|[DSORT-MCU: Detecting Small Objects in Real-Time on Microcontroller Units](http://arxiv.org/abs/2410.16769)|null|轻量级神经网络的进步彻底改变了广泛物联网应用中的计算机视觉,包括远程监控和过程自动化。然而,对于许多此类应用至关重要的小目标检测仍然是当前计算机视觉研究中一个尚未充分探索的领域,特别是对于托管资源受限处理器的低功耗嵌入式设备而言。为了解决上述差距,本文提出了一种适用于轻量级和节能目标检测网络的自适应切片方法,包括基于 YOLO 的模型和流行的 FOMO 网络。与大规模检测模型相比,所提出的切片方法能够在不影响精度的情况下在低功耗 MCU 上进行目标检测。通过将所提出的方法应用于具有内置机器学习加速器的新型基于 RISC-V 的 MCU 上的 FOMO 和 TinyissimoYOLO 网络,证明了该方法的优势。大量的实验结果表明,所提出的切片方法在 FOMO 和 TinyissimoYOLO 网络上将 F1 分数提高了高达 225%,同时使用 FOMO 将平均目标计数误差降低了高达 76%,使用 TinyissimoYOLO 降低了高达 89%。此外,这项工作的研究结果表明,对流行的二元交叉熵损失使用软 F1 损失可以作为 FOMO 网络的隐式非极大值抑制。为了评估真实世界的性能,这些网络部署在 GreenWaves Technologies 的基于 RISC-V 的 GAP9 微控制器上,展示了所提出的方法在检测性能(58% - 95% F1 分数)、低延迟(0.6 毫秒/推理 - 16.2 毫秒/推理)和能效(31 微焦耳/推理 - 1.27 毫焦耳/推理)之间取得平衡的能力,同时在 MCU 上使用高分辨率图像执行多个预测。||
|**2024-10-22**|[DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model](http://arxiv.org/abs/2410.16707)|null|本文的研究动机源于一个有趣的现象:当我们探究MaskDINO(即目前最先进的联合检测和分割模型)中transformer解码器初始层的中间结果时,会发现目标检测的性能滞后于实例分割的性能(即性能不平衡)。这一现象促使我们思考一个问题:transformer解码器初始层的性能不平衡是否会限制最终性能的上限?带着这个问题,我们进一步进行了定性和定量的预实验,验证了检测-分割不平衡问题对模型性能的负面影响。为了解决这个问题,本文提出了DI-MaskDINO模型,其核心思想是通过缓解检测-分割不平衡来提高最终性能。DI-MaskDINO是通过将我们提出的去不平衡(DI)模块和平衡感知token优化(BATO)模块配置到MaskDINO中来实现的。DI模块负责生成平衡感知查询,BATO模块使用平衡感知查询来指导初始特征token的优化。平衡感知查询和优化后的特征token分别作为transformer解码器的查询和键值对,以执行联合目标检测和实例分割任务。DI-MaskDINO在COCO和BDD100K基准测试中优于现有的联合目标检测和实例分割模型,与目前最先进的联合检测和分割模型MaskDINO相比, $AP^{box}$提高了+1.2,$AP^{mask}$提高了+0.9。此外,与目前最先进的目标检测模型DINO相比,DI-MaskDINO的$AP^{box}$提高了+1.0,与目前最先进的分割模型Mask2Former相比,$AP^{mask}$ 提高了+3.0。||
|**2024-10-22**|[Fire and Smoke Detection with Burning Intensity Representation](http://arxiv.org/abs/2410.16642)|**[link](https://github.com/xiaoyihan6/fsdmethod)**|由于火灾的破坏性潜力,有效地进行火灾和烟雾检测 (FSD) 和分析系统至关重要。 然而,许多现有的 FSD 方法直接采用通用的目标检测技术,而没有考虑火灾和烟雾的透明性,这导致定位不准确并降低了检测性能。 为了解决这个问题,本文提出了一种新的注意力火灾和烟雾检测模型 (a-FSDM)。 该模型不仅保留了传统检测算法强大的特征提取和融合能力,还重新设计了专门针对 FSD 中透明目标的检测头,称为注意力透明度检测头 (ATDH)。 此外,燃烧强度 (BI) 被引入作为传统 FSD 方法中与火灾相关的下游风险评估的关键特征。 在多个 FSD 数据集上的大量实验展示了所提出的 FSD 模型的有效性和通用性。 该项目可在 \href{https://xiaoyihan6.github.io/FSD/}{https://xiaoyihan6.github.io/FSD/} 获取。||
|**2024-10-21**|[Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models](http://arxiv.org/abs/2410.16163)|**[link](https://github.com/jefferyzhan/griffon)**|大型多模态模型 (LMM) 基于自回归建模在各种视觉语言和以视觉为中心的的任务中取得了重大突破。然而,这些模型通常专注于以视觉为中心的的任务,例如视觉定位和区域描述,或者视觉语言任务,例如图像字幕和多场景视觉问答 (VQA)。现有的 LMM 都没有像自然语言处理领域的大型语言模型那样,在一个模型中全面统一这两类任务。此外,即使有丰富的多任务指令遵循数据,直接堆叠这些数据来扩展通用能力仍然具有挑战性。为了解决这些问题,我们引入了一个名为 CCMD-8M 的新型多维度策划和整合的多模态数据集,它通过多级数据策划和多任务整合克服了统一以视觉为中心的任务和视觉语言任务的数据障碍。更重要的是,我们提出了 Griffon-G,这是一种通用的 LMM,可以在单个端到端范例中同时解决以视觉为中心的任务和视觉语言任务。 Griffon-G 解决了在联合优化这些任务期间遇到的训练崩溃问题,实现了更好的训练效率。跨多模态基准、通用视觉问答 (VQA) 任务、以场景文本为中心的 VQA 任务、与文档相关的 VQA 任务、指代表达理解和目标检测的评估表明,Griffon-G 超越了先进的 LMM,并在复杂的以视觉为中心的的任务中实现了专家级的性能。||
|**2024-10-21**|[Few-shot target-driven instance detection based on open-vocabulary object detection models](http://arxiv.org/abs/2410.16028)|null|当前的大型开放视觉模型可以用于单样本和少样本目标识别。然而,基于梯度的重新训练方案成本高昂。另一方面,开放词汇目标检测模型在相同的潜在空间中拉近了视觉和文本概念,从而允许以较小的计算成本通过提示进行零样本检测。我们提出了一种轻量级的方法,可以在不需要文本描述的情况下将后者转换为单样本或少样本目标识别模型。我们在 TEgO 数据集上使用 YOLO-World 模型作为基础进行的实验表明,性能随着模型大小、示例数量和图像增强的使用而提高。||
|**2024-10-21**|[Visual Representation Learning Guided By Multi-modal Prior Knowledge](http://arxiv.org/abs/2410.15981)|null|尽管深度神经网络(DNN)在计算机视觉方面取得了显著成功,但当训练数据和测试数据之间存在分布偏移时,它们的表现就会下降。在本文中,我们提出了一种基于分布的学习方法——知识引导的视觉表征学习(KGV),它利用多模态先验知识来提高分布偏移下的泛化能力。我们使用了来自两种不同模态的先验知识:1)具有层次和关联关系的知识图谱(KG);2)根据知识图谱中语义表示的视觉元素生成的合成图像。在共同的潜在空间中,从给定的模态生成相应的嵌入,即来自原始图像和合成图像的视觉嵌入以及知识图谱嵌入(KGE)。这些嵌入通过一种新颖的基于翻译的KGE方法进行对齐,其中知识图谱的节点和关系嵌入分别被建模为高斯分布和平移。我们认为,结合多模型先验知识可以实现更规范化的图像表征学习。因此,模型能够更好地泛化到不同的数据分布。我们在具有较大或较小分布偏移的不同图像分类任务上评估了KGV,即来自德国、中国和俄罗斯的数据集上的道路标志分类、使用mini-ImageNet数据集及其变体的图像分类,以及DVM-CAR数据集。结果表明,在所有实验中,KGV始终比基线表现出更高的准确性和数据效率。||
|**2024-10-18**|[MultiOrg: A Multi-rater Organoid-detection Dataset](http://arxiv.org/abs/2410.14612)|null|近年来,生物医学领域的高通量图像分析备受关注,推动了药物发现、疾病预测和个性化医疗的进步。类器官作为人类器官及其功能的优秀模型,是一个活跃的研究领域。显微图像中类器官自动量化的实现将为克服大量手动量化瓶颈提供有效的解决方案,特别是在高通量图像分析中。然而,与自动驾驶等其他领域相比,开放生物医学数据集明显缺乏,而且值得注意的是,其中只有少数尝试量化标注的不确定性。在这项工作中,我们提出了MultiOrg,一个全面的类器官数据集,专为具有不确定性量化的目标检测任务而设计。该数据集包含超过400张高分辨率二维显微图像和超过60,000个类器官的精选注释。最重要的是,它包括三个用于测试数据的标签集,由两位专家在不同时间点独立标注。我们还提供了一个类器官检测的基准,并通过一个易于安装的交互式插件,将最佳模型应用于流行的图像可视化工具Napari,以执行类器官量化。||
|**2024-10-18**|[A Hybrid Feature Fusion Deep Learning Framework for Leukemia Cancer Detection in Microscopic Blood Sample Using Gated Recurrent Unit and Uncertainty Quantification](http://arxiv.org/abs/2410.14536)|null|急性淋巴细胞白血病 (ALL) 是最恶性的白血病,也是成人和儿童中最常见的癌症。传统上,白血病的诊断是通过在显微镜下分析血液和骨髓涂片,并通过额外的细胞化学测试来确认。然而,这些方法昂贵、耗时且高度依赖专家知识。近年来,深度学习,特别是卷积神经网络 (CNN),为显微镜涂片图像分类提供了先进的方法,有助于检测白血病细胞。这些方法快速、经济高效,并且不受人为偏差的影响。然而,大多数方法缺乏量化不确定性的能力,这可能导致严重的误诊。在这项研究中,混合深度学习模型(InceptionV3-GRU、EfficientNetB3-GRU、MobileNetV2-GRU)被用于对ALL进行分类。贝叶斯优化用于微调模型的超参数并提高其性能。此外,深度集成不确定性量化被应用于解决白血病图像分类过程中的不确定性。所提出的模型在公开可用的数据集 ALL-IDB1 和 ALL-IDB2 上进行了训练。然后使用求和规则在分数级别聚合它们的结果。这些模型中使用的并行架构在区分 ALL 和非 ALL 病例方面提供了高水平的置信度。所提出的方法在 ALL-IDB1 数据集上实现了 100% 的检测准确率,在 ALL-IDB2 数据集上实现了 98.07% 的检测准确率,在组合数据集上实现了 98.64% 的检测准确率,证明了其在准确可靠的白血病诊断方面的潜力。||
|**2024-10-18**|[Ultrasound matrix imaging for transcranial in-vivo localization microscopy](http://arxiv.org/abs/2410.14499)|null|经颅超声成像通常受到颅骨引起的衰减和高阶像差的限制。通过使用微泡等造影剂并结合超快成像,不仅可以提高信噪比,还可以获得分辨率低至脑血管微米级的超分辨率图像。然而,超声定位显微镜 (ULM) 仍然受到波前畸变的影响,这限制了微泡的检测率并阻碍了它们的定位。在这项工作中,我们展示了依赖于预先记录反射矩阵的矩阵成像如何为这些基本问题提供解决方案。作为实验性概念验证,对三只麻醉羊进行了深部脑微血管的体内重建。结果表明,波畸变的补偿可以显著增强 ULM 的对比度和分辨率。这项实验研究为经颅和非电离观测人类脑微血管病理学(如中风)开辟了广阔的前景。||
|**2024-10-18**|[ClearSR: Latent Low-Resolution Image Embeddings Help Diffusion-Based Real-World Super Resolution Models See Clearer](http://arxiv.org/abs/2410.14279)|null|我们提出了ClearSR,这是一种可以更好地利用潜在低分辨率图像(LR)嵌入进行基于扩散的真实世界图像超分辨率(Real-ISR)的新方法。以前的Real-ISR模型主要关注如何激活更多文本到图像扩散模型的生成先验,以使输出的高分辨率(HR)图像看起来更好。然而,由于这些方法过于依赖生成先验,输出图像的内容往往与输入的LR图像不一致。为了缓解上述问题,在这项工作中,我们探索使用潜在的LR嵌入来约束ControlNet的控制信号,并在细节和结构层面提取LR信息。我们表明,正确使用潜在的LR嵌入可以产生更高质量的控制信号,这使得超分辨率结果与LR图像更加一致,并产生更清晰的视觉结果。此外,我们还表明,潜在的LR嵌入可以用来控制推理阶段,从而同时提高保真度和生成能力。实验表明,我们的模型在多个测试集的多个指标上都能取得更好的性能,并且与现有方法相比,能够生成与LR图像更加一致的SR结果。我们的代码将公开发布。||
|**2024-10-18**|[Comparative Evaluation of Clustered Federated Learning Method](http://arxiv.org/abs/2410.14212)|**[link](https://github.com/leahcimali/Comparative-Evaluation-of-Clustered-Federated-Learning-Methods)**|近年来,联邦学习 (FL) 已被证明是最有前途的分布式学习方法之一,可以保护数据隐私。随着该方法的发展并在各种现实场景中的应用,出现了新的挑战。其中一个挑战是 FL 协议参与者之间存在高度异构(通常称为非独立同分布)的数据分布。解决这个障碍的一个流行方案是集群联邦学习 (CFL),其目的是将客户端划分为分布均匀的组。在文献中,最先进的 CFL 算法通常使用一些数据异构性案例进行测试,而没有系统地证明选择的合理性。此外,用于区分不同异构场景的分类法并不总是直截了当。在本文中,我们针对联邦学习 (FL) 中提出的数据异构性分类法,探讨了两种最先进的 CFL 算法的性能。我们使用三个图像分类数据集,并使用外部聚类指标针对异构性类别分析生成的聚类。我们的目标是更清楚地了解 CFL 性能与数据异构场景之间的关系。||
|**2024-10-17**|[MMAD-Purify: A Precision-Optimized Framework for Efficient and Scalable Multi-Modal Attacks](http://arxiv.org/abs/2410.14089)|null|神经网络在各种任务中都取得了显著的性能,但它们仍然容易受到对抗性扰动的影响,这对安全关键型应用构成了重大风险。随着多模态的兴起,扩散模型已成为强大的工具,不仅可用于生成任务,还可用于图像编辑、修复和超分辨率等各种应用。然而,由于对其攻击以增强其弹性的研究有限,这些模型仍然缺乏鲁棒性。传统的攻击技术,如基于梯度的对抗性攻击和基于扩散模型的方法,由于其迭代性质而受到计算效率低下和可扩展性问题的阻碍。为了应对这些挑战,我们引入了一个创新框架,该框架利用扩散模型的蒸馏骨干,并结合了精度优化的噪声预测器,以增强我们攻击框架的有效性。这种方法不仅增强了攻击的效力,而且还显著降低了计算成本。我们的框架为多模态对抗性攻击提供了一种前沿解决方案,确保了更低的延迟和生成具有更高成功率的高保真对抗性示例。此外,我们证明了我们的框架实现了出色的可迁移性和针对净化防御的鲁棒性,在有效性和效率方面都优于现有的基于梯度的攻击模型。||
|**2024-10-17**|[Reproducibility study of "LICO: Explainable Models with Language-Image Consistency"](http://arxiv.org/abs/2410.13989)|**[link](https://github.com/robertdvdk/lico-fact)**|机器学习领域日益严重的复现性危机要求我们仔细审查研究结果。本文调查了 Lei 等人 (2023) 提出的 LICO 方法,该方法旨在增强事后可解释性技术并提高图像分类性能。LICO 利用来自视觉语言模型的自然语言监督来丰富特征表示并指导学习过程。我们进行了全面的复现性研究,采用了(Wide)ResNets 和已建立的可解释性方法,如 Grad-CAM 和 RISE。我们基本上无法复现作者的结果。特别是,我们没有发现 LICO 始终如一地带来分类性能的提高或可解释性的定量和定性指标的改进。因此,我们的研究结果强调了在可解释性研究中进行严格评估和透明报告的重要性。||
|**2024-10-17**|[ConsisSR: Delving Deep into Consistency in Diffusion-based Image Super-Resolution](http://arxiv.org/abs/2410.13807)|null|现实世界图像超分辨率 (Real-ISR) 旨在从被未知且复杂的退化破坏的低质量 (LQ) 输入中恢复高质量 (HQ) 图像。特别是,预训练的文本到图像 (T2I) 扩散模型提供了强大的生成先验,可以重建可信且复杂的细节。然而,T2I 生成侧重于语义一致性,而 Real-ISR 强调像素级重建,这阻碍了现有方法充分利用扩散先验。为了应对这一挑战,我们引入了 ConsisSR 来处理语义和像素级的一致性。具体来说,与粗粒度的文本提示相比,我们利用更强大的 CLIP 图像嵌入,并通过我们的混合提示适配器 (HPA) 有效地利用这两种模态进行语义指导。其次,我们引入了时间感知潜在增强 (TALA) 来减轻 T2I 生成和 Real-ISR 一致性要求之间的固有差距。通过随机混合 LQ 和 HQ 潜在输入,我们的模型不仅可以处理时间步长特定的扩散噪声,还可以细化累积的潜在表示。最后但同样重要的是,我们的 GAN 嵌入策略采用预训练的 Real-ESRGAN 模型来细化扩散起点。这在不训练的情况下将推理过程加速到 10 步,同时保持采样质量。我们的方法在全尺度和加速模型中都表现出最先进的性能。代码将公开。||
|**2024-10-17**|[LoLDU: Low-Rank Adaptation via Lower-Diag-Upper Decomposition for Parameter-Efficient Fine-Tuning](http://arxiv.org/abs/2410.13618)|**[link](https://github.com/skddj/loldu)**|模型规模的快速增长对微调所需的计算资源提出了更高的要求。现有的方法,如低秩自适应(LoRA),试图解决全参数微调中处理大量更新参数的问题。然而,LoRA 利用随机初始化和低秩矩阵优化来近似更新权重,这可能导致与全参数微调相比,收敛速度较慢且精度存在差距。为了解决这些问题,我们提出了 LoLDU,这是一种参数高效微调(PEFT)方法,与常规 PEFT 方法相比,可将可训练参数减少 2600 倍,同时保持相当的性能。 LoLDU 利用下三角-对角-上三角分解(LDU)来初始化低秩矩阵,以实现更快的收敛速度和正交性。我们专注于优化对角矩阵以进行缩放变换。据我们所知,LoLDU 在所有 PEFT 方法中参数最少。我们对 4 个指令遵循数据集、6 个自然语言理解 (NLU) 数据集、8 个图像分类数据集以及具有多种模型类型(LLaMA2、RoBERTa、ViT 和 Stable Diffusion)的图像生成数据集进行了广泛的实验,提供了全面而详细的分析。我们的开源代码可在 \href{https://github.com/SKDDJ/LoLDU}{https://github.com/SKDDJ/LoLDU} 获取。||
|**2024-10-17**|[Spatiotemporal Object Detection for Improved Aerial Vehicle Detection in Traffic Monitoring](http://arxiv.org/abs/2410.13616)|null|这项工作通过开发时空目标检测模型,在使用无人机摄像头进行多类别车辆检测方面取得了进展。该研究介绍了一个时空车辆检测数据集 (STVD),其中包含 6,600 张由无人机捕获的带注释的连续帧图像,能够对用于整体时空感知的算法进行全面训练和评估。基于 YOLO 的目标检测算法得到了增强,以结合时间动态,从而提高了单帧模型的性能。将注意力机制集成到时空模型中可以进一步提高性能。实验验证表明取得了重大进展,最佳时空模型比单帧模型提高了 16.22%,同时证明注意力机制具有进一步提高性能的潜力。||
|**2024-10-17**|[Augmentation Policy Generation for Image Classification Using Large Language Models](http://arxiv.org/abs/2410.13453)|null|自动数据增强方法显著提高了深度学习模型在图像分类中的性能和泛化能力。然而,大多数最先进的方法都是在常见的基准数据集上进行优化的,这限制了它们对更多样化或特定领域数据(如医学数据集)的适用性。在本文中,我们提出了一种使用大型语言模型自动生成高效增强策略的策略,该策略可针对任何数据集和模型架构的特定特征进行定制。所提出的方法迭代地与LLM交互,以获得并根据模型性能反馈改进增强策略,从而创建一个与数据集无关的数据增强管道。在医学影像数据集上对所提出的方法进行了评估,结果表明,该方法比现有方法有明显的改进。所提出的方法提供了一种自适应和可扩展的解决方案。虽然它增加了计算成本,但它显著提高了模型的鲁棒性,使流程自动化,并最大限度地减少了模型开发过程中的人工参与。||
|**2024-10-17**|[Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation](http://arxiv.org/abs/2410.13437)|null|指代性多目标跟踪(RMOT)是一项新兴的跨模态任务,旨在定位视频中由语言表达式指代的任意数量的目标对象并维持其身份。这项复杂的任务涉及语言和视觉模态的推理,以及目标对象的时间关联。然而,现有研究仅采用松散的特征融合,忽略了对跟踪目标的长期信息的利用。在本研究中,我们介绍了一种紧凑的基于 Transformer 的方法,称为 TenRMOT。我们在编码和解码阶段都进行特征融合,以充分利用 Transformer 架构的优势。具体来说,我们在编码阶段逐层递增地执行跨模态融合。在解码阶段,我们利用语言引导的查询来探测记忆特征,以准确预测所需的对象。此外,我们引入了一个查询更新模块,该模块明确利用跟踪对象的先前时间信息来增强其轨迹的一致性。此外,我们引入了一个名为“指代性多目标跟踪和分割(RMOTS)”的新任务,并构建了一个名为 Ref-KITTI Segmentation 的新数据集。我们的数据集包含 18 个视频,共 818 个表达式,每个表达式平均包含 10.7 个掩码,与大多数现有指代性视频分割数据集中典型的单个掩码相比,这带来了更大的挑战。TenRMOT 在指代性多目标跟踪和分割任务上均表现出优越的性能。||
|**2024-10-17**|[Unsupervised Skull Segmentation via Contrastive MR-to-CT Modality Translation](http://arxiv.org/abs/2410.13427)|null|从CT扫描中分割颅骨可以看作是一个已经解决的问题。然而,在MRI中,由于存在软组织而不是骨骼,这项任务的复杂性要大得多。从头部MRI图像中捕获骨骼结构非常困难,因为头部MRI的主要可视化目标是大脑。尝试使用颅骨剥离的方法似乎不太适合这项任务,并且在许多情况下都失败了。另一方面,监督学习方法需要昂贵且耗时的颅骨标注。为了克服这些困难,我们提出了一种完全无监督的方法,我们不直接对MRI图像进行分割,而是通过MRI到CT的转换生成合成CT数据,并在其中进行分割。我们解决了与无监督颅骨分割相关的许多问题,包括MRI和CT数据集的不配对性质(对比学习)、低分辨率和低质量(超分辨率)以及泛化能力。这项研究对于需要从MRI体积数据中进行颅骨分割的下游任务(如颅骨切除术或手术计划)具有重要价值,并且可以被视为朝着在医学影像中利用合成数据迈出的重要一步。||
|**2024-10-16**|[Interpreting and Analyzing CLIP's Zero-Shot Image Classification via Mutual Knowledge](http://arxiv.org/abs/2410.13016)|**[link](https://github.com/fawazsammani/clip-interpret-mutual-knowledge)**|对比语言-图像预训练 (CLIP) 通过将图像和文本类别表示映射到共享嵌入空间中来执行零样本图像分类,然后检索最接近图像的类别。这项工作提供了一种新方法,可以从两种模态之间的互知识的角度来解释用于图像分类的 CLIP 模型。具体来说,我们提出以下问题:视觉和语言 CLIP 编码器都学习了哪些共同的概念,这些概念会影响联合嵌入空间,导致点更近或更远?我们通过基于文本概念的解释方法来回答这个问题,展示其有效性,并对包含 13 个 CLIP 模型的池进行分析,这些模型在架构、规模和预训练数据集方面各不相同。我们探讨了与互知识相关的这些不同方面,并分析了零样本预测。我们的方法展示了一种有效且人性化的方式来理解 CLIP 的零样本分类决策。||
|**2024-10-16**|[PND-Net: Plant Nutrition Deficiency and Disease Classification using Graph Convolutional Network](http://arxiv.org/abs/2410.12742)|null|如果能够在早期识别和检测各种植物营养缺乏症和病害,就可以提高作物产量,促进农业增长。深度学习方法在利用叶片视觉症状自动检测植物病害和营养缺乏方面表现出优异的性能。本文提出了一种新的深度学习方法,即在基础卷积神经网络 (CNN) 的基础上,使用图卷积网络 (GNN) 对植物营养缺乏和病害进行分类。有时,全局特征描述符可能无法捕获病叶的关键区域,从而导致疾病分类不准确。为了解决这个问题,区域特征学习对于整体特征聚合至关重要。在这项工作中,我们探索了使用空间金字塔池化进行多尺度区域特征汇总,以实现具有判别性的特征表示。我们开发了一个 GCN,使其能够学习更精细的细节,从而对植物病害和营养缺乏进行分类。所提出的方法称为植物营养缺乏与病害网络 (PND-Net),并在两个营养缺乏公共数据集和两个病害分类公共数据集上使用四种 CNN 进行了评估。最佳分类性能为:(a) 香蕉营养缺乏数据集 90.00% 和咖啡营养缺乏数据集 90.54%;(b) 使用 Xception 骨干网络在马铃薯病害数据集上达到 96.18%,在 PlantDoc 数据集上达到 84.30%。此外,还进行了一些泛化实验,所提出的方法在两个公共数据集上取得了最先进的性能,即乳腺癌组织病理学图像分类(BreakHis 40X:95.50% 准确率,BreakHis 100X:96.79% 准确率)和宫颈癌分类巴氏涂片图像中的单细胞(SIPaKMeD:99.18% 准确率)。此外,PND-Net 使用五折交叉验证也取得了更好的性能。||
|**2024-10-16**|[Transformer based super-resolution downscaling for regional reanalysis: Full domain vs tiling approaches](http://arxiv.org/abs/2410.12728)|null|超分辨率 (SR) 是一种很有前景的降尺度方法,可以经济高效地从较粗糙的气候数据中生成高分辨率气候信息。其一个特定应用是从驱动全局对应物(预测因子)中降尺度区域再分析输出(预测值)。本研究以 CERRA 再分析(5.5 公里分辨率,由 ERA5 驱动的区域大气模型生成)为例,对各种 SR 降尺度方法进行了比较,重点关注温度。这项工作中提出的方法是 Swin Transformer,并使用了两种替代方法作为基准(全卷积 U-Net 和卷积和密集 DeepESD)以及简单的双三次插值。我们比较了两种方法,一种是使用整个域作为输入的标准方法,另一种是更具可扩展性的切片方法,将整个域划分为用作输入的切片。这些方法经过训练可以根据来自驱动 ERA5 的温度信息对 CERRA 地表温度进行降尺度;此外,切片方法还包括静态地形信息。我们表明,需要空间可迁移性的切片方法以降低性能为代价(尽管它优于某些全域基准),但提供了一种有效的可扩展解决方案,允许在泛欧尺度上进行 SR 减少,并且对于实时应用很有价值。||
|**2024-10-16**|[MambaBEV: An efficient 3D detection model with Mamba2](http://arxiv.org/abs/2410.12673)|null|基于BEV范式并结合时间信息的稳定3D目标检测模型对于自动驾驶系统至关重要。然而,当前使用卷积层或可变形自注意力的时序融合模型不利于BEV空间全局信息的交换,并且计算成本更高。最近,一种专门用于处理序列的新型基于Mamba的模型在多个下游任务中显示出巨大潜力。在这项工作中,我们提出了一种基于Mamba2的BEV 3D目标检测模型,名为MambaBEV。我们还采用了一种端到端的自动驾驶范式来测试模型的性能。我们的工作在nuScenes数据集上取得了相当不错的结果:我们的基本版本实现了51.7%的NDS。我们的代码将很快开源。||
|**2024-10-15**|[Fractal Calibration for long-tailed object detection](http://arxiv.org/abs/2410.11774)|null|现实世界的数据集遵循不平衡的分布,这对稀有类别目标检测提出了重大挑战。最近的研究通过开发重新加权和重新采样的方法来解决这个问题,这些方法利用了数据集的类别频率。然而,这些技术只关注频率统计,而忽略了图像空间中类别的分布,从而遗漏了重要信息。与它们不同的是,我们提出了分形校准(FRACAL):一种新的用于长尾目标检测的后校准方法。FRACAL设计了一种logit调整方法,利用分形维数来估计类别在图像空间中的均匀分布程度。在推理过程中,它使用分形维数对均匀分布的类别预测概率进行反向加权,从而在两个轴上实现平衡:频繁类别和稀有类别之间,以及均匀分布类别和稀疏分布类别之间。FRACAL是一种后处理方法,它不需要任何训练,并且可以与许多现成的模型相结合,例如一级sigmoid检测器和两级实例分割模型。FRACAL将稀有类别的性能提高了8.6%,并在LVIS数据集上超过了所有以前的方法,同时在其他数据集(如COCO、V3Det和OpenImages)上也表现出良好的泛化能力。代码将被发布。||
|**2024-10-15**|[YOLO-ELA: Efficient Local Attention Modeling for High-Performance Real-Time Insulator Defect Detection](http://arxiv.org/abs/2410.11727)|null|现有的无人机绝缘子缺陷检测方法在处理复杂背景和小型目标时存在困难,导致精度欠佳和误报率高。为了解决这个问题,本文基于局部注意力建模的概念,提出了一种新的基于注意力的基础架构YOLO-ELA。该架构在单阶段YOLOv8架构的颈部添加了高效局部注意力(ELA)模块,将模型的注意力从背景特征转移到缺陷绝缘子特征。采用SCYLLA Intersection-Over-Union(SIoU)准则函数来减少检测损失,加速模型收敛,并提高模型对小型绝缘子缺陷的敏感性,从而产生更高的真阳性结果。由于数据集有限,我们利用数据增强技术来增加数据集的多样性。此外,我们利用迁移学习策略来提高模型的性能。在高分辨率无人机图像上的实验结果表明,我们的方法达到了最先进的性能,mAP0.5为96.9%,实时检测速度为每秒74.63帧,优于基线模型。这进一步证明了基于注意力的卷积神经网络(CNN)在目标检测任务中的有效性。||
|**2024-10-15**|[Degradation Oriented and Regularized Network for Real-World Depth Super-Resolution](http://arxiv.org/abs/2410.11666)|**[link](https://github.com/yanzq95/dornet)**|近年来,现有的RGB引导的深度超分辨率方法在固定和已知退化(例如,双三次下采样)的假设下取得了优异的性能。 然而,在现实场景中,由于传感器限制和成像环境的复杂性(例如,低反射表面、照明),捕获的深度往往会出现非常规和未知的退化。 当这些真实退化与其假设不同时,它们的性能会显著下降。 为了解决这些问题,我们提出了一种面向退化和正则化的网络DORNet,它更加关注学习低分辨率深度的退化表示,从而为深度恢复提供有针对性的指导。 具体来说,我们首先设计了一种自监督退化学习方法,使用基于路由选择的退化正则化来模拟低分辨率深度的判别性退化表示。 然后,我们提出了一种退化感知方法,递归地进行多个面向退化的特征变换,每个变换都根据学习到的退化表示选择性地将RGB信息嵌入到深度中。 在真实数据集和合成数据集上的大量实验结果表明,我们的方法达到了最先进的性能。||
|**2024-10-15**|[LoKO: Low-Rank Kalman Optimizer for Online Fine-Tuning of Large Models](http://arxiv.org/abs/2410.11551)|null|从头开始训练具有数百万甚至数十亿参数的大型模型会产生巨大的计算成本。参数高效微调 (PEFT) 方法,特别是低秩自适应 (LoRA),通过仅使少量参数适应基于梯度优化器的特定任务来应对这一挑战。在本文中,我们将 PEFT 转换为最优滤波/状态估计问题,并提出低秩卡尔曼优化器 (LoKO) 以在线方式估计最优可训练参数。我们利用 LoRA 中的低秩分解来显着减少卡尔曼迭代中的矩阵大小,并进一步利用协方差矩阵的对角近似来有效地将计算复杂度从可训练参数数量的二次方降低到线性。此外,我们发现卡尔曼算法中协方差矩阵的初始化和观测噪声协方差的准确估计是该公式的关键,并且我们提出了在各种成熟的计算机视觉和语言模型中都能很好地工作的鲁棒方法。我们的结果表明,与图像分类和语言任务中 LoRA 常用的优化器相比,LoKO 以更少的迭代次数收敛并产生更好的性能模型。我们的研究开辟了利用卡尔曼滤波器作为在线微调大型模型的有效优化器的可能性。||
|**2024-10-15**|[Spatio-Temporal Distortion Aware Omnidirectional Video Super-Resolution](http://arxiv.org/abs/2410.11506)|**[link](https://github.com/nichenxingmeng/STDAN)**|全向视频(ODV)可以提供沉浸式体验,并广泛应用于虚拟现实和增强现实领域。然而,受限的采集设备和传输带宽导致ODV分辨率较低。视频超分辨率(VSR)方法被提出用于提高视频的分辨率,但直接应用此类方法并不能很好地解决应用中ODV投影失真问题。为了获得更好的超分辨率重建质量,我们提出了一种面向ODV特性的新型时空失真感知网络(STDAN)。具体来说,引入了一个时空失真调制模块,以根据帧内和帧间对齐来改善空间ODV投影失真并利用时间相关性。接下来,我们设计了一种多帧重建和融合机制,以改进重建ODV帧的一致性。此外,我们在损失函数中加入了纬度显著性自适应映射,以专注于具有更高纹理复杂度和人类观看兴趣的重要视点区域。此外,我们收集了一个包含各种场景的新ODV-SR数据集。大量实验结果表明,所提出的STDAN在ODV上实现了卓越的超分辨率性能,并优于最先进的方法。||
|**2024-10-15**|[SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection](http://arxiv.org/abs/2410.11358)|null|多模态目标检测利用多种模态信息来提高检测器的准确性和鲁棒性。通过学习长期依赖关系,Transformer可以在特征提取阶段有效地融合多模态特征,从而大大提高多模态目标检测的性能。然而,当前的方法仅仅是堆叠Transformer引导的融合技术,而没有探索其在网络不同深度层提取特征的能力,从而限制了检测性能的提升。在本文中,我们介绍了一种名为SeaDATE的精确高效的目标检测方法。首先,我们提出了一种新颖的双重注意力特征融合(DTF)模块,在Transformer的引导下,通过双重注意力机制融合局部和全局信息,利用空间和通道token从正交角度加强模态特征的融合。同时,我们的理论分析和实证验证表明,将图像视为像素序列进行融合的Transformer引导融合方法,在浅层特征的细节信息方面比深度语义信息表现更好。针对这一问题,我们设计了一个对比学习(CL)模块,旨在学习多模态样本的特征,弥补Transformer引导融合在提取深度语义特征方面的不足,并有效地利用跨模态信息。在FLIR、LLVIP和M3FD数据集上的大量实验和消融研究证明了我们方法的有效性,达到了最先进的检测性能。||
|**2024-10-15**|[Representation Similarity: A Better Guidance of DNN Layer Sharing for Edge Computing without Training](http://arxiv.org/abs/2410.11233)|null|边缘计算已经成为一种减少传输和处理延迟并保护视频流隐私的替代方案。然而,基于视频的应用程序(例如目标检测)中使用的深度神经网络 (DNN) 日益复杂,这给内存受限的边缘设备带来了压力。模型合并被提出通过在内存中仅保留合并层权重的一个副本,来减少 DNN 的内存占用。在现有的模型合并技术中,(i) 只有架构相同的层才能共享;(ii) 需要在云中进行计算成本高昂的重新训练;(iii) 假设可获得用于重新训练的真实数据。然而,重新评估合并模型的性能需要具有真实数据的验证数据集,通常在云中运行。指导选择共享层的常用指标包括共享层的大小或计算成本或表示大小。我们提出了一种新的模型合并方案,通过在边缘共享表示(即层的输出),并以表示相似度 S 为指导。我们发现,与其他指标相比,S 与合并模型的准确性具有极高的相关性,Pearson 相关系数|r|
|**2024-10-15**|[TEOcc: Radar-camera Multi-modal Occupancy Prediction via Temporal Enhancement](http://arxiv.org/abs/2410.11228)|**[link](https://github.com/vdigpku/teocc)**|语义占用作为一种新颖的三维场景表示方法,在自动驾驶领域受到了广泛关注。然而,现有的占用预测方法主要集中于设计更好的占用表示方法,例如三视角或神经辐射场,而忽略了利用长期时间信息的优势。本文提出了一种雷达-相机多模态时间增强占用预测网络,称为TEOcc。我们的方法受到三维目标检测中利用时间信息取得成功的启发。具体来说,我们引入了一个时间增强分支来学习时间占用预测。在这个分支中,我们随机丢弃多视角相机的第t-k帧输入,并利用其他相邻帧和多模态输入的信息,分别通过长期和短期时间解码器预测其三维占用情况。此外,为了降低计算成本并融合多模态输入,我们针对长期和短期时间解码器专门设计了三维卷积层。此外,由于轻量级占用预测头是一个密集分类头,我们建议对时间增强分支和主分支使用共享的占用预测头。值得注意的是,时间增强分支仅在训练期间执行,在推理期间被丢弃。实验结果表明,TEOcc在nuScenes基准测试中实现了最先进的占用预测性能。此外,所提出的时间增强分支是一个即插即用的模块,可以很容易地集成到现有的占用预测方法中,以提高占用预测的性能。代码和模型将在https://github.com/VDIGPKU/TEOcc发布。||
|**2024-10-15**|[CVCP-Fusion: On Implicit Depth Estimation for 3D Bounding Box Prediction](http://arxiv.org/abs/2410.11211)|**[link](https://github.com/safetylab24/FusionCVCP)**|激光雷达和摄像头视图数据的结合已成为3D目标检测的常用方法。然而,以往的方法在点级别上融合两种输入流,丢弃了从摄像头特征中提取的语义信息。在本文中,我们提出了跨视图中心点融合(Cross-View Center Point-Fusion),这是一种通过在BEV空间中融合摄像头和激光雷达衍生特征来执行3D目标检测的最先进模型,它在融合激光雷达的空间数据的同时保留了来自摄像头流的语义密度。我们的架构利用了先前已建立的算法(跨视图Transformer和CenterPoint)的各个方面,并并行运行它们的主干网络,从而实现实时处理和应用的高效计算。在本文中,我们发现,虽然隐式计算的深度估计在2D地图视图表示中可能足够准确,但在3D世界视图空间中进行精确的边界框预测需要显式计算的几何和空间信息。||
|**2024-10-15**|[Multiview Scene Graph](http://arxiv.org/abs/2410.11187)|**[link](https://github.com/ai4ce/MSG)**|一个合适的场景表示是实现空间智能的核心,在这种情况下,智能体可以稳健地重建并有效地理解 3D 场景。场景表示可以是度量的,例如 3D 重建中的地标地图、目标检测中的 3D 边界框或占用预测中的体素网格,也可以是拓扑的,例如 SLAM 中具有闭环的位姿图或 SfM 中的可见性图。在这项工作中,我们建议从无位姿图像构建多视图场景图 (MSG),使用相互连接的地点和对象节点以拓扑方式表示场景。对于现有的表示学习方法来说,构建 MSG 的任务具有挑战性,因为它需要从视野有限且可能存在较大视角变化的图像中共同解决视觉位置识别、目标检测和目标关联问题。为了评估任何解决此任务的方法,我们基于公共 3D 数据集开发了 MSG 数据集和注释。我们还提出了一种基于 MSG 边缘的交并比分数的评估指标。此外,我们开发了一种基于主流预训练视觉模型的新基线方法,将视觉位置识别和目标关联结合到一个 Transformer 解码器架构中。实验表明,与现有的相关基线相比,我们的方法具有优越的性能。||
|**2024-10-11**|[Efficient Hyperparameter Importance Assessment for CNNs](http://arxiv.org/abs/2410.08920)|null|Hyperparameter selection is an essential aspect of the machine learning pipeline, profoundly impacting models' robustness, stability, and generalization capabilities. Given the complex hyperparameter spaces associated with Neural Networks and the constraints of computational resources and time, optimizing all hyperparameters becomes impractical. In this context, leveraging hyperparameter importance assessment (HIA) can provide valuable guidance by narrowing down the search space. This enables machine learning practitioners to focus their optimization efforts on the hyperparameters with the most significant impact on model performance while conserving time and resources. This paper aims to quantify the importance weights of some hyperparameters in Convolutional Neural Networks (CNNs) with an algorithm called N-RReliefF, laying the groundwork for applying HIA methodologies in the Deep Learning field. We conduct an extensive study by training over ten thousand CNN models across ten popular image classification datasets, thereby acquiring a comprehensive dataset containing hyperparameter configuration instances and their corresponding performance metrics. It is demonstrated that among the investigated hyperparameters, the top five important hyperparameters of the CNN model are the number of convolutional layers, learning rate, dropout rate, optimizer and epoch.||
|**2024-10-11**|[Efficient Multi-Object Tracking on Edge Devices via Reconstruction-Based Channel Pruning](http://arxiv.org/abs/2410.08769)|null|The advancement of multi-object tracking (MOT) technologies presents the dual challenge of maintaining high performance while addressing critical security and privacy concerns. In applications such as pedestrian tracking, where sensitive personal data is involved, the potential for privacy violations and data misuse becomes a significant issue if data is transmitted to external servers. To mitigate these risks, processing data directly on an edge device, such as a smart camera, has emerged as a viable solution. Edge computing ensures that sensitive information remains local, thereby aligning with stringent privacy principles and significantly reducing network latency. However, the implementation of MOT on edge devices is not without its challenges. Edge devices typically possess limited computational resources, necessitating the development of highly optimized algorithms capable of delivering real-time performance under these constraints. The disparity between the computational requirements of state-of-the-art MOT algorithms and the capabilities of edge devices emphasizes a significant obstacle. To address these challenges, we propose a neural network pruning method specifically tailored to compress complex networks, such as those used in modern MOT systems. This approach optimizes MOT performance by ensuring high accuracy and efficiency within the constraints of limited edge devices, such as NVIDIA's Jetson Orin Nano. By applying our pruning method, we achieve model size reductions of up to 70% while maintaining a high level of accuracy and further improving performance on the Jetson Orin Nano, demonstrating the effectiveness of our approach for edge computing applications.||
|**2024-10-11**|[MMLF: Multi-modal Multi-class Late Fusion for Object Detection with Uncertainty Estimation](http://arxiv.org/abs/2410.08739)|null|Autonomous driving necessitates advanced object detection techniques that integrate information from multiple modalities to overcome the limitations associated with single-modal approaches. The challenges of aligning diverse data in early fusion and the complexities, along with overfitting issues introduced by deep fusion, underscore the efficacy of late fusion at the decision level. Late fusion ensures seamless integration without altering the original detector's network structure. This paper introduces a pioneering Multi-modal Multi-class Late Fusion method, designed for late fusion to enable multi-class detection. Fusion experiments conducted on the KITTI validation and official test datasets illustrate substantial performance improvements, presenting our model as a versatile solution for multi-modal object detection in autonomous driving. Moreover, our approach incorporates uncertainty analysis into the classification fusion process, rendering our model more transparent and trustworthy and providing more reliable insights into category predictions.||
|**2024-10-11**|[Boosting Open-Vocabulary Object Detection by Handling Background Samples](http://arxiv.org/abs/2410.08645)|null|Open-vocabulary object detection is the task of accurately detecting objects from a candidate vocabulary list that includes both base and novel categories. Currently, numerous open-vocabulary detectors have achieved success by leveraging the impressive zero-shot capabilities of CLIP. However, we observe that CLIP models struggle to effectively handle background images (i.e. images without corresponding labels) due to their language-image learning methodology. This limitation results in suboptimal performance for open-vocabulary detectors that rely on CLIP when processing background samples. In this paper, we propose Background Information Representation for open-vocabulary Detector (BIRDet), a novel approach to address the limitations of CLIP in handling background samples. Specifically, we design Background Information Modeling (BIM) to replace the single, fixed background embedding in mainstream open-vocabulary detectors with dynamic scene information, and prompt it into image-related background representations. This method effectively enhances the ability to classify oversized regions as background. Besides, we introduce Partial Object Suppression (POS), an algorithm that utilizes the ratio of overlap area to address the issue of misclassifying partial regions as foreground. Experiments on OV-COCO and OV-LVIS benchmarks demonstrate that our proposed model is capable of achieving performance enhancements across various open-vocabulary detectors.||
|**2024-10-11**|[DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention](http://arxiv.org/abs/2410.08582)|**[link](https://github.com/maclong01/DeBiFormer)**|Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity-adaptive attention, such as in DAT, has yielded strong results in image classification, the key-value pairs selected by deformable points lack semantic relevance when fine-tuning for semantic segmentation tasks. The query-aware sparsity attention in BiFormer seeks to focus each query on top-k routed regions. However, during attention calculation, the selected key-value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi-level Routing Attention (DBRA) module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its effectiveness.Code is available at {https://github.com/maclong01/DeBiFormer}||
|**2024-10-11**|[Quality Prediction of AI Generated Images and Videos: Emerging Trends and Opportunities](http://arxiv.org/abs/2410.08534)|null|The advent of AI has influenced many aspects of human life, from self-driving cars and intelligent chatbots to text-based image and video generation models capable of creating realistic images and videos based on user prompts (text-to-image, image-to-image, and image-to-video). AI-based methods for image and video super resolution, video frame interpolation, denoising, and compression have already gathered significant attention and interest in the industry and some solutions are already being implemented in real-world products and services. However, to achieve widespread integration and acceptance, AI-generated and enhanced content must be visually accurate, adhere to intended use, and maintain high visual quality to avoid degrading the end user's quality of experience (QoE). One way to monitor and control the visual "quality" of AI-generated and -enhanced content is by deploying Image Quality Assessment (IQA) and Video Quality Assessment (VQA) models. However, most existing IQA and VQA models measure visual fidelity in terms of "reconstruction" quality against a pristine reference content and were not designed to assess the quality of "generative" artifacts. To address this, newer metrics and models have recently been proposed, but their performance evaluation and overall efficacy have been limited by datasets that were too small or otherwise lack representative content and/or distortion capacity; and by performance measures that can accurately report the success of an IQA/VQA model for "GenAI". This paper examines the current shortcomings and possibilities presented by AI-generated and enhanced image and video content, with a particular focus on end-user perceived quality. Finally, we discuss open questions and make recommendations for future work on the "GenAI" quality assessment problems, towards further progressing on this interesting and relevant field of research.||
|**2024-10-11**|[Accelerated Distributed Stochastic Non-Convex Optimization over Time-Varying Directed Networks](http://arxiv.org/abs/2410.08508)|null|Distributed stochastic non-convex optimization problems have recently received attention due to the growing interest of signal processing, computer vision, and natural language processing communities in applications deployed over distributed learning systems (e.g., federated learning). We study the setting where the data is distributed across the nodes of a time-varying directed network, a topology suitable for modeling dynamic networks experiencing communication delays and straggler effects. The network nodes, which can access only their local objectives and query a stochastic first-order oracle to obtain gradient estimates, collaborate to minimize a global objective function by exchanging messages with their neighbors. We propose an algorithm, novel to this setting, that leverages stochastic gradient descent with momentum and gradient tracking to solve distributed non-convex optimization problems over time-varying networks. To analyze the algorithm, we tackle the challenges that arise when analyzing dynamic network systems which communicate gradient acceleration components. We prove that the algorithm's oracle complexity is $\mathcal{O}(1/\epsilon^{1.5})$, and that under Polyak-$\L$ ojasiewicz condition the algorithm converges linearly to a steady error state. The proposed scheme is tested on several learning tasks: a non-convex logistic regression experiment on the MNIST dataset, an image classification task on the CIFAR-10 dataset, and an NLP classification test on the IMDB dataset. We further present numerical simulations with an objective that satisfies the PL condition. The results demonstrate superior performance of the proposed framework compared to the existing related methods.||
|**2024-10-10**|[Bilinear MLPs enable weight-based mechanistic interpretability](http://arxiv.org/abs/2410.08417)|**[link](https://github.com/tdooms/bilinear-decomposition)**|A mechanistic understanding of how MLPs do computation in deep neural networks remains elusive. Current interpretability work can extract features from hidden activations over an input dataset but generally cannot explain how MLP weights construct features. One challenge is that element-wise nonlinearities introduce higher-order interactions and make it difficult to trace computations through the MLP layer. In this paper, we analyze bilinear MLPs, a type of Gated Linear Unit (GLU) without any element-wise nonlinearity that nevertheless achieves competitive performance. Bilinear MLPs can be fully expressed in terms of linear operations using a third-order tensor, allowing flexible analysis of the weights. Analyzing the spectra of bilinear MLP weights using eigendecomposition reveals interpretable low-rank structure across toy tasks, image classification, and language modeling. We use this understanding to craft adversarial examples, uncover overfitting, and identify small language model circuits directly from the weights alone. Our results demonstrate that bilinear layers serve as an interpretable drop-in replacement for current activation functions and that weight-based interpretability is viable for understanding deep-learning models.||
|**2024-10-10**|[What is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias](http://arxiv.org/abs/2410.08407)|null|Knowledge Distillation is a commonly used Deep Neural Network compression method, which often maintains overall generalization performance. However, we show that even for balanced image classification datasets, such as CIFAR-100, Tiny ImageNet and ImageNet, as many as 41% of the classes are statistically significantly affected by distillation when comparing class-wise accuracy (i.e. class bias) between a teacher/distilled student or distilled student/non-distilled student model. Changes in class bias are not necessarily an undesirable outcome when considered outside of the context of a model's usage. Using two common fairness metrics, Demographic Parity Difference (DPD) and Equalized Odds Difference (EOD) on models trained with the CelebA, Trifeature, and HateXplain datasets, our results suggest that increasing the distillation temperature improves the distilled student model's fairness -- for DPD, the distilled student even surpasses the fairness of the teacher model at high temperatures. This study highlights the uneven effects of Knowledge Distillation on certain classes and its potentially significant role in fairness, emphasizing that caution is warranted when using distilled models for sensitive application domains.||
|**2024-10-10**|[Are We Ready for Real-Time LiDAR Semantic Segmentation in Autonomous Driving?](http://arxiv.org/abs/2410.08365)|null|Within a perception framework for autonomous mobile and robotic systems, semantic analysis of 3D point clouds typically generated by LiDARs is key to numerous applications, such as object detection and recognition, and scene reconstruction. Scene semantic segmentation can be achieved by directly integrating 3D spatial data with specialized deep neural networks. Although this type of data provides rich geometric information regarding the surrounding environment, it also presents numerous challenges: its unstructured and sparse nature, its unpredictable size, and its demanding computational requirements. These characteristics hinder the real-time semantic analysis, particularly on resource-constrained hardware architectures that constitute the main computational components of numerous robotic applications. Therefore, in this paper, we investigate various 3D semantic segmentation methodologies and analyze their performance and capabilities for resource-constrained inference on embedded NVIDIA Jetson platforms. We evaluate them for a fair comparison through a standardized training protocol and data augmentations, providing benchmark results on the Jetson AGX Orin and AGX Xavier series for two large-scale outdoor datasets: SemanticKITTI and nuScenes.||
|**2024-10-10**|[Dynamic Object Catching with Quadruped Robot Front Legs](http://arxiv.org/abs/2410.08065)|null|本文提出了一种利用四足机器人的前腿在其后腿站立时进行动态物体捕捉的框架。该系统集成了计算机视觉、轨迹预测和腿部控制,使四足机器人能够使用机载摄像头视觉检测、跟踪并成功捕捉抛掷物体。利用微调后的 YOLOv8 模型进行物体检测和基于回归的轨迹预测模块,四足机器人迭代地调整其前腿位置,以预测和拦截物体。捕捉动作包括识别最佳捕捉位置、使用笛卡尔 PD 控制控制前腿以及在适当的时刻合拢双腿。我们提出并验证了三种选择最佳捕捉位置的不同方法:1)将预测轨迹与垂直平面相交;2)选择预测轨迹上与机器人腿部在其标称位置的中心距离最小的点;3)选择基于高斯混合模型 (GMM) 对机器人可达空间建模的预测轨迹上可能性最高的点。实验结果证明了该系统在各种场景下的鲁棒捕捉能力,其中 GMM 方法表现最佳,捕捉成功率达到 80%。系统运行的视频演示可在 https://youtu.be/sm7RdxRfIYg 找到。||
|**2024-10-10**|[When the Small-Loss Trick is Not Enough: Multi-Label Image Classification with Noisy Labels Applied to CCTV Sewer Inspections](http://arxiv.org/abs/2410.07689)|null|拥有数百万公里管道的污水管网维护在很大程度上依赖于高效的闭路电视(CCTV)检查。许多基于多标签图像分类的有前景的方法都利用了历史检查报告数据库来自动化这些检查。然而,这些数据库中标签噪声的显著存在,尽管已为人所知,但尚未得到解决。虽然大量研究探索了单标签分类(SLC)中的标签噪声问题,但很少有人关注多标签分类(MLC)中的标签噪声。为了解决这个问题,我们首先调整了三种样本选择SLC方法(Co-teaching、CoSELFIE和DISC),这些方法已被证明对标签噪声具有鲁棒性。我们的研究结果表明,仅基于小损失技巧的样本选择可以处理复杂的标签噪声,但它不是最优的。将混合样本选择方法应用于噪声MLC似乎是一种更有前景的方法。鉴于此,我们开发了一种基于CoSELFIE的新方法,称为MHSS(多标签混合样本选择)。通过深入的比较研究,我们证明了我们的方法在处理合成复杂噪声和真实噪声方面的优越性能,从而有助于持续努力实现CCTV污水管道检查的有效自动化。||
|**2024-10-10**|[TDDSR: Single-Step Diffusion with Two Discriminators for Super Resolution](http://arxiv.org/abs/2410.07663)|null|超分辨率方法正越来越多地针对现实世界和特定人脸任务进行专门设计。然而,许多现有方法依赖于过于简化的退化模型,这限制了它们有效处理复杂和未知退化模式的能力。虽然基于扩散的超分辨率技术最近显示出令人印象深刻的结果,但它们仍然受到需要大量推理步骤的限制。为了解决这个问题,我们提出了 TDDSR,一种高效的单步扩散超分辨率方法。我们的方法是从预训练的教师模型中提取,并基于扩散网络,只需一步即可执行超分辨率。它集成了一个可学习的下采样器来捕获不同的退化模式,并采用了两个鉴别器(一个用于高分辨率图像,一个用于低分辨率图像)来提高整体性能。实验结果证明了该方法在现实世界和特定人脸超分辨率任务中的有效性,其性能与另一种单步方法、先前最先进的模型和教师模型相当,甚至更好。||
|**2024-10-10**|[Explainability of Deep Neural Networks for Brain Tumor Detection](http://arxiv.org/abs/2410.07613)|**[link](https://github.com/sunyoung98/Brain_Tumor_Detection_XAI)**|医学图像分类对于支持医疗保健专业人员进行决策和培训至关重要。虽然卷积神经网络 (CNN) 传统上一直主导着该领域,但基于 Transformer 的模型正受到越来越多的关注。在这项研究中,我们应用可解释人工智能 (XAI) 技术来评估各种模型在现实世界医学数据上的性能,并确定需要改进的领域。我们将 VGG-16、ResNet-50 和 EfficientNetV2L 等 CNN 模型与 Transformer 模型 ViT-Base-16 进行了比较。我们的结果表明,数据增强几乎没有影响,但超参数调整和高级建模可以提高性能。CNN,特别是 VGG-16 和 ResNet-50,优于 ViT-Base-16 和 EfficientNetV2L,这可能是由于数据有限导致的欠拟合。LIME 和 SHAP 等 XAI 方法进一步表明,性能更好的模型可以更有效地显示肿瘤。这些发现表明,具有较浅架构的 CNN 对于小型数据集更有效,并且可以支持医疗决策。||
|**2024-10-10**|[O1O: Grouping of Known Classes to Identify Unknown Objects as Odd-One-Out](http://arxiv.org/abs/2410.07514)|null|在固定已知类别集合上训练的目标检测方法难以在开放世界环境中检测未知类别的物体。目前的修复方法包括添加近似监督,使用与候选物体位置相对应的伪标签,这些位置通常以类别无关的方式获得。虽然先前的方法主要依赖于物体的视觉特征,但我们发现几何线索可以提高未知物体的召回率。尽管来自伪标签的额外监督有助于检测未知物体,但它也会给已知类别带来混淆。我们观察到,在存在噪声伪标签的情况下,模型检测已知物体的性能显著下降。受人类认知研究的启发,我们建议将已知类别分组到超类中。通过识别超类中类别之间的相似性,我们可以通过“异类排除”评分机制识别未知类别。我们在开放世界检测基准上的实验表明,所有任务的未知物体召回率都有显著提高。至关重要的是,由于通过超类更好地划分了特征空间,我们在不影响已知物体性能的情况下实现了这一点。||
|**2024-10-09**|[Progressive Multi-Modal Fusion for Robust 3D Object Detection](http://arxiv.org/abs/2410.07475)|null|多传感器融合对于自动驾驶中精确的 3D 物体检测至关重要,其中摄像头和激光雷达是最常用的传感器。然而,现有方法通过将两种模态的特征投影到鸟瞰图 (BEV) 或透视图 (PV) 中,在单一视图中进行传感器融合,从而牺牲了诸如高度或几何比例等补充信息。为了解决这一局限性,我们提出了 ProFusion3D,一种渐进式融合框架,在中间和对象查询级别结合了 BEV 和 PV 中的特征。我们的架构分层融合了局部和全局特征,增强了 3D 物体检测的鲁棒性。此外,我们引入了一种自监督掩码建模预训练策略,通过三个新颖的目标来改进多模态表示学习和数据效率。在 nuScenes 和 Argoverse2 数据集上的大量实验最终证明了 ProFusion3D 的有效性。此外,ProFusion3D 对传感器故障具有鲁棒性,在仅有一种模态可用的情况下也表现出强大的性能。||
|**2024-10-09**|[Self-Supervised Learning for Real-World Object Detection: a Survey](http://arxiv.org/abs/2410.07442)|null|自监督学习 (SSL) 已成为计算机视觉领域的一种很有前景的方法,它使网络能够从大型未标记数据集中学习有意义的表示。SSL 方法主要分为两类:实例判别和掩码图像建模 (MIM)。虽然实例判别是 SSL 的基础,但它最初是为分类任务设计的,对于目标检测,尤其是小型目标检测,效果可能不佳。在本综述中,我们重点关注专为现实世界目标检测而设计的 SSL 方法,重点是在复杂环境中检测小型目标。与以往的综述不同,我们详细比较了 SSL 策略,包括目标级实例判别和 MIM 方法,并使用基于 CNN 和 ViT 的架构评估了它们对小型目标检测的有效性。具体而言,我们的基准测试是在广泛使用的 COCO 数据集以及专注于红外遥感图像中车辆检测的专业现实世界数据集上进行的。我们还评估了在自定义领域特定数据集上进行预训练的影响,重点介绍了某些 SSL 策略如何更适合处理未经整理的数据。我们的研究结果表明,实例判别方法在基于 CNN 的编码器中表现良好,而 MIM 方法更适合基于 ViT 的架构和自定义数据集预训练。本综述为选择最佳 SSL 策略提供了实用指南,并考虑了主干架构、目标大小和自定义预训练要求等因素。最后,我们证明,选择合适的 SSL 预训练策略以及合适的编码器可以显著提高现实世界目标检测的性能,特别是对于资源有限环境中的小型目标检测。||
|**2024-10-09**|[Robust infrared small target detection using self-supervised and a contrario paradigms](http://arxiv.org/abs/2410.07437)|null|在国防应用中,由于复杂背景的存在和目标的小尺寸,红外图像中的小目标检测提出了重大挑战。传统的目标检测方法往往难以在高检测率和低误报率之间取得平衡,尤其是在处理小目标时。在本文中,我们介绍了一种新方法,将“反事实范式”与自监督学习 (SSL) 相结合,以改进红外小目标检测 (IRSTD)。一方面,在 YOLO 检测头中集成“反事实准则”增强了对小型和意外目标的特征图响应,同时有效控制了误报。另一方面,我们探索了 SSL 技术来克服 IRSTD 任务中常见的注释数据有限的挑战。具体来说,我们对几种具有代表性的 SSL 策略进行了基准测试,以了解它们在提高小目标检测性能方面的有效性。我们的研究结果表明,实例判别方法在应用于基于 YOLO 的小目标检测时优于掩码图像建模策略。此外,“反事实范式”和 SSL 范式的结合带来了显着的性能提升,缩小了与最先进的分割方法的差距,甚至在资源有限的环境中也优于它们。这种双管齐下的方法为提高 IRSTD 性能提供了一种强大的解决方案,尤其是在具有挑战性的条件下。||
|**2024-10-09**|[One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation](http://arxiv.org/abs/2410.07170)|**[link](https://github.com/ml-jku/EVA)**|基础模型 (FM) 在大规模数据集上进行预训练,然后针对特定应用在下游任务上进行微调。最成功和最常用的微调方法是通过低秩自适应 (LoRA) 更新预训练的权重。LoRA 引入了新的权重矩阵,这些矩阵通常使用跨模型权重的均匀秩分布随机初始化。最近的工作集中在权重驱动的初始化或在训练期间学习自适应秩。这两种方法都只是孤立地进行研究,导致收敛速度慢或秩分布均匀,进而导致性能欠佳。我们建议通过以数据驱动的方式初始化新权重来增强 LoRA,方法是在小批量激活向量上计算奇异值分解。然后,我们使用获得的右奇异向量初始化 LoRA 矩阵,并在所有权重矩阵之间重新分配秩,以解释最大量的方差,并继续标准的 LoRA 微调过程。这导致了我们的新方法,称为解释方差自适应 (EVA)。我们将 EVA 应用于各种微调任务,从语言生成和理解到图像分类和强化学习。与竞争对手相比,EVA 表现出更快的收敛速度,并在每个领域的众多任务中获得了最高的平均分数。||
|**2024-10-09**|[JPEG Inspired Deep Learning](http://arxiv.org/abs/2410.07081)|**[link](https://github.com/jpeginspireddl/jpeg-inspired-dl)**|尽管传统上认为有损图像压缩(例如JPEG压缩)会对深度神经网络(DNN)的性能产生负面影响,但最近的研究表明,精心设计的JPEG压缩实际上可以提高深度学习(DL)的性能。受此启发,我们提出了JPEG-DL,这是一种新颖的深度学习框架,它在任何底层DNN架构之前添加了一个可训练的JPEG压缩层。为了使JPEG压缩中的量化操作可训练,我们在JPEG层采用了一种新的可微分软量化器,然后联合训练量化操作和底层DNN。大量实验表明,与标准深度学习相比,JPEG-DL在各种数据集和模型架构上均可显著提高准确性,同时增强了对对抗性攻击的鲁棒性。特别是,在一些细粒度图像分类数据集上,JPEG-DL可以将预测精度提高多达20.9%。我们的代码可在https://github.com/JpegInspiredDl/JPEG-Inspired-DL.git获取。||
|**2024-10-07**|[LoTLIP: Improving Language-Image Pre-training for Long Text Understanding](http://arxiv.org/abs/2410.05249)|null|理解长文本在实践中有着巨大的需求,但这超出了大多数语言图像预训练 (LIP) 模型的能力范围。在本研究中,我们通过实证证实了造成这个问题的关键原因是训练图像通常与简短的标题配对,导致某些词语容易被突出的词语所掩盖。为了解决这个问题,我们最初尝试使用长标题重新标记数据,但是,直接使用长标题进行学习可能会导致理解短文本的性能下降(例如,在图像分类任务中)。然后,通过结合角点词语来聚合不同的文本信息,我们设法帮助模型在理解短文本方面赶上其原始水平,同时大大增强其理解长文本的能力。我们进一步研究了模型是否可以从更长的标题中持续受益,并注意到性能和效率之间存在明显的权衡。最后,我们使用一个自建的大规模数据集验证了我们方法的有效性,该数据集包含 1 亿个面向长标题的文本图像对。值得注意的是,在长文本图像检索任务中,我们比使用长标题的竞争对手提高了 11.1%(即从 72.62% 提高到 83.72%)。我们将发布代码、模型和新数据集,以促进可重复性和进一步的研究。项目页面可访问 https://wuw2019.github.io/lotlip。||
|**2024-10-07**|[Control-oriented Clustering of Visual Latent Representation](http://arxiv.org/abs/2410.05063)|null|我们对基于图像的控制管道中视觉表征空间(从视觉编码器到动作解码器的信道)的几何结构进行研究,该管道通过行为克隆学习得到。受图像分类中神经元崩溃(NC)现象的启发,我们研究了视觉表征空间中是否会出现类似的聚类规律。由于基于图像的控制是一项没有明确定义类别的回归任务,因此问题的关键在于确定视觉特征根据哪些隐含类别进行聚类(如果存在这种规律)。我们专注于基于图像的平面推动任务,假设视觉表征在控制任务中最重要作用是向动作解码器传递目标。然后,我们根据(a) 输入中物体和目标之间的相对姿态或(b) 输出中专家动作引起的物体的相对姿态,将专家演示的训练样本分为八个“面向控制”的类别,其中一个类别对应一个相对姿态卦限(REPO)。在架构的四种不同实例中,我们报告了根据八个REPO,视觉表征空间中普遍出现了面向控制的聚类。除了经验观察之外,我们还表明,当使用有限的专家演示训练策略时,这种聚类规律可以用作算法工具来提高测试时的性能。特别是,我们使用NC作为正则化方法对视觉编码器进行预训练,以鼓励视觉特征的面向控制的聚类。令人惊讶的是,这种经过NC预训练的视觉编码器在使用动作解码器进行端到端微调时,在低数据情况下将测试性能提高了10%到35%。现实世界中基于视觉的平面推动实验证实了面向控制的视觉表征预训练的惊人优势。||
|**2024-10-07**|[Improving Object Detection via Local-global Contrastive Learning](http://arxiv.org/abs/2410.05058)|null|视觉域差距通常会影响目标检测性能。图像到图像的转换可以减轻这种影响,其中对比方法能够在无监督情况下学习图像到图像的映射。然而,现有方法往往无法处理包含多个目标实例的内容丰富的场景,这表现为检测性能不理想。对这种实例级内容的敏感性通常只能通过目标标注来获得,而目标标注的获取成本可能很高。为了解决这个问题,我们提出了一种新的图像到图像转换方法,专门针对跨域目标检测。我们将我们的方法制定为一个对比学习框架,该框架具有归纳先验,通过空间注意掩码优化目标实例的外观,将场景隐式地划分为与目标目标实例相关的前景区域和背景非目标区域。我们的方法不是依靠目标标注在转换过程中明确地考虑目标实例,而是通过对比局部-全局信息来学习表示目标。这为探索一项未被充分挖掘的挑战提供了可能:在不依赖目标标注或检测器模型微调的情况下,在域转移下获得高性能检测。我们通过三个具有挑战性的基准测试,对多个跨域目标检测设置进行了实验,并报告了最先进的性能。项目页面:https://local-global-detection.github.io||
|**2024-10-07**|[Near-Field ISAC in 6G: Addressing Phase Nonlinearity via Lifted Super-Resolution](http://arxiv.org/abs/2410.04930)|null|集成传感与通信 (ISAC) 是 6G 网络的一个很有前景的组成部分,它融合了通信和雷达技术以促进新的服务。此外,在 ISAC 共用接收机上使用超大规模天线阵列 (ELLA) 不仅促进了太赫兹级通信链路,而且还显著提高了雷达应用中目标检测的精度。在实际场景中,通信散射体和雷达目标通常位于距离 ISAC 接收机很近的位置。这种情况,再加上 ELLA 的使用,从根本上改变了无线和雷达信道的电磁特性,从远场平面波传播转变为近场球面波传播。在远场平面波模型下,阵列响应向量的相位随天线索引线性变化。相反,在近场球面波模型中,这种相位关系变为非线性。这种转变提出了一个根本性的挑战:广泛使用的傅立叶分析不能再直接应用于 ISAC 共用接收机上的目标检测和通信信道估计。在这项工作中,我们提出了一个可行的解决方案来解决这个基本问题。具体来说,我们证明了存在一个高维空间,其中相位非线性可以表示为线性。利用这一见解,我们开发了一个提升的超分辨率框架,该框架可以同时执行通信信道估计并以高精度提取目标参数。||
|**2024-10-07**|[Improved detection of discarded fish species through BoxAL active learning](http://arxiv.org/abs/2410.04880)|**[link](https://github.com/pieterblok/boxal)**|近年来,强大的数据驱动深度学习技术已被开发并应用于自动化渔获登记。然而,这些方法依赖于标记数据,而标记数据的收集非常耗时、费力、昂贵,并且需要专业知识。在本研究中,我们提出了一种名为 BoxAL 的主动学习技术,该技术包括对 Faster R-CNN 目标检测模型的认知不确定性进行估计。该方法允许从未标记的图像池中选择最不确定的训练图像,然后使用这些图像来训练目标检测模型。为了评估该方法,我们使用了一个开源图像数据集,该数据集是通过专为捕捞底层鱼类的商业拖网渔船开发的专用图像采集系统获得的。我们证明,我们的方法可以使用比随机抽样少 400 张标记图像的情况下达到相同的目标检测性能。此外,在最后一次训练迭代中,使用 1100 张训练图像时,基于置信度的采样和随机采样的平均 AP 分数分别显着提高到 39.0±1.6 和 34.8±1.8。此外,我们还表明,认知不确定性是一种合适的采样方法,可以对当前迭代模型无法处理的图像进行采样。我们的研究还表明,采样得到的新数据比剩余的未标记数据对训练更有价值。我们的软件可在 https://github.com/pieterblok/boxal 获取。||
|**2024-10-06**|[Learning De-Biased Representations for Remote-Sensing Imagery](http://arxiv.org/abs/2410.04546)|**[link](https://github.com/doem97/deblora)**|遥感 (RS) 影像需要专门的卫星进行采集,而且标注难度大,因此存在数据稀缺和某些光谱类别不平衡的问题。由于数据稀缺,从头开始训练任何大规模 RS 模型都是不现实的,替代方案是通过微调或数据效率更高的 LoRA 方法来迁移预训练模型。由于类别不平衡,迁移后的模型表现出强烈的偏差,其中主要类别的特征支配着次要类别的特征。在本文中,我们提出了 debLoRA,这是一种通用的训练方法,可以与任何 LoRA 变体一起使用,以产生去偏差的特征。它是一种无监督学习方法,可以根据与主要类别共享的属性来实现次要类别特征的多样化,其中属性是通过简单的聚类步骤获得的。为了对其进行评估,我们在 RS 领域的两种迁移学习场景中进行了广泛的实验:从自然图像到光学 RS 图像,以及从光学 RS 图像到多光谱 RS 图像。我们在光学 RS 数据集 DOTA 和 SAR 数据集 FUSRS 上执行了目标分类和面向目标的检测任务。结果表明,我们的 debLoRA 在这些 RS 适应性设置中始终优于现有技术,在自然图像到光学 RS 和光学 RS 到多光谱 RS 的适应性方面,尾部类别的性能分别提高了 3.3 和 4.7 个百分点,同时保持了头部类别的性能,证明了其有效性和适应性。||
|**2024-10-05**|[Distillation-Free One-Step Diffusion for Real-World Image Super-Resolution](http://arxiv.org/abs/2410.04224)|**[link](https://github.com/jianzeli-114/dfosd)**|扩散模型在现实世界图像超分辨率(Real-ISR)方面取得了优异的性能,但计算成本很高。当前的方法试图通过知识蒸馏从多步模型中推导出一步扩散模型。然而,这些方法会导致大量的训练成本,并且可能会受到教师模型的限制,从而限制学生模型的性能。为了解决这些问题,我们提出了DFOSD,一种无需蒸馏的一步扩散模型。具体来说,我们提出了一个噪声感知鉴别器(NAD)来参与对抗训练,进一步增强生成内容的真实性。此外,我们利用边缘感知DISTS(EA-DISTS)改进了感知损失,以增强模型生成精细细节的能力。我们的实验表明,与之前需要数十步甚至数百步的基于扩散的方法相比,我们的DFOSD在定量指标和定性评估方面都取得了相当甚至更好的结果。与其他一步扩散方法相比,我们的DFOSD还获得了更高的性能和效率。我们将在\url{https://github.com/JianzeLi-114/DFOSD}发布代码和模型。||
|**2024-10-05**|[Exploring Strengths and Weaknesses of Super-Resolution Attack in Deepfake Detection](http://arxiv.org/abs/2410.04205)|null|Image manipulation is rapidly evolving, allowing the creation of credible content that can be used to bend reality. Although the results of deepfake detectors are promising, deepfakes can be made even more complicated to detect through adversarial attacks. They aim to further manipulate the image to camouflage deepfakes' artifacts or to insert signals making the image appear pristine. In this paper, we further explore the potential of super-resolution attacks based on different super-resolution techniques and with different scales that can impact the performance of deepfake detectors with more or less intensity. We also evaluated the impact of the attack on more diverse datasets discovering that the super-resolution process is effective in hiding the artifacts introduced by deepfake generation models but fails in hiding the traces contained in fully synthetic images. Finally, we propose some changes to the detectors' training process to improve their robustness to this kind of attack.||
|**2024-10-05**|[Fast Object Detection with a Machine Learning Edge Device](http://arxiv.org/abs/2410.04173)|null|本机器学习研究调查了一种低成本边缘设备,该设备集成了一个具有计算机视觉功能的嵌入式系统,从而提高了目标检测和分类的推理时间和精度。本研究的主要目标是减少推理时间和降低功耗,并使竞赛级自主人形机器人的嵌入式设备能够支持实时目标识别、场景理解、视觉导航、运动规划和机器人的自主导航。本研究比较了中央处理器 (CPU)、图形处理器 (GPU) 和张量处理器 (TPU) 之间的推理时间性能。CPU、GPU 和 TPU 都是可用于机器学习任务的处理器。为了支持自主人形机器人,我们还努力观察使用具有单目视觉功能的相机与立体视觉功能的相机是否存在显著差异。本研究的 TPU 推理时间结果反映,与 GPU 相比,时间缩短了 25%,与 CPU 相比,推理时间惊人地缩短了 87.5%。本文的许多信息有助于最终选择 Google 的 Coral 品牌 Edge TPU 设备。Arduino Nano 33 BLE Sense Tiny ML 套件也被考虑用于比较,但由于初始不兼容性以及为了及时完成本研究,我们决定在未来的实验中再审查该套件。||
|**2024-10-05**|[Robust Task-Oriented Communication Framework for Real-Time Collaborative Vision Perception](http://arxiv.org/abs/2410.04168)|null|Cooperative perception enhances sensing in multi-robot and vehicular networks by aggregating information from multiple agents, improving perception accuracy and range. However, mobility and non-rigid sensor mounts introduce extrinsic calibration errors, necessitating online calibration, which is complicated by limited overlap in sensing regions. Maintaining fresh information is crucial for timely and accurate sensing. To address calibration errors and ensure both perception accuracy and transmission timeliness, we propose a Robust Task-Oriented Communication framework (R-TOCOM) that optimizes calibration and feature transmission in both deployment and streaming phases. First, we formulate an Age of Perceived Targets (AoPT) minimization problem to capture information freshness. Then, in the deployment phase, we introduce a channel-aware self-calibration technique based on re-identification (Re-ID). This technique adaptively compresses key-point features according to channel capacities, effectively addressing calibration issues via spatial and temporal cross-camera correlations. In the streaming phase, we tackle the trade-off between bandwidth and inference accuracy by integrating an Information Bottleneck (IB)-based encoding method that adjusts video compression rates based on task relevance, thereby reducing communication overhead and latency. To mitigate performance degradation from packet loss, we introduce a priority network that filters corrupted features. Extensive studies demonstrate our framework outperforms five baselines, improving multiple object detection accuracy (MODA) by 25.49% and reducing communication costs by 51.36% under severe channel condition.||
|**2024-10-04**|[Classification-Denoising Networks](http://arxiv.org/abs/2410.03505)|null|图像分类和去噪面临着缺乏鲁棒性或部分忽略条件信息的互补问题。我们认为,可以通过 (噪声) 图像和类别标签的联合概率模型来统一这两个任务,从而缓解这些问题。分类通过前向传递和条件化来执行。使用 Tweedie-Miyasawa 公式,我们用分数来评估去噪函数,该分数可以通过边缘化和反向传播来计算。然后,训练目标是交叉熵损失和在噪声水平上积分的去噪分数匹配损失的组合。在 CIFAR-10 和 ImageNet 上的数值实验表明,与参考深度卷积分类器/去噪器相比,该方法具有竞争性的分类和去噪性能,并且与以前的联合方法相比,效率显着提高。与标准判别分类器相比,我们的模型对对抗性扰动的鲁棒性有所提高,并且可以将对抗性梯度 novel 地解释为去噪器的差异。||
|**2024-10-04**|[Sm: enhanced localization in Multiple Instance Learning for medical imaging classification](http://arxiv.org/abs/2410.03276)|**[link](https://github.com/franblueee/smmil)**|多示例学习 (MIL) 广泛应用于医学图像分类,以减少标注工作量。虽然训练时只有包标签可用,但人们通常会在包和实例级别寻求预测(分别为分类和定位任务)。早期的 MIL 方法独立地处理包中的实例。最近的方法考虑了实例之间的全局和局部依赖关系。虽然它们在分类方面取得了很好的效果,但它们在定位方面的性能相对有限。我们认为,这些模型的设计目标是分类任务,而实例级别的含义尚未得到深入研究。基于一个简单的观察结果——相邻实例可能具有相同的标签——我们提出了一种新颖、有原则且灵活的机制来模拟局部依赖关系。它可以单独使用,也可以与任何模拟全局依赖关系的机制(例如,Transformer)结合使用。全面的实证验证表明,我们的模块在定位方面达到了最先进的性能,同时在分类方面也具有竞争力或优越性。我们的代码位于https://github.com/Franblueee/SmMIL。||
|**2024-10-04**|[DRAFTS: A Deep Learning-Based Radio Fast Transient Search Pipeline](http://arxiv.org/abs/2410.03200)|**[link](https://github.com/SukiYume/DRAFTS)**|在射电天文学中,快速射电暴 (FRB) 的探测是一项复杂的任务,因为它面临着射频干扰 (RFI) 和星际介质中信号色散带来的挑战。传统的搜索算法通常效率低下、耗时且会产生大量的误报。在本文中,我们提出了 DRAFTS,一个基于深度学习的快速射电瞬变搜索流程。DRAFTS 整合了目标检测和二元分类技术,以准确识别射电数据中的 FRB。我们开发了一个大型的真实 FRB 数据集,用于训练深度学习模型。对 FAST 真实观测数据的搜索测试表明,DRAFTS 在准确性、完整性和搜索速度方面表现出色。在 FRB 20190520B 观测数据的搜索中,DRAFTS 探测到的爆发次数是 Heimdall 的三倍多,这突出了其在未来 FRB 探测和分析方面的潜力。||
|**2024-10-03**|[PixelShuffler: A Simple Image Translation Through Pixel Rearrangement](http://arxiv.org/abs/2410.03021)|**[link](https://github.com/OmarSZamzam/PixelShuffler)**|图像到图像的转换是计算机视觉领域的一个课题,其应用范围十分广泛,从医学图像转换(例如将MRI扫描转换为CT扫描或其他MRI对比度)到图像着色、超分辨率、域适应以及从草图或语义图生成逼真图像。图像风格迁移也是图像到图像转换中一个被广泛研究的应用,其目标是合成一个结合了一幅图像的内容和另一幅图像风格的图像。现有的最先进方法通常依赖于复杂的神经网络(包括扩散模型和语言模型)来实现高质量的风格迁移,但这些方法的计算成本可能很高,而且实现起来也很复杂。在本文中,我们提出了一种新的像素洗牌方法,该方法解决了图像到图像转换的一般问题,并在风格迁移中有一个具体的演示应用。该方法通过对风格图像的像素进行洗牌来实现风格迁移,从而最大化洗牌后的图像与内容图像之间的互信息。这种方法inherently保留了风格图像的颜色,同时确保了内容图像的结构细节保留在风格化后的输出中。我们证明,这种简单直接的方法产生的结果可与最先进的技术相媲美,这可以通过学习感知图像块相似度(LPIPS)损失(用于内容保留)和Fr\'echet初始距离(FID)分数(用于风格相似度)来衡量。我们的实验验证了所提出的像素洗牌方法在显著降低复杂度的同时实现了具有竞争力的性能,为高效的图像风格迁移提供了一种很有前途的替代方案,同时也为该方法在一般图像到图像转换任务中的可用性带来了希望。||
|**2024-10-03**|[On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions](http://arxiv.org/abs/2410.02935)|null|随着混合专家模型 (MoE) 架构在开发大规模基础模型中的重要性日益凸显,我们研究了分层混合专家模型 (HMoE),这是 MoE 的一种特殊变体,擅长处理复杂输入和提高目标任务的性能。我们的研究强调了使用不同的门控函数的优势,超越了 HMoE 框架内的 softmax 门控。我们从理论上证明,即使仅在选定的层次级别应用最佳门控函数,对每个专家组应用定制的门控函数也允许 HMoE 实现稳健的结果。跨不同场景的经验验证支持了这些理论主张。这包括大规模多模态任务、图像分类以及潜在领域发现和预测任务,在这些任务中,我们改进的 HMoE 模型显示出巨大的性能提升。||
|**2024-10-04**|[Learning 3D Perception from Others' Predictions](http://arxiv.org/abs/2410.02646)|null|在现实环境中进行精确的三维目标检测需要大量高质量的标注数据。获取此类数据的过程既乏味又昂贵,并且在采用新传感器或将检测器部署到新环境中时,通常需要重复工作。我们研究了一种构建三维目标检测器的新方案:从配备精确检测器的附近单元的预测中学习。例如,当自动驾驶汽车进入一个新区域时,它可以从其他交通参与者那里学习,这些交通参与者的检测器已经针对该区域进行了优化。这种设置具有标签效率高、传感器无关性和通信效率高的特点:附近的单元只需要与自我代理(例如,汽车)共享预测结果。然而,简单地将接收到的预测作为真实值来训练自我车辆的检测器会导致性能下降。我们系统地研究了这个问题,并将视点不匹配和定位错误(由于同步和 GPS 错误)确定为主要原因,这些原因不可避免地会导致误报、漏报和不准确的伪标签。我们提出了一种基于距离的课程学习方法,首先从视点相似的较近单元学习,然后通过自我训练逐步提高其他单元预测的质量。我们进一步证明,可以使用少量标注数据训练有效的伪标签细化模块,从而大大减少训练目标检测器所需的数据量。我们在最近发布的真实世界协同驾驶数据集上验证了我们的方法,使用参考车辆的预测作为自我车辆的伪标签。包括多种场景(例如,不同的传感器、检测器和域)在内的大量实验表明,我们的方法可以有效地从其他单元的预测中进行标签高效的三维感知学习。||
|**2024-10-03**|[LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model](http://arxiv.org/abs/2410.02615)|null|当前最先进的医学多模态大型语言模型(med-MLLM),如 LLaVA-Med 或 BioMedGPT,在预训练中利用了指令遵循数据。然而,这些模型主要侧重于扩大模型规模和数据量以提高性能,而主要依赖于自回归学习目标。令人惊讶的是,我们发现这种学习方案可能导致视觉和语言模态之间的对齐较弱,使得这些模型高度依赖于大量的预训练数据集——这在医学领域是一个重大挑战,因为高质量指令遵循实例的整理既昂贵又耗时。我们使用 LoGra-Med 来解决这个问题,这是一种新的多图对齐算法,可在图像模态、基于对话的描述和扩展字幕之间强制执行三元组关联。这有助于模型捕捉上下文含义、处理语言变异性以及在视觉和文本之间建立跨模态关联。为了扩展我们的方法,我们设计了一种使用黑盒梯度估计的高效端到端学习方案,可以实现更快的 LLaMa 7B 训练。我们的结果表明,LoGra-Med 在 60 万个图像-文本对的医学 VQA 上与 LLAVA-Med 的性能相匹配,并且在接受 10% 数据训练时明显优于它。例如,在 VQA-RAD 上,我们比 LLAVA-Med 高出 20.13%,并且几乎达到了 100% 预训练分数(72.52% 对比 72.64%)。我们还在视觉聊天机器人上超越了像 BiomedGPT 这样的 SOTA 方法,并在使用 VQA 进行零样本图像分类方面超越了 RadFM,突出了多图对齐的有效性。||
|**2024-10-03**|[Personalized Quantum Federated Learning for Privacy Image Classification](http://arxiv.org/abs/2410.02547)|null|量子联邦学习提高了隐私图像分类的效果,但客户端模型缺乏个性化可能导致量子联邦学习的次优性。为了增强图像分布不平衡情况下客户端模型的个性化,提出了一种用于隐私图像分类的个性化量子联邦学习算法。首先,构建了个性化量子联邦学习模型,在客户端模型中设置了个性化层以维护个性化参数。其次,引入了个性化量子联邦学习算法,以确保客户端和服务器之间交换的信息安全。第三,将个性化联邦学习应用于 FashionMNIST 数据集上的图像分类,实验结果表明,即使在本地训练样本不平衡的情况下,个性化量子联邦学习算法也能获得性能优异的全局和局部模型。在8个客户端和分布参数为100的情况下,服务器的准确率达到了100%,比非个性化模型提高了7%。在2个客户端和分布参数为1的情况下,客户端的平均准确率比非个性化模型提高了2.9%。与之前的量子联邦学习算法相比,所提出的个性化量子联邦学习算法在保护模型和数据隐私的同时,无需额外的本地训练。这可能促进量子技术的更广泛采用和应用,并为更安全、可扩展和高效的量子分布式机器学习解决方案铺平道路。||
|**2024-10-03**|[DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM](http://arxiv.org/abs/2410.02492)|null|视觉语言跟踪 (VLT) 已成为一个前沿研究领域,它利用语言数据增强了多模态输入算法,并将传统单目标跟踪 (SOT) 的范围扩展到视频理解应用。 尽管如此,大多数 VLT 基准测试仍然依赖于人工标注的简洁文本描述来描述每个视频。 这些描述通常无法捕捉视频内容动态的细微差别,并且缺乏语言风格变化,受限于其统一的细节水平和固定的标注频率。 因此,算法倾向于默认采用“记住答案”的策略,偏离了深入理解视频内容的核心目标。 幸运的是,大型语言模型 (LLM) 的出现使生成多样化文本成为可能。 这项工作利用 LLM 为具有代表性的 SOT 基准生成不同的语义注释(在文本长度和粒度方面),从而建立了一个新的多模态基准。 具体来说,我们 (1) 基于五个著名的 VLT 和 SOT 基准,提出了一个新的具有不同文本的视觉语言跟踪基准,名为 DTVLT,包括三个子任务:短期跟踪、长期跟踪和全局实例跟踪。 (2) 我们的基准测试提供了四种粒度的文本,考虑了语义信息的范围和密度。 我们预计这种多粒度生成策略将为 VLT 和视频理解研究营造有利的环境。 (3) 我们对 DTVLT 进行了全面的实验分析,评估了不同文本对跟踪性能的影响,并希望识别出的现有算法的性能瓶颈能够支持 VLT 和视频理解的进一步研究。 提出的基准、实验结果和工具包将在 http://videocube.aitestunion.com/ 上逐步发布。||
|**2024-10-03**|[PnP-Flow: Plug-and-Play Image Restoration with Flow Matching](http://arxiv.org/abs/2410.02423)|**[link](https://github.com/annegnx/PnP-Flow)**|本文介绍了即插即用流匹配 (PnP Flow Matching),这是一种解决成像逆问题的算法。PnP 方法利用预训练去噪器(通常是深度神经网络)的优势,将它们集成到优化方案中。虽然它们在各种成像逆问题上实现了最先进的性能,但 PnP 方法在修复等更具生成性的任务中面临着固有的局限性。另一方面,流匹配等生成模型突破了图像采样的界限,但缺乏在图像恢复中有效使用的明确方法。我们建议通过使用预训练的 FM 模型定义时间相关的去噪器,将 PnP 框架与流匹配 (FM) 相结合。我们的算法在数据保真度项上的梯度下降步骤、对学习到的 FM 路径的重新投影和去噪之间交替进行。值得注意的是,我们的方法计算效率高且内存友好,因为它避免了通过 ODE 的反向传播和轨迹计算。我们评估了其在去噪、超分辨率、去模糊和修复任务上的性能,证明了其与现有 PnP 算法和基于流匹配的最先进方法相比具有优越的结果。||
|**2024-10-03**|[Spiking Neural Network as Adaptive Event Stream Slicer](http://arxiv.org/abs/2410.02249)|null|基于事件的相机由于其丰富的边缘信息、高动态范围和高时间分辨率而备受关注。许多最先进的基于事件的算法依赖于将事件分割成固定的组,这会导致关键时间信息的丢失,尤其是在处理不同的运动场景(例如,高速/低速)时。在这项工作中,我们提出了SpikeSlicer,一种新颖的即插即用事件处理方法,能够自适应地分割事件流。SpikeSlicer利用轻量级(0.41M)和低能耗的脉冲神经网络(SNN)来触发事件切片。为了引导SNN在最佳时间步长触发脉冲,我们提出了脉冲位置感知损失(SPA-Loss)来调节神经元的状态。此外,我们开发了一种反馈更新训练策略,利用来自下游人工神经网络(ANN)的反馈来改进切片决策。大量实验表明,我们的方法在基于事件的目标跟踪和识别方面取得了显著的性能提升。值得注意的是,SpikeSlicer提供了一种全新的SNN-ANN合作范式,其中SNN充当高效、低能耗的数据处理器,协助ANN提高下游性能,为探索新的视角和潜在途径注入了活力。||
|**2024-10-02**|[Kolmogorov-Arnold Network Autoencoders](http://arxiv.org/abs/2410.02077)|**[link](https://github.com/aminmoradixl/kan_ae)**|深度学习模型已经彻底改变了各个领域,其中多层感知器 (MLP) 是数据回归和图像分类等任务的基石。然而,最近的一项研究引入了 Kolmogorov-Arnold 网络 (KAN) 作为 MLP 的有前途的替代方案,它利用放置在边而不是节点上的激活函数。这种结构转变使 KAN 与 Kolmogorov-Arnold 表示定理紧密结合,有可能提高模型的准确性和可解释性。在这项研究中,我们探讨了 KAN 在通过自动编码器进行数据表示方面的功效,将它们在 MNIST、SVHN 和 CIFAR-10 数据集上的性能与传统卷积神经网络 (CNN) 进行了比较。我们的结果表明,基于 KAN 的自动编码器在重建精度方面取得了具有竞争力的性能,从而表明它们可以作为数据分析任务中的有效工具。||
|**2024-10-02**|[Stochastic Deep Restoration Priors for Imaging Inverse Problems](http://arxiv.org/abs/2410.02057)|null|作为图像去噪器的深度神经网络被广泛用作解决成像逆问题的先验。 虽然高斯去噪被认为足以学习图像先验,但我们表明,从预先训练为更通用的恢复算子的深度模型中获得的先验可以表现得更好。 我们引入了随机深度恢复先验 (ShaRP),这是一种利用此类恢复模型的集合来规范化逆问题的新方法。 ShaRP 通过更好地处理结构化伪影并在即使没有完全采样数据的情况下也能进行自监督训练,改进了使用高斯去噪器先验的方法。 我们证明了 ShaRP 最小化了一个目标函数,该函数涉及从最小均方误差 (MMSE) 恢复算子的得分函数导出的正则化器,并从理论上分析了其收敛性。 经验表明,ShaRP 在磁共振成像重建和单图像超分辨率等任务上实现了最先进的性能,超过了基于去噪器和扩散模型的方法,而无需重新训练。||
|**2024-10-02**|[Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking](http://arxiv.org/abs/2410.01806)|null|在复杂场景(例如,协作舞蹈表演、团队运动或动态动物群体)中进行多目标跟踪提出了独特的挑战。在这些场景中,目标经常以协调的模式移动、相互遮挡并在其轨迹中表现出长期依赖性。然而,如何对轨迹内的长期依赖性、轨迹间的相互依赖性以及相关的时序遮挡进行建模仍然是一个关键的开放性研究问题。为此,我们引入了 Samba,这是一种新颖的线性时间序列集模型,旨在通过同步用于对每个轨迹建模的多个选择性状态空间来联合处理多个轨迹。Samba 自回归地预测每个序列的未来轨迹查询,同时保持跨轨迹同步的长期记忆表示。通过将 Samba 集成到逐传播跟踪框架中,我们提出了 SambaMOTR,这是第一个有效解决上述问题的跟踪器,包括长期依赖性、轨迹相互依赖性和时间遮挡。此外,我们介绍了一种处理不确定观察结果的有效技术 (MaskObs) 和一种有效的训练方法,以将 SambaMOTR 扩展到更长的序列。通过对跟踪对象之间的长期依赖性和交互进行建模,SambaMOTR 隐式地学习在没有任何手工启发式的情况下准确地跟踪遮挡下的对象。我们的方法在 DanceTrack、BFT 和 SportsMOT 数据集上显着优于先前最先进的方法。||
|**2024-10-02**|[Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking](http://arxiv.org/abs/2410.01678)|**[link](https://github.com/ayesha-ishaq/open3dtrack)**|三维多目标跟踪通过实时监控和预测多个物体的运动,在自动驾驶中发挥着至关重要的作用。传统的三维跟踪系统通常受到预定义物体类别的限制,限制了它们对动态环境中新出现的、未见过的物体的适应性。为了解决这一限制,我们引入了开放词汇三维跟踪,它将三维跟踪的范围扩展到预定义类别之外的物体。我们将开放词汇三维跟踪问题进行公式化,并引入了旨在表示各种开放词汇场景的数据集划分。我们提出了一种新方法,将开放词汇能力集成到三维跟踪框架中,从而能够泛化到未见过的物体类别。我们的方法通过策略性适应有效地减少了跟踪已知物体和新物体之间的性能差距。实验结果表明,我们的方法在各种室外驾驶场景中具有鲁棒性和适应性。据我们所知,这项工作是第一个解决开放词汇三维跟踪问题的,为现实世界中的自主系统带来了重大进步。代码、经过训练的模型和数据集划分均已公开发布。||
|**2024-09-30**|[NUTRIVISION: A System for Automatic Diet Management in Smart Healthcare](http://arxiv.org/abs/2409.20508)|null|通过均衡饮食保持健康和强健体魄对于预防心脏病、糖尿病和癌症等非传染性疾病至关重要。NutriVision 将智能医疗保健与计算机视觉和机器学习相结合,以应对营养和饮食管理方面的挑战。本文介绍了一种新颖的系统,该系统可以识别食物种类,估算数量,并提供全面的营养信息。NutriVision 采用了基于 Faster Region 的卷积神经网络,这是一种深度学习算法,通过生成区域提proposals 并对这些区域进行分类来改进对象检测,使其即使在复杂和无序的膳食环境中也能高效、准确地识别食物。通过基于智能手机的图像捕捉,NutriVision 可以提供即时营养数据,包括宏量营养素分解、卡路里计数和微量营养素详细信息。NutriVision 的突出特点之一是其个性化的营养分析和饮食建议,这些建议是根据每个用户的饮食偏好、营养需求和健康史量身定制的。通过提供定制化的建议,NutriVision 帮助用户实现特定的健康和健身目标,例如管理饮食限制或控制体重。除了提供精确的食物检测和营养评估外,NutriVision 还通过将用户数据与促进均衡健康饮食的建议相结合,支持更明智的饮食决策。该系统为营养管理提供了一种实用且先进的解决方案,并有可能显著影响人们的饮食选择方式,促进更健康的饮食习惯和整体健康。本文讨论了 NutriVision 系统的设计、性能评估和未来应用。||
|**2024-09-30**|[POMONAG: Pareto-Optimal Many-Objective Neural Architecture Generator](http://arxiv.org/abs/2409.20447)|null|神经架构搜索 (NAS) 自动化了神经网络设计,减少了对人类专业知识的依赖。虽然 NAS 方法计算量大且依赖于特定数据集,但辅助预测器减少了需要训练的模型数量,从而缩短了搜索时间。此策略用于生成满足多个计算约束的架构。最近,可迁移 NAS 应运而生,将搜索过程从依赖于数据集推广到依赖于任务。在该领域,DiffusionNAG 是一种最先进的方法。这种基于扩散的方法简化了计算,生成针对未见数据集的准确性进行优化的架构,而无需进一步调整。然而,DiffusionNAG 只关注准确性,而忽略了其他关键目标,如模型复杂性、计算效率和推理延迟,这些因素对于在资源受限环境中部署模型至关重要。本文介绍了帕累托最优多目标神经架构生成器 (POMONAG),通过多目标扩散过程扩展了 DiffusionNAG。POMONAG 同时考虑准确性、参数数量、乘积累加运算 (MAC) 和推理延迟。它集成了性能预测器模型来估计这些指标并指导扩散梯度。POMONAG 的优化通过扩展其训练元数据集、应用帕累托前沿过滤和改进条件生成的嵌入来增强。这些增强功能使 POMONAG 能够生成在性能和效率方面优于先前技术的帕累托最优架构。结果在两个搜索空间(NASBench201 和 MobileNetV3)上得到验证,并在 15 个图像分类数据集上进行了评估。||
|**2024-09-30**|[Fine-Tuning Personalization in Federated Learning to Mitigate Adversarial Clients](http://arxiv.org/abs/2409.20329)|null|联邦学习 (FL) 是一种颇具吸引力的范式,它允许多台机器(也称为客户端)在保持数据本地化的同时进行集体学习。然而,由于客户端数据分布的异构性,使用联邦学习算法获得的模型在某些客户端的数据上可能表现不佳。个性化通过使每个客户端能够拥有针对自身数据定制的不同模型,同时受益于其他客户端的数据来解决这个问题。我们考虑了一种联邦学习设置,其中某些客户端可能是对抗性的,并且我们推导出完全协作失败的条件。具体来说,我们分析了在存在对抗性客户端的情况下插值个性化联邦学习框架的泛化性能,并精确地描述了完全协作的性能严格低于微调个性化的情况。我们的分析根据数据异构性和可容忍的对抗性客户端比例,确定了我们应该将协作程度降低多少。我们通过对均值估计和二元分类问题的实证结果来支持我们的发现,并考虑了合成和基准图像分类数据集。||
|**2024-09-30**|[Classroom-Inspired Multi-Mentor Distillation with Adaptive Learning Strategies](http://arxiv.org/abs/2409.20237)|null|我们提出了ClassroomKD,这是一个受课堂环境启发的新型多导师知识蒸馏框架,旨在增强学生和多个导师之间的知识转移。与依赖固定导师-学生关系的传统方法不同,我们的框架根据每个数据样本的有效性动态选择和调整不同导师的教学策略。ClassroomKD 包含两个主要模块:知识过滤 (KF) 模块和指导模块。KF 模块根据每个输入的表现对导师进行动态排名,仅激活高质量的导师,以最大程度地减少误差累积并防止信息丢失。指导模块通过根据学生和导师之间的表现差距调整每个导师的影响力来调整蒸馏策略,从而有效地调节学习进度。在图像分类(CIFAR-100 和 ImageNet)和二维人体姿态估计(COCO Keypoints 和 MPII Human Pose)方面的大量实验表明,ClassroomKD 明显优于现有的知识蒸馏方法。我们的结果表明,导师选择和指导的动态和自适应方法可以实现更有效的知识转移,从而通过蒸馏提高模型性能。||
|**2024-09-30**|[Training a Computer Vision Model for Commercial Bakeries with Primarily Synthetic Images](http://arxiv.org/abs/2409.20122)|null|在食品工业中,重新加工退回的产品是提高资源效率的重要步骤。[SBB23] 提出了一种人工智能应用程序,可以自动跟踪退回的圆面包。我们通过创建一个包含 2432 张图像和更广泛烘焙食品的扩展数据集来扩展他们的工作。为了提高模型的鲁棒性,我们使用生成模型 pix2pix 和 CycleGAN 来创建合成图像。我们在检测任务上训练了最先进的对象检测模型 YOLOv9 和 YOLOv8。我们总体表现最佳的模型在我们的测试集上实现了 90.3% 的平均精度 [email protected]。||
|**2024-09-30**|[TSdetector: Temporal-Spatial Self-correction Collaborative Learning for Colonoscopy Video Detection](http://arxiv.org/abs/2409.19983)|null|基于CNN的目标检测模型在性能和速度之间取得了平衡,并逐渐应用于息肉检测任务。然而,由于现有方法忽略了两个关键问题:帧内序列分布异质性和精度-置信度差异,因此在复杂的结肠镜视频场景中准确定位息肉仍然具有挑战性。为了应对这些挑战,我们提出了一种新颖的时空自校正检测器(TSdetector),它首先整合了时间层面的 consistency learning 和空间层面的 reliability learning 来持续检测目标。具体来说,我们首先提出了一种全局时间感知卷积,它汇集了先前的信息,以动态引导当前的卷积核关注序列之间的全局特征。此外,我们设计了一种层次队列集成机制,通过渐进累积的方式组合多时间特征,充分利用上下文一致性信息,同时保留长序列依赖特征。同时,在空间层面上,我们提出了一种位置感知聚类,以探索候选框之间的空间关系,从而自适应地重新校准预测置信度,从而有效地消除冗余边界框。在三个公开可用的息肉视频数据集上的实验结果表明,TSdetector 实现了最高的息肉检测率,并优于其他最先进的方法。代码可在 https://github.com/soleilssss/TSdetector 获取。||
|**2024-09-30**|[DAOcc: 3D Object Detection Assisted Multi-Sensor Fusion for 3D Occupancy Prediction](http://arxiv.org/abs/2409.19972)|**[link](https://github.com/alphaplustt/daocc)**|多传感器融合显著提高了三维语义占用预测的准确性和鲁棒性,这对于自动驾驶和机器人技术至关重要。然而,现有方法依赖于大图像分辨率和复杂网络来实现最佳性能,这阻碍了它们在实际场景中的应用。此外,大多数多传感器融合方法侧重于改进融合特征,而忽略了对这些特征的监督策略的探索。为此,我们提出了 DAOcc,一种新颖的多传感器融合占用网络,它利用 3D 目标检测监督来帮助实现卓越的性能,同时使用部署友好的图像特征提取网络和实用的输入图像分辨率。此外,我们引入了 BEV 视域扩展策略来减轻降低图像分辨率带来的不利影响。因此,我们的方法在使用 ResNet50 和 256x704 输入图像分辨率的 Occ3D-nuScenes 和 SurroundOcc 数据集上取得了新的最先进的结果。代码将在 https://github.com/AlphaPlusTT/DAOcc 上提供。||
|**2024-09-30**|[SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers](http://arxiv.org/abs/2409.19850)|null|在过去的几年里,视觉Transformer(ViT)在各种视觉识别任务中一直表现出卓越的性能。然而,增强其鲁棒性的尝试收效甚微,主要集中在不同的训练策略、输入patch增强或网络结构增强。这些方法通常涉及大量的训练和微调,既耗时又耗费资源。为了克服这些障碍,我们引入了一种名为空间自相关Token分析(SATA)的新方法。通过利用Token特征之间的空间关系,SATA增强了ViT模型的表示能力和鲁棒性。这是通过在输入到自注意力机制的前馈网络(FFN)块之前,根据空间自相关分数对Token进行分析和分组来实现的。重要的是,SATA可以无缝集成到现有的预训练ViT基线中,无需重新训练或额外的微调,同时通过减少FFN单元的计算负载来提高效率。实验结果表明,经过SATA增强的基线ViT不仅在ImageNet-1K图像分类上实现了新的最先进的top-1准确率(94.9%),而且在多个鲁棒性基准测试中也建立了新的最先进的性能,包括ImageNet-A(top-1=63.6%)、ImageNet-R(top-1=79.2%)和ImageNet-C(mCE=13.6%),所有这些都不需要对基线模型进行额外的训练或微调。||
|**2024-09-30**|[HazyDet: Open-source Benchmark for Drone-view Object Detection with Depth-cues in Hazy Scenes](http://arxiv.org/abs/2409.19833)|**[link](https://github.com/grokcv/hazydet)**|基于无人机的恶劣天气条件下的目标检测对于增强无人机的环境感知至关重要,但由于缺乏相关的基准测试,这方面在很大程度上仍未得到探索。为了弥合这一差距,我们引入了 HazyDet,这是一个专为无人机在雾霾场景中进行目标检测而设计的大规模数据集。它包含 383,000 个真实世界实例,这些实例是从自然雾霾环境和具有合成叠加雾霾效果的正常场景中收集的,以模拟恶劣的天气条件。通过观察不同深度和雾霾条件下目标尺度和清晰度的显著变化,我们设计了一种深度条件检测器 (DeCoDet),以结合这种先验知识。DeCoDet 具有多尺度深度感知检测头,可无缝集成深度感知,并通过动态深度条件核模块利用由此产生的深度线索。此外,我们提出了一种尺度不变的细化损失,以促进从伪标签中学习鲁棒的深度线索。在 HazyDet 数据集上的大量评估证明了我们方法的灵活性和有效性,产生了显著的性能提升。我们的数据集和工具包可在 https://github.com/GrokCV/HazyDet 获取。||
|**2024-09-29**|[Applying the Lower-Biased Teacher Model in Semi-Suepervised Object Detection](http://arxiv.org/abs/2409.19703)|null|我提出了低偏差教师模型,这是对无偏差教师模型的增强,专门针对半监督目标检测任务进行了定制。该模型的主要创新在于将定位损失集成到教师模型中,从而显着提高了伪标签生成的准确性。通过解决类别不平衡和边界框精度等关键问题,低偏差教师模型在目标检测任务中表现出优异的性能。在多个半监督目标检测数据集上的大量实验表明,低偏差教师模型不仅减少了由类别不平衡引起的伪标签偏差,而且还减少了由错误边界框引起的错误。因此,与现有方法相比,该模型实现了更高的mAP分数和更可靠的检测结果。这项研究强调了准确的伪标签生成的重要性,并为未来半监督学习在目标检测中的进步提供了一个强大的框架。||
|**2024-09-27**|[Spectral Wavelet Dropout: Regularization in the Wavelet Domain](http://arxiv.org/abs/2409.18951)|null|正则化技术有助于防止过拟合,从而提高卷积神经网络 (CNN) 的泛化能力。过拟合的原因之一是网络不同部分之间复杂的相互适应,这使得 CNN 依赖于它们的联合响应,而不是鼓励每个部分独立学习有用的特征表示。频域处理是一种强大的策略,它利用频率分解来修改具有时间和空间一致性的数据。这项工作介绍了一种新颖的正则化方法——谱小波丢弃 (SWD),它包括两种变体:1D-SWD 和 2D-SWD。这些变体通过随机丢弃特征图的离散小波分解中的详细频带,从而提高 CNN 的泛化能力。我们的方法区别于预先存在的谱“傅立叶”丢弃 (2D-SFD),后者消除了傅立叶域中的系数。值得注意的是,SWD 只需要一个超参数,不像 SFD 需要两个。我们还通过实现一维版本的谱“傅立叶”丢弃 (1D-SFD) 来扩展文献,为全面比较奠定了基础。我们的评估表明,相对于 1D-SFD 和 2D-SFD,1D 和 2D SWD 变体在 CIFAR-10/100 基准测试中均具有竞争力的性能。具体来说,与 1D/2D-SFD 相比,1D-SWD 具有显著更低的计算复杂度。在 Pascal VOC 目标检测基准测试中,SWD 变体的性能优于 1D-SFD 和 2D-SFD,并且在训练期间表现出更低的计算复杂度。||
|**2024-09-27**|[Unconditional stability of a recurrent neural circuit implementing divisive normalization](http://arxiv.org/abs/2409.18946)|**[link](https://github.com/martiniani-lab/dynamic-divisive-norm)**|递归神经模型的稳定性是一个重大挑战,特别是在开发可以无缝训练的生物学上合理的 neurodynamical 模型方面。传统的皮质回路模型由于动力系统中存在广泛的非线性,因此难以训练,导致优化问题具有难以施加的非线性稳定性约束。相反,递归神经网络 (RNN) 在涉及序列数据的任务中表现出色,但缺乏生物学上的合理性和可解释性。在这项工作中,我们通过将动态除法归一化 (DN) 与 ORGaNICs 的稳定性联系起来来解决这些挑战,ORGaNICs 是一种生物学上合理的递归皮质回路模型,它可以动态地实现 DN,并且已被证明可以模拟广泛的神经生理学现象。通过使用 Lyapunov 的间接方法,我们证明了当递归权重矩阵是单位矩阵时,任意维度的 ORGaNICs 电路具有无条件局部稳定性的显著特性。因此,我们将 ORGaNICs 连接到一个耦合阻尼谐振子的系统,这使我们能够推导出电路的能量函数,从而提供电路和单个神经元旨在实现的目标的规范原则。此外,对于一般的递归权重矩阵,我们证明了二维模型的稳定性,并通过经验证明了稳定性在更高维度上成立。最后,我们表明 ORGaNICs 可以通过时间反向传播进行训练,而无需梯度裁剪/缩放,这得益于其内在的稳定性特性和自适应时间常数,解决了梯度爆炸、消失和振荡的问题。通过评估模型在 RNN 基准测试中的性能,我们发现 ORGaNICs 在静态图像分类任务上优于其他神经动力学模型,并且在序列任务上的性能与 LSTM 相当。||
|**2024-09-27**|[Subspace Preserving Quantum Convolutional Neural Network Architectures](http://arxiv.org/abs/2409.18918)|null|子空间保持量子电路是一类量子算法,它依赖于计算中的某些对称性,可以为其训练提供理论上的保证。这些算法之所以受到广泛关注,是因为它们可以提供多项式加速,并且可以用来模拟经典的机器学习算法。在这项工作中,我们提出了一种基于汉明重量保持量子电路的新型卷积神经网络架构模型。特别是,我们引入了卷积层和基于测量的池化层,它们在保持量子态对称性的同时,使用非子空间保持的门来实现非线性。与经典的深度学习架构相比,我们的方案在多项式运行时间上具有显著的优势。我们提供了一个用于汉明重量保持量子电路的开源仿真库,可以使用面向GPU的库更有效地仿真我们的技术。使用此代码,我们提供了一些架构示例,这些示例突出了在量子比特数量有限且参数少于经典深度学习架构的情况下,在复杂图像分类任务上的出色性能。||
|**2024-09-27**|[MCUBench: A Benchmark of Tiny Object Detectors on MCUs](http://arxiv.org/abs/2409.18866)|**[link](https://github.com/deeplite/deeplite-torch-zoo)**|我们推出了 MCUBench,这是一个基准测试平台,涵盖了 100 多个基于 YOLO 的目标检测模型,这些模型在 VOC 数据集上针对七种不同的 MCU 进行了评估。该基准测试平台提供了各种输入分辨率和基于 YOLO 的单阶段检测器的平均精度、延迟、RAM 和 Flash 使用情况的详细信息。通过使用固定的训练流程进行受控比较,我们收集了全面的性能指标。我们的帕累托最优分析表明,集成现代检测头和训练技术可以让各种 YOLO 架构(包括 YOLOv3 等传统模型)在平均精度 (mAP) 和延迟之间实现高效的权衡。MCUBench 是一个有价值的工具,可用于对当代目标检测器的 MCU 性能进行基准测试,并根据特定限制条件帮助进行模型选择。||
|**2024-09-27**|[A Novel Unified Architecture for Low-Shot Counting by Detection and Segmentation](http://arxiv.org/abs/2409.18686)|null|少样本目标计数器可以使用少量甚至没有标注样本估计图像中的目标数量。目标定位通过将目标与原型进行匹配来实现,原型是通过对图像范围内的目标外观进行无监督聚合构建的。由于目标外观可能存在多样性,现有方法通常会导致过度泛化和误报。此外,性能最佳的方法通过预测每个目标中心的单位高斯分布的代理损失来训练目标定位。这种损失对标注误差和超参数很敏感,并且没有直接优化检测任务,导致计数结果欠佳。我们引入了GeCo,这是一种新颖的少样本计数器,可以在统一的架构中实现准确的目标检测、分割和计数估计。GeCo 通过一种新颖的密集目标查询公式,可以稳健地泛化不同目标外观的原型。此外,我们还提出了一种新的计数损失,它直接优化检测任务,避免了标准代理损失的问题。GeCo 在总计数平均绝对误差方面比领先的基于少样本检测的计数器高出约 25%,实现了卓越的检测精度,并在所有少样本计数设置中都树立了新的最先进的结果。||
|**2024-09-27**|[Query matching for spatio-temporal action detection with query-based object detector](http://arxiv.org/abs/2409.18408)|null|本文提出了一种扩展基于查询的目标检测模型DETR的方法,将其应用于时空动作检测,该任务需要在视频中保持时间一致性。我们提出的方法将DETR应用于每一帧,并使用特征偏移来整合时间信息。然而,每帧中DETR的对象查询可能对应于不同的对象,使得简单的特征偏移无效。为了克服这个问题,我们提出了跨不同帧的查询匹配,确保对同一对象的查询能够匹配并用于特征偏移。实验结果表明,当使用所提出的查询匹配对查询特征进行偏移时,JHMDB21数据集上的性能显著提高。||
|**2024-09-27**|[Simpler Gradient Methods for Blind Super-Resolution with Lower Iteration Complexity](http://arxiv.org/abs/2409.18387)|**[link](https://github.com/Jinshengg/SimplerGDs-VHL)**|我们研究了盲超分辨率问题,它可以通过向量化汉克尔提升(VHL)公式化为一个低秩矩阵恢复问题。先前基于VHL的名为PGD-VHL的梯度下降方法依赖于额外的正则化,例如投影和平衡惩罚,表现出次优的迭代复杂度。在本文中,我们提出了一个更简单的无约束优化问题,无需上述两种类型的正则化,并开发了两种新的可证梯度方法,分别名为VGD-VHL和ScalGD-VHL。我们为算法的理论保证提供了新颖而清晰的分析,证明了我们的方法比PGD-VHL具有更低的迭代复杂度。此外,ScalGD-VHL具有最低的迭代复杂度,同时与条件数无关。此外,我们的新分析表明,盲超分辨率问题对不相干性的要求较低,从而无需不相干投影即可实现线性收敛。实验结果表明,我们的方法在实现与现有技术相当的恢复性能的同时,还具有更高的计算效率。||
|**2024-09-26**|[Realistic Evaluation of Model Merging for Compositional Generalization](http://arxiv.org/abs/2409.18314)|**[link](https://github.com/r-three/realistic_evaluation_of_model_merging_for_compositional_generalization)**|模型融合已成为一种广泛使用的方法,可以将单个模型廉价地组合成一个模型,该模型继承了它们的性能并获得了更好的性能。这种流行促进了许多新融合方法的快速发展,这些方法通常在不同的实验环境中得到验证,并且经常在对模型架构、数据可用性和计算预算做出的假设方面有所不同。在这项工作中,我们通过在共享实验环境中评估不同的融合方法并精确识别每种方法的实际要求,来描述它们的相对优点。具体来说,我们的设置侧重于使用融合来实现图像分类、图像生成和自然语言处理中功能的组合泛化。此外,我们还测量了不同融合方法的计算成本,以及它们在扩展融合模型数量时的性能。总的来说,我们的结果阐明了模型融合领域的现状,并提供了一个全面而严谨的实验设置来测试新方法。||
|**2024-09-26**|[Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing](http://arxiv.org/abs/2409.18286)|null|本研究旨在全面回顾和实证评估多模态大型语言模型 (MLLM) 和大型视觉模型 (VLM) 在交通系统目标检测中的应用。首先,我们介绍了 MLLM 在交通应用中的潜在优势,并对以往研究中现有的 MLLM 技术进行了全面回顾。我们重点介绍了它们在各种交通场景下目标检测的有效性和局限性。其次,我们概述了交通应用中端到端目标检测的分类以及未来方向。在此基础上,我们提出了实证分析,在三个现实交通问题上测试 MLLM,这些问题包括目标检测任务,即道路安全属性提取、安全关键事件检测和热图像视觉推理。我们的研究结果提供了对 MLLM 性能的详细评估,揭示了其优势和需要改进的方面。最后,我们讨论了 MLLM 在增强交通目标检测方面的实际局限性和挑战,从而为该关键领域的未来研究和开发提供了路线图。||
|**2024-09-26**|[DARE: Diverse Visual Question Answering with Robustness Evaluation](http://arxiv.org/abs/2409.18023)|null|视觉语言模型 (VLM) 扩展了仅文本大型语言模型和仅视觉模型的卓越能力,并且能够从多模态视觉文本输入中学习和处理。 虽然现代 VLM 在许多标准图像分类和图像文本匹配任务中表现良好,但它们仍然难以应对许多关键的视觉语言 (VL) 推理能力,例如计数和空间推理。 此外,虽然它们可能对指令和/或评估协议的微小变化非常脆弱,但现有基准测试未能评估它们的稳健性(或者更确切地说是缺乏稳健性)。 为了将具有挑战性的 VL 场景与全面的稳健性评估相结合,我们引入了 DARE,即具有稳健性评估的多样化视觉问答,这是一个精心创建和策划的多项选择 VQA 基准测试。 DARE 评估 VLM 在五个不同类别上的性能,并包括四个基于以下变化的稳健性评估:提示、答案选项子集、输出格式和正确答案的数量。 在其他一系列发现中,我们报告说,最先进的 VLM 仍然难以回答大多数类别的问题,并且无法在测试的稳健性评估中始终如一地提供其峰值性能。 选项子集的最坏情况性能比标准情况下的性能低 34%。 LLaVA 1.6 和 Idefics2 等开源 VLM 的稳健性无法与 GPT-4 和 Gemini 等闭源模型相提并论,但即使是后者仍然非常容易受到不同变化的影响。||
|**2024-09-26**|[A New Dataset for Monocular Depth Estimation Under Viewpoint Shifts](http://arxiv.org/abs/2409.17851)|null|单目深度估计是自动驾驶和许多其他计算机视觉应用的关键任务。虽然该领域已经取得了重大进展,但视角变化对深度估计模型的影响在很大程度上仍未得到充分探索。本文介绍了一种新的数据集和评估方法,用于量化不同相机位置和方向对单目深度估计性能的影响。我们提出了一种基于单应性估计和目标检测的真值策略,无需昂贵的激光雷达传感器。我们从多个视点收集了道路场景的多样化数据集,并用它来评估现代深度估计模型对几何偏移的鲁棒性。在公共数据集上评估了我们策略的有效性后,我们提供了对当前模型局限性的宝贵见解,并强调了在实际应用中考虑视点变化的重要性。||
|**2024-09-26**|[Cascade Prompt Learning for Vision-Language Model Adaptation](http://arxiv.org/abs/2409.17805)|**[link](https://github.com/megvii-research/caspl)**|提示学习已成为一种有效的方法,可以提高视觉语言模型(VLM)在下游任务中的性能,例如CLIP。然而,当前可学习的提示标记主要用于适应任务的单一阶段(即,调整提示),容易导致过拟合风险。在这项工作中,我们提出了一种新颖的级联提示学习CasPL框架,使提示学习能够同时服务于通用和特定专业知识(即,增强和调整提示)。具体来说,CasPL是一种新的学习范式,包括两个不同阶段的可学习提示:第一个增强提示旨在通过使用大量未标记的域图像对齐其预测的logits,从高级更大的CLIP教师模型中提取域一般知识。然后,第二个调整提示与冻结的第一组级联,以微调下游任务,遵循先前研究中采用的方法。通过这种方式,CasPL可以有效地将域一般表示和任务特定表示捕获到明确不同的渐进提示组中,从而潜在地缓解目标域中的过拟合问题。值得注意的是,CasPL是一个即插即用模块,可以无缝集成到任何现有的提示学习方法中。CasPL在性能和推理速度之间取得了显著更好的平衡,这对于在资源受限的环境中部署较小的VLM模型尤其有利。与之前的最先进方法PromptSRC相比,CasPL在11个图像分类数据集上,基础类的平均改进率为1.85%,新类的平均改进率为3.44%,调和平均值的平均改进率为2.72%。代码公开地址:https://github.com/megvii-research/CasPL。||
|**2024-09-26**|[Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs](http://arxiv.org/abs/2409.17778)|**[link](https://github.com/qinpengcui/dossr)**|基于扩散的图像超分辨率 (SR) 模型因其强大的图像恢复能力而引起了广泛关注。然而,现有的扩散模型通常难以在效率和性能之间取得最佳平衡。它们通常要么忽略了利用现有大量预训练模型的潜力,限制了其生成能力,要么需要从随机噪声开始进行数十次前向传递,从而降低了推理效率。在本文中,我们提出了 DoSSR,一种基于域迁移扩散的 SR 模型,它利用预训练扩散模型的生成能力,并通过以低分辨率 (LR) 图像初始化扩散过程来显著提高效率。我们方法的核心是一个与现有扩散模型无缝集成的域迁移方程。这种集成不仅提高了扩散先验的利用,还提高了推理效率。此外,我们通过将离散迁移过程转换为连续公式(称为 DoS-SDE)来推进我们的方法。这一进步带来了快速且定制化的求解器,进一步提高了采样效率。实验结果表明,我们提出的方法在合成数据集和真实世界数据集上均达到了最先进的性能,同时仅需 5 个采样步骤。与之前基于扩散先验的方法相比,我们的方法实现了 5-7 倍的显著加速,证明了其卓越的效率。代码:https://github.com/QinpengCui/DoSSR。||
|**2024-09-26**|[LGFN: Lightweight Light Field Image Super-Resolution using Local Convolution Modulation and Global Attention Feature Extraction](http://arxiv.org/abs/2409.17759)|null|光场(LF)能够将三维场景信息编码成四维光场图像,在诸如后期重聚焦和深度感知等领域有着广泛的应用。光场图像超分辨率(SR)旨在提升受限于光场相机传感器性能的图像分辨率。尽管现有方法已经取得了可喜的成果,但由于模型不够轻量化,限制了其实际应用。本文提出了一种名为LGFN的轻量级模型,它集成了不同视角的局部和全局特征以及不同通道的特征,用于光场图像超分辨率。具体来说,由于不同子孔径图像中相同像素位置的相邻区域表现出相似的结构关系,我们设计了一个基于轻量级CNN的特征提取模块(DGCE),通过特征调制更好地提取局部特征。同时,由于光场图像中超出边界的像素位置存在较大差异,我们提出了一个高效的空间注意力模块(ESAM),它使用可分解的大核卷积来获得更大的感受野,以及一个高效的通道注意力模块(ECAM)。与现有参数量大的光场图像超分辨率模型相比,我们的模型参数量为0.45M,FLOPs为19.33G,取得了具有竞争力的效果。大量的消融实验验证了我们提出的方法的有效性,在NTIRE2024光场超分辨率挑战赛的Track 2保真度和效率赛道中排名第二,在Track 1保真度赛道中排名第七。||
|**2024-09-26**|[Scene Understanding in Pick-and-Place Tasks: Analyzing Transformations Between Initial and Final Scenes](http://arxiv.org/abs/2409.17720)|null|随着机器人在日常任务中越来越多地与人类合作,采取措施使机器人系统能够理解环境变得至关重要。这项工作侧重于场景理解,以根据场景的初始图像和最终图像检测拾取和放置任务。为此,我们收集了一个用于目标检测和拾取放置任务检测的数据集。随后训练了一个 YOLOv5 网络来检测初始场景和最终场景中的目标。给定检测到的目标及其边界框,我们提出了两种方法来检测将初始场景转换为最终场景的拾取和放置任务。一种是几何方法,它跟踪目标在两个场景中的运动,并根据场景内移动的边界框的交集进行工作。相反,基于 CNN 的方法利用卷积神经网络将具有相交边界框的目标分类为 5 类,显示相关目标之间的空间关系。然后,通过分析包含这两个场景的实验,得出执行的拾取和放置任务。结果表明,在某些场景下,使用 VGG16 骨干网络的基于 CNN 的方法的成功率比几何方法高出约 12 个百分点,总体成功率为 84.3%。||
|**2024-09-26**|[Unifying Dimensions: A Linear Adaptive Approach to Lightweight Image Super-Resolution](http://arxiv.org/abs/2409.17597)|null|基于窗口的 Transformer 由于其通过局部自注意力机制 (SA) 进行自适应建模的能力,在超分辨率任务中展现出卓越的性能。然而,与卷积神经网络相比,它们表现出更高的计算复杂度和推理延迟。在本文中,我们首先确定 Transformer 的适应性源于其自适应空间聚合和先进的结构设计,而其高延迟则源于与局部 SA 相关的计算成本和内存布局转换。为了模拟这种聚合方法,我们提出了一种有效的基于卷积的线性焦点可分离注意力机制 (FSA),允许以线性复杂度进行长距离动态建模。此外,我们引入了一种有效的双分支结构,结合超轻量级信息交换模块 (IEM),以增强 Token Mixer 对信息的聚合能力。最后,在结构方面,我们通过结合自门控机制来修改现有的基于空间门控的前馈神经网络,以保留高维通道信息,从而能够对更复杂的关系进行建模。基于这些改进,我们构建了一个名为线性自适应混合网络 (LAMNet) 的基于卷积的 Transformer 框架。大量实验表明,LAMNet 在保持卷积神经网络计算效率的同时,实现了比现有基于 SA 的 Transformer 方法更好的性能,推理时间可达 \(3\times\) 加速。代码将公开发布在:https://github.com/zononhzy/LAMNet。||
|**2024-09-26**|[Let the Quantum Creep In: Designing Quantum Neural Network Models by Gradually Swapping Out Classical Components](http://arxiv.org/abs/2409.17583)|**[link](https://github.com/peiyong-addwater/let-the-quantum-creep-in)**|人工智能 (AI) 凭借其乘数效应和在多个领域的广泛应用,可能成为量子计算的重要应用领域。由于现代人工智能系统通常建立在神经网络之上,因此量子神经网络的设计成为将量子计算集成到人工智能中的关键挑战。为了更细致地描述量子组件对神经网络性能的影响,我们提出了一个框架,在该框架中,经典神经网络层逐渐被具有相同输入和输出类型、同时保持层间信息流不变的量子层所取代,这不同于目前大多数量子神经网络的研究,后者倾向于端到端的量子模型。我们从一个没有任何标准化层或激活函数的简单三层经典神经网络开始,逐步将经典层更改为相应的量子版本。我们对 MNIST、FashionMNIST 和 CIFAR-10 等图像分类数据集进行了数值实验,以证明系统引入量子组件所带来的性能变化。通过这个框架,我们的研究为未来量子神经网络模型的设计提供了新的思路,在这些模型中,寻找能够利用经典世界和量子世界优势的方法和框架可能更为有利。||
|**2024-09-26**|[General Compression Framework for Efficient Transformer Object Tracking](http://arxiv.org/abs/2409.17564)|null|基于Transformer的跟踪器在视觉目标跟踪领域占据主导地位。虽然这些跟踪器表现出良好的性能,但由于效率低下,它们在资源受限设备上的部署仍然具有挑战性。为了提高推理效率并降低计算成本,先前的方法旨在设计轻量级跟踪器或将知识从较大的教师模型提炼到更紧凑的学生模型中。然而,这些解决方案通常以牺牲精度为代价来提高速度。因此,我们提出了一种通用的高效Transformer目标跟踪模型压缩框架CompressTracker,以将预训练的跟踪模型压缩成轻量级跟踪器,同时最大限度地减少性能下降。我们的方法采用了一种新颖的阶段划分策略,将教师模型的Transformer层划分为不同的阶段,使学生模型能够更有效地模拟每个相应的教师阶段。此外,我们还设计了一种独特的替换训练技术,该技术涉及用教师模型中的相应阶段随机替换学生模型中的特定阶段,而不是孤立地训练学生模型。替换训练增强了学生模型复制教师模型行为的能力。为了进一步迫使学生模型模拟教师模型,我们引入了预测指导和阶段性特征模拟,以便在教师模型的压缩过程中提供额外的监督。我们的框架CompressTracker在结构上是不可知的,使其与任何Transformer架构兼容。我们进行了一系列实验,以验证CompressTracker的有效性和通用性。我们的CompressTracker-4具有4个Transformer层,它是从OSTrack压缩而来的,在LaSOT上保留了约96%的性能(66.1% AUC),同时实现了2.17倍的加速。||
|**2024-09-26**|[CAMOT: Camera Angle-aware Multi-Object Tracking](http://arxiv.org/abs/2409.17533)|null|本文提出了CAMOT,一种用于多目标跟踪的简单相机角度估计器,用于解决两个问题:1)遮挡和2)深度方向上的距离估计不准确。在假设每个视频帧中的多个目标位于平面上,CAMOT 使用目标检测来估计相机角度。此外,它还给出了每个目标的深度,从而实现了伪 3D MOT。我们通过将其添加到 MOT17 和 MOT20 数据集上的各种 2D MOT 方法中来评估其性能,并确认了其有效性。将 CAMOT 应用于 ByteTrack,我们在 MOT17 中获得了 63.8% 的 HOTA、80.6% 的 MOTA 和 78.5% 的 IDF1,这些都是最先进的结果。它的计算成本明显低于现有的基于深度学习的跟踪深度估计器。||
|**2024-09-18**|[Applications of Knowledge Distillation in Remote Sensing: A Survey](http://arxiv.org/abs/2409.12111)|null|随着遥感 (RS) 领域模型复杂性的不断提高,对平衡模型精度和计算效率的解决方案的需求也日益增长。知识蒸馏 (KD) 已成为满足这一需求的强大工具,能够在不显著降低性能的情况下,将知识从大型复杂模型迁移到更小、更高效的模型。这篇综述文章广泛考察了 KD 及其在遥感领域的创新应用。KD 是一种将知识从复杂、通常笨重的模型(教师)迁移到更紧凑、更高效的模型(学生)的技术,已经在各个领域得到了显著的发展和应用。首先,我们介绍了 KD 方法的基本概念和历史进程。文章重点介绍了采用 KD 的优势,特别是在模型压缩、计算效率提高和性能改善方面,这些优势对于 RS 场景中的实际部署至关重要。文章提供了 KD 技术的全面分类,其中每个类别都经过严格分析,以证明替代方案的广度和深度,并通过具体的案例研究展示了 KD 方法在 RS 任务中的实际应用,例如实例分割和目标检测。此外,该综述还讨论了 KD 在遥感领域面临的挑战和局限性,包括实际约束和未来的发展方向,为遥感领域的研究人员和从业者提供了全面的概述。通过这种组织方式,本文不仅阐明了 KD 研究的现状,而且为未来的研究方向奠定了基础,从而为学术研究和实际应用做出了重大贡献。||
|**2024-09-18**|[Unraveling the Hessian: A Key to Smooth Convergence in Loss Function Landscapes](http://arxiv.org/abs/2409.11995)|**[link](https://github.com/kisnikser/landscape-hessian)**|神经网络的损失景观是其训练的一个关键方面,理解其属性对于提高其性能至关重要。在本文中,我们研究了当样本量增加时损失曲面如何变化,这是一个以前未被探索的问题。我们从理论上分析了全连接神经网络中损失景观的收敛性,并推导出在样本中添加新对象时损失函数值差异的上界。我们的实证研究在各种数据集上证实了这些结果,证明了图像分类任务中损失函数曲面的收敛性。我们的发现为神经损失景观的局部几何提供了见解,并对样本量确定技术的发展具有意义。||
|**2024-09-18**|[Agglomerative Token Clustering](http://arxiv.org/abs/2409.11923)|null|我们提出了聚合式Token聚类(ATC),这是一种新颖的Token合并方法,在图像分类、图像合成以及目标检测和分割任务中始终优于以前的Token合并和剪枝方法。ATC通过自下而上的层次聚类来合并聚类,无需引入额外的可学习参数。我们发现ATC在所有任务中都实现了最先进的性能,甚至在应用于现成模型时(即无需微调)也能与之前的最先进技术相媲美。当应用于低保留率时,ATC特别有效,在这种情况下,只有一小部分Token被保留,并且保持任务性能特别困难。||
|**2024-09-18**|[Distillation-free Scaling of Large SSMs for Images and Videos](http://arxiv.org/abs/2409.11867)|null|State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$ .||
|**2024-09-18**|[RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework](http://arxiv.org/abs/2409.11749)|null|随着3D目标检测技术的快速发展,尤其是在经济高效的多相机设置中,3D多目标跟踪(MOT)获得了显著的性能提升。然而,目前流行的端到端多相机跟踪器训练方法会导致模型依赖于特定的检测器,从而限制了其通用性。此外,现有的通用跟踪器忽略了多相机检测器的独特特征,即运动观测的不可靠性和视觉信息的可用性。为了应对这些挑战,我们提出了RockTrack,一种面向多相机检测器的3D MOT方法。RockTrack遵循“检测跟踪”框架,兼容各种现成的检测器。RockTrack包含一个置信度引导的预处理模块,用于从单个检测器的不同表示空间中提取可靠的运动和图像观测结果。然后,这些观测结果会在关联模块中融合,该模块利用几何和外观线索来最大程度地减少错配。最终的匹配结果通过分阶段估计过程进行传播,形成启发式噪声建模的基础。此外,我们引入了一种新颖的外观相似性度量方法,用于在多相机设置中明确表征目标亲和度。RockTrack在nuScenes仅视觉跟踪排行榜上实现了最先进的性能,AMOTA达到59.1%,同时展现出惊人的计算效率。||
|**2024-09-18**|[Few-Shot Learning Approach on Tuberculosis Classification Based on Chest X-Ray Images](http://arxiv.org/abs/2409.11644)|null|Tuberculosis (TB) is caused by the bacterium Mycobacterium tuberculosis, primarily affecting the lungs. Early detection is crucial for improving treatment effectiveness and reducing transmission risk. Artificial intelligence (AI), particularly through image classification of chest X-rays, can assist in TB detection. However, class imbalance in TB chest X-ray datasets presents a challenge for accurate classification. In this paper, we propose a few-shot learning (FSL) approach using the Prototypical Network algorithm to address this issue. We compare the performance of ResNet-18, ResNet-50, and VGG16 in feature extraction from the TBX11K Chest X-ray dataset. Experimental results demonstrate classification accuracies of 98.93% for ResNet-18, 98.60% for ResNet-50, and 33.33% for VGG16. These findings indicate that the proposed method outperforms others in mitigating data imbalance, which is particularly beneficial for disease classification applications.||
|**2024-09-17**|[VALO: A Versatile Anytime Framework for LiDAR-based Object Detection Deep Neural Networks](http://arxiv.org/abs/2409.11542)|**[link](https://github.com/csl-ku/valo)**|This work addresses the challenge of adapting dynamic deadline requirements for LiDAR object detection deep neural networks (DNNs). The computing latency of object detection is critically important to ensure safe and efficient navigation. However, state-of-the-art LiDAR object detection DNNs often exhibit significant latency, hindering their real-time performance on resource-constrained edge platforms. Therefore, a tradeoff between detection accuracy and latency should be dynamically managed at runtime to achieve optimum results. In this paper, we introduce VALO (Versatile Anytime algorithm for LiDAR Object detection), a novel data-centric approach that enables anytime computing of 3D LiDAR object detection DNNs. VALO employs a deadline-aware scheduler to selectively process input regions, making execution time and accuracy tradeoffs without architectural modifications. Additionally, it leverages efficient forecasting of past detection results to mitigate possible loss of accuracy due to partial processing of input. Finally, it utilizes a novel input reduction technique within its detection heads to significantly accelerate execution without sacrificing accuracy. We implement VALO on state-of-the-art 3D LiDAR object detection networks, namely CenterPoint and VoxelNext, and demonstrate its dynamic adaptability to a wide range of time constraints while achieving higher accuracy than the prior state-of-the-art. Code is available athttps://github.com/CSL-KU/VALO}{github.com/CSL-KU/VALO.||
|**2024-09-17**|[Enhancing the Reliability of LiDAR Point Cloud Sampling: A Colorization and Super-Resolution Approach Based on LiDAR-Generated Images](http://arxiv.org/abs/2409.11532)|null|In recent years, Light Detection and Ranging (LiDAR) technology, a critical sensor in robotics and autonomous systems, has seen significant advancements. These improvements include enhanced resolution of point clouds and the capability to provide 360{\deg} low-resolution images. These images encode various data such as depth, reflectivity, and near-infrared light within the pixels. However, an excessive density of points and conventional point cloud sampling can be counterproductive, particularly in applications such as LiDAR odometry, where misleading points and degraded geometry information may induce drift errors. Currently, extensive research efforts are being directed towards leveraging LiDAR-generated images to improve situational awareness. This paper presents a comprehensive review of current deep learning (DL) techniques, including colorization and super-resolution, which are traditionally utilized in conventional computer vision tasks. These techniques are applied to LiDAR-generated images and are analyzed qualitatively. Based on this analysis, we have developed a novel approach that selectively integrates the most suited colorization and super-resolution methods with LiDAR imagery to sample reliable points from the LiDAR point cloud. This approach aims to not only improve the accuracy of point cloud registration but also avoid mismatching caused by lacking geometry information, thereby augmenting the utility and precision of LiDAR systems in practical applications. In our evaluation, the proposed approach demonstrates superior performance compared to our previous work, achieving lower translation and rotation errors with a reduced number of points.||
|**2024-09-19**|[Super Resolution On Global Weather Forecasts](http://arxiv.org/abs/2409.11502)|null|Weather forecasting is a vitally important tool for tasks ranging from planning day to day activities to disaster response planning. However, modeling weather has proven to be challenging task due to its chaotic and unpredictable nature. Each variable, from temperature to precipitation to wind, all influence the path the environment will take. As a result, all models tend to rapidly lose accuracy as the temporal range of their forecasts increase. Classical forecasting methods use a myriad of physics-based, numerical, and stochastic techniques to predict the change in weather variables over time. However, such forecasts often require a very large amount of data and are extremely computationally expensive. Furthermore, as climate and global weather patterns change, classical models are substantially more difficult and time-consuming to update for changing environments. Fortunately, with recent advances in deep learning and publicly available high quality weather datasets, deploying learning methods for estimating these complex systems has become feasible. The current state-of-the-art deep learning models have comparable accuracy to the industry standard numerical models and are becoming more ubiquitous in practice due to their adaptability. Our group seeks to improve upon existing deep learning based forecasting methods by increasing spatial resolutions of global weather predictions. Specifically, we are interested in performing super resolution (SR) on GraphCast temperature predictions by increasing the global precision from 1 degree of accuracy to 0.5 degrees, which is approximately 111km and 55km respectively.||
|**2024-09-17**|[SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking](http://arxiv.org/abs/2409.11235)|**[link](https://github.com/siyuanliii/slack)**|Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to novel categories not in the training set. Currently, the best-performing methods are mainly based on pure appearance matching. Due to the complexity of motion patterns in the large-vocabulary scenarios and unstable classification of the novel objects, the motion and semantics cues are either ignored or applied based on heuristics in the final matching steps by existing methods. In this paper, we present a unified framework SLAck that jointly considers semantics, location, and appearance priors in the early steps of association and learns how to integrate all valuable information through a lightweight spatial and temporal object graph. Our method eliminates complex post-processing heuristics for fusing different cues and boosts the association performance significantly for large-scale open-vocabulary tracking. Without bells and whistles, we outperform previous state-of-the-art methods for novel classes tracking on the open-vocabulary MOT and TAO TETA benchmarks. Our code is available at \href{https://github.com/siyuanliii/SLAck}{github.com/siyuanliii/SLAck}.||
|**2024-09-17**|[STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking](http://arxiv.org/abs/2409.11234)|**[link](https://github.com/ydhcg-bobo/stcmot)**|Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision. Current MOT trackers rely on accurate object detection results and precise matching of target reidentification (ReID). These methods focus on optimizing target spatial attributes while overlooking temporal cues in modelling object relationships, especially for challenging tracking conditions such as object deformation and blurring, etc. To address the above-mentioned issues, we propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which utilizes historical embedding features to model the representation of ReID and detection features in a sequential order. Concretely, a temporal embedding boosting module is introduced to enhance the discriminability of individual embedding based on adjacent frame cooperation. While the trajectory embedding is then propagated by a temporal detection refinement module to mine salient target locations in the temporal field. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new state-of-the-art performance in MOTA and IDF1 metrics. The source codes are released at https://github.com/ydhcg-BoBo/STCMOT.||
|**2024-09-17**|[Vision foundation models: can they be applied to astrophysics data?](http://arxiv.org/abs/2409.11175)|**[link](https://github.com/elastufka/fm4astro)**|Vision foundation models, which have demonstrated significant potential in many multimedia applications, are often underutilized in the natural sciences. This is primarily due to mismatches between the nature of domain-specific scientific data and the typical training data used for foundation models, leading to distribution shifts. Scientific data often differ substantially in structure and characteristics; researchers frequently face the challenge of optimizing model performance with limited labeled data of only a few hundred or thousand images. To adapt foundation models effectively requires customized approaches in preprocessing, data augmentation, and training techniques. Additionally, each vision foundation model exhibits unique strengths and limitations, influenced by differences in architecture, training procedures, and the datasets used for training. In this work, we evaluate the application of various vision foundation models to astrophysics data, specifically images from optical and radio astronomy. Our results show that using features extracted by specific foundation models improves the classification accuracy of optical galaxy images compared to conventional supervised training. Similarly, these models achieve equivalent or better performance in object detection tasks with radio images. However, their performance in classifying radio galaxy images is generally poor and often inferior to traditional supervised training results. These findings suggest that selecting suitable vision foundation models for astrophysics applications requires careful consideration of the model characteristics and alignment with the specific requirements of the downstream tasks.||
|**2024-09-17**|[Unleashing the Potential of Mamba: Boosting a LiDAR 3D Sparse Detector by Using Cross-Model Knowledge Distillation](http://arxiv.org/abs/2409.11018)|null|The LiDAR-based 3D object detector that strikes a balance between accuracy and speed is crucial for achieving real-time perception in autonomous driving and robotic navigation systems. To enhance the accuracy of point cloud detection, integrating global context for visual understanding improves the point clouds ability to grasp overall spatial information. However, many existing LiDAR detection models depend on intricate feature transformation and extraction processes, leading to poor real-time performance and high resource consumption, which limits their practical effectiveness. In this work, we propose a Faster LiDAR 3D object detection framework, called FASD, which implements heterogeneous model distillation by adaptively uniform cross-model voxel features. We aim to distill the transformer's capacity for high-performance sequence modeling into Mamba models with low FLOPs, achieving a significant improvement in accuracy through knowledge transfer. Specifically, Dynamic Voxel Group and Adaptive Attention strategies are integrated into the sparse backbone, creating a robust teacher model with scale-adaptive attention for effective global visual context modeling. Following feature alignment with the Adapter, we transfer knowledge from the Transformer to the Mamba through latent space feature supervision and span-head distillation, resulting in improved performance and an efficient student model. We evaluated the framework on the Waymo and nuScenes datasets, achieving a 4x reduction in resource consumption and a 1-2\% performance improvement over the current SoTA methods.||
|**2024-09-17**|[TrajSSL: Trajectory-Enhanced Semi-Supervised 3D Object Detection](http://arxiv.org/abs/2409.10901)|null|Semi-supervised 3D object detection is a common strategy employed to circumvent the challenge of manually labeling large-scale autonomous driving perception datasets. Pseudo-labeling approaches to semi-supervised learning adopt a teacher-student framework in which machine-generated pseudo-labels on a large unlabeled dataset are used in combination with a small manually-labeled dataset for training. In this work, we address the problem of improving pseudo-label quality through leveraging long-term temporal information captured in driving scenes. More specifically, we leverage pre-trained motion-forecasting models to generate object trajectories on pseudo-labeled data to further enhance the student model training. Our approach improves pseudo-label quality in two distinct manners: first, we suppress false positive pseudo-labels through establishing consistency across multiple frames of motion forecasting outputs. Second, we compensate for false negative detections by directly inserting predicted object tracks into the pseudo-labeled scene. Experiments on the nuScenes dataset demonstrate the effectiveness of our approach, improving the performance of standard semi-supervised approaches in a variety of settings.||
|**2024-09-17**|[Single-Layer Learnable Activation for Implicit Neural Representation (SL $^{2}$A-INR)](http://arxiv.org/abs/2409.10836)|null|隐式神经表示 (INR) 利用神经网络将坐标输入转换为相应的属性,近年来在多个视觉相关领域取得了重大进展。然而,INR 的性能很大程度上受其多层感知器 (MLP) 架构中使用的非线性激活函数选择的影响。目前已经研究了多种非线性方法;然而,当前的 INR 在捕获高频分量、多样信号类型和处理逆问题方面面临局限性。我们已经确定,通过引入 INR 的范式转变可以大大缓解这些问题。我们发现,在初始层具有可学习激活函数的架构可以表示底层信号中的精细细节。具体来说,我们提出了 SL$^{2}$A-INR,这是一种用于 INR 的混合网络,具有单层可学习激活函数,从而提高了传统基于 ReLU 的 MLP 的有效性。我们的方法在各种任务中均表现出色,包括图像表示、3D 形状重建、图像修复、单图像超分辨率、CT 重建和新视图合成。通过综合实验,SL$^{2}$ A-INR 在 INR 的准确性、质量和收敛速度方面树立了新的基准。||
|**2024-09-17**|[Context-Dependent Interactable Graphical User Interface Element Detection for VR Applications](http://arxiv.org/abs/2409.10811)|null|In recent years, Virtual Reality (VR) has emerged as a transformative technology, offering users immersive and interactive experiences across diversified virtual environments. Users can interact with VR apps through interactable GUI elements (IGEs) on the stereoscopic three-dimensional (3D) graphical user interface (GUI). The accurate recognition of these IGEs is instrumental, serving as the foundation of many software engineering tasks, including automated testing and effective GUI search. The most recent IGE detection approaches for 2D mobile apps typically train a supervised object detection model based on a large-scale manually-labeled GUI dataset, usually with a pre-defined set of clickable GUI element categories like buttons and spinners. Such approaches can hardly be applied to IGE detection in VR apps, due to a multitude of challenges including complexities posed by open-vocabulary and heterogeneous IGE categories, intricacies of context-sensitive interactability, and the necessities of precise spatial perception and visual-semantic alignment for accurate IGE detection results. Thus, it is necessary to embark on the IGE research tailored to VR apps. In this paper, we propose the first zero-shot cOntext-sensitive inteRactable GUI ElemeNT dEtection framework for virtual Reality apps, named Orienter. By imitating human behaviors, Orienter observes and understands the semantic contexts of VR app scenes first, before performing the detection. The detection process is iterated within a feedback-directed validation and reflection loop. Specifically, Orienter contains three components, including (1) Semantic context comprehension, (2) Reflection-directed IGE candidate detection, and (3) Context-sensitive interactability classification. Extensive experiments on the dataset demonstrate that Orienter is more effective than the state-of-the-art GUI element detection approaches.||
|**2024-09-16**|[Are Deep Learning Models Robust to Partial Object Occlusion in Visual Recognition Tasks?](http://arxiv.org/abs/2409.10775)|null|图像分类模型,包括卷积神经网络(CNN),在各种分类任务中表现良好,但在部分遮挡的情况下表现不佳,例如,物体被部分遮挡在相机视野之外的情况。已经出现了一些方法来提高遮挡情况下的性能,包括数据增强、基于部分的聚类,以及更强大的架构,包括视觉Transformer(ViT)模型,这些方法在一定程度上已经根据其在部分遮挡下对物体进行分类的能力进行了评估。然而,对这些方法的评估很大程度上依赖于包含人工遮挡的图像,这些图像通常是计算机生成的,因此标注成本低廉。此外,这些方法很少相互比较,许多方法是与早期、现在已经过时的深度学习模型进行比较的。我们贡献了遮挡下图像识别(IRUO)数据集,该数据集基于最近开发的遮挡视频实例分割(OVIS)数据集(arXiv:2102.01558)。IRUO利用真实世界和人工遮挡的图像来测试和比较领先方法在视觉识别任务中对部分遮挡的鲁棒性。此外,我们还贡献了使用IRUO图像进行的人类研究的设计和结果,该研究评估了人类在多个级别和类型的遮挡下的分类性能。我们发现,与早期的基于CNN的模型相比,现代基于CNN的模型在遮挡图像上的识别精度有所提高,并且基于ViT的模型在遮挡图像上的精度高于基于CNN的模型,其性能仅略低于人类精度。我们还发现,某些类型的遮挡,包括漫射遮挡,即相关物体通过栅栏和树叶等遮挡物上的“孔洞”可见,与人类相比,这种遮挡会大大降低深度识别模型的精度,尤其是那些具有CNN骨干的模型。||
|**2024-09-16**|[CoMamba: Real-time Cooperative Perception Unlocked with State Space Models](http://arxiv.org/abs/2409.10699)|null|Cooperative perception systems play a vital role in enhancing the safety and efficiency of vehicular autonomy. Although recent studies have highlighted the efficacy of vehicle-to-everything (V2X) communication techniques in autonomous driving, a significant challenge persists: how to efficiently integrate multiple high-bandwidth features across an expanding network of connected agents such as vehicles and infrastructure. In this paper, we introduce CoMamba, a novel cooperative 3D detection framework designed to leverage state-space models for real-time onboard vehicle perception. Compared to prior state-of-the-art transformer-based models, CoMamba enjoys being a more scalable 3D model using bidirectional state space models, bypassing the quadratic complexity pain-point of attention mechanisms. Through extensive experimentation on V2X/V2V datasets, CoMamba achieves superior performance compared to existing methods while maintaining real-time processing capabilities. The proposed framework not only enhances object detection accuracy but also significantly reduces processing time, making it a promising solution for next-generation cooperative perception systems in intelligent transportation networks.||
|**2024-09-16**|[Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning](http://arxiv.org/abs/2409.10362)|null|We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training. Prior work in this direction masks out pre-defined frequencies in the input image and employs a reconstruction loss to pre-train the model. While achieving promising results, such an implementation has two fundamental limitations as identified in our paper. First, using pre-defined frequencies overlooks the variability of image frequency responses. Second, pre-trained with frequency-filtered images, the resulting model needs relatively more data to adapt to naturally looking images during fine-tuning. To address these drawbacks, we propose FOurier transform compression with seLf-Knowledge distillation (FOLK), integrating two dedicated ideas. First, inspired by image compression, we adaptively select the masked-out frequencies based on image frequency responses, creating more suitable SSL tasks for pre-training. Second, we employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input, largely reducing the burden of downstream tasks. Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state-of-the-art SSL methods across various downstream tasks, including image classification, few-shot learning, and semantic segmentation.||
|**2024-09-13**|[Optically-Validated Microvascular Phantom for Super-Resolution Ultrasound Imaging](http://arxiv.org/abs/2409.09031)|null|超分辨率超声 (SRUS) 通过定位和跟踪空间隔离的微泡造影剂,可视化超声衍射极限(波长 ( $λ$ )/2)以外的微血管结构。SRUS 模型通常由简单的管状结构组成,其中直径小于 100 微米的通道不可用。此外,这些模型通常易碎且不稳定,真值验证有限,并且其简单的结构限制了 SRUS 算法的评估。为了帮助 SRUS 的开发,需要具有已知且生理相关的微血管结构的坚固耐用的模型,以便进行可重复的 SRUS 测试。这项工作提出了一种制造耐用微血管模型的方法,该模型允许进行光学测量以进行 SRUS 验证。该方法使用嵌入聚二甲基硅氧烷中的微血管阴模来制造微血管模型。展示了具有可变微血管密度的分支微血管模型,其光学验证的血管直径低至约 60 微米(λ/5.8;λ = 约 350 微米)。进行了 SRUS 成像并通过光学测量进行了验证。平均 SRUS 误差为 15.61 微米(λ/22),标准偏差误差为 11.44 微米。一旦定位的微泡数量超过每个估计直径 1000 个,平均误差降低至 7.93 微米(λ/44)。此外,制造一年后测得的声学和光学特性变化小于 10% 以及模型的机械韧性证明了其长期耐用性。这项工作提出了一种制造耐用且经过光学验证的复杂微血管模型的方法,该模型可用于量化 SRUS 性能并促进其进一步发展。||
|**2024-09-13**|[Pushing Joint Image Denoising and Classification to the Edge](http://arxiv.org/abs/2409.08943)|null|本文中,我们将图像分类和图像去噪相结合,旨在增强人类对边缘设备(如低照度监控摄像头)所拍摄噪声图像的感知能力。在这种情况下,重要的是要保留人类验证自动分类决策的能力,从而联合对图像进行去噪以增强人类感知。由于边缘设备计算能力有限,我们通过提出一种集成这两项任务的新型架构来明确优化效率。此外,我们还修改了一种神经架构搜索(NAS)方法,该方法搜索分类器以搜索集成模型,同时优化目标延迟、分类精度和去噪性能。NAS 架构在去噪和分类方面均优于我们手动设计的方案,可显著改善人类感知。我们的方法使用户能够构建针对医疗成像、监控系统和工业检测等领域的定制架构。||
|**2024-09-13**|[Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing](http://arxiv.org/abs/2409.08885)|null|遥感影像中的目标检测在地球观测的各个应用中都起着至关重要的作用。然而,与自然场景图像中的目标检测不同,由于不同地形中存在大量的小型且通常难以察觉的目标,这项任务尤其具有挑战性。为了应对这些挑战,可以使用多模态学习来整合来自不同数据模态的特征,从而提高检测精度。然而,多模态学习的性能往往受到标记数据集大小有限的限制。在本文中,我们建议使用掩蔽图像建模(MIM)作为预训练技术,利用未标记数据的自监督学习来提高检测性能。然而,传统的 MIM 方法(如 MAE)使用不包含任何上下文信息的掩码标记,由于缺乏与图像其他部分的交互,难以捕捉到细粒度的细节。为了解决这个问题,我们提出了一种新的交互式 MIM 方法,可以在不同标记之间建立交互,这对于遥感中的目标检测特别有利。大量的消融研究和评估证明了我们方法的有效性。||
|**2024-09-13**|[Direct-CP: Directed Collaborative Perception for Connected and Autonomous Vehicles via Proactive Attention](http://arxiv.org/abs/2409.08840)|null|协同感知 (CP) 利用来自联网和自动驾驶车辆 (CAV) 的视觉数据来增强自车视野 (FoV)。尽管最近取得了进展,但目前的 CP 方法几乎平等地扩展了自车的 360 度感知范围,这面临着两个关键挑战。首先,在交通分布不均匀的地区,关注交通流量小的方向带来的好处有限。其次,在有限的通信预算下,为不太重要的方向分配过多的带宽会降低更重要区域的感知精度。为了解决这些问题,我们提出了 Direct-CP,一种主动且方向感知的 CP 系统,旨在改善特定方向的 CP。我们的核心理念是使自车能够主动发出其感兴趣方向的信号,并重新调整其注意力以增强局部方向性 CP 性能。为此,我们首先提出了一种 RSU 辅助方向掩蔽机制,以帮助自车识别重要方向。此外,我们设计了一个方向感知的选择性注意模块,根据自车的方向优先级、通信预算和 CAV 的位置数据,明智地聚合相关特征。此外,我们引入了方向加权检测损失 (DWLoss) 来捕捉方向性 CP 结果与真实情况之间的差异,从而促进有效的模型训练。在 V2X-Sim 2.0 数据集上进行的大量实验表明,与最先进的协作 3D 目标检测方法相比,我们的方法在感兴趣方向的局部感知精度提高了 19.8%,整体感知精度提高了 2.5%。||
|**2024-09-13**|[Test-time Training for Hyperspectral Image Super-resolution](http://arxiv.org/abs/2409.08667)|null|高光谱图像 (HSI) 超分辨率 (SR) 的研究进展仍然落后于 RGB 图像 SR 的研究。HSI 通常具有大量的波段,因此准确地模拟 HSI SR 的波段间交互非常困难。此外,HSI SR 的训练数据难以获取,因此数据集通常很小。在这项工作中,我们提出了一种新的测试时训练方法来解决这个问题。具体来说,我们开发了一个新的自训练框架,可以生成更准确的伪标签和更准确的 LR-HR 关系,以便模型可以使用它们进行进一步训练以提高性能。为了更好地支持我们的测试时训练方法,我们还提出了一种新的网络架构来学习 HSI SR,而无需对波段间交互进行建模,并提出了一种新的数据增强方法 Spectral Mixup,以增加测试时训练数据的的多样性。我们还收集了一个新的 HSI 数据集,其中包含从食物到植被、材料和一般场景等各种有趣对象的图像。在多个数据集上的大量实验表明,我们的方法可以在测试时训练后显着提高预训练模型的性能,并在 HSI SR 方面显着优于竞争方法。||
|**2024-09-13**|[Low Complexity DoA-ToA Signature Estimation for Multi-Antenna Multi-Carrier Systems](http://arxiv.org/abs/2409.08650)|null|准确的方向估计 (DoA) 和到达时间 (ToA) 估计是声纳、雷达、通信和双功能雷达通信 (DFRC) 等多种无线系统的严格要求。由于使用高载波频率和带宽,这些系统大多数设计有多个天线和子载波。尽管大阵列机制下的分辨率很高,但由于频谱泄漏效应,实际的网格估计方法的 DoA-ToA 估计精度仍然存在估计不准确的问题。在本文中,我们提出了针对具有正交频分复用 (OFDM) 信号的多天线多载波系统的 DoA-ToA 估计方法。在第一种方法中,我们应用了基于离散傅立叶变换 (DFT) 的粗略特征估计,并提出了一种低复杂度的多级微调方法,以极大地提高估计精度。第二种方法基于压缩感知,其中我们通过采用比天线和子载波基数实际数量更多的二维过完备角度延迟字典来实现超分辨率。与向量化一维正交匹配追踪 (OMP) 方法不同,我们将低复杂度的二维 OMP 方法应用于矩阵数据模型,这使得在大型阵列机制中使用压缩感知方法变得切实可行。通过数值仿真,我们表明我们提出的方法实现了与基于子空间的二维多重信号分类 (MUSIC) 方法相似的性能,并且计算复杂度显着降低。||
|**2024-09-13**|[Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering](http://arxiv.org/abs/2409.08640)|null|分布式学习已成为跨私有数据孤岛训练大规模机器学习模型的标准方法。虽然分布式学习增强了隐私保护和训练效率,但它也面临着与拜占庭鲁棒性和通信减少相关的重大挑战。现有的拜占庭鲁棒且高效通信的方法依赖于每次迭代或以一定概率在某些迭代中获得完整的梯度信息,并且它们仅收敛到解周围一个不必要的大的邻域。基于这些问题,我们提出了一种新颖的拜占庭鲁棒且高效通信的随机分布式学习方法,该方法对批量大小没有任何要求,并且收敛到比所有现有方法都更接近最优解的小邻域,与理论下界一致。我们的关键创新是利用 Polyak 动量来减轻由有偏压缩器和随机梯度引起的噪声,从而在信息压缩的情况下防御拜占庭工作者。我们提供了在非凸平滑损失函数的背景下,我们算法的紧复杂度界限的证明,证明这些界限与无拜占庭场景中的下界相匹配。最后,我们通过一系列广泛的实验验证了我们算法的实际意义,对二进制分类和图像分类任务的性能进行了基准测试。||
|**2024-09-13**|[Think Twice Before You Act: Improving Inverse Problem Solving With MCMC](http://arxiv.org/abs/2409.08551)|null|最近的研究表明,扩散模型可以作为解决逆问题的强有力先验。一个突出的例子是扩散后验采样(DPS),它使用Tweedie公式来近似给定测量值的数据后验分布。尽管DPS在解决各种逆问题时具有无需重新训练的优点,但由于这种后验近似可能不准确,特别是在高噪声水平下,因此其性能受到限制。因此,我们提出了扩散后验MCMC(DPMC),这是一种基于退火MCMC的新型推理算法,用于解决使用预训练扩散模型的逆问题。我们定义了一系列中间分布,其灵感来自DPS使用的近似条件分布。通过退火MCMC采样,我们鼓励样本在移动到噪声水平较低的下一个分布之前,更紧密地遵循每个中间分布,从而减少沿路径累积的误差。我们在各种逆问题中测试了我们的算法,包括超分辨率、高斯去模糊、运动去模糊、修复和相位检索。我们的算法在几乎所有任务中都优于DPS,并且评估次数更少,并且与现有方法相比具有竞争力。||
|**2024-09-12**|[Learned Compression for Images and Point Clouds](http://arxiv.org/abs/2409.08376)|**[link](https://github.com/multimedialabsfu/learned-point-cloud-compression-for-classification)**|在过去十年中,深度学习在执行计算机视觉任务(包括分类、超分辨率和风格迁移)方面表现出色。现在,我们将其应用于数据压缩,以帮助构建下一代多媒体编解码器。本论文对这一新兴的学习压缩领域做出了三个主要贡献。首先,我们提出了一种高效的低复杂度熵模型,它通过将编码分布本身作为边信息进行压缩和传输,从而动态地使编码分布适应特定的输入。其次,我们提出了一种新颖的轻量级低复杂度点云编解码器,该编解码器专门针对分类进行了高度优化,与非专门编解码器相比,可以显著降低比特率。最后,我们探讨了连续视频帧之间输入域内的运动是如何体现在相应的卷积导出的潜在空间中的。||
|**2024-09-12**|[FACT: Feature Adaptive Continual-learning Tracker for Multiple Object Tracking](http://arxiv.org/abs/2409.07904)|null|多目标跟踪 (MOT) 涉及识别视频序列中的多个目标并为其分配相应的 ID,其中经常遇到遮挡。最近的方法通过在线学习技术解决遮挡问题,以提高适应性,或通过离线学习技术利用视频中的时间信息。然而,大多数现有的基于在线学习的 MOT 方法无法从所有过去的跟踪信息中学习,从而在保持实时跟踪速度的同时提高对长期遮挡的适应性。另一方面,基于时间信息的离线学习方法维护一个长期记忆来存储过去的跟踪信息,但这种方法限制了它们在跟踪过程中只能使用局部的过去信息。为了应对这些挑战,我们提出了一种新的 MOT 框架,称为特征自适应持续学习跟踪器 (FACT),它通过利用所有过去的跟踪信息实现目标的实时跟踪和特征学习。我们证明了该框架可以与各种最先进的基于特征的跟踪器集成,从而提高它们的跟踪能力。具体来说,我们开发了特征自适应持续学习 (FAC) 模块,这是一个神经网络,可以在线训练以自适应地学习特征,并在跟踪过程中使用所有过去的跟踪信息。此外,我们还介绍了一个专为所提出的基于持续学习的跟踪而设计的两阶段关联模块。大量实验结果表明,所提出的方法在 MOT17 和 MOT20 基准测试中实现了最先进的在线跟踪性能。代码将在接收后发布。||
|**2024-09-12**|[Microscopic-Mamba: Revealing the Secrets of Microscopic Images with Just 4M Parameters](http://arxiv.org/abs/2409.07896)|**[link](https://github.com/zs1314/microscopic-mamba)**|在医学显微图像分类 (MIC) 领域,基于 CNN 和 Transformer 的模型已被广泛研究。然而,CNN 难以建模远程依赖关系,限制了其充分利用图像语义信息的能力。相反,Transformer 则受到二次计算复杂性的阻碍。为了解决这些挑战,我们提出了一种基于 Mamba 架构的模型:Microscopic-Mamba。具体来说,我们设计了部分选择前馈网络(PSFFN)来替换视觉状态空间模块(VSSM)的最后一个线性层,增强了 Mamba 的局部特征提取能力。此外,我们引入了调制交互特征聚合(MIFA)模块,以有效地调制和动态聚合全局和局部特征。我们还结合了并行 VSSM 机制,以改善通道间的信息交互,同时减少参数数量。大量实验表明,我们的方法在五个公共数据集上实现了最先进的性能。代码可在 https://github.com/zs1314/Microscopic-Mamba 获取。||
|**2024-09-12**|[What is YOLOv9: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector](http://arxiv.org/abs/2409.07813)|null|本研究全面分析了 YOLOv9 对象检测模型,重点关注其架构创新、训练方法以及相较于先前版本的性能改进。关键的改进,例如广义高效层聚合网络 (GELAN) 和可编程梯度信息 (PGI),显著增强了特征提取和梯度流,从而提高了准确性和效率。通过结合深度卷积和轻量级 C3Ghost 架构,YOLOv9 在保持高精度的同时降低了计算复杂度。在 Microsoft COCO 上的基准测试表明,它具有优越的平均精度均值 (mAP) 和更快的推理时间,在多个指标上优于 YOLOv8。该模型的多功能性体现在它可以无缝部署到从边缘设备到高性能 GPU 的各种硬件平台上,并内置支持 PyTorch 和 TensorRT 集成。本文首次深入探讨了 YOLOv9 的内部特征及其在现实世界中的适用性,将其确立为跨行业的实时对象检测的最新解决方案,从物联网设备到大型工业应用。||
|**2024-09-12**|[Mesh-based Super-Resolution of Fluid Flows with Multiscale Graph Neural Networks](http://arxiv.org/abs/2409.07769)|null|这项工作介绍了一种图神经网络 (GNN) 方法,能够对流体流动进行基于网格的三维超分辨率重建。在此框架中,GNN 的设计不是一次性在整个基于网格的场上运行,而是直接在局部元素(或单元)网格上运行。为了以类似于谱(或有限)元素离散化的方式促进基于网格的 GNN 表示,修改了基线 GNN 层(称为消息传递层,用于更新局部节点属性)以考虑重合图节点的同步,从而使其与常用的基于元素的网格连接兼容。该架构本质上是多尺度的,由粗尺度和细尺度消息传递层序列(称为处理器)组合而成,这些序列之间通过图解池层进行分离。粗尺度处理器使用粗尺度同步消息传递在元素邻域上将查询元素(以及一组相邻的粗元素)嵌入到单个潜在图表示中,而细尺度处理器利用此潜在图上的其他消息传递操作来校正插值误差。使用来自雷诺数为 1600 和 3200 的泰勒-格林涡流模拟的六面体网格数据进行演示研究。通过分析全局和局部误差,结果最终表明,与粗尺度和多尺度模型配置中的目标相比,GNN 如何能够生成准确的超分辨率场。发现固定架构的重建误差与雷诺数成正比,而包含周围粗元素邻居被发现可以改善 Re=1600 时的预测,但在 Re=3200 时则不然。||
|**2024-09-12**|[DFDG: Data-Free Dual-Generator Adversarial Distillation for One-Shot Federated Learning](http://arxiv.org/abs/2409.07734)|null|联邦学习 (FL) 是一种分布式机器学习方案,其中客户端通过共享模型信息而不是其私有数据集来共同参与全局模型的协作训练。考虑到与通信和隐私相关的担忧,具有一轮通信的单次联邦学习已成为事实上的有希望的解决方案。然而,现有的单次联邦学习方法要么需要公共数据集,要么侧重于模型同构设置,要么从本地模型中提取的知识有限,这使得训练鲁棒的全局模型变得困难甚至不切实际。为了解决这些限制,我们提出了一种新的用于单次联邦学习的无数据双生成器对抗蒸馏方法 (即 DFDG),该方法可以通过训练双生成器来探索更广泛的本地模型训练空间。DFDG 以对抗方式执行,包括两部分:双生成器训练和双模型蒸馏。在双生成器训练中,我们深入研究了每个生成器在保真度、可迁移性和多样性方面的内容,以确保其效用,并额外定制了交叉散度损失以减少双生成器输出空间的重叠。在双模型蒸馏中,训练好的双生成器协同工作,为全局模型的更新提供训练数据。最后,我们对各种图像分类任务的广泛实验表明,与 SOTA 基线相比,DFDG 在准确性方面取得了显着的性能提升。||
|**2024-09-12**|[Cooperative Inference with Interleaved Operator Partitioning for CNNs](http://arxiv.org/abs/2409.07693)|null|将深度学习模型部署在物联网(IoT)设备上通常会面临内存资源和计算能力有限的挑战。协同推理是解决这一问题的重要方法,需要对智能模型进行分区和分布式部署。为了执行水平分区,现有的协同推理方法要么采用算子的输出通道,要么采用特征图的高度和宽度作为分区维度。在这种方式下,由于算子的激活是分布式的,因此必须将它们连接在一起,然后才能将其馈送到下一个算子,这会导致协同推理的延迟。在本文中,我们为CNN模型提出了交错算子分区(IOP)策略。通过基于输出通道维度对一个算子进行分区,并基于输入通道维度对其后续算子进行分区,可以避免激活连接,从而减少通信连接的数量,从而减少协同推理延迟。基于IOP,我们进一步提出了一种模型分割算法,用于最小化协同推理时间,该算法根据获得的推理延迟收益,贪婪地选择用于IOP配对的算子。实验结果表明,与CoEdge中使用的最先进的分区方法相比,IOP策略在三个经典图像分类模型上实现了6.39%~16.83%的加速,并将峰值内存占用减少了21.22%~49.98%。||
|**2024-09-11**|[Minimizing Embedding Distortion for Robust Out-of-Distribution Performance](http://arxiv.org/abs/2409.07582)|null|基于庞大且多样化数据集训练的基础模型在各种零样本任务中展现出跨不同领域和分布泛化的非凡能力。我们的工作解决了在通过微调使基础模型适应特定下游任务时,如何保留这些强大的泛化能力的挑战。为此,我们引入了一种名为“相似性损失”的新方法,它可以融入到任何任务的微调过程中。通过最小化微调嵌入与预训练嵌入之间的扭曲,我们的方法在特定任务适应和保持广泛泛化能力之间取得了平衡。我们在两个不同的任务上评估了我们的方法:卫星图像的图像分类和人脸识别,重点关注开放类别和领域迁移场景,以评估分布外 (OOD) 性能。我们证明,这种方法在保持强大的分布内 (ID) 性能的同时,显著提高了 OOD 性能。||
|**2024-09-11**|[ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers](http://arxiv.org/abs/2409.07541)|**[link](https://github.com/gsavathrakis/enact)**|Transformer在基于视觉的目标检测问题上表现出具有竞争力的精度。然而,由于注意力权重的平方大小,它们需要相当大的计算资源。在这项工作中,我们建议根据输入信息熵对transformer输入进行聚类。这样做的原因是,每个像素的自信息(其总和为熵)在对应于同一对象的像素之间可能是相似的。聚类减少了作为transformer输入的数据量,因此减少了训练时间和GPU内存使用量,同时保留了要传递到网络其余部分的有意义信息。建议的过程组织在一个名为ENACT的模块中,该模块可以插入任何在其编码器中包含多头自注意力计算的transformer架构。我们使用COCO目标检测数据集和三个检测transformer进行了广泛的实验。获得的结果表明,在所有测试案例中,所需的计算资源都持续减少,而检测任务的精度仅略有下降。ENACT模块的代码将在https://github.com/GSavathrakis/ENACT上提供。||
|**2024-09-11**|[A Contrastive Symmetric Forward-Forward Algorithm (SFFA) for Continual Learning Tasks](http://arxiv.org/abs/2409.07387)|null|所谓的“正向-正向算法”(FFA) 近期作为一种替代传统神经网络学习中反向传播算法的新方法获得了关注,在各种建模任务中展现出具有竞争力的性能。通过用两次对比正向传递代替梯度反向传播的反向传递,FFA 通过启用逐层训练启发式方法,避免了其前身所经历的几个缺点(例如梯度消失/爆炸)。在分类任务中,这种对比方法已被证明可以有效地创建输入数据的潜在稀疏表示,最终有利于区分性。然而,由于正负数据之间损失函数的不平衡,FFA 表现出固有的不对称梯度行为,这会对模型的泛化能力产生负面影响并导致准确性下降。为了解决这个问题,这项工作提出了对称正向-正向算法 (SFFA),这是对原始 FFA 的一种新颖改进,它将每一层划分为正神经元和负神经元。这允许将局部适应度函数定义为正神经元激活与整体层活动之间的比率,从而在训练阶段产生对称的损失情况。为了评估我们方法增强的收敛性,我们使用多个图像分类基准进行了多项实验,比较了使用 SFFA 训练的模型与其使用 FFA 训练的模型的准确性。作为这种重新表述的副产品,我们探索了将逐层训练算法用于持续学习 (CL) 任务的优势。逐层训练算法引起的神经元特化及其激活的稀疏性使得能够实现有效的 CL 策略,将新知识(类别)整合到神经网络中,同时防止灾难性地遗忘先前...||
|**2024-09-11**|[Three-Dimensional, Multimodal Synchrotron Data for Machine Learning Applications](http://arxiv.org/abs/2409.07322)|**[link](https://github.com/calum-green/xct-xdrct_paper_code)**|Machine learning techniques are being increasingly applied in medical and physical sciences across a variety of imaging modalities; however, an important issue when developing these tools is the availability of good quality training data. Here we present a unique, multimodal synchrotron dataset of a bespoke zinc-doped Zeolite 13X sample that can be used to develop advanced deep learning and data fusion pipelines. Multi-resolution micro X-ray computed tomography was performed on a zinc-doped Zeolite 13X fragment to characterise its pores and features, before spatially resolved X-ray diffraction computed tomography was carried out to characterise the homogeneous distribution of sodium and zinc phases. Zinc absorption was controlled to create a simple, spatially isolated, two-phase material. Both raw and processed data is available as a series of Zenodo entries. Altogether we present a spatially resolved, three-dimensional, multimodal, multi-resolution dataset that can be used for the development of machine learning techniques. Such techniques include development of super-resolution, multimodal data fusion, and 3D reconstruction algorithm development.||
|**2024-09-10**|[A comprehensive study on Blood Cancer detection and classification using Convolutional Neural Network](http://arxiv.org/abs/2409.06689)|null|多年来,在目标检测领域,一些高效的卷积神经网络 (CNN),如 DenseNet201、InceptionV3、ResNet152v2、SEresNet152、VGG19、Xception 因其性能而备受关注。此外,CNN 范式已经扩展到从原始 CNN 架构进行迁移学习和集成模型。研究表明,迁移学习和集成模型能够提高深度学习 (DL) 模型的准确性。然而,很少有研究利用这些技术对血液恶性肿瘤进行检测和定位的综合实验。意识到这一差距,本研究进行了三个实验;在第一个实验中,使用了六个原始 CNN,在第二个实验中,使用了迁移学习,在第三个实验中,开发了一个新的集成模型 DIX(DenseNet201、InceptionV3 和 Xception)来检测和分类血癌。统计结果表明,DIX 的性能优于原始模型和迁移学习,准确率达到 99.12%。然而,这项研究也提供了一个关于迁移学习的负面结果,因为迁移学习并没有提高原始 CNN 的准确性。与许多其他癌症一样,血癌疾病需要及时识别,才能制定有效的治疗方案并提高生存机会。使用 CNN 检测和分类血癌的高精度表明,CNN 模型在血癌检测中很有前景。这项研究在生物医学工程、计算机辅助疾病诊断和基于机器学习的疾病检测领域具有重要意义。||
|**2024-09-10**|[Lightweight Multiscale Feature Fusion Super-Resolution Network Based on Two-branch Convolution and Transformer](http://arxiv.org/abs/2409.06590)|null|目前,深度学习下的单图像超分辨率(SISR)算法主要有两大模型,一种是基于卷积神经网络的模型,另一种是基于Transformer的模型。前者采用不同卷积核大小的卷积层堆叠的方式来设计模型,使得模型能够更好地提取图像的局部特征;后者采用自注意力机制来设计模型,通过自注意力机制可以让模型建立图像像素点之间的长距离依赖关系,进而更好地提取图像的全局特征。然而,上述两种方法都面临着自己的问题。基于此,本文提出了一种基于双向互补卷积和Transformer的新型轻量级多尺度特征融合网络模型,该模型通过双分支网络架构,融合Transformer和卷积神经网络各自的特点,实现全局和局部信息的相互融合。同时,考虑到深度神经网络训练的低像素图像造成的局部信息丢失,本文设计了一种多阶段特征补充的模块化连接方式,将模型浅层阶段提取的特征图与模型深层阶段提取的特征图进行融合,以最大限度地减少特征图像中信息的丢失,有利于图像的复原,便于获得更高质量的复原图像。最终的实践结果表明,与其他参数量相同的轻量级模型相比,本文提出的模型在图像恢复性能方面是最优的。||
|**2024-09-10**|[Transtreaming: Adaptive Delay-aware Transformer for Real-time Streaming Perception](http://arxiv.org/abs/2409.06584)|null|实时目标检测对于许多现实应用(如自动驾驶中的防撞和路径规划)的决策过程至关重要。本研究提出了一种创新的实时流感知方法 Transtreaming,它解决了具有动态计算延迟的实时目标检测挑战。Transtreaming 的核心创新在于其自适应延迟感知转换器,它可以同时预测多个未来帧并选择与现实世界当前时间最匹配的输出,从而补偿任何系统引起的计算延迟。即使在单帧检测场景中,所提出的模型也通过利用基于转换器的方法优于现有的最先进方法。它在从强大的 V100 到适度的 2080Ti 的各种设备上均表现出强大的性能,在所有平台上都实现了最高水平的感知精度。与大多数难以在功能较弱的设备上在一帧内完成计算的最先进方法不同,Transtreaming 可以满足各种设备上的严格实时处理要求。实验结果强调了该系统的适应性和其显着提高许多现实系统(如自动驾驶)的安全性和可靠性的潜力。||
|**2024-09-10**|[Semi-Supervised 3D Object Detection with Chanel Augmentation using Transformation Equivariance](http://arxiv.org/abs/2409.06583)|null|对于自动驾驶汽车和机器人来说,精确的三维物体检测对于其安全有效地导航和与环境交互至关重要。同时,三维检测器的性能依赖于数据规模和标注,而这通常成本高昂。因此,使用有限的标注数据进行训练的需求日益增长。本文探索了一种新颖的师生框架,该框架采用通道增强技术进行三维半监督目标检测。师生SSL通常对教师和学生分别采用弱增强和强增强。在本工作中,我们使用变换等变检测器(TED)对两个网络应用了多通道增强。TED使我们能够探索点云上增强的不同组合,并有效地聚合多通道变换等变特征。原则上,通过对教师网络采用固定的通道增强,学生可以在可靠的伪标签上稳定地训练。采用强通道增强可以丰富数据的多样性,增强对变换的鲁棒性,提高学生网络的泛化性能。我们使用SOTA层次监督作为基线,并将其双阈值调整到TED,称为通道IoU一致性。我们使用KITTI数据集对我们的方法进行了评估,取得了显著的性能提升,超越了SOTA三维半监督目标检测模型。||
|**2024-09-10**|[Dynamic Decoupling of Placid Terminal Attractor-based Gradient Descent Algorithm](http://arxiv.org/abs/2409.06542)|null|梯度下降 (GD) 和随机梯度下降 (SGD) 已广泛应用于众多应用领域。因此,理解 GD 的动力学并提高其收敛速度仍然非常重要。本文根据梯度流不同阶段的终端吸引子,仔细分析了 GD 的动力学。基于终端滑模理论和终端吸引子理论,设计了四种自适应学习率。并通过详细的理论研究考察了它们的性能,并对学习过程的运行时间进行了评估和比较。此外,还详细研究了它们学习过程的总时间。为了评估其有效性,在函数逼近问题和图像分类问题上对各种仿真结果进行了研究。||
|**2024-09-10**|[Knowledge Distillation via Query Selection for Detection Transformer](http://arxiv.org/abs/2409.06443)|null|Transformer 通过引入 DETR 为目标检测领域带来了革命性的变化,DETR 以其简洁性和有效性而备受赞誉。尽管有这些优势,但这些模型的庞大规模对其在实际部署中,尤其是在资源受限的环境中,提出了重大挑战。本文利用知识蒸馏技术解决了压缩 DETR 的挑战,该技术有望在保持模型性能的同时减小模型规模。DETR 性能的一个关键方面是它们依赖查询来准确解释对象表示。传统的蒸馏方法通常只关注通过二分匹配识别的正查询,而忽略了硬负查询中存在的信息。我们的视觉分析表明,关注前景元素的硬负查询对于增强蒸馏结果至关重要。为此,我们引入了一种新颖的组查询选择策略,该策略通过根据查询与真实对象的广义交并比 (GIoU) 对查询进行分段,从而发现有价值的硬负查询用于蒸馏,这与 DETR 蒸馏中的传统查询选择不同。此外,我们提出了基于查询选择的 DETR 知识蒸馏 (QSKD) 框架,该框架结合了注意力引导特征蒸馏 (AGFD) 和局部对齐预测蒸馏 (LAPD)。这些组件通过关注教师模型中间特征和输出中最有信息的部分来优化蒸馏过程。我们对 MS-COCO 数据集的综合实验评估证明了我们方法的有效性,在不增加大量计算成本的情况下,显着提高了各种 DETR 架构的平均精度 (AP)。具体来说,Conditional DETR ResNet-18 的 AP 从 35.8 提高到 39.9。||
|**2024-09-10**|[Seam Carving as Feature Pooling in CNN](http://arxiv.org/abs/2409.06311)|null|这项工作研究了将接缝裁剪作为卷积神经网络 (CNN) 中的一种特征池化技术用于图像分类任务的潜力。我们建议用接缝裁剪操作替换传统的最大池化层。我们在 Caltech-UCSD Birds 200-2011 数据集上进行的实验表明,基于接缝裁剪的 CNN 与采用最大池化的模型相比,在准确率、精确率、召回率和 F1 分数等指标上均取得了更好的性能。我们通过特征图可视化进一步分析了这两种方法的行为,表明接缝裁剪在池化过程中可能保留了更多结构信息。此外,我们还讨论了我们方法的局限性,并提出了未来研究的潜在方向。||
|**2024-09-10**|[An Attribute-Enriched Dataset and Auto-Annotated Pipeline for Open Detection](http://arxiv.org/abs/2409.06300)|null|通过语言检测感兴趣的对象经常会遇到挑战,特别是对于那些不常见或难以描述的对象,因为自动化模型和人类标注者之间存在感知差异。这些挑战凸显了对综合数据集的需求,这些数据集需要超越标准的对象标签,并结合详细的属性描述。为了满足这一需求,我们引入了 Objects365-Attr 数据集,它是对现有 Objects365 数据集的扩展,其特点是具有属性标注。该数据集通过整合广泛的属性(包括颜色、材质、状态、纹理和色调)来减少对象检测中的不一致性。它包含 560 万个对象级属性描述的扩展集合,这些描述在 140 万个边界框中进行了精心标注。此外,为了验证数据集的有效性,我们对不同规模的 YOLO-World 进行了严格的评估,测量了它们的检测性能,并展示了该数据集对推进对象检测的贡献。||
|**2024-09-09**|[Replay Consolidation with Label Propagation for Continual Object Detection](http://arxiv.org/abs/2409.05650)|null|目标检测是一个与机器人技术和自动驾驶等许多应用高度相关的计算机视觉问题。持续学习 (CL) 考虑的是模型在保留先前获得的知识的同时逐步学习新信息的设置。这尤其具有挑战性,因为深度学习模型在训练新数据时往往会灾难性地忘记旧知识。特别是,与用于分类的持续学习相比,用于目标检测的持续学习 (CLOD) 带来了额外的困难。在 CLOD 中,来自先前任务的图像可能包含未知的类别,这些类别可能会在未来的任务中重新出现并被标记。这些缺失的注释会导致基于重放的方法出现任务干扰问题。因此,文献中的大多数工作都集中在基于蒸馏的方法上。然而,这些方法只有在不同任务之间存在强大的类别重叠时才有效。为了解决当前方法的问题,我们提出了一种解决 CLOD 的新技术,称为用于目标检测的标签传播重放整合 (RCLPOD)。基于重放方法,我们的解决方案通过增强缓冲区内存样本来避免任务干扰问题。我们的方法在 CLOD 文献中的现有技术基础上进行了评估,证明了其在 VOC 和 COCO 等既定基准测试中的优越性能。||
|**2024-09-09**|[LEROjD: Lidar Extended Radar-Only Object Detection](http://arxiv.org/abs/2409.05564)|**[link](https://github.com/rst-tu-dortmund/lerojd)**|对于自动驾驶而言,精确的三维物体检测至关重要。激光雷达传感器非常适合这项任务,但它们价格昂贵,并且在恶劣天气条件下存在局限性。3+1D 成像雷达传感器提供了一种经济高效且稳健的替代方案,但由于其分辨率低和测量噪声高而面临挑战。现有的 3+1D 成像雷达数据集包括雷达和激光雷达数据,可以改进跨模态模型。尽管不应在推理过程中使用激光雷达,但它可以帮助训练仅使用雷达的物体检测器。我们探索了两种将知识从激光雷达域迁移到雷达域和仅使用雷达的物体检测器的策略:1. 使用顺序激光雷达点云细化的多阶段训练,以及 2. 跨模态知识蒸馏。在多阶段过程中,我们研究了三种细化方法。我们的结果表明,通过多阶段训练,平均精度 (mAP) 显着提高了 4.2 个百分点,通过使用教师模型的权重初始化学生模型进行知识蒸馏,平均精度提高了 3.9 个百分点。这些方法的主要优点是它们适用于其他 3D 物体检测网络,而无需改变其架构,正如我们通过在两个不同的物体检测器上进行分析所展示的那样。我们的代码可在 https://github.com/rst-tu-dortmund/lerojd 获取。||
|**2024-09-08**|[Can OOD Object Detectors Learn from Foundation Models?](http://arxiv.org/abs/2409.05162)|**[link](https://github.com/cvmi-lab/syncood)**|Out-of-distribution (OOD) object detection is a challenging task due to the absence of open-set OOD data. Inspired by recent advancements in text-to-image generative models, such as Stable Diffusion, we study the potential of generative models trained on large-scale open-set data to synthesize OOD samples, thereby enhancing OOD object detection. We introduce SyncOOD, a simple data curation method that capitalizes on the capabilities of large foundation models to automatically extract meaningful OOD data from text-to-image generative models. This offers the model access to open-world knowledge encapsulated within off-the-shelf foundation models. The synthetic OOD samples are then employed to augment the training of a lightweight, plug-and-play OOD detector, thus effectively optimizing the in-distribution (ID)/OOD decision boundaries. Extensive experiments across multiple benchmarks demonstrate that SyncOOD significantly outperforms existing methods, establishing new state-of-the-art performance with minimal synthetic data usage.||
|**2024-09-08**|[Visual Grounding with Multi-modal Conditional Adaptation](http://arxiv.org/abs/2409.04999)|**[link](https://github.com/mr-bigworth/mmca)**|Visual grounding is the task of locating objects specified by natural language expressions. Existing methods extend generic object detection frameworks to tackle this task. They typically extract visual and textual features separately using independent visual and textual encoders, then fuse these features in a multi-modal decoder for final prediction. However, visual grounding presents unique challenges. It often involves locating objects with different text descriptions within the same image. Existing methods struggle with this task because the independent visual encoder produces identical visual features for the same image, limiting detection performance. Some recently approaches propose various language-guided visual encoders to address this issue, but they mostly rely solely on textual information and require sophisticated designs. In this paper, we introduce Multi-modal Conditional Adaptation (MMCA), which enables the visual encoder to adaptively update weights, directing its focus towards text-relevant regions. Specifically, we first integrate information from different modalities to obtain multi-modal embeddings. Then we utilize a set of weighting coefficients, which generated from the multimodal embeddings, to reorganize the weight update matrices and apply them to the visual encoder of the visual grounding model. Extensive experiments on four widely used datasets demonstrate that MMCA achieves significant improvements and state-of-the-art results. Ablation experiments further demonstrate the lightweight and efficiency of our method. Our source code is available at: https://github.com/Mr-Bigworth/MMCA.||
|**2024-09-08**|[RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network](http://arxiv.org/abs/2409.04979)|null|Perceiving the surrounding environment is a fundamental task in autonomous driving. To obtain highly accurate perception results, modern autonomous driving systems typically employ multi-modal sensors to collect comprehensive environmental data. Among these, the radar-camera multi-modal perception system is especially favored for its excellent sensing capabilities and cost-effectiveness. However, the substantial modality differences between radar and camera sensors pose challenges in fusing information. To address this problem, this paper presents RCBEVDet, a radar-camera fusion 3D object detection framework. Specifically, RCBEVDet is developed from an existing camera-based 3D object detector, supplemented by a specially designed radar feature extractor, RadarBEVNet, and a Cross-Attention Multi-layer Fusion (CAMF) module. Firstly, RadarBEVNet encodes sparse radar points into a dense bird's-eye-view (BEV) feature using a dual-stream radar backbone and a Radar Cross Section aware BEV encoder. Secondly, the CAMF module utilizes a deformable attention mechanism to align radar and camera BEV features and adopts channel and spatial fusion layers to fuse them. To further enhance RCBEVDet's capabilities, we introduce RCBEVDet++, which advances the CAMF through sparse fusion, supports query-based multi-view camera perception models, and adapts to a broader range of perception tasks. Extensive experiments on the nuScenes show that our method integrates seamlessly with existing camera-based 3D perception models and improves their performance across various perception tasks. Furthermore, our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks. Notably, with ViT-L as the image backbone, RCBEVDet++ achieves 72.73 NDS and 67.34 mAP in 3D object detection without test-time augmentation or model ensembling.||
|**2024-09-08**|[PatchAlign:Fair and Accurate Skin Disease Image Classification by Alignment with Clinical Labels](http://arxiv.org/abs/2409.04975)|**[link](https://github.com/aayushmanace/patchalign24)**|深度学习模型在皮肤病变诊断自动化方面取得了巨大成功。然而,在部署这些模型之前,需要解决其预测中存在的种族差异问题。我们介绍了一种名为 PatchAlign 的新方法,通过与皮肤病临床文本表征对齐来提高皮肤病图像分类的准确性和公平性。PatchAlign 使用图最优传输 (GOT) 损失作为正则化器来执行跨域对齐。即使在训练样本有限的情况下,获得的表征也是稳健的,并且可以很好地泛化到不同的肤色。为了减少临床皮肤病图像中噪声和伪影的影响,我们提出了一种可学习的掩码图最优传输,用于跨域对齐,进一步改善了公平性指标。我们在两个具有不同皮肤类型的皮肤病变数据集上将我们的模型与最先进的 FairDisCo 进行了比较:Fitzpatrick17k 和 Diverse Dermatology Images (DDI)。与 FairDisCo 相比,PatchAlign 在 Fitzpatrick17k 上将皮肤病图像分类的准确性提高了 2.8%(域内)和 6.2%(跨域),在 DDI 上提高了 4.2%(域内)。此外,它持续改善了不同肤色真实阳性率的公平性。用于实现的源代码可在以下 GitHub 存储库中获取:https://github.com/aayushmanace/PatchAlign24,可以轻松复现和进一步试验。||
|**2024-09-07**|[Activation Function Optimization Scheme for Image Classification](http://arxiv.org/abs/2409.04915)|**[link](https://github.com/abdurrahman1828/afos)**|Activation function has a significant impact on the dynamics, convergence, and performance of deep neural networks. The search for a consistent and high-performing activation function has always been a pursuit during deep learning model development. Existing state-of-the-art activation functions are manually designed with human expertise except for Swish. Swish was developed using a reinforcement learning-based search strategy. In this study, we propose an evolutionary approach for optimizing activation functions specifically for image classification tasks, aiming to discover functions that outperform current state-of-the-art options. Through this optimization framework, we obtain a series of high-performing activation functions denoted as Exponential Error Linear Unit (EELU). The developed activation functions are evaluated for image classification tasks from two perspectives: (1) five state-of-the-art neural network architectures, such as ResNet50, AlexNet, VGG16, MobileNet, and Compact Convolutional Transformer which cover computationally heavy to light neural networks, and (2) eight standard datasets, including CIFAR10, Imagenette, MNIST, Fashion MNIST, Beans, Colorectal Histology, CottonWeedID15, and TinyImageNet which cover from typical machine vision benchmark, agricultural image applications to medical image applications. Finally, we statistically investigate the generalization of the resultant activation functions developed through the optimization scheme. With a Friedman test, we conclude that the optimization scheme is able to generate activation functions that outperform the existing standard ones in 92.8% cases among 28 different cases studied, and $-x\cdot erf(e^{-x})$ is found to be the best activation function for image classification generated by the optimization scheme.||
|**2024-09-07**|[SSFam: Scribble Supervised Salient Object Detection Family](http://arxiv.org/abs/2409.04817)|**[link](https://github.com/liuzywen/ssfam)**|Scribble supervised salient object detection (SSSOD) constructs segmentation ability of attractive objects from surroundings under the supervision of sparse scribble labels. For the better segmentation, depth and thermal infrared modalities serve as the supplement to RGB images in the complex scenes. Existing methods specifically design various feature extraction and multi-modal fusion strategies for RGB, RGB-Depth, RGB-Thermal, and Visual-Depth-Thermal image input respectively, leading to similar model flood. As the recently proposed Segment Anything Model (SAM) possesses extraordinary segmentation and prompt interactive capability, we propose an SSSOD family based on SAM, named SSFam, for the combination input with different modalities. Firstly, different modal-aware modulators are designed to attain modal-specific knowledge which cooperates with modal-agnostic information extracted from the frozen SAM encoder for the better feature ensemble. Secondly, a siamese decoder is tailored to bridge the gap between the training with scribble prompt and the testing with no prompt for the stronger decoding ability. Our model demonstrates the remarkable performance among combinations of different modalities and refreshes the highest level of scribble supervised methods and comes close to the ones of fully supervised methods. https://github.com/liuzywen/SSFam||
|**2024-09-07**|[SpotActor: Training-Free Layout-Controlled Consistent Image Generation](http://arxiv.org/abs/2409.04801)|null|Text-to-image diffusion models significantly enhance the efficiency of artistic creation with high-fidelity image generation. However, in typical application scenarios like comic book production, they can neither place each subject into its expected spot nor maintain the consistent appearance of each subject across images. For these issues, we pioneer a novel task, Layout-to-Consistent-Image (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts. To accomplish this challenging task, we present a new formalization of dual energy guidance with optimization in a dual semantic-latent space and thus propose a training-free pipeline, SpotActor, which features a layout-conditioned backward update stage and a consistent forward sampling stage. In the backward stage, we innovate a nuanced layout energy function to mimic the attention activations with a sigmoid-like objective. While in the forward stage, we design Regional Interconnection Self-Attention (RISA) and Semantic Fusion Cross-Attention (SFCA) mechanisms that allow mutual interactions across images. To evaluate the performance, we present ActorBench, a specified benchmark with hundreds of reasonable prompt-box pairs stemming from object detection datasets. Comprehensive experiments are conducted to demonstrate the effectiveness of our method. The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications with superior layout alignment, subject consistency, prompt conformity and background diversity.||
|**2024-09-07**|[LoCa: Logit Calibration for Knowledge Distillation](http://arxiv.org/abs/2409.04778)|null|Knowledge Distillation (KD), aiming to train a better student model by mimicking the teacher model, plays an important role in model compression. One typical way is to align the output logits. However, we find a common issue named mis-instruction, that the student would be misled when the predictions based on teacher logits do not follow the labels. Meanwhile, there is other useful dark knowledge in the logits such as the class discriminability, which is vital for distillation. In this paper, we propose a simple yet effective Logit Calibration (LoCa) method, which calibrates the logits from the teacher model based on the ground-truth labels. The key insight is to correct the prediction (to address the mis-instruction issue) and maintain useful dark knowledge simultaneously. Our proposed LoCa does not require any additional parameters. Empirical results on image classification and text generation tasks demonstrate that LoCa can effectively improve the performance of baselines.||
|**2024-09-05**|[Use of triplet loss for facial restoration in low-resolution images](http://arxiv.org/abs/2409.03530)|null|近年来,人脸识别 (FR) 模型已成为应用最广泛的生物识别工具,在众多数据集上取得了令人瞩目的成果。然而,硬件的固有挑战或拍摄距离 often 导致低分辨率图像,这会严重影响人脸识别模型的性能。为了解决这个问题,人们提出了几种解决方案,包括生成高度逼真的人脸的超分辨率 (SR) 模型。尽管做出了这些努力,但人脸识别算法并未取得显著改进。我们提出了一种新颖的超分辨率模型 FTLGAN,它侧重于生成保留个人身份的高分辨率图像,而不仅仅是提高图像质量,从而最大限度地提高人脸识别模型的性能。结果令人信服,表明 d' 的平均值比当前最先进的模型高出 21%,具体而言,14x14 像素时 d' = 1.099,AUC = 0.78,28x28 像素时 d' = 2.112,AUC = 0.92,56x56 像素时 d' = 3.049,AUC = 0.98。这项研究的贡献在几个关键领域意义重大。首先,在低分辨率图像(特别是 14x14、28x28 和 56x56 像素的分辨率)中,人脸识别性能取得了显着提高。其次,FTLGAN 所展示的增强功能在所有分辨率下都表现出一致的响应,与其他比较模型不同,它始终如一地提供出色的性能。第三,使用三元组损失逻辑实施了一种创新方法,能够仅使用真实图像训练超分辨率模型,这与当前模型形成对比,并扩展了潜在的现实应用。最后,本研究引入了一种新颖的模型,该模型通过在模型训练期间将人脸识别质量作为损失纳入其中,专门解决了提高人脸识别系统分类性能的挑战。||
|**2024-09-05**|[Have Large Vision-Language Models Mastered Art History?](http://arxiv.org/abs/2409.03521)|null|大型视觉语言模型 (VLM) 的出现最近在跨多个领域的图像分类方面建立了新的基准。然而,VLM 在艺术品分类这一特定任务中的表现,特别是绘画艺术风格分类——传统上由艺术史学家掌握的领域——尚未得到探索。与自然图像相比,艺术品由于其固有的复杂性和多样性结构(以多变的构图和风格为特征)而构成了独特的挑战。艺术史学家长期以来一直在研究艺术品的独特方面,而风格预测是其学科的一个重要组成部分。本文研究了集成视觉和文本数据的大型 VLM 是否可以有效地预测绘画的艺术史属性。我们对四种 VLM(即 CLIP、LLaVA、OpenFlamingo 和 GPT-4o)进行了深入分析,重点关注使用两个公共艺术品基准对艺术风格、作者和时间段进行零样本分类。此外,我们还介绍了 ArTest,这是一个精心策划的艺术品测试集,其中包括艺术史学家研究的关键绘画作品。||
|**2024-09-05**|[LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution](http://arxiv.org/abs/2409.03516)|**[link](https://github.com/jwgdmkj/lmlt)**|近年来,基于视觉Transformer (ViT) 的图像超分辨率方法展现出令人印象深刻的性能。然而,它们存在复杂性高的问题,导致推理时间和内存使用量大。此外,使用窗口自注意力机制(WSA) 的ViT模型在处理窗口区域外的信息时面临挑战。为了解决这些问题,我们提出了低到高多级Transformer (LMLT),它对每个头采用不同特征大小的注意力机制。LMLT 沿通道维度划分图像特征,逐渐减小低层头的空间大小,并对每个头应用自注意力机制。这种方法有效地捕获了局部和全局信息。通过将低层头的结果整合到高层头中,LMLT 克服了自注意力机制中的窗口边界问题。大量实验表明,我们的模型在保持甚至超越最先进的基于 ViT 的图像超分辨率方法的性能的同时,显著减少了推理时间和 GPU 内存使用量。我们的代码可在 https://github.com/jwgdmkj/LMLT 获取。||
|**2024-09-05**|[Non-Uniform Illumination Attack for Fooling Convolutional Neural Networks](http://arxiv.org/abs/2409.03458)|**[link](https://github.com/Akshayjain97/Non-Uniform_Illumination)**|卷积神经网络(CNN)虽然取得了显著进步,但仍然容易受到攻击,特别是在面对人类容易识别的微小图像扰动时。这种弱点通常被称为“攻击”,突显了CNN的鲁棒性有限,需要研究如何增强其抵抗此类操纵的能力。本研究介绍了一种新颖的非均匀照明(NUI)攻击技术,该技术使用不同的NUI掩码对图像进行细微 alteration。我们在广泛接受的数据集(包括CIFAR10、TinyImageNet和CalTech256)上进行了大量实验,重点关注12种不同NUI攻击模型的图像分类。评估了VGG、ResNet、MobilenetV3-small和InceptionV3模型对NUI攻击的抵抗力。我们的结果表明,CNN模型在遭受NUI攻击时,分类精度大幅下降,表明它们在非均匀照明下的脆弱性。为了缓解这种情况,我们提出了一种防御策略,将通过新的NUI变换生成的NUI攻击图像包含到训练集中。结果表明,当CNN模型面对受NUI攻击影响的扰动图像时,其性能得到显著提升。该策略旨在增强CNN模型对NUI攻击的抵抗力。||
|**2024-09-05**|[Raw Speech Enhancement with Deep State Space Modeling](http://arxiv.org/abs/2409.03377)|**[link](https://github.com/Brainchip-Inc/aTENNuate)**|我们提出了 aTENNuate,这是一种简单的深度状态空间自编码器,专为高效的在线原始语音增强而配置,采用端到端的方式。该网络的性能主要在原始语音去噪方面进行评估,并在超分辨率和去量化等任务上进行了额外评估。我们在 VoiceBank + DEMAND 和 Microsoft DNS1 合成测试集上对 aTENNuate 进行了基准测试。该网络在 PESQ 分数、参数数量、MAC 和延迟方面优于以前的实时去噪模型。即使作为原始波形处理模型,该模型也能保持对干净信号的高保真度,并且可听见的伪影极少。此外,即使将噪声输入压缩至 4000Hz 和 4 位,该模型仍能保持良好的性能,这表明它在资源受限的环境中具有一般的语音增强能力。||
|**2024-09-05**|[Training-free Conversion of Pretrained ANNs to SNNs for Low-Power and High-Performance Applications](http://arxiv.org/abs/2409.03368)|null|脉冲神经网络 (SNN) 由于其推理速度快、功耗低等优势,已成为人工神经网络 (ANN) 的一种很有前途的替代方案。然而,缺乏有效的训练算法阻碍了它们的广泛应用。现有的 SNN 监督学习算法比 ANN 需要更多的内存和时间。即使是常用的 ANN-SNN 转换方法也需要重新训练 ANN 以提高转换效率,从而产生额外的计算成本。为了应对这些挑战,我们提出了一种新颖的免训练 ANN-SNN 转换流程。我们的方法将预先训练好的 ANN 模型直接转换为高性能 SNN,无需额外的训练。该转换流程包括一个基于局部学习的阈值平衡算法,该算法能够有效地计算最佳阈值并通过通道缩放对阈值进行细粒度调整。我们展示了我们的框架在三个典型的计算机视觉任务中的可扩展性:图像分类、语义分割和目标检测。这展示了其对分类和回归任务的适用性。此外,我们评估了转换后的 SNN 的能耗,证明了它们与传统 ANN 相比具有优越的低功耗优势。我们的免训练算法优于现有方法,突出了其实用性和效率。这种方法通过利用开源预训练 ANN 模型和神经形态硬件简化了 SNN 的部署,从而实现了快速、低功耗的推理,并且性能损失可以忽略不计。||
|**2024-09-05**|[YOLO-PPA based Efficient Traffic Sign Detection for Cruise Control in Autonomous Driving](http://arxiv.org/abs/2409.03320)|null|在自动驾驶系统中高效、准确地检测交通标志至关重要。然而,距离越远,交通标志越小。现有的目标检测算法很难检测到这些小尺寸的标志。此外,车载嵌入式设备的性能限制了检测模型的规模。为了应对这些挑战,本文提出了一种基于 YOLO PPA 的交通标志检测算法。在 GTSDB 数据集上的实验结果表明,与原始 YOLO 相比,该方法将推理效率提高了 11.2%,mAP 50 也提高了 93.2%,证明了所提出的 YOLO PPA 的有效性。||
|**2024-09-05**|[PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning](http://arxiv.org/abs/2409.03192)|null|细粒度图像分类随着深度学习和计算机视觉技术的出现取得了显著的进步。然而,详细标注的缺乏仍然是一个主要挑战,特别是在获取高质量标记数据的成本高昂或耗时的情况下。为了解决这一限制,我们引入了专为半监督学习框架内的细粒度图像分类设计的精度增强型伪标签(PEPL)方法。我们的方法通过生成高质量的伪标签来利用丰富的未标记数据,这些伪标签通过两个关键阶段逐步细化:初始伪标签生成和语义混合伪标签生成。这些阶段利用类激活图(CAM)来准确估计语义内容并生成细化标签,这些标签捕获了细粒度分类所需的基本细节。通过关注语义级信息,我们的方法有效地解决了标准数据增强和图像混合技术在保留关键细粒度特征方面的局限性。我们在基准数据集上实现了最先进的性能,证明了相对于现有半监督策略的显著改进,在准确性和鲁棒性方面都有显著提升。我们的代码已在https://github.com/TianSuya/SemiFG开源。||
|**2024-09-05**|[The AdEMAMix Optimizer: Better, Faster, Older](http://arxiv.org/abs/2409.03137)|**[link](https://github.com/apple/ml-ademamix)**|基于动量的优化器是众多机器学习应用的核心。这些优化器通常依赖于梯度的指数移动平均 (EMA),它会以指数方式衰减旧梯度对当前梯度的贡献。这是因为梯度是局部的线性近似,当迭代点在损失函数曲面上移动时,旧梯度的相关性会降低。这项工作对使用单个 EMA 来累积过去梯度的做法提出了质疑,并通过经验证明了这种选择可能是次优的:单个 EMA 无法同时对最近的梯度赋予高权重,并对较旧的梯度赋予不可忽略的权重。基于这一观察,我们提出了 AdEMAMix,它是对 Adam 优化器的一种简单修改,它混合了两个 EMA,以更好地利用过去的梯度。我们在语言建模和图像分类方面的实验表明,令人惊讶的是,梯度在数万步内仍然具有相关性。它们有助于更快地收敛,并且通常收敛到更低的最小值:例如,一个在 1010 亿个词符上训练的具有 13 亿个参数的 AdEMAMix LLM 的性能与在一个 1970 亿个词符上训练的 AdamW 模型相当(+95%)。此外,我们的方法显著减缓了训练过程中的模型遗忘。我们的工作鼓励进一步探索利用过去梯度的不同类型的函数,而不仅仅是 EMA。||
|**2024-09-04**|[Boundless: Generating Photorealistic Synthetic Data for Object Detection in Urban Streetscapes](http://arxiv.org/abs/2409.03022)|**[link](https://github.com/zk2172-columbia/boundless)**|我们介绍Boundless,这是一个用于在密集的城市街景中实现高度准确的目标检测的逼真合成数据生成系统。Boundless可以用自动化和可配置的过程取代大规模的现实世界数据收集和手动地面实况目标注释(标记)。Boundless基于虚幻引擎5 (UE5) 城市示例项目,并进行了改进,能够在不同的照明和场景变化条件下准确收集3D边界框。我们评估了在Boundless生成的数据集上训练的目标检测模型在从中空相机获取的真实数据集上进行推理时的性能。我们将Boundless训练模型的性能与CARLA训练模型的性能进行了比较,观察到7.8 mAP的改进。我们取得的结果支持了合成数据生成是一种可靠的方法,可以用于训练/微调用于城市场景的可扩展目标检测模型。||
|**2024-09-04**|[iConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation](http://arxiv.org/abs/2409.02838)|null|基于预训练编码器的完整微调(FFT)和任务特定解码器的迁移学习随着深度模型的指数级增长而变得越来越复杂。使用由小型可学习层组成的适配器的参数高效微调(PEFT)方法已成为 FFT 的替代方案,在保持高训练效率的同时实现了可比的性能。然而,适配器对输入实例的不灵活限制了其在不同下游任务中学习任务特定信息的能力。在本文中,我们提出了一种新的 PEFT 方法,即输入条件化的 Transformer,称为 iConFormer,它利用了以输入实例为条件的动态适配器。为了确保在各种下游任务中对输入实例的灵活学习能力,我们在动态适配器中引入了输入条件化网络(iCoN),从而实现实例级特征转换。具体来说,iCoN 为每个特征生成通道级的卷积核,并使用自适应卷积过程对其进行转换,以有效捕获针对下游任务的任务特定和细粒度细节。实验结果表明,通过仅调整 Transformer 主干参数的 1.6% 到 2.8%,iConFormer 在单目深度估计和语义分割方面实现了与 FFT 相当的性能,同时在图像分类和实例分割方面优于 FFT。此外,所提出的方法在所有上述任务中始终优于最近的 PEFT 方法。||
|**2024-09-04**|[Real-Time Dynamic Scale-Aware Fusion Detection Network: Take Road Damage Detection as an example](http://arxiv.org/abs/2409.02546)|null|基于无人机的道路损坏检测 (RDD) 对城市的日常维护和安全至关重要,特别是在显著降低劳动力成本方面。然而,当前基于无人机的 RDD 研究仍面临许多挑战。例如,形状和方向不规则的损坏、背景对损坏的遮挡以及难以区分损坏和背景,这些因素都显著影响了无人机在日常巡检中检测道路损坏的能力。为了解决这些问题并提高无人机实时道路损坏检测的性能,我们设计并提出了三个相应的模块:一个能够灵活适应形状和背景的特征提取模块;一个融合多尺度感知并适应形状和背景的模块;一个高效的下采样模块。 基于这些模块,我们设计了一种具有自动去除背景干扰能力的多尺度自适应道路损坏检测模型,称为动态尺度感知融合检测模型 (RT-DSAFDet)。在 UAV-PDD2023 公开数据集上的实验结果表明,我们的模型 RT-DSAFDet 的 mAP50 达到了 54.2%,比最新实时目标检测模型 YOLOv10 的高效变体 YOLOv10-m 高 11.1%,而参数量减少到 1.8M,FLOPs 减少到 4.6G,分别降低了 88% 和 93%。此外,在大型通用目标检测公开数据集 MS COCO2017 上也展现了我们模型的优越性,其 mAP50-95 与 YOLOv9-t 相同,但 mAP50 高出 0.5%,参数量减少 10%,FLOPs 减少 40%。||
|**2024-09-04**|[Boosting Generalizability towards Zero-Shot Cross-Dataset Single-Image Indoor Depth by Meta-Initialization](http://arxiv.org/abs/2409.02486)|null|室内机器人的导航或障碍物检测等任务依赖于深度信息,而单图像深度估计被广泛用于辅助感知。大多数室内单图像深度预测较少关注模型对未见数据集的泛化能力,而更关注系统部署的野外鲁棒性。这项工作利用基于梯度的元学习在零样本跨数据集推理中获得更高的泛化能力。与研究最多的、与显式类别标签相关的图像分类元学习不同,对于与物体排列和场景构成方面高度变化的室内环境相关的连续深度值,不存在明确的任务边界。我们提出了细粒度任务,在我们的元学习公式中将每个RGB-D小批量视为一个任务。我们首先展示了我们的方法在有限数据上诱导出更好的先验(RMSE 最高降低 27.8%)。然后,在元学习初始化上进行微调始终优于没有元方法的基线。为了实现泛化,我们提出了零样本跨数据集协议,并验证了由我们的元初始化诱导的更高泛化能力,作为许多现有深度估计方法的简单而有用的插件。深度和元学习交叉领域的工作有可能推动这两项研究更接近实际的机器人和机器感知应用。||
|**2024-09-03**|[Site Selection for the Second Flyeye Telescope: A Simulation Study for Optimizing Near-Earth Object Discovery](http://arxiv.org/abs/2409.02329)|null|欧洲航天局 (ESA) 正在开发一个名为 Flyeye 的广域巡天望远镜网络,以改进近地天体 (NEO) 的发现。该网络中的第一个望远镜将位于北半球的穆法拉山(意大利),而第二个具有增强探测能力的 Flyeye 望远镜刚刚开始关键设计阶段。通过对撞击轨迹上的近地天体进行模拟,研究了第二个 Flyeye 望远镜的潜在位置。对大约 3000 个撞击小行星(绝对星等为 H=25 和 H=28)进行了传播,并测试了主要现有巡天项目(Catalina、Pan-STARRS、ATLAS)、即将投入使用的薇拉·鲁宾天文台 (LSST) 以及 Flyeye 可能选址的可探测性。 考虑了智利、南非和北半球的第二个设施。对于每个天文台,在模拟中都考虑了它们过去或计划的指向策略。在 LSST 部署之前,南半球的一个 Flyeye 的性能与北半球的一个望远镜相似。结合起来,在北方和南方各放置一台望远镜可以最大限度地提高探测率和探测到的独特物体的数量。LSST 之后,南部和北部的 Flyeye 望远镜仍然是互补的。总体而言,模拟表明,无论是在 LSST 之前还是之后,位于南部的第二个 Flyeye 都可以补充位于北部的 Flyeye 望远镜。位于拉西拉的 Flyeye 将利用其优越的大气条件,同时平衡南北半球的资产。||
|**2024-09-03**|[K-Origins: Better Colour Quantification for Neural Networks](http://arxiv.org/abs/2409.02281)|**[link](https://github.com/lewismmason/Thesis-Public)**|K-Origins是一种神经网络层,旨在在学习颜色或强度有利时提高基于图像的网络性能。 超过 250 个编码器-解码器卷积网络在 16 位合成数据上进行了训练和测试,结果表明,在两种情况下,K-Origins 提高了语义分割精度:低信噪比下的目标检测,以及分割形状相同但颜色不同的多个目标。 对于每个可训练参数 $w_k$,K-Origins 通过公式 $\textbf{Y}_k = \textbf{X}-\textbf{J}\cdot w_k$ 从输入特征 $\textbf{X}$ 生成输出特征,其中 $\textbf{J}$ 是一个全 1 矩阵。 此外,还训练了具有不同感受野的网络,以根据目标类别的维度确定最佳网络深度,这表明感受野长度应超过目标大小。 通过确保足够的感受野长度并结合 K-Origins,我们可以获得更好的语义网络性能。||
|**2024-09-03**|[Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems](http://arxiv.org/abs/2409.02278)|null|近年来,视觉语言模型(VLM)的快速发展展现出其在图像理解相关应用方面的巨大潜力。本研究探索了最先进的VLM模型在基于视觉的交通工程任务中的应用,例如图像分类和目标检测。图像分类任务包括拥堵检测和裂缝识别,而目标检测任务则用于识别未佩戴头盔的行为。我们应用了开源模型(如CLIP、BLIP、OWL-ViT、Llava-Next)和闭源模型GPT-4o,评估了这些最先进的VLM模型的性能,以利用语言理解能力来完成基于视觉的交通任务。这些任务通过对VLM模型应用零样本提示来完成,因为零样本提示可以在不对任务进行任何训练的情况下执行任务。这消除了对特定任务进行标注数据集或微调的需求。虽然这些模型在图像分类任务中取得了与基准卷积神经网络(CNN)模型相当的结果,但在目标定位任务中仍有改进的空间。因此,本研究对最先进的VLM模型进行了全面评估,突出了这些模型的优势和局限性,可以作为未来改进和广泛实施的基准。||
|**2024-09-03**|[A Modern Take on Visual Relationship Reasoning for Grasp Planning](http://arxiv.org/abs/2409.02035)|null|与现实世界杂乱场景交互对机器人代理提出了若干挑战,这些代理需要理解观察到的物体之间复杂的的空间依赖性,以确定最佳拾取顺序或有效的物体检索策略。 现有的解决方案通常管理简化的场景,并侧重于在初始物体检测阶段之后预测成对物体关系,但往往忽略全局上下文或难以处理冗余和缺失的物体关系。 在这项工作中,我们提出了一种用于抓取规划的视觉关系推理的现代方法。 我们介绍了 D3GD,这是一个新的测试平台,其中包括包含来自 97 个不同类别的多达 35 个物体的分拣场景。 此外,我们还提出了 D3G,这是一种新的基于端到端 transformer 的依赖图生成模型,它可以同时检测物体并生成表示其空间关系的邻接矩阵。 认识到标准指标的局限性,我们首次采用关系平均精度来评估模型性能,进行了广泛的实验基准测试。 获得的结果表明我们的方法是这项任务的最新技术,为机器人操作的未来研究奠定了基础。 我们在 https://paolotron.github.io/d3g.github.io 上公开发布代码和数据集。||
|**2024-09-03**|[Compressed learning based onboard semantic compression for remote sensing platforms](http://arxiv.org/abs/2409.01988)|**[link](https://github.com/protim1191/glodismo_classifier)**|地球观测 (EO) 在创建和维持一个具有弹性和繁荣的社会方面发挥着至关重要的作用,这对所有生命和地球本身都具有深远的影响。卫星、航空平台以及最近的无人机和无人驾驶飞行器等遥感平台都用于 EO。它们收集大量数据,需要将其下传到地球进行进一步处理和分析。这种高吞吐量采集的瓶颈是下行链路带宽。需要以数据为中心的图像压缩解决方案来应对这种海量数据。在这项工作中,通过压缩学习框架研究了语义压缩,该框架仅利用快速和稀疏的矩阵向量乘法来编码数据。相机噪声和通信信道是造成失真的主要来源。然后,完整的语义通信管道由一个学习到的低复杂度压缩矩阵组成,该矩阵作用于噪声相机输出,以在机载生成一个观测向量,该向量通过通信信道下行链路传输,通过展开网络处理,然后馈送到执行必要下游任务的深度学习模型;研究了图像分类。通过使用小波稀疏先验展开 NA-ALISTA 的层来补偿失真。因此,解码是一种根据相机/环境信息和下游任务设计的即插即用方法。用于下游任务的深度学习模型通过端到端方式的损失函数与压缩矩阵和展开网络联合微调。结果表明,在低压缩比的噪声环境中,添加恢复损失以及任务相关损失可以提高下游性能。||
|**2024-09-03**|[Latent Distillation for Continual Object Detection at the Edge](http://arxiv.org/abs/2409.01872)|**[link](https://github.com/pastifra/Continual_Nanodet)**|虽然在目标检测文献中存在许多性能卓越的方法,但解决数据分布偏移仍然具有挑战性。持续学习(CL)为这个问题提供了解决方案,使模型能够适应新数据,同时保持对先前数据的性能。这对于边缘设备尤其重要,这些设备在汽车和机器人等动态环境中很常见。在这项工作中,我们解决了目标检测持续学习(CLOD)场景中边缘设备的内存和计算限制。具体来说,(i)我们研究了一种开源、轻量级和快速的检测器 NanoDet 对边缘设备上 CLOD 的适用性,改进了文献中使用的较大架构。此外,(ii)我们提出了一种名为潜在蒸馏(LD)的新型 CL 方法,该方法在不显着影响检测性能的情况下减少了最先进的 CL 方法所需的运算次数和内存。我们的方法使用著名的 VOC 和 COCO 基准测试集进行了验证,与其他蒸馏方法相比,每次模型更新可将蒸馏参数开销减少 74%,将浮点运算(FLOPs)减少 56%。||
|**2024-09-03**|[GeoBEV: Learning Geometric BEV Representation for Multi-view 3D Object Detection](http://arxiv.org/abs/2409.01816)|null|鸟瞰图 (BEV) 表示已成为多视图 3D 对象检测的主流范式,展现出令人印象深刻的感知能力。然而,现有方法忽略了 BEV 表示的几何质量,使其处于低分辨率状态,无法恢复场景真实的几何信息。在本文中,我们确定了先前方法受限于低 BEV 表示分辨率的原因,并提出了径向-笛卡尔 BEV 采样 (RC-Sampling),从而能够高效生成高分辨率密集 BEV 表示,而无需复杂的算子。此外,我们设计了一种新颖的盒内标签来替代从激光雷达点生成的传统深度标签。此标签反映了对象的实际几何结构,而不仅仅是它们的表面,将现实世界的几何信息注入 BEV 表示中。此外,结合盒内标签,开发了一种质心感知内部损失 (CAI 损失) 来捕捉对象的细粒度内部几何结构。最后,我们将上述模块集成到一个名为 GeoBEV 的新型多视图 3D 对象检测框架中。在 nuScenes 数据集上的大量实验表明,GeoBEV 实现了最先进的性能,突出了其有效性。||

(back to top)

## 生成模型

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2024-11-05**|[DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models](http://arxiv.org/abs/2411.03250)|null|大型语言模型 (LLM) 近期的进展显著增强了它们的知识和生成能力,引发了人们对利用 LLM 合成高质量数据的浓厚兴趣。然而,通过提示 LLM 生成合成数据仍然具有挑战性,因为 LLM 对目标数据分布的理解有限,并且提示工程的复杂性较高,尤其是对于结构化格式的数据。为了解决这些问题,我们引入了 DiffLM,这是一个基于变分自编码器 (VAE) 的可控数据合成框架,它进一步 (1) 利用扩散模型在学习的潜在分布中保留更多原始分布和格式结构的信息,并且 (2) 通过即插即用的潜在特征注入模块将目标分布知识的学习与 LLM 的生成目标解耦。由于我们观察到 VAE 的潜在表示与真实数据分布之间存在显著差异,因此在我们的框架中引入了潜在扩散模块来学习完全表达的潜在分布。在七个具有结构化格式数据(即表格、代码和工具数据)的真实世界数据集上的评估表明,DiffLM 生成了高质量的数据,在某些情况下,下游任务的性能比真实数据高 2-7 个百分点。数据和代码将在内部审查完成后公开发布。|
|**2024-11-05**|[On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models](http://arxiv.org/abs/2411.03177)|null|大规模训练潜在扩散模型 (LDM) 使图像生成质量达到了前所未有的水平。然而,性能最佳的 LDM 训练方法的关键组成部分通常不对研究界开放,这阻碍了同类比较并妨碍了该领域进展的验证。在这项工作中,我们对 LDM 训练方法进行了深入研究,重点关注模型的性能及其训练效率。为了确保同类比较,我们重新实现了五个先前发布的模型及其相应的训练方法。通过我们的研究,我们探讨了 (i) 用于控制生成模型对语义信息(例如,文本提示)和控制元数据(例如,裁剪大小、随机翻转标志等)的条件机制对模型性能的影响,以及 (ii) 在较小和较低分辨率数据集上学习的表示迁移到较大数据集上对训练效率和模型性能的影响。然后,我们提出了一种新的条件机制,它将语义和控制元数据条件分离,并在 ImageNet-1k 数据集上的类条件生成方面树立了新的最先进水平——256 和 512 分辨率的 FID 分别提高了 7% 和 8%——以及在 CC12M 数据集上的文本到图像生成方面——256 和 512 分辨率的 FID 分别提高了 8% 和 23%。|
|**2024-11-05**|[Unleashing the power of novel conditional generative approaches for new materials discovery](http://arxiv.org/abs/2411.03156)|**[link](https://github.com/AIRI-Institute/conditional-crystal-generation)**|长期以来,新材料设计的计算方法依赖于寻找候选材料并对其性质进行建模的迭代过程。人工智能在这方面发挥了至关重要的作用,通过先进的计算方法和数据驱动的方法,帮助加速了晶体性质和结构的发现和优化。为了解决新材料设计问题并加快新材料的搜索过程,我们将最新的生成方法应用于晶体结构设计问题,试图解决逆问题:在给定性质的情况下生成满足这些性质的结构,而无需利用超级计算机的能力。在我们的工作中,我们提出了两种方法:1)条件结构修改:利用能量上最有利的结构与其所有不太稳定的多晶型物之间的能量差来优化任意原子构型的稳定性;2)条件结构生成。我们使用了包含以下信息的材料表示:晶格、原子坐标、原子类型、化学特征、空间群和结构的形成能。损失函数经过优化,以考虑晶体结构的周期性边界条件。我们应用了扩散模型方法、流匹配、普通的自动编码器(AE),并比较了模型和方法的结果。作为研究的度量标准,我们使用了物理PyMatGen匹配器:我们使用默认容差比较目标结构和生成的结构。到目前为止,我们的修改器和生成器分别以41%和82%的准确率生成了具有所需性质的结构。为了证明所提出的方法的有效性,我们进行了推断,得到了一些形成能低于AFLOW衍生凸包的潜在新结构。|
|**2024-11-05**|[Local Lesion Generation is Effective for Capsule Endoscopy Image Data Augmentation in a Limited Data Setting](http://arxiv.org/abs/2411.03098)|null|有限的医学影像数据集通过增加过拟合和泛化能力降低的风险来挑战深度学习模型,尤其是在生成对抗网络 (GAN) 中,判别器可能过拟合,导致训练发散。这种限制也损害了在小数据集上训练的分类模型。生成数据增强 (GDA) 通过使用合成数据扩展训练数据集来解决这个问题,尽管它需要训练一个生成模型。我们提出并评估了两种局部病灶生成方法,以应对增强小型医学图像数据集的挑战。第一种方法采用泊松图像编辑算法(一种经典的图像处理技术)来创建逼真的图像合成物,其性能优于当前最先进的方法。第二种方法引入了一种新的生成方法,利用微调的图像修复 GAN 在真实训练图像的指定区域内合成逼真的病灶。对这两种方法的全面比较表明,在数据受限的环境下有效的局部病灶生成能够在胶囊内窥镜病灶分类中达到新的最先进的结果。结合我们的技术,在高度不平衡的 Kvasir 胶囊数据集(胶囊内窥镜的基准)上实现了 33.07% 的宏观 F1 分数,比之前的最佳结果高出 7.84 个百分点。据我们所知,这项工作是第一个将微调的图像修复 GAN 应用于医学影像中的 GDA 的工作,证明了图像条件 GAN 可以有效地适应有限的数据集以生成高质量的样本,从而促进有效的数据增强。此外,我们还表明,将这种基于 GAN 的方法与经典图像处理技术相结合可以进一步增强结果。|
|**2024-11-05**|[Gradient-Guided Conditional Diffusion Models for Private Image Reconstruction: Analyzing Adversarial Impacts of Differential Privacy and Denoising](http://arxiv.org/abs/2411.03053)|null|我们研究了用于重建隐私图像的梯度引导条件扩散模型的构建方法,重点关注差分隐私噪声与扩散模型去噪能力之间的对抗性相互作用。当前基于梯度的重建方法由于计算复杂度和先验知识要求的限制,难以处理高分辨率图像,而我们提出了两种新方法,它们只需对扩散模型的生成过程进行少量修改,并且无需先验知识。我们的方法利用扩散模型强大的图像生成能力,即使在梯度中添加了少量差分隐私噪声的情况下,也能从随机生成的噪声开始重建隐私图像。我们还对差分隐私噪声对重建图像质量的影响进行了全面的理论分析,揭示了噪声幅度、受攻击模型的架构以及攻击者的重建能力之间的关系。此外,大量的实验验证了我们提出的方法的有效性和我们理论发现的准确性,为使用条件扩散模型进行隐私风险审计提出了新的方向。|
|**2024-11-05**|[GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details](http://arxiv.org/abs/2411.03047)|null|神经隐式函数为从多张甚至单张图像进行 clothed human digitization 带来了显著的进步。然而,尽管取得了进展,目前的技术仍然难以泛化到具有复杂布料变形和身体姿势的未见过图像。在这项工作中,我们提出了 GarVerseLOD,一个新的数据集和框架,为实现从单张不受约束的图像进行高保真 3D 服装重建的 unprecedented robustness 铺平了道路。受大型生成模型近期成功的启发,我们认为解决泛化挑战的关键在于 3D 服装数据的数量和质量。为此,GarVerseLOD 收集了 6,000 个高质量的布料模型,这些模型具有由专业艺术家手动创建的精细几何细节。除了训练数据的规模外,我们观察到,拥有 disentangled granularities 的几何细节可以在提升学习模型的泛化能力和推理精度方面发挥重要作用。因此,我们将 GarVerseLOD 设计为具有不同细节级别 (LOD) 的分层数据集,从无细节的程式化形状到具有像素对齐细节的姿势混合服装。这使我们能够通过将推理分解成更简单的任务来处理这个高度欠约束的问题,每个任务都缩小了搜索空间。为了确保 GarVerseLOD 能够很好地泛化到自然图像,我们提出了一种基于条件扩散模型的新颖标注范式,为每个服装模型生成大量具有高逼真度的配对图像。我们在大量自然图像上评估了我们的方法。实验结果表明,GarVerseLOD 可以生成独立的服装,其质量明显优于先前的方法。项目页面:https://garverselod.github.io/|
|**2024-11-05**|[IMUDiffusion: A Diffusion Model for Multivariate Time Series Synthetisation for Inertial Motion Capturing Systems](http://arxiv.org/abs/2411.02954)|null|由于运动传感器易于使用且不受空间限制(这与基于视频的动作捕捉系统不同),它们常用于分析体育和日常活动中的运动行为。然而,运动数据的生成,尤其是针对特定活动的标记,可能既耗时又昂贵。此外,许多模型难以处理有限的数据,这限制了它们识别复杂运动模式的性能。为了解决这些问题,生成合成数据有助于扩展数据的多样性和可变性。在这项工作中,我们提出了 IMUDiffusion,这是一种专门为多元时间序列生成设计的概率扩散模型。我们的方法能够生成高质量的时间序列,准确地捕捉人类活动的动态。此外,通过将我们的数据集与合成数据结合,我们显著提高了基线人类活动分类器的性能。在某些情况下,我们能够将宏观 F1 分数提高近 30%。IMUDiffusion 为生成逼真的人类活动运动提供了一个宝贵的工具,并增强了模型在训练数据有限的情况下的鲁棒性。|
|**2024-11-05**|[LDPM: Towards undersampled MRI reconstruction with MR-VAE and Latent Diffusion Prior](http://arxiv.org/abs/2411.02951)|null|扩散模型作为一种强大的生成模型,已在包括MRI重建在内的广泛领域得到应用。然而,大多数现有的基于扩散模型的MRI重建方法直接在像素空间中进行操作,这使得它们的优化和推理在计算上非常昂贵。潜在扩散模型的引入是为了解决自然图像处理中的这个问题,但将其直接应用于MRI重建仍然面临许多挑战,包括对生成结果缺乏控制、变分自动编码器 (VAE) 对MRI的适应性以及潜在空间中适用数据一致性的探索。为了应对这些挑战,本文提出了一种基于潜在扩散先验的欠采样MRI重建方法(LDPM)。该方法利用了一个草图模块来提供适当的控制,并平衡重建MRI图像的质量和保真度。本文还探索了一种适用于MRI任务的VAE(MR-VAE),它可以作为未来MRI相关任务的基础。此外,本文提出了一种DDIM采样器的变体,称为双阶段采样器,以在潜在空间中实现高保真重建。所提出的方法在fastMRI数据集上取得了具有竞争力的结果,并且消融实验也证明了每个模块的有效性。|
|**2024-11-05**|[Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey](http://arxiv.org/abs/2411.02914)|null|世界模型和视频生成是自动驾驶领域的关键技术,它们在增强自主系统的稳健性和可靠性方面发挥着至关重要的作用。世界模型模拟现实环境的动态,而视频生成模型则生成逼真的视频序列,二者正日益融合以提高自动驾驶汽车的态势感知和决策能力。本文研究了这两种技术之间的关系,重点关注它们在结构上的相似性(尤其是在基于扩散的模型中)如何促进对驾驶场景进行更准确、更一致的模拟。我们考察了JEPA、Genie和Sora等前沿工作,它们代表了世界模型设计的不同方法,从而突出了目前缺乏对世界模型普遍接受的定义。这些不同的解释强调了该领域对如何针对各种自动驾驶任务优化世界模型的理解仍在不断发展。此外,本文还讨论了该领域采用的关键评估指标,例如用于3D场景重建的Chamfer距离和用于评估生成视频内容质量的Fr\'echet初始距离 (FID)。通过分析视频生成和世界模型之间的相互作用,本综述指出了关键挑战和未来研究方向,强调了这些技术在共同提升自动驾驶系统性能方面的潜力。本文提出的研究结果旨在全面了解视频生成和世界模型的融合如何推动更安全、更可靠的自动驾驶汽车的创新发展。|
|**2024-11-05**|[ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate](http://arxiv.org/abs/2411.02853)|**[link](https://github.com/ishohei220/adopt)**|Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., $\beta_2$, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $\beta_2$ without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at https://github.com/iShohei220/adopt.|
|**2024-10-31**|[Bridging Geometric States via Geometric Diffusion Bridge](http://arxiv.org/abs/2410.24220)|null|在复杂的系统中准确预测几何状态演化对于推进量子化学和材料建模等科学领域至关重要。传统的实验和计算方法在环境限制和计算需求方面面临挑战,而目前的深度学习方法在精度和普适性方面仍然不足。在这项工作中,我们引入了几何扩散桥 (GDB),这是一个新颖的生成建模框架,可以准确地连接初始和目标几何状态。GDB 利用概率方法来演化几何状态分布,采用由修改版的 Doob $h$ -变换导出的等变扩散桥来连接几何状态。这个定制的扩散过程以初始和目标几何状态作为固定端点,并由等变转移核控制。此外,通过使用一系列等变扩散桥,轨迹数据可以无缝地融入我们的 GDB 框架中,从而提供更详细、更准确的演化动力学表征。理论上,我们进行了全面的检验,以确认我们的框架能够保持几何状态的联合分布,并能够以可忽略的误差对轨迹分布进行完整建模。跨各种实际场景的实验评估表明,GDB 超越了现有的最先进方法,为精确连接几何状态和以更高的精度和适用性应对关键科学挑战开辟了一条新途径。|
|**2024-10-31**|[Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning](http://arxiv.org/abs/2410.24219)|**[link](https://github.com/pr-ryan/demo)**|尽管文本到视频 (T2V) 生成技术取得了进步,但生成具有逼真运动的视频仍然具有挑战性。目前的模型通常产生静态或极少动态的输出,无法捕捉文本描述的复杂运动。这个问题源于文本编码中忽略运动的内部偏差,以及T2V生成模型中不充分的条件机制。为了解决这个问题,我们提出了一个名为分解运动 (DEMO) 的新框架,它通过将文本编码和条件机制分解为内容和运动组件来增强T2V生成中的运动合成。我们的方法包括用于静态元素的内容编码器和用于时间动态的运动编码器,以及单独的内容和运动条件机制。至关重要的是,我们引入了文本-运动和视频-运动监督来提高模型对运动的理解和生成能力。在MSR-VTT、UCF-101、WebVid-10M、EvalCrafter和VBench等基准上的评估表明,DEMO能够生成具有增强运动动态且保持高视觉质量的视频。我们的方法通过直接从文本描述中集成全面的运动理解,显著推进了T2V生成技术。项目页面:https://PR-Ryan.github.io/DEMO-project/|
|**2024-10-31**|[DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion](http://arxiv.org/abs/2410.24203)|**[link](https://github.com/zju3dv/diffpano)**|基于扩散的方法在2D图像或3D物体生成方面取得了显著成就,然而,3D场景乃至360度图像的生成仍然受到限制,这归因于场景数据集数量有限、3D场景本身的复杂性以及生成一致多视角图像的难度。为了解决这些问题,我们首先建立了一个大规模的全景视频-文本数据集,其中包含数百万个连续的全景关键帧以及相应的全景深度、相机姿态和文本描述。然后,我们提出了一种新的文本驱动的全景生成框架,称为DiffPano,以实现可扩展、一致且多样化的全景场景生成。具体而言,得益于稳定扩散强大的生成能力,我们在已建立的全景视频-文本数据集上使用LoRA微调了一个单视角文本到全景的扩散模型。我们进一步设计了一个球面极线感知的多视角扩散模型,以确保生成的全景图像的多视角一致性。大量实验表明,DiffPano可以根据给定的未见文本描述和相机姿态生成可扩展、一致且多样化的全景图像。|
|**2024-10-31**|[Multi-Attribute Linguistic Tuning for Controlled Paraphrase Generation](http://arxiv.org/abs/2410.24199)|null|我们提出了一种新颖的复述生成方法,可以精确控制和微调英语的40个语言属性。我们的模型采用编码器-解码器架构,输入源语句和所需的语言属性,并生成满足所需属性的源语句复述。为了保证推理时的高质量输出,我们的方法配备了质量控制机制,逐步调整语言属性的嵌入,以找到用于复述生成的最近且最可实现的所需属性配置。我们通过将其与最近的可控生成模型进行比较来评估我们方法的有效性。实验结果表明,所提出的模型在生成满足所需语言属性的复述方面优于基线模型。|
|**2024-10-31**|[AR-Pro: Counterfactual Explanations for Anomaly Repair with Formal Properties](http://arxiv.org/abs/2410.24178)|null|异常检测被广泛用于识别关键错误和可疑行为,但目前的方法缺乏可解释性。我们利用现有方法的共同特性和生成模型的最新进展,为异常检测引入了反事实解释。给定一个输入,我们生成其反事实解释,作为基于扩散的修复,展示非异常版本应该是什么样子。这种方法的一个关键优势是它支持对可解释性需求进行领域无关的正式规范,从而为生成和评估解释提供了一个统一的框架。我们在视觉(MVTec、VisA)和时间序列(SWaT、WADI、HAI)异常数据集上证明了我们的异常可解释性框架AR-Pro的有效性。实验代码可在以下网址访问:https://github.com/xjiae/arpro。|
|**2024-10-31**|[Redefining in Dictionary: Towards a Enhanced Semantic Understanding of Creative Generation](http://arxiv.org/abs/2410.24160)|null|创造力,无论是在人类还是在扩散模型中,本质上都是一个抽象的概念;因此,简单地在提示词中添加“creative”并不能保证模型能够可靠地识别其语义。在这项工作中,我们通过TP2O任务将“创造性”这一抽象概念具体化,该任务旨在融合两个不相关的概念,并引入了CreTok,将“创造性”重新定义为标记。这种重新定义为概念融合提供了一种更具体、更普遍适应的表示方法。这一重新定义过程是连续进行的,包括反复随机抽取具有不同概念的文本对,并优化目标提示词和常量提示词之间的余弦相似度。这种方法使能够学习一种创造性概念融合的方法。大量实验表明,带来的创造能力大大超越了最近的SOTA扩散模型,并实现了更优越的创造性生成。CreTok展现出更大的灵活性和更低的时间开销,因为可以作为任何概念的通用标记,从而无需重新训练即可促进创造性生成。|
|**2024-10-31**|[Scaling Concept With Text-Guided Diffusion Models](http://arxiv.org/abs/2410.24151)|null|文本引导的扩散模型通过根据文本描述生成高保真内容,彻底改变了生成任务。它们还实现了一种编辑范式,可以通过文本条件替换概念(例如,将狗替换为老虎)。在这项工作中,我们探索了一种新颖的方法:我们能否增强或抑制概念本身,而不是替换概念?通过实证研究,我们发现了一个趋势,即在文本引导的扩散模型中,概念可以被分解。利用这一见解,我们引入了 ScalingConcept,这是一种简单而有效的方法,可以在不引入新元素的情况下放大或缩小真实输入中分解的概念。为了系统地评估我们的方法,我们提出了 WeakConcept-10 数据集,其中概念不完善,需要增强。更重要的是,ScalingConcept 能够在图像和音频领域实现各种新颖的零样本应用,包括诸如规范姿态生成和生成声音突出显示或移除等任务。|
|**2024-10-31**|[Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure](http://arxiv.org/abs/2410.24060)|**[link](https://github.com/Morefre/Understanding-Generalizability-of-Diffusion-Models-Requires-Rethinking-the-Hidden-Gaussian-Structure)**|In this work, we study the generalizability of diffusion models by looking into the hidden properties of the learned score functions, which are essentially a series of deep denoisers trained on various noise levels. We observe that as diffusion models transition from memorization to generalization, their corresponding nonlinear diffusion denoisers exhibit increasing linearity. This discovery leads us to investigate the linear counterparts of the nonlinear diffusion models, which are a series of linear models trained to match the function mappings of the nonlinear diffusion denoisers. Surprisingly, these linear denoisers are approximately the optimal denoisers for a multivariate Gaussian distribution characterized by the empirical mean and covariance of the training dataset. This finding implies that diffusion models have the inductive bias towards capturing and utilizing the Gaussian structure (covariance information) of the training dataset for data generation. We empirically demonstrate that this inductive bias is a unique property of diffusion models in the generalization regime, which becomes increasingly evident when the model's capacity is relatively small compared to the training dataset size. In the case that the model is highly overparameterized, this inductive bias emerges during the initial training phases before the model fully memorizes its training data. Our study provides crucial insights into understanding the notable strong generalization phenomenon recently observed in real-world diffusion models.|
|**2024-10-31**|[TPC: Test-time Procrustes Calibration for Diffusion-based Human Image Animation](http://arxiv.org/abs/2410.24037)|null|Human image animation aims to generate a human motion video from the inputs of a reference human image and a target motion video. Current diffusion-based image animation systems exhibit high precision in transferring human identity into targeted motion, yet they still exhibit irregular quality in their outputs. Their optimal precision is achieved only when the physical compositions (i.e., scale and rotation) of the human shapes in the reference image and target pose frame are aligned. In the absence of such alignment, there is a noticeable decline in fidelity and consistency. Especially, in real-world environments, this compositional misalignment commonly occurs, posing significant challenges to the practical usage of current systems. To this end, we propose Test-time Procrustes Calibration (TPC), which enhances the robustness of diffusion-based image animation systems by maintaining optimal performance even when faced with compositional misalignment, effectively addressing real-world scenarios. The TPC provides a calibrated reference image for the diffusion model, enhancing its capability to understand the correspondence between human shapes in the reference and target images. Our method is simple and can be applied to any diffusion-based image animation system in a model-agnostic manner, improving the effectiveness at test time without additional training.|
|**2024-10-31**|[Unveiling Synthetic Faces: How Synthetic Datasets Can Expose Real Identities](http://arxiv.org/abs/2410.24015)|null|合成数据生成在不同的计算机视觉应用中越来越受欢迎。现有的最先进的人脸识别模型使用大规模人脸数据集进行训练,这些数据集是从互联网上抓取的,引发了隐私和伦理方面的担忧。为了解决这些担忧,一些工作提出了生成合成人脸数据集来训练人脸识别模型。然而,这些方法依赖于生成模型,而这些模型是在真实人脸图像上训练的。在这项工作中,我们设计了一种简单而有效的成员推理攻击,系统地研究了任何现有的合成人脸识别数据集是否泄露了用于训练生成器模型的真实数据中的任何信息。我们对6个最先进的合成人脸识别数据集进行了广泛的研究,并表明在所有这些合成数据集中,原始真实数据集中的几个样本都被泄露了。据我们所知,本文是第一个展示生成器模型的训练数据泄露到生成的合成人脸识别数据集中的工作。我们的研究揭示了合成人脸识别数据集中的隐私陷阱,并为未来关于生成负责任的合成人脸数据集的研究铺平了道路。|
|**2024-10-29**|[A Gaussian Process Generative Model for QCD Equation of State](http://arxiv.org/abs/2410.22160)|null|我们利用高斯过程回归方法开发了一个零净重子密度下核物质状态方程的生成模型。我们分别在高温和低温区域施加了来自格点量子色动力学和强子共振气体的第一性原理理论约束。通过允许训练后的高斯过程回归模型在相变区域附近自由变化,我们生成了具有不同声速的随机平滑交叉状态方程,而不依赖于特定的参数化。我们探索了大量实验可观测量与生成的状态方程之间的依赖关系,这为未来使用相对论重离子碰撞的实验测量来约束核物质状态方程的贝叶斯推断研究奠定了基础。||
|**2024-10-29**|[Capacity Control is an Effective Memorization Mitigation Mechanism in Text-Conditional Diffusion Models](http://arxiv.org/abs/2410.22149)|**[link](https://github.com/raman1121/diffusion_memorization_hpo)**|在这项工作中,我们提出了令人信服的证据,表明在微调过程中控制模型容量可以有效地减轻扩散模型中的记忆效应。具体来说,我们证明了在预训练-微调范式中采用参数高效微调(PEFT)与传统的完整微调方法相比,可以显著减少记忆效应。我们的实验使用了MIMIC数据集,该数据集包含胸部X光图像及其相应报告的图像-文本对。通过一系列记忆效应和生成质量指标评估的结果表明,PEFT不仅减少了记忆效应,还提高了下游生成质量。此外,PEFT方法可以与现有的记忆效应缓解技术无缝结合,以进一步改进。我们的实验代码可在以下网址获取:https://github.com/Raman1121/Diffusion_Memorization_HPO||
|**2024-10-29**|[AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts](http://arxiv.org/abs/2410.22143)|null|虽然大型语言模型 (LLM) 通常对齐良好,但它们仍然容易受到精心设计的自然语言提示或奇怪的对抗性后缀的攻击。然而,尽管乱码标记在攻击对齐的 LLM 方面取得了成功,但它们受到的关注相对较少。最近的研究 AmpleGCG~\citep{liao2024amplegcg} 表明,生成模型可以针对任何有害查询快速生成大量可定制的乱码对抗性后缀,从而暴露分布外 (OOD) 语言空间中的一系列对齐差距。为了引起更多人关注这一领域,我们推出了 AmpleGCG-Plus,这是一个增强版本,可在更少的尝试次数下获得更好的性能。通过一系列探索性实验,我们确定了几种改进乱码后缀学习的训练策略。我们在严格的评估设置下验证的结果表明,它在开放权重和闭源模型上的性能均优于 AmpleGCG,在针对 Llama-2-7B-chat 的白盒设置中,攻击成功率 (ASR) 提升高达 17%,在针对 GPT-4 的黑盒设置中,ASR 提升了三倍以上。值得注意的是,AmpleGCG-Plus 以与 GPT-4 相似的比率攻击了较新的 GPT-4o 系列模型,并发现了针对最近提出的断路器防御的漏洞。我们公开发布了 AmpleGCG-Plus 以及我们收集的训练数据集。||
|**2024-10-29**|[Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench](http://arxiv.org/abs/2410.22108)|**[link](https://github.com/franciscoliu/MLLMU-Bench)**|像大型语言模型 (LLM) 和多模态大型语言模型 (MLLM) 这样的生成模型,在海量网络语料库上训练后,可能会记住并泄露个人的机密和隐私数据,引发法律和伦理方面的担忧。虽然之前的许多工作已经通过机器遗忘技术解决了 LLM 中的这个问题,但在 MLLM 中,这仍然是一个很大程度上未被探索的领域。为了应对这一挑战,我们引入了多模态大型语言模型遗忘基准 (MLLMU-Bench),这是一个旨在提升对多模态机器遗忘理解的新型基准。MLLMU-Bench 包含 500 个虚构人物和 153 个公众人物的个人资料,每个资料都包含超过 14 个定制的问答对,并从多模态(图像+文本)和单模态(文本)两个角度进行评估。该基准测试分为四组,用于评估遗忘算法的有效性、泛化能力和模型效用。最后,我们使用现有的生成模型遗忘算法提供了基线结果。令人惊讶的是,我们的实验表明,单模态遗忘算法在生成和完形填空任务中表现出色,而多模态遗忘方法在使用多模态输入的分类任务中表现更好。||
|**2024-10-29**|[Variational inference for pile-up removal at hadron colliders with diffusion models](http://arxiv.org/abs/2410.22074)|null|本文提出了一种使用扩散模型的变分推理方法来去除pp相互作用中的堆积效应,称为Vipr。该方法并非使用分类方法来识别哪些粒子来自主碰撞,而是训练一个生成模型来预测去除堆积效应后的硬散射粒子射流的成分。这将得到对硬散射射流成分的完整后验估计,这在去除堆积效应的背景下尚未被探索。我们在模拟 tt¯ 事件样本中评估了 Vipr 的性能,该样本叠加了堆积污染。在各种堆积场景下,Vipr 在预测硬散射射流的子结构方面优于 SoftDrop。||
|**2024-10-29**|[PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement](http://arxiv.org/abs/2410.22059)|null|场景重排,例如整理桌子,由于预测不同物体排列的复杂性,在机器人操作中是一项具有挑战性的任务。网络规模训练的生成模型,例如 Stable Diffusion,可以通过生成自然场景作为目标来提供帮助。为了便于机器人执行,必须提取对象级表示,以便将真实场景与生成的目标匹配,并计算对象姿态变换。目前的方法通常采用多步骤设计,涉及用于生成、分割和特征编码的单独模型,这可能由于误差累积而导致低成功率。此外,它们缺乏对生成目标视角的控制,将任务限制在 3 自由度设置中。在本文中,我们提出了 PACA,一个用于场景重排的零样本流水线,它利用从 Stable Diffusion 派生的透视感知交叉注意力表示。具体来说,我们开发了一种将生成、分割和特征编码集成到单个步骤中以生成对象级表示的表示方法。此外,我们引入了视角控制,从而能够匹配 6 自由度相机视角,并扩展了过去局限于 3 自由度俯视视角的方法。我们的方法的有效性通过其在各种场景的真实机器人实验中的零样本性能得到证明,分别实现了 87% 的平均匹配精度和 67% 的执行成功率。||
|**2024-10-29**|[Dual Conditional Diffusion Models for Sequential Recommendation](http://arxiv.org/abs/2410.21967)|null|扩散模型的最新进展在序列推荐(SR)中展现出可喜的成果。然而,当前基于扩散的方法仍然存在两个关键限制。首先,它们隐式地对目标项目嵌入而不是离散的目标项目本身进行建模,导致推荐过程中的不一致性。其次,现有方法依赖于隐式或显式条件扩散模型,限制了它们充分捕捉用户行为上下文的能力,并导致目标项目嵌入的鲁棒性较差。在本文中,我们提出了用于序列推荐的双条件扩散模型(DCRec),引入了一个离散到连续的序列推荐扩散框架。我们的框架引入了一个完整的马尔可夫链来模拟从反向目标项目表示到离散项目索引的转换,连接了扩散模型的离散和连续项目空间,并确保了与扩散框架的一致性。在此框架的基础上,我们提出了双条件扩散变换器(DCDT),它结合了基于扩散的SR的隐式条件和显式条件。在公共基准数据集上的大量实验表明,DCRec 的性能优于最先进的方法。||
|**2024-10-29**|[PrefPaint: Aligning Image Inpainting Diffusion Model with Human Preference](http://arxiv.org/abs/2410.21966)|null|在本文中,我们首次尝试通过强化学习框架将图像修复的扩散模型与人类审美标准对齐,从而显著提高修复图像的质量和视觉吸引力。具体来说,我们没有直接测量与配对图像的差异,而是使用我们构建的数据集训练了一个奖励模型,该数据集包含近51,000张带有注释人类偏好的图像。然后,我们采用强化学习过程微调预训练的图像修复扩散模型的分布,使其朝着更高奖励的方向发展。此外,我们从理论上推导了奖励模型的误差上限,这说明了在整个强化对齐过程中奖励估计的潜在置信度,从而促进了准确的正则化。在修复比较和下游任务(例如图像扩展和3D重建)上的大量实验,证明了我们方法的有效性,与最先进的方法相比,修复图像与人类偏好的对齐度显著提高。这项研究不仅推进了图像修复领域,还提供了一个框架,将人类偏好纳入基于建模奖励精度的生成模型的迭代改进中,对视觉驱动AI应用的设计具有广泛的意义。我们的代码和数据集已公开发布在https://prefpaint.github.io。||
|**2024-10-29**|[CT to PET Translation: A Large-scale Dataset and Domain-Knowledge-Guided Diffusion Approach](http://arxiv.org/abs/2410.21932)|**[link](https://github.com/thanhhff/CPDM)**|正电子发射断层扫描(PET)和计算机断层扫描(CT)对于诊断、分期和监测各种疾病(尤其是癌症)至关重要。尽管它们很重要,但PET/CT系统的使用受到放射性物质的必要性、PET扫描仪的稀缺性以及PET成像相关高成本的限制。相比之下,CT扫描仪更容易获得且成本低得多。为了应对这些挑战,我们的研究解决了从CT图像生成PET图像的问题,旨在降低医疗检查成本和患者的相关健康风险。我们的贡献有两个方面:首先,我们引入了一个名为CPDM的条件扩散模型,据我们所知,这是首次尝试使用扩散模型将CT图像转换为PET图像。其次,我们提供了迄今为止最大的CT-PET数据集,包含2,028,628对配对CT-PET图像,这有助于CT到PET转换模型的训练和评估。对于CPDM模型,我们结合领域知识开发了两个条件图:注意力图和衰减图。前者帮助扩散过程聚焦于感兴趣区域,而后者改进PET数据校正并确保准确的诊断信息。跨各种基准的实验评估表明,CPDM在生成高质量PET图像方面在多个指标上均优于现有方法。源代码和数据样本可在https://github.com/thanhhff/CPDM获取。||
|**2024-10-29**|[Guided Diffusion-based Counterfactual Augmentation for Robust Session-based Recommendation](http://arxiv.org/abs/2410.21892)|null|基于会话的推荐(SR)模型旨在根据用户在当前会话期间的行为向用户推荐top-K项目。文献中提出了几种SR模型,然而,人们对其易受训练数据(观察数据)中固有偏差(例如流行度偏差)的影响提出了担忧。在有偏差的训练数据上训练的SR模型在现实场景中可能会遇到分布外数据的性能挑战。减轻流行度偏差的一种方法是反事实数据增强。与先前依赖于使用SR模型生成数据的工作相比,我们专注于利用最先进的扩散模型来生成反事实数据。我们提出了一个用于SR的基于引导扩散的反事实增强框架。通过分别在真实世界和模拟数据集上进行的离线和在线实验的组合,我们表明我们的方法比基线SR模型和其他最先进的增强框架表现得更好。更重要的是,我们的框架在不太流行的目标项目上显示出显著的改进,在真实世界和模拟数据集上的召回率分别提高了20%,点击率提高了13%。||
|**2024-10-25**|[Model merging with SVD to tie the Knots](http://arxiv.org/abs/2410.19735)|**[link](https://github.com/gstoica27/knots)**|最近的模型合并方法表明,专门针对不同任务的完全微调模型的参数可以合并到一个模型中,该模型能够在不进行重新训练的情况下解决所有任务。然而,当合并 LoRA 微调模型时,这种成功并没有很好地迁移。我们研究了这一现象,并观察到与完全微调模型相比,LoRA 微调模型的权重表现出较低的对齐程度。我们假设提高这种对齐性是获得更好 LoRA 模型合并的关键,并提出了 KnOTS 来解决这个问题。KnOTS 使用 SVD 将不同 LoRA 模型的权重联合转换到一个对齐的空间中,现有的合并方法可以在该空间中应用。此外,我们引入了一个新的基准测试,该基准测试明确评估合并模型是否为通用模型。值得注意的是,KnOTS 在多个视觉和语言基准测试中,包括我们的新设置,始终将 LoRA 合并提高了 4.3%。我们在以下位置发布我们的代码:https://github.com/gstoica27/KnOTS。||
|**2024-10-25**|[Adversarial Environment Design via Regret-Guided Diffusion Models](http://arxiv.org/abs/2410.19715)|null|在深度强化学习 (RL) 中,训练对环境变化具有鲁棒性的智能体仍然是一项重大挑战。无监督环境设计 (UED) 近期应运而生,旨在通过生成一组针对智能体能力量身定制的训练环境来解决这个问题。尽管先前的工作表明 UED 有可能学习到鲁棒的策略,但其性能受到环境生成能力的限制。为此,我们提出了一种新颖的 UED 算法,即通过遗憾引导扩散模型进行对抗性环境设计 (ADD)。所提出的方法利用智能体的遗憾来指导基于扩散的环境生成器,以生成对智能体具有挑战性但有利于进一步改进的环境。通过利用扩散模型的表示能力,ADD 可以直接生成对抗性环境,同时保持训练环境的多样性,从而使智能体能够有效地学习鲁棒的策略。我们的实验结果表明,所提出的方法成功地生成了一个具有指导意义的环境课程,在对新颖的、超出分布的环境的零样本泛化方面优于 UED 基线。项目页面:https://github.com/rllab-snu.github.io/projects/ADD||
|**2024-10-25**|[DiffGS: Functional Gaussian Splatting Diffusion](http://arxiv.org/abs/2410.19657)|null|三维高斯 splatting (3DGS) 在渲染速度和保真度方面表现出了令人信服的性能,但由于其离散性和非结构化性质,高斯 splatting 的生成仍然是一个挑战。在这项工作中,我们提出了 DiffGS,一个基于潜在扩散模型的通用高斯生成器。DiffGS 是一种强大且高效的 3D 生成模型,能够生成任意数量的高斯基元,用于光栅化的高保真渲染。其关键见解是通过三个新颖的函数以解耦的方式表示高斯 splatting,分别对高斯概率、颜色和变换进行建模。通过对 3DGS 的新颖解耦,我们使用连续的高斯 splatting 函数表示离散和非结构化的 3DGS,然后我们训练一个潜在扩散模型,目标是无条件和有条件地生成这些高斯 splatting 函数。同时,我们引入了一种离散化算法,通过八叉树引导采样和优化,从生成的函数中提取任意数量的高斯函数。我们探索了 DiffGS 的各种任务,包括无条件生成、从文本、图像和部分 3DGS 进行条件生成,以及点到高斯的生成。我们相信,DiffGS 为灵活建模和生成高斯 splatting 提供了一个新的方向。||
|**2024-10-25**|[Diffusion models for lattice gauge field simulations](http://arxiv.org/abs/2410.19602)|null|我们为格点规范理论开发了基于随机量子化概念的扩散模型。这个框架被应用于 $1+1$维的$U(1)$ 规范理论。我们证明,在一个小的逆耦合常数下训练的模型可以有效地迁移到更大的逆耦合常数,而不会遇到与拓扑冻结相关的问题,即该模型可以通过引入玻尔兹曼因子作为物理条件来生成对应于不同耦合常数的构型,同时保持正确的物理分布,而无需任何额外的训练。这证明了物理条件扩散模型在高效灵活的格点规范理论模拟方面的潜力。||
|**2024-10-25**|[Utilizing Image Transforms and Diffusion Models for Generative Modeling of Short and Long Time Series](http://arxiv.org/abs/2410.19538)|null|近年来,围绕时间序列数据的生成模型的兴趣激增。大多数现有方法要么设计用于处理短序列,要么处理长程序列。这种二分法可归因于循环网络的梯度问题、与 Transformer 相关的计算成本以及状态空间模型的表达能力有限。为了构建一个适用于不同长度时间序列的统一生成模型,我们在这项工作中建议将序列转换为图像。通过采用可逆变换(例如延迟嵌入和短时傅里叶变换),我们获得了三个主要优势:i)我们可以利用先进的扩散视觉模型;ii)我们可以在同一框架内显著地处理短程和长程输入;iii)我们可以利用时间序列到图像文献中提出的最新和已建立的工具。我们通过对多个任务(包括无条件生成、插值和外推)的综合评估来验证我们方法的有效性。我们表明,我们的方法在与强大的基线相比始终如一地实现了最先进的结果。在无条件生成任务中,我们展示了与之前的扩散模型相比,在短期判别分数上取得了 58.17% 的显着平均改进,在(超)长期分类分数上取得了 132.61% 的显着平均改进。代码位于 https://github.com/azencot-group/ImagenTime。||
|**2024-10-25**|[Ensemble Data Assimilation for Particle-based Methods](http://arxiv.org/abs/2410.19525)|null|本研究提出了一种新颖的方法,将数据同化技术应用于基于粒子的模拟中,并使用了集合卡尔曼滤波器。虽然数据同化方法已有效地应用于欧拉模拟,但其在拉格朗日解离散化中的应用尚未得到适当的探索。我们引入了两种具体的方法来弥补这一差距。第一种方法采用了一种中间欧拉变换,它结合了投影和重新网格化过程。第二种方法是一种纯粹的拉格朗日方案,适用于重新网格化不合适的情况。第二种方法是一种纯粹的拉格朗日方案,适用于重新网格化不适用的情况。这些方法使用具周期边界条件的一维对流扩散模型进行评估。针对基于网格的同化滤波器对一维场景进行了性能基准测试。随后,将同化方案应用于通过涡度-单元法求解的非线性二维不可压缩流动问题。结果证明了这些方法在更复杂场景中的适用性,突出了它们在一维和二维环境中的有效性。||
|**2024-10-25**|[Marked Temporal Bayesian Flow Point Processes](http://arxiv.org/abs/2410.19512)|null|带标记事件数据通过记录事件的连续值发生时间戳及其对应的离散值类型来捕获事件。它们出现在各种现实场景中,例如社交媒体、金融交易和医疗保健记录,并且已经通过带标记时间点过程 (MTPP) 模型得到有效建模。最近,由于其强大的生成能力和限制较少的函数形式,针对这些 MTPP 模型开发生成模型发展迅速。然而,现有的生成性 MTPP 模型通常在联合建模事件的时间戳和类型方面面临挑战,因为:(1) 主流方法仅设计时间戳的生成机制,不包括事件类型;(2) 时间戳和事件类型之间复杂的相互依赖关系被忽略了。在本文中,我们提出了一种新的生成性 MTPP 模型,称为 BMTPP。与现有的生成性 MTPP 模型不同,BMTPP 使用基于参数的方法灵活地对标记的时间联合分布进行建模。此外,通过向标记的时间数据空间添加联合噪声,BMTPP 可以有效地捕获并明确揭示时间戳和事件类型之间的相互依赖关系。大量实验验证了我们的方法优于其他最先进模型的优越性及其有效捕获标记时间相互依赖性的能力。||
|**2024-10-25**|[NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction](http://arxiv.org/abs/2410.19452)|**[link](https://github.com/gongzix/neuroclips)**|利用非侵入性脑活动 fMRI 重建静态视觉刺激取得了巨大成功,这得益于诸如 CLIP 和 Stable Diffusion 等先进的深度学习模型。然而,由于解码对连续视觉体验的时空感知非常具有挑战性,因此关于 fMRI 到视频重建的研究仍然有限。我们认为,应对这些挑战的关键在于准确解码大脑对视频刺激所感知到的高级语义和低级感知流。为此,我们提出了 NeuroClips,这是一个从 fMRI 解码高保真、流畅视频的创新框架。NeuroClips 利用语义重建器来重建视频关键帧,指导语义准确性和一致性,并采用感知重建器来捕捉低级感知细节,确保视频流畅性。在推理过程中,它采用预先训练的 T2V 扩散模型,注入关键帧和低级感知流,用于视频重建。在公开可用的 fMRI 视频数据集上进行评估,NeuroClips 实现了高达 6 秒、8FPS 的流畅高保真视频重建,在各种指标上都比现有最佳模型取得了显著改进,例如,SSIM 提高了 128%,时空指标提高了 81%。我们的项目可在 https://github.com/gongzix/NeuroClips 获得。||
|**2024-10-25**|[Learned Reference-based Diffusion Sampling for multi-modal distributions](http://arxiv.org/abs/2410.19449)|null|在过去几年中,已经提出了一些利用基于分数的扩散方法从概率分布中采样的方法,即在无法获得精确样本的情况下,仅依靠对未归一化密度的评估。由此产生的采样器近似于噪声扩散过程的时间反转,将目标分布桥接到易于采样的基础分布。在实践中,这些方法的性能在很大程度上取决于关键的超参数,这些超参数需要真实样本才能进行精确调整。我们的工作旨在突出和解决这一基本问题,特别关注多模态分布,这对现有的采样方法提出了重大挑战。在现有方法的基础上,我们引入了基于学习参考的扩散采样器(LRDS),这是一种专门设计用于利用关于目标模态位置的先验知识的方法,以绕过超参数调整的障碍。LRDS 分两步进行:(i)学习位于高密度空间区域并针对多模态量身定制的样本上的参考扩散模型,以及(ii)使用该参考模型来促进基于扩散的采样器的训练。我们通过实验证明,在各种具有挑战性的分布上,与竞争算法相比,LRDS 最好地利用了目标分布的先验知识。||
|**2024-10-25**|[Generative Diffusion Models for Sequential Recommendations](http://arxiv.org/abs/2410.19429)|null|诸如变分自编码器 (VAE) 和生成对抗网络 (GAN) 等生成模型在序列推荐任务中已展现出前景。然而,它们也面临着挑战,包括后验坍缩和表示能力有限。Li 等人 (2023) 的工作引入了一种新颖的方法,利用扩散模型来应对这些挑战,将物品嵌入表示为分布而不是固定向量。这种方法允许更自适应地反映用户多样化的兴趣和物品的各个方面。在扩散阶段,模型通过添加噪声将目标物品嵌入转换为高斯分布,促进序列物品分布的表示并注入不确定性。然后,一个逼近器处理这个带有噪声的物品表示以重建目标物品。在反向阶段,模型利用用户的历史交互来逆转噪声,并通过舍入操作最终确定物品预测。这项研究对 DiffuRec 架构进行了增强,特别是在扩散过程中添加了偏移噪声以提高鲁棒性,并在逼近器中加入了交叉注意力机制以更好地捕获相关的用户-物品交互。这些贡献促成了一种名为 DiffuRecSys 的新模型的开发,该模型提高了性能。在三个公共基准数据集上进行的大量实验表明,这些改进增强了物品表示,有效地捕获了不同的用户偏好,并在序列推荐研究中优于现有基线。||
|**2024-10-24**|[MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms](http://arxiv.org/abs/2410.18977)|null|本研究深入探讨了人体动作生成的交互式编辑问题。以往的动作扩散模型缺乏对词级文本-动作对应关系的显式建模和良好的可解释性,从而限制了其细粒度的编辑能力。为了解决这个问题,我们提出了一个基于注意力的动作扩散模型,名为MotionCLR,它对注意力机制进行了清晰的建模(CLeaR)。从技术上讲,MotionCLR分别使用自注意力和交叉注意力机制对模态内和跨模态交互进行建模。更具体地说,自注意力机制旨在测量帧之间的序列相似性并影响运动特征的顺序。相比之下,交叉注意力机制致力于找到细粒度的词序列对应关系,并激活运动序列中相应的时刻。基于这些关键特性,我们开发了一套通用且简单有效的运动编辑方法,通过操纵注意力图来实现,例如运动(去)强调、原位运动替换和基于示例的动作生成等。为了进一步验证注意力机制的可解释性,我们还探索了通过注意力图进行动作计数和基于基础的动作生成的能力。我们的实验结果表明,我们的方法具有良好的生成和编辑能力以及良好的可解释性。||
|**2024-10-24**|[Unbounded: A Generative Infinite Game of Character Life Simulation](http://arxiv.org/abs/2410.18975)|null|我们引入了生成式无限游戏的概念,这是一种超越了有限的、硬编码的传统系统边界,使用生成模型的电子游戏。受James P. Carse的有限游戏和无限游戏区别的启发,我们利用生成式人工智能的最新进展创造了“无限”:一个完全封装在生成模型中的人物生活模拟游戏。“无限”从沙盒生活模拟游戏中汲取灵感,允许你通过喂养、玩耍和引导,与自主的虚拟角色在一个虚拟世界中互动——其开放式机制由大型语言模型生成,其中一些可能是涌现的。为了开发“无限”,我们提出了大型语言模型和视觉生成领域的技术创新。具体来说,我们提出了:(1)一个专门的、精简的大型语言模型(LLM),它可以实时动态地生成游戏机制、叙事和角色互动;(2)一个新的用于视觉模型的动态区域图像提示适配器(IP-Adapter),它确保了角色在多个环境中的视觉生成保持一致性和灵活性。我们通过定性和定量分析评估了我们的系统,结果表明,与传统的相关方法相比,在角色生活模拟、用户指令遵循、叙事连贯性以及角色和环境的视觉一致性方面都有显著改进。||
|**2024-10-24**|[3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation](http://arxiv.org/abs/2410.18974)|**[link](https://github.com/Lakonik/MVEdit)**|多视角图像扩散模型显著推进了开放域三维物体生成。然而,大多数现有模型依赖于缺乏固有三维偏差的二维网络架构,导致几何一致性受损。为了应对这一挑战,我们引入了3D-Adapter,一个插件模块,旨在将三维几何感知融入预训练的图像扩散模型。我们方法的核心是三维反馈增强:对于采样循环中的每个去噪步骤,3D-Adapter将中间多视角特征解码为一致的三维表示,然后重新编码渲染的RGBD视图,通过特征添加来增强预训练的基础模型。我们研究了3D-Adapter的两种变体:一种基于高斯 splatting 的快速前馈版本和一种利用神经场和网格的多功能免训练版本。我们广泛的实验表明,3D-Adapter不仅极大地提高了诸如Instant3D和Zero123++等文本到多视角模型的几何质量,还能够使用普通的文本到图像模型Stable Diffusion进行高质量的三维生成。此外,我们通过在文本到三维、图像到三维、文本到纹理和文本到头像任务中呈现高质量结果,展示了3D-Adapter广泛的应用潜力。||
|**2024-10-24**|[On the Crucial Role of Initialization for Matrix Factorization](http://arxiv.org/abs/2410.18965)|null|这项工作重新审视了经典的低秩矩阵分解问题,并揭示了初始化在塑造这种非凸非光滑优化收敛速度中的关键作用。我们引入了Nystrom初始化,它显著提高了缩放梯度下降(ScaledGD)在对称和非对称矩阵分解任务中的全局收敛性。具体来说,我们证明了在以前只知道线性收敛速度的情况下,使用Nystrom初始化的ScaledGD可以实现二次收敛。此外,我们将此初始化扩展到通常用于微调基础模型的低秩适配器(LoRA)。我们的方法NoRA,即带有Nystrom初始化的LoRA,在各种下游任务和模型规模(从10亿到70亿个参数)的大语言模型和扩散模型中展现出优越的性能。||
|**2024-10-24**|[Stable Consistency Tuning: Understanding and Improving Consistency Models](http://arxiv.org/abs/2410.18958)|**[link](https://github.com/G-U-N/Stable-Consistency-Tuning)**|扩散模型实现了卓越的生成质量,但由于去噪的迭代性质,生成速度较慢。相比之下,一致性模型作为一种新的生成模型系列,以显著更快的采样速度实现了具有竞争力的性能。这些模型要么通过一致性蒸馏(利用预训练的扩散模型)进行训练,要么直接从原始数据进行一致性训练/微调。在这项工作中,我们提出了一个新的框架来理解一致性模型,我们将扩散模型的去噪过程建模为马尔可夫决策过程 (MDP),并将一致性模型训练框架化为通过时间差学习 (TD Learning) 进行的价值估计。更重要的是,该框架使我们能够分析当前一致性训练/微调策略的局限性。在轻松一致性微调 (ECT) 的基础上,我们提出了稳定一致性微调 (SCT),它结合了使用分数恒等式的方差减少学习。SCT 在 CIFAR-10 和 ImageNet-64 等基准测试中带来了显著的性能提升。在 ImageNet-64 上,SCT 实现了 1 步 FID 2.42 和 2 步 FID 1.55,这是当前一致性模型的最佳结果。||
|**2024-10-24**|[Generation of synthetic financial time series by diffusion models](http://arxiv.org/abs/2410.18897)|null|尽管实际意义重大,但生成逼真的合成金融时间序列仍然具有挑战性,这是由于其统计特性,即所谓的程式化事实,例如厚尾、波动率聚集和季节性模式。各种生成模型,包括生成对抗网络 (GAN) 和变分自编码器 (VAE),已被用于解决这一挑战,尽管目前还没有模型能够满足所有程式化事实。我们提出另一种方法,利用扩散模型,特别是去噪扩散概率模型 (DDPM),来生成合成金融时间序列。这种方法采用小波变换将多个时间序列(例如股票价格、交易量和价差)转换为图像。给定这些转换后的图像,该模型能够生成可以通过逆小波变换转换回逼真的时间序列的图像。我们证明了我们提出的方法满足程式化事实。||
|**2024-10-24**|[Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences](http://arxiv.org/abs/2410.18881)|null|一步文本到图像生成模型具有推理效率高、架构灵活以及最先进的生成性能等优势。本文首次研究了一步生成模型与人类偏好的对齐问题。受人类反馈强化学习 (RLHF) 的成功启发,我们将对齐问题表述为最大化预期人类奖励函数,同时添加一个积分 Kullback-Leibler 散度项以防止生成器偏离。通过克服技术挑战,我们引入了 Diff-Instruct++ (DI++),这是第一个快速收敛且无需图像数据的人类偏好对齐方法,适用于一步文本到图像生成器。我们还引入了新的理论见解,表明使用 CFG 进行扩散蒸馏实际上是在使用 DI++ 进行 RLHF。这一有趣的发现有助于理解和促进未来涉及 CFG 的研究。在实验部分,我们使用 DI++ 对齐了基于 UNet 和基于 DiT 的一步生成器,它们分别使用 Stable Diffusion 1.5 和 PixelArt- $\alpha$ 作为参考扩散过程。由此产生的基于 DiT 的一步文本到图像模型在 COCO 验证提示数据集上实现了 6.19 的高美学得分和 1.24 的图像奖励。它还实现了领先的人类偏好得分 (HPSv2.0) 28.48,优于其他开源模型,如 Stable Diffusion XL、DMD2、SD-Turbo 以及 PixelArt-$\alpha$ 。理论贡献和实证证据都表明,DI++ 是一种强大的人类偏好对齐方法,适用于一步文本到图像模型。||
|**2024-10-24**|[The Cat and Mouse Game: The Ongoing Arms Race Between Diffusion Models and Detection Methods](http://arxiv.org/abs/2410.18866)|null|扩散模型的出现改变了合成媒体生成领域,在内容创作方面提供了无与伦比的真实感和控制力。这些进步推动了艺术、设计和科学可视化等领域的创新。然而,它们也带来了重大的伦理和社会挑战,特别是通过创建超逼真图像,这些图像可能助长深度伪造、虚假信息和未经授权的版权材料复制。因此,对有效检测机制的需求变得日益迫切。本综述探讨了扩散模型发展与检测方法进步之间不断演变的对抗关系。我们对当代检测策略进行了全面分析,包括频域和空域技术、基于深度学习的方法以及结合多种方法的混合模型。我们还强调了多样化数据集和标准化评估指标在提高检测精度和泛化能力方面的重要性。我们的讨论探讨了这些检测系统在版权保护、虚假信息预防和取证分析中的实际应用,同时也探讨了合成媒体的伦理影响。最后,我们确定了关键的研究差距,并提出了未来发展方向,以增强检测方法的鲁棒性和适应性,使其与扩散模型的快速发展保持同步。本综述强调了在日益数字化的世界中,采取全面方法来降低与人工智能生成内容相关的风险的必要性。||
|**2024-10-24**|[From Efficiency to Equity: Measuring Fairness in Preference Learning](http://arxiv.org/abs/2410.18841)|null|随着人工智能系统,特别是生成模型,越来越多地影响决策,确保它们能够公平地代表不同的人类偏好变得至关重要。本文介绍了一个新的框架,用于评估偏好学习模型中的认知公平性,其灵感来自经济学中的不平等理论和罗尔斯主义的正义理论。我们提出了根据基尼系数、阿特金森指数和库兹涅茨比率改编的指标来量化这些模型的公平性。我们使用两个数据集验证了我们的方法:一个自定义的视觉偏好数据集 (AI-EDI-Space) 和 Jester Jokes 数据集。我们的分析揭示了模型性能在不同用户之间的差异,突出了潜在的认知不公正现象。我们探索了预处理和进程中技术来减轻这些不平等,证明了模型效率和公平性之间的复杂关系。这项工作通过提供一个评估和改进偏好学习模型中认知公平性的框架,为人工智能伦理做出了贡献,为在人类偏好多样性至关重要的环境中开发更具包容性的人工智能系统提供了见解。||
|**2024-10-24**|[Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation](http://arxiv.org/abs/2410.18830)|null|Diffusion models have recently gained recognition for generating diverse and high-quality content, especially in the domain of image synthesis. These models excel not only in creating fixed-size images but also in producing panoramic images. However, existing methods often struggle with spatial layout consistency when producing high-resolution panoramas, due to the lack of guidance of the global image layout. In this paper, we introduce the Multi-Scale Diffusion (MSD) framework, a plug-and-play module that extends the existing panoramic image generation framework to multiple resolution levels. By utilizing gradient descent techniques, our method effectively incorporates structural information from low-resolution images into high-resolution outputs. A comprehensive evaluation of the proposed method was conducted, comparing it with the prior works in qualitative and quantitative dimensions. The evaluation results demonstrate that our method significantly outperforms others in generating coherent high-resolution panoramas.||
|**2024-10-22**|[Creativity in AI: Progresses and Challenges](http://arxiv.org/abs/2410.17218)|null|创造力是产生新颖、有用和令人惊讶的想法的能力,并且作为人类认知的一个重要方面已被广泛研究。另一方面,机器创造力一直是一项长期挑战。随着高级生成式人工智能的兴起,人们对人工智能的创造能力重新产生了兴趣和争论。因此,有必要重新审视人工智能创造力的现状,并确定关键进展和 remaining challenges。在这项工作中,我们调查了研究人工智能系统创造能力的主要工作,重点关注创造性问题解决、语言、艺术和科学创造力。我们的综述表明,虽然最新的人工智能模型在很大程度上能够生成具有语言和艺术创造力的输出,如诗歌、图像和音乐作品,但它们在需要创造性问题解决、抽象思维和组合性的任务中却步履维艰,而且它们的生成缺乏多样性、原创性、长期连贯性和幻觉。我们还讨论了与生成模型相关的版权和作者身份问题。此外,我们强调需要对创造力进行全面的评估,这种评估应以流程为导向,并考虑创造力的多个维度。最后,我们从认知科学和心理学中汲取灵感,提出了未来改进人工智能输出创造力的研究方向。||
|**2024-10-22**|[Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding](http://arxiv.org/abs/2410.17173)|**[link](https://github.com/flagshippioneering/pi-rldif)**|蛋白质逆折叠,即预测折叠成所需 3D 结构的氨基酸序列,是基于结构的蛋白质设计中的一个重要问题。基于机器学习的逆折叠方法通常使用原始序列的恢复作为优化目标。然而,逆折叠是一个一对多问题,其中多个序列可以折叠成相同的结构。此外,对于许多实际应用来说,拥有多个折叠成目标结构的不同序列通常是可取的,因为它允许为下游优化提供更多候选序列。在这里,我们证明,尽管最近的逆折叠方法显示出更高的序列恢复率,但它们的“可折叠多样性”——即它们生成多个折叠成与目标一致的结构的非相似序列的能力——并没有提高。为了解决这个问题,我们提出了 RL-DIF,一种用于逆折叠的分类扩散模型,该模型在序列恢复上进行了预训练,并通过强化学习对结构一致性进行了调整。我们发现 RL-DIF 实现了与基准模型相当的序列恢复和结构一致性,但显示出更大的可折叠多样性:实验表明 RL-DIF 在 CATH 4.2 上可以实现 29% 的可折叠多样性,而使用相同数据集训练的模型为 23%。PyTorch 模型权重和采样代码可在 GitHub 上获取。||
|**2024-10-22**|[Hybrid Generative AI for De Novo Design of Co-Crystals with Enhanced Tabletability](http://arxiv.org/abs/2410.17005)|**[link](https://github.com/ai-chem/gemcode)**|共晶化是控制有机晶体物理化学性质的一种便捷方法,在生物医学领域有着广泛的应用。本研究提出了一种名为“生成式共晶设计”(GEMCODE)的新型自动化共晶筛选流程,该流程基于深度生成模型和进化优化的混合,以更广泛地探索目标化学空间。GEMCODE能够快速地从头设计具有目标成片性的共晶,这对药物开发至关重要。通过一系列突出验证和发现案例的实验研究,我们证明了GEMCODE即使在现实的计算限制下也是有效的。此外,我们还探索了语言模型在生成共晶方面的潜力。最后,我们展示了GEMCODE预测的许多以前未知的共晶,并讨论了其在加速药物开发方面的潜力。||
|**2024-10-22**|[DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization](http://arxiv.org/abs/2410.16942)|null|扩散模型凭借其出色的能力在图像生成领域取得了显著进展。然而,由于推理过程中需要多步去噪,这些模型需要大量的计算资源。虽然传统的剪枝方法已被用于优化这些模型,但重新训练过程需要大规模的训练数据集和大量的计算成本才能保持泛化能力,这既不方便也不高效。最近的研究试图利用相邻去噪阶段特征的相似性,通过简单、静态的策略来降低计算成本。然而,这些策略不能充分利用相邻时间步中相似特征模式的潜力。在这项工作中,我们提出了一种新的剪枝方法,该方法通过更智能、可微分的剪枝器得到一个高效的扩散模型。我们的方法的核心是将模型剪枝过程转化为子网络搜索过程。具体来说,我们首先在标准扩散的基础上引入了一个超级网络,通过添加一些基于相似特征的备份连接。然后,我们构建了一个插件式的剪枝器网络,并设计了优化损失来识别冗余计算。最后,我们的方法可以通过少量的梯度优化和简单的后处理步骤来确定一个最优的子网络。我们在包括稳定扩散系列和 DiT 在内的各种扩散模型上进行了广泛的实验。我们的 DiP-GO 方法在不损失准确率的情况下,实现了 SD-1.5 的 4.4 倍加速,显著优于以往最先进的方法。||
|**2024-10-22**|[Hierarchical Clustering for Conditional Diffusion in Image Generation](http://arxiv.org/abs/2410.16910)|**[link](https://github.com/jogo175/treediffusion)**|寻找具有相似特征的数据点簇并生成新的簇特定样本可以显著增强我们对复杂数据分布的理解。虽然已经使用变分自编码器对聚类进行了广泛的探索,但这些模型在现实世界的数据集中通常缺乏生成质量。本文通过引入 TreeDiffusion 来解决这一差距,TreeDiffusion 是一种深度生成模型,它将扩散模型 conditioning 在层次聚类上,以获得高质量的、特定于聚类的生成结果。所提出的流程包括两个步骤:一个基于 VAE 的聚类模型,学习数据的层次结构;以及一个条件扩散模型,为每个聚类生成逼真的图像。我们提出这个两阶段过程,以确保生成的样本保持其各自聚类的代表性,并将图像保真度提高到扩散模型的水平。我们方法的一个关键优势是它能够为每个聚类创建图像,通过定性结果证明,可以更好地可视化聚类模型学习到的表示。这种方法有效地解决了基于 VAE 的方法的生成限制,同时保留了它们的聚类性能。根据经验,我们证明了在层次聚类上 conditioning 扩散模型可以显著提高生成性能,从而推进了生成聚类模型的发展。||
|**2024-10-22**|[Bayes without Underfitting: Fully Correlated Deep Learning Posteriors via Alternating Projections](http://arxiv.org/abs/2410.16901)|null|贝叶斯深度学习经常出现欠拟合问题,导致贝叶斯预测的准确性低于简单的点估计。因此,不确定性量化是以牺牲准确性为代价的。对于线性化模型,广义高斯-牛顿矩阵的零空间对应于保留点估计的训练预测的参数。我们建议在这个零空间中构建贝叶斯近似,从而保证贝叶斯预测不会欠拟合。我们提出了一种用于投影到该零空间的无矩阵算法,该算法的规模与参数数量呈线性关系,与输出维度数量呈平方关系。为了使该方法适用于生成模型,我们进一步提出了一种仅与参数呈线性关系的近似方法。广泛的实证评估表明,该方法可扩展到大型模型,包括具有 2800 万个参数的视觉Transformer。||
|**2024-10-22**|[VistaDream: Sampling multiview consistent images for single-view scene reconstruction](http://arxiv.org/abs/2410.16892)|null|在本文中,我们提出了VistaDream,这是一个从单视图图像重建三维场景的新框架。最近的扩散模型能够从单视图输入图像生成高质量的新视图图像。大多数现有方法只专注于建立输入图像和生成图像之间的一致性,而忽略了生成图像之间的一致性。VistaDream 通过两阶段流水线解决了这个问题。在第一阶段,VistaDream 首先通过稍微缩小并绘制边界和估计深度图来构建全局粗糙三维框架。然后,在这个全局框架上,我们使用基于迭代扩散的RGB-D修复来生成新视图图像,以修复框架中的孔洞。在第二阶段,我们通过一种新的无需训练的多视图一致性采样(MCS)进一步增强了生成的新视图图像之间的一致性,该采样在扩散模型的反向采样过程中引入了多视图一致性约束。实验结果表明,在没有训练或微调现有扩散模型的情况下,VistaDream仅使用单视图图像就能实现一致且高质量的新视图合成,并且大幅度优于基线方法。代码、视频和交互式演示可在https://vistadream-project-page.github.io/获取。||
|**2024-10-22**|[CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare](http://arxiv.org/abs/2410.16872)|null|由于隐私法规的严格限制,获取真实的临床数据非常困难,这阻碍了医疗保健研究和教育的发展。这些限制减缓了新疗法和数据驱动型医疗解决方案的开发进程,同时也限制了学生接触真实世界数据集的机会,使他们缺乏必要的实践技能。因此,高实用性的合成数据集对于推进研究和提供有意义的培训材料至关重要。然而,当前的生成模型——例如变分自动编码器 (VAE) 和生成对抗网络 (GAN)——以牺牲医疗实用性为代价来产生表面上的真实感,混合不同的患者特征,并产生实际相关性有限的合成数据。为了克服这些限制,我们引入了 CK4Gen(Cox Knowledge for Generation),这是一种利用 Cox 比例风险 (CoxPH) 模型中的知识蒸馏来创建合成生存数据集的新框架,该框架保留了关键的临床特征,包括风险比和生存曲线。CK4Gen 通过维护不同的患者风险特征来避免 VAE 和 GAN 中出现的插值问题,确保为研究和教育用途提供真实可靠的输出。CK4Gen 在四个基准数据集(GBSG2、ACTG320、WHAS500 和 FLChain)中得到验证,通过更好地对齐真实数据和合成数据,通过数据增强提高了生存模型在区分和校准方面的性能,优于竞争技术。由于 CK4Gen 可扩展到各种临床条件,并且代码将公开可用,因此未来的研究人员可以将其应用于自己的数据集,以生成适合公开共享的合成版本。||
|**2024-10-22**|[MPDS: A Movie Posters Dataset for Image Generation with Diffusion Model](http://arxiv.org/abs/2410.16840)|null|电影海报对于吸引观众、传达主题和推动电影行业的市场竞争至关重要。虽然传统的设计费时费力,但智能生成技术可以提高效率并改进设计。尽管图像生成取得了令人兴奋的进展,但目前的模型在生成令人满意的海报结果方面往往存在不足。主要问题在于缺乏用于模型训练的专门海报数据集。在这项工作中,我们提出了一个电影海报数据集(MPDS),专为文本到图像生成模型而设计,旨在彻底改变海报制作。作为致力于海报的数据集,据我们所知,MPDS 是第一个图像-文本对数据集,由 37.3 万多个图像-文本对和 8 千多张演员图像(涵盖 4 千多名演员)组成。详细的海报描述,如电影标题、类型、演员阵容和剧情梗概,都根据公开的电影梗概(也称为电影梗概提示)进行了精心组织和标准化。为了增强海报描述并减少与电影梗概的差异,我们利用大型视觉语言模型自动生成每个海报的视觉感知提示,然后进行手动校正并与电影梗概提示进行整合。此外,我们还引入了一个海报标题提示,以展示海报中的文本元素,如演员姓名和电影标题。对于电影海报生成,我们开发了一个多条件扩散框架,将海报提示、海报标题和演员图像(用于个性化)作为输入,通过学习扩散模型产生出色的结果。实验表明,我们提出的 MPDS 数据集在推进个性化电影海报生成方面发挥着重要作用。MPDS 可在 https://anonymous.4open.science/r/MPDS-373k-BD3B 获取。||
|**2024-10-22**|[Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other?](http://arxiv.org/abs/2410.16823)|null|生成式检索作为一种用于搜索和推荐的新兴范式,为传统的依赖外部索引和最近邻搜索的检索方法提供了一种替代方案。生成式模型直接将输入与项目ID相关联。鉴于大型语言模型(LLM)的突破,这些生成式系统可以在集中各种信息检索(IR)任务方面发挥至关重要的作用,在一个模型中执行查询理解、检索、推荐、解释、重新排序和响应生成等任务。尽管人们对这种用于信息检索系统的统一生成方法越来越感兴趣,但在文献中,使用单一、多任务模型优于多个专用模型的优势尚未得到很好的证实。本文探讨了这种统一的方法是否以及何时能够在搜索和推荐的信息检索任务中胜过特定于任务的模型,这些任务广泛存在于多个工业在线平台中,如Spotify、YouTube和Netflix。先前的工作表明:(1)生成式推荐系统学习到的项目潜在表示偏向于流行度,以及(2)基于内容和基于协同过滤的信息可以改进项目的表示。受此启发,我们的研究以两个假设为指导:[H1]联合训练规范了每个项目流行度的估计,以及[H2]联合训练规范了项目的潜在表示,其中搜索捕获项目的基于内容的方面,推荐捕获基于协同过滤的方面。我们使用模拟数据和真实世界数据进行的大量实验都支持[H1]和[H2],认为它们是统一搜索和推荐生成模型相对于单任务方法所观察到的有效性改进的关键因素。||
|**2024-10-18**|[BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities](http://arxiv.org/abs/2410.14672)|**[link](https://github.com/haoosz/BiGR)**|我们介绍了一种名为 BiGR 的新型条件图像生成模型,它使用紧凑的二进制潜码进行生成训练,专注于增强生成和表示能力。BiGR 是第一个将生成和判别统一在同一框架内的条件生成模型。BiGR 具有二进制分词器、掩码建模机制和用于二进制代码预测的二进制转换器。此外,我们引入了一种新颖的熵排序采样方法,以实现高效的图像生成。大量实验验证了 BiGR 在生成质量(通过 FID-50k 衡量)和表示能力(通过线性探针精度证明)方面的优越性能。此外,BiGR 展示了跨各种视觉任务的零样本泛化能力,可在无需结构修改的情况下实现图像修复、扩展、编辑、插值和丰富等应用。我们的研究结果表明,BiGR 有效地统一了生成和判别任务,为该领域的进一步发展铺平了道路。||
|**2024-10-18**|[How Does Data Diversity Shape the Weight Landscape of Neural Networks?](http://arxiv.org/abs/2410.14602)|null|为了增强机器学习模型对未见数据的泛化能力,通常采用dropout、权重衰减( $L_2$ 正则化)和噪声增强等技术。正则化方法(即dropout和权重衰减)旨在调整模型参数以防止过拟合,而数据增强则增加了输入训练集的多样性,这是一种据称可以提高准确性和校准误差的方法。在本文中,我们研究了这些技术各自对神经网络参数空间的影响,目的是了解它们如何在迁移学习场景中改变权重情况。为此,我们采用随机矩阵理论分析了使用这些技术进行微调的预训练模型的特征值分布,这些模型使用不同级别的数据多样性,用于相同的下游任务。我们观察到,多样化数据对权重情况的影响与dropout类似。此外,我们将常用的数据增强方法与生成模型创建的合成数据进行了比较。我们得出结论,合成数据可以为真实输入数据带来更多样性,从而在分布外测试实例上获得更好的性能。||
|**2024-10-18**|[Bayesian Multi-wavelength Imaging of the LMC SN1987A with SRG/eROSITA](http://arxiv.org/abs/2410.14599)|null|EDR和eRASS1数据已经揭示了大量未被发现的X射线源。利用贝叶斯推理和X射线成像的生成模型技术,我们的目标是通过对X射线天空进行去噪、反卷积和分解来提高这些观测的灵敏度和科学价值。利用信息场理论,我们可以利用天空不同物理成分的空间和光谱相关结构以及非参数先验来增强图像重建。通过将仪器效应纳入正演模型,我们为eROSITA指向观测开发了一种全面的贝叶斯成像算法。最后,我们将开发的算法应用于大麦哲伦星云SN1987A的EDR数据,融合了五个不同望远镜模块观测到的数据集。最终结果是一个去噪、去卷积和分解的大麦哲伦星云视图,它可以分析其精细结构,创建该区域的点源目录,并为未来的工作增强校准。||
|**2024-10-18**|[Neuro-Symbolic Traders: Assessing the Wisdom of AI Crowds in Markets](http://arxiv.org/abs/2410.14587)|null|深度生成模型正越来越多地被用作金融分析工具。然而,目前尚不清楚这些模型将如何影响金融市场,尤其是在它们以半自主的方式推断金融价值的情况下。在这项工作中,我们探讨了深度生成模型与市场动态之间的相互作用。我们开发了一种虚拟交易者,他们使用深度生成模型进行买卖决策,我们称之为神经符号交易者,并将其暴露在虚拟市场中。在我们的框架下,神经符号交易者是使用视觉语言模型来发现资产基本价值模型的代理。代理将此模型开发为随机微分方程,使用梯度下降校准市场数据。我们在合成数据和真实金融时间序列(包括股票、商品和外汇对)上测试了我们的神经符号交易者。然后,我们将几组神经符号交易者置于虚拟市场环境中。这种市场环境允许交易者对基础价值的信念与观察到的价格动态之间进行反馈。我们发现,与历史数据相比,这会导致价格抑制,突出了未来市场稳定的风险。我们的工作是量化深度生成代理对市场动态影响的第一步,并阐述了这种方法未来的一些潜在风险和收益。||
|**2024-10-18**|[Multi-modal Pose Diffuser: A Multimodal Generative Conditional Pose Prior](http://arxiv.org/abs/2410.14540)|null|SMPL (Skinned Multi-Person Linear) 模型在 3D 人体姿态估计中扮演着至关重要的角色,它提供了一种简化但有效的人体表示方法。然而,在诸如人体网格回归等任务中,确保 SMPL 配置的有效性仍然是一项重大挑战,这凸显了对能够辨别人体姿态真实性的鲁棒人体姿态先验的需求。为了解决这个问题,我们引入了 MOPED:\underline{M}ulti-m\underline{O}dal \underline{P}os\underline{E} \underline{D}iffuser。MOPED 是第一个利用新型多模态条件扩散模型作为 SMPL 姿态参数先验的方法。我们的方法提供了强大的无条件姿态生成能力,并能够以图像和文本等多模态输入作为条件。这种能力通过结合传统姿态先验中经常忽略的额外上下文信息,增强了我们方法的适用性。我们在姿态估计、姿态去噪和姿态补全这三个不同任务上的大量实验表明,我们基于多模态扩散模型的先验明显优于现有方法。这些结果表明,我们的模型捕获了更广泛的合理人体姿态。||
|**2024-10-18**|[LEAD: Latent Realignment for Human Motion Diffusion](http://arxiv.org/abs/2410.14508)|null|我们的目标是从自然语言生成逼真的人体动作。现代方法通常在模型表达能力和文本到动作的对齐之间进行权衡。一些方法对齐文本和动作的潜在空间,但牺牲了表达能力;另一些方法依赖于扩散模型,产生令人印象深刻的动作,但其潜在空间缺乏语义。这可能会损害真实性、多样性和适用性。在这里,我们通过将潜在扩散与重新对齐机制相结合来解决这个问题,产生一个新颖的、语义结构化的空间,该空间编码语言的语义。利用这种能力,我们引入了文本动作反演的任务,以从几个例子中捕捉新的动作概念。对于动作合成,我们在 HumanML3D 和 KIT-ML 上评估了 LEAD,并在真实性、多样性和文本-动作一致性方面表现出与最先进技术相当的性能。我们的定性分析和用户研究表明,与现代方法相比,我们合成的动作更清晰、更像人,并且更符合文本。对于动作文本反演,与传统的变分自编码器相比,我们的方法在捕捉分布外特征方面表现出更高的能力。||
|**2024-10-18**|[Reinforcement Learning in Non-Markov Market-Making](http://arxiv.org/abs/2410.14504)|null|我们开发了一个深度强化学习 (RL) 框架,用于解决最优做市 (MM) 交易问题,特别关注具有半马尔可夫和霍克斯跳跃扩散动力学的價格过程。我们首先讨论了 RL 的基础知识以及所使用的深度 RL 框架,其中我们部署了最先进的软行动者-评论家 (SAC) 算法进行深度学习部分。SAC 算法是一种离线策略熵最大化算法,更适合解决具有连续状态和动作空间的复杂、高维问题,例如最优做市 (MM)。我们介绍了所考虑的最优 MM 问题,详细说明了用于设置模拟此策略的环境的所有确定性和随机过程。在这里,我们还深入概述了使用的跳跃扩散定价动态、我们处理限价订单簿中逆向选择的方法,并重点介绍了优化问题的各个组成部分。接下来,我们讨论了训练和测试结果,并通过图表展示了重要的确定性和随机过程(例如买卖价差、交易执行、库存和奖励函数)是如何演变的。我们还讨论了这些结果的局限性,这些是大多数扩散模型在此设置中需要注意的重要点。||
|**2024-10-18**|[Data-driven topology design with persistent homology for enhancing population diversity](http://arxiv.org/abs/2410.14496)|null|本文提出了一种选择策略,用于增强数据驱动拓扑设计 (DDTD) 中的种群多样性,DDTD 是一种基于进化算法 (EA) 并使用深度生成模型的拓扑优化框架。虽然种群多样性对于 EA 的全局搜索至关重要,但由于设计变量空间的高维性和评估函数的强非线性,基于目标值保留多样性解决方案的传统选择算子仍可能导致拓扑优化问题中的种群多样性丧失。基于拓扑结构是材料分布之间固有多样性特征的理念,我们采用了一种称为持久同源性的拓扑数据分析方法。作为一项具体操作,在持久图之间引入了 Wasserstein 距离排序到选择算法中,以保持内在的种群多样性。我们将结合到 DDTD 中的所提出的选择操作应用于基于应力的拓扑优化问题作为数值示例。结果证实,可以使用持久同源性分析拓扑结构,并且所提出的选择操作显着提高了 DDTD 的搜索性能。||
|**2024-10-18**|[ANT: Adaptive Noise Schedule for Time Series Diffusion Models](http://arxiv.org/abs/2410.14488)|**[link](https://github.com/seunghan96/ant)**|生成式人工智能中扩散模型的进步最近已经扩展到时间序列(TS)领域,在各种任务上展现出最先进的性能。然而,先前关于时间序列扩散模型的研究工作往往借鉴了其他领域现有工作的框架,而没有考虑时间序列数据的特点,导致性能欠佳。在本研究中,我们提出了时间序列扩散模型的自适应噪声调度(ANT),它可以根据给定时间序列数据集的非平稳性统计数据,自动预先确定合适的噪声调度方案。我们的直觉是,一个最优的噪声调度方案应该满足以下要求:1)线性降低时间序列数据的非平稳性,使所有扩散步骤都具有同等意义;2)在最后一步将数据破坏为随机噪声;3)步骤数量足够多。所提出的方法具有很强的实用性,因为它消除了寻找最佳噪声调度的必要性,只需额外计算给定数据集的统计数据即可,这可以在训练前离线完成。我们在不同领域的数据集上验证了我们方法在各种任务上的有效性,包括时间序列预测、细化和生成。代码可在以下存储库中找到:https://github.com/seunghan96/ANT。||
|**2024-10-18**|[CaTs and DAGs: Integrating Directed Acyclic Graphs with Transformers and Fully-Connected Neural Networks for Causally Constrained Predictions](http://arxiv.org/abs/2410.14485)|**[link](https://github.com/matthewvowels1/causal_transformer)**|人工神经网络 (ANN),包括全连接网络和 Transformer,是高度灵活且强大的函数逼近器,广泛应用于计算机视觉和自然语言处理等领域。然而,它们无法 inherent 地遵循因果结构,这限制了它们的鲁棒性,使其容易受到协变量偏移的影响,并且难以解释。这对它们在现实应用中的可靠性构成了重大挑战。在本文中,我们介绍了因果全连接神经网络 (CFCN) 和因果 Transformer (CaT),这是两个通用的模型系列,旨在根据预定义的因果约束(由有向无环图 (DAG) 指定)进行操作。这些模型保留了传统神经网络强大的函数逼近能力,同时遵循底层结构约束,提高了推理时的鲁棒性、可靠性和可解释性。这种方法为在鲁棒性和可解释性至关重要的更苛刻的现实场景中部署神经网络开辟了新途径。||
|**2024-10-17**|[Diffusing States and Matching Scores: A New Framework for Imitation Learning](http://arxiv.org/abs/2410.13855)|**[link](https://github.com/ziqian2000/smiling)**|对抗性模仿学习传统上被构建为学习器和对抗性选择的成本函数之间的两人零和博弈,因此可以被认为是生成对抗网络 (GAN) 的顺序泛化。这种框架的一个突出例子是生成对抗性模仿学习 (GAIL)。然而,近年来,扩散模型已成为 GAN 的非对抗性替代方案,它只需要通过回归训练一个评分函数,就能产生更高质量的生成结果。为此,我们研究了如何将扩散模型的见解提升到序列设置中。我们建议沿着扩散状态对状态进行扩散并执行分数匹配,以测量专家和学习者状态之间的差异。因此,我们的方法只需要训练评分函数以通过标准回归来预测噪声,这使得它比对抗性方法更容易训练且更稳定。理论上,我们证明了具有一阶和二阶实例依赖界限且水平线性缩放,证明了我们的方法避免了阻碍离线模仿学习方法的复合误差。根据经验,我们展示了我们的方法在各种连续控制问题上优于 GAN 风格的模仿学习基线,包括控制仿人机器人行走、坐下和爬行的复杂任务。||
|**2024-10-17**|[Influence Functions for Scalable Data Attribution in Diffusion Models](http://arxiv.org/abs/2410.13850)|null|扩散模型在生成式建模方面取得了显著进展。然而,它们的广泛应用对数据溯源和可解释性提出了挑战。在本文中,我们的目标是通过开发一个\textit{影响函数}框架来帮助解决扩散模型中的此类挑战。基于影响函数的数据溯源方法近似于如果删除某些训练数据,模型的输出将如何变化。在监督学习中,这通常用于预测特定样本的损失将如何变化。对于扩散模型,我们专注于通过几个代理指标来预测生成特定样本的概率变化。我们展示了如何为此类量制定影响函数,以及如何将先前提出的方法解释为我们框架中的特定设计选择。为了确保影响函数中Hessian计算的可扩展性,我们系统地开发了基于广义高斯-牛顿矩阵的K-FAC近似,专门针对扩散模型量身定制。我们将先前提出的方法重新定义为我们框架中的特定设计选择,并表明我们推荐的方法在常见评估中优于先前的数据溯源方法,例如线性数据建模分数(LDS)或不包括顶部影响的重新训练,而无需针对特定方法进行超参数调整。||
|**2024-10-17**|[VidPanos: Generative Panoramic Videos from Casual Panning Videos](http://arxiv.org/abs/2410.13832)|null|全景图像拼接提供了一种统一的广角场景视图,超越了相机的视野范围。将平移视频的帧拼接成全景照片对于静态场景来说是一个很好理解的问题,但是当物体移动时,静态全景图无法捕捉场景。我们提出了一种从随意拍摄的平移视频合成全景视频的方法,就好像原始视频是用广角相机拍摄的一样。我们将全景合成视为一个时空外推问题,目标是创建一个与输入视频长度相同的完整全景视频。时空体积的一致性完成需要对视频内容和运动进行强大而真实的先验,为此我们采用了生成式视频模型。然而,现有的生成式模型并不能立即扩展到全景补全,正如我们所展示的那样。相反,我们将视频生成作为全景合成系统的一个组成部分,并演示了如何在最大限度地减少其局限性的同时利用模型的优势。我们的系统可以为各种野外场景创建视频全景图,包括人、车辆和流动的水,以及静止的背景特征。||
|**2024-10-17**|[Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning](http://arxiv.org/abs/2410.13823)|**[link](https://github.com/junzhin/dgm-vlc)**|深度生成模型通过增强数据集的大小和质量,极大地促进了医学图像分析的发展。除了单纯的数据增强之外,我们研究的重点在于深度生成模型的另一个重要能力:揭示和展示医学图像中的模式。我们采用了一种具有混合条件的生成结构,结合临床数据和分割掩码来指导图像合成过程。此外,我们创新地将表格化的临床数据转换为文本描述。这种方法简化了缺失值的处理,并使我们能够利用大型预训练的视觉语言模型,这些模型可以研究独立临床条目之间的关系,并理解性别和吸烟状况等一般术语。由于我们的临床信息与图像之间的视觉相关性较低,因此我们的方法不同于传统的医学报告指导的合成,并且提出了一项更具挑战性的任务。为了克服这个问题,我们引入了一种文本-视觉嵌入机制来加强条件,确保网络有效地利用所提供的信息。我们的流程可推广到基于 GAN 的模型和扩散模型。在胸部 CT 上进行的实验(特别关注吸烟状况)表明,肺部出现了一致的强度变化,这与临床观察结果一致,表明我们的方法可以有效地捕捉和可视化特定属性对医学图像模式的影响。我们的方法为利用深度生成模型早期检测和精确可视化复杂的临床状况开辟了新的途径。所有代码均可在 https://github.com/junzhin/DGM-VLC 获取。||
|**2024-10-17**|[ConsisSR: Delving Deep into Consistency in Diffusion-based Image Super-Resolution](http://arxiv.org/abs/2410.13807)|null|Real-world image super-resolution (Real-ISR) aims at restoring high-quality (HQ) images from low-quality (LQ) inputs corrupted by unknown and complex degradations. In particular, pretrained text-to-image (T2I) diffusion models provide strong generative priors to reconstruct credible and intricate details. However, T2I generation focuses on semantic consistency while Real-ISR emphasizes pixel-level reconstruction, which hinders existing methods from fully exploiting diffusion priors. To address this challenge, we introduce ConsisSR to handle both semantic and pixel-level consistency. Specifically, compared to coarse-grained text prompts, we exploit the more powerful CLIP image embedding and effectively leverage both modalities through our Hybrid Prompt Adapter (HPA) for semantic guidance. Secondly, we introduce Time-aware Latent Augmentation (TALA) to mitigate the inherent gap between T2I generation and Real-ISR consistency requirements. By randomly mixing LQ and HQ latent inputs, our model not only handle timestep-specific diffusion noise but also refine the accumulated latent representations. Last but not least, our GAN-Embedding strategy employs the pretrained Real-ESRGAN model to refine the diffusion start point. This accelerates the inference process to 10 steps while preserving sampling quality, in a training-free manner.Our method demonstrates state-of-the-art performance among both full-scale and accelerated models. The code will be made publicly available.||
|**2024-10-17**|[Probing the Latent Hierarchical Structure of Data via Diffusion Models](http://arxiv.org/abs/2410.13770)|null|High-dimensional data must be highly structured to be learnable. Although the compositional and hierarchical nature of data is often put forward to explain learnability, quantitative measurements establishing these properties are scarce. Likewise, accessing the latent variables underlying such a data structure remains a challenge. In this work, we show that forward-backward experiments in diffusion-based models, where data is noised and then denoised to generate new samples, are a promising tool to probe the latent structure of data. We predict in simple hierarchical models that, in this process, changes in data occur by correlated chunks, with a length scale that diverges at a noise level where a phase transition is known to take place. Remarkably, we confirm this prediction in both text and image datasets using state-of-the-art diffusion models. Our results show how latent variable changes manifest in the data and establish how to measure these effects in real data using diffusion models.||
|**2024-10-17**|[Theory on Score-Mismatched Diffusion Models and Zero-Shot Conditional Samplers](http://arxiv.org/abs/2410.13746)|null|The denoising diffusion model has recently emerged as a powerful generative technique, capable of transforming noise into meaningful data. While theoretical convergence guarantees for diffusion models are well established when the target distribution aligns with the training distribution, practical scenarios often present mismatches. One common case is in zero-shot conditional diffusion sampling, where the target conditional distribution is different from the (unconditional) training distribution. These score-mismatched diffusion models remain largely unexplored from a theoretical perspective. In this paper, we present the first performance guarantee with explicit dimensional dependencies for general score-mismatched diffusion samplers, focusing on target distributions with finite second moments. We show that score mismatches result in an asymptotic distributional bias between the target and sampling distributions, proportional to the accumulated mismatch between the target and training distributions. This result can be directly applied to zero-shot conditional samplers for any conditional model, irrespective of measurement noise. Interestingly, the derived convergence upper bound offers useful guidance for designing a novel bias-optimal zero-shot sampler in linear conditional models that minimizes the asymptotic bias. For such bias-optimal samplers, we further establish convergence guarantees with explicit dependencies on dimension and conditioning, applied to several interesting target distributions, including those with bounded support and Gaussian mixtures. Our findings are supported by numerical studies.||
|**2024-10-17**|[Improved Convergence Rate for Diffusion Probabilistic Models](http://arxiv.org/abs/2410.13738)|null|Score-based diffusion models have achieved remarkable empirical performance in the field of machine learning and artificial intelligence for their ability to generate high-quality new data instances from complex distributions. Improving our understanding of diffusion models, including mainly convergence analysis for such models, has attracted a lot of interests. Despite a lot of theoretical attempts, there still exists significant gap between theory and practice. Towards to close this gap, we establish an iteration complexity at the order of $d^{1/3}\varepsilon^{-2/3}$, which is better than $d^{5/12}\varepsilon^{-1}$, the best known complexity achieved before our work. This convergence analysis is based on a randomized midpoint method, which is first proposed for log-concave sampling (Shen and Lee, 2019), and then extended to diffusion models by Gupta et al. (2024). Our theory accommodates $\varepsilon$-accurate score estimates, and does not require log-concavity on the target distribution. Moreover, the algorithm can also be parallelized to run in only $O(\log^2(d/\varepsilon))$ parallel rounds in a similar way to prior works.||
|**2024-10-17**|[Optimizing Probabilistic Conformal Prediction with Vectorized Non-Conformity Scores](http://arxiv.org/abs/2410.13735)|null|Generative models have shown significant promise in critical domains such as medical diagnosis, autonomous driving, and climate science, where reliable decision-making hinges on accurate uncertainty quantification. While probabilistic conformal prediction (PCP) offers a powerful framework for this purpose, its coverage efficiency -- the size of the uncertainty set -- is limited when dealing with complex underlying distributions and a finite number of generated samples. In this paper, we propose a novel PCP framework that enhances efficiency by first vectorizing the non-conformity scores with ranked samples and then optimizing the shape of the prediction set by varying the quantiles for samples at the same rank. Our method delivers valid coverage while producing discontinuous and more efficient prediction sets, making it particularly suited for high-stakes applications. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets.||
|**2024-10-17**|[DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation](http://arxiv.org/abs/2410.13726)|**[link](https://github.com/hanbo-cheng/dawn-pytorch)**|Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly at https://github.com/Hanbo-Cheng/DAWN-pytorch.||
|**2024-10-15**|[High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion](http://arxiv.org/abs/2410.11838)|null|尽管近期取得了进展,现有的帧插值方法在处理极高分辨率输入和处理重复纹理、细小物体和大运动等挑战性案例时仍然存在困难。为了解决这些问题,我们引入了一种基于补丁的级联像素扩散模型,用于帧插值,名为 HiFI,它在这些场景中表现出色,同时在标准基准测试中实现了具有竞争力的性能。级联模型可以生成一系列从低分辨率到高分辨率的图像,这有助于处理需要全局上下文以获得粗略解决方案以及需要详细上下文以获得高分辨率输出的大运动或复杂运动。然而,与先前在越来越大的分辨率上执行扩散的级联扩散模型工作相反,我们使用单个模型,该模型始终以相同的分辨率执行扩散,并通过处理输入和先前解决方案的补丁来进行上采样。我们表明,这种技术大大减少了推理时的内存使用量,并且还允许我们在测试时使用单个模型,同时解决帧插值和空间上采样问题,从而节省了训练成本。我们证明了 HiFI 对需要全局上下文的高分辨率和复杂重复纹理有很大帮助。HiFI 在多个基准测试(Vimeo、Xiph、X-Test、SEPE-8K)上展示了与最先进技术相当或更优的性能。在我们新引入的专注于特别具有挑战性的案例的数据集上,HiFI 在这些案例上的表现也明显优于其他基线模型。请访问我们的项目页面以获取视频结果:https://hifi-diffusion.github.io||
|**2024-10-15**|[On the Effectiveness of Dataset Alignment for Fake Image Detection](http://arxiv.org/abs/2410.11835)|null|随着潜在扩散模型 (LDM) 使图像生成能力大众化,对虚假图像检测的需求日益增长。一个好的检测器应该专注于生成模型的指纹,而忽略图像属性,如语义内容、分辨率、文件格式等。虚假图像检测器通常以数据驱动的方式构建,其中训练模型以区分真实图像和虚假图像。现有工作主要研究网络架构选择和训练方法。在这项工作中,我们认为除了这些算法选择之外,我们还需要一个良好对齐的真实/虚假图像数据集来训练鲁棒的检测器。对于 LDM 系列,我们提出了一种非常简单的方法来实现这一点:我们使用 LDM 自动编码器重建所有真实图像,无需任何去噪操作。然后,我们训练一个模型来将这些真实图像与其重建图像区分开来。以这种方式创建的虚假图像在几乎所有方面(例如,大小、纵横比、语义内容)都与真实图像极其相似,这迫使模型寻找 LDM 解码器的伪影。我们通过经验证明,这种创建对齐的真实/虚假数据集的方法(也绕过了计算量大的去噪过程)有助于构建一个较少关注虚假相关性的检测器,而现有的非常流行的方法很容易受到这种相关性的影响。最后,为了证明数据集中对齐的有效性,我们使用非自然对象的图像构建了一个检测器,并获得了可喜的结果。总的来说,我们的工作确定了在训练虚假图像检测器时出现的细微但重要的问题,并提出了一种简单且廉价的解决方案来解决这些问题。||
|**2024-10-15**|[Bayesian Experimental Design via Contrastive Diffusions](http://arxiv.org/abs/2410.11826)|**[link](https://github.com/jcopo/ContrastiveDiffusions)**|贝叶斯最优实验设计 (BOED) 是一种强大的工具,可以降低运行一系列实验的成本。当基于预期信息增益 (EIG) 时,设计优化对应于最大化先验分布和后验分布之间某些难以处理的预期“对比”。由于 BOED 固有的计算复杂性,将这种最大化扩展到高维和复杂的环境一直是一个问题。在这项工作中,我们介绍了一种具有成本效益的采样特性的“预期后验”分布,并通过新的 EIG 梯度表达式提供了对 EIG 对比度最大化的易处理访问。基于扩散的采样器用于计算预期后验的动态,并且利用双层优化的思想来推导出高效的联合采样优化循环,而无需诉诸 EIG 的下界近似。由此产生的效率提升允许将 BOED 扩展到经过充分测试的扩散模型的生成能力。通过将生成模型纳入 BOED 框架,我们扩展了它的范围及其在以前不切实际的场景中的使用。数值实验和与最先进方法的比较显示了该方法的潜力。||
|**2024-10-15**|[KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities](http://arxiv.org/abs/2410.11824)|null|最近文本到图像生成技术的进步显著提高了合成图像的质量。尽管取得了这些进展,但评估主要集中在审美情趣或与文本提示的一致性上。因此,人们对这些模型是否能够准确地表示各种现实世界的视觉实体——一项需要现实世界知识的任务——知之甚少。为了弥合这一差距,我们提出了一个基准测试,重点评估现实世界实体的知识密集型图像生成(即 KITTEN)。我们使用 KITTEN 对文本到图像生成模型中的实体保真度进行了系统研究,重点关注它们生成各种现实世界视觉实体的能力,如地标建筑、飞机、植物和动物。我们使用自动指标和精心设计的人工评估来评估最新的文本到图像模型和检索增强定制模型,重点关注生成图像中实体的保真度。我们的研究结果表明,即使是最先进的文本到图像模型也常常无法生成具有准确视觉细节的实体。尽管检索增强模型可以通过在测试期间合并参考图像来增强实体的保真度,但它们往往过度依赖于这些参考,并且难以根据创意文本提示生成实体的新颖配置。||
|**2024-10-15**|[Improving Long-Text Alignment for Text-to-Image Diffusion Models](http://arxiv.org/abs/2410.11817)|**[link](https://github.com/luping-liu/longalign)**|文本到图像 (T2I) 扩散模型的快速发展使其能够根据给定文本生成前所未有的结果。然而,随着文本输入变长,像 CLIP 这样的现有编码方法面临局限性,并且使生成的图像与长文本对齐变得具有挑战性。为了解决这些问题,我们提出了 LongAlign,它包括用于处理长文本的分段级编码方法和用于有效对齐训练的分解偏好优化方法。对于分段级编码,长文本被分成多个段并分别处理。此方法克服了预训练编码模型的最大输入长度限制。对于偏好优化,我们提供基于 CLIP 的分解偏好模型来微调扩散模型。具体来说,为了利用基于 CLIP 的偏好模型进行 T2I 对齐,我们深入研究了它们的评分机制,发现偏好分数可以分解为两个部分:衡量 T2I 对齐的文本相关部分和评估人类偏好的其他视觉方面的文本无关部分。此外,我们发现文本无关部分会导致微调期间出现常见的过拟合问题。为了解决这个问题,我们提出了一种重新加权策略,为这两个部分分配不同的权重,从而减少过拟合并增强对齐。在我们使用该方法对 $512 \times 512$ Stable Diffusion (SD) v1.5 进行约 20 小时的微调后,微调后的 SD 在 T2I 对齐方面优于更强大的基础模型,例如 PixArt-$\alpha$ 和 Kandinsky v2.2。代码可在 https://github.com/luping-liu/LongAlign 获取。||
|**2024-10-15**|[SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing](http://arxiv.org/abs/2410.11815)|null|场景图提供了一种结构化的图像层次表示,其中节点和边分别代表对象及其之间的关系。它可以作为图像编辑的自然接口,极大地提高精度和灵活性。利用这一优势,我们引入了一个新的框架,将大型语言模型(LLM)与 Text2Image 生成模型相结合,用于基于场景图的图像编辑。这种集成可以在不影响整体图像完整性的情况下,实现对象级别的精确修改和场景的创造性重组。我们的方法包括两个主要阶段:1)利用 LLM 驱动的场景解析器,我们构建图像的场景图,捕获关键对象及其相互关系,并解析细粒度属性,如对象掩码和描述。这些注释有助于使用微调的扩散模型进行概念学习,用优化的标记和详细的描述提示来表示每个对象。2)在图像编辑阶段,LLM 编辑控制器引导编辑特定区域。然后,这些编辑由注意力调制的扩散编辑器执行,利用微调模型执行对象添加、删除、替换和调整。通过大量实验,我们证明了我们的框架在编辑精度和场景美学方面明显优于现有的图像编辑方法。||
|**2024-10-15**|[Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices](http://arxiv.org/abs/2410.11795)|null|作为近年来最受欢迎和最受追捧的生成模型之一,扩散模型凭借其扎实的理论基础和可靠的应用实践,引起了众多研究者的兴趣,并在图像合成、视频生成、分子设计、3D场景渲染和多模态生成等各种生成任务中展现出优异的性能。这些基于扩散模型的最新研究成果的显著成功很大程度上归功于渐进式设计原则以及高效的架构、训练、推理和部署方法。然而,目前尚缺乏全面深入的综述来总结这些原则和实践,以帮助快速理解和应用扩散模型。在本综述中,我们以效率为导向,对现有工作进行了新的视角审视,主要关注架构设计、模型训练、快速推理和可靠部署方面的深刻原理和高效实践,以通俗易懂的方式指导进一步的理论研究、算法迁移和模型应用到新的场景中。\url{https://github.com/ponyzym/Efficient-DMs-Survey}||
|**2024-10-15**|[Probabilistic Principles for Biophysics and Neuroscience: Entropy Production, Bayesian Mechanics & the Free-Energy Principle](http://arxiv.org/abs/2410.11735)|null|本论文重点研究生物系统的三个基本方面:即熵产生、贝叶斯力学和自由能原理。贡献有三方面:1) 我们计算了比以往更大类别系统的熵产生,包括几乎所有稳态扩散过程,例如驱动噪声不作用于系统所有坐标的退化扩散。重要的是,这类系统包含了由有色噪声驱动的随机微分方程的马尔可夫近似,这一点意义重大,因为宏观和中尺度生物系统通常会受到有色噪声的影响。2) 我们为与环境相互作用的生物和物理实体开发了一种贝叶斯力学,其中我们为事物的内部状态推断其外部状态提供了充分必要条件,这与统计学和理论神经科学中的变分贝叶斯推理一致。3) 我们改进了对贝叶斯力学的约束,以获得对生物系统更具体的描述,称为自由能原理。这表明生物系统的活动状态和内部状态是通过最小化称为自由能的量来展开的。这里提出的自由能原理的数学基础,通过在给定外部状态和感觉状态的生成模型的情况下最小化自由能,为神经生物学和人工智能中的行为建模和仿真提供了一种第一性原理方法。||
|**2024-10-15**|[Patch-Based Diffusion Models Beat Whole-Image Models for Mismatched Distribution Inverse Problems](http://arxiv.org/abs/2410.11730)|null|扩散模型由于能够学习强大的图像先验,在解决逆问题方面取得了优异的成功,但现有方法需要大量的图像训练数据集,这些图像应该来自与测试数据集相同的分布。当训练和测试分布不匹配时,由于先验不正确,重建图像中会出现伪影和幻觉。在这项工作中,我们系统地研究了分布外 (OOD) 问题,其中首先提供已知的训练分布。我们首先研究了仅从未知测试分布获得单次测量的情况。接下来,我们研究了属于测试分布的非常小的数据样本可用的情况,我们的目标仍然是从来自测试分布的测量中重建图像。在这两种情况下,我们都使用基于补丁的扩散先验,它仅从补丁中学习图像分布。此外,在第一种情况下,我们包含一个自监督损失,帮助网络输出保持与测量的Consistency。大量实验表明,在这两种情况下,基于补丁的方法都可以获得高质量的图像重建,其性能优于整幅图像模型,并且可以与可以使用大型分布内训练数据集的方法相媲美。此外,我们展示了整幅图像模型如何容易出现记忆和过拟合,从而导致重建中的伪影,而基于补丁的模型可以解决这些问题。||
|**2024-10-15**|[DeformPAM: Data-Efficient Learning for Long-horizon Deformable Object Manipulation via Preference-based Action Alignment](http://arxiv.org/abs/2410.11584)|**[link](https://github.com/xiaoxiaoxh/DeformPAM)**|In recent years, imitation learning has made progress in the field of robotic manipulation. However, it still faces challenges when dealing with complex long-horizon deformable object tasks, such as high-dimensional state spaces, complex dynamics, and multimodal action distributions. Traditional imitation learning methods often require a large amount of data and encounter distributional shifts and accumulative errors in these tasks. To address these issues, we propose a data-efficient general learning framework (DeformPAM) based on preference learning and reward-guided action selection. DeformPAM decomposes long-horizon tasks into multiple action primitives, utilizes 3D point cloud inputs and diffusion models to model action distributions, and trains an implicit reward model using human preference data. During the inference phase, the reward model scores multiple candidate actions, selecting the optimal action for execution, thereby reducing the occurrence of anomalous actions and improving task completion quality. Experiments conducted on three challenging real-world long-horizon deformable object manipulation tasks demonstrate the effectiveness of this method. Results show that DeformPAM improves both task completion quality and efficiency compared to baseline methods even with limited data. Code and data will be available at https://deform-pam.robotflow.ai.||
|**2024-10-11**|[SceneCraft: Layout-Guided 3D Scene Generation](http://arxiv.org/abs/2410.09049)|**[link](https://github.com/orangesodahub/scenecraft)**|The creation of complex 3D scenes tailored to user specifications has been a tedious and challenging task with traditional 3D modeling tools. Although some pioneering methods have achieved automatic text-to-3D generation, they are generally limited to small-scale scenes with restricted control over the shape and texture. We introduce SceneCraft, a novel method for generating detailed indoor scenes that adhere to textual descriptions and spatial layout preferences provided by users. Central to our method is a rendering-based technique, which converts 3D semantic layouts into multi-view 2D proxy maps. Furthermore, we design a semantic and depth conditioned diffusion model to generate multi-view images, which are used to learn a neural radiance field (NeRF) as the final scene representation. Without the constraints of panorama image generation, we surpass previous methods in supporting complicated indoor space generation beyond a single room, even as complicated as a whole multi-bedroom apartment with irregular shapes and layouts. Through experimental analysis, we demonstrate that our method significantly outperforms existing approaches in complex indoor scene generation with diverse textures, consistent geometry, and realistic visual quality. Code and more results are available at: https://orangesodahub.github.io/SceneCraft||
|**2024-10-11**|[Linear Convergence of Diffusion Models Under the Manifold Hypothesis](http://arxiv.org/abs/2410.09046)|null|Score-matching generative models have proven successful at sampling from complex high-dimensional data distributions. In many applications, this distribution is believed to concentrate on a much lower $d$-dimensional manifold embedded into $D$-dimensional space; this is known as the manifold hypothesis. The current best-known convergence guarantees are either linear in $D$ or polynomial (superlinear) in $d$. The latter exploits a novel integration scheme for the backward SDE. We take the best of both worlds and show that the number of steps diffusion models require in order to converge in Kullback-Leibler~(KL) divergence is linear (up to logarithmic terms) in the intrinsic dimension $d$ . Moreover, we show that this linear dependency is sharp.||
|**2024-10-11**|[Semantic Score Distillation Sampling for Compositional Text-to-3D Generation](http://arxiv.org/abs/2410.09009)|**[link](https://github.com/yangling0818/semanticsds-3d)**|Generating high-quality 3D assets from textual descriptions remains a pivotal challenge in computer graphics and vision research. Due to the scarcity of 3D data, state-of-the-art approaches utilize pre-trained 2D diffusion priors, optimized through Score Distillation Sampling (SDS). Despite progress, crafting complex 3D scenes featuring multiple objects or intricate interactions is still difficult. To tackle this, recent methods have incorporated box or layout guidance. However, these layout-guided compositional methods often struggle to provide fine-grained control, as they are generally coarse and lack expressiveness. To overcome these challenges, we introduce a novel SDS approach, Semantic Score Distillation Sampling (SemanticSDS), designed to effectively improve the expressiveness and accuracy of compositional text-to-3D generation. Our approach integrates new semantic embeddings that maintain consistency across different rendering views and clearly differentiate between various objects and parts. These embeddings are transformed into a semantic map, which directs a region-specific SDS process, enabling precise optimization and compositional generation. By leveraging explicit semantic guidance, our method unlocks the compositional capabilities of existing pre-trained diffusion models, thereby achieving superior quality in 3D content generation, particularly for complex objects and scenes. Experimental results demonstrate that our SemanticSDS framework is highly effective for generating state-of-the-art complex 3D content. Code: https://github.com/YangLing0818/SemanticSDS-3D||
|**2024-10-11**|[WaveDiffusion: Exploring Full Waveform Inversion via Joint Diffusion in the Latent Space](http://arxiv.org/abs/2410.09002)|null|Full Waveform Inversion (FWI) is a vital technique for reconstructing high-resolution subsurface velocity maps from seismic waveform data, governed by partial differential equations (PDEs) that model wave propagation. Traditional machine learning approaches typically map seismic data to velocity maps by encoding seismic waveforms into latent embeddings and decoding them into velocity maps. In this paper, we introduce a novel framework that reframes FWI as a joint diffusion process in a shared latent space, bridging seismic waveform data and velocity maps. Our approach has two key components: first, we merge the bottlenecks of two separate autoencoders-one for seismic data and one for velocity maps-into a unified latent space using vector quantization to establish a shared codebook. Second, we train a diffusion model in this latent space, enabling the simultaneous generation of seismic and velocity map pairs by sampling and denoising the latent representations, followed by decoding each modality with its respective decoder. Remarkably, our jointly generated seismic-velocity pairs approximately satisfy the governing PDE without any additional constraint, offering a new geometric interpretation of FWI. The diffusion process learns to score the latent space according to its deviation from the PDE, with higher scores representing smaller deviations from the true solutions. By following this diffusion process, the model traces a path from random initialization to a valid solution of the governing PDE. Our experiments on the OpenFWI dataset demonstrate that the generated seismic and velocity map pairs not only exhibit high fidelity and diversity but also adhere to the physical constraints imposed by the governing PDE.||
|**2024-10-11**|[Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory](http://arxiv.org/abs/2410.08942)|null|Synthetic data has gained attention for training large language models, but poor-quality data can harm performance (see, e.g., Shumailov et al. (2023); Seddik et al. (2024)). A potential solution is data pruning, which retains only high-quality data based on a score function (human or machine feedback). Previous work Feng et al. (2024) analyzed models trained on synthetic data as sample size increases. We extend this by using random matrix theory to derive the performance of a binary classifier trained on a mix of real and pruned synthetic data in a high dimensional setting. Our findings identify conditions where synthetic data could improve performance, focusing on the quality of the generative model and verification strategy. We also show a smooth phase transition in synthetic label noise, contrasting with prior sharp behavior in infinite sample limits. Experiments with toy models and large language models validate our theoretical results.||
|**2024-10-11**|[DiffPO: A causal diffusion model for learning distributions of potential outcomes](http://arxiv.org/abs/2410.08924)|null|Predicting potential outcomes of interventions from observational data is crucial for decision-making in medicine, but the task is challenging due to the fundamental problem of causal inference. Existing methods are largely limited to point estimates of potential outcomes with no uncertain quantification; thus, the full information about the distributions of potential outcomes is typically ignored. In this paper, we propose a novel causal diffusion model called DiffPO, which is carefully designed for reliable inferences in medicine by learning the distribution of potential outcomes. In our DiffPO, we leverage a tailored conditional denoising diffusion model to learn complex distributions, where we address the selection bias through a novel orthogonal diffusion loss. Another strength of our DiffPO method is that it is highly flexible (e.g., it can also be used to estimate different causal quantities such as CATE). Across a wide range of experiments, we show that our method achieves state-of-the-art performance.||
|**2024-10-11**|[Conditional Generative Models for Contrast-Enhanced Synthesis of T1w and T1 Maps in Brain MRI](http://arxiv.org/abs/2410.08894)|**[link](https://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models)**|Contrast enhancement by Gadolinium-based contrast agents (GBCAs) is a vital tool for tumor diagnosis in neuroradiology. Based on brain MRI scans of glioblastoma before and after Gadolinium administration, we address enhancement prediction by neural networks with two new contributions. Firstly, we study the potential of generative models, more precisely conditional diffusion and flow matching, for uncertainty quantification in virtual enhancement. Secondly, we examine the performance of T1 scans from quantitive MRI versus T1-weighted scans. In contrast to T1-weighted scans, these scans have the advantage of a physically meaningful and thereby comparable voxel range. To compare network prediction performance of these two modalities with incompatible gray-value scales, we propose to evaluate segmentations of contrast-enhanced regions of interest using Dice and Jaccard scores. Across models, we observe better segmentations with T1 scans than with T1-weighted scans.||
|**2024-10-11**|[On-Chip Learning via Transformer In-Context Learning](http://arxiv.org/abs/2410.08711)|null|Autoregressive decoder-only transformers have become key components for scalable sequence processing and generation models. However, the transformer's self-attention mechanism requires transferring prior token projections from the main memory at each time step (token), thus severely limiting their performance on conventional processors. Self-attention can be viewed as a dynamic feed-forward layer, whose matrix is input sequence-dependent similarly to the result of local synaptic plasticity. Using this insight, we present a neuromorphic decoder-only transformer model that utilizes an on-chip plasticity processor to compute self-attention. Interestingly, the training of transformers enables them to ``learn'' the input context during inference. We demonstrate this in-context learning ability of transformers on the Loihi 2 processor by solving a few-shot classification problem. With this we emphasize the importance of pretrained models especially their ability to find simple, local, backpropagation free, learning rules enabling on-chip learning and adaptation in a hardware friendly manner.||
|**2024-10-11**|[Distillation of Discrete Diffusion through Dimensional Correlations](http://arxiv.org/abs/2410.08709)|null|Diffusion models have demonstrated exceptional performances in various fields of generative modeling. While they often outperform competitors including VAEs and GANs in sample quality and diversity, they suffer from slow sampling speed due to their iterative nature. Recently, distillation techniques and consistency models are mitigating this issue in continuous domains, but discrete diffusion models have some specific challenges towards faster generation. Most notably, in the current literature, correlations between different dimensions (pixels, locations) are ignored, both by its modeling and loss functions, due to computational limitations. In this paper, we propose "mixture" models in discrete diffusion that are capable of treating dimensional correlations while remaining scalable, and we provide a set of loss functions for distilling the iterations of existing models. Two primary theoretical insights underpin our approach: first, that dimensionally independent models can well approximate the data distribution if they are allowed to conduct many sampling steps, and second, that our loss functions enables mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations. We empirically demonstrate that our proposed method for discrete diffusions work in practice, by distilling a continuous-time discrete diffusion model pretrained on the CIFAR-10 dataset.||
|**2024-10-11**|[E-Motion: Future Motion Simulation via Event Sequence Diffusion](http://arxiv.org/abs/2410.08649)|**[link](https://github.com/p4r4mount/E-Motion)**|Forecasting a typical object's future motion is a critical task for interpreting and interacting with dynamic environments in computer vision. Event-based sensors, which could capture changes in the scene with exceptional temporal granularity, may potentially offer a unique opportunity to predict future motion with a level of detail and precision previously unachievable. Inspired by that, we propose to integrate the strong learning capacity of the video diffusion model with the rich motion information of an event camera as a motion simulation framework. Specifically, we initially employ pre-trained stable video diffusion models to adapt the event sequence dataset. This process facilitates the transfer of extensive knowledge from RGB videos to an event-centric domain. Moreover, we introduce an alignment mechanism that utilizes reinforcement learning techniques to enhance the reverse generation trajectory of the diffusion model, ensuring improved performance and accuracy. Through extensive testing and validation, we demonstrate the effectiveness of our method in various complex scenarios, showcasing its potential to revolutionize motion flow prediction in computer vision applications such as autonomous vehicle guidance, robotic navigation, and interactive media. Our findings suggest a promising direction for future research in enhancing the interpretative power and predictive accuracy of computer vision systems.||
|**2024-10-10**|[DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models](http://arxiv.org/abs/2410.08207)|null|离散扩散模型在图像生成和掩码语言建模等任务中取得了成功,但在可控内容编辑方面面临局限性。我们引入了 DICE(用于可控编辑的离散逆推),这是第一个能够对离散扩散模型(包括多项式扩散和掩码生成模型)进行精确逆推的方法。通过在反向扩散过程中记录噪声序列和掩码模式,DICE 无需预定义掩码或注意力机制操作即可实现离散数据的准确重建和灵活编辑。我们在图像和文本领域证明了 DICE 的有效性,并在 VQ-Diffusion、Paella 和 RoBERTa 等模型上对其进行了评估。结果表明,DICE 在保持高数据保真度的同时增强了编辑能力,为离散空间中的细粒度内容操作提供了新的机会。项目网页请访问 https://hexiaoxiao-cs.github.io/DICE/。||
|**2024-10-10**|[HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation](http://arxiv.org/abs/2410.08192)|null|近年来,文本到图像扩散模型在使用文本提示进行创作方面取得了显著成果,但基于特定主题生成个性化实例(即主题驱动生成)仍然具有挑战性。为了解决这个问题,我们提出了一种名为 HybridBooth 的新型混合框架,它融合了基于优化和直接回归方法的优点。HybridBooth 分为两个阶段运行:词嵌入探测和词嵌入细化。词嵌入探测使用微调后的编码器生成稳健的初始词嵌入;词嵌入细化通过优化关键参数,进一步使编码器适应特定的主题图像。这种方法能够有效且快速地将视觉概念反转为文本嵌入,即使只有一个图像,同时还能保持模型的泛化能力。||
|**2024-10-10**|[DifFRelight: Diffusion-Based Facial Performance Relighting](http://arxiv.org/abs/2410.08188)|null|我们提出了一种基于扩散的图像到图像转换的新颖框架,用于自由视点的人脸表演重新照明。利用包含在各种照明条件下(包括平面照明和一次一灯 (OLAT) 场景)捕获的多种面部表情的特定主题数据集,我们训练了一个用于精确照明控制的扩散模型,能够从平面照明输入中生成高保真度的重新照明人脸图像。我们的框架包括空间对齐的平面照明捕获和随机噪声的调节,以及用于全局控制的集成照明信息,利用来自预训练的稳定扩散模型的先验知识。然后将此模型应用于在一致的平面照明环境中捕获的动态面部表演,并使用可扩展的动态 3D 高斯渲染方法重建以进行新颖视图合成,以保持重新照明结果的质量和一致性。此外,我们通过将新颖的区域照明表示与定向照明相结合,引入了统一的照明控制,允许对光照大小和方向进行联合调整。我们还支持使用多个定向光进行高动态范围成像 (HDRI) 合成,以在复杂的照明条件下生成动态序列。我们的评估证明了该模型在实现精确照明控制和泛化各种面部表情方面的效率,同时保留了皮肤纹理和头发等细节特征。该模型准确地再现了复杂的照明效果,例如眼睛反射、次表面散射、自阴影和半透明性,从而提高了我们框架内的照片真实感。||
|**2024-10-10**|[ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion](http://arxiv.org/abs/2410.08168)|null|我们提出了 ZeroComp,这是一种有效的零样本 3D 对象合成方法,在训练期间不需要成对的合成场景图像。我们的方法利用 ControlNet 从内蕴图像中进行条件控制,并将其与 Stable Diffusion 模型相结合,利用其场景先验,共同构成一个有效的渲染引擎。在训练过程中,ZeroComp 使用基于几何形状、反照率和遮罩阴影的内蕴图像,而不需要包含和不包含合成对象的场景的成对图像。训练完成后,它可以将虚拟 3D 对象无缝集成到场景中,调整阴影以创建逼真的合成图像。我们开发了一个高质量的评估数据集,并证明 ZeroComp 在定量和人类感知基准测试中优于使用显式光照估计和生成技术的其他方法。此外,ZeroComp 还可以扩展到真实和室外图像合成,即使仅在合成室内数据上进行训练,也展示了其在图像合成方面的有效性。||
|**2024-10-10**|[DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation](http://arxiv.org/abs/2410.08159)|null|扩散模型已成为视觉生成的主导方法。它们通过对马尔可夫过程进行去噪来训练,该过程逐渐向输入中添加噪声。我们认为,马尔可夫性质限制了模型充分利用生成轨迹的能力,导致训练和推理过程中的效率低下。在本文中,我们提出了 DART,一种基于 Transformer 的模型,它在非马尔可夫框架内统一了自回归 (AR) 和扩散。DART 使用与标准语言模型相同架构的自回归模型,在空间和频谱上迭代地对图像块进行去噪。DART 不依赖图像量化,从而能够在保持灵活性的同时实现更有效的图像建模。此外,DART 可以在统一模型中使用文本和图像数据进行无缝训练。我们的方法在类别条件和文本到图像生成任务上表现出具有竞争力的性能,为传统的扩散模型提供了一种可扩展、高效的替代方案。通过这种统一的框架,DART 为可扩展、高质量的图像合成树立了新的标杆。||
|**2024-10-10**|[Progressive Autoregressive Video Diffusion Models](http://arxiv.org/abs/2410.08151)|**[link](https://github.com/desaixie/pa_vdm)**|当前前沿的视频扩散模型在生成高质量视频方面已经展现出显著成果。然而,由于训练过程中的计算限制,它们只能生成通常约10秒或240帧的短视频片段。在这项工作中,我们展示了现有模型可以自然地扩展到自回归视频扩散模型,而无需改变架构。我们的关键思想是为潜在帧分配逐渐增加的噪声级别,而不是单一噪声级别,这允许潜在帧之间进行细粒度的条件化以及注意力窗口之间的大量重叠。这种渐进式视频去噪允许我们的模型自回归地生成视频帧,而不会出现质量下降或场景突变。我们在1分钟的长视频生成(24 FPS下1440帧)上呈现了最先进的结果。本文中的视频可在https://desaixie.github.io/pa-vdm/上获取。||
|**2024-10-10**|[Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction](http://arxiv.org/abs/2410.08134)|null|离散数据的生成模型是许多重要应用的基础,涵盖了从基于文本的智能体(如 ChatGPT)到蛋白质序列中生命基本构建块的设计。然而,应用领域需要通过引导生成过程(通常通过 RLHF)来控制生成的数据,以满足特定的属性、奖励或亲和度指标。在本文中,我们研究了引导掩码扩散模型 (MDM) 的问题,MDM 是一类新兴的离散扩散模型,为传统的自回归模型提供了一种引人注目的替代方案。我们引入了离散去噪后验预测 (DDPP),这是一个新的框架,通过学习从目标贝叶斯后验分布中采样,将引导预训练 MDM 的任务转化为概率推理问题。我们的 DDPP 框架产生了一系列三个新的目标函数,它们都是无需模拟的,因此具有可扩展性,同时适用于一般的不可微奖励函数。在实验中,我们通过引导 MDM 执行类别条件像素级图像建模、使用基于文本奖励的 MDM 的 RLHF 对齐,以及微调蛋白质语言模型以生成更多样化的二级结构和更短的蛋白质,实例化了 DDPP。我们通过湿实验室验证证实了我们的设计,观察到奖励优化蛋白质序列的瞬时表达。||
|**2024-10-10**|[Robust AI-Generated Text Detection by Restricted Embeddings](http://arxiv.org/abs/2410.08113)|**[link](https://github.com/silversolver/robustatd)**|人工智能生成文本的数量和质量不断提高,这使得检测此类内容变得更加困难。在大多数现实场景中,生成数据的领域(风格和主题)和生成器模型事先并不知道。在这项工作中,我们关注基于分类器的 AI 生成文本检测器的鲁棒性,即它们迁移到未知生成器或语义领域的能力。我们研究了基于 Transformer 的文本编码器嵌入空间的几何结构,并表明清除有害的线性子空间有助于训练鲁棒的分类器,忽略特定领域的虚假特征。我们研究了几种子空间分解和特征选择策略,并在跨领域和跨生成器迁移方面取得了优于现有技术的显著改进。我们针对词头和基于坐标的子空间去除的最佳方法分别将 RoBERTa 和 BERT 嵌入的平均失配分布 (OOD) 分类分数提高了高达 9% 和 14%。我们发布了代码和数据:https://github.com/SilverSolver/RobustATD||
|**2024-10-10**|[Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models](http://arxiv.org/abs/2410.08074)|null|文图生成扩散模型依赖于大规模网络数据集。从头开始训练这些模型计算成本高昂,因此开发者通常更喜欢对现有模型进行增量更新。这些更新通常包括微调步骤(学习新概念或提高模型性能)和“遗忘”步骤(“忘记”现有概念,例如受版权保护的作品或露骨内容)。在这项工作中,我们展示了这种范式中出现的一个关键且以前未知的漏洞:即使在良性、非对抗性条件下,在看似无关的图像上微调文图生成扩散模型也会导致其“重新学习”先前已“遗忘”的概念。我们通过一系列将“大规模概念擦除”(文图生成扩散模型中遗忘的当前技术水平(Lu et al., 2024))与随后对 Stable Diffusion v1.4 进行微调的实验,全面研究了这种现象的原因和范围,我们将这种现象称为概念复苏。我们的研究结果强调了组合增量模型更新的脆弱性,并对当前确保文图生成扩散模型的安全性和一致性的方法提出了新的严重担忧。||
|**2024-10-10**|[A Target-Aware Analysis of Data Augmentation for Hate Speech Detection](http://arxiv.org/abs/2410.08053)|null|仇恨言论是社交网络广泛使用带来的主要威胁之一,尽管人们努力限制它。尽管已经关注了这个问题,但缺乏以能力歧视或年龄歧视等鲜少出现的现象为中心的数据集和案例研究,可能导致仇恨言论检测系统在代表性不足的身份群体中表现不佳。鉴于大型语言模型 (LLM) 在生成高质量数据方面的空前能力,我们研究了使用生成式语言模型扩充现有数据的可能性,以减少目标不平衡。我们尝试使用 Measuring Hate Speech 语料库中的 1,000 个帖子进行扩充,这是一个标注了目标身份信息的英语数据集,使用简单的数据库增强方法和不同类型的生成模型添加了大约 30,000 个合成样本,比较了自回归和序列到序列的方法。我们发现传统的数据库增强方法通常比生成模型更可取,但两者结合往往会产生最好的结果。事实上,对于某些仇恨类别,例如出身、宗教和残疾,使用增强数据进行训练的仇恨言论分类比没有增强数据的基线提高了 10% 以上的 F1 值。这项工作有助于开发仇恨言论检测系统,这些系统不仅性能更好,而且对迄今为止被忽视的目标更公平、更具包容性。||
|**2024-10-07**|[DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control](http://arxiv.org/abs/2410.05260)|null|文本条件化人体动作生成允许用户通过自然语言进行交互,近年来备受欢迎。现有方法通常基于单个输入语句生成简短、孤立的动作。然而,人类动作是连续的,可以持续很长时间,并承载着丰富的语义。创造能够精确响应文本描述流的长期、复杂动作,特别是在在线和实时环境中,仍然是一项重大挑战。此外,将空间约束纳入文本条件化动作生成带来了额外的挑战,因为它需要将文本描述指定的动作语义与几何信息(例如目标位置和 3D 场景几何形状)对齐。为了解决这些限制,我们提出了 DART,一种基于扩散的自回归运动基元模型,用于实时文本驱动的运动控制。我们的模型 DART 使用潜在扩散模型,有效地学习了联合依赖于运动历史和文本输入的紧凑运动基元空间。通过根据先前历史和当前文本输入自回归地生成运动基元,DART 可以实现由自然语言描述驱动的实时、连续动作生成。此外,学习到的运动基元空间允许精确的空间运动控制,我们将其制定为潜在噪声优化问题或通过强化学习解决的马尔可夫决策过程。我们针对这两种方法提出了有效的算法,证明了我们的模型在各种运动合成任务中的多功能性和卓越性能。实验表明,我们的方法在运动真实感、效率和可控性方面优于现有的基线。视频结果可在项目页面上找到:https://zkf1997.github.io/DART/。||
|**2024-10-07**|[GS-VTON: Controllable 3D Virtual Try-on with Gaussian Splatting](http://arxiv.org/abs/2410.05259)|null|Diffusion-based 2D virtual try-on (VTON) techniques have recently demonstrated strong performance, while the development of 3D VTON has largely lagged behind. Despite recent advances in text-guided 3D scene editing, integrating 2D VTON into these pipelines to achieve vivid 3D VTON remains challenging. The reasons are twofold. First, text prompts cannot provide sufficient details in describing clothing. Second, 2D VTON results generated from different viewpoints of the same 3D scene lack coherence and spatial relationships, hence frequently leading to appearance inconsistencies and geometric distortions. To resolve these problems, we introduce an image-prompted 3D VTON method (dubbed GS-VTON) which, by leveraging 3D Gaussian Splatting (3DGS) as the 3D representation, enables the transfer of pre-trained knowledge from 2D VTON models to 3D while improving cross-view consistency. (1) Specifically, we propose a personalized diffusion model that utilizes low-rank adaptation (LoRA) fine-tuning to incorporate personalized information into pre-trained 2D VTON models. To achieve effective LoRA training, we introduce a reference-driven image editing approach that enables the simultaneous editing of multi-view images while ensuring consistency. (2) Furthermore, we propose a persona-aware 3DGS editing framework to facilitate effective editing while maintaining consistent cross-view appearance and high-quality 3D geometry. (3) Additionally, we have established a new 3D VTON benchmark, 3D-VTONBench, which facilitates comprehensive qualitative and quantitative 3D VTON evaluations. Through extensive experiments and comparative analyses with existing methods, the proposed \OM has demonstrated superior fidelity and advanced editing capabilities, affirming its effectiveness for 3D VTON.||
|**2024-10-07**|[SePPO: Semi-Policy Preference Optimization for Diffusion Alignment](http://arxiv.org/abs/2410.05255)|**[link](https://github.com/dwanzhang-ai/seppo)**|Reinforcement learning from human feedback (RLHF) methods are emerging as a way to fine-tune diffusion models (DMs) for visual generation. However, commonly used on-policy strategies are limited by the generalization capability of the reward model, while off-policy approaches require large amounts of difficult-to-obtain paired human-annotated data, particularly in visual generation tasks. To address the limitations of both on- and off-policy RLHF, we propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. Specifically, we introduce a Semi-Policy Preference Optimization (SePPO) method. SePPO leverages previous checkpoints as reference models while using them to generate on-policy reference samples, which replace "losing images" in preference pairs. This approach allows us to optimize using only off-policy "winning images." Furthermore, we design a strategy for reference model selection that expands the exploration in the policy space. Notably, we do not simply treat reference samples as negative examples for learning. Instead, we design an anchor-based criterion to assess whether the reference samples are likely to be winning or losing images, allowing the model to selectively learn from the generated reference samples. This approach mitigates performance degradation caused by the uncertainty in reference sample quality. We validate SePPO across both text-to-image and text-to-video benchmarks. SePPO surpasses all previous approaches on the text-to-image benchmarks and also demonstrates outstanding performance on the text-to-video benchmarks. Code will be released in https://github.com/DwanZhang-AI/SePPO.||
|**2024-10-07**|[DiffuseReg: Denoising Diffusion Model for Obtaining Deformation Fields in Unsupervised Deformable Image Registration](http://arxiv.org/abs/2410.05234)|**[link](https://github.com/yutazhuo/diffusereg)**|Deformable image registration aims to precisely align medical images from different modalities or times. Traditional deep learning methods, while effective, often lack interpretability, real-time observability and adjustment capacity during registration inference. Denoising diffusion models present an alternative by reformulating registration as iterative image denoising. However, existing diffusion registration approaches do not fully harness capabilities, neglecting the critical sampling phase that enables continuous observability during the inference. Hence, we introduce DiffuseReg, an innovative diffusion-based method that denoises deformation fields instead of images for improved transparency. We also propose a novel denoising network upon Swin Transformer, which better integrates moving and fixed images with diffusion time step throughout the denoising process. Furthermore, we enhance control over the denoising registration process with a novel similarity consistency regularization. Experiments on ACDC datasets demonstrate DiffuseReg outperforms existing diffusion registration methods by 1.32 in Dice score. The sampling process in DiffuseReg enables real-time output observability and adjustment unmatched by previous deep models.||
|**2024-10-07**|[Avoiding Deadlocks via Weak Deadlock Sets](http://arxiv.org/abs/2410.05175)|null|A deadlock occurs in a network when two or more items prevent each other from moving and are stalled. In a general model, items are stored at vertices and each vertex $v$ has a buffer with $b(v)$ slots. Given a route for each item toward its destination, the Deadlock Safety Problem asks whether the current state is safe, i.e., it is possible to deliver each item at its destination, or is bound to deadlock, i.e., any sequence of moves will end up with a set of items stalled. While when $b \geq 2$ the problem is solvable in polynomial time building upon a nice characterization of YES/NO-instances, it is NP-hard on quite simple graphs as grids when $b=1$ and on trees when $b\leq 3$. We improve on these results by means of two new tools, weak deadlock sets and wise states. We show that for general networks and $b$ a state that is wise and without weak deadlock sets -- this can be recognized in polynomial time -- is safe: this is indeed a strengthening of the result for $b\geq 2$ . We sharpen this result for trees, where we show that a wise state is safe if and only if it has no weak deadlock set. That is interesting in particular in the context of rail transportation where networks are often single-tracked and deadlock detection and avoidance focuses on local sub-networks, mostly with a tree-like structure. We pose some research questions for future investigations.||
|**2024-10-07**|[Presto! Distilling Steps and Layers for Accelerating Music Generation](http://arxiv.org/abs/2410.05167)|null|Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at https://presto-music.github.io/web/.||
|**2024-10-07**|[A Simulation-Free Deep Learning Approach to Stochastic Optimal Control](http://arxiv.org/abs/2410.05163)|null|We propose a simulation-free algorithm for the solution of generic problems in stochastic optimal control (SOC). Unlike existing methods, our approach does not require the solution of an adjoint problem, but rather leverages Girsanov theorem to directly calculate the gradient of the SOC objective on-policy. This allows us to speed up the optimization of control policies parameterized by neural networks since it completely avoids the expensive back-propagation step through stochastic differential equations (SDEs) used in the Neural SDE framework. In particular, it enables us to solve SOC problems in high dimension and on long time horizons. We demonstrate the efficiency of our approach in various domains of applications, including standard stochastic optimal control problems, sampling from unnormalized distributions via construction of a Schr\"odinger-F\"ollmer process, and fine-tuning of pre-trained diffusion models. In all cases our method is shown to outperform the existing methods in both the computing time and memory efficiency.||
|**2024-10-07**|[Leveraging Multimodal Diffusion Models to Accelerate Imaging with Side Information](http://arxiv.org/abs/2410.05143)|null|Diffusion models have found phenomenal success as expressive priors for solving inverse problems, but their extension beyond natural images to more structured scientific domains remains limited. Motivated by applications in materials science, we aim to reduce the number of measurements required from an expensive imaging modality of interest, by leveraging side information from an auxiliary modality that is much cheaper to obtain. To deal with the non-differentiable and black-box nature of the forward model, we propose a framework to train a multimodal diffusion model over the joint modalities, turning inverse problems with black-box forward models into simple linear inpainting problems. Numerically, we demonstrate the feasibility of training diffusion models over materials imagery data, and show that our approach achieves superior image reconstruction by leveraging the available side information, requiring significantly less amount of data from the expensive microscopy modality.||
|**2024-10-07**|[Agnostic Smoothed Online Learning](http://arxiv.org/abs/2410.05124)|null|Classical results in statistical learning typically consider two extreme data-generating models: i.i.d. instances from an unknown distribution, or fully adversarial instances, often much more challenging statistically. To bridge the gap between these models, recent work introduced the smoothed framework, in which at each iteration an adversary generates instances from a distribution constrained to have density bounded by $\sigma^{-1}$ compared to some fixed base measure $\mu$. This framework interpolates between the i.i.d. and adversarial cases, depending on the value of $\sigma$. For the classical online prediction problem, most prior results in smoothed online learning rely on the arguably strong assumption that the base measure $\mu$ is known to the learner, contrasting with standard settings in the PAC learning or consistency literature. We consider the general agnostic problem in which the base measure is unknown and values are arbitrary. Along this direction, Block et al. showed that empirical risk minimization has sublinear regret under the well-specified assumption. We propose an algorithm R-Cover based on recursive coverings which is the first to guarantee sublinear regret for agnostic smoothed online learning without prior knowledge of $\mu$. For classification, we prove that R-Cover has adaptive regret $\tilde O(\sqrt{dT/\sigma})$ for function classes with VC dimension $d$ , which is optimal up to logarithmic factors. For regression, we establish that R-Cover has sublinear oblivious regret for function classes with polynomial fat-shattering dimension growth.||
|**2024-10-07**|[Synthetic Generation of Dermatoscopic Images with GAN and Closed-Form Factorization](http://arxiv.org/abs/2410.05114)|null|In the realm of dermatological diagnoses, where the analysis of dermatoscopic and microscopic skin lesion images is pivotal for the accurate and early detection of various medical conditions, the costs associated with creating diverse and high-quality annotated datasets have hampered the accuracy and generalizability of machine learning models. We propose an innovative unsupervised augmentation solution that harnesses Generative Adversarial Network (GAN) based models and associated techniques over their latent space to generate controlled semiautomatically-discovered semantic variations in dermatoscopic images. We created synthetic images to incorporate the semantic variations and augmented the training data with these images. With this approach, we were able to increase the performance of machine learning models and set a new benchmark amongst non-ensemble based models in skin lesion classification on the HAM10000 dataset; and used the observed analytics and generated models for detailed studies on model explainability, affirming the effectiveness of our solution.||
|**2024-10-04**|[Estimating Body and Hand Motion in an Ego-sensed World](http://arxiv.org/abs/2410.03665)|null|我们提出了EgoAllo,一个基于头戴式设备的人体动作估计系统。EgoAllo仅使用以自我为中心的SLAM姿态和图像,引导从条件扩散模型中采样,以估计捕捉佩戴者在场景的全局坐标系中的动作的3D身体姿态、身高和手部参数。为了实现这一点,我们的关键见解在于表示:我们提出了用于提高模型性能的空间和时间不变性标准,并由此推导出一种头部运动条件参数化,该参数化将估计精度提高了18%。我们还展示了我们系统估计的身体如何改进手部估计:与嘈杂的单目估计相比,由此产生的运动学和时间约束使手部估计误差降低了40%以上。项目页面:https://egoallo.github.io/||
|**2024-10-04**|[Geometric Representation Condition Improves Equivariant Molecule Generation](http://arxiv.org/abs/2410.03655)|null|近年来,分子生成模型的进步展现了其在加速科学发现方面的巨大潜力,特别是在药物设计领域。然而,这些模型在生成高质量分子方面经常面临挑战,尤其是在必须满足特定分子特性的条件生成场景下。在这项工作中,我们介绍了 GeoRCG,这是一个通过整合几何表示条件来增强分子生成模型性能的通用框架。我们将分子生成过程分解为两个阶段:首先,生成信息丰富的几何表示;其次,根据该表示生成分子。与直接生成分子相比,在第一阶段生成相对容易的表示,以更目标导向和更快的速度引导第二阶段生成高质量分子。利用 EDM 作为基础生成器,我们观察到在广泛使用的 QM9 和 GEOM-DRUG 数据集上的无条件分子生成质量有显著提高。更值得注意的是,在具有挑战性的条件分子生成任务中,我们的框架相对于最先进的方法实现了平均 31% 的性能提升,这凸显了以语义丰富的几何表示为条件优于先前方法中以单个属性值为条件的优越性。此外,我们还发现,在这种表示指导下,扩散步骤的数量可以减少到仅 100 步,同时保持比 1000 步更高的生成质量,从而显著加速了生成过程。||
|**2024-10-04**|[Real-World Benchmarks Make Membership Inference Attacks Fail on Diffusion Models](http://arxiv.org/abs/2410.03640)|**[link](https://github.com/caradryanl/copymark)**|扩散模型的成员推断攻击 (MIA) 已成为潜在证据,表明在训练预训练扩散模型中存在未经授权的数据使用。这些攻击旨在检测扩散模型训练数据集中是否存在特定图像。我们的研究深入评估了扩散模型中最先进的 MIA,并揭示了现有 MIA 评估中的严重缺陷和过于乐观的性能估计。我们介绍了 CopyMark,这是一个更现实的 MIA 基准测试,它通过支持预训练的扩散模型、无偏数据集和公平的评估管道来区分自己。通过广泛的实验,我们证明了当前 MIA 方法的有效性在这些更实际的条件下会显着降低。根据我们的结果,我们提醒,MIA 目前的状态并不是识别预训练扩散模型中未经授权数据使用的可靠方法。据我们所知,我们是第一个发现 MIA 对扩散模型的性能高估,并提出了一个统一的基准以进行更现实的评估。我们的代码可在 GitHub 上获取:\url{https://github.com/caradryanl/CopyMark}。||
|**2024-10-04**|[Conditional Enzyme Generation Using Protein Language Models with Adapters](http://arxiv.org/abs/2410.03634)|null|以期望的功能和/或特性为条件生成蛋白质是生成模型的关键目标。现有的基于语言模型提示的方法可以生成以目标功能(例如所需的酶家族)为条件的蛋白质。然而,这些方法仅限于简单的标记化条件,并且尚未显示出对未见功能的泛化能力。在本研究中,我们提出了 ProCALM(蛋白质条件自适应语言模型),这是一种使用适配器对蛋白质语言模型进行条件生成蛋白质的方法。我们对 ProCALM 的具体实现涉及微调 ProGen2,以结合酶功能和分类法的条件表示。ProCALM 在有条件地从目标酶家族生成序列方面与现有方法相匹配。令人印象深刻的是,它还可以在酶功能和分类法的联合分布内生成,并且可以泛化到稀有和未见过的酶家族和分类法。总的来说,ProCALM 是一种灵活且计算效率高的方法,我们预计它可以扩展到广泛的生成语言模型。||
|**2024-10-04**|[How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework](http://arxiv.org/abs/2410.03601)|null|离散扩散模型因其能够对具有易于处理的采样和推理的复杂分布进行建模而受到越来越多的关注。然而,离散扩散模型的误差分析仍然缺乏深入的理解。在这项工作中,我们提出了一个基于 Lévy 型随机积分的离散扩散模型误差分析综合框架。通过将泊松随机测度推广到具有时间无关和状态相关强度的测度,我们严格建立了离散扩散模型的随机积分公式,并提供了相应的测度变化定理,这些定理与 Itô 积分和 Girsanov 定理及其连续对应物有着惊人的相似之处。我们的框架统一并加强了当前关于离散扩散模型的理论结果,并获得了 KL 散度中 τ-leaping 方案的第一个误差界。通过明确识别误差来源,我们的分析为离散扩散模型的数学性质提供了新的见解,并为设计用于现实世界离散扩散模型应用的高效和准确算法提供了指导。||
|**2024-10-04**|[Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features](http://arxiv.org/abs/2410.03558)|**[link](https://github.com/darkbblue/generic-diffusion-feature)**|扩散模型最初是为图像生成而设计的。最近的研究表明,其主干内部的信号(称为激活)也可以作为密集特征,用于各种判别任务,例如语义分割。在众多激活中,选择一个有效的小子集是一个基本问题。为此,该领域的早期研究对激活的判别能力进行了大规模的定量比较。然而,我们发现许多潜在的激活还没有被评估,例如用于计算注意力分数的查询和键。此外,扩散架构的最新进展带来了许多新的激活,例如嵌入式 ViT 模块中的激活。两者结合在一起,激活选择仍然是一个尚未解决但被忽视的问题。为了解决这个问题,本文更进一步,评估了更广泛的激活。考虑到激活的显著增加,全面的定量比较已不再可行。相反,我们试图了解这些激活的属性,以便可以通过简单的定性评估预先过滤掉明显较差的激活。经过仔细分析,我们发现了扩散模型中普遍存在的三个属性,使这项研究能够超越特定的模型。在此基础上,我们针对几种流行的扩散模型提出了有效的特征选择解决方案。最后,跨多个判别任务的实验验证了我们的方法优于 SOTA 竞争对手。我们的代码可在 https://github.com/Darkbblue/generic-diffusion-feature 获取。||
|**2024-10-04**|[NRGBoost: Energy-Based Generative Boosted Trees](http://arxiv.org/abs/2410.03535)|null|尽管深度学习在非结构化数据领域占据主导地位,但基于树的方法,如随机森林(RF)和梯度提升决策树(GBDT),仍然是处理表格数据判别任务的主力军。我们探索了这些流行算法的生成式扩展,重点是对数据密度(直到归一化常数)进行显式建模,从而支持除采样之外的其他应用。作为我们的主要贡献,我们提出了一种基于能量的生成式提升算法,该算法类似于在 XGBoost 等流行软件包中实现的二阶提升。我们表明,尽管产生了一个能够处理任何输入变量的推理任务的生成模型,但我们提出的算法在许多真实世界的表格数据集上可以实现与 GBDT 相似的判别性能,优于其他生成方法。同时,我们也展示了它在采样方面也具有与基于神经网络的模型相媲美的竞争力。||
|**2024-10-04**|[Generative Artificial Intelligence for Navigating Synthesizable Chemical Space](http://arxiv.org/abs/2410.03494)|**[link](https://github.com/wenhao-gao/synformer)**|我们推出了 SynFormer,这是一个生成式建模框架,旨在有效地探索和导航可合成化学空间。与传统的分子生成方法不同,我们为分子生成合成路线,以确保设计具有合成可行性。通过结合可扩展的 Transformer 架构和用于构建块选择的扩散模块,SynFormer 在可合成分子设计方面超越了现有模型。我们通过两个关键应用展示了 SynFormer 的有效性:(1) 局部化学空间探索,其中模型生成参考分子的可合成类似物,以及 (2) 全局化学空间探索,其中模型旨在根据黑盒性质预测预言机识别最佳分子。此外,我们通过随着更多计算资源可用而提高性能来证明我们方法的可扩展性。通过公开我们的代码和训练模型,我们希望 SynFormer 能够在药物发现和材料科学的应用中得到应用。||
|**2024-10-04**|[Diffusion State-Guided Projected Gradient for Inverse Problems](http://arxiv.org/abs/2410.03463)|null|扩散模型的最新进展在学习用于解决反问题的先验数据方面非常有效。它们利用扩散采样步骤来引入数据先验,同时在每个步骤中使用测量引导梯度来施加数据一致性。对于一般的反问题,当使用无条件训练的扩散模型时,由于测量似然是难以处理的,因此需要进行近似,这会导致不准确的后验采样。换句话说,由于它们的近似性,这些方法无法在由扩散先验定义的数据流形上保留生成过程,从而导致图像恢复等应用中的伪影。为了提高扩散模型在解决反问题方面的性能和鲁棒性,我们提出了扩散状态引导投影梯度(DiffStateGrad),它将测量梯度投影到一个子空间上,该子空间是扩散过程中间状态的低秩近似。DiffStateGrad作为一个模块,可以添加到各种基于扩散的反求解器中,以改进对先验流形上扩散过程的保留,并滤除产生伪影的成分。我们强调,DiffStateGrad提高了扩散模型在测量引导步长和噪声选择方面的鲁棒性,同时提高了最坏情况下的性能。最后,我们证明了DiffStateGrad在线性和非线性图像恢复反问题上优于现有技术水平。||
|**2024-10-04**|[Generative Semantic Communication for Text-to-Speech Synthesis](http://arxiv.org/abs/2410.03459)|null|语义通信是一种很有前景的技术,它只传输源数据的语义信息,从而提高通信效率。然而,传统的语义通信方法主要集中在数据重建任务上,对于文本到语音(TTS)合成等新兴的生成任务来说,效率可能不高。为了解决这一局限性,本文利用生成式人工智能技术,开发了一种新的TTS合成生成式语义通信框架。首先,我们利用预先训练好的大型语音模型WavLM和残差矢量量化方法,分别在发送端和接收端构建了两个语义知识库(KB)。发送端的KB能够有效地提取语义,而接收端的KB则有助于逼真的语音合成。然后,我们采用Transformer编码器和扩散模型来实现高效的语义编码,而不会引入显著的通信开销。最后,数值结果表明,在加性高斯白噪声信道和瑞利衰落信道两种情况下,我们的框架在生成语音的保真度方面都比四种基线方法高得多。||
|**2024-10-03**|[Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models](http://arxiv.org/abs/2410.02740)|null|近年来,多模态模型的进步凸显了重写图像描述对于提高性能的价值,但关键挑战依然存在。例如,虽然合成图像描述通常能提供更高的质量和图文一致性,但尚不清楚它们是否可以完全替代替代文本:合成图像描述的作用以及它们在预训练中与原始网络爬取的替代文本的交互作用仍不清楚。此外,不同的多模态基础模型可能对特定的图像描述格式有独特的偏好,但识别每种模型最佳图像描述的工作仍然有限。在这项工作中,我们提出了一种新颖的、可控的、可扩展的图像描述生成流程,旨在生成针对各种多模态模型量身定制的不同图像描述格式。通过以短合成图像描述 (SSC) 和密集合成图像描述 (DSC+) 作为案例研究,我们系统地探索了它们对 CLIP、多模态大语言模型和扩散模型等模型的影响,以及它们与替代文本的交互作用。我们的研究结果表明,保留合成图像描述和替代文本的混合方法可以优于单独使用合成图像描述,从而提高一致性和性能,并且每个模型都表现出对特定图像描述格式的偏好。这种全面的分析为优化图像描述策略提供了宝贵的见解,从而促进了多模态基础模型的预训练。||
|**2024-10-03**|[A Photonic Parameter-shift Rule: Enabling Gradient Computation for Photonic Quantum Computers](http://arxiv.org/abs/2410.02726)|null|我们提出了一种在线性光量子计算平台上实现的量子算法中进行梯度计算的方法。虽然参数移位规则已成为基于量子比特门的量子计算中计算梯度的主要方法,但由于福克空间中微分相移算符的非幺正性,它们在光子平台上的直接应用受到了阻碍。我们引入了一种克服了这一限制的光子参数移位规则,为线性光量子处理器中的梯度计算提供了一个精确的公式。我们的方法与输入光子数呈线性比例,并且在每次评估中使用具有移位参数的相同参数化光子电路。这一进步弥合了光子量子计算中的一个关键差距,使得能够在近期光子量子处理器上对变分量子算法进行有效的基于梯度的优化。我们通过量子化学和生成模型任务中的数值模拟证明了我们方法的有效性,与其他基于梯度和无梯度的方法相比,该方法显示出优越的优化性能以及对有限采样和光子可分辨性噪声的鲁棒性。||
|**2024-10-03**|[SteerDiff: Steering towards Safe Text-to-Image Diffusion Models](http://arxiv.org/abs/2410.02710)|null|文本到图像 (T2I) 扩散模型因其能够生成具有精确文本对齐的高质量图像而备受关注。然而,这些模型也可能被滥用于制作不当内容。现有的安全措施通常依赖于文本分类器或类似 ControlNet 的方法,但往往不够充分。传统的文本分类器依赖于大规模标记数据集,并且很容易通过改写来绕过。随着扩散模型的不断扩展,微调这些安全措施变得越来越具有挑战性,并且缺乏灵活性。最近的红队攻击研究进一步强调了需要一种新的范式来防止生成不当内容。在本文中,我们介绍了 SteerDiff,这是一个轻量级的适配器模块,旨在充当用户输入和扩散模型之间的中介,确保生成的图像符合道德和安全标准,并且对可用性的影响微乎其微。SteerDiff 识别并操纵文本嵌入空间中的不当概念,以引导模型远离有害输出。我们进行了各种概念遗忘任务的广泛实验,以评估我们方法的有效性。此外,我们针对多种红队攻击策略对 SteerDiff 进行了基准测试,以评估其鲁棒性。最后,我们探讨了 SteerDiff 在概念遗忘任务中的潜力,展示了其在文本条件图像生成中的多功能性。||
|**2024-10-03**|[ControlAR: Controllable Image Generation with Autoregressive Models](http://arxiv.org/abs/2410.02705)|**[link](https://github.com/hustvl/controlar)**|自回归 (AR) 模型将图像生成重新定义为下一个标记预测任务,展现出惊人的潜力,并逐渐成为扩散模型的有力竞争者。然而,类似于 ControlNet 的控制到图像生成在 AR 模型中仍然很大程度上未被探索。尽管受大型语言模型进步的启发,一种自然而然的方法是将控制图像标记化为标记,并在解码图像标记之前将它们预填充到自回归模型中,但与其 ControlNet 相比,这种方法在生成质量方面仍然不足,并且效率低下。为此,我们引入了 ControlAR,这是一个高效且有效的框架,用于将空间控制集成到自回归图像生成模型中。首先,我们探索了 AR 模型的控制编码,并提出了一种轻量级的控制编码器,将空间输入(例如,Canny 边缘或深度图)转换为控制标记。然后,ControlAR 利用条件解码方法,根据控制标记和图像标记之间的每个标记融合(类似于位置编码)生成下一个图像标记。与预填充标记相比,使用条件解码显着增强了 AR 模型的控制能力,同时保持了模型的效率。此外,令人惊讶的是,所提出的 ControlAR 通过条件解码和特定控制使 AR 模型能够生成任意分辨率的图像。大量实验表明,所提出的 ControlAR 能够在包括边缘、深度和分割掩码在内的不同输入上进行自回归控制到图像生成。此外,定量和定性结果都表明 ControlAR 超越了先前最先进的可控扩散模型,例如 ControlNet++。代码、模型和演示将很快在 https://github.com/hustvl/ControlAR 上提供。||
|**2024-10-03**|[GUD: Generation with Unified Diffusion](http://arxiv.org/abs/2410.02667)|null|扩散生成模型通过反转将噪声逐步添加到数据样本的过程,将噪声转换为数据。受物理学中重整化群概念的启发,该概念分析不同尺度的系统,我们通过探索三个关键设计方面来重新审视扩散模型:1)扩散过程在其上运行的表示的选择(例如,像素、PCA、傅里叶或小波基),2)数据在扩散过程中被转换成先验分布(例如,具有协方差 $\Sigma$ 的高斯分布),以及 3)应用于数据不同部分的噪声水平的调度,由组件级噪声调度捕获。结合这些选择的灵活性,我们为扩散生成模型开发了一个统一的框架,极大地增强了设计自由度。特别是,我们引入了软条件模型,可以在标准扩散模型和自回归模型(在任何基础上)之间平滑插值,从概念上连接了这两种方法。我们的框架开辟了一个广阔的设计空间,可以实现更高效的训练和数据生成,并为集成不同生成方法和生成任务的新颖架构铺平道路。||
|**2024-10-03**|[Grounded Answers for Multi-agent Decision-making Problem through Generative World Model](http://arxiv.org/abs/2410.02664)|null|生成模型的最新进展促进了图像生成和聊天机器人等许多领域的重大创新。尽管取得了成功,但这些模型在解决复杂的多智能体决策问题时,常常会产生粗略且误导性的解决方案,因为它们缺乏像人类一样的试错经验和推理能力。为了解决这一局限性,我们探索了一种将语言引导的模拟器集成到多智能体强化学习管道中的范式,以增强生成的答案质量。该模拟器是一个分别学习动力学和奖励的世界模型,其中动力学模型包括一个图像分词器和一个因果Transformer,用于自回归地生成交互转换,而奖励模型是一个双向Transformer,通过在语言指导下最大化专家演示中轨迹的可能性来学习。给定当前状态的图像和任务描述,我们使用世界模型来训练联合策略,并通过在动力学模型上运行收敛的策略来生成图像序列作为答案。实证结果表明,该框架可以通过在星际争霸多智能体挑战基准测试的训练和未见任务上表现出优异的性能,从而改进多智能体决策问题的答案。特别是,它可以生成一致的交互序列和交互状态下可解释的奖励函数,为未来训练生成模型开辟了道路。||
|**2024-10-03**|[Scalable Simulation-free Entropic Unbalanced Optimal Transport](http://arxiv.org/abs/2410.02656)|null|最优传输(OT)问题旨在寻找一个连接两个分布的传输映射,同时最小化给定的成本函数。寻找这样的传输映射在机器学习中有着广泛的应用,例如生成模型和图像到图像的转换。在本文中,我们介绍了一种可扩展且无需模拟的方法来解决熵非平衡最优传输(EUOT)问题。我们推导了该EUOT问题的动力学形式,它是薛定谔桥(SB)问题的推广。在此基础上,我们从随机最优控制的角度推导了EUOT问题的对偶形式和最优性条件。通过利用这些性质,我们提出了一种无需模拟的算法来求解EUOT,称为Simulation-free EUOT (SF-EUOT)。现有的SB模型在训练和评估过程中需要昂贵的模拟成本,而我们的模型利用互易性实现了无需模拟的训练和一步生成。与之前的SB方法相比,我们的模型在生成模型和图像到图像转换任务中显示出显著提高的可扩展性。||
|**2024-10-03**|[Measuring and Improving Persuasiveness of Generative Models](http://arxiv.org/abs/2410.02653)|null|大型语言模型 (LLM) 正越来越多地用于涉及生成人类消费内容(例如营销)以及直接与人类互动(例如通过聊天机器人)的工作流程中。开发能够生成可验证的说服性信息的此类系统,对社会来说既有机遇也有挑战。一方面,此类系统可以对广告和社会公益等领域产生积极影响,例如解决药物成瘾问题;另一方面,它们也可能被滥用于传播错误信息和塑造政治观点。为了引导 LLM 对社会的影响,我们需要开发系统来衡量和比较它们的 说服力。出于这种动机,我们推出了 PersuasionBench 和 PersuasionArena,这是第一个包含一系列任务的大型基准和竞技场,用于自动衡量生成模型的说服能力。我们调查了 LLM 在多大程度上了解和利用了可以帮助它们生成更有说服力的语言的语言模式。我们的研究结果表明,LLM 的说服力与其模型规模呈正相关,但较小的模型也可以比更大的模型具有更高的说服力。值得注意的是,使用合成数据集和自然数据集进行的目标训练显着增强了较小模型的说服能力,这对依赖规模的假设提出了挑战。我们的研究结果对模型开发者和政策制定者都具有重要意义。例如,虽然欧盟人工智能法案和加州的 SB-1047 旨在根据浮点运算次数来监管人工智能模型,但我们证明,仅凭此类简单指标无法完全捕捉人工智能的社会影响。我们邀请社区探索并贡献 PersuasionArena 和 PersuasionBench(网址为 https://bit.ly/measure-persuasion),以促进我们对人工智能驱动型说服及其社会影响的理解。||
|**2024-10-03**|[Beyond Squared Error: Exploring Loss Design for Enhanced Training of Generative Flow Networks](http://arxiv.org/abs/2410.02596)|null|生成流网络 (GFlowNets) 是一类新颖的生成模型,旨在从非规范化分布中采样,并在各种重要任务中得到应用,其训练算法引起了人们极大的研究兴趣。通常,GFlowNets 的训练是通过将采样的训练对象上的前向流与反向流进行拟合来实现的。先前的工作重点关注训练对象的选择、参数化、采样和重采样策略以及反向策略,旨在增强训练过程中的信用分配、探索或利用。然而,回归损失的选择却被忽视了,而它极大地影响了训练不足策略的探索和利用行为。由于缺乏对选择合适的回归损失的理论理解,大多数现有算法通过最小化对数空间中前向流和反向流的平方误差来训练流网络,即使用二次回归损失。在这项工作中,我们严格证明了不同的回归损失对应于特定的散度度量,这使我们能够根据相应散度度量的期望属性来设计和分析回归损失。具体来说,我们研究了两个关键属性:零强制和零避免,前者促进利用和更高的奖励,而后者鼓励探索并增强多样性。基于我们的理论框架,我们提出了三种新的回归损失,即 Shifted-Cosh、Linex(1/2) 和 Linex(1)。我们通过三个基准测试来评估它们:超网格、位序列生成和分子生成。我们提出的损失函数与大多数现有训练算法兼容,并在收敛速度、样本多样性和鲁棒性方面显著提高了算法的性能。||
|**2024-10-03**|[Local Flow Matching Generative Models](http://arxiv.org/abs/2410.02548)|null|流匹配(FM)是一种无需模拟的方法,用于学习连续且可逆的流,以在两个分布之间进行插值,特别是在生成建模中从噪声生成数据。在本文中,我们介绍了局部流匹配(LFM),它学习一系列 FM 子模型,每个子模型都匹配一个扩散过程,直到数据到噪声方向上的步长时间。在每个步骤中,子模型要插值的两个分布比数据与噪声更接近,这使得可以使用更小的模型进行更快的训练。LFM 的逐步结构 naturally lends itself to distillation,并且可以采用不同的蒸馏技术来加速生成。理论上,我们根据生成的和真实数据分布之间的 $\chi^2$ 散度证明了所提出的流模型的生成保证。在实验中,我们证明了 LFM 与 FM 相比,在表格数据和图像数据集的无条件生成以及机器人操作策略的条件生成方面,具有更高的训练效率和更具竞争力的生成性能。||
|**2024-09-30**|[SpaceMesh: A Continuous Representation for Learning Manifold Surface Meshes](http://arxiv.org/abs/2409.20562)|null|网格在视觉计算和模拟中无处不在,但大多数现有的机器学习技术只能间接地表示网格,例如,将其表示为标量场的水平集或模板的变形,或者表示为缺乏局部结构的无序三角形集合。这项工作提出了一种方案,可以直接生成具有复杂连接性的流形多边形网格作为神经网络的输出。我们的关键创新是在每个网格顶点定义一个连续的潜在连接空间,这意味着离散网格。特别是,我们的顶点嵌入在半边网格表示中生成循环邻居关系,这保证了边的流形性和表示一般多边形网格的能力。这种表示非常适合机器学习和随机优化,并且不受连通性或拓扑结构的限制。我们首先探索了这种表示的基本属性,然后使用它来拟合来自大型数据集的网格分布。生成的模型可以生成具有从数据集总体学习到的镶嵌结构的不同网格,并具有简洁的细节和高质量的网格元素。在应用中,这种方法不仅可以从生成模型中产生高质量的输出,还可以直接学习具有挑战性的几何处理任务,例如网格修复。||
|**2024-09-30**|[COLLAGE: Collaborative Human-Agent Interaction Generation using Hierarchical Latent Diffusion and Language Models](http://arxiv.org/abs/2409.20502)|null|我们提出了一个名为COLLAGE的新框架,用于生成协作式的“主体-客体-主体”交互,该框架利用了大型语言模型(LLM)和分层的、针对动作的矢量量化变分自编码器(VQ-VAE)。我们的模型通过结合LLM的知识和推理能力来指导生成扩散模型,解决了该领域缺乏丰富数据集的问题。分层VQ-VAE架构在多个抽象级别捕获不同的动作特定特征,避免了冗余概念,并实现了高效的多分辨率表示。我们引入了一种在潜在空间中运行的扩散模型,并结合了LLM生成的运动规划线索来指导去噪过程,从而产生更具控制力和多样性的、针对提示词的动作生成。在CORE-4D和InterHuman数据集上的实验结果表明,我们的方法在生成逼真且多样化的协作式“人-物体-人”交互方面非常有效,优于现有最佳方法。我们的工作为在机器人、图形和计算机视觉等各个领域对复杂交互进行建模开辟了新的可能性。||
|**2024-09-30**|[FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing](http://arxiv.org/abs/2409.20500)|null|文本到视频的扩散模型取得了显著的进步。由于其能够生成时间连贯的视频,使用这些基础模型进行零样本视频编辑的研究迅速扩展。为了提高编辑质量,结构化控制经常被用于视频编辑中。在这些技术中,交叉注意力掩码控制以其有效性和效率而著称。然而,当交叉注意力掩码被简单地应用于视频编辑时,它们会引入诸如模糊和闪烁之类的伪影。我们的实验发现了一个先前视频编辑研究中被忽视的关键因素:交叉注意力掩码并非始终清晰,而是随着模型结构和去噪时间步长而变化。为了解决这个问题,我们提出了度量掩码匹配成本 (MMC) 来量化这种可变性,并提出了 FreeMask,一种为特定视频编辑任务选择最佳掩码的方法。使用 MMC 选择的掩码,我们进一步改进了全面注意力特征(例如,时间、交叉和自注意力模块)中的掩码融合机制。我们的方法可以无缝集成到现有的零样本视频编辑框架中,并具有更好的性能,无需控制辅助或参数微调,但能够通过掩码精度控制自适应地解耦未编辑的语义布局。大量实验表明,与最先进的方法相比,FreeMask 实现了卓越的语义保真度、时间一致性和编辑质量。||
|**2024-09-30**|[All-optical autoencoder machine learning framework using diffractive processors](http://arxiv.org/abs/2409.20346)|null|衍射深度神经网络 (D2NN) 以其高速、低功耗和强大的并行性而闻名,已广泛应用于模式识别、图像处理和图像传输等各个领域。然而,现有的网络架构主要关注原始域内的数据表示,对潜在空间的探索有限,从而限制了 D2NN 的信息挖掘能力和多功能集成。在这里,我们提出了一种全光自动编码器 (OAE) 框架,它可以将输入波场编码到潜在空间中的先验形状分布,并将编码的模式解码回原始波场。通过利用 D2NN 的非互易性,OAE 模型在一个波传播方向上充当编码器,而在相反方向上充当解码器。我们进一步将这些模型应用于三个关键领域:图像去噪、抗噪声的可重构图像分类和图像生成。已经进行了概念验证实验以验证数值模拟。我们的 OAE 框架充分利用了潜在空间表示的潜力,使一组衍射处理器能够同时实现图像重建、表示和生成。它可以被视为电子自动编码器模型的对应物和扩展。这项工作不仅为光学生成模型的设计提供了新的见解,而且为开发和应用多功能、高度集成和通用的光学智能系统铺平了道路。||
|**2024-09-30**|[Devil is in Details: Locality-Aware 3D Abdominal CT Volume Generation for Self-Supervised Organ Segmentation](http://arxiv.org/abs/2409.20332)|null|在医学图像分析领域,自监督学习 (SSL) 技术已经出现,以减轻对标签的需求,但由于资源需求不断增加和隐私限制,训练数据的稀缺性仍然是一个挑战。许多努力都采用生成模型来生成跨越不同模态和解剖区域的高保真、未标记的 3D 体积数据。然而,与其他解剖区域相比,腹部内复杂且难以区分的解剖结构对腹部 CT 体积生成提出了独特的挑战。为了应对这一被忽视的挑战,我们引入了局部感知扩散 (Lad),这是一种专为生成精细的 3D 腹部 CT 体积数据而设计的新方法。我们设计了一个局部损失来细化关键的解剖区域,并设计了一个条件提取器将腹部先验信息整合到生成过程中,从而能够生成大量高质量的腹部 CT 体积数据,这些数据对于 SSL 任务至关重要,而无需额外的标签或放射学报告等数据。通过我们的方法生成的体积数据在再现腹部结构方面表现出非凡的保真度,在 AbdomenCT-1K 数据集上将 FID 分数从 0.0034 降低到 0.0002,与真实数据非常接近,并优于当前的方法。大量实验表明,我们的方法在自监督器官分割任务中的有效性,在两个腹部数据集上有效地提高了平均 Dice 分数。这些结果强调了合成数据在推进医学图像分析中的自监督学习方面的潜力。||
|**2024-09-30**|[UIR-LoRA: Achieving Universal Image Restoration through Multiple Low-Rank Adaptation](http://arxiv.org/abs/2409.20197)|**[link](https://github.com/justones/uir-lora)**|Existing unified methods typically treat multi-degradation image restoration as a multi-task learning problem. Despite performing effectively compared to single degradation restoration methods, they overlook the utilization of commonalities and specificities within multi-task restoration, thereby impeding the model's performance. Inspired by the success of deep generative models and fine-tuning techniques, we proposed a universal image restoration framework based on multiple low-rank adapters (LoRA) from multi-domain transfer learning. Our framework leverages the pre-trained generative model as the shared component for multi-degradation restoration and transfers it to specific degradation image restoration tasks using low-rank adaptation. Additionally, we introduce a LoRA composing strategy based on the degradation similarity, which adaptively combines trained LoRAs and enables our model to be applicable for mixed degradation restoration. Extensive experiments on multiple and mixed degradations demonstrate that the proposed universal image restoration method not only achieves higher fidelity and perceptual image quality but also has better generalization ability than other unified image restoration models. Our code is available at https://github.com/Justones/UIR-LoRA.||
|**2024-09-30**|[Ensemble Kalman Diffusion Guidance: A Derivative-free Method for Inverse Problems](http://arxiv.org/abs/2409.20175)|null|在解决逆问题时,使用预训练的扩散模型作为即插即用的先验越来越受欢迎。这种框架可以适应不同的前向模型,而无需重新训练,同时保留了扩散模型的生成能力。尽管它们在许多成像逆问题中取得了成功,但大多数现有方法都依赖于特权信息,例如导数、伪逆或关于前向模型的完整知识。这种依赖性构成了一个重大限制,限制了它们在无法获得此类信息的各种问题中的使用,例如在许多科学应用中。为了解决这个问题,我们提出了用于扩散模型的集成卡尔曼扩散引导 (EnKG),这是一种无导数方法,可以通过仅访问前向模型评估和预训练的扩散模型先验来解决逆问题。我们研究了我们的方法在各种逆问题中的经验有效性,包括科学环境,例如推断流体流动和天文物体,这些都是高度非线性的逆问题,通常只允许对前向模型进行黑盒访问。||
|**2024-09-30**|[Erase, then Redraw: A Novel Data Augmentation Approach for Free Space Detection Using Diffusion Model](http://arxiv.org/abs/2409.20164)|null|Data augmentation is one of the most common tools in deep learning, underpinning many recent advances including tasks such as classification, detection, and semantic segmentation. The standard approach to data augmentation involves simple transformations like rotation and flipping to generate new images. However, these new images often lack diversity along the main semantic dimensions within the data. Traditional data augmentation methods cannot alter high-level semantic attributes such as the presence of vehicles, trees, and buildings in a scene to enhance data diversity. In recent years, the rapid development of generative models has injected new vitality into the field of data augmentation. In this paper, we address the lack of diversity in data augmentation for road detection task by using a pre-trained text-to-image diffusion model to parameterize image-to-image transformations. Our method involves editing images using these diffusion models to change their semantics. In essence, we achieve this goal by erasing instances of real objects from the original dataset and generating new instances with similar semantics in the erased regions using the diffusion model, thereby expanding the original dataset. We evaluate our approach on the KITTI road dataset and achieve the best results compared to other data augmentation methods, which demonstrates the effectiveness of our proposed development.||
|**2024-09-30**|[Conditional Diffusion Models are Minimax-Optimal and Manifold-Adaptive for Conditional Distribution Estimation](http://arxiv.org/abs/2409.20124)|null|We consider a class of conditional forward-backward diffusion models for conditional generative modeling, that is, generating new data given a covariate (or control variable). To formally study the theoretical properties of these conditional generative models, we adopt a statistical framework of distribution regression to characterize the large sample properties of the conditional distribution estimators induced by these conditional forward-backward diffusion models. Here, the conditional distribution of data is assumed to smoothly change over the covariate. In particular, our derived convergence rate is minimax-optimal under the total variation metric within the regimes covered by the existing literature. Additionally, we extend our theory by allowing both the data and the covariate variable to potentially admit a low-dimensional manifold structure. In this scenario, we demonstrate that the conditional forward-backward diffusion model can adapt to both manifold structures, meaning that the derived estimation error bound (under the Wasserstein metric) depends only on the intrinsic dimensionalities of the data and the covariate.||
|**2024-09-30**|[Training a Computer Vision Model for Commercial Bakeries with Primarily Synthetic Images](http://arxiv.org/abs/2409.20122)|null|In the food industry, reprocessing returned product is a vital step to increase resource efficiency. [SBB23] presented an AI application that automates the tracking of returned bread buns. We extend their work by creating an expanded dataset comprising 2432 images and a wider range of baked goods. To increase model robustness, we use generative models pix2pix and CycleGAN to create synthetic images. We train state-of-the-art object detection model YOLOv9 and YOLOv8 on our detection task. Our overall best-performing model achieved an average precision [email protected] of 90.3% on our test set.||
|**2024-09-27**|[ $O(d/T)$ Convergence Theory for Diffusion Probabilistic Models under Minimal Assumptions](http://arxiv.org/abs/2409.18959)|null|基于分数的扩散模型通过学习逆转将目标分布数据扰动为噪声的扩散过程来生成新数据,已经在各种生成任务中取得了显著成功。尽管它们具有优越的经验性能,但现有的理论保证通常受到严格假设或次优收敛速度的限制。在本文中,我们以最小的假设建立了流行的基于 SDE 的采样器的快速收敛理论。我们的分析表明,如果提供分数函数的 $\ell_{2}$ 精度估计,则目标分布和生成分布之间的总变差距离的上限为 $O(d/T)$(忽略对数因子),其中 $d$ 是数据维度,$T$ 是步数。该结果适用于任何具有一阶矩有限的目标分布。据我们所知,这改进了基于 SDE 的采样器和另一种基于 ODE 的采样器的现有收敛理论,同时对目标数据分布和分数估计施加了最小假设。这是通过一组新颖的分析工具实现的,该工具提供了对误差在反向过程的每个步骤中如何传播的细粒度表征。||
|**2024-09-27**|[ReviveDiff: A Universal Diffusion Model for Restoring Images in Adverse Weather Conditions](http://arxiv.org/abs/2409.18932)|null|在诸如夜间、雾天、雨天和水下等挑战性环境中拍摄的图像经常会遭受严重的质量下降,导致视觉质量大幅降低。有效地恢复这些退化的图像对于后续的视觉任务至关重要。虽然许多现有方法已经成功地结合了针对个任务的特定先验知识,但这些定制解决方案限制了它们对其他退化的适用性。在这项工作中,我们提出了一个通用的网络架构,称为“ReviveDiff”,它可以解决各种退化问题,并通过增强和恢复图像质量使其恢复生机。我们的方法受到以下观察结果的启发:与运动或电子问题造成的退化不同,恶劣条件下的质量退化主要源于自然介质(如雾、水和低亮度),这些介质通常保留了物体的原始结构。为了恢复此类图像的质量,我们利用了扩散模型的最新进展,并开发了ReviveDiff,从宏观和微观层面恢复图像质量,涵盖决定图像质量的一些关键因素,如清晰度、失真、噪声水平、动态范围和色彩准确度。我们在涵盖五种退化条件(雨天、水下、低光、烟雾和夜间雾霾)的七个基准数据集上对ReviveDiff进行了严格评估。我们的实验结果表明,ReviveDiff在定量和视觉上都优于最先进的方法。||
|**2024-09-27**|[Unsupervised Low-light Image Enhancement with Lookup Tables and Diffusion Priors](http://arxiv.org/abs/2409.18899)|null|弱光图像增强 (LIE) 旨在精确有效地恢复在弱光环境下降质的图像。最近先进的 LIE 技术正在使用深度神经网络,这需要大量的弱光-正常光图像对、网络参数和计算资源。因此,它们的实用性受到限制。在这项工作中,我们设计了一种基于扩散先验和查找表 (DPLUT) 的新型无监督 LIE 框架,以实现高效的弱光图像恢复。所提出的方法包括两个关键组件:光照调整查找表 (LLUT) 和噪声抑制查找表 (NLUT)。LLUT 使用一组无监督损失进行优化。它旨在预测特定图像动态范围调整的逐像素曲线参数。NLUT 旨在去除光线变亮后放大的噪声。由于扩散模型对噪声很敏感,因此引入了扩散先验以实现高性能的噪声抑制。大量实验表明,我们的方法在视觉质量和效率方面优于最先进的方法。||
|**2024-09-27**|[Detecting Dataset Abuse in Fine-Tuning Stable Diffusion Models for Text-to-Image Synthesis](http://arxiv.org/abs/2409.18897)|null|文图生成在生成逼真和风格化的图像方面已经变得非常流行,这通常需要使用特定领域的数据库对生成模型进行微调以完成专门的任务。然而,这些有价值的数据库面临着未经授权使用和未经批准共享的风险,损害了所有者的权利。在本文中,我们解决了在对 Stable Diffusion 模型进行文图生成的微调过程中出现的数据库滥用问题。我们提出了一个数据库水印框架,旨在检测未经授权的使用并追踪数据泄露。该框架在多个水印方案中采用了两种关键策略,对大规模数据库授权有效。大量实验表明,该框架有效,对数据库的影响最小(只需修改 2% 的数据即可实现高检测精度),并且能够追踪数据泄露。我们的结果还突出了该框架的鲁棒性和可迁移性,证明了其在检测数据库滥用方面的实际适用性。||
|**2024-09-27**|[Explainable Artifacts for Synthetic Western Blot Source Attribution](http://arxiv.org/abs/2409.18881)|**[link](https://github.com/phillipecardenuto/ai-wblots-detector)**|人工智能领域的最新进展使得生成模型能够生成与真实图像难以区分的合成科学图像,这对习惯于处理此类内容的专业科学家也构成了挑战。当被称为“论文工厂”的组织利用这些技术系统地生成虚假文章时,它们可能会助长关于无根据科学的错误信息的传播,从而有可能破坏对科学研究的信任。虽然之前的研究已经探索了黑盒解决方案(例如卷积神经网络)来识别合成内容,但只有一部分研究解决了跨不同模型进行泛化并深入了解合成图像中可用于检测过程的人工痕迹的挑战。本研究旨在识别由最先进的生成模型(例如,生成对抗网络和扩散模型)产生的可解释的人工痕迹,并利用它们进行开放集识别和来源归因(即,指出创建图像的模型)。||
|**2024-09-27**|[Emu3: Next-Token Prediction is All You Need](http://arxiv.org/abs/2409.18869)|null|虽然下一词预测被认为是通向人工通用智能的有希望的途径,但它在多模态任务中一直难以取得优异表现,而多模态任务仍然由扩散模型(例如,Stable Diffusion)和组合方法(例如,CLIP 与 LLM 相结合)主导。在本文中,我们介绍了 Emu3,这是一套全新的最先进的多模态模型,仅使用下一词预测进行训练。通过将图像、文本和视频标记化为离散空间,我们在多模态序列的混合上从头开始训练单个变换器。Emu3 在生成和感知任务中均优于多个完善的特定任务模型,超越了 SDXL 和 LLaVA-1.6 等旗舰模型,同时无需扩散或组合架构。Emu3 还能够通过预测视频序列中的下一个标记来生成高保真视频。我们通过专注于单一焦点:标记,简化了复杂的多模态模型设计,从而在训练和推理过程中释放了巨大的扩展潜力。我们的结果表明,下一词预测是构建超越语言的通用多模态智能的有希望的途径。我们开源了关键技术和模型,以支持在该方向上的进一步研究。||
|**2024-09-27**|[Challenges of Generating Structurally Diverse Graphs](http://arxiv.org/abs/2409.18859)|**[link](https://github.com/Abusagit/Challenges-on-generating-structurally-diverse-graphs)**|对于许多与图相关的问题,拥有一组结构多样化的图至关重要。例如,此类图可用于测试图算法或其神经网络近似。然而,据我们所知,生成结构多样化图的问题尚未在文献中得到探讨。在本文中,我们填补了这一空白。首先,我们讨论了如何定义一组图的多样性,为什么这项任务不简单,以及如何选择合适的度量标准。然后,对于给定的多样性度量标准,我们提出并比较了几种优化它的算法:我们考虑了基于标准随机图模型、局部图优化、遗传算法和神经生成模型的方法。我们证明,相较于基本的随机图生成器,可以显著提高多样性。此外,我们对生成图的分析使我们能够更好地理解图距离的特性:根据用于优化的多样性度量标准,获得的图可能具有非常不同的结构特性,这为了解多样性度量标准中使用的图距离的敏感性提供了见解。||
|**2024-09-27**|[Convergence of Diffusion Models Under the Manifold Hypothesis in High-Dimensions](http://arxiv.org/abs/2409.18804)|null|去噪扩散概率模型 (DDPM) 是一种强大的最先进方法,用于从高维数据分布生成合成数据,并广泛用于图像、音频和视频生成以及科学及其他领域的更多应用。流形假设指出高维数据通常位于环境空间内的低维流形上,并且被广泛认为在提供的示例中成立。虽然最近的结果为了解扩散模型如何适应流形假设提供了宝贵的见解,但它们没有捕捉到这些模型的巨大经验成功,这使其成为一个非常富有成果的研究方向。在这项工作中,我们研究了流形假设下的 DDPM,并证明了它们在学习分数方面实现了与环境维度无关的速率。在采样方面,我们获得了关于 Kullback-Leibler 散度的与环境维度无关的速率,以及关于 Wasserstein 距离的 $O(\sqrt{D})$ 。我们通过开发一个新的框架来做到这一点,该框架将扩散模型连接到经过充分研究的高斯过程极值理论。||
|**2024-09-27**|[Geometric deep learning for galaxy-halo connection: a case study for galaxy intrinsic alignments](http://arxiv.org/abs/2409.18761)|null|即将进行的宇宙学成像巡天,例如 Rubin Observatory LSST,需要包含真实星系群的大规模模拟,以用于各种科学应用。其中一个特别值得关注的现象是内禀排列 (IA),即星系倾向于朝向超密度区域排列,如果不对其进行适当建模,可能会在弱引力透镜分析中引入显著的系统偏差。由于计算限制,在广阔的体积范围内模拟与 IA 相关的星系形成和演化的复杂细节是不切实际的。作为替代方案,我们提出了一种在 IllustrisTNG-100 模拟上训练的深度生成模型,用于对 3D 星系形状和方向进行采样,以准确地再现内禀排列以及相关的标量特征。我们将宇宙网建模为一组图,每个图代表一个晕,节点代表子晕/星系。该架构由一个 SO(3) $\times$ $\mathbb{R}^n$ 扩散生成模型组成,用于星系方向和 $n$ 个标量,并使用明确遵守宇宙欧几里德对称性的 E(3) 等变图神经网络实现。该模型能够学习和预测与参考模拟在统计上一致的特征,例如星系方向。值得注意的是,我们的模型展示了联合建模欧几里德值标量(星系大小、形状和颜色)以及非欧几里德值 SO(3) 量(星系方向)的能力,这些量受非线性尺度上高度复杂的星系物理支配。||
|**2024-09-27**|[Unsupervised Fingerphoto Presentation Attack Detection With Diffusion Models](http://arxiv.org/abs/2409.18636)|null|基于智能手机的非接触式指纹认证由于智能手机相机技术的快速发展,已成为传统接触式指纹生物识别系统的可靠替代方案。尽管其便利性很高,但通过指纹照片进行的指纹认证更容易受到伪造攻击,这促使最近的研究工作致力于开发指纹照片呈现攻击检测 (PAD) 技术。然而,先前的 PAD 方法利用了监督学习方法,这些方法需要真实和攻击样本的标记训练数据。这可能会遇到两个关键问题,即 (i) 泛化性:检测训练数据中未见过的呈现攻击工具 (PAI),以及 (ii) 可扩展性:使用不同的 PAI 收集大型攻击样本数据集。为了应对这些挑战,我们提出了一种基于最先进的深度学习扩散模型的新型无监督方法,即去噪扩散概率模型 (DDPM),该模型仅使用真实样本进行训练。所提出的方法通过计算 DDPM 的输入和输出对之间的重建相似性来检测呈现攻击 (PA)。我们展示了跨三个 PAI 数据集的大量实验,以测试我们方法的准确性和泛化能力。结果表明,与其他基线无监督方法相比,所提出的基于 DDPM 的 PAD 方法在多个 PAI 类别上实现了显着更好的检测错误率。||
|**2024-09-26**|[FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner](http://arxiv.org/abs/2409.18128)|**[link](https://github.com/shiml20/flowturbo)**|基于扩散模型在视觉生成方面的成功,基于流的模型作为另一类重要的生成模型重新兴起,在视觉质量和推理速度方面都取得了与之相当或更好的性能。通过流匹配学习速度场,基于流的模型倾向于产生更直的采样轨迹,这在采样过程中是有利的。然而,与快速采样器已经得到很好发展的扩散模型不同,基于流的生成模型的有效采样还很少被探索。在本文中,我们提出了一个名为FlowTurbo的框架,以加速基于流的模型的采样,同时提高采样质量。我们的主要观察结果是,基于流模型中的速度预测器输出在采样过程中会变得稳定,从而可以通过轻量级速度优化器估计速度。此外,我们还引入了一些技术,包括伪校正器和样本感知编译,以进一步减少推理时间。由于FlowTurbo没有改变多步采样范式,因此可以有效地应用于图像编辑、修复等各种任务。通过将FlowTurbo集成到不同的基于流的模型中,我们在类别条件生成上获得了53.1% $\sim$58.3%的加速比,在文本到图像生成上获得了29.8%$\sim$ 38.5%的加速比。值得注意的是,FlowTurbo在ImageNet上实现了100 (ms / img)时FID为2.12,38 (ms / img)时FID为3.93,实现了实时图像生成,并建立了新的最先进水平。代码可在https://github.com/shiml20/FlowTurbo获取。||
|**2024-09-26**|[Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction](http://arxiv.org/abs/2409.18124)|null|利用预训练文本到图像扩散模型的视觉先验知识为增强密集预测任务中的零样本泛化能力提供了一种很有前景的解决方案。然而,现有方法通常不加批判地使用原始的扩散公式,由于密集预测和图像生成之间的根本差异,这可能不是最佳选择。在本文中,我们对用于密集预测的扩散公式进行了系统分析,重点关注质量和效率。我们发现,用于图像生成的原始参数化类型(学习预测噪声)对密集预测是有害的;多步加噪/去噪扩散过程也是不必要的,并且难以优化。基于这些见解,我们推出了Lotus,这是一个基于扩散的视觉基础模型,它采用了一种简单而有效的密集预测适应协议。具体来说,Lotus被训练成直接预测注释而不是噪声,从而避免了有害的方差。我们还将扩散过程重新定义为单步过程,简化了优化并显著提高了推理速度。此外,我们引入了一种称为细节保留器的新型调整策略,它可以实现更准确、更细粒度的预测。在不扩大训练数据或模型容量的情况下,Lotus在各种数据集上的零样本深度和法线估计方面均达到了最先进的性能。它还显著提高了效率,比大多数现有的基于扩散的方法快数百倍。||
|**2024-09-26**|[EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation](http://arxiv.org/abs/2409.18114)|null|目前的自动回归网格生成方法存在着诸如网格不完整、细节不足和泛化能力差等问题。在本文中,我们提出了一种自回归自动编码器(ArAE)模型,能够生成高达4,000个面片、空间分辨率为 $512^3$ 的高质量三维网格。我们引入了一种新颖的网格标记化算法,可以有效地将三角网格压缩成一维标记序列,显著提高了训练效率。此外,我们的模型将变长三角网格压缩成固定长度的潜在空间,从而能够训练潜在扩散模型以获得更好的泛化能力。大量实验表明,我们的模型在点云和图像条件网格生成任务中均表现出优越的质量、多样性和泛化能力。||
|**2024-09-26**|[StackGen: Generating Stable Structures from Silhouettes via Diffusion](http://arxiv.org/abs/2409.18098)|null|Humans naturally obtain intuition about the interactions between and the stability of rigid objects by observing and interacting with the world. It is this intuition that governs the way in which we regularly configure objects in our environment, allowing us to build complex structures from simple, everyday objects. Robotic agents, on the other hand, traditionally require an explicit model of the world that includes the detailed geometry of each object and an analytical model of the environment dynamics, which are difficult to scale and preclude generalization. Instead, robots would benefit from an awareness of intuitive physics that enables them to similarly reason over the stable interaction of objects in their environment. Towards that goal, we propose StackGen, a diffusion model that generates diverse stable configurations of building blocks matching a target silhouette. To demonstrate the capability of the method, we evaluate it in a simulated environment and deploy it in the real setting using a robotic arm to assemble structures generated by the model.||
|**2024-09-26**|[DiffSSC: Semantic LiDAR Scan Completion using Denoising Diffusion Probabilistic Models](http://arxiv.org/abs/2409.18092)|null|感知系统在自动驾驶中起着至关重要的作用,它结合了多个传感器和相应的计算机视觉算法。3D 激光雷达传感器被广泛用于捕捉车辆周围环境的稀疏点云。然而,由于这些点云的稀疏性和缺乏语义信息,此类系统难以感知遮挡区域和场景中的间隙。为了应对这些挑战,语义场景补全 (SSC) 在给定原始激光雷达测量值的情况下,联合预测场景中未观察到的几何形状和语义信息,旨在实现更完整的场景表示。基于扩散模型在图像生成和超分辨率任务中的良好结果,我们建议将其扩展到 SSC,方法是在点空间和语义空间中分别实现去噪和加噪扩散过程。为了控制生成过程,我们采用语义激光雷达点云作为条件输入,并设计了局部和全局正则化损失来稳定去噪过程。我们在自动驾驶数据集上评估了我们的方法,我们的方法在 SSC 方面的性能优于最先进的方法。||
|**2024-09-26**|[Stable Video Portraits](http://arxiv.org/abs/2409.18083)|null|Rapid advances in the field of generative AI and text-to-image methods in particular have transformed the way we interact with and perceive computer-generated imagery today. In parallel, much progress has been made in 3D face reconstruction, using 3D Morphable Models (3DMM). In this paper, we present SVP, a novel hybrid 2D/3D generation method that outputs photorealistic videos of talking faces leveraging a large pre-trained text-to-image prior (2D), controlled via a 3DMM (3D). Specifically, we introduce a person-specific fine-tuning of a general 2D stable diffusion model which we lift to a video model by providing temporal 3DMM sequences as conditioning and by introducing a temporal denoising procedure. As an output, this model generates temporally smooth imagery of a person with 3DMM-based controls, i.e., a person-specific avatar. The facial appearance of this person-specific avatar can be edited and morphed to text-defined celebrities, without any fine-tuning at test time. The method is analyzed quantitatively and qualitatively, and we show that our method outperforms state-of-the-art monocular head avatar methods.||
|**2024-09-26**|[PhoCoLens: Photorealistic and Consistent Reconstruction in Lensless Imaging](http://arxiv.org/abs/2409.17996)|null|Lensless cameras offer significant advantages in size, weight, and cost compared to traditional lens-based systems. Without a focusing lens, lensless cameras rely on computational algorithms to recover the scenes from multiplexed measurements. However, current algorithms struggle with inaccurate forward imaging models and insufficient priors to reconstruct high-quality images. To overcome these limitations, we introduce a novel two-stage approach for consistent and photorealistic lensless image reconstruction. The first stage of our approach ensures data consistency by focusing on accurately reconstructing the low-frequency content with a spatially varying deconvolution method that adjusts to changes in the Point Spread Function (PSF) across the camera's field of view. The second stage enhances photorealism by incorporating a generative prior from pre-trained diffusion models. By conditioning on the low-frequency content retrieved in the first stage, the diffusion model effectively reconstructs the high-frequency details that are typically lost in the lensless imaging process, while also maintaining image fidelity. Our method achieves a superior balance between data fidelity and visual quality compared to existing methods, as demonstrated with two popular lensless systems, PhlatCam and DiffuserCam. Project website: https://phocolens.github.io/.||
|**2024-09-26**|[Joint Localization and Planning using Diffusion](http://arxiv.org/abs/2409.17995)|null|Diffusion models have been successfully applied to robotics problems such as manipulation and vehicle path planning. In this work, we explore their application to end-to-end navigation -- including both perception and planning -- by considering the problem of jointly performing global localization and path planning in known but arbitrary 2D environments. In particular, we introduce a diffusion model which produces collision-free paths in a global reference frame given an egocentric LIDAR scan, an arbitrary map, and a desired goal position. To this end, we implement diffusion in the space of paths in SE(2), and describe how to condition the denoising process on both obstacles and sensor observations. In our evaluation, we show that the proposed conditioning techniques enable generalization to realistic maps of considerably different appearance than the training environment, demonstrate our model's ability to accurately describe ambiguous solutions, and run extensive simulation experiments showcasing our model's use as a real-time, end-to-end localization and planning stack.||
|**2024-09-26**|[CNCA: Toward Customizable and Natural Generation of Adversarial Camouflage for Vehicle Detectors](http://arxiv.org/abs/2409.17963)|null|Prior works on physical adversarial camouflage against vehicle detectors mainly focus on the effectiveness and robustness of the attack. The current most successful methods optimize 3D vehicle texture at a pixel level. However, this results in conspicuous and attention-grabbing patterns in the generated camouflage, which humans can easily identify. To address this issue, we propose a Customizable and Natural Camouflage Attack (CNCA) method by leveraging an off-the-shelf pre-trained diffusion model. By sampling the optimal texture image from the diffusion model with a user-specific text prompt, our method can generate natural and customizable adversarial camouflage while maintaining high attack performance. With extensive experiments on the digital and physical worlds and user studies, the results demonstrate that our proposed method can generate significantly more natural-looking camouflage than the state-of-the-art baselines while achieving competitive attack performance. Our code is available at \href{https://anonymous.4open.science/r/CNCA-1D54}{https://anonymous.4open.science/r/CNCA-1D54}||
|**2024-09-26**|[Relativistic diffusion model for hadron production in p-Pb collisions at the LHC](http://arxiv.org/abs/2409.17960)|null|We investigate charged-hadron production in relativistic heavy-ion collisions of asymmetric systems within a nonequilibrium-statistical framework. Calculated centrality-dependent pseudorapidity distributions for p-Pb collisions at sqrt(s_NN)=5.02 and 8.16 TeV are compared with data from the Large Hadron Collider (LHC). Our approach combines a relativistic diffusion model with formulations based on quantum chromodynamics while utilizing numerical solutions of a Fokker-Planck equation to account for the shift and broadening of the fragmentation sources for particle-production with respect to the stopping (net-baryon) rapidity distributions. To represent the centrality dependence of charged-hadron production in asymmetric systems over a broad region of pseudorapidities, the consideration and precise modelling of the fragmentation sources - along with the central gluon-gluon source - is found to be essential. Specifically, this results in an inversion of the particle-production amplitude from backward- to forward-dominance when transitioning from central to peripheral collisions, in agreement with recent ATLAS and ALICE p-Pb data at sqrt(s_NN)=5.02 TeV.||
|**2024-09-18**|[Massively Multi-Person 3D Human Motion Forecasting with Scene Context](http://arxiv.org/abs/2409.12189)|**[link](https://github.com/felixbmuller/sast)**|Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study. Code is available at https://github.com/felixbmuller/SAST.||
|**2024-09-18**|[MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion](http://arxiv.org/abs/2409.12140)|null|We introduce MoRAG, a novel multi-part fusion based retrieval-augmented generation strategy for text-based human motion generation. The method enhances motion diffusion models by leveraging additional knowledge obtained through an improved motion retrieval process. By effectively prompting large language models (LLMs), we address spelling errors and rephrasing issues in motion retrieval. Our approach utilizes a multi-part retrieval strategy to improve the generalizability of motion retrieval across the language space. We create diverse samples through the spatial composition of the retrieved motions. Furthermore, by utilizing low-level, part-specific motion information, we can construct motion samples for unseen text descriptions. Our experiments demonstrate that our framework can serve as a plug-and-play module, improving the performance of motion diffusion models. Code, pretrained models and sample videos will be made available at: https://motion-rag.github.io/||
|**2024-09-18**|[Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance](http://arxiv.org/abs/2409.12099)|null|Understanding how humans process visual information is one of the crucial steps for unraveling the underlying mechanism of brain activity. Recently, this curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI data from visual stimuli, it aims to reconstruct the corresponding visual stimuli. Surprisingly, leveraging powerful generative models such as the Latent Diffusion Model (LDM) has shown promising results in reconstructing complex visual stimuli such as high-resolution natural images from vision datasets. Despite the impressive structural fidelity of these reconstructions, they often lack details of small objects, ambiguous shapes, and semantic nuances. Consequently, the incorporation of additional semantic knowledge, beyond mere visuals, becomes imperative. In light of this, we exploit how modern LDMs effectively incorporate multi-modal guidance (text guidance, visual guidance, and image layout) for structurally and semantically plausible image generations. Specifically, inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, our framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings. That is, by extracting textual guidance from semantic information regions and visual guidance from perceptual information regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI data.||
|**2024-09-18**|[Design of Ligand-Binding Proteins with Atomic Flow Matching](http://arxiv.org/abs/2409.12080)|null|Designing novel proteins that bind to small molecules is a long-standing challenge in computational biology, with applications in developing catalysts, biosensors, and more. Current computational methods rely on the assumption that the binding pose of the target molecule is known, which is not always feasible, as conformations of novel targets are often unknown and tend to change upon binding. In this work, we formulate proteins and molecules as unified biotokens, and present AtomFlow, a novel deep generative model under the flow-matching framework for the design of ligand-binding proteins from the 2D target molecular graph alone. Operating on representative atoms of biotokens, AtomFlow captures the flexibility of ligands and generates ligand conformations and protein backbone structures iteratively. We consider the multi-scale nature of biotokens and demonstrate that AtomFlow can be effectively trained on a subset of structures from the Protein Data Bank, by matching flow vector field using an SE(3) equivariant structure prediction network. Experimental results show that our method can generate high fidelity ligand-binding proteins and achieve performance comparable to the state-of-the-art model RFDiffusionAA, while not requiring bound ligand structures. As a general framework, AtomFlow holds the potential to be applied to various biomolecule generation tasks in the future.||
|**2024-09-18**|[LEMON: Localized Editing with Mesh Optimization and Neural Shaders](http://arxiv.org/abs/2409.12024)|null|In practical use cases, polygonal mesh editing can be faster than generating new ones, but it can still be challenging and time-consuming for users. Existing solutions for this problem tend to focus on a single task, either geometry or novel view synthesis, which often leads to disjointed results between the mesh and view. In this work, we propose LEMON, a mesh editing pipeline that combines neural deferred shading with localized mesh optimization. Our approach begins by identifying the most important vertices in the mesh for editing, utilizing a segmentation model to focus on these key regions. Given multi-view images of an object, we optimize a neural shader and a polygonal mesh while extracting the normal map and the rendered image from each view. By using these outputs as conditioning data, we edit the input images with a text-to-image diffusion model and iteratively update our dataset while deforming the mesh. This process results in a polygonal mesh that is edited according to the given text instruction, preserving the geometric characteristics of the initial mesh while focusing on the most significant areas. We evaluate our pipeline using the DTU dataset, demonstrating that it generates finely-edited meshes more rapidly than the current state-of-the-art methods. We include our code and additional results in the supplementary material.||
|**2024-09-18**|[Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models](http://arxiv.org/abs/2409.11920)|null|In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.||
|**2024-09-18**|[Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation](http://arxiv.org/abs/2409.11904)|null|Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and quantify the state of the art. Leveraging Rapidata's technology, we present an efficient annotation framework that sources human feedback from a diverse, global pool of annotators. Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style preference, coherence, and text-to-image alignment. We demonstrate that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.||
|**2024-09-18**|[NT-ViT: Neural Transcoding Vision Transformers for EEG-to-fMRI Synthesis](http://arxiv.org/abs/2409.11836)|null|This paper introduces the Neural Transcoding Vision Transformer (\modelname), a generative model designed to estimate high-resolution functional Magnetic Resonance Imaging (fMRI) samples from simultaneous Electroencephalography (EEG) data. A key feature of \modelname is its Domain Matching (DM) sub-module which effectively aligns the latent EEG representations with those of fMRI volumes, enhancing the model's accuracy and reliability. Unlike previous methods that tend to struggle with fidelity and reproducibility of images, \modelname addresses these challenges by ensuring methodological integrity and higher-quality reconstructions which we showcase through extensive evaluation on two benchmark datasets; \modelname outperforms the current state-of-the-art by a significant margin in both cases, e.g. achieving a $10\times$ reduction in RMSE and a $3.14\times$ increase in SSIM on the Oddball dataset. An ablation study also provides insights into the contribution of each component to the model's overall effectiveness. This development is critical in offering a new approach to lessen the time and financial constraints typically linked with high-resolution brain imaging, thereby aiding in the swift and precise diagnosis of neurological disorders. Although it is not a replacement for actual fMRI but rather a step towards making such imaging more accessible, we believe that it represents a pivotal advancement in clinical practice and neuroscience research. Code is available at \url{https://github.com/rom42pla/ntvit}.||
|**2024-09-18**|[DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech](http://arxiv.org/abs/2409.11835)|null|In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel spectrograms as general images, which overlooks the specific acoustic properties of speech. To address these limitations, we propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which builds on DiT and achieves fast training without compromising accuracy. Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties, enhancing the naturalness of the generated speech. Additionally, we introduce a fine-grained style temporal modeling method that further improves speaker style similarity. Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.||
|**2024-09-18**|[RaggeDi: Diffusion-based State Estimation of Disordered Rags, Sheets, Towels and Blankets](http://arxiv.org/abs/2409.11831)|null|Cloth state estimation is an important problem in robotics. It is essential for the robot to know the accurate state to manipulate cloth and execute tasks such as robotic dressing, stitching, and covering/uncovering human beings. However, estimating cloth state accurately remains challenging due to its high flexibility and self-occlusion. This paper proposes a diffusion model-based pipeline that formulates the cloth state estimation as an image generation problem by representing the cloth state as an RGB image that describes the point-wise translation (translation map) between a pre-defined flattened mesh and the deformed mesh in a canonical space. Then we train a conditional diffusion-based image generation model to predict the translation map based on an observation. Experiments are conducted in both simulation and the real world to validate the performance of our method. Results indicate that our method outperforms two recent methods in both accuracy and speed.||
|**2024-09-17**|[Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion](http://arxiv.org/abs/2409.11406)|null|In 3D modeling, designers often use an existing 3D model as a reference to create new ones. This practice has inspired the development of Phidias, a novel generative model that uses diffusion for reference-augmented 3D generation. Given an image, our method leverages a retrieved or user-provided 3D reference model to guide the generation process, thereby enhancing the generation quality, generalization ability, and controllability. Our model integrates three key components: 1) meta-ControlNet that dynamically modulates the conditioning strength, 2) dynamic reference routing that mitigates misalignment between the input image and 3D reference, and 3) self-reference augmentations that enable self-supervised training with a progressive curriculum. Collectively, these designs result in a clear improvement over existing methods. Phidias establishes a unified framework for 3D generation using text, image, and 3D conditions with versatile applications.||
|**2024-09-17**|[Teaching dark matter simulations to speak the halo language](http://arxiv.org/abs/2409.11401)|**[link](https://github.com/shivampcosmo/gotham)**|We develop a transformer-based conditional generative model for discrete point objects and their properties. We use it to build a model for populating cosmological simulations with gravitationally collapsed structures called dark matter halos. Specifically, we condition our model with dark matter distribution obtained from fast, approximate simulations to recover the correct three-dimensional positions and masses of individual halos. This leads to a first model that can recover the statistical properties of the halos at small scales to better than 3% level using an accelerated dark matter simulation. This trained model can then be applied to simulations with significantly larger volumes which would otherwise be computationally prohibitive with traditional simulations, and also provides a crucial missing link in making end-to-end differentiable cosmological simulations. The code, named GOTHAM (Generative cOnditional Transformer for Halo's Auto-regressive Modeling) is publicly available at \url{https://github.com/shivampcosmo/GOTHAM}.||
|**2024-09-17**|[Ultrasound Image Enhancement with the Variance of Diffusion Models](http://arxiv.org/abs/2409.11380)|**[link](https://github.com/yuxin-zhang-jasmine/ius2024_diffusion)**|Ultrasound imaging, despite its widespread use in medicine, often suffers from various sources of noise and artifacts that impact the signal-to-noise ratio and overall image quality. Enhancing ultrasound images requires a delicate balance between contrast, resolution, and speckle preservation. This paper introduces a novel approach that integrates adaptive beamforming with denoising diffusion-based variance imaging to address this challenge. By applying Eigenspace-Based Minimum Variance (EBMV) beamforming and employing a denoising diffusion model fine-tuned on ultrasound data, our method computes the variance across multiple diffusion-denoised samples to produce high-quality despeckled images. This approach leverages both the inherent multiplicative noise of ultrasound and the stochastic nature of diffusion models. Experimental results on a publicly available dataset demonstrate the effectiveness of our method in achieving superior image reconstructions from single plane-wave acquisitions. The code is available at: https://github.com/Yuxin-Zhang-Jasmine/IUS2024_Diffusion.||
|**2024-09-17**|[OSV: One Step is Enough for High-Quality Image to Video Generation](http://arxiv.org/abs/2409.11367)|null|Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. While efforts have been made to accelerate video diffusion by reducing inference steps (through techniques like consistency distillation) and GAN training (these approaches often fall short in either performance or training stability). In this work, we introduce a two-stage training framework that effectively combines consistency distillation with GAN training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).||
|**2024-09-17**|[Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think](http://arxiv.org/abs/2409.11355)|**[link](https://github.com/VisualComputingInstitute/diffusion-e2e-ft)**|Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200 $\times$ faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.||
|**2024-09-17**|[OmniGen: Unified Image Generation](http://arxiv.org/abs/2409.11340)|**[link](https://github.com/vectorspacelab/omnigen)**|In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGenis characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports other downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. Additionally, OmniGen can handle classical computer vision tasks by transforming them into image generation tasks, such as edge detection and human pose recognition. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional text encoders. Moreover, it is more user-friendly compared to existing diffusion models, enabling complex tasks to be accomplished through instructions without the need for extra preprocessing steps (e.g., human pose estimation), thereby significantly simplifying the workflow of image generation. 3) Knowledge Transfer: Through learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and there remain several unresolved issues. We will open-source the related resources at https://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.||
|**2024-09-17**|[fMRI-3D: A Comprehensive Dataset for Enhancing fMRI-based 3D Reconstruction](http://arxiv.org/abs/2409.11315)|null|Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI) data, introduced as Recon3DMind in our conference work, is of significant interest to both cognitive neuroscience and computer vision. To advance this task, we present the fMRI-3D dataset, which includes data from 15 participants and showcases a total of 4768 3D objects. The dataset comprises two components: fMRI-Shape, previously introduced and accessible at https://huggingface.co/datasets/Fudan-fMRI/fMRI-Shape, and fMRI-Objaverse, proposed in this paper and available at https://huggingface.co/datasets/Fudan-fMRI/fMRI-Objaverse. fMRI-Objaverse includes data from 5 subjects, 4 of whom are also part of the Core set in fMRI-Shape, with each subject viewing 3142 3D objects across 117 categories, all accompanied by text captions. This significantly enhances the diversity and potential applications of the dataset. Additionally, we propose MinD-3D, a novel framework designed to decode 3D visual information from fMRI signals. The framework first extracts and aggregates features from fMRI data using a neuro-fusion encoder, then employs a feature-bridge diffusion model to generate visual features, and finally reconstructs the 3D object using a generative transformer decoder. We establish new benchmarks by designing metrics at both semantic and structural levels to evaluate model performance. Furthermore, we assess our model's effectiveness in an Out-of-Distribution setting and analyze the attribution of the extracted features and the visual ROIs in fMRI signals. Our experiments demonstrate that MinD-3D not only reconstructs 3D objects with high semantic and spatial accuracy but also deepens our understanding of how human brain processes 3D visual information. Project page at: https://jianxgao.github.io/MinD-3D.||
|**2024-09-17**|[SpMis: An Investigation of Synthetic Spoken Misinformation Detection](http://arxiv.org/abs/2409.11308)|null|In recent years, speech generation technology has advanced rapidly, fueled by generative models and large-scale training techniques. While these developments have enabled the production of high-quality synthetic speech, they have also raised concerns about the misuse of this technology, particularly for generating synthetic misinformation. Current research primarily focuses on distinguishing machine-generated speech from human-produced speech, but the more urgent challenge is detecting misinformation within spoken content. This task requires a thorough analysis of factors such as speaker identity, topic, and synthesis. To address this need, we conduct an initial investigation into synthetic spoken misinformation detection by introducing an open-source dataset, SpMis. SpMis includes speech synthesized from over 1,000 speakers across five common topics, utilizing state-of-the-art text-to-speech systems. Although our results show promising detection capabilities, they also reveal substantial challenges for practical implementation, underscoring the importance of ongoing research in this critical area.||
|**2024-09-17**|[DroneDiffusion: Robust Quadrotor Dynamics Learning with Diffusion Models](http://arxiv.org/abs/2409.11292)|null|An inherent fragility of quadrotor systems stems from model inaccuracies and external disturbances. These factors hinder performance and compromise the stability of the system, making precise control challenging. Existing model-based approaches either make deterministic assumptions, utilize Gaussian-based representations of uncertainty, or rely on nominal models, all of which often fall short in capturing the complex, multimodal nature of real-world dynamics. This work introduces DroneDiffusion, a novel framework that leverages conditional diffusion models to learn quadrotor dynamics, formulated as a sequence generation task. DroneDiffusion achieves superior generalization to unseen, complex scenarios by capturing the temporal nature of uncertainties and mitigating error propagation. We integrate the learned dynamics with an adaptive controller for trajectory tracking with stability guarantees. Extensive experiments in both simulation and real-world flights demonstrate the robustness of the framework across a range of scenarios, including unfamiliar flight paths and varying payloads, velocities, and wind disturbances.||
|**2024-09-17**|[Learning Source Disentanglement in Neural Audio Codec](http://arxiv.org/abs/2409.11228)|null|Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.||
|**2024-09-13**|[Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation](http://arxiv.org/abs/2409.09016)|**[link](https://github.com/OpenDriveLab/CLOVER)**|Despite significant progress in robotics and embodied AI in recent years, deploying robots for long-horizon tasks remains a great challenge. Majority of prior arts adhere to an open-loop philosophy and lack real-time feedback, leading to error accumulation and undesirable robustness. A handful of approaches have endeavored to establish feedback mechanisms leveraging pixel-level differences or pre-trained visual representations, yet their efficacy and adaptability have been found to be constrained. Inspired by classic closed-loop control systems, we propose CLOVER, a closed-loop visuomotor control framework that incorporates feedback mechanisms to improve adaptive robotic control. CLOVER consists of a text-conditioned video diffusion model for generating visual plans as reference inputs, a measurable embedding space for accurate error quantification, and a feedback-driven controller that refines actions from feedback and initiates replans as needed. Our framework exhibits notable advancement in real-world robotic tasks and achieves state-of-the-art on CALVIN benchmark, improving by 8% over previous open-loop counterparts. Code and checkpoints are maintained at https://github.com/OpenDriveLab/CLOVER.||
|**2024-09-13**|[A Diffusion Approach to Radiance Field Relighting using Multi-Illumination Synthesis](http://arxiv.org/abs/2409.08947)|null|Relighting radiance fields is severely underconstrained for multi-view data, which is most often captured under a single illumination condition; It is especially hard for full scenes containing multiple objects. We introduce a method to create relightable radiance fields using such single-illumination data by exploiting priors extracted from 2D image diffusion models. We first fine-tune a 2D diffusion model on a multi-illumination dataset conditioned by light direction, allowing us to augment a single-illumination capture into a realistic -- but possibly inconsistent -- multi-illumination dataset from directly defined light directions. We use this augmented data to create a relightable radiance field represented by 3D Gaussian splats. To allow direct control of light direction for low-frequency lighting, we represent appearance with a multi-layer perceptron parameterized on light direction. To enforce multi-view consistency and overcome inaccuracies we optimize a per-image auxiliary feature vector. We show results on synthetic and real multi-view data under single illumination, demonstrating that our method successfully exploits 2D diffusion model priors to allow realistic 3D relighting for complete scenes. Project site https://repo-sam.inria.fr/fungraph/generative-radiance-field-relighting/||
|**2024-09-13**|[Latent Space Score-based Diffusion Model for Probabilistic Multivariate Time Series Imputation](http://arxiv.org/abs/2409.08917)|**[link](https://github.com/gorgen2020/LSSDM_imputation)**|Accurate imputation is essential for the reliability and success of downstream tasks. Recently, diffusion models have attracted great attention in this field. However, these models neglect the latent distribution in a lower-dimensional space derived from the observed data, which limits the generative capacity of the diffusion model. Additionally, dealing with the original missing data without labels becomes particularly problematic. To address these issues, we propose the Latent Space Score-Based Diffusion Model (LSSDM) for probabilistic multivariate time series imputation. Observed values are projected onto low-dimensional latent space and coarse values of the missing data are reconstructed without knowing their ground truth values by this unsupervised learning approach. Finally, the reconstructed values are fed into a conditional diffusion model to obtain the precise imputed values of the time series. In this way, LSSDM not only possesses the power to identify the latent distribution but also seamlessly integrates the diffusion model to obtain the high-fidelity imputed values and assess the uncertainty of the dataset. Experimental results demonstrate that LSSDM achieves superior imputation performance while also providing a better explanation and uncertainty analysis of the imputation mechanism. The website of the code is \textit{https://github.com/gorgen2020/LSSDM\_imputation}.||
|**2024-09-13**|[Gaussian is All You Need: A Unified Framework for Solving Inverse Problems via Diffusion Posterior Sampling](http://arxiv.org/abs/2409.08906)|null|Diffusion models can generate a variety of high-quality images by modeling complex data distributions. Trained diffusion models can also be very effective image priors for solving inverse problems. Most of the existing diffusion-based methods integrate data consistency steps within the diffusion reverse sampling process. The data consistency steps rely on an approximate likelihood function. In this paper, we show that the existing approximations are either insufficient or computationally inefficient. To address these issues, we propose a unified likelihood approximation method that incorporates a covariance correction term to enhance the performance and avoids propagating gradients through the diffusion model. The correction term, when integrated into the reverse diffusion sampling process, achieves better convergence towards the true data posterior for selected distributions and improves performance on real-world natural image datasets. Furthermore, we present an efficient way to factorize and invert the covariance matrix of the likelihood function for several inverse problems. We present comprehensive experiments to demonstrate the effectiveness of our method over several existing approaches.||
|**2024-09-13**|[Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control](http://arxiv.org/abs/2409.08861)|null|Dynamical generative models that produce samples through an iterative process, such as Flow Matching and denoising diffusion models, have seen widespread use, but there has not been many theoretically-sound methods for improving these models with reward fine-tuning. In this work, we cast reward fine-tuning as stochastic optimal control (SOC). Critically, we prove that a very specific memoryless noise schedule must be enforced during fine-tuning, in order to account for the dependency between the noise variable and the generated samples. We also propose a new algorithm named Adjoint Matching which outperforms existing SOC algorithms, by casting SOC problems as a regression problem. We find that our approach significantly improves over existing methods for reward fine-tuning, achieving better consistency, realism, and generalization to unseen human preference reward models, while retaining sample diversity.||
|**2024-09-13**|[InstantDrag: Improving Interactivity in Drag-based Image Editing](http://arxiv.org/abs/2409.08857)|null|Drag-based image editing has recently gained popularity for its interactivity and precision. However, despite the ability of text-to-image models to generate samples within a second, drag editing still lags behind due to the challenge of accurately reflecting user interaction while maintaining image content. Some existing approaches rely on computationally intensive per-image optimization or intricate guidance-based methods, requiring additional inputs such as masks for movable regions and text prompts, thereby compromising the interactivity of the editing process. We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed, requiring only an image and a drag instruction as input. InstantDrag consists of two carefully designed networks: a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion). InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. We demonstrate InstantDrag's capability to perform fast, photo-realistic edits without masks or text prompts through experiments on facial video datasets and general scenes. These results highlight the efficiency of our approach in handling drag-based image editing, making it a promising solution for interactive, real-time applications.||
|**2024-09-13**|[DX2CT: Diffusion Model for 3D CT Reconstruction from Bi or Mono-planar 2D X-ray(s)](http://arxiv.org/abs/2409.08850)|null|Computational tomography (CT) provides high-resolution medical imaging, but it can expose patients to high radiation. X-ray scanners have low radiation exposure, but their resolutions are low. This paper proposes a new conditional diffusion model, DX2CT, that reconstructs three-dimensional (3D) CT volumes from bi or mono-planar X-ray image(s). Proposed DX2CT consists of two key components: 1) modulating feature maps extracted from two-dimensional (2D) X-ray(s) with 3D positions of CT volume using a new transformer and 2) effectively using the modulated 3D position-aware feature maps as conditions of DX2CT. In particular, the proposed transformer can provide conditions with rich information of a target CT slice to the conditional diffusion model, enabling high-quality CT reconstruction. Our experiments with the bi or mono-planar X-ray(s) benchmark datasets show that proposed DX2CT outperforms several state-of-the-art methods. Our codes and model will be available at: https://www.github.com/intyeger/DX2CT.||
|**2024-09-13**|[DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset](http://arxiv.org/abs/2409.08731)|**[link](https://github.com/dfadd-dataset/dfadd_demo_pages)**|Mainstream zero-shot TTS production systems like Voicebox and Seed-TTS achieve human parity speech by leveraging Flow-matching and Diffusion models, respectively. Unfortunately, human-level audio synthesis leads to identity misuse and information security issues. Currently, many antispoofing models have been developed against deepfake audio. However, the efficacy of current state-of-the-art anti-spoofing models in countering audio synthesized by diffusion and flowmatching based TTS systems remains unknown. In this paper, we proposed the Diffusion and Flow-matching based Audio Deepfake (DFADD) dataset. The DFADD dataset collected the deepfake audio based on advanced diffusion and flowmatching TTS models. Additionally, we reveal that current anti-spoofing models lack sufficient robustness against highly human-like audio generated by diffusion and flow-matching TTS systems. The proposed DFADD dataset addresses this gap and provides a valuable resource for developing more resilient anti-spoofing models.||
|**2024-09-13**|[STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment](http://arxiv.org/abs/2409.08601)|null|Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that enhances audio generation from videos by extracting both local temporal and global semantic video features and combining these refined video features with text as cross-modal guidance. To address the issue of information redundancy in videos, we propose an onset prediction pretext task for local temporal feature extraction and an attentive pooling module for global semantic feature extraction. To supplement the insufficient semantic information in videos, we propose a Latent Diffusion Model with Text-to-Audio priors initialization and cross-modal guidance. We also introduce Audio-Audio Align, a new metric to assess audio-temporal alignment. Subjective and objective metrics demonstrate that our method surpasses existing Video-to-Audio models in generating audio with better quality, semantic consistency, and temporal alignment. The ablation experiment validated the effectiveness of each module. Audio samples are available at https://y-ren16.github.io/STAV2A.||
|**2024-09-13**|[LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling](http://arxiv.org/abs/2409.08583)|null|Singing Voice Conversion (SVC) has emerged as a significant subfield of Voice Conversion (VC), enabling the transformation of one singer's voice into another while preserving musical elements such as melody, rhythm, and timbre. Traditional SVC methods have limitations in terms of audio quality, data requirements, and computational complexity. In this paper, we propose LHQ-SVC, a lightweight, CPU-compatible model based on the SVC framework and diffusion model, designed to reduce model size and computational demand without sacrificing performance. We incorporate features to improve inference quality, and optimize for CPU execution by using performance tuning tools and parallel computing frameworks. Our experiments demonstrate that LHQ-SVC maintains competitive performance, with significant improvements in processing speed and efficiency across different devices. The results suggest that LHQ-SVC can meet||
|**2024-09-12**|[DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors](http://arxiv.org/abs/2409.08278)|null|We present DreamHOI, a novel method for zero-shot synthesis of human-object interactions (HOIs), enabling a 3D human model to realistically interact with any given object based on a textual description. This task is complicated by the varying categories and geometries of real-world objects and the scarcity of datasets encompassing diverse HOIs. To circumvent the need for extensive data, we leverage text-to-image diffusion models trained on billions of image-caption pairs. We optimize the articulation of a skinned human mesh using Score Distillation Sampling (SDS) gradients obtained from these models, which predict image-space edits. However, directly backpropagating image-space gradients into complex articulation parameters is ineffective due to the local nature of such gradients. To overcome this, we introduce a dual implicit-explicit representation of a skinned mesh, combining (implicit) neural radiance fields (NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization, we transition between implicit and explicit forms, grounding the NeRF generation while refining the mesh articulation. We validate our approach through extensive experiments, demonstrating its effectiveness in generating realistic HOIs.||
|**2024-09-12**|[Hand-Object Interaction Pretraining from Videos](http://arxiv.org/abs/2409.08273)|null|我们提出了一种从 3D 手-物体交互轨迹中学习通用机器人操作先验的方法。我们构建了一个框架,利用野外视频生成感觉运动机器人轨迹。为此,我们将人手和被操纵物体提升到共享的 3D 空间中,并将人体动作重定向到机器人动作。对这些数据进行生成建模,我们得到了一个与任务无关的基础策略。该策略捕获了一个通用而灵活的操作先验。我们通过经验证明,使用强化学习 (RL) 和行为克隆 (BC) 对该策略进行微调,可以实现对下游任务的样本高效适应,同时与先前的方法相比,提高了鲁棒性和泛化能力。定性实验结果可见:\url{https://hgaurav2k.github.io/hop/}。||
|**2024-09-12**|[Click2Mask: Local Editing with Dynamic Mask Generation](http://arxiv.org/abs/2409.08272)|null|生成模型的最新进展彻底改变了图像生成和编辑领域,使非专业人士也能轻松完成这些任务。本文重点关注局部图像编辑,特别是向大致指定区域添加新内容的任务。现有方法通常需要精确的掩码或对位置的详细描述,这可能既麻烦又容易出错。我们提出了 Click2Mask,这是一种新颖的方法,它只需一个参考点(以及内容描述)即可简化局部编辑过程。在混合潜在扩散 (BLD) 过程中,掩码会围绕该点动态增长,并以基于 CLIP 的语义损失为指导。Click2Mask 超越了基于分割和依赖微调的方法的局限性,提供了一种对用户更友好且上下文更准确的解决方案。我们的实验表明,根据人类判断和自动指标,与 SoTA 方法相比,Click2Mask 不仅最大限度地减少了用户的工作量,而且还提供了具有竞争力或更优的局部图像处理结果。主要贡献包括简化用户输入、能够不受现有分割限制地自由添加对象,以及将我们的动态掩码方法集成到其他编辑方法中的潜力。||
|**2024-09-12**|[DreamBeast: Distilling 3D Fantastical Animals with Part-Aware Knowledge Transfer](http://arxiv.org/abs/2409.08271)|null|We present DreamBeast, a novel method based on score distillation sampling (SDS) for generating fantastical 3D animal assets composed of distinct parts. Existing SDS methods often struggle with this generation task due to a limited understanding of part-level semantics in text-to-image diffusion models. While recent diffusion models, such as Stable Diffusion 3, demonstrate a better part-level understanding, they are prohibitively slow and exhibit other common problems associated with single-view diffusion models. DreamBeast overcomes this limitation through a novel part-aware knowledge transfer mechanism. For each generated asset, we efficiently extract part-level knowledge from the Stable Diffusion 3 model into a 3D Part-Affinity implicit representation. This enables us to instantly generate Part-Affinity maps from arbitrary camera views, which we then use to modulate the guidance of a multi-view diffusion model during SDS to create 3D assets of fantastical animals. DreamBeast significantly enhances the quality of generated 3D creatures with user-specified part compositions while reducing computational overhead, as demonstrated by extensive quantitative and qualitative evaluations.||
|**2024-09-12**|[Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation](http://arxiv.org/abs/2409.08269)|null|现今的触觉传感器形态各异,尺寸不一。由于模型通常与特定的传感器设计绑定,因此开发通用的触觉处理方法变得极具挑战性。我们通过在触觉传感器之间进行跨模态预测来解决这个问题:给定来自一个传感器的触觉信号,我们使用生成模型来估计另一个传感器如何感知相同的物理接触。这允许我们将特定于传感器的算法应用于生成的信号。我们通过训练一个扩散模型来实现这个想法,该模型可以在流行的 GelSlim 和 Soft Bubble 传感器之间进行转换。作为一个下游任务,我们使用 GelSlim 传感器执行手持物体姿态估计,同时使用仅对 Soft Bubble 信号进行操作的算法。数据集、代码和更多详细信息可以在 https://www.mmintlab.com/research/touch2touch/ 上找到。||
|**2024-09-12**|[Improving Text-guided Object Inpainting with Semantic Pre-inpainting](http://arxiv.org/abs/2409.08260)|**[link](https://github.com/nnn-s/catdiffusion)**|Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \url{https://github.com/Nnn-s/CATdiffusion}.||
|**2024-09-12**|[Improving Virtual Try-On with Garment-focused Diffusion Models](http://arxiv.org/abs/2409.08258)|null|Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: \href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.||
|**2024-09-12**|[LoRID: Low-Rank Iterative Diffusion for Adversarial Purification](http://arxiv.org/abs/2409.08255)|null|This work presents an information-theoretic examination of diffusion-based purification methods, the state-of-the-art adversarial defenses that utilize diffusion models to remove malicious perturbations in adversarial examples. By theoretically characterizing the inherent purification errors associated with the Markov-based diffusion purifications, we introduce LoRID, a novel Low-Rank Iterative Diffusion purification method designed to remove adversarial perturbation with low intrinsic purification errors. LoRID centers around a multi-stage purification process that leverages multiple rounds of diffusion-denoising loops at the early time-steps of the diffusion models, and the integration of Tucker decomposition, an extension of matrix factorization, to remove adversarial noise at high-noise regimes. Consequently, LoRID increases the effective diffusion time-steps and overcomes strong adversarial attacks, achieving superior robustness performance in CIFAR-10/100, CelebA-HQ, and ImageNet datasets under both white-box and black-box settings.||
|**2024-09-12**|[Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding](http://arxiv.org/abs/2409.08251)|null|Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance.||
|**2024-09-12**|[IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation](http://arxiv.org/abs/2409.08240)|null|While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of individual instances, they struggle to accurately position and control the features generation of multiple instances. The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. In response, we propose the Instance Feature Generation (IFG) task, which aims to ensure both positional accuracy and feature fidelity in generated instances. To address the IFG task, we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances feature depiction by incorporating additional appearance tokens and utilizing an Instance Semantic Map to align instance-level features with spatial locations. The IFAdapter guides the diffusion process as a plug-and-play module, making it adaptable to various community models. For evaluation, we contribute an IFG benchmark and develop a verification pipeline to objectively compare models' abilities to generate instances with accurate positioning and features. Experimental results demonstrate that IFAdapter outperforms other models in both quantitative and qualitative evaluations.||
|**2024-09-10**|[SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation](http://arxiv.org/abs/2409.06633)|null|近年来,扩散模型的发展推动了图像和视频生成任务的显著进步,其中像Stable Diffusion系列这样的预训练模型发挥了至关重要的作用。受模型剪枝技术的启发,该技术通过移除不重要的参数来减轻大型预训练模型的负担,我们提出了一种新颖的模型微调方法,可以充分利用这些无效参数,并使预训练模型具备新的任务特定能力。本研究首先调查了预训练扩散模型中参数的重要性,发现按绝对值计算,最小的10%到20%的参数对生成过程没有贡献。基于这一观察,我们提出了一种名为SaRA的方法,该方法重新利用这些暂时无效的参数,相当于优化一个稀疏权重矩阵来学习特定任务的知识。为了减轻过拟合,我们提出了一种基于核范数的低秩稀疏训练方案,以实现高效的微调。此外,我们设计了一种新的渐进式参数调整策略,以充分利用重新训练/微调的参数。最后,我们提出了一种新颖的非结构化反向传播策略,可显著降低微调过程中的内存成本。我们的方法增强了预训练模型在下游应用中的生成能力,并且在保持模型泛化能力方面优于LoRA等传统微调方法。我们通过在SD模型上的微调实验验证了我们的方法,结果表明SaRA取得了显著的改进。SaRA还具有一个实际优势,即只需修改一行代码即可实现高效实施,并且与现有方法无缝兼容。||
|**2024-09-10**|[MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification](http://arxiv.org/abs/2409.06620)|null|文本到3D内容生成领域在生成逼真的3D对象方面取得了重大进展,像分数蒸馏采样(SDS)这样的现有方法提供了有希望的指导。然而,由于指导不精确,这些方法经常遇到“两面神”问题——多面歧义。此外,虽然最近3D高斯分裂的进步已经显示出其在表示3D体积方面的功效,但这种表示的优化在很大程度上仍未得到探索。本文介绍了一个用于文本到3D内容生成的统一框架,以解决这些关键差距。我们的方法利用多视图指导迭代形成3D模型的结构,逐步增强细节和准确性。我们还引入了一种新的密集化算法,使高斯接近表面,优化生成模型的结构完整性和保真度。大量实验验证了我们的方法,表明它能够以最少的时间成本生成高质量的视觉输出。值得注意的是,我们的方法在半小时的训练时间内就能获得高质量的结果,与大多数需要数小时训练时间才能获得类似结果的现有方法相比,效率显著提高。||
|**2024-09-10**|[A Primer on Variational Inference for Physics-Informed Deep Generative Modelling](http://arxiv.org/abs/2409.06560)|null|变分推断(VI)是一种计算高效且可扩展的近似贝叶斯推断方法。它在不确定性量化的准确性和实际可处理性之间取得了平衡。由于其内置的贝叶斯正则化和灵活性,它在生成建模和反演任务中表现出色,这对于物理相关问题至关重要。推导 VI 的核心学习目标通常必须针对新的学习任务进行调整,其中问题的性质决定了感兴趣变量之间的条件依赖性,例如物理问题中出现的情况。在本文中,我们为正向和反向问题提供了 VI 的易于理解且全面的技术介绍,引导读者了解 VI 框架的标准推导及其如何通过深度学习得到最佳实现。然后,我们回顾并统一了最近的文献,这些文献例证了 VI 所允许的创造性灵活性。本文面向希望解决基于物理的问题并强调不确定性量化的一般科学受众。||
|**2024-09-10**|[From LIMA to DeepLIMA: following a new path of interoperability](http://arxiv.org/abs/2409.06550)|null|本文描述了 LIMA(Libre Multilingual Analyzer)框架的体系结构及其最新发展,其中新增了基于深度神经网络的文本分析模块。我们在保留现有可配置架构以及先前开发的基于规则和统计的分析组件的可用性的同时,扩展了 LIMA 在支持语言数量方面的功能。我们在 Universal Dependencies 2.5 语料库、WikiNer 语料库和 CoNLL-03 数据集上针对 60 多种语言训练了模型。Universal Dependencies 允许我们增加支持的语言数量,并生成可以集成到其他平台的模型。这种普遍存在的深度学习自然语言处理模型的集成以及使用 Universal Dependencies 的标准注释集合的使用可以被视为一种新的互操作性途径,通过模型和数据的规范化,与更标准的技术互操作性相辅相成,在 LIMA 中通过 Docker Hub 上 Docker 容器中可用的服务实现。||
|**2024-09-10**|[Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models](http://arxiv.org/abs/2409.06451)|null|虽然当前的情感文本到语音(TTS)系统可以生成高度智能的情感语音,但在输出语音的情感渲染方面实现精细控制仍然是一项重大挑战。在本文中,我们介绍了 ParaEVITS,这是一种新颖的情感 TTS 框架,它利用自然语言的组合性来增强对情感渲染的控制。通过结合受 ParaCLAP(一种用于计算语用学的对比性语言-音频预训练(CLAP)模型)启发的文本-音频编码器,我们训练扩散模型以根据文本情感风格描述生成情感嵌入。我们的框架首先使用音频编码器在参考音频上进行训练,然后微调扩散模型以处理来自 ParaCLAP 文本编码器的文本输入。在推理过程中,仅使用文本条件就可以操纵音调、抖动和响度等语音属性。我们的实验表明,ParaEVITS 可以有效地控制情感渲染,而不会影响语音质量。语音演示公开可用。||
|**2024-09-10**|[Prompt2Fashion: An automatically generated fashion dataset](http://arxiv.org/abs/2409.06442)|**[link](https://github.com/georgiarg/prompt2fashion)**|尽管语言和视觉生成模型在快速发展且效率不断提高,但仍然缺乏将个性化时尚需求与人工智能驱动设计联系起来的综合数据集,这限制了真正包容和定制化时尚解决方案的潜力。在这项工作中,我们利用生成模型自动构建了一个时尚图像数据集,该数据集根据用户的指示针对不同的场合、风格和体型量身定制。我们使用不同的生成式预训练模型(LLM)和提示策略,为专家和非专家用户提供具有高质量审美、细节和相关性的个性化服装,并通过定性分析证明了这一点。到目前为止,生成的服装的评估一直由非专家的人类受试者进行。尽管对生成的质量和相关性提供了细致入微的见解,但我们就专家知识对于评估此类艺术性人工智能生成数据集的重要性展开了进一步的讨论。我们的数据集可在 GitHub 上公开获取,网址为 https://github.com/georgiarg/Prompt2Fashion。||
|**2024-09-10**|[Fast nonparametric inference of network backbones for graph sparsification](http://arxiv.org/abs/2409.06417)|**[link](https://github.com/aleckirkley/mdl-network-backbones)**|网络骨干通过仅保留最重要的链接来提供加权网络的有用稀疏表示,从而实现一系列计算加速并简化复杂的网络可视化。判断链接是否重要的标准有很多,因此已经开发了许多用于图稀疏化网络骨干提取的方法。这些方法根据它们是在整个网络还是在单个节点邻域的上下文中评估边的重要性,可以分为全局或局部方法。现有网络骨干提取方法的一个关键限制是,它们要么人为地将骨干的拓扑结构限制为特定形式(例如树),要么需要指定一个自由参数(例如显著性水平)来确定骨干中要保留的边数。在这里,我们开发了一个完全非参数的框架来推断加权网络的骨干,该框架通过使用信息论中的最小描述长度(MDL)原则自动选择保留在骨干中的最佳边数来克服这些限制。我们开发了两种编码方案,作为全局和局部网络骨干的目标函数,以及有效的优化算法,以根据这些目标识别最佳骨干,其运行时复杂度在边数上是对数线性的。我们表明,所提出的框架可以使用最大后验(MAP)估计程序和渐近等效的贝叶斯骨干生成模型推广到边上的任何离散权重分布。我们在真实和合成网络上的一系列任务中将所提出的方法与现有方法进行了比较。||
|**2024-09-10**|[Distilling Generative-Discriminative Representations for Very Low-Resolution Face Recognition](http://arxiv.org/abs/2409.06371)|null|由于分辨率下降会导致信息丰富的面部细节严重丢失,因此极低分辨率人脸识别极具挑战性。在本文中,我们提出了一种结合了生成表示和跨分辨率对齐知识蒸馏的生成-判别表示蒸馏方法。这种方法通过两个蒸馏模块联合蒸馏生成模型和判别模型,促进了极低分辨率人脸识别。首先,生成表示蒸馏将预先训练用于人脸超分辨率的扩散模型的编码器作为生成教师,通过特征回归来监督学生骨干网络的学习,然后冻结学生骨干网络。之后,判别表示蒸馏进一步考虑将预先训练好的人脸识别器作为判别教师,通过跨分辨率关系对比蒸馏来监督学生头部的学习。通过这种方式,可以将通用的骨干网络表示转换为判别头部表示,从而形成一个鲁棒的、具有判别力的学生模型,用于极低分辨率人脸识别。我们的方法改进了极低分辨率人脸中缺失细节的恢复,并实现了更好的知识迁移。在人脸数据集上的大量实验表明,我们的方法提高了极低分辨率人脸的识别精度,展示了其有效性和适应性。||
|**2024-09-10**|[What happens to diffusion model likelihood when your model is conditional?](http://arxiv.org/abs/2409.06364)|null|Diffusion Models (DMs) iteratively denoise random samples to produce high-quality data. The iterative sampling process is derived from Stochastic Differential Equations (SDEs), allowing a speed-quality trade-off chosen at inference. Another advantage of sampling with differential equations is exact likelihood computation. These likelihoods have been used to rank unconditional DMs and for out-of-domain classification. Despite the many existing and possible uses of DM likelihoods, the distinct properties captured are unknown, especially in conditional contexts such as Text-To-Image (TTI) or Text-To-Speech synthesis (TTS). Surprisingly, we find that TTS DM likelihoods are agnostic to the text input. TTI likelihood is more expressive but cannot discern confounding prompts. Our results show that applying DMs to conditional tasks reveals inconsistencies and strengthens claims that the properties of DM likelihood are unknown. This impact sheds light on the previously unknown nature of DM likelihoods. Although conditional DMs maximise likelihood, the likelihood in question is not as sensitive to the conditioning input as one expects. This investigation provides a new point-of-view on diffusion likelihoods.||
|**2024-09-10**|[DiffQRCoder: Diffusion-based Aesthetic QR Code Generation with Scanning Robustness Guided Iterative Refinement](http://arxiv.org/abs/2409.06355)|null|With the success of Diffusion Models for image generation, the technologies also have revolutionized the aesthetic Quick Response (QR) code generation. Despite significant improvements in visual attractiveness for the beautified codes, their scannabilities are usually sacrificed and thus hinder their practical uses in real-world scenarios. To address this issue, we propose a novel Diffusion-based QR Code generator (DiffQRCoder) to effectively craft both scannable and visually pleasing QR codes. The proposed approach introduces Scanning-Robust Perceptual Guidance (SRPG), a new diffusion guidance for Diffusion Models to guarantee the generated aesthetic codes to obey the ground-truth QR codes while maintaining their attractiveness during the denoising process. Additionally, we present another post-processing technique, Scanning Robust Manifold Projected Gradient Descent (SR-MPGD), to further enhance their scanning robustness through iterative latent space optimization. With extensive experiments, the results demonstrate that our approach not only outperforms other compared methods in Scanning Success Rate (SSR) with better or comparable CLIP aesthetic score (CLIP-aes.) but also significantly improves the SSR of the ControlNet-only approach from 60% to 99%. The subjective evaluation indicates that our approach achieves promising visual attractiveness to users as well. Finally, even with different scanning angles and the most rigorous error tolerance settings, our approach robustly achieves over 95% SSR, demonstrating its capability for real-world applications.||
|**2024-09-09**|[Enhancing Preference-based Linear Bandits via Human Response Time](http://arxiv.org/abs/2409.05798)|null|Binary human choice feedback is widely used in interactive preference learning for its simplicity, but it provides limited information about preference strength. To overcome this limitation, we leverage human response times, which inversely correlate with preference strength, as complementary information. Our work integrates the EZ-diffusion model, which jointly models human choices and response times, into preference-based linear bandits. We introduce a computationally efficient utility estimator that reformulates the utility estimation problem using both choices and response times as a linear regression problem. Theoretical and empirical comparisons with traditional choice-only estimators reveal that for queries with strong preferences ("easy" queries), choices alone provide limited information, while response times offer valuable complementary information about preference strength. As a result, incorporating response times makes easy queries more useful. We demonstrate this advantage in the fixed-budget best-arm identification problem, with simulations based on three real-world datasets, consistently showing accelerated learning when response times are incorporated.||
|**2024-09-09**|[Predicting Critical Heat Flux with Uncertainty Quantification and Domain Generalization Using Conditional Variational Autoencoders and Deep Neural Networks](http://arxiv.org/abs/2409.05790)|null|Deep generative models (DGMs) have proven to be powerful in generating realistic data samples. Their capability to learn the underlying distribution of a dataset enable them to generate synthetic data samples that closely resemble the original training dataset, thus addressing the challenge of data scarcity. In this work, we investigated the capabilities of DGMs by developing a conditional variational autoencoder (CVAE) model to augment the critical heat flux (CHF) measurement data that was used to generate the 2006 Groeneveld lookup table. To determine how this approach compared to traditional methods, a fine-tuned deep neural network (DNN) regression model was created and evaluated with the same dataset. Both the CVAE and DNN models achieved small mean absolute relative errors, with the CVAE model maintaining more favorable results. To quantify the uncertainty in the model's predictions, uncertainty quantification (UQ) was performed with repeated sampling of the CVAE model and ensembling of the DNN model. Following UQ, the DNN ensemble notably improved performance when compared to the baseline DNN model, while the CVAE model achieved similar results to its non-UQ results. The CVAE model was shown to have significantly less variability and a higher confidence after assessment of the prediction-wise relative standard deviations. Evaluating domain generalization, both models achieved small mean error values when predicting both inside and outside the training domain, with predictions outside the training domain showing slightly larger errors. Overall, the CVAE model was comparable to the DNN regression model in predicting CHF values but with better uncertainty behavior.||
|**2024-09-09**|[Vector Quantized Diffusion Model Based Speech Bandwidth Extension](http://arxiv.org/abs/2409.05784)|null|神经音频编解码器 (NAC) 的最新进展为音频信号处理解锁了新的潜力。越来越多的研究探索利用 NAC 的潜在特征来完成各种语音信号处理任务。本文介绍了第一种利用从 NAC 获得的离散特征进行语音带宽扩展 (BWE) 的方法。通过恢复高度压缩的离散标记中的高频细节,该方法增强了语音的清晰度和自然度。所提出的框架基于矢量量化扩散,结合了先进 NAC、扩散模型和 Mamba-2 的优势,以重建高频语音成分。大量实验表明,该方法在对数谱距离和 ViSQOL 方面均表现出优异的性能,显着提高了语音质量。||
|**2024-09-09**|[AS-Speech: Adaptive Style For Speech Synthesis](http://arxiv.org/abs/2409.05730)|null|近年来,文本到语音(TTS)合成技术取得了显著进展,能够在常见场景下合成高质量的语音。在未知情况下,自适应TTS需要强大的泛化能力来适应说话人的风格特征。然而,现有的自适应方法只能分别提取和整合粗粒度的音色或混合的韵律属性。在本文中,我们提出了AS-Speech,一种将说话人音色特征和韵律属性整合到一个统一框架中的自适应风格方法,用于文本到语音合成。具体来说,AS-Speech可以通过细粒度的基于文本的音色特征和全局韵律信息准确地模拟风格特征,并通过扩散模型实现高保真语音合成。实验表明,与一系列自适应TTS模型相比,该模型生成的语音在音色和韵律方面具有更高的自然度和相似性。||
|**2024-09-09**|[pFedGPA: Diffusion-based Generative Parameter Aggregation for Personalized Federated Learning](http://arxiv.org/abs/2409.05701)|null|联邦学习 (FL) 是一种去中心化的模型训练方法,数据保留在本地,只有模型参数在客户端和中心服务器之间共享。传统的联邦平均 (FedAvg) 等方法对这些通常在异构数据分布上训练的参数进行线性聚合,这可能忽略了参数空间复杂、高维的性质,导致聚合模型的性能下降。虽然个性化联邦学习方法可以在一定程度上缓解异构数据问题,但线性聚合的局限性仍然没有解决。为了缓解这个问题,我们研究了扩散模型的生成方法,并提出了一种新的个性化联邦学习生成参数聚合框架,即 pFedGPA。在这个框架中,我们在服务器上部署了一个扩散模型,以整合不同的参数分布,并提出了一种参数反演方法,为每个客户端有效地生成一组个性化参数。这种反演方法将上传的参数转换为一个潜在代码,然后通过去噪采样进行聚合,生成最终的个性化参数。通过使用高容量扩散模型对客户端模型参数对其特定数据分布的依赖性进行编码,pFedGPA 可以有效地将所有客户端模型参数的总体分布的复杂性与每个客户端参数分布的复杂性解耦。我们的实验结果一致地证明了所提出的方法在多个数据集上的优越性能,超过了基线方法。||
|**2024-09-09**|[Unlearning or Concealment? A Critical Analysis and Evaluation Metrics for Unlearning in Diffusion Models](http://arxiv.org/abs/2409.05668)|null|近期的研究已经看到人们对扩散模型中概念去除和目标遗忘方法的浓厚兴趣。在本文中,我们对现有的扩散模型遗忘方法进行了全面的白盒分析,以揭示其存在的重大漏洞。我们发现,现有方法中用于遗忘的目标函数导致了要遗忘的目标概念与相应提示之间的解耦。这是一种隐蔽行为,而不是真正的遗忘,而真正的遗忘才是最初的目标。当前方法的无效性主要源于它们只关注降低特定提示集的生成概率,而忽略了推理过程中使用的中间引导的多种形式。本文对四种常用的扩散模型遗忘技术进行了严格的理论和实证检验。我们引入了两个新的评估指标:概念检索分数(CRS)和概念置信度分数(CCS)。这些指标基于一个成功的对抗攻击设置,可以从遗忘的扩散模型中恢复被遗忘的概念。CRS 衡量的是遗忘后的遗忘模型和完全训练模型的潜在表示之间的相似性。它反映了随着引导量增加,被遗忘概念的检索程度。CCS 量化了模型将目标概念分配给被操纵数据的置信度。它反映了随着引导量增加,未遗忘模型的生成结果与原始领域知识一致的概率。我们使用提出的针对扩散模型的严格指标对现有的遗忘方法进行评估,结果揭示了它们在真正遗忘概念方面的重大缺陷。源代码:https://respailab.github.io/unlearning-or-concealment||
|**2024-09-09**|[Forward KL Regularized Preference Optimization for Aligning Diffusion Policies](http://arxiv.org/abs/2409.05622)|null|扩散模型通过在策略学习中利用高度表达的模型能力,在序列决策中取得了显著的成功。学习扩散策略的一个核心问题是如何在各种任务中使策略输出与人类意图保持一致。为了实现这一点,先前的方法进行了回报条件策略生成或基于强化学习(RL)的策略优化,但它们都依赖于预先定义的奖励函数。在这项工作中,我们提出了一种新的框架,即用于对齐扩散策略的前向 KL 正则化偏好优化,以直接将扩散策略与偏好对齐。我们首先从离线数据集中训练一个不考虑偏好的扩散策略,然后通过直接偏好优化将该策略与偏好数据对齐。在对齐阶段,我们在扩散策略中制定了直接偏好学习,其中在前向偏好优化中采用了 KL 正则化,以避免生成分布外动作。我们对 MetaWorld 操作和 D4RL 任务进行了广泛的实验。结果表明,我们的方法在偏好一致性方面表现出色,并且优于先前最先进的算法。||
|**2024-09-09**|[Latent 3D Brain MRI Counterfactual](http://arxiv.org/abs/2409.05585)|null|结构性脑部MRI研究中的样本数量通常过小,无法充分训练深度学习模型。生成模型通过有效学习数据分布和生成高保真MRI,为解决这一问题带来了希望。然而,它们难以生成训练数据分布之外的多样化、高质量数据。解决这一问题的一种方法是使用针对3D体积反事实开发的因果模型。然而,在高维空间中准确建模因果关系是一项挑战,因此这些模型通常生成质量较低的3D脑部MRI。为了应对这些挑战,我们提出了一种两阶段方法,在潜在空间内构建结构因果模型(SCM)。在第一阶段,我们采用VQ-VAE学习MRI体积的紧凑嵌入。随后,我们将因果模型整合到这个潜在空间中,并使用封闭形式的广义线性模型(GLM)执行三步反事实程序。我们对真实世界的高分辨率MRI数据(1mm)进行的实验表明,我们的方法可以生成高质量的3D MRI反事实。||
|**2024-09-09**|[Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation](http://arxiv.org/abs/2409.05583)|**[link](https://github.com/gmuraleekrishna/sas)**|具身人工智能旨在开发能够理解和执行人类语言指令并以自然语言进行交流的机器人。为此,我们研究了生成高度详细的导航指令以供具身机器人遵循的任务。尽管最近的研究表明,从图像序列生成逐步指令方面取得了重大进展,但生成的指令在指称物体和地标方面缺乏多样性。现有的说话者模型学习了一些策略来规避评估指标,即使对于低质量的句子也能获得更高的分数。在这项工作中,我们提出了SAS(空间感知说话者),这是一种指令生成器或“说话者”模型,它利用环境的结构和语义知识来生成更丰富的指令。为了进行训练,我们在对抗性设置中采用了奖励学习方法,以避免语言评估指标引入的系统性偏差。根据经验,我们的方法优于现有的指令生成模型,并使用标准指标进行了评估。我们的代码可在以下网址获得:https://github.com/gmuraleekrishna/SAS。||
|**2024-09-09**|[A Taxonomy of Miscompressions: Preparing Image Forensics for Neural Compression](http://arxiv.org/abs/2409.05490)|null|神经压缩有可能彻底改变有损图像压缩技术。基于生成模型,最近的方案在高感知质量下实现了前所未有的压缩率,但牺牲了语义保真度。解压缩图像的细节可能看起来在视觉上是完美的,但在语义上与原始图像不同,这使得压缩错误难以或不可能被检测到。我们探索了这个问题的空间,并提出了一个暂定的错误压缩分类法。它定义了三种类型的“发生了什么”,并有一个二进制的“高影响”标志,表示改变符号的错误压缩。我们讨论了该分类法如何促进风险沟通和缓解措施的研究。||
|**2024-09-05**|[Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding](http://arxiv.org/abs/2409.03757)|**[link](https://github.com/yunzeman/lexicon3d)**|复杂三维场景理解近年来备受关注,场景编码策略在其中发挥着至关重要的作用。然而,针对不同场景的最佳场景编码策略仍不明确,特别是与基于图像的编码策略相比。为了解决这个问题,我们对用于三维场景理解的各种视觉编码模型进行了全面研究,确定了每个模型在不同场景下的优势和局限性。我们的评估涵盖了七种视觉基础编码器,包括基于图像、基于视频和三维基础模型。我们在四个任务中评估这些模型:视觉语言场景推理、视觉定位、分割和配准,每个任务都侧重于场景理解的不同方面。我们的评估得出了以下主要发现:DINOv2 表现出优越的性能,视频模型在对象级任务中表现出色,扩散模型有利于几何任务,而语言预训练模型在语言相关任务中表现出意想不到的局限性。这些见解挑战了一些传统认知,为利用视觉基础模型提供了新的视角,并强调了在未来的视觉语言和场景理解任务中需要更灵活的编码器选择。||
|**2024-09-05**|[ArtiFade: Learning to Generate High-quality Subject from Blemished Images](http://arxiv.org/abs/2409.03745)|null|以主题为主导的文本到图像生成技术在学习和捕捉主题特征方面取得了显著进步,即使只使用有限数量的图像。然而,现有方法通常依赖于高质量的图像进行训练,当输入图像存在瑕疵时,可能难以生成合理的图像。这主要归因于当前技术在区分主题相关特征和干扰性瑕疵方面的能力不足。在本文中,我们引入了ArtiFade来解决这个问题,并成功地从有瑕疵的数据集中生成了高质量的无瑕疵图像。具体来说,ArtiFade利用预先训练的文本到图像模型的微调来消除瑕疵。通过在微调过程中使用包含无瑕疵图像及其对应的有瑕疵图像的专门数据集来实现瑕疵的消除。ArtiFade还确保了保留扩散模型中固有的原始生成能力,从而提高了主题驱动方法在生成高质量和无瑕疵图像方面的整体性能。我们进一步为这项任务设计了评估基准。通过广泛的定性和定量实验,我们证明了ArtiFade在分布内和分布外情况下都能有效去除瑕疵的泛化能力。||
|**2024-09-05**|[RealisHuman: A Two-Stage Approach for Refining Malformed Human Parts in Generated Images](http://arxiv.org/abs/2409.03644)|null|近年来,扩散模型彻底改变了视觉生成领域,其性能超越了生成对抗网络 (GANs) 等传统框架。然而,由于人类及其语义部分(如手和脸)复杂的结构,生成具有真实感的人类图像仍然是一项重大挑战。为了解决这个问题,我们提出了一种名为 RealisHuman 的新型后处理解决方案。RealisHuman 框架分两个阶段运行。首先,它使用原始的畸形部分作为参考,生成逼真的人体部位(如手或脸),确保细节与原始图像一致。其次,它通过重新绘制周围区域将校正后的人体部位无缝地融入到其对应的位置,以确保平滑逼真的融合。RealisHuman 框架显著增强了人类生成的真实感,这可以通过定性和定量指标的显著改进得到证明。代码可在 https://github.com/Wangbenzhi/RealisHuman 获取。||
|**2024-09-05**|[DiffEVC: Any-to-Any Emotion Voice Conversion with Expressive Guidance](http://arxiv.org/abs/2409.03636)|null|情感语音转换 (EVC) 通过放大积极线索和减少消极线索来改变语音情感,从而增强沟通。这项复杂的任务涉及语音质量、说话者特征和内容等纠缠不清的因素。传统的深度学习模型(如 GAN 和自动编码器)通过学习映射或解耦特征在 EVC 中取得了一定的成功,但面临着不稳定性和语音质量下降等挑战。扩散模型提供了稳定的训练和高质量的生成。我们提出了一个基于扩散的 EVC 框架,该框架使用互信息损失和辅助模型来解耦情感和说话者身份。引入了一种表达性引导机制,以改善情感转换,同时保持说话者特征。实验结果表明,我们的方法对于未知说话者和情感的有效性,在 EVC 任务中实现了最先进的性能。||
|**2024-09-05**|[TCDiff: Triple Condition Diffusion Model with 3D Constraints for Stylizing Synthetic Faces](http://arxiv.org/abs/2409.03600)|**[link](https://github.com/bovifocr/tcdiff)**|一个鲁棒的人脸识别模型需要使用包含大量个体以及每个个体在不同条件(例如姿态、表情、年龄、噪声和遮挡)下的大量样本的数据集进行训练。由于伦理和隐私问题,大型真实人脸数据集(例如 MS1MV3)已被停用,并且已经提出了利用 GAN 和扩散模型的合成人脸生成器,例如 SYNFace、SFace、DigiFace-1M、IDiff-Face、DCFace 和 GANDiffFace,旨在满足这一需求。其中一些方法可以生成高保真度的真实人脸,但类内差异较低,而另一些方法则生成具有高差异性但身份一致性较低的人脸。在本文中,我们提出了一种三重条件扩散模型(TCDiff),通过 2D 和 3D 人脸约束来改进从真实人脸到合成人脸的人脸风格迁移,在保持必要的类内高差异性的同时增强人脸身份一致性。使用我们新的数据集的 1k、2k 和 5k 类进行训练的人脸识别实验在 LFW、CFP-FP、AgeDB 和 BUPT 等真实人脸基准测试中优于最先进的合成数据集。我们的源代码可在以下网址获得:https://github.com/BOVIFOCR/tcdiff。||
|**2024-09-05**|[DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture](http://arxiv.org/abs/2409.03550)|null|扩散模型 (DM) 在各个领域都表现出卓越的生成能力,但其部署过程中的推理速度慢和计算需求高却阻碍了其发展。加速DM最常用的方法是减少生成过程中的去噪步骤,这可以通过更快的采样求解器或知识蒸馏 (KD) 来实现。与先前的方法不同,我们提出了一种新方法,可以将大型预训练DM的功能迁移到更快的架构中。具体来说,我们以独特的方式使用KD,通过将生成能力提炼到更快的变体中来压缩DM。此外,考虑到源数据不可访问或对于当前的生成模型来说存储量太大,我们引入了一种新的无源数据蒸馏范式,称为扩散模型的无数据知识蒸馏 (DKDM)。通常,我们建立的DKDM框架包含两个主要组件:1) DKDM目标函数,它使用预训练DM生成的合成去噪数据来优化更快的DM,而无需源数据;2) 动态迭代蒸馏方法,它可以灵活地组织去噪数据的合成,防止由于生成速度慢而减慢优化过程。据我们所知,这是首次尝试使用KD以无数据的方式将DM提炼到任何架构中。重要的是,我们的DKDM与大多数现有的加速方法(例如减少去噪步骤、量化和剪枝)是正交的。实验表明,我们的DKDM能够推导出速度提高2倍的DM,其性能与基线保持一致。值得注意的是,我们的DKDM使预训练的DM能够作为“数据集”来训练新的DM。||
|**2024-09-05**|[Blended Latent Diffusion under Attention Control for Real-World Video Editing](http://arxiv.org/abs/2409.03514)|null|由于缺乏完全公开可用的文本到视频模型,目前的视频编辑方法倾向于建立在预训练的文本到图像生成模型之上,然而,在处理具有时间信息的视频局部编辑方面,它们仍然面临着巨大的挑战。首先,尽管现有方法试图通过预先定义的掩码来关注局部区域编辑,但由于每一帧的空间整体生成,外部区域背景的保留并不理想。此外,由用户专门提供掩码是一项额外的昂贵工作,因此需要一种集成到编辑过程中的自主掩码策略。最后但同样重要的是,图像级预训练模型没有学习视频帧之间的时间信息,而这对于表达运动和动态至关重要。在本文中,我们建议采用图像级混合潜在扩散模型来执行局部视频编辑任务。具体来说,我们利用 DDIM 反演来获取潜在向量作为背景潜在向量,而不是随机噪声的潜在向量,以更好地保留输入视频的背景信息。我们进一步介绍了一种从扩散步骤中的交叉注意图衍生的自主掩码制造机制。最后,我们通过将 U-Net 的自注意力块转换为时空块来增强视频帧之间的时间一致性。通过大量的实验,我们提出的方法在不同的现实世界视频编辑任务中表现出有效性。||
|**2024-09-05**|[Data-free Distillation with Degradation-prompt Diffusion for Multi-weather Image Restoration](http://arxiv.org/abs/2409.03455)|null|多天气图像复原取得了令人瞩目的进展,但模型容量的增加和昂贵的数据获取限制了其在内存有限设备上的应用。无数据蒸馏提供了一种替代方案,允许从预训练的教师模型中学习轻量级学生模型,而无需依赖原始训练数据。现有的无数据学习方法主要利用GAN生成的伪数据或从互联网收集的真实数据来优化模型。然而,它们不可避免地会遇到训练不稳定或与原始数据存在域偏移的问题。在本文中,我们提出了一种新的基于退化提示扩散的无数据蒸馏多天气图像复原框架(D4IR)。它用预训练的扩散模型代替GAN以避免模型崩溃,并结合了退化感知提示适配器,以促进内容驱动的条件扩散,从而生成与域相关的图像。具体来说,首先设计了一种基于对比的退化提示适配器,用于从网络收集的退化图像中捕获退化感知提示。然后,将收集到的未配对的干净图像扰动到稳定扩散的潜在特征中,并以退化感知提示为条件,合成新的域相关退化图像,用于知识蒸馏。实验表明,我们的方法取得了与使用原始训练数据蒸馏的模型相当的性能,甚至优于其他主流的无监督方法。||
|**2024-09-05**|[Convergence Rates for the Maximum A Posteriori Estimator in PDE-Regression Models with Random Design](http://arxiv.org/abs/2409.03417)|null|我们考虑从高斯回归问题 $Y = \mathscr{G}(\theta)(Z)+\varepsilon$产生的数据中恢复参数$\theta\in H^\alpha$的统计逆问题,其中$\mathscr{G}:\mathbb{L}^2\to\mathbb{L}^2$是非线性正向映射,$Z$是随机设计点,$\varepsilon$是高斯噪声。估计策略基于$\Vert\cdot\Vert_{H^\alpha}$-约束下的最小二乘法。我们在正向映射$\mathscr{G}$满足Lipschitz类型假设的情况下,建立了最小二乘估计量$\hat{\theta}$作为给定泛函的最大值的存在性。证明了一个一般的浓度结果,并用它来证明预测误差的一致性和上界。相应的收敛速度不仅反映了目标参数的平滑性,还反映了潜在逆问题的适定性。我们将一般模型应用于达西问题,其中PDE的未知系数函数$f$ 的恢复是令人感兴趣的。对于这个例子,我们还提供了预测误差和估计误差的相应收敛速度。此外,我们还简要讨论了该一般模型对其他问题的适用性。||
|**2024-09-05**|[RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning](http://arxiv.org/abs/2409.03403)|null|扩大机器人学习规模需要庞大而多样化的数据集,如何有效地重复使用收集到的数据并将策略迁移到新的机器人平台仍然是一个悬而未决的问题。诸如Open-X Embodiment (OXE) 项目等新兴研究已经表明,通过组合包含不同机器人的数据集来利用技能是有希望的。然而,许多数据集中机器人类型和相机角度分布的不平衡使得策略容易过拟合。为了缓解这个问题,我们提出了RoVi-Aug,它利用最先进的图像到图像生成模型,通过合成具有不同机器人和相机视角的演示来增强机器人数据。通过广泛的物理实验,我们证明了通过在机器人和视点增强数据上进行训练,RoVi-Aug 可以在具有显著不同相机角度的未知机器人上进行零样本部署。与 Mirage 等测试时自适应算法相比,RoVi-Aug 在测试时不需要额外的处理,不假设已知相机角度,并且允许策略微调。此外,通过在原始机器人数据集和增强机器人数据集上进行联合训练,RoVi-Aug 可以学习多机器人和多任务策略,从而实现机器人和技能之间更有效的迁移,并将成功率提高高达 30%。||
|**2024-09-04**|[HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts](http://arxiv.org/abs/2409.02919)|**[link](https://github.com/Liuxinyv/HiPrompt)**|利用预训练扩散模型生成更高分辨率图像的潜力巨大,但这些模型在处理物体重复和结构伪影方面常常遇到困难,尤其是在扩展到 4K 及更高分辨率时。我们发现问题在于,单个提示生成多个尺度的方式效率低下。为此,我们提出了 HiPrompt,这是一种无须微调的新解决方案,它通过引入分层提示来解决上述问题。分层提示提供全局和局部指导。具体来说,全局指导来自描述整体内容的用户输入,而局部指导则利用来自 MLLM 的逐块描述来精心指导局部结构和纹理的生成。此外,在逆向去噪过程中,生成的噪声被分解为低频和高频空间分量。这些分量以多个提示级别为条件,包括详细的逐块描述和更广泛的图像级提示,从而促进在分层语义指导下的提示引导去噪。它进一步允许生成过程更多地关注局部空间区域,并确保生成的图像在高清晰度下保持一致的局部和全局语义、结构和纹理。大量实验表明,HiPrompt 在高分辨率图像生成方面优于现有技术,显著减少了物体重复并提高了结构质量。||
|**2024-09-04**|[Latent Watermarking of Audio Generative Models](http://arxiv.org/abs/2409.02915)|null|音频生成模型的进步给其负责任的披露和滥用检测带来了新的挑战。为了应对这些挑战,我们介绍了一种通过对其训练数据进行特定水印来标记潜在生成模型的方法。由此产生的水印模型生成的潜在表示,其解码输出可以被高置信度地检测到,而无论使用何种解码方法。这种方法无需进行事后水印步骤即可检测生成的内容。它为开源模型提供了更安全的解决方案,并有助于识别那些在未遵守许可条款的情况下对这些模型进行微调或使用的衍生作品。例如,我们的结果表明,即使在对潜在生成模型进行微调后,生成输出的检测精度也能在假阳性率为 $10^{-3}$ 的情况下达到 75% 以上。||
|**2024-09-04**|[Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling](http://arxiv.org/abs/2409.02908)|null|掩码扩散模型 (MDM) 由于其相较于其他离散扩散模型的优越性能,已成为离散数据生成建模的热门研究课题,并在语言建模任务中与自回归模型 (ARM) 展开竞争。最近简化掩码扩散框架的努力进一步使其与连续空间扩散模型保持一致,并获得了更有原则的训练和采样方法。然而,在本文中,我们揭示了 MDM 的训练和采样在理论上都可以摆脱时间变量(可以说是扩散模型的关键特征),并且等效于掩码模型。我们在采样方面的联系是通过我们提出的首次命中采样器 (FHS) 建立的。具体来说,我们证明了 FHS 在理论上等效于 MDM 的原始生成过程,同时显著减少了耗时的分类采样,并实现了 20 倍的加速。此外,我们的研究对先前关于 MDM 在生成困惑度方面可以超越 ARM 的说法提出了质疑。我们首次发现了一个潜在的数值问题,即使使用 32 位浮点精度,也会导致不准确的分类采样。我们表明,该数值问题在理论上和经验上都降低了有效温度,导致先前文献中对 MDM 生成结果的评估不公平。||
|**2024-09-04**|[Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models](http://arxiv.org/abs/2409.02851)|**[link](https://github.com/Human-VDM/Human-VDM)**|从单张RGB图像生成逼真3D人体是计算机视觉中一项具有挑战性的任务,因为它需要精确的几何建模、高质量的纹理和合理的不可见部分生成。现有方法通常使用多视角扩散模型进行3D人体生成,但它们经常面临视角不一致的问题,这阻碍了高质量3D人体的生成。为了解决这个问题,我们提出了Human-VDM,一种使用视频扩散模型从单张RGB图像生成3D人体的新方法。Human-VDM使用高斯渲染为3D人体生成提供了时间上一致的视图。它由三个模块组成:视图一致的人体视频扩散模块、视频增强模块和高斯渲染模块。首先,将单张图像输入人体视频扩散模块以生成连贯的人体视频。接下来,视频增强模块应用超分辨率和视频插值来增强生成视频的纹理和几何平滑度。最后,3D人体高斯渲染模块在这些高分辨率和视角一致的图像的指导下学习逼真的人体。实验表明,Human-VDM可以从单张图像生成高质量的3D人体,在生成质量和数量方面均优于现有最佳方法。项目页面:https://human-vdm.github.io/Human-VDM/||
|**2024-09-04**|[Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model](http://arxiv.org/abs/2409.02845)|null|扩散模型在涉及音频和音乐的跨模态生成任务中展现出巨大的潜力,例如文本到声音和文本到音乐的生成。这些文本控制的音乐生成模型通常侧重于通过捕捉全局音乐属性(如流派和情绪)来生成音乐。然而,音乐创作是一项复杂的多层次任务,通常将音乐编排作为创作过程的一个组成部分。此过程涉及创作每个乐器部分,使其在节奏、力度、和声和旋律方面与现有部分保持一致,这需要比文本提示通常提供的更精确的音轨控制。在这项工作中,我们通过将 MusicLDM(一种用于音乐的潜在扩散模型)扩展为多轨生成模型来应对这些挑战。通过学习共享上下文的音轨的联合概率,我们的模型能够跨多个音轨生成彼此良好对应的音乐,无论是有条件地还是无条件地。此外,我们的模型还能够进行编曲生成,其中模型可以在给定其他音轨的情况下生成任何音轨子集(例如,生成与给定贝斯和鼓音轨互补的钢琴音轨)。我们将我们的模型与现有的多轨生成模型进行了比较,结果表明,我们的模型在总生成任务和编曲生成任务的客观指标上都取得了相当大的改进。||
|**2024-09-04**|[Rethinking HTG Evaluation: Bridging Generation and Recognition](http://arxiv.org/abs/2409.02683)|**[link](https://github.com/koninik/htg_evaluation)**|生成模型在自然图像任务中的评估已得到广泛研究。即使在诸如手写生成(HTG)等具有独特特殊性的情况下,也使用了类似的协议和指标,即使它们可能并非完全合适。在这项工作中,我们介绍了三种专为 HTG 评估量身定制的度量指标: $\text{HTG}_{\text{HTR}} $、$ \text{HTG}_{\text{style}} $ 和 $ \text{HTG}_{\text{OOV}}$ ,并认为它们更便于评估生成手写图像的质量。这些指标依赖于手写文本识别和书写者识别模型的识别错误/准确率,并强调书写风格、文本内容和多样性是符合手写图像内容的主要方面。我们在 IAM 手写数据库上进行了全面的实验,结果表明,诸如 FID 之类的广泛使用的指标无法正确量化生成手写样本的多样性和实用性。我们的研究结果表明,我们的指标信息更丰富,并强调了 HTG 中标准化评估协议的必要性。所提出的指标为评估 HTG 质量提供了更稳健、信息更丰富的协议,有助于提高 HTR 的性能。评估协议的代码可在以下网址获得:https://github.com/koninik/HTG_evaluation。||
|**2024-09-04**|[Introduction to Machine Learning](http://arxiv.org/abs/2409.02668)|null|本书介绍了机器学习中许多算法的开发和分析所依赖的数学基础和技术。本书首先介绍了贯穿全书的符号表示,并回顾了微积分、线性代数和概率论的基本概念,还介绍了一些测度论术语,可作为使用这些工具的部分的阅读指南。导论章节还提供了矩阵分析和优化的背景知识。后面的章节为本书中使用的许多算法提供了理论支持,包括随机梯度下降、近似方法等。在讨论了统计预测的基本概念之后,本书介绍了再生核理论和希尔伯特空间技术,这些技术在许多地方都有应用,然后介绍了各种监督统计学习算法,包括线性方法、支持向量机、决策树、boosting和神经网络。接下来转向生成方法,首先介绍了采样方法和马尔可夫链理论。接下来的章节描述了图模型理论,介绍了潜变量模型的变分方法,以及基于深度学习的生成模型。接下来的章节重点介绍无监督学习方法,包括聚类、因子分析和流形学习。本书的最后一章偏向理论,讨论了集中不等式和泛化界。||
|**2024-09-04**|[Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection](http://arxiv.org/abs/2409.02664)|null|The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via data perturbations, our method can reprogram a pretrained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. Furthermore, we insert a pseudo-word guided by facial identity into the text prompt. Extensive experiments on several popular benchmarks demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88% AUC in cross-dataset setting from FF++ to WildDeepfake) using a pre-trained CLIP model with our proposed reprogramming method; (2) our superior performances are at less cost of trainable parameters, making it a promising approach for real-world applications.||
|**2024-09-04**|[PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation](http://arxiv.org/abs/2409.02657)|null|While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4\% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: https://junleen.github.io/projects/posetalk.||
|**2024-09-04**|[Skip-and-Play: Depth-Driven Pose-Preserved Image Generation for Any Objects](http://arxiv.org/abs/2409.02653)|null|The emergence of diffusion models has enabled the generation of diverse high-quality images solely from text, prompting subsequent efforts to enhance the controllability of these models. Despite the improvement in controllability, pose control remains limited to specific objects (e.g., humans) or poses (e.g., frontal view) due to the fact that pose is generally controlled via camera parameters (e.g., rotation angle) or keypoints (e.g., eyes, nose). Specifically, camera parameters-conditional pose control models generate unrealistic images depending on the object, owing to the small size of 3D datasets for training. Also, keypoint-based approaches encounter challenges in acquiring reliable keypoints for various objects (e.g., church) or poses (e.g., back view). To address these limitations, we propose depth-based pose control, as depth maps are easily obtainable from a single depth estimation model regardless of objects and poses, unlike camera parameters and keypoints. However, depth-based pose control confronts issues of shape dependency, as depth maps influence not only the pose but also the shape of the generated images. To tackle this issue, we propose Skip-and-Play (SnP), designed via analysis of the impact of three components of depth-conditional ControlNet on the pose and the shape of the generated images. To be specific, based on the analysis, we selectively skip parts of the components to mitigate shape dependency on the depth map while preserving the pose. Through various experiments, we demonstrate the superiority of SnP over baselines and showcase the ability of SnP to generate images of diverse objects and poses. Remarkably, SnP exhibits the ability to generate images even when the objects in the condition (e.g., a horse) and the prompt (e.g., a hedgehog) differ from each other.||

(back to top)

## LLM

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2024-11-01**|[Improving Few-Shot Cross-Domain Named Entity Recognition by Instruction Tuning a Word-Embedding based Retrieval Augmented Large Language Model](http://arxiv.org/abs/2411.00451)|null|少样本跨领域命名实体识别(NER)是指利用来自数据丰富的源域的知识,对数据稀缺的目标域执行实体识别。大多数先前的最先进 (SOTA) 方法使用预训练语言模型 (PLM) 进行跨领域 NER。然而,这些模型通常是特定于领域的。为了成功地将这些模型用于新的目标域,我们需要修改模型架构或使用来自新域的数据进行模型微调。这两种方法都会为每个目标域创建一个全新的 NER 模型,这在实际场景中是不可行的。最近,一些工作尝试使用大型语言模型 (LLM) 来解决少样本跨领域 NER 问题。然而,其中大多数要么过于昂贵以至于不切实际,要么难以遵循 LLM 提示指令。在本文中,我们提出了 IF-WRANER(Instruction Finetuned Word-embedding based Retrieval Augmented large language model for Named Entity Recognition),一种针对 NER 任务进行微调的检索增强型 LLM。凭借 LLM 微调期间使用的正则化技术以及在提示示例检索过程中采用词级嵌入而不是句子级嵌入,IF-WRANER 能够超越先前的 SOTA 少样本跨领域 NER 方法。我们通过在开源 CrossNER 数据集上对其性能进行基准测试来证明我们模型的有效性,在该数据集上,它的 F1 分数比之前的 SOTA 模型提高了 2% 以上。我们已将该模型部署到企业的多个客户服务领域。通过 IF-WRANER 进行准确的实体预测有助于将客户引导至针对这些领域的自动化工作流程,从而将上报给人工客服的案例减少了近 15%,并为公司每年节省数百万美元。|
|**2024-11-01**|[Enhancing Authorship Attribution through Embedding Fusion: A Novel Approach with Masked and Encoder-Decoder Language Models](http://arxiv.org/abs/2411.00411)|null|The increasing prevalence of AI-generated content alongside human-written text underscores the need for reliable discrimination methods. To address this challenge, we propose a novel framework with textual embeddings from Pre-trained Language Models (PLMs) to distinguish AI-generated and human-authored text. Our approach utilizes Embedding Fusion to integrate semantic information from multiple Language Models, harnessing their complementary strengths to enhance performance. Through extensive evaluation across publicly available diverse datasets, our proposed approach demonstrates strong performance, achieving classification accuracy greater than 96% and a Matthews Correlation Coefficient (MCC) greater than 0.93. This evaluation is conducted on a balanced dataset of texts generated from five well-known Large Language Models (LLMs), highlighting the effectiveness and robustness of our novel methodology.|
|**2024-11-01**|[C2A: Client-Customized Adaptation for Parameter-Efficient Federated Learning](http://arxiv.org/abs/2411.00311)|null|Despite the versatility of pre-trained language models (PLMs) across domains, their large memory footprints pose significant challenges in federated learning (FL), where the training model has to be distributed between a server and clients. One potential solution to bypass such constraints might be the use of parameter-efficient fine-tuning (PEFT) in the context of FL. However, we have observed that typical PEFT tends to severely suffer from heterogeneity among clients in FL scenarios, resulting in unstable and slow convergence. In this paper, we propose Client-Customized Adaptation (C2A), a novel hypernetwork-based FL framework that generates client-specific adapters by conditioning the client information. With the effectiveness of the hypernetworks in generating customized weights through learning to adopt the different characteristics of inputs, C2A can maximize the utility of shared model parameters while minimizing the divergence caused by client heterogeneity. To verify the efficacy of C2A, we perform extensive evaluations on FL scenarios involving heterogeneity in label and language distributions. Comprehensive evaluation results clearly support the superiority of C2A in terms of both efficiency and effectiveness in FL scenarios.|
|**2024-11-01**|[Large Language Models for Patient Comments Multi-Label Classification](http://arxiv.org/abs/2410.23528)|null|Patient experience and care quality are crucial for a hospital's sustainability and reputation. The analysis of patient feedback offers valuable insight into patient satisfaction and outcomes. However, the unstructured nature of these comments poses challenges for traditional machine learning methods following a supervised learning paradigm. This is due to the unavailability of labeled data and the nuances these texts encompass. This research explores leveraging Large Language Models (LLMs) in conducting Multi-label Text Classification (MLTC) of inpatient comments shared after a stay in the hospital. GPT-4 Turbo was leveraged to conduct the classification. However, given the sensitive nature of patients' comments, a security layer is introduced before feeding the data to the LLM through a Protected Health Information (PHI) detection framework, which ensures patients' de-identification. Additionally, using the prompt engineering framework, zero-shot learning, in-context learning, and chain-of-thought prompting were experimented with. Results demonstrate that GPT-4 Turbo, whether following a zero-shot or few-shot setting, outperforms traditional methods and Pre-trained Language Models (PLMs) and achieves the highest overall performance with an F1-score of 76.12% and a weighted F1-score of 73.61% followed closely by the few-shot learning results. Subsequently, the results' association with other patient experience structured variables (e.g., rating) was conducted. The study enhances MLTC through the application of LLMs, offering healthcare practitioners an efficient method to gain deeper insights into patient feedback and deliver prompt, appropriate responses.|
|**2024-10-28**|[Relation-based Counterfactual Data Augmentation and Contrastive Learning for Robustifying Natural Language Inference Models](http://arxiv.org/abs/2410.20710)|null|Although pre-trained language models show good performance on various natural language processing tasks, they often rely on non-causal features and patterns to determine the outcome. For natural language inference tasks, previous results have shown that even a model trained on a large number of data fails to perform well on counterfactually revised data, indicating that the model is not robustly learning the semantics of the classes. In this paper, we propose a method in which we use token-based and sentence-based augmentation methods to generate counterfactual sentence pairs that belong to each class, and apply contrastive learning to help the model learn the difference between sentence pairs of different classes with similar contexts. Evaluation results with counterfactually-revised dataset and general NLI datasets show that the proposed method can improve the performance and robustness of the NLI model.|
|**2024-10-28**|[SubjECTive-QA: Measuring Subjectivity in Earnings Call Transcripts' QA Through Six-Dimensional Feature Analysis](http://arxiv.org/abs/2410.20651)|**[link](https://github.com/gtfintechlab/subjective-qa)**|Fact-checking is extensively studied in the context of misinformation and disinformation, addressing objective inaccuracies. However, a softer form of misinformation involves responses that are factually correct but lack certain features such as clarity and relevance. This challenge is prevalent in formal Question-Answer (QA) settings such as press conferences in finance, politics, sports, and other domains, where subjective answers can obscure transparency. Despite this, there is a lack of manually annotated datasets for subjective features across multiple dimensions. To address this gap, we introduce SubjECTive-QA, a human annotated dataset on Earnings Call Transcripts' (ECTs) QA sessions as the answers given by company representatives are often open to subjective interpretations and scrutiny. The dataset includes 49,446 annotations for long-form QA pairs across six features: Assertive, Cautious, Optimistic, Specific, Clear, and Relevant. These features are carefully selected to encompass the key attributes that reflect the tone of the answers provided during QA sessions across different domain. Our findings are that the best-performing Pre-trained Language Model (PLM), RoBERTa-base, has similar weighted F1 scores to Llama-3-70b-Chat on features with lower subjectivity, such as Relevant and Clear, with a mean difference of 2.17% in their weighted F1 scores. The models perform significantly better on features with higher subjectivity, such as Specific and Assertive, with a mean difference of 10.01% in their weighted F1 scores. Furthermore, testing SubjECTive-QA's generalizability using QAs from White House Press Briefings and Gaggles yields an average weighted F1 score of 65.97% using our best models for each feature, demonstrating broader applicability beyond the financial domain. SubjECTive-QA is publicly available under the CC BY 4.0 license|
|**2024-10-27**|[Effective Instruction Parsing Plugin for Complex Logical Query Answering on Knowledge Graphs](http://arxiv.org/abs/2410.20321)|null|Knowledge Graph Query Embedding (KGQE) aims to embed First-Order Logic (FOL) queries in a low-dimensional KG space for complex reasoning over incomplete KGs. To enhance the generalization of KGQE models, recent studies integrate various external information (such as entity types and relation context) to better capture the logical semantics of FOL queries. The whole process is commonly referred to as Query Pattern Learning (QPL). However, current QPL methods typically suffer from the pattern-entity alignment bias problem, leading to the learned defective query patterns limiting KGQE models' performance. To address this problem, we propose an effective Query Instruction Parsing Plugin (QIPP) that leverages the context awareness of Pre-trained Language Models (PLMs) to capture latent query patterns from code-like query instructions. Unlike the external information introduced by previous QPL methods, we first propose code-like instructions to express FOL queries in an alternative format. This format utilizes textual variables and nested tuples to convey the logical semantics within FOL queries, serving as raw materials for a PLM-based instruction encoder to obtain complete query patterns. Building on this, we design a query-guided instruction decoder to adapt query patterns to KGQE models. To further enhance QIPP's effectiveness across various KGQE models, we propose a query pattern injection mechanism based on compressed optimization boundaries and an adaptive normalization component, allowing KGQE models to utilize query patterns more efficiently. Extensive experiments demonstrate that our plug-and-play method improves the performance of eight basic KGQE models and outperforms two state-of-the-art QPL methods.|
|**2024-10-25**|[A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection](http://arxiv.org/abs/2410.19898)|null|This review paper explores recent advances in deep learning approaches for non-invasive cognitive impairment detection. We examine various non-invasive indicators of cognitive decline, including speech and language, facial, and motoric mobility. The paper provides an overview of relevant datasets, feature-extracting techniques, and deep-learning architectures applied to this domain. We have analyzed the performance of different methods across modalities and observed that speech and language-based methods generally achieved the highest detection performance. Studies combining acoustic and linguistic features tended to outperform those using a single modality. Facial analysis methods showed promise for visual modalities but were less extensively studied. Most papers focused on binary classification (impaired vs. non-impaired), with fewer addressing multi-class or regression tasks. Transfer learning and pre-trained language models emerged as popular and effective techniques, especially for linguistic analysis. Despite significant progress, several challenges remain, including data standardization and accessibility, model explainability, longitudinal analysis limitations, and clinical adaptation. Lastly, we propose future research directions, such as investigating language-agnostic speech analysis methods, developing multi-modal diagnostic systems, and addressing ethical considerations in AI-assisted healthcare. By synthesizing current trends and identifying key obstacles, this review aims to guide further development of deep learning-based cognitive impairment detection systems to improve early diagnosis and ultimately patient outcomes.|
|**2024-10-25**|[Intelligent Understanding of Large Language Models in Traditional Chinese Medicine Based on Prompt Engineering Framework](http://arxiv.org/abs/2410.19451)|null|This paper explores the application of prompt engineering to enhance the performance of large language models (LLMs) in the domain of Traditional Chinese Medicine (TCM). We propose TCM-Prompt, a framework that integrates various pre-trained language models (PLMs), templates, tokenization, and verbalization methods, allowing researchers to easily construct and fine-tune models for specific TCM-related tasks. We conducted experiments on disease classification, syndrome identification, herbal medicine recommendation, and general NLP tasks, demonstrating the effectiveness and superiority of our approach compared to baseline methods. Our findings suggest that prompt engineering is a promising technique for improving the performance of LLMs in specialized domains like TCM, with potential applications in digitalization, modernization, and personalized medicine.|
|**2024-10-22**|[All Entities are Not Created Equal: Examining the Long Tail for Fine-Grained Entity Typing](http://arxiv.org/abs/2410.17355)|null|Pre-trained language models (PLMs) are trained on large amounts of data, which helps capture world knowledge alongside linguistic competence. Due to this, they are extensively used for ultra-fine entity typing tasks, where they provide the entity knowledge held in its parameter space. Given that PLMs learn from co-occurrence patterns, they likely contain more knowledge or less knowledge about entities depending on their how frequent they are in the pre-training data. In this work, we probe PLMs to elicit encoded entity probabilities and demonstrate that they highly correlate with their frequency in large-scale internet data. Then, we demonstrate that entity-typing approaches that rely on PLMs struggle with entities at the long tail on the distribution. Our findings suggests that we need to go beyond PLMs to produce solutions that perform well for rare, new or infrequent entities.|
|**2024-10-21**|[ComPO: Community Preferences for Language Model Personalization](http://arxiv.org/abs/2410.16027)|null|Conventional algorithms for training language models (LMs) with human feedback rely on preferences that are assumed to account for an "average" user, disregarding subjectivity and finer-grained variations. Recent studies have raised concerns that aggregating such diverse and often contradictory human feedback to finetune models results in generic models that generate outputs not preferred by many user groups, as they tend to average out styles and norms. To address this issue, we draw inspiration from recommendation systems and propose ComPO, a method to personalize preference optimization in LMs by contextualizing the probability distribution of model outputs with the preference provider. Focusing on group-level preferences rather than individuals, we collect and release ComPRed, a question answering dataset with community-level preferences from Reddit. This dataset facilitates studying diversity in preferences without incurring privacy concerns associated with individual feedback. Our experiments reveal that conditioning language models on a community identifier (i.e., subreddit name) during preference tuning substantially enhances model performance. Conversely, replacing this context with random subreddit identifiers significantly diminishes performance, highlighting the effectiveness of our approach in tailoring responses to communities' preferences.|
|**2024-10-21**|[Learning-to-Defer for Extractive Question Answering](http://arxiv.org/abs/2410.15761)|null|预训练语言模型已对抽取式问答领域产生了深远的影响,利用大规模文本语料库增强了上下文语言理解能力。尽管取得了成功,但这些模型在需要细致解读或推理超出直接文本线索的复杂场景中仍存在困难。此外,它们的规模也给资源受限设备上的部署带来了挑战。为了解决这些限制,我们引入了一种改进的两阶段“学会延迟”机制,通过选择性地将问题交给人类专家或更大模型来增强决策能力,而无需在问答环境下重新训练语言模型。这种方法不仅保持了计算效率,还在模糊的上下文中显著提高了模型的可靠性和准确性。我们通过证明代理损失函数的贝叶斯和 $(\mathcal{H}, \mathcal{R})$ 一致性,确立了我们方法的理论可靠性,保证了最终解决方案的最优性。在SQuADv2数据集上的实证评估表明,整合人类专业知识和利用更大模型可以提高性能。我们的结果进一步表明,只需延迟少量查询,较小的模型就能达到与其较大模型相当的性能,同时保持计算效率,从而拓宽了预训练语言模型在各种操作环境中的适用性。|
|**2024-10-21**|[Who's Who: Large Language Models Meet Knowledge Conflicts in Practice](http://arxiv.org/abs/2410.15737)|**[link](https://github.com/vinairesearch/whoqa)**|Retrieval-augmented generation (RAG) methods are viable solutions for addressing the static memory limits of pre-trained language models. Nevertheless, encountering conflicting sources of information within the retrieval context is an inevitable practical challenge. In such situations, the language models are recommended to transparently inform users about the conflicts rather than autonomously deciding what to present based on their inherent biases. To analyze how current large language models (LLMs) align with our recommendation, we introduce WhoQA, a public benchmark dataset to examine model's behavior in knowledge conflict situations. We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers. WhoQA evaluation set includes 5K questions across 13 Wikidata property types and 150K Wikipedia entities. Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs' performance in RAG settings.|
|**2024-10-21**|[DomainSum: A Hierarchical Benchmark for Fine-Grained Domain Shift in Abstractive Text Summarization](http://arxiv.org/abs/2410.15687)|**[link](https://github.com/hpzhang94/DomainSum)**|Most research on abstractive summarization focuses on single-domain applications, often neglecting how domain shifts between documents affect performance and the generalization ability of summarization models. To address this issue, we introduce DomainSum, a hierarchical benchmark designed to capture fine-grained domain shifts in abstractive summarization. We categorize these shifts into three levels: genre, style, and topic, and demonstrate through comprehensive benchmark analysis that they follow a hierarchical structure. Furthermore, we evaluate the domain generalization capabilities of commonly used pre-trained language models (PLMs) and large language models (LLMs) in in-domain and cross-domain settings.||
|**2024-10-21**|[Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding](http://arxiv.org/abs/2410.15609)|null|Recently, pre-trained language models (PLMs) have been increasingly adopted in spoken language understanding (SLU). However, automatic speech recognition (ASR) systems frequently produce inaccurate transcriptions, leading to noisy inputs for SLU models, which can significantly degrade their performance. To address this, our objective is to train SLU models to withstand ASR errors by exposing them to noises commonly observed in ASR systems, referred to as ASR-plausible noises. Speech noise injection (SNI) methods have pursued this objective by introducing ASR-plausible noises, but we argue that these methods are inherently biased towards specific ASR systems, or ASR-specific noises. In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. Experimental results and analyses demonstrate the effectiveness of our proposed methods in enhancing the robustness and generalizability of SLU models against unseen ASR systems by introducing more diverse and plausible ASR noises in advance.||
|**2024-10-19**|[MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science](http://arxiv.org/abs/2410.15126)|null|We introduce a novel continued pre-training method, MELT (MatEriaLs-aware continued pre-Training), specifically designed to efficiently adapt the pre-trained language models (PLMs) for materials science. Unlike previous adaptation strategies that solely focus on constructing domain-specific corpus, MELT comprehensively considers both the corpus and the training strategy, given that materials science corpus has distinct characteristics from other domains. To this end, we first construct a comprehensive materials knowledge base from the scientific corpus by building semantic graphs. Leveraging this extracted knowledge, we integrate a curriculum into the adaptation process that begins with familiar and generalized concepts and progressively moves toward more specialized terms. We conduct extensive experiments across diverse benchmarks to verify the effectiveness and generality of MELT. A comprehensive evaluation convincingly supports the strength of MELT, demonstrating superior performance compared to existing continued pre-training methods. The in-depth analysis also shows that MELT enables PLMs to effectively represent materials entities compared to the existing adaptation methods, thereby highlighting its broad applicability across a wide spectrum of materials science.||
|**2024-10-19**|[BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation](http://arxiv.org/abs/2410.14971)|null|Recent advances in decoding language from brain signals (EEG and MEG) have been significantly driven by pre-trained language models, leading to remarkable progress on publicly available non-invasive EEG/MEG datasets. However, previous works predominantly utilize teacher forcing during text generation, leading to significant performance drops without its use. A fundamental issue is the inability to establish a unified feature space correlating textual data with the corresponding evoked brain signals. Although some recent studies attempt to mitigate this gap using an audio-text pre-trained model, Whisper, which is favored for its signal input modality, they still largely overlook the inherent differences between audio signals and brain signals in directly applying Whisper to decode brain signals. To address these limitations, we propose a new multi-stage strategy for semantic brain signal decoding via vEctor-quantized speCtrogram reconstruction for WHisper-enhanced text generatiOn, termed BrainECHO. Specifically, BrainECHO successively conducts: 1) Discrete autoencoding of the audio spectrogram; 2) Brain-audio latent space alignment; and 3) Semantic text generation via Whisper finetuning. Through this autoencoding--alignment--finetuning process, BrainECHO outperforms state-of-the-art methods under the same data split settings on two widely accepted resources: the EEG dataset (Brennan) and the MEG dataset (GWilliams). The innovation of BrainECHO, coupled with its robustness and superiority at the sentence, session, and subject-independent levels across public datasets, underscores its significance for language-based brain-computer interfaces.||
|**2024-10-18**|[Reasoning, Memorization, and Fine-Tuning Language Models for Non-Cooperative Games](http://arxiv.org/abs/2410.14890)|null|We develop a method that integrates the tree of thoughts and multi-agent framework to enhance the capability of pre-trained language models in solving complex, unfamiliar games. The method decomposes game-solving into four incremental tasks -- game summarization, area selection, action extraction, and action validation -- each assigned to a specific language-model agent. By constructing a tree of thoughts, the method simulates reasoning paths and allows agents to collaboratively distill game representations and tactics, mitigating the limitations of language models in reasoning and long-term memorization. Additionally, an automated fine-tuning process further optimizes the agents' performance by ranking query-response pairs based on game outcomes, e.g., winning or losing. We apply the method to a non-cooperative game and demonstrate a 65 percent winning rate against benchmark algorithms, with an additional 10 percent improvement after fine-tuning. In contrast to existing deep learning algorithms for game solving that require millions of training samples, the proposed method consumes approximately 1000 training samples, highlighting its efficiency and scalability.||
|**2024-10-18**|[PTR: A Pre-trained Language Model for Trajectory Recovery](http://arxiv.org/abs/2410.14281)|null|Spatiotemporal trajectory data is vital for web-of-things services and is extensively collected and analyzed by web-based hardware and platforms. However, issues such as service interruptions and network instability often lead to sparsely recorded trajectories, resulting in a loss of detailed movement data. As a result, recovering these trajectories to restore missing information becomes essential. Despite progress, several challenges remain unresolved. First, the lack of large-scale dense trajectory data hampers the performance of existing deep learning methods, which rely heavily on abundant data for supervised training. Second, current methods struggle to generalize across sparse trajectories with varying sampling intervals, necessitating separate re-training for each interval and increasing computational costs. Third, external factors crucial for the recovery of missing points are not fully incorporated. To address these challenges, we propose a framework called PTR. This framework mitigates the issue of limited dense trajectory data by leveraging the capabilities of pre-trained language models (PLMs). PTR incorporates an explicit trajectory prompt and is trained on datasets with multiple sampling intervals, enabling it to generalize effectively across different intervals in sparse trajectories. To capture external factors, we introduce an implicit trajectory prompt that models road conditions, providing richer information for recovering missing points. Additionally, we present a trajectory embedder that encodes trajectory points and transforms the embeddings of both observed and missing points into a format comprehensible to PLMs. Experimental results on two public trajectory datasets with three sampling intervals demonstrate the efficacy and scalability of PTR.||
|**2024-10-16**|[NSmark: Null Space Based Black-box Watermarking Defense Framework for Pre-trained Language Models](http://arxiv.org/abs/2410.13907)|**[link](https://github.com/dongdongzhaoup/nsmark)**|Pre-trained language models (PLMs) have emerged as critical intellectual property (IP) assets that necessitate protection. Although various watermarking strategies have been proposed, they remain vulnerable to Linear Functionality Equivalence Attacks (LFEA), which can invalidate most existing white-box watermarks without prior knowledge of the watermarking scheme or training data. This paper further analyzes and extends the attack scenarios of LFEA to the commonly employed black-box settings for PLMs by considering Last-Layer outputs (dubbed LL-LFEA). We discover that the null space of the output matrix remains invariant against LL-LFEA attacks. Based on this finding, we propose NSmark, a task-agnostic, black-box watermarking scheme capable of resisting LL-LFEA attacks. NSmark consists of three phases: (i) watermark generation using the digital signature of the owner, enhanced by spread spectrum modulation for increased robustness; (ii) watermark embedding through an output mapping extractor that preserves PLM performance while maximizing watermark capacity; (iii) watermark verification, assessed by extraction rate and null space conformity. Extensive experiments on both pre-training and downstream tasks confirm the effectiveness, reliability, fidelity, and robustness of our approach. Code is available at https://github.com/dongdongzhaoUP/NSmark.||
|**2024-10-17**|[Meta-DiffuB: A Contextualized Sequence-to-Sequence Text Diffusion Model with Meta-Exploration](http://arxiv.org/abs/2410.13201)|**[link](https://github.com/meta-diffub/meta-diffub)**|The diffusion model, a new generative modeling paradigm, has achieved significant success in generating images, audio, video, and text. It has been adapted for sequence-to-sequence text generation (Seq2Seq) through DiffuSeq, termed S2S Diffusion. Existing S2S-Diffusion models predominantly rely on fixed or hand-crafted rules to schedule noise during the diffusion and denoising processes. However, these models are limited by non-contextualized noise, which fails to fully consider the characteristics of Seq2Seq tasks. In this paper, we propose the Meta-DiffuB framework - a novel scheduler-exploiter S2S-Diffusion paradigm designed to overcome the limitations of existing S2S-Diffusion models. We employ Meta-Exploration to train an additional scheduler model dedicated to scheduling contextualized noise for each sentence. Our exploiter model, an S2S-Diffusion model, leverages the noise scheduled by our scheduler model for updating and generation. Meta-DiffuB achieves state-of-the-art performance compared to previous S2S-Diffusion models and fine-tuned pre-trained language models (PLMs) across four Seq2Seq benchmark datasets. We further investigate and visualize the impact of Meta-DiffuB's noise scheduling on the generation of sentences with varying difficulties. Additionally, our scheduler model can function as a "plug-and-play" model to enhance DiffuSeq without the need for fine-tuning during the inference stage.||
|**2024-10-16**|[Negative-Prompt-driven Alignment for Generative Language Model](http://arxiv.org/abs/2410.12194)|null|Large language models have achieved remarkable capabilities, but aligning their outputs with human values and preferences remains a significant challenge. Existing alignment methods primarily focus on positive examples while overlooking the importance of negative responses in guiding models away from undesirable behaviors. For instance, the widely-used alignment datasets reveals a scarcity of explicit negative examples that contradict human values, hindering its ability to discourage harmful or biased outputs during training. To address this limitation, we propose NEAT, i.e., NEgative-prompt-driven AlignmenT, to introduce negative prompts to generate undesirable responses alongside positive examples during the optimization process. NEAT explicitly penalizes the model for producing harmful outputs, guiding it not only toward desirable behaviors but also steering it away from generating undesirable, biased responses. This dual feedback mechanism enables better alignment with human preferences, crucial in contexts where avoiding harm is paramount. Starting from a pre-trained language model, NEAT performs online alignment by incorporating a ranking loss derived from an expanded preference dataset containing both positive and negative examples. Extensive experiments validate NEAT's effectiveness in significantly enhancing language models' alignment with human values and preferences.||
|**2024-10-15**|[Bridging Large Language Models and Graph Structure Learning Models for Robust Representation Learning](http://arxiv.org/abs/2410.12096)|null|Graph representation learning, involving both node features and graph structures, is crucial for real-world applications but often encounters pervasive noise. State-of-the-art methods typically address noise by focusing separately on node features with large language models (LLMs) and on graph structures with graph structure learning models (GSLMs). In this paper, we introduce LangGSL, a robust framework that integrates the complementary strengths of pre-trained language models and GSLMs to jointly enhance both node feature and graph structure learning. In LangGSL, we first leverage LLMs to filter noise in the raw data and extract valuable cleaned information as features, enhancing the synergy of downstream models. During the mutual learning phase in LangGSL, the core idea is to leverage the relatively small language model (LM) to process local attributes and generate reliable pseudo-labels and informative node embeddings, which are then integrated into the GSLM's prediction phase. This approach enriches the global context and enhances overall performance. Meanwhile, GSLM refines the evolving graph structure constructed from the LM's output, offering updated labels back to the LM as additional guidance, thus facilitating a more effective mutual learning process. The LM and GSLM work synergistically, complementing each other's strengths and offsetting weaknesses within a variational information-maximizing framework, resulting in enhanced node features and a more robust graph structure. Extensive experiments on diverse graph datasets of varying scales and across different task scenarios demonstrate the scalability and effectiveness of the proposed approach.||
|**2024-10-15**|[LegalLens Shared Task 2024: Legal Violation Identification in Unstructured Text](http://arxiv.org/abs/2410.12064)|null|This paper presents the results of the LegalLens Shared Task, focusing on detecting legal violations within text in the wild across two sub-tasks: LegalLens-NER for identifying legal violation entities and LegalLens-NLI for associating these violations with relevant legal contexts and affected individuals. Using an enhanced LegalLens dataset covering labor, privacy, and consumer protection domains, 38 teams participated in the task. Our analysis reveals that while a mix of approaches was used, the top-performing teams in both tasks consistently relied on fine-tuning pre-trained language models, outperforming legal-specific models and few-shot methods. The top-performing team achieved a 7.11% improvement in NER over the baseline, while NLI saw a more marginal improvement of 5.7%. Despite these gains, the complexity of legal texts leaves room for further advancements.||
|**2024-10-15**|[A Survey on Deep Tabular Learning](http://arxiv.org/abs/2410.12034)|null|Tabular data, widely used in industries like healthcare, finance, and transportation, presents unique challenges for deep learning due to its heterogeneous nature and lack of spatial structure. This survey reviews the evolution of deep learning models for tabular data, from early fully connected networks (FCNs) to advanced architectures like TabNet, SAINT, TabTranSELU, and MambaNet. These models incorporate attention mechanisms, feature embeddings, and hybrid architectures to address tabular data complexities. TabNet uses sequential attention for instance-wise feature selection, improving interpretability, while SAINT combines self-attention and intersample attention to capture complex interactions across features and data points, both advancing scalability and reducing computational overhead. Hybrid architectures such as TabTransformer and FT-Transformer integrate attention mechanisms with multi-layer perceptrons (MLPs) to handle categorical and numerical data, with FT-Transformer adapting transformers for tabular datasets. Research continues to balance performance and efficiency for large datasets. Graph-based models like GNN4TDL and GANDALF combine neural networks with decision trees or graph structures, enhancing feature representation and mitigating overfitting in small datasets through advanced regularization techniques. Diffusion-based models like the Tabular Denoising Diffusion Probabilistic Model (TabDDPM) generate synthetic data to address data scarcity, improving model robustness. Similarly, models like TabPFN and Ptab leverage pre-trained language models, incorporating transfer learning and self-supervised techniques into tabular tasks. This survey highlights key advancements and outlines future research directions on scalability, generalization, and interpretability in diverse tabular data applications.||
|**2024-10-14**|[Improve Meta-learning for Few-Shot Text Classification with All You Can Acquire from the Tasks](http://arxiv.org/abs/2410.10454)|**[link](https://github.com/yvogao/laqda)**|Meta-learning has emerged as a prominent technology for few-shot text classification and has achieved promising performance. However, existing methods often encounter difficulties in drawing accurate class prototypes from support set samples, primarily due to probable large intra-class differences and small inter-class differences within the task. Recent approaches attempt to incorporate external knowledge or pre-trained language models to augment data, but this requires additional resources and thus does not suit many few-shot scenarios. In this paper, we propose a novel solution to address this issue by adequately leveraging the information within the task itself. Specifically, we utilize label information to construct a task-adaptive metric space, thereby adaptively reducing the intra-class differences and magnifying the inter-class differences. We further employ the optimal transport technique to estimate class prototypes with query set samples together, mitigating the problem of inaccurate and ambiguous support set samples caused by large intra-class differences. We conduct extensive experiments on eight benchmark datasets, and our approach shows obvious advantages over state-of-the-art models across all the tasks on all the datasets. For reproducibility, all the datasets and codes are available at https://github.com/YvoGao/LAQDA.||
|**2024-10-14**|[Scalable Multi-Domain Adaptation of Language Models using Modular Experts](http://arxiv.org/abs/2410.10181)|null|Domain-specific adaptation is critical to maximizing the performance of pre-trained language models (PLMs) on one or multiple targeted tasks, especially under resource-constrained use cases, such as edge devices. However, existing methods often struggle to balance domain-specific performance, retention of general knowledge, and efficiency for training and inference. To address these challenges, we propose Modular Domain Experts (MoDE). MoDE is a mixture-of-experts architecture that augments a general PLMs with modular, domain-specialized experts. These experts are trained independently and composed together via a lightweight training process. In contrast to standard low-rank adaptation methods, each MoDE expert consists of several transformer layers which scale better with more training examples and larger parameter counts. Our evaluation demonstrates that MoDE achieves comparable target performances to full parameter fine-tuning while achieving 1.65% better retention performance. Moreover, MoDE's architecture enables flexible sharding configurations and improves training speeds by up to 38% over state-of-the-art distributed training configurations.||
|**2024-10-11**|[Lifelong Event Detection via Optimal Transport](http://arxiv.org/abs/2410.08905)|null|Continual Event Detection (CED) poses a formidable challenge due to the catastrophic forgetting phenomenon, where learning new tasks (with new coming event types) hampers performance on previous ones. In this paper, we introduce a novel approach, Lifelong Event Detection via Optimal Transport (LEDOT), that leverages optimal transport principles to align the optimization of our classification module with the intrinsic nature of each class, as defined by their pre-trained language modeling. Our method integrates replay sets, prototype latent representations, and an innovative Optimal Transport component. Extensive experiments on MAVEN and ACE datasets demonstrate LEDOT's superior performance, consistently outperforming state-of-the-art baselines. The results underscore LEDOT as a pioneering solution in continual event detection, offering a more effective and nuanced approach to addressing catastrophic forgetting in evolving environments.||
|**2024-10-10**|[Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity](http://arxiv.org/abs/2410.08198)|**[link](https://github.com/mohamad-amin/adam-coordinate-adaptivity)**|Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $\widetilde{O}(T^{-1/4})$. In this work, we argue that the exploitation of nice $\ell_\infty$-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$-geometry rather than the more common $\ell_2$-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $\ell_\infty$ -geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.||
|**2024-10-10**|[Do Current Language Models Support Code Intelligence for R Programming Language?](http://arxiv.org/abs/2410.07793)|null|Recent advancements in developing Pre-trained Language Models for Code (Code-PLMs) have urged many areas of Software Engineering (SE) and brought breakthrough results for many SE tasks. Though these models have achieved the state-of-the-art performance for SE tasks for many popular programming languages, such as Java and Python, the Scientific Software and its related languages like R programming language have rarely benefited or even been evaluated with the Code-PLMs. Research has shown that R has many differences with other programming languages and requires specific techniques. In this study, we provide the first insights for code intelligence for R. For this purpose, we collect and open source an R dataset, and evaluate Code-PLMs for the two tasks of code summarization and method name prediction using several settings and strategies, including the differences in two R styles, Tidy-verse and Base R. Our results demonstrate that the studied models have experienced varying degrees of performance degradation when processing R programming language code, which is supported by human evaluation. Additionally, not all models show performance improvement in R-specific tasks even after multi-language fine-tuning. The dual syntax paradigms in R significantly impact the models' performance, particularly in code summarization tasks. Furthermore, the project-specific context inherent in R codebases significantly impacts the performance when attempting cross-project training.||
|**2024-10-09**|[Multi-Task Program Error Repair and Explanatory Diagnosis](http://arxiv.org/abs/2410.07271)|null|Program errors can occur in any type of programming, and can manifest in a variety of ways, such as unexpected output, crashes, or performance issues. And program error diagnosis can often be too abstract or technical for developers to understand, especially for beginners. The goal of this paper is to present a novel machine-learning approach for Multi-task Program Error Repair and Explanatory Diagnosis (mPRED). A pre-trained language model is used to encode the source code, and a downstream model is specifically designed to identify and repair errors. Programs and test cases will be augmented and optimized from several perspectives. Additionally, our approach incorporates a "chain of thoughts" method, which enables the models to produce intermediate reasoning explanations before providing the final correction. To aid in visualizing and analyzing the program structure, we use a graph neural network for program structure visualization. Overall, our approach offers a promising approach for repairing program errors across different programming languages and providing helpful explanations to programmers.||
|**2024-10-08**|[Manual Verbalizer Enrichment for Few-Shot Text Classification](http://arxiv.org/abs/2410.06173)|null|With the continuous development of pre-trained language models, prompt-based training becomes a well-adopted paradigm that drastically improves the exploitation of models for many natural language processing tasks. Prompting also shows great performance compared to traditional fine-tuning when adapted to zero-shot or few-shot scenarios where the number of annotated data is limited. In this framework, the role of verbalizers is essential, as an interpretation from masked word distributions into output predictions. In this work, we propose \acrshort{mave}, an approach for verbalizer construction by enrichment of class labels using neighborhood relation in the embedding space of words for the text classification task. In addition, we elaborate a benchmarking procedure to evaluate typical baselines of verbalizers for document classification in few-shot learning contexts. Our model achieves state-of-the-art results while using significantly fewer resources. We show that our approach is particularly effective in cases with extremely limited supervision data.||
|**2024-10-08**|[Enhancing SPARQL Generation by Triplet-order-sensitive Pre-training](http://arxiv.org/abs/2410.05731)|**[link](https://github.com/LUMIA-Group/TosT5)**|Semantic parsing that translates natural language queries to SPARQL is of great importance for Knowledge Graph Question Answering (KGQA) systems. Although pre-trained language models like T5 have achieved significant success in the Text-to-SPARQL task, their generated outputs still exhibit notable errors specific to the SPARQL language, such as triplet flips. To address this challenge and further improve the performance, we propose an additional pre-training stage with a new objective, Triplet Order Correction (TOC), along with the commonly used Masked Language Modeling (MLM), to collectively enhance the model's sensitivity to triplet order and SPARQL syntax. Our method achieves state-of-the-art performances on three widely-used benchmarks.||
|**2024-10-05**|[Persona Knowledge-Aligned Prompt Tuning Method for Online Debate](http://arxiv.org/abs/2410.04239)|**[link](https://github.com/HKUST-KnowComp/PersonaPrompt)**|Debate is the process of exchanging viewpoints or convincing others on a particular issue. Recent research has provided empirical evidence that the persuasiveness of an argument is determined not only by language usage but also by communicator characteristics. Researchers have paid much attention to aspects of languages, such as linguistic features and discourse structures, but combining argument persuasiveness and impact with the social personae of the audience has not been explored due to the difficulty and complexity. We have observed the impressive simulation and personification capability of ChatGPT, indicating a giant pre-trained language model may function as an individual to provide personae and exert unique influences based on diverse background knowledge. Therefore, we propose a persona knowledge-aligned framework for argument quality assessment tasks from the audience side. This is the first work that leverages the emergence of ChatGPT and injects such audience personae knowledge into smaller language models via prompt tuning. The performance of our pipeline demonstrates significant and consistent improvement compared to competitive architectures.||
|**2024-10-05**|[Overview of Factify5WQA: Fact Verification through 5W Question-Answering](http://arxiv.org/abs/2410.04236)|null|Researchers have found that fake news spreads much times faster than real news. This is a major problem, especially in today's world where social media is the key source of news for many among the younger population. Fact verification, thus, becomes an important task and many media sites contribute to the cause. Manual fact verification is a tedious task, given the volume of fake news online. The Factify5WQA shared task aims to increase research towards automated fake news detection by providing a dataset with an aspect-based question answering based fact verification method. Each claim and its supporting document is associated with 5W questions that help compare the two information sources. The objective performance measure in the task is done by comparing answers using BLEU score to measure the accuracy of the answers, followed by an accuracy measure of the classification. The task had submissions using custom training setup and pre-trained language-models among others. The best performing team posted an accuracy of 69.56%, which is a near 35% improvement over the baseline.||
|**2024-10-05**|[On Eliciting Syntax from Language Models via Hashing](http://arxiv.org/abs/2410.04074)|**[link](https://github.com/speedcell4/parserker)**|Unsupervised parsing, also known as grammar induction, aims to infer syntactic structure from raw text. Recently, binary representation has exhibited remarkable information-preserving capabilities at both lexicon and syntax levels. In this paper, we explore the possibility of leveraging this capability to deduce parsing trees from raw text, relying solely on the implicitly induced grammars within models. To achieve this, we upgrade the bit-level CKY from zero-order to first-order to encode the lexicon and syntax in a unified binary representation space, switch training from supervised to unsupervised under the contrastive hashing framework, and introduce a novel loss function to impose stronger yet balanced alignment signals. Our model shows competitive performance on various datasets, therefore, we claim that our method is effective and efficient enough to acquire high-quality parsing trees from pre-trained language models at a low cost.||
|**2024-10-03**|[Reward-RAG: Enhancing RAG with Reward Driven Supervision](http://arxiv.org/abs/2410.03780)|null|In this paper, we introduce Reward-RAG, a novel approach designed to enhance the Retrieval-Augmented Generation (RAG) model through Reward-Driven Supervision. Unlike previous RAG methodologies, which focus on training language models (LMs) to utilize external knowledge retrieved from external sources, our method adapts retrieval information to specific domains by employing CriticGPT to train a dedicated reward model. This reward model generates synthesized datasets for fine-tuning the RAG encoder, aligning its outputs more closely with human preferences. The versatility of our approach allows it to be effectively applied across various domains through domain-specific fine-tuning. We evaluate Reward-RAG on publicly available benchmarks from multiple domains, comparing it to state-of-the-art methods. Our experimental results demonstrate significant improvements in performance, highlighting the effectiveness of Reward-RAG in improving the relevance and quality of generated responses. These findings underscore the potential of integrating reward models with RAG to achieve superior outcomes in natural language generation tasks.||
|**2024-10-04**|[Vulnerability Detection via Topological Analysis of Attention Maps](http://arxiv.org/abs/2410.03470)|**[link](https://github.com/Snopoff/Vulnerability-Detection-via-Topological-Analysis-of-Attention-Maps)**|Recently, deep learning (DL) approaches to vulnerability detection have gained significant traction. These methods demonstrate promising results, often surpassing traditional static code analysis tools in effectiveness. In this study, we explore a novel approach to vulnerability detection utilizing the tools from topological data analysis (TDA) on the attention matrices of the BERT model. Our findings reveal that traditional machine learning (ML) techniques, when trained on the topological features extracted from these attention matrices, can perform competitively with pre-trained language models (LLMs) such as CodeBERTa. This suggests that TDA tools, including persistent homology, are capable of effectively capturing semantic information critical for identifying vulnerabilities.||
|**2024-10-09**|[What do Large Language Models Need for Machine Translation Evaluation?](http://arxiv.org/abs/2410.03278)|**[link](https://github.com/surrey-nlp/LLM4MT_eval)**|Leveraging large language models (LLMs) for various natural language processing tasks has led to superlative claims about their performance. For the evaluation of machine translation (MT), existing research shows that LLMs are able to achieve results comparable to fine-tuned multilingual pre-trained language models. In this paper, we explore what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate MT quality. In addition, we investigate prompting techniques such as zero-shot, Chain of Thought (CoT) and few-shot prompting for eight language pairs covering high-, medium- and low-resource languages, leveraging varying LLM variants. Our findings indicate the importance of reference translations for an LLM-based evaluation. While larger models do not necessarily fare better, they tend to benefit more from CoT prompting, than smaller models. We also observe that LLMs do not always provide a numerical score when generating evaluations, which poses a question on their reliability for the task. Our work presents a comprehensive analysis for resource-constrained and training-less LLM-based evaluation of machine translation. We release the accrued prompt templates, code and data publicly for reproducibility.||
|**2024-10-04**|[Generating bilingual example sentences with large language models as lexicography assistants](http://arxiv.org/abs/2410.03182)|null|We present a study of LLMs' performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels: French (high-resource), Indonesian (mid-resource), and Tetun (low-resource), with English as the target language. We evaluate the quality of LLM-generated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility. Our findings reveal that while LLMs can generate reasonably good dictionary examples, their performance degrades significantly for lower-resourced languages. We also observe high variability in human preferences for example quality, reflected in low inter-annotator agreement rates. To address this, we demonstrate that in-context learning can successfully align LLMs with individual annotator preferences. Additionally, we explore the use of pre-trained language models for automated rating of examples, finding that sentence perplexity serves as a good proxy for typicality and intelligibility in higher-resourced languages. Our study also contributes a novel dataset of 600 ratings for LLM-generated sentence pairs, and provides insights into the potential of LLMs in reducing the cost of lexicographic work, particularly for low-resource languages.||
|**2024-10-03**|[Guided Stream of Search: Learning to Better Search with Language Models via Optimal Path Guidance](http://arxiv.org/abs/2410.02992)|**[link](https://github.com/symoon11/guided-stream-of-search)**|While language models have demonstrated impressive capabilities across a range of tasks, they still struggle with tasks that require complex planning and reasoning. Recent studies have proposed training language models on search processes rather than optimal solutions, resulting in better generalization performance even though search processes are noisy and even suboptimal. However, these studies overlook the value of optimal solutions, which can serve as step-by-step landmarks to guide more effective search. In this work, we explore how to leverage optimal solutions to enhance the search and planning abilities of language models. To this end, we propose guided stream of search (GSoS), which seamlessly incorporates optimal solutions into the self-generation process in a progressive manner, producing high-quality search trajectories. These trajectories are then distilled into the pre-trained model via supervised fine-tuning. Our approach significantly enhances the search and planning abilities of language models on Countdown, a simple yet challenging mathematical reasoning task. Notably, combining our method with RL fine-tuning yields further improvements, whereas previous supervised fine-tuning methods do not benefit from RL. Furthermore, our approach exhibits greater effectiveness than leveraging optimal solutions in the form of subgoal rewards.||
|**2024-10-03**|[Does the Order of Fine-tuning Matter and Why?](http://arxiv.org/abs/2410.02915)|null|To improve the performance on a target task, researchers have fine-tuned language models with an intermediate task before the target task of interest. However, previous works have focused on the pre-trained language models and downstream tasks in Natural Language Processing (NLP) and considered only one intermediate task. The effect of fine-tuning multiple intermediate tasks and their ordering on target task performance has not been fully explored in Software Engineering. In this study, we perform the first empirical study on analyzing the impact of task ordering on target task performance. Experimental results show that there is an impact of task ordering on target task performance by up to 6% of performance gain and up to 4% of performance loss. To explain such an impact, we consider a variety of potential factors, including the characteristics of dataset (syntactic similarity and semantic similarity analysis, dataset size), model (probing task and attention analysis), and task (task affinity analysis). Our study provides Software Engineering researchers and practitioners with insights into the effect of task orderings and how to select the one that is cost-effective while achieving the best performance gain.||
|**2024-10-02**|[SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics](http://arxiv.org/abs/2410.01946)|**[link](https://github.com/zhiwenyou103/SciPrompt)**|基于提示的微调已成为从预训练语言模型中提取编码信息的重要方法,用于各种任务,包括文本分类。对于多类别分类任务,在低资源场景下,基于提示的微调已经实现了与完全微调方法相当的性能水平。先前的研究使用精心设计的提示模板和词语转换器,将标签词空间映射到类别空间,从而将分类问题解决为掩码语言建模任务。然而,具有自动丰富词语转换器的跨领域和细粒度提示微调仍然 unexplored,这主要是由于手动选择领域标签词用于词语转换器存在困难且成本高昂,这需要具备领域专业知识的人员。为了应对这一挑战,我们引入了 SciPrompt,这是一个旨在自动检索与科学主题相关的术语的框架,用于低资源文本分类任务。为此,我们在科学文献的背景下选择语义相关且特定于领域的标签词进行词语转换器增强。此外,我们提出了一种新的词语转换策略,使用相关性得分作为额外的权重,以提高语言模型在模型微调期间的预测性能。我们的方法在少样本和零样本设置下的科学文本分类任务中优于最先进的基于提示的微调方法,特别是在对细粒度和新兴科学主题进行分类方面。||
|**2024-10-01**|[PclGPT: A Large Language Model for Patronizing and Condescending Language Detection](http://arxiv.org/abs/2410.00361)|**[link](https://github.com/dut-laowang/emnlp24-PclGPT)**|Disclaimer: Samples in this paper may be harmful and cause discomfort! Patronizing and condescending language (PCL) is a form of speech directed at vulnerable groups. As an essential branch of toxic language, this type of language exacerbates conflicts and confrontations among Internet communities and detrimentally impacts disadvantaged groups. Traditional pre-trained language models (PLMs) perform poorly in detecting PCL due to its implicit toxicity traits like hypocrisy and false sympathy. With the rise of large language models (LLMs), we can harness their rich emotional semantics to establish a paradigm for exploring implicit toxicity. In this paper, we introduce PclGPT, a comprehensive LLM benchmark designed specifically for PCL. We collect, annotate, and integrate the Pcl-PT/SFT dataset, and then develop a bilingual PclGPT-EN/CN model group through a comprehensive pre-training and supervised fine-tuning staircase process to facilitate implicit toxic detection. Group detection results and fine-grained detection from PclGPT and other models reveal significant variations in the degree of bias in PCL towards different vulnerable groups, necessitating increased societal attention to protect them.||
|**2024-10-03**|[Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation](http://arxiv.org/abs/2410.00249)|null|With the rapid development and widespread use of advanced network systems, software vulnerabilities pose a significant threat to secure communications and networking. Learning-based vulnerability detection systems, particularly those leveraging pre-trained language models, have demonstrated significant potential in promptly identifying vulnerabilities in communication networks and reducing the risk of exploitation. However, the shortage of accurately labeled vulnerability datasets hinders further progress in this field. Failing to represent real-world vulnerability data variety and preserve vulnerability semantics, existing augmentation approaches provide limited or even counterproductive contributions to model training. In this paper, we propose a data augmentation technique aimed at enhancing the performance of pre-trained language models for vulnerability detection. Given the vulnerability dataset, our method performs natural semantic-preserving program transformation to generate a large volume of new samples with enriched data diversity and variety. By incorporating our augmented dataset in fine-tuning a series of representative code pre-trained models (i.e., CodeBERT, GraphCodeBERT, UnixCoder, and PDBERT), up to 10.1% increase in accuracy and 23.6% increase in F1 can be achieved in the vulnerability detection task. Comparison results also show that our proposed method can substantially outperform other prominent vulnerability augmentation approaches.||
|**2024-09-29**|[Adversarial Examples for DNA Classification](http://arxiv.org/abs/2409.19788)|null|Pre-trained language models such as DNABERT2 and Nucleotide Transformer, which are trained on DNA sequences, have shown promising performance in DNA sequence classification tasks. The classification ability of these models stems from language models trained on vast amounts of DNA sequence samples, followed by fine-tuning with relatively smaller classification datasets. However, these text-based systems are not robust enough and can be vulnerable to adversarial examples. While adversarial attacks have been widely studied in text classification, there is limited research in DNA sequence classification. In this paper, we adapt commonly used attack algorithms in text classification for DNA sequence classification. We evaluated the impact of various attack methods on DNA sequence classification at the character, word, and sentence levels. Our findings indicate that actual DNA language model sequence classifiers are vulnerable to these attacks.||
|**2024-09-29**|[NeuroMax: Enhancing Neural Topic Modeling via Maximizing Mutual Information and Group Topic Regularization](http://arxiv.org/abs/2409.19749)|null|Recent advances in neural topic models have concentrated on two primary directions: the integration of the inference network (encoder) with a pre-trained language model (PLM) and the modeling of the relationship between words and topics in the generative model (decoder). However, the use of large PLMs significantly increases inference costs, making them less practical for situations requiring low inference times. Furthermore, it is crucial to simultaneously model the relationships between topics and words as well as the interrelationships among topics themselves. In this work, we propose a novel framework called NeuroMax (Neural Topic Model with Maximizing Mutual Information with Pretrained Language Model and Group Topic Regularization) to address these challenges. NeuroMax maximizes the mutual information between the topic representation obtained from the encoder in neural topic models and the representation derived from the PLM. Additionally, NeuroMax employs optimal transport to learn the relationships between topics by analyzing how information is transported among them. Experimental results indicate that NeuroMax reduces inference time, generates more coherent topics and topic groups, and produces more representative document embeddings, thereby enhancing performance on downstream tasks.||
|**2024-09-27**|[Suicide Phenotyping from Clinical Notes in Safety-Net Psychiatric Hospital Using Multi-Label Classification with Pre-Trained Language Models](http://arxiv.org/abs/2409.18878)|null|Accurate identification and categorization of suicidal events can yield better suicide precautions, reducing operational burden, and improving care quality in high-acuity psychiatric settings. Pre-trained language models offer promise for identifying suicidality from unstructured clinical narratives. We evaluated the performance of four BERT-based models using two fine-tuning strategies (multiple single-label and single multi-label) for detecting coexisting suicidal events from 500 annotated psychiatric evaluation notes. The notes were labeled for suicidal ideation (SI), suicide attempts (SA), exposure to suicide (ES), and non-suicidal self-injury (NSSI). RoBERTa outperformed other models using binary relevance (acc=0.86, F1=0.78). MentalBERT (F1=0.74) also exceeded BioClinicalBERT (F1=0.72). RoBERTa fine-tuned with a single multi-label classifier further improved performance (acc=0.88, F1=0.81), highlighting that models pre-trained on domain-relevant data and the single multi-label classification strategy enhance efficiency and performance. Keywords: EHR-based Phynotyping; Natural Language Processing; Secondary Use of EHR Data; Suicide Classification; BERT-based Model; Psychiatry; Mental Health||
|**2024-09-26**|[Infer Human's Intentions Before Following Natural Language Instructions](http://arxiv.org/abs/2409.18073)|**[link](https://github.com/simon-wan/fiser)**|For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformer-based models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.||
|**2024-09-26**|[Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study](http://arxiv.org/abs/2409.17750)|null|In this study, we delve into the efficacy of transformers within pre-trained language models (PLMs) when repurposed as encoders for Automatic Speech Recognition (ASR). Our underlying hypothesis posits that, despite being initially trained on text-based corpora, these transformers possess a remarkable capacity to extract effective features from the input sequence. This inherent capability, we argue, is transferrable to speech data, thereby augmenting the acoustic modeling ability of ASR. Through rigorous empirical analysis, our findings reveal a notable improvement in Character Error Rate (CER) and Word Error Rate (WER) across diverse ASR tasks when transformers from pre-trained LMs are incorporated. Particularly, they serve as an advantageous starting point for initializing ASR encoders. Furthermore, we uncover that these transformers, when integrated into a well-established ASR encoder, can significantly boost performance, especially in scenarios where profound semantic comprehension is pivotal. This underscores the potential of leveraging the semantic prowess embedded within pre-trained transformers to advance ASR systems' capabilities.||
|**2024-09-24**|[HLB: Benchmarking LLMs' Humanlikeness in Language Use](http://arxiv.org/abs/2409.15890)|null|As synthetic data becomes increasingly prevalent in training language models, particularly through generated dialogue, concerns have emerged that these models may deviate from authentic human language patterns, potentially losing the richness and creativity inherent in human communication. This highlights the critical need to assess the humanlikeness of language models in real-world language use. In this paper, we present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs) using 10 psycholinguistic experiments designed to probe core linguistic aspects, including sound, word, syntax, semantics, and discourse (see https://huggingface.co/spaces/XufengDuan/HumanLikeness). To anchor these comparisons, we collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments. For rigorous evaluation, we developed a coding algorithm that accurately identified language use patterns, enabling the extraction of response distributions for each task. By comparing the response distributions between human participants and LLMs, we quantified humanlikeness through distributional similarity. Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels. Importantly, we found that improvements in other performance metrics did not necessarily lead to greater humanlikeness, and in some cases, even resulted in a decline. By introducing psycholinguistic methods to model evaluation, this benchmark offers the first framework for systematically assessing the humanlikeness of LLMs in language use.||
|**2024-09-23**|[DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models](http://arxiv.org/abs/2409.14904)|**[link](https://github.com/josangyeon/dsg-kd)**|The use of pre-trained language models fine-tuned to address specific downstream tasks is a common approach in natural language processing (NLP). However, acquiring domain-specific knowledge via fine-tuning is challenging. Traditional methods involve pretraining language models using vast amounts of domain-specific data before fine-tuning for particular tasks. This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea. Our findings reveal that existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics of non-English-speaking regions. To address these limitations, we propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning. This study demonstrates the effective transfer of specialized knowledge between models by defining a general language model as the student model and a domain-specific pre-trained model as the teacher model. In particular, we address the complexities of EMR data obtained from PEDs in non-English-speaking regions, such as Korea, and demonstrate that the proposed method enhances classification performance in such contexts. The proposed methodology not only outperforms baseline models on Korean PED EMR data, but also promises broader applicability in various professional and technical domains. In future works, we intend to extend this methodology to include diverse non-English-speaking regions and address additional downstream tasks, with the aim of developing advanced model architectures using state-of-the-art KD techniques. The code is available in https://github.com/JoSangYeon/DSG-KD.||
|**2024-09-23**|[Pre-trained Language Model and Knowledge Distillation for Lightweight Sequential Recommendation](http://arxiv.org/abs/2409.14810)|null|Sequential recommendation models user interests based on historical behaviors to provide personalized recommendation. Previous sequential recommendation algorithms primarily employ neural networks to extract features of user interests, achieving good performance. However, due to the recommendation system datasets sparsity, these algorithms often employ small-scale network frameworks, resulting in weaker generalization capability. Recently, a series of sequential recommendation algorithms based on large pre-trained language models have been proposed. Nonetheless, given the real-time demands of recommendation systems, the challenge remains in applying pre-trained language models for rapid recommendations in real scenarios. To address this, we propose a sequential recommendation algorithm based on a pre-trained language model and knowledge distillation. The key of proposed algorithm is to transfer pre-trained knowledge across domains and achieve lightweight inference by knowledge distillation. The algorithm operates in two stages: in the first stage, we fine-tune the pre-trained language model on the recommendation dataset to transfer the pre-trained knowledge to the recommendation task; in the second stage, we distill the trained language model to transfer the learned knowledge to a lightweight model. Extensive experiments on multiple public recommendation datasets show that the proposed algorithm enhances recommendation accuracy and provide timely recommendation services.||
|**2024-09-21**|[Probing Context Localization of Polysemous Words in Pre-trained Language Model Sub-Layers](http://arxiv.org/abs/2409.14097)|null|In the era of high performing Large Language Models, researchers have widely acknowledged that contextual word representations are one of the key drivers in achieving top performances in downstream tasks. In this work, we investigate the degree of contextualization encoded in the fine-grained sub-layer representations of a Pre-trained Language Model (PLM) by empirical experiments using linear probes. Unlike previous work, we are particularly interested in identifying the strength of contextualization across PLM sub-layer representations (i.e. Self-Attention, Feed-Forward Activation and Output sub-layers). To identify the main contributions of sub-layers to contextualisation, we first extract the sub-layer representations of polysemous words in minimally different sentence pairs, and compare how these representations change through the forward pass of the PLM network. Second, by probing on a sense identification classification task, we try to empirically localize the strength of contextualization information encoded in these sub-layer representations. With these probing experiments, we also try to gain a better understanding of the influence of context length and context richness on the degree of contextualization. Our main conclusion is cautionary: BERT demonstrates a high degree of contextualization in the top sub-layers if the word in question is in a specific position in the sentence with a shorter context window, but this does not systematically generalize across different word positions and context sizes.||
|**2024-09-20**|[Eliciting Instruction-tuned Code Language Models' Capabilities to Utilize Auxiliary Function for Code Generation](http://arxiv.org/abs/2409.13928)|null|We study the code generation behavior of instruction-tuned models built on top of code pre-trained language models when they could access an auxiliary function to implement a function. We design several ways to provide auxiliary functions to the models by adding them to the query or providing a response prefix to incorporate the ability to utilize auxiliary functions with the instruction-following capability. Our experimental results show the effectiveness of combining the base models' auxiliary function utilization ability with the instruction following ability. In particular, the performance of adopting our approaches with the open-sourced language models surpasses that of the recent powerful proprietary language models, i.e., gpt-4o.||
|**2024-09-20**|[Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis](http://arxiv.org/abs/2409.13561)|**[link](https://github.com/jun-jie-huang/lofi)**|Logs are imperative in the maintenance of online service systems, which often encompass important information for effective failure mitigation. While existing anomaly detection methodologies facilitate the identification of anomalous logs within extensive runtime data, manual investigation of log messages by engineers remains essential to comprehend faults, which is labor-intensive and error-prone. Upon examining the log-based troubleshooting practices at CloudA, we find that engineers typically prioritize two categories of log information for diagnosis. These include fault-indicating descriptions, which record abnormal system events, and fault-indicating parameters, which specify the associated entities. Motivated by this finding, we propose an approach to automatically extract such faultindicating information from logs for fault diagnosis, named LoFI. LoFI comprises two key stages. In the first stage, LoFI performs coarse-grained filtering to collect logs related to the faults based on semantic similarity. In the second stage, LoFI leverages a pre-trained language model with a novel prompt-based tuning method to extract fine-grained information of interest from the collected logs. We evaluate LoFI on logs collected from Apache Spark and an industrial dataset from CloudA. The experimental results demonstrate that LoFI outperforms all baseline methods by a significant margin, achieving an absolute improvement of 25.8~37.9 in F1 over the best baseline method, ChatGPT. This highlights the effectiveness of LoFI in recognizing fault-indicating information. Furthermore, the successful deployment of LoFI at CloudA and user studies validate the utility of our method. The code and data are available at https://github.com/Jun-jie-Huang/LoFI.||
|**2024-09-20**|[HUT: A More Computation Efficient Fine-Tuning Method With Hadamard Updated Transformation](http://arxiv.org/abs/2409.13501)|null|Fine-tuning pre-trained language models for downstream tasks has achieved impressive results in NLP. However, fine-tuning all parameters becomes impractical due to the rapidly increasing size of model parameters. To address this, Parameter Efficient Fine-Tuning (PEFT) methods update only a subset of parameters. Most PEFT methods, such as LoRA, use incremental updates, which involve adding learned weight matrix increments to the original parameters. Although effective, these methods face limitations in capturing complex parameter dynamics and do not maintain a strong correlation between the original and updated parameters. To overcome these challenges, we propose the direct Updated Transformation (UT) paradigm, which constructs a transformation directly from the original to the updated parameters. This approach ensures that the correlation between the original and updated parameters is preserved, leveraging the semantic features learned during pre-training. Building on this paradigm, we present the Hadamard Updated Transformation (HUT) method. HUT efficiently updates the original weight matrix using the Hadamard transformation with two low-rank matrices, offering a more expressive and flexible update mechanism. This allows HUT to capture richer parameter features through functional transformations, reducing computational complexity while maintaining or improving model quality. Theoretical analysis and extensive experiments on RoBERTa and GPT-2 validate the effectiveness of HUT. Results show that HUT performs on par with or better than other PEFT methods in terms of model quality, while significantly reducing computational complexity.||
|**2024-09-19**|[Exploring Large Language Models for Product Attribute Value Identification](http://arxiv.org/abs/2409.12695)|null|Product attribute value identification (PAVI) involves automatically identifying attributes and their values from product information, enabling features like product search, recommendation, and comparison. Existing methods primarily rely on fine-tuning pre-trained language models, such as BART and T5, which require extensive task-specific training data and struggle to generalize to new attributes. This paper explores large language models (LLMs), such as LLaMA and Mistral, as data-efficient and robust alternatives for PAVI. We propose various strategies: comparing one-step and two-step prompt-based approaches in zero-shot settings and utilizing parametric and non-parametric knowledge through in-context learning examples. We also introduce a dense demonstration retriever based on a pre-trained T5 model and perform instruction fine-tuning to explicitly train LLMs on task-specific instructions. Extensive experiments on two product benchmarks show that our two-step approach significantly improves performance in zero-shot settings, and instruction fine-tuning further boosts performance when using training data, demonstrating the practical benefits of using LLMs for PAVI.||
|**2024-09-16**|[Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models](http://arxiv.org/abs/2409.10695)|null|We introduce Playground v3 (PGv3), our latest text-to-image model that achieves state-of-the-art (SoTA) performance across multiple testing benchmarks, excels in graphic design abilities and introduces new capabilities. Unlike traditional text-to-image generative models that rely on pre-trained language models like T5 or CLIP text encoders, our approach fully integrates Large Language Models (LLMs) with a novel structure that leverages text conditions exclusively from a decoder-only LLM. Additionally, to enhance image captioning quality-we developed an in-house captioner, capable of generating captions with varying levels of detail, enriching the diversity of text structures. We also introduce a new benchmark CapsBench to evaluate detailed image captioning performance. Experimental results demonstrate that PGv3 excels in text prompt adherence, complex reasoning, and accurate text rendering. User preference studies indicate the super-human graphic design ability of our model for common design applications, such as stickers, posters, and logo designs. Furthermore, PGv3 introduces new capabilities, including precise RGB color control and robust multilingual understanding.||
|**2024-09-14**|[Protecting Copyright of Medical Pre-trained Language Models: Training-Free Backdoor Watermarking](http://arxiv.org/abs/2409.10570)|null|Pre-training language models followed by fine-tuning on specific tasks is standard in NLP, but traditional models often underperform when applied to the medical domain, leading to the development of specialized medical pre-trained language models (Med-PLMs). These models are valuable assets but are vulnerable to misuse and theft, requiring copyright protection. However, no existing watermarking methods are tailored for Med-PLMs, and adapting general PLMs watermarking techniques to the medical domain faces challenges such as task incompatibility, loss of fidelity, and inefficiency. To address these issues, we propose the first training-free backdoor watermarking method for Med-PLMs. Our method uses rare special symbols as trigger words, which do not impact downstream task performance, embedding watermarks by replacing their original embeddings with those of specific medical terms in the Med-PLMs' word embeddings layer. After fine-tuning the watermarked Med-PLMs on various medical downstream tasks, the final models (FMs) respond to the trigger words in the same way they would to the corresponding medical terms. This property can be utilized to extract the watermark. Experiments demonstrate that our method achieves high fidelity while effectively extracting watermarks across various medical downstream tasks. Additionally, our method demonstrates robustness against various attacks and significantly enhances the efficiency of watermark embedding, reducing the embedding time from 10 hours to 10 seconds.||
|**2024-09-14**|[Synthetic4Health: Generating Annotated Synthetic Clinical Letters](http://arxiv.org/abs/2409.09501)|**[link](https://github.com/hecta-uom/synthetic4health)**|Since clinical letters contain sensitive information, clinical-related datasets can not be widely applied in model training, medical research, and teaching. This work aims to generate reliable, various, and de-identified synthetic clinical letters. To achieve this goal, we explored different pre-trained language models (PLMs) for masking and generating text. After that, we worked on Bio\_ClinicalBERT, a high-performing model, and experimented with different masking strategies. Both qualitative and quantitative methods were used for evaluation. Additionally, a downstream task, Named Entity Recognition (NER), was also implemented to assess the usability of these synthetic letters. The results indicate that 1) encoder-only models outperform encoder-decoder models. 2) Among encoder-only models, those trained on general corpora perform comparably to those trained on clinical data when clinical information is preserved. 3) Additionally, preserving clinical entities and document structure better aligns with our objectives than simply fine-tuning the model. 4) Furthermore, different masking strategies can impact the quality of synthetic clinical letters. Masking stopwords has a positive impact, while masking nouns or verbs has a negative effect. 5) For evaluation, BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references. 6) Contextual information does not significantly impact the models' understanding, so the synthetic clinical letters have the potential to replace the original ones in downstream tasks.||
|**2024-09-12**|[Knowledge Tagging with Large Language Model based Multi-Agent System](http://arxiv.org/abs/2409.08406)|null|Knowledge tagging for questions is vital in modern intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations have been performed by pedagogical experts, as the task demands not only a deep semantic understanding of question stems and knowledge definitions but also a strong ability to link problem-solving logic with relevant knowledge concepts. With the advent of advanced natural language processing (NLP) algorithms, such as pre-trained language models and large language models (LLMs), pioneering studies have explored automating the knowledge tagging process using various machine learning models. In this paper, we investigate the use of a multi-agent system to address the limitations of previous algorithms, particularly in handling complex cases involving intricate knowledge definitions and strict numerical constraints. By demonstrating its superior performance on the publicly available math question knowledge tagging dataset, MathKnowCT, we highlight the significant potential of an LLM-based multi-agent system in overcoming the challenges that previous methods have encountered. Finally, through an in-depth discussion of the implications of automating knowledge tagging, we underscore the promising results of deploying LLM-based algorithms in educational contexts.||
|**2024-09-12**|[Fine-tuning Large Language Models for Entity Matching](http://arxiv.org/abs/2409.08185)|**[link](https://github.com/wbsg-uni-mannheim/tailormatch)**|Generative large language models (LLMs) are a promising alternative to pre-trained language models for entity matching due to their high zero-shot performance and their ability to generalize to unseen entities. Existing research on using LLMs for entity matching has focused on prompt engineering and in-context learning. This paper explores the potential of fine-tuning LLMs for entity matching. We analyze fine-tuning along two dimensions: 1) The representation of training examples, where we experiment with adding different types of LLM-generated explanations to the training set, and 2) the selection and generation of training examples using LLMs. In addition to the matching performance on the source dataset, we investigate how fine-tuning affects the model's ability to generalize to other in-domain datasets as well as across topical domains. Our experiments show that fine-tuning significantly improves the performance of the smaller models while the results for the larger models are mixed. Fine-tuning also improves the generalization to in-domain datasets while hurting cross-domain transfer. We show that adding structured explanations to the training set has a positive impact on the performance of three out of four LLMs, while the proposed example selection and generation methods only improve the performance of Llama 3.1 8B while decreasing the performance of GPT-4o Mini.||
|**2024-09-10**|[Exploring Italian sentence embeddings properties through multi-tasking](http://arxiv.org/abs/2409.06622)|null|We investigate to what degree existing LLMs encode abstract linguistic information in Italian in a multi-task setting. We exploit curated synthetic data on a large scale -- several Blackbird Language Matrices (BLMs) problems in Italian -- and use them to study how sentence representations built using pre-trained language models encode specific syntactic and semantic information. We use a two-level architecture to model separately a compression of the sentence embeddings into a representation that contains relevant information for a task, and a BLM task. We then investigate whether we can obtain compressed sentence representations that encode syntactic and semantic information relevant to several BLM tasks. While we expected that the sentence structure -- in terms of sequence of phrases/chunks -- and chunk properties could be shared across tasks, performance and error analysis show that the clues for the different tasks are encoded in different manners in the sentence embeddings, suggesting that abstract linguistic notions such as constituents or thematic roles does not seem to be present in the pretrained sentence embeddings.||
|**2024-09-09**|[TransformerRanker: A Tool for Efficiently Finding the Best-Suited Language Models for Downstream Classification Tasks](http://arxiv.org/abs/2409.05997)|**[link](https://github.com/flairnlp/transformer-ranker)**|Classification tasks in NLP are typically addressed by selecting a pre-trained language model (PLM) from a model hub, and fine-tuning it for the task at hand. However, given the very large number of PLMs that are currently available, a practical challenge is to determine which of them will perform best for a specific downstream task. With this paper, we introduce TransformerRanker, a lightweight library that efficiently ranks PLMs for classification tasks without the need for computationally costly fine-tuning. Our library implements current approaches for transferability estimation (LogME, H-Score, kNN), in combination with layer aggregation options, which we empirically showed to yield state-of-the-art rankings of PLMs (Garbas et al., 2024). We designed the interface to be lightweight and easy to use, allowing users to directly connect to the HuggingFace Transformers and Dataset libraries. Users need only select a downstream classification task and a list of PLMs to create a ranking of likely best-suited PLMs for their task. We make TransformerRanker available as a pip-installable open-source library https://github.com/flairNLP/transformer-ranker.||
|**2024-09-08**|[Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?](http://arxiv.org/abs/2409.05197)|**[link](https://github.com/zawedcvg/are-large-language-models-attentive-readers)**|State-of-the-art Large Language Models (LLMs) are accredited with an increasing number of different capabilities, ranging from reading comprehension, over advanced mathematical and reasoning skills to possessing scientific knowledge. In this paper we focus on their multi-hop reasoning capability: the ability to identify and integrate information from multiple textual sources. Given the concerns with the presence of simplifying cues in existing multi-hop reasoning benchmarks, which allow models to circumvent the reasoning requirement, we set out to investigate, whether LLMs are prone to exploiting such simplifying cues. We find evidence that they indeed circumvent the requirement to perform multi-hop reasoning, but they do so in more subtle ways than what was reported about their fine-tuned pre-trained language model (PLM) predecessors. Motivated by this finding, we propose a challenging multi-hop reasoning benchmark, by generating seemingly plausible multi-hop reasoning chains, which ultimately lead to incorrect answers. We evaluate multiple open and proprietary state-of-the-art LLMs, and find that their performance to perform multi-hop reasoning is affected, as indicated by up to 45% relative decrease in F1 score when presented with such seemingly plausible alternatives. We conduct a deeper analysis and find evidence that while LLMs tend to ignore misleading lexical cues, misleading reasoning paths indeed present a significant challenge.||
|**2024-08-21**|[CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction](http://arxiv.org/abs/2409.03773)|null|准确测量蛋白质-RNA结合亲和力在许多生物过程和药物设计中至关重要。以前的蛋白质-RNA结合亲和力预测计算方法依赖于序列或结构特征,无法全面捕捉结合机制。最近出现的在大量无监督蛋白质和RNA序列上训练的预训练语言模型,在包括结合位点预测在内的各种域内下游任务中表现出强大的表示能力。然而,协同应用不同领域的语言模型来完成复杂级别的任务仍未得到探索。在本文中,我们提出了CoPRA,通过蛋白质-RNA结合亲和力预测的复合物结构,将来自不同生物领域的预训练语言模型连接起来。我们首次证明了跨生物模态语言模型可以协同提高结合亲和力预测。我们提出了一个Co-Former来结合跨模态序列和结构信息,并提出了一种双范围预训练策略来提高Co-Former的交互理解能力。同时,我们构建了最大的蛋白质-RNA结合亲和力数据集PRA310用于性能评估。我们还在一个公共数据集上测试了我们模型的突变效应预测能力。CoPRA在所有数据集上都达到了最先进的性能。我们提供了广泛的分析,并验证了CoPRA可以(1)准确预测蛋白质-RNA结合亲和力;(2)理解由突变引起的结合亲和力变化;(3)受益于数据和模型规模的扩大。||
|**2024-09-03**|[LUK: Empowering Log Understanding with Expert Knowledge from Large Language Models](http://arxiv.org/abs/2409.01909)|**[link](https://github.com/LeaperOvO/LUK)**|Logs play a critical role in providing essential information for system monitoring and troubleshooting. Recently, with the success of pre-trained language models (PLMs) and large language models (LLMs) in natural language processing (NLP), smaller PLMs (such as BERT) and LLMs (like ChatGPT) have become the current mainstream approaches for log analysis. While LLMs possess rich knowledge, their high computational costs and unstable performance make LLMs impractical for analyzing logs directly. In contrast, smaller PLMs can be fine-tuned for specific tasks even with limited computational resources, making them more practical. However, these smaller PLMs face challenges in understanding logs comprehensively due to their limited expert knowledge. To better utilize the knowledge embedded within LLMs for log understanding, this paper introduces a novel knowledge enhancement framework, called LUK, which acquires expert knowledge from LLMs to empower log understanding on a smaller PLM. Specifically, we design a multi-expert collaboration framework based on LLMs consisting of different roles to acquire expert knowledge. In addition, we propose two novel pre-training tasks to enhance the log pre-training with expert knowledge. LUK achieves state-of-the-art results on different log analysis tasks and extensive experiments demonstrate expert knowledge from LLMs can be utilized more effectively to understand logs.||
|**2024-09-04**|[MARS: Matching Attribute-aware Representations for Text-based Sequential Recommendation](http://arxiv.org/abs/2409.00702)|**[link](https://github.com/junieberry/mars)**|Sequential recommendation aims to predict the next item a user is likely to prefer based on their sequential interaction history. Recently, text-based sequential recommendation has emerged as a promising paradigm that uses pre-trained language models to exploit textual item features to enhance performance and facilitate knowledge transfer to unseen datasets. However, existing text-based recommender models still struggle with two key challenges: (i) representing users and items with multiple attributes, and (ii) matching items with complex user interests. To address these challenges, we propose a novel model, Matching Attribute-aware Representations for Text-based Sequential Recommendation (MARS). MARS extracts detailed user and item representations through attribute-aware text encoding, capturing diverse user intents with multiple attribute-aware representations. It then computes user-item scores via attribute-wise interaction matching, effectively capturing attribute-level user preferences. Our extensive experiments demonstrate that MARS significantly outperforms existing sequential models, achieving improvements of up to 24.43% and 29.26% in Recall@10 and NDCG@10 across five benchmark datasets. Code is available at https://github.com/junieberry/MARS||
|**2024-08-31**|[From Prediction to Application: Language Model-based Code Knowledge Tracing with Domain Adaptive Pre-Training and Automatic Feedback System with Pedagogical Prompting for Comprehensive Programming Education](http://arxiv.org/abs/2409.00323)|null|Knowledge Tracing (KT) is a critical component in online learning, but traditional approaches face limitations in interpretability and cross-domain adaptability. This paper introduces Language Model-based Code Knowledge Tracing (CodeLKT), an innovative application of Language model-based Knowledge Tracing (LKT) to programming education. CodeLKT leverages pre-trained language models to process learning data, demonstrating superior performance over existing KT and Code KT models. We explore Domain Adaptive Pre-Training (DAPT) and Task Adaptive Pre-Training (TAPT), showing enhanced performance in the coding domain and investigating cross-domain transfer between mathematics and coding. Additionally, we present an theoretically-informed integrated system combining CodeLKT with large language models to generate personalized, in-depth feedback to support students' programming learning. This work advances the field of Code Knowledge Tracing by expanding the knowledge base with language model-based approach and offering practical implications for programming education through data-informed feedback.||
|**2024-08-30**|[Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage](http://arxiv.org/abs/2408.17354)|null|Fine-tuning large language models on private data for downstream applications poses significant privacy risks in potentially exposing sensitive information. Several popular community platforms now offer convenient distribution of a large variety of pre-trained models, allowing anyone to publish without rigorous verification. This scenario creates a privacy threat, as pre-trained models can be intentionally crafted to compromise the privacy of fine-tuning datasets. In this study, we introduce a novel poisoning technique that uses model-unlearning as an attack tool. This approach manipulates a pre-trained language model to increase the leakage of private data during the fine-tuning process. Our method enhances both membership inference and data extraction attacks while preserving model utility. Experimental results across different models, datasets, and fine-tuning setups demonstrate that our attacks significantly surpass baseline performance. This work serves as a cautionary note for users who download pre-trained models from unverified sources, highlighting the potential risks involved.||
|**2024-08-24**|[Empowering Pre-Trained Language Models for Spatio-Temporal Forecasting via Decoupling Enhanced Discrete Reprogramming](http://arxiv.org/abs/2408.14505)|null|Spatio-temporal time series forecasting plays a critical role in various real-world applications, such as transportation optimization, energy management, and climate analysis. The recent advancements in Pre-trained Language Models (PLMs) have inspired efforts to reprogram these models for time series forecasting tasks, by leveraging their superior reasoning and generalization capabilities. However, existing approaches fall short in handling complex spatial inter-series dependencies and intrinsic intra-series frequency components, limiting their spatio-temporal forecasting performance. Moreover, the linear mapping of continuous time series to a compressed subset vocabulary in reprogramming constrains the spatio-temporal semantic expressivity of PLMs and may lead to potential information bottleneck. To overcome the above limitations, we propose \textsc{RePST}, a tailored PLM reprogramming framework for spatio-temporal forecasting. The key insight of \textsc{RePST} is to decouple the spatio-temporal dynamics in the frequency domain, allowing better alignment with the PLM text space. Specifically, we first decouple spatio-temporal data in Fourier space and devise a structural diffusion operator to obtain temporal intrinsic and spatial diffusion signals, making the dynamics more comprehensible and predictable for PLMs. To avoid information bottleneck from a limited vocabulary, we further propose a discrete reprogramming strategy that selects relevant discrete textual information from an expanded vocabulary space in a differentiable manner. Extensive experiments on four real-world datasets show that our proposed approach significantly outperforms state-of-the-art spatio-temporal forecasting models, particularly in data-scarce scenarios.||
|**2024-08-23**|[SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks](http://arxiv.org/abs/2408.13040)|null|Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM's inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.||
|**2024-08-23**|[Investigating LLM Applications in E-Commerce](http://arxiv.org/abs/2408.12779)|null|The emergence of Large Language Models (LLMs) has revolutionized natural language processing in various applications especially in e-commerce. One crucial step before the application of such LLMs in these fields is to understand and compare the performance in different use cases in such tasks. This paper explored the efficacy of LLMs in the e-commerce domain, focusing on instruction-tuning an open source LLM model with public e-commerce datasets of varying sizes and comparing the performance with the conventional models prevalent in industrial applications. We conducted a comprehensive comparison between LLMs and traditional pre-trained language models across specific tasks intrinsic to the e-commerce domain, namely classification, generation, summarization, and named entity recognition (NER). Furthermore, we examined the effectiveness of the current niche industrial application of very large LLM, using in-context learning, in e-commerce specific tasks. Our findings indicate that few-shot inference with very large LLMs often does not outperform fine-tuning smaller pre-trained models, underscoring the importance of task-specific model optimization.Additionally, we investigated different training methodologies such as single-task training, mixed-task training, and LoRA merging both within domain/tasks and between different tasks. Through rigorous experimentation and analysis, this paper offers valuable insights into the potential effectiveness of LLMs to advance natural language processing capabilities within the e-commerce industry.||
|**2024-08-22**|[AutoTest: Evolutionary Code Solution Selection with Test Cases](http://arxiv.org/abs/2408.12125)|null|随着代码生成技术的发展,从多个候选方案中选择正确的代码方案已成为一项至关重要的任务。本研究提出了一种名为AutoTest的新技术,该技术将自动测试用例生成与代码方案执行相结合,利用进化遗传算法优化选择过程。首先,AutoTest利用诸如codegen-16B、code-davinci-002和incoder-6B等大型预训练语言模型来提供代码方案及其相应的测试用例。然后,通过执行代码方案并评估其在测试用例上的性能,形成共识集。基于进化遗传算法的选择、变异和交叉机制,通过调整alpha和beta参数,实现细粒度排名。最后,选择最佳代码方案。AutoTest在HumanEval基准测试中展现出显著的性能提升。HumanEval数据集包含164个编程问题,AutoTest在pass@1分数方面比基线方法提高了约10%。||
|**2024-08-24**|[SarcasmBench: Towards Evaluating Large Language Models on Sarcasm Understanding](http://arxiv.org/abs/2408.11319)|null|In the era of large language models (LLMs), the task of ``System I''~-~the fast, unconscious, and intuitive tasks, e.g., sentiment analysis, text classification, etc., have been argued to be successfully solved. However, sarcasm, as a subtle linguistic phenomenon, often employs rhetorical devices like hyperbole and figuration to convey true sentiments and intentions, involving a higher level of abstraction than sentiment analysis. There is growing concern that the argument about LLMs' success may not be fully tenable when considering sarcasm understanding. To address this question, we select eleven SOTA LLMs and eight SOTA pre-trained language models (PLMs) and present comprehensive evaluations on six widely used benchmark datasets through different prompting approaches, i.e., zero-shot input/output (IO) prompting, few-shot IO prompting, chain of thought (CoT) prompting. Our results highlight three key findings: (1) current LLMs underperform supervised PLMs based sarcasm detection baselines across six sarcasm benchmarks. This suggests that significant efforts are still required to improve LLMs' understanding of human sarcasm. (2) GPT-4 consistently and significantly outperforms other LLMs across various prompting methods, with an average improvement of 14.0\% $\uparrow$ . Claude 3 and ChatGPT demonstrate the next best performance after GPT-4. (3) Few-shot IO prompting method outperforms the other two methods: zero-shot IO and few-shot CoT. The reason is that sarcasm detection, being a holistic, intuitive, and non-rational cognitive process, is argued not to adhere to step-by-step logical reasoning, making CoT less effective in understanding sarcasm compared to its effectiveness in mathematical reasoning tasks.||
|**2024-08-20**|[Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution](http://arxiv.org/abs/2408.10548)|**[link](https://github.com/lanxiang1017/language-modeling-on-tabular-data-survey)**|Tabular data, a prevalent data type across various domains, presents unique challenges due to its heterogeneous nature and complex structural relationships. Achieving high predictive performance and robustness in tabular data analysis holds significant promise for numerous applications. Influenced by recent advancements in natural language processing, particularly transformer architectures, new methods for tabular data modeling have emerged. Early techniques concentrated on pre-training transformers from scratch, often encountering scalability issues. Subsequently, methods leveraging pre-trained language models like BERT have been developed, which require less data and yield enhanced performance. The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning. Despite the growing interest, a comprehensive survey of language modeling techniques for tabular data remains absent. This paper fills this gap by providing a systematic review of the development of language modeling for tabular data, encompassing: (1) a categorization of different tabular data structures and data types; (2) a review of key datasets used in model training and tasks used for evaluation; (3) a summary of modeling techniques including widely-adopted data processing methods, popular architectures, and training objectives; (4) the evolution from adapting traditional Pre-training/Pre-trained language models to the utilization of large language models; (5) an identification of persistent challenges and potential future research directions in language modeling for tabular data analysis. GitHub page associated with this survey is available at: https://github.com/lanxiang1017/Language-Modeling-on-Tabular-Data-Survey.git.||

(back to top)

## Transformer

|Publish Date|Title|Code|Abstract|
|---|---|---|--------------------------------------------------|
|**2024-11-05**|[DiT4Edit: Diffusion Transformer for Image Editing](http://arxiv.org/abs/2411.03286)|null|Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture the long-range dependencies among patches, leading to higher-quality image generation. In this paper, we propose DiT4Edit, the first Diffusion Transformer-based image editing framework. Specifically, DiT4Edit uses the DPM-Solver inversion algorithm to obtain the inverted latents, reducing the number of steps compared to the DDIM inversion algorithm commonly used in UNet-based frameworks. Additionally, we design unified attention control and patches merging, tailored for transformer computation streams. This integration allows our framework to generate higher-quality edited images faster. Our design leverages the advantages of DiT, enabling it to surpass UNet structures in image editing, especially in high-resolution and arbitrary-size images. Extensive experiments demonstrate the strong performance of DiT4Edit across various editing scenarios, highlighting the potential of Diffusion Transformers in supporting image editing.|
|**2024-11-05**|[Rethinking Decoders for Transformer-based Semantic Segmentation: Compression is All You Need](http://arxiv.org/abs/2411.03033)|**[link](https://github.com/qishuaiwen/depict)**|State-of-the-art methods for Transformer-based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross-attention, refine either or both types of embeddings via self-attention, and project image embeddings onto the additional embeddings via dot-product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self-attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross-attention operator seeks to find a low-rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot-product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter, and it is light weight and more robust.|
|**2024-11-05**|[Transformer-Based Fault-Tolerant Control for Fixed-Wing UAVs Using Knowledge Distillation and In-Context Adaptation](http://arxiv.org/abs/2411.02975)|null|This study presents a transformer-based approach for fault-tolerant control in fixed-wing Unmanned Aerial Vehicles (UAVs), designed to adapt in real time to dynamic changes caused by structural damage or actuator failures. Unlike traditional Flight Control Systems (FCSs) that rely on classical control theory and struggle under severe alterations in dynamics, our method directly maps outer-loop reference values -- altitude, heading, and airspeed -- into control commands using the in-context learning and attention mechanisms of transformers, thus bypassing inner-loop controllers and fault-detection layers. Employing a teacher-student knowledge distillation framework, the proposed approach trains a student agent with partial observations by transferring knowledge from a privileged expert agent with full observability, enabling robust performance across diverse failure scenarios. Experimental results demonstrate that our transformer-based controller outperforms industry-standard FCS and state-of-the-art reinforcement learning (RL) methods, maintaining high tracking accuracy and stability in nominal conditions and extreme failure cases, highlighting its potential for enhancing UAV operational safety and reliability.|
|**2024-11-04**|[Adaptive Caching for Faster Video Generation with Diffusion Transformers](http://arxiv.org/abs/2411.02397)|null|Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.|
|**2024-11-04**|[Training-free Regional Prompting for Diffusion Transformers](http://arxiv.org/abs/2411.02395)|**[link](https://github.com/antonioo-c/regional-prompting-flux)**|Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.|
|**2024-11-04**|[Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning](http://arxiv.org/abs/2411.02199)|null|Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. Existing empirical studies have revealed a strong connection between these LLMs' impressive emergence abilities and their in-context learning (ICL) capacity, allowing them to solve new tasks using only task-specific prompts without further fine-tuning. On the other hand, existing empirical and theoretical studies also show that there is a linear regularity of the multi-concept encoded semantic representation behind transformer-based LLMs. However, existing theoretical work fail to build up an understanding of the connection between this regularity and the innovative power of ICL. Additionally, prior work often focuses on simplified, unrealistic scenarios involving linear transformers or unrealistic loss functions, and they achieve only linear or sub-linear convergence rates. In contrast, this work provides a fine-grained mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities, offering insights into how transformers innovate solutions for certain unseen tasks encoded with multiple cross-concept semantics. Inspired by empirical studies on the linear latent geometry of LLMs, the analysis is based on a concept-based low-noise sparse coding prompt model. Leveraging advanced techniques, this work showcases the exponential 0-1 loss convergence over the highly non-convex training dynamics, which pioneeringly incorporates the challenges of softmax self-attention, ReLU-activated MLPs, and cross-entropy loss. Empirical simulations corroborate the theoretical findings.|
|**2024-11-04**|[Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention](http://arxiv.org/abs/2411.02063)|**[link](https://github.com/tsinghuac3i/lpa)**|Improving the effectiveness and efficiency of large language models (LLMs) simultaneously is a critical yet challenging research goal. In this paper, we find that low-rank pre-training, normally considered as efficient methods that will compromise performance, can be scalably effective when reduced parameters are precisely targeted. Specifically, applying the low-dimensional module only to the attention layer -- resolves this issue and enhances both effectiveness and efficiency. We refer to this structure as Low-dimensional Projected Attention (LPA) and provide an explanatory analysis. Through extensive experimentation at parameter scales of 130M, 370M, and scaling up to 3B, we have validated the effectiveness and scalability of LPA. Our results show that LPA model can save up to 12.4% in time while achieving an approximate 5% improvement in test perplexity (ppl) and on downstream tasks compared with the vanilla Transformer.|
|**2024-11-04**|[UnSegMedGAT: Unsupervised Medical Image Segmentation using Graph Attention Networks Clustering](http://arxiv.org/abs/2411.01966)|**[link](https://github.com/mudit-adityaja/unsegmedgat)**|The data-intensive nature of supervised classification drives the interest of the researchers towards unsupervised approaches, especially for problems such as medical image segmentation, where labeled data is scarce. Building on the recent advancements of Vision transformers (ViT) in computer vision, we propose an unsupervised segmentation framework using a pre-trained Dino-ViT. In the proposed method, we leverage the inherent graph structure within the image to realize a significant performance gain for segmentation in medical images. For this, we introduce a modularity-based loss function coupled with a Graph Attention Network (GAT) to effectively capture the inherent graph topology within the image. Our method achieves state-of-the-art performance, even significantly surpassing or matching that of existing (semi)supervised technique such as MedSAM which is a Segment Anything Model in medical images. We demonstrate this using two challenging medical image datasets ISIC-2018 and CVC-ColonDB. This work underscores the potential of unsupervised approaches in advancing medical image analysis in scenarios where labeled data is scarce. The github repository of the code is available on [https://github.com/mudit-adityaja/UnSegMedGAT].|
|**2024-11-04**|[ElasTST: Towards Robust Varied-Horizon Forecasting with Elastic Time-Series Transformer](http://arxiv.org/abs/2411.01842)|**[link](https://github.com/microsoft/probts)**|Numerous industrial sectors necessitate models capable of providing robust forecasts across various horizons. Despite the recent strides in crafting specific architectures for time-series forecasting and developing pre-trained universal models, a comprehensive examination of their capability in accommodating varied-horizon forecasting during inference is still lacking. This paper bridges this gap through the design and evaluation of the Elastic Time-Series Transformer (ElasTST). The ElasTST model incorporates a non-autoregressive design with placeholders and structured self-attention masks, warranting future outputs that are invariant to adjustments in inference horizons. A tunable version of rotary position embedding is also integrated into ElasTST to capture time-series-specific periods and enhance adaptability to different horizons. Additionally, ElasTST employs a multi-scale patch design, effectively integrating both fine-grained and coarse-grained information. During the training phase, ElasTST uses a horizon reweighting strategy that approximates the effect of random sampling across multiple horizons with a single fixed horizon setting. Through comprehensive experiments and comparisons with state-of-the-art time-series architectures and contemporary foundation models, we demonstrate the efficacy of ElasTST's unique design elements. Our findings position ElasTST as a robust solution for the practical necessity of varied-horizon forecasting.|
|**2024-11-05**|[MSTA3D: Multi-scale Twin-attention for 3D Instance Segmentation](http://arxiv.org/abs/2411.01781)|null|Recently, transformer-based techniques incorporating superpoints have become prevalent in 3D instance segmentation. However, they often encounter an over-segmentation problem, especially noticeable with large objects. Additionally, unreliable mask predictions stemming from superpoint mask prediction further compound this issue. To address these challenges, we propose a novel framework called MSTA3D. It leverages multi-scale feature representation and introduces a twin-attention mechanism to effectively capture them. Furthermore, MSTA3D integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries. Experimental evaluations on ScanNetV2, ScanNet200 and S3DIS datasets demonstrate that our approach surpasses state-of-the-art 3D instance segmentation methods.|
|**2024-10-31**|[Length-Induced Embedding Collapse in Transformer-based Models](http://arxiv.org/abs/2410.24200)|null|Text embeddings enable various applications, but their performance deteriorates on longer texts. In this paper, we find that the performance degradation is due to a phenomenon called Length Collapse, where longer text embeddings collapse into a narrow space. This collapse results in a distributional inconsistency between embeddings of different text lengths, ultimately hurting the performance of downstream tasks. Theoretically, by considering the self-attention mechanism inherently functions as a low-pass filter, we prove that long sequences increase the attenuation rate of the low-pass filter effect of the self-attention mechanism. With layers going deeper, excessive low-pass filtering causes the token signals to retain only their Direct-Current (DC) component, which means the input token feature maps will collapse into a narrow space, especially in long texts. Based on the above analysis, we propose to mitigate the undesirable length collapse limitation by introducing a temperature in softmax(), which achieves a higher low-filter attenuation rate. The tuning-free method, called TempScale, can be plugged into multiple transformer-based embedding models. Empirically, we demonstrate that TempScale can improve existing embedding models, especially on long text inputs, bringing up to 0.53% performance gains on 40 datasets from Massive Text Embedding Benchmark (MTEB) and 0.82% performance gains on 4 datasets from LongEmbed, which specifically focuses on long context retrieval.|
|**2024-10-31**|[Ada-MSHyper: Adaptive Multi-Scale Hypergraph Transformer for Time Series Forecasting](http://arxiv.org/abs/2410.23992)|null|Although transformer-based methods have achieved great success in multi-scale temporal pattern interaction modeling, two key challenges limit their further development: (1) Individual time points contain less semantic information, and leveraging attention to model pair-wise interactions may cause the information utilization bottleneck. (2) Multiple inherent temporal variations (e.g., rising, falling, and fluctuating) entangled in temporal patterns. To this end, we propose Adaptive Multi-Scale Hypergraph Transformer (Ada-MSHyper) for time series forecasting. Specifically, an adaptive hypergraph learning module is designed to provide foundations for modeling group-wise interactions, then a multi-scale interaction module is introduced to promote more comprehensive pattern interactions at different scales. In addition, a node and hyperedge constraint mechanism is introduced to cluster nodes with similar semantic information and differentiate the temporal variations within each scales. Extensive experiments on 11 real-world datasets demonstrate that Ada-MSHyper achieves state-of-the-art performance, reducing prediction errors by an average of 4.56%, 10.38%, and 4.97% in MSE for long-range, short-range, and ultra-long-range time series forecasting, respectively. Code is available at https://github.com/shangzongjiang/Ada-MSHyper.|
|**2024-10-31**|[Weight decay induces low-rank attention layers](http://arxiv.org/abs/2410.23819)|null|The effect of regularizers such as weight decay when training deep neural networks is not well understood. We study the influence of weight decay as well as $L2$-regularization when training neural network models in which parameter matrices interact multiplicatively. This combination is of particular interest as this parametrization is common in attention layers, the workhorse of transformers. Here, key-query, as well as value-projection parameter matrices, are multiplied directly with each other: $W_K^TW_Q$ and $PW_V$. We extend previous results and show on one hand that any local minimum of a $L2$-regularized loss of the form $L(AB^\top) + \lambda (\|A\|^2 + \|B\|^2)$ coincides with a minimum of the nuclear norm-regularized loss $L(AB^\top) + \lambda\|AB^\top\|_*$, and on the other hand that the 2 losses become identical exponentially quickly during training. We thus complement existing works linking $L2$-regularization with low-rank regularization, and in particular, explain why such regularization on the matrix product affects early stages of training. Based on these theoretical insights, we verify empirically that the key-query and value-projection matrix products $W_K^TW_Q, PW_V$ within attention layers, when optimized with weight decay, as usually done in vision tasks and language modelling, indeed induce a significant reduction in the rank of $W_K^TW_Q$ and $PW_V$ , even in fully online training. We find that, in accordance with existing work, inducing low rank in attention matrix products can damage language model performance, and observe advantages when decoupling weight decay in attention layers from the rest of the parameters.|
|**2024-11-01**|[Human Action Recognition (HAR) Using Skeleton-based Spatial Temporal Relative Transformer Network: ST-RTR](http://arxiv.org/abs/2410.23806)|null|Human Action Recognition (HAR) is an interesting research area in human-computer interaction used to monitor the activities of elderly and disabled individuals affected by physical and mental health. In the recent era, skeleton-based HAR has received much attention because skeleton data has shown that it can handle changes in striking, body size, camera views, and complex backgrounds. One key characteristic of ST-GCN is automatically learning spatial and temporal patterns from skeleton sequences. It has some limitations, as this method only works for short-range correlation due to its limited receptive field. Consequently, understanding human action requires long-range interconnection. To address this issue, we developed a spatial-temporal relative transformer ST-RTR model. The ST-RTR includes joint and relay nodes, which allow efficient communication and data transmission within the network. These nodes help to break the inherent spatial and temporal skeleton topologies, which enables the model to understand long-range human action better. Furthermore, we combine ST-RTR with a fusion model for further performance improvements. To assess the performance of the ST-RTR method, we conducted experiments on three skeleton-based HAR benchmarks: NTU RGB+D 60, NTU RGB+D 120, and UAV-Human. It boosted CS and CV by 2.11 % and 1.45% on NTU RGB+D 60, 1.25% and 1.05% on NTU RGB+D 120. On UAV-Human datasets, accuracy improved by 2.54%. The experimental outcomes explain that the proposed ST-RTR model significantly improves action recognition associated with the standard ST-GCN method.|
|**2024-10-31**|[EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching](http://arxiv.org/abs/2410.23788)|**[link](https://github.com/xinwangchen/edt)**|Transformer-based Diffusion Probabilistic Models (DPMs) have shown more potential than CNN-based DPMs, yet their extensive computational requirements hinder widespread practical applications. To reduce the computation budget of transformer-based DPMs, this work proposes the Efficient Diffusion Transformer (EDT) framework. The framework includes a lightweight-design diffusion model architecture, and a training-free Attention Modulation Matrix and its alternation arrangement in EDT inspired by human-like sketching. Additionally, we propose a token relation-enhanced masking training strategy tailored explicitly for EDT to augment its token relation learning capability. Our extensive experiments demonstrate the efficacy of EDT. The EDT framework reduces training and inference costs and surpasses existing transformer-based diffusion models in image synthesis performance, thereby achieving a significant overall enhancement. With lower FID, EDT-S, EDT-B, and EDT-XL attained speed-ups of 3.93x, 2.84x, and 1.92x respectively in the training phase, and 2.29x, 2.29x, and 2.22x respectively in inference, compared to the corresponding sizes of MDTv2. The source code is released at https://github.com/xinwangChen/EDT.|
|**2024-11-01**|[In-Context LoRA for Diffusion Transformers](http://arxiv.org/abs/2410.23775)|**[link](https://github.com/ali-vilab/In-Context-LoRA)**|Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inherently possess in-context generation capabilities, requiring only minimal tuning to activate them. Through diverse task experiments, we qualitatively demonstrate that existing text-to-image DiTs can effectively perform in-context generation without any tuning. Building on this insight, we propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., $20\sim 100$ samples) instead of full-parameter tuning with large datasets. We name our models In-Context LoRA (IC-LoRA). This approach requires no modifications to the original DiT models, only changes to the training data. Remarkably, our pipeline generates high-fidelity image sets that better adhere to prompts. While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems. We release our code, data, and models at https://github.com/ali-vilab/In-Context-LoRA|
|**2024-10-31**|[MLLA-UNet: Mamba-like Linear Attention in an Efficient U-Shape Model for Medical Image Segmentation](http://arxiv.org/abs/2410.23738)|null|Recent advancements in medical imaging have resulted in more complex and diverse images, with challenges such as high anatomical variability, blurred tissue boundaries, low organ contrast, and noise. Traditional segmentation methods struggle to address these challenges, making deep learning approaches, particularly U-shaped architectures, increasingly prominent. However, the quadratic complexity of standard self-attention makes Transformers computationally prohibitive for high-resolution images. To address these challenges, we propose MLLA-UNet (Mamba-Like Linear Attention UNet), a novel architecture that achieves linear computational complexity while maintaining high segmentation accuracy through its innovative combination of linear attention and Mamba-inspired adaptive mechanisms, complemented by an efficient symmetric sampling structure for enhanced feature processing. Our architecture effectively preserves essential spatial features while capturing long-range dependencies at reduced computational complexity. Additionally, we introduce a novel sampling strategy for multi-scale feature fusion. Experiments demonstrate that MLLA-UNet achieves state-of-the-art performance on six challenging datasets with 24 different segmentation tasks, including but not limited to FLARE22, AMOS CT, and ACDC, with an average DSC of 88.32%. These results underscore the superiority of MLLA-UNet over existing methods. Our contributions include the novel 2D segmentation architecture and its empirical validation. The code is available via https://github.com/csyfjiang/MLLA-UNet.|
|**2024-10-31**|[Context-Aware Token Selection and Packing for Enhanced Vision Transformer](http://arxiv.org/abs/2410.23608)|null|In recent years, the long-range attention mechanism of vision transformers has driven significant performance breakthroughs across various computer vision tasks. However, the traditional self-attention mechanism, which processes both informative and non-informative tokens, suffers from inefficiency and inaccuracies. While sparse attention mechanisms have been introduced to mitigate these issues by pruning tokens involved in attention, they often lack context-awareness and intelligence. These mechanisms frequently apply a uniform token selection strategy across different inputs for batch training or optimize efficiency only for the inference stage. To overcome these challenges, we propose a novel algorithm: Select and Pack Attention (SPA). SPA dynamically selects informative tokens using a low-cost gating layer supervised by selection labels and packs these tokens into new batches, enabling a variable number of tokens to be used in parallelized GPU batch training and inference. Extensive experiments across diverse datasets and computer vision tasks demonstrate that SPA delivers superior performance and efficiency, including a 0.6 mAP improvement in object detection and a 16.4% reduction in computational costs.|
|**2024-10-30**|[A Neural Transformer Framework for Simultaneous Tasks of Segmentation, Classification, and Caller Identification of Marmoset Vocalization](http://arxiv.org/abs/2410.23279)|null|Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification, and caller identification for marmoset vocalizations. However, the CNN has limitations in modeling long-range acoustic patterns; the Transformer architecture that has been shown to outperform CNNs, utilizes the self-attention mechanism that efficiently segregates information parallelly over long distances and captures the global structure of marmoset vocalization. We propose using the Transformer to jointly segment and classify the marmoset calls and identify the callers for each vocalization.|
|**2024-10-30**|[DiaMond: Dementia Diagnosis with Multi-Modal Vision Transformers Using MRI and PET](http://arxiv.org/abs/2410.23219)|**[link](https://github.com/ai-med/diamond)**|Diagnosing dementia, particularly for Alzheimer's Disease (AD) and frontotemporal dementia (FTD), is complex due to overlapping symptoms. While magnetic resonance imaging (MRI) and positron emission tomography (PET) data are critical for the diagnosis, integrating these modalities in deep learning faces challenges, often resulting in suboptimal performance compared to using single modalities. Moreover, the potential of multi-modal approaches in differential diagnosis, which holds significant clinical importance, remains largely unexplored. We propose a novel framework, DiaMond, to address these issues with vision Transformers to effectively integrate MRI and PET. DiaMond is equipped with self-attention and a novel bi-attention mechanism that synergistically combine MRI and PET, alongside a multi-modal normalization to reduce redundant dependency, thereby boosting the performance. DiaMond significantly outperforms existing multi-modal methods across various datasets, achieving a balanced accuracy of 92.4% in AD diagnosis, 65.2% for AD-MCI-CN classification, and 76.5% in differential diagnosis of AD and FTD. We also validated the robustness of DiaMond in a comprehensive ablation study. The code is available at https://github.com/ai-med/DiaMond.|
|**2024-10-29**|[Abrupt Learning in Transformers: A Case Study on Matrix Completion](http://arxiv.org/abs/2410.22244)|null|Recent analysis on the training dynamics of Transformers has unveiled an interesting characteristic: the training loss plateaus for a significant number of training steps, and then suddenly (and sharply) drops to near--optimal values. To understand this phenomenon in depth, we formulate the low-rank matrix completion problem as a masked language modeling (MLM) task, and show that it is possible to train a BERT model to solve this task to low error. Furthermore, the loss curve shows a plateau early in training followed by a sudden drop to near-optimal values, despite no changes in the training procedure or hyper-parameters. To gain interpretability insights into this sudden drop, we examine the model's predictions, attention heads, and hidden states before and after this transition. Concretely, we observe that (a) the model transitions from simply copying the masked input to accurately predicting the masked entries; (b) the attention heads transition to interpretable patterns relevant to the task; and (c) the embeddings and hidden states encode information relevant to the problem. We also analyze the training dynamics of individual model components to understand the sudden drop in loss.||
|**2024-10-29**|[MAPUNetR: A Hybrid Vision Transformer and U-Net Architecture for Efficient and Interpretable Medical Image Segmentation](http://arxiv.org/abs/2410.22223)|null|Medical image segmentation is pivotal in healthcare, enhancing diagnostic accuracy, informing treatment strategies, and tracking disease progression. This process allows clinicians to extract critical information from visual data, enabling personalized patient care. However, developing neural networks for segmentation remains challenging, especially when preserving image resolution, which is essential in detecting subtle details that influence diagnoses. Moreover, the lack of transparency in these deep learning models has slowed their adoption in clinical practice. Efforts in model interpretability are increasingly focused on making these models' decision-making processes more transparent. In this paper, we introduce MAPUNetR, a novel architecture that synergizes the strengths of transformer models with the proven U-Net framework for medical image segmentation. Our model addresses the resolution preservation challenge and incorporates attention maps highlighting segmented regions, increasing accuracy and interpretability. Evaluated on the BraTS 2020 dataset, MAPUNetR achieved a dice score of 0.88 and a dice coefficient of 0.92 on the ISIC 2018 dataset. Our experiments show that the model maintains stable performance and potential as a powerful tool for medical image segmentation in clinical practice.||
|**2024-10-29**|[Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech](http://arxiv.org/abs/2410.22179)|null|Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.||
|**2024-10-29**|[PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement](http://arxiv.org/abs/2410.22059)|null|Scene rearrangement, like table tidying, is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion can aid by generating natural scenes as goals. To facilitate robot execution, object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation, segmentation, and feature encoding, which can lead to a low success rate due to error accumulation. Furthermore, they lack control over the viewing perspectives of the generated goals, restricting the tasks to 3-DoF settings. In this paper, we propose PACA, a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation derived from Stable Diffusion. Specifically, we develop a representation that integrates generation, segmentation, and feature encoding into a single step to produce object-level representations. Additionally, we introduce perspective control, thus enabling the matching of 6-DoF camera views and extending past approaches that were limited to 3-DoF top-down views. The efficacy of our method is demonstrated through its zero-shot performance in real robot experiments across various scenes, achieving an average matching accuracy and execution success rate of 87% and 67%, respectively.||
|**2024-10-29**|[FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection](http://arxiv.org/abs/2410.21964)|null|Recently, Vision Transformers (ViTs) have achieved unprecedented effectiveness in the general domain of image classification. Nonetheless, these models remain underexplored in the field of deepfake detection, given their lower performance as compared to Convolution Neural Networks (CNNs) in that specific context. In this paper, we start by investigating why plain ViT architectures exhibit a suboptimal performance when dealing with the detection of facial forgeries. Our analysis reveals that, as compared to CNNs, ViT struggles to model localized forgery artifacts that typically characterize deepfakes. Based on this observation, we propose a deepfake detection framework called FakeFormer, which extends ViTs to enforce the extraction of subtle inconsistency-prone information. For that purpose, an explicit attention learning guided by artifact-vulnerable patches and tailored to ViTs is introduced. Extensive experiments are conducted on diverse well-known datasets, including FF++, Celeb-DF, WildDeepfake, DFD, DFDCP, and DFDC. The results show that FakeFormer outperforms the state-of-the-art in terms of generalization and computational cost, without the need for large-scale training datasets. The code is available at \url{https://github.com/10Ring/FakeFormer}.||
|**2024-10-29**|[Spatio-temporal Transformers for Action Unit Classification with Event Cameras](http://arxiv.org/abs/2410.21958)|null|Face analysis has been studied from different angles to infer emotion, poses, shapes, and landmarks. Traditionally RGB cameras are used, yet for fine-grained tasks standard sensors might not be up to the task due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. We propose a novel spatiotemporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered one of the main causes of an existing gap between the maturity of RGB and neuromorphic vision models. Gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and contains streams collected with various possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. Our proposed model outperforms baseline methods by effectively capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.||
|**2024-10-28**|[On Inductive Biases That Enable Generalization of Diffusion Transformers](http://arxiv.org/abs/2410.21273)|**[link](https://github.com/dit-generalization/dit-generalization.github.io)**|Recent work studying the generalization of diffusion models with UNet-based denoisers reveals inductive biases that can be expressed via geometry-adaptive harmonic bases. However, in practice, more recent denoising networks are often based on transformers, e.g., the diffusion transformer (DiT). This raises the question: do transformer-based denoising networks exhibit inductive biases that can also be expressed via geometry-adaptive harmonic bases? To our surprise, we find that this is not the case. This discrepancy motivates our search for the inductive bias that can lead to good generalization in DiT models. Investigating the pivotal attention modules of a DiT, we find that locality of attention maps are closely associated with generalization. To verify this finding, we modify the generalization of a DiT by restricting its attention windows. We inject local attention windows to a DiT and observe an improvement in generalization. Furthermore, we empirically find that both the placement and the effective attention size of these local attention windows are crucial factors. Experimental results on the CelebA, ImageNet, and LSUN datasets show that strengthening the inductive bias of a DiT can improve both generalization and generation quality when less training data is available. Source code will be released publicly upon paper publication. Project page: dit-generalization.github.io/.||
|**2024-10-29**|[Enhancing Learned Image Compression via Cross Window-based Attention](http://arxiv.org/abs/2410.21144)|null|In recent years, learned image compression methods have demonstrated superior rate-distortion performance compared to traditional image compression methods. Recent methods utilize convolutional neural networks (CNN), variational autoencoders (VAE), invertible neural networks (INN), and transformers. Despite their significant contributions, a main drawback of these models is their poor performance in capturing local redundancy. Therefore, to leverage global features along with local redundancy, we propose a CNN-based solution integrated with a feature encoding module. The feature encoding module encodes important features before feeding them to the CNN and then utilizes cross-scale window-based attention, which further captures local redundancy. Cross-scale window-based attention is inspired by the attention mechanism in transformers and effectively enlarges the receptive field. Both the feature encoding module and the cross-scale window-based attention module in our architecture are flexible and can be incorporated into any other network architecture. We evaluate our method on the Kodak and CLIC datasets and demonstrate that our approach is effective and on par with state-of-the-art methods.||
|**2024-10-28**|[LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition](http://arxiv.org/abs/2410.21108)|null|Group Activity Recognition (GAR) remains challenging in computer vision due to the complex nature of multi-agent interactions. This paper introduces LiGAR, a LIDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition. LiGAR leverages LiDAR data as a structural backbone to guide the processing of visual and textual information, enabling robust handling of occlusions and complex spatial arrangements. Our framework incorporates a Multi-Scale LIDAR Transformer, Cross-Modal Guided Attention, and an Adaptive Fusion Module to integrate multi-modal data at different semantic levels effectively. LiGAR's hierarchical architecture captures group activities at various granularities, from individual actions to scene-level dynamics. Extensive experiments on the JRDB-PAR, Volleyball, and NBA datasets demonstrate LiGAR's superior performance, achieving state-of-the-art results with improvements of up to 10.6% in F1-score on JRDB-PAR and 5.9% in Mean Per Class Accuracy on the NBA dataset. Notably, LiGAR maintains high performance even when LiDAR data is unavailable during inference, showcasing its adaptability. Our ablation studies highlight the significant contributions of each component and the effectiveness of our multi-modal, multi-scale approach in advancing the field of group activity recognition.||
|**2024-10-28**|[Pay Attention to Attention for Sequential Recommendation](http://arxiv.org/abs/2410.21048)|null|Transformer-based approaches have demonstrated remarkable success in various sequence-based tasks. However, traditional self-attention models may not sufficiently capture the intricate dependencies within items in sequential recommendation scenarios. This is due to the lack of explicit emphasis on attention weights, which play a critical role in allocating attention and understanding item-to-item correlations. To better exploit the potential of attention weights and improve the capability of sequential recommendation in learning high-order dependencies, we propose a novel sequential recommendation (SR) approach called attention weight refinement (AWRSR). AWRSR enhances the effectiveness of self-attention by additionally paying attention to attention weights, allowing for more refined attention distributions of correlations among items. We conduct comprehensive experiments on multiple real-world datasets, demonstrating that our approach consistently outperforms state-of-the-art SR models. Moreover, we provide a thorough analysis of AWRSR's effectiveness in capturing higher-level dependencies. These findings suggest that AWRSR offers a promising new direction for enhancing the performance of self-attention architecture in SR tasks, with potential applications in other sequence-based problems as well.||
|**2024-10-25**|[Capsule Endoscopy Multi-classification via Gated Attention and Wavelet Transformations](http://arxiv.org/abs/2410.19363)|null|Abnormalities in the gastrointestinal tract significantly influence the patient's health and require a timely diagnosis for effective treatment. With such consideration, an effective automatic classification of these abnormalities from a video capsule endoscopy (VCE) frame is crucial for improvement in diagnostic workflows. The work presents the process of developing and evaluating a novel model designed to classify gastrointestinal anomalies from a VCE video frame. Integration of Omni Dimensional Gated Attention (OGA) mechanism and Wavelet transformation techniques into the model's architecture allowed the model to focus on the most critical areas in the endoscopy images, reducing noise and irrelevant features. This is particularly advantageous in capsule endoscopy, where images often contain a high degree of variability in texture and color. Wavelet transformations contributed by efficiently capturing spatial and frequency-domain information, improving feature extraction, especially for detecting subtle features from the VCE frames. Furthermore, the features extracted from the Stationary Wavelet Transform and Discrete Wavelet Transform are concatenated channel-wise to capture multiscale features, which are essential for detecting polyps, ulcerations, and bleeding. This approach improves classification accuracy on imbalanced capsule endoscopy datasets. The proposed model achieved 92.76% and 91.19% as training and validation accuracies respectively. At the same time, Training and Validation losses are 0.2057 and 0.2700. The proposed model achieved a Balanced Accuracy of 94.81%, AUC of 87.49%, F1-score of 91.11%, precision of 91.17%, recall of 91.19% and specificity of 98.44%. Additionally, the model's performance is benchmarked against two base models, VGG16 and ResNet50, demonstrating its enhanced ability to identify and classify a range of gastrointestinal abnormalities accurately.||
|**2024-10-24**|[DCT-HistoTransformer: Efficient Lightweight Vision Transformer with DCT Integration for histopathological image analysis](http://arxiv.org/abs/2410.19166)|null|In recent years, the integration of advanced imaging techniques and deep learning methods has significantly advanced computer-aided diagnosis (CAD) systems for breast cancer detection and classification. Transformers, which have shown great promise in computer vision, are now being applied to medical image analysis. However, their application to histopathological images presents challenges due to the need for extensive manual annotations of whole-slide images (WSIs), as these models require large amounts of data to work effectively, which is costly and time-consuming. Furthermore, the quadratic computational cost of Vision Transformers (ViTs) is particularly prohibitive for large, high-resolution histopathological images, especially on edge devices with limited computational resources. In this study, we introduce a novel lightweight breast cancer classification approach using transformers that operates effectively without large datasets. By incorporating parallel processing pathways for Discrete Cosine Transform (DCT) Attention and MobileConv, we convert image data from the spatial domain to the frequency domain to utilize the benefits such as filtering out high frequencies in the image, which reduces computational cost. This demonstrates the potential of our approach to improve breast cancer classification in histopathological images, offering a more efficient solution with reduced reliance on extensive annotated datasets. Our proposed model achieves an accuracy of 96.00% $\pm$ 0.48% for binary classification and 87.85% $\pm$ 0.93% for multiclass classification, which is comparable to state-of-the-art models while significantly reducing computational costs. This demonstrates the potential of our approach to improve breast cancer classification in histopathological images, offering a more efficient solution with reduced reliance on extensive annotated datasets.||
|**2024-10-24**|[Attention-based Citywide Electric Vehicle Charging Demand Prediction Approach Considering Urban Region and Dynamic Influences](http://arxiv.org/abs/2410.18766)|null|Electric vehicle charging demand prediction is important for vacant charging pile recommendation and charging infrastructure planning, thus facilitating vehicle electrification and green energy development. The performance of previous spatio-temporal studies is still far from satisfactory because the traditional graphs are difficult to model non-pairwise spatial relationships and multivariate temporal features are not adequately taken into account. To tackle these issues, we propose an attention-based heterogeneous multivariate data fusion approach (AHMDF) for citywide electric vehicle charging demand prediction, which incorporates geo-based clustered hypergraph and multivariate gated Transformer to considers both static and dynamic influences. To learn non-pairwise relationships, we cluster service areas by the types and numbers of points of interest in the areas and develop attentive hypergraph networks accordingly. Graph attention mechanisms are used for information propagation between neighboring areas. Additionally, we improve the Transformer encoder utilizing gated mechanisms so that it can selectively learn dynamic auxiliary information and temporal features. Experiments on an electric vehicle charging benchmark dataset demonstrate the effectiveness of our proposed approach compared with a broad range of competing baselines. Furthermore, we demonstrate the impact of dynamic influences on prediction results in different areas of the city and the effectiveness of our clustering method.||
|**2024-10-24**|[Rethinking Softmax: Self-Attention with Polynomial Activations](http://arxiv.org/abs/2410.18613)|null|This paper challenges the conventional belief that softmax attention in transformers is effective primarily because it generates a probability distribution for attention allocation. Instead, we theoretically show that its success lies in its ability to implicitly regularize the Frobenius norm of the attention matrix during training. We then explore alternative activations that regularize the Frobenius norm of the attention matrix, demonstrating that certain polynomial activations can achieve this effect, making them suitable for attention-based architectures. Empirical results indicate these activations perform comparably or better than softmax across various computer vision and language tasks, suggesting new possibilities for attention mechanisms beyond softmax.||
|**2024-10-24**|[Taipan: Efficient and Expressive State Space Language Models with Selective Attention](http://arxiv.org/abs/2410.18572)|null|Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.||
|**2024-10-24**|[Local and Global Graph Modeling with Edge-weighted Graph Attention Network for Handwritten Mathematical Expression Recognition](http://arxiv.org/abs/2410.18555)|null|In this paper, we present a novel approach to Handwritten Mathematical Expression Recognition (HMER) by leveraging graph-based modeling techniques. We introduce an End-to-end model with an Edge-weighted Graph Attention Mechanism (EGAT), designed to perform simultaneous node and edge classification. This model effectively integrates node and edge features, facilitating the prediction of symbol classes and their relationships within mathematical expressions. Additionally, we propose a stroke-level Graph Modeling method for both local (LGM) and global (GGM) information, which applies an end-to-end model to Online HMER tasks, transforming the recognition problem into node and edge classification tasks in graph structure. By capturing both local and global graph features, our method ensures comprehensive understanding of the expression structure. Through the combination of these components, our system demonstrates superior performance in symbol detection, relation classification, and expression-level recognition.||
|**2024-10-24**|[On Explaining with Attention Matrices](http://arxiv.org/abs/2410.18541)|**[link](https://github.com/omyokun/on-explaining-with-attention-matrices)**|This paper explores the much discussed, possible explanatory link between attention weights (AW) in transformer models and predicted output. Contrary to intuition and early research on attention, more recent prior research has provided formal arguments and empirical evidence that AW are not explanatorily relevant. We show that the formal arguments are incorrect. We introduce and effectively compute efficient attention, which isolates the effective components of attention matrices in tasks and models in which AW play an explanatory role. We show that efficient attention has a causal role (provides minimally necessary and sufficient conditions) for predicting model output in NLP tasks requiring contextual information, and we show, contrary to [7], that efficient attention matrices are probability distributions and are effectively calculable. Thus, they should play an important part in the explanation of attention based model behavior. We offer empirical experiments in support of our method illustrating various properties of efficient attention with various metrics on four datasets.||
|**2024-10-24**|[SFB-net for cardiac segmentation: Bridging the semantic gap with attention](http://arxiv.org/abs/2410.18503)|null|In the past few years, deep learning algorithms have been widely used for cardiac image segmentation. However, most of these architectures rely on convolutions that hardly model long-range dependencies, limiting their ability to extract contextual information. In order to tackle this issue, this article introduces the Swin Filtering Block network (SFB-net) which takes advantage of both conventional and swin transformer layers. The former are used to introduce spatial attention at the bottom of the network, while the latter are applied to focus on high level semantically rich features between the encoder and decoder. An average Dice score of 92.4 was achieved on the ACDC dataset. To the best of our knowledge, this result outperforms any other work on this dataset. The average Dice score of 87.99 obtained on the M\&M's dataset demonstrates that the proposed method generalizes well to data from different vendors and centres.||
|**2024-10-23**|[Value Residual Learning For Alleviating Attention Concentration In Transformers](http://arxiv.org/abs/2410.17897)|**[link](https://github.com/Zcchill/Value-Residual-Learning)**|Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the KV cache by nearly 50%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.||
|**2024-10-23**|[Anomaly Resilient Temporal QoS Prediction using Hypergraph Convoluted Transformer Network](http://arxiv.org/abs/2410.17762)|null|Quality-of-Service (QoS) prediction is a critical task in the service lifecycle, enabling precise and adaptive service recommendations by anticipating performance variations over time in response to evolving network uncertainties and user preferences. However, contemporary QoS prediction methods frequently encounter data sparsity and cold-start issues, which hinder accurate QoS predictions and limit the ability to capture diverse user preferences. Additionally, these methods often assume QoS data reliability, neglecting potential credibility issues such as outliers and the presence of greysheep users and services with atypical invocation patterns. Furthermore, traditional approaches fail to leverage diverse features, including domain-specific knowledge and complex higher-order patterns, essential for accurate QoS predictions. In this paper, we introduce a real-time, trust-aware framework for temporal QoS prediction to address the aforementioned challenges, featuring an end-to-end deep architecture called the Hypergraph Convoluted Transformer Network (HCTN). HCTN combines a hypergraph structure with graph convolution over hyper-edges to effectively address high-sparsity issues by capturing complex, high-order correlations. Complementing this, the transformer network utilizes multi-head attention along with parallel 1D convolutional layers and fully connected dense blocks to capture both fine-grained and coarse-grained dynamic patterns. Additionally, our approach includes a sparsity-resilient solution for detecting greysheep users and services, incorporating their unique characteristics to improve prediction accuracy. Trained with a robust loss function resistant to outliers, HCTN demonstrated state-of-the-art performance on the large-scale WSDREAM-2 datasets for response time and throughput.||
|**2024-10-23**|[PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers in a resource-limited Context](http://arxiv.org/abs/2410.17661)|null|Following their success in natural language processing (NLP), there has been a shift towards transformer models in computer vision. While transformers perform well and offer promising multi-tasking performance, due to their high compute requirements, many resource-constrained applications still rely on convolutional or hybrid models that combine the benefits of convolution and attention layers and achieve the best results in the sub 100M parameter range. Simultaneously, task adaptation techniques that allow for the use of one shared transformer backbone for multiple downstream tasks, resulting in great storage savings at negligible cost in performance, have not yet been adopted for hybrid transformers. In this work, we investigate how to achieve the best task-adaptation performance and introduce PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers. We further combine PETAH adaptation with pruning to achieve highly performant and storage friendly models for multi-tasking. In our extensive evaluation on classification and other vision tasks, we demonstrate that our PETAH-adapted hybrid models outperform established task-adaptation techniques for ViTs while requiring fewer parameters and being more efficient on mobile hardware.||
|**2024-10-23**|[Surgical Scene Segmentation by Transformer With Asymmetric Feature Enhancement](http://arxiv.org/abs/2410.17642)|**[link](https://github.com/cyuan-sjtu/vit-asym)**|Surgical scene segmentation is a fundamental task for robotic-assisted laparoscopic surgery understanding. It often contains various anatomical structures and surgical instruments, where similar local textures and fine-grained structures make the segmentation a difficult task. Vision-specific transformer method is a promising way for surgical scene understanding. However, there are still two main challenges. Firstly, the absence of inner-patch information fusion leads to poor segmentation performance. Secondly, the specific characteristics of anatomy and instruments are not specifically modeled. To tackle the above challenges, we propose a novel Transformer-based framework with an Asymmetric Feature Enhancement module (TAFE), which enhances local information and then actively fuses the improved feature pyramid into the embeddings from transformer encoders by a multi-scale interaction attention strategy. The proposed method outperforms the SOTA methods in several different surgical segmentation tasks and additionally proves its ability of fine-grained structure recognition. Code is available at https://github.com/cyuan-sjtu/ViT-asym.||
|**2024-10-22**|[From Attention to Activation: Unravelling the Enigmas of Large Language Models](http://arxiv.org/abs/2410.17174)|null|We study two strange phenomena in auto-regressive Transformers: (1) the dominance of the first token in attention heads; (2) the occurrence of large outlier activations in the hidden states. We find that popular large language models, such as Llama attend maximally to the first token in 98% of attention heads, a behaviour we attribute to the softmax function. To mitigate this issue, we propose a reformulation of softmax to softmax-1. Furthermore, we identify adaptive optimisers, e.g. Adam, as the primary contributor to the large outlier activations and introduce OrthoAdam, a novel optimiser that utilises orthogonal matrices to transform gradients, to address this issue. Finally, not only do our methods prevent these phenomena from occurring, but additionally, they enable Transformers to sustain their performance when quantised using basic algorithms, something that standard methods are unable to do. In summary, our methods reduce the attention proportion on the first token from 65% to 3.3%, the activation kurtosis in the hidden states from 1657 to 3.1, and perplexity penalty under 4-bit weight quantisation from 3565 to 0.3.||
|**2024-10-22**|[A Comparison of Baseline Models and a Transformer Network for SOC Prediction in Lithium-Ion Batteries](http://arxiv.org/abs/2410.17049)|null|Accurately predicting the state of charge of Lithium-ion batteries is essential to the performance of battery management systems of electric vehicles. One of the main reasons for the slow global adoption of electric cars is driving range anxiety. The ability of a battery management system to accurately estimate the state of charge can help alleviate this problem. In this paper, a comparison between data-driven state-of-charge estimation methods is conducted. The paper compares different neural network-based models and common regression models for SOC estimation. These models include several ablated transformer networks, a neural network, a lasso regression model, a linear regression model and a decision tree. Results of various experiments conducted on data obtained from natural driving cycles of the BMW i3 battery show that the decision tree outperformed all other models including the more complex transformer network with self-attention and positional encoding.||
|**2024-10-20**|[Advancing Gasoline Consumption Forecasting: A Novel Hybrid Model Integrating Transformers, LSTM, and CNN](http://arxiv.org/abs/2410.16336)|null|Iran, endowed with abundant hydrocarbon resources, plays a crucial role in the global energy landscape. Gasoline, as a critical fuel, significantly supports the nation's transportation sector. Accurate forecasting of gasoline consumption is essential for strategic resource management and environmental planning. This research introduces a novel approach to predicting monthly gasoline consumption using a hybrid Transformer-LSTM-CNN model, which integrates the strengths of Transformer networks, Long Short-Term Memory (LSTM) networks, and Convolutional Neural Networks (CNN). This advanced architecture offers a superior alternative to conventional methods such as artificial neural networks and regression models by capturing both short- and long-term dependencies in time series data. By leveraging the self-attention mechanism of Transformers, the temporal memory of LSTMs, and the local pattern detection of CNNs, our hybrid model delivers improved prediction accuracy. Implemented using Python, the model provides precise future gasoline consumption forecasts and evaluates the environmental impact through the analysis of greenhouse gas emissions. This study examines gasoline consumption trends from 2007 to 2021, which rose from 64.5 million liters per day in 2007 to 99.80 million liters per day in 2021. Our proposed model forecasts consumption levels up to 2031, offering a valuable tool for policymakers and energy analysts. The results highlight the superiority of this hybrid model in improving the accuracy of gasoline consumption forecasts, reinforcing the need for advanced machine learning techniques to optimize resource management and mitigate environmental risks in the energy sector.||
|**2024-10-21**|[MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report](http://arxiv.org/abs/2410.16239)|**[link](https://github.com/svthapa/more)**|In this paper, we introduce a novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms (ECGs), and radiology/cardiology reports. Our approach leverages transformers to encode these diverse modalities into a unified representation space, aiming to enhance diagnostic accuracy and facilitate comprehensive patient assessments. We utilize LoRA-Peft to significantly reduce trainable parameters in the LLM and incorporate recent linear attention dropping strategy in the Vision Transformer(ViT) for smoother attention. Furthermore, we provide novel multimodal attention explanations and retrieval for our model. To the best of our knowledge, we are the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach. By utilizing contrastive loss, MoRE effectively aligns modality-specific features into a coherent embedding, which supports various downstream tasks such as zero-shot classification and multimodal retrieval. Employing our proposed methodology, we achieve state-of-the-art (SOTA) on the Mimic-IV, CheXpert, Edema Severity, and PtbXl downstream datasets, surpassing existing multimodal approaches. Our proposed framework shows significant improvements in capturing intricate inter-modal relationships and its robustness in medical diagnosis that establishes a framework for future research in multimodal learning in the healthcare sector.||
|**2024-10-21**|[An Explainable Contrastive-based Dilated Convolutional Network with Transformer for Pediatric Pneumonia Detection](http://arxiv.org/abs/2410.16143)|null|Pediatric pneumonia remains a significant global threat, posing a larger mortality risk than any other communicable disease. According to UNICEF, it is a leading cause of mortality in children under five and requires prompt diagnosis. Early diagnosis using chest radiographs is the prevalent standard, but limitations include low radiation levels in unprocessed images and data imbalance issues. This necessitates the development of efficient, computer-aided diagnosis techniques. To this end, we propose a novel EXplainable Contrastive-based Dilated Convolutional Network with Transformer (XCCNet) for pediatric pneumonia detection. XCCNet harnesses the spatial power of dilated convolutions and the global insights from contrastive-based transformers for effective feature refinement. A robust chest X-ray processing module tackles low-intensity radiographs, while adversarial-based data augmentation mitigates the skewed distribution of chest X-rays in the dataset. Furthermore, we actively integrate an explainability approach through feature visualization, directly aligning it with the attention region that pinpoints the presence of pneumonia or normality in radiographs. The efficacy of XCCNet is comprehensively assessed on four publicly available datasets. Extensive performance evaluation demonstrates the superiority of XCCNet compared to state-of-the-art methods.||
|**2024-10-21**|[START: A Generalized State Space Model with Saliency-Driven Token-Aware Transformation](http://arxiv.org/abs/2410.16020)|null|Domain Generalization (DG) aims to enable models to generalize to unseen target domains by learning from multiple source domains. Existing DG methods primarily rely on convolutional neural networks (CNNs), which inherently learn texture biases due to their limited receptive fields, making them prone to overfitting source domains. While some works have introduced transformer-based methods (ViTs) for DG to leverage the global receptive field, these methods incur high computational costs due to the quadratic complexity of self-attention. Recently, advanced state space models (SSMs), represented by Mamba, have shown promising results in supervised learning tasks by achieving linear complexity in sequence length during training and fast RNN-like computation during inference. Inspired by this, we investigate the generalization ability of the Mamba model under domain shifts and find that input-dependent matrices within SSMs could accumulate and amplify domain-specific features, thus hindering model generalization. To address this issue, we propose a novel SSM-based architecture with saliency-based token-aware transformation (namely START), which achieves state-of-the-art (SOTA) performances and offers a competitive alternative to CNNs and ViTs. Our START can selectively perturb and suppress domain-specific features in salient tokens within the input-dependent matrices of SSMs, thus effectively reducing the discrepancy between different domains. Extensive experiments on five benchmarks demonstrate that START outperforms existing SOTA DG methods with efficient linear complexity. Our code is available at https://github.com/lingeringlight/START.||
|**2024-10-21**|[All You Need is an Improving Column: Enhancing Column Generation for Parallel Machine Scheduling via Transformers](http://arxiv.org/abs/2410.15601)|null|We present a neural network-enhanced column generation (CG) approach for a parallel machine scheduling problem. The proposed approach utilizes an encoder-decoder attention model, namely the transformer and pointer architectures, to develop job sequences with negative reduced cost and thus generate columns to add to the master problem. By training the neural network offline and using it in inference mode to predict negative reduced costs columns, we achieve significant computational time savings compared to dynamic programming (DP). Since the exact DP procedure is used to verify that no further columns with negative reduced cost can be identified at termination, the optimality guarantee of the original CG procedure is preserved. For small to medium-sized instances, our approach achieves an average 45% reduction in computation time compared to solving the subproblems with DP. Furthermore, the model generalizes not only to unseen, larger problem instances from the same probability distribution but also to instances from different probability distributions than those presented at training time. For large-sized instances, the proposed approach achieves an 80% improvement in the objective value in under 500 seconds, demonstrating both its scalability and efficiency.||
|**2024-10-21**|[Generalized Probabilistic Attention Mechanism in Transformers](http://arxiv.org/abs/2410.15578)|null|The Transformer architecture has become widely adopted due to its demonstrated success, attributed to the attention mechanism at its core. Despite these successes, the attention mechanism of Transformers is associated with two well-known issues: rank-collapse and gradient vanishing. In this paper, we present a theoretical analysis that it is inherently difficult to address both issues simultaneously in the conventional attention mechanism. To handle these issues, we introduce a novel class of attention mechanism, referred to as generalized probabilistic attention mechanism (GPAM), and its dual-attention implementation within the Transformer architecture. Unlike conventional attention mechanisms, GPAM allows for negative attention scores while preserving a fixed total sum. We provide theoretical evidence that the proposed dual-attention GPAM (daGPAM) effectively mitigates both the rank-collapse and gradient vanishing issues which are difficult to resolve simultaneously with the conventional attention mechanisms. Furthermore, we empirically validate this theoretical evidence, demonstrating the superiority of daGPAM compared to other alternative attention mechanisms that were proposed to address the same issues. Additionally, we demonstrate the practical benefits of GPAM in natural language processing tasks, such as language modeling and neural machine translation.||
|**2024-10-20**|[SEA: State-Exchange Attention for High-Fidelity Physics-Based Transformers](http://arxiv.org/abs/2410.15495)|**[link](https://github.com/parsaesmati/sea)**|Current approaches using sequential networks have shown promise in estimating field variables for dynamical systems, but they are often limited by high rollout errors. The unresolved issue of rollout error accumulation results in unreliable estimations as the network predicts further into the future, with each step's error compounding and leading to an increase in inaccuracy. Here, we introduce the State-Exchange Attention (SEA) module, a novel transformer-based module enabling information exchange between encoded fields through multi-head cross-attention. The cross-field multidirectional information exchange design enables all state variables in the system to exchange information with one another, capturing physical relationships and symmetries between fields. In addition, we incorporate a ViT-like architecture to generate spatially coherent mesh embeddings, further improving the model's ability to capture spatial dependencies in the data. This enhances the model's ability to represent complex interactions between the field variables, resulting in improved rollout error accumulation. Our results show that the Transformer model integrated with the State-Exchange Attention (SEA) module outperforms competitive baseline models, including the PbGMR-GMUS Transformer-RealNVP and GMR-GMUS Transformer, with a reduction in error of 88\% and 91\%, respectively, achieving state-of-the-art performance. Furthermore, we demonstrate that the SEA module alone can reduce errors by 97\% for state variables that are highly dependent on other states of the system.||
|**2024-10-19**|[EViT-Unet: U-Net Like Efficient Vision Transformer for Medical Image Segmentation on Mobile and Edge Devices](http://arxiv.org/abs/2410.15036)|null|With the rapid development of deep learning, CNN-based U-shaped networks have succeeded in medical image segmentation and are widely applied for various tasks. However, their limitations in capturing global features hinder their performance in complex segmentation tasks. The rise of Vision Transformer (ViT) has effectively compensated for this deficiency of CNNs and promoted the application of ViT-based U-networks in medical image segmentation. However, the high computational demands of ViT make it unsuitable for many medical devices and mobile platforms with limited resources, restricting its deployment on resource-constrained and edge devices. To address this, we propose EViT-UNet, an efficient ViT-based segmentation network that reduces computational complexity while maintaining accuracy, making it ideal for resource-constrained medical devices. EViT-UNet is built on a U-shaped architecture, comprising an encoder, decoder, bottleneck layer, and skip connections, combining convolutional operations with self-attention mechanisms to optimize efficiency. Experimental results demonstrate that EViT-UNet achieves high accuracy in medical image segmentation while significantly reducing computational complexity.||
|**2024-10-18**|[SignAttention: On the Interpretability of Transformer Models for Sign Language Translation](http://arxiv.org/abs/2410.14506)|null|This paper presents the first comprehensive interpretability analysis of a Transformer-based Sign Language Translation (SLT) model, focusing on the translation from video-based Greek Sign Language to glosses and text. Leveraging the Greek Sign Language Dataset, we examine the attention mechanisms within the model to understand how it processes and aligns visual input with sequential glosses. Our analysis reveals that the model pays attention to clusters of frames rather than individual ones, with a diagonal alignment pattern emerging between poses and glosses, which becomes less distinct as the number of glosses increases. We also explore the relative contributions of cross-attention and self-attention at each decoding step, finding that the model initially relies on video frames but shifts its focus to previously predicted tokens as the translation progresses. This work contributes to a deeper understanding of SLT models, paving the way for the development of more transparent and reliable translation systems essential for real-world applications.||
|**2024-10-18**|[Mixed Attention Transformer Enhanced Channel Estimation for Extremely Large-Scale MIMO Systems](http://arxiv.org/abs/2410.14439)|null|Extremely large-scale massive multiple-input multiple-output (XL-MIMO) is one of the key technologies for next-generation wireless communication systems. However, acquiring the accurate high-dimensional channel matrix of XL-MIMO remains a pressing challenge due to the intractable channel property and the high complexity. In this paper, a Mixed Attention Transformer based Channel Estimation Neural Network (MAT-CENet) is developed, which is inspired by the Transformer encoder structure as well as organically integrates the feature map attention and spatial attention mechanisms to better grasp the unique characteristics of the XL-MIMO channel. By incorporating the multi-head attention layer as the core enabler, the insightful feature importance is captured and exploited effectively. A comprehensive complexity analysis for the proposed MAT-CENet is also provided. Simulation results show that MAT-CENet outperforms the state of the art in different propagation scenarios of near-, far- and hybrid-fields.||
|**2024-10-18**|[Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis](http://arxiv.org/abs/2410.14195)|**[link](https://github.com/invoker-ll/long-mil)**|Histopathology Whole Slide Image (WSI) analysis serves as the gold standard for clinical cancer diagnosis in the daily routines of doctors. To develop computer-aided diagnosis model for WSIs, previous methods typically employ Multi-Instance Learning to enable slide-level prediction given only slide-level labels. Among these models, vanilla attention mechanisms without pairwise interactions have traditionally been employed but are unable to model contextual information. More recently, self-attention models have been utilized to address this issue. To alleviate the computational complexity of long sequences in large WSIs, methods like HIPT use region-slicing, and TransMIL employs approximation of full self-attention. Both approaches suffer from suboptimal performance due to the loss of key information. Moreover, their use of absolute positional embedding struggles to effectively handle long contextual dependencies in shape-varying WSIs. In this paper, we first analyze how the low-rank nature of the long-sequence attention matrix constrains the representation ability of WSI modelling. Then, we demonstrate that the rank of attention matrix can be improved by focusing on local interactions via a local attention mask. Our analysis shows that the local mask aligns with the attention patterns in the lower layers of the Transformer. Furthermore, the local attention mask can be implemented during chunked attention calculation, reducing the quadratic computational complexity to linear with a small local bandwidth. Building on this, we propose a local-global hybrid Transformer for both computational acceleration and local-global information interactions modelling. Our method, Long-contextual MIL (LongMIL), is evaluated through extensive experiments on various WSI tasks to validate its superiority. Our code will be available at github.com/invoker-LL/Long-MIL.||
|**2024-10-18**|[Provable In-context Learning for Mixture of Linear Regressions using Transformers](http://arxiv.org/abs/2410.14183)|null|We theoretically investigate the in-context learning capabilities of transformers in the context of learning mixtures of linear regression models. For the case of two mixtures, we demonstrate the existence of transformers that can achieve an accuracy, relative to the oracle predictor, of order $\mathcal{\tilde{O}}((d/n)^{1/4})$ in the low signal-to-noise ratio (SNR) regime and $\mathcal{\tilde{O}}(\sqrt{d/n})$ in the high SNR regime, where $n$ is the length of the prompt, and $d$ is the dimension of the problem. Additionally, we derive in-context excess risk bounds of order $\mathcal{O}(L/\sqrt{B})$, where $B$ denotes the number of (training) prompts, and $L$ represents the number of attention layers. The order of $L$ depends on whether the SNR is low or high. In the high SNR regime, we extend the results to $K$-component mixture models for finite $K$ . Extensive simulations also highlight the advantages of transformers for this task, outperforming other baselines such as the Expectation-Maximization algorithm.||
|**2024-10-17**|[MarineFormer: A Transformer-based Navigation Policy Model for Collision Avoidance in Marine Environment](http://arxiv.org/abs/2410.13973)|null|In this work, we investigate the problem of Unmanned Surface Vehicle (USV) navigation in a dense marine environment with a high-intensity current flow. The complexities arising from static and dynamic obstacles and the disturbance forces caused by current flow render existing navigation protocols inadequate for ensuring safety and avoiding collisions at sea. To learn a safe and efficient robot policy, we propose a novel methodology that leverages attention mechanisms to capture heterogeneous interactions of the agents with the static and moving obstacles and the flow disturbances from the environment in space and time. In particular, we refine a temporal function with MarineFormer, a Transformer navigation policy for spatially variable Marine environment, trained end-to-end with reinforcement learning (RL). MarineFormer uses foundational spatio-temporal graph attention with transformer architecture to process spatial attention and temporal sequences in an environment that simulates a 2D turbulent marine condition. We propose architectural modifications that improve the stability and learning speed of the recurrent models. The flow velocity estimation, which can be derived from flow simulations or sensors, is incorporated into a model-free RL framework to prevent the robot from entering into high-intensity current flow regions including intense vortices, while potentially leveraging the flow to assist in transportation. The investigated 2D marine environment encompasses flow singularities, including vortices, sinks, and sources, representing fundamental planar flow patterns associated with flood or maritime thunderstorms. Our proposed method is trained with a new reward model to deal with static and dynamic obstacles and disturbances from the current flow.||
|**2024-10-17**|[Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs](http://arxiv.org/abs/2410.13835)|null|Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.||
|**2024-10-17**|[Reducing the Transformer Architecture to a Minimum](http://arxiv.org/abs/2410.13732)|null|Transformers are a widespread and successful model architecture, particularly in Natural Language Processing (NLP) and Computer Vision (CV). The essential innovation of this architecture is the Attention Mechanism, which solves the problem of extracting relevant context information from long sequences in NLP and realistic scenes in CV. A classical neural network component, a Multi-Layer Perceptron (MLP), complements the attention mechanism. Its necessity is frequently justified by its capability of modeling nonlinear relationships. However, the attention mechanism itself is nonlinear through its internal use of similarity measures. A possible hypothesis is that this nonlinearity is sufficient for modeling typical application problems. As the MLPs usually contain the most trainable parameters of the whole model, their omission would substantially reduce the parameter set size. Further components can also be reorganized to reduce the number of parameters. Under some conditions, query and key matrices can be collapsed into a single matrix of the same size. The same is true about value and projection matrices, which can also be omitted without eliminating the substance of the attention mechanism. Initially, the similarity measure was defined asymmetrically, with peculiar properties such as that a token is possibly dissimilar to itself. A possible symmetric definition requires only half of the parameters. We have laid the groundwork by testing widespread CV benchmarks: MNIST and CIFAR-10. The tests have shown that simplified transformer architectures (a) without MLP, (b) with collapsed matrices, and (c) symmetric similarity matrices exhibit similar performance as the original architecture, saving up to 90% of parameters without hurting the classification performance.||
|**2024-10-17**|[DiRecNetV2: A Transformer-Enhanced Network for Aerial Disaster Recognition](http://arxiv.org/abs/2410.13663)|null|The integration of Unmanned Aerial Vehicles (UAVs) with artificial intelligence (AI) models for aerial imagery processing in disaster assessment, necessitates models that demonstrate exceptional accuracy, computational efficiency, and real-time processing capabilities. Traditionally Convolutional Neural Networks (CNNs), demonstrate efficiency in local feature extraction but are limited by their potential for global context interpretation. On the other hand, Vision Transformers (ViTs) show promise for improved global context interpretation through the use of attention mechanisms, although they still remain underinvestigated in UAV-based disaster response applications. Bridging this research gap, we introduce DiRecNetV2, an improved hybrid model that utilizes convolutional and transformer layers. It merges the inductive biases of CNNs for robust feature extraction with the global context understanding of Transformers, maintaining a low computational load ideal for UAV applications. Additionally, we introduce a new, compact multi-label dataset of disasters, to set an initial benchmark for future research, exploring how models trained on single-label data perform in a multi-label test set. The study assesses lightweight CNNs and ViTs on the AIDERSv2 dataset, based on the frames per second (FPS) for efficiency and the weighted F1 scores for classification performance. DiRecNetV2 not only achieves a weighted F1 score of 0.964 on a single-label test set but also demonstrates adaptability, with a score of 0.614 on a complex multi-label test set, while functioning at 176.13 FPS on the Nvidia Orin Jetson device.||
|**2024-10-17**|[360U-Former: HDR Illumination Estimation with Panoramic Adapted Vision Transformers](http://arxiv.org/abs/2410.13566)|null|Recent illumination estimation methods have focused on enhancing the resolution and improving the quality and diversity of the generated textures. However, few have explored tailoring the neural network architecture to the Equirectangular Panorama (ERP) format utilised in image-based lighting. Consequently, high dynamic range images (HDRI) results usually exhibit a seam at the side borders and textures or objects that are warped at the poles. To address this shortcoming we propose a novel architecture, 360U-Former, based on a U-Net style Vision-Transformer which leverages the work of PanoSWIN, an adapted shifted window attention tailored to the ERP format. To the best of our knowledge, this is the first purely Vision-Transformer model used in the field of illumination estimation. We train 360U-Former as a GAN to generate HDRI from a limited field of view low dynamic range image (LDRI). We evaluate our method using current illumination estimation evaluation protocols and datasets, demonstrating that our approach outperforms existing and state-of-the-art methods without the artefacts typically associated with the use of the ERP format.||
|**2024-10-17**|[Precipitation Nowcasting Using Diffusion Transformer with Causal Attention](http://arxiv.org/abs/2410.13314)|null|Short-term precipitation forecasting remains challenging due to the difficulty in capturing long-term spatiotemporal dependencies. Current deep learning methods fall short in establishing effective dependencies between conditions and forecast results, while also lacking interpretability. To address this issue, we propose a Precipitation Nowcasting Using Diffusion Transformer with Causal Attention model. Our model leverages Transformer and combines causal attention mechanisms to establish spatiotemporal queries between conditional information (causes) and forecast results (results). This design enables the model to effectively capture long-term dependencies, allowing forecast results to maintain strong causal relationships with input conditions over a wide range of time and space. We explore four variants of spatiotemporal information interactions for DTCA, demonstrating that global spatiotemporal labeling interactions yield the best performance. In addition, we introduce a Channel-To-Batch shift operation to further enhance the model's ability to represent complex rainfall dynamics. We conducted experiments on two datasets. Compared to state-of-the-art U-Net-based methods, our approach improved the CSI (Critical Success Index) for predicting heavy precipitation by approximately 15% and 8% respectively, achieving state-of-the-art performance.||
|**2024-10-17**|[DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis](http://arxiv.org/abs/2410.13288)|null|This paper proposes an improved version of DurIAN-E (DurIAN-E 2), which is also a duration informed attention neural network for expressive and high-fidelity text-to-speech (TTS) synthesis. Similar with the DurIAN-E model, multiple stacked SwishRNN-based Transformer blocks are utilized as linguistic encoders and Style-Adaptive Instance Normalization (SAIN) layers are also exploited into frame-level encoders to improve the modeling ability of expressiveness in the proposed the DurIAN-E 2. Meanwhile, motivated by other TTS models using generative models such as VITS, the proposed DurIAN-E 2 utilizes variational autoencoders (VAEs) augmented with normalizing flows and a BigVGAN waveform generator with adversarial training strategy, which further improve the synthesized speech quality and expressiveness. Both objective test and subjective evaluation results prove that the proposed expressive TTS model DurIAN-E 2 can achieve better performance than several state-of-the-art approaches besides DurIAN-E.||
|**2024-10-17**|[An Evolved Universal Transformer Memory](http://arxiv.org/abs/2410.13166)|**[link](https://github.com/sakanaai/evo-memory)**|Prior methods propose to offset the escalating costs of modern foundation models by dropping specific parts of their contexts with hand-designed rules, while attempting to preserve their original performance. We overcome this trade-off with Neural Attention Memory Models (NAMMs), introducing a learned network for memory management that improves both the performance and efficiency of transformers. We evolve NAMMs atop pre-trained transformers to provide different latent contexts focusing on the most relevant information for individual layers and attention heads.NAMMs are universally applicable to any model using self-attention as they condition exclusively on the values in the produced attention matrices. Learning NAMMs on a small set of problems, we achieve substantial performance improvements across multiple long-context benchmarks while cutting the model's input contexts up to a fraction of the original sizes. We show the generality of our conditioning enables zero-shot transfer of NAMMs trained only on language to entirely new transformer architectures even across input modalities, with their benefits carrying over to vision and reinforcement learning.||
|**2024-10-16**|[SWIM: An Attention-Only Model for Speech Quality Assessment Under Subjective Variance](http://arxiv.org/abs/2410.12675)|null|Speech quality is best evaluated by human feedback using mean opinion scores (MOS). However, variance in ratings between listeners can introduce noise in the true quality label of an utterance. Currently, deep learning networks including convolutional, recurrent, and attention-based architectures have been explored for quality estimation. This paper proposes an exclusively attention-based model involving a Swin Transformer for MOS estimation (SWIM). Our network captures local and global dependencies that reflect the acoustic properties of an utterance. To counteract subjective variance in MOS labels, we propose a normal distance-based objective that accounts for standard deviation in each label, and we avail a multistage self-teaching strategy to improve generalization further. Our model is significantly more compact than existing attention-based networks for quality estimation. Finally, our experiments on the Samsung Open Mean Opinion Score (SOMOS) dataset show improvement over existing baseline models when trained from scratch.||
|**2024-10-16**|[ExoTST: Exogenous-Aware Temporal Sequence Transformer for Time Series Prediction](http://arxiv.org/abs/2410.12184)|null|Accurate long-term predictions are the foundations for many machine learning applications and decision-making processes. Traditional time series approaches for prediction often focus on either autoregressive modeling, which relies solely on past observations of the target ``endogenous variables'', or forward modeling, which considers only current covariate drivers ``exogenous variables''. However, effectively integrating past endogenous and past exogenous with current exogenous variables remains a significant challenge. In this paper, we propose ExoTST, a novel transformer-based framework that effectively incorporates current exogenous variables alongside past context for improved time series prediction. To integrate exogenous information efficiently, ExoTST leverages the strengths of attention mechanisms and introduces a novel cross-temporal modality fusion module. This module enables the model to jointly learn from both past and current exogenous series, treating them as distinct modalities. By considering these series separately, ExoTST provides robustness and flexibility in handling data uncertainties that arise from the inherent distribution shift between historical and current exogenous variables. Extensive experiments on real-world carbon flux datasets and time series benchmarks demonstrate ExoTST's superior performance compared to state-of-the-art baselines, with improvements of up to 10\% in prediction accuracy. Moreover, ExoTST exhibits strong robustness against missing values and noise in exogenous drivers, maintaining consistent performance in real-world situations where these imperfections are common.||
|**2024-10-15**|[MoH: Multi-Head Attention as Mixture-of-Head Attention](http://arxiv.org/abs/2410.11842)|**[link](https://github.com/skyworkai/moh)**|In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.||
|**2024-10-15**|[Light-Weight Fault Tolerant Attention for Large Language Model Training](http://arxiv.org/abs/2410.11720)|null|Large Language Models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. However, the training of these models is computationally intensive and susceptible to faults, particularly in the attention mechanism, which is a critical component of transformer-based LLMs. In this paper, we investigate the impact of faults on LLM training, focusing on INF, NaN, and near-INF values in the computation results with systematic fault injection experiments. We observe the propagation patterns of these errors, which can trigger non-trainable states in the model and disrupt training, forcing the procedure to load from checkpoints.To mitigate the impact of these faults, we propose ATTNChecker, the first Algorithm-Based Fault Tolerance (ABFT) technique tailored for the attention mechanism in LLMs. ATTNChecker is designed based on fault propagation patterns of LLM and incorporates performance optimization to adapt to both system reliability and model vulnerability while providing lightweight protection for fast LLM training. Evaluations on four LLMs show that ATTNChecker on average incurs on average 7% overhead on training while detecting and correcting all extreme errors. Compared with the state-of-the-art checkpoint/restore approach, ATTNChecker reduces recovery overhead by up to 49x.||
|**2024-10-15**|[CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction](http://arxiv.org/abs/2410.11428)|null|Convolutional neural networks (CNNs) and vision transformers (ViTs) have become essential in computer vision for local and global feature extraction. However, aggregating these architectures in existing methods often results in inefficiencies. To address this, the CNN-Transformer Aggregation Network (CTA-Net) was developed. CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features. This integration enables efficient processing of detailed local and broader contextual information. CTA-Net introduces the Light Weight Multi-Scale Feature Fusion Multi-Head Self-Attention (LMF-MHSA) module for effective multi-scale feature integration with reduced parameters. Additionally, the Reverse Reconstruction CNN-Variants (RRCV) module enhances the embedding of CNNs within the transformer architecture. Extensive experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance (TOP-1 Acc 86.76\%), fewer parameters (20.32M), and greater efficiency (FLOPs 2.83B), making it a highly efficient and lightweight solution for visual tasks on small-scale datasets (fewer than 100,000).||
|**2024-10-15**|[Implementing Derivations of Definite Logic Programs with Self-Attention Networks](http://arxiv.org/abs/2410.11396)|null|In this paper we propose that a restricted version of logical inference can be implemented with self-attention networks. We are aiming at showing that LLMs (Large Language Models) constructed with transformer networks can make logical inferences. We would reveal the potential of LLMs by analyzing self-attention networks, which are main components of transformer networks. Our approach is not based on semantics of natural languages but operations of logical inference. %point of view. We show that hierarchical constructions of self-attention networks with feed forward networks (FFNs) can implement top-down derivations for a class of logical formulae. We also show bottom-up derivations are also implemented for the same class. We believe that our results show that LLMs implicitly have the power of logical inference.||
|**2024-10-15**|[SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection](http://arxiv.org/abs/2410.11358)|null|Multimodal object detection leverages diverse modal information to enhance the accuracy and robustness of detectors. By learning long-term dependencies, Transformer can effectively integrate multimodal features in the feature extraction stage, which greatly improves the performance of multimodal object detection. However, current methods merely stack Transformer-guided fusion techniques without exploring their capability to extract features at various depth layers of network, thus limiting the improvements in detection performance. In this paper, we introduce an accurate and efficient object detection method named SeaDATE. Initially, we propose a novel dual attention Feature Fusion (DTF) module that, under Transformer's guidance, integrates local and global information through a dual attention mechanism, strengthening the fusion of modal features from orthogonal perspectives using spatial and channel tokens. Meanwhile, our theoretical analysis and empirical validation demonstrate that the Transformer-guided fusion method, treating images as sequences of pixels for fusion, performs better on shallow features' detail information compared to deep semantic information. To address this, we designed a contrastive learning (CL) module aimed at learning features of multimodal samples, remedying the shortcomings of Transformer-guided fusion in extracting deep semantic features, and effectively utilizing cross-modal information. Extensive experiments and ablation studies on the FLIR, LLVIP, and M3FD datasets have proven our method to be effective, achieving state-of-the-art detection performance.||
|**2024-10-15**|[Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix](http://arxiv.org/abs/2410.11261)|null|Large Language Models (LLMs) have shown immense potential in enhancing various aspects of our daily lives, from conversational AI to search and AI assistants. However, their growing capabilities come at the cost of extremely large model sizes, making deployment on edge devices challenging due to memory and computational constraints. This paper introduces a novel approach to LLM weight pruning that directly optimizes for approximating the attention matrix, a core component of transformer architectures. Unlike existing methods that focus on linear approximations, our approach accounts for the non-linear nature of the Softmax attention mechanism. We provide theoretical guarantees for the convergence of our Gradient Descent-based optimization method to a near-optimal pruning mask solution. Our preliminary empirical results demonstrate the effectiveness of this approach in maintaining model performance while significantly reducing computational costs. This work establishes a new theoretical foundation for pruning algorithm design in LLMs, potentially paving the way for more efficient LLM inference on resource-constrained devices.||
|**2024-10-15**|[Rethinking Graph Transformer Architecture Design for Node Classification](http://arxiv.org/abs/2410.11189)|null|Graph Transformer (GT), as a special type of Graph Neural Networks (GNNs), utilizes multi-head attention to facilitate high-order message passing. However, this also imposes several limitations in node classification applications: 1) nodes are susceptible to global noise; 2) self-attention computation cannot scale well to large graphs. In this work, we conduct extensive observational experiments to explore the adaptability of the GT architecture in node classification tasks and draw several conclusions: the current multi-head self-attention module in GT can be completely replaceable, while the feed-forward neural network module proves to be valuable. Based on this, we decouple the propagation (P) and transformation (T) of GNNs and explore a powerful GT architecture, named GNNFormer, which is based on the P/T combination message passing and adapted for node classification in both homophilous and heterophilous scenarios. Extensive experiments on 12 benchmark datasets demonstrate that our proposed GT architecture can effectively adapt to node classification tasks without being affected by global noise and computational efficiency limitations.||
|**2024-10-14**|[What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis](http://arxiv.org/abs/2410.10986)|null|The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning -- to the extent that Transformers are often accompanied by adaptive optimizers, layer normalization, learning rate warmup, and more, in comparison to MLPs/CNNs. The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures -- grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer's Hessian and express it in matrix derivatives; (b) we then characterize it in terms of data, weight, and attention moment dependencies; and (c) while doing so further highlight the important structural differences to the Hessian of classical networks. Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies on the data and weight matrices, which vary heterogeneously across parameters. Ultimately, our findings provide a deeper understanding of the Transformer's unique optimization landscape and the challenges it poses.||
|**2024-10-14**|[Hybrid Transformer for Early Alzheimer's Detection: Integration of Handwriting-Based 2D Images and 1D Signal Features](http://arxiv.org/abs/2410.10547)|null|Alzheimer's Disease (AD) is a prevalent neurodegenerative condition where early detection is vital. Handwriting, often affected early in AD, offers a non-invasive and cost-effective way to capture subtle motor changes. State-of-the-art research on handwriting, mostly online, based AD detection has predominantly relied on manually extracted features, fed as input to shallow machine learning models. Some recent works have proposed deep learning (DL)-based models, either 1D-CNN or 2D-CNN architectures, with performance comparing favorably to handcrafted schemes. These approaches, however, overlook the intrinsic relationship between the 2D spatial patterns of handwriting strokes and their 1D dynamic characteristics, thus limiting their capacity to capture the multimodal nature of handwriting data. Moreover, the application of Transformer models remains basically unexplored. To address these limitations, we propose a novel approach for AD detection, consisting of a learnable multimodal hybrid attention model that integrates simultaneously 2D handwriting images with 1D dynamic handwriting signals. Our model leverages a gated mechanism to combine similarity and difference attention, blending the two modalities and learning robust features by incorporating information at different scales. Our model achieved state-of-the-art performance on the DARWIN dataset, with an F1-score of 90.32\% and accuracy of 90.91\% in Task 8 ('L' writing), surpassing the previous best by 4.61% and 6.06% respectively.||
|**2024-10-14**|[Domain-Conditioned Transformer for Fully Test-time Adaptation](http://arxiv.org/abs/2410.10442)|**[link](https://github.com/yushuntang/dct)**|Fully test-time adaptation aims to adapt a network model online based on sequential analysis of input samples during the inference stage. We observe that, when applying a transformer network model into a new domain, the self-attention profiles of image samples in the target domain deviate significantly from those in the source domain, which results in large performance degradation during domain changes. To address this important issue, we propose a new structure for the self-attention modules in the transformer. Specifically, we incorporate three domain-conditioning vectors, called domain conditioners, into the query, key, and value components of the self-attention module. We learn a network to generate these three domain conditioners from the class token at each transformer network layer. We find that, during fully online test-time adaptation, these domain conditioners at each transform network layer are able to gradually remove the impact of domain shift and largely recover the original self-attention profile. Our extensive experimental results demonstrate that the proposed domain-conditioned transformer significantly improves the online fully test-time domain adaptation performance and outperforms existing state-of-the-art methods by large margins.||
|**2024-10-11**|[AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation](http://arxiv.org/abs/2410.09040)|**[link](https://github.com/ucsc-vlaa/attngcg-attack)**|This paper studies the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks, focusing specifically on the optimization-based Greedy Coordinate Gradient (GCG) strategy. We first observe a positive correlation between the effectiveness of attacks and the internal behaviors of the models. For instance, attacks tend to be less effective when models pay more attention to system prompts designed to ensure LLM safety alignment. Building on this discovery, we introduce an enhanced method that manipulates models' attention scores to facilitate LLM jailbreaking, which we term AttnGCG. Empirically, AttnGCG shows consistent improvements in attack efficacy across diverse LLMs, achieving an average increase of ~7% in the Llama-2 series and ~10% in the Gemma series. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs like GPT-3.5 and GPT-4. Moreover, we note our attention-score visualization is more interpretable, allowing us to gain better insights into how our targeted attention manipulation facilitates more effective jailbreaking. We release the code at https://github.com/UCSC-VLAA/AttnGCG-attack.||
|**2024-10-11**|[Extra Global Attention Designation Using Keyword Detection in Sparse Transformer Architectures](http://arxiv.org/abs/2410.08971)|null|In this paper, we propose an extension to Longformer Encoder-Decoder, a popular sparse transformer architecture. One common challenge with sparse transformers is that they can struggle with encoding of long range context, such as connections between topics discussed at a beginning and end of a document. A method to selectively increase global attention is proposed and demonstrated for abstractive summarization tasks on several benchmark data sets. By prefixing the transcript with additional keywords and encoding global attention on these keywords, improvement in zero-shot, few-shot, and fine-tuned cases is demonstrated for some benchmark data sets.||
|**2024-10-11**|[On-Chip Learning via Transformer In-Context Learning](http://arxiv.org/abs/2410.08711)|null|Autoregressive decoder-only transformers have become key components for scalable sequence processing and generation models. However, the transformer's self-attention mechanism requires transferring prior token projections from the main memory at each time step (token), thus severely limiting their performance on conventional processors. Self-attention can be viewed as a dynamic feed-forward layer, whose matrix is input sequence-dependent similarly to the result of local synaptic plasticity. Using this insight, we present a neuromorphic decoder-only transformer model that utilizes an on-chip plasticity processor to compute self-attention. Interestingly, the training of transformers enables them to ``learn'' the input context during inference. We demonstrate this in-context learning ability of transformers on the Loihi 2 processor by solving a few-shot classification problem. With this we emphasize the importance of pretrained models especially their ability to find simple, local, backpropagation free, learning rules enabling on-chip learning and adaptation in a hardware friendly manner.||
|**2024-10-11**|[Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation](http://arxiv.org/abs/2410.08626)|null|Recently, symbolic music generation has become a focus of numerous deep learning research. Structure as an important part of music, contributes to improving the quality of music, and an increasing number of works start to study the hierarchical structure. In this study, we delve into the multi-level structures within music from macro-level and micro-level hierarchies. At the macro-level hierarchy, we conduct phrase segmentation algorithm to explore how phrases influence the overall development of music, and at the micro-level hierarchy, we design skeleton notes extraction strategy to explore how skeleton notes within each phrase guide the melody generation. Furthermore, we propose a novel Phrase-level Cross-Attention mechanism to capture the intrinsic relationship between macro-level hierarchy and micro-level hierarchy. Moreover, in response to the current lack of research on Chinese-style music, we construct our Small Tunes Dataset: a substantial collection of MIDI files comprising 10088 Small Tunes, a category of traditional Chinese Folk Songs. This dataset serves as the focus of our study. We generate Small Tunes songs utilizing the extracted skeleton notes as conditions, and experiment results indicate that our proposed model, Small Tunes Transformer, outperforms other state-of-the-art models. Besides, we design three novel objective evaluation metrics to evaluate music from both rhythm and melody dimensions.||
|**2024-10-11**|[DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention](http://arxiv.org/abs/2410.08582)|**[link](https://github.com/maclong01/DeBiFormer)**|Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity-adaptive attention, such as in DAT, has yielded strong results in image classification, the key-value pairs selected by deformable points lack semantic relevance when fine-tuning for semantic segmentation tasks. The query-aware sparsity attention in BiFormer seeks to focus each query on top-k routed regions. However, during attention calculation, the selected key-value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi-level Routing Attention (DBRA) module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its effectiveness.Code is available at {https://github.com/maclong01/DeBiFormer}||
|**2024-10-10**|[Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow](http://arxiv.org/abs/2410.08243)|null|Banking Transaction Flow (BTF) is a sequential data found in a number of banking activities such as marketing, credit risk or banking fraud. It is a multimodal data composed of three modalities: a date, a numerical value and a wording. We propose in this work an application of self-attention mechanism to the processing of BTFs. We trained two general models on a large amount of BTFs in a self-supervised way: one RNN-based model and one Transformer-based model. We proposed a specific tokenization in order to be able to process BTFs. The performance of these two models was evaluated on two banking downstream tasks: a transaction categorization task and a credit risk task. The results show that fine-tuning these two pre-trained models allowed to perform better than the state-of-the-art approaches for both tasks.||
|**2024-10-10**|[Pretraining Graph Transformers with Atom-in-a-Molecule Quantum Properties for Improved ADMET Modeling](http://arxiv.org/abs/2410.08024)|**[link](https://github.com/aidd-msca/GraphQPT)**|We evaluate the impact of pretraining Graph Transformer architectures on atom-level quantum-mechanical features for the modeling of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of drug-like compounds. We compare this pretraining strategy with two others: one based on molecular quantum properties (specifically the HOMO-LUMO gap) and one using a self-supervised atom masking technique. After fine-tuning on Therapeutic Data Commons ADMET datasets, we evaluate the performance improvement in the different models observing that models pretrained with atomic quantum mechanical properties produce in general better results. We then analyse the latent representations and observe that the supervised strategies preserve the pretraining information after finetuning and that different pretrainings produce different trends in latent expressivity across layers. Furthermore, we find that models pretrained on atomic quantum mechanical properties capture more low-frequency laplacian eigenmodes of the input graph via the attention weights and produce better representations of atomic environments within the molecule. Application of the analysis to a much larger non-public dataset for microsomal clearance illustrates generalizability of the studied indicators. In this case the performances of the models are in accordance with the representation analysis and highlight, especially for the case of masking pretraining and atom-level quantum property pretraining, how model types with similar performance on public benchmarks can have different performances on large scale pharmaceutical data.||
|**2024-10-11**|[BA-Net: Bridge Attention in Deep Neural Networks](http://arxiv.org/abs/2410.07860)|null|Attention mechanisms, particularly channel attention, have become highly influential in numerous computer vision tasks. Despite their effectiveness, many existing methods primarily focus on optimizing performance through complex attention modules applied at individual convolutional layers, often overlooking the synergistic interactions that can occur across multiple layers. In response to this gap, we introduce bridge attention, a novel approach designed to facilitate more effective integration and information flow between different convolutional layers. Our work extends the original bridge attention model (BAv1) by introducing an adaptive selection operator, which reduces information redundancy and optimizes the overall information exchange. This enhancement results in the development of BAv2, which achieves substantial performance improvements in the ImageNet classification task, obtaining Top-1 accuracies of 80.49% and 81.75% when using ResNet50 and ResNet101 as backbone networks, respectively. These results surpass the retrained baselines by 1.61% and 0.77%, respectively. Furthermore, BAv2 outperforms other existing channel attention techniques, such as the classical SENet101, exceeding its retrained performance by 0.52% Additionally, integrating BAv2 into advanced convolutional networks and vision transformers has led to significant gains in performance across a wide range of computer vision tasks, underscoring its broad applicability.||
|**2024-10-10**|[Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers](http://arxiv.org/abs/2410.07799)|null|Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. However, \softmaxx-based attention puts transformers' trainability at risk. Even \textit{at initialisation}, the propagation of signals and gradients through the random network can be pathological, resulting in known issues such as (i) vanishing/exploding gradients and (ii) \textit{rank collapse}, i.e. when all tokens converge to a single representation \textit{with depth}. This paper examines signal propagation in \textit{attention-only} transformers from a random matrix perspective, illuminating the origin of such issues, as well as unveiling a new phenomenon -- (iii) rank collapse \textit{in width}. Modelling \softmaxx-based attention at initialisation with Random Markov matrices, our theoretical analysis reveals that a \textit{spectral gap} between the two largest singular values of the attention matrix causes (iii), which, in turn, exacerbates (i) and (ii). Building on this insight, we propose a novel, yet simple, practical solution to resolve rank collapse in width by removing the spectral gap. Moreover, we validate our findings and discuss the training benefits of the proposed fix through experiments that also motivate a revision of some of the default parameter scaling. Our attention model accurately describes the standard key-query attention in a single-layer transformer, making this work a significant first step towards a better understanding of the initialisation dynamics in the multi-layer case.||
|**2024-10-10**|[Benign Overfitting in Single-Head Attention](http://arxiv.org/abs/2410.07746)|null|The phenomenon of benign overfitting, where a trained neural network perfectly fits noisy training data but still achieves near-optimal test performance, has been extensively studied in recent years for linear models and fully-connected/convolutional networks. In this work, we study benign overfitting in a single-head softmax attention model, which is the fundamental building block of Transformers. We prove that under appropriate conditions, the model exhibits benign overfitting in a classification setting already after two steps of gradient descent. Moreover, we show conditions where a minimum-norm/maximum-margin interpolator exhibits benign overfitting. We study how the overfitting behavior depends on the signal-to-noise ratio (SNR) of the data distribution, namely, the ratio between norms of signal and noise tokens, and prove that a sufficiently large SNR is both necessary and sufficient for benign overfitting.||
|**2024-10-10**|[Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM](http://arxiv.org/abs/2410.07531)|null|Dropout, a network operator, when enabled is likely to dramatically impact the performance of Flash-Attention, which in turn increases the end-to-end training time of Large-Language-Models (LLMs). The main contributor to such performance degradation is the Random Number Generation (RNG) phase that is traditionally fused into the Flash-Attention kernel. As RNG and Attention have the same hardware bottlenecks, RNG latency can hardly be hidden within the Attention kernel. We propose overlapping RNG with previous GEMM layers in the network to hide RNG runtime and improve end-to-end performance. RNG and GEMM have distinct resource requirements and hardware bottlenecks, so they can run in parallel without compromising each other's performance. Our fine-grained performance model, cross-validated by silicon results, shows 1.14x speedup on one transformer block (including multi-head attention and feed-forward layers) for Llama2, and up to 1.23x speedup when varying workload sizes, on GH100 GPUs with FP8 precision. Further, we extend our theoretical model to different RNG implementations and hardware architectures, and discuss the widely applicable benefits for overlapping RNG with GEMM layers.||
|**2024-10-09**|[VIRT: Vision Instructed Transformer for Robotic Manipulation](http://arxiv.org/abs/2410.07169)|null|Robotic manipulation, owing to its multi-modal nature, often faces significant training ambiguity, necessitating explicit instructions to clearly delineate the manipulation details in tasks. In this work, we highlight that vision instruction is naturally more comprehensible to recent robotic policies than the commonly adopted text instruction, as these policies are born with some vision understanding ability like human infants. Building on this premise and drawing inspiration from cognitive science, we introduce the robotic imagery paradigm, which realizes large-scale robotic data pre-training without text annotations. Additionally, we propose the robotic gaze strategy that emulates the human eye gaze mechanism, thereby guiding subsequent actions and focusing the attention of the policy on the manipulated object. Leveraging these innovations, we develop VIRT, a fully Transformer-based policy. We design comprehensive tasks using both a physical robot and simulated environments to assess the efficacy of VIRT. The results indicate that VIRT can complete very competitive tasks like ``opening the lid of a tightly sealed bottle'', and the proposed techniques boost the success rates of the baseline policy on diverse challenging tasks from nearly 0% to more than 65%.||
|**2024-10-09**|[Stanceformer: Target-Aware Transformer for Stance Detection](http://arxiv.org/abs/2410.07083)|null|The task of Stance Detection involves discerning the stance expressed in a text towards a specific subject or target. Prior works have relied on existing transformer models that lack the capability to prioritize targets effectively. Consequently, these models yield similar performance regardless of whether we utilize or disregard target information, undermining the task's significance. To address this challenge, we introduce Stanceformer, a target-aware transformer model that incorporates enhanced attention towards the targets during both training and inference. Specifically, we design a \textit{Target Awareness} matrix that increases the self-attention scores assigned to the targets. We demonstrate the efficacy of the Stanceformer with various BERT-based models, including state-of-the-art models and Large Language Models (LLMs), and evaluate its performance across three stance detection datasets, alongside a zero-shot dataset. Our approach Stanceformer not only provides superior performance but also generalizes even to other domains, such as Aspect-based Sentiment Analysis. We make the code publicly available.\footnote{\scriptsize\url{https://github.com/kgarg8/Stanceformer}}||
|**2024-10-09**|[InAttention: Linear Context Scaling for Transformers](http://arxiv.org/abs/2410.07063)|null|VRAM requirements for transformer models scale quadratically with context length due to the self-attention mechanism. In this paper we modify the decoder-only transformer, replacing self-attention with InAttention, which scales linearly with context length during inference by having tokens attend only to initial states. Benchmarking shows that InAttention significantly reduces VRAM usage during inference, enabling handling of long sequences on consumer GPUs. We corroborate that fine-tuning extends context length efficiently, improving performance on long sequences without high training costs. InAttention offers a scalable solution for long-range dependencies in transformer models, paving the way for further optimization.||
|**2024-10-09**|[Dynamic metastability in the self-attention model](http://arxiv.org/abs/2410.06833)|**[link](https://github.com/hugokoubbi/2024-transformers-dotm)**|We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a configuration of several clusters for an exponentially long period of time. By leveraging a gradient flow interpretation of the system, we also connect our result to an overarching framework of slow motion of gradient flows proposed by Otto and Reznikoff [OR07] in the context of coarsening and the Allen-Cahn equation. We finally probe the dynamics beyond the exponentially long period of metastability, and illustrate that, under an appropriate time-rescaling, the energy reaches its global maximum in finite time and has a staircase profile, with trajectories manifesting saddle-to-saddle-like behavior, reminiscent of recent works in the analysis of training dynamics via gradient descent for two-layer neural networks.||
|**2024-10-09**|[Cluster-wise Graph Transformer with Dual-granularity Kernelized Attention](http://arxiv.org/abs/2410.06746)|**[link](https://github.com/lumia-group/cluster-wise-graph-transformer)**|In the realm of graph learning, there is a category of methods that conceptualize graphs as hierarchical structures, utilizing node clustering to capture broader structural information. While generally effective, these methods often rely on a fixed graph coarsening routine, leading to overly homogeneous cluster representations and loss of node-level information. In this paper, we envision the graph as a network of interconnected node sets without compressing each cluster into a single embedding. To enable effective information transfer among these node sets, we propose the Node-to-Cluster Attention (N2C-Attn) mechanism. N2C-Attn incorporates techniques from Multiple Kernel Learning into the kernelized attention framework, effectively capturing information at both node and cluster levels. We then devise an efficient form for N2C-Attn using the cluster-wise message-passing framework, achieving linear time complexity. We further analyze how N2C-Attn combines bi-level feature maps of queries and keys, demonstrating its capability to merge dual-granularity information. The resulting architecture, Cluster-wise Graph Transformer (Cluster-GT), which uses node clusters as tokens and employs our proposed N2C-Attn module, shows superior performance on various graph-level tasks. Code is available at https://github.com/LUMIA-Group/Cluster-wise-Graph-Transformer.||
|**2024-10-07**|[Differential Transformer](http://arxiv.org/abs/2410.05258)|**[link](https://github.com/microsoft/unilm/blob/master/Diff-Transformer/)**|Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.||
|**2024-10-07**|[TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention](http://arxiv.org/abs/2410.05076)|**[link](https://github.com/DerrickYLJ/TidalDecode)**|大型语言模型(LLM)在各种自然语言处理任务中取得了重大进展,其中长上下文模型在处理扩展输入方面表现突出。然而,Transformer 架构所需的不断扩大的键值(KV)缓存大小加剧了内存限制,特别是在解码阶段,造成了显著的瓶颈。现有的旨在解决此瓶颈的稀疏注意力机制有两个局限性:(1)它们通常无法可靠地识别与注意力最相关的标记,以及(2)它们忽略了跨连续 Transformer 层的标记选择的空間一致性,这可能导致性能下降和标记选择中的大量开销。本文介绍了 TidalDecode,这是一种简单而有效的算法和系统,可通过位置持久性稀疏注意力实现快速准确的 LLM 解码。TidalDecode 利用现有稀疏注意力方法选择的标记的空间一致性,并引入了一些执行完全注意力的标记选择层,以识别具有最高注意力分数的标记,而所有其他层都对预先选择的标记执行稀疏注意力。这种设计使 TidalDecode 能够在不牺牲生成结果质量的情况下,大幅减少稀疏注意力的标记选择开销。对各种 LLM 和任务的评估表明,TidalDecode 在生成性能上与完全注意力方法非常接近,同时将 LLM 解码延迟降低了高达 2.1 倍。||
|**2024-10-07**|[On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent](http://arxiv.org/abs/2410.04870)|null|The Adam optimizer is widely used for transformer optimization in practice, which makes understanding the underlying optimization mechanisms an important problem. However, due to the Adam's complexity, theoretical analysis of how it optimizes transformers remains a challenging task. Fortunately, Sign Gradient Descent (SignGD) serves as an effective surrogate for Adam. Despite its simplicity, theoretical understanding of how SignGD optimizes transformers still lags behind. In this work, we study how SignGD optimizes a two-layer transformer -- consisting of a softmax attention layer with trainable query-key parameterization followed by a linear layer -- on a linearly separable noisy dataset. We identify four stages in the training dynamics, each exhibiting intriguing behaviors. Based on the training dynamics, we prove the fast convergence but poor generalization of the learned transformer on the noisy dataset. We also show that Adam behaves similarly to SignGD in terms of both optimization and generalization in this setting. Additionally, we find that the poor generalization of SignGD is not solely due to data noise, suggesting that both SignGD and Adam requires high-quality data for real-world tasks. Finally, experiments on synthetic and real-world datasets empirically support our theoretical results.||
|**2024-10-07**|[Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering](http://arxiv.org/abs/2410.04801)|null|The goal of this paper is to improve the performance of pretrained Vision Transformer (ViT) models, particularly DINOv2, in image clustering task without requiring re-training or fine-tuning. As model size increases, high-norm artifacts anomaly appears in the patches of multi-head attention. We observe that this anomaly leads to reduced accuracy in zero-shot image clustering. These artifacts are characterized by disproportionately large values in the attention map compared to other patch tokens. To address these artifacts, we propose an approach called Inference-Time Attention Engineering (ITAE), which manipulates attention function during inference. Specifically, we identify the artifacts by investigating one of the Query-Key-Value (QKV) patches in the multi-head attention and attenuate their corresponding attention values inside the pretrained models. ITAE shows improved clustering accuracy on multiple datasets by exhibiting more expressive features in latent space. Our findings highlight the potential of ITAE as a practical solution for reducing artifacts in pretrained ViT models and improving model performance in clustering tasks without the need for re-training or fine-tuning.||
|**2024-10-07**|[DAPE V2: Process Attention Score as Feature Map for Length Extrapolation](http://arxiv.org/abs/2410.04798)|**[link](https://github.com/chuanyang-zheng/dape)**|The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens, in contrast to earlier feed-forward neural networks. In general, the attention scores are determined simply by the key-query products. However, this work's occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance.||
|**2024-10-07**|[PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners](http://arxiv.org/abs/2410.04733)|**[link](https://github.com/yyyujintang/predformer)**|Spatiotemporal predictive learning methods generally fall into two categories: recurrent-based approaches, which face challenges in parallelization and performance, and recurrent-free methods, which employ convolutional neural networks (CNNs) as encoder-decoder architectures. These methods benefit from strong inductive biases but often at the expense of scalability and generalization. This paper proposes PredFormer, a pure transformer-based framework for spatiotemporal predictive learning. Motivated by the Vision Transformers (ViT) design, PredFormer leverages carefully designed Gated Transformer blocks, following a comprehensive analysis of 3D attention mechanisms, including full-, factorized-, and interleaved- spatial-temporal attention. With its recurrent-free, transformer-based design, PredFormer is both simple and efficient, significantly outperforming previous methods by large margins. Extensive experiments on synthetic and real-world datasets demonstrate that PredFormer achieves state-of-the-art performance. On Moving MNIST, PredFormer achieves a 51.3% reduction in MSE relative to SimVP. For TaxiBJ, the model decreases MSE by 33.1% and boosts FPS from 533 to 2364. Additionally, on WeatherBench, it reduces MSE by 11.1% while enhancing FPS from 196 to 404. These performance gains in both accuracy and efficiency demonstrate PredFormer's potential for real-world applications. The source code will be released at https://github.com/yyyujintang/PredFormer.||
|**2024-10-07**|[Efficient transformer with reinforced position embedding for language models](http://arxiv.org/abs/2410.04731)|null|In this paper, we propose an efficient transformer architecture that uses reinforced positional embedding to obtain superior performance with half the number of encoder decoder layers. We demonstrate that concatenating positional encoding with trainable token embeddings, normalizing columns in the token embedding matrix, and using the normalized token embedding matrix as the value of the attention layer improve the training and validation loss and the training time in an encoder-decoder Transformer model for a Portuguese-English translation task with 10 epochs or 12 hours of training across 10 trials. Our method, with roughly a threefold parameter reduction compared to the baseline model, yields a mean training loss of 1.21, a mean validation loss of 1.51, and an average training time of 1352.27 seconds per epoch, surpassing the baseline model with the same embedding dimension that employs addition of positional encoding and token embeddings, which achieves a mean training loss of 1.96, a validation loss of 2.18, and an average training time of 4297.79 seconds per epoch. Additionally, we evaluated our proposed architecture and the baseline across 14 diverse translation datasets from TensorFlow. The results indicate that our method consistently achieves lower or comparable training and validation losses, suggesting enhanced learning efficiency.||
|**2024-10-07**|[Low-Rank Continual Pyramid Vision Transformer: Incrementally Segment Whole-Body Organs in CT with Light-Weighted Adaptation](http://arxiv.org/abs/2410.04689)|null|Deep segmentation networks achieve high performance when trained on specific datasets. However, in clinical practice, it is often desirable that pretrained segmentation models can be dynamically extended to enable segmenting new organs without access to previous training datasets or without training from scratch. This would ensure a much more efficient model development and deployment paradigm accounting for the patient privacy and data storage issues. This clinically preferred process can be viewed as a continual semantic segmentation (CSS) problem. Previous CSS works would either experience catastrophic forgetting or lead to unaffordable memory costs as models expand. In this work, we propose a new continual whole-body organ segmentation model with light-weighted low-rank adaptation (LoRA). We first train and freeze a pyramid vision transformer (PVT) base segmentation model on the initial task, then continually add light-weighted trainable LoRA parameters to the frozen model for each new learning task. Through a holistically exploration of the architecture modification, we identify three most important layers (i.e., patch-embedding, multi-head attention and feed forward layers) that are critical in adapting to the new segmentation tasks, while retaining the majority of the pretrained parameters fixed. Our proposed model continually segments new organs without catastrophic forgetting and meanwhile maintaining a low parameter increasing rate. Continually trained and tested on four datasets covering different body parts of a total of 121 organs, results show that our model achieves high segmentation accuracy, closely reaching the PVT and nnUNet upper bounds, and significantly outperforms other regularization-based CSS methods. When comparing to the leading architecture-based CSS method, our model has a substantial lower parameter increasing rate while achieving comparable performance.||
|**2024-10-06**|[DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination](http://arxiv.org/abs/2410.04514)|null|Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that $D$ive into $A$ttention $M$echanism of LVLM to $R$educe $O$ bject Hallucination. Specifically, our approach employs classification token (CLS) of ViT to filter out high-attention outlier tokens scattered in the background and then eliminate their influence during decoding stage. We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs. The code of our method will be released soon.||
|**2024-10-05**|[Fundamental Limitations on Subquadratic Alternatives to Transformers](http://arxiv.org/abs/2410.04271)|null|The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of approaches, including designing heuristic algorithms for performing attention computations faster, and proposing alternatives to the attention mechanism which can be computed more quickly. For instance, state space models such as Mamba were designed to replace attention with an almost linear time alternative. In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm. Thus, any model which can be evaluated in subquadratic time - whether because of subquadratic-time heuristics for attention, faster attention replacements like Mamba, or any other reason - cannot perform this task. In other words, in order to perform tasks that (implicitly or explicitly) involve document similarity, one may as well use Transformer and cannot avoid its quadratic running time.||
|**2024-10-04**|[Linear Transformer Topological Masking with Graph Random Features](http://arxiv.org/abs/2410.03462)|null|When training transformers on graph-structured data, incorporating information about the underlying topology is crucial for good performance. Topological masking, a type of relative position encoding, achieves this by upweighting or downweighting attention depending on the relationship between the query and keys in a graph. In this paper, we propose to parameterise topological masks as a learnable function of a weighted adjacency matrix -- a novel, flexible approach which incorporates a strong structural inductive bias. By approximating this mask with graph random features (for which we prove the first known concentration bounds), we show how this can be made fully compatible with linear attention, preserving $\mathcal{O}(N)$ time and space complexity with respect to the number of input tokens. The fastest previous alternative was $\mathcal{O}(N \log N)$ and only suitable for specific graphs. Our efficient masking algorithms provide strong performance gains for tasks on image and point cloud data, including with $>30$ k nodes.||
|**2024-10-04**|[Error Correction Code Transformer: From Non-Unified to Unified](http://arxiv.org/abs/2410.03364)|null|Channel coding is vital for reliable data transmission in modern wireless systems, and its significance will increase with the emergence of sixth-generation (6G) networks, which will need to support various error correction codes. However, traditional decoders were typically designed as fixed hardware circuits tailored to specific decoding algorithms, leading to inefficiencies and limited flexibility. To address these challenges, this paper proposes a unified, code-agnostic Transformer-based decoding architecture capable of handling multiple linear block codes, including Polar, Low-Density Parity-Check (LDPC), and Bose-Chaudhuri-Hocquenghem (BCH), within a single framework. To achieve this, standardized units are employed to harmonize parameters across different code types, while the redesigned unified attention module compresses the structural information of various codewords. Additionally, a sparse mask, derived from the sparsity of the parity-check matrix, is introduced to enhance the model's ability to capture inherent constraints between information and parity-check bits, resulting in improved decoding accuracy and robustness. Extensive experimental results demonstrate that the proposed unified Transformer-based decoder not only outperforms existing methods but also provides a flexible, efficient, and high-performance solution for next-generation wireless communication systems.||
|**2024-10-04**|[Selective Transformer for Hyperspectral Image Classification](http://arxiv.org/abs/2410.03171)|null|Transformer has achieved satisfactory results in the field of hyperspectral image (HSI) classification. However, existing Transformer models face two key challenges when dealing with HSI scenes characterized by diverse land cover types and rich spectral information: (1) fixed receptive field representation overlooks effective contextual information; (2) redundant self-attention feature representation. To address these limitations, we propose a novel Selective Transformer (SFormer) for HSI classification. The SFormer is designed to dynamically select receptive fields for capturing both spatial and spectral contextual information, while mitigating the impact of redundant data by prioritizing the most relevant features. This enables a highly accurate classification of the land covers of the HSI. Specifically, a Kernel Selective Transformer Block (KSTB) is first utilized to dynamically select an appropriate receptive field range to effectively extract spatial-spectral features. Furthermore, to capture the most crucial tokens, a Token Selective Transformer Block (TSTB) is introduced, which selects the most relevant tokens based on the ranking of attention scores for each query. Extensive experiments on four benchmark HSI datasets demonstrate that the proposed SFormer outperforms the state-of-the-art HSI classification models. The codes will be released.||
|**2024-10-04**|[Autoregressive Moving-average Attention Mechanism for Time Series Forecasting](http://arxiv.org/abs/2410.03159)|**[link](https://github.com/ljc-fvnr/arma-attention)**|We propose an Autoregressive (AR) Moving-average (MA) attention structure that can adapt to various linear attention mechanisms, enhancing their ability to capture long-range and local temporal patterns in time series. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that incorporating the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.||
|**2024-10-03**|[Towards Understanding the Universality of Transformers for Next-Token Prediction](http://arxiv.org/abs/2410.03011)|null|Causal Transformers are trained to predict the next token for a given context. While it is widely accepted that self-attention is crucial for encoding the causal structure of sequences, the precise underlying mechanism behind this in-context autoregressive learning ability remains unclear. In this paper, we take a step towards understanding this phenomenon by studying the approximation ability of Transformers for next-token prediction. Specifically, we explore the capacity of causal Transformers to predict the next token $x_{t+1}$ given an autoregressive sequence $(x_1, \dots, x_t)$ as a prompt, where $ x_{t+1} = f(x_t) $, and $ f $ is a context-dependent function that varies with each sequence. On the theoretical side, we focus on specific instances, namely when $ f $ is linear or when $ (x_t)_{t \geq 1} $ is periodic. We explicitly construct a Transformer (with linear, exponential, or softmax attention) that learns the mapping $f$ in-context through a causal kernel descent method. The causal kernel descent method we propose provably estimates $x_{t+1} $ based solely on past and current observations $ (x_1, \dots, x_t) $, with connections to the Kaczmarz algorithm in Hilbert spaces. We present experimental results that validate our theoretical findings and suggest their applicability to more general mappings $f$ .||
|**2024-10-03**|[Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient](http://arxiv.org/abs/2410.02984)|null|We introduce refined variants of the Local Learning Coefficient (LLC), a measure of model complexity grounded in singular learning theory, to study the development of internal structure in transformer language models during training. By applying these \textit{refined LLCs} (rLLCs) to individual components of a two-layer attention-only transformer, we gain novel insights into the progressive differentiation and specialization of attention heads. Our methodology reveals how attention heads differentiate into distinct functional roles over the course of training, analyzes the types of data these heads specialize to process, and discovers a previously unidentified multigram circuit. These findings demonstrate that rLLCs provide a principled, quantitative toolkit for \textit{developmental interpretability}, which aims to understand models through their evolution across the learning process. More broadly, this work takes a step towards establishing the correspondence between data distributional structure, geometric properties of the loss landscape, learning dynamics, and emergent computational structures in neural networks.||
|**2024-10-03**|[GABIC: Graph-based Attention Block for Image Compression](http://arxiv.org/abs/2410.02981)|**[link](https://github.com/EIDOSLAB/GABIC)**|While standardized codecs like JPEG and HEVC-intra represent the industry standard in image compression, neural Learned Image Compression (LIC) codecs represent a promising alternative. In detail, integrating attention mechanisms from Vision Transformers into LIC models has shown improved compression efficiency. However, extra efficiency often comes at the cost of aggregating redundant features. This work proposes a Graph-based Attention Block for Image Compression (GABIC), a method to reduce feature redundancy based on a k-Nearest Neighbors enhanced attention mechanism. Our experiments show that GABIC outperforms comparable methods, particularly at high bit rates, enhancing compression performance.||
|**2024-10-03**|[Selective Attention Improves Transformer](http://arxiv.org/abs/2410.02703)|null|Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.||
|**2024-10-03**|[Deconstructing Recurrence, Attention, and Gating: Investigating the transferability of Transformers and Gated Recurrent Neural Networks in forecasting of dynamical systems](http://arxiv.org/abs/2410.02654)|null|Machine learning architectures, including transformers and recurrent neural networks (RNNs) have revolutionized forecasting in applications ranging from text processing to extreme weather. Notably, advanced network architectures, tuned for applications such as natural language processing, are transferable to other tasks such as spatiotemporal forecasting tasks. However, there is a scarcity of ablation studies to illustrate the key components that enable this forecasting accuracy. The absence of such studies, although explainable due to the associated computational cost, intensifies the belief that these models ought to be considered as black boxes. In this work, we decompose the key architectural components of the most powerful neural architectures, namely gating and recurrence in RNNs, and attention mechanisms in transformers. Then, we synthesize and build novel hybrid architectures from the standard blocks, performing ablation studies to identify which mechanisms are effective for each task. The importance of considering these components as hyper-parameters that can augment the standard architectures is exhibited on various forecasting datasets, from the spatiotemporal chaotic dynamics of the multiscale Lorenz 96 system, the Kuramoto-Sivashinsky equation, as well as standard real world time-series benchmarks. A key finding is that neural gating and attention improves the performance of all standard RNNs in most tasks, while the addition of a notion of recurrence in transformers is detrimental. Furthermore, our study reveals that a novel, sparsely used, architecture which integrates Recurrent Highway Networks with neural gating and attention mechanisms, emerges as the best performing architecture in high-dimensional spatiotemporal forecasting of dynamical systems.||
|**2024-10-03**|[NestedMorph: Enhancing Deformable Medical Image Registration with Nested Attention Mechanisms](http://arxiv.org/abs/2410.02550)|null|Deformable image registration is crucial for aligning medical images in a non-linear fashion across different modalities, allowing for precise spatial correspondence between varying anatomical structures. This paper presents NestedMorph, a novel network utilizing a Nested Attention Fusion approach to improve intra-subject deformable registration between T1-weighted (T1w) MRI and diffusion MRI (dMRI) data. NestedMorph integrates high-resolution spatial details from an encoder with semantic information from a decoder using a multi-scale framework, enhancing both local and global feature extraction. Our model notably outperforms existing methods, including CNN-based approaches like VoxelMorph, MIDIR, and CycleMorph, as well as Transformer-based models such as TransMorph and ViT-V-Net, and traditional techniques like NiftyReg and SyN. Evaluations on the HCP dataset demonstrate that NestedMorph achieves superior performance across key metrics, including SSIM, HD95, and SDlogJ, with the highest SSIM of 0.89, and the lowest HD95 of 2.5 and SDlogJ of 0.22. These results highlight NestedMorph's ability to capture both local and global image features effectively, leading to superior registration performance. The promising outcomes of this study underscore NestedMorph's potential to significantly advance deformable medical image registration, providing a robust framework for future research and clinical applications. The source code and our implementation are available at: https://bit.ly/3zdVqcg||
|**2024-10-03**|[SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration](http://arxiv.org/abs/2410.02367)|**[link](https://github.com/thu-ml/SageAttention)**|The transformer architecture predominates across various models. As the heart of the transformer, attention has a computational complexity of O(N^2), compared to O(N) for linear transformations. When handling large sequence lengths, attention becomes the primary time-consuming component. Although quantization has proven to be an effective method for accelerating model inference, existing quantization methods primarily focus on optimizing the linear layer. In response, we first analyze the feasibility of quantization in attention detailedly. Following that, we propose SageAttention, a highly efficient and accurate quantization method for attention. The OPS (operations per second) of our approach outperforms FlashAttention2 and xformers by about 2.1 times and 2.7 times, respectively. SageAttention also achieves superior accuracy performance over FlashAttention3. Comprehensive experiments confirm that our approach incurs almost no end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation.||
|**2024-10-03**|[Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization](http://arxiv.org/abs/2410.02247)|**[link](https://github.com/chen123CtrlS/LightweightAtt)**|Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive parameterization. In this paper, we investigate two remarkable phenomena observed during the fine-tuning of LLMs, particularly focusing on the attention mechanism: (1) Different Impact, optimizing the $\mathbf{W}_v$ matrix significantly improves performance over optimizing the $\mathbf{W}_k$ matrix. Fine-tuning only the $\mathbf{W}_q$ and $\mathbf{W}_v$ matrices is computationally efficient, delivering results that are comparable to, or even better than, fine-tuning all three matrices $\mathbf{W}_q$, $\mathbf{W}_k$, and $\mathbf{W}_v$. (2) Efficient Convergence, employing distinct learning rates for these matrices is crucial for optimal performance, with a higher learning rate for the $\mathbf{W}_v$ matrix expediting convergence. However, theoretical analyses of these phenomena are still relatively limited. We present a theoretical analysis of these phenomena from two perspectives: (i) Generalization, where we demonstrate that fine-tuning only $\mathbf{W}_q$ and $\mathbf{W}_v$ improves generalization bounds, enhances memory efficiency, and (ii) Optimization, where we emphasize that the feature learning of the attention mechanism is efficient, particularly when using distinct learning rates for the matrices, which leads to more effective fine-tuning. Building on these insights, we propose a new strategy that improves fine-tuning efficiency in terms of both storage and time. Experimental results on benchmark datasets validate the effectiveness of this approach, supporting our theoretical findings. Our analysis lays the theoretical groundwork for configuring and improving lightweight algorithms in LLMs fine-tuning.||
|**2024-10-03**|[HATFormer: Historic Handwritten Arabic Text Recognition with Transformers](http://arxiv.org/abs/2410.02179)|null|Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer's attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.||
|**2024-10-03**|[Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis](http://arxiv.org/abs/2410.02167)|null|Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability of large language models by augmenting the query using multiple examples with multiple intermediate steps. Despite the empirical success, the theoretical understanding of how to train a Transformer to achieve the CoT ability remains less explored. This is primarily due to the technical challenges involved in analyzing the nonconvex optimization on nonlinear attention models. To the best of our knowledge, this work provides the first theoretical study of training Transformers with nonlinear attention to obtain the CoT generalization capability so that the resulting model can inference on unseen tasks when the input is augmented by examples of the new task. We first quantify the required training samples and iterations to train a Transformer model towards CoT ability. We then prove the success of its CoT generalization on unseen tasks with distribution-shifted testing data. Moreover, we theoretically characterize the conditions for an accurate reasoning output by CoT even when the provided reasoning examples contain noises and are not always accurate. In contrast, in-context learning (ICL), which can be viewed as one-step CoT without intermediate steps, may fail to provide an accurate output when CoT does. These theoretical findings are justified through experiments.||
|**2024-10-02**|[Positional Attention: Out-of-Distribution Generalization and Expressivity for Neural Algorithmic Reasoning](http://arxiv.org/abs/2410.01686)|**[link](https://github.com/opallab/positional_attention)**|There has been a growing interest in the ability of neural networks to solve algorithmic tasks, such as arithmetic, summary statistics, and sorting. While state-of-the-art models like Transformers have demonstrated good generalization performance on in-distribution tasks, their out-of-distribution (OOD) performance is poor when trained end-to-end. In this paper, we focus on value generalization, a common instance of OOD generalization where the test distribution has the same input sequence length as the training distribution, but the value ranges in the training and test distributions do not necessarily overlap. To address this issue, we propose that using fixed positional encodings to determine attention weights-referred to as positional attention-enhances empirical OOD performance while maintaining expressivity. We support our claim about expressivity by proving that Transformers with positional attention can effectively simulate parallel algorithms.||
|**2024-10-02**|[On The Adaptation of Unlimiformer for Decoder-Only Transformers](http://arxiv.org/abs/2410.01637)|null|One of the prominent issues stifling the current generation of large language models is their limited context length. Recent proprietary models such as GPT-4 and Claude 2 have introduced longer context lengths, 8k/32k and 100k, respectively; however, despite the efforts in the community, most common models, such as LLama-2, have a context length of 4k or less. Unlimiformer (Bertsch et al., 2023) is a recently popular vector-retrieval augmentation method that offloads cross-attention computations to a kNN index. However, its main limitation is incompatibility with decoder-only transformers out of the box. In this work, we explore practical considerations of adapting Unlimiformer to decoder-only transformers and introduce a series of modifications to overcome this limitation. Moreover, we expand the original experimental setup on summarization to include a new task (i.e., free-form Q&A) and an instruction-tuned model (i.e., a custom 6.7B GPT model). Our results showcase the effectiveness of these modifications on summarization, performing on par with a model with 2x the context length. Moreover, we discuss limitations and future directions for free-form Q&A and instruction-tuned models.||
|**2024-10-02**|[Attention layers provably solve single-location regression](http://arxiv.org/abs/2410.01537)|**[link](https://github.com/pierremarion23/single-location-regression)**|Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding token-wise sparsity and internal linear representations. To address this gap, we introduce the single-location regression task, where only one token in a sequence determines the output, and its position is a latent random variable, retrievable via a linear projection of the input. To solve this task, we propose a dedicated predictor, which turns out to be a simplified version of a non-linear self-attention layer. We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and internal linear structures.||
|**2024-09-30**|[CBAM-SwinT-BL: Small Rail Surface Detect Detection Method Based on Swin Transformer with Block Level CBAM Enhancement](http://arxiv.org/abs/2409.20113)|null|Under high-intensity rail operations, rail tracks endure considerable stresses resulting in various defects such as corrugation and spellings. Failure to effectively detect defects and provide maintenance in time would compromise service reliability and public safety. While advanced models have been developed in recent years, efficiently identifying small-scale rail defects has not yet been studied, especially for categories such as Dirt or Squat on rail surface. To address this challenge, this study utilizes Swin Transformer (SwinT) as baseline and incorporates the Convolutional Block Attention Module (CBAM) for enhancement. Our proposed method integrates CBAM successively within the swin transformer blocks, resulting in significant performance improvement in rail defect detection, particularly for categories with small instance sizes. The proposed framework is named CBAM-Enhanced Swin Transformer in Block Level (CBAM-SwinT-BL). Experiment and ablation study have proven the effectiveness of the framework. The proposed framework has a notable improvement in the accuracy of small size defects, such as dirt and dent categories in RIII dataset, with mAP-50 increasing by +23.0% and +38.3% respectively, and the squat category in MUET dataset also reaches +13.2% higher than the original model. Compares to the original SwinT, CBAM-SwinT-BL increase overall precision around +5% in the MUET dataset and +7% in the RIII dataset, reaching 69.1% and 88.1% respectively. Meanwhile, the additional module CBAM merely extend the model training speed by an average of +0.04s/iteration, which is acceptable compared to the significant improvement in system performance.||
|**2024-09-30**|[SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers](http://arxiv.org/abs/2409.19850)|null|Over the past few years, vision transformers (ViTs) have consistently demonstrated remarkable performance across various visual recognition tasks. However, attempts to enhance their robustness have yielded limited success, mainly focusing on different training strategies, input patch augmentation, or network structural enhancements. These approaches often involve extensive training and fine-tuning, which are time-consuming and resource-intensive. To tackle these obstacles, we introduce a novel approach named Spatial Autocorrelation Token Analysis (SATA). By harnessing spatial relationships between token features, SATA enhances both the representational capacity and robustness of ViT models. This is achieved through the analysis and grouping of tokens according to their spatial autocorrelation scores prior to their input into the Feed-Forward Network (FFN) block of the self-attention mechanism. Importantly, SATA seamlessly integrates into existing pre-trained ViT baselines without requiring retraining or additional fine-tuning, while concurrently improving efficiency by reducing the computational load of the FFN units. Experimental results show that the baseline ViTs enhanced with SATA not only achieve a new state-of-the-art top-1 accuracy on ImageNet-1K image classification (94.9%) but also establish new state-of-the-art performance across multiple robustness benchmarks, including ImageNet-A (top-1=63.6%), ImageNet-R (top-1=79.2%), and ImageNet-C (mCE=13.6%), all without requiring additional training or fine-tuning of baseline models.||
|**2024-09-29**|[Spiking Transformer with Spatial-Temporal Attention](http://arxiv.org/abs/2409.19764)|null|Spiking Neural Networks (SNNs) present a compelling and energy-efficient alternative to traditional Artificial Neural Networks (ANNs) due to their sparse binary activation. Leveraging the success of the transformer architecture, the spiking transformer architecture is explored to scale up dataset size and performance. However, existing works only consider the spatial self-attention in spiking transformer, neglecting the inherent temporal context across the timesteps. In this work, we introduce Spiking Transformer with Spatial-Temporal Attention (STAtten), a simple and straightforward architecture designed to integrate spatial and temporal information in self-attention with negligible additional computational load. The STAtten divides the temporal or token index and calculates the self-attention in a cross-manner to effectively incorporate spatial-temporal information. We first verify our spatial-temporal attention mechanism's ability to capture long-term temporal dependencies using sequential datasets. Moreover, we validate our approach through extensive experiments on varied datasets, including CIFAR10/100, ImageNet, CIFAR10-DVS, and N-Caltech101. Notably, our cross-attention mechanism achieves an accuracy of 78.39 % on the ImageNet dataset.||
|**2024-09-29**|[OrientedFormer: An End-to-End Transformer-Based Oriented Object Detector in Remote Sensing Images](http://arxiv.org/abs/2409.19648)|**[link](https://github.com/wokaikaixinxin/OrientedFormer)**|Oriented object detection in remote sensing images is a challenging task due to objects being distributed in multi-orientation. Recently, end-to-end transformer-based methods have achieved success by eliminating the need for post-processing operators compared to traditional CNN-based methods. However, directly extending transformers to oriented object detection presents three main issues: 1) objects rotate arbitrarily, necessitating the encoding of angles along with position and size; 2) the geometric relations of oriented objects are lacking in self-attention, due to the absence of interaction between content and positional queries; and 3) oriented objects cause misalignment, mainly between values and positional queries in cross-attention, making accurate classification and localization difficult. In this paper, we propose an end-to-end transformer-based oriented object detector, consisting of three dedicated modules to address these issues. First, Gaussian positional encoding is proposed to encode the angle, position, and size of oriented boxes using Gaussian distributions. Second, Wasserstein self-attention is proposed to introduce geometric relations and facilitate interaction between content and positional queries by utilizing Gaussian Wasserstein distance scores. Third, oriented cross-attention is proposed to align values and positional queries by rotating sampling points around the positional query according to their angles. Experiments on six datasets DIOR-R, a series of DOTA, HRSC2016 and ICDAR2015 show the effectiveness of our approach. Compared with previous end-to-end detectors, the OrientedFormer gains 1.16 and 1.21 AP $_{50}$ on DIOR-R and DOTA-v1.0 respectively, while reducing training epochs from 3$\times$ to 1$\times$ . The codes are available at https://github.com/wokaikaixinxin/OrientedFormer.||
|**2024-09-28**|[Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization](http://arxiv.org/abs/2409.19345)|null|Transformers have demonstrated great power in the recent development of large foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary changes to the field of vision, achieving significant accomplishments on the experimental side. However, their theoretical capabilities, particularly in terms of generalization when trained to overfit training data, are still not fully understood. To address this gap, this work delves deeply into the benign overfitting perspective of transformers in vision. To this end, we study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. By developing techniques that address the challenges posed by softmax and the interdependent nature of multiple weights in transformer optimization, we successfully characterized the training dynamics and achieved generalization in post-training. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model. The theoretical results are further verified by experimental simulation.||
|**2024-09-28**|[Intelligent Fish Detection System with Similarity-Aware Transformer](http://arxiv.org/abs/2409.19323)|**[link](https://github.com/vision4robotics/fishvit)**|Fish detection in water-land transfer has significantly contributed to the fishery. However, manual fish detection in crowd-collaboration performs inefficiently and expensively, involving insufficient accuracy. To further enhance the water-land transfer efficiency, improve detection accuracy, and reduce labor costs, this work designs a new type of lightweight and plug-and-play edge intelligent vision system to automatically conduct fast fish detection with high-speed camera. Moreover, a novel similarity-aware vision Transformer for fast fish detection (FishViT) is proposed to onboard identify every single fish in a dense and similar group. Specifically, a novel similarity-aware multi-level encoder is developed to enhance multi-scale features in parallel, thereby yielding discriminative representations for varying-size fish. Additionally, a new soft-threshold attention mechanism is introduced, which not only effectively eliminates background noise from images but also accurately recognizes both the edge details and overall features of different similar fish. 85 challenging video sequences with high framerate and high-resolution are collected to establish a benchmark from real fish water-land transfer scenarios. Exhaustive evaluation conducted with this challenging benchmark has proved the robustness and effectiveness of FishViT with over 80 FPS. Real work scenario tests validate the practicality of the proposed method. The code and demo video are available at https://github.com/vision4robotics/FishViT.||
|**2024-09-28**|[Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models](http://arxiv.org/abs/2409.19315)|null|Transformer neural networks, driven by self-attention mechanisms, are core components of foundational and Large Language Models. In generative transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks for long sequences. In this work, we propose a fast and energy-efficient hardware implementation of self-attention using analog in-memory computing based on gain cell memories. Volatile gain cell memories can be efficiently written to store new tokens during sequence generation, while performing analog signed weight multiplications to compute the dot-products required for self-attention. We implement Sliding Window Attention, which keeps memory of a finite set of past steps. A charge-to-pulse converter for array readout eliminates the need for analog-to-digital conversion between self-attention stages. Using a co-designed initialization algorithm to adapt pre-trained weights to gain cell non-idealities, we achieve NLP performance comparable to ChatGPT-2 with minimal training iterations, despite hardware constraints. Our end-to-end hardware design includes digital controls, estimating area, latency, and energy. The system reduces attention latency by up to two orders of magnitude and energy consumption by up to five orders compared to GPUs, marking a significant step toward ultra-fast, low-power sequence generation in Large Language Models.||
|**2024-09-27**|[Feature Estimation of Global Language Processing in EEG Using Attention Maps](http://arxiv.org/abs/2409.19174)|null|Understanding the correlation between EEG features and cognitive tasks is crucial for elucidating brain function. Brain activity synchronizes during speaking and listening tasks. However, it is challenging to estimate task-dependent brain activity characteristics with methods with low spatial resolution but high temporal resolution, such as EEG, rather than methods with high spatial resolution, like fMRI. This study introduces a novel approach to EEG feature estimation that utilizes the weights of deep learning models to explore this association. We demonstrate that attention maps generated from Vision Transformers and EEGNet effectively identify features that align with findings from prior studies. EEGNet emerged as the most accurate model regarding subject independence and the classification of Listening and Speaking tasks. The application of Mel-Spectrogram with ViTs enhances the resolution of temporal and frequency-related EEG characteristics. Our findings reveal that the characteristics discerned through attention maps vary significantly based on the input data, allowing for tailored feature extraction from EEG signals. By estimating features, our study reinforces known attributes and predicts new ones, potentially offering fresh perspectives in utilizing EEG for medical purposes, such as early disease detection. These techniques will make substantial contributions to cognitive neuroscience.||
|**2024-09-27**|[Cottention: Linear Transformers With Cosine Attention](http://arxiv.org/abs/2409.18747)|**[link](https://github.com/gmongaras/Cottention_Transformer)**|Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.||
|**2024-09-27**|[Token Caching for Diffusion Transformer Acceleration](http://arxiv.org/abs/2409.18523)|null|Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their high computational cost, arising from the quadratic computational complexity of attention mechanisms and multi-step inference, presents a significant bottleneck. To address this challenge, we propose TokenCache, a novel post-training acceleration method that leverages the token-based multi-block architecture of transformers to reduce redundant computations among tokens across inference steps. TokenCache specifically addresses three critical questions in the context of diffusion transformers: (1) which tokens should be pruned to eliminate redundancy, (2) which blocks should be targeted for efficient pruning, and (3) at which time steps caching should be applied to balance speed and quality. In response to these challenges, TokenCache introduces a Cache Predictor that assigns importance scores to tokens, enabling selective pruning without compromising model performance. Furthermore, we propose an adaptive block selection strategy to focus on blocks with minimal impact on the network's output, along with a Two-Phase Round-Robin (TPRR) scheduling policy to optimize caching intervals throughout the denoising process. Experimental results across various models demonstrate that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers. Our code will be publicly available.||
|**2024-09-26**|[Decomposable Transformer Point Processes](http://arxiv.org/abs/2409.18158)|null|The standard paradigm of modeling marked point processes is by parameterizing the intensity function using an attention-based (Transformer-style) architecture. Despite the flexibility of these methods, their inference is based on the computationally intensive thinning algorithm. In this work, we propose a framework where the advantages of the attention-based architecture are maintained and the limitation of the thinning algorithm is circumvented. The framework depends on modeling the conditional distribution of inter-event times with a mixture of log-normals satisfying a Markov property and the conditional probability mass function for the marks with a Transformer-based architecture. The proposed method attains state-of-the-art performance in predicting the next event of a sequence given its history. The experiments also reveal the efficacy of the methods that do not rely on the thinning algorithm during inference over the ones they do. Finally, we test our method on the challenging long-horizon prediction task and find that it outperforms a baseline developed specifically for tackling this task; importantly, inference requires just a fraction of time compared to the thinning-based baseline.||
|**2024-09-26**|[Supra-Laplacian Encoding for Transformer on Dynamic Graphs](http://arxiv.org/abs/2409.17986)|null|Fully connected Graph Transformers (GT) have rapidly become prominent in the static graph community as an alternative to Message-Passing models, which suffer from a lack of expressivity, oversquashing, and under-reaching. However, in a dynamic context, by interconnecting all nodes at multiple snapshots with self-attention, GT loose both structural and temporal information. In this work, we introduce Supra-LAplacian encoding for spatio-temporal TransformErs (SLATE), a new spatio-temporal encoding to leverage the GT architecture while keeping spatio-temporal information. Specifically, we transform Discrete Time Dynamic Graphs into multi-layer graphs and take advantage of the spectral properties of their associated supra-Laplacian matrix. Our second contribution explicitly model nodes' pairwise relationships with a cross-attention mechanism, providing an accurate edge representation for dynamic link prediction. SLATE outperforms numerous state-of-the-art methods based on Message-Passing Graph Neural Networks combined with recurrent models (e.g LSTM), and Dynamic Graph Transformers, on 9 datasets. Code and instructions to reproduce our results will be open-sourced.||
|**2024-09-26**|[Self-supervised Monocular Depth Estimation with Large Kernel Attention](http://arxiv.org/abs/2409.17895)|null|Self-supervised monocular depth estimation has emerged as a promising approach since it does not rely on labeled training data. Most methods combine convolution and Transformer to model long-distance dependencies to estimate depth accurately. However, Transformer treats 2D image features as 1D sequences, and positional encoding somewhat mitigates the loss of spatial information between different feature blocks, tending to overlook channel features, which limit the performance of depth estimation. In this paper, we propose a self-supervised monocular depth estimation network to get finer details. Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies without compromising the two-dimension structure of features while maintaining feature channel adaptivity. In addition, we introduce a up-sampling module to accurately recover the fine details in the depth map. Our method achieves competitive results on the KITTI dataset.||
|**2024-09-26**|[CASPFormer: Trajectory Prediction from BEV Images with Deformable Attention](http://arxiv.org/abs/2409.17790)|null|Motion prediction is an important aspect for Autonomous Driving (AD) and Advance Driver Assistance Systems (ADAS). Current state-of-the-art motion prediction methods rely on High Definition (HD) maps for capturing the surrounding context of the ego vehicle. Such systems lack scalability in real-world deployment as HD maps are expensive to produce and update in real-time. To overcome this issue, we propose Context Aware Scene Prediction Transformer (CASPFormer), which can perform multi-modal motion prediction from rasterized Bird-Eye-View (BEV) images. Our system can be integrated with any upstream perception module that is capable of generating BEV images. Moreover, CASPFormer directly decodes vectorized trajectories without any postprocessing. Trajectories are decoded recurrently using deformable attention, as it is computationally efficient and provides the network with the ability to focus its attention on the important spatial locations of the BEV images. In addition, we also address the issue of mode collapse for generating multiple scene-consistent trajectories by incorporating learnable mode queries. We evaluate our model on the nuScenes dataset and show that it reaches state-of-the-art across multiple metrics||
|**2024-09-26**|[Paraformer-v2: An improved non-autoregressive transformer for noise-robust speech recognition](http://arxiv.org/abs/2409.17746)|null|Attention-based encoder-decoder, e.g. transformer and its variants, generates the output sequence in an autoregressive (AR) manner. Despite its superior performance, AR model is computationally inefficient as its generation requires as many iterations as the output length. In this paper, we propose Paraformer-v2, an improved version of Paraformer, for fast, accurate, and noise-robust non-autoregressive speech recognition. In Paraformer-v2, we use a CTC module to extract the token embeddings, as the alternative to the continuous integrate-and-fire module in Paraformer. Extensive experiments demonstrate that Paraformer-v2 outperforms Paraformer on multiple datasets, especially on the English datasets (over 14% improvement on WER), and is more robust in noisy environments.||
|**2024-09-26**|[Optimal Memorization Capacity of Transformers](http://arxiv.org/abs/2409.17677)|null|Recent research in the field of machine learning has increasingly focused on the memorization capacity of Transformers, but how efficient they are is not yet well understood. We demonstrate that Transformers can memorize labels with $\tilde{O}(\sqrt{N})$ parameters in a next-token prediction setting for $N$ input sequences of length $n$, which is proved to be optimal up to logarithmic factors. This indicates that Transformers can efficiently perform memorization with little influence from the input length $n$ owing to the benefit of parameter sharing. We also analyze the memorization capacity in the sequence-to-sequence setting, and find that $\tilde{O}(\sqrt{nN})$ parameters are not only sufficient, but also necessary at least for Transformers with hardmax. These results suggest that while self-attention mechanisms can efficiently identify input sequences, the feed-forward network becomes a bottleneck when associating a label to each token.||
|**2024-09-26**|[Benign or Not-Benign Overfitting in Token Selection of Attention Mechanism](http://arxiv.org/abs/2409.17625)|null|Modern over-parameterized neural networks can be trained to fit the training data perfectly while still maintaining a high generalization performance. This "benign overfitting" phenomenon has been studied in a surge of recent theoretical work; however, most of these studies have been limited to linear models or two-layer neural networks. In this work, we analyze benign overfitting in the token selection mechanism of the attention architecture, which characterizes the success of transformer models. We first show the existence of a benign overfitting solution and explain its mechanism in the attention architecture. Next, we discuss whether the model converges to such a solution, raising the difficulties specific to the attention architecture. We then present benign overfitting cases and not-benign overfitting cases by conditioning different scenarios based on the behavior of attention probabilities during training. To the best of our knowledge, this is the first study to characterize benign overfitting for the attention mechanism.||
|**2024-09-26**|[Dynamic Subframe Splitting and Spatio-Temporal Motion Entangled Sparse Attention for RGB-E Tracking](http://arxiv.org/abs/2409.17560)|null|Event-based bionic camera asynchronously captures dynamic scenes with high temporal resolution and high dynamic range, offering potential for the integration of events and RGB under conditions of illumination degradation and fast motion. Existing RGB-E tracking methods model event characteristics utilising attention mechanism of Transformer before integrating both modalities. Nevertheless, these methods involve aggregating the event stream into a single event frame, lacking the utilisation of the temporal information inherent in the event stream.Moreover, the traditional attention mechanism is well-suited for dense semantic features, while the attention mechanism for sparse event features require revolution. In this paper, we propose a dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters, aiming to capture spatio-temporal features that contain motion cues. Based on this, we design an event-based sparse attention mechanism to enhance the interaction of event features in temporal and spatial dimensions. The experimental results indicate that our method outperforms existing state-of-the-art methods on the FE240 and COESOT datasets, providing an effective processing manner for the event data.||
|**2024-09-26**|[MASSFormer: Mobility-Aware Spectrum Sensing using Transformer-Driven Tiered Structure](http://arxiv.org/abs/2409.17546)|null|In this paper, we develop a novel mobility-aware transformer-driven tiered structure (MASSFormer) based cooperative spectrum sensing method that effectively models the spatio-temporal dynamics of user movements. Unlike existing methods, our method considers a dynamic scenario involving mobile primary users (PUs) and secondary users (SUs)and addresses the complexities introduced by user mobility. The transformer architecture utilizes an attention mechanism, enabling the proposed method to adeptly model the temporal dynamics of user mobility by effectively capturing long-range dependencies within the input data. The proposed method first computes tokens from the sequence of covariance matrices (CMs) for each SU and processes them in parallel using the SUtransformer network to learn the spatio-temporal features at SUlevel. Subsequently, the collaborative transformer network learns the group-level PU state from all SU-level feature representations. The attention-based sequence pooling method followed by the transformer encoder adjusts the contributions of all tokens. The main goal of predicting the PU states at each SU-level and group-level is to improve detection performance even more. We conducted a sufficient amount of simulations and compared the detection performance of different SS methods. The proposed method is tested under imperfect reporting channel scenarios to show robustness. The efficacy of our method is validated with the simulation results demonstrating its higher performance compared with existing methods in terms of detection probability, sensing error, and classification accuracy.||
|**2024-09-26**|[NeuroPath: A Neural Pathway Transformer for Joining the Dots of Human Connectomes](http://arxiv.org/abs/2409.17510)|**[link](https://github.com/Chrisa142857/neuro_detour)**|Although modern imaging technologies allow us to study connectivity between two distinct brain regions in-vivo, an in-depth understanding of how anatomical structure supports brain function and how spontaneous functional fluctuations emerge remarkable cognition is still elusive. Meanwhile, tremendous efforts have been made in the realm of machine learning to establish the nonlinear mapping between neuroimaging data and phenotypic traits. However, the absence of neuroscience insight in the current approaches poses significant challenges in understanding cognitive behavior from transient neural activities. To address this challenge, we put the spotlight on the coupling mechanism of structural connectivity (SC) and functional connectivity (FC) by formulating such network neuroscience question into an expressive graph representation learning problem for high-order topology. Specifically, we introduce the concept of topological detour to characterize how a ubiquitous instance of FC (direct link) is supported by neural pathways (detour) physically wired by SC, which forms a cyclic loop interacted by brain structure and function. In the clich\'e of machine learning, the multi-hop detour pathway underlying SC-FC coupling allows us to devise a novel multi-head self-attention mechanism within Transformer to capture multi-modal feature representation from paired graphs of SC and FC. Taken together, we propose a biological-inspired deep model, coined as NeuroPath, to find putative connectomic feature representations from the unprecedented amount of neuroimages, which can be plugged into various downstream applications such as task recognition and disease diagnosis. We have evaluated NeuroPath on large-scale public datasets including HCP and UK Biobank under supervised and zero-shot learning, where the state-of-the-art performance by our NeuroPath indicates great potential in network neuroscience.||
|**2024-09-25**|[Non-asymptotic Convergence of Training Transformers for Next-token Prediction](http://arxiv.org/abs/2409.17335)|null|Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data, especially in next-token prediction (NTP) tasks. However, the theoretical understanding of their performance in NTP is limited, with existing studies focusing mainly on asymptotic performance. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer consisting of a self-attention module followed by a feed-forward layer. We first characterize the essential structural properties of training datasets for NTP using a mathematical framework based on partial orders. Then, we design a two-stage training algorithm, where the pre-processing stage for training the feed-forward layer and the main stage for training the attention layer exhibit fast convergence performance. Specifically, both layers converge sub-linearly to the direction of their corresponding max-margin solutions. We also show that the cross-entropy loss enjoys a linear convergence rate. Furthermore, we show that the trained transformer presents non-trivial prediction ability with dataset shift, which sheds light on the remarkable generalization performance of transformers. Our analysis technique involves the development of novel properties on the attention gradient and further in-depth analysis of how these properties contribute to the convergence of the training process. Our experiments further validate our theoretical findings.||
|**2024-09-24**|[MonoFormer: One Transformer for Both Diffusion and Autoregression](http://arxiv.org/abs/2409.16280)|**[link](https://github.com/MonoFormer/MonoFormer)**|Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The project is publicly available at https://monoformer.github.io/.||
|**2024-09-24**|[TE-PINN: Quaternion-Based Orientation Estimation using Transformer-Enhanced Physics-Informed Neural Networks](http://arxiv.org/abs/2409.16214)|null|This paper introduces a Transformer-Enhanced Physics-Informed Neural Network (TE-PINN) designed for accurate quaternion-based orientation estimation in high-dynamic environments, particularly within the field of robotics. By integrating transformer networks with physics-informed learning, our approach innovatively captures temporal dependencies in sensor data while enforcing the fundamental physical laws governing rotational motion. TE-PINN leverages a multi-head attention mechanism to handle sequential data from inertial sensors, such as accelerometers and gyroscopes, ensuring temporal consistency. Simultaneously, the model embeds quaternion kinematics and rigid body dynamics into the learning process, aligning the network's predictions with mechanical principles like Euler's laws of motion. The physics-informed loss function incorporates the dynamics of angular velocity and external forces, enhancing the network's ability to generalize in complex scenarios. Our experimental evaluation demonstrates that TE-PINN consistently outperforms traditional methods such as Extended Kalman Filters (EKF) and LSTM-based estimators, particularly in scenarios characterized by high angular velocities and noisy sensor data. The results show a significant reduction in mean quaternion error and improved gyroscope bias estimation compared to the state-of-the-art. An ablation study further isolates the contributions of both the transformer architecture and the physics-informed constraints, highlighting the synergistic effect of both components in improving model performance. The proposed model achieves real-time performance on embedded systems typical of mobile robots, offering a scalable and efficient solution for orientation estimation in autonomous systems.||
|**2024-09-24**|[Self-attention as an attractor network: transient memories without backpropagation](http://arxiv.org/abs/2409.16112)|**[link](https://github.com/francill99/self_attention_attractor_network)**|Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an energy function in certain cases: while it is possible to write the cross-attention layer as a modern Hopfield network, the same is not possible for the self-attention, which is used in the GPT architectures and other autoregressive models. In this work we show that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood. We leverage the analogy with pseudo-likelihood to design a recurrent model that can be trained without backpropagation: the dynamics shows transient states that are strongly correlated with both train and test examples. Overall we present a novel framework to interpret self-attention as an attractor network, potentially paving the way for new theoretical approaches inspired from physics to understand transformers.||
|**2024-09-24**|[Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR](http://arxiv.org/abs/2409.15869)|**[link](https://github.com/aiola-lab/whisper-medusa)**|Large transformer-based models have significant potential for speech transcription and translation. Their self-attention mechanisms and parallel processing enable them to capture complex patterns and dependencies in audio sequences. However, this potential comes with challenges, as these large and computationally intensive models lead to slow inference speeds. Various optimization strategies have been proposed to improve performance, including efficient hardware utilization and algorithmic enhancements. In this paper, we introduce Whisper-Medusa, a novel approach designed to enhance processing speed with minimal impact on Word Error Rate (WER). The proposed model extends the OpenAI's Whisper architecture by predicting multiple tokens per iteration, resulting in a 50% reduction in latency. We showcase the effectiveness of Whisper-Medusa across different learning setups and datasets.||
|**2024-09-23**|[SOFI: Multi-Scale Deformable Transformer for Camera Calibration with Enhanced Line Queries](http://arxiv.org/abs/2409.15553)|**[link](https://github.com/sebastianjanampa/sofi)**|Camera calibration consists of estimating camera parameters such as the zenith vanishing point and horizon line. Estimating the camera parameters allows other tasks like 3D rendering, artificial reality effects, and object insertion in an image. Transformer-based models have provided promising results; however, they lack cross-scale interaction. In this work, we introduce \textit{multi-Scale defOrmable transFormer for camera calibratIon with enhanced line queries}, SOFI. SOFI improves the line queries used in CTRL-C and MSCC by using both line content and line geometric features. Moreover, SOFI's line queries allow transformer models to adopt the multi-scale deformable attention mechanism to promote cross-scale interaction between the feature maps produced by the backbone. SOFI outperforms existing methods on the \textit {Google Street View}, \textit {Horizon Line in the Wild}, and \textit {Holicity} datasets while keeping a competitive inference speed.||
|**2024-09-23**|[Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer](http://arxiv.org/abs/2409.15117)|null|Vision-based perception and reasoning is essential for scene understanding in any autonomous system. RGB and depth images are commonly used to capture both the semantic and geometric features of the environment. Developing methods to reliably interpret this data is critical for real-world applications, where noisy measurements are often unavoidable. In this work, we introduce a diffusion-based framework to address the RGB-D semantic segmentation problem. Additionally, we demonstrate that utilizing a Deformable Attention Transformer as the encoder to extract features from depth images effectively captures the characteristics of invalid regions in depth measurements. Our generative framework shows a greater capacity to model the underlying distribution of RGB-D images, achieving robust performance in challenging scenarios with significantly less training time compared to discriminative methods. Experimental results indicate that our approach achieves State-of-the-Art performance on both the NYUv2 and SUN-RGBD datasets in general and especially in the most challenging of their image data. Our project page will be available at https://diffusionmms.github.io/||
|**2024-09-24**|[Efficiently Dispatching Flash Attention For Partially Filled Attention Masks](http://arxiv.org/abs/2409.15097)|null|Transformers are widely used across various applications, many of which yield sparse or partially filled attention matrices. Examples include attention masks designed to reduce the quadratic complexity of attention, sequence packing techniques, and recent innovations like tree masking for fast validation in MEDUSA. Despite the inherent sparsity in these matrices, the state-of-the-art algorithm Flash Attention still processes them with quadratic complexity as though they were dense. In this paper, we introduce Binary Block Masking, a highly efficient modification that enhances Flash Attention by making it mask-aware. We further propose two optimizations: one tailored for masks with contiguous non-zero patterns and another for extremely sparse masks. Our experiments on attention masks derived from real-world scenarios demonstrate up to a 9x runtime improvement. The implementation will be publicly released to foster further research and application.||
|**2024-09-23**|[Kriformer: A Novel Spatiotemporal Kriging Approach Based on Graph Transformers](http://arxiv.org/abs/2409.14906)|null|Accurately estimating data in sensor-less areas is crucial for understanding system dynamics, such as traffic state estimation and environmental monitoring. This study addresses challenges posed by sparse sensor deployment and unreliable data by framing the problem as a spatiotemporal kriging task and proposing a novel graph transformer model, Kriformer. This model estimates data at locations without sensors by mining spatial and temporal correlations, even with limited resources. Kriformer utilizes transformer architecture to enhance the model's perceptual range and solve edge information aggregation challenges, capturing spatiotemporal information effectively. A carefully constructed positional encoding module embeds the spatiotemporal features of nodes, while a sophisticated spatiotemporal attention mechanism enhances estimation accuracy. The multi-head spatial interaction attention module captures subtle spatial relationships between observed and unobserved locations. During training, a random masking strategy prompts the model to learn with partial information loss, allowing the spatiotemporal embedding and multi-head attention mechanisms to synergistically capture correlations among locations. Experimental results show that Kriformer excels in representation learning for unobserved locations, validated on two real-world traffic speed datasets, demonstrating its effectiveness in spatiotemporal kriging tasks.||
|**2024-09-23**|[A-VL: Adaptive Attention for Large Vision-Language Models](http://arxiv.org/abs/2409.14846)|null|The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention techniques can dynamically reduce computational redundancy and thus improve efficiency. Although current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models, they are not tailored for LVLMs. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. This observation inspires us to manage the attention for each modality separately. Specifically, for visual input, we store the cache of potentially useful information but only compute the most critical parts. For language input, we care more about local information. Based on our observation and analysis of vision-language attention patterns, we develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference. Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.||
|**2024-09-23**|[RoWSFormer: A Robust Watermarking Framework with Swin Transformer for Enhanced Geometric Attack Resilience](http://arxiv.org/abs/2409.14829)|null|In recent years, digital watermarking techniques based on deep learning have been widely studied. To achieve both imperceptibility and robustness of image watermarks, most current methods employ convolutional neural networks to build robust watermarking frameworks. However, despite the success of CNN-based watermarking models, they struggle to achieve robustness against geometric attacks due to the limitations of convolutional neural networks in capturing global and long-range relationships. To address this limitation, we propose a robust watermarking framework based on the Swin Transformer, named RoWSFormer. Specifically, we design the Locally-Channel Enhanced Swin Transformer Block as the core of both the encoder and decoder. This block utilizes the self-attention mechanism to capture global and long-range information, thereby significantly improving adaptation to geometric distortions. Additionally, we construct the Frequency-Enhanced Transformer Block to extract frequency domain information, which further strengthens the robustness of the watermarking framework. Experimental results demonstrate that our RoWSFormer surpasses existing state-of-the-art watermarking methods. For most non-geometric attacks, RoWSFormer improves the PSNR by 3 dB while maintaining the same extraction accuracy. In the case of geometric attacks (such as rotation, scaling, and affine transformations), RoWSFormer achieves over a 6 dB improvement in PSNR, with extraction accuracy exceeding 97\%.||
|**2024-09-18**|[On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery](http://arxiv.org/abs/2409.12026)|null|Side-scan sonar (SSS) imagery presents unique challenges in the classification of man-made objects on the seafloor due to the complex and varied underwater environments. Historically, experts have manually interpreted SSS images, relying on conventional machine learning techniques with hand-crafted features. While Convolutional Neural Networks (CNNs) significantly advanced automated classification in this domain, they often fall short when dealing with diverse seafloor textures, such as rocky or ripple sand bottoms, where false positive rates may increase. Recently, Vision Transformers (ViTs) have shown potential in addressing these limitations by utilizing a self-attention mechanism to capture global information in image patches, offering more flexibility in processing spatial hierarchies. This paper rigorously compares the performance of ViT models alongside commonly used CNN architectures, such as ResNet and ConvNext, for binary classification tasks in SSS imagery. The dataset encompasses diverse geographical seafloor types and is balanced between the presence and absence of man-made objects. ViT-based models exhibit superior classification performance across f1-score, precision, recall, and accuracy metrics, although at the cost of greater computational resources. CNNs, with their inductive biases, demonstrate better computational efficiency, making them suitable for deployment in resource-constrained environments like underwater vehicles. Future research directions include exploring self-supervised learning for ViTs and multi-modal fusion to further enhance performance in challenging underwater environments.||
|**2024-09-17**|[A short trajectory is all you need: A transformer-based model for long-time dissipative quantum dynamics](http://arxiv.org/abs/2409.11320)|**[link](https://github.com/kananenka-group/Transformer-spin-boson)**|In this communication we demonstrate that a deep artificial neural network based on a transformer architecture with self-attention layers can predict the long-time population dynamics of a quantum system coupled to a dissipative environment provided that the short-time population dynamics of the system is known. The transformer neural network model developed in this work predicts the long-time dynamics of spin-boson model efficiently and very accurately across different regimes, from weak system-bath coupling to strong coupling non-Markovian regimes. Our model is more accurate than classical forecasting models, such as recurrent neural networks and is comparable to the state-of-the-art models for simulating the dynamics of quantum dissipative systems, based on kernel ridge regression.||
|**2024-09-17**|[Linear Recency Bias During Training Improves Transformers' Fit to Reading Times](http://arxiv.org/abs/2409.11250)|null|Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown a strong fit between surprisal values from Transformers and reading times. However, standard Transformers work with a lossless representation of the entire previous linguistic context, unlike models of human language processing that include memory decay. To bridge this gap, this paper evaluates a modification of the Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to attention scores. Surprisal estimates with ALiBi show an improved fit to human reading times compared to a standard Transformer baseline. A subsequent analysis of attention heads suggests that ALiBi's mixture of slopes -- which determine the rate of memory decay in each attention head -- may play a role in the improvement by helping models with ALiBi to track different kinds of linguistic dependencies.||
|**2024-09-17**|[Contrasformer: A Brain Network Contrastive Transformer for Neurodegenerative Condition Identification](http://arxiv.org/abs/2409.10944)|**[link](https://github.com/angusmonroe/contrasformer)**|Understanding neurological disorder is a fundamental problem in neuroscience, which often requires the analysis of brain networks derived from functional magnetic resonance imaging (fMRI) data. Despite the prevalence of Graph Neural Networks (GNNs) and Graph Transformers in various domains, applying them to brain networks faces challenges. Specifically, the datasets are severely impacted by the noises caused by distribution shifts across sub-populations and the neglect of node identities, both obstruct the identification of disease-specific patterns. To tackle these challenges, we propose Contrasformer, a novel contrastive brain network Transformer. It generates a prior-knowledge-enhanced contrast graph to address the distribution shifts across sub-populations by a two-stream attention mechanism. A cross attention with identity embedding highlights the identity of nodes, and three auxiliary losses ensure group consistency. Evaluated on 4 functional brain network datasets over 4 different diseases, Contrasformer outperforms the state-of-the-art methods for brain networks by achieving up to 10.8\% improvement in accuracy, which demonstrates its efficacy in neurological disorder identification. Case studies illustrate its interpretability, especially in the context of neuroscience. This paper provides a solution for analyzing brain networks, offering valuable insights into neurological disorders. Our code is available at \url{https://github.com/AngusMonroe/Contrasformer}.||
|**2024-09-17**|[Adaptive Large Language Models By Layerwise Attention Shortcuts](http://arxiv.org/abs/2409.10870)|null|Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we propose to challenge this and introduce adaptive computations for LLM-like setups, which allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism, thereby introducing computational \textbf{attention shortcuts}. These shortcuts can thus make the architecture depth and context adaptive. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture. We give evidence via attention maps that the models learn complex dependencies across layers that are adaptive in context and depth depending on the input tokens.||
|**2024-09-16**|[Recurrent Graph Transformer Network for Multiple Fault Localization in Naval Shipboard Systems](http://arxiv.org/abs/2409.10792)|null|The integration of power electronics building blocks in modern MVDC 12kV Naval ship systems enhances energy management and functionality but also introduces complex fault detection and control challenges. These challenges strain traditional fault diagnostic methods, making it difficult to detect and manage faults across multiple locations while maintaining system stability and performance. This paper proposes a temporal recurrent graph transformer network for fault diagnosis in naval MVDC 12kV shipboard systems. The deep graph neural network uses gated recurrent units to capture temporal features and a multi-head attention mechanism to extract spatial features, enhancing diagnostic accuracy. The approach effectively identifies and evaluates successive multiple faults with high precision. The method is implemented and validated on the MVDC 12kV shipboard system designed by the ESDRC team, incorporating all key components. Results show significant improvements in fault localization accuracy, with a 1-4% increase in performance metrics compared to other machine learning methods.||
|**2024-09-16**|[Self-Attention Limits Working Memory Capacity of Transformer-Based Models](http://arxiv.org/abs/2409.10715)|null|Recent work on Transformer-based large language models (LLMs) has revealed striking limits in their working memory capacity, similar to what has been found in human behavioral studies. Specifically, these models' performance drops significantly on N-back tasks as N increases. However, there is still a lack of mechanistic interpretability as to why this phenomenon would arise. Inspired by the executive attention theory from behavioral sciences, we hypothesize that the self-attention mechanism within Transformer-based models might be responsible for their working memory capacity limits. To test this hypothesis, we train vanilla decoder-only transformers to perform N-back tasks and find that attention scores gradually aggregate to the N-back positions over training, suggesting that the model masters the task by learning a strategy to pay attention to the relationship between the current position and the N-back position. Critically, we find that the total entropy of the attention score matrix increases as N increases, suggesting that the dispersion of attention scores might be the cause of the capacity limit observed in N-back tasks.||
|**2024-09-16**|[Logic Synthesis Optimization with Predictive Self-Supervision via Causal Transformers](http://arxiv.org/abs/2409.10653)|null|Contemporary hardware design benefits from the abstraction provided by high-level logic gates, streamlining the implementation of logic circuits. Logic Synthesis Optimization (LSO) operates at one level of abstraction within the Electronic Design Automation (EDA) workflow, targeting improvements in logic circuits with respect to performance metrics such as size and speed in the final layout. Recent trends in the field show a growing interest in leveraging Machine Learning (ML) for EDA, notably through ML-guided logic synthesis utilizing policy-based Reinforcement Learning (RL) methods.Despite these advancements, existing models face challenges such as overfitting and limited generalization, attributed to constrained public circuits and the expressiveness limitations of graph encoders. To address these hurdles, and tackle data scarcity issues, we introduce LSOformer, a novel approach harnessing Autoregressive transformer models and predictive SSL to predict the trajectory of Quality of Results (QoR). LSOformer integrates cross-attention modules to merge insights from circuit graphs and optimization sequences, thereby enhancing prediction accuracy for QoR metrics. Experimental studies validate the effectiveness of LSOformer, showcasing its superior performance over baseline architectures in QoR prediction tasks, where it achieves improvements of 5.74%, 4.35%, and 17.06% on the EPFL, OABCD, and proprietary circuits datasets, respectively, in inductive setup.||
|**2024-09-16**|[Garment Attribute Manipulation with Multi-level Attention](http://arxiv.org/abs/2409.10206)|null|In the rapidly evolving field of online fashion shopping, the need for more personalized and interactive image retrieval systems has become paramount. Existing methods often struggle with precisely manipulating specific garment attributes without inadvertently affecting others. To address this challenge, we propose GAMMA (Garment Attribute Manipulation with Multi-level Attention), a novel framework that integrates attribute-disentangled representations with a multi-stage attention-based architecture. GAMMA enables targeted manipulation of fashion image attributes, allowing users to refine their searches with high accuracy. By leveraging a dual-encoder Transformer and memory block, our model achieves state-of-the-art performance on popular datasets like Shopping100k and DeepFashion.||
|**2024-09-14**|[Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens](http://arxiv.org/abs/2409.09513)|null|Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent's future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model's policy through the interpretable plan visualisations and attention map.||
|**2024-09-14**|[TransformerMPC: Accelerating Model Predictive Control via Transformers](http://arxiv.org/abs/2409.09266)|null|In this paper, we address the problem of reducing the computational burden of Model Predictive Control (MPC) for real-time robotic applications. We propose TransformerMPC, a method that enhances the computational efficiency of MPC algorithms by leveraging the attention mechanism in transformers for both online constraint removal and better warm start initialization. Specifically, TransformerMPC accelerates the computation of optimal control inputs by selecting only the active constraints to be included in the MPC problem, while simultaneously providing a warm start to the optimization process. This approach ensures that the original constraints are satisfied at optimality. TransformerMPC is designed to be seamlessly integrated with any MPC solver, irrespective of its implementation. To guarantee constraint satisfaction after removing inactive constraints, we perform an offline verification to ensure that the optimal control inputs generated by the MPC solver meet all constraints. The effectiveness of TransformerMPC is demonstrated through extensive numerical simulations on complex robotic systems, achieving up to 35x improvement in runtime without any loss in performance.||
|**2024-09-13**|[SGFormer: Single-Layer Graph Transformers with Approximation-Free Linear Complexity](http://arxiv.org/abs/2409.09007)|**[link](https://github.com/qitianwu/sgformer)**|Learning representations on large graphs is a long-standing challenge due to the inter-dependence nature. Transformers recently have shown promising performance on small graphs thanks to its global attention for capturing all-pair interactions beyond observed structures. Existing approaches tend to inherit the spirit of Transformers in language and vision tasks, and embrace complicated architectures by stacking deep attention-based propagation layers. In this paper, we attempt to evaluate the necessity of adopting multi-layer attentions in Transformers on graphs, which considerably restricts the efficiency. Specifically, we analyze a generic hybrid propagation layer, comprised of all-pair attention and graph-based propagation, and show that multi-layer propagation can be reduced to one-layer propagation, with the same capability for representation learning. It suggests a new technical path for building powerful and efficient Transformers on graphs, particularly through simplifying model architectures without sacrificing expressiveness. As exemplified by this work, we propose a Simplified Single-layer Graph Transformers (SGFormer), whose main component is a single-layer global attention that scales linearly w.r.t. graph sizes and requires none of any approximation for accommodating all-pair interactions. Empirically, SGFormer successfully scales to the web-scale graph ogbn-papers100M, yielding orders-of-magnitude inference acceleration over peer Transformers on medium-sized graphs, and demonstrates competitiveness with limited labeled data.||
|**2024-09-13**|[Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry](http://arxiv.org/abs/2409.08769)|**[link](https://github.com/ybkurt/vift)**|In recent years, transformer-based architectures become the de facto standard for sequence modeling in deep learning frameworks. Inspired by the successful examples, we propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry. This study aims to improve pose estimation accuracy by leveraging the attention mechanisms in transformers, which better utilize historical data compared to the recurrent neural network (RNN) based methods seen in recent methods. Transformers typically require large-scale data for training. To address this issue, we utilize inductive biases for deep VIO networks. Since latent visual-inertial feature vectors encompass essential information for pose estimation, we employ transformers to refine pose estimates by updating latent vectors temporally. Our study also examines the impact of data imbalance and rotation learning methods in supervised end-to-end learning of visual inertial odometry by utilizing specialized gradients in backpropagation for the elements of SE $(3)$ group. The proposed method is end-to-end trainable and requires only a monocular camera and IMU during inference. Experimental results demonstrate that VIFT increases the accuracy of monocular VIO networks, achieving state-of-the-art results when compared to previous methods on the KITTI dataset. The code will be made available at https://github.com/ybkurt/VIFT.||
|**2024-09-13**|[SkinFormer: Learning Statistical Texture Representation with Transformer for Skin Lesion Segmentation](http://arxiv.org/abs/2409.08652)|**[link](https://github.com/rongtao-xu/skinformer)**|Accurate skin lesion segmentation from dermoscopic images is of great importance for skin cancer diagnosis. However, automatic segmentation of melanoma remains a challenging task because it is difficult to incorporate useful texture representations into the learning process. Texture representations are not only related to the local structural information learned by CNN, but also include the global statistical texture information of the input image. In this paper, we propose a trans\textbf{Former} network (\textbf{SkinFormer}) that efficiently extracts and fuses statistical texture representation for \textbf{Skin} lesion segmentation. Specifically, to quantify the statistical texture of input features, a Kurtosis-guided Statistical Counting Operator is designed. We propose Statistical Texture Fusion Transformer and Statistical Texture Enhance Transformer with the help of Kurtosis-guided Statistical Counting Operator by utilizing the transformer's global attention mechanism. The former fuses structural texture information and statistical texture information, and the latter enhances the statistical texture of multi-scale features. {Extensive experiments on three publicly available skin lesion datasets validate that our SkinFormer outperforms other SOAT methods, and our method achieves 93.2\% Dice score on ISIC 2018. It can be easy to extend SkinFormer to segment 3D images in the future.} Our code is available at https://github.com/Rongtao-Xu/SkinFormer.||
|**2024-09-13**|[VistaFormer: Scalable Vision Transformers for Satellite Image Time Series Segmentation](http://arxiv.org/abs/2409.08461)|**[link](https://github.com/macdonaldezra/VistaFormer)**|We introduce VistaFormer, a lightweight Transformer-based model architecture for the semantic segmentation of remote-sensing images. This model uses a multi-scale Transformer-based encoder with a lightweight decoder that aggregates global and local attention captured in the encoder blocks. VistaFormer uses position-free self-attention layers which simplifies the model architecture and removes the need to interpolate temporal and spatial codes, which can reduce model performance when training and testing image resolutions differ. We investigate simple techniques for filtering noisy input signals like clouds and demonstrate that improved model scalability can be achieved by substituting Multi-Head Self-Attention (MHSA) with Neighbourhood Attention (NA). Experiments on the PASTIS and MTLCC crop-type segmentation benchmarks show that VistaFormer achieves better performance than comparable models and requires only 8% of the floating point operations using MHSA and 11% using NA while also using fewer trainable parameters. VistaFormer with MHSA improves on state-of-the-art mIoU scores by 0.1% on the PASTIS benchmark and 3% on the MTLCC benchmark while VistaFormer with NA improves on the MTLCC benchmark by 3.7%.||
|**2024-09-12**|[SDformer: Efficient End-to-End Transformer for Depth Completion](http://arxiv.org/abs/2409.08159)|**[link](https://github.com/jamesqian11/sdformer-for-depth-completion)**|Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.||
|**2024-09-12**|[InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bimanual Manipulation](http://arxiv.org/abs/2409.07914)|null|We present InterACT: Inter-dependency aware Action Chunking with Hierarchical Attention Transformers, a novel imitation learning framework for bimanual manipulation that integrates hierarchical attention to capture inter-dependencies between dual-arm joint states and visual inputs. InterACT consists of a Hierarchical Attention Encoder and a Multi-arm Decoder, both designed to enhance information aggregation and coordination. The encoder processes multi-modal inputs through segment-wise and cross-segment attention mechanisms, while the decoder leverages synchronization blocks to refine individual action predictions, providing the counterpart's prediction as context. Our experiments on a variety of simulated and real-world bimanual manipulation tasks demonstrate that InterACT significantly outperforms existing methods. Detailed ablation studies validate the contributions of key components of our work, including the impact of CLS tokens, cross-segment encoders, and synchronization blocks.||
|**2024-09-12**|[Lagrange Duality and Compound Multi-Attention Transformer for Semi-Supervised Medical Image Segmentation](http://arxiv.org/abs/2409.07793)|**[link](https://github.com/lzeeorno/lagrange-duality-and-cmaformer)**|Medical image segmentation, a critical application of semantic segmentation in healthcare, has seen significant advancements through specialized computer vision techniques. While deep learning-based medical image segmentation is essential for assisting in medical diagnosis, the lack of diverse training data causes the long-tail problem. Moreover, most previous hybrid CNN-ViT architectures have limited ability to combine various attentions in different layers of the Convolutional Neural Network. To address these issues, we propose a Lagrange Duality Consistency (LDC) Loss, integrated with Boundary-Aware Contrastive Loss, as the overall training objective for semi-supervised learning to mitigate the long-tail problem. Additionally, we introduce CMAformer, a novel network that synergizes the strengths of ResUNet and Transformer. The cross-attention block in CMAformer effectively integrates spatial attention and channel attention for multi-scale feature fusion. Overall, our results indicate that CMAformer, combined with the feature fusion framework and the new consistency loss, demonstrates strong complementarity in semi-supervised learning ensembles. We achieve state-of-the-art results on multiple public medical image datasets. Example code are available at: \url{https://github.com/lzeeorno/Lagrange-Duality-and-CMAformer}.||
|**2024-09-11**|[ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers](http://arxiv.org/abs/2409.07541)|**[link](https://github.com/gsavathrakis/enact)**|Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection. However, they require considerable computational resources due to the quadratic size of the attention weights. In this work, we propose to cluster the transformer input on the basis of its entropy. The reason for this is that the self-information of each pixel (whose sum is the entropy), is likely to be similar among pixels corresponding to the same objects. Clustering reduces the size of data given as input to the transformer and therefore reduces training time and GPU memory usage, while at the same time preserves meaningful information to be passed through the remaining parts of the network. The proposed process is organized in a module called ENACT, that can be plugged-in any transformer architecture that consists of a multi-head self-attention computation in its encoder. We ran extensive experiments using the COCO object detection dataset, and three detection transformers. The obtained results demonstrate that in all tested cases, there is consistent reduction in the required computational resources, while the precision of the detection task is only slightly reduced. The code of the ENACT module will become available at https://github.com/GSavathrakis/ENACT||
|**2024-09-11**|[Gated Slot Attention for Efficient Linear-Time Sequence Modeling](http://arxiv.org/abs/2409.07146)|**[link](https://github.com/sustcsonglin/flash-linear-attention)**|Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the softmax operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.||
|**2024-09-11**|[Enhancing Cross-domain Pre-Trained Decision Transformers with Adaptive Attention](http://arxiv.org/abs/2409.06985)|null|Recently, the pre-training of decision transformers (DT) using a different domain, such as natural language text, has generated significant attention in offline reinforcement learning (Offline RL). Although this cross-domain pre-training approach achieves superior performance compared to training from scratch in environments required short-term planning ability, the mechanisms by which pre-training benefits the fine-tuning phase remain unclear. Furthermore, we point out that the cross-domain pre-training approach hinders the extraction of distant information in environments like PointMaze that require long-term planning ability, leading to performance that is much worse than training DT from scratch. This work first analyzes these issues and found that Markov Matrix, a component that exists in pre-trained attention heads, is the key to explain the significant performance disparity of pre-trained models in different planning abilities. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pre-trained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodating diverse attention requirements during fine-tuning. Extensive experiments demonstrate that the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, and in long-term environments, it mitigates the negative impact caused by Markov Matrix, achieving results comparable to those of DT trained from scratch.||
|**2024-09-11**|[Brain-Inspired Stepwise Patch Merging for Vision Transformers](http://arxiv.org/abs/2409.06963)|null|The hierarchical architecture has become a mainstream design paradigm for Vision Transformers (ViTs), with Patch Merging serving as the pivotal component that transforms a columnar architecture into a hierarchical one. Drawing inspiration from the brain's ability to integrate global and local information for comprehensive visual understanding, we propose a novel technique called Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to 'see' better. SPM comprises two critical modules: Multi-Scale Aggregation (MSA) and Guided Local Enhancement (GLE). The MSA module integrates multi-scale features to enrich feature representation, while the GLE module focuses on refining local detail extraction, thus achieving an optimal balance between long-range dependency modeling and local feature enhancement. Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. These results underscore the efficacy of SPM in enhancing model accuracy and robustness across a wide range of computer vision tasks.||
|**2024-09-10**|[A Practical Gated Recurrent Transformer Network Incorporating Multiple Fusions for Video Denoising](http://arxiv.org/abs/2409.06603)|null|State-of-the-art (SOTA) video denoising methods employ multi-frame simultaneous denoising mechanisms, resulting in significant delays (e.g., 16 frames), making them impractical for real-time cameras. To overcome this limitation, we propose a multi-fusion gated recurrent Transformer network (GRTN) that achieves SOTA denoising performance with only a single-frame delay. Specifically, the spatial denoising module extracts features from the current frame, while the reset gate selects relevant information from the previous frame and fuses it with current frame features via the temporal denoising module. The update gate then further blends this result with the previous frame features, and the reconstruction module integrates it with the current frame. To robustly compute attention for noisy features, we propose a residual simplified Swin Transformer with Euclidean distance (RSSTE) in the spatial and temporal denoising modules. Comparative objective and subjective results show that our GRTN achieves denoising performance comparable to SOTA multi-frame delay networks, with only a single-frame delay.||
|**2024-09-10**|[Lightweight Multiscale Feature Fusion Super-Resolution Network Based on Two-branch Convolution and Transformer](http://arxiv.org/abs/2409.06590)|null|The single image super-resolution(SISR) algorithms under deep learning currently have two main models, one based on convolutional neural networks and the other based on Transformer. The former uses the stacking of convolutional layers with different convolutional kernel sizes to design the model, which enables the model to better extract the local features of the image; the latter uses the self-attention mechanism to design the model, which allows the model to establish long-distance dependencies between image pixel points through the self-attention mechanism and then better extract the global features of the image. However, both of the above methods face their problems. Based on this, this paper proposes a new lightweight multi-scale feature fusion network model based on two-way complementary convolutional and Transformer, which integrates the respective features of Transformer and convolutional neural networks through a two-branch network architecture, to realize the mutual fusion of global and local information. Meanwhile, considering the partial loss of information caused by the low-pixel images trained by the deep neural network, this paper designs a modular connection method of multi-stage feature supplementation to fuse the feature maps extracted from the shallow stage of the model with those extracted from the deep stage of the model, to minimize the loss of the information in the feature images that is beneficial to the image restoration as much as possible, to facilitate the obtaining of a higher-quality restored image. The practical results finally show that the model proposed in this paper is optimal in image recovery performance when compared with other lightweight models with the same amount of parameters.||
|**2024-09-10**|[Knowledge Distillation via Query Selection for Detection Transformer](http://arxiv.org/abs/2409.06443)|null|Transformers have revolutionized the object detection landscape by introducing DETRs, acclaimed for their simplicity and efficacy. Despite their advantages, the substantial size of these models poses significant challenges for practical deployment, particularly in resource-constrained environments. This paper addresses the challenge of compressing DETR by leveraging knowledge distillation, a technique that holds promise for maintaining model performance while reducing size. A critical aspect of DETRs' performance is their reliance on queries to interpret object representations accurately. Traditional distillation methods often focus exclusively on positive queries, identified through bipartite matching, neglecting the rich information present in hard-negative queries. Our visual analysis indicates that hard-negative queries, focusing on foreground elements, are crucial for enhancing distillation outcomes. To this end, we introduce a novel Group Query Selection strategy, which diverges from traditional query selection in DETR distillation by segmenting queries based on their Generalized Intersection over Union (GIoU) with ground truth objects, thereby uncovering valuable hard-negative queries for distillation. Furthermore, we present the Knowledge Distillation via Query Selection for DETR (QSKD) framework, which incorporates Attention-Guided Feature Distillation (AGFD) and Local Alignment Prediction Distillation (LAPD). These components optimize the distillation process by focusing on the most informative aspects of the teacher model's intermediate features and output. Our comprehensive experimental evaluation of the MS-COCO dataset demonstrates the effectiveness of our approach, significantly improving average precision (AP) across various DETR architectures without incurring substantial computational costs. Specifically, the AP of Conditional DETR ResNet-18 increased from 35.8 to 39.9.||
|**2024-09-10**|[AgileIR: Memory-Efficient Group Shifted Windows Attention for Agile Image Restoration](http://arxiv.org/abs/2409.06206)|null|Image Transformers show a magnificent success in Image Restoration tasks. Nevertheless, most of transformer-based models are strictly bounded by exorbitant memory occupancy. Our goal is to reduce the memory consumption of Swin Transformer and at the same time speed up the model during training process. Thus, we introduce AgileIR, group shifted attention mechanism along with window attention, which sparsely simplifies the model in architecture. We propose Group Shifted Window Attention (GSWA) to decompose Shift Window Multi-head Self Attention (SW-MSA) and Window Multi-head Self Attention (W-MSA) into groups across their attention heads, contributing to shrinking memory usage in back propagation. In addition to that, we keep shifted window masking and its shifted learnable biases during training, in order to induce the model interacting across windows within the channel. We also re-allocate projection parameters to accelerate attention matrix calculation, which we found a negligible decrease in performance. As a result of experiment, compared with our baseline SwinIR and other efficient quantization models, AgileIR keeps the performance still at 32.20 dB on Set5 evaluation dataset, exceeding other methods with tailor-made efficient methods and saves over 50% memory while a large batch size is employed.||
|**2024-09-09**|[ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL](http://arxiv.org/abs/2409.05749)|null|为了提取鲁棒且可泛化的骨架动作识别特征,通常需要大量精心标注的数据,而标注和计算成本的限制使得这项任务极具挑战性。因此,利用无标签骨架数据的无监督表征学习至关重要。本研究探讨了用于骨架动作识别的无监督表征学习方法。为此,我们设计了一个轻量级卷积Transformer框架,名为ReL-SAR,它利用卷积层和注意力层的互补性来联合建模骨架序列中的空间和时间线索。我们还对骨架关节采用了选择-排列策略,以确保从骨骼数据中获取更多信息。最后,我们利用Bootstrap Your Own Latent(BYOL)从无标签骨架序列数据中学习鲁棒的表征。我们在有限大小的数据集:MCAD、IXMAS、JHMDB和NW-UCLA上取得了非常有竞争力的结果,证明了我们提出的方法在性能和计算效率方面相对于现有技术的有效性。为了确保可重复性和可复用性,我们在以下链接提供了包含所有实现参数的源代码:https://github.com/SafwenNaimi/Representation-Learning-for-Skeleton-Action-Recognition-with-Convolutional-Transformers-and-BYOL||
|**2024-09-09**|[DSDFormer: An Innovative Transformer-Mamba Framework for Robust High-Precision Driver Distraction Identification](http://arxiv.org/abs/2409.05587)|null|Driver distraction remains a leading cause of traffic accidents, posing a critical threat to road safety globally. As intelligent transportation systems evolve, accurate and real-time identification of driver distraction has become essential. However, existing methods struggle to capture both global contextual and fine-grained local features while contending with noisy labels in training datasets. To address these challenges, we propose DSDFormer, a novel framework that integrates the strengths of Transformer and Mamba architectures through a Dual State Domain Attention (DSDA) mechanism, enabling a balance between long-range dependencies and detailed feature extraction for robust driver behavior recognition. Additionally, we introduce Temporal Reasoning Confident Learning (TRCL), an unsupervised approach that refines noisy labels by leveraging spatiotemporal correlations in video sequences. Our model achieves state-of-the-art performance on the AUC-V1, AUC-V2, and 100-Driver datasets and demonstrates real-time processing efficiency on the NVIDIA Jetson AGX Orin platform. Extensive experimental results confirm that DSDFormer and TRCL significantly improve both the accuracy and robustness of driver distraction detection, offering a scalable solution to enhance road safety.||
|**2024-09-10**|[Retrofitting Temporal Graph Neural Networks with Transformer](http://arxiv.org/abs/2409.05477)|**[link](https://github.com/qianghuangwhu/tf-tgn)**|Temporal graph neural networks (TGNNs) outperform regular GNNs by incorporating time information into graph-based operations. However, TGNNs adopt specialized models (e.g., TGN, TGAT, and APAN ) and require tailored training frameworks (e.g., TGL and ETC). In this paper, we propose TF-TGN, which uses Transformer decoder as the backbone model for TGNN to enjoy Transformer's codebase for efficient training. In particular, Transformer achieves tremendous success for language modeling, and thus the community developed high-performance kernels (e.g., flash-attention and memory-efficient attention) and efficient distributed training schemes (e.g., PyTorch FSDP, DeepSpeed, and Megatron-LM). We observe that TGNN resembles language modeling, i.e., the message aggregation operation between chronologically occurring nodes and their temporal neighbors in TGNNs can be structured as sequence modeling. Beside this similarity, we also incorporate a series of algorithm designs including suffix infilling, temporal graph attention with self-loop, and causal masking self-attention to make TF-TGN work. During training, existing systems are slow in transforming the graph topology and conducting graph sampling. As such, we propose methods to parallelize the CSR format conversion and graph sampling. We also adapt Transformer codebase to train TF-TGN efficiently with multiple GPUs. We experiment with 9 graphs and compare with 2 state-of-the-art TGNN training frameworks. The results show that TF-TGN can accelerate training by over 2.20 while providing comparable or even superior accuracy to existing SOTA TGNNs. TF-TGN is available at https://github.com/qianghuangwhu/TF-TGN.||
|**2024-09-08**|[Low Latency Transformer Inference on FPGAs for Physics Applications with hls4ml](http://arxiv.org/abs/2409.05207)|null|This study presents an efficient implementation of transformer architectures in Field-Programmable Gate Arrays(FPGAs) using hls4ml. We demonstrate the strategy for implementing the multi-head attention, softmax, and normalization layer and evaluate three distinct models. Their deployment on VU13P FPGA chip achieved latency less than 2us, demonstrating the potential for real-time applications. HLS4ML compatibility with any TensorFlow-built transformer model further enhances the scalability and applicability of this work. Index Terms: FPGAs, machine learning, transformers, high energy physics, LIGO||
|**2024-09-08**|[MHS-STMA: Multimodal Hate Speech Detection via Scalable Transformer-Based Multilevel Attention Framework](http://arxiv.org/abs/2409.05136)|null|Social media has a significant impact on people's lives. Hate speech on social media has emerged as one of society's most serious issues recently. Text and pictures are two forms of multimodal data distributed within articles. Unimodal analysis has been the primary emphasis of earlier approaches. Additionally, when doing multimodal analysis, researchers neglect to preserve the distinctive qualities associated with each modality. The present article suggests a scalable architecture for multimodal hate content detection called transformer-based multilevel attention (STMA) to address these shortcomings. This architecture consists of three main parts: a combined attention-based deep learning mechanism, a vision attention mechanism encoder, and a caption attention-mechanism encoder. To identify hate content, each component uses various attention processes and uniquely handles multimodal data. Several studies employing multiple assessment criteria on three hate speech datasets: Hateful memes, MultiOff, and MMHS150K, validate the suggested architecture's efficacy. The outcomes demonstrate that on all three datasets, the suggested strategy performs better than the baseline approaches.||
|**2024-09-08**|[An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing](http://arxiv.org/abs/2409.04940)|null|The attention mechanism is a key computing kernel of Transformers, calculating pairwise correlations across the entire input sequence. The computing complexity and frequent memory access in computing self-attention put a huge burden on the system especially when the sequence length increases. This paper presents an analog and digital hybrid processor to accelerate the attention mechanism for transformers in 65nm CMOS technology. We propose an analog computing-in-memory (CIM) core, which prunes ~75% of low-score tokens on average during runtime at ultra-low power and delay. Additionally, a digital processor performs precise computations only for ~25% unpruned tokens selected by the analog CIM core, preventing accuracy degradation. Measured results show peak energy efficiency of 14.8 and 1.65 TOPS/W, and peak area efficiency of 976.6 and 79.4 GOPS/mm $^\mathrm{2}$ in the analog core and the system-on-chip (SoC), respectively.||
|**2024-09-07**|[Efficient Training of Transformers for Molecule Property Prediction on Small-scale Datasets](http://arxiv.org/abs/2409.04909)|null|血脑屏障(BBB)是一道保护性屏障,将大脑与循环系统隔开,调节物质进入中枢神经系统的通道。评估潜在药物的BBB渗透性对于有效的药物靶向至关重要。然而,传统的BBB渗透性测量实验方法具有挑战性,并且对于大规模筛选来说不切实际。因此,需要开发计算方法来预测BBB渗透性。本文提出了一种增强了自注意力机制的GPS Transformer架构,旨在在低数据情况下表现良好。所提出的方法在使用BBBP数据集的BBB渗透性预测任务上实现了最先进的性能,超过了现有模型。该方法的ROC-AUC为78.8%,比现有最佳水平提高了5.5%。我们证明了标准的自注意力机制与GPS Transformer结合使用比其他注意力机制变体与GPS Transformer结合使用表现更好。||
|**2024-09-07**|[Cross-attention Inspired Selective State Space Models for Target Sound Extraction](http://arxiv.org/abs/2409.04803)|**[link](https://github.com/WuDH2000/CrossMamba)**|Transformer模型,特别是其交叉注意力模块,广泛应用于目标声音提取中的特征融合,该任务基于给定的线索提取感兴趣的信号。尽管有效,但这种方法的计算效率较低。状态空间模型的最新进展,特别是最近的Mamba模型,在各种任务中表现出与基于Transformer的方法相当的性能,同时显著降低了计算复杂度。然而,由于Mamba无法像交叉注意力那样捕捉不同序列之间的依赖关系,因此它在目标声音提取中的适用性受到限制。在本文中,我们提出了用于目标声音提取的CrossMamba模型,它利用Mamba的隐藏注意力机制来计算给定线索和音频混合物之间的依赖关系。Mamba的计算可以分为查询、键和值。我们利用线索生成查询,并利用音频混合物导出键和值,遵循Transformer中交叉注意力机制的原理。来自两种具有代表性的目标声音提取方法的实验结果验证了所提出的CrossMamba的有效性。||
|**2024-09-06**|[Theory, Analysis, and Best Practices for Sigmoid Self-Attention](http://arxiv.org/abs/2409.04431)|**[link](https://github.com/apple/ml-sigmoid-attention)**|注意力是 Transformer 架构的关键组成部分。它是一种序列到序列的映射,将每个序列元素转换为值的加权和。权重通常是通过键和查询之间的点积的 softmax 获得的。最近的工作探索了 Transformer 中 softmax 注意力的替代方案,例如 ReLU 和 sigmoid 激活函数。在这项工作中,我们重新审视 sigmoid 注意力,并对其进行深入的理论和实证分析。理论上,我们证明了具有 sigmoid 注意力的 Transformer 是通用函数逼近器,并且与 softmax 注意力相比,具有更好的正则性。通过详细的实证分析,我们发现,在训练的早期阶段稳定较大的初始注意力范数是成功训练具有 sigmoid 注意力模型的关键因素,其性能优于先前的尝试。我们还介绍了 FLASHSIGMOID,这是一种硬件感知且内存高效的 sigmoid 注意力实现,在 H100 GPU 上,其推理内核速度比 FLASHATTENTION2 提高了 17%。跨语言、视觉和语音的实验表明,经过适当标准化的 sigmoid 注意力在广泛的领域和规模上与 softmax 注意力的强大性能相匹配,这是先前尝试 sigmoid 注意力所无法完全实现的。我们的工作统一了现有技术,并为 sigmoid 注意力作为 Transformer 中 softmax 的直接替代品建立了最佳实践。||
|**2024-09-09**|[AttentionX: Exploiting Consensus Discrepancy In Attention from A Distributed Optimization Perspective](http://arxiv.org/abs/2409.04275)|null|在本文中,我们从分布式优化的角度出发,利用共识差异来扩展Transformer中的标准注意力机制,我们称之为AttentionX。值得注意的是,乘子交替方向法(PDMM)\cite{Zhang16PDMM}旨在迭代地解决点对点(P2P)网络上的一大类分布式优化问题,其中相邻节点根据优化过程中预定义的线性边约束逐渐达成共识。特别是在PDMM的每次迭代中,网络中的每个节点首先从邻居节点收集信息,然后执行本地信息融合。从高层次来看,注意力机制中基于 $KQ$-softmax的$V$表示加权求和对应于从邻居节点收集信息,而Transformer中通过前馈网络(FFN)进行的特征处理对应于本地信息融合。PDMM利用拉格朗日乘子以线性边约束的残差形式捕获历史共识差异,这对于算法的收敛至关重要。受PDMM的启发,我们提出了AttentionX,将共识差异纳入标准注意力机制的输出更新表达式中。AttentionX中的共识差异是指$V$表示的加权求和与其缩放后的$V$ 表示本身之间的差异。在ViT和nanoGPT上的实验表明了其良好的性能。||
|**2024-09-05**|[Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers](http://arxiv.org/abs/2409.03621)|**[link](https://github.com/schwartz-lab-NLP/Attend-First-Consolidate-Later)**|In decoder-based LLMs, the representation of a given layer serves two purposes: as input to the next layer during the computation of the current token; and as input to the attention mechanism of future tokens. In this work, we show that the importance of the latter role might be overestimated. To show that, we start by manipulating the representations of previous tokens; e.g. by replacing the hidden states at some layer k with random vectors. Our experimenting with four LLMs and four tasks show that this operation often leads to small to negligible drop in performance. Importantly, this happens if the manipulation occurs in the top part of the model-k is in the final 30-50% of the layers. In contrast, doing the same manipulation in earlier layers might lead to chance level performance. We continue by switching the hidden state of certain tokens with hidden states of other tokens from another prompt; e.g., replacing the word "Italy" with "France" in "What is the capital of Italy?". We find that when applying this switch in the top 1/3 of the model, the model ignores it (answering "Rome"). However if we apply it before, the model conforms to the switch ("Paris"). Our results hint at a two stage process in transformer-based LLMs: the first part gathers input from previous tokens, while the second mainly processes that information internally.||
|**2024-09-05**|[LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution](http://arxiv.org/abs/2409.03516)|**[link](https://github.com/jwgdmkj/lmlt)**|Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in processing regions outside their windows. To address these issues, we propose the Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head. LMLT divides image features along the channel dimension, gradually reduces spatial size for lower heads, and applies self-attention to each head. This approach effectively captures both local and global information. By integrating the results from lower heads into higher heads, LMLT overcomes the window boundary issues in self-attention. Extensive experiments show that our model significantly reduces inference time and GPU memory usage while maintaining or even surpassing the performance of state-of-the-art ViT-based Image Super-Resolution methods. Our codes are availiable at https://github.com/jwgdmkj/LMLT.||
|**2024-09-05**|[Blended Latent Diffusion under Attention Control for Real-World Video Editing](http://arxiv.org/abs/2409.03514)|null|由于缺乏完全公开可用的文本到视频模型,当前的视频编辑方法倾向于建立在预训练的文本到图像生成模型之上,然而,它们在处理具有时间信息的视频局部编辑方面仍然面临巨大挑战。首先,尽管现有方法试图通过预定义的掩码专注于局部区域编辑,但由于每一帧的空间整体生成,区域外背景的保留并不理想。此外,用户专门提供掩码是一项额外的昂贵工作,因此需要一种集成到编辑过程中的自主掩码策略。最后但同样重要的是,图像级预训练模型没有学习视频帧之间的时间信息,而这对于表达运动和动态至关重要。在本文中,我们建议采用图像级混合潜在扩散模型来执行局部视频编辑任务。具体来说,我们利用 DDIM 反演来获取潜在代码作为背景潜在代码,而不是随机噪声的潜在代码,以更好地保留输入视频的背景信息。我们进一步介绍了一种从扩散步骤中的交叉注意图派生的自主掩码制造机制。最后,我们通过将 U-Net 的自注意力块转换为时空块来增强视频帧之间的时间一致性。通过大量实验,我们提出的方法在不同的现实世界视频编辑任务中展示了有效性。||
|**2024-09-05**|[Characterizing Massive Activations of Attention Mechanism in Graph Neural Networks](http://arxiv.org/abs/2409.03463)|**[link](https://github.com/msorbi/gnn-ma)**|Graph Neural Networks (GNNs) have become increasingly popular for effectively modeling data with graph structures. Recently, attention mechanisms have been integrated into GNNs to improve their ability to capture complex patterns. This paper presents the first comprehensive study revealing a critical, unexplored consequence of this integration: the emergence of Massive Activations (MAs) within attention layers. We introduce a novel method for detecting and analyzing MAs, focusing on edge features in different graph transformer architectures. Our study assesses various GNN models using benchmark datasets, including ZINC, TOX21, and PROTEINS. Key contributions include (1) establishing the direct link between attention mechanisms and MAs generation in GNNs, (2) developing a robust definition and detection method for MAs based on activation ratio distributions, (3) introducing the Explicit Bias Term (EBT) as a potential countermeasure and exploring it as an adversarial framework to assess models robustness based on the presence or absence of MAs. Our findings highlight the prevalence and impact of attention-induced MAs across different architectures, such as GraphTransformer, GraphiT, and SAN. The study reveals the complex interplay between attention mechanisms, model architecture, dataset characteristics, and MAs emergence, providing crucial insights for developing more robust and reliable graph models.||
|**2024-09-05**|[LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones](http://arxiv.org/abs/2409.03460)|**[link](https://github.com/altair199797/lowformer)**|Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks. A smart combination of both, architecture-wise and component-wise is mandatory to excel in the speedaccuracy trade-off. Most publications focus on maximizing accuracy and utilize MACs (multiply accumulate operations) as an efficiency metric. The latter however often do not measure accurately how fast a model actually is due to factors like memory access cost and degree of parallelism. We analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications. We applied the conclusions taken from that analysis to create a recipe for increasing hardware-efficiency in macro design. Additionally we introduce a simple slimmed-down version of MultiHead Self-Attention, that aligns with our analysis. We combine both macro and micro design to create a new family of hardware-efficient backbone networks called LowFormer. LowFormer achieves a remarkable speedup in terms of throughput and latency, while achieving similar or better accuracy than current state-of-the-art efficient backbones. In order to prove the generalizability of our hardware-efficient design, we evaluate our method on GPU, mobile GPU and ARM CPU. We further show that the downstream tasks object detection and semantic segmentation profit from our hardware-efficient architecture. Code and models are available at https://github.com/ altair199797/LowFormer.||
|**2024-09-05**|[Masked Sensory-Temporal Attention for Sensor Generalization in Quadruped Locomotion](http://arxiv.org/abs/2409.03332)|null|With the rising focus on quadrupeds, a generalized policy capable of handling different robot models and sensory inputs will be highly beneficial. Although several methods have been proposed to address different morphologies, it remains a challenge for learning-based policies to manage various combinations of proprioceptive information. This paper presents Masked Sensory-Temporal Attention (MSTA), a novel transformer-based model with masking for quadruped locomotion. It employs direct sensor-level attention to enhance sensory-temporal understanding and handle different combinations of sensor data, serving as a foundation for incorporating unseen information. This model can effectively understand its states even with a large portion of missing information, and is flexible enough to be deployed on a physical system despite the long input sequence.||
|**2024-09-05**|[Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion](http://arxiv.org/abs/2409.03223)|null|Multi-modality image fusion aims to integrate the merits of images from different sources and render high-quality fusion images. However, existing feature extraction and fusion methods are either constrained by inherent local reduction bias and static parameters during inference (CNN) or limited by quadratic computational complexity (Transformers), and cannot effectively extract and fuse features. To solve this problem, we propose a dual-branch image fusion network called Tmamba. It consists of linear Transformer and Mamba, which has global modeling capabilities while maintaining linear complexity. Due to the difference between the Transformer and Mamba structures, the features extracted by the two branches carry channel and position information respectively. T-M interaction structure is designed between the two branches, using global learnable parameters and convolutional layers to transfer position and channel information respectively. We further propose cross-modal interaction at the attention level to obtain cross-modal attention. Experiments show that our Tmamba achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. Code with checkpoints will be available after the peer-review process.||
|**2024-09-04**|[Probing self-attention in self-supervised speech models for cross-linguistic differences](http://arxiv.org/abs/2409.03115)|null|Speech models have gained traction thanks to increase in accuracy from novel transformer architectures. While this impressive increase in performance across automatic speech recognition (ASR) benchmarks is noteworthy, there is still much that is unknown about the use of attention mechanisms for speech-related tasks. For example, while it is assumed that these models are learning language-independent (i.e., universal) speech representations, there has not yet been an in-depth exploration of what it would mean for the models to be language-independent. In the current paper, we explore this question within the realm of self-attention mechanisms of one small self-supervised speech transformer model (TERA). We find that even with a small model, the attention heads learned are diverse ranging from almost entirely diagonal to almost entirely global regardless of the training language. We highlight some notable differences in attention patterns between Turkish and English and demonstrate that the models do learn important phonological information during pretraining. We also present a head ablation study which shows that models across languages primarily rely on diagonal heads to classify phonemes.||
|**2024-09-04**|[Leveraging Interpretability in the Transformer to Automate the Proactive Scaling of Cloud Resources](http://arxiv.org/abs/2409.03103)|null|现代Web服务采用云原生原则来利用微服务的优势。为了根据服务等级协议(SLA)持续保证高质量的服务(QoS),确保令人满意的用户体验并最大程度地降低运营成本,必须为每个微服务配置适量的资源。然而,准确地为微服务配置充足的资源非常复杂,并且取决于许多因素,包括工作负载强度和微服务之间复杂的互连关系。为了应对这一挑战,我们开发了一个模型,该模型捕获了端到端延迟、前端级别的请求和资源利用率之间的关系。然后,我们使用开发的模型来预测端到端延迟。我们的解决方案利用了时间融合Transformer(TFT),这是一种具有可解释性特征的基于注意力的架构。当预测结果表明不符合SLA时,我们使用TFT提供的特征重要性作为核岭回归(KRR)中的协变量,并将响应变量设置为期望延迟,以学习与特征重要性相关的参数。这些学习到的参数反映了为确保符合SLA而需要对特征进行的调整。我们通过一个基于微服务的应用程序证明了我们方法的优点,并提供了一个部署路线图。||
|**2024-09-05**|[Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?](http://arxiv.org/abs/2409.02727)|**[link](https://github.com/yixuantt/poolingandattn)**|The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.||
|**2024-09-04**|[UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching](http://arxiv.org/abs/2409.02545)|null|Unlike other vision tasks where Transformer-based approaches are becoming increasingly common, stereo depth estimation is still dominated by convolution-based approaches. This is mainly due to the limited availability of real-world ground truth for stereo matching, which is a limiting factor in improving the performance of Transformer-based stereo approaches. In this paper, we propose UniTT-Stereo, a method to maximize the potential of Transformer-based stereo architectures by unifying self-supervised learning used for pre-training with stereo matching framework based on supervised learning. To be specific, we explore the effectiveness of reconstructing features of masked portions in an input image and at the same time predicting corresponding points in another image from the perspective of locality inductive bias, which is crucial in training models with limited training data. Moreover, to address these challenging tasks of reconstruction-and-prediction, we present a new strategy to vary a masking ratio when training the stereo model with stereo-tailored losses. State-of-the-art performance of UniTT-Stereo is validated on various benchmarks such as ETH3D, KITTI 2012, and KITTI 2015 datasets. Lastly, to investigate the advantages of the proposed approach, we provide a frequency analysis of feature maps and the analysis of locality inductive bias based on attention maps.||
|**2024-09-03**|[F2former: When Fractional Fourier Meets Deep Wiener Deconvolution and Selective Frequency Transformer for Image Deblurring](http://arxiv.org/abs/2409.02056)|null|Recent progress in image deblurring techniques focuses mainly on operating in both frequency and spatial domains using the Fourier transform (FT) properties. However, their performance is limited due to the dependency of FT on stationary signals and its lack of capability to extract spatial-frequency properties. In this paper, we propose a novel approach based on the Fractional Fourier Transform (FRFT), a unified spatial-frequency representation leveraging both spatial and frequency components simultaneously, making it ideal for processing non-stationary signals like images. Specifically, we introduce a Fractional Fourier Transformer (F2former), where we combine the classical fractional Fourier based Wiener deconvolution (F2WD) as well as a multi-branch encoder-decoder transformer based on a new fractional frequency aware transformer block (F2TB). We design F2TB consisting of a fractional frequency aware self-attention (F2SA) to estimate element-wise product attention based on important frequency components and a novel feed-forward network based on frequency division multiplexing (FM-FFN) to refine high and low frequency features separately for efficient latent clear image restoration. Experimental results for the cases of both motion deblurring as well as defocus deblurring show that the performance of our proposed method is superior to other state-of-the-art (SOTA) approaches.||
|**2024-09-03**|[TransDAE: Dual Attention Mechanism in a Hierarchical Transformer for Efficient Medical Image Segmentation](http://arxiv.org/abs/2409.02018)|null|In healthcare, medical image segmentation is crucial for accurate disease diagnosis and the development of effective treatment strategies. Early detection can significantly aid in managing diseases and potentially prevent their progression. Machine learning, particularly deep convolutional neural networks, has emerged as a promising approach to addressing segmentation challenges. Traditional methods like U-Net use encoding blocks for local representation modeling and decoding blocks to uncover semantic relationships. However, these models often struggle with multi-scale objects exhibiting significant variations in texture and shape, and they frequently fail to capture long-range dependencies in the input data. Transformers designed for sequence-to-sequence predictions have been proposed as alternatives, utilizing global self-attention mechanisms. Yet, they can sometimes lack precise localization due to insufficient granular details. To overcome these limitations, we introduce TransDAE: a novel approach that reimagines the self-attention mechanism to include both spatial and channel-wise associations across the entire feature space, while maintaining computational efficiency. Additionally, TransDAE enhances the skip connection pathway with an inter-scale interaction module, promoting feature reuse and improving localization accuracy. Remarkably, TransDAE outperforms existing state-of-the-art methods on the Synaps multi-organ dataset, even without relying on pre-trained weights.||
|**2024-09-03**|[TASL-Net: Tri-Attention Selective Learning Network for Intelligent Diagnosis of Bimodal Ultrasound Video](http://arxiv.org/abs/2409.01557)|null|In the intelligent diagnosis of bimodal (gray-scale and contrast-enhanced) ultrasound videos, medical domain knowledge such as the way sonographers browse videos, the particular areas they emphasize, and the features they pay special attention to, plays a decisive role in facilitating precise diagnosis. Embedding medical knowledge into the deep learning network can not only enhance performance but also boost clinical confidence and reliability of the network. However, it is an intractable challenge to automatically focus on these person- and disease-specific features in videos and to enable networks to encode bimodal information comprehensively and efficiently. This paper proposes a novel Tri-Attention Selective Learning Network (TASL-Net) to tackle this challenge and automatically embed three types of diagnostic attention of sonographers into a mutual transformer framework for intelligent diagnosis of bimodal ultrasound videos. Firstly, a time-intensity-curve-based video selector is designed to mimic the temporal attention of sonographers, thus removing a large amount of redundant information while improving computational efficiency of TASL-Net. Then, to introduce the spatial attention of the sonographers for contrast-enhanced video analysis, we propose the earliest-enhanced position detector based on structural similarity variation, on which the TASL-Net is made to focus on the differences of perfusion variation inside and outside the lesion. Finally, by proposing a mutual encoding strategy that combines convolution and transformer, TASL-Net possesses bimodal attention to structure features on gray-scale videos and to perfusion variations on contrast-enhanced videos. These modules work collaboratively and contribute to superior performance. We conduct a detailed experimental validation of TASL-Net's performance on three datasets, including lung, breast, and liver.||
|**2024-09-02**|[Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement](http://arxiv.org/abs/2409.01352)|**[link](https://github.com/tatban/Spectron)**|Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based end-to-end model to extract a target speaker's speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two additional objectives to impose speaker embedding consistency and waveform encoder invertibility and jointly train both speaker encoder and speech separator to better capture the speaker conditional embedding. Furthermore, we leverage a multi-scale discriminator to refine the perceptual quality of the extracted speech. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points. Finally, we compare our approach with recent state-of-the-arts and show that our model outperforms existing methods by $4.1$ dB points on an average without creating additional data dependency.||
|**2024-09-02**|[CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models](http://arxiv.org/abs/2409.01193)|**[link](https://github.com/raytsang123/clibe)**|Backdoors can be injected into NLP models to induce misbehavior when the input text contains a specific feature, known as a trigger, which the attacker secretly selects. Unlike fixed words, phrases, or sentences used in the static text trigger, NLP dynamic backdoor attacks design triggers associated with abstract and latent text features, making them considerably stealthier than traditional static backdoor attacks. However, existing research on NLP backdoor detection primarily focuses on defending against static backdoor attacks, while detecting dynamic backdoors in NLP models remains largely unexplored. This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models. CLIBE injects a "few-shot perturbation" into the suspect Transformer model by crafting optimized weight perturbation in the attention layers to make the perturbed model classify a limited number of reference samples as a target label. Subsequently, CLIBE leverages the generalization ability of this few-shot perturbation to determine whether the original model contains a dynamic backdoor. Extensive evaluation on three advanced NLP dynamic backdoor attacks, two widely-used Transformer frameworks, and four real-world classification tasks strongly validates the effectiveness of CLIBE. We also demonstrate the robustness of CLIBE against various adaptive attacks. Furthermore, we employ CLIBE to scrutinize 49 popular Transformer models on Hugging Face and discover one exhibiting a high probability of containing a dynamic backdoor. We have contacted Hugging Face and provided detailed evidence of this model's backdoor behavior. Moreover, we extend CLIBE to detect backdoor text generation models modified to exhibit toxic behavior. To the best of our knowledge, CLIBE is the first framework capable of detecting backdoors in text generation models without access to trigger input test samples.||
|**2024-09-02**|[Progressive Retinal Image Registration via Global and Local Deformable Transformations](http://arxiv.org/abs/2409.01068)|**[link](https://github.com/lyp-deeplearning/awesome-retinal-registration)**|Retinal image registration plays an important role in the ophthalmological diagnosis process. Since there exist variances in viewing angles and anatomical structures across different retinal images, keypoint-based approaches become the mainstream methods for retinal image registration thanks to their robustness and low latency. These methods typically assume the retinal surfaces are planar, and adopt feature matching to obtain the homography matrix that represents the global transformation between images. Yet, such a planar hypothesis inevitably introduces registration errors since retinal surface is approximately curved. This limitation is more prominent when registering image pairs with significant differences in viewing angles. To address this problem, we propose a hybrid registration framework called HybridRetina, which progressively registers retinal images with global and local deformable transformations. For that, we use a keypoint detector and a deformation network called GAMorph to estimate the global transformation and local deformable transformation, respectively. Specifically, we integrate multi-level pixel relation knowledge to guide the training of GAMorph. Additionally, we utilize an edge attention module that includes the geometric priors of the images, ensuring the deformation field focuses more on the vascular regions of clinical interest. Experiments on two widely-used datasets, FIRE and FLoRI21, show that our proposed HybridRetina significantly outperforms some state-of-the-art methods. The code is available at https://github.com/lyp-deeplearning/awesome-retinal-registration.||
|**2024-09-02**|[Multi-scale Temporal Fusion Transformer for Incomplete Vehicle Trajectory Prediction](http://arxiv.org/abs/2409.00904)|null|Motion prediction plays an essential role in autonomous driving systems, enabling autonomous vehicles to achieve more accurate local-path planning and driving decisions based on predictions of the surrounding vehicles. However, existing methods neglect the potential missing values caused by object occlusion, perception failures, etc., which inevitably degrades the trajectory prediction performance in real traffic scenarios. To address this limitation, we propose a novel end-to-end framework for incomplete vehicle trajectory prediction, named Multi-scale Temporal Fusion Transformer (MTFT), which consists of the Multi-scale Attention Head (MAH) and the Continuity Representation-guided Multi-scale Fusion (CRMF) module. Specifically, the MAH leverages the multi-head attention mechanism to parallelly capture multi-scale motion representation of trajectory from different temporal granularities, thus mitigating the adverse effect of missing values on prediction. Furthermore, the multi-scale motion representation is input into the CRMF module for multi-scale fusion to obtain the robust temporal feature of the vehicle. During the fusion process, the continuity representation of vehicle motion is first extracted across time steps to guide the fusion, ensuring that the resulting temporal feature incorporates both detailed information and the overall trend of vehicle motion, which facilitates the accurate decoding of future trajectory that is consistent with the vehicle's motion trend. We evaluate the proposed model on four datasets derived from highway and urban traffic scenarios. The experimental results demonstrate its superior performance in the incomplete vehicle trajectory prediction task compared with state-of-the-art models, e.g., a comprehensive performance improvement of more than 39% on the HighD dataset.||
|**2024-09-01**|[Attention-Guided Multi-scale Interaction Network for Face Super-Resolution](http://arxiv.org/abs/2409.00591)|null|Recently, CNN and Transformer hybrid networks demonstrated excellent performance in face super-resolution (FSR) tasks. Since numerous features at different scales in hybrid networks, how to fuse these multi-scale features and promote their complementarity is crucial for enhancing FSR. However, existing hybrid network-based FSR methods ignore this, only simply combining the Transformer and CNN. To address this issue, we propose an attention-guided Multi-scale interaction network (AMINet), which contains local and global feature interactions as well as encoder-decoder phases feature interactions. Specifically, we propose a Local and Global Feature Interaction Module (LGFI) to promote fusions of global features and different receptive fields' local features extracted by our Residual Depth Feature Extraction Module (RDFE). Additionally, we propose a Selective Kernel Attention Fusion Module (SKAF) to adaptively select fusions of different features within LGFI and encoder-decoder phases. Our above design allows the free flow of multi-scale features from within modules and between encoder and decoder, which can promote the complementarity of different scale features to enhance FSR. Comprehensive experiments confirm that our method consistently performs well with less computational consumption and faster inference.||

(back to top)