Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/xianshang33/llm-paper-daily
Daily updated LLM papers. 每日更新 LLM 相关的论文,欢迎订阅 👏 喜欢的话动动你的小手 🌟 一个
https://github.com/xianshang33/llm-paper-daily
agent chatgpt large-language-models llm rag
Last synced: 3 months ago
JSON representation
Daily updated LLM papers. 每日更新 LLM 相关的论文,欢迎订阅 👏 喜欢的话动动你的小手 🌟 一个
- Host: GitHub
- URL: https://github.com/xianshang33/llm-paper-daily
- Owner: xianshang33
- Created: 2023-11-27T03:50:11.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-04-13T05:54:15.000Z (7 months ago)
- Last Synced: 2024-04-14T05:11:13.072Z (7 months ago)
- Topics: agent, chatgpt, large-language-models, llm, rag
- Homepage:
- Size: 1.78 MB
- Stars: 490
- Watchers: 37
- Forks: 18
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-ChatGPT-repositories - llm-paper-daily - Daily updated LLM papers. 每日更新 LLM 相关的论文,欢迎订阅 👏 喜欢的话动动你的小手 🌟 一个 (NLP)
README
llm-paper-daily 日常论文精选
[![Status](https://img.shields.io/badge/status-Update_07.31_10:00-success.svg)]() [![简体中文 badge](https://img.shields.io/badge/%E7%AE%80%E4%BD%93%E4%B8%AD%E6%96%87-Simplified%20Chinese-blue)](./README.md) [![English badge](https://img.shields.io/badge/%E8%8B%B1%E6%96%87-English-blue)](./README_en.md)
欢迎来到 **llm-paper-daily**! 这是一个获取最新研究论文的每日更新和分类的平台。希望为爱好者提供 LLM 研究的前沿资讯,让您更轻松地了解该领域的最新发展。
📚 **每日更新:** 仓库每天会带来最新的 LLM 研究,并附有arxiv地址、相关 git 仓库和基于 GPT-4 的简单总结
💐 **分类摘要:** 将每篇论文分类到如推理、代理、检索、应用、预训练与指令微调等不同部分,帮助您能轻松导航并发现相关的研究
🌈 **交流学习:** 最近准备拉一个讨论小组方便大家交流和互相学习。
欢迎对大模型落地、论文等等方面有兴趣的小伙伴加入🙌## 目录
- [最新论文(含总结)](#最新论文)
- [分类](#分类)
- [💡 Reasoning](CATEGORIES.md#Reasoning)
- [🤖 Agent](CATEGORIES.md#Agent)
- [🦉 Knowledge and Retrieval](CATEGORIES.md#Knowledge-and-Retrieval)
- [👩🏫 Alignment and Hallucination](CATEGORIES.md#Alignment-and-Hallucination)
- [🎨 Application](CATEGORIES.md#Application)
- [📐 Pre-training and Instruction Fine-tuning](CATEGORIES.md#Pre-training-and-Instruction-Fine-tuning)
- [📄 Survey](CATEGORIES.md#Survey)查看更新文章 更新时间: 07月31日 10:00
- Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning
## 最新论文
### 07月| Date | Paper | Links & Summary |
| --- | --- | --- |
| 07-29 | **SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages**
机构: DAMO Academy, Alibaba Group
SeaLLMs 3是专为东南亚多语言环境设计的大型语言模型,重点在于克服现有模型的局限性,通过高效的语言增强技术和安全可靠的机制,使之能够提供文化适宜的回应,同时减少幻觉现象。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.19672v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.19672.md) |
| 07-29 | **QAEA-DR: A Unified Text Augmentation Framework for Dense Retrieval**
机构: Tsinghua University
QAEA-DR框架提出了一个针对密集检索的创新文本增强技术,通过集成事件提取和问题答案生成来提高文本生成的质量和健壮性,同时还可以与多种嵌入模型兼容,证明了其在实验中的有效性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.20207v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.20207.md) |
| 07-28 | **Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge**
机构: Meta FAIR, University of California, Berkeley, New York University
Meta-Rewarding机制通过引入一个meta-judge角色来评估模型自身的判断,改进了LLMs的自改进流程。该方法不仅提高了模型跟随指令的能力,而且打破了依赖人工数据的局限,表现出了自改进模型在无人监督下的巨大潜力。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.19594v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.19594.md) |
| 07-25 | **Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning**
机构: StatNLP Research Group
本论文提出了一种创新的自我训练框架,通过整合直接偏好优化(DPO)来提升小规模语言模型在解决数学推理问题上的能力。这种方法减少了对大模型的依赖,同时降低了计算成本,并在多个数学推理任务上取得了积极的结果。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.18248v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.18248.md) |
| 07-23 | **RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent**
机构: Zhejiang University, Palo Alto Networks, University of North Texas
RedAgent系统成功地通过模拟特定上下文的jailbreak策略,有效地发现并利用大型语言模型的安全漏洞,它不仅提高了红队方法的效率和自动化,也为理解和强化LLM应用程序的安全性提供了新的视角。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.16667v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.16667.md) |
| 07-23 | **Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.16833v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.16833.md) |
| 07-23 | **OpenDevin: An Open Platform for AI Software Developers as Generalist Agents**
机构: UIUC, CMU, Yale
OpenDevin 是一个用于开发与世界通过软件交互的通用和专业AI代理的社区驱动平台,具有强大灵活的交互机制、沙箱操作系统和网页浏览器环境、以及全面的评估框架。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.16741v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.16741.md) |
| 07-22 | **Knowledge Mechanisms in Large Language Models: A Survey and Perspective**
机构: Zhejiang University, National University of Singapore, University of California, Los Angeles
本论文认为深入了解LLMs的知识机制对于培养强大且可靠的AI至关重要。论文提出了一个评估此类系统的新框架,集中在知识的利用和演化,并为未来的研究方向提供了愿景和工具。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.15017v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.15017.md) |
| 07-19 | **ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities**
机构: NVIDIA
论文提出了一个名为Llama3-ChatQA-2-70B的模型,用于桥接开源LLMs与专有模型之间的差距,具备处理最长128K token上下文的能力,并在多个基准测试上达到了与GPT-4-Turbo相当的性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.14482v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.14482.md) |
| 07-19 | **Internal Consistency and Self-Feedback in Large Language Models: A Survey**
机构: Renmin University of China, Institute for Advanced Algorithms Research, Shanghai, Beijing Institute of Technology
本文针对大型语言模型在保持一致性和避免产生幻觉方面的问题,提出了内部一致性和自我反馈的概念。这为我们理解和改进这些模型提供了新的视角,并展望了未来的发展方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.14507v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.14507.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/IAAR-Shanghai/ICSFSurvey)|
| 07-18 | **CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis**
机构: Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen
这项工作提出了Chain-of-Diagnosis (CoD),它是一种旨在增强疾病诊断中LLM的可解释性的方法,该方法通过合成病例结合疾病百科全书数据,有效地生成了培训数据并开发了DiagnosisGPT模型。实验表明,DiagnosisGPT在多个诊断数据集上的性能优于其他LLM。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.13301v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.13301.md) |
| 07-16 | **Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild**
机构: University of Washington, Allen Institute for AI, McGill University
本项研究强调了在与聊天机器人的互动中,用户个人信息泄露的问题。它呈现了在这些互动中共享的敏感信息的类型,并呼吁对聊天机器人设计采取措施,以保护用户隐私并保持交流内容的适当透明度。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.11438v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.11438.md) |
| 07-16 | **NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?**
机构: Shanghai AI Laboratory, Tsinghua University
NeedleBench框架和提出的ATC测试为评估和提升LLMs在处理长文本数据时的检索和推理能力提供了新方法。这对现实世界中的长上下文任务至关重要,也揭示了当前LLMs面临的机遇和挑战。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.11963v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.11963.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/open-compass/opencompass)|
| 07-16 | **LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data**
机构: Stanford University, UC Berkeley
本文提出了LOTUS系统,该系统通过定义语义操作符来启用基于自然语言的查询,并通过高效的算法和优化实现了快速准确的查询执行。LOTUS在多个真实世界的应用案例中展示了其应用的广泛性和高性能,对改进基于LMs的大规模语义分析和查询系统具有重要意义。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.11418v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.11418.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/stanford-futuredata/lotus)|
| 07-15 | **Qwen2 Technical Report**
机构: Alibaba Group
Qwen2 系列模型作为最新的大型语言模型,不仅在语言理解、生成、多语言能力、编码、数学和推理等多任务环境中性能出色,还在开源社区中公开了权重和资源,促进了社区的创新和可访问性。模型在多个基几测试中与现有模型相比展现出竞争力,尤其在多语言方面显示出广泛的适用性和全球覆盖范围。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.10671v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.10671.md) |
| 07-15 | **Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.10805v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.10805.md) |
| 07-14 | **Learning to Refuse: Towards Mitigating Privacy Risks in LLMs**
机构: Institute of Artificial Intelligence, Soochow University, China
论文提出了一种新的机器学习框架NAUF和与之配套的真实世界个人数据遗忘数据集RETURN,用于评估和改善LLMs在隐私保护方面的表现。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.10058v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.10058.md) |
| 07-12 | **Human-like Episodic Memory for Infinite Context LLMs**
机构: Huawei Noah’s Ark Lab, University College London
文章通过整合人类情节记忆和事件认知到大型语言模型中,创造了一种新型结构EM-LLM,使LLMs能够处理实际无限上下文长度,同时保持计算效率。这项研究不仅改进了大型语言模型处理广泛上下文的能力,还有助于揭示人类记忆机制。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.09450v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.0945.md) |
| 07-10 | **Toto: Time Series Optimized Transformer for Observability**
机构: Datadog
Toto模型是Datadog开发的一种用于时序预测的基础模型,专为处理观测数据而设计。它通过创新的注意力机制和预训练策略,有效地提高了处理观测数据的性能和效率。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.07874v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.07874.md) |
| 07-10 | **Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization**
机构: University of Science and Technology of China, Alibaba Stripe, Zhejiang University
研究人员开发了Dr. DPO框架,它通过单一额外代码行来增强DPO的鲁棒性。实证评估表明,Dr. DPO在各种设置中显著提高了性能,无论是在有噪声还是无噪声的条件下。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.07880v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.0788.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/junkangwu/Dr_DPO)|
| 07-09 | **Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence**
这篇论文提出了一个灵活且可扩展的多代理协作平台——代理互联网(IoA),它克服了现有多代理框架的局限性,并展示了在多种任务和应用场景中的出色性能。此外,论文还发布了相关的代码库,以推动自主代理系统的发展。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.07061v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.07061.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/OpenBMB/IoA)|
| 07-05 | **AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents**
机构: AIRI, Moscow, Russia, Skoltech, Moscow, Russia
AriGraph 是一种新颖的记忆架构,用于构建集成了语义和情景记忆的知识图世界模型,以提升 LLM 代理的探索和规划能力。通过在 TextWorld 环境中的实验,证明了它在处理复杂任务方面比其他现有方法更有效。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.04363v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.04363.md) |
| 07-02 | **Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models**
机构: DeepSeek AI, Northwestern University
本文提出了针对稀疏架构LLM的参数高效微调方法ESFT,该方法通过只微调与下游任务最相关的专家,既保持了专家的特化性,又显著节省了计算资源。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.01906v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.01906.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/deepseek-ai/ESFT)|
| 07-02 | **RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs**
机构: Georgia Tech, NVIDIA
RankRAG是一个新颖的、通过指令微调LLM以增强其在RAG中的上下文排名和答案生成能力的框架,它在多个基准测试中提升了生成性能,并显示出良好的泛化能力。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.02485v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.02485.md) |
| 07-01 | **Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems**
本文提出了一种新的评价大型语言模型和RAG系统处理长文本能力的方法,即SummHay任务,并通过合成数据生成和自动评估系统两个角度,解决了长文本评估的挑战。实验结果表明,目前的系统在此任务上表现不佳,为未来系统的提升指明了方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.01370v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.0137.md) |
| 07-01 | **We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?**
机构: Beijing University of Posts and Telecommunications, Tencent Inc., Huazhong University of Science and Technology
这篇论文创建了一个名为WE-MATH的视觉数学推理基准测试,旨在超越传统的端到端性能评估,深入探讨和评价LMMs的问题解决原理及它们的知识获取和泛化能力。通过新的多维度评估方法揭示出多模态模型在内在推理过程中的挑战,并通过实验验证了知识增强策略的有效性,推动了LMMs在视觉数学推理方面的进步。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.01284v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.01284.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/We-Math/We-Math)|
| 07-01 | **AI Agents That Matter**
机构: Princeton University
这篇论文批评了当前AI代理的基准评测方式,并且提出了一系列改进措施,目标是发展出真正有实际应用价值的智能代理,而不仅仅是在基准测试中取得高分的代理。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.01502v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-07/2407.01502.md) |---
### 06月
| Date | Paper | Links & Summary |
| --- | --- | --- |
| 06-30 | **Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning**
机构: Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
本论文提出了一种新的数学推理优化方法——SCDPO,通过在特定步骤监督错误的方式,自动化地生成训练样本,显著提升了LLMs在数学问题求解方面的性能,证明了该方法的潜力。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.00782v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2407.00782.md) |
| 06-29 | **LiteSearch: Efficacious Tree Search for LLM**
机构: Xiamen University, Tencent AI Lab
该论文通过提出一种效率更高的树搜索算法来降低在辅助大型语言模型解决复杂数学推理任务时的资源消耗,同时确保保持高性能水平。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.00320v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2407.0032.md) |
| 06-28 | **Scaling Synthetic Data Creation with 1,000,000,000 Personas**
机构: Tencent AI Lab Seattle
本论文提出了一个名为“Persona Hub”的合成数据平台,在保证生成数据多样化和丰富性的同时,重点关注合成数据的安全和负责任使用。通过一系列用例证明了该方法在多元化、可扩展性、灵活性和易用性方面的优势。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.20094v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.20094.md) |
| 06-27 | **From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data**
机构: University of Wisconsin-Madison
该论文提出了一个通过在合成数据集上微调LLMs来提高其在长文本任务上检索和推理能力的方法。实验结果表明,这种方法可以在不显著影响模型整体能力的同时,显著提高模型在长文本任务中的表现,并降低幻觉的生成。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.19292v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.19292.md) |
| 06-27 | **SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation**
本论文提出了一个名为SEAKR的新型自适应检索增强生成模型,通过利用LLMs的内部状态自我意识来动态决定何时进行检索,并有效整合检索到的知识,从而提高了在问答任务中的性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.19215v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.19215.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/THU-KEG/SeaKR)|
| 06-26 | **Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs**
机构: The Chinese University of Hong Kong, Harbin Institute of Technology (Shenzhen), SmartMore
这份论文提出了一种新的优化方法Step-DPO,它通过对单个推理步骤进行优化而非整体评估答案,提升了LLMs在长链数学推理上的准确性和鲁棒性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.18629v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.18629.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/dvlab-research/Step-DPO)|
| 06-25 | **The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale**
机构: Hugging Face
本论文通过介绍FineWeb数据集,突出了如何策划出一个有效的基于Common Crawl的预训练数据集的重要性,并通过实验证明了其对于提升大型语言模型的性能的贡献。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.17557v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.17557.md) |
| 06-24 | **WARP: On the Benefits of Weight Averaged Rewarded Policies**
机构: Google DeepMind
本文提出了WARP,一种新的LLM对齐策略,通过权重平均合并模型以解决RLHF过程中的挑战,改善KL与奖励之间的权衡。实验证明,WARP能够提升模型性能和与人类价值的对齐度。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.16768v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.16768.md) |
| 06-22 | **Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs**
机构: OATML, Department of Computer Science, University of Oxford
论文提出SEPs为成本高效和可靠的幻觉检测方法,能够在无需生成多样本的条件下,直接从LLMs单次生成的隐藏状态中捕捉到语义不确定性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.15927v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.15927.md) |
| 06-21 | **LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs**
机构: University of Waterloo
LongRAG是一个针对开放领域问答任务的新框架,它通过增大检索单元和利用长文本语言模型来解冤传统RAG框架的限制。通过减少检索单元和提升检索器效能,以及使用长文本LLMs进行零次学习的答案提取,LongRAG在性能上取得了显着的改善。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.15319v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.15319.md) |
| 06-19 | **Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?**
本论文通过引入LOFT基准,探索了长上下文语言模型在替代现有范式和处理新颖任务方面的潜力。发现LCLMs在未经明确训练的情况下,能够在特定任务上与现有的检索和RAG系统相媲美,并指出了未来在提高问题表现上需要继续研究的领域。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.13121v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.13121.md) |
| 06-18 | **Nash CoT: Multi-Path Inference with Preference Equilibrium**
机构: Westlake University, University of Cambridge
本文提出了一种新颖的方法Nash CoT,利用偏好均衡概念通过减少推理路径的数量在保持性能的同时降低了LLMs的部署成本。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2407.07099v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2407.07099.md) |
| 06-18 | **Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges**
本研究通过评估LLMs作为评判在对齐和评判弱点方面的表现,为使用LLMs作为未来评判提供了有用的洞察。重要的发现包括适合作为评判的仅有部分顶尖模型,以及Cohen's Kappa是一个更好的对齐度量标准,能在区分评判者方面做得比百分比对齐更好。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.12624v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.12624.md) |
| 06-17 | **A Survey of AIOps for Failure Management in the Era of Large Language Models**
机构: Peking University, Tsinghua University, The Hong Kong University of Science and Technology (Guangzhou), University of Illinois Chicago
本文是关于LLMs时代用于故障管理的AIOps技术的全面综述。文章讨论了LLMs对于解决现有AIOps方法所面临挑战的潜力,并勾勒了未来的研究方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.11213v4)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.11213.md) |
| 06-13 | **Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning**
机构: Google Research, Google DeepMind, Google
本文引入了一个新的基准测试 ToT,通过合成数据集和众包任务,全面评估了LLMs在各种情境中对时间推理能力的表现,同时揭示了这些模型在时间推理方面的优势和不足。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.09170v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.0917.md) |
| 06-12 | **Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing**
机构: University of Washington, Allen Institute for AI
本文提出了MAGPIE,一个自合成方法生成大规模的高质量对齐数据,该方法不依赖于人的参与或提示工程。实验证明,使用MAGPIE微调的模型在多个基准上均显示出优异的性能,展示了LLMs在自动数据生成和对齐方面的潜能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.08464v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.08464.md) |
| 06-12 | **Designing a Dashboard for Transparency and Control of Conversational AI**
机构: Harvard University, Google Research
这篇论文致力于增加LLMs在对话AI系统中的透明度,并通过设计一个可视化的用户界面—一个与聊天机器人接口相配套的看板—实现了这一点。用户能够实时看到系统的内部用户模型,并可以通过界面更改这些模型。基于用户反馈,看板还有助于揭露并且对抗模型的偏见行为。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.07882v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.07882.md) |
| 06-12 | **TasTe: Teaching Large Language Models to Translate through Self-Reflection**
机构: Harbin Institute of Technology, Tencent Inc
本文提出的TASTE框架通过自我反思过程提升了LLMs的机器翻译能力,它代表了利用LLMs翻译潜力的一种新方法,为理解和利用LLMs的复杂推理和语言建模能力树立了新的典范。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.08434v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.08434.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/YutongWang1216/ReflectionLLMMT)|
| 06-11 | **Delving into ChatGPT usage in academic writing through excess vocabulary**
机构: Hertie Institute for AI in Brain Health, University of Tübingen, Germany, Tübingen AI Center, Northwestern University
此论文针对学术文本中广泛使用 LLMs 的现象,提出了一种新的无偏差的大规模方法来研究 LLM 的使用情况,并对 LLM 导致的科学写作变化进行了前所未有的量化比较。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.07016v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.07016.md) |
| 06-11 | **Needle In A Multimodal Haystack**
机构: OpenGVLab, Shanghai AI Laboratory, Fudan University
该论文提出了MM-NIAH,首个长篇多模态文件理解的评估基准,旨在考验和提升MLLMs的性能。通过不同的评估任务,论文指出了现有MLLMs在长篇多模态文档理解方面的局限和挑战。进一步的,该基准为MLLMs的长篇多模态文档理解研究提供了有效的平台。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.07230v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.0723.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/OpenGVLab/MM-NIAH)|
| 06-10 | **Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching**
机构: The Chinese University of Hong Kong, Tencent AI Lab, Centre for Perceptual and Interactive Intelligence
论文通过引入SELF-TUNING,提出了一种改进LLM通过自我教学获取知识能力的框架,并通过Wiki-Newpages-QA数据集在多个关键知识获取任务上验证了其有效性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.06326v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.06326.md) |
| 06-10 | **Transforming Wearable Data into Health Insights using Large Language Model Agents**
机构: Google LLC
本论文通过介绀名为PHIA的大型语言模型代理系统,成功地将可穿戴设备数据转化为个人健康洞察。PHIA结合了代码生成和信息检索工具,有效解决了从大量健康数据中派生个性化健康指导的挑战。通过广泛的人工和自动化评估,证明了这种方法在处理实际健康问题上的准确性和应用可能性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.06464v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.06464.md) |
| 06-10 | **Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning**
机构: University of Washington, MetaAI, Allen Institute for AI
HUSKY是首个统一、开源的多步推理语言代理,解决了成本高和扩展困难的问题,且在多任务环境中取得优异表现,展现了开源语言代理的潜力。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.06469v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.06469.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/agent-husky/Husky-v1)|
| 06-10 | **Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies**
机构: Duke University, AWS AI Labs
论文提出了一个考虑计算预算的LLM推理策略评估框架,并展示了简单策略在同等计算资源下可超越复杂策略的能力。通过揭示自我评估的重要性,为更加高效的预算利用和更有效策略的开发奠定了基础。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.06461v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.06461.md) |
| 06-09 | **Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses**
机构: University of Washington, University of Washington - Bothell
本研究强调了 LLM 在社交推理方面的不足,并展示了如何通过整合人类的意图和情绪来增强其有效性。研究结果凸显了 LLM 理解人类心理状态并在开放式问题中进行社交推理的需求,标明了未来发展的关键领域。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.05659v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.05659.md) |
| 06-07 | **WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild**
WILDBENCH 作为一个评价基准,提供了一个结合了真实用户任务挑战、自动化指标和解释性清单的评价框架,能够更准确地评估和区别大型语言模型在复杂任务中的表现。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.04770v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.0477.md) |
| 06-07 | **Mixture-of-Agents Enhances Large Language Model Capabilities**
机构: Duke University, Together AI, University of Chicago
这篇论文通过提出Mixture-of-Agents (MoA) 方法,展示了如何通过结合多个大型语言模型的集体专长来增强它们在理解和生成自然语言方面的能力。作者通过实验验证了这种方法可以显著提高模型的表现,并在多个竞争力很强的基准测试中取得了最新的最佳成绩。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.04692v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.04692.md) |
| 06-06 | **The Prompt Report: A Systematic Survey of Prompting Techniques**
该论文提供了对提示技术的全面调研,系统分析了提示的概念、类型和应用,并对此进行了详细的元分析。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.06608v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.06608.md) |
| 06-06 | **FastGAS: Fast Graph-based Annotation Selection for In-Context Learning**
机构: Department of ECE, University of Virginia
论文提出的FastGAS方法在选择ICL实例时,不仅能提高多样性和代表性,同时还显著减少了所需的时间和计算资源。实验结果验证了其在多个数据集上的效能和效率,证明了其作为一种有效的实例选择方法的潜力。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.03730v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.0373.md) |
| 06-06 | **Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models**
机构: Peking University, UC Berkeley, Stanford University
BoT通过为LLMs提供一个存储高层次思维模板的meta-buffer,增强了推理的准确性、效率和鲁棒性,克服了现有方法的限制,并实现了显著的性能提升。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.04271v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.04271.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/YangLing0818/buffer-of-thought-llm)|
| 06-04 | **Synergetic Event Understanding: A Collaborative Approach to Cross-Document Event Coreference Resolution with Large Language Models**
机构: Zhejiang University, School of Engineering (Westlake University), Shanghai AI Laboratory
文章提出了一种新颖的协作方法以解决跨文档事件共指消解任务。通过将LLMs的普遍能力与任务特定的SLMs结合,显著提高了模型性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.02148v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.02148.md) |
| 06-04 | **To Believe or Not to Believe Your LLM**
机构: Google DeepMind
本论文重点研穴并提出了一个新的信息论度量方法以在大型语言模型中量化不确定性,特别是针对LLMs生成响应时的幻觉现象。这项研究为如何识别和处理LLMs中的幻觉提供了新的理解和解决方案。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.02543v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.02543.md) |
| 06-03 | **Self-Improving Robust Preference Optimization**
机构: Cohere
SRPO通过在理论上合理的离线RLHF框架内表现出对任务变化的强大鲁棒性,成功地解决了依赖特定任务的问题,并通过非对抗性离线损失的优化提供了更简单的训练和部署过程。实验结果显示SRPO在包括OOD设置在内的各种环境下优于现有方法。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.01660v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.0166.md) |
| 06-03 | **Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration**
机构: Beijing Jiaotong University, Alibaba Group
Mobile-Agent-v2是一个多代理架构,能有效解决移动设备操作任务中的导航挑战,特别是任务进展和焦点内容的导航问题。通过引入三个专门的代理角色,相较于传统的单代理架构,显著提高了任务完成率。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2406.01014v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-06/2406.01014.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/X-PLUG/MobileAgent)|---
### 05月
| Date | Paper | Links & Summary |
| --- | --- | --- |
| 05-31 | **Preemptive Answer "Attacks" on Chain-of-Thought Reasoning**
机构: Tsinghua University
论文研究了预先答案对LLMs推理能力的负面影响,并提出了减轻其影响的策略。实验结果表明,这些策略不能完全抵消预先答案的影响,提示需要进一步增强CoT的鲁棒性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.20902v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.20902.md) |
| 05-31 | **Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality**
机构: Princeton University, Carnegie Mellon University
本论文展示了一个全新的状态空间对偶性(SSD)框架,连接了结构化的状态空间模型(SSMs)和注意力机制变体。论文的主要贡献包括将原本针对Transformers的算法和系统优化应用到SSMs上,以及开发了一种新的SSD算法,有效提高了模型训练和推理的效率。Mamba-2架构作为最终产品,实现了理想的性能表现,为未来的深度学习模型设计和优化提供了新的方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.21060v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.2106.md) |
| 05-30 | **Jina CLIP: Your CLIP Model Is Also Your Text Retriever**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.20204v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.20204.md) |
| 05-30 | **Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts**
机构: Ant Group
METRAG提出了一个新颖的检索增强生成框架,该框架通过实用性和紧凑性思维来解决现有模型的局限性,并在知识密集型任务中显示出更好的性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.19893v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.19893.md) |
| 05-29 | **MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.19327v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.19327.md) |
| 05-29 | **LLMs achieve adult human performance on higher-order theory of mind tasks**
机构: Google Research, Google DeepMind, Johns Hopkins University Applied Physics Lab
本论文展示了LLMs在高阶理论心智(ToM)任务上的性能,特别是证明了某些模型如GPT-4能够在某些任务上达到成人水平的表现。通过引入基于真实人类成人基准的新评测指标,本研究有助于揭示和理解LLMs在复杂社交互动中的潜力与限制。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.18870v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.1887.md) |
| 05-28 | **RealitySummary: On-Demand Mixed Reality Document Enhancement using Large Language Models**
机构: University of Calgary
此论文介绍了RealitySummary系统,它结合了大型语言模型和混合现实技术,提供了一个即时的阅读辅助工具,并且展现了这种技术在实际应用中的潜力和确立了未来研究的方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.18620v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.1862.md) |
| 05-23 | **Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration**
机构: Tsinghua University, Northwestern Polytechnical University, Shanghai AI Laboratory
本文针对多代理合作任务中LLMs的有效规划提出了ReAd框架,证明了其降低交互次数并提高成功率的能力,为LLMs在多代理系统中的应用奠定了基础。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.14314v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.14314.md) |
| 05-23 | **PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services**
机构: Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences
本论文提出了PerLLM框架,通过边缘-云协作来处理大量推理服务,不仅优化了服务调度和资源分配,还显著提高了吞吐量并降低了能源成本,具有突出的应用价值。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.14636v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.14636.md) |
| 05-23 | **RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models**
机构: Amazon AWS AI, Shanghai AI Lab, Shanghai Jiaotong University
REFCHECKER是一个用于检测LLMs中细粒度幻觉并进行基准测试的框架。其通过使用claim-triplets,能在细粒度上检测并验证回应中的事实一致性,显著提高了检测的精度和与人类判断的一致性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.14486v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.14486.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/amazon-science/RefChecker)|
| 05-23 | **RaFe: Ranking Feedback Improves Query Rewriting for RAG**
机构: Zhejiang University, Alibaba Group, Nanjing University
RaFe是一个新颖的查询重写框架,利用重排序器反馈来训练模型,无需注释,支持离线和在线反馈训练,具有良好的普适性和有效性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.14431v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.14431.md) |
| 05-23 | **AGILE: A Novel Framework of LLM Agents**
机构: ByteDance Research, University of Science and Technology of China, Shanghai Jiao Tong University
该论文提出了一个新型的LLM代理框架AGILE,它通过整合不同的组件,并采用强化学习来实现端到端的训练。该框架在复杂的问答任务中展现出较传统LLM独立使用更优的性能,并证明了组件整合和端到端优化的有效性。数据集和代码已公开发布,以促进相关领域的进一步研究。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.14751v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.14751.md) |
| 05-23 | **Agent Planning with World Knowledge Model**
机构: Zhejiang University, Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph, National University of Singapore, Alibaba Group
本论文通过创建一个参数化的世界知识模型 (WKM),来提升大型语言模型在执行交互式规划任务中的性能。这个模型使用了来自专家和探索性轨迹的知识,并通过在仿真环境中与多种强基准进行比较,验证了其有效性,并处理了生成幻视动作和盲目试错的问题。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.14205v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.14205.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/zjunlp/WKM)|
| 05-23 | **HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models**
机构: The Ohio State University, Stanford University
HippoRAG是一个受人类记忆系统启发的新型检索框架,解决了传统LLMs在长期记忆和知识整合方面的不足。通过模拟人脑结构和运作机制,HippoRAG有效地提升了LLMs处理复杂知识整合任务的能力,并且在效率和性能上均超越现有方法。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.14831v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.14831.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/OSU-NLP-Group/HippoRAG)|
| 05-21 | **SmartFlow: Robotic Process Automation using LLMs**
机构: TCS Research
SmartFlow是一个基于AI的RPA系统,它整合了深度学习的视觉理解与LLMs,能够自动生成导航工作流并自主执行用户指派的任务,展示了其在适应GUI变化和处理复杂任务上的高效性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.12842v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.12842.md) |
| 05-21 | **G-DIG: Towards Gradient-based DIverse and hiGh-quality Instruction Data Selection for Machine Translation**
机构: ByteDance Research
该论文为解决LLMs在机器翻译中指令微调数据的多样性和质量问题,提出了基于梯度的数据选择方法G-DIG,通过实验验证了方法的有效性和泛化性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.12915v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.12915.md) |
| 05-20 | **Multiple-Choice Questions are Efficient and Robust LLM Evaluators**
机构: Shanghai Jiao Tong University
该研究成功将常规的开放式生成问题转换为多项选择格式,以提高LLMs的评估效率和准确度。这一方法在防止无效答案的影响、提高评估效率方面取得了突破。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.11966v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.11966.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/Geralt-Targaryen/MC-Evaluation)|
| 05-20 | **xFinder: Robust and Pinpoint Answer Extraction for Large Language Models**
机构: Institute for Advanced Algorithms Research, Shanghai,Renmin University of China
这篇文章的重点是提出一个名为xFinder的方法,旨在提高从LLMs输出中提取关键答案的准确度,解决了现有方法无法满足的领域需求,为LLMs评估提供了更可靠的方法。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.11874v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.11874.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/IAAR-Shanghai/xFinder)|
| 05-20 | **Octo: An Open-Source Generalist Robot Policy**
机构: UC Berkeley, Stanford
论文介绍了Octo,这是一种基于变换器的策略,对多样化的机器人任务提供开源的解决方案,能通过微调适应新的观测和动作空间。它在多个机器人平台上表现出色,并通过完全开放的源码鼓励广泛应用和进一步发展。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.12213v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.12213.md) |
| 05-20 | **OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework**
机构: OpenLLMAI Team, ByteDance Inc., Netease Fuxi AI Lab
OpenRLHF是一个开源框架,它使得在70亿以上参数模型上实现全尺度RLHF训练成为可能。它通过Ray分布式计算模型,并利用vLLM优化效率,同时实现了多种对齐算法,并与HuggingFace库无缝整合,从而提供即开即用的用户体验。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.11143v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.11143.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/OpenLLMAI/OpenRLHF)|
| 05-19 | **Your Transformer is Secretly Linear**
机构: AIRI, Skoltech, SberAI
这项研究展示了变压器编码层之间可能存在高度的线性动态,这一发现推翻了变压器中线性和非线性操作的传统理解,并发现可以在不牺牲性能的情况下进行模型修改以提高效率。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.12250v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.1225.md) |
| 05-17 | **Prompt Exploration with Prompt Regression**
机构: Carnegie Mellon University, Massachusetts Institute of Technology, University of Michigan
本文提出了一种新的框架PEPR,用于预测LLMs中提示元素组合的影响,并选择最适用于特定任务的提示。该框架不仅提出了创新性的解决方案,还通过在多个数据集和任务上进行评估,展示了其有效性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.11083v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.11083.md) |
| 05-16 | **SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation**
机构: Amazon, The University of Texas at Austin
SYNTHESIZRR是一种新方法,通过检索增强为教师-学生蒸馏的示例合成集成了获取信息。研究表明,与现有方法相比,SYNTHESIZRR生成的数据在内在数据多样性和下游任务准确性方面表现更佳。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.10040v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.1004.md) |
| 05-16 | **MarkLLM: An Open-Source Toolkit for LLM Watermarking**
机构: Tsinghua University, Shanghai Jiao Tong University, The University of Sydney
MARKLLM为研究人员和公众提供一个易于访问和使用的实验平台,旨在提高LLM水印技术的普及度和参与度,推动研究和应用进一步发展。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.10051v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.10051.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/THU-BPM/MarkLLM)|
| 05-16 | **SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation**
机构: Amazon, The University of Texas at Austin
SYNTHESIZRR通过检索增强解决了过去合成数据的多样性不足和与人类文本相异的问题,通过检索不同文档和内容,生成的样本具有更高的多样性和更接近人类文本的风格,这改善了蒸馏模型的性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.10040v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.1004.md) |
| 05-16 | **Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models**
机构: Nanyang Technological University, University of Science and Technology of China, University of Aberdeen
本论文成功提出并验证了结合多模态LLM的新ASR错误修正范式,不仅解决了源语音忽视和输入冗余的问题,还在实际应用中取得了显著效果。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.10025v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.10025.md) |
| 05-16 | **Thinking Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models**
机构: BITS Pilani, MDSR Labs, Adobe, IIT Guhawati, National University of Singapore
这项研究开发并评估了一个针对端到端用户的迭代消偏框架,该框架提供了一种非训练型的消除LLMs偏见的方法。这种方法使用复杂的prompting策略在不减少下游任务性能的前提下显著降低了输出的平均偏见度,并为未来研究LLMs的prompt-based消偏方法铺平了道路。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.10431v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.10431.md) |
| 05-15 | **ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models**
机构: Microsoft Research Asia, Harvard University, Peking University
ALPINE项目考察了自回归学习如何使Transformer具备网络中的规划能力,并揭示了在执行路径寻找任务中Transformer的表现能力及其局限性,为我们理解大型语言模型在其他相关领域的一般规划能力提供了新见解。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.09220v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.0922.md) |
| 05-15 | **LoRA Learns Less and Forgets Less**
机构: Columbia University, Databricks
LoRA虽然在目标任务的学习效率和精确度方面通常不如全参数微调,但在保持源任务性能方面展现了更好的表现和更强的正则化能力。根据本文研究,对使用LoRA做微调时的最佳实践做出了建议,尤其注意到学习率、目标模块选择和扰动的秩。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.09673v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.09673.md) |
| 05-14 | **Is the Pope Catholic? Yes, the Pope is Catholic. Generative Evaluation of Intent Resolution in LLMs**
机构: Carnegie Mellon University, Allen Institute for AI
本研究通过引入一个全新的生成性评估框架,探索了LLMs在理解和生成与意图对齐的回应方面的潜力和挑战,揭示了当前模型在语用理解方面的不足,并指出了未来提升方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.08760v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.0876.md) |
| 05-13 | **RLHF Workflow: From Reward Modeling to Online RLHF**
机构: Salesforce AI Research, University of Illinois Urbana-Champaign
本文提出了一个完整的在线迭代 RLHF 工作流程,不仅理论上创新,还通过详细的实践实现指南提供了实际应用的框架。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.07863v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.07863.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/RLHFlow/RLHF-Reward-Modeling)|
| 05-13 | **DoLLM: How Large Language Models Understanding Network Flow Data to Detect Carpet Bombing DDoS**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.07638v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.07638.md) |
| 05-10 | **Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval**
机构: Imperial College London, Huawei
这项工作通过一个新的自我完善增强的知识图谱检索方法有效地减少了大型语言模型中的幻觉现象,尤其提高了在医疗领域中的应用实效性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.06545v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.06545.md) |
| 05-10 | **UniDM: A Unified Framework for Data Manipulation with Large Language Models**
机构: Alibaba Group, University of Science and Technology of China
UniDM是一个创新的统一数据操作框架,通过有效的提示设计与步骤分解,显著提高了处理多种数据任务的效率和质量。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.06510v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.0651.md) |
| 05-10 | **Automatic Generation of Model and Data Cards: A Step Towards Responsible AI**
机构: CMU, MPI, ETH Zürich
论文成功开发了一种使用大型语言模型自动化生成机器学习模型卡片和数据卡片的方法,并通过创建相应的数据集和评估机制,显著提升了生成文档的质量和标准化。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.06258v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.06258.md) |
| 05-10 | **A Survey on RAG Meets LLMs: Towards Retrieval-Augmented Large Language Models**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.06211v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.06211.md) |
| 05-10 | **Value Augmented Sampling for Language Model Alignment and Personalization**
VAS为LLM的适配和个性化提供了一个高效且强大的方法。它克服了现有RL算法的不稳定性,实现了高性能和计算效率的双重优势,同时支持黑盒模型的适应,为未来的LLM个性化和对齐开辟了新的可能性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.06639v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.06639.md) |
| 05-09 | **LLMPot: Automated LLM-based Industrial Protocol and Physical Process Emulation for ICS Honeypots**
机构: New York University Abu Dhabi
LLMPot是一种创新的ICS网络安全防御工具,其利用LLM的能力,通过自动化生成与协议和物理过程紧密相关的响应,显著提高了蜜罐的实用性和效果。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.05999v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.05999.md) |
| 05-09 | **Exploring the Potential of Human-LLM Synergy in Advancing Qualitative Analysis: A Case Study on Mental-Illness Stigma**
CHALET方法框架展示了人类-LLM 协作在定性研究中的巨大潜力,特别是在深化理解和洞见生成方面,为未来的HCI和定性分析研究提供了新方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.05758v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.05758.md) |
| 05-09 | **An Automatic Prompt Generation System for Tabular Data Tasks**
本论文成功开发了一个既适应多种LLMs又无需广泛训练的自动提示生成系统,通过两种创新方法显著提高了处理表格数据任务的性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.05618v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.05618.md) |
| 05-09 | **Can large language models understand uncommon meanings of common words?**
机构: Tsinghua University, Chinese Academy of Science
本研究通过建立新的评估体系和数据集,揭示了大型语言模型在理解常见词汇的罕见含义方面存在的重大不足,为提高模型的NLU能力提供了新的研究方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.05741v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.05741.md) |
| 05-08 | **"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations**
机构: University of Washington, MBZUAI
这项研究通过创新的CHAST评估体系,揭示了LLMs在处理涵盖广泛文化和身份的复杂社会互动中可能导致的潜在伤害,强调了在部署这些模型之前进行彻底的偏见审计的必要性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.05378v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.05378.md) |
| 05-08 | **Air Gap: Protecting Privacy-Conscious Conversational Agents**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.05175v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.05175.md) |
| 05-08 | **ADELIE: Aligning Large Language Models on Information Extraction**
机构: Tsinghua University
本文提出的ADELIE模型有效地解决了LLM在信息提取任务中的对齐问题,并通过创新的数据集和训练方法提升了模型在这些任务上的性能,同时维护了良好的通用能力,为未来相关研究提供了有价值的见解和基础。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.05008v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.05008.md) |
| 05-07 | **QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving**
机构: MIT, NVIDIA
通过新的量化算法和系统设计,QServe显著提升了LLM在GPU上的服务效率,实现了成本的大幅度降低,为大规模语言模型的部署提供了新的解决方案。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.04532v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.04532.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/mit-han-lab/qserve)|
| 05-07 | **Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation**
机构: Center for Responsible AI, IIT Madras, Princeton University
论文有效地展示了使用大型语言模型的自治代理在目标导向环境中执行复杂任务(如立法游说)时的欺骗能力,并提出了检测这种欺骗行为的有效方法。这些发现为AI在法律和道德方面的应用提供了重要的见解,同时也为AI安全提供了新的研究方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.04325v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.04325.md) |
| 05-07 | **Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application**
机构: Kuaishou Technology, Southeast University
本论文成功地将大型语言模型的开放世界知识应用于推荐系统,通过一个创新的双塔结构解决了实际应用中的核心挑战,为提升推荐系统的性能提供了新的思路。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.03988v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.03988.md) |
| 05-07 | **Toward In-Context Teaching: Adapting Examples to Students' Misconceptions**
机构: MIT CSAIL
本论文成功展示了利用大型语言模型进行适应性教学的潜力,并通过ATOM模型实现了对学生误解的有效识别和教学反馈的优化。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.04495v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.04495.md) |
| 05-06 | **Lifelong Knowledge Editing for LLMs with Retrieval-Augmented Continuous Prompt Learning**
机构: East China Normal University
RECIPE方法通过转换知识陈述为连续提示符并结合知识哨兵来动态管理检索过程,有效提高了LLMs在生命周期学习场景中的编辑效率和推断速度,同时保持了模型整体性能。这种方法克服了以前方法的缺点,并在多个评估指标中表现出色。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.03279v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.03279.md) |
| 05-06 | **MARE: Multi-Agents Collaboration Framework for Requirements Engineering**
机构: Peking University
这项研究提出了一个创新的多代理合作框架,MARE,用于在整个需求工程过程中利用大型语言模型(LLMs)之间的合作。它针对RE中自动化任务的局限性进行了改进,并通过大规模实验的评估显示,MARE在需求建模和规格生成方面优于现有的先进方法。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.03256v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.03256.md) |
| 05-03 | **What matters when building vision-language models?**
机构: Hugging Face, Sorbonne Université
本文通过广泛的实验探讨了影响VLMs性能的关键设计选择,提出了Idefics2这一高效的基础视觉语言模型,并在多个标准测试中证明了其优越性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.02246v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.02246.md) |
| 05-02 | **Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models**
机构: KAIST AI, LG AI Research, Carnegie Mellon University
PROMETHEUS 2是一个新型的开源评估LM,能在直接评估和成对排名两种格式下工作,并且在自定义评价标准上与人类评分和专有LMs的判断密切相关。该模型采用权重合并的方式训练,性能显著超过其他开源模型和某些专有模型。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.01535v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.01535.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/prometheus-eval/prometheus-eval)|
| 05-02 | **How Can I Get It Right? Using GPT to Rephrase Incorrect Trainee Responses**
机构: Carnegie Mellon University
该论文研究了利用GPT-4构建一个自动化反馈系统来帮助一对一节课中导师的训练,旨在减轻传统提供个性化教学反馈的资源负担,同时提供高质量和具体性的反馈,是知识检索与评估类的研究。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.00970v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.0097.md) |
| 05-01 | **Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model Editing with Llama-3**
本研究对于大型语言模型编辑技术进行了实证分析,揭示了以往方法的潜在不足,并为未来的模型编辑方法提出了新方向和思路。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.00664v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.00664.md) |
| 05-01 | **"I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust**
机构: Princeton University, Microsoft
该论文通过大规模实验研究表明,LLMs通过自然语言来表达不确定性,可以减少用户的过度依赖,并提高任务处理的准确度。尤其是第一人称表达形式对提高用户的准确性效果显著。此外,这项研究还强调在实际应用LLMs之前,进行用户测试以调整不确定性的表达方式的重要性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.00623v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.00623.md) |
| 05-01 | **The Real, the Better: Aligning Large Language Models with Online Human Behaviors**
机构: Baidu Inc.
本文提出了一种新型的大型语言模型对齐框架RLHB,它通过利用真实线上人类行为创新性地对LLMs进行调整和优化,克服了现有方法中的局限性,并通过实验有效地验证了其方法。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.00578v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.00578.md) |
| 05-01 | **A Careful Examination of Large Language Model Performance on Grade School Arithmetic**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.00332v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.00332.md) |
| 05-01 | **Can a Hallucinating Model help in Reducing Human "Hallucination"?**
机构: Stanford University, UC Berkeley
本文探讨了如何使用大型语言模型(LLMs)来检测和对抗无根据信念,以及利用LLMs作为个性化的错误信息驳斥代理。研究者提出了评估并利用LLMs在识别逻辑陷阱方面的能力,并挑战人类无根据信念的新方法。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.00843v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-05/2405.00843.md) |---
### 04月
| Date | Paper | Links & Summary |
| --- | --- | --- |
| 04-30 | **Better & Faster Large Language Models via Multi-token Prediction**
机构: FAIR at Meta
论文提出了一种新的训练大型语言模型的方法,通过预测多个标记而不是单个来提高样本效率,并展示了如何提升生成任务中的性能并加快推理速度。实验证明了这种方法在提升大型模型性能和推理效率方面的显著优势。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.19737v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.19737.md) |
| 04-30 | **Multi-hop Question Answering over Knowledge Graphs using Large Language Models**
机构: Microsoft
论文在多跳问答任务中提出针对不同的知识图谱数据集采用不同策略,展示了利用大型预训练语言模型在这些复杂问答任务中的强大能力。通过实验,验证了所提方法相比现有技术的优势。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.19234v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.19234.md) |
| 04-30 | **Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom**
机构: Shanghai Jiao Tong University
该研究通过创建一个新的中文多轮对话数据集SwordsmanImp评估LLMs理解言外之意的能力,特别是在涉及大量上下文和轮换的对话中,并揭示了LLMs在理解和解释非字面含义时的挑战和局限。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.19509v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.19509.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/sjtu-compling/llm-pragmatics)|
| 04-30 | **Iterative Reasoning Preference Optimization**
机构: FAIR at Meta, New York University
本文提出了一种迭代推理偏好优化方法,通过在推理任务上应用偏好优化,特别是针对CoT推理,并通过在迭代训练中引入NLL损失项来提升模型性能。实验证明,该方法在数次迭代后能够有效提升推理性能,最终达到性能饱和。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.19733v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.19733.md) |
| 04-29 | **Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models**
机构: Cohere
该论文发展了一种以成员来自不同模型家族的小型模型组织成的“评审团”来评估LLM生成物的新方法,称为PoLL,显示出在不同任务中的适用性以及成本效率,减少了LLMs作为评判时存在的偏见问题。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.18796v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.18796.md) |
| 04-29 | **LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report**
机构: Predibase
本文提出通过LoRA对大型语言模型进行细化,可以明显提升模型的整体表现,降低在分类任务中出现的误差,且与开箱即用的GPT-4和GPT-3.5相比,有显著提高。同时,论文还考虑了成本限制,通过限制评估样本的数量来降低使用LLM API的财务负担。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2405.00732v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2405.00732.md) |
| 04-26 | **When to Trust LLMs: Aligning Confidence with Response Quality**
机构: Alibaba Group
本文提出了一个通过强化学习对齐信心和回答质量的方法(CONQORD)。该方法在没有客观实际标准的情况下通过自我评估来优化信心水平,并能够减少偏见,提升了模型预测的准确性和对齐性,但仍需对比绩效更高的方法进行改进。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.17287v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.17287.md) |
| 04-26 | **A Comprehensive Evaluation on Event Reasoning of Large Language Models**
机构: Peking University, Advanced Institute of Big Data, Beihang University
本文通过引入一个名为EV2的新基准测试来全面评估大型语言模型(LLMs)的事件推理能力。实验结果表明,虽然LLMs拥有事件推理能力,但与人类在运用事件模式知识方面并不一致,通过提供明确的指导,可以帮助模型更好地理解和执行事件推理任务。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.17513v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.17513.md) |
| 04-25 | **Continual Learning of Large Language Models: A Comprehensive Survey**
机构: Rutgers University, Wuhan University, Huazhong University of Science and Technology
本综述为LLMs的持续学习提供了一个全面的视角,特别强调了连续预训练(CPT)和领域自适应预训练(DAP)的研究领域。强调社区需更多关注,特别是开发实用、易于获取且广泛认可的评估基准方面,以及需要针对新兴LLMs学习范式特别设计的方法论。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.16789v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.16789.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/Wang-ML-Lab/llm-continual-learning-survey)|
| 04-25 | **How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites**
机构: Shanghai AI Laboratory, SenseTime Research, Tsinghua University
InternVL 1.5是一个强大的开源多模态语言模型,致力于弥补开源和商业模型在多模态理解方面的性能差距。该模型的优势包括改善视觉理解、处理动态高分辨率图像以及高质量的双语数据集的使用,这些它在多项任务中表现出色。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.16821v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.16821.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/OpenGVLab/InternVL)|
| 04-25 | **Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.16621v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.16621.md) |
| 04-25 | **Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding**
机构: Meta, University of Toronto, Carnegie Mellon University
LayerSkip是一个新颖的端到端解决方案,能够在不牺牲准确率的情况下显著加速大型语言模型的推理过程,具有实际应用价值和潜力。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.16710v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.1671.md) |
| 04-24 | **From Local to Global: A Graph RAG Approach to Query-Focused Summarization**
机构: Microsoft Research, Microsoft Strategic Missions and Technologies, Microsoft Office of the CTO
这篇论文提出了Graph RAG方法,这是一种以图谱索引和LLM生成摘要为基础的查询聚焦摘要技术,旨在处理因语料量过大而超出大型语言模型处理能力的问题。通过社区检测算法的帮助,该方法能在处理全局性问题并实现大规模文本分析方面取得显著成效。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.16130v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.1613.md) |
| 04-24 | **Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs**
机构: Shanghai Jiao Tong University, UC San Diego, Duke University
本文章是对大型语言模型(LLMs)中Chain-of-X (CoX) 方法的详尽调研,着重于将Chain-of-Thought (CoT) 的概念扩展至更广泛的应用,并为未来的研究提供了潜在的发展方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.15676v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.15676.md) |
| 04-23 | **A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and Applications**
机构: Hong Kong Baptist University
本文是一个综述性研究,主要调查了在图数据上使用的LLMs研究,探讨了LLMs在图任务泛化方面的优势,并提出了在该领域进行研究的未来方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.14809v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.14809.md) |
| 04-23 | **CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies**
机构: Stanford University, IBM Research
该论文提出了一个用于构建文化知识库的通用流水线,并使用该流水线创建了CultureBank,这是一个包含TikTok和Reddit上文化描述符的知识库。论文还通过这个知识库评估了LLMs在文化意识方面的表现,并用于训练更具文化意识的语言模型,以此促进未来语言技术的文化意识发展。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.15238v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.15238.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/SALT-NLP/CultureBank)|
| 04-22 | **Beyond Scaling: Predicting Patent Approval with Domain-specific Fine-grained Claim Dependency Graph**
机构: University of California San Diego, Carnegie Mellon University, University of Pennsylvania
研究者提出了一种新颖的用于构建细粒度主张依赖图(FLAN图)的算法,该算法在大规模上显著改善了现状,并对现代LLMs在专利批准预测上的应用进行了广泛实验和分析,发现了LLMs的局限性,并为未来LLM方案的开发提供了有价值的参考。源代码和数据集已公开发布以促进未来的研究。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.14372v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.14372.md) |
| 04-22 | **A Survey on Efficient Inference for Large Language Models**
机构: Tsinghua University
本文提供了一个全面的综述关于提高大型语言模型推理效率的文献,并提出了一个包含数据层、模型层和系统层优化的分类法。同时,通过实验对关键技术进行了量化比较,指出了研究的未来方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.14294v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.14294.md) |
| 04-22 | **A Survey on Self-Evolution of Large Language Models**
机构: Peking University, Alibaba Group, Nanyang Technological University
这篇综述文章提出并总结了LLMs的自我进化方法,为推动自我进化的研究提供了概念框架和未来方向的见解,旨在推动下一代自我进化LLMs的发展。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.14387v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.14387.md) |
| 04-22 | **SnapKV: LLM Knows What You are Looking for Before Generation**
机构: University of Illinois Urbana-Champaign, Cohere, Princeton University
该文章介绍了SnapKV,一种针对大型语言模型中关键值缓存问题的新方法。SnapKV通过智能压缩和选取重要的KV位置,有效地提升了长文本处理时的解码速度和内存效率,并在保持准确性的同时显著降低了计算成本。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.14469v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.14469.md) |
| 04-22 | **Tree of Reviews: A Tree-based Dynamic Iterative Retrieval Framework for Multi-hop Question Answering**
机构: Tencent Inc., Harbin Institute of Technology
论文提出了一种新的迭代检索框架TOR,它采用树形结构减少错误累积,并引入优化策略提高检索效率和质量。在实验中,TOR框架在多个数据集上达到了最先进的性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.14464v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.14464.md) |
| 04-22 | **LLMs Know What They Need: Leveraging a Missing Information Guided Framework to Empower Retrieval-Augmented Generation**
机构: Meituan
本文提出的MIGRES框架是通过Exploiting LLMs识别缺失信息的能力来增强RAG的能力。研究结果证明了MIGRES在多个公共数据集上具有优越性,应对了RAG在理解复杂查询和检索相关文档方面的挑战。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.14043v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.14043.md) |
| 04-22 | **Information Re-Organization Improves Reasoning in Large Language Models**
机构: Zhejiang University
本论文提出了一个新颖的信息重组方法(InfoRE),通过重组上下文内容来揭示逻辑关系,从而增强LLMs的推理能力。方法在零次射击设置下对LLMs进行上下文理解的多跳推理任务测试,取得了显著效果。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.13985v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.13985.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/hustcxx/InfoRE)|
| 04-21 | **AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs**
机构: Meta AI (FAIR), Max-Planck-Institute for Intelligent Systems
本文提出了一个新型的LLM,名为AdvPrompter,它利用新颖的算法,无需目标LLM的梯度信息,迅速生成人类可读的敌对提示,显著提升了生成速度并保持了提示的语义连贯性。此外,通过AdvPrompter的训练还能增强LLM面对越狱攻击的稳健性,而不牺牲性能表现。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.16873v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.16873.md) |
| 04-19 | **Relevant or Random: Can LLMs Truly Perform Analogical Reasoning?**
机构: Nanyang Technological University, Princeton University, Salesforce Research
本论文系统地评估了LLMs进行类比推理的能力,并提出了两种可以在显著降低推理成本的同时获得更好性能的方法。研究结果表明,与以前认为相关性至关重要的观点相反,自我生成的无关例子在某些任务上可以达到相当甚至更好的性能。希望本研究能刺激更多关于自我生成上下文设计的进一步研究。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.12728v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.12728.md) |
| 04-19 | **LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency**
机构: Nanyang Technological University, DAMO Academy Alibaba Group, Singapore University of Technology and Design
LLM-R2是一种利用大型语言模型增强的查询重写系统,通过自动选择一组给定重写规则中的有效规则,有效地提升了查询重写的执行效率,解决了目前其他方法的局限性,并在多个数据集上取得了优越的性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.12872v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.12872.md) |
| 04-18 | **mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture**
机构: Beihang University, Beijing Information Science and Technology University
mABC是一种创新的框架,利用了大语言模型(LLMs)及多代理合作,并由区块链启发式的决策过程促成,针对云原生技术中微服务架构的根本原因分析(RCA)。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.12135v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.12135.md) |
| 04-18 | **RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation**
机构: Peking University, ByteDance Inc.
通过针对性的缓存系统设计和中间状态共享,RAGCache优化了RAG流程的性能,显著提升了处理速度并减少了计算资源的开销。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.12457v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.12457.md) |
| 04-18 | **Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers**
机构: Westlake University, Alibaba Group, Zhejiang University
该论文提出了MCRanker模型,通过构建虚拟专业评注团队和生成多角度评估标准,有效提升了LLM排序器的一致性与全面性,可广泛适应于各类数据集,改进了排序性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.11960v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.1196.md) |
| 04-18 | **EVIT: Event-Oriented Instruction Tuning for Event Reasoning**
机构: Key Laboratory of High Confidence Software Technologies (PKU), MOE, China, School of Computer Science, Peking University, Advanced Institute of Big Data
EVIT通过提出面向事件的指令调谐(Event-Oriented Instruction Tuning)和事件四元组的概念,解决了现有小型基于指令调谐模型在事件推理任务中的表现不足问题。实验结果表明,EVIT在事件推理任务上的表现优于其他模型。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.11978v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.11978.md) |
| 04-18 | **Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing**
该论文介绍了一种名为ALPHALLM的新型框架,通过蒙特卡洛树搜索(MCTS)和大型语言模型(LLMs)的结合,实现了LLMs的自我提高,无需额外的注解数据。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.12253v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.12253.md) |
| 04-18 | **Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences**
机构: UC Berkeley
这篇论文提出了一个与人类偏好相一致的LLM辅助评估界面EvalGen,通过混合主动式方法解决了LLM生成的评估功能评估质量受信任度的问题。论文还探讨了用户如何定义和使用评估标准的动态性,以及在实际应用中所面临的挑战。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.12272v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.12272.md) |
| 04-17 | **AgentKit: Flow Engineering with Graphs, not Coding**
机构: Carnegie Mellon University, NVIDIA, Microsoft
论文引入了一种新型的 LLM 提示框架 AgentKit,针对多功能代理问题,通过模块化组件和直观设计支持构建和微调复杂的代理思维过程。AgentKit 显示出实现先进代理能力和降低用户参与门槛的潜力。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.11483v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.11483.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/holmeswww/AgentKit)|
| 04-17 | **Unifying Bias and Unfairness in Information Retrieval: A Survey of Challenges and Opportunities with Large Language Models**
机构: Renmin University of China, Chinese Academy of Sciences, Huawei Technologies
这篇综述文章提供了一个新颖的视角来理解LLMs和IR系统中的偏见和不公平为分布失配问题,并归类了各种缓解策略。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.11457v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.11457.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/KID-22/LLM-IR-Bias-Fairness-Survey)|
| 04-17 | **A Deep Dive into Large Language Models for Automated Bug Localization and Repair**
机构: University of Virginia, Purdue University, Amazon Web Services
这篇论文提出了一种名为Toggle的新方法,该方法使用token粒度的bug定位并修复,克服了现有行粒度方法的局限,通过输入设计和LLMs的微调,大幅提升了错误修复的准确性,并在多个数据集上取得优异的表现,为APR领域带来新的进展。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.11595v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.11595.md) |
| 04-17 | **Many-Shot In-Context Learning**
机构: Google DeepMind
本论文主要贡献包括系统评估LLM在不同规模上下文样例的性能,导入reinforced ICL和unsupervised ICL以减少样例依赖,并发现MS-ICL可以克服预训练偏差学习高维数值预测任务。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.11018v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.11018.md) |
| 04-16 | **How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior**
机构: Stanford University
论文通过分析在RAG环境下LLMs内部知识与检索信息之间的张力,发现了LLMs倾向于遵循RAG信息的程度与模型在无上下文情况下的回答信心成反比。研究基于跨超过1200个问题的六个领域数据集,揭示了在模型的预训练知识与检索到的信息之间的固有冲突。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.10198v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.10198.md) |
| 04-16 | **CoTAR: Chain-of-Thought Attribution Reasoning with Multi-level Granularity**
机构: Intel Labs
本文提出的CoTAR方法针对LLMs在问答任务中倾向于生成不准确归因的问题。通过在输出生成前进行推理,并在不同的归因粒度级别上引导模型,显著提升了模型在答案质量和归因精确度上的表现。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.10513v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.10513.md) |
| 04-16 | **Self-playing Adversarial Language Game Enhances LLM Reasoning**
机构: Tencent AI Lab
本论文提出了一个名为SPAG的新型训练方案,通过自我对抗性语言游戏的自我播放,有效提升了LLMs的推理能力,并且其改进是可以通过迭代过程持续增强的。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.10642v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.10642.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/Linear95/SPAG)|
| 04-15 | **Learn Your Reference Model for Real Good Alignment**
机构: Tinkoff
本论文提出了一个名为Trust Region DPO (TR-DPO) 的新方法,该方法通过交互式地更新参考策略的参数,显著改进了语言模型的对齐问题。实验结果显示,TR-DPO在两个数据集上均优于DPO方法,有效提升了模型的多参数性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.09656v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.09656.md) |
| 04-15 | **Compression Represents Intelligence Linearly**
机构: The Hong Kong University of Science and Technology, Tencent
这篇论文通过实证研究,证明了LLMs在下游任务性能与它们的压缩效率之间存在着几乎线性的相关性,为“更好的压缩能力表明了更高的智能”这一长期信念提供了支持。同时,提出了使用压缩效率作为评估LLMs性能的无监督度量标准的建议。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.09937v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.09937.md) |
| 04-14 | **Emerging Platforms Meet Emerging LLMs: A Year-Long Journey of Top-Down Development**
本论文的研究主要集中在如何支持和优化新兴计算平台下的机器学习模型部署,并提出了一个框架TAPML,旨在通过顶层方法和通用运行时环境促进模型部署的广泛性、便利性和强大性,文中提供了实际部署案例作为发展ML系统的深入见解和最佳实践。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.09151v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.09151.md) |
| 04-13 | **Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning**
机构: Nanjing University, University of California
论文提出了一个新的用于大型语言模型多任务微调的框架Intuition-MoR1E,该框架借鉴人类认知神经科学原理,并利用排名1专家形式来管理直觉,显著提高了参数效率和多任务微调效果。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.08985v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.08985.md) |
| 04-12 | **Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length**
机构: AI at Meta, University of Southern California, Carnegie Mellon University
这篇论文介绍了MEGALODON,一个高效处理无限上下文长度序列的神经网络架构。通过引入多项创新技术,MEGALODON在长序列模型任务中显示出比Transformer更高的效率和效能,同时在不同规模和模态的基准测试中都取得了稳健的改进。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.08801v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.08801.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/XuezheMax/megalodon)|
| 04-11 | **Rho-1: Not All Tokens Are What You Need**
机构: Xiamen University, Tsinghua University, Microsoft
本文提出了RHO-1,这是一种利用选择性语言建模(SLM)的新型语言模型。该模型在预训练中专注于对有用的令牌进行训练,这种方法在数学领域的连续预训练中显示出卓越性能,能够更快地达到基线性能,并且在少量令牌的情况下达到最新的状态。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.07965v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.07965.md) |
| 04-11 | **Interactive Prompt Debugging with Sequence Salience**
这篇论文提出了一个名为序列显著性(Sequence Salience)的系统,它扩展了现有的输入显著性(IS)方法,以支持复杂的LLM提示调试。该工具提供实时交互式调试,并降低了实践者的认知负荷,支持根据显著性结果快速迭代提示,与开发者的思维模型更加对齐。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.07498v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.07498.md) |
| 04-11 | **Decomposing Label Space, Format and Discrimination: Rethinking How LLMs Respond and Solve Tasks via In-Context Learning**
机构: Nanyang Technological University
本文研究了ICL在提升任务性能方面的生效机制,通过分解ICL的贡献因素,发现ICL通过精细调整标签空间和格式来显著提升性能,同时强调了选择合适演示示例的重要性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.07546v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.07546.md) |
| 04-11 | **ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback**
机构: University of Central Florida, ByteDance Inc
ControlNet++通过优化生成图像与条件控制之间的像素级一致性,并通过高效的奖励微调策略减少了与图像采样相关的时间和内存成本,显著改善了在多种条件控制下的可控性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.07987v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.07987.md) |
| 04-11 | **ChatGPT Can Predict the Future when it Tells Stories Set in the Future About the Past**
机构: Baylor University
本研究通过分析ChatGPT-3.5和ChatGPT-4的预测能力,揭示了LLMs在推理方面的新潜力。研究证明了“未来叙事”提示能够显著提升预测的准确性,为LLMs在分析环境中的潜在应用提供了有益见解。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.07396v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.07396.md) |
| 04-11 | **OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments**
机构: The University of Hong Kong, CMU, Salesforce Research
OSWORLD提供了一个新的评估环境,解决了现有基准测试的局限性,为开发能在真实计算机环境中完成开放式任务的多模态代理提供了基础。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.07972v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.07972.md) |
| 04-10 | **Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention**
机构: Google
该研究提出一种全新的注意力机制Infini-attention,它通过将压缩记忆与标准的点积注意力相结合,并在设计上支持插拔式的持续预训练和长上下文调整,使得LLMs能以有界的内存和计算资源处理无限长的上下文。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.07143v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.07143.md) |
| 04-10 | **Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation**
机构: Apple, Cupertino, CA, USA
该论文提出了一种新的检索增强生成(RAG)提示方法——“超级叠加提示”,用于处理大型语言模型处理长文本时遇到的问题,并在没有额外训练或微调的情况下显著提高了时间效率和准确性。这一方法在众多预训练模型上得到验证,并且作者计划发布一个开源代码实现。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.06910v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.0691.md) |
| 04-10 | **Transferable and Efficient Non-Factual Content Detection via Probe Training with Offline Consistency Checking**
机构: Renmin University of China, Tsinghua University
本论文提出了一种通过离线自我一致性检查训练探测模型的新方法PINOS,有效地解决了现有真实性检测方法的限制。PINOS提高了过程的转移能力和效率,并且在真实性检测和问答基准测试上取得了超越现有方法的结果。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.06742v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.06742.md) |
| 04-10 | **"We Need Structured Output": Towards User-centered Constraints on Large Language Model Output**
机构: Google Research
本论文探索如何为大型语言模型(LLM)输出实现用户中心的约束,通过调查行业专业人士来了解不同场景和需求。重点是提高开发者在开发、测试和整合LLM过程中的效率,并通过满足特定的输出格式和用户界面要求来增强最终用户的体验。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.07362v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.07362.md) |
| 04-09 | **THOUGHTSCULPT: Reasoning with Intermediate Revision and Search**
机构: UC Berkeley
THOUGHTSCULPT作为一个基于图的框架,通过内嵌的自我修正机制,能够让LLMs在生成新的思维节点的同时迭代地改进之前的输出,特别在需要持续修正和修改的任务中表现出卓越的能力。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.05966v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.05966.md) |
| 04-09 | **Event-enhanced Retrieval in Real-time Search**
机构: Tencent Search, Platform and Content Group
EER是一种新型方法,针对实时搜索中的“语义漂移”问题,通过改进EBR模型和加入对比学习及事件三元组生成任务提升检索性能。该方法通过实验验证了其有效性,并可能为信息检索领域提供新的视角。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.05989v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.05989.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/open-event-hub/Event-enhanced_Retrieval)|
| 04-09 | **RULER: What's the Real Context Size of Your Long-Context Language Models?**
机构: NVIDIA
本论文为长上下文LMs提出了新的评估工具RULER,并开源,用于测试LMs在复杂任务和长上下文理解能力上的表现,并在各种模型和任务复杂度上进行了分析,推动了长上下文LMs的未来研究。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.06654v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.06654.md) |
| 04-09 | **Privacy Preserving Prompt Engineering: A Survey**
机构: University of Arkansas
这篇调研论文为了在使用LLMs进行ICL和一般提示的过程中保护隐私,提供了一个关于在这一范畴下的隐私保护方法的系统性概述,有利于推动社区在隐私保护方面的进一步研究和探索。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.06001v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.06001.md) |
| 04-08 | **LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding**
机构: Alibaba Group, Zhejiang University
该论文成功提出了LayoutLLM模型及其布局指导的调整策略,显著提高了模型对文档布局信息的理解和利用,尤其在零样本文档理解任务上表现出了卓越的效果。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.05225v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.05225.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding)|
| 04-08 | **Evaluating Interventional Reasoning Capabilities of Large Language Models**
机构: Université de Montréal, Google DeepMind, ServiceNow Research
本文对大型语言模型(LLMs)因果推理能力进行了评估。通过提出干预效果预测,它主要测试LLMs在干预实验后如何更新自己对事实的理解。结果显示GPT-4在某些条件下能够准确预测干预效果,但提示设计的微小变化会显著影响其表现。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.05545v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.05545.md) |
| 04-08 | **Know When To Stop: A Study of Semantic Drift in Text Generation**
机构: FAIR, Meta, Anthropic
本文为理解和测量语言模型在长文本生成中的语义漂移现象提供了工具。通过早停和重采样-重新排序等方法,显著提高了事实准确性,并为如何平衡信息量与事实准确性提供了可能的解决策略。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.05411v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.05411.md) |
| 04-08 | **LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding**
机构: Meta
该论文成功提出并验证了一个增强型文档级嵌入的LLM-augmented检索框架,不仅通过生成合成的相关查询和标题增加了文档嵌入的上下文信息,还改进了检索模型训练的关键步骤,从而提升检索模型的性能和鲁棒性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.05825v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.05825.md) |
| 04-07 | **Radial Networks: Dynamic Layer Routing for High-Performance Large Language Models**
机构: Cornell University
本论文提出了径向网络,这是一种新型神经网络结构,通过动态层稀疏性和一个经过训练的路由模块来实现令牌级的层间路由。这不仅提高了模型的性能,还显著降低了计算和服务成本,为大型语言模型的进一步扩展提供了可能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.04900v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.049.md) |
| 04-07 | **Prompting Large Language Models for Zero-shot Essay Scoring via Multi-trait Specialization**
机构: Peking University
该研究提出了一个零样本的大型语言模型作文评分框架(MTS),通过多轮对话来为作文的不同写作特质打分,并采用最小-最大缩放和异常值截断机制来得到最终得分。MTS在准确度上显著优于直接提示评分方法,并在小型化部署中优于ChatGPT,提供了监督学习之外的零样本作文评分方案。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.04941v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.04941.md) |
| 04-04 | **Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences**
机构: Microsoft Research
这篇论文介绍了DNO——一种能够将对比学习的简洁性与从优化一般性偏好而来的理论普适性相结合的算法。DNO在后训练大型语言模型方面显著提升性能,它的成功实证了通过优化一般偏好来指导模型学习与人类价值观保持一致是可能的。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.03715v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.03715.md) |
| 04-04 | **ReFT: Representation Finetuning for Language Models**
机构: Stanford University, Pr(Ai)2R Group
这篇论文介绍了一种新的语言模型微调方法LoReFT,它在资源效率和模型控制能力方面显著优于现有的参数有效调整(PEFTs)方法。实验表明,该方法在多个NLP领域的任务上实现了新的最佳性能,同时保持了较少的参数需求和较高的可解释性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.03592v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.03592.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/stanfordnlp/pyreft)|
| 04-04 | **AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.03648v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.03648.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/THUDM/AutoWebGLM)|
| 04-03 | **PromptRPA: Generating Robotic Process Automation on Smartphones from Textual Prompts**
机构: Shanghai Jiao Tong University, CMU
文章介绍了PromptRPA系统,这是一个解决RPA在移动设备上应用受限的有效方案。通过利用多代理框架和在线教程,该系统能够解释各种文本提示,解决大范围的RPA任务。性能评估显示成功率显著提高,证明了文本驱动控制在RPA领域的可行性,并开辟了功能增强和适用性扩展的新方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.02475v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.02475.md) |
| 04-02 | **CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models**
机构: East China Jiaotong University, Guangdong University of Technology, University of Toronto
该论文的核心贡献是提出了CMAT框架,这是一种创新方法,可实现多智能体系统内部的动态、实时记忆更新,并设计了一种新型的角色扮演机制,用于精准的任务分配和提升代理间的通信,以此显著提高整体性能和合作效率。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.01663v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.01663.md) |
| 04-02 | **Octopus v2: On-device language model for super agent**
机构: Stanford University
这篇论文解决了边缘设备上LLM的部署和功能调用效率问题,通过引入特殊的训练方法和减少推理时需处理的上下文量,显著提高了在设备上进行函数调用的准确率和降低了延迟,实验结果表明其对提升函数调用任务的性能具有显著影响。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.01744v3)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.01744.md) |
| 04-02 | **Long-context LLMs Struggle with Long In-context Learning**
机构: University of Waterloo, Carnegie Mellon University
这篇文章提出了一个新的评估基准,LongICLBench,用于评估LLMs在处理长输入任务时的性能,以及LLMs对输入序列中实例位置的敏感性。这一工作有助于更好地理解和改进大型语言模型在长文本处理方面的能力。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.02060v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.0206.md) |
| 04-02 | **LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models**
机构: Microsoft
论文探讨了大型语言模型(LLMs)如何辅助设计自适应比特率(ABR)算法,通过生成多样化的候选算法,并运用早停机制在网络模拟器中进行测试,从而有效地筛选出最有效的算法设计。评估显示在特定网络场景中,利用LLMs可以显著提高ABR算法的性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.01617v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.01617.md) |
| 04-02 | **Advancing LLM Reasoning Generalists with Preference Trees**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.02078v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.02078.md) |
| 04-02 | **Long-context LLMs Struggle with Long In-context Learning**
机构: University of Waterloo, Carnegie Mellon University
这项研究为评估大型语言模型处理长上下文任务的能力提供了一个新的基准——LongICLBench,并显示了随着任务难度增加,LLMs的性能普遍下降,并且模型的长上下文学习能力受到提示中标签位置分布的影响。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.02060v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.0206.md) |
| 04-01 | **AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review**
机构: University of Lyon, INSA Lyon, Infologic
本文提供了 AIOps 领域中事件管理的全面文献回顾,旨在通过提供结构化的知识、确定知识空白和为该领域的未来发展奠定基础。论文建立了 AIOps 的统一术语和分类法,揭示了现有的挑战,并提供了公开数据集,为未来的研究提供了方向和基础。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.01363v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.01363.md) |
| 04-01 | **Mapping the Increasing Use of LLMs in Scientific Papers**
机构: Stanford University, UC Santa Barbara
本文是首次进行的,跨arXiv、bioRxiv和Nature组合上发表的文章的系统性、大规模分析,采用的统计估计方法可以在群体层面上测量LLM修改内容的普及程度,为理解LLM在科学写作中的应用提供了宝贵的洞察。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.01268v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.01268.md) |
| 04-01 | **Prompt-prompted Mixture of Experts for Efficient LLM Generation**
机构: CMU
GRIFFIN是一个不需要训练的MoE系统,利用LLMs前馈块内的flocking现象在不同的激活函数下提高模型效率,保持性能的同时减少了计算成本。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.01365v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.01365.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/hdong920/GRIFFIN)|
| 04-01 | **Efficiently Distilling LLMs for Edge Applications**
机构: IBM Research
本论文提供了一种新的针对边缘设备进行LLMs蒸馏的方法,允许LPFT同时显著减少模型尺寸和训练成本,尤其是优化了解码器模型的压缩抵抗和训练时长。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.01353v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.01353.md) |
| 04-01 | **LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation**
机构: Microsoft Research Asia
这篇论文提出了一个基于大型语言模型的放射学报告评价新框架——LLM-RadJudge,能够有效提高放射学报告评价的临床相关性和一致性。并通过知识蒸馏技术实现了小型模型的开发,既降低了评价成本也提高了可访问性,为放射学报告生成研究和实际应用提供了有力的支撑。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.00998v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-04/2404.00998.md) |---
### 03月
| Date | Paper | Links & Summary |
| --- | --- | --- |
| 03-28 | **Jamba: A Hybrid Transformer-Mamba Language Model**
机构: AI21 Labs
Jamba是基于混合Transformer-Mamba体系结构的新型大型语言模型,突破了处理长上下文的限制,并且通过应用专家混合(MoE)组件提高了模型吞吐量,同时保持了较小的内存足迹。此模型标志着在大型语言模型领域的一个新方向,并展示了高效训练与强大性能之间的可能平衡。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.19887v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.19887.md) |
| 03-28 | **sDPO: Don't Use Your Data All at Once**
此论文提出了一个新的步骤化DPO(sDPO)方法,通过分步骤利用偏好数据集,并使用先前步骤中的对齐模型作为当前步骤的参考模型,有效提高了最终模型的性能与对齐度。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.19270v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.1927.md) |
| 03-27 | **Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback**
这项工作通过提出RLKF框架并定义了新的模型可靠性评估指标,有效地解决了LLMs的幻觉问题,并提升了LLMs的诚实度和可靠性,显示出打造更值得信赖的AI系统的潜力。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.18349v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.18349.md) |
| 03-27 | **BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models**
机构: DCST Tsinghua University, Beijing Institute of Technology, Huawei Cloud BU
这项研究提出了一个新架构BLADE,可以通过小型领域特定模型增强黑盒大型语言模型,并解决了大型模型在特定领域应用中的知识不足问题。BLADE证明了其在性能和成本上都是一个有效的解决方案。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.18365v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.18365.md) |
| 03-26 | **COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning**
机构: Shenzhen Institute of Advanced Technology, CAS; M-A-P; Institute of Automation, CAS
本文提出了COIG-CQIA数据集,这是一个针对中文指令调优的高质量数据集,能够促进与人类交互的对齐。研究强调了高质量数据源在模型微调中的重要性,并通过实验展示了数据集创建策略和微调方法对模型性能的显著影响。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.18058v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.18058.md) |
| 03-26 | **LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning**
机构: The Hong Kong University of Science and Technology, University of Illinois Urbana-Champaign
这篇论文提出的LISA策略,通过分层权重重要性采样,实现了在保持类似于LoRA的内存效率的同时,提升了大型语言模型的微调效率和性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.17919v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.17919.md) |
| 03-26 | **The Unreasonable Ineffectiveness of the Deeper Layers**
机构: Meta FAIR, UMD
本论文针对流行的开权重预训练LLMs提出了一种简单的层剪枝策略,并展示了在删除大量层后LLMs对性能影响较小的实证研究。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.17887v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.17887.md) |
| 03-25 | **AIOS: LLM Agent Operating System**
机构: Rutgers University
AIOS作为一个LLM代理操作系统,通过设计特定的内核和模块,克服了之前资源调度和上下文管理等领域的挑战,为LLM代理的性能和效率提供了改进,为AIOS生态系统的未来发展和部署铺平了道路。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.16971v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.16971.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/agiresearch/AIOS)|
| 03-22 | **Can large language models explore in-context?**
机构: Microsoft Research, Carnegie Mellon University
这篇论文调查了当代大型语言模型(LLMs)能否在上下文中从事探索的问题,特别是在没有训练干预的情况下。经过一系列实验,作者发现只有在特定的配置下LLMs才能稳健地进行探索。研究表明,没有适当的提示设计,即使是最先进的LLMs也可能无法在更复杂的环境中进行探索,而在这些环境中外部总结历史可能是一个非平凡的算法设计问题。这项工作提示了LLMs可能需要有针对性的算法干预才能在复杂环境中有效地工作。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.15371v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.15371.md) |
| 03-20 | **Chain-of-Interaction: Enhancing Large Language Models for Psychiatric Behavior Understanding by Dyadic Contexts**
机构: University of Memphis, San Francisco Veterans Affairs Health Care System, University of California San Francisco
本文通过引入互动链提示方法,有效地提升了大型语言模型在理解精神病行为方面的能力,特别是在动机面谈语境下的应用。通过结构化的提示和评估方法,能够模拟专业心理治疗人员的思维过程,对模型进行了有效的域知识教育,相比传统方法取得了更好的性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.13786v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.13786.md) |
| 03-19 | **Towards Robots That Know When They Need Help: Affordance-Based Uncertainty for Large Language Model Planners**
机构: University of Maryland
论文提出了一种新方法LAP,通过结合大型语言模型(LLMs)和场景可以供性来减少规划任务中的幻觉并实现不确定性对齐。通过在模拟和现实世界机器人操作任务的实验中表明,LAP可以显著提高成功率并减少对人类帮助的依赖,从而推动智能机器人领域的进步。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.13198v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.13198.md) |
| 03-18 | **Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression**
机构: University of Texas at Austin, Drexel University, MIT
本文首次对经过压缩的LLMs在多个信任维度上进行了全面评估,并提供了压缩时同时考虑效率和信任度的实用建议。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.15447v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.15447.md) |
| 03-15 | **Uni-SMART: Universal Science Multimodal Analysis and Research Transformer**
机构: DP Technology, AI for Science Institute Beijing
Uni-SMART 是一款创新的模型,旨在深入理解多模态科学文献,它在多个领域相对于其他顶尖文本焦点的 LLMs 显示出了更优越的性能,并有潜力改变我们与科学文献的互动方式。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.10301v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.10301.md) |
| 03-15 | **VideoAgent: Long-form Video Understanding with Large Language Model as Agent**
机构: Stanford University
VideoAgent通过模仿人类的认知过程,在长视频理解方面迈出了重要的一步,强调了在长时间跨度内对视觉信息进行推理的重要性。此工作不仅为长视频理解设立了新的基准,也为未来该方向的研究提供了启示。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.10517v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.10517.md) |
| 03-15 | **RAFT: Adapting Language Model to Domain Specific RAG**
机构: UC Berkeley
本论文提出的RAFT方法针对训练大型语言模型在特定领域内以“开卷”模式回答问题进行了创新,强化了模型的推理能力和对干扰文档的抵抗力,同时通过链式推理方式改进了模型生成解答的准确性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.10131v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.10131.md) |
| 03-13 | **Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments**
机构: Nanjing University, Microsoft
Readi框架提出了一种高效并真实地在大规模结构化环境中进行推理的方法,它充分发挥了LLMs的规划能力,并通过动态反馈优化推理路径,实现了在多跳推理任务中的显著改进。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.08593v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.08593.md) |
| 03-13 | **Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework**
机构: ByteDance Research, University of Maryland College Park, Carnegie Mellon University
该论文成功提出了一个新的因果关系引导的去偏见框架,并通过实证研究验证了其有效性,既可以整合现有的基于提示的去偏见方法,也为诱导无偏见推理提出了新的途径。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.08743v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.08743.md) |
| 03-13 | **Scaling Instructable Agents Across Many Simulated Worlds**
此论文提出的SIMA项目旨在创建一个能够在各种模拟3D环境中根据任意语言指令进行操作的AI系统。该系统的设计致力于解决在感知和体化行动中具体化语言的挑战,以及在许多不同环境中实现通用性和可扩展性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2404.10179v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2404.10179.md) |
| 03-12 | **Chronos: Learning the Language of Time Series**
机构: Amazon Web Services, UC San Diego, University of Freiburg
Chronos作为一个预训练的时间序列预测模型框架,在零样本和标准预测任务中表现出色。它利用了数据增强策略和公共数据集的优势,证实了时间序列预测中语言模型架构通用性的潜力,为将来的时间序列模型提供了新的研究方向。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.07815v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.07815.md) |
| 03-11 | **RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback**
机构: Zhejiang University, Southeast University, Massachusetts Institute of Technology
RA-ISF是一个创新的检索增强框架,通过迭代问题分解和三个子模块的迭代处理来提高LLMs的问题解决能力,并有效降低不相关文本的干扰,显著提升知识检索的性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.06840v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.0684.md) |
| 03-11 | **ERA-CoT: Improving Chain-of-Thought through Entity Relationship Analysis**
机构: Zhejiang University, Southeast University
该论文通过提出一种新的框架ERA-CoT,有效强化了大型语言模型在复杂实体场景中的推理和问题回答能力。该方法通过增强对实体关系的理解,实现了显著提升模型推理准确度,特别是在CoT推理过程中。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.06932v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.06932.md) |
| 03-11 | **Stealing Part of a Production Language Model**
机构: Google DeepMind, ETH Zurich, University of Washington
本文提出了一项对生产语言模型进行模型窃取的新攻击方法,该方法能够有效地提取Transformer模型的最后一层,并能用于解密黑盒模型的细节信息、参数和尺寸。文章还讨论了可能的防御措施,并指出了修改API以防止未来此类攻击的必要性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.06634v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.06634.md) |
| 03-08 | **Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation**
这篇论文介绍了Adversarial Policy Optimization (AdvPO),它是解决基于人类反馈的强化学习过程中出现的奖励过优化问题的新方法,特别是在与人类偏好对齐的大型语言模型中。AdvPO有效地在没有带来高额计算成本的情况下缓解了奖励过优化。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.05171v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.05171.md) |
| 03-08 | **Harnessing Multi-Role Capabilities of Large Language Models for Open-Domain Question Answering**
机构: Gaoling School of Artificial Intelligence Renmin University of China, Nankai University, Beijing Academy of Artificial Intelligence
LLMQA是一个新的通用框架模型,通过结合检索和生成范式搜集更高质量的证据,并让LLMs在框架中发挥多重角色,提高了开放域问答系统的整体性能,实验结果也证明了其超越现有方法的有效性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.05217v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.05217.md) |
| 03-08 | **Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context**
机构: Google
Gemini 1.5 Pro在记忆与推理海量长上下文信息的能力上取得了显著突破,尤其是在超长文本、视频和音频处理方面。该模型不仅在效果上优于现有模型,也在计算效率上有显著提高。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.05530v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.0553.md) |
| 03-07 | **Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference**
机构: UC Berkeley, Stanford, UCSD
Chatbot Arena是一个基于用户偏好,用于评估大型语言模型的开放平台。它通过众包方式收集用户问题并进行匿名化的随机化对决,用于评估LLMs的表现,解决了现有静态数据集基准测试的局限性,并通过精心设计的统计方法确保了评估结果的可信度和效率。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.04132v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.04132.md) |
| 03-07 | **Yi: Open Foundation Models by 01.AI**
机构: 01.AI
该论文成功地提出了一个在性能和效率上都可与GPT-3.5相媲美的Yi-34B模型,并详细描述了在大型语言模型预训练及其指令微调方面的创新方法。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.04652v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.04652.md) |
| 03-05 | **Design2Code: How Far Are We From Automating Front-End Engineering?**
机构: Stanford University, Georgia Tech, Microsoft
本文通过对Design2Code任务的形式化和基准测试,评估了当前多模态LLMs在将视觉设计转换为代码的能力,并发现GPT-4V表现最佳,为自动化前端开发提供了一种新的范式。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.03163v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.03163.md) |
| 03-05 | **ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature Summary**
机构: Tsinghua University
ChatCite系统是为了克服LLM在生成文献回顾时的挑战而设计的,它通过特定的模块使LLM代理可以更有效地理解、汇总和对比不同的研究工作,进而生成有组织、有比较性分析的文献回顾。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.02574v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.02574.md) |
| 03-05 | **MathScale: Scaling Instruction Tuning for Mathematical Reasoning**
机构: The Chinese University of Hong Kong Shenzhen, China; Microsoft Research Asia, Beijing, China; Shenzhen Research Institute of Big Data, Shenzhen, China
MathScale提出了一个可扩展的方法来创建高质量的数学推理数据,通过构建新的评估基准MWPBENCH全面地评价LLMs在数学推理上的能力,显著提升了模型解决数学问题的性能。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.02884v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-03/2403.02884.md) |---
### 02月
| Date | Paper | Links & Summary |
| --- | --- | --- |
| 02-29 | **StarCoder 2 and The Stack v2: The Next Generation**
机构: ServiceNow, Hugging Face
本论文提出了The Stack v2和StarCoder2的发展过程,这是基于代码大规模预训练和指令微调的一项工作。研究人员通过整合多样化数据源和经过精心设计的训练过程,显著提高了代码LLMs的性能,特别是在处理低资源编程语言和需要代码推理的任务上。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.19173v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.19173.md) |
| 02-29 | **SEED: Customize Large Language Models with Sample-Efficient Adaptation for Code Generation**
机构: Peking University
本文提出了一个名为SEED的适应方法,它利用错误驱动学习来使LLMs更少样本地高效学习,针对代码生成任务实现了更佳的性能和泛化性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.00046v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2403.00046.md) |
| 02-29 | **Beyond Language Models: Byte Models are Digital World Simulators**
机构: Microsoft Research Asia
论文展现了bGPT在处理挑战性的字节级数据模拟任务中的潜力,特别强调了其在跨模态知识转移和数字世界模拟方面的能力。这揭示了字节模型在数字媒体数据处理和理解上的广泛适用性和灵活性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.19155v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.19155.md) |
| 02-29 | **Resonance RoPE: Improving Context Length Generalization of Large Language Models**
机构: 1DIRO Université de Montréal, Mila - Quebec AI Institute, Huawei Noah’s Ark Lab
本论文提出了 Resonance Rope,这是一个改进的模型,它基于对 RoPE 位置嵌入特征波长的分析来提升模型在处理长文本时的性能。它还引入了 POSGEN 基准测试,以帮助研究和评估位置嵌入在长文本任务中的表现。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2403.00071v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2403.00071.md) |
| 02-27 | **EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions**
机构: Alibaba Group
EMO 框架通过直接的音频到视频合成方法提高了生成视频的真实感和表现力,显著优于现有技术,为视频合成领域提供了一个重要的进步。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.17485v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.17485.md) |
| 02-27 | **The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits**
机构: Microsoft, University of Chinese Academy of Sciences
论文提出BitNet b1.58模型,这是一个1.58比特量化的大型语言模型,与传统的完整精度LLMs在性能上可比,而且更高效、更节省能源。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.17764v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.17764.md) |
| 02-27 | **When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method**
机构: Google DeepMind
该论文提供了大型语言模型微调阶段不同因素如数据大小、模型大小以及微调方法对模型性能影响的深入洞见,定义了一种新的评估框架。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.17193v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.17193.md) |
| 02-27 | **REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering**
机构: Gaoling School of Artificial Intelligence Renmin University of China, School of Information Renmin University of China
该论文提出了REAR框架,重点在于通过为LLMs加入文档相关性自我意识来增强其在QA任务中利用外部知识的能力,并证实该框架有效地超越了前述方法。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.17497v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.17497.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/RUCAIBox/REAR)|
| 02-27 | **Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization**
机构: Zhejiang University, Institute of Software Chinese Academy of Sciences, Nanjing University of Posts and Telecommunications
Agent-Pro是一个新型的基于LLM的智能代理,能够通过政策级反思和优化在交互环境中学习和发展策略,解决了现有工作无法通过交互学习和适应的问题。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.17574v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.17574.md) |
| 02-27 | **Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models**
机构: OpenAI
本文是一篇对Sora——一个大型视觉模型的综述。论文讨论了Sora的技术特征、创新点、以及当前应用领域的局限性和未来可能的发展机会。Sora的能力在多个维度上展现了大型视觉模型的进步,包括长视频生成和多样化视频格式的处理。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.17177v2)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.17177.md) |
| 02-26 | **Improving LLM-based Machine Translation with Systematic Self-Correction**
机构: Zhejiang University, Tencent, Angelalign Technology Inc.
论文成功提出了第一个基于LLMs的自我纠正翻译框架TER,并验证了其在多种语言对和不同模型间的翻译质量改进效果。它为机器翻译领域带来了新的视角,特别是在自我纠正在高资源、低资源语言和不同中心语言之间翻译的应用。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.16379v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.16379.md) |
| 02-26 | **Do Large Language Models Latently Perform Multi-Hop Reasoning?**
机构: Google DeepMind, UCL, Google Research
本文对LLMs是否能够进行潜在的多跳推理进行了研究,并通过实验提出了评估LLMs潜在多跳推理能力的新方法。研究提示LLMs对某些关系类型的提示有很强的多跳推理证据,但这种推理路径的运用在不同类型的提示中表现出高度的情境依赖性。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.16837v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.16837.md) |
| 02-26 | **LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments**
研究介绍了LLMARENA基准,用以评估LLMs智能体在复杂多代理环境中的能力,指出了存在的问题并促进了未来的研究方向,包括多模态动态环境中的能力及利用外部工具的潜力。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.16499v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.16499.md) |
| 02-25 | **ChatMusician: Understanding and Generating Music Intrinsically with LLM**
机构: Hong Kong University of Science and Technology
本文通过创造首个针对语言模型的音乐预训练数据集和评估基准,提升了LLMs在音乐理解和生成方面的表现,并在这一未被深入研究的领域取得了实质性进展。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.16153v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.16153.md) |
| 02-23 | **ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.15220v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.1522.md) |
| 02-23 | **Genie: Generative Interactive Environments**
机构: Google DeepMind, University of British Columbia
Genie是能够生成新视频并能通过用户输入控制视频内容的交互环境模型,弥补了传统视频生成技术与交互体验之间的差距。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.15391v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.15391.md) |
| 02-22 | **Automating psychological hypothesis generation with AI: when large language models meet causal graph**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.14424v3)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.14424.md) |
| 02-22 | **Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments**
通过为LLMs设计特定的工具和推理算法,研究开发了名为FUXI的新框架,有效提高了LLMs在复杂环境中的操作能|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.14672v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.14672.md) |
| 02-22 | **CriticBench: Benchmarking LLMs for Critique-Correct Reasoning**
机构: Tsinghua University, University of Hong Kong
该论文通过CRITICBENCH评估了LLMs的批判和纠正推理能力,并探究了影响这些能力的关键因子,旨在促进LLMs批判和自我改进能力的后续研究。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.14809v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.14809.md) |
| 02-22 | **OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement**
|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.14658v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.14658.md) |
| 02-21 | **User-LLM: Efficient LLM Contextualization with User Embeddings**
USER-LLM是一个通过用户嵌入来上下文化LLM的框架。它能有效地解决用户数据的复杂性和长序列处理的问题,提升了LLM在个性化应用上的效能,同时也保证了计算效率。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.13598v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.13598.md) |
| 02-21 | **AgentScope: A Flexible yet Robust Multi-Agent Platform**
机构: Alibaba Group
AgentScope是一个用于构建多代理应用的多功能平台,强调易用性与可定制性,特别适合不同技能水平的开发者使用。通过实现容错和支持多模态数据处理,以及优化分布式操作,AgentScope显著降低了多代理系统开发与部署的难度,鼓励更广泛的参与和创新。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.14034v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.14034.md)[![GitHub](https://img.shields.io/badge/GitHub-View-brightgreen?logo=github)](https://github.com/modelscope/agentscope)|
| 02-20 | **TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization**
机构: AWS AI Labs, The University of Texas at Austin, KAIST
本论文提出了一个名为TOFUEVAL的新型评估基准,针对LLM在生成话题焦点对话摘要时的事实一致性进行了评估。研究发现,不同大小的LLM在对话领域生成的摘要中存在大量事实错误。|[![arXiv](https://img.shields.io/badge/arXiv-Paper-%23D2691E?logo=arxiv)](http://arxiv.org/pdf/2402.13249v1)[![Summary](https://img.shields.io/badge/Sum.-Read-blue?logo=dependabot)](summary/2024-02/2402.13249.md) |
| 02-20 | **Instruction-tuned Language Models are Better Knowledge Learners**
机构: FAIR at Meta, Carnegie Mellon University, University of Washington