{"id":47567608,"url":"https://github.com/VoltAgent/awesome-ai-agent-papers","last_synced_at":"2026-04-01T06:00:31.797Z","repository":{"id":337967109,"uuid":"1154407752","full_name":"VoltAgent/awesome-ai-agent-papers","owner":"VoltAgent","description":"A curated collection of AI agent research papers released in 2026, covering agent engineering, memory, evaluation, workflows, and autonomous systems.","archived":false,"fork":false,"pushed_at":"2026-02-12T14:26:20.000Z","size":177,"stargazers_count":37,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-02-12T16:09:26.376Z","etag":null,"topics":["ai-agents","awesome","awesome-list","llm","llm-agents","memory","rag","research-paper"],"latest_commit_sha":null,"homepage":"https://github.com/VoltAgent/voltagent","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VoltAgent.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-10T10:58:31.000Z","updated_at":"2026-02-12T16:03:43.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/VoltAgent/awesome-ai-agent-papers","commit_stats":null,"previous_names":["voltagent/awesome-ai-agent-papers"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/VoltAgent/awesome-ai-agent-papers","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VoltAgent%2Fawesome-ai-agent-papers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VoltAgent%2Fawesome-ai-agent-papers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VoltAgent%2Fawesome-ai-agent-papers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VoltAgent%2Fawesome-ai-agent-papers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VoltAgent","download_url":"https://codeload.github.com/VoltAgent/awesome-ai-agent-papers/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VoltAgent%2Fawesome-ai-agent-papers/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31268586,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T05:46:55.838Z","status":"ssl_error","status_checked_at":"2026-04-01T05:46:47.827Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","awesome","awesome-list","llm","llm-agents","memory","rag","research-paper"],"created_at":"2026-03-30T11:00:20.657Z","updated_at":"2026-04-01T06:00:31.773Z","avatar_url":"https://github.com/VoltAgent.png","language":null,"funding_links":[],"categories":["📚 学習リソース","🔬 Autonomous Research \u0026 Self-Improving Agents","Reference Implementations","Tools \u0026 Libraries","Community","Related Lists","Machine Learning \u0026 AI","Research Papers","Foundation Models, LLMs, and Agents","📚 Related resources"],"sub_categories":["キュレートされたリスト","Related Resources","Adjacent Collections","Voice \u0026 Realtime Agents","Research \u0026 Humanitarian","Benchmark Reality Check (real-world tool use)","AI Agents"],"readme":"\n\n\u003cdiv align=\"center\"\u003e\n\n\u003ca href=\"https://github.com/VoltAgent/voltagent\"\u003e\n\u003cimg width=\"1500\" height=\"500\" alt=\"cover-image\" src=\"https://github.com/user-attachments/assets/23718e60-9ad3-4105-999c-8372713c3fbb\" /\u003e\n\u003c/a\u003e\n\n\u003cbr/\u003e\n\u003cbr/\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003cstrong\u003eHand-picked research papers on the AI agent ecosystem, published in 2026.\n    \u003c/strong\u003e\n    \u003cbr /\u003e\n    \u003cbr /\u003e\n\n\u003c/div\u003e\n\n[![Awesome](https://awesome.re/badge.svg)](https://awesome.re)\n![Papers Count](https://img.shields.io/badge/Research%20Papers-363+-b31b1b)\n![Last Update](https://img.shields.io/github/last-commit/VoltAgent/awesome-ai-agent-papers?label=Last%20update)\n\u003ca href=\"https://github.com/VoltAgent/voltagent\"\u003e\n  \u003cimg alt=\"VoltAgent\" src=\"https://cdn.voltagent.dev/website/logo/logo-2-svg.svg\" height=\"20\" /\u003e\n\u003c/a\u003e\n[![Discord](https://img.shields.io/discord/1361559153780195478.svg?label=\u0026logo=discord\u0026logoColor=ffffff\u0026color=7389D8\u0026labelColor=6A7EC2)](https://s.voltagent.dev/discord)\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003cstrong\u003eMore awesome collections for developers\u003c/strong\u003e\n    \u003cbr /\u003e\n    \u003cbr /\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n[![Agent Skills](https://img.shields.io/static/v1?label=%E2%9A%A1%20Agent\u0026message=Skills%2012k\u0026color=black\u0026style=classic)](https://github.com/VoltAgent/awesome-agent-skills)\n[![Claude Code Subagents](https://img.shields.io/github/stars/VoltAgent/awesome-claude-code-subagents?style=classic\u0026label=Claude%20Code%20Subagents\u0026color=D97757\u0026logo=claude\u0026logoColor=D97757)](https://github.com/VoltAgent/awesome-claude-code-subagents)\n[![Codex Subagents][codex-badge]][codex-link]\n[![OpenClaw Skills](https://img.shields.io/github/stars/VoltAgent/awesome-openclaw-skills?style=classic\u0026label=%F0%9F%A6%9E%20OpenClaw%20Skills\u0026color=f53e36)](https://github.com/VoltAgent/awesome-openclaw-skills)\n\n\n\u003c/div\u003e\n\n# Awesome AI Agent Papers\n\nA curated collection of research papers **published in 2026** and sourced from arXiv, covering core topics from the AI agent ecosystem like **multi-agent coordination**, **memory \u0026 RAG**, **tooling**, **evaluation \u0026 observability**, and **security**.\n\nWhether you're an AI engineer building agent systems, a researcher exploring new architectures, or a developer integrating LLM agents into products, these papers help you stay on top of what's actually working, what's breaking, and where the field is heading. Updated weekly from arXiv.\n\n### Why this list exists\n\nHundreds of papers are published on arXiv every week, and a growing number of them touch on AI agents. We go through them all, filter the ones that are directly relevant to the AI agent ecosystem, and categorize them so you don't have to. This list only includes papers published from January 2026 onward.\n\n### Table of Contents\n\n- [Multi-Agent](#multi-agent) (51)\n- [Memory \u0026 RAG](#memory--rag) (56)\n- [Eval \u0026 Observability](#eval--observability) (79)\n- [Agent Tooling](#agent-tooling) (95)\n- [AI Agent Security](#ai-agent-security) (82)\n\n\u003cbr\u003e\n\n\u003cdetails open id=\"multi-agent\"\u003e\n\u003csummary\u003e\u003ch3 style=\"display:inline\"\u003eMulti-Agent (51)\u003c/h3\u003e\u003c/summary\u003e\n\n\u003cbr\u003e\n\n| Paper | arXiv ID |\n|---|:---:|\n| **[DyTopo: Dynamic Topology Routing for Multi-Agent Reasoning via Semantic Matching](https://arxiv.org/pdf/2602.06039v1)** - Investigates dynamically rewiring agent-to-agent connections at each reasoning round via semantic matching instead of fixed communication topologies. | \u003ca href=\"https://arxiv.org/abs/2602.06039v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06039-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[RuleSmith: Multi-Agent LLMs for Automated Game Balancing](https://arxiv.org/pdf/2602.06232v1)** - Explores automated game balancing by combining multi-agent LLM self-play with Bayesian optimization on a civ-style game. | \u003ca href=\"https://arxiv.org/abs/2602.06232v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06232-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction](https://arxiv.org/pdf/2602.06038v1)** - Examines how conformal prediction can filter noisy inter-agent messages to improve multi-robot coordination. | \u003ca href=\"https://arxiv.org/abs/2602.06038v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06038-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions](https://arxiv.org/pdf/2602.06008v1)** - Introduces a 110+ task benchmark to evaluate how well multi-agent LLM systems handle buyer-seller negotiation through natural language. | \u003ca href=\"https://arxiv.org/abs/2602.06008v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06008-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Gender Dynamics and Homophily in a Social Network of LLM Agents](https://arxiv.org/pdf/2602.02606v1)** - Analyzes social network formation among 70K+ autonomous LLM agents on Chirper.ai to study emergent group behavior and bias. | \u003ca href=\"https://arxiv.org/abs/2602.02606v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.02606-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ROMA: Recursive Open Meta-Agent Framework for Long-Horizon Multi-Agent Systems](https://arxiv.org/pdf/2602.01848v1)** - Proposes breaking large tasks into subtask trees that run in parallel across multiple agents to handle long-horizon workflows without exceeding context windows. | \u003ca href=\"https://arxiv.org/abs/2602.01848v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.01848-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ORCH: many analyses, one merge — a deterministic multi-agent orchestrator](https://arxiv.org/pdf/2602.01797v1)** - Proposes a deterministic multi-agent orchestrator where multiple LLMs analyze a problem independently and a merge agent selects the best answer without any training. | \u003ca href=\"https://arxiv.org/abs/2602.01797v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.01797-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows](https://arxiv.org/pdf/2602.05407v1)** - Simulates end-to-end hospital administrative workflows with multi-agent LLMs and FHIR integration to test LLM-driven automation in healthcare settings. | \u003ca href=\"https://arxiv.org/abs/2602.05407v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.05407-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering](https://arxiv.org/pdf/2602.01465v2)** - Proposes a multi-agent system for autonomous software engineering that assigns specialized agents to roles like coordination, research, implementation, and review. | \u003ca href=\"http://arxiv.org/abs/2602.01465v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.01465-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Multi-Agent Teams Hold Experts Back](https://arxiv.org/pdf/2602.01011v3)** - Examines whether self-organizing LLM agent teams can match or beat their best member's performance across collaborative benchmarks. | \u003ca href=\"http://arxiv.org/abs/2602.01011v3\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.01011-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Evolving Interpretable Constitutions for Multi-Agent Coordination](https://arxiv.org/pdf/2602.00755v1)** - Explores using LLM-driven genetic programming to automatically discover behavioral norms for multi-agent coordination in a survival-pressure grid-world simulation. | \u003ca href=\"http://arxiv.org/abs/2602.00755v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.00755-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Scaling Multiagent Systems with Process Rewards](https://arxiv.org/pdf/2601.23228v2)** - Proposes per-action process rewards from AI feedback to improve credit assignment and sample efficiency when finetuning multi-agent LLM systems. | \u003ca href=\"http://arxiv.org/abs/2601.23228v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.23228-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[MonoScale: Scaling Multi-Agent System with Monotonic Improvement](https://arxiv.org/pdf/2601.23219v1)** - Proposes a framework for safely growing multi-agent pools by generating familiarization tasks and building routing memory, with a guaranteed non-decreasing performance across onboarding rounds. | \u003ca href=\"http://arxiv.org/abs/2601.23219v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.23219-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Task-Aware LLM Council with Adaptive Decision Pathways for Decision Support](https://arxiv.org/pdf/2601.22662v1)** - Proposes a task-adaptive multi-agent framework that routes control to the most suitable LLM at each decision step using semantic matching against each model's success history. | \u003ca href=\"http://arxiv.org/abs/2601.22662v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22662-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous Language Model Assembly](https://arxiv.org/pdf/2601.22623v1)** - Explores using a pool of different LLM agents within MCTS planning to increase rollout diversity and improve multi-step reasoning. | \u003ca href=\"http://arxiv.org/abs/2601.22623v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22623-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Learning to Recommend Multi-Agent Subgraphs from Calling Trees](https://arxiv.org/pdf/2601.22209v1)** - Proposes a recommendation framework that uses historical calling trees to select the best agents or agent teams for each subtask in multi-agent orchestration. | \u003ca href=\"http://arxiv.org/abs/2601.22209v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22209-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic](https://arxiv.org/pdf/2601.21972v2)** - Investigates actor-critic reinforcement learning methods for training decentralized LLM agent collaboration across writing, coding, and game-playing tasks. | \u003ca href=\"http://arxiv.org/abs/2601.21972v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21972-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[AgenticSimLaw: A Juvenile Courtroom Multi-Agent Debate Simulation for Explainable High-Stakes Tabular Decision Making](https://arxiv.org/pdf/2601.21936v1)** - Proposes a role-structured multi-agent courtroom debate framework with defined agent roles, interaction protocols, and private reasoning strategies for auditable high-stakes decision-making. | \u003ca href=\"http://arxiv.org/abs/2601.21936v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21936-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems](https://arxiv.org/pdf/2601.21742v1)** - Introduces a reasoning framework that builds peer reliability profiles from interaction history so agents in multi-agent systems learn which peers to trust when uncertain. | \u003ca href=\"http://arxiv.org/abs/2601.21742v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21742-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Adaptive Confidence Gating in Multi-Agent Collaboration for Efficient and Optimized Code Generation](https://arxiv.org/pdf/2601.21469v1)** - Explores structured multi-agent debate with three role-based agents and adaptive confidence gating to improve small language model code generation. | \u003ca href=\"http://arxiv.org/abs/2601.21469v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21469-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[CASTER: Context-Aware Strategy for Task Efficient Routing in Multi-Agent Systems](https://arxiv.org/pdf/2601.19793v1)** - Proposes a lightweight router for dynamic model selection in graph-based multi-agent systems that combines semantic embeddings with structural meta-features and self-optimizes through on-policy negative feedback. | \u003ca href=\"http://arxiv.org/abs/2601.19793v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.19793-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Phase Transition for Budgeted Multi-Agent Synergy](https://arxiv.org/pdf/2601.17311v1)** - Develops a theory for predicting when budgeted multi-agent LLM systems improve, saturate, or collapse based on context windows, communication fidelity, and shared-error correlation. | \u003ca href=\"http://arxiv.org/abs/2601.17311v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.17311-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Dynamic Role Assignment for Multi-Agent Debate](https://arxiv.org/pdf/2601.17152v1)** - Proposes a meta-debate framework that dynamically assigns roles in multi-agent systems by matching model capabilities to positions through proposal and peer review stages. | \u003ca href=\"http://arxiv.org/abs/2601.17152v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.17152-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Learning to Collaborate: An Orchestrated-Decentralized Framework for Peer-to-Peer LLM Federation](https://arxiv.org/pdf/2601.17133v1)** - Introduces orchestrated decentralized peer-to-peer LLM collaboration that uses contextual bandits to learn optimal matchmaking between heterogeneous agents via secure distillation. | \u003ca href=\"http://arxiv.org/abs/2601.17133v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.17133-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Mixture-of-Models: Unifying Heterogeneous Agents via N-Way Self-Evaluating Deliberation](https://arxiv.org/pdf/2601.16863v1)** - Explores a runtime Mixture-of-Models architecture with a dynamic expertise broker and quadratic voting consensus that enables small model ensembles to match frontier performance. | \u003ca href=\"http://arxiv.org/abs/2601.16863v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.16863-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Multi-Agent Constraint Factorization Reveals Latent Invariant Solution Structure](https://arxiv.org/pdf/2601.15077v1)** - Formalizes through operator theory why multi-agent LLM systems access invariant solutions that a single agent applying all constraints simultaneously cannot reach. | \u003ca href=\"http://arxiv.org/abs/2601.15077v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.15077-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks](https://arxiv.org/pdf/2601.14652v2)** - Proposes a training-time framework that formulates multi-agent orchestration as function-calling reinforcement learning with holistic system-level reasoning and introduces MASBENCH for controlled evaluation. | \u003ca href=\"http://arxiv.org/abs/2601.14652v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.14652-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems](https://arxiv.org/pdf/2601.14230v1)** - Proposes a bi-level optimization framework for multi-agent companions that aligns individual personas via RLAIF and optimizes collaborative dialogue through group-level meta-policy rewards. | \u003ca href=\"http://arxiv.org/abs/2601.14230v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.14230-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[If You Want Coherence, Orchestrate a Team of Rivals: Multi-Agent Models of Organizational Intelligence](https://arxiv.org/pdf/2601.14351v1)** - Explores a team-of-rivals multi-agent architecture with specialized roles and a remote code executor that separates reasoning from data execution to maintain clean context windows. | \u003ca href=\"http://arxiv.org/abs/2601.14351v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.14351-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[The Orchestration of Multi-Agent Systems: Architectures, Protocols, and Enterprise Adoption](https://arxiv.org/pdf/2601.13671v1)** - Formalizes a unified architectural framework for orchestrated multi-agent systems integrating MCP for tool access and Agent2Agent protocol for peer coordination, delegation, and policy enforcement. | \u003ca href=\"http://arxiv.org/abs/2601.13671v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.13671-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[MARO: Learning Stronger Reasoning from Social Interaction](https://arxiv.org/pdf/2601.12323v2)** - Proposes Multi-Agent Reward Optimization, a method that decomposes multi-agent social interaction outcomes into per-behavior learning signals to improve LLM reasoning through simulated social environments. | \u003ca href=\"http://arxiv.org/abs/2601.12323v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.12323-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding](https://arxiv.org/pdf/2601.11913v1)** - Introduces an LSTM-inspired multi-agent architecture with worker, filter, judge, and manager agents that emulate gated memory mechanisms to control information flow for long-context understanding. | \u003ca href=\"http://arxiv.org/abs/2601.11913v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.11913-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Do We Always Need Query-Level Workflows? Rethinking Agentic Workflow Generation for Multi-Agent Systems](https://arxiv.org/pdf/2601.11147v1)** - Examines whether query-level workflow generation is always necessary in multi-agent systems and proposes a low-cost task-level framework that uses self-prediction with few-shot calibration instead of full execution. | \u003ca href=\"http://arxiv.org/abs/2601.11147v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.11147-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems](https://arxiv.org/pdf/2601.10560v1)** - Proposes a latency-aware multi-agent orchestration framework that explicitly optimizes the critical execution path under parallel execution to reduce end-to-end latency while maintaining task performance. | \u003ca href=\"http://arxiv.org/abs/2601.10560v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.10560-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[TopoDIM: One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems](https://arxiv.org/pdf/2601.10120v1)** - Proposes a one-shot topology generation framework with diverse interaction modes that enables decentralized agents to autonomously construct heterogeneous communication topologies without iterative coordination. | \u003ca href=\"http://arxiv.org/abs/2601.10120v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.10120-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Beyond Rule-Based Workflows: An Information-Flow-Orchestrated Multi-Agents Paradigm via A2A Communication from CORAL](https://arxiv.org/pdf/2601.09883v1)** - Replaces predefined multi-agent workflows with a dynamic information-flow orchestrator that coordinates agents through natural-language A2A communication. | \u003ca href=\"https://arxiv.org/abs/2601.09883v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.09883-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[LLM-Based Agentic Systems for Software Engineering: Challenges and Opportunities](https://arxiv.org/pdf/2601.09822v2)** - Reviews LLM-based multi-agent systems across the software development lifecycle, covering frameworks, communication protocols, and orchestration challenges from requirements to debugging. | \u003ca href=\"https://arxiv.org/abs/2601.09822v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.09822-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning](https://arxiv.org/pdf/2601.09667v2)** - Explores injecting structured textual experience into multi-agent deliberation at test time to improve reasoning accuracy without any model tuning. | \u003ca href=\"https://arxiv.org/abs/2601.09667v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.09667-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[The End of Reward Engineering: How LLMs Are Redefining Multi-Agent Coordination](https://arxiv.org/pdf/2601.08237v1)** - Argues that LLMs can replace hand-crafted numerical reward functions with language-based objective specifications for multi-agent coordination, drawing on EUREKA and RLVR as evidence. | \u003ca href=\"https://arxiv.org/abs/2601.08237v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.08237-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[A Large-Scale Study on the Development and Issues of Multi-Agent AI Systems](https://arxiv.org/pdf/2601.07136v1)** - Analyzes over 42K commits and 4.7K resolved issues across eight leading multi-agent AI systems (LangChain, CrewAI, AutoGen, etc.) to study development patterns, maintenance practices, and ecosystem maturity. | \u003ca href=\"https://arxiv.org/abs/2601.07136v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.07136-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[StackPlanner: A Centralized Hierarchical Multi-Agent System with Task-Experience Memory Management](https://arxiv.org/pdf/2601.05890v1)** - Proposes a hierarchical multi-agent framework that decouples high-level coordination from subtask execution with active task-level memory control and reinforcement-learning-driven experience reuse. | \u003ca href=\"https://arxiv.org/abs/2601.05890v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.05890-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[CTHA: Constrained Temporal Hierarchical Architecture for Stable Multi-Agent LLM Systems](https://arxiv.org/pdf/2601.10738v1)** - Proposes a constrained temporal hierarchical architecture for multi-agent LLM systems that projects inter-layer communication onto structured manifolds with typed message contracts and authority bounds. | \u003ca href=\"https://arxiv.org/abs/2601.10738v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.10738-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[DynaDebate: Breaking Homogeneity in Multi-Agent Debate with Dynamic Path Generation](https://arxiv.org/pdf/2601.05746v1)** - Introduces dynamic path generation for multi-agent debate that allocates diverse solution paths to agents, shifts focus to step-by-step logic critique, and uses a trigger-based verification agent to resolve deadlocks. | \u003ca href=\"https://arxiv.org/abs/2601.05746v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.05746-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Demystifying Multi-Agent Debate: The Role of Confidence and Diversity](https://arxiv.org/pdf/2601.19921v1)** - Investigates how diversity-aware initialization and confidence-modulated updates improve multi-agent debate, connecting findings from human deliberation research to LLM-based debate protocols. | \u003ca href=\"https://arxiv.org/abs/2601.19921v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.19921-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Orchestrating Intelligence: Confidence-Aware Routing for Multi-Agent Collaboration](https://arxiv.org/pdf/2601.04861v2)** - Proposes a multi-agent framework with confidence-aware routing that dynamically selects agent roles and model scales across heterogeneous LLMs based on task complexity. | \u003ca href=\"https://arxiv.org/abs/2601.04861v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.04861-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework](https://arxiv.org/pdf/2601.04790v1)** - Analyzes role-based authority bias in multi-agent evaluation frameworks using French and Raven's power-based theory across legitimate, referent, and expert power types. | \u003ca href=\"https://arxiv.org/abs/2601.04790v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.04790-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail](https://arxiv.org/pdf/2601.04748v2)** - Investigates when a single agent with a skill library can replace multi-agent systems, studying scaling limits and phase transitions in skill selection as libraries grow. | \u003ca href=\"https://arxiv.org/abs/2601.04748v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.04748-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ResMAS: Resilience Optimization in LLM-based Multi-Agent Systems](https://arxiv.org/pdf/2601.04694v1)** - Proposes a two-stage framework for enhancing multi-agent system resilience through RL-based topology generation and topology-aware prompt optimization under perturbations. | \u003ca href=\"https://arxiv.org/abs/2601.04694v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.04694-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[TCAndon-Router: Adaptive Reasoning Router for Multi-Agent Collaboration](https://arxiv.org/pdf/2601.04544v1)** - Proposes an adaptive reasoning router for multi-agent systems that generates natural-language reasoning chains before predicting candidate agents, with a collaborative execution pipeline. | \u003ca href=\"https://arxiv.org/abs/2601.04544v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.04544-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[When Numbers Start Talking: Implicit Numerical Coordination Among LLM-Based Agents](https://arxiv.org/pdf/2601.03846v1)** - Investigates covert communication in LLM multi-agent systems through game-theoretic analysis of implicit coordination signals across different communication regimes. | \u003ca href=\"https://arxiv.org/abs/2601.03846v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.03846-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making](https://arxiv.org/pdf/2601.01522v1)** - Proposes a Bayesian, cost-aware multi-LLM orchestration framework that treats LLMs as approximate likelihood models and aggregates across diverse models for sequential decision-making. | \u003ca href=\"https://arxiv.org/abs/2601.01522v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.01522-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n\n\u003c/details\u003e\n\n\u003cbr\u003e\n\n\u003cdetails open id=\"memory--rag\"\u003e\n\u003csummary\u003e\u003ch3 style=\"display:inline\"\u003eMemory \u0026 RAG (56)\u003c/h3\u003e\u003c/summary\u003e\n\n\u003cbr\u003e\n\n| Paper | arXiv ID |\n|---|:---:|\n| **[BudgetMem: Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory](https://arxiv.org/pdf/2602.06025v1)** - Investigates routing agent memory queries to different processing tiers based on query difficulty to control the cost-accuracy trade-off at runtime. | \u003ca href=\"https://arxiv.org/abs/2602.06025v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06025-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Learning to Share: Selective Memory for Efficient Parallel Agentic Systems](https://arxiv.org/pdf/2602.05965v1)** - Proposes a shared memory bank with a learned controller that decides what information is worth passing between parallel agent teams to reduce redundant work. | \u003ca href=\"https://arxiv.org/abs/2602.05965v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.05965-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering](https://arxiv.org/pdf/2602.05728v1)** - Explores converting a corpus into atomic QA pairs offline to resolve multi-hop questions with just two LLM calls regardless of hop count. | \u003ca href=\"https://arxiv.org/abs/2602.05728v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.05728-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Mitigating Hallucination in Financial Retrieval-Augmented Generation via Fine-Grained Knowledge Verification](https://arxiv.org/pdf/2602.05723v1)** - Examines breaking financial RAG answers into atomic facts and verifying each against retrieved documents using reinforcement learning rewards. | \u003ca href=\"https://arxiv.org/abs/2602.05723v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.05723-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Graph-based Agent Memory: Taxonomy, Techniques, and Applications](https://arxiv.org/pdf/2602.05665v1)** - Surveys graph-based memory architectures for agents, covering extraction, storage, retrieval, and how memory evolves over time. | \u003ca href=\"https://arxiv.org/abs/2602.05665v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.05665-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[AI Agent Systems for Supply Chains: Structured Decision Prompts and Memory Retrieval](https://arxiv.org/pdf/2602.05524v1)** - Proposes a multi-agent system for inventory management that retrieves similar past decisions to adapt ordering across various supply chain scenarios. | \u003ca href=\"https://arxiv.org/abs/2602.05524v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.05524-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[SOPRAG: Multi-view Graph Experts Retrieval for Industrial Standard Operating Procedures](https://arxiv.org/pdf/2602.01858v1)** - Explores replacing flat chunk-based RAG with graph experts that understand entity relationships, causality, and process flows for structured documents like SOPs. | \u003ca href=\"https://arxiv.org/abs/2602.01858v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.01858-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ProcMEM: Learning Reusable Procedural Memory from Experience via Non-Parametric PPO for LLM Agents](https://arxiv.org/pdf/2602.01869v1)** - Investigates letting agents save step-by-step procedural skills from past runs and reuse them later without retraining to reduce repeated computation. | \u003ca href=\"https://arxiv.org/abs/2602.01869v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.01869-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Aggregation Queries over Unstructured Text: Benchmark and Agentic Method](https://arxiv.org/pdf/2602.01355v2)** - Proposes an agentic method for aggregation queries over unstructured text that tries to find all matching evidence, breaking the task into disambiguation, filtering, and aggregation stages. | \u003ca href=\"http://arxiv.org/abs/2602.01355v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.01355-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking](https://arxiv.org/pdf/2602.00238v1)** - Proposes an agentic RAG framework that uses reflection and memory-based refinement to generate diverse answers for open-ended questions. | \u003ca href=\"http://arxiv.org/abs/2602.00238v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.00238-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[JADE: Bridging the Strategic-Operational Gap in Dynamic Agentic RAG](https://arxiv.org/pdf/2601.21916v1)** - Proposes joint optimization of planning and execution in agentic RAG by modeling the system as a cooperative multi-agent team with shared backbone and outcome-based rewards. | \u003ca href=\"http://arxiv.org/abs/2601.21916v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21916-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation](https://arxiv.org/pdf/2601.21912v1)** - Proposes process-supervised reinforcement learning for RAG that uses MCTS-based step-level rewards to identify and fix flawed reasoning steps in multi-hop retrieval. | \u003ca href=\"http://arxiv.org/abs/2601.21912v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21912-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory](https://arxiv.org/pdf/2601.21714v1)** - Introduces an episodic memory framework where assistant agents maintain uncompressed memory contexts while a master agent orchestrates global planning, replacing destructive memory compression with context reconstruction. | \u003ca href=\"http://arxiv.org/abs/2601.21714v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21714-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ShardMemo: Masked MoE Routing for Sharded Agentic LLM Memory](https://arxiv.org/pdf/2601.21545v1)** - Proposes a tiered memory service for agentic LLM systems that uses masked mixture-of-experts routing to probe only eligible memory shards under a fixed budget. | \u003ca href=\"http://arxiv.org/abs/2601.21545v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21545-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[When should I search more: Adaptive Complex Query Optimization with Reinforcement Learning](https://arxiv.org/pdf/2601.21208v1)** - Explores adaptive query optimization in RAG using reinforcement learning to dynamically decide when to split complex queries into sub-queries and fuse the retrieved results. | \u003ca href=\"http://arxiv.org/abs/2601.21208v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21208-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[A2RAG: Adaptive Agentic Graph Retrieval for Cost-Aware and Reliable Reasoning](https://arxiv.org/pdf/2601.21162v1)** - Introduces an adaptive agentic Graph-RAG framework that verifies evidence sufficiency and progressively escalates retrieval effort, mapping graph signals back to source text to handle extraction loss. | \u003ca href=\"http://arxiv.org/abs/2601.21162v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21162-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[MemCtrl: Using MLLMs as Active Memory Controllers on Embodied Agents](https://arxiv.org/pdf/2601.20831v1)** - Investigates augmenting multimodal LLMs with a trainable memory gate that decides which observations to retain, update, or discard during online embodied agent exploration. | \u003ca href=\"http://arxiv.org/abs/2601.20831v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.20831-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[AMA: Adaptive Memory via Multi-Agent Collaboration](https://arxiv.org/pdf/2601.20352v2)** - Proposes a multi-agent memory framework with hierarchical granularity, adaptive query routing, consistency verification, and targeted memory refresh for long-term agent interaction. | \u003ca href=\"http://arxiv.org/abs/2601.20352v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.20352-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering](https://arxiv.org/pdf/2601.19827v2)** - Examines when iterative retrieval-reasoning loops outperform static gold-context RAG in scientific multi-hop QA, diagnosing failure modes across retrieval coverage, hypothesis drift, and stopping calibration. | \u003ca href=\"http://arxiv.org/abs/2601.19827v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.19827-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Dep-Search: Learning Dependency-Aware Reasoning Traces with Persistent Memory](https://arxiv.org/pdf/2601.18771v1)** - Introduces a dependency-aware search framework that uses GRPO reinforcement learning to teach LLMs to decompose questions with dependency relationships and store intermediate results in persistent memory. | \u003ca href=\"http://arxiv.org/abs/2601.18771v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.18771-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[FadeMem: Biologically-Inspired Forgetting for Efficient Agent Memory](https://arxiv.org/pdf/2601.18642v2)** - Proposes a biologically-inspired agent memory architecture with adaptive exponential decay, LLM-guided conflict resolution, and intelligent memory fusion across a dual-layer hierarchy. | \u003ca href=\"http://arxiv.org/abs/2601.18642v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.18642-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[FastInsight: Fast and Insightful Retrieval via Fusion Operators for Graph RAG](https://arxiv.org/pdf/2601.18579v1)** - Explores two fusion operators for Graph RAG that combine graph-aware reranking with semantic-topological expansion to improve retrieval accuracy and generation quality. | \u003ca href=\"http://arxiv.org/abs/2601.18579v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.18579-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Less is More for RAG: Information Gain Pruning for Generator-Aligned Reranking and Evidence Selection](https://arxiv.org/pdf/2601.17532v1)** - Proposes a generator-aligned reranking and pruning module for RAG that selects evidence using utility signals and filters weak or harmful passages before context truncation. | \u003ca href=\"http://arxiv.org/abs/2601.17532v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.17532-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering](https://arxiv.org/pdf/2601.16478v1)** - Introduces a step-by-step reasoning reranking agent for RAG that distinguishes semantically similar but logically irrelevant passages in retrieval-augmented question answering. | \u003ca href=\"http://arxiv.org/abs/2601.16478v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.16478-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[SPARC-RAG: Adaptive Sequential-Parallel Scaling with Context Management for Retrieval-Augmented Generation](https://arxiv.org/pdf/2602.00083v1)** - Introduces a multi-agent RAG framework that coordinates sequential and parallel inference-time scaling under unified context management to prevent contamination and improve multi-hop reasoning. | \u003ca href=\"http://arxiv.org/abs/2602.00083v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.00083-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Incorporating Q\u0026A Nuggets into Retrieval-Augmented Generation](https://arxiv.org/pdf/2601.13222v1)** - Proposes a nugget-augmented generation system that constructs a bank of Q\u0026A nuggets from retrieved documents to guide extraction, selection, and report generation with citation provenance. | \u003ca href=\"http://arxiv.org/abs/2601.13222v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.13222-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Augmenting Question Answering with A Hybrid RAG Approach](https://arxiv.org/pdf/2601.12658v2)** - Introduces a hybrid RAG architecture combining query augmentation, agentic routing, and structured retrieval that merges vector and graph-based techniques with context unification for question answering. | \u003ca href=\"http://arxiv.org/abs/2601.12658v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.12658-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Utilizing Metadata for Better Retrieval-Augmented Generation](https://arxiv.org/pdf/2601.11863v1)** - Presents a systematic study of metadata-aware retrieval strategies for RAG, comparing prefix, suffix, unified embedding, and late-fusion approaches with field-level ablations on embedding space structure. | \u003ca href=\"http://arxiv.org/abs/2601.11863v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.11863-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Deep GraphRAG: A Balanced Approach to Hierarchical Retrieval and Adaptive Integration](https://arxiv.org/pdf/2601.11144v3)** - Proposes a hierarchical global-to-local retrieval strategy for GraphRAG with beam search-optimized re-ranking and a compact LLM integration module trained via dynamic-weighting reinforcement learning. | \u003ca href=\"http://arxiv.org/abs/2601.11144v3\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.11144-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Grounding Agent Memory in Contextual Intent](https://arxiv.org/pdf/2601.10702v1)** - Introduces an agentic memory system that indexes trajectory steps with structured contextual intent cues and retrieves history by intent compatibility to reduce interference in long-horizon goal-oriented interactions. | \u003ca href=\"http://arxiv.org/abs/2601.10702v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.10702-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Structure and Diversity Aware Context Bubble Construction for Enterprise Retrieval Augmented Systems](https://arxiv.org/pdf/2601.10681v1)** - Proposes a structure-informed and diversity-constrained context bubble construction framework for RAG that preserves document structure and balances relevance, coverage, and redundancy under strict token budgets. | \u003ca href=\"http://arxiv.org/abs/2601.10681v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.10681-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Topo-RAG: Topology-aware retrieval for hybrid text-table documents](https://arxiv.org/pdf/2601.10215v1)** - Introduces a dual-architecture RAG framework that routes narrative through dense retrievers and tabular data through a cell-aware late interaction mechanism to preserve spatial relationships in hybrid documents. | \u003ca href=\"http://arxiv.org/abs/2601.10215v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.10215-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Continuum Memory Architectures for Long-Horizon LLM Agents](https://arxiv.org/pdf/2601.09913v1)** - Defines a class of memory systems for long-horizon agents that maintain persistent, temporally chained internal state instead of stateless RAG lookups, specifying the architectural requirements they must satisfy. | \u003ca href=\"https://arxiv.org/abs/2601.09913v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.09913-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey](https://arxiv.org/pdf/2602.06052v3)** - Surveys foundation agent memory organized by substrate (internal/external), cognitive mechanism (episodic, semantic, working, procedural), and subject (agent- vs user-centric). | \u003ca href=\"https://arxiv.org/abs/2602.06052v3\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06052-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[The AI Hippocampus: How Far are We From Human Memory?](https://arxiv.org/pdf/2601.09113v1)** - Surveys memory in LLMs and multimodal LLMs across implicit, explicit, and agentic paradigms, covering cross-modal integration and challenges like capacity, alignment, and factual consistency. | \u003ca href=\"https://arxiv.org/abs/2601.09113v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.09113-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[AtomMem: Learnable Dynamic Agentic Memory with Atomic Memory Operation](https://arxiv.org/pdf/2601.08323v2)** - Decomposes memory management into atomic CRUD operations and learns an autonomous policy via SFT + RL to study whether learnable memory outperforms static-workflow methods on long-context tasks. | \u003ca href=\"https://arxiv.org/abs/2601.08323v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.08323-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[OpenDecoder: Open LLM Decoding to Incorporate Document Quality in RAG](https://arxiv.org/pdf/2601.09028v2)** - Feeds explicit document quality signals (relevance score, ranking, QPP) into RAG generation to study whether exposing retrieval metadata makes the model more robust to noisy context. | \u003ca href=\"https://arxiv.org/abs/2601.09028v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.09028-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Reliable Graph-RAG for Codebases: AST-Derived Graphs vs LLM-Extracted Knowledge Graphs](https://arxiv.org/pdf/2601.08773v1)** - Benchmarks vector-only, LLM-extracted KG, and AST-derived graph pipelines for code RAG, comparing correctness and indexing cost across deterministic and LLM-based graph construction. | \u003ca href=\"https://arxiv.org/abs/2601.08773v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.08773-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[To Retrieve or To Think? An Agentic Approach for Context Evolution](https://arxiv.org/pdf/2601.08747v2)** - Proposes an agentic RAG framework that dynamically decides whether to retrieve new evidence or reason over existing context at each step, aiming to eliminate redundant retrieval. | \u003ca href=\"https://arxiv.org/abs/2601.08747v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.08747-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Parallel Context-of-Experts Decoding for Retrieval Augmented Generation](https://arxiv.org/pdf/2601.08670v1)** - Proposes a training-free RAG decoding method that treats retrieved documents as isolated \"experts\" and aggregates their logits via retrieval-aware contrastive decoding to recover cross-document reasoning. | \u003ca href=\"https://arxiv.org/abs/2601.08670v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.08670-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[SwiftMem: Fast Agentic Memory via Query-aware Indexing](https://arxiv.org/pdf/2601.08160v1)** - Proposes a query-aware agentic memory system that achieves sub-linear retrieval through temporal and semantic DAG-Tag indexing with an embedding-tag co-consolidation mechanism for memory fragmentation. | \u003ca href=\"https://arxiv.org/abs/2601.08160v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.08160-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Learning How to Remember: A Meta-Cognitive Management Method for Structured and Transferable Agent Memory](https://arxiv.org/pdf/2601.07470v1)** - Proposes treating memory abstraction as a learnable cognitive skill, training a memory copilot via DPO to determine how memories should be structured, abstracted, and reused across tasks. | \u003ca href=\"https://arxiv.org/abs/2601.07470v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.07470-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Beyond Dialogue Time: Temporal Semantic Memory for Personalized LLM Agents](https://arxiv.org/pdf/2601.07468v1)** - Introduces a temporal semantic memory framework that organizes memories by actual occurrence time rather than dialogue time and consolidates temporally continuous information into durative memory. | \u003ca href=\"https://arxiv.org/abs/2601.07468v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.07468-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Active Context Compression: Autonomous Memory Management in LLM Agents](https://arxiv.org/pdf/2601.07190v1)** - Proposes an agent-centric architecture inspired by Physarum polycephalum where the agent autonomously decides when to consolidate learnings and prune raw interaction history to manage context growth. | \u003ca href=\"https://arxiv.org/abs/2601.07190v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.07190-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Relink: Constructing Query-Driven Evidence Graph On-the-Fly for GraphRAG](https://arxiv.org/pdf/2601.07192v1)** - Proposes a reason-and-construct paradigm for GraphRAG that dynamically builds query-specific evidence graphs by instantiating facts from a latent relation pool and discarding distractor facts. | \u003ca href=\"https://arxiv.org/abs/2601.07192v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.07192-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Seeing through the Conflict: Transparent Knowledge Conflict Handling in RAG](https://arxiv.org/pdf/2601.06842v1)** - Introduces a plug-and-play RAG framework that disentangles semantic match from factual consistency and estimates self-answerability to make the conflict-resolution decision process observable and controllable. | \u003ca href=\"https://arxiv.org/abs/2601.06842v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.06842-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[CIRAG: Construction-Integration Retrieval and Adaptive Generation for Multi-hop Question Answering](https://arxiv.org/pdf/2601.06799v1)** - Proposes a construction-integration approach for multi-hop RAG that preserves multiple evidence chains via iterative triple construction and adaptively expands context granularity from triples to full passages. | \u003ca href=\"https://arxiv.org/abs/2601.06799v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.06799-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Amory: Building Coherent Narrative-Driven Agent Memory through Agentic Reasoning](https://arxiv.org/pdf/2601.06282v1)** - Proposes a working memory framework that constructs structured episodic narratives from conversational fragments, consolidates memories with momentum, and semanticizes peripheral facts into semantic memory during offline time. | \u003ca href=\"https://arxiv.org/abs/2601.06282v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.06282-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[L-RAG: Balancing Context and Retrieval with Entropy-Based Lazy Loading](https://arxiv.org/pdf/2601.06551v1)** - Proposes an adaptive RAG framework that uses entropy-based gating to bypass vector database retrieval when model uncertainty is low, triggering expensive chunk retrieval only when genuine uncertainty is detected. | \u003ca href=\"https://arxiv.org/abs/2601.06551v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.06551-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[PRISMA: Reinforcement Learning Guided Two-Stage Policy Optimization in Multi-Agent Architecture for Open-Domain Multi-Hop QA](https://arxiv.org/pdf/2601.05465v1)** - Proposes a decoupled multi-agent RAG framework for multi-hop QA with a Plan-Retrieve-Inspect-Solve-Memoize architecture and two-stage GRPO optimization to address retrieval collapse over large corpora. | \u003ca href=\"https://arxiv.org/abs/2601.05465v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.05465-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Controllable Memory Usage: Balancing Anchoring and Innovation in Long-Term Human-Agent Interaction](https://arxiv.org/pdf/2601.05107v1)** - Proposes a framework for user-controllable memory reliance in long-term agent interactions, modeling memory dependence as an explicit and steerable dimension. | \u003ca href=\"https://arxiv.org/abs/2601.05107v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.05107-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Beyond Static Summarization: Proactive Memory Extraction for LLM Agents](https://arxiv.org/pdf/2601.04463v1)** - Proposes proactive memory extraction using self-questioning feedback loops instead of one-off static summarization to recover missing information and correct errors iteratively. | \u003ca href=\"https://arxiv.org/abs/2601.04463v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.04463-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents](https://arxiv.org/pdf/2601.03785v2)** - Proposes a hierarchical memory architecture with a Topic Loom that groups consecutive same-topic dialogue turns into coherent memory boxes and links them via long-range event-timeline traces. | \u003ca href=\"https://arxiv.org/abs/2601.03785v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.03785-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[MAGMA: A Multi-Graph based Agentic Memory Architecture](https://arxiv.org/pdf/2601.03236v1)** - Proposes a multi-graph agentic memory architecture that represents memories across orthogonal semantic, temporal, causal, and entity graphs with policy-guided traversal for retrieval. | \u003ca href=\"https://arxiv.org/abs/2601.03236v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.03236-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[HiMeS: Hippocampus-inspired Memory System for Personalized AI Assistants](https://arxiv.org/pdf/2601.06152v1)** - Proposes a hippocampus-inspired memory architecture for AI assistants that fuses RL-trained short-term memory extraction with partitioned long-term memory for personalization. | \u003ca href=\"https://arxiv.org/abs/2601.06152v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.06152-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[SimpleMem: Efficient Lifelong Memory for LLM Agents](https://arxiv.org/pdf/2601.02553v3)** - Proposes a three-stage memory framework based on semantic lossless compression with structured compression, online semantic synthesis, and intent-aware retrieval planning. | \u003ca href=\"https://arxiv.org/abs/2601.02553v3\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.02553-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n\n\u003c/details\u003e\n\n\u003cbr\u003e\n\n\u003cdetails id=\"eval--observability\"\u003e\n\u003csummary\u003e\u003ch3 style=\"display:inline\"\u003eEval \u0026 Observability (79)\u003c/h3\u003e\u003c/summary\u003e\n\n\u003cbr\u003e\n\n| Paper | arXiv ID |\n|---|:---:|\n| **[From Features to Actions: Explainability in Traditional and Agentic AI Systems](https://arxiv.org/pdf/2602.06841v1)** - Compares attribution-based explanations with trace-based diagnostics across static and agentic settings to study how explainability methods translate to multi-step agent trajectories. | \u003ca href=\"https://arxiv.org/abs/2602.06841v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06841-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Agentic Uncertainty Reveals Agentic Overconfidence](https://arxiv.org/pdf/2602.06948v1)** - Investigates whether agents can accurately predict their own success rates in agentic tasks. | \u003ca href=\"https://arxiv.org/abs/2602.06948v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06948-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents](https://arxiv.org/pdf/2602.06855v1)** - Introduces 20 research tasks from real ML papers covering idea generation, experiments, and refinement for benchmarking science agents. | \u003ca href=\"https://arxiv.org/abs/2602.06855v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06855-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks](https://arxiv.org/pdf/2602.06486v1)** - Proposes evaluating agent outputs by decomposing responses into individual claims and checking each against expert knowledge. | \u003ca href=\"https://arxiv.org/abs/2602.06486v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06486-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Completing Missing Annotation: Multi-Agent Debate for Accurate Relevant Assessment](https://arxiv.org/pdf/2602.06526v1)** - Explores using multi-agent debate to fill missing labels in information retrieval benchmarks. | \u003ca href=\"https://arxiv.org/abs/2602.06526v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06526-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[TrajAD: Trajectory Anomaly Detection for Trustworthy LLM Agents](https://arxiv.org/pdf/2602.06443v1)** - Proposes a specialized verifier that detects and locates errors in agent execution trajectories at runtime to enable precise rollback-and-retry. | \u003ca href=\"https://arxiv.org/abs/2602.06443v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06443-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Emulating Aggregate Human Choice Behavior and Biases with GPT Conversational Agents](https://arxiv.org/pdf/2602.05597v1)** - Examines whether GPT-4/5 agents can reproduce aggregate human cognitive biases in interactive decision-making scenarios. | \u003ca href=\"https://arxiv.org/abs/2602.05597v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.05597-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Capture the Flags: Family-Based Evaluation of Agentic LLMs](https://arxiv.org/pdf/2602.05523v1)** - Proposes generating families of equivalent CTF challenges through code transformations to test whether agents truly understand exploits or just memorize patterns. | \u003ca href=\"https://arxiv.org/abs/2602.05523v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.05523-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[PieArena: Frontier Language Agents Achieve MBA-Level Negotiation](https://arxiv.org/pdf/2602.05302v1)** - Introduces a negotiation benchmark where frontier LLM agents are evaluated against MBA students to reveal cross-model differences in deception, accuracy, and trustworthiness. | \u003ca href=\"https://arxiv.org/abs/2602.05302v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.05302-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support](https://arxiv.org/pdf/2602.01885v1)** - Benchmarks how well conversational agents retain and use personal information over long emotional support conversations. | \u003ca href=\"https://arxiv.org/abs/2602.01885v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.01885-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[HumanStudy-Bench: Towards AI Agent Design for Participant Simulation](https://arxiv.org/pdf/2602.00685v1)** - Introduces a benchmark that replays published human-subject experiments with LLM agents to test how well they simulate real participants. | \u003ca href=\"http://arxiv.org/abs/2602.00685v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.00685-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Benchmarking Agents in Insurance Underwriting Environments](https://arxiv.org/pdf/2602.00456v1)** - Proposes an expert-designed multi-turn insurance underwriting benchmark to evaluate agent performance under real-world enterprise conditions with noisy tools and proprietary knowledge. | \u003ca href=\"http://arxiv.org/abs/2602.00456v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.00456-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[TriCEGAR: A Trace-Driven Abstraction Mechanism for Agentic AI](https://arxiv.org/pdf/2601.22997v1)** - Proposes automated state abstraction from agent execution traces using predicate trees and counterexample refinement for probabilistic runtime verification of agent behavior. | \u003ca href=\"http://arxiv.org/abs/2601.22997v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22997-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering](https://arxiv.org/pdf/2601.22952v1)** - Compares three LLM agent frameworks (Aider, OpenHands, SWE-agent) on vulnerability false positive filtering to study how agent design and backbone model affect triage performance. | \u003ca href=\"http://arxiv.org/abs/2601.22952v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22952-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study](https://arxiv.org/pdf/2602.00164v1)** - Analyzes 8,106 fix-related pull requests from five AI coding agents to catalog the reasons agent-generated contributions are closed without merging. | \u003ca href=\"http://arxiv.org/abs/2602.00164v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.00164-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[JAF: Judge Agent Forest](https://arxiv.org/pdf/2601.22269v1)** - Proposes a judge agent framework that evaluates query-response pairs jointly across a cohort rather than in isolation, using in-context neighborhoods for cross-instance pattern detection. | \u003ca href=\"http://arxiv.org/abs/2601.22269v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22269-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis](https://arxiv.org/pdf/2601.22208v1)** - Evaluates LLM reasoning under ReAct and Plan-and-Execute agentic workflows across 48,000 simulated failure scenarios, producing a taxonomy of 16 common reasoning failures. | \u003ca href=\"http://arxiv.org/abs/2601.22208v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22208-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty](https://arxiv.org/pdf/2601.22027v1)** - Introduces a benchmark for evaluating LLM agent consistency, uncertainty handling, and capability awareness in multi-turn tool-using scenarios with incomplete or ambiguous user requests. | \u003ca href=\"http://arxiv.org/abs/2601.22027v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22027-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[More Code, Less Reuse: Investigating Code Quality and Reviewer Sentiment towards AI-generated Pull Requests](https://arxiv.org/pdf/2601.21276v1)** - Examines code quality, maintainability, and reviewer sentiment toward AI-agent-generated pull requests compared to human-authored contributions. | \u003ca href=\"http://arxiv.org/abs/2601.21276v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21276-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[The Quiet Contributions: Insights into AI-Generated Silent Pull Requests](https://arxiv.org/pdf/2601.21102v1)** - Analyzes silent (no-comment) AI-generated pull requests to examine their impact on code complexity, quality issues, and security vulnerabilities. | \u003ca href=\"http://arxiv.org/abs/2601.21102v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21102-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Agent Benchmarks Fail Public Sector Requirements](https://arxiv.org/pdf/2601.20617v1)** - Analyzes over 1,300 agent benchmarks against public-sector requirements including process-based evaluation, realism, and domain-specific metrics. | \u003ca href=\"http://arxiv.org/abs/2601.20617v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.20617-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Interpreting Emergent Extreme Events in Multi-Agent Systems](https://arxiv.org/pdf/2601.20538v1)** - Applies Shapley values to attribute emergent extreme events in LLM multi-agent systems to specific agent actions across time, agent, and behavior dimensions. | \u003ca href=\"http://arxiv.org/abs/2601.20538v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.20538-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Who Writes the Docs in SE 3.0? Agent vs. Human Documentation Pull Requests](https://arxiv.org/pdf/2601.20171v1)** - Analyzes AI agent contributions to documentation pull requests and examines how human developers review and intervene in agent-authored documentation changes. | \u003ca href=\"http://arxiv.org/abs/2601.20171v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.20171-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Are We All Using Agents the Same Way? An Empirical Study of Core and Peripheral Developers Use of Coding Agents](https://arxiv.org/pdf/2601.20106v1)** - Examines how core and peripheral developers differ in their use, review, modification, and verification of coding-agent-generated pull requests. | \u003ca href=\"http://arxiv.org/abs/2601.20106v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.20106-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle](https://arxiv.org/pdf/2601.20882v1)** - Introduces an end-to-end benchmark with 700+ real-world tasks across build, monitoring, issue resolving, and test generation for evaluating AI agents in full software DevOps workflows. | \u003ca href=\"http://arxiv.org/abs/2601.20882v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.20882-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Toward Architecture-Aware Evaluation Metrics for LLM Agents](https://arxiv.org/pdf/2601.19583v1)** - Proposes an architecture-informed evaluation approach that links agent components like planners, memory, and tool routers to observable behaviors and diagnostic metrics. | \u003ca href=\"http://arxiv.org/abs/2601.19583v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.19583-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Balancing Sustainability And Performance: The Role Of Small-Scale LLMs In Agentic AI Systems](https://arxiv.org/pdf/2601.19311v2)** - Investigates whether smaller-scale language models can reduce energy consumption in multi-agent agentic AI systems without compromising task quality. | \u003ca href=\"http://arxiv.org/abs/2601.19311v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.19311-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Understanding Dominant Themes in Reviewing Agentic AI-authored Code](https://arxiv.org/pdf/2601.19287v1)** - Analyzes 19,450 inline review comments on agent-authored pull requests and derives a taxonomy of 12 review themes to understand how reviewers respond to AI-generated code. | \u003ca href=\"http://arxiv.org/abs/2601.19287v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.19287-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Let's Make Every Pull Request Meaningful: An Empirical Analysis of Developer and Agentic Pull Requests](https://arxiv.org/pdf/2601.18749v1)** - Analyzes 40,214 developer and agentic pull requests to compare merge outcomes and identify how submitter attributes and review features differ between human and AI agent contributions. | \u003ca href=\"http://arxiv.org/abs/2601.18749v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.18749-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Automated Structural Testing of LLM-Based Agents: Methods, Framework, and Case Studies](https://arxiv.org/pdf/2601.18827v1)** - Presents structural testing methods for LLM-based agents using OpenTelemetry traces, mocking for reproducible behavior, and automated assertions for component-level verification. | \u003ca href=\"http://arxiv.org/abs/2601.18827v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.18827-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[When AI Agents Touch CI/CD Configurations: Frequency and Success](https://arxiv.org/pdf/2601.17413v1)** - Analyzes how five AI coding agents interact with CI/CD configurations across 8,031 pull requests, examining modification frequency, merge rates, and build success. | \u003ca href=\"http://arxiv.org/abs/2601.17413v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.17413-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Fingerprinting AI Coding Agents on GitHub](https://arxiv.org/pdf/2601.17406v1)** - Identifies behavioral signatures of five AI coding agents from 33,580 pull requests using commit, PR structure, and code features for agent attribution. | \u003ca href=\"http://arxiv.org/abs/2601.17406v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.17406-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Interpreting Agentic Systems: Beyond Model Explanations to System-Level Accountability](https://arxiv.org/pdf/2601.17168v1)** - Assesses existing interpretability methods for agentic systems and identifies gaps in explaining temporal dynamics, compounding decisions, and context-dependent behaviors. | \u003ca href=\"http://arxiv.org/abs/2601.17168v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.17168-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[AI builds, We Analyze: An Empirical Study of AI-Generated Build Code Quality](https://arxiv.org/pdf/2601.16839v1)** - Investigates maintainability and security-related build code smells in AI-agent-generated pull requests across 364 identified quality issues. | \u003ca href=\"http://arxiv.org/abs/2601.16839v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.16839-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Will It Survive? Deciphering the Fate of AI-Generated Code in Open Source](https://arxiv.org/pdf/2601.16809v1)** - Examines long-term survival of AI-agent-generated code through survival analysis of 200,000+ code units across 201 open-source projects. | \u003ca href=\"http://arxiv.org/abs/2601.16809v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.16809-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents](https://arxiv.org/pdf/2601.16649v1)** - Develops an oracle counterfactual framework for multi-turn agentic tasks that measures the criticality of individual capabilities like planning and state tracking. | \u003ca href=\"http://arxiv.org/abs/2601.16649v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.16649-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems](https://arxiv.org/pdf/2601.16280v1)** - Presents a 12-category error taxonomy and diagnostic framework for evaluating tool-use reliability across open-weight and proprietary LLMs in multi-agent systems on edge hardware. | \u003ca href=\"http://arxiv.org/abs/2601.16280v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.16280-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Agentic Confidence Calibration](https://arxiv.org/pdf/2601.15778v1)** - Introduces the problem of agentic confidence calibration and proposes Holistic Trajectory Calibration, extracting process-level features across an agent's entire trajectory to diagnose failures. | \u003ca href=\"http://arxiv.org/abs/2601.15778v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.15778-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Improving Methodologies for Agentic Evaluations Across Domains: Leakage of Sensitive Information, Fraud and Cybersecurity Threats](https://arxiv.org/pdf/2601.15679v1)** - Examines methodological challenges in evaluating AI agents across sensitive information leakage, fraud, and cybersecurity threats through a multi-national collaborative benchmarking exercise. | \u003ca href=\"http://arxiv.org/abs/2601.15679v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.15679-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[MiRAGE: A Multiagent Framework for Generating Multimodal Multihop Question-Answer Dataset for RAG Evaluation](https://arxiv.org/pdf/2601.15487v1)** - Introduces a multi-agent framework that generates verified, domain-specific, multimodal, multi-hop question-answer datasets for benchmarking retrieval-augmented generation systems. | \u003ca href=\"http://arxiv.org/abs/2601.15487v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.15487-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling](https://arxiv.org/pdf/2601.15232v1)** - Analyzes 1,187 bug reports from LLM agent software across seven frameworks to categorize bug types, root causes, effects, and tests automated bug labeling with a ReAct agent. | \u003ca href=\"http://arxiv.org/abs/2601.15232v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.15232-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution](https://arxiv.org/pdf/2601.15075v2)** - Proposes a hierarchical framework for general agentic attribution that identifies internal factors driving agent actions through temporal likelihood dynamics and perturbation-based analysis. | \u003ca href=\"http://arxiv.org/abs/2601.15075v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.15075-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering](https://arxiv.org/pdf/2601.14470v1)** - Analyzes token consumption patterns across software development lifecycle stages in a multi-agent system to identify where tokens are consumed and which stages drive cost. | \u003ca href=\"http://arxiv.org/abs/2601.14470v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.14470-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[APEX-Agents](https://arxiv.org/pdf/2601.14242v2)** - Introduces a benchmark of 480 long-horizon, cross-application productivity tasks created by investment banking analysts, consultants, and lawyers for evaluating AI agent capabilities in realistic work environments. | \u003ca href=\"http://arxiv.org/abs/2601.14242v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.14242-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[CooperBench: Why Coding Agents Cannot be Your Teammates Yet](https://arxiv.org/pdf/2601.13295v2)** - Introduces a benchmark of 600+ collaborative coding tasks to evaluate whether coding agents can coordinate as effective teammates under various coordination structures. | \u003ca href=\"http://arxiv.org/abs/2601.13295v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.13295-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?](https://arxiv.org/pdf/2601.13227v1)** - Investigates how RAG systems can game nugget-based LLM judge evaluations through metric overfitting, demonstrating near-perfect scores when evaluation elements are leaked or predictable. | \u003ca href=\"http://arxiv.org/abs/2601.13227v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.13227-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents](https://arxiv.org/pdf/2601.15322v1)** - Introduces the Determinism-Faithfulness Assurance Harness for measuring trajectory determinism and evidence-conditioned faithfulness in tool-using LLM agents across 74 configurations and 12 models. | \u003ca href=\"http://arxiv.org/abs/2601.15322v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.15322-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems](https://arxiv.org/pdf/2601.11903v1)** - Presents a process-aware and auditable multi-agent evaluation framework that plans, executes, and aggregates multi-step evaluations across heterogeneous agentic workflows under human oversight. | \u003ca href=\"http://arxiv.org/abs/2601.11903v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.11903-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces](https://arxiv.org/pdf/2601.11868v1)** - Introduces a curated benchmark of 89 hard tasks in computer terminal environments with unique environments, human-written solutions, and comprehensive tests for evaluating frontier agent capabilities. | \u003ca href=\"http://arxiv.org/abs/2601.11868v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.11868-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue Systems](https://arxiv.org/pdf/2601.11854v2)** - Introduces a benchmark and evaluation framework for agentic task-oriented dialogue systems covering multi-goal coordination, dependency management, memory, adaptability, and proactivity. | \u003ca href=\"http://arxiv.org/abs/2601.11854v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.11854-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[What Do LLM Agents Know About Their World? Task2Quiz](https://arxiv.org/pdf/2601.09503v1)** - Decouples task execution from environment understanding with a deterministic QA paradigm to study whether task success is actually a good proxy for how well agents understand their environment. | \u003ca href=\"https://arxiv.org/abs/2601.09503v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.09503-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments](https://arxiv.org/pdf/2601.09032v1)** - Evaluates frontier models on 150 workplace tasks to identify an empirical hierarchy of agentic capabilities spanning tool use, planning, adaptability, groundedness, and common-sense reasoning. | \u003ca href=\"https://arxiv.org/abs/2601.09032v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.09032-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ViDoRe V3: A Comprehensive Evaluation of RAG in Complex Real-World Scenarios](https://arxiv.org/pdf/2601.08620v1)** - Introduces a multimodal RAG benchmark with 26K pages and 3,099 queries in 6 languages to evaluate retrieval across non-textual elements and open-ended queries. | \u003ca href=\"https://arxiv.org/abs/2601.08620v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.08620-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[M3-BENCH: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games](https://arxiv.org/pdf/2601.08462v1)** - Evaluates LLM agent social behaviors in mixed-motive games using process-aware analysis of both reasoning and communication rather than outcome-only metrics. | \u003ca href=\"https://arxiv.org/abs/2601.08462v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.08462-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents](https://arxiv.org/pdf/2601.19935v1)** - Benchmarks whether agents can proactively use long-term memory to execute tool-based actions, rather than just passively retrieving facts on demand. | \u003ca href=\"https://arxiv.org/abs/2601.19935v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.19935-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms](https://arxiv.org/pdf/2601.07651v2)** - Proposes a formal framework for actively evaluating general-purpose agents across multiple tasks, selecting which tasks and agents to sample next to minimize ranking error over time. | \u003ca href=\"https://arxiv.org/abs/2601.07651v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.07651-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[VirtualEnv: A Platform for Embodied AI Research](https://arxiv.org/pdf/2601.07553v2)** - Introduces an Unreal Engine 5 simulation platform for benchmarking LLM-driven agents on embodied tasks including navigation, object manipulation, and multi-agent coordination in procedurally generated environments. | \u003ca href=\"https://arxiv.org/abs/2601.07553v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.07553-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[FROAV: A Framework for RAG Observation and Agent Verification](https://arxiv.org/pdf/2601.07504v1)** - Presents an open-source platform combining visual workflow orchestration with LLM-as-a-Judge evaluation for prototyping and validating RAG-based agent pipelines without infrastructure coding. | \u003ca href=\"https://arxiv.org/abs/2601.07504v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.07504-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Lost in the Noise: How Reasoning Models Fail with Contextual Distractors](https://arxiv.org/pdf/2601.07226v1)** - Benchmarks model robustness across 11 RAG, reasoning, alignment, and tool-use tasks against diverse contextual noise types including random documents, irrelevant histories, and hard negative distractors. | \u003ca href=\"https://arxiv.org/abs/2601.07226v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.07226-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction](https://arxiv.org/pdf/2601.06966v1)** - Introduces a project-oriented memory benchmark with 2,000+ cross-session dialogues across eleven scenarios to evaluate how well agents track evolving goals and dynamic context dependencies. | \u003ca href=\"https://arxiv.org/abs/2601.06966v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.06966-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[IDRBench: Interactive Deep Research Benchmark](https://arxiv.org/pdf/2601.06676v1)** - Introduces the first benchmark for interactive deep research combining a modular multi-agent framework with on-demand user interaction, a scalable user simulator, and interaction-aware metrics measuring quality, alignment, and cost. | \u003ca href=\"https://arxiv.org/abs/2601.06676v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.06676-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation](https://arxiv.org/pdf/2601.06328v1)** - Introduces an open-world tool-using environment with 5,571 tools across 204 apps, a task engine for multi-tool workflows with wild constraints, and a state controller that injects failures to stress-test robustness. | \u003ca href=\"https://arxiv.org/abs/2601.06328v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.06328-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents](https://arxiv.org/pdf/2601.05899v1)** - Introduces a tower defense environment for evaluating LLM agent planning and decision-making with low computational demands, multimodal observation, and hallucination assessment support. | \u003ca href=\"https://arxiv.org/abs/2601.05899v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.05899-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents](https://arxiv.org/pdf/2601.05215v2)** - Introduces a user-authored benchmark for memory-aware LLM agents in Minecraft with parametric task templates, machine-checkable validators, and bounded-knowledge evaluation under a no-shortcut policy. | \u003ca href=\"https://arxiv.org/abs/2601.05215v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.05215-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Internal Representations as Indicators of Hallucinations in Agent Tool Selection](https://arxiv.org/pdf/2601.05214v1)** - Proposes a framework for detecting tool-calling hallucinations in LLM agents by analyzing internal representations during a single forward pass, targeting incorrect tool selection, parameter errors, and tool bypass. | \u003ca href=\"https://arxiv.org/abs/2601.05214v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.05214-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Agent-as-a-Judge](https://arxiv.org/pdf/2601.05111v1)** - Surveys the evolution from LLM-as-a-Judge to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory for evaluation. | \u003ca href=\"https://arxiv.org/abs/2601.05111v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.05111-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Arabic Prompts with English Tools: A Benchmark](https://arxiv.org/pdf/2601.05101v1)** - Introduces the first benchmark for evaluating tool-calling and agentic capabilities of LLMs in Arabic, measuring functional accuracy and robustness in Arabic agentic workflows. | \u003ca href=\"https://arxiv.org/abs/2601.05101v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.05101-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Effects of Personality Steering on Cooperative Behavior in LLM Agents](https://arxiv.org/pdf/2601.05302v2)** - Examines how Big Five personality steering affects cooperative behavior in LLM agents using repeated Prisoner's Dilemma games across multiple model generations. | \u003ca href=\"https://arxiv.org/abs/2601.05302v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.05302-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Analyzing Message-Code Inconsistency in AI Coding Agent-Authored Pull Requests](https://arxiv.org/pdf/2601.04886v2)** - Analyzes message-code inconsistency in pull requests authored by AI coding agents across five agent systems to study trustworthiness of agent-generated PR descriptions. | \u003ca href=\"https://arxiv.org/abs/2601.04886v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.04886-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[GUITester: Enabling GUI Agents for Exploratory Defect Discovery](https://arxiv.org/pdf/2601.04500v1)** - Proposes a multi-agent framework for autonomous exploratory GUI testing that decouples navigation from verification via planning-execution and hierarchical reflection modules. | \u003ca href=\"https://arxiv.org/abs/2601.04500v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.04500-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems](https://arxiv.org/pdf/2601.04170v1)** - Introduces the concept of agent drift and a composite metric framework for quantifying semantic, coordination, and behavioral degradation in multi-agent LLM systems over extended interactions. | \u003ca href=\"https://arxiv.org/abs/2601.04170v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.04170-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?](https://arxiv.org/pdf/2601.02854v1)** - Introduces a unified benchmark for evaluating Multi-Agent Debate methods across multiple domains, modalities, and efficiency metrics including token consumption and inference time. | \u003ca href=\"https://arxiv.org/abs/2601.02854v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.02854-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts](https://arxiv.org/pdf/2601.03315v1)** - Documents six recurring failure modes across four end-to-end attempts at autonomous ML research using a pipeline of LLM agents mapped to stages of the scientific workflow. | \u003ca href=\"https://arxiv.org/abs/2601.03315v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.03315-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[LongDA: Benchmarking LLM Agents for Long-Document Data Analysis](https://arxiv.org/pdf/2601.02598v2)** - Introduces a data analysis benchmark for evaluating LLM agents under documentation-intensive analytical workflows requiring long document navigation and multi-step computation. | \u003ca href=\"https://arxiv.org/abs/2601.02598v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.02598-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[The Rise of Agentic Testing: Multi-Agent Systems for Robust Software Quality Assurance](https://arxiv.org/pdf/2601.02454v1)** - Proposes a closed-loop multi-agent testing framework with generation, execution analysis, and review optimization agents for autonomous software test refinement. | \u003ca href=\"https://arxiv.org/abs/2601.02454v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.02454-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents](https://arxiv.org/pdf/2601.02314v1)** - Proposes a causal framework using structural causal models and counterfactual interventions to audit whether reasoning traces in LLM agents are faithful generative drivers or post-hoc rationalizations. | \u003ca href=\"https://arxiv.org/abs/2601.02314v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.02314-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions](https://arxiv.org/pdf/2601.06112v1)** - Introduces a benchmark for evaluating agent reliability across consistency, robustness to perturbations, and fault tolerance under chaos-engineering-style tool failure injection. | \u003ca href=\"https://arxiv.org/abs/2601.06112v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.06112-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability](https://arxiv.org/pdf/2601.00481v1)** - Introduces an evaluation suite that standardizes MAS configuration and execution, exports framework-agnostic execution traces, and enables systematic reliability assessment across agent architectures. | \u003ca href=\"https://arxiv.org/abs/2601.00481v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.00481-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Beyond Perfect APIs: WildAGTEval](https://arxiv.org/pdf/2601.00268v1)** - Introduces a benchmark for evaluating LLM agent function-calling under realistic API complexity including noisy outputs, detailed specifications, and runtime challenges. | \u003ca href=\"https://arxiv.org/abs/2601.00268v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.00268-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n\n\u003c/details\u003e\n\n\u003cbr\u003e\n\n\u003cdetails id=\"agent-tooling\"\u003e\n\u003csummary\u003e\u003ch3 style=\"display:inline\"\u003eAgent Tooling (95)\u003c/h3\u003e\u003c/summary\u003e\n\n\u003cbr\u003e\n\n| Paper | arXiv ID |\n|---|:---:|\n| **[TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging](https://arxiv.org/pdf/2602.06875v1)** - Proposes a multi-agent observe-analyze-repair loop that uses runtime traces to find and fix bugs in LLM-generated code. | \u003ca href=\"https://arxiv.org/abs/2602.06875v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.06875-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Generative Ontology: When Structured Knowledge Learns to Create](https://arxiv.org/pdf/2602.05636v1)** - Explores constraining LLM generation with executable schemas and multi-agent roles to produce structurally valid yet creative outputs. | \u003ca href=\"https://arxiv.org/abs/2602.05636v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.05636-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Structured Context Engineering for File-Native Agentic Systems](https://arxiv.org/pdf/2602.05447v1)** - Tests how context format (YAML, JSON, Markdown) affects agent accuracy across 9,649 experiments in file-native agentic systems. | \u003ca href=\"https://arxiv.org/abs/2602.05447v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.05447-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ProAct: Agentic Lookahead in Interactive Environments](https://arxiv.org/pdf/2602.05327v1)** - Explores training agents to think ahead by distilling environment search into causal reasoning chains in interactive environments. | \u003ca href=\"https://arxiv.org/abs/2602.05327v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.05327-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Autonomous Question Formation for Large Language Model-Driven AI Systems](https://arxiv.org/pdf/2602.01556v1)** - Investigates teaching agents to ask themselves the right questions before acting to adapt to new situations autonomously. | \u003ca href=\"https://arxiv.org/abs/2602.01556v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.01556-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[From Perception to Action: Spatial AI Agents and World Models](https://arxiv.org/pdf/2602.01644v1)** - Surveys the connection between agentic architectures and spatial tasks like robotics and navigation, covering memory, planning, and world models in embodied agents. | \u003ca href=\"https://arxiv.org/abs/2602.01644v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.01644-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[World Models as an Intermediary between Agents and the Real World](https://arxiv.org/pdf/2602.00785v1)** - Argues for using world models as a bridge between agents and high-cost real-world environments to provide richer learning signals across domains like robotics and ML engineering. | \u003ca href=\"http://arxiv.org/abs/2602.00785v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.00785-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Engineering AI Agents for Clinical Workflows: A Case Study in Architecture, MLOps, and Governance](https://arxiv.org/pdf/2602.00751v1)** - Presents a reference architecture for production AI agents integrating Clean Architecture, event-driven design, per-agent MLOps lifecycles, and human-in-the-loop governance. | \u003ca href=\"http://arxiv.org/abs/2602.00751v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.00751-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Autonomous Data Processing using Meta-Agents](https://arxiv.org/pdf/2602.00307v1)** - Proposes a meta-agent framework that builds, runs, and keeps refining data processing pipelines through hierarchical agent orchestration. | \u003ca href=\"http://arxiv.org/abs/2602.00307v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.00307-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering](https://arxiv.org/pdf/2601.22859v2)** - Proposes a multi-agent framework for automatically building executable test environments across ten programming languages using planning-execution-verification with environment reuse. | \u003ca href=\"http://arxiv.org/abs/2601.22859v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22859-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training](https://arxiv.org/pdf/2601.22781v1)** - Proposes an adaptive data generation framework for training mobile GUI agents that matches task difficulty to the agent's current capability level. | \u003ca href=\"http://arxiv.org/abs/2601.22781v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22781-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[AutoRefine: From Trajectories to Reusable Expertise for Continual LLM Agent Refinement](https://arxiv.org/pdf/2601.22758v1)** - Proposes extracting dual-form reusable expertise from agent execution histories — specialized subagents for procedural tasks and skill patterns for static knowledge — with continuous pruning and merging. | \u003ca href=\"http://arxiv.org/abs/2601.22758v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22758-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents](https://arxiv.org/pdf/2602.02548v1)** - Proposes modeling GUI agent operations as sequences of learnable tool tokens with semantic anchoring and curriculum-based training instead of coordinate-based visual grounding. | \u003ca href=\"http://arxiv.org/abs/2602.02548v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.02548-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents](https://arxiv.org/pdf/2601.22607v2)** - Proposes a framework combining a self-evolving multi-agent data engine with verifier-based reinforcement learning to train multi-turn interactive tool-using agents. | \u003ca href=\"http://arxiv.org/abs/2601.22607v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22607-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents](https://arxiv.org/pdf/2601.22311v1)** - Investigates why step-wise reasoning struggles with long-horizon planning in LLM agents and proposes future-aware lookahead with reward estimation to let early actions account for delayed outcomes. | \u003ca href=\"http://arxiv.org/abs/2601.22311v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22311-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents](https://arxiv.org/pdf/2601.22129v2)** - Proposes a test-time scaling method for software engineering agents that recycles prior trajectories and branches at critical intermediate steps instead of resampling from scratch. | \u003ca href=\"http://arxiv.org/abs/2601.22129v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22129-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Optimizing Agentic Workflows using Meta-tools](https://arxiv.org/pdf/2601.22037v2)** - Proposes bundling recurring sequences of agent tool calls into deterministic meta-tools to skip unnecessary intermediate LLM reasoning steps and cut failures. | \u003ca href=\"http://arxiv.org/abs/2601.22037v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.22037-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[astra-langchain4j: Experiences Combining LLMs and Agent Programming](https://arxiv.org/pdf/2601.21879v1)** - Explores integrating LLM capabilities into the ASTRA agent programming language to study how traditional agent toolkits and modern LLM-based agentic platforms can inform each other. | \u003ca href=\"http://arxiv.org/abs/2601.21879v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21879-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Meta Context Engineering via Agentic Skill Evolution](https://arxiv.org/pdf/2601.21557v2)** - Introduces a bi-level framework where a meta-agent evolves context engineering skills via agentic crossover while a base agent executes them to optimize context as files and code. | \u003ca href=\"http://arxiv.org/abs/2601.21557v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21557-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[DataCross: A Unified Benchmark and Agent Framework for Cross-Modal Heterogeneous Data Analysis](https://arxiv.org/pdf/2601.21403v1)** - Proposes a multi-agent framework and benchmark for cross-modal data analysis that coordinates specialized sub-agents via a divide-and-conquer workflow across structured and unstructured data sources. | \u003ca href=\"http://arxiv.org/abs/2601.21403v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21403-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[CovAgent: Overcoming the 30% Curse of Mobile Application Coverage with Agentic AI and Dynamic Instrumentation](https://arxiv.org/pdf/2601.21253v1)** - Explores agentic AI for Android app testing that uses code inspection and dynamic instrumentation to reach activities that standard GUI fuzzers cannot access. | \u003ca href=\"http://arxiv.org/abs/2601.21253v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21253-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[CUA-Skill: Develop Skills for Computer Using Agent](https://arxiv.org/pdf/2601.21123v2)** - Introduces a large-scale computer-using agent skill library with parameterized execution, composition graphs, dynamic retrieval, and memory-aware failure recovery for desktop applications. | \u003ca href=\"http://arxiv.org/abs/2601.21123v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21123-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Textual Equilibrium Propagation for Deep Compound AI Systems](https://arxiv.org/pdf/2601.21064v2)** - Explores local equilibrium propagation for optimizing deep compound AI systems that avoids signal degradation in long-horizon agentic workflows by replacing global textual backpropagation. | \u003ca href=\"http://arxiv.org/abs/2601.21064v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.21064-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Should I Have Expressed a Different Intent? Counterfactual Generation for LLM-Based Autonomous Control](https://arxiv.org/pdf/2601.20090v2)** - Investigates counterfactual reasoning in agentic LLM control scenarios using structural causal models and conformal prediction for formal reliability guarantees. | \u003ca href=\"http://arxiv.org/abs/2601.20090v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.20090-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Insight Agents: An LLM-Based Multi-Agent System for Data Insights](https://arxiv.org/pdf/2601.20048v2)** - Introduces a hierarchical multi-agent system with out-of-domain detection and BERT-based agent routing for delivering personalized data insights at production scale. | \u003ca href=\"http://arxiv.org/abs/2601.20048v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.20048-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Agentic Design Patterns: A System-Theoretic Framework](https://arxiv.org/pdf/2601.19752v1)** - Introduces a system-theoretic framework that decomposes agentic AI into five functional subsystems and derives 12 reusable design patterns for building robust agent architectures. | \u003ca href=\"http://arxiv.org/abs/2601.19752v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.19752-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[A Practical Guide to Agentic AI Transition in Organizations](https://arxiv.org/pdf/2602.10122v1)** - Explores a pragmatic framework for transitioning organizational processes to agentic AI, covering domain-driven use case identification, task delegation, and human-in-the-loop operating models. | \u003ca href=\"http://arxiv.org/abs/2602.10122v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2602.10122-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[JitRL: Just-In-Time Reinforcement Learning for Continual Learning in LLM Agents Without Gradient Updates](https://arxiv.org/pdf/2601.18510v1)** - Proposes a training-free continual learning framework for LLM agents that retrieves relevant past experiences and modulates output logits at test time without gradient updates. | \u003ca href=\"http://arxiv.org/abs/2601.18510v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.18510-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Think-Augmented Function Calling: Improving LLM Parameter Accuracy Through Embedded Reasoning](https://arxiv.org/pdf/2601.18282v2)** - Proposes embedding explicit reasoning at both function and parameter levels during agent tool calls, with dynamic complexity scoring to trigger granular justification for critical decisions. | \u003ca href=\"http://arxiv.org/abs/2601.18282v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.18282-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents](https://arxiv.org/pdf/2601.18217v1)** - Investigates which RL training environment properties and modeling choices most influence cross-domain generalization for LLM agents deployed beyond their training domains. | \u003ca href=\"http://arxiv.org/abs/2601.18217v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.18217-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation](https://arxiv.org/pdf/2601.17915v2)** - Proposes disaggregating LLM investigation into bounded local evidence mining with deterministic graph traversal and belief propagation for reliable open-ended agent reasoning. | \u003ca href=\"http://arxiv.org/abs/2601.17915v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.17915-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[AI Agent for Reverse-Engineering Legacy Finite-Difference Code](https://arxiv.org/pdf/2601.18381v1)** - Presents a LangGraph-based AI agent framework combining GraphRAG, multi-stage retrieval, and RL-inspired adaptive feedback for reverse-engineering legacy scientific code. | \u003ca href=\"http://arxiv.org/abs/2601.18381v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.18381-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[PatchIsland: Orchestration of LLM Agents for Continuous Vulnerability Repair](https://arxiv.org/pdf/2601.17471v1)** - Proposes a continuous vulnerability repair system that orchestrates a diverse LLM agent ensemble with two-phase deduplication for integration with continuous fuzzing pipelines. | \u003ca href=\"http://arxiv.org/abs/2601.17471v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.17471-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[DALIA: Towards a Declarative Agentic Layer for Intelligent Agents in MCP-Based Server Ecosystems](https://arxiv.org/pdf/2601.17435v1)** - Introduces a declarative architectural layer for agentic workflows with formalized capabilities, declarative discovery protocol, and deterministic task graph construction. | \u003ca href=\"http://arxiv.org/abs/2601.17435v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.17435-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents](https://arxiv.org/pdf/2601.16746v2)** - Presents a task-aware context pruning framework for coding agents that trains a lightweight neural skimmer to selectively retain relevant code lines based on explicit goals. | \u003ca href=\"http://arxiv.org/abs/2601.16746v2\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.16746-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[REprompt: Prompt Generation for Intelligent Software Development Guided by Requirements Engineering](https://arxiv.org/pdf/2601.16507v1)** - Proposes a multi-agent prompt optimization framework guided by requirements engineering principles for system and user prompts in agent-based software development. | \u003ca href=\"http://arxiv.org/abs/2601.16507v1\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2601.16507-b31b1b.svg\" alt=\"arXiv\" /\u003e\u003c/a\u003e |\n| **[EvoConfig: Self-Evolving Multi-Agent Systems for Efficient Autonomous Environment Configuration](https","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVoltAgent%2Fawesome-ai-agent-papers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FVoltAgent%2Fawesome-ai-agent-papers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVoltAgent%2Fawesome-ai-agent-papers/lists"}