Automated Failure Attribution in LLM Multi-Agent Systems: A New Benchmark and Methods

By

Introduction

Large language model (LLM) multi-agent systems are increasingly used to tackle complex problems through collaborative effort. However, despite their promise, these systems frequently encounter task failures. When a failure occurs, developers face a daunting question: which agent, at which point in the process, caused the problem? Manually sifting through extensive interaction logs to find the root cause is like searching for a needle in a haystack—time-consuming and labor-intensive. This challenge is especially acute in autonomous multi-agent environments, where long information chains and agent autonomy make debugging nearly impossible without automated assistance.

Automated Failure Attribution in LLM Multi-Agent Systems: A New Benchmark and Methods
Source: syncedreview.com

To address this, researchers from Penn State University and Duke University, in collaboration with teams from Google DeepMind, the University of Washington, Meta, and others, have introduced the novel problem of automated failure attribution. They have built the first benchmark dataset for this task—called Who&When—and developed several attribution methods. Their work has been accepted as a Spotlight presentation at ICML 2025, and the code and dataset are fully open-source.

Background and Challenges

LLM multi-agent systems show immense potential across domains like software development, scientific research, and customer support. Yet they remain fragile. A single agent’s error, a misunderstanding between agents, or a mistake in information transmission can derail the entire task. According to the research, current debugging relies on manual methods:

These approaches are inefficient and fail to scale as systems grow more complex. The researchers highlight that automated failure attribution is essential for rapid system iteration and optimization.

The Who&When Dataset

The Who&When dataset is the first benchmark specifically designed for automated failure attribution in LLM multi-agent systems. It contains multiple multi-agent interaction logs, each annotated with the responsible agent and the time step where the failure originated. The dataset covers various task types and failure modes, providing a standardized evaluation platform.

Key features of the dataset include:

The dataset is available on Hugging Face and the code on GitHub.

Automated Attribution Methods

The researchers proposed and evaluated several automated approaches for failure attribution. These methods fall into two main categories:

Heuristic Methods

Baseline approaches that use simple rules, such as identifying the agent that produced the first anomalous output or the agent with the most errors in the log. While fast, these methods often lack accuracy.

Learning-Based Methods

More sophisticated techniques that train models to recognize failure patterns. The team explored:

Preliminary results show that learning-based methods significantly outperform heuristics, but the task remains challenging—especially for long interaction chains and subtle errors.

Results and Key Findings

Experiments on the Who&When benchmark revealed several insights:

The study establishes a baseline for future work and highlights the need for more sophisticated reasoning in automated failure attribution.

Conclusion and Future Directions

The introduction of automated failure attribution for LLM multi-agent systems is a crucial step toward building more reliable and debuggable AI systems. The Who&When dataset provides a foundation for research, and the proposed methods open new avenues for improvement. Future work could explore:

As multi-agent systems become more prevalent, tools like these will be essential for developers to maintain and scale their applications. The researchers hope their work encourages the community to tackle this important problem.

Paper: arXiv | Code: GitHub | Dataset: Hugging Face

Tags:

Related Articles

Recommended

Discover More

4 Must-Attend Cybersecurity and AI Talks in 202610 Key Insights Into DeepSeek's Meteoric Rise and $45 Billion ValuationGerman Police Unmask Leader of Notorious Ransomware Gangs REvil and GandCrabThe Teacher Exodus: What's Driving Educators Away?Improving Man Pages: Practical Examples for tcpdump and dig