Diagnosing Agent Failures in LLM Multi-Agent Systems: A Practical Guide to Automated Failure Attribution

By

Overview

Large Language Model (LLM) multi-agent systems are gaining traction for tackling complex tasks through collaborative workflows. Yet, even with multiple agents working in parallel, failures are common—and pinpointing the exact agent and moment of failure is notoriously difficult. Manually trawling through thousands of interaction logs is like finding a needle in a haystack, slowing down debugging and optimization.

Diagnosing Agent Failures in LLM Multi-Agent Systems: A Practical Guide to Automated Failure Attribution
Source: syncedreview.com

To solve this, researchers from Penn State University and Duke University, in collaboration with Google DeepMind, University of Washington, Meta, Nanyang Technological University, and Oregon State University, introduced the problem of automated failure attribution. They built the first dedicated benchmark dataset, Who&When, and developed several attribution methods. This work was accepted as a Spotlight presentation at ICML 2025. The code and dataset are fully open-source.

This guide walks you through the core concepts, prerequisites, and practical steps to implement automated failure attribution using the Who&When dataset and proposed techniques. By the end, you'll understand how to programmatically determine which agent caused a failure and when it happened, dramatically reducing manual debugging effort.

Prerequisites

  • Python 3.8+ installed on your system
  • Basic knowledge of LLMs and multi-agent systems (e.g., how agents communicate via structured prompts)
  • Familiarity with Hugging Face datasets and common ML libraries (torch, transformers)
  • Access to a GPU (recommended) for running baseline attribution methods efficiently
  • Git to clone the repository

Step-by-Step Instructions

Step 1: Understand the Who&When Dataset

The Who&When dataset (Hugging Face) contains multi-agent interaction logs where each log includes a sequence of agent messages, a ground-truth label of the failing agent (the who) and the temporal step (the when) of the failure. The tasks range from reasoning to code generation, with failures caused by single agent errors, miscommunication, or information cascade breakdowns.

Step 2: Set Up the Environment

  1. Clone the official repository:
    git clone https://github.com/mingyin1/Agents_Failure_Attribution.git
    cd Agents_Failure_Attribution
  2. Create a virtual environment and install dependencies:
    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    pip install -r requirements.txt
  3. Download the dataset (if not automatically loaded):
    python download_dataset.py

Step 3: Load and Explore the Data

Use the Hugging Face datasets library to load the dataset:

from datasets import load_dataset

dataset = load_dataset("Kevin355/Who_and_When")
print(dataset)

# View a sample interaction
sample = dataset['train'][0]
print(sample['messages'])  # List of agent utterances
print(sample['failure_agent'])
print(sample['failure_step'])

Each sample has three key fields:

  • messages: list of dicts with agent_id, content, timestamp (or sequential order)
  • failure_agent: integer index of the agent that caused the failure
  • failure_step: integer step (message index) where the failure occurred

Step 4: Implement a Baseline Attribution Method

The paper proposes several methods. We'll implement the simplest: Trace-to-Failure. It works by tracking which agents contributed to the final erroneous output. You'll need to parse the conversation to find the last agent message that directly influenced the failure.

def trace_to_failure(messages, final_error):
    # Heuristic: find the last agent that produced a message containing the error
    for msg in reversed(messages):
        if final_error in msg['content']:
            return msg['agent_id'], messages.index(msg)
    return None, None

More sophisticated methods (e.g., counterfactual reasoning, causal graph) are in the repository under methods/.

Step 5: Evaluate Against Ground Truth

Run the baseline on a subset and compute accuracy:

correct_who = 0
correct_when = 0
total = 0

for sample in dataset['test']:
    pred_agent, pred_step = trace_to_failure(sample['messages'], sample['final_error'])
    if pred_agent == sample['failure_agent']:
        correct_who += 1
    if pred_step == sample['failure_step']:
        correct_when += 1
    total += 1

print(f"Who Accuracy: {correct_who/total:.2%}")
print(f"When Accuracy: {correct_when/total:.2%}")

Step 6: Visualize Results

Create a confusion matrix for agent attribution and a histogram of step errors. The code includes plotting utilities:

from utils.visualization import plot_confusion_matrix
plot_confusion_matrix(predictions, ground_truth, labels=agent_names)

Common Mistakes

  • Ignoring information chains: A failure may propagate from an earlier step; only blaming the last agent leads to false attribution.
  • Assuming single cause: Some failures stem from interaction between multiple agents. The dataset currently labels only one agent per sample, but real systems may have combined faults.
  • Not normalizing timestamps: Ensure message order is consistent. Some logs have non-sequential timestamps; always sort by timestamp before analysis.
  • Overfitting to simple heuristics: The baseline methods may perform well only on simple cases. For robust attribution, use the proposed methods (e.g., causal_graph_attribution) from the repository.

Summary

Automated failure attribution is a crucial step toward reliable LLM multi-agent systems. By using the Who&When dataset and the open-source tools, you can now systematically identify which agent caused a failure and at what point in the interaction, replacing manual log archaeology with a reproducible, data-driven approach. Start experimenting with the provided code and adapt the attribution methods to your own multi-agent architectures.

Related Articles

Recommended

Discover More

Why Kubernetes Is Becoming the Foundation for AI Workloads10 Brilliant Reasons to Choose ESUNYD Solar Fence Lights for Your Outdoor SpaceIgnite Your Personalization Strategy: The Prepersonalization Workshop BlueprintMaster Your Mobile Presentations: A Complete Guide to the Tank Pad Ultra Rugged Tablet with Integrated 1080p ProjectorHow to Analyze the Global LNG Price Divergence Triggered by the Strait of Hormuz Closure