IBM and UC Berkeley Unveil IT-Bench and MAST to Understand AI Agent Failures in Enterprise

IBM Research and the University of Berkeley launch IT-Bench and MAST, two innovative tools that precisely analyze why artificial intelligence agents deployed in enterprises fail, offering a novel framework to improve their reliability.

IT-Bench and MAST, Key Tools to Diagnose AI Agent Failures in Enterprise

IBM Research, in collaboration with the University of California at Berkeley, has developed IT-Bench and MAST, two frameworks designed to deeply analyze the causes of failures of artificial intelligence agents in professional IT environments. These tools provide a systematic methodology to identify failures often invisible in automated systems used by companies, a critical issue as digital transformation increasingly relies on AI.

While AI assistants are widely adopted to automate complex tasks in enterprises, their reliability remains a major concern. IT-Bench and MAST allow for scrutinizing the internal mechanisms leading to malfunctions, paving the way for tangible improvements in operational performance.

📖 Also read: Jonah Peretti sells 52% of BuzzFeed to Byron Allen for 120 million dollars

Specifically, These Tools Decode AI Agent Errors

IT-Bench functions as a test bench for AI agents, simulating varied operational scenarios to measure their robustness and adaptability. Furthermore, MAST (Model-Agnostic State Tracker) analyzes the internal state of agents during execution, precisely tracking inconsistencies or errors in their decision-making.

This unique combination offers a dual perspective: IT-Bench broadly tests agents under near-real conditions, while MAST dives into the details of their cognitive process. Together, they enable identifying not only when an agent fails but especially why, exposing often complex causes such as interpretation errors, data biases, or flaws in managing internal states.

📖 Also read: Open Agent Leaderboard: the new benchmark to evaluate autonomous AI agents

By comparison, traditional approaches mostly rely on tests in controlled environments or general performance metrics, without access to agents' internal states. IT-Bench and MAST thus provide an unprecedented level of diagnosis, essential for ensuring the reliability of AI systems in critical industrial contexts.

An Architecture Designed for Flexibility and Precision

MAST is designed to be model-agnostic, meaning it can adapt to different types of AI agents, whether based on neural networks, symbolic systems, or hybrid architectures. This flexibility facilitates its integration into varied infrastructures, a major asset for companies using solutions from multiple providers.

📖 Also read: xAI unveils Grok Build, a direct competitor to Claude Code and Codex for AI programming

IT-Bench, for its part, simulates complex IT environments with a wide diversity of situations, including common hazards and errors in real systems. Agents are thus subjected to rigorous stress tests, revealing their limits under realistic conditions.

This innovative technical approach relies on close collaboration between AI researchers and IT experts, ensuring that the tools meet operational needs while leveraging recent advances in artificial intelligence.

An Immediate Application for IT Teams and AI Developers

IBM and Berkeley offer IT-Bench and MAST primarily to internal AI agent development teams and IT managers seeking to improve the resilience of their automated systems. These tools are accessible via an API interface, facilitating their integration into development and deployment pipelines.

The provided documentation includes detailed use cases, notably in IT infrastructure management, where agents must handle incidents in real time. The goal is to significantly reduce outages and improve predictive maintenance through a better understanding of failing behaviors.

What Impact for the Enterprise AI Sector?

The rise of AI agents in enterprises raises the major challenge of their reliability on critical systems. Tools like IT-Bench and MAST offer pragmatic responses to this issue by providing a solid benchmark and detailed monitoring of agents' internal states.

In Europe and France, where trust in digital technologies is a strategic issue, these technological advances play a key role in promoting AI adoption in sensitive sectors. They also help better govern automated systems, a subject at the heart of debates on ethics and responsibility in artificial intelligence.

A Breakthrough Still Perfectible but Promising for the Future

While IT-Bench and MAST represent a breakthrough in analyzing AI agent failures, their large-scale deployment will still require adjustments. Notably, extending to multilingual and multi-domain environments remains to be deepened. Moreover, integrating these tools into heterogeneous production chains poses technical challenges.

Despite these limitations, this initiative highlights the importance of developing sophisticated diagnostic tools to ensure the robustness of AI agents in enterprises, a sine qua non condition for their massive and secure adoption.

This collaboration between IBM Research and UC Berkeley illustrates a strong trend in AI research: moving from simple model development to a holistic approach integrating testing, monitoring, and predictive maintenance, essential to meet industrial requirements.

Source: Hugging Face Blog, IBM Research and UC Berkeley, February 18, 2026.