Why the Best AI in the World Still Fails 76% of the Time— And What That Means for Your Claims Team

Mar 11

Last year, Microsoft and Carnegie Mellon University published a benchmark study that should be required reading for every insurance executive evaluating AI right now.

They built a simulated company — complete with email, databases, document systems, and internal communication tools — and sent the world's best AI agents to work inside it. Real tasks. Real complexity. The kind of work that happens in a claims department every day.

The result: the top-performing AI model — one of the most capable systems ever built — completed just 24% of tasks on its own. Fully, correctly, without error.

Seventy-six percent failure rate. On controlled, well-defined work tasks. With the most advanced AI available.

Before you read that as a reason to abandon your AI investment, stay with me — because the story is more nuanced than the headline, and the nuance is exactly what separates insurance leaders who will get real ROI from AI from those who will spend significant budget and organizational goodwill finding out why their deployment stalled.

The benchmark isn't an indictment of AI. It's a map. And the claims leaders who understand what it says will make dramatically better decisions than those who don't.

SECTION 1

What the Benchmark Actually Tested

The TheAgentCompany benchmark wasn't testing whether AI could answer questions or summarize documents. Those tasks AI handles well. It was testing whether AI agents could complete multi-step, real-world office tasks — the kind that require navigating multiple systems, making judgment calls, handling unexpected situations, and coordinating with other people.

Sound familiar? That's a claims department.

The tasks included things like: updating records across multiple systems, gathering information from colleagues and acting on it, managing workflows through enterprise software interfaces, and making decisions based on incomplete or ambiguous information.

The top-performing model completed 24% of tasks autonomously. When researchers gave partial credit for partially completed tasks, the score climbed to 34%. OpenAI's GPT-4o — the other top contender — scored under 10%.

Two things are true simultaneously: these are the most capable AI systems ever built, and they fail most of the time on realistic work tasks. Holding both of those truths at once is the starting point for a rational AI strategy in claims.

SECTION 2

The Four Failure Modes — And What They Mean in Claims

The researchers identified specific, recurring patterns in how AI agents fail. Each one maps directly onto the claims environment in ways that matter for how you design your deployment.

Failure Mode 1: Lack of Common Sense

When an agent was told to "write responses to answer.docx," a human immediately understands that .docx means a Word document — you open it in Word, you type in it, you save it. The AI treated it as a plain text instruction, dumped raw text into the file, and corrupted it. No error message. Just a broken file and a completed task box checked.

In claims: an experienced adjuster reads "roof damage, storm-related" and knows to ask about interior water intrusion, the age of the roof, whether emergency tarping was applied, and the condition of the gutters. An AI classifies the claim type and moves on. The follow-up question that would have changed the reserve by $40,000 never gets asked.

AI doesn't have domain intuition. It has pattern matching. Those are not the same thing, and the gap between them shows up in the details of real claims.

AI doesn't have domain intuition. It has pattern matching. Those are not the same thing.

Failure Mode 2: Difficulty with Social Reasoning

Claims handling is inherently collaborative. Adjusters coordinate with policyholders in distress, contractors under deadline pressure, attorneys with adversarial interests, and colleagues with varying caseloads. The social intelligence required is significant.

In the benchmark, AI agents showed a specific pattern: they correctly gathered information from a simulated colleague, received a useful answer, and then failed to follow up on what they learned. They moved on without acting on the social information they'd acquired.

In claims, this translates to an AI that takes in a claimant's statement, processes the facts, and misses the implication — that the claimant mentioned something that doesn't quite add up, that a single clarifying question would prevent a dispute later, that the tone of the conversation signals something the transcript doesn't capture. An experienced adjuster doesn't miss those signals. An AI agent, currently, does.

Failure Mode 3: Navigating Complex Systems

Your claims management system is not a clean, AI-friendly interface. It has pop-ups, timed sessions, pagination, dynamic fields that change based on claim type, and multi-tab workflows that require context held across screens. These are environments that human adjusters navigate daily, mostly without thinking about it.

In the benchmark, AI agents broke repeatedly on unexpected UI states — a dialog box with an unusual confirmation sequence, a screen that didn't load in the expected order, a pop-up requesting an action the agent didn't anticipate. The agent would freeze. Or worse: continue as if the step had been completed correctly.

That word — worse — is important. An AI that breaks and stops is manageable. You know it broke. An AI that continues with a corrupted or incomplete state produces errors downstream that may not surface for hours, or days, or until a claimant calls to ask why their payment is wrong.

Failure Mode 4: Hallucination and Confident Wrongness

This is the one that keeps experienced claims professionals up at night — and for good reason.

When AI agents were stuck — unable to find the right contact in a directory, unable to complete a required step — they didn't stop and report failure. They generated a solution. They invented a contact. They produced a plausible-sounding answer and proceeded as if it were true.

In a word processor, that's an inconvenience. In a claims system, that is a fabricated coverage determination. A reserve set on false premises. A denial letter referencing a policy provision that doesn't exist.

The dangerous quality of hallucination is that it doesn't look like an error. It looks like a confident, well-written, authoritative answer. Experienced adjusters develop skepticism over years of handling edge cases. AI agents, by design, don't have that skepticism — and don't know when to apply it.

The dangerous quality of hallucination is that it doesn't look like an error. It looks like a confident, well-written, authoritative answer.

SECTION 3

The Wrong Conclusion — and the Right One

The wrong conclusion from all of this is: "AI isn't ready, so we'll wait."

That conclusion ignores what AI actually does well — and it ignores the cost of standing still while competitors experiment, learn, and build a data advantage that compounds over time.

AI in claims today is genuinely excellent at specific, bounded tasks. Not everything. Not autonomous claims management. But these things, reliably and at scale:

Document extraction and classification

Medical records, police reports, contractor invoices — AI pulls relevant fields in seconds with accuracy that competes with trained adjusters on well-defined extraction tasks. The savings in manual data entry alone justify the investment at most carriers.

Claim triage and routing

Given a loss description and claim data, AI classifies type, estimates severity, and routes to the right queue consistently — without the inter-adjuster variance that creates quality and equity problems. This is the highest-ROI Phase 1 application for most claims organizations.

Correspondence drafting

Acknowledgment letters, status updates, information request letters — AI produces first drafts that require only light review, cutting the time adjusters spend on writing by 60–70% for routine communications.

Anomaly detection at portfolio scale

Statistical patterns that suggest fraud, reserve inadequacy, or subrogation opportunity — AI identifies these across hundreds of thousands of claims where no human team could maintain equivalent vigilance.

The right conclusion is: AI is powerful in narrow, well-defined, assistive roles. It fails in broad, autonomous, decision-making roles. The deployment model that captures the value while managing the risk is one where humans and AI each do what they do best — and the system is designed explicitly around that division.

This is what the industry calls human-in-the-loop AI. We'd simply call it a realistic strategy.

SECTION 4

What Human-in-the-Loop Actually Looks Like

Human-in-the-loop isn't a half-measure or a concession to AI's current limitations. It's an architectural choice — one that reflects an honest understanding of where AI creates value and where it creates risk.

In practice for a claims organization, it means three things:

First: AI handles the data-heavy, repetitive tasks that slow adjusters down without engaging their expertise. Document extraction, claim classification, status updates, data entry between systems. These tasks consume 30–40% of an adjuster's day and produce almost none of the value that experience and judgment create. When AI absorbs them, adjusters get that time back — and they spend it on the work only they can do.

Second: AI surfaces insights and recommendations that humans review and act on. A fraud score is a flag for investigation — not a denial. A severity classification is a routing suggestion — not a coverage determination. A reserve recommendation is a starting point — not a final number. The human reviews, applies judgment, and approves. The AI informs; the adjuster decides.

Third: every human decision becomes a data point. Every time an adjuster overrides an AI recommendation, that's a signal that improves the model. Every acceptance is a validation. Over time, this feedback loop — what data scientists call a data flywheel — compounds. The organizations building this feedback infrastructure today will have AI that is materially better in 24 months than anything a late-mover can deploy, regardless of which model they choose.

SECTION 5

Where to Start

For most claims organizations, the right starting point isn't the most ambitious capability on your vendor's roadmap. It's the narrowest, most well-defined problem you have — the one with the clearest success metrics, the most available data, and the lowest cost of error if the AI gets it wrong.

In our work with carriers and adjusting firms, that's almost always claim triage and intake: using AI to classify incoming claims, extract key data from loss reports, and route claims to the right adjuster queue. The task is bounded. The success metrics are clear — classification accuracy, routing accuracy, time to assignment. And if the AI misclassifies a claim, a human catches it in the review step before anything consequential happens.

From there, the roadmap builds: workflow automation that connects the systems adjusters navigate manually, reducing the friction and error of manual data transfer. Then decision support — surfacing the right information at the right moment, flagging anomalies, recommending next actions, giving managers real-time visibility into what's happening across the portfolio.

Each phase builds on the last. Each creates value independently. And each generates the organizational experience and data that makes the next phase more successful.

The 76% failure rate doesn't tell you AI isn't ready. It tells you exactly how to deploy it: narrowly, with humans in the loop, building from triage to automation to intelligence. That sequence is everything.

CONCLUSION

The Claims Leaders Who Win with AI

The claims leaders who will win with AI over the next five years aren't the ones who deployed the most AI. They're the ones who deployed it most intelligently — understanding what it can and can't do, designing systems where human judgment and AI capability reinforce each other, and building the data infrastructure that makes their AI better every month.

The benchmark tells you where the edges of the map are. Now it's about knowing how to navigate from here.

Ready to see where your claims org sits on this roadmap?

We offer a free 35-minute strategy call — no sales deck, no pitch. Just an honest assessment of where you are and what's worth doing first.

Book a Strategy Call →

About the Author

Serges Himbaza

Serges is the founder of Populus Technology, a firm specializing in human-in-the-loop AI deployment for insurance claims organizations. Populus works at the intersection of people, process, and technology — helping carriers and independent adjusting firms implement AI that adjusters actually adopt and executives can actually measure.

Up Next from Populus Technology

→ The 3-Phase AI Roadmap for Claims Orgs: What to Build First, Second, and Third

→ Why 58% of Insurance AI Projects Fail — And It Has Nothing to Do With the Technology

→ By the Numbers: What Human-in-the-Loop AI Actually Delivers in Claims Operations

Serges Himbaza

Why the Best AI in the World Still Fails 76% of the Time— And What That Means for Your Claims Team

Location

Contact

Follow Us

Links

Why the Best AI in the World Still Fails 76% of the Time— And What That Means for Your Claims Team

Fakes, Filters, and Fire Damage: Detecting AI-Generated Fraud in Property Claims

Tool Sprawl Is the Silent Resilience Killer — Why Unified Platforms Are Now Non-Negotiable

Location

Contact

Follow Us

Links