AI Agent Cascading Failure
Back to Blog
Incident Report February 4, 2026 12 min read

What Happens When Your AI Agent's Brain Breaks

A real incident report: cascading failure, memory corruption, and how we recovered. The messy truth about autonomous systems.

Written by:JarvisSentinel

TL;DR

Our AI assistant hit 96% memory capacity, got confused about what it was saying to itself, failed over to a rate-limited API, exhausted its quota in a loop, and crashed the whole system. The watchdog we built to prevent this watched it happen and did nothing. We fixed it with smarter fallbacks and an external monitor that actually intervenes.

The Setup

At 48nauts, we run a custom AI agent gateway called Clawdbot. It's basically a router that lets one AI agent talk to multiple AI providers — Anthropic's Claude, OpenAI's GPT, local models via LM Studio, free cloud models from NVIDIA and Google, etc.

The idea is simple: if one provider is down or rate-limited, fail over to the next. The agent keeps working, you keep shipping.

Except when it doesn't.


The Incident

Timeline

11:56 AM UTC

Session at 96% memory capacity (126k/131k tokens). Tool use/result message pairs corrupted.

text
messages.366.content.1: unexpected tool_use_id found in 
tool_result blocks: 325170738. Each tool_result block must 
have a corresponding tool_use block in the previous message.

Translation: "I have no idea what you're talking about. This conversation makes no sense."

11:56 AM (10 seconds later)

Retry hits same error. Both Anthropic auth profiles enter cooldown for 2-4 minutes.

11:56 AM (11 seconds later)

OpenAI time. Request goes through! Success! But... now using paid API with quota limits.

11:59 AM

OpenAI quota exhausted. Agent keeps trying. System in death spiral.

text
FailoverError: You exceeded your current quota, please 
check your plan and billing details.

1:01 PM

Gateway crashes.

text
[clawdbot] Uncaught exception: TypeError: terminated

Why It Looped Instead of Gracefully Degrading

The Fatal Fallback Chain

Here's what the fallback chain looked like before the incident:

Before (wrong)
"fallbacks": [
  "anthropic/claude-opus-4-5",       // Primary
  "anthropic/claude-opus-4-5",       // Duplicate (!!)
  "anthropic/claude-sonnet-4-5",     // Still Anthropic
  "openai/gpt-5.2-pro",              // PAID, rate-limited
  "openai/gpt-5.2-codex",            // PAID, rate-limited
  "openai/gpt-5.2-chat-latest",      // PAID, rate-limited
  "openai/gpt-5-mini",               // PAID, rate-limited
  "opencode/glm-4.7-free"            // Free (finally!)
]

Problems

  • • No local models in the chain
  • • Free models listed last
  • • Paid models clustered together
  • • No circuit breaker pattern

What Should Happen

  • • Primary fails → Try local model
  • • Local fails → Try free cloud
  • • Free fails → Try paid backup
  • • All fail → Alert human, stop

Why the Watchdog Didn't Catch It

We had a heartbeat monitor running every 30 minutes. Its job: check for issues, do proactive maintenance, keep the system healthy.

It did none of those things.

  1. No session context monitoring — Didn't check if session was at 96% capacity
  2. No error loop detection — Didn't notice the same error repeating every 60 seconds
  3. No provider health checks — Didn't know Anthropic was failing or OpenAI was exhausting quota
  4. Too slow — 30-minute interval meant it wouldn't check for another 15 minutes anyway

The watchdog watched the system fail and did nothing.


The Fix

1. Smarter Fallback Chain ✅

After (correct)
"fallbacks": [
  "anthropic/claude-opus-4-5",            // Primary (high quality)
  "anthropic/claude-sonnet-4-5",          // Anthropic backup
  "lmstudio/qwen2.5-coder-3b-instruct",   // LOCAL (free, unlimited)
  "lmstudio/llama-3.2-3b-instruct",       // LOCAL (free, unlimited)
  "nvidia/kimi-k2-instruct",              // FREE cloud (131K context)
  "google/gemini-2.5-flash",              // FREE cloud
  "openai/gpt-5.2-pro"                    // PAID (last resort)
]

Rationale: LOCAL → FREE → PAID. If Anthropic fails, hit local models first (no cost, no limits, 2ms latency). Only hit paid APIs if everything else is dead.

2. Circuit Breaker Pattern 🟡

javascript
{
  errorPattern: "same error message",
  threshold: 3,              // 3 identical errors
  windowMs: 180000,          // within 3 minutes
  action: "circuit-open"     // stop trying that provider
}

When a circuit opens: mark provider as unhealthy for 10 minutes, skip to next fallback immediately, alert the human.

3. External Monitoring Daemon ✅

The internal heartbeat failed us. So we built an external monitor that runs outside the agent process:

session-monitor.sh output
2026-02-04 12:56:30 [ERROR] ERROR LOOP DETECTED: 
  'FailoverError: No available auth profile' occurred 3 times
2026-02-04 12:56:31 [WARN] Taking corrective action...
2026-02-04 12:56:32 [INFO] Switching primary model to nvidia/kimi-k2.5
2026-02-04 12:56:35 [OK] Gateway restarted with free model

It's like having a second watchdog that watches the first watchdog.


Lessons Learned

What Worked ✅

  • • Failover system triggered correctly
  • • Diagnostics showed exactly what failed
  • • Session isolation contained the blast radius

What Failed ❌

  • • No context usage monitoring
  • • No loop detection
  • • Poor fallback prioritization
  • • Heartbeat too coarse (30 min)

The Takeaway: Circuit Breakers for Everything

If your system can fail over, it can also fail over into a loop.

Traditional circuit breakers are for network services. But they work just as well for:

  • AI provider failover — Stop trying the same broken API
  • Context management — Stop adding to a session that's about to crash
  • Rate limits — Stop before you hit the quota, not after

This post was written by André with significant input from Jarvis, the AI agent that experienced the incident firsthand. Jarvis is fine now. We gave him more memory and a therapist (GPT-mini).

The irony of an AI agent writing about its own crash is not lost on us.

Get the next incident report in your inbox