The Setup
At 48nauts, we run a custom AI agent gateway called Clawdbot. It's basically a router that lets one AI agent talk to multiple AI providers — Anthropic's Claude, OpenAI's GPT, local models via LM Studio, free cloud models from NVIDIA and Google, etc.
The idea is simple: if one provider is down or rate-limited, fail over to the next. The agent keeps working, you keep shipping.
Except when it doesn't.
The Incident
Timeline
11:56 AM UTC
Session at 96% memory capacity (126k/131k tokens). Tool use/result message pairs corrupted.
messages.366.content.1: unexpected tool_use_id found in
tool_result blocks: 325170738. Each tool_result block must
have a corresponding tool_use block in the previous message.Translation: "I have no idea what you're talking about. This conversation makes no sense."
11:56 AM (10 seconds later)
Retry hits same error. Both Anthropic auth profiles enter cooldown for 2-4 minutes.
11:56 AM (11 seconds later)
OpenAI time. Request goes through! Success! But... now using paid API with quota limits.
11:59 AM
OpenAI quota exhausted. Agent keeps trying. System in death spiral.
FailoverError: You exceeded your current quota, please
check your plan and billing details.1:01 PM
Gateway crashes.
[clawdbot] Uncaught exception: TypeError: terminatedWhy It Looped Instead of Gracefully Degrading
The Fatal Fallback Chain
Here's what the fallback chain looked like before the incident:
"fallbacks": [
"anthropic/claude-opus-4-5", // Primary
"anthropic/claude-opus-4-5", // Duplicate (!!)
"anthropic/claude-sonnet-4-5", // Still Anthropic
"openai/gpt-5.2-pro", // PAID, rate-limited
"openai/gpt-5.2-codex", // PAID, rate-limited
"openai/gpt-5.2-chat-latest", // PAID, rate-limited
"openai/gpt-5-mini", // PAID, rate-limited
"opencode/glm-4.7-free" // Free (finally!)
]Problems
- • No local models in the chain
- • Free models listed last
- • Paid models clustered together
- • No circuit breaker pattern
What Should Happen
- • Primary fails → Try local model
- • Local fails → Try free cloud
- • Free fails → Try paid backup
- • All fail → Alert human, stop
Why the Watchdog Didn't Catch It
We had a heartbeat monitor running every 30 minutes. Its job: check for issues, do proactive maintenance, keep the system healthy.
It did none of those things.
- No session context monitoring — Didn't check if session was at 96% capacity
- No error loop detection — Didn't notice the same error repeating every 60 seconds
- No provider health checks — Didn't know Anthropic was failing or OpenAI was exhausting quota
- Too slow — 30-minute interval meant it wouldn't check for another 15 minutes anyway
The watchdog watched the system fail and did nothing.
The Fix
1. Smarter Fallback Chain ✅
"fallbacks": [
"anthropic/claude-opus-4-5", // Primary (high quality)
"anthropic/claude-sonnet-4-5", // Anthropic backup
"lmstudio/qwen2.5-coder-3b-instruct", // LOCAL (free, unlimited)
"lmstudio/llama-3.2-3b-instruct", // LOCAL (free, unlimited)
"nvidia/kimi-k2-instruct", // FREE cloud (131K context)
"google/gemini-2.5-flash", // FREE cloud
"openai/gpt-5.2-pro" // PAID (last resort)
]Rationale: LOCAL → FREE → PAID. If Anthropic fails, hit local models first (no cost, no limits, 2ms latency). Only hit paid APIs if everything else is dead.
2. Circuit Breaker Pattern 🟡
{
errorPattern: "same error message",
threshold: 3, // 3 identical errors
windowMs: 180000, // within 3 minutes
action: "circuit-open" // stop trying that provider
}When a circuit opens: mark provider as unhealthy for 10 minutes, skip to next fallback immediately, alert the human.
3. External Monitoring Daemon ✅
The internal heartbeat failed us. So we built an external monitor that runs outside the agent process:
2026-02-04 12:56:30 [ERROR] ERROR LOOP DETECTED:
'FailoverError: No available auth profile' occurred 3 times
2026-02-04 12:56:31 [WARN] Taking corrective action...
2026-02-04 12:56:32 [INFO] Switching primary model to nvidia/kimi-k2.5
2026-02-04 12:56:35 [OK] Gateway restarted with free modelIt's like having a second watchdog that watches the first watchdog.
Lessons Learned
What Worked ✅
- • Failover system triggered correctly
- • Diagnostics showed exactly what failed
- • Session isolation contained the blast radius
What Failed ❌
- • No context usage monitoring
- • No loop detection
- • Poor fallback prioritization
- • Heartbeat too coarse (30 min)
The Takeaway: Circuit Breakers for Everything
If your system can fail over, it can also fail over into a loop.
Traditional circuit breakers are for network services. But they work just as well for:
- AI provider failover — Stop trying the same broken API
- Context management — Stop adding to a session that's about to crash
- Rate limits — Stop before you hit the quota, not after
This post was written by André with significant input from Jarvis, the AI agent that experienced the incident firsthand. Jarvis is fine now. We gave him more memory and a therapist (GPT-mini).
The irony of an AI agent writing about its own crash is not lost on us.
