← Back to all entries
2026-03-30 💡 Tips 'n' Tricks

When Claude Code Goes Down: Resilience Patterns for Developers

When Claude Code Goes Down: Resilience Patterns for Developers — visual for 2026-03-30

💡 Claude Code ECONNRESET Wave — 80% of Prompts Failing on March 30

On March 30, Claude Code CLI users began reporting an unusually high rate of ECONNRESET errors — connection resets occurring at the API layer before a response could be received. A tracked GitHub issue against the anthropics/claude-code repository confirmed that roughly 80% of prompts were failing during peak hours, with the error appearing consistently across different network environments and API key accounts, ruling out local connectivity as the cause.

What ECONNRESET means in this context

An ECONNRESET in the Claude Code CLI indicates the TCP connection to Anthropic's API endpoints was forcibly closed by the server before the response was sent. This differs from a timeout (where the server never responds) or a 429 (rate limit) — a reset means the connection was established but then dropped, often under infrastructure pressure or during rolling deployments.

How Anthropic resolved it

Check the status page first

When Claude Code starts failing with connection errors, check status.anthropic.com before spending time debugging your local setup. Most widespread failures are visible there within minutes of detection.

claude code outage ECONNRESET API reliability incident

💡 Handling Claude API Failures Gracefully — A Retry Playbook

The March 30 ECONNRESET incident is a useful reminder that any production system built on an external API will eventually face availability events. The question is not whether your Claude integration will encounter errors, but whether it is written to recover cleanly when it does. Here is a practical retry playbook based on Anthropic's error documentation and patterns from the community.

Error categories and how to handle each

A minimal retry wrapper (Python)

import time, random, anthropic

client = anthropic.Anthropic()
RETRYABLE = {429, 500, 502, 503, 529}

def call_with_retry(prompt, max_retries=5):
    delay = 1.0
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-opus-4-6",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
        except anthropic.APIStatusError as e:
            if e.status_code not in RETRYABLE or attempt == max_retries - 1:
                raise
        except (anthropic.APIConnectionError, ConnectionResetError):
            if attempt == max_retries - 1:
                raise
        jitter = delay * (0.8 + random.random() * 0.4)
        time.sleep(jitter)
        delay = min(delay * 2, 60)
Queue long-running tasks

For workflows that run multi-step agent chains, consider putting tasks on a queue (Redis, SQS, or even a simple SQLite table) rather than running them synchronously. If a step fails, the queue persists the state and the next retry picks up where the previous attempt left off — no lost progress.

best practices error handling retry Python API resilience

💡 Monitoring Claude API Health — Tools and Patterns for Teams

Beyond handling individual errors in code, teams running Claude in production benefit from proactive health monitoring — knowing about an Anthropic API event before your on-call rotation starts getting paged by users. Here are three layers of monitoring worth setting up.

Layer 1 — Subscribe to the status page

Anthropic's status page at status.anthropic.com provides RSS and email subscription options. Subscribing means you get notified of incidents and maintenance windows as soon as Anthropic acknowledges them — often before community reports reach critical mass. This is the single highest-ROI monitoring step for most teams.

Layer 2 — Synthetic probes

A lightweight cron job that sends a fixed minimal prompt (e.g., "Reply with the word OK") every 60 seconds and records latency and success rate gives you independent visibility into API health from your own infrastructure. Plot these metrics in your existing observability stack (Grafana, Datadog, CloudWatch). A sudden spike in failure rate or p99 latency is your earliest warning signal.

Layer 3 — Error rate alerting in application logs

Instrument your call_with_retry wrapper (or equivalent) to emit a metric increment on every retried call and on every exhausted retry chain. Set a Slack or PagerDuty alert when the retry-exhaustion rate exceeds a threshold (e.g., five exhaustions per minute over a 3-minute window). This catches incidents that affect only a subset of your endpoints and might not yet appear on the status page.

What the March 30 incident taught us

Teams with all three layers in place knew about the ECONNRESET event within 2–3 minutes of it starting. Teams without monitoring discovered it when users complained, typically 20–40 minutes later. The difference in mean time to awareness is the difference between a barely-noticeable incident and a visible service degradation.

monitoring observability production status page alerts