2025-12-09 ✅ Best Practices

Production Resilience: Rate Limits, Retries, and Graceful Degradation with Claude

✅ Understanding Claude API Rate Limits and How to Design Around Them

Every production Claude integration will eventually hit a rate limit. Understanding how Anthropic's rate limiting works — and designing your application to handle it gracefully — is the difference between an integration that scales smoothly and one that produces sporadic failures under load. Anthropic enforces limits on three dimensions simultaneously: requests per minute (RPM), input tokens per minute (TPM), and output tokens per day (TPD) for some tiers. Hitting any one of these limits produces a 429 Too Many Requests response.

The rate limit dimensions

Requests per minute (RPM): The raw number of API calls per minute, regardless of size. High-volume, low-token workloads (short classification tasks, keyword extraction) are most likely to hit RPM limits.
Tokens per minute (TPM): The total input + output tokens consumed per minute. Long-document workloads hit TPM limits even when RPM is under threshold.
Tokens per day (TPD): A daily ceiling on total token consumption. Relevant for sustained high-volume batch workloads — if you exhaust your daily token budget midday, calls fail for the rest of the day.

Reading the rate limit headers

Every Anthropic API response includes rate limit headers. Read and log these — they tell you how much headroom remains before the next limit window resets:

anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 847
anthropic-ratelimit-requests-reset: 2025-12-09T14:01:00Z
anthropic-ratelimit-tokens-limit: 80000
anthropic-ratelimit-tokens-remaining: 51200
anthropic-ratelimit-tokens-reset: 2025-12-09T14:01:00Z

Designing for rate limits

Request queuing with a token bucket: Implement a client-side token bucket that throttles outgoing requests to stay below your rate limit ceiling. Python's tenacity library and Node's bottleneck are good starting points.
Exponential backoff on 429: When you receive a 429, wait before retrying. Start at 1 second, double on each retry, add jitter, and cap at 60 seconds. Retrying immediately on a 429 makes the limit worse, not better.
Tier upgrade planning: Monitor your P95 token consumption and RPM in production. If you consistently run above 80% of your limit ceiling during peak hours, request a tier upgrade before you start dropping requests.

✅ Error Handling and Graceful Degradation in Production Claude Applications

Robust Claude integrations treat API errors as expected events, not exceptions. Claude's API returns a predictable set of error codes; building explicit handling for each one — and defining a graceful degradation path when the API is unavailable — is what separates a pilot project from production-grade software. Here is the error handling architecture that holds up in practice.

Error types and how to handle each

400 Bad Request: Your request was malformed — missing required fields, invalid model ID, or oversized context. Do not retry. Log the full request for debugging and fix the construction logic.
401 Unauthorized: Invalid or expired API key. Do not retry. Alert immediately — this means your key has been revoked or was never valid. Check your secrets manager.
403 Forbidden: Your request violates a usage policy (content filtered). Do not retry with the same input. Log the input (appropriately anonymised) and route to a human review queue if this is user-generated content.
429 Too Many Requests: Rate limit hit. Always retry with exponential backoff. Read the retry-after header if present.
500 / 529 Overloaded: Server-side error or temporary overload. Retry with backoff. Anthropic's API is highly available, but brief 529s occur during high-traffic periods.

Graceful degradation patterns

Cached responses for common queries: For frequently asked questions or stable content (product FAQs, policy summaries), cache Claude's responses with a TTL. A cache hit bypasses the API entirely, insulating the user from any API issues.
Fallback to a simpler model: If Claude Sonnet 4.5 is returning 529s, fall back to Claude Haiku 4.5 for the duration of the outage. A slightly less capable response is almost always better than an error message to the user.
Graceful error messages: When all retries are exhausted, show a message that is honest but not alarming: "Our AI assistant is temporarily unavailable. Please try again in a few minutes, or contact support." Never expose raw API error messages to end users.
Circuit breaker pattern: After N consecutive failures within a time window, open the circuit and stop sending requests to Claude for a cooling-off period. This prevents a cascade of 429s from amplifying during recovery.