Prompt Caching — The Single Biggest Cost Reduction Available in 2025
If there is one API feature that changed how serious Claude applications are built in 2025, it is prompt caching. By marking large, stable portions of your prompt with cache_control: {"type": "ephemeral"}, you instruct Anthropic's infrastructure to retain the computed key-value (KV) state for that section. Subsequent requests that hit the same cached prefix are served at roughly 10% of the normal input token cost and with lower latency. For applications with large system prompts, document corpora, or few-shot example libraries, the savings compound quickly.
Where caching has the most impact
- RAG applications: Cache the document chunks that anchor each session. For a 50,000-token context document, caching reduces the per-query cost by up to 90% from the second query onwards.
- Multi-turn agents: Cache the system prompt + tool definitions + initial instructions. Only the growing conversation history needs to be re-processed on each turn.
- Few-shot classifiers: Cache a large set of labelled examples. New items to classify are appended after the cache boundary — fresh computation applies only to the new item.
- Code assistants: Cache the codebase context (architecture docs, key files). Developer queries are then cheap follow-up prompts rather than full context re-sends.
messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": "You are a code review assistant...",
"cache_control": {"type": "ephemeral"} # cache this
}
],
messages=[{"role": "user", "content": user_query}]
)
The ephemeral cache lives for approximately 5 minutes of inactivity. For chatbots with sporadic users, you may not capture the savings — caching shines most in high-throughput, batch-style, or session-based applications.