Prompt Caching in Claude API 2026: Save Up to 90% on Tokens
Prompt caching turns expensive repeated prompts into a near-free reuse — 90% off after the first call.
What is Prompt Caching?
Prompt caching is a Claude API feature that lets you cache long, repeated parts of your prompts — system instructions, documents, examples — so future calls reuse the cached tokens at 90% off the normal input price. It is built for any workflow that reuses the same context across multiple calls.
The simplest way to think about it: imagine you have a 50-page legal document you want Claude to answer questions about. Without caching, every question sends those 50 pages again — paying full price each time. With caching, you send the 50 pages once and reuse them at 10% of the original cost for every later question.
How It Actually Works
Three steps happen behind the scenes:
- First call: Claude reads your prompt. You mark a section with
cache_control. Claude stores those tokens and charges a small write fee (25% more than normal input). - Subsequent calls within TTL: If your prompt's cached prefix matches exactly, Claude reads from cache at 10% of normal price.
- After TTL expires: The next call re-writes the cache (small write fee again) and the cycle continues.
Two TTLs available: 5 minutes (default ephemeral) and 1 hour (opt-in, slightly higher write cost but lasts 12x longer).
Real Pricing Comparison
Example: A chatbot that sends a 10,000-token system prompt with every user message. The user sends 100 messages an hour. Using Claude Sonnet 4.6 (input $3 per million tokens):
| Setup | Cost per Hour | Cost per Month (24/7) |
|---|---|---|
| No caching | $3.00 | $2,160 |
| 5-minute cache | $0.39 | $281 |
| 1-hour cache | $0.33 | $238 |
Savings: 87% to 89%. The math compounds fast for any production AI app.
Code Example: Python & TypeScript
Python (with the official Anthropic SDK):
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert legal analyst. Document:\n\n" + long_document,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Summarise the contract"}]
)
TypeScript:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: "You are an expert legal analyst. Document:\n\n" + longDocument,
cache_control: { type: "ephemeral" }
}
],
messages: [{ role: "user", content: "Summarise the contract" }]
});
That single cache_control field unlocks the 90% discount on every subsequent call within the 5-minute window.
When to Use Caching (and When Not)
Use caching when:
- You have a long system prompt repeated across calls (chatbots, agents)
- You are doing Q&A over the same document multiple times (RAG)
- You have few-shot examples used in every call
- You are running batch jobs that share context
Skip caching when:
- Your prompt is short (under 1024 tokens — minimum cacheable size)
- The context changes every single call
- You only make one or two API calls total
A well-cached production app sees 60% to 85% lower Claude API bills every month.
5 Best Practices
- Cache the right block. Put cache_control on the longest stable part of your prompt — system instructions, large documents, few-shot examples.
- Watch for prompt variations. Even a single character change breaks the cache match. Keep your cached prefix byte-for-byte stable.
- Use 1-hour cache for batch jobs. If you run nightly batches, 1-hour cache saves more than 5-minute. The slightly higher write cost is worth it.
- Monitor cache hit rate. The API response includes
cache_read_input_tokens— if it stays high, your caching is working. - Combine with Batch API. Stack prompt caching with the Batch API discount for 95%+ total savings on async workloads.
For any production AI app sending 1,000+ calls a day, prompt caching pays for itself in the first hour. Add it before optimising anything else.
For more on optimising Claude API costs, see our guides on How to Use Claude with Less Tokens and Claude Haiku vs Sonnet vs Opus.
Need Help Optimising Your Claude API Costs?
At Mayank Digital Labs, we audit and optimise production AI applications for cost — applying prompt caching, batch API, model selection, and context engineering best practices to cut Claude bills by 50% to 90% without losing quality.
No commitment. A 30-minute call to map your top API cost opportunities.
References & Further Reading
Looking for the best information about prompt caching in Claude API? Quick takeaway: if you run any production Claude app, turn on prompt caching today. One cache_control field cuts your bill by 60% to 85% within the hour. There is no reason to skip this one.
Frequently Asked Questions
What is prompt caching in the Claude API?
Prompt caching is a Claude API feature that lets you cache long, repeated parts of your prompt — system instructions, documents, examples — so subsequent calls reuse the cached tokens at 90% off normal price. It is ideal for chatbots, RAG apps, and agents that keep sending the same context.
How much does prompt caching save?
Cache reads cost 10% of the normal input price — a 90% discount. Cache writes cost 25% more than normal input (only on the first call). After one or two calls, savings start. Real-world chatbots and RAG systems typically see 60% to 85% lower monthly bills.
How long does the cache last?
Claude offers a 5-minute ephemeral cache (default) and a 1-hour cache (opt-in, slightly higher write cost). After the TTL expires, the next call writes a fresh cache. Both options have the same 90% read discount.
How do I enable prompt caching?
Add a cache_control block to the message you want cached. Example: {"type":"text","text":"long system prompt...","cache_control":{"type":"ephemeral"}}. The first call pays a small write cost, all later calls within the TTL get the 90% read discount automatically.
When should I NOT use prompt caching?
Skip caching when: your prompt is short (under 1024 tokens), your context changes every call, you only call the API once or twice. Caching adds a small write cost — it pays off only when the same large context is reused multiple times within the 5-minute or 1-hour window.