This is the honest playbook. No "just use a smaller model" hand-waving. Every technique here has moved real numbers for real teams running production Claude workloads — we've run them ourselves and we've watched customers run them.
We'll cover, in rough order of impact:
- Model selection — the 10× leverage most teams leave on the floor
- Prompt caching — Anthropic's built-in 90% discount on repeated context
- Batch API — 50% off for non-realtime work
- Prompt compression — what actually works, what doesn't
- Agentic loop discipline — where agents silently bleed money
- Infrastructure layer — when the bill is still too big after all of the above
Read the whole thing if you're serious about the number. Skip around if you're optimizing one specific thing.
1. Model Selection Is The Lever
Most teams over-spec. They default to Opus or the top Sonnet because "it's the smart one" and never revisit the decision after the first prompt works.
As of April 2026, Anthropic's lineup roughly breaks down like this (per-million-token pricing, rounded):
- Claude Opus 4.x — premium reasoning. $15 in / $75 out per M tokens.
- Claude Sonnet 4.x — the workhorse. $3 in / $15 out per M tokens.
- Claude Haiku 4.x — fast + cheap. $0.80 in / $4 out per M tokens.
Opus is ~5× the cost of Sonnet. Sonnet is ~4× the cost of Haiku. That's a ~20× spread from top to bottom.
Practical heuristic:
- Use Haiku for: classification, extraction, routing, summary, quick formatting, simple code edits, boilerplate generation, high-volume background jobs.
- Use Sonnet for: most chat, RAG answering, non-trivial code generation, tool-using agents, structured data extraction with real logic.
- Use Opus for: multi-step reasoning where getting it wrong is expensive, hard math, deep planning, ambiguous requirements that need to be unwound.
Run the same prompt on all three. Evaluate the outputs side-by-side. You'll often find Sonnet matches Opus for your task, or Haiku matches Sonnet. Each step down is a multiple of savings.
Mistake to avoid: routing everything through the biggest model "just to be safe." That's how a $200/mo workload becomes $4,000/mo.
2. Prompt Caching: The 90% Discount Most Teams Ignore
Anthropic's prompt caching feature gives you up to a 90% discount on cached input tokens. If you send the same long system prompt, the same tool definitions, or the same large document on every call, you are paying for it over and over again when you shouldn't be.
What to cache:
- System prompts longer than ~1,000 tokens
- Tool/function definitions (the Claude API tool schemas)
- Large reference documents included in every call
- Few-shot examples at the top of a long prompt
- Your brand voice guide, style guide, or policy document
How it works:
You mark a segment of the prompt with cache_control: { type: "ephemeral" }. Anthropic's infrastructure stores it. The next call within the cache lifetime (5 minutes by default, 1 hour if you pay a small premium) reads the cached segment at a ~10% rate instead of the full input rate.
Concrete numbers. If your system prompt is 5,000 tokens and you call Claude 1,000 times in a session:
- Without caching: 5,000 × 1,000 = 5M tokens × $3/M = $15 just for the repeated system prompt
- With caching: first call pays full + 25% cache write premium, next 999 calls pay ~10% = roughly $1.80 total
That's an 88% reduction on that portion of your traffic for zero product change.
Gotcha: caching only works for exact prefix matches. If your system prompt ends with "Today is {date}", the date changes every call and invalidates the cache. Move dynamic content to the end. Keep the stable prefix stable.
3. Batch API: 50% Off For Work That Can Wait
Anthropic's Batch API gives you 50% off on input and output tokens for requests that don't need to finish immediately. Results come back within 24 hours (usually much faster).
Good fits:
- Nightly summarization of user activity
- Bulk classification of support tickets
- Large-document indexing
- Offline evaluation runs
- Generating embeddings-adjacent metadata at scale
- Anything you could describe with the phrase "run it overnight"
Bad fits:
- Anything a user is waiting on
- Agent loops where each step depends on the last
- Real-time chat
If 30% of your volume is batchable, running it through the Batch API alone cuts your total bill by 15%. That's real money for zero architectural effort.
4. Prompt Compression: What Works, What Doesn't
The internet is full of bad advice here. Here's what actually moves the number:
What works
- Cut the system prompt ruthlessly. Most system prompts have 2–3× more text than they need. Every sentence costs tokens on every call (unless you're caching — see #2). Remove redundant instructions, ceremonial politeness, and examples of things the model already does well.
- Prefer structured output (JSON schema, tool use) over free-form prose. The model generates fewer output tokens because the format constrains it. Output tokens are 3–5× more expensive than input tokens.
- Truncate RAG context to what's actually needed. If your retrieval is returning 10 chunks and the model only uses 2, you're paying for the other 8 on every call.
- Stop re-sending the conversation history for stateless tasks. If a task doesn't need prior turns, don't send them.
What doesn't work (or barely does)
- "Compressed" prompt hacks (emoji-as-shorthand, abbreviation schemes). Token savings are marginal, and model quality usually drops more than the savings justify.
- Asking the model to "be concise." Small effect. A schema or explicit word limit does more.
- Running prompts through another LLM to shrink them. You're now paying two bills.
Output length discipline
Output tokens are the expensive ones. A helpful mental model: writing is pricier than reading.
- If you want a decision, ask for a single word.
- If you want structured data, ask for JSON with only the fields you need.
- If you want a summary, specify a length.
- Stop streaming as soon as you have what you need — don't let the model finish a monologue.
5. Agentic Loops: Where Money Silently Disappears
If you're running any kind of agent — Claude Code, a ReAct loop, a tool-using assistant — this is probably where most of your bill lives.
Agents iterate. Each iteration sends the full conversation history back to the model, which reads it, produces a tool call, waits for the result, and iterates again. A single "simple" agent task can consume 20–200 round-trips, each one paying for the growing transcript.
The compounding math.
Say your agent's accumulated context grows to 20,000 tokens by iteration 10, 40,000 by iteration 20, 80,000 by iteration 40. If each of those iterations on Sonnet reads the full context:
- Iteration 10: 20k × $3/M = $0.06
- Iteration 20: 40k × $3/M = $0.12
- Iteration 40: 80k × $3/M = $0.24
That's one agent run. If it runs 500 times a day, you're easily at $60–$120 a day just from context regrowth, for a single user's workflow.
What to do:
- Cache aggressively. The static portion of each iteration (system prompt, tool schemas, early conversation) should be cached. Every iteration pays 10% of what it would otherwise.
- Trim old tool outputs. If an earlier tool call returned a 2,000-token file dump and the model has already used the relevant part, summarize or drop it from the context on the next iteration.
- Cap iteration depth. Set a hard maximum on loops. If the agent isn't done in 30 iterations, it's probably looping on itself.
- Pick the cheapest model that can do the step. Routing decisions, summaries, and formatting can go to Haiku. Save Sonnet/Opus for the actual reasoning step.
Real teams running Claude Code regularly see a 4–8× reduction just from applying these four techniques together.
6. The Infrastructure Layer
You've done everything above. You've tuned models, you're caching, you're running batch jobs overnight, your agent loops are disciplined. The bill is still bigger than you want.
At this point, the remaining savings live in the infrastructure layer — the proxy between your application and Anthropic's API. This is where services like aiusage come in.
The short version: your existing Claude key still points at your existing Anthropic account, but the traffic flows through a proxy that uses proprietary infrastructure to deliver the same output for roughly 1/20 of the per-call Anthropic cost. You change one environment variable. Your code, your prompts, your workflows don't change.
When it makes sense:
- You're already on Claude and not switching.
- You've applied 1–5 above and you're still bigger than you want to be.
- You don't want to maintain infrastructure yourself.
- You like BYOK pricing (you keep your Anthropic account and key; you just pay us a flat credit-pack fee for routing).
When it doesn't:
- You're multi-model shopping. Use a router (OpenRouter) instead.
- You haven't done steps 1–5 yet. Do those first; they're free and they compound with any infrastructure layer.
We built aiusage.ai for exactly this use case. Credit packs start at $10/15 runs, no subscription, credits never expire, and the one-line installer drops into Claude Code or the Anthropic SDK without code changes. Nick — our first paying customer — was running Claude Code 6–8 hours a day and watched his month's Anthropic credits last roughly 5× longer after the install.
A Suggested Order of Operations
If you want a concrete playbook to work through this week:
- Day 1: Audit your model usage. What percentage of calls go to Opus vs. Sonnet vs. Haiku? Re-route the easy stuff down a tier. Ship it.
- Day 2: Turn on prompt caching for your system prompt and tool definitions. Verify you're seeing the cache-hit metric in the API response.
- Day 3: Identify which workloads are batchable. Move them to the Batch API.
- Day 4: Cut your system prompts. Truncate RAG context. Shorten outputs.
- Day 5: Audit your agent loops. Cap iterations. Cache the stable prefix. Trim old tool outputs.
- Week 2: If the bill is still bigger than you want, evaluate an infrastructure layer like aiusage.
A disciplined team can cut their Claude bill by 60–90% in a week by working through this list. We've watched it happen.
Final Thought: Cost Is A Product Surface
One cultural note that matters more than any specific tactic: treat your Anthropic bill like a product metric. Put it on a dashboard. Review it weekly. Assign it an owner. Set a target cost per user, cost per task, cost per feature.
Teams that do this stay cheap forever. Teams that don't ship fast and then discover a $20,000/mo surprise in November.
If you want the fast path — same Claude, 20× cheaper, no code change — start with a $10 pack on aiusage. Otherwise, work the list above. Either way, your invoice next month is going to be smaller.
Drop your Claude bill 20×.
Paste your key at aiusage.ai — takes 60 seconds. BYOK, credit packs from $10, credits never expire.
Get started →Written by the team at aiusage.ai. We run a BYOK Claude proxy that makes your existing Claude account ~20× cheaper. If that's interesting, read more or grab a $10 pack to try it.