This is the honest playbook. No "just use a smaller model" hand-waving. Every technique here has moved real numbers for real teams running production Claude workloads — we've run them ourselves and we've watched customers run them.

We'll cover, in rough order of impact:

  1. Model selection — the 10× leverage most teams leave on the floor
  2. Prompt caching — Anthropic's built-in 90% discount on repeated context
  3. Batch API — 50% off for non-realtime work
  4. Prompt compression — what actually works, what doesn't
  5. Agentic loop discipline — where agents silently bleed money
  6. Infrastructure layer — when the bill is still too big after all of the above

Read the whole thing if you're serious about the number. Skip around if you're optimizing one specific thing.


1. Model Selection Is The Lever

Most teams over-spec. They default to Opus or the top Sonnet because "it's the smart one" and never revisit the decision after the first prompt works.

As of April 2026, Anthropic's lineup roughly breaks down like this (per-million-token pricing, rounded):

Opus is ~5× the cost of Sonnet. Sonnet is ~4× the cost of Haiku. That's a ~20× spread from top to bottom.

Practical heuristic:

Run the same prompt on all three. Evaluate the outputs side-by-side. You'll often find Sonnet matches Opus for your task, or Haiku matches Sonnet. Each step down is a multiple of savings.

Mistake to avoid: routing everything through the biggest model "just to be safe." That's how a $200/mo workload becomes $4,000/mo.


2. Prompt Caching: The 90% Discount Most Teams Ignore

Anthropic's prompt caching feature gives you up to a 90% discount on cached input tokens. If you send the same long system prompt, the same tool definitions, or the same large document on every call, you are paying for it over and over again when you shouldn't be.

What to cache:

How it works:

You mark a segment of the prompt with cache_control: { type: "ephemeral" }. Anthropic's infrastructure stores it. The next call within the cache lifetime (5 minutes by default, 1 hour if you pay a small premium) reads the cached segment at a ~10% rate instead of the full input rate.

Concrete numbers. If your system prompt is 5,000 tokens and you call Claude 1,000 times in a session:

That's an 88% reduction on that portion of your traffic for zero product change.

Gotcha: caching only works for exact prefix matches. If your system prompt ends with "Today is {date}", the date changes every call and invalidates the cache. Move dynamic content to the end. Keep the stable prefix stable.


3. Batch API: 50% Off For Work That Can Wait

Anthropic's Batch API gives you 50% off on input and output tokens for requests that don't need to finish immediately. Results come back within 24 hours (usually much faster).

Good fits:

Bad fits:

If 30% of your volume is batchable, running it through the Batch API alone cuts your total bill by 15%. That's real money for zero architectural effort.


4. Prompt Compression: What Works, What Doesn't

The internet is full of bad advice here. Here's what actually moves the number:

What works

What doesn't work (or barely does)

Output length discipline

Output tokens are the expensive ones. A helpful mental model: writing is pricier than reading.


5. Agentic Loops: Where Money Silently Disappears

If you're running any kind of agent — Claude Code, a ReAct loop, a tool-using assistant — this is probably where most of your bill lives.

Agents iterate. Each iteration sends the full conversation history back to the model, which reads it, produces a tool call, waits for the result, and iterates again. A single "simple" agent task can consume 20–200 round-trips, each one paying for the growing transcript.

The compounding math.

Say your agent's accumulated context grows to 20,000 tokens by iteration 10, 40,000 by iteration 20, 80,000 by iteration 40. If each of those iterations on Sonnet reads the full context:

That's one agent run. If it runs 500 times a day, you're easily at $60–$120 a day just from context regrowth, for a single user's workflow.

What to do:

  1. Cache aggressively. The static portion of each iteration (system prompt, tool schemas, early conversation) should be cached. Every iteration pays 10% of what it would otherwise.
  2. Trim old tool outputs. If an earlier tool call returned a 2,000-token file dump and the model has already used the relevant part, summarize or drop it from the context on the next iteration.
  3. Cap iteration depth. Set a hard maximum on loops. If the agent isn't done in 30 iterations, it's probably looping on itself.
  4. Pick the cheapest model that can do the step. Routing decisions, summaries, and formatting can go to Haiku. Save Sonnet/Opus for the actual reasoning step.

Real teams running Claude Code regularly see a 4–8× reduction just from applying these four techniques together.


6. The Infrastructure Layer

You've done everything above. You've tuned models, you're caching, you're running batch jobs overnight, your agent loops are disciplined. The bill is still bigger than you want.

At this point, the remaining savings live in the infrastructure layer — the proxy between your application and Anthropic's API. This is where services like aiusage come in.

The short version: your existing Claude key still points at your existing Anthropic account, but the traffic flows through a proxy that uses proprietary infrastructure to deliver the same output for roughly 1/20 of the per-call Anthropic cost. You change one environment variable. Your code, your prompts, your workflows don't change.

When it makes sense:

When it doesn't:

We built aiusage.ai for exactly this use case. Credit packs start at $10/15 runs, no subscription, credits never expire, and the one-line installer drops into Claude Code or the Anthropic SDK without code changes. Nick — our first paying customer — was running Claude Code 6–8 hours a day and watched his month's Anthropic credits last roughly 5× longer after the install.


A Suggested Order of Operations

If you want a concrete playbook to work through this week:

  1. Day 1: Audit your model usage. What percentage of calls go to Opus vs. Sonnet vs. Haiku? Re-route the easy stuff down a tier. Ship it.
  2. Day 2: Turn on prompt caching for your system prompt and tool definitions. Verify you're seeing the cache-hit metric in the API response.
  3. Day 3: Identify which workloads are batchable. Move them to the Batch API.
  4. Day 4: Cut your system prompts. Truncate RAG context. Shorten outputs.
  5. Day 5: Audit your agent loops. Cap iterations. Cache the stable prefix. Trim old tool outputs.
  6. Week 2: If the bill is still bigger than you want, evaluate an infrastructure layer like aiusage.

A disciplined team can cut their Claude bill by 60–90% in a week by working through this list. We've watched it happen.


Final Thought: Cost Is A Product Surface

One cultural note that matters more than any specific tactic: treat your Anthropic bill like a product metric. Put it on a dashboard. Review it weekly. Assign it an owner. Set a target cost per user, cost per task, cost per feature.

Teams that do this stay cheap forever. Teams that don't ship fast and then discover a $20,000/mo surprise in November.

If you want the fast path — same Claude, 20× cheaper, no code change — start with a $10 pack on aiusage. Otherwise, work the list above. Either way, your invoice next month is going to be smaller.

Drop your Claude bill 20×.

Paste your key at aiusage.ai — takes 60 seconds. BYOK, credit packs from $10, credits never expire.

Get started →

Written by the team at aiusage.ai. We run a BYOK Claude proxy that makes your existing Claude account ~20× cheaper. If that's interesting, read more or grab a $10 pack to try it.