How to Reduce Your Claude API Bill (7 Tactics, Real Numbers)

June 20, 2026 · Aiapiflow

The fastest way to reduce your Claude API bill: stop using Opus for everything. Route simple tasks to Haiku (12x cheaper than Opus), enable prompt caching (90% off cached tokens), and set max_tokens on every request. Combined, these three changes cut most bills by 50-70% in an afternoon. If you want the biggest single lever — a discounted API gateway saves 80-85% instantly with one URL change, no code refactoring needed.

Here are the 7 tactics, ordered by ROI.

Why Claude bills spike

Three cost components per request:

Input tokens — your prompt + system prompt + conversation history + tool definitions. Sonnet 4.6: $3/M uncached, $0.30/M cached.
Output tokens — what the model writes back. Sonnet 4.6: $15/M. No caching discount. This is where most bills blow up.
Model tier — Opus costs 5x more than Sonnet, 60x more than Haiku. Using Opus for classification is the biggest single waste.

Typical heavy user before optimization: 200K input + 15K output tokens per session, 20 sessions/day = ~$500/month. After the tactics below: same work, $80-130/month.

Tactic 1 — Model routing (save 50-80%)

Most developers default Opus everywhere. The correct split:

Task	Model	Why
Complex reasoning, architecture, planning	Opus 4.8	Worth the cost here
Code writing, refactoring, analysis	Sonnet 4.6	5x cheaper, near-equal on code
Classification, extraction, file reads	Haiku 4.5	60x cheaper than Opus

Routing 60% of requests to Haiku and 35% to Sonnet cuts a $500/month bill to roughly $90-120.

function selectModel(taskType) {
  const simple = ['classify', 'extract', 'summarize', 'read']
  const medium = ['code', 'analyze', 'write']
  if (simple.includes(taskType)) return 'claude-haiku-4-5'   // $0.80/M input
  if (medium.includes(taskType)) return 'claude-sonnet-4-6'  // $3/M input
  return 'claude-opus-4-8'                                   // $15/M — reserve for real reasoning
}

Tactic 2 — Prompt caching (save 85-90% on input)

Anthropic’s prompt caching gives 90% off cached input tokens for 5 minutes. Any stable prefix over 1,024 tokens is a candidate: system prompts, tool definitions, long documents.

const response = await client.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 500,
  system: [
    {
      type: 'text',
      text: yourLargeSystemPrompt,
      cache_control: { type: 'ephemeral' }
    }
  ],
  messages: [{ role: 'user', content: userMessage }]
})
// First call: full price. Subsequent calls: 90% off the system prompt.

Rule: keep the cached prefix stable within sessions. Put volatile data in user messages, not the system prompt.

Tactic 3 — Set max_tokens on every request

Output tokens cost 5x more than input. An unconstrained response producing 2,000 tokens when you needed 200 costs 10x more.

# Bad — unbounded
client.messages.create(model=model, messages=messages)

# Good — explicit ceiling
client.messages.create(model=model, max_tokens=300, messages=messages)

For classification: max_tokens=50. For code generation: max_tokens=800. For long-form: measure your p95 output length and add 20% buffer.

Tactic 4 — Compress system prompts

Every token in your system prompt gets billed on every request. A 1,500-line CLAUDE.md adds ~6K tokens per turn.

Target: 200-400 tokens. Cut: polite framing, generic best practices the model already knows, outdated decisions. Keep: stack with versions, active conventions (max 10), “never do” list (max 5).

Tactic 5 — Context pruning and /clear discipline

Sessions accumulate stale tool results. After 90 minutes, 60% of context might be irrelevant file reads. Two fixes:

/clear between unrelated tasks. Free. Instant.
Truncate conversation history. Keep last 8-10 turns. Summarize older context instead of sending the full history.

Tactic 6 — Structured output over prose

Tool use and JSON output produce 50-80% fewer output tokens than natural language for extraction tasks. A 200-token prose answer often compresses to 40 tokens as a JSON object.

# Instead of: "The sentiment is positive and the user seems satisfied..."
# Force structured output:
prompt = "Classify sentiment. Return JSON only: sentiment: positive|negative|neutral"
# Response: {"sentiment": "positive"} — 8 tokens vs 200+

Tactic 7 — Use a discounted API gateway (save 80-85%, zero code changes)

All 6 tactics above require code changes. There is a faster lever: API gateways that provide the same Claude models at significantly lower prices. The only change is your base URL.

Aiapiflow provides access to Claude Opus 4.8, Sonnet 4.6, and Haiku 4.5 at up to 85% less than direct Anthropic pricing. Setup:

export ANTHROPIC_BASE_URL=https://aiapiflow.com
export ANTHROPIC_API_KEY=your-aiapiflow-key

Works with Claude Code, Cursor, Cline, any Anthropic SDK. Credits never expire, no subscription.

The math: a $500/month Anthropic bill becomes $75-100/month through a discounted gateway, before applying any of tactics 1-6. Combined, heavy users report reductions to $30-50/month.

What does not save money

Routing everything to Haiku. Quality drops on real code tasks. More turns fixing mistakes costs more than the savings.
Extremely short prompts. Under-specifying forces more back-and-forth. Total cost goes up.
Local open-source models. Llama 70B is 3-4x slower on tool use than Sonnet. Productivity loss exceeds API savings.

Before and after

Metric	Unoptimized	After tactics 1-3	After all 7
Avg input tokens/session	200K	80K (50K cached)	40K
Avg output tokens/session	15K	7K	5K
Monthly cost (20 sessions/day)	~$490	~$95	~$35

Frequently asked questions

How much does Claude API cost per month for a heavy user?

Heavy Claude API use (20+ hours/week, full agent workflows) typically costs $250-500/month at official Anthropic pricing. A discounted gateway like Aiapiflow reduces these by 80-85% immediately.

What is the cheapest Claude model?

Claude Haiku 4.5 is the cheapest at $0.80/M input and $4/M output tokens. It handles classification, extraction, and file reads well. For code generation, Sonnet 4.6 ($3/$15 per M) gives better quality-to-cost. Use Opus ($15/$75) only for complex multi-step reasoning.

Does prompt caching work with Claude Code?

Yes. Claude Code uses prompt caching automatically on stable prefixes including CLAUDE.md and tool definitions. Keep CLAUDE.md stable within a session — editing it mid-task resets the cache.

Can I use a cheaper API without changing my code?

Yes. Set ANTHROPIC_BASE_URL to a discounted gateway endpoint and update your API key. The Anthropic SDK, Claude Code, Cursor, and Cline all respect these environment variables. Aiapiflow offers Claude Opus 4.8 at up to 85% less with no subscription.

Cut your Claude API bill instantly

One URL change. Claude Opus 4.8, Sonnet 4.6, Haiku at up to 85% less than direct pricing. Credits never expire, no subscription.

Create free account → Free account · Pay only when you top up · 5-min setup