How to Reduce Your Claude API Bill (7 Tactics, Real Numbers)

June 20, 2026 · Aiapiflow

The fastest way to reduce your Claude API bill: stop using Opus for everything. Route simple tasks to Haiku (12x cheaper than Opus), enable prompt caching (90% off cached tokens), and set max_tokens on every request. Combined, these three changes cut most bills by 50-70% in an afternoon. If you want the biggest single lever — a discounted API gateway saves 80-85% instantly with one URL change, no code refactoring needed.

Here are the 7 tactics, ordered by ROI.

Why Claude bills spike

Three cost components per request:

Typical heavy user before optimization: 200K input + 15K output tokens per session, 20 sessions/day = ~$500/month. After the tactics below: same work, $80-130/month.

Tactic 1 — Model routing (save 50-80%)

Most developers default Opus everywhere. The correct split:

TaskModelWhy
Complex reasoning, architecture, planningOpus 4.8Worth the cost here
Code writing, refactoring, analysisSonnet 4.65x cheaper, near-equal on code
Classification, extraction, file readsHaiku 4.560x cheaper than Opus

Routing 60% of requests to Haiku and 35% to Sonnet cuts a $500/month bill to roughly $90-120.

function selectModel(taskType) {
  const simple = ['classify', 'extract', 'summarize', 'read']
  const medium = ['code', 'analyze', 'write']
  if (simple.includes(taskType)) return 'claude-haiku-4-5'   // $0.80/M input
  if (medium.includes(taskType)) return 'claude-sonnet-4-6'  // $3/M input
  return 'claude-opus-4-8'                                   // $15/M — reserve for real reasoning
}

Tactic 2 — Prompt caching (save 85-90% on input)

Anthropic’s prompt caching gives 90% off cached input tokens for 5 minutes. Any stable prefix over 1,024 tokens is a candidate: system prompts, tool definitions, long documents.

const response = await client.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 500,
  system: [
    {
      type: 'text',
      text: yourLargeSystemPrompt,
      cache_control: { type: 'ephemeral' }
    }
  ],
  messages: [{ role: 'user', content: userMessage }]
})
// First call: full price. Subsequent calls: 90% off the system prompt.

Rule: keep the cached prefix stable within sessions. Put volatile data in user messages, not the system prompt.

Tactic 3 — Set max_tokens on every request

Output tokens cost 5x more than input. An unconstrained response producing 2,000 tokens when you needed 200 costs 10x more.

# Bad — unbounded
client.messages.create(model=model, messages=messages)

# Good — explicit ceiling
client.messages.create(model=model, max_tokens=300, messages=messages)

For classification: max_tokens=50. For code generation: max_tokens=800. For long-form: measure your p95 output length and add 20% buffer.

Tactic 4 — Compress system prompts

Every token in your system prompt gets billed on every request. A 1,500-line CLAUDE.md adds ~6K tokens per turn.

Target: 200-400 tokens. Cut: polite framing, generic best practices the model already knows, outdated decisions. Keep: stack with versions, active conventions (max 10), “never do” list (max 5).

Tactic 5 — Context pruning and /clear discipline

Sessions accumulate stale tool results. After 90 minutes, 60% of context might be irrelevant file reads. Two fixes:

Tactic 6 — Structured output over prose

Tool use and JSON output produce 50-80% fewer output tokens than natural language for extraction tasks. A 200-token prose answer often compresses to 40 tokens as a JSON object.

# Instead of: "The sentiment is positive and the user seems satisfied..."
# Force structured output:
prompt = "Classify sentiment. Return JSON only: sentiment: positive|negative|neutral"
# Response: {"sentiment": "positive"} — 8 tokens vs 200+

Tactic 7 — Use a discounted API gateway (save 80-85%, zero code changes)

All 6 tactics above require code changes. There is a faster lever: API gateways that provide the same Claude models at significantly lower prices. The only change is your base URL.

Aiapiflow provides access to Claude Opus 4.8, Sonnet 4.6, and Haiku 4.5 at up to 85% less than direct Anthropic pricing. Setup:

export ANTHROPIC_BASE_URL=https://aiapiflow.com
export ANTHROPIC_API_KEY=your-aiapiflow-key

Works with Claude Code, Cursor, Cline, any Anthropic SDK. Credits never expire, no subscription.

The math: a $500/month Anthropic bill becomes $75-100/month through a discounted gateway, before applying any of tactics 1-6. Combined, heavy users report reductions to $30-50/month.

What does not save money

Before and after

MetricUnoptimizedAfter tactics 1-3After all 7
Avg input tokens/session200K80K (50K cached)40K
Avg output tokens/session15K7K5K
Monthly cost (20 sessions/day)~$490~$95~$35

Frequently asked questions

How much does Claude API cost per month for a heavy user?

Heavy Claude API use (20+ hours/week, full agent workflows) typically costs $250-500/month at official Anthropic pricing. A discounted gateway like Aiapiflow reduces these by 80-85% immediately.

What is the cheapest Claude model?

Claude Haiku 4.5 is the cheapest at $0.80/M input and $4/M output tokens. It handles classification, extraction, and file reads well. For code generation, Sonnet 4.6 ($3/$15 per M) gives better quality-to-cost. Use Opus ($15/$75) only for complex multi-step reasoning.

Does prompt caching work with Claude Code?

Yes. Claude Code uses prompt caching automatically on stable prefixes including CLAUDE.md and tool definitions. Keep CLAUDE.md stable within a session — editing it mid-task resets the cache.

Can I use a cheaper API without changing my code?

Yes. Set ANTHROPIC_BASE_URL to a discounted gateway endpoint and update your API key. The Anthropic SDK, Claude Code, Cursor, and Cline all respect these environment variables. Aiapiflow offers Claude Opus 4.8 at up to 85% less with no subscription.

Cut your Claude API bill instantly

One URL change. Claude Opus 4.8, Sonnet 4.6, Haiku at up to 85% less than direct pricing. Credits never expire, no subscription.

Create free account → Free account · Pay only when you top up · 5-min setup