The Hidden Cost of AI Coding Agents (and How to Cut It)
Most of your Claude / GPT bill is context, not generation. Here's the math, why local models alone don't solve it, and how a local context engine acts as the bridge between fast cheap retrieval and frontier-model generation.
If you've used Claude Code, Cursor, or Codex CLI on a serious codebase for more than a week, you've already noticed: the bill grows faster than the work.
The intuition is that LLMs are expensive because they think. That's not what you're paying for. You're paying for them to figure out what to think about — re-reading files, walking directories, grepping for symbols, second-guessing imports. The actual generation is a small fraction of every turn.
This post breaks down where the tokens go, why local-only models aren't a clean answer, and how a deliberate bridge between local context retrieval and frontier-model generation cuts cost without giving up output quality.
The cost is mostly context
A typical agent turn on a real codebase looks like this:
- The agent receives your prompt + system instructions (~2k tokens).
- It runs a few exploratory tools —
grep, list directory, read file, read another file (~6–10k tokens of tool output piped back into context). - It re-reads files it already saw on a previous turn because it forgot (~3–5k tokens of redundant input).
- It thinks about what to do (~500 tokens of model output).
- It produces an edit or answer (~1–3k tokens of output).
The math:
Input tokens (system + prompt + tools): ~12,000
Output tokens (thinking + edit): ~2,500
Claude Sonnet 4.6 pricing (typical):
input: $3.00 / 1M tokens → $0.036 / turn
output: $15.00 / 1M tokens → $0.038 / turn
Per-turn cost: ~$0.07
Cost across a 30-turn debugging session: ~$2.10 Frontier models are worse. Claude Opus or GPT-5 with extended thinking can easily push input tokens to 30k+ per turn (the model "thinks" by generating internal reasoning, which counts) and output to 5k+. A long session with one of those costs $5–$15. Multiply by a team of ten running daily and you have a real budget line.
And here's the kicker: roughly 80% of those input tokens are context the agent rediscovered from scratch. The same files read three times in different turns. The same import graph re-walked. The same database schema re-explained because the agent forgot it between calls.
Why "just use a local model" doesn't fix it
The obvious response is "switch to a local model." Llama 3.3 70B runs on a workstation, Qwen 2.5 Coder runs on a laptop. No per-token bill.
I've tried. It doesn't hold up for serious work, for three reasons:
1. Context windows are smaller
Local models that fit on consumer hardware top out at 128k tokens of practical context (32B–70B class). That's enough for short tasks, not enough for a real codebase exploration. Frontier models go to 200k–1M.
2. Tool-use reasoning is weaker
The thing agents need most isn't raw code-completion. It's the ability to look at a tool result and decide what to call next. That's a deeply trained behavior — Anthropic's Sonnet/Opus and OpenAI's GPT-5 spent a huge amount of post-training specifically on tool-use traces. Local models are catching up but still produce noticeably worse multi-step agentic flows.
3. The cost shifts to your machine
"Free" in the cloud sense becomes expensive in laptop battery, memory pressure, and slow generation. A local 70B model generating at 20 tokens/s makes the agent feel sluggish in a way that frontier APIs don't.
None of this means local models are useless. It means the answer isn't "replace the frontier model." It's "stop sending the frontier model so much context."
The bridge pattern
Here's the pattern that actually works:
- Local context engine handles retrieval: indexed symbols, routes, schema, imports, durable findings. Fast, deterministic, free.
- Frontier model handles generation: planning, edits, explanation. Smart, expensive, but only used when needed.
- The local engine pre-digests the codebase and hands the model a tight context packet — 600 tokens of "here are the 3 files that matter" instead of 12,000 tokens of grep output.
This is the same shape as a CDN in front of a database, or a query planner in front of a dataset. You don't make the slow expensive part faster — you make sure it only runs on the work that actually needs it.
What this looks like in practice
Concretely: instead of letting the agent grep its way around your repo, you give it one MCP tool that returns a ranked, source-labeled context packet. Same prompt, same model, dramatically fewer input tokens.
// Without a context engine
agent: "I'll search for auth..."
→ rg -n "auth" . (4,200 tokens of output)
→ ls app/ (300 tokens)
→ cat app/auth.ts (1,800 tokens — it's the wrong file)
→ rg -n "verifySession" . (1,100 tokens)
→ cat lib/auth/dal.ts (2,100 tokens)
→ ...four more turns...
Total input across 6 turns: ~25,000 tokens
// With a context engine
agent: context_packet({ request: "trace auth flow", budgetTokens: 4000 })
→ primaryContext: lib/auth/dal.ts, app/dashboard/manager/layout.tsx
→ activeFindings: missing-tenant-scope on dal.ts:142
→ recommendedHarnessPattern: "read primary → check finding → edit"
Total input on the next turn: ~600 tokens The agent skips the exploration phase entirely. It goes straight to the relevant files with a prior finding already attached. The SOTA model spends its tokens on the part it's good at — reading the actual code and producing an edit — not on figuring out where to look.
Bonus: cheap bug-finding, expensive fixing
This pattern unlocks a workflow that's hard to get otherwise: split detection from repair.
- Detection is cheap. Lint rules, AST patterns,
schema-aware audits, and tenant-leak checks all run locally without
the model. Mako's
lint_files,tenant_leak_audit, andgit_precommit_checkare pure local computation. Catching a bug costs you nothing per finding. - Fixing is expensive. Once you have a finding — "missing tenant scope on lib/auth/dal.ts:142" — you point Claude or GPT-5 at it with a tight context packet. Fast, focused, one model call.
This inverts the normal "spray a frontier model and hope" pattern. You spend cheap local cycles to find what to fix, then spend expensive frontier tokens on how to fix it. Pre-filtering turns a $20 debugging session into a $1 one.
It also gives you persistent value: every detection run leaves findings in a local store. Tomorrow's session inherits today's work. The frontier model isn't being asked to re-discover the same bugs you already found.
Database awareness changes the equation again
If your project uses Postgres or Supabase, the cost story gets even worse without a local layer. The agent doesn't know your schema. It either:
- Hallucinates column names and writes broken SQL (free at the model, expensive at the test suite).
- Asks you to paste the schema (you do, eating 8k tokens of context per session).
- Or runs ad-hoc
SHOW TABLES-style queries through a shell tool, eating tokens on round-trips.
A local context engine that snapshots your schema once and exposes
db_table_schema, db_rls, and
schema_usage as MCP tools turns this into a single
deterministic call. The model gets exactly the columns and policies
relevant to the current task — no more, no less.
Same logic with row-level security: an audit that checks every tenant-keyed table and flags ones with weak policies is local and cheap. Fixing the flagged ones is where you spend frontier tokens.
What this means for tool builders
If you're building agent tooling — internal at your company or open source — the takeaway is:
- Index aggressively. Anything that can be computed once and reused is a token-cost reduction every time it's queried.
- Return ranked, structured context. Don't dump whole files; return targeted slices with explicit reasoning ("ranked because…").
- Persist findings. A finding's value compounds across sessions. Re-discovery is the single biggest waste.
- Stay local-first. Network round-trips and remote schema lookups eat the savings. Keep the index and findings in a local SQLite next to the code.
This is the design behind agentmako: a local-first MCP server that does exactly this — indexes the repo, snapshots the schema, tracks findings, and returns deterministic context packets. It plugs into Claude Code, Codex CLI, Cursor, Cline, and anything else that speaks MCP. The frontier models you already pay for get a much tighter input. The local layer gets out of the way.
The actual savings
Real numbers from running it on a moderately-sized Next.js + Supabase codebase (~1,200 files):
- Average input tokens per turn: ~12k → ~3.5k (-71%)
- Average turns to complete a debugging task: ~22 → ~9 (-59%)
- Average cost per debugging session (Sonnet 4.6): ~$2.10 → ~$0.32
- Time-to-first-edit (the agent stops exploring and starts editing): ~3 minutes → ~25 seconds
The percentage savings get larger on bigger codebases. The biggest wins come from cross-session memory — the second time you ask about a file, the agent already has its findings.
Where to start
The bridge pattern doesn't require buying anything new. If your agent already speaks MCP (Claude Code, Cursor, Codex, Cline, Continue.dev), wire up a context engine and tell it to call that first.
For agentmako specifically:
npm install -g agentmakoagentmako connect .in your project- Add the MCP server to your agent's config
- Drop the CLAUDE.md template into your project root so the agent knows when to call it
It's Apache-2.0, local-first, and zero telemetry. The frontier model keeps doing what it's good at; the local layer makes sure you only pay for that part.
Want this for your codebase?
agentmako is local-first, Apache-2.0, and works with every MCP-compatible coding agent.