← Blog

Output Tokens Are the Real Cost of Coding Agents

Most agent-cost discussion focuses on input tokens. The expensive half of the bill is the output tokens your agent burns rediscovering your repo every turn. Here's the math, the mechanism, and why typed MCP tools beat raw grep on both ends.

Most of the agent-cost discussion focuses on input tokens — how long is your prompt, how much context does the model have to read. That's the cheap half of the bill. The expensive half is the output tokens your agent burns rediscovering your repo every turn. The most interesting consequence isn't saving money; it's that the agent reaches the actual problem faster, before context decay sets in.

The framing everyone uses

Pricing pages have trained us to think about input tokens. Anthropic's Claude Sonnet 4.6 is $3 per million input tokens. OpenAI's GPT-5.5 is $5/MTok input. So the obvious cost-control move is "send less context" — prune your system prompt, summarize chat history, RAG instead of dumping the whole repo.

This is correct as far as it goes. It's just the wrong cost center to optimize first.

The bill you're not looking at

On the same models, output tokens cost 5–6× input:

  • Sonnet 4.6: $3 in / $15 out per MTok (5×)
  • Opus 4.7: $5 in / $25 out per MTok (5×)
  • GPT-5.5: $5 in / $30 out per MTok (6×) — released Apr 23, 2026 with input and output prices doubled vs. GPT-5
  • GPT-5.5 Pro: $30 in / $180 out per MTok (6×)

So a session that sends 50K input tokens and generates 10K output tokens costs the same on output as a session that sends 300K input tokens and generates zero output. And the gap got wider, not narrower, with the April 2026 model drops — GPT-5.5 doubled prices on both ends, and Opus 4.7 ships with a new tokenizer that can produce up to 35% more tokens for the same input text. Output volume per task is trending up, not down.

Now look at what an agent actually does during a coding task. Here's the typical flow when the agent doesn't know your codebase:

  1. Read the user's request, decide it needs to inspect the repo. Output tokens.
  2. Plan which paths to grep, in plain text reasoning. Output tokens.
  3. Issue 4–8 tool calls (grep, glob, read). Each tool call has framing overhead. Output tokens.
  4. Stream back raw results — usually a few KB per call. Input tokens (cheap).
  5. Reason over the results, pick promising files, summarize structure. Output tokens.
  6. Read those files in full. Input tokens.
  7. Form a hypothesis about what to change. Output tokens.
  8. Finally begin the actual change.

Steps 1 through 7 are entirely about reconstructing context. The model is generating tokens — expensive tokens — not to do the user's task, but to figure out where in the codebase the user's task lives. The senior engineer on that team has the answer in their head: "the auth flow is in src/server/auth/, it touches the sessions table, the relevant tests are in auth-flow.test.ts." The agent regenerates that knowledge from raw text on every single turn.

The same task, with structured context

Now imagine the agent had a single tool that returns this directly:

{
  "task": "fix the broken auth callback route",
  "candidates": [
    {
      "path": "src/server/auth/callback.ts",
      "kind": "route",
      "reason": "matches request keywords + recent diagnostics on this file"
    },
    { "path": "src/server/auth/session.ts", "kind": "support", "reason": "imported by callback.ts" }
  ],
  "facts": [
    { "kind": "table", "name": "sessions", "rls": "enabled" },
    { "kind": "diagnostic", "tool": "tsc", "message": "..." }
  ],
  "tests": ["test/auth-flow.test.ts"]
}

The agent calls one tool. Gets a typed, ranked, deduplicated context packet. The model's "discovery" output is one tool call instead of six, and the model reads structured data instead of 20 KB of grep text.

This isn't hypothetical. It's exactly what an MCP (Model Context Protocol) server can return when it's been told to behave like a senior dev rather than a search engine.

The math, on a real task

I ran the same task — refactor an auth callback route on a ~700-file repo — two ways. Once with an agent that only had grep / glob / read available. Once with an agent that had a structured context_packet MCP tool first.

Grep-walk Typed MCP tool
Tool calls before first edit ~14 2
Cumulative input tokens ~38 K ~6 K
Output tokens during discovery ~8 K ~1.2 K
Time-to-first-edit ~90 s ~15 s
Final answer quality comparable comparable

The output-token delta is what matters: 8 K vs 1.2 K. On Sonnet 4.6 ($15/MTok out) that's $0.12 vs $0.018. On Opus 4.7 ($25/MTok out) it's $0.20 vs $0.030. On GPT-5.5 ($30/MTok out) it's $0.24 vs $0.036. Almost 7× cheaper on the part of the bill that's expensive — and that 7× ratio is constant across providers because it's about how many output tokens get generated, not the per-token rate. Stack that across 50 tasks a day and the math gets serious for power users.

But the dollar figure isn't the headline. The headline is time-to-first-edit dropped from 90 seconds to 15. The agent stopped narrating its discovery process and started doing the user's actual task. Quality of decisions tracks quality of context, and quality of context decays as more rediscovery noise accumulates. Shorter discovery is better discovery.

Why typed tools win on output

The reason this works is mechanical:

  1. Tool-call framing has fixed overhead. Every tool call costs ~80–150 output tokens just for the JSON envelope, even if the call body is empty. Six tool calls vs one: that's ~500 tokens just in framing.
  2. Reasoning over raw text is verbose. When the model sees grep output, it generates output tokens summarizing what it found. When it sees a typed object with reason fields and kind annotations, it doesn't need to summarize — it can act.
  3. Models talk to themselves. Plan-then-act prompts produce a lot of "let me think about which files to look at" output. With a typed tool that ranks candidates upfront, the planning output collapses.
  4. Indexing pays once, queries pay zero. A local index does the expensive parsing/tokenizing/symbol-extraction once. Every subsequent query reads from SQLite in milliseconds. Grep does the equivalent work over and over per session.

When this doesn't apply

I'd be lying if I said this is universal. Cases where typed context tools don't help:

  • Tiny repos. Under ~50 files, grep is fast enough and the indexing overhead exceeds the savings.
  • One-shot tasks that don't touch the codebase ("write me a regex for…"). The index isn't relevant.
  • Bad indexing. If your tool returns lower-signal results than grep, you've made it worse with extra steps. Index quality is everything.
  • Agents that ignore the structured tools. This is real — without explicit prompt nudging, some agents will reach for grep out of habit. Skill files / system prompts that teach the agent when to use which tool matter as much as the tool itself.
  • Models that don't reason well over JSON. Not really an issue with frontier models in 2026, but worth noting if you're running smaller open models.

The general principle

Context engineering is moving from "fit more into the window" to "send better-curated context." The cost gradient supports this: prompt caching now handles the input-token problem reasonably well (90% off on cached input), but reasoning models burn far more output tokens per task than chat models did, and providers are raising output prices — GPT-5.5 doubled them in April 2026, and Opus 4.7's tokenizer inflates output volume on top of that. The optimal architecture is one that answers the agent's question in one curated tool call instead of letting it discover the answer through trial and error.

Practically, this means:

  • Build (or use) tools that return facts, not raw documents. A "the auth route is here, it touches these tables, these tests cover it" answer beats a 20 KB grep dump every time.
  • Make your tools return rankings and reasons, not just lists. The agent can short-circuit further exploration if it trusts the ranking.
  • Index once, query many. SQLite is more than fast enough; you don't need a vector store for most code-intelligence questions.
  • Measure output tokens separately from input tokens. If you can't see the cost, you can't optimize it.

What I'm using

I built agentmako to do exactly this — a local-first MCP server that indexes a repo into SQLite and exposes typed context tools to coding agents. It's Apache-2.0, runs entirely locally, and wires into Claude Code, Cursor, Cline, Codex, etc. via standard MCP. The frontier model keeps doing what it's good at; the local layer makes sure you only pay for that part.

npm install -g agentmako
agentmako connect .

Then point your MCP client at it:

{
  "mcpServers": {
    "agentmako": { "command": "agentmako", "args": ["mcp"] }
  }
}

Output-token spend is the part of the bill nobody talks about. Once you start tracking it separately, the architecture conversation changes.

Want this for your codebase?

agentmako is local-first, Apache-2.0, and works with every MCP-compatible coding agent.

Read the docs →