Context window economics: every token has a cost you're ignoring

context-window optimization cost
TL;DR
  • Context windows are budgets with competing line items, not unlimited storage
  • System prompt (5-15%), tools (10-25%), memory (10-20%), conversation (30-50%), output (10-20%)
  • 20 tools x 500 tokens each = 10K tokens before the conversation starts
  • Minify schemas, compress old turns, reduce results — every token saved is a token earned

Developers treat context windows like unlimited storage. “It’s 200K tokens, just throw everything in.” Then they wonder why their agent costs $0.50 per turn and hallucinates tool schemas from page 3 of the context.

The token budget framework

Think of your context window as a fixed budget with competing line items:

System prompt (5-15%) — your agent’s personality, constraints, and routing rules. Every word here is taxed on every single API call. Ruthlessly compress.

Tool schemas (10-25%) — MCP tool definitions, parameter types, examples. This grows linearly with tool count. 20 tools x 500 tokens each = 10K tokens before the conversation even starts.

Memory/RAG (10-20%) — retrieved context, conversation summaries, user preferences. Stale memory is wasted tokens. Evict aggressively.

Conversation (30-50%) — the actual back-and-forth. This is what the user cares about. Protect this allocation.

Output headroom (10-20%) — the model needs room to generate. Starve this and you get truncated responses.

# Token budget monitor
budget = TokenBudget(
    total=128_000,
    system=0.10,
    tools=0.15,
    memory=0.15,
    conversation=0.40,
    output=0.20
)
budget.check()  # warns if any segment exceeds allocation

The compression playbook

Minify tool schemas (strip examples, use short param names). Summarize old conversation turns. Use SkillReducer to compress tool results. Every token saved in overhead is a token available for the user’s actual task.

Agent Trace

agent trace · post #3 6 steps · 188ms
THINK Conversation at turn 12. Checking token budget. 1ms
TOOL TokenBudget.check(current_usage) 3ms
OBS system=12%, tools=22%, memory=18%, conv=41%, free=7% 1ms
ERR Output headroom below 10%. Triggering conversation compression. 1ms
TOOL SkillReducer.compress(turns[0:8]) 180ms
ACT Freed 8.2K tokens. Output headroom restored to 14%. 2ms
Open in terminal ↗