Context window economics: every token has a cost you're ignoring

Developers treat context windows like unlimited storage. “It’s 200K tokens, just throw everything in.” Then they wonder why their agent costs $0.50 per turn and hallucinates tool schemas from page 3 of the context.

The token budget framework

Think of your context window as a fixed budget with competing line items:

System prompt (5-15%) — your agent’s personality, constraints, and routing rules. Every word here is taxed on every single API call. Ruthlessly compress.

Tool schemas (10-25%) — MCP tool definitions, parameter types, examples. This grows linearly with tool count. 20 tools x 500 tokens each = 10K tokens before the conversation even starts.

Memory/RAG (10-20%) — retrieved context, conversation summaries, user preferences. Stale memory is wasted tokens. Evict aggressively.

Conversation (30-50%) — the actual back-and-forth. This is what the user cares about. Protect this allocation.

Output headroom (10-20%) — the model needs room to generate. Starve this and you get truncated responses.

# Token budget monitor
budget = TokenBudget(
    total=128_000,
    system=0.10,
    tools=0.15,
    memory=0.15,
    conversation=0.40,
    output=0.20
)
budget.check()  # warns if any segment exceeds allocation

The compression playbook

Minify tool schemas (strip examples, use short param names). Summarize old conversation turns. Use SkillReducer to compress tool results. Every token saved in overhead is a token available for the user’s actual task.

Agent Trace

agent trace · post #3 6 steps · 188ms

THINK Conversation at turn 12. Checking token budget. 1ms

TOOL TokenBudget.check(current_usage) 3ms

OBS system=12%, tools=22%, memory=18%, conv=41%, free=7% 1ms

ERR Output headroom below 10%. Triggering conversation compression. 1ms

TOOL SkillReducer.compress(turns[0:8]) 180ms

ACT Freed 8.2K tokens. Output headroom restored to 14%. 2ms