Developers treat context windows like unlimited storage. “It’s 200K tokens, just throw everything in.” Then they wonder why their agent costs $0.50 per turn and hallucinates tool schemas from page 3 of the context.
The token budget framework
Think of your context window as a fixed budget with competing line items:
System prompt (5-15%) — your agent’s personality, constraints, and routing rules. Every word here is taxed on every single API call. Ruthlessly compress.
Tool schemas (10-25%) — MCP tool definitions, parameter types, examples. This grows linearly with tool count. 20 tools x 500 tokens each = 10K tokens before the conversation even starts.
Memory/RAG (10-20%) — retrieved context, conversation summaries, user preferences. Stale memory is wasted tokens. Evict aggressively.
Conversation (30-50%) — the actual back-and-forth. This is what the user cares about. Protect this allocation.
Output headroom (10-20%) — the model needs room to generate. Starve this and you get truncated responses.
# Token budget monitor
budget = TokenBudget(
total=128_000,
system=0.10,
tools=0.15,
memory=0.15,
conversation=0.40,
output=0.20
)
budget.check() # warns if any segment exceeds allocation
The compression playbook
Minify tool schemas (strip examples, use short param names). Summarize old conversation turns. Use SkillReducer to compress tool results. Every token saved in overhead is a token available for the user’s actual task.