Checking account...
Checking account...
where the tokens go
The One-Sentence Prompt That Wasn’t
The first confusing part of Codex token usage is that the user interface is a liar by omission.
You type:
fix the flaky test
Then the usage report says something like 300k tokens. That feels insane if you imagine the model received one sentence, thought for a moment, and wrote back a patch.
That is not what happened. Codex is not a single prompt and response. It is a loop:
read -> reason -> tool
-> observe -> edit
-> test -> observe
-> repair -> summarize
Each step can involve another model call. Each model call needs enough context
to decide the next action. That context can include the system prompt,
developer instructions, AGENTS.md, tool schemas, MCP server descriptions,
repo files, terminal output, diffs, test failures, prior trace summaries, the
current user request, and generated output from earlier turns.
The visible prompt is the kickoff. The expensive part is the loop.
A Small Token Lab
The numbers below are fake on purpose. They are for intuition, not pricing. Move the sliders and watch what happens when a task goes from one model call to twelve, or when one local agent becomes eight parallel agents.
The important shape is not the exact number. It is the multiplication:
Let (I_j) be the input tokens in model call (j). Then:
$$ T_{\text{task}} = \sum_{j=1}^{J} I_j $$
For parallel work, let (I_{a,j}) be the input tokens for agent (a) in call (j):
$$ T_{\text{fleet}} = \sum_{a=1}^{A}\sum_{j=1}^{J_a} I_{a,j} $$
One small task can become many large model calls. Many small tasks in parallel can become a very large number before anything looks dramatic in the UI.
What A Token Is
A token is the unit the model reads and writes. It is not exactly a word. It can be a word, part of a word, punctuation, whitespace, a chunk of code, JSON syntax, or a weird little boundary that only makes sense to the tokenizer.
For example:
"OpenAI is cool."
might be split roughly like:
OpenAI
is
cool
.
Do not treat that as an exact tokenizer result. The exact split depends on the model and tokenizer. The point is simpler: models do not read “words” in the way humans do. They read token IDs.
That matters because code is token-dense. Logs are token-dense. JSON tool schemas are token-dense. A terminal dump with paths, stack traces, quoted strings, ANSI leftovers, timestamps, and repeated lines can be a stupid amount of text for a model to read.
Tokens include:
- your prompt
- hidden or durable instructions
- tool schemas
- file contents
- diffs
- terminal output
- test output
- generated answers
- reasoning tokens, when the model uses them
So “I only typed one sentence” is usually true and irrelevant.
The Billing Buckets
The clean accounting model is:
- input tokens: what the model reads
- cached input tokens: input prefix tokens whose computation was reused
- output tokens: what the model writes
- reasoning tokens: internal generated tokens used by reasoning models, usually counted inside output usage
- tool/service costs: extra charges from built-in tools or services, when those are involved
OpenAI’s pricing page separates input, cached input, and output columns, and its reasoning guide says reasoning tokens are not visible as raw text but still occupy context and are billed as output tokens for reasoning models (pricing, reasoning).
The useful formula is:
Let (T_i) be total input, (T_c) be cached input, (T_u) be uncached input, (T_o) be output, and (C_t) be any tool or service charges.
$$ T_u = T_i - T_c $$
Then:
$$ \begin{aligned} C ={}& T_u r_i + T_c r_c \\ &+ T_o r_o + C_t \end{aligned} $$
where each (r) is the current rate for the model and service tier. I am not hardcoding rates here because they change. The thing worth remembering is more durable: cached input is cheaper input, not nonexistent input.
Prefill And Decode
LLM inference has two phases that are easy to blur together:
prefill:
process the whole input/context
decode:
generate output tokens one at a time
Prefill is the cost of reading. Decode is the cost of writing.
During prefill, the model processes the prompt through its Transformer layers. For a prompt of length (n), it builds internal representations for those (n) positions. In an autoregressive decoder model, token (i) is allowed to attend only to tokens at positions (\le i). That causal structure is what makes next-token generation work.
During decode, the model generates one token, appends it to the sequence, then generates the next token conditioned on everything so far. That sequential dependency is one reason output tokens can be more expensive than input tokens: writing 1,000 tokens is not one big parallel operation in the same way reading a prompt can be.
The short version:
prefill = read the context
decode = write the answer
For Codex, “read the context” is often the painful part because the context is not only the user’s sentence. It is the working state of a software project.
Full Prefill Means The Whole Prompt
Full prefill means the serving system processes the entire prompt from scratch through all model layers before the first generated token comes out.
[system prompt]
[developer instructions]
[AGENTS.md]
[tool schemas]
[repo files]
[terminal logs]
[user request]
|
v
full prefill
|
v
first output token
Inside the attention layers, the model computes query, key, and value vectors. In simplified single-layer notation:
$$ Q = XW_Q $$
$$ K = XW_K $$
$$ V = XW_V $$
and attention is:
$$ A = \operatorname{softmax}(Z)V $$
$$ Z = QK^T / \sqrt{d_k} $$
Real serving stacks have batching, kernels, memory management, quantization, speculative tricks, and many implementation details. None of that changes the basic mental model: prefill is the model doing the expensive first pass over the input so it can start generating.
KV Cache Is Not Prompt Cache
There are two related caches that people often mash together.
KV Cache Inside One Request
Once a model has processed a prompt, it has key/value tensors for the prompt positions at each layer. During decode, it does not need to recompute the whole prompt for every generated token. It can keep the already-computed keys and values around and only compute the new token’s contribution.
That is the KV cache.
It is intra-request. It helps the model avoid rereading the prompt every time it writes the next token.
Prompt Cache Across Requests
Prompt caching is cross-request. If a later request starts with the exact same prefix, the serving system can reuse cached computation for that prefix. OpenAI’s prompt caching guide describes this as exact prefix matching and says prompt caching works automatically for supported models (prompt caching).
Request 1:
[static prefix:
system + tools + AGENTS.md + main.py]
[dynamic suffix: task A]
-> compute prefix from scratch
Request 2:
[same static prefix:
system + tools + AGENTS.md + main.py]
[dynamic suffix: task B]
-> reuse cached prefix
-> compute only new suffix
The OpenAI guide is explicit that cache hits require exact prefix matches. It also says cached prompt data affects latency and cost, not the final answer: the response is still computed fresh for the request.
That last detail is the whole ballgame.
But How Can Cached main.py Be Reused?
This is the subtle objection:
If
main.pyis cached, how can the same cached file be useful for 1,000 different questions? Doesn’t the model need to reread it differently for each task?
No. Not in the prefix positions.
In a causal decoder-only Transformer, earlier tokens cannot attend to later tokens. The representation and key/value tensors for a static prefix do not depend on the future user query, because the future user query is not visible to those prefix tokens.
Suppose the prefix is:
[system][tools][AGENTS.md][main.py]
and the suffix is:
[user asks about login]
The cached part is:
K/V for main.py
The fresh per-query part is:
Q vectors for the new request
attention weights:
request <-> main.py
suffix computation
generated answer
For a later token (t) in the request, attention into the cached prefix looks roughly like:
$$ \alpha_{t,i} = \operatorname{softmax}_i \left( \frac{q_t k_i^T}{\sqrt{d_k}} \right) $$
where (k_i) can come from cached main.py, but (q_t) is fresh for the new
query. Different query, different (q_t). Different (q_t), different
attention weights.
So:
Query A: "Why does login fail?"
attends mostly to login/auth code
Query B: "Why does --config not work?"
attends mostly to arg parsing code
Query C: "Can this corrupt the local cache?"
attends mostly to cache-writing code
Same cached main.py K/V. Different query vectors. Different attention
weights. Different answer.
Caching does not mean the model has pre-decided what main.py means for every
possible task. It means the reusable prefix tensors are already computed.
Relevance is still computed fresh.
Prompt Cache Hits Are Fragile
Prompt caching is prefix reuse, not semantic similarity. “These two prompts basically say the same thing” does not matter if the prefix bytes/tokens are not the same.
Cache-friendly:
stable first:
system prompt
tool schemas
AGENTS.md
durable repo context
dynamic last:
current task
latest terminal logs
current diff
random IDs
timestamps
Cache-unfriendly:
dynamic stuff first
stable stuff later
The annoying parts are the mundane parts: timestamps, random IDs, changing file
order, changing tool descriptions, a modified AGENTS.md, different MCP server
instructions, or a tool schema that appears before the otherwise-stable repo
context.
The prompt cache is not a vibes cache. It is a prefix cache.
Why Codex Eats Tokens
OpenAI’s Codex prompting docs describe Codex as a loop that calls the model and then performs actions indicated by model output, including file reads, edits, and tool calls (Codex prompting). That loop is where the tokens go.
A typical bugfix looks like this:
user gives task
-> model decides what to inspect
-> reads files
-> receives file contents as input tokens
-> edits files
-> receives patch/diff/result
-> runs tests
-> receives test output/logs
-> diagnoses failure
-> edits again
-> runs tests again
-> summarizes proof
The model call after the first test failure is not looking only at the original prompt. It may need the original task, relevant files, the patch it made, the test command, the failing output, and enough prior trace to know why it made the previous decision.
Here is an illustrative one-call input budget:
persistent instructions: 10k tokens
AGENTS.md: 5k
tool schemas/MCP context: 3k
relevant files: 30k
terminal output: 15k
prior trace/summary: 8k
one model call input: 71k tokens
Now multiply:
12 calls in one task:
12 * 71k = 852k input tokens
8 agents in parallel:
8 * 852k = 6.816M input tokens
That is before counting output tokens. It is also before counting the extra reasoning tokens a reasoning model may generate internally.
Prompt caching can make the stable prefix cheaper and faster. It does not make the loop disappear.
The estimator form is simple. Let (S) be the stable prefix tokens per call, (D_j) be dynamic context in call (j), and (h_j) be the cache hit fraction for that call.
Then for one agent:
$$ T_c = \sum_{j=1}^{J} h_j S $$
and:
$$ T_u = \sum_{j=1}^{J} (S + D_j) - T_c $$
If the cached-input rate is a fraction (\gamma) of the normal input rate, the input-side cost weight is:
$$ T_u + \gamma T_c $$
When (\gamma = 0.1), cached tokens still count. They just count less in the rate-weighted math.
Tools Are Context
Tools feel free because they are buttons from the user’s point of view. They are not free from the model’s point of view.
A tool has a name, description, arguments, schema, safety rules, and sometimes server-level instructions. MCP adds external tools and context; the Codex MCP docs say server instructions can be read and used alongside the server’s tools (Codex MCP).
That metadata is useful. It tells the model what it can do. It is also context.
More tools can mean:
- more input tokens
- more possible actions
- more routing decisions
- more chances for the model to choose the wrong lever
Tools are leverage. Tool metadata is context.
AGENTS.md Is Also Context
AGENTS.md is the right place for durable repository guidance. OpenAI’s
AGENTS.md guide describes it as repository guidance Codex reads before doing
work, with global and project-level layers
(AGENTS.md).
That is good. It means you do not need to repeat:
preserve user changes
run targeted tests
do not use destructive git commands
include residual risks
in every prompt.
But it is still input. A useful AGENTS.md is great. A bloated one is a tax
you pay on many calls.
The right shape is not “tiny at all costs.” The right shape is “dense with rules that actually change behavior.”
A Synthetic Pete-Style Workflow
I am using “Pete-style” here as a shorthand for a public, synthetic power-user workflow. This is not a claim about any specific person’s private setup.
The high-throughput pattern is not one giant prompt. It is a configured system:
- durable repo rules in
AGENTS.md - reusable skills or commands for repeated workflows
- short task prompts
- many parallel agents
- test/eval-driven loops
- strict Git, PR, and CI rules
- short final summaries with proof
A good AGENTS.md excerpt might look like:
Work style:
- terse
- preserve user changes
- no destructive git commands
- run targeted tests before broad tests
- update changelog for user-visible changes
- final response must include files changed,
tests run, and residual risks
PR/CI:
- for "fix ci", inspect failing checks,
patch, rerun, repeat
- for "land", preserve contributor credit
and merge only when green
- never hide failing tests
Then the actual prompt can be tiny:
fix ci on PR 482.
preserve contributor credit.
land when green.
The agent expands that into work:
- inspect git status
- fetch PR
- inspect failing checks
- read logs
- locate likely files
- patch
- run targeted tests
- rerun CI-relevant checks
- review diff
- commit/push
- watch CI
- summarize proof
That workflow is powerful because the human prompt is short and the repo rules are durable. It is expensive because the agent is doing real work: reading, editing, testing, observing, and repairing.
It is not “autocomplete but more expensive.” It is a junior engineer with a very large reading bill and no patience for pretending a failed test passed.
Practical Ways To Waste Fewer Tokens
The goal is not to minimize tokens at all costs. That is how you get cheap bad work. The goal is to spend tokens on useful context and avoid paying repeatedly for noise.
- Put stable content first.
- Keep
AGENTS.mduseful but not bloated. - Use skills or task-specific docs for detailed workflows that do not always need to load.
- Avoid dumping huge logs into the prompt. Ask Codex to inspect relevant failures.
- Prefer targeted tests before huge suites.
- Keep prompts short once repo rules are encoded.
- Minimize MCP servers and tool surfaces for the task.
- Reuse stable prefixes when possible.
- Put timestamps, random IDs, latest logs, and current diffs late.
- Ask for concise final summaries.
- Log
input_tokens,cached_tokens,output_tokens,reasoning_tokens, andtotal_tokens. - Treat cached input as cheaper, not free.
- Use smaller or faster models for routine mechanical tasks when appropriate.
- Use stronger reasoning models for ambiguous, architectural, or high-risk changes.
The blunt version: do not starve the model of the context needed to do the job, but stop feeding it junk just because pasting is easy.
The Mental Model
Codex token usage is not mysterious once you stop thinking of it as autocomplete.
It is a fleet of model calls reading repo state, tool state, instructions, logs, diffs, and its own prior work. Prompt caching makes the repeated stable parts cheaper. It does not make them disappear.
The bill comes from the loop.
Sources
- OpenAI prompt caching: exact prefix matching, cached input accounting, prompt cache retention, and the distinction between cached prompt computation and fresh response generation.
- OpenAI pricing: current pricing buckets for input, cached input, output, and built-in tools.
- OpenAI reasoning models: reasoning tokens, usage fields, context-window impact, and output-token billing.
- Codex prompting: Codex’s model/tool loop, thread context, gathered file/tool output, and compaction.
- Codex
AGENTS.md: durable repository guidance and instruction layering.- Codex MCP: MCP tools, server instructions, and tool configuration.
- Codex best practices: reusable instructions, verification, testing, review, and workflow guidance.
- Attention Is All You Need: Transformer attention and causal decoder masking. Sources checked June 8, 2026.