Checking account...

where the tokens go

Brandon Szeto · June 8, 2026

The One-Sentence Prompt That Wasn’t

The first confusing part of Codex token usage is that the user interface is a liar by omission.

You type:

fix the flaky test

Then the usage report says something like 300k tokens. That feels insane if you imagine the model received one sentence, thought for a moment, and wrote back a patch.

That is not what happened. Codex is not a single prompt and response. It is a loop:

read -> reason -> tool
     -> observe -> edit
     -> test -> observe
     -> repair -> summarize

Each step can involve another model call. Each model call needs enough context to decide the next action. That context can include the system prompt, developer instructions, AGENTS.md, tool schemas, MCP server descriptions, repo files, terminal output, diffs, test failures, prior trace summaries, the current user request, and generated output from earlier turns.

The visible prompt is the kickoff. The expensive part is the loop.

A Small Token Lab

The numbers below are fake on purpose. They are for intuition, not pricing. Move the sliders and watch what happens when a task goes from one model call to twelve, or when one local agent becomes eight parallel agents.

One call input 71k

Total input 6.8M

Cached input 1.5M

Generated 326k

Agent Loop Estimator

model calls per task 12 parallel agents 8 stable prefix / call 18k dynamic context / call 53k visible output / call 1.4k reasoning / call 2.0k prefix cache hit rate 85% cached/input rate ratio 0.10x output/input rate ratio 4.0x

Where the Tokens Land

uncached input cached input output + reasoning

Rate-weighted total: 0 tokens.

Same Cached File, Different Query

system tools AGENTS.md main.py K/V cache query: login fails

arg parsing 18%

cache writes 28%

The cached prefix is unchanged. The fresh request changes the query vectors and attention weights.

The important shape is not the exact number. It is the multiplication:

Let (I_j) be the input tokens in model call (j). Then:

$$ T_{\text{task}} = \sum_{j=1}^{J} I_j $$

For parallel work, let (I_{a,j}) be the input tokens for agent (a) in call (j):

$$ T_{\text{fleet}} = \sum_{a=1}^{A}\sum_{j=1}^{J_a} I_{a,j} $$

One small task can become many large model calls. Many small tasks in parallel can become a very large number before anything looks dramatic in the UI.

What A Token Is

A token is the unit the model reads and writes. It is not exactly a word. It can be a word, part of a word, punctuation, whitespace, a chunk of code, JSON syntax, or a weird little boundary that only makes sense to the tokenizer.

For example:

"OpenAI is cool."

might be split roughly like:

OpenAI
 is
 cool
.

Do not treat that as an exact tokenizer result. The exact split depends on the model and tokenizer. The point is simpler: models do not read “words” in the way humans do. They read token IDs.

That matters because code is token-dense. Logs are token-dense. JSON tool schemas are token-dense. A terminal dump with paths, stack traces, quoted strings, ANSI leftovers, timestamps, and repeated lines can be a stupid amount of text for a model to read.

Tokens include:

your prompt
hidden or durable instructions
tool schemas
file contents
diffs
terminal output
test output
generated answers
reasoning tokens, when the model uses them

So “I only typed one sentence” is usually true and irrelevant.

The Billing Buckets

The clean accounting model is:

input tokens: what the model reads
cached input tokens: input prefix tokens whose computation was reused
output tokens: what the model writes
reasoning tokens: internal generated tokens used by reasoning models, usually counted inside output usage
tool/service costs: extra charges from built-in tools or services, when those are involved

OpenAI’s pricing page separates input, cached input, and output columns, and its reasoning guide says reasoning tokens are not visible as raw text but still occupy context and are billed as output tokens for reasoning models (pricing, reasoning).

The useful formula is:

Let (T_i) be total input, (T_c) be cached input, (T_u) be uncached input, (T_o) be output, and (C_t) be any tool or service charges.

$$ T_u = T_i - T_c $$

Then:

$$ \begin{aligned} C ={}& T_u r_i + T_c r_c \\ &+ T_o r_o + C_t \end{aligned} $$

where each (r) is the current rate for the model and service tier. I am not hardcoding rates here because they change. The thing worth remembering is more durable: cached input is cheaper input, not nonexistent input.

Prefill And Decode

LLM inference has two phases that are easy to blur together:

prefill:
  process the whole input/context

decode:
  generate output tokens one at a time

Prefill is the cost of reading. Decode is the cost of writing.

During prefill, the model processes the prompt through its Transformer layers. For a prompt of length (n), it builds internal representations for those (n) positions. In an autoregressive decoder model, token (i) is allowed to attend only to tokens at positions (\le i). That causal structure is what makes next-token generation work.

During decode, the model generates one token, appends it to the sequence, then generates the next token conditioned on everything so far. That sequential dependency is one reason output tokens can be more expensive than input tokens: writing 1,000 tokens is not one big parallel operation in the same way reading a prompt can be.

The short version:

prefill = read the context
decode  = write the answer

For Codex, “read the context” is often the painful part because the context is not only the user’s sentence. It is the working state of a software project.

Full Prefill Means The Whole Prompt

Full prefill means the serving system processes the entire prompt from scratch through all model layers before the first generated token comes out.

[system prompt]
[developer instructions]
[AGENTS.md]
[tool schemas]
[repo files]
[terminal logs]
[user request]
      |
      v
full prefill
      |
      v
first output token

Inside the attention layers, the model computes query, key, and value vectors. In simplified single-layer notation:

$$ Q = XW_Q $$

$$ K = XW_K $$

$$ V = XW_V $$

and attention is:

$$ A = \operatorname{softmax}(Z)V $$

$$ Z = QK^T / \sqrt{d_k} $$

Real serving stacks have batching, kernels, memory management, quantization, speculative tricks, and many implementation details. None of that changes the basic mental model: prefill is the model doing the expensive first pass over the input so it can start generating.

KV Cache Is Not Prompt Cache

There are two related caches that people often mash together.

KV Cache Inside One Request

Once a model has processed a prompt, it has key/value tensors for the prompt positions at each layer. During decode, it does not need to recompute the whole prompt for every generated token. It can keep the already-computed keys and values around and only compute the new token’s contribution.

That is the KV cache.

It is intra-request. It helps the model avoid rereading the prompt every time it writes the next token.

Prompt Cache Across Requests

Prompt caching is cross-request. If a later request starts with the exact same prefix, the serving system can reuse cached computation for that prefix. OpenAI’s prompt caching guide describes this as exact prefix matching and says prompt caching works automatically for supported models (prompt caching).

Request 1:
  [static prefix:
    system + tools + AGENTS.md + main.py]
  [dynamic suffix: task A]
  -> compute prefix from scratch

Request 2:
  [same static prefix:
    system + tools + AGENTS.md + main.py]
  [dynamic suffix: task B]
  -> reuse cached prefix
  -> compute only new suffix

The OpenAI guide is explicit that cache hits require exact prefix matches. It also says cached prompt data affects latency and cost, not the final answer: the response is still computed fresh for the request.

That last detail is the whole ballgame.

But How Can Cached `main.py` Be Reused?

This is the subtle objection:

If main.py is cached, how can the same cached file be useful for 1,000 different questions? Doesn’t the model need to reread it differently for each task?

No. Not in the prefix positions.

In a causal decoder-only Transformer, earlier tokens cannot attend to later tokens. The representation and key/value tensors for a static prefix do not depend on the future user query, because the future user query is not visible to those prefix tokens.

Suppose the prefix is:

[system][tools][AGENTS.md][main.py]

and the suffix is:

[user asks about login]

The cached part is:

K/V for main.py

The fresh per-query part is:

Q vectors for the new request
attention weights:
  request <-> main.py
suffix computation
generated answer

For a later token (t) in the request, attention into the cached prefix looks roughly like:

$$ \alpha_{t,i} = \operatorname{softmax}_i \left( \frac{q_t k_i^T}{\sqrt{d_k}} \right) $$

where (k_i) can come from cached main.py, but (q_t) is fresh for the new query. Different query, different (q_t). Different (q_t), different attention weights.

So:

Query A: "Why does login fail?"
  attends mostly to login/auth code

Query B: "Why does --config not work?"
  attends mostly to arg parsing code

Query C: "Can this corrupt the local cache?"
  attends mostly to cache-writing code

Same cached main.py K/V. Different query vectors. Different attention weights. Different answer.

Caching does not mean the model has pre-decided what main.py means for every possible task. It means the reusable prefix tensors are already computed. Relevance is still computed fresh.

Prompt Cache Hits Are Fragile

Prompt caching is prefix reuse, not semantic similarity. “These two prompts basically say the same thing” does not matter if the prefix bytes/tokens are not the same.

Cache-friendly:

stable first:
  system prompt
  tool schemas
  AGENTS.md
  durable repo context

dynamic last:
  current task
  latest terminal logs
  current diff
  random IDs
  timestamps

Cache-unfriendly:

dynamic stuff first
stable stuff later

The annoying parts are the mundane parts: timestamps, random IDs, changing file order, changing tool descriptions, a modified AGENTS.md, different MCP server instructions, or a tool schema that appears before the otherwise-stable repo context.

The prompt cache is not a vibes cache. It is a prefix cache.

Why Codex Eats Tokens

OpenAI’s Codex prompting docs describe Codex as a loop that calls the model and then performs actions indicated by model output, including file reads, edits, and tool calls (Codex prompting). That loop is where the tokens go.

A typical bugfix looks like this:

user gives task
  -> model decides what to inspect
  -> reads files
  -> receives file contents as input tokens
  -> edits files
  -> receives patch/diff/result
  -> runs tests
  -> receives test output/logs
  -> diagnoses failure
  -> edits again
  -> runs tests again
  -> summarizes proof

The model call after the first test failure is not looking only at the original prompt. It may need the original task, relevant files, the patch it made, the test command, the failing output, and enough prior trace to know why it made the previous decision.

Here is an illustrative one-call input budget:

persistent instructions:   10k tokens
AGENTS.md:                  5k
tool schemas/MCP context:   3k
relevant files:            30k
terminal output:           15k
prior trace/summary:        8k

one model call input:      71k tokens

Now multiply:

12 calls in one task:
  12 * 71k = 852k input tokens

8 agents in parallel:
  8 * 852k = 6.816M input tokens

That is before counting output tokens. It is also before counting the extra reasoning tokens a reasoning model may generate internally.

Prompt caching can make the stable prefix cheaper and faster. It does not make the loop disappear.

The estimator form is simple. Let (S) be the stable prefix tokens per call, (D_j) be dynamic context in call (j), and (h_j) be the cache hit fraction for that call.

Then for one agent:

$$ T_c = \sum_{j=1}^{J} h_j S $$

and:

$$ T_u = \sum_{j=1}^{J} (S + D_j) - T_c $$

If the cached-input rate is a fraction (\gamma) of the normal input rate, the input-side cost weight is:

$$ T_u + \gamma T_c $$

When (\gamma = 0.1), cached tokens still count. They just count less in the rate-weighted math.

Tools Are Context

Tools feel free because they are buttons from the user’s point of view. They are not free from the model’s point of view.

A tool has a name, description, arguments, schema, safety rules, and sometimes server-level instructions. MCP adds external tools and context; the Codex MCP docs say server instructions can be read and used alongside the server’s tools (Codex MCP).

That metadata is useful. It tells the model what it can do. It is also context.

More tools can mean:

more input tokens
more possible actions
more routing decisions
more chances for the model to choose the wrong lever

Tools are leverage. Tool metadata is context.

`AGENTS.md` Is Also Context

AGENTS.md is the right place for durable repository guidance. OpenAI’s AGENTS.md guide describes it as repository guidance Codex reads before doing work, with global and project-level layers (AGENTS.md).

That is good. It means you do not need to repeat:

preserve user changes
run targeted tests
do not use destructive git commands
include residual risks

in every prompt.

But it is still input. A useful AGENTS.md is great. A bloated one is a tax you pay on many calls.

The right shape is not “tiny at all costs.” The right shape is “dense with rules that actually change behavior.”

A Synthetic Pete-Style Workflow

I am using “Pete-style” here as a shorthand for a public, synthetic power-user workflow. This is not a claim about any specific person’s private setup.

The high-throughput pattern is not one giant prompt. It is a configured system:

durable repo rules in AGENTS.md
reusable skills or commands for repeated workflows
short task prompts
many parallel agents
test/eval-driven loops
strict Git, PR, and CI rules
short final summaries with proof

A good AGENTS.md excerpt might look like:

Work style:
- terse
- preserve user changes
- no destructive git commands
- run targeted tests before broad tests
- update changelog for user-visible changes
- final response must include files changed,
  tests run, and residual risks

PR/CI:
- for "fix ci", inspect failing checks,
  patch, rerun, repeat
- for "land", preserve contributor credit
  and merge only when green
- never hide failing tests

Then the actual prompt can be tiny:

fix ci on PR 482.
preserve contributor credit.
land when green.

The agent expands that into work:

inspect git status
fetch PR
inspect failing checks
read logs
locate likely files
patch
run targeted tests
rerun CI-relevant checks
review diff
commit/push
watch CI
summarize proof

That workflow is powerful because the human prompt is short and the repo rules are durable. It is expensive because the agent is doing real work: reading, editing, testing, observing, and repairing.

It is not “autocomplete but more expensive.” It is a junior engineer with a very large reading bill and no patience for pretending a failed test passed.

Practical Ways To Waste Fewer Tokens

The goal is not to minimize tokens at all costs. That is how you get cheap bad work. The goal is to spend tokens on useful context and avoid paying repeatedly for noise.

Put stable content first.
Keep AGENTS.md useful but not bloated.
Use skills or task-specific docs for detailed workflows that do not always need to load.
Avoid dumping huge logs into the prompt. Ask Codex to inspect relevant failures.
Prefer targeted tests before huge suites.
Keep prompts short once repo rules are encoded.
Minimize MCP servers and tool surfaces for the task.
Reuse stable prefixes when possible.
Put timestamps, random IDs, latest logs, and current diffs late.
Ask for concise final summaries.
Log input_tokens, cached_tokens, output_tokens, reasoning_tokens, and total_tokens.
Treat cached input as cheaper, not free.
Use smaller or faster models for routine mechanical tasks when appropriate.
Use stronger reasoning models for ambiguous, architectural, or high-risk changes.

The blunt version: do not starve the model of the context needed to do the job, but stop feeding it junk just because pasting is easy.

The Mental Model

Codex token usage is not mysterious once you stop thinking of it as autocomplete.

It is a fleet of model calls reading repo state, tool state, instructions, logs, diffs, and its own prior work. Prompt caching makes the repeated stable parts cheaper. It does not make them disappear.

The bill comes from the loop.

Sources

OpenAI prompt caching: exact prefix matching, cached input accounting, prompt cache retention, and the distinction between cached prompt computation and fresh response generation.

OpenAI pricing: current pricing buckets for input, cached input, output, and built-in tools.

OpenAI reasoning models: reasoning tokens, usage fields, context-window impact, and output-token billing.

Codex prompting: Codex’s model/tool loop, thread context, gathered file/tool output, and compaction.

Codex AGENTS.md: durable repository guidance and instruction layering.

Codex MCP: MCP tools, server instructions, and tool configuration.

Codex best practices: reusable instructions, verification, testing, review, and workflow guidance.

Attention Is All You Need: Transformer attention and causal decoder masking. Sources checked June 8, 2026.