Per LLM step, how large are the inputs and outputs — and how do Claude and Codex differ?
Per LLM step, how large are the inputs and outputs — and how do Claude and Codex differ?
Prefix tokens
| Tokens | Avg | P25 | P50 | P90 | P99 |
|---|---|---|---|---|---|
| Claude | 194,361 | 59,949 | 126,180 | 467,082 | 918,111 |
| Codex | 116,623 | 67,456 | 115,584 | 201,600 | 231,040 |
Append tokens
| Tokens | Avg | P25 | P50 | P90 | P99 |
|---|---|---|---|---|---|
| Claude | 8,479 | 384 | 857 | 5,342 | 232,206 |
| Codex | 5,283 | 406 | 886 | 8,310 | 121,009 |
Output tokens
| Tokens | Avg | P25 | P50 | P90 | P99 |
|---|---|---|---|---|---|
| Claude | 690 | 132 | 252 | 1,671 | 6,571 |
| Codex | 415 | 70 | 184 | 939 | 3,508 |
This is the single table feeding tab:token_length_distribution, and it makes the central asymmetry
of the workload concrete: per step, inputs are huge and outputs are tiny. On the prefix side a
median step replays 126k cached tokens for Claude and 116k for Codex — and because Claude’s context
window is longer, its prefix stretches to a p99 of 918k while Codex saturates near 231k. The
append side is two orders of magnitude smaller, a median of just 857 fresh tokens for Claude and
886 for Codex, the only slice a step is actually charged for. Output is smaller still: a median
of 252 tokens for Claude and 184 for Codex, with the p90 under 1.7k and even the p99 staying in the
low thousands. That outputs run so short is counterintuitive but follows from the tool loop — a full
response is split across ~8 tool-call steps, so each individual generation is brief and often just
emits the next tool call’s arguments.
This experiment produces the single combined paper table shared by the Input length distribution
and Output length distribution subsections (tab:token_length_distribution in
src/05_LLMGeneration.tex). For each provider (Claude, Codex), over all LLM steps (rounds), it
reports the avg / p25 / p50 / p90 / p99 of three per-step token counts:
prefix_tokens, the replayed accumulated context.newly_append_tokens, the freshly added uncached input.output_tokens, generated tokens with reasoning included.The prefix/append split is the same decomposition as
llm_generation/prefix_append_distribution, and the output column is the same metric as
llm_generation/output_tokens; this experiment exists only to emit the combined per-provider
.tex table. The other two experiments keep their figures and CDFs.
Method and assumptions:
column IS NOT NULL AND column >= 0), matching the two source experiments.# default merged trace, output next to this README
uv run python artifacts/llm_generation/token_length_distribution/analyze.py
# a specific trace (materialized to a temp DuckDB cache on first use)
uv run python artifacts/llm_generation/token_length_distribution/analyze.py -i trace/sample.jsonl
# a prebuilt DB, into a chosen dir
uv run python artifacts/llm_generation/token_length_distribution/analyze.py --db /tmp/trace.duckdb -o /tmp/out
token_length_distribution.tex — the combined per-provider table; a copy with a provenance
header lives at figure-tex/tab_token_length_distribution.tex in the paper repo.token_length_distribution.md — GFM Markdown mirror of the table, rendered on the web detail page.headline.json — the few headline numbers for the Overview gallery card.The per-provider stats are also printed to stdout.