Token length distribution · SyFI TraceLab

Session

one continuous trace of work, often spanning multiple requests or problems.

Request

one user input through the agent's final response.

Agent step

one model call inside a request.

User-initiated step

an agent step started by user input.

Tool-triggered step

an agent step started by a tool result.

Problem

Per LLM step, how large are the inputs and outputs — and how do Claude and Codex differ?

Tables

Prefix tokens

Tokens	Avg	P25	P50	P90	P99
Claude	194,361	59,949	126,180	467,082	918,111
Codex	116,623	67,456	115,584	201,600	231,040

Append tokens

Tokens	Avg	P25	P50	P90	P99
Claude	8,479	384	857	5,342	232,206
Codex	5,283	406	886	8,310	121,009

Output tokens

Tokens	Avg	P25	P50	P90	P99
Claude	690	132	252	1,671	6,571
Codex	415	70	184	939	3,508

Table 1Per-step prefix, append, and output token lengths (avg / p25 / p50 / p90 / p99), by provider.

This is the single table feeding tab:token_length_distribution, and it makes the central asymmetry of the workload concrete: per step, inputs are huge and outputs are tiny. On the prefix side a median step replays 126k cached tokens for Claude and 116k for Codex — and because Claude’s context window is longer, its prefix stretches to a p99 of 918k while Codex saturates near 231k. The append side is two orders of magnitude smaller, a median of just 857 fresh tokens for Claude and 886 for Codex, the only slice a step is actually charged for. Output is smaller still: a median of 252 tokens for Claude and 184 for Codex, with the p90 under 1.7k and even the p99 staying in the low thousands. That outputs run so short is counterintuitive but follows from the tool loop — a full response is split across ~8 tool-call steps, so each individual generation is brief and often just emits the next tool call’s arguments.

Reference

Experiment overview

This experiment produces the single combined paper table shared by the Input length distribution and Output length distribution subsections (tab:token_length_distribution in src/05_LLMGeneration.tex). For each provider (Claude, Codex), over all LLM steps (rounds), it reports the avg / p25 / p50 / p90 / p99 of three per-step token counts:

Prefix tokens — prefix_tokens, the replayed accumulated context.
Append tokens — newly_append_tokens, the freshly added uncached input.
Output tokens — output_tokens, generated tokens with reasoning included.

The prefix/append split is the same decomposition as llm_generation/prefix_append_distribution, and the output column is the same metric as llm_generation/output_tokens; this experiment exists only to emit the combined per-provider .tex table. The other two experiments keep their figures and CDFs.

Method and assumptions:

Exact, not sampled. DuckDB keeps every row, so the percentiles and means run over the full set of valid rounds (no reservoir sampling).
Per-column filtering. Each token column is independently restricted to non-null, non-negative values (column IS NOT NULL AND column >= 0), matching the two source experiments.
Per step. The unit is one LLM round; there is no session/request aggregation here.

Running it

# default merged trace, output next to this README
uv run python artifacts/llm_generation/token_length_distribution/analyze.py

# a specific trace (materialized to a temp DuckDB cache on first use)
uv run python artifacts/llm_generation/token_length_distribution/analyze.py -i trace/sample.jsonl

# a prebuilt DB, into a chosen dir
uv run python artifacts/llm_generation/token_length_distribution/analyze.py --db /tmp/trace.duckdb -o /tmp/out

Outputs

token_length_distribution.tex — the combined per-provider table; a copy with a provenance header lives at figure-tex/tab_token_length_distribution.tex in the paper repo.
token_length_distribution.md — GFM Markdown mirror of the table, rendered on the web detail page.
headline.json — the few headline numbers for the Overview gallery card.

The per-provider stats are also printed to stdout.