SyFI TraceLab
Trace assistant
Reading the public SYFI pool
357,161 agent steps across Claude & Codex — public, shareable.
Answers run real DuckDB/Python in a sandbox · code is shown
All figures
Session
one continuous trace of work, often spanning multiple requests or problems.
Request
one user input through the agent's final response.
Agent step
one model call inside a request.
User-initiated step
an agent step started by user input.
Tool-triggered step
an agent step started by a tool result.
Problem

Per LLM step, how large are the inputs and outputs — and how do Claude and Codex differ?

Tables

Prefix tokens

TokensAvgP25P50P90P99
Claude194,36159,949126,180467,082918,111
Codex116,62367,456115,584201,600231,040

Append tokens

TokensAvgP25P50P90P99
Claude8,4793848575,342232,206
Codex5,2834068868,310121,009

Output tokens

TokensAvgP25P50P90P99
Claude6901322521,6716,571
Codex415701849393,508
Table 1Per-step prefix, append, and output token lengths (avg / p25 / p50 / p90 / p99), by provider.

This is the single table feeding tab:token_length_distribution, and it makes the central asymmetry of the workload concrete: per step, inputs are huge and outputs are tiny. On the prefix side a median step replays 126k cached tokens for Claude and 116k for Codex — and because Claude’s context window is longer, its prefix stretches to a p99 of 918k while Codex saturates near 231k. The append side is two orders of magnitude smaller, a median of just 857 fresh tokens for Claude and 886 for Codex, the only slice a step is actually charged for. Output is smaller still: a median of 252 tokens for Claude and 184 for Codex, with the p90 under 1.7k and even the p99 staying in the low thousands. That outputs run so short is counterintuitive but follows from the tool loop — a full response is split across ~8 tool-call steps, so each individual generation is brief and often just emits the next tool call’s arguments.

Reference
Experiment overview

This experiment produces the single combined paper table shared by the Input length distribution and Output length distribution subsections (tab:token_length_distribution in src/05_LLMGeneration.tex). For each provider (Claude, Codex), over all LLM steps (rounds), it reports the avg / p25 / p50 / p90 / p99 of three per-step token counts:

  • Prefix tokensprefix_tokens, the replayed accumulated context.
  • Append tokensnewly_append_tokens, the freshly added uncached input.
  • Output tokensoutput_tokens, generated tokens with reasoning included.

The prefix/append split is the same decomposition as llm_generation/prefix_append_distribution, and the output column is the same metric as llm_generation/output_tokens; this experiment exists only to emit the combined per-provider .tex table. The other two experiments keep their figures and CDFs.

Method and assumptions:

  • Exact, not sampled. DuckDB keeps every row, so the percentiles and means run over the full set of valid rounds (no reservoir sampling).
  • Per-column filtering. Each token column is independently restricted to non-null, non-negative values (column IS NOT NULL AND column >= 0), matching the two source experiments.
  • Per step. The unit is one LLM round; there is no session/request aggregation here.
Running it
# default merged trace, output next to this README
uv run python artifacts/llm_generation/token_length_distribution/analyze.py

# a specific trace (materialized to a temp DuckDB cache on first use)
uv run python artifacts/llm_generation/token_length_distribution/analyze.py -i trace/sample.jsonl

# a prebuilt DB, into a chosen dir
uv run python artifacts/llm_generation/token_length_distribution/analyze.py --db /tmp/trace.duckdb -o /tmp/out
Outputs
  • token_length_distribution.tex — the combined per-provider table; a copy with a provenance header lives at figure-tex/tab_token_length_distribution.tex in the paper repo.
  • token_length_distribution.md — GFM Markdown mirror of the table, rendered on the web detail page.
  • headline.json — the few headline numbers for the Overview gallery card.

The per-provider stats are also printed to stdout.

SyFI TraceLab · experiment detail