How many tokens does the model generate per agent step, and how does that distribution differ by provider / model?
For every agent step the trace records output_tokens — the step’s generated-token count. This
experiment plots the distribution of that count, grouped by provider (or model, or
provider:model), on a base-2 log token axis, and writes per-group quantiles.
Method and assumptions:
- What counts. Every step whose
output_tokensis non-null and>= 0contributes one value to its group (and to the syntheticallgroup). This matches the old loader’sallow_zeronumeric rule — zero-output steps are kept, negatives (never observed) are dropped. - Provider caveat. For Codex,
output_tokensincludes reasoning tokens; for Claude it is the message-level output count. The distributions are therefore not strictly like-for-like across providers — read each provider on its own terms. - Exact, not sampled. The distribution, percentiles, and histogram are computed over every
observation. The pre-DuckDB loader reservoir-sampled at 200k values per group to bound memory while
parsing JSON; querying the materialized DuckDB removes that constraint, so the stats are now exact.
The summary CSV’s
sampledcolumn is therefore alwaysFalseandsample_countequals the fullcount. (On any trace below the old 200k cap — e.g.trace/sample.jsonl— the old path was already exact, so the migration is value-for-value identical there.) - Group fallbacks. Grouping mirrors the old
group_key()"<unknown-provider>"/"<unknown-model>"fallbacks via SQLCOALESCE, so missing/empty provider or model values fall into an explicit<unknown-*>bucket rather than being dropped.