How long does a single agent step’s observable model generation take, and how is total generation time distributed across fast vs. slow steps?
The observable generation time of an agent step is the span from its latest input event
(user_message or tool_result) to its last model-output event (reasoning, text, or
tool_call), taken from the step’s ordered timing_events[]. Concretely, per step: take the
earliest model-output timestamp as first_output; of the input timestamps, keep only those
at-or-before first_output (the inputs that could have triggered this output); the span is
(latest output - latest such input), kept only when strictly positive. Steps with no input or no
output events, or with a non-positive span, contribute nothing.
This is a trace-level estimate, not a serving-engine timer. It excludes the preceding human wait and any post-response usage-accounting events, and it can only reflect events the trace actually recorded.
The experiment renders two complementary CDFs over a per-step generation-time threshold T, paneled
by provider on a fine log-spaced duration axis:
- count CDF — fraction of agent steps with generation time
≤ T. - total CDF — fraction of summed generation time contributed by steps with generation time
≤ T, i.e. where the wall time actually goes.
Method and assumptions:
- Exact, not sampled. Every agent step with a positive observable span contributes one value to its provider’s list; the CDFs, percentiles, and summed-time bins are computed over the full set. (The old loader already kept every value here — there was never a reservoir cap on this metric — so the migration is value-for-value identical.)
- Provider grouping mirrors the old loader’s
str(provider) or "<unknown-provider>"fallback, so a missing/empty provider falls into<unknown-provider>. - Engine-independent timestamps. Timestamps are read from the DB as integer epoch-microseconds
(
CAST(epoch_us(timestamp) AS BIGINT)) and rebuilt to naive datetimes in Python, never fetched as a rawTIMESTAMP(native duckdb marshals that to adatetime, duckdb-wasm to a string). A span between two same-timezone datetimes equals the naive-microsecond span exactly, so durations match the pre-DuckDB result bit-for-bit.