For one request, how long until the agent fully finishes responding (including intermediate tool waits), end to end?
Request response time is the span, within one session, from the user message that starts a
request to the agent’s final response before the next user-started request. It is
computed by a stateful single-pass walk over agent steps in ingestion order (round_pk == file order),
keeping current_user_turn_by_session: {session_id -> {provider, start_at, last_output_at}}.
A request is bounded by a small state machine. close_user_turn(session_id) pops the session’s open
request; if it has both a start_at and a last_output_at and the elapsed
dur = (last_output_at − start_at).total_seconds() is strictly positive, that duration is
appended to the "all" list and to the request’s provider bucket. For each step, in order:
start= the request-start user-message timestamp for the step: among the step’stiming_events, take the earliest model-output (reasoning/text/tool_call) timestamp asfirst_output, keep theuser_messagetimestamps at-or-beforefirst_output, and take the latest such candidate (None if there is no user message or no output, or none qualifies). Ifstartis not None and the step has a stringsession_id, close any open request for that session, then open a fresh request{provider, start_at: start, last_output_at: None}.resp_end= the step’s last response-end timestamp (the latestreasoning/text/tool_calltimestamp). If the session has a stringsession_idandresp_endis not None and the session has an open request, advance that request’slast_output_atwhen it is unset orresp_endis strictly later.
After the walk, every still-open request is flushed with close_user_turn in dict-insertion order
(end-of-stream flush), so the final request of each session contributes its response time too.
This is a trace-level estimate, not a serving-engine timer; it reflects only recorded events.
The span includes intermediate tool-triggered generations and observed tool waits within the
request, and excludes the following human wait and post-response usage-accounting events. The
trigger is the latest user_message before the first model output in a row, so stale/resumed user
messages embedded earlier in the row are not counted.
Method and assumptions:
- Exact, not sampled. Every positive request duration contributes one value to its provider’s list
(and to
all); the percentiles run over the full set. The old loader already kept every value here — there was never a reservoir cap on this metric — so the migration is value-for-value identical. - File-order state. The walk is over
round_pk(ingestion ordinal == file order), reproducing the line-order tie-break the old single-pass JSONL loader relied on for its session state, including the dict-insertion-order end-of-stream flush. - Provider grouping mirrors the old loader’s
str(provider) or "<unknown-provider>"fallback, so a missing/empty provider falls into<unknown-provider>. The provider stored on a request is the one from the step that opened it. - Engine-independent timestamps. Timestamps are read from the DB as integer epoch-microseconds
(
CAST(epoch_us(timestamp) AS BIGINT)) and rebuilt to naive datetimes in Python, never fetched as a rawTIMESTAMP(native duckdb marshals that to adatetime, duckdb-wasm to a string). A difference between two same-timezone datetimes equals the naive-microsecond difference exactly, so the durations match the pre-DuckDB result bit-for-bit.