Tool latency distribution · SyFI TraceLab

Session

one continuous trace of work, often spanning multiple requests or problems.

Request

one user input through the agent's final response.

Agent step

one model call inside a request.

User-initiated step

an agent step started by user input.

Tool-triggered step

an agent step started by a tool result.

Problem

How long do tool calls take, per tool and per provider, and where does the long tail of tool latency live — for Claude Code vs Codex?

Figures 4

Fig. 2Tool-call count share versus summed-latency share.

Fig. 4Summed tool-latency CDF by provider.

Reference

Experiment overview

Every agent step carries a tools[] list of tool calls, each with a measured latency. This experiment profiles that per-call latency four ways: a per-tool/per-provider box-and-whisker view, a count-vs-latency mass breakdown over coarse latency bins, and two cumulative CDFs (by call count and by summed latency) over the per-call latency threshold.

Method and assumptions:

Effective tool latency = tool_internal_latency_ms if present, else tool_wall_latency_ms (= result_at − emitted_at; the shared trace_db.EFFECTIVE_TOOL_LATENCY_MS_SQL precedence — the legacy latency_ms field is not in the normalized data). Internal timing is the runner-reported duration (Codex wrapper Wall time, Claude durationMs).
Only strictly-positive latencies feed the distribution. A call with no effective latency is counted as missing_latency; a call with a non-positive effective latency is nonpositive_latency — neither enters the boxplots, bins, percentiles, or CDFs (matching the old ToolStats).
MCP tools are merged (figure only). Any tool whose name starts with mcp_ is aliased to a single mcp bucket; the long server-qualified names are individually rare. CSV summaries keep the raw, unaliased names.
Rare tools collapse (figure only). Tools with fewer than --min-tool-calls-for-plot provider-local calls (default 20) fold into one Other (<N calls/tool) box. CSV summaries keep full per-tool detail.
CDFs are additive over calls. Latency totals sum per call — parallel tools are not collapsed into wall-clock time, so the total-latency CDF measures attributed work, not elapsed session time.
Exact, not sampled. Box quartiles, whiskers, percentiles and CDFs are computed over every positive effective latency pulled from SQL. The old per-tool 50k reservoir sampler is gone, so the summary CSV reports sampled=False and sample_count = the full latency_count for every tool (previously the two highest-volume tools, exec_command and Bash, were reservoir-sampled).

Code structure

plot.py is a thin query→shape→plot pipeline over the shared trace DuckDB:

_per_tool_query(plot_name_expr, *, by_provider) — the shared per-tool aggregation: normalizes the tool name (blank/NULL → <unknown-tool>), applies the effective-latency precedence, and emits per bucket the call/latency_count/missing_latency/nonpositive_latency/error_calls counts, latency_sum/min/max, the exact list(eff) of positive latencies, and a first-appearance first_seen ordinal. plot_name_expr selects the raw name (CSV) or the mcp_*→mcp alias (figure); by_provider splits per rounds.provider.
load_tool_stats(con) — global {tool_name: ToolStats} for the summary CSV (raw names, no collapsing), inserted in first-appearance order so the stable sort(key=calls) reproduces the old dict-order tie-break.
load_tool_stats_by_provider(con, *, min_calls) — per-provider stats for the boxplot with the MCP merge in SQL, then the rare-tool collapse in Python; real tools are inserted in alphabetical plot-name order (the old merged-dict order) with Other appended last, pinning equal-call ties.
load_tool_latency_values_by_provider(con) — {provider: [positive latency, …]} feeding both CDFs (exact, no reservoir).
load_tool_latency_bins(con, *, by_provider) — the 8 coarse TOOL_LATENCY_BINS_MS half-open bins (call/error counts + summed latency), global and per-provider.
plot_* / write_* — the four figures and four CSVs.
main() — wires the standard trace_db CLI (--db | -i/--input | -o/--output-dir) and embeds the self-contained PNG sidecar.

The matplotlib 3.9 labels→tick_labels boxplot-kwarg rename is handled by the _BOXPLOT_LABEL_KW shim so the figure also renders under the matplotlib 3.8.x bundled with Pyodide (the web Analyze tab). The data layer (parsing, surrogate keys, schema) lives in artifacts/utils/trace_db.py; see artifacts/utils/DB_SCHEMA.md.

Running it

# default merged trace, output next to this README
uv run python artifacts/tool_calls/tool_latency_distribution/plot.py

# a specific trace (materialized to a temp DuckDB cache on first use)
uv run python artifacts/tool_calls/tool_latency_distribution/plot.py -i trace/sample.jsonl

# a prebuilt DB (run_all.py's build-db step passes this), into a chosen dir
uv run python artifacts/tool_calls/tool_latency_distribution/plot.py --db "$TMPDIR/trace.duckdb" -o "$TMPDIR/out"

Useful flags: --top-tools (max boxes per panel, default 30), --min-tool-calls-for-plot (rare-tool collapse threshold, default 20).

Outputs

Written to -o (default this folder):

tool_latency_by_tool.png — per-provider box/whisker panels of per-call latency by tool.
tool_latency_summary.csv — full per-tool stats (raw names): calls, latency_count, missing_latency, nonpositive_latency, error_calls, mean_ms, min_ms, p50/p90/p99_ms, max_ms, sample_count, sampled, providers.
tool_latency_weighted_bins.png / .csv — tool-call count vs summed-latency share over the 8 coarse latency bins.
tool_latency_count_cdf_by_provider.png / .csv — cumulative tool-call count ≤ a latency threshold, per provider.
tool_total_latency_cdf_by_provider.png / .csv — cumulative summed latency from calls ≤ a threshold, per provider.

Each PNG is self-contained — it embeds this README, the CSVs, and the plotting code. Unpack with uv run python artifacts/utils/png_sidecar.py extract <png>.