Tool category distribution

Session

one continuous trace of work, often spanning multiple requests or problems.

Request

one user input through the agent's final response.

Agent step

one model call inside a request.

User-initiated step

an agent step started by user input.

Tool-triggered step

an agent step started by a tool result.

Problem

When tools are folded into a handful of coarse categories (execute, file write/edit, file read/search, agent/task, web/lookup, …), how are calls and latency distributed across those categories — and how concentrated is the latency long tail in a few slow calls?

Figures 4

Fig. 1Tool-call count share by coarse category.

Fig. 2Summed effective latency by tool category.

Fig. 3Tool category dashboard with call share and latency quantiles.

Fig. 4Call share versus latency share across latency bins.

Reference

Experiment overview

Individual tool names are numerous and provider-specific; this experiment groups them into coarse categories that mean the same thing across Claude Code and Codex, then reports how calls and effective latency split across those categories.

Method and assumptions:

One row per call. We count entries in tool_calls (the UNNESTed tools[]), not agent steps.
Two fixed tool→category maps. A 5-category-plus-other map (Execute command, File write/edit, File read/search, Agent/task, Web/remote/lookup, Other) drives the count ring and latency bar; a 7-bucket presentation map (which additionally splits out Planning) drives the dashboard. Both maps are explicit name→category sets ported verbatim — the tool_category_tool_map.csv emits the realized (category, provider, tool) breakdown for auditing.
Effective tool latency = tool_internal_latency_ms if present, else tool_wall_latency_ms (the legacy latency_ms fallback is not in the normalized schema). Only positive latencies contribute to summed latency and to the percentile/long-tail views; missing and non-positive latencies are counted separately but excluded from the sums.
Long-tail bins. Positive latencies are bucketed into <1s, 1–10s, 10s–1m, >1m to contrast each bucket’s share of calls against its share of total latency.

Code structure

analyze.py is a query→fold→plot pipeline over the shared trace DuckDB:

load_tool_aggregates(con) — one GROUP BY (provider, tool_name) over tool_calls ⋈ rounds that returns per-tool calls, error_calls, the valid/missing/non-positive latency-class counts, and summed positive latency. Provider/tool-name normalization (<unknown-provider> / <unknown-tool>) is done in SQL to match the old loader.
load_positive_latency_histogram(con) — (tool_name, latency_ms, count) rows for positive latencies, expanded in Python into the per-category latency lists the percentiles consume.
scan_trace / scan_trace_presentation / scan_trace_long_tail_latency — fold the per-tool aggregates into the coarse categories using the verbatim category_for_tool / presentation_category_for_tool maps (summing is order-independent over the integer-ms latencies).
category_rows / presentation_rows / long_tail_rows + their write_*_csv — shape and emit the four CSVs.
plot_count_ring / plot_latency_bar / plot_dashboard / plot_long_tail_imbalance — the four figures. main() wires the standard trace_db CLI and embeds the PNG sidecar.

The data layer (parsing, surrogate keys, schema) lives in artifacts/utils/trace_db.py; see artifacts/utils/DB_SCHEMA.md.

Running it

# default merged trace, output next to this README
uv run python artifacts/tool_calls/tool_category_distribution/analyze.py

# a specific trace (materialized to a temp DuckDB cache on first use)
uv run python artifacts/tool_calls/tool_category_distribution/analyze.py -i trace/sample.jsonl

# a prebuilt DB (run_all.py's build-db step passes this), into a chosen dir
uv run python artifacts/tool_calls/tool_category_distribution/analyze.py --db "$TMPDIR/trace.duckdb" -o "$TMPDIR/out"

Outputs

tool_category_count_ring.png — donut of call counts across the 6 coarse categories.
tool_category_latency_bar.png — summed effective latency (hours) per category, with average.
tool_category_dashboard.png — combined donut + category table + latency-quantile strip for the 7-bucket presentation map.
tool_latency_long_tail_imbalance.png — call-share vs latency-share across the <1s … >1m bins.
tool_category_summary.csv — per coarse category: calls, share, error rate, latency-class counts, summed/avg latency.
tool_category_tool_map.csv — the realized (category, provider, tool_name) breakdown.
tool_category_dashboard_summary.csv — per presentation category: calls, share, p25/p50/p90/p99 seconds.
tool_latency_long_tail_imbalance.csv — per latency bin: calls, call share, latency, latency share.
result_analysis.md — generated run log.

The PNGs are self-contained — each embeds this README, the CSVs, and the plotting code. Unpack with uv run python artifacts/utils/png_sidecar.py extract <png>.