Which tools do coding agents actually call, how often, and how often do those calls fail — separately for Claude Code and Codex?
Every agent step in the trace carries a tools[] list of the tool calls the model made in that step.
This experiment counts those calls per (provider, tool) and renders one horizontal-bar panel
per provider, tools ordered by call volume, with a red overlay marking the share that returned an
error.
Method and assumptions:
- One row per call. We count entries in
tool_calls(the UNNESTedtools[]), not agent steps — a step that callsBashthree times contributes three. - MCP tools are merged. Any tool whose name starts with
mcp_is aliased to a singlemcpbucket, since the long opaque server-qualified names are individually rare and uninformative in aggregate. - Rare tools collapse. For the figure only, tools with fewer than
--min-tool-calls-for-plotprovider-local calls (default 20) are summed into oneOther (<N calls/tool)bar. The CSV keeps full per-tool detail — nothing is dropped from the data, only from the plot. - Linear, clipped axis. Tool usage is heavily skewed (one or two tools dominate), so each panel clips its x-axis at ~1.05× the second-largest bar and annotates the clipped leader with its true count. This keeps the long tail readable instead of being crushed against a single giant bar.
- Errors are counted as calls where
is_erroris true, drawn as a shorter bar inside the call bar.