Codex tool overhead · SyFI TraceLab

Session

one continuous trace of work, often spanning multiple requests or problems.

Request

one user input through the agent's final response.

Agent step

one model call inside a request.

User-initiated step

an agent step started by user input.

Tool-triggered step

an agent step started by a tool result.

Problem

For Codex tool calls, how much wall-clock time sits outside the work the runner actually did — the end-to-end vs. internal latency gap?

Tables

Tool	Calls	E2E	Int.	Res.	Avg	P50/90/99
All timed	253k	418.1h	341.5h	77.8h	1.11s	0.13/0.24/10.0s
`exec_command`	184k	97.8h	33.8h	64.1h	1.26s	0.16/0.27/8.6s
`write_stdin`	61k	314.6h	303.7h	11.7h	0.69s	0.01/0.10/14.9s
`shell_command`	6.1k	5.3h	3.8h	1.8h	1.08s	0.04/0.16/4.0s
`apply_patch`	2.5k	0.5h	0.3h	0.21h	0.30s	0.01/0.41/3.9s

Table 1Codex tool end-to-end vs internal latency and the residual gap, by tool.

Codex’s observed end-to-end tool latency substantially exceeds the runner’s internal execution time, so a non-trivial slice of tool wall-time is not actual command work (the paper’s tab:codex_tool_e2e_internal). Over the 253k both-timed calls — 87.3% of Codex tool calls — E2E sums to 418.1h against 341.5h internal, leaving a 77.8h residual (~18.6% of E2E). exec_command dominates that gap with 64.1h of residual (internal 33.8h out of 97.8h E2E), consistent with shell commands being the ones most likely to stall on permission/auto-approval. The residual is mostly made of many tiny gaps plus a long tail: the average is just 1.11s and P50/P90 stay small (0.13s/0.24s), but P99 reaches 10.0s. write_stdin is the opposite shape — huge E2E (314.6h) that is almost all internal, leaving only 11.7h residual — confirming the overhead concentrates in command launches, not in long-running interactive sessions. Read this residual as an upper bound on client-side waiting around the call (approval, shell startup, scheduling), not a direct approval measurement.

Reference

Experiment overview

Codex traces carry two latency notions per tool call:

End-to-end (wall) — tool_wall_latency_ms, the timestamp span from the model emitting the function call to its output being recorded.
Internal — tool_internal_latency_ms, the runner-reported Wall time: … seconds parsed out of the tool output, i.e. how long the command itself ran.

The experiment quantifies the positive residual gap = max(tool_wall_latency_ms − tool_internal_latency_ms, 0): the slice of end-to-end time the runner did not attribute to executing the command. In the normalized trace this residual is the only signal available for approval / user-wait overhead, because tool inputs, outputs, and explicit approval events are not retained, so it is best read as an upper bound on client-side waiting around the call rather than a direct approval measurement.

Only (provider = 'codex') calls with both timings present, positive wall time, and non-negative internal time enter the residual statistics. The paper float tab:codex_tool_e2e_internal (emitted as codex_tool_e2e_internal.tex / .md) aggregates these into an All timed row plus one row per major execution-like tool (exec_command, write_stdin, shell_command, apply_patch), reporting calls, summed E2E / internal / residual hours, the average residual, and the P50/90/99 residual seconds. The script also writes several CSV breakdowns (per-tool, per-category, residual buckets, direct-human wall time, top-gap examples) and a result_analysis.md narrative.

Running it

# released DuckDB, outputs written next to this README
uv run python artifacts/tool_calls/codex_wall_internal_gap/analyze.py --db trace/syfi_coding_trace.duckdb

# default merged trace
uv run python artifacts/tool_calls/codex_wall_internal_gap/analyze.py

Standard trace_db CLI (--db | -i/--input | -o/--output-dir). Useful flag: --top-gap-examples (rows in the top-gap-examples CSV, default 50).

Outputs

codex_tool_e2e_internal.tex — the paper float tab:codex_tool_e2e_internal: per-tool E2E / internal / positive-residual latency with avg and P50/90/99.
codex_tool_e2e_internal.md — GFM mirror of that table (same numbers, no caption) for the web detail page.
headline.json — the few headline numbers for the Overview gallery card.
result_analysis.md — narrative of the main numbers, coverage, and interpretability limits.
CSVs: codex_tool_timing_coverage.csv, codex_wall_internal_gap_by_tool.csv, codex_wall_internal_gap_by_category.csv, codex_wall_internal_gap_buckets.csv, codex_direct_human_wall_time.csv, codex_top_wall_internal_gap_examples.csv.

Headline numbers (public trace)

Across 253k both-timed Codex calls, end-to-end time is 418.1h vs 341.5h internal — a 77.8h residual gap (~19% of E2E time sits outside command execution).
exec_command carries most of it: 184k calls, 64.1h residual (its internal time is only 33.8h of 97.8h E2E).
Residuals are usually tiny but heavy-tailed: median 0.13s, P90 0.24s, P99 10.0s over all timed calls.