For Codex tool calls, how much wall-clock time sits outside the work the runner actually did — the end-to-end vs. internal latency gap?
For Codex tool calls, how much wall-clock time sits outside the work the runner actually did — the end-to-end vs. internal latency gap?
| Tool | Calls | E2E | Int. | Res. | Avg | P50/90/99 |
|---|---|---|---|---|---|---|
| All timed | 253k | 418.1h | 341.5h | 77.8h | 1.11s | 0.13/0.24/10.0s |
exec_command | 184k | 97.8h | 33.8h | 64.1h | 1.26s | 0.16/0.27/8.6s |
write_stdin | 61k | 314.6h | 303.7h | 11.7h | 0.69s | 0.01/0.10/14.9s |
shell_command | 6.1k | 5.3h | 3.8h | 1.8h | 1.08s | 0.04/0.16/4.0s |
apply_patch | 2.5k | 0.5h | 0.3h | 0.21h | 0.30s | 0.01/0.41/3.9s |
Codex’s observed end-to-end tool latency substantially exceeds the runner’s internal execution
time, so a non-trivial slice of tool wall-time is not actual command work (the paper’s
tab:codex_tool_e2e_internal). Over the 253k both-timed calls — 87.3% of Codex tool calls — E2E
sums to 418.1h against 341.5h internal, leaving a 77.8h residual (~18.6% of E2E). exec_command
dominates that gap with 64.1h of residual (internal 33.8h out of 97.8h E2E), consistent with shell
commands being the ones most likely to stall on permission/auto-approval. The residual is mostly
made of many tiny gaps plus a long tail: the average is just 1.11s and P50/P90 stay small
(0.13s/0.24s), but P99 reaches 10.0s. write_stdin is the opposite shape — huge E2E (314.6h) that
is almost all internal, leaving only 11.7h residual — confirming the overhead concentrates in
command launches, not in long-running interactive sessions. Read this residual as an upper bound on
client-side waiting around the call (approval, shell startup, scheduling), not a direct approval
measurement.
Codex traces carry two latency notions per tool call:
tool_wall_latency_ms, the timestamp span from the model
emitting the function call to its output being recorded.tool_internal_latency_ms, the runner-reported Wall time: … seconds
parsed out of the tool output, i.e. how long the command itself ran.The experiment quantifies the positive residual
gap = max(tool_wall_latency_ms − tool_internal_latency_ms, 0): the slice of
end-to-end time the runner did not attribute to executing the command. In the
normalized trace this residual is the only signal available for approval / user-wait
overhead, because tool inputs, outputs, and explicit approval events are not retained,
so it is best read as an upper bound on client-side waiting around the call rather
than a direct approval measurement.
Only (provider = 'codex') calls with both timings present, positive wall time, and
non-negative internal time enter the residual statistics. The paper float
tab:codex_tool_e2e_internal (emitted as codex_tool_e2e_internal.tex / .md)
aggregates these into an All timed row plus one row per major execution-like tool
(exec_command, write_stdin, shell_command, apply_patch), reporting calls,
summed E2E / internal / residual hours, the average residual, and the P50/90/99 residual
seconds. The script also writes several CSV breakdowns (per-tool, per-category, residual
buckets, direct-human wall time, top-gap examples) and a result_analysis.md narrative.
# released DuckDB, outputs written next to this README
uv run python artifacts/tool_calls/codex_wall_internal_gap/analyze.py --db trace/syfi_coding_trace.duckdb
# default merged trace
uv run python artifacts/tool_calls/codex_wall_internal_gap/analyze.py
Standard trace_db CLI (--db | -i/--input | -o/--output-dir). Useful flag:
--top-gap-examples (rows in the top-gap-examples CSV, default 50).
codex_tool_e2e_internal.tex — the paper float tab:codex_tool_e2e_internal:
per-tool E2E / internal / positive-residual latency with avg and P50/90/99.codex_tool_e2e_internal.md — GFM mirror of that table (same numbers, no caption)
for the web detail page.headline.json — the few headline numbers for the Overview gallery card.result_analysis.md — narrative of the main numbers, coverage, and interpretability
limits.codex_tool_timing_coverage.csv, codex_wall_internal_gap_by_tool.csv,
codex_wall_internal_gap_by_category.csv, codex_wall_internal_gap_buckets.csv,
codex_direct_human_wall_time.csv, codex_top_wall_internal_gap_examples.csv.exec_command carries most of it: 184k calls, 64.1h residual (its internal
time is only 33.8h of 97.8h E2E).