SyFI TraceLab
Trace assistant
Reading the public SYFI pool
357,161 agent steps across Claude & Codex — public, shareable.
Answers run real DuckDB/Python in a sandbox · code is shown
All figures
Session
one continuous trace of work, often spanning multiple requests or problems.
Request
one user input through the agent's final response.
Agent step
one model call inside a request.
User-initiated step
an agent step started by user input.
Tool-triggered step
an agent step started by a tool result.
Problem

For Codex tool calls, how much wall-clock time sits outside the work the runner actually did — the end-to-end vs. internal latency gap?

Tables
ToolCallsE2EInt.Res.AvgP50/90/99
All timed253k418.1h341.5h77.8h1.11s0.13/0.24/10.0s
exec_command184k97.8h33.8h64.1h1.26s0.16/0.27/8.6s
write_stdin61k314.6h303.7h11.7h0.69s0.01/0.10/14.9s
shell_command6.1k5.3h3.8h1.8h1.08s0.04/0.16/4.0s
apply_patch2.5k0.5h0.3h0.21h0.30s0.01/0.41/3.9s
Table 1Codex tool end-to-end vs internal latency and the residual gap, by tool.

Codex’s observed end-to-end tool latency substantially exceeds the runner’s internal execution time, so a non-trivial slice of tool wall-time is not actual command work (the paper’s tab:codex_tool_e2e_internal). Over the 253k both-timed calls — 87.3% of Codex tool calls — E2E sums to 418.1h against 341.5h internal, leaving a 77.8h residual (~18.6% of E2E). exec_command dominates that gap with 64.1h of residual (internal 33.8h out of 97.8h E2E), consistent with shell commands being the ones most likely to stall on permission/auto-approval. The residual is mostly made of many tiny gaps plus a long tail: the average is just 1.11s and P50/P90 stay small (0.13s/0.24s), but P99 reaches 10.0s. write_stdin is the opposite shape — huge E2E (314.6h) that is almost all internal, leaving only 11.7h residual — confirming the overhead concentrates in command launches, not in long-running interactive sessions. Read this residual as an upper bound on client-side waiting around the call (approval, shell startup, scheduling), not a direct approval measurement.

Reference
Experiment overview

Codex traces carry two latency notions per tool call:

  • End-to-end (wall)tool_wall_latency_ms, the timestamp span from the model emitting the function call to its output being recorded.
  • Internaltool_internal_latency_ms, the runner-reported Wall time: … seconds parsed out of the tool output, i.e. how long the command itself ran.

The experiment quantifies the positive residual gap = max(tool_wall_latency_ms − tool_internal_latency_ms, 0): the slice of end-to-end time the runner did not attribute to executing the command. In the normalized trace this residual is the only signal available for approval / user-wait overhead, because tool inputs, outputs, and explicit approval events are not retained, so it is best read as an upper bound on client-side waiting around the call rather than a direct approval measurement.

Only (provider = 'codex') calls with both timings present, positive wall time, and non-negative internal time enter the residual statistics. The paper float tab:codex_tool_e2e_internal (emitted as codex_tool_e2e_internal.tex / .md) aggregates these into an All timed row plus one row per major execution-like tool (exec_command, write_stdin, shell_command, apply_patch), reporting calls, summed E2E / internal / residual hours, the average residual, and the P50/90/99 residual seconds. The script also writes several CSV breakdowns (per-tool, per-category, residual buckets, direct-human wall time, top-gap examples) and a result_analysis.md narrative.

Running it
# released DuckDB, outputs written next to this README
uv run python artifacts/tool_calls/codex_wall_internal_gap/analyze.py --db trace/syfi_coding_trace.duckdb

# default merged trace
uv run python artifacts/tool_calls/codex_wall_internal_gap/analyze.py

Standard trace_db CLI (--db | -i/--input | -o/--output-dir). Useful flag: --top-gap-examples (rows in the top-gap-examples CSV, default 50).

Outputs
  • codex_tool_e2e_internal.tex — the paper float tab:codex_tool_e2e_internal: per-tool E2E / internal / positive-residual latency with avg and P50/90/99.
  • codex_tool_e2e_internal.md — GFM mirror of that table (same numbers, no caption) for the web detail page.
  • headline.json — the few headline numbers for the Overview gallery card.
  • result_analysis.md — narrative of the main numbers, coverage, and interpretability limits.
  • CSVs: codex_tool_timing_coverage.csv, codex_wall_internal_gap_by_tool.csv, codex_wall_internal_gap_by_category.csv, codex_wall_internal_gap_buckets.csv, codex_direct_human_wall_time.csv, codex_top_wall_internal_gap_examples.csv.
Headline numbers (public trace)
  • Across 253k both-timed Codex calls, end-to-end time is 418.1h vs 341.5h internal — a 77.8h residual gap (~19% of E2E time sits outside command execution).
  • exec_command carries most of it: 184k calls, 64.1h residual (its internal time is only 33.8h of 97.8h E2E).
  • Residuals are usually tiny but heavy-tailed: median 0.13s, P90 0.24s, P99 10.0s over all timed calls.
SyFI TraceLab · experiment detail