Cache hit versus idle gap · SyFI TraceLab

Session

one continuous trace of work, often spanning multiple requests or problems.

Request

one user input through the agent's final response.

Agent step

one model call inside a request.

User-initiated step

an agent step started by user input.

Tool-triggered step

an agent step started by a tool result.

Problem

Do low prefix-cache-hit agent steps follow long idle waits or long tool durations — i.e. is a cache miss explained by the cache having been evicted during the preceding gap?

Figures 2

Fig. 1Human idle wait versus prefix-cache hit rate.

Fig. 2Tool-triggered wait time versus prefix-cache hit rate.

Reference

Experiment overview

Two scripts share this folder, both relating an agent step’s prefix-cache hit ratio to the gap that precedes it:

hit_ratio = prefix_tokens / (prefix_tokens + newly_append_tokens)

A step is classified by the type of its first timing event:

a user_message step — the preceding gap is the human idle wait since the previous activity in the same session;
a tool_result step — the preceding gap is the tool duration (result_at - emitted_at) of the leading tool result(s) the step consumes.

cache_hit_idle_gap_analysis.py aggregates, per (scope, trigger), how often low-hit steps sit behind a long gap (the gap/hit summary CSV). plot_user_wait_time_vs_hit_rate.py draws the raw per-step (gap, hit-rate) cloud as provider-paneled scatters. A negative association — low hit rates concentrated at long gaps — is consistent with time-based cache eviction.

Method and key assumptions:

Prefix hit ratio = prefix_tokens / (prefix_tokens + newly_append_tokens); steps with a non-positive token total are dropped.
Step trigger = the event_type of the step’s first timing event (event_index = 1); only user_message and tool_result steps are measured.
Human idle wait (user trigger) = first_activity(current) - last_activity(previous step in the session), kept when ≥ 0. first_activity is the first timing event carrying a timestamp (else the earliest activity timestamp); last_activity is the latest timestamp across the step’s timing events and its tools’ emitted_at/result_at.
Tool duration (tool_result trigger) = the max result_at - emitted_at over the step’s leading tool-result call ids, looked up in a session-scoped map of tools emitted by previous steps (tool call ids are unique within a session). Negative durations are discarded.
Session walk is stateful and order-sensitive: steps are grouped by (provider, session_id) and sorted by (round_index, first_activity_ts). The walk reproduces the pre-DuckDB single-pass loader exactly — tools are “remembered” only after a step is processed, so a tool_result step never sees a tool emitted in its own step.
Scopes in the summary: merged (all steps), plus each provider (claude, codex). The scatter draws one panel per provider.
Exact, not sampled. Every measurable step contributes; percentiles in the summary use the legacy linear-interpolation method, and the scatter plots every point (the old code applied no decimation here). Points are emitted in file order within each session and sessions in first-appearance order, so the scatter renders pixel-identically to the pre-DuckDB figure.
Engine-independent timestamps. Timestamps are read as integer epoch-microseconds (CAST(epoch_us(timestamp) AS BIGINT), likewise for tool emitted_at/result_at) and differenced in Python, never fetched as a raw TIMESTAMP (native duckdb marshals that to a datetime, duckdb-wasm to a string). A difference of two same-timezone instants equals the naive-microsecond difference exactly, so every gap matches the pre-DuckDB result bit-for-bit.
These are trace-level estimates, not engine timers.

Code structure

Both scripts query the shared trace DuckDB (rounds / tool_calls / timing_events) instead of re-parsing the JSONL. The stateful session walk is shared:

cache_hit_idle_gap_analysis.py
- load_rounds_by_session(con) — {(provider, session_id): [RoundData, …]}. One SQL pass each over timing_events (ordered by round_pk, event_index) and tool_calls (ordered by round_pk, tool_index) plus the rounds scalars (ordered by round_pk), assembled per step into a RoundData (trigger event type, first/last activity microseconds, leading tool-result call ids, and the per-step tool durations keyed by call id). Sessions are sorted by (round_index, first_activity_ts), reproducing the old line-order tie-break.
- analyze(con, …) — the stateful walk that, per (scope, trigger), accumulates a GapGroup (round/low-hit/over-idle counts and the gap lists) via update_group.
- percentile / format_pct / format_float / write_summary — the summary math and CSV writer, unchanged from the pre-migration script.
plot_user_wait_time_vs_hit_rate.py
- collect_points(con, trigger=…) — reuses idle.load_rounds_by_session and runs the same walk to emit {provider: ([wait_seconds…], [hit_rate…])} for one trigger.
- plot(...) / count_points(...) / default_output_path(...) / WAIT_TICKS / format_wait_time — the figure layout (log-x wait axis, per-provider panels, low-hit/long-wait guides) unchanged.
- main() — wires the standard trace_db CLI and embeds the self-contained PNG sidecar (this README, the gap CSV, and both scripts).

Both use the standard trace_db.add_db_args surface (--db | -i/--input | -o/--output-dir); the plot script adds --trigger {all,user,tool_result} and a --output single-PNG path (valid only with one concrete trigger). The data layer (parsing, surrogate keys, schema) lives in artifacts/utils/trace_db.py; see artifacts/utils/DB_SCHEMA.md.

Running it

# default merged trace, outputs next to this README
uv run python artifacts/prefix_cache/cache_hit_idle_relationship/cache_hit_idle_gap_analysis.py
uv run python artifacts/prefix_cache/cache_hit_idle_relationship/plot_user_wait_time_vs_hit_rate.py

# a specific trace (materialized to a temp DuckDB cache on first use)
uv run python artifacts/prefix_cache/cache_hit_idle_relationship/cache_hit_idle_gap_analysis.py -i trace/sample.jsonl

# a prebuilt DB (run_all.py's build-db step passes this), into a chosen dir
uv run python artifacts/prefix_cache/cache_hit_idle_relationship/plot_user_wait_time_vs_hit_rate.py --db "$TMPDIR/trace.duckdb" -o "$TMPDIR/out"

# a single scatter
uv run python artifacts/prefix_cache/cache_hit_idle_relationship/plot_user_wait_time_vs_hit_rate.py --trigger user

Outputs

Written to -o (default this folder):

cache_hit_idle_gap_summary.csv — one row per (scope, trigger) for scope ∈ {merged, claude, codex} and trigger ∈ {all, user, tool_result}, with the thresholds (low_hit_threshold, idle_threshold_seconds), the round/low-hit counts, the over-idle shares (low_gt_idle_share, all_gt_idle_share, nonlow_gt_idle_share), and gap percentiles (low_gap_median_s/low_gap_p90_s, all_gap_median_s/all_gap_p90_s).
user_wait_time_vs_hit_rate_scatter.png — user-message wait vs hit rate, Claude/Codex panels.
tool_result_wait_time_vs_hit_rate_scatter.png — tool duration vs hit rate, Claude/Codex panels.

Each scatter PNG is self-contained — it embeds this README, the gap-summary CSV, and both scripts. Unpack with uv run python artifacts/utils/png_sidecar.py extract <png>.