Do low prefix-cache-hit agent steps follow long idle waits or long tool durations — i.e. is a cache miss explained by the cache having been evicted during the preceding gap?
Two scripts share this folder, both relating an agent step’s prefix-cache hit ratio to the gap that precedes it:
hit_ratio = prefix_tokens / (prefix_tokens + newly_append_tokens)
A step is classified by the type of its first timing event:
- a
user_messagestep — the preceding gap is the human idle wait since the previous activity in the same session; - a
tool_resultstep — the preceding gap is the tool duration (result_at - emitted_at) of the leading tool result(s) the step consumes.
cache_hit_idle_gap_analysis.py aggregates, per (scope, trigger), how often low-hit steps sit
behind a long gap (the gap/hit summary CSV). plot_user_wait_time_vs_hit_rate.py draws the raw
per-step (gap, hit-rate) cloud as provider-paneled scatters. A negative association — low hit rates
concentrated at long gaps — is consistent with time-based cache eviction.
Method and key assumptions:
- Prefix hit ratio =
prefix_tokens / (prefix_tokens + newly_append_tokens); steps with a non-positive token total are dropped. - Step trigger = the
event_typeof the step’s first timing event (event_index = 1); onlyuser_messageandtool_resultsteps are measured. - Human idle wait (
usertrigger) =first_activity(current) - last_activity(previous step in the session), kept when≥ 0.first_activityis the first timing event carrying a timestamp (else the earliest activity timestamp);last_activityis the latest timestamp across the step’s timing events and its tools’emitted_at/result_at. - Tool duration (
tool_resulttrigger) = the maxresult_at - emitted_atover the step’s leading tool-result call ids, looked up in a session-scoped map of tools emitted by previous steps (tool call ids are unique within a session). Negative durations are discarded. - Session walk is stateful and order-sensitive: steps are grouped by
(provider, session_id)and sorted by(round_index, first_activity_ts). The walk reproduces the pre-DuckDB single-pass loader exactly — tools are “remembered” only after a step is processed, so atool_resultstep never sees a tool emitted in its own step. - Scopes in the summary:
merged(all steps), plus each provider (claude,codex). The scatter draws one panel per provider. - Exact, not sampled. Every measurable step contributes; percentiles in the summary use the legacy linear-interpolation method, and the scatter plots every point (the old code applied no decimation here). Points are emitted in file order within each session and sessions in first-appearance order, so the scatter renders pixel-identically to the pre-DuckDB figure.
- Engine-independent timestamps. Timestamps are read as integer epoch-microseconds
(
CAST(epoch_us(timestamp) AS BIGINT), likewise for toolemitted_at/result_at) and differenced in Python, never fetched as a rawTIMESTAMP(native duckdb marshals that to adatetime, duckdb-wasm to a string). A difference of two same-timezone instants equals the naive-microsecond difference exactly, so every gap matches the pre-DuckDB result bit-for-bit. - These are trace-level estimates, not engine timers.