Does an agent step’s output token count predict the next step’s newly appended tokens (i.e. is the prior response what reappears as new append), and how does prior output relate to the next step’s cached-prefix gain — does the previous response land on the prefix side or the append side?
The previous assistant response is normally replayed into the next prompt. This experiment asks
where it lands by comparing adjacent agent steps in the same session: prev.output_tokens against the
next step’s newly_append_tokens (the freshly charged slice) and against the next step’s prefix
gain (current.prefix_tokens − previous.input_tokens_total). It tests the replay hypothesis from
../../../docs/prompt_cache_accounting.md: the prior response generally shows up in the next step’s
append/cache-write.
Method and assumptions:
- Adjacent agent steps within a session. Steps are grouped per
(provider, session_id), sorted by(round_index, first-event timestamp), and each step is paired with the one immediately after it in that sorted order. (Unlikeadjusted_prefix_append, pairs are adjacency-in-order, not a strictround_indexstep of 1.) - A timing gap gates each pair. The gap from the previous step’s last model-output event
(
reasoning/text/tool_call, falling back to its first observed timestamp) to the current step’s first observed timestamp must be ≥ 0 and ≤--max-gap-seconds(default 240) — this drops cross-conversation or stale pairs. Pairs whosepreviousstep has non-positiveinput_tokens_totaloroutput_tokensare skipped. - Scenarios split pairs by provider/model and by how the next step started (
tool_resultvsuser_message): Claude, gpt-5.5, gpt-5.4, gpt-5.3-codex, gpt-5.2-codex. - Assignment heuristic. Per pair, with a per-pair tolerance
max(512, 0.10·prev_output):prefix_close(prefix gain ≈ prev output),prefix_rejects_output(prefix gain far below prev output),append_can_contain_output, andappend_side_pair(= reject ∧ can-contain). The per-scenarioprefix_close_pct/append_side_pair_pctdrive adecisionlabel (prefix_side/append_side/mixed/not_sure) and adecision_strength. - Output proxy. Output is taken as the step’s raw
output_tokens. For Codex, output includes reasoning, so the visible-output proxyoutput_tokens − reasoning_output_tokensis reported as context in the method notes (the carriedprev_reasoningis recorded per pair), but the plotted quantity isoutput_tokens. - Thresholds (
--min-output-tokens, default2000 4000) drop tiny-output noise; one full set of figures + summary is emitted per thresholdN. - Stats are exact. Every percentile / correlation in the summary CSV is computed over all
pairs in the scenario (legacy linear-interpolation
percentile,(n−1)·q); nothing is sampled for the stats. - Adjacency ordering is file order. The pre-migration JSONL loader grouped rows per
(provider, session_id)in first-appearance (file) order, then stably sorted each session by(round_index, first_timestamp)so ties kept file order. The shared DuckDB surrogate keyingest_seq(= round_pk) is that file order, so pullingORDER BY ingest_seqand grouping in Python reproduces both the per-session row order and the session-visitation order byte-for-byte. This matters for the scatter: the per-scenario subsample’s stable sort byprev_outputkeeps the pair-append order on ties, and that append order is driven by the session-visitation order.