From a consumer perspective, how much extra append-prefill and API cost does a user pay because “thinking” time can turn prefix-cache hits into misses?
From a consumer perspective, how much extra append-prefill and API cost does a user pay because “thinking” time can turn prefix-cache hits into misses?
| Metric | Claude | Codex | Total |
|---|---|---|---|
| User-initiated steps with predecessor | 16,927 | 17,033 | 33,960 |
| Observed append tokens | 1.19B | 1.15B | 2.34B |
| Append after retained cache | 541.9M | 721.7M | 1.26B |
| Append-token reduction | 648.1M (54.5%) | 423.9M (37.0%) | 1.07B (45.9%) |
| Observed total cost | $22,654 | $17,777 | $40,431 |
| Cost after retained cache | $18,973 | $16,269 | $35,242 |
| Total cost reduction | $3,680 (16.2%) | $1,508 (8.5%) | $5,189 (12.8%) |
| Cost saved / reduced step avg | $0.263 | $0.116 | $0.192 |
From this consumer perspective, if user-initiated steps retained their prefix cache across human thinking time, append-prefill would drop from 2.34B to 1.26B tokens in the merged trace: a 1.07B-token reduction, or 45.9% of all append tokens. Because the estimate only changes user-initiated steps, that reduction is 95.1% of observed user-initiated append tokens.
With pricing.json prices as of 2026-06, the estimated final cost falls from $40,431 to
$35,242, saving $5,189 (12.8%) over priced rounds. The split is 648.1M fewer append
tokens and $3,680 saved for Claude, and 423.9M fewer append tokens and $1,508 saved for
Codex. Dollar totals price 99.1% of rounds; token reductions include all rounds.
This is an upper-bound savings estimate, not an observed cache metric. We analyze prefix caching
from a consumer perspective: the user-visible cost of “thinking” is the extra fresh-input
prefill billed when a user-initiated step resumes after the prefix cache has expired. For every
user-initiated step S with a predecessor P in the same session, the estimate caps observed
append at the step’s net context growth:
total_input(S) = prefix_tokens(S) + newly_append_tokens(S)
context_growth(S) = max(0, total_input(S) - total_input(P))
append_after_retained_cache(S) = min(newly_append_tokens(S), context_growth(S))
prefix_after_retained_cache(S) = total_input(S) - append_after_retained_cache(S)
All other steps keep their observed prefix_tokens / newly_append_tokens split. The total input
length and output tokens do not change. Any append-token reduction is moved into prefix tokens and
billed at the cache-read price; remaining Claude cache-creation tokens are billed at the 5-minute
cache-write rate. Because the estimate assumes all shifted tokens can be served from cache at the
cache-read rate, the resulting savings are an upper bound rather than an achievable policy
guarantee.
Method and assumptions:
user_message, matching the
trigger convention used by cache_hit_ratio and redundant_prefill.redundant_prefill / session/total_input_growth: the predecessor is the last
round seen for the same session_id in round_pk file order. Session-first user steps have no
predecessor and remain unchanged.artifacts/utils/pricing.json through artifacts/web_analytics/pricing.py: append at
fresh-input/cache-write rates, prefix at the cache-read rate, output unchanged. Unpriced rounds
contribute to token counts but are excluded from dollar totals.collect(con) streams rounds joined with each round’s first timing event, walks sessions in
file order, and accumulates observed vs. retained-cache token/cost totals by scope.ScopeAccum stores token totals, user-step coverage, priced-round coverage, and observed /
retained-cache cost buckets. It also stores per-reduced-step cost-saved samples and preceding
human idle gaps for avg / p50 / p90 rows.write_summary_csv(...), render_md(...), and write_latex_table(...) emit the raw summary,
web table, and optional paper table.# default merged trace (materialized to a temp DuckDB cache on first use)
uv run python artifacts/prefix_cache/human_idle_cache_counterfactual/analyze.py
# a prebuilt DB, into a chosen dir
uv run python artifacts/prefix_cache/human_idle_cache_counterfactual/analyze.py \
--db trace/syfi_coding_trace.duckdb -o /tmp/out
human_idle_cache_counterfactual_summary.csv - raw observed vs. retained-cache token and cost
totals per scope (merged, claude, codex).human_idle_cache_counterfactual.md - GFM table for the web detail page.human_idle_cache_counterfactual.tex - optional LaTeX table.headline.json - headline values for the Overview gallery card.