Cost of human thinking time

会话

一段连续的工作记录，通常包含多个请求或问题。

请求

从一次用户输入到智能体最终响应的完整过程。

智能体步骤

请求内部的一次模型调用。

用户触发步骤

由用户输入启动的智能体步骤。

工具触发步骤

由工具结果启动的智能体步骤。

问题

From a consumer perspective, how much extra append-prefill and API cost does a user pay because “thinking” time can turn prefix-cache hits into misses?

表格

Metric	Claude	Codex	Total
User-initiated steps with predecessor	16,927	17,033	33,960
Observed append tokens	1.19B	1.15B	2.34B
Append after retained cache	541.9M	721.7M	1.26B
Append-token reduction	648.1M (54.5%)	423.9M (37.0%)	1.07B (45.9%)
Observed total cost	$22,654	$17,777	$40,431
Cost after retained cache	$18,973	$16,269	$35,242
Total cost reduction	$3,680 (16.2%)	$1,508 (8.5%)	$5,189 (12.8%)
Cost saved / reduced step avg	$0.263	$0.116	$0.192

表 1Upper-bound append-token and cost savings from eliminating user-thinking-induced prefix-cache misses.

From this consumer perspective, if user-initiated steps retained their prefix cache across human thinking time, append-prefill would drop from 2.34B to 1.26B tokens in the merged trace: a 1.07B-token reduction, or 45.9% of all append tokens. Because the estimate only changes user-initiated steps, that reduction is 95.1% of observed user-initiated append tokens.

With pricing.json prices as of 2026-06, the estimated final cost falls from $40,431 to $35,242, saving $5,189 (12.8%) over priced rounds. The split is 648.1M fewer append tokens and $3,680 saved for Claude, and 423.9M fewer append tokens and $1,508 saved for Codex. Dollar totals price 99.1% of rounds; token reductions include all rounds.

参考

Experiment overview

This is an upper-bound savings estimate, not an observed cache metric. We analyze prefix caching from a consumer perspective: the user-visible cost of “thinking” is the extra fresh-input prefill billed when a user-initiated step resumes after the prefix cache has expired. For every user-initiated step S with a predecessor P in the same session, the estimate caps observed append at the step’s net context growth:

total_input(S)            = prefix_tokens(S) + newly_append_tokens(S)
context_growth(S)         = max(0, total_input(S) - total_input(P))
append_after_retained_cache(S)  = min(newly_append_tokens(S), context_growth(S))
prefix_after_retained_cache(S)  = total_input(S) - append_after_retained_cache(S)

All other steps keep their observed prefix_tokens / newly_append_tokens split. The total input length and output tokens do not change. Any append-token reduction is moved into prefix tokens and billed at the cache-read price; remaining Claude cache-creation tokens are billed at the 5-minute cache-write rate. Because the estimate assumes all shifted tokens can be served from cache at the cache-read rate, the resulting savings are an upper bound rather than an achievable policy guarantee.

Method and assumptions:

A user-initiated step is a round whose first timing event is user_message, matching the trigger convention used by cache_hit_ratio and redundant_prefill.
Pairing follows redundant_prefill / session/total_input_growth: the predecessor is the last round seen for the same session_id in round_pk file order. Session-first user steps have no predecessor and remain unchanged.
This isolates the consumer-side cost of human thinking time. Tool-result steps and session-first steps are left as observed.
Costs use artifacts/utils/pricing.json through artifacts/web_analytics/pricing.py: append at fresh-input/cache-write rates, prefix at the cache-read rate, output unchanged. Unpriced rounds contribute to token counts but are excluded from dollar totals.

Code structure

collect(con) streams rounds joined with each round’s first timing event, walks sessions in file order, and accumulates observed vs. retained-cache token/cost totals by scope.
ScopeAccum stores token totals, user-step coverage, priced-round coverage, and observed / retained-cache cost buckets. It also stores per-reduced-step cost-saved samples and preceding human idle gaps for avg / p50 / p90 rows.
write_summary_csv(...), render_md(...), and write_latex_table(...) emit the raw summary, web table, and optional paper table.

Running it

# default merged trace (materialized to a temp DuckDB cache on first use)
uv run python artifacts/prefix_cache/human_idle_cache_counterfactual/analyze.py

# a prebuilt DB, into a chosen dir
uv run python artifacts/prefix_cache/human_idle_cache_counterfactual/analyze.py \
  --db trace/syfi_coding_trace.duckdb -o /tmp/out

Outputs

human_idle_cache_counterfactual_summary.csv - raw observed vs. retained-cache token and cost totals per scope (merged, claude, codex).
human_idle_cache_counterfactual.md - GFM table for the web detail page.
human_idle_cache_counterfactual.tex - optional LaTeX table.
headline.json - headline values for the Overview gallery card.