SyFI TraceLab
Trace assistant
Reading the public SYFI pool
357,161 agent steps across Claude & Codex — public, shareable.
Answers run real DuckDB/Python in a sandbox · code is shown
All figures
Session
one continuous trace of work, often spanning multiple requests or problems.
Request
one user input through the agent's final response.
Agent step
one model call inside a request.
User-initiated step
an agent step started by user input.
Tool-triggered step
an agent step started by a tool result.
Problem

What does a coding session / request / step cost, and where does the money go?

Computes the USD-cost distribution behind tab:cost_distribution (src/04_SessionContext.tex). For each granularity (per session, per request, per step) and each billed category, the paper table reports the cost as avg / p50 / p90 / p99 plus the category’s share of total spend (the script also prints the underlying token distributions, incl. p25, to stdout):

  • Append tokensnewly_append_tokens, billed at the fresh-input/cache-write rates.
  • Prefix tokensprefix_tokens, billed at the cache-read rate.
  • Output tokensoutput_tokens (reasoning included), billed at the output rate.
  • Total — the sum of the three.
Tables

Per session

MetricAvgP50P90P99% cost
Total$9.36$0.59$13.2$172
Append tokens$2.50$0.21$3.33$40.226.7%
Prefix tokens$5.77$0.17$7.55$11161.7%
Output tokens$1.09$0.16$1.92$16.911.6%

Per request

MetricAvgP50P90P99% cost
Total$0.97$0.32$2.35$8.94
Append tokens$0.25$0.03$0.68$3.2226.7%
Prefix tokens$0.60$0.16$1.35$6.5161.7%
Output tokens$0.11$0.03$0.29$1.0811.6%

Per step

MetricAvgP50P90P99% cost
Total$0.11$0.07$0.20$0.67
Append tokens$0.03$0.00$0.03$0.6226.7%
Prefix tokens$0.07$0.05$0.12$0.4161.7%
Output tokens$0.01$0.01$0.03$0.1211.6%
Table 1Per-session, per-request, and per-step cost (USD) by category; % cost is each category’s share of total spend.

For a coding agent the bill is dominated by re-reading context, not by generation (the paper’s tab:cost_distribution). Cached prefix tokens are 59.5% of total spend even though they are billed at roughly a tenth of the fresh-input rate — pure volume, since the accumulating context is replayed on every step — against 29.2% for append/new-input and only 11.2% for output. Output is cheap in aggregate despite its high per-token price because each step emits so few tokens. The absolute costs are modest at the median ($0.61/session, $0.33/request, $0.07/step) but carry a heavy tail: the average session is $9.70 and p99 reaches $178, a few very long sessions driving most of the spend. This inverts the usual intuition that generation is the expensive part.

Reference
Definitions
  • Cost uses the single-source price table artifacts/utils/pricing.json via web_analytics/pricing.py (price_for → per-model exact/family resolve; round_cost → append at input/cache-write rates, prefix at cache-read rate, output at output rate — the same billing the web dashboard uses). Rounds whose model has no price are unpriced and excluded; 99.1% of rounds are priced (the rest are codex:codex-auto-review / null-model rows). Coverage is printed.
  • Request — one user turn, via the same turn state machine as human_in_the_loop/user_turn_decomposition (39,202 turns, matching user_turn_response_time and session_internal_counts). Step — one LLM round. Session — one session_id.
Running it
uv run python artifacts/session/session_cost_distribution/analyze.py -i trace/syfi_coding_trace.jsonl
uv run python artifacts/session/session_cost_distribution/analyze.py            # default merged trace
Outputs
  • session_cost_distribution.tex — the merged single-column cost table (Avg / P50 / P90 / P99
    • % cost) for the paper.
  • session_cost_distribution.md — GFM Markdown mirror of the table, rendered on the web detail page.
  • headline.json — the few headline numbers for the Overview gallery card.
  • stdout — merged + per-provider (Claude / Codex) token and cost percentiles, plus the append / prefix / output cost composition.
Headline numbers (public trace, list prices as of 2026-06)
  • Cost composition: prefix/cached 59.5%, append/new-input 29.2%, output 11.2%. Cached input dominates spend despite the ~10× cache-read discount, purely on volume.
  • Avg cost: $9.70 / session, $1.01 / request, $0.11 / step; medians are far lower ($0.61 / $0.33 / $0.074) with a heavy session tail (p99 = $178).

No figures.

SyFI TraceLab · experiment detail