SyFI TraceLab
Trace assistant
Reading the public SYFI pool
357,161 agent steps across Claude & Codex — public, shareable.
Answers run real DuckDB/Python in a sandbox · code is shown
All figures
Session
one continuous trace of work, often spanning multiple requests or problems.
Request
one user input through the agent's final response.
Agent step
one model call inside a request.
User-initiated step
an agent step started by user input.
Tool-triggered step
an agent step started by a tool result.
Problem

Given how large the cached prefix already is, how many new (uncached) tokens does a step append?

Populates tab:append_by_prefix (src/05_LLMGeneration.tex) — the quantitative companion to the prefix-vs-append scatter (fig:prefill_append_relationship). For Claude and Codex, every agent step is binned by its prefix_tokens, and within each bin we report the distribution of newly_append_tokens: count, avg, p50, p90, p99.

Prefix bins are doubling, in 1024-token units: <1k, 1-2k, 2-4k, 4-8k, 8-16k, 16-32k, 32-64k, 64-128k, 128-256k, >256k. The prefix_tokens / newly_append_tokens accounting is the same one used by prefix_append_distribution and token_length_distribution, so the numbers reconcile.

Tables

Claude

PrefixStepsAvgP50P90P99
<1k2,937136.1K78.4K344.3K871.9K
1-2k0
2-4k22.7K2.7K3.8K4.0K
4-8k530122.0K17.2K385.3K881.3K
8-16k4,03438.8K3.5K108.5K549.1K
16-32k10,24835.4K1.3K30.7K661.0K
32-64k20,9192.8K9515.3K27.3K
64-128k33,5711.6K7933.7K12.8K
128-256k34,8401.4K7103.2K10.1K
>256k33,2571.4K7623.1K8.8K

Codex

PrefixStepsAvgP50P90P99
<1k626116.3K124.3K210.7K247.0K
1-2k9022.2K4.3K63.9K192.6K
2-4k2,10856.4K20.8K168.7K240.8K
4-8k3,50160.2K25.7K172.4K220.7K
8-16k5,50322.8K2.9K84.7K195.5K
16-32k10,4709.6K1.9K18.8K152.2K
32-64k29,9253.7K9548.3K50.4K
64-128k72,9962.7K7966.1K31.3K
128-256k91,5982.2K7715.3K21.0K
>256k67509001.1K1.1K
Table 1Append-token stats (steps / avg / p50 / p90 / p99) by prefix-length bin, per provider.

The table (tab:append_by_prefix) quantifies the inverse relationship behind the prefix-vs-append scatter: the more a step has already cached, the less it appends. In the smallest prefix bin (<1k — a cache miss or the very first request, where almost nothing is cached) the median append is huge, 78k tokens for Claude and 124k for Codex, because nearly the whole prompt has to be sent as new. Once the prefix grows past 32k the median append collapses to well under 1k (Claude 951→762, Codex 954→771 across the 32-64k..>256k bins), as those steps only stack an incremental tool result or user turn onto an already-cached context. The bins also expose provider structure: Claude’s prefix jumps almost straight to large values — its 1-2k bin is empty and 2-4k holds just 2 steps, reflecting a large system prompt — while Codex effectively caps near its 256k context window, with only 6 steps exceeding it.

Reference
Running it
uv run python artifacts/llm_generation/append_by_prefix_bin/analyze.py -i trace/syfi_coding_trace.jsonl
uv run python artifacts/llm_generation/append_by_prefix_bin/analyze.py        # default merged trace
Outputs
  • append_by_prefix_bin.tex — the Claude/Codex table for the paper (empty bins render as --).
  • append_by_prefix_bin.md — GFM Markdown mirror of the table, rendered on the web detail page.
  • headline.json — the few headline numbers for the Overview gallery card.
  • stdout — the same per-provider breakdown in plain text.
Headline numbers (public trace)
  • Append and prefix are inversely related. At <1k prefix (cold start — a cache miss or the first request) the median append is 78k tokens for Claude and 124k for Codex.
  • Once the prefix exceeds ~32k (incremental tool-loop / user steps) the median append collapses to well under 1k for both providers, with only a modest p99 tail.
  • Bins reveal provider structure: Claude’s prefix jumps almost straight to large values (the 1-2k bin is empty, 2-4k has 2 steps) because its system prompt is large; Codex effectively caps near its 256k context (only 6 steps exceed it).

No figures.

SyFI TraceLab · experiment detail