Append by prefix bin · SyFI TraceLab

Trace assistant

Reading the public SYFI pool

357,161 agent steps across Claude & Codex — public, shareable.

Answers run real DuckDB/Python in a sandbox · code is shown

All figures

Session

one continuous trace of work, often spanning multiple requests or problems.

Request

one user input through the agent's final response.

Agent step

one model call inside a request.

User-initiated step

an agent step started by user input.

Tool-triggered step

an agent step started by a tool result.

Problem

Given how large the cached prefix already is, how many new (uncached) tokens does a step append?

Populates tab:append_by_prefix (src/05_LLMGeneration.tex) — the quantitative companion to the prefix-vs-append scatter (fig:prefill_append_relationship). For Claude and Codex, every agent step is binned by its prefix_tokens, and within each bin we report the distribution of newly_append_tokens: count, avg, p50, p90, p99.

Prefix bins are doubling, in 1024-token units: <1k, 1-2k, 2-4k, 4-8k, 8-16k, 16-32k, 32-64k, 64-128k, 128-256k, >256k. The prefix_tokens / newly_append_tokens accounting is the same one used by prefix_append_distribution and token_length_distribution, so the numbers reconcile.

Tables

Claude

Prefix	Steps	Avg	P50	P90	P99
<1k	2,937	136.1K	78.4K	344.3K	871.9K
1-2k	0	—	—	—	—
2-4k	2	2.7K	2.7K	3.8K	4.0K
4-8k	530	122.0K	17.2K	385.3K	881.3K
8-16k	4,034	38.8K	3.5K	108.5K	549.1K
16-32k	10,248	35.4K	1.3K	30.7K	661.0K
32-64k	20,919	2.8K	951	5.3K	27.3K
64-128k	33,571	1.6K	793	3.7K	12.8K
128-256k	34,840	1.4K	710	3.2K	10.1K
>256k	33,257	1.4K	762	3.1K	8.8K

Codex

Prefix	Steps	Avg	P50	P90	P99
<1k	626	116.3K	124.3K	210.7K	247.0K
1-2k	90	22.2K	4.3K	63.9K	192.6K
2-4k	2,108	56.4K	20.8K	168.7K	240.8K
4-8k	3,501	60.2K	25.7K	172.4K	220.7K
8-16k	5,503	22.8K	2.9K	84.7K	195.5K
16-32k	10,470	9.6K	1.9K	18.8K	152.2K
32-64k	29,925	3.7K	954	8.3K	50.4K
64-128k	72,996	2.7K	796	6.1K	31.3K
128-256k	91,598	2.2K	771	5.3K	21.0K
>256k	6	750	900	1.1K	1.1K

Table 1Append-token stats (steps / avg / p50 / p90 / p99) by prefix-length bin, per provider.

The table (tab:append_by_prefix) quantifies the inverse relationship behind the prefix-vs-append scatter: the more a step has already cached, the less it appends. In the smallest prefix bin (<1k — a cache miss or the very first request, where almost nothing is cached) the median append is huge, 78k tokens for Claude and 124k for Codex, because nearly the whole prompt has to be sent as new. Once the prefix grows past 32k the median append collapses to well under 1k (Claude 951→762, Codex 954→771 across the 32-64k..>256k bins), as those steps only stack an incremental tool result or user turn onto an already-cached context. The bins also expose provider structure: Claude’s prefix jumps almost straight to large values — its 1-2k bin is empty and 2-4k holds just 2 steps, reflecting a large system prompt — while Codex effectively caps near its 256k context window, with only 6 steps exceeding it.

Reference

Running it

uv run python artifacts/llm_generation/append_by_prefix_bin/analyze.py -i trace/syfi_coding_trace.jsonl
uv run python artifacts/llm_generation/append_by_prefix_bin/analyze.py        # default merged trace

Outputs

append_by_prefix_bin.tex — the Claude/Codex table for the paper (empty bins render as --).
append_by_prefix_bin.md — GFM Markdown mirror of the table, rendered on the web detail page.
headline.json — the few headline numbers for the Overview gallery card.
stdout — the same per-provider breakdown in plain text.

Headline numbers (public trace)

Append and prefix are inversely related. At <1k prefix (cold start — a cache miss or the first request) the median append is 78k tokens for Claude and 124k for Codex.
Once the prefix exceeds ~32k (incremental tool-loop / user steps) the median append collapses to well under 1k for both providers, with only a modest p99 tail.
Bins reveal provider structure: Claude’s prefix jumps almost straight to large values (the 1-2k bin is empty, 2-4k has 2 steps) because its system prompt is large; Codex effectively caps near its 256k context (only 6 steps exceed it).

No figures.