Context compactions · SyFI TraceLab

Session

one continuous trace of work, often spanning multiple requests or problems.

Request

one user input through the agent's final response.

Agent step

one model call inside a request.

User-initiated step

an agent step started by user input.

Tool-triggered step

an agent step started by a tool result.

Problem

How many context compactions does a coding session undergo?

A compaction is the behavioral event the paper distinguishes from the plain size buckets in total_input_growth: when the running context nears its limit it is summarized/dropped to a short history and then slowly re-accumulates. We detect it structurally from the per-step total input length (prefix_tokens + newly_append_tokens), ordered by round_pk within each session. A step i is a compaction when all three hold:

Great reduction — total[i-1] - total[i] >= 64k (--min-drop-tokens, defaults to growth.MAJOR_REDUCTION_MIN_TOKENS). Every compaction is therefore also a major reduction; compactions are the strict subset that also satisfy (2) and (3).
Near the context limit — the pre-drop level total[i-1] is at least --near-max-ratio (0.75) of the session’s observed max total input. The drop happens near the session’s peak, not at a small early dip.
Recovers slowly — the context does not rebound to --rebound-ratio (0.75) of the pre-drop level within the next --rebound-steps (3) steps, and at least one step follows. A drop that immediately snaps back is a branch/edit artifact, not a compaction.

Each compaction is attributed to the trigger of step i, using the same user-initiated / tool-initiated split the rest of the paper uses: user_message → user-initiated (an explicit /compact or a new request that forced summarization); tool_result → tool-initiated (auto-compaction mid-loop).

Tables

Metric	Claude	Codex
Sessions	2,676	1,589
Major reductions (≥64k drop)	324	1,306
of which compactions	284 (87.7%)	1,235 (94.6%)
Sessions with ≥1 compaction	120 (4.5%)	292 (18.4%)

Compactions per session

Metric	Claude	Codex
Avg (all sessions)	0.106	0.777
Avg (sessions with ≥1)	2.37	4.23
P90 / P99 (sessions with ≥1)	6 / 12	8 / 34

Trigger

Metric	Claude	Codex
User-initiated	105 (37.0%)	100 (8.1%)
Tool-initiated	179 (63.0%)	1,135 (91.9%)

Table 1Compactions per session and their user- vs tool-initiated trigger split, by provider.

Most large context drops are genuine compactions, and they hit Codex far harder than Claude (the paper’s tab:session_compaction). Of the major (≥64k) reductions, 284/324 qualify for Claude and 1,235/1,306 for Codex — 1,519 of 1,630 combined — so the structural near-limit + slow-recovery test rarely flags a false positive. Compactions are uncommon per session but lopsided by provider: only 4.5% of Claude sessions ever compact versus 18.4% of Codex, and among sessions that do, the mean and tail are larger for Codex (avg 4.23, p99 = 34) than Claude (avg 2.37, p99 = 12). The trigger split mirrors the autonomy finding: Codex compactions are overwhelmingly tool-initiated (91.9%, auto-compaction mid-loop), while Claude’s are more evenly split (63.0% tool / 37.0% user).

Reference

Running it

# pinned public trace
uv run python artifacts/session/session_compaction_counts/analyze.py -i trace/syfi_coding_trace.jsonl

# default merged trace
uv run python artifacts/session/session_compaction_counts/analyze.py

# loosen/tighten the definition
uv run python artifacts/session/session_compaction_counts/analyze.py \
    --near-max-ratio 0.8 --rebound-steps 5

Outputs

session_compaction_counts.tex — the merged summary table (tab:session_compaction).
session_compaction_counts.md — GFM Markdown mirror of the table, rendered on the web detail page.
headline.json — the few headline numbers for the Overview gallery card.
stdout — merged + per-provider (Claude / Codex): total compactions, share of sessions with ≥1, the per-session distribution (avg / p25 / p50 / p90 / p99, over all sessions and over only those with ≥1), and the user-initiated-vs-tool-initiated trigger split.

Headline numbers (public trace, default criteria)

Of 1,630 major reductions (≥64k drops), 1,519 (93.2%) qualify as compactions.
9.7% of the 4,265 sessions undergo at least one compaction.
Overwhelmingly tool-initiated (86.5%, mid-loop) rather than user-initiated.
Far more common in Codex (18.4% of sessions, 1,235 events) than Claude (4.5%, 284).
Among sessions that compact at all, the mean is 3.7 and the tail is long (Codex p99 = 34).

No figures.