Inside a single coding session, how does the context window grow agent step by agent step — where do the user’s messages land, how much is cheap cached prefix vs. freshly appended input, and where does the agent compact and start over?
Each row in the trace is one agent step. This experiment picks a handful of illustrative sessions and, for each, draws one bar per step: cached/prefix tokens (blue) stacked under newly appended input tokens (orange), with the running total input as a line on top. A thin top strip lays the same steps out on a 5-minute wall-clock timeline so you can see where the agent paused. It is the closest thing in the toolkit to watching a session breathe.
Method and assumptions:
- One step per invocation, ordered within a session by
(round_index, first-event timestamp, ingestion order). The ingestion-order tie-break is file order, so equal-timestamp steps never reorder. - Prefix vs. append come straight from the step’s
prefix_tokens/newly_append_tokens; their sum is the full input size for that invocation. - User-initiated steps (
U1,U2, …) are the steps whose timing events include a visibleuser_message, i.e. where the human actually typed — as opposed to tool-triggered steps. - Compaction is flagged two ways: an explicit marker (a timing event whose type/source mentions “compact”), or an inferred input drop — the full input size falls by ≥8k tokens and ≥25% from a base of ≥32k, and stays low for the next few steps (a one-step dip that rebounds is ignored). A prefix-only decrease is deliberately not treated as compaction, because a cache miss can shift tokens from prefix to append without shrinking the real context.
- Generation time per step is measured from the last input event at-or-before the first model output to the last model output — the model’s own “thinking + generating” span, excluding human wait time.
- Session selection is automatic and deterministic: candidates are filtered by step count and
user-initiated/tool-triggered mix, then three ranked picks are unioned — a balanced score, a context-heavy
score, and a compaction-heavy score — so the gallery shows variety, not six look-alikes. Pin
specific sessions with
--session-id.