Output attribution · SyFI TraceLab

Session

one continuous trace of work, often spanning multiple requests or problems.

Request

one user input through the agent's final response.

Agent step

one model call inside a request.

User-initiated step

an agent step started by user input.

Tool-triggered step

an agent step started by a tool result.

Problem

A prior step’s output must reappear in the next step’s prompt — but where? Folded into the cached prefix (output-cached), or dropped from the cache and re-sent as freshly billed append (output-resend)? This experiment lays out the two accounting schemes, then measures which one each model actually uses.

Figures 5

Fig. 1Two ways a prior step’s output is accounted in the next step: cached as prefix, or re-sent as append (schematic).

Fig. 2Previous output versus next append, min 2000 output tokens.

Fig. 3Ranked previous output and next append, min 2000 output tokens.

Fig. 4Previous output versus next prefix gain, min 2000 output tokens.

Fig. 5Ranked previous output and prefix gain, min 2000 output tokens.

Reference

Experiment overview

The previous assistant response is normally replayed into the next prompt. This experiment asks where it lands by comparing adjacent agent steps in the same session: prev.output_tokens against the next step’s newly_append_tokens (the freshly charged slice) and against the next step’s prefix gain (current.prefix_tokens − previous.input_tokens_total). It tests the replay hypothesis from ../../../docs/prompt_cache_accounting.md: the prior response generally shows up in the next step’s append/cache-write.

Method and assumptions:

Adjacent agent steps within a session. Steps are grouped per (provider, session_id), sorted by (round_index, first-event timestamp), and each step is paired with the one immediately after it in that sorted order. (Unlike adjusted_prefix_append, pairs are adjacency-in-order, not a strict round_index step of 1.)
A timing gap gates each pair. The gap from the previous step’s last model-output event (reasoning/text/tool_call, falling back to its first observed timestamp) to the current step’s first observed timestamp must be ≥ 0 and ≤ --max-gap-seconds (default 240) — this drops cross-conversation or stale pairs. Pairs whose previous step has non-positive input_tokens_total or output_tokens are skipped.
Scenarios split pairs by provider/model and by how the next step started (tool_result vs user_message): Claude, gpt-5.5, gpt-5.4, gpt-5.3-codex, gpt-5.2-codex.
Assignment heuristic. Per pair, with a per-pair tolerance max(512, 0.10·prev_output): prefix_close (prefix gain ≈ prev output), prefix_rejects_output (prefix gain far below prev output), append_can_contain_output, and append_side_pair (= reject ∧ can-contain). The per-scenario prefix_close_pct / append_side_pair_pct drive a decision label (prefix_side / append_side / mixed / not_sure) and a decision_strength.
Output proxy. Output is taken as the step’s raw output_tokens. For Codex, output includes reasoning, so the visible-output proxy output_tokens − reasoning_output_tokens is reported as context in the method notes (the carried prev_reasoning is recorded per pair), but the plotted quantity is output_tokens.
Thresholds (--min-output-tokens, default 2000 4000) drop tiny-output noise; one full set of figures + summary is emitted per threshold N.
Stats are exact. Every percentile / correlation in the summary CSV is computed over all pairs in the scenario (legacy linear-interpolation percentile, (n−1)·q); nothing is sampled for the stats.
Adjacency ordering is file order. The pre-migration JSONL loader grouped rows per (provider, session_id) in first-appearance (file) order, then stably sorted each session by (round_index, first_timestamp) so ties kept file order. The shared DuckDB surrogate key ingest_seq (= round_pk) is that file order, so pulling ORDER BY ingest_seq and grouping in Python reproduces both the per-session row order and the session-visitation order byte-for-byte. This matters for the scatter: the per-scenario subsample’s stable sort by prev_output keeps the pair-append order on ties, and that append order is driven by the session-visitation order.

Code structure

plot.py is a query→shape→plot pipeline over the shared trace DuckDB:

load_pairs(con, *, max_gap_seconds) — pulls the step-level columns ORDER BY ingest_seq and the per-step timing events ORDER BY round_pk, event_index (timestamps as epoch_us integers, rebuilt to naive datetimes for native/wasm-identical marshalling), drops rows with a non-string provider/session_id, a non-integer round_index, or no observed timestamp (the old loader’s validity gate; in the pinned-schema DB these are the NULL rows), groups into rows_by_session preserving file order, stably sorts each session by (round_index, first_timestamp), then walks adjacent pairs applying the gap gate. Returns the list of Pairs.
first_timestamp(...) / last_model_output_timestamp(...) — first observed and last model-output timestamps of a step, over its timing events in event_index order, unchanged in semantics from pre-migration.
The assignment predicates (prefix_close, prefix_rejects_output, append_can_contain_output, append_side_pair, assignment_label) and scenario_groups() — unchanged.
plot_scatter_grid / plot_rank_grid / plot_prefix_gain_scatter_grid / plot_prefix_gain_rank_grid — the four figure families; sampled_pairs(...) is the per-scenario scatter decimation (deterministic rank-stratified np.linspace over a stable prev_output sort).
write_summary(...) — the per-scenario decision/quantile CSV, using the legacy percentile / fmt helpers so values match the pre-migration run exactly.
main() — wires the standard trace_db CLI (--db | -i/--input | -o/--output-dir), keeps the custom --max-gap-seconds / --min-output-tokens / --max-points-per-scenario flags, and embeds the self-contained PNG sidecar.

The data layer (parsing, surrogate keys, schema) lives in artifacts/utils/trace_db.py; see artifacts/utils/DB_SCHEMA.md.

Running it

# default merged trace, output next to this README
uv run python artifacts/llm_generation/output_append_assignment/plot.py

# a specific trace (materialized to a temp DuckDB cache on first use)
uv run python artifacts/llm_generation/output_append_assignment/plot.py -i trace/sample.jsonl

# a prebuilt DB (run_all.py's build-db step passes this), into a chosen dir
uv run python artifacts/llm_generation/output_append_assignment/plot.py --db "$TMPDIR/trace.duckdb" -o "$TMPDIR/out"

Useful flags: --max-gap-seconds (pair gap ceiling, default 240), --min-output-tokens (one figure set per threshold, default 2000 4000), --max-points-per-scenario (scatter/rank subsample per scenario, default 6000).

Outputs

Written to -o (default this folder), one set per --min-output-tokens threshold N:

output_vs_next_append_scatter_min{N}.png — prev-output vs next-append scatter grid.
ranked_output_vs_next_append_min{N}.png — ranked prev-output / next-append grid.
output_vs_prefix_gain_scatter_min{N}.png — prev-output vs next-prefix-gain scatter grid.
ranked_output_vs_prefix_gain_min{N}.png — ranked prev-output / next-prefix-gain grid.
output_append_assignment_summary_min{N}.csv — per-scenario count, decision, decision_strength, the log-output correlations, the prefix_close / prefix_reject / append_can_contain_output / append_side_pair / unassigned percentages, and the median / p10 / p90 token quantiles (over all pairs).

Each PNG is self-contained — it embeds this README, the summary CSVs, and the plotting code (plot.py + shared artifacts/utils/ modules). Unpack with uv run python artifacts/utils/png_sidecar.py extract <png>.