Skip to content

Runner vs OTel Overhead Findings (2026-05)

Status: event-rate boundary-finding update. This document summarizes the overhead follow-up evidence collected so far. It does not commit the generated measurement artifacts. Direct arm deltas are reported only for the matching linux-aarch64-6.8.0-117-generic host class.

Evidence Anchors

Slice Workflow run Arm Samples Result
2 26449999294 Arm C dual capture 20 wall-clock 20 valid, 0 discarded, all health gates clean
3 26454010701 Arm C dual capture 5 RSS 5 valid, 0 discarded, all health gates clean
6 wall 26459699303 Arm B OTel-only 20 wall-clock 20 valid, 0 discarded, same host class
6 RSS 26461726436 Arm B OTel-only 5 RSS 5 valid, 0 discarded, same host class
7 sanity 26463582658 Arm A runner-only 2 wall-clock 2 valid, 0 discarded, kernel layer complete
7 wall 26463798358 Arm A runner-only 20 wall-clock 20 valid, 0 discarded, same host class
7 RSS 26464003194 Arm A runner-only 5 RSS 5 valid, 0 discarded, same host class
8 diagnostic 26472122983 Arm A runner-only 20 wall-clock repeat failed; one sample discarded, partial artifacts not uploaded by the old workflow
8 repeat 26473448298 Arm A runner-only 20 wall-clock repeat 20 valid, 0 discarded, artifact-success gate active
8 phase A 26476490968 Arm A runner-only 20 wall-clock + phase timing 20 valid, 0 discarded, same host class
8 phase C 26476824593 Arm C dual capture 20 wall-clock + phase timing 20 valid, 0 discarded, same host class
9 paired A/C 26479319306 Arm A + Arm C paired 20 adjacent pairs 20 valid per arm, 0 discarded, same job and host class
10 smoke kernel 26508127380 Arm A + Arm C paired 2 adjacent pairs, kernel=low, span=baseline, concurrency=1 2 valid per arm, 0 discarded, clean health gates
10 smoke span/concurrency 26508355816 Arm A + Arm C paired 2 adjacent pairs, kernel=medium, span=low, concurrency=2 2 valid per arm, 0 discarded, clean health gates
11 control 26511405031 Arm A + Arm C paired 5 adjacent pairs, baseline sweep 5 valid per arm, 0 discarded, clean health gates
11 kernel-high 26511787316 Arm A + Arm C paired 5 adjacent pairs, kernel=high, span=baseline, concurrency=1 5 valid per arm, 0 discarded, 100 kernel worker files per arm
11 span-high 26512146963 Arm A + Arm C paired 5 adjacent pairs, kernel=baseline, span=high, concurrency=1 5 valid per arm, 0 discarded, 100 Arm C span events per sample
11 kernel-concurrent 26512515478 Arm A + Arm C paired 5 adjacent pairs, kernel=high, span=baseline, concurrency=4 5 valid per arm, 0 discarded, 100 kernel worker files per arm
11 corner 26512909068 Arm A + Arm C paired 5 adjacent pairs, kernel=high, span=high, concurrency=4, payload=large 5 valid per arm, 0 discarded, 100 kernel worker files and 100 Arm C span events per sample
12 k500 26517696032 Arm A + Arm C paired 5 adjacent pairs plus warm-up, kernel=x500, span=baseline, concurrency=4 5 valid per arm, 0 discarded, 500 kernel worker files per arm
12 k1000 26518158603 Arm A + Arm C paired 5 adjacent pairs plus warm-up, kernel=x1000, span=baseline, concurrency=4 5 valid per arm, 0 discarded, 1000 kernel worker files per arm
12 s500 26518522754 Arm A + Arm C paired 5 adjacent pairs plus warm-up, kernel=baseline, span=x500, concurrency=1 5 valid per arm, 0 discarded, Arm C retained 128/500 span events
12 s1000 26518894002 Arm A + Arm C paired 5 adjacent pairs plus warm-up, kernel=baseline, span=x1000, concurrency=1 5 valid per arm, 0 discarded, Arm C retained 128/1000 span events
12 kc1000 26519398461 Arm A + Arm C paired 5 adjacent pairs plus warm-up, kernel=x1000, span=baseline, concurrency=16 5 valid per arm, 0 discarded, 1000 kernel worker files per arm
12 corner-lite 26520226593 Arm A + Arm C paired 5 adjacent pairs plus warm-up, kernel=x1000, span=x1000, concurrency=8, payload=large 5 valid per arm, 0 discarded, 1000 kernel worker files per arm; Arm C retained 128/1000 span events

Generated artifacts from those runs were inspected as review artifacts only. They are intentionally not committed as benchmark evidence in this slice.

The failed diagnostic repeat is listed to explain the workflow-DX fix, not to replace the successful Slice 7 findings. The follow-up repeat passed with artifacts uploaded and the new artifact-success gate active. Together, the repeats show that the first-sample cgroup failure is not deterministic, while wall-clock decomposition still needs phase timing before it can support an additive claim.

The phase-timing runs are listed as diagnostics, not as replacement baselines. They validate the Slice 8 instrumentation and localize part of the Arm A / Arm C median gap, but Arm A again showed an unhealthy wall-clock tail in that dispatch.

The paired Slice 9 run is also diagnostic. It keeps Arm A and Arm C adjacent in one delegated job to reduce inter-dispatch drift. It does not replace the same-host baselines below, and its unhealthy tails keep wall-clock publication caveats in force.

The Slice 10 smoke runs are workflow and metadata validation only. They show that the event-rate sweep knobs reach the real delegated workload: both arms captured event-rate-sweep/worker-* kernel events, Arm C emitted assay.sweep.* trace metadata when span/event pressure was requested, and Arm A correctly recorded span_event_rate=baseline with target_span_events=0. They are too small to support slope, threshold, or benchmark findings.

The Slice 11 starter matrix is the first event-rate sweep findings slice. It is still small (n=5 adjacent pairs per cell), so it reports health, calibration, and threshold signals only. It does not publish a product benchmark or a new additive wall-clock decomposition.

The Slice 12 boundary-finding sweep is the widened event-rate pass. It ran the predeclared cells sequentially because the workflow concurrency group keeps only one active delegated overhead run at a time. All six completed runs had 5/5 measured samples per arm, 0 discarded samples, clean Runner health gates, and successful warm-up samples. The finding is a fidelity/health boundary result, not a benchmark result.

Same-Host Baselines

All three arms now have clean measurements on the same delegated assay-bpf-runner host class. The runs were dispatched separately, so they characterize a shared host class, not co-temporal variance.

Metric Value Interpretation
Host class linux-aarch64-6.8.0-117-generic Delegated Linux runner machine/OS/kernel boundary
Arm A wall-clock valid samples 20/20 Meets the n >= 20 gate
Arm A wall median 1,859.521 ms Runner archive-only repeat baseline on this host class
Arm A wall p95 2,143.676 ms Tail sample remained within the healthy band
Arm A wall p99 2,459.097 ms Nearest-rank p99 for n=20
Arm A wall p99/median 1.322 Healthy per the v0 < 1.5 tail-ratio band
Arm A RSS valid samples 5/5 Meets the n >= 5 gate
Arm A peak RSS median 116,641,792 bytes Runner archive-only memory baseline
Arm A peak RSS max 116,645,888 bytes No large RSS outlier in the n=5 sample
Arm B wall-clock valid samples 20/20 Meets the n >= 20 gate
Arm B wall median 879.961 ms OTel-only baseline on this host class
Arm B wall p95 924.845 ms Tail sample remained close to median
Arm B wall p99 964.023 ms Nearest-rank p99 for n=20
Arm B wall p99/median 1.096 Healthy per the v0 < 1.5 tail-ratio band
Arm B RSS valid samples 5/5 Meets the n >= 5 gate
Arm B peak RSS median 108,953,600 bytes OTel-only memory baseline
Arm B peak RSS max 110,493,696 bytes No large RSS outlier in the n=5 sample
Arm C wall-clock valid samples 20/20 Meets the n >= 20 gate
Arm C wall median 1,737.838 ms Dual-capture baseline on this host class
Arm C wall p95 2,051.039 ms Tail sample remained close to median
Arm C wall p99 2,070.354 ms Nearest-rank p99 for n=20
Arm C wall p99/median 1.191 Healthy per the v0 < 1.5 tail-ratio band
Arm C RSS valid samples 5/5 Meets the n >= 5 gate
Arm C peak RSS median 116,649,984 bytes Dual-capture memory baseline
Arm C peak RSS max 116,781,056 bytes No large RSS outlier in the n=5 sample
Arm B trace JSON median 3,204 bytes L1 trace footprint baseline
Arm A trace JSON median null Expected: runner-only arm emits no OTel trace JSON
Arm A archive .tar.gz median 1,628 bytes L2 compressed archive footprint baseline without trace export
Arm A archive extracted median 5,639 bytes Review/storage footprint baseline without trace export
Arm C trace JSON median 3,220 bytes L1 trace plus Runner wrapper footprint
Arm C archive .tar.gz median 1,776 bytes L2 compressed archive footprint baseline
Arm C archive extracted median 8,186 bytes Review/storage footprint baseline

The one-byte artifact-size spread between Slice 2 and Slice 3 is expected for freshly generated archives and traces. The useful claim is that the footprint is tiny and stable at this workload scale, not that archive bytes are deterministic across runs.

Same-Host Delta

Because Arm B and Arm C emitted the same host_class, a narrow same-host delta is now valid for this deterministic workload. The runs were not co-temporal, so this is still a host-class baseline comparison, not a product benchmark.

Metric Arm B OTel-only Arm C dual capture Delta
Wall median 879.961 ms 1,737.838 ms +857.878 ms (+97.5%)
Wall p95 924.845 ms 2,051.039 ms +1,126.195 ms (+121.8%)
Wall p99 964.023 ms 2,070.354 ms +1,106.331 ms (+114.8%)
Wall p99/median 1.096 1.191 +0.096
Peak RSS median 108,953,600 bytes 116,649,984 bytes +7,696,384 bytes (+7.1%)
Peak RSS max 110,493,696 bytes 116,781,056 bytes +6,287,360 bytes (+5.7%)
Trace JSON median 3,204 bytes 3,220 bytes +16 bytes

The wall-clock delta is the cost of the current dual-capture path on this host class: Runner archive capture plus the existing OTel trace around the deterministic workload. The Arm A section below records the runner-only decomposition attempt, but that decomposition is not stable enough to turn this delta into an additive wall-clock cost model.

Runner-Only Decomposition

Arm A adds the runner-only comparison point: assay runner-spike with Linux/eBPF archive capture and the deterministic OpenAI Agents fixture, but without OTel trace export. All three arms emitted the same host_class, but the runs were separate dispatches and Arm A uses the fixture-agent path rather than the OTel workload wrapper. Treat this as an experiment-scoped decomposition aid, not as a general additive cost model.

Metric Arm B OTel-only Arm A runner-only Arm C dual capture
Wall median 879.961 ms 1,859.521 ms 1,737.838 ms
Wall p95 924.845 ms 2,143.676 ms 2,051.039 ms
Wall p99 964.023 ms 2,459.097 ms 2,070.354 ms
Wall p99/median 1.096 1.322 1.191
Peak RSS median 108,953,600 bytes 116,641,792 bytes 116,649,984 bytes
Peak RSS max 110,493,696 bytes 116,645,888 bytes 116,781,056 bytes
Trace JSON median 3,204 bytes null 3,220 bytes
Archive .tar.gz median null 1,628 bytes 1,776 bytes
Archive extracted median null 5,639 bytes 8,186 bytes

The decomposition read is:

  • RSS: useful and stable. Arm A and Arm C differ by only 8,192 bytes at the median RSS level (0.007%). The observed same-host RSS increase over Arm B is therefore dominated by Runner capture rather than by adding the OTel trace wrapper around Runner capture.
  • Wall-clock: inconclusive as an additive decomposition. The healthy Arm A repeat is still 121.683 ms slower at the median than Arm C, even though Arm A omits OTel trace export. That is not a meaningful "OTel adds negative overhead" result; it means the runner-only fixture path and dual-capture workload path need phase timing before the current data can be decomposed into additive cost buckets.

Phase-Timing Read

Slice 8 added experiment-scoped phase diagnostics via assay.experiment.runner_phase_timing.v0. Both Arm A and Arm C phase runs used the same linux-aarch64-6.8.0-117-generic host class and produced 20 valid samples with 0 discarded samples.

Metric Arm A runner-only Arm C dual capture Delta A-C
Wall median 1,894.545 ms 1,787.294 ms +107.251 ms
Wall p95 2,542.600 ms 2,013.966 ms +528.634 ms
Wall p99 6,855.941 ms 2,060.190 ms +4,795.751 ms
Wall p99/median 3.619 1.153 Arm A tail unhealthy
Sum of phase medians 1,427.638 ms 1,393.098 ms +34.540 ms
Wall median minus summed phase medians 466.907 ms 394.197 ms +72.711 ms

Median phase breakdown:

Phase Arm A median Arm C median Delta A-C
preflight_ms 0.196 ms 0.144 ms +0.052 ms
cgroup_prepare_ms 0.966 ms 1.085 ms -0.119 ms
monitor_attach_ms 446.885 ms 408.601 ms +38.284 ms
child_spawn_ms 18.020 ms 23.787 ms -5.767 ms
child_runtime_ms 850.928 ms 847.777 ms +3.151 ms
event_flush_ms 107.313 ms 109.086 ms -1.773 ms
archive_write_ms 3.330 ms 2.617 ms +0.713 ms

The summed phase medians explain about 34.540 ms of the 107.251 ms Arm A median wall-clock gap. The largest instrumented contributor is monitor_attach_ms (+38.284 ms for Arm A), but most of the median gap remains outside the current phase buckets as measured by wall median minus summed phase medians (+72.711 ms residual).

That means Slice 8 supports a narrower conclusion than an additive wall-clock decomposition: the Runner-internal phases do not fully explain why the runner-only Arm A path is slower than Arm C at the median. The wall-clock split remains unsuitable for a "Runner archive only + OTel trace export" additive claim.

Paired Residual Read

Slice 9 dispatched arm=paired-a-c in run 26479319306. The harness ran 20 adjacent counterbalanced pairs in one delegated job: odd pairs used Arm A then Arm C, even pairs used Arm C then Arm A. Both arms produced 20 valid samples, 0 discarded samples, ringbuf_drops=0, kernel_layer=complete, and cgroup_correlation=clean.

Metric Arm A runner-only Arm C dual capture Delta A-C
Wall median 1,806.007 ms 1,917.081 ms -111.074 ms
Wall p95 3,500.225 ms 4,337.908 ms -837.682 ms
Wall p99 3,911.765 ms 4,400.113 ms -488.348 ms
Wall p99/median 2.166 2.295 both tails unhealthy
Sum of phase medians 1,443.657 ms 1,499.847 ms -56.190 ms
Wall median minus summed phase medians 362.349 ms 417.234 ms -54.884 ms
Median per-sample phase_residual_ms 368.808 ms 391.284 ms -22.476 ms
Median paired wall delta n/a n/a -176.852 ms; noisy pair spread
Median paired residual delta n/a n/a -26.187 ms; residuals close

Median phase breakdown from the paired run:

Phase Arm A median Arm C median Delta A-C
preflight_ms 0.171 ms 0.157 ms +0.014 ms
cgroup_prepare_ms 1.941 ms 2.080 ms -0.140 ms
monitor_attach_ms 423.719 ms 428.631 ms -4.912 ms
child_spawn_ms 15.442 ms 16.860 ms -1.418 ms
child_runtime_ms 887.384 ms 939.790 ms -52.406 ms
event_flush_ms 112.041 ms 108.897 ms +3.144 ms
archive_write_ms 2.960 ms 3.432 ms -0.471 ms

This paired result changes the wall-clock read: the Slice 8 Arm A slower-than-Arm-C median gap does not reproduce when the arms run as adjacent counterbalanced pairs. In the paired run, Arm A is faster at the median and the per-sample residual medians differ by only 22.476 ms. The result points to inter-dispatch drift and measurement variance as material contributors to the earlier wall-clock anomaly.

The paired run does not justify a new additive wall-clock model: both paired tails are unhealthy (p99/median > 2.0), and the paired wall deltas have a wide spread. It does justify a stopping rule for this arc: wall-clock decomposition is not stable enough at n=20 on this runner to publish as an additive split. RSS remains the clean decomposition signal.

Event-Rate Starter Matrix

Slice 11 ran the predeclared five-cell paired A/C starter matrix. Every cell used repetitions=5, measure_rss=false, build_ebpf=true, and paired adjacent A/C order on the same linux-aarch64-6.8.0-117-generic host class. All cells passed the health gates: 5/5 valid samples per arm, 0 discarded samples, ringbuf_drops=0, kernel_layer=complete, and cgroup_correlation=clean.

Cell Run Arm A wall median Arm C wall median Arm C - Arm A Tail band Calibration
control 26511405031 1,859.312 ms 2,094.345 ms +235.033 ms healthy (1.260 / 1.056) no sweep metadata
kernel-high 26511787316 2,256.657 ms 2,364.349 ms +107.692 ms healthy (1.301 / 1.344) 100 kernel worker files per arm
span-high 26512146963 1,755.104 ms 1,720.886 ms -34.218 ms healthy (1.186 / 1.030) 100 Arm C span events per sample
kernel-concurrent 26512515478 1,972.446 ms 1,855.518 ms -116.928 ms healthy but near warn for Arm C (1.104 / 1.468) 100 kernel worker files per arm
corner 26512909068 1,815.705 ms 2,026.579 ms +210.873 ms healthy (1.140 / 1.166) 100 kernel worker files per arm; 100 Arm C span events per sample

The starter matrix is useful mainly because it did not find a failure boundary at these levels. Even the corner cell stayed clean: no ring-buffer drops, no degraded kernel layer, no cgroup-correlation failure, and a healthy tail ratio. The corner trace size grew to a median 6,636,746 bytes, while the Arm C archive remained small (3,905 bytes compressed median, 93,987 bytes extracted median).

The wall-clock medians still should not be read as a product benchmark. At n=5, paired medians remain sensitive to run-window variance. The publishable Slice 11 result is narrower:

  • the sweep knobs calibrate correctly on real delegated infrastructure;
  • high=100 kernel events produces the expected 100 worker files per sample in both Arm A and Arm C;
  • high=100 span events produces exactly 100 assay.sweep.span_event entries per Arm C trace;
  • no starter cell hit the health or tail failure boundary;
  • larger OTel span payloads scale trace size clearly, but did not destabilize Runner health at this matrix budget.

Event-Rate Boundary Sweep

Slice 12 ran the predeclared widened A/C matrix with repetitions=5, warmup_iterations=1, measure_rss=false, build_ebpf=true, and timeout_seconds=300. Warm-up samples are review artifacts only; all measured cells below had 5 valid samples per arm and 0 discarded samples.

Cell Run Arm A wall median Arm C wall median Arm C - Arm A Tail band Fidelity read
k500 26517696032 1,903.863 ms 2,177.001 ms +273.138 ms Arm A warning (1.724), Arm C healthy (1.098) 500/500 kernel worker files in both arms
k1000 26518158603 2,407.982 ms 2,290.073 ms -117.910 ms Arm A warning (1.604), Arm C healthy (1.304) 1000/1000 kernel worker files in both arms
s500 26518522754 1,761.804 ms 1,780.047 ms +18.244 ms healthy (1.157 / 1.192) Arm C retained 128/500 span events
s1000 26518894002 1,777.862 ms 1,811.485 ms +33.623 ms healthy (1.040 / 1.026) Arm C retained 128/1000 span events
kc1000 26519398461 2,101.253 ms 2,623.885 ms +522.632 ms healthy (1.314 / 1.057) 1000/1000 kernel worker files in both arms, concurrency 16
corner-lite 26520226593 2,283.301 ms 2,787.537 ms +504.236 ms healthy (1.094 / 1.466) 1000/1000 kernel worker files; Arm C retained 128/1000 span events

The kernel-capture branch stayed healthy through the widened cells: x1000 kernel worker files calibrated exactly for both arms, and the kc1000 cell stayed clean even at concurrency 16. The two kernel-only cells put Arm A's tail ratio in the warning band, but no cell crossed the fail boundary (p99/median > 2.0), and no Runner health gate degraded.

The span/event branch hit a fidelity boundary before timing can be interpreted. At both x500 and x1000, the Arm C trace retained exactly 128 assay.sweep.span_event entries per sample. That matches the OpenTelemetry JS SDK default span event count limit in this workload configuration, so the apparent wall-clock and trace-size behavior above 100 span events is not evidence that the system handled 500 or 1000 trace records. It is evidence that the default OTel span limit preserves only 128 events.

Mechanism verification:

  • The OpenTelemetry SDK environment-variable specification defines OTEL_SPAN_EVENT_COUNT_LIMIT as the maximum span-event count, with default 128: https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#span-limits.
  • The OpenTelemetry Trace SDK specification likewise defines EventCountLimit (Default=128) under Span Limits: https://opentelemetry.io/docs/specs/otel/trace/sdk/#span-limits.
  • The checked-in workload declares @opentelemetry/sdk-trace-base@^2.0.0 in workload/package.json, and the v1 findings record that dependency as resolved to 2.7.x at install time. Its BasicTracerProvider setup does not pass a custom spanLimits override, so default SDK limits apply unless the environment overrides them.
  • The retained event indexes prove the cap is on span events, not on the Runner archive or JSON writer. For s500, every Arm C measured trace retained indexes 372..499; for s1000 and corner-lite, every Arm C measured trace retained indexes 872..999. That is the last 128 events, matching the OTel JS behavior of dropping the oldest event once eventCountLimit is reached. This last-N pattern is consistent with FIFO truncation under Span Limits, not with random sampling, first-N truncation, or JSON serialization loss.
  • A local workload repro confirmed the mechanism: with OTEL_SPAN_EVENT_COUNT_LIMIT unset, --sweep-span-events 500 retained 128 events (372..499); with OTEL_SPAN_EVENT_COUNT_LIMIT=1000, the same workload retained all 500 events (0..499). With target 1000 and limit 1000, it retained all 1000 events (0..999).

The corner-lite cell is therefore a mixed result: Runner kernel capture stayed healthy at 1000 worker files with large payloads and concurrency 8, but the OTel span side was already lossy at 128/1000 events. Its median Arm C trace grew to 8,494,070 bytes because the retained events carried 64 KiB payloads, but the cell cannot support a "healthy through x1000 span events" claim.

The Slice 12 boundary result is:

  • Kernel side: healthy through x1000 kernel events and concurrency 16 on this host class, at n=5 and without RSS collection. Artifact inspection found exactly 500/500 unique worker files in k500 and 1000/1000 in k1000, kc1000, and corner-lite, for both Arm A and Arm C, counted from event-rate-sweep/worker-* entries in extracted archive contents.
  • Span side: first widened span cell (s500) is a trace-fidelity boundary under the default OTel JS SDK limits: 128/500 events retained. Arm A remains correctly asymmetric in these paired cells: it has no OTel trace export and records span pressure as baseline / 0.
  • Corner side: combined kernel + span stress is bounded by the same span-fidelity limit, not by Runner health.

That closes the current event-rate arc for the default configuration. A future experiment can deliberately raise OTEL_SPAN_EVENT_COUNT_LIMIT and rerun the span cells, but that would be a new span-limit study, not a continuation of the default-config boundary sweep. That follow-up is tracked separately in issue #1408.

What This Means

  • The delegated measurement harness is usable for all three arms: wall-clock and RSS runs produced all-valid samples on the same host class.
  • The observed Arm C tail ratio is healthy for this deterministic workload on assay-bpf-runner.
  • The observed Arm C median wall-clock is about 2x Arm B on the same host class for this workload, while RSS increases by about 7%.
  • The observed Arm A and Arm C RSS medians are effectively identical at this scale, so the RSS delta versus Arm B is attributable to Runner capture rather than trace JSON export.
  • Arm A's repeat wall-clock tail was healthy, but the later phase-timing run had an unhealthy tail and the runner-only median remained higher than Arm C, so wall-clock decomposition remains a caution, not a benchmark claim.
  • Slice 8 phase timing localizes the largest measured internal phase delta to monitor attach, but the majority of the Arm A / Arm C median gap sits outside the current phase buckets.
  • Slice 9 paired diagnostics show that the Slice 8 Arm A-over-Arm C median gap does not reproduce under adjacent pairing. Wall-clock residuals are close enough, and tails noisy enough, that the wall-clock decomposition should stop rather than spawning another broad rerun.
  • Slice 11 shifts the useful overhead question from "which arm is faster?" to "where do health gates or artifact sizes start to scale?" At the starter matrix levels, the health boundary was not reached.
  • Slice 12 extends that result: Runner kernel capture remained healthy through 1000 worker files and concurrency 16, while widened OTel span pressure hit the default SDK event-retention limit at the first widened span cell.
  • The RSS path works on the delegated Linux runner with GNU /usr/bin/time -v; samples record the RSS tool version and emit peak_rss_bytes into both summary.json and the BMF export.
  • The summary renderer now gives reviewers the same metrics in summary.md and in the GitHub step summary, while summary.json remains canonical.

What This Does Not Mean

  • No product ranking is implied between OpenTelemetry, OpenInference, or Assay-Runner.
  • No model/provider latency claim is made. The workload is deterministic and measurement-scoped.
  • No co-temporal variance claim is made. Arm B and Arm C ran on the same host class at different times, and Arm A was dispatched separately as well.
  • No additive wall-clock decomposition claim is made between "Runner archive only" and "Runner archive plus OTel trace". The phase-timing and paired residual runs show that the median gap is not stable under pairing, and the paired run has unhealthy tails.
  • Slice 12 does not prove that 500 or 1000 OTel span events are safe or cheap under the default workload configuration. The trace retained only 128 span events at those targets, so timing above that point is fidelity-limited.
  • No Trust Card or Trust Basis claim is added. This remains an experiment-scoped measurement follow-up.
  • The generated artifacts remain review artifacts until a later decision explicitly promotes a measurement bundle into committed evidence.

Next Work

The correct publication language is now:

On the linux-aarch64-6.8.0-117-generic delegated runner host class, the current dual-capture path measured roughly +858 ms median wall-clock and +7.7 MB median RSS over OTel-only for this deterministic workload. The result is not co-temporal and does not decompose Runner archive-only cost.

Arm A measurements and phase-timing diagnostics have landed. The safe publication language is now:

On the same delegated host class, Arm A runner-only and Arm C dual-capture had effectively identical median RSS. The RSS decomposition points to Runner capture as the memory-cost source. Wall-clock decomposition remains inconclusive: phase timing explains part of the Arm A / Arm C median gap, mostly around monitor attach, but the majority remains outside the current Runner phase buckets and Arm A's phase run had an unhealthy tail.

Next engineering slice:

Do not add another broad Arm A/C wall-clock rerun for this arc. The paired residual diagnostic has landed and shows that the median gap is not stable enough for an additive wall-clock decomposition at the current measurement budget. Slice 12 has now produced the intended boundary statement: kernel capture stayed healthy through the widened kernel cells, and span widening hit the default OTel event-retention boundary at the first widened span cell.

The only logical follow-up is a new, explicitly scoped span-limit study: configure the OTel SDK span-event limit above the requested target, verify retained event counts first, and then rerun only the span cells needed to answer whether trace export cost scales after the fidelity boundary is removed. Track that as issue #1408, outside the closed default-config overhead arc.