Assay-Runner Phase 1 Acceptance¶

Status: accepted on delegated Linux hardware; internal evidence note, not a public release announcement Date: 2026-05-21 Scope: records the Phase 1 outcome for the internal Assay-Runner measured-run spike. It does not imply a public repository split, hosted service, macOS support, live LLM support, or a released product surface.

Phase 1 asked whether Assay can produce verifiable measured-run bundles with low-ambiguity correlation across kernel, policy, and SDK observation layers.

The answer for the delegated Linux/eBPF path is yes.

Acceptance Run¶

Field	Value
Workflow	`Runner Spike Delegated`
Run	`26211485614`
URL	https://github.com/Rul1an/assay/actions/runs/26211485614
Commit	`56571045`
Branch	`codex/assay-runner-drop-kernel-stream-before-stats`
Mode	`gates=all`
Delegated job duration	`13m11s`
Host class	`assay-bpf-runner` self-hosted Linux runner

The delegated run executed all Phase 1 proof modes in one workflow dispatch:

Gate	Result
`kernel-only`	3 acceptance runs plus three-run determinism passed
`kernel+policy`	3 acceptance runs plus three-run determinism passed
`OpenAI Agents kernel+policy+SDK`	3 acceptance runs plus three-run determinism passed

The OpenAI Agents gate exercised the real @openai/agents SDK runtime path with a deterministic local model provider. It did not make live LLM calls.

The determinism claim is over normalized runner evidence artifacts extracted from the archive. It is not a claim that the raw eBPF object, raw ring-buffer delivery, or every kernel telemetry event is byte-identical across runs.

What Was Proven¶

This run proves the Phase 1 Linux/eBPF delegated path under the bounded spike contract:

the eBPF program loads and attaches on the delegated Linux host
child execution can be placed into a clean measured cgroup before spawn
kernel observation can complete with ringbuf_drops=0
policy decisions can be correlated with kernel-observed side effects
SDK tool-call events from the real OpenAI Agents runtime can be correlated with policy events by tool_call_id
the spike archive can be verified by the existing Assay evidence path
each proof mode can produce byte-stable three-run determinism on the delegated host

Allowed Phase 1 claim:

For the deterministic acceptance fixtures, Assay can produce verifiable
measured-run bundles on a delegated Linux/eBPF host that correlate kernel,
policy, and OpenAI Agents SDK evidence with low ambiguity and complete
observation health.

Forbidden claims remain forbidden:

this does not prove macOS or Windows kernel-grounded attribution
this does not prove live LLM execution or cassette replay
this does not prove arbitrary SDK compatibility beyond the validated @openai/agents fixture
this does not prove production traffic, sustained load, or a runner fleet
this does not prove event-level syscall causality or ordered trace semantics
this does not include a dedicated ring-buffer drop debug mode; follow-up is tracked in https://github.com/Rul1an/assay/issues/1271

Kill Criteria¶

The Phase 1 spike plan named several ways the track could fail. The delegated acceptance run mechanically refuted each one for the tested Linux path.

Kill criterion	Acceptance result
Clean cgroup correlation cannot be made reliable	Refuted: all delegated acceptance runs reported clean correlation
Ordinary runs produce ring-buffer drops often enough to make complete observation abnormal	Refuted: all delegated acceptance runs in `gates=all` completed with zero drops
`tool_call_id` cannot be carried through the OpenAI Agents path	Refuted: OpenAI Agents SDK and policy correlation passed across three runs
Bundle verification requires a parallel artifact system	Refuted: the spike bundle path verifies through the existing Assay evidence integration
Policy-to-kernel attribution is unstable across repeated runs	Refuted: `kernel+policy` three-run determinism passed

Regression Anchors¶

The green delegated run depended on several concrete discoveries. These are now regression risks and should remain anchored by code, tests, docs, or delegated gates before refactors touch the runner path.

Discovery	Why it matters	Current anchor
`EVENT_INODE_RESOLVED` is telemetry, not attribution evidence	Including it made bundle output noisy without strengthening the claim	Runner-spike kernel normalizer filter and delegated determinism gate
Loader, locale, runtime, and dependency paths are telemetry, not bundle evidence	Dynamic linker and locale opens dominated kernel volume and were not agent behavior	eBPF/userspace path filters and delegated determinism gate
Ring-buffer consumers must persist across polls	Rebuilding the consumer per poll replayed kernel records and was the final root cause blocking byte-stable delegated determinism	`assay-monitor` listener ownership of persistent ring buffers
Runner session cgroups must start from a valid domain root	Service cgroups can be `domain threaded`; child session cgroups below them can reject process placement	cgroup domain-root resolution and unit coverage
Self-hosted workflow cleanup must precede action download	Deleting `_actions` inside the same job can delete already prepared action repositories	delegated workflow prepare job plus gates job split
SDK metadata must come from installed package metadata	Hardcoded SDK version strings can silently falsify bundle claims after dependency bumps	OpenAI Agents fixture metadata load plus expected-version gate
Acceptance fixtures must be deterministic below output level	Cold/warm fixture differences can change observed syscall surfaces even when files match	fixture pre-seeding/control paths and delegated three-run determinism

Phase 2 Follow-Up¶

Phase 2 should consolidate this proof before any repository split or platform expansion:

freeze versioned v0 references for observation-health, capability-surface, and correlation-report: Runner artifact v0 contracts
maintain the telemetry-versus-evidence filter contract in that reference
maintain the delegated runner runbook for provisioning and failure triage: ASSAY-RUNNER-DELEGATED-RUNBOOK-2026-05-21.md
classify when delegated CI is required for runner-impacting changes: Runner CI lane contract
write and maintain the acceptance fixture contract for future SDK fixtures: Runner acceptance fixture v0 contract
define the Assay-Runner boundary and extraction map: Assay-Runner boundary and extraction map

OpenTelemetry mappings, macOS support, live LLM calls, and repository extraction are outside Phase 2A. They should only start after the Linux runner boundary is documented and stable.

Proof Pack¶

This acceptance note records the workflow run, commit, and pass/fail evidence. Phase 2B hygiene adds a small durable proof manifest: Phase 1 delegated proof pack.

The proof pack preserves workflow metadata, the full workflow log, PASS-line evidence, and v0 golden-shape JSON digests. It does not contain the original runner-*.tar.gz archives because the historical workflow run did not upload GitHub artifacts and the delegated host wrote the archives to temporary paths cleaned after the job. Future delegated acceptance workflows should upload a first-class proof pack artifact during the run; that workflow follow-up is tracked in https://github.com/Rul1an/assay/issues/1287.

Subsequent Internal Landings¶

This section is a chronology pointer only. It does not re-open or extend the Phase 1 acceptance claim above, does not promise future work, and does not introduce a new contract. It exists so that a reader returning later can reconstruct the Phase 2 sequence without PR archaeology.

Phase 2A — internal contract freeze¶

The Phase 2 Follow-Up items above were consolidated into versioned internal references:

Phase 2B — first capability-diff slice¶

Runner capability-diff Phase 2B plan
Runner capability-diff v0 contract
Reference projection: scripts/ci/assay_runner_capability_diff_validate.py produces status=clean for the accepted S5 fixture against itself.

Phase 2B — second runtime fixture (Gemini Python `google-genai`)¶

The second-runtime fixture line landed against the Phase 2B entry discipline. It introduces a new fixture, not a new contract or a new artifact category.

Runner second runtime Phase 2B plan
Runner second runtime candidate selection
Runner Gemini fixture design
Fixture root: runner-fixtures/gemini-google-genai/

Delegated gates=all proof for the Gemini fixture:

Field	Value
Workflow	`Runner Spike Delegated`
Run	`26280889075`
URL	https://github.com/Rul1an/assay/actions/runs/26280889075
Head SHA	`f917090a4a707eb485f57e82e8ed83ea8a52391e`
Mode	`gates=all` (includes `gemini-google-genai-kernel-policy`)
Proof-pack artifact	`assay-runner-delegated-proof-pack-26280889075`
Host class	`assay-bpf-runner` self-hosted Linux runner
Identity source	cassette-derived `FunctionCall.id`, level-3 stable identity verified by the maintainer probe outcome under `runner-fixtures/gemini-google-genai/probe-results/`

Scope discipline for this landing:

S5's mcp-policy-agent.sh and the OpenAI Agents fixture path were not modified
The v0 artifact contracts, fixture v0 contract, and boundary map were not modified
No narrower delegated gate name was introduced; the Gemini fixture runs under the existing gates=all dispatch
Cross-runtime capability comparison between S5 and Gemini was not introduced

The Gemini delegated run is a second-runtime fixture acceptance only. It does not extend the Phase 1 OpenAI Agents claim above and does not establish a third runtime as eligible.

Phase 2C — cross-runtime diff decision resolved¶

Runner cross-runtime diff Phase 2C plan
Runner cross-runtime diff Phase 2C decisions (A1+B3+C1)
Decision gate (resolved): https://github.com/Rul1an/assay/issues/1310 (closed by #1312)

The three open interpretation questions (path projection, binding-id comparability, SDK metadata placement) were resolved as A1 + B3 + C1: prefix-only path canonicalization, binding ids out of scope for cross-runtime comparability, SDK metadata as side-band runtime provenance.

Phase 2C has no acceptance run, no contract slice, and no implementation work landed yet. The contract slice PR is now unblocked under A1 + B3 + C1 per step 4 of cross-runtime-diff-plan.md § Suggested Slice Sequence, but it remains a separate, future change.

Open backlog independent of this line¶

Ring-buffer drop debug follow-up: https://github.com/Rul1an/assay/issues/1271. This must not weaken the ringbuf_drops=0 clean-health bar for any future fixture.