Assay-Runner Phase 1 Acceptance¶
Status: accepted on delegated Linux hardware; internal evidence note, not a public release announcement Date: 2026-05-21 Scope: records the Phase 1 outcome for the internal Assay-Runner measured-run spike. It does not imply a public repository split, hosted service, macOS support, live LLM support, or a released product surface.
Phase 1 asked whether Assay can produce verifiable measured-run bundles with low-ambiguity correlation across kernel, policy, and SDK observation layers.
The answer for the delegated Linux/eBPF path is yes.
Acceptance Run¶
| Field | Value |
|---|---|
| Workflow | Runner Spike Delegated |
| Run | 26211485614 |
| URL | https://github.com/Rul1an/assay/actions/runs/26211485614 |
| Commit | 56571045 |
| Branch | codex/assay-runner-drop-kernel-stream-before-stats |
| Mode | gates=all |
| Delegated job duration | 13m11s |
| Host class | assay-bpf-runner self-hosted Linux runner |
The delegated run executed all Phase 1 proof modes in one workflow dispatch:
| Gate | Result |
|---|---|
kernel-only | 3 acceptance runs plus three-run determinism passed |
kernel+policy | 3 acceptance runs plus three-run determinism passed |
OpenAI Agents kernel+policy+SDK | 3 acceptance runs plus three-run determinism passed |
The OpenAI Agents gate exercised the real @openai/agents SDK runtime path with a deterministic local model provider. It did not make live LLM calls.
The determinism claim is over normalized runner evidence artifacts extracted from the archive. It is not a claim that the raw eBPF object, raw ring-buffer delivery, or every kernel telemetry event is byte-identical across runs.
What Was Proven¶
This run proves the Phase 1 Linux/eBPF delegated path under the bounded spike contract:
- the eBPF program loads and attaches on the delegated Linux host
- child execution can be placed into a clean measured cgroup before spawn
- kernel observation can complete with
ringbuf_drops=0 - policy decisions can be correlated with kernel-observed side effects
- SDK tool-call events from the real OpenAI Agents runtime can be correlated with policy events by
tool_call_id - the spike archive can be verified by the existing Assay evidence path
- each proof mode can produce byte-stable three-run determinism on the delegated host
Allowed Phase 1 claim:
For the deterministic acceptance fixtures, Assay can produce verifiable
measured-run bundles on a delegated Linux/eBPF host that correlate kernel,
policy, and OpenAI Agents SDK evidence with low ambiguity and complete
observation health.
Forbidden claims remain forbidden:
- this does not prove macOS or Windows kernel-grounded attribution
- this does not prove live LLM execution or cassette replay
- this does not prove arbitrary SDK compatibility beyond the validated
@openai/agentsfixture - this does not prove production traffic, sustained load, or a runner fleet
- this does not prove event-level syscall causality or ordered trace semantics
- this does not include a dedicated ring-buffer drop debug mode; follow-up is tracked in https://github.com/Rul1an/assay/issues/1271
Kill Criteria¶
The Phase 1 spike plan named several ways the track could fail. The delegated acceptance run mechanically refuted each one for the tested Linux path.
| Kill criterion | Acceptance result |
|---|---|
| Clean cgroup correlation cannot be made reliable | Refuted: all delegated acceptance runs reported clean correlation |
| Ordinary runs produce ring-buffer drops often enough to make complete observation abnormal | Refuted: all delegated acceptance runs in gates=all completed with zero drops |
tool_call_id cannot be carried through the OpenAI Agents path | Refuted: OpenAI Agents SDK and policy correlation passed across three runs |
| Bundle verification requires a parallel artifact system | Refuted: the spike bundle path verifies through the existing Assay evidence integration |
| Policy-to-kernel attribution is unstable across repeated runs | Refuted: kernel+policy three-run determinism passed |
Regression Anchors¶
The green delegated run depended on several concrete discoveries. These are now regression risks and should remain anchored by code, tests, docs, or delegated gates before refactors touch the runner path.
| Discovery | Why it matters | Current anchor |
|---|---|---|
EVENT_INODE_RESOLVED is telemetry, not attribution evidence | Including it made bundle output noisy without strengthening the claim | Runner-spike kernel normalizer filter and delegated determinism gate |
| Loader, locale, runtime, and dependency paths are telemetry, not bundle evidence | Dynamic linker and locale opens dominated kernel volume and were not agent behavior | eBPF/userspace path filters and delegated determinism gate |
| Ring-buffer consumers must persist across polls | Rebuilding the consumer per poll replayed kernel records and was the final root cause blocking byte-stable delegated determinism | assay-monitor listener ownership of persistent ring buffers |
| Runner session cgroups must start from a valid domain root | Service cgroups can be domain threaded; child session cgroups below them can reject process placement | cgroup domain-root resolution and unit coverage |
| Self-hosted workflow cleanup must precede action download | Deleting _actions inside the same job can delete already prepared action repositories | delegated workflow prepare job plus gates job split |
| SDK metadata must come from installed package metadata | Hardcoded SDK version strings can silently falsify bundle claims after dependency bumps | OpenAI Agents fixture metadata load plus expected-version gate |
| Acceptance fixtures must be deterministic below output level | Cold/warm fixture differences can change observed syscall surfaces even when files match | fixture pre-seeding/control paths and delegated three-run determinism |
Phase 2 Follow-Up¶
Phase 2 should consolidate this proof before any repository split or platform expansion:
- freeze versioned v0 references for
observation-health,capability-surface, andcorrelation-report:Runner artifact v0 contracts - maintain the telemetry-versus-evidence filter contract in that reference
- maintain the delegated runner runbook for provisioning and failure triage:
ASSAY-RUNNER-DELEGATED-RUNBOOK-2026-05-21.md - classify when delegated CI is required for runner-impacting changes:
Runner CI lane contract - write and maintain the acceptance fixture contract for future SDK fixtures:
Runner acceptance fixture v0 contract - define the Assay-Runner boundary and extraction map:
Assay-Runner boundary and extraction map
OpenTelemetry mappings, macOS support, live LLM calls, and repository extraction are outside Phase 2A. They should only start after the Linux runner boundary is documented and stable.
Proof Pack¶
This acceptance note records the workflow run, commit, and pass/fail evidence. Phase 2B hygiene adds a small durable proof manifest: Phase 1 delegated proof pack.
The proof pack preserves workflow metadata, the full workflow log, PASS-line evidence, and v0 golden-shape JSON digests. It does not contain the original runner-*.tar.gz archives because the historical workflow run did not upload GitHub artifacts and the delegated host wrote the archives to temporary paths cleaned after the job. Future delegated acceptance workflows should upload a first-class proof pack artifact during the run; that workflow follow-up is tracked in https://github.com/Rul1an/assay/issues/1287.
Subsequent Internal Landings¶
This section is a chronology pointer only. It does not re-open or extend the Phase 1 acceptance claim above, does not promise future work, and does not introduce a new contract. It exists so that a reader returning later can reconstruct the Phase 2 sequence without PR archaeology.
Phase 2A — internal contract freeze¶
The Phase 2 Follow-Up items above were consolidated into versioned internal references:
- Runner artifact v0 contracts
- Runner acceptance fixture v0 contract
- Runner CI lane contract
- Assay-Runner boundary and extraction map
Phase 2B — first capability-diff slice¶
- Runner capability-diff Phase 2B plan
- Runner capability-diff v0 contract
- Reference projection:
scripts/ci/assay_runner_capability_diff_validate.pyproducesstatus=cleanfor the accepted S5 fixture against itself.
Phase 2B — second runtime fixture (Gemini Python google-genai)¶
The second-runtime fixture line landed against the Phase 2B entry discipline. It introduces a new fixture, not a new contract or a new artifact category.
- Runner second runtime Phase 2B plan
- Runner second runtime candidate selection
- Runner Gemini fixture design
- Fixture root:
runner-fixtures/gemini-google-genai/
Delegated gates=all proof for the Gemini fixture:
| Field | Value |
|---|---|
| Workflow | Runner Spike Delegated |
| Run | 26280889075 |
| URL | https://github.com/Rul1an/assay/actions/runs/26280889075 |
| Head SHA | f917090a4a707eb485f57e82e8ed83ea8a52391e |
| Mode | gates=all (includes gemini-google-genai-kernel-policy) |
| Proof-pack artifact | assay-runner-delegated-proof-pack-26280889075 |
| Host class | assay-bpf-runner self-hosted Linux runner |
| Identity source | cassette-derived FunctionCall.id, level-3 stable identity verified by the maintainer probe outcome under runner-fixtures/gemini-google-genai/probe-results/ |
Scope discipline for this landing:
- S5's
mcp-policy-agent.shand the OpenAI Agents fixture path were not modified - The v0 artifact contracts, fixture v0 contract, and boundary map were not modified
- No narrower delegated gate name was introduced; the Gemini fixture runs under the existing
gates=alldispatch - Cross-runtime capability comparison between S5 and Gemini was not introduced
The Gemini delegated run is a second-runtime fixture acceptance only. It does not extend the Phase 1 OpenAI Agents claim above and does not establish a third runtime as eligible.
Phase 2C — cross-runtime diff decision resolved¶
- Runner cross-runtime diff Phase 2C plan
- Runner cross-runtime diff Phase 2C decisions (A1+B3+C1)
- Decision gate (resolved): https://github.com/Rul1an/assay/issues/1310 (closed by #1312)
The three open interpretation questions (path projection, binding-id comparability, SDK metadata placement) were resolved as A1 + B3 + C1: prefix-only path canonicalization, binding ids out of scope for cross-runtime comparability, SDK metadata as side-band runtime provenance.
Phase 2C has no acceptance run, no contract slice, and no implementation work landed yet. The contract slice PR is now unblocked under A1 + B3 + C1 per step 4 of cross-runtime-diff-plan.md § Suggested Slice Sequence, but it remains a separate, future change.
Open backlog independent of this line¶
- Ring-buffer drop debug follow-up: https://github.com/Rul1an/assay/issues/1271. This must not weaken the
ringbuf_drops=0clean-health bar for any future fixture.