Assay-Runner Second Runtime Phase 2B Plan¶
Internal Phase 2B planning note. This page defines the entry discipline for adding a second deterministic runtime fixture after the first capability-diff consumer. It is not a runtime selection record, not a dependency bump request, and not a new artifact contract.
The first Phase 2B capability-diff line now has one clean consumer: scripts/ci/assay_runner_capability_diff_validate.py can project assay.runner.capability_diff.v0 from explicit normalized evidence directories. The delegated workflow also uploads a first-class proof pack during the run.
That makes a second runtime fixture eligible to plan, but not yet eligible to implement broadly. The next slice should answer one narrow question:
Can a second offline runtime fixture produce the same v0 normalized runner
artifacts cleanly enough for capability-diff comparison?
Entry Conditions¶
Do not start implementation until these are true:
assay.runner.capability_diff.v0is the active diff contract.- The reference projection can produce
status=cleanfor the accepted S5 fixture. - Delegated proof packs are uploaded during
Runner Spike Delegatedruns. - The candidate runtime has a deterministic offline fixture path with no live LLM calls and no live secrets.
- The candidate runtime can expose stable tool-call identity, or the PR stops before implementation and opens a correlation-contract decision.
Candidate Requirements¶
A candidate runtime must satisfy all of these before code is added:
| Requirement | Rule |
|---|---|
| Offline execution | The fixture must run without network model calls, hosted credentials, or mutable external services |
| Stable identity | Each observed tool call must have a stable id that can map into tool_call_id |
| Comparable surface | The first fixture should exercise the same small read-file capability class as S5 |
| Deterministic dependency lock | Runtime dependencies must be pinned or vendored through the existing dependency-review discipline |
| Linux/eBPF fit | The fixture must run on the delegated Linux host under the existing cgroup capture model |
| Small event shape | The fixture should produce one binding first; multi-tool or branching traces are later work |
| Evidence boundary fit | The normalizer must not broaden evidence boundaries to make the runtime look comparable |
If stable identity is absent, do not add order-based or timestamp-based matching in the second runtime PR. That decision belongs in a separate correlation fallback contract, not in fixture plumbing.
Expected Artifact Shape¶
The first second-runtime fixture should produce the existing normalized runner artifact family. Three-run determinism inherits the fixture v0 contract and compares the same five files byte-for-byte:
observation-health.jsoncapability-surface.jsoncorrelation-report.jsonlayers/sdk.ndjsonlayers/policy.ndjson
These artifacts retain their existing schemas; the second runtime fixture must emit shapes that pass the v0 contracts in artifacts-v0.md without proposing schema extensions.
The expected clean health bar remains unchanged:
kernel_layer=completeringbuf_drops=0policy_layer=presentsdk_layer=self_reportedunless a separate contract proves corroborated SDK observationcgroup_correlation=clean
The first capability diff involving the second runtime should be descriptive only. It may compare the accepted S5 fixture against the new fixture, or compare two runs of the new fixture, but it must not decide whether any difference is acceptable.
Suggested PR Sequence¶
- Land this entry plan.
- Add a candidate-selection note that records the chosen runtime, identity source, offline fixture strategy, dependency lock path, and expected gate.
- Add the smallest fixture instance and local validators without changing the v0 artifact contracts.
- Run delegated proof with
gates=allfor the first fixture PR. A narrower named gate for the second runtime is a separate later change that requires coordinated updates to: ci-lanes.mddecision table and required-gate mapping- the lane-check classifier in
scripts/ci/assay_runner_lane_check.py - the
Runner Spike Delegatedworkflowinputs.gatesenum - the matching
scripts/ci/runner-spike-*acceptance scripts
Do not add a narrower gate as a side effect of the first fixture PR. 5. Add capability-diff golden output for diff(second_runtime, second_runtime). This is the idempotent acceptance check that mirrors the S5 idempotent golden defined by capability-diff-v0.md. 6. Cross-runtime diff examples comparing the second runtime against S5 are out of Phase 2B scope. They require a separate Phase 2C contract review: what does "same capability" mean across runtimes with different tool naming, different SDK event vocabularies, and different binding identity sources? Do not introduce cross-runtime diff in the first second-runtime PR.
Acceptance Criteria For The First Fixture PR¶
The first implementation PR should satisfy all of these:
- The fixture emits one stable binding id and does not rely on order fallback.
- Three-run determinism covers the same normalized artifact family as S5.
assay_runner_capability_diff_validate.pycan project a clean idempotent diff for the new fixture.- The delegated proof-pack artifact contains the new runtime archive, selected JSON artifacts, gate log, PASS lines, and manifest entry.
- The PR records a successful
Runner Spike Delegatedrun URL, head SHA, gate, and proof-pack artifact name. - No
pull_request,push, orscheduletrigger is added to the delegated self-hosted workflow.
Kill Criteria¶
Stop the line before implementation if any of these are true:
- The runtime cannot expose stable tool-call identity without timing or ordering inference.
- The fixture needs live model calls, live secrets, or mutable hosted state.
- The runtime requires host privileges or services outside the delegated runbook.
- Normalized evidence can only be made clean by weakening ring-buffer, cgroup-correlation, or telemetry-versus-evidence rules.
- Dependency installation is not reproducible enough for three-run byte determinism.
Out Of Phase 2B Scope¶
The following are intentionally deferred to Phase 2C or later. They are listed here as boundary markers so a future PR cannot quietly absorb them into second-runtime work:
| Item | Why deferred |
|---|---|
Cross-runtime capability-diff (diff(second_runtime, S5)) | requires a contract for what "same capability" means across runtimes; not a fixture-implementation question. The Phase 2C mini-plan that opens this question is cross-runtime-diff-plan.md. |
| Declared-capability input | new artifact category; needs its own schema and contract slice |
| Call-id-less fallback semantics | tracked as a separate correlation-contract decision; must not be introduced as fixture plumbing |
| macOS or Windows measurement | a separate platform spike with its own kill criteria and CI lane contract |
| OTel or GenAI semantic-convention mapping | external mapping surface; must not be introduced before the Linux runner boundary is stable |
| Repository extraction of a runner candidate | boundary-map readiness criteria still apply; this plan does not move them |
| Multi-tool or branching agent traces | second runtime first proves one binding; multi-binding is a follow-up after the idempotent diff lands |
This list does not promise future work. It only prevents quiet scope creep into the first second-runtime fixture PR.
Non-Goals¶
This plan does not:
- choose the second runtime
- add runtime dependencies
- add fixture code
- define call-id-less fallback semantics
- add declared-capability inputs
- add macOS or Windows measurement
- add OTel or GenAI semantic-convention mapping
- change branch protection or delegated workflow triggers
Those require separate contracts or implementation PRs after the candidate selection record exists.
Follow-Up After Merge¶
After this plan lands on main, the next discoverable step is a candidate selection issue separate from this document. The issue compares concrete runtime candidates against the Candidate Requirements table above. It does not propose code; it produces the selection note that step 2 of the Suggested PR Sequence requires.
The ring-buffer drop debug follow-up tracked in https://github.com/Rul1an/assay/issues/1271 remains independent of this line. It must not weaken the ringbuf_drops=0 clean-health bar for any second-runtime fixture.