Assay-Runner Capability-Diff v0 Contract¶

Internal Phase 2B contract. This page defines the first Assay-Runner capability-diff projection over normalized runner artifacts. It is not a runner-emitted archive artifact, not a CLI surface, and not a product release contract.

The v0 capability diff answers one narrow question:

Given two clean normalized runner evidence sets, what observed capability
surface changed?

The diff is descriptive. It reports added, removed, and unchanged normalized capability values. It must not decide whether a change is acceptable; that remains policy, reviewer, or Harness responsibility.

Inputs¶

A v0 diff compares two evidence sets, named base and head. Each set must provide these normalized artifacts:

Artifact	Role
`observation-health.json`	Determines whether the evidence set is clean enough for a clean diff
`capability-surface.json`	Provides the observed capability sets to compare
`correlation-report.json`	Provides stable binding identity through `tool_call_id`

The input artifacts retain their own schema contracts:

Raw kernel telemetry, workflow logs, proof-pack metadata, and normalized layer streams are diagnostic context only. They are not primary v0 diff inputs.

The v0 capability diff is a pure projection over normalized evidence. Workflow run URLs, commit SHAs, and generation timestamps are intentionally not part of this schema. Consumers that need forensic anchoring should pair the diff with a proof-pack manifest, which carries workflow context separately.

Contract Principles¶

Normalized evidence only. The diff consumes normalized artifacts after the runner normalizer has already drawn the evidence boundary.
Surface-level projection. v0 compares set-like values in capability-surface.json. It does not attribute each filesystem path, process, endpoint, or tool value to an individual binding window.
Stable binding identity. Clean v0 diffs require stable tool_call_id values in correlation-report.json.
Health remains strict. ringbuf_drops > 0, incomplete cgroup correlation, or missing SDK/policy layers must not be softened into a clean diff.
Deterministic serialization. Arrays are stable sorted sets unless this contract explicitly says otherwise.
No acceptability judgment. The diff says what changed, not whether the change is allowed for the project.

Schema¶

Schema string:

assay.runner.capability_diff.v0

Fields:

Field	Type	Required	Semantics
`schema`	string	yes	Must equal `assay.runner.capability_diff.v0`
`base_run_id`	string	yes	`run_id` from the base evidence set
`head_run_id`	string	yes	`run_id` from the head evidence set
`status`	enum	yes	`clean`, `partial:health`, `partial:correlation`, `partial:unbound`, or `failed`
`preconditions`	object	yes	Machine-readable checks that determine whether the diff can be clean
`scope`	object	yes	Declares what evidence domain this diff used
`surface`	object	yes	Added, removed, and unchanged capability-surface values by category
`binding_ids`	object	yes	Added, removed, and unchanged tool-call binding ids
`policy_outcomes`	object	yes	Policy decision changes for stable binding ids
`unbound`	object	yes	Evidence buckets that could not be safely compared in v0
`ambiguities`	array[string]	yes	Stable code-prefixed ambiguity strings
`notes`	array[string]	yes	Stable code-prefixed human-readable notes

Preconditions¶

preconditions records why a diff is or is not clean.

Field	Type	Required	Clean value
`base_health_clean`	boolean	yes	`true`
`head_health_clean`	boolean	yes	`true`
`base_correlation_clean`	boolean	yes	`true`
`head_correlation_clean`	boolean	yes	`true`
`stable_tool_call_ids_required`	boolean	yes	`true`
`stable_tool_call_ids_present`	boolean	yes	`true`

A health set is clean only when:

kernel_layer=complete
ringbuf_drops=0
policy_layer=present
sdk_layer=self_reported
cgroup_correlation=clean

sdk_layer=present is reserved for a future corroborated SDK path in the artifact contract. This first capability-diff contract accepts only the currently proven S5 fixture shape: sdk_layer=self_reported.

Scope¶

scope separates preconditions from projection scope.

Field	Type	Required	v0 value
`projection`	string	yes	`surface_set`
`uses_raw_telemetry`	boolean	yes	`false`
`uses_proof_pack`	boolean	yes	`false`
`per_binding_capability_values`	boolean	yes	`false`

per_binding_capability_values=false is load-bearing. v0 correlation proves a stable binding id and kernel-event window, but the current capability-surface artifact is global to the run. Therefore v0 must not claim that an individual path, process, endpoint, or tool value belongs to one specific binding.

Surface Diff¶

surface contains one object per capability-surface.v0 category:

filesystem_paths
network_endpoints
process_execs
mcp_tools
policy_decisions

Each category object has the same fields:

Field	Type	Required	Semantics
`added`	array[string]	yes	Values present in head and absent from base
`removed`	array[string]	yes	Values present in base and absent from head
`unchanged`	array[string]	yes	Values present in both base and head

All arrays serialize in stable lexicographic order.

Policy Decision Consistency¶

surface.policy_decisions reports the changed set of policy decision summaries regardless of binding identity. policy_outcomes.changed reports changed coarse policy outcomes per stable binding id. These views can diverge only when bindings are added or removed.

For unchanged binding ids, the two views must stay consistent. If an unchanged tool_call_id in binding_ids.unchanged has a policy summary change that appears in surface.policy_decisions.added or surface.policy_decisions.removed, that same binding must appear in policy_outcomes.changed. Implementations must verify this consistency before emitting status=clean.

Binding Id Diff¶

binding_ids compares the set of tool_call_id values from clean correlation bindings.

Field	Type	Required	Semantics
`added`	array[string]	yes	Binding ids present in head and absent from base
`removed`	array[string]	yes	Binding ids present in base and absent from head
`unchanged`	array[string]	yes	Binding ids present in both base and head

binding_ids.unchanged reports identity stability only. Policy outcome changes for stable binding ids are tracked separately in policy_outcomes.changed.

Clean v0 does not support order-based fallback. If a binding lacks a stable tool_call_id, the diff is at least partial:correlation.

Policy Outcomes¶

policy_outcomes.changed records changed coarse policy outcomes for unchanged binding ids.

Each entry has:

Field	Type	Required	Semantics
`tool_call_id`	string	yes	Stable binding id whose policy outcome changed
`base`	string or null	yes	Base coarse policy outcome
`head`	string or null	yes	Head coarse policy outcome

Entries serialize by tool_call_id. v0 accepted fixtures use coarse outcomes such as allow and deny; capability-surface policy summaries remain strings such as allow:read_file.

Unbound Evidence¶

unbound uses the same category names as surface, each as a stable array[string].

For a clean v0 diff, every unbound category must be empty. A future input may make per-value unbound evidence explicit. Until then, v0 producers must not invent per-binding path attribution from global capability-surface values. Because all current capability-surface.v0 values are run-global, partial:unbound is reserved for a future per-binding capability artifact. v0 implementations must keep unbound arrays empty; inputs that suggest per-value unbinding without an explicit versioned source should produce status=failed, not an invented partial:unbound projection.

Status Semantics¶

Status	Semantics
`clean`	All preconditions are true, all required artifacts validate, correlation is clean for both sides, and all `unbound` arrays are empty
`partial:health`	At least one evidence set can be parsed but has incomplete health such as ring-buffer drops or incomplete cgroup correlation
`partial:correlation`	Health is sufficient to parse, but at least one correlation report is partial, ambiguous, or lacks stable binding identity
`partial:unbound`	Reserved for a future per-binding capability artifact; v0 producers must not emit this status from run-global capability-surface values
`failed`	Required artifacts are missing, schema strings are unsupported, run ids are internally inconsistent, or deterministic parsing fails

partial:* diffs may be useful for triage, but they are not clean evidence for acceptance. ringbuf_drops > 0 always prevents status=clean.

Idempotence¶

The first read-only validation gate is idempotence:

diff(S5_acceptance, S5_acceptance)

This must produce:

status=clean
empty added and removed arrays for every surface category
unchanged arrays equal to the input capability-surface sets
empty binding_ids.added and binding_ids.removed
binding_ids.unchanged=["tc_runner_policy_001"]
empty policy_outcomes.changed
empty unbound arrays
empty ambiguities

The golden shape for this case is golden/capability-diff-s5-idempotent-v0.json. The read-only validation and reference projection entry point is scripts/ci/assay_runner_capability_diff_validate.py.

Without explicit inputs, the helper projects the diff from the existing S5 golden artifacts and compares it to that frozen output shape. With explicit --base-dir and --head-dir inputs, or with explicit base/head artifact paths, it emits a v0 capability diff over normalized evidence only. Directory inputs must contain observation-health.json, capability-surface.json, and correlation-report.json.

Example:

python3 scripts/ci/assay_runner_capability_diff_validate.py \
  --base-dir /path/to/base-normalized-evidence \
  --head-dir /path/to/head-normalized-evidence \
  --output /tmp/capability-diff.json

Notes Vocabulary¶

notes follows the same code-prefixed convention as the runner artifact contracts. v0 reserves the capability_diff_ prefix. The first golden shape emits capability_diff_idempotent when base and head evidence sets are identical. Implementations must not introduce new note codes without updating this contract.

Non-Goals¶

v0 does not include:

declared-capability input; the future declared-capability contract is a separate Phase 2C+ slice
per-binding path/process/endpoint attribution
call-id-less order fallback
second runtime support; cross-runtime semantics are contracted separately in cross-runtime-diff-v0.md
macOS runner support
raw telemetry diffing
proof-pack ingestion as a required input
OTel or GenAI semantic-convention mapping
acceptability or policy judgment

Implementation Placement¶

The boundary map places capability-diff projection semantics on the Trust Basis / Harness side while Runner delivers clean measured-run input bundles. A future implementation may start as an Assay-side reference checker, but it must not silently move artifact meaning into the runner candidate.