Assay-Runner Product Candidate Memo¶
Status: internal product-candidate memo, Phase-1-spike gated, not a public roadmap commitment Date: 2026-05-18 Last updated: 2026-05-21 Scope: records a possible measured-run product track beside Assay-Harness; no external publication, repo split, or build commitment is implied.
This note defines a candidate product track for measured agent execution. It is intentionally separate from Assay-Harness, which remains the recipe, gate, and projection layer above Assay artifacts.
Product Layout¶
Assay-core = evidence, policy, monitor, Trust Basis primitives
Assay-Harness = recipes, gates, PR projections
Assay-Runner = measured execution, tri-layer bundles, capability diff
Three products, three roles, no overlap. None of these products define new artifact semantics outside Assay-core.
Name discipline: Assay-Runner is an internal placeholder only. External name work is deferred until a Phase 1 spike passes. Public names such as Assay Measure or Assay Run Attestation should not be advanced before there is one verified demo bundle and one credible capability diff.
Thesis¶
The harness should stop asking the agent what it did. It should observe side effects at the syscall boundary, bind evidence under a content-addressed, verifiable bundle, and diff the resulting capability surface. The agent runtime is interchangeable.
The kernel layer is not treated as perfect truth. It is the primary observation source for side effects, with explicit health metadata that says how complete or incomplete that observation was.
Differentiator¶
| Category, May 2026 | Trust basis | What is missing |
|---|---|---|
| Eval harnesses such as Inspect AI, METR, promptfoo, DeepEval | SDK or score | Observation below the SDK |
| Agent-SDK harnesses such as OpenAI Agents SDK, LangGraph | SDK is source of truth | Kernel-blind, runtime-bound |
| Sandbox-as-harness such as Daytona, E2B, Modal | Container boundary | Inside-the-run observation |
| Gateway harnesses such as Kuadrant MCP, Envoy ext_proc | Proxy visibility | Filesystem, syscall, process view |
| Observability such as Langfuse, Helicone, LangSmith | SDK-reported traces | Verification and capability diff |
| Assay-Runner | Correlated syscall, policy, and SDK evidence | Phase 1 must prove attribution is reliable |
The opening is kernel-grounded, verify-before-diff capability evidence across multiple runtimes.
Integration Claim¶
The needed substrate already exists in Assay. Runner is correlation and integration, not a new research bet.
| Capability | Source |
|---|---|
| Syscall observation, LSM, tracepoints | crates/assay-monitor, crates/assay-ebpf |
| Tier 1 policy, kernel-oriented | crates/assay-policy |
| Tier 2 policy, userspace and proxy | crates/assay-policy, crates/assay-mcp-server |
| Content-addressed bundles | crates/assay-evidence |
| Verify-before-diff, Trust Basis | crates/assay-evidence |
| Deterministic LLM replay | assay-core::vcr |
| Landlock sandbox | assay sandbox |
| Attack and chaos simulation | crates/assay-sim |
What is missing is the correlation layer (run_id to cgroup to pid to tool_call_id) and the shim layer.
Bounded Claim¶
The Runner bundle is observation evidence, not transcript truth.
It may state:
- what the kernel observed in a marked time window bound to
run_idvia cgroup v2 - which policy decisions
assay-mcp-serverorassay-policymade in that same window - which tool-call events the SDK shim emitted in that same window, when a shim other than
noneis used
It must not claim:
- that this is everything the agent did
- that the SDK layer is correct
- that the kernel layer captured every relevant event
- that layer timestamps prove causal ordering
For v1, cross-layer claims are set-based. They are not sequence claims.
Observation Health¶
Observation health is a first-class bundle field. Every bundle carries explicit metadata like:
observation_health:
kernel_layer: complete | partial_ringbuf_drops | absent
ringbuf_drops: <int>
policy_layer: present | absent
sdk_layer: present | self_reported | absent
cgroup_correlation: clean | partial | failed
A bundle with kernel_layer=absent, such as a macOS run, can still be valid. It is valid but explicitly incomplete. A bundle with ringbuf_drops>0 says kernel visibility was partial, and every capability diff projected from that bundle must show that warning. The diff must never hide how complete its own evidence is.
Linux-First Boundary¶
Full kernel-grounded measurement is Linux only:
- tri-layer bundle
- capability diff over the kernel layer
- eventual production measurement path
Partial or degraded mode is allowed on macOS and Windows:
- SDK plus policy layers where available
kernel_layer=absent- every output marks the incomplete trust basis explicitly
This is a design boundary, not a vague cross-platform promise.
Risks¶
| Risk | What it breaks |
|---|---|
| Cross-layer correlation ambiguity | If tool-call to policy to syscall attribution is not reliable, the value proposition fails. |
| Baseline instability | Noisy capability sets make PR diffs meaningless without set summarization and ignore rules. |
| Overhead under load | Bundle writing must stream to disk, not aggregate large event sets in memory. |
| Shim drift | Each major SDK version needs an owned shim contract, not best-effort broad compatibility. |
| Compliance-frame contamination | Event-level signed receipts would drag the product into the wrong category. Bundle-level signing can stay optional. |
The real risk is attribution, not eBPF or bundle writing.
Phase 1: Proof Spike¶
Two shims are required, in order of proof:
--agent-shim none: the epistemic wedge. Runner asks nothing of the SDK.--agent-shim openai-agents: the adoption wedge. The SDK layer becomes a correlated source, not the source of truth.
The none shim has two explicit submodes:
none + kernel-only: no SDK shim and no Assay policy layernone + kernel+policy: no SDK shim, but Assay policy/proxy events are present and correlated with kernel-observed side effects
Spike question:
Can Assay produce one verifiable run bundle per shim mode, where every observable layer is correlated to the others with low ambiguity, and observation health is honestly self-reported?
Acceptance Criteria For none¶
- One run has one explicit
run_id. - Kernel-layer events inside cgroup C and window T are grouped into a capability surface: filesystem prefixes, network endpoints, and process execs.
- In
none + kernel+policy, policy-layer events correlate with kernel-layer events: policy saw tool X, and the kernel saw a congruent syscall set in the same bounded window. - In
none + kernel-only,policy_layer: absentis explicit. observation_healthis correctly filled. Three consecutive runs of the same scenario produce consistent health metadata.assay evidence verifysucceeds.
Acceptance Criteria For openai-agents¶
These are in addition to the none criteria:
- SDK-layer
tool_call_idcorrelates with a policy-layer decision carrying the sametool_call_id. - The SDK-layer tool-call window contains the kernel syscall set for that tool, set-based only.
- Three consecutive runs produce the same tool to policy to syscall binding.
Kill Criterion¶
If the spike is too ambiguous or too noisy, the track stops.
Examples:
- tool to syscall binding succeeds in only 60 percent of runs
- PID recycling or cgroup disturbance makes attribution unstable
- ring-buffer loss makes ordinary runs visibly incomplete
In that case, this memo becomes the documented dead branch. No further product build follows.
Phase 1 Result¶
Phase 1 passed on delegated Linux/eBPF hardware on 2026-05-21. The acceptance record is ASSAY-RUNNER-PHASE1-ACCEPTANCE-2026-05-21.md.
This result moves the internal track from proof spike to Phase 2A consolidation. It still does not imply a public product launch, repository split, hosted service, macOS support, or live LLM support.
Phase 2: Capability Diff¶
Only start Phase 2 if Phase 1 passes.
The v1 discipline is set-based summaries only. No ordered traces, no sequence semantics.
First output shape:
New kernel-observed capabilities vs baseline:
+ network api.stripe.com:443
+ filesystem /var/cache/agent-state/
+ process /usr/bin/git
+ mcp_tool filesystem.write_file
Observation health:
kernel_layer complete
ringbuf_drops 0
policy_layer present
sdk_layer self_reported (openai-agents v0.11.1)
cgroup_correlation clean
Output rules:
- The health block is mandatory. There is no capability diff without observation health.
- A diff from
kernel_layer=partial_ringbuf_dropsmust carry a visible warning. - Summaries are set-based: touched X, Y, and Z. They never claim "first X, then Y."
- Projection text is descriptive, not normative. It says what changed, not whether the change is acceptable.
- Baseline pinning and project-level ignore rules are required before broad use.
This is where Runner and Harness meet: Runner produces the bundle; Harness diffs and projects it.
Phase 3: Extra Shims¶
Only after Phase 2 should cross-runtime language become externally claimable. At that point, add at least two shims beyond openai-agents.
Candidates:
- Inspect AI
- LangGraph
- Mastra
Each shim maps into one normalized SDK-layer event schema.
Phase 4: Continuous Run Measurement¶
Do not put this in the public story until Phase 1 and Phase 2 are credible.
The long-term shape is the same bundle schema in production, with the same diff mechanism for "today versus yesterday". That is a future bridge, not v0.1 sales material.
Decisions Deferred¶
- External name. No naming work before the spike produces one demo.
- Repo host. A spike can live inside
Rul1an/assay; do not createRul1an/Assay-Runnerbefore Phase 1 evidence exists. - Whether
noneeventually remains pure kernel-only by default or defaults to kernel plus policy when Assay MCP wrapping is available.
Verdict¶
Ship this internally as a candidate memo, not a roadmap. The product track is interesting only if the Phase 1 attribution spike passes. If attribution is ambiguous, the track should stop cleanly.