ADR-025: Evidence-as-a-Product — Reliability Surfaces, Completeness/Closure, and Portable Verifiability¶
Status¶
Accepted (March 2026; I1/I2/I3 rollout slices implemented and closed-loop on main)
Status Sync (2026-02-26)¶
- I1 closed-loop on
main: soak/reliability surface + readiness + release-lane enforcement. - I2 closed-loop on
main: closure/completeness contracts + release-lane attach/enforce modes + stabilization. - I3 closed-loop on
main: OTel bridge contracts + generator/tests + informational nightly lane + release integration. - Remaining work for ADR-025 is status/governance convergence and future roadmap increments, not missing I1/I2/I3 implementation.
Context¶
The agent engineering market (2026) has commoditized "eval CI" and "observability". Differentiation now comes from:
- Multi-run reliability as a first-order property (pass^k, stress/fault surfaces).
- Auditability as a sociotechnical system (tamper-evident + governance context, not just logs).
- Security via enforcement points (policy enforcement + evidence emission).
- Standard-first interoperability (OTEL GenAI + MCP semconv).
- Attestation/transparency stacks as evidence substrate (in-toto/DSSE/SCITT/Sigstore/SLSA).
- Compliance hooks (EU AI Act Art 12/19, OWASP Agentic Top 10).
- CI gates as a commodity integration surface (Actions/Evals), not a differentiator.
Assay’s wedge is portable, verifiable “evidence primitives” + policy packs + stability assurance (pass^k) + closure/confidence.
Decision¶
Track 1 — Reliability Surface (pass^k + faults) as Evidence¶
We introduce Soak/Surface as the primary simulation product: - assay sim soak executes N runs (seeded), collecting policy outcomes, infra errors, and summary metrics. - Pass^k Semantics: pass_all (AND over k runs) is the strict assurance bar. We also report pass_rate and pass_probability_estimate (beta posterior or CI95) for statistical confidence. - Decision Policy: User defines strictness via decision_policy: { stop_on_violation: bool, max_failures: u32, min_runs: u32 }.
Normative: pass^k and drift are first-class outputs. Soak reports must include a decision_policy used to reach the verdict.
Track ⅔ — Evidence Completeness + Closure Score (audit + replay readiness)¶
We distinguish between Completeness (Pack-Relative) and Closure (Replay-Relative):
-
Completeness ("Did we capture what the Pack needs?"):
- Defined relative to a specific pack.
- Signals are defined in a canonical registry (e.g.,
policy_decisions,tool_calls). - State:
captured(present),redacted(removed but committed with hash/metadata),unknown(missing/undetectable).
-
Closure ("Can we reconstruct the run?"):
- Defined relative to replayability.
- Score is deterministic (0.0-1.0) based on presence of replay-critical signals (inputs, model ID, tool outputs, RNG seeds).
- Normative: Score must be calculated from the bundle contents alone (no heuristics).
Track 4 — Pack "Required Signals" Registry¶
To prevent schema drift, we introduce a namespaced, additive field for packs: - Field: x-assay.requires_signals (v0). - Registry: Packs must select from a canonical list of signal types (e.g., policy_decisions, tool_io_bodies, model_identity, prompt_lineage, human_approvals). - This avoids free-text requirements and enables automated completeness checks.
Track 5 — Standard-first export (OTEL GenAI + MCP)¶
We support a dual-format approach with a strict versioning policy: 1. Target: GenAI semconv (stable opt-in) + MCP semconv. 2. Policy: Best-effort translation. 3. Transparency: Reports must include a mapping_loss section detailing dropped attributes and unknown events.
Track 6 — Attestation Envelope (Portable Verifiability)¶
We evolve the Evidence Bundle towards an "attestation bundle" using DSSE envelopes. - Payload v1: - Digests of bundle artifacts (manifest, events). - Pack versions + mapping references. - Closure report digest. - Verification: OSS capability for offline verification (providing public key). - Threat Model: Protects against tampering (integrity) and repudiation (provenance). Does not guarantee confidentiality (payload content).
Data Contracts (Normative)¶
1) Soak Report v1 (Normative)¶
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://assay.dev/schemas/soak-report-v1.schema.json",
"title": "Assay Soak Report v1",
"type": "object",
"additionalProperties": false,
"required": [
"schema_version",
"mode",
"iterations",
"seed",
"time_budget_secs",
"limits",
"packs",
"results"
],
"properties": {
"schema_version": {
"type": "string",
"const": "soak-report-v1"
},
"mode": {
"type": "string",
"const": "soak"
},
"generated_at": {
"type": "string",
"format": "date-time"
},
"assay_version": {
"type": "string",
"minLength": 1
},
"suite": {
"type": "string",
"minLength": 1,
"description": "Optional tier/name, e.g. quick/nightly or a named soak profile."
},
"iterations": {
"type": "integer",
"minimum": 1
},
"seed": {
"type": "integer",
"minimum": 0
},
"time_budget_secs": {
"type": "integer",
"minimum": 1
},
"limits": {
"type": "object",
"additionalProperties": false,
"required": [
"max_bundle_bytes",
"max_decode_bytes",
"max_manifest_bytes",
"max_events_bytes",
"max_events",
"max_line_bytes",
"max_path_len",
"max_json_depth"
],
"properties": {
"max_bundle_bytes": { "type": "integer", "minimum": 1 },
"max_decode_bytes": { "type": "integer", "minimum": 1 },
"max_manifest_bytes": { "type": "integer", "minimum": 1 },
"max_events_bytes": { "type": "integer", "minimum": 1 },
"max_events": { "type": "integer", "minimum": 1 },
"max_line_bytes": { "type": "integer", "minimum": 1 },
"max_path_len": { "type": "integer", "minimum": 1 },
"max_json_depth": { "type": "integer", "minimum": 1 }
}
},
"packs": {
"type": "array",
"minItems": 1,
"items": { "$ref": "#/$defs/pack_ref" }
},
"decision_policy": {
"type": "object",
"additionalProperties": false,
"required": ["pass_on_severity_at_or_above"],
"properties": {
"pass_on_severity_at_or_above": {
"type": "string",
"enum": ["info", "warning", "error"],
"description": "Defines what counts as a failing rule severity threshold."
},
"stop_on_first_failure": {
"type": "boolean",
"default": false
},
"max_failures": {
"type": "integer",
"minimum": 1,
"description": "Optional early-stop threshold."
}
}
},
"results": {
"type": "object",
"additionalProperties": false,
"required": [
"runs",
"passes",
"failures",
"infra_errors",
"pass_rate",
"pass_all"
],
"properties": {
"runs": { "type": "integer", "minimum": 1 },
"passes": { "type": "integer", "minimum": 0 },
"failures": { "type": "integer", "minimum": 0 },
"infra_errors": { "type": "integer", "minimum": 0 },
"pass_rate": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"pass_all": {
"type": "boolean",
"description": "True iff all runs passed under the decision policy."
},
"first_failure_at": {
"type": ["integer", "null"],
"minimum": 1,
"description": "1-based index of first failing run, or null if none."
},
"violations_by_rule": {
"type": "object",
"additionalProperties": {
"type": "integer",
"minimum": 1
},
"description": "Map from canonical rule id (pack@ver:rule) to count of runs where it violated."
},
"infra_errors_by_kind": {
"type": "object",
"additionalProperties": {
"type": "integer",
"minimum": 1
},
"description": "Optional breakdown, e.g. time_budget_exceeded, subprocess_failed, io_error."
},
"pass_rate_ci95": {
"type": "array",
"minItems": 2,
"maxItems": 2,
"items": { "type": "number", "minimum": 0, "maximum": 1 },
"description": "Optional 95% CI for pass_rate; implement as Wilson or Beta posterior interval."
}
}
},
"runs": {
"type": "array",
"items": { "$ref": "#/$defs/run_result" },
"description": "Optional per-run detail; can be omitted for compact reports."
}
},
"$defs": {
"pack_ref": {
"type": "object",
"additionalProperties": false,
"required": ["name", "version"],
"properties": {
"name": { "type": "string", "minLength": 1 },
"version": { "type": "string", "minLength": 1 },
"kind": { "type": "string", "minLength": 1 },
"digest": { "type": "string", "minLength": 1 },
"source": {
"type": "string",
"description": "Optional URI/path for provenance (built-in, local, url)."
}
}
},
"run_result": {
"type": "object",
"additionalProperties": false,
"required": ["index", "status", "duration_ms"],
"properties": {
"index": { "type": "integer", "minimum": 1 },
"status": {
"type": "string",
"enum": ["pass", "fail", "infra_error"]
},
"duration_ms": { "type": "integer", "minimum": 0 },
"violated_rules": {
"type": "array",
"items": { "type": "string", "minLength": 1 },
"description": "Canonical rule ids (pack@ver:rule) that violated in this run."
},
"infra_error_kind": {
"type": "string",
"minLength": 1
},
"infra_error_message": {
"type": "string"
}
}
}
}
}
2) Completeness + Closure v1 (Normative)¶
```json { "\(schema": "https://json-schema.org/draft/2020-12/schema", "\)id": "https://assay.dev/schemas/closure-v1.schema.json", "title": "Assay Completeness + Closure v1", "type": "object", "additionalProperties": false, "required": ["schema_version", "completeness", "closure"], "properties": { "schema_version": { "type": "string", "const": "closure-v1" }, "generated_at": { "type": "string", "format": "date-time" }, "bundle_digest": { "type": "string", "minLength": 1, "description": "Optional sha256 (or similar) digest of the evidence bundle for linking." }, "pack_context": { "type": "array", "items": { "\(ref": "#/\)defs/pack_ref" }, "description": "Optional: the packs used to compute required signals." }, "completeness": { "\(ref": "#/\)defs/completeness" }, "closure": { "\(ref": "#/\)defs/closure" } }, "\(defs": { "pack_ref": { "type": "object", "additionalProperties": false, "required": ["name", "version"], "properties": { "name": { "type": "string", "minLength": 1 }, "version": { "type": "string", "minLength": 1 }, "kind": { "type": "string", "minLength": 1 }, "digest": { "type": "string", "minLength": 1 } } }, "signal": { "type": "string", "pattern": "^[a-z0-9][a-z0-9_\\\\.-]*[a-z0-9]\)", "description": "Canonical signal key. Prefer a registry to avoid drift." }, "completeness": { "type": "object", "additionalProperties": false, "required": ["required", "captured", "redacted", "unknown"], "properties": { "required": { "type": "array", "items": { "\(ref": "#/\)defs/signal" } }, "captured": { "type": "array", "items": { "\(ref": "#/\)defs/signal" } }, "redacted": { "type": "array", "items": { "\(ref": "#/\)defs/signal" } }, "unknown": { "type": "array", "items": { "\(ref": "#/\)defs/signal" } }, "by_signal": { "type": "object", "additionalProperties": { "\(ref": "#/\)defs/signal_detail" }, "description": "Optional per-signal detail (why missing, where expected)." } } }, "signal_detail": { "type": "object", "additionalProperties": false, "required": ["status"], "properties": { "status": { "type": "string", "enum": ["captured", "redacted", "missing", "unknown"] }, "reason": { "type": "string" }, "evidence_paths": { "type": "array", "items": { "type": "string", "minLength": 1 }, "description": "JSON pointer(s) or path hints where the signal should be found." }, "commitment": { "type": "object", "additionalProperties": false, "required": ["alg", "digest"], "properties": { "alg": { "type": "string", "minLength": 1 }, "digest": { "type": "string", "minLength": 1 }, "size_bytes": { "type": "integer", "minimum": 0 } }, "description": "For redacted signals: a verifiable commitment (hash/size) without revealing content." } } }, "closure": { "type": "object", "additionalProperties": false, "required": ["score", "confidence", "captured", "missing"], "properties": { "score": { "type": "number", "minimum": 0, "maximum": 1 }, "confidence": { "type": "string", "enum": ["low", "medium", "high"] }, "captured": { "type": "array", "items": { "\(ref": "#/\)defs/signal" } }, "missing": { "type": "array", "items": { "\(ref": "#/\)defs/signal" } }, "uncontrolled_dependencies": { "type": "array", "items": { "type": "string", "minLength": 1 }, "description": "Optional: known nondeterministic inputs (network, live tools) that prevent hermetic replay." }, "scoring": { "type": "object", "additionalProperties": false, "required": ["method", "weights"], "properties": { "method": { "type": "string", "enum": ["weighted_ratio_v1"] }, "weights": { "type": "object", "additionalProperties": { "type": "number", "minimum": 0 }, "description": "Optional: per-signal weights used to compute score." } }, "description": "Optional scoring transparency for audits/CI." } } } } }
3) Manifest additions¶
manifest.json attributes: - x-assay.packs_applied[]: {name, version, digest, kind, source_url?} - x-assay.mappings[]: {rule, framework, ref}
UX/DX Requirements (Feb 2026)¶
- Unified Happy Path:
assay evidence lint <bundle>(default: lint withcicd-starter).assay sim soak --iterations N --pack <pack> --target <bundle> --report out.json.- Normative: Soak must use the same pack loader/resolution as Lint.
- Explainability:
assay evidence lint --explain closure.score.assay evidence lint --explain <pack>:<rule>(shows missing signals if applicable).
- Machine-Readable Reports: ALL commands supporting
--reportmust output JSON withschema_version. Stdout remains human-readable summary.
Rollout Plan¶
Iteration 1 (MVP): Audit Kit Baseline & Soak MVP¶
assay sim soak+ report v1 (withdecision_policy,pass_rate).manifest.jsonx-assay.*metadata.- Pack-provided
x-assay.requires_signals(minimal registry).
Iteration 2: Closure Score & Explainability¶
- Completeness Matrix (Pack-relative) + Closure Score (Replay-relative).
redactedvsunknowndefinitions in lint reporting.- Advanced
--explain(closure gaps).
Iteration 3: Attestation & OTEL¶
- DSSE envelope generation (opt-in) + offline verify command.
- OTEL export +
mapping_lossreport.
Open-core Boundary¶
- OSS: Soak MVP, Closure/Completeness v1, Open Packs, OTEL bridge (opt-in), Offline Verification.
- Pro: Signing/Attestation Key Mgmt, Enforcement Gateway, Private/Advanced Packs.
Acceptance Criteria (Gates)¶
I1:¶
assay sim soakwithreport.jsoncontainingschema_versionand validpass_rate/pass_all.- Manifest
x-assayfields populated correctly. - Packs in OSS repo define
requires_signalsfrom v0 registry; parser validates this.
I2:¶
- Completeness matrix calculation is deterministic.
- Closure score logic documented and tested with fixed fixtures.
--explainrenders closure gaps.
I3:¶
- DSSE envelope generation (feature-flagged).
- OTEL export produces valid SemConv +
mapping_losssection in report.