PR Gate Output Contracts Specification v1¶

Status: Draft Version: 1.0.1-draft Date: 2026-02 ADR: ADR-019: PR Gate 2026 SOTA Related: DX-IMPLEMENTATION-PLAN, SPEC-GitHub-Action-v2.1

1. Overview¶

This specification defines the output contracts for the Assay PR gate: the blessed flow outputs (junit.xml, sarif.json, summary.json), exit and reason code semantics, SARIF constraints for GitHub compatibility, and the requirement that every non-zero exit provides a suggested next step. Implementations of assay ci and assay run (when used as the CI entrypoint) MUST conform to this spec so that CI consumers and the GitHub Action get predictable, machine-readable results.

Design Principles¶

PR-native — Outputs integrate with GitHub (JUnit → test annotations, SARIF → Security tab, Check Run Summary) without custom glue.
Stable and versioned — summary.json carries a schema_version so consumers can detect and adapt to changes.
Machine-readable nuance — Exit codes stay coarse (0/½/3); reason codes in summary.json and console provide stable, fine-grained semantics without breaking exit-code scripts.
Upload-safe — SARIF stays within GitHub limits (size, result count) so upload never fails randomly; every result has at least one location.

2. Blessed Flow Outputs¶

When assay ci (or the equivalent run invoked by the blessed workflow) completes, it MUST produce the following artifacts in the configured output directory (default: .assay/reports or equivalent).

Artifact	Required	Description
`junit.xml`	Yes	JUnit XML format; test cases with `<failure>` for Fail/Error; compatible with GitHub test reporting and JUnit reporter actions.
`sarif.json`	Yes	SARIF 2.1.0; see §5 for location and truncation rules.
`summary.json`	Yes	Machine-readable run summary; see §3 for schema.

Normative: The blessed entrypoint is assay ci. The same three outputs MUST be produced so that one local command reproduces exact CI behaviour.

3. summary.json Schema¶

3.1 Required Top-Level Fields¶

Field	Type	Required	Description
`schema_version`	integer	Yes	Version of this summary schema. MUST be `1` for this spec. Increment when adding or changing fields in a backward-incompatible way.
`reason_code_version`	integer	Yes	Version of the reason code registry. MUST be present. MUST equal `1` in Outputs-v1. Future changes to the reason code set use this version. Consumers MUST branch on `(reason_code_version, reason_code)` for semantics; exit code is coarse transport only. Consumers MUST treat unknown versions as "compat required" (fail closed or fallback parsing).
`exit_code`	integer	Yes	Process exit code: 0 = pass, 1 = test failure, 2 = config/user error, 3 = infra/judge unavailable. See §4.
`reason_code`	string	Yes	Stable machine-readable code when exit_code ≠ 0; e.g. `E_TRACE_NOT_FOUND`, `E_JUDGE_UNAVAILABLE`. See §5. When exit_code is 0, MAY be empty string or a designated success code (e.g. OK); empty is allowed and common.
`message`	string	No	Human-readable one-line description of outcome.
`next_step`	string	No	Single suggested command or hint when exit_code ≠ 0; e.g. "Run: assay doctor --config ...", "See: assay explain ...". See §7.

3.2 Provenance (Artifact Auditability)¶

Every summary.json MUST include a top-level provenance object with the following fields so that gates remain auditable (ADR-019 P0.4).

Field	Type	Required	Description
`assay_version`	string	Yes	Assay CLI version that produced this run (e.g. `"2.12.0"`).
`verify_mode`	string	Yes	`"enabled"` or `"disabled"`. When `"disabled"`, indicates signature verification was turned off (UNSAFE).
`policy_pack_digest`	string	No	Digest of policy/pack used (e.g. `sha256:...`).
`baseline_digest`	string	No	Digest of baseline used for comparison, if applicable.
`trace_digest`	string	No	Digest of trace input, if applicable (optional for privacy/size).
`replay`	boolean	No	`true` when this output was produced by replay from a bundle.
`bundle_digest`	string	No	SHA256 digest of the replay bundle archive used for this run.
`replay_mode`	string	No	Replay mode when `replay=true`: `"offline"` or `"live"`.
`source_run_id`	string	No	Optional original run id carried into replay provenance.

Normative: If the run was executed with --no-verify, verify_mode MUST be "disabled". When replay is used, producers SHOULD set replay=true and include bundle_digest and replay_mode.

3.3 Results Summary (Optional but Recommended)¶

A top-level results object MAY contain:

Field	Type	Required	Description
`passed`	integer	No	Count of tests passed.
`failed`	integer	No	Count of tests failed.
`warned`	integer	No	Count of tests with Warn/Flaky (depends on strict mode).
`skipped`	integer	No	Count of tests skipped (e.g. cache hit).
`total`	integer	No	Total test count.

A top-level performance object MAY contain total_duration_ms (integer, milliseconds). Future versions MAY add slowest_tests, cache_hit_rate, phase_timings (see ADR-019 / DX-IMPLEMENTATION-PLAN). Consumers MUST ignore unknown top-level keys.

3.3.1 Seeds (E7.2 – Replay Determinism)¶

A top-level seeds object (summary.json) and top-level seed_version, order_seed, judge_seed (run.json) SHALL be present for schema stability. On early-exit (e.g. trace not found, config fail), seeds may be null when unknown; seed_version SHALL still be present.

Field	Type	Required	Description
`seed_version`	integer	Yes	Version of the seed schema. MUST be `1` for Outputs-v1. Consumers MUST branch on `seed_version` when interpreting seeds.
`order_seed`	string or null	Yes	Decimal u64 encoded as string to avoid JSON number precision loss; null on early-exit when unknown.
`judge_seed`	string or null	Yes	Decimal u64 encoded as string; MAY be null until judge-level seeding is implemented (E9); consumers MUST handle null.
`sampling_seed`	integer	No	Optional: determinism for telemetry sampling (reserved for future use).

Normative: run.json (extended and minimal) and summary.json SHALL include seed_version; order_seed and judge_seed SHALL be present (string or null). Seeds MUST be encoded as decimal strings (or null) to avoid precision loss in JSON consumers (e.g. JS/TS safe for u64 > 2^53). CLI console SHALL print one line: Seeds: seed_version=1 order_seed=… judge_seed=… so CI job summaries can show them for replay.

3.3.2 Judge Metrics (E7.3)¶

When the run had judge evaluations, a top-level judge_metrics object MAY be present with low-cardinality reliability metrics:

Field	Type	Required	Description
`abstain_rate`	number	No	Fraction of judge evaluations that returned Abstain (uncertain).
`flip_rate`	number	No	Fraction of evaluations where order was swapped and outcome differed. (Implementation may use a proxy: swapped and non-unanimous agreement, when the judge does not record whether the pass/fail verdict would have differed under the other ordering.)
`consensus_rate`	number	No	Fraction of evaluations where all samples agreed.
`unavailable_count`	integer	No	Count of runs where judge was unavailable (infra/transport); not counted toward abstain_rate.

Normative: Judge unavailable (transport/infra) MUST NOT be counted as Abstain; use unavailable_count for that.

Implementation note (unavailable_count): Implementations may use message heuristics (e.g. timeout, 5xx, rate limit, network) on Error-status rows to classify infra failures. Abstain (uncertain verdict) is never counted as unavailable. Prefer standardised reason codes or an explicit infra_class field when available.

Implementation note (flip_rate): The spec defines flip_rate as “order was swapped and outcome differed”. When the judge does not record whether the pass/fail verdict would have differed under the other ordering, implementations may use a heuristic proxy (e.g. swapped and non-unanimous agreement). This proxy does not guarantee that the verdict actually flipped; it indicates order may have affected the outcome. When present, run.json and the CLI console SHALL expose judge metrics so CI can display them.

3.4 Example (Minimal)¶

{
  "schema_version": 1,
  "reason_code_version": 1,
  "exit_code": 0,
  "reason_code": "",
  "provenance": {
    "assay_version": "2.12.0",
    "verify_mode": "enabled"
  },
  "results": {
    "passed": 10,
    "failed": 0,
    "total": 10
  },
  "performance": {
    "total_duration_ms": 1234
  }
}

3.5 Example (Non-Zero with Next Step)¶

{
  "schema_version": 1,
  "reason_code_version": 1,
  "exit_code": 2,
  "reason_code": "E_TRACE_NOT_FOUND",
  "message": "Trace file not found: traces/ci.jsonl",
  "next_step": "Run: assay doctor --config ci-eval.yaml --trace-file traces/ci.jsonl",
  "provenance": {
    "assay_version": "2.12.0",
    "verify_mode": "enabled"
  }
}

4. Exit Code Registry¶

Exit codes are coarse and MUST NOT be redefined in a breaking way. Reason codes (§5) carry the nuance.

Exit Code	Meaning	Typical reason_codes
0	All tests passed	(none)
1	One or more tests failed	(test-level codes)
2	Configuration / user error	E_CFG_PARSE, E_TRACE_NOT_FOUND, E_MISSING_CONFIG, etc.
3	Infra / judge unavailable	E_JUDGE_UNAVAILABLE, E_RATE_LIMIT, E_PROVIDER_5XX, E_TIMEOUT

Normative: Judge failures (rate limit, provider 5xx, timeout) MUST map to exit code 3. Behaviour for security vs quality suites is policy-driven (fail-closed vs degrade/skip) per ADR-003/ADR-004; the exit code alone does not change.

Compatibility: Historically, some documentation used exit 3 for "trace file not found". Under this spec, trace-not-found is exit 2 with reason_code E_TRACE_NOT_FOUND. Implementations MAY support a compatibility mode (e.g. --exit-codes=v1) that preserves the old mapping for a documented deprecation period.

5. Reason Code Registry¶

Reason codes are stable, machine-readable strings. CI and scripts MAY branch on reason_code in summary.json. New codes MUST be added in a backward-compatible way (new string values); existing codes MUST NOT be removed or repurposed without a schema_version bump and migration notes.

5.1 Config / User Error (exit_code 2)¶

Code	Description
E_CFG_PARSE	Config file parse error (YAML/JSON).
E_TRACE_NOT_FOUND	Trace file or path not found.
E_MISSING_CONFIG	Required config file missing.
E_BASELINE_INVALID	Baseline file invalid or missing.
E_POLICY_PARSE	Policy file parse error.
E_REPLAY_MISSING_DEPENDENCY	Replay missing required offline dependency (e.g. uncached judge/cassette input).

5.2 Infra / Judge Unavailable (exit_code 3)¶

Code	Description
E_JUDGE_UNAVAILABLE	Judge service unavailable or returned error.
E_RATE_LIMIT	Judge/provider rate limit hit.
E_PROVIDER_5XX	Judge/provider returned 5xx.
E_TIMEOUT	Judge or dependency timed out.

5.3 Test Failure (exit_code 1)¶

Test-level failures MAY use existing policy/metric codes (e.g. E_ARG_SCHEMA, E_SEQUENCE_VIOLATION) or a generic E_TEST_FAILED. The summary.json reason_code for the run MAY be E_TEST_FAILED when at least one test failed and no single dominant reason is reported.

Normative: When exit_code ≠ 0, summary.json MUST set reason_code to one of the registered values (or a documented extension). Implementations MUST NOT leave reason_code empty when exit_code ≠ 0.

6. SARIF Contract (GitHub Compatibility)¶

SARIF produced for GitHub Code Scanning MUST satisfy the following so that upload-sarif does not reject the file.

6.1 Schema and Version¶

SARIF version MUST be "2.1.0".
Schema URI MUST be the official SARIF 2.1.0 JSON schema.

6.2 Location Requirement¶

Every result MUST have at least one location. If no file/line is available, the producer MUST emit a synthetic location (e.g. URI assay.yaml, policy.yaml, or the config path). GitHub's upload can fail with "expected at least one location" when a result has an empty locations array.

Normative: Contract tests MUST validate that every result in the generated SARIF has locations length ≥ 1.

6.3 Truncation (Size and Result Limits) (E2.3)¶

GitHub enforces limits on SARIF upload (e.g. max size gzipped, max number of results). Producers MUST truncate results when limits would be exceeded, and MUST add a clear indication that results were omitted (e.g. in the run description or a dedicated message: "N results omitted due to GitHub upload limits").
Truncation strategy: keep top N results by severity (e.g. error first, then warning). N and the exact message are implementation-defined but MUST be documented. Truncation MUST be deterministic (same run → same selection); selection order: blocking (Fail/Error) first, then warning-level, then stable sort (e.g. by test_id). Eligibility: only SARIF-eligible results (e.g. Fail, Error, Warn, Flaky, Unstable) count toward the limit. Define eligible_total as the count of SARIF-eligible results before truncation; included as the number of results actually written in the SARIF run; then omitted_count = eligible_total − included.
SARIF run-level metadata when truncated: runs[].properties.assay MUST be present when truncation was applied:
truncated (boolean): true
omitted_count (integer): number of eligible results omitted
summary.json and run.json — nested sarif object: When SARIF was truncated, summary.json and run.json MAY include a top-level sarif object. Schema when present:
sarif (object, optional): present only when truncation occurred (recommended to reduce noise).
sarif.omitted (integer, required when sarif is present): ≥ 1.
Consistency: When both are present, sarif.omitted (in run.json or summary.json) MUST equal runs[0].properties.assay.omitted_count.
Normative: SARIF upload MUST NOT fail due to size or result count; truncation is required when necessary. Consumers MUST treat SARIF as potentially truncated and MUST use summary/run for authoritative counts.

6.4 Severity Mapping¶

Map Assay outcomes to SARIF severity: Fail/Error → "error"; Warn/Flaky → "warning"; Info/other → "note".

7. Next-Step Requirement¶

For every non-zero exit, the implementation MUST provide at least one suggested next step so that users and CI logs know what to do next.

Console: When exiting with exit_code ≠ 0, the process MUST print at least one line that is a concrete command or hint (e.g. "Run: assay doctor ...", "See: assay explain ...", "Fix baseline: assay baseline record ...").
summary.json: The next_step field SHOULD be set when exit_code ≠ 0 (see §3.1). It MAY be the same as or a shortened form of the console message.

Normative: Contract tests MAY verify that for a set of known error conditions (missing config, missing trace, failing test), the output contains a non-empty next_step (in summary.json) and a console line with a suggested command.

8. Conformance¶

Producers: assay ci and any code path that writes summary.json, junit.xml, or sarif.json for the PR gate MUST follow §2–§7.
Consumers: CI workflows and the GitHub Action MAY rely on schema_version, exit_code, reason_code, and next_step as defined above. Unknown summary fields MUST be ignored.
Contract tests: Implementations MUST include tests that (1) validate summary.json schema_version and required fields, (2) validate that every SARIF result has at least one location, (3) optionally validate SARIF against the official 2.1.0 schema and/or a minimal upload-smoke test.

9. Version History¶

schema_version	Date	Changes
1	2026-01	Initial Outputs-v1.
1	2026-02	Clarified/added: Seeds (§3.3.1) + Judge metrics (§3.3.2). Seeds MUST be decimal strings (or null) to avoid JSON precision loss. judge_seed reserved (null) until implemented.
1	2026-02	E2.3: SARIF truncation metadata (§6.3): properties.assay (truncated, omitted_count) in SARIF run; sarif.omitted in summary.json and run.json when truncated. Deterministic truncation order.
1	2026-02	E9c alignment draft: replay provenance keys in `provenance` (`replay`, `bundle_digest`, `replay_mode`, `source_run_id`) and `E_REPLAY_MISSING_DEPENDENCY` reason code.