Skip to content

PR Gate Output Contracts Specification v1

Status: Draft Version: 1.0.1-draft Date: 2026-02 ADR: ADR-019: PR Gate 2026 SOTA Related: DX-IMPLEMENTATION-PLAN, SPEC-GitHub-Action-v2.1


1. Overview

This specification defines the output contracts for the Assay PR gate: the blessed flow outputs (junit.xml, sarif.json, summary.json), exit and reason code semantics, SARIF constraints for GitHub compatibility, and the requirement that every non-zero exit provides a suggested next step. Implementations of assay ci and assay run (when used as the CI entrypoint) MUST conform to this spec so that CI consumers and the GitHub Action get predictable, machine-readable results.

Design Principles

  • PR-native — Outputs integrate with GitHub (JUnit → test annotations, SARIF → Security tab, Check Run Summary) without custom glue.
  • Stable and versioned — summary.json carries a schema_version so consumers can detect and adapt to changes.
  • Machine-readable nuance — Exit codes stay coarse (0/½/3); reason codes in summary.json and console provide stable, fine-grained semantics without breaking exit-code scripts.
  • Upload-safe — SARIF stays within GitHub limits (size, result count) so upload never fails randomly; every result has at least one location.

2. Blessed Flow Outputs

When assay ci (or the equivalent run invoked by the blessed workflow) completes, it MUST produce the following artifacts in the configured output directory (default: .assay/reports or equivalent).

Artifact Required Description
junit.xml Yes JUnit XML format; test cases with <failure> for Fail/Error; compatible with GitHub test reporting and JUnit reporter actions.
sarif.json Yes SARIF 2.1.0; see §5 for location and truncation rules.
summary.json Yes Machine-readable run summary; see §3 for schema.

Normative: The blessed entrypoint is assay ci. The same three outputs MUST be produced so that one local command reproduces exact CI behaviour.


3. summary.json Schema

3.1 Required Top-Level Fields

Field Type Required Description
schema_version integer Yes Version of this summary schema. MUST be 1 for this spec. Increment when adding or changing fields in a backward-incompatible way.
reason_code_version integer Yes Version of the reason code registry. MUST be present. MUST equal 1 in Outputs-v1. Future changes to the reason code set use this version. Consumers MUST branch on (reason_code_version, reason_code) for semantics; exit code is coarse transport only. Consumers MUST treat unknown versions as "compat required" (fail closed or fallback parsing).
exit_code integer Yes Process exit code: 0 = pass, 1 = test failure, 2 = config/user error, 3 = infra/judge unavailable. See §4.
reason_code string Yes Stable machine-readable code when exit_code ≠ 0; e.g. E_TRACE_NOT_FOUND, E_JUDGE_UNAVAILABLE. See §5. When exit_code is 0, MAY be empty string or a designated success code (e.g. OK); empty is allowed and common.
message string No Human-readable one-line description of outcome.
next_step string No Single suggested command or hint when exit_code ≠ 0; e.g. "Run: assay doctor --config ...", "See: assay explain ...". See §7.

3.2 Provenance (Artifact Auditability)

Every summary.json MUST include a top-level provenance object with the following fields so that gates remain auditable (ADR-019 P0.4).

Field Type Required Description
assay_version string Yes Assay CLI version that produced this run (e.g. "2.12.0").
verify_mode string Yes "enabled" or "disabled". When "disabled", indicates signature verification was turned off (UNSAFE).
policy_pack_digest string No Digest of policy/pack used (e.g. sha256:...).
baseline_digest string No Digest of baseline used for comparison, if applicable.
trace_digest string No Digest of trace input, if applicable (optional for privacy/size).
replay boolean No true when this output was produced by replay from a bundle.
bundle_digest string No SHA256 digest of the replay bundle archive used for this run.
replay_mode string No Replay mode when replay=true: "offline" or "live".
source_run_id string No Optional original run id carried into replay provenance.

Normative: If the run was executed with --no-verify, verify_mode MUST be "disabled". When replay is used, producers SHOULD set replay=true and include bundle_digest and replay_mode.

A top-level results object MAY contain:

Field Type Required Description
passed integer No Count of tests passed.
failed integer No Count of tests failed.
warned integer No Count of tests with Warn/Flaky (depends on strict mode).
skipped integer No Count of tests skipped (e.g. cache hit).
total integer No Total test count.

A top-level performance object MAY contain total_duration_ms (integer, milliseconds). Future versions MAY add slowest_tests, cache_hit_rate, phase_timings (see ADR-019 / DX-IMPLEMENTATION-PLAN). Consumers MUST ignore unknown top-level keys.

3.3.1 Seeds (E7.2 – Replay Determinism)

A top-level seeds object (summary.json) and top-level seed_version, order_seed, judge_seed (run.json) SHALL be present for schema stability. On early-exit (e.g. trace not found, config fail), seeds may be null when unknown; seed_version SHALL still be present.

Field Type Required Description
seed_version integer Yes Version of the seed schema. MUST be 1 for Outputs-v1. Consumers MUST branch on seed_version when interpreting seeds.
order_seed string or null Yes Decimal u64 encoded as string to avoid JSON number precision loss; null on early-exit when unknown.
judge_seed string or null Yes Decimal u64 encoded as string; MAY be null until judge-level seeding is implemented (E9); consumers MUST handle null.
sampling_seed integer No Optional: determinism for telemetry sampling (reserved for future use).

Normative: run.json (extended and minimal) and summary.json SHALL include seed_version; order_seed and judge_seed SHALL be present (string or null). Seeds MUST be encoded as decimal strings (or null) to avoid precision loss in JSON consumers (e.g. JS/TS safe for u64 > 2^53). CLI console SHALL print one line: Seeds: seed_version=1 order_seed=… judge_seed=… so CI job summaries can show them for replay.

3.3.2 Judge Metrics (E7.3)

When the run had judge evaluations, a top-level judge_metrics object MAY be present with low-cardinality reliability metrics:

Field Type Required Description
abstain_rate number No Fraction of judge evaluations that returned Abstain (uncertain).
flip_rate number No Fraction of evaluations where order was swapped and outcome differed. (Implementation may use a proxy: swapped and non-unanimous agreement, when the judge does not record whether the pass/fail verdict would have differed under the other ordering.)
consensus_rate number No Fraction of evaluations where all samples agreed.
unavailable_count integer No Count of runs where judge was unavailable (infra/transport); not counted toward abstain_rate.

Normative: Judge unavailable (transport/infra) MUST NOT be counted as Abstain; use unavailable_count for that.

Implementation note (unavailable_count): Implementations may use message heuristics (e.g. timeout, 5xx, rate limit, network) on Error-status rows to classify infra failures. Abstain (uncertain verdict) is never counted as unavailable. Prefer standardised reason codes or an explicit infra_class field when available.

Implementation note (flip_rate): The spec defines flip_rate as “order was swapped and outcome differed”. When the judge does not record whether the pass/fail verdict would have differed under the other ordering, implementations may use a heuristic proxy (e.g. swapped and non-unanimous agreement). This proxy does not guarantee that the verdict actually flipped; it indicates order may have affected the outcome. When present, run.json and the CLI console SHALL expose judge metrics so CI can display them.

3.4 Example (Minimal)

{
  "schema_version": 1,
  "reason_code_version": 1,
  "exit_code": 0,
  "reason_code": "",
  "provenance": {
    "assay_version": "2.12.0",
    "verify_mode": "enabled"
  },
  "results": {
    "passed": 10,
    "failed": 0,
    "total": 10
  },
  "performance": {
    "total_duration_ms": 1234
  }
}

3.5 Example (Non-Zero with Next Step)

{
  "schema_version": 1,
  "reason_code_version": 1,
  "exit_code": 2,
  "reason_code": "E_TRACE_NOT_FOUND",
  "message": "Trace file not found: traces/ci.jsonl",
  "next_step": "Run: assay doctor --config ci-eval.yaml --trace-file traces/ci.jsonl",
  "provenance": {
    "assay_version": "2.12.0",
    "verify_mode": "enabled"
  }
}

4. Exit Code Registry

Exit codes are coarse and MUST NOT be redefined in a breaking way. Reason codes (§5) carry the nuance.

Exit Code Meaning Typical reason_codes
0 All tests passed (none)
1 One or more tests failed (test-level codes)
2 Configuration / user error E_CFG_PARSE, E_TRACE_NOT_FOUND, E_MISSING_CONFIG, etc.
3 Infra / judge unavailable E_JUDGE_UNAVAILABLE, E_RATE_LIMIT, E_PROVIDER_5XX, E_TIMEOUT

Normative: Judge failures (rate limit, provider 5xx, timeout) MUST map to exit code 3. Behaviour for security vs quality suites is policy-driven (fail-closed vs degrade/skip) per ADR-003/ADR-004; the exit code alone does not change.

Compatibility: Historically, some documentation used exit 3 for "trace file not found". Under this spec, trace-not-found is exit 2 with reason_code E_TRACE_NOT_FOUND. Implementations MAY support a compatibility mode (e.g. --exit-codes=v1) that preserves the old mapping for a documented deprecation period.


5. Reason Code Registry

Reason codes are stable, machine-readable strings. CI and scripts MAY branch on reason_code in summary.json. New codes MUST be added in a backward-compatible way (new string values); existing codes MUST NOT be removed or repurposed without a schema_version bump and migration notes.

5.1 Config / User Error (exit_code 2)

Code Description
E_CFG_PARSE Config file parse error (YAML/JSON).
E_TRACE_NOT_FOUND Trace file or path not found.
E_MISSING_CONFIG Required config file missing.
E_BASELINE_INVALID Baseline file invalid or missing.
E_POLICY_PARSE Policy file parse error.
E_REPLAY_MISSING_DEPENDENCY Replay missing required offline dependency (e.g. uncached judge/cassette input).

5.2 Infra / Judge Unavailable (exit_code 3)

Code Description
E_JUDGE_UNAVAILABLE Judge service unavailable or returned error.
E_RATE_LIMIT Judge/provider rate limit hit.
E_PROVIDER_5XX Judge/provider returned 5xx.
E_TIMEOUT Judge or dependency timed out.

5.3 Test Failure (exit_code 1)

Test-level failures MAY use existing policy/metric codes (e.g. E_ARG_SCHEMA, E_SEQUENCE_VIOLATION) or a generic E_TEST_FAILED. The summary.json reason_code for the run MAY be E_TEST_FAILED when at least one test failed and no single dominant reason is reported.

Normative: When exit_code ≠ 0, summary.json MUST set reason_code to one of the registered values (or a documented extension). Implementations MUST NOT leave reason_code empty when exit_code ≠ 0.


6. SARIF Contract (GitHub Compatibility)

SARIF produced for GitHub Code Scanning MUST satisfy the following so that upload-sarif does not reject the file.

6.1 Schema and Version

  • SARIF version MUST be "2.1.0".
  • Schema URI MUST be the official SARIF 2.1.0 JSON schema.

6.2 Location Requirement

  • Every result MUST have at least one location. If no file/line is available, the producer MUST emit a synthetic location (e.g. URI assay.yaml, policy.yaml, or the config path). GitHub's upload can fail with "expected at least one location" when a result has an empty locations array.

Normative: Contract tests MUST validate that every result in the generated SARIF has locations length ≥ 1.

6.3 Truncation (Size and Result Limits) (E2.3)

  • GitHub enforces limits on SARIF upload (e.g. max size gzipped, max number of results). Producers MUST truncate results when limits would be exceeded, and MUST add a clear indication that results were omitted (e.g. in the run description or a dedicated message: "N results omitted due to GitHub upload limits").
  • Truncation strategy: keep top N results by severity (e.g. error first, then warning). N and the exact message are implementation-defined but MUST be documented. Truncation MUST be deterministic (same run → same selection); selection order: blocking (Fail/Error) first, then warning-level, then stable sort (e.g. by test_id). Eligibility: only SARIF-eligible results (e.g. Fail, Error, Warn, Flaky, Unstable) count toward the limit. Define eligible_total as the count of SARIF-eligible results before truncation; included as the number of results actually written in the SARIF run; then omitted_count = eligible_total − included.
  • SARIF run-level metadata when truncated: runs[].properties.assay MUST be present when truncation was applied:
  • truncated (boolean): true
  • omitted_count (integer): number of eligible results omitted
  • summary.json and run.json — nested sarif object: When SARIF was truncated, summary.json and run.json MAY include a top-level sarif object. Schema when present:
  • sarif (object, optional): present only when truncation occurred (recommended to reduce noise).
  • sarif.omitted (integer, required when sarif is present): ≥ 1.
  • Consistency: When both are present, sarif.omitted (in run.json or summary.json) MUST equal runs[0].properties.assay.omitted_count.
  • Normative: SARIF upload MUST NOT fail due to size or result count; truncation is required when necessary. Consumers MUST treat SARIF as potentially truncated and MUST use summary/run for authoritative counts.

6.4 Severity Mapping

  • Map Assay outcomes to SARIF severity: Fail/Error → "error"; Warn/Flaky → "warning"; Info/other → "note".

7. Next-Step Requirement

For every non-zero exit, the implementation MUST provide at least one suggested next step so that users and CI logs know what to do next.

  • Console: When exiting with exit_code ≠ 0, the process MUST print at least one line that is a concrete command or hint (e.g. "Run: assay doctor ...", "See: assay explain ...", "Fix baseline: assay baseline record ...").
  • summary.json: The next_step field SHOULD be set when exit_code ≠ 0 (see §3.1). It MAY be the same as or a shortened form of the console message.

Normative: Contract tests MAY verify that for a set of known error conditions (missing config, missing trace, failing test), the output contains a non-empty next_step (in summary.json) and a console line with a suggested command.


8. Conformance

  • Producers: assay ci and any code path that writes summary.json, junit.xml, or sarif.json for the PR gate MUST follow §2–§7.
  • Consumers: CI workflows and the GitHub Action MAY rely on schema_version, exit_code, reason_code, and next_step as defined above. Unknown summary fields MUST be ignored.
  • Contract tests: Implementations MUST include tests that (1) validate summary.json schema_version and required fields, (2) validate that every SARIF result has at least one location, (3) optionally validate SARIF against the official 2.1.0 schema and/or a minimal upload-smoke test.

9. Version History

schema_version Date Changes
1 2026-01 Initial Outputs-v1.
1 2026-02 Clarified/added: Seeds (§3.3.1) + Judge metrics (§3.3.2). Seeds MUST be decimal strings (or null) to avoid JSON precision loss. judge_seed reserved (null) until implemented.
1 2026-02 E2.3: SARIF truncation metadata (§6.3): properties.assay (truncated, omitted_count) in SARIF run; sarif.omitted in summary.json and run.json when truncated. Deterministic truncation order.
1 2026-02 E9c alignment draft: replay provenance keys in provenance (replay, bundle_digest, replay_mode, source_run_id) and E_REPLAY_MISSING_DEPENDENCY reason code.

10. References