Skip to content

PLAN — P34 Trust Basis Diff Gate (Q2 2026)

Status: implemented in this slice Owner: Assay core / CLI Scope: compare two canonical Trust Basis artifacts, not raw external evidence


1. Why This Exists

P31 made the Promptfoo compiler path real:

Promptfoo assertion component result -> Assay evidence receipt bundle

P33 made that receipt boundary visible to the Trust Basis compiler:

receipt bundle -> trust-basis.json with external_eval_receipt_boundary_visible

P34 adds the next small bridge:

baseline trust-basis.json + candidate trust-basis.json -> claim-level diff

That gives Harness a stable gate foundation without asking Harness to parse Promptfoo JSONL, understand external eval receipt payloads, or re-run Trust Basis classification logic.


2. Boundary

P34 compares compiled Assay artifacts only.

It does not:

  • parse Promptfoo JSONL
  • inspect raw prompts, outputs, expected values, vars, provider payloads, stats, or full rows
  • compare evidence bundles directly
  • infer model correctness or Promptfoo run success
  • add Trust Card rendering changes
  • add Harness baseline/candidate UI

The command is deliberately generic:

assay trust-basis diff baseline.trust-basis.json candidate.trust-basis.json

Promptfoo is only the first motivating receipt lane. The diff layer is about Trust Basis claims, not Promptfoo semantics.


3. Comparison Semantics

P34 v1 accepts canonical Trust Basis JSON produced by assay trust-basis generate.

Claim comparison is keyed by stable claim identity:

claim.id

Duplicate claim IDs in either input are invalid. Without unique claim identity, the command cannot distinguish an actual regression from ambiguous input.

Trust Basis claim levels are ordered:

absent < inferred < self_reported < verified

A candidate is a regression when:

  • a baseline claim is missing from the candidate
  • a candidate claim level is lower than the baseline claim level

A candidate is an improvement when:

  • a candidate claim level is higher than the baseline claim level

The diff also reports:

  • added claims
  • removed claims
  • source/boundary/note metadata changes
  • unchanged claim count

P34 v1 gates on claim presence and claim level only. Metadata changes are visible but do not fail by default. They may represent a spec or compiler evolution rather than a runtime regression.

Added claims, including unknown or newly introduced claim IDs, are not regressions by default.

Machine-readable JSON output uses the stable schema assay.trust-basis.diff.v1 and includes:

  • summary
  • regressed_claims
  • improved_claims
  • removed_claims
  • added_claims
  • metadata_changes
  • unchanged_claim_count

Each diff item carries the matching claim_id, its diff class, and the baseline/candidate fields needed by later Harness or SARIF/JUnit projection. All arrays are sorted deterministically by claim.id.


4. Gate Posture

The default command reports differences and exits successfully.

Use this mode for local inspection:

assay trust-basis diff baseline.trust-basis.json candidate.trust-basis.json

Use --fail-on-regression when the diff should become a gate:

assay trust-basis diff \
  baseline.trust-basis.json \
  candidate.trust-basis.json \
  --fail-on-regression

This keeps the compiler path and the gate policy separate:

  • Assay core compiles Trust Basis artifacts.
  • assay trust-basis diff compares those artifacts.
  • Harness can later decide how to surface regressions in PR feedback.

Exit code contract:

  • 0 means the comparison completed and no enabled gate failed.
  • 1 means --fail-on-regression was set and a missing/lowered baseline claim was found.
  • Other non-zero exits are reserved for input, parse, or validation failures.

5. Acceptance Criteria

P34 is complete when:

  • assay trust-basis diff accepts two canonical Trust Basis JSON files.
  • text and JSON output are available.
  • claim comparison is keyed by claim.id.
  • JSON output exposes the stable assay.trust-basis.diff.v1 shape.
  • output ordering is deterministic.
  • metadata changes are visible and non-blocking in v1.
  • --fail-on-regression exits with code 1 only for missing/lowered baseline claims.
  • Promptfoo-origin Trust Basis claim improvements and regressions are covered by CLI tests.
  • docs explain that this command compares Trust Basis artifacts, not external eval payloads.

6. Follow-Ups

Future slices may add:

  • Harness baseline/candidate wiring over trust-basis diff JSON output
  • SARIF/JUnit projection for Trust Basis regressions
  • stricter metadata-change policy for release gates
  • multi-artifact comparison summaries

Those should stay above this generic diff layer.