PLAN — P34 Trust Basis Diff Gate (Q2 2026)¶
Status: implemented in this slice Owner: Assay core / CLI Scope: compare two canonical Trust Basis artifacts, not raw external evidence
1. Why This Exists¶
P31 made the Promptfoo compiler path real:
P33 made that receipt boundary visible to the Trust Basis compiler:
P34 adds the next small bridge:
That gives Harness a stable gate foundation without asking Harness to parse Promptfoo JSONL, understand external eval receipt payloads, or re-run Trust Basis classification logic.
2. Boundary¶
P34 compares compiled Assay artifacts only.
It does not:
- parse Promptfoo JSONL
- inspect raw prompts, outputs, expected values, vars, provider payloads, stats, or full rows
- compare evidence bundles directly
- infer model correctness or Promptfoo run success
- add Trust Card rendering changes
- add Harness baseline/candidate UI
The command is deliberately generic:
Promptfoo is only the first motivating receipt lane. The diff layer is about Trust Basis claims, not Promptfoo semantics.
3. Comparison Semantics¶
P34 v1 accepts canonical Trust Basis JSON produced by assay trust-basis generate.
Claim comparison is keyed by stable claim identity:
Duplicate claim IDs in either input are invalid. Without unique claim identity, the command cannot distinguish an actual regression from ambiguous input.
Trust Basis claim levels are ordered:
A candidate is a regression when:
- a baseline claim is missing from the candidate
- a candidate claim level is lower than the baseline claim level
A candidate is an improvement when:
- a candidate claim level is higher than the baseline claim level
The diff also reports:
- added claims
- removed claims
- source/boundary/note metadata changes
- unchanged claim count
P34 v1 gates on claim presence and claim level only. Metadata changes are visible but do not fail by default. They may represent a spec or compiler evolution rather than a runtime regression.
Added claims, including unknown or newly introduced claim IDs, are not regressions by default.
Machine-readable JSON output uses the stable schema assay.trust-basis.diff.v1 and includes:
summaryregressed_claimsimproved_claimsremoved_claimsadded_claimsmetadata_changesunchanged_claim_count
Each diff item carries the matching claim_id, its diff class, and the baseline/candidate fields needed by later Harness or SARIF/JUnit projection. All arrays are sorted deterministically by claim.id.
4. Gate Posture¶
The default command reports differences and exits successfully.
Use this mode for local inspection:
Use --fail-on-regression when the diff should become a gate:
assay trust-basis diff \
baseline.trust-basis.json \
candidate.trust-basis.json \
--fail-on-regression
This keeps the compiler path and the gate policy separate:
- Assay core compiles Trust Basis artifacts.
assay trust-basis diffcompares those artifacts.- Harness can later decide how to surface regressions in PR feedback.
Exit code contract:
0means the comparison completed and no enabled gate failed.1means--fail-on-regressionwas set and a missing/lowered baseline claim was found.- Other non-zero exits are reserved for input, parse, or validation failures.
5. Acceptance Criteria¶
P34 is complete when:
assay trust-basis diffaccepts two canonical Trust Basis JSON files.- text and JSON output are available.
- claim comparison is keyed by
claim.id. - JSON output exposes the stable
assay.trust-basis.diff.v1shape. - output ordering is deterministic.
- metadata changes are visible and non-blocking in v1.
--fail-on-regressionexits with code1only for missing/lowered baseline claims.- Promptfoo-origin Trust Basis claim improvements and regressions are covered by CLI tests.
- docs explain that this command compares Trust Basis artifacts, not external eval payloads.
6. Follow-Ups¶
Future slices may add:
- Harness baseline/candidate wiring over
trust-basis diffJSON output - SARIF/JUnit projection for Trust Basis regressions
- stricter metadata-change policy for release gates
- multi-artifact comparison summaries
Those should stay above this generic diff layer.