Skip to content

PLAN — P28 Promptfoo Assertion GradingResult Evidence

  • Date: 2026-04-24
  • Owner: Evidence / External Interop
  • Status: Planning lane
  • Scope (current repo state): Explore one bounded Promptfoo-adjacent evidence lane built around a single deterministic assertion GradingResult surfaced through Promptfoo's public assertion/eval output path. This plan is for the smallest honest external-consumer surface only. It does not propose broad Promptfoo support, red-team result support, prompt comparison support, provider output import, dataset import, trace import, or Promptfoo platform truth.

1. Why this plan exists

promptfoo/promptfoo is a strong adjacent P28 candidate because it is not an agent runtime or tracing platform. It sits in the eval-as-CI space: declarative tests, assertions, JSON/JSONL export, and CI-friendly pass/fail outputs.

That makes it adjacent to Assay in a slightly different way from the recent span, evaluator, and returned-score lanes.

The risk is also obvious: Promptfoo can easily become a whole eval-run import.

P28 should not do that.

The strongest first Promptfoo wedge is not:

  • a full promptfoo eval results export
  • a red-team scan report
  • a prompt/provider comparison matrix
  • raw model outputs
  • assertion configuration truth
  • Promptfoo's web viewer or platform state

It is:

  • one deterministic assertion
  • one surfaced GradingResult
  • one bounded pass/score/reason result bag

The public Promptfoo docs make this seam plausible. Assertion functions can return a GradingResult, deterministic assertions such as equals are documented, and export docs show JSON/JSONL result surfaces with pass/fail and score information. That is enough to justify discovery.

It is not enough to freeze a contract before capture.

2. What this plan is and is not

This plan is for:

  • one deterministic Promptfoo assertion result
  • one bounded GradingResult-shaped result bag
  • one discovery pass over the public surfaced result shape
  • one small external-consumer artifact reduced from that surfaced result

This plan is not for:

  • full Promptfoo support
  • red-team or vulnerability scan reports
  • prompt comparison truth
  • provider output truth
  • raw prompt, vars, expected, or output payload truth
  • Promptfoo config truth
  • dataset, eval-run, or stats truth
  • model-graded assertion semantics
  • token, cost, latency, or provider telemetry
  • web viewer, cloud, or sharing semantics

3. Hard positioning rule

P28 v1 claims only one bounded Promptfoo deterministic assertion GradingResult as imported external evaluation signal evidence. It does not claim output truth, expected-answer truth, prompt truth, provider truth, Promptfoo config truth, red-team truth, dataset truth, or eval-run truth.

That means:

  • Promptfoo remains the source of the observed assertion result
  • Assay imports only the smallest honest surfaced result shape
  • Assay does not inherit broader eval-run semantics as truth

The first surface should stay on exactly one move:

  • run one public deterministic Promptfoo assertion, preferably equals
  • capture the surfaced assertion GradingResult from the public CLI output or public Node package result path
  • reduce exactly one assertion result object

Not:

  • llm-rubric
  • model-graded assertions
  • red-team plugins
  • full JSON output envelopes
  • JSONL output lines as a whole
  • provider response bodies
  • prompt matrix rows
  • stats summaries
  • config exports

This is intentionally smaller than the broader Promptfoo surface.

The deterministic equals path is the best first Promptfoo surface because it is:

  • public in the deterministic assertion docs
  • independent of model-graded rubrics
  • small enough to validate without importing raw prompt/output payloads
  • close to the pass/fail CI shape users already expect from Promptfoo

5. Canonical v1 artifact thesis

The reduced artifact should stay on a single surfaced deterministic assertion result.

The v1 artifact must be frozen from a captured surfaced Promptfoo assertion result object, not from docs snippets, TypeScript interface snippets, or caller-side expectations. Public docs justify the lane, but the raw surfaced result is the source of truth for fixture freeze.

The v1 artifact models one extracted surfaced assertion result only. It does not model Promptfoo JSON, JSONL, YAML, or XML export schemas, and it does not model full eval result wrappers.

Illustrative v1 shape:

{
  "schema": "promptfoo.assertion-grading-result.export.v1",
  "framework": "promptfoo",
  "surface": "assertion_grading_result",
  "target_kind": "promptfoo_output_assertion",
  "assertion_type": "equals",
  "result": {
    "pass": true,
    "score": 1,
    "reason": "Assertion passed"
  }
}

Optional reviewer support, only if naturally present on the surfaced assertion result:

  • result.reason

Not allowed in v1:

  • raw prompt
  • raw output
  • raw expected
  • raw vars
  • raw assertion config
  • Promptfoo full JSON/YAML/XML export envelopes
  • JSONL output lines as canonical artifacts
  • provider identifiers or response bodies
  • token, cost, latency, or stats objects
  • componentResults
  • namedScores
  • tokensUsed
  • synthetic timestamps
  • synthetic prompt, output, expected, or test identifiers

6. Field boundaries

6.1 target_kind

For v1, the only allowed value is:

  • promptfoo_output_assertion

This names the evaluation level. It does not imply that v1 carries stable prompt identity, output identity, expected-answer identity, provider identity, test-case identity, or run identity.

6.2 No target_id_ref in v1

A single surfaced GradingResult does not naturally guarantee a stable target identifier.

Therefore v1 should not invent one.

Assay must not synthesize a target reference from:

  • Promptfoo testIdx or promptIdx
  • provider IDs
  • prompt text
  • vars
  • hashes of raw outputs or expected values
  • eval-run IDs
  • JSONL line positions
  • internal run bookkeeping

If a future public surfaced assertion result naturally carries a stable assertion anchor, the lane can be revisited. V1 should stay honest and omit it.

6.3 assertion_type

This is the canonical Assay-side name for the observed Promptfoo assertion.

It should stay:

  • required
  • short
  • observed from the surfaced assertion result or adjacent public assertion descriptor
  • reviewer-readable

It must not become:

  • Promptfoo assertion taxonomy truth
  • assertion configuration truth
  • a broader Promptfoo eval ontology

For the first lane, the expected v1 value is:

  • equals

assertion_type should preserve the surfaced assertion type when naturally present. If the surfaced result does not carry it directly, the reducer may use the explicitly invoked deterministic assertion type, but must document that as a minimal reduction choice rather than surfaced-result truth.

6.4 result.pass

This is the core bounded assertion outcome.

For v1 deterministic assertion evidence, it should remain:

  • required
  • boolean
  • observed exactly as surfaced

It must not be treated as:

  • universal correctness truth
  • proof that the model output is true
  • proof that the expected value is correct
  • Promptfoo run success as a whole
  • Promptfoo test-case success as a whole
  • Promptfoo threshold or weighted-score success as a whole

6.5 result.score

This is the numeric score attached to the assertion result.

For first execution, it should remain:

  • required
  • numeric
  • observed exactly as surfaced
  • bounded to the shape proven by discovery

For the first deterministic equals lane, assume binary score semantics by default. Widen only if the surfaced deterministic result demonstrably returns a broader numeric shape on the chosen public path.

The plan should not widen result.score to generic scorer semantics before capture.

6.6 result.reason

This is optional reviewer support only.

It must remain:

  • optional
  • never required
  • short
  • bounded
  • non-empty when present
  • derived only from the surfaced assertion result

It must not become:

  • chain-of-thought
  • prompt or output transcript
  • provider error dump
  • model-graded rubric explanation
  • multi-line structured reasoning blob

The reducer may omit reason even when present if it is too long, too rich, or too close to rubric/provider reasoning. Multiline, verbose, structured, or rubric-like reason content should be malformed or dropped for v1.

7. Observed vs derived rule

P28 v1 should remain almost entirely observed.

Observed:

  • surfaced assertion type, if naturally present
  • surfaced pass
  • surfaced score
  • surfaced reason, if short and naturally present

Derived:

  • fixed framework = "promptfoo"
  • fixed surface = "assertion_grading_result"
  • fixed target_kind = "promptfoo_output_assertion"
  • using the explicitly invoked assertion type only if the surfaced result does not naturally carry one

The plan must not derive:

  • timestamps
  • target identifiers
  • run identifiers
  • prompt, provider, dataset, or config lineage
  • output/expected hashes as identity
  • pass/fail summaries for the run

result.pass names assertion outcome only. It must not be interpreted as Promptfoo test-case success, threshold success, weighted-score success, or overall run success.

Promptfoo inputs and wrappers are discovery material only:

  • prompt text may be captured for discovery only
  • output may be captured for discovery only
  • expected value may be captured for discovery only
  • vars may be captured for discovery only
  • assertion config may be captured for discovery only
  • full JSON/JSONL export wrappers may be captured only to locate the surfaced assertion result

None of those fields may enter the canonical v1 artifact.

8. Cardinality rule

This lane is for exactly one surfaced deterministic assertion result.

Therefore v1 artifacts should be malformed if they contain:

  • multiple assertion results
  • arrays of results
  • componentResults
  • namedScores
  • full JSON/JSONL/YAML/XML export envelopes
  • prompt/provider/test matrices
  • stats summaries
  • red-team result bundles
  • model-graded rubric outputs
  • provider outputs or response bodies
  • raw prompt, vars, expected, or output values
  • assertion configuration objects

No partial import of larger Promptfoo eval results should be allowed in v1.

V1 must fail closed on larger eval/export wrappers rather than partially importing the "first relevant" assertion result.

9. Discovery gate

P28 should not advance on docs snippets alone. Freeze nothing until one raw surfaced deterministic Promptfoo assertion result is captured from a public Promptfoo path and stored separately from all emitted inputs and wrappers.

Required first proof:

  • run one real deterministic equals assertion through Promptfoo
  • capture the raw prompt/output/expected/config inputs separately as discovery artifacts
  • capture the public surfaced assertion result separately
  • confirm whether the result came from CLI JSON output, JSONL output, or the Node package result path
  • compare emitted inputs, export wrappers, and the surfaced assertion result before freezing any reduced artifact

Keep these separate:

  • emitted Promptfoo config and assertion input
  • provider/model output
  • full Promptfoo export envelope
  • extracted surfaced GradingResult
  • reduced Assay-facing artifact

Do not treat full Promptfoo JSON output as equivalent to the assertion result shape. Promptfoo JSON/YAML/XML exports can include config and redacted environment data, so importing them as v1 evidence would be too broad.

Freeze one surfaced path first: CLI JSON, JSONL, or Node package. If those paths return materially different shapes, freeze per surfaced path rather than pretending there is a single Promptfoo-wide v1 result shape by default.

10. Initial malformed rules

Artifacts should be malformed if they contain:

  • no assertion_type
  • no result
  • no result.pass
  • no result.score
  • non-boolean result.pass
  • non-numeric result.score
  • empty or whitespace-only result.reason
  • raw prompt
  • raw output
  • raw expected
  • raw vars
  • assertion config
  • provider IDs or response bodies
  • full Promptfoo export wrappers
  • JSONL line wrappers
  • stats, latency, cost, or token usage
  • componentResults
  • namedScores
  • red-team or model-graded assertion metadata
  • arrays of assertion results
  • partial imports from larger Promptfoo eval results

11. Repository deliverables for first execution

If discovery validates the surface, the first concrete P28 lane should include:

  • a formal example directory
  • one live discovery note with emitted vs surfaced field presence
  • one small mapper
  • valid, failure, and malformed fixtures
  • generated placeholder NDJSON outputs for valid cases

Suggested layout:

examples/
  promptfoo-assertion-grading-result-evidence/
    README.md
    map_to_assay.py
    capture_probe.mjs
    discovery/
      FIELD_PRESENCE.md
    fixtures/
      valid.promptfoo.json
      failure.promptfoo.json
      malformed.promptfoo.json
      valid.assay.ndjson
      failure.assay.ndjson

12. Outward strategy

Promptfoo has issues enabled and discussions disabled. The repo norm is mostly direct technical issues with concrete examples.

P28 should not open with a broad integration ask.

If the sample lands, outreach should be a compact issue that asks one narrow question:

Is the surfaced deterministic assertion GradingResult the right minimal public result boundary for external evidence consumers, or should consumers anchor to a different JSON/Node result surface?

Keep the tone warm, concise, and concrete. Do not ask Promptfoo maintainers to validate Assay's broader evidence model.

13. Success criteria

This plan succeeds when:

  • Assay has one credible eval-as-CI adjacent surface that is smaller than Promptfoo eval-run truth
  • the lane stays on a single deterministic assertion result
  • the reduced artifact remains smaller than prompt/output/expected/config payloads
  • discovery proves the surfaced shape before any contract freeze
  • malformed rules prevent wrapper drift into Promptfoo platform or run truth

14. Final judgment

P28 should be a Promptfoo deterministic assertion GradingResult lane: one surfaced pass/score/reason result, and nothing broader.