PLAN — P27 AutoEvals ExactMatch Score Evidence¶
- Date: 2026-04-24
- Owner: Evidence / External Interop
- Status: Planning lane
- Scope (current repo state): Explore one bounded AutoEvals-adjacent evidence lane built around a single deterministic
ExactMatchscore object returned through the public AutoEvals scorer API. This plan is for the smallest honest external-consumer surface only. It does not propose broad AutoEvals support, Braintrust experiment logging support, LLM judge support, RAG scorer support, JSON/list scorer support, or model-provider truth.
1. Why this plan exists¶
braintrustdata/autoevals is a strong non-LangChain P27 candidate because it is an active evaluator package with public Python and TypeScript surfaces, open issues, and a product boundary adjacent to Braintrust without requiring Braintrust logging.
That matters because the recent lanes have leaned into LangChain-adjacent evaluation surfaces. P27 should test the same small-result discipline in a different evaluation community.
The strongest first AutoEvals wedge is not:
- Factuality or other LLM-as-a-judge scorers
- RAGAS-style RAG scorers
- JSON or list aggregate scorers
- Braintrust
Eval(...)/ experiment logging - custom scorer authoring as a whole
It is:
- one deterministic
ExactMatchscorer - one returned score object
- one bounded result bag
The public scorer reference describes ExactMatch as a binary scorer that checks exact equality, with a score range of 0 or 1. The AutoEvals README also shows the core custom scorer return shape as a Score with:
namescore- optional metadata in broader scorer families
That is the kind of small returned result Assay should prefer over richer judge, retrieval, JSON, list, or experiment surfaces.
2. What this plan is and is not¶
This plan is for:
- one deterministic
ExactMatchscore object - one bounded result bag
- one discovery pass over the public scorer call and returned object
- one small external-consumer artifact reduced from that returned score
This plan is not for:
- full AutoEvals support
- LLM judge scorer results
- Braintrust experiment or logging wrappers
- RAG scorer contexts
- JSON scorer result trees
- list scorer bundles
- custom scorer framework support
- model-provider or prompt truth
- raw
output,expected, orinputtruth
3. Hard positioning rule¶
P27 v1 claims only one bounded AutoEvals ExactMatch score object as imported external evaluation signal evidence. It does not claim output truth, expected answer truth, Braintrust truth, dataset truth, scorer-family truth, model truth, or prompt truth.
That means:
- AutoEvals remains the source of the observed score
- Assay imports only the smallest honest returned score surface
- Assay does not inherit broader experiment or logging semantics as truth
4. Recommended surface¶
The first surface should stay on exactly one move:
- call the public deterministic
ExactMatchscorer through the Python or TypeScript AutoEvals API - reduce exactly one returned score object
Not:
FactualityClosedQASummary- RAG scorers
- JSON scorers
- list scorers
- Braintrust
Eval(...) - model-provider-backed scorers
- raw output/expected payload export
This is intentionally smaller than the broader AutoEvals surface.
The ExactMatch path is the best first AutoEvals surface because it is:
- deterministic
- public in the scorer reference
- independent of LLM clients, prompts, model configuration, and Braintrust experiment logging
- small enough to review without importing the compared payloads
5. Canonical v1 artifact thesis¶
The reduced artifact should stay on a single returned ExactMatch score object.
The v1 artifact must be frozen from a captured returned scorer object, not from README examples or caller-side expectations. The public docs are enough to justify the lane, but the raw returned result is the source of truth for fixture freeze.
Illustrative v1 shape:
{
"schema": "autoevals.exactmatch-score.export.v1",
"framework": "autoevals",
"surface": "exactmatch_score",
"target_kind": "output_expected_pair",
"scorer_name": "ExactMatch",
"result": {
"score": 1
}
}
Optional reviewer support, only if naturally present on the returned score object:
result.metadata_ref
Not allowed in v1:
- raw
output - raw
expected - raw
input - inline metadata bags
- Braintrust experiment or span wrappers
- scorer configuration blobs
- prompt, model, rubric, context, or provider metadata
- synthetic timestamps
- synthetic output or expected identifiers
6. Field boundaries¶
6.1 target_kind¶
For v1, the only allowed value is:
output_expected_pair
This names the comparison level. It does not imply that v1 carries stable output identity, expected-answer identity, dataset row identity, or run identity.
6.2 No target_id_ref in v1¶
The returned ExactMatch score object does not naturally carry a stable target identifier.
Therefore v1 should not invent one.
Assay must not synthesize a target reference from:
- caller-side harness state
- dataset row identity
- hashes of raw outputs or expected values
- Braintrust wrappers
- internal run bookkeeping
If a future public returned result naturally carries a stable comparison anchor, the lane can be revisited. V1 should stay honest and omit it.
6.3 scorer_name¶
This is the canonical Assay-side name for the observed AutoEvals scorer.
It should stay:
- required
- short
- observed or directly implied by the returned score object and scorer call
- reviewer-readable
It must not become:
- a taxonomy import
- scorer configuration truth
- a broader AutoEvals scorer ontology
For the first lane, the expected v1 name is:
ExactMatch
Discovery must confirm the actual returned field names before fixture freeze. If the returned object naturally uses name rather than a class-style scorer name, the reducer should preserve that observed value instead of forcing the illustrative value above.
6.4 result.score¶
This is the core bounded evaluation signal.
For v1 ExactMatch, it should remain:
- required
- numeric
- exactly
0or1 - observed exactly as returned
It must not be treated as:
- universal evaluator truth
- ranking truth
- normalized cross-scorer semantics
- proof that either compared value is correct
The score only reports AutoEvals' exact comparison result for the supplied output/expected pair.
6.5 result.metadata_ref¶
Inline metadata should not be part of v1.
If the returned score object includes metadata and discovery proves a tiny stable subset is genuinely necessary for review, P27 should prefer:
metadata_ref
over importing the raw metadata object.
For first execution, raw inline metadata should be malformed unless discovery proves otherwise. This keeps LLM-judge rationales, RAG context, provider payloads, and Braintrust logging details out of the first lane.
7. Observed vs derived rule¶
P27 v1 should remain almost entirely observed.
Observed:
- returned scorer name or equivalent score name
- returned
score - returned metadata presence only as discovery information
Derived:
- renaming an observed score name into canonical
scorer_name - minimal field normalization required to freeze the artifact
- fixed
target_kind = "output_expected_pair"to name the comparison level
The plan must not derive:
- timestamps
- target identifiers
- dataset or run lineage
- scorer-family truth
- output/expected hashes as identity
Scorer inputs are discovery material only:
outputmay be captured for discovery onlyexpectedmay be captured for discovery onlyinputmay be captured for discovery only if the public call requires or naturally accepts it- raw compared payloads must never enter the canonical v1 artifact
- their only role is to prove that the returned score is genuinely smaller than the evaluated payloads
8. Cardinality rule¶
This lane is for exactly one returned score object.
Therefore v1 artifacts should be malformed if they contain:
- multiple score objects
- score arrays
- JSON scorer result trees
- list scorer bundles
- Braintrust experiment wrappers
- dataset row bundles
- full output/expected-plus-score payloads
- scorer configuration fields
- model, prompt, rubric, provider, or context metadata
No partial import of larger evaluation bundles should be allowed in v1.
V1 must fail closed on larger scorer, dataset, or experiment wrappers rather than partially importing the "first relevant" score.
9. Discovery gate¶
P27 should not advance on docs snippets alone. Freeze nothing until one raw ExactMatch return object is captured from the public scorer call and stored separately from all caller inputs.
Required first proof:
- call one real
ExactMatchscorer through the public AutoEvals API - capture raw
outputandexpectedseparately as discovery artifacts - capture the raw returned score object as its own discovery artifact
- compare the input boundary to the returned-score boundary before freezing any reduced artifact
Keep raw inputs and raw returned score separate. Do not treat scorer inputs as part of the returned public result shape.
If Python and TypeScript return materially different score shapes, the lane should freeze per language first rather than pretending there is a single cross-language v1 artifact by default.
10. Initial malformed rules¶
Artifacts should be malformed if they contain:
- no
scorer_name - no
result - a non-numeric
result.score - a
result.scoreother than0or1 - raw output
- raw expected
- raw input
- inline metadata bags
- dataset or experiment identifiers
- Braintrust wrapper fields
- scorer configuration fields
- prompt, model, rubric, provider, or context metadata
- JSON/list scorer aggregate outputs
- arrays of score objects
- partial imports from larger Braintrust or AutoEvals wrappers
11. Repository deliverables for first execution¶
If discovery validates the surface, the first concrete P27 lane should include:
- a formal example directory
- one live discovery note with input vs returned field presence
- one small mapper
- valid, failure, and malformed fixtures
- generated placeholder NDJSON outputs for valid cases
Suggested layout:
examples/
autoevals-exactmatch-evidence/
README.md
map_to_assay.py
capture_probe.py
discovery/
FIELD_PRESENCE.md
fixtures/
valid.autoevals.json
failure.autoevals.json
malformed.autoevals.json
valid.assay.ndjson
failure.assay.ndjson
12. Success criteria¶
This plan succeeds when:
- Assay has one credible non-LangChain evaluator surface that is smaller than AutoEvals or Braintrust evaluation truth
- the lane stays on a single returned
ExactMatchscore object - the reduced artifact remains smaller than output/expected payloads or experiment wrappers
- discovery proves the returned shape before any contract freeze
13. Final judgment¶
P27 should be an AutoEvals ExactMatch lane: one returned deterministic output/expected comparison score, and nothing broader.