PLAN — P26 AgentEvals Trajectory Strict-Match Result Signal Evidence¶
- Date: 2026-04-23
- Owner: Evidence / External Interop
- Status: Planning lane
- Scope (current repo state): Explore one bounded AgentEvals-adjacent evidence lane built around a single deterministic trajectory strict-match result returned through the public AgentEvals evaluator API. This plan is for the smallest honest external-consumer seam only. It does not propose broad AgentEvals support, LangSmith evaluation-run support, graph-trajectory support, LLM-as-judge support, dataset support, or LangChain runtime truth.
1. Why this plan exists¶
langchain-ai/agentevals is a strong adjacent candidate because it publicly positions itself around agent trajectory evaluation and exposes small evaluator-return surfaces directly in its README.
That matters because Assay does not need "LangChain evals" as a platform.
It needs the smallest honest external-consumer seam that:
- already exists in named public docs
- is reviewable without importing full trajectory truth
- stays smaller than LangSmith runs or broader evaluator workflow semantics
The strongest first wedge is not a full eval run and not an LLM-judge result.
It is:
- one deterministic trajectory strict-match evaluator
- one returned result object
- one bounded result bag
The public strict-match example is the key reason to start here. The docs show a direct returned object shaped like:
keyscorecomment
with trajectory_strict_match as the evaluator key in the strict-match path.
That is exactly the kind of small, named, returned signal Assay should prefer.
2. What this plan is and is not¶
This plan is for:
- one deterministic trajectory strict-match result
- one bounded result bag
- one discovery pass over the public evaluator call and returned object
- one small external-consumer artifact reduced from that returned result
This plan is not for:
- full AgentEvals support
- LLM-as-judge trajectory results
- graph trajectory evaluators
- LangSmith experiment or
evaluate(...)result wrappers - dataset truth
- raw
outputsorreference_outputstruth - evaluator prompt or model-config truth
- LangChain or LangGraph runtime truth
3. Hard positioning rule¶
P26 v1 claims only one bounded AgentEvals trajectory strict-match result as imported external evaluation signal evidence. It does not claim trajectory truth, reference truth, evaluator prompt truth, LangSmith truth, dataset truth, or LangChain runtime truth.
That means:
- AgentEvals remains the source of the observed result
- Assay imports only the smallest honest returned result surface
- Assay does not inherit broader eval-run semantics as truth
4. Recommended seam¶
The first seam should stay on exactly one move:
- call the public deterministic trajectory match path through
create_trajectory_match_evaluator(trajectory_match_mode="strict")orcreateTrajectoryMatchEvaluator({ trajectoryMatchMode: "strict" }) - reduce exactly one returned result object
Not:
create_trajectory_llm_as_judge(...)- graph trajectory evaluators
- LangSmith
evaluate(...)envelopes - dataset-backed experiment rows
- full trajectory payload export
This is intentionally smaller than the broader AgentEvals surface.
The strict-match path is the best first seam because it is:
- deterministic
- public in the README
- already shown as returning a small object
- smaller than the LLM-as-judge path, which already widens into prompt, model, and free-text reasoning semantics
5. Canonical v1 artifact thesis¶
The reduced artifact should stay on a single returned strict-match result.
The v1 artifact must be frozen from a captured returned evaluator object, not from README examples or caller-side expectations. The public docs are enough to justify the lane, but the raw returned result is the source of truth for fixture freeze.
Illustrative v1 shape:
{
"schema": "agentevals.trajectory-strict-match.export.v1",
"framework": "agentevals",
"surface": "trajectory_strict_match_result",
"target_kind": "trajectory",
"evaluator_key": "trajectory_strict_match",
"result": {
"score": false
}
}
Optional reviewer support, only if naturally present on the returned result:
result.comment
Not allowed in v1:
- raw
outputs - raw
reference_outputs - LangSmith run or experiment wrappers
- dataset identifiers
- prompt or model metadata
- evaluator configuration blobs
- synthetic timestamps
- synthetic trajectory identifiers
6. Field boundaries¶
6.1 target_kind¶
For v1, the only allowed value is:
trajectory
This keeps the lane on one trajectory-evaluation result rather than wider session, thread, or graph semantics.
target_kind = "trajectory" names the evaluation level only. It does not imply that v1 carries a stable target identity.
6.2 No target_id_ref in v1¶
The returned strict-match result does not naturally carry a stable target identifier.
Therefore v1 should not invent one.
Assay must not synthesize a target reference from:
- caller-side harness state
- dataset row identity
- LangSmith wrappers
- hashes of full trajectories
- internal run bookkeeping
If a future public returned result naturally carries a stable trajectory anchor, the lane can be revisited. V1 should stay honest and omit it.
6.3 evaluator_key¶
This is the canonical Assay-side name for the returned AgentEvals key.
It should stay:
- required
- short
- observed
- reviewer-readable
It must not become:
- a taxonomy import
- evaluator configuration truth
- a broader LangChain evaluation ontology
For the strict-match-first lane, the expected v1 key is:
trajectory_strict_match
6.4 result.score¶
This is the core bounded evaluation signal.
For v1 strict-match, it should remain:
- required
- boolean
- observed exactly as returned
It must not be treated as:
- universal evaluator truth
- ranking truth
- normalized cross-evaluator semantics
6.5 result.comment¶
This is optional reviewer support only.
It must remain:
- optional
- bounded
- short when present
comment is never required. If present, it may be omitted during reduction if it is too long, too rich, multiline, structured, or otherwise broader than the small returned-result evidence surface.
It must not become:
- chain-of-thought import
- raw reasoning transcript
- prompt or rubric payload
- embedded trajectory content dump
- structured reasoning blob
Empty or whitespace-only comments should be omitted or treated as malformed.
7. Observed vs derived rule¶
P26 v1 should remain almost entirely observed.
Observed:
- returned
key - returned
score - returned
commentwhen naturally present and non-empty
Derived:
- renaming returned
keyinto canonicalevaluator_key - minimal field normalization required to freeze the artifact
The plan must not derive:
- timestamps
- trajectory identifiers
- dataset or run lineage
- evaluator-mode truth beyond what is already explicit in the returned key
Evaluator inputs are discovery material only:
outputsmay be captured for discovery onlyreference_outputsmay be captured for discovery only- raw trajectory payloads must never enter the canonical v1 artifact
- their only role is to prove that the returned result is genuinely smaller than the evaluated payloads
8. Cardinality rule¶
This lane is for exactly one returned evaluation result object.
Therefore v1 artifacts should be malformed if they contain:
- multiple evaluation results
- result arrays
- batch evaluator wrappers
- LangSmith experiment result envelopes
- dataset row bundles
- full trajectory-plus-result payloads
- evaluator configuration fields beyond the returned key
- trajectory match mode fields
- model or prompt metadata
No partial import of larger evaluation bundles should be allowed in v1.
V1 must fail closed on larger evaluation, dataset, or experiment wrappers rather than partially importing the "first relevant" result.
9. Discovery gate¶
P26 should not advance on docs snippets alone. Freeze nothing until one raw strict-match return object is captured from the public evaluator call and stored separately from all caller inputs.
Required first proof:
- call one real strict-match evaluator through the public AgentEvals API
- capture raw input
outputsandreference_outputsseparately as discovery artifacts - capture the raw returned result object as its own discovery artifact
- compare the input boundary to the returned-result boundary before freezing any reduced artifact
Keep raw inputs and raw returned result separate. Do not treat the evaluator inputs as part of the returned public result shape.
If the observed returned shape differs materially across Python and TypeScript, the lane should freeze per language first rather than pretending there is a single cross-language v1 artifact by default.
10. Initial malformed rules¶
Artifacts should be malformed if they contain:
- no
evaluator_key - no
result - a non-boolean
result.score - empty or whitespace-only
result.comment - raw trajectory payloads
- raw reference trajectory payloads
- dataset or experiment identifiers
- LangSmith wrapper fields
- evaluator configuration fields
- trajectory match mode fields
- prompt, model, or rubric metadata
- arrays of evaluation results
- partial imports from larger LangSmith or LangChain evaluation wrappers
11. Repository deliverables for first execution¶
If discovery validates the seam, the first concrete P26 lane should include:
- a formal example directory
- one live discovery note with input vs returned field presence
- one small mapper
- valid, failure, and malformed fixtures
- generated placeholder NDJSON outputs for valid cases
Suggested layout:
examples/
agentevals-trajectory-strict-match-evidence/
README.md
map_to_assay.py
capture_probe.py
discovery/
FIELD_PRESENCE.md
fixtures/
valid.agentevals.json
failure.agentevals.json
malformed.agentevals.json
valid.assay.ndjson
failure.assay.ndjson
12. Success criteria¶
This plan succeeds when:
- Assay has one credible AgentEvals-adjacent seam that is smaller than AgentEvals or LangSmith evaluation truth
- the lane stays on a single returned strict-match result
- the reduced artifact remains smaller than trajectory payloads or eval-run wrappers
- discovery proves the returned shape before any contract freeze
13. Final judgment¶
P26 should be a strict-match-first AgentEvals lane: one returned deterministic trajectory match result, and nothing broader.