PLAN — P10 Agno Accuracy Eval Evidence Interop (2026 Q2)¶

Date: 2026-04-08
Owner: Evidence / Product
Status: Planning
Scope (this PR): Define the next Agno interop lane after the current framework, protocol, runtime-accounting, and eval-report wave. No sample implementation, no outward post, no contract freeze in this slice.

1. Why this plan exists¶

After the current wave, the next lane should still pass the same three tests:

the upstream project already exposes one bounded surface,
Assay can consume that surface without inheriting upstream semantics as truth,
the repo has a natural maintainer channel for one small sample-backed question.

agno-agi/agno currently fits that pattern well:

the repo is large, active, and visibly growing
Discussions are enabled, with active Q&A, Ideas, and Show and tell
the docs separate Evals from Tracing clearly enough that we can choose one seam instead of mixing both

That makes Agno a strong next candidate, but only if Assay keeps the first slice on a small eval-result artifact and does not drift into another trace-first or observability-first pitch.

This is not a trace-export plan.

This is not an AgentOS platform-export plan.

This is a plan for a bounded eval-result seam derived from Agno accuracy evals.

2. Hard positioning rule¶

This lane must not overclaim what the sample actually observes.

Normative framing:

This sample targets the smallest honest eval-result surface exposed by Agno, not a tracing export, AgentOS platform API, or runtime truth surface.

That means:

Agno is the upstream context, not the truth source
AccuracyEval / AccuracyResult are eval-result surfaces, not trace surfaces
Assay stays an external evidence consumer, not a judge of evaluator correctness, runtime correctness, or tracing correctness

3. Why not trace-first or Agent-as-Judge first¶

Agno publicly documents both Tracing and Evals, but those surfaces do not make equally good first seams.

Tracing is documented as the broader OpenTelemetry-style observability layer. Choosing it first would make this lane look too similar to:

Microsoft Agent Framework trace export
OpenAI Agents TraceProcessor
LangGraph task stream

AgentAsJudgeEval is also not the best first seam. It is legitimate, but more semantically loaded than the basic accuracy path because it centers evaluator judgment configuration from the start.

The cleaner first wedge is:

one artifact derived from the documented AccuracyEval / AccuracyResult path
bounded scores
bounded average score
minimal optional references only if the chosen sample shape needs them

This keeps the lane clearly different from trace-first lanes while still staying anchored in an official Agno surface.

4. Recommended v1 seam¶

Use one frozen serialized artifact derived from the documented AccuracyEval / AccuracyResult surface as the first external-consumer seam.

This seam is:

eval-first
reviewable
smaller than tracing
smaller than AgentOS eval-run APIs
lighter than AgentAsJudgeEval
directly aligned with the public Accuracy docs

This is intentionally not:

tracing export
OpenTelemetry export
AgentOS /eval-runs API export
AgentAsJudgeEval as the first seam
performance or reliability evals as the first seam

Important framing rule:

The sample uses a frozen serialized artifact derived from the documented AccuracyEval / AccuracyResult surface, not a claim that Agno already guarantees a fixed wire-export contract.

5. v1 artifact contract¶

5.1 Required fields¶

The first sample should require:

schema
framework
surface
eval_type
eval_name
timestamp
outcome
num_iterations
scores
avg_score

5.2 Optional fields¶

The first sample may include:

threshold
input_label
expected_output_ref
guidelines_ref
agent_ref

5.3 Important field boundaries¶

`scores`¶

This field is required in the frozen sample shape.

It should stay small and bounded:

integer-valued scores in v1
no raw evaluator reasoning payload
no full prompt or output bodies

This requirement belongs to the sample shape, not to an upstream claim that Agno guarantees a universal serialized scores contract.

`avg_score`¶

This field is required in the frozen sample shape but remains upstream eval semantics only.

It must not be promoted into:

quality truth
evaluator truth
runtime truth

`threshold`¶

This field is optional in v1.

It should only appear if the chosen frozen sample shape carries it explicitly. Its presence must not imply that Agno already guarantees a fixed serialized export contract for that field.

References¶

The optional reference fields must stay bounded:

small label
opaque id
short reference string

Not allowed in v1:

full expected output payload
full guidelines payload
full agent config
trace payload

6. Assay-side meaning¶

The sample may only claim bounded eval-result observation.

Assay must not treat as truth:

evaluator correctness
runtime correctness
pass/fail semantics beyond the observed upstream artifact
trace correctness
AgentOS platform state

Common anti-overclaim sentence:

We are not asking Assay to inherit Agno eval judgments, evaluator semantics, runtime semantics, or tracing semantics as truth.

7. Concrete repo deliverable¶

If this plan is accepted, the next implementation PR should add:

examples/agno-accuracy-evidence/README.md
examples/agno-accuracy-evidence/requirements.txt only if the generator truly needs it
examples/agno-accuracy-evidence/generate_synthetic_result.py only if a clean local generator is viable
examples/agno-accuracy-evidence/map_to_assay.py
examples/agno-accuracy-evidence/fixtures/valid.agno.json
examples/agno-accuracy-evidence/fixtures/failure.agno.json
examples/agno-accuracy-evidence/fixtures/malformed.agno.json
examples/agno-accuracy-evidence/fixtures/valid.assay.ndjson
examples/agno-accuracy-evidence/fixtures/failure.assay.ndjson

Fixture boundary notes:

v1 fixtures may omit every optional reference field
v1 fixtures must not embed trace payloads
v1 fixtures should keep the export shape obviously artifact-first rather than dashboard-first or platform-first

8. Generator policy¶

The implementation should prefer a real local generator only if it stays small and deterministic.

8.1 Preferred path¶

Preferred:

a local generator that exercises the documented AccuracyEval flow
no tracing dependency
no AgentOS dependency
no hidden credential requirement
no runtime setup heavy enough to overshadow the sample

8.2 Hard fallback rule¶

If a real local generator would require:

provider credentials
non-deterministic remote evaluation behavior
model/runtime setup heavy enough to turn the sample into a hosted eval demo

then the sample must fall back to a docs-backed frozen artifact shape.

The sample must not become a half-working hosted eval demo.

9. README boundary requirements¶

The eventual sample README must say:

this is not a production Assay↔Agno adapter
this does not freeze a new Assay Evidence Contract event type
this does not treat scores, average score, or outcomes as Assay truth
this does not turn tracing into the first seam
this does not claim a fixed upstream wire-export contract

10. Outward channel strategy¶

If the sample lands and the surrounding outbound queue is quiet enough, the first outward move should be one small Discussion in agno-agi/agno.

Best-fit category candidate:

Q&A

Why Q&A instead of Show and tell:

the question is about the smallest honest seam
the repo already uses Q&A for focused technical boundary questions
Show and tell is more likely to read as project showcase or promotion

The outward question should stay narrow:

If an external evidence consumer wants the smallest honest Agno eval-result surface, is an artifact derived from AccuracyEval / AccuracyResult roughly the right first seam, or is there a thinner result surface you'd rather point them at?

11. Sequencing rule¶

This lane should not begin outward outreach until the newest lanes have had time to breathe.

That means:

the pydantic-ai sample and issue should already be out
the mcp-agent discussion should be allowed to sit without another nudge
no LangGraph retry or UCP outward move should happen at the same time

Implementation planning can start now, but outward Agno posting should still follow the one-lane-at-a-time discipline.

12. ToolAuditHook follow-up candidate¶

On 2026-05-09, agno-agi/agno PR #7782 added a proposed ToolAuditHook plus a minimal JSONL fixture in response to the public audit hook discussion.

That creates a separate candidate seam from the original P10 accuracy-eval lane:

source: ToolAuditHook JSONL records
anchor fields: tool_name, subject, status, duration_ms
privacy boundary: explicit arguments_redacted: true and result_redacted: true markers when raw values are disabled
fixture value: downstream consumers can test against a repo-local JSONL surface without asking Agno to own a signed receipt spec

Treat this as watchlist / probe-only until the upstream PR is merged.

If it merges, the right next step is still small:

import 3-5 checked-in JSONL rows
map them into bounded tool-audit receipts
preserve subject as a review anchor
treat redaction markers as evidence-boundary facts
claim only that an audit record was observed, bundled, and verified

This must not become:

a public fourth receipt family by default
an Agno-specific Harness branch
a claim that Agno tool calls were safe or policy-correct
a broader tracing or AgentOS import lane

The current public families remain Promptfoo, OpenFeature, and CycloneDX ML-BOM. ToolAuditHook is only a future importer-lane candidate unless a later release decision explicitly promotes it.

13. Non-goals¶

building another trace-first sample
opening on AgentAsJudgeEval instead of the simpler accuracy path
consuming AgentOS eval-run APIs as the first seam
importing evaluator truth or runtime truth
treating threshold as a guaranteed upstream serialized field

PLAN — P10 Agno Accuracy Eval Evidence Interop (2026 Q2)¶

1. Why this plan exists¶

2. Hard positioning rule¶

3. Why not trace-first or Agent-as-Judge first¶

4. Recommended v1 seam¶

5. v1 artifact contract¶

5.1 Required fields¶

5.2 Optional fields¶

5.3 Important field boundaries¶

scores¶

avg_score¶

threshold¶

References¶

6. Assay-side meaning¶

7. Concrete repo deliverable¶

8. Generator policy¶

8.1 Preferred path¶

8.2 Hard fallback rule¶

9. README boundary requirements¶

10. Outward channel strategy¶

11. Sequencing rule¶

12. ToolAuditHook follow-up candidate¶

13. Non-goals¶

References¶

`scores`¶

`avg_score`¶

`threshold`¶