PLAN — P24 Phoenix Span Annotation Evaluation-Signal Evidence Interop (2026 Q2)¶

Date: 2026-04-22
Owner: Evidence / Product
Status: Planning lane
Scope (current repo state): Define one bounded Phoenix-adjacent lane centered on a single span annotation as an external evidence signal. This plan does not propose broad Phoenix support, trace export support, dataset or experiment support, prompt management support, or platform-level observability import.

1. Why `P24` should exist¶

Arize-ai/phoenix is now one of the clearest active adjacencies for Assay:

it is public and actively maintained,
it explicitly spans tracing, evaluation, datasets, and experiments,
and it already names small feedback surfaces instead of only platform-wide dashboards.

That matters because Assay does not need Phoenix as a platform.

It needs the smallest honest external-consumer seam that:

already exists in named public docs,
is small enough to review without importing Phoenix platform truth,
and is likely to stay meaningful even if the broader product surface keeps evolving.

The strongest first wedge is not a whole experiment or a whole trace.

It is:

one bounded span annotation,
on one named span target,
with one small result bag,
and only the minimal provenance needed to treat it as imported external evidence.

2. Why Phoenix and why now¶

Phoenix publicly positions itself as:

tracing,
evaluation,
datasets,
experiments,
playground,
and prompt management.

That is exactly why the first Assay wedge must be narrower than "Phoenix interop."

Phoenix's own annotation documentation already exposes a smaller public shape:

annotations can target spans, documents, traces, and sessions,
every annotation has an entity id plus a name,
and the result surface is explicitly bounded to one or more of label, score, and explanation, with optional annotator_kind, identifier, and metadata.

The Python client examples then make the seam concrete through:

client.spans.add_span_annotation(...)
client.traces.add_trace_annotation(...)
related document/session annotation helpers

That is exactly the kind of public, named, smaller-than-platform surface Assay should prefer.

3. Why this seam and not broader Phoenix surfaces¶

3.1 Why span annotations first¶

Phoenix offers four annotation targets:

span
document
trace
session

The smallest honest v1 wedge is the span annotation because it is:

entity-scoped,
feedback-shaped,
already public in docs,
and smaller than end-to-end trace or session judgments.

It also avoids importing document payloads and conversation/session semantics.

3.2 Why not full trace truth¶

Phoenix tracing is important and also far too broad as the first wedge.

A trace-first lane would immediately widen into:

full span trees,
prompt/completion payloads,
token/cost/runtime context,
and broader observability semantics.

That is the wrong first move.

3.3 Why not experiments first¶

Phoenix experiments are useful, but the first public experiment surface already comes wrapped in:

datasets,
tasks,
evaluators,
and aggregate experiment results.

That is productively interesting and too broad for v1.

For Assay, experiments are a runner-up lane, not the first lane.

3.4 Why not datasets, playground, or prompt management¶

Those surfaces are all real and all broader than the first evidence seam.

They widen immediately into:

content truth,
version-management truth,
replay semantics,
or prompt-management semantics.

That is not the first external evidence wedge.

4. Hard positioning rule¶

This lane must stay smaller than the upstream ecosystem name.

Normative framing:

P24 v1 claims only one bounded Phoenix span annotation artifact as an imported external evaluation signal. It does not claim trace truth, experiment truth, evaluator truth, dataset truth, prompt truth, or Phoenix platform truth.

That means:

Phoenix remains the source system, not Assay truth
the annotation is imported as external evidence, not as platform judgment we inherit wholesale
Assay stays smaller than Phoenix itself

Common anti-overclaim sentence:

We are not asking Assay to inherit Phoenix trace, experiment, evaluator, or dashboard semantics as truth.

5. Proposed v1 seam¶

The first honest seam is:

one span_id-scoped annotation
one name
one bounded result bag using only the fields Phoenix already names publicly
one optional annotator_kind
one optional identifier
exactly one annotation object per artifact
no broad trace or experiment wrapper

In Phoenix terms, the public shape we are leaning on is:

entity id: span_id
name
one or more of label, score, explanation
optional annotator_kind
optional identifier
optional metadata

For Assay, the imported artifact should stay smaller than that full permissive shape when necessary.

That means:

no batch annotation arrays
no multi-span wrapper
no "annotations export" bundle in one v1 artifact

6. Canonical artifact boundaries¶

The Assay-side reduced artifact for this lane should stay on:

Required:

schema
framework
surface
entity_kind
entity_id_ref
annotation_name
result
timestamp

Optional:

annotator_kind
identifier
metadata_ref

Recommended fixed values for v1:

framework = "phoenix"
surface = "span_annotation"
entity_kind = "span"

6.1 `entity_id_ref`¶

This is the bounded imported reference to the annotated span.

It must remain:

opaque
non-resolving
entity-scoped

It must not become:

a full span payload
a trace tree
a platform lookup URL

6.2 `annotation_name`¶

This should preserve the upstream annotation label/name directly when possible.

It must not be widened into:

evaluator definitions,
taxonomy truth,
or a Phoenix-global ontology claim.

6.3 `result`¶

This should preserve only the bounded result fields that are actually present.

For v1, that means:

label when present
score when present
explanation when present and bounded

For v1, at least one of label or score must be present.

explanation remains optional and bounded reviewer context, not a primary carrier of the seam.

Whitespace-only or effectively empty explanations should be:

omitted during reduction,
or treated as malformed if the upstream payload is trying to rely on them as the only result content.

The mapper must not invent absent result fields.

6.4 `annotator_kind`¶

This is allowed only when naturally present upstream.

It is a useful bounded provenance field and should remain on the named Phoenix surface:

HUMAN
LLM
CODE

It is observed provenance only.

It must not be interpreted as:

evaluator-quality truth
runtime-quality truth
or a guarantee about how reliable the annotation process was

It must not be widened into evaluator-runtime semantics.

6.5 `identifier`¶

This is allowed only as a bounded upsert/reference aid when naturally present.

Its presence may reflect upstream overwrite/upsert behavior.

It must not be treated as:

a durable global identity guarantee,
a cross-system reconciliation contract,
or cross-run deduplication truth

Its absence must not be read as evidence that the annotation has no upstream natural grouping or overwrite behavior.

6.6 `metadata_ref`¶

Phoenix annotations allow optional metadata.

For Assay v1, raw metadata should not be imported by default.

If metadata matters at all, keep it as:

omitted,
or a bounded metadata_ref / hash / reviewer aid

unless live proof shows a tiny stable subset is honestly necessary.

Raw annotation metadata inline is malformed for v1 unless a later discovery pass proves that a tiny stable subset is honestly necessary.

7. What is explicitly out of scope¶

This lane is not for:

multiple annotations in one artifact
batch annotation wrappers
multiple span targets in one artifact
full trace export
span trees
experiment runs
dataset example payloads
prompt/completion bodies
Phoenix dashboard URLs
full annotation metadata objects
session or document annotation truth

The first lane should stay on one span annotation artifact only.

Any v1 artifact containing:

multiple annotations
multiple span targets
or a batch/export wrapper

should be malformed rather than partially imported.

8. Why this pick wins over the current shortlist¶

The current adjacent shortlist was:

Phoenix
LangWatch
LangChain AgentEvals / agentevals-dev
OpenHands SDK / benchmarks
AutoGen
browser-use
smolagents

Phoenix wins the first P24 slot because it combines:

a large active upstream,
explicit evaluation vocabulary,
explicit annotation vocabulary,
and a public surface that is already smaller than experiments or tracing.

That is better than:

LangWatch for v1, because LangWatch still pulls more naturally toward trace/run wrappers than a single tiny feedback object
AgentEvals for v1, because those repos are very clean but less likely to produce high-value upstream seam clarification than Phoenix
OpenHands, AutoGen, browser-use, and smolagents, because those all create stronger drift pressure toward runtime/session/task truth

Runner-up order after Phoenix:

LangWatch evaluation item result lane
LangChain AgentEvals trajectory result lane
OpenHands benchmark case result lane

9. Discovery gate before any sample claim¶

P24 should not freeze a sample from docs alone.

Required discovery order:

create one real Phoenix span annotation through the public client or API
retrieve that annotation again through the public get/list path when possible, so created and retrieved payloads can be compared
keep raw created payload and raw retrieved payload separate from the reduced Assay artifact
build a field presence/absence table
reduce to the smallest honest artifact
only then freeze fixtures and mapper behavior

The lane should fail closed if live proof shows the public payload is thinner or differently shaped than the first frozen sample.

10. Outward strategy¶

The outward thread should stay in our normal shape:

sample-first
narrow question
no integration pitch
no platform-support framing

The core upstream question should be:

If an external evidence consumer wants the smallest honest Phoenix feedback surface, is one reduced artifact derived from a span annotation roughly the right seam, or is there a thinner public annotation/result surface you would rather point them at?

That keeps the discussion on:

one public object,
one boundary question,
and no broader Phoenix-import claim.

11. Implementation phases¶

Phase A — Discovery¶

Deliverables:

one real span annotation capture
raw created payload freeze
raw retrieved payload freeze when available
field presence table
observed-vs-derived notes

Acceptance:

no field in the reduced artifact lacks live backing
raw and reduced shapes are kept separate
create-path and retrieve-path differences are explicitly noted when present

Phase B — Sample lane¶

Deliverables:

examples/phoenix-span-annotation-evidence/
mapper
valid / failure / malformed fixtures
generated placeholder Assay outputs

Acceptance:

valid maps
failure maps
malformed fails fast on metadata drift, batch drift, and multi-annotation drift

Phase C — Outward check¶

Deliverables:

one maintainer-facing issue/discussion
one tiny sample link
one sharply bounded seam question

Acceptance:

no broad Phoenix support claim leaks in
the lane remains annotation-first

12. Success criteria¶

This plan succeeds when:

Assay has one honest Phoenix-adjacent sample lane
the lane stays on one span annotation artifact
the sample does not import trace or experiment truth
the outward question asks only whether span annotation is the right minimal seam

13. Final judgment¶

P24 should be a Phoenix annotation-first lane, not a Phoenix platform lane.

The right first wedge is:

one bounded span annotation,
one small result bag,
one span-scoped reference,
and nothing broader.

PLAN — P24 Phoenix Span Annotation Evaluation-Signal Evidence Interop (2026 Q2)¶

1. Why P24 should exist¶

2. Why Phoenix and why now¶

3. Why this seam and not broader Phoenix surfaces¶

3.1 Why span annotations first¶

3.2 Why not full trace truth¶

3.3 Why not experiments first¶

3.4 Why not datasets, playground, or prompt management¶

4. Hard positioning rule¶

5. Proposed v1 seam¶

6. Canonical artifact boundaries¶

6.1 entity_id_ref¶

6.2 annotation_name¶

6.3 result¶

6.4 annotator_kind¶

6.5 identifier¶

6.6 metadata_ref¶

7. What is explicitly out of scope¶

8. Why this pick wins over the current shortlist¶

9. Discovery gate before any sample claim¶

10. Outward strategy¶

11. Implementation phases¶

Phase A — Discovery¶

Phase B — Sample lane¶

Phase C — Outward check¶

12. Success criteria¶

13. Final judgment¶

1. Why `P24` should exist¶

6.1 `entity_id_ref`¶

6.2 `annotation_name`¶

6.3 `result`¶

6.4 `annotator_kind`¶

6.5 `identifier`¶

6.6 `metadata_ref`¶