PLAN — P13 Langfuse Experiment Result Evidence Interop (2026 Q2)¶

Date: 2026-04-08
Owner: Evidence / Product
Status: Planning
Scope (this PR): Define the next platform-adjacent upstream interop lane after Browser Use and Visa TAP. No sample implementation, no outward Discussion, and no contract freeze in this slice.

1. Why this plan exists¶

After the current framework, protocol, runtime-accounting, eval-report, adjacent output/history, and TAP verification wave, the next lane should still pass the same three tests:

the upstream project already exposes one bounded surface,
Assay can consume that surface without inheriting upstream platform semantics as truth,
the repo has a natural maintainer channel for one small sample-backed boundary question.

langfuse/langfuse now fits that pattern well enough to justify a formal plan:

the repo is large, active, and visibly current
GitHub Discussions are enabled and actively used
the docs expose a clear evaluation stack through datasets, experiments, and scores
the platform is explicitly API-first and already frames data exports as part of the product story

This lane is especially relevant at the bleeding edge in 2026 because Langfuse is evolving in two directions at once:

observations-first observability, which makes trace surfaces broader and more operationally central
experiments-first evaluation workflows, which make offline, bounded eval result surfaces easier to isolate from live production tracing

That combination creates a clear planning rule:

do not open Langfuse as another trace-export lane
do not pitch Assay as a competing observability platform
do start from the smallest experiment-result seam that Langfuse already documents

This is not a trace export plan.

This is not a dashboard or metrics export plan.

This is not a prompt-management plan.

This is a plan for a bounded experiment-result surface derived from the documented Langfuse evaluation path.

2. Why Langfuse is now the next best candidate¶

As of 2026-04-08, the current active and just-opened lanes already cover:

framework trace surfaces
eval-report surfaces
runtime-accounting surfaces
adjacent output/history surfaces
commerce / verification-first protocol surfaces

That changes what “best next lane” means.

The next lane should now:

stay GitHub-native
open a different seam from the work already live
still have strong momentum and a real maintainer channel
be current enough to reflect where the ecosystem is actually heading

Langfuse is the strongest fit for that next move because:

its evaluation stack is now clearly first-class, not a side feature
its docs and SDK references already expose experiments, datasets, and custom scores as explicit product surfaces
its Discussions categories include a natural answerable Support lane
it is strategically important without forcing us straight into a browser, payment, or telemetry-heavy surface

Why this is not Mastra first:

Mastra remains a good candidate
but the repo has no Discussions
and its issue channel is more bug and triage heavy right now

Why this is not x402 first:

x402 remains technically interesting
but the repo currently has no Issues and no Discussions
and the public channel shape is too weak for the small sample-backed question style that is working best for us

Why this is not OpenLIT first:

OpenLIT remains a valid OTel-native special case
but it is smaller, and the lane would be much more observability-colored than the current Langfuse opportunity

3. Hard positioning rule¶

This lane must not overclaim what the sample actually observes.

Normative framing:

This sample targets the smallest honest Langfuse experiment-result surface, not a trace export, dashboard export, metrics export, or production observability truth surface.

That means:

Langfuse is the upstream platform context, not the truth source
experiment results are bounded eval artifacts, not production truth
Assay stays an external evidence consumer, not a scorer, dashboard, or trace authority

4. Why not trace-first¶

Trace-first would be the wrong first wedge here.

Why:

Langfuse is publicly and product-wise strongest as an observability platform
trace-first would read immediately as platform-on-platform
the v4 observations-first direction makes live trace surfaces broader, not smaller
Langfuse also supports live evaluators on production traces, which would make it too easy to blur runtime monitoring semantics into Assay evidence truth

The safer first seam is one step away from production observability:

dataset-backed
experiment-backed
offline or reviewable
smaller than traces
smaller than dashboards
smaller than metrics rollups

This is the same best-practice move we used in earlier lanes:

choose the smallest documented surface that already exists
avoid the broadest operational surface just because it is prominent

5. Why experiment-result-first and not score-only-first¶

Langfuse also exposes scores as a first-class concept:

numeric, boolean, and categorical values
custom scores via SDK or API
scores attached across broader observability and evaluation workflows

That is real, but score-only is still not the best v1 seam.

Why not score-only first:

scores can belong to traces, observations, or experiments
score-only export loses the bounded experiment context that makes the seam reviewable
score-only makes it easier to slide back into production-evaluation or observability semantics

The cleaner first wedge is:

one artifact derived from run_experiment()
one bounded ExperimentItemResult-style output
one bounded list of Evaluation-style results
optional trace_ref only if it is naturally present in the chosen sample shape

That still keeps Langfuse tied to scores and evaluation, but it does so inside the smallest experiment-result frame instead of the broadest scoring frame.

6. Recommended v1 seam¶

Use one frozen serialized artifact derived from the documented Langfuse experiment result path as the first external-consumer seam.

The intended upstream anchors are:

run_experiment()
ExperimentResult
ExperimentItemResult
bounded Evaluation results

The frozen sample shape should stay at the item result level, not the full run export and not a whole trace tree.

The first artifact should therefore center on:

one experiment name
one dataset identity and one bounded dataset version reference
one item reference
one short output representation
one bounded list of evaluations
no aggregate layer unless it is naturally present and still small
optional opaque run or trace references only if needed

Important framing rule:

The sample uses a frozen serialized artifact derived from the documented Langfuse experiment-result surface, not a claim that Langfuse already guarantees one fixed wire-export contract for external evidence consumers.

7. v1 artifact contract¶

7.1 Required fields¶

The first sample should require:

schema
platform
surface
experiment_name
dataset_name
dataset_version_ref
item_ref
timestamp
output_ref
evaluations

These required fields belong to the frozen sample artifact shape.

They must be described as:

sample-level reductions derived from documented Langfuse experiment results
not an upstream guarantee that Langfuse already ships one canonical serialized export object with these exact field names

dataset_version_ref remains required in this v1 plan because it keeps the artifact reviewable in dataset/experiment context, but that is still a frozen sample-shape choice, not a claim that every smallest honest Langfuse export must carry that exact field in that exact form.

7.2 Optional fields¶

The first sample may include:

run_ref
trace_ref
experiment_description_ref
metadata_ref
aggregate_scores

7.3 Important field boundaries¶

`dataset_version_ref`¶

This field is required because dataset versioning is part of the documented Langfuse experiments story.

In v1, it must stay a bounded version pointer only:

timestamp
short version label
short normalized version reference

Not allowed in v1:

full dataset export
full dataset item payload replay
dataset schema dump

This field belongs to the frozen sample shape, not to an upstream claim that Langfuse guarantees one universal serialized dataset-version contract for external consumers.

`item_ref`¶

This field is required and must stay an opaque dataset-item reference only.

Allowed:

item id
short item label
short opaque reference

Not allowed in v1:

full dataset input body as the primary seam
full expected-output body as the primary seam
large source-trace payloads

`output_ref`¶

This field is required in the frozen sample shape.

It should be framed as a short frozen representation derived from the experiment item output, not necessarily the full upstream output body.

It should also be described explicitly as a sample-level reduction of the documented experiment item output surface, not as an upstream-guaranteed serialized export field.

It remains upstream output semantics only.

It must not become:

model-quality truth
prompt-quality truth
business-outcome truth

The sample should prefer:

short string output
short bounded object
short opaque output label

Not:

full prompt transcript
full chain-of-thought style payload
large generated body

`evaluations`¶

This field is required in the frozen sample shape.

It should be framed as a bounded list derived from documented Evaluation results, not as a claim that Langfuse already guarantees one universal serialized eval-result wire contract.

Each evaluation entry should stay very small:

name
value
optional short data_type
optional short classifier or label only if the chosen sample shape naturally carries one

Not allowed in v1:

evaluator prompts
evaluator reasoning transcripts
large free-text judge explanations
raw dashboard rollups

This field is where Langfuse’s eval-result seam is real, but it must still stay a bounded sample reduction rather than a platform export claim.

`trace_ref`¶

This field is optional in v1.

The sample should prefer omitting it unless it is naturally present in the chosen experiment-result shape.

If present, it must remain:

an opaque reference
a short id
a bounded link target

It must not turn the lane back into a trace-first plan.

`aggregate_scores`¶

This field is optional in v1.

The sample remains complete without any aggregate score field.

The v1 sample should prefer omitting it unless it is naturally present in the chosen experiment-result shape and still stays very small.

If included, it must stay bounded:

one short summary object
one short scalar or small list

It must not become:

dashboard export
metrics export
production health truth

Optional references¶

The optional reference fields must stay bounded:

short label
opaque id
short normalized reference

Not allowed in v1:

full trace trees
prompt registry payloads
full metadata blobs
rich dashboard configuration

8. Assay-side meaning¶

The sample may only claim bounded experiment-result observation.

Assay must not treat as truth:

production trace truth
dashboard truth
metrics truth
prompt-quality truth
model-quality truth beyond the observed eval artifact
user-feedback truth beyond the observed score or label
Langfuse platform semantics as a whole

Common anti-overclaim sentence:

We are not asking Assay to inherit Langfuse trace semantics, dashboard semantics, metrics semantics, or evaluation semantics as truth.

9. Concrete repo deliverable¶

If this plan is accepted, the next implementation PR should add:

examples/langfuse-experiment-evidence/README.md
examples/langfuse-experiment-evidence/requirements.txt only if a tiny local generator truly needs it
examples/langfuse-experiment-evidence/generate_synthetic_result.py only if a small local or self-hosted generator is viable
examples/langfuse-experiment-evidence/map_to_assay.py
examples/langfuse-experiment-evidence/fixtures/valid.langfuse.json
examples/langfuse-experiment-evidence/fixtures/failure.langfuse.json
examples/langfuse-experiment-evidence/fixtures/malformed.langfuse.json
examples/langfuse-experiment-evidence/fixtures/valid.assay.ndjson
examples/langfuse-experiment-evidence/fixtures/failure.assay.ndjson

Fixture boundary notes:

v1 fixtures may omit every optional reference field
v1 fixtures must keep trace_ref optional and preferably absent
v1 fixtures must not include full trace trees or dashboard exports
v1 fixtures should keep the export shape obviously experiment-result-first rather than trace-first

10. Generator policy¶

The implementation should prefer a real local generator only if it stays small and deterministic.

Langfuse is more promising than some earlier lanes because:

the platform is open source and self-hostable
experiments can run against local or hosted datasets
the experiment runner is now an official SDK concept

Even so, the first sample should still stay strict about setup cost.

Practical expectation for v1:

start from a docs-backed frozen artifact shape by default
only switch to a real local generator if the setup turns out to be surprisingly small and does not overshadow the seam

10.1 Preferred path¶

Preferred:

a small SDK-driven experiment path
one local or tiny self-hosted Langfuse setup only if it is truly lightweight
one bounded evaluator
no production-trace dependency
no dashboard dependency
no prompt-management dependency

10.2 Hard fallback rule¶

If a real local generator would require:

a full Langfuse stack bootstrap heavy enough to dominate the seam
cloud credentials
project or environment setup that outweighs the sample itself
more observability plumbing than experiment-result logic

then the sample should fall back to a docs-backed frozen experiment-result artifact shape.

That fallback is acceptable here because the goal is the smallest honest external-consumer seam, not a full Langfuse platform tutorial.

11. Valid, failure, malformed corpus¶

The first sample should follow the established corpus pattern.

11.1 Valid¶

One strong experiment item result with:

short output_ref
one or more bounded evaluations
a bounded dataset version reference

11.2 Failure¶

One lower-scoring experiment item result with:

short output_ref
one or more bounded evaluations showing a lower or negative result
the same bounded experiment-result shape

This is not a platform failure or infrastructure failure artifact.

It is only a weaker experiment-result artifact represented in the repo corpus under the established failure naming pattern.

11.3 Malformed¶

One malformed artifact that fails fast, for example:

missing evaluations
missing dataset_version_ref
evaluations not a list
unsupported evaluation entry shape

12. Outward strategy¶

Do not open an outward Langfuse Discussion until the sample is on main.

After that:

one small GitHub Discussion
category: Support
one link
one boundary question
no broad platform pitch
no trace pitch
no dashboard pitch

Why Support:

it is the answerable category
the question is about the smallest honest seam
it is the best pragmatic fit for one technical boundary question even if it is not a perfect semantic match for a planning-style interop post
that reads more like a technical boundary question than Share your Work or Ideas

13. Sequencing rule¶

The active Browser Use and Visa TAP lanes are now live.

Meaning:

formalize P13 now
let the fresh Browser Use and TAP outward threads breathe unless a maintainer responds
if no hot follow-up needs attention, implement the P13 sample next
open the Langfuse Support Discussion only after the sample lands on main

14. Non-goals¶

This plan does not:

define a Langfuse trace export contract
define a Langfuse dashboard export contract
define a metrics export contract
define a prompt-management export contract
define Langfuse evals or scores as Assay truth
define a platform-to-platform synchronization story

PLAN — P13 Langfuse Experiment Result Evidence Interop (2026 Q2)¶

1. Why this plan exists¶

2. Why Langfuse is now the next best candidate¶

3. Hard positioning rule¶

4. Why not trace-first¶

5. Why experiment-result-first and not score-only-first¶

6. Recommended v1 seam¶

7. v1 artifact contract¶

7.1 Required fields¶

7.2 Optional fields¶

7.3 Important field boundaries¶

dataset_version_ref¶

item_ref¶

output_ref¶

evaluations¶

trace_ref¶

aggregate_scores¶

Optional references¶

8. Assay-side meaning¶

9. Concrete repo deliverable¶

10. Generator policy¶

10.1 Preferred path¶

10.2 Hard fallback rule¶

11. Valid, failure, malformed corpus¶

11.1 Valid¶

11.2 Failure¶

11.3 Malformed¶

12. Outward strategy¶

13. Sequencing rule¶

14. Non-goals¶

References¶

`dataset_version_ref`¶

`item_ref`¶

`output_ref`¶

`evaluations`¶

`trace_ref`¶

`aggregate_scores`¶