OpenTelemetry & Langfuse¶
Assay does not try to replace your observability stack.
Use Langfuse, OTel collectors, or your existing tracing pipeline for live visibility. Use Assay when you want to turn those traces into:
- deterministic replay input
- policy gates in CI
- tamper-evident evidence bundles for audit handoff
The Flow¶
Agent framework -> OTel / Langfuse -> JSONL export -> assay trace ingest-otel -> Assay replay + evidence
1. Export OpenTelemetry JSONL¶
Assay's OTel ingest path expects OpenTelemetry-style JSONL spans aligned with GenAI semantic conventions.
At minimum, emit:
gen_ai.promptgen_ai.tool.namegen_ai.tool.argsgen_ai.completion
If your stack already sends spans to Langfuse, keep doing that. Assay can consume the same exported trace data as a downstream governance step.
2. Ingest Into Assay¶
assay trace ingest-otel \
--input otel-export.jsonl \
--db .eval/eval.db \
--out-trace traces/otel.v2.jsonl
What this gives you:
- a normalized Assay trace dataset in SQLite for downstream processing
- an optional replay trace file for deterministic CI runs
3. Gate and Replay¶
assay ci \
--config eval.yaml \
--db .eval/eval.db \
--trace-file traces/otel.v2.jsonl \
--replay-strict
--replay-strict keeps the run offline and deterministic. If a prompt is missing from the trace file, the run fails instead of calling a live model.
4. Export Evidence¶
assay evidence export --profile profile.yaml --out evidence.tar.gz
assay evidence verify evidence.tar.gz
Now you have both:
- observability in your existing stack
- a replayable, verifiable evidence artifact in Assay
Assay-Native Metric Spans¶
When you run Assay directly, the runner now emits one assay.eval.metric span per metric evaluation. That span is additive to any upstream agent traces you already collect and gives you a stable place to inspect policy latency and result state.
Fields recorded on assay.eval.metric:
assay.eval.test_idassay.eval.metric.nameassay.eval.response.cachedassay.eval.metric.scoreassay.eval.metric.passedassay.eval.metric.unstableassay.eval.metric.duration_mserrorerror.message
This span intentionally records evaluation metadata, not raw prompts or tool arguments, so it stays useful for latency triage without widening the observability payload surface.
Langfuse Positioning¶
Langfuse is great for tracing, prompts, and production observability. Assay sits next to it:
- Langfuse answers: "What happened in production?"
- Assay answers: "Was this tool call allowed, reproducible, and audit-ready?"