Skip to content

OpenTelemetry & Langfuse

Assay does not try to replace your observability stack.

Use Langfuse, OTel collectors, or your existing tracing pipeline for live visibility. Use Assay when you want to turn those traces into:

  • deterministic replay input
  • policy gates in CI
  • tamper-evident evidence bundles for audit handoff

The Flow

Agent framework -> OTel / Langfuse -> JSONL export -> assay trace ingest-otel -> Assay replay + evidence

1. Export OpenTelemetry JSONL

Assay's OTel ingest path expects OpenTelemetry-style JSONL spans aligned with GenAI semantic conventions.

At minimum, emit:

  • gen_ai.prompt
  • gen_ai.tool.name
  • gen_ai.tool.args
  • gen_ai.completion

If your stack already sends spans to Langfuse, keep doing that. Assay can consume the same exported trace data as a downstream governance step.

2. Ingest Into Assay

assay trace ingest-otel \
  --input otel-export.jsonl \
  --db .eval/eval.db \
  --out-trace traces/otel.v2.jsonl

What this gives you:

  • a normalized Assay trace dataset in SQLite for downstream processing
  • an optional replay trace file for deterministic CI runs

3. Gate and Replay

assay ci \
  --config eval.yaml \
  --db .eval/eval.db \
  --trace-file traces/otel.v2.jsonl \
  --replay-strict

--replay-strict keeps the run offline and deterministic. If a prompt is missing from the trace file, the run fails instead of calling a live model.

4. Export Evidence

assay evidence export --profile profile.yaml --out evidence.tar.gz
assay evidence verify evidence.tar.gz

Now you have both:

  • observability in your existing stack
  • a replayable, verifiable evidence artifact in Assay

Assay-Native Metric Spans

When you run Assay directly, the runner now emits one assay.eval.metric span per metric evaluation. That span is additive to any upstream agent traces you already collect and gives you a stable place to inspect policy latency and result state.

Fields recorded on assay.eval.metric:

  • assay.eval.test_id
  • assay.eval.metric.name
  • assay.eval.response.cached
  • assay.eval.metric.score
  • assay.eval.metric.passed
  • assay.eval.metric.unstable
  • assay.eval.metric.duration_ms
  • error
  • error.message

This span intentionally records evaluation metadata, not raw prompts or tool arguments, so it stays useful for latency triage without widening the observability payload surface.

Langfuse Positioning

Langfuse is great for tracing, prompts, and production observability. Assay sits next to it:

  • Langfuse answers: "What happened in production?"
  • Assay answers: "Was this tool call allowed, reproducible, and audit-ready?"

See Also