Testing Agents with Assay¶
Assay provides first-class support for testing AI Agents, including function calling, tool use sequences, and multi-step reasoning.
Overview¶
Testing agents is harder than testing simple RAG pipelines because: 1. Non-determinism: Agents may take different paths (tool calls) to reach the same result. 2. Side-effects: Running agents live (with tools) in CI is slow, expensive, and risky. 3. Complexity: You need to assert on the intermediate steps (did it call the search tool?) not just the final answer.
Assay solves this with: * OpenTelemetry Ingestion: Record traces from your actual agent framework (LangChain, AutoGen, custom). * Dual-Mode Replay: Use recorded traces to "replay" the agent's execution without live LLM calls, while verifying assertions against the structured execution graph (Episodes, Steps, Tool Calls). * Behavioral Assertions: Built-in assertions for tool usage, sequence enforcement, and more.
Real-World Use Cases (2025)¶
Assay is designed for the challenges of modern AI engineering:
1. "Compliance-First" Agents (FinTech/Health)¶
Context: Autonomous agents performing sensitive actions (e.g., "block card", "change limit"). Problem: Non-determinism in CI is unacceptable for auditors. You need absolute proof that the agent never calls unauthorized tools. Solution: Deterministic Replay + Tool Assertions. Value: Guarantees strict protocol adherence in CI without live LLM calls. Enables true "unit testing" for autonomous agents.
2. High-Velocity RAG Pipelines (Cost-Effective CI)¶
Context: Teams shipping daily updates to prompts and retrieval logic. Problem: Running full regression suites with GST-4o for every commit is too slow and expensive. Solution: Offline Replay Mode (--replay-strict). Value: Developers can test the full flow locally and in CI with 0% LLM cost and millisecond latency.
3. Model Migration & Validation (The "Exit Strategy")¶
Context: Migrating from expensive hosted models to specialized, smaller, or on-premise models. Problem: Verifying that the new model is "good enough" without manual review. Solution: Baseline Regression Testing (assay ci --baseline). Value: Use existing traces as a baseline to flag semantic deviations in the new model.
1. Instrumentation (OpenTelemetry)¶
Assay ingests traces via the OpenTelemetry (OTel) GenAI Semantic Conventions. Most Python/JS frameworks support OTel export.
Ensure your traces include: * gen_ai.prompt in the span attributes (for the model call). * gen_ai.tool.name and gen_ai.tool.args for tool calls. * gen_ai.completion for the final response.
2. Ingestion & Replay¶
To enable fast, deterministic CI, we use a "Dual Output" strategy: 1. Ingest to DB: For deep structural assertions (SQL-backed). 2. Emit Trace File: For replay capability (mocking the LLM).
Workflow¶
- Record: Run your agent (locally or in staging) to generate an
otel_trace.jsonlfile. - Ingest: Use
assay trace ingest-otelto convert this into Assay's format.
# Ingest OTel spans -> SQLite DB (assertions) + Replay File (LLM mock)
assay trace ingest-otel \
--input otel_trace.jsonl \
--db .eval/eval.db \
--suite my-agent-suite \
--out-trace otel.v2.jsonl
- Run Gate: Run
assay ciusing the generated replay file.
# Run assertions using the captured trace data
assay ci \
--config eval.yaml \
--db .eval/eval.db \
--trace-file otel.v2.jsonl \
--replay-strict
--replay-strict: Ensures NO live LLM calls are made. If a prompt is not found in the trace file, the test fails.
Deterministic Replay (Precedence Rules)¶
To handle "noisy" traces where multiple model calls or tools might occur, Assay V0.4.0+ uses strict precedence rules to determine exactly what prompt/output to use for the replay:
Prompt Extraction: 1. EpisodeStart: If the trace provides an input at start, it wins. 2. Model Step: The first step with kind="model" determines the prompt (First Wins). 3. Fallback: If no model step is found, the first step with gen_ai.prompt is used.
Output Extraction: 1. EpisodeEnd (Root Span): If the Root Span contains gen_ai.completion, this takes absolute precedence. This allows the Agent's "Final Answer" to override intermediate tool outputs. 2. Last Step: Otherwise, the last step's completion is used (Last Wins).
3. Defining Assertions¶
Use eval.yaml to define behavioral gates for your agent.
Example Configuration¶
version: 1
suite: my-agent-suite
model: gpt-4
policies:
agent_policy:
assertions:
# 1. Must use a specific tool
- type: trace_must_call_tool
tool_name: web_search
min_calls: 1
# 2. Must NOT use a forbidden tool
- type: trace_must_call_tool
tool_name: delete_database
max_calls: 0
# 3. Enforce a specific sequence of actions
- type: trace_tool_sequence
sequence:
- web_search
- summarize_results
mode: loose # allow other steps in between
Supported Assertions¶
trace_must_call_tool: Verify tool usage counts (min/max).trace_tool_sequence: Verify order of operations (exactorloosemodes).trace_no_tool_errors: Ensure no tool calls resulted in errors.trace_max_steps: Limit the number of steps (prevent infinite loops).
4. CI Integration¶
Check examples/agent-function-calling/ for a complete, runnable example including: * run.sh: End-to-end script. * eval.yaml: complete configuration. * otel_trace.jsonl: Sample OTel data.