Replay Engine¶
The replay engine is the core of Assay's zero-flake testing — deterministic re-execution without calling LLMs or tools.
What is Replay?¶
Replay means re-executing an agent session using recorded behavior instead of live API calls:
Traditional Test:
Prompt → LLM API → Tool Calls → Validation
(slow, expensive, flaky)
Assay Replay:
Trace → Replay Engine → Validation
(instant, free, deterministic)
The replay engine reads a trace file and simulates the agent's execution, validating each step against your policies.
How It Works¶
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Trace │ ──► │ Replay │ ──► │ Metrics │
│ (recorded) │ │ Engine │ │ (validate) │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ Results │
│ Pass/Fail │
└──────────────┘
- Load Trace — Read the recorded session (
.jsonlfile) - Simulate Execution — Process each tool call in order
- Validate — Check arguments, sequences, blocklists
- Report — Output pass/fail with detailed violations
Replay Modes¶
Strict Mode¶
Fail on any violation. Use for CI gates.
In strict mode: - Any policy violation fails the entire test - Exit code is 1 if any test fails - Ideal for blocking PRs with regressions
Lenient Mode¶
Report violations but don't fail. Use for auditing.
In lenient mode: - Violations are logged but don't fail - Exit code is 0 even with violations - Ideal for migration, baseline analysis
Determinism Guarantees¶
Assay guarantees identical results on every run:
| Factor | Assay's Approach |
|---|---|
| Random seeds | Fixed per trace |
| Timestamps | Normalized from trace |
| External calls | Mocked from trace data |
| Ordering | Preserved from recording |
This means: - ✅ Same trace + same policies = same result, always - ✅ No network variance - ✅ No model variance - ✅ No timing variance
Replay vs. Live Execution¶
| Aspect | Replay | Live Execution |
|---|---|---|
| Speed | 1-10 ms | 1-30 seconds |
| Cost | $0.00 | \(0.01-\)1.00 |
| Determinism | 100% | 80-95% |
| Network | Not required | Required |
| Isolation | Complete | Shared state risks |
When to Use Replay¶
- CI/CD gates — Every PR gets tested
- Regression testing — Catch breaking changes
- Debugging — Reproduce production incidents
- Baseline comparison — A vs. B testing
When to Use Live¶
- Development — Exploring new features
- E2E testing — Full integration validation
- Model evaluation — Comparing LLM versions
Running Replay¶
Basic Replay¶
Specify Trace File¶
# Run against a specific trace
assay run --config mcp-eval.yaml --trace-file traces/production-incident.jsonl
Multiple Traces¶
In-Memory Database¶
For CI, skip disk writes:
Replay with Debugging¶
Verbose Output¶
assay run --config mcp-eval.yaml --verbose
# Output:
# [TRACE] Loading trace: traces/golden.jsonl
# [TRACE] Found 47 tool calls
# [REPLAY] Call 1: get_customer(id="123")
# [VALIDATE] args_valid: ✅ PASS
# [REPLAY] Call 2: update_customer(id="123", email="new@example.com")
# [VALIDATE] args_valid: ✅ PASS
# ...
Step-by-Step¶
assay replay --trace traces/golden.jsonl --step
# Interactive mode:
# > [1/47] get_customer(id="123") — Press Enter to continue
# > [2/47] update_customer(...) — Press Enter to continue
Export Replay Log¶
Replay Isolation¶
Each replay is isolated:
- No side effects — Tools aren't actually called
- No shared state — Each run starts fresh
- No external dependencies — Works offline
This makes replay ideal for: - Parallel test execution - CI runners with no network - Air-gapped environments
Error Handling¶
Trace Not Found¶
Error: Trace file not found: traces/missing.jsonl
Suggestion: Run 'assay import' first or check the path
Invalid Trace Format¶
Error: Invalid trace format at line 15
{"type":"tool_call","tool":"get_customer"}
^
Missing required field: 'arguments'
Suggestion: Validate trace with 'assay validate --trace <file>'
Policy Mismatch¶
Warning: Tool 'new_feature' in trace not found in policy
The trace contains calls to 'new_feature', but no policy defines it.
Options:
1. Add 'new_feature' to your policy file
2. Use --ignore-unknown-tools to skip validation
3. Use --strict to fail on unknown tools
Performance¶
Replay is fast because it:
- Skips network — No HTTP calls
- Skips LLM inference — No model computation
- Uses compiled validators — Rust-native JSON Schema
- Caches fingerprints — Skip unchanged traces
Typical performance:
| Trace Size | Replay Time |
|---|---|
| 10 calls | ~1 ms |
| 100 calls | ~5 ms |
| 1000 calls | ~30 ms |
CI Integration¶
GitHub Actions¶
- name: Run Assay Tests
run: |
assay run \
--config mcp-eval.yaml \
--trace-file traces/golden.jsonl \
--strict \
--output sarif \
--db :memory:
Exit Codes¶
| Code | Meaning |
|---|---|
| 0 | All tests passed |
| 1 | One or more tests failed |
| 2 | Configuration error |
| 3 | Trace file error |