Replay Engine¶
The replay engine is the core of Assay's zero-flake testing — deterministic re-execution without calling LLMs or tools.
What is Replay?¶
Replay means re-executing an agent session using recorded behavior instead of live API calls:
Traditional Test:
Prompt → LLM API → Tool Calls → Validation
(slow, expensive, flaky)
Assay Replay:
Trace → Replay Engine → Validation
(instant, free, deterministic)
The replay engine reads a trace file and simulates the agent's execution, validating each step against your policies.
How It Works¶
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Trace │ ──► │ Replay │ ──► │ Metrics │
│ (recorded) │ │ Engine │ │ (validate) │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ Results │
│ Pass/Fail │
└──────────────┘
- Load Trace — Read the recorded session (
.jsonlfile) - Simulate Execution — Process each tool call in order
- Validate — Check arguments, sequences, blocklists
- Report — Output pass/fail with detailed violations
Replay Modes¶
Strict Mode¶
Fail on any violation. Use for CI gates.
In strict mode: - Any policy violation fails the entire test - Exit code is 1 if any test fails - Ideal for blocking PRs with regressions
Non-Strict Mode¶
Report violations but don't fail. Use for auditing.
Without --strict: - Warn/flaky outcomes do not fail the process - Exit code remains 0 unless blocking failures occur - Useful for migration and exploratory audits
Determinism Guarantees¶
Assay guarantees identical results on every run:
| Factor | Assay's Approach |
|---|---|
| Random seeds | Fixed per trace |
| Timestamps | Normalized from trace |
| External calls | Mocked from trace data |
| Ordering | Preserved from recording |
This means: - ✅ Same trace + same policies = same result, always - ✅ No network variance - ✅ No model variance - ✅ No timing variance
Replay vs. Live Execution¶
| Aspect | Replay | Live Execution |
|---|---|---|
| Speed | 1-10 ms | 1-30 seconds |
| Cost | $0.00 | \(0.01-\)1.00 |
| Determinism | 100% | 80-95% |
| Network | Not required | Required |
| Isolation | Complete | Shared state risks |
When to Use Replay¶
- CI/CD gates — Every PR gets tested
- Regression testing — Catch breaking changes
- Debugging — Reproduce production incidents
- Baseline comparison — A vs. B testing
When to Use Live¶
- Development — Exploring new features
- E2E testing — Full integration validation
- Model evaluation — Comparing LLM versions
Running Replay¶
Basic Replay¶
Specify Trace File¶
# Run against a specific trace
assay run --config eval.yaml --trace-file traces/production-incident.jsonl
Multiple Traces¶
# Run multiple traces by iterating files
for trace in traces/*.jsonl; do
assay run --config eval.yaml --trace-file "$trace" --strict || exit $?
done
In-Memory Database¶
For CI, skip disk writes:
Replay with Debugging¶
Detailed Explanation¶
assay explain --trace traces/golden.jsonl --policy policy.yaml --verbose
# Output:
# Step 1: get_customer(...)
# Verdict: Allowed
# Rules: args_valid, sequence_valid
# ...
Bundle Replay¶
# Replay from an immutable replay bundle (offline by default)
assay replay --bundle .assay/bundles/run-123.tar.gz
Export Explain Report¶
Replay Isolation¶
Each replay is isolated:
- No side effects — Tools aren't actually called
- No shared state — Each run starts fresh
- No external dependencies — Works offline
This makes replay ideal for: - Parallel test execution - CI runners with no network - Air-gapped environments
Error Handling¶
Trace Not Found¶
Error: Trace file not found: traces/missing.jsonl
Suggestion: Run 'assay import' first or check the path
Invalid Trace Format¶
Error: Invalid trace format at line 15
{"type":"tool_call","tool":"get_customer"}
^
Missing required field: 'arguments'
Suggestion: Validate trace with 'assay trace verify --trace <file> --config eval.yaml'
Policy Mismatch¶
Warning: Tool 'new_feature' in trace not found in policy
The trace contains calls to 'new_feature', but no policy defines it.
Options:
1. Add 'new_feature' to your policy file
2. Re-run with an updated policy file
3. Validate config and trace coverage with `assay trace verify`
Performance¶
Replay is fast because it:
- Skips network — No HTTP calls
- Skips LLM inference — No model computation
- Uses compiled validators — Rust-native JSON Schema
- Caches fingerprints — Skip unchanged traces
Typical performance:
| Trace Size | Replay Time |
|---|---|
| 10 calls | ~1 ms |
| 100 calls | ~5 ms |
| 1000 calls | ~30 ms |
CI Integration¶
GitHub Actions¶
- name: Run Assay Tests
run: |
assay ci \
--config eval.yaml \
--trace-file traces/golden.jsonl \
--strict \
--sarif .assay/reports/sarif.json \
--junit .assay/reports/junit.xml \
--db :memory:
Exit Codes¶
| Code | Meaning |
|---|---|
| 0 | All tests passed |
| 1 | One or more tests failed |
| 2 | Configuration/input error |
| 3 | Infrastructure/judge/provider error |