Trace-Driven Debugging¶
Reproduce and diagnose production failures using recorded traces.
The Problem¶
When an AI agent fails in production:
- Logs are incomplete — Missing context, truncated output
- Reproduction is hard — "It worked when I tried it"
- LLM is non-deterministic — Can't recreate the exact failure
- Time pressure — Users waiting, SLA ticking
The Solution¶
Assay enables deterministic replay of production incidents:
- Capture — Record the failing session
- Import — Convert to Assay trace format
- Replay — Step through exactly what happened
- Fix — Update policy or agent, verify fix works
Workflow¶
1. Get the Incident Trace¶
When a user reports an issue, ask for their session log:
# From MCP Inspector export
assay import --format inspector user_session.json
# Output:
# Imported 23 tool calls from user_session.json
# Created: traces/incident-2025-12-27.jsonl
2. Reproduce the Failure¶
assay run --config eval.yaml --trace-file traces/incident-2025-12-27.jsonl
# Output:
# ❌ FAIL: args_valid
# Tool: apply_discount (call #15)
# Violation: percent=75 exceeds max(30)
Now you know: - Which tool failed - What argument was wrong - Exactly when in the session it happened
3. Inspect in Detail¶
assay explain \
--trace traces/incident-2025-12-27.jsonl \
--policy policy.yaml \
--verbose
# Output:
# Step 15: apply_discount
# Verdict: Blocked
# Rule: args_valid
# Reason: percent=75 exceeds max(30)
Root cause found: The calculate_discount tool suggested 75%, but the business rule caps at 30%.
4. Fix and Verify¶
Option A: Fix the agent — Cap the discount before applying:
suggested = calculate_discount(total)
capped = min(suggested["suggested_percent"], 30)
apply_discount(percent=capped, ...)
Option B: Fix the policy — If 75% is actually valid:
Verify:
assay run --config eval.yaml --trace-file traces/incident-2025-12-27.jsonl
# Output:
# ✅ All tests passed
Debugging Commands¶
Focus on blocked steps¶
Full rule evaluation¶
Test updated policy behavior¶
Real Example: Customer Service Bot¶
Incident Report¶
"The bot promised a 75% discount but then said it couldn't apply it. The customer is upset."
Investigation¶
# Import the session
assay import --format inspector support-case-4521.json
# Find the problem
assay run --config eval.yaml --trace-file traces/support-case-4521.jsonl
# Explain the exact failing step and rule
assay explain --trace traces/support-case-4521.jsonl --policy policy.yaml --blocked-only
Output:
[14] calculate_discount
Input: {"customer_tier": "platinum", "order_total": 500}
Output: {"percent": 75, "reason": "Platinum member 3x points"}
[15] apply_discount ← FAILURE
Input: {"percent": 75}
Policy violation: percent exceeds max(30)
Root Cause¶
The calculate_discount tool returned 75% for platinum members, but apply_discount has a hard cap of 30% from the fraud prevention policy.
Fix¶
Updated the discount calculation to respect the cap:
def calculate_discount(customer_tier, order_total):
base_discount = get_tier_discount(customer_tier)
return min(base_discount, MAX_DISCOUNT) # Added cap
Verification¶
# Re-run with fix
assay run --config eval.yaml --trace-file traces/support-case-4521.jsonl
# ✅ All tests passed
Building a Failure Library¶
Over time, build a collection of failure traces:
traces/
├── golden/
│ └── happy-path.jsonl
├── failures/
│ ├── discount-overflow.jsonl
│ ├── missing-auth.jsonl
│ ├── blocked-tool-called.jsonl
│ └── sequence-violation.jsonl
└── edge-cases/
├── empty-cart.jsonl
└── unicode-input.jsonl
Run all as regression tests:
for trace in traces/failures/*.jsonl traces/edge-cases/*.jsonl; do
assay run --config eval.yaml --trace-file "$trace" --strict || exit $?
done
Tips¶
1. Capture Early¶
Set up logging to capture all sessions, not just failures:
2. Redact Sensitive Data¶
Assay currently has no dedicated anonymize subcommand. Redact before sharing (example with jq):
jq 'walk(if type == "object" then with_entries(if (.key|test("token|password|secret";"i")) then .value="***" else . end) else . end)' \
incident.jsonl > safe-incident.jsonl
3. Add to Test Suite¶
After fixing a bug, add the trace to CI:
cp traces/incident-2025-12-27.jsonl traces/regression/discount-cap.jsonl
git add traces/regression/discount-cap.jsonl
git commit -m "Add regression test for discount cap bug"
4. Time-Box Investigation¶
With Assay, debugging should take minutes, not hours:
- 5 min — Import and run initial test
- 10 min — Step through replay, identify root cause
- 15 min — Implement and verify fix