Skip to content

Trace-Driven Debugging

Reproduce and diagnose production failures using recorded traces.


The Problem

When an AI agent fails in production:

  • Logs are incomplete — Missing context, truncated output
  • Reproduction is hard — "It worked when I tried it"
  • LLM is non-deterministic — Can't recreate the exact failure
  • Time pressure — Users waiting, SLA ticking

The Solution

Assay enables deterministic replay of production incidents:

  1. Capture — Record the failing session
  2. Import — Convert to Assay trace format
  3. Replay — Step through exactly what happened
  4. Fix — Update policy or agent, verify fix works

Workflow

1. Get the Incident Trace

When a user reports an issue, ask for their session log:

# From MCP Inspector export
assay import --format inspector user_session.json

# Output:
# Imported 23 tool calls from user_session.json
# Created: traces/incident-2025-12-27.jsonl

2. Reproduce the Failure

assay run --config eval.yaml --trace-file traces/incident-2025-12-27.jsonl

# Output:
# ❌ FAIL: args_valid
#    Tool: apply_discount (call #15)
#    Violation: percent=75 exceeds max(30)

Now you know: - Which tool failed - What argument was wrong - Exactly when in the session it happened

3. Inspect in Detail

assay explain \
  --trace traces/incident-2025-12-27.jsonl \
  --policy policy.yaml \
  --verbose

# Output:
# Step 15: apply_discount
# Verdict: Blocked
# Rule: args_valid
# Reason: percent=75 exceeds max(30)

Root cause found: The calculate_discount tool suggested 75%, but the business rule caps at 30%.

4. Fix and Verify

Option A: Fix the agent — Cap the discount before applying:

suggested = calculate_discount(total)
capped = min(suggested["suggested_percent"], 30)
apply_discount(percent=capped, ...)

Option B: Fix the policy — If 75% is actually valid:

# policies/discounts.yaml
tools:
  apply_discount:
    arguments:
      percent:
        max: 75  # Updated from 30

Verify:

assay run --config eval.yaml --trace-file traces/incident-2025-12-27.jsonl

# Output:
# ✅ All tests passed

Debugging Commands

Focus on blocked steps

assay explain --trace traces/incident.jsonl --policy policy.yaml --blocked-only

Full rule evaluation

assay explain --trace traces/incident.jsonl --policy policy.yaml --verbose

Test updated policy behavior

assay explain --trace traces/incident.jsonl --policy policies/new-rules.yaml

Real Example: Customer Service Bot

Incident Report

"The bot promised a 75% discount but then said it couldn't apply it. The customer is upset."

Investigation

# Import the session
assay import --format inspector support-case-4521.json

# Find the problem
assay run --config eval.yaml --trace-file traces/support-case-4521.jsonl

# Explain the exact failing step and rule
assay explain --trace traces/support-case-4521.jsonl --policy policy.yaml --blocked-only

Output:

[14] calculate_discount
     Input: {"customer_tier": "platinum", "order_total": 500}
     Output: {"percent": 75, "reason": "Platinum member 3x points"}

[15] apply_discount  ← FAILURE
     Input: {"percent": 75}
     Policy violation: percent exceeds max(30)

Root Cause

The calculate_discount tool returned 75% for platinum members, but apply_discount has a hard cap of 30% from the fraud prevention policy.

Fix

Updated the discount calculation to respect the cap:

def calculate_discount(customer_tier, order_total):
    base_discount = get_tier_discount(customer_tier)
    return min(base_discount, MAX_DISCOUNT)  # Added cap

Verification

# Re-run with fix
assay run --config eval.yaml --trace-file traces/support-case-4521.jsonl

# ✅ All tests passed

Building a Failure Library

Over time, build a collection of failure traces:

traces/
├── golden/
│   └── happy-path.jsonl
├── failures/
│   ├── discount-overflow.jsonl
│   ├── missing-auth.jsonl
│   ├── blocked-tool-called.jsonl
│   └── sequence-violation.jsonl
└── edge-cases/
    ├── empty-cart.jsonl
    └── unicode-input.jsonl

Run all as regression tests:

for trace in traces/failures/*.jsonl traces/edge-cases/*.jsonl; do
  assay run --config eval.yaml --trace-file "$trace" --strict || exit $?
done

Tips

1. Capture Early

Set up logging to capture all sessions, not just failures:

# Log all MCP sessions
session.export_to_file(f"logs/{session_id}.json")

2. Redact Sensitive Data

Assay currently has no dedicated anonymize subcommand. Redact before sharing (example with jq):

jq 'walk(if type == "object" then with_entries(if (.key|test("token|password|secret";"i")) then .value="***" else . end) else . end)' \
  incident.jsonl > safe-incident.jsonl

3. Add to Test Suite

After fixing a bug, add the trace to CI:

cp traces/incident-2025-12-27.jsonl traces/regression/discount-cap.jsonl
git add traces/regression/discount-cap.jsonl
git commit -m "Add regression test for discount cap bug"

4. Time-Box Investigation

With Assay, debugging should take minutes, not hours:

  1. 5 min — Import and run initial test
  2. 10 min — Step through replay, identify root cause
  3. 15 min — Implement and verify fix

See Also