Traces¶
Traces are recorded agent sessions — the "golden" behavior you test against.
What is a Trace?¶
A trace is a normalized log of every tool call your AI agent made during a session:
- Which tools were called
- What arguments were passed
- What results were returned
- In what order
Traces are the foundation of Assay's deterministic testing. Instead of calling your LLM again (slow, expensive, flaky), Assay replays the recorded trace and validates it against your policies.
Trace Format¶
Assay uses a line-delimited JSON format (.jsonl):
{"type":"tool_call","id":"call_001","tool":"get_customer","arguments":{"id":"cust_123"},"timestamp":"2025-12-27T10:00:00Z"}
{"type":"tool_result","id":"call_001","result":{"name":"Alice","email":"alice@example.com"},"timestamp":"2025-12-27T10:00:01Z"}
{"type":"tool_call","id":"call_002","tool":"update_customer","arguments":{"id":"cust_123","email":"alice@newdomain.com"},"timestamp":"2025-12-27T10:00:02Z"}
{"type":"tool_result","id":"call_002","result":{"success":true},"timestamp":"2025-12-27T10:00:03Z"}
Each line is a self-contained event:
| Field | Description |
|---|---|
type | tool_call or tool_result |
id | Links call to result |
tool | Tool name (for calls) |
arguments | Tool arguments (for calls) |
result | Tool response (for results) |
timestamp | When the event occurred |
Creating Traces¶
From MCP Inspector¶
Export your session from MCP Inspector, then import:
This creates: - traces/session-YYYY-MM-DD.jsonl — The normalized trace - mcp-eval.yaml — Test configuration - policies/default.yaml — Policy template
From Other Formats¶
# Raw JSON-RPC messages
assay import --format jsonrpc messages.json
# LangChain traces (coming soon)
assay import --format langchain run.json
# LlamaIndex traces (coming soon)
assay import --format llamaindex trace.json
Manual Creation¶
For testing, you can create traces manually:
cat > traces/test.jsonl << 'EOF'
{"type":"tool_call","id":"1","tool":"get_customer","arguments":{"id":"123"}}
{"type":"tool_result","id":"1","result":{"name":"Test User"}}
EOF
Trace Storage¶
Traces are stored in the .assay/ directory:
your-project/
├── .assay/
│ ├── store.db # SQLite database (cache, metadata)
│ └── traces/ # Trace files
│ ├── session-001.jsonl
│ └── session-002.jsonl
├── traces/ # Your golden traces (commit these)
│ └── golden.jsonl
└── mcp-eval.yaml
Best practice: Keep "golden" traces in a traces/ folder at your repo root and commit them to Git. These are your baseline for regression testing.
Trace Fingerprinting¶
Assay computes a fingerprint (hash) of each trace to detect changes:
If the underlying trace changes, the cache invalidates and tests re-run. This ensures you're always testing against the current baseline.
Working with Traces¶
Inspect a Trace¶
# List all tools in a trace
assay inspect --trace traces/golden.jsonl --tools
# Output:
# Tools found:
# - get_customer (5 calls)
# - update_customer (2 calls)
# - send_email (1 call)
Validate a Trace¶
# Check trace format is valid
assay validate --trace traces/golden.jsonl
# Output:
# ✅ Trace valid: 8 events, 4 tool calls
Compare Traces¶
# Diff two traces
assay diff --baseline traces/v1.jsonl --candidate traces/v2.jsonl
# Output:
# + Added: delete_customer (1 call)
# - Removed: verify_identity (was 1 call)
# ~ Changed: update_customer arguments differ
Trace Best Practices¶
1. Use Descriptive Names¶
traces/
├── golden-customer-flow.jsonl # ✅ Clear purpose
├── edge-case-empty-cart.jsonl # ✅ Specific scenario
└── test1.jsonl # ❌ Unclear
2. Version Your Traces¶
When agent behavior changes intentionally, create new traces:
# Old baseline
traces/v1-customer-flow.jsonl
# New baseline after feature addition
traces/v2-customer-flow.jsonl
3. Keep Traces Small¶
Large traces slow down testing. Record only what's needed:
- Good: 10-50 tool calls covering critical paths
- Avoid: 1000+ calls from a full day's logs
4. Commit Golden Traces¶
Your "golden" traces should be in version control:
Trace vs. Live Testing¶
| Aspect | Trace Replay | Live LLM Call |
|---|---|---|
| Speed | 3ms | 3+ seconds |
| Cost | $0.00 | \(0.01-\)1.00 |
| Determinism | 100% | ~80-95% |
| Network | Not required | Required |
| Use case | CI/CD, regression | Exploration, new features |
Use traces for: CI gates, regression testing, debugging production issues.
Use live calls for: Developing new features, exploring model behavior.