Traces and Evidence¶

Traces are recorded agent sessions. Evidence bundles are verifiable, tamper-evident packages of those traces for audit and compliance.

Traces¶

A trace is a normalized log of every tool call your agent made:

Which tools were called
What arguments were passed
What results were returned
In what order

Traces enable deterministic testing. Replay recorded behavior instead of calling your LLM again.

Evidence Bundles¶

An evidence bundle is a tamper-evident package containing:

Trace data (CloudEvents v1.0 format)
Metadata (run ID, timestamps, tool manifest)
Content-addressed ID (SHA-256)
Optional signatures (Ed25519, mandate signatures)

# Create bundle
assay evidence export --profile assay-profile.yaml --out bundle.tar.gz

# Verify integrity
assay evidence verify bundle.tar.gz

# Lint for issues
assay evidence lint bundle.tar.gz --format sarif

# Lint with compliance pack
assay evidence lint --pack eu-ai-act-baseline bundle.tar.gz

# Compare bundles
assay evidence diff baseline.tar.gz current.tar.gz

Bundle ID¶

Each bundle has a content-addressed ID:

sha256:a3f2b1c4d5e6f7890...

Any modification changes the ID. Tamper-evident by design.

BYOS Storage¶

Push bundles to your own S3-compatible storage:

assay evidence push bundle.tar.gz --store s3://bucket/evidence
assay evidence pull --bundle-id sha256:abc... --store s3://bucket/evidence
assay evidence list --store s3://bucket/evidence

Supported: AWS S3, Backblaze B2, Cloudflare R2, MinIO, Azure Blob, GCS.

Trace Format¶

Assay uses a line-delimited JSON format (.jsonl):

{"type":"tool_call","id":"call_001","tool":"get_customer","arguments":{"id":"cust_123"},"timestamp":"2025-12-27T10:00:00Z"}
{"type":"tool_result","id":"call_001","result":{"name":"Alice","email":"alice@example.com"},"timestamp":"2025-12-27T10:00:01Z"}
{"type":"tool_call","id":"call_002","tool":"update_customer","arguments":{"id":"cust_123","email":"alice@newdomain.com"},"timestamp":"2025-12-27T10:00:02Z"}
{"type":"tool_result","id":"call_002","result":{"success":true},"timestamp":"2025-12-27T10:00:03Z"}

Each line is a self-contained event:

Field	Description
`type`	`tool_call` or `tool_result`
`id`	Links call to result
`tool`	Tool name (for calls)
`arguments`	Tool arguments (for calls)
`result`	Tool response (for results)
`timestamp`	When the event occurred

Creating Traces¶

From MCP Inspector¶

Export your session from MCP Inspector, then import:

assay import --format inspector session.json --out-trace traces/session.jsonl

This creates: - traces/session.jsonl — The normalized trace

If you use --init, the current implementation still scaffolds legacy mcp-eval.yaml.

From Other Formats¶

# Raw JSON-RPC messages
assay import --format jsonrpc messages.json

Manual Creation¶

For testing, you can create traces manually:

cat > traces/test.jsonl << 'EOF'
{"type":"tool_call","id":"1","tool":"get_customer","arguments":{"id":"123"}}
{"type":"tool_result","id":"1","result":{"name":"Test User"}}
EOF

Trace Storage¶

Traces are stored in the .assay/ directory:

your-project/
├── .assay/
│   ├── store.db          # SQLite database (cache, metadata)
│   └── traces/           # Trace files
│       ├── session-001.jsonl
│       └── session-002.jsonl
├── traces/               # Your golden traces (commit these)
│   └── golden.jsonl
└── eval.yaml

Best practice: Keep "golden" traces in a traces/ folder at your repo root and commit them to Git. These are your baseline for regression testing.

Trace Fingerprinting¶

Assay computes a fingerprint (hash) of each trace to detect changes:

Trace: traces/golden.jsonl
Fingerprint: sha256:a3f2b1c4d5e6...

If the underlying trace changes, the cache invalidates and tests re-run. This ensures you're always testing against the current baseline.

Working with Traces¶

Inspect a Trace¶

# List all tools in a trace
awk -F'"' '/"tool"/ {print $4}' traces/golden.jsonl | sort | uniq -c

# Output:
#   5 get_customer
#   2 update_customer
#   1 send_email

Validate a Trace¶

# Check trace format is valid
assay trace verify --trace traces/golden.jsonl --config eval.yaml

# Output:
# ✅ Trace verifies against config coverage

Compare Traces¶

# Diff two traces
diff -u traces/v1.jsonl traces/v2.jsonl

# Output:
# + Added: delete_customer (1 call)
# - Removed: verify_identity (was 1 call)
# ~ Changed: update_customer arguments differ

Trace Best Practices¶

1. Use Descriptive Names¶

traces/
├── golden-customer-flow.jsonl      # ✅ Clear purpose
├── edge-case-empty-cart.jsonl      # ✅ Specific scenario
└── test1.jsonl                     # ❌ Unclear

2. Version Your Traces¶

When agent behavior changes intentionally, create new traces:

# Old baseline
traces/v1-customer-flow.jsonl

# New baseline after feature addition
traces/v2-customer-flow.jsonl

3. Keep Traces Small¶

Large traces slow down testing. Record only what's needed:

Good: 10-50 tool calls covering critical paths
Avoid: 1000+ calls from a full day's logs

4. Commit Golden Traces¶

Your "golden" traces should be in version control:

git add traces/golden.jsonl
git commit -m "Add golden trace for customer workflow"

Trace vs. Live Testing¶

Aspect	Trace Replay	Live LLM Call
Speed	3ms	3+ seconds
Cost	$0.00	$0.01-$1.00
Determinism	100%	~80-95%
Network	Not required	Required
Use case	CI/CD, regression	Exploration, new features

Use traces for: CI gates, regression testing, debugging production issues.

Use live calls for: Developing new features, exploring model behavior.