Core Concepts¶
Understand the building blocks of Assay.
Overview¶
Assay is built on four core concepts:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Traces │ ──► │ Policies │ ──► │ Metrics │ ──► │ Replay │
│ (record) │ │ (define) │ │ (validate) │ │ (execute) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
- Traces — Recorded agent behavior (the "what happened")
- Policies — Validation rules (the "what's correct")
- Metrics — Validation functions (the "how to check")
- Replay — Deterministic execution (the "how to test")
The Testing Flow¶
graph TD
A[Agent Session] -->|Export| B[MCP Inspector]
B -->|Import| C[Trace File]
C --> D[Replay Engine]
E[Policy Files] --> D
D --> F{Metrics}
F -->|args_valid| G[Check Arguments]
F -->|sequence_valid| H[Check Order]
F -->|tool_blocklist| I[Check Blocklist]
G --> J[Results]
H --> J
I --> J
J -->|SARIF/JUnit| K[CI Report] Concepts in Depth¶
-
Traces
Recorded agent sessions in a normalized format. The "golden" behavior you test against.
- What is a trace?
- Trace format (JSONL)
- Creating and managing traces
- Fingerprinting
-
Policies
Rules that define "correct" behavior for tool arguments.
- Policy structure
- Constraint types
- Built-in formats
- Real-world examples
-
Metrics
Pure functions that validate agent behavior.
- args_valid
- sequence_valid
- tool_blocklist
- Why deterministic?
-
Replay Engine
Deterministic re-execution without calling LLMs or tools.
- How replay works
- Strict vs. lenient mode
- Determinism guarantees
- Performance
-
Cache & Fingerprints
Intelligent caching to skip redundant work.
- How caching works
- Fingerprint computation
- Cache invalidation
- CI best practices
-
Mandates
Cryptographic proof of user authorization for AI agent actions.
- What is a mandate?
- Intent vs transaction
- Revocation and expiry
- Evidence output
-
Pack Registry
Secure, reproducible fetching of compliance packs from remote registries.
- Resolution order (local → bundled → registry → BYOS)
- Canonical digests (JCS + SHA-256)
- DSSE signature verification
- No-TOFU trust model
- Lockfile for CI reproducibility
Quick Reference¶
| Concept | Purpose | Key Files |
|---|---|---|
| Traces | Record behavior | traces/*.jsonl |
| Policies | Define rules | policies/*.yaml |
| Metrics | Validate | Built into Assay |
| Replay | Execute | assay run |
| Cache | Optimize | .assay/store.db |
| Mandates | User authorization | audit.ndjson, decisions.ndjson |
| Pack Registry | Fetch compliance packs | assay.packs.lock, ~/.assay/cache/packs/ |
How They Work Together¶
Example: Customer Service Agent¶
1. Record a session → Creates a trace
2. Define policies → What's valid?
3. Configure metrics → What to check?
# eval.yaml
tests:
- id: args_valid
metric: args_valid
policy: policies/customer.yaml
- id: no_admin
metric: tool_blocklist
blocklist: [admin_*]
4. Run replay → Execute tests
Key Principles¶
1. Determinism¶
Every Assay test produces the same result on every run. No network variance, no model variance, no timing variance.
2. Speed¶
Tests run in milliseconds, not minutes. This enables running tests on every PR without blocking developers.
3. Local-First¶
Everything runs on localhost. No data leaves your network. Works in air-gapped environments.
4. Developer Experience¶
Clear error messages, actionable suggestions, standard output formats (SARIF, JUnit).