Core Concepts¶

Understand the building blocks of Assay.

Overview¶

Assay is built on four core concepts:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Traces    │ ──► │  Policies   │ ──► │   Metrics   │ ──► │   Replay    │
│  (record)   │     │  (define)   │     │  (validate) │     │  (execute)  │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘

Traces — Recorded agent behavior (the "what happened")
Policies — Validation rules (the "what's correct")
Metrics — Validation functions (the "how to check")
Replay — Deterministic execution (the "how to test")

The Testing Flow¶

graph TD
    A[Agent Session] -->|Export| B[MCP Inspector]
    B -->|Import| C[Trace File]
    C --> D[Replay Engine]
    E[Policy Files] --> D
    D --> F{Metrics}
    F -->|args_valid| G[Check Arguments]
    F -->|sequence_valid| H[Check Order]
    F -->|tool_blocklist| I[Check Blocklist]
    G --> J[Results]
    H --> J
    I --> J
    J -->|SARIF/JUnit| K[CI Report]

Concepts in Depth¶

Traces

Recorded agent sessions in a normalized format. The "golden" behavior you test against.
- What is a trace?
- Trace format (JSONL)
- Creating and managing traces
- Fingerprinting
Traces
Policies

Rules that define "correct" behavior for tool arguments.
- Policy structure
- Constraint types
- Built-in formats
- Real-world examples
Policies
Metrics

Pure functions that validate agent behavior.
- args_valid
- sequence_valid
- tool_blocklist
- Why deterministic?
Metrics
Replay Engine

Deterministic re-execution without calling LLMs or tools.
- How replay works
- Strict vs. lenient mode
- Determinism guarantees
- Performance
Replay
Cache & Fingerprints

Intelligent caching to skip redundant work.
- How caching works
- Fingerprint computation
- Cache invalidation
- CI best practices
Cache
Mandates

Cryptographic proof of user authorization for AI agent actions.
- What is a mandate?
- Intent vs transaction
- Revocation and expiry
- Evidence output
Mandates
Pack Registry

Secure, reproducible fetching of compliance packs from remote registries.
- Resolution order (local → bundled → registry → BYOS)
- Canonical digests (JCS + SHA-256)
- DSSE signature verification
- No-TOFU trust model
- Lockfile for CI reproducibility
Pack Registry

Quick Reference¶

Concept	Purpose	Key Files
Traces	Record behavior	`traces/*.jsonl`
Policies	Define rules	`policies/*.yaml`
Metrics	Validate	Built into Assay
Replay	Execute	`assay run`
Cache	Optimize	`.assay/store.db`
Mandates	User authorization	`audit.ndjson`, `decisions.ndjson`
Pack Registry	Fetch compliance packs	`assay.packs.lock`, `~/.assay/cache/packs/`

How They Work Together¶

Example: Customer Service Agent¶

1. Record a session → Creates a trace

assay import --format inspector session.json
# Creates: traces/session.jsonl

2. Define policies → What's valid?

# policies/customer.yaml
tools:
  apply_discount:
    arguments:
      percent: { type: number, max: 30 }

3. Configure metrics → What to check?

# eval.yaml
tests:
  - id: args_valid
    metric: args_valid
    policy: policies/customer.yaml
  - id: no_admin
    metric: tool_blocklist
    blocklist: [admin_*]

4. Run replay → Execute tests

assay run --config eval.yaml --strict
# Result: Pass/Fail in 3ms

Key Principles¶

1. Determinism¶

Every Assay test produces the same result on every run. No network variance, no model variance, no timing variance.

2. Speed¶

Tests run in milliseconds, not minutes. This enables running tests on every PR without blocking developers.

3. Local-First¶

Everything runs on localhost. No data leaves your network. Works in air-gapped environments.

4. Developer Experience¶

Clear error messages, actionable suggestions, standard output formats (SARIF, JUnit).