Skip to content

Core Concepts

Understand the building blocks of Assay.


Overview

Assay is built on four core concepts:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Traces    │ ──► │  Policies   │ ──► │   Metrics   │ ──► │   Replay    │
│  (record)   │     │  (define)   │     │  (validate) │     │  (execute)  │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
  1. Traces — Recorded agent behavior (the "what happened")
  2. Policies — Validation rules (the "what's correct")
  3. Metrics — Validation functions (the "how to check")
  4. Replay — Deterministic execution (the "how to test")

The Testing Flow

graph TD
    A[Agent Session] -->|Export| B[MCP Inspector]
    B -->|Import| C[Trace File]
    C --> D[Replay Engine]
    E[Policy Files] --> D
    D --> F{Metrics}
    F -->|args_valid| G[Check Arguments]
    F -->|sequence_valid| H[Check Order]
    F -->|tool_blocklist| I[Check Blocklist]
    G --> J[Results]
    H --> J
    I --> J
    J -->|SARIF/JUnit| K[CI Report]

Concepts in Depth

  • Traces


    Recorded agent sessions in a normalized format. The "golden" behavior you test against.

    • What is a trace?
    • Trace format (JSONL)
    • Creating and managing traces
    • Fingerprinting

    Traces

  • Policies


    Rules that define "correct" behavior for tool arguments.

    • Policy structure
    • Constraint types
    • Built-in formats
    • Real-world examples

    Policies

  • Metrics


    Pure functions that validate agent behavior.

    • args_valid
    • sequence_valid
    • tool_blocklist
    • Why deterministic?

    Metrics

  • Replay Engine


    Deterministic re-execution without calling LLMs or tools.

    • How replay works
    • Strict vs. lenient mode
    • Determinism guarantees
    • Performance

    Replay

  • Cache & Fingerprints


    Intelligent caching to skip redundant work.

    • How caching works
    • Fingerprint computation
    • Cache invalidation
    • CI best practices

    Cache

  • Mandates


    Cryptographic proof of user authorization for AI agent actions.

    • What is a mandate?
    • Intent vs transaction
    • Revocation and expiry
    • Evidence output

    Mandates

  • Pack Registry


    Secure, reproducible fetching of compliance packs from remote registries.

    • Resolution order (local → bundled → registry → BYOS)
    • Canonical digests (JCS + SHA-256)
    • DSSE signature verification
    • No-TOFU trust model
    • Lockfile for CI reproducibility

    Pack Registry


Quick Reference

Concept Purpose Key Files
Traces Record behavior traces/*.jsonl
Policies Define rules policies/*.yaml
Metrics Validate Built into Assay
Replay Execute assay run
Cache Optimize .assay/store.db
Mandates User authorization audit.ndjson, decisions.ndjson
Pack Registry Fetch compliance packs assay.packs.lock, ~/.assay/cache/packs/

How They Work Together

Example: Customer Service Agent

1. Record a session → Creates a trace

assay import --format inspector session.json
# Creates: traces/session.jsonl

2. Define policies → What's valid?

# policies/customer.yaml
tools:
  apply_discount:
    arguments:
      percent: { type: number, max: 30 }

3. Configure metrics → What to check?

# eval.yaml
tests:
  - id: args_valid
    metric: args_valid
    policy: policies/customer.yaml
  - id: no_admin
    metric: tool_blocklist
    blocklist: [admin_*]

4. Run replay → Execute tests

assay run --config eval.yaml --strict
# Result: Pass/Fail in 3ms

Key Principles

1. Determinism

Every Assay test produces the same result on every run. No network variance, no model variance, no timing variance.

2. Speed

Tests run in milliseconds, not minutes. This enables running tests on every PR without blocking developers.

3. Local-First

Everything runs on localhost. No data leaves your network. Works in air-gapped environments.

4. Developer Experience

Clear error messages, actionable suggestions, standard output formats (SARIF, JUnit).


See Also