CI Regression Gate¶

Catch breaking changes before they hit production.

The Problem¶

Traditional AI agent tests are:

Slow: 30 seconds to 3 minutes per test (LLM API calls)
Expensive: $0.10-$1.00 per test run
Flaky: 5-20% random failure rate (network, model variance)

This leads to: - Developers ignoring test failures ("it's probably flaky") - PRs merging without proper validation - Bugs reaching production

The Solution¶

Assay's CI gate provides:

3ms tests — Replay traces, don't call APIs
$0 cost — No API charges
0% flakiness — Deterministic replay

Setup¶

1. Record a Golden Trace¶

# Export from MCP Inspector (or your agent framework)
assay import --format inspector session.json --out-trace traces/golden.jsonl

This creates: - traces/golden.jsonl — Your baseline behavior

2. Add to CI¶

# .github/workflows/agent-tests.yml
name: Agent Quality Gate

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  assay:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Assay
        run: cargo install assay

      - name: Run Tests
        run: |
          assay ci \
            --config eval.yaml \
            --trace-file traces/golden.jsonl \
            --strict \
            --sarif .assay/reports/results.sarif \
            --junit .assay/reports/junit.xml \
            --db :memory:

      - name: Upload Results
        uses: github/codeql-action/upload-sarif@v2
        if: always()
        with:
          sarif_file: .assay/reports/results.sarif

3. Configure Policies¶

# eval.yaml
version: "1"
suite: agent-regression

tests:
  # Validate all tool arguments
  - id: args_valid
    metric: args_valid
    policy: policies/business-rules.yaml

  # Enforce required sequences
  - id: auth_flow
    metric: sequence_valid
    rules:
      - type: require
        tool: authenticate
      - type: before
        first: authenticate
        then: [get_data, update_data]

  # Block dangerous tools
  - id: safety
    metric: tool_blocklist
    blocklist:
      - delete_*
      - admin_*
      - debug_*

Use assay ci --sarif ... --junit ... to choose report output paths in CI.

Results¶

Before Assay¶

PR opened → Run tests → 4 minutes → Random failure → Retry → 😤

After Assay¶

PR opened → Run tests → 50ms → Deterministic result → ✅ or ❌

Metrics¶

Metric	Before	After
Test duration	3-5 min	50ms
Cost per PR	$2-5	$0
Flake rate	10-20%	0%
Developer trust	Low	High

What Gets Caught¶

Argument Violations¶

❌ PR Check Failed: args_valid

   Tool: apply_discount
   Argument: percent = 50
   Violation: Value exceeds maximum (max: 30)

   File: prompts/discount-handler.yaml:15

Sequence Violations¶

❌ PR Check Failed: sequence_valid

   Rule: auth_before_data
   Expected: authenticate before get_customer
   Actual: get_customer called without prior authenticate

   File: agents/customer-service.py:42

Blocklist Violations¶

❌ PR Check Failed: tool_blocklist

   Blocked tool called: admin_delete
   This tool is not allowed in production agents.

   File: agents/admin-handler.py:88

GitHub Integration¶

SARIF Annotations¶

SARIF output creates inline annotations on your PR:

⚠️ agents/customer-service.py:42
   args_valid: percent=50 exceeds maximum (max: 30)

Status Checks¶

The job appears as a required check:

✅ All checks have passed
   ✅ Agent Quality Gate (3s)

Best Practices¶

1. Run on Every PR¶

on:
  pull_request:
    branches: [main]

2. Block Merges on Failure¶

In GitHub: Settings → Branches → Branch protection rules - ✅ Require status checks to pass - ✅ Require your CI job (e.g. "Agent Quality Gate") to pass

For the Assay repo, required status checks are: CI, Smoke Install (E2E), assay-action-contract-tests, and MCP Security (Assay). See Branch protection setup.

3. Keep Tests Fast¶

# Use in-memory database
--db :memory:

# Skip caching in CI
--no-cache

4. Separate Fast and Slow Tests¶

jobs:
  fast-tests:
    # Assay (milliseconds, free)
    - uses: Rul1an/assay-action@v2

  slow-tests:
    needs: fast-tests  # Only if fast tests pass
    # Real LLM tests (minutes, paid)
    - run: pytest tests/integration

Troubleshooting¶

Tests Pass Locally, Fail in CI¶

Check for environment differences: - Same Assay version? - Same trace file (check git)? - Same policy files?

# Verify versions match
assay --version

False Positives¶

If tests fail incorrectly:

Check the violation — Is it a real issue or policy misconfiguration?
Update policy — Loosen constraints if too strict
Update trace — Re-record if agent behavior changed intentionally

Slow CI Jobs¶

If jobs take too long:

# Use in-memory mode
assay ci --config eval.yaml --trace-file traces/focused-test.jsonl --db :memory:

# Skip large traces
--trace-file traces/focused-test.jsonl  # Not the 1000-call log