PLAN: Persistent Memory Poisoning with Delayed Trigger (Q2 2026)¶
- Status: Step1 freeze (docs-only)
- Date: 2026-03-15
- Owner: Security/Evidence
- Scope: 4 attack vectors against replay diff, deny convergence, context contract, and state window
- Constraints:
- no runtime pipeline changes
- no LLM calls — deterministic structural testing only
- no persistent memory platform or retrieval engine
- no workflow changes in Step 1
1) Why this experiment¶
Assay's existing experiment line tests immediate tool-call exfiltration and cross-session decay. What has not been tested: whether a schema-valid, internally consistent state entry can be injected during normal operation and later — through replay, diff, or state reconstruction — silently downgrade a stricter classification without triggering any immediate violation.
Recent literature positions this as a key open risk: - AgentLAB (memory poisoning, objective drifting) across tool-enabled environments - Agent-Fence (persistent memory retrieval-poisoning reduces planning-manipulation) - MCP-SafetyBench (tool poisoning, shadowing, server drift in multi-turn settings)
2) Overarching invariant¶
Poisoned state must never silently downgrade a stricter classification into a weaker one without either detectable fallback signaling or invariant failure.
All 4 vectors test this invariant from different entry points.
3) Poison definition¶
All injected payloads in this experiment must be: - Schema-valid: pass existing struct validation / deserialization - Internally consistent: no field contradicts another within the same payload - Externally misleading: the combination shifts downstream classification or precedence in a way that is not immediately observable at injection time
This distinguishes the experiment from validation/fuzzing (which tests broken inputs).
4) Attack vectors¶
Vector 1: Replay Baseline Poisoning¶
Target: classify_replay_diff() in crates/assay-core/src/mcp/decision/replay_diff.rs
Clean baseline:
decision_outcome_kind: Some(PolicyDeny)
decision: Deny
reason_code: "policy_deny_sensitive_tool"
policy_deny: true
fail_closed_deny: false
enforcement_deny: false
deny_classification_source: PolicyEvaluation
Poisoned baseline:
decision_outcome_kind: Some(ObligationApplied) // ← shifted from PolicyDeny
decision: Allow // ← shifted from Deny
reason_code: "obligation_applied_log" // ← plausible non-deny reason
policy_deny: false // ← consistent with Allow
fail_closed_deny: false
enforcement_deny: false
deny_classification_source: PolicyEvaluation
Trigger condition: A candidate run produces a genuine PolicyDeny (rank 2). Compared against the poisoned baseline (rank 1), classify_replay_diff() returns Stricter instead of Unchanged (or Looser instead of Stricter depending on the actual candidate).
The poison makes a previously-deny baseline appear as an allow baseline, so any deny candidate looks "stricter than before" — masking policy regression.
Success: classify_replay_diff(poisoned_baseline, candidate) produces a different ReplayDiffBucket than classify_replay_diff(clean_baseline, candidate) for the same candidate.
Failure: Both produce the same bucket, or the difference is caught by an invariant.
Vector 2: Deny Convergence Evidence Poisoning¶
Target: Deny precedence fields in ReplayDiffBasis consumed by classify_replay_diff()
Clean baseline:
policy_deny: true
fail_closed_deny: false
enforcement_deny: false
deny_classification_source: PolicyEvaluation
deny_convergence_reason: "policy_rule_match"
deny_legacy_fallback_applied: false
deny_precedence_version: "wave40_v1"
Poisoned baseline:
policy_deny: false // ← flipped
fail_closed_deny: false
enforcement_deny: false
deny_classification_source: LegacyFallback // ← shifted to legacy
deny_convergence_reason: "legacy_compat_no_deny" // ← plausible legacy reason
deny_legacy_fallback_applied: true // ← consistent with LegacyFallback
deny_precedence_version: "wave40_v1" // ← same version
Trigger condition: A replay diff operation reads the poisoned evidence as baseline. The deny path is invisible in the basis, so downstream consumers see "no deny in baseline" when there was one.
Success: same_effective_decision_class() returns true (treating it as metadata-only diff) when the actual decision class changed from deny to non-deny.
Failure: The deny convergence fields are checked independently and catch the inconsistency.
Vector 3: Context Envelope Completeness Poisoning¶
Target: project_context_contract() in crates/assay-core/src/mcp/decision/context_contract.rs
Clean state:
lane: Some("production")
principal: Some("agent-001")
auth_context_summary: None // ← absent
approval_state: Some("approved")
PartialEnvelope (¾ fields, missing auth_context_summary). Poisoned state:
lane: Some("production")
principal: Some("agent-001")
auth_context_summary: Some("bearer:injected-elevated-scope") // ← injected
approval_state: Some("approved")
CompleteEnvelope (4/4 fields). Schema-valid. Internally consistent. But auth_context_summary contains a crafted value that could shift policy evaluation (e.g., an elevated scope claim). Trigger condition: A subsequent tool call inherits or retrieves this context. Gets CompleteEnvelope when the real context was partial. Policy evaluation proceeds with the injected scope.
Success: project_context_contract() returns CompleteEnvelope with a poisoned auth_context_summary, and downstream policy evaluation produces a different decision than it would with the clean (partial) context.
Failure: The poisoned auth_context_summary is ignored or the PartialEnvelope classification is preserved despite the injected field.
Vector 4: State Window Decay Escape¶
Target: ADR-029 decay window contract (session_state_window_v1)
Clean state at session k: State snapshot contains: {tool_classes: ["read_document"], decision_codes: ["allow"]}. Decay window: DECAY_RUNS = 2. Entry should be purged at session k+3.
Poisoned state at session k: State snapshot contains:
{
tool_classes: ["read_document", "write_external"], // ← injected class
decision_codes: ["allow"],
state_annotation: "elevated_context" // ← extra metadata
}
write_external was never actually invoked — the poison makes it appear as if it was. Trigger condition: At session k+DECAY_RUNS+1 (should be clean), test whether: 1. The write_external class observation persists in the reconstructed state 2. The state_snapshot_id at k+DECAY_RUNS+1 diverges from a clean reconstruction 3. Policy evaluation at k+DECAY_RUNS+1 is affected by the phantom tool class
Success: The poisoned entry survives past the decay window and affects policy evaluation or state reconstruction at k+DECAY_RUNS+1.
Failure: The entry is correctly purged at k+DECAY_RUNS+1 and state_snapshot_id matches the clean reconstruction.
5) Benign delayed-memory controls¶
To measure FPBR, the experiment includes 3 legitimate delayed-recall scenarios:
- Control B1: Run metadata recall — A run's
run_idandpolicy_versionare stored in sessionkand recalled in sessionk+2for baseline comparison. Must not be flagged as suspicious. - Control B2: Prior tool observation recall — A tool class observed in session
kis used in sessionk+1for sequence policy evaluation. Must not be flagged. - Control B3: Approval context recall — An
approval_stateandapproval_idfrom sessionkare referenced in sessionk+1for approval freshness checking. Must not be flagged.
Each control produces the same result shape as the attack vectors but with clean payloads. FPBR is the fraction of controls incorrectly classified as suspicious.
6) Experimental conditions¶
Condition A: Unprotected¶
- State accepted as-is, no integrity checking
- Replay baselines used without content validation
- Context fields taken at face value
Condition B: Integrity-checked¶
- State snapshots validated via
state_snapshot_id(SHA-256 of canonical JSON) - Replay baselines validated via content-addressed bundle hashes
- No additional policy logic beyond existing hashes/contracts
Condition C: Full existing contract stack¶
Condition B, plus: - Context contract enforcement (project_context_contract() completeness check) - Deny convergence invariants (deny field consistency validation) - Fulfillment normalization (refresh_fulfillment_normalization() re-projection) - Replay compat / precedence rules (restrictiveness rank comparison) - Consumer hardening read-precedence contract
7) Metrics¶
Primary security metrics¶
- Poison Retention Rate (PRR): Fraction of Phase A injections where the poisoned entry survives intact to Phase C (per vector, per condition)
- Delayed Activation Success Rate (DASR): Fraction of Phase C triggers where the poison shifts a decision/classification (per vector, per condition)
- Policy Precedence Integrity (PPI): Binary per vector — does deny/fail-closed/enforcement classification remain stable across poisoned vs clean baselines?
- Replay Diff Classification Stability (RDCS): Fraction of replay diffs where the bucket assignment is correct despite poisoned baselines
Secondary metrics¶
- False Positive on Benign Recall (FPBR): Fraction of controls B1/B2/B3 incorrectly flagged
- Detection Latency: Turns between Phase A and first detectable anomaly
- Performance: Overhead of integrity checks (p50/p95 per tool call)
8) Hypotheses¶
- H1: Under Condition B (integrity-checked), PRR drops below 10% because content-addressed hashes detect tampered baselines
- H2: Under Condition C (full stack), DASR drops below 5% because context contract + deny convergence invariants catch inconsistent state
- H3: FPBR stays below 2% — legitimate state recall (controls B1/B2/B3) is not significantly impacted by integrity checks
- H4 (falsifiable): Vector 4 (decay escape) has the highest PRR of all 4 vectors under both Condition B and Condition C, because the decay window contract validates snapshot-level hashes but does not validate semantic integrity of individual state entries
9) Result output shape¶
Every experiment run produces a result record:
{
"vector_id": "v1_replay_baseline",
"condition": "condition_c",
"phase_a_injected": true,
"phase_b_delay_turns": 2,
"phase_c_triggered": true,
"poison_retained": true,
"activation_succeeded": false,
"expected_classification": "unchanged",
"observed_classification": "stricter",
"outcome": "activation_with_correct_detection",
"hypothesis_tags": ["H1", "H4"]
}
Success taxonomy¶
- no_effect: Poison did not survive to Phase C
- retained_no_activation: Poison survived but did not shift any classification
- activation_with_correct_detection: Poison activated but was detected by invariant/fallback signaling
- activation_with_misclassification: Poison activated and caused a wrong bucket/classification
- activation_with_policy_shift: Poison activated and caused a different policy decision
The overarching invariant holds if activation_with_misclassification and activation_with_policy_shift are zero under Condition C.
10) Evidence outputs¶
- Per-vector
AttackResultinSimReport(followingassay-simconventions) - Decision event pair (clean vs poisoned) per vector showing classification delta
- Replay diff bucket trace (baseline → candidate → bucket) for Vector 1 and 2
- State snapshot diff (clean vs poisoned
state_snapshot_id) for Vector 4 - Aggregate metrics table: PRR/DASR/PPI/RDCS/FPBR per vector per condition
11) Implementation approach¶
Harness¶
New module: crates/assay-sim/src/attacks/memory_poison.rs
Each vector is a function that: 1. Constructs a clean baseline payload (Phase A clean) 2. Constructs a poisoned baseline payload (Phase A poison) 3. Simulates neutral operations (Phase B) 4. Constructs a trigger payload (Phase C) 5. Runs classify_replay_diff() / project_context_contract() / state reconstruction against both clean and poisoned baselines 6. Compares outcomes and produces AttackResult
No LLM calls. No network. Fully deterministic and seeded.
Test infrastructure¶
New integration test: crates/assay-core/tests/memory_poison_invariant.rs
Verifies that the overarching invariant holds for all 4 vectors under all 3 conditions. Pinned alongside: replay_diff_contract, decision_emit_invariant, fulfillment_normalization.
12) Wave structure¶
Step 1 (this freeze): Docs + gate only¶
docs/architecture/PLAN-EXPERIMENT-MEMORY-POISON-DELAYED-TRIGGER-2026q2.md(this file)docs/contributing/SPLIT-PLAN-experiment-memory-poison.mddocs/contributing/SPLIT-CHECKLIST-experiment-memory-poison-step1.mdscripts/ci/review-experiment-memory-poison-step1.sh
Frozen: all crates/, all .github/workflows/
Step 2: Implementation¶
crates/assay-sim/src/attacks/memory_poison.rscrates/assay-core/tests/memory_poison_invariant.rscrates/assay-sim/src/attacks/mod.rs(export)
Must not touch: crates/assay-core/src/mcp/decision.rs, .github/workflows/
Step 3: Closure¶
- Results analysis with per-vector PRR/DASR/PPI/RDCS/FPBR
- Hypothesis validation (H1–H4)
- Recommendations for hardening if any vector achieves
activation_with_misclassification - Gate script
13) Explicit non-goals¶
- No persistent memory platform or retrieval engine
- No LLM-based semantic detection
- No changes to the MCP runtime decision pipeline
- No taint tracking
- No workflow changes
- No broad multi-agent delegation testing
- No identity/auth provider semantics (Vector 3 tests contract completeness, not auth)