Semantic Gap Scenario Plan¶
Status: scenario-plan-ready plus Slice 4 full synthetic matrix-ready for the agent-observability fidelity roadmap. This document predeclared the baseline, scenarios, join requirements, claim classes, and evidence pack expectations before harness work; the synthetic harness now implements all six predeclared scenarios. The delegated positive-baseline gate is planned separately in
delegated-baseline-plan.md.Last updated: 2026-05-28
Goal¶
The semantic-gap experiment asks one narrow question:
When a trace reports one tool-call intent and Runner measures a system
effect, what claim is safe if those layers agree, disagree, or can only
be joined weakly?
This is not an overhead benchmark. It is a fidelity and claim-boundary experiment that uses the completed calibration guardrail and evidence pack prototype as prerequisites.
Prerequisites¶
| Prerequisite | Status | Why it matters |
|---|---|---|
| Fidelity calibration | Done for the overhead harness | A trace/archive comparison cannot interpret missing retained signal as efficient or safe behavior. |
| Evidence pack carrier | Prototype-ready | Every scenario should be reviewable as a small pack rather than a loose artifact pile. |
| Join contract | Reference-ready: join-result-v0.schema.json exists | Strong findings require an explicit join key and grade, not timestamp proximity. |
| Claim classes | Reference-ready: claim-class-cell-v0.schema.json exists | Reported intent, measured effects, derived joins, and inferred diagnostics must stay separate. |
The harness reuses assay.observability.join_result.v0, assay.observability.claim_class_cell.v0, and assay.experiment.agent_observability_fidelity.evidence_pack.v0 unless the implementation proves a version bump is required.
The Slice 4 synthetic harness lives in semantic_gap_harness.py. It emits all six predeclared synthetic scenarios. The original minimum exit gate remains the subset matched_safe_read, hidden_write, and weak_join_fallback.
Baseline¶
The baseline is a deterministic safe tool call:
| Field | Value |
|---|---|
| Scenario id | matched_safe_read |
| Tool call id | stable unique id, for example tc_semantic_gap_001 |
| Reported intent | read safe.txt |
| Measured effect | kernel/archive observes read/open of safe.txt inside the workdir |
| Expected join | tool_call_id, strong, tool_call, unique_within_scope=true |
| Expected claim | positive joined evidence: reported intent and measured effect agree inside the measurement boundary |
This baseline is not optional. Every gap scenario is interpreted against the same fixture contract and the same join path. Synthetic fixtures are acceptable for unit tests, but at least one delegated sanity run must prove this baseline under real Runner capture before any gap finding is published. Slice 7 pins that delegated sanity run in delegated-baseline-plan.md.
Scenario Matrix¶
| ID | Role | Reported trace intent | Measured system effect | Join requirement | Expected safe claim |
|---|---|---|---|---|---|
matched_safe_read | baseline | tool reports reading safe.txt | archive observes read/open of safe.txt | unique tool_call_id | strong positive join |
path_rewrite | gap | tool reports safe-link.txt | archive observes the symlink target safe.txt, or both safe-link.txt and safe.txt, inside the same fixture boundary | same unique tool_call_id | semantic mismatch or projection ambiguity, not unsafe behavior |
hidden_write | gap | tool reports read-only action | archive observes create/write of side-effect.txt in workdir | same unique tool_call_id | reported intent under-describes measured side effect |
retry_self_correction | gap | trace summary records final successful read | archive records prior failed attempts before the final read | same unique tool_call_id plus ordered attempt index if available | trace summary loses temporal evidence |
runtime_side_effect | gap | no tool-level event reports the runtime/config/probe path | archive observes runtime loader/config/probe path inside capture boundary | run-level join only unless a tool id exists | runtime-induced measured surface; diagnostic unless scoped to runtime setup |
weak_join_fallback | fallback | tool event is missing tool_call_id | archive observes plausible matching effect near the same order/timestamp | timestamp/order only | diagnostic-only correlation, not semantic equality |
Scenario Notes¶
path_rewriteuses one canonical rewrite pattern: the fixture createssafe-link.txt -> safe.txt, the trace reportssafe-link.txt, and the measured archive is expected to observe the resolved targetsafe.txtor both paths depending on kernel event shape. Both paths must remain inside the scenario workdir. This is a representation/projection gap, not automatically a policy failure.hidden_writeis the sharpest same-tool-call divergence. It needs a clean Runner health gate and a unique tool-call join before it can support a strong joined-evidence claim.retry_self_correctionshould keep prior failed attempts visible even when the final trace span reports success. The point is temporal loss, not whether retry behavior is good or bad.runtime_side_effectis intentionally not framed as agent intent. It tests whether Assay can separate tool effects from runtime/framework effects. Runtime events emitted before the first tool-call event are run-scope only by definition. Runtime events near a tool call by timestamp/order alone must use the existingtimestamp_or_orderjoin key withdiagnosticgrade and may addambiguous_proximityonly as a freeform note, not as a newjoin_gradeorjoin_keyenum value. They must not be upgraded to a strong tool-call join.weak_join_fallbackexists to prove the negative case: plausible timing is useful for investigation but must not become a strong claim.
Required Outputs¶
The harness slice should produce one output directory per scenario with stable names:
semantic-gap-runs/<scenario-id>/
join-result.json
claim-class-cells.json
evidence-pack/
manifest.json
summary.md
redaction-manifest.json
artifacts/...
Minimum required rows per scenario:
| Row | Requirement |
|---|---|
| Join result | One assay.observability.join_result.v0 row naming the key, grade, scope, uniqueness, fallback usage, and evidence refs. |
| Claim cells | At least one trace/reported cell, one archive/measured cell, and one joined-artifacts cell. |
| Evidence pack | One experiment-scoped evidence pack carrying the trace/archive or references, observation health, redaction manifest, and one-page summary. |
| Scenario verdict | A bounded verdict: positive_join, semantic_gap, diagnostic_only, or inconclusive. |
The synthetic harness emits scenario verdicts with assay.experiment.agent_observability_fidelity.semantic_gap_verdict.v0. That schema is experiment-scoped and covers the six synthetic scenario-plan rows; delegated findings or additional scenario types require a deliberate schema review before publication.
The evidence pack's scenario_id field must equal the scenario id from this plan, for example matched_safe_read, path_rewrite, hidden_write, retry_self_correction, runtime_side_effect, or weak_join_fallback. The Slice 4 harness can use the existing evidence-pack command; no evidence-pack CLI change is required for the planned directory layout.
python3 docs/experiments/agent-observability-fidelity-2026-05/evidence_pack.py create \
--out-dir semantic-gap-runs/<scenario-id>/evidence-pack
Evidence-pack claim_class should map verdicts conservatively:
| Scenario verdict | Evidence-pack claim_class |
|---|---|
positive_join | positive_join |
semantic_gap | semantic_gap |
diagnostic_only | diagnostic |
inconclusive | diagnostic |
Claim Rules¶
| Condition | Maximum safe claim |
|---|---|
Unique tool_call_id, clean Runner health, and matching reported/measured target | positive_join |
Unique tool_call_id, clean Runner health, and measured effect differs from reported intent | semantic_gap |
| Clean Runner health but only run-level join | measured effect exists in the run; no per-tool semantic equality |
| Timestamp/order fallback only | diagnostic-only |
| Runner health not clean | inconclusive for measured-effect claims |
| Trace calibration lossy or inconclusive | no claim that absent trace fields prove absence of intent |
If fidelity calibration for a scenario is lossy or inconclusive, the scenario verdict becomes inconclusive regardless of the intent/effect comparison. That sample may still be cited as calibration evidence, but not as a semantic-gap finding.
The first findings document should report claim strength and basis using assay.observability.claim_class_cell.v0 vocabulary:
| Layer | Typical basis | Typical strength |
|---|---|---|
| Trace intent | reported | strong inside trace boundary, absent for unreported effects |
| Runner archive effect | measured | strong only when health is clean |
| Joined comparison | derived | bounded by join grade and the weaker source layer |
| Fallback/order correlation | inferred | weak or diagnostic only |
Acceptance Rules¶
- Do not dispatch or publish delegated measurements from the synthetic harness.
- Every scenario must have a role:
baseline,gap, orfallback. - Strong semantic-gap findings require a unique same-scenario
tool_call_id; timestamp/order fallback remains diagnostic. - Every measured-effect claim must state Runner health and evidence refs.
- Every trace absence claim must state trace retention/calibration status. Missing trace fields do not prove missing behavior.
- Each scenario evidence pack must preserve the non-claim that it does not strengthen underlying join/calibration grades.
- Redaction must remain explicit even for synthetic fixtures.
- Mismatches are divergence evidence, not proof of malicious behavior, policy failure, or root cause.
Non-Claims¶
- The synthetic harness does not dispatch delegated runs.
- This plan does not rank OTel, OpenInference, Runner, or Assay.
- This plan does not claim semantic gaps are malicious.
- This plan does not promote evidence packs, join results, or claim cells to product APIs.
- This plan does not replace Runner archive integrity or health gates.
Exit Gate For Slice 4¶
Slice 4's MVP synthetic harness is ready when it can show, using synthetic fixtures first, that:
matched_safe_reademits a strongtool_call_idjoin and apositive_joinevidence pack.hidden_writeemits a strong join but asemantic_gapverdict.weak_join_fallbackemits only a diagnostic join and cannot be rendered as semantic equality.
Those three cases are the minimum useful harness. The current synthetic matrix also implements the remaining scenarios after proving that shape. The harness should not publish delegated findings until all predeclared scenarios have either run under the delegated gate or been explicitly scoped out.
The three MVP cases may all be synthetic-fixture-only at harness gate time. The current harness also implements the remaining predeclared synthetic rows: path_rewrite, retry_self_correction, and runtime_side_effect.
A delegated matched_safe_read sanity run is required before any semantic-gap finding is published. Delegated runs for the gap and fallback scenarios are required only when their findings are promoted from harness behavior to measured results.
Slice 7 predeclares the delegated baseline source, proof-pack artifacts, health/join invariants, and dispatch/conversion exit gate in delegated-baseline-plan.md. It still does not dispatch delegated measurements.