Delegated Semantic-Gap Baseline Plan¶
Status: delegated-baseline-smoke-verified for Slice 7 of the agent-observability fidelity roadmap. This document pinned the one delegated
matched_safe_readsanity run required before any semantic-gap finding is promoted beyond synthetic harness behavior. The successful smoke record is inruns/slice7-delegated-baseline/summary.md. It does not publish gap findings and does not open the optional OTel span-limit study.Last updated: 2026-05-28
Goal¶
The delegated baseline asks one narrow question:
When the synthetic matched_safe_read baseline is repeated through real
Runner capture, can the same tool-call id join reported tool intent,
SDK events, policy evidence, and measured filesystem effects without
weakening the claim class?
This is a publication gate, not a new semantic-gap experiment. A clean baseline run lets later semantic-gap findings say "the positive join path works on delegated Runner capture" before interpreting divergent scenarios. A failed or inconclusive baseline does not prove a semantic gap; it means the delegated join path is not yet healthy enough to cite.
Prerequisites¶
| Prerequisite | Status | Why it matters |
|---|---|---|
| Full synthetic semantic-gap matrix | Done: semantic_gap_harness.py emits all six scenarios | The delegated gate should validate the positive baseline path before gap rows are promoted. |
| Evidence-pack carrier | Prototype-ready: evidence_pack.py | The delegated baseline must be reviewable as a bounded pack or a stable proof-pack reference, not loose files. |
| Runner delegated proof pack | Existing workflow: .github/workflows/runner-spike-delegated.yml | The workflow already runs on the Linux/eBPF host and uploads retained archives, selected JSON, gate logs, and manifest metadata. |
| OpenAI Agents kernel+policy gate | Existing gate: openai-agents-kernel-policy | This gate already exercises a deterministic OpenAI Agents tool call, SDK events, policy evidence, kernel capture, and stable tool_call_id. |
| Join and claim-class contracts | Reference-ready | The output should reuse assay.observability.join_result.v0 and assay.observability.claim_class_cell.v0, not invent a delegated-only join vocabulary. |
Baseline Source¶
Slice 7 should use the existing delegated workflow rather than adding a new runner lane:
.github/workflows/runner-spike-delegated.yml
inputs.gates = openai-agents-kernel-policy
inputs.build_ebpf = true
The selected gate runs scripts/ci/runner-spike-openai-agents-kernel-policy-three-run-determinism.sh, which in turn invokes the acceptance script three times. Each run captures:
- a Runner archive tarball under the delegated proof pack payload;
observation-health.json;capability-surface.json;correlation-report.json;layers/sdk.ndjson;layers/policy.ndjson;- the gate log and pass lines.
The fixture's stable tool-call id is tc_runner_policy_001. The fixture reports a deterministic read_file tool call against openai-agents-input.txt, and the Runner archive must observe the same workdir-bounded file read/open effect.
Scenario Scope¶
The first delegated baseline is deliberately one scenario: matched_safe_read. hidden_write is the sharpest next delegated gap candidate, but it should not be added to this slice. It can only be promoted after the positive baseline is clean or after an inconclusive baseline has been understood and fixed.
Smoke Outcome¶
The follow-up dispatched runner-spike-delegated.yml on branch codex/agent-fidelity-delegated-baseline-smoke with gates=openai-agents-kernel-policy and build_ebpf=true. Run 26571739019 passed all three deterministic OpenAI Agents kernel+policy runs and uploaded proof pack assay-runner-delegated-proof-pack-26571739019 (artifact 7264883391, retained until 2026-08-26).
The review record runs/slice7-delegated-baseline/summary.md points to the proof-pack reference, a strong tool_call_id join result, claim-class cells, a positive_join scenario verdict, and a redaction manifest. No tarball payload is committed to git; the proof pack remains the retained GitHub Actions artifact.
Required Output¶
A delegated baseline follow-up should produce one stable review directory or evidence pack with this minimum shape:
| Artifact | Required | Notes |
|---|---|---|
| Delegated proof-pack manifest | Yes | From assay-runner-delegated-proof-pack-<run_id>/manifest.json. |
| Runner archive reference or copied tarball | Yes | The archive from payload/gates/openai-agents-kernel-policy/run-1/ is enough for the first baseline; runs 2 and 3 support determinism. |
| Extracted selected JSON | Yes | observation-health.json, capability-surface.json, and correlation-report.json. |
| SDK events | Yes | layers/sdk.ndjson, preserving tool_call_started and tool_call_completed for tc_runner_policy_001. |
| Policy events | Yes | layers/policy.ndjson, preserving the policy-side evidence for the same fixture path. |
| Join result | Yes | assay.observability.join_result.v0 row with join_key=tool_call_id, join_grade=strong, and unique_within_scope=true. |
| Claim-class cells | Yes | Reported intent, measured effect, and joined evidence rows using assay.observability.claim_class_cell.v0. |
| Scenario verdict | Yes | assay.experiment.agent_observability_fidelity.semantic_gap_verdict.v0 with scenario_id=matched_safe_read, verdict=positive_join, and evidence-pack claim class positive_join. |
| Redaction manifest | Yes | Required even when no redaction is applied. |
| Summary | Yes | One-page Markdown with run URL, commit, gate, health verdict, join verdict, and non-claims. |
The delegated proof pack itself is operational evidence. It should not be reclassified as a Runner archive schema or product evidence-pack format merely because the baseline plan consumes it.
Acceptance Rules¶
- Dispatch exactly the
openai-agents-kernel-policydelegated gate for the first delegated baseline. Do not useallunless a separate PR explicitly needs additional gates. - The delegated workflow must pass, including all three deterministic OpenAI Agents kernel+policy runs.
- The proof pack must include the
openai-agents-kernel-policygate with statuspassed, at least one Runner archive tarball, selected JSON, and gate log pass lines. observation-health.jsonmust be clean enough for citation:kernel_layercomplete, cgroup correlation clean, and no Runner health failure that would downgrade the join.- SDK events must include exactly one started/completed tool-call pair for
tc_runner_policy_001and toolread_file. - The measured archive must include a workdir-bounded read/open effect for
openai-agents-input.txt. A target path outside the delegated workdir fails the baseline. - The join result must be strong by
tool_call_id; timestamp/order may appear only as supporting context and may not upgrade the claim. - If any required artifact is missing, the baseline outcome is
inconclusive, notsemantic_gap. - If the delegated baseline is inconclusive, do not publish delegated gap findings and do not run delegated gap scenarios until the baseline failure is understood.
- The first delegated baseline may cite only the positive join path. It does not promote
hidden_write,path_rewrite,retry_self_correction,runtime_side_effect, orweak_join_fallbackto measured findings.
Follow-Up Dispatch/Conversion Gate¶
A follow-up implementation/dispatch pass is complete when it:
- either adds a small converter that turns the delegated proof pack into the required
matched_safe_readreview directory, or documents that the existing evidence-pack prototype can carry the proof-pack references without code changes. Any converter must reuse the evidence-pack, join-result, and claim-class vocabularies rather than introducing a new artifact family; - dispatches
runner-spike-delegated.ymlwithgates=openai-agents-kernel-policyandbuild_ebpf=true; - records the workflow run URL and proof-pack artifact name;
- validates the join, health, SDK, policy, and measured-effect invariants above; and
- updates the roadmap from
delegated-baseline-plan-readytodelegated-baseline-smoke-verifiedonly if the run passes.
If the run fails, the follow-up should record an inconclusive baseline result and stop. It should not silently widen to more delegated scenarios.
Non-Claims¶
- This plan does not dispatch the delegated baseline.
- This plan does not publish semantic-gap findings.
- This plan does not claim delegated gap scenarios are ready.
- This plan does not promote evidence packs, semantic-gap verdicts, or join rows to product APIs.
- This plan does not rank Runner, OTel, OpenInference, or the OpenAI Agents SDK.
- This plan does not open the optional OTel span-limit characterization issue.