Experiment Namespace Governance¶
Status: reference guidance for Assay experiment artifacts. This document does not define a product API. It keeps experiment-scoped schemas, cross-arc fields, and promotion decisions consistent before the agent-observability fidelity roadmap adds more artifacts. For the broader plan-to-harness-to-summary lifecycle, see
arc-lifecycle-guide.md.
Problem¶
Assay experiments now emit several useful but local artifact families:
- overhead samples, summaries, paired sequences, phase timings, and event-rate sweep cells;
- cross-runtime drift reports and fixtures;
- observability join and claim-class reference rows;
- active fidelity-calibration sidecars and planned evidence-pack artifacts.
Without a naming and promotion rule, each new slice can create a locally reasonable schema that is hard to compare across arcs later. This doc sets the default rule before adding more fidelity and evidence-pack surfaces.
Naming Convention¶
Use schema strings in this shape:
Examples:
assay.experiment.agent_observability_fidelity.calibration.v0
assay.experiment.agent_observability_fidelity.evidence_pack.v0
assay.experiment.runner_vs_otel_overhead.event_rate_sweep.v0
Existing pre-governance schemas such as assay.experiment.overhead_sample.v0 remain valid. Do not rename historical artifacts just to fit the new convention. New artifacts should use the arc/artifact split unless a narrower existing family already owns the shape.
Rules:
- Arc slug first. The arc names the evidence boundary, not the implementation module.
- Artifact slug second. The artifact names what the file contains:
sample,summary,calibration,evidence_pack,paired_sequence,phase_timing, etc. - Version only on shape changes. Additive optional fields may stay within the same version when old artifacts still validate. Changed meaning or required fields need a new version.
- No product namespace by accident.
assay.experiment.*artifacts remain local evidence until promoted explicitly.
Cross-Arc Fields¶
Prefer repeating a small common field set in each experiment schema over creating a shared assay.experiment.common.v0 too early. Duplication is acceptable while the fields are still proving themselves.
Recommended common fields:
| Field | Meaning |
|---|---|
schema | Schema string for the artifact. |
experiment | Human-readable experiment slug. |
assay_commit | Source commit used to produce the artifact. |
started_at | ISO-8601 timestamp for the sample/run. |
host_class | Host/OS/kernel boundary for measurement claims. |
workflow_run_url | GitHub Actions run URL when produced by delegated workflow. |
tool_versions | Tool/runtime versions relevant to the artifact. |
calibration_status | clean, lossy, inconclusive, or not_applicable when the artifact interprets requested-vs-observed signals. |
If three independent arcs need the same nested object with the same semantics, open a promotion PR to define a shared reference shape under assay.observability.* or another explicit namespace. Do not add a shared schema as a convenience before it has multiple consumers.
Promotion Criteria¶
An experiment artifact may be promoted out of assay.experiment.* only when at least one of these triggers exists:
- A production or CLI feature consumes it directly.
- Two or more experiment arcs independently need the same shape.
- A public reference doc or paper needs the shape as a stable citation target.
- External interoperability requires a stable contract.
Promotion targets:
| Target namespace | Use when |
|---|---|
assay.runner.* | The shape is part of Runner archive, projection, or report contracts. |
assay.observability.* | The shape interprets or joins traces, archives, receipts, and external evidence. |
assay.receipt.* or receipt-family docs | The shape becomes a bounded imported evidence receipt. |
Promotion requires:
- A reference page naming the new stability promise.
- A migration note for the experiment shape that motivated the promotion.
- At least one validation fixture or golden file.
- A non-claims section stating what the promoted shape does not prove.
Fidelity Calibration Shapes¶
Calibration artifacts should include method metadata. An observed count without its counting method is not reproducible.
Recommended nested shape:
{
"schema": "assay.experiment.agent_observability_fidelity.calibration.v0",
"kind": "sample",
"calibration_status": "lossy",
"fidelity_verdict": {
"runner_capture": "clean",
"otel_capture": "clipped",
"overall": "lossy"
},
"kernel_events": {
"target": 1000,
"observed": 1000,
"method": "kernel_ndjson_path_match_count",
"agreement": "match"
},
"span_events": {
"target": 500,
"observed": 128,
"method": "otel_trace_json_events_count",
"agreement": "clipped",
"effective_limit": 128,
"effective_limit_source": "default"
}
}
fidelity_verdict is the review-facing rollup. The per-measurement objects are the reproducibility layer. Keep both: a reviewer should see the verdict quickly, while an auditor can still see how every count was produced.
Vocabulary Alignment¶
The calibration shape uses two vocabulary levels. Per-measurement agreement uses match, clipped, drift, failed, or not_applicable. The per-layer fidelity_verdict and top-level calibration_status use clean, lossy, inconclusive, or not_applicable. Agreement rolls up to status as follows: match -> clean, clipped -> lossy, drift or failed -> inconclusive, and not_applicable -> not_applicable. Layer statuses roll up to the overall status by worst case: not_applicable < clean < lossy < inconclusive.
Allowed agreement values:
| Value | Meaning |
|---|---|
match | Observed count matches the requested target. |
clipped | Observed count is lower because a known limit applied. |
drift | Observed count differs without a known clipping explanation. |
failed | Counting failed. |
not_applicable | The layer does not apply for this arm or artifact. |
Allowed method values should be documented next to the schema that uses them. Initial methods:
| Method | Meaning |
|---|---|
archive_contents_worker_files_count | Count unique event-rate-sweep/worker-* files in extracted archive contents. |
kernel_ndjson_path_match_count | Count matching kernel events in layers/kernel.ndjson. |
otel_trace_json_events_count | Count retained OTel span events in trace JSON. |
fixture_side_log_count | Count fixture-emitted records from an explicit side log. |
Evidence Pack Minimum¶
The first evidence-pack prototype should keep the mandatory set small:
| Required | Artifact |
|---|---|
| Yes | One-page Markdown summary. |
| Yes | Runner archive or verified archive reference. |
| Yes | Trace JSON or trace reference when a trace layer exists. |
| Yes | Observation health summary. |
| Yes | Redaction manifest, even if it says no redaction was applied. |
| Nice-to-have v1 | Expanded manifest/provenance table. |
| Nice-to-have v1 | Derived measured-effects summary. |
The pack must not strengthen a claim beyond the underlying calibration and join grades. It is a carrier for evidence, not a new source of truth.
The v0 prototype lives under docs/experiments/agent-observability-fidelity-2026-05/ and uses:
| Schema | Role |
|---|---|
assay.experiment.agent_observability_fidelity.evidence_pack.v0 | Pack manifest with scenario, claim class, carried artifacts, health, reproduction, and non-claims. |
assay.experiment.agent_observability_fidelity.redaction_manifest.v0 | Required redaction record, even when no redaction was applied. |
Keep this prototype in assay.experiment.* until a real CLI or artifact-exchange consumer needs a stable product surface.
Semantic-Gap Verdicts¶
The Slice 4 synthetic harness adds one narrow experiment-scoped verdict shape:
| Schema | Role |
|---|---|
assay.experiment.agent_observability_fidelity.semantic_gap_verdict.v0 | Bounded verdict for the six synthetic scenario-plan rows: positive_join, semantic_gap, diagnostic_only, or inconclusive. |
This verdict summarizes existing join and claim-class rows. It does not replace assay.observability.join_result.v0, does not promote semantic gap findings to a product API, and does not support delegated findings until the delegated baseline gate is run.
Synthetic fixture payloads emitted by this harness also stay under the same experiment namespace:
| Schema | Role |
|---|---|
assay.experiment.agent_observability_fidelity.synthetic_trace.v0 | Synthetic trace fixture used by the local semantic-gap harness. |
assay.experiment.agent_observability_fidelity.synthetic_runner_archive.v0 | Synthetic Runner-archive fixture used by the local semantic-gap harness. |
These fixture payloads are intentionally schema-string-only in v0. They are not delegated capture artifacts, not Runner archive contracts, and not a new assay.synthetic.* namespace.
Interop Mapping Rows¶
The Slice 5 interop plan reserves one experiment-scoped row family for coverage, joinability, and claim-strength mappings between OTel GenAI, OpenInference, Runner measured effects, and Assay observability vocabulary:
| Schema | Role |
|---|---|
assay.experiment.agent_observability_fidelity.interop_coverage_cell.v0 | Slice 6 row for observation-profile coverage, row-level joinability, claim strength, join key, evidence layer, source snapshot, and bounded mapping notes. |
The schema sidecar is active in the Slice 6 harness PR. The schema stays experiment-scoped and must not promote interop mappings to assay.observability.*.
Interop rows must stay coverage-focused:
- use
assay.observability.claim_class_cell.v0vocabulary forclaim_strengthandclaim_basis; - use
assay.observability.join_result.v0vocabulary for join keys; - keep
joinabilityas a row-level summary and not a replacement forassay.observability.join_result.v0; - record source snapshots for OTel GenAI and OpenInference because both vocabularies are moving;
- treat absent or partial mappings as valid findings, not product rankings.
Evidence packs are not required for every matrix cell. Single-run scenario outputs should use evidence packs when the claim depends on a portable trace/archive bundle; multi-row synthetic matrix cells may use a stable directory layout when the claim is coverage-shape behavior rather than delegated run evidence.
Artifact Family Inventory¶
Before adding a new artifact family, check ../artifact-families-inventory.md. If the family is still proposed, describe it as a working term. Do not call it a product line, receipt family, or canonical artifact until a promotion PR says so.
Non-Claims¶
- This document does not promote any existing experiment schema.
- This document does not require historical schema renames.
- This document does not define a universal Assay evidence-pack format.
- This document does not make calibration or join rows product APIs.