AG8
AG8 is the baseline chained protocol: steps run in one continuous, stateful session and downstream steps build on earlier outputs, so errors can propagate. It's designed for regression testing and diagnosing core failure modes (chain collapse, evidence misuse, citation integrity failures).
Mode
Chained (stateful)
Steps
8 total (d* deterministic + one j*)
Outputs
Run artifacts + per-step scores
Protocol sketch
d1
→
d2
→
d3
→
d4
→
d5
→
d6
→
j7
→
d8
Payload admissions
p1 (anchor) admitted early; p2 (authorities) admitted
later for open-book synthesis.
Integrity
The final step acts as an integrity check. A single fabricated authority is treated as an integrity failure for the
run's integrity status.
Tip: the exact step prompts/contracts are defined by the active run spec.
Step breakdown
| Step | Purpose | Scoring | Payload |
|---|---|---|---|
| d1 | Anchor / known authority grounding | Deterministic | p1 |
| d2 | Citation network retrieval | Deterministic | — |
| d3 | Validate authority status | Deterministic | — |
| d4 | Extract facts / posture | Deterministic | — |
| d5 | Distinguish and reconcile authorities | Deterministic | — |
| d6 | Draft synthesis for review | Deterministic | — |
| j7 | Synthesis quality (rubric) | Isolated judge | p2 |
| d8 | Citation integrity / hard constraints | Deterministic | — |
What it measures
- Chained reliability (does the model maintain state across steps?)
- Evidence-respecting behavior under staged admissions
- Citation integrity (fabricated authority penalties)
- Diagnosable failure modes via per-step outputs and scores
Results
Leaderboard
Pick an AG8 run spec from the dropdown to compare models.
How it works
Methodology
Runner semantics, artifacts, scoring, and integrity policies.