AG8

AG8 is the baseline chained protocol: steps run in one continuous, stateful session and downstream steps build on earlier outputs, so errors can propagate. It's designed for regression testing and diagnosing core failure modes (chain collapse, evidence misuse, citation integrity failures).

Mode
Chained (stateful)
Steps
8 total (d* deterministic + one j*)
Outputs
Run artifacts + per-step scores

Protocol sketch

d1 d2 d3 d4 d5 d6 j7 d8
Payload admissions
p1 (anchor) admitted early; p2 (authorities) admitted later for open-book synthesis.
Integrity
The final step acts as an integrity check. A single fabricated authority is treated as an integrity failure for the run's integrity status.

Tip: the exact step prompts/contracts are defined by the active run spec.

Step breakdown

Step Purpose Scoring Payload
d1 Anchor / known authority grounding Deterministic p1
d2 Citation network retrieval Deterministic
d3 Validate authority status Deterministic
d4 Extract facts / posture Deterministic
d5 Distinguish and reconcile authorities Deterministic
d6 Draft synthesis for review Deterministic
j7 Synthesis quality (rubric) Isolated judge p2
d8 Citation integrity / hard constraints Deterministic

What it measures

  • Chained reliability (does the model maintain state across steps?)
  • Evidence-respecting behavior under staged admissions
  • Citation integrity (fabricated authority penalties)
  • Diagnosable failure modes via per-step outputs and scores
Results
Leaderboard
Pick an AG8 run spec from the dropdown to compare models.
How it works
Methodology
Runner semantics, artifacts, scoring, and integrity policies.