BENCHMARK ACTIVE [v3.0]
Default View
2026.01.01
Benchmark Result Summary
82.4% PEAK
ACCURACY.
Legal-10 is the frontier for multi-step agentic planning in Common Law. We verify logic gates before scoring knowledge.
Tested Foundational Models
24
Leader: GPT-4o
91%
Technical Baseline
S1–S8 Design
Legal‑10's first chained/agentic legal benchmark - "AG8" evaluates intermediate research skills through open‑book synthesis and then enforces a deterministic citation integrity gate via U.S.-Reports citations. AG8 uses deterministic citation extraction from SCOTUS opinion text and triangulates citation relevance using Shepard’s treatment labels as a universal, human-curated oracle.
First Chained Legal Benchmark
Deterministic Reference Pack
Shepard's as Relevance Oracle
Chain-Faithful Evaluation
Citation Integrity Gate
Selection Manifest as Contract
Two-Layer Architecture
Top Performance
FULL_LOGS ->| Model_ID | Chain_Acc | Integrity_Gate |
|---|---|---|
| Syncing with evaluation server... | ||
Latest_Syntheses
"The transition from atomic prompts to autonomous legal reasoning requires a new standard of observability."