Empirical Final Design v4.0
Chapters IV & V: Evidence & Benchmark Design
Typography Reference
This section shows each styled element once for reference. Body text uses serif font at 16.5px, line-height 1.68.
H3: Third-level heading
Regular paragraph text. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
H4: Fourth-level heading
Text with bold, italic, and link styles.
Blockquote: Used for quoted material or callouts. This shows the left border and background styling.
- Unordered list item one
- Unordered list item two
- Ordered list item one
- Ordered list item two
| Table Header 1 | Table Header 2 |
|---|---|
| Table cell | Table cell |
// Code block example
function example() {
return "preformatted text";
}
IV. Evidence that Quantization Impacts Legal Skills
This section maps existing evidence to each skill, establishing that quantization plausibly degrades the underlying capabilities each skill requires.
4.1 Summary: The Full Skill Surface Is At Risk
| Skill | Mechanism Evidence | Risk Level |
|---|---|---|
| S3: Known Authority | Long-context degradation | High |
| S4: Unknown Authority | Reasoning + retrieval | High |
| S5: Validating Authority | Temporal reasoning | Medium |
| S6: Fact Extraction | Long-context retrieval | High |
| S7: Distinguishing Cases | Multi-step reasoning | High |
| S8: Synthesizing Results | Integration + accuracy | High |
| PR: Prof. Responsibility | Fabrication resistance | Very High |
4.2.1 Research Planning
Mechanism: Research planning requires decomposing complex queries into subtasks—exactly the multi-hop reasoning that Li et al. showed degrades up to 4× under quantization.
| Study | Finding |
|---|---|
| ACBench (Dong et al., 2025) | 4-bit quantization creates critical divergence between apparent competence and actual reliability with real-world accuracy drops by 10-15%. |
| Liu et al. (2025) | Lower bit-width quantization introduces task-difficulty-dependent accuracy risks; explicitly evaluates KV cache / activation quantization as well as weights. |
| IntactKV (2024) | Mechanism support that KV cache quantization can be a failure point for workflow state maintenance. |
4.2.2 Strategic Stopping
Mechanism: Strategic stopping requires calibrated confidence—knowing when you have enough. ECE studies show quantized models become overconfident, failing to recognize their own uncertainty.
| Study | Finding |
|---|---|
| Zhong et al., 2025 | Quantized LLMs are consistently worse-calibrated than full-precision counterparts with errors in 85% of reported measurements (41 of 48 test conditions). |
4.2.3 Finding Known Authority
Mechanism: Known authority retrieval requires precise matching across long contexts. Embedding quantization collapses semantic distances; generator quantization corrupts attention to specific passages.
| Study | Finding |
|---|---|
| Mekala et al. (2025) | 8-bit is roughly preserved (small average drop), but 4-bit methods can produce very large losses (up to ~59%), especially for long-context inputs. |
| LegalBench-RAG (2024) | Legal-domain benchmark that isolates retrieval quality—great "legal retrieval is hard" anchor even before quantization. |
4.2.4 Finding Unknown Authority
Mechanism: Finding unknown authority requires decomposing fact patterns into legal issues. Liu et al. show quantization degrades multi-hop reasoning by up to 4× on complex tasks.
| Study | Finding |
|---|---|
| Li et al. | Low-bit quantization introduces up to 32.39% accuracy degradation (avg. 11.31%) on complex math reasoning, with degradation specifically in numerical computation and planning capabilities. |
| Liu et al., 2025 | Lower bit-widths introduce significant accuracy risks across DeepSeek-R1, LLaMA, and Qwen. |
| Yazan, Verberne & Situmeang (2024) | In RAG pipelines, quantization may not impair RAG when base LLM performs well, but smaller models show high sensitivity to context length and setup. |
4.2.5 Validating Authority
Mechanism: Validation requires temporal reasoning (when was this overruled?) and status classification (still good law?). These are exactly the fine-grained distinctions that outlier weight clipping destroys.
| Study | Finding |
|---|---|
| Liu et al. (2025) | While W8A8/W4A16 can be lossless, lower bit-widths introduce significant accuracy risks, with task difficulty a critical determinant—placing authority-validation in the high-risk regime. |
| MixKVQ (Zhang et al. 2025) | Existing low-bit KV-cache quantization can exhibit severe performance degradation on complex reasoning tasks; fixed-precision methods at very low bit-widths struggle with outlier channels. |
| TimeBench | GPT-4 achieves only 66.4% accuracy on tasks requiring implicit temporal relationships. Accuracy varies from 40.25% to 92% depending on how temporal facts are organized. |
4.2.6 Fact Extraction
Mechanism: Fact extraction from contracts is long-context retrieval. The 59% accuracy collapse on NIAH-none (correctly identifying when information is absent) directly implicates document review reliability.
| Study | Finding |
|---|---|
| Mekala et al. (2025) | Up to 59% degradation on long-context extraction tasks at 4-bit quantization. S6's failure mode is particularly insidious because extracted "holdings" may be linguistically plausible while being substantively fabricated. |
4.2.7 Distinguishing Cases
Mechanism: Case distinction requires tracking multiple factors simultaneously and identifying material differences. This is multi-step reasoning—the capacity most vulnerable to quantization.
| Study | Finding |
|---|---|
| Dahl et al. (2024) | Models "cannot reliably detect when they are hallucinating" and fail to correct users' incorrect legal assumptions. Combined with Li et al.'s 32.39% reasoning degradation under quantization, demonstrates high unreliability. |
| Liu et al., 2025 | Low-bit regimes create accuracy risks on hard reasoning tasks (the cognitive substrate for distinction). |
4.2.8 Synthesizing Results
Mechanism: Synthesis requires integrating multiple sources while maintaining coherence. CLERC explicitly reports that strong models can produce highly rated analyses while hallucinating—"good writing ≠ truthful authority."
| Study | Finding |
|---|---|
| LegalEval-Q (Li & Wu, 2025) | Measures clarity/coherence/terminology quality; reports quantization has negligible impact on writing-quality metrics—supports "fluency preserved while truth degrades" story. |
| Lewis et al. (2020) | Canonical RAG citation establishing retrieval + generation as a distinct paradigm; provenance/updating are core motivations. |
4.2.9 Professional Responsibility
Mechanism: Citation integrity and fabrication resistance depend on precise parametric memory. Quantization clips the outlier weights encoding rare-but-accurate associations.
| Study | Finding |
|---|---|
| Q-Misalign (Dong et al., 2025) | Safety alignment is not preserved by quantization but is contingent upon precision—vulnerabilities can remain dormant, making pre-deployment safety audits unreliable for detecting post-quantization failure modes. |
| Li et al. (2024) | 4-bit quantization significantly weakens fabrication resistance. |
| Dahl et al. (2024) | LLMs hallucinate legal authority at alarming rates (69-88%) on verifiable legal queries. |
V. The Legal-7 Benchmark: Building the Benchmark
5.1 Research Execution as Job Performance
Section II identified the cognitive skills that constitute competent legal work. But these sources converge on something more fundamental than a checklist: Research Execution—the integrated professional competency of completing a legal research task from question to answer.
Shultz & Zedeck's empirical study confirms this framing. Their 26 "Lawyering Effectiveness Factors" are job performance measures—derived from asking lawyers, judges, and clients: "If you were looking for a lawyer for an important matter, what qualities would cause you to choose that attorney?"
AALL Principle IV operationalizes this directly: "A successful legal researcher applies information effectively to resolve a specific issue or need."
5.2 The Legal-7 Chain
The Legal-7 (L7) Agentic benchmark operationalizes Research Execution as a seven-step dependent chain:
| Step | Name | Modality | Task | Ground Truth |
|---|---|---|---|---|
| S1 | Known Authority | RAG | Resolve known citation to correct authority | SCDB citation lookup |
| S2 | Unknown Authority | RAG | Retrieve relevant law from fact pattern | shepards_data.csv |
| S3 | Validate Authority | RAG | Determine if authority remains good law | scotus_overruled_db.csv |
| S4 | Fact Extraction | RAG | Extract disposition, holding, outcome from opinion | SCDB metadata + opinion text |
| S5 | Distinguish Cases | RAG + CB | Decide if precedent applies or can be distinguished | shepards.agree field |
| S6 | IRAC Synthesis | RAG | Write IRAC-structured legal analysis | MEE rubric + chain grounding |
| S7 | Citation Integrity | CB | Verify no fabricated citations in S6 output | fake_cases.csv + SCDB |
The chain maps to the IRAC framework that governs U.S. common law analysis:
- Rule Phase (S1–S3): Identify, retrieve, and validate legal authority
- Application Phase (S4–S5): Extract facts and apply precedent through distinction
- Conclusion Phase (S6–S7): Synthesize analysis and verify citation integrity
5.3 Why S6 Validates the Chain
S6 is administered closed-book: the model cannot return to the sources. It must synthesize an IRAC memo from what it gathered in S1–S5.
This design reflects AALL Principle IV's standard: applying gathered information to resolve an issue.
| If Step Fails... | Cascade Effect |
|---|---|
| S1 (Known Authority) | Wrong case → all downstream analysis corrupted |
| S2 (Unknown Authority) | Missing precedent → incomplete rule statement |
| S3 (Validate Authority) | Citing bad law → S6 argument fails |
| S4 (Fact Extraction) | Wrong facts → S5 distinction invalid |
| S5 (Distinguish) | Wrong application → S6 conclusion unsupported |
| S6 (IRAC Synthesis) | Poor reasoning → chain fails at capstone |
| S7 (Citation Integrity) | Fabrication detected → S6 voided, chain fails |
A model achieving 90% accuracy on each independent skill will, under independence assumptions, complete only 0.97 ≈ 48% of full chains successfully. This multiplicative penalty reflects the reality of legal practice.
5.4 S7 as Professional Responsibility Gate
S7 operationalizes Shultz & Zedeck's Factor 21: "Integrity & Honesty—has core values and beliefs; acts with integrity and honesty."
Under Model Rule 3.3(a)(1), attorneys may not "make a false statement of fact or law to a tribunal." A brief citing fabricated cases is not merely imperfect; it is professionally worthless and potentially sanctionable.
L7 mirrors this: if S7 detects any fabricated citation in the S6 output, the entire S6 score is voided—set to zero regardless of reasoning quality.
5.5 S5 Dual-Modality: The Reasoning Bridge
S5 (Distinguish Cases) occupies a unique position in the chain. It is the point where retrieval must transform into reasoning. The model must:
- Understand the holding of a precedent case
- Understand the facts of the current case
- Determine whether the precedent applies or can be distinguished
To isolate the reasoning component, L7 tests S5 in two modalities:
S5-RAG (Primary): Both case texts available. Tests whether the model can distinguish cases with full information.
S5-CB (Diagnostic): Only the S4-extracted holding available; no citing case text. Tests whether the model can reason from the rule alone.
The gap between S5-RAG and S5-CB is the Fluency-Reasoning Divergence measurement:
| S5-RAG | S5-CB | Interpretation |
|---|---|---|
| High | High | Model reasons well |
| High | Low | Model copies, doesn't reason (surface fluency vs reasoning divergence signature) |
| Low | Low | Model cannot perform the task |
5.6 Grading Architecture
L7 achieves 6/7 objective grading:
| Step | Grading Method | Ground Truth Source |
|---|---|---|
| S1 | Exact match | SCDB citation |
| S2 | MRR / Hit@k | Shepard's precedent relationships |
| S3 | Exact match | scotus_overruled_db |
| S4 | Exact match (disposition, party) | SCDB metadata |
| S5 | Exact match | shepards.agree field |
| S6 | Hybrid (50% objective, 50% LLM-as-Judge) | Chain grounding + MEE rubric |
| S7 | Deterministic | Citation existence check |
Only S6 requires rubric-based evaluation. This architecture minimizes LLM-as-judge circularity: most of the benchmark is deterministic.
5.7 Task Structure: From Case to Chain
Each L7 chain instance begins with a Supreme Court case pair drawn from the Shepard's citation network.
| Element | Source | Example |
|---|---|---|
| Cited Case | scdb_sample.csv | Brown v. Board of Education, 347 U.S. 483 (1954) |
| Citing Case | scotus_shepards_sample.csv | Cooper v. Aaron, 358 U.S. 1 (1958) |
| Shepard's Signal | shepards field | "followed" |
| Doctrinal Agreement | agree field | True (citing case follows precedent) |
| Overrule Status | scotus_overruled_db.csv | None (not overruled) |
| Opinion Text | majority opinion field | Full text of majority opinion |
5.8 Scoring Summary
| Step | Ground Truth | Scoring Method |
|---|---|---|
| S1 | SCDB metadata | Exact match |
| S2 | Shepard's citing_case_us_cite | MRR, hit@10 |
| S3 | scotus_overruled_db | Binary match on is_overruled |
| S4 | SCDB caseDisposition, partyWinning | Closed enum exact match |
| S5 | Shepard's agree field | Binary match |
| S6 | MEE rubric + chain grounding | Hybrid (50% objective, 50% rubric) |
| S7 | fake_cases.csv + SCDB | Deterministic lookup |
5.9 What L7 Detects That Parallel Benchmarks Cannot
L7 detects three failure modes invisible to parallel evaluation:
Cascade failures. A model that hallucinates at S1 corrupts all downstream steps. Parallel scoring treats S1 as one task among many; L7 propagates the error through the chain.
Surface fluency vs reasoning divergence signature. The S5 RAG-CB gap directly measures whether the model is reasoning or copying. No parallel benchmark isolates this.
Professional responsibility failures. S7 voiding enforces the binary reality of citation integrity—a synthesized memo is either citable or it is not.
For quantization testing, these properties are essential. We hypothesize that compression degrades reasoning while preserving fluency. L7's chained architecture, dual-modality S5, and hard-gate S7 are designed to make this degradation visible.