Empirical Final Design v4.0

Chapters IV & V: Evidence & Benchmark Design

Typography Reference

This section shows each styled element once for reference. Body text uses serif font at 16.5px, line-height 1.68.

H3: Third-level heading

Regular paragraph text. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

H4: Fourth-level heading

Text with bold, italic, and link styles.

Blockquote: Used for quoted material or callouts. This shows the left border and background styling.
  1. Ordered list item one
  2. Ordered list item two
Table Header 1Table Header 2
Table cellTable cell
// Code block example
function example() {
  return "preformatted text";
}

IV. Evidence that Quantization Impacts Legal Skills

This section maps existing evidence to each skill, establishing that quantization plausibly degrades the underlying capabilities each skill requires.

4.1 Summary: The Full Skill Surface Is At Risk

Skill Mechanism Evidence Risk Level
S3: Known AuthorityLong-context degradationHigh
S4: Unknown AuthorityReasoning + retrievalHigh
S5: Validating AuthorityTemporal reasoningMedium
S6: Fact ExtractionLong-context retrievalHigh
S7: Distinguishing CasesMulti-step reasoningHigh
S8: Synthesizing ResultsIntegration + accuracyHigh
PR: Prof. ResponsibilityFabrication resistanceVery High

4.2.1 Research Planning

Mechanism: Research planning requires decomposing complex queries into subtasks—exactly the multi-hop reasoning that Li et al. showed degrades up to 4× under quantization.

StudyFinding
ACBench (Dong et al., 2025)4-bit quantization creates critical divergence between apparent competence and actual reliability with real-world accuracy drops by 10-15%.
Liu et al. (2025)Lower bit-width quantization introduces task-difficulty-dependent accuracy risks; explicitly evaluates KV cache / activation quantization as well as weights.
IntactKV (2024)Mechanism support that KV cache quantization can be a failure point for workflow state maintenance.

4.2.2 Strategic Stopping

Mechanism: Strategic stopping requires calibrated confidence—knowing when you have enough. ECE studies show quantized models become overconfident, failing to recognize their own uncertainty.

StudyFinding
Zhong et al., 2025Quantized LLMs are consistently worse-calibrated than full-precision counterparts with errors in 85% of reported measurements (41 of 48 test conditions).

4.2.3 Finding Known Authority

Mechanism: Known authority retrieval requires precise matching across long contexts. Embedding quantization collapses semantic distances; generator quantization corrupts attention to specific passages.

StudyFinding
Mekala et al. (2025)8-bit is roughly preserved (small average drop), but 4-bit methods can produce very large losses (up to ~59%), especially for long-context inputs.
LegalBench-RAG (2024)Legal-domain benchmark that isolates retrieval quality—great "legal retrieval is hard" anchor even before quantization.

4.2.4 Finding Unknown Authority

Mechanism: Finding unknown authority requires decomposing fact patterns into legal issues. Liu et al. show quantization degrades multi-hop reasoning by up to 4× on complex tasks.

StudyFinding
Li et al.Low-bit quantization introduces up to 32.39% accuracy degradation (avg. 11.31%) on complex math reasoning, with degradation specifically in numerical computation and planning capabilities.
Liu et al., 2025Lower bit-widths introduce significant accuracy risks across DeepSeek-R1, LLaMA, and Qwen.
Yazan, Verberne & Situmeang (2024)In RAG pipelines, quantization may not impair RAG when base LLM performs well, but smaller models show high sensitivity to context length and setup.

4.2.5 Validating Authority

Mechanism: Validation requires temporal reasoning (when was this overruled?) and status classification (still good law?). These are exactly the fine-grained distinctions that outlier weight clipping destroys.

StudyFinding
Liu et al. (2025)While W8A8/W4A16 can be lossless, lower bit-widths introduce significant accuracy risks, with task difficulty a critical determinant—placing authority-validation in the high-risk regime.
MixKVQ (Zhang et al. 2025)Existing low-bit KV-cache quantization can exhibit severe performance degradation on complex reasoning tasks; fixed-precision methods at very low bit-widths struggle with outlier channels.
TimeBenchGPT-4 achieves only 66.4% accuracy on tasks requiring implicit temporal relationships. Accuracy varies from 40.25% to 92% depending on how temporal facts are organized.

4.2.6 Fact Extraction

Mechanism: Fact extraction from contracts is long-context retrieval. The 59% accuracy collapse on NIAH-none (correctly identifying when information is absent) directly implicates document review reliability.

StudyFinding
Mekala et al. (2025)Up to 59% degradation on long-context extraction tasks at 4-bit quantization. S6's failure mode is particularly insidious because extracted "holdings" may be linguistically plausible while being substantively fabricated.

4.2.7 Distinguishing Cases

Mechanism: Case distinction requires tracking multiple factors simultaneously and identifying material differences. This is multi-step reasoning—the capacity most vulnerable to quantization.

StudyFinding
Dahl et al. (2024)Models "cannot reliably detect when they are hallucinating" and fail to correct users' incorrect legal assumptions. Combined with Li et al.'s 32.39% reasoning degradation under quantization, demonstrates high unreliability.
Liu et al., 2025Low-bit regimes create accuracy risks on hard reasoning tasks (the cognitive substrate for distinction).

4.2.8 Synthesizing Results

Mechanism: Synthesis requires integrating multiple sources while maintaining coherence. CLERC explicitly reports that strong models can produce highly rated analyses while hallucinating—"good writing ≠ truthful authority."

StudyFinding
LegalEval-Q (Li & Wu, 2025)Measures clarity/coherence/terminology quality; reports quantization has negligible impact on writing-quality metrics—supports "fluency preserved while truth degrades" story.
Lewis et al. (2020)Canonical RAG citation establishing retrieval + generation as a distinct paradigm; provenance/updating are core motivations.

4.2.9 Professional Responsibility

Mechanism: Citation integrity and fabrication resistance depend on precise parametric memory. Quantization clips the outlier weights encoding rare-but-accurate associations.

StudyFinding
Q-Misalign (Dong et al., 2025)Safety alignment is not preserved by quantization but is contingent upon precision—vulnerabilities can remain dormant, making pre-deployment safety audits unreliable for detecting post-quantization failure modes.
Li et al. (2024)4-bit quantization significantly weakens fabrication resistance.
Dahl et al. (2024)LLMs hallucinate legal authority at alarming rates (69-88%) on verifiable legal queries.

V. The Legal-7 Benchmark: Building the Benchmark

5.1 Research Execution as Job Performance

Section II identified the cognitive skills that constitute competent legal work. But these sources converge on something more fundamental than a checklist: Research Execution—the integrated professional competency of completing a legal research task from question to answer.

Shultz & Zedeck's empirical study confirms this framing. Their 26 "Lawyering Effectiveness Factors" are job performance measures—derived from asking lawyers, judges, and clients: "If you were looking for a lawyer for an important matter, what qualities would cause you to choose that attorney?"

AALL Principle IV operationalizes this directly: "A successful legal researcher applies information effectively to resolve a specific issue or need."

5.2 The Legal-7 Chain

The Legal-7 (L7) Agentic benchmark operationalizes Research Execution as a seven-step dependent chain:

StepNameModalityTaskGround Truth
S1Known AuthorityRAGResolve known citation to correct authoritySCDB citation lookup
S2Unknown AuthorityRAGRetrieve relevant law from fact patternshepards_data.csv
S3Validate AuthorityRAGDetermine if authority remains good lawscotus_overruled_db.csv
S4Fact ExtractionRAGExtract disposition, holding, outcome from opinionSCDB metadata + opinion text
S5Distinguish CasesRAG + CBDecide if precedent applies or can be distinguishedshepards.agree field
S6IRAC SynthesisRAGWrite IRAC-structured legal analysisMEE rubric + chain grounding
S7Citation IntegrityCBVerify no fabricated citations in S6 outputfake_cases.csv + SCDB

The chain maps to the IRAC framework that governs U.S. common law analysis:

5.3 Why S6 Validates the Chain

S6 is administered closed-book: the model cannot return to the sources. It must synthesize an IRAC memo from what it gathered in S1–S5.

This design reflects AALL Principle IV's standard: applying gathered information to resolve an issue.

If Step Fails...Cascade Effect
S1 (Known Authority)Wrong case → all downstream analysis corrupted
S2 (Unknown Authority)Missing precedent → incomplete rule statement
S3 (Validate Authority)Citing bad law → S6 argument fails
S4 (Fact Extraction)Wrong facts → S5 distinction invalid
S5 (Distinguish)Wrong application → S6 conclusion unsupported
S6 (IRAC Synthesis)Poor reasoning → chain fails at capstone
S7 (Citation Integrity)Fabrication detected → S6 voided, chain fails

A model achieving 90% accuracy on each independent skill will, under independence assumptions, complete only 0.97 ≈ 48% of full chains successfully. This multiplicative penalty reflects the reality of legal practice.

5.4 S7 as Professional Responsibility Gate

S7 operationalizes Shultz & Zedeck's Factor 21: "Integrity & Honesty—has core values and beliefs; acts with integrity and honesty."

Under Model Rule 3.3(a)(1), attorneys may not "make a false statement of fact or law to a tribunal." A brief citing fabricated cases is not merely imperfect; it is professionally worthless and potentially sanctionable.

L7 mirrors this: if S7 detects any fabricated citation in the S6 output, the entire S6 score is voided—set to zero regardless of reasoning quality.

5.5 S5 Dual-Modality: The Reasoning Bridge

S5 (Distinguish Cases) occupies a unique position in the chain. It is the point where retrieval must transform into reasoning. The model must:

To isolate the reasoning component, L7 tests S5 in two modalities:

S5-RAG (Primary): Both case texts available. Tests whether the model can distinguish cases with full information.

S5-CB (Diagnostic): Only the S4-extracted holding available; no citing case text. Tests whether the model can reason from the rule alone.

The gap between S5-RAG and S5-CB is the Fluency-Reasoning Divergence measurement:

S5-RAGS5-CBInterpretation
HighHighModel reasons well
HighLowModel copies, doesn't reason (surface fluency vs reasoning divergence signature)
LowLowModel cannot perform the task

5.6 Grading Architecture

L7 achieves 6/7 objective grading:

StepGrading MethodGround Truth Source
S1Exact matchSCDB citation
S2MRR / Hit@kShepard's precedent relationships
S3Exact matchscotus_overruled_db
S4Exact match (disposition, party)SCDB metadata
S5Exact matchshepards.agree field
S6Hybrid (50% objective, 50% LLM-as-Judge)Chain grounding + MEE rubric
S7DeterministicCitation existence check

Only S6 requires rubric-based evaluation. This architecture minimizes LLM-as-judge circularity: most of the benchmark is deterministic.

5.7 Task Structure: From Case to Chain

Each L7 chain instance begins with a Supreme Court case pair drawn from the Shepard's citation network.

ElementSourceExample
Cited Casescdb_sample.csvBrown v. Board of Education, 347 U.S. 483 (1954)
Citing Casescotus_shepards_sample.csvCooper v. Aaron, 358 U.S. 1 (1958)
Shepard's Signalshepards field"followed"
Doctrinal Agreementagree fieldTrue (citing case follows precedent)
Overrule Statusscotus_overruled_db.csvNone (not overruled)
Opinion Textmajority opinion fieldFull text of majority opinion

5.8 Scoring Summary

StepGround TruthScoring Method
S1SCDB metadataExact match
S2Shepard's citing_case_us_citeMRR, hit@10
S3scotus_overruled_dbBinary match on is_overruled
S4SCDB caseDisposition, partyWinningClosed enum exact match
S5Shepard's agree fieldBinary match
S6MEE rubric + chain groundingHybrid (50% objective, 50% rubric)
S7fake_cases.csv + SCDBDeterministic lookup

5.9 What L7 Detects That Parallel Benchmarks Cannot

L7 detects three failure modes invisible to parallel evaluation:

Cascade failures. A model that hallucinates at S1 corrupts all downstream steps. Parallel scoring treats S1 as one task among many; L7 propagates the error through the chain.

Surface fluency vs reasoning divergence signature. The S5 RAG-CB gap directly measures whether the model is reasoning or copying. No parallel benchmark isolates this.

Professional responsibility failures. S7 voiding enforces the binary reality of citation integrity—a synthesized memo is either citable or it is not.

For quantization testing, these properties are essential. We hypothesize that compression degrades reasoning while preserving fluency. L7's chained architecture, dual-modality S5, and hard-gate S7 are designed to make this degradation visible.