Empirical Final Design v4.0

Chapters IV & V: Evidence & Benchmark Design

Typography Reference

This section shows each styled element once for reference. Body text uses serif font at 16.5px, line-height 1.68.

H3: Third-level heading

Regular paragraph text. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

H4: Fourth-level heading

Text with bold, italic, and link styles.

Blockquote: Used for quoted material or callouts. This shows the left border and background styling.

Unordered list item one
Unordered list item two

Ordered list item one
Ordered list item two

Table Header 1	Table Header 2
Table cell	Table cell

// Code block example
function example() {
  return "preformatted text";
}

IV. Evidence that Quantization Impacts Legal Skills

This section maps existing evidence to each skill, establishing that quantization plausibly degrades the underlying capabilities each skill requires.

4.1 Summary: The Full Skill Surface Is At Risk

Skill	Mechanism Evidence	Risk Level
S3: Known Authority	Long-context degradation	High
S4: Unknown Authority	Reasoning + retrieval	High
S5: Validating Authority	Temporal reasoning	Medium
S6: Fact Extraction	Long-context retrieval	High
S7: Distinguishing Cases	Multi-step reasoning	High
S8: Synthesizing Results	Integration + accuracy	High
PR: Prof. Responsibility	Fabrication resistance	Very High

4.2.1 Research Planning

Mechanism: Research planning requires decomposing complex queries into subtasks—exactly the multi-hop reasoning that Li et al. showed degrades up to 4× under quantization.

Study	Finding
ACBench (Dong et al., 2025)	4-bit quantization creates critical divergence between apparent competence and actual reliability with real-world accuracy drops by 10-15%.
Liu et al. (2025)	Lower bit-width quantization introduces task-difficulty-dependent accuracy risks; explicitly evaluates KV cache / activation quantization as well as weights.
IntactKV (2024)	Mechanism support that KV cache quantization can be a failure point for workflow state maintenance.

4.2.2 Strategic Stopping

Mechanism: Strategic stopping requires calibrated confidence—knowing when you have enough. ECE studies show quantized models become overconfident, failing to recognize their own uncertainty.

Study	Finding
Zhong et al., 2025	Quantized LLMs are consistently worse-calibrated than full-precision counterparts with errors in 85% of reported measurements (41 of 48 test conditions).

4.2.3 Finding Known Authority

Mechanism: Known authority retrieval requires precise matching across long contexts. Embedding quantization collapses semantic distances; generator quantization corrupts attention to specific passages.

Study	Finding
Mekala et al. (2025)	8-bit is roughly preserved (small average drop), but 4-bit methods can produce very large losses (up to ~59%), especially for long-context inputs.
LegalBench-RAG (2024)	Legal-domain benchmark that isolates retrieval quality—great "legal retrieval is hard" anchor even before quantization.

4.2.4 Finding Unknown Authority

Mechanism: Finding unknown authority requires decomposing fact patterns into legal issues. Liu et al. show quantization degrades multi-hop reasoning by up to 4× on complex tasks.

Study	Finding
Li et al.	Low-bit quantization introduces up to 32.39% accuracy degradation (avg. 11.31%) on complex math reasoning, with degradation specifically in numerical computation and planning capabilities.
Liu et al., 2025	Lower bit-widths introduce significant accuracy risks across DeepSeek-R1, LLaMA, and Qwen.
Yazan, Verberne & Situmeang (2024)	In RAG pipelines, quantization may not impair RAG when base LLM performs well, but smaller models show high sensitivity to context length and setup.

4.2.5 Validating Authority

Mechanism: Validation requires temporal reasoning (when was this overruled?) and status classification (still good law?). These are exactly the fine-grained distinctions that outlier weight clipping destroys.

Study	Finding
Liu et al. (2025)	While W8A8/W4A16 can be lossless, lower bit-widths introduce significant accuracy risks, with task difficulty a critical determinant—placing authority-validation in the high-risk regime.
MixKVQ (Zhang et al. 2025)	Existing low-bit KV-cache quantization can exhibit severe performance degradation on complex reasoning tasks; fixed-precision methods at very low bit-widths struggle with outlier channels.
TimeBench	GPT-4 achieves only 66.4% accuracy on tasks requiring implicit temporal relationships. Accuracy varies from 40.25% to 92% depending on how temporal facts are organized.

4.2.6 Fact Extraction

Mechanism: Fact extraction from contracts is long-context retrieval. The 59% accuracy collapse on NIAH-none (correctly identifying when information is absent) directly implicates document review reliability.

Study	Finding
Mekala et al. (2025)	Up to 59% degradation on long-context extraction tasks at 4-bit quantization. S6's failure mode is particularly insidious because extracted "holdings" may be linguistically plausible while being substantively fabricated.

4.2.7 Distinguishing Cases

Mechanism: Case distinction requires tracking multiple factors simultaneously and identifying material differences. This is multi-step reasoning—the capacity most vulnerable to quantization.

Study	Finding
Dahl et al. (2024)	Models "cannot reliably detect when they are hallucinating" and fail to correct users' incorrect legal assumptions. Combined with Li et al.'s 32.39% reasoning degradation under quantization, demonstrates high unreliability.
Liu et al., 2025	Low-bit regimes create accuracy risks on hard reasoning tasks (the cognitive substrate for distinction).

4.2.8 Synthesizing Results

Mechanism: Synthesis requires integrating multiple sources while maintaining coherence. CLERC explicitly reports that strong models can produce highly rated analyses while hallucinating—"good writing ≠ truthful authority."

Study	Finding
LegalEval-Q (Li & Wu, 2025)	Measures clarity/coherence/terminology quality; reports quantization has negligible impact on writing-quality metrics—supports "fluency preserved while truth degrades" story.
Lewis et al. (2020)	Canonical RAG citation establishing retrieval + generation as a distinct paradigm; provenance/updating are core motivations.

4.2.9 Professional Responsibility

Mechanism: Citation integrity and fabrication resistance depend on precise parametric memory. Quantization clips the outlier weights encoding rare-but-accurate associations.

Study	Finding
Q-Misalign (Dong et al., 2025)	Safety alignment is not preserved by quantization but is contingent upon precision—vulnerabilities can remain dormant, making pre-deployment safety audits unreliable for detecting post-quantization failure modes.
Li et al. (2024)	4-bit quantization significantly weakens fabrication resistance.
Dahl et al. (2024)	LLMs hallucinate legal authority at alarming rates (69-88%) on verifiable legal queries.

V. The Legal-7 Benchmark: Building the Benchmark

5.1 Research Execution as Job Performance

Section II identified the cognitive skills that constitute competent legal work. But these sources converge on something more fundamental than a checklist: Research Execution—the integrated professional competency of completing a legal research task from question to answer.

Shultz & Zedeck's empirical study confirms this framing. Their 26 "Lawyering Effectiveness Factors" are job performance measures—derived from asking lawyers, judges, and clients: "If you were looking for a lawyer for an important matter, what qualities would cause you to choose that attorney?"

AALL Principle IV operationalizes this directly: "A successful legal researcher applies information effectively to resolve a specific issue or need."

5.2 The Legal-7 Chain

The Legal-7 (L7) Agentic benchmark operationalizes Research Execution as a seven-step dependent chain:

Step	Name	Modality	Task	Ground Truth
S1	Known Authority	RAG	Resolve known citation to correct authority	SCDB citation lookup
S2	Unknown Authority	RAG	Retrieve relevant law from fact pattern	shepards_data.csv
S3	Validate Authority	RAG	Determine if authority remains good law	scotus_overruled_db.csv
S4	Fact Extraction	RAG	Extract disposition, holding, outcome from opinion	SCDB metadata + opinion text
S5	Distinguish Cases	RAG + CB	Decide if precedent applies or can be distinguished	shepards.agree field
S6	IRAC Synthesis	RAG	Write IRAC-structured legal analysis	MEE rubric + chain grounding
S7	Citation Integrity	CB	Verify no fabricated citations in S6 output	fake_cases.csv + SCDB

The chain maps to the IRAC framework that governs U.S. common law analysis:

Rule Phase (S1–S3): Identify, retrieve, and validate legal authority
Application Phase (S4–S5): Extract facts and apply precedent through distinction
Conclusion Phase (S6–S7): Synthesize analysis and verify citation integrity

5.3 Why S6 Validates the Chain

S6 is administered closed-book: the model cannot return to the sources. It must synthesize an IRAC memo from what it gathered in S1–S5.

This design reflects AALL Principle IV's standard: applying gathered information to resolve an issue.

If Step Fails...	Cascade Effect
S1 (Known Authority)	Wrong case → all downstream analysis corrupted
S2 (Unknown Authority)	Missing precedent → incomplete rule statement
S3 (Validate Authority)	Citing bad law → S6 argument fails
S4 (Fact Extraction)	Wrong facts → S5 distinction invalid
S5 (Distinguish)	Wrong application → S6 conclusion unsupported
S6 (IRAC Synthesis)	Poor reasoning → chain fails at capstone
S7 (Citation Integrity)	Fabrication detected → S6 voided, chain fails

A model achieving 90% accuracy on each independent skill will, under independence assumptions, complete only 0.9⁷ ≈ 48% of full chains successfully. This multiplicative penalty reflects the reality of legal practice.

5.4 S7 as Professional Responsibility Gate

S7 operationalizes Shultz & Zedeck's Factor 21: "Integrity & Honesty—has core values and beliefs; acts with integrity and honesty."

Under Model Rule 3.3(a)(1), attorneys may not "make a false statement of fact or law to a tribunal." A brief citing fabricated cases is not merely imperfect; it is professionally worthless and potentially sanctionable.

L7 mirrors this: if S7 detects any fabricated citation in the S6 output, the entire S6 score is voided—set to zero regardless of reasoning quality.

5.5 S5 Dual-Modality: The Reasoning Bridge

S5 (Distinguish Cases) occupies a unique position in the chain. It is the point where retrieval must transform into reasoning. The model must:

Understand the holding of a precedent case
Understand the facts of the current case
Determine whether the precedent applies or can be distinguished

To isolate the reasoning component, L7 tests S5 in two modalities:

S5-RAG (Primary): Both case texts available. Tests whether the model can distinguish cases with full information.

S5-CB (Diagnostic): Only the S4-extracted holding available; no citing case text. Tests whether the model can reason from the rule alone.

The gap between S5-RAG and S5-CB is the Fluency-Reasoning Divergence measurement:

S5-RAG	S5-CB	Interpretation
High	High	Model reasons well
High	Low	Model copies, doesn't reason (surface fluency vs reasoning divergence signature)
Low	Low	Model cannot perform the task

5.6 Grading Architecture

L7 achieves 6/7 objective grading:

Step	Grading Method	Ground Truth Source
S1	Exact match	SCDB citation
S2	MRR / Hit@k	Shepard's precedent relationships
S3	Exact match	scotus_overruled_db
S4	Exact match (disposition, party)	SCDB metadata
S5	Exact match	shepards.agree field
S6	Hybrid (50% objective, 50% LLM-as-Judge)	Chain grounding + MEE rubric
S7	Deterministic	Citation existence check

Only S6 requires rubric-based evaluation. This architecture minimizes LLM-as-judge circularity: most of the benchmark is deterministic.

5.7 Task Structure: From Case to Chain

Each L7 chain instance begins with a Supreme Court case pair drawn from the Shepard's citation network.

Element	Source	Example
Cited Case	scdb_sample.csv	Brown v. Board of Education, 347 U.S. 483 (1954)
Citing Case	scotus_shepards_sample.csv	Cooper v. Aaron, 358 U.S. 1 (1958)
Shepard's Signal	shepards field	"followed"
Doctrinal Agreement	agree field	True (citing case follows precedent)
Overrule Status	scotus_overruled_db.csv	None (not overruled)
Opinion Text	majority opinion field	Full text of majority opinion

5.8 Scoring Summary

Step	Ground Truth	Scoring Method
S1	SCDB metadata	Exact match
S2	Shepard's citing_case_us_cite	MRR, hit@10
S3	scotus_overruled_db	Binary match on is_overruled
S4	SCDB caseDisposition, partyWinning	Closed enum exact match
S5	Shepard's agree field	Binary match
S6	MEE rubric + chain grounding	Hybrid (50% objective, 50% rubric)
S7	fake_cases.csv + SCDB	Deterministic lookup

5.9 What L7 Detects That Parallel Benchmarks Cannot

L7 detects three failure modes invisible to parallel evaluation:

Cascade failures. A model that hallucinates at S1 corrupts all downstream steps. Parallel scoring treats S1 as one task among many; L7 propagates the error through the chain.

Surface fluency vs reasoning divergence signature. The S5 RAG-CB gap directly measures whether the model is reasoning or copying. No parallel benchmark isolates this.

Professional responsibility failures. S7 voiding enforces the binary reality of citation integrity—a synthesized memo is either citable or it is not.

For quantization testing, these properties are essential. We hypothesize that compression degrades reasoning while preserving fluency. L7's chained architecture, dual-modality S5, and hard-gate S7 are designed to make this degradation visible.