Empirical Final Design (Long Report Test)
This page exists to stress-test report typography and the right-side Contents TOC (scrolling, active section highlight, tables, lists, code blocks, and quotes).
Summary
We suspect aggressive quantization can create a “silent defect zone” where fluency remains stable while multi-step legal reasoning degrades. This report format is designed to make those failures observable with consistent structure.
Key metrics
| Metric | Definition | Why it matters |
|---|---|---|
| Chain completion | Percent of instances that pass all steps and gates | Captures compounding error across steps |
| Citation integrity | Percent of responses without fabricated authorities | Hard constraint for legal work product |
| Surface fluency vs reasoning proxy | Drop in “validate + synthesize” under compression | Detects reasoning fragility vs surface fluency |
Background
In production, vendors routinely trade precision for throughput. The user sees faster responses; the risk is a selective drop in reasoning capacity. A benchmark must be chain-aware to detect that.
Design principle
Don’t grade the plan; grade the memo. Planning is implicit in whether synthesis succeeds.
Method
The report structure below is intentionally repetitive: each section has the same density and spacing so that visual rhythm stays stable while you scroll, and the TOC remains useful.
Inputs
- Run id, timestamp, and model configuration
- Dataset slice and coverage constraints
- Decoding parameters and determinism settings
Outputs
- Per-step metrics and chain score
- Gate outcomes (pass/fail) with rationale
- Artifact bundle hashes for auditability
Implementation notes
This block checks code styling in report mode: inline code, and fenced blocks.
run_id = "l10_demo_2025_01_01"
model = {"name": "example", "precision": "4-bit", "ptq": "awq"}
thresholds = {"citation_integrity": "strict"}
Findings
The sections below are intentionally long to test scroll behavior and TOC highlighting over many headings.
F1: Surface fluency stability
We often observe stable grammar, formatting, and confident tone even when legal validation quality drops. This mismatch is exactly why a report layout needs strong structure and navigability.
F2: Reasoning fragility under compression
Under aggressive PTQ, rare but critical weight directions can clip first. In legal tasks, those are often activated during validation, distinguishing, and synthesis—where multi-factor tests and exception handling matter.
Section stress test A
Scroll rhythm test paragraph. Repeatable spacing helps you notice anomalies in charts/tables rather than in inconsistent typography.
A.1
Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test.
A.2
Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test.
Section stress test B
Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test.
B.1
Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test.
B.2
Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test.
Section stress test C
Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test.
C.1
Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test.
C.2
Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test.
Section stress test D
Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test.
D.1
Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test.
D.2
Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test. Dense paragraph test.
Appendix
Quick checklist for report UI:
- TOC stays visible and scrollable
- Active section highlight updates smoothly
- Tables remain readable in light and dark
- Code blocks don’t overflow horizontally