Sample Report

This is a placeholder report page showing the intended reading experience: larger serif text and a left “On this page” TOC.

Summary

Report summary goes here: run id, model, dataset slice, and headline metrics (void rate, per-step accuracy, etc.).

Method

Describe the evaluation method used for this report. Keep this consistent across reports so comparisons are easy.

Inputs

Run identifier + timestamp
Model + backend configuration
Dataset slice + coverage constraints

Outputs

Chain-level metrics
Per-step metrics
Failure modes + examples

Findings

Main narrative findings. Use short paragraphs and clear headings so the TOC remains useful.

Step performance

Discuss where accuracy drops and why. Link to supporting tables/charts if needed.

Failure modes

Highlight common error clusters (e.g., S4 disposition confusion, S6 incoherent synthesis, citation integrity issues).

Appendix

Include additional details (definitions, thresholds, or per-step rubric notes).