Sample Report
This is a placeholder report page showing the intended reading experience: larger serif text and a left “On this page” TOC.
Summary
Report summary goes here: run id, model, dataset slice, and headline metrics (void rate, per-step accuracy, etc.).
Method
Describe the evaluation method used for this report. Keep this consistent across reports so comparisons are easy.
Inputs
- Run identifier + timestamp
- Model + backend configuration
- Dataset slice + coverage constraints
Outputs
- Chain-level metrics
- Per-step metrics
- Failure modes + examples
Findings
Main narrative findings. Use short paragraphs and clear headings so the TOC remains useful.
Step performance
Discuss where accuracy drops and why. Link to supporting tables/charts if needed.
Failure modes
Highlight common error clusters (e.g., S4 disposition confusion, S6 incoherent synthesis, citation integrity issues).
Appendix
Include additional details (definitions, thresholds, or per-step rubric notes).