Models Evaluated
Top Chain Score
Median Score
50th percentile
Std Deviation
Score spread

Key Findings

Performance Distributions

Statistical analysis of model performance across evaluation metrics

Chain Score Distribution

p < 0.01

Box plot showing quartiles, median, and outliers

Top 3 Skill Profiles

S1-S8 normalized

Skill Accuracy vs Chain Completion

Demonstrates compounding effect of per-step errors

Correlation: r = —
R²:

Dataset Explorer

Browse, filter, and download the full benchmark results

Dataset Version
Last Updated
Total Instances

Download Dataset

Model Rankings

Click column headers to sort

Rank Model Chain Score

Sequential Testing

7-step pipeline reveals fragile reasoning patterns that parallel benchmarks miss

Citation Integrity

Single fabrication voids the entire chain - integrity is binary, not gradual

Full Reproducibility

Download complete dataset, methodology, and Langfuse traces for verification

Want to see per-skill performance without error propagation?

View L10 Atomic Results