Models Evaluated
—
—
Top
Chain Score
—
—
Median Score
—
50th percentile
Std
Deviation
—
Score spread
Key Findings
Performance Distributions
Statistical analysis of model performance across evaluation metrics
Chain Score Distribution
p < 0.01Box plot showing quartiles, median, and outliers
Top 3 Skill Profiles
S1-S8 normalizedSkill Accuracy vs Chain Completion
Demonstrates compounding effect of per-step errors
Correlation:
r
= —
R²:
—
Dataset Explorer
Browse, filter, and download the full benchmark results
Dataset Version
—
Last Updated
—
Total Instances
—
Download Dataset
Model Rankings
Click column headers to sort
| Rank | Model | Chain Score ↕ | |||
|---|---|---|---|---|---|
Sequential Testing
7-step pipeline reveals fragile reasoning patterns that parallel benchmarks miss
Citation Integrity
Single fabrication voids the entire chain - integrity is binary, not gradual
Full Reproducibility
Download complete dataset, methodology, and Langfuse traces for verification
Want to see per-skill performance without error propagation?
View L10 Atomic Results