L10 Agentic - Benchmarks

Models Evaluated

—

Top Chain Score

—

Median Score

—

50th percentile

Std Deviation

—

Score spread

Key Findings

Statistical analysis of model performance across evaluation metrics

p < 0.01

Box plot showing quartiles, median, and outliers

S1-S8 normalized

Demonstrates compounding effect of per-step errors

Correlation: r = —

R²: —

Browse, filter, and download the full benchmark results

Dataset Version

—

Last Updated

—

Total Instances

—

Click column headers to sort

Rank	Model	Chain Score ↕	S8 Transitive	Trend	Provider

7-step pipeline reveals fragile reasoning patterns that parallel benchmarks miss

Single fabrication voids the entire chain - integrity is binary, not gradual

Download complete dataset, methodology, and Langfuse traces for verification

Want to see per-skill performance without error propagation?