Evaluation Observability

Full tracing via Langfuse ensures every model call is recorded, traceable, and auditable—enabling transparent, reproducible evaluation.

Traces
CHAIN
Full Run Context
Spans
STEP
Per-Step Detail
Generations
JUDGE
LLM Judge Calls
What Gets Traced

Every model call is recorded with full context: prompts, responses, latency, status, and sub-scores. When LLM-as-judge is used for S6 Phase 3, the judge's reasoning is captured separately.

[ TRACE HIERARCHY ]
Chain Trace
├─ S1 Span (status, latency)
├─ S4 Span
├─ S6 Span
└─ Judge Generation
Why Observability Matters

Debugging: When a model scores poorly, trace exactly which step failed.
Audit trail: Every score has provenance.
Reproducibility: Full traces enable independent verification.

Transparency Principle

Evaluation methodology should be verifiable. With full tracing, anyone can inspect exactly what prompt produced a given score and confirm our methodology works as documented.