Evaluation Observability

Full tracing via Langfuse ensures every model call is recorded, traceable, and auditable—enabling transparent, reproducible evaluation.

Traces

CHAIN

Full Run Context

Spans

STEP

Per-Step Detail

Generations

JUDGE

LLM Judge Calls

What Gets Traced

Every model call is recorded with full context: prompts, responses, latency, status, and sub-scores. When LLM-as-judge is used for S6 Phase 3, the judge's reasoning is captured separately.

[ TRACE HIERARCHY ]

Chain Trace

├─ S1 Span (status, latency)

├─ S4 Span

├─ S6 Span

└─ Judge Generation

Why Observability Matters

Debugging: When a model scores poorly, trace exactly which step failed.
Audit trail: Every score has provenance.
Reproducibility: Full traces enable independent verification.

Transparency Principle

Evaluation methodology should be verifiable. With full tracing, anyone can inspect exactly what prompt produced a given score and confirm our methodology works as documented.