Evaluation Observability
Full tracing via Langfuse ensures every model call is recorded, traceable, and auditable—enabling transparent, reproducible evaluation.
Every model call is recorded with full context: prompts, responses, latency, status, and sub-scores. When LLM-as-judge is used for S6 Phase 3, the judge's reasoning is captured separately.
Debugging: When a model scores poorly, trace exactly which step failed.
Audit trail: Every score has provenance.
Reproducibility: Full traces enable independent verification.
Evaluation methodology should be verifiable. With full tracing, anyone can inspect exactly what prompt produced a given score and confirm our methodology works as documented.