Most regulated R&D teams focus on picking the right model. The more important question is whether you can actually tell if it's working. A practical framework for building evaluation-first AI systems.