Back to Insights

Why evaluation discipline matters more than model choice in regulated R&D

Hugo Evers 7 min read

Most regulated R&D teams ask the wrong first question.

They ask: which model should we use?

The more important question is: can we actually tell if it’s working?

Without a credible answer to the second question, the first one is irrelevant.

The evaluation gap in regulated environments

In a typical ML project outside life sciences, evaluation is messy but recoverable. You build something, observe behaviour, iterate. The cost of a wrong model is a bad quarter metric.

In regulated R&D, the calculus is different:

  • A model influencing a clinical decision or readout needs to be defensible, not just accurate on a held-out set.
  • A data source with privacy constraints limits what you can put in an evaluation harness.
  • Traceability requirements (EU AI Act, GxP) mean “it performed well in our notebooks” is not a valid audit trail.

The gap shows up as: teams that can’t answer basic questions about their system’s failure modes, so they either over-trust it or don’t deploy it at all.

Both outcomes are expensive.

What evaluation discipline actually means

It doesn’t mean perfection. It means structured uncertainty.

A good evaluation setup answers four questions:

  1. Baseline: what does “better than nothing” actually look like for this decision?
  2. Failure modes: under what conditions does this system get it wrong, and how wrong?
  3. Boundary: where does confidence drop enough that a human should take over?
  4. Drift: will I know when the system’s behaviour changes in production?

None of these require a perfect model. They require intentional instrumentation before you start building.

The practical consequence for R&D sprints

When I work with a team on a 2-week Uncertainty Reduction Sprint, I spend the first day defining the decision interface, not the model. This means:

  • What is the team actually deciding?
  • What would a correct answer look like?
  • What data exists to test against?

Only after that conversation is it worth picking up a notebook.

This sounds obvious. It almost never happens in practice, because there’s pressure to show output quickly. The result is a proliferation of models that nobody trusts and nobody deploys.

A simple evaluation hierarchy for regulated R&D

If you’re not sure where to start, use this as a checklist:

Level 1: Baseline comparison Before anything: what does rule-based or simple threshold logic give you? Establish this first. If your ML model doesn’t beat it, you have a data or framing problem, not a model problem.

Level 2: Held-out evaluation with human-labelled ground truth Not just a test split. Ideally, clinical or domain expert labels on a realistic slice of data, with documented labelling criteria.

Level 3: Failure mode mapping Explicitly test the cases where you expect the model to fail. Document what happens. This is your “known unknowns” register, critical for EU AI Act high-risk traceability.

Level 4: Production monitoring At minimum: distribution shift detection on inputs, output distribution monitoring, and a mechanism to flag and route uncertain predictions to a human reviewer.

Most projects in regulated R&D only have Level 1 (if that). Getting to Level 2 before deployment is a defensible minimum. Level 3 is required for anything high-risk under the EU AI Act.

What this means for your next project

If your team is currently debating GPT-4o vs a fine-tuned open model, pause that conversation and answer these questions first:

  • Do we have a labelled evaluation set that reflects real-world conditions?
  • Have we defined what failure looks like for this system?
  • Is there a human-in-the-loop mechanism for low-confidence outputs?

If the answers are unclear, the model debate is premature. Spend a week on evaluation design first. The model choice will become obvious, or the project scope will clarify substantially.


If this resonates with a problem you’re sitting on: Book a 30-min R&D triage. I’ll identify the actual blocker (usually not the model) and outline a realistic path forward.

Related: The 2-week Uncertainty Reduction Sprint: how I structure the first two weeks of an engagement to get here systematically.

Take Action

Does this connect to a blocker you're sitting on?

Book a 30-min triage. I identify the real problem and outline a plan you can take back to your team. No pitch, just diagnostics.