Biotech R&D data platform
System design, cloud data workflows, and annotation infrastructure for a biotech instrumentation company running large-scale experiments. The bottleneck was upstream: data flows were inconsistent, annotation was manual and not reproducible, and there was no shared lineage across studies.
Confidential client: instrumentation company (NL)
Milestone
Operational
pipelines running
The blocker
Symptom
Researchers couldn't reliably share or reuse data across studies. Annotation was done ad hoc, with no consistent schema or reproducibility.
Root cause
No shared data platform with lineage tracking; each team had local conventions. Annotation tooling wasn't integrated with the data pipeline.
Why it persisted
Previous attempts to standardize had failed due to tooling friction; researchers optimized for their own workflows, not shared infrastructure.
What was built
System-level. What it actually is: inputs, outputs, users.
-
Cloud data workflow: standardized ingestion, storage, and access patterns across experiment types, replacing ad hoc local solutions.
-
Annotation platform: integrated annotation tooling with schema versioning and lineage tracking, annotation decisions traceable back to source data.
-
Dataset registry: centralized metadata index enabling cross-study discovery and reproducibility.
-
Access control layer: role-based access patterns for different data sensitivity levels.
-
Interfaces: inputs: raw experiment outputs across instrument types; outputs: annotated, versioned datasets; users: ML engineers and research scientists.
Architecture diagram
D2How we evaluated it
What "working" meant: baselines, metrics, guardrails, failure modes.
Definition of working
Annotation throughput increased; cross-study reuse rate measurable; pipeline runs reproducibly from ingestion through to labeled dataset.
Metrics tracked
-
Annotation throughput vs. manual baseline
-
Pipeline reproducibility: same input produces same output across runs
-
Dataset reuse rate: downstream consumers per dataset
Failure modes checked
-
Schema drift: experiment types changing over time breaking downstream consumers
-
Annotation inconsistency: disagreement between annotators on edge cases
-
Access control gaps: unexpected data exposure across teams
Milestone
Operational
pipelines running
Cloud data platform operational. Annotation at scale enabled. Cross-study sharing with lineage tracking active. Details available under NDA.
Why it was hard
Constraints that shaped every decision.
Scale
experiment data volumes required careful storage and processing design, naive approaches fell over at production scale.
Data governance
multiple sensitivity levels across studies; access control couldn't be an afterthought.
User workflows
researchers had strong existing habits; the new system had to integrate rather than replace to achieve adoption.
Reproducibility requirement
scientific context demanded that every annotation decision be traceable and re-runnable, not just 'good enough'.
What comes next
If continuing: next hypotheses, next system increment, next risk gate.
- 1
Active learning integration
use model uncertainty to prioritize which data to annotate next, reducing annotation cost per quality point.
- 2
Cross-study feature registry
build shared feature definitions so ML experiments can reuse validated feature pipelines rather than rebuilding them.
- 3
Compute cost monitoring
add visibility into which pipeline stages consume the most compute so researchers can trade off quality vs. speed explicitly.
Built with EU traceability + oversight expectations in mind.
Security-aware GenAI integration patterns. (ISPE)
Book the 30-min triage: you leave with a plan.
No demo, no deck, no pitch. A structured conversation about your specific situation.