Marketplace optimization under constraints
A bespoke modelling and optimization system, decision policy, evaluation harness, and deployment pathway, built end-to-end for a production marketplace at scale. The constraint was real: fixed budget, real-time pressure, and delayed feedback made naive approaches fail fast.
Named: Aimwel (joint venture DPG Media + Randstad)
Outcome
+32%
outcome at fixed budget
The blocker
Symptom
Allocation decisions were made by rule of thumb. Performance improved when monitored closely, degraded when not.
Root cause
No principled policy that could handle non-stationarity, delayed reward, and exploration-exploitation trade-offs simultaneously.
Why it persisted
Off-the-shelf RL frameworks assumed stable environments and required more exploration than business constraints allowed; custom evaluation was missing so no one could tell when things broke.
What was built
System-level. What it actually is: inputs, outputs, users.
-
Decision policy: bespoke optimization model designed for the specific constraint structure (budget, feedback delay, bid granularity).
-
Evaluation harness: offline evaluation framework to estimate policy quality before deployment, critical when online experimentation is expensive.
-
Monitoring layer: live metrics to detect drift and policy degradation in production.
-
Deployment pathway: model versioning, rollback mechanism, shadow mode testing.
-
Interfaces: inputs: marketplace signals, historical outcomes, budget constraints; outputs: per-item allocation recommendations; users: platform backend + ops team.
Architecture diagram
D2How we evaluated it
What "working" meant: baselines, metrics, guardrails, failure modes.
Definition of working
Policy outperforms baseline allocation on primary outcome metric (defined upfront) at fixed spend, not just in simulation.
Metrics tracked
-
Primary: outcome metric per unit spend vs. baseline
-
Secondary: decision coverage and exploration rate
-
Guardrails: spend bounds, rollback triggers, anomaly thresholds
Failure modes checked
-
Non-stationarity: environment shifts that invalidate offline evaluation
-
Feedback delay: decisions with rewards arriving >24h later
-
Exploration penalty: policy too cautious or too exploratory in edge segments
Outcome
+32%
outcome at fixed budget
Sustained improvement in production over baseline allocation strategy, within hard budget constraints. System ran in production with monitoring and rollback ready.
Why it was hard
Constraints that shaped every decision.
Non-stationarity
the environment shifted regularly, seasonality, supply/demand fluctuations, competitor behavior, making historical data partially misleading.
Delayed feedback
rewards arrived hours or days after decisions, requiring offline evaluation that was itself uncertain.
Deployment safety
the business could not afford large-scale exploration; the policy had to be asymmetrically conservative in unfamiliar regions.
Integration complexity
the system had to fit into an existing data platform and decision pipeline without rearchitecting the whole stack.
What comes next
If continuing: next hypotheses, next system increment, next risk gate.
- 1
Multi-objective optimization
incorporate secondary metrics (e.g., advertiser satisfaction) into the policy objective directly rather than as guardrails.
- 2
Adaptive evaluation
improve offline estimators to reduce the gap between simulated and live policy quality, especially in distributional shift.
- 3
Contextual bandit extension
move from per-segment policies toward a fully contextual approach once the evaluation harness can support it reliably.
Built with EU traceability + oversight expectations in mind.
Security-aware GenAI integration patterns. (ISPE)
Book the 30-min triage: you leave with a plan.
No demo, no deck, no pitch. A structured conversation about your specific situation.