1. Problem
1.1 Real-world optimization setting
Many practical optimization problems require simultaneous optimization of multiple conflicting objectives, for example:
- maximize accuracy and minimize inference time
- minimize error and minimize model size
- engineering design trade-offs (performance vs cost)
Formally:
$$ \min_{x \in \mathcal{X}} f(x) = (f_1(x), \dots, f_m(x)) $$
where objectives are black-box, evaluations are expensive and analytical gradients are unavailable
1.2 Challenges with standard Bayesian Optimization (GP-based)
Traditional multiobjective BO typically relies on Gaussian Processes (GPs), but they struggle with:
- Complex search spaces
- categorical variables, integer variables, conditional (tree-structured) parameters
- Scalability
- GP inference scales with number of observations, expensive in medium/high dimensions
- Parallel evaluation
- batch GP-BO requires complex coordination, synchronization overhead is high
1.3 Why TPE is attractive (but insufficient)
Tree-structured Parzen Estimator (TPE):
- models parameter distributions instead of the objective function
- naturally handles:
- conditional parameters
- categorical / integer variables
- scales well to many trials
- easy to parallelize
Limitation:
TPE is single-objective only and cannot directly handle Pareto optimization.
1.4 Goal of the paper
Design a multiobjective Bayesian optimization algorithm that:
- handles tree-structured, mixed-type search spaces
- works with limited evaluation budgets
- approximates the Pareto front efficiently
- supports asynchronous parallelization
2. Method (MOTPE)
2.1 Core idea
MOTPE extends TPE by:
- redefining “good” vs “bad” observations using Pareto dominance
- preserving TPE’s density-ratio optimization principle
- replacing Expected Improvement with Expected Hypervolume Improvement
2.2 Review: TPE (single objective)
TPE models: $p(x \mid y)$
using two densities:
- $l(x)$: density of good configurations
- $g(x)$: density of bad configurations
Split rule (single objective):
- good if ( y < $y^*$ ) (top γ quantile)
- bad otherwise
Acquisition (equivalent to EI):
$$ \arg\max_x \frac{l(x)}{g(x)} $$
2.3 Key extension to multiobjective setting
Observation classification
For multiobjective outcomes $y \in \mathbb{R}^m$:
- Good observations:
- nondominated points
- or points incomparable with a reference Pareto set $Y^*$
- Bad observations:
- points dominated by $Y^*$
This replaces scalar thresholding with Pareto-based ranking.
2.4 Defining the “good” set
Procedure:
- Perform nondominated sorting
- Add Pareto fronts greedily until reaching γ proportion
- If a front overflows:
- select points maximizing hypervolume contribution
This ensures quality (near Pareto front) and diversity (spread along front).
2.5 Density modeling
For each parameter $x_i$, construct:
- $l(x_i)$ from good observations
- $g(x_i)$ from bad observations
Density estimation:
- Parzen estimators
- Gaussian kernels (real/integer)
- histograms (categorical)
Important difference from TPE:
- good samples are weighted by hypervolume contribution
- bad samples use uniform weight
2.6 Acquisition function (key result)
MOTPE uses Expected Hypervolume Improvement (EHVI).
The paper shows:
$$ \text{EHVI}(x_i) \propto \left(\gamma + (1-\gamma)\frac{g(x_i)}{l(x_i)}\right)^{-1} $$
Therefore:
$$ \arg\max_{x_i} \text{EHVI}(x_i) \Longleftrightarrow \arg\max_{x_i} \frac{l(x_i)}{g(x_i)} $$
✅ Same optimization rule as TPE
✅ No explicit modeling of $p(y \mid x)$ needed
2.7 Sampling procedure
- sample each parameter independently
- traverse the tree-structured space
- sample candidates from $l(x_i)$
- select those with highest $l/g$ ratio
- evaluate full configuration
2.8 Parallelization
MOTPE supports asynchronous parallel execution:
- no batch synchronization
- unfinished evaluations ignored
- new candidates sampled stochastically
Result:
- large wall-clock speedups
- small loss in sample efficiency
- very practical for real systems
One-line takeaway
MOTPE extends TPE to multiobjective optimization by replacing scalar “good/bad” splits with Pareto-based splits and using hypervolume-aware density modeling—while keeping the same simple $l(x)/g(x)$ decision rule.