1. Problem

1.1 Real-world optimization setting

Many practical optimization problems require simultaneous optimization of multiple conflicting objectives, for example:

maximize accuracy and minimize inference time
minimize error and minimize model size
engineering design trade-offs (performance vs cost)

Formally:

$$ \min_{x \in \mathcal{X}} f(x) = (f_1(x), \dots, f_m(x)) $$

where objectives are black-box, evaluations are expensive and analytical gradients are unavailable

1.2 Challenges with standard Bayesian Optimization (GP-based)

Traditional multiobjective BO typically relies on Gaussian Processes (GPs), but they struggle with:

Complex search spaces
- categorical variables, integer variables, conditional (tree-structured) parameters
Scalability
- GP inference scales with number of observations, expensive in medium/high dimensions
Parallel evaluation
- batch GP-BO requires complex coordination, synchronization overhead is high

1.3 Why TPE is attractive (but insufficient)

Tree-structured Parzen Estimator (TPE):

models parameter distributions instead of the objective function
naturally handles:
- conditional parameters
- categorical / integer variables
scales well to many trials
easy to parallelize

Limitation:

TPE is single-objective only and cannot directly handle Pareto optimization.

1.4 Goal of the paper

Design a multiobjective Bayesian optimization algorithm that:

handles tree-structured, mixed-type search spaces
works with limited evaluation budgets
approximates the Pareto front efficiently
supports asynchronous parallelization

2. Method (MOTPE)

2.1 Core idea

MOTPE extends TPE by:

redefining “good” vs “bad” observations using Pareto dominance
preserving TPE’s density-ratio optimization principle
replacing Expected Improvement with Expected Hypervolume Improvement

2.2 Review: TPE (single objective)

TPE models: $p(x \mid y)$

using two densities:

$l(x)$: density of good configurations
$g(x)$: density of bad configurations

Split rule (single objective):

good if ( y < $y^*$ ) (top γ quantile)
bad otherwise

Acquisition (equivalent to EI):

$$ \arg\max_x \frac{l(x)}{g(x)} $$

2.3 Key extension to multiobjective setting

Observation classification

For multiobjective outcomes $y \in \mathbb{R}^m$:

Good observations:
- nondominated points
- or points incomparable with a reference Pareto set $Y^*$
Bad observations:
- points dominated by $Y^*$

This replaces scalar thresholding with Pareto-based ranking.

2.4 Defining the “good” set

Procedure:

Perform nondominated sorting
Add Pareto fronts greedily until reaching γ proportion
If a front overflows:
- select points maximizing hypervolume contribution

This ensures quality (near Pareto front) and diversity (spread along front).

2.5 Density modeling

For each parameter $x_i$, construct:

$l(x_i)$ from good observations
$g(x_i)$ from bad observations

Density estimation:

Parzen estimators
Gaussian kernels (real/integer)
histograms (categorical)

Important difference from TPE:

good samples are weighted by hypervolume contribution
bad samples use uniform weight

2.6 Acquisition function (key result)

MOTPE uses Expected Hypervolume Improvement (EHVI).

The paper shows:

$$ \text{EHVI}(x_i) \propto \left(\gamma + (1-\gamma)\frac{g(x_i)}{l(x_i)}\right)^{-1} $$

Therefore:

$$ \arg\max_{x_i} \text{EHVI}(x_i) \Longleftrightarrow \arg\max_{x_i} \frac{l(x_i)}{g(x_i)} $$

✅ Same optimization rule as TPE

✅ No explicit modeling of $p(y \mid x)$ needed

2.7 Sampling procedure

sample each parameter independently
traverse the tree-structured space
sample candidates from $l(x_i)$
select those with highest $l/g$ ratio
evaluate full configuration

2.8 Parallelization

MOTPE supports asynchronous parallel execution:

no batch synchronization
unfinished evaluations ignored
new candidates sampled stochastically

Result:

large wall-clock speedups
small loss in sample efficiency
very practical for real systems

One-line takeaway

MOTPE extends TPE to multiobjective optimization by replacing scalar “good/bad” splits with Pareto-based splits and using hypervolume-aware density modeling—while keeping the same simple $l(x)/g(x)$ decision rule.

1. Problem#

1.1 Real-world optimization setting#

1.2 Challenges with standard Bayesian Optimization (GP-based)#

1.3 Why TPE is attractive (but insufficient)#

1.4 Goal of the paper#

2. Method (MOTPE)#

2.1 Core idea#

2.2 Review: TPE (single objective)#

2.3 Key extension to multiobjective setting#

Observation classification#

2.4 Defining the “good” set#

2.5 Density modeling#

2.6 Acquisition function (key result)#

2.7 Sampling procedure#

2.8 Parallelization#

One-line takeaway#