Generalization: An Epistemic Dichotomy

Introduction

In "generalization" the size of the parameter space should be smaller than the data space. This is exactly what compression means, and it's at the cornerstone of modern machine learning. But this premise inherently presupposes the manifold hypothesis—meaning the high dimensional data lies on a lower dimension and hence can be described by much smaller parameters. If the model is not expressive enough, reducing its uncertainty with respect to the data can be done in two different ways.

In the first case, the model takes a more inclusive approach by "mean behavior", meaning it averages out the data instances that it observes! For example, if your model can't understand the multimodal data distribution, it takes the middle point, which is the average. The obvious problem here is that if those data samples are not very similar, the model seems to "confabulate" a middle state that doesn't even exist!

On the other hand, the model can take a more exclusive approach by a "mode behavior", meaning it tries only to model the most frequent data point and ignore everything else! The problem with this case is that it simply misses the other data modalities that exist, just because they don't occur as much!

This epistemic dichotomy in generalization arises when the model wants to stay simple. The issue stems from two ways to decrease the "distance" between the model $q_{\theta}$ and data distribution $p$.

Forward KL Divergence: Mean-Seeking Behavior

In the first case, we are looking to minimize the forward KL-divergence:

$$D_{\mathrm{KL}}(p \| q_\theta)=\mathbb{E}_{x \sim p}\left[\log \frac{p(x)}{q_\theta(x)}\right]=\sum_{x} p(x) \log\frac{p(x)}{q_\theta(x)}$$

Note that in this case, when $p(x) = 0$, the values of $q$ will be ignored because $p$ values are the weights in the formula. In other words if $p(x)$ is very small it has no consequence if $q_\theta$ is large! During the learning if $p$ is small $q_\theta$ will be ignored and hence will not be updated! This is why it's not a surprise if the model learns the mean value of multiple modalities while there might not be any data points corresponding to that! This is the mean-seeking behavior.

Note that in this setting, whenever $p(x)=0$, the corresponding values of $q_\theta(x)$ are effectively ignored, because the weights in the objective come entirely from $p(x)$. In other words, if $p(x)$ is extremely small, it does not matter how large $q_\theta(x)$ becomes—the model receives essentially no gradient signal there. During training, regions where $p(x)$ is small contribute almost nothing, so $q_\theta(x)$ is neither penalized nor updated in those areas.

This explains why the model may converge to a mean value across multiple modes, even when no actual data points lie near that region. This is the hallmark of mean-seeking behavior.

Reverse KL Divergence: Mode-Seeking Behavior

In the second regime, we are looking to minimize the backward KL-divergence:

$$D_{\mathrm{KL}}(q_\theta \| p)=\mathbb{E}_{x \sim q_\theta(x)}\left[\log \frac{q_\theta(x)}{p(x)}\right]=\sum_{x} q_\theta(x) \log\frac{q_\theta(x)}{p(x)}$$

Now we have the opposite problem: if the model decides to put zero probability $q_\theta(x) = 0$ where there is data $p(x) \neq 0$, then it will not be punished by the training algorithm! In this case it is better for the model to just fit to a portion of the data which appears more often and ignore the rest. This is the mode-seeking behavior.

Interactive Visualization

The interactive visualization below demonstrates the fundamental difference between forward and reverse KL divergence optimization. You can adjust the mode separation and animation speed to see how the approximating distributions behave differently.

KL Divergence Optimization

Visualizing Forward KL(p||q) vs Reverse KL(q||p) Minimization

Mode Separation 5.0

Animation Speed 1.0x

Target p(x) - Bimodal

Forward KL (Mean-Seeking)

Reverse KL (Mode-Seeking)

D_KL(p || q_θ)

0.000

D_KL(q_θ || p)

0.000

Optimization Step

Convergence

In the visualization above, the target (data) distribution $p(x)$ (the cyan line) is a Gaussian mixture model with two components. The approximating distribution $q_\theta(x)$ is constrained to be a single Gaussian $\mathcal{N}(\mu, \sigma^2)$.

Forward KL divergence $D_{KL}(p||q_\theta)$ minimization yields mean-seeking behavior (red dashed line), placing $q_\theta$ between the two modes to minimize the penalty where $p$ has mass but $q_\theta$ does not. Reverse KL divergence $D_{KL}(q_\theta||p)$ minimization exhibits mode-seeking behavior (blue dotted line), concentrating $q$ on a single mode to avoid placing mass where $p$ has low probability. This asymmetry has profound implications for variational inference and generative modeling.

The Philosophical Implications

This points out an uneasy fact about modeling: when we are trying to model the world we either have to make a "jump" by assigning non-zero probability to a situation that has not been observed unless we make the manifold assumption or we just have to ignore different data points in an attempt to explain (compress) everything by the most frequent observation. Note that there is a third way in which we can just memorize all the data points, and hence no compression and no generalization.

This highlights a fundamental and uneasy truth about modeling: in order to represent the world, we must make a leap beyond what we have actually observed. Without some form of structural assumption (such as the manifold assumption in machine learning), we are forced to assign non-zero probability to situations that have never occurred in our data. This reflects a philosophical problem that goes back to Hume's problem of induction, namely that there is no purely logical way to justify predictions about unobserved cases based solely on past experience. Any such prediction requires an imaginative or assumptive step beyond the evidence.

If, on the other hand, we avoid this leap and refuse to assign probability anywhere we have not seen data, we end up explaining everything solely by reference to the most frequent or typical observations, basically ignoring rarer data points, in an attempt to compress all knowledge into a single pattern.

There is, of course, a third possibility: we could simply memorize every data point without attempting any form of generalization or compression. This "solution" ignores generalization entirely and with it, the ability to predict or understand new scenarios.

Conclusion

The dichotomy between mean-seeking and mode-seeking behaviors reveals a deep epistemic tension in machine learning and statistical inference. This is not merely a technical detail about optimization objectives—it reflects fundamental questions about how we can and should generalize from finite observations to make predictions about an uncertain world.

The choice between forward and reverse KL divergence is ultimately a choice about what kinds of errors we are willing to tolerate: Do we prefer a model that covers all possibilities but may hallucinate impossible states (mean-seeking)? Or do we prefer a model that stays conservative, focusing only on high-probability regions while potentially missing important rare events (mode-seeking)?

Neither approach is universally superior. The appropriate choice depends on the application, the cost of different types of errors, and our prior beliefs about the structure of the world. What remains constant is the inescapable truth that compression requires assumption, and all generalization involves a leap of faith beyond the data we have observed.