Wherein I discuss at length what puts the ‘mean’ in mean field theory, and why we minimise free energy to calculate it. Some more general comments on free energy are given, because I can’t help myself. The proofs (about the effective field in the middle, and about variational inference at the bottom) appear new, so that’s nice.

Mean Field Theory in Statistical Physics

The gift that continues to give is evidently this set of notes I wrote recently on mean field theory, where I discuss a little about the formal, variational basis for mean field approximation. In that paper, I hint at a nice physical motivation for mean field theory—if we simplify a system to something that has certain effective dynamics, then interaction terms tend to go away, having the physical interpretation of suppressing fluctuations around a mean. It was pointed out to me that I never prove this in the paper, only suggesting that there is a relationship between simple non-interacting pictures and actual convergence to mean field dynamics. That is, I never explicitly show what puts the ‘mean’ in mean field theory. As an addendum to the paper indicated, I’ll talk a little about this here, in this blog post.

As I said in a previous blog post, one way of modelling the dynamics of the Ising model is by mean field theory (MFT), which looks at the ‘average’ dynamics of a system. The way we do MFT is by approximating the system with a simpler system that is related in some way to the original (specifically, we use a trial Hamiltonian for which the free energy is bounded from below by the Bogoliubov inequality). The typical way in which this is done, and in some sense the most meaningful, is by replacing the system with a ‘mean field’ and ignoring fluctuations in the dynamics of smaller items constituting the system. MFT is thus a theory of mean fields, ignoring the fluctuations around a mean (consider the transport relation \(\delta A(t) = A(t) - \langle A \rangle\) for the dynamics of an observable \(A\) in the presence of fluctuations. When we ignore fluctuations and the term \(\delta A(t)\) vanishes, we have \(A(t) = \langle A \rangle\)).

There are ostensibly four steps to finding a mean field model formally. First, we pick a simpler trial Hamiltonian \(\hat{H}_0\). The second is to find the trial free energy, and then note that the true free energy is bounded by the trial free energy plus a correction term in a control parameter \(\lambda\), which is implicitly dependent on \(\hat{H}_0\). This is the Bogoliubov inequality. The third step is to make sure the approximation is actually any good. If the corrective term is large, then we should try something a little more complex for \(\hat{H}_0\). The fourth step is to minimise the variational free energy as it depends on \(\lambda\). The rationale behind this step is, if \(F_V(\lambda)\) is strictly greater than \(F\), its minimum is the place where it is closest to \(F\). Since we minimise in the argument \(\lambda\), this is the value for \(\lambda\) at which the correction term is smallest. If \(\hat{H}_0\) is initially a good approximation of \(\hat{H}\), then this minimum distance will itself be quite small. Now, simultaneously, minimising \(F_V\) ensures the statistics of \(F\) and \(F_V\) match, such that whatever dynamics we calculate from \(F_V\) provide a semi-reliable approximation of the true system, depending on how close \(\text{arg min}[ F_V(\lambda) ]\) actually gets to \(F\).

At this point, we should note something potentially hazardous: minimising \(F_V\) does not minimise \(F\). In fact, \(F\) is concave in \(\lambda\), so that \(F \leq F_V = F_0 + \lambda \Delta \hat{H}\) implies the maximum of \(F\) approaches some minimum distance from the minimum of \(F_V\). That is, these two functions talk past each other in a sense; minimising \(F_V\) in \(\lambda\) is simply shrinking a thing that is at least as big as \(F\). It’s a slightly different operating principle, but an instructive mental picture is something like \(x \leq x^2+1\), whose right-hand side’s minimum has nothing to do with the minimum of the LHS. Because the Bogoliubov inequality is written the way it is, minimising \(F_V\) narrows the distance of \(F_V\) from \(F\).

To summarise the above two paragraphs, minimising \(F_V\) gets us close to \(F\) by defining the mean field as something useful—what should the effective field \(\lambda\) look like for \(F_V\) to look the most like \(F\) as possible, such that their statistics generally match?

In the Ising model, the canonical choice of \(\hat{H}_0\) is \(m\sum_i s_i\). This seems in conflict with our earlier approach, since it replaces interactions between spins, but has nothing to say of means per se. And yet, I go on to say that supressing fluctuations is nothing but a short cut, which achieves a MFT by the same means as the formal procedure—neglecting those terms produces a simpler trial Hamiltonian, for which the free energy is greater than the real one, which we minimise with respect to \(m\). Likewise, [Sak22] is thin on physical motivation for MFT, but (I’d like to think) has plenty of mathematical motivation for MFT and physical motivation for magnetisation. Upon reflection this is a weakness of the paper. Can’t always be perfect, but here I’ll try to bring them together. In any case—the magnetisation function is reduced to the minimiser of the variational free energy functional. We choose a simpler Hamiltonian in a control parameter, and choose the value of that parameter which minimises the free energy of the trial Hamiltonian. The trial \(\hat{H}\) assumes we can model the Ising model as a set of independent spins experiencing an effective field, rather than individual site-wise interactions. This is reasonable a priori, since if one is surrounded by a mosh of spins, and those spins are themselves surrounded by a mosh of spins, eventually everyone converges to a sort of collective mosh pit, which can be understood as any individual being situated in an effective push and pull. What that effective field must actually look like in order to approximate the true \(\hat{H}\), especially such that their free energy functionals in terms of spin configurations have similar values, is the element of variational minimisation that happens in MFT. For those details, I’m happy to suggest my paper. To understand why that simpler effective field has anything to do with convergence to the mean, let’s look at the Bogoliubov inequality.

In the specific example of the Ising model, the correction term in the Bogoliubov inequality (see equation two in the mentioned paper) vanishes when

\[\sum_{ij} s_i s_j = m \sum_i s_i.\]

For this proof sketch, let’s assume the non-interacting Hamiltonian is a good approximation of the system. It follows that either \(s_i\) or \(s_j\) is equal to \(m\). Without loss of generality, take \(s_j = m\). Independently of that, expand

\[\sum_{ij} s_i s_j = \sum_{ij} (\langle s_i \rangle + \delta s_i)(\langle s_j \rangle + \delta s_j),\]

using the transport relation above. Assume fluctuations vanish for \(s_j\). Then, we have

\[\sum_{ij} s_i s_j = \sum_{ij} s_i \langle s_j \rangle.\]

Clearly, if we ask that the effective field models the original Hamiltonian, then fluctuations going to zero and \(m = \langle s_j \rangle\) is sufficient for \(F_V = F\). In other words, if all the spins around a given \(s_i\) centre around a mean of \(m\), then the trial Hamiltonian with decoupled spins is a good model. Hence, MFT can be read as producing a statistical field theory—a field theory with probabilistic degrees of freedom—where every such variable regresses to some mean. As to what that mean is, the function for \(m\) is the magnetisation function, which can be found by minimising \(F_V\). Let’s also take a moment to note that, as an order parameter, it makes some sense that \(m\) should be a shared mean over all spins. For more on that, I can again point the reader to later sections of the paper mentioned above.

Remark. In the above, we assume that the variational free energy is already a good match for the true system, which simply bypasses variational free energy minimisation and reaches the same conclusion—i.e., the consequence of minimising \(F_V\) in MFT is that the system behaves like a mean field under ideal circumstances.

We could have begun from spins rather than trial Hamiltonians, if we had so chosen. Assuming non-interacting spins implies \(\langle s_i s_j\rangle = \langle s_i \rangle \langle s_j \rangle\). If fluctuations vanish for at least one variable, then this is guaranteed to be true, by linearity of the expectation. The non-interacting spins approach leads naturally to a Hamiltonian which looks like

\[\langle s_j \rangle \sum_i s_i\]

by essentially the same argument as above. If the true \(\hat{H}\) is equal to this Hamiltonian, then the corrective term is minimised, and \(F_V \approx F\) automatically.

Finally, some further notes on the minimisation of variational free energy: the minimisation of free energy is about the stability of a configuration, and so it is tempting to say that minimising variational free energy tells us how something behaves. There are three fairly independent ideas at hand. We have that critical points (as in phase transitions) are the minima of thermodynamical free energy \(F\); variational free energy is an upper bound on thermodynamical free energy; and, if we assume the true system is a simple system which matches some trial Hamiltonian, this minimises their difference non-variationally.

  • On the variational side, we can think of work and energy relationships only in an abstract theory space, where the minimum free energy model—the energetically favourable, stable model—is the model with no work left to be extracted from it, or, the model which has already been fully optimised. That said, if the system is simple enough that \(F = F_V\), which in general is only possible when the Hamiltonian needn’t be perturbed, then minimising variational free energy in \(\lambda\) gives one back \(F\) (the term \(\Delta \hat{H}\) vanishes). Otherwise, we choose a Hamiltonian such that true and trial \(\hat{H}\) are close to each other, assume this is a faithful representation of the dynamics by some mean field argument, and then minimise \(F_V\) to find the best dynamical rule for configurations of the mean field.

  • We also mentioned that the function for \(m\) which arises from finding magnetisation can be read as telling us how the system behaves in the non-interacting spins limit; that is, assuming the full free energy function is the original system and that MFT tells us what the ‘right’ simplification of the system is given non-interacting spins, the \(m\) we produce is like the field that arises when passing to this limit. Now, \(m\) has been chosen a priori as an order parameter, so figuring out what such an order parameter looks like naturally tells us how magnetisation looks. Connections to Landau theory are incidental at best ! One could say it reassures us that our model actually exhibits a phase transition, given we have a critical point of \(F_V(\lambda)\). But once again, the crucial idea is that the bound cannot reach across the arguments—minimising in \(m\) brings \(F_V\) close to \(F\). When all is said and done, choosing \(m\) as a control parameter is contingent on its eligibility to satisfy all of these criteria, as a useful simplification that still captures useful dynamics. So, if our model ‘magnetises’ (n.b. scare quotes) like \(m\), we assume the actual system will too—at least, approximately so.

Mean Field Theory in Statistical Inference

I’ll provide some further remarks, this time on Bayesian methods and variational inference. The Bogoliubov inequality appears in variational inference (even if it’s not initially apparent, everything in statistical mechanics is—you’ve guessed it—statistical). We know that the free energy functional is of critical importance in physics and statistics. Thermodynamical relationships follow from \(F\). The partition function \(Z\), which contains everything one needs to know about the system (it gives the probability measure over the entire state space) is \(e^{-\beta F}\). The gradient of the free energy generates dynamics for probabilistic systems like the Fokker-Planck equation and weight updating in deep learning models. Energy-based learning frameworks suggest that most traditional inference algorithms minimise free energy. The free energy principle suggests that any system can be understood as exhibiting states that minimise free energy. At the very least, \(F\) generates models of the dynamics of a physical system, depending on the ontological commitments one makes. This is the reason why, for most any sort of system, we want to approximate \(F\) as closely as possible. Many such ideas in turn rely on variational inference, which is prescribed in part by the Bogoliubov inequality. For the simple case of a trial density \(q(x)\) (indexed with a zero in the previous section and a \(q\) in this section) and a true density \(p(x)\) over states of a field, \(x\), the KL divergence of these quantities is literally the Helmholtz free energy, with the entropy in microscopic terms and internal energy consisting of the mean of the true distribution (this should make sense based on the previous discussion about how optimisation is like doing work). Suppose both \(q\) and \(p\) are exponential distributions, in terms of \(\hat{H}_q\) and \(\hat{H}\), respectively. In particular, let \(q = Z_q^{-1} e^{-\beta \hat{H}_q}\) (and likewise for \(p\)). Putting everything into the KL divergence and simplifying it, we get

\[D_{KL}(q \, \| \, p) = \beta \langle \hat{H} - \hat{H}_q \rangle_q + \ln\left\{\frac{Z}{Z_q}\right\}.\]

Since \(D_{KL} \geq 0\), in turn we have

\[- \ln\{Z\} \leq - \ln\{Z_q\} + \beta \langle \hat{H} - \hat{H}_q \rangle_q,\]

which is the Bogoliubov inequality (proof in [Sak22, Section IIA], with an unfortunate but inconsequential misprint in the sign of one term and the direction of one bound at the top of page seven in the arXiv version. These counteract each other such that the expression as a whole is correct, which becomes apparent in the line immediately following). Procedurally, variational inference is typically a bit more complicated, in that we have a slightly harder problem. It is often the case that the true distribution is unknown or too complicated to be accessible even in principle—say, a difficult marginal distribution which is the posterior of some Bayesian estimation problem. In that case, we can’t minimise the distance away from something we don’t know. Let’s rewrite the Bogoliubov inequality using our new probability densities, to get

\[-\ln\{Z(x\mid y)\} \leq -\ln\{Z_q(x; \lambda)\} + \langle \hat{H}(x \mid y) - \hat{H}_q(x ; \lambda) \rangle_q \iff D_{KL}( q(x;\lambda) \|\, p(x \mid y) \geq 0,\]

where we have also set \(\beta = 1\) and introduced some parameter \(\lambda\) to the trial distribution \(q\) (not a notational accident; this is indeed a control parameter for our trial Hamiltonian). Just as in mean field theory, this KL divergence is the variational free energy of \(q\) as compared to \(p\), which is greater than (or equal to) the free energy of \(p\) in general. Again, though, this is usually not tractable. So, let any probability density \(p = Z^{-1} \tilde p\) where \(\tilde p\) is of the form \(e^{-\hat{H}}\). Note now that \(\hat{H} = -\ln\{\tilde p\} = -\ln\{Z p\}\). We do the following: using this assumption, Bayes’ theorem to expand the posterior, and linearity of the expectation throughout, we can rewrite the Bogoliubov inequality for \(q\) away from \(p(x \mid y)\) as

\[\begin{align*} -\ln\{Z_{\mid y})\} &\leq - \ln\{Z_q\} + \langle -\ln\{ Z_{\mid y} p(x \mid y) \} \rangle_q + \langle \ln\{Z_q q_\lambda(x) \} \rangle_q \\ -\ln\{Z_{\mid y})\} &\leq - \ln\{Z_q\} + \left\langle - \ln\{Z_{\mid y}\} - \ln\left\{ \frac{p(x,y)}{p(y)} \right\} \right\rangle_q + \langle \ln\{Z_q\} + \ln\{q_\lambda(x) \} \rangle_q\\ 0 &\leq \left\langle - \ln\left\{ \frac{p(x,y)}{p(y)} \right\} \right\rangle_q + \langle \ln\{q_\lambda(x) \} \rangle_q\\ 0 &\leq D_{KL}( q_\lambda(x) \|\, p(x,y) ) + \ln\{p(y)\}. \end{align*}\]

Et voilà, the ELBO is secretly the Bogoliubov inequality. (Though this derivation seems circuitous, putting the constant \(Z\) back in mirrors taking it out in our initial derivation that the KL divergence produces the Bogoliubov inequality.) In particular, now, the distance between \(q_\lambda(x)\) and \(p(x \mid y)\) is controlled by not just \(\lambda\), but \(\ln\{p(y)\}\), which one could read as scoring the difference between our trial distribution \(q\)’s performance for the ‘true’ distribution \(p(x,y)\) and the actual true distribution \(p(x \mid y)\). That is to say, we have the identity

\[D_{KL}\left( q_\lambda(x) \bigg\|\, \frac{p(x, y)}{p(y)} \right) = D_{KL}(q_\lambda(x), p(x, y) ) + \ln\{ p(y) \},\]

which implies the full relation

\[\ln\{Z_{\mid y}\} - \ln\{Z_q\} + \beta \langle \hat{H}_{\mid y} - \hat{H}_q \rangle_q = \ln\{Z_{,y} \} - \ln\{Z_q\} + \beta \langle \hat{H}_{,y} - \hat{H}_q \rangle_q + \ln\{p(y)\} \geq 0,\]

such that globally minimising the variational free energy occurs precisely when we have a particular optimal value of \(\lambda\), i.e., \(q_\lambda(x) \approx p(x,y)\), and a minimum for \(-\ln\{p(y)\}\). In particular, since \(\ln\{p(y)\}\) is negative by assumption (\(p(y)\) is also Gibbs), there is a further error to our approximation induced by matching to the wrong density, i.e., we have

\[-\ln\{Z_{,y} \} \leq - \ln\{Z_q\} + \beta \langle \hat{H}_{,y} - \hat{H}_q \rangle_q + \ln\{p(y)\} \leq - \ln\{Z_q\} + \beta \langle \hat{H}_{,y} - \hat{H}_q \rangle_q.\]

If you wish to look at it a different way, the Bogoliubov inequality arises because there is a distance between the trial and true distributions which is not necessarily zero, and minimising the KL divergence between these two things makes \(F_V\) approach \(F\) for some critical value of \(\lambda\). Taking the fact that the KL divergence of \(q_\lambda(x)\) and \(p(x\mid y)\) is equivalent to that of \(q_\lambda(x)\) and \(p(x,y)\) plus the log term, minimising the variational free energy of \(q\) and \(p(x \mid y)\) is equivalent to minimising either (ideally both, as in EM) terms. Two nice resources drawing those ideas together are this blog and this paper.

Conclusion

To recap, we went over

  • why it is we minimise \(F_V\) in MFT,
  • how this is equivalent to approximating the system with a field of means,
  • what minimising \(F_V\) actually means, and
  • what it tells us about other types of approximations in statistics.

Mean field theory is one of my favourite things in statistical physics, partially because it takes something that should be ugly—heuristic, even mindless approximation of something that we feel like ‘should’ work—and makes it into something rather elegant, with important things to say about physics and statistics itself. Hope you’re coming round to it as well.