High Energy Physics and Energy-Based Learning
Last week, I released a set of results proving some identities from gauge theory in the context of maximum entropy (preprint arXiv:2203.08119). In a more detailed sketch than the linked tweet, but staying less formal than the linked paper, I’ll comment a little on the ‘hidden stuff’ in that paper—this deconstruction should help readers read between the lines.1
I have previously taken the position that inference is secretly a really high-tech way of doing analysis (i.e., the mathematical study of functions and their outputs). This is not strictly a new idea, but all the same, it has actually been foundational to my research—that statistical inference is a thing which might be of broad interest to pure mathematicians, and that describing inference correctly and completely is a mathematical problem with a mathematical solution, was part of the key to starting a viable research programme in mathematical complex systems theory. In any case, this sort of suggests that statistical inference is reducible to a big integral for a really big stochastic dynamical system describing one’s data; by learning the relationships between data points, or between space-time inputs and data outputs, inference effectively solves a process generating that data, without knowing either the process a priori or the solution to that process. That is, somehow, model selection and model solution coincide, such that we don’t really need to do analysis.
In this paper, I take a slightly new2 position which corresponds to this observation—we don’t actually solve anything when we do inference. In fact, we don’t care about the underlying equation at all, and moreover, we don’t need to. Inference works so well precisely because in order to find the underlying process it stops caring about the underlying process. Instead, it studies the shape of the data, figuring out what the process ought to look like that way. This is similar in spirit to the way that humans employ intuitions about the dynamics of the world around us by using previous experiences about what’s possible and what isn’t, bypassing the need to actually calculate the action-minimising trajectories of bodies surrounding us. Here’s how we might make that mathematically precise:
We start with a foundational model for statistical inference, which is called either constrained maximum entropy (by statistical physicists) or energy-based learning (by statistical theorists). There’s a lot of work showing the ubiquity of constrained maximum entropy and EBL-type formalisms, so I won’t go into that here, but they really are foundational. It’s also slightly non-trivial that maximum entropy under appropriate constraints reproduces energy-based learning frameworks, but that’s true as well (see here for some demonstrations).
The other thing we should begin with is not statistical, but high energy, physics. This is the physics of particles, fields, and strings, whose energy scales are much higher than everyday solids and liquids—what we call ‘condensed matter.’ If it isn’t statistical, then what’s the relationship? A key to the whole story is that one part of HEP, gauge field theory, really lends itself to the geometrisation of dynamical systems. A gauge potential is a choice of gauge at every point in space-time, which is just some quantity that affects the dynamics of a particle with a gauge symmetry (an intuitive overview of gauge theory is way beyond the scope of this post, but I might revisit it later. For now, think of every particle’s state as having some context, like an arbitrary set of coordinates in which that state is expressed). My favourite discussion of this geometric relationship is in the opening to the famous paper by Baez and Schreiber, Higher Gauge Theory—or equivalently, the slightly more accessible Invitation to Higher Gauge Theory by Baez and Huerta. There, Baez and coauthors rely on an interpretation of something called parallel transport as a geometric encoding of how a particle interacts with a gauge potential—how its state transforms—as it moves on a space-time trajectory. It wasn’t a new idea at the time, but it really comes to life in those papers.
So we have some useful ingredients—a mathematical/physical theory available to us via gauge theory, which tells us how dynamics in a space express geometric features of a space, and a connection to the mathematics and physics of inference via maximum entropy. Suppose we can bring these things together. Beyond the obvious utility of formulating an answer to our above question, there’s something conceptually satisfying about bringing together the two most general theories available to modern physics (depending on who you ask, that’s secondary to actually solving the problem). Indeed, we can bring these together, in a really useful way.
Imagine we have a dynamical system sampling from or generating the state space we are interested in. This system produces the data an inferential process tries to characterise. By placing constraints on different states (data points, say), we have both 1) a potential function for dynamics on the manifold we call the state space, and 2) an expression of preferences for different states in the space. This idea of preferences gives us a weighting on the probability of observing a given state which is inversely proportional to the constraint. Indeed, this is exactly what (one reading) of maximum entropy allows us to say. The first, more dynamical perspective is where we start from; it turns out we can recover the second view from the first, which is the whole point (and in some sense the main result) of the paper.
We just mentioned there is a principled way of speaking about dynamics under potentials with some geometric meaning. The core of the construction, at the intuitive level, is as follows: critically, we look at the flow of probability through the state space. This is predicated on some process flowing on the state space—as it visits states, it maps them to a probabiity. So, the flow of probability through the state space ought to be determined by the potential function on the state space. Moreover, this two-way flow is what we would call a lift—a function associating a flow in some base space to a simultaneous flow in some other space. There is already a special sort of lift that obeys a constraining potential function, in that these lifts are the solutions to a Newtonian system of ODEs relating the dynamics of the curve exactly to the potential function—these are called horizontal lifts. The image of a horizontal lift is given by parallel transport. In parallel transport, the potential we are interested in is termed a connection.
In more detail—when the potential can be read as a deformation of the space, geodesic curves on that deformed geometry can be thought of as being given by parallel transportation. Suppose we have an ensemble of states and want to lift those states into some other space encoding some extra information, like a space of scalars attaching a probability to each state. What that lift looks like now depends on the potential. When we do inference and try to learn probabilities over data, we are trying to find what that lift is—i.e., what is \(p(x)\). Equating this with a simple flow in a potential has a nice interpretation as the classification of states by probability not being the solution to a Fokker-Planck equation, but the parallel transport of points \(p(x)\) as a quantity interacting with the potential on states. This view is implicitly taking \((x, p(x))\) as a surface containing probabilistic points (that is, real scalar numbers, each associated to a state). Parallel transport means we describe the density as being made up of lines along some surface, whose shapes are determined by the constraints on the states underlying the surface. This in turn means a simple equation for a flow in a potential is sufficient to determine the shape of the density. No integrals necessary. So, incorporating into our lifting process the interactions of flows in the probability space over states with the potential gives a nice conceptual explanation of the effectiveness of maximum entropy, and should be possible using parallel transport.
The equation for parallel transport satisfies a Newtonian relation for geodesic flows,
\[\nabla \nabla p(x) = - \nabla (\nabla J(x) p(x)), \tag{1}\label{1}\]yielding the second-order spatial change in some point \(p(x)\) under a potential function \(\nabla J(x) p(x)\). Note that I mean this a bit loosely; this Newtonian relation is usually for the temporal change in the point and the potential isn’t quite what we want. Nonetheless, it’s close enough that it gives us a reason to keep investigating. By the gradient theorem, when we integrate \eqref{1} (ignoring some ultimately unimportant technical details) we get the system of ODEs for parallel transport in \(\mathbb{R}^n\) and a potential \(J(x)\),
\[\nabla p(x) = - \nabla J(x) p(x).\]The solution to this equation is obviously \(\exp\{ -J(x) \}\). Attentive readers might recognise \eqref{1} as the stationary Fokker-Planck equation; indeed, \(\exp\{ - J(x) \}\) is the solution to that too. The equivalence is more principled still—the parallel transport equation rearranges to
\[- \frac{\nabla p(x)}{p(x)} - \nabla J(x) = 0,\]which integrates to the Euler-Lagrange equation satisfying the maximisation of entropy subject to a constraint function, \(-\ln\{p(x)\} - J(x) = 0\). Maximising entropy is famously the stationary solution to the Fokker-Planck equation (see work by Otto, Villani, etc). So, the purely analytic perspective confirms the general idea. Establishing as much is the content of Theorem 1.
The point of the paper is rather geometric, though, in that we want to look at parallel transport as a geometric feature of a dynamical theory. In the sense that parallel transport flows encode the geometry of a state space in which something flows, this is certainly possible. A lift is the image of a kind of generalised function, which takes a base or input space and maps points therein to a different space ‘expanding’ the base space. When we say expansion, we mean that we associate a copy of the target space to every point of the input space, and some kind of internal function (called a section) lifts each point on the base into a point in the target space over that point. This is a well-studied construction; we call this a fibre bundle. It’s like a bundle of fibres over an input space, since every copy of the target space sticks out like the fibres of a hair brush. The choice of section and the lifts it generates is what we identify with inference, in that inference fashions a probability density over states. Instead of writing down that function in an explicit form, we can focus on the shape of the lifts as this is induced by the potential function. This requires a tonne of geometric formalism, but the end result is precisely what we wanted: that parallel transport happens at maximum entropy is a formal expression of the fact that the constraints change the geometry of horizontal lifts in the probabilistic space (which we identify as a fibre bundle, a copy of \(\mathbb{R}\) at each state. Technically we mean \(\mathbb{R}_{\geq 0}\), but this positivity condition can be hard-coded as a constraint, which preserves the nice geometric features of real line bundles).
The point of using gauge theory is that there are actually two bundles here, and one affects the dynamics in the other: there is a probabilitistic space we have identified as a real line bundle, which is where probability densities live as lifts under a section from the state space to the bundle. But, the shape of that density is affected by a choice of constraints, which come from their own bundle. The coupling between these two bundles is called the associated bundle construction, which we can give explicitly once we construct the bundle the constraints come from.
A large portion of these results are now purely technical material: constructing the bundles we need and showing that the constraints induce a potential function (a connection) changing the evolution of lifts is pretty easy once we identify the relevant properties of bundles, connections, and constraints. Since probability maps into a line bundle, there exists a covariantly constant section, which means that parallel transport actually does have a totally horizontal solution. There’s something to be said post hoc for the fact that classical probability only makes sense in terms of real scalars, and bundles containing real scalars make parallel transport work the way we want. The horizontal paths with respect to the connection—these are composed of parallelly transported points, and identically, are paths obeying a potential—turn out to be a locus of equiprobable states, arising from the level sets of the constraint function. This arises from pulling back these level sets to the state space, and then lifting those into the probabilistic space to produce a probability from a constraint on states. In physics, taking extra information back down to state space is what we call a gauge field, and in mathematics, it’s called a pullback connection or a local connection form. Establishing all of this pulls everything together, and we are able to get the main result in Theorem 3.
The utility of these results is that they allow us to conceptualise inference as a horizontal lift. This is ultimately Jaynes’ original idea: that heat flow (in this case, constrained heat flow) performs inference. Flows in the state space are constrained by the shape of the data, so by exploring the state space (prosaically: flowing over the data) there is a very precise sense in which one learns a description of the shape of that data. Here, that flow occurs at maximum entropy, by finding parallel transport lines; the description we learn is a probabilistic one about how likely or preferable a state is. This has the very concrete interpretation that the solubility of the Fokker-Planck equation in the stationary regime corresponds to the existence of parallel transport, an interesting analysis-type result. The point of these results is to offer a principled mechanism for how constrained maximum entropy does what it does, and thus, why energy-based learning algorithms do so well on problems that are analytically challenging.
-
See Thurston, On Proof and Progress in Mathematics, page seven. By ‘hidden stuff,’ I mean the unstated motivations for various things in the paper, or the ‘point’ of any given structure in the paper—whether a technical choice like the particular statement of a proof, or a stylistic choice like the order of exposition, etc. By reading between the lines, I (and Thurston) mean hacking an easy path through the paper, translating the information-dense encoding of the language of formal mathematics into something a little easier to read, with less extra adornment and more intution for what’s essential. With a bit of cheek, we can say we’re doing parallel transport over the paper: getting a feel for its shape, rather than computing its solution. However we describe it, this is an important part of making a paper a bit more palatable. ↩
-
The results I will discuss here actually date back to a talk I gave just a bit more than a year ago, presenting an idea I had had when studying the nature of solutions to diffusion under the path integral as a kind of fibre integral. The idea at the time was there ought to be some sort of geometric intuition for how statistical inference (like the path integral) cobbles together the right solution to diffusion most always, despite not being analytic in nature. The result described here is one facet of the results discussed during that talk, although it is a slightly simpler construction. ↩