Activation Functions and Models of Magnets
As stated on the home page of this website, I’m going to start writing the occasional blog post about my work. At the time of writing, I had resisted the temptation for a while, but I think it would be nice to post about specific things (e.g. papers) every once in a while; since publishing in pure mathematics really is once in a while, it seems like it won’t be too much to keep on top of. This inaugural post concerns a paper that I preprinted some time ago, which discusses the activation function in neural networks. The work in question can be found here: Formalising the Use of the Activation Function in Neural Inference.
As someone who does occasional work on statistical mechanics I am quite familiar with the Ising model, and with universality. Universality refers to the fact that certain classes of phase transition, which are nothing but changes in the characteristics of an object (such as changing from a solid to a liquid to a gas), share key dynamical features. In this sense, a model that exhibits a phase transition can be used to model certain other systems faithfully, even if they are quite different. Universality is explained well in David Tong’s notes on statistical field theory, and in John Cardy’s Scaling and Renormalisation in Statistical Physics. There are some interesting references in this tweet too, and this response to that tweet discusses the same thing that is covered in chapter three of Tong’s notes.
In the course of working on something else, I came across an interesting case of the Ising universality class as a kind of toy model of a neural network. This paper discusses exactly this, and I read it as motivating the activation function ML people are familiar with beyond ‘it works.’ This is something like my broader research mission, formulating complex system theories like ML or neuroscience axiomatically, so I thought it was worth writing up. In that sense, hopefully it goes beyond the curiosity of ‘a neural net looks a bit like an Ising model; isn’t that clever.’
The way a simple neurone might be claimed to work is by the following essential anatomy of an action potential: the resting potential is slightly below ionic equilibrium, creating a concentration gradient of positive ions diffusing across the cell membrane. The way the neurone keeps the membrane potential low is by pumping these ions out as they diffuse in. When the voltage of the neural cell rises to a critical point, sensors in the neurone ‘un-gate’ the closed channels that prohibit mass diffusion and close the pumps getting rid of these ions, and positive ions flood the cell, going past even the equilibrium point. This is the characteristic spike in membrane potential that we read as an action potential, and foregoing some dynamical details about how exactly this model would work, it is sufficient to describe an action potential. We’ll hold on to this thought for now, and lay out the necessary physics; then we’ll revisit this model.
The Ising model, on the other hand, is a model of the electronic structure of a magnet. It is a lattice of atoms and their electrons, the latter of which possess a quantity called ‘spin’ from quantum mechanics. It isn’t important what spin is—and it probably doesn’t really have an intuitive explanation—other than to say that it points ‘up’ or ‘down.’ When all spins are aligned in one such direction, the metal has a magnetic moment from this alignment. The Ising model exhibits a phase transition wherein when the metal is heated, the energy added from the heat causes these electrons to start flipping their spins, just like molecules in a solid vibrating at high speeds when heated. This disorders the configuration of the lattice, and zeroes the magnetic moment \(m\). The contrapositive also holds: when the magnet is cooled once more, past a critical point \(m\) goes to one again. This is like a liquid solidifying below zero centigrade.
One way of modelling the dynamics of the Ising model is by mean field theory (MFT), which looks at the ‘average’ dynamics of a system. The way we do MFT is by approximating the system with a simpler system that is related in some way to the original (specifically, we use a trial Hamiltonian for which the free energy is bounded from below by the Bogoliubov inequality). The typical way in which this is done, and in some sense the most meaningful, is by replacing the system with a ‘mean field’ and ignoring fluctuations in the dynamics of smaller items constituting the system. MFT is thus a theory of mean fields, ignoring the fluctuations around a mean (consider the transport relation \(\delta A(t) = A(t) - \langle A \rangle\) for the dynamics of an observable \(A\) in the presence of fluctuations. When we ignore fluctuations and the term \(\delta A(t)\) vanishes, we have \(A(t) = \langle A \rangle\)). It is also a multiscale modelling method, if these fluctuations are due to the dynamics of smaller things in a large system, and the mean field is representative of the dynamics of the larger things. In this sense MFT is not just a clever approximation—not in my view, anyway—but has some extra ontological importance as a method of understanding a system’s dynamics at multiple scales.
In Section IIC, we do exactly this, derive a mean field model for the Ising model. It’s nothing new, I just go through it for the purpose of the paper. I actually use a short cut that neglects fluctuations, rather than performing the full formal derivation. This is, in fact, the same as the formal procedure—neglecting fluctuations produces a simpler trial Hamiltonian, for which the free energy is strictly greater than the real one, which we minimise with respect to a control parameter based on what we know of the Bogoliubov inequality. Again, it’s nothing but a short cut.
Revisiting the model we discussed earlier, in Sections IIA and IIB we discuss the similarities between the Ising lattice and the network of channels in the cell membrane, with a spin equivalent to a channel containing a diffusing ion (\(+1\)) or not (\(-1\)). We consider the Ising model in a hot environment with an occasional quench applied, to model the effect of an external stimulus being applied that cools the lattice. In the neural case, this is a cell body in a disordered environment—some channels closed and some open—which occasionally is quenched and cooled below the critical temperature. When this cooling happens, all the channels open, the analogous Ising model is positively magnetised, and as the magnetic moment goes to one, we have an action potential. When the quench is removed, the neurone ‘heats up’ again, and the channels close just as spins get disordered.
The key feature of this phase transition is, if we claim a neurone is in the Ising universality class, and the indicated magnetisation function (Figure 1 in the paper) arises from a mean field model of the Ising model, then the typical sigmoidal activation function is a mean field model of a real neurone. Correspondingly, if we think of real neurones as the thing to aspire to in building artificial neural networks—indeed, human brains are the best learners that we know of—then this explains why the sigmoid class of activation function has been so successful.
If we then consider the quench in more detail, and look at it dynamically, then we can parameterise the function for \(m\) in terms of time and look at how ‘hard’ the quench is, and how long we take to heat up. If we assume a linear relationship between \(t\) and heat flow inwards, then the time spent magnetised is akin to a firing rate (i.e., spikes per second over the number of seconds gives us the number of spikes we have in the time we were allowed to spike). Analysing this gives us back the ReLU function and other functions with similar shapes.
This paper is important to me for three reasons. For one thing, as stated previously, it formalises the problem of why sigmoidal activation functions are necessary to classify things, in a more motivated way than ‘it doesn’t work without it.’ For another, it shows that if we consider the conventional artificial neural network to be a mean field model of a real neural network, then this approximation of spiking dynamics necessarily looks as though it has a sigmoidal or rectified activation function. For a third and final reason, it sheds light on other interesting questions about ANNs by way of statistical mechanical analogy, which are examined in the discussion.
This has been a bit long, so hopefully it has some utility as a reading guide. What I will likely do from here onwards, if only to make this useful for myself, is deposit similar reading guides and summaries in this blog. Ideally, readers will also find it useful, and others end up enjoying my work too. Do let me know if this was the case for you—I’m always a fan of public outreach (read: not screaming into the void).